Cloud Cost Optimization - Kubernetes Focused

Sep 22, 2024

7 min read


A hands-on guide to optimizing your cloud infrastructure costs, with a special emphasis on Kubernetes.

Why the post?

With all the recent chatter - #awsexodus about moving away from the cloud and back to self-hosting because of rising costs, I thought it would be nice to share my experiences of how we manage our assets in the cloud as a payment gateway in Malaysia, with a special emphasis on a tool that everyone is using these days: Kubernetes.

This post specifically focuses on Alibaba Cloud, but the concepts discussed here are applicable to all cloud providers.

Although cloud services can be expensive, they offer remarkable convenience and are particularly advantageous for businesses that must manage high traffic—such as eCommerce, while also fulfilling compliance needs. So, without further ado, let’s get started!

Disclaimer

The views expressed here are my own and do not reflect those of the organization I work for.

1 - Know your business & requirements

Having worked with clients in banking and eCommerce, I’ve noticed it’s frustrating that many infrastructure engineers don’t consider what their application teams need. This often results in oversized servers being set up for simple single-page applications. To make matters worse, people hesitate to decommission servers because no one really knows what’s happening inside and it might disrupt other essential services.

Beyond the server costs themselves, there are additional expenses like server snapshots, logs, database backups, and database logs. These “hidden costs” often get overlooked by engineers. It’s important to ask ourselves whether we’re managing these effectively. For instance, do you really need to store server logs in the LogStore (or AWS LogStorage)? If so, for how long?

Pricing for logs and storage classes in the cloud can be quite complex, as different classes come with varying costs and performance levels. Customers are charged based on several factors, including data ingestion, storage size, and even the cost of reading logs.

If your business doesn’t require active log searches, or at least doesn’t need to search logs from the past three months, you should consider sending your logs to a storage bucket.

AliCloud Log Pricing

At Revenue Monster, we are required to report to regulatory bodies during investigations (if required). To meet our audit requirements at a justifiable cost, we store our logs permanently in a Deep Cold Archive bucket. It would costs us thoudsands of dollars if we keep it in the hot log storage or standard bucket.

Keep an eye on the retention period, especially if your business doesn’t need to hold onto that data for long.

AliCloud Storage Classes

2 - Know your traffic

Running applications in the cloud is all about managing traffic—both incoming (ingress) and outgoing (egress), and that comes with its own costs.

AliCloud Traffic Model

Ingress traffic is actually free in most clouds and this is something we can take advantages of.

For instance, in a three-tier application, you have a web app, a backend, and a database, all of which need to communicate with each other. One effective way to cut costs and reduce latency is to make HTTP calls internally, avoiding the need to go out to the internet. Applications within a Kubernetes cluster can communicate with each other via a concept called Service. This is as if the web application is calling backend via localhost.

2.1 - CDN to the rescue

CDNs offers competitive pricing package across regions and you get Anti-DDoS. How cool is that?!

One of our Next.js projects was misconfigured and ended up costing us quite a bit. The project involved a lot of images, and one of the engineers responsible for the deployment didn’t think it through and included all the images in the deployment. As a result, the images were served through the load balancer, which incurred unnecessary costs.

If this sounds familiar, make sure to monitor the network request tab to see where your assets are coming from.

2.2 - Endpoint Gateway

Cloud managed services such as OSS (S3 bucket), ACR, RDS usually comes with VPC endpoint by default. Using these endpoints leads to lower costs as the traffic is not routed to the NAT Gateway.

2.3 Cloud Registry

If your application is storing docker images in the cloud registry, make sure to push and pull them via the internal VPC endpoint to save cost. This would allow customers to read/write to ACR/ECR without going through the NAT Gateway, and further reduce the cost of NATG.

3 - Design thoughtfully & carefully

Cloud providers often enable certain features by default to make it easier for new customers to get started. However, this convenience can come at a cost.

For example, Alibaba Cloud’s Container for Kubernetes (ACK) automatically enables options to create NAT and CLB when setting up a Kubernetes cluster. The downside is that if the person creating the cluster isn’t familiar enough with the cloud, they might end up with unnecessary assets that could have been provisioned better.

AliCloud Traffic Model

4 - Unobserved systems lead to unknown cost

Building observability stacks isn’t cheap, as it involves tools like Prometheus, Loki, and Elasticsearch (logs, storage, and compute involved). The infrastructure team can easily find themselves overwhelmed by a sea of metrics and noise. It’s definitely not a trivial or inexpensive task for growing organizations, as it requires both expertise and investment in the right tools.

If you can’t measure it, you can’t manage it.

So, big bosses, don’t skimp on observability stacks, okay?

5 - Start small and scale diligently

Dear readers, you might be surprised by the server specs that enterprise clients use to host their applications. I’ve seen a customer running a ecs.g7.16xlarge machine with 64 cores and 256 GB of memory just to host a PHP application!

Modern applications typically don’t require hefty servers. It’s best to start with minimal specifications and scale as needed. If you encounter bottlenecks or if the specs become excessive, it’s time to have a chat with the application architecture instead.

If you are using Kubernetes, there is an option to scale the application in the node and pods level, both horizontally and vertically. Please let me know if you need a consultation.

6 - Serverless Architecture

Once you’ve understand the business application and how they behave in different environments, it may be a good time to rethink the deployment architecture.

If any of the running microservices does not require long-running processes, if may be cost effective to have them run in a serverless setup such as AliCloud ASK

A great example is the GitLab runner. We don’t need the runner to be active 24/7; it’s only necessary during deployments. Performance isn’t a major concern, and it’s inherently scalable.

However, there are some scenarios where a serverless setup isn’t ideal:

  • Long-running processes
  • Custom hardening requirements on the OS images

7 - Resource Group

Things can get quite complicated as we provision more resources and invest further in the cloud. By categorizing assets into resource groups, you can gain a clearer, holistic view of how much each business group is spending in the cloud. This can then be translated into technical operational expenses (OpEx).

8 - FinOps

As a technical leader, our job is to get the most value out of the cloud to drive efficient growth.

8.1 Review the expenditure and cleanup periodically

Part of my role involves keeping things lean and minimizing operational overhead in the cloud. I review cloud expenses every month to identify any unintended expenditures. As daunting as this may seem, regularly doing this helps me spot and restructure oversized assets.

8.2 Cloud analysis tools

Cloud providers now offer financial management tools that help us detect cost anomalies. Be sure to check these tools from time to time, as they can reveal stale workloads (unintended expenses) and help us set the budget more diligently.

Google FinOps

Not the end

This is not exactly a short post and I applaude you for making this far! I hope my insights can help my fellow engineers manage their cloud resources more effectively.

If you are facing any difficulties managing your infrastructure, or simply need a chat, please do not hesitate to reach out to me! 😃

Allow me some shameless plug 🗣

We are expanding our team, especially under Tech at Revenue Monster Sdn Bhd. We are hiring developers who speak frontend, backend, DevOps, and machine learning. Do reach out to my HR personnel via hr@revenuemonster.my if you’re interested to explore the FinTech world!

Further reading sources