DevOps from Zero to Hero: Cost Optimization and What Comes Next
Support this blog
If you find this content useful, consider supporting the blog.
Introduction
Welcome to article twenty, the final article of the DevOps from Zero to Hero series. Over the past nineteen articles we built an entire DevOps practice from scratch. We wrote a TypeScript API, learned version control, set up CI/CD pipelines, deployed to AWS, mastered Kubernetes, automated everything with GitOps, and added observability so we could actually see what was happening in production.
But there is one topic we have not covered yet, and it might be the one that gets you the most attention from leadership: cost. Cloud bills have a way of growing quietly in the background until someone notices a five-figure monthly invoice and starts asking hard questions. Cost optimization is not about being cheap. It is about spending intentionally and getting maximum value from every dollar.
In this article we will cover how to understand your AWS bill, identify common cost traps, right-size your resources, use Spot instances and Savings Plans, optimize Kubernetes costs, build a tagging strategy, set up cost monitoring, and manage dev/staging environments efficiently. Then we will wrap up the entire series with a full recap of everything we learned and talk about where to go from here.
Let's get into it.
Why cost matters: the rise of FinOps
When you are learning cloud in a personal account, costs feel manageable. A small EKS cluster, a few EC2 instances, and an RDS database might cost $100-300 per month. But in a real organization, those numbers multiply fast. Teams spin up resources and forget about them. Someone creates a NAT Gateway for testing and leaves it running for six months. A developer provisions an m5.4xlarge instance for a service that barely uses 10% of its CPU.
The cloud makes it incredibly easy to spend money. That is by design. There is no procurement process, no hardware to order, no six-week wait. You click a button and resources appear. This is powerful for speed, but dangerous for budgets.
This is where FinOps comes in. FinOps (Financial Operations) is a practice that brings financial accountability to cloud spending. It is not about cutting costs blindly. It is about making informed decisions about what to spend and why.
The core principles of FinOps are:
- Teams need to own their cloud costs: Just like DevOps made teams responsible for running their software, FinOps makes teams responsible for the cost of running it. If you deploy it, you should know what it costs.
- Decisions are driven by business value: Not every cost reduction is a good idea. Cutting your monitoring stack to save $500/month might cost you $50,000 when you miss an outage. Cost optimization is about value, not just spending less.
- Cloud is a variable cost model: Unlike on-premise where you buy servers and depreciate them over years, cloud costs change monthly. This means you need to review and optimize continuously, not just once a year.
Think of FinOps as the financial pillar of DevOps. You would not deploy code without testing it. You should not deploy infrastructure without understanding what it costs.
AWS Cost Explorer: understanding your bill
The first step in cost optimization is understanding where your money is going. AWS Cost Explorer is the primary tool for this. It is free and built into every AWS account.
To access it, go to the AWS Billing Console and click on Cost Explorer. The first time you enable it, it takes about 24 hours to populate historical data. After that, you get up to 12 months of spending history.
Here are the views you should use regularly:
Monthly cost by service
This is your starting point. Group by "Service" and set the time range to the last 3 months. You will immediately see which services are costing the most. In a typical Kubernetes-based setup, your top costs will usually be:
- EC2 (including EKS worker nodes): Compute is almost always the biggest line item
- RDS: Database instances, especially if you run Multi-AZ
- NAT Gateway: Data transfer through NAT Gateways is surprisingly expensive
- EBS: Persistent volumes, snapshots, and unattached volumes
- S3: Storage and request costs
- Data Transfer: Cross-AZ and internet egress charges
Cost by tag
If you have a proper tagging strategy (we will cover this later), you can group costs by tag. This lets you answer questions like "How much does the staging environment cost?" or "What is team-alpha spending per month?" To use this view, you first need to activate your cost allocation tags in the Billing Console under Cost Allocation Tags.
Daily cost trends
Switch to daily granularity and look for spikes. A sudden jump in EC2 costs might mean someone launched a bunch of instances for a load test and forgot to terminate them. A spike in data transfer costs might indicate a misconfigured service that is pulling data across regions.
You can also use the AWS CLI to query cost data programmatically:
# Get last month's cost grouped by service
aws ce get-cost-and-usage \
--time-period Start=2026-05-01,End=2026-06-01 \
--granularity MONTHLY \
--metrics "BlendedCost" \
--group-by Type=DIMENSION,Key=SERVICE
# Get daily costs for the current month
aws ce get-cost-and-usage \
--time-period Start=2026-06-01,End=2026-06-17 \
--granularity DAILY \
--metrics "BlendedCost"
Common cost traps
Every cloud environment has hidden costs waiting to surprise you. Here are the most common ones and how to find them.
Forgotten resources
These are resources that were created for a purpose but are no longer needed. They quietly accumulate charges every month.
- Unattached EBS volumes: When you terminate an EC2 instance, its EBS volumes might not be deleted automatically (depends on the DeleteOnTermination flag). These orphaned volumes cost money even when nothing is using them.
- Old EBS snapshots: Snapshots pile up over time. A daily snapshot policy on a 500GB volume creates 365 snapshots per year. At $0.05/GB-month, that adds up.
- Idle load balancers: A load balancer with no healthy targets still costs about $16-22/month. If you have abandoned ALBs from old projects, find them and delete them.
- NAT Gateways: Each NAT Gateway costs about $32/month just to exist, plus $0.045 per GB of data processed. If you have NAT Gateways in multiple AZs across multiple VPCs, that is hundreds of dollars per month doing nothing if those VPCs are inactive.
- Elastic IPs: An Elastic IP attached to a running instance is free. An Elastic IP not attached to anything costs $3.65/month. Small, but they add up.
- Unused ECR images: Container images in ECR cost $0.10/GB-month. If your CI pipeline pushes a new image on every commit and you never clean up old ones, storage costs grow linearly.
Find forgotten resources with these commands:
# Find unattached EBS volumes
aws ec2 describe-volumes \
--filters Name=status,Values=available \
--query 'Volumes[*].{ID:VolumeId,Size:Size,Created:CreateTime}' \
--output table
# Find Elastic IPs not associated with anything
aws ec2 describe-addresses \
--query 'Addresses[?AssociationId==`null`].{IP:PublicIp,AllocID:AllocationId}' \
--output table
# Find load balancers with no targets
aws elbv2 describe-target-groups \
--query 'TargetGroups[*].{ARN:TargetGroupArn,Name:TargetGroupName}' \
--output table
Oversized instances
This is the most common cost trap. Teams pick an instance type when they first deploy a service and never revisit it. That m5.xlarge you chose "just in case" might be running at 5% CPU utilization. You could be on a t3.medium and save 75%.
Idle dev/staging environments
Your staging environment runs 24/7 but your team works 8 hours a day, 5 days a week. That means staging is idle 76% of the time. If staging costs $2,000/month, you are wasting about $1,500/month on compute that nobody is using.
Cross-AZ data transfer
Data transfer between Availability Zones costs $0.01/GB in each direction ($0.02/GB round trip). This sounds tiny, but a chatty microservice architecture with services spread across AZs can generate terabytes of cross-AZ traffic. This is often the most surprising line item on an AWS bill.
Right-sizing: matching resources to actual usage
Right-sizing means adjusting your compute resources to match what your workload actually needs. It is the highest-impact cost optimization you can do because compute is usually your biggest expense.
Step 1: Gather metrics
Before you can right-size anything, you need data. Use CloudWatch to understand your actual resource utilization:
# Get average CPU utilization for an instance over the last 7 days
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-0abc123def456789 \
--start-time 2026-06-10T00:00:00Z \
--end-time 2026-06-17T00:00:00Z \
--period 3600 \
--statistics Average Maximum \
--output table
Look at both the average and the maximum. If your average CPU is 10% and your max is 25%, you have significant room to downsize. If your average is 10% but your max spikes to 95%, you might need that capacity for peak loads (or you might need to investigate what causes those spikes).
Step 2: Use AWS Compute Optimizer
AWS Compute Optimizer analyzes your CloudWatch metrics and recommends instance types that would better fit your workload. Enable it in the AWS Console under Compute Optimizer. It is free for basic recommendations.
It will tell you things like: "This m5.xlarge instance averages 8% CPU utilization. A t3.medium would save 75% while still providing sufficient capacity." These recommendations are a great starting point, but always validate them against your application's actual requirements. Memory-intensive applications might need more RAM than CPU, for example.
Step 3: Right-size gradually
Do not downsize everything at once. Pick your most over-provisioned instances, downsize them one at a time, and monitor for a week. If performance is fine, move to the next one. If you see issues, scale back up. Right-sizing is iterative, not a one-time event.
# Change instance type (requires stop/start)
aws ec2 stop-instances --instance-ids i-0abc123def456789
aws ec2 modify-instance-attribute \
--instance-id i-0abc123def456789 \
--instance-type '{"Value":"t3.medium"}'
aws ec2 start-instances --instance-ids i-0abc123def456789
For EKS worker nodes managed by a node group, you would update the launch template or node group configuration instead:
# Update managed node group instance type
aws eks update-nodegroup-config \
--cluster-name my-cluster \
--nodegroup-name my-nodegroup \
--scaling-config minSize=2,maxSize=6,desiredSize=3
Spot instances and Karpenter
Spot instances let you use unused EC2 capacity at up to 90% discount compared to on-demand prices. The trade-off is that AWS can reclaim them with a 2-minute warning when it needs the capacity back. This sounds scary, but with the right architecture, Spot is one of the most effective cost optimization strategies available.
How Spot works
When AWS has unused capacity in a particular instance type and AZ, it makes that capacity available as Spot instances at a reduced price. The price fluctuates based on supply and demand but is typically 60-90% cheaper than on-demand. When AWS needs that capacity back (a "Spot interruption"), your instance gets a 2-minute warning and then is terminated.
When to use Spot
- Stateless workloads: Web servers, API servers, and workers that do not store data locally are perfect for Spot. If an instance gets interrupted, the load balancer routes traffic to other instances.
- Batch processing: Jobs that can be checkpointed and restarted work well on Spot.
- CI/CD runners: Build agents are short-lived by nature and can tolerate interruptions.
- Development and staging environments: These do not need the same reliability guarantees as production.
When NOT to use Spot
- Databases: Losing a database instance mid-transaction is a bad day.
- Stateful workloads without replication: If losing an instance means losing data, do not put it on Spot.
- Single-instance workloads: If you only have one instance and it gets interrupted, your service is down.
Mixing on-demand and Spot
The best practice is to run a baseline of on-demand instances that can handle your minimum expected load, and use Spot for everything above that. For example, if your API needs at least 3 instances to handle normal traffic but scales to 10 during peak hours, run 3 on-demand and let the remaining 7 be Spot.
Karpenter for Kubernetes
If you are running EKS, Karpenter is the best way to use Spot instances with Kubernetes. Karpenter is an open-source node provisioning tool that automatically selects the right instance types and purchase options (on-demand vs Spot) based on your pod requirements.
Here is a basic Karpenter NodePool configuration that mixes on-demand and Spot:
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand", "spot"]
- key: node.kubernetes.io/instance-type
operator: In
values:
- m5.large
- m5.xlarge
- m5a.large
- m5a.xlarge
- m6i.large
- m6i.xlarge
- key: topology.kubernetes.io/zone
operator: In
values:
- us-east-1a
- us-east-1b
- us-east-1c
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
limits:
cpu: "100"
memory: 400Gi
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 1m
Karpenter will automatically diversify across multiple instance types and AZs to reduce the chance
of simultaneous Spot interruptions. The disruption block tells Karpenter to consolidate
underutilized nodes, which saves money by packing pods more efficiently.
Handling Spot interruptions
For graceful handling of Spot interruptions in Kubernetes, make sure your pods handle SIGTERM properly
and have appropriate terminationGracePeriodSeconds. Karpenter integrates with the AWS Node
Termination Handler to cordon and drain nodes before they are reclaimed.
Reserved Instances and Savings Plans
If you know you will need a certain amount of compute for the next 1-3 years, Reserved Instances (RIs) and Savings Plans offer significant discounts (up to 72%) in exchange for a commitment.
Savings Plans vs Reserved Instances
- Compute Savings Plans: You commit to a specific dollar amount of compute per hour (e.g., $10/hour) for 1 or 3 years. The discount applies across EC2, Fargate, and Lambda. This is the most flexible option.
- EC2 Instance Savings Plans: You commit to a specific instance family in a specific region (e.g., m5 in us-east-1). Higher discount than Compute Savings Plans but less flexible.
- Reserved Instances: You commit to a specific instance type, AZ, and tenancy. The highest discount but the least flexible. These are the legacy option and Savings Plans are generally recommended instead.
When commitments make sense
- Stable, predictable workloads: If your production database has been running on an r5.2xlarge for a year and will continue to do so, a Savings Plan is a no-brainer.
- Baseline compute: Commit to your minimum required compute. Use on-demand and Spot for anything above the baseline.
- After right-sizing: Always right-size first, then commit. There is nothing worse than committing to an oversized instance for 3 years.
When to avoid commitments
- New workloads: Wait until you understand the actual resource requirements (at least 2-3 months of data).
- Rapidly changing architectures: If you are migrating from EC2 to containers or from x86 to ARM, locking into commitments can backfire.
- Small amounts: The administrative overhead of managing RIs for a $50/month saving is not worth it.
A practical approach is to cover 60-70% of your steady-state compute with Savings Plans, handle the next 20% with on-demand, and use Spot for the remaining 10-20% that handles peak loads.
Kubernetes cost optimization
Kubernetes adds its own layer of cost complexity. Pods request resources, nodes provide them, and the gap between requested and actually used resources is wasted money.
Resource requests and limits
Every pod should have resource requests and limits defined. Requests tell the scheduler how much CPU and memory a pod needs. Limits cap how much it can use. The gap between what you request and what you actually use is waste.
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
replicas: 3
template:
spec:
containers:
- name: api
image: my-api:latest
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
The most common mistake is setting requests too high "just to be safe." If your API container uses 50m CPU on average but you request 500m, each pod wastes 450m of CPU. With 10 replicas, you are wasting 4.5 vCPUs, which could be an entire node worth of compute.
To find the right values, check actual usage with kubectl top:
# Check actual resource usage per pod
kubectl top pods -n my-namespace
# Check node-level resource utilization
kubectl top nodes
# Detailed resource allocation per node
kubectl describe node <node-name> | grep -A 5 "Allocated resources"
Set requests based on the P95 usage (what the pod actually uses 95% of the time) and limits at roughly 2x the request to handle bursts. Review and adjust these values every month.
Namespace resource quotas
Resource quotas prevent any single team or namespace from consuming more than its fair share of cluster resources. Without quotas, one team's runaway deployment can starve everyone else and force unnecessary cluster scaling.
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-alpha-quota
namespace: team-alpha
spec:
hard:
requests.cpu: "8"
requests.memory: "16Gi"
limits.cpu: "16"
limits.memory: "32Gi"
pods: "50"
persistentvolumeclaims: "10"
Cluster Autoscaler and Karpenter
Both Cluster Autoscaler and Karpenter scale your node count based on pending pods, but they approach it differently:
- Cluster Autoscaler: Works with AWS Auto Scaling Groups. You predefine node group configurations (instance types, sizes). The autoscaler adds or removes nodes from these predefined groups. Simpler to set up but less flexible.
- Karpenter: Evaluates pending pods and provisions the optimal instance type on the fly. It can choose from a wide range of instance types and automatically bin-pack pods efficiently. More flexible and generally more cost-effective, but requires more initial configuration.
Whichever you use, make sure scale-down is enabled and tuned. By default, Cluster Autoscaler waits 10 minutes before removing an underutilized node. In a bursty environment, this delay means you are paying for idle nodes for 10 minutes after every traffic spike.
Horizontal Pod Autoscaler (HPA)
HPA scales your pod count based on metrics like CPU or custom metrics. This lets you run fewer pods during low-traffic periods and scale up during peaks, instead of running peak capacity 24/7.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
Tagging strategy: tag everything
Tags are the foundation of cost visibility. Without tags, your AWS bill is one big number. With tags, you can answer "How much does each environment cost?", "Which team is spending the most?", and "What is the cost per customer?"
Minimum required tags
Every resource in your AWS account should have at least these tags:
- Environment:
production,staging,development- Team: The team that owns the resource
- Service: The application or service name
- CostCenter: For chargeback or showback to business units
- ManagedBy:
terraform,manual,karpenter, etc.
Enforce tags with policies
Tags only work if they are applied consistently. Use AWS Organizations tag policies or Terraform validation to enforce tagging:
# Terraform: enforce tags on all resources
variable "required_tags" {
type = map(string)
default = {
Environment = ""
Team = ""
Service = ""
ManagedBy = "terraform"
}
}
resource "aws_instance" "api" {
ami = "ami-0abc123def456789"
instance_type = "t3.medium"
tags = merge(var.required_tags, {
Name = "api-server"
Environment = "production"
Team = "backend"
Service = "user-api"
})
}
For a more robust approach, use an AWS Organizations tag policy:
{
"tags": {
"Environment": {
"tag_key": {
"@@assign": "Environment"
},
"tag_value": {
"@@assign": [
"production",
"staging",
"development"
]
},
"enforced_for": {
"@@assign": [
"ec2:instance",
"rds:db",
"s3:bucket",
"elasticloadbalancing:loadbalancer"
]
}
}
}
}
Activate cost allocation tags
Creating tags is not enough. You also need to activate them as cost allocation tags in the Billing Console. Only activated tags appear in Cost Explorer for grouping and filtering. Go to Billing, then Cost Allocation Tags, find your tags, and click Activate. It takes up to 24 hours for activated tags to appear in Cost Explorer.
Cost monitoring: budgets and alerts
Setting up cost monitoring is like setting up application monitoring. You do not wait for users to report outages. You set up alerts. You should not wait for finance to report cost overruns either.
AWS Budgets
Create budgets for your total account spend and for each major service or environment:
# Create a monthly budget with email alerts
aws budgets create-budget \
--account-id 123456789012 \
--budget '{
"BudgetName": "monthly-total",
"BudgetLimit": {
"Amount": "5000",
"Unit": "USD"
},
"TimeUnit": "MONTHLY",
"BudgetType": "COST"
}' \
--notifications-with-subscribers '[
{
"Notification": {
"NotificationType": "ACTUAL",
"ComparisonOperator": "GREATER_THAN",
"Threshold": 80,
"ThresholdType": "PERCENTAGE"
},
"Subscribers": [
{
"SubscriptionType": "EMAIL",
"Address": "[email protected]"
}
]
},
{
"Notification": {
"NotificationType": "FORECASTED",
"ComparisonOperator": "GREATER_THAN",
"Threshold": 100,
"ThresholdType": "PERCENTAGE"
},
"Subscribers": [
{
"SubscriptionType": "EMAIL",
"Address": "[email protected]"
}
]
}
]'
This creates a $5,000/month budget with two alerts: one when actual spend hits 80% of the budget, and another when the forecasted spend is projected to exceed the budget. The forecast alert is especially useful because it gives you time to act before you actually overspend.
Weekly cost reviews
Set up a weekly ritual where someone on the team reviews costs. It does not need to be a long meeting. A 15-minute check of Cost Explorer once a week is enough. Look for:
- Unexpected spikes: Anything that jumped significantly from the previous week
- New services: Any service that appeared in your bill that was not there before
- Trend lines: Is overall spending trending up? If so, is it proportional to growth?
- Idle resources: Any resources with zero or near-zero utilization
The person doing the review should rotate across the team. This builds cost awareness across the entire team, not just one designated cost watcher.
Dev/staging environment strategies
Development and staging environments are often the easiest place to cut costs because they do not need to be available 24/7 and they do not need production-grade resources.
Scale down at night and on weekends
If your team works 9am to 6pm on weekdays, your dev and staging environments are idle 73% of the time. Use scheduled scaling to shut them down outside working hours:
# Scale down EKS node group at night (run via cron or Lambda)
aws eks update-nodegroup-config \
--cluster-name dev-cluster \
--nodegroup-name dev-nodes \
--scaling-config minSize=0,maxSize=3,desiredSize=0
# Scale up in the morning
aws eks update-nodegroup-config \
--cluster-name dev-cluster \
--nodegroup-name dev-nodes \
--scaling-config minSize=1,maxSize=3,desiredSize=2
You can automate this with a Lambda function triggered by EventBridge on a schedule:
{
"schedule_expression": "cron(0 22 ? * MON-FRI *)",
"description": "Scale down dev cluster at 10 PM",
"action": "scale-down"
}
Use smaller instances for non-production
If production runs on m5.xlarge, staging can probably run on t3.medium. Dev can run on t3.small. The goal is not identical environments. It is environments that are similar enough to catch bugs but small enough to be affordable.
Ephemeral environments
Instead of running a persistent staging environment, consider spinning up short-lived environments for each pull request. The environment gets created when the PR is opened, runs integration tests, and gets destroyed when the PR is merged or closed. You only pay for the time someone is actively testing. Tools like Argo CD ApplicationSets or Terraform workspaces can automate this pattern.
Single-node dev clusters
For development, consider running a single-node Kubernetes cluster or using a local tool like kind or minikube. This avoids the EKS control plane cost ($73/month) and multi-node compute costs entirely for local development.
Putting it all together: a cost optimization checklist
Here is a practical checklist you can work through to optimize your cloud costs:
- Week 1: Enable Cost Explorer, activate cost allocation tags, create a basic budget with alerts
- Week 2: Audit for forgotten resources (unattached volumes, idle load balancers, unused Elastic IPs). Delete anything not needed
- Week 3: Analyze compute utilization with CloudWatch and Compute Optimizer. Identify right-sizing candidates
- Week 4: Right-size your most over-provisioned instances. Start with non-production
- Month 2: Implement tagging policies, set up scheduled scaling for dev/staging, evaluate Spot for stateless workloads
- Month 3: Review Kubernetes resource requests/limits, implement HPA, consider Karpenter. Evaluate Savings Plans for stable production workloads
- Ongoing: Weekly cost reviews, monthly optimization passes, quarterly Savings Plan evaluation
The complete series recap
We have covered a lot of ground in this series. Let's take a moment to look back at every article and what we learned in each one. If you missed any or want to revisit a topic, the links below will take you there.
- Article 1: What It Actually Means - We started from the very beginning. What DevOps is, where it came from, the DORA metrics that measure it, and how DevOps relates to SRE and Platform Engineering.
- Article 2: Your First TypeScript API - We built a real application with Express and Docker. This gave us something concrete to deploy throughout the rest of the series.
- Article 3: Version Control for Teams - We learned Git workflows, branching strategies, pull requests, and code review. The collaboration foundation for everything that followed.
- Article 4: Automated Testing - We wrote unit tests, integration tests, and learned the testing pyramid. No CI pipeline works without good tests.
- Article 5: Your First CI Pipeline - We set up GitHub Actions to automatically lint, test, and build our code on every push. Our first taste of automation.
- Article 6: AWS from Scratch - We created an AWS account, set up IAM users and roles, understood regions and AZs, and got comfortable with the AWS CLI.
- Article 7: Infrastructure as Code with Terraform - We stopped clicking around in the console and started defining infrastructure as code. VPCs, subnets, security groups, all in Terraform.
- Article 8: Deploying to ECS with Fargate - We deployed our API to AWS for the first time using ECS and Fargate. Real cloud infrastructure running our real application.
- Article 9: Secrets and Config Management - We learned how to manage secrets safely with AWS Secrets Manager and SSM Parameter Store. No more hardcoded passwords.
- Article 10: DNS, TLS, and Networking - We made our app reachable with a real domain, set up TLS certificates with ACM, and understood how networking ties everything together.
- Article 11: Kubernetes Fundamentals - We learned pods, deployments, services, and namespaces. The building blocks of container orchestration.
- Article 12: Helm Charts - We packaged our Kubernetes application with Helm, making it reusable and configurable across environments.
- Article 13: EKS, Running Kubernetes on AWS - We set up a production-grade EKS cluster with Terraform, including managed node groups, IAM integration, and networking.
- Article 14: GitOps with ArgoCD - We implemented GitOps so that git became the single source of truth for our deployments. Push to git and ArgoCD handles the rest.
- Article 15: Observability in Kubernetes - We set up Prometheus, Grafana, and structured logging. We learned about the three pillars: logs, metrics, and traces.
- Article 16: CI/CD, The Complete Pipeline - We stitched everything together into a complete pipeline from pull request to production, with staging gates and manual approvals.
- Article 17: Security and Compliance - We covered container image scanning, RBAC policies, network policies, and how to bake security into every stage of the pipeline.
- Article 18: Disaster Recovery and High Availability - We learned multi-AZ deployments, backup strategies, RTO/RPO targets, and how to plan for the worst so your systems stay up.
- Article 19: Advanced Deployment Strategies - We explored canary deployments, blue/green deployments, feature flags, and progressive delivery patterns for zero-downtime releases.
- Article 20: Cost Optimization and What Comes Next (this article) - We learned how to understand, monitor, and optimize cloud costs, then wrapped up the entire series.
That is twenty articles, and if you followed along, you went from knowing nothing about DevOps to having a complete, production-grade pipeline with automated testing, infrastructure as code, Kubernetes, GitOps, observability, security, and cost optimization. That is a serious achievement.
What comes next
Finishing this series does not mean you are done learning. In many ways, you are just getting started. You now have a solid foundation, and there are several paths forward depending on your interests and career goals.
Site Reliability Engineering (SRE)
If you enjoyed the observability, monitoring, and reliability aspects of this series, SRE is a natural next step. SRE takes the DevOps principles we covered and adds rigorous engineering practices around reliability: SLIs, SLOs, error budgets, incident management, chaos engineering, and capacity planning.
We have an entire SRE series on this blog that picks up where this one leaves off. Start with SRE: SLIs, SLOs, and Automations That Actually Help and work through all fourteen articles.
Platform Engineering
If you found yourself thinking "I wish developers did not have to know all of this just to deploy their apps," Platform Engineering is for you. Platform teams build internal developer platforms that abstract away infrastructure complexity. You would build golden paths, self-service portals, and developer tooling that makes it easy for any developer to deploy, observe, and manage their applications without needing to understand every underlying component.
Developer Experience (DX)
Related to Platform Engineering, Developer Experience focuses on making developers productive and happy. Fast CI pipelines, great local development setups, clear documentation, easy onboarding. If you care about how people experience the tools and processes you build, DX is worth exploring.
Certifications
If you want to formalize your knowledge and signal your skills to employers, consider these certifications:
- AWS Solutions Architect Associate (SAA-C03): Covers the core AWS services we used throughout this series. If you followed along and built everything, you already know about 70% of what is on this exam.
- Certified Kubernetes Administrator (CKA): Validates your Kubernetes skills. The articles on Kubernetes fundamentals, Helm, and EKS gave you a strong head start.
- HashiCorp Terraform Associate: Covers the Terraform concepts we used for infrastructure as code. Probably the easiest of the three if you have been writing Terraform along with the series.
None of these certifications are required. Hands-on experience matters more than certificates. But they can be helpful for landing interviews, especially early in your career.
Communities and resources
Learning does not happen in isolation. Here are some communities and resources worth checking out:
- CNCF (Cloud Native Computing Foundation): The organization behind Kubernetes, Prometheus, ArgoCD, and many other tools we used. Their landscape page gives you a map of the entire cloud native ecosystem.
- DevOps subreddits and forums: r/devops, r/kubernetes, and r/aws are active communities where people share experiences and help each other.
- KubeCon talks: The recorded talks from KubeCon are freely available on YouTube and cover everything from beginner to advanced topics.
- The SRE Book: Google's "Site Reliability Engineering" book is available free online at sre.google. It is the foundational text for SRE practices.
- "Accelerate" by Forsgren, Humble, and Kim: The book behind the DORA metrics. If you want to understand the research that proves DevOps practices work, this is the one.
Closing notes
This is the end of the DevOps from Zero to Hero series, and if you made it all the way here, I want to say something sincerely: well done. Twenty articles is a lot. Building all of this from scratch takes real commitment, and the fact that you stuck with it says a lot about you.
When we started this series, we talked about what DevOps actually means. Not the buzzword, not the job title, but the real idea: that the people who build software and the people who run it should work together, share responsibility, and use automation to move faster without sacrificing stability. Every article since then has been a practical expression of that idea. Automated tests, CI pipelines, infrastructure as code, Kubernetes, GitOps, observability, security, and now cost optimization. Each piece reinforces the others. Together, they form a complete practice.
But the most important thing you built is not a pipeline or a cluster. It is a way of thinking. You now approach problems differently. When you see a manual process, you think about automating it. When you see a deployment that requires SSH and prayer, you think about CI/CD. When someone says "it works on my machine," you think about containers. That mindset is more valuable than any specific tool, and it will serve you well no matter where your career takes you.
The cloud ecosystem will keep evolving. New tools will appear, some of what we covered will become outdated, and best practices will shift. That is fine. The fundamentals we covered (version control, testing, automation, infrastructure as code, observability, security, cost awareness) are timeless. The specific tools change, but the principles do not.
So go build something. Take what you learned here and apply it at work, on a side project, or in an open source contribution. The best way to solidify knowledge is to use it. And when you get stuck, remember that every expert you admire was once exactly where you are now.
Thank you for reading this series. I genuinely hope it helped you, and I hope you had as much fun following along as I had writing it. Until the next series!
Errata
If you spot any error or have any suggestion, please send me a message so it gets fixed.
Also, you can check the source code and changes in the sources here
$ Comments
Online: 0Please sign in to be able to write comments.