DevOps from Zero to Hero: CI/CD, The Complete Pipeline

2026-06-05 | Gabriel Garrido | 21 min read

On this page

Introduction
The pipeline philosophy
Pipeline stages overview
GitHub Actions environments
Environment-specific secrets and variables
The complete workflow
Phase 1: Validate
Phase 2: Build and deploy to staging
Smoke tests in detail
Production promotion and manual approval
Deployment strategies
Rollback strategies
Pipeline best practices
Monitoring your pipeline
Putting it all together
What comes next
Debugging mystery: intermittent 503s after a deploy
Errata

Support this blog

If you find this content useful, consider supporting the blog.

Introduction#

Welcome to article sixteen of the DevOps from Zero to Hero series. Over the past fifteen articles we have covered everything from writing a TypeScript API, to version control, testing, CI, infrastructure as code, Kubernetes, Helm, secrets, and more. Each piece solved a specific problem, but we have not yet stitched them all together into one cohesive, end-to-end pipeline.

That changes now. In this article we are going to build a complete CI/CD pipeline that takes your code from a pull request all the way to production. Not a toy example. A real, multi-job GitHub Actions workflow that lints, tests, builds, deploys to staging, runs smoke tests, waits for manual approval, and then promotes to production. We will also cover deployment strategies, rollback procedures, and best practices for keeping your pipeline fast and reliable.

If you have been following the series, think of this article as the glue that connects everything. If you are jumping in fresh, do not worry. We will explain each piece as we go.

Let's get into it.

The pipeline philosophy#

Before we write a single line of YAML, let's establish the principles that drive a good CI/CD pipeline:

Every commit to main should be deployable: If something is in main, it has been linted, tested, and built. It is ready to ship. If it is not ready, it should not be in main.

Environments are gates, not destinations: Staging exists to validate, not to accumulate. Code should flow through staging quickly, not sit there for weeks. Production is the destination.

Fail fast, fail loud: If something is broken, you want to know in seconds, not minutes. Put the cheapest checks first (lint, format) and the expensive ones later (integration tests, builds).

Automation over manual processes: Every manual step is a step that can be forgotten, done wrong, or skipped under pressure. Automate everything except the final production approval.

Reproducibility: Your pipeline should produce the same result whether you run it today or three months from now. Pin your versions, cache your dependencies, and use immutable artifacts.

These are not abstract ideals. They are engineering decisions that prevent outages, reduce toil, and let you ship with confidence. Every design choice in the pipeline we are about to build traces back to one of these principles.

Pipeline stages overview#

Our pipeline will have seven stages, organized into three phases:

flowchart TD
    subgraph P1["Phase 1: Validate (every PR and push to main)"]
        Lint["Lint<br/>ESLint, Prettier, type checking"]
        Test["Test<br/>unit, integration, coverage"]
    end
    subgraph P2["Phase 2: Build and Deploy to Staging (push to main only)"]
        Build["Build<br/>Docker image and push to registry"] --> Deploy["Deploy to staging via ArgoCD"]
        Deploy --> Smoke["Smoke Test<br/>health check and API tests"]
    end
    subgraph P3["Phase 3: Promote to Production (manual)"]
        Approve["Approve<br/>manual gate via GitHub Environments"] --> Prod["Deploy to production namespace"]
    end
    P1 --> P2
    P2 --> P3

Phase 1 runs on every pull request and every push to main. It is your safety net. Phase 2 only runs on pushes to main (merged PRs) because you do not want to deploy feature branches to staging. Phase 3 requires a human to click "Approve" before code reaches production. This is the one manual step we keep on purpose, because deploying to production should be a conscious decision.

GitHub Actions environments#

GitHub Actions has a feature called Environments that gives you exactly what we need: environment-specific secrets, protection rules, and deployment history. Let's set them up.

Go to your repository on GitHub, then Settings, then Environments. Create two environments:

staging: No protection rules needed. Deployments here should be automatic after the build passes.

production: Add a "Required reviewers" protection rule. Pick one or more team members who must approve before a deployment can proceed.

You can also add a "Wait timer" to production if you want a mandatory cooldown period between staging and production deploys. Some teams set this to 15 minutes to give smoke tests extra time to surface issues.

Environment-specific secrets and variables#

Each environment can have its own secrets and variables. This is how you handle the fact that staging and production use different clusters, namespaces, databases, and API keys without littering your workflow with if conditionals.

Here is what you would typically configure:

Repository secrets (shared):
  REGISTRY_USERNAME    -> your container registry username
  REGISTRY_PASSWORD    -> your container registry token

Staging environment secrets:
  KUBE_CONFIG          -> kubeconfig for your staging cluster
  DATABASE_URL         -> staging database connection string
  ARGOCD_AUTH_TOKEN    -> ArgoCD token for staging

Staging environment variables:
  KUBE_NAMESPACE       -> staging
  APP_URL              -> https://staging.myapp.example.com

Production environment secrets:
  KUBE_CONFIG          -> kubeconfig for your production cluster
  DATABASE_URL         -> production database connection string
  ARGOCD_AUTH_TOKEN    -> ArgoCD token for production

Production environment variables:
  KUBE_NAMESPACE       -> production
  APP_URL              -> https://myapp.example.com

When a job specifies environment: staging, it can only access the staging secrets and variables. When it specifies environment: production, it gets the production ones. This isolation prevents the worst kind of mistake: accidentally running a production migration against the staging database, or vice versa.

To configure these, go to Settings, then Environments, click on the environment, and add your secrets and variables there. They work exactly like repository-level secrets but are scoped to the environment.

The complete workflow#

Here is the full pipeline. We will go through each job in detail after, but first, see the big picture:

name: CI/CD Pipeline

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

permissions:
  contents: read
  packages: write

jobs:
  # Phase 1: Validate
  lint:
    name: Lint
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: "22"
          cache: "npm"

      - run: npm ci

      - name: Run ESLint
        run: npx eslint .

      - name: Check formatting
        run: npx prettier --check .

      - name: Type check
        run: npx tsc --noEmit

  test:
    name: Test
    runs-on: ubuntu-latest
    services:
      postgres:
        image: postgres:16
        env:
          POSTGRES_USER: test
          POSTGRES_PASSWORD: test
          POSTGRES_DB: myapp_test
        ports:
          - 5432:5432
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
    env:
      DATABASE_URL: postgres://test:test@localhost:5432/myapp_test
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: "22"
          cache: "npm"

      - run: npm ci

      - name: Run tests with coverage
        run: npm test -- --coverage

      - name: Upload coverage
        if: github.event_name == 'push'
        uses: actions/upload-artifact@v4
        with:
          name: coverage-report
          path: coverage/
          retention-days: 14

  # Phase 2: Build and Deploy to Staging
  build:
    name: Build and Push Image
    runs-on: ubuntu-latest
    needs: [lint, test]
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    outputs:
      image-tag: ${{ steps.meta.outputs.tags }}
      image-digest: ${{ steps.build.outputs.digest }}
    steps:
      - uses: actions/checkout@v4

      - uses: docker/setup-buildx-action@v3

      - name: Log in to registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Extract metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=sha,prefix=
            type=raw,value=latest

      - name: Build and push
        id: build
        uses: docker/build-push-action@v6
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  deploy-staging:
    name: Deploy to Staging
    runs-on: ubuntu-latest
    needs: [build]
    environment: staging
    steps:
      - uses: actions/checkout@v4

      - name: Install ArgoCD CLI
        run: |
          curl -sSL -o argocd https://github.com/argoproj/argo-cd/releases/latest/download/argocd-linux-amd64
          chmod +x argocd
          sudo mv argocd /usr/local/bin/

      - name: Deploy to staging
        env:
          ARGOCD_SERVER: ${{ vars.ARGOCD_SERVER }}
          ARGOCD_AUTH_TOKEN: ${{ secrets.ARGOCD_AUTH_TOKEN }}
        run: |
          argocd app set myapp-staging \
            --parameter image.tag=${{ github.sha }} \
            --grpc-web

          argocd app sync myapp-staging \
            --grpc-web \
            --timeout 300

          argocd app wait myapp-staging \
            --grpc-web \
            --timeout 300 \
            --health

  smoke-test:
    name: Smoke Tests
    runs-on: ubuntu-latest
    needs: [deploy-staging]
    environment: staging
    steps:
      - uses: actions/checkout@v4

      - name: Wait for deployment to stabilize
        run: sleep 30

      - name: Health check
        run: |
          for i in $(seq 1 10); do
            STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
              "${{ vars.APP_URL }}/health")
            if [ "$STATUS" = "200" ]; then
              echo "Health check passed on attempt $i"
              exit 0
            fi
            echo "Attempt $i: got $STATUS, retrying in 10s..."
            sleep 10
          done
          echo "Health check failed after 10 attempts"
          exit 1

      - name: API smoke test
        run: |
          RESPONSE=$(curl -s -w "\n%{http_code}" \
            "${{ vars.APP_URL }}/api/v1/status")
          BODY=$(echo "$RESPONSE" | head -n -1)
          STATUS=$(echo "$RESPONSE" | tail -n 1)

          echo "Status: $STATUS"
          echo "Body: $BODY"

          if [ "$STATUS" != "200" ]; then
            echo "API smoke test failed with status $STATUS"
            exit 1
          fi

          echo "API smoke test passed"

      - name: Run E2E tests against staging
        env:
          BASE_URL: ${{ vars.APP_URL }}
        run: |
          npm ci
          npx playwright test tests/e2e/smoke.spec.ts

  # Phase 3: Promote to Production
  deploy-production:
    name: Deploy to Production
    runs-on: ubuntu-latest
    needs: [smoke-test]
    environment: production
    steps:
      - uses: actions/checkout@v4

      - name: Install ArgoCD CLI
        run: |
          curl -sSL -o argocd https://github.com/argoproj/argo-cd/releases/latest/download/argocd-linux-amd64
          chmod +x argocd
          sudo mv argocd /usr/local/bin/

      - name: Deploy to production
        env:
          ARGOCD_SERVER: ${{ vars.ARGOCD_SERVER }}
          ARGOCD_AUTH_TOKEN: ${{ secrets.ARGOCD_AUTH_TOKEN }}
        run: |
          argocd app set myapp-production \
            --parameter image.tag=${{ github.sha }} \
            --grpc-web

          argocd app sync myapp-production \
            --grpc-web \
            --timeout 300

          argocd app wait myapp-production \
            --grpc-web \
            --timeout 300 \
            --health

      - name: Verify production deployment
        run: |
          for i in $(seq 1 10); do
            STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
              "${{ vars.APP_URL }}/health")
            if [ "$STATUS" = "200" ]; then
              echo "Production health check passed"
              exit 0
            fi
            echo "Attempt $i: got $STATUS, retrying in 10s..."
            sleep 10
          done
          echo "Production health check failed"
          exit 1

That is a lot of YAML, so let's break it down piece by piece.

Phase 1: Validate#

The lint and test jobs run in parallel on every push and pull request. They are the cheapest and fastest checks, so they go first.

The lint job runs three checks: ESLint for code quality, Prettier for formatting, and the TypeScript compiler for type safety. If any of these fail, the pipeline stops. There is no point building a Docker image for code that does not compile.

The test job spins up a PostgreSQL service container. GitHub Actions lets you define services alongside your job, and they are available on localhost just like a local database. The tests run with coverage enabled, and the coverage report is uploaded as an artifact for later review.

Notice that lint and test have no dependency on each other. They run in parallel by default, which means the validate phase takes as long as the slower of the two, not the sum of both.

Phase 2: Build and deploy to staging#

The build job only runs on pushes to main (not on pull requests) and only after both lint and test pass. This is controlled by the needs: [lint, test] dependency and the if conditional.

We use Docker Buildx with GitHub Actions cache (cache-from: type=gha). This means subsequent builds reuse cached layers, which can cut build time from minutes to seconds. The image is tagged with the Git SHA and pushed to GitHub Container Registry (GHCR).

The deploy-staging job uses the ArgoCD CLI to update the image tag and sync the application. ArgoCD then handles the actual Kubernetes deployment: it updates the deployment manifest, waits for the new pods to be healthy, and reports back. The argocd app wait command blocks until the deployment is fully rolled out and healthy, so the pipeline knows whether the deploy succeeded or failed.

If you are not using ArgoCD, you can replace this with kubectl commands:

      - name: Deploy to staging with kubectl
        run: |
          echo "${{ secrets.KUBE_CONFIG }}" | base64 -d > kubeconfig
          export KUBECONFIG=kubeconfig

          kubectl set image deployment/myapp \
            myapp=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \
            -n staging

          kubectl rollout status deployment/myapp \
            -n staging \
            --timeout=300s

          rm kubeconfig

The key point is the same: update the image, then wait for the rollout to finish before moving on.

Smoke tests in detail#

The smoke test job is the gatekeeper between staging and production. It answers one question: is the thing we just deployed actually working?

We run three levels of smoke tests:

Health check: A simple HTTP request to /health. If the server is not responding, everything else is irrelevant. We retry up to 10 times with 10-second intervals because deployments can take a moment to stabilize.

API smoke test: A request to a real API endpoint. This validates that the application is not just running but actually serving requests correctly. We check both the status code and that the response body is valid.

E2E smoke test: A Playwright test that loads the application in a browser and performs a few critical user flows. This catches issues that API-level tests miss, like broken JavaScript bundles or misconfigured CDN paths.

You do not need all three levels on day one. Start with just the health check. Add the API test when you have an API. Add the E2E test when you have Playwright set up. The important thing is to have something that validates the deployment before you promote to production.

Here is a minimal Playwright smoke test:

import { test, expect } from "@playwright/test";

const BASE_URL = process.env.BASE_URL || "http://localhost:3000";

test.describe("Smoke Tests", () => {
  test("homepage loads successfully", async ({ page }) => {
    const response = await page.goto(BASE_URL);
    expect(response?.status()).toBe(200);
    await expect(page.locator("h1")).toBeVisible();
  });

  test("API returns valid response", async ({ request }) => {
    const response = await request.get(`${BASE_URL}/api/v1/status`);
    expect(response.status()).toBe(200);

    const body = await response.json();
    expect(body).toHaveProperty("status", "ok");
  });

  test("login page renders", async ({ page }) => {
    await page.goto(`${BASE_URL}/login`);
    await expect(page.locator('input[name="email"]')).toBeVisible();
    await expect(page.locator('button[type="submit"]')).toBeVisible();
  });
});

Keep smoke tests fast. They should run in under a minute. If you need comprehensive E2E coverage, run that in a separate workflow. Smoke tests are about confidence, not completeness.

Production promotion and manual approval#

The deploy-production job has environment: production, which triggers the protection rules you configured earlier. When the pipeline reaches this job, it pauses and shows a "Review deployments" button in the GitHub Actions UI. The required reviewers you configured get a notification, and the pipeline waits until one of them clicks "Approve."

This is intentional. Production deployments should be a deliberate decision. The approval step gives your team a moment to ask: did the smoke tests look good? Are there any known issues? Is this a good time to deploy (not Friday afternoon)?

Once approved, the production deploy follows the same pattern as staging: update the image tag, sync with ArgoCD, wait for the rollout, and verify with a health check.

You might be wondering why we do not run the full smoke test suite against production. Some teams do, and that is fine. But there is a tradeoff: running tests against production means your tests can fail due to production-specific issues (rate limiting, real data edge cases), and a test failure after deploy can cause confusion about whether the deploy itself failed. A simple health check is usually enough for the production verification step.

Deployment strategies#

The pipeline we built uses the default Kubernetes deployment strategy: rolling update. But it is worth understanding the alternatives and when to use them.

Rolling update (default)

This is what Kubernetes does out of the box. It gradually replaces old pods with new pods, one at a time (or in batches). At any point during the rollout, some pods are running the old version and some are running the new version.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  replicas: 3
  template:
    spec:
      containers:
        - name: myapp
          image: ghcr.io/myorg/myapp:abc123
          readinessProbe:
            httpGet:
              path: /health
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 10

maxSurge: 1 means Kubernetes can create one extra pod above the desired replica count during the rollout.

maxUnavailable: 0 means no pod is removed until its replacement is ready. This ensures zero downtime.

readinessProbe tells Kubernetes when a new pod is ready to receive traffic. Without this, Kubernetes might send requests to a pod that is still starting up.

Rolling updates are the right choice for most applications. They are simple, zero-downtime, and well-supported by every Kubernetes distribution.

Blue-green deployment

In a blue-green deployment, you run two identical environments: blue (current production) and green (the new version). Traffic goes to blue while green is being deployed and tested. Once green is verified, you switch traffic from blue to green in one shot.

The advantage is that the switch is instantaneous and you can roll back by switching back to blue. The disadvantage is that you need double the resources during the deployment. In Kubernetes, you can implement blue-green by maintaining two deployments and switching the service selector:

# Deploy the new version as "green"
kubectl set image deployment/myapp-green \
  myapp=ghcr.io/myorg/myapp:new-version -n production

kubectl rollout status deployment/myapp-green -n production

# Switch traffic from blue to green
kubectl patch service myapp \
  -p '{"spec":{"selector":{"version":"green"}}}' -n production

Canary deployment

A canary deployment routes a small percentage of traffic (say 5%) to the new version while the majority continues hitting the old version. You monitor error rates and latency for the canary, and if everything looks good, you gradually increase the traffic split until 100% goes to the new version.

Canary deployments are powerful but require a service mesh (like Istio or Linkerd) or an ingress controller that supports traffic splitting. They are more complex to set up but give you the safest possible production rollout for high-traffic applications.

For this series, we will stick with rolling updates. They cover the vast majority of use cases, and you can always adopt blue-green or canary later when your needs grow.

Rollback strategies#

Things go wrong. A deploy passes all tests but a subtle bug appears under real traffic. You need to get back to a known-good state fast. Here are your options:

Option 1: Git revert and push

This is the simplest and most reliable approach. You revert the commit that caused the problem, push to main, and the pipeline redeploys the previous version automatically.

# Find the commit that caused the issue
git log --oneline -5

# Revert it
git revert HEAD

# Push to main, which triggers the pipeline
git push origin main

The advantage of this approach is that it goes through the full pipeline: lint, test, build, staging, smoke test, production. You know the reverted version works because it was validated at every stage. The downside is that it takes as long as a normal deployment (5-15 minutes depending on your pipeline).

Option 2: ArgoCD rollback

If you are using ArgoCD, you can roll back to a previous sync directly:

# List the sync history
argocd app history myapp-production

# Roll back to a specific revision
argocd app rollback myapp-production <revision-number>

This is faster than a git revert because it skips the build step. ArgoCD simply redeploys the previous manifests. However, it creates a drift between your Git state and what is running in the cluster. You should still create a git revert afterwards to keep Git as the source of truth.

Option 3: kubectl rollout undo

Kubernetes keeps a history of deployments, and you can roll back with a single command:

# Roll back to the previous version
kubectl rollout undo deployment/myapp -n production

# Or roll back to a specific revision
kubectl rollout history deployment/myapp -n production
kubectl rollout undo deployment/myapp -n production --to-revision=3

Like the ArgoCD rollback, this is fast but creates drift from Git. Use it for emergencies, then follow up with a proper git revert.

The recommendation is: for planned rollbacks, use git revert. For emergencies, use kubectl rollout undo or ArgoCD rollback, then git revert as a follow-up. Either way, Git should always reflect what is actually running in production.

Pipeline best practices#

Now that you have a working pipeline, here are the practices that keep it fast, reliable, and maintainable over time:

Fail early

Order your jobs from fastest to slowest. Lint takes seconds, tests take a minute, Docker builds take several minutes. If the code does not pass lint, there is no point waiting for a Docker build to finish. The needs keyword enforces this ordering.

Parallelize where possible

Lint and test do not depend on each other. Run them in parallel. If you have multiple test suites (unit, integration, E2E), split them into separate jobs that run simultaneously. Every minute you shave off the pipeline is a minute your team gets back on every single commit.

Cache aggressively

Cache everything that does not change between builds:

npm dependencies: Use actions/setup-node with cache: "npm". This caches the npm global store and restores it based on package-lock.json.

Docker layers: Use BuildKit with cache-from: type=gha and cache-to: type=gha,mode=max. This stores and restores layer caches using GitHub's cache backend.

Test fixtures: If your tests download large fixtures, cache them with actions/cache.

Without caching, a typical pipeline takes 8-12 minutes. With caching, it can drop to 3-5 minutes.

Keep it DRY

If you have multiple repositories with similar pipelines, extract common steps into reusable workflows or composite actions:

# .github/workflows/reusable-deploy.yml
name: Deploy
on:
  workflow_call:
    inputs:
      environment:
        required: true
        type: string
      argocd-app:
        required: true
        type: string
    secrets:
      ARGOCD_AUTH_TOKEN:
        required: true

jobs:
  deploy:
    runs-on: ubuntu-latest
    environment: ${{ inputs.environment }}
    steps:
      - name: Install ArgoCD CLI
        run: |
          curl -sSL -o argocd https://github.com/argoproj/argo-cd/releases/latest/download/argocd-linux-amd64
          chmod +x argocd
          sudo mv argocd /usr/local/bin/

      - name: Deploy
        env:
          ARGOCD_SERVER: ${{ vars.ARGOCD_SERVER }}
          ARGOCD_AUTH_TOKEN: ${{ secrets.ARGOCD_AUTH_TOKEN }}
        run: |
          argocd app set ${{ inputs.argocd-app }} \
            --parameter image.tag=${{ github.sha }} \
            --grpc-web
          argocd app sync ${{ inputs.argocd-app }} \
            --grpc-web --timeout 300
          argocd app wait ${{ inputs.argocd-app }} \
            --grpc-web --timeout 300 --health

Then call it from your main pipeline:

  deploy-staging:
    needs: [build]
    uses: ./.github/workflows/reusable-deploy.yml
    with:
      environment: staging
      argocd-app: myapp-staging
    secrets:
      ARGOCD_AUTH_TOKEN: ${{ secrets.ARGOCD_AUTH_TOKEN }}

This avoids duplicating deployment logic across staging and production jobs. When you need to change how deployments work, you change it in one place.

Pin your action versions

Always use specific versions (or commit SHAs) for actions, not @main or @latest. Third-party actions can change without warning, and a broken action version can break your pipeline across all repositories at once:

# Good: pinned to a specific version
- uses: actions/checkout@v4
- uses: docker/build-push-action@v6

# Bad: unpinned, can break without warning
- uses: actions/checkout@main
- uses: some-org/some-action@latest

Monitoring your pipeline#

A pipeline is only useful if you know how it is performing. GitHub Actions gives you several ways to monitor pipeline health:

Workflow run history: Go to the Actions tab in your repository. You can see every run, filter by workflow, branch, or status, and drill into individual jobs and steps.

Build time trends: Track how long your pipeline takes over time. If builds are getting slower, it usually means your test suite is growing without corresponding optimization, or your Docker cache is not working correctly.

Failure rate: If your pipeline fails more than 10% of the time on legitimate code changes, something is flaky. Common culprits are network-dependent tests, race conditions, and service container startup timing.

Status badges: Add a workflow status badge to your README so the team can see pipeline health at a glance.

You can add a status badge to your README with this markdown:

![CI/CD](https://github.com/myorg/myapp/actions/workflows/ci-cd.yml/badge.svg)

For more advanced monitoring, consider integrating with tools like Datadog CI Visibility or Grafana with the GitHub Actions exporter. These give you dashboards with build time percentiles, failure breakdowns by job, and alerts when build times exceed a threshold.

Putting it all together#

Let's recap what happens when a developer pushes a change through this pipeline:

Developer opens a PR: Lint and test run automatically. The PR gets a green checkmark or a red X. Code review happens in parallel.

PR is merged to main: Lint and test run again on the merged code. Then the build job creates a Docker image tagged with the commit SHA and pushes it to GHCR.

Staging deploy: ArgoCD updates the staging deployment with the new image tag. The pipeline waits until the rollout is healthy.

Smoke tests: Health check, API test, and E2E test run against staging. If any fail, the pipeline stops and the team is notified.

Manual approval: A reviewer checks the staging deployment, confirms it looks good, and clicks "Approve" in the GitHub Actions UI.

Production deploy: ArgoCD updates the production deployment. A final health check confirms the deployment is live.

The entire process, from merge to production, takes about 10-15 minutes. Most of that time is in the build and test stages. The actual deployment steps take less than a minute each.

If anything goes wrong, the pipeline stops at the failed step. No code reaches production unless it has passed every gate. And if something slips through, you can roll back with a git revert in under a minute.

What comes next#

We now have a complete, end-to-end CI/CD pipeline that takes code from a pull request to production with automated validation at every stage. In the next article, we will look at monitoring and observability: how to know what your application is doing once it is running in production.

Hope you found this useful and enjoyed reading it, until next time!

Debugging mystery: intermittent 503s after a deploy#

Your shiny new pipeline shipped cleanly, the rollout went green... and now some requests return 503. Not all, just some. Diagnose it before you reach for rollback.

Errata#

If you spot any error or have any suggestion, please send me a message so it gets fixed.

Also, you can check the source code and changes in the sources here

$ Comments

Online: 0

Please sign in to be able to write comments.

2026-06-05 | Gabriel Garrido

$ Related Posts

> DevOps from Zero to Hero: Security Hardening (2026-06-11)

> DevOps from Zero to Hero: Your First CI Pipeline with GitHub Actions (2026-05-03)

> DevOps from Zero to Hero: Database Migrations and Zero-Downtime Deployments (2026-06-08)