SRE: GitOps with ArgoCD

2026-03-09 | Gabriel Garrido | 11 min read

On this page

Introduction
What is GitOps?
ArgoCD architecture
Installing ArgoCD
Application CRDs
The App of Apps pattern
Sync strategies
Health checks and custom health
Rollback patterns
Multi-cluster management with ApplicationSets
Kustomize and Helm integration
RBAC and SSO
Notifications
Monitoring ArgoCD itself
Closing notes
Errata

Support this blog

If you find this content useful, consider supporting the blog.

Introduction#

Throughout this SRE series we have covered SLIs and SLOs, incident management, observability, chaos engineering, and capacity planning. All of those practices assume that when you change something, the change is tracked, reviewed, auditable, and easy to roll back. That is exactly what GitOps gives you.

If you have been deploying to Kubernetes with kubectl apply or CI pipelines that push directly to the cluster, you probably know the pain: someone applies a hotfix manually, another person runs a different version of a manifest, and before you know it the cluster state has drifted from what is in your repository. Nobody knows what is actually running. GitOps solves this by making Git the single source of truth and using a controller to continuously reconcile the cluster state with what is declared in your repository.

Let's get into it.

What is GitOps?#

GitOps is an operational model where the desired state of your infrastructure and applications is declared in Git. A controller running in your cluster watches the Git repository and ensures the live state matches the declared state. If something drifts, the controller corrects it automatically.

The core principles are:

Declarative configuration: Everything is described as YAML or JSON manifests in Git. No imperative scripts, no manual steps.

Git as the single source of truth: The Git repository is the only place where changes are made. What is in Git is what runs in the cluster.

Pull-based reconciliation: Instead of CI pushing to the cluster, a controller inside the cluster pulls the desired state from Git. This is more secure because the cluster credentials never leave the cluster.

Continuous reconciliation: The controller does not just apply changes once. It continuously compares the live state with the desired state and corrects any drift.

This is fundamentally different from traditional push-based CI/CD where a pipeline runs kubectl apply after a build. With push-based CD, if someone changes something in the cluster manually, your CI does not know about it. With GitOps, the controller detects the drift and fixes it.

flowchart LR
    subgraph push["Push-based CI/CD (traditional)"]
        direction LR
        D1["Developer"] -->|git push| G1["Git"]
        G1 --> CI["CI builds"]
        CI -->|kubectl apply<br/>needs cluster creds| C1["Cluster"]
        C1 -.->|drift undetected| C1
    end
    subgraph pull["Pull-based GitOps"]
        direction LR
        D2["Developer"] -->|git push| G2["Git"]
        Ctrl["ArgoCD controller<br/>(lives in cluster)"] -->|watches continuously| G2
        Ctrl -->|applies & self-heals| C2["Cluster"]
        C2 -.->|drift detected & corrected| Ctrl
    end

ArgoCD architecture#

ArgoCD is the most popular GitOps controller for Kubernetes. It is a CNCF graduated project with a well-defined architecture.

API Server: The gRPC/REST server that powers the web UI, CLI, and external integrations. Handles authentication, RBAC, and serves the application state.

Repository Server: Clones Git repositories and generates Kubernetes manifests. Supports plain YAML, Kustomize, Helm, Jsonnet, and custom plugins.

Application Controller: The brain of ArgoCD. Watches Application resources, compares desired state (from Git) with live state (from the cluster), and performs sync operations when they differ.

Redis: Caching layer for the repository server and application controller.

ApplicationSet Controller: Manages ApplicationSet resources that generate multiple Applications from a single definition.

flowchart TD
    CRD["Application CRD"] --> Ctrl["Application Controller<br/>(reconcile loop, every 3 min)"]
    Ctrl <-->|fetch & render manifests| Repo["Repository Server"]
    Repo <-->|clone| Git["Git repo"]
    Ctrl -->|compare desired vs live| Q{"Differ?"}
    Q -->|auto-sync| Sync["Apply changes (Synced)"]
    Q -->|no auto-sync| OOS["Mark OutOfSync"]
    Q -->|no| Done["Do nothing"]
    Sync --> Live["Cluster live state"]
    Ctrl --> Status["Update Application status"]
    Status -.->|loop repeats| Ctrl
    API["API Server (UI/CLI/RBAC)"] --- Ctrl
    Redis["Redis (cache)"] --- Ctrl

Installing ArgoCD#

The recommended approach is using Helm. Create the namespace and install:

kubectl create namespace argocd
helm repo add argo https://argoproj.github.io/argo-helm
helm repo update

Here is a production-ready values file:

# argocd-values.yaml
configs:
  params:
    server.insecure: true
    timeout.reconciliation: 180s
  cm:
    statusbadge.enabled: "true"
    kustomize.buildOptions: "--enable-helm"

server:
  replicas: 2
  ingress:
    enabled: true
    ingressClassName: nginx
    hostname: argocd.example.com
    tls: true

controller:
  resources:
    requests:
      cpu: 250m
      memory: 512Mi
    limits:
      memory: 1Gi

repoServer:
  replicas: 2
  resources:
    requests:
      cpu: 100m
      memory: 256Mi
    limits:
      memory: 512Mi

helm install argocd argo/argo-cd \
  --namespace argocd \
  --values argocd-values.yaml \
  --wait

# Get the initial admin password
kubectl -n argocd get secret argocd-initial-admin-secret \
  -o jsonpath="{.data.password}" | base64 -d

# Login and change password
argocd login argocd.example.com --username admin --password <your-password>
argocd account update-password

Application CRDs#

The Application CRD is the fundamental building block. It defines what to deploy, where, and how to keep it in sync:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  project: default
  source:
    repoURL: https://github.com/kainlite/my-app-manifests
    targetRevision: main
    path: overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: my-app
  syncPolicy:
    automated:
      prune: true       # Delete resources no longer in Git
      selfHeal: true    # Revert manual changes
    syncOptions:
      - CreateNamespace=true
      - PruneLast=true
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m
  ignoreDifferences:
    - group: apps
      kind: Deployment
      jsonPointers:
        - /spec/replicas   # Ignore if HPA manages replicas

source: Where to find the manifests. repoURL, targetRevision (branch/tag), and path (directory within the repo).

destination: Where to deploy. server is the Kubernetes API endpoint, namespace is the target namespace.

syncPolicy: How to keep things in sync. automated enables auto-sync, prune deletes removed resources, selfHeal reverts manual changes.

ignoreDifferences: Fields to ignore when comparing desired vs. live state, useful for fields set dynamically by the cluster.

The App of Apps pattern#

When you have many applications, managing each Application resource individually becomes tedious. The App of Apps pattern creates a parent Application that manages child Application manifests.

# Repository structure
gitops-repo/
├── apps/                          # Parent app points here
│   ├── my-app.yaml               # Child Application manifests
│   ├── monitoring.yaml
│   ├── cert-manager.yaml
│   └── ingress-nginx.yaml
├── my-app/
│   ├── base/
│   └── overlays/
│       ├── staging/
│       └── production/
└── monitoring/

The parent Application:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: app-of-apps
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/kainlite/gitops-repo
    targetRevision: main
    path: apps
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Child applications use sync-wave annotations to control deployment order. Infrastructure components get wave 0, application workloads get wave 2:

# apps/cert-manager.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: cert-manager
  namespace: argocd
  annotations:
    argocd.argoproj.io/sync-wave: "0"  # Deploy infrastructure first
spec:
  project: default
  source:
    repoURL: https://charts.jetstack.io
    chart: cert-manager
    targetRevision: v1.16.3
    helm:
      releaseName: cert-manager
      values: |
        installCRDs: true
  destination:
    server: https://kubernetes.default.svc
    namespace: cert-manager
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

Sync strategies#

ArgoCD gives you fine-grained control over how and when syncs happen. A common pattern is auto-sync for staging and manual for production:

# Auto-sync: changes applied automatically
syncPolicy:
  automated:
    prune: true      # Delete resources removed from Git
    selfHeal: true   # Revert manual cluster changes

# Manual sync: omit the automated section
syncPolicy:
  syncOptions:
    - CreateNamespace=true

Retry policies handle transient failures:

syncPolicy:
  retry:
    limit: 5
    backoff:
      duration: 5s
      factor: 2
      maxDuration: 3m

Sync windows restrict when ArgoCD can sync, useful for change freezes:

# In the AppProject spec
spec:
  syncWindows:
    - kind: allow
      schedule: "0 9 * * 1-5"   # Mon-Fri at 9am
      duration: 8h
      applications: ["*"]
    - kind: deny
      schedule: "0 0 20 12 *"   # Holiday freeze
      duration: 336h
      applications: ["*"]
      clusters: ["production"]

Health checks and custom health#

ArgoCD has built-in health checks for standard Kubernetes resources. For custom resources (CRDs), you can write Lua health check scripts:

Healthy: The resource is operating correctly

Progressing: Not yet healthy but making progress

Degraded: The resource has an error

Suspended: The resource is paused

Missing: The resource does not exist

# Custom health check for cert-manager Certificate (in argocd-cm ConfigMap)
resource.customizations.health.cert-manager.io_Certificate: |
  hs = {}
  if obj.status ~= nil then
    if obj.status.conditions ~= nil then
      for i, condition in ipairs(obj.status.conditions) do
        if condition.type == "Ready" and condition.status == "False" then
          hs.status = "Degraded"
          hs.message = condition.message
          return hs
        end
        if condition.type == "Ready" and condition.status == "True" then
          hs.status = "Healthy"
          hs.message = condition.message
          return hs
        end
      end
    end
  end
  hs.status = "Progressing"
  hs.message = "Waiting for certificate"
  return hs

Rollback patterns#

One of the biggest advantages of GitOps is that rollback is just a git revert. ArgoCD also provides its own rollback mechanisms for emergencies.

# The GitOps way: revert the commit in Git
git revert HEAD --no-edit
git push
# ArgoCD detects the change and syncs automatically

# ArgoCD history-based rollback
argocd app history my-app
argocd app rollback my-app 2
# Note: this does not revert Git, so auto-sync will eventually re-apply
# Disable auto-sync first or also revert in Git

Multi-cluster management with ApplicationSets#

ApplicationSets generate multiple Applications from a template using generators. Instead of manually creating an Application for each cluster, you define a template and a generator that produces the variations.

List generator: Provide explicit parameter sets:

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: my-app
  namespace: argocd
spec:
  generators:
    - list:
        elements:
          - cluster: staging
            url: https://staging-api.example.com
          - cluster: production
            url: https://production-api.example.com
  template:
    metadata:
      name: "my-app-{{cluster}}"
    spec:
      project: default
      source:
        repoURL: https://github.com/kainlite/gitops-repo
        targetRevision: main
        path: "my-app/overlays/{{cluster}}"
      destination:
        server: "{{url}}"
        namespace: my-app
      syncPolicy:
        automated:
          prune: true
          selfHeal: true

Cluster generator: Automatically creates Applications for every matching cluster:

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: monitoring-stack
  namespace: argocd
spec:
  generators:
    - clusters:
        selector:
          matchLabels:
            environment: production
  template:
    metadata:
      name: "monitoring-{{name}}"
    spec:
      project: monitoring
      source:
        repoURL: https://github.com/kainlite/gitops-repo
        targetRevision: main
        path: monitoring
      destination:
        server: "{{server}}"
        namespace: monitoring

Git generator: Creates Applications based on directory structure or config files:

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: team-apps
  namespace: argocd
spec:
  generators:
    - git:
        repoURL: https://github.com/kainlite/gitops-repo
        revision: main
        directories:
          - path: "teams/*/apps/*"
  template:
    metadata:
      name: "{{path.basename}}"
    spec:
      project: default
      source:
        repoURL: https://github.com/kainlite/gitops-repo
        targetRevision: main
        path: "{{path}}"
      destination:
        server: https://kubernetes.default.svc
        namespace: "{{path.basename}}"

Kustomize and Helm integration#

ArgoCD natively supports both Kustomize and Helm. It renders manifests at sync time, so you do not need to run these tools in your CI pipeline.

For Kustomize, just point the Application source to the overlay directory:

# overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - ../../base
patches:
  - target:
      kind: Deployment
      name: my-app
    patch: |
      - op: replace
        path: /spec/replicas
        value: 3
images:
  - name: kainlite/my-app
    newTag: v1.2.3

For Helm charts from a chart repository:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: prometheus
  namespace: argocd
spec:
  source:
    repoURL: https://prometheus-community.github.io/helm-charts
    chart: kube-prometheus-stack
    targetRevision: 67.9.0
    helm:
      releaseName: prometheus
      values: |
        prometheus:
          prometheusSpec:
            retention: 30d
            storageSpec:
              volumeClaimTemplate:
                spec:
                  resources:
                    requests:
                      storage: 50Gi
        grafana:
          enabled: true
  destination:
    server: https://kubernetes.default.svc
    namespace: monitoring
  syncPolicy:
    syncOptions:
      - ServerSideApply=true

RBAC and SSO#

Projects are the primary mechanism for restricting access. Each project defines which repositories, clusters, and namespaces an application can use:

apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: payments-team
  namespace: argocd
spec:
  description: "Payments team project"
  sourceRepos:
    - "https://github.com/kainlite/payments-*"
  destinations:
    - server: https://kubernetes.default.svc
      namespace: "payments-*"
  namespaceResourceWhitelist:
    - group: "apps"
      kind: Deployment
    - group: ""
      kind: Service
    - group: ""
      kind: ConfigMap
    - group: ""
      kind: Secret
  roles:
    - name: developer
      policies:
        - p, proj:payments-team:developer, applications, get, payments-team/*, allow
        - p, proj:payments-team:developer, applications, sync, payments-team/*, allow
      groups:
        - payments-developers
    - name: admin
      policies:
        - p, proj:payments-team:admin, applications, *, payments-team/*, allow
      groups:
        - payments-admins

SSO with OIDC:

# In argocd-cm ConfigMap
oidc.config: |
  name: Keycloak
  issuer: https://keycloak.example.com/realms/engineering
  clientID: argocd
  clientSecret: $oidc.keycloak.clientSecret
  requestedScopes: ["openid", "profile", "email", "groups"]

# In argocd-rbac-cm ConfigMap
policy.default: role:readonly
policy.csv: |
  g, platform-admins, role:admin
  g, payments-developers, proj:payments-team:developer
  p, role:readonly, applications, get, */*, allow

Notifications#

ArgoCD Notifications sends alerts on sync events. It is included in the Helm chart since ArgoCD 2.6+:

# argocd-notifications-cm ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-notifications-cm
  namespace: argocd
data:
  service.slack: |
    token: $slack-token

  template.app-sync-succeeded: |
    slack:
      attachments: |
        [{"color": "#18be52", "title": "{{.app.metadata.name}} synced successfully",
          "fields": [
            {"title": "Revision", "value": "{{.app.status.sync.revision}}", "short": true},
            {"title": "Namespace", "value": "{{.app.spec.destination.namespace}}", "short": true}
          ]}]

  template.app-sync-failed: |
    slack:
      attachments: |
        [{"color": "#E96D76", "title": "{{.app.metadata.name}} sync FAILED",
          "fields": [
            {"title": "Error", "value": "{{range .app.status.conditions}}{{.message}}{{end}}"}
          ]}]

  trigger.on-sync-succeeded: |
    - when: app.status.operationState.phase in ['Succeeded']
      send: [app-sync-succeeded]
  trigger.on-sync-failed: |
    - when: app.status.operationState.phase in ['Error', 'Failed']
      send: [app-sync-failed]

Subscribe applications to notifications with annotations:

metadata:
  annotations:
    notifications.argoproj.io/subscribe.on-sync-succeeded.slack: deployments
    notifications.argoproj.io/subscribe.on-sync-failed.slack: deployments-alerts

Monitoring ArgoCD itself#

ArgoCD exposes Prometheus metrics out of the box. Here are the key metrics to watch:

argocd_app_info: Gauge with sync status and health per application

argocd_app_sync_total: Counter of sync operations (track deployment frequency)

argocd_app_reconcile_bucket: Histogram of reconciliation duration

argocd_git_request_total: Counter of Git requests (failures mean ArgoCD cannot reach your repos)

argocd_cluster_api_resource_objects: Gauge of tracked objects per cluster (memory planning)

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: argocd-alerts
  namespace: monitoring
spec:
  groups:
    - name: argocd.rules
      rules:
        - alert: ArgoCDAppOutOfSync
          expr: argocd_app_info{sync_status="OutOfSync"} == 1
          for: 30m
          labels:
            severity: warning
          annotations:
            summary: "ArgoCD app {{ $labels.name }} out of sync for 30m+"

        - alert: ArgoCDAppUnhealthy
          expr: argocd_app_info{health_status!~"Healthy|Progressing"} == 1
          for: 15m
          labels:
            severity: critical
          annotations:
            summary: "ArgoCD app {{ $labels.name }} is {{ $labels.health_status }}"

        - alert: ArgoCDSyncFailing
          expr: increase(argocd_app_sync_total{phase!="Succeeded"}[1h]) > 3
          labels:
            severity: critical
          annotations:
            summary: "More than 3 failed syncs in 1h for {{ $labels.name }}"

        - alert: ArgoCDGitFetchErrors
          expr: increase(argocd_git_request_total{request_type="fetch", result="error"}[10m]) > 5
          labels:
            severity: warning
          annotations:
            summary: "ArgoCD cannot fetch from Git repositories"

Make sure Prometheus scrapes ArgoCD metrics:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: argocd-metrics
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app.kubernetes.io/part-of: argocd
  namespaceSelector:
    matchNames: [argocd]
  endpoints:
    - port: metrics
      interval: 30s

Closing notes#

GitOps with ArgoCD gives you a deployment workflow that is auditable, repeatable, and self-healing. By treating Git as the single source of truth and letting a controller handle reconciliation, you eliminate an entire class of problems related to configuration drift and manual deployments. The combination of Application CRDs, the App of Apps pattern, ApplicationSets, and proper RBAC gives you a solid foundation for managing anything from a single cluster to a fleet of clusters across multiple environments.

This article continues the SRE series where we have been building up the practices and tools needed to run reliable systems. GitOps is the glue that ties everything together, because all the SLO definitions, monitoring configurations, and infrastructure changes we covered in previous articles should flow through Git and be reconciled by ArgoCD.

Hope you found this useful and enjoyed reading it, until next time!

Errata#

If you spot any error or have any suggestion, please send me a message so it gets fixed.

Also, you can check the source code and changes in the sources here

$ Comments

Online: 0

Please sign in to be able to write comments.

2026-03-09 | Gabriel Garrido

$ Related Posts

> SRE: Release Engineering and Progressive Delivery (2026-03-21)

> DevOps from Zero to Hero: GitOps with ArgoCD (2026-05-30)

> SRE: SLIs, SLOs, and Automations That Actually Help (2026-02-06)