SRE: Security as Code

2026-03-29 | Gabriel Garrido | 23 min read
Share:

Support this blog

If you find this content useful, consider supporting the blog.

Introduction

In the previous articles we covered SLIs and SLOs, incident management, observability, chaos engineering, capacity planning, GitOps, secrets management, cost optimization, dependency management, database reliability, and release engineering. All of those topics assume that your cluster and workloads are secure, but security is often treated as an afterthought or someone else’s problem.


That stops today. Security is an SRE concern because a security incident is just another type of incident that burns your error budget, erodes user trust, and creates operational chaos. The shift-left approach means we define security policies as code, enforce them automatically, and treat security violations the same way we treat SLO breaches: with measurable indicators, automated responses, and continuous improvement.


In this article we are going to cover the full security-as-code stack for Kubernetes: admission control with OPA Gatekeeper, Pod Security Standards, network policies, image scanning in CI, RBAC hardening, audit logging, runtime security with Falco, and supply chain security with Cosign and Kyverno. All as code, all automated.


Let’s get into it.


OPA and Gatekeeper policies

Open Policy Agent (OPA) is a general-purpose policy engine, and Gatekeeper is the Kubernetes-native way to use it. Gatekeeper acts as an admission controller that intercepts every request to the Kubernetes API server and evaluates it against your policies before allowing or denying it.


The beauty of this approach is that your security policies become code that lives in Git, gets reviewed in PRs, and is enforced automatically. No more hoping that developers remember to add the right labels or avoid privileged containers.


Installing Gatekeeper

Getting Gatekeeper into your cluster is straightforward with Helm:


# Install Gatekeeper via Helm
helm repo add gatekeeper https://open-policy-agent.github.io/gatekeeper/charts
helm repo update

helm install gatekeeper gatekeeper/gatekeeper \
  --namespace gatekeeper-system \
  --create-namespace \
  --set replicas=3 \
  --set audit.replicas=2 \
  --set audit.logLevel=INFO

Or if you prefer a declarative ArgoCD approach:


# argocd/gatekeeper-app.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: gatekeeper
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://open-policy-agent.github.io/gatekeeper/charts
    chart: gatekeeper
    targetRevision: 3.15.0
    helm:
      values: |
        replicas: 3
        audit:
          replicas: 2
          logLevel: INFO
  destination:
    server: https://kubernetes.default.svc
    namespace: gatekeeper-system
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

ConstraintTemplate: Require labels

Gatekeeper uses two resources: ConstraintTemplates (the policy logic in Rego) and Constraints (how to apply them). Here is a template that requires specific labels on all resources:


# policies/templates/require-labels.yaml
apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: k8srequiredlabels
spec:
  crd:
    spec:
      names:
        kind: K8sRequiredLabels
      validation:
        openAPIV3Schema:
          type: object
          properties:
            labels:
              type: array
              items:
                type: string
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8srequiredlabels

        violation[{"msg": msg, "details": {"missing_labels": missing}}] {
          provided := {label | input.review.object.metadata.labels[label]}
          required := {label | label := input.parameters.labels[_]}
          missing := required - provided
          count(missing) > 0
          msg := sprintf("Resource is missing required labels: %v", [missing])
        }

And the constraint that applies it to all namespaces:


# policies/constraints/require-labels.yaml
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels
metadata:
  name: all-must-have-owner
spec:
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Namespace"]
      - apiGroups: ["apps"]
        kinds: ["Deployment", "StatefulSet"]
  parameters:
    labels:
      - "app.kubernetes.io/name"
      - "app.kubernetes.io/managed-by"
      - "team"

ConstraintTemplate: Block privileged pods

This one is critical. Privileged containers have full access to the host, which means a container escape gives an attacker root on the node:


# policies/templates/block-privileged.yaml
apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: k8sblockprivileged
spec:
  crd:
    spec:
      names:
        kind: K8sBlockPrivileged
      validation:
        openAPIV3Schema:
          type: object
          properties:
            allowedImages:
              type: array
              items:
                type: string
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8sblockprivileged

        violation[{"msg": msg}] {
          container := input.review.object.spec.containers[_]
          container.securityContext.privileged == true
          msg := sprintf("Privileged containers are not allowed: %v", [container.name])
        }

        violation[{"msg": msg}] {
          container := input.review.object.spec.initContainers[_]
          container.securityContext.privileged == true
          msg := sprintf("Privileged init containers are not allowed: %v", [container.name])
        }

# policies/constraints/block-privileged.yaml
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sBlockPrivileged
metadata:
  name: no-privileged-containers
spec:
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Pod"]
      - apiGroups: ["apps"]
        kinds: ["Deployment", "StatefulSet", "DaemonSet"]
    excludedNamespaces:
      - kube-system
      - gatekeeper-system

ConstraintTemplate: Enforce image registry

You probably do not want random Docker Hub images running in production. This policy restricts images to your trusted registries:


# policies/templates/allowed-registries.yaml
apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: k8sallowedregistries
spec:
  crd:
    spec:
      names:
        kind: K8sAllowedRegistries
      validation:
        openAPIV3Schema:
          type: object
          properties:
            registries:
              type: array
              items:
                type: string
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8sallowedregistries

        violation[{"msg": msg}] {
          container := input.review.object.spec.containers[_]
          not registry_allowed(container.image)
          msg := sprintf("Image '%v' is from an untrusted registry. Allowed registries: %v",
            [container.image, input.parameters.registries])
        }

        violation[{"msg": msg}] {
          container := input.review.object.spec.initContainers[_]
          not registry_allowed(container.image)
          msg := sprintf("Init container image '%v' is from an untrusted registry. Allowed registries: %v",
            [container.image, input.parameters.registries])
        }

        registry_allowed(image) {
          registry := input.parameters.registries[_]
          startswith(image, registry)
        }

# policies/constraints/allowed-registries.yaml
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sAllowedRegistries
metadata:
  name: trusted-registries-only
spec:
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Pod"]
      - apiGroups: ["apps"]
        kinds: ["Deployment", "StatefulSet", "DaemonSet"]
    excludedNamespaces:
      - kube-system
  parameters:
    registries:
      - "ghcr.io/kainlite/"
      - "docker.io/kainlite/"
      - "registry.k8s.io/"
      - "quay.io/"

With these three policies alone you already have a strong foundation: every resource needs ownership labels, no one can run privileged containers, and only images from trusted registries are allowed.


Pod Security Standards

Kubernetes ships with built-in Pod Security Standards (PSS) that provide three levels of security profiles. These work at the namespace level and do not require any external controller like Gatekeeper. They are a great starting point if you want something simple that covers the basics.


The three profiles are:


  • Privileged: Unrestricted. Allows everything. Used for system-level workloads like CNI plugins and monitoring agents.
  • Baseline: Prevents known privilege escalations. Blocks hostNetwork, hostPID, privileged containers, and most dangerous capabilities. Good default for most workloads.
  • Restricted: Heavily restricted. Requires non-root, drops all capabilities, disallows privilege escalation. The gold standard for application workloads.

Namespace-level enforcement

You apply PSS profiles using labels on namespaces. There are three modes:


  • enforce: Rejects pods that violate the policy
  • audit: Allows pods but logs violations
  • warn: Allows pods but shows a warning to the user

A good rollout strategy is to start with warn and audit, review violations, fix them, and then switch to enforce:


# namespaces/production.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/enforce-version: latest
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/audit-version: latest
    pod-security.kubernetes.io/warn: restricted
    pod-security.kubernetes.io/warn-version: latest

# namespaces/staging.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: staging
  labels:
    pod-security.kubernetes.io/enforce: baseline
    pod-security.kubernetes.io/enforce-version: latest
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/audit-version: latest
    pod-security.kubernetes.io/warn: restricted
    pod-security.kubernetes.io/warn-version: latest

# namespaces/monitoring.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: monitoring
  labels:
    pod-security.kubernetes.io/enforce: baseline
    pod-security.kubernetes.io/enforce-version: latest
    pod-security.kubernetes.io/audit: baseline
    pod-security.kubernetes.io/audit-version: latest
    pod-security.kubernetes.io/warn: restricted
    pod-security.kubernetes.io/warn-version: latest

Making your pods compliant

For the restricted profile, your pods need to meet several requirements. Here is what a compliant pod spec looks like:


# deployments/tr-web.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tr-web
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app.kubernetes.io/name: tr-web
  template:
    metadata:
      labels:
        app.kubernetes.io/name: tr-web
        app.kubernetes.io/managed-by: argocd
        team: platform
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        runAsGroup: 1000
        fsGroup: 1000
        seccompProfile:
          type: RuntimeDefault
      containers:
        - name: tr-web
          image: ghcr.io/kainlite/tr:latest
          ports:
            - containerPort: 4000
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            capabilities:
              drop:
                - ALL
          volumeMounts:
            - name: tmp
              mountPath: /tmp
      volumes:
        - name: tmp
          emptyDir: {}

The key security settings are: runAsNonRoot, allowPrivilegeEscalation: false, dropping all capabilities, read-only root filesystem, and a seccomp profile. If any of those are missing, the restricted profile will reject the pod.


Network policies

By default, every pod in Kubernetes can talk to every other pod. That is terrible for security. If an attacker compromises one pod, they can freely move laterally to every other service in the cluster. Network policies fix this by defining which traffic is allowed.


Default deny everything

The first thing you should do is create a default deny policy for every namespace. This blocks all traffic that is not explicitly allowed:


# network-policies/default-deny.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress

Now nothing can talk to anything. Time to allow the traffic you actually need.


Allow specific traffic

Here is a policy that allows the web frontend to receive traffic from the ingress controller and talk to the database:


# network-policies/tr-web.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-tr-web
  namespace: production
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: tr-web
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: ingress-nginx
          podSelector:
            matchLabels:
              app.kubernetes.io/name: ingress-nginx
      ports:
        - protocol: TCP
          port: 4000
  egress:
    # Allow DNS
    - to:
        - namespaceSelector: {}
          podSelector:
            matchLabels:
              k8s-app: kube-dns
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53
    # Allow database access
    - to:
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: postgresql
      ports:
        - protocol: TCP
          port: 5432

Cilium network policies

If you are using Cilium as your CNI, you get access to more powerful network policies that can filter at L7 (HTTP, gRPC, DNS):


# cilium-policies/tr-web-l7.yaml
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: tr-web-l7-policy
  namespace: production
spec:
  endpointSelector:
    matchLabels:
      app.kubernetes.io/name: tr-web
  ingress:
    - fromEndpoints:
        - matchLabels:
            app.kubernetes.io/name: ingress-nginx
            io.kubernetes.pod.namespace: ingress-nginx
      toPorts:
        - ports:
            - port: "4000"
              protocol: TCP
          rules:
            http:
              - method: GET
              - method: POST
                path: "/api/.*"
              - method: HEAD
  egress:
    - toEndpoints:
        - matchLabels:
            app.kubernetes.io/name: postgresql
      toPorts:
        - ports:
            - port: "5432"
              protocol: TCP
    # DNS policy
    - toEndpoints:
        - matchLabels:
            k8s-app: kube-dns
            io.kubernetes.pod.namespace: kube-system
      toPorts:
        - ports:
            - port: "53"
              protocol: ANY
          rules:
            dns:
              - matchPattern: "*.production.svc.cluster.local"
              - matchPattern: "*.kube-system.svc.cluster.local"

The L7 filtering is incredibly powerful. You can restrict not just which pods can talk to each other but also which HTTP methods and paths are allowed. This means even if an attacker compromises the web pod, they can only make the exact API calls that the web pod is supposed to make.


Image scanning in CI

Catching vulnerabilities before they reach your cluster is much better than detecting them at runtime. Trivy is an excellent open-source scanner that checks container images for known CVEs, misconfigurations, and exposed secrets.


Trivy in GitHub Actions

Here is a complete CI workflow that scans your images and blocks the deployment if high-severity vulnerabilities are found:


# .github/workflows/security-scan.yaml
name: Security Scan

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  trivy-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Build image
        run: |
          docker build -t ghcr.io/kainlite/tr:${{ github.sha }} .

      - name: Run Trivy vulnerability scanner
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: ghcr.io/kainlite/tr:${{ github.sha }}
          format: table
          exit-code: 1
          ignore-unfixed: true
          vuln-type: os,library
          severity: CRITICAL,HIGH
          output: trivy-results.txt

      - name: Run Trivy for SARIF output
        uses: aquasecurity/trivy-action@master
        if: always()
        with:
          image-ref: ghcr.io/kainlite/tr:${{ github.sha }}
          format: sarif
          output: trivy-results.sarif
          ignore-unfixed: true
          vuln-type: os,library
          severity: CRITICAL,HIGH

      - name: Upload Trivy scan results to GitHub Security tab
        uses: github/codeql-action/upload-sarif@v3
        if: always()
        with:
          sarif_file: trivy-results.sarif

      - name: Scan Kubernetes manifests
        uses: aquasecurity/trivy-action@master
        with:
          scan-type: config
          scan-ref: ./k8s/
          format: table
          exit-code: 1
          severity: CRITICAL,HIGH

The key parts are: exit-code: 1 makes the pipeline fail when vulnerabilities are found, ignore-unfixed: true skips CVEs that do not have a fix yet (so you do not block on things you cannot fix), and the SARIF upload pushes results to the GitHub Security tab for visibility.


Scanning Helm charts and IaC

Trivy can also scan your Kubernetes manifests, Helm charts, and Terraform files for misconfigurations:


# .github/workflows/iac-scan.yaml
name: IaC Security Scan

on:
  pull_request:
    paths:
      - 'k8s/**'
      - 'terraform/**'
      - 'charts/**'

jobs:
  trivy-config-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Scan Kubernetes manifests
        uses: aquasecurity/trivy-action@master
        with:
          scan-type: config
          scan-ref: ./k8s/
          format: table
          exit-code: 1
          severity: CRITICAL,HIGH

      - name: Scan Terraform
        uses: aquasecurity/trivy-action@master
        with:
          scan-type: config
          scan-ref: ./terraform/
          format: table
          exit-code: 1
          severity: CRITICAL,HIGH

      - name: Scan Helm charts
        uses: aquasecurity/trivy-action@master
        with:
          scan-type: config
          scan-ref: ./charts/
          format: table
          exit-code: 0
          severity: CRITICAL,HIGH,MEDIUM

This catches issues like containers running as root, missing resource limits, missing network policies, and overly permissive RBAC before they ever get merged.


RBAC best practices

Role-Based Access Control (RBAC) is how you control who can do what in your Kubernetes cluster. The principle of least privilege is simple: give every user, service account, and automation only the permissions they actually need and nothing more.


ClusterRole vs Role

The first rule: prefer Role over ClusterRole whenever possible. A Role is scoped to a namespace, so a compromised service account can only affect that namespace. A ClusterRole applies cluster-wide.


# rbac/tr-web-role.yaml
# Namespace-scoped role for the application
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: tr-web
  namespace: production
rules:
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["secrets"]
    resourceNames: ["tr-web-config"]
    verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: tr-web
  namespace: production
subjects:
  - kind: ServiceAccount
    name: tr-web
    namespace: production
roleRef:
  kind: Role
  name: tr-web
  apiGroup: rbac.authorization.k8s.io

Service account hardening

Every pod should have its own service account with only the permissions it needs. The default service account in each namespace should have no permissions and automount should be disabled:


# rbac/default-sa-lockdown.yaml
# Disable automounting for the default service account
apiVersion: v1
kind: ServiceAccount
metadata:
  name: default
  namespace: production
automountServiceAccountToken: false
---
# Create a dedicated service account for the app
apiVersion: v1
kind: ServiceAccount
metadata:
  name: tr-web
  namespace: production
  labels:
    app.kubernetes.io/name: tr-web
    team: platform
automountServiceAccountToken: true

Aggregated ClusterRoles for team access

For human access to the cluster, use aggregated ClusterRoles that compose permissions from multiple smaller roles. This makes it easy to add new permissions without editing a monolithic role:


# rbac/team-roles.yaml
# Base read-only role for all team members
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: team-readonly
  labels:
    rbac.kainlite.com/aggregate-to-developer: "true"
    rbac.kainlite.com/aggregate-to-sre: "true"
rules:
  - apiGroups: [""]
    resources: ["pods", "services", "configmaps", "events"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["apps"]
    resources: ["deployments", "statefulsets", "daemonsets", "replicasets"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["batch"]
    resources: ["jobs", "cronjobs"]
    verbs: ["get", "list", "watch"]
---
# Additional permissions for developers
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: developer-extra
  labels:
    rbac.kainlite.com/aggregate-to-developer: "true"
rules:
  - apiGroups: [""]
    resources: ["pods/log", "pods/portforward"]
    verbs: ["get", "create"]
  - apiGroups: [""]
    resources: ["pods/exec"]
    verbs: ["create"]
---
# Additional permissions for SREs
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: sre-extra
  labels:
    rbac.kainlite.com/aggregate-to-sre: "true"
rules:
  - apiGroups: ["apps"]
    resources: ["deployments", "statefulsets"]
    verbs: ["patch", "update"]
  - apiGroups: ["apps"]
    resources: ["deployments/rollback"]
    verbs: ["create"]
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["get", "list", "watch", "cordon", "uncordon"]
---
# Aggregated role for developers
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: developer
aggregationRule:
  clusterRoleSelectors:
    - matchLabels:
        rbac.kainlite.com/aggregate-to-developer: "true"
rules: []
---
# Aggregated role for SREs
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: sre
aggregationRule:
  clusterRoleSelectors:
    - matchLabels:
        rbac.kainlite.com/aggregate-to-sre: "true"
rules: []

The aggregation pattern means you can add a new ClusterRole with the right label and it automatically gets included in the aggregated role. No need to edit the parent role, which means fewer merge conflicts and cleaner Git history.


Audit logging

Kubernetes audit logging records every request to the API server. This is essential for security investigations, compliance requirements, and understanding who did what and when. Without audit logs, a security incident turns into guesswork.


Audit policy

You need an audit policy that defines what to log and at what level. Here is a practical policy that captures the important events without drowning you in noise:


# audit/audit-policy.yaml
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
  # Do not log requests to certain non-resource URL paths
  - level: None
    nonResourceURLs:
      - /healthz*
      - /readyz*
      - /livez*
      - /metrics

  # Do not log watch requests (too noisy)
  - level: None
    verbs: ["watch"]

  # Do not log kube-proxy and system:nodes
  - level: None
    users:
      - system:kube-proxy
    verbs: ["get", "list"]

  # Log secret access at Metadata level (do not log the secret values)
  - level: Metadata
    resources:
      - group: ""
        resources: ["secrets"]

  # Log all changes to pods and deployments at RequestResponse level
  - level: RequestResponse
    verbs: ["create", "update", "patch", "delete"]
    resources:
      - group: ""
        resources: ["pods", "pods/exec", "pods/portforward"]
      - group: "apps"
        resources: ["deployments", "statefulsets", "daemonsets"]

  # Log RBAC changes at RequestResponse level
  - level: RequestResponse
    verbs: ["create", "update", "patch", "delete"]
    resources:
      - group: "rbac.authorization.k8s.io"
        resources: ["clusterroles", "clusterrolebindings", "roles", "rolebindings"]

  # Log namespace changes
  - level: RequestResponse
    verbs: ["create", "update", "patch", "delete"]
    resources:
      - group: ""
        resources: ["namespaces"]

  # Log everything else at Metadata level
  - level: Metadata
    omitStages:
      - RequestReceived

Sending audit logs to your observability stack

The audit logs need to go somewhere useful. If you are using the Loki stack from the observability article, you can configure the API server to write audit logs to a file and have Promtail ship them to Loki:


# audit/promtail-audit-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: promtail-audit-config
  namespace: monitoring
data:
  promtail.yaml: |
    server:
      http_listen_port: 3101

    positions:
      filename: /tmp/positions.yaml

    clients:
      - url: http://loki:3100/loki/api/v1/push

    scrape_configs:
      - job_name: kubernetes-audit
        static_configs:
          - targets:
              - localhost
            labels:
              job: kubernetes-audit
              __path__: /var/log/kubernetes/audit/*.log
        pipeline_stages:
          - json:
              expressions:
                level: level
                verb: verb
                user: user.username
                resource: objectRef.resource
                namespace: objectRef.namespace
                name: objectRef.name
                responseCode: responseStatus.code
          - labels:
              level:
              verb:
              user:
              resource:
              namespace:
          - timestamp:
              source: stageTimestamp
              format: RFC3339Nano

With audit logs in Loki, you can create Grafana dashboards that show who is accessing your cluster, what changes are being made, and alert on suspicious activity like someone creating a ClusterRoleBinding or exec-ing into a production pod.


Falco for runtime security

Gatekeeper and PSS prevent bad configurations from entering the cluster, but what about runtime attacks? That is where Falco comes in. Falco monitors system calls at the kernel level and alerts when it detects suspicious behavior like a shell being spawned in a container, sensitive files being read, or unexpected network connections.


Installing Falco

Falco can be installed as a DaemonSet using Helm:


# Install Falco with Helm
helm repo add falcosecurity https://falcosecurity.github.io/charts
helm repo update

helm install falco falcosecurity/falco \
  --namespace falco \
  --create-namespace \
  --set falcosidekick.enabled=true \
  --set falcosidekick.config.slack.webhookurl="https://hooks.slack.com/services/XXX" \
  --set driver.kind=ebpf \
  --set collectors.kubernetes.enabled=true

Or as an ArgoCD application:


# argocd/falco-app.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: falco
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://falcosecurity.github.io/charts
    chart: falco
    targetRevision: 4.2.0
    helm:
      values: |
        driver:
          kind: ebpf
        falcosidekick:
          enabled: true
          config:
            slack:
              webhookurl: "https://hooks.slack.com/services/XXX"
              minimumpriority: warning
            prometheus:
              extralabels: "source:falco"
        collectors:
          kubernetes:
            enabled: true
  destination:
    server: https://kubernetes.default.svc
    namespace: falco
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

Custom Falco rules

Falco ships with a comprehensive set of default rules, but you should add custom rules specific to your environment. Here are some practical examples:


# falco/custom-rules.yaml
# Detect exec into production pods
- rule: Exec into production pod
  desc: Detect when someone execs into a pod in the production namespace
  condition: >
    spawned_process
    and container
    and k8s.ns.name = "production"
    and proc.pname = "runc:[2:INIT]"
  output: >
    Shell spawned in production pod
    (user=%ka.user.name pod=%k8s.pod.name ns=%k8s.ns.name
     container=%container.name command=%proc.cmdline)
  priority: WARNING
  tags: [security, shell, production]

# Detect reading sensitive files
- rule: Read sensitive file in container
  desc: Detect read of sensitive files like /etc/shadow or private keys
  condition: >
    open_read
    and container
    and (fd.name startswith /etc/shadow
      or fd.name startswith /etc/gshadow
      or fd.name contains id_rsa
      or fd.name contains id_ed25519
      or fd.name endswith .pem
      or fd.name endswith .key)
  output: >
    Sensitive file read in container
    (user=%user.name file=%fd.name pod=%k8s.pod.name
     ns=%k8s.ns.name container=%container.name)
  priority: WARNING
  tags: [security, filesystem, sensitive]

# Detect unexpected outbound connections
- rule: Unexpected outbound connection from production
  desc: Detect outbound connections to IPs not in the allowed list
  condition: >
    outbound
    and container
    and k8s.ns.name = "production"
    and not (fd.sip in (allowed_outbound_ips))
    and not (fd.sport in (53, 443, 5432))
  output: >
    Unexpected outbound connection from production
    (pod=%k8s.pod.name ns=%k8s.ns.name ip=%fd.sip port=%fd.sport
     command=%proc.cmdline container=%container.name)
  priority: NOTICE
  tags: [security, network, production]

# Detect container drift (new executables written and executed)
- rule: Container drift detected
  desc: Detect when new executables are written to a container filesystem and then executed
  condition: >
    spawned_process
    and container
    and proc.is_exe_upper_layer = true
  output: >
    Drift detected: new executable run in container
    (user=%user.name command=%proc.cmdline pod=%k8s.pod.name
     ns=%k8s.ns.name container=%container.name image=%container.image.repository)
  priority: ERROR
  tags: [security, drift]

# Detect crypto mining
- rule: Detect crypto mining activity
  desc: Detect processes known to be associated with cryptocurrency mining
  condition: >
    spawned_process
    and container
    and (proc.name in (xmrig, minerd, cpuminer, cryptonight)
      or proc.cmdline contains "stratum+tcp"
      or proc.cmdline contains "pool.minexmr")
  output: >
    Possible crypto mining detected
    (pod=%k8s.pod.name ns=%k8s.ns.name process=%proc.name
     command=%proc.cmdline container=%container.name)
  priority: CRITICAL
  tags: [security, crypto, mining]

Loading custom rules

You can deploy custom rules as a ConfigMap and tell Falco to load them:


# falco/custom-rules-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: falco-custom-rules
  namespace: falco
data:
  custom-rules.yaml: |
    - list: allowed_outbound_ips
      items: ["10.0.0.0/8", "172.16.0.0/12"]

    - rule: Exec into production pod
      desc: Detect when someone execs into a pod in the production namespace
      condition: >
        spawned_process
        and container
        and k8s.ns.name = "production"
        and proc.pname = "runc:[2:INIT]"
      output: >
        Shell spawned in production pod
        (user=%ka.user.name pod=%k8s.pod.name ns=%k8s.ns.name
         container=%container.name command=%proc.cmdline)
      priority: WARNING
      tags: [security, shell, production]

Falco gives you visibility into what is actually happening inside your containers at the system call level. Combined with network policies (which control what traffic is allowed) and Gatekeeper (which controls what configurations are allowed), you have defense in depth covering configuration time, network layer, and runtime.


Supply chain security

Your container images are only as trustworthy as the process that built them. Supply chain attacks, where an attacker compromises a dependency or build pipeline to inject malicious code, have become increasingly common. The solution is to sign your images and verify those signatures before allowing them to run.


Signing images with Cosign

Cosign from the Sigstore project makes it easy to sign and verify container images. Here is how to integrate it into your CI pipeline:


# .github/workflows/build-and-sign.yaml
name: Build, Sign, and Push

on:
  push:
    branches: [main]

permissions:
  contents: read
  packages: write
  id-token: write  # Required for keyless signing

jobs:
  build-sign-push:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install Cosign
        uses: sigstore/cosign-installer@main

      - name: Login to GHCR
        uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Build and push image
        id: build
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: ghcr.io/kainlite/tr:${{ github.sha }}

      - name: Sign the image with Cosign (keyless)
        env:
          COSIGN_EXPERIMENTAL: "true"
        run: |
          cosign sign --yes \
            ghcr.io/kainlite/tr@${{ steps.build.outputs.digest }}

      - name: Generate SBOM
        uses: anchore/sbom-action@v0
        with:
          image: ghcr.io/kainlite/tr:${{ github.sha }}
          format: spdx-json
          output-file: sbom.spdx.json

      - name: Attach SBOM to image
        run: |
          cosign attach sbom \
            --sbom sbom.spdx.json \
            ghcr.io/kainlite/tr@${{ steps.build.outputs.digest }}

      - name: Upload SBOM as artifact
        uses: actions/upload-artifact@v4
        with:
          name: sbom
          path: sbom.spdx.json

The --yes flag uses keyless signing, which means Cosign gets a short-lived certificate from Sigstore’s Fulcio CA tied to your GitHub Actions OIDC identity. No long-lived keys to manage or rotate.


SBOM generation

A Software Bill of Materials (SBOM) is a list of every component in your image. It is essential for tracking which of your images are affected when a new CVE is published. The workflow above generates an SPDX-format SBOM and attaches it to the image in the registry.


Verifying signatures with Kyverno

Now that your images are signed, you need to enforce that only signed images can run in the cluster. Kyverno is a Kubernetes policy engine that can verify Cosign signatures at admission time:


# kyverno/verify-image-signature.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: verify-image-signatures
  annotations:
    policies.kyverno.io/title: Verify Image Signatures
    policies.kyverno.io/description: >
      Verify that all container images are signed with Cosign
      using keyless signing from our GitHub Actions workflows.
spec:
  validationFailureAction: Enforce
  background: false
  webhookTimeoutSeconds: 30
  rules:
    - name: verify-signature
      match:
        any:
          - resources:
              kinds:
                - Pod
              namespaces:
                - production
                - staging
      verifyImages:
        - imageReferences:
            - "ghcr.io/kainlite/*"
          attestors:
            - entries:
                - keyless:
                    subject: "https://github.com/kainlite/tr/.github/workflows/*"
                    issuer: "https://token.actions.githubusercontent.com"
                    rekor:
                      url: https://rekor.sigstore.dev
          mutateDigest: true
          required: true

# kyverno/require-sbom.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-sbom-attestation
spec:
  validationFailureAction: Audit
  background: false
  rules:
    - name: check-sbom
      match:
        any:
          - resources:
              kinds:
                - Pod
              namespaces:
                - production
      verifyImages:
        - imageReferences:
            - "ghcr.io/kainlite/*"
          attestations:
            - type: https://spdx.dev/Document
              attestors:
                - entries:
                    - keyless:
                        subject: "https://github.com/kainlite/tr/.github/workflows/*"
                        issuer: "https://token.actions.githubusercontent.com"
              conditions:
                - all:
                    - key: "{{ creationInfo.created }}"
                      operator: NotEquals
                      value: ""

With this setup, the full supply chain flow is: GitHub Actions builds the image, signs it with Cosign using keyless signing, generates and attaches an SBOM, and Kyverno verifies the signature before allowing the image to run in the cluster. If someone pushes an unsigned image or an image that was not built by your CI pipeline, Kyverno rejects it.


Security SLOs

If you have been following the SRE series, you know that if you cannot measure it, you cannot improve it. Security is no different. Just like you track availability and latency SLOs, you should track security metrics as SLIs.


Vulnerability remediation time

How long does it take your team to patch a critical CVE after it is discovered? This is one of the most important security metrics:


# prometheus-rules/security-slis.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: security-slis
  namespace: monitoring
spec:
  groups:
    - name: security.slis
      interval: 1h
      rules:
        # Track critical vulnerability count over time
        - record: security:critical_cves:total
          expr: |
            sum(trivy_vulnerability_count{severity="CRITICAL"})

        # Track high vulnerability count
        - record: security:high_cves:total
          expr: |
            sum(trivy_vulnerability_count{severity="HIGH"})

        # Track time since oldest unpatched critical CVE
        - record: security:oldest_critical_cve_age_days
          expr: |
            (time() - min(trivy_vulnerability_first_seen{severity="CRITICAL"})) / 86400

        # Policy violations detected by Gatekeeper audit
        - record: security:policy_violations:total
          expr: |
            sum(gatekeeper_violations)

        # Falco alerts rate
        - record: security:falco_alerts:rate1h
          expr: |
            sum(rate(falco_events_total{priority=~"WARNING|ERROR|CRITICAL"}[1h]))

Security SLOs definition

Define concrete SLOs for your security posture:


# security-slos.yaml
security_slos:
  vulnerability_remediation:
    description: "Critical CVEs must be patched within 7 days"
    sli: security:oldest_critical_cve_age_days
    objective: 7
    measurement: "Days since oldest unpatched critical CVE"

  policy_compliance:
    description: "Zero Gatekeeper policy violations in production"
    sli: security:policy_violations:total
    objective: 0
    measurement: "Total active policy violations"

  runtime_security:
    description: "Zero critical Falco alerts in production"
    sli: security:falco_alerts:rate1h
    objective: 0
    measurement: "Critical and error Falco alerts per hour"

  image_signing:
    description: "100% of production images must be signed"
    sli: kyverno:policy_violations:image_signature
    objective: 0
    measurement: "Unsigned images blocked or running"

Alerting on security SLOs

Set up alerts that fire when your security SLOs are at risk:


# prometheus-rules/security-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: security-alerts
  namespace: monitoring
spec:
  groups:
    - name: security.alerts
      rules:
        - alert: CriticalCVEUnpatchedTooLong
          expr: security:oldest_critical_cve_age_days > 5
          for: 1h
          labels:
            severity: warning
            team: platform
          annotations:
            summary: "Critical CVE has been unpatched for more than 5 days"
            description: "Oldest unpatched critical CVE is {{ $value }} days old. SLO target is 7 days."
            runbook: "https://runbooks.example.com/patch-critical-cve"

        - alert: GatekeeperPolicyViolations
          expr: security:policy_violations:total > 0
          for: 5m
          labels:
            severity: warning
            team: platform
          annotations:
            summary: "Gatekeeper policy violations detected"
            description: "{{ $value }} policy violations found in the cluster."

        - alert: FalcoCriticalAlert
          expr: security:falco_alerts:rate1h > 0
          for: 0m
          labels:
            severity: critical
            team: platform
          annotations:
            summary: "Falco detected critical security event"
            description: "Falco is reporting {{ $value }} critical/error events per hour."

Treating security metrics as SLIs gives you the same benefits as reliability SLOs: you can measure progress, set targets, alert when things drift, and make data-driven decisions about where to invest your security efforts.


Putting it all together

Here is a summary of the full security-as-code stack we built:


  1. OPA Gatekeeper: Admission control policies that enforce labels, block privileged containers, and restrict image registries
  2. Pod Security Standards: Built-in namespace-level security profiles (Privileged, Baseline, Restricted)
  3. Network policies: Default deny with explicit allow rules, L7 filtering with Cilium
  4. Image scanning with Trivy: CI pipeline that blocks deployments with critical vulnerabilities
  5. RBAC hardening: Least privilege roles, service account isolation, aggregated ClusterRoles
  6. Audit logging: Recording API server activity and shipping to your observability stack
  7. Falco runtime security: Detecting suspicious behavior at the system call level
  8. Supply chain security: Image signing with Cosign, SBOM generation, verification with Kyverno
  9. Security SLOs: Measuring and alerting on vulnerability remediation time and compliance metrics

Each layer covers a different phase of the attack surface: Gatekeeper and PSS prevent bad configurations, network policies limit blast radius, Trivy catches known vulnerabilities, RBAC restricts access, audit logs provide forensic evidence, Falco detects runtime attacks, and supply chain security ensures image integrity.


No single layer is perfect, but together they create defense in depth that makes it significantly harder for an attacker to succeed and much easier for you to detect and respond when something does go wrong.


Closing notes

Security as code is not about buying expensive tools or achieving perfect compliance scores. It is about applying the same engineering discipline we use for reliability to security: define policies as code, enforce them automatically, measure compliance, and continuously improve.


Start small. If you do nothing else, add a default deny network policy to your production namespace and enable the restricted Pod Security Standard. Those two changes alone will significantly reduce your attack surface. Then layer on Gatekeeper policies, image scanning, Falco, and supply chain security as your team’s maturity grows.


The important thing is to make security a continuous process, not a one-time audit. Treat CVE remediation time like you treat latency SLOs. Track it, alert on it, and invest in improving it. Your future self during the next security incident will thank you.


Hope you found this useful and enjoyed reading it, until next time!


Errata

If you spot any error or have any suggestion, please send me a message so it gets fixed.

Also, you can check the source code and changes in the sources here



$ Comments

Online: 0

Please sign in to be able to write comments.

2026-03-29 | Gabriel Garrido