SRE: Security as Code
Support this blog
If you find this content useful, consider supporting the blog.
Introduction
In the previous articles we covered SLIs and SLOs, incident management, observability, chaos engineering, capacity planning, GitOps, secrets management, cost optimization, dependency management, database reliability, and release engineering. All of those topics assume that your cluster and workloads are secure, but security is often treated as an afterthought or someone else’s problem.
That stops today. Security is an SRE concern because a security incident is just another type of incident that burns your error budget, erodes user trust, and creates operational chaos. The shift-left approach means we define security policies as code, enforce them automatically, and treat security violations the same way we treat SLO breaches: with measurable indicators, automated responses, and continuous improvement.
In this article we are going to cover the full security-as-code stack for Kubernetes: admission control with OPA Gatekeeper, Pod Security Standards, network policies, image scanning in CI, RBAC hardening, audit logging, runtime security with Falco, and supply chain security with Cosign and Kyverno. All as code, all automated.
Let’s get into it.
OPA and Gatekeeper policies
Open Policy Agent (OPA) is a general-purpose policy engine, and Gatekeeper is the Kubernetes-native way to use it. Gatekeeper acts as an admission controller that intercepts every request to the Kubernetes API server and evaluates it against your policies before allowing or denying it.
The beauty of this approach is that your security policies become code that lives in Git, gets reviewed in PRs, and is enforced automatically. No more hoping that developers remember to add the right labels or avoid privileged containers.
Installing Gatekeeper
Getting Gatekeeper into your cluster is straightforward with Helm:
# Install Gatekeeper via Helm
helm repo add gatekeeper https://open-policy-agent.github.io/gatekeeper/charts
helm repo update
helm install gatekeeper gatekeeper/gatekeeper \
--namespace gatekeeper-system \
--create-namespace \
--set replicas=3 \
--set audit.replicas=2 \
--set audit.logLevel=INFO
Or if you prefer a declarative ArgoCD approach:
# argocd/gatekeeper-app.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: gatekeeper
namespace: argocd
spec:
project: default
source:
repoURL: https://open-policy-agent.github.io/gatekeeper/charts
chart: gatekeeper
targetRevision: 3.15.0
helm:
values: |
replicas: 3
audit:
replicas: 2
logLevel: INFO
destination:
server: https://kubernetes.default.svc
namespace: gatekeeper-system
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
ConstraintTemplate: Require labels
Gatekeeper uses two resources: ConstraintTemplates (the policy logic in Rego) and Constraints (how to apply them). Here is a template that requires specific labels on all resources:
# policies/templates/require-labels.yaml
apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
name: k8srequiredlabels
spec:
crd:
spec:
names:
kind: K8sRequiredLabels
validation:
openAPIV3Schema:
type: object
properties:
labels:
type: array
items:
type: string
targets:
- target: admission.k8s.gatekeeper.sh
rego: |
package k8srequiredlabels
violation[{"msg": msg, "details": {"missing_labels": missing}}] {
provided := {label | input.review.object.metadata.labels[label]}
required := {label | label := input.parameters.labels[_]}
missing := required - provided
count(missing) > 0
msg := sprintf("Resource is missing required labels: %v", [missing])
}
And the constraint that applies it to all namespaces:
# policies/constraints/require-labels.yaml
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels
metadata:
name: all-must-have-owner
spec:
match:
kinds:
- apiGroups: [""]
kinds: ["Namespace"]
- apiGroups: ["apps"]
kinds: ["Deployment", "StatefulSet"]
parameters:
labels:
- "app.kubernetes.io/name"
- "app.kubernetes.io/managed-by"
- "team"
ConstraintTemplate: Block privileged pods
This one is critical. Privileged containers have full access to the host, which means a container escape gives an attacker root on the node:
# policies/templates/block-privileged.yaml
apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
name: k8sblockprivileged
spec:
crd:
spec:
names:
kind: K8sBlockPrivileged
validation:
openAPIV3Schema:
type: object
properties:
allowedImages:
type: array
items:
type: string
targets:
- target: admission.k8s.gatekeeper.sh
rego: |
package k8sblockprivileged
violation[{"msg": msg}] {
container := input.review.object.spec.containers[_]
container.securityContext.privileged == true
msg := sprintf("Privileged containers are not allowed: %v", [container.name])
}
violation[{"msg": msg}] {
container := input.review.object.spec.initContainers[_]
container.securityContext.privileged == true
msg := sprintf("Privileged init containers are not allowed: %v", [container.name])
}
# policies/constraints/block-privileged.yaml
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sBlockPrivileged
metadata:
name: no-privileged-containers
spec:
match:
kinds:
- apiGroups: [""]
kinds: ["Pod"]
- apiGroups: ["apps"]
kinds: ["Deployment", "StatefulSet", "DaemonSet"]
excludedNamespaces:
- kube-system
- gatekeeper-system
ConstraintTemplate: Enforce image registry
You probably do not want random Docker Hub images running in production. This policy restricts images to your trusted registries:
# policies/templates/allowed-registries.yaml
apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
name: k8sallowedregistries
spec:
crd:
spec:
names:
kind: K8sAllowedRegistries
validation:
openAPIV3Schema:
type: object
properties:
registries:
type: array
items:
type: string
targets:
- target: admission.k8s.gatekeeper.sh
rego: |
package k8sallowedregistries
violation[{"msg": msg}] {
container := input.review.object.spec.containers[_]
not registry_allowed(container.image)
msg := sprintf("Image '%v' is from an untrusted registry. Allowed registries: %v",
[container.image, input.parameters.registries])
}
violation[{"msg": msg}] {
container := input.review.object.spec.initContainers[_]
not registry_allowed(container.image)
msg := sprintf("Init container image '%v' is from an untrusted registry. Allowed registries: %v",
[container.image, input.parameters.registries])
}
registry_allowed(image) {
registry := input.parameters.registries[_]
startswith(image, registry)
}
# policies/constraints/allowed-registries.yaml
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sAllowedRegistries
metadata:
name: trusted-registries-only
spec:
match:
kinds:
- apiGroups: [""]
kinds: ["Pod"]
- apiGroups: ["apps"]
kinds: ["Deployment", "StatefulSet", "DaemonSet"]
excludedNamespaces:
- kube-system
parameters:
registries:
- "ghcr.io/kainlite/"
- "docker.io/kainlite/"
- "registry.k8s.io/"
- "quay.io/"
With these three policies alone you already have a strong foundation: every resource needs ownership labels, no one can run privileged containers, and only images from trusted registries are allowed.
Pod Security Standards
Kubernetes ships with built-in Pod Security Standards (PSS) that provide three levels of security profiles. These work at the namespace level and do not require any external controller like Gatekeeper. They are a great starting point if you want something simple that covers the basics.
The three profiles are:
- Privileged: Unrestricted. Allows everything. Used for system-level workloads like CNI plugins and monitoring agents.
- Baseline: Prevents known privilege escalations. Blocks hostNetwork, hostPID, privileged containers, and most dangerous capabilities. Good default for most workloads.
- Restricted: Heavily restricted. Requires non-root, drops all capabilities, disallows privilege escalation. The gold standard for application workloads.
Namespace-level enforcement
You apply PSS profiles using labels on namespaces. There are three modes:
- enforce: Rejects pods that violate the policy
- audit: Allows pods but logs violations
- warn: Allows pods but shows a warning to the user
A good rollout strategy is to start with warn and audit, review violations, fix them, and then switch to enforce:
# namespaces/production.yaml
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/enforce-version: latest
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/audit-version: latest
pod-security.kubernetes.io/warn: restricted
pod-security.kubernetes.io/warn-version: latest
# namespaces/staging.yaml
apiVersion: v1
kind: Namespace
metadata:
name: staging
labels:
pod-security.kubernetes.io/enforce: baseline
pod-security.kubernetes.io/enforce-version: latest
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/audit-version: latest
pod-security.kubernetes.io/warn: restricted
pod-security.kubernetes.io/warn-version: latest
# namespaces/monitoring.yaml
apiVersion: v1
kind: Namespace
metadata:
name: monitoring
labels:
pod-security.kubernetes.io/enforce: baseline
pod-security.kubernetes.io/enforce-version: latest
pod-security.kubernetes.io/audit: baseline
pod-security.kubernetes.io/audit-version: latest
pod-security.kubernetes.io/warn: restricted
pod-security.kubernetes.io/warn-version: latest
Making your pods compliant
For the restricted profile, your pods need to meet several requirements. Here is what a compliant pod spec looks like:
# deployments/tr-web.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: tr-web
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app.kubernetes.io/name: tr-web
template:
metadata:
labels:
app.kubernetes.io/name: tr-web
app.kubernetes.io/managed-by: argocd
team: platform
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
containers:
- name: tr-web
image: ghcr.io/kainlite/tr:latest
ports:
- containerPort: 4000
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
volumeMounts:
- name: tmp
mountPath: /tmp
volumes:
- name: tmp
emptyDir: {}
The key security settings are: runAsNonRoot, allowPrivilegeEscalation: false, dropping all capabilities,
read-only root filesystem, and a seccomp profile. If any of those are missing, the restricted profile will reject
the pod.
Network policies
By default, every pod in Kubernetes can talk to every other pod. That is terrible for security. If an attacker compromises one pod, they can freely move laterally to every other service in the cluster. Network policies fix this by defining which traffic is allowed.
Default deny everything
The first thing you should do is create a default deny policy for every namespace. This blocks all traffic that is not explicitly allowed:
# network-policies/default-deny.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
Now nothing can talk to anything. Time to allow the traffic you actually need.
Allow specific traffic
Here is a policy that allows the web frontend to receive traffic from the ingress controller and talk to the database:
# network-policies/tr-web.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-tr-web
namespace: production
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: tr-web
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: ingress-nginx
podSelector:
matchLabels:
app.kubernetes.io/name: ingress-nginx
ports:
- protocol: TCP
port: 4000
egress:
# Allow DNS
- to:
- namespaceSelector: {}
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
# Allow database access
- to:
- podSelector:
matchLabels:
app.kubernetes.io/name: postgresql
ports:
- protocol: TCP
port: 5432
Cilium network policies
If you are using Cilium as your CNI, you get access to more powerful network policies that can filter at L7 (HTTP, gRPC, DNS):
# cilium-policies/tr-web-l7.yaml
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: tr-web-l7-policy
namespace: production
spec:
endpointSelector:
matchLabels:
app.kubernetes.io/name: tr-web
ingress:
- fromEndpoints:
- matchLabels:
app.kubernetes.io/name: ingress-nginx
io.kubernetes.pod.namespace: ingress-nginx
toPorts:
- ports:
- port: "4000"
protocol: TCP
rules:
http:
- method: GET
- method: POST
path: "/api/.*"
- method: HEAD
egress:
- toEndpoints:
- matchLabels:
app.kubernetes.io/name: postgresql
toPorts:
- ports:
- port: "5432"
protocol: TCP
# DNS policy
- toEndpoints:
- matchLabels:
k8s-app: kube-dns
io.kubernetes.pod.namespace: kube-system
toPorts:
- ports:
- port: "53"
protocol: ANY
rules:
dns:
- matchPattern: "*.production.svc.cluster.local"
- matchPattern: "*.kube-system.svc.cluster.local"
The L7 filtering is incredibly powerful. You can restrict not just which pods can talk to each other but also which HTTP methods and paths are allowed. This means even if an attacker compromises the web pod, they can only make the exact API calls that the web pod is supposed to make.
Image scanning in CI
Catching vulnerabilities before they reach your cluster is much better than detecting them at runtime. Trivy is an excellent open-source scanner that checks container images for known CVEs, misconfigurations, and exposed secrets.
Trivy in GitHub Actions
Here is a complete CI workflow that scans your images and blocks the deployment if high-severity vulnerabilities are found:
# .github/workflows/security-scan.yaml
name: Security Scan
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
trivy-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Build image
run: |
docker build -t ghcr.io/kainlite/tr:${{ github.sha }} .
- name: Run Trivy vulnerability scanner
uses: aquasecurity/trivy-action@master
with:
image-ref: ghcr.io/kainlite/tr:${{ github.sha }}
format: table
exit-code: 1
ignore-unfixed: true
vuln-type: os,library
severity: CRITICAL,HIGH
output: trivy-results.txt
- name: Run Trivy for SARIF output
uses: aquasecurity/trivy-action@master
if: always()
with:
image-ref: ghcr.io/kainlite/tr:${{ github.sha }}
format: sarif
output: trivy-results.sarif
ignore-unfixed: true
vuln-type: os,library
severity: CRITICAL,HIGH
- name: Upload Trivy scan results to GitHub Security tab
uses: github/codeql-action/upload-sarif@v3
if: always()
with:
sarif_file: trivy-results.sarif
- name: Scan Kubernetes manifests
uses: aquasecurity/trivy-action@master
with:
scan-type: config
scan-ref: ./k8s/
format: table
exit-code: 1
severity: CRITICAL,HIGH
The key parts are: exit-code: 1 makes the pipeline fail when vulnerabilities are found, ignore-unfixed: true
skips CVEs that do not have a fix yet (so you do not block on things you cannot fix), and the SARIF upload pushes
results to the GitHub Security tab for visibility.
Scanning Helm charts and IaC
Trivy can also scan your Kubernetes manifests, Helm charts, and Terraform files for misconfigurations:
# .github/workflows/iac-scan.yaml
name: IaC Security Scan
on:
pull_request:
paths:
- 'k8s/**'
- 'terraform/**'
- 'charts/**'
jobs:
trivy-config-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Scan Kubernetes manifests
uses: aquasecurity/trivy-action@master
with:
scan-type: config
scan-ref: ./k8s/
format: table
exit-code: 1
severity: CRITICAL,HIGH
- name: Scan Terraform
uses: aquasecurity/trivy-action@master
with:
scan-type: config
scan-ref: ./terraform/
format: table
exit-code: 1
severity: CRITICAL,HIGH
- name: Scan Helm charts
uses: aquasecurity/trivy-action@master
with:
scan-type: config
scan-ref: ./charts/
format: table
exit-code: 0
severity: CRITICAL,HIGH,MEDIUM
This catches issues like containers running as root, missing resource limits, missing network policies, and overly permissive RBAC before they ever get merged.
RBAC best practices
Role-Based Access Control (RBAC) is how you control who can do what in your Kubernetes cluster. The principle of least privilege is simple: give every user, service account, and automation only the permissions they actually need and nothing more.
ClusterRole vs Role
The first rule: prefer Role over ClusterRole whenever possible. A Role is scoped to a namespace, so a compromised service account can only affect that namespace. A ClusterRole applies cluster-wide.
# rbac/tr-web-role.yaml
# Namespace-scoped role for the application
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: tr-web
namespace: production
rules:
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["secrets"]
resourceNames: ["tr-web-config"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: tr-web
namespace: production
subjects:
- kind: ServiceAccount
name: tr-web
namespace: production
roleRef:
kind: Role
name: tr-web
apiGroup: rbac.authorization.k8s.io
Service account hardening
Every pod should have its own service account with only the permissions it needs. The default service account in each namespace should have no permissions and automount should be disabled:
# rbac/default-sa-lockdown.yaml
# Disable automounting for the default service account
apiVersion: v1
kind: ServiceAccount
metadata:
name: default
namespace: production
automountServiceAccountToken: false
---
# Create a dedicated service account for the app
apiVersion: v1
kind: ServiceAccount
metadata:
name: tr-web
namespace: production
labels:
app.kubernetes.io/name: tr-web
team: platform
automountServiceAccountToken: true
Aggregated ClusterRoles for team access
For human access to the cluster, use aggregated ClusterRoles that compose permissions from multiple smaller roles. This makes it easy to add new permissions without editing a monolithic role:
# rbac/team-roles.yaml
# Base read-only role for all team members
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: team-readonly
labels:
rbac.kainlite.com/aggregate-to-developer: "true"
rbac.kainlite.com/aggregate-to-sre: "true"
rules:
- apiGroups: [""]
resources: ["pods", "services", "configmaps", "events"]
verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
resources: ["deployments", "statefulsets", "daemonsets", "replicasets"]
verbs: ["get", "list", "watch"]
- apiGroups: ["batch"]
resources: ["jobs", "cronjobs"]
verbs: ["get", "list", "watch"]
---
# Additional permissions for developers
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: developer-extra
labels:
rbac.kainlite.com/aggregate-to-developer: "true"
rules:
- apiGroups: [""]
resources: ["pods/log", "pods/portforward"]
verbs: ["get", "create"]
- apiGroups: [""]
resources: ["pods/exec"]
verbs: ["create"]
---
# Additional permissions for SREs
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: sre-extra
labels:
rbac.kainlite.com/aggregate-to-sre: "true"
rules:
- apiGroups: ["apps"]
resources: ["deployments", "statefulsets"]
verbs: ["patch", "update"]
- apiGroups: ["apps"]
resources: ["deployments/rollback"]
verbs: ["create"]
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "list", "watch", "cordon", "uncordon"]
---
# Aggregated role for developers
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: developer
aggregationRule:
clusterRoleSelectors:
- matchLabels:
rbac.kainlite.com/aggregate-to-developer: "true"
rules: []
---
# Aggregated role for SREs
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: sre
aggregationRule:
clusterRoleSelectors:
- matchLabels:
rbac.kainlite.com/aggregate-to-sre: "true"
rules: []
The aggregation pattern means you can add a new ClusterRole with the right label and it automatically gets included in the aggregated role. No need to edit the parent role, which means fewer merge conflicts and cleaner Git history.
Audit logging
Kubernetes audit logging records every request to the API server. This is essential for security investigations, compliance requirements, and understanding who did what and when. Without audit logs, a security incident turns into guesswork.
Audit policy
You need an audit policy that defines what to log and at what level. Here is a practical policy that captures the important events without drowning you in noise:
# audit/audit-policy.yaml
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
# Do not log requests to certain non-resource URL paths
- level: None
nonResourceURLs:
- /healthz*
- /readyz*
- /livez*
- /metrics
# Do not log watch requests (too noisy)
- level: None
verbs: ["watch"]
# Do not log kube-proxy and system:nodes
- level: None
users:
- system:kube-proxy
verbs: ["get", "list"]
# Log secret access at Metadata level (do not log the secret values)
- level: Metadata
resources:
- group: ""
resources: ["secrets"]
# Log all changes to pods and deployments at RequestResponse level
- level: RequestResponse
verbs: ["create", "update", "patch", "delete"]
resources:
- group: ""
resources: ["pods", "pods/exec", "pods/portforward"]
- group: "apps"
resources: ["deployments", "statefulsets", "daemonsets"]
# Log RBAC changes at RequestResponse level
- level: RequestResponse
verbs: ["create", "update", "patch", "delete"]
resources:
- group: "rbac.authorization.k8s.io"
resources: ["clusterroles", "clusterrolebindings", "roles", "rolebindings"]
# Log namespace changes
- level: RequestResponse
verbs: ["create", "update", "patch", "delete"]
resources:
- group: ""
resources: ["namespaces"]
# Log everything else at Metadata level
- level: Metadata
omitStages:
- RequestReceived
Sending audit logs to your observability stack
The audit logs need to go somewhere useful. If you are using the Loki stack from the observability article, you can configure the API server to write audit logs to a file and have Promtail ship them to Loki:
# audit/promtail-audit-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: promtail-audit-config
namespace: monitoring
data:
promtail.yaml: |
server:
http_listen_port: 3101
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: kubernetes-audit
static_configs:
- targets:
- localhost
labels:
job: kubernetes-audit
__path__: /var/log/kubernetes/audit/*.log
pipeline_stages:
- json:
expressions:
level: level
verb: verb
user: user.username
resource: objectRef.resource
namespace: objectRef.namespace
name: objectRef.name
responseCode: responseStatus.code
- labels:
level:
verb:
user:
resource:
namespace:
- timestamp:
source: stageTimestamp
format: RFC3339Nano
With audit logs in Loki, you can create Grafana dashboards that show who is accessing your cluster, what changes are being made, and alert on suspicious activity like someone creating a ClusterRoleBinding or exec-ing into a production pod.
Falco for runtime security
Gatekeeper and PSS prevent bad configurations from entering the cluster, but what about runtime attacks? That is where Falco comes in. Falco monitors system calls at the kernel level and alerts when it detects suspicious behavior like a shell being spawned in a container, sensitive files being read, or unexpected network connections.
Installing Falco
Falco can be installed as a DaemonSet using Helm:
# Install Falco with Helm
helm repo add falcosecurity https://falcosecurity.github.io/charts
helm repo update
helm install falco falcosecurity/falco \
--namespace falco \
--create-namespace \
--set falcosidekick.enabled=true \
--set falcosidekick.config.slack.webhookurl="https://hooks.slack.com/services/XXX" \
--set driver.kind=ebpf \
--set collectors.kubernetes.enabled=true
Or as an ArgoCD application:
# argocd/falco-app.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: falco
namespace: argocd
spec:
project: default
source:
repoURL: https://falcosecurity.github.io/charts
chart: falco
targetRevision: 4.2.0
helm:
values: |
driver:
kind: ebpf
falcosidekick:
enabled: true
config:
slack:
webhookurl: "https://hooks.slack.com/services/XXX"
minimumpriority: warning
prometheus:
extralabels: "source:falco"
collectors:
kubernetes:
enabled: true
destination:
server: https://kubernetes.default.svc
namespace: falco
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
Custom Falco rules
Falco ships with a comprehensive set of default rules, but you should add custom rules specific to your environment. Here are some practical examples:
# falco/custom-rules.yaml
# Detect exec into production pods
- rule: Exec into production pod
desc: Detect when someone execs into a pod in the production namespace
condition: >
spawned_process
and container
and k8s.ns.name = "production"
and proc.pname = "runc:[2:INIT]"
output: >
Shell spawned in production pod
(user=%ka.user.name pod=%k8s.pod.name ns=%k8s.ns.name
container=%container.name command=%proc.cmdline)
priority: WARNING
tags: [security, shell, production]
# Detect reading sensitive files
- rule: Read sensitive file in container
desc: Detect read of sensitive files like /etc/shadow or private keys
condition: >
open_read
and container
and (fd.name startswith /etc/shadow
or fd.name startswith /etc/gshadow
or fd.name contains id_rsa
or fd.name contains id_ed25519
or fd.name endswith .pem
or fd.name endswith .key)
output: >
Sensitive file read in container
(user=%user.name file=%fd.name pod=%k8s.pod.name
ns=%k8s.ns.name container=%container.name)
priority: WARNING
tags: [security, filesystem, sensitive]
# Detect unexpected outbound connections
- rule: Unexpected outbound connection from production
desc: Detect outbound connections to IPs not in the allowed list
condition: >
outbound
and container
and k8s.ns.name = "production"
and not (fd.sip in (allowed_outbound_ips))
and not (fd.sport in (53, 443, 5432))
output: >
Unexpected outbound connection from production
(pod=%k8s.pod.name ns=%k8s.ns.name ip=%fd.sip port=%fd.sport
command=%proc.cmdline container=%container.name)
priority: NOTICE
tags: [security, network, production]
# Detect container drift (new executables written and executed)
- rule: Container drift detected
desc: Detect when new executables are written to a container filesystem and then executed
condition: >
spawned_process
and container
and proc.is_exe_upper_layer = true
output: >
Drift detected: new executable run in container
(user=%user.name command=%proc.cmdline pod=%k8s.pod.name
ns=%k8s.ns.name container=%container.name image=%container.image.repository)
priority: ERROR
tags: [security, drift]
# Detect crypto mining
- rule: Detect crypto mining activity
desc: Detect processes known to be associated with cryptocurrency mining
condition: >
spawned_process
and container
and (proc.name in (xmrig, minerd, cpuminer, cryptonight)
or proc.cmdline contains "stratum+tcp"
or proc.cmdline contains "pool.minexmr")
output: >
Possible crypto mining detected
(pod=%k8s.pod.name ns=%k8s.ns.name process=%proc.name
command=%proc.cmdline container=%container.name)
priority: CRITICAL
tags: [security, crypto, mining]
Loading custom rules
You can deploy custom rules as a ConfigMap and tell Falco to load them:
# falco/custom-rules-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: falco-custom-rules
namespace: falco
data:
custom-rules.yaml: |
- list: allowed_outbound_ips
items: ["10.0.0.0/8", "172.16.0.0/12"]
- rule: Exec into production pod
desc: Detect when someone execs into a pod in the production namespace
condition: >
spawned_process
and container
and k8s.ns.name = "production"
and proc.pname = "runc:[2:INIT]"
output: >
Shell spawned in production pod
(user=%ka.user.name pod=%k8s.pod.name ns=%k8s.ns.name
container=%container.name command=%proc.cmdline)
priority: WARNING
tags: [security, shell, production]
Falco gives you visibility into what is actually happening inside your containers at the system call level. Combined with network policies (which control what traffic is allowed) and Gatekeeper (which controls what configurations are allowed), you have defense in depth covering configuration time, network layer, and runtime.
Supply chain security
Your container images are only as trustworthy as the process that built them. Supply chain attacks, where an attacker compromises a dependency or build pipeline to inject malicious code, have become increasingly common. The solution is to sign your images and verify those signatures before allowing them to run.
Signing images with Cosign
Cosign from the Sigstore project makes it easy to sign and verify container images. Here is how to integrate it into your CI pipeline:
# .github/workflows/build-and-sign.yaml
name: Build, Sign, and Push
on:
push:
branches: [main]
permissions:
contents: read
packages: write
id-token: write # Required for keyless signing
jobs:
build-sign-push:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install Cosign
uses: sigstore/cosign-installer@main
- name: Login to GHCR
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Build and push image
id: build
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: ghcr.io/kainlite/tr:${{ github.sha }}
- name: Sign the image with Cosign (keyless)
env:
COSIGN_EXPERIMENTAL: "true"
run: |
cosign sign --yes \
ghcr.io/kainlite/tr@${{ steps.build.outputs.digest }}
- name: Generate SBOM
uses: anchore/sbom-action@v0
with:
image: ghcr.io/kainlite/tr:${{ github.sha }}
format: spdx-json
output-file: sbom.spdx.json
- name: Attach SBOM to image
run: |
cosign attach sbom \
--sbom sbom.spdx.json \
ghcr.io/kainlite/tr@${{ steps.build.outputs.digest }}
- name: Upload SBOM as artifact
uses: actions/upload-artifact@v4
with:
name: sbom
path: sbom.spdx.json
The --yes flag uses keyless signing, which means Cosign gets a short-lived certificate from Sigstore’s Fulcio CA
tied to your GitHub Actions OIDC identity. No long-lived keys to manage or rotate.
SBOM generation
A Software Bill of Materials (SBOM) is a list of every component in your image. It is essential for tracking which of your images are affected when a new CVE is published. The workflow above generates an SPDX-format SBOM and attaches it to the image in the registry.
Verifying signatures with Kyverno
Now that your images are signed, you need to enforce that only signed images can run in the cluster. Kyverno is a Kubernetes policy engine that can verify Cosign signatures at admission time:
# kyverno/verify-image-signature.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: verify-image-signatures
annotations:
policies.kyverno.io/title: Verify Image Signatures
policies.kyverno.io/description: >
Verify that all container images are signed with Cosign
using keyless signing from our GitHub Actions workflows.
spec:
validationFailureAction: Enforce
background: false
webhookTimeoutSeconds: 30
rules:
- name: verify-signature
match:
any:
- resources:
kinds:
- Pod
namespaces:
- production
- staging
verifyImages:
- imageReferences:
- "ghcr.io/kainlite/*"
attestors:
- entries:
- keyless:
subject: "https://github.com/kainlite/tr/.github/workflows/*"
issuer: "https://token.actions.githubusercontent.com"
rekor:
url: https://rekor.sigstore.dev
mutateDigest: true
required: true
# kyverno/require-sbom.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-sbom-attestation
spec:
validationFailureAction: Audit
background: false
rules:
- name: check-sbom
match:
any:
- resources:
kinds:
- Pod
namespaces:
- production
verifyImages:
- imageReferences:
- "ghcr.io/kainlite/*"
attestations:
- type: https://spdx.dev/Document
attestors:
- entries:
- keyless:
subject: "https://github.com/kainlite/tr/.github/workflows/*"
issuer: "https://token.actions.githubusercontent.com"
conditions:
- all:
- key: "{{ creationInfo.created }}"
operator: NotEquals
value: ""
With this setup, the full supply chain flow is: GitHub Actions builds the image, signs it with Cosign using keyless signing, generates and attaches an SBOM, and Kyverno verifies the signature before allowing the image to run in the cluster. If someone pushes an unsigned image or an image that was not built by your CI pipeline, Kyverno rejects it.
Security SLOs
If you have been following the SRE series, you know that if you cannot measure it, you cannot improve it. Security is no different. Just like you track availability and latency SLOs, you should track security metrics as SLIs.
Vulnerability remediation time
How long does it take your team to patch a critical CVE after it is discovered? This is one of the most important security metrics:
# prometheus-rules/security-slis.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: security-slis
namespace: monitoring
spec:
groups:
- name: security.slis
interval: 1h
rules:
# Track critical vulnerability count over time
- record: security:critical_cves:total
expr: |
sum(trivy_vulnerability_count{severity="CRITICAL"})
# Track high vulnerability count
- record: security:high_cves:total
expr: |
sum(trivy_vulnerability_count{severity="HIGH"})
# Track time since oldest unpatched critical CVE
- record: security:oldest_critical_cve_age_days
expr: |
(time() - min(trivy_vulnerability_first_seen{severity="CRITICAL"})) / 86400
# Policy violations detected by Gatekeeper audit
- record: security:policy_violations:total
expr: |
sum(gatekeeper_violations)
# Falco alerts rate
- record: security:falco_alerts:rate1h
expr: |
sum(rate(falco_events_total{priority=~"WARNING|ERROR|CRITICAL"}[1h]))
Security SLOs definition
Define concrete SLOs for your security posture:
# security-slos.yaml
security_slos:
vulnerability_remediation:
description: "Critical CVEs must be patched within 7 days"
sli: security:oldest_critical_cve_age_days
objective: 7
measurement: "Days since oldest unpatched critical CVE"
policy_compliance:
description: "Zero Gatekeeper policy violations in production"
sli: security:policy_violations:total
objective: 0
measurement: "Total active policy violations"
runtime_security:
description: "Zero critical Falco alerts in production"
sli: security:falco_alerts:rate1h
objective: 0
measurement: "Critical and error Falco alerts per hour"
image_signing:
description: "100% of production images must be signed"
sli: kyverno:policy_violations:image_signature
objective: 0
measurement: "Unsigned images blocked or running"
Alerting on security SLOs
Set up alerts that fire when your security SLOs are at risk:
# prometheus-rules/security-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: security-alerts
namespace: monitoring
spec:
groups:
- name: security.alerts
rules:
- alert: CriticalCVEUnpatchedTooLong
expr: security:oldest_critical_cve_age_days > 5
for: 1h
labels:
severity: warning
team: platform
annotations:
summary: "Critical CVE has been unpatched for more than 5 days"
description: "Oldest unpatched critical CVE is {{ $value }} days old. SLO target is 7 days."
runbook: "https://runbooks.example.com/patch-critical-cve"
- alert: GatekeeperPolicyViolations
expr: security:policy_violations:total > 0
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "Gatekeeper policy violations detected"
description: "{{ $value }} policy violations found in the cluster."
- alert: FalcoCriticalAlert
expr: security:falco_alerts:rate1h > 0
for: 0m
labels:
severity: critical
team: platform
annotations:
summary: "Falco detected critical security event"
description: "Falco is reporting {{ $value }} critical/error events per hour."
Treating security metrics as SLIs gives you the same benefits as reliability SLOs: you can measure progress, set targets, alert when things drift, and make data-driven decisions about where to invest your security efforts.
Putting it all together
Here is a summary of the full security-as-code stack we built:
- OPA Gatekeeper: Admission control policies that enforce labels, block privileged containers, and restrict image registries
- Pod Security Standards: Built-in namespace-level security profiles (Privileged, Baseline, Restricted)
- Network policies: Default deny with explicit allow rules, L7 filtering with Cilium
- Image scanning with Trivy: CI pipeline that blocks deployments with critical vulnerabilities
- RBAC hardening: Least privilege roles, service account isolation, aggregated ClusterRoles
- Audit logging: Recording API server activity and shipping to your observability stack
- Falco runtime security: Detecting suspicious behavior at the system call level
- Supply chain security: Image signing with Cosign, SBOM generation, verification with Kyverno
- Security SLOs: Measuring and alerting on vulnerability remediation time and compliance metrics
Each layer covers a different phase of the attack surface: Gatekeeper and PSS prevent bad configurations, network policies limit blast radius, Trivy catches known vulnerabilities, RBAC restricts access, audit logs provide forensic evidence, Falco detects runtime attacks, and supply chain security ensures image integrity.
No single layer is perfect, but together they create defense in depth that makes it significantly harder for an attacker to succeed and much easier for you to detect and respond when something does go wrong.
Closing notes
Security as code is not about buying expensive tools or achieving perfect compliance scores. It is about applying the same engineering discipline we use for reliability to security: define policies as code, enforce them automatically, measure compliance, and continuously improve.
Start small. If you do nothing else, add a default deny network policy to your production namespace and enable the restricted Pod Security Standard. Those two changes alone will significantly reduce your attack surface. Then layer on Gatekeeper policies, image scanning, Falco, and supply chain security as your team’s maturity grows.
The important thing is to make security a continuous process, not a one-time audit. Treat CVE remediation time like you treat latency SLOs. Track it, alert on it, and invest in improving it. Your future self during the next security incident will thank you.
Hope you found this useful and enjoyed reading it, until next time!
Errata
If you spot any error or have any suggestion, please send me a message so it gets fixed.
Also, you can check the source code and changes in the sources here
$ Comments
Online: 0Please sign in to be able to write comments.