I listen to a lot of folks talk about their Kubernetes strategy as a means of apportioning a finite, limited resource (compute) among a wide and varied set of people, usually application developers and operations nerds, with an eye toward isolation.
I have bad news for you.
Kubernetes isn’t about isolation, not in the security sense of the word anyway.
If you reduce containers down to their base essence (and I’m going to take a few liberties here, so bear with me), it’s about processes. Processes. Program binaries executing code in a virtually unique memory address space. Same kernel. Same user/group space. Sometimes, same filesystem and same PID space.
It’s all an elaborate set of carefully constructed smoke and mirrors that lets the Linux kernel provide different views of shared resources to different processes.
This has been most handy when paired with the OCI image standard, and some best practices from the Docker ecosystem – every container gets its own PID namespace; every container brings its own root filesystem with it, etc.
But you don’t have to abide by those rules if you don’t want to.
To wit: kubectl r00t
:
This little gem blows up the security charade of containers. Let’s unpack this, piece by piece.
#!/bin/bashexec kubectl run r00t -it --rm \
--restart=Never \
--image nah r00t \
--overrides '{"spec":{"hostPID": true, "containers":[{"name":"x","image":"alpine","command":["nsenter","--mount=/proc/1/ns/mnt","--","/bin/bash"],"stdin": true,"tty":true,"securityContext":{"privileged":true}}]}}' "[email protected]"
The kubectl run r00t -it --rm --restart=Never
bits tell Kubernetes that we want to execute a single Pod (no Deployment here thank you very much), and when that Pod exits, we’re done. Think of it as an analog to docker run -it --rm
.
The next bits --image nah
and --overrides ...
let us modify the generated YAML of the Pod resource. The kubectl run
command requires that we specify an image to run and a name for the pod, but we’re just going to override those value with --overrides
, so you can put (quite literally) anything you want here.
That brings us to the JSON overrides. For sanity’s sake, let’s reformat that blob of JSON to be a bit more readable, and turn it back into YAML via spruce:
spec:
hostPID: true
containers:
- image: alpine
name: x
stdin: true
tty: true
securityContext:
privileged: true
command: [nsenter, --mount=/proc/1/ns/mnt, --, /bin/bash]
The first thing we do is pop this pod (and all of its containers) into the Kubernetes node’s hostPID
namespace. By default, containers get new process ID namespaces inside the kernel – the first process executed becomes “PID 1”, and gets all the benefits that PID 1 normally gets – automatic inheriatnce of child processes, special signal delivery, etc. A side-effect of being in the Kubernetes node’s PID namespace is that /proc/1
refers to the actual init process of the VM / physical host – this will become exceedingly important in just a bit.
Next up, we start modifying the (only) container in the pod. We choose the alpine
image, because it is small and likely to be present in the image cache already. We pick an arbitrary name for the container (x
), turn on standard input attachment (stdin: true
) and teletype terminal emulation (tty: true
) so that we can run an interactive shell.
Then, we set the security context of the running container to be privileged – this provides us all of the normal Linux capabilities you’d come to expect from being the root
user on a Linux box.
Finally, the coup de grâce: the command we want this container to execute is nsenter
, a handy (and flexible!) little utility for munging and modifying our current Linux namespaces; the foundation on which, combined with cgroups, all this containerization stuff is built. We’re already in the host’s process ID namespace, but we are jailed inside of our own filesystem namespace. To get out we can take advantage of the fact that /proc/1
is the real Linux init (systemd) process, so /proc/1/ns/mnt
is the outermost mount namespace, i.e. the real root filesystem!
Let’s give it a go:
$ kubectl r00t
If you don't see a command prompt, try pressing enter.
node/f8f9b380-f4a6-4b17-84d0-996962f7b106:/#
node/f8f9b380-f4a6-4b17-84d0-996962f7b106:/# ps -ef | grep ku[b]elet
root 6788 1 2 Feb12 ? 14:15:41 kubelet --config=...
There you have it. On my EKS cluster, this is the easiest and best way to pop a root shell and go snooping through kubelet
configurations, changing things as I need to. Handy for me, but probably not something that would make the cluster operator sleep well at night.
Are you that cluster operator?
This exploit works because of several, collaborating reasons:
- I was able to create a
privileged: true
Pod - I was able to create a Pod in the
hostPID
namespace - I was able to run a Pod as the root user (UID/GID of 0/0)
- I was able to run a Pod with
stdin
attached, and a controlling terminal.
If you take away any of those capabilities, the above attack vector stops working. Let’s take away as many of those capabilities as we can, using Pod Security Policies.
A Pod Security Policy lets you prohibit or allow certain types of workloads. They work with the Kubernetes role-based access control (RBAC) system to give you flexibility in what you allow and who you allow it to.
In the rest of this post, we’re going to create a namespace and a service account that can deploy to it. We’ll verify that the service account can do bad things first, before we implement a security policy that prohibits such shenanigans.
Here’s the YAML bits for creating our demo namespace and service account:
---
apiVersion: v1
kind: Namespace
metadata:
name: psp-demo
---
apiVersion: v1
kind: ServiceAccount
metadata:
namespace: psp-demo
name: psp-demo-sa
This gives us a namespace named psp-demo
, and a service account in that namespace, named psp-demo-sa
. We will be impersonating that service account later, when we attempt to live under the constraints of our security policy.
Next up, we need to set up some basic RBAC access to allow psp-demo-sa
to deploy Pods. This is only because we want to demo Pod creation as the service account!
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: psp-sa
namespace: psp-demo
rules:
- apiGroups: ['']
resources: [pods]
verbs: [get, list, watch, create, update, patch, delete]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: psp-sa
namespace: psp-demo
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: psp-sa
subjects:
- kind: ServiceAccount
name: psp-demo-sa
The new (namespace-bound) role psp-sa
is bound to the psp-demo-sa
service account and allows it to do pretty much anything with Pods. Note: this does preclude us from creating Deployments, StatefulSets, and the like. That’s solely by virtue of the role assignments, and has nothing to do with our Pod Security Policies.
Our first security policy is called privileged
, and it encodes the most lax security we can specify. This will be reserved for people we trust with our lives (and our cluster!), and serves to show what happens when a user or service account can’t use
a policy that exists.
---
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: privileged
spec:
privileged: true
allowPrivilegeEscalation: true
allowedCapabilities: ['*']
volumes: ['*']
hostNetwork: true
hostIPC: true
hostPID: true
hostPorts: [{ min: 0, max: 65535 }]
runAsUser: { rule: RunAsAny }
seLinux: { rule: RunAsAny }
supplementalGroups: { rule: RunAsAny }
fsGroup: { rule: RunAsAny }
The next policy is much more restricted. It’s even named restricted
! It locks down almost everything we can:
---
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: restricted
spec:
privileged: false
allowPrivilegeEscalation: false
requiredDropCapabilities: [ALL]
readOnlyRootFilesystem: false
hostNetwork: false
hostIPC: false
hostPID: false
runAsUser:
# Require the container to run without root privileges.
rule: MustRunAsNonRoot
seLinux:
# Assume nodes are using AppArmor rather than SELinux.
rule: RunAsAny
supplementalGroups:
rule: MustRunAs
ranges: [{ min: 1, max: 65535 }]
fsGroup:
rule: MustRunAs
ranges: [{ min: 1, max: 65535 }]
# Allow core volume types.
volumes:
- configMap
- emptyDir
- projected
- secret
- downwardAPI
- persistentVolumeClaim
That’s worth reading over a few times to make sure you’ve got it all. The salient bits (insofar as our attack vector is concerned) are thus:
- We disallow
hostPID
Pods / containers - We don’t allow directories on the host to be bind-mounted into containers.
(There’s nohostPath
listed in the allowed volume types list) - Pods must specify users to run as, and those UIDs cannot be 0. No root!
With those YAMLs applied to the cluster, we can list our policies:
$ kubectl get psp
NAME PRIV CAPS SELINUX RUNASUSER FSGROUP SUPGROUP READONLYROOTFS VOLUMES
restricted false RunAsAny MustRunAsNonRoot MustRunAs MustRunAs false configMap,emptyDir,projected,secret,downwardAPI,persistentVolumeClaim
privileged true * RunAsAny RunAsAny RunAsAny RunAsAny false *
Right now, these policies are inert. No one is allowed to use them, which means that no one will be able to create any Pods. To activate these policies, we need to grant users and service accounts the use
verb against the policy resources. For that, we’ll use a new Cluster Role and a Cluster Role Binding.
First, the Cluster Role:
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: default-psp
rules:
- apiGroups: [policy]
resources: [podsecuritypolicies]
resourceNames: []
verbs: [list, get]
- apiGroups: [policy]
resources: [podsecuritypolicies]
resourceNames: [restricted]
verbs: [use]
This role is allowed to list and get all security policies, but only use the restricted
policy.
Next, we bind the Cluster Role to all users (via the system:authenticated
group) and all service accounts (via the system:serviceaccounts
group):
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: default-psp
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: default-psp
subjects:
- apiGroup: rbac.authorization.k8s.io
kind: Group
name: system:authenticated # All authenticated users
- apiGroup: rbac.authorization.k8s.io
kind: Group
name: system:serviceaccounts
Now, we need to impersonate our demo service account. For that, we can use the --as
flag to kubectl
:
$ kubectl --as=system:serviceaccount:psp-demo:psp-demo-sa get pods
No resources found in psp-demo namespace.
I hate typing. I hate making other people type. We’re going to alias that big --as
flag as ku
(which is way easier on the keyboard):
$ alias ku='kubectl --as=system:serviceaccount:psp-demo:psp-demo-sa'
$ ku get pods
No resources found in psp-demo namespace.
Now, we can explore with kubectl auth can-i
:
$ ku auth can-i create pods
yes
$ ku get psp -o custom-columns=NAME:.metadata.name
NAME
privileged
restricted
$ ku auth can-i use psp/privileged
no
$ ku auth can-i use psp/restricted
yes
Note: if you get warnings like
Warning: resource 'podsecuritypolicies' is not namespace scoped in group 'policy'
, don’t worry. I get them too, and from what I’ve been able to tell from random Internet searches, they aren’t anything to worry about.
This tells us that we are able to use the restricted
policy, but not the privileged
policy; so our attempts at breaking in should no longer bear fruit:
$ kubectl r00t --as=system:serviceaccount:psp-demo:psp-demo-sa
Error from server (Forbidden): pods "r00t" is forbidden: unable to validate against any pod security policy: [spec.securityContext.hostPID: Invalid value: true: Host PID is not allowed to be used spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed]
Success!
Host PID is not allowed to be used
Where To From Here?
Armed with your newfound expertise in Pod Security Policies, go forth and secure your Kubernetes clusters! A few things to try from here include:
- Letting actual cluster admins create privileged pods
- Allowing some capabilities to certain specific service accounts
- Auditing all of your service accounts and what they can do under your PSPs
Happy Hacking!