The Capable Kernel: An Introduction to Linux Capabilities

February 6, 2020

Traditionally, Linux separates users and their processes into two different groups: root (user ID 0) and everyone else. Back in 1999, with the 2.2 Linux kernel release, kernel developers started breaking up the privileges of the root user into distinct capabilities, allowing processes to inherit subsets of root’s privilege, without giving away too much. Fast-forward to late 2019 (Linux is up to version 5.4.x, by the way), and we have over three dozen different capabilities to assign out.

Here’s some interesting ones:

CAP_KILL – Enables a process to send signals to other processes, regardless of their effective UIDs.
CAP_IPC_LOCK – Allows a process to lock memory, to ensure that it doesn’t get swapped out to disk. Security-conscious tools that handle credentials in-RAM.
CAP_MKNOD – Lets you create special files (like block devices). Think of it! Your very own /dev/null!
CAP_NET_ADMIN – Enables most network management, including interface configuration, packet filtering and more. This is a fairly broad scope.
CAP_NET_BIND_SERVICE – Allows binding of so-called “privileged” ports (those below 1024).

So how do we find out what capabilities we have? As with most things process-oriented, we can look in the lovely /proc filesystem.

We can look at our own capability set using /proc/self:

→  grep Cap /proc/self/statusCapInh:	0000000000000000
CapPrm:	0000000000000000
CapEff:	0000000000000000
CapBnd:	0000003fffffffff
CapAmb:	0000000000000000

These hex strings represent the bit flags set for slightly different purposes. We’re mostly going to focus on CapPrm, the “permitted set”. Since I ran grep as myself, I have no capabilities, and CapPrm is all zeroes.

Let’s run it as root and see what happens:

→  sudo grep Cap /proc/self/status
CapInh:	0000000000000000
CapPrm:	0000003fffffffff
CapEff:	0000003fffffffff
CapBnd:	0000003fffffffff
CapAmb:	0000000000000000

Hey look at that. All the capabilities!

If you don’t immediately grok that long string of hexadecimal, fret not! I wrote a small utility, which you can find on GitHub, that prints out human-friendly names and descriptions. The easiest way to run this is inside of Docker, using my huntprod/caps image:

→  docker run --rm huntprod/caps 0000003fffffffff
0000003fffffffff (38 capabilities):
  chown                  0 (0x00000000000001)  Make arbitrary changes to file UIDs and GIDs
  dac_override           1 (0x00000000000002)  Bypass file read, write, and execute permission checks.
  dac_read_search        2 (0x00000000000004)  Bypass file read permission checks and directory read and execute permission checks.
  fowner                 3 (0x00000000000008)  Bypass file ownership / process owner equality permission checks.
  fsetid                 4 (0x00000000000010)  Don't clear set-user-ID and set-group-ID mode bits when a file is modified
  kill                   5 (0x00000000000020)  Bypass permission checks for sending signals.
  setgid                 6 (0x00000000000040)  Make arbitrary manipulations of process GIDs and supplementary GID list.
  setuid                 7 (0x00000000000080)  Make arbitrary manipulations of process UIDs.
  ... etc ...

Looking at Subsets of Capabilities

If we want to look at some middle ground, we need a way of dropping permitted capabilities. There’s two ways to do this: via filesystem attributes using the setcap program, and via containers. Frankly, containers is a lot easier, since both Docker and Kubernetes have first-class support for explicitly specifying the permitted set of capabilities.

Let’s start with Docker.

If we run the huntprod/caps image with no arguments, it searches through /proc/self/status and grabs the permitted capability set, and then displays that.

→  docker run --rm huntprod/caps
(via /proc/self/status)
00000000a80425fb (14 capabilities):
  chown                  0 (0x00000000000001)  Make arbitrary changes to file UIDs and GIDs
  dac_override           1 (0x00000000000002)  Bypass file read, write, and execute permission checks.
  fowner                 3 (0x00000000000008)  Bypass file ownership / process owner equality permission checks.
  fsetid                 4 (0x00000000000010)  Don't clear set-user-ID and set-group-ID mode bits when a file is modified
  kill                   5 (0x00000000000020)  Bypass permission checks for sending signals.
  setgid                 6 (0x00000000000040)  Make arbitrary manipulations of process GIDs and supplementary GID list.
  setuid                 7 (0x00000000000080)  Make arbitrary manipulations of process UIDs.
  setpcap                8 (0x00000000000100)  Manage capability sets (from bounded / inherited set).
  net_bind_service      10 (0x00000000000400)  Bind a socket to Internet domain privileged ports.
  net_raw               13 (0x00000000002000)  Use RAW and PACKET sockets.
  sys_chroot            18 (0x00000000040000)  Use chroot(2) and manage kernel namespaces.
  mknod                 27 (0x00000008000000)  Create special files using mknod(2).
  audit_write           29 (0x00000020000000)  Write records to kernel auditing log.
  setfcap               31 (0x00000080000000)  Set arbitrary capabilities on a file.

Voila!

Without specifying anything, my docker container was restricted down to a subset of capabilities, a80425fb.

Unsurprisingly, if we are a privileged container, we get the full capability set:

→  docker run --privileged --rm huntprod/caps | head -n2
(via /proc/self/status)
0000003fffffffff (38 capabilities):

Let’s try explicitly asking for a capability:

→  docker run --rm --cap-add ipc_lock huntprod/caps
(via /proc/self/status)
00000000a80465fb (15 capabilities):
  chown                  0 (0x00000000000001)  Make arbitrary changes to file UIDs and GIDs
  dac_override           1 (0x00000000000002)  Bypass file read, write, and execute permission checks.
  fowner                 3 (0x00000000000008)  Bypass file ownership / process owner equality permission checks.
  fsetid                 4 (0x00000000000010)  Don't clear set-user-ID and set-group-ID mode bits when a file is modified
  kill                   5 (0x00000000000020)  Bypass permission checks for sending signals.
  setgid                 6 (0x00000000000040)  Make arbitrary manipulations of process GIDs and supplementary GID list.
  setuid                 7 (0x00000000000080)  Make arbitrary manipulations of process UIDs.
  setpcap                8 (0x00000000000100)  Manage capability sets (from bounded / inherited set).
  net_bind_service      10 (0x00000000000400)  Bind a socket to Internet domain privileged ports.
  net_raw               13 (0x00000000002000)  Use RAW and PACKET sockets.
  ipc_lock              14 (0x00000000004000)  Lock memory, via mlock(2) and friends.
  sys_chroot            18 (0x00000000040000)  Use chroot(2) and manage kernel namespaces.
  mknod                 27 (0x00000008000000)  Create special files using mknod(2).
  audit_write           29 (0x00000020000000)  Write records to kernel auditing log.
  setfcap               31 (0x00000080000000)  Set arbitrary capabilities on a file.

What if we only want the IPC_LOCK capability, and none of the others?

→  docker run --rm --cap-drop all --cap-add ipc_lock huntprod/caps
(via /proc/self/status)
0000000000004000 (1 capability):
  ipc_lock              14 (0x00000000004000)  Lock memory, via mlock(2) and friends.

Doing It For Real: docker-compose

Docker Compose supports the same facilities as Docker itself for managing your set of capabilities, using the cap_add and cap_drop lists in your docker-compose.yml specification:

version: '2'
services:
  pause:
    image: starkandwayne/pause
    cap_drop:
      - ALL
    cap_add:
      - NET_ADMIN
      - SYS_ADMIN

See Compose file reference documentation for more information.

Doing It For Real: Kubernetes

In Kubernetes, you can do something similar, but it has to be configured in a security context, like this:

apiVersion: v1
kind: Pod
metadata:
  name: linux-capabilities
spec:
  containers:
  - name: pause
    image: starkandwayne/pause:latest
    securityContext:
      capabilities:
        add:
          - NET_ADMIN
          - SYS_ADMIN

See the Kubernetes Pod SecurityContext documentation for the full story.

Capabilities in the Wild

Not all Linux capabilities are created equally. Some of them you will rarely need, but others will pop up quite frequently. Here’s our top three all-time faves!

IPC_LOCK – Being able to lock memory via mlock(2) and friends comes up far more frequently than you would think, especially when you’re running security-minded infrastructure software.
LINUX_IMMUTABLE – Sometimes, you know you’re never going to rewrite a file, for auditing purposes or whatnot, and the ability to fcntl(2) the filesystem entry to only ever allow append operations is a definite boon.
NET_BIND_SERVICE – Not all software is cloud-native, and some of it even has hard-coded “privileged” ports. This capability allows that particular shenanigan without widening the attack surface needlessly.

Hopefully, you now have a better understanding and a greater appreciation for the power and utility of Linux capabilities, and are gearing up to incorporate them into your next Docker or Kubernetes deployment.

Happy Hacking!

Written by:
Ashley Gerwitz

Marketing Manager at Stark & Wayne and Qarik