This week I spent a late afternoon dissecting a rather cryptic Python script that was shared by a colleague (original code here => IT’S DANGEROUS! DON’T RUN!). It contained a malicious local privilege escalation exploit, leveraging a vulnerability in the Linux kernel (originally reported by xint.io).

After some forensic analysis and a good deal of research, I understood what it was doing. Honestly, it is quite brilliant and also scary. This post aims to dissect the attack surface and strategy, revealing effects that go way beyond the initial privilege escalation.

Important disclaimer

What follows next is potentially dangerous. Don’t try any of these commands unless you know exactly what you are doing.

This content is educational! Even though we can be amazed by the ingenious ways these attacks are performed, always remember that trying to compromise systems is a criminal activity!

In my work, I have to guarantee that our own code and the code trusted to us are executed safely, which includes understanding potential attack vectors and how to neutralize them. This can only be done by understanding the “criminal mind” and how these exploits work.

Testing the exploit

Of course, I didn’t run any of this before fully understanding the depth of the attack. That would have been irresponsible. I only executed the harmful commands — intentionally — after understanding perfectly what it was doing, and always using completely isolated VMs with no access to any other systems.

The tests reported below were executed using a Debian Trixie VM on GCP.

The result was immediate. I was granted a highly privileged root shell, without being asked for a password. What is even more fascinating is that subsequent calls to su also granted me root access without needing to run the script again. The system was compromised in a persistent way, at least until the next reboot.

Security comes from understanding. Let’s break down exactly what this exploit is doing, step by step, and why modern container runtimes are our best defense against it.

The page cache poisoning

At its core, this attack abuses the Linux kernel’s AF_ALG cryptographic subsystem in combination with os.splice(). It is a zero-copy mechanism that corrupts the kernel’s page cache, allowing an unprivileged attacker to overwrite read-only, SUID binaries (like /usr/bin/su) directly in memory.

Stage 1: The Cryptographic Socket

First, the script opens a connection to the kernel’s cryptographic API via AF_ALG (family 38):

import socket
a = socket.socket(38, 5, 0) # AF_ALG, SOCK_SEQPACKET

Stage 2: State Corruption

Next, it configures an authenticated encryption algorithm. Crucially, the setsockopt call for ALG_SET_AEAD_AUTHSIZE passes a NULL pointer (None). This intentionally causes a kernel -EFAULT, which disrupts the socket initialization state and creates the conditions for memory corruption.

v = a.setsockopt
v(279, 1, bytes.fromhex('0800010000000010' + '0'*64)) # Set authenc key
v(279, 5, None, 4) # Trigger invalid memory access

Stage 3: The Paused Payload

The script creates an operation socket and uses sendmsg to push a 4-byte payload chunk. The MSG_MORE flag (32768) is critical here; it instructs the kernel to buffer the data and pause the state machine without finalizing the cryptographic operation.

u, _ = a.accept() # Operation socket
# ...
u.sendmsg(
    [b'A'*4 + c],  # Payload chunk
    [(279, 3, i*4), (279, 2, b'\x10' + i*19), (279, 4, b'\x08' + i*3)],
    32769 # MSG_MORE | MSG_OOB
)

Stage 4: Hijacking the Cache

This is where the magic happens. The attacker uses os.splice() to map the target read-only binary (/usr/bin/su) into a pipe, and then splices the pipe directly into the paused AF_ALG socket. This forces the kernel to use the page cache memory pages of /usr/bin/su as the buffer for the cryptographic operation.

r, w = os.pipe()
n = os.splice
f = os.open('/usr/bin/su', 0) # Open target SUID binary read-only
n(f, w, o, offset_src=0)
n(r, u.fileno(), o)

Stage 5: The Overwrite

By calling recv, the attacker finalizes the request. The kernel’s crypto engine performs an in-place “decryption” directly onto the buffer. Because the buffer is physically mapped to the page cache of /usr/bin/su, the kernel forcefully overwrites the read-only memory pages with the attacker’s output!

try:
    u.recv(8 + t)
except:
    0

The script loops over a compressed ELF payload, injecting it 4 bytes at a time. Once the loop completes, /usr/bin/su is entirely replaced in memory with the malicious payload.

Memory is the battlefield

Once the script finishes overwriting the cache, it simply calls os.system("su"). But it is no longer executing the legitimate su binary. It is executing our malicious payload, which the kernel happily serves directly from the corrupted page cache.

This explains the persistence. Any subsequent call to su by any user on the system will execute the poisoned cache. The original file on disk remains untouched.

A simple reboot neutralizes the attack. The poisoned memory is cleared, and the kernel will load the pristine executable from disk the next time it is requested. But in a production environment, a compromised su binary living in memory is more than enough time for an attacker to establish a permanent foothold.

The level 10 force field

I always run production workloads in containers. Naturally, my next test was to run this fresh exploit from within a containerized environment.

The good news is the attack fails completely.

Whether using standard runc or a sandboxed runtime like gVisor, the script is stopped dead in its tracks. The error looks like this:

$ python3 exploit.py
Traceback (most recent call last):
  ...
OSError: [Errno 97] Address family not supported by protocol

This is where defense-in-depth shines.

Default Seccomp profiles in Docker/runc act as a strict allowlist. They block the AF_ALG (38) socket family entirely, returning EAFNOSUPPORT (Error 97) and safely neutralizing the attack surface before any kernel state corruption can occur.

Similarly, gVisor functions as a user-space “guest kernel.” It implements only necessary application syscalls and simply does not support the AF_ALG subsystem. The exploit’s syscalls never reach the potentially vulnerable host Linux kernel.

Just like a level 10 force field in Star Trek, it stops the threat before it can do any damage.

The Blast Radius of Shared Cache

While containers provide strong isolation, it is worth noting a terrifying quirk of how container runtimes work.

Most runtimes use OverlayFS. Multiple containers running from the same base image share the underlying read-only layers in memory (the page cache). If an attacker successfully executed this exploit inside a container that lacked the Seccomp protections, they wouldn’t just compromise their own container.

Because the exploit modifies the shared page cache for the inode corresponding to /usr/bin/su in the base image layer, every other container running on the same host that shares that exact image layer would instantly have a compromised /usr/bin/su binary.

To see just how scary this is, consider this highly plausible scenario. I built a simple, unprivileged Docker image:

FROM debian:latest
RUN useradd foobar && apt-get update && apt-get install --assume-yes python3
USER foobar
CMD ["tail", "--follow", "/dev/null"]

I spun up a few standard containers from this image. They are running securely as the unprivileged user foobar, with the default restricted Seccomp profile. This is how it should be, right? We are safe!

Well, then I started a single, ephemeral container using the same image, but passed the --privileged flag. You know, this kind of thing happens in production, be it because someone was not careful enough or in a desperate debugging session:

$ docker run --rm --interactive --tty --privileged test-copy-fail:latest bash

Because of the --privileged flag, runc does not apply the Seccomp filters, and the AF_ALG sockets are fully exposed. I executed the Python exploit inside this ephemeral container and exited.

Next, I attached a shell to one of the previously running unprivileged containers and simply typed su.

Instantly, I was dropped into a root shell.

What makes this even more devastating is that it did not matter if those unprivileged containers were running under a hardened sandbox like gVisor. Because the attack happened elsewhere (in the privileged container sharing the same underlying OverlayFS layer), the memory was already poisoned. The sandbox prevents the attack from being performed, but it cannot protect the container from consuming an already compromised page cache served by the host.

Even when following good practices for your long-running workloads, a single ephemeral run of a privileged container can silently compromise the entire node of currently running containers!

Patch early and often

Using state-of-the-art runtimes such as gVisor proved again that they can save many headaches. They provide a critical layer of isolation that mitigates entire classes of kernel vulnerabilities.

But sandboxes are not an excuse to ignore the underlying issue. Keep your systems up-to-date. Keep your kernels patched. Don’t run containers with unnecessary capabilities or with --privileged. Use minimal base images (like Distroless, Alpine) that don’t even include SUID binaries like su in the first place.

Understanding these attacks reminds us that the boundary between user space and the kernel is complex and constantly tested. It is our job to make sure the shields hold.