6.1810 2025 Lecture 22: Kernel extensibility via BPF Note: demos from this lecture are available at https://pdos.csail.mit.edu/6.1810/2025/lec/bpf/ Paper: "The BSD Packet Filter" by McCanne, Jacobson Motivation: monitoring raw network packets Userspace processes needs to check incoming network packets Not a specific TCP connection / UDP ports Multiple apps: tcpdump, arpwatch, rarpd, .. Each process wants different types of packets Kernel developer doesn't know upfront what packets to pass along [ demo: tcpdump ] # tcpdump -n -i wlan0 udp # tcpdump -n -i wlan0 udp and port 123 Challenge: performance Expensive to send each packet to userspace Need to make a copy of packet in kernel Need to send packet to userspace app App will discard packets it doesn't care about Wasted work copying those packets Goal: specify what packets kernel should copy or discard Check in-kernel, based on specified rules Avoid cost of copying if app doesn't want packet Paper: efficient and general way of specifying such filters Paper's idea: specify filter as a program in a simple bytecode language ("BPF") Application can translate its requirements into filter program Bytecode language flexible to specify many different kinds of packets Kernel can efficiently execute filter program Could run filter in the interrupt handler when receiving packet [ demo: tcpdump bpf ] # tcpdump -n -i wlan0 -d udp [ Diagram: tcpdump translates expr into bpf, sends to kernel, kernel runs bpf ] Turns out to be a powerful approach for extending the kernel Flexible, safe, low-overhead way for kernel to run user-supplied code Different isolation model from the usual process-level isolation Sometimes called "software fault isolation" Common techniques with systems like WebAssembly, even Javascript Some restrictions compared to processes Code must be in BPF language Code must return right away (no sleeping, no threads, ..) No virtual memory No standard system calls Big win from the restrictions: performance, enables new use cases [ demo: seccomp ] (install https://aur.archlinux.org/packages/seccomp-tools) $ cd bpf $ make seccomp.bpf $ seccomp-tools disasm seccomp.bpf $ bwrap --bind / / --seccomp 5 5f (gdb) x/20i runner->f Need to have a plan for what registers will contain what state. On x86-64, first function call arg goes into %rdi. Let's use that to store pointer to packet. On x86-64, return value from function goes into %eax. Will put the value from return opcode there. On x86-64, the %eax register is callee-saved (i.e., function can modify it). So, let's use that to store BPF's "A" register. How do we figure out the instruction encodings? [ demo: assembly snippets, in another terminal on the side ] $ vi snippets.S $ objdump -d snippets.o | less What does the JIT need to do? Still need the same safety checks as our fast interpreter. For each instruction, need to fill in the corresponding x86 code. Need to encode specific constants and offsets into the instruction. Need to know where to jump. BPF code specifies the BPF instruction number. Generated x86 code needs x86 code offset. Two passes to fill in the right jump offsets. First, generate the x86 code, perhaps with wrong jumps, but right size. Remember the x86 instruction offset for each BPF instruction. Second, generate the x86 code again, this time with correct jumps. Need to fill in initial x86 code to zero out the %eax register. Otherwise, BPF code (and generated x86 code) could read uninitialized %eax. After JIT setup, running filter is just jumping to the generated code. bpf_run. [ demo: bpf jit perf ] $ time ./run_jit ../seccomp.bpf bpf_run: 0x7fff0000 real 0m0.380s user 0m0.377s sys 0m0.002s Big win over interpreter: 3.8 nsec instead of 16 nsec! Is there more room for optimizations? Could use more efficient instruction encodings (short jumps). Could eliminate the clearing of %eax. Need to check that BPF program loads before use, on every execution path. Different kind of analysis pass: depends on branch choices. Linux supports a more sophisticated version of BPF called eBPF. Richer opcodes for performance (multiple registers). Allows some loops (requires a more sophisticated validator). Allows access to some shared state (like a hash table). Allows calls into some kernel code (again, sophisticated validator). Translated into native code with a JIT for performance. [ demo: bpftrace ] # pacman -S bpftrace # bpftrace -l # bpftrace -vl tracepoint:syscalls:sys_enter_mkdir # bpftrace -e 'tracepoint:syscalls:sys_enter_mkdir { printf("PID %d calling mkdir(%s)\n", pid, str(args.pathname)); }' $ mkdir x # bpftrace -vl tracepoint:irq:irq_handler_entry # bpftrace -e 'tracepoint:irq:irq_handler_entry { printf("intr %d: %s\n", args.irq, str(args.name)); }' # bpftrace -e 'tracepoint:irq:irq_handler_entry { if (args.irq == 1) { printf("intr %d: %s\n", args.irq, str(args.name)); } }' # bpftrace -d dis -e 'tracepoint:irq:irq_handler_entry { if (args.irq == 1) { printf("intr %d: %s\n", args.irq, str(args.name)); } }' More examples: https://bpftrace.org/tutorial-one-liners [ demo: xdp ] # pacman -S bpf $ cd bpf $ cat drop_icmp.c $ make # ip link set wlan0 xdp object drop_icmp.o # ping -n 8.8.8.8 # bpftool prog show # bpftool prog dump jited id NNN # ip link set wlan0 xdp off $ cat set_ttl.c # ping -n 8.8.8.8 # ip link set wlan0 xdp object set_ttl.o # ip link set wlan0 xdp off (comment out data_end check from set_ttl.c) $ make # ip link set wlan0 xdp object set_ttl.o eBPF validator is highly sophisticated, active area of research + kernel development. [ demo: kernel/bpf/verifier.c ] Other references: https://github.com/zoidyzoidzoid/awesome-ebpf