Recent Linux kernel releases are equipped with a powerful Linux monitoring framework for kernel instrumentation. It has its roots in what historically was approached as BPF.
What is BPF?
BPF (Berkeley Packet Filter) is a very efficient network packet filtering mechanism aimed at avoiding the unnecessary user space allocations. It operates on network packet data directly in kernel land. The most familiar application of BPF powers is related to filter expressions used in tcpdump tool. Under the hood, the expressions gets compiled and transparently translated to BPF bytecode. This bytecode is then loaded into the kernel and applied on the raw network packet flow, thus effectively passing to user space only those packets that meet filtering criteria.
What is eBPF?
eBPF is an extended and enhanced version of the BPF Linux observability system. Think of it as BPF on steroids. With eBPF one can attach custom sandboxed bytecode to virtually every function exported via kernel symbol table without the fear of breaking the kernel. In fact, eBPF emphasizes the importance of safety when crossing user space boundaries. In-kernel verifier will refuse to load any eBPF program if invalid pointer dereferences are detected or maximum stack size limit is reached. Loops are not allowed (except loops with constant upper bounds known at compile time) and only a small subset of specific eBPF helper functions is permitted to be called within generated bytecode. eBPF programs are guaranteed to terminate at some point in time and never ever exhaust system resources, which is not the case with kernel modules that can cause system instability or lead to terrifying kernel panics. Conversely, some might find eBPF too restrictive when compared to the “freedom” kernel modules offer, but the tradeoffs are likely to favor eBPF over “module-oriented” instrumentation mainly because of the guarantee that eBPF programs cannot harm the kernel. However, that’s not the only benefit.
Why use eBPF for Linux monitoring?
Being a core part of the Linux kernel, eBPF doesn’t depend on any third party modules or external dependencies. It imposes a stable ABI (Application Binary Interface) making programs built on older kernels runnable on newer kernel versions. The performance overhead induced by eBPF is often negligible making it a great fit for application monitoring and tracing heavy loaded systems. Windows users don’t have eBPF, but they can use Event Tracing for Windows.
eBPF is very flexible and capable of tracing almost any aspect of all major Linux subsystems ranging from CPU scheduler, memory manager, networking, system calls, block device requests, and so on. Sky’s the limit.
You can find the full list of traceable symbols by running this command from your terminal:
$ cat /proc/kallsyms
Traceable symbols
The above command will produce a huge output. If we were only interested in instrumenting syscall interface, a bit of grep magic will help filter out unwanted symbol names:
$ cat /proc/kallsyms | grep -w -E “sys.*” ffffffffb502dd20 T sys_arch_prctl ffffffffb502e660 T sys_rt_sigreturn ffffffffb5031100 T sys_ioperm ffffffffb50313b0 T sys_iopl ffffffffb50329b0 T sys_modify_ldt ffffffffb5033850 T sys_mmap ffffffffb503d6e0 T sys_set_thread_area ffffffffb503d7a0 T sys_get_thread_area ffffffffb5080670 T sys_set_tid_address ffffffffb5080b70 T sys_fork ffffffffb5080ba0 T sys_vfork ffffffffb5080bd0 T sys_clone
Different types of hook points are responsible for reacting to events being triggered inside the kernel. The execution of the kernel routine at specific memory address, arrival of a network packet or invocation of user space code are all examples of events trappable by attaching eBPF programs to kprobes, XDP programs to packet ingress paths and uprobes to user space processes respectively.
Here, at Sematext, we are very excited about eBPF and are exploring ways to leverage its power in the context of server monitoring and containers visibility. And yes, we are also looking for smart and happy people to work with eBPF, but keep reading.
Anatomy of a Linux eBPF Program
Before we start further explanation on eBPF program’s structure, it’s worth mentioning BCC (BPF Compiler Collection) – a toolkit which abstracts bytecode loading and provides bindings for Python and Lua languages to interop with underlying eBPF infrastructure. It also contains a lot of useful tools that can give you an overview of what’s possible to achieve through eBPF instrumentation.
In the past, BPF programs were crafted by hand generating the resulting bytecode via raw BPF instruction set directives. Fortunately, clang compiler (part of the LLVM frontends) can translate C to eBPF bytecode and spare us the juggling with BPF instructions. As of today, it’s the only compiler that can emit eBPF bytecode, although it’s possible to produce eBPF bytecode from Rust too.
Once an eBPF program is successfully compiled and the object file is generated we are ready to inject it into the kernel. For that purpose, a new bpf system call has been introduced. This seemingly simple syscall does a lot more apart from loading eBPF bytecode. It creates and manipulates in-kernel maps (more about maps later) one of the most compelling features of eBPF infrastructure. You can figure out a lot more by reading bpf manual pages (man 2 bpf).
When user space process decides to push eBPF bytecode by invoking bpf syscall, the kernel will verify it and after that will JIT (translate to machine code) the instructions to equivalent target architecture instruction set. The resulting code will be quite fast! If for any reasons the JIT compiler is not available, the kernel will fall back to the interpreter that doesn’t enjoy aforementioned bare-metal performance.
Linux eBPF Example
Let’s see an example of a Linux eBPF program now. Our goal is to trap the invocation to setns system call. Processes call this syscall when they wish to join a new isolated namespace that’s created after child process’s descriptor is conceived (the child process can control which namespaces it should unlink from the parent by specifying a bit mask of flags in the clone syscall argument). This system call is very often used to provide processes a segregated overview of system resources such as TCP stack, mount points, PID number space, etc.
#include <linux/kconfig.h> #include <linux/sched.h> #include <linux/version.h> #include <linux/bpf.h> #ifndef SEC #define SEC(NAME) __attribute__((section(NAME), used)) #endif SEC("kprobe/sys_setns") int kprobe__sys_setns(struct pt_regs *ctx) { return 0; } char _license[] SEC("license") = "GPL"; __u32 _version SEC("version") = 0xFFFFFFFE;
The above is the bare minimum eBPF program. It consists of different segments. First, we include various kernel header files which contain definitions for multiple data types. We also declare the SEC macro that’s used to generate sections inside object file that are later interpreted by ELF BPF loader. The loader will complain if it can’t find license and version sections so we need to provide both of them.
Now comes the most interesting part of our eBPF program – the actual hook point for the setns syscall. By starting the function name with kprobe__ prefix and binding the corresponding SEC macro we instruct the in-kernel virtual machine to attach instrumentation callback to sys_setns symbol that will trigger our eBPF program and execute the code inside the function’s body each time syscall is dispatched. Every eBPF program has a context. In the case of kernel probes, that’s the current state of the processor’s registers (pt_regs structure) that contain function arguments as placed by libc upon transition from user to kernel space. To compile the program (llvm and clang should be installed and properly configured) we can use the following command (please note you’ll need to specify the path to kernel headers through LINUX_HEADERS env variable) where clang will emit an intermediate LLVM representation of our program and LLVM compiler will produce the final eBPF bytecode:
$ clang -D__KERNEL__ -D__ASM_SYSREG_H -Wunused -Wall -Wno-compare-distinct-pointer-types -Wno-pointer-sign -O2 -S -emit-llvm ns.c -I $LINUX_HEADERS/source/include -I $LINUX_HEADERS/source/include/generated/uapi -I $LINUX_HEADERS/source/arch/x86/include -I $LINUX_HEADERS/build/include -I $LINUX_HEADERS/build/arch/x86/include -I $LINUX_HEADERS/build/include/uapi -I $LINUX_HEADERS/build/include/generated/uapi -I $LINUX_HEADERS/build/arch/x86/include/generated -o - | llc -march=bpf -filetype=obj -o ns.o
You can use readelf tool to introspect ELF sections and the symbol table of the object file:
$ readelf -a -S ns.o … 2: 0000000000000000 0 NOTYPE GLOBAL DEFAULT 4 _license 3: 0000000000000000 0 NOTYPE GLOBAL DEFAULT 5 _version 4: 0000000000000000 0 NOTYPE GLOBAL DEFAULT 3 kprobe__sys_setns
The above output proves that the symbol table is built up as expected. We have a valid eBPF object file and now it’s the time to load it into the kernel and see the magic happen.
Attaching eBPF Programs with Go
We already mentioned BCC and how it does the heavy lifting while offering an ergonomic interface to the eBPF machinery. In order to build and run eBPF programs BCC requires installing LLVM and kernel headers on the target node and sometimes we might not have the luxury to make that tradeoff. In such scenarios it would be ideal if we could ship the resulting ELF object baked in the data segment of our binary and maximize the portability across machines.
Apart from providing bindings for libbcc, gobpf package has the ability to load eBPF programs from precompiled bytecode. If we combine it with a tool such as packr which can embed blobs in a Go app we have all needed ingredients to distribute our binary with zero runtime dependencies.
We’ll slightly modify the eBPF program so it prints to the kernel tracing pipe when kprobe is triggered. For brevity, we won’t include the definition of printt macro as well as other eBPF helper functions, but you can find them in this header file.
SEC("kprobe/sys_setns") int kprobe__sys_setns(struct pt_regs *ctx) { int fd = (int)PT_REGS_PARM1(ctx); int pid = bpf_get_current_pid_tgid() >> 32; printt("process with pid %d joined ns through fd %d", pid, fd); return 0; }
Now we can start writing Go code that handles eBPF bytecode loading. We will implement a tiny abstraction (KprobeTracer) atop gobpf:
import ( "bytes" "errors" "fmt" bpflib "github.com/iovisor/gobpf/elf" ) type KprobeTracer struct { // bytecode is the byte stream with embedded eBPF program bytecode []byte // eBPF module associated with this tracer. The module is a collection of maps, probes, etc. mod *bpflib.Module } func NewKprobeTracer(bytecode []byte) (*KprobeTracer, error) { mod := bpflib.NewModuleFromReader(bytes.NewReader(bytecode)) if mod == nil { return nil, errors.New("ebpf is not supported") } return KprobeTracer{mod: mod, bytecode: bytecode}, nil } // EnableAllKprobes enables all kprobes/kretprobes in the module. The argument // determines the maximum number of instances of the probed functions the can // be handled simultaneously. func (t *KprobeTracer) EnableAllKprobes(maxActive int) error { params := make(map[string]*bpflib.PerfMap) err := t.mod.Load(params) if err != nil { return fmt.Errorf("unable to load module: %v", err) } err = t.mod.EnableKprobes(maxActive) if err != nil { return fmt.Errorf("cannot initialize kprobes: %v", err) } return nil }
We are ready to bootstrap the kernel probe tracer:
package main import ( "log" "github.com/gobuffalo/packr" ) func main() { box := packr.NewBox("/directory/to/your/object/files") bytecode, err := box.Find("ns.o") if err != nil { log.Fatal(err) } ktracer, err := NewKprobeTracer(bytecode) if err != nil { log.Fatal(err) } if err := ktracer.EnableAllKprobes(10); err != nil { log.Fatal(err) } }
Use sudo cat /sys/kernel/debug/tracing/trace_pipe to read debug info pushed to the pipe. The easiest way to test eBPF program is by attaching it to the running Docker container:
$ docker exec -it nginx /bin/bash
Behind the scenes, the container runtime will re-associate the bash process to the namespace of the nginx container. The first argument we captured via PT_REGS_PARM1 macro is the file descriptor of the namespace that’s represented with symbolic link inside /proc/<pid>/ns directory. Yay! So we can monitor each time a process joins the namespace. It might not be something super useful, but it illustrates how easy it is to trap syscall’s execution and have access to its arguments.
Using eBPF Maps
Writing results to the tracing pipe is good for debugging purposes, but for production environments we’ll definitely need a more sophisticated mechanism for sharing state between user and kernel spaces. That’s where eBPF maps come to the rescue. They represent a very efficient in-kernel key / value stores for data aggregation and can be accessed asynchronously from user space. There are many types of eBPF maps, but for this particular use case we will rely on BPF_MAP_TYPE_PERF_EVENT_ARRAY map. It can store custom structures that are pushed via perf event ring buffer and broadcasted to user space process.
Go-bpf allows for perf map creation and event streaming to provided Go channel. We can add the following code to transmit C structures to our program.
rxChan := make(chan []byte) lostChan := make(chan uint64) pmap, err := bpflib.InitPerfMap( t.mod, mapName, rxChan, lostChan, ) if err != nil { return quit, err } if _, found := t.maps[mapName]; !found { t.maps[mapName] = pmap } go func() { for { select { case pe := <-rxChan: nsJoin := (*C.struct_ns_evt_t)(unsafe.Pointer(&(*pe)[0])) log.Info(nsJoin) case l := <-lostChan: if lost != nil { lost(l) } } } }() pmap.PollStart()
We initialize receiver and lost event channels and pass them to InitPerfMap function along with the module reference and the name of the perf map we are supposed to consume events from. Each time new event is pushed on the receiver channel, we cast the raw pointer to C struct (ns_evt_t) as defined in our eBPF program. We also need to declare a perf map and emit the structure through bpf_perf_event_output helper:
struct bpf_map_def SEC("maps/ns") ns_map = { .type = BPF_MAP_TYPE_HASH, .key_size = sizeof(u32), .value_size = sizeof(struct ns_evt_t), .max_entries = 1024, .pinning = 0, .namespace = "", }; struct ns_evt_t evt = {}; /* Initialize structure fields.*/ u32 cpu = bpf_get_smp_processor_id(); bpf_perf_event_output(ctx, &ns_map, cpu, &evt, sizeof(evt));
Conclusion
eBPF is constantly evolving and getting wider adoption. With each kernel release new features and improvements are being addressed. Low overhead and native programmability support makes it very attractive for a variety of use cases. For example, Suricata intrusion detection system uses it for implementing advanced socket load balancing strategies and packet filtering at the very early stage in the Linux network stack. Cilium relies heavily on eBPF for delivering sophisticated security policies for containers. Sematext Agent leverages eBPF for pinpointing interesting events such as kill signals broadcasting or OOM notifications for Docker monitoring and Kubernetes monitoring, as well as regular server monitoring. It also provides an efficient network tracer for network monitoring by using eBPF for capturing TCP/UDP traffic statistics. It seems eBPF is aiming to become a de facto standard for Linux monitoring via Linux kernel instrumentation.
Next Steps
If you find this stuff exciting, we’re looking for people like you – we work on performance monitoring, log management, transaction tracing, and other forms of observability and utilize things like Go, Kotlin, Node.js, Kubernetes, Kafka, Elasticsearch, Akka, eBPF of course, etc.
Looking for a Full Stack Observability platform? We suggest checking out Sematext Cloud which brings together logs, metrics, user monitoring and tracing. Want instead only tracing solution? Sematext Tracing provides end to end visibility into your distributed applications so you can find bottlenecks quickly, resolve production issues faster and with less effort. Interested in trying it out? Sign up today for a free exclusive beta invite by clicking here.
We would love to hear your opinions, experiments or anything else you would like to share regarding eBPF ecosystem. Before you go, don’t forget to download your OpenTracing eBook: Distributed Tracing’s Emerging Industry Standard.