👶 🐉 🏘️ Brief Introduction to BPF and eBPF 🐠 😰 🧒🏿

Hello, Habr! We inform you that we are preparing to release the book " Linux Observability with BPF ".

As the BPF virtual machine continues to evolve and is actively applied in practice, we have translated an article for you that describes its main features and current status.

In recent years, programming tools and techniques have begun to gain in popularity, designed to compensate for the limitations of the Linux kernel in cases where high-performance package processing is required. One of the most popular methods of this kind is called bypassing the kernel (kernel bypass) and lets passing network layer cores perform all packet processing from user space. Bypassing the kernel also involves managing the network card from user space . In other words, when working with a network card, we rely on the user space driver .

By transferring full control over the network card to the program from user space, we reduce the costs associated with the kernel (switching context, processing the network level, interrupts, etc.), which is quite important when working at speeds of 10Gb / s or higher. Kernel bypass plus a combination of other features ( batch processing ) and accurate performance tuning ( NUMA accounting , CPU isolation , etc.) correspond to the basics of high-performance network processing in user space. Perhaps a model example of this new approach to package processing is Intel's DPDK ( Data Plane Development Kit), although there are other well-known tools and techniques, including VPP from Cisco (Vector Packet Processing), Netmap and, of course, Snabb .

The organization of network interactions in user space has a number of disadvantages:

The kernel of the OS is the level of abstraction for hardware resources. Because user-space programs have to manage their resources directly, they also have to manage their own hardware. This often means that you need to program your own drivers.
, , . , , , .
, .

In essence, when organizing network interactions in user space, increasing productivity is achieved by transferring packet processing from the kernel to user space. XDP does exactly the opposite: moves network programs from user space (filters, converters, routing, etc.) to the kernel area. XDP allows us to perform a network function as soon as the packet arrives at the network interface and before it begins to move up into the network subsystem of the kernel. As a result, packet processing speed increases significantly. However, how does the kernel allow the user to execute their programs in kernel space? Before answering this question, let's look at what BPF is.

BPF and eBPF

Despite the not-so-clear name BPF (Packet Filtering, Berkeley), it is, in fact, a virtual machine model. This virtual machine was originally designed to handle packet filtering, hence the name.

One of the best known tools using BPF is tcpdump. When capturing packets with, the tcpdumpuser can specify an expression to filter packets. Only packets matching this expression will be captured. For example, the expression “ tcp dst port 80” applies to all TCP packets arriving at port 80. The compiler can shorten this expression by converting it to BPF bytecode. Here is what the above program basically does:

$ sudo tcpdump -d "tcp dst port 80"
(000) ldh [12]
(001) jeq #0x86dd jt 2 jf 6
(002) ldb [20]
(003) jeq #0x6 jt 4 jf 15
(004) ldh [56]
(005) jeq #0x50 jt 14 jf 15
(006) jeq #0x800 jt 7 jf 15
(007) ldb [23]
(008) jeq #0x6 jt 9 jf 15
(009) ldh [20]
(010) jset #0x1fff jt 15 jf 11
(011) ldxb 4*([14]&0xf)
(012) ldh [x + 16]
(013) jeq #0x50 jt 14 jf 15
(014) ret #262144
(015) ret #0

Instruction (000): downloads a packet at offset 12 as a 16-bit word into the battery. An offset of 12 corresponds to an ethertype packet.
(001): 0x86dd, , ethertype- IPv6. true, (002), – (006).
(006): 0x800 (ethertype- IPv4). true, (007), – (015).

And so on, until the packet filtering program returns a result. This is usually a bulean. Returning a non-zero value (instruction (014)) means that the packet has approached, and returning zero (instruction (015)) means that the packet has not arrived.

The BPF virtual machine and its bytecode were proposed by Steve McCann and Van Jacobson in late 1992 when their article BSD Packet Filter: A New Architecture for Packet Capture at the User Level , was first introduced at a Usenix conference in the winter of 1993.

Because BPF is a virtual machine, it defines the environment in which programs run. In addition to the bytecode, it also defines a batch memory model (boot instructions are implicitly applied to the package), registers (A and X; battery and index registers), scratch memory storage, and an implicit program counter. Interestingly, the BPF bytecode was modeled after the Motorola 6502 ISA. As Steve McCann recalled in his plenary talk at Sharkfest '11, he had been familiar with the 6502 build since high school when he programmed in Apple II, and this knowledge influenced his work on designing BPF bytecode.

Support for BPF is implemented in the Linux kernel in version v2.5 and higher, added mainly by the efforts of Jay Schullist. The BPF code remained unchanged until 2011, when Eric Dumazett redid the BPF interpreter to work in JIT mode (Source: JIT for packet filters ). After that, the kernel, instead of interpreting the BPF byte code, could directly convert the BPF programs to the target architecture: x86, ARM, MIPS, etc.

Later, in 2014, Alexey Starovoitov proposed a new JIT mechanism for BPF. In fact, this new JIT has become the new BPF-based architecture and is called eBPF. I think that for some time both virtual machines coexisted, but currently packet filtering is implemented based on eBPF. In fact, in many examples of modern documentation, BPF is understood to mean eBPF, and classical BPF is now known as cBPF.

eBPF extends the classic BPF virtual machine in several ways:

Based on modern 64-bit architectures. eBPF uses 64-bit registers and increases the number of available registers from 2 (battery and X) to 10. eBPF also provides additional operation codes (BPF_MOV, BPF_JNE, BPF_CALL ...).
. BPF . , , . , eBPF . , eBPF tracepoint kprobe. eBPF, . eBPF : kernel/bpf.
. – «-», . eBPF .
. , , . . , eBPF .
. eBPF 4096 . eBPF eBPF- ( 32 ).

eBPF: An Example

There are several examples for eBPF in the Linux kernel sources. They are available at samples / bpf /. To compile these examples, simply enter:

$ sudo make samples/bpf/

I will not write a new example for eBPF myself, but will use one of the samples available in samples / bpf /. I’ll look at some sections of the code and explain how it works. As an example, I chose a program tracex4.

In general, each of the examples in samples / bpf / consists of two files. In this case:

tracex4_kern.c, contains the source code that must be executed in the kernel as eBPF bytecode.
tracex4_user.c, contains a program from user space.

In this case, we need to compile the tracex4_kern.ceBPF into bytecode. There is currently gccno server part for eBPF. Fortunately, it clangcan issue eBPF bytecode. Makefileuses clangto compile tracex4_kern.can object file.

I mentioned above that one of the most interesting features of eBPFs is cards. tracex4_kern defines one map:

struct pair {
    u64 val;
    u64 ip;
};  

struct bpf_map_def SEC("maps") my_map = {
    .type = BPF_MAP_TYPE_HASH,
    .key_size = sizeof(long),
    .value_size = sizeof(struct pair),
    .max_entries = 1000000,
};

BPF_MAP_TYPE_HASH- One of the many types of cards offered by eBPF. In this case, it's just a hash. You may also have noticed an ad SEC("maps"). SEC is a macro used to create a new section of a binary file. Actually, the example tracex4_kerndefines two more sections:

SEC("kprobe/kmem_cache_free")
int bpf_prog1(struct pt_regs *ctx)
{   
    long ptr = PT_REGS_PARM2(ctx);

    bpf_map_delete_elem(&my_map, &ptr); 
    return 0;
}
    
SEC("kretprobe/kmem_cache_alloc_node") 
int bpf_prog2(struct pt_regs *ctx)
{
    long ptr = PT_REGS_RC(ctx);
    long ip = 0;

    //  ip-   kmem_cache_alloc_node() 
    BPF_KRETPROBE_READ_RET_IP(ip, ctx);

    struct pair v = {
        .val = bpf_ktime_get_ns(),
        .ip = ip,
    };
    
    bpf_map_update_elem(&my_map, &ptr, &v, BPF_ANY);
    return 0;
}

These two functions allow you to delete an entry from the card ( kprobe/kmem_cache_free) and add a new record ( kretprobe/kmem_cache_alloc_node) to the card . All function names written in capital letters correspond to the macros defined in bpf_helpers.h.

If I output a dump of sections of the object file, I should see that these new sections are already defined: Still there , the main program. Basically, this program listens to events . When such an event occurs, the corresponding eBPF code is executed. The code stores the IP attribute of the object in the map, and then this object is cyclically displayed in the main program. Example: How are the user space program and the eBPF program related? At initialization, it loads an object file using a function .

$ objdump -h tracex4_kern.o

tracex4_kern.o: file format elf64-little

Sections:
Idx Name Size VMA LMA File off Algn
0 .text 00000000 0000000000000000 0000000000000000 00000040 2**2
CONTENTS, ALLOC, LOAD, READONLY, CODE
1 kprobe/kmem_cache_free 00000048 0000000000000000 0000000000000000 00000040 2**3
CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
2 kretprobe/kmem_cache_alloc_node 000000c0 0000000000000000 0000000000000000 00000088 2**3
CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
3 maps 0000001c 0000000000000000 0000000000000000 00000148 2**2
CONTENTS, ALLOC, LOAD, DATA
4 license 00000004 0000000000000000 0000000000000000 00000164 2**0
CONTENTS, ALLOC, LOAD, DATA
5 version 00000004 0000000000000000 0000000000000000 00000168 2**2
CONTENTS, ALLOC, LOAD, DATA
6 .eh_frame 00000050 0000000000000000 0000000000000000 00000170 2**3
CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA

tracex4_user.ckmem_cache_alloc_node

$ sudo ./tracex4
obj 0xffff8d6430f60a00 is 2sec old was allocated at ip ffffffff9891ad90
obj 0xffff8d6062ca5e00 is 23sec old was allocated at ip ffffffff98090e8f
obj 0xffff8d5f80161780 is 6sec old was allocated at ip ffffffff98090e8f

tracex4_user.ctracex4_kern.oload_bpf_file

int main(int ac, char **argv)
{
    struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
    char filename[256];
    int i;

    snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);

    if (setrlimit(RLIMIT_MEMLOCK, &r)) {
        perror("setrlimit(RLIMIT_MEMLOCK, RLIM_INFINITY)");
        return 1;
    }

    if (load_bpf_file(filename)) {
        printf("%s", bpf_log_buf);
        return 1;
    }

    for (i = 0; ; i++) {
        print_old_objects(map_fd[1]);
        sleep(1);
    }

    return 0;
}

When executed, the load_bpf_fileprobes defined in the eBPF file are added to /sys/kernel/debug/tracing/kprobe_events. Now we are listening to these events, and our program can do something when they happen. All other programs in sample / bpf / are structured in a similar way. They always have two files:

$ sudo cat /sys/kernel/debug/tracing/kprobe_events
p:kprobes/kmem_cache_free kmem_cache_free
r:kprobes/kmem_cache_alloc_node kmem_cache_alloc_node

XXX_kern.c: eBPF program.
XXX_user.c: main program.

The eBPF program defines the cards and functions associated with the section. When the kernel throws an event of a certain type (for example, tracepoint), the bound functions are executed. Maps provide data exchange between the kernel program and user space program.

Conclusion

This article outlined BPF and eBPF. I know that today there is a lot of information and resources about eBPF, so I recommend some more materials for further study. I

recommend reading:

BPF: the universal in-kernel virtual machine by Jonathan Corbett. An introduction to BPF and how it evolved into eBPF.
A thorough introduction to eBPF . LWN.net. eBPF .
Notes on BPF & eBPF . “The BSD Packet Filter: A New Architecture for User-level Packet Capture”. .
eBPF, part1: Past, Present and Future . , . eBPF, .

Brief Introduction to BPF and eBPF

More articles: