小孩子的BPF,零部分:经典BPF

伯克利数据包过滤器(BPF)是一种Linux内核技术,多年来一直在英语技术出版物的首页上出现。会议挤满了BPF使用和开发的报告。 Linux网络子系统经理David Miller在Linux Plumbers 2018上的演讲“此演讲不是关于XDP”(XDP是使用BPF的一种方式)。 Brendan Gregg阅读了名为Linux BPF Superpowers的报告。 TokeHøiland-Jørgensen 到现在核心是微内核。托马斯·格拉夫(Thomas Graf)吹捧BPF是内核的javascript语言


关于Habré的BPF仍然没有系统的描述,因此在一系列文章中,我将尝试讲述技术的历史,描述体系结构和开发工具,概述使用BPF的应用领域和实践。在这其中,零,本系列文章讨论了历史和古典BPF的架构,并揭示的经营方针秘密tcpdumpseccompstrace,等等。


BPF开发由Linux网络社区控制,现有的主要BPF应用程序与网络相关,因此,在@eucariot的允许下,我将BPF命名为最小系列,以纪念最小系列的网络


BPF的简史(c


现代BPF技术是对同名旧技术的改进和增强版本,为了避免混淆,现在将其称为经典BPF。基于经典BPF,公知的实用程序tcpdump,一个机构seccomp,以及一个不太知名的模块xt_bpf用于iptables和分类创建cls_bpf。在现代Linux中,经典的BPF程序会自动转换为新的形式,但是,从用户的角度来看,API仍然存在,并且经典的BPF的新应用程序(如我们在本文中将看到的那样)仍然存在。出于这个原因,并且也因为跟随Linux中经典BPF的发展历史,使它变得更现代以及为什么变得更加清晰,我决定从一篇有关经典BPF的文章开始。


在上世纪八十年代后期,著名的劳伦斯·伯克利实验室的工程师对如何在上世纪八十年代后期如何在现代硬件上正确过滤网络数据包产生了兴趣。最初以CSPF(CMU /斯坦福数据包过滤器)技术实现的基本过滤思想是尽早过滤掉多余的数据包,即在内核空间中,因为这样可以避免将多余的数据复制到用户空间中。为了确保在内核空间中运行用户代码的运行时安全性,使用了一个虚拟机-沙箱。


RISC . Berkeley Labs BPF (Berkeley Packet Filters), Motorola 6502 — Apple II NES. .


BPF


-, . , 32- , A X, 64 (16 ), , . , , .., , .


. BPF , - (, ), - (, — ). (, ), - (, ).


, : . , The BSD Packet Filter / Documentation/networking/filter.txt . , libpcap: An Architecture and Optimization Methodology for Packet Capture, McCanne, BPF, libpcap.


BPF Linux: tcpdump (libpcap), seccomp, xt_bpf, cls_bpf.


tcpdump


BPF — tcpdump. , BPF, , .


( Linux 5.6.0-rc6. .)


: IPv6


, IPv6 eth0. tcpdump ip6:


$ sudo tcpdump -i eth0 ip6

tcpdump ip6 - BPF (. Tcpdump: ). , eth0. n, n tcpdump.



, tcpdump tcpdump, -d:


$ sudo tcpdump -i eth0 -d ip6
(000) ldh      [12]
(001) jeq      #0x86dd          jt 2    jf 3
(002) ret      #262144
(003) ret      #0

ldh [12], « A - (16 ), 12» — ? , x (x+1)- . Ethernet eth0, , ( , VLAN ):


       6              6          2
|Destination MAC|Source MAC|Ether Type|...|

ldh [12] A Ether Type — Ethernet- . 1 A ( ) c 0x86dd, IPv6. 1 — jt 2 jf 3 — , (A == 0x86dd) . , (IPv6) 2, — 3. 3 0 ( ), 2 262144 ( 256 ).


: TCP


, TCP 666. IPv4, IPv6 . , IPv6 (ip6 and tcp dst port 666) (tcp dst port 666). , :


$ sudo tcpdump -i eth0 -d ip and tcp dst port 666
(000) ldh      [12]
(001) jeq      #0x800           jt 2    jf 10
(002) ldb      [23]
(003) jeq      #0x6             jt 4    jf 10
(004) ldh      [20]
(005) jset     #0x1fff          jt 10   jf 6
(006) ldxb     4*([14]&0xf)
(007) ldh      [x + 16]
(008) jeq      #0x29a           jt 9    jf 10
(009) ret      #262144
(010) ret      #0

0 1 . 2 , IPv4 (Ether Type = 0x800) A 24- .


       14            8      1     1
|ethernet header|ip fields|ttl|protocol|...|

A Protocol IP, , TCP . Protocol 0x6 (IPPROTO_TCP) 3.


4 5 , 20, jset , jset . , IP , , . . , .


6 — . ldxb 4*([14]&0xf) , X , 4. — Internet Header Length IPv4, , 4. , 4*([14]&0xf) — , X, .. ldb 4*([14]&0xf) ldxb 5*([14]&0xf) ( offset, , ldxb 4*([16]&0xf)). , BPF , X ( ) IPv4.


, 7 -, (X+16). , 14 Ethernet, X IPv4, , A TCP:


       14           X           2             2
|ethernet header|ip header|source port|destination port|

, 8 9 10 — .


Tcpdump:


, BPF . , tcpdump tcpdump libpcap. , libpcap, :



, pcap_setfilter Linux, strace ( ):


$ sudo strace -f -e trace=%network tcpdump -p -i eth0 ip
socket(AF_PACKET, SOCK_RAW, 768)        = 3
bind(3, {sa_family=AF_PACKET, sll_protocol=htons(ETH_P_ALL), sll_ifindex=if_nametoindex("eth0"), sll_hatype=ARPHRD_NETROM, sll_pkttype=PACKET_HOST, sll_halen=0}, 20) = 0
setsockopt(3, SOL_SOCKET, SO_ATTACH_FILTER, {len=4, filter=0xb00bb00bb00b}, 16) = 0
...

raw Ethernet eth0. , ip BPF , , SO_ATTACH_FILTER setsockopt 4. .


, BPF , BPF .


输出的完整版本如下所示:


$ sudo strace -f -e trace=%network tcpdump -p -i eth0 ip
socket(AF_PACKET, SOCK_RAW, 768)        = 3
bind(3, {sa_family=AF_PACKET, sll_protocol=htons(ETH_P_ALL), sll_ifindex=if_nametoindex("eth0"), sll_hatype=ARPHRD_NETROM, sll_pkttype=PACKET_HOST, sll_halen=0}, 20) = 0
setsockopt(3, SOL_SOCKET, SO_ATTACH_FILTER, {len=1, filter=0xbeefbeefbeef}, 16) = 0
recvfrom(3, 0x7ffcad394257, 1, MSG_TRUNC, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
setsockopt(3, SOL_SOCKET, SO_ATTACH_FILTER, {len=4, filter=0xb00bb00bb00b}, 16) = 0
...

, 5, 3 4? , libpcap — , , , ret #0 ( ), , .


, Linux BPF, struct sock_fprog , setsockopt.


, , raw. , , UDP . ( , .)


setsockopt . socket(7), struct sock_fprog tcpdump BPF .


BPF XXI


BPF Linux 1997 libpcap (Linux- , , , ). , BPF 2011 , Eric Dumazet , Just In Time Compiler — BPF x86_64 .


JIT compiler : 2012 seccomp, BPF, 2013 xt_bpf, iptables BPF, 2013 cls_bpf, BPF .


, BPF, , libpcap ( : , libpcap — 0 0x40000) , seccomp, .


BPF


BPF, :


   16    8    8     32
| code | jt | jf |  k  |

64 , 16 — , , jt jf, 32 K, . , ret, 6, K. C BPF


struct sock_filter {
        __u16   code;
        __u8    jt;
        __u8    jf;
        __u32   k;
}


struct sock_fprog {
        unsigned short len;
        struct sock_filter *filter;
}

, ( , , [1]). ip6 :


struct sock_filter code[] = {
        { 0x28, 0, 0, 0x0000000c },
        { 0x15, 0, 1, 0x000086dd },
        { 0x06, 0, 0, 0x00040000 },
        { 0x06, 0, 0, 0x00000000 },
};
struct sock_fprog prog = {
        .len = ARRAY_SIZE(code),
        .filter = code,
};

prog


setsockopt(sk, SOL_SOCKET, SO_ATTACH_FILTER, &prog, sizeof(prog))

, (, , -, ..). <linux/filter.h> - — , ,


struct sock_filter code[] = {
        BPF_STMT(BPF_LD|BPF_H|BPF_ABS, 12),
        BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, ETH_P_IPV6, 0, 1),
        BPF_STMT(BPF_RET|BPF_K, 0x00040000),
        BPF_STMT(BPF_RET|BPF_K, 0),
}

, - . Linux tools/bpf BPF.


tcpdump, . , , , TCP/IPv4:


$ cat /tmp/tcp-over-ipv4.bpf
ldh [12]
jne #0x800, drop
ldb [23]
jneq #6, drop
ret #-1
drop: ret #0

< >,<code1> <jt1> <jf1> <k1>,..., TCP


$ tools/bpf/bpf_asm /tmp/tcp-over-ipv4.bpf
6,40 0 0 12,21 0 3 2048,48 0 0 23,21 0 1 6,6 0 0 4294967295,6 0 0 0,

C :


$ tools/bpf/bpf_asm -c /tmp/tcp-over-ipv4.bpf
{ 0x28,  0,  0, 0x0000000c },
{ 0x15,  0,  3, 0x00000800 },
{ 0x30,  0,  0, 0x00000017 },
{ 0x15,  0,  1, 0x00000006 },
{ 0x06,  0,  0, 0xffffffff },
{ 0x06,  0,  0, 0000000000 },

struct sock_filter, .


Linux netsniff-ng


BPF, Linux tools/bpf/bpf_asm . , struct sk_buff, . , - , ldw cpu A raw_smp_processor_id(). ( BPF kernel helpers , , .) , , poff, payload offset:


ld poff
ret a

BPF tcpdump, netsniff-ng, , , netsniff-ng, , BPF , , tools/bpf/bpf_asm, BPF bpfc. , . .


seccomp


, BPF , — seccomp, BPF , .


seccomp 2005 , — , , : read, write, exit sigreturn, , SIGKILL. 2012 seccomp BPF , . (, Chrome, Chrome KRSI, BPF Linux Security Modules.) .


, seccomp, - ( ) . : seccomp seccomp, 2007 , BPF ( libseccomp), seccomp Docker, . systemd « Docker !» , , , systemd.


seccomp C libseccomp , seccomp strace.


seccomp


BPF seccomp. , . seccomp(2):


seccomp(SECCOMP_SET_MODE_FILTER, flags, &filter)

&filterstruct sock_fprog, .. BPF.


seccomp ? . , , , seccomp


struct seccomp_data {
    int   nr;
    __u32 arch;
    __u64 instruction_pointer;
    __u64 args[6];
};

nr — , arch — ( ), args — , instruction_pointer — , . , , A


ldw [0]

seccomp , , 32- - — ldh [0] seccomp EINVAL. seccomp_check_filter() . ( , , seccomp, mod ( ) seccomp BPF , ABI.)


, , seccomp . ,


ld [0]
jeq #304, bad
jeq #176, bad
jeq #239, bad
jeq #279, bad
good: ret #0x7fff0000 /* SECCOMP_RET_ALLOW */
bad: ret #0

304, 176, 239, 279. ? , . seccomp ( arch struct seccomp_data). :


ld [4]
jne #0xc000003e, bad_arch ; SCMP_ARCH_X86_64

.


seccomp libseccomp


BPF , / . libseccomp, .


, , , , , , ( , ):


#include <seccomp.h>
#include <unistd.h>
#include <err.h>

static int sys_numbers[] = {
        __NR_mount,
        __NR_umount2,
       // ...  40   ...
        __NR_vmsplice,
        __NR_perf_event_open,
};

int main(int argc, char **argv)
{
        scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_ALLOW);

        for (size_t i = 0; i < sizeof(sys_numbers)/sizeof(sys_numbers[0]); i++)
                seccomp_rule_add(ctx, SCMP_ACT_TRAP, sys_numbers[i], 0);

        seccomp_load(ctx);

        execvp(argv[1], &argv[1]);
        err(1, "execlp: %s", argv[1]);
}

sys_numbers 40+ . , ctx , (SCMP_ACT_ALLOW) - ( ). , , . SCMP_ACT_TRAP, seccomp SIGSYS , . , seccomp_load, seccomp(2).


libseccomp, :


cc -std=c17 -Wall -Wextra -c -o seccomp_lib.o seccomp_lib.c
cc -o seccomp_lib seccomp_lib.o -lseccomp

:


$ ./seccomp_lib echo ok
ok

:


$ sudo ./seccomp_lib mount -t bpf bpf /tmp
Bad system call

strace, :


$ sudo strace -e seccomp ./seccomp_lib mount -t bpf bpf /tmp
seccomp(SECCOMP_SET_MODE_FILTER, 0, {len=50, filter=0x55d8e78428e0}) = 0
--- SIGSYS {si_signo=SIGSYS, si_code=SYS_SECCOMP, si_call_addr=0xboobdeadbeef, si_syscall=__NR_mount, si_arch=AUDIT_ARCH_X86_64} ---
+++ killed by SIGSYS (core dumped) +++
Bad system call

, - mount(2).


, libseccomp, . , — . libseccomp , SCMP_FLTATR_CTL_OPTIMIZE. 2, .


, , BPF , :


$ echo 1 3 6 8 13 | ./generate_bin_search_bpf.py
ld [0]
jeq #6, bad
jgt #6, check8
jeq #1, bad
jeq #3, bad
ret #0x7fff0000
check8:
jeq #8, bad
jeq #13, bad
ret #0x7fff0000
bad: ret #0

, BPF ( , , jmp A jmp [label+X]) .


seccomp strace


strace — Linux. , . , strace ptrace(2), , , .., ,


$ time strace du /usr/share/ >/dev/null 2>&1

real    0m3.081s
user    0m0.531s
sys     0m2.073s


$ time strace -e open du /usr/share/ >/dev/null 2>&1

real    0m2.404s
user    0m0.193s
sys     0m1.800s

, .


--seccomp-bpf, strace 5.3, :


$ time strace --seccomp-bpf -e open du /usr/share/ >/dev/null 2>&1

real    0m0.148s
user    0m0.017s
sys     0m0.131s

$ time du /usr/share/ >/dev/null 2>&1

real    0m0.140s
user    0m0.024s
sys     0m0.116s

(, , , . , , newfsstat, strace , --seccomp-bpf.)


? strace PTRACE_SYSCALL. () , strace, PTRACE_SYSCALL. strace, PTRACE_SYSCALL, ..



seccomp, , , . , X, BPF , X SECCOMP_RET_TRACE, — SECCOMP_RET_ALLOW:


ld [0]
jneq #X, ignore
trace: ret #0x7ff00000
ignore: ret #0x7fff0000

strace PTRACE_CONT, , X, , X, seccomp strace, PTRACE_SYSCALL ( seccomp ). , strace PTRACE_CONT seccomp.



--seccomp-bpf . -, ( -p strace), seccomp. -, , seccomp .


strace seccomp . , BPF seccomp .


xt_bpf


.


: -, 2007 , xt_u32 netfilter. cls_u32 iptables : 32 . ,


sudo iptables -A INPUT -m u32 --u32 "6&0xFF=1" -j LOG --log-prefix "seen-by-xt_u32"

32 IP, 6, 0xFF ( ). — protocol IP 1 (ICMP). , @ — X . ,


iptables -m u32 --u32 "6&0xFF=0x6 && 0>>22&0x3C@4=0x29"

, TCP Sequence Number 0x29. , , . BPF — the forgotten bytecode, xt_u32. . .


2013 xt_u32 BPF xt_bpf. : BPF iptables. , , :


iptables -A INPUT -m bpf --bytecode <> -j LOG

<>bpf_asm -, ,


$ cat /tmp/test.bpf
ldb [9]
jneq #17, ignore
ret #1
ignore: ret #0

$ bpf_asm /tmp/test.bpf
4,48 0 0 9,21 0 1 17,6 0 0 1,6 0 0 0,

# iptables -A INPUT -m bpf --bytecode "$(bpf_asm /tmp/test.bpf)" -j LOG

UDP . BPF xt_bpf, , , iptables — IPv4. BPF , false , .


, xt_bpf , . Cloudfare. xt_bpf DDoS . Introducing the BPF Tools ( ) BPF . , bpfgen BPF , DNS- habr.com:


$ ./bpfgen --assembly dns -- habr.com
ldx 4*([0]&0xf)
ld #20
add x
tax

lb_0:
    ld [x + 0]
    jneq #0x04686162, lb_1
    ld [x + 4]
    jneq #0x7203636f, lb_1
    ldh [x + 8]
    jneq #0x6d00, lb_1
    ret #65535

lb_1:
    ret #0

X \x04habr\x03com\x00 UDP- : 0x04686162 <-> "\x04hab" ..


Cloudfare p0f -> BPF. Introducing the p0f BPF compiler , p0f p0f BPF:


$ ./bpfgen p0f -- 4:64:0:0:*,0::ack+:0
39,0 0 0 0,48 0 0 8,37 35 0 64,37 0 34 29,48 0 0 0,
84 0 0 15,21 0 31 5,48 0 0 9,21 0 29 6,40 0 0 6,
...

Cloudfare xt_bpf, XDP — BPF, . L4Drop: XDP DDoS Mitigations.


cls_bpf


BPF — cls_bpf Linux, Linux 2013 cls_u32.


, , cls_bpf, BPF — . , , Extended BPF, .


BPF c cls_bpf , Extended BPF : .


BPF .


classic BPF


, BPF, . , RISC, BPF, 32- 64- BPF . , BPF — BPF , , , sk_buff -, .


, BPF Linux API , , seccomp, , Extended BPF. ( .)


2013 , BPF. 2014 . , JIT-compiler 64- , Linux.


本系列的其他文章将讨论新技术的体系结构和应用,该技术最初称为内部BPF,然后扩展为BPF,现在简称为BPF。


参考文献
  1. Steven McCanne and Van Jacobson, "The BSD Packet Filter: A New Architecture for User-level Packet Capture", https://www.tcpdump.org/papers/bpf-usenix93.pdf
  2. Steven McCanne, "libpcap: An Architecture and Optimization Methodology for Packet Capture", https://sharkfestus.wireshark.org/sharkfest.11/presentations/McCanne-Sharkfest'11_Keynote_Address.pdf
  3. tcpdump, libpcap: https://www.tcpdump.org/
  4. IPtable U32 Match Tutorial.
  5. BPF — the forgotten bytecode: https://blog.cloudflare.com/bpf-the-forgotten-bytecode/
  6. Introducing the BPF Tool: https://blog.cloudflare.com/introducing-the-bpf-tools/
  7. bpf_cls: http://man7.org/linux/man-pages/man8/tc-bpf.8.html
  8. A seccomp overview: https://lwn.net/Articles/656307/
  9. https://github.com/torvalds/linux/blob/master/Documentation/userspace-api/seccomp_filter.rst
  10. habr: : seccomp
  11. habr: systemd « Docker !»
  12. Paul Chaignon, "strace --seccomp-bpf: a look under the hood", https://fosdem.org/2020/schedule/event/debugging_strace_bpf/
  13. netsniff-ng: http://netsniff-ng.org/

All Articles