eBPF | Notion

eBPF basic

Untitled

eBPF虚拟机用R0~R9十个通用寄存器、R0寄存器用于存储函数返回值、R1~R5寄存器用于函数参数
R10寄存器是只读的，是stack frame pointer，用来访问堆栈的
The context argument to an eBPF program is loaded into Register 1
There is no access to heap space and data must instead be written to maps.
There is only 512 bytes of stack (or 256 bytes if we are using tail calls).
Fentry and fexit attachment points were designed to be more efficient than kprobes, but there’s another advantage when you want to generate an event at the end of a function: the fexit hook has access to the input parameters to the function, which kretprobe does not
bpf_insn which represents a BPF instruction

指令Format如下，是一个64位定长的指令编码，其中opcode根据指令类型的不同又细分了三个部分，其中算术、Jump和Swap类指令的opcode如下。

其中source用来指明源操作数是立即数，还是寄存器。

对于Load和Store累的指令，其opcode格式如下:
```
struct bpf_insn {
    __u8 code;          /* opcode */                       
    __u8 dst_reg:4;     /* dest register */               
    __u8 src_reg:4;     /* source register */
    __s16 off;       /* signed offset */                  
    __s32 imm;       /* signed immediate constant */
};
```
- Loading a value into a register (either an immediate value or a value read from memory or fromanother register)
- Storing a value from a register into memory
- Performing arithmetic operations such as adding a value to the contents of a register
- Jumping to a different instruction if a particular condition is satisfied
下面是一个指令的翻译:
```
b7 02 00 00 0f 00 00 00 r2 = 15

0xb7 为opcode，低3位是0x07，也就是BPF_ALU64类别的指令，高四位是0xb，也就是BPF_MOV，将src的内容移动到dst中。第四位是0，也就是使用32位的立即数作为源操作数。

0x02 表示dst寄存器是R2

0x0f000000 32位的立即数
```
64位的指令如何处理64位的立即数呢?

**imm64 = (next_imm << 32) | imm，**对于64位立即数的指令，是用两条指令表示，第一条指令就是正常的指令，其中包含了一个32位的立即数，属于低32位，紧接一条指令被称之为伪指令，其opcode、src_reg、dst_reg等都是0，只有imm立即数部分为高32位，两者一组合就构成了最终的64位立即数了。
BPF ring buffer

It solves memory efficiency and event re-ordering problems of the BPF perf buffer.It provides both perfbuf-compatible for easy migration, but also has the new reserve/commit API with better usability. Also, both synthetic and real-world benchmarks show that in almost all cases so think about making it a default choice for sending data from the BPF program to user-space.

It is a multi-producer, single-consumer (MPSC) queue and can be safely shared across multiple CPUs simultaneously.

ring buffer克服了perfbuf的两个主要缺点，一个是内存开销，另外一个则是event re-ordering，推荐使用ring buffer来代替perfbuf，perfbuf的核心是一组per-CPU circular buffers。ring buffer同时也支持变长数据、高效的内核和用户态数据交换，epoll和busy-loop数据通知。perfbuf是每一个CPU分配一个circular buffer，因此CPU核数增大就会导致buffer增大带来buffer的冗余，而ring buffer则是全局一个大的buffer和CPU核心数没关系。另外perfbuf需要先把event拷贝到一个per CPU的数组中(BPF栈空间很小，因此较大的event无法放到栈上)，然后再将数据拷贝到perfbuf中，如果此时perfbuf空间不足导致失败，那么从event拷贝到per CPU数组这一步就导致了浪费。而ring buffer通过reserve和commit两阶段提交的方式避免了这个问题，先reserve确保有空间，然后再将数据写入到ring buffer中。

// 定义ring buffer
struct {
	__uint(type, BPF_MAP_TYPE_RINGBUF);
	__uint(max_entries, 256 * 1024);
} rb SEC(".maps");

// 使用ring buffer
/* reserve sample from BPF ringbuf */
struct event *e;
e = bpf_ringbuf_reserve(&rb, sizeof(*e), 0);
/* successfully submit it to user-space for post-processing */
bpf_ringbuf_submit(e, 0);

// 用户态程序创建ring buffer polling
/* Set up ring buffer polling */
	rb = ring_buffer__new(bpf_map__fd(skel->maps.rb), handle_event, NULL, NULL);
	if (!rb) {
		err = -1;
		fprintf(stderr, "Failed to create ring buffer\\n");
		goto cleanup;
	}

// 从ring buffer中poll数据
while (!exiting) {
		err = ring_buffer__poll(rb, 100 /* timeout, ms */);
		/* Ctrl-C will cause -EINTR */
		if (err == -EINTR) {
			err = 0;
			break;
		}
		if (err < 0) {
			printf("Error polling perf buffer: %d\\n", err);
			break;
		}
	}

/* Clean up */
ring_buffer__free(rb);

BPF links provide a layer of abstraction between an eBPF program and the event it’s attached to
BPF spin lock 不能用于tracing 类型的Prog
BPF Prog Type
- BPF_PROG_TYPE_KPROBE
BPF Map
- BPF_MAP_TYPE_HASH
- BPF_MAP_TYPE_PERF_EVENT_ARRAY