😹 👩🏽‍🤝‍👨🏻 🚾 A bit about relocations in the Linux kernel 👲🏾 🤹 🧖🏼

We will solve a simple problem - select a memory block in the space of the Linux kernel, put some binary code into it and execute it. To do this, we write a kernel module, in it we define the function foo, which will play the role of the binary code we need, then, using the module_alloc function, select the memory block, copy this whole function to it through memcpy and give it control.

Here's what it looks like:

static noinline int foo(int ret)
{
	return (ret + 2);
}

static int exe_init(void)
{
	int ret = 0;
	int (*new_foo)(int);

	ret = foo(0);
	printk(KERN_INFO "ret=%d\n", ret);

	new_foo = module_alloc(PAGE_SIZE);
	set_memory_x((unsigned long)new_foo, 1);

	printk(KERN_INFO "foo=%lx new_foo=%lx\n",
		(unsigned long)foo, (unsigned long)new_foo);

	memcpy((void *)new_foo, (const void *)foo, PAGE_SIZE);

	ret = new_foo(1);
	printk(KERN_INFO "ret=%d\n", ret);

	vfree(new_foo);
	return 0;
}

The exe_init function is called when the module is loaded. We look at the result of work in the kernel log:

[ 6972.522422] ret=2
[ 6972.522443] foo=ffffffffc0000000 new_foo=ffffffffc0007000
[ 6972.522457] ret=3

Everything is working correctly. And now we add the printk function to foo to display the argument:

static noinline int foo(int ret)
{
	printk(KERN_INFO "ret=%d\n", ret);
	return (ret + 2);
}

and dump 25 bytes of the contents of the new_foo () function before passing control to it:

	memcpy((void *)new_foo, (const void *)foo, PAGE_SIZE);
	dump((unsigned long)new_foo);

dump is defined as

static inline void dump(unsigned long x)
{
	int i;
	for (i = 0; i < 25; i++) \
		pr_cont("%.2x ", *((unsigned char *)(x) + i) & 0xFF); \
	pr_cont("\n");
}

We load the module and get a crash with the following message in the log:

[ 8482.806092] ret=0
[ 8482.806092] ret=2
[ 8482.806111] foo=ffffffffc0000000 new_foo=ffffffffc0007000
[ 8482.806113] 53 89 fe 89 fb 48 c7 c7 24 10 00 c0 e8 e8 3d 0b c1 8d 43 02 5b c3 66 2e 0f 
[ 8482.806135] invalid opcode: 0000 [#1] SMP NOPTI
[ 8482.806639] CPU: 0 PID: 5081 Comm: insmod Tainted: G           O      5.4.27 #12
[ 8482.807669] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 8482.808560] RIP: 0010:irq_create_direct_mapping+0x79/0x90

Somehow, we ended up in the irq_create_direct_mapping function, although we had to call printk. Let's figure out what happened.

First, take a look at the disassembled listing of the foo function. Get it with the objdump -d command:

Disassembly of section .text:

0000000000000000 <foo>:
   0:	53                   	push   %rbx
   1:	89 fe                	mov    %edi,%esi
   3:	89 fb                	mov    %edi,%ebx
   5:	48 c7 c7 00 00 00 00 	mov    $0x0,%rdi
   c:	e8 00 00 00 00       	callq  11 <foo+0x11>
  11:	8d 43 02             	lea    0x2(%rbx),%eax
  14:	5b                   	pop    %rbx
  15:	c3                   	retq   
  16:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
  1d:	00 00 00

The foo function is located at the beginning of the text section. At offset 0xC, the opcode of the near call command e8 is located - near, because it is executed in the current code segment, the selector value does not change. The next 4 bytes are the offset relative to the value in the RIP register to which control will be transferred, i.e. RIP = RIP + offset, according to Intel documentation (Intel 64 and IA-32 Architectures Software Developer's Manual, Instruction Set Reference AZ):

A relative offset (rel16 or rel32) is generally specified as a label in assembly code. But at the machine code level, it is encoded as a signed, 16- or 32-bit immediate value.
This value is added to the value in the EIP (RIP) register. In 64-bit mode the relative offset is always a 32-bit immediate value which is sign extended to 64-bits before it is added to the value in the RIP register for the target calculation.

We know the address of the function foo, it is 0xffffffffc0000000, so in RIP = 0xffffffffc0000000 + 0xc + 0x5 = 0xffffffffc00000011 (0xc is the offset to the e8 instruction, 1 byte of the instruction and 4 bytes of the offset). We know the offset, because dumped body functions. Let us calculate where the call to send us to the function foo will send:

0xffffffffc00000011 + 0xffffffffc10b3de8 = 0xffffffff810b3df9

This is the address of the printk function:

# cat /proc/kallsyms | grep ffffffff810b3df9  
ffffffff810b3df9 T printk

And now the same goes for new_foo, whose address is 0xffffffffc0007000

0xffffffffc0007011 + 0xffffffffc10b3de8 = 0xffffffff810badf9

There is no such address in kallsyms, but there is 0xffffffff810badf9 - 0x79 = 0xffffffff810bad80

# cat /proc/kallsyms | grep ffffffff810bad80
ffffffff810bad80 T irq_create_direct_mapping

This is the very function on which the crash happened.

To prevent a crash, just recalculate the offset, knowing the address of the new_foo function:

memcpy((void *)new_foo, (const void *)foo, PAGE_SIZE);
unsigned int delta = (unsigned long)printk - (unsigned long)new_foo - 0x11;
*(unsigned int *)((void *)new_foo + 0xD) = delta;

After this correction, there will be no crash, the new_foo function will successfully execute and return control.

The problem is solved. It remains only to understand why in the disassembler listing the offset after the e8 opcode is zero, but there is no function in the dump. To do this, consider what relocations are and how the kernel works with them. But first, a little about the ELF format.

ELF stands for Executable and Linkable Format - the format of executable and composable files. An ELF file is a collection of sections. The section stores a set of objects necessary for the linker to form an executable image - instructions, data, symbol tables, records of relocations, etc. Each section is described by a heading. All headers are collected in a table of headers and are essentially an array where each element has an index. The section header contains an offset to the beginning of the section and other overhead information, such as links to other sections by specifying an index in the header table.

When assembling our test case, the compiler does not know the address of the printk function, therefore it fills the call location with a zero value and, using a relocation record, tells the kernel that this position must be filled with a valid value. A relocation record contains an offset to the position where you want to make changes (relocation position), the type of relocation and the index of the symbol in the symbol table, the address of which must be substituted at the specified offset. What is the type of relocation for? We consider below. The heading of the section of relocation records refers through indexes to the headings of the section with a table of characters and sections, relative to the beginning of which an offset to the position of the relocation is specified.

You can look at the contents of relocation records using the objdump utility with the -r switch.
From the disassembled listing, we know that at offset 0xD it is necessary to write the address of the printk function, so we look for objdump output with the following position:

000000000000000d R_X86_64_PC32     printk-0x0000000000000004

So, we have the necessary relocation record indicating the position at offset 0xD, and the name of the symbol whose address should be written to this position.

Value (-4). which is added to the address of the printk function is called addendum, and it is taken into account when calculating the final result of the relocation.

Now look at the printk symbol:

$ objdump -t exe.ko | grep printk
0000000000000000         *UND*	0000000000000000 printk

There is a symbol, it is undefined inside the module (undefined), so we will search for it in the kernel.

It will be more informative to look at the records of relocation and symbols in binary form. This can be done using wireshark, it can parse ELF format. Here is our relocation entry (copy paste from writeshark, LSB on the left):

  0d 00 00 00 00 00 00 00  02 00 00 00 22 00 00 00  fc ff ff ff ff ff ff ff
  |                     |  |          ||         |  |                     |
  +----  -------+  +--  ---++---+  +---- addendum  ------+

Compare this entry with the definition of the corresponding structure from <linux / elf.h>:

typedef struct elf64_rela {
  Elf64_Addr r_offset;	/* Location at which to apply the action */
  Elf64_Xword r_info;	/* index and type of relocation */
  Elf64_Sxword r_addend;	/* Constant addend used to compute value */
} Elf64_Rela;

Here we have 8 bytes offset 0x00000000d, 4 bytes type 0x00000002, 4 bytes index in the character table 0x00000022 (or 34 in decimal) and 8 bytes addendum -4.

And here is the entry from the symbol table at number 34:

  01 01 00 00 10 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00

and related structure

typedef struct elf64_sym {
  Elf64_Word st_name;		/* Symbol name, index in string tbl */
  unsigned char	st_info;	/* Type and binding attributes */
  unsigned char	st_other;	/* No defined meaning, 0 */
  Elf64_Half st_shndx;		/* Associated section index */
  Elf64_Addr st_value;		/* Value of the symbol */
  Elf64_Xword st_size;		/* Associated symbol size */
} Elf64_Sym;

The first 4 bytes 0x00000101 is the index in the table of strings .strtab to the name of this character, i.e. printk. The st_info field defines the type of symbol, it can be a function, data object, etc., see the ELF specification for more details. We will skip the st_other field, now it is of no interest to us, and look at the last three fields st_shndx, st_value and st_size. st_shndx - the header index of the section in which the character is defined. We see here a zero value, because the symbol is not defined inside the module; it is not in the available sections.
Accordingly, its st_value value and st_size size are also zero. These fields will be filled by the kernel when loading the module.

For comparison, look at the symbol foo, which is clearly present:

  08 00 00 00 02 00 02 00  00 00 00 00 00 00 00 00  16 00 00 00 00 00 00 00

The symbol defines a function that is located in the .text section at the address relative to the beginning of the 0x00000000 section, i.e. at the very beginning of the section, as we saw in the disassembled listing, the function size is 22 bytes.

Objdump will show us the same information about this:

$ objdump -t exe.ko | grep foo
0000000000000000 l     F .text	0000000000000016 foo

When the kernel loads the module, it finds all Undefined characters and fills the st_value and st_size fields with valid values. This is done in the simplify_symbols function, kernel / module.c file:

/* Change all symbols so that st_value encodes the pointer directly. */
static int simplify_symbols(struct module *mod, const struct load_info *info)
{
...

In the parameters of the function, the load_info structure of the following form is passed

struct load_info {
	const char *name;
	/* pointer to module in temporary copy, freed at end of load_module() */
	struct module *mod;
	Elf_Ehdr *hdr;
	unsigned long len;
	Elf_Shdr *sechdrs;
	char *secstrings, *strtab;
	unsigned long symoffs, stroffs, init_typeoffs, core_typeoffs;
	struct _ddebug *debug;
	unsigned int num_debug;
	bool sig_ok;
#ifdef CONFIG_KALLSYMS
	unsigned long mod_kallsyms_init_off;
#endif
	struct {
		unsigned int sym, str, mod, vers, info, pcpu;
	} index;
};

The following fields are of interest to us:
- hdr - ELF file header
- sechdrs - pointer to the section header table
- strtab - symbol name table - a set of strings separated by zeros
- index.sym - index of the section header containing the symbol table

First of all, the function will get access to the section with the symbol table. The symbol table is an array of elements of type Elf64_Sym:

Elf64_Shdr *symsec = &info->sechdrs[info->index.sym];
Elf64_Sym *sym = (void *)symsec->sh_addr;

Next, in the loop, we go through all the characters in the table, determining for each its name:

for (i = 1; i < symsec->sh_size / sizeof(Elf_Sym); i++) {
	const char *name = info->strtab + sym[i].st_name;

The st_shndx field contains the header index of the section in which this character is defined. If there is a zero value (our case), then this symbol is not inside the module, you need to look for it in the kernel:

	switch (sym[i].st_shndx) {
	.....
	 case SHN_UNDEF: //  0
	ksym = resolve_symbol_wait(mod, info, name);
 	/* Ok if resolved.  */
	if (ksym && !IS_ERR(ksym)) {
		sym[i].st_value = kernel_symbol_value(ksym);
		break;
	}

Then comes the relocation queue in the apply_relocations function:

static int apply_relocations(struct module *mod, const struct load_info *info)
{
	unsigned int i;
	int err = 0;

	/* Now do relocations. */
	for (i = 1; i < info->hdr->e_shnum; i++) {
	.....

In the loop, we look for relocation sections and process the records of each section found in the apply_relocate_add function:

if (info->sechdrs[i].sh_type == SHT_RELA) //   
	err = apply_relocate_add(info->sechdrs, info->strtab,
				info->index.sym, i, mod);

A pointer to a section header table, a pointer to a symbol name table, a section header index with a symbol table and a relocation section header index are passed to apply_relocate_add:

int apply_relocate_add(Elf64_Shdr *sechdrs,
	   const char *strtab,
	   unsigned int symindex,
	   unsigned int relsec,
	   struct module *me)
{

First we address the relocations section:

Elf64_Rela *rel = (void *)sechdrs[relsec].sh_addr;

Then, in a loop, iterate over the array of its entries:

for (i = 0; i < sechdrs[relsec].sh_size / sizeof(*rel); i++) {

We find the section for relocation and the position in it, i.e. where we need to make changes. The sh_info field of the relocation section header is the index of the section header for relocation, the r_offset field of the relocation record is the offset to the position inside the section for relocation:

/* This is where to make the change */
loc = (void *)sechdrs[sechdrs[relsec].sh_info].sh_addr + rel[i].r_offset;

The address of the character to be substituted in this position, taking into account addendum. The r_info field of the relocation entry contains the index of this symbol in the symbol table:

	/* This is the symbol it is referring to.  Note that all
	   undefined symbols have been resolved.  */
	sym = (Elf64_Sym *)sechdrs[symindex].sh_addr
		+ ELF64_R_SYM(rel[i].r_info);

	val = sym->st_value + rel[i].r_addend;

The type of relocation determines the final result of the calculations, in our example it is R_X86_64_PLT32:

	switch (ELF64_R_TYPE(rel[i].r_info)) {
	......
	case R_X86_64_PLT32:	
		if (*(u32 *)loc != 0)
			goto invalid_relocation;
		val -= (u64)loc;	//   
		*(u32 *)loc = val;  //    
		break;
	.....

Now we can calculate the final val ourselves, knowing that sym-> st_value is the address of the printk function 0xffffffff810b3df9, r_addend is (-4), the offset to the position of relocation is 0xd from the beginning of the module text section, or from the beginning of the foo function, i.e. will be ffffffffc000000d. Substitute all these values and get:

val = (u32)(0xffffffff810b3df9 - 0x4 - 0xffffffffc000000d) = 0xc10b3de8

Let's look at the dump of the foo function, which we got at the very beginning:

53 89 fe 89 fb 48 c7 c7 24 10 00 c0 e8 e8 3d 0b c1 8d 43 02 5b c3 66 2e 0f

At offset 0xD, the value 0xc10b3de8 is found, which is identical to the one we calculated.

This is how the kernel processes relocations and gets the necessary offset for the close call command.

In preparing the article, the Linux kernel version 5.4.27 was used.

A bit about relocations in the Linux kernel

More articles: