Thursday, March 21, 2019

pagetable adventures in linux x86_64

i'm teaching CS 149 (Operating Systems) this semester. it is one of my favorite classes! we are currently covering virtual memory and page tables. i'm using the operating systems: three easy pieces book which covers the topic well; however, after my lecture on it, i felt that the students needed a way to see the page tables in action; i wanted to let them look at the page tables of a process in real time. it would allow them to see the data structures involved, walk through the resolution, and get the final mapping. turns out, doing it was a bit harder than i anticipated.

accessing the top level page table

we are using x86_64 linux which has a 4 level page table. philipp oppermann has an amazing explanation of x86_64 page tables. i highly recommend checking it out!

the first step to accessing the top level page table is reading the CR3 CPU register. unfortunately, reading CR3 is a privileged operation. fortunately, allan cruse from the university of san francisco wrote a kernel module for exposing CR3 through /proc/cr3. it needed a bit of adapting to make it work with x86_64 and the new /proc interface, but i got it implemented: https://github.com/breed/virt2phys/blob/master/kernel-module/cr3.c.

with the information from CR3 we can get the address in physical memory of the top page table for the current process. the top page table is 4K in size and contains 512 entries of addresses to the next level page tables. our next task is reading these tables from physical memory.

give me a physical page!


in the good old days, there was this intriguing file in /dev called /dev/mem. when i first started using linux and before i completely understood virtual memory, that file remained a half understood mystery. i did learn that you could sometimes recover emails and editing sessions that you prematurely canceled by grepping it, but i never actually had a need to use it in a program.

it turns out /dev/mem isn't mysterious at all! it allows you to access physical memory as if it was a file. (technically it is a character device, but UNIX allows character devices to be interacted with as if they were files :) ) you simply open() /dev/mem, lseek() to the offset in the physical memory that you want to access, and then access the physical memory with read() or write().

/dev/mem is perfect for what we need to do! tragically, /dev/mem has effectively been disabled in recent kernels. it is a huge security hole! you can recompile the kernel to enable it, but i didn't want to require students to do that to examine page tables.

so i went the more difficult route of expanding the kernel module to also expose /proc/page_reader. this file allows you to lseek() to the physical page you want to read and then read its contents.

putting it all together


now that we have access to CR3 and physical pages, we can chop up the virtual address into its 5 components: the 4 9-bit indexes into the 4 levels of page tables and the 12-bit offset into the 4K page.

here is an example run of the pagetable program in https://github.com/breed/virt2phys/blob/master/pagetables.c. (we run it with sudo and we insmod the cr3.ko module before we run it.)

CR3 is 6F1D0006
data(): addr 000015010D00D000 -> 02A 004 068 00D 000
Need to resolved entry 02A in 000000006F1D0000
PAGE TABLE for 000000006F1D0000 (non zero entries):
  02A 8000000078173067
  0AB 800000007915a067
  0FE 800000006f228067
  0FF 800000006f7b0067
  136 0000000075f60067
  170 000000007d144067
  1B0 00000000702e2067
  1F6 000000007f73d067
  1FC 000000007ff3a067
  1FE 0000000075d1c067
  1FF 000000007580e067
Got PTE 8000000078173067
Need to resolved entry 004 in 0000000078173000
PAGE TABLE for 0000000078173000 (non zero entries):
  004 000000006f6ff067
Got PTE 000000006F6FF067
Need to resolved entry 068 in 000000006F6FF000
PAGE TABLE for 000000006F6FF000 (non zero entries):
  068 0000000076762067
Got PTE 0000000076762067
Need to resolved entry 00D in 0000000076762000
PAGE TABLE for 0000000076762000 (non zero entries):
  00D 80000000479ec867
Got PTE 80000000479EC867
data(): virt 000015010D00D000 -> phys 00000000479EC000
------------------
here we see the address 0x15010d00d000 breaks up into 4 9-bit indexes 0x2a, 0x4, 0x68, and 0xd. CR3 is pointing at 6f1d0000 (the low 12-bits are used for flags), so our top level page table is stored in the physical address 6f1d0000. we grab the 4K of data stored at 6f1d0000. now we need to find the 0x2ath (0 based) page table entry in that 4K of data. each page table entry is a 64-bit integer, so we can cast the page table data to a int64_t *pte and then look at pte[0x2a] which is 8000000078173067.

the top bit of 8000000078173067 (8) is the NX bit; it means that we are mapping memory that does not contain executable code. (there is that security again!) the the page table entry's bottom 12-bits are flags, so we need to mask those off to get the physical address, which is 78173000, for the 2nd level page table. we are going to do this page retrieval and indexed look up three more times until we finally get the physical address of the page that holds the data. we then use the 12-bit offset, which is 0, in this case, to get the offset into that page to find the exact bytes that we are looking for.

conclusion

virtual memory and page table resolution is a fascinating bit of black magic that makes our life as a programmer pretty awesome! peeking behind the curtains can help you understand what is really happening when you run your code. in the next post i'll delve deep into the real magic involving COWs and demand paging.