Monday, August 3, 2009

Dumping kernel page tables

Sometimes when debugging kernel issues, you might come across kernel addresses that you would find very difficult to map to a particular section in the kernel, ie, vmalloc, vmemmap, low/high kernel mapping, kernel text, etc. On x86, Arjan van de Ven has written an interface that provides a dump of the kernel page tables which gives information on the various memory areas in the kernel.

# cat /debug/kernel_page_tables
---[ User Space ]---
0x0000000000000000-0xffff800000000000 16777088T pgd
---[ Kernel Space ]---
0xffff800000000000-0xffff880000000000 8T pgd
---[ Low Kernel Mapping ]---
0xffff880000000000-0xffff880000200000 2M RW GLB x pte
0xffff880000200000-0xffff880040000000 1022M RW PSE GLB x pmd
0xffff880040000000-0xffff8800cfe00000 2302M RW PSE GLB NX pmd
...
---[ vmalloc() Area ]---
0xffffc20000000000-0xffffc20000001000 4K RW PCD GLB NX pte
0xffffc20000001000-0xffffc20000004000 12K pte
0xffffc20000004000-0xffffc20000005000 4K RW PCD GLB NX pte
0xffffc20000005000-0xffffc20000008000 12K pte
0xffffc20000008000-0xffffc2000000d000 20K RW PCD GLB NX pte
0xffffc2000000d000-0xffffc20000010000 12K pte
0xffffc20000010000-0xffffc20000011000 4K RW PCD GLB NX pte
....
---[ Vmemmap ]---
0xffffe20000000000-0xffffe20007c00000 124M RW PSE GLB NX pmd
0xffffe20007c00000-0xffffe20040000000 900M pmd
0xffffe20040000000-0xffffe28000000000 511G pud
0xffffe28000000000-0xffffff8000000000 29T pgd
0xffffff8000000000-0xffffffff80000000 510G pud
---[ High Kernel Mapping ]---
0xffffffff80000000-0xffffffff80200000 2M pmd
0xffffffff80200000-0xffffffff80a00000 8M RW PSE GLB x pmd
0xffffffff80a00000-0xffffffffa0000000 502M pmd
---[ Modules ]---
0xffffffffa0000000-0xffffffffa000a000 40K RW GLB x pte
0xffffffffa000a000-0xffffffffa000f000 20K pte
0xffffffffa000f000-0xffffffffa0016000 28K RW GLB x pte
0xffffffffa0016000-0xffffffffa001b000 20K pte
....
...

Understanding the above output:

o First field indicates the address range of a particular type of area (for example, user space, vmalloc area, kernel space, etc)
o The second field indicates the size of the address range in K,M,G,T units
o The fields following the size of the range have the following meaning:
USR - whether the page being mapped is a user page or not
RW - whether the page is read/write. If not RW, the output would be 'ro' to indicate a read-only page
PCD - Page Cache Disabled - maps a page with caching disabled
PWT - page with Page Write-Through set
PSE - Extended paging enabled - allows large linear contiguous address ranges to be mapped
GLB - Page Global flag - The global flag is set for a page that is frequently used and prevents it from being flushed from the TLB
NX - Page is non-executable, else marked as 'x'
o The last entry indicates the particular level of the page table - pgd, pud, pmd or pte that the region corresponds to

Enable the CONFIG_X86_PTDUMP configuration option, along with enabling debugfs. The corresponding kernel code for the interface can be found under arch/x86/mm/dump_pagetables.c