
Thursday, September 17, 2009
Real-Time Linux Kernel Scheduler

Monday, August 3, 2009
Dumping kernel page tables
# cat /debug/kernel_page_tables
---[ User Space ]---
0x0000000000000000-0xffff800000000000 16777088T pgd
---[ Kernel Space ]---
0xffff800000000000-0xffff880000000000 8T pgd
---[ Low Kernel Mapping ]---
0xffff880000000000-0xffff880000200000 2M RW GLB x pte
0xffff880000200000-0xffff880040000000 1022M RW PSE GLB x pmd
0xffff880040000000-0xffff8800cfe00000 2302M RW PSE GLB NX pmd
...
---[ vmalloc() Area ]---
0xffffc20000000000-0xffffc20000001000 4K RW PCD GLB NX pte
0xffffc20000001000-0xffffc20000004000 12K pte
0xffffc20000004000-0xffffc20000005000 4K RW PCD GLB NX pte
0xffffc20000005000-0xffffc20000008000 12K pte
0xffffc20000008000-0xffffc2000000d000 20K RW PCD GLB NX pte
0xffffc2000000d000-0xffffc20000010000 12K pte
0xffffc20000010000-0xffffc20000011000 4K RW PCD GLB NX pte
....
---[ Vmemmap ]---
0xffffe20000000000-0xffffe20007c00000 124M RW PSE GLB NX pmd
0xffffe20007c00000-0xffffe20040000000 900M pmd
0xffffe20040000000-0xffffe28000000000 511G pud
0xffffe28000000000-0xffffff8000000000 29T pgd
0xffffff8000000000-0xffffffff80000000 510G pud
---[ High Kernel Mapping ]---
0xffffffff80000000-0xffffffff80200000 2M pmd
0xffffffff80200000-0xffffffff80a00000 8M RW PSE GLB x pmd
0xffffffff80a00000-0xffffffffa0000000 502M pmd
---[ Modules ]---
0xffffffffa0000000-0xffffffffa000a000 40K RW GLB x pte
0xffffffffa000a000-0xffffffffa000f000 20K pte
0xffffffffa000f000-0xffffffffa0016000 28K RW GLB x pte
0xffffffffa0016000-0xffffffffa001b000 20K pte
....
...
Understanding the above output:
o First field indicates the address range of a particular type of area (for example, user space, vmalloc area, kernel space, etc)
o The second field indicates the size of the address range in K,M,G,T units
o The fields following the size of the range have the following meaning:
USR - whether the page being mapped is a user page or not
RW - whether the page is read/write. If not RW, the output would be 'ro' to indicate a read-only page
PCD - Page Cache Disabled - maps a page with caching disabled
PWT - page with Page Write-Through set
PSE - Extended paging enabled - allows large linear contiguous address ranges to be mapped
GLB - Page Global flag - The global flag is set for a page that is frequently used and prevents it from being flushed from the TLB
NX - Page is non-executable, else marked as 'x'
o The last entry indicates the particular level of the page table - pgd, pud, pmd or pte that the region corresponds to
Enable the CONFIG_X86_PTDUMP configuration option, along with enabling debugfs. The corresponding kernel code for the interface can be found under arch/x86/mm/dump_pagetables.c

Thursday, July 23, 2009
Using gdb for debugging kernel modules
BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
IP: [<ffffffff8023262c>] mmput+0x11/0xb0
PGD 22bcf6067 PUD 211839067 PMD 0
Oops: 0002 [#1] SMP
last sysfs file: /sys/block/sdb/size
CPU 6
Modules linked in: <a routine in my module> ehci_hcd uhci_hcd [last unloaded: <my module>]
Pid: 26830, comm: <my kernel thread> Not tainted 2.6.28-anki #13
RIP: 0010:[<ffffffff8023262c>] [<ffffffff8023262c>] mmput+0x11/0xb0
RSP: 0000:ffff8802279fbed0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000045
RDX: ffff88022e150300 RSI: 0000000000000282 RDI: 0000000000000000
RBP: ffff8802279fbee0 R08: 0000000000000018 R09: ffff88022e150300
R10: ffff88022e150300 R11: 00007fffe6dfefff R12: ffff88022e52f860
R13: ffff88022e150300 R14: ffff88022d99aed0 R15: ffff88022d99b110
FS: 0000000000000000(0000) GS:ffff88022f818280(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000050 CR3: 0000000211829000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process memref (pid: 26830, threadinfo ffff8802279fa000, task
ffff88022d4ef1b0)
Stack:
ffff88022e150300 0000000000000000 ffff8802279fbf20 ffffffffa003314b
ffff8802279fbf20 ffff88022d99aed0 ffffffffa00330f2 ffff880210cf1c68
ffff88022f855ee0 ffff88022f855f00 ffff8802279fbf40 ffffffff802468e3
Call Trace:
[<ffffffffa003314b>] <routine in my module>+0x59/0xb2 [<my kernel thread>]
[<ffffffffa00330f2>] ? <my modulee>+0x0/0xb2 [<my kernel thread>]
[<ffffffff802468e3>] kthread+0x49/0x76
[<ffffffff8020cce9>] child_rip+0xa/0x11
[<ffffffff802277a5>] ? dequeue_task+0xbf/0xca
[<ffffffff8024689a>] ? kthread+0x0/0x76
[<ffffffff8020ccdf>] ? child_rip+0x0/0x11
Code: df e8 88 c7 fd ff 48 8b 3d ba 28 6b 00 48 89 de e8 b1 af 05 00 41 59 5b
c9 c3 55 48 89 e5 53 48 89 fb 48 83 ec 08 e8 6a fa 39 00 <f0> ff 4b 50 0f 94
c0 84 c0 0f 84 8b 00 00 00 48 89 df e8 8b 76
RIP [<ffffffff8023262c>] mmput+0x11/0xb0
RSP <ffff8802279fbed0>
CR2: 0000000000000050
Now, to find out from where exactly in my module the panic happened, I simply use gdb, as below:
#gdb <path to my module.o file from the kernel source root dir> [for example, #gdb drivers/misc/ankita.o]
....
(gdb) list *(the routine)+0x59
The above gives a source listing, pointing to the lines which correspond to 0x59 address. This serves as a good starting point to debug the issue. Hope this helps some folks atleast :-)

Thursday, July 16, 2009
Looking at thread state inside gdb
(gdb) info threads
12 Thread 4398053392560 (LWP 7234) 0x0000000010005740 in xxxx ()
11 Thread 4398054441136 (LWP 7235) 0x0000000010005740 in xxxx ()
10 Thread 4398055489712 (LWP 7237) 0x0000000010005748 in xxxx ()
9 Thread 4398063878320 (LWP 8263) 0x0000000010000768 in yyy ()
8 Thread 4398062829744 (LWP 8267) 0x0000000010003624 in zzz ()
7 Thread 4398061781168 (LWP 8270) 0x0000000010003624 in zzz ()
6 Thread 4398060732592 (LWP 8273) 0x0000000010001ce4 in ddd ()
5 Thread 4398059684016 (LWP 8276) 0x0000000010000750 in fff ()
4 Thread 4398058635440 (LWP 8277) 0x000000001001e128 in ttt ()
3 Thread 4398057586864 (LWP 8278) 0x0000000010003624 in zzz ()
2 Thread 4398056538288 (LWP 8281) 0x000000001001e140 in rrrr ()
1 Thread 269578240 (LWP 28151) 0x000000001000cba4 in www ()
She complained that in another proprietary debugger she has worked with in the past, she could easily also get the thread status information as well. Found that on gdb, we need to use the following commands to get the status information of the threads:
(gdb) help info proc
Show /proc process information about any running process.
Specify any process id, or use the program being debugged by default.
Specify any of the following keywords for detailed info:
mappings -- list of mapped memory regions.
stat -- list a bunch of random process info.
status -- list a different bunch of random process info.
all -- list all available /proc info.
(gdb) info proc 7234 stat
process 7234
cmdline = '/usr/bin/xyz'
warning: unable to read link '/proc/7234/cwd'
warning: unable to read link '/proc/7234/exe'
Process: 7234
Exec file: xyz
State: S
Parent process: 3230
Process group: 7234
Session id: 7234
TTY: 0
TTY owner process group: -1
Flags: 0x402040
Minor faults (no memory page): 96
Minor faults, children: 0
Major faults (memory page faults): 0
Major faults, children: 0
utime: 0
stime: 0
utime, children: 0
stime, children: 0
jiffies remaining in current time slice: 20
'nice' value: 0
jiffies until next timeout: 1
jiffies until next SIGALRM: 0
start time (jiffies since system boot): 4371
Virtual memory size: ...
Resident set size: ....
rlim: ......
Start of text: 0xb8013000
End of text: 0xb8025ca4
Start of stack: 0xbf925630
(gdb) info proc 7234 status
process 7234
cmdline = '/usr/bin/xyz'
warning: unable to read link '/proc/7234/cwd'
warning: unable to read link '/proc/7234/exe'
Name: xyz
State: S (sleeping)
Tgid: 7234
Pid: 7234
PPid: 3230
TracerPid: 0
Uid: 1000 1000 1000 1000
Gid: 1000 1000 109 1000
FDSize: 32
Groups: 4 20 24 46 106 117 118 123 124 1000
VmPeak: 4784 kB
VmSize: 4784 kB
VmLck: 0 kB
VmHWM: 636 kB
VmRSS: 636 kB
VmData: 380 kB
VmStk: 84 kB
VmExe: 76 kB
VmLib: 4020 kB
VmPTE: 20 kB
Threads: 1
SigQ: 0/16382
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000001002
SigCgt: 0000000180004001
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: ffffffffffffffff
Cpus_allowed: 00000000,00000003
Cpus_allowed_list: 0-1
Mems_allowed: 1
Mems_allowed_list: 0
voluntary_ctxt_switches: 627
nonvoluntary_ctxt_switches: 0
So as we can see, we can get a whole lot of information about the threads and not just the state. There are other options supported by the info proc command as shown by the help. gdb picks up the above information from the /proc/<pid> interface. It helps to be able to look at all the pieces of debug information from the same interface itself.
Friday, July 3, 2009
Fake NUMA nodes in Linux
- Following config options need to be turned on: CONFIG_NUMA=y, CONFIG_NUMA_EMULATION=y
- Build the kernel with the above config options set
- The kernel commandline could be any one of the following, depending on your requirement:
- numa=fake=4 : Split the entire memory into 4 equal nodes
- numa=fake=8*1024 : Split the memory into 8 equal chunks of 1024MB (ie 1G) (note, the number is considered to be in MB) [If system has more memory, the last node will be assigned remaining memory]
- numa=fake=2*512,2*1024 : Split the memory into 2 nodes of 512MB each and 2 more nodes of 1GB each (and so on)
You can play around with more options :) The userspace numa utilities like numactl and numastat would then show the numa environment that has been setup. Details of the cpumap and per-node meminfo can be obtained from the sysfs file /sys/devices/system/node/node<0|1|2..>.
Fake NUMA has one flaw however and that is the CPU mapping to nodes. There would exist nodes that do not show up as having any CPUs (unde the cpumap file in the node dir of the above mentioned sysfs file). As per the semantics, a CPU must unquely belong to a NUMA node. However, inside the kernel, the CPU is mapped to all the fake nodes.
Fake NUMA nodes can be created even on a real NUMA system. In this case, the fake nodes are aligned within a real node. The distances between two fake nodes across two real nodes is maintained. Could cover internal implementation details in a separate post. Have fun playing around with NUMA !
Sunday, June 28, 2009
Build your kernel faster
Run the script with the arguement being your architecture's Kconfig file and save the output
- # ./streamline_config.pl arch/x86/Kconfig > config_stream
Note: If you already have a .config file that has some of your custom config options set and you want to streamline that, no worries as streamline_config.pl will work on that .config itself (provided its present in the kernel src dir). You might still want to take a backup of your .config [;-)]
Sunday, June 21, 2009
Using large pages
An application can request large pages using the shmget API:
#include <sys/ipc.h>
#include <sys/shm.h>
int shmget(key_t key, size_t size, int shmflg);
The SHM_HUGETLB flag part of shmflg field specifies creation of large pages.
Linux kernel provides an interface using which large pages can be requested.
#echo 1000 > /proc/sys/vm/nr_hugepages
The above causes 1000 large pages to be allocated by the kernel. More information on large pages can be obtained from the /proc fs:
#cat /proc/meminfo
MemTotal: 8114308 kB
MemFree: 5867312 kB
Buffers: 8412 kB
Cached: 107304 kB
SwapCached: 0 kB
Active: 48000 kB
Inactive: 87592 kB
Active(anon): 22704 kB
Inactive(anon): 0 kB
Active(file): 25296 kB
Inactive(file): 87592 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 4883752 kB
SwapFree: 4883752 kB
Dirty: 48 kB
Writeback: 36 kB
AnonPages: 20212 kB
Mapped: 10948 kB
Slab: 25988 kB
SReclaimable: 12916 kB
SUnreclaim: 13072 kB
PageTables: 2400 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 7916904 kB
Committed_AS: 46040 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 43496 kB
VmallocChunk: 34359693843 kB
HugePages_Total: 1000
HugePages_Free: 1000
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 3824 kB
DirectMap2M: 8384512 kB
On a NUMA system, the kernel splits large page allocations equally across the different nodes. For example, if the system has 2 nodes, a request of 1000 large pages would get split into 500 pages from each node. Per node large page information can be obtained from the /sys interface:
# cat /sys/devices/system/node/node0/meminfo
Node 1 MemTotal: 4194304 kB
Node 1 MemFree: 40004 kB
Node 1 MemUsed: 4154300 kB
Node 1 Active: 2166524 kB
Node 1 Inactive: 810704 kB
Node 1 Active(anon): 2127084 kB
Node 1 Inactive(anon): 8360 kB
Node 1 Active(file): 39440 kB
Node 1 Inactive(file): 802344 kB
Node 1 Unevictable: 0 kB
Node 1 Mlocked: 0 kB
Node 1 Dirty: 0 kB
Node 1 Writeback: 0 kB
Node 1 FilePages: 841792 kB
Node 1 Mapped: 11008 kB
Node 1 AnonPages: 2135884 kB
Node 1 PageTables: 5136 kB
Node 1 NFS_Unstable: 0 kB
Node 1 Bounce: 0 kB
Node 1 WritebackTmp: 0 kB
Node 1 Slab: 33704 kB
Node 1 SReclaimable: 30708 kB
Node 1 SUnreclaim: 2996 kB
Node 1 HugePages_Total: 500
Node 1 HugePages_Free: 498
Node 1 HugePages_Surp: 0
Recently, in one of the benchmarks (JAVA benchmark) I was running, I was seeing a huge performance degradation of about 6-8%. After some debugging, the issue turned out to be that the application was not able to utilize the large pages allocated (thanks to some weird environment I had ;-) ). To find out the number of large pages being utilized by the app, besides the above meminfo output, you can also use numa_maps. For example,
# cat /proc/<process pid>/numa_maps
00001000 default anon=1 dirty=1 N0=1
00400000 default file=<....library file info..> mapped=10 mapmax=3 N0=10
0050b000 default file=<....library file info..> anon=1 dirty=1 N0=1
0050c000 default heap anon=213 dirty=213 N0=213
00600000 default file=/SYSV00000000\040(deleted) huge dirty=472 N0=472
40600000 default
40601000 default anon=2 dirty=2 N0=2
40641000 default
40642000 default anon=4 dirty=4 N0=4
40682000 default
40683000 default anon=2 dirty=2 N0=2
4090f000 default
40910000 default anon=3 dirty=3 N0=3
40a68000 default
40a69000 default anon=4 dirty=4 N0=4
40a70000 default
40a71000 default anon=2 dirty=2 N0=2
40ab1000 default
40ab2000 default anon=2 dirty=2 N0=2
.....
41fc9000 default anon=10 dirty=10 N0=10
427c9000 default anon=535 dirty=535 N0=535
2aaaaac00000 default file=/SYSV00000000\040(deleted) huge dirty=1 N0=1
7f6024000000 default anon=5578 dirty=5578 N0=5578
7f6027398000 default
7f602a402000 default anon=821 dirty=821 N0=821


