Thursday, July 23, 2009

Using gdb for debugging kernel modules

This is a very simple and maybe trivial post.. and nothing novel ! But something that has helped me a lot in debugging issues with my kernel modules and kernel panics in general. So, when you get a panic originating in a kernel module, you normally get it in the following format:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
IP: [<ffffffff8023262c>] mmput+0x11/0xb0
PGD 22bcf6067 PUD 211839067 PMD 0
Oops: 0002 [#1] SMP
last sysfs file: /sys/block/sdb/size
CPU 6
Modules linked in: <a routine in my module> ehci_hcd uhci_hcd [last unloaded: <my module>]
Pid: 26830, comm: <my kernel thread> Not tainted 2.6.28-anki #13
RIP: 0010:[<ffffffff8023262c>] [<ffffffff8023262c>] mmput+0x11/0xb0
RSP: 0000:ffff8802279fbed0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000045
RDX: ffff88022e150300 RSI: 0000000000000282 RDI: 0000000000000000
RBP: ffff8802279fbee0 R08: 0000000000000018 R09: ffff88022e150300
R10: ffff88022e150300 R11: 00007fffe6dfefff R12: ffff88022e52f860
R13: ffff88022e150300 R14: ffff88022d99aed0 R15: ffff88022d99b110
FS: 0000000000000000(0000) GS:ffff88022f818280(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000050 CR3: 0000000211829000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process memref (pid: 26830, threadinfo ffff8802279fa000, task
ffff88022d4ef1b0)
Stack:
ffff88022e150300 0000000000000000 ffff8802279fbf20 ffffffffa003314b
ffff8802279fbf20 ffff88022d99aed0 ffffffffa00330f2 ffff880210cf1c68
ffff88022f855ee0 ffff88022f855f00 ffff8802279fbf40 ffffffff802468e3
Call Trace:
[<ffffffffa003314b>] <routine in my module>+0x59/0xb2 [<my kernel thread>]
[<ffffffffa00330f2>] ? <my modulee>+0x0/0xb2 [<my kernel thread>]
[<ffffffff802468e3>] kthread+0x49/0x76
[<ffffffff8020cce9>] child_rip+0xa/0x11
[<ffffffff802277a5>] ? dequeue_task+0xbf/0xca
[<ffffffff8024689a>] ? kthread+0x0/0x76
[<ffffffff8020ccdf>] ? child_rip+0x0/0x11
Code: df e8 88 c7 fd ff 48 8b 3d ba 28 6b 00 48 89 de e8 b1 af 05 00 41 59 5b
c9 c3 55 48 89 e5 53 48 89 fb 48 83 ec 08 e8 6a fa 39 00 <f0> ff 4b 50 0f 94
c0 84 c0 0f 84 8b 00 00 00 48 89 df e8 8b 76
RIP [<ffffffff8023262c>] mmput+0x11/0xb0
RSP <ffff8802279fbed0>
CR2: 0000000000000050

Now, to find out from where exactly in my module the panic happened, I simply use gdb, as below:

#gdb <path to my module.o file from the kernel source root dir> [for example, #gdb drivers/misc/ankita.o]
....
(gdb) list *(the routine)+0x59

The above gives a source listing, pointing to the lines which correspond to 0x59 address. This serves as a good starting point to debug the issue. Hope this helps some folks atleast :-)

Thursday, July 16, 2009

Looking at thread state inside gdb

While helping a friend debug a multi-threaded application hang on linux, using gdb. She was facing difficulty in pointing out exactly which were the threads that were in blocked state. Getting thread info only tells us what the thread ids are and the routine they are executing.

(gdb) info threads
12 Thread 4398053392560 (LWP 7234) 0x0000000010005740 in xxxx ()
11 Thread 4398054441136 (LWP 7235) 0x0000000010005740 in xxxx ()
10 Thread 4398055489712 (LWP 7237) 0x0000000010005748 in xxxx ()
9 Thread 4398063878320 (LWP 8263) 0x0000000010000768 in yyy ()
8 Thread 4398062829744 (LWP 8267) 0x0000000010003624 in zzz ()
7 Thread 4398061781168 (LWP 8270) 0x0000000010003624 in zzz ()
6 Thread 4398060732592 (LWP 8273) 0x0000000010001ce4 in ddd ()
5 Thread 4398059684016 (LWP 8276) 0x0000000010000750 in fff ()
4 Thread 4398058635440 (LWP 8277) 0x000000001001e128 in ttt ()
3 Thread 4398057586864 (LWP 8278) 0x0000000010003624 in zzz ()
2 Thread 4398056538288 (LWP 8281) 0x000000001001e140 in rrrr ()
1 Thread 269578240 (LWP 28151) 0x000000001000cba4 in www ()

She complained that in another proprietary debugger she has worked with in the past, she could easily also get the thread status information as well. Found that on gdb, we need to use the following commands to get the status information of the threads:

(gdb) help info proc
Show /proc process information about any running process.
Specify any process id, or use the program being debugged by default.
Specify any of the following keywords for detailed info:
mappings -- list of mapped memory regions.
stat -- list a bunch of random process info.
status -- list a different bunch of random process info.
all -- list all available /proc info.

(gdb) info proc 7234 stat
process 7234
cmdline = '/usr/bin/xyz'
warning: unable to read link '/proc/7234/cwd'
warning: unable to read link '/proc/7234/exe'
Process: 7234
Exec file: xyz
State: S
Parent process: 3230
Process group: 7234
Session id: 7234
TTY: 0
TTY owner process group: -1
Flags: 0x402040
Minor faults (no memory page): 96
Minor faults, children: 0
Major faults (memory page faults): 0
Major faults, children: 0
utime: 0
stime: 0
utime, children: 0
stime, children: 0
jiffies remaining in current time slice: 20
'nice' value: 0
jiffies until next timeout: 1
jiffies until next SIGALRM: 0
start time (jiffies since system boot): 4371
Virtual memory size: ...
Resident set size: ....
rlim: ......
Start of text: 0xb8013000
End of text: 0xb8025ca4
Start of stack: 0xbf925630

(gdb) info proc 7234 status
process 7234
cmdline = '/usr/bin/xyz'
warning: unable to read link '/proc/7234/cwd'
warning: unable to read link '/proc/7234/exe'
Name: xyz
State: S (sleeping)
Tgid: 7234
Pid: 7234
PPid: 3230
TracerPid: 0
Uid: 1000 1000 1000 1000
Gid: 1000 1000 109 1000
FDSize: 32
Groups: 4 20 24 46 106 117 118 123 124 1000
VmPeak: 4784 kB
VmSize: 4784 kB
VmLck: 0 kB
VmHWM: 636 kB
VmRSS: 636 kB
VmData: 380 kB
VmStk: 84 kB
VmExe: 76 kB
VmLib: 4020 kB
VmPTE: 20 kB
Threads: 1
SigQ: 0/16382
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000001002
SigCgt: 0000000180004001
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: ffffffffffffffff
Cpus_allowed: 00000000,00000003
Cpus_allowed_list: 0-1
Mems_allowed: 1
Mems_allowed_list: 0
voluntary_ctxt_switches: 627
nonvoluntary_ctxt_switches: 0

So as we can see, we can get a whole lot of information about the threads and not just the state. There are other options supported by the info proc command as shown by the help. gdb picks up the above information from the /proc/<pid> interface. It helps to be able to look at all the pieces of debug information from the same interface itself.

Friday, July 3, 2009

Fake NUMA nodes in Linux

While NUMA systems are becoming commonplace, many a times we do not have access to such systems when either writng new code, understanding NUMA architecture, conducting experiments or debugging existing code. For such cases, the Linux kernel provides a very neat feature called 'fake numa nodes'. One can create fake numa nodes on a non-NUMA machine by simply passing a commandline parameter to the kernel. Below are the steps for x86 systems:

  1. Following config options need to be turned on: CONFIG_NUMA=y, CONFIG_NUMA_EMULATION=y
  2. Build the kernel with the above config options set
  3. The kernel commandline could be any one of the following, depending on your requirement:
  • numa=fake=4 : Split the entire memory into 4 equal nodes
  • numa=fake=8*1024 : Split the memory into 8 equal chunks of 1024MB (ie 1G) (note, the number is considered to be in MB) [If system has more memory, the last node will be assigned remaining memory]
  • numa=fake=2*512,2*1024 : Split the memory into 2 nodes of 512MB each and 2 more nodes of 1GB each (and so on)
Note: On ppc, the nodes required are specified using cumulative comma separated list. For example, to create 4 nodes of 2GB each the parameter would be: "numa=fake=2G,4G,6G,8G"

You can play around with more options :) The userspace numa utilities like numactl and numastat would then show the numa environment that has been setup. Details of the cpumap and per-node meminfo can be obtained from the sysfs file /sys/devices/system/node/node<0|1|2..>.

Fake NUMA has one flaw however and that is the CPU mapping to nodes. There would exist nodes that do not show up as having any CPUs (unde the cpumap file in the node dir of the above mentioned sysfs file). As per the semantics, a CPU must unquely belong to a NUMA node. However, inside the kernel, the CPU is mapped to all the fake nodes.

Fake NUMA nodes can be created even on a real NUMA system. In this case, the fake nodes are aligned within a real node. The distances between two fake nodes across two real nodes is maintained. Could cover internal implementation details in a separate post. Have fun playing around with NUMA !