Wednesday, December 10, 2008

Collecting trace data from Linux kernel using klog

Many mechanisms exist to trace through the linux kernel code, both dynamically and statically. For example, for static tracing, one could use the markers or ftrace, and for dynamic tracing, there is kprobe, systemtap (which is built on top of kprobes). However, I was looking for a way to just dump some data from the kernel into a buffer and dump that data to the user at a later point in time. An example scenario being, while I am in the timer interrupt or in some other non-preemptible code section, I only wish to capture some timestamps, I cannot afford to do a printk from there. Also for instance, I want to instrument the scheduler code to capture timestamp information, I will be into trouble.. as, scheduler code being executed so very often, I will be bombarded with data and the depending on how much data I am collecting, the system could also become unusable.

I came across the klog kernel patch very recently. It was written by Tom Zanussi in 2005. Recently, Vaidyanathan Srinivasan made a few changes to klog, by adding the function ktrace, which makes it even more simpler to use it. Below, I describe how one could make use of this mechanism. Before we begin, note that klog is a static tracer. So, everytime you want to trace new code, the kernel will need to be recompiled.

Apply the klog patch to your kernel. This includes the changes made by Vaidy for ktrace. To trace through any code (ie, kernel routine), ktrace provides the following infrastructure:

1) Trace events (defined in include/linux/ktrace.h):

enum KTRACE_EVENT_ID {
KT_EVENT_FUNC_ENTER,
KT_EVENT_FUNC_EXIT,
KT_EVENT_INFO1,
KT_EVENT_INFO2,
KT_EVENT_INFO3,
KT_EVENT_INFO4,
KT_EVENT_ERROR,
};


2) Enumerator for the functions to be traced (include/linux/ktrace.h):

enum KTRACE_FUNC_ID {
KT_FUNC_tick_nohz_stop_sched_tick, /* 0 */
KT_FUNC_tick_nohz_restart_sched_tick, /* 1 */
};



3) Logging routines (lib/ktrace.c):

void ktrace_log2(unsigned char func, unsigned char event, uint64_t u1, uint64_t u2);
void ktrace_log4(unsigned char func, unsigned char event, uint32_t u1, uint32_t u2,
uint32_t u3, uint32_t u4);

So now, if you want to capture timestamps between the entry and exit of a routine x (assuming the routine executes in preempt disable mode), the following changes would be needed:

a) Add that function into the KTRACE_FUNC_ID,

enum KTRACE_FUNC_ID {
KT_FUNC_tick_nohz_stop_sched_tick, /* 0 */
KT_FUNC_tick_nohz_restart_sched_tick, /* 1 */
+ KT_FUNC_my_func_to_trace,
};


b) At the entry of the routine, capture the timestamp:

ktime_to_ns(start);

At the exit of the routine, make a call to ktrace_log2,

ktrace_log2(KT_FUNC_my_func_to_trace, KT_FUNC_EXIT,ktime_to_ns(now)-start, 0);

Now, whenever the routine is executed, the ktrace_log2 call will dump the information into the klog buffer. Compile and reboot into the modified kernel. Mount debugfs:

# mkdir /debug
# mount -t debugfs nodev /debug


Now run the klog.c program

# ./klog [-b subbuf-size -n n_subbufs]


The resultant trace data will be stored in ./cpu0..cpu[x] files. Remember that this data is in binary format and needs to be processed to convert into ascii. For that, a simple python program can be used. A sample python script could be as follows: (adopted from this script)

import os
import sys
import struct

# Globals
nr_cpus = 4

tracedata = []

cookedtrace = open("trace.txt", 'w')
for cpu in range(nr_cpus):
tracefile = open("cpu%d" % cpu)
while(1):
tracerecord = tracefile.read(24)
if not len(tracerecord):
break
tracerecordfields = struct.unpack("IBBBxQQ", tracerecord)
tracedata.append(tracerecordfields)
cookedtrace.write("U%-10d C%d F%X E%x %10d %10d\n" % tracerecordfields)
cookedtrace.close()

The above will generate trace.txt file, with the data in the different columsn. Further processing could be done based on the particular event type and function id, to capture more information. From the information captured, one could easily plot graphs usig gnuplot.