Friday, July 3, 2009

Fake NUMA nodes in Linux

While NUMA systems are becoming commonplace, many a times we do not have access to such systems when either writng new code, understanding NUMA architecture, conducting experiments or debugging existing code. For such cases, the Linux kernel provides a very neat feature called 'fake numa nodes'. One can create fake numa nodes on a non-NUMA machine by simply passing a commandline parameter to the kernel. Below are the steps for x86 systems:

  1. Following config options need to be turned on: CONFIG_NUMA=y, CONFIG_NUMA_EMULATION=y
  2. Build the kernel with the above config options set
  3. The kernel commandline could be any one of the following, depending on your requirement:
  • numa=fake=4 : Split the entire memory into 4 equal nodes
  • numa=fake=8*1024 : Split the memory into 8 equal chunks of 1024MB (ie 1G) (note, the number is considered to be in MB) [If system has more memory, the last node will be assigned remaining memory]
  • numa=fake=2*512,2*1024 : Split the memory into 2 nodes of 512MB each and 2 more nodes of 1GB each (and so on)
Note: On ppc, the nodes required are specified using cumulative comma separated list. For example, to create 4 nodes of 2GB each the parameter would be: "numa=fake=2G,4G,6G,8G"

You can play around with more options :) The userspace numa utilities like numactl and numastat would then show the numa environment that has been setup. Details of the cpumap and per-node meminfo can be obtained from the sysfs file /sys/devices/system/node/node<0|1|2..>.

Fake NUMA has one flaw however and that is the CPU mapping to nodes. There would exist nodes that do not show up as having any CPUs (unde the cpumap file in the node dir of the above mentioned sysfs file). As per the semantics, a CPU must unquely belong to a NUMA node. However, inside the kernel, the CPU is mapped to all the fake nodes.

Fake NUMA nodes can be created even on a real NUMA system. In this case, the fake nodes are aligned within a real node. The distances between two fake nodes across two real nodes is maintained. Could cover internal implementation details in a separate post. Have fun playing around with NUMA !

Sunday, June 28, 2009

Build your kernel faster

Normally when building custom kernels for our laptops or desktops, we tend to make use of the kernel config file used by the particular distro. However the distro config files tend to be huge, having loads of modules turned on, ven those which might not even be needed on our particular laptop or desktop. This is the case since the distro kernels need to cater to a large configurations of systems. . This leads to the kernel taking ages to compile ! If you want to build your kernel fast, and turn off all those modules/drivers which are not needed on your system, streamline_config.pl script by Steven Rostedt is what you need (at the sametime ensuring that your kernel does have all that is necessary). Here is the thread where Steven explains how this script can be used. In brief,

Run the script with the arguement being your architecture's Kconfig file and save the output
  • # ./streamline_config.pl arch/x86/Kconfig > config_stream
Copy config_stream as your new .config and run 'make oldconfig' or 'make menuconfig' if you want to continue configuring the kernel. Your build would now take much lesser time !

Sunday, June 21, 2009

Using large pages

Linux has had support for large pages (also called huge pages) for a long time now. The size of large pages supported depends on the platform. For example, on Intel it has mostly been 2MB. Large pages offer the advantage of having fewer entries in the TLB and thus fewer cache misses. However, it could lead to more wastage of memory and fragmentation. Many applications typically use large pages for certain designation functions. For example, if supported and required number available, JVM heap is composed of large pages.

An application can request large pages using the shmget API:

#include <sys/ipc.h>

#include <sys/shm.h>

int shmget(key_t key, size_t size, int shmflg);

The SHM_HUGETLB flag part of shmflg field specifies creation of large pages.

Linux kernel provides an interface using which large pages can be requested.

#echo 1000 > /proc/sys/vm/nr_hugepages

The above causes 1000 large pages to be allocated by the kernel. More information on large pages can be obtained from the /proc fs:

#cat /proc/meminfo

MemTotal: 8114308 kB
MemFree: 5867312 kB
Buffers: 8412 kB
Cached: 107304 kB
SwapCached: 0 kB
Active: 48000 kB
Inactive: 87592 kB
Active(anon): 22704 kB
Inactive(anon): 0 kB
Active(file): 25296 kB
Inactive(file): 87592 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 4883752 kB
SwapFree: 4883752 kB
Dirty: 48 kB
Writeback: 36 kB
AnonPages: 20212 kB
Mapped: 10948 kB
Slab: 25988 kB
SReclaimable: 12916 kB
SUnreclaim: 13072 kB
PageTables: 2400 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 7916904 kB
Committed_AS: 46040 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 43496 kB
VmallocChunk: 34359693843 kB
HugePages_Total: 1000
HugePages_Free: 1000
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 3824 kB
DirectMap2M: 8384512 kB

On a NUMA system, the kernel splits large page allocations equally across the different nodes. For example, if the system has 2 nodes, a request of 1000 large pages would get split into 500 pages from each node. Per node large page information can be obtained from the /sys interface:

# cat /sys/devices/system/node/node0/meminfo

Node 1 MemTotal: 4194304 kB
Node 1 MemFree: 40004 kB
Node 1 MemUsed: 4154300 kB
Node 1 Active: 2166524 kB
Node 1 Inactive: 810704 kB
Node 1 Active(anon): 2127084 kB
Node 1 Inactive(anon): 8360 kB
Node 1 Active(file): 39440 kB
Node 1 Inactive(file): 802344 kB
Node 1 Unevictable: 0 kB
Node 1 Mlocked: 0 kB
Node 1 Dirty: 0 kB
Node 1 Writeback: 0 kB
Node 1 FilePages: 841792 kB
Node 1 Mapped: 11008 kB
Node 1 AnonPages: 2135884 kB
Node 1 PageTables: 5136 kB
Node 1 NFS_Unstable: 0 kB
Node 1 Bounce: 0 kB
Node 1 WritebackTmp: 0 kB
Node 1 Slab: 33704 kB
Node 1 SReclaimable: 30708 kB
Node 1 SUnreclaim: 2996 kB
Node 1 HugePages_Total: 500
Node 1 HugePages_Free: 498
Node 1 HugePages_Surp: 0

Recently, in one of the benchmarks (JAVA benchmark) I was running, I was seeing a huge performance degradation of about 6-8%. After some debugging, the issue turned out to be that the application was not able to utilize the large pages allocated (thanks to some weird environment I had ;-) ). To find out the number of large pages being utilized by the app, besides the above meminfo output, you can also use numa_maps. For example,

# cat /proc/<process pid>/numa_maps

00001000 default anon=1 dirty=1 N0=1
00400000 default file=<....library file info..> mapped=10 mapmax=3 N0=10
0050b000 default file=<....library file info..> anon=1 dirty=1 N0=1
0050c000 default heap anon=213 dirty=213 N0=213
00600000 default file=/SYSV00000000\040(deleted) huge dirty=472 N0=472
40600000 default
40601000 default anon=2 dirty=2 N0=2
40641000 default
40642000 default anon=4 dirty=4 N0=4
40682000 default
40683000 default anon=2 dirty=2 N0=2
4090f000 default
40910000 default anon=3 dirty=3 N0=3
40a68000 default
40a69000 default anon=4 dirty=4 N0=4
40a70000 default
40a71000 default anon=2 dirty=2 N0=2
40ab1000 default
40ab2000 default anon=2 dirty=2 N0=2
.....

41fc9000 default anon=10 dirty=10 N0=10
427c9000 default anon=535 dirty=535 N0=535
2aaaaac00000 default file=/SYSV00000000\040(deleted) huge dirty=1 N0=1
7f6024000000 default anon=5578 dirty=5578 N0=5578
7f6027398000 default
7f602a402000 default anon=821 dirty=821 N0=821




Friday, June 19, 2009

Useful staps to track task movement across CPUs

Quite sometime back, I was faced with a situation where I needed to track instances when a particular task was being migrated away from a cpu. It was in the context of a real-time system, where a real-time task was facing huge context switch delays. Obvious suspect being the scheduler, I used systemtap to infer a few things, besides other debugging:
  • To find if the task was being migrated away to some other cpu, used the following trivial stap script:
/* Filename: migrate.stp
* Author: Ankita Garg <ankita@in.ibm.com>
* Description: Captures information on the migration of threads
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* © Copyright IBM Corp. 2009. All Rights Reserved.
*
*/

probe kernel.function("__migrate_task")
{
if(($1 != 0 ) && (tid() == $1)) {
printf ("thread %d (%s) is migrating from %d to %d \n", $p->pid,
kernel_string($p->comm), $src_cpu, $dest_cpu);
}
}


  • Below is a script that tracks all the cpus that a particular task ran on. Pl note it does not track the context switches.
/* Filename: chng_cpu.stp
* Author: Ankita Garg <ankita@in.ibm.com>
* Description: Captures information on the number of times java thread
* switches cpu
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* © Copyright IBM Corp. 2009. All Rights Reserved.
*
*/

global threads

probe kernel.function("finish_task_switch")
{
if ((threads[tid()] != cpu()) && (tid() != 0) && (execname() == @1)) {
printf("thread %d (%s) context switched on %d \n",
tid(), execname(), cpu());
printf("state: %d\n", task_state(task_current()))
print_stack(backtrace())
}
threads[tid()] = cpu();
}


These are a bit older techniques, as now there is a new tracepoints infrastructure which can do these things. But on older kernels, the above would be useful. Expect more posts on kernel RAS features in due time.

Thursday, June 18, 2009

Importing .ics into Lotus Notes 8

A number of times I get calender invites for meetings on my non-Lotus notes email IDs. The calender invites are normally in the .ics format. Once can easily import it into Lotus Notes. Here is how:
  1. Compose a mail inside notes
  2. Attach the .ics file to it
  3. Right click the attachment, and click on "View"
  4. The calender view would open, with the meeting details. Now accept/decline the invite, save and exit
Voila, the entry gets saved to your calender. Ofcourse, there might be other ways to achieve this in notes :-)

Free up that memory

Recently came across this cool interface in the Linux kernel. Typically, the memory might be over-provisioned on the system. Instead of wasting the memory, the kernel normally utilizes a lot of it for page cache, dentry cache and inodes. These caches speed I/O operations and improves performance. However, there are cases when large amount of memory might actually be needed by the apps. While most of the cache pages could be easily reclaimed, there is obviously some overhead involved (the pages could be dirty and might have to written back to the disk, thus incurring disk write latency). Linux has a neat kernel.. so now, while it uses its smart to utilize the memory well, it also provides a method for people to indicate that they do not want the kernel to use its smarts ;-)

To free memory, just do the following:

# echo 1 > /proc/sys/vm/drop_caches

(the above frees only page cache)

# echo 2 > /proc/sys/vm/drop_caches

(for freeing dentry caches and inodes)

# echo 3 > /proc/sys/vm/drop_caches

(for freeing all of the above)

It would be advisable to first do a 'sync' before dropping the caches, so that all the dirty pages could be acted upon.

Saturday, May 30, 2009

Simplifying GCC

GCC is the GNU Compiler Collection which provides C, C++ etc compilers. These compilers are used by default in all *nixes .

Here i provide simple command line options which can prove to be quite useful.

  1. The simplest way to use GCC to compile a C source file is

    $ gcc -o test test1.c test2.c

    gcc is the C compiler, test1.c and test2.c are the input C source files and -o lets us specify the name of the output file. Here it is "test". Without the -o option, "a.out" is the default executable that gets created.

  2. The preprocessor:

    $ gcc -E test.c > test.out

    This option, ensures the compilation process stops after the pre-processor has run. This helps us in figuring out issues/problems in macros.


  3. The Compiler:

    $ gcc -c test.c -o test.o

    This option ensures the compilation process completes but doesn't invoke the linker/loader. This is useful if you want to just remove compilation warnings and errors.


  4. Header Files:

    $gcc -c test.c -I /location/of/header/files -o test

    Many a times the headers files you want to use, is located is some other directory. A "bad" practice followed is to include the direct path of the header files in the C src file.
    Instead use this option. It tells the compiler which directories to look in for the mentioned header files. The -I options can be used multiple times for multiple directories where header files are located.

  5. Library Files:

    $ gcc -c test.c -lpthread -L /usr/lib/libpthread


    Another requirement that is frequently required is using standard libraries ( NPTL Threads etc) or non-standard ones (expat etc). '-l' option tells which library to use while linking while '-L' tells where the find this library. In the above example during linking, it will search for pthread library in the dir /usr/lib/libpthread.

  6. Warnings, Errors, etc:

    $gcc test.c -o test -Wall -Werror


    -Wall options shows all warnings that are typically not shown during regular compilation. These errors are easy fixable like "Unused varniables", "implicit function declaration" etc. -Werror options tells the compiler to treat all warnings as errors and stop compilation instantly.
    Sometimes -Werror can be too strict for our purpose. Instead you can treat only certain warnings are errors.
    eg. -Werror-implicit-function-declaration: Treat only implicit function declaration warnings as errors. For more such options check the gcc man pages.

  7. Debugging:

    $gcc -g test.c -o test

    This option activates all the debugging symbols. This is required if one plans to use gdb for debugging (which is mostly the case).

  8. Optimizations:
    $gcc -O2 test.c -o test

    This option lets the compiler optimize the code . -O can take 0,1,2 levels of optimizations.
    More info is available in the man pages of gcc.
These options are the ones that are most frequently used. Obviously there are many more options available . Use them as per your needs and refer the man pages for the exhaustive list of options.