Sunday, June 21, 2009

Using large pages

Linux has had support for large pages (also called huge pages) for a long time now. The size of large pages supported depends on the platform. For example, on Intel it has mostly been 2MB. Large pages offer the advantage of having fewer entries in the TLB and thus fewer cache misses. However, it could lead to more wastage of memory and fragmentation. Many applications typically use large pages for certain designation functions. For example, if supported and required number available, JVM heap is composed of large pages.

An application can request large pages using the shmget API:

#include <sys/ipc.h>

#include <sys/shm.h>

int shmget(key_t key, size_t size, int shmflg);

The SHM_HUGETLB flag part of shmflg field specifies creation of large pages.

Linux kernel provides an interface using which large pages can be requested.

#echo 1000 > /proc/sys/vm/nr_hugepages

The above causes 1000 large pages to be allocated by the kernel. More information on large pages can be obtained from the /proc fs:

#cat /proc/meminfo

MemTotal: 8114308 kB
MemFree: 5867312 kB
Buffers: 8412 kB
Cached: 107304 kB
SwapCached: 0 kB
Active: 48000 kB
Inactive: 87592 kB
Active(anon): 22704 kB
Inactive(anon): 0 kB
Active(file): 25296 kB
Inactive(file): 87592 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 4883752 kB
SwapFree: 4883752 kB
Dirty: 48 kB
Writeback: 36 kB
AnonPages: 20212 kB
Mapped: 10948 kB
Slab: 25988 kB
SReclaimable: 12916 kB
SUnreclaim: 13072 kB
PageTables: 2400 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 7916904 kB
Committed_AS: 46040 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 43496 kB
VmallocChunk: 34359693843 kB
HugePages_Total: 1000
HugePages_Free: 1000
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 3824 kB
DirectMap2M: 8384512 kB

On a NUMA system, the kernel splits large page allocations equally across the different nodes. For example, if the system has 2 nodes, a request of 1000 large pages would get split into 500 pages from each node. Per node large page information can be obtained from the /sys interface:

# cat /sys/devices/system/node/node0/meminfo

Node 1 MemTotal: 4194304 kB
Node 1 MemFree: 40004 kB
Node 1 MemUsed: 4154300 kB
Node 1 Active: 2166524 kB
Node 1 Inactive: 810704 kB
Node 1 Active(anon): 2127084 kB
Node 1 Inactive(anon): 8360 kB
Node 1 Active(file): 39440 kB
Node 1 Inactive(file): 802344 kB
Node 1 Unevictable: 0 kB
Node 1 Mlocked: 0 kB
Node 1 Dirty: 0 kB
Node 1 Writeback: 0 kB
Node 1 FilePages: 841792 kB
Node 1 Mapped: 11008 kB
Node 1 AnonPages: 2135884 kB
Node 1 PageTables: 5136 kB
Node 1 NFS_Unstable: 0 kB
Node 1 Bounce: 0 kB
Node 1 WritebackTmp: 0 kB
Node 1 Slab: 33704 kB
Node 1 SReclaimable: 30708 kB
Node 1 SUnreclaim: 2996 kB
Node 1 HugePages_Total: 500
Node 1 HugePages_Free: 498
Node 1 HugePages_Surp: 0

Recently, in one of the benchmarks (JAVA benchmark) I was running, I was seeing a huge performance degradation of about 6-8%. After some debugging, the issue turned out to be that the application was not able to utilize the large pages allocated (thanks to some weird environment I had ;-) ). To find out the number of large pages being utilized by the app, besides the above meminfo output, you can also use numa_maps. For example,

# cat /proc/<process pid>/numa_maps

00001000 default anon=1 dirty=1 N0=1
00400000 default file=<....library file info..> mapped=10 mapmax=3 N0=10
0050b000 default file=<....library file info..> anon=1 dirty=1 N0=1
0050c000 default heap anon=213 dirty=213 N0=213
00600000 default file=/SYSV00000000\040(deleted) huge dirty=472 N0=472
40600000 default
40601000 default anon=2 dirty=2 N0=2
40641000 default
40642000 default anon=4 dirty=4 N0=4
40682000 default
40683000 default anon=2 dirty=2 N0=2
4090f000 default
40910000 default anon=3 dirty=3 N0=3
40a68000 default
40a69000 default anon=4 dirty=4 N0=4
40a70000 default
40a71000 default anon=2 dirty=2 N0=2
40ab1000 default
40ab2000 default anon=2 dirty=2 N0=2

41fc9000 default anon=10 dirty=10 N0=10
427c9000 default anon=535 dirty=535 N0=535
2aaaaac00000 default file=/SYSV00000000\040(deleted) huge dirty=1 N0=1
7f6024000000 default anon=5578 dirty=5578 N0=5578
7f6027398000 default
7f602a402000 default anon=821 dirty=821 N0=821


Jitesh Shah said...

How can the pages allocated by "echo 1000 > /proc/sys/vm/nr_hugepages" be used? why is such an interface provided in the first place?

Also, in what particular conditions can a user-land app need a large page?
(In the kernel, for performance reasons, et al, large pages can be used.. but how relevant is it in the user space?)

Anki said...

Hi Jitu,

So proc interface is to request a pool of large pages from the kernel. For example, if you know that your application needs about 1GB in large pages (with a size of say 2MB), you have to actually request/reserve that many pages from the kernel first. You do this using the /proc/sys/vm/nr_hugepages interface. Now, your application can use the shmget APIs to actually make use of it.

As for the applications using large pages, some of the example I can give are:
1) Databases use them
2) JVM uses it for its heap

And as you mentioned, apps use large pages for performance improvement itself. I can give you an example where performance of a JAVA benchmark dropped by a significant 6-8% when not using large pages. Besides the fewer entries in TLB I mentioned in the post, also note that large pages are locked in memory and are not subjected to swapping. Hence further adding to performance improvement.

However, one has to be careful when using large pages. Large pages are used only as shared memory and not for other allocations. So system could run out of memory and go into swapping.

Hope that answers your questions !

Devi said...

Hi Anki,

How will i know how much of memory i should reserve for the large page to improve my application performance ?


Glen Newton said...

I've collected a large number of Huge Page resources for various things (i.e. MySql, Oracle, Linux, java) on this page. I hope others find this useful...