Linux Hugepages

Hugepages are a hardware feature designed to reduce pressure on the translation lookaside buffer (TLB) for applications that operate on large contiguous memory regions.

Take a program that operates on a large 2MB internal structure as an example. If the program accesses that space in such a way that one byte in each regular 4k page is accessed, 2M/4k = 512 TLB entries are needed. Each TLB miss at the hardware level requires and interrupt and kernel intervention to resolve. However, if the allocation is backed by a 2M hugepage by mmap()ing with MAP_HUGETLB, only 1 TLB entry is required.

On x86_64, there are two hugepage sizes: 2MB and 1G. 1G hugepages are also called gigantic pages. 1G must be enabled on kernel boot line with hugepagesz=1G. Hugeages, especially 1G ones, should to be allocated early before memory fragments (i.e. at/near boot time) to increase the likelihood that they can be allocated successfully with minimal memory migration (i.e. defreg) required.

Consuming Hugepages

hugetlbfs

mount -t hugetlbfs \
	-o uid=<value>,gid=<value>,mode=<value>,pagesize=<value>,size=<value>,\
	min_size=<value>,nr_inodes=<value> none /mnt/huge

The application then open()s a file on the mountpoint and mmap()s the file backed by hugepages

Doing this is Kubernetes would involve:

The container having a predetermined hugetlbfs mount point the application stores files it wants backed by hugepages
The kubelet mounting the hugetlbfs on the mount then bind mounting into the predetermined hugetlbfs mount point in the container
The kubelet would also need to manage the permissions on the mount point such that the application can manipulate the hubetlbfs

`mmap()` with `MAP_HUGETLB | [MAP_HUGE_2MB|MAP_HUGE_1GB]` for anonymous allocations

The application can mmap() a hugepage directly and use it as anonymous memory

addr = mmap((void *)(0x0UL), (256UL*1024*1024), PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | MAP_HUGE_1GB, 0, 0);

mmap() failure (with -ENOMEM I assume?) reflects no hugepages of requested size available for the mapping.

`shmget()` with `SHM_HUGETLB` for shared memory allocations

The application can shmget() a hugepage directly and use it as a shared memory segment.

The user running the application will need to be part of the /proc/sys/vm/hugetlb_shm_group to do this.

Transparent Hugepages (out of scope)

There are two flavors of 2MB pages, transparent and persistent. All 1G pages are persistent.

Transparent huge pages (THP) are dynamically managed by the kernel. An application need only hint the memory region to the kernel with madvise() and MADV_HUGEPAGE for the kernel to use THP for the memory region. THP really doesn't required any knowledge on the part of kubernetes, assuming the nodes are configured with default value of madvise for cat /sys/kernel/mm/transparent_hugepage/enabled. Therefore THP is outside the scope of hugepage support for Kubernetes. This will focus on presistent or preallocated hugepages.

Persistent Huge Pages

Allocation

At Boot

Huge pages can be allocated a boot time via kernel parameters:

For example, hugepages=20 will allocate 20 huge pages of the default size (2MB). You can allocate both sizes of hugepages at boot with hugepages=20 hugepagesz=1G hugepages=2. This allocates 20 2MB pages and 2 1GB pages.

At Runtime

Allocating huge pages at runtime should be done as close to boot as possible. Memory fragments the longer the system is running and defragmenting efforts may not be able to secure enough continuous memory to allocate huge pages.

Runtime allocations can be done in a number of ways

procfs

echo 20 > /proc/sys/vm/nr_hugepages will allocate 20 hugepages of the default size. Unless default_hugepagesz=1G is specified in the kernel boot parameters, the default huge page size is 2MB. There is no way to allocate non-default huge page sizes in procfs.

sysfs

sysfs can allocate all supported/enable hugepage sizes.

/sys/devices/system/node/ contains a directory for each memory node on the systems. Huge pages are NUMA sensitive so this directory structure exists to enable control over the node to which the huge pages are allocated.

/sys/devices/system/node/nodeX/hugepages contains directories for each supported/enable hugepage size.

For example:

$ pwd
/sys/devices/system/node/node0/hugepages
$ find .
.
./hugepages-2048kB
./hugepages-2048kB/nr_hugepages
./hugepages-2048kB/surplus_hugepages
./hugepages-2048kB/free_hugepages

The only writeable file is nr_hugepages which will dynamically allocate or free hugepages of a partial size to/from a particular node.

Surplus (or Huge Page Overcommit)

Normally when an application requests a huge page but there are none available, the request fails. However, if the /proc/sys/vm/nr_overcommit_hugepages is set non-zero, the request will be attempted using an on-demand allocation which may or may not work depending on the amount of free memory available and memory fragmentation. Preallocated huge pages are held in reserve even after the application frees them. Surplus huge pages are returned to the kernel allocator upon freeing.

Surplus pages can only be enabled for the default page size.

Accounting

Capacity

The total number of preallocated hugepages, per supported hugepage size (2M, 1G) can be obtained from sysfs

$ ls /sys/devices/system/node/*/hugepages/*/nr_hugepages
/sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages

Available

The number of free preallocated hugepages, per supported hugepage size (2M, 1G) can be obtained from sysfs

$ ls /sys/devices/system/node/*/hugepages/*/free_hugepages
/sys/devices/system/node/node0/hugepages/hugepages-1048576kB/free_hugepages
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages

cgroupfs hubetlb controller

The hugetlb cgroup controller can be used to limit hugepage usage on a cgroup level.

Note the limits are in bytes, not pages.

$ ls /sys/fs/cgroup/hugetlb
cgroup.clone_children  hugetlb.1GB.limit_in_bytes      hugetlb.2MB.limit_in_bytes      release_agent
cgroup.procs           hugetlb.1GB.max_usage_in_bytes  hugetlb.2MB.max_usage_in_bytes  tasks
cgroup.sane_behavior   hugetlb.1GB.usage_in_bytes      hugetlb.2MB.usage_in_bytes
hugetlb.1GB.failcnt    hugetlb.2MB.failcnt             notify_on_release

Statistics gathering programs (i.e. cadvisor) can get the number of huge pages in-use per container using hugetlb.1GB.usage_in_bytes and hugetlb.2MB.usage_in_bytes and dividing them by their respective huge page sizes.

bjhaid/hugepages.md

Linux Hugepages

Consuming Hugepages

hugetlbfs

`mmap()` with `MAP_HUGETLB | [MAP_HUGE_2MB|MAP_HUGE_1GB]` for anonymous allocations

`shmget()` with `SHM_HUGETLB` for shared memory allocations

Transparent Hugepages (out of scope)

Persistent Huge Pages

Allocation

At Boot

At Runtime

procfs

sysfs

Surplus (or Huge Page Overcommit)

Accounting

Capacity

Available

cgroupfs hubetlb controller

Resources

bjhaid/hugepages.md

Linux Hugepages

Consuming Hugepages

hugetlbfs

mmap() with MAP_HUGETLB | [MAP_HUGE_2MB|MAP_HUGE_1GB] for anonymous allocations

shmget() with SHM_HUGETLB for shared memory allocations

Transparent Hugepages (out of scope)

Persistent Huge Pages

Allocation

At Boot

At Runtime

procfs

sysfs

Surplus (or Huge Page Overcommit)

Accounting

Capacity

Available

cgroupfs hubetlb controller

Resources

`mmap()` with `MAP_HUGETLB | [MAP_HUGE_2MB|MAP_HUGE_1GB]` for anonymous allocations

`shmget()` with `SHM_HUGETLB` for shared memory allocations