Hugepages are a hardware feature designed to reduce pressure on the translation lookaside buffer (TLB) for applications that operate on large contiguous memory regions.
Take a program that operates on a large 2MB internal structure as an example. If the program accesses that space in such a way that one byte in each regular 4k page is accessed, 2M/4k = 512 TLB entries are needed. Each TLB miss at the hardware level requires and interrupt and kernel intervention to resolve. However, if the allocation is backed by a 2M hugepage by mmap()
ing with MAP_HUGETLB
, only 1 TLB entry is required.
On x86_64, there are two hugepage sizes: 2MB and 1G. 1G hugepages are also called gigantic pages. 1G must be enabled on kernel boot line with hugepagesz=1G
. Hugeages, especially 1G ones, should to be allocated early before memory fragments (i.e. at/near boot time) to increase the likelihood that they can be allocated successfully with minimal memory migration (i.e. defreg) required.
mount -t hugetlbfs \
-o uid=<value>,gid=<value>,mode=<value>,pagesize=<value>,size=<value>,\
min_size=<value>,nr_inodes=<value> none /mnt/huge
The application then open()
s a file on the mountpoint and mmap()
s the file backed by hugepages
Doing this is Kubernetes would involve:
- The container having a predetermined hugetlbfs mount point the application stores files it wants backed by hugepages
- The kubelet mounting the hugetlbfs on the mount then bind mounting into the predetermined hugetlbfs mount point in the container
- The kubelet would also need to manage the permissions on the mount point such that the application can manipulate the hubetlbfs
The application can mmap()
a hugepage directly and use it as anonymous memory
addr = mmap((void *)(0x0UL), (256UL*1024*1024), PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | MAP_HUGE_1GB, 0, 0);
mmap()
failure (with -ENOMEM
I assume?) reflects no hugepages of requested size available for the mapping.
The application can shmget()
a hugepage directly and use it as a shared memory segment.
The user running the application will need to be part of the /proc/sys/vm/hugetlb_shm_group
to do this.
There are two flavors of 2MB pages, transparent and persistent. All 1G pages are persistent.
Transparent huge pages (THP) are dynamically managed by the kernel. An application need only hint the memory region to the kernel with madvise()
and MADV_HUGEPAGE
for the kernel to use THP for the memory region. THP really doesn't required any knowledge on the part of kubernetes, assuming the nodes are configured with default value of madvise
for cat /sys/kernel/mm/transparent_hugepage/enabled
. Therefore THP is outside the scope of hugepage support for Kubernetes. This will focus on presistent or preallocated hugepages.
Huge pages can be allocated a boot time via kernel parameters:
For example, hugepages=20
will allocate 20 huge pages of the default size (2MB). You can allocate both sizes of hugepages at boot with hugepages=20 hugepagesz=1G hugepages=2
. This allocates 20 2MB pages and 2 1GB pages.
Allocating huge pages at runtime should be done as close to boot as possible. Memory fragments the longer the system is running and defragmenting efforts may not be able to secure enough continuous memory to allocate huge pages.
Runtime allocations can be done in a number of ways
echo 20 > /proc/sys/vm/nr_hugepages
will allocate 20 hugepages of the default size. Unless default_hugepagesz=1G
is specified in the kernel boot parameters, the default huge page size is 2MB. There is no way to allocate non-default huge page sizes in procfs.
sysfs can allocate all supported/enable hugepage sizes.
/sys/devices/system/node/
contains a directory for each memory node on the systems. Huge pages are NUMA sensitive so this directory structure exists to enable control over the node to which the huge pages are allocated.
/sys/devices/system/node/nodeX/hugepages
contains directories for each supported/enable hugepage size.
For example:
$ pwd
/sys/devices/system/node/node0/hugepages
$ find .
.
./hugepages-2048kB
./hugepages-2048kB/nr_hugepages
./hugepages-2048kB/surplus_hugepages
./hugepages-2048kB/free_hugepages
The only writeable file is nr_hugepages
which will dynamically allocate or free hugepages of a partial size to/from a particular node.
Normally when an application requests a huge page but there are none available, the request fails. However, if the /proc/sys/vm/nr_overcommit_hugepages
is set non-zero, the request will be attempted using an on-demand allocation which may or may not work depending on the amount of free memory available and memory fragmentation. Preallocated huge pages are held in reserve even after the application frees them. Surplus huge pages are returned to the kernel allocator upon freeing.
Surplus pages can only be enabled for the default page size.
The total number of preallocated hugepages, per supported hugepage size (2M, 1G) can be obtained from sysfs
$ ls /sys/devices/system/node/*/hugepages/*/nr_hugepages
/sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
The number of free preallocated hugepages, per supported hugepage size (2M, 1G) can be obtained from sysfs
$ ls /sys/devices/system/node/*/hugepages/*/free_hugepages
/sys/devices/system/node/node0/hugepages/hugepages-1048576kB/free_hugepages
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages
The hugetlb
cgroup controller can be used to limit hugepage usage on a cgroup level.
Note the limits are in bytes, not pages.
$ ls /sys/fs/cgroup/hugetlb
cgroup.clone_children hugetlb.1GB.limit_in_bytes hugetlb.2MB.limit_in_bytes release_agent
cgroup.procs hugetlb.1GB.max_usage_in_bytes hugetlb.2MB.max_usage_in_bytes tasks
cgroup.sane_behavior hugetlb.1GB.usage_in_bytes hugetlb.2MB.usage_in_bytes
hugetlb.1GB.failcnt hugetlb.2MB.failcnt notify_on_release
Statistics gathering programs (i.e. cadvisor) can get the number of huge pages in-use per container using hugetlb.1GB.usage_in_bytes
and hugetlb.2MB.usage_in_bytes
and dividing them by their respective huge page sizes.