Memory utilization is a key area where resource control can make big efficiency improvements. In this section we’ll look in detail at the cgroup2 memory controller, and how to get started configuring its interface files for controlling system memory resources.
Like all cgroup controllers, the memory controller creates a set of interface files in its child cgroups whenever it’s enabled. You adjust the distribution of memory resources by modifying these interface files, often within a Chef recipe or container job description.
Here are some of the memory controller's core interface files. Amounts in these files are expressed in bytes.
Note: Be sure to see the canonical cgroup2 reference documentation for additional details.
||Shows the total amount of memory currently being used by the cgroup and its descendants. It includes page cache, in-kernel data structures such as inodes, and network buffers.|
Memory contention between system binaries and the main workload was one of the most common causes of resource problems. The fbtax2 project team experimented with a few different memory controller configurations before resolving the issue.
Because a primary goal of the fbtax2 cgroup hierarchy was restricting memory used by the system binaries in
system.slice, the team first tried setting a memory limit for
system.slice in its
memory.high config file.
The problem was that restricting memory on these system binaries made them more prone to thrashing and OOMs. Since some of these
system.slice processes are critical to the main workload, if they fail, the main workload stops running.
To work around this problem, the team instead used
memory.low to soft-guarantee memory to
workload.sliceand to a lesser extent
workload-support.slice where system binaries required by the workload are grouped.
On this host, the total memory available was around 32G. The team committed 19G to
workload.slice and 192M (0.2G) to
hostcritical.slice. In addition,
hhvm uses 6.95G for hugepages, which is locked hard consumption.
This adds up to approximately 26G, leaving just under 6G up for grabs. These
memory.low protections leave sufficient memory to prevent
system.slice failures, and reduce the likelihood of OOM kills by providing the kernel plenty of leeway when the whole system is under stress.
In addition, any memory that's guaranteed but unused can be allocated to other processes, further optimizing memory utilization.
To find optimal settings for
memory.low, the team first had to determine a working set size for memory. To get a baseline picture of the system's memory use, they queried
To get an accurate result, it's necessary to read
memory.current when the system is under some memory pressure, due to the way the kernel hoards resources when it's idle. The fbtax2 team applied moderate memory pressure, then read
memory.current, repeating the process on a number of machines and averaging the result. This provided the baseline for
memory.low, which they optimized through further experimentation.
The 192MB in
hostcritical.slice is low, but sufficient to keep the processes there running. Note that the team set this amount for
hostcritical.slice in both
memory.min since some kernels still lack
memory.min, but ultimately the hard guarantee offered by
memory.min is required to protect
Swap historically has a bad reputation, especially on roational disks; enabling swap unchecked can lead to thrashing and system lockups. The team discovered however that using swap in conjunction with other tools in this setup provided a number of benefits:
Because swap is slow on the hard disk drives used on many of the hosts in this tier, the team disabled swap for the main workload in the
workload-tw.slice cgroup, setting
0, as shown above. This allowed less latency-sensitive processes in
system.slice to benefit from swap, while avoiding the slowdowns swap would cause for the main workload.
oomd is the key that makes the use of swap possible in this case. In the next section of this case study, we'll take a closer look at oomd, how it uses PSI metrics to resolve OOM situations more gracefully, and how it enables the use of swap.
In addition to
memory.current, the interface files below help you monitor memory use, see the results of changes, and take specific actions based on their settings.
||A file containing memory pressure, a Pressure Stall Information (PSI) metric showing general memory health, as a measurement of the CPU time lost due to lack of memory. Provides a measurement of memory pressure that can be monitored by applications, which can use pressure thresholds to trigger various actions, e.g., load shedding or killing processes when pressure spikes. See the PSI pressure metrics page for additional details.|
||A file that shows the number of times certain memory events have occurred in the cgroup. Generates file-modified events that allow applications to track and monitor changes:
||A file that breaks down the cgroup's memory footprint into different types of memory (e.g., kernel stack, slab, sock, etc.) and provides additional info on the state and past events of the memory management system. Includes memory consumption of the cgroup’s entire subtree.|
||Allows all processes of an entire cgroup to be handled as a single memory consumer, enabling the kernel OOM killer to compare the total memory consumption of the cgroup with other memory consumers (including other cgroups with