Facebook believes it has developed a faster and more reliable solution for resolving common out-of-memory (OOM) situations. The company announced it is open sourcing oomd, a userspace OOM killer.
The solution was designed with two key features: pre-OOM hooks and a custom plugin system. According to the company, pre-OOM hooks provide visibility into an OOM situation before the workload is threatened, while the plugin system will enable them to set custom policies for handling workloads running on a host.
According to Facebook, oomd is able to respond fast, is less rigid, and is more reliable that the Linux kernel OOM killer. According to tests the company has performed, oomd is an effective replacement to the default Linux kernel OOM killer.
The oomd solution was based on two key developments: Pressure Stall Information (PSI) and Cgroup2.
PSI is a Facebook utility that tracks CPU, memory and I/O, and provides a canonical view of how the usage of those resources change over time. When deployed in production, PSI acts as a barometer of upcoming resource shortages. “Thanks to this new ability to monitor key system resource indicators, oomd is able to take corrective action in userspace before a system-wide OOM occurs,” Facebook wrote in a post.
Cgroup2 organizations processes hierarchically and allocates system resources along that hierarchy in a controlled way.
“We have developed and deployed oomd in production at Facebook, and we’ve found that it has allowed us to decrease the frequency of livelocks on workloads ranging from build servers to rack switches to shared compute resources,” the company wrote.