Abstract
The unrelenting growth in the memory demands of datacenter applications, combined with DRAM’s volatile prices and ever-increasing costs, has made DRAM a major infrastructure expense. Alternative technologies, such as NVMe SSDs and emerging NVM devices, offer higher capacity than DRAM at a fraction of the cost and power. A promising approach is to transparently offload colder memory to cheaper memory technologies via kernel or hypervisor techniques. The key challenge, however, is to develop a datacenter-scale solution that is robust in dealing with the diverse workloads and large performance variance of different offload devices, such as compressed memory, SSD, and NVM. This paper presents TMO, Meta’s transparent memory offloading solution for heterogeneous datacenter environments. TMO introduces a new Linux kernel mechanism that directly measures, in real time, process stalls due to resource shortages across CPU, memory, and I/O. Guided by this information and without any prior application knowledge, TMO automatically adjusts how much memory to offload to heterogeneous devices (for example, compressed memory or SSD) according to the device’s performance characteristics and the application’s sensitivity to memory-access slowdown. To maximize memory savings, TMO targets both anonymous memory and file cache, balancing the swap-in rate of anonymous memory and the reload rate of file pages that were recently evicted from the file cache. Moreover, it identifies offloading opportunities not only from the application containers but also from the sidecar containers that provide infrastructure-level functions. TMO has been in production since 2021, saving 20%–32% of total memory across millions of servers in our hyperscale datacenter fleet. We have successfully upstreamed TMO into the Linux kernel.
1. Introduction
The massive growth in memory needs of emerging applications such as machine learning (ML), coupled with the slowdown of dynamic random access memory (DRAM) device scaling13 and large fluctuations of the DRAM costs, has made DRAM prohibitively expensive as the sole memory-capacity solution.
In recent years, a variety of cheaper, non-DRAM memory technologies, such as non-volatile memory express (NVMe) solid-state drives (SSDs)7 and non-volatile memory (NVM),17,21,23 have been successfully deployed in datacenters, or are on their way. Moreover, emerging non-double-data rate (DDR) memory bus technologies such as compute express link (CXL)5 provide memory-like access semantics and close-to-DDR performance. The confluence of these trends enables new opportunities for memory tiering that were not possible before.2,3,8,9,11,14,18,19,20,22,24
With memory tiering, less-frequently accessed data is migrated to slower memory. The migration process can be driven by the application, a userspace library,4,16 the kernel, or the hypervisor. This paper focuses on kernel-driven migration, or swapping, as it can be transparently applied to unmodified applications.
Despite its conceptual simplicity, the only known large-scale adoption of kernel-driven swapping for latency-sensitive datacenter applications is Google’s deployment12 of zswap,25 referred to as g-swap in this paper. As a pioneer, g-swap advanced the state of the art but still has several major limitations.
First, g-swap supports only a single slow memory tier—a compressed memory pool. While this simplicity avoids the complexities of handling heterogeneous memory tiers with varying performance (for example, NVMe SSDs and NVM devices), it limits memory cost savings. As shown in §2.1, NVMe SSDs offer significantly better cost and power efficiency than compressed memory. Additionally, some data, such as quantized ML models, are inherently difficult to compress.
Second, to determine how much memory to offload, g-swap relies on extensive offline profiling and uses a static target page-promotion rate. This approach is inadequate because the promotion rate does not directly reflect an application’s sensitivity to memory-access slowdowns or the performance characteristics of the offloading device. Our evaluation in §4.3 demonstrates that, for a large Meta application, g-swap’s assumption of limiting offloading to a fixed promotion rate to maintain performance is flawed. In fact, with faster offloading devices, higher promotion rates can improve application performance.
To address these limitations, we built TMO, a transparent memory offloading solution for containerized environments. Fundamentally, TMO needs to answer two questions: how much memory to offload and what memory to offload. To answer the first question, TMO introduces a new kernel mechanism called pressure stall information (PSI), which directly measures in real time process stalls due to resource shortages across CPU, memory, and I/O. PSI is reported on a per-process and per-container basis. Unlike g-swap’s promotion-rate metric, PSI accounts for both the performance characteristics of the slow memory tier and the application’s sensitivity to memory-access slowdown. A userspace agent called Senpai uses the PSI metrics to dynamically decide how much memory to offload without prior application knowledge, while taking into account hardware heterogeneity in datacenters.
To answer the question of what memory to offload, we had to address several challenges. First, the Linux kernel attempted to balance memory reclamation between file cache and swap-backed anonymous memory, but it skewed heavily toward file cache through several heuristics. This relegated swap to being used only as an emergency overflow for memory, which is not suitable for offloading operations. To offload file and anonymous pages more evenly, we modified the kernel to balance the swap-in rate of anonymous memory and the reload rate of file pages that were recently evicted from the file cache.
Second, TMO holistically identifies offloading opportunities not only from the application containers but also from the sidecar containers, which provide infrastructure-level functions such as service discovery and configuration management. Finally, as memory is distributed across complex container hierarchies and containers may have different priorities, TMO accurately monitors each container’s memory needs and considers the hierarchies and properties of containers when making offloading decisions.
Currently, TMO enables transparent memory offloading across millions of servers in our datacenters, resulting in memory savings of 20%–32%. Of this, 7%–19% is from the application containers, while approximately 13% is from the sidecar containers.
The contributions of this paper are as follows:
We introduce PSI, a Linux kernel component that directly measures, in real time, process stalls due to resource shortages across CPU, memory, and I/O. This is the first solution that can directly measure an application’s sensitivity to memory-access slowdown without resorting to fragile low-level metrics, such as the page-promotion rate.
We introduce Senpai, a userspace agent that applies mild memory pressure to effectively offload memory across diverse applications and heterogeneous hardware with minimal impact on application performance. Compared to g-swap, Senpai offers key advantages: it does not require offline application profiling and supports both SSDs and zswap as slow memory tiers.
TMO performs memory offloading to swap at a subliminal memory-pressure level, and the turnover is proportional to file cache. This is in contrast to the historic behavior of swapping as an emergency overflow under severe memory pressure.
We report our experience of deploying TMO in production to millions of servers.
We have upstreamed PSI into the Linux kernel and also made Senpai open source.
2. Memory Offloading Opportunities and Challenges in Datacenters
This section begins with an analysis of DRAM and SSD cost trends from Meta’s datacenters, which host millions of servers. We then explore memory offloading opportunities across a diverse range of applications in our fleet, including datacenter and microservice memory tax, supported by a fleet-wide characterization of offloading potential. Next, we emphasize the importance of accounting for the complexities of the memory-allocation subsystem in designing an offloading system. Finally, we examine SSD heterogeneity across datacenters and the challenges it presents for heterogeneous memory offloading.
2.1 Memory and SSD cost trends.
Figure 1 shows the relative cost of DRAM, compressed memory, and SSD storage as a fraction of server cost in our datacenters. The -axis shows different hardware generations. The compressed memory cost is estimated based on a 3x compression ratio representative of the average for our production workloads. Gen-1 hardware is near its end of life, while Gen-5 and Gen-6 are expected to be deployed in the near future. The cost of DRAM, as a fraction of server cost, is expected to grow and reach 33%. While not shown in the figure, DRAM power consumption follows a similar trend and is expected to reach 38% of our server infrastructure.
Using compressed memory can reduce the cost significantly, but it is still insufficient; we need alternative memory technologies such as NVMe SSDs to further drive down the cost more aggressively. NVMe SSDs provide a much larger memory footprint per server than DRAM, at a substantially cheaper cost and lower power per-byte. We equip all our production servers with a very capable NVMe SSD. At the system level, NVMe SSDs contribute to less than 3% of server cost (about 3x lower than compressed memory in our current generation of servers). Moreover, Figure 1 shows that, iso-capacity to DRAM, SSD remains less than 1% of server cost across generations (about 10x lower than compressed memory in cost-per-byte). These trends make NVMe SSDs much more cost effective compared to compressed memory for our fleet.
2.2 Cold memory as offloading opportunity.
Datacenter applications exhibit drastic differences in their memory behavior. To quantify the opportunity of memory offloading we characterize the memory coldness of seven large applications at Meta.
Figure 2 shows the amount of memory touched in the last minutes, where is one, two, or five minutes, as well as memory that remains untouched after 5 minutes. For example, for Feed, starting from the bottom, 50% of the memory is used in the last minute, an additional 8% in the last two minutes, and an additional 12% in the last five minutes. The remaining 30% remains cold past the five-minute interval. The memory coldness of applications varies drastically. For example, 81% of memory for Cache B is active in the last five minutes. By contrast, only 38% of memory for Web is actively used in the last five minutes. Overall, the memory-offloading opportunity (that is, fraction of cold memory) averages about 35%, but varies wildly across applications in a range of 19%–62%, which emphasizes the importance of having an offloading method that is robust against an application’s diverse memory behaviors.
2.3 Memory tax.
To ease the operation of applications in datacenters, a significant amount of memory is used to enable microservices and provide infrastructure-level functions.10 We define as datacenter memory tax the memory required for software packages, profiling, logging, and other supporting functions related to the deployment of applications in datacenters. We further define as microservice memory tax all the memory required by applications due to their disaggregation into microservices—for example, to support routing and proxy—and it is applicable uniquely to microservice architectures.
Figure 3 shows the average memory tax as a percentage of the total server memory across all workloads at Meta. Both datacenter and microservice tax account for a significant percentage of memory usage. On average, the memory tax accounts for 20% of the total memory capacity. Datacenter memory tax is 13% and it is uniform across all workloads. Microservice memory tax accounts for 7% on average and can vary depending on application characteristics. Notably, the performance SLA for most of the memory tax is more relaxed than that of memory directly consumed by applications. As a result, the memory tax was a prime target for memory offloading during our first production launch of TMO.
2.4 Anonymous and file-backed memory.
Memory is separated into two main categories, anonymous memory and file-backed memory. Anonymous memory is allocated by applications and is not backed by a file or a device. File-backed memory represents allocated memory in relation to a file and is further stored in the kernel’s page cache.
Figure 4 shows the breakdown of anonymous and file-backed memory for several large applications, datacenter memory tax, and microservice memory tax. The breakdown varies wildly across applications and memory taxes.
Overall, we need to consider offloading opportunities for both anonymous and file-backed memory to maximize the savings.
2.5 Hardware heterogeneity of offload backend.
We define a memory offload backend as the slow-memory tier that holds offloaded memory. In our current production fleet this consists of NVMe SSDs and a compressed memory pool. In the future, we expect this to include NVM and CXL devices.
NVMe SSD device heterogeneity is a significant challenge in datacenter environments. Multiple factors unavoidably create heterogeneous hardware in large-scale datacenters, including datacenter turn-ups, hardware refresh, and the need to maintain a diverse supply chain from different vendors.
Figure 5 shows in logscale, the endurance, read and write IOPS, and p99 latency for major SSD types across Meta’s fleet. Newer devices are to the right of the figure. Notably, although SSD endurance has improved over SSD generations, it is still a limited resource and should be used judiciously by a memory offloading system. Furthermore, although IOPS are relatively stable across generations, read and write latency shows significant variation across generations, ranging from 9.3ms to 470μs.
Besides SSDs, we also use compressed memory as an offload backend. The p90 latency of a 4KB read from compressed memory is about 40us. Compared with SSDs, compressed memory is an order of magnitude faster and with a small latency variance. Moreover, compressed memory avoids the endurance limits of SSDs. Overall, a memory offloading system needs to effectively tackle a heterogeneous fleet despite large differences in offload backends.
3. TMO Design
The goal of TMO is to transparently offload memory to heterogeneous back ends that offer cost-effective but slower memory accesses. TMO’s workload-transparent design allows for a seamless deployment across a diverse set of applications and heterogeneous infrastructure. Fundamentally, TMO addresses the questions of how much memory to offload and what memory to offload.
3.1 Transparent Memory Offloading Architecture
Figure 6 provides an overview of the TMO architecture (left) and its memory and storage layout (right). TMO integrates components across userspace and the kernel. Unmodified workloads run in containers ?, interacting with the kernel memory management subsystem via system calls and paging. The userspace component, Senpai?, manages memory offloading by determining how much memory to offload from each workload. It relies on pressure metrics from the kernel’s PSI module?, detailed in §3.2.
Using PSI feedback, Senpai triggers memory reclamation by writing to cgroup control files ?, activating the kernel’s reclamation logic to decide which memory to offload. Modifications to the kernel’s reclaim algorithms are discussed in §3.4. The memory management subsystem ? collects pressure data, performs read/write operations to offload back ends ?, and interfaces with the regular file system. TMO supports both compressed memory pools via zswap and storage devices through swap. The right side of Figure 6 illustrates the memory and storage layout, highlighting TMO’s ability to target workload memory and datacenter/microservice memory tax. TMO offloads memory to back ends ? through compression, swapping, or discarding page cache, and reverses the process to restore memory when needed. Supported back ends ? include zswap in DRAM and swap/filesystem on SSDs.
3.2 Defining resource pressure.
Determining how much memory to offload requires measuring its specific impact on application performance, which is challenging due to workload diversity and the need to isolate memory-related issues from other factors.
The OS kernel provides event counters, such as major page-fault counts, which have been commonly used to approximate the performance impact of memory offloading. However, elevated fault counts may result from workload startup or working-set transitions rather than memory shortages. Furthermore, in heterogeneous memory systems, a given fault rate might be problematic on slow storage but negligible on faster devices.
Understanding whether specific kernel events represent functional issues for a workload is inherently complex. We will demonstrate in the evaluation §4.3 that major fault rate or promotion rate has limitations, especially when considering heterogeneous offloading back ends with significant diversity in performance.
3.2.1 PSI metrics.
Fundamentally, PSI exposes a metric that represents the amount of lost work due to the lack of a resource. It can be measured for a single process, a container, or machine-wide. PSI calculates pressure metrics by considering only non-idle processes. For each non-idle process, PSI further distinguishes between periods of time when a process is exclusively either runnable or stalled due to insufficient resources. Furthermore, it defines the compute potential as the number of non-idle processes capped at the number of CPUs. PSI is the proportion of compute potential that is unproductive due to resource stalls. It is often represented as a percentage. For containers and whole-system domains, PSI introduces two pressure indicators for each resource called some
and full
. The some
metric tracks the percentage of time in which at least one process within the domain is stalled waiting for the resource. The full
metric tracks the percentage of time in which all processes are delayed simultaneously.
Consider the example in Figure 7, which shows the execution time of two processes A and B as well as their stall time with a dotted box. The some
stalling information is shown with a blue arrow, while full
stalling is shown with a green arrow. The execution time is normalized to 100% and partitioned into four sections. During the first quarter, only one process, either process A or process B, stalls at a time. Hence 12.5% of the time is accounted for by the some
metric. Instead, in the second quarter, for 6.25% of the time, both processes stall concurrently during their execution time, and hence this stalling time is accounted for by the full
metric. In addition, 18.75% of stall time is accounted for by the some
metric. The next two quarters show different variations of stalls and how some
and full
are accounted for accordingly. Overall, some
aims to capture added latencies to individual processes due to lack of a resource while full
indicates the amount of completely unproductive time in the container or system.
3.2.2 PSI across system resources.
To track memory pressure, PSI records time spent on events that occur exclusively when there is a shortage of memory. Currently, this includes three occasions. The first occasion is when a process triggers reclaiming pages when memory is full and the process tries to allocate new pages. The second occasion is when a process needs to wait for IO for a refault, that is, a major fault against a page which was recently evicted from the file cache. The third occasion is when a process blocks on reading a page in from the swap device.
Block IO stalls are more difficult to accurately calculate because existing hardware provides little insights into device contention. In particular, we cannot attribute a portion of a stall on block IO to the device being oversubscribed or simply expected latency of a device access. We therefore treat any process waiting on block IO completion to be stalled due to lack of IO. This has worked well for us in production across diverse workloads.
CPU stalls are accounted for as the periods of time when a process is runnable but needs to wait for an idle CPU to become available. CPU full
pressure is only possible within a container, which occurs when none of the processes can execute either due to outside competition or due to configured limits on the cgroup’s CPU cycles.
3.2.3 PSI comparison to other metrics and cost.
One existing mechanism is the resident set size (RSS), which tracks the amount of main memory that belongs to a process. The main limitation of RSS is that by itself it does not capture the impact of memory, or a lack thereof, on application performance. Other metrics, like promotion rate, account for the number of swap-ins per second. A drawback of the promotion rate is that it does not take into account the performance characteristics of the offloading back end. Furthermore, it fails to capture application performance improvements as more memory becomes available due to offloading.
Instead, PSI directly captures the impact of memory-access slowdown to an application and further incorporates the performance and utilization characteristics of the offloading back end. The main cost of PSI is scheduling latency since some logic needs to be performed on a context switch. In real applications in our fleet, the overhead is negligible. Beyond datacenters, PSI is enabled by default on all major Linux distributions, including Android. We further compare PSI to other metrics in Section 4.
3.2.4 PSI use cases.
By aggregating process state times across all CPUs and breaking them down per container, PSI provides intuitive insights into resource provisioning quality and helps diagnose performance issues or SLO violations, such as missed response deadlines. Pressure metrics are updated in real time, available at microsecond resolution, and as running averages (10s, 1m, 5m). This granularity supports effective resource management across the pressure spectrum.
High full
pressure indicates critical productivity losses requiring immediate action. These can result from overlapping workload peaks, application bugs, or misconfigurations. Early intervention prevents service disruptions and server health issues. For example, userspace out-of-memory (OOM) killers can monitor full
metrics to implement proactive policies, mitigating delays that breach SLOs before the kernel’s OOM killer activates. Some
pressure captures latency impacts from resource shortages and detects delays below performance-degrading thresholds. TMO relies on some
metrics to balance resource contention—keeping values low but non-zero to avoid idleness without disrupting workloads. This ensures workloads operate with minimal yet sufficient resources.
Before PSI, resource health relied on indirect metrics, such as kernel time, throughput variations, or reclaim activity, requiring expertise in hardware and kernel behavior. For example, file read-ahead could mask cache re-reads. PSI, however, directly measures productivity losses from qualifying stall events, accounting for hardware differences, memory-management efficiency, and workload concurrency (some
vs. full
stalls).
3.3 Determining memory requirements of containers.
Senpai is a userspace tool responsible for driving memory offload, using PSI metrics to determine how much memory can be moved out.
Estimating a workload’s memory requirements is challenging. Applications often allocate or cache memory that is rarely or only once used, but the kernel only reclaims such cold pages when memory is scarce. As a result, memory footprints tend to exceed what is truly necessary for normal operation. While developers quickly notice when applications run out of memory, overprovisioning often goes unnoticed. Senpai continuously engages the kernel’s reclaim algorithm, using PSI metrics to gauge workload health and dynamically adjust reclaim aggressiveness. This identifies the essential share of allocated memory while offloading excess, ensuring optimal use of available capacity. Over time, this proactive approach generates an accurate working set profile, enabling developers to provision memory more precisely for their workloads.
Figure 8 shows a high-level overview of Senpai’s operations. Once every few seconds, Senpai calculates for each cgroup the amount of memory to reclaim as follows:
is the cgroup’s some
PSI metric. and reclaim_ratio are configurable parameters. is the current memory footprint of the cgroup. No memory is reclaimed when is above . Otherwise, Senpai asks the kernel to reclaim from the cgroup. As approaches , Senpai gradually reclaims less memory in order to achieve a mild steady-state memory pressure.
We studied the performance sensitivity of applications related to file cache and anonymous memory, and iteratively arrived at the current configuration used in production for all applications, specifically, =0.0005 and =0.1%. In production, reclaim is performed every six seconds. We set this value empirically to leave enough time to measure the delayed impact (refaults) of reclaimed memory. The step size of how much memory is reclaimed () depends on how far or how close the observed pressure () is to the target threshold (). The maximum is 1% of the total workload size in each reclaim period. As a result, reaction time to extreme contraction tends to be minutes. Adaptation to workload expansion, on the other hand, is immediate. Senpai’s effectiveness is largely insensitive to parameter variations, though certain workloads, like batch jobs with relaxed SLOs, can tolerate higher memory pressure, offering offloading opportunities. Future work includes exploring automated parameter tuning to enhance memory savings.
3.4 Kernel optimizations for memory offloading.
Senpai relies on the kernel to reclaim cold memory pages. Instead of using expensive full-page table scans to determine which memory pages are cold, Senpai lets the kernel’s reclaim algorithm choose the pages to offload. The original kernel reclaim process avoided swapping but caused file-cache thrashing, even when there was an abundance of cold anonymous memory available. Fundamentally, at the time the algorithm was conceived, the kernel lacked insight into whether file-cache reclamation forced working set file pages to be reloaded from storage. We addressed this by adding non-resident cache tracking. Specifically, we added refault detection to the Linux kernel, which can determine if a faulting page was recently resident. We then modified the reclamation algorithm to focus on file-cache reclamation only until no refaults occur, at which point refaults are balanced across file cache and swap. This leads to the kernel balancing reclaim better and results in more memory offloading with no impact to application performance.
4. Evaluation
TMO has been running in production since 2021 for more than a year, leading to significant memory savings across Meta’s fleet. Our evaluation focuses on showcasing different aspects of TMO. Specifically, it answers the following questions:
How much memory can TMO save?
How does TMO impact memory-bound applications?
Are PSI metrics more effective than the promotion-rate metric?
4.1 Fleet-wide memory savings.
We break TMO’s memory savings into savings from applications, datacenter memory tax, and application memory tax, respectively.
Application savings. Figure 9 shows the relative memory savings achieved by TMO for eight representative applications using different offload back ends: either compressed memory or SSDs. Using a compressed-memory back end, TMO saves 7%-12% of resident memory across five applications.
Multiple applications’ data have poor compressibility, and their memory offloading is more effective with a SSD back end. For those applications, Figure 9 shows that TMO achieves significant savings of 10%-19% with a SSD backend. Such savings that do not rely on compression would be unattainable with previous approaches.12 Specifically, ML models used for ads prediction commonly use quantized byte-encoded values that exhibit a compression ratio of 1.3–1.4x, leading to poor memory savings through a compressed memory offloading. Instead, for such applications, SSD offloading provides a more cost-effective solution. Overall, across compressed-memory and SSD back ends, TMO achieves significant savings of 7%-19% of the total memory, without any noticeable application performance degradation.
Datacenter and application memory tax savings. Beyond regular workload memory, TMO further targets datacenter and application memory tax.
Figure 10 shows the relative memory savings by offloading memory tax across Meta’s fleet. When it comes to datacenter tax, TMO saves on average 9% of the total memory within a server. Application tax savings account for 4%. Overall, TMO achieves on average 13% of memory tax savings, in addition to actual workload savings. This is a significant amount of memory at the scale of Meta’s fleet.
4.2 Performance impact on memory-bound Applications
In this section, we demonstrate how TMO enhances the performance of memory-bound applications, focusing on the Web application, one of the largest workloads at Meta. Using our high-fidelity production load-testing framework, we conduct A/B tests to guide hardware and software optimizations across Meta’s fleet. The experiments use production Skylake 64GB hosts, representative of our deployed infrastructure. Each test involves a sufficient number of machines (typically tens of hosts) to ensure statistically reliable results. While this section highlights the Web due to its robust testing framework, the findings are broadly applicable to other large-scale applications. Performance is measured in requests per second (RPS) at a predefined target tail latency, with servers automatically throttling RPS to maintain latency targets.
The Web application’s memory profile begins by loading the entire file system cache into memory, followed by lazy loading anonymous memory as requests arrive. As memory usage approaches the limit, servers self-regulate by throttling RPSs to avoid running out of memory. In the baseline tier (Figure 11a), this results in an RPS drop of more than 20% within two hours as the server becomes memory-bound.
Figure 11(b) shows the resident memory size when offloading to SSD devices (second phase in the middle) and a compressed memory pool (third phase on the right), respectively, compared with a baseline without offloading (first phase on the left). The figure shows that once TMO is enabled, it is able to offload a significant fraction of system memory to the heterogeneous back ends and the RPS drop is eliminated over time. This leads to 20% of capacity savings for Web. Because the Web application’s data has a high compression ratio of 4x, compressed-memory offloading is effective and saves about 13% of Web memory at the peak. By contrast, SSD offloading saves only about 4% memory in the best case, because Web is sensitive to memory-access slowdown.
4.3 Comparing PSI and promotion rate.
TMO relies on PSI to report the lost work due to lack of resources, such as CPU, memory, and IO. Compared with counting page-promotion events,12 PSI naturally factors in an application’s sensitivity to memory-access slowdown as well as the performance and utilization aspects of a given offload backend. These cannot be achieved by low-level metrics, such as promotion rate.
A slower back end directly increases PSI memory and IO pressure, as page faults take longer to resolve, reducing memory savings. Figure 12 compares TMO using two SSD back ends—“fast SSD” and “slow SSD” (SSD C and SSD B from Figure 12)—under Web application load tests with Senpai configured to keep pressures below a predefined threshold. Figure 12a shows significantly worse P90 read latency for the slow SSD compared to the fast SSD. Figure 12b indicates that TMO with the fast SSD performs more aggressive swapping, with higher swap sizes and lower resident memory sizes. The 1.6GB–3GB difference in offloaded anonymous memory represents a 10%–15% reduction in resident set size.
Figure 12c reveals a higher promotion rate (swap-ins per second) for the fast SSD, and Figure 12d shows that this correlates with higher requests per second. However, using promotion rate as a proxy for performance is flawed. It neglects backend performance variations and ignores that some applications benefit from increased memory availability due to aggressive offloading. These results highlight the limitations of promotion rate as a metric, especially in environments with heterogeneous or variable backend performance, as seen with the fast vs. slow SSDs. In contrast, PSI memory and IO pressures in Figures 12e and 12f adapt dynamically, maintaining application performance by offloading more memory with better-performing backends while keeping pressures within target thresholds. Overall, Senpai effectively adjusts its behavior in real time based on backend and application characteristics.
5. Discussion
5.1 Production deployment experience in the past two years.
Since the publication of the paper in ASPLOS’22, the deployment of TMO has continued to expand. In this section, we summarize our production experience since then.
TMO is widely deployed across Meta’s infrastructure, with most services using it by default. A minority opt out due to specific challenges, such as sensitivity to page-fault latencies or memory-access patterns that make Linux’s LRU algorithm less effective at distinguishing hot and cold pages. The Senpai configuration detailed in this paper has remained largely unchanged. However, tuning Senpai settings such as PSI involves extensive trial and error to balance memory savings and performance. To accommodate diverse needs, we offer a few TMO “profiles,” including more aggressive options for services prioritizing memory savings at a potential performance cost. We are also exploring profiling techniques like DAMON6 to better understand memory-access patterns for future tuning.
A key focus since publication has been enhancing Linux’s ability to treat offload back ends as a hierarchy. Currently, service owners manually select between zswap and SSD-backed swap. Progress has been made toward enabling Linux to use both hierarchically, with zswap handling warmer pages and SSD targeting colder ones.
We are also investigating techniques for managing CXL-tiered memory with TPP.15 Unlike traditional offload back ends, CXL accesses do not induce page faults, necessitating new methods for measuring page-access frequency. Furthermore, PSI does not fully capture the performance costs of slower CXL memory, prompting exploration of alternative approaches.
5.2 Hardware for memory offloading.
Hardware support can significantly enhance memory offloading efficiency. For example, zswap could leverage hardware-assisted memory compression and decompression. Currently, maintaining LRU ordering for memory reclaim relies on software sampling, with overhead scaling to the paging rate. Emerging technologies such as CXL, which offer memory-like access semantics, could provide hardware-assisted estimation of both cold and warm memory, enabling TMO-like techniques to operate more effectively.
6. Related Work
Previous work from Google12 focuses on swapping cold pages into a compressed in-memory pool, guided by a target rate for swapping in pages. The key distinction of our approach lies in how we assess application performance degradation. Rather than relying on heuristics derived from offline profiling of low-level performance indicators, our method leverages high-level, real-time pressure metrics. These metrics inherently capture memory access patterns, sensitivity to page faults, and hardware characteristics. Consequently, TMO is adaptable to a broader range of production environments. Other works1 have focused on cold page detection based on LRU, page table entries, and estimated access latency based on access rate. Our use of pressure metrics takes a step further by directly measuring performance loss, which is more adaptive to different hardware and workloads.
7. Conclusion
This paper introduces TMO, Meta’s transparent memory offloading solution for heterogeneous datacenter environments. TMO consists of multiple components across userspace and the kernel, working in tandem to provide a holistic solution that leads to significant memory savings across the fleet.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment