Message ID | 20241105184333.2305744-1-jthoughton@google.com (mailing list archive) |
---|---|
Headers | show |
Series | KVM: x86/mmu: Age sptes locklessly | expand |
On Tue, Nov 5, 2024 at 11:43 AM James Houghton <jthoughton@google.com> wrote: > > Andrew has queued patches to make MGLRU consult KVM when doing aging[8]. > Now, make aging lockless for the shadow MMU and the TDP MMU. This allows > us to reduce the time/CPU it takes to do aging and the performance > impact on the vCPUs while we are aging. > > The final patch in this series modifies access_tracking_stress_test to > age using MGLRU. There is a mode (-p) where it will age while the vCPUs > are faulting memory in. Here are some results with that mode: Additional background in case I didn't provide it before: At Google we keep track of hotness/coldness of VM memory to identify opportunities to demote cold memory into slower tiers of storage. This is done in a controlled manner so that while we benefit from the improved memory efficiency through improved bin-packing, without violating customer SLOs. However, the monitoring/tracking introduced two major overheads [1] for us: 1. the traditional (host) PFN + rmap data structures [2] used to locate host PTEs (containing the accessed bits). 2. the KVM MMU lock required to clear the accessed bits in secondary/shadow PTEs. MGLRU provides the infrastructure for us to reach out into page tables directly from a list of mm_struct's, and therefore allows us to bypass the first problem above and reduce the CPU overhead by ~80% for our workloads (90%+ mmaped memory). This series solves the second problem: by supporting locklessly clearing the accessed bits in SPTEs, it would reduce our current KVM MMU lock contention by >80% [3]. All other existing mechanisms, e.g., Idle Page Tracking, DAMON, etc., can also seamlessly benefit from this series when monitoring/tracking VM memory. [1] https://lwn.net/Articles/787611/ [2] https://docs.kernel.org/admin-guide/mm/idle_page_tracking.html [3] https://research.google/pubs/profiling-a-warehouse-scale-computer/
On Tue, Nov 5, 2024 at 12:21 PM Yu Zhao <yuzhao@google.com> wrote: > > On Tue, Nov 5, 2024 at 11:43 AM James Houghton <jthoughton@google.com> wrote: > > > > Andrew has queued patches to make MGLRU consult KVM when doing aging[8]. > > Now, make aging lockless for the shadow MMU and the TDP MMU. This allows > > us to reduce the time/CPU it takes to do aging and the performance > > impact on the vCPUs while we are aging. > > > > The final patch in this series modifies access_tracking_stress_test to > > age using MGLRU. There is a mode (-p) where it will age while the vCPUs > > are faulting memory in. Here are some results with that mode: > > Additional background in case I didn't provide it before: > > At Google we keep track of hotness/coldness of VM memory to identify > opportunities to demote cold memory into slower tiers of storage. This > is done in a controlled manner so that while we benefit from the > improved memory efficiency through improved bin-packing, without > violating customer SLOs. > > However, the monitoring/tracking introduced two major overheads [1] for us: > 1. the traditional (host) PFN + rmap data structures [2] used to > locate host PTEs (containing the accessed bits). > 2. the KVM MMU lock required to clear the accessed bits in > secondary/shadow PTEs. > > MGLRU provides the infrastructure for us to reach out into page tables > directly from a list of mm_struct's, and therefore allows us to bypass > the first problem above and reduce the CPU overhead by ~80% for our > workloads (90%+ mmaped memory). This series solves the second problem: > by supporting locklessly clearing the accessed bits in SPTEs, it would > reduce our current KVM MMU lock contention by >80% [3]. All other > existing mechanisms, e.g., Idle Page Tracking, DAMON, etc., can also > seamlessly benefit from this series when monitoring/tracking VM > memory. > > [1] https://lwn.net/Articles/787611/ > [2] https://docs.kernel.org/admin-guide/mm/idle_page_tracking.html > [3] https://research.google/pubs/profiling-a-warehouse-scale-computer/ And we also ran an A/B experiment on quarter million Chromebooks running Android in VMs last year (with an older version of this series): It reduced PSI by 10% at the 99th percentile and "janks" by 8% at the 95th percentile, which resulted in an overall improvement in user engagement by 16% at the 75th percentile (all statistically significant).