Message ID | 20240813165619.748102-1-yuanchu@google.com (mailing list archive) |
---|---|
Headers | show |
Series | mm: workingset reporting | expand |
On Tue, 13 Aug 2024 09:56:11 -0700 Yuanchu Xie <yuanchu@google.com> wrote: > This patch series provides workingset reporting of user pages in > lruvecs, of which coldness can be tracked by accessed bits and fd > references. Very little reviewer interest. I wonder why. Will Google be the only organization which finds this useful? > Benchmarks > ========== > Ghait Ouled Amar Ben Cheikh has implemented a simple "reclaim everything > colder than 10 seconds every 40 seconds" policy and ran Linux compile > and redis from the phoronix test suite. The results are in his repo: > https://github.com/miloudi98/WMO I'd suggest at least summarizing these results here in the [0/N]. The Linux kernel will probably outlive that URL!
On Tue, 13 Aug 2024, Andrew Morton wrote: > On Tue, 13 Aug 2024 09:56:11 -0700 Yuanchu Xie <yuanchu@google.com> wrote: > > > This patch series provides workingset reporting of user pages in > > lruvecs, of which coldness can be tracked by accessed bits and fd > > references. > > Very little reviewer interest. I wonder why. Will Google be the only > organization which finds this useful? > Although also from Google, I'm optimistic that others will find this very useful. It's implemented in a way that is intended to be generally useful for multiple use cases, including user defined policy for proactive reclaim. The cited sample userspace implementation is intended to demonstrate how this insight can be put into practice. Insight into the working set of applications, particularly on multi-tenant systems, has derived significant memory savings for Google over the past decade. The introduction of MGLRU into the upstream kernel has allowed this information to be derived in a much more efficient manner, presented here, that should make upstreaming of this insight much more palatable. This insight into working set will only become more critical going forward with memory tiered systems. Nothing here is specific to Google; in fact, we apply the insight into working set in very different ways across our fleets. > > Benchmarks > > ========== > > Ghait Ouled Amar Ben Cheikh has implemented a simple "reclaim everything > > colder than 10 seconds every 40 seconds" policy and ran Linux compile > > and redis from the phoronix test suite. The results are in his repo: > > https://github.com/miloudi98/WMO > > I'd suggest at least summarizing these results here in the [0/N]. The > Linux kernel will probably outlive that URL! > Fully agreed that this would be useful for including in the cover letter. The results showing the impact of proactive reclaim using insight into working set is impressive for multi-tenant systems. Having very comparable performance for kernbench with a fraction of the memory usage shows the potential for proactive reclaim and without the dependency on direct reclaim or throttling of the application itself. This is one of several benchmarks that we are running and we'll be expanding upon this with cotenancy, user defined latency senstivity per job, extensions for insight into memory re-access, and in-guest use cases.
On Tue, Aug 13, 2024 at 09:56:11AM -0700, Yuanchu Xie wrote: > This patch series provides workingset reporting of user pages in > lruvecs, of which coldness can be tracked by accessed bits and fd > references. However, the concept of workingset applies generically to > all types of memory, which could be kernel slab caches, discardable > userspace caches (databases), or CXL.mem. Therefore, data sources might > come from slab shrinkers, device drivers, or the userspace. IMO, the > kernel should provide a set of workingset interfaces that should be > generic enough to accommodate the various use cases, and be extensible > to potential future use cases. The current proposed interfaces are not > sufficient in that regard, but I would like to start somewhere, solicit > feedback, and iterate. > ... snip ... > Use cases > ========== > Promotion/Demotion > If different mechanisms are used for promition and demotion, workingset > information can help connect the two and avoid pages being migrated back > and forth. > For example, given a promotion hot page threshold defined in reaccess > distance of N seconds (promote pages accessed more often than every N > seconds). The threshold N should be set so that ~80% (e.g.) of pages on > the fast memory node passes the threshold. This calculation can be done > with workingset reports. > To be directly useful for promotion policies, the workingset report > interfaces need to be extended to report hotness and gather hotness > information from the devices[1]. > > [1] > https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-white-paper-pdf-1 > > Sysfs and Cgroup Interfaces > ========== > The interfaces are detailed in the patches that introduce them. The main > idea here is we break down the workingset per-node per-memcg into time > intervals (ms), e.g. > > 1000 anon=137368 file=24530 > 20000 anon=34342 file=0 > 30000 anon=353232 file=333608 > 40000 anon=407198 file=206052 > 9223372036854775807 anon=4925624 file=892892 > > I realize this does not generalize well to hotness information, but I > lack the intuition for an abstraction that presents hotness in a useful > way. Based on a recent proposal for move_phys_pages[2], it seems like > userspace tiering software would like to move specific physical pages, > instead of informing the kernel "move x number of hot pages to y > device". Please advise. > > [2] > https://lore.kernel.org/lkml/20240319172609.332900-1-gregory.price@memverge.com/ > Just as a note on this work, this is really a testing interface. The end-goal is not to merge such an interface that is user-facing like move_phys_pages, but instead to have something like a triggered kernel task that has a directive of "Promote X pages from Device A". This work is more of an open collaboration for prototyping such that we don't have to plumb it through the kernel from the start and assess the usefulness of the hardware hotness collection mechanism. --- More generally on promotion, I have been considering recently a problem with promoting unmapped pagecache pages - since they are not subject to NUMA hint faults. I started looking at PG_accessed and PG_workingset as a potential mechanism to trigger promotion - but i'm starting to see a pattern of competing priorities between reclaim (LRU/MGLRU) logic and promotion logic. Reclaim is triggered largely under memory pressure - which means co-opting reclaim logic for promotion is at best logically confusing, and at worst likely to introduce regressions. The LRU/MGLRU logic is written largely for reclaim, not promotion. This makes hacking promotion in after the fact rather dubious - the design choices don't match. One example: if a page moves from inactive->active (or old->young), we could treat this as a page "becoming hot" and mark it for promotion, but this potentially punishes pages on the "active/younger" lists which are themselves hotter. I'm starting to think separate demotion/reclaim and promotion components are warranted. This could take the form of a separate kernel worker that occasionally gets scheduled to manage a promotion list, or even the addition of a PG_promote flag to decouple reclaim and promotion logic completely. Separating the structures entirely would be good to allow both demotion/reclaim and promotion to occur concurrently (although this seems problematic under memory pressure). Would like to know your thoughts here. If we can decide to segregate promotion and demotion logic, it might go a long way to simplify the existing interfaces and formalize transactions between the two. (also if you're going to LPC, might be worth a chat in person) ~Gregory
On Tue, Aug 20, 2024 at 6:00 AM Gregory Price <gourry@gourry.net> wrote: > > On Tue, Aug 13, 2024 at 09:56:11AM -0700, Yuanchu Xie wrote: > > This patch series provides workingset reporting of user pages in > > lruvecs, of which coldness can be tracked by accessed bits and fd > > references. However, the concept of workingset applies generically to > > all types of memory, which could be kernel slab caches, discardable > > userspace caches (databases), or CXL.mem. Therefore, data sources might > > come from slab shrinkers, device drivers, or the userspace. IMO, the > > kernel should provide a set of workingset interfaces that should be > > generic enough to accommodate the various use cases, and be extensible > > to potential future use cases. The current proposed interfaces are not > > sufficient in that regard, but I would like to start somewhere, solicit > > feedback, and iterate. > > > ... snip ... > > Use cases > > ========== > > Promotion/Demotion > > If different mechanisms are used for promition and demotion, workingset > > information can help connect the two and avoid pages being migrated back > > and forth. > > For example, given a promotion hot page threshold defined in reaccess > > distance of N seconds (promote pages accessed more often than every N > > seconds). The threshold N should be set so that ~80% (e.g.) of pages on > > the fast memory node passes the threshold. This calculation can be done > > with workingset reports. > > To be directly useful for promotion policies, the workingset report > > interfaces need to be extended to report hotness and gather hotness > > information from the devices[1]. > > > > [1] > > https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-white-paper-pdf-1 > > > > Sysfs and Cgroup Interfaces > > ========== > > The interfaces are detailed in the patches that introduce them. The main > > idea here is we break down the workingset per-node per-memcg into time > > intervals (ms), e.g. > > > > 1000 anon=137368 file=24530 > > 20000 anon=34342 file=0 > > 30000 anon=353232 file=333608 > > 40000 anon=407198 file=206052 > > 9223372036854775807 anon=4925624 file=892892 > > > > I realize this does not generalize well to hotness information, but I > > lack the intuition for an abstraction that presents hotness in a useful > > way. Based on a recent proposal for move_phys_pages[2], it seems like > > userspace tiering software would like to move specific physical pages, > > instead of informing the kernel "move x number of hot pages to y > > device". Please advise. > > > > [2] > > https://lore.kernel.org/lkml/20240319172609.332900-1-gregory.price@memverge.com/ > > > > Just as a note on this work, this is really a testing interface. The > end-goal is not to merge such an interface that is user-facing like > move_phys_pages, but instead to have something like a triggered kernel > task that has a directive of "Promote X pages from Device A". > > This work is more of an open collaboration for prototyping such that we > don't have to plumb it through the kernel from the start and assess the > usefulness of the hardware hotness collection mechanism. Understood. I think we previously had this exchange and I forgot to remove the mentions from the cover letter. > > --- > > More generally on promotion, I have been considering recently a problem > with promoting unmapped pagecache pages - since they are not subject to > NUMA hint faults. I started looking at PG_accessed and PG_workingset as > a potential mechanism to trigger promotion - but i'm starting to see a > pattern of competing priorities between reclaim (LRU/MGLRU) logic and > promotion logic. In this case, IMO hardware support would be good as it could provide the kernel with exactly what pages are hot, and it would not care whether a page is mapped or not. I recall there being some CXL proposal on this, but I'm not sure whether it has settled into a standard yet. > > Reclaim is triggered largely under memory pressure - which means co-opting > reclaim logic for promotion is at best logically confusing, and at worst > likely to introduce regressions. The LRU/MGLRU logic is written largely > for reclaim, not promotion. This makes hacking promotion in after the > fact rather dubious - the design choices don't match. > > One example: if a page moves from inactive->active (or old->young), we > could treat this as a page "becoming hot" and mark it for promotion, but > this potentially punishes pages on the "active/younger" lists which are > themselves hotter. To avoid punishing pages on the "young" list, one could insert the page into a "less young" generation, but it would be difficult to have a fixed policy for this in the kernel, so it may be best for this to be configurable via BPF. One could insert the page in the middle of the active/inactive list, but that would in effect create multiple generations. > > I'm starting to think separate demotion/reclaim and promotion components > are warranted. This could take the form of a separate kernel worker that > occasionally gets scheduled to manage a promotion list, or even the > addition of a PG_promote flag to decouple reclaim and promotion logic > completely. Separating the structures entirely would be good to allow > both demotion/reclaim and promotion to occur concurrently (although this > seems problematic under memory pressure). > > Would like to know your thoughts here. If we can decide to segregate > promotion and demotion logic, it might go a long way to simplify the > existing interfaces and formalize transactions between the two. The two systems still have to interact, so separating the two would essentially create a new policy that decides whether the demotion/reclaim or the promotion policy is in effect. If promotion could figure out where to insert the page in terms of generations, wouldn't that be simpler? > > (also if you're going to LPC, might be worth a chat in person) I cannot make it to LPC. :( Sadness Yuanchu