Message ID | 20240813215358.2259750-1-shakeel.butt@linux.dev (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2] memcg: use ratelimited stats flush in the reclaim | expand |
On Tue, Aug 13, 2024 at 2:54 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > The Meta prod is seeing large amount of stalls in memcg stats flush > from the memcg reclaim code path. At the moment, this specific callsite > is doing a synchronous memcg stats flush. The rstat flush is an > expensive and time consuming operation, so concurrent relaimers will > busywait on the lock potentially for a long time. Actually this issue is > not unique to Meta and has been observed by Cloudflare [1] as well. For > the Cloudflare case, the stalls were due to contention between kswapd > threads running on their 8 numa node machines which does not make sense > as rstat flush is global and flush from one kswapd thread should be > sufficient for all. Simply replace the synchronous flush with the > ratelimited one. > > One may raise a concern on potentially using 2 sec stale (at worst) > stats for heuristics like desirable inactive:active ratio and preferring > inactive file pages over anon pages but these specific heuristics do not > require very precise stats and also are ignored under severe memory > pressure. > > More specifically for this code path, the stats are needed for two > specific heuristics: > > 1. Deactivate LRUs > 2. Cache trim mode > > The deactivate LRUs heuristic is to maintain a desirable inactive:active > ratio of the LRUs. The specific stats needed are WORKINGSET_ACTIVATE* > and the hierarchical LRU size. The WORKINGSET_ACTIVATE* is needed to > check if there is a refault since last snapshot and the LRU size are > needed for the desirable ratio between inactive and active LRUs. See the > table below on how the desirable ratio is calculated. > > /* total target max > * memory ratio inactive > * ------------------------------------- > * 10MB 1 5MB > * 100MB 1 50MB > * 1GB 3 250MB > * 10GB 10 0.9GB > * 100GB 31 3GB > * 1TB 101 10GB > * 10TB 320 32GB > */ > > The desirable ratio only changes at the boundary of 1 GiB, 10 GiB, > 100 GiB, 1 TiB and 10 TiB. There is no need for the precise and accurate > LRU size information to calculate this ratio. In addition, if > deactivation is skipped for some LRU, the kernel will force deactive on > the severe memory pressure situation. > > For the cache trim mode, inactive file LRU size is read and the kernel > scales it down based on the reclaim iteration (file >> sc->priority) and > only checks if it is zero or not. Again precise information is not > needed. > > This patch has been running on Meta fleet for several months and we have > not observed any issues. Please note that MGLRU is not impacted by this > issue at all as it avoids rstat flushing completely. > > Link: https://lore.kernel.org/all/6ee2518b-81dd-4082-bdf5-322883895ffc@kernel.org [1] > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Just curious, does Jesper's patch help with this problem? > --- > Changes since v1: > - Updated the commit message. > > mm/vmscan.c | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 008b62abf104..82318464cd5e 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2282,10 +2282,11 @@ static void prepare_scan_control(pg_data_t *pgdat, struct scan_control *sc) > target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat); > > /* > - * Flush the memory cgroup stats, so that we read accurate per-memcg > - * lruvec stats for heuristics. > + * Flush the memory cgroup stats in rate-limited way as we don't need > + * most accurate stats here. We may switch to regular stats flushing > + * in the future once it is cheap enough. > */ > - mem_cgroup_flush_stats(sc->target_mem_cgroup); > + mem_cgroup_flush_stats_ratelimited(sc->target_mem_cgroup); > > /* > * Determine the scan balance between anon and file LRUs. > -- > 2.43.5 >
On Tue, Aug 13, 2024 at 02:58:51PM GMT, Yosry Ahmed wrote: > On Tue, Aug 13, 2024 at 2:54 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > The Meta prod is seeing large amount of stalls in memcg stats flush > > from the memcg reclaim code path. At the moment, this specific callsite > > is doing a synchronous memcg stats flush. The rstat flush is an > > expensive and time consuming operation, so concurrent relaimers will > > busywait on the lock potentially for a long time. Actually this issue is > > not unique to Meta and has been observed by Cloudflare [1] as well. For > > the Cloudflare case, the stalls were due to contention between kswapd > > threads running on their 8 numa node machines which does not make sense > > as rstat flush is global and flush from one kswapd thread should be > > sufficient for all. Simply replace the synchronous flush with the > > ratelimited one. > > > > One may raise a concern on potentially using 2 sec stale (at worst) > > stats for heuristics like desirable inactive:active ratio and preferring > > inactive file pages over anon pages but these specific heuristics do not > > require very precise stats and also are ignored under severe memory > > pressure. > > > > More specifically for this code path, the stats are needed for two > > specific heuristics: > > > > 1. Deactivate LRUs > > 2. Cache trim mode > > > > The deactivate LRUs heuristic is to maintain a desirable inactive:active > > ratio of the LRUs. The specific stats needed are WORKINGSET_ACTIVATE* > > and the hierarchical LRU size. The WORKINGSET_ACTIVATE* is needed to > > check if there is a refault since last snapshot and the LRU size are > > needed for the desirable ratio between inactive and active LRUs. See the > > table below on how the desirable ratio is calculated. > > > > /* total target max > > * memory ratio inactive > > * ------------------------------------- > > * 10MB 1 5MB > > * 100MB 1 50MB > > * 1GB 3 250MB > > * 10GB 10 0.9GB > > * 100GB 31 3GB > > * 1TB 101 10GB > > * 10TB 320 32GB > > */ > > > > The desirable ratio only changes at the boundary of 1 GiB, 10 GiB, > > 100 GiB, 1 TiB and 10 TiB. There is no need for the precise and accurate > > LRU size information to calculate this ratio. In addition, if > > deactivation is skipped for some LRU, the kernel will force deactive on > > the severe memory pressure situation. > > > > For the cache trim mode, inactive file LRU size is read and the kernel > > scales it down based on the reclaim iteration (file >> sc->priority) and > > only checks if it is zero or not. Again precise information is not > > needed. > > > > This patch has been running on Meta fleet for several months and we have > > not observed any issues. Please note that MGLRU is not impacted by this > > issue at all as it avoids rstat flushing completely. > > > > Link: https://lore.kernel.org/all/6ee2518b-81dd-4082-bdf5-322883895ffc@kernel.org [1] > > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> > > Just curious, does Jesper's patch help with this problem? If you are asking if I have tested Jesper's patch in Meta's production then no, I have not tested it. Also I have not taken a look at the latest from Jesper as I was stuck in some other issues.
On 14/08/2024 00.30, Shakeel Butt wrote: > On Tue, Aug 13, 2024 at 02:58:51PM GMT, Yosry Ahmed wrote: >> On Tue, Aug 13, 2024 at 2:54 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: >>> >>> The Meta prod is seeing large amount of stalls in memcg stats flush >>> from the memcg reclaim code path. At the moment, this specific callsite >>> is doing a synchronous memcg stats flush. The rstat flush is an >>> expensive and time consuming operation, so concurrent relaimers will >>> busywait on the lock potentially for a long time. Actually this issue is >>> not unique to Meta and has been observed by Cloudflare [1] as well. For >>> the Cloudflare case, the stalls were due to contention between kswapd >>> threads running on their 8 numa node machines which does not make sense >>> as rstat flush is global and flush from one kswapd thread should be >>> sufficient for all. Simply replace the synchronous flush with the >>> ratelimited one. >>> >>> One may raise a concern on potentially using 2 sec stale (at worst) >>> stats for heuristics like desirable inactive:active ratio and preferring >>> inactive file pages over anon pages but these specific heuristics do not >>> require very precise stats and also are ignored under severe memory >>> pressure. >>> >>> More specifically for this code path, the stats are needed for two >>> specific heuristics: >>> >>> 1. Deactivate LRUs >>> 2. Cache trim mode >>> >>> The deactivate LRUs heuristic is to maintain a desirable inactive:active >>> ratio of the LRUs. The specific stats needed are WORKINGSET_ACTIVATE* >>> and the hierarchical LRU size. The WORKINGSET_ACTIVATE* is needed to >>> check if there is a refault since last snapshot and the LRU size are >>> needed for the desirable ratio between inactive and active LRUs. See the >>> table below on how the desirable ratio is calculated. >>> >>> /* total target max >>> * memory ratio inactive >>> * ------------------------------------- >>> * 10MB 1 5MB >>> * 100MB 1 50MB >>> * 1GB 3 250MB >>> * 10GB 10 0.9GB >>> * 100GB 31 3GB >>> * 1TB 101 10GB >>> * 10TB 320 32GB >>> */ >>> >>> The desirable ratio only changes at the boundary of 1 GiB, 10 GiB, >>> 100 GiB, 1 TiB and 10 TiB. There is no need for the precise and accurate >>> LRU size information to calculate this ratio. In addition, if >>> deactivation is skipped for some LRU, the kernel will force deactive on >>> the severe memory pressure situation. >>> >>> For the cache trim mode, inactive file LRU size is read and the kernel >>> scales it down based on the reclaim iteration (file >> sc->priority) and >>> only checks if it is zero or not. Again precise information is not >>> needed. >>> >>> This patch has been running on Meta fleet for several months and we have >>> not observed any issues. Please note that MGLRU is not impacted by this >>> issue at all as it avoids rstat flushing completely. >>> >>> Link: https://lore.kernel.org/all/6ee2518b-81dd-4082-bdf5-322883895ffc@kernel.org [1] >>> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> >> >> Just curious, does Jesper's patch help with this problem? > > If you are asking if I have tested Jesper's patch in Meta's production > then no, I have not tested it. Also I have not taken a look at the > latest from Jesper as I was stuck in some other issues. > I see this patch as a whac-a-mole approach. But it should be applied as a stopgap, because my patches are still not ready to be merged. My patch is more generic, but *only* solves the rstat lock contention part of the issue. The remaining issue is that rstat is flushed too often, which I address in my other patch[2] "cgroup/rstat: introduce ratelimited rstat flushing". In [2], I explicitly excluded memcg as Shakeel's patch demonstrates memcg already have a ratelimit API specific to memcg. [2] https://lore.kernel.org/all/171328990014.3930751.10674097155895405137.stgit@firesoul/ I suspect the next whac-a-mole will be the rstat flush for the slab code that kswapd also activates via shrink_slab, that via shrinker->count_objects() invoke count_shadow_nodes(). --Jesper
Ccing Nhat On Wed, Aug 14, 2024 at 02:57:38PM GMT, Jesper Dangaard Brouer wrote: > > > On 14/08/2024 00.30, Shakeel Butt wrote: > > On Tue, Aug 13, 2024 at 02:58:51PM GMT, Yosry Ahmed wrote: > > > On Tue, Aug 13, 2024 at 2:54 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > > > The Meta prod is seeing large amount of stalls in memcg stats flush > > > > from the memcg reclaim code path. At the moment, this specific callsite > > > > is doing a synchronous memcg stats flush. The rstat flush is an > > > > expensive and time consuming operation, so concurrent relaimers will > > > > busywait on the lock potentially for a long time. Actually this issue is > > > > not unique to Meta and has been observed by Cloudflare [1] as well. For > > > > the Cloudflare case, the stalls were due to contention between kswapd > > > > threads running on their 8 numa node machines which does not make sense > > > > as rstat flush is global and flush from one kswapd thread should be > > > > sufficient for all. Simply replace the synchronous flush with the > > > > ratelimited one. > > > > > > > > One may raise a concern on potentially using 2 sec stale (at worst) > > > > stats for heuristics like desirable inactive:active ratio and preferring > > > > inactive file pages over anon pages but these specific heuristics do not > > > > require very precise stats and also are ignored under severe memory > > > > pressure. > > > > > > > > More specifically for this code path, the stats are needed for two > > > > specific heuristics: > > > > > > > > 1. Deactivate LRUs > > > > 2. Cache trim mode > > > > > > > > The deactivate LRUs heuristic is to maintain a desirable inactive:active > > > > ratio of the LRUs. The specific stats needed are WORKINGSET_ACTIVATE* > > > > and the hierarchical LRU size. The WORKINGSET_ACTIVATE* is needed to > > > > check if there is a refault since last snapshot and the LRU size are > > > > needed for the desirable ratio between inactive and active LRUs. See the > > > > table below on how the desirable ratio is calculated. > > > > > > > > /* total target max > > > > * memory ratio inactive > > > > * ------------------------------------- > > > > * 10MB 1 5MB > > > > * 100MB 1 50MB > > > > * 1GB 3 250MB > > > > * 10GB 10 0.9GB > > > > * 100GB 31 3GB > > > > * 1TB 101 10GB > > > > * 10TB 320 32GB > > > > */ > > > > > > > > The desirable ratio only changes at the boundary of 1 GiB, 10 GiB, > > > > 100 GiB, 1 TiB and 10 TiB. There is no need for the precise and accurate > > > > LRU size information to calculate this ratio. In addition, if > > > > deactivation is skipped for some LRU, the kernel will force deactive on > > > > the severe memory pressure situation. > > > > > > > > For the cache trim mode, inactive file LRU size is read and the kernel > > > > scales it down based on the reclaim iteration (file >> sc->priority) and > > > > only checks if it is zero or not. Again precise information is not > > > > needed. > > > > > > > > This patch has been running on Meta fleet for several months and we have > > > > not observed any issues. Please note that MGLRU is not impacted by this > > > > issue at all as it avoids rstat flushing completely. > > > > > > > > Link: https://lore.kernel.org/all/6ee2518b-81dd-4082-bdf5-322883895ffc@kernel.org [1] > > > > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> > > > > > > Just curious, does Jesper's patch help with this problem? > > > > If you are asking if I have tested Jesper's patch in Meta's production > > then no, I have not tested it. Also I have not taken a look at the > > latest from Jesper as I was stuck in some other issues. > > > > I see this patch as a whac-a-mole approach. But it should be applied as > a stopgap, because my patches are still not ready to be merged. > > My patch is more generic, but *only* solves the rstat lock contention > part of the issue. The remaining issue is that rstat is flushed too > often, which I address in my other patch[2] "cgroup/rstat: introduce > ratelimited rstat flushing". In [2], I explicitly excluded memcg as > Shakeel's patch demonstrates memcg already have a ratelimit API specific > to memcg. > > [2] https://lore.kernel.org/all/171328990014.3930751.10674097155895405137.stgit@firesoul/ > > I suspect the next whac-a-mole will be the rstat flush for the slab code > that kswapd also activates via shrink_slab, that via > shrinker->count_objects() invoke count_shadow_nodes(). > Actually count_shadow_nodes() is already using ratelimited version. However zswap_shrinker_count() is still using the sync version. Nhat is modifying this code at the moment and we can ask if we really need most accurate values for MEMCG_ZSWAP_B and MEMCG_ZSWAPPED for the zswap writeback heuristic. > --Jesper
On Wed, Aug 14, 2024 at 9:32 AM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > Ccing Nhat > > On Wed, Aug 14, 2024 at 02:57:38PM GMT, Jesper Dangaard Brouer wrote: > > I suspect the next whac-a-mole will be the rstat flush for the slab code > > that kswapd also activates via shrink_slab, that via > > shrinker->count_objects() invoke count_shadow_nodes(). > > > > Actually count_shadow_nodes() is already using ratelimited version. > However zswap_shrinker_count() is still using the sync version. Nhat is > modifying this code at the moment and we can ask if we really need most > accurate values for MEMCG_ZSWAP_B and MEMCG_ZSWAPPED for the zswap > writeback heuristic. You are referring to this, correct: mem_cgroup_flush_stats(memcg); nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT; nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED); It's already a bit less-than-accurate - as you pointed out in another discussion, it takes into account the objects and sizes of the entire subtree, rather than just the ones charged to the current (memcg, node) combo. Feel free to optimize this away! In fact, I should probably replace this with another (atomic?) counter in zswap_lruvec_state struct, which tracks the post-compression size. That way, we'll have a better estimate of the compression factor - total post-compression size / (length of LRU * page size), and perhaps avoid the whole stat flushing path altogether... > > > --Jesper
On Wed, Aug 14, 2024 at 04:03:13PM GMT, Nhat Pham wrote: > On Wed, Aug 14, 2024 at 9:32 AM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > Ccing Nhat > > > > On Wed, Aug 14, 2024 at 02:57:38PM GMT, Jesper Dangaard Brouer wrote: > > > I suspect the next whac-a-mole will be the rstat flush for the slab code > > > that kswapd also activates via shrink_slab, that via > > > shrinker->count_objects() invoke count_shadow_nodes(). > > > > > > > Actually count_shadow_nodes() is already using ratelimited version. > > However zswap_shrinker_count() is still using the sync version. Nhat is > > modifying this code at the moment and we can ask if we really need most > > accurate values for MEMCG_ZSWAP_B and MEMCG_ZSWAPPED for the zswap > > writeback heuristic. > > You are referring to this, correct: > > mem_cgroup_flush_stats(memcg); > nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT; > nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED); > > It's already a bit less-than-accurate - as you pointed out in another > discussion, it takes into account the objects and sizes of the entire > subtree, rather than just the ones charged to the current (memcg, > node) combo. Feel free to optimize this away! > > In fact, I should probably replace this with another (atomic?) counter > in zswap_lruvec_state struct, which tracks the post-compression size. > That way, we'll have a better estimate of the compression factor - > total post-compression size / (length of LRU * page size), and > perhaps avoid the whole stat flushing path altogether... > That sounds like much better solution than relying on rstat for accurate stats.
On Wed, Aug 14, 2024 at 4:42 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Wed, Aug 14, 2024 at 04:03:13PM GMT, Nhat Pham wrote: > > On Wed, Aug 14, 2024 at 9:32 AM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > > > > Ccing Nhat > > > > > > On Wed, Aug 14, 2024 at 02:57:38PM GMT, Jesper Dangaard Brouer wrote: > > > > I suspect the next whac-a-mole will be the rstat flush for the slab code > > > > that kswapd also activates via shrink_slab, that via > > > > shrinker->count_objects() invoke count_shadow_nodes(). > > > > > > > > > > Actually count_shadow_nodes() is already using ratelimited version. > > > However zswap_shrinker_count() is still using the sync version. Nhat is > > > modifying this code at the moment and we can ask if we really need most > > > accurate values for MEMCG_ZSWAP_B and MEMCG_ZSWAPPED for the zswap > > > writeback heuristic. > > > > You are referring to this, correct: > > > > mem_cgroup_flush_stats(memcg); > > nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT; > > nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED); > > > > It's already a bit less-than-accurate - as you pointed out in another > > discussion, it takes into account the objects and sizes of the entire > > subtree, rather than just the ones charged to the current (memcg, > > node) combo. Feel free to optimize this away! > > > > In fact, I should probably replace this with another (atomic?) counter > > in zswap_lruvec_state struct, which tracks the post-compression size. > > That way, we'll have a better estimate of the compression factor - > > total post-compression size / (length of LRU * page size), and > > perhaps avoid the whole stat flushing path altogether... > > > > That sounds like much better solution than relying on rstat for accurate > stats. We can also use such atomic counters in obj_cgroup_may_zswap() and eliminate the rstat flush there as well. Same for zswap_current_read() probably. Most in-kernel flushers really only need a few stats, so I am wondering if it's better to incrementally move these ones outside of the rstat framework and completely eliminate in-kernel flushers. For instance, MGLRU does not require the flush that reclaim does as Shakeel pointed out. This will solve so many scalability problems that all of us have observed at some point or another and tried to optimize. I believe using rstat for userspace reads was the original intention anyway.
On Wed, Aug 14, 2024 at 4:49 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > We can also use such atomic counters in obj_cgroup_may_zswap() and > eliminate the rstat flush there as well. Same for zswap_current_read() > probably. zswap/zswapped are subtree-cumulative counters. Would that be a problem? > > Most in-kernel flushers really only need a few stats, so I am > wondering if it's better to incrementally move these ones outside of > the rstat framework and completely eliminate in-kernel flushers. For > instance, MGLRU does not require the flush that reclaim does as > Shakeel pointed out. > > This will solve so many scalability problems that all of us have > observed at some point or another and tried to optimize. I believe > using rstat for userspace reads was the original intention anyway. But yeah, the fewer in-kernel flushers we have, the fewer scalability/lock contention issues there will be. Not an expert in this area, but sounds like a worthwhile direction to pursue.
On Wed, Aug 14, 2024 at 5:19 PM Nhat Pham <nphamcs@gmail.com> wrote: > > On Wed, Aug 14, 2024 at 4:49 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > We can also use such atomic counters in obj_cgroup_may_zswap() and > > eliminate the rstat flush there as well. Same for zswap_current_read() > > probably. > > zswap/zswapped are subtree-cumulative counters. Would that be a problem? For obj_cgroup_may_zswap() we iterate the parents anyway, so I think it should be fine. We will also iterate the nodes on each level, but this is usually a small number and is probably cheaper than an rstat flush (but I can easily be wrong). For zswap_current_read() we need to iterate the children, not the parents. At this point the rstat flush is probably much better, so we can leave this one alone. It's a userspace read anyway, so it shouldn't be causing problems. > > > > > Most in-kernel flushers really only need a few stats, so I am > > wondering if it's better to incrementally move these ones outside of > > the rstat framework and completely eliminate in-kernel flushers. For > > instance, MGLRU does not require the flush that reclaim does as > > Shakeel pointed out. > > > > This will solve so many scalability problems that all of us have > > observed at some point or another and tried to optimize. I believe > > using rstat for userspace reads was the original intention anyway. > > But yeah, the fewer in-kernel flushers we have, the fewer > scalability/lock contention issues there will be. Not an expert in > this area, but sounds like a worthwhile direction to pursue.
On Wed, Aug 14, 2024 at 04:48:42PM GMT, Yosry Ahmed wrote: > On Wed, Aug 14, 2024 at 4:42 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > On Wed, Aug 14, 2024 at 04:03:13PM GMT, Nhat Pham wrote: > > > On Wed, Aug 14, 2024 at 9:32 AM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > > > > > > > Ccing Nhat > > > > > > > > On Wed, Aug 14, 2024 at 02:57:38PM GMT, Jesper Dangaard Brouer wrote: > > > > > I suspect the next whac-a-mole will be the rstat flush for the slab code > > > > > that kswapd also activates via shrink_slab, that via > > > > > shrinker->count_objects() invoke count_shadow_nodes(). > > > > > > > > > > > > > Actually count_shadow_nodes() is already using ratelimited version. > > > > However zswap_shrinker_count() is still using the sync version. Nhat is > > > > modifying this code at the moment and we can ask if we really need most > > > > accurate values for MEMCG_ZSWAP_B and MEMCG_ZSWAPPED for the zswap > > > > writeback heuristic. > > > > > > You are referring to this, correct: > > > > > > mem_cgroup_flush_stats(memcg); > > > nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT; > > > nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED); > > > > > > It's already a bit less-than-accurate - as you pointed out in another > > > discussion, it takes into account the objects and sizes of the entire > > > subtree, rather than just the ones charged to the current (memcg, > > > node) combo. Feel free to optimize this away! > > > > > > In fact, I should probably replace this with another (atomic?) counter > > > in zswap_lruvec_state struct, which tracks the post-compression size. > > > That way, we'll have a better estimate of the compression factor - > > > total post-compression size / (length of LRU * page size), and > > > perhaps avoid the whole stat flushing path altogether... > > > > > > > That sounds like much better solution than relying on rstat for accurate > > stats. > > We can also use such atomic counters in obj_cgroup_may_zswap() and > eliminate the rstat flush there as well. Same for zswap_current_read() > probably. > > Most in-kernel flushers really only need a few stats, so I am > wondering if it's better to incrementally move these ones outside of > the rstat framework and completely eliminate in-kernel flushers. For > instance, MGLRU does not require the flush that reclaim does as > Shakeel pointed out. > > This will solve so many scalability problems that all of us have > observed at some point or another and tried to optimize. I believe > using rstat for userspace reads was the original intention anyway. I like this direction and I think zswap would be a good first target.
diff --git a/mm/vmscan.c b/mm/vmscan.c index 008b62abf104..82318464cd5e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2282,10 +2282,11 @@ static void prepare_scan_control(pg_data_t *pgdat, struct scan_control *sc) target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat); /* - * Flush the memory cgroup stats, so that we read accurate per-memcg - * lruvec stats for heuristics. + * Flush the memory cgroup stats in rate-limited way as we don't need + * most accurate stats here. We may switch to regular stats flushing + * in the future once it is cheap enough. */ - mem_cgroup_flush_stats(sc->target_mem_cgroup); + mem_cgroup_flush_stats_ratelimited(sc->target_mem_cgroup); /* * Determine the scan balance between anon and file LRUs.
The Meta prod is seeing large amount of stalls in memcg stats flush from the memcg reclaim code path. At the moment, this specific callsite is doing a synchronous memcg stats flush. The rstat flush is an expensive and time consuming operation, so concurrent relaimers will busywait on the lock potentially for a long time. Actually this issue is not unique to Meta and has been observed by Cloudflare [1] as well. For the Cloudflare case, the stalls were due to contention between kswapd threads running on their 8 numa node machines which does not make sense as rstat flush is global and flush from one kswapd thread should be sufficient for all. Simply replace the synchronous flush with the ratelimited one. One may raise a concern on potentially using 2 sec stale (at worst) stats for heuristics like desirable inactive:active ratio and preferring inactive file pages over anon pages but these specific heuristics do not require very precise stats and also are ignored under severe memory pressure. More specifically for this code path, the stats are needed for two specific heuristics: 1. Deactivate LRUs 2. Cache trim mode The deactivate LRUs heuristic is to maintain a desirable inactive:active ratio of the LRUs. The specific stats needed are WORKINGSET_ACTIVATE* and the hierarchical LRU size. The WORKINGSET_ACTIVATE* is needed to check if there is a refault since last snapshot and the LRU size are needed for the desirable ratio between inactive and active LRUs. See the table below on how the desirable ratio is calculated. /* total target max * memory ratio inactive * ------------------------------------- * 10MB 1 5MB * 100MB 1 50MB * 1GB 3 250MB * 10GB 10 0.9GB * 100GB 31 3GB * 1TB 101 10GB * 10TB 320 32GB */ The desirable ratio only changes at the boundary of 1 GiB, 10 GiB, 100 GiB, 1 TiB and 10 TiB. There is no need for the precise and accurate LRU size information to calculate this ratio. In addition, if deactivation is skipped for some LRU, the kernel will force deactive on the severe memory pressure situation. For the cache trim mode, inactive file LRU size is read and the kernel scales it down based on the reclaim iteration (file >> sc->priority) and only checks if it is zero or not. Again precise information is not needed. This patch has been running on Meta fleet for several months and we have not observed any issues. Please note that MGLRU is not impacted by this issue at all as it avoids rstat flushing completely. Link: https://lore.kernel.org/all/6ee2518b-81dd-4082-bdf5-322883895ffc@kernel.org [1] Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> --- Changes since v1: - Updated the commit message. mm/vmscan.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)