Message ID | 20180831203450.2536-1-guro@fb.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: slowly shrink slabs with a relatively small number of objects | expand |
On Fri, 2018-08-31 at 13:34 -0700, Roman Gushchin wrote: > diff --git a/mm/vmscan.c b/mm/vmscan.c > index fa2c150ab7b9..c910cf6bf606 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -476,6 +476,10 @@ static unsigned long do_shrink_slab(struct > shrink_control *shrinkctl, > delta = freeable >> priority; > delta *= 4; > do_div(delta, shrinker->seeks); > + > + if (delta == 0 && freeable > 0) > + delta = min(freeable, batch_size); > + > total_scan += delta; > if (total_scan < 0) { > pr_err("shrink_slab: %pF negative objects to delete > nr=%ld\n", I agree that we need to shrink slabs with fewer than 4096 objects, but do we want to put more pressure on a slab the moment it drops below 4096 than we applied when it had just over 4096 objects on it? With this patch, a slab with 5000 objects on it will get 1 item scanned, while a slab with 4000 objects on it will see shrinker->batch or SHRINK_BATCH objects scanned every time. I don't know if this would cause any issues, just something to ponder. If nobody things this is a problem, you can give the patch my: Acked-by: Rik van Riel <riel@surriel.com>
On Fri, Aug 31, 2018 at 05:15:39PM -0400, Rik van Riel wrote: > On Fri, 2018-08-31 at 13:34 -0700, Roman Gushchin wrote: > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index fa2c150ab7b9..c910cf6bf606 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -476,6 +476,10 @@ static unsigned long do_shrink_slab(struct > > shrink_control *shrinkctl, > > delta = freeable >> priority; > > delta *= 4; > > do_div(delta, shrinker->seeks); > > + > > + if (delta == 0 && freeable > 0) > > + delta = min(freeable, batch_size); > > + > > total_scan += delta; > > if (total_scan < 0) { > > pr_err("shrink_slab: %pF negative objects to delete > > nr=%ld\n", > > I agree that we need to shrink slabs with fewer than > 4096 objects, but do we want to put more pressure on > a slab the moment it drops below 4096 than we applied > when it had just over 4096 objects on it? > > With this patch, a slab with 5000 objects on it will > get 1 item scanned, while a slab with 4000 objects on > it will see shrinker->batch or SHRINK_BATCH objects > scanned every time. > > I don't know if this would cause any issues, just > something to ponder. Hm, fair enough. So, basically we can always do delta = max(delta, min(freeable, batch_size)); Does it look better? > > If nobody things this is a problem, you can give the > patch my: > > Acked-by: Rik van Riel <riel@surriel.com> > Thanks!
On Fri, 2018-08-31 at 14:31 -0700, Roman Gushchin wrote: > On Fri, Aug 31, 2018 at 05:15:39PM -0400, Rik van Riel wrote: > > On Fri, 2018-08-31 at 13:34 -0700, Roman Gushchin wrote: > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > index fa2c150ab7b9..c910cf6bf606 100644 > > > --- a/mm/vmscan.c > > > +++ b/mm/vmscan.c > > > @@ -476,6 +476,10 @@ static unsigned long do_shrink_slab(struct > > > shrink_control *shrinkctl, > > > delta = freeable >> priority; > > > delta *= 4; > > > do_div(delta, shrinker->seeks); > > > + > > > + if (delta == 0 && freeable > 0) > > > + delta = min(freeable, batch_size); > > > + > > > total_scan += delta; > > > if (total_scan < 0) { > > > pr_err("shrink_slab: %pF negative objects to delete > > > nr=%ld\n", > > > > I agree that we need to shrink slabs with fewer than > > 4096 objects, but do we want to put more pressure on > > a slab the moment it drops below 4096 than we applied > > when it had just over 4096 objects on it? > > > > With this patch, a slab with 5000 objects on it will > > get 1 item scanned, while a slab with 4000 objects on > > it will see shrinker->batch or SHRINK_BATCH objects > > scanned every time. > > > > I don't know if this would cause any issues, just > > something to ponder. > > Hm, fair enough. So, basically we can always do > > delta = max(delta, min(freeable, batch_size)); > > Does it look better? Yeah, that looks fine to me. That will read to small cgroups having small caches reclaimed relatively more quickly than large caches getting reclaimed, but small caches should also be faster to refill once they are needed again, so it is probably fine.
On Fri 31-08-18 14:31:41, Roman Gushchin wrote: > On Fri, Aug 31, 2018 at 05:15:39PM -0400, Rik van Riel wrote: > > On Fri, 2018-08-31 at 13:34 -0700, Roman Gushchin wrote: > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > index fa2c150ab7b9..c910cf6bf606 100644 > > > --- a/mm/vmscan.c > > > +++ b/mm/vmscan.c > > > @@ -476,6 +476,10 @@ static unsigned long do_shrink_slab(struct > > > shrink_control *shrinkctl, > > > delta = freeable >> priority; > > > delta *= 4; > > > do_div(delta, shrinker->seeks); > > > + > > > + if (delta == 0 && freeable > 0) > > > + delta = min(freeable, batch_size); > > > + > > > total_scan += delta; > > > if (total_scan < 0) { > > > pr_err("shrink_slab: %pF negative objects to delete > > > nr=%ld\n", > > > > I agree that we need to shrink slabs with fewer than > > 4096 objects, but do we want to put more pressure on > > a slab the moment it drops below 4096 than we applied > > when it had just over 4096 objects on it? > > > > With this patch, a slab with 5000 objects on it will > > get 1 item scanned, while a slab with 4000 objects on > > it will see shrinker->batch or SHRINK_BATCH objects > > scanned every time. > > > > I don't know if this would cause any issues, just > > something to ponder. > > Hm, fair enough. So, basically we can always do > > delta = max(delta, min(freeable, batch_size)); > > Does it look better? Why don't you use the same heuristic we use for the normal LRU raclaim? /* * If the cgroup's already been deleted, make sure to * scrape out the remaining cache. */ if (!scan && !mem_cgroup_online(memcg)) scan = min(size, SWAP_CLUSTER_MAX);
On Mon, Sep 03, 2018 at 08:29:56PM +0200, Michal Hocko wrote: > On Fri 31-08-18 14:31:41, Roman Gushchin wrote: > > On Fri, Aug 31, 2018 at 05:15:39PM -0400, Rik van Riel wrote: > > > On Fri, 2018-08-31 at 13:34 -0700, Roman Gushchin wrote: > > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > > index fa2c150ab7b9..c910cf6bf606 100644 > > > > --- a/mm/vmscan.c > > > > +++ b/mm/vmscan.c > > > > @@ -476,6 +476,10 @@ static unsigned long do_shrink_slab(struct > > > > shrink_control *shrinkctl, > > > > delta = freeable >> priority; > > > > delta *= 4; > > > > do_div(delta, shrinker->seeks); > > > > + > > > > + if (delta == 0 && freeable > 0) > > > > + delta = min(freeable, batch_size); > > > > + > > > > total_scan += delta; > > > > if (total_scan < 0) { > > > > pr_err("shrink_slab: %pF negative objects to delete > > > > nr=%ld\n", > > > > > > I agree that we need to shrink slabs with fewer than > > > 4096 objects, but do we want to put more pressure on > > > a slab the moment it drops below 4096 than we applied > > > when it had just over 4096 objects on it? > > > > > > With this patch, a slab with 5000 objects on it will > > > get 1 item scanned, while a slab with 4000 objects on > > > it will see shrinker->batch or SHRINK_BATCH objects > > > scanned every time. > > > > > > I don't know if this would cause any issues, just > > > something to ponder. > > > > Hm, fair enough. So, basically we can always do > > > > delta = max(delta, min(freeable, batch_size)); > > > > Does it look better? > > Why don't you use the same heuristic we use for the normal LRU raclaim? Because we do reparent kmem lru lists on offlining. Take a look at memcg_offline_kmem(). Thanks!
On Mon 03-09-18 13:28:06, Roman Gushchin wrote: > On Mon, Sep 03, 2018 at 08:29:56PM +0200, Michal Hocko wrote: > > On Fri 31-08-18 14:31:41, Roman Gushchin wrote: > > > On Fri, Aug 31, 2018 at 05:15:39PM -0400, Rik van Riel wrote: > > > > On Fri, 2018-08-31 at 13:34 -0700, Roman Gushchin wrote: > > > > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > > > index fa2c150ab7b9..c910cf6bf606 100644 > > > > > --- a/mm/vmscan.c > > > > > +++ b/mm/vmscan.c > > > > > @@ -476,6 +476,10 @@ static unsigned long do_shrink_slab(struct > > > > > shrink_control *shrinkctl, > > > > > delta = freeable >> priority; > > > > > delta *= 4; > > > > > do_div(delta, shrinker->seeks); > > > > > + > > > > > + if (delta == 0 && freeable > 0) > > > > > + delta = min(freeable, batch_size); > > > > > + > > > > > total_scan += delta; > > > > > if (total_scan < 0) { > > > > > pr_err("shrink_slab: %pF negative objects to delete > > > > > nr=%ld\n", > > > > > > > > I agree that we need to shrink slabs with fewer than > > > > 4096 objects, but do we want to put more pressure on > > > > a slab the moment it drops below 4096 than we applied > > > > when it had just over 4096 objects on it? > > > > > > > > With this patch, a slab with 5000 objects on it will > > > > get 1 item scanned, while a slab with 4000 objects on > > > > it will see shrinker->batch or SHRINK_BATCH objects > > > > scanned every time. > > > > > > > > I don't know if this would cause any issues, just > > > > something to ponder. > > > > > > Hm, fair enough. So, basically we can always do > > > > > > delta = max(delta, min(freeable, batch_size)); > > > > > > Does it look better? > > > > Why don't you use the same heuristic we use for the normal LRU raclaim? > > Because we do reparent kmem lru lists on offlining. > Take a look at memcg_offline_kmem(). Then I must be missing something. Why are we growing the number of dead cgroups then?
On Tue, Sep 04, 2018 at 09:00:05AM +0200, Michal Hocko wrote: > On Mon 03-09-18 13:28:06, Roman Gushchin wrote: > > On Mon, Sep 03, 2018 at 08:29:56PM +0200, Michal Hocko wrote: > > > On Fri 31-08-18 14:31:41, Roman Gushchin wrote: > > > > On Fri, Aug 31, 2018 at 05:15:39PM -0400, Rik van Riel wrote: > > > > > On Fri, 2018-08-31 at 13:34 -0700, Roman Gushchin wrote: > > > > > > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > > > > index fa2c150ab7b9..c910cf6bf606 100644 > > > > > > --- a/mm/vmscan.c > > > > > > +++ b/mm/vmscan.c > > > > > > @@ -476,6 +476,10 @@ static unsigned long do_shrink_slab(struct > > > > > > shrink_control *shrinkctl, > > > > > > delta = freeable >> priority; > > > > > > delta *= 4; > > > > > > do_div(delta, shrinker->seeks); > > > > > > + > > > > > > + if (delta == 0 && freeable > 0) > > > > > > + delta = min(freeable, batch_size); > > > > > > + > > > > > > total_scan += delta; > > > > > > if (total_scan < 0) { > > > > > > pr_err("shrink_slab: %pF negative objects to delete > > > > > > nr=%ld\n", > > > > > > > > > > I agree that we need to shrink slabs with fewer than > > > > > 4096 objects, but do we want to put more pressure on > > > > > a slab the moment it drops below 4096 than we applied > > > > > when it had just over 4096 objects on it? > > > > > > > > > > With this patch, a slab with 5000 objects on it will > > > > > get 1 item scanned, while a slab with 4000 objects on > > > > > it will see shrinker->batch or SHRINK_BATCH objects > > > > > scanned every time. > > > > > > > > > > I don't know if this would cause any issues, just > > > > > something to ponder. > > > > > > > > Hm, fair enough. So, basically we can always do > > > > > > > > delta = max(delta, min(freeable, batch_size)); > > > > > > > > Does it look better? > > > > > > Why don't you use the same heuristic we use for the normal LRU raclaim? > > > > Because we do reparent kmem lru lists on offlining. > > Take a look at memcg_offline_kmem(). > > Then I must be missing something. Why are we growing the number of dead > cgroups then? We do reparent LRU lists, but not objects. Objects (or, more precisely, pages) are still holding a reference to the memcg. Let's say we have systemd periodically restarting some service in system.slice. Then all accounted objects after removal of the service's memcg are placed into the system.slice's LRU. But under small/moderate memory pressure we won't scan it at all, unless it get's really big. If there is "only" a couple of thousands of objects, we don't scan it, and can easily have several hundreds of pinned dying cgroups.
On Tue 04-09-18 08:34:49, Roman Gushchin wrote: > On Tue, Sep 04, 2018 at 09:00:05AM +0200, Michal Hocko wrote: > > On Mon 03-09-18 13:28:06, Roman Gushchin wrote: > > > On Mon, Sep 03, 2018 at 08:29:56PM +0200, Michal Hocko wrote: > > > > On Fri 31-08-18 14:31:41, Roman Gushchin wrote: > > > > > On Fri, Aug 31, 2018 at 05:15:39PM -0400, Rik van Riel wrote: > > > > > > On Fri, 2018-08-31 at 13:34 -0700, Roman Gushchin wrote: > > > > > > > > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > > > > > index fa2c150ab7b9..c910cf6bf606 100644 > > > > > > > --- a/mm/vmscan.c > > > > > > > +++ b/mm/vmscan.c > > > > > > > @@ -476,6 +476,10 @@ static unsigned long do_shrink_slab(struct > > > > > > > shrink_control *shrinkctl, > > > > > > > delta = freeable >> priority; > > > > > > > delta *= 4; > > > > > > > do_div(delta, shrinker->seeks); > > > > > > > + > > > > > > > + if (delta == 0 && freeable > 0) > > > > > > > + delta = min(freeable, batch_size); > > > > > > > + > > > > > > > total_scan += delta; > > > > > > > if (total_scan < 0) { > > > > > > > pr_err("shrink_slab: %pF negative objects to delete > > > > > > > nr=%ld\n", > > > > > > > > > > > > I agree that we need to shrink slabs with fewer than > > > > > > 4096 objects, but do we want to put more pressure on > > > > > > a slab the moment it drops below 4096 than we applied > > > > > > when it had just over 4096 objects on it? > > > > > > > > > > > > With this patch, a slab with 5000 objects on it will > > > > > > get 1 item scanned, while a slab with 4000 objects on > > > > > > it will see shrinker->batch or SHRINK_BATCH objects > > > > > > scanned every time. > > > > > > > > > > > > I don't know if this would cause any issues, just > > > > > > something to ponder. > > > > > > > > > > Hm, fair enough. So, basically we can always do > > > > > > > > > > delta = max(delta, min(freeable, batch_size)); > > > > > > > > > > Does it look better? > > > > > > > > Why don't you use the same heuristic we use for the normal LRU raclaim? > > > > > > Because we do reparent kmem lru lists on offlining. > > > Take a look at memcg_offline_kmem(). > > > > Then I must be missing something. Why are we growing the number of dead > > cgroups then? > > We do reparent LRU lists, but not objects. Objects (or, more precisely, pages) > are still holding a reference to the memcg. OK, this is what I missed. I thought that the reparenting includes all the pages as well. Is there any strong reason that we cannot do that? Performance/Locking/etc.? Or maybe do not reparent at all and rely on the same reclaim heuristic we do for normal pages? I am not opposing your patch but I am trying to figure out whether that is the best approach.
On Tue, Sep 04, 2018 at 06:14:31PM +0200, Michal Hocko wrote: > On Tue 04-09-18 08:34:49, Roman Gushchin wrote: > > On Tue, Sep 04, 2018 at 09:00:05AM +0200, Michal Hocko wrote: > > > On Mon 03-09-18 13:28:06, Roman Gushchin wrote: > > > > On Mon, Sep 03, 2018 at 08:29:56PM +0200, Michal Hocko wrote: > > > > > On Fri 31-08-18 14:31:41, Roman Gushchin wrote: > > > > > > On Fri, Aug 31, 2018 at 05:15:39PM -0400, Rik van Riel wrote: > > > > > > > On Fri, 2018-08-31 at 13:34 -0700, Roman Gushchin wrote: > > > > > > > > > > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > > > > > > index fa2c150ab7b9..c910cf6bf606 100644 > > > > > > > > --- a/mm/vmscan.c > > > > > > > > +++ b/mm/vmscan.c > > > > > > > > @@ -476,6 +476,10 @@ static unsigned long do_shrink_slab(struct > > > > > > > > shrink_control *shrinkctl, > > > > > > > > delta = freeable >> priority; > > > > > > > > delta *= 4; > > > > > > > > do_div(delta, shrinker->seeks); > > > > > > > > + > > > > > > > > + if (delta == 0 && freeable > 0) > > > > > > > > + delta = min(freeable, batch_size); > > > > > > > > + > > > > > > > > total_scan += delta; > > > > > > > > if (total_scan < 0) { > > > > > > > > pr_err("shrink_slab: %pF negative objects to delete > > > > > > > > nr=%ld\n", > > > > > > > > > > > > > > I agree that we need to shrink slabs with fewer than > > > > > > > 4096 objects, but do we want to put more pressure on > > > > > > > a slab the moment it drops below 4096 than we applied > > > > > > > when it had just over 4096 objects on it? > > > > > > > > > > > > > > With this patch, a slab with 5000 objects on it will > > > > > > > get 1 item scanned, while a slab with 4000 objects on > > > > > > > it will see shrinker->batch or SHRINK_BATCH objects > > > > > > > scanned every time. > > > > > > > > > > > > > > I don't know if this would cause any issues, just > > > > > > > something to ponder. > > > > > > > > > > > > Hm, fair enough. So, basically we can always do > > > > > > > > > > > > delta = max(delta, min(freeable, batch_size)); > > > > > > > > > > > > Does it look better? > > > > > > > > > > Why don't you use the same heuristic we use for the normal LRU raclaim? > > > > > > > > Because we do reparent kmem lru lists on offlining. > > > > Take a look at memcg_offline_kmem(). > > > > > > Then I must be missing something. Why are we growing the number of dead > > > cgroups then? > > > > We do reparent LRU lists, but not objects. Objects (or, more precisely, pages) > > are still holding a reference to the memcg. > > OK, this is what I missed. I thought that the reparenting includes all > the pages as well. Is there any strong reason that we cannot do that? > Performance/Locking/etc.? > > Or maybe do not reparent at all and rely on the same reclaim heuristic > we do for normal pages? > > I am not opposing your patch but I am trying to figure out whether that > is the best approach. I don't think the current logic does make sense. Why should cgroups with less than 4k kernel objects be excluded from being scanned? Reparenting of all pages is definitely an option to consider, but it's not free in any case, so if there is no problem, why should we? Let's keep it as a last measure. In my case, the proposed patch works perfectly: the number of dying cgroups jumps around 100, where it grew steadily to 2k and more before. I believe that reparenting of LRU lists is required to minimize the number of LRU lists to scan, but I'm not sure.
On Tue 04-09-18 10:52:46, Roman Gushchin wrote: > On Tue, Sep 04, 2018 at 06:14:31PM +0200, Michal Hocko wrote: [...] > > I am not opposing your patch but I am trying to figure out whether that > > is the best approach. > > I don't think the current logic does make sense. Why should cgroups > with less than 4k kernel objects be excluded from being scanned? How is it any different from the the LRU reclaim? Maybe it is just not that visible because there usually more pages there. But in principle it is the same issue AFAICS. > Reparenting of all pages is definitely an option to consider, > but it's not free in any case, so if there is no problem, > why should we? Let's keep it as a last measure. In my case, > the proposed patch works perfectly: the number of dying cgroups > jumps around 100, where it grew steadily to 2k and more before. Let me emphasise that I am not opposing the patch. I just think that we have made some decisions which are not ideal but I would really like to prevent from building workarounds on top. If we have to reconsider some of those decisions then let's do it. Maybe the priority scaling is just too coarse and what seem to work work for normal LRUs doesn't work for shrinkers. > I believe that reparenting of LRU lists is required to minimize > the number of LRU lists to scan, but I'm not sure. Well, we do have more lists to scan for normal LRUs. It is true that shrinkers add multiplining factor to that but in principle I guess we really want to distinguish dead memcgs because we do want to reclaim those much more than the rest. Those objects are basically of no use just eating resources. The pagecache has some chance to be reused at least but I fail to see why we should keep kernel objects around. Sure, some of them might be harder to reclaim due to different life time and internal object management but this doesn't change the fact that we should try hard to reclaim those. So my gut feeling tells me that we should have a way to distinguish them. Btw. I do not see Vladimir on the CC list. Added (the thread starts here http://lkml.kernel.org/r/20180831203450.2536-1-guro@fb.com)
[now CC Vladimir for real] On Tue 04-09-18 20:06:31, Michal Hocko wrote: > On Tue 04-09-18 10:52:46, Roman Gushchin wrote: > > On Tue, Sep 04, 2018 at 06:14:31PM +0200, Michal Hocko wrote: > [...] > > > I am not opposing your patch but I am trying to figure out whether that > > > is the best approach. > > > > I don't think the current logic does make sense. Why should cgroups > > with less than 4k kernel objects be excluded from being scanned? > > How is it any different from the the LRU reclaim? Maybe it is just not > that visible because there usually more pages there. But in principle it > is the same issue AFAICS. > > > Reparenting of all pages is definitely an option to consider, > > but it's not free in any case, so if there is no problem, > > why should we? Let's keep it as a last measure. In my case, > > the proposed patch works perfectly: the number of dying cgroups > > jumps around 100, where it grew steadily to 2k and more before. > > Let me emphasise that I am not opposing the patch. I just think that we > have made some decisions which are not ideal but I would really like to > prevent from building workarounds on top. If we have to reconsider some > of those decisions then let's do it. Maybe the priority scaling is just > too coarse and what seem to work work for normal LRUs doesn't work for > shrinkers. > > > I believe that reparenting of LRU lists is required to minimize > > the number of LRU lists to scan, but I'm not sure. > > Well, we do have more lists to scan for normal LRUs. It is true that > shrinkers add multiplining factor to that but in principle I guess we > really want to distinguish dead memcgs because we do want to reclaim > those much more than the rest. Those objects are basically of no use > just eating resources. The pagecache has some chance to be reused at > least but I fail to see why we should keep kernel objects around. Sure, > some of them might be harder to reclaim due to different life time and > internal object management but this doesn't change the fact that we > should try hard to reclaim those. So my gut feeling tells me that we > should have a way to distinguish them. > > Btw. I do not see Vladimir on the CC list. Added (the thread starts > here http://lkml.kernel.org/r/20180831203450.2536-1-guro@fb.com)
On Tue, Sep 04, 2018 at 10:52:46AM -0700, Roman Gushchin wrote: > Reparenting of all pages is definitely an option to consider, Reparenting pages would be great indeed, but I'm not sure we could do that, because we'd have to walk over page lists of semi-active kmem caches and do it consistently while some pages may be freed as we go. Kmem caches are so optimized for performance that implementing such a procedure without impacting any hot paths would be nearly impossible IMHO. And there are two implementations (SLAB/SLUB), both of which we'd have to support. > but it's not free in any case, so if there is no problem, > why should we? Let's keep it as a last measure. In my case, > the proposed patch works perfectly: the number of dying cgroups > jumps around 100, where it grew steadily to 2k and more before. > > I believe that reparenting of LRU lists is required to minimize > the number of LRU lists to scan, but I'm not sure. AFAIR the sole purpose of LRU reparenting is releasing kmemcg_id as soon as a cgroup directory is deleted. If we didn't do that, dead cgroups would occupy slots in per memcg arrays (list_lru, kmem_cache) so if we had say 10K dead cgroups, we'd have to allocate 80 KB arrays to store per memcg data for each kmem_cache and list_lru. Back when kmem accounting was introduced, we used kmalloc() for allocating those arrays so growing the size up to 80 KB would result in getting ENOMEM when trying to create a cgroup too often. Now, we fall back on vmalloc() so may be it wouldn't be a problem... Alternatively, I guess we could "reparent" those dangling LRU objects not to the parent cgroup's list_lru_memcg, but instead to a special list_lru_memcg which wouldn't be assigned to any cgroup and which would be reclaimed ASAP on both global or memcg pressure.
diff --git a/mm/vmscan.c b/mm/vmscan.c index fa2c150ab7b9..c910cf6bf606 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -476,6 +476,10 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, delta = freeable >> priority; delta *= 4; do_div(delta, shrinker->seeks); + + if (delta == 0 && freeable > 0) + delta = min(freeable, batch_size); + total_scan += delta; if (total_scan < 0) { pr_err("shrink_slab: %pF negative objects to delete nr=%ld\n",
Commit 9092c71bb724 ("mm: use sc->priority for slab shrink targets") changed the way how target the slab pressure is calculated and made it priority-based: delta = freeable >> priority; delta *= 4; do_div(delta, shrinker->seeks); The problem is that on a default priority (which is 12) no pressure is applied at all, if the number of potentially reclaimable objects is less than 4096. It wouldn't be a big deal, if only these objects were not pinning the corresponding dying memory cgroups. 4096 dentries/inodes/radix tree nodes/... is a reasonable number, but 4096 dying cgroups is not. If there are no big spikes in memory pressure, and new memory cgroups are created and destroyed periodically, this causes the number of dying cgroups grow steadily, causing a slow-ish and hard-to-detect memory "leak". It's not a real leak, as the memory can be eventually reclaimed, but it could not happen in a real life at all. I've seen hosts with a steadily climbing number of dying cgroups, which doesn't show any signs of a decline in months, despite the host is loaded with a production workload. It is an obvious waste of memory, and to prevent it, let's apply a minimal pressure even on small shrinker lists. E.g. if there are freeable objects, let's scan at least min(freeable, scan_batch) objects. This fix significantly improves a chance of a dying cgroup to be reclaimed, and together with some previous patches stops the steady growth of the dying cgroups number on some of our hosts. Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Josef Bacik <jbacik@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@surriel.com> Cc: Andrew Morton <akpm@linux-foundation.org> --- mm/vmscan.c | 4 ++++ 1 file changed, 4 insertions(+)