diff mbox series

[v7,12/12] mm: vmscan: shrink deferred objects proportional to priority

Message ID 20210209174646.1310591-13-shy828301@gmail.com (mailing list archive)
State New, archived
Headers show
Series Make shrinker's nr_deferred memcg aware | expand

Commit Message

Yang Shi Feb. 9, 2021, 5:46 p.m. UTC
The number of deferred objects might get windup to an absurd number, and it
results in clamp of slab objects.  It is undesirable for sustaining workingset.

So shrink deferred objects proportional to priority and cap nr_deferred to twice
of cache items.

The idea is borrowed from Dave Chinner's patch:
https://lore.kernel.org/linux-xfs/20191031234618.15403-13-david@fromorbit.com/

Tested with kernel build and vfs metadata heavy workload in our production
environment, no regression is spotted so far.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/vmscan.c | 40 +++++-----------------------------------
 1 file changed, 5 insertions(+), 35 deletions(-)

Comments

Vlastimil Babka Feb. 11, 2021, 1:10 p.m. UTC | #1
On 2/9/21 6:46 PM, Yang Shi wrote:
> The number of deferred objects might get windup to an absurd number, and it
> results in clamp of slab objects.  It is undesirable for sustaining workingset.
> 
> So shrink deferred objects proportional to priority and cap nr_deferred to twice
> of cache items.

Makes sense to me, minimally it's simpler than the old code and avoiding absurd
growth of nr_deferred should be a good thing, as well as the "proportional to
priority" part.

I just suspect there's a bit of unnecessary bias in the implementation, as
explained below:

> The idea is borrowed from Dave Chinner's patch:
> https://lore.kernel.org/linux-xfs/20191031234618.15403-13-david@fromorbit.com/
> 
> Tested with kernel build and vfs metadata heavy workload in our production
> environment, no regression is spotted so far.
> 
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>  mm/vmscan.c | 40 +++++-----------------------------------
>  1 file changed, 5 insertions(+), 35 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 66163082cc6f..d670b119d6bd 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -654,7 +654,6 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  	 */
>  	nr = count_nr_deferred(shrinker, shrinkctl);
>  
> -	total_scan = nr;
>  	if (shrinker->seeks) {
>  		delta = freeable >> priority;
>  		delta *= 4;
> @@ -668,37 +667,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  		delta = freeable / 2;
>  	}
>  
> +	total_scan = nr >> priority;
>  	total_scan += delta;

So, our scan goal consists of the part based on freeable objects (delta), plus a
part of the defferred objects (nr >> priority). Fine.

> -	if (total_scan < 0) {
> -		pr_err("shrink_slab: %pS negative objects to delete nr=%ld\n",
> -		       shrinker->scan_objects, total_scan);
> -		total_scan = freeable;
> -		next_deferred = nr;
> -	} else
> -		next_deferred = total_scan;
> -
> -	/*
> -	 * We need to avoid excessive windup on filesystem shrinkers
> -	 * due to large numbers of GFP_NOFS allocations causing the
> -	 * shrinkers to return -1 all the time. This results in a large
> -	 * nr being built up so when a shrink that can do some work
> -	 * comes along it empties the entire cache due to nr >>>
> -	 * freeable. This is bad for sustaining a working set in
> -	 * memory.
> -	 *
> -	 * Hence only allow the shrinker to scan the entire cache when
> -	 * a large delta change is calculated directly.
> -	 */
> -	if (delta < freeable / 4)
> -		total_scan = min(total_scan, freeable / 2);
> -
> -	/*
> -	 * Avoid risking looping forever due to too large nr value:
> -	 * never try to free more than twice the estimate number of
> -	 * freeable entries.
> -	 */
> -	if (total_scan > freeable * 2)
> -		total_scan = freeable * 2;
> +	total_scan = min(total_scan, (2 * freeable));

Probably unnecessary as we cap next_deferred below anyway? So total_scan cannot
grow without limits anymore. But can't hurt.

>  	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
>  				   freeable, delta, total_scan, priority);
> @@ -737,10 +708,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  		cond_resched();
>  	}
>  
> -	if (next_deferred >= scanned)
> -		next_deferred -= scanned;
> -	else
> -		next_deferred = 0;
> +	next_deferred = max_t(long, (nr - scanned), 0) + total_scan;

And here's the bias I think. Suppose we scanned 0 due to e.g. GFP_NOFS. We count
as newly deferred both the "delta" part of total_scan, which is fine, but also
the "nr >> priority" part, where we failed to our share of the "reduce
nr_deferred" work, but I don't think it means we should also increase
nr_deferred by that amount of failed work.
OTOH if we succeed and scan exactly the whole goal, we are subtracting from
nr_deferred both the "nr >> priority" part, which is correct, but also delta,
which was new work, not deferred one, so that's incorrect IMHO as well.
So the calculation should probably be something like this?

	next_deferred = max_t(long, nr + delta - scanned, 0);

Thanks,
Vlastimil

> +	next_deferred = min(next_deferred, (2 * freeable));
> +
>  	/*
>  	 * move the unused scan count back into the shrinker in a
>  	 * manner that handles concurrent updates.
>
Yang Shi Feb. 11, 2021, 5:29 p.m. UTC | #2
On Thu, Feb 11, 2021 at 5:10 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 2/9/21 6:46 PM, Yang Shi wrote:
> > The number of deferred objects might get windup to an absurd number, and it
> > results in clamp of slab objects.  It is undesirable for sustaining workingset.
> >
> > So shrink deferred objects proportional to priority and cap nr_deferred to twice
> > of cache items.
>
> Makes sense to me, minimally it's simpler than the old code and avoiding absurd
> growth of nr_deferred should be a good thing, as well as the "proportional to
> priority" part.

Thanks.

>
> I just suspect there's a bit of unnecessary bias in the implementation, as
> explained below:
>
> > The idea is borrowed from Dave Chinner's patch:
> > https://lore.kernel.org/linux-xfs/20191031234618.15403-13-david@fromorbit.com/
> >
> > Tested with kernel build and vfs metadata heavy workload in our production
> > environment, no regression is spotted so far.
> >
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  mm/vmscan.c | 40 +++++-----------------------------------
> >  1 file changed, 5 insertions(+), 35 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 66163082cc6f..d670b119d6bd 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -654,7 +654,6 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> >        */
> >       nr = count_nr_deferred(shrinker, shrinkctl);
> >
> > -     total_scan = nr;
> >       if (shrinker->seeks) {
> >               delta = freeable >> priority;
> >               delta *= 4;
> > @@ -668,37 +667,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> >               delta = freeable / 2;
> >       }
> >
> > +     total_scan = nr >> priority;
> >       total_scan += delta;
>
> So, our scan goal consists of the part based on freeable objects (delta), plus a
> part of the defferred objects (nr >> priority). Fine.
>
> > -     if (total_scan < 0) {
> > -             pr_err("shrink_slab: %pS negative objects to delete nr=%ld\n",
> > -                    shrinker->scan_objects, total_scan);
> > -             total_scan = freeable;
> > -             next_deferred = nr;
> > -     } else
> > -             next_deferred = total_scan;
> > -
> > -     /*
> > -      * We need to avoid excessive windup on filesystem shrinkers
> > -      * due to large numbers of GFP_NOFS allocations causing the
> > -      * shrinkers to return -1 all the time. This results in a large
> > -      * nr being built up so when a shrink that can do some work
> > -      * comes along it empties the entire cache due to nr >>>
> > -      * freeable. This is bad for sustaining a working set in
> > -      * memory.
> > -      *
> > -      * Hence only allow the shrinker to scan the entire cache when
> > -      * a large delta change is calculated directly.
> > -      */
> > -     if (delta < freeable / 4)
> > -             total_scan = min(total_scan, freeable / 2);
> > -
> > -     /*
> > -      * Avoid risking looping forever due to too large nr value:
> > -      * never try to free more than twice the estimate number of
> > -      * freeable entries.
> > -      */
> > -     if (total_scan > freeable * 2)
> > -             total_scan = freeable * 2;
> > +     total_scan = min(total_scan, (2 * freeable));
>
> Probably unnecessary as we cap next_deferred below anyway? So total_scan cannot
> grow without limits anymore. But can't hurt.
>
> >       trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
> >                                  freeable, delta, total_scan, priority);
> > @@ -737,10 +708,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> >               cond_resched();
> >       }
> >
> > -     if (next_deferred >= scanned)
> > -             next_deferred -= scanned;
> > -     else
> > -             next_deferred = 0;
> > +     next_deferred = max_t(long, (nr - scanned), 0) + total_scan;
>
> And here's the bias I think. Suppose we scanned 0 due to e.g. GFP_NOFS. We count
> as newly deferred both the "delta" part of total_scan, which is fine, but also
> the "nr >> priority" part, where we failed to our share of the "reduce
> nr_deferred" work, but I don't think it means we should also increase
> nr_deferred by that amount of failed work.

Here "nr" is the saved deferred work since the last scan, "scanned" is
the scanned work in this round, total_scan is the *unscanned" work
which is actually "total_scan - scanned" (total_scan is decreased by
scanned in each loop). So, the logic is "decrease any scanned work
from deferred then add newly unscanned work to deferred". IIUC this is
what "deferred" means even before this patch.

> OTOH if we succeed and scan exactly the whole goal, we are subtracting from
> nr_deferred both the "nr >> priority" part, which is correct, but also delta,
> which was new work, not deferred one, so that's incorrect IMHO as well.

I don't think so. The deferred comes from new work, why not dec new
work from deferred?

And, the old code did:

if (next_deferred >= scanned)
                next_deferred -= scanned;
        else
                next_deferred = 0;

IIUC, it also decreases the new work (the scanned includes both last
deferred and new delata).

> So the calculation should probably be something like this?
>
>         next_deferred = max_t(long, nr + delta - scanned, 0);
>
> Thanks,
> Vlastimil
>
> > +     next_deferred = min(next_deferred, (2 * freeable));
> > +
> >       /*
> >        * move the unused scan count back into the shrinker in a
> >        * manner that handles concurrent updates.
> >
>
Vlastimil Babka Feb. 11, 2021, 6:52 p.m. UTC | #3
On 2/11/21 6:29 PM, Yang Shi wrote:
> On Thu, Feb 11, 2021 at 5:10 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>> >       trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
>> >                                  freeable, delta, total_scan, priority);
>> > @@ -737,10 +708,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>> >               cond_resched();
>> >       }
>> >
>> > -     if (next_deferred >= scanned)
>> > -             next_deferred -= scanned;
>> > -     else
>> > -             next_deferred = 0;
>> > +     next_deferred = max_t(long, (nr - scanned), 0) + total_scan;
>>
>> And here's the bias I think. Suppose we scanned 0 due to e.g. GFP_NOFS. We count
>> as newly deferred both the "delta" part of total_scan, which is fine, but also
>> the "nr >> priority" part, where we failed to our share of the "reduce
>> nr_deferred" work, but I don't think it means we should also increase
>> nr_deferred by that amount of failed work.
> 
> Here "nr" is the saved deferred work since the last scan, "scanned" is
> the scanned work in this round, total_scan is the *unscanned" work
> which is actually "total_scan - scanned" (total_scan is decreased by
> scanned in each loop). So, the logic is "decrease any scanned work
> from deferred then add newly unscanned work to deferred". IIUC this is
> what "deferred" means even before this patch.

Hm I thought the logic was "increase by any new work (delta) that wasn't done,
decrease by old deferred work that was done now". My examples with scanned = 0
and scanned = total_work (total_work before subtracting scanned from it) should
demonstrate that the logic is different with your patch.

>> OTOH if we succeed and scan exactly the whole goal, we are subtracting from
>> nr_deferred both the "nr >> priority" part, which is correct, but also delta,
>> which was new work, not deferred one, so that's incorrect IMHO as well.
> 
> I don't think so. The deferred comes from new work, why not dec new
> work from deferred?
> 
> And, the old code did:
> 
> if (next_deferred >= scanned)
>                 next_deferred -= scanned;
>         else
>                 next_deferred = 0;
> 
> IIUC, it also decreases the new work (the scanned includes both last
> deferred and new delata).

Yes, but in the old code, next_deferred starts as

nr = count_nr_deferred()...
total_scan = nr;
delta = ... // something based on freeable
total_scan += delta;
next_deferred = total_scan; // in the common case total_scan >= 0

... and that's "total_scan" before "scanned" is subtracted from it, so it
includes the new_work ("delta"), so then it's OK to do "next_deferred -= scanned";

I still think your formula is (unintentionally) changing the logic. You can also
look at it from different angle, it's effectively (without the max_t() part) "nr
- scanned + total_scan" where total_scan is actually "total_scan - scanned" as
you point your yourself. So "scanned" is subtracted twice? That can't be correct...

>> So the calculation should probably be something like this?
>>
>>         next_deferred = max_t(long, nr + delta - scanned, 0);
>>
>> Thanks,
>> Vlastimil
>>
>> > +     next_deferred = min(next_deferred, (2 * freeable));
>> > +
>> >       /*
>> >        * move the unused scan count back into the shrinker in a
>> >        * manner that handles concurrent updates.
>> >
>>
>
Yang Shi Feb. 11, 2021, 7:15 p.m. UTC | #4
On Thu, Feb 11, 2021 at 10:52 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 2/11/21 6:29 PM, Yang Shi wrote:
> > On Thu, Feb 11, 2021 at 5:10 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >> >       trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
> >> >                                  freeable, delta, total_scan, priority);
> >> > @@ -737,10 +708,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> >> >               cond_resched();
> >> >       }
> >> >
> >> > -     if (next_deferred >= scanned)
> >> > -             next_deferred -= scanned;
> >> > -     else
> >> > -             next_deferred = 0;
> >> > +     next_deferred = max_t(long, (nr - scanned), 0) + total_scan;
> >>
> >> And here's the bias I think. Suppose we scanned 0 due to e.g. GFP_NOFS. We count
> >> as newly deferred both the "delta" part of total_scan, which is fine, but also
> >> the "nr >> priority" part, where we failed to our share of the "reduce
> >> nr_deferred" work, but I don't think it means we should also increase
> >> nr_deferred by that amount of failed work.
> >
> > Here "nr" is the saved deferred work since the last scan, "scanned" is
> > the scanned work in this round, total_scan is the *unscanned" work
> > which is actually "total_scan - scanned" (total_scan is decreased by
> > scanned in each loop). So, the logic is "decrease any scanned work
> > from deferred then add newly unscanned work to deferred". IIUC this is
> > what "deferred" means even before this patch.
>
> Hm I thought the logic was "increase by any new work (delta) that wasn't done,
> decrease by old deferred work that was done now". My examples with scanned = 0
> and scanned = total_work (total_work before subtracting scanned from it) should
> demonstrate that the logic is different with your patch.

I think we are on the same page about the logic. But I agree the
formula implemented in the code is wrong.

>
> >> OTOH if we succeed and scan exactly the whole goal, we are subtracting from
> >> nr_deferred both the "nr >> priority" part, which is correct, but also delta,
> >> which was new work, not deferred one, so that's incorrect IMHO as well.
> >
> > I don't think so. The deferred comes from new work, why not dec new
> > work from deferred?
> >
> > And, the old code did:
> >
> > if (next_deferred >= scanned)
> >                 next_deferred -= scanned;
> >         else
> >                 next_deferred = 0;
> >
> > IIUC, it also decreases the new work (the scanned includes both last
> > deferred and new delata).
>
> Yes, but in the old code, next_deferred starts as
>
> nr = count_nr_deferred()...
> total_scan = nr;
> delta = ... // something based on freeable
> total_scan += delta;
> next_deferred = total_scan; // in the common case total_scan >= 0
>
> ... and that's "total_scan" before "scanned" is subtracted from it, so it
> includes the new_work ("delta"), so then it's OK to do "next_deferred -= scanned";
>
> I still think your formula is (unintentionally) changing the logic. You can also
> look at it from different angle, it's effectively (without the max_t() part) "nr
> - scanned + total_scan" where total_scan is actually "total_scan - scanned" as
> you point your yourself. So "scanned" is subtracted twice? That can't be correct...

Yes, I think you are right, it can not be correct. Actually I wanted
plus the unscanned delta part to the next_deferred. But my formula
actually not only decs scanned twice but also adds unscanned deferred
back again. So it seems the formula suggested by you is correct. Will
correct this in v8. Thanks a lot for helping get out of the maze. Will
add some notes right before the formula as well.

>
> >> So the calculation should probably be something like this?
> >>
> >>         next_deferred = max_t(long, nr + delta - scanned, 0);
> >>
> >> Thanks,
> >> Vlastimil
> >>
> >> > +     next_deferred = min(next_deferred, (2 * freeable));
> >> > +
> >> >       /*
> >> >        * move the unused scan count back into the shrinker in a
> >> >        * manner that handles concurrent updates.
> >> >
> >>
> >
>
diff mbox series

Patch

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 66163082cc6f..d670b119d6bd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -654,7 +654,6 @@  static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	 */
 	nr = count_nr_deferred(shrinker, shrinkctl);
 
-	total_scan = nr;
 	if (shrinker->seeks) {
 		delta = freeable >> priority;
 		delta *= 4;
@@ -668,37 +667,9 @@  static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 		delta = freeable / 2;
 	}
 
+	total_scan = nr >> priority;
 	total_scan += delta;
-	if (total_scan < 0) {
-		pr_err("shrink_slab: %pS negative objects to delete nr=%ld\n",
-		       shrinker->scan_objects, total_scan);
-		total_scan = freeable;
-		next_deferred = nr;
-	} else
-		next_deferred = total_scan;
-
-	/*
-	 * We need to avoid excessive windup on filesystem shrinkers
-	 * due to large numbers of GFP_NOFS allocations causing the
-	 * shrinkers to return -1 all the time. This results in a large
-	 * nr being built up so when a shrink that can do some work
-	 * comes along it empties the entire cache due to nr >>>
-	 * freeable. This is bad for sustaining a working set in
-	 * memory.
-	 *
-	 * Hence only allow the shrinker to scan the entire cache when
-	 * a large delta change is calculated directly.
-	 */
-	if (delta < freeable / 4)
-		total_scan = min(total_scan, freeable / 2);
-
-	/*
-	 * Avoid risking looping forever due to too large nr value:
-	 * never try to free more than twice the estimate number of
-	 * freeable entries.
-	 */
-	if (total_scan > freeable * 2)
-		total_scan = freeable * 2;
+	total_scan = min(total_scan, (2 * freeable));
 
 	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
 				   freeable, delta, total_scan, priority);
@@ -737,10 +708,9 @@  static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 		cond_resched();
 	}
 
-	if (next_deferred >= scanned)
-		next_deferred -= scanned;
-	else
-		next_deferred = 0;
+	next_deferred = max_t(long, (nr - scanned), 0) + total_scan;
+	next_deferred = min(next_deferred, (2 * freeable));
+
 	/*
 	 * move the unused scan count back into the shrinker in a
 	 * manner that handles concurrent updates.