diff mbox series

mm,vmscan: fix divide by zero in get_scan_count

Message ID 20210826220149.058089c6@imladris.surriel.com (mailing list archive)
State New
Headers show
Series mm,vmscan: fix divide by zero in get_scan_count | expand

Commit Message

Rik van Riel Aug. 27, 2021, 2:01 a.m. UTC
Changeset f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to
proportional memory.low reclaim") introduced a divide by zero corner
case when oomd is being used in combination with cgroup memory.low
protection.

When oomd decides to kill a cgroup, it will force the cgroup memory
to be reclaimed after killing the tasks, by writing to the memory.max
file for that cgroup, forcing the remaining page cache and reclaimable
slab to be reclaimed down to zero.

Previously, on cgroups with some memory.low protection that would result
in the memory being reclaimed down to the memory.low limit, or likely not
at all, having the page cache reclaimed asynchronously later.

With f56ce412a59d the oomd write to memory.max tries to reclaim all the
way down to zero, which may race with another reclaimer, to the point of
ending up with the divide by zero below.

This patch implements the obvious fix.

Fixes: f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaim")
Signed-off-by: Rik van Riel <riel@surriel.com>

Comments

Roman Gushchin Aug. 27, 2021, 4:28 p.m. UTC | #1
On Thu, Aug 26, 2021 at 10:01:49PM -0400, Rik van Riel wrote:
> Changeset f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to
> proportional memory.low reclaim") introduced a divide by zero corner
> case when oomd is being used in combination with cgroup memory.low
> protection.
> 
> When oomd decides to kill a cgroup, it will force the cgroup memory
> to be reclaimed after killing the tasks, by writing to the memory.max
> file for that cgroup, forcing the remaining page cache and reclaimable
> slab to be reclaimed down to zero.
> 
> Previously, on cgroups with some memory.low protection that would result
> in the memory being reclaimed down to the memory.low limit, or likely not
> at all, having the page cache reclaimed asynchronously later.
> 
> With f56ce412a59d the oomd write to memory.max tries to reclaim all the
> way down to zero, which may race with another reclaimer, to the point of
> ending up with the divide by zero below.
> 
> This patch implements the obvious fix.
> 
> Fixes: f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaim")
> Signed-off-by: Rik van Riel <riel@surriel.com>

It looks like we discover a new divide-by-zero corner case after every
functional change to the memory protection code :)

Acked-by: Roman Gushchin <guro@fb.com>

Thanks, Rik!
Michal Hocko Aug. 30, 2021, 11:33 a.m. UTC | #2
On Thu 26-08-21 22:01:49, Rik van Riel wrote:
> Changeset f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to
> proportional memory.low reclaim") introduced a divide by zero corner
> case when oomd is being used in combination with cgroup memory.low
> protection.
> 
> When oomd decides to kill a cgroup, it will force the cgroup memory
> to be reclaimed after killing the tasks, by writing to the memory.max
> file for that cgroup, forcing the remaining page cache and reclaimable
> slab to be reclaimed down to zero.
> 
> Previously, on cgroups with some memory.low protection that would result
> in the memory being reclaimed down to the memory.low limit, or likely not
> at all, having the page cache reclaimed asynchronously later.
> 
> With f56ce412a59d the oomd write to memory.max tries to reclaim all the
> way down to zero, which may race with another reclaimer, to the point of
> ending up with the divide by zero below.
> 
> This patch implements the obvious fix.

I must be missing something but how can cgroup_size be ever 0 when it is
max(cgroup_size, protection) and protection != 0?
Rik van Riel Aug. 30, 2021, 1:24 p.m. UTC | #3
On Mon, 2021-08-30 at 13:33 +0200, Michal Hocko wrote:
> On Thu 26-08-21 22:01:49, Rik van Riel wrote:
> > Changeset f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to
> > proportional memory.low reclaim") introduced a divide by zero
> > corner
> > case when oomd is being used in combination with cgroup memory.low
> > protection.
> > 
> > When oomd decides to kill a cgroup, it will force the cgroup memory
> > to be reclaimed after killing the tasks, by writing to the
> > memory.max
> > file for that cgroup, forcing the remaining page cache and
> > reclaimable
> > slab to be reclaimed down to zero.
> > 
> > Previously, on cgroups with some memory.low protection that would
> > result
> > in the memory being reclaimed down to the memory.low limit, or
> > likely not
> > at all, having the page cache reclaimed asynchronously later.
> > 
> > With f56ce412a59d the oomd write to memory.max tries to reclaim all
> > the
> > way down to zero, which may race with another reclaimer, to the
> > point of
> > ending up with the divide by zero below.
> > 
> > This patch implements the obvious fix.
> 
> I must be missing something but how can cgroup_size be ever 0 when it
> is
> max(cgroup_size, protection) and protection != 0?

Going into the condition we use if (low || min), where
it is possible for low > 0 && min == 0.

Inside the conditional, we can end up testing against
min.
Michal Hocko Aug. 30, 2021, 1:41 p.m. UTC | #4
On Mon 30-08-21 09:24:22, Rik van Riel wrote:
> On Mon, 2021-08-30 at 13:33 +0200, Michal Hocko wrote:
[...]
> > I must be missing something but how can cgroup_size be ever 0 when it
> > is
> > max(cgroup_size, protection) and protection != 0?
> 
> Going into the condition we use if (low || min), where
> it is possible for low > 0 && min == 0.
> 
> Inside the conditional, we can end up testing against
> min.

Dang, I was looking at the tree without f56ce412a59d7 applied. My bad!
Personally I would consider the following slightly easier to follow
			scan = lruvec_size - lruvec_size * protection /
				max(cgroup_size, 1);

The code is quite tricky already and if you asked me what kind of effect
cgroup_size + 1 have there I would just shrug shoulders...

Anyway your fix will prevent the reported problem and I cannot see any
obvious problem with it either so
Acked-by: Michal Hocko <mhocko@suse.com>
Johannes Weiner Aug. 30, 2021, 8:48 p.m. UTC | #5
On Thu, Aug 26, 2021 at 10:01:49PM -0400, Rik van Riel wrote:
> Changeset f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to
> proportional memory.low reclaim") introduced a divide by zero corner
> case when oomd is being used in combination with cgroup memory.low
> protection.
> 
> When oomd decides to kill a cgroup, it will force the cgroup memory
> to be reclaimed after killing the tasks, by writing to the memory.max
> file for that cgroup, forcing the remaining page cache and reclaimable
> slab to be reclaimed down to zero.
> 
> Previously, on cgroups with some memory.low protection that would result
> in the memory being reclaimed down to the memory.low limit, or likely not
> at all, having the page cache reclaimed asynchronously later.
> 
> With f56ce412a59d the oomd write to memory.max tries to reclaim all the
> way down to zero, which may race with another reclaimer, to the point of
> ending up with the divide by zero below.
> 
> This patch implements the obvious fix.
> 
> Fixes: f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaim")
> Signed-off-by: Rik van Riel <riel@surriel.com>

That took me a second.

Before the patch, that sc->memcg_low_reclaim test was outside of that
whole proportional reclaim branch. So if we were in low reclaim mode
we wouldn't even check if a low setting is in place; if min is zero,
we don't enter the proportional branch.

Now we enter if low is set but ignored, and then end up with
cgroup_size == min == 0 == divide by black hole.

Good catch.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eeae2f6bc532..f1782b816c98 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2592,7 +2592,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>  			cgroup_size = max(cgroup_size, protection);
>  
>  			scan = lruvec_size - lruvec_size * protection /
> -				cgroup_size;
> +				(cgroup_size + 1);

I have no overly strong preferences, but if Michal prefers max(), how about:

	cgroup_size = max3(cgroup_size, protection, 1);

Or go back to not taking the branch in the first place when there is
no protection in effect...

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6247f6f4469a..9c200bb3ae51 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2547,7 +2547,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 		mem_cgroup_protection(sc->target_mem_cgroup, memcg,
 				      &min, &low);
 
-		if (min || low) {
+		if (min || (!sc->memcg_low_reclaim && low)) {
 			/*
 			 * Scale a cgroup's reclaim pressure by proportioning
 			 * its current usage to its memory.low or memory.min
Michal Hocko Aug. 31, 2021, 9:59 a.m. UTC | #6
On Mon 30-08-21 16:48:03, Johannes Weiner wrote:
> On Thu, Aug 26, 2021 at 10:01:49PM -0400, Rik van Riel wrote:
[...]
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index eeae2f6bc532..f1782b816c98 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2592,7 +2592,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> >  			cgroup_size = max(cgroup_size, protection);
> >  
> >  			scan = lruvec_size - lruvec_size * protection /
> > -				cgroup_size;
> > +				(cgroup_size + 1);
> 
> I have no overly strong preferences, but if Michal prefers max(), how about:
> 
> 	cgroup_size = max3(cgroup_size, protection, 1);

Yes this is better.

> Or go back to not taking the branch in the first place when there is
> no protection in effect...
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 6247f6f4469a..9c200bb3ae51 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2547,7 +2547,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>  		mem_cgroup_protection(sc->target_mem_cgroup, memcg,
>  				      &min, &low);
>  
> -		if (min || low) {
> +		if (min || (!sc->memcg_low_reclaim && low)) {
>  			/*
>  			 * Scale a cgroup's reclaim pressure by proportioning
>  			 * its current usage to its memory.low or memory.min

This is slightly more complex to read but it is also better than +1
trick.

Either of the two work for me.
Chris Down Aug. 31, 2021, 12:58 p.m. UTC | #7
Rik van Riel writes:
>Changeset f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to
>proportional memory.low reclaim") introduced a divide by zero corner
>case when oomd is being used in combination with cgroup memory.low
>protection.
>
>When oomd decides to kill a cgroup, it will force the cgroup memory
>to be reclaimed after killing the tasks, by writing to the memory.max
>file for that cgroup, forcing the remaining page cache and reclaimable
>slab to be reclaimed down to zero.
>
>Previously, on cgroups with some memory.low protection that would result
>in the memory being reclaimed down to the memory.low limit, or likely not
>at all, having the page cache reclaimed asynchronously later.
>
>With f56ce412a59d the oomd write to memory.max tries to reclaim all the
>way down to zero, which may race with another reclaimer, to the point of
>ending up with the divide by zero below.
>
>This patch implements the obvious fix.
>
>Fixes: f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaim")
>Signed-off-by: Rik van Riel <riel@surriel.com>

Thanks, good catch. No strong preference on this vs. max3(), so feel free to 
keep my ack either way.

Acked-by: Chris Down <chris@chrisdown.name>

>
>diff --git a/mm/vmscan.c b/mm/vmscan.c
>index eeae2f6bc532..f1782b816c98 100644
>--- a/mm/vmscan.c
>+++ b/mm/vmscan.c
>@@ -2592,7 +2592,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> 			cgroup_size = max(cgroup_size, protection);
>
> 			scan = lruvec_size - lruvec_size * protection /
>-				cgroup_size;
>+				(cgroup_size + 1);
>
> 			/*
> 			 * Minimally target SWAP_CLUSTER_MAX pages to keep
>
>
Rik van Riel Aug. 31, 2021, 3:48 p.m. UTC | #8
On Tue, 2021-08-31 at 11:59 +0200, Michal Hocko wrote:
> On Mon 30-08-21 16:48:03, Johannes Weiner wrote:
> 
> 
> > Or go back to not taking the branch in the first place when there
> > is
> > no protection in effect...
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 6247f6f4469a..9c200bb3ae51 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2547,7 +2547,7 @@ static void get_scan_count(struct lruvec
> > *lruvec, struct scan_control *sc,
> >                 mem_cgroup_protection(sc->target_mem_cgroup, memcg,
> >                                       &min, &low);
> >  
> > -               if (min || low) {
> > +               if (min || (!sc->memcg_low_reclaim && low)) {
> >                         /*
> >                          * Scale a cgroup's reclaim pressure by
> > proportioning
> >                          * its current usage to its memory.low or
> > memory.min
> 
> This is slightly more complex to read but it is also better than +1
> trick.

We could also fold it into the helper function, with
mem_cgroup_protection deciding whether to use low or
min for the protection limit, and then we key the rest
of our decisions off that.

Wait a minute, that's pretty much what mem_cgroup_protection
looked like before f56ce412a59d ("mm: memcontrol: fix occasional
OOMs due to proportional memory.low reclaim")

Now I'm confused how that changeset works.

Before f56ce412a59d, mem_cgroup_protection would return
memcg->memory.emin if sc->memcg_low_reclaim is true, and
memcg->memory.elow when not.

After f56ce412a59d, we still do the same thing. We just
also set sc->memcg_low_skipped so we know to come back
if we could not hit our target without skipping groups
with memory.low protection...
Johannes Weiner Sept. 1, 2021, 7:40 p.m. UTC | #9
On Tue, Aug 31, 2021 at 11:48:28AM -0400, Rik van Riel wrote:
> On Tue, 2021-08-31 at 11:59 +0200, Michal Hocko wrote:
> > On Mon 30-08-21 16:48:03, Johannes Weiner wrote:
> > 
> > 
> > > Or go back to not taking the branch in the first place when there
> > > is
> > > no protection in effect...
> > > 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 6247f6f4469a..9c200bb3ae51 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2547,7 +2547,7 @@ static void get_scan_count(struct lruvec
> > > *lruvec, struct scan_control *sc,
> > >                 mem_cgroup_protection(sc->target_mem_cgroup, memcg,
> > >                                       &min, &low);
> > >  
> > > -               if (min || low) {
> > > +               if (min || (!sc->memcg_low_reclaim && low)) {
> > >                         /*
> > >                          * Scale a cgroup's reclaim pressure by
> > > proportioning
> > >                          * its current usage to its memory.low or
> > > memory.min
> > 
> > This is slightly more complex to read but it is also better than +1
> > trick.
> 
> We could also fold it into the helper function, with
> mem_cgroup_protection deciding whether to use low or
> min for the protection limit, and then we key the rest
> of our decisions off that.
> 
> Wait a minute, that's pretty much what mem_cgroup_protection
> looked like before f56ce412a59d ("mm: memcontrol: fix occasional
> OOMs due to proportional memory.low reclaim")
> 
> Now I'm confused how that changeset works.
> 
> Before f56ce412a59d, mem_cgroup_protection would return
> memcg->memory.emin if sc->memcg_low_reclaim is true, and
> memcg->memory.elow when not.
> 
> After f56ce412a59d, we still do the same thing. We just
> also set sc->memcg_low_skipped so we know to come back
> if we could not hit our target without skipping groups
> with memory.low protection...

Yeah, I just bubbled the sc->memcg_low_reclaim test up into vmscan.c
so we can modify sc->memcg_low_skipped based on it. Because
scan_control is vmscan.c-scope and I tried to retain that; and avoid
doing things like passing &sc->memcg_low_skipped into memcg code.
diff mbox series

Patch

diff --git a/mm/vmscan.c b/mm/vmscan.c
index eeae2f6bc532..f1782b816c98 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2592,7 +2592,7 @@  static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 			cgroup_size = max(cgroup_size, protection);
 
 			scan = lruvec_size - lruvec_size * protection /
-				cgroup_size;
+				(cgroup_size + 1);
 
 			/*
 			 * Minimally target SWAP_CLUSTER_MAX pages to keep