diff mbox series

mm: memcg: fix stale protection of reclaim target memcg

Message ID 20221122232721.2306102-1-yosryahmed@google.com (mailing list archive)
State New
Headers show
Series mm: memcg: fix stale protection of reclaim target memcg | expand

Commit Message

Yosry Ahmed Nov. 22, 2022, 11:27 p.m. UTC
During reclaim, mem_cgroup_calculate_protection() is used to determine
the effective protection (emin and elow) values of a memcg. The
protection of the reclaim target is ignored, but we cannot set their
effective protection to 0 due to a limitation of the current
implementation (see comment in mem_cgroup_protection()). Instead,
we leave their effective protection values unchaged, and later ignore it
in mem_cgroup_protection().

However, mem_cgroup_protection() is called later in
shrink_lruvec()->get_scan_count(), which is after the
mem_cgroup_below_{min/low}() checks in shrink_node_memcgs(). As a
result, the stale effective protection values of the target memcg may
lead us to skip reclaiming from the target memcg entirely, before
calling shrink_lruvec(). This can be even worse with recursive
protection, where the stale target memcg protection can be higher than
its standalone protection.

An example where this can happen is as follows. Consider the following
hierarchy with memory_recursiveprot:
ROOT
 |
 A (memory.min = 50M)
 |
 B (memory.min = 10M, memory.high = 40M)

Consider the following scenarion:
- B has memory.current = 35M.
- The system undergoes global reclaim (target memcg is NULL).
- B will have an effective min of 50M (all of A's unclaimed protection).
- B will not be reclaimed from.
- Now allocate 10M more memory in B, pushing it above it's high limit.
- The system undergoes memcg reclaim from B (target memcg is B)
- In shrink_node_memcgs(), we call mem_cgroup_calculate_protection(),
  which immediately returns for B without doing anything, as B is the
  target memcg, relying on mem_cgroup_protection() to ignore B's stale
  effective min (still 50M).
- Directly after mem_cgroup_calculate_protection(), we will call
  mem_cgroup_below_min(), which will read the stale effective min for B
  and skip it (instead of ignoring its protection as intended). In this
  case, it's really bad because we are not just considering B's
  standalone protection (10M), but we are reading a much higher stale
  protection (50M) which will cause us to not reclaim from B at all.

This is an artifact of commit 45c7f7e1ef17 ("mm, memcg: decouple
e{low,min} state mutations from protection checks") which made
mem_cgroup_calculate_protection() only change the state without
returning any value. Before that commit, we used to return
MEMCG_PROT_NONE for the target memcg, which would cause us to skip the
mem_cgroup_below_{min/low}() checks. After that commit we do not return
anything and we end up checking the min & low effective protections for
the target memcg, which are stale.

Add mem_cgroup_ignore_protection() that checks if we are reclaiming from
the target memcg, and call it in mem_cgroup_below_{min/low}() to ignore
the stale protection of the target memcg.

Fixes: 45c7f7e1ef17 ("mm, memcg: decouple e{low,min} state mutations from protection checks")
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 include/linux/memcontrol.h | 33 +++++++++++++++++++++++++++------
 mm/vmscan.c                | 11 ++++++-----
 2 files changed, 33 insertions(+), 11 deletions(-)

Comments

Yosry Ahmed Nov. 22, 2022, 11:31 p.m. UTC | #1
+David Rientjes

The attached test reproduces the problem on a cgroup v2 hierarchy
mounted with memory_recursiveprot, and fails without this patch.

On Tue, Nov 22, 2022 at 3:27 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> During reclaim, mem_cgroup_calculate_protection() is used to determine
> the effective protection (emin and elow) values of a memcg. The
> protection of the reclaim target is ignored, but we cannot set their
> effective protection to 0 due to a limitation of the current
> implementation (see comment in mem_cgroup_protection()). Instead,
> we leave their effective protection values unchaged, and later ignore it
> in mem_cgroup_protection().
>
> However, mem_cgroup_protection() is called later in
> shrink_lruvec()->get_scan_count(), which is after the
> mem_cgroup_below_{min/low}() checks in shrink_node_memcgs(). As a
> result, the stale effective protection values of the target memcg may
> lead us to skip reclaiming from the target memcg entirely, before
> calling shrink_lruvec(). This can be even worse with recursive
> protection, where the stale target memcg protection can be higher than
> its standalone protection.
>
> An example where this can happen is as follows. Consider the following
> hierarchy with memory_recursiveprot:
> ROOT
>  |
>  A (memory.min = 50M)
>  |
>  B (memory.min = 10M, memory.high = 40M)
>
> Consider the following scenarion:
> - B has memory.current = 35M.
> - The system undergoes global reclaim (target memcg is NULL).
> - B will have an effective min of 50M (all of A's unclaimed protection).
> - B will not be reclaimed from.
> - Now allocate 10M more memory in B, pushing it above it's high limit.
> - The system undergoes memcg reclaim from B (target memcg is B)
> - In shrink_node_memcgs(), we call mem_cgroup_calculate_protection(),
>   which immediately returns for B without doing anything, as B is the
>   target memcg, relying on mem_cgroup_protection() to ignore B's stale
>   effective min (still 50M).
> - Directly after mem_cgroup_calculate_protection(), we will call
>   mem_cgroup_below_min(), which will read the stale effective min for B
>   and skip it (instead of ignoring its protection as intended). In this
>   case, it's really bad because we are not just considering B's
>   standalone protection (10M), but we are reading a much higher stale
>   protection (50M) which will cause us to not reclaim from B at all.
>
> This is an artifact of commit 45c7f7e1ef17 ("mm, memcg: decouple
> e{low,min} state mutations from protection checks") which made
> mem_cgroup_calculate_protection() only change the state without
> returning any value. Before that commit, we used to return
> MEMCG_PROT_NONE for the target memcg, which would cause us to skip the
> mem_cgroup_below_{min/low}() checks. After that commit we do not return
> anything and we end up checking the min & low effective protections for
> the target memcg, which are stale.
>
> Add mem_cgroup_ignore_protection() that checks if we are reclaiming from
> the target memcg, and call it in mem_cgroup_below_{min/low}() to ignore
> the stale protection of the target memcg.
>
> Fixes: 45c7f7e1ef17 ("mm, memcg: decouple e{low,min} state mutations from protection checks")
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> ---
>  include/linux/memcontrol.h | 33 +++++++++++++++++++++++++++------
>  mm/vmscan.c                | 11 ++++++-----
>  2 files changed, 33 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index e1644a24009c..22c9c9f9c6b1 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -625,18 +625,32 @@ static inline bool mem_cgroup_supports_protection(struct mem_cgroup *memcg)
>
>  }
>
> -static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg)
> +static inline bool mem_cgroup_ignore_protection(struct mem_cgroup *target,
> +                                               struct mem_cgroup *memcg)
>  {
> -       if (!mem_cgroup_supports_protection(memcg))
> +       /*
> +        * The target memcg's protection is ignored, see
> +        * mem_cgroup_calculate_protection() and mem_cgroup_protection()
> +        */
> +       return target == memcg;
> +}
> +
> +static inline bool mem_cgroup_below_low(struct mem_cgroup *target,
> +                                       struct mem_cgroup *memcg)
> +{
> +       if (!mem_cgroup_supports_protection(memcg) ||
> +           mem_cgroup_ignore_protection(target, memcg))
>                 return false;
>
>         return READ_ONCE(memcg->memory.elow) >=
>                 page_counter_read(&memcg->memory);
>  }
>
> -static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg)
> +static inline bool mem_cgroup_below_min(struct mem_cgroup *target,
> +                                       struct mem_cgroup *memcg)
>  {
> -       if (!mem_cgroup_supports_protection(memcg))
> +       if (!mem_cgroup_supports_protection(memcg) ||
> +           mem_cgroup_ignore_protection(target, memcg))
>                 return false;
>
>         return READ_ONCE(memcg->memory.emin) >=
> @@ -1209,12 +1223,19 @@ static inline void mem_cgroup_calculate_protection(struct mem_cgroup *root,
>  {
>  }
>
> -static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg)
> +static inline bool mem_cgroup_ignore_protection(struct mem_cgroup *target,
> +                                               struct mem_cgroup *memcg)
> +{
> +       return false;
> +}
> +static inline bool mem_cgroup_below_low(struct mem_cgroup *target,
> +                                       struct mem_cgroup *memcg)
>  {
>         return false;
>  }
>
> -static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg)
> +static inline bool mem_cgroup_below_min(struct mem_cgroup *target,
> +                                       struct mem_cgroup *memcg)
>  {
>         return false;
>  }
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 04d8b88e5216..79ef0fe67518 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4486,7 +4486,7 @@ static bool age_lruvec(struct lruvec *lruvec, struct scan_control *sc, unsigned
>
>         mem_cgroup_calculate_protection(NULL, memcg);
>
> -       if (mem_cgroup_below_min(memcg))
> +       if (mem_cgroup_below_min(NULL, memcg))
>                 return false;
>
>         need_aging = should_run_aging(lruvec, max_seq, min_seq, sc, swappiness, &nr_to_scan);
> @@ -5047,8 +5047,9 @@ static unsigned long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *
>         DEFINE_MAX_SEQ(lruvec);
>         DEFINE_MIN_SEQ(lruvec);
>
> -       if (mem_cgroup_below_min(memcg) ||
> -           (mem_cgroup_below_low(memcg) && !sc->memcg_low_reclaim))
> +       if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg) ||
> +           (mem_cgroup_below_low(sc->target_mem_cgroup, memcg) &&
> +            !sc->memcg_low_reclaim))
>                 return 0;
>
>         *need_aging = should_run_aging(lruvec, max_seq, min_seq, sc, can_swap, &nr_to_scan);
> @@ -6048,13 +6049,13 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>
>                 mem_cgroup_calculate_protection(target_memcg, memcg);
>
> -               if (mem_cgroup_below_min(memcg)) {
> +               if (mem_cgroup_below_min(target_memcg, memcg)) {
>                         /*
>                          * Hard protection.
>                          * If there is no reclaimable memory, OOM.
>                          */
>                         continue;
> -               } else if (mem_cgroup_below_low(memcg)) {
> +               } else if (mem_cgroup_below_low(target_memcg, memcg)) {
>                         /*
>                          * Soft protection.
>                          * Respect the protection only as long as
> --
> 2.38.1.584.g0f3c55d4c2-goog
>
Roman Gushchin Nov. 23, 2022, 12:37 a.m. UTC | #2
On Tue, Nov 22, 2022 at 11:27:21PM +0000, Yosry Ahmed wrote:
> During reclaim, mem_cgroup_calculate_protection() is used to determine
> the effective protection (emin and elow) values of a memcg. The
> protection of the reclaim target is ignored, but we cannot set their
> effective protection to 0 due to a limitation of the current
> implementation (see comment in mem_cgroup_protection()). Instead,
> we leave their effective protection values unchaged, and later ignore it
> in mem_cgroup_protection().
> 
> However, mem_cgroup_protection() is called later in
> shrink_lruvec()->get_scan_count(), which is after the
> mem_cgroup_below_{min/low}() checks in shrink_node_memcgs(). As a
> result, the stale effective protection values of the target memcg may
> lead us to skip reclaiming from the target memcg entirely, before
> calling shrink_lruvec(). This can be even worse with recursive
> protection, where the stale target memcg protection can be higher than
> its standalone protection.
> 
> An example where this can happen is as follows. Consider the following
> hierarchy with memory_recursiveprot:
> ROOT
>  |
>  A (memory.min = 50M)
>  |
>  B (memory.min = 10M, memory.high = 40M)
> 
> Consider the following scenarion:
> - B has memory.current = 35M.
> - The system undergoes global reclaim (target memcg is NULL).
> - B will have an effective min of 50M (all of A's unclaimed protection).
> - B will not be reclaimed from.
> - Now allocate 10M more memory in B, pushing it above it's high limit.
> - The system undergoes memcg reclaim from B (target memcg is B)
> - In shrink_node_memcgs(), we call mem_cgroup_calculate_protection(),
>   which immediately returns for B without doing anything, as B is the
>   target memcg, relying on mem_cgroup_protection() to ignore B's stale
>   effective min (still 50M).
> - Directly after mem_cgroup_calculate_protection(), we will call
>   mem_cgroup_below_min(), which will read the stale effective min for B
>   and skip it (instead of ignoring its protection as intended). In this
>   case, it's really bad because we are not just considering B's
>   standalone protection (10M), but we are reading a much higher stale
>   protection (50M) which will cause us to not reclaim from B at all.
> 
> This is an artifact of commit 45c7f7e1ef17 ("mm, memcg: decouple
> e{low,min} state mutations from protection checks") which made
> mem_cgroup_calculate_protection() only change the state without
> returning any value. Before that commit, we used to return
> MEMCG_PROT_NONE for the target memcg, which would cause us to skip the
> mem_cgroup_below_{min/low}() checks. After that commit we do not return
> anything and we end up checking the min & low effective protections for
> the target memcg, which are stale.
> 
> Add mem_cgroup_ignore_protection() that checks if we are reclaiming from
> the target memcg, and call it in mem_cgroup_below_{min/low}() to ignore
> the stale protection of the target memcg.
> 
> Fixes: 45c7f7e1ef17 ("mm, memcg: decouple e{low,min} state mutations from protection checks")
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>

Great catch!
The fix looks good to me, only a couple of cosmetic suggestions.

> ---
>  include/linux/memcontrol.h | 33 +++++++++++++++++++++++++++------
>  mm/vmscan.c                | 11 ++++++-----
>  2 files changed, 33 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index e1644a24009c..22c9c9f9c6b1 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -625,18 +625,32 @@ static inline bool mem_cgroup_supports_protection(struct mem_cgroup *memcg)
>  
>  }
>  
> -static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg)
> +static inline bool mem_cgroup_ignore_protection(struct mem_cgroup *target,
> +						struct mem_cgroup *memcg)
>  {
> -	if (!mem_cgroup_supports_protection(memcg))

How about to merge mem_cgroup_supports_protection() and your new helper into
something like mem_cgroup_possibly_protected()? It seems like they never used
separately and unlikely ever will be used.
Also, I'd swap target and memcg arguments.

Thank you!


PS If it's not too hard, please, consider adding a new kselftest to cover this case.
Thank you!
Yosry Ahmed Nov. 23, 2022, 12:45 a.m. UTC | #3
On Tue, Nov 22, 2022 at 4:37 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Tue, Nov 22, 2022 at 11:27:21PM +0000, Yosry Ahmed wrote:
> > During reclaim, mem_cgroup_calculate_protection() is used to determine
> > the effective protection (emin and elow) values of a memcg. The
> > protection of the reclaim target is ignored, but we cannot set their
> > effective protection to 0 due to a limitation of the current
> > implementation (see comment in mem_cgroup_protection()). Instead,
> > we leave their effective protection values unchaged, and later ignore it
> > in mem_cgroup_protection().
> >
> > However, mem_cgroup_protection() is called later in
> > shrink_lruvec()->get_scan_count(), which is after the
> > mem_cgroup_below_{min/low}() checks in shrink_node_memcgs(). As a
> > result, the stale effective protection values of the target memcg may
> > lead us to skip reclaiming from the target memcg entirely, before
> > calling shrink_lruvec(). This can be even worse with recursive
> > protection, where the stale target memcg protection can be higher than
> > its standalone protection.
> >
> > An example where this can happen is as follows. Consider the following
> > hierarchy with memory_recursiveprot:
> > ROOT
> >  |
> >  A (memory.min = 50M)
> >  |
> >  B (memory.min = 10M, memory.high = 40M)
> >
> > Consider the following scenarion:
> > - B has memory.current = 35M.
> > - The system undergoes global reclaim (target memcg is NULL).
> > - B will have an effective min of 50M (all of A's unclaimed protection).
> > - B will not be reclaimed from.
> > - Now allocate 10M more memory in B, pushing it above it's high limit.
> > - The system undergoes memcg reclaim from B (target memcg is B)
> > - In shrink_node_memcgs(), we call mem_cgroup_calculate_protection(),
> >   which immediately returns for B without doing anything, as B is the
> >   target memcg, relying on mem_cgroup_protection() to ignore B's stale
> >   effective min (still 50M).
> > - Directly after mem_cgroup_calculate_protection(), we will call
> >   mem_cgroup_below_min(), which will read the stale effective min for B
> >   and skip it (instead of ignoring its protection as intended). In this
> >   case, it's really bad because we are not just considering B's
> >   standalone protection (10M), but we are reading a much higher stale
> >   protection (50M) which will cause us to not reclaim from B at all.
> >
> > This is an artifact of commit 45c7f7e1ef17 ("mm, memcg: decouple
> > e{low,min} state mutations from protection checks") which made
> > mem_cgroup_calculate_protection() only change the state without
> > returning any value. Before that commit, we used to return
> > MEMCG_PROT_NONE for the target memcg, which would cause us to skip the
> > mem_cgroup_below_{min/low}() checks. After that commit we do not return
> > anything and we end up checking the min & low effective protections for
> > the target memcg, which are stale.
> >
> > Add mem_cgroup_ignore_protection() that checks if we are reclaiming from
> > the target memcg, and call it in mem_cgroup_below_{min/low}() to ignore
> > the stale protection of the target memcg.
> >
> > Fixes: 45c7f7e1ef17 ("mm, memcg: decouple e{low,min} state mutations from protection checks")
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
>
> Great catch!
> The fix looks good to me, only a couple of cosmetic suggestions.
>
> > ---
> >  include/linux/memcontrol.h | 33 +++++++++++++++++++++++++++------
> >  mm/vmscan.c                | 11 ++++++-----
> >  2 files changed, 33 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index e1644a24009c..22c9c9f9c6b1 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -625,18 +625,32 @@ static inline bool mem_cgroup_supports_protection(struct mem_cgroup *memcg)
> >
> >  }
> >
> > -static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg)
> > +static inline bool mem_cgroup_ignore_protection(struct mem_cgroup *target,
> > +                                             struct mem_cgroup *memcg)
> >  {
> > -     if (!mem_cgroup_supports_protection(memcg))
>
> How about to merge mem_cgroup_supports_protection() and your new helper into
> something like mem_cgroup_possibly_protected()? It seems like they never used
> separately and unlikely ever will be used.

Sounds good! I am thinking maybe mem_cgroup_no_protection() which is
an inlining of !mem_cgroup_supports_protection() ||
mem_cgorup_ignore_protection().

> Also, I'd swap target and memcg arguments.

Sounds good.

>
> Thank you!
>
>
> PS If it's not too hard, please, consider adding a new kselftest to cover this case.
> Thank you!

I will try to translate my bash test to something in test_memcontrol,
I don't plan to spend a lot of time on it though so I hope it's simple
enough..
Yosry Ahmed Nov. 23, 2022, 12:49 a.m. UTC | #4
On Tue, Nov 22, 2022 at 4:45 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Tue, Nov 22, 2022 at 4:37 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > On Tue, Nov 22, 2022 at 11:27:21PM +0000, Yosry Ahmed wrote:
> > > During reclaim, mem_cgroup_calculate_protection() is used to determine
> > > the effective protection (emin and elow) values of a memcg. The
> > > protection of the reclaim target is ignored, but we cannot set their
> > > effective protection to 0 due to a limitation of the current
> > > implementation (see comment in mem_cgroup_protection()). Instead,
> > > we leave their effective protection values unchaged, and later ignore it
> > > in mem_cgroup_protection().
> > >
> > > However, mem_cgroup_protection() is called later in
> > > shrink_lruvec()->get_scan_count(), which is after the
> > > mem_cgroup_below_{min/low}() checks in shrink_node_memcgs(). As a
> > > result, the stale effective protection values of the target memcg may
> > > lead us to skip reclaiming from the target memcg entirely, before
> > > calling shrink_lruvec(). This can be even worse with recursive
> > > protection, where the stale target memcg protection can be higher than
> > > its standalone protection.
> > >
> > > An example where this can happen is as follows. Consider the following
> > > hierarchy with memory_recursiveprot:
> > > ROOT
> > >  |
> > >  A (memory.min = 50M)
> > >  |
> > >  B (memory.min = 10M, memory.high = 40M)
> > >
> > > Consider the following scenarion:
> > > - B has memory.current = 35M.
> > > - The system undergoes global reclaim (target memcg is NULL).
> > > - B will have an effective min of 50M (all of A's unclaimed protection).
> > > - B will not be reclaimed from.
> > > - Now allocate 10M more memory in B, pushing it above it's high limit.
> > > - The system undergoes memcg reclaim from B (target memcg is B)
> > > - In shrink_node_memcgs(), we call mem_cgroup_calculate_protection(),
> > >   which immediately returns for B without doing anything, as B is the
> > >   target memcg, relying on mem_cgroup_protection() to ignore B's stale
> > >   effective min (still 50M).
> > > - Directly after mem_cgroup_calculate_protection(), we will call
> > >   mem_cgroup_below_min(), which will read the stale effective min for B
> > >   and skip it (instead of ignoring its protection as intended). In this
> > >   case, it's really bad because we are not just considering B's
> > >   standalone protection (10M), but we are reading a much higher stale
> > >   protection (50M) which will cause us to not reclaim from B at all.
> > >
> > > This is an artifact of commit 45c7f7e1ef17 ("mm, memcg: decouple
> > > e{low,min} state mutations from protection checks") which made
> > > mem_cgroup_calculate_protection() only change the state without
> > > returning any value. Before that commit, we used to return
> > > MEMCG_PROT_NONE for the target memcg, which would cause us to skip the
> > > mem_cgroup_below_{min/low}() checks. After that commit we do not return
> > > anything and we end up checking the min & low effective protections for
> > > the target memcg, which are stale.
> > >
> > > Add mem_cgroup_ignore_protection() that checks if we are reclaiming from
> > > the target memcg, and call it in mem_cgroup_below_{min/low}() to ignore
> > > the stale protection of the target memcg.
> > >
> > > Fixes: 45c7f7e1ef17 ("mm, memcg: decouple e{low,min} state mutations from protection checks")
> > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> >
> > Great catch!
> > The fix looks good to me, only a couple of cosmetic suggestions.
> >
> > > ---
> > >  include/linux/memcontrol.h | 33 +++++++++++++++++++++++++++------
> > >  mm/vmscan.c                | 11 ++++++-----
> > >  2 files changed, 33 insertions(+), 11 deletions(-)
> > >
> > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > index e1644a24009c..22c9c9f9c6b1 100644
> > > --- a/include/linux/memcontrol.h
> > > +++ b/include/linux/memcontrol.h
> > > @@ -625,18 +625,32 @@ static inline bool mem_cgroup_supports_protection(struct mem_cgroup *memcg)
> > >
> > >  }
> > >
> > > -static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg)
> > > +static inline bool mem_cgroup_ignore_protection(struct mem_cgroup *target,
> > > +                                             struct mem_cgroup *memcg)
> > >  {
> > > -     if (!mem_cgroup_supports_protection(memcg))
> >
> > How about to merge mem_cgroup_supports_protection() and your new helper into
> > something like mem_cgroup_possibly_protected()? It seems like they never used
> > separately and unlikely ever will be used.
>
> Sounds good! I am thinking maybe mem_cgroup_no_protection() which is
> an inlining of !mem_cgroup_supports_protection() ||
> mem_cgorup_ignore_protection().
>
> > Also, I'd swap target and memcg arguments.
>
> Sounds good.

I just remembered, the reason I put "target" first is to match the
ordering of mem_cgroup_calculate_protection(), otherwise the code in
shrink_node_memcgs() may be confusing.

>
> >
> > Thank you!
> >
> >
> > PS If it's not too hard, please, consider adding a new kselftest to cover this case.
> > Thank you!
>
> I will try to translate my bash test to something in test_memcontrol,
> I don't plan to spend a lot of time on it though so I hope it's simple
> enough..
Roman Gushchin Nov. 23, 2022, 1:26 a.m. UTC | #5
On Tue, Nov 22, 2022 at 04:49:54PM -0800, Yosry Ahmed wrote:
> On Tue, Nov 22, 2022 at 4:45 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Tue, Nov 22, 2022 at 4:37 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > >
> > > On Tue, Nov 22, 2022 at 11:27:21PM +0000, Yosry Ahmed wrote:
> > > > During reclaim, mem_cgroup_calculate_protection() is used to determine
> > > > the effective protection (emin and elow) values of a memcg. The
> > > > protection of the reclaim target is ignored, but we cannot set their
> > > > effective protection to 0 due to a limitation of the current
> > > > implementation (see comment in mem_cgroup_protection()). Instead,
> > > > we leave their effective protection values unchaged, and later ignore it
> > > > in mem_cgroup_protection().
> > > >
> > > > However, mem_cgroup_protection() is called later in
> > > > shrink_lruvec()->get_scan_count(), which is after the
> > > > mem_cgroup_below_{min/low}() checks in shrink_node_memcgs(). As a
> > > > result, the stale effective protection values of the target memcg may
> > > > lead us to skip reclaiming from the target memcg entirely, before
> > > > calling shrink_lruvec(). This can be even worse with recursive
> > > > protection, where the stale target memcg protection can be higher than
> > > > its standalone protection.
> > > >
> > > > An example where this can happen is as follows. Consider the following
> > > > hierarchy with memory_recursiveprot:
> > > > ROOT
> > > >  |
> > > >  A (memory.min = 50M)
> > > >  |
> > > >  B (memory.min = 10M, memory.high = 40M)
> > > >
> > > > Consider the following scenarion:
> > > > - B has memory.current = 35M.
> > > > - The system undergoes global reclaim (target memcg is NULL).
> > > > - B will have an effective min of 50M (all of A's unclaimed protection).
> > > > - B will not be reclaimed from.
> > > > - Now allocate 10M more memory in B, pushing it above it's high limit.
> > > > - The system undergoes memcg reclaim from B (target memcg is B)
> > > > - In shrink_node_memcgs(), we call mem_cgroup_calculate_protection(),
> > > >   which immediately returns for B without doing anything, as B is the
> > > >   target memcg, relying on mem_cgroup_protection() to ignore B's stale
> > > >   effective min (still 50M).
> > > > - Directly after mem_cgroup_calculate_protection(), we will call
> > > >   mem_cgroup_below_min(), which will read the stale effective min for B
> > > >   and skip it (instead of ignoring its protection as intended). In this
> > > >   case, it's really bad because we are not just considering B's
> > > >   standalone protection (10M), but we are reading a much higher stale
> > > >   protection (50M) which will cause us to not reclaim from B at all.
> > > >
> > > > This is an artifact of commit 45c7f7e1ef17 ("mm, memcg: decouple
> > > > e{low,min} state mutations from protection checks") which made
> > > > mem_cgroup_calculate_protection() only change the state without
> > > > returning any value. Before that commit, we used to return
> > > > MEMCG_PROT_NONE for the target memcg, which would cause us to skip the
> > > > mem_cgroup_below_{min/low}() checks. After that commit we do not return
> > > > anything and we end up checking the min & low effective protections for
> > > > the target memcg, which are stale.
> > > >
> > > > Add mem_cgroup_ignore_protection() that checks if we are reclaiming from
> > > > the target memcg, and call it in mem_cgroup_below_{min/low}() to ignore
> > > > the stale protection of the target memcg.
> > > >
> > > > Fixes: 45c7f7e1ef17 ("mm, memcg: decouple e{low,min} state mutations from protection checks")
> > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > >
> > > Great catch!
> > > The fix looks good to me, only a couple of cosmetic suggestions.
> > >
> > > > ---
> > > >  include/linux/memcontrol.h | 33 +++++++++++++++++++++++++++------
> > > >  mm/vmscan.c                | 11 ++++++-----
> > > >  2 files changed, 33 insertions(+), 11 deletions(-)
> > > >
> > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > > index e1644a24009c..22c9c9f9c6b1 100644
> > > > --- a/include/linux/memcontrol.h
> > > > +++ b/include/linux/memcontrol.h
> > > > @@ -625,18 +625,32 @@ static inline bool mem_cgroup_supports_protection(struct mem_cgroup *memcg)
> > > >
> > > >  }
> > > >
> > > > -static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg)
> > > > +static inline bool mem_cgroup_ignore_protection(struct mem_cgroup *target,
> > > > +                                             struct mem_cgroup *memcg)
> > > >  {
> > > > -     if (!mem_cgroup_supports_protection(memcg))
> > >
> > > How about to merge mem_cgroup_supports_protection() and your new helper into
> > > something like mem_cgroup_possibly_protected()? It seems like they never used
> > > separately and unlikely ever will be used.
> >
> > Sounds good! I am thinking maybe mem_cgroup_no_protection() which is
> > an inlining of !mem_cgroup_supports_protection() ||
> > mem_cgorup_ignore_protection().
> >
> > > Also, I'd swap target and memcg arguments.
> >
> > Sounds good.
> 
> I just remembered, the reason I put "target" first is to match the
> ordering of mem_cgroup_calculate_protection(), otherwise the code in
> shrink_node_memcgs() may be confusing.

Oh, I see...
Nevermind, let's leave it the way it is now.
Thanks for checking it out!

Roman
Yosry Ahmed Nov. 23, 2022, 9:25 a.m. UTC | #6
On Tue, Nov 22, 2022 at 4:37 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Tue, Nov 22, 2022 at 11:27:21PM +0000, Yosry Ahmed wrote:
> > During reclaim, mem_cgroup_calculate_protection() is used to determine
> > the effective protection (emin and elow) values of a memcg. The
> > protection of the reclaim target is ignored, but we cannot set their
> > effective protection to 0 due to a limitation of the current
> > implementation (see comment in mem_cgroup_protection()). Instead,
> > we leave their effective protection values unchaged, and later ignore it
> > in mem_cgroup_protection().
> >
> > However, mem_cgroup_protection() is called later in
> > shrink_lruvec()->get_scan_count(), which is after the
> > mem_cgroup_below_{min/low}() checks in shrink_node_memcgs(). As a
> > result, the stale effective protection values of the target memcg may
> > lead us to skip reclaiming from the target memcg entirely, before
> > calling shrink_lruvec(). This can be even worse with recursive
> > protection, where the stale target memcg protection can be higher than
> > its standalone protection.
> >
> > An example where this can happen is as follows. Consider the following
> > hierarchy with memory_recursiveprot:
> > ROOT
> >  |
> >  A (memory.min = 50M)
> >  |
> >  B (memory.min = 10M, memory.high = 40M)
> >
> > Consider the following scenarion:
> > - B has memory.current = 35M.
> > - The system undergoes global reclaim (target memcg is NULL).
> > - B will have an effective min of 50M (all of A's unclaimed protection).
> > - B will not be reclaimed from.
> > - Now allocate 10M more memory in B, pushing it above it's high limit.
> > - The system undergoes memcg reclaim from B (target memcg is B)
> > - In shrink_node_memcgs(), we call mem_cgroup_calculate_protection(),
> >   which immediately returns for B without doing anything, as B is the
> >   target memcg, relying on mem_cgroup_protection() to ignore B's stale
> >   effective min (still 50M).
> > - Directly after mem_cgroup_calculate_protection(), we will call
> >   mem_cgroup_below_min(), which will read the stale effective min for B
> >   and skip it (instead of ignoring its protection as intended). In this
> >   case, it's really bad because we are not just considering B's
> >   standalone protection (10M), but we are reading a much higher stale
> >   protection (50M) which will cause us to not reclaim from B at all.
> >
> > This is an artifact of commit 45c7f7e1ef17 ("mm, memcg: decouple
> > e{low,min} state mutations from protection checks") which made
> > mem_cgroup_calculate_protection() only change the state without
> > returning any value. Before that commit, we used to return
> > MEMCG_PROT_NONE for the target memcg, which would cause us to skip the
> > mem_cgroup_below_{min/low}() checks. After that commit we do not return
> > anything and we end up checking the min & low effective protections for
> > the target memcg, which are stale.
> >
> > Add mem_cgroup_ignore_protection() that checks if we are reclaiming from
> > the target memcg, and call it in mem_cgroup_below_{min/low}() to ignore
> > the stale protection of the target memcg.
> >
> > Fixes: 45c7f7e1ef17 ("mm, memcg: decouple e{low,min} state mutations from protection checks")
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
>
> Great catch!
> The fix looks good to me, only a couple of cosmetic suggestions.
>
> > ---
> >  include/linux/memcontrol.h | 33 +++++++++++++++++++++++++++------
> >  mm/vmscan.c                | 11 ++++++-----
> >  2 files changed, 33 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index e1644a24009c..22c9c9f9c6b1 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -625,18 +625,32 @@ static inline bool mem_cgroup_supports_protection(struct mem_cgroup *memcg)
> >
> >  }
> >
> > -static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg)
> > +static inline bool mem_cgroup_ignore_protection(struct mem_cgroup *target,
> > +                                             struct mem_cgroup *memcg)
> >  {
> > -     if (!mem_cgroup_supports_protection(memcg))
>
> How about to merge mem_cgroup_supports_protection() and your new helper into
> something like mem_cgroup_possibly_protected()? It seems like they never used
> separately and unlikely ever will be used.
> Also, I'd swap target and memcg arguments.
>
> Thank you!
>
>
> PS If it's not too hard, please, consider adding a new kselftest to cover this case.
> Thank you!

Sent v2 with mem_cgroup_supports_protection() and
mem_cgroup_ignore_protection() merged into mem_cgroup_unprotected().

Also added a test case to test_memcontrol.c:test_memcg_protection.
Since the scenario in the bash test and the v1 commit log was too
complicated, I extended the existing test with a simpler scenario
based on proactive reclaim, and reused some functionality from
test_memcg_reclaim(). I also included explaining that simple proactive
reclaim scenario in the commit log of the fix.

Writing a test for the more complex scenario with recursive protection
would be more involved, so I think this should be enough for now :)
diff mbox series

Patch

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e1644a24009c..22c9c9f9c6b1 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -625,18 +625,32 @@  static inline bool mem_cgroup_supports_protection(struct mem_cgroup *memcg)
 
 }
 
-static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg)
+static inline bool mem_cgroup_ignore_protection(struct mem_cgroup *target,
+						struct mem_cgroup *memcg)
 {
-	if (!mem_cgroup_supports_protection(memcg))
+	/*
+	 * The target memcg's protection is ignored, see
+	 * mem_cgroup_calculate_protection() and mem_cgroup_protection()
+	 */
+	return target == memcg;
+}
+
+static inline bool mem_cgroup_below_low(struct mem_cgroup *target,
+					struct mem_cgroup *memcg)
+{
+	if (!mem_cgroup_supports_protection(memcg) ||
+	    mem_cgroup_ignore_protection(target, memcg))
 		return false;
 
 	return READ_ONCE(memcg->memory.elow) >=
 		page_counter_read(&memcg->memory);
 }
 
-static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg)
+static inline bool mem_cgroup_below_min(struct mem_cgroup *target,
+					struct mem_cgroup *memcg)
 {
-	if (!mem_cgroup_supports_protection(memcg))
+	if (!mem_cgroup_supports_protection(memcg) ||
+	    mem_cgroup_ignore_protection(target, memcg))
 		return false;
 
 	return READ_ONCE(memcg->memory.emin) >=
@@ -1209,12 +1223,19 @@  static inline void mem_cgroup_calculate_protection(struct mem_cgroup *root,
 {
 }
 
-static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg)
+static inline bool mem_cgroup_ignore_protection(struct mem_cgroup *target,
+						struct mem_cgroup *memcg)
+{
+	return false;
+}
+static inline bool mem_cgroup_below_low(struct mem_cgroup *target,
+					struct mem_cgroup *memcg)
 {
 	return false;
 }
 
-static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg)
+static inline bool mem_cgroup_below_min(struct mem_cgroup *target,
+					struct mem_cgroup *memcg)
 {
 	return false;
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 04d8b88e5216..79ef0fe67518 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4486,7 +4486,7 @@  static bool age_lruvec(struct lruvec *lruvec, struct scan_control *sc, unsigned
 
 	mem_cgroup_calculate_protection(NULL, memcg);
 
-	if (mem_cgroup_below_min(memcg))
+	if (mem_cgroup_below_min(NULL, memcg))
 		return false;
 
 	need_aging = should_run_aging(lruvec, max_seq, min_seq, sc, swappiness, &nr_to_scan);
@@ -5047,8 +5047,9 @@  static unsigned long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *
 	DEFINE_MAX_SEQ(lruvec);
 	DEFINE_MIN_SEQ(lruvec);
 
-	if (mem_cgroup_below_min(memcg) ||
-	    (mem_cgroup_below_low(memcg) && !sc->memcg_low_reclaim))
+	if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg) ||
+	    (mem_cgroup_below_low(sc->target_mem_cgroup, memcg) &&
+	     !sc->memcg_low_reclaim))
 		return 0;
 
 	*need_aging = should_run_aging(lruvec, max_seq, min_seq, sc, can_swap, &nr_to_scan);
@@ -6048,13 +6049,13 @@  static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 
 		mem_cgroup_calculate_protection(target_memcg, memcg);
 
-		if (mem_cgroup_below_min(memcg)) {
+		if (mem_cgroup_below_min(target_memcg, memcg)) {
 			/*
 			 * Hard protection.
 			 * If there is no reclaimable memory, OOM.
 			 */
 			continue;
-		} else if (mem_cgroup_below_low(memcg)) {
+		} else if (mem_cgroup_below_low(target_memcg, memcg)) {
 			/*
 			 * Soft protection.
 			 * Respect the protection only as long as