[v1,0/2] Ignore non-LRU-based reclaim in memcg reclaim

Message ID	20230228085002.2592473-1-yosryahmed@google.com (mailing list archive)
Headers	show Return-Path: <linux-xfs-owner@vger.kernel.org> Date: Tue, 28 Feb 2023 08:50:00 +0000 Mime-Version: 1.0 Message-ID: <20230228085002.2592473-1-yosryahmed@google.com> Subject: [PATCH v1 0/2] Ignore non-LRU-based reclaim in memcg reclaim From: Yosry Ahmed <yosryahmed@google.com> To: Alexander Viro <viro@zeniv.linux.org.uk>, "Darrick J. Wong" <djwong@kernel.org>, Christoph Lameter <cl@linux.com>, David Rientjes <rientjes@google.com>, Joonsoo Kim <iamjoonsoo.kim@lge.com>, Vlastimil Babka <vbabka@suse.cz>, Roman Gushchin <roman.gushchin@linux.dev>, Hyeonggon Yoo <42.hyeyoo@gmail.com>, "Matthew Wilcox (Oracle)" <willy@infradead.org>, Miaohe Lin <linmiaohe@huawei.com>, David Hildenbrand <david@redhat.com>, Johannes Weiner <hannes@cmpxchg.org>, Peter Xu <peterx@redhat.com>, NeilBrown <neilb@suse.de>, Shakeel Butt <shakeelb@google.com>, Michal Hocko <mhocko@kernel.org> Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-xfs@vger.kernel.org, linux-mm@kvack.org, Yosry Ahmed <yosryahmed@google.com> Content-Type: text/plain; charset="UTF-8" Precedence: bulk
Series	Ignore non-LRU-based reclaim in memcg reclaim \| expand [v1,0/2] Ignore non-LRU-based reclaim in memcg reclaim [v1,1/2] mm: vmscan: refactor updating reclaimed pages in reclaim_state [v1,2/2] mm: vmscan: ignore non-LRU-based reclaim in memcg reclaim

Yosry Ahmed Feb. 28, 2023, 8:50 a.m. UTC

Reclaimed pages through other means than LRU-based reclaim are tracked
through reclaim_state in struct scan_control, which is stashed in
current task_struct. These pages are added to the number of reclaimed
pages through LRUs. For memcg reclaim, these pages generally cannot be
linked to the memcg under reclaim and can cause an overestimated count
of reclaimed pages. This short series tries to address that.

Patch 1 is just refactoring updating reclaim_state into a helper
function, and renames reclaimed_slab to just reclaimed, with a comment
describing its true purpose.

Patch 2 ignores pages reclaimed outside of LRU reclaim in memcg reclaim.
The pages are uncharged anyway, so even if we end up under-reporting
reclaimed pages we will still succeed in making progress during
charging.

Do not let the diff stat trick you, patch 2 is a one-line change. All
the rest is moving a couple of functions around and a huge comment :)

RFC -> v1:
- Exported report_freed_pages in case XFS is built as a module (Matthew
  Wilcox).
- Renamed reclaimed_slab to reclaim in previously missed MGLRU code.
- Refactored using reclaim_state to update sc->nr_reclaimed into a
  helper and added an XL comment explaining why we ignore
  reclaim_state->reclaimed in memcg reclaim (Johannes Weiner).

Yosry Ahmed (2):
  mm: vmscan: refactor updating reclaimed pages in reclaim_state
  mm: vmscan: ignore non-LRU-based reclaim in memcg reclaim

 fs/inode.c           |  3 +-
 fs/xfs/xfs_buf.c     |  3 +-
 include/linux/swap.h |  5 ++-
 mm/slab.c            |  3 +-
 mm/slob.c            |  6 ++--
 mm/slub.c            |  5 ++-
 mm/vmscan.c          | 79 +++++++++++++++++++++++++++++++++++---------
 7 files changed, 74 insertions(+), 30 deletions(-)

Yosry Ahmed March 8, 2023, 6:54 a.m. UTC | #1

On Tue, Feb 28, 2023 at 12:50 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> Reclaimed pages through other means than LRU-based reclaim are tracked
> through reclaim_state in struct scan_control, which is stashed in
> current task_struct. These pages are added to the number of reclaimed
> pages through LRUs. For memcg reclaim, these pages generally cannot be
> linked to the memcg under reclaim and can cause an overestimated count
> of reclaimed pages. This short series tries to address that.
>
> Patch 1 is just refactoring updating reclaim_state into a helper
> function, and renames reclaimed_slab to just reclaimed, with a comment
> describing its true purpose.
>
> Patch 2 ignores pages reclaimed outside of LRU reclaim in memcg reclaim.
> The pages are uncharged anyway, so even if we end up under-reporting
> reclaimed pages we will still succeed in making progress during
> charging.
>
> Do not let the diff stat trick you, patch 2 is a one-line change. All
> the rest is moving a couple of functions around and a huge comment :)
>
> RFC -> v1:
> - Exported report_freed_pages in case XFS is built as a module (Matthew
>   Wilcox).
> - Renamed reclaimed_slab to reclaim in previously missed MGLRU code.
> - Refactored using reclaim_state to update sc->nr_reclaimed into a
>   helper and added an XL comment explaining why we ignore
>   reclaim_state->reclaimed in memcg reclaim (Johannes Weiner).
>
> Yosry Ahmed (2):
>   mm: vmscan: refactor updating reclaimed pages in reclaim_state
>   mm: vmscan: ignore non-LRU-based reclaim in memcg reclaim
>
>  fs/inode.c           |  3 +-
>  fs/xfs/xfs_buf.c     |  3 +-
>  include/linux/swap.h |  5 ++-
>  mm/slab.c            |  3 +-
>  mm/slob.c            |  6 ++--
>  mm/slub.c            |  5 ++-
>  mm/vmscan.c          | 79 +++++++++++++++++++++++++++++++++++---------
>  7 files changed, 74 insertions(+), 30 deletions(-)
>
> --
> 2.39.2.722.g9855ee24e9-goog
>

Friendly ping on this series, any comments or thoughts -- especially
on the memcg side?

Johannes Weiner March 8, 2023, 4 p.m. UTC | #2

Hello Yosry,

On Tue, Feb 28, 2023 at 08:50:00AM +0000, Yosry Ahmed wrote:
> Reclaimed pages through other means than LRU-based reclaim are tracked
> through reclaim_state in struct scan_control, which is stashed in
> current task_struct. These pages are added to the number of reclaimed
> pages through LRUs. For memcg reclaim, these pages generally cannot be
> linked to the memcg under reclaim and can cause an overestimated count
> of reclaimed pages. This short series tries to address that.

Could you please add more details on how this manifests as a problem
with real workloads?

> Patch 1 is just refactoring updating reclaim_state into a helper
> function, and renames reclaimed_slab to just reclaimed, with a comment
> describing its true purpose.

Looking through the code again, I don't think these helpers add value.

report_freed_pages() is fairly vague. Report to who? It abstracts only
two lines of code, and those two lines are more descriptive of what's
happening than the helper is. Just leave them open-coded.

add_non_vmanscan_reclaimed() may or may not add anything. But let's
take a step back. It only has two callsites because lrugen duplicates
the entire reclaim implementation, including the call to shrink_slab()
and the transfer of reclaim_state to sc->nr_reclaimed.

IMO the resulting code would overall be simpler, less duplicative and
easier to follow if you added a common shrink_slab_reclaim() that
takes sc, handles the transfer, and documents the memcg exception.

Yosry Ahmed March 8, 2023, 6:01 p.m. UTC | #3

On Wed, Mar 8, 2023 at 8:00 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> Hello Yosry,
>
> On Tue, Feb 28, 2023 at 08:50:00AM +0000, Yosry Ahmed wrote:
> > Reclaimed pages through other means than LRU-based reclaim are tracked
> > through reclaim_state in struct scan_control, which is stashed in
> > current task_struct. These pages are added to the number of reclaimed
> > pages through LRUs. For memcg reclaim, these pages generally cannot be
> > linked to the memcg under reclaim and can cause an overestimated count
> > of reclaimed pages. This short series tries to address that.
>
> Could you please add more details on how this manifests as a problem
> with real workloads?

We haven't observed problems in production workloads, but we have
observed problems in testing using memory.reclaim when sometimes a
write to memory.reclaim would succeed when we didn't fully reclaim the
requested amount. This leads to tests flaking sometimes, and we have
to look into the failures to find out if there is a real problem or
not.

>
> > Patch 1 is just refactoring updating reclaim_state into a helper
> > function, and renames reclaimed_slab to just reclaimed, with a comment
> > describing its true purpose.
>
> Looking through the code again, I don't think these helpers add value.
>
> report_freed_pages() is fairly vague. Report to who? It abstracts only
> two lines of code, and those two lines are more descriptive of what's
> happening than the helper is. Just leave them open-coded.

I agree the name is not great, I am usually bad at naming things and
hope people would point that out (like you're doing now). The reason I
added it is to contain the logic within mm/vmscan.c such that future
changes do not have to add noisy diffs to a lot of unrelated files. If
you have a better name that makes more sense to you please let me
know, otherwise I'm fine dropping the helper as well, no strong
opinions here.

>
> add_non_vmanscan_reclaimed() may or may not add anything. But let's
> take a step back. It only has two callsites because lrugen duplicates
> the entire reclaim implementation, including the call to shrink_slab()
> and the transfer of reclaim_state to sc->nr_reclaimed.
>
> IMO the resulting code would overall be simpler, less duplicative and
> easier to follow if you added a common shrink_slab_reclaim() that
> takes sc, handles the transfer, and documents the memcg exception.

IIUC you mean something like:

void shrink_slab_reclaim(struct scan_control *sc, pg_data_t *pgdat,
struct mem_cgroup *memcg)
{
    shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority);

    /* very long comment */
    if (current->reclaim_state && !cgroup_reclaim(sc)) {
        sc->nr_reclaimed += current->reclaim_state->reclaimed;
        current->reclaim_state->reclaimed = 0;
    }
}

The difference would be that today we handle the transfer once after
we scan all memcgs in classic lruvec, while we do the transfer once
per-memcg in lrugen. With this change, we would be doing the transfer
once per-memcg for both, but I guess that's not a big deal.

What I don't like about this is that it doubles down on associating
the counter in reclaim_state with slab, which is the opposite of what
patch 1 does (renaming reclaimed_slab to just reclaimed). If we do
this, maybe it's better from a consistency perspective to leave it as
reclaimed_slab, with a comment announcing that this is mainly used for
slab, but others are piggybacking on it. It seems like whatever we do
is not going to be ideal in this case, but we should at least be
consistent. Either add shrink_slab_reclaim(), leave it as
reclaimed_slab and call out other users as piggybackers -- or make it
generic and separate from slab completely, with a separate helper to
do the transfer.

I do not have a strong opinion here, so let me know what you prefer.

Johannes Weiner March 8, 2023, 8:16 p.m. UTC | #4

On Wed, Mar 08, 2023 at 10:01:24AM -0800, Yosry Ahmed wrote:
> On Wed, Mar 8, 2023 at 8:00 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > Hello Yosry,
> >
> > On Tue, Feb 28, 2023 at 08:50:00AM +0000, Yosry Ahmed wrote:
> > > Reclaimed pages through other means than LRU-based reclaim are tracked
> > > through reclaim_state in struct scan_control, which is stashed in
> > > current task_struct. These pages are added to the number of reclaimed
> > > pages through LRUs. For memcg reclaim, these pages generally cannot be
> > > linked to the memcg under reclaim and can cause an overestimated count
> > > of reclaimed pages. This short series tries to address that.
> >
> > Could you please add more details on how this manifests as a problem
> > with real workloads?
> 
> We haven't observed problems in production workloads, but we have
> observed problems in testing using memory.reclaim when sometimes a
> write to memory.reclaim would succeed when we didn't fully reclaim the
> requested amount. This leads to tests flaking sometimes, and we have
> to look into the failures to find out if there is a real problem or
> not.

Ah, that would be great to have in the cover letter. Thanks!

Have you also tested this patch against prod without memory.reclaim?
Just to make sure there are no problems with cgroup OOMs or
similar. There shouldn't be, but, you know...

> > > Patch 1 is just refactoring updating reclaim_state into a helper
> > > function, and renames reclaimed_slab to just reclaimed, with a comment
> > > describing its true purpose.
> >
> > Looking through the code again, I don't think these helpers add value.
> >
> > report_freed_pages() is fairly vague. Report to who? It abstracts only
> > two lines of code, and those two lines are more descriptive of what's
> > happening than the helper is. Just leave them open-coded.
> 
> I agree the name is not great, I am usually bad at naming things and
> hope people would point that out (like you're doing now). The reason I
> added it is to contain the logic within mm/vmscan.c such that future
> changes do not have to add noisy diffs to a lot of unrelated files. If
> you have a better name that makes more sense to you please let me
> know, otherwise I'm fine dropping the helper as well, no strong
> opinions here.

I tried to come up with something better, but wasn't happy with any of
the options, either. So I defaulted to just leaving it alone :-)

It's part of the shrinker API and the name hasn't changed since the
initial git import of the kernel tree. It should be fine, churn-wise.

> > add_non_vmanscan_reclaimed() may or may not add anything. But let's
> > take a step back. It only has two callsites because lrugen duplicates
> > the entire reclaim implementation, including the call to shrink_slab()
> > and the transfer of reclaim_state to sc->nr_reclaimed.
> >
> > IMO the resulting code would overall be simpler, less duplicative and
> > easier to follow if you added a common shrink_slab_reclaim() that
> > takes sc, handles the transfer, and documents the memcg exception.
> 
> IIUC you mean something like:
> 
> void shrink_slab_reclaim(struct scan_control *sc, pg_data_t *pgdat,
> struct mem_cgroup *memcg)
> {
>     shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority);
> 
>     /* very long comment */
>     if (current->reclaim_state && !cgroup_reclaim(sc)) {
>         sc->nr_reclaimed += current->reclaim_state->reclaimed;
>         current->reclaim_state->reclaimed = 0;
>     }
> }

Sorry, I screwed up, that doesn't actually work.

reclaim_state is used by buffer heads freed in shrink_folio_list() ->
filemap_release_folio(). So flushing the count cannot be shrink_slab()
specific. Bummer. Your patch had it right by making a helper for just
flushing the reclaim state. But add_non_vmscan_reclaimed() is then
also not a great name because these frees are directly from vmscan.

Maybe simply flush_reclaim_state()?

As far as the name reclaimed_slab, I agree it's not optimal, although
90% accurate ;-) I wouldn't mind a rename to just 'reclaimed'.

Yosry Ahmed March 8, 2023, 8:24 p.m. UTC | #5

On Wed, Mar 8, 2023 at 12:16 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Mar 08, 2023 at 10:01:24AM -0800, Yosry Ahmed wrote:
> > On Wed, Mar 8, 2023 at 8:00 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > Hello Yosry,
> > >
> > > On Tue, Feb 28, 2023 at 08:50:00AM +0000, Yosry Ahmed wrote:
> > > > Reclaimed pages through other means than LRU-based reclaim are tracked
> > > > through reclaim_state in struct scan_control, which is stashed in
> > > > current task_struct. These pages are added to the number of reclaimed
> > > > pages through LRUs. For memcg reclaim, these pages generally cannot be
> > > > linked to the memcg under reclaim and can cause an overestimated count
> > > > of reclaimed pages. This short series tries to address that.
> > >
> > > Could you please add more details on how this manifests as a problem
> > > with real workloads?
> >
> > We haven't observed problems in production workloads, but we have
> > observed problems in testing using memory.reclaim when sometimes a
> > write to memory.reclaim would succeed when we didn't fully reclaim the
> > requested amount. This leads to tests flaking sometimes, and we have
> > to look into the failures to find out if there is a real problem or
> > not.
>
> Ah, that would be great to have in the cover letter. Thanks!

Will do in the next version.

>
> Have you also tested this patch against prod without memory.reclaim?
> Just to make sure there are no problems with cgroup OOMs or
> similar. There shouldn't be, but, you know...

Honestly, no. I was debugging a test flake and I spotted that this was
the cause, came up with patches to address it, and sent it to the
mailing list for feedback. We did not want to merge it internally if
it's not going to land upstream -- with the rationale that making the
test more tolerant might be better than maintaining the patch
internally, although that is not ideal of course (as it can hide
actual failures from different sources).

>
> > > > Patch 1 is just refactoring updating reclaim_state into a helper
> > > > function, and renames reclaimed_slab to just reclaimed, with a comment
> > > > describing its true purpose.
> > >
> > > Looking through the code again, I don't think these helpers add value.
> > >
> > > report_freed_pages() is fairly vague. Report to who? It abstracts only
> > > two lines of code, and those two lines are more descriptive of what's
> > > happening than the helper is. Just leave them open-coded.
> >
> > I agree the name is not great, I am usually bad at naming things and
> > hope people would point that out (like you're doing now). The reason I
> > added it is to contain the logic within mm/vmscan.c such that future
> > changes do not have to add noisy diffs to a lot of unrelated files. If
> > you have a better name that makes more sense to you please let me
> > know, otherwise I'm fine dropping the helper as well, no strong
> > opinions here.
>
> I tried to come up with something better, but wasn't happy with any of
> the options, either. So I defaulted to just leaving it alone :-)
>
> It's part of the shrinker API and the name hasn't changed since the
> initial git import of the kernel tree. It should be fine, churn-wise.

Last attempt, just update_reclaim_state() (corresponding to
flush_reclaim_state() below). It doesn't tell a story, but neither
does incrementing a counter in current->reclaim_state. If that doesn't
make you happy I'll give up now and leave it as-is :)

>
> > > add_non_vmanscan_reclaimed() may or may not add anything. But let's
> > > take a step back. It only has two callsites because lrugen duplicates
> > > the entire reclaim implementation, including the call to shrink_slab()
> > > and the transfer of reclaim_state to sc->nr_reclaimed.
> > >
> > > IMO the resulting code would overall be simpler, less duplicative and
> > > easier to follow if you added a common shrink_slab_reclaim() that
> > > takes sc, handles the transfer, and documents the memcg exception.
> >
> > IIUC you mean something like:
> >
> > void shrink_slab_reclaim(struct scan_control *sc, pg_data_t *pgdat,
> > struct mem_cgroup *memcg)
> > {
> >     shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority);
> >
> >     /* very long comment */
> >     if (current->reclaim_state && !cgroup_reclaim(sc)) {
> >         sc->nr_reclaimed += current->reclaim_state->reclaimed;
> >         current->reclaim_state->reclaimed = 0;
> >     }
> > }
>
> Sorry, I screwed up, that doesn't actually work.
>
> reclaim_state is used by buffer heads freed in shrink_folio_list() ->
> filemap_release_folio(). So flushing the count cannot be shrink_slab()
> specific. Bummer. Your patch had it right by making a helper for just
> flushing the reclaim state. But add_non_vmscan_reclaimed() is then
> also not a great name because these frees are directly from vmscan.
>
> Maybe simply flush_reclaim_state()?

Sounds good and simple enough, I will use that for the next version.

>
> As far as the name reclaimed_slab, I agree it's not optimal, although
> 90% accurate ;-) I wouldn't mind a rename to just 'reclaimed'.

Got it.

Dave Chinner March 8, 2023, 9:25 p.m. UTC | #6

On Wed, Mar 08, 2023 at 12:24:08PM -0800, Yosry Ahmed wrote:
> > I tried to come up with something better, but wasn't happy with any of
> > the options, either. So I defaulted to just leaving it alone :-)
> >
> > It's part of the shrinker API and the name hasn't changed since the
> > initial git import of the kernel tree. It should be fine, churn-wise.
> 
> Last attempt, just update_reclaim_state() (corresponding to
> flush_reclaim_state() below). It doesn't tell a story, but neither
> does incrementing a counter in current->reclaim_state. If that doesn't
> make you happy I'll give up now and leave it as-is :)

This is used in different subsystem shrinkers outside mm/, so the
name needs to be correctly namespaced. Please prefix it with the
subsystem the function belongs to, at minimum.

mm_account_reclaimed_pages() is what is actually being done here.
It is self describing  and leaves behind no ambiguity as to what is
being accounted and why, nor which subsystem the accounting belongs
to.

It doesn't matter what the internal mm/vmscan structures are called,
all we care about is telling the mm infrastructure how many extra
pages were freed by the shrinker....

-Dave.

Yosry Ahmed March 8, 2023, 9:31 p.m. UTC | #7

On Wed, Mar 8, 2023 at 1:25 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Wed, Mar 08, 2023 at 12:24:08PM -0800, Yosry Ahmed wrote:
> > > I tried to come up with something better, but wasn't happy with any of
> > > the options, either. So I defaulted to just leaving it alone :-)
> > >
> > > It's part of the shrinker API and the name hasn't changed since the
> > > initial git import of the kernel tree. It should be fine, churn-wise.
> >
> > Last attempt, just update_reclaim_state() (corresponding to
> > flush_reclaim_state() below). It doesn't tell a story, but neither
> > does incrementing a counter in current->reclaim_state. If that doesn't
> > make you happy I'll give up now and leave it as-is :)
>
> This is used in different subsystem shrinkers outside mm/, so the
> name needs to be correctly namespaced. Please prefix it with the
> subsystem the function belongs to, at minimum.
>
> mm_account_reclaimed_pages() is what is actually being done here.
> It is self describing  and leaves behind no ambiguity as to what is
> being accounted and why, nor which subsystem the accounting belongs
> to.
>
> It doesn't matter what the internal mm/vmscan structures are called,
> all we care about is telling the mm infrastructure how many extra
> pages were freed by the shrinker....

mm_account_reclaimed_pages() sounds good to me. We can also do
something more specific if Johannes has any ideas. I do not have a
strong opinion here at all, I just prefer having a helper to leaving
it open-coded.

Thanks!

>
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com

Johannes Weiner March 9, 2023, 4:08 a.m. UTC | #8

On Thu, Mar 09, 2023 at 08:25:29AM +1100, Dave Chinner wrote:
> On Wed, Mar 08, 2023 at 12:24:08PM -0800, Yosry Ahmed wrote:
> > > I tried to come up with something better, but wasn't happy with any of
> > > the options, either. So I defaulted to just leaving it alone :-)
> > >
> > > It's part of the shrinker API and the name hasn't changed since the
> > > initial git import of the kernel tree. It should be fine, churn-wise.
> > 
> > Last attempt, just update_reclaim_state() (corresponding to
> > flush_reclaim_state() below). It doesn't tell a story, but neither
> > does incrementing a counter in current->reclaim_state. If that doesn't
> > make you happy I'll give up now and leave it as-is :)
> 
> This is used in different subsystem shrinkers outside mm/, so the
> name needs to be correctly namespaced. Please prefix it with the
> subsystem the function belongs to, at minimum.
> 
> mm_account_reclaimed_pages() is what is actually being done here.
> It is self describing  and leaves behind no ambiguity as to what is
> being accounted and why, nor which subsystem the accounting belongs
> to.
> 
> It doesn't matter what the internal mm/vmscan structures are called,
> all we care about is telling the mm infrastructure how many extra
> pages were freed by the shrinker....

My first preference would still be to just leave it. IMO that one line
saved in a small handful of places isn't worth the indirection,
obscuring the `current' deref etc.

But mm_account_reclaimed_pages() works for me if you really want to
enapsulate it.

[v1,0/2] Ignore non-LRU-based reclaim in memcg reclaim

Message

Comments