[v2,0/2] mm: memcg: fix tracking of pending stats updates values

Message ID	20230922175741.635002-1-yosryahmed@google.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> Date: Fri, 22 Sep 2023 17:57:38 +0000 Mime-Version: 1.0 Message-ID: <20230922175741.635002-1-yosryahmed@google.com> Subject: [PATCH v2 0/2] mm: memcg: fix tracking of pending stats updates values From: Yosry Ahmed <yosryahmed@google.com> To: Andrew Morton <akpm@linux-foundation.org>, Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Roman Gushchin <roman.gushchin@linux.dev>, Muchun Song <muchun.song@linux.dev>, " =?utf-8?q?Michal_Koutn=C3=BD?= " <mkoutny@suse.com>, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Yosry Ahmed <yosryahmed@google.com> Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: memcg: fix tracking of pending stats updates values \| expand [v2,0/2] mm: memcg: fix tracking of pending stats updates values [v2,1/2] mm: memcg: refactor page state unit helpers [v2,2/2] mm: memcg: normalize the value passed into memcg_rstat_updated()

Yosry Ahmed Sept. 22, 2023, 5:57 p.m. UTC

While working on adjacent code [1], I realized that the values passed
into memcg_rstat_updated() to keep track of the magnitude of pending
updates is consistent. It is mostly in pages, but sometimes it can be in
bytes or KBs. Fix that.

Patch 1 reworks memcg_page_state_unit() so that we can reuse it in patch
2 to check and normalize the units of state updates.

[1]https://lore.kernel.org/lkml/20230921081057.3440885-1-yosryahmed@google.com/

v1 -> v2:
- Rebased on top of mm-unstable.

Yosry Ahmed (2):
  mm: memcg: refactor page state unit helpers
  mm: memcg: normalize the value passed into memcg_rstat_updated()

 mm/memcontrol.c | 64 +++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 51 insertions(+), 13 deletions(-)

Michal Hocko Sept. 25, 2023, 1:50 p.m. UTC | #1

On Fri 22-09-23 17:57:38, Yosry Ahmed wrote:
> While working on adjacent code [1], I realized that the values passed
> into memcg_rstat_updated() to keep track of the magnitude of pending
> updates is consistent. It is mostly in pages, but sometimes it can be in
> bytes or KBs. Fix that.

What kind of practical difference does this change make? Is it worth
additional code?
 
> Patch 1 reworks memcg_page_state_unit() so that we can reuse it in patch
> 2 to check and normalize the units of state updates.
> 
> [1]https://lore.kernel.org/lkml/20230921081057.3440885-1-yosryahmed@google.com/
> 
> v1 -> v2:
> - Rebased on top of mm-unstable.
> 
> Yosry Ahmed (2):
>   mm: memcg: refactor page state unit helpers
>   mm: memcg: normalize the value passed into memcg_rstat_updated()
> 
>  mm/memcontrol.c | 64 +++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 51 insertions(+), 13 deletions(-)
> 
> -- 
> 2.42.0.515.g380fc7ccd1-goog

Yosry Ahmed Sept. 25, 2023, 5:11 p.m. UTC | #2

On Mon, Sep 25, 2023 at 6:50 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 22-09-23 17:57:38, Yosry Ahmed wrote:
> > While working on adjacent code [1], I realized that the values passed
> > into memcg_rstat_updated() to keep track of the magnitude of pending
> > updates is consistent. It is mostly in pages, but sometimes it can be in
> > bytes or KBs. Fix that.
>
> What kind of practical difference does this change make? Is it worth
> additional code?

As explained in patch 2's commit message, the value passed into
memcg_rstat_updated() is used for the "flush only if not worth it"
heuristic. As we have discussed in different threads in the past few
weeks, unnecessary flushes can cause increased global lock contention
and/or latency.

Byte-sized paths (percpu, slab, zswap, ..) feed bytes into the
heuristic, but those are interpreted as pages, which means we will
flush earlier than we should. This was noticed by code inspection. How
much does this matter in practice? I would say it depends on the
workload: how many percpu/slab allocations are being made vs. how many
flushes are requested.

On a system with 100 cpus, 25M of stat updates are needed for a flush
usually, but ~6K of slab/percpu updates will also (mistakenly) cause a
flush.

>
> > Patch 1 reworks memcg_page_state_unit() so that we can reuse it in patch
> > 2 to check and normalize the units of state updates.
> >
> > [1]https://lore.kernel.org/lkml/20230921081057.3440885-1-yosryahmed@google.com/
> >
> > v1 -> v2:
> > - Rebased on top of mm-unstable.
> >
> > Yosry Ahmed (2):
> >   mm: memcg: refactor page state unit helpers
> >   mm: memcg: normalize the value passed into memcg_rstat_updated()
> >
> >  mm/memcontrol.c | 64 +++++++++++++++++++++++++++++++++++++++----------
> >  1 file changed, 51 insertions(+), 13 deletions(-)
> >
> > --
> > 2.42.0.515.g380fc7ccd1-goog
>
> --
> Michal Hocko
> SUSE Labs

Michal Hocko Oct. 3, 2023, 7:57 a.m. UTC | #3

On Mon 25-09-23 10:11:05, Yosry Ahmed wrote:
> On Mon, Sep 25, 2023 at 6:50 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 22-09-23 17:57:38, Yosry Ahmed wrote:
> > > While working on adjacent code [1], I realized that the values passed
> > > into memcg_rstat_updated() to keep track of the magnitude of pending
> > > updates is consistent. It is mostly in pages, but sometimes it can be in
> > > bytes or KBs. Fix that.
> >
> > What kind of practical difference does this change make? Is it worth
> > additional code?
> 
> As explained in patch 2's commit message, the value passed into
> memcg_rstat_updated() is used for the "flush only if not worth it"
> heuristic. As we have discussed in different threads in the past few
> weeks, unnecessary flushes can cause increased global lock contention
> and/or latency.
> 
> Byte-sized paths (percpu, slab, zswap, ..) feed bytes into the
> heuristic, but those are interpreted as pages, which means we will
> flush earlier than we should. This was noticed by code inspection. How
> much does this matter in practice? I would say it depends on the
> workload: how many percpu/slab allocations are being made vs. how many
> flushes are requested.
> 
> On a system with 100 cpus, 25M of stat updates are needed for a flush
> usually, but ~6K of slab/percpu updates will also (mistakenly) cause a
> flush.

This surely depends on workload and that is understandable. But it would
be really nice to provide some numbers for typical workloads which
exercise slab heavily.

Yosry Ahmed Oct. 3, 2023, 8:03 a.m. UTC | #4

On Tue, Oct 3, 2023 at 12:57 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 25-09-23 10:11:05, Yosry Ahmed wrote:
> > On Mon, Sep 25, 2023 at 6:50 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Fri 22-09-23 17:57:38, Yosry Ahmed wrote:
> > > > While working on adjacent code [1], I realized that the values passed
> > > > into memcg_rstat_updated() to keep track of the magnitude of pending
> > > > updates is consistent. It is mostly in pages, but sometimes it can be in
> > > > bytes or KBs. Fix that.
> > >
> > > What kind of practical difference does this change make? Is it worth
> > > additional code?
> >
> > As explained in patch 2's commit message, the value passed into
> > memcg_rstat_updated() is used for the "flush only if not worth it"
> > heuristic. As we have discussed in different threads in the past few
> > weeks, unnecessary flushes can cause increased global lock contention
> > and/or latency.
> >
> > Byte-sized paths (percpu, slab, zswap, ..) feed bytes into the
> > heuristic, but those are interpreted as pages, which means we will
> > flush earlier than we should. This was noticed by code inspection. How
> > much does this matter in practice? I would say it depends on the
> > workload: how many percpu/slab allocations are being made vs. how many
> > flushes are requested.
> >
> > On a system with 100 cpus, 25M of stat updates are needed for a flush
> > usually, but ~6K of slab/percpu updates will also (mistakenly) cause a
> > flush.
>
> This surely depends on workload and that is understandable. But it would
> be really nice to provide some numbers for typical workloads which
> exercise slab heavily.

If you have a workload in mind I can run it and see how many flushes
we get with/without this patch. The first thing that pops into my head
is creating a bunch of empty files but I don't know if that's the best
thing to get numbers from.

> --
> Michal Hocko
> SUSE Labs

Michal Hocko Oct. 3, 2023, 8:09 a.m. UTC | #5

On Tue 03-10-23 01:03:53, Yosry Ahmed wrote:
> On Tue, Oct 3, 2023 at 12:57 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 25-09-23 10:11:05, Yosry Ahmed wrote:
> > > On Mon, Sep 25, 2023 at 6:50 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Fri 22-09-23 17:57:38, Yosry Ahmed wrote:
> > > > > While working on adjacent code [1], I realized that the values passed
> > > > > into memcg_rstat_updated() to keep track of the magnitude of pending
> > > > > updates is consistent. It is mostly in pages, but sometimes it can be in
> > > > > bytes or KBs. Fix that.
> > > >
> > > > What kind of practical difference does this change make? Is it worth
> > > > additional code?
> > >
> > > As explained in patch 2's commit message, the value passed into
> > > memcg_rstat_updated() is used for the "flush only if not worth it"
> > > heuristic. As we have discussed in different threads in the past few
> > > weeks, unnecessary flushes can cause increased global lock contention
> > > and/or latency.
> > >
> > > Byte-sized paths (percpu, slab, zswap, ..) feed bytes into the
> > > heuristic, but those are interpreted as pages, which means we will
> > > flush earlier than we should. This was noticed by code inspection. How
> > > much does this matter in practice? I would say it depends on the
> > > workload: how many percpu/slab allocations are being made vs. how many
> > > flushes are requested.
> > >
> > > On a system with 100 cpus, 25M of stat updates are needed for a flush
> > > usually, but ~6K of slab/percpu updates will also (mistakenly) cause a
> > > flush.
> >
> > This surely depends on workload and that is understandable. But it would
> > be really nice to provide some numbers for typical workloads which
> > exercise slab heavily.
> 
> If you have a workload in mind I can run it and see how many flushes
> we get with/without this patch. The first thing that pops into my head
> is creating a bunch of empty files but I don't know if that's the best
> thing to get numbers from.

Let me remind you that you are proposing a performance optimization and
such a change requires some numbers to actually show it is benefitial.
There are cases where the resulting code is clearly an improvement and
the performance benefit is just a nice side effect. I do not consider
this to be the case. The whole thing is quite convoluted even without
a better precision you are proposing. And let me be clear, I am not
opposing your patch but I would rather see it based on more than just
hand waving.

Yosry Ahmed Oct. 3, 2023, 8:49 a.m. UTC | #6

On Tue, Oct 3, 2023 at 1:09 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 03-10-23 01:03:53, Yosry Ahmed wrote:
> > On Tue, Oct 3, 2023 at 12:57 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 25-09-23 10:11:05, Yosry Ahmed wrote:
> > > > On Mon, Sep 25, 2023 at 6:50 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Fri 22-09-23 17:57:38, Yosry Ahmed wrote:
> > > > > > While working on adjacent code [1], I realized that the values passed
> > > > > > into memcg_rstat_updated() to keep track of the magnitude of pending
> > > > > > updates is consistent. It is mostly in pages, but sometimes it can be in
> > > > > > bytes or KBs. Fix that.
> > > > >
> > > > > What kind of practical difference does this change make? Is it worth
> > > > > additional code?
> > > >
> > > > As explained in patch 2's commit message, the value passed into
> > > > memcg_rstat_updated() is used for the "flush only if not worth it"
> > > > heuristic. As we have discussed in different threads in the past few
> > > > weeks, unnecessary flushes can cause increased global lock contention
> > > > and/or latency.
> > > >
> > > > Byte-sized paths (percpu, slab, zswap, ..) feed bytes into the
> > > > heuristic, but those are interpreted as pages, which means we will
> > > > flush earlier than we should. This was noticed by code inspection. How
> > > > much does this matter in practice? I would say it depends on the
> > > > workload: how many percpu/slab allocations are being made vs. how many
> > > > flushes are requested.
> > > >
> > > > On a system with 100 cpus, 25M of stat updates are needed for a flush
> > > > usually, but ~6K of slab/percpu updates will also (mistakenly) cause a
> > > > flush.
> > >
> > > This surely depends on workload and that is understandable. But it would
> > > be really nice to provide some numbers for typical workloads which
> > > exercise slab heavily.
> >
> > If you have a workload in mind I can run it and see how many flushes
> > we get with/without this patch. The first thing that pops into my head
> > is creating a bunch of empty files but I don't know if that's the best
> > thing to get numbers from.
>
> Let me remind you that you are proposing a performance optimization and
> such a change requires some numbers to actually show it is benefitial.
> There are cases where the resulting code is clearly an improvement and
> the performance benefit is just a nice side effect. I do not consider
> this to be the case. The whole thing is quite convoluted even without
> a better precision you are proposing. And let me be clear, I am not
> opposing your patch but I would rather see it based on more than just
> hand waving.

It is purely based on code inspection, and honestly I don't have
numbers to support it. I saw something wrong with the code and I tried
to fix it, I was working on something else when I noticed it. That
being said, I acknowledge it's not making the code any prettier :)

Feel free to suggest improvements to the code to make it more
bearable, otherwise if you don't like it I will just leave it to be
honest.

Thanks for taking a look!

> --
> Michal Hocko
> SUSE Labs

[v2,0/2] mm: memcg: fix tracking of pending stats updates values

Message

Comments