[2/2] memcg: use memcg flush tracepoint

Message ID	20241025002511.129899-3-inwardvessel@gmail.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: JP Kobryn <inwardvessel@gmail.com> To: shakeel.butt@linux.dev, hannes@cmpxchg.org, yosryahmed@google.com, akpm@linux-foundation.org, rostedt@goodmis.org Cc: linux-mm@kvack.org, cgroups@vger.kernel.org Subject: [PATCH 2/2] memcg: use memcg flush tracepoint Date: Thu, 24 Oct 2024 17:25:11 -0700 Message-ID: <20241025002511.129899-3-inwardvessel@gmail.com> In-Reply-To: <20241025002511.129899-1-inwardvessel@gmail.com> References: <20241025002511.129899-1-inwardvessel@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	memcg: tracepoint for flushing stats \| expand [0/2] memcg: tracepoint for flushing stats [1/2] memcg: add memcg flush tracepoint event [2/2] memcg: use memcg flush tracepoint

JP Kobryn Oct. 25, 2024, 12:25 a.m. UTC

Make use of the flush tracepoint within memcontrol.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 mm/memcontrol.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Yosry Ahmed Oct. 25, 2024, 12:57 a.m. UTC | #1

On Thu, Oct 24, 2024 at 5:26 PM JP Kobryn <inwardvessel@gmail.com> wrote:
>
> Make use of the flush tracepoint within memcontrol.
>
> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>

Is the intention to use tools like bpftrace to analyze where we flush
the most? In this case, why can't we just attach to the fentry of
do_flush_stats() and use the stack trace to find the path?

We can also attach to mem_cgroup_flush_stats(), and the difference in
counts between the two will be the number of skipped flushes.

Are there other use cases for these tracepoints?

> ---
>  mm/memcontrol.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 18c3f513d766..f816737228fa 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -613,8 +613,11 @@ void mem_cgroup_flush_stats(struct mem_cgroup *memcg)
>         if (!memcg)
>                 memcg = root_mem_cgroup;
>
> -       if (memcg_vmstats_needs_flush(memcg->vmstats))
> +       if (memcg_vmstats_needs_flush(memcg->vmstats)) {
> +               trace_memcg_flush_stats(memcg, TRACE_MEMCG_FLUSH_READER);
>                 do_flush_stats(memcg);
> +       } else
> +               trace_memcg_flush_stats(memcg, TRACE_MEMCG_FLUSH_READER_SKIP);
>  }
>
>  void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup *memcg)
> @@ -630,6 +633,7 @@ static void flush_memcg_stats_dwork(struct work_struct *w)
>          * Deliberately ignore memcg_vmstats_needs_flush() here so that flushing
>          * in latency-sensitive paths is as cheap as possible.
>          */
> +       trace_memcg_flush_stats(root_mem_cgroup, TRACE_MEMCG_FLUSH_PERIODIC);
>         do_flush_stats(root_mem_cgroup);
>         queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME);
>  }
> @@ -5285,6 +5289,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
>                  * mem_cgroup_flush_stats() ignores small changes. Use
>                  * do_flush_stats() directly to get accurate stats for charging.
>                  */
> +               trace_memcg_flush_stats(memcg, TRACE_MEMCG_FLUSH_ZSWAP);
>                 do_flush_stats(memcg);
>                 pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE;
>                 if (pages < max)
> --
> 2.47.0
>

Shakeel Butt Oct. 25, 2024, 1:15 a.m. UTC | #2

On Thu, Oct 24, 2024 at 05:57:25PM GMT, Yosry Ahmed wrote:
> On Thu, Oct 24, 2024 at 5:26 PM JP Kobryn <inwardvessel@gmail.com> wrote:
> >
> > Make use of the flush tracepoint within memcontrol.
> >
> > Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
> 
> Is the intention to use tools like bpftrace to analyze where we flush
> the most? In this case, why can't we just attach to the fentry of
> do_flush_stats() and use the stack trace to find the path?
> 
> We can also attach to mem_cgroup_flush_stats(), and the difference in
> counts between the two will be the number of skipped flushes.
> 

All these functions can get inlined and then we can not really attach
easily. We can somehow find the offset in the inlined places and try to
use kprobe but it is prohibitive when have to do for multiple kernels
built with fdo/bolt.

Please note that tracepoints are not really API, so we can remove them
in future if we see no usage for them.

Thanks for the review,
Shakeel

Yosry Ahmed Oct. 25, 2024, 7:40 a.m. UTC | #3

On Thu, Oct 24, 2024 at 6:16 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Thu, Oct 24, 2024 at 05:57:25PM GMT, Yosry Ahmed wrote:
> > On Thu, Oct 24, 2024 at 5:26 PM JP Kobryn <inwardvessel@gmail.com> wrote:
> > >
> > > Make use of the flush tracepoint within memcontrol.
> > >
> > > Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
> >
> > Is the intention to use tools like bpftrace to analyze where we flush
> > the most? In this case, why can't we just attach to the fentry of
> > do_flush_stats() and use the stack trace to find the path?
> >
> > We can also attach to mem_cgroup_flush_stats(), and the difference in
> > counts between the two will be the number of skipped flushes.
> >
>
> All these functions can get inlined and then we can not really attach
> easily. We can somehow find the offset in the inlined places and try to
> use kprobe but it is prohibitive when have to do for multiple kernels
> built with fdo/bolt.
>
> Please note that tracepoints are not really API, so we can remove them
> in future if we see no usage for them.

That's fair, but can we just add two tracepoints? This seems enough to
collect necessary data, and prevent proliferation of tracepoints and
the addition of the enum.

I am thinking one in mem_cgroup_flush_stats() and one in
do_flush_stats(), e.g. trace_mem_cgroup_flush_stats() and
trace_do_flush_stats(). Although the name of the latter is too
generic, maybe we should rename the function first to add mem_cgroup_*
or memcg_*.

WDYT?

JP Kobryn Oct. 25, 2024, 5:04 p.m. UTC | #4

On 10/25/24 12:40 AM, Yosry Ahmed wrote:
> On Thu, Oct 24, 2024 at 6:16 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>> On Thu, Oct 24, 2024 at 05:57:25PM GMT, Yosry Ahmed wrote:
>>> On Thu, Oct 24, 2024 at 5:26 PM JP Kobryn <inwardvessel@gmail.com> wrote:
>>>> Make use of the flush tracepoint within memcontrol.
>>>>
>>>> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
>>> Is the intention to use tools like bpftrace to analyze where we flush
>>> the most? In this case, why can't we just attach to the fentry of
>>> do_flush_stats() and use the stack trace to find the path?
>>>
>>> We can also attach to mem_cgroup_flush_stats(), and the difference in
>>> counts between the two will be the number of skipped flushes.
>>>
>> All these functions can get inlined and then we can not really attach
>> easily. We can somehow find the offset in the inlined places and try to
>> use kprobe but it is prohibitive when have to do for multiple kernels
>> built with fdo/bolt.
>>
>> Please note that tracepoints are not really API, so we can remove them
>> in future if we see no usage for them.
> That's fair, but can we just add two tracepoints? This seems enough to
> collect necessary data, and prevent proliferation of tracepoints and
> the addition of the enum.
>
> I am thinking one in mem_cgroup_flush_stats() and one in
> do_flush_stats(), e.g. trace_mem_cgroup_flush_stats() and
> trace_do_flush_stats(). Although the name of the latter is too
> generic, maybe we should rename the function first to add mem_cgroup_*
> or memcg_*.
>
> WDYT?

Hmmm, I think if we did that we wouldn't get accurate info on when the 
flush was skipped. Comparing the number of hits between 
mem_cgroup_flush_stats() and do_flush_stats() to determine the number of 
skips doesn't seem reliable because of the places where do_flush_stats() 
is called outside of mem_cgroup_flush_stats(). There would be situations 
where a skip occurs, but meanwhile each call to do_flush_stats() outside 
of mem_cgroup_flush_stats() would effectively subtract that skip, making 
it appear that a skip did not occur.

Maybe as a middle ground we could remove the trace calls for the zswap 
and periodic cases, since no skips can occur there. We could then just 
leave one trace call in mem_cgroup_flush_stats() and instead of an enum 
we can pass a bool saying skipped or not. Something like this:

mem_cgroup_flush_stats()

{

     bool needs_flush = memcg_vmstats_needs_flush(...);

     trace_memcg_flush_stats(memcg, needs_flush);

     if (needs_flush)

         do_flush_stats(...);

}

Yosry/Shakeel, do you have any thoughts on whether we should keep the 
trace calls for obj_cgroup_may_zswap() and periodic workqueue cases?

Yosry Ahmed Oct. 25, 2024, 5:53 p.m. UTC | #5

On Fri, Oct 25, 2024 at 10:05 AM JP Kobryn <inwardvessel@gmail.com> wrote:
>
>
> On 10/25/24 12:40 AM, Yosry Ahmed wrote:
> > On Thu, Oct 24, 2024 at 6:16 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >> On Thu, Oct 24, 2024 at 05:57:25PM GMT, Yosry Ahmed wrote:
> >>> On Thu, Oct 24, 2024 at 5:26 PM JP Kobryn <inwardvessel@gmail.com> wrote:
> >>>> Make use of the flush tracepoint within memcontrol.
> >>>>
> >>>> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
> >>> Is the intention to use tools like bpftrace to analyze where we flush
> >>> the most? In this case, why can't we just attach to the fentry of
> >>> do_flush_stats() and use the stack trace to find the path?
> >>>
> >>> We can also attach to mem_cgroup_flush_stats(), and the difference in
> >>> counts between the two will be the number of skipped flushes.
> >>>
> >> All these functions can get inlined and then we can not really attach
> >> easily. We can somehow find the offset in the inlined places and try to
> >> use kprobe but it is prohibitive when have to do for multiple kernels
> >> built with fdo/bolt.
> >>
> >> Please note that tracepoints are not really API, so we can remove them
> >> in future if we see no usage for them.
> > That's fair, but can we just add two tracepoints? This seems enough to
> > collect necessary data, and prevent proliferation of tracepoints and
> > the addition of the enum.
> >
> > I am thinking one in mem_cgroup_flush_stats() and one in
> > do_flush_stats(), e.g. trace_mem_cgroup_flush_stats() and
> > trace_do_flush_stats(). Although the name of the latter is too
> > generic, maybe we should rename the function first to add mem_cgroup_*
> > or memcg_*.
> >
> > WDYT?
>
> Hmmm, I think if we did that we wouldn't get accurate info on when the
> flush was skipped. Comparing the number of hits between
> mem_cgroup_flush_stats() and do_flush_stats() to determine the number of
> skips doesn't seem reliable because of the places where do_flush_stats()
> is called outside of mem_cgroup_flush_stats(). There would be situations
> where a skip occurs, but meanwhile each call to do_flush_stats() outside
> of mem_cgroup_flush_stats() would effectively subtract that skip, making
> it appear that a skip did not occur.

You're underestimating the power of BPF, my friend :) We can count the
number of flushes in task local storages, in which case we can get a
very accurate representation because the counters are per-task, so we
know exactly when we skipped, but..

>
> Maybe as a middle ground we could remove the trace calls for the zswap
> and periodic cases, since no skips can occur there. We could then just
> leave one trace call in mem_cgroup_flush_stats() and instead of an enum
> we can pass a bool saying skipped or not. Something like this:
>
> mem_cgroup_flush_stats()
>
> {
>
>      bool needs_flush = memcg_vmstats_needs_flush(...);
>
>      trace_memcg_flush_stats(memcg, needs_flush);
>
>      if (needs_flush)
>
>          do_flush_stats(...);
>
> }
>
>
> Yosry/Shakeel, do you have any thoughts on whether we should keep the
> trace calls for obj_cgroup_may_zswap() and periodic workqueue cases?

..with that being said, I do like having a single tracepoint. I think
with some refactoring we can end up with a single tracepoint and more
data. We can even capture the cases where we force a flush but we
don't really need to flush. We can even add vmstats->stats_updates to
the tracepoint to know exactly how many updates we have when we flush.

What about the following:

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7845c64a2c570..be0e7f52ad11a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -584,8 +584,14 @@ static inline void memcg_rstat_updated(struct
mem_cgroup *memcg, int val)
        }
 }

-static void do_flush_stats(struct mem_cgroup *memcg)
+static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg, bool force)
 {
+       bool needs_flush = memcg_vmstats_needs_flush(memcg->vmstats);
+
+       trace_memcg_flush_stats(memcg, needs_flush, force, ...);
+       if (!force && !needs_flush)
+               return;
+
        if (mem_cgroup_is_root(memcg))
                WRITE_ONCE(flush_last_time, jiffies_64);

@@ -609,8 +615,7 @@ void mem_cgroup_flush_stats(struct mem_cgroup *memcg)
        if (!memcg)
                memcg = root_mem_cgroup;

-       if (memcg_vmstats_needs_flush(memcg->vmstats))
-               do_flush_stats(memcg);
+       __mem_cgroup_flush_stats(memcg, false);
 }

 void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup *memcg)
@@ -626,7 +631,7 @@ static void flush_memcg_stats_dwork(struct work_struct *w)
         * Deliberately ignore memcg_vmstats_needs_flush() here so that flushing
         * in latency-sensitive paths is as cheap as possible.
         */
-       do_flush_stats(root_mem_cgroup);
+       __mem_cgroup_flush_stats(root_mem_cgroup, true);
        queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME);
 }

@@ -5272,11 +5277,8 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
                        break;
                }

-               /*
-                * mem_cgroup_flush_stats() ignores small changes. Use
-                * do_flush_stats() directly to get accurate stats for charging.
-                */
-               do_flush_stats(memcg);
+               /* Force a flush to get accurate stats for charging */
+               __mem_cgroup_flush_stats(memcg, true);
                pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE;
                if (pages < max)
                        continue;

JP Kobryn Oct. 25, 2024, 6:26 p.m. UTC | #6

On 10/25/24 10:53 AM, Yosry Ahmed wrote:
> On Fri, Oct 25, 2024 at 10:05 AM JP Kobryn <inwardvessel@gmail.com> wrote:
>>
>> On 10/25/24 12:40 AM, Yosry Ahmed wrote:
>>> On Thu, Oct 24, 2024 at 6:16 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>>>> On Thu, Oct 24, 2024 at 05:57:25PM GMT, Yosry Ahmed wrote:
>>>>> On Thu, Oct 24, 2024 at 5:26 PM JP Kobryn <inwardvessel@gmail.com> wrote:
>>>>>> Make use of the flush tracepoint within memcontrol.
>>>>>>
>>>>>> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
>>>>> Is the intention to use tools like bpftrace to analyze where we flush
>>>>> the most? In this case, why can't we just attach to the fentry of
>>>>> do_flush_stats() and use the stack trace to find the path?
>>>>>
>>>>> We can also attach to mem_cgroup_flush_stats(), and the difference in
>>>>> counts between the two will be the number of skipped flushes.
>>>>>
>>>> All these functions can get inlined and then we can not really attach
>>>> easily. We can somehow find the offset in the inlined places and try to
>>>> use kprobe but it is prohibitive when have to do for multiple kernels
>>>> built with fdo/bolt.
>>>>
>>>> Please note that tracepoints are not really API, so we can remove them
>>>> in future if we see no usage for them.
>>> That's fair, but can we just add two tracepoints? This seems enough to
>>> collect necessary data, and prevent proliferation of tracepoints and
>>> the addition of the enum.
>>>
>>> I am thinking one in mem_cgroup_flush_stats() and one in
>>> do_flush_stats(), e.g. trace_mem_cgroup_flush_stats() and
>>> trace_do_flush_stats(). Although the name of the latter is too
>>> generic, maybe we should rename the function first to add mem_cgroup_*
>>> or memcg_*.
>>>
>>> WDYT?
>> Hmmm, I think if we did that we wouldn't get accurate info on when the
>> flush was skipped. Comparing the number of hits between
>> mem_cgroup_flush_stats() and do_flush_stats() to determine the number of
>> skips doesn't seem reliable because of the places where do_flush_stats()
>> is called outside of mem_cgroup_flush_stats(). There would be situations
>> where a skip occurs, but meanwhile each call to do_flush_stats() outside
>> of mem_cgroup_flush_stats() would effectively subtract that skip, making
>> it appear that a skip did not occur.
> You're underestimating the power of BPF, my friend :) We can count the
> number of flushes in task local storages, in which case we can get a
> very accurate representation because the counters are per-task, so we
> know exactly when we skipped, but..
Interesting. Thanks for explaining.
>
>> Maybe as a middle ground we could remove the trace calls for the zswap
>> and periodic cases, since no skips can occur there. We could then just
>> leave one trace call in mem_cgroup_flush_stats() and instead of an enum
>> we can pass a bool saying skipped or not. Something like this:
>>
>> mem_cgroup_flush_stats()
>>
>> {
>>
>>       bool needs_flush = memcg_vmstats_needs_flush(...);
>>
>>       trace_memcg_flush_stats(memcg, needs_flush);
>>
>>       if (needs_flush)
>>
>>           do_flush_stats(...);
>>
>> }
>>
>>
>> Yosry/Shakeel, do you have any thoughts on whether we should keep the
>> trace calls for obj_cgroup_may_zswap() and periodic workqueue cases?
> ..with that being said, I do like having a single tracepoint. I think
> with some refactoring we can end up with a single tracepoint and more
> data. We can even capture the cases where we force a flush but we
> don't really need to flush. We can even add vmstats->stats_updates to
> the tracepoint to know exactly how many updates we have when we flush.
>
> What about the following:
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7845c64a2c570..be0e7f52ad11a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -584,8 +584,14 @@ static inline void memcg_rstat_updated(struct
> mem_cgroup *memcg, int val)
>          }
>   }
>
> -static void do_flush_stats(struct mem_cgroup *memcg)
> +static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg, bool force)
>   {
> +       bool needs_flush = memcg_vmstats_needs_flush(memcg->vmstats);
> +
> +       trace_memcg_flush_stats(memcg, needs_flush, force, ...);
> +       if (!force && !needs_flush)
> +               return;
> +
>          if (mem_cgroup_is_root(memcg))
>                  WRITE_ONCE(flush_last_time, jiffies_64);
>
> @@ -609,8 +615,7 @@ void mem_cgroup_flush_stats(struct mem_cgroup *memcg)
>          if (!memcg)
>                  memcg = root_mem_cgroup;
>
> -       if (memcg_vmstats_needs_flush(memcg->vmstats))
> -               do_flush_stats(memcg);
> +       __mem_cgroup_flush_stats(memcg, false);
>   }
>
>   void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup *memcg)
> @@ -626,7 +631,7 @@ static void flush_memcg_stats_dwork(struct work_struct *w)
>           * Deliberately ignore memcg_vmstats_needs_flush() here so that flushing
>           * in latency-sensitive paths is as cheap as possible.
>           */
> -       do_flush_stats(root_mem_cgroup);
> +       __mem_cgroup_flush_stats(root_mem_cgroup, true);
>          queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME);
>   }
>
> @@ -5272,11 +5277,8 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
>                          break;
>                  }
>
> -               /*
> -                * mem_cgroup_flush_stats() ignores small changes. Use
> -                * do_flush_stats() directly to get accurate stats for charging.
> -                */
> -               do_flush_stats(memcg);
> +               /* Force a flush to get accurate stats for charging */
> +               __mem_cgroup_flush_stats(memcg, true);
>                  pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE;
>                  if (pages < max)
>                          continue;
I like the additional info. I'll incorporate this into v2.

[2/2] memcg: use memcg flush tracepoint

Commit Message

Comments

Patch