[-next,v6,0/2] Make memory reclamation measurable

Message ID	20240105013607.2868-1-cuibixuan@vivo.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Bixuan Cui <cuibixuan@vivo.com> To: akpm@linux-foundation.org, rostedt@goodmis.org, mhiramat@kernel.org, mathieu.desnoyers@efficios.com Cc: linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-mm@kvack.org, opensource.kernel@vivo.com, cuibixuan@vivo.com Subject: [PATCH -next v6 0/2] Make memory reclamation measurable Date: Thu, 4 Jan 2024 17:36:05 -0800 Message-Id: <20240105013607.2868-1-cuibixuan@vivo.com> Content-Type: text/plain MIME-Version: 1.0 Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Make memory reclamation measurable \| expand [-next,v6,0/2] Make memory reclamation measurable [-next,v6,1/2] mm: shrinker: add new event to trace shrink count [-next,v6,2/2] mm: vmscan: add new event to trace shrink lru

Bixuan Cui Jan. 5, 2024, 1:36 a.m. UTC

When the system memory is low, kswapd reclaims the memory. The key steps
of memory reclamation include
1.shrink_lruvec
  * shrink_active_list, moves folios from the active LRU to the inactive LRU
  * shrink_inactive_list, shrink lru from inactive LRU list
2.shrink_slab
  * shrinker->count_objects(), calculates the freeable memory
  * shrinker->scan_objects(), reclaims the slab memory

The existing tracers in the vmscan are as follows:

--do_try_to_free_pages
--shrink_zones
--trace_mm_vmscan_node_reclaim_begin (tracer)
--shrink_node
--shrink_node_memcgs
  --trace_mm_vmscan_memcg_shrink_begin (tracer)
  --shrink_lruvec
    --shrink_list
      --shrink_active_list
	  --trace_mm_vmscan_lru_shrink_active (tracer)
      --shrink_inactive_list
	  --trace_mm_vmscan_lru_shrink_inactive (tracer)
    --shrink_active_list
  --shrink_slab
    --do_shrink_slab
    --shrinker->count_objects()
    --trace_mm_shrink_slab_start (tracer)
    --shrinker->scan_objects()
    --trace_mm_shrink_slab_end (tracer)
  --trace_mm_vmscan_memcg_shrink_end (tracer)
--trace_mm_vmscan_node_reclaim_end (tracer)

If we get the duration and quantity of shrink lru and slab,
then we can measure the memory recycling, as follows

Measuring memory reclamation with bpf:
  LRU FILE:
	CPU COMM 	ShrinkActive(us) ShrinkInactive(us)  Reclaim(page)
	7   kswapd0	 	26		51		32
	7   kswapd0		52		47		13
  SLAB:
	CPU COMM 		OBJ_NAME		Count_Dur(us) Freeable(page) Scan_Dur(us) Reclaim(page)
	 1  kswapd0		super_cache_scan.cfi_jt     2		    341		   3225		128
	 7  kswapd0		super_cache_scan.cfi_jt     0		    2247	   8524		1024
	 7  kswapd0	        super_cache_scan.cfi_jt     2367	    0		   0		0

For this, add the new tracer to shrink_active_list/shrink_inactive_list
and shrinker->count_objects().

Changes:
v6: * Add Reviewed-by from Steven Rostedt.
v5: * Use 'DECLARE_EVENT_CLASS(mm_vmscan_lru_shrink_start_template' to
replace 'RACE_EVENT(mm_vmscan_lru_shrink_inactive/active_start'
    * Add the explanation for adding new shrink lru events into 'mm: vmscan: add new event to trace shrink lru'
v4: Add Reviewed-by and Changlog to every patch.
v3: Swap the positions of 'nid' and 'freeable' to prevent the hole in the trace event.
v2: Modify trace_mm_vmscan_lru_shrink_inactive() in evict_folios() at the same time to fix build error.

cuibixuan (2):
  mm: shrinker: add new event to trace shrink count
  mm: vmscan: add new event to trace shrink lru

 include/trace/events/vmscan.h | 80 ++++++++++++++++++++++++++++++++++-
 mm/shrinker.c                 |  4 ++
 mm/vmscan.c                   | 11 +++--
 3 files changed, 90 insertions(+), 5 deletions(-)

Bixuan Cui Jan. 15, 2024, 6:27 a.m. UTC | #1

ping~

在 2024/1/5 9:36, Bixuan Cui 写道:
> When the system memory is low, kswapd reclaims the memory. The key steps
> of memory reclamation include
> 1.shrink_lruvec
>    * shrink_active_list, moves folios from the active LRU to the inactive LRU
>    * shrink_inactive_list, shrink lru from inactive LRU list
> 2.shrink_slab
>    * shrinker->count_objects(), calculates the freeable memory
>    * shrinker->scan_objects(), reclaims the slab memory
> 
> The existing tracers in the vmscan are as follows:
> 
> --do_try_to_free_pages
> --shrink_zones
> --trace_mm_vmscan_node_reclaim_begin (tracer)
> --shrink_node
> --shrink_node_memcgs
>    --trace_mm_vmscan_memcg_shrink_begin (tracer)
>    --shrink_lruvec
>      --shrink_list
>        --shrink_active_list
> 	  --trace_mm_vmscan_lru_shrink_active (tracer)
>        --shrink_inactive_list
> 	  --trace_mm_vmscan_lru_shrink_inactive (tracer)
>      --shrink_active_list
>    --shrink_slab
>      --do_shrink_slab
>      --shrinker->count_objects()
>      --trace_mm_shrink_slab_start (tracer)
>      --shrinker->scan_objects()
>      --trace_mm_shrink_slab_end (tracer)
>    --trace_mm_vmscan_memcg_shrink_end (tracer)
> --trace_mm_vmscan_node_reclaim_end (tracer)
> 
> If we get the duration and quantity of shrink lru and slab,
> then we can measure the memory recycling, as follows
> 
> Measuring memory reclamation with bpf:
>    LRU FILE:
> 	CPU COMM 	ShrinkActive(us) ShrinkInactive(us)  Reclaim(page)
> 	7   kswapd0	 	26		51		32
> 	7   kswapd0		52		47		13
>    SLAB:
> 	CPU COMM 		OBJ_NAME		Count_Dur(us) Freeable(page) Scan_Dur(us) Reclaim(page)
> 	 1  kswapd0		super_cache_scan.cfi_jt     2		    341		   3225		128
> 	 7  kswapd0		super_cache_scan.cfi_jt     0		    2247	   8524		1024
> 	 7  kswapd0	        super_cache_scan.cfi_jt     2367	    0		   0		0
> 
> For this, add the new tracer to shrink_active_list/shrink_inactive_list
> and shrinker->count_objects().
> 
> Changes:
> v6: * Add Reviewed-by from Steven Rostedt.
> v5: * Use 'DECLARE_EVENT_CLASS(mm_vmscan_lru_shrink_start_template' to
> replace 'RACE_EVENT(mm_vmscan_lru_shrink_inactive/active_start'
>      * Add the explanation for adding new shrink lru events into 'mm: vmscan: add new event to trace shrink lru'
> v4: Add Reviewed-by and Changlog to every patch.
> v3: Swap the positions of 'nid' and 'freeable' to prevent the hole in the trace event.
> v2: Modify trace_mm_vmscan_lru_shrink_inactive() in evict_folios() at the same time to fix build error.
> 
> cuibixuan (2):
>    mm: shrinker: add new event to trace shrink count
>    mm: vmscan: add new event to trace shrink lru
> 
>   include/trace/events/vmscan.h | 80 ++++++++++++++++++++++++++++++++++-
>   mm/shrinker.c                 |  4 ++
>   mm/vmscan.c                   | 11 +++--
>   3 files changed, 90 insertions(+), 5 deletions(-)
>

Bixuan Cui Jan. 24, 2024, 2:41 a.m. UTC | #2

ping~

在 2024/1/5 9:36, Bixuan Cui 写道:
> When the system memory is low, kswapd reclaims the memory. The key steps
> of memory reclamation include
> 1.shrink_lruvec
>    * shrink_active_list, moves folios from the active LRU to the inactive LRU
>    * shrink_inactive_list, shrink lru from inactive LRU list
> 2.shrink_slab
>    * shrinker->count_objects(), calculates the freeable memory
>    * shrinker->scan_objects(), reclaims the slab memory
> 
> The existing tracers in the vmscan are as follows:
> 
> --do_try_to_free_pages
> --shrink_zones
> --trace_mm_vmscan_node_reclaim_begin (tracer)
> --shrink_node
> --shrink_node_memcgs
>    --trace_mm_vmscan_memcg_shrink_begin (tracer)
>    --shrink_lruvec
>      --shrink_list
>        --shrink_active_list
> 	  --trace_mm_vmscan_lru_shrink_active (tracer)
>        --shrink_inactive_list
> 	  --trace_mm_vmscan_lru_shrink_inactive (tracer)
>      --shrink_active_list
>    --shrink_slab
>      --do_shrink_slab
>      --shrinker->count_objects()
>      --trace_mm_shrink_slab_start (tracer)
>      --shrinker->scan_objects()
>      --trace_mm_shrink_slab_end (tracer)
>    --trace_mm_vmscan_memcg_shrink_end (tracer)
> --trace_mm_vmscan_node_reclaim_end (tracer)
> 
> If we get the duration and quantity of shrink lru and slab,
> then we can measure the memory recycling, as follows
> 
> Measuring memory reclamation with bpf:
>    LRU FILE:
> 	CPU COMM 	ShrinkActive(us) ShrinkInactive(us)  Reclaim(page)
> 	7   kswapd0	 	26		51		32
> 	7   kswapd0		52		47		13
>    SLAB:
> 	CPU COMM 		OBJ_NAME		Count_Dur(us) Freeable(page) Scan_Dur(us) Reclaim(page)
> 	 1  kswapd0		super_cache_scan.cfi_jt     2		    341		   3225		128
> 	 7  kswapd0		super_cache_scan.cfi_jt     0		    2247	   8524		1024
> 	 7  kswapd0	        super_cache_scan.cfi_jt     2367	    0		   0		0
> 
> For this, add the new tracer to shrink_active_list/shrink_inactive_list
> and shrinker->count_objects().
> 
> Changes:
> v6: * Add Reviewed-by from Steven Rostedt.
> v5: * Use 'DECLARE_EVENT_CLASS(mm_vmscan_lru_shrink_start_template' to
> replace 'RACE_EVENT(mm_vmscan_lru_shrink_inactive/active_start'
>      * Add the explanation for adding new shrink lru events into 'mm: vmscan: add new event to trace shrink lru'
> v4: Add Reviewed-by and Changlog to every patch.
> v3: Swap the positions of 'nid' and 'freeable' to prevent the hole in the trace event.
> v2: Modify trace_mm_vmscan_lru_shrink_inactive() in evict_folios() at the same time to fix build error.
> 
> cuibixuan (2):
>    mm: shrinker: add new event to trace shrink count
>    mm: vmscan: add new event to trace shrink lru
> 
>   include/trace/events/vmscan.h | 80 ++++++++++++++++++++++++++++++++++-
>   mm/shrinker.c                 |  4 ++
>   mm/vmscan.c                   | 11 +++--
>   3 files changed, 90 insertions(+), 5 deletions(-)
>

Bixuan Cui Feb. 21, 2024, 1:44 a.m. UTC | #3

ping~

在 2024/1/5 9:36, Bixuan Cui 写道:
> When the system memory is low, kswapd reclaims the memory. The key steps
> of memory reclamation include
> 1.shrink_lruvec
>    * shrink_active_list, moves folios from the active LRU to the inactive LRU
>    * shrink_inactive_list, shrink lru from inactive LRU list
> 2.shrink_slab
>    * shrinker->count_objects(), calculates the freeable memory
>    * shrinker->scan_objects(), reclaims the slab memory
> 
> The existing tracers in the vmscan are as follows:
> 
> --do_try_to_free_pages
> --shrink_zones
> --trace_mm_vmscan_node_reclaim_begin (tracer)
> --shrink_node
> --shrink_node_memcgs
>    --trace_mm_vmscan_memcg_shrink_begin (tracer)
>    --shrink_lruvec
>      --shrink_list
>        --shrink_active_list
> 	  --trace_mm_vmscan_lru_shrink_active (tracer)
>        --shrink_inactive_list
> 	  --trace_mm_vmscan_lru_shrink_inactive (tracer)
>      --shrink_active_list
>    --shrink_slab
>      --do_shrink_slab
>      --shrinker->count_objects()
>      --trace_mm_shrink_slab_start (tracer)
>      --shrinker->scan_objects()
>      --trace_mm_shrink_slab_end (tracer)
>    --trace_mm_vmscan_memcg_shrink_end (tracer)
> --trace_mm_vmscan_node_reclaim_end (tracer)
> 
> If we get the duration and quantity of shrink lru and slab,
> then we can measure the memory recycling, as follows
> 
> Measuring memory reclamation with bpf:
>    LRU FILE:
> 	CPU COMM 	ShrinkActive(us) ShrinkInactive(us)  Reclaim(page)
> 	7   kswapd0	 	26		51		32
> 	7   kswapd0		52		47		13
>    SLAB:
> 	CPU COMM 		OBJ_NAME		Count_Dur(us) Freeable(page) Scan_Dur(us) Reclaim(page)
> 	 1  kswapd0		super_cache_scan.cfi_jt     2		    341		   3225		128
> 	 7  kswapd0		super_cache_scan.cfi_jt     0		    2247	   8524		1024
> 	 7  kswapd0	        super_cache_scan.cfi_jt     2367	    0		   0		0
> 
> For this, add the new tracer to shrink_active_list/shrink_inactive_list
> and shrinker->count_objects().
> 
> Changes:
> v6: * Add Reviewed-by from Steven Rostedt.
> v5: * Use 'DECLARE_EVENT_CLASS(mm_vmscan_lru_shrink_start_template' to
> replace 'RACE_EVENT(mm_vmscan_lru_shrink_inactive/active_start'
>      * Add the explanation for adding new shrink lru events into 'mm: vmscan: add new event to trace shrink lru'
> v4: Add Reviewed-by and Changlog to every patch.
> v3: Swap the positions of 'nid' and 'freeable' to prevent the hole in the trace event.
> v2: Modify trace_mm_vmscan_lru_shrink_inactive() in evict_folios() at the same time to fix build error.
> 
> cuibixuan (2):
>    mm: shrinker: add new event to trace shrink count
>    mm: vmscan: add new event to trace shrink lru
> 
>   include/trace/events/vmscan.h | 80 ++++++++++++++++++++++++++++++++++-
>   mm/shrinker.c                 |  4 ++
>   mm/vmscan.c                   | 11 +++--
>   3 files changed, 90 insertions(+), 5 deletions(-)
>

Steven Rostedt Feb. 21, 2024, 2:22 a.m. UTC | #4

On Wed, 21 Feb 2024 09:44:32 +0800
Bixuan Cui <cuibixuan@vivo.com> wrote:

> ping~
> 

It's up to the memory management folks to decide on this.

-- Steve


> 在 2024/1/5 9:36, Bixuan Cui 写道:
> > When the system memory is low, kswapd reclaims the memory. The key steps
> > of memory reclamation include
> > 1.shrink_lruvec
> >    * shrink_active_list, moves folios from the active LRU to the inactive LRU
> >    * shrink_inactive_list, shrink lru from inactive LRU list
> > 2.shrink_slab
> >    * shrinker->count_objects(), calculates the freeable memory
> >    * shrinker->scan_objects(), reclaims the slab memory
> > 
> > The existing tracers in the vmscan are as follows:
> > 
> > --do_try_to_free_pages
> > --shrink_zones
> > --trace_mm_vmscan_node_reclaim_begin (tracer)
> > --shrink_node
> > --shrink_node_memcgs
> >    --trace_mm_vmscan_memcg_shrink_begin (tracer)
> >    --shrink_lruvec
> >      --shrink_list
> >        --shrink_active_list
> > 	  --trace_mm_vmscan_lru_shrink_active (tracer)
> >        --shrink_inactive_list
> > 	  --trace_mm_vmscan_lru_shrink_inactive (tracer)
> >      --shrink_active_list
> >    --shrink_slab
> >      --do_shrink_slab
> >      --shrinker->count_objects()
> >      --trace_mm_shrink_slab_start (tracer)
> >      --shrinker->scan_objects()
> >      --trace_mm_shrink_slab_end (tracer)
> >    --trace_mm_vmscan_memcg_shrink_end (tracer)
> > --trace_mm_vmscan_node_reclaim_end (tracer)
> > 
> > If we get the duration and quantity of shrink lru and slab,
> > then we can measure the memory recycling, as follows
> > 
> > Measuring memory reclamation with bpf:
> >    LRU FILE:
> > 	CPU COMM 	ShrinkActive(us) ShrinkInactive(us)  Reclaim(page)
> > 	7   kswapd0	 	26		51		32
> > 	7   kswapd0		52		47		13
> >    SLAB:
> > 	CPU COMM 		OBJ_NAME		Count_Dur(us) Freeable(page) Scan_Dur(us) Reclaim(page)
> > 	 1  kswapd0		super_cache_scan.cfi_jt     2		    341		   3225		128
> > 	 7  kswapd0		super_cache_scan.cfi_jt     0		    2247	   8524		1024
> > 	 7  kswapd0	        super_cache_scan.cfi_jt     2367	    0		   0		0
> > 
> > For this, add the new tracer to shrink_active_list/shrink_inactive_list
> > and shrinker->count_objects().
> > 
> > Changes:
> > v6: * Add Reviewed-by from Steven Rostedt.
> > v5: * Use 'DECLARE_EVENT_CLASS(mm_vmscan_lru_shrink_start_template' to
> > replace 'RACE_EVENT(mm_vmscan_lru_shrink_inactive/active_start'
> >      * Add the explanation for adding new shrink lru events into 'mm: vmscan: add new event to trace shrink lru'
> > v4: Add Reviewed-by and Changlog to every patch.
> > v3: Swap the positions of 'nid' and 'freeable' to prevent the hole in the trace event.
> > v2: Modify trace_mm_vmscan_lru_shrink_inactive() in evict_folios() at the same time to fix build error.
> > 
> > cuibixuan (2):
> >    mm: shrinker: add new event to trace shrink count
> >    mm: vmscan: add new event to trace shrink lru
> > 
> >   include/trace/events/vmscan.h | 80 ++++++++++++++++++++++++++++++++++-
> >   mm/shrinker.c                 |  4 ++
> >   mm/vmscan.c                   | 11 +++--
> >   3 files changed, 90 insertions(+), 5 deletions(-)
> >

Bixuan Cui Feb. 21, 2024, 3 a.m. UTC | #5

在 2024/2/21 10:22, Steven Rostedt 写道:
> It's up to the memory management folks to decide on this. -- Steve
Noted with thanks.

Bixuan Cui

Michal Hocko Feb. 21, 2024, 7:44 a.m. UTC | #6

On Wed 21-02-24 11:00:53, Bixuan Cui wrote:
> 
> 
> 在 2024/2/21 10:22, Steven Rostedt 写道:
> > It's up to the memory management folks to decide on this. -- Steve
> Noted with thanks.

It would be really helpful to have more details on why we need those
trace points.

It is my understanding that you would like to have a more fine grained
numbers for the time duration of different parts of the reclaim process.
I can imagine this could be useful in some cases but is it useful enough
and for a wider variety of workloads? Is that worth a dedicated static
tracepoints? Why an add-hoc dynamic tracepoints or BPF for a very
special situation is not sufficient?

In other words, tell us more about the usecases and why is this
generally useful.

Thanks!

Bixuan Cui March 7, 2024, 7:40 a.m. UTC | #7

在 2024/2/21 15:44, Michal Hocko 写道:
> It would be really helpful to have more details on why we need those 
> trace points. It is my understanding that you would like to have a more 
> fine grained numbers for the time duration of different parts of the 
> reclaim process. I can imagine this could be useful in some cases but is 
> it useful enough and for a wider variety of workloads? Is that worth a 
> dedicated static tracepoints? Why an add-hoc dynamic tracepoints or BPF 
> for a very special situation is not sufficient? In other words, tell us 
> more about the usecases and why is this generally useful.
Thank you for your reply, I'm sorry that I forgot to describe the 
detailed reason.

Memory reclamation usually occurs when there is high memory pressure (or 
low memory) and is performed by Kswapd. In embedded systems, CPU 
resources are limited, and it is common for kswapd and critical 
processes (which typically require a large amount of memory and trigger 
memory reclamation) to compete for CPU resources. which in turn affects 
the execution of this key process, causing the execution time to 
increase and causing lags,such as dropped frames or slower startup times 
in mobile games.
Currently, with the help of kernel trace events or tools like Perfetto, 
we can only see that kswapd is competing for CPU and the frequency of 
memory reclamation triggers, but we do not have detailed information or 
metrics about memory reclamation, such as the duration and amount of 
each reclamation, or who is releasing memory (super_cache, f2fs, ext4), 
etc. This makes it impossible to locate the above problems.

Currently this patch helps us solve 2 actual performance problems 
(kswapd preempts the CPU causing game delay)
1. The increased memory allocation in the game (across different 
versions) has led to the degradation of kswapd.
     This is found by calculating the total amount of Reclaim(page) 
during the game startup phase.

2. The adoption of a different file system in the new system version has 
resulted in a slower reclamation rate.
     This is discovered through the OBJ_NAME change. For example, 
OBJ_NAME changes from super_cache_scan to ext4_es_scan.

Subsequently, it is also possible to calculate the memory reclamation 
rate to evaluate the memory performance of different versions.



The main reasons for adding static tracepoints are:
1. To subdivide the time spent in the shrinker->count_objects() and 
shrinker->scan_objects() functions within the do_shrink_slab function. 
Using BPF kprobe, we can only track the time spent in the do_shrink_slab 
function.
2. When tracing frequently called functions, static tracepoints (BPF 
tp/tracepoint) have lower performance impact compared to dynamic 
tracepoints (BPF kprobe).

Thanks
Bixuan Cui

Michal Hocko March 7, 2024, 9:26 a.m. UTC | #8

On Thu 07-03-24 15:40:29, Bixuan Cui wrote:
[...]
> Currently, with the help of kernel trace events or tools like Perfetto, we
> can only see that kswapd is competing for CPU and the frequency of memory
> reclamation triggers, but we do not have detailed information or metrics
> about memory reclamation, such as the duration and amount of each
> reclamation, or who is releasing memory (super_cache, f2fs, ext4), etc. This
> makes it impossible to locate the above problems.

I am not sure I agree with you here. We do provide insight into LRU and
shrinkers reclaim. Why isn't that enough. In general I would advise you
to focus more on describing why the existing infrastructure is
insuficient (having examples would be really appreciated).

> Currently this patch helps us solve 2 actual performance problems (kswapd
> preempts the CPU causing game delay)
> 1. The increased memory allocation in the game (across different versions)
> has led to the degradation of kswapd.
>     This is found by calculating the total amount of Reclaim(page) during
> the game startup phase.
> 
> 2. The adoption of a different file system in the new system version has
> resulted in a slower reclamation rate.
>     This is discovered through the OBJ_NAME change. For example, OBJ_NAME
> changes from super_cache_scan to ext4_es_scan.
> 
> Subsequently, it is also possible to calculate the memory reclamation rate
> to evaluate the memory performance of different versions.

Why cannot you achive this with existing tracing or /proc/vmstat
infrastructure?

> The main reasons for adding static tracepoints are:
> 1. To subdivide the time spent in the shrinker->count_objects() and
> shrinker->scan_objects() functions within the do_shrink_slab function. Using
> BPF kprobe, we can only track the time spent in the do_shrink_slab function.
> 2. When tracing frequently called functions, static tracepoints (BPF
> tp/tracepoint) have lower performance impact compared to dynamic tracepoints
> (BPF kprobe).

You can track the time process has been preempted by other means, no? We
have context switching tracepoints in place. Have you considered that
option?

Bixuan Cui March 8, 2024, 8:37 a.m. UTC | #9

在 2024/3/7 17:26, Michal Hocko 写道:
>> The main reasons for adding static tracepoints are:
>> 1. To subdivide the time spent in the shrinker->count_objects() and
>> shrinker->scan_objects() functions within the do_shrink_slab function. Using
>> BPF kprobe, we can only track the time spent in the do_shrink_slab function.
>> 2. When tracing frequently called functions, static tracepoints (BPF
>> tp/tracepoint) have lower performance impact compared to dynamic tracepoints
>> (BPF kprobe).
> You can track the time process has been preempted by other means, no? We
> have context switching tracepoints in place. Have you considered that
> option?
Let me think about it...

Thanks
Bixuan Cui

[-next,v6,0/2] Make memory reclamation measurable

Message

Comments