diff mbox series

[1/4] mm: rearrange node_stat_item to put memcg stats at start

Message ID 20240423051826.791934-2-shakeel.butt@linux.dev (mailing list archive)
State New
Headers show
Series memcg: reduce memory consumption by memcg stats | expand

Commit Message

Shakeel Butt April 23, 2024, 5:18 a.m. UTC
At the moment the memcg stats are sized based on the size of enum
node_stat_item but not all fields in node_stat_item corresponds to memcg
stats. So, rearrage the contents of node_stat_item such that all the
memcg specific stats are at the top and then the later patches will make
sure that the memcg code will not waste space for non-memcg stats.

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 include/linux/mmzone.h | 25 +++++++++++++------------
 mm/vmstat.c            | 24 ++++++++++++------------
 2 files changed, 25 insertions(+), 24 deletions(-)

Comments

Johannes Weiner April 23, 2024, 1:58 p.m. UTC | #1
On Mon, Apr 22, 2024 at 10:18:23PM -0700, Shakeel Butt wrote:
> At the moment the memcg stats are sized based on the size of enum
> node_stat_item but not all fields in node_stat_item corresponds to memcg
> stats. So, rearrage the contents of node_stat_item such that all the
> memcg specific stats are at the top and then the later patches will make
> sure that the memcg code will not waste space for non-memcg stats.
> 
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>

This series is a great idea and the savings speak for themselves.

But rearranging and splitting vmstats along the memcg-nomemcg line
seems like an undue burden on the non-memcg codebase and interface.

- It messes with user-visible /proc/vmstat ordering, and sets things
  up to do so on an ongoing basis as stats are added to memcg.

- It also separates related stats (like the workingset ones) in
  /proc/vmstat when memcg only accounts a subset.

Would it make more sense to have a translation table inside memcg?
Like we have with memcg1_events.
Shakeel Butt April 23, 2024, 5:44 p.m. UTC | #2
On Tue, Apr 23, 2024 at 09:58:44AM -0400, Johannes Weiner wrote:
> On Mon, Apr 22, 2024 at 10:18:23PM -0700, Shakeel Butt wrote:
> > At the moment the memcg stats are sized based on the size of enum
> > node_stat_item but not all fields in node_stat_item corresponds to memcg
> > stats. So, rearrage the contents of node_stat_item such that all the
> > memcg specific stats are at the top and then the later patches will make
> > sure that the memcg code will not waste space for non-memcg stats.
> > 
> > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> 
> This series is a great idea and the savings speak for themselves.
> 
> But rearranging and splitting vmstats along the memcg-nomemcg line
> seems like an undue burden on the non-memcg codebase and interface.
> 
> - It messes with user-visible /proc/vmstat ordering, and sets things
>   up to do so on an ongoing basis as stats are added to memcg.
> 
> - It also separates related stats (like the workingset ones) in
>   /proc/vmstat when memcg only accounts a subset.
> 
> Would it make more sense to have a translation table inside memcg?
> Like we have with memcg1_events.

Thanks for taking a look. I will look into the translation table
approach. The reason I went with this approach was that I am in parallel
looking into rearranging fields of important MM structs and also enums
to improve cache locality. For example, the field NR_SWAPCACHE is always
accessed together with NR_FILE_PAGES, so it makes sense to have them on
same cacheline. So, is the rearrangement of vmstats a big NO or a little
bit here and there is fine unlike what I did with this patch?

thanks,
Shakeel
Johannes Weiner April 23, 2024, 6:30 p.m. UTC | #3
On Tue, Apr 23, 2024 at 10:44:07AM -0700, Shakeel Butt wrote:
> On Tue, Apr 23, 2024 at 09:58:44AM -0400, Johannes Weiner wrote:
> > On Mon, Apr 22, 2024 at 10:18:23PM -0700, Shakeel Butt wrote:
> > > At the moment the memcg stats are sized based on the size of enum
> > > node_stat_item but not all fields in node_stat_item corresponds to memcg
> > > stats. So, rearrage the contents of node_stat_item such that all the
> > > memcg specific stats are at the top and then the later patches will make
> > > sure that the memcg code will not waste space for non-memcg stats.
> > > 
> > > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> > 
> > This series is a great idea and the savings speak for themselves.
> > 
> > But rearranging and splitting vmstats along the memcg-nomemcg line
> > seems like an undue burden on the non-memcg codebase and interface.
> > 
> > - It messes with user-visible /proc/vmstat ordering, and sets things
> >   up to do so on an ongoing basis as stats are added to memcg.
> > 
> > - It also separates related stats (like the workingset ones) in
> >   /proc/vmstat when memcg only accounts a subset.
> > 
> > Would it make more sense to have a translation table inside memcg?
> > Like we have with memcg1_events.
> 
> Thanks for taking a look. I will look into the translation table
> approach. The reason I went with this approach was that I am in parallel
> looking into rearranging fields of important MM structs and also enums
> to improve cache locality. For example, the field NR_SWAPCACHE is always
> accessed together with NR_FILE_PAGES, so it makes sense to have them on
> same cacheline. So, is the rearrangement of vmstats a big NO or a little
> bit here and there is fine unlike what I did with this patch?

I'm curious what other folks think.

The cache optimization is a stronger argument, IMO, because it
directly benefits the users of /proc/vmstat. And it would be fairly
self contained inside the node_stat_item enum - "ordered for cache".

I was more hesitant about imposing a memcg requirement on the generic
vmstat ordering.
Chris Li April 25, 2024, 6:01 p.m. UTC | #4
On Mon, Apr 22, 2024 at 10:19 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> At the moment the memcg stats are sized based on the size of enum
> node_stat_item but not all fields in node_stat_item corresponds to memcg
> stats. So, rearrage the contents of node_stat_item such that all the
> memcg specific stats are at the top and then the later patches will make
> sure that the memcg code will not waste space for non-memcg stats.

Is this patch meant to have no functional change other than the output order?
It would be better to clarify it in the commit message if that is the case.

Chris

>
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> ---
>  include/linux/mmzone.h | 25 +++++++++++++------------
>  mm/vmstat.c            | 24 ++++++++++++------------
>  2 files changed, 25 insertions(+), 24 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 8f9c9590a42c..989ca97402c6 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -166,9 +166,6 @@ enum node_stat_item {
>         NR_UNEVICTABLE,         /*  "     "     "   "       "         */
>         NR_SLAB_RECLAIMABLE_B,
>         NR_SLAB_UNRECLAIMABLE_B,
> -       NR_ISOLATED_ANON,       /* Temporary isolated pages from anon lru */
> -       NR_ISOLATED_FILE,       /* Temporary isolated pages from file lru */
> -       WORKINGSET_NODES,
>         WORKINGSET_REFAULT_BASE,
>         WORKINGSET_REFAULT_ANON = WORKINGSET_REFAULT_BASE,
>         WORKINGSET_REFAULT_FILE,
> @@ -179,39 +176,43 @@ enum node_stat_item {
>         WORKINGSET_RESTORE_ANON = WORKINGSET_RESTORE_BASE,
>         WORKINGSET_RESTORE_FILE,
>         WORKINGSET_NODERECLAIM,
> +       NR_PAGETABLE,           /* used for pagetables */
> +       NR_SECONDARY_PAGETABLE, /* secondary pagetables, KVM & IOMMU */
> +       NR_KERNEL_STACK_KB,     /* measured in KiB */
>         NR_ANON_MAPPED, /* Mapped anonymous pages */
>         NR_FILE_MAPPED, /* pagecache pages mapped into pagetables.
>                            only modified from process context */
>         NR_FILE_PAGES,
> +#ifdef CONFIG_SWAP
> +       NR_SWAPCACHE,
> +#endif
>         NR_FILE_DIRTY,
>         NR_WRITEBACK,
> -       NR_WRITEBACK_TEMP,      /* Writeback using temporary buffers */
>         NR_SHMEM,               /* shmem pages (included tmpfs/GEM pages) */
>         NR_SHMEM_THPS,
> -       NR_SHMEM_PMDMAPPED,
>         NR_FILE_THPS,
> -       NR_FILE_PMDMAPPED,
>         NR_ANON_THPS,
> +       /* No memcg stats for the following fields. */
> +       NR_SHMEM_PMDMAPPED,
> +       NR_FILE_PMDMAPPED,
> +       NR_WRITEBACK_TEMP,      /* Writeback using temporary buffers */
>         NR_VMSCAN_WRITE,
>         NR_VMSCAN_IMMEDIATE,    /* Prioritise for reclaim when writeback ends */
> +       NR_ISOLATED_ANON,       /* Temporary isolated pages from anon lru */
> +       NR_ISOLATED_FILE,       /* Temporary isolated pages from file lru */
> +       WORKINGSET_NODES,
>         NR_DIRTIED,             /* page dirtyings since bootup */
>         NR_WRITTEN,             /* page writings since bootup */
>         NR_THROTTLED_WRITTEN,   /* NR_WRITTEN while reclaim throttled */
>         NR_KERNEL_MISC_RECLAIMABLE,     /* reclaimable non-slab kernel pages */
>         NR_FOLL_PIN_ACQUIRED,   /* via: pin_user_page(), gup flag: FOLL_PIN */
>         NR_FOLL_PIN_RELEASED,   /* pages returned via unpin_user_page() */
> -       NR_KERNEL_STACK_KB,     /* measured in KiB */
>  #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
>         NR_KERNEL_SCS_KB,       /* measured in KiB */
>  #endif
> -       NR_PAGETABLE,           /* used for pagetables */
> -       NR_SECONDARY_PAGETABLE, /* secondary pagetables, KVM & IOMMU */
>  #ifdef CONFIG_IOMMU_SUPPORT
>         NR_IOMMU_PAGES,         /* # of pages allocated by IOMMU */
>  #endif
> -#ifdef CONFIG_SWAP
> -       NR_SWAPCACHE,
> -#endif
>  #ifdef CONFIG_NUMA_BALANCING
>         PGPROMOTE_SUCCESS,      /* promote successfully */
>         PGPROMOTE_CANDIDATE,    /* candidate pages to promote */
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 8507c497218b..4eac2f6322a3 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1206,9 +1206,6 @@ const char * const vmstat_text[] = {
>         "nr_unevictable",
>         "nr_slab_reclaimable",
>         "nr_slab_unreclaimable",
> -       "nr_isolated_anon",
> -       "nr_isolated_file",
> -       "workingset_nodes",
>         "workingset_refault_anon",
>         "workingset_refault_file",
>         "workingset_activate_anon",
> @@ -1216,38 +1213,41 @@ const char * const vmstat_text[] = {
>         "workingset_restore_anon",
>         "workingset_restore_file",
>         "workingset_nodereclaim",
> +       "nr_page_table_pages",
> +       "nr_sec_page_table_pages",
> +       "nr_kernel_stack",
>         "nr_anon_pages",
>         "nr_mapped",
>         "nr_file_pages",
> +#ifdef CONFIG_SWAP
> +       "nr_swapcached",
> +#endif
>         "nr_dirty",
>         "nr_writeback",
> -       "nr_writeback_temp",
>         "nr_shmem",
>         "nr_shmem_hugepages",
> -       "nr_shmem_pmdmapped",
>         "nr_file_hugepages",
> -       "nr_file_pmdmapped",
>         "nr_anon_transparent_hugepages",
> +       "nr_shmem_pmdmapped",
> +       "nr_file_pmdmapped",
> +       "nr_writeback_temp",
>         "nr_vmscan_write",
>         "nr_vmscan_immediate_reclaim",
> +       "nr_isolated_anon",
> +       "nr_isolated_file",
> +       "workingset_nodes",
>         "nr_dirtied",
>         "nr_written",
>         "nr_throttled_written",
>         "nr_kernel_misc_reclaimable",
>         "nr_foll_pin_acquired",
>         "nr_foll_pin_released",
> -       "nr_kernel_stack",
>  #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
>         "nr_shadow_call_stack",
>  #endif
> -       "nr_page_table_pages",
> -       "nr_sec_page_table_pages",
>  #ifdef CONFIG_IOMMU_SUPPORT
>         "nr_iommu_pages",
>  #endif
> -#ifdef CONFIG_SWAP
> -       "nr_swapcached",
> -#endif
>  #ifdef CONFIG_NUMA_BALANCING
>         "pgpromote_success",
>         "pgpromote_candidate",
> --
> 2.43.0
>
>
Chris Li April 25, 2024, 10:54 p.m. UTC | #5
On Tue, Apr 23, 2024 at 11:30 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Tue, Apr 23, 2024 at 10:44:07AM -0700, Shakeel Butt wrote:
> > On Tue, Apr 23, 2024 at 09:58:44AM -0400, Johannes Weiner wrote:
> > > On Mon, Apr 22, 2024 at 10:18:23PM -0700, Shakeel Butt wrote:
> > > > At the moment the memcg stats are sized based on the size of enum
> > > > node_stat_item but not all fields in node_stat_item corresponds to memcg
> > > > stats. So, rearrage the contents of node_stat_item such that all the
> > > > memcg specific stats are at the top and then the later patches will make
> > > > sure that the memcg code will not waste space for non-memcg stats.
> > > >
> > > > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> > >
> > > This series is a great idea and the savings speak for themselves.
> > >
> > > But rearranging and splitting vmstats along the memcg-nomemcg line
> > > seems like an undue burden on the non-memcg codebase and interface.
> > >
> > > - It messes with user-visible /proc/vmstat ordering, and sets things
> > >   up to do so on an ongoing basis as stats are added to memcg.
> > >
> > > - It also separates related stats (like the workingset ones) in
> > >   /proc/vmstat when memcg only accounts a subset.
> > >
> > > Would it make more sense to have a translation table inside memcg?
> > > Like we have with memcg1_events.
> >
> > Thanks for taking a look. I will look into the translation table
> > approach. The reason I went with this approach was that I am in parallel
> > looking into rearranging fields of important MM structs and also enums
> > to improve cache locality. For example, the field NR_SWAPCACHE is always
> > accessed together with NR_FILE_PAGES, so it makes sense to have them on
> > same cacheline. So, is the rearrangement of vmstats a big NO or a little
> > bit here and there is fine unlike what I did with this patch?
>
> I'm curious what other folks think.
>
I slightly prefer not to change user visible ordering for no good reason.
It is not said the order is carved to stone. It depends on the ROI.

> The cache optimization is a stronger argument, IMO, because it
> directly benefits the users of /proc/vmstat. And it would be fairly
> self contained inside the node_stat_item enum - "ordered for cache".

Not sure how much of the cache optimization is measurable here.
I suspect it is going to be hard to measure a meaningful difference
just from the cache line order alone.

> I was more hesitant about imposing a memcg requirement on the generic
> vmstat ordering.

That is a valid reason.

Chris
diff mbox series

Patch

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8f9c9590a42c..989ca97402c6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -166,9 +166,6 @@  enum node_stat_item {
 	NR_UNEVICTABLE,		/*  "     "     "   "       "         */
 	NR_SLAB_RECLAIMABLE_B,
 	NR_SLAB_UNRECLAIMABLE_B,
-	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
-	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
-	WORKINGSET_NODES,
 	WORKINGSET_REFAULT_BASE,
 	WORKINGSET_REFAULT_ANON = WORKINGSET_REFAULT_BASE,
 	WORKINGSET_REFAULT_FILE,
@@ -179,39 +176,43 @@  enum node_stat_item {
 	WORKINGSET_RESTORE_ANON = WORKINGSET_RESTORE_BASE,
 	WORKINGSET_RESTORE_FILE,
 	WORKINGSET_NODERECLAIM,
+	NR_PAGETABLE,		/* used for pagetables */
+	NR_SECONDARY_PAGETABLE, /* secondary pagetables, KVM & IOMMU */
+	NR_KERNEL_STACK_KB,	/* measured in KiB */
 	NR_ANON_MAPPED,	/* Mapped anonymous pages */
 	NR_FILE_MAPPED,	/* pagecache pages mapped into pagetables.
 			   only modified from process context */
 	NR_FILE_PAGES,
+#ifdef CONFIG_SWAP
+	NR_SWAPCACHE,
+#endif
 	NR_FILE_DIRTY,
 	NR_WRITEBACK,
-	NR_WRITEBACK_TEMP,	/* Writeback using temporary buffers */
 	NR_SHMEM,		/* shmem pages (included tmpfs/GEM pages) */
 	NR_SHMEM_THPS,
-	NR_SHMEM_PMDMAPPED,
 	NR_FILE_THPS,
-	NR_FILE_PMDMAPPED,
 	NR_ANON_THPS,
+	/* No memcg stats for the following fields. */
+	NR_SHMEM_PMDMAPPED,
+	NR_FILE_PMDMAPPED,
+	NR_WRITEBACK_TEMP,	/* Writeback using temporary buffers */
 	NR_VMSCAN_WRITE,
 	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
+	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
+	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
+	WORKINGSET_NODES,
 	NR_DIRTIED,		/* page dirtyings since bootup */
 	NR_WRITTEN,		/* page writings since bootup */
 	NR_THROTTLED_WRITTEN,	/* NR_WRITTEN while reclaim throttled */
 	NR_KERNEL_MISC_RECLAIMABLE,	/* reclaimable non-slab kernel pages */
 	NR_FOLL_PIN_ACQUIRED,	/* via: pin_user_page(), gup flag: FOLL_PIN */
 	NR_FOLL_PIN_RELEASED,	/* pages returned via unpin_user_page() */
-	NR_KERNEL_STACK_KB,	/* measured in KiB */
 #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
 	NR_KERNEL_SCS_KB,	/* measured in KiB */
 #endif
-	NR_PAGETABLE,		/* used for pagetables */
-	NR_SECONDARY_PAGETABLE, /* secondary pagetables, KVM & IOMMU */
 #ifdef CONFIG_IOMMU_SUPPORT
 	NR_IOMMU_PAGES,		/* # of pages allocated by IOMMU */
 #endif
-#ifdef CONFIG_SWAP
-	NR_SWAPCACHE,
-#endif
 #ifdef CONFIG_NUMA_BALANCING
 	PGPROMOTE_SUCCESS,	/* promote successfully */
 	PGPROMOTE_CANDIDATE,	/* candidate pages to promote */
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 8507c497218b..4eac2f6322a3 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1206,9 +1206,6 @@  const char * const vmstat_text[] = {
 	"nr_unevictable",
 	"nr_slab_reclaimable",
 	"nr_slab_unreclaimable",
-	"nr_isolated_anon",
-	"nr_isolated_file",
-	"workingset_nodes",
 	"workingset_refault_anon",
 	"workingset_refault_file",
 	"workingset_activate_anon",
@@ -1216,38 +1213,41 @@  const char * const vmstat_text[] = {
 	"workingset_restore_anon",
 	"workingset_restore_file",
 	"workingset_nodereclaim",
+	"nr_page_table_pages",
+	"nr_sec_page_table_pages",
+	"nr_kernel_stack",
 	"nr_anon_pages",
 	"nr_mapped",
 	"nr_file_pages",
+#ifdef CONFIG_SWAP
+	"nr_swapcached",
+#endif
 	"nr_dirty",
 	"nr_writeback",
-	"nr_writeback_temp",
 	"nr_shmem",
 	"nr_shmem_hugepages",
-	"nr_shmem_pmdmapped",
 	"nr_file_hugepages",
-	"nr_file_pmdmapped",
 	"nr_anon_transparent_hugepages",
+	"nr_shmem_pmdmapped",
+	"nr_file_pmdmapped",
+	"nr_writeback_temp",
 	"nr_vmscan_write",
 	"nr_vmscan_immediate_reclaim",
+	"nr_isolated_anon",
+	"nr_isolated_file",
+	"workingset_nodes",
 	"nr_dirtied",
 	"nr_written",
 	"nr_throttled_written",
 	"nr_kernel_misc_reclaimable",
 	"nr_foll_pin_acquired",
 	"nr_foll_pin_released",
-	"nr_kernel_stack",
 #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
 	"nr_shadow_call_stack",
 #endif
-	"nr_page_table_pages",
-	"nr_sec_page_table_pages",
 #ifdef CONFIG_IOMMU_SUPPORT
 	"nr_iommu_pages",
 #endif
-#ifdef CONFIG_SWAP
-	"nr_swapcached",
-#endif
 #ifdef CONFIG_NUMA_BALANCING
 	"pgpromote_success",
 	"pgpromote_candidate",