Message ID | 20240423051826.791934-2-shakeel.butt@linux.dev (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | memcg: reduce memory consumption by memcg stats | expand |
On Mon, Apr 22, 2024 at 10:18:23PM -0700, Shakeel Butt wrote: > At the moment the memcg stats are sized based on the size of enum > node_stat_item but not all fields in node_stat_item corresponds to memcg > stats. So, rearrage the contents of node_stat_item such that all the > memcg specific stats are at the top and then the later patches will make > sure that the memcg code will not waste space for non-memcg stats. > > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> This series is a great idea and the savings speak for themselves. But rearranging and splitting vmstats along the memcg-nomemcg line seems like an undue burden on the non-memcg codebase and interface. - It messes with user-visible /proc/vmstat ordering, and sets things up to do so on an ongoing basis as stats are added to memcg. - It also separates related stats (like the workingset ones) in /proc/vmstat when memcg only accounts a subset. Would it make more sense to have a translation table inside memcg? Like we have with memcg1_events.
On Tue, Apr 23, 2024 at 09:58:44AM -0400, Johannes Weiner wrote: > On Mon, Apr 22, 2024 at 10:18:23PM -0700, Shakeel Butt wrote: > > At the moment the memcg stats are sized based on the size of enum > > node_stat_item but not all fields in node_stat_item corresponds to memcg > > stats. So, rearrage the contents of node_stat_item such that all the > > memcg specific stats are at the top and then the later patches will make > > sure that the memcg code will not waste space for non-memcg stats. > > > > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> > > This series is a great idea and the savings speak for themselves. > > But rearranging and splitting vmstats along the memcg-nomemcg line > seems like an undue burden on the non-memcg codebase and interface. > > - It messes with user-visible /proc/vmstat ordering, and sets things > up to do so on an ongoing basis as stats are added to memcg. > > - It also separates related stats (like the workingset ones) in > /proc/vmstat when memcg only accounts a subset. > > Would it make more sense to have a translation table inside memcg? > Like we have with memcg1_events. Thanks for taking a look. I will look into the translation table approach. The reason I went with this approach was that I am in parallel looking into rearranging fields of important MM structs and also enums to improve cache locality. For example, the field NR_SWAPCACHE is always accessed together with NR_FILE_PAGES, so it makes sense to have them on same cacheline. So, is the rearrangement of vmstats a big NO or a little bit here and there is fine unlike what I did with this patch? thanks, Shakeel
On Tue, Apr 23, 2024 at 10:44:07AM -0700, Shakeel Butt wrote: > On Tue, Apr 23, 2024 at 09:58:44AM -0400, Johannes Weiner wrote: > > On Mon, Apr 22, 2024 at 10:18:23PM -0700, Shakeel Butt wrote: > > > At the moment the memcg stats are sized based on the size of enum > > > node_stat_item but not all fields in node_stat_item corresponds to memcg > > > stats. So, rearrage the contents of node_stat_item such that all the > > > memcg specific stats are at the top and then the later patches will make > > > sure that the memcg code will not waste space for non-memcg stats. > > > > > > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> > > > > This series is a great idea and the savings speak for themselves. > > > > But rearranging and splitting vmstats along the memcg-nomemcg line > > seems like an undue burden on the non-memcg codebase and interface. > > > > - It messes with user-visible /proc/vmstat ordering, and sets things > > up to do so on an ongoing basis as stats are added to memcg. > > > > - It also separates related stats (like the workingset ones) in > > /proc/vmstat when memcg only accounts a subset. > > > > Would it make more sense to have a translation table inside memcg? > > Like we have with memcg1_events. > > Thanks for taking a look. I will look into the translation table > approach. The reason I went with this approach was that I am in parallel > looking into rearranging fields of important MM structs and also enums > to improve cache locality. For example, the field NR_SWAPCACHE is always > accessed together with NR_FILE_PAGES, so it makes sense to have them on > same cacheline. So, is the rearrangement of vmstats a big NO or a little > bit here and there is fine unlike what I did with this patch? I'm curious what other folks think. The cache optimization is a stronger argument, IMO, because it directly benefits the users of /proc/vmstat. And it would be fairly self contained inside the node_stat_item enum - "ordered for cache". I was more hesitant about imposing a memcg requirement on the generic vmstat ordering.
On Mon, Apr 22, 2024 at 10:19 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > At the moment the memcg stats are sized based on the size of enum > node_stat_item but not all fields in node_stat_item corresponds to memcg > stats. So, rearrage the contents of node_stat_item such that all the > memcg specific stats are at the top and then the later patches will make > sure that the memcg code will not waste space for non-memcg stats. Is this patch meant to have no functional change other than the output order? It would be better to clarify it in the commit message if that is the case. Chris > > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> > --- > include/linux/mmzone.h | 25 +++++++++++++------------ > mm/vmstat.c | 24 ++++++++++++------------ > 2 files changed, 25 insertions(+), 24 deletions(-) > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 8f9c9590a42c..989ca97402c6 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -166,9 +166,6 @@ enum node_stat_item { > NR_UNEVICTABLE, /* " " " " " */ > NR_SLAB_RECLAIMABLE_B, > NR_SLAB_UNRECLAIMABLE_B, > - NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */ > - NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */ > - WORKINGSET_NODES, > WORKINGSET_REFAULT_BASE, > WORKINGSET_REFAULT_ANON = WORKINGSET_REFAULT_BASE, > WORKINGSET_REFAULT_FILE, > @@ -179,39 +176,43 @@ enum node_stat_item { > WORKINGSET_RESTORE_ANON = WORKINGSET_RESTORE_BASE, > WORKINGSET_RESTORE_FILE, > WORKINGSET_NODERECLAIM, > + NR_PAGETABLE, /* used for pagetables */ > + NR_SECONDARY_PAGETABLE, /* secondary pagetables, KVM & IOMMU */ > + NR_KERNEL_STACK_KB, /* measured in KiB */ > NR_ANON_MAPPED, /* Mapped anonymous pages */ > NR_FILE_MAPPED, /* pagecache pages mapped into pagetables. > only modified from process context */ > NR_FILE_PAGES, > +#ifdef CONFIG_SWAP > + NR_SWAPCACHE, > +#endif > NR_FILE_DIRTY, > NR_WRITEBACK, > - NR_WRITEBACK_TEMP, /* Writeback using temporary buffers */ > NR_SHMEM, /* shmem pages (included tmpfs/GEM pages) */ > NR_SHMEM_THPS, > - NR_SHMEM_PMDMAPPED, > NR_FILE_THPS, > - NR_FILE_PMDMAPPED, > NR_ANON_THPS, > + /* No memcg stats for the following fields. */ > + NR_SHMEM_PMDMAPPED, > + NR_FILE_PMDMAPPED, > + NR_WRITEBACK_TEMP, /* Writeback using temporary buffers */ > NR_VMSCAN_WRITE, > NR_VMSCAN_IMMEDIATE, /* Prioritise for reclaim when writeback ends */ > + NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */ > + NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */ > + WORKINGSET_NODES, > NR_DIRTIED, /* page dirtyings since bootup */ > NR_WRITTEN, /* page writings since bootup */ > NR_THROTTLED_WRITTEN, /* NR_WRITTEN while reclaim throttled */ > NR_KERNEL_MISC_RECLAIMABLE, /* reclaimable non-slab kernel pages */ > NR_FOLL_PIN_ACQUIRED, /* via: pin_user_page(), gup flag: FOLL_PIN */ > NR_FOLL_PIN_RELEASED, /* pages returned via unpin_user_page() */ > - NR_KERNEL_STACK_KB, /* measured in KiB */ > #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK) > NR_KERNEL_SCS_KB, /* measured in KiB */ > #endif > - NR_PAGETABLE, /* used for pagetables */ > - NR_SECONDARY_PAGETABLE, /* secondary pagetables, KVM & IOMMU */ > #ifdef CONFIG_IOMMU_SUPPORT > NR_IOMMU_PAGES, /* # of pages allocated by IOMMU */ > #endif > -#ifdef CONFIG_SWAP > - NR_SWAPCACHE, > -#endif > #ifdef CONFIG_NUMA_BALANCING > PGPROMOTE_SUCCESS, /* promote successfully */ > PGPROMOTE_CANDIDATE, /* candidate pages to promote */ > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 8507c497218b..4eac2f6322a3 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -1206,9 +1206,6 @@ const char * const vmstat_text[] = { > "nr_unevictable", > "nr_slab_reclaimable", > "nr_slab_unreclaimable", > - "nr_isolated_anon", > - "nr_isolated_file", > - "workingset_nodes", > "workingset_refault_anon", > "workingset_refault_file", > "workingset_activate_anon", > @@ -1216,38 +1213,41 @@ const char * const vmstat_text[] = { > "workingset_restore_anon", > "workingset_restore_file", > "workingset_nodereclaim", > + "nr_page_table_pages", > + "nr_sec_page_table_pages", > + "nr_kernel_stack", > "nr_anon_pages", > "nr_mapped", > "nr_file_pages", > +#ifdef CONFIG_SWAP > + "nr_swapcached", > +#endif > "nr_dirty", > "nr_writeback", > - "nr_writeback_temp", > "nr_shmem", > "nr_shmem_hugepages", > - "nr_shmem_pmdmapped", > "nr_file_hugepages", > - "nr_file_pmdmapped", > "nr_anon_transparent_hugepages", > + "nr_shmem_pmdmapped", > + "nr_file_pmdmapped", > + "nr_writeback_temp", > "nr_vmscan_write", > "nr_vmscan_immediate_reclaim", > + "nr_isolated_anon", > + "nr_isolated_file", > + "workingset_nodes", > "nr_dirtied", > "nr_written", > "nr_throttled_written", > "nr_kernel_misc_reclaimable", > "nr_foll_pin_acquired", > "nr_foll_pin_released", > - "nr_kernel_stack", > #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK) > "nr_shadow_call_stack", > #endif > - "nr_page_table_pages", > - "nr_sec_page_table_pages", > #ifdef CONFIG_IOMMU_SUPPORT > "nr_iommu_pages", > #endif > -#ifdef CONFIG_SWAP > - "nr_swapcached", > -#endif > #ifdef CONFIG_NUMA_BALANCING > "pgpromote_success", > "pgpromote_candidate", > -- > 2.43.0 > >
On Tue, Apr 23, 2024 at 11:30 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Tue, Apr 23, 2024 at 10:44:07AM -0700, Shakeel Butt wrote: > > On Tue, Apr 23, 2024 at 09:58:44AM -0400, Johannes Weiner wrote: > > > On Mon, Apr 22, 2024 at 10:18:23PM -0700, Shakeel Butt wrote: > > > > At the moment the memcg stats are sized based on the size of enum > > > > node_stat_item but not all fields in node_stat_item corresponds to memcg > > > > stats. So, rearrage the contents of node_stat_item such that all the > > > > memcg specific stats are at the top and then the later patches will make > > > > sure that the memcg code will not waste space for non-memcg stats. > > > > > > > > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> > > > > > > This series is a great idea and the savings speak for themselves. > > > > > > But rearranging and splitting vmstats along the memcg-nomemcg line > > > seems like an undue burden on the non-memcg codebase and interface. > > > > > > - It messes with user-visible /proc/vmstat ordering, and sets things > > > up to do so on an ongoing basis as stats are added to memcg. > > > > > > - It also separates related stats (like the workingset ones) in > > > /proc/vmstat when memcg only accounts a subset. > > > > > > Would it make more sense to have a translation table inside memcg? > > > Like we have with memcg1_events. > > > > Thanks for taking a look. I will look into the translation table > > approach. The reason I went with this approach was that I am in parallel > > looking into rearranging fields of important MM structs and also enums > > to improve cache locality. For example, the field NR_SWAPCACHE is always > > accessed together with NR_FILE_PAGES, so it makes sense to have them on > > same cacheline. So, is the rearrangement of vmstats a big NO or a little > > bit here and there is fine unlike what I did with this patch? > > I'm curious what other folks think. > I slightly prefer not to change user visible ordering for no good reason. It is not said the order is carved to stone. It depends on the ROI. > The cache optimization is a stronger argument, IMO, because it > directly benefits the users of /proc/vmstat. And it would be fairly > self contained inside the node_stat_item enum - "ordered for cache". Not sure how much of the cache optimization is measurable here. I suspect it is going to be hard to measure a meaningful difference just from the cache line order alone. > I was more hesitant about imposing a memcg requirement on the generic > vmstat ordering. That is a valid reason. Chris
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 8f9c9590a42c..989ca97402c6 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -166,9 +166,6 @@ enum node_stat_item { NR_UNEVICTABLE, /* " " " " " */ NR_SLAB_RECLAIMABLE_B, NR_SLAB_UNRECLAIMABLE_B, - NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */ - NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */ - WORKINGSET_NODES, WORKINGSET_REFAULT_BASE, WORKINGSET_REFAULT_ANON = WORKINGSET_REFAULT_BASE, WORKINGSET_REFAULT_FILE, @@ -179,39 +176,43 @@ enum node_stat_item { WORKINGSET_RESTORE_ANON = WORKINGSET_RESTORE_BASE, WORKINGSET_RESTORE_FILE, WORKINGSET_NODERECLAIM, + NR_PAGETABLE, /* used for pagetables */ + NR_SECONDARY_PAGETABLE, /* secondary pagetables, KVM & IOMMU */ + NR_KERNEL_STACK_KB, /* measured in KiB */ NR_ANON_MAPPED, /* Mapped anonymous pages */ NR_FILE_MAPPED, /* pagecache pages mapped into pagetables. only modified from process context */ NR_FILE_PAGES, +#ifdef CONFIG_SWAP + NR_SWAPCACHE, +#endif NR_FILE_DIRTY, NR_WRITEBACK, - NR_WRITEBACK_TEMP, /* Writeback using temporary buffers */ NR_SHMEM, /* shmem pages (included tmpfs/GEM pages) */ NR_SHMEM_THPS, - NR_SHMEM_PMDMAPPED, NR_FILE_THPS, - NR_FILE_PMDMAPPED, NR_ANON_THPS, + /* No memcg stats for the following fields. */ + NR_SHMEM_PMDMAPPED, + NR_FILE_PMDMAPPED, + NR_WRITEBACK_TEMP, /* Writeback using temporary buffers */ NR_VMSCAN_WRITE, NR_VMSCAN_IMMEDIATE, /* Prioritise for reclaim when writeback ends */ + NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */ + NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */ + WORKINGSET_NODES, NR_DIRTIED, /* page dirtyings since bootup */ NR_WRITTEN, /* page writings since bootup */ NR_THROTTLED_WRITTEN, /* NR_WRITTEN while reclaim throttled */ NR_KERNEL_MISC_RECLAIMABLE, /* reclaimable non-slab kernel pages */ NR_FOLL_PIN_ACQUIRED, /* via: pin_user_page(), gup flag: FOLL_PIN */ NR_FOLL_PIN_RELEASED, /* pages returned via unpin_user_page() */ - NR_KERNEL_STACK_KB, /* measured in KiB */ #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK) NR_KERNEL_SCS_KB, /* measured in KiB */ #endif - NR_PAGETABLE, /* used for pagetables */ - NR_SECONDARY_PAGETABLE, /* secondary pagetables, KVM & IOMMU */ #ifdef CONFIG_IOMMU_SUPPORT NR_IOMMU_PAGES, /* # of pages allocated by IOMMU */ #endif -#ifdef CONFIG_SWAP - NR_SWAPCACHE, -#endif #ifdef CONFIG_NUMA_BALANCING PGPROMOTE_SUCCESS, /* promote successfully */ PGPROMOTE_CANDIDATE, /* candidate pages to promote */ diff --git a/mm/vmstat.c b/mm/vmstat.c index 8507c497218b..4eac2f6322a3 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1206,9 +1206,6 @@ const char * const vmstat_text[] = { "nr_unevictable", "nr_slab_reclaimable", "nr_slab_unreclaimable", - "nr_isolated_anon", - "nr_isolated_file", - "workingset_nodes", "workingset_refault_anon", "workingset_refault_file", "workingset_activate_anon", @@ -1216,38 +1213,41 @@ const char * const vmstat_text[] = { "workingset_restore_anon", "workingset_restore_file", "workingset_nodereclaim", + "nr_page_table_pages", + "nr_sec_page_table_pages", + "nr_kernel_stack", "nr_anon_pages", "nr_mapped", "nr_file_pages", +#ifdef CONFIG_SWAP + "nr_swapcached", +#endif "nr_dirty", "nr_writeback", - "nr_writeback_temp", "nr_shmem", "nr_shmem_hugepages", - "nr_shmem_pmdmapped", "nr_file_hugepages", - "nr_file_pmdmapped", "nr_anon_transparent_hugepages", + "nr_shmem_pmdmapped", + "nr_file_pmdmapped", + "nr_writeback_temp", "nr_vmscan_write", "nr_vmscan_immediate_reclaim", + "nr_isolated_anon", + "nr_isolated_file", + "workingset_nodes", "nr_dirtied", "nr_written", "nr_throttled_written", "nr_kernel_misc_reclaimable", "nr_foll_pin_acquired", "nr_foll_pin_released", - "nr_kernel_stack", #if IS_ENABLED(CONFIG_SHADOW_CALL_STACK) "nr_shadow_call_stack", #endif - "nr_page_table_pages", - "nr_sec_page_table_pages", #ifdef CONFIG_IOMMU_SUPPORT "nr_iommu_pages", #endif -#ifdef CONFIG_SWAP - "nr_swapcached", -#endif #ifdef CONFIG_NUMA_BALANCING "pgpromote_success", "pgpromote_candidate",
At the moment the memcg stats are sized based on the size of enum node_stat_item but not all fields in node_stat_item corresponds to memcg stats. So, rearrage the contents of node_stat_item such that all the memcg specific stats are at the top and then the later patches will make sure that the memcg code will not waste space for non-memcg stats. Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> --- include/linux/mmzone.h | 25 +++++++++++++------------ mm/vmstat.c | 24 ++++++++++++------------ 2 files changed, 25 insertions(+), 24 deletions(-)