Message ID | 20231101230816.1459373-2-souravpanda@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: report per-page metadata information | expand |
On Wed, Nov 1, 2023 at 4:08 PM Sourav Panda <souravpanda@google.com> wrote: > > Adds a new per-node PageMetadata field to > /sys/devices/system/node/nodeN/meminfo > and a global PageMetadata field to /proc/meminfo. This information can > be used by users to see how much memory is being used by per-page > metadata, which can vary depending on build configuration, machine > architecture, and system use. > > Per-page metadata is the amount of memory that Linux needs in order to > manage memory at the page granularity. The majority of such memory is > used by "struct page" and "page_ext" data structures. In contrast to > most other memory consumption statistics, per-page metadata might not > be included in MemTotal. For example, MemTotal does not include memblock > allocations but includes buddy allocations. While on the other hand, > per-page metadata would include both memblock and buddy allocations. I expect that the new PageMetadata field in meminfo should help break down the memory usage of a system (MemUsed, or MemTotal - MemFree), similar to the other fields in meminfo. However, given that PageMetadata includes per-page metadata allocated from not only the buddy allocator, but also the memblock allocations, and MemTotal doesn't include memory reserved by memblock allocations, I wonder how a user can actually use this new PageMetadata to break down the system memory usage. BTW, it is not robust to assume that all memblock allocations are for per-page metadata. Here are some ideas to address this problem: - Only report the buddy allocations for per-page medata in PageMetadata, or - Report per-page metadata in two separate fields in meminfo, one for buddy allocations and another for memblock allocations, or - Change MemTotal/MemUsed to include the memblock reserved memory as well. Wei Xu > This memory depends on build configurations, machine architectures, and > the way system is used: > > Build configuration may include extra fields into "struct page", > and enable / disable "page_ext" > Machine architecture defines base page sizes. For example 4K x86, > 8K SPARC, 64K ARM64 (optionally), etc. The per-page metadata > overhead is smaller on machines with larger page sizes. > System use can change per-page overhead by using vmemmap > optimizations with hugetlb pages, and emulated pmem devdax pages. > Also, boot parameters can determine whether page_ext is needed > to be allocated. This memory can be part of MemTotal or be outside > MemTotal depending on whether the memory was hot-plugged, booted with, > or hugetlb memory was returned back to the system. > > Suggested-by: Pasha Tatashin <pasha.tatashin@soleen.com> > Signed-off-by: Sourav Panda <souravpanda@google.com> > --- > Documentation/filesystems/proc.rst | 3 +++ > drivers/base/node.c | 2 ++ > fs/proc/meminfo.c | 7 +++++++ > include/linux/mmzone.h | 3 +++ > include/linux/vmstat.h | 4 ++++ > mm/hugetlb.c | 11 ++++++++-- > mm/hugetlb_vmemmap.c | 12 +++++++++-- > mm/mm_init.c | 3 +++ > mm/page_alloc.c | 1 + > mm/page_ext.c | 32 +++++++++++++++++++++--------- > mm/sparse-vmemmap.c | 3 +++ > mm/sparse.c | 7 ++++++- > mm/vmstat.c | 24 ++++++++++++++++++++++ > 13 files changed, 98 insertions(+), 14 deletions(-) > > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst > index 2b59cff8be17..c121f2ef9432 100644 > --- a/Documentation/filesystems/proc.rst > +++ b/Documentation/filesystems/proc.rst > @@ -987,6 +987,7 @@ Example output. You may not have all of these fields. > AnonPages: 4654780 kB > Mapped: 266244 kB > Shmem: 9976 kB > + PageMetadata: 513419 kB > KReclaimable: 517708 kB > Slab: 660044 kB > SReclaimable: 517708 kB > @@ -1089,6 +1090,8 @@ Mapped > files which have been mmapped, such as libraries > Shmem > Total memory used by shared memory (shmem) and tmpfs > +PageMetadata > + Memory used for per-page metadata > KReclaimable > Kernel allocations that the kernel will attempt to reclaim > under memory pressure. Includes SReclaimable (below), and other > diff --git a/drivers/base/node.c b/drivers/base/node.c > index 493d533f8375..da728542265f 100644 > --- a/drivers/base/node.c > +++ b/drivers/base/node.c > @@ -428,6 +428,7 @@ static ssize_t node_read_meminfo(struct device *dev, > "Node %d Mapped: %8lu kB\n" > "Node %d AnonPages: %8lu kB\n" > "Node %d Shmem: %8lu kB\n" > + "Node %d PageMetadata: %8lu kB\n" > "Node %d KernelStack: %8lu kB\n" > #ifdef CONFIG_SHADOW_CALL_STACK > "Node %d ShadowCallStack:%8lu kB\n" > @@ -458,6 +459,7 @@ static ssize_t node_read_meminfo(struct device *dev, > nid, K(node_page_state(pgdat, NR_FILE_MAPPED)), > nid, K(node_page_state(pgdat, NR_ANON_MAPPED)), > nid, K(i.sharedram), > + nid, K(node_page_state(pgdat, NR_PAGE_METADATA)), > nid, node_page_state(pgdat, NR_KERNEL_STACK_KB), > #ifdef CONFIG_SHADOW_CALL_STACK > nid, node_page_state(pgdat, NR_KERNEL_SCS_KB), > diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c > index 45af9a989d40..f141bb2a550d 100644 > --- a/fs/proc/meminfo.c > +++ b/fs/proc/meminfo.c > @@ -39,7 +39,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v) > long available; > unsigned long pages[NR_LRU_LISTS]; > unsigned long sreclaimable, sunreclaim; > + unsigned long nr_page_metadata; > int lru; > + int nid; > > si_meminfo(&i); > si_swapinfo(&i); > @@ -57,6 +59,10 @@ static int meminfo_proc_show(struct seq_file *m, void *v) > sreclaimable = global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B); > sunreclaim = global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B); > > + nr_page_metadata = 0; > + for_each_online_node(nid) > + nr_page_metadata += node_page_state(NODE_DATA(nid), NR_PAGE_METADATA); > + > show_val_kb(m, "MemTotal: ", i.totalram); > show_val_kb(m, "MemFree: ", i.freeram); > show_val_kb(m, "MemAvailable: ", available); > @@ -104,6 +110,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v) > show_val_kb(m, "Mapped: ", > global_node_page_state(NR_FILE_MAPPED)); > show_val_kb(m, "Shmem: ", i.sharedram); > + show_val_kb(m, "PageMetadata: ", nr_page_metadata); > show_val_kb(m, "KReclaimable: ", sreclaimable + > global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE)); > show_val_kb(m, "Slab: ", sreclaimable + sunreclaim); > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 4106fbc5b4b3..dda1ad522324 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -207,6 +207,9 @@ enum node_stat_item { > PGPROMOTE_SUCCESS, /* promote successfully */ > PGPROMOTE_CANDIDATE, /* candidate pages to promote */ > #endif > + NR_PAGE_METADATA, /* Page metadata size (struct page and page_ext) > + * in pages > + */ > NR_VM_NODE_STAT_ITEMS > }; > > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h > index fed855bae6d8..af096a881f03 100644 > --- a/include/linux/vmstat.h > +++ b/include/linux/vmstat.h > @@ -656,4 +656,8 @@ static inline void lruvec_stat_sub_folio(struct folio *folio, > { > lruvec_stat_mod_folio(folio, idx, -folio_nr_pages(folio)); > } > + > +void __init mod_node_early_perpage_metadata(int nid, long delta); > +void __init store_early_perpage_metadata(void); > + > #endif /* _LINUX_VMSTAT_H */ > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 1301ba7b2c9a..1778e02ed583 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -1790,6 +1790,9 @@ static void __update_and_free_hugetlb_folio(struct hstate *h, > destroy_compound_gigantic_folio(folio, huge_page_order(h)); > free_gigantic_folio(folio, huge_page_order(h)); > } else { > +#ifndef CONFIG_SPARSEMEM_VMEMMAP > + __node_stat_sub_folio(folio, NR_PAGE_METADATA); > +#endif > __free_pages(&folio->page, huge_page_order(h)); > } > } > @@ -2125,6 +2128,7 @@ static struct folio *alloc_buddy_hugetlb_folio(struct hstate *h, > struct page *page; > bool alloc_try_hard = true; > bool retry = true; > + struct folio *folio; > > /* > * By default we always try hard to allocate the page with > @@ -2175,9 +2179,12 @@ static struct folio *alloc_buddy_hugetlb_folio(struct hstate *h, > __count_vm_event(HTLB_BUDDY_PGALLOC_FAIL); > return NULL; > } > - > + folio = page_folio(page); > +#ifndef CONFIG_SPARSEMEM_VMEMMAP > + __node_stat_add_folio(folio, NR_PAGE_METADATA); > +#endif > __count_vm_event(HTLB_BUDDY_PGALLOC); > - return page_folio(page); > + return folio; > } > > /* > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c > index 4b9734777f69..f7ca5d4dd583 100644 > --- a/mm/hugetlb_vmemmap.c > +++ b/mm/hugetlb_vmemmap.c > @@ -214,6 +214,7 @@ static inline void free_vmemmap_page(struct page *page) > free_bootmem_page(page); > else > __free_page(page); > + __mod_node_page_state(page_pgdat(page), NR_PAGE_METADATA, -1); > } > > /* Free a list of the vmemmap pages */ > @@ -335,6 +336,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, > copy_page(page_to_virt(walk.reuse_page), > (void *)walk.reuse_addr); > list_add(&walk.reuse_page->lru, &vmemmap_pages); > + __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA, 1); > } > > /* > @@ -384,14 +386,20 @@ static int alloc_vmemmap_page_list(unsigned long start, unsigned long end, > unsigned long nr_pages = (end - start) >> PAGE_SHIFT; > int nid = page_to_nid((struct page *)start); > struct page *page, *next; > + int i; > > - while (nr_pages--) { > + for (i = 0; i < nr_pages; i++) { > page = alloc_pages_node(nid, gfp_mask, 0); > - if (!page) > + if (!page) { > + __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA, > + i); > goto out; > + } > list_add_tail(&page->lru, list); > } > > + __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA, nr_pages); > + > return 0; > out: > list_for_each_entry_safe(page, next, list, lru) > diff --git a/mm/mm_init.c b/mm/mm_init.c > index 50f2f34745af..6997bf00945b 100644 > --- a/mm/mm_init.c > +++ b/mm/mm_init.c > @@ -26,6 +26,7 @@ > #include <linux/pgtable.h> > #include <linux/swap.h> > #include <linux/cma.h> > +#include <linux/vmstat.h> > #include "internal.h" > #include "slab.h" > #include "shuffle.h" > @@ -1656,6 +1657,8 @@ static void __init alloc_node_mem_map(struct pglist_data *pgdat) > panic("Failed to allocate %ld bytes for node %d memory map\n", > size, pgdat->node_id); > pgdat->node_mem_map = map + offset; > + mod_node_early_perpage_metadata(pgdat->node_id, > + DIV_ROUND_UP(size, PAGE_SIZE)); > } > pr_debug("%s: node %d, pgdat %08lx, node_mem_map %08lx\n", > __func__, pgdat->node_id, (unsigned long)pgdat, > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 85741403948f..522dc0c52610 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -5443,6 +5443,7 @@ void __init setup_per_cpu_pageset(void) > for_each_online_pgdat(pgdat) > pgdat->per_cpu_nodestats = > alloc_percpu(struct per_cpu_nodestat); > + store_early_perpage_metadata(); > } > > __meminit void zone_pcp_init(struct zone *zone) > diff --git a/mm/page_ext.c b/mm/page_ext.c > index 4548fcc66d74..d8d6db9c3d75 100644 > --- a/mm/page_ext.c > +++ b/mm/page_ext.c > @@ -201,6 +201,8 @@ static int __init alloc_node_page_ext(int nid) > return -ENOMEM; > NODE_DATA(nid)->node_page_ext = base; > total_usage += table_size; > + __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA, > + DIV_ROUND_UP(table_size, PAGE_SIZE)); > return 0; > } > > @@ -255,12 +257,15 @@ static void *__meminit alloc_page_ext(size_t size, int nid) > void *addr = NULL; > > addr = alloc_pages_exact_nid(nid, size, flags); > - if (addr) { > + if (addr) > kmemleak_alloc(addr, size, 1, flags); > - return addr; > - } > + else > + addr = vzalloc_node(size, nid); > > - addr = vzalloc_node(size, nid); > + if (addr) { > + mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA, > + DIV_ROUND_UP(size, PAGE_SIZE)); > + } > > return addr; > } > @@ -303,18 +308,27 @@ static int __meminit init_section_page_ext(unsigned long pfn, int nid) > > static void free_page_ext(void *addr) > { > + size_t table_size; > + struct page *page; > + struct pglist_data *pgdat; > + > + table_size = page_ext_size * PAGES_PER_SECTION; > + > if (is_vmalloc_addr(addr)) { > + page = vmalloc_to_page(addr); > + pgdat = page_pgdat(page); > vfree(addr); > } else { > - struct page *page = virt_to_page(addr); > - size_t table_size; > - > - table_size = page_ext_size * PAGES_PER_SECTION; > - > + page = virt_to_page(addr); > + pgdat = page_pgdat(page); > BUG_ON(PageReserved(page)); > kmemleak_free(addr); > free_pages_exact(addr, table_size); > } > + > + __mod_node_page_state(pgdat, NR_PAGE_METADATA, > + -1L * (DIV_ROUND_UP(table_size, PAGE_SIZE))); > + > } > > static void __free_page_ext(unsigned long pfn) > diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c > index a2cbe44c48e1..2bc67b2c2aa2 100644 > --- a/mm/sparse-vmemmap.c > +++ b/mm/sparse-vmemmap.c > @@ -469,5 +469,8 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn, > if (r < 0) > return NULL; > > + __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA, > + DIV_ROUND_UP(end - start, PAGE_SIZE)); > + > return pfn_to_page(pfn); > } > diff --git a/mm/sparse.c b/mm/sparse.c > index 77d91e565045..7f67b5486cd1 100644 > --- a/mm/sparse.c > +++ b/mm/sparse.c > @@ -14,7 +14,7 @@ > #include <linux/swap.h> > #include <linux/swapops.h> > #include <linux/bootmem_info.h> > - > +#include <linux/vmstat.h> > #include "internal.h" > #include <asm/dma.h> > > @@ -465,6 +465,9 @@ static void __init sparse_buffer_init(unsigned long size, int nid) > */ > sparsemap_buf = memmap_alloc(size, section_map_size(), addr, nid, true); > sparsemap_buf_end = sparsemap_buf + size; > +#ifndef CONFIG_SPARSEMEM_VMEMMAP > + mod_node_early_perpage_metadata(nid, DIV_ROUND_UP(size, PAGE_SIZE)); > +#endif > } > > static void __init sparse_buffer_fini(void) > @@ -641,6 +644,8 @@ static void depopulate_section_memmap(unsigned long pfn, unsigned long nr_pages, > unsigned long start = (unsigned long) pfn_to_page(pfn); > unsigned long end = start + nr_pages * sizeof(struct page); > > + __mod_node_page_state(page_pgdat(pfn_to_page(pfn)), NR_PAGE_METADATA, > + -1L * (DIV_ROUND_UP(end - start, PAGE_SIZE))); > vmemmap_free(start, end, altmap); > } > static void free_map_bootmem(struct page *memmap) > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 00e81e99c6ee..070d2b3d2bcc 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -1245,6 +1245,7 @@ const char * const vmstat_text[] = { > "pgpromote_success", > "pgpromote_candidate", > #endif > + "nr_page_metadata", > > /* enum writeback_stat_item counters */ > "nr_dirty_threshold", > @@ -2274,4 +2275,27 @@ static int __init extfrag_debug_init(void) > } > > module_init(extfrag_debug_init); > + > #endif > + > +/* > + * Page metadata size (struct page and page_ext) in pages > + */ > +static unsigned long early_perpage_metadata[MAX_NUMNODES] __initdata; > + > +void __init mod_node_early_perpage_metadata(int nid, long delta) > +{ > + early_perpage_metadata[nid] += delta; > +} > + > +void __init store_early_perpage_metadata(void) > +{ > + int nid; > + struct pglist_data *pgdat; > + > + for_each_online_pgdat(pgdat) { > + nid = pgdat->node_id; > + __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA, > + early_perpage_metadata[nid]); > + } > +} > -- > 2.42.0.820.g83a721a137-goog >
On Wed, Nov 1, 2023 at 7:40 PM Wei Xu <weixugc@google.com> wrote: > > On Wed, Nov 1, 2023 at 4:08 PM Sourav Panda <souravpanda@google.com> wrote: > > > > Adds a new per-node PageMetadata field to > > /sys/devices/system/node/nodeN/meminfo > > and a global PageMetadata field to /proc/meminfo. This information can > > be used by users to see how much memory is being used by per-page > > metadata, which can vary depending on build configuration, machine > > architecture, and system use. > > > > Per-page metadata is the amount of memory that Linux needs in order to > > manage memory at the page granularity. The majority of such memory is > > used by "struct page" and "page_ext" data structures. In contrast to > > most other memory consumption statistics, per-page metadata might not > > be included in MemTotal. For example, MemTotal does not include memblock > > allocations but includes buddy allocations. While on the other hand, > > per-page metadata would include both memblock and buddy allocations. > > I expect that the new PageMetadata field in meminfo should help break > down the memory usage of a system (MemUsed, or MemTotal - MemFree), > similar to the other fields in meminfo. > > However, given that PageMetadata includes per-page metadata allocated > from not only the buddy allocator, but also the memblock allocations, > and MemTotal doesn't include memory reserved by memblock allocations, > I wonder how a user can actually use this new PageMetadata to break > down the system memory usage. BTW, it is not robust to assume that > all memblock allocations are for per-page metadata. > Hi Wei, > Here are some ideas to address this problem: > > - Only report the buddy allocations for per-page medata in PageMetadata, or Making PageMetadata not to contain all per-page memory but just some is confusing, especially right after boot it would always be 0, as all struct pages are all coming from memblock during boot, yet we know we have allocated tons of memory for struct pages. > - Report per-page metadata in two separate fields in meminfo, one for > buddy allocations and another for memblock allocations, or This is also going to be confusing for the users, it is really implementation detail which allocator was used to allocate struct pages, and having to trackers is not going to improve things. > - Change MemTotal/MemUsed to include the memblock reserved memory as well. I think this is the right solution for an existing bug: MemTotal should really include memblock reserved memory. Pasha > > Wei Xu > > > This memory depends on build configurations, machine architectures, and > > the way system is used: > > > > Build configuration may include extra fields into "struct page", > > and enable / disable "page_ext" > > Machine architecture defines base page sizes. For example 4K x86, > > 8K SPARC, 64K ARM64 (optionally), etc. The per-page metadata > > overhead is smaller on machines with larger page sizes. > > System use can change per-page overhead by using vmemmap > > optimizations with hugetlb pages, and emulated pmem devdax pages. > > Also, boot parameters can determine whether page_ext is needed > > to be allocated. This memory can be part of MemTotal or be outside > > MemTotal depending on whether the memory was hot-plugged, booted with, > > or hugetlb memory was returned back to the system. > > > > Suggested-by: Pasha Tatashin <pasha.tatashin@soleen.com> > > Signed-off-by: Sourav Panda <souravpanda@google.com> > > --- > > Documentation/filesystems/proc.rst | 3 +++ > > drivers/base/node.c | 2 ++ > > fs/proc/meminfo.c | 7 +++++++ > > include/linux/mmzone.h | 3 +++ > > include/linux/vmstat.h | 4 ++++ > > mm/hugetlb.c | 11 ++++++++-- > > mm/hugetlb_vmemmap.c | 12 +++++++++-- > > mm/mm_init.c | 3 +++ > > mm/page_alloc.c | 1 + > > mm/page_ext.c | 32 +++++++++++++++++++++--------- > > mm/sparse-vmemmap.c | 3 +++ > > mm/sparse.c | 7 ++++++- > > mm/vmstat.c | 24 ++++++++++++++++++++++ > > 13 files changed, 98 insertions(+), 14 deletions(-) > > > > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst > > index 2b59cff8be17..c121f2ef9432 100644 > > --- a/Documentation/filesystems/proc.rst > > +++ b/Documentation/filesystems/proc.rst > > @@ -987,6 +987,7 @@ Example output. You may not have all of these fields. > > AnonPages: 4654780 kB > > Mapped: 266244 kB > > Shmem: 9976 kB > > + PageMetadata: 513419 kB > > KReclaimable: 517708 kB > > Slab: 660044 kB > > SReclaimable: 517708 kB > > @@ -1089,6 +1090,8 @@ Mapped > > files which have been mmapped, such as libraries > > Shmem > > Total memory used by shared memory (shmem) and tmpfs > > +PageMetadata > > + Memory used for per-page metadata > > KReclaimable > > Kernel allocations that the kernel will attempt to reclaim > > under memory pressure. Includes SReclaimable (below), and other > > diff --git a/drivers/base/node.c b/drivers/base/node.c > > index 493d533f8375..da728542265f 100644 > > --- a/drivers/base/node.c > > +++ b/drivers/base/node.c > > @@ -428,6 +428,7 @@ static ssize_t node_read_meminfo(struct device *dev, > > "Node %d Mapped: %8lu kB\n" > > "Node %d AnonPages: %8lu kB\n" > > "Node %d Shmem: %8lu kB\n" > > + "Node %d PageMetadata: %8lu kB\n" > > "Node %d KernelStack: %8lu kB\n" > > #ifdef CONFIG_SHADOW_CALL_STACK > > "Node %d ShadowCallStack:%8lu kB\n" > > @@ -458,6 +459,7 @@ static ssize_t node_read_meminfo(struct device *dev, > > nid, K(node_page_state(pgdat, NR_FILE_MAPPED)), > > nid, K(node_page_state(pgdat, NR_ANON_MAPPED)), > > nid, K(i.sharedram), > > + nid, K(node_page_state(pgdat, NR_PAGE_METADATA)), > > nid, node_page_state(pgdat, NR_KERNEL_STACK_KB), > > #ifdef CONFIG_SHADOW_CALL_STACK > > nid, node_page_state(pgdat, NR_KERNEL_SCS_KB), > > diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c > > index 45af9a989d40..f141bb2a550d 100644 > > --- a/fs/proc/meminfo.c > > +++ b/fs/proc/meminfo.c > > @@ -39,7 +39,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v) > > long available; > > unsigned long pages[NR_LRU_LISTS]; > > unsigned long sreclaimable, sunreclaim; > > + unsigned long nr_page_metadata; > > int lru; > > + int nid; > > > > si_meminfo(&i); > > si_swapinfo(&i); > > @@ -57,6 +59,10 @@ static int meminfo_proc_show(struct seq_file *m, void *v) > > sreclaimable = global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B); > > sunreclaim = global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B); > > > > + nr_page_metadata = 0; > > + for_each_online_node(nid) > > + nr_page_metadata += node_page_state(NODE_DATA(nid), NR_PAGE_METADATA); > > + > > show_val_kb(m, "MemTotal: ", i.totalram); > > show_val_kb(m, "MemFree: ", i.freeram); > > show_val_kb(m, "MemAvailable: ", available); > > @@ -104,6 +110,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v) > > show_val_kb(m, "Mapped: ", > > global_node_page_state(NR_FILE_MAPPED)); > > show_val_kb(m, "Shmem: ", i.sharedram); > > + show_val_kb(m, "PageMetadata: ", nr_page_metadata); > > show_val_kb(m, "KReclaimable: ", sreclaimable + > > global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE)); > > show_val_kb(m, "Slab: ", sreclaimable + sunreclaim); > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > > index 4106fbc5b4b3..dda1ad522324 100644 > > --- a/include/linux/mmzone.h > > +++ b/include/linux/mmzone.h > > @@ -207,6 +207,9 @@ enum node_stat_item { > > PGPROMOTE_SUCCESS, /* promote successfully */ > > PGPROMOTE_CANDIDATE, /* candidate pages to promote */ > > #endif > > + NR_PAGE_METADATA, /* Page metadata size (struct page and page_ext) > > + * in pages > > + */ > > NR_VM_NODE_STAT_ITEMS > > }; > > > > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h > > index fed855bae6d8..af096a881f03 100644 > > --- a/include/linux/vmstat.h > > +++ b/include/linux/vmstat.h > > @@ -656,4 +656,8 @@ static inline void lruvec_stat_sub_folio(struct folio *folio, > > { > > lruvec_stat_mod_folio(folio, idx, -folio_nr_pages(folio)); > > } > > + > > +void __init mod_node_early_perpage_metadata(int nid, long delta); > > +void __init store_early_perpage_metadata(void); > > + > > #endif /* _LINUX_VMSTAT_H */ > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index 1301ba7b2c9a..1778e02ed583 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -1790,6 +1790,9 @@ static void __update_and_free_hugetlb_folio(struct hstate *h, > > destroy_compound_gigantic_folio(folio, huge_page_order(h)); > > free_gigantic_folio(folio, huge_page_order(h)); > > } else { > > +#ifndef CONFIG_SPARSEMEM_VMEMMAP > > + __node_stat_sub_folio(folio, NR_PAGE_METADATA); > > +#endif > > __free_pages(&folio->page, huge_page_order(h)); > > } > > } > > @@ -2125,6 +2128,7 @@ static struct folio *alloc_buddy_hugetlb_folio(struct hstate *h, > > struct page *page; > > bool alloc_try_hard = true; > > bool retry = true; > > + struct folio *folio; > > > > /* > > * By default we always try hard to allocate the page with > > @@ -2175,9 +2179,12 @@ static struct folio *alloc_buddy_hugetlb_folio(struct hstate *h, > > __count_vm_event(HTLB_BUDDY_PGALLOC_FAIL); > > return NULL; > > } > > - > > + folio = page_folio(page); > > +#ifndef CONFIG_SPARSEMEM_VMEMMAP > > + __node_stat_add_folio(folio, NR_PAGE_METADATA); > > +#endif > > __count_vm_event(HTLB_BUDDY_PGALLOC); > > - return page_folio(page); > > + return folio; > > } > > > > /* > > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c > > index 4b9734777f69..f7ca5d4dd583 100644 > > --- a/mm/hugetlb_vmemmap.c > > +++ b/mm/hugetlb_vmemmap.c > > @@ -214,6 +214,7 @@ static inline void free_vmemmap_page(struct page *page) > > free_bootmem_page(page); > > else > > __free_page(page); > > + __mod_node_page_state(page_pgdat(page), NR_PAGE_METADATA, -1); > > } > > > > /* Free a list of the vmemmap pages */ > > @@ -335,6 +336,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, > > copy_page(page_to_virt(walk.reuse_page), > > (void *)walk.reuse_addr); > > list_add(&walk.reuse_page->lru, &vmemmap_pages); > > + __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA, 1); > > } > > > > /* > > @@ -384,14 +386,20 @@ static int alloc_vmemmap_page_list(unsigned long start, unsigned long end, > > unsigned long nr_pages = (end - start) >> PAGE_SHIFT; > > int nid = page_to_nid((struct page *)start); > > struct page *page, *next; > > + int i; > > > > - while (nr_pages--) { > > + for (i = 0; i < nr_pages; i++) { > > page = alloc_pages_node(nid, gfp_mask, 0); > > - if (!page) > > + if (!page) { > > + __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA, > > + i); > > goto out; > > + } > > list_add_tail(&page->lru, list); > > } > > > > + __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA, nr_pages); > > + > > return 0; > > out: > > list_for_each_entry_safe(page, next, list, lru) > > diff --git a/mm/mm_init.c b/mm/mm_init.c > > index 50f2f34745af..6997bf00945b 100644 > > --- a/mm/mm_init.c > > +++ b/mm/mm_init.c > > @@ -26,6 +26,7 @@ > > #include <linux/pgtable.h> > > #include <linux/swap.h> > > #include <linux/cma.h> > > +#include <linux/vmstat.h> > > #include "internal.h" > > #include "slab.h" > > #include "shuffle.h" > > @@ -1656,6 +1657,8 @@ static void __init alloc_node_mem_map(struct pglist_data *pgdat) > > panic("Failed to allocate %ld bytes for node %d memory map\n", > > size, pgdat->node_id); > > pgdat->node_mem_map = map + offset; > > + mod_node_early_perpage_metadata(pgdat->node_id, > > + DIV_ROUND_UP(size, PAGE_SIZE)); > > } > > pr_debug("%s: node %d, pgdat %08lx, node_mem_map %08lx\n", > > __func__, pgdat->node_id, (unsigned long)pgdat, > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 85741403948f..522dc0c52610 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -5443,6 +5443,7 @@ void __init setup_per_cpu_pageset(void) > > for_each_online_pgdat(pgdat) > > pgdat->per_cpu_nodestats = > > alloc_percpu(struct per_cpu_nodestat); > > + store_early_perpage_metadata(); > > } > > > > __meminit void zone_pcp_init(struct zone *zone) > > diff --git a/mm/page_ext.c b/mm/page_ext.c > > index 4548fcc66d74..d8d6db9c3d75 100644 > > --- a/mm/page_ext.c > > +++ b/mm/page_ext.c > > @@ -201,6 +201,8 @@ static int __init alloc_node_page_ext(int nid) > > return -ENOMEM; > > NODE_DATA(nid)->node_page_ext = base; > > total_usage += table_size; > > + __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA, > > + DIV_ROUND_UP(table_size, PAGE_SIZE)); > > return 0; > > } > > > > @@ -255,12 +257,15 @@ static void *__meminit alloc_page_ext(size_t size, int nid) > > void *addr = NULL; > > > > addr = alloc_pages_exact_nid(nid, size, flags); > > - if (addr) { > > + if (addr) > > kmemleak_alloc(addr, size, 1, flags); > > - return addr; > > - } > > + else > > + addr = vzalloc_node(size, nid); > > > > - addr = vzalloc_node(size, nid); > > + if (addr) { > > + mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA, > > + DIV_ROUND_UP(size, PAGE_SIZE)); > > + } > > > > return addr; > > } > > @@ -303,18 +308,27 @@ static int __meminit init_section_page_ext(unsigned long pfn, int nid) > > > > static void free_page_ext(void *addr) > > { > > + size_t table_size; > > + struct page *page; > > + struct pglist_data *pgdat; > > + > > + table_size = page_ext_size * PAGES_PER_SECTION; > > + > > if (is_vmalloc_addr(addr)) { > > + page = vmalloc_to_page(addr); > > + pgdat = page_pgdat(page); > > vfree(addr); > > } else { > > - struct page *page = virt_to_page(addr); > > - size_t table_size; > > - > > - table_size = page_ext_size * PAGES_PER_SECTION; > > - > > + page = virt_to_page(addr); > > + pgdat = page_pgdat(page); > > BUG_ON(PageReserved(page)); > > kmemleak_free(addr); > > free_pages_exact(addr, table_size); > > } > > + > > + __mod_node_page_state(pgdat, NR_PAGE_METADATA, > > + -1L * (DIV_ROUND_UP(table_size, PAGE_SIZE))); > > + > > } > > > > static void __free_page_ext(unsigned long pfn) > > diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c > > index a2cbe44c48e1..2bc67b2c2aa2 100644 > > --- a/mm/sparse-vmemmap.c > > +++ b/mm/sparse-vmemmap.c > > @@ -469,5 +469,8 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn, > > if (r < 0) > > return NULL; > > > > + __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA, > > + DIV_ROUND_UP(end - start, PAGE_SIZE)); > > + > > return pfn_to_page(pfn); > > } > > diff --git a/mm/sparse.c b/mm/sparse.c > > index 77d91e565045..7f67b5486cd1 100644 > > --- a/mm/sparse.c > > +++ b/mm/sparse.c > > @@ -14,7 +14,7 @@ > > #include <linux/swap.h> > > #include <linux/swapops.h> > > #include <linux/bootmem_info.h> > > - > > +#include <linux/vmstat.h> > > #include "internal.h" > > #include <asm/dma.h> > > > > @@ -465,6 +465,9 @@ static void __init sparse_buffer_init(unsigned long size, int nid) > > */ > > sparsemap_buf = memmap_alloc(size, section_map_size(), addr, nid, true); > > sparsemap_buf_end = sparsemap_buf + size; > > +#ifndef CONFIG_SPARSEMEM_VMEMMAP > > + mod_node_early_perpage_metadata(nid, DIV_ROUND_UP(size, PAGE_SIZE)); > > +#endif > > } > > > > static void __init sparse_buffer_fini(void) > > @@ -641,6 +644,8 @@ static void depopulate_section_memmap(unsigned long pfn, unsigned long nr_pages, > > unsigned long start = (unsigned long) pfn_to_page(pfn); > > unsigned long end = start + nr_pages * sizeof(struct page); > > > > + __mod_node_page_state(page_pgdat(pfn_to_page(pfn)), NR_PAGE_METADATA, > > + -1L * (DIV_ROUND_UP(end - start, PAGE_SIZE))); > > vmemmap_free(start, end, altmap); > > } > > static void free_map_bootmem(struct page *memmap) > > diff --git a/mm/vmstat.c b/mm/vmstat.c > > index 00e81e99c6ee..070d2b3d2bcc 100644 > > --- a/mm/vmstat.c > > +++ b/mm/vmstat.c > > @@ -1245,6 +1245,7 @@ const char * const vmstat_text[] = { > > "pgpromote_success", > > "pgpromote_candidate", > > #endif > > + "nr_page_metadata", > > > > /* enum writeback_stat_item counters */ > > "nr_dirty_threshold", > > @@ -2274,4 +2275,27 @@ static int __init extfrag_debug_init(void) > > } > > > > module_init(extfrag_debug_init); > > + > > #endif > > + > > +/* > > + * Page metadata size (struct page and page_ext) in pages > > + */ > > +static unsigned long early_perpage_metadata[MAX_NUMNODES] __initdata; > > + > > +void __init mod_node_early_perpage_metadata(int nid, long delta) > > +{ > > + early_perpage_metadata[nid] += delta; > > +} > > + > > +void __init store_early_perpage_metadata(void) > > +{ > > + int nid; > > + struct pglist_data *pgdat; > > + > > + for_each_online_pgdat(pgdat) { > > + nid = pgdat->node_id; > > + __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA, > > + early_perpage_metadata[nid]); > > + } > > +} > > -- > > 2.42.0.820.g83a721a137-goog > >
On Wed, Nov 01, 2023 at 04:08:16PM -0700, Sourav Panda wrote: > Adds a new per-node PageMetadata field to > /sys/devices/system/node/nodeN/meminfo No, this file is already an abuse of sysfs and we need to get rid of it (it has multiple values in one file.) Please do not add to the nightmare by adding new values. Also, even if you did want to do this, you didn't document it properly in Documentation/ABI/ :( thanks, greg k-h
On Wed, Nov 01, 2023 at 04:08:16PM -0700, Sourav Panda wrote: > +void __init mod_node_early_perpage_metadata(int nid, long delta); > +void __init store_early_perpage_metadata(void); Section markers are useless with prototypes.
On Thu, Nov 2, 2023 at 1:42 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > On Wed, Nov 01, 2023 at 04:08:16PM -0700, Sourav Panda wrote: > > Adds a new per-node PageMetadata field to > > /sys/devices/system/node/nodeN/meminfo > > No, this file is already an abuse of sysfs and we need to get rid of it > (it has multiple values in one file.) Please do not add to the > nightmare by adding new values. Hi Greg, Today, nodeN/meminfo is a counterpart of /proc/meminfo, they contain almost identical fields, but show node-wide and system-wide views. Since per-page metadata is added into /proc/meminfo, it is logical to add into nodeN/meminfo, some nodes can have more or less struct page data based on size of the node, and also the way memory is configured, such as use of vmemamp optimization etc, therefore this information is useful to users. I am not aware of any example of where a system-wide field from /proc/meminfo is represented as a separate sysfs file under node0/. If nodeN/meminfo is ever broken down into separate files it will affect all the fields in it the same way with or without per-page metadata > Also, even if you did want to do this, you didn't document it properly > in Documentation/ABI/ :( The documentation for the fields in nodeN/meminfo is only specified in Documentation/filesystems/proc.rst, there is no separate sysfs Documentation for the fields in this file, we could certainly add that. Thank you, Pasha
On Thu, Nov 02, 2023 at 10:24:04AM -0400, Pasha Tatashin wrote: > On Thu, Nov 2, 2023 at 1:42 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > On Wed, Nov 01, 2023 at 04:08:16PM -0700, Sourav Panda wrote: > > > Adds a new per-node PageMetadata field to > > > /sys/devices/system/node/nodeN/meminfo > > > > No, this file is already an abuse of sysfs and we need to get rid of it > > (it has multiple values in one file.) Please do not add to the > > nightmare by adding new values. > > Hi Greg, > > Today, nodeN/meminfo is a counterpart of /proc/meminfo, they contain > almost identical fields, but show node-wide and system-wide views. And that is wrong, and again, an abuse of sysfs, please do not continue to add to it, that will only cause problems. > Since per-page metadata is added into /proc/meminfo, it is logical to > add into nodeN/meminfo, some nodes can have more or less struct page > data based on size of the node, and also the way memory is configured, > such as use of vmemamp optimization etc, therefore this information is > useful to users. > > I am not aware of any example of where a system-wide field from > /proc/meminfo is represented as a separate sysfs file under node0/. If > nodeN/meminfo is ever broken down into separate files it will affect > all the fields in it the same way with or without per-page metadata All of the fields should be individual files, please start adding them if you want to add new items, I do not want to see additional abuse here as that will cause problems (as you are seeing with the proc file.) > > Also, even if you did want to do this, you didn't document it properly > > in Documentation/ABI/ :( > > The documentation for the fields in nodeN/meminfo is only specified > in Documentation/filesystems/proc.rst, there is no separate sysfs > Documentation for the fields in this file, we could certainly add > that. All sysfs files need to be documented in Documentation/ABI/ otherwise you should get a warning when running our testing scripts. thanks, greg k-h
On Thu, Nov 2, 2023 at 10:29 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > On Thu, Nov 02, 2023 at 10:24:04AM -0400, Pasha Tatashin wrote: > > On Thu, Nov 2, 2023 at 1:42 AM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > > > On Wed, Nov 01, 2023 at 04:08:16PM -0700, Sourav Panda wrote: > > > > Adds a new per-node PageMetadata field to > > > > /sys/devices/system/node/nodeN/meminfo > > > > > > No, this file is already an abuse of sysfs and we need to get rid of it > > > (it has multiple values in one file.) Please do not add to the > > > nightmare by adding new values. > > > > Hi Greg, > > > > Today, nodeN/meminfo is a counterpart of /proc/meminfo, they contain > > almost identical fields, but show node-wide and system-wide views. > > And that is wrong, and again, an abuse of sysfs, please do not continue > to add to it, that will only cause problems. > > > Since per-page metadata is added into /proc/meminfo, it is logical to > > add into nodeN/meminfo, some nodes can have more or less struct page > > data based on size of the node, and also the way memory is configured, > > such as use of vmemamp optimization etc, therefore this information is > > useful to users. > > > > I am not aware of any example of where a system-wide field from > > /proc/meminfo is represented as a separate sysfs file under node0/. If > > nodeN/meminfo is ever broken down into separate files it will affect > > all the fields in it the same way with or without per-page metadata > > All of the fields should be individual files, please start adding them > if you want to add new items, I do not want to see additional abuse here Sounds good, in our next patch version we will create a new file under nodeN/ to contain per-page metadata overhead, and add an ABI doc file for it. Thanks, Pasha
On Wed, Nov 1, 2023 at 7:58 PM Pasha Tatashin <pasha.tatashin@soleen.com> wrote: > > On Wed, Nov 1, 2023 at 7:40 PM Wei Xu <weixugc@google.com> wrote: > > > > On Wed, Nov 1, 2023 at 4:08 PM Sourav Panda <souravpanda@google.com> wrote: > > > > > > Adds a new per-node PageMetadata field to > > > /sys/devices/system/node/nodeN/meminfo > > > and a global PageMetadata field to /proc/meminfo. This information can > > > be used by users to see how much memory is being used by per-page > > > metadata, which can vary depending on build configuration, machine > > > architecture, and system use. > > > > > > Per-page metadata is the amount of memory that Linux needs in order to > > > manage memory at the page granularity. The majority of such memory is > > > used by "struct page" and "page_ext" data structures. In contrast to > > > most other memory consumption statistics, per-page metadata might not > > > be included in MemTotal. For example, MemTotal does not include memblock > > > allocations but includes buddy allocations. While on the other hand, > > > per-page metadata would include both memblock and buddy allocations. > > > > I expect that the new PageMetadata field in meminfo should help break > > down the memory usage of a system (MemUsed, or MemTotal - MemFree), > > similar to the other fields in meminfo. > > > > However, given that PageMetadata includes per-page metadata allocated > > from not only the buddy allocator, but also the memblock allocations, > > and MemTotal doesn't include memory reserved by memblock allocations, > > I wonder how a user can actually use this new PageMetadata to break > > down the system memory usage. BTW, it is not robust to assume that > > all memblock allocations are for per-page metadata. > > > > Hi Wei, > > > Here are some ideas to address this problem: > > > > - Only report the buddy allocations for per-page medata in PageMetadata, or > > Making PageMetadata not to contain all per-page memory but just some > is confusing, especially right after boot it would always be 0, as all > struct pages are all coming from memblock during boot, yet we know we > have allocated tons of memory for struct pages. > > > - Report per-page metadata in two separate fields in meminfo, one for > > buddy allocations and another for memblock allocations, or > > This is also going to be confusing for the users, it is really > implementation detail which allocator was used to allocate struct > pages, and having to trackers is not going to improve things. > > > - Change MemTotal/MemUsed to include the memblock reserved memory as well. > > I think this is the right solution for an existing bug: MemTotal > should really include memblock reserved memory. Adding reserved memory to MemTotal is a cleaner approach IMO as well. But it changes the semantics of MemTotal, which may have compatibility issues. I think the MemTotal change should be part of this patch series, too. If it doesn't get accepted, then we need to take one of the first two approaches (reporting only buddy allocations of per-page metadata or reporting per-page metadata separately for buddy/memblock allocations) at least for the Google use cases such that we can use the new PageMetadata to improve the breakdown of runtime kernel memory overheads (excluding the boot-time memblock allocations). > Pasha > > > > > Wei Xu > > > > > This memory depends on build configurations, machine architectures, and > > > the way system is used: > > > > > > Build configuration may include extra fields into "struct page", > > > and enable / disable "page_ext" > > > Machine architecture defines base page sizes. For example 4K x86, > > > 8K SPARC, 64K ARM64 (optionally), etc. The per-page metadata > > > overhead is smaller on machines with larger page sizes. > > > System use can change per-page overhead by using vmemmap > > > optimizations with hugetlb pages, and emulated pmem devdax pages. > > > Also, boot parameters can determine whether page_ext is needed > > > to be allocated. This memory can be part of MemTotal or be outside > > > MemTotal depending on whether the memory was hot-plugged, booted with, > > > or hugetlb memory was returned back to the system. > > > > > > Suggested-by: Pasha Tatashin <pasha.tatashin@soleen.com> > > > Signed-off-by: Sourav Panda <souravpanda@google.com> > > > --- > > > Documentation/filesystems/proc.rst | 3 +++ > > > drivers/base/node.c | 2 ++ > > > fs/proc/meminfo.c | 7 +++++++ > > > include/linux/mmzone.h | 3 +++ > > > include/linux/vmstat.h | 4 ++++ > > > mm/hugetlb.c | 11 ++++++++-- > > > mm/hugetlb_vmemmap.c | 12 +++++++++-- > > > mm/mm_init.c | 3 +++ > > > mm/page_alloc.c | 1 + > > > mm/page_ext.c | 32 +++++++++++++++++++++--------- > > > mm/sparse-vmemmap.c | 3 +++ > > > mm/sparse.c | 7 ++++++- > > > mm/vmstat.c | 24 ++++++++++++++++++++++ > > > 13 files changed, 98 insertions(+), 14 deletions(-) > > > > > > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst > > > index 2b59cff8be17..c121f2ef9432 100644 > > > --- a/Documentation/filesystems/proc.rst > > > +++ b/Documentation/filesystems/proc.rst > > > @@ -987,6 +987,7 @@ Example output. You may not have all of these fields. > > > AnonPages: 4654780 kB > > > Mapped: 266244 kB > > > Shmem: 9976 kB > > > + PageMetadata: 513419 kB > > > KReclaimable: 517708 kB > > > Slab: 660044 kB > > > SReclaimable: 517708 kB > > > @@ -1089,6 +1090,8 @@ Mapped > > > files which have been mmapped, such as libraries > > > Shmem > > > Total memory used by shared memory (shmem) and tmpfs > > > +PageMetadata > > > + Memory used for per-page metadata > > > KReclaimable > > > Kernel allocations that the kernel will attempt to reclaim > > > under memory pressure. Includes SReclaimable (below), and other > > > diff --git a/drivers/base/node.c b/drivers/base/node.c > > > index 493d533f8375..da728542265f 100644 > > > --- a/drivers/base/node.c > > > +++ b/drivers/base/node.c > > > @@ -428,6 +428,7 @@ static ssize_t node_read_meminfo(struct device *dev, > > > "Node %d Mapped: %8lu kB\n" > > > "Node %d AnonPages: %8lu kB\n" > > > "Node %d Shmem: %8lu kB\n" > > > + "Node %d PageMetadata: %8lu kB\n" > > > "Node %d KernelStack: %8lu kB\n" > > > #ifdef CONFIG_SHADOW_CALL_STACK > > > "Node %d ShadowCallStack:%8lu kB\n" > > > @@ -458,6 +459,7 @@ static ssize_t node_read_meminfo(struct device *dev, > > > nid, K(node_page_state(pgdat, NR_FILE_MAPPED)), > > > nid, K(node_page_state(pgdat, NR_ANON_MAPPED)), > > > nid, K(i.sharedram), > > > + nid, K(node_page_state(pgdat, NR_PAGE_METADATA)), > > > nid, node_page_state(pgdat, NR_KERNEL_STACK_KB), > > > #ifdef CONFIG_SHADOW_CALL_STACK > > > nid, node_page_state(pgdat, NR_KERNEL_SCS_KB), > > > diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c > > > index 45af9a989d40..f141bb2a550d 100644 > > > --- a/fs/proc/meminfo.c > > > +++ b/fs/proc/meminfo.c > > > @@ -39,7 +39,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v) > > > long available; > > > unsigned long pages[NR_LRU_LISTS]; > > > unsigned long sreclaimable, sunreclaim; > > > + unsigned long nr_page_metadata; > > > int lru; > > > + int nid; > > > > > > si_meminfo(&i); > > > si_swapinfo(&i); > > > @@ -57,6 +59,10 @@ static int meminfo_proc_show(struct seq_file *m, void *v) > > > sreclaimable = global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B); > > > sunreclaim = global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B); > > > > > > + nr_page_metadata = 0; > > > + for_each_online_node(nid) > > > + nr_page_metadata += node_page_state(NODE_DATA(nid), NR_PAGE_METADATA); > > > + > > > show_val_kb(m, "MemTotal: ", i.totalram); > > > show_val_kb(m, "MemFree: ", i.freeram); > > > show_val_kb(m, "MemAvailable: ", available); > > > @@ -104,6 +110,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v) > > > show_val_kb(m, "Mapped: ", > > > global_node_page_state(NR_FILE_MAPPED)); > > > show_val_kb(m, "Shmem: ", i.sharedram); > > > + show_val_kb(m, "PageMetadata: ", nr_page_metadata); > > > show_val_kb(m, "KReclaimable: ", sreclaimable + > > > global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE)); > > > show_val_kb(m, "Slab: ", sreclaimable + sunreclaim); > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > > > index 4106fbc5b4b3..dda1ad522324 100644 > > > --- a/include/linux/mmzone.h > > > +++ b/include/linux/mmzone.h > > > @@ -207,6 +207,9 @@ enum node_stat_item { > > > PGPROMOTE_SUCCESS, /* promote successfully */ > > > PGPROMOTE_CANDIDATE, /* candidate pages to promote */ > > > #endif > > > + NR_PAGE_METADATA, /* Page metadata size (struct page and page_ext) > > > + * in pages > > > + */ > > > NR_VM_NODE_STAT_ITEMS > > > }; > > > > > > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h > > > index fed855bae6d8..af096a881f03 100644 > > > --- a/include/linux/vmstat.h > > > +++ b/include/linux/vmstat.h > > > @@ -656,4 +656,8 @@ static inline void lruvec_stat_sub_folio(struct folio *folio, > > > { > > > lruvec_stat_mod_folio(folio, idx, -folio_nr_pages(folio)); > > > } > > > + > > > +void __init mod_node_early_perpage_metadata(int nid, long delta); > > > +void __init store_early_perpage_metadata(void); > > > + > > > #endif /* _LINUX_VMSTAT_H */ > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > > index 1301ba7b2c9a..1778e02ed583 100644 > > > --- a/mm/hugetlb.c > > > +++ b/mm/hugetlb.c > > > @@ -1790,6 +1790,9 @@ static void __update_and_free_hugetlb_folio(struct hstate *h, > > > destroy_compound_gigantic_folio(folio, huge_page_order(h)); > > > free_gigantic_folio(folio, huge_page_order(h)); > > > } else { > > > +#ifndef CONFIG_SPARSEMEM_VMEMMAP > > > + __node_stat_sub_folio(folio, NR_PAGE_METADATA); > > > +#endif > > > __free_pages(&folio->page, huge_page_order(h)); > > > } > > > } > > > @@ -2125,6 +2128,7 @@ static struct folio *alloc_buddy_hugetlb_folio(struct hstate *h, > > > struct page *page; > > > bool alloc_try_hard = true; > > > bool retry = true; > > > + struct folio *folio; > > > > > > /* > > > * By default we always try hard to allocate the page with > > > @@ -2175,9 +2179,12 @@ static struct folio *alloc_buddy_hugetlb_folio(struct hstate *h, > > > __count_vm_event(HTLB_BUDDY_PGALLOC_FAIL); > > > return NULL; > > > } > > > - > > > + folio = page_folio(page); > > > +#ifndef CONFIG_SPARSEMEM_VMEMMAP > > > + __node_stat_add_folio(folio, NR_PAGE_METADATA); > > > +#endif > > > __count_vm_event(HTLB_BUDDY_PGALLOC); > > > - return page_folio(page); > > > + return folio; > > > } > > > > > > /* > > > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c > > > index 4b9734777f69..f7ca5d4dd583 100644 > > > --- a/mm/hugetlb_vmemmap.c > > > +++ b/mm/hugetlb_vmemmap.c > > > @@ -214,6 +214,7 @@ static inline void free_vmemmap_page(struct page *page) > > > free_bootmem_page(page); > > > else > > > __free_page(page); > > > + __mod_node_page_state(page_pgdat(page), NR_PAGE_METADATA, -1); > > > } > > > > > > /* Free a list of the vmemmap pages */ > > > @@ -335,6 +336,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, > > > copy_page(page_to_virt(walk.reuse_page), > > > (void *)walk.reuse_addr); > > > list_add(&walk.reuse_page->lru, &vmemmap_pages); > > > + __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA, 1); > > > } > > > > > > /* > > > @@ -384,14 +386,20 @@ static int alloc_vmemmap_page_list(unsigned long start, unsigned long end, > > > unsigned long nr_pages = (end - start) >> PAGE_SHIFT; > > > int nid = page_to_nid((struct page *)start); > > > struct page *page, *next; > > > + int i; > > > > > > - while (nr_pages--) { > > > + for (i = 0; i < nr_pages; i++) { > > > page = alloc_pages_node(nid, gfp_mask, 0); > > > - if (!page) > > > + if (!page) { > > > + __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA, > > > + i); > > > goto out; > > > + } > > > list_add_tail(&page->lru, list); > > > } > > > > > > + __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA, nr_pages); > > > + > > > return 0; > > > out: > > > list_for_each_entry_safe(page, next, list, lru) > > > diff --git a/mm/mm_init.c b/mm/mm_init.c > > > index 50f2f34745af..6997bf00945b 100644 > > > --- a/mm/mm_init.c > > > +++ b/mm/mm_init.c > > > @@ -26,6 +26,7 @@ > > > #include <linux/pgtable.h> > > > #include <linux/swap.h> > > > #include <linux/cma.h> > > > +#include <linux/vmstat.h> > > > #include "internal.h" > > > #include "slab.h" > > > #include "shuffle.h" > > > @@ -1656,6 +1657,8 @@ static void __init alloc_node_mem_map(struct pglist_data *pgdat) > > > panic("Failed to allocate %ld bytes for node %d memory map\n", > > > size, pgdat->node_id); > > > pgdat->node_mem_map = map + offset; > > > + mod_node_early_perpage_metadata(pgdat->node_id, > > > + DIV_ROUND_UP(size, PAGE_SIZE)); > > > } > > > pr_debug("%s: node %d, pgdat %08lx, node_mem_map %08lx\n", > > > __func__, pgdat->node_id, (unsigned long)pgdat, > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > > index 85741403948f..522dc0c52610 100644 > > > --- a/mm/page_alloc.c > > > +++ b/mm/page_alloc.c > > > @@ -5443,6 +5443,7 @@ void __init setup_per_cpu_pageset(void) > > > for_each_online_pgdat(pgdat) > > > pgdat->per_cpu_nodestats = > > > alloc_percpu(struct per_cpu_nodestat); > > > + store_early_perpage_metadata(); > > > } > > > > > > __meminit void zone_pcp_init(struct zone *zone) > > > diff --git a/mm/page_ext.c b/mm/page_ext.c > > > index 4548fcc66d74..d8d6db9c3d75 100644 > > > --- a/mm/page_ext.c > > > +++ b/mm/page_ext.c > > > @@ -201,6 +201,8 @@ static int __init alloc_node_page_ext(int nid) > > > return -ENOMEM; > > > NODE_DATA(nid)->node_page_ext = base; > > > total_usage += table_size; > > > + __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA, > > > + DIV_ROUND_UP(table_size, PAGE_SIZE)); > > > return 0; > > > } > > > > > > @@ -255,12 +257,15 @@ static void *__meminit alloc_page_ext(size_t size, int nid) > > > void *addr = NULL; > > > > > > addr = alloc_pages_exact_nid(nid, size, flags); > > > - if (addr) { > > > + if (addr) > > > kmemleak_alloc(addr, size, 1, flags); > > > - return addr; > > > - } > > > + else > > > + addr = vzalloc_node(size, nid); > > > > > > - addr = vzalloc_node(size, nid); > > > + if (addr) { > > > + mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA, > > > + DIV_ROUND_UP(size, PAGE_SIZE)); > > > + } > > > > > > return addr; > > > } > > > @@ -303,18 +308,27 @@ static int __meminit init_section_page_ext(unsigned long pfn, int nid) > > > > > > static void free_page_ext(void *addr) > > > { > > > + size_t table_size; > > > + struct page *page; > > > + struct pglist_data *pgdat; > > > + > > > + table_size = page_ext_size * PAGES_PER_SECTION; > > > + > > > if (is_vmalloc_addr(addr)) { > > > + page = vmalloc_to_page(addr); > > > + pgdat = page_pgdat(page); > > > vfree(addr); > > > } else { > > > - struct page *page = virt_to_page(addr); > > > - size_t table_size; > > > - > > > - table_size = page_ext_size * PAGES_PER_SECTION; > > > - > > > + page = virt_to_page(addr); > > > + pgdat = page_pgdat(page); > > > BUG_ON(PageReserved(page)); > > > kmemleak_free(addr); > > > free_pages_exact(addr, table_size); > > > } > > > + > > > + __mod_node_page_state(pgdat, NR_PAGE_METADATA, > > > + -1L * (DIV_ROUND_UP(table_size, PAGE_SIZE))); > > > + > > > } > > > > > > static void __free_page_ext(unsigned long pfn) > > > diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c > > > index a2cbe44c48e1..2bc67b2c2aa2 100644 > > > --- a/mm/sparse-vmemmap.c > > > +++ b/mm/sparse-vmemmap.c > > > @@ -469,5 +469,8 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn, > > > if (r < 0) > > > return NULL; > > > > > > + __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA, > > > + DIV_ROUND_UP(end - start, PAGE_SIZE)); > > > + > > > return pfn_to_page(pfn); > > > } > > > diff --git a/mm/sparse.c b/mm/sparse.c > > > index 77d91e565045..7f67b5486cd1 100644 > > > --- a/mm/sparse.c > > > +++ b/mm/sparse.c > > > @@ -14,7 +14,7 @@ > > > #include <linux/swap.h> > > > #include <linux/swapops.h> > > > #include <linux/bootmem_info.h> > > > - > > > +#include <linux/vmstat.h> > > > #include "internal.h" > > > #include <asm/dma.h> > > > > > > @@ -465,6 +465,9 @@ static void __init sparse_buffer_init(unsigned long size, int nid) > > > */ > > > sparsemap_buf = memmap_alloc(size, section_map_size(), addr, nid, true); > > > sparsemap_buf_end = sparsemap_buf + size; > > > +#ifndef CONFIG_SPARSEMEM_VMEMMAP > > > + mod_node_early_perpage_metadata(nid, DIV_ROUND_UP(size, PAGE_SIZE)); > > > +#endif > > > } > > > > > > static void __init sparse_buffer_fini(void) > > > @@ -641,6 +644,8 @@ static void depopulate_section_memmap(unsigned long pfn, unsigned long nr_pages, > > > unsigned long start = (unsigned long) pfn_to_page(pfn); > > > unsigned long end = start + nr_pages * sizeof(struct page); > > > > > > + __mod_node_page_state(page_pgdat(pfn_to_page(pfn)), NR_PAGE_METADATA, > > > + -1L * (DIV_ROUND_UP(end - start, PAGE_SIZE))); > > > vmemmap_free(start, end, altmap); > > > } > > > static void free_map_bootmem(struct page *memmap) > > > diff --git a/mm/vmstat.c b/mm/vmstat.c > > > index 00e81e99c6ee..070d2b3d2bcc 100644 > > > --- a/mm/vmstat.c > > > +++ b/mm/vmstat.c > > > @@ -1245,6 +1245,7 @@ const char * const vmstat_text[] = { > > > "pgpromote_success", > > > "pgpromote_candidate", > > > #endif > > > + "nr_page_metadata", > > > > > > /* enum writeback_stat_item counters */ > > > "nr_dirty_threshold", > > > @@ -2274,4 +2275,27 @@ static int __init extfrag_debug_init(void) > > > } > > > > > > module_init(extfrag_debug_init); > > > + > > > #endif > > > + > > > +/* > > > + * Page metadata size (struct page and page_ext) in pages > > > + */ > > > +static unsigned long early_perpage_metadata[MAX_NUMNODES] __initdata; > > > + > > > +void __init mod_node_early_perpage_metadata(int nid, long delta) > > > +{ > > > + early_perpage_metadata[nid] += delta; > > > +} > > > + > > > +void __init store_early_perpage_metadata(void) > > > +{ > > > + int nid; > > > + struct pglist_data *pgdat; > > > + > > > + for_each_online_pgdat(pgdat) { > > > + nid = pgdat->node_id; > > > + __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA, > > > + early_perpage_metadata[nid]); > > > + } > > > +} > > > -- > > > 2.42.0.820.g83a721a137-goog > > >
On 02.11.23 16:43, Wei Xu wrote: > On Wed, Nov 1, 2023 at 7:58 PM Pasha Tatashin <pasha.tatashin@soleen.com> wrote: >> >> On Wed, Nov 1, 2023 at 7:40 PM Wei Xu <weixugc@google.com> wrote: >>> >>> On Wed, Nov 1, 2023 at 4:08 PM Sourav Panda <souravpanda@google.com> wrote: >>>> >>>> Adds a new per-node PageMetadata field to >>>> /sys/devices/system/node/nodeN/meminfo >>>> and a global PageMetadata field to /proc/meminfo. This information can >>>> be used by users to see how much memory is being used by per-page >>>> metadata, which can vary depending on build configuration, machine >>>> architecture, and system use. >>>> >>>> Per-page metadata is the amount of memory that Linux needs in order to >>>> manage memory at the page granularity. The majority of such memory is >>>> used by "struct page" and "page_ext" data structures. In contrast to >>>> most other memory consumption statistics, per-page metadata might not >>>> be included in MemTotal. For example, MemTotal does not include memblock >>>> allocations but includes buddy allocations. While on the other hand, >>>> per-page metadata would include both memblock and buddy allocations. >>> >>> I expect that the new PageMetadata field in meminfo should help break >>> down the memory usage of a system (MemUsed, or MemTotal - MemFree), >>> similar to the other fields in meminfo. >>> >>> However, given that PageMetadata includes per-page metadata allocated >>> from not only the buddy allocator, but also the memblock allocations, >>> and MemTotal doesn't include memory reserved by memblock allocations, >>> I wonder how a user can actually use this new PageMetadata to break >>> down the system memory usage. BTW, it is not robust to assume that >>> all memblock allocations are for per-page metadata. >>> >> >> Hi Wei, >> >>> Here are some ideas to address this problem: >>> >>> - Only report the buddy allocations for per-page medata in PageMetadata, or >> >> Making PageMetadata not to contain all per-page memory but just some >> is confusing, especially right after boot it would always be 0, as all >> struct pages are all coming from memblock during boot, yet we know we >> have allocated tons of memory for struct pages. >> >>> - Report per-page metadata in two separate fields in meminfo, one for >>> buddy allocations and another for memblock allocations, or >> >> This is also going to be confusing for the users, it is really >> implementation detail which allocator was used to allocate struct >> pages, and having to trackers is not going to improve things. >> >>> - Change MemTotal/MemUsed to include the memblock reserved memory as well. >> >> I think this is the right solution for an existing bug: MemTotal >> should really include memblock reserved memory. > > Adding reserved memory to MemTotal is a cleaner approach IMO as well. > But it changes the semantics of MemTotal, which may have compatibility > issues. I object.
> > Adding reserved memory to MemTotal is a cleaner approach IMO as well. > > But it changes the semantics of MemTotal, which may have compatibility > > issues. > > I object. Could you please elaborate what you object (and why): you object that it will have compatibility issues, or you object to include memblock reserves into MemTotal? Thanks, Pasha
On 02.11.23 16:50, Pasha Tatashin wrote: >>> Adding reserved memory to MemTotal is a cleaner approach IMO as well. >>> But it changes the semantics of MemTotal, which may have compatibility >>> issues. >> >> I object. > > Could you please elaborate what you object (and why): you object that > it will have compatibility issues, or you object to include memblock > reserves into MemTotal? Sorry, I object to changing the semantics of MemTotal. MemTotal is traditionally the memory managed by the buddy, not all memory in the system. I know people/scripts that are relying on that [although it's been source of confusion a couple of times].
On Thu, Nov 2, 2023 at 11:53 AM David Hildenbrand <david@redhat.com> wrote: > > On 02.11.23 16:50, Pasha Tatashin wrote: > >>> Adding reserved memory to MemTotal is a cleaner approach IMO as well. > >>> But it changes the semantics of MemTotal, which may have compatibility > >>> issues. > >> > >> I object. > > > > Could you please elaborate what you object (and why): you object that > > it will have compatibility issues, or you object to include memblock > > reserves into MemTotal? > > Sorry, I object to changing the semantics of MemTotal. MemTotal is > traditionally the memory managed by the buddy, not all memory in the > system. I know people/scripts that are relying on that [although it's > been source of confusion a couple of times]. What if one day we change so that struct pages are allocated from buddy allocator (i.e. allocate deferred struct pages from buddy) will it break those MemTotal scripts? What if the size of struct pages changes significantly, but the overhead will come from other metadata (i.e. memdesc) will that break those scripts? I feel like struct page memory should really be included into MemTotal, otherwise we will have this struggle in the future when we try to optimize struct page memory.
On 02.11.23 17:02, Pasha Tatashin wrote: > On Thu, Nov 2, 2023 at 11:53 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 02.11.23 16:50, Pasha Tatashin wrote: >>>>> Adding reserved memory to MemTotal is a cleaner approach IMO as well. >>>>> But it changes the semantics of MemTotal, which may have compatibility >>>>> issues. >>>> >>>> I object. >>> >>> Could you please elaborate what you object (and why): you object that >>> it will have compatibility issues, or you object to include memblock >>> reserves into MemTotal? >> >> Sorry, I object to changing the semantics of MemTotal. MemTotal is >> traditionally the memory managed by the buddy, not all memory in the >> system. I know people/scripts that are relying on that [although it's >> been source of confusion a couple of times]. > > What if one day we change so that struct pages are allocated from > buddy allocator (i.e. allocate deferred struct pages from buddy) will It does on memory hotplug. But for things like crashkernel size detection doesn't really care about that. > it break those MemTotal scripts? What if the size of struct pages > changes significantly, but the overhead will come from other metadata > (i.e. memdesc) will that break those scripts? I feel like struct page Probably; but ideally the metadata overhead will be smaller with memdesc. And we'll talk about that once it gets real ;) > memory should really be included into MemTotal, otherwise we will have > this struggle in the future when we try to optimize struct page > memory. How far do we want to go, do we want to include crashkernel reserved memory in MemTotal because it is system memory? Only metadata? what else allocated using memblock? Again, right now it's simple: MemTotal is memory managed by the buddy. The spirit of this patch set is good, modifying existing counters needs good justification.
On Thu, Nov 2, 2023 at 12:09 PM David Hildenbrand <david@redhat.com> wrote: > > On 02.11.23 17:02, Pasha Tatashin wrote: > > On Thu, Nov 2, 2023 at 11:53 AM David Hildenbrand <david@redhat.com> wrote: > >> > >> On 02.11.23 16:50, Pasha Tatashin wrote: > >>>>> Adding reserved memory to MemTotal is a cleaner approach IMO as well. > >>>>> But it changes the semantics of MemTotal, which may have compatibility > >>>>> issues. > >>>> > >>>> I object. > >>> > >>> Could you please elaborate what you object (and why): you object that > >>> it will have compatibility issues, or you object to include memblock > >>> reserves into MemTotal? > >> > >> Sorry, I object to changing the semantics of MemTotal. MemTotal is > >> traditionally the memory managed by the buddy, not all memory in the > >> system. I know people/scripts that are relying on that [although it's > >> been source of confusion a couple of times]. > > > > What if one day we change so that struct pages are allocated from > > buddy allocator (i.e. allocate deferred struct pages from buddy) will > > It does on memory hotplug. But for things like crashkernel size > detection doesn't really care about that. "Crash kernel" is a different case: it is kernel external memory, similar to limiting the amount of physical memory via mem=/memmap=, it sets memory that cannot be used by this kernel, but only by the crash kernel. Also, the crash kernel reserve is exposed in /proc/iomem via "Crash kernel" range. Page metadata memory on the other hand, is used by this kernel, and also can be changed by this kernel depending on how the memory is used: memdec, hotplug, THP, emulated pmem etc. > > it break those MemTotal scripts? What if the size of struct pages > > changes significantly, but the overhead will come from other metadata > > (i.e. memdesc) will that break those scripts? I feel like struct page > > Probably; but ideally the metadata overhead will be smaller with > memdesc. And we'll talk about that once it gets real ;) The size and allocation of struct pages change MemTotal today, during runtime, even without memdesc, I just brought it up, to emphasize that this is something that we should resolve now before it gets worse. > > memory should really be included into MemTotal, otherwise we will have > > this struggle in the future when we try to optimize struct page > > memory. > How far do we want to go, do we want to include crashkernel reserved > memory in MemTotal because it is system memory? Only metadata? what else > allocated using memblock? > > Again, right now it's simple: MemTotal is memory managed by the buddy. > > The spirit of this patch set is good, modifying existing counters needs > good justification. Wei, noticed that all other fields in /proc/meminfo are part of MemTotal, but this new field may be not (depending where struct pages are allocated), so what would be the best way to export page metadata without redefining MemTotal? Keep the new field in /proc/meminfo but be ok that it is not part of MemTotal or do two counters? If we do two counters, we will still need to keep one that is a buddy allocator in /proc/meminfo and the other one somewhere outside? Pasha
On 02.11.23 17:43, Pasha Tatashin wrote: > On Thu, Nov 2, 2023 at 12:09 PM David Hildenbrand <david@redhat.com> wrote: >> >> On 02.11.23 17:02, Pasha Tatashin wrote: >>> On Thu, Nov 2, 2023 at 11:53 AM David Hildenbrand <david@redhat.com> wrote: >>>> >>>> On 02.11.23 16:50, Pasha Tatashin wrote: >>>>>>> Adding reserved memory to MemTotal is a cleaner approach IMO as well. >>>>>>> But it changes the semantics of MemTotal, which may have compatibility >>>>>>> issues. >>>>>> >>>>>> I object. >>>>> >>>>> Could you please elaborate what you object (and why): you object that >>>>> it will have compatibility issues, or you object to include memblock >>>>> reserves into MemTotal? >>>> >>>> Sorry, I object to changing the semantics of MemTotal. MemTotal is >>>> traditionally the memory managed by the buddy, not all memory in the >>>> system. I know people/scripts that are relying on that [although it's >>>> been source of confusion a couple of times]. >>> >>> What if one day we change so that struct pages are allocated from >>> buddy allocator (i.e. allocate deferred struct pages from buddy) will >> >> It does on memory hotplug. But for things like crashkernel size >> detection doesn't really care about that. > > "Crash kernel" is a different case: it is kernel external memory, > similar to limiting the amount of physical memory via mem=/memmap=, it > sets memory that cannot be used by this kernel, but only by the crash > kernel. Also, the crash kernel reserve is exposed in /proc/iomem via > "Crash kernel" range. Agreed. > > Page metadata memory on the other hand, is used by this kernel, and > also can be changed by this kernel depending on how the memory is > used: memdec, hotplug, THP, emulated pmem etc. And then, there is the "altmap" for dax, where the metadata is placed on the dax memory itself. I mean, it's system RAM (or NVDIMM or whatever) used for metadata, but not managed by the buddy. There is now also the "memmap_on_memory" feature for memory hotplug, where we do the same for ordinary hotplug memory (but some memory aside for the memmap and not allocate it from the buddy). We'd have to account that one as well as metadata, I think. I don't think it would get accounted under MemTotal (because, not managed by the buddy) as of now. > >>> it break those MemTotal scripts? What if the size of struct pages >>> changes significantly, but the overhead will come from other metadata >>> (i.e. memdesc) will that break those scripts? I feel like struct page >> >> Probably; but ideally the metadata overhead will be smaller with >> memdesc. And we'll talk about that once it gets real ;) > > The size and allocation of struct pages change MemTotal today, during > runtime, even without memdesc, I just brought it up, to emphasize that > this is something that we should resolve now before it gets worse. I don't quite see the immediate need for action, but I get what you are saying. It's a historical mess, but if we want to tackle it, we should tackle it completely and not only sort out the metadata accounting. > >>> memory should really be included into MemTotal, otherwise we will have >>> this struggle in the future when we try to optimize struct page >>> memory. >> How far do we want to go, do we want to include crashkernel reserved >> memory in MemTotal because it is system memory? Only metadata? what else >> allocated using memblock? >> >> Again, right now it's simple: MemTotal is memory managed by the buddy. >> >> The spirit of this patch set is good, modifying existing counters needs >> good justification. > > Wei, noticed that all other fields in /proc/meminfo are part of > MemTotal, but this new field may be not (depending where struct pages I could have sworn that I pointed that out in a previous version and requested to document that special case in the patch description. :) > are allocated), so what would be the best way to export page metadata > without redefining MemTotal? Keep the new field in /proc/meminfo but > be ok that it is not part of MemTotal or do two counters? If we do two > counters, we will still need to keep one that is a buddy allocator in > /proc/meminfo and the other one somewhere outside? IMHO, we should just leave MemTotal alone ("memory managed by the buddy that could actually mostly get freed up and reused -- although that's not completely true") and have a new counter that includes any system memory (MemSystem? but as we learned, as separate files), including most memblock allocations/reservations as well (metadata, early pagetables, initrd, kernel, ...). The you would actually know how much memory the system is using (exclusing things like crashmem, mem=, ...). That part is tricky, though -- I recall there are memblock reservations that are similar to the crashkernel -- which is why the current state is to account memory when it's handed to the buddy under MemTotal -- which is straight forward and simply. I'm happy to discuss this further, if that direction is worth exploring.
> > Wei, noticed that all other fields in /proc/meminfo are part of > > MemTotal, but this new field may be not (depending where struct pages > > I could have sworn that I pointed that out in a previous version and > requested to document that special case in the patch description. :) Sounds, good we will document that parts of per-page may not be part of MemTotal. > > are allocated), so what would be the best way to export page metadata > > without redefining MemTotal? Keep the new field in /proc/meminfo but > > be ok that it is not part of MemTotal or do two counters? If we do two > > counters, we will still need to keep one that is a buddy allocator in > > /proc/meminfo and the other one somewhere outside? > > IMHO, we should just leave MemTotal alone ("memory managed by the buddy > that could actually mostly get freed up and reused -- although that's > not completely true") and have a new counter that includes any system > memory (MemSystem? but as we learned, as separate files), including most > memblock allocations/reservations as well (metadata, early pagetables, > initrd, kernel, ...). > > The you would actually know how much memory the system is using > (exclusing things like crashmem, mem=, ...). > > That part is tricky, though -- I recall there are memblock reservations > that are similar to the crashkernel -- which is why the current state is > to account memory when it's handed to the buddy under MemTotal -- which > is straight forward and simply. It may be simplified if we define MemSystem as all the usable memory provided by firmware to Linux kernel. For BIOS it would be the "usable" ranges in the original e820 memory list before it's been modified by the kernel based on the parameters. For device-tree architectures, it would be the memory binding provided by the original device tree from the firmware. Pasha
On Thu, Nov 2, 2023 at 10:12 AM Pasha Tatashin <pasha.tatashin@soleen.com> wrote: > > > > Wei, noticed that all other fields in /proc/meminfo are part of > > > MemTotal, but this new field may be not (depending where struct pages > > > > I could have sworn that I pointed that out in a previous version and > > requested to document that special case in the patch description. :) > > Sounds, good we will document that parts of per-page may not be part > of MemTotal. But this still doesn't answer how we can use the new PageMetadata field to help break down the runtime kernel overhead within MemUsed (MemTotal - MemFree). > > > are allocated), so what would be the best way to export page metadata > > > without redefining MemTotal? Keep the new field in /proc/meminfo but > > > be ok that it is not part of MemTotal or do two counters? If we do two > > > counters, we will still need to keep one that is a buddy allocator in > > > /proc/meminfo and the other one somewhere outside? > > I think the simplest thing to do now is to only report the buddy allocations of per-page metadata in meminfo. The meaning of the new counter is easier to understand and consistent with MemTotal and other fields in meminfo. Its implementation can also be greatly simplified and we don't need to handle the other special cases, either, e.g. pagemeta allocated from DAX devices. > > IMHO, we should just leave MemTotal alone ("memory managed by the buddy > > that could actually mostly get freed up and reused -- although that's > > not completely true") and have a new counter that includes any system > > memory (MemSystem? but as we learned, as separate files), including most > > memblock allocations/reservations as well (metadata, early pagetables, > > initrd, kernel, ...). > > > > The you would actually know how much memory the system is using > > (exclusing things like crashmem, mem=, ...). > > > > That part is tricky, though -- I recall there are memblock reservations > > that are similar to the crashkernel -- which is why the current state is > > to account memory when it's handed to the buddy under MemTotal -- which > > is straight forward and simply. > > It may be simplified if we define MemSystem as all the usable memory > provided by firmware to Linux kernel. > For BIOS it would be the "usable" ranges in the original e820 memory > list before it's been modified by the kernel based on the parameters. > > For device-tree architectures, it would be the memory binding provided > by the original device tree from the firmware. > > Pasha
> > > I could have sworn that I pointed that out in a previous version and > > > requested to document that special case in the patch description. :) > > > > Sounds, good we will document that parts of per-page may not be part > > of MemTotal. > > But this still doesn't answer how we can use the new PageMetadata > field to help break down the runtime kernel overhead within MemUsed > (MemTotal - MemFree). I am not sure it matters to the end users: they look at PageMetadata with or without Page Owner, page_table_check, HugeTLB and it shows exactly how much per-page overhead changed. Where the kernel allocated that memory is not that important to the end user as long as that memory became available to them. In addition, it is still possible to estimate the actual memblock part of Per-page metadata by looking at /proc/zoneinfo: Memblock reserved per-page metadata: "present_pages - managed_pages" If there is something big that we will allocate in that range, we should probably also export it in some form. If this field does not fit in /proc/meminfo due to not fully being part of MemTotal, we could just keep it under nodeN/, as a separate file, as suggested by Greg. However, I think it is useful enough to have an easy system wide view for Per-page metadata. > > > > are allocated), so what would be the best way to export page metadata > > > > without redefining MemTotal? Keep the new field in /proc/meminfo but > > > > be ok that it is not part of MemTotal or do two counters? If we do two > > > > counters, we will still need to keep one that is a buddy allocator in > > > > /proc/meminfo and the other one somewhere outside? > > > > > I think the simplest thing to do now is to only report the buddy > allocations of per-page metadata in meminfo. The meaning of the new This will cause PageMetadata to be 0 on 99% of the systems, and essentially become useless to the vast majority of users.
On Thu, Nov 2, 2023 at 11:34 AM Pasha Tatashin <pasha.tatashin@soleen.com> wrote: > > > > > I could have sworn that I pointed that out in a previous version and > > > > requested to document that special case in the patch description. :) > > > > > > Sounds, good we will document that parts of per-page may not be part > > > of MemTotal. > > > > But this still doesn't answer how we can use the new PageMetadata > > field to help break down the runtime kernel overhead within MemUsed > > (MemTotal - MemFree). > > I am not sure it matters to the end users: they look at PageMetadata > with or without Page Owner, page_table_check, HugeTLB and it shows > exactly how much per-page overhead changed. Where the kernel allocated > that memory is not that important to the end user as long as that > memory became available to them. > > In addition, it is still possible to estimate the actual memblock part > of Per-page metadata by looking at /proc/zoneinfo: > > Memblock reserved per-page metadata: "present_pages - managed_pages" This assumes that all reserved memblocks are per-page metadata. As I mentioned earlier, it is not a robust approach. > If there is something big that we will allocate in that range, we > should probably also export it in some form. > > If this field does not fit in /proc/meminfo due to not fully being > part of MemTotal, we could just keep it under nodeN/, as a separate > file, as suggested by Greg. > > However, I think it is useful enough to have an easy system wide view > for Per-page metadata. It is fine to have this as a separate, informational sysfs file under nodeN/, outside of meminfo. I just don't think as in the current implementation (where PageMetadata is a mixture of buddy and memblock allocations), it can help with the use case that motivates this change, i.e. to improve the breakdown of the kernel overhead. > > > > > are allocated), so what would be the best way to export page metadata > > > > > without redefining MemTotal? Keep the new field in /proc/meminfo but > > > > > be ok that it is not part of MemTotal or do two counters? If we do two > > > > > counters, we will still need to keep one that is a buddy allocator in > > > > > /proc/meminfo and the other one somewhere outside? > > > > > > > > I think the simplest thing to do now is to only report the buddy > > allocations of per-page metadata in meminfo. The meaning of the new > > This will cause PageMetadata to be 0 on 99% of the systems, and > essentially become useless to the vast majority of users. I don't think it is a major issue. There are other fields (e.g. Zswap) in meminfo that remain 0 when the feature is not used.
On 02.11.23 18:11, Pasha Tatashin wrote: >>> Wei, noticed that all other fields in /proc/meminfo are part of >>> MemTotal, but this new field may be not (depending where struct pages >> >> I could have sworn that I pointed that out in a previous version and >> requested to document that special case in the patch description. :) > > Sounds, good we will document that parts of per-page may not be part > of MemTotal. > >>> are allocated), so what would be the best way to export page metadata >>> without redefining MemTotal? Keep the new field in /proc/meminfo but >>> be ok that it is not part of MemTotal or do two counters? If we do two >>> counters, we will still need to keep one that is a buddy allocator in >>> /proc/meminfo and the other one somewhere outside? >> >> IMHO, we should just leave MemTotal alone ("memory managed by the buddy >> that could actually mostly get freed up and reused -- although that's >> not completely true") and have a new counter that includes any system >> memory (MemSystem? but as we learned, as separate files), including most >> memblock allocations/reservations as well (metadata, early pagetables, >> initrd, kernel, ...). >> >> The you would actually know how much memory the system is using >> (exclusing things like crashmem, mem=, ...). >> >> That part is tricky, though -- I recall there are memblock reservations >> that are similar to the crashkernel -- which is why the current state is >> to account memory when it's handed to the buddy under MemTotal -- which >> is straight forward and simply. > > It may be simplified if we define MemSystem as all the usable memory > provided by firmware to Linux kernel. > For BIOS it would be the "usable" ranges in the original e820 memory > list before it's been modified by the kernel based on the parameters. There are some cases to consider, like "mem=", crashkernel, and some more odd things (I believe there are some on ppc at least for hw tracing buffers). All information should be in the memblock allocator, maybe we just have to find some ways to better enlighten it what an allocation is (e.g., memmap), and what some other reason to exclude memory is (crash kernel, mem=, ACPI tables, odd memory holes, ...).
On Thu, Nov 2, 2023 at 4:22 PM Wei Xu <weixugc@google.com> wrote: > > On Thu, Nov 2, 2023 at 11:34 AM Pasha Tatashin > <pasha.tatashin@soleen.com> wrote: > > > > > > > I could have sworn that I pointed that out in a previous version and > > > > > requested to document that special case in the patch description. :) > > > > > > > > Sounds, good we will document that parts of per-page may not be part > > > > of MemTotal. > > > > > > But this still doesn't answer how we can use the new PageMetadata > > > field to help break down the runtime kernel overhead within MemUsed > > > (MemTotal - MemFree). > > > > I am not sure it matters to the end users: they look at PageMetadata > > with or without Page Owner, page_table_check, HugeTLB and it shows > > exactly how much per-page overhead changed. Where the kernel allocated > > that memory is not that important to the end user as long as that > > memory became available to them. > > > > In addition, it is still possible to estimate the actual memblock part > > of Per-page metadata by looking at /proc/zoneinfo: > > > > Memblock reserved per-page metadata: "present_pages - managed_pages" > > This assumes that all reserved memblocks are per-page metadata. As I Right after boot, when all Per-page metadata is still from memblocks, we could determine what part of the zone reserved memory is not per-page, and use it later in our calculations. > mentioned earlier, it is not a robust approach. > > If there is something big that we will allocate in that range, we > > should probably also export it in some form. > > > > If this field does not fit in /proc/meminfo due to not fully being > > part of MemTotal, we could just keep it under nodeN/, as a separate > > file, as suggested by Greg. > > > > However, I think it is useful enough to have an easy system wide view > > for Per-page metadata. > > It is fine to have this as a separate, informational sysfs file under > nodeN/, outside of meminfo. I just don't think as in the current > implementation (where PageMetadata is a mixture of buddy and memblock > allocations), it can help with the use case that motivates this > change, i.e. to improve the breakdown of the kernel overhead. > > > > > > are allocated), so what would be the best way to export page metadata > > > > > > without redefining MemTotal? Keep the new field in /proc/meminfo but > > > > > > be ok that it is not part of MemTotal or do two counters? If we do two > > > > > > counters, we will still need to keep one that is a buddy allocator in > > > > > > /proc/meminfo and the other one somewhere outside? > > > > > > > > > > > I think the simplest thing to do now is to only report the buddy > > > allocations of per-page metadata in meminfo. The meaning of the new > > > > This will cause PageMetadata to be 0 on 99% of the systems, and > > essentially become useless to the vast majority of users. > > I don't think it is a major issue. There are other fields (e.g. Zswap) > in meminfo that remain 0 when the feature is not used. Since we are going to use two independent interfaces /proc/meminfo/PageMetadata and nodeN/page_metadata (in a separate file as requested by Greg) How about if in /proc/meminfo we provide only the buddy allocator part, and in nodeN/page_metadata we provide the total per-page overhead in the given node that include memblock reserves, and buddy allocator memory? Pasha
On Thu, Nov 2, 2023 at 6:07 PM Pasha Tatashin <pasha.tatashin@soleen.com> wrote: > > On Thu, Nov 2, 2023 at 4:22 PM Wei Xu <weixugc@google.com> wrote: > > > > On Thu, Nov 2, 2023 at 11:34 AM Pasha Tatashin > > <pasha.tatashin@soleen.com> wrote: > > > > > > > > > I could have sworn that I pointed that out in a previous version and > > > > > > requested to document that special case in the patch description. :) > > > > > > > > > > Sounds, good we will document that parts of per-page may not be part > > > > > of MemTotal. > > > > > > > > But this still doesn't answer how we can use the new PageMetadata > > > > field to help break down the runtime kernel overhead within MemUsed > > > > (MemTotal - MemFree). > > > > > > I am not sure it matters to the end users: they look at PageMetadata > > > with or without Page Owner, page_table_check, HugeTLB and it shows > > > exactly how much per-page overhead changed. Where the kernel allocated > > > that memory is not that important to the end user as long as that > > > memory became available to them. > > > > > > In addition, it is still possible to estimate the actual memblock part > > > of Per-page metadata by looking at /proc/zoneinfo: > > > > > > Memblock reserved per-page metadata: "present_pages - managed_pages" > > > > This assumes that all reserved memblocks are per-page metadata. As I > > Right after boot, when all Per-page metadata is still from memblocks, > we could determine what part of the zone reserved memory is not > per-page, and use it later in our calculations. > > > mentioned earlier, it is not a robust approach. > > > If there is something big that we will allocate in that range, we > > > should probably also export it in some form. > > > > > > If this field does not fit in /proc/meminfo due to not fully being > > > part of MemTotal, we could just keep it under nodeN/, as a separate > > > file, as suggested by Greg. > > > > > > However, I think it is useful enough to have an easy system wide view > > > for Per-page metadata. > > > > It is fine to have this as a separate, informational sysfs file under > > nodeN/, outside of meminfo. I just don't think as in the current > > implementation (where PageMetadata is a mixture of buddy and memblock > > allocations), it can help with the use case that motivates this > > change, i.e. to improve the breakdown of the kernel overhead. > > > > > > > are allocated), so what would be the best way to export page metadata > > > > > > > without redefining MemTotal? Keep the new field in /proc/meminfo but > > > > > > > be ok that it is not part of MemTotal or do two counters? If we do two > > > > > > > counters, we will still need to keep one that is a buddy allocator in > > > > > > > /proc/meminfo and the other one somewhere outside? > > > > > > > > > > > > > > I think the simplest thing to do now is to only report the buddy > > > > allocations of per-page metadata in meminfo. The meaning of the new > > > > > > This will cause PageMetadata to be 0 on 99% of the systems, and > > > essentially become useless to the vast majority of users. > > > > I don't think it is a major issue. There are other fields (e.g. Zswap) > > in meminfo that remain 0 when the feature is not used. > > Since we are going to use two independent interfaces > /proc/meminfo/PageMetadata and nodeN/page_metadata (in a separate file > as requested by Greg) How about if in /proc/meminfo we provide only > the buddy allocator part, and in nodeN/page_metadata we provide the > total per-page overhead in the given node that include memblock > reserves, and buddy allocator memory? What we want is the system-wide breakdown of kernel memory usage. It works for this use case with the new PageMetadata counter in /proc/meminfo to report only buddy-allocated per-page metadata. > Pasha
> > Since we are going to use two independent interfaces > > /proc/meminfo/PageMetadata and nodeN/page_metadata (in a separate file > > as requested by Greg) How about if in /proc/meminfo we provide only > > the buddy allocator part, and in nodeN/page_metadata we provide the > > total per-page overhead in the given node that include memblock > > reserves, and buddy allocator memory? > > What we want is the system-wide breakdown of kernel memory usage. It > works for this use case with the new PageMetadata counter in > /proc/meminfo to report only buddy-allocated per-page metadata. We want to report all PageMetadata, otherwise this effort is going to be useless for the majority of users. As you noted, /proc/meminfo allows us to report only the part of per-page metadata that was allocated by the buddy allocator because of an existing MemTotal bug that does not include memblock reserves. However, we do not have this limitation when we create a new nodeN/page_metadata interface, and we can document that in the sysfs ABI documentation: sum(nodeN/page_metadata) contains all per-page metadata and is superset of /proc/meminfo. The only question is how to name PageMetadata in the /proc/meminfo appropriately, so users can understand that not all page metadata is included? (of course we will also document that only the MemTotal part of page metadata is reported in /proc/meminfo) Pasha
hi, Sourav Panda, we are not sure if this patch is NACKed since https://lore.kernel.org/all/2023110205-enquirer-sponge-4f35@gregkh/ but seems you still have plan for next version https://lore.kernel.org/all/CA+CK2bCFgwLXp=pUTKezWtRoCKiDC41DqGXx_kahg0UcB53sPw@mail.gmail.com/ so still send below report to you FYI about what we observed in our tests. Hello, kernel test robot noticed "WARNING:at_mm/vmstat.c:#__mod_node_page_state" on: commit: 77348e22542ef30ac2e12e111fdbe2debe4c8bf7 ("[PATCH v5 1/1] mm: report per-page metadata information") url: https://github.com/intel-lab-lkp/linux/commits/Sourav-Panda/mm-report-per-page-metadata-information/20231102-071047 base: https://git.kernel.org/cgit/linux/kernel/git/gregkh/driver-core.git effd7c70eaa0440688b60b9d419243695ede3c45 patch link: https://lore.kernel.org/all/20231101230816.1459373-2-souravpanda@google.com/ patch subject: [PATCH v5 1/1] mm: report per-page metadata information in testcase: kernel-selftests version: kernel-selftests-x86_64-60acb023-1_20230329 with following parameters: sc_nr_hugepages: 2 group: mm compiler: gcc-12 test machine: 36 threads 1 sockets Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz (Cascade Lake) with 32G memory (please refer to attached dmesg/kmsg for entire log/backtrace) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <oliver.sang@intel.com> | Closes: https://lore.kernel.org/oe-lkp/202311171013.fb3e52d3-oliver.sang@intel.com kern :warn : [ 625.944628] ------------[ cut here ]------------ kern :warn : [ 625.945623] WARNING: CPU: 30 PID: 16422 at mm/vmstat.c:393 __mod_node_page_state (mm/vmstat.c:393) kern :warn : [ 625.946550] Modules linked in: test_hmm(+) netconsole openvswitch nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 intel_rapl_msr intel_rapl_common nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp btrfs blake2b_generic xor coretemp kvm_intel raid6_pq zstd_compress kvm libcrc32c irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 rapl intel_cstate nvme nvme_core ahci t10_pi ipmi_devintf libahci ipmi_msghandler wmi_bmof mxm_wmi intel_wmi_thunderbolt crc64_rocksoft_generic i2c_i801 crc64_rocksoft intel_uncore wdat_wdt crc64 libata mei_me i2c_smbus ioatdma mei dca wmi binfmt_misc fuse drm ip_tables kern :warn : [ 625.951800] CPU: 30 PID: 16422 Comm: modprobe Not tainted 6.6.0-rc4-00022-g77348e22542e #1 kern :warn : [ 625.952689] Hardware name: Gigabyte Technology Co., Ltd. X299 UD4 Pro/X299 UD4 Pro-CF, BIOS F8a 04/27/2021 kern :warn : [ 625.953692] RIP: 0010:__mod_node_page_state (mm/vmstat.c:393) kern :warn : [ 625.954310] Code: 1c 24 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e 41 5f c3 65 8b 05 78 ad 77 7e a9 ff ff ff 7f 75 bb 65 8b 05 9e 79 76 7e 85 c0 74 b0 <0f> 0b eb ac 49 83 fd 2c 77 7b 4e 8d 34 ed c8 a5 02 00 be 08 00 00 All code ======== 0: 1c 24 sbb $0x24,%al 2: 48 83 c4 08 add $0x8,%rsp 6: 5b pop %rbx 7: 5d pop %rbp 8: 41 5c pop %r12 a: 41 5d pop %r13 c: 41 5e pop %r14 e: 41 5f pop %r15 10: c3 retq 11: 65 8b 05 78 ad 77 7e mov %gs:0x7e77ad78(%rip),%eax # 0x7e77ad90 18: a9 ff ff ff 7f test $0x7fffffff,%eax 1d: 75 bb jne 0xffffffffffffffda 1f: 65 8b 05 9e 79 76 7e mov %gs:0x7e76799e(%rip),%eax # 0x7e7679c4 26: 85 c0 test %eax,%eax 28: 74 b0 je 0xffffffffffffffda 2a:* 0f 0b ud2 <-- trapping instruction 2c: eb ac jmp 0xffffffffffffffda 2e: 49 83 fd 2c cmp $0x2c,%r13 32: 77 7b ja 0xaf 34: 4e 8d 34 ed c8 a5 02 lea 0x2a5c8(,%r13,8),%r14 3b: 00 3c: be .byte 0xbe 3d: 08 00 or %al,(%rax) ... Code starting with the faulting instruction =========================================== 0: 0f 0b ud2 2: eb ac jmp 0xffffffffffffffb0 4: 49 83 fd 2c cmp $0x2c,%r13 8: 77 7b ja 0x85 a: 4e 8d 34 ed c8 a5 02 lea 0x2a5c8(,%r13,8),%r14 11: 00 12: be .byte 0xbe 13: 08 00 or %al,(%rax) ... kern :warn : [ 625.956115] RSP: 0018:ffffc90000d7f548 EFLAGS: 00010202 kern :warn : [ 625.956726] RAX: 0000000000000001 RBX: 00000003ffff8000 RCX: 1ffffffff0aeddef kern :warn : [ 625.957526] RDX: 0000000000000000 RSI: 0000000000000026 RDI: ffff88889fffe5c0 kern :warn : [ 625.958414] RBP: ffff88889ffd4000 R08: 0000000000000007 R09: fffffbfff091ebd4 kern :warn : [ 625.959207] R10: ffffffff848f5ea3 R11: 0000000000000001 R12: 00000000000427ec kern :warn : [ 625.960008] R13: 000000000000002b R14: 0000000000000200 R15: 00000000000427c0 kern :warn : [ 625.960786] FS: 00007fca350f5740(0000) GS:ffff88880f100000(0000) knlGS:0000000000000000 kern :warn : [ 625.961664] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kern :warn : [ 625.962342] CR2: 00007f643c75d000 CR3: 00000002c7c44003 CR4: 00000000003706e0 kern :warn : [ 625.963132] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 kern :warn : [ 625.963923] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 kern :warn : [ 625.964702] Call Trace: kern :warn : [ 625.965089] <TASK> kern :warn : [ 625.965436] ? __warn (kernel/panic.c:673) kern :warn : [ 625.965898] ? __mod_node_page_state (mm/vmstat.c:393) kern :warn : [ 625.966450] ? report_bug (lib/bug.c:180 lib/bug.c:219) kern :warn : [ 625.966947] ? handle_bug (arch/x86/kernel/traps.c:237) kern :warn : [ 625.967409] ? exc_invalid_op (arch/x86/kernel/traps.c:258 (discriminator 1)) kern :warn : [ 625.967914] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:568) kern :warn : [ 625.968445] ? __mod_node_page_state (mm/vmstat.c:393) kern :warn : [ 625.969014] __populate_section_memmap (mm/sparse-vmemmap.c:475) kern :warn : [ 625.969591] ? kasan_set_track (mm/kasan/common.c:52) kern :warn : [ 625.970103] sparse_add_section (mm/sparse.c:867 mm/sparse.c:907) kern :warn : [ 625.970628] ? sparse_buffer_alloc (mm/sparse.c:897) kern :warn : [ 625.971177] __add_pages (mm/memory_hotplug.c:403) kern :warn : [ 625.971650] add_pages (arch/x86/mm/init_64.c:956) kern :warn : [ 625.972113] pagemap_range (mm/memremap.c:250) kern :warn : [ 625.972609] ? memremap_compat_align (mm/memremap.c:163) kern :warn : [ 625.973162] ? percpu_ref_init (arch/x86/include/asm/atomic64_64.h:20 include/linux/atomic/atomic-arch-fallback.h:2602 include/linux/atomic/atomic-long.h:79 include/linux/atomic/atomic-instrumented.h:3196 lib/percpu-refcount.c:98) kern :warn : [ 625.973678] memremap_pages (mm/memremap.c:367) kern :warn : [ 625.974187] ? pagemap_range (mm/memremap.c:292) kern :warn : [ 625.974697] ? kasan_set_track (mm/kasan/common.c:52) kern :warn : [ 625.975209] ? __kmalloc_node_track_caller (include/trace/events/kmem.h:54 include/trace/events/kmem.h:54 mm/slab_common.c:1024 mm/slab_common.c:1043) kern :warn : [ 625.975802] dmirror_allocate_chunk (include/linux/err.h:72 lib/test_hmm.c:552) test_hmm kern :warn : [ 625.976483] hmm_dmirror_init (lib/test_hmm.c:267) test_hmm kern :warn : [ 625.977092] ? 0xffffffffc14b1000 kern :warn : [ 625.977539] do_one_initcall (init/main.c:1232) kern :warn : [ 625.978044] ? trace_event_raw_event_initcall_level (init/main.c:1223) kern :warn : [ 625.978718] ? kasan_unpoison (mm/kasan/shadow.c:160 mm/kasan/shadow.c:194) kern :warn : [ 625.979261] do_init_module (kernel/module/main.c:2530) kern :warn : [ 625.979761] load_module (kernel/module/main.c:2981) kern :warn : [ 625.980267] ? post_relocation (kernel/module/main.c:2830) kern :warn : [ 625.980782] ? kernel_read_file (arch/x86/include/asm/atomic.h:53 include/linux/atomic/atomic-arch-fallback.h:979 include/linux/atomic/atomic-instrumented.h:436 include/linux/fs.h:2740 fs/kernel_read_file.c:122) kern :warn : [ 625.981318] ? __x64_sys_fspick (fs/kernel_read_file.c:38) kern :warn : [ 625.981858] ? init_module_from_file (kernel/module/main.c:3148) kern :warn : [ 625.982408] init_module_from_file (kernel/module/main.c:3148) kern :warn : [ 625.982959] ? __ia32_sys_init_module (kernel/module/main.c:3124) kern :warn : [ 625.983508] ? __lock_release+0x111/0x440 kern :warn : [ 625.984078] ? idempotent_init_module (kernel/module/main.c:3094 kernel/module/main.c:3159) kern :warn : [ 625.984743] ? idempotent_init_module (kernel/module/main.c:3094 kernel/module/main.c:3159) kern :warn : [ 625.985347] ? do_raw_spin_unlock (arch/x86/include/asm/atomic.h:23 include/linux/atomic/atomic-arch-fallback.h:444 include/linux/atomic/atomic-instrumented.h:33 include/asm-generic/qspinlock.h:57 kernel/locking/spinlock_debug.c:100 kernel/locking/spinlock_debug.c:140) kern :warn : [ 625.985895] idempotent_init_module (kernel/module/main.c:3165) kern :warn : [ 625.986448] ? init_module_from_file (kernel/module/main.c:3152) kern :warn : [ 625.987029] ? security_capable (security/security.c:946 (discriminator 13)) kern :warn : [ 625.987540] __x64_sys_finit_module (include/linux/file.h:45 kernel/module/main.c:3187 kernel/module/main.c:3169 kernel/module/main.c:3169) kern :warn : [ 625.988090] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) kern :warn : [ 625.988576] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) kern :warn : [ 625.989174] RIP: 0033:0x7fca352005a9 kern :warn : [ 625.989645] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 27 08 0d 00 f7 d8 64 89 01 48 All code ======== 0: 08 89 e8 5b 5d c3 or %cl,-0x3ca2a418(%rcx) 6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) d: 00 00 00 10: 90 nop 11: 48 89 f8 mov %rdi,%rax 14: 48 89 f7 mov %rsi,%rdi 17: 48 89 d6 mov %rdx,%rsi 1a: 48 89 ca mov %rcx,%rdx 1d: 4d 89 c2 mov %r8,%r10 20: 4d 89 c8 mov %r9,%r8 23: 4c 8b 4c 24 08 mov 0x8(%rsp),%r9 28: 0f 05 syscall 2a:* 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax <-- trapping instruction 30: 73 01 jae 0x33 32: c3 retq 33: 48 8b 0d 27 08 0d 00 mov 0xd0827(%rip),%rcx # 0xd0861 3a: f7 d8 neg %eax 3c: 64 89 01 mov %eax,%fs:(%rcx) 3f: 48 rex.W Code starting with the faulting instruction =========================================== 0: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax 6: 73 01 jae 0x9 8: c3 retq 9: 48 8b 0d 27 08 0d 00 mov 0xd0827(%rip),%rcx # 0xd0837 10: f7 d8 neg %eax 12: 64 89 01 mov %eax,%fs:(%rcx) 15: 48 rex.W The kernel config and materials to reproduce are available at: https://download.01.org/0day-ci/archive/20231117/202311171013.fb3e52d3-oliver.sang@intel.com
On Thu, Nov 16, 2023 at 6:43 PM kernel test robot <oliver.sang@intel.com> wrote: > > hi, Sourav Panda, > > we are not sure if this patch is NACKed since > https://lore.kernel.org/all/2023110205-enquirer-sponge-4f35@gregkh/ > > but seems you still have plan for next version > > https://lore.kernel.org/all/CA+CK2bCFgwLXp=pUTKezWtRoCKiDC41DqGXx_kahg0UcB53sPw@mail.gmail.com/ > > so still send below report to you FYI about what we observed in our tests. > > > Hello, > > kernel test robot noticed "WARNING:at_mm/vmstat.c:#__mod_node_page_state" > on: > > commit: 77348e22542ef30ac2e12e111fdbe2debe4c8bf7 ("[PATCH v5 1/1] mm: > report per-page metadata information") > url: > https://github.com/intel-lab-lkp/linux/commits/Sourav-Panda/mm-report-per-page-metadata-information/20231102-071047 > base: https://git.kernel.org/cgit/linux/kernel/git/gregkh/driver-core.git > effd7c70eaa0440688b60b9d419243695ede3c45 > patch link: > https://lore.kernel.org/all/20231101230816.1459373-2-souravpanda@google.com/ > patch subject: [PATCH v5 1/1] mm: report per-page metadata information > > in testcase: kernel-selftests > version: kernel-selftests-x86_64-60acb023-1_20230329 > with following parameters: > > sc_nr_hugepages: 2 > group: mm > > > > compiler: gcc-12 > test machine: 36 threads 1 sockets Intel(R) Core(TM) i9-10980XE CPU @ > 3.00GHz (Cascade Lake) with 32G memory > > (please refer to attached dmesg/kmsg for entire log/backtrace) > > > > If you fix the issue in a separate patch/commit (i.e. not just a new > version of > the same patch/commit), kindly add following tags > | Reported-by: kernel test robot <oliver.sang@intel.com> > | Closes: > https://lore.kernel.org/oe-lkp/202311171013.fb3e52d3-oliver.sang@intel.com > > > kern :warn : [ 625.944628] ------------[ cut here ]------------ > kern :warn : [ 625.945623] WARNING: CPU: 30 PID: 16422 at mm/vmstat.c:393 > __mod_node_page_state (mm/vmstat.c:393) > kern :warn : [ 625.946550] Modules linked in: test_hmm(+) netconsole > openvswitch nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 > intel_rapl_msr intel_rapl_common nfit libnvdimm x86_pkg_temp_thermal > intel_powerclamp btrfs blake2b_generic xor coretemp kvm_intel raid6_pq > zstd_compress kvm libcrc32c irqbypass crct10dif_pclmul crc32_pclmul > crc32c_intel ghash_clmulni_intel sha512_ssse3 rapl intel_cstate nvme > nvme_core ahci t10_pi ipmi_devintf libahci ipmi_msghandler wmi_bmof mxm_wmi > intel_wmi_thunderbolt crc64_rocksoft_generic i2c_i801 crc64_rocksoft > intel_uncore wdat_wdt crc64 libata mei_me i2c_smbus ioatdma mei dca wmi > binfmt_misc fuse drm ip_tables > kern :warn : [ 625.951800] CPU: 30 PID: 16422 Comm: modprobe Not > tainted 6.6.0-rc4-00022-g77348e22542e #1 > kern :warn : [ 625.952689] Hardware name: Gigabyte Technology Co., Ltd. > X299 UD4 Pro/X299 UD4 Pro-CF, BIOS F8a 04/27/2021 > kern :warn : [ 625.953692] RIP: 0010:__mod_node_page_state > (mm/vmstat.c:393) > kern :warn : [ 625.954310] Code: 1c 24 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e > 41 5f c3 65 8b 05 78 ad 77 7e a9 ff ff ff 7f 75 bb 65 8b 05 9e 79 76 7e 85 > c0 74 b0 <0f> 0b eb ac 49 83 fd 2c 77 7b 4e 8d 34 ed c8 a5 02 00 be 08 00 00 > All code > ======== > 0: 1c 24 sbb $0x24,%al > 2: 48 83 c4 08 add $0x8,%rsp > 6: 5b pop %rbx > 7: 5d pop %rbp > 8: 41 5c pop %r12 > a: 41 5d pop %r13 > c: 41 5e pop %r14 > e: 41 5f pop %r15 > 10: c3 retq > 11: 65 8b 05 78 ad 77 7e mov %gs:0x7e77ad78(%rip),%eax # > 0x7e77ad90 > 18: a9 ff ff ff 7f test $0x7fffffff,%eax > 1d: 75 bb jne 0xffffffffffffffda > 1f: 65 8b 05 9e 79 76 7e mov %gs:0x7e76799e(%rip),%eax # > 0x7e7679c4 > 26: 85 c0 test %eax,%eax > 28: 74 b0 je 0xffffffffffffffda > 2a:* 0f 0b ud2 <-- trapping instruction > 2c: eb ac jmp 0xffffffffffffffda > 2e: 49 83 fd 2c cmp $0x2c,%r13 > 32: 77 7b ja 0xaf > 34: 4e 8d 34 ed c8 a5 02 lea 0x2a5c8(,%r13,8),%r14 > 3b: 00 > 3c: be .byte 0xbe > 3d: 08 00 or %al,(%rax) > ... > > Code starting with the faulting instruction > =========================================== > 0: 0f 0b ud2 > 2: eb ac jmp 0xffffffffffffffb0 > 4: 49 83 fd 2c cmp $0x2c,%r13 > 8: 77 7b ja 0x85 > a: 4e 8d 34 ed c8 a5 02 lea 0x2a5c8(,%r13,8),%r14 > 11: 00 > 12: be .byte 0xbe > 13: 08 00 or %al,(%rax) > ... > kern :warn : [ 625.956115] RSP: 0018:ffffc90000d7f548 EFLAGS: 00010202 > kern :warn : [ 625.956726] RAX: 0000000000000001 RBX: 00000003ffff8000 > RCX: 1ffffffff0aeddef > kern :warn : [ 625.957526] RDX: 0000000000000000 RSI: 0000000000000026 > RDI: ffff88889fffe5c0 > kern :warn : [ 625.958414] RBP: ffff88889ffd4000 R08: 0000000000000007 > R09: fffffbfff091ebd4 > kern :warn : [ 625.959207] R10: ffffffff848f5ea3 R11: 0000000000000001 > R12: 00000000000427ec > kern :warn : [ 625.960008] R13: 000000000000002b R14: 0000000000000200 > R15: 00000000000427c0 > kern :warn : [ 625.960786] FS: 00007fca350f5740(0000) > GS:ffff88880f100000(0000) knlGS:0000000000000000 > kern :warn : [ 625.961664] CS: 0010 DS: 0000 ES: 0000 CR0: > 0000000080050033 > kern :warn : [ 625.962342] CR2: 00007f643c75d000 CR3: 00000002c7c44003 > CR4: 00000000003706e0 > kern :warn : [ 625.963132] DR0: 0000000000000000 DR1: 0000000000000000 > DR2: 0000000000000000 > kern :warn : [ 625.963923] DR3: 0000000000000000 DR6: 00000000fffe0ff0 > DR7: 0000000000000400 > kern :warn : [ 625.964702] Call Trace: > kern :warn : [ 625.965089] <TASK> > kern :warn : [ 625.965436] ? __warn (kernel/panic.c:673) > kern :warn : [ 625.965898] ? __mod_node_page_state (mm/vmstat.c:393) > kern :warn : [ 625.966450] ? report_bug (lib/bug.c:180 lib/bug.c:219) > kern :warn : [ 625.966947] ? handle_bug (arch/x86/kernel/traps.c:237) > kern :warn : [ 625.967409] ? exc_invalid_op (arch/x86/kernel/traps.c:258 > (discriminator 1)) > kern :warn : [ 625.967914] ? asm_exc_invalid_op > (arch/x86/include/asm/idtentry.h:568) > kern :warn : [ 625.968445] ? __mod_node_page_state (mm/vmstat.c:393) > kern :warn : [ 625.969014] __populate_section_memmap > (mm/sparse-vmemmap.c:475) > kern :warn : [ 625.969591] ? kasan_set_track (mm/kasan/common.c:52) > kern :warn : [ 625.970103] sparse_add_section (mm/sparse.c:867 > mm/sparse.c:907) > kern :warn : [ 625.970628] ? sparse_buffer_alloc (mm/sparse.c:897) > kern :warn : [ 625.971177] __add_pages (mm/memory_hotplug.c:403) > kern :warn : [ 625.971650] add_pages (arch/x86/mm/init_64.c:956) > kern :warn : [ 625.972113] pagemap_range (mm/memremap.c:250) > kern :warn : [ 625.972609] ? memremap_compat_align (mm/memremap.c:163) > kern :warn : [ 625.973162] ? percpu_ref_init > (arch/x86/include/asm/atomic64_64.h:20 > include/linux/atomic/atomic-arch-fallback.h:2602 > include/linux/atomic/atomic-long.h:79 > include/linux/atomic/atomic-instrumented.h:3196 lib/percpu-refcount.c:98) > kern :warn : [ 625.973678] memremap_pages (mm/memremap.c:367) > kern :warn : [ 625.974187] ? pagemap_range (mm/memremap.c:292) > kern :warn : [ 625.974697] ? kasan_set_track (mm/kasan/common.c:52) > kern :warn : [ 625.975209] ? __kmalloc_node_track_caller > (include/trace/events/kmem.h:54 include/trace/events/kmem.h:54 > mm/slab_common.c:1024 mm/slab_common.c:1043) > kern :warn : [ 625.975802] dmirror_allocate_chunk (include/linux/err.h:72 > lib/test_hmm.c:552) test_hmm > kern :warn : [ 625.976483] hmm_dmirror_init (lib/test_hmm.c:267) test_hmm > kern :warn : [ 625.977092] ? 0xffffffffc14b1000 > kern :warn : [ 625.977539] do_one_initcall (init/main.c:1232) > kern :warn : [ 625.978044] ? trace_event_raw_event_initcall_level > (init/main.c:1223) > kern :warn : [ 625.978718] ? kasan_unpoison (mm/kasan/shadow.c:160 > mm/kasan/shadow.c:194) > kern :warn : [ 625.979261] do_init_module (kernel/module/main.c:2530) > kern :warn : [ 625.979761] load_module (kernel/module/main.c:2981) > kern :warn : [ 625.980267] ? post_relocation (kernel/module/main.c:2830) > kern :warn : [ 625.980782] ? kernel_read_file > (arch/x86/include/asm/atomic.h:53 > include/linux/atomic/atomic-arch-fallback.h:979 > include/linux/atomic/atomic-instrumented.h:436 include/linux/fs.h:2740 > fs/kernel_read_file.c:122) > kern :warn : [ 625.981318] ? __x64_sys_fspick (fs/kernel_read_file.c:38) > kern :warn : [ 625.981858] ? init_module_from_file > (kernel/module/main.c:3148) > kern :warn : [ 625.982408] init_module_from_file > (kernel/module/main.c:3148) > kern :warn : [ 625.982959] ? __ia32_sys_init_module > (kernel/module/main.c:3124) > kern :warn : [ 625.983508] ? __lock_release+0x111/0x440 > kern :warn : [ 625.984078] ? idempotent_init_module > (kernel/module/main.c:3094 kernel/module/main.c:3159) > kern :warn : [ 625.984743] ? idempotent_init_module > (kernel/module/main.c:3094 kernel/module/main.c:3159) > kern :warn : [ 625.985347] ? do_raw_spin_unlock > (arch/x86/include/asm/atomic.h:23 > include/linux/atomic/atomic-arch-fallback.h:444 > include/linux/atomic/atomic-instrumented.h:33 > include/asm-generic/qspinlock.h:57 kernel/locking/spinlock_debug.c:100 > kernel/locking/spinlock_debug.c:140) > kern :warn : [ 625.985895] idempotent_init_module > (kernel/module/main.c:3165) > kern :warn : [ 625.986448] ? init_module_from_file > (kernel/module/main.c:3152) > kern :warn : [ 625.987029] ? security_capable (security/security.c:946 > (discriminator 13)) > kern :warn : [ 625.987540] __x64_sys_finit_module > (include/linux/file.h:45 kernel/module/main.c:3187 > kernel/module/main.c:3169 kernel/module/main.c:3169) > kern :warn : [ 625.988090] do_syscall_64 (arch/x86/entry/common.c:50 > arch/x86/entry/common.c:80) > kern :warn : [ 625.988576] entry_SYSCALL_64_after_hwframe > (arch/x86/entry/entry_64.S:120) > kern :warn : [ 625.989174] RIP: 0033:0x7fca352005a9 > kern :warn : [ 625.989645] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 > 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 > 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 27 08 0d 00 f7 d8 64 89 01 48 > All code > ======== > 0: 08 89 e8 5b 5d c3 or %cl,-0x3ca2a418(%rcx) > 6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) > d: 00 00 00 > 10: 90 nop > 11: 48 89 f8 mov %rdi,%rax > 14: 48 89 f7 mov %rsi,%rdi > 17: 48 89 d6 mov %rdx,%rsi > 1a: 48 89 ca mov %rcx,%rdx > 1d: 4d 89 c2 mov %r8,%r10 > 20: 4d 89 c8 mov %r9,%r8 > 23: 4c 8b 4c 24 08 mov 0x8(%rsp),%r9 > 28: 0f 05 syscall > 2a:* 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax > <-- trapping instruction > 30: 73 01 jae 0x33 > 32: c3 retq > 33: 48 8b 0d 27 08 0d 00 mov 0xd0827(%rip),%rcx # 0xd0861 > 3a: f7 d8 neg %eax > 3c: 64 89 01 mov %eax,%fs:(%rcx) > 3f: 48 rex.W > > Code starting with the faulting instruction > =========================================== > 0: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax > 6: 73 01 jae 0x9 > 8: c3 retq > 9: 48 8b 0d 27 08 0d 00 mov 0xd0827(%rip),%rcx # 0xd0837 > 10: f7 d8 neg %eax > 12: 64 89 01 mov %eax,%fs:(%rcx) > 15: 48 rex.W > > > The kernel config and materials to reproduce are available at: > > https://download.01.org/0day-ci/archive/20231117/202311171013.fb3e52d3-oliver.sang@intel.com > > > > -- > 0-DAY CI Kernel Test Service > https://github.com/intel/lkp-tests/wiki Thank you for pointing this out. This will be fixed with the next patch along with the several interface changes proposed by the community. Thank you again. With regards, Sourav Panda
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 2b59cff8be17..c121f2ef9432 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -987,6 +987,7 @@ Example output. You may not have all of these fields. AnonPages: 4654780 kB Mapped: 266244 kB Shmem: 9976 kB + PageMetadata: 513419 kB KReclaimable: 517708 kB Slab: 660044 kB SReclaimable: 517708 kB @@ -1089,6 +1090,8 @@ Mapped files which have been mmapped, such as libraries Shmem Total memory used by shared memory (shmem) and tmpfs +PageMetadata + Memory used for per-page metadata KReclaimable Kernel allocations that the kernel will attempt to reclaim under memory pressure. Includes SReclaimable (below), and other diff --git a/drivers/base/node.c b/drivers/base/node.c index 493d533f8375..da728542265f 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -428,6 +428,7 @@ static ssize_t node_read_meminfo(struct device *dev, "Node %d Mapped: %8lu kB\n" "Node %d AnonPages: %8lu kB\n" "Node %d Shmem: %8lu kB\n" + "Node %d PageMetadata: %8lu kB\n" "Node %d KernelStack: %8lu kB\n" #ifdef CONFIG_SHADOW_CALL_STACK "Node %d ShadowCallStack:%8lu kB\n" @@ -458,6 +459,7 @@ static ssize_t node_read_meminfo(struct device *dev, nid, K(node_page_state(pgdat, NR_FILE_MAPPED)), nid, K(node_page_state(pgdat, NR_ANON_MAPPED)), nid, K(i.sharedram), + nid, K(node_page_state(pgdat, NR_PAGE_METADATA)), nid, node_page_state(pgdat, NR_KERNEL_STACK_KB), #ifdef CONFIG_SHADOW_CALL_STACK nid, node_page_state(pgdat, NR_KERNEL_SCS_KB), diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c index 45af9a989d40..f141bb2a550d 100644 --- a/fs/proc/meminfo.c +++ b/fs/proc/meminfo.c @@ -39,7 +39,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v) long available; unsigned long pages[NR_LRU_LISTS]; unsigned long sreclaimable, sunreclaim; + unsigned long nr_page_metadata; int lru; + int nid; si_meminfo(&i); si_swapinfo(&i); @@ -57,6 +59,10 @@ static int meminfo_proc_show(struct seq_file *m, void *v) sreclaimable = global_node_page_state_pages(NR_SLAB_RECLAIMABLE_B); sunreclaim = global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B); + nr_page_metadata = 0; + for_each_online_node(nid) + nr_page_metadata += node_page_state(NODE_DATA(nid), NR_PAGE_METADATA); + show_val_kb(m, "MemTotal: ", i.totalram); show_val_kb(m, "MemFree: ", i.freeram); show_val_kb(m, "MemAvailable: ", available); @@ -104,6 +110,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v) show_val_kb(m, "Mapped: ", global_node_page_state(NR_FILE_MAPPED)); show_val_kb(m, "Shmem: ", i.sharedram); + show_val_kb(m, "PageMetadata: ", nr_page_metadata); show_val_kb(m, "KReclaimable: ", sreclaimable + global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE)); show_val_kb(m, "Slab: ", sreclaimable + sunreclaim); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 4106fbc5b4b3..dda1ad522324 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -207,6 +207,9 @@ enum node_stat_item { PGPROMOTE_SUCCESS, /* promote successfully */ PGPROMOTE_CANDIDATE, /* candidate pages to promote */ #endif + NR_PAGE_METADATA, /* Page metadata size (struct page and page_ext) + * in pages + */ NR_VM_NODE_STAT_ITEMS }; diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index fed855bae6d8..af096a881f03 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -656,4 +656,8 @@ static inline void lruvec_stat_sub_folio(struct folio *folio, { lruvec_stat_mod_folio(folio, idx, -folio_nr_pages(folio)); } + +void __init mod_node_early_perpage_metadata(int nid, long delta); +void __init store_early_perpage_metadata(void); + #endif /* _LINUX_VMSTAT_H */ diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 1301ba7b2c9a..1778e02ed583 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1790,6 +1790,9 @@ static void __update_and_free_hugetlb_folio(struct hstate *h, destroy_compound_gigantic_folio(folio, huge_page_order(h)); free_gigantic_folio(folio, huge_page_order(h)); } else { +#ifndef CONFIG_SPARSEMEM_VMEMMAP + __node_stat_sub_folio(folio, NR_PAGE_METADATA); +#endif __free_pages(&folio->page, huge_page_order(h)); } } @@ -2125,6 +2128,7 @@ static struct folio *alloc_buddy_hugetlb_folio(struct hstate *h, struct page *page; bool alloc_try_hard = true; bool retry = true; + struct folio *folio; /* * By default we always try hard to allocate the page with @@ -2175,9 +2179,12 @@ static struct folio *alloc_buddy_hugetlb_folio(struct hstate *h, __count_vm_event(HTLB_BUDDY_PGALLOC_FAIL); return NULL; } - + folio = page_folio(page); +#ifndef CONFIG_SPARSEMEM_VMEMMAP + __node_stat_add_folio(folio, NR_PAGE_METADATA); +#endif __count_vm_event(HTLB_BUDDY_PGALLOC); - return page_folio(page); + return folio; } /* diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index 4b9734777f69..f7ca5d4dd583 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -214,6 +214,7 @@ static inline void free_vmemmap_page(struct page *page) free_bootmem_page(page); else __free_page(page); + __mod_node_page_state(page_pgdat(page), NR_PAGE_METADATA, -1); } /* Free a list of the vmemmap pages */ @@ -335,6 +336,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, copy_page(page_to_virt(walk.reuse_page), (void *)walk.reuse_addr); list_add(&walk.reuse_page->lru, &vmemmap_pages); + __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA, 1); } /* @@ -384,14 +386,20 @@ static int alloc_vmemmap_page_list(unsigned long start, unsigned long end, unsigned long nr_pages = (end - start) >> PAGE_SHIFT; int nid = page_to_nid((struct page *)start); struct page *page, *next; + int i; - while (nr_pages--) { + for (i = 0; i < nr_pages; i++) { page = alloc_pages_node(nid, gfp_mask, 0); - if (!page) + if (!page) { + __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA, + i); goto out; + } list_add_tail(&page->lru, list); } + __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA, nr_pages); + return 0; out: list_for_each_entry_safe(page, next, list, lru) diff --git a/mm/mm_init.c b/mm/mm_init.c index 50f2f34745af..6997bf00945b 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -26,6 +26,7 @@ #include <linux/pgtable.h> #include <linux/swap.h> #include <linux/cma.h> +#include <linux/vmstat.h> #include "internal.h" #include "slab.h" #include "shuffle.h" @@ -1656,6 +1657,8 @@ static void __init alloc_node_mem_map(struct pglist_data *pgdat) panic("Failed to allocate %ld bytes for node %d memory map\n", size, pgdat->node_id); pgdat->node_mem_map = map + offset; + mod_node_early_perpage_metadata(pgdat->node_id, + DIV_ROUND_UP(size, PAGE_SIZE)); } pr_debug("%s: node %d, pgdat %08lx, node_mem_map %08lx\n", __func__, pgdat->node_id, (unsigned long)pgdat, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 85741403948f..522dc0c52610 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5443,6 +5443,7 @@ void __init setup_per_cpu_pageset(void) for_each_online_pgdat(pgdat) pgdat->per_cpu_nodestats = alloc_percpu(struct per_cpu_nodestat); + store_early_perpage_metadata(); } __meminit void zone_pcp_init(struct zone *zone) diff --git a/mm/page_ext.c b/mm/page_ext.c index 4548fcc66d74..d8d6db9c3d75 100644 --- a/mm/page_ext.c +++ b/mm/page_ext.c @@ -201,6 +201,8 @@ static int __init alloc_node_page_ext(int nid) return -ENOMEM; NODE_DATA(nid)->node_page_ext = base; total_usage += table_size; + __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA, + DIV_ROUND_UP(table_size, PAGE_SIZE)); return 0; } @@ -255,12 +257,15 @@ static void *__meminit alloc_page_ext(size_t size, int nid) void *addr = NULL; addr = alloc_pages_exact_nid(nid, size, flags); - if (addr) { + if (addr) kmemleak_alloc(addr, size, 1, flags); - return addr; - } + else + addr = vzalloc_node(size, nid); - addr = vzalloc_node(size, nid); + if (addr) { + mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA, + DIV_ROUND_UP(size, PAGE_SIZE)); + } return addr; } @@ -303,18 +308,27 @@ static int __meminit init_section_page_ext(unsigned long pfn, int nid) static void free_page_ext(void *addr) { + size_t table_size; + struct page *page; + struct pglist_data *pgdat; + + table_size = page_ext_size * PAGES_PER_SECTION; + if (is_vmalloc_addr(addr)) { + page = vmalloc_to_page(addr); + pgdat = page_pgdat(page); vfree(addr); } else { - struct page *page = virt_to_page(addr); - size_t table_size; - - table_size = page_ext_size * PAGES_PER_SECTION; - + page = virt_to_page(addr); + pgdat = page_pgdat(page); BUG_ON(PageReserved(page)); kmemleak_free(addr); free_pages_exact(addr, table_size); } + + __mod_node_page_state(pgdat, NR_PAGE_METADATA, + -1L * (DIV_ROUND_UP(table_size, PAGE_SIZE))); + } static void __free_page_ext(unsigned long pfn) diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c index a2cbe44c48e1..2bc67b2c2aa2 100644 --- a/mm/sparse-vmemmap.c +++ b/mm/sparse-vmemmap.c @@ -469,5 +469,8 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn, if (r < 0) return NULL; + __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA, + DIV_ROUND_UP(end - start, PAGE_SIZE)); + return pfn_to_page(pfn); } diff --git a/mm/sparse.c b/mm/sparse.c index 77d91e565045..7f67b5486cd1 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -14,7 +14,7 @@ #include <linux/swap.h> #include <linux/swapops.h> #include <linux/bootmem_info.h> - +#include <linux/vmstat.h> #include "internal.h" #include <asm/dma.h> @@ -465,6 +465,9 @@ static void __init sparse_buffer_init(unsigned long size, int nid) */ sparsemap_buf = memmap_alloc(size, section_map_size(), addr, nid, true); sparsemap_buf_end = sparsemap_buf + size; +#ifndef CONFIG_SPARSEMEM_VMEMMAP + mod_node_early_perpage_metadata(nid, DIV_ROUND_UP(size, PAGE_SIZE)); +#endif } static void __init sparse_buffer_fini(void) @@ -641,6 +644,8 @@ static void depopulate_section_memmap(unsigned long pfn, unsigned long nr_pages, unsigned long start = (unsigned long) pfn_to_page(pfn); unsigned long end = start + nr_pages * sizeof(struct page); + __mod_node_page_state(page_pgdat(pfn_to_page(pfn)), NR_PAGE_METADATA, + -1L * (DIV_ROUND_UP(end - start, PAGE_SIZE))); vmemmap_free(start, end, altmap); } static void free_map_bootmem(struct page *memmap) diff --git a/mm/vmstat.c b/mm/vmstat.c index 00e81e99c6ee..070d2b3d2bcc 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1245,6 +1245,7 @@ const char * const vmstat_text[] = { "pgpromote_success", "pgpromote_candidate", #endif + "nr_page_metadata", /* enum writeback_stat_item counters */ "nr_dirty_threshold", @@ -2274,4 +2275,27 @@ static int __init extfrag_debug_init(void) } module_init(extfrag_debug_init); + #endif + +/* + * Page metadata size (struct page and page_ext) in pages + */ +static unsigned long early_perpage_metadata[MAX_NUMNODES] __initdata; + +void __init mod_node_early_perpage_metadata(int nid, long delta) +{ + early_perpage_metadata[nid] += delta; +} + +void __init store_early_perpage_metadata(void) +{ + int nid; + struct pglist_data *pgdat; + + for_each_online_pgdat(pgdat) { + nid = pgdat->node_id; + __mod_node_page_state(NODE_DATA(nid), NR_PAGE_METADATA, + early_perpage_metadata[nid]); + } +}
Adds a new per-node PageMetadata field to /sys/devices/system/node/nodeN/meminfo and a global PageMetadata field to /proc/meminfo. This information can be used by users to see how much memory is being used by per-page metadata, which can vary depending on build configuration, machine architecture, and system use. Per-page metadata is the amount of memory that Linux needs in order to manage memory at the page granularity. The majority of such memory is used by "struct page" and "page_ext" data structures. In contrast to most other memory consumption statistics, per-page metadata might not be included in MemTotal. For example, MemTotal does not include memblock allocations but includes buddy allocations. While on the other hand, per-page metadata would include both memblock and buddy allocations. This memory depends on build configurations, machine architectures, and the way system is used: Build configuration may include extra fields into "struct page", and enable / disable "page_ext" Machine architecture defines base page sizes. For example 4K x86, 8K SPARC, 64K ARM64 (optionally), etc. The per-page metadata overhead is smaller on machines with larger page sizes. System use can change per-page overhead by using vmemmap optimizations with hugetlb pages, and emulated pmem devdax pages. Also, boot parameters can determine whether page_ext is needed to be allocated. This memory can be part of MemTotal or be outside MemTotal depending on whether the memory was hot-plugged, booted with, or hugetlb memory was returned back to the system. Suggested-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Sourav Panda <souravpanda@google.com> --- Documentation/filesystems/proc.rst | 3 +++ drivers/base/node.c | 2 ++ fs/proc/meminfo.c | 7 +++++++ include/linux/mmzone.h | 3 +++ include/linux/vmstat.h | 4 ++++ mm/hugetlb.c | 11 ++++++++-- mm/hugetlb_vmemmap.c | 12 +++++++++-- mm/mm_init.c | 3 +++ mm/page_alloc.c | 1 + mm/page_ext.c | 32 +++++++++++++++++++++--------- mm/sparse-vmemmap.c | 3 +++ mm/sparse.c | 7 ++++++- mm/vmstat.c | 24 ++++++++++++++++++++++ 13 files changed, 98 insertions(+), 14 deletions(-)