Message ID | 010001858025fc22-e619988e-c0a5-4545-bd93-783890b9ad14-000000@email.amazonses.com (mailing list archive) |
---|---|
State | Deferred, archived |
Headers | show |
Series | Pages not released from memblock to the buddy allocator | expand |
* Aaron Thompson <dev@aaront.org> wrote: > For example, on an Amazon EC2 t3.micro VM (1 GB) booting via EFI: > > v6.2-rc2: > # grep -E 'Node|spanned|present|managed' /proc/zoneinfo > Node 0, zone DMA > spanned 4095 > present 3999 > managed 3840 > Node 0, zone DMA32 > spanned 246652 > present 245868 > managed 178867 > > v6.2-rc2 + patch: > # grep -E 'Node|spanned|present|managed' /proc/zoneinfo > Node 0, zone DMA > spanned 4095 > present 3999 > managed 3840 > Node 0, zone DMA32 > spanned 246652 > present 245868 > managed 222816 # +43,949 pages [ Note the annotation I added to the output - might be useful in the changelog too. ] So this patch adds around +17% of RAM to this 1 GB virtual system? That looks rather significant ... Thanks, Ingo
On 2023-01-05 02:48, Ingo Molnar wrote: > * Aaron Thompson <dev@aaront.org> wrote: > >> For example, on an Amazon EC2 t3.micro VM (1 GB) booting via EFI: >> >> v6.2-rc2: >> # grep -E 'Node|spanned|present|managed' /proc/zoneinfo >> Node 0, zone DMA >> spanned 4095 >> present 3999 >> managed 3840 >> Node 0, zone DMA32 >> spanned 246652 >> present 245868 >> managed 178867 >> >> v6.2-rc2 + patch: >> # grep -E 'Node|spanned|present|managed' /proc/zoneinfo >> Node 0, zone DMA >> spanned 4095 >> present 3999 >> managed 3840 >> Node 0, zone DMA32 >> spanned 246652 >> present 245868 >> managed 222816 # +43,949 pages > > [ Note the annotation I added to the output - might be useful in the > changelog too. ] > > So this patch adds around +17% of RAM to this 1 GB virtual system? That > looks rather significant ... > > Thanks, > > Ingo It is significant, but I wouldn't describe it as being added. I would say that the system is currently losing 17% of RAM due to a bug, and this patch fixes that bug. The actual numbers depend on the mappings given by the EFI, so they're largely out of our control. As an example, similar VMs that I run with the OVMF EFI lose about 3%. I couldn't say for sure which is the outlier, but my point is that the specific values are not really the focus, this is just an example that shows that the issue can be encountered in the wild with real impact. I know I'll be happy to get that memory back, whether it is 3% or 17% :) Thanks, -- Aaron
* Aaron Thompson <dev@aaront.org> wrote: > > On 2023-01-05 02:48, Ingo Molnar wrote: > > * Aaron Thompson <dev@aaront.org> wrote: > > > > > For example, on an Amazon EC2 t3.micro VM (1 GB) booting via EFI: > > > > > > v6.2-rc2: > > > # grep -E 'Node|spanned|present|managed' /proc/zoneinfo > > > Node 0, zone DMA > > > spanned 4095 > > > present 3999 > > > managed 3840 > > > Node 0, zone DMA32 > > > spanned 246652 > > > present 245868 > > > managed 178867 > > > > > > v6.2-rc2 + patch: > > > # grep -E 'Node|spanned|present|managed' /proc/zoneinfo > > > Node 0, zone DMA > > > spanned 4095 > > > present 3999 > > > managed 3840 > > > Node 0, zone DMA32 > > > spanned 246652 > > > present 245868 > > > managed 222816 # +43,949 pages > > > > [ Note the annotation I added to the output - might be useful in the > > changelog too. ] > > > > So this patch adds around +17% of RAM to this 1 GB virtual system? That > > looks rather significant ... > > > > Thanks, > > > > Ingo > > It is significant, but I wouldn't describe it as being added. I would say > that the system is currently losing 17% of RAM due to a bug, and this patch > fixes that bug. To the end-user gaining +17% [or +3%] extra usable RAM compared to what they had before is what matters, and it's a big deal. :-) Thanks, Ingo
diff --git a/mm/memblock.c b/mm/memblock.c index 511d4783dcf1..fc3d8fbd2060 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -1640,7 +1640,13 @@ void __init memblock_free_late(phys_addr_t base, phys_addr_t size) end = PFN_DOWN(base + size); for (; cursor < end; cursor++) { - memblock_free_pages(pfn_to_page(cursor), cursor, 0); + /* + * Reserved pages are always initialized by the end of + * memblock_free_all() (by memmap_init() and, if deferred + * initialization is enabled, memmap_init_reserved_pages()), so + * these pages can be released directly to the buddy allocator. + */ + __free_pages_core(pfn_to_page(cursor), 0); totalram_pages_inc(); } } diff --git a/tools/testing/memblock/internal.h b/tools/testing/memblock/internal.h index fdb7f5db7308..85973e55489e 100644 --- a/tools/testing/memblock/internal.h +++ b/tools/testing/memblock/internal.h @@ -15,6 +15,10 @@ bool mirrored_kernelcore = false; struct page {}; +void __free_pages_core(struct page *page, unsigned int order) +{ +} + void memblock_free_pages(struct page *page, unsigned long pfn, unsigned int order) {
If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, memblock_free_pages() only releases pages to the buddy allocator if they are not in the deferred range. This is correct for free pages (as defined by for_each_free_mem_pfn_range_in_zone()) because free pages in the deferred range will be initialized and released as part of the deferred init process. memblock_free_pages() is called by memblock_free_late(), which is used to free reserved ranges after memblock_free_all() has run. All pages in reserved ranges have been initialized at that point, and accordingly, those pages are not touched by the deferred init process. This means that currently, if the pages that memblock_free_late() intends to release are in the deferred range, they will never be released to the buddy allocator. They will forever be reserved. In addition, memblock_free_pages() calls kmsan_memblock_free_pages(), which is also correct for free pages but is not correct for reserved pages. KMSAN metadata for reserved pages is initialized by kmsan_init_shadow(), which runs shortly before memblock_free_all(). For both of these reasons, memblock_free_pages() should only be called for free pages, and memblock_free_late() should call __free_pages_core() directly instead. One case where this issue can occur in the wild is EFI boot on x86_64. The x86 EFI code reserves all EFI boot services memory ranges via memblock_reserve() and frees them later via memblock_free_late() (efi_reserve_boot_services() and efi_free_boot_services(), respectively). If any of those ranges happen to fall within the deferred init range, the pages will not be released and that memory will be unavailable. For example, on an Amazon EC2 t3.micro VM (1 GB) booting via EFI: v6.2-rc2: # grep -E 'Node|spanned|present|managed' /proc/zoneinfo Node 0, zone DMA spanned 4095 present 3999 managed 3840 Node 0, zone DMA32 spanned 246652 present 245868 managed 178867 v6.2-rc2 + patch: # grep -E 'Node|spanned|present|managed' /proc/zoneinfo Node 0, zone DMA spanned 4095 present 3999 managed 3840 Node 0, zone DMA32 spanned 246652 present 245868 managed 222816 Fixes: 3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set") Signed-off-by: Aaron Thompson <dev@aaront.org> --- mm/memblock.c | 8 +++++++- tools/testing/memblock/internal.h | 4 ++++ 2 files changed, 11 insertions(+), 1 deletion(-)