Message ID | 20230512145737.985671-1-bjorn@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | riscv: Memory Hot(Un)Plug support | expand |
On 12.05.23 16:57, Björn Töpel wrote: > From: Björn Töpel <bjorn@rivosinc.com> > > Memory Hot(Un)Plug support for the RISC-V port > ============================================== > > Introduction > ------------ > > To quote "Documentation/admin-guide/mm/memory-hotplug.rst": "Memory > hot(un)plug allows for increasing and decreasing the size of physical > memory available to a machine at runtime." > > This series attempts to add memory hot(un)plug support for the RISC-V > Linux port. > > I'm sending the series as a v1, but it's borderline RFC. It definitely > needs more testing time, but it would be nice with some early input. > > Implementation > -------------- > > From an arch perspective, a couple of callbacks needs to be > implemented to support hot plugging: > > arch_add_memory() > This callback is responsible for updating the linear/direct map, and > call into the memory hot plugging generic code via __add_pages(). > > arch_remove_memory() > In this callback the linear/direct map is tore down. > > vmemmap_free() > The function tears down the vmemmap mappings (if > CONFIG_SPARSEMEM_VMEMMAP is in-use), and also deallocates the backing > vmemmap pages. Note that for persistent memory, an alternative > allocator for the backing pages can be used -- the vmem_altmap. This > means that when the backing pages are cleared, extra care is needed so > that the correct deallocation method is used. Note that RISC-V > populates the vmemmap using vmemmap_populate_basepages(), so currently > no hugepages are used for the backing store. > > The page table unmap/teardown functions are heavily based (copied!) > from the x86 tree. The same remove_pgd_mapping() is used in both > vmemmap_free() and arch_remove_memory(), but in the latter function > the backing pages are not removed. > > On RISC-V, the PGD level kernel mappings needs to synchronized with > all page-tables (e.g. via sync_kernel_mappings()). Synchronization > involves special care, like locking. Instead, this patch series takes > a different approach (introduced by Jörg Rödel in the x86-tree); > Pre-allocate the PGD-leaves (P4D, PUD, or PMD depending on the paging > setup) at mem_init(), for vmemmap and the direct map. > > Pre-allocating the PGD-leaves waste some memory, but is only enabled > for CONFIG_MEMORY_HOTPLUG. The number pages, potentially unused, are > ~128 * 4K. > > Patch 1: Preparation for hotplugging support, by pre-allocating the > PGD leaves. > > Patch 2: Changes the __init attribute to __meminit, to avoid that the > functions are removed after init. __meminit keeps the > functions after init, if memory hotplugging is enabled for > the build. > > Patch 3: Refactor the direct map setup, so it can be used for hot add. > > Patch 4: The actual add/remove code. Mostly a page-table-walk > exercise. > > Patch 5: Turn on the arch support in Kconfig > > Patch 6: Now that memory hotplugging is enabled, make virtio-mem > usable for RISC-V > > Patch 7: Pre-allocate vmalloc PGD-leaves as well, which removes the > need for vmalloc faulting. > > RFC > --- > > * TLB flushes. The current series uses Big Hammer flush-it-all. > * Pre-allocation vs explicit syncs > > Testing > ------- > > ACPI support is still in the making for RISC-V, so tests that involve > CXL and similar fanciness is currently not possible. Virtio-mem, > however, works without proper ACPI support. In order to try this out > in Qemu, some additional patches for Qemu are needed: > > * Enable virtio-mem for RISC-V > * Add proper hotplug support for virtio-mem > > The patch for Qemu can be found is commit 5d90a7ef1bc0 > ("hw/riscv/virt: Support for virtio-mem-pci"), and can be found here > > https://github.com/bjoto/qemu/tree/riscv-virtio-mem > > I will try to upstream that work in parallel with this. > > Thanks to David Hildenbrand for valuable input for the Qemu side of > things. > > The series is based on the RISC-V fixes tree > https://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux.git/log/?h=fixes > Cool stuff! I'm fairly busy right now, so some high-level questions upfront: What is the memory section size (which implies the memory block size and)? This implies the minimum DIMM granularity and the high-level granularity in which virtio-mem adds memory. What is the pageblock size, implying the minimum granularity that virtio-mem can operate on? On x86-64 and arm64 we currently use the ACPI SRAT to expose the maximum physical address where we can see memory getting hotplugged. [1] From that, we can derive the "max_possible_pfn" and prepare the kernel virtual memory layourt (especially, direct map). Is something similar required on RISC-V? On s390x, I'm planning on adding a paravirtualized mechanism to detect where memory devices might be located. (I had a running RFC, but was distracted by all other kinds of stuff) [1] https://virtio-mem.gitlab.io/developer-guide.html
David Hildenbrand <david@redhat.com> writes: > On 12.05.23 16:57, Björn Töpel wrote: >> From: Björn Töpel <bjorn@rivosinc.com> >> >> Memory Hot(Un)Plug support for the RISC-V port >> ============================================== [...] > > Cool stuff! I'm fairly busy right now, so some high-level questions upfront: No worries, and no rush! I'd say the v1 series was mainly for the RISC-V folks, and I've got tons of (offline) comments from Alex -- and with your comments below some more details to figure out. > What is the memory section size (which implies the memory block size > and)? This implies the minimum DIMM granularity and the high-level > granularity in which virtio-mem adds memory. It's 128M (27 bits) -- (like arm64 and x86-64?). > What is the pageblock size, implying the minimum granularity that > virtio-mem can operate on? Nothing special AFAIU; MAX_ORDER is 10, so PAGE_SIZE (4K) * 1024. Hmm, I realize that I need to look into some more details of virtio-mem! :-) > On x86-64 and arm64 we currently use the ACPI SRAT to expose the maximum > physical address where we can see memory getting hotplugged. [1] From > that, we can derive the "max_possible_pfn" and prepare the kernel > virtual memory layourt (especially, direct map). > > Is something similar required on RISC-V? Yes! RISC-V is in the progress of getting proper ACPI support. Thanks for pointing me in the these directions; Food for thought that I'll digest for the next version. Cheers, Björn
Hi David and Anshuman! Björn Töpel <bjorn@kernel.org> writes: > David Hildenbrand <david@redhat.com> writes: > >> On 12.05.23 16:57, Björn Töpel wrote: >>> From: Björn Töpel <bjorn@rivosinc.com> >>> >>> Memory Hot(Un)Plug support for the RISC-V port >>> ============================================== > > [...] > >> >> Cool stuff! I'm fairly busy right now, so some high-level questions upfront: > > No worries, and no rush! I'd say the v1 series was mainly for the RISC-V > folks, and I've got tons of (offline) comments from Alex -- and with > your comments below some more details to figure out. One of the major issues with my v1 patch is around init_mm page table synchronization, and that'll be part of the v2. I've noticed there's a quite a difference between x86-64 and arm64 in terms of locking, when updating (add/remove) the init_mm table. x86-64 uses the usual page table locking mechanisms (used by the generic kernel functions), whereas arm64 does not. How does arm64 manage to mix the "lock-less" updates (READ/WRITE_ONCE, and fences in set_p?d+friends), with the generic kernel ones that uses the regular page locking mechanism? I'm obviously missing something about the locking rules for memory hot add/remove... I've been reading the arm64 memory hot add/remove series, but none the wiser! ;-) Björn
On 21.05.23 11:15, Björn Töpel wrote: > Hi David and Anshuman! > > Björn Töpel <bjorn@kernel.org> writes: > >> David Hildenbrand <david@redhat.com> writes: >> >>> On 12.05.23 16:57, Björn Töpel wrote: >>>> From: Björn Töpel <bjorn@rivosinc.com> >>>> >>>> Memory Hot(Un)Plug support for the RISC-V port >>>> ============================================== >> >> [...] >> >>> >>> Cool stuff! I'm fairly busy right now, so some high-level questions upfront: >> >> No worries, and no rush! I'd say the v1 series was mainly for the RISC-V >> folks, and I've got tons of (offline) comments from Alex -- and with >> your comments below some more details to figure out. > > One of the major issues with my v1 patch is around init_mm page table > synchronization, and that'll be part of the v2. > > I've noticed there's a quite a difference between x86-64 and arm64 in > terms of locking, when updating (add/remove) the init_mm table. x86-64 > uses the usual page table locking mechanisms (used by the generic > kernel functions), whereas arm64 does not. > > How does arm64 manage to mix the "lock-less" updates (READ/WRITE_ONCE, > and fences in set_p?d+friends), with the generic kernel ones that uses > the regular page locking mechanism? > > I'm obviously missing something about the locking rules for memory hot > add/remove... I've been reading the arm64 memory hot add/remove > series, but none the wiser! ;-) In general, memory hot(un)plug is serialized on a high level using the mem_hotplug_lock. For example, in pagemap_range() or in add_memory_resource(), we grab that lock in write mode. So we'll never see memory getting added/removed concurrently from the direct map. From what I recall, the locking on the arch level is required for concurrent (direct mapping) page table modifications that target virtual address ranges adjacent to the ranges we hot(un)plug: CONFIG_ARCH_HAS_SET_DIRECT_MAP and vmalloc come to mind. For example, if a range would be mapped using a large PUD, but we have to unplug it partially (unplugging memory part of bootmem), we'd have to replace the large PUD by a PMD table first. That change (that could affect other concurrent page table walkers/operations) has to be synchronized. I guess to which degree this applies to riscv depends on the virtual memory layout, direct mapping granularity and features (e.g., CONFIG_ARCH_HAS_SET_DIRECT_MAP). One trick that arm64 implements is, that it only allows hotunplugging memory that was hotplugged (see prevent_bootmem_remove_notifier()). That might just rule out such problematic cases that require locking completely, and the high-level mem_hotplug_lock sufficient.
From: Björn Töpel <bjorn@rivosinc.com> Memory Hot(Un)Plug support for the RISC-V port ============================================== Introduction ------------ To quote "Documentation/admin-guide/mm/memory-hotplug.rst": "Memory hot(un)plug allows for increasing and decreasing the size of physical memory available to a machine at runtime." This series attempts to add memory hot(un)plug support for the RISC-V Linux port. I'm sending the series as a v1, but it's borderline RFC. It definitely needs more testing time, but it would be nice with some early input. Implementation -------------- From an arch perspective, a couple of callbacks needs to be implemented to support hot plugging: arch_add_memory() This callback is responsible for updating the linear/direct map, and call into the memory hot plugging generic code via __add_pages(). arch_remove_memory() In this callback the linear/direct map is tore down. vmemmap_free() The function tears down the vmemmap mappings (if CONFIG_SPARSEMEM_VMEMMAP is in-use), and also deallocates the backing vmemmap pages. Note that for persistent memory, an alternative allocator for the backing pages can be used -- the vmem_altmap. This means that when the backing pages are cleared, extra care is needed so that the correct deallocation method is used. Note that RISC-V populates the vmemmap using vmemmap_populate_basepages(), so currently no hugepages are used for the backing store. The page table unmap/teardown functions are heavily based (copied!) from the x86 tree. The same remove_pgd_mapping() is used in both vmemmap_free() and arch_remove_memory(), but in the latter function the backing pages are not removed. On RISC-V, the PGD level kernel mappings needs to synchronized with all page-tables (e.g. via sync_kernel_mappings()). Synchronization involves special care, like locking. Instead, this patch series takes a different approach (introduced by Jörg Rödel in the x86-tree); Pre-allocate the PGD-leaves (P4D, PUD, or PMD depending on the paging setup) at mem_init(), for vmemmap and the direct map. Pre-allocating the PGD-leaves waste some memory, but is only enabled for CONFIG_MEMORY_HOTPLUG. The number pages, potentially unused, are ~128 * 4K. Patch 1: Preparation for hotplugging support, by pre-allocating the PGD leaves. Patch 2: Changes the __init attribute to __meminit, to avoid that the functions are removed after init. __meminit keeps the functions after init, if memory hotplugging is enabled for the build. Patch 3: Refactor the direct map setup, so it can be used for hot add. Patch 4: The actual add/remove code. Mostly a page-table-walk exercise. Patch 5: Turn on the arch support in Kconfig Patch 6: Now that memory hotplugging is enabled, make virtio-mem usable for RISC-V Patch 7: Pre-allocate vmalloc PGD-leaves as well, which removes the need for vmalloc faulting. RFC --- * TLB flushes. The current series uses Big Hammer flush-it-all. * Pre-allocation vs explicit syncs Testing ------- ACPI support is still in the making for RISC-V, so tests that involve CXL and similar fanciness is currently not possible. Virtio-mem, however, works without proper ACPI support. In order to try this out in Qemu, some additional patches for Qemu are needed: * Enable virtio-mem for RISC-V * Add proper hotplug support for virtio-mem The patch for Qemu can be found is commit 5d90a7ef1bc0 ("hw/riscv/virt: Support for virtio-mem-pci"), and can be found here https://github.com/bjoto/qemu/tree/riscv-virtio-mem I will try to upstream that work in parallel with this. Thanks to David Hildenbrand for valuable input for the Qemu side of things. The series is based on the RISC-V fixes tree https://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux.git/log/?h=fixes Thanks, Björn Björn Töpel (7): riscv: mm: Pre-allocate PGD leaves to avoid synchronization riscv: mm: Change attribute from __init to __meminit for page functions riscv: mm: Refactor create_linear_mapping_range() for hot add riscv: mm: Add memory hot add/remove support riscv: Enable memory hot add/remove arch kbuild support virtio-mem: Enable virtio-mem for RISC-V riscv: mm: Pre-allocate vmalloc PGD leaves arch/riscv/Kconfig | 2 + arch/riscv/include/asm/kasan.h | 4 +- arch/riscv/include/asm/mmu.h | 2 +- arch/riscv/include/asm/pgtable.h | 2 +- arch/riscv/mm/fault.c | 7 +- arch/riscv/mm/init.c | 387 ++++++++++++++++++++++++++++--- drivers/virtio/Kconfig | 2 +- 7 files changed, 364 insertions(+), 42 deletions(-) base-commit: 3b90b09af5be42491a8a74a549318cfa265b3029