mbox series

[0/7] riscv: Memory Hot(Un)Plug support

Message ID 20230512145737.985671-1-bjorn@kernel.org (mailing list archive)
Headers show
Series riscv: Memory Hot(Un)Plug support | expand

Message

Björn Töpel May 12, 2023, 2:57 p.m. UTC
From: Björn Töpel <bjorn@rivosinc.com>

Memory Hot(Un)Plug support for the RISC-V port
==============================================

Introduction
------------

To quote "Documentation/admin-guide/mm/memory-hotplug.rst": "Memory
hot(un)plug allows for increasing and decreasing the size of physical
memory available to a machine at runtime."

This series attempts to add memory hot(un)plug support for the RISC-V
Linux port.

I'm sending the series as a v1, but it's borderline RFC. It definitely
needs more testing time, but it would be nice with some early input.

Implementation
--------------

From an arch perspective, a couple of callbacks needs to be
implemented to support hot plugging:

arch_add_memory()
This callback is responsible for updating the linear/direct map, and
call into the memory hot plugging generic code via __add_pages().

arch_remove_memory()
In this callback the linear/direct map is tore down.

vmemmap_free()
The function tears down the vmemmap mappings (if
CONFIG_SPARSEMEM_VMEMMAP is in-use), and also deallocates the backing
vmemmap pages. Note that for persistent memory, an alternative
allocator for the backing pages can be used -- the vmem_altmap. This
means that when the backing pages are cleared, extra care is needed so
that the correct deallocation method is used. Note that RISC-V
populates the vmemmap using vmemmap_populate_basepages(), so currently
no hugepages are used for the backing store.

The page table unmap/teardown functions are heavily based (copied!)
from the x86 tree. The same remove_pgd_mapping() is used in both
vmemmap_free() and arch_remove_memory(), but in the latter function
the backing pages are not removed.

On RISC-V, the PGD level kernel mappings needs to synchronized with
all page-tables (e.g. via sync_kernel_mappings()). Synchronization
involves special care, like locking. Instead, this patch series takes
a different approach (introduced by Jörg Rödel in the x86-tree);
Pre-allocate the PGD-leaves (P4D, PUD, or PMD depending on the paging
setup) at mem_init(), for vmemmap and the direct map.

Pre-allocating the PGD-leaves waste some memory, but is only enabled
for CONFIG_MEMORY_HOTPLUG. The number pages, potentially unused, are
~128 * 4K.

Patch 1: Preparation for hotplugging support, by pre-allocating the
         PGD leaves.

Patch 2: Changes the __init attribute to __meminit, to avoid that the
         functions are removed after init. __meminit keeps the
         functions after init, if memory hotplugging is enabled for
         the build.
         
Patch 3: Refactor the direct map setup, so it can be used for hot add.

Patch 4: The actual add/remove code. Mostly a page-table-walk
         exercise.

Patch 5: Turn on the arch support in Kconfig

Patch 6: Now that memory hotplugging is enabled, make virtio-mem
         usable for RISC-V
         
Patch 7: Pre-allocate vmalloc PGD-leaves as well, which removes the
         need for vmalloc faulting.
         
RFC
---

 * TLB flushes. The current series uses Big Hammer flush-it-all.
 * Pre-allocation vs explicit syncs

Testing
-------

ACPI support is still in the making for RISC-V, so tests that involve
CXL and similar fanciness is currently not possible. Virtio-mem,
however, works without proper ACPI support. In order to try this out
in Qemu, some additional patches for Qemu are needed:

 * Enable virtio-mem for RISC-V
 * Add proper hotplug support for virtio-mem
 
The patch for Qemu can be found is commit 5d90a7ef1bc0
("hw/riscv/virt: Support for virtio-mem-pci"), and can be found here

  https://github.com/bjoto/qemu/tree/riscv-virtio-mem

I will try to upstream that work in parallel with this.
  
Thanks to David Hildenbrand for valuable input for the Qemu side of
things.

The series is based on the RISC-V fixes tree
  https://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux.git/log/?h=fixes


Thanks,
Björn


Björn Töpel (7):
  riscv: mm: Pre-allocate PGD leaves to avoid synchronization
  riscv: mm: Change attribute from __init to __meminit for page
    functions
  riscv: mm: Refactor create_linear_mapping_range() for hot add
  riscv: mm: Add memory hot add/remove support
  riscv: Enable memory hot add/remove arch kbuild support
  virtio-mem: Enable virtio-mem for RISC-V
  riscv: mm: Pre-allocate vmalloc PGD leaves

 arch/riscv/Kconfig               |   2 +
 arch/riscv/include/asm/kasan.h   |   4 +-
 arch/riscv/include/asm/mmu.h     |   2 +-
 arch/riscv/include/asm/pgtable.h |   2 +-
 arch/riscv/mm/fault.c            |   7 +-
 arch/riscv/mm/init.c             | 387 ++++++++++++++++++++++++++++---
 drivers/virtio/Kconfig           |   2 +-
 7 files changed, 364 insertions(+), 42 deletions(-)


base-commit: 3b90b09af5be42491a8a74a549318cfa265b3029

Comments

David Hildenbrand May 17, 2023, 1:49 p.m. UTC | #1
On 12.05.23 16:57, Björn Töpel wrote:
> From: Björn Töpel <bjorn@rivosinc.com>
> 
> Memory Hot(Un)Plug support for the RISC-V port
> ==============================================
> 
> Introduction
> ------------
> 
> To quote "Documentation/admin-guide/mm/memory-hotplug.rst": "Memory
> hot(un)plug allows for increasing and decreasing the size of physical
> memory available to a machine at runtime."
> 
> This series attempts to add memory hot(un)plug support for the RISC-V
> Linux port.
> 
> I'm sending the series as a v1, but it's borderline RFC. It definitely
> needs more testing time, but it would be nice with some early input.
> 
> Implementation
> --------------
> 
>  From an arch perspective, a couple of callbacks needs to be
> implemented to support hot plugging:
> 
> arch_add_memory()
> This callback is responsible for updating the linear/direct map, and
> call into the memory hot plugging generic code via __add_pages().
> 
> arch_remove_memory()
> In this callback the linear/direct map is tore down.
> 
> vmemmap_free()
> The function tears down the vmemmap mappings (if
> CONFIG_SPARSEMEM_VMEMMAP is in-use), and also deallocates the backing
> vmemmap pages. Note that for persistent memory, an alternative
> allocator for the backing pages can be used -- the vmem_altmap. This
> means that when the backing pages are cleared, extra care is needed so
> that the correct deallocation method is used. Note that RISC-V
> populates the vmemmap using vmemmap_populate_basepages(), so currently
> no hugepages are used for the backing store.
> 
> The page table unmap/teardown functions are heavily based (copied!)
> from the x86 tree. The same remove_pgd_mapping() is used in both
> vmemmap_free() and arch_remove_memory(), but in the latter function
> the backing pages are not removed.
> 
> On RISC-V, the PGD level kernel mappings needs to synchronized with
> all page-tables (e.g. via sync_kernel_mappings()). Synchronization
> involves special care, like locking. Instead, this patch series takes
> a different approach (introduced by Jörg Rödel in the x86-tree);
> Pre-allocate the PGD-leaves (P4D, PUD, or PMD depending on the paging
> setup) at mem_init(), for vmemmap and the direct map.
> 
> Pre-allocating the PGD-leaves waste some memory, but is only enabled
> for CONFIG_MEMORY_HOTPLUG. The number pages, potentially unused, are
> ~128 * 4K.
> 
> Patch 1: Preparation for hotplugging support, by pre-allocating the
>           PGD leaves.
> 
> Patch 2: Changes the __init attribute to __meminit, to avoid that the
>           functions are removed after init. __meminit keeps the
>           functions after init, if memory hotplugging is enabled for
>           the build.
>           
> Patch 3: Refactor the direct map setup, so it can be used for hot add.
> 
> Patch 4: The actual add/remove code. Mostly a page-table-walk
>           exercise.
> 
> Patch 5: Turn on the arch support in Kconfig
> 
> Patch 6: Now that memory hotplugging is enabled, make virtio-mem
>           usable for RISC-V
>           
> Patch 7: Pre-allocate vmalloc PGD-leaves as well, which removes the
>           need for vmalloc faulting.
>           
> RFC
> ---
> 
>   * TLB flushes. The current series uses Big Hammer flush-it-all.
>   * Pre-allocation vs explicit syncs
> 
> Testing
> -------
> 
> ACPI support is still in the making for RISC-V, so tests that involve
> CXL and similar fanciness is currently not possible. Virtio-mem,
> however, works without proper ACPI support. In order to try this out
> in Qemu, some additional patches for Qemu are needed:
> 
>   * Enable virtio-mem for RISC-V
>   * Add proper hotplug support for virtio-mem
>   
> The patch for Qemu can be found is commit 5d90a7ef1bc0
> ("hw/riscv/virt: Support for virtio-mem-pci"), and can be found here
> 
>    https://github.com/bjoto/qemu/tree/riscv-virtio-mem
> 
> I will try to upstream that work in parallel with this.
>    
> Thanks to David Hildenbrand for valuable input for the Qemu side of
> things.
> 
> The series is based on the RISC-V fixes tree
>    https://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux.git/log/?h=fixes
> 

Cool stuff! I'm fairly busy right now, so some high-level questions upfront:

What is the memory section size (which implies the memory block size 
and)? This implies the minimum DIMM granularity and the high-level 
granularity in which virtio-mem adds memory.

What is the pageblock size, implying the minimum granularity that 
virtio-mem can operate on?

On x86-64 and arm64 we currently use the ACPI SRAT to expose the maximum 
physical address where we can see memory getting hotplugged. [1] From 
that, we can derive the "max_possible_pfn" and prepare the kernel 
virtual memory layourt (especially, direct map).

Is something similar required on RISC-V? On s390x, I'm planning on 
adding a paravirtualized mechanism to detect where memory devices might 
be located. (I had a running RFC, but was distracted by all other kinds 
of stuff)


[1] https://virtio-mem.gitlab.io/developer-guide.html
Björn Töpel May 17, 2023, 6:53 p.m. UTC | #2
David Hildenbrand <david@redhat.com> writes:

> On 12.05.23 16:57, Björn Töpel wrote:
>> From: Björn Töpel <bjorn@rivosinc.com>
>> 
>> Memory Hot(Un)Plug support for the RISC-V port
>> ==============================================

[...]

>
> Cool stuff! I'm fairly busy right now, so some high-level questions upfront:

No worries, and no rush! I'd say the v1 series was mainly for the RISC-V
folks, and I've got tons of (offline) comments from Alex -- and with
your comments below some more details to figure out.

> What is the memory section size (which implies the memory block size 
> and)? This implies the minimum DIMM granularity and the high-level 
> granularity in which virtio-mem adds memory.

It's 128M (27 bits) -- (like arm64 and x86-64?).

> What is the pageblock size, implying the minimum granularity that 
> virtio-mem can operate on?

Nothing special AFAIU; MAX_ORDER is 10, so PAGE_SIZE (4K) * 1024. Hmm, I
realize that I need to look into some more details of virtio-mem! :-)

> On x86-64 and arm64 we currently use the ACPI SRAT to expose the maximum 
> physical address where we can see memory getting hotplugged. [1] From 
> that, we can derive the "max_possible_pfn" and prepare the kernel 
> virtual memory layourt (especially, direct map).
>
> Is something similar required on RISC-V?

Yes! RISC-V is in the progress of getting proper ACPI support. Thanks
for pointing me in the these directions; Food for thought that I'll
digest for the next version.


Cheers,
Björn
Björn Töpel May 21, 2023, 9:15 a.m. UTC | #3
Hi David and Anshuman!

Björn Töpel <bjorn@kernel.org> writes:

> David Hildenbrand <david@redhat.com> writes:
>
>> On 12.05.23 16:57, Björn Töpel wrote:
>>> From: Björn Töpel <bjorn@rivosinc.com>
>>> 
>>> Memory Hot(Un)Plug support for the RISC-V port
>>> ==============================================
>
> [...]
>
>>
>> Cool stuff! I'm fairly busy right now, so some high-level questions upfront:
>
> No worries, and no rush! I'd say the v1 series was mainly for the RISC-V
> folks, and I've got tons of (offline) comments from Alex -- and with
> your comments below some more details to figure out.

One of the major issues with my v1 patch is around init_mm page table
synchronization, and that'll be part of the v2.

I've noticed there's a quite a difference between x86-64 and arm64 in
terms of locking, when updating (add/remove) the init_mm table. x86-64
uses the usual page table locking mechanisms (used by the generic
kernel functions), whereas arm64 does not.

How does arm64 manage to mix the "lock-less" updates (READ/WRITE_ONCE,
and fences in set_p?d+friends), with the generic kernel ones that uses
the regular page locking mechanism?

I'm obviously missing something about the locking rules for memory hot
add/remove... I've been reading the arm64 memory hot add/remove
series, but none the wiser! ;-)


Björn
David Hildenbrand May 22, 2023, 8:21 a.m. UTC | #4
On 21.05.23 11:15, Björn Töpel wrote:
> Hi David and Anshuman!
> 
> Björn Töpel <bjorn@kernel.org> writes:
> 
>> David Hildenbrand <david@redhat.com> writes:
>>
>>> On 12.05.23 16:57, Björn Töpel wrote:
>>>> From: Björn Töpel <bjorn@rivosinc.com>
>>>>
>>>> Memory Hot(Un)Plug support for the RISC-V port
>>>> ==============================================
>>
>> [...]
>>
>>>
>>> Cool stuff! I'm fairly busy right now, so some high-level questions upfront:
>>
>> No worries, and no rush! I'd say the v1 series was mainly for the RISC-V
>> folks, and I've got tons of (offline) comments from Alex -- and with
>> your comments below some more details to figure out.
> 
> One of the major issues with my v1 patch is around init_mm page table
> synchronization, and that'll be part of the v2.
> 
> I've noticed there's a quite a difference between x86-64 and arm64 in
> terms of locking, when updating (add/remove) the init_mm table. x86-64
> uses the usual page table locking mechanisms (used by the generic
> kernel functions), whereas arm64 does not.
> 
> How does arm64 manage to mix the "lock-less" updates (READ/WRITE_ONCE,
> and fences in set_p?d+friends), with the generic kernel ones that uses
> the regular page locking mechanism?
> 
> I'm obviously missing something about the locking rules for memory hot
> add/remove... I've been reading the arm64 memory hot add/remove
> series, but none the wiser! ;-)

In general, memory hot(un)plug is serialized on a high level using the 
mem_hotplug_lock. For example, in pagemap_range() or in 
add_memory_resource(), we grab that lock in write mode. So we'll never 
see memory getting added/removed concurrently from the direct map.

 From what I recall, the locking on the arch level is required for 
concurrent (direct mapping) page table modifications that target virtual 
address ranges adjacent to the ranges we hot(un)plug:
CONFIG_ARCH_HAS_SET_DIRECT_MAP and vmalloc come to mind.

For example, if a range would be mapped using a large PUD, but we have 
to unplug it partially (unplugging memory part of bootmem), we'd have to 
replace the large PUD by a PMD table first. That change (that could 
affect other concurrent page table walkers/operations) has to be 
synchronized.

I guess to which degree this applies to riscv depends on the virtual 
memory layout, direct mapping granularity and features (e.g., 
CONFIG_ARCH_HAS_SET_DIRECT_MAP).


One trick that arm64 implements is, that it only allows hotunplugging 
memory that was hotplugged (see prevent_bootmem_remove_notifier()). That 
might just rule out such problematic cases that require locking 
completely, and the high-level mem_hotplug_lock sufficient.