[RFC,v4,00/13] virtio-mem: paravirtualized memory

Message ID	20191212171137.13872-1-david@redhat.com (mailing list archive)
Headers	show Return-Path: <SRS0=BLLP=2C=vger.kernel.org=kvm-owner@kernel.org> From: David Hildenbrand <david@redhat.com> To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, virtio-dev@lists.oasis-open.org, virtualization@lists.linux-foundation.org, kvm@vger.kernel.org, Michal Hocko <mhocko@kernel.org>, Andrew Morton <akpm@linux-foundation.org>, "Michael S . Tsirkin" <mst@redhat.com>, David Hildenbrand <david@redhat.com>, Sebastien Boeuf <sebastien.boeuf@intel.com>, Samuel Ortiz <samuel.ortiz@intel.com>, Robert Bradford <robert.bradford@intel.com>, Luiz Capitulino <lcapitulino@redhat.com>, Alexander Duyck <alexander.h.duyck@linux.intel.com>, Alexander Potapenko <glider@google.com>, Alexander Viro <viro@zeniv.linux.org.uk>, Anshuman Khandual <anshuman.khandual@arm.com>, Anthony Yznaga <anthony.yznaga@oracle.com>, Dan Williams <dan.j.williams@intel.com>, Dave Young <dyoung@redhat.com>, Igor Mammedov <imammedo@redhat.com>, Jason Wang <jasowang@redhat.com>, Johannes Weiner <hannes@cmpxchg.org>, Juergen Gross <jgross@suse.com>, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>, Len Brown <lenb@kernel.org>, Mel Gorman <mgorman@techsingularity.net>, Michal Hocko <mhocko@suse.com>, Mike Rapoport <rppt@linux.ibm.com>, Oscar Salvador <osalvador@suse.com>, Oscar Salvador <osalvador@suse.de>, Pavel Tatashin <pasha.tatashin@soleen.com>, Pavel Tatashin <pavel.tatashin@microsoft.com>, Pingfan Liu <kernelfans@gmail.com>, Qian Cai <cai@lca.pw>, "Rafael J. Wysocki" <rjw@rjwysocki.net>, Stefan Hajnoczi <stefanha@redhat.com>, Vlastimil Babka <vbabka@suse.cz>, Wei Yang <richard.weiyang@gmail.com> Subject: [PATCH RFC v4 00/13] virtio-mem: paravirtualized memory Date: Thu, 12 Dec 2019 18:11:24 +0100 Message-Id: <20191212171137.13872-1-david@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: kvm-owner@vger.kernel.org Precedence: bulk
Series	virtio-mem: paravirtualized memory \| expand [RFC,v4,00/13] virtio-mem: paravirtualized memory [RFC,v4,01/13] ACPI: NUMA: export pxm_to_node [RFC,v4,02/13] virtio-mem: Paravirtualized memory hotplug [RFC,v4,03/13] virtio-mem: Paravirtualized memory hotunplug part 1 [RFC,v4,04/13] mm: Export alloc_contig_range() / free_contig_range() [RFC,v4,05/13] virtio-mem: Paravirtualized memory hotunplug part 2 [RFC,v4,06/13] mm: Allow to offline unmovable PageOffline() pages via MEM_GOING_OFFLINE [RFC,v4,07/13] virtio-mem: Allow to offline partially unplugged memory blocks [RFC,v4,08/13] mm/memory_hotplug: Introduce offline_and_remove_memory() [RFC,v4,09/13] virtio-mem: Offline and remove completely unplugged memory blocks [RFC,v4,10/13] virtio-mem: Better retry handling [RFC,v4,11/13] mm/vmscan: Move count_vm_event(DROP_SLAB) into drop_slab() [RFC,v4,12/13] mm/vmscan: Export drop_slab() and drop_slab_node() [RFC,v4,13/13] virtio-mem: Drop slab objects when unplug continues to fail

David Hildenbrand Dec. 12, 2019, 5:11 p.m. UTC

This series is based on latest linux-next. The patches are located at:
    https://github.com/davidhildenbrand/linux.git virtio-mem-rfc-v4

The basic idea of virtio-mem is to provide a flexible,
cross-architecture memory hot(un)plug solution that avoids many limitations
imposed by existing technologies, architectures, and interfaces. More
details can be found below and in linked material.

This RFC is limited to x86-64, however, should theoretically work on any
architecture that supports virtio and implements memory hot(un)plug under
Linux - like s390x, powerpc64 and arm64. On x86-64, it is currently
possible to add/remove memory to the system in >= 4MB granularity.
Memory hotplug works very reliably. For memory unplug, there are no
guarantees how much memory can actually get unplugged, it depends on the
setup (especially: fragmentation of (unmovable) memory). I have plans to
improve that in the future.

--------------------------------------------------------------------------
1. virtio-mem
--------------------------------------------------------------------------

The basic idea behind virtio-mem was presented at KVM Forum 2018. The
slides can be found at [1]. The previous RFC can be found at [2]. The
first RFC can be found at [3]. However, the concept evolved over time. The
KVM Forum slides roughly match the current design.

Patch #2 ("virtio-mem: Paravirtualized memory hotplug") contains quite some
information, especially in "include/uapi/linux/virtio_mem.h":

  Each virtio-mem device manages a dedicated region in physical address
  space. Each device can belong to a single NUMA node, multiple devices
  for a single NUMA node are possible. A virtio-mem device is like a
  "resizable DIMM" consisting of small memory blocks that can be plugged
  or unplugged. The device driver is responsible for (un)plugging memory
  blocks on demand.

  Virtio-mem devices can only operate on their assigned memory region in
  order to (un)plug memory. A device cannot (un)plug memory belonging to
  other devices.

  The "region_size" corresponds to the maximum amount of memory that can
  be provided by a device. The "size" corresponds to the amount of memory
  that is currently plugged. "requested_size" corresponds to a request
  from the device to the device driver to (un)plug blocks. The
  device driver should try to (un)plug blocks in order to reach the
  "requested_size". It is impossible to plug more memory than requested.

  The "usable_region_size" represents the memory region that can actually
  be used to (un)plug memory. It is always at least as big as the
  "requested_size" and will grow dynamically. It will only shrink when
  explicitly triggered (VIRTIO_MEM_REQ_UNPLUG).

  Memory in the usable region can usually be read, however, there are no
  guarantees. It can happen that the device cannot process a request,
  because it is busy. The device driver has to retry later.

  Usually, during system resets all memory will get unplugged, so the
  device driver can start with a clean state. However, in specific
  scenarios (if the device is busy) it can happen that the device still
  has memory plugged. The device driver can request to unplug all memory
  (VIRTIO_MEM_REQ_UNPLUG) - which might take a while to succeed if the
  device is busy.

--------------------------------------------------------------------------
2. Linux Implementation
--------------------------------------------------------------------------

This RFC reuses quite some existing MM infrastructure, however, has to
expose some additional functionality.

Memory blocks (e.g., 128MB) are added/removed on demand. Within these
memory blocks, subblocks (e.g., 4MB) are plugged/unplugged. The sizes
depend on the target architecture, MAX_ORDER + pageblock_order, and
the block size of a virtio-mem device.

add_memory()/try_remove_memory() is used to add/remove memory blocks.
virtio-mem will not online memory blocks itself. This has to be done by
user space, or configured into the kernel
(CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE). virtio-mem will only unplug memory
that was online to the ZONE_NORMAL. Memory is suggested to be onlined to
the ZONE_NORMAL for now.

The memory hotplug notifier is used to properly synchronize against
onlining/offlining of memory blocks and to track the states of memory
blocks (including the zone memory blocks are onlined to).

The set_online_page() callback is used to keep unplugged subblocks
of a memory block fake-offline when onlining the memory block.
generic_online_page() is used to fake-online plugged subblocks. This
handling is similar to the Hyper-V balloon driver.

PG_offline is used to mark unplugged subblocks as offline, so e.g.,
dumping tools (makedumpfile) will skip these pages. This is similar to
other balloon drivers like virtio-balloon and Hyper-V.

Memory offlining code is extended to allow drivers to drop their reference
to PG_offline pages when MEM_GOING_OFFLINE, so these pages can be skipped
when offlining memory blocks. This allows to offline memory blocks that
have partially unplugged (allocated e.g., via alloc_contig_range())
subblocks - or are completely unplugged.

alloc_contig_range()/free_contig_range() [now exposed] is used to
unplug/plug subblocks of memory blocks the are already exposed to Linux.

offline_and_remove_memory() [new] is used to offline a fully unplugged
memory block and remove it from Linux.


A lot of additional information can be found in the separate patches and
as comments in the code itself.

--------------------------------------------------------------------------
3. Changes RFC v2 -> v3
--------------------------------------------------------------------------

A lot of things changed, especially also on the QEMU + virtio side. The
biggest changes on the Linux driver side are:
- Onlining/offlining of subblocks is now emulated on top of memory blocks.
  set_online_page()+alloc_contig_range()+free_contig_range() is now used
  for that. Core MM does not have to be modified and will continue to
  online/offline full memory blocks.
- Onlining/offlining of memory blocks is no longer performed by virtio-mem.
- Pg_offline is upstream and can be used. It is also used to allow
  offlining of partially unplugged memory blocks.
- Memory block states + subblocks are now tracked more space-efficient.
- Proper kexec(), kdump(), driver unload, driver reload, ZONE_MOVABLE, ...
  handling.

--------------------------------------------------------------------------
4. Changes RFC v3 -> RFC v4
--------------------------------------------------------------------------

Only minor things changed, especially nothing on the QEMU + virtio side.
Interresting changes on the Linux driver side are:
- "mm: Allow to offline unmovable PageOffline() pages via
   MEM_GOING_OFFLINE"
-- Rework to Michals suggestion (allow to isolate all PageOffline() pages
   by skipping all PageOffline() pages in has_unmovable_pages(). Fail
   offlining later if the pages cannot be offlined/migrated).
- "virtio-mem: Allow to offline partially unplugged memory blocks"
-- Adapt to Michals suggestion on core-mm part.
- "virtio-mem: Better retry handling"
-- Optimize retry intervals
- "virtio-mem: Drop slab objects when unplug continues to fail"
-- Call drop_slab()/drop_slab_node() when unplug keeps failing for a longer
   time.
- Multiple cleanups and fixes.

--------------------------------------------------------------------------
5. Future work
--------------------------------------------------------------------------

The separate patches contain a lot of future work items. One of the next
steps is to make memory unplug more likely to succeed - currently, there
are no guarantees on how much memory can get unplugged again. I have
various ideas on how to limit fragmentation of all memory blocks that
virtio-mem added.

Memory hotplug:
- Reduce the amount of memory resources if that turnes out to be an
  issue. Or try to speed up relevant code paths to deal with many
  resources.
- Allocate the vmemmap from the added memory. Makes hotplug more likely
  to succeed, the vmemmap is stored on the same NUMA node and that
  unmovable memory will later not hinder unplug.

Memory hotunplug:
- Performance improvements:
-- Sense (lockless) if it make sense to try alloc_contig_range() at all
   before directly trying to isolate and taking locks.
-- Try to unplug bigger chunks if possible first.
-- Identify free areas first, that don't have to be evacuated.
- Make unplug more likely to succeed:
-- There are various idea to limit fragmentation on memory block
   granularity. (e.g., ZONE_PREFER_MOVABLE and smart balancing)
-- Allocate memmap from added memory. This way, less unmovable data can
   end up on the memory blocks.
- OOM handling, e.g., via an OOM handler.
- Defragmentation
-- Will require a new virtio-mem CMD to exchange plugged<->unplugged blocks

--------------------------------------------------------------------------
6. Example Usage
--------------------------------------------------------------------------

A very basic QEMU prototype (kept updated) is available at:
    https://github.com/davidhildenbrand/qemu.git virtio-mem

It lacks various features, however, works to test the guest driver side:
- No support for resizable memory regions / memory backends yet
- No protection of unplugged memory (esp., userfaultfd-wp) yet
- No dump/migration/XXX optimizations to skip unplugged memory (and avoid
  touching it)

Start QEMU with two virtio-mem devices (one per NUMA node):
 $ qemu-system-x86_64 -m 4G,maxmem=20G \
  -smp sockets=2,cores=2 \
  -numa node,nodeid=0,cpus=0-1 -numa node,nodeid=1,cpus=2-3 \
  [...]
  -object memory-backend-ram,id=mem0,size=8G \
  -device virtio-mem-pci,id=vm0,memdev=mem0,node=0,requested-size=128M \
  -object memory-backend-ram,id=mem1,size=8G \
  -device virtio-mem-pci,id=vm1,memdev=mem1,node=1,requested-size=80M

Query the configuration:
 QEMU 4.1.95 monitor - type 'help' for more information
 (qemu) info memory-devices
 Memory device [virtio-mem]: "vm0"
   memaddr: 0x140000000
   node: 0
   requested-size: 134217728
   size: 134217728
   max-size: 8589934592
   block-size: 2097152
   memdev: /objects/mem0
 Memory device [virtio-mem]: "vm1"
   memaddr: 0x340000000
   node: 1
   requested-size: 83886080
   size: 83886080
   max-size: 8589934592
   block-size: 2097152
   memdev: /objects/mem1

Add some memory to node 1:
 QEMU 4.1.95 monitor - type 'help' for more information
 (qemu) qom-set vm1 requested-size 1G

Remove some memory from node 0:
 QEMU 4.1.95 monitor - type 'help' for more information
 (qemu) qom-set vm0 requested-size 64M

Query the configuration again:
 (qemu) info memory-devices
 Memory device [virtio-mem]: "vm0"
   memaddr: 0x140000000
   node: 0
   requested-size: 67108864
   size: 67108864
   max-size: 8589934592
   block-size: 2097152
   memdev: /objects/mem0
 Memory device [virtio-mem]: "vm1"
   memaddr: 0x340000000
   node: 1
   requested-size: 1073741824
   size: 1073741824
   max-size: 8589934592
   block-size: 2097152
   memdev: /objects/mem1

--------------------------------------------------------------------------
7. Q/A
--------------------------------------------------------------------------

Q: Why add/remove parts ("subblocks") of memory blocks/sections?
A: Flexibility (section size depends on the architecture) - e.g., some
   architectures have a section size of 2GB. Also, the memory block size
   is variable (e.g., on x86-64). I want to avoid any such restrictions.
   Some use cases want to add/remove memory in smaller granularities to a
   VM (e.g., the Hyper-V balloon also implements this) - especially smaller
   VMs like used for kata containers. Also, on memory unplug, it is more
   reliable to free-up and unplug multiple small chunks instead
   of one big chunk. E.g., if one page of a DIMM is either unmovable or
   pinned, the DIMM can't get unplugged. This approach is basically a
   compromise between DIMM-based memory hot(un)plug and balloon
   inflation/deflation, which works mostly on page granularity.

Q: Why care about memory blocks?
A: They are the way to tell user space about new memory. This way,
   memory can get onlined/offlined by user space. Also, e.g., kdump
   relies on udev events to reload kexec when memory blocks are
   onlined/offlined. Memory blocks are the "real" memory hot(un)plug
   granularity. Everything that's smaller has to be emulated "on top".

Q: Won't memory unplug of subblocks fragment memory?
A: Yes and no. Unplugging e.g., >=4MB subblocks on x86-64 will not really
   fragment memory like unplugging random pages like a balloon driver does.
   Buddy merging will not be limited. However, any allocation that requires
   bigger consecutive memory chunks (e.g., gigantic pages) might observe
   the fragmentation. Possible solutions: Allocate gigantic huge pages
   before unplugging memory, don't unplug memory, combine virtio-mem with
   DIMM based memory or bigger initial memory. Remember, a virtio-mem
   device will only unplug on the memory range it manages, not on other
   DIMMs. Unplug of single memory blocks will result in similar
   fragmentation in respect to gigantic huge pages. I ahve plans for a
   virtio-mem defragmentation feature in the future.

Q: How reliable is memory unplug?
A: There are no guarantees on how much memory can get unplugged
   again. However, it is more likely to find 4MB chunks to unplug than
   e.g., 128MB chunks. If memory is terribly fragmented, there is nothing
   we can do - for now. I consider memory hotplug the first primary use
   of virtio-mem. Memory unplug might usually work, but we want to improve
   the performance and the amount of memory we can actually unplug later.

Q: Why not unplug from the ZONE_MOVABLE?
A: Unplugged memory chunks are unmovable. Unmovable data must not end up
   on the ZONE_MOVABLE - similar to gigantic pages - they will never be
   allocated from ZONE_MOVABLE. virtio-mem added memory can be onlined
   to the ZONE_MOVABLE, but subblocks will not get unplugged from it.

Q: How big should the initial (!virtio-mem) memory of a VM be?
A: virtio-mem memory will not go to the DMA zones. So to avoid running out
   of DMA memory, I suggest something like 2-3GB on x86-64. But many
   VMs can most probably deal with less DMA memory - depends on the use
   case.

[1] https://events.linuxfoundation.org/wp-content/uploads/2017/12/virtio-mem-Paravirtualized-Memory-David-Hildenbrand-Red-Hat-1.pdf
[2] https://lkml.kernel.org/r/20190919142228.5483-1-david@redhat.com
[3] https://lkml.kernel.org/r/547865a9-d6c2-7140-47e2-5af01e7d761d@redhat.com

Cc: Sebastien Boeuf  <sebastien.boeuf@intel.com>
Cc: Samuel Ortiz <samuel.ortiz@intel.com>
Cc: Robert Bradford <robert.bradford@intel.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>

David Hildenbrand (13):
  ACPI: NUMA: export pxm_to_node
  virtio-mem: Paravirtualized memory hotplug
  virtio-mem: Paravirtualized memory hotunplug part 1
  mm: Export alloc_contig_range() / free_contig_range()
  virtio-mem: Paravirtualized memory hotunplug part 2
  mm: Allow to offline unmovable PageOffline() pages via
    MEM_GOING_OFFLINE
  virtio-mem: Allow to offline partially unplugged memory blocks
  mm/memory_hotplug: Introduce offline_and_remove_memory()
  virtio-mem: Offline and remove completely unplugged memory blocks
  virtio-mem: Better retry handling
  mm/vmscan: Move count_vm_event(DROP_SLAB) into drop_slab()
  mm/vmscan: Export drop_slab() and drop_slab_node()
  virtio-mem: Drop slab objects when unplug continues to fail

 drivers/acpi/numa/srat.c        |    1 +
 drivers/virtio/Kconfig          |   18 +
 drivers/virtio/Makefile         |    1 +
 drivers/virtio/virtio_mem.c     | 1939 +++++++++++++++++++++++++++++++
 fs/drop_caches.c                |    4 +-
 include/linux/memory_hotplug.h  |    1 +
 include/linux/mm.h              |    4 +-
 include/linux/page-flags.h      |   10 +
 include/uapi/linux/virtio_ids.h |    1 +
 include/uapi/linux/virtio_mem.h |  204 ++++
 mm/memory_hotplug.c             |   76 +-
 mm/page_alloc.c                 |   26 +
 mm/page_isolation.c             |    9 +
 mm/vmscan.c                     |    3 +
 14 files changed, 2282 insertions(+), 15 deletions(-)
 create mode 100644 drivers/virtio/virtio_mem.c
 create mode 100644 include/uapi/linux/virtio_mem.h

Konrad Rzeszutek Wilk Dec. 13, 2019, 8:15 p.m. UTC | #1

On Thu, Dec 12, 2019 at 06:11:24PM +0100, David Hildenbrand wrote:
> This series is based on latest linux-next. The patches are located at:
>     https://github.com/davidhildenbrand/linux.git virtio-mem-rfc-v4
Heya!

Would there be by any chance a virtio-spec git tree somewhere?

..snip..
> --------------------------------------------------------------------------
> 5. Future work
> --------------------------------------------------------------------------
> 
> The separate patches contain a lot of future work items. One of the next
> steps is to make memory unplug more likely to succeed - currently, there
> are no guarantees on how much memory can get unplugged again. I have


Or perhaps tell the caller why we can't and let them sort it out?
For example: "Application XYZ is mlocked. Can't offload'.

David Hildenbrand Dec. 16, 2019, 11:03 a.m. UTC | #2

On 13.12.19 21:15, Konrad Rzeszutek Wilk wrote:
> On Thu, Dec 12, 2019 at 06:11:24PM +0100, David Hildenbrand wrote:
>> This series is based on latest linux-next. The patches are located at:
>>     https://github.com/davidhildenbrand/linux.git virtio-mem-rfc-v4
> Heya!

Hi Konrad!

> 
> Would there be by any chance a virtio-spec git tree somewhere?

I haven't started working on a spec yet - it's on my todo list but has
low priority (one-man-team). I'll focus on the QEMU pieces next, once
the kernel part is in an acceptable state.

The uapi file contains quite some documentation - if somebody wants to
start hacking on an alternative hypervisor implementation, I'm happy to
answer questions until I have a spec ready.

> 
> ..snip..
>> --------------------------------------------------------------------------
>> 5. Future work
>> --------------------------------------------------------------------------
>>
>> The separate patches contain a lot of future work items. One of the next
>> steps is to make memory unplug more likely to succeed - currently, there
>> are no guarantees on how much memory can get unplugged again. I have
> 
> 
> Or perhaps tell the caller why we can't and let them sort it out?
> For example: "Application XYZ is mlocked. Can't offload'.

Yes, it might in general be interesting for the guest to indicate
persistent errors, both when hotplugging and hotunplugging memory.
Indicating why unplugging is not able to succeed in that detail is,
however, non-trivial.

The hypervisor sets the requested size can can watch over the actual
size of a virtio-mem device. Right now, after it updated the requested
size, it can wait some time (e.g., 1-5 minutes). If the requested size
was not reached after that time, it knows there is a persistent issue
limiting plug/unplug. In the future, this could be extended by a rough
or detailed root cause indication. In the worst case, the guest crashed
and is no longer able to respond (not even with an error indication).

One interesting piece of the current hypervisor (QEMU) design is that
the maximum memory size a VM can consume is always known and QEMU will
send QMP events to upper layers whenever that size changes. This means
that you can e.g., reliably charge a customer how much memory a VM is
actually able to consume over time (independent of hotplug/unplug
errors). But yeah, the QEMU bits are still in a very early stage.

Hui Zhu Dec. 24, 2019, 6:58 a.m. UTC | #3

Hi David,

Thanks for your work.

I Got following build fail if X86_64_ACPI_NUMA is n with rfc3 and rfc4:
make -j8 bzImage
  GEN     Makefile
  DESCEND  objtool
  CALL    /home/teawater/kernel/linux-upstream3/scripts/atomic/check-atomics.sh
  CALL    /home/teawater/kernel/linux-upstream3/scripts/checksyscalls.sh
  CHK     include/generated/compile.h
  CC      drivers/virtio/virtio_mem.o
/home/teawater/kernel/linux-upstream3/drivers/virtio/virtio_mem.c: In function ‘virtio_mem_translate_node_id’:
/home/teawater/kernel/linux-upstream3/drivers/virtio/virtio_mem.c:478:10: error: implicit declaration of function ‘pxm_to_node’ [-Werror=implicit-function-declaration]
   node = pxm_to_node(node_id);
          ^~~~~~~~~~~
cc1: some warnings being treated as errors
/home/teawater/kernel/linux-upstream3/scripts/Makefile.build:265: recipe for target 'drivers/virtio/virtio_mem.o' failed
make[3]: *** [drivers/virtio/virtio_mem.o] Error 1
/home/teawater/kernel/linux-upstream3/scripts/Makefile.build:503: recipe for target 'drivers/virtio' failed
make[2]: *** [drivers/virtio] Error 2
/home/teawater/kernel/linux-upstream3/Makefile:1649: recipe for target 'drivers' failed
make[1]: *** [drivers] Error 2
/home/teawater/kernel/linux-upstream3/Makefile:179: recipe for target 'sub-make' failed
make: *** [sub-make] Error 2

Best,
Hui

> 在 2019年12月13日，01:11，David Hildenbrand <david@redhat.com> 写道：
> 
> This series is based on latest linux-next. The patches are located at:
>    https://github.com/davidhildenbrand/linux.git virtio-mem-rfc-v4
> 
> The basic idea of virtio-mem is to provide a flexible,
> cross-architecture memory hot(un)plug solution that avoids many limitations
> imposed by existing technologies, architectures, and interfaces. More
> details can be found below and in linked material.
> 
> This RFC is limited to x86-64, however, should theoretically work on any
> architecture that supports virtio and implements memory hot(un)plug under
> Linux - like s390x, powerpc64 and arm64. On x86-64, it is currently
> possible to add/remove memory to the system in >= 4MB granularity.
> Memory hotplug works very reliably. For memory unplug, there are no
> guarantees how much memory can actually get unplugged, it depends on the
> setup (especially: fragmentation of (unmovable) memory). I have plans to
> improve that in the future.
> 
> --------------------------------------------------------------------------
> 1. virtio-mem
> --------------------------------------------------------------------------
> 
> The basic idea behind virtio-mem was presented at KVM Forum 2018. The
> slides can be found at [1]. The previous RFC can be found at [2]. The
> first RFC can be found at [3]. However, the concept evolved over time. The
> KVM Forum slides roughly match the current design.
> 
> Patch #2 ("virtio-mem: Paravirtualized memory hotplug") contains quite some
> information, especially in "include/uapi/linux/virtio_mem.h":
> 
>  Each virtio-mem device manages a dedicated region in physical address
>  space. Each device can belong to a single NUMA node, multiple devices
>  for a single NUMA node are possible. A virtio-mem device is like a
>  "resizable DIMM" consisting of small memory blocks that can be plugged
>  or unplugged. The device driver is responsible for (un)plugging memory
>  blocks on demand.
> 
>  Virtio-mem devices can only operate on their assigned memory region in
>  order to (un)plug memory. A device cannot (un)plug memory belonging to
>  other devices.
> 
>  The "region_size" corresponds to the maximum amount of memory that can
>  be provided by a device. The "size" corresponds to the amount of memory
>  that is currently plugged. "requested_size" corresponds to a request
>  from the device to the device driver to (un)plug blocks. The
>  device driver should try to (un)plug blocks in order to reach the
>  "requested_size". It is impossible to plug more memory than requested.
> 
>  The "usable_region_size" represents the memory region that can actually
>  be used to (un)plug memory. It is always at least as big as the
>  "requested_size" and will grow dynamically. It will only shrink when
>  explicitly triggered (VIRTIO_MEM_REQ_UNPLUG).
> 
>  Memory in the usable region can usually be read, however, there are no
>  guarantees. It can happen that the device cannot process a request,
>  because it is busy. The device driver has to retry later.
> 
>  Usually, during system resets all memory will get unplugged, so the
>  device driver can start with a clean state. However, in specific
>  scenarios (if the device is busy) it can happen that the device still
>  has memory plugged. The device driver can request to unplug all memory
>  (VIRTIO_MEM_REQ_UNPLUG) - which might take a while to succeed if the
>  device is busy.
> 
> --------------------------------------------------------------------------
> 2. Linux Implementation
> --------------------------------------------------------------------------
> 
> This RFC reuses quite some existing MM infrastructure, however, has to
> expose some additional functionality.
> 
> Memory blocks (e.g., 128MB) are added/removed on demand. Within these
> memory blocks, subblocks (e.g., 4MB) are plugged/unplugged. The sizes
> depend on the target architecture, MAX_ORDER + pageblock_order, and
> the block size of a virtio-mem device.
> 
> add_memory()/try_remove_memory() is used to add/remove memory blocks.
> virtio-mem will not online memory blocks itself. This has to be done by
> user space, or configured into the kernel
> (CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE). virtio-mem will only unplug memory
> that was online to the ZONE_NORMAL. Memory is suggested to be onlined to
> the ZONE_NORMAL for now.
> 
> The memory hotplug notifier is used to properly synchronize against
> onlining/offlining of memory blocks and to track the states of memory
> blocks (including the zone memory blocks are onlined to).
> 
> The set_online_page() callback is used to keep unplugged subblocks
> of a memory block fake-offline when onlining the memory block.
> generic_online_page() is used to fake-online plugged subblocks. This
> handling is similar to the Hyper-V balloon driver.
> 
> PG_offline is used to mark unplugged subblocks as offline, so e.g.,
> dumping tools (makedumpfile) will skip these pages. This is similar to
> other balloon drivers like virtio-balloon and Hyper-V.
> 
> Memory offlining code is extended to allow drivers to drop their reference
> to PG_offline pages when MEM_GOING_OFFLINE, so these pages can be skipped
> when offlining memory blocks. This allows to offline memory blocks that
> have partially unplugged (allocated e.g., via alloc_contig_range())
> subblocks - or are completely unplugged.
> 
> alloc_contig_range()/free_contig_range() [now exposed] is used to
> unplug/plug subblocks of memory blocks the are already exposed to Linux.
> 
> offline_and_remove_memory() [new] is used to offline a fully unplugged
> memory block and remove it from Linux.
> 
> 
> A lot of additional information can be found in the separate patches and
> as comments in the code itself.
> 
> --------------------------------------------------------------------------
> 3. Changes RFC v2 -> v3
> --------------------------------------------------------------------------
> 
> A lot of things changed, especially also on the QEMU + virtio side. The
> biggest changes on the Linux driver side are:
> - Onlining/offlining of subblocks is now emulated on top of memory blocks.
>  set_online_page()+alloc_contig_range()+free_contig_range() is now used
>  for that. Core MM does not have to be modified and will continue to
>  online/offline full memory blocks.
> - Onlining/offlining of memory blocks is no longer performed by virtio-mem.
> - Pg_offline is upstream and can be used. It is also used to allow
>  offlining of partially unplugged memory blocks.
> - Memory block states + subblocks are now tracked more space-efficient.
> - Proper kexec(), kdump(), driver unload, driver reload, ZONE_MOVABLE, ...
>  handling.
> 
> --------------------------------------------------------------------------
> 4. Changes RFC v3 -> RFC v4
> --------------------------------------------------------------------------
> 
> Only minor things changed, especially nothing on the QEMU + virtio side.
> Interresting changes on the Linux driver side are:
> - "mm: Allow to offline unmovable PageOffline() pages via
>   MEM_GOING_OFFLINE"
> -- Rework to Michals suggestion (allow to isolate all PageOffline() pages
>   by skipping all PageOffline() pages in has_unmovable_pages(). Fail
>   offlining later if the pages cannot be offlined/migrated).
> - "virtio-mem: Allow to offline partially unplugged memory blocks"
> -- Adapt to Michals suggestion on core-mm part.
> - "virtio-mem: Better retry handling"
> -- Optimize retry intervals
> - "virtio-mem: Drop slab objects when unplug continues to fail"
> -- Call drop_slab()/drop_slab_node() when unplug keeps failing for a longer
>   time.
> - Multiple cleanups and fixes.
> 
> --------------------------------------------------------------------------
> 5. Future work
> --------------------------------------------------------------------------
> 
> The separate patches contain a lot of future work items. One of the next
> steps is to make memory unplug more likely to succeed - currently, there
> are no guarantees on how much memory can get unplugged again. I have
> various ideas on how to limit fragmentation of all memory blocks that
> virtio-mem added.
> 
> Memory hotplug:
> - Reduce the amount of memory resources if that turnes out to be an
>  issue. Or try to speed up relevant code paths to deal with many
>  resources.
> - Allocate the vmemmap from the added memory. Makes hotplug more likely
>  to succeed, the vmemmap is stored on the same NUMA node and that
>  unmovable memory will later not hinder unplug.
> 
> Memory hotunplug:
> - Performance improvements:
> -- Sense (lockless) if it make sense to try alloc_contig_range() at all
>   before directly trying to isolate and taking locks.
> -- Try to unplug bigger chunks if possible first.
> -- Identify free areas first, that don't have to be evacuated.
> - Make unplug more likely to succeed:
> -- There are various idea to limit fragmentation on memory block
>   granularity. (e.g., ZONE_PREFER_MOVABLE and smart balancing)
> -- Allocate memmap from added memory. This way, less unmovable data can
>   end up on the memory blocks.
> - OOM handling, e.g., via an OOM handler.
> - Defragmentation
> -- Will require a new virtio-mem CMD to exchange plugged<->unplugged blocks
> 
> --------------------------------------------------------------------------
> 6. Example Usage
> --------------------------------------------------------------------------
> 
> A very basic QEMU prototype (kept updated) is available at:
>    https://github.com/davidhildenbrand/qemu.git virtio-mem
> 
> It lacks various features, however, works to test the guest driver side:
> - No support for resizable memory regions / memory backends yet
> - No protection of unplugged memory (esp., userfaultfd-wp) yet
> - No dump/migration/XXX optimizations to skip unplugged memory (and avoid
>  touching it)
> 
> Start QEMU with two virtio-mem devices (one per NUMA node):
> $ qemu-system-x86_64 -m 4G,maxmem=20G \
>  -smp sockets=2,cores=2 \
>  -numa node,nodeid=0,cpus=0-1 -numa node,nodeid=1,cpus=2-3 \
>  [...]
>  -object memory-backend-ram,id=mem0,size=8G \
>  -device virtio-mem-pci,id=vm0,memdev=mem0,node=0,requested-size=128M \
>  -object memory-backend-ram,id=mem1,size=8G \
>  -device virtio-mem-pci,id=vm1,memdev=mem1,node=1,requested-size=80M
> 
> Query the configuration:
> QEMU 4.1.95 monitor - type 'help' for more information
> (qemu) info memory-devices
> Memory device [virtio-mem]: "vm0"
>   memaddr: 0x140000000
>   node: 0
>   requested-size: 134217728
>   size: 134217728
>   max-size: 8589934592
>   block-size: 2097152
>   memdev: /objects/mem0
> Memory device [virtio-mem]: "vm1"
>   memaddr: 0x340000000
>   node: 1
>   requested-size: 83886080
>   size: 83886080
>   max-size: 8589934592
>   block-size: 2097152
>   memdev: /objects/mem1
> 
> Add some memory to node 1:
> QEMU 4.1.95 monitor - type 'help' for more information
> (qemu) qom-set vm1 requested-size 1G
> 
> Remove some memory from node 0:
> QEMU 4.1.95 monitor - type 'help' for more information
> (qemu) qom-set vm0 requested-size 64M
> 
> Query the configuration again:
> (qemu) info memory-devices
> Memory device [virtio-mem]: "vm0"
>   memaddr: 0x140000000
>   node: 0
>   requested-size: 67108864
>   size: 67108864
>   max-size: 8589934592
>   block-size: 2097152
>   memdev: /objects/mem0
> Memory device [virtio-mem]: "vm1"
>   memaddr: 0x340000000
>   node: 1
>   requested-size: 1073741824
>   size: 1073741824
>   max-size: 8589934592
>   block-size: 2097152
>   memdev: /objects/mem1
> 
> --------------------------------------------------------------------------
> 7. Q/A
> --------------------------------------------------------------------------
> 
> Q: Why add/remove parts ("subblocks") of memory blocks/sections?
> A: Flexibility (section size depends on the architecture) - e.g., some
>   architectures have a section size of 2GB. Also, the memory block size
>   is variable (e.g., on x86-64). I want to avoid any such restrictions.
>   Some use cases want to add/remove memory in smaller granularities to a
>   VM (e.g., the Hyper-V balloon also implements this) - especially smaller
>   VMs like used for kata containers. Also, on memory unplug, it is more
>   reliable to free-up and unplug multiple small chunks instead
>   of one big chunk. E.g., if one page of a DIMM is either unmovable or
>   pinned, the DIMM can't get unplugged. This approach is basically a
>   compromise between DIMM-based memory hot(un)plug and balloon
>   inflation/deflation, which works mostly on page granularity.
> 
> Q: Why care about memory blocks?
> A: They are the way to tell user space about new memory. This way,
>   memory can get onlined/offlined by user space. Also, e.g., kdump
>   relies on udev events to reload kexec when memory blocks are
>   onlined/offlined. Memory blocks are the "real" memory hot(un)plug
>   granularity. Everything that's smaller has to be emulated "on top".
> 
> Q: Won't memory unplug of subblocks fragment memory?
> A: Yes and no. Unplugging e.g., >=4MB subblocks on x86-64 will not really
>   fragment memory like unplugging random pages like a balloon driver does.
>   Buddy merging will not be limited. However, any allocation that requires
>   bigger consecutive memory chunks (e.g., gigantic pages) might observe
>   the fragmentation. Possible solutions: Allocate gigantic huge pages
>   before unplugging memory, don't unplug memory, combine virtio-mem with
>   DIMM based memory or bigger initial memory. Remember, a virtio-mem
>   device will only unplug on the memory range it manages, not on other
>   DIMMs. Unplug of single memory blocks will result in similar
>   fragmentation in respect to gigantic huge pages. I ahve plans for a
>   virtio-mem defragmentation feature in the future.
> 
> Q: How reliable is memory unplug?
> A: There are no guarantees on how much memory can get unplugged
>   again. However, it is more likely to find 4MB chunks to unplug than
>   e.g., 128MB chunks. If memory is terribly fragmented, there is nothing
>   we can do - for now. I consider memory hotplug the first primary use
>   of virtio-mem. Memory unplug might usually work, but we want to improve
>   the performance and the amount of memory we can actually unplug later.
> 
> Q: Why not unplug from the ZONE_MOVABLE?
> A: Unplugged memory chunks are unmovable. Unmovable data must not end up
>   on the ZONE_MOVABLE - similar to gigantic pages - they will never be
>   allocated from ZONE_MOVABLE. virtio-mem added memory can be onlined
>   to the ZONE_MOVABLE, but subblocks will not get unplugged from it.
> 
> Q: How big should the initial (!virtio-mem) memory of a VM be?
> A: virtio-mem memory will not go to the DMA zones. So to avoid running out
>   of DMA memory, I suggest something like 2-3GB on x86-64. But many
>   VMs can most probably deal with less DMA memory - depends on the use
>   case.
> 
> [1] https://events.linuxfoundation.org/wp-content/uploads/2017/12/virtio-mem-Paravirtualized-Memory-David-Hildenbrand-Red-Hat-1.pdf
> [2] https://lkml.kernel.org/r/20190919142228.5483-1-david@redhat.com
> [3] https://lkml.kernel.org/r/547865a9-d6c2-7140-47e2-5af01e7d761d@redhat.com
> 
> Cc: Sebastien Boeuf  <sebastien.boeuf@intel.com>
> Cc: Samuel Ortiz <samuel.ortiz@intel.com>
> Cc: Robert Bradford <robert.bradford@intel.com>
> Cc: Luiz Capitulino <lcapitulino@redhat.com>
> 
> David Hildenbrand (13):
>  ACPI: NUMA: export pxm_to_node
>  virtio-mem: Paravirtualized memory hotplug
>  virtio-mem: Paravirtualized memory hotunplug part 1
>  mm: Export alloc_contig_range() / free_contig_range()
>  virtio-mem: Paravirtualized memory hotunplug part 2
>  mm: Allow to offline unmovable PageOffline() pages via
>    MEM_GOING_OFFLINE
>  virtio-mem: Allow to offline partially unplugged memory blocks
>  mm/memory_hotplug: Introduce offline_and_remove_memory()
>  virtio-mem: Offline and remove completely unplugged memory blocks
>  virtio-mem: Better retry handling
>  mm/vmscan: Move count_vm_event(DROP_SLAB) into drop_slab()
>  mm/vmscan: Export drop_slab() and drop_slab_node()
>  virtio-mem: Drop slab objects when unplug continues to fail
> 
> drivers/acpi/numa/srat.c        |    1 +
> drivers/virtio/Kconfig          |   18 +
> drivers/virtio/Makefile         |    1 +
> drivers/virtio/virtio_mem.c     | 1939 +++++++++++++++++++++++++++++++
> fs/drop_caches.c                |    4 +-
> include/linux/memory_hotplug.h  |    1 +
> include/linux/mm.h              |    4 +-
> include/linux/page-flags.h      |   10 +
> include/uapi/linux/virtio_ids.h |    1 +
> include/uapi/linux/virtio_mem.h |  204 ++++
> mm/memory_hotplug.c             |   76 +-
> mm/page_alloc.c                 |   26 +
> mm/page_isolation.c             |    9 +
> mm/vmscan.c                     |    3 +
> 14 files changed, 2282 insertions(+), 15 deletions(-)
> create mode 100644 drivers/virtio/virtio_mem.c
> create mode 100644 include/uapi/linux/virtio_mem.h
> 
> -- 
> 2.23.0

David Hildenbrand Dec. 24, 2019, 9:28 a.m. UTC | #4

> Am 24.12.2019 um 08:04 schrieb teawater <teawaterz@linux.alibaba.com>:
> 
> Hi David,
> 
> Thanks for your work.
> 
> I Got following build fail if X86_64_ACPI_NUMA is n with rfc3 and rfc4:
> make -j8 bzImage
>  GEN     Makefile
>  DESCEND  objtool
>  CALL    /home/teawater/kernel/linux-upstream3/scripts/atomic/check-atomics.sh
>  CALL    /home/teawater/kernel/linux-upstream3/scripts/checksyscalls.sh
>  CHK     include/generated/compile.h
>  CC      drivers/virtio/virtio_mem.o
> /home/teawater/kernel/linux-upstream3/drivers/virtio/virtio_mem.c: In function ‘virtio_mem_translate_node_id’:
> /home/teawater/kernel/linux-upstream3/drivers/virtio/virtio_mem.c:478:10: error: implicit declaration of function ‘pxm_to_node’ [-Werror=implicit-function-declaration]
>   node = pxm_to_node(node_id);
>          ^~~~~~~~~~~
> cc1: some warnings being treated as errors
> /home/teawater/kernel/linux-upstream3/scripts/Makefile.build:265: recipe for target 'drivers/virtio/virtio_mem.o' failed
> make[3]: *** [drivers/virtio/virtio_mem.o] Error 1
> /home/teawater/kernel/linux-upstream3/scripts/Makefile.build:503: recipe for target 'drivers/virtio' failed
> make[2]: *** [drivers/virtio] Error 2
> /home/teawater/kernel/linux-upstream3/Makefile:1649: recipe for target 'drivers' failed
> make[1]: *** [drivers] Error 2
> /home/teawater/kernel/linux-upstream3/Makefile:179: recipe for target 'sub-make' failed
> make: *** [sub-make] Error 2
> 

Thanks Hui,

So it has to be wrapped in an ifdef, thanks!

Cheers!

David Hildenbrand Jan. 9, 2020, 1:48 p.m. UTC | #5

On 12.12.19 18:11, David Hildenbrand wrote:
> This series is based on latest linux-next. The patches are located at:
>     https://github.com/davidhildenbrand/linux.git virtio-mem-rfc-v4
> 
> The basic idea of virtio-mem is to provide a flexible,
> cross-architecture memory hot(un)plug solution that avoids many limitations
> imposed by existing technologies, architectures, and interfaces. More
> details can be found below and in linked material.
> 
> This RFC is limited to x86-64, however, should theoretically work on any
> architecture that supports virtio and implements memory hot(un)plug under
> Linux - like s390x, powerpc64 and arm64. On x86-64, it is currently
> possible to add/remove memory to the system in >= 4MB granularity.
> Memory hotplug works very reliably. For memory unplug, there are no
> guarantees how much memory can actually get unplugged, it depends on the
> setup (especially: fragmentation of (unmovable) memory). I have plans to
> improve that in the future.
> 
> --------------------------------------------------------------------------
> 1. virtio-mem
> --------------------------------------------------------------------------
> 
> The basic idea behind virtio-mem was presented at KVM Forum 2018. The
> slides can be found at [1]. The previous RFC can be found at [2]. The
> first RFC can be found at [3]. However, the concept evolved over time. The
> KVM Forum slides roughly match the current design.
> 
> Patch #2 ("virtio-mem: Paravirtualized memory hotplug") contains quite some
> information, especially in "include/uapi/linux/virtio_mem.h":
> 
>   Each virtio-mem device manages a dedicated region in physical address
>   space. Each device can belong to a single NUMA node, multiple devices
>   for a single NUMA node are possible. A virtio-mem device is like a
>   "resizable DIMM" consisting of small memory blocks that can be plugged
>   or unplugged. The device driver is responsible for (un)plugging memory
>   blocks on demand.
> 
>   Virtio-mem devices can only operate on their assigned memory region in
>   order to (un)plug memory. A device cannot (un)plug memory belonging to
>   other devices.
> 
>   The "region_size" corresponds to the maximum amount of memory that can
>   be provided by a device. The "size" corresponds to the amount of memory
>   that is currently plugged. "requested_size" corresponds to a request
>   from the device to the device driver to (un)plug blocks. The
>   device driver should try to (un)plug blocks in order to reach the
>   "requested_size". It is impossible to plug more memory than requested.
> 
>   The "usable_region_size" represents the memory region that can actually
>   be used to (un)plug memory. It is always at least as big as the
>   "requested_size" and will grow dynamically. It will only shrink when
>   explicitly triggered (VIRTIO_MEM_REQ_UNPLUG).
> 
>   Memory in the usable region can usually be read, however, there are no
>   guarantees. It can happen that the device cannot process a request,
>   because it is busy. The device driver has to retry later.
> 
>   Usually, during system resets all memory will get unplugged, so the
>   device driver can start with a clean state. However, in specific
>   scenarios (if the device is busy) it can happen that the device still
>   has memory plugged. The device driver can request to unplug all memory
>   (VIRTIO_MEM_REQ_UNPLUG) - which might take a while to succeed if the
>   device is busy.
> 
> --------------------------------------------------------------------------
> 2. Linux Implementation
> --------------------------------------------------------------------------
> 
> This RFC reuses quite some existing MM infrastructure, however, has to
> expose some additional functionality.
> 
> Memory blocks (e.g., 128MB) are added/removed on demand. Within these
> memory blocks, subblocks (e.g., 4MB) are plugged/unplugged. The sizes
> depend on the target architecture, MAX_ORDER + pageblock_order, and
> the block size of a virtio-mem device.
> 
> add_memory()/try_remove_memory() is used to add/remove memory blocks.
> virtio-mem will not online memory blocks itself. This has to be done by
> user space, or configured into the kernel
> (CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE). virtio-mem will only unplug memory
> that was online to the ZONE_NORMAL. Memory is suggested to be onlined to
> the ZONE_NORMAL for now.
> 
> The memory hotplug notifier is used to properly synchronize against
> onlining/offlining of memory blocks and to track the states of memory
> blocks (including the zone memory blocks are onlined to).
> 
> The set_online_page() callback is used to keep unplugged subblocks
> of a memory block fake-offline when onlining the memory block.
> generic_online_page() is used to fake-online plugged subblocks. This
> handling is similar to the Hyper-V balloon driver.
> 
> PG_offline is used to mark unplugged subblocks as offline, so e.g.,
> dumping tools (makedumpfile) will skip these pages. This is similar to
> other balloon drivers like virtio-balloon and Hyper-V.
> 
> Memory offlining code is extended to allow drivers to drop their reference
> to PG_offline pages when MEM_GOING_OFFLINE, so these pages can be skipped
> when offlining memory blocks. This allows to offline memory blocks that
> have partially unplugged (allocated e.g., via alloc_contig_range())
> subblocks - or are completely unplugged.
> 
> alloc_contig_range()/free_contig_range() [now exposed] is used to
> unplug/plug subblocks of memory blocks the are already exposed to Linux.
> 
> offline_and_remove_memory() [new] is used to offline a fully unplugged
> memory block and remove it from Linux.
> 
> 
> A lot of additional information can be found in the separate patches and
> as comments in the code itself.
> 
> --------------------------------------------------------------------------
> 3. Changes RFC v2 -> v3
> --------------------------------------------------------------------------
> 
> A lot of things changed, especially also on the QEMU + virtio side. The
> biggest changes on the Linux driver side are:
> - Onlining/offlining of subblocks is now emulated on top of memory blocks.
>   set_online_page()+alloc_contig_range()+free_contig_range() is now used
>   for that. Core MM does not have to be modified and will continue to
>   online/offline full memory blocks.
> - Onlining/offlining of memory blocks is no longer performed by virtio-mem.
> - Pg_offline is upstream and can be used. It is also used to allow
>   offlining of partially unplugged memory blocks.
> - Memory block states + subblocks are now tracked more space-efficient.
> - Proper kexec(), kdump(), driver unload, driver reload, ZONE_MOVABLE, ...
>   handling.
> 
> --------------------------------------------------------------------------
> 4. Changes RFC v3 -> RFC v4
> --------------------------------------------------------------------------
> 
> Only minor things changed, especially nothing on the QEMU + virtio side.
> Interresting changes on the Linux driver side are:
> - "mm: Allow to offline unmovable PageOffline() pages via
>    MEM_GOING_OFFLINE"
> -- Rework to Michals suggestion (allow to isolate all PageOffline() pages
>    by skipping all PageOffline() pages in has_unmovable_pages(). Fail
>    offlining later if the pages cannot be offlined/migrated).
> - "virtio-mem: Allow to offline partially unplugged memory blocks"
> -- Adapt to Michals suggestion on core-mm part.
> - "virtio-mem: Better retry handling"
> -- Optimize retry intervals
> - "virtio-mem: Drop slab objects when unplug continues to fail"
> -- Call drop_slab()/drop_slab_node() when unplug keeps failing for a longer
>    time.
> - Multiple cleanups and fixes.
> 
> --------------------------------------------------------------------------
> 5. Future work
> --------------------------------------------------------------------------
> 
> The separate patches contain a lot of future work items. One of the next
> steps is to make memory unplug more likely to succeed - currently, there
> are no guarantees on how much memory can get unplugged again. I have
> various ideas on how to limit fragmentation of all memory blocks that
> virtio-mem added.
> 
> Memory hotplug:
> - Reduce the amount of memory resources if that turnes out to be an
>   issue. Or try to speed up relevant code paths to deal with many
>   resources.
> - Allocate the vmemmap from the added memory. Makes hotplug more likely
>   to succeed, the vmemmap is stored on the same NUMA node and that
>   unmovable memory will later not hinder unplug.
> 
> Memory hotunplug:
> - Performance improvements:
> -- Sense (lockless) if it make sense to try alloc_contig_range() at all
>    before directly trying to isolate and taking locks.
> -- Try to unplug bigger chunks if possible first.
> -- Identify free areas first, that don't have to be evacuated.
> - Make unplug more likely to succeed:
> -- There are various idea to limit fragmentation on memory block
>    granularity. (e.g., ZONE_PREFER_MOVABLE and smart balancing)
> -- Allocate memmap from added memory. This way, less unmovable data can
>    end up on the memory blocks.
> - OOM handling, e.g., via an OOM handler.
> - Defragmentation
> -- Will require a new virtio-mem CMD to exchange plugged<->unplugged blocks
> 
> --------------------------------------------------------------------------
> 6. Example Usage
> --------------------------------------------------------------------------
> 
> A very basic QEMU prototype (kept updated) is available at:
>     https://github.com/davidhildenbrand/qemu.git virtio-mem
> 
> It lacks various features, however, works to test the guest driver side:
> - No support for resizable memory regions / memory backends yet
> - No protection of unplugged memory (esp., userfaultfd-wp) yet
> - No dump/migration/XXX optimizations to skip unplugged memory (and avoid
>   touching it)
> 
> Start QEMU with two virtio-mem devices (one per NUMA node):
>  $ qemu-system-x86_64 -m 4G,maxmem=20G \
>   -smp sockets=2,cores=2 \
>   -numa node,nodeid=0,cpus=0-1 -numa node,nodeid=1,cpus=2-3 \
>   [...]
>   -object memory-backend-ram,id=mem0,size=8G \
>   -device virtio-mem-pci,id=vm0,memdev=mem0,node=0,requested-size=128M \
>   -object memory-backend-ram,id=mem1,size=8G \
>   -device virtio-mem-pci,id=vm1,memdev=mem1,node=1,requested-size=80M
> 
> Query the configuration:
>  QEMU 4.1.95 monitor - type 'help' for more information
>  (qemu) info memory-devices
>  Memory device [virtio-mem]: "vm0"
>    memaddr: 0x140000000
>    node: 0
>    requested-size: 134217728
>    size: 134217728
>    max-size: 8589934592
>    block-size: 2097152
>    memdev: /objects/mem0
>  Memory device [virtio-mem]: "vm1"
>    memaddr: 0x340000000
>    node: 1
>    requested-size: 83886080
>    size: 83886080
>    max-size: 8589934592
>    block-size: 2097152
>    memdev: /objects/mem1
> 
> Add some memory to node 1:
>  QEMU 4.1.95 monitor - type 'help' for more information
>  (qemu) qom-set vm1 requested-size 1G
> 
> Remove some memory from node 0:
>  QEMU 4.1.95 monitor - type 'help' for more information
>  (qemu) qom-set vm0 requested-size 64M
> 
> Query the configuration again:
>  (qemu) info memory-devices
>  Memory device [virtio-mem]: "vm0"
>    memaddr: 0x140000000
>    node: 0
>    requested-size: 67108864
>    size: 67108864
>    max-size: 8589934592
>    block-size: 2097152
>    memdev: /objects/mem0
>  Memory device [virtio-mem]: "vm1"
>    memaddr: 0x340000000
>    node: 1
>    requested-size: 1073741824
>    size: 1073741824
>    max-size: 8589934592
>    block-size: 2097152
>    memdev: /objects/mem1
> 
> --------------------------------------------------------------------------
> 7. Q/A
> --------------------------------------------------------------------------
> 
> Q: Why add/remove parts ("subblocks") of memory blocks/sections?
> A: Flexibility (section size depends on the architecture) - e.g., some
>    architectures have a section size of 2GB. Also, the memory block size
>    is variable (e.g., on x86-64). I want to avoid any such restrictions.
>    Some use cases want to add/remove memory in smaller granularities to a
>    VM (e.g., the Hyper-V balloon also implements this) - especially smaller
>    VMs like used for kata containers. Also, on memory unplug, it is more
>    reliable to free-up and unplug multiple small chunks instead
>    of one big chunk. E.g., if one page of a DIMM is either unmovable or
>    pinned, the DIMM can't get unplugged. This approach is basically a
>    compromise between DIMM-based memory hot(un)plug and balloon
>    inflation/deflation, which works mostly on page granularity.
> 
> Q: Why care about memory blocks?
> A: They are the way to tell user space about new memory. This way,
>    memory can get onlined/offlined by user space. Also, e.g., kdump
>    relies on udev events to reload kexec when memory blocks are
>    onlined/offlined. Memory blocks are the "real" memory hot(un)plug
>    granularity. Everything that's smaller has to be emulated "on top".
> 
> Q: Won't memory unplug of subblocks fragment memory?
> A: Yes and no. Unplugging e.g., >=4MB subblocks on x86-64 will not really
>    fragment memory like unplugging random pages like a balloon driver does.
>    Buddy merging will not be limited. However, any allocation that requires
>    bigger consecutive memory chunks (e.g., gigantic pages) might observe
>    the fragmentation. Possible solutions: Allocate gigantic huge pages
>    before unplugging memory, don't unplug memory, combine virtio-mem with
>    DIMM based memory or bigger initial memory. Remember, a virtio-mem
>    device will only unplug on the memory range it manages, not on other
>    DIMMs. Unplug of single memory blocks will result in similar
>    fragmentation in respect to gigantic huge pages. I ahve plans for a
>    virtio-mem defragmentation feature in the future.
> 
> Q: How reliable is memory unplug?
> A: There are no guarantees on how much memory can get unplugged
>    again. However, it is more likely to find 4MB chunks to unplug than
>    e.g., 128MB chunks. If memory is terribly fragmented, there is nothing
>    we can do - for now. I consider memory hotplug the first primary use
>    of virtio-mem. Memory unplug might usually work, but we want to improve
>    the performance and the amount of memory we can actually unplug later.
> 
> Q: Why not unplug from the ZONE_MOVABLE?
> A: Unplugged memory chunks are unmovable. Unmovable data must not end up
>    on the ZONE_MOVABLE - similar to gigantic pages - they will never be
>    allocated from ZONE_MOVABLE. virtio-mem added memory can be onlined
>    to the ZONE_MOVABLE, but subblocks will not get unplugged from it.
> 
> Q: How big should the initial (!virtio-mem) memory of a VM be?
> A: virtio-mem memory will not go to the DMA zones. So to avoid running out
>    of DMA memory, I suggest something like 2-3GB on x86-64. But many
>    VMs can most probably deal with less DMA memory - depends on the use
>    case.
> 
> [1] https://events.linuxfoundation.org/wp-content/uploads/2017/12/virtio-mem-Paravirtualized-Memory-David-Hildenbrand-Red-Hat-1.pdf
> [2] https://lkml.kernel.org/r/20190919142228.5483-1-david@redhat.com
> [3] https://lkml.kernel.org/r/547865a9-d6c2-7140-47e2-5af01e7d761d@redhat.com
> 
> Cc: Sebastien Boeuf  <sebastien.boeuf@intel.com>
> Cc: Samuel Ortiz <samuel.ortiz@intel.com>
> Cc: Robert Bradford <robert.bradford@intel.com>
> Cc: Luiz Capitulino <lcapitulino@redhat.com>
> 
> David Hildenbrand (13):
>   ACPI: NUMA: export pxm_to_node
>   virtio-mem: Paravirtualized memory hotplug
>   virtio-mem: Paravirtualized memory hotunplug part 1
>   mm: Export alloc_contig_range() / free_contig_range()
>   virtio-mem: Paravirtualized memory hotunplug part 2
>   mm: Allow to offline unmovable PageOffline() pages via
>     MEM_GOING_OFFLINE
>   virtio-mem: Allow to offline partially unplugged memory blocks
>   mm/memory_hotplug: Introduce offline_and_remove_memory()
>   virtio-mem: Offline and remove completely unplugged memory blocks
>   virtio-mem: Better retry handling
>   mm/vmscan: Move count_vm_event(DROP_SLAB) into drop_slab()
>   mm/vmscan: Export drop_slab() and drop_slab_node()
>   virtio-mem: Drop slab objects when unplug continues to fail

Ping,

I'd love to get some feedback on

a) The remaining MM bits from MM folks (especially, patch #6 and #8).
b) The general virtio infrastructure (esp. uapi in patch #2) from virtio
folks.

I'm planning to send a proper v1 (!RFC) once I have all necessary MM
acks. In the meanwhile, I will do more testing and minor reworks (e.g.,
fix !CONFIG_NUMA compilation).

David Hildenbrand Jan. 29, 2020, 9:41 a.m. UTC | #6

On 09.01.20 14:48, David Hildenbrand wrote:
> On 12.12.19 18:11, David Hildenbrand wrote:
>> This series is based on latest linux-next. The patches are located at:
>>     https://github.com/davidhildenbrand/linux.git virtio-mem-rfc-v4
>>
>> The basic idea of virtio-mem is to provide a flexible,
>> cross-architecture memory hot(un)plug solution that avoids many limitations
>> imposed by existing technologies, architectures, and interfaces. More
>> details can be found below and in linked material.
>>
>> This RFC is limited to x86-64, however, should theoretically work on any
>> architecture that supports virtio and implements memory hot(un)plug under
>> Linux - like s390x, powerpc64 and arm64. On x86-64, it is currently
>> possible to add/remove memory to the system in >= 4MB granularity.
>> Memory hotplug works very reliably. For memory unplug, there are no
>> guarantees how much memory can actually get unplugged, it depends on the
>> setup (especially: fragmentation of (unmovable) memory). I have plans to
>> improve that in the future.
>>
>> --------------------------------------------------------------------------
>> 1. virtio-mem
>> --------------------------------------------------------------------------
>>
>> The basic idea behind virtio-mem was presented at KVM Forum 2018. The
>> slides can be found at [1]. The previous RFC can be found at [2]. The
>> first RFC can be found at [3]. However, the concept evolved over time. The
>> KVM Forum slides roughly match the current design.
>>
>> Patch #2 ("virtio-mem: Paravirtualized memory hotplug") contains quite some
>> information, especially in "include/uapi/linux/virtio_mem.h":
>>
>>   Each virtio-mem device manages a dedicated region in physical address
>>   space. Each device can belong to a single NUMA node, multiple devices
>>   for a single NUMA node are possible. A virtio-mem device is like a
>>   "resizable DIMM" consisting of small memory blocks that can be plugged
>>   or unplugged. The device driver is responsible for (un)plugging memory
>>   blocks on demand.
>>
>>   Virtio-mem devices can only operate on their assigned memory region in
>>   order to (un)plug memory. A device cannot (un)plug memory belonging to
>>   other devices.
>>
>>   The "region_size" corresponds to the maximum amount of memory that can
>>   be provided by a device. The "size" corresponds to the amount of memory
>>   that is currently plugged. "requested_size" corresponds to a request
>>   from the device to the device driver to (un)plug blocks. The
>>   device driver should try to (un)plug blocks in order to reach the
>>   "requested_size". It is impossible to plug more memory than requested.
>>
>>   The "usable_region_size" represents the memory region that can actually
>>   be used to (un)plug memory. It is always at least as big as the
>>   "requested_size" and will grow dynamically. It will only shrink when
>>   explicitly triggered (VIRTIO_MEM_REQ_UNPLUG).
>>
>>   Memory in the usable region can usually be read, however, there are no
>>   guarantees. It can happen that the device cannot process a request,
>>   because it is busy. The device driver has to retry later.
>>
>>   Usually, during system resets all memory will get unplugged, so the
>>   device driver can start with a clean state. However, in specific
>>   scenarios (if the device is busy) it can happen that the device still
>>   has memory plugged. The device driver can request to unplug all memory
>>   (VIRTIO_MEM_REQ_UNPLUG) - which might take a while to succeed if the
>>   device is busy.
>>
>> --------------------------------------------------------------------------
>> 2. Linux Implementation
>> --------------------------------------------------------------------------
>>
>> This RFC reuses quite some existing MM infrastructure, however, has to
>> expose some additional functionality.
>>
>> Memory blocks (e.g., 128MB) are added/removed on demand. Within these
>> memory blocks, subblocks (e.g., 4MB) are plugged/unplugged. The sizes
>> depend on the target architecture, MAX_ORDER + pageblock_order, and
>> the block size of a virtio-mem device.
>>
>> add_memory()/try_remove_memory() is used to add/remove memory blocks.
>> virtio-mem will not online memory blocks itself. This has to be done by
>> user space, or configured into the kernel
>> (CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE). virtio-mem will only unplug memory
>> that was online to the ZONE_NORMAL. Memory is suggested to be onlined to
>> the ZONE_NORMAL for now.
>>
>> The memory hotplug notifier is used to properly synchronize against
>> onlining/offlining of memory blocks and to track the states of memory
>> blocks (including the zone memory blocks are onlined to).
>>
>> The set_online_page() callback is used to keep unplugged subblocks
>> of a memory block fake-offline when onlining the memory block.
>> generic_online_page() is used to fake-online plugged subblocks. This
>> handling is similar to the Hyper-V balloon driver.
>>
>> PG_offline is used to mark unplugged subblocks as offline, so e.g.,
>> dumping tools (makedumpfile) will skip these pages. This is similar to
>> other balloon drivers like virtio-balloon and Hyper-V.
>>
>> Memory offlining code is extended to allow drivers to drop their reference
>> to PG_offline pages when MEM_GOING_OFFLINE, so these pages can be skipped
>> when offlining memory blocks. This allows to offline memory blocks that
>> have partially unplugged (allocated e.g., via alloc_contig_range())
>> subblocks - or are completely unplugged.
>>
>> alloc_contig_range()/free_contig_range() [now exposed] is used to
>> unplug/plug subblocks of memory blocks the are already exposed to Linux.
>>
>> offline_and_remove_memory() [new] is used to offline a fully unplugged
>> memory block and remove it from Linux.
>>
>>
>> A lot of additional information can be found in the separate patches and
>> as comments in the code itself.
>>
>> --------------------------------------------------------------------------
>> 3. Changes RFC v2 -> v3
>> --------------------------------------------------------------------------
>>
>> A lot of things changed, especially also on the QEMU + virtio side. The
>> biggest changes on the Linux driver side are:
>> - Onlining/offlining of subblocks is now emulated on top of memory blocks.
>>   set_online_page()+alloc_contig_range()+free_contig_range() is now used
>>   for that. Core MM does not have to be modified and will continue to
>>   online/offline full memory blocks.
>> - Onlining/offlining of memory blocks is no longer performed by virtio-mem.
>> - Pg_offline is upstream and can be used. It is also used to allow
>>   offlining of partially unplugged memory blocks.
>> - Memory block states + subblocks are now tracked more space-efficient.
>> - Proper kexec(), kdump(), driver unload, driver reload, ZONE_MOVABLE, ...
>>   handling.
>>
>> --------------------------------------------------------------------------
>> 4. Changes RFC v3 -> RFC v4
>> --------------------------------------------------------------------------
>>
>> Only minor things changed, especially nothing on the QEMU + virtio side.
>> Interresting changes on the Linux driver side are:
>> - "mm: Allow to offline unmovable PageOffline() pages via
>>    MEM_GOING_OFFLINE"
>> -- Rework to Michals suggestion (allow to isolate all PageOffline() pages
>>    by skipping all PageOffline() pages in has_unmovable_pages(). Fail
>>    offlining later if the pages cannot be offlined/migrated).
>> - "virtio-mem: Allow to offline partially unplugged memory blocks"
>> -- Adapt to Michals suggestion on core-mm part.
>> - "virtio-mem: Better retry handling"
>> -- Optimize retry intervals
>> - "virtio-mem: Drop slab objects when unplug continues to fail"
>> -- Call drop_slab()/drop_slab_node() when unplug keeps failing for a longer
>>    time.
>> - Multiple cleanups and fixes.
>>
>> --------------------------------------------------------------------------
>> 5. Future work
>> --------------------------------------------------------------------------
>>
>> The separate patches contain a lot of future work items. One of the next
>> steps is to make memory unplug more likely to succeed - currently, there
>> are no guarantees on how much memory can get unplugged again. I have
>> various ideas on how to limit fragmentation of all memory blocks that
>> virtio-mem added.
>>
>> Memory hotplug:
>> - Reduce the amount of memory resources if that turnes out to be an
>>   issue. Or try to speed up relevant code paths to deal with many
>>   resources.
>> - Allocate the vmemmap from the added memory. Makes hotplug more likely
>>   to succeed, the vmemmap is stored on the same NUMA node and that
>>   unmovable memory will later not hinder unplug.
>>
>> Memory hotunplug:
>> - Performance improvements:
>> -- Sense (lockless) if it make sense to try alloc_contig_range() at all
>>    before directly trying to isolate and taking locks.
>> -- Try to unplug bigger chunks if possible first.
>> -- Identify free areas first, that don't have to be evacuated.
>> - Make unplug more likely to succeed:
>> -- There are various idea to limit fragmentation on memory block
>>    granularity. (e.g., ZONE_PREFER_MOVABLE and smart balancing)
>> -- Allocate memmap from added memory. This way, less unmovable data can
>>    end up on the memory blocks.
>> - OOM handling, e.g., via an OOM handler.
>> - Defragmentation
>> -- Will require a new virtio-mem CMD to exchange plugged<->unplugged blocks
>>
>> --------------------------------------------------------------------------
>> 6. Example Usage
>> --------------------------------------------------------------------------
>>
>> A very basic QEMU prototype (kept updated) is available at:
>>     https://github.com/davidhildenbrand/qemu.git virtio-mem
>>
>> It lacks various features, however, works to test the guest driver side:
>> - No support for resizable memory regions / memory backends yet
>> - No protection of unplugged memory (esp., userfaultfd-wp) yet
>> - No dump/migration/XXX optimizations to skip unplugged memory (and avoid
>>   touching it)
>>
>> Start QEMU with two virtio-mem devices (one per NUMA node):
>>  $ qemu-system-x86_64 -m 4G,maxmem=20G \
>>   -smp sockets=2,cores=2 \
>>   -numa node,nodeid=0,cpus=0-1 -numa node,nodeid=1,cpus=2-3 \
>>   [...]
>>   -object memory-backend-ram,id=mem0,size=8G \
>>   -device virtio-mem-pci,id=vm0,memdev=mem0,node=0,requested-size=128M \
>>   -object memory-backend-ram,id=mem1,size=8G \
>>   -device virtio-mem-pci,id=vm1,memdev=mem1,node=1,requested-size=80M
>>
>> Query the configuration:
>>  QEMU 4.1.95 monitor - type 'help' for more information
>>  (qemu) info memory-devices
>>  Memory device [virtio-mem]: "vm0"
>>    memaddr: 0x140000000
>>    node: 0
>>    requested-size: 134217728
>>    size: 134217728
>>    max-size: 8589934592
>>    block-size: 2097152
>>    memdev: /objects/mem0
>>  Memory device [virtio-mem]: "vm1"
>>    memaddr: 0x340000000
>>    node: 1
>>    requested-size: 83886080
>>    size: 83886080
>>    max-size: 8589934592
>>    block-size: 2097152
>>    memdev: /objects/mem1
>>
>> Add some memory to node 1:
>>  QEMU 4.1.95 monitor - type 'help' for more information
>>  (qemu) qom-set vm1 requested-size 1G
>>
>> Remove some memory from node 0:
>>  QEMU 4.1.95 monitor - type 'help' for more information
>>  (qemu) qom-set vm0 requested-size 64M
>>
>> Query the configuration again:
>>  (qemu) info memory-devices
>>  Memory device [virtio-mem]: "vm0"
>>    memaddr: 0x140000000
>>    node: 0
>>    requested-size: 67108864
>>    size: 67108864
>>    max-size: 8589934592
>>    block-size: 2097152
>>    memdev: /objects/mem0
>>  Memory device [virtio-mem]: "vm1"
>>    memaddr: 0x340000000
>>    node: 1
>>    requested-size: 1073741824
>>    size: 1073741824
>>    max-size: 8589934592
>>    block-size: 2097152
>>    memdev: /objects/mem1
>>
>> --------------------------------------------------------------------------
>> 7. Q/A
>> --------------------------------------------------------------------------
>>
>> Q: Why add/remove parts ("subblocks") of memory blocks/sections?
>> A: Flexibility (section size depends on the architecture) - e.g., some
>>    architectures have a section size of 2GB. Also, the memory block size
>>    is variable (e.g., on x86-64). I want to avoid any such restrictions.
>>    Some use cases want to add/remove memory in smaller granularities to a
>>    VM (e.g., the Hyper-V balloon also implements this) - especially smaller
>>    VMs like used for kata containers. Also, on memory unplug, it is more
>>    reliable to free-up and unplug multiple small chunks instead
>>    of one big chunk. E.g., if one page of a DIMM is either unmovable or
>>    pinned, the DIMM can't get unplugged. This approach is basically a
>>    compromise between DIMM-based memory hot(un)plug and balloon
>>    inflation/deflation, which works mostly on page granularity.
>>
>> Q: Why care about memory blocks?
>> A: They are the way to tell user space about new memory. This way,
>>    memory can get onlined/offlined by user space. Also, e.g., kdump
>>    relies on udev events to reload kexec when memory blocks are
>>    onlined/offlined. Memory blocks are the "real" memory hot(un)plug
>>    granularity. Everything that's smaller has to be emulated "on top".
>>
>> Q: Won't memory unplug of subblocks fragment memory?
>> A: Yes and no. Unplugging e.g., >=4MB subblocks on x86-64 will not really
>>    fragment memory like unplugging random pages like a balloon driver does.
>>    Buddy merging will not be limited. However, any allocation that requires
>>    bigger consecutive memory chunks (e.g., gigantic pages) might observe
>>    the fragmentation. Possible solutions: Allocate gigantic huge pages
>>    before unplugging memory, don't unplug memory, combine virtio-mem with
>>    DIMM based memory or bigger initial memory. Remember, a virtio-mem
>>    device will only unplug on the memory range it manages, not on other
>>    DIMMs. Unplug of single memory blocks will result in similar
>>    fragmentation in respect to gigantic huge pages. I ahve plans for a
>>    virtio-mem defragmentation feature in the future.
>>
>> Q: How reliable is memory unplug?
>> A: There are no guarantees on how much memory can get unplugged
>>    again. However, it is more likely to find 4MB chunks to unplug than
>>    e.g., 128MB chunks. If memory is terribly fragmented, there is nothing
>>    we can do - for now. I consider memory hotplug the first primary use
>>    of virtio-mem. Memory unplug might usually work, but we want to improve
>>    the performance and the amount of memory we can actually unplug later.
>>
>> Q: Why not unplug from the ZONE_MOVABLE?
>> A: Unplugged memory chunks are unmovable. Unmovable data must not end up
>>    on the ZONE_MOVABLE - similar to gigantic pages - they will never be
>>    allocated from ZONE_MOVABLE. virtio-mem added memory can be onlined
>>    to the ZONE_MOVABLE, but subblocks will not get unplugged from it.
>>
>> Q: How big should the initial (!virtio-mem) memory of a VM be?
>> A: virtio-mem memory will not go to the DMA zones. So to avoid running out
>>    of DMA memory, I suggest something like 2-3GB on x86-64. But many
>>    VMs can most probably deal with less DMA memory - depends on the use
>>    case.
>>
>> [1] https://events.linuxfoundation.org/wp-content/uploads/2017/12/virtio-mem-Paravirtualized-Memory-David-Hildenbrand-Red-Hat-1.pdf
>> [2] https://lkml.kernel.org/r/20190919142228.5483-1-david@redhat.com
>> [3] https://lkml.kernel.org/r/547865a9-d6c2-7140-47e2-5af01e7d761d@redhat.com
>>
>> Cc: Sebastien Boeuf  <sebastien.boeuf@intel.com>
>> Cc: Samuel Ortiz <samuel.ortiz@intel.com>
>> Cc: Robert Bradford <robert.bradford@intel.com>
>> Cc: Luiz Capitulino <lcapitulino@redhat.com>
>>
>> David Hildenbrand (13):
>>   ACPI: NUMA: export pxm_to_node
>>   virtio-mem: Paravirtualized memory hotplug
>>   virtio-mem: Paravirtualized memory hotunplug part 1
>>   mm: Export alloc_contig_range() / free_contig_range()
>>   virtio-mem: Paravirtualized memory hotunplug part 2
>>   mm: Allow to offline unmovable PageOffline() pages via
>>     MEM_GOING_OFFLINE
>>   virtio-mem: Allow to offline partially unplugged memory blocks
>>   mm/memory_hotplug: Introduce offline_and_remove_memory()
>>   virtio-mem: Offline and remove completely unplugged memory blocks
>>   virtio-mem: Better retry handling
>>   mm/vmscan: Move count_vm_event(DROP_SLAB) into drop_slab()
>>   mm/vmscan: Export drop_slab() and drop_slab_node()
>>   virtio-mem: Drop slab objects when unplug continues to fail
> 
> Ping,
> 
> I'd love to get some feedback on
> 
> a) The remaining MM bits from MM folks (especially, patch #6 and #8).

Friendly ping again:

Can I get some feedback on the two important MM changes in this series

"[PATCH RFC v4 06/13] mm: Allow to offline unmovable PageOffline() pages
via MEM_GOING_OFFLINE"

and

"[PATCH RFC v4 08/13] mm/memory_hotplug: Introduce
offline_and_remove_memory()"

David Hildenbrand Feb. 25, 2020, 9:58 a.m. UTC | #7

On 29.01.20 10:41, David Hildenbrand wrote:
> On 09.01.20 14:48, David Hildenbrand wrote:
>> On 12.12.19 18:11, David Hildenbrand wrote:
>>> This series is based on latest linux-next. The patches are located at:
>>>     https://github.com/davidhildenbrand/linux.git virtio-mem-rfc-v4
>>>
>>> The basic idea of virtio-mem is to provide a flexible,
>>> cross-architecture memory hot(un)plug solution that avoids many limitations
>>> imposed by existing technologies, architectures, and interfaces. More
>>> details can be found below and in linked material.
>>>
>>> This RFC is limited to x86-64, however, should theoretically work on any
>>> architecture that supports virtio and implements memory hot(un)plug under
>>> Linux - like s390x, powerpc64 and arm64. On x86-64, it is currently
>>> possible to add/remove memory to the system in >= 4MB granularity.
>>> Memory hotplug works very reliably. For memory unplug, there are no
>>> guarantees how much memory can actually get unplugged, it depends on the
>>> setup (especially: fragmentation of (unmovable) memory). I have plans to
>>> improve that in the future.
>>>
>>> --------------------------------------------------------------------------
>>> 1. virtio-mem
>>> --------------------------------------------------------------------------
>>>
>>> The basic idea behind virtio-mem was presented at KVM Forum 2018. The
>>> slides can be found at [1]. The previous RFC can be found at [2]. The
>>> first RFC can be found at [3]. However, the concept evolved over time. The
>>> KVM Forum slides roughly match the current design.
>>>
>>> Patch #2 ("virtio-mem: Paravirtualized memory hotplug") contains quite some
>>> information, especially in "include/uapi/linux/virtio_mem.h":
>>>
>>>   Each virtio-mem device manages a dedicated region in physical address
>>>   space. Each device can belong to a single NUMA node, multiple devices
>>>   for a single NUMA node are possible. A virtio-mem device is like a
>>>   "resizable DIMM" consisting of small memory blocks that can be plugged
>>>   or unplugged. The device driver is responsible for (un)plugging memory
>>>   blocks on demand.
>>>
>>>   Virtio-mem devices can only operate on their assigned memory region in
>>>   order to (un)plug memory. A device cannot (un)plug memory belonging to
>>>   other devices.
>>>
>>>   The "region_size" corresponds to the maximum amount of memory that can
>>>   be provided by a device. The "size" corresponds to the amount of memory
>>>   that is currently plugged. "requested_size" corresponds to a request
>>>   from the device to the device driver to (un)plug blocks. The
>>>   device driver should try to (un)plug blocks in order to reach the
>>>   "requested_size". It is impossible to plug more memory than requested.
>>>
>>>   The "usable_region_size" represents the memory region that can actually
>>>   be used to (un)plug memory. It is always at least as big as the
>>>   "requested_size" and will grow dynamically. It will only shrink when
>>>   explicitly triggered (VIRTIO_MEM_REQ_UNPLUG).
>>>
>>>   Memory in the usable region can usually be read, however, there are no
>>>   guarantees. It can happen that the device cannot process a request,
>>>   because it is busy. The device driver has to retry later.
>>>
>>>   Usually, during system resets all memory will get unplugged, so the
>>>   device driver can start with a clean state. However, in specific
>>>   scenarios (if the device is busy) it can happen that the device still
>>>   has memory plugged. The device driver can request to unplug all memory
>>>   (VIRTIO_MEM_REQ_UNPLUG) - which might take a while to succeed if the
>>>   device is busy.
>>>
>>> --------------------------------------------------------------------------
>>> 2. Linux Implementation
>>> --------------------------------------------------------------------------
>>>
>>> This RFC reuses quite some existing MM infrastructure, however, has to
>>> expose some additional functionality.
>>>
>>> Memory blocks (e.g., 128MB) are added/removed on demand. Within these
>>> memory blocks, subblocks (e.g., 4MB) are plugged/unplugged. The sizes
>>> depend on the target architecture, MAX_ORDER + pageblock_order, and
>>> the block size of a virtio-mem device.
>>>
>>> add_memory()/try_remove_memory() is used to add/remove memory blocks.
>>> virtio-mem will not online memory blocks itself. This has to be done by
>>> user space, or configured into the kernel
>>> (CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE). virtio-mem will only unplug memory
>>> that was online to the ZONE_NORMAL. Memory is suggested to be onlined to
>>> the ZONE_NORMAL for now.
>>>
>>> The memory hotplug notifier is used to properly synchronize against
>>> onlining/offlining of memory blocks and to track the states of memory
>>> blocks (including the zone memory blocks are onlined to).
>>>
>>> The set_online_page() callback is used to keep unplugged subblocks
>>> of a memory block fake-offline when onlining the memory block.
>>> generic_online_page() is used to fake-online plugged subblocks. This
>>> handling is similar to the Hyper-V balloon driver.
>>>
>>> PG_offline is used to mark unplugged subblocks as offline, so e.g.,
>>> dumping tools (makedumpfile) will skip these pages. This is similar to
>>> other balloon drivers like virtio-balloon and Hyper-V.
>>>
>>> Memory offlining code is extended to allow drivers to drop their reference
>>> to PG_offline pages when MEM_GOING_OFFLINE, so these pages can be skipped
>>> when offlining memory blocks. This allows to offline memory blocks that
>>> have partially unplugged (allocated e.g., via alloc_contig_range())
>>> subblocks - or are completely unplugged.
>>>
>>> alloc_contig_range()/free_contig_range() [now exposed] is used to
>>> unplug/plug subblocks of memory blocks the are already exposed to Linux.
>>>
>>> offline_and_remove_memory() [new] is used to offline a fully unplugged
>>> memory block and remove it from Linux.
>>>
>>>
>>> A lot of additional information can be found in the separate patches and
>>> as comments in the code itself.
>>>
>>> --------------------------------------------------------------------------
>>> 3. Changes RFC v2 -> v3
>>> --------------------------------------------------------------------------
>>>
>>> A lot of things changed, especially also on the QEMU + virtio side. The
>>> biggest changes on the Linux driver side are:
>>> - Onlining/offlining of subblocks is now emulated on top of memory blocks.
>>>   set_online_page()+alloc_contig_range()+free_contig_range() is now used
>>>   for that. Core MM does not have to be modified and will continue to
>>>   online/offline full memory blocks.
>>> - Onlining/offlining of memory blocks is no longer performed by virtio-mem.
>>> - Pg_offline is upstream and can be used. It is also used to allow
>>>   offlining of partially unplugged memory blocks.
>>> - Memory block states + subblocks are now tracked more space-efficient.
>>> - Proper kexec(), kdump(), driver unload, driver reload, ZONE_MOVABLE, ...
>>>   handling.
>>>
>>> --------------------------------------------------------------------------
>>> 4. Changes RFC v3 -> RFC v4
>>> --------------------------------------------------------------------------
>>>
>>> Only minor things changed, especially nothing on the QEMU + virtio side.
>>> Interresting changes on the Linux driver side are:
>>> - "mm: Allow to offline unmovable PageOffline() pages via
>>>    MEM_GOING_OFFLINE"
>>> -- Rework to Michals suggestion (allow to isolate all PageOffline() pages
>>>    by skipping all PageOffline() pages in has_unmovable_pages(). Fail
>>>    offlining later if the pages cannot be offlined/migrated).
>>> - "virtio-mem: Allow to offline partially unplugged memory blocks"
>>> -- Adapt to Michals suggestion on core-mm part.
>>> - "virtio-mem: Better retry handling"
>>> -- Optimize retry intervals
>>> - "virtio-mem: Drop slab objects when unplug continues to fail"
>>> -- Call drop_slab()/drop_slab_node() when unplug keeps failing for a longer
>>>    time.
>>> - Multiple cleanups and fixes.
>>>
>>> --------------------------------------------------------------------------
>>> 5. Future work
>>> --------------------------------------------------------------------------
>>>
>>> The separate patches contain a lot of future work items. One of the next
>>> steps is to make memory unplug more likely to succeed - currently, there
>>> are no guarantees on how much memory can get unplugged again. I have
>>> various ideas on how to limit fragmentation of all memory blocks that
>>> virtio-mem added.
>>>
>>> Memory hotplug:
>>> - Reduce the amount of memory resources if that turnes out to be an
>>>   issue. Or try to speed up relevant code paths to deal with many
>>>   resources.
>>> - Allocate the vmemmap from the added memory. Makes hotplug more likely
>>>   to succeed, the vmemmap is stored on the same NUMA node and that
>>>   unmovable memory will later not hinder unplug.
>>>
>>> Memory hotunplug:
>>> - Performance improvements:
>>> -- Sense (lockless) if it make sense to try alloc_contig_range() at all
>>>    before directly trying to isolate and taking locks.
>>> -- Try to unplug bigger chunks if possible first.
>>> -- Identify free areas first, that don't have to be evacuated.
>>> - Make unplug more likely to succeed:
>>> -- There are various idea to limit fragmentation on memory block
>>>    granularity. (e.g., ZONE_PREFER_MOVABLE and smart balancing)
>>> -- Allocate memmap from added memory. This way, less unmovable data can
>>>    end up on the memory blocks.
>>> - OOM handling, e.g., via an OOM handler.
>>> - Defragmentation
>>> -- Will require a new virtio-mem CMD to exchange plugged<->unplugged blocks
>>>
>>> --------------------------------------------------------------------------
>>> 6. Example Usage
>>> --------------------------------------------------------------------------
>>>
>>> A very basic QEMU prototype (kept updated) is available at:
>>>     https://github.com/davidhildenbrand/qemu.git virtio-mem
>>>
>>> It lacks various features, however, works to test the guest driver side:
>>> - No support for resizable memory regions / memory backends yet
>>> - No protection of unplugged memory (esp., userfaultfd-wp) yet
>>> - No dump/migration/XXX optimizations to skip unplugged memory (and avoid
>>>   touching it)
>>>
>>> Start QEMU with two virtio-mem devices (one per NUMA node):
>>>  $ qemu-system-x86_64 -m 4G,maxmem=20G \
>>>   -smp sockets=2,cores=2 \
>>>   -numa node,nodeid=0,cpus=0-1 -numa node,nodeid=1,cpus=2-3 \
>>>   [...]
>>>   -object memory-backend-ram,id=mem0,size=8G \
>>>   -device virtio-mem-pci,id=vm0,memdev=mem0,node=0,requested-size=128M \
>>>   -object memory-backend-ram,id=mem1,size=8G \
>>>   -device virtio-mem-pci,id=vm1,memdev=mem1,node=1,requested-size=80M
>>>
>>> Query the configuration:
>>>  QEMU 4.1.95 monitor - type 'help' for more information
>>>  (qemu) info memory-devices
>>>  Memory device [virtio-mem]: "vm0"
>>>    memaddr: 0x140000000
>>>    node: 0
>>>    requested-size: 134217728
>>>    size: 134217728
>>>    max-size: 8589934592
>>>    block-size: 2097152
>>>    memdev: /objects/mem0
>>>  Memory device [virtio-mem]: "vm1"
>>>    memaddr: 0x340000000
>>>    node: 1
>>>    requested-size: 83886080
>>>    size: 83886080
>>>    max-size: 8589934592
>>>    block-size: 2097152
>>>    memdev: /objects/mem1
>>>
>>> Add some memory to node 1:
>>>  QEMU 4.1.95 monitor - type 'help' for more information
>>>  (qemu) qom-set vm1 requested-size 1G
>>>
>>> Remove some memory from node 0:
>>>  QEMU 4.1.95 monitor - type 'help' for more information
>>>  (qemu) qom-set vm0 requested-size 64M
>>>
>>> Query the configuration again:
>>>  (qemu) info memory-devices
>>>  Memory device [virtio-mem]: "vm0"
>>>    memaddr: 0x140000000
>>>    node: 0
>>>    requested-size: 67108864
>>>    size: 67108864
>>>    max-size: 8589934592
>>>    block-size: 2097152
>>>    memdev: /objects/mem0
>>>  Memory device [virtio-mem]: "vm1"
>>>    memaddr: 0x340000000
>>>    node: 1
>>>    requested-size: 1073741824
>>>    size: 1073741824
>>>    max-size: 8589934592
>>>    block-size: 2097152
>>>    memdev: /objects/mem1
>>>
>>> --------------------------------------------------------------------------
>>> 7. Q/A
>>> --------------------------------------------------------------------------
>>>
>>> Q: Why add/remove parts ("subblocks") of memory blocks/sections?
>>> A: Flexibility (section size depends on the architecture) - e.g., some
>>>    architectures have a section size of 2GB. Also, the memory block size
>>>    is variable (e.g., on x86-64). I want to avoid any such restrictions.
>>>    Some use cases want to add/remove memory in smaller granularities to a
>>>    VM (e.g., the Hyper-V balloon also implements this) - especially smaller
>>>    VMs like used for kata containers. Also, on memory unplug, it is more
>>>    reliable to free-up and unplug multiple small chunks instead
>>>    of one big chunk. E.g., if one page of a DIMM is either unmovable or
>>>    pinned, the DIMM can't get unplugged. This approach is basically a
>>>    compromise between DIMM-based memory hot(un)plug and balloon
>>>    inflation/deflation, which works mostly on page granularity.
>>>
>>> Q: Why care about memory blocks?
>>> A: They are the way to tell user space about new memory. This way,
>>>    memory can get onlined/offlined by user space. Also, e.g., kdump
>>>    relies on udev events to reload kexec when memory blocks are
>>>    onlined/offlined. Memory blocks are the "real" memory hot(un)plug
>>>    granularity. Everything that's smaller has to be emulated "on top".
>>>
>>> Q: Won't memory unplug of subblocks fragment memory?
>>> A: Yes and no. Unplugging e.g., >=4MB subblocks on x86-64 will not really
>>>    fragment memory like unplugging random pages like a balloon driver does.
>>>    Buddy merging will not be limited. However, any allocation that requires
>>>    bigger consecutive memory chunks (e.g., gigantic pages) might observe
>>>    the fragmentation. Possible solutions: Allocate gigantic huge pages
>>>    before unplugging memory, don't unplug memory, combine virtio-mem with
>>>    DIMM based memory or bigger initial memory. Remember, a virtio-mem
>>>    device will only unplug on the memory range it manages, not on other
>>>    DIMMs. Unplug of single memory blocks will result in similar
>>>    fragmentation in respect to gigantic huge pages. I ahve plans for a
>>>    virtio-mem defragmentation feature in the future.
>>>
>>> Q: How reliable is memory unplug?
>>> A: There are no guarantees on how much memory can get unplugged
>>>    again. However, it is more likely to find 4MB chunks to unplug than
>>>    e.g., 128MB chunks. If memory is terribly fragmented, there is nothing
>>>    we can do - for now. I consider memory hotplug the first primary use
>>>    of virtio-mem. Memory unplug might usually work, but we want to improve
>>>    the performance and the amount of memory we can actually unplug later.
>>>
>>> Q: Why not unplug from the ZONE_MOVABLE?
>>> A: Unplugged memory chunks are unmovable. Unmovable data must not end up
>>>    on the ZONE_MOVABLE - similar to gigantic pages - they will never be
>>>    allocated from ZONE_MOVABLE. virtio-mem added memory can be onlined
>>>    to the ZONE_MOVABLE, but subblocks will not get unplugged from it.
>>>
>>> Q: How big should the initial (!virtio-mem) memory of a VM be?
>>> A: virtio-mem memory will not go to the DMA zones. So to avoid running out
>>>    of DMA memory, I suggest something like 2-3GB on x86-64. But many
>>>    VMs can most probably deal with less DMA memory - depends on the use
>>>    case.
>>>
>>> [1] https://events.linuxfoundation.org/wp-content/uploads/2017/12/virtio-mem-Paravirtualized-Memory-David-Hildenbrand-Red-Hat-1.pdf
>>> [2] https://lkml.kernel.org/r/20190919142228.5483-1-david@redhat.com
>>> [3] https://lkml.kernel.org/r/547865a9-d6c2-7140-47e2-5af01e7d761d@redhat.com
>>>
>>> Cc: Sebastien Boeuf  <sebastien.boeuf@intel.com>
>>> Cc: Samuel Ortiz <samuel.ortiz@intel.com>
>>> Cc: Robert Bradford <robert.bradford@intel.com>
>>> Cc: Luiz Capitulino <lcapitulino@redhat.com>
>>>
>>> David Hildenbrand (13):
>>>   ACPI: NUMA: export pxm_to_node
>>>   virtio-mem: Paravirtualized memory hotplug
>>>   virtio-mem: Paravirtualized memory hotunplug part 1
>>>   mm: Export alloc_contig_range() / free_contig_range()
>>>   virtio-mem: Paravirtualized memory hotunplug part 2
>>>   mm: Allow to offline unmovable PageOffline() pages via
>>>     MEM_GOING_OFFLINE
>>>   virtio-mem: Allow to offline partially unplugged memory blocks
>>>   mm/memory_hotplug: Introduce offline_and_remove_memory()
>>>   virtio-mem: Offline and remove completely unplugged memory blocks
>>>   virtio-mem: Better retry handling
>>>   mm/vmscan: Move count_vm_event(DROP_SLAB) into drop_slab()
>>>   mm/vmscan: Export drop_slab() and drop_slab_node()
>>>   virtio-mem: Drop slab objects when unplug continues to fail
>>
>> Ping,
>>
>> I'd love to get some feedback on
>>
>> a) The remaining MM bits from MM folks (especially, patch #6 and #8).
> 
> Friendly ping again:
> 
> Can I get some feedback on the two important MM changes in this series
> 
> "[PATCH RFC v4 06/13] mm: Allow to offline unmovable PageOffline() pages
> via MEM_GOING_OFFLINE"
> 
> and
> 
> "[PATCH RFC v4 08/13] mm/memory_hotplug: Introduce
> offline_and_remove_memory()"
> 

Yet another ping.

Alex Shi June 5, 2020, 8:55 a.m. UTC | #8

在 2020/1/9 下午9:48, David Hildenbrand 写道:
> Ping,
> 
> I'd love to get some feedback on
> 
> a) The remaining MM bits from MM folks (especially, patch #6 and #8).
> b) The general virtio infrastructure (esp. uapi in patch #2) from virtio
> folks.
> 
> I'm planning to send a proper v1 (!RFC) once I have all necessary MM
> acks. In the meanwhile, I will do more testing and minor reworks (e.g.,
> fix !CONFIG_NUMA compilation).


Hi David,

Thanks for your work!

I am trying your https://github.com/davidhildenbrand/linux.git virtio-mem-v5
which works fine for me, but just a 'DMA error' happens when a vm start with
less than 2GB memory, Do I missed sth?

Thanks
Alex


(qemu) qom-set vm0 requested-size 1g
(qemu) [   26.560026] virtio_mem virtio0: plugged size: 0x0
[   26.560648] virtio_mem virtio0: requested size: 0x40000000
[   26.561730] systemd-journald[167]: no db file to read /run/udev/data/+virtio:virtio0: No such file or directory
[   26.563138] systemd-journald[167]: no db file to read /run/udev/data/+virtio:virtio0: No such file or directory
[   26.569122] Built 1 zonelists, mobility grouping on.  Total pages: 513141
[   26.570039] Policy zone: Normal

(qemu) [   32.175838] e1000 0000:00:03.0: swiotlb buffer is full (sz: 81 bytes), total 0 (slots), used 0 (slots)
[   32.176922] e1000 0000:00:03.0: TX DMA map failed
[   32.177488] e1000 0000:00:03.0: swiotlb buffer is full (sz: 81 bytes), total 0 (slots), used 0 (slots)
[   32.178535] e1000 0000:00:03.0: TX DMA map failed

my qemu command is like this:
qemu-system-x86_64  --enable-kvm \
	-m 2G,maxmem=16G -kernel /root/linux-next/$1/arch/x86/boot/bzImage \
	-smp 4 \
	-append "earlyprintk=ttyS0 root=/dev/sda1 console=ttyS0 debug psi=1 nokaslr ignore_loglevel" \
	-hda /root/CentOS-7-x86_64-Azure-1703.qcow2 \
	-net user,hostfwd=tcp::2222-:22 -net nic -s \
  -object memory-backend-ram,id=mem0,size=3G \
  -device virtio-mem-pci,id=vm0,memdev=mem0,node=0,requested-size=0M \
	--nographic

David Hildenbrand June 5, 2020, 9:08 a.m. UTC | #9

On 05.06.20 10:55, Alex Shi wrote:
> 
> 
> 在 2020/1/9 下午9:48, David Hildenbrand 写道:
>> Ping,
>>
>> I'd love to get some feedback on
>>
>> a) The remaining MM bits from MM folks (especially, patch #6 and #8).
>> b) The general virtio infrastructure (esp. uapi in patch #2) from virtio
>> folks.
>>
>> I'm planning to send a proper v1 (!RFC) once I have all necessary MM
>> acks. In the meanwhile, I will do more testing and minor reworks (e.g.,
>> fix !CONFIG_NUMA compilation).
> 
> 
> Hi David,
> 
> Thanks for your work!
> 
> I am trying your https://github.com/davidhildenbrand/linux.git virtio-mem-v5
> which works fine for me, but just a 'DMA error' happens when a vm start with
> less than 2GB memory, Do I missed sth?

Please use the virtio-mem-v4 branch for now, v5 is still under
construction (and might be scrapped completely if v4 goes upstream as is).

Looks like a DMA issue. Your're hotplugging 1GB, which should not really
eat too much memory. There was a similar issue reported by Hui in [1],
which boiled down to wrong usage of the swiotlb parameter.

In such cases you should always try to reproduce with hotplug of a
sam-sized DIMM. E.g., hotplugging a 1GB DIMM should result in the same
issue.

What does your .config specify for CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE?

I'll try to reproduce with v4 briefly.

[1]
https://lkml.kernel.org/r/9708F43A-9BD2-4377-8EE8-7FB1D95C6F69@linux.alibaba.com

> 
> Thanks
> Alex
> 
> 
> (qemu) qom-set vm0 requested-size 1g
> (qemu) [   26.560026] virtio_mem virtio0: plugged size: 0x0
> [   26.560648] virtio_mem virtio0: requested size: 0x40000000
> [   26.561730] systemd-journald[167]: no db file to read /run/udev/data/+virtio:virtio0: No such file or directory
> [   26.563138] systemd-journald[167]: no db file to read /run/udev/data/+virtio:virtio0: No such file or directory
> [   26.569122] Built 1 zonelists, mobility grouping on.  Total pages: 513141
> [   26.570039] Policy zone: Normal
> 
> (qemu) [   32.175838] e1000 0000:00:03.0: swiotlb buffer is full (sz: 81 bytes), total 0 (slots), used 0 (slots)
> [   32.176922] e1000 0000:00:03.0: TX DMA map failed
> [   32.177488] e1000 0000:00:03.0: swiotlb buffer is full (sz: 81 bytes), total 0 (slots), used 0 (slots)
> [   32.178535] e1000 0000:00:03.0: TX DMA map failed
> 
> my qemu command is like this:
> qemu-system-x86_64  --enable-kvm \
> 	-m 2G,maxmem=16G -kernel /root/linux-next/$1/arch/x86/boot/bzImage \
> 	-smp 4 \
> 	-append "earlyprintk=ttyS0 root=/dev/sda1 console=ttyS0 debug psi=1 nokaslr ignore_loglevel" \
> 	-hda /root/CentOS-7-x86_64-Azure-1703.qcow2 \
> 	-net user,hostfwd=tcp::2222-:22 -net nic -s \
>   -object memory-backend-ram,id=mem0,size=3G \
>   -device virtio-mem-pci,id=vm0,memdev=mem0,node=0,requested-size=0M \
> 	--nographic
> 
>

David Hildenbrand June 5, 2020, 9:36 a.m. UTC | #10

On 05.06.20 11:08, David Hildenbrand wrote:
> On 05.06.20 10:55, Alex Shi wrote:
>>
>>
>> 在 2020/1/9 下午9:48, David Hildenbrand 写道:
>>> Ping,
>>>
>>> I'd love to get some feedback on
>>>
>>> a) The remaining MM bits from MM folks (especially, patch #6 and #8).
>>> b) The general virtio infrastructure (esp. uapi in patch #2) from virtio
>>> folks.
>>>
>>> I'm planning to send a proper v1 (!RFC) once I have all necessary MM
>>> acks. In the meanwhile, I will do more testing and minor reworks (e.g.,
>>> fix !CONFIG_NUMA compilation).
>>
>>
>> Hi David,
>>
>> Thanks for your work!
>>
>> I am trying your https://github.com/davidhildenbrand/linux.git virtio-mem-v5
>> which works fine for me, but just a 'DMA error' happens when a vm start with
>> less than 2GB memory, Do I missed sth?
> 
> Please use the virtio-mem-v4 branch for now, v5 is still under
> construction (and might be scrapped completely if v4 goes upstream as is).
> 
> Looks like a DMA issue. Your're hotplugging 1GB, which should not really
> eat too much memory. There was a similar issue reported by Hui in [1],
> which boiled down to wrong usage of the swiotlb parameter.
> 
> In such cases you should always try to reproduce with hotplug of a
> sam-sized DIMM. E.g., hotplugging a 1GB DIMM should result in the same
> issue.
> 
> What does your .config specify for CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE?
> 
> I'll try to reproduce with v4 briefly.

I guess I know what's happening here. In case we only have DMA memory
when booting, we don't reserve swiotlb buffers. Once we hotplug memory
and online ZONE_NORMAL, we don't have any swiotlb DMA bounce buffers to
map such PFNs (total 0 (slots), used 0 (slots)).

Can you try with "swiotlb=force" on the kernel cmdline?

David Hildenbrand June 5, 2020, 10:05 a.m. UTC | #11

On 05.06.20 11:36, David Hildenbrand wrote:
> On 05.06.20 11:08, David Hildenbrand wrote:
>> On 05.06.20 10:55, Alex Shi wrote:
>>>
>>>
>>> 在 2020/1/9 下午9:48, David Hildenbrand 写道:
>>>> Ping,
>>>>
>>>> I'd love to get some feedback on
>>>>
>>>> a) The remaining MM bits from MM folks (especially, patch #6 and #8).
>>>> b) The general virtio infrastructure (esp. uapi in patch #2) from virtio
>>>> folks.
>>>>
>>>> I'm planning to send a proper v1 (!RFC) once I have all necessary MM
>>>> acks. In the meanwhile, I will do more testing and minor reworks (e.g.,
>>>> fix !CONFIG_NUMA compilation).
>>>
>>>
>>> Hi David,
>>>
>>> Thanks for your work!
>>>
>>> I am trying your https://github.com/davidhildenbrand/linux.git virtio-mem-v5
>>> which works fine for me, but just a 'DMA error' happens when a vm start with
>>> less than 2GB memory, Do I missed sth?
>>
>> Please use the virtio-mem-v4 branch for now, v5 is still under
>> construction (and might be scrapped completely if v4 goes upstream as is).
>>
>> Looks like a DMA issue. Your're hotplugging 1GB, which should not really
>> eat too much memory. There was a similar issue reported by Hui in [1],
>> which boiled down to wrong usage of the swiotlb parameter.
>>
>> In such cases you should always try to reproduce with hotplug of a
>> sam-sized DIMM. E.g., hotplugging a 1GB DIMM should result in the same
>> issue.
>>
>> What does your .config specify for CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE?
>>
>> I'll try to reproduce with v4 briefly.
> 
> I guess I know what's happening here. In case we only have DMA memory
> when booting, we don't reserve swiotlb buffers. Once we hotplug memory
> and online ZONE_NORMAL, we don't have any swiotlb DMA bounce buffers to
> map such PFNs (total 0 (slots), used 0 (slots)).
> 
> Can you try with "swiotlb=force" on the kernel cmdline?

Alternative, looks like you can specify "-m 2G,maxmem=16G,slots=1", to
create proper ACPI tables that indicate hotpluggable memory. (I'll have
to look into QEMU to figure out to always indicate hotpluggable memory
that way).

Alex Shi June 5, 2020, 10:06 a.m. UTC | #12

在 2020/6/5 下午5:08, David Hildenbrand 写道:
> Please use the virtio-mem-v4 branch for now, v5 is still under
> construction (and might be scrapped completely if v4 goes upstream as is).
> 
> Looks like a DMA issue. Your're hotplugging 1GB, which should not really
> eat too much memory. There was a similar issue reported by Hui in [1],
> which boiled down to wrong usage of the swiotlb parameter.

I have no swiotbl=noforce set, and sometime no swiotlb error reported, like
(qemu) [   41.591308] e1000 0000:00:03.0: dma_direct_map_page: overflow 0x000000011fd470da+54 of device mask ffffffff
[   41.592431] e1000 0000:00:03.0: TX DMA map failed
[   41.593031] e1000 0000:00:03.0: dma_direct_map_page: overflow 0x000000011fd474da+54 of device mask ffffff
...
[   63.049464] ata_piix 0000:00:01.1: dma_direct_map_sg: overflow 0x0000000107db2000+4096 of device mask ffffffff
[   63.068297] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[   63.069057] ata1.00: failed command: READ DMA
[   63.069580] ata1.00: cmd c8/00:20:40:bd:d2/00:00:00:00:00/e0 tag 0 dma 16384 in
[   63.069580]          res 50/00:00:3f:30:80/00:00:00:00:00/a0 Emask 0x40 (internal error) 
> 
> In such cases you should always try to reproduce with hotplug of a
> sam-sized DIMM. E.g., hotplugging a 1GB DIMM should result in the same
> issue.
> 
> What does your .config specify for CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE?

Yes, it's set. 

I had tried the v2/v4 version, which has the same issue.
Is this related with virtio-mem start address too low?

Thanks a lot!
> 
> I'll try to reproduce with v4 briefly.
> 
> [1]
> https://lkml.kernel.org/r/9708F43A-9BD2-4377-8EE8-7FB1D95C6F69@linux.alibaba.com

Alex Shi June 5, 2020, 10:08 a.m. UTC | #13

在 2020/6/5 下午5:36, David Hildenbrand 写道:
> I guess I know what's happening here. In case we only have DMA memory
> when booting, we don't reserve swiotlb buffers. Once we hotplug memory
> and online ZONE_NORMAL, we don't have any swiotlb DMA bounce buffers to
> map such PFNs (total 0 (slots), used 0 (slots)).
> 
> Can you try with "swiotlb=force" on the kernel cmdline?

Yes, it works fine with this cmdline. problems gone,

Alex Shi June 5, 2020, 10:46 a.m. UTC | #14

在 2020/6/5 下午6:05, David Hildenbrand 写道:
>> I guess I know what's happening here. In case we only have DMA memory
>> when booting, we don't reserve swiotlb buffers. Once we hotplug memory
>> and online ZONE_NORMAL, we don't have any swiotlb DMA bounce buffers to
>> map such PFNs (total 0 (slots), used 0 (slots)).
>>
>> Can you try with "swiotlb=force" on the kernel cmdline?
> Alternative, looks like you can specify "-m 2G,maxmem=16G,slots=1", to
> create proper ACPI tables that indicate hotpluggable memory. (I'll have
> to look into QEMU to figure out to always indicate hotpluggable memory
> that way).
> 


That works too. Yes, better resolved in qemu, maybe. :)

Thanks!

David Hildenbrand June 5, 2020, 12:18 p.m. UTC | #15

On 05.06.20 12:46, Alex Shi wrote:
> 
> 
> 在 2020/6/5 下午6:05, David Hildenbrand 写道:
>>> I guess I know what's happening here. In case we only have DMA memory
>>> when booting, we don't reserve swiotlb buffers. Once we hotplug memory
>>> and online ZONE_NORMAL, we don't have any swiotlb DMA bounce buffers to
>>> map such PFNs (total 0 (slots), used 0 (slots)).
>>>
>>> Can you try with "swiotlb=force" on the kernel cmdline?
>> Alternative, looks like you can specify "-m 2G,maxmem=16G,slots=1", to
>> create proper ACPI tables that indicate hotpluggable memory. (I'll have
>> to look into QEMU to figure out to always indicate hotpluggable memory
>> that way).
>>
> 
> 
> That works too. Yes, better resolved in qemu, maybe. :)
> 

You can checkout

git@github.com:davidhildenbrand/qemu.git virtio-mem-v4

(prone to change before officially sent), which will create srat tables
also if no "slots" parameter was defined (and no -numa config was
specified).

Your original example should work with that.

Alex Shi June 9, 2020, 3:05 a.m. UTC | #16

在 2020/6/5 下午8:18, David Hildenbrand 写道:
> On 05.06.20 12:46, Alex Shi wrote:
>>
>>
>> 在 2020/6/5 下午6:05, David Hildenbrand 写道:
>>>> I guess I know what's happening here. In case we only have DMA memory
>>>> when booting, we don't reserve swiotlb buffers. Once we hotplug memory
>>>> and online ZONE_NORMAL, we don't have any swiotlb DMA bounce buffers to
>>>> map such PFNs (total 0 (slots), used 0 (slots)).
>>>>
>>>> Can you try with "swiotlb=force" on the kernel cmdline?
>>> Alternative, looks like you can specify "-m 2G,maxmem=16G,slots=1", to
>>> create proper ACPI tables that indicate hotpluggable memory. (I'll have
>>> to look into QEMU to figure out to always indicate hotpluggable memory
>>> that way).
>>>
>>
>>
>> That works too. Yes, better resolved in qemu, maybe. :)
>>
> 
> You can checkout
> 
> git@github.com:davidhildenbrand/qemu.git virtio-mem-v4

yes, it works for me. Thanks!

> 
> (prone to change before officially sent), which will create srat tables
> also if no "slots" parameter was defined (and no -numa config was
> specified).
> 
> Your original example should work with that.
>

[RFC,v4,00/13] virtio-mem: paravirtualized memory

Message

Comments