diff mbox series

[V2,01/11] machine: alloc-anon option

Message ID 1719776434-435013-2-git-send-email-steven.sistare@oracle.com (mailing list archive)
State New, archived
Headers show
Series Live update: cpr-exec | expand

Commit Message

Steven Sistare June 30, 2024, 7:40 p.m. UTC
Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
on the value of the anon-alloc machine property.  This affects
memory-backend-ram objects, guest RAM created with the global -m option
but without an associated memory-backend object and without the -mem-path
option, and various memory regions such as ROMs that are allocated when
devices are created.  This option does not affect memory-backend-file,
memory-backend-memfd, or memory-backend-epc objects.

The memfd option is intended to support new migration modes, in which the
memory region can be transferred in place to a new QEMU process, by sending
the memfd file descriptor to the process.  Memory contents are preserved,
and if the mode also transfers device descriptors, then pages that are
locked in memory for DMA remain locked.  This behavior is a pre-requisite
for supporting vfio, vdpa, and iommufd devices with the new modes.

To access the same memory in the old and new QEMU processes, the memory
must be mapped shared.  Therefore, the implementation always sets
RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, where the
user must explicitly specify the share option.  In lieu of defining a new
RAM flag, at the lowest level the implementation uses RAM_SHARED with fd=-1
as the condition for calling memfd_create.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/core/machine.c   | 24 ++++++++++++++++++++++++
 include/hw/boards.h |  1 +
 qapi/machine.json   | 14 ++++++++++++++
 qemu-options.hx     | 13 +++++++++++++
 system/memory.c     | 12 +++++++++---
 system/physmem.c    | 38 +++++++++++++++++++++++++++++++++++++-
 system/trace-events |  3 +++
 7 files changed, 101 insertions(+), 4 deletions(-)

Comments

Fabiano Rosas July 15, 2024, 5:52 p.m. UTC | #1
Steve Sistare <steven.sistare@oracle.com> writes:

> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
> on the value of the anon-alloc machine property.  This affects
> memory-backend-ram objects, guest RAM created with the global -m option
> but without an associated memory-backend object and without the -mem-path
> option, and various memory regions such as ROMs that are allocated when
> devices are created.  This option does not affect memory-backend-file,
> memory-backend-memfd, or memory-backend-epc objects.
>
> The memfd option is intended to support new migration modes, in which the
> memory region can be transferred in place to a new QEMU process, by sending
> the memfd file descriptor to the process.  Memory contents are preserved,
> and if the mode also transfers device descriptors, then pages that are
> locked in memory for DMA remain locked.  This behavior is a pre-requisite
> for supporting vfio, vdpa, and iommufd devices with the new modes.
>
> To access the same memory in the old and new QEMU processes, the memory
> must be mapped shared.  Therefore, the implementation always sets
> RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, where the
> user must explicitly specify the share option.  In lieu of defining a new
> RAM flag, at the lowest level the implementation uses RAM_SHARED with fd=-1
> as the condition for calling memfd_create.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

The commit message is inconsistent with alloc-anon, anon-alloc.

Reviewed-by: Fabiano Rosas <farosas@suse.de>
Igor Mammedov July 16, 2024, 9:19 a.m. UTC | #2
On Sun, 30 Jun 2024 12:40:24 -0700
Steve Sistare <steven.sistare@oracle.com> wrote:

> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
> on the value of the anon-alloc machine property.  This affects
> memory-backend-ram objects, guest RAM created with the global -m option
> but without an associated memory-backend object and without the -mem-path
> option
nowadays, all machines were converted to use memory backend for VM RAM.
so -m option implicitly creates memory-backend object,
which will be either MEMORY_BACKEND_FILE if -mem-path present
or MEMORY_BACKEND_RAM otherwise.


> To access the same memory in the old and new QEMU processes, the memory
> must be mapped shared.  Therefore, the implementation always sets

> RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, where the
> user must explicitly specify the share option.  In lieu of defining a new
so statement at the top that memory-backend-ram is affected is not
really valid? 

> RAM flag, at the lowest level the implementation uses RAM_SHARED with fd=-1
> as the condition for calling memfd_create.

In general I do dislike adding yet another option that will affect
guest RAM allocation (memory-backends  should be sufficient).

However I do see that you need memfd for device memory (vram, roms, ...).
Can we just use memfd/shared unconditionally for those and
avoid introducing a new confusing option?


> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  hw/core/machine.c   | 24 ++++++++++++++++++++++++
>  include/hw/boards.h |  1 +
>  qapi/machine.json   | 14 ++++++++++++++
>  qemu-options.hx     | 13 +++++++++++++
>  system/memory.c     | 12 +++++++++---
>  system/physmem.c    | 38 +++++++++++++++++++++++++++++++++++++-
>  system/trace-events |  3 +++
>  7 files changed, 101 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/core/machine.c b/hw/core/machine.c
> index 655d75c..7ca2ad0 100644
> --- a/hw/core/machine.c
> +++ b/hw/core/machine.c
> @@ -454,6 +454,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp)
>      ms->mem_merge = value;
>  }
>  
> +static int machine_get_anon_alloc(Object *obj, Error **errp)
> +{
> +    MachineState *ms = MACHINE(obj);
> +
> +    return ms->anon_alloc;
> +}
> +
> +static void machine_set_anon_alloc(Object *obj, int value, Error **errp)
> +{
> +    MachineState *ms = MACHINE(obj);
> +
> +    ms->anon_alloc = value;
> +}
> +
>  static bool machine_get_usb(Object *obj, Error **errp)
>  {
>      MachineState *ms = MACHINE(obj);
> @@ -1066,6 +1080,11 @@ static void machine_class_init(ObjectClass *oc, void *data)
>      object_class_property_set_description(oc, "mem-merge",
>          "Enable/disable memory merge support");
>  
> +    object_class_property_add_enum(oc, "anon-alloc", "AnonAllocOption",
> +                                   &AnonAllocOption_lookup,
> +                                   machine_get_anon_alloc,
> +                                   machine_set_anon_alloc);
> +
>      object_class_property_add_bool(oc, "usb",
>          machine_get_usb, machine_set_usb);
>      object_class_property_set_description(oc, "usb",
> @@ -1416,6 +1435,11 @@ static bool create_default_memdev(MachineState *ms, const char *path, Error **er
>      if (!object_property_set_int(obj, "size", ms->ram_size, errp)) {
>          goto out;
>      }
> +    if (!object_property_set_bool(obj, "share",
> +                                  ms->anon_alloc == ANON_ALLOC_OPTION_MEMFD,
> +                                  errp)) {
> +        goto out;
> +    }
>      object_property_add_child(object_get_objects_root(), mc->default_ram_id,
>                                obj);
>      /* Ensure backend's memory region name is equal to mc->default_ram_id */
> diff --git a/include/hw/boards.h b/include/hw/boards.h
> index 73ad319..77f16ad 100644
> --- a/include/hw/boards.h
> +++ b/include/hw/boards.h
> @@ -383,6 +383,7 @@ struct MachineState {
>      bool enable_graphics;
>      ConfidentialGuestSupport *cgs;
>      HostMemoryBackend *memdev;
> +    AnonAllocOption anon_alloc;
>      /*
>       * convenience alias to ram_memdev_id backend memory region
>       * or to numa container memory region
> diff --git a/qapi/machine.json b/qapi/machine.json
> index 2fd3e9c..9173953 100644
> --- a/qapi/machine.json
> +++ b/qapi/machine.json
> @@ -1881,3 +1881,17 @@
>  { 'command': 'x-query-interrupt-controllers',
>    'returns': 'HumanReadableText',
>    'features': [ 'unstable' ]}
> +
> +##
> +# @AnonAllocOption:
> +#
> +# An enumeration of the options for allocating anonymous guest memory.
> +#
> +# @mmap: allocate using mmap MAP_ANON
> +#
> +# @memfd: allocate using memfd_create
> +#
> +# Since: 9.1
> +##
> +{ 'enum': 'AnonAllocOption',
> +  'data': [ 'mmap', 'memfd' ] }
> diff --git a/qemu-options.hx b/qemu-options.hx
> index 8ca7f34..595b693 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -38,6 +38,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
>      "                nvdimm=on|off controls NVDIMM support (default=off)\n"
>      "                memory-encryption=@var{} memory encryption object to use (default=none)\n"
>      "                hmat=on|off controls ACPI HMAT support (default=off)\n"
> +    "                anon-alloc=mmap|memfd allocate anonymous guest RAM using mmap MAP_ANON or memfd_create (default: mmap)\n"
>      "                memory-backend='backend-id' specifies explicitly provided backend for main RAM (default=none)\n"
>      "                cxl-fmw.0.targets.0=firsttarget,cxl-fmw.0.targets.1=secondtarget,cxl-fmw.0.size=size[,cxl-fmw.0.interleave-granularity=granularity]\n",
>      QEMU_ARCH_ALL)
> @@ -101,6 +102,18 @@ SRST
>          Enables or disables ACPI Heterogeneous Memory Attribute Table
>          (HMAT) support. The default is off.
>  
> +    ``anon-alloc=mmap|memfd``
> +        Allocate anonymous guest RAM using mmap MAP_ANON (the default)
> +        or memfd_create.  This affects memory-backend-ram objects,
> +        RAM created with the global -m option but without an
> +        associated memory-backend object and without the -mem-path
> +        option, and various memory regions such as ROMs that are
> +        allocated when devices are created.  This option does not
> +        affect memory-backend-file, memory-backend-memfd, or
> +        memory-backend-epc objects.
> +
> +        Some migration modes require anon-alloc=memfd.
> +
>      ``memory-backend='id'``
>          An alternative to legacy ``-mem-path`` and ``mem-prealloc`` options.
>          Allows to use a memory backend as main RAM.
> diff --git a/system/memory.c b/system/memory.c
> index 2d69521..28a837d 100644
> --- a/system/memory.c
> +++ b/system/memory.c
> @@ -1552,8 +1552,10 @@ bool memory_region_init_ram_nomigrate(MemoryRegion *mr,
>                                        uint64_t size,
>                                        Error **errp)
>  {
> +    uint32_t flags = (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD) ?
> +                     RAM_SHARED : 0;
>      return memory_region_init_ram_flags_nomigrate(mr, owner, name,
> -                                                  size, 0, errp);
> +                                                  size, flags, errp);
>  }
>  
>  bool memory_region_init_ram_flags_nomigrate(MemoryRegion *mr,
> @@ -1713,8 +1715,10 @@ bool memory_region_init_rom_nomigrate(MemoryRegion *mr,
>                                        uint64_t size,
>                                        Error **errp)
>  {
> +    uint32_t flags = (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD) ?
> +                     RAM_SHARED : 0;
>      if (!memory_region_init_ram_flags_nomigrate(mr, owner, name,
> -                                                size, 0, errp)) {
> +                                                size, flags, errp)) {
>           return false;
>      }
>      mr->readonly = true;
> @@ -1731,6 +1735,8 @@ bool memory_region_init_rom_device_nomigrate(MemoryRegion *mr,
>                                               Error **errp)
>  {
>      Error *err = NULL;
> +    uint32_t flags = (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD) ?
> +                     RAM_SHARED : 0;
>      assert(ops);
>      memory_region_init(mr, owner, name, size);
>      mr->ops = ops;
> @@ -1738,7 +1744,7 @@ bool memory_region_init_rom_device_nomigrate(MemoryRegion *mr,
>      mr->terminates = true;
>      mr->rom_device = true;
>      mr->destructor = memory_region_destructor_ram;
> -    mr->ram_block = qemu_ram_alloc(size, 0, mr, &err);
> +    mr->ram_block = qemu_ram_alloc(size, flags, mr, &err);
>      if (err) {
>          mr->size = int128_zero();
>          object_unparent(OBJECT(mr));
> diff --git a/system/physmem.c b/system/physmem.c
> index 33d09f7..efe95ff 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -47,6 +47,7 @@
>  #include "qemu/qemu-print.h"
>  #include "qemu/log.h"
>  #include "qemu/memalign.h"
> +#include "qemu/memfd.h"
>  #include "exec/memory.h"
>  #include "exec/ioport.h"
>  #include "sysemu/dma.h"
> @@ -54,6 +55,7 @@
>  #include "sysemu/hw_accel.h"
>  #include "sysemu/xen-mapcache.h"
>  #include "trace/trace-root.h"
> +#include "trace.h"
>  
>  #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
>  #include <linux/falloc.h>
> @@ -69,6 +71,8 @@
>  
>  #include "qemu/pmem.h"
>  
> +#include "qapi/qapi-types-migration.h"
> +#include "migration/options.h"
>  #include "migration/vmstate.h"
>  
>  #include "qemu/range.h"
> @@ -1828,6 +1832,32 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>                  qemu_mutex_unlock_ramlist();
>                  return;
>              }
> +
> +        } else if (new_block->flags & RAM_SHARED) {
> +            size_t max_length = new_block->max_length;
> +            MemoryRegion *mr = new_block->mr;
> +            const char *name = memory_region_name(mr);
> +
> +            new_block->mr->align = QEMU_VMALLOC_ALIGN;
> +
> +            if (new_block->fd == -1) {
> +                new_block->fd = qemu_memfd_create(name, max_length + mr->align,
> +                                                  0, 0, 0, errp);
> +            }
> +
> +            if (new_block->fd >= 0) {
> +                int mfd = new_block->fd;
> +                qemu_set_cloexec(mfd);
> +                new_block->host = file_ram_alloc(new_block, max_length, mfd,
> +                                                 false, 0, errp);
> +            }
> +            if (!new_block->host) {
> +                qemu_mutex_unlock_ramlist();
> +                return;
> +            }
> +            memory_try_enable_merging(new_block->host, new_block->max_length);
> +            free_on_error = true;
> +
>          } else {
>              new_block->host = qemu_anon_ram_alloc(new_block->max_length,
>                                                    &new_block->mr->align,
> @@ -1911,6 +1941,9 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>          ram_block_notify_add(new_block->host, new_block->used_length,
>                               new_block->max_length);
>      }
> +    trace_ram_block_add(memory_region_name(new_block->mr), new_block->flags,
> +                        new_block->fd, new_block->used_length,
> +                        new_block->max_length);
>      return;
>  
>  out_free:
> @@ -2097,8 +2130,11 @@ RAMBlock *qemu_ram_alloc_resizeable(ram_addr_t size, ram_addr_t maxsz,
>                                                       void *host),
>                                       MemoryRegion *mr, Error **errp)
>  {
> +    uint32_t flags = (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD) ?
> +                     RAM_SHARED : 0;
> +    flags |= RAM_RESIZEABLE;
>      return qemu_ram_alloc_internal(size, maxsz, resized, NULL,
> -                                   RAM_RESIZEABLE, mr, errp);
> +                                   flags, mr, errp);
>  }
>  
>  static void reclaim_ramblock(RAMBlock *block)
> diff --git a/system/trace-events b/system/trace-events
> index 69c9044..f8ebf42 100644
> --- a/system/trace-events
> +++ b/system/trace-events
> @@ -38,3 +38,6 @@ dirtylimit_state_finalize(void)
>  dirtylimit_throttle_pct(int cpu_index, uint64_t pct, int64_t time_us) "CPU[%d] throttle percent: %" PRIu64 ", throttle adjust time %"PRIi64 " us"
>  dirtylimit_set_vcpu(int cpu_index, uint64_t quota) "CPU[%d] set dirty page rate limit %"PRIu64
>  dirtylimit_vcpu_execute(int cpu_index, int64_t sleep_time_us) "CPU[%d] sleep %"PRIi64 " us"
> +
> +#physmem.c
> +ram_block_add(const char *name, uint32_t flags, int fd, size_t used_length, size_t max_length) "%s, flags %u, fd %d, len %lu, maxlen %lu"
Peter Xu July 17, 2024, 7:24 p.m. UTC | #3
On Tue, Jul 16, 2024 at 11:19:55AM +0200, Igor Mammedov wrote:
> On Sun, 30 Jun 2024 12:40:24 -0700
> Steve Sistare <steven.sistare@oracle.com> wrote:
> 
> > Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
> > on the value of the anon-alloc machine property.  This affects
> > memory-backend-ram objects, guest RAM created with the global -m option
> > but without an associated memory-backend object and without the -mem-path
> > option
> nowadays, all machines were converted to use memory backend for VM RAM.
> so -m option implicitly creates memory-backend object,
> which will be either MEMORY_BACKEND_FILE if -mem-path present
> or MEMORY_BACKEND_RAM otherwise.
> 
> 
> > To access the same memory in the old and new QEMU processes, the memory
> > must be mapped shared.  Therefore, the implementation always sets
> 
> > RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, where the
> > user must explicitly specify the share option.  In lieu of defining a new
> so statement at the top that memory-backend-ram is affected is not
> really valid? 
> 
> > RAM flag, at the lowest level the implementation uses RAM_SHARED with fd=-1
> > as the condition for calling memfd_create.
> 
> In general I do dislike adding yet another option that will affect
> guest RAM allocation (memory-backends  should be sufficient).

I shared the same concern when reviewing the previous version, and I keep
having so.

> 
> However I do see that you need memfd for device memory (vram, roms, ...).
> Can we just use memfd/shared unconditionally for those and
> avoid introducing a new confusing option?

ROMs should be fine IIUC, as they shouldn't be large, and they can be
migrated normally (because they're not DMA target from VFIO assigned
devices).  IOW, per my understanding what must be shared via memfd is
writable memories that can be DMAed from a VFIO device.

I raised such question on whether / why vram can be a DMA target, but I
didn't get a response.  So I would like to redo this comment: I think we
should figure out what is missing when we switch all backends to use
-object, rather than adding this flag easily.  When added, we should be
crystal clear on which RAM region will be applicable by this flag.

PS to Steve: and I think I left tons of other comments in previous version
outside this patch too, but I don't think they're fully discussed when this
series was sent.  I can re-read the series again, but I don't think it'll
work out if we keep skipping discussions..

Thanks,
Steven Sistare July 18, 2024, 3:43 p.m. UTC | #4
On 7/17/2024 3:24 PM, Peter Xu wrote:
[...]
> 
> PS to Steve: and I think I left tons of other comments in previous version
> outside this patch too, but I don't think they're fully discussed when this
> series was sent.  I can re-read the series again, but I don't think it'll
> work out if we keep skipping discussions..

Hi Peter, let me address this part first, because I don't want you to think
that I ignored your questions and concerns.  This V2 series tries to address
them.  The change log was intended to be my response, rather than responding
to each open question individually, but let me try again here with more detail.
I apologize if I don't summarize your concerns correctly or completely.

issue: discomfort with exec. why is it needed?
response: exec is just a transport mechanism to send fd's to new qemu.
   I refactored to separate core patches from exec-specific patches, submitted
   cpr-transfer patches to illustrate a non-exec method, and provided reasons
   why one vs the other would be desirable in the commit messages and cover
   letter.

issue: why do we need to preserve the ramblock fields and make them available
   prior to object creation?
response.  we don't need to preserve all of them, and we only need fd prior
   to object creation, so I deleted the precreate, factory, and named object
   patches, and added CprState to preserve fd's. used_length arrives in the
   normal migration stream.  max_length is recovered from the mfd using lseek.

issue: the series is too large, with too much change.
response: in addition to the deletions mentioned above, I simplified the
   functionality and tossed out style patches and nice-to-haves, so we can
   focus on core functionality.  V2 is much smaller.

issue: memfd_create option is oddly expressed and hard to understand.
response: I redefined the option, deleted all the stylistic ramblock patches
   to lay its workings bare, and explicitly documented its affect on all types
   of memory in the commit messages and qapi documentation.

issue: no need to preserve blocks like ROM for DMA (with memfd_create).
   Blocks that must be preserved should be surfaced to the user as objects.
response: I disagree, and will continue that conversation in this email thread.

issue: how will vfio be handled?
response: I submitted the vfio patches (non-trivial, because first I had to
   rework them without using precreate vmstate).

issue: how will fd's be preserved for chardevs?
response: via cpr_save_fd, CprState, and cpr_load_fd at device creation time,
   in each device's creation function, just like vfio.  Those primitives are
   defined in this V2 series.

- Steve
Peter Xu July 18, 2024, 4:22 p.m. UTC | #5
On Thu, Jul 18, 2024 at 11:43:54AM -0400, Steven Sistare wrote:
> On 7/17/2024 3:24 PM, Peter Xu wrote:
> [...]
> > 
> > PS to Steve: and I think I left tons of other comments in previous version
> > outside this patch too, but I don't think they're fully discussed when this
> > series was sent.  I can re-read the series again, but I don't think it'll
> > work out if we keep skipping discussions..
> 
> Hi Peter, let me address this part first, because I don't want you to think
> that I ignored your questions and concerns.  This V2 series tries to address
> them.  The change log was intended to be my response, rather than responding
> to each open question individually, but let me try again here with more detail.
> I apologize if I don't summarize your concerns correctly or completely.
> 
> issue: discomfort with exec. why is it needed?
> response: exec is just a transport mechanism to send fd's to new qemu.
>   I refactored to separate core patches from exec-specific patches, submitted
>   cpr-transfer patches to illustrate a non-exec method, and provided reasons
>   why one vs the other would be desirable in the commit messages and cover
>   letter.
> 
> issue: why do we need to preserve the ramblock fields and make them available
>   prior to object creation?
> response.  we don't need to preserve all of them, and we only need fd prior
>   to object creation, so I deleted the precreate, factory, and named object
>   patches, and added CprState to preserve fd's. used_length arrives in the
>   normal migration stream.  max_length is recovered from the mfd using lseek.
> 
> issue: the series is too large, with too much change.
> response: in addition to the deletions mentioned above, I simplified the
>   functionality and tossed out style patches and nice-to-haves, so we can
>   focus on core functionality.  V2 is much smaller.
> 
> issue: memfd_create option is oddly expressed and hard to understand.
> response: I redefined the option, deleted all the stylistic ramblock patches
>   to lay its workings bare, and explicitly documented its affect on all types
>   of memory in the commit messages and qapi documentation.
> 
> issue: no need to preserve blocks like ROM for DMA (with memfd_create).
>   Blocks that must be preserved should be surfaced to the user as objects.
> response: I disagree, and will continue that conversation in this email thread.
> 
> issue: how will vfio be handled?
> response: I submitted the vfio patches (non-trivial, because first I had to
>   rework them without using precreate vmstate).
> 
> issue: how will fd's be preserved for chardevs?
> response: via cpr_save_fd, CprState, and cpr_load_fd at device creation time,
>   in each device's creation function, just like vfio.  Those primitives are
>   defined in this V2 series.

Thanks for the answers.  I think I'll need to read more into the patches in
the next few days; it looks like I'll get more answers from there.

I just sent an email probably when you're drafting this one.. it may has
some questions that may not be covered here.

I think a major issue with exec() is the (1-3) steps that I mentioned there
that needs to run sequentially, and IIUC all these steps can be completely
avoided in cpr-transfer, and it may matter a lot in huge VMs.  But maybe I
missed something.

Please have a look there.
Steven Sistare July 20, 2024, 8:28 p.m. UTC | #6
On 7/16/2024 5:19 AM, Igor Mammedov wrote:
> On Sun, 30 Jun 2024 12:40:24 -0700
> Steve Sistare <steven.sistare@oracle.com> wrote:
> 
>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>> on the value of the anon-alloc machine property.  This affects
>> memory-backend-ram objects, guest RAM created with the global -m option
>> but without an associated memory-backend object and without the -mem-path
>> option
> nowadays, all machines were converted to use memory backend for VM RAM.
> so -m option implicitly creates memory-backend object,
> which will be either MEMORY_BACKEND_FILE if -mem-path present
> or MEMORY_BACKEND_RAM otherwise.

Yes.  I dropped an an important adjective, "implicit".

   "guest RAM created with the global -m option but without an explicit associated
   memory-backend object and without the -mem-path option"

>> To access the same memory in the old and new QEMU processes, the memory
>> must be mapped shared.  Therefore, the implementation always sets
> 
>> RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, where the
>> user must explicitly specify the share option.  In lieu of defining a new
> so statement at the top that memory-backend-ram is affected is not
> really valid?

memory-backend-ram is affected by alloc-anon.  But in addition, the user must
explicitly add the "share" option.  I don't implicitly set share in this case,
because I would be overriding the user's specification of the memory object's property,
which would be private if omitted.

>> RAM flag, at the lowest level the implementation uses RAM_SHARED with fd=-1
>> as the condition for calling memfd_create.
> 
> In general I do dislike adding yet another option that will affect
> guest RAM allocation (memory-backends  should be sufficient).
> 
> However I do see that you need memfd for device memory (vram, roms, ...).
> Can we just use memfd/shared unconditionally for those and
> avoid introducing a new confusing option?

The Linux kernel has different tunables for backing memfd's with huge pages, so we
could hurt performance if we unconditionally change to memfd.  The user should have
a choice for any segment that is large enough for huge pages to improve performance,
which potentially is any memory-backend-object.  The non memory-backend objects are
small, and it would be OK to use memfd unconditionally for them.

- Steve
Steven Sistare July 20, 2024, 8:35 p.m. UTC | #7
On 7/17/2024 3:24 PM, Peter Xu wrote:
> On Tue, Jul 16, 2024 at 11:19:55AM +0200, Igor Mammedov wrote:
>> On Sun, 30 Jun 2024 12:40:24 -0700
>> Steve Sistare <steven.sistare@oracle.com> wrote:
>>
>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>> on the value of the anon-alloc machine property.  This affects
>>> memory-backend-ram objects, guest RAM created with the global -m option
>>> but without an associated memory-backend object and without the -mem-path
>>> option
>> nowadays, all machines were converted to use memory backend for VM RAM.
>> so -m option implicitly creates memory-backend object,
>> which will be either MEMORY_BACKEND_FILE if -mem-path present
>> or MEMORY_BACKEND_RAM otherwise.
>>
>>
>>> To access the same memory in the old and new QEMU processes, the memory
>>> must be mapped shared.  Therefore, the implementation always sets
>>
>>> RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, where the
>>> user must explicitly specify the share option.  In lieu of defining a new
>> so statement at the top that memory-backend-ram is affected is not
>> really valid?
>>
>>> RAM flag, at the lowest level the implementation uses RAM_SHARED with fd=-1
>>> as the condition for calling memfd_create.
>>
>> In general I do dislike adding yet another option that will affect
>> guest RAM allocation (memory-backends  should be sufficient).
> 
> I shared the same concern when reviewing the previous version, and I keep
> having so.
> 
>>
>> However I do see that you need memfd for device memory (vram, roms, ...).
>> Can we just use memfd/shared unconditionally for those and
>> avoid introducing a new confusing option?
> 
> ROMs should be fine IIUC, as they shouldn't be large, and they can be
> migrated normally (because they're not DMA target from VFIO assigned
> devices).  IOW, per my understanding what must be shared via memfd is
> writable memories that can be DMAed from a VFIO device.
> 
> I raised such question on whether / why vram can be a DMA target, but I
> didn't get a response.  So I would like to redo this comment: I think we
> should figure out what is missing when we switch all backends to use
> -object, rather than adding this flag easily.  When added, we should be
> crystal clear on which RAM region will be applicable by this flag.

All RAM regions that are mapped by the guest are registered for vfio DMA by
a memory listener and could potentially be DMA'd, either read or written.
That is defined by the architecture.  We are not allowed to make value
judgements and decide to not support the architecture for some segments
such as ROM.

Alex Williamson, any comment here?

- Steve
David Hildenbrand July 22, 2024, 9:10 a.m. UTC | #8
On 20.07.24 22:28, Steven Sistare wrote:
> On 7/16/2024 5:19 AM, Igor Mammedov wrote:
>> On Sun, 30 Jun 2024 12:40:24 -0700
>> Steve Sistare <steven.sistare@oracle.com> wrote:
>>
>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>> on the value of the anon-alloc machine property.  This affects
>>> memory-backend-ram objects, guest RAM created with the global -m option
>>> but without an associated memory-backend object and without the -mem-path
>>> option
>> nowadays, all machines were converted to use memory backend for VM RAM.
>> so -m option implicitly creates memory-backend object,
>> which will be either MEMORY_BACKEND_FILE if -mem-path present
>> or MEMORY_BACKEND_RAM otherwise.
> 
> Yes.  I dropped an an important adjective, "implicit".
> 
>     "guest RAM created with the global -m option but without an explicit associated
>     memory-backend object and without the -mem-path option"
> 
>>> To access the same memory in the old and new QEMU processes, the memory
>>> must be mapped shared.  Therefore, the implementation always sets
>>
>>> RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, where the
>>> user must explicitly specify the share option.  In lieu of defining a new
>> so statement at the top that memory-backend-ram is affected is not
>> really valid?
> 
> memory-backend-ram is affected by alloc-anon.  But in addition, the user must
> explicitly add the "share" option.  I don't implicitly set share in this case,
> because I would be overriding the user's specification of the memory object's property,
> which would be private if omitted.

Note that memory-backend-memfd uses "shared=on" as default, as using 
"shared=off" is something that shouldn't have ever been allowed. It can 
(and will) result in a double memory consumption.

One reason I also don't quite like this approach :/
Igor Mammedov July 29, 2024, 12:29 p.m. UTC | #9
On Sat, 20 Jul 2024 16:28:25 -0400
Steven Sistare <steven.sistare@oracle.com> wrote:

> On 7/16/2024 5:19 AM, Igor Mammedov wrote:
> > On Sun, 30 Jun 2024 12:40:24 -0700
> > Steve Sistare <steven.sistare@oracle.com> wrote:
> >   
> >> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
> >> on the value of the anon-alloc machine property.  This affects
> >> memory-backend-ram objects, guest RAM created with the global -m option
> >> but without an associated memory-backend object and without the -mem-path
> >> option  
> > nowadays, all machines were converted to use memory backend for VM RAM.
> > so -m option implicitly creates memory-backend object,
> > which will be either MEMORY_BACKEND_FILE if -mem-path present
> > or MEMORY_BACKEND_RAM otherwise.  
> 
> Yes.  I dropped an an important adjective, "implicit".
> 
>    "guest RAM created with the global -m option but without an explicit associated
>    memory-backend object and without the -mem-path option"
> 
> >> To access the same memory in the old and new QEMU processes, the memory
> >> must be mapped shared.  Therefore, the implementation always sets  
> >   
> >> RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, where the
> >> user must explicitly specify the share option.  In lieu of defining a new  
> > so statement at the top that memory-backend-ram is affected is not
> > really valid?  
> 
> memory-backend-ram is affected by alloc-anon.  But in addition, the user must
> explicitly add the "share" option.  I don't implicitly set share in this case,
> because I would be overriding the user's specification of the memory object's property,
> which would be private if omitted.

instead of touching implicit RAM (-m), it would be better to error out
and ask user to provide properly configured memory-backend explicitly.

> 
> >> RAM flag, at the lowest level the implementation uses RAM_SHARED with fd=-1
> >> as the condition for calling memfd_create.  
> > 
> > In general I do dislike adding yet another option that will affect
> > guest RAM allocation (memory-backends  should be sufficient).
> > 
> > However I do see that you need memfd for device memory (vram, roms, ...).
> > Can we just use memfd/shared unconditionally for those and
> > avoid introducing a new confusing option?  
> 
> The Linux kernel has different tunables for backing memfd's with huge pages, so we
> could hurt performance if we unconditionally change to memfd.  The user should have
> a choice for any segment that is large enough for huge pages to improve performance,
> which potentially is any memory-backend-object.  The non memory-backend objects are
> small, and it would be OK to use memfd unconditionally for them.
> 
> - Steve
>
Peter Xu Aug. 4, 2024, 4:20 p.m. UTC | #10
On Sat, Jul 20, 2024 at 04:35:58PM -0400, Steven Sistare wrote:
> On 7/17/2024 3:24 PM, Peter Xu wrote:
> > On Tue, Jul 16, 2024 at 11:19:55AM +0200, Igor Mammedov wrote:
> > > On Sun, 30 Jun 2024 12:40:24 -0700
> > > Steve Sistare <steven.sistare@oracle.com> wrote:
> > > 
> > > > Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
> > > > on the value of the anon-alloc machine property.  This affects
> > > > memory-backend-ram objects, guest RAM created with the global -m option
> > > > but without an associated memory-backend object and without the -mem-path
> > > > option
> > > nowadays, all machines were converted to use memory backend for VM RAM.
> > > so -m option implicitly creates memory-backend object,
> > > which will be either MEMORY_BACKEND_FILE if -mem-path present
> > > or MEMORY_BACKEND_RAM otherwise.
> > > 
> > > 
> > > > To access the same memory in the old and new QEMU processes, the memory
> > > > must be mapped shared.  Therefore, the implementation always sets
> > > 
> > > > RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, where the
> > > > user must explicitly specify the share option.  In lieu of defining a new
> > > so statement at the top that memory-backend-ram is affected is not
> > > really valid?
> > > 
> > > > RAM flag, at the lowest level the implementation uses RAM_SHARED with fd=-1
> > > > as the condition for calling memfd_create.
> > > 
> > > In general I do dislike adding yet another option that will affect
> > > guest RAM allocation (memory-backends  should be sufficient).
> > 
> > I shared the same concern when reviewing the previous version, and I keep
> > having so.
> > 
> > > 
> > > However I do see that you need memfd for device memory (vram, roms, ...).
> > > Can we just use memfd/shared unconditionally for those and
> > > avoid introducing a new confusing option?
> > 
> > ROMs should be fine IIUC, as they shouldn't be large, and they can be
> > migrated normally (because they're not DMA target from VFIO assigned
> > devices).  IOW, per my understanding what must be shared via memfd is
> > writable memories that can be DMAed from a VFIO device.
> > 
> > I raised such question on whether / why vram can be a DMA target, but I
> > didn't get a response.  So I would like to redo this comment: I think we
> > should figure out what is missing when we switch all backends to use
> > -object, rather than adding this flag easily.  When added, we should be
> > crystal clear on which RAM region will be applicable by this flag.
> 
> All RAM regions that are mapped by the guest are registered for vfio DMA by
> a memory listener and could potentially be DMA'd, either read or written.
> That is defined by the architecture.  We are not allowed to make value
> judgements and decide to not support the architecture for some segments
> such as ROM.

You're right.  However the problem is we have pretty good grasp of the
major DMA target here (guest mem), so what I feel like is some missing work
in this area, that we're not sure what this new parameter is applying to.

It's not the case where we know "OK we have a million use case of RAM, and
we're 100% sure we want to make them all fd-based, and we introduce this
flag simply because adding this to each 1-million will take years and
thousands LOC changes".

The new parameter is cheap to paper over the question being raised here,
but it definitely adds not only ambiguity (when it conflicts with -object)
and that we'll need to maintain its compatibility for all RAMs that we have
totally no idea what can be implied underneath for whatever QEMU cmdline
that can be specified.

IMHO that, OTOH, justifies that there may need some further study to
justify this parameter alone.  For example, if it's only the vRAM that is
missing, we may at least want to have a parameter nailing down to vRAM
behavior rather than affecting anything, so as to at least avoid collision
on -object parameters.

Thanks,
Steven Sistare Aug. 8, 2024, 6:32 p.m. UTC | #11
On 7/29/2024 8:29 AM, Igor Mammedov wrote:
> On Sat, 20 Jul 2024 16:28:25 -0400
> Steven Sistare <steven.sistare@oracle.com> wrote:
> 
>> On 7/16/2024 5:19 AM, Igor Mammedov wrote:
>>> On Sun, 30 Jun 2024 12:40:24 -0700
>>> Steve Sistare <steven.sistare@oracle.com> wrote:
>>>    
>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>> on the value of the anon-alloc machine property.  This affects
>>>> memory-backend-ram objects, guest RAM created with the global -m option
>>>> but without an associated memory-backend object and without the -mem-path
>>>> option
>>> nowadays, all machines were converted to use memory backend for VM RAM.
>>> so -m option implicitly creates memory-backend object,
>>> which will be either MEMORY_BACKEND_FILE if -mem-path present
>>> or MEMORY_BACKEND_RAM otherwise.
>>
>> Yes.  I dropped an an important adjective, "implicit".
>>
>>     "guest RAM created with the global -m option but without an explicit associated
>>     memory-backend object and without the -mem-path option"
>>
>>>> To access the same memory in the old and new QEMU processes, the memory
>>>> must be mapped shared.  Therefore, the implementation always sets
>>>    
>>>> RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, where the
>>>> user must explicitly specify the share option.  In lieu of defining a new
>>> so statement at the top that memory-backend-ram is affected is not
>>> really valid?
>>
>> memory-backend-ram is affected by alloc-anon.  But in addition, the user must
>> explicitly add the "share" option.  I don't implicitly set share in this case,
>> because I would be overriding the user's specification of the memory object's property,
>> which would be private if omitted.
> 
> instead of touching implicit RAM (-m), it would be better to error out
> and ask user to provide properly configured memory-backend explicitly.
> 
>>
>>>> RAM flag, at the lowest level the implementation uses RAM_SHARED with fd=-1
>>>> as the condition for calling memfd_create.
>>>
>>> In general I do dislike adding yet another option that will affect
>>> guest RAM allocation (memory-backends  should be sufficient).
>>>
>>> However I do see that you need memfd for device memory (vram, roms, ...).
>>> Can we just use memfd/shared unconditionally for those and
>>> avoid introducing a new confusing option?
>>
>> The Linux kernel has different tunables for backing memfd's with huge pages, so we
>> could hurt performance if we unconditionally change to memfd.  The user should have
>> a choice for any segment that is large enough for huge pages to improve performance,
>> which potentially is any memory-backend-object.  The non memory-backend objects are
>> small, and it would be OK to use memfd unconditionally for them.

Thanks everyone for your feedback.  The common theme is that you dislike that the
new option modifies the allocation of memory-backend-objects.  OK, accepted.  I propose
to remove that interaction, and document in the QAPI which backends work for CPR.
Specifically, memory-backend-memfd or memory-backend-file object is required,
with share=on (which is the default for memory-backend-memfd).  CPR will be blocked
otherwise.  The legacy -m option without an explicit memory-backend-object will not
support CPR.

Non memory-backend-objects (ramblocks not described on the qemu command line) will always
be allocated using memfd_create (on Linux only).  The alloc-anon option is deleted.
The logic in ram_block_add becomes:

     if (!new_block->host) {
         if (xen_enabled()) {
             ...
         } else if (!object_dynamic_cast(new_block->mr->parent_obj.parent,
                                         TYPE_MEMORY_BACKEND)) {
             qemu_memfd_create()
         } else {
             qemu_anon_ram_alloc()
         }

Is that acceptable to everyone?  Igor, Peter, Daniel?

- Steve
Steven Sistare Aug. 12, 2024, 6:37 p.m. UTC | #12
On 8/8/2024 2:32 PM, Steven Sistare wrote:
> On 7/29/2024 8:29 AM, Igor Mammedov wrote:
>> On Sat, 20 Jul 2024 16:28:25 -0400
>> Steven Sistare <steven.sistare@oracle.com> wrote:
>>
>>> On 7/16/2024 5:19 AM, Igor Mammedov wrote:
>>>> On Sun, 30 Jun 2024 12:40:24 -0700
>>>> Steve Sistare <steven.sistare@oracle.com> wrote:
>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>> on the value of the anon-alloc machine property.  This affects
>>>>> memory-backend-ram objects, guest RAM created with the global -m option
>>>>> but without an associated memory-backend object and without the -mem-path
>>>>> option
>>>> nowadays, all machines were converted to use memory backend for VM RAM.
>>>> so -m option implicitly creates memory-backend object,
>>>> which will be either MEMORY_BACKEND_FILE if -mem-path present
>>>> or MEMORY_BACKEND_RAM otherwise.
>>>
>>> Yes.  I dropped an an important adjective, "implicit".
>>>
>>>     "guest RAM created with the global -m option but without an explicit associated
>>>     memory-backend object and without the -mem-path option"
>>>
>>>>> To access the same memory in the old and new QEMU processes, the memory
>>>>> must be mapped shared.  Therefore, the implementation always sets
>>>>> RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, where the
>>>>> user must explicitly specify the share option.  In lieu of defining a new
>>>> so statement at the top that memory-backend-ram is affected is not
>>>> really valid?
>>>
>>> memory-backend-ram is affected by alloc-anon.  But in addition, the user must
>>> explicitly add the "share" option.  I don't implicitly set share in this case,
>>> because I would be overriding the user's specification of the memory object's property,
>>> which would be private if omitted.
>>
>> instead of touching implicit RAM (-m), it would be better to error out
>> and ask user to provide properly configured memory-backend explicitly.
>>
>>>
>>>>> RAM flag, at the lowest level the implementation uses RAM_SHARED with fd=-1
>>>>> as the condition for calling memfd_create.
>>>>
>>>> In general I do dislike adding yet another option that will affect
>>>> guest RAM allocation (memory-backends  should be sufficient).
>>>>
>>>> However I do see that you need memfd for device memory (vram, roms, ...).
>>>> Can we just use memfd/shared unconditionally for those and
>>>> avoid introducing a new confusing option?
>>>
>>> The Linux kernel has different tunables for backing memfd's with huge pages, so we
>>> could hurt performance if we unconditionally change to memfd.  The user should have
>>> a choice for any segment that is large enough for huge pages to improve performance,
>>> which potentially is any memory-backend-object.  The non memory-backend objects are
>>> small, and it would be OK to use memfd unconditionally for them.
> 
> Thanks everyone for your feedback.  The common theme is that you dislike that the
> new option modifies the allocation of memory-backend-objects.  OK, accepted.  I propose
> to remove that interaction, and document in the QAPI which backends work for CPR.
> Specifically, memory-backend-memfd or memory-backend-file object is required,
> with share=on (which is the default for memory-backend-memfd).  CPR will be blocked
> otherwise.  The legacy -m option without an explicit memory-backend-object will not
> support CPR.
> 
> Non memory-backend-objects (ramblocks not described on the qemu command line) will always
> be allocated using memfd_create (on Linux only).  The alloc-anon option is deleted.
> The logic in ram_block_add becomes:
> 
>      if (!new_block->host) {
>          if (xen_enabled()) {
>              ...
>          } else if (!object_dynamic_cast(new_block->mr->parent_obj.parent,
>                                          TYPE_MEMORY_BACKEND)) {
>              qemu_memfd_create()
>          } else {
>              qemu_anon_ram_alloc()
>          }
> 
> Is that acceptable to everyone?  Igor, Peter, Daniel?

In a simple test here are the NON-memory-backend-object ramblocks which
are allocated with memfd_create in my new proposal:

   memfd_create system.flash0 3653632 @ 0x7fffe1000000 2 rw
   memfd_create system.flash1 540672 @ 0x7fffe0c00000 2 rw
   memfd_create pc.rom 131072 @ 0x7fffe0800000 2 rw
   memfd_create vga.vram 16777216 @ 0x7fffcac00000 2 rw
   memfd_create vga.rom 65536 @ 0x7fffe0400000 2 rw
   memfd_create /rom@etc/acpi/tables 2097152 @ 0x7fffca400000 6 rw
   memfd_create /rom@etc/table-loader 65536 @ 0x7fffca000000 6 rw
   memfd_create /rom@etc/acpi/rsdp 4096 @ 0x7fffc9c00000 6 rw

Of those, only a subset are mapped for DMA, per the existing QEMU logic,
no changes from me:

   dma_map: pc.rom 131072 @ 0x7fffe0800000 ro
   dma_map: vga.vram 16777216 @ 0x7fffcac00000 rw
   dma_map: vga.rom 65536 @ 0x7fffe0400000 ro
   dma_map: 0000:3a:10.0 BAR 0 mmaps[0] 16384 @ 0x7ffff7fef000 rw
   dma_map: 0000:3a:10.0 BAR 3 mmaps[0] 12288 @ 0x7ffff7fec000 rw

system.flash0 is excluded by the vfio listener because it is a rom_device.
The rom@etc blocks are excluded because their MemoryRegions are not added to
any container region, so the flatmem traversal of the AS used by the listener
does not see them.

The BARs should not be mapped IMO, and I propose excluding them in the
iommufd series:
   https://lore.kernel.org/qemu-devel/1721502937-87102-3-git-send-email-steven.sistare@oracle.com/

Note that the old-QEMU contents of all ramblocks must be preserved, just like
in live migration.  Live migration copies the contents in the stream.  Live update
preserves the contents in place by preserving the memfd.  Thus memfd serves
two purposes: preserving old contents, and preserving DMA mapped pinned pages.

- Steve
Peter Xu Aug. 13, 2024, 3:35 p.m. UTC | #13
On Mon, Aug 12, 2024 at 02:37:59PM -0400, Steven Sistare wrote:
> On 8/8/2024 2:32 PM, Steven Sistare wrote:
> > On 7/29/2024 8:29 AM, Igor Mammedov wrote:
> > > On Sat, 20 Jul 2024 16:28:25 -0400
> > > Steven Sistare <steven.sistare@oracle.com> wrote:
> > > 
> > > > On 7/16/2024 5:19 AM, Igor Mammedov wrote:
> > > > > On Sun, 30 Jun 2024 12:40:24 -0700
> > > > > Steve Sistare <steven.sistare@oracle.com> wrote:
> > > > > > Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
> > > > > > on the value of the anon-alloc machine property.  This affects
> > > > > > memory-backend-ram objects, guest RAM created with the global -m option
> > > > > > but without an associated memory-backend object and without the -mem-path
> > > > > > option
> > > > > nowadays, all machines were converted to use memory backend for VM RAM.
> > > > > so -m option implicitly creates memory-backend object,
> > > > > which will be either MEMORY_BACKEND_FILE if -mem-path present
> > > > > or MEMORY_BACKEND_RAM otherwise.
> > > > 
> > > > Yes.  I dropped an an important adjective, "implicit".
> > > > 
> > > >     "guest RAM created with the global -m option but without an explicit associated
> > > >     memory-backend object and without the -mem-path option"
> > > > 
> > > > > > To access the same memory in the old and new QEMU processes, the memory
> > > > > > must be mapped shared.  Therefore, the implementation always sets
> > > > > > RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, where the
> > > > > > user must explicitly specify the share option.  In lieu of defining a new
> > > > > so statement at the top that memory-backend-ram is affected is not
> > > > > really valid?
> > > > 
> > > > memory-backend-ram is affected by alloc-anon.  But in addition, the user must
> > > > explicitly add the "share" option.  I don't implicitly set share in this case,
> > > > because I would be overriding the user's specification of the memory object's property,
> > > > which would be private if omitted.
> > > 
> > > instead of touching implicit RAM (-m), it would be better to error out
> > > and ask user to provide properly configured memory-backend explicitly.
> > > 
> > > > 
> > > > > > RAM flag, at the lowest level the implementation uses RAM_SHARED with fd=-1
> > > > > > as the condition for calling memfd_create.
> > > > > 
> > > > > In general I do dislike adding yet another option that will affect
> > > > > guest RAM allocation (memory-backends  should be sufficient).
> > > > > 
> > > > > However I do see that you need memfd for device memory (vram, roms, ...).
> > > > > Can we just use memfd/shared unconditionally for those and
> > > > > avoid introducing a new confusing option?
> > > > 
> > > > The Linux kernel has different tunables for backing memfd's with huge pages, so we
> > > > could hurt performance if we unconditionally change to memfd.  The user should have
> > > > a choice for any segment that is large enough for huge pages to improve performance,
> > > > which potentially is any memory-backend-object.  The non memory-backend objects are
> > > > small, and it would be OK to use memfd unconditionally for them.
> > 
> > Thanks everyone for your feedback.  The common theme is that you dislike that the
> > new option modifies the allocation of memory-backend-objects.  OK, accepted.  I propose
> > to remove that interaction, and document in the QAPI which backends work for CPR.
> > Specifically, memory-backend-memfd or memory-backend-file object is required,
> > with share=on (which is the default for memory-backend-memfd).  CPR will be blocked
> > otherwise.  The legacy -m option without an explicit memory-backend-object will not
> > support CPR.
> > 
> > Non memory-backend-objects (ramblocks not described on the qemu command line) will always
> > be allocated using memfd_create (on Linux only).  The alloc-anon option is deleted.
> > The logic in ram_block_add becomes:
> > 
> >      if (!new_block->host) {
> >          if (xen_enabled()) {
> >              ...
> >          } else if (!object_dynamic_cast(new_block->mr->parent_obj.parent,
> >                                          TYPE_MEMORY_BACKEND)) {
> >              qemu_memfd_create()
> >          } else {
> >              qemu_anon_ram_alloc()
> >          }
> > 
> > Is that acceptable to everyone?  Igor, Peter, Daniel?

Sorry for a late reply.

I think this may not work as David pointed out? Where AFAIU it will switch
many old anon use cases to use memfd, aka, shmem, and it might be
problematic when share=off: we have double memory consumption issue with
shmem with private mapping.

I assume that includes things like "-m", "memory-backend-ram", and maybe
more.  IIUC memory consumption of the VM will double with them.

> 
> In a simple test here are the NON-memory-backend-object ramblocks which
> are allocated with memfd_create in my new proposal:
> 
>   memfd_create system.flash0 3653632 @ 0x7fffe1000000 2 rw
>   memfd_create system.flash1 540672 @ 0x7fffe0c00000 2 rw
>   memfd_create pc.rom 131072 @ 0x7fffe0800000 2 rw
>   memfd_create vga.vram 16777216 @ 0x7fffcac00000 2 rw
>   memfd_create vga.rom 65536 @ 0x7fffe0400000 2 rw
>   memfd_create /rom@etc/acpi/tables 2097152 @ 0x7fffca400000 6 rw
>   memfd_create /rom@etc/table-loader 65536 @ 0x7fffca000000 6 rw
>   memfd_create /rom@etc/acpi/rsdp 4096 @ 0x7fffc9c00000 6 rw
> 
> Of those, only a subset are mapped for DMA, per the existing QEMU logic,
> no changes from me:
> 
>   dma_map: pc.rom 131072 @ 0x7fffe0800000 ro
>   dma_map: vga.vram 16777216 @ 0x7fffcac00000 rw
>   dma_map: vga.rom 65536 @ 0x7fffe0400000 ro

I wonder whether there's any case that the "rom"s can be DMA target at
all..  I understand it's logically possible to be READ from as ROMs, but I
am curious what happens if we don't map them at all when they're ROMs, or
whether there's any device that can (in real life) DMA from device ROMs,
and for what use.

>   dma_map: 0000:3a:10.0 BAR 0 mmaps[0] 16384 @ 0x7ffff7fef000 rw
>   dma_map: 0000:3a:10.0 BAR 3 mmaps[0] 12288 @ 0x7ffff7fec000 rw
> 
> system.flash0 is excluded by the vfio listener because it is a rom_device.
> The rom@etc blocks are excluded because their MemoryRegions are not added to
> any container region, so the flatmem traversal of the AS used by the listener
> does not see them.
> 
> The BARs should not be mapped IMO, and I propose excluding them in the
> iommufd series:
>   https://lore.kernel.org/qemu-devel/1721502937-87102-3-git-send-email-steven.sistare@oracle.com/

Looks like this is clear now that they should be there.

> 
> Note that the old-QEMU contents of all ramblocks must be preserved, just like
> in live migration.  Live migration copies the contents in the stream.  Live update
> preserves the contents in place by preserving the memfd.  Thus memfd serves
> two purposes: preserving old contents, and preserving DMA mapped pinned pages.

IMHO the 1st purpose is a fake one.  IOW:

  - Preserving content will be important on large RAM/ROM regions.  When
    it's small, it shouldn't matter a huge deal, IMHO, because this is
    about "how fast we can migrate / live upgrade'.  IOW, this is not a
    functional requirement.

  - DMA mapped pinned pages: instead this is a hard requirement that we
    must make sure these pages are fd-based, because only a fd-based
    mapping can persist the pages (via page cache).

IMHO we shouldn't mangle them, and we should start with sticking with the
2nd goal here.  To be explicit, if we can find a good replacement for
-alloc-anon, IMHO we could still migrate the ramblocks only fall into the
1st purpose category, e.g. device ROMs, hopefully even if they're pinned,
they should never be DMAed to/from.

Thanks,
Alex Williamson Aug. 13, 2024, 5 p.m. UTC | #14
On Tue, 13 Aug 2024 11:35:15 -0400
Peter Xu <peterx@redhat.com> wrote:

> On Mon, Aug 12, 2024 at 02:37:59PM -0400, Steven Sistare wrote:
> > On 8/8/2024 2:32 PM, Steven Sistare wrote:  
> > > On 7/29/2024 8:29 AM, Igor Mammedov wrote:  
> > > > On Sat, 20 Jul 2024 16:28:25 -0400
> > > > Steven Sistare <steven.sistare@oracle.com> wrote:
> > > >   
> > > > > On 7/16/2024 5:19 AM, Igor Mammedov wrote:  
> > > > > > On Sun, 30 Jun 2024 12:40:24 -0700
> > > > > > Steve Sistare <steven.sistare@oracle.com> wrote:  
> > > > > > > Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
> > > > > > > on the value of the anon-alloc machine property.  This affects
> > > > > > > memory-backend-ram objects, guest RAM created with the global -m option
> > > > > > > but without an associated memory-backend object and without the -mem-path
> > > > > > > option  
> > > > > > nowadays, all machines were converted to use memory backend for VM RAM.
> > > > > > so -m option implicitly creates memory-backend object,
> > > > > > which will be either MEMORY_BACKEND_FILE if -mem-path present
> > > > > > or MEMORY_BACKEND_RAM otherwise.  
> > > > > 
> > > > > Yes.  I dropped an an important adjective, "implicit".
> > > > > 
> > > > >     "guest RAM created with the global -m option but without an explicit associated
> > > > >     memory-backend object and without the -mem-path option"
> > > > >   
> > > > > > > To access the same memory in the old and new QEMU processes, the memory
> > > > > > > must be mapped shared.  Therefore, the implementation always sets
> > > > > > > RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, where the
> > > > > > > user must explicitly specify the share option.  In lieu of defining a new  
> > > > > > so statement at the top that memory-backend-ram is affected is not
> > > > > > really valid?  
> > > > > 
> > > > > memory-backend-ram is affected by alloc-anon.  But in addition, the user must
> > > > > explicitly add the "share" option.  I don't implicitly set share in this case,
> > > > > because I would be overriding the user's specification of the memory object's property,
> > > > > which would be private if omitted.  
> > > > 
> > > > instead of touching implicit RAM (-m), it would be better to error out
> > > > and ask user to provide properly configured memory-backend explicitly.
> > > >   
> > > > >   
> > > > > > > RAM flag, at the lowest level the implementation uses RAM_SHARED with fd=-1
> > > > > > > as the condition for calling memfd_create.  
> > > > > > 
> > > > > > In general I do dislike adding yet another option that will affect
> > > > > > guest RAM allocation (memory-backends  should be sufficient).
> > > > > > 
> > > > > > However I do see that you need memfd for device memory (vram, roms, ...).
> > > > > > Can we just use memfd/shared unconditionally for those and
> > > > > > avoid introducing a new confusing option?  
> > > > > 
> > > > > The Linux kernel has different tunables for backing memfd's with huge pages, so we
> > > > > could hurt performance if we unconditionally change to memfd.  The user should have
> > > > > a choice for any segment that is large enough for huge pages to improve performance,
> > > > > which potentially is any memory-backend-object.  The non memory-backend objects are
> > > > > small, and it would be OK to use memfd unconditionally for them.  
> > > 
> > > Thanks everyone for your feedback.  The common theme is that you dislike that the
> > > new option modifies the allocation of memory-backend-objects.  OK, accepted.  I propose
> > > to remove that interaction, and document in the QAPI which backends work for CPR.
> > > Specifically, memory-backend-memfd or memory-backend-file object is required,
> > > with share=on (which is the default for memory-backend-memfd).  CPR will be blocked
> > > otherwise.  The legacy -m option without an explicit memory-backend-object will not
> > > support CPR.
> > > 
> > > Non memory-backend-objects (ramblocks not described on the qemu command line) will always
> > > be allocated using memfd_create (on Linux only).  The alloc-anon option is deleted.
> > > The logic in ram_block_add becomes:
> > > 
> > >      if (!new_block->host) {
> > >          if (xen_enabled()) {
> > >              ...
> > >          } else if (!object_dynamic_cast(new_block->mr->parent_obj.parent,
> > >                                          TYPE_MEMORY_BACKEND)) {
> > >              qemu_memfd_create()
> > >          } else {
> > >              qemu_anon_ram_alloc()
> > >          }
> > > 
> > > Is that acceptable to everyone?  Igor, Peter, Daniel?  
> 
> Sorry for a late reply.
> 
> I think this may not work as David pointed out? Where AFAIU it will switch
> many old anon use cases to use memfd, aka, shmem, and it might be
> problematic when share=off: we have double memory consumption issue with
> shmem with private mapping.
> 
> I assume that includes things like "-m", "memory-backend-ram", and maybe
> more.  IIUC memory consumption of the VM will double with them.
> 
> > 
> > In a simple test here are the NON-memory-backend-object ramblocks which
> > are allocated with memfd_create in my new proposal:
> > 
> >   memfd_create system.flash0 3653632 @ 0x7fffe1000000 2 rw
> >   memfd_create system.flash1 540672 @ 0x7fffe0c00000 2 rw
> >   memfd_create pc.rom 131072 @ 0x7fffe0800000 2 rw
> >   memfd_create vga.vram 16777216 @ 0x7fffcac00000 2 rw
> >   memfd_create vga.rom 65536 @ 0x7fffe0400000 2 rw
> >   memfd_create /rom@etc/acpi/tables 2097152 @ 0x7fffca400000 6 rw
> >   memfd_create /rom@etc/table-loader 65536 @ 0x7fffca000000 6 rw
> >   memfd_create /rom@etc/acpi/rsdp 4096 @ 0x7fffc9c00000 6 rw
> > 
> > Of those, only a subset are mapped for DMA, per the existing QEMU logic,
> > no changes from me:
> > 
> >   dma_map: pc.rom 131072 @ 0x7fffe0800000 ro
> >   dma_map: vga.vram 16777216 @ 0x7fffcac00000 rw
> >   dma_map: vga.rom 65536 @ 0x7fffe0400000 ro  
> 
> I wonder whether there's any case that the "rom"s can be DMA target at
> all..  I understand it's logically possible to be READ from as ROMs, but I
> am curious what happens if we don't map them at all when they're ROMs, or
> whether there's any device that can (in real life) DMA from device ROMs,
> and for what use.
> 
> >   dma_map: 0000:3a:10.0 BAR 0 mmaps[0] 16384 @ 0x7ffff7fef000 rw
> >   dma_map: 0000:3a:10.0 BAR 3 mmaps[0] 12288 @ 0x7ffff7fec000 rw
> > 
> > system.flash0 is excluded by the vfio listener because it is a rom_device.
> > The rom@etc blocks are excluded because their MemoryRegions are not added to
> > any container region, so the flatmem traversal of the AS used by the listener
> > does not see them.
> > 
> > The BARs should not be mapped IMO, and I propose excluding them in the
> > iommufd series:
> >   https://lore.kernel.org/qemu-devel/1721502937-87102-3-git-send-email-steven.sistare@oracle.com/  
> 
> Looks like this is clear now that they should be there.
> 
> > 
> > Note that the old-QEMU contents of all ramblocks must be preserved, just like
> > in live migration.  Live migration copies the contents in the stream.  Live update
> > preserves the contents in place by preserving the memfd.  Thus memfd serves
> > two purposes: preserving old contents, and preserving DMA mapped pinned pages.  
> 
> IMHO the 1st purpose is a fake one.  IOW:
> 
>   - Preserving content will be important on large RAM/ROM regions.  When
>     it's small, it shouldn't matter a huge deal, IMHO, because this is
>     about "how fast we can migrate / live upgrade'.  IOW, this is not a
>     functional requirement.

Regardless of the size of a ROM region, how would it ever be faster to
migrate ROMs rather that reload them from stable media on the target?
Furthermore, what mechanism other than migrating the ROM do we have to
guarantee the contents of the ROM are identical?

I have a hard time accepting that ROMs are only migrated for
performance and there isn't some aspect of migrating them to ensure the
contents remain identical, and by that token CPR would also need to
preserve the contents to provide the same guarantee.  Thanks,

Alex

>   - DMA mapped pinned pages: instead this is a hard requirement that we
>     must make sure these pages are fd-based, because only a fd-based
>     mapping can persist the pages (via page cache).
> 
> IMHO we shouldn't mangle them, and we should start with sticking with the
> 2nd goal here.  To be explicit, if we can find a good replacement for
> -alloc-anon, IMHO we could still migrate the ramblocks only fall into the
> 1st purpose category, e.g. device ROMs, hopefully even if they're pinned,
> they should never be DMAed to/from.
> 
> Thanks,
>
Steven Sistare Aug. 13, 2024, 5:34 p.m. UTC | #15
On 8/13/2024 11:35 AM, Peter Xu wrote:
> On Mon, Aug 12, 2024 at 02:37:59PM -0400, Steven Sistare wrote:
>> On 8/8/2024 2:32 PM, Steven Sistare wrote:
>>> On 7/29/2024 8:29 AM, Igor Mammedov wrote:
>>>> On Sat, 20 Jul 2024 16:28:25 -0400
>>>> Steven Sistare <steven.sistare@oracle.com> wrote:
>>>>
>>>>> On 7/16/2024 5:19 AM, Igor Mammedov wrote:
>>>>>> On Sun, 30 Jun 2024 12:40:24 -0700
>>>>>> Steve Sistare <steven.sistare@oracle.com> wrote:
>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>> on the value of the anon-alloc machine property.  This affects
>>>>>>> memory-backend-ram objects, guest RAM created with the global -m option
>>>>>>> but without an associated memory-backend object and without the -mem-path
>>>>>>> option
>>>>>> nowadays, all machines were converted to use memory backend for VM RAM.
>>>>>> so -m option implicitly creates memory-backend object,
>>>>>> which will be either MEMORY_BACKEND_FILE if -mem-path present
>>>>>> or MEMORY_BACKEND_RAM otherwise.
>>>>>
>>>>> Yes.  I dropped an an important adjective, "implicit".
>>>>>
>>>>>      "guest RAM created with the global -m option but without an explicit associated
>>>>>      memory-backend object and without the -mem-path option"
>>>>>
>>>>>>> To access the same memory in the old and new QEMU processes, the memory
>>>>>>> must be mapped shared.  Therefore, the implementation always sets
>>>>>>> RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, where the
>>>>>>> user must explicitly specify the share option.  In lieu of defining a new
>>>>>> so statement at the top that memory-backend-ram is affected is not
>>>>>> really valid?
>>>>>
>>>>> memory-backend-ram is affected by alloc-anon.  But in addition, the user must
>>>>> explicitly add the "share" option.  I don't implicitly set share in this case,
>>>>> because I would be overriding the user's specification of the memory object's property,
>>>>> which would be private if omitted.
>>>>
>>>> instead of touching implicit RAM (-m), it would be better to error out
>>>> and ask user to provide properly configured memory-backend explicitly.
>>>>
>>>>>
>>>>>>> RAM flag, at the lowest level the implementation uses RAM_SHARED with fd=-1
>>>>>>> as the condition for calling memfd_create.
>>>>>>
>>>>>> In general I do dislike adding yet another option that will affect
>>>>>> guest RAM allocation (memory-backends  should be sufficient).
>>>>>>
>>>>>> However I do see that you need memfd for device memory (vram, roms, ...).
>>>>>> Can we just use memfd/shared unconditionally for those and
>>>>>> avoid introducing a new confusing option?
>>>>>
>>>>> The Linux kernel has different tunables for backing memfd's with huge pages, so we
>>>>> could hurt performance if we unconditionally change to memfd.  The user should have
>>>>> a choice for any segment that is large enough for huge pages to improve performance,
>>>>> which potentially is any memory-backend-object.  The non memory-backend objects are
>>>>> small, and it would be OK to use memfd unconditionally for them.
>>>
>>> Thanks everyone for your feedback.  The common theme is that you dislike that the
>>> new option modifies the allocation of memory-backend-objects.  OK, accepted.  I propose
>>> to remove that interaction, and document in the QAPI which backends work for CPR.
>>> Specifically, memory-backend-memfd or memory-backend-file object is required,
>>> with share=on (which is the default for memory-backend-memfd).  CPR will be blocked
>>> otherwise.  The legacy -m option without an explicit memory-backend-object will not
>>> support CPR.
>>>
>>> Non memory-backend-objects (ramblocks not described on the qemu command line) will always
>>> be allocated using memfd_create (on Linux only).  The alloc-anon option is deleted.
>>> The logic in ram_block_add becomes:
>>>
>>>       if (!new_block->host) {
>>>           if (xen_enabled()) {
>>>               ...
>>>           } else if (!object_dynamic_cast(new_block->mr->parent_obj.parent,
>>>                                           TYPE_MEMORY_BACKEND)) {
>>>               qemu_memfd_create()
>>>           } else {
>>>               qemu_anon_ram_alloc()
>>>           }
>>>
>>> Is that acceptable to everyone?  Igor, Peter, Daniel?
> 
> Sorry for a late reply.
> 
> I think this may not work as David pointed out? Where AFAIU it will switch
> many old anon use cases to use memfd, aka, shmem, and it might be
> problematic when share=off: we have double memory consumption issue with
> shmem with private mapping.
> 
> I assume that includes things like "-m", "memory-backend-ram", and maybe
> more.  IIUC memory consumption of the VM will double with them.

The new proposal only affects anon allocations that are not described on
the command line, and their memfd will be shared.  There is no
command line option which would set share=off for these blocks.

"-m" and "memory-backend-ram" are not affected.
They will not work with CPR.

I will respond to your other comments separately.

- Steve
Peter Xu Aug. 13, 2024, 6:45 p.m. UTC | #16
On Tue, Aug 13, 2024 at 11:00:37AM -0600, Alex Williamson wrote:
> > > Note that the old-QEMU contents of all ramblocks must be preserved, just like
> > > in live migration.  Live migration copies the contents in the stream.  Live update
> > > preserves the contents in place by preserving the memfd.  Thus memfd serves
> > > two purposes: preserving old contents, and preserving DMA mapped pinned pages.  
> > 
> > IMHO the 1st purpose is a fake one.  IOW:
> > 
> >   - Preserving content will be important on large RAM/ROM regions.  When
> >     it's small, it shouldn't matter a huge deal, IMHO, because this is
> >     about "how fast we can migrate / live upgrade'.  IOW, this is not a
> >     functional requirement.
> 
> Regardless of the size of a ROM region, how would it ever be faster to
> migrate ROMs rather that reload them from stable media on the target?
> Furthermore, what mechanism other than migrating the ROM do we have to
> guarantee the contents of the ROM are identical?

IIRC we need to migrate ROMs in some form because they can be different on
src/dst, e.g., ROM files can upgrade after QEMU upgrades.  Here either
putting them onto migration stream, or making that fd-based should work.

Frankly I don't solidly know the details on why they can't be different.
My current understanding was that if one device boots with one version of
ROM/firmware, then it's possible the device keep interacting with the ROM
region in some way (in the form of referring addresses in this specific
version of ROM?), so that it may stop working if the ROM content changed.

IOW, if my understanding is correct, new ROM files won't get used after
migration automatically, but it requires one system reset.  When a system
reset triggered after VM migrated to destination host, it'll reload device
ROMs with the files specified (which will start to point to the upgraded
version of ROM files), and IIUC that's where the devices will boostrap with
the new ROM / BIOS files.

> 
> I have a hard time accepting that ROMs are only migrated for
> performance

AFAICT, it's never about performance or making it faster when putting ROM
data on wire.  Even in this context where Steve wanted to use fd backing
the ROMs, then putting that on wire is still slower than sharing fds.

Here my previous comment / point was that this should be a small region, so
it shouldn't matter a huge deal for ROMs to migrate either through the wire
or via the fd page cache.  I wanted to remove one more dependency that we
may even need the new -alloc-anon parameter as it doesn't sound required
for ROM migrations.

> and there isn't some aspect of migrating them to ensure the
> contents remain identical, and by that token CPR would also need to
> preserve the contents to provide the same guarantee.  Thanks,

Thanks,
Steven Sistare Aug. 13, 2024, 6:46 p.m. UTC | #17
On 8/13/2024 1:00 PM, Alex Williamson wrote:
> On Tue, 13 Aug 2024 11:35:15 -0400
> Peter Xu <peterx@redhat.com> wrote:
> 
>> On Mon, Aug 12, 2024 at 02:37:59PM -0400, Steven Sistare wrote:
>>> On 8/8/2024 2:32 PM, Steven Sistare wrote:
>>>> On 7/29/2024 8:29 AM, Igor Mammedov wrote:
>>>>> On Sat, 20 Jul 2024 16:28:25 -0400
>>>>> Steven Sistare <steven.sistare@oracle.com> wrote:
>>>>>    
>>>>>> On 7/16/2024 5:19 AM, Igor Mammedov wrote:
>>>>>>> On Sun, 30 Jun 2024 12:40:24 -0700
>>>>>>> Steve Sistare <steven.sistare@oracle.com> wrote:
>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>>> on the value of the anon-alloc machine property.  This affects
>>>>>>>> memory-backend-ram objects, guest RAM created with the global -m option
>>>>>>>> but without an associated memory-backend object and without the -mem-path
>>>>>>>> option
>>>>>>> nowadays, all machines were converted to use memory backend for VM RAM.
>>>>>>> so -m option implicitly creates memory-backend object,
>>>>>>> which will be either MEMORY_BACKEND_FILE if -mem-path present
>>>>>>> or MEMORY_BACKEND_RAM otherwise.
>>>>>>
>>>>>> Yes.  I dropped an an important adjective, "implicit".
>>>>>>
>>>>>>      "guest RAM created with the global -m option but without an explicit associated
>>>>>>      memory-backend object and without the -mem-path option"
>>>>>>    
>>>>>>>> To access the same memory in the old and new QEMU processes, the memory
>>>>>>>> must be mapped shared.  Therefore, the implementation always sets
>>>>>>>> RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, where the
>>>>>>>> user must explicitly specify the share option.  In lieu of defining a new
>>>>>>> so statement at the top that memory-backend-ram is affected is not
>>>>>>> really valid?
>>>>>>
>>>>>> memory-backend-ram is affected by alloc-anon.  But in addition, the user must
>>>>>> explicitly add the "share" option.  I don't implicitly set share in this case,
>>>>>> because I would be overriding the user's specification of the memory object's property,
>>>>>> which would be private if omitted.
>>>>>
>>>>> instead of touching implicit RAM (-m), it would be better to error out
>>>>> and ask user to provide properly configured memory-backend explicitly.
>>>>>    
>>>>>>    
>>>>>>>> RAM flag, at the lowest level the implementation uses RAM_SHARED with fd=-1
>>>>>>>> as the condition for calling memfd_create.
>>>>>>>
>>>>>>> In general I do dislike adding yet another option that will affect
>>>>>>> guest RAM allocation (memory-backends  should be sufficient).
>>>>>>>
>>>>>>> However I do see that you need memfd for device memory (vram, roms, ...).
>>>>>>> Can we just use memfd/shared unconditionally for those and
>>>>>>> avoid introducing a new confusing option?
>>>>>>
>>>>>> The Linux kernel has different tunables for backing memfd's with huge pages, so we
>>>>>> could hurt performance if we unconditionally change to memfd.  The user should have
>>>>>> a choice for any segment that is large enough for huge pages to improve performance,
>>>>>> which potentially is any memory-backend-object.  The non memory-backend objects are
>>>>>> small, and it would be OK to use memfd unconditionally for them.
>>>>
>>>> Thanks everyone for your feedback.  The common theme is that you dislike that the
>>>> new option modifies the allocation of memory-backend-objects.  OK, accepted.  I propose
>>>> to remove that interaction, and document in the QAPI which backends work for CPR.
>>>> Specifically, memory-backend-memfd or memory-backend-file object is required,
>>>> with share=on (which is the default for memory-backend-memfd).  CPR will be blocked
>>>> otherwise.  The legacy -m option without an explicit memory-backend-object will not
>>>> support CPR.
>>>>
>>>> Non memory-backend-objects (ramblocks not described on the qemu command line) will always
>>>> be allocated using memfd_create (on Linux only).  The alloc-anon option is deleted.
>>>> The logic in ram_block_add becomes:
>>>>
>>>>       if (!new_block->host) {
>>>>           if (xen_enabled()) {
>>>>               ...
>>>>           } else if (!object_dynamic_cast(new_block->mr->parent_obj.parent,
>>>>                                           TYPE_MEMORY_BACKEND)) {
>>>>               qemu_memfd_create()
>>>>           } else {
>>>>               qemu_anon_ram_alloc()
>>>>           }
>>>>
>>>> Is that acceptable to everyone?  Igor, Peter, Daniel?
>>
>> Sorry for a late reply.
>>
>> I think this may not work as David pointed out? Where AFAIU it will switch
>> many old anon use cases to use memfd, aka, shmem, and it might be
>> problematic when share=off: we have double memory consumption issue with
>> shmem with private mapping.
>>
>> I assume that includes things like "-m", "memory-backend-ram", and maybe
>> more.  IIUC memory consumption of the VM will double with them.
>>
>>>
>>> In a simple test here are the NON-memory-backend-object ramblocks which
>>> are allocated with memfd_create in my new proposal:
>>>
>>>    memfd_create system.flash0 3653632 @ 0x7fffe1000000 2 rw
>>>    memfd_create system.flash1 540672 @ 0x7fffe0c00000 2 rw
>>>    memfd_create pc.rom 131072 @ 0x7fffe0800000 2 rw
>>>    memfd_create vga.vram 16777216 @ 0x7fffcac00000 2 rw
>>>    memfd_create vga.rom 65536 @ 0x7fffe0400000 2 rw
>>>    memfd_create /rom@etc/acpi/tables 2097152 @ 0x7fffca400000 6 rw
>>>    memfd_create /rom@etc/table-loader 65536 @ 0x7fffca000000 6 rw
>>>    memfd_create /rom@etc/acpi/rsdp 4096 @ 0x7fffc9c00000 6 rw
>>>
>>> Of those, only a subset are mapped for DMA, per the existing QEMU logic,
>>> no changes from me:
>>>
>>>    dma_map: pc.rom 131072 @ 0x7fffe0800000 ro
>>>    dma_map: vga.vram 16777216 @ 0x7fffcac00000 rw
>>>    dma_map: vga.rom 65536 @ 0x7fffe0400000 ro
>>
>> I wonder whether there's any case that the "rom"s can be DMA target at
>> all..  I understand it's logically possible to be READ from as ROMs, but I
>> am curious what happens if we don't map them at all when they're ROMs, or
>> whether there's any device that can (in real life) DMA from device ROMs,
>> and for what use.
>>
>>>    dma_map: 0000:3a:10.0 BAR 0 mmaps[0] 16384 @ 0x7ffff7fef000 rw
>>>    dma_map: 0000:3a:10.0 BAR 3 mmaps[0] 12288 @ 0x7ffff7fec000 rw
>>>
>>> system.flash0 is excluded by the vfio listener because it is a rom_device.
>>> The rom@etc blocks are excluded because their MemoryRegions are not added to
>>> any container region, so the flatmem traversal of the AS used by the listener
>>> does not see them.
>>>
>>> The BARs should not be mapped IMO, and I propose excluding them in the
>>> iommufd series:
>>>    https://lore.kernel.org/qemu-devel/1721502937-87102-3-git-send-email-steven.sistare@oracle.com/
>>
>> Looks like this is clear now that they should be there.
>>
>>>
>>> Note that the old-QEMU contents of all ramblocks must be preserved, just like
>>> in live migration.  Live migration copies the contents in the stream.  Live update
>>> preserves the contents in place by preserving the memfd.  Thus memfd serves
>>> two purposes: preserving old contents, and preserving DMA mapped pinned pages.
>>
>> IMHO the 1st purpose is a fake one.  IOW:
>>
>>    - Preserving content will be important on large RAM/ROM regions.  When
>>      it's small, it shouldn't matter a huge deal, IMHO, because this is
>>      about "how fast we can migrate / live upgrade'.  IOW, this is not a
>>      functional requirement.
> 
> Regardless of the size of a ROM region, how would it ever be faster to
> migrate ROMs rather that reload them from stable media on the target?
> Furthermore, what mechanism other than migrating the ROM do we have to
> guarantee the contents of the ROM are identical?
> 
> I have a hard time accepting that ROMs are only migrated for
> performance and there isn't some aspect of migrating them to ensure the
> contents remain identical, and by that token CPR would also need to
> preserve the contents to provide the same guarantee.  Thanks,

I agree.  Any ramblock may change if the contents are read from a file in
the QEMU distribution, or if the contents are composed by QEMU code.  Live
migration guards against this by sending the old ramblock contents in the
migration stream.

- Steve

>>    - DMA mapped pinned pages: instead this is a hard requirement that we
>>      must make sure these pages are fd-based, because only a fd-based
>>      mapping can persist the pages (via page cache).
>>
>> IMHO we shouldn't mangle them, and we should start with sticking with the
>> 2nd goal here.  To be explicit, if we can find a good replacement for
>> -alloc-anon, IMHO we could still migrate the ramblocks only fall into the
>> 1st purpose category, e.g. device ROMs, hopefully even if they're pinned,
>> they should never be DMAed to/from.
>>
>> Thanks,
>>
>
Steven Sistare Aug. 13, 2024, 6:49 p.m. UTC | #18
On 8/13/2024 2:46 PM, Steven Sistare wrote:
> On 8/13/2024 1:00 PM, Alex Williamson wrote:
>> On Tue, 13 Aug 2024 11:35:15 -0400
>> Peter Xu <peterx@redhat.com> wrote:
>>
>>> On Mon, Aug 12, 2024 at 02:37:59PM -0400, Steven Sistare wrote:
>>>> On 8/8/2024 2:32 PM, Steven Sistare wrote:
>>>>> On 7/29/2024 8:29 AM, Igor Mammedov wrote:
>>>>>> On Sat, 20 Jul 2024 16:28:25 -0400
>>>>>> Steven Sistare <steven.sistare@oracle.com> wrote:
>>>>>>> On 7/16/2024 5:19 AM, Igor Mammedov wrote:
>>>>>>>> On Sun, 30 Jun 2024 12:40:24 -0700
>>>>>>>> Steve Sistare <steven.sistare@oracle.com> wrote:
>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending
>>>>>>>>> on the value of the anon-alloc machine property.  This affects
>>>>>>>>> memory-backend-ram objects, guest RAM created with the global -m option
>>>>>>>>> but without an associated memory-backend object and without the -mem-path
>>>>>>>>> option
>>>>>>>> nowadays, all machines were converted to use memory backend for VM RAM.
>>>>>>>> so -m option implicitly creates memory-backend object,
>>>>>>>> which will be either MEMORY_BACKEND_FILE if -mem-path present
>>>>>>>> or MEMORY_BACKEND_RAM otherwise.
>>>>>>>
>>>>>>> Yes.  I dropped an an important adjective, "implicit".
>>>>>>>
>>>>>>>      "guest RAM created with the global -m option but without an explicit associated
>>>>>>>      memory-backend object and without the -mem-path option"
>>>>>>>>> To access the same memory in the old and new QEMU processes, the memory
>>>>>>>>> must be mapped shared.  Therefore, the implementation always sets
>>>>>>>>> RAM_SHARED if alloc-anon=memfd, except for memory-backend-ram, where the
>>>>>>>>> user must explicitly specify the share option.  In lieu of defining a new
>>>>>>>> so statement at the top that memory-backend-ram is affected is not
>>>>>>>> really valid?
>>>>>>>
>>>>>>> memory-backend-ram is affected by alloc-anon.  But in addition, the user must
>>>>>>> explicitly add the "share" option.  I don't implicitly set share in this case,
>>>>>>> because I would be overriding the user's specification of the memory object's property,
>>>>>>> which would be private if omitted.
>>>>>>
>>>>>> instead of touching implicit RAM (-m), it would be better to error out
>>>>>> and ask user to provide properly configured memory-backend explicitly.
>>>>>>>>> RAM flag, at the lowest level the implementation uses RAM_SHARED with fd=-1
>>>>>>>>> as the condition for calling memfd_create.
>>>>>>>>
>>>>>>>> In general I do dislike adding yet another option that will affect
>>>>>>>> guest RAM allocation (memory-backends  should be sufficient).
>>>>>>>>
>>>>>>>> However I do see that you need memfd for device memory (vram, roms, ...).
>>>>>>>> Can we just use memfd/shared unconditionally for those and
>>>>>>>> avoid introducing a new confusing option?
>>>>>>>
>>>>>>> The Linux kernel has different tunables for backing memfd's with huge pages, so we
>>>>>>> could hurt performance if we unconditionally change to memfd.  The user should have
>>>>>>> a choice for any segment that is large enough for huge pages to improve performance,
>>>>>>> which potentially is any memory-backend-object.  The non memory-backend objects are
>>>>>>> small, and it would be OK to use memfd unconditionally for them.
>>>>>
>>>>> Thanks everyone for your feedback.  The common theme is that you dislike that the
>>>>> new option modifies the allocation of memory-backend-objects.  OK, accepted.  I propose
>>>>> to remove that interaction, and document in the QAPI which backends work for CPR.
>>>>> Specifically, memory-backend-memfd or memory-backend-file object is required,
>>>>> with share=on (which is the default for memory-backend-memfd).  CPR will be blocked
>>>>> otherwise.  The legacy -m option without an explicit memory-backend-object will not
>>>>> support CPR.
>>>>>
>>>>> Non memory-backend-objects (ramblocks not described on the qemu command line) will always
>>>>> be allocated using memfd_create (on Linux only).  The alloc-anon option is deleted.
>>>>> The logic in ram_block_add becomes:
>>>>>
>>>>>       if (!new_block->host) {
>>>>>           if (xen_enabled()) {
>>>>>               ...
>>>>>           } else if (!object_dynamic_cast(new_block->mr->parent_obj.parent,
>>>>>                                           TYPE_MEMORY_BACKEND)) {
>>>>>               qemu_memfd_create()
>>>>>           } else {
>>>>>               qemu_anon_ram_alloc()
>>>>>           }
>>>>>
>>>>> Is that acceptable to everyone?  Igor, Peter, Daniel?
>>>
>>> Sorry for a late reply.
>>>
>>> I think this may not work as David pointed out? Where AFAIU it will switch
>>> many old anon use cases to use memfd, aka, shmem, and it might be
>>> problematic when share=off: we have double memory consumption issue with
>>> shmem with private mapping.
>>>
>>> I assume that includes things like "-m", "memory-backend-ram", and maybe
>>> more.  IIUC memory consumption of the VM will double with them.
>>>
>>>>
>>>> In a simple test here are the NON-memory-backend-object ramblocks which
>>>> are allocated with memfd_create in my new proposal:
>>>>
>>>>    memfd_create system.flash0 3653632 @ 0x7fffe1000000 2 rw
>>>>    memfd_create system.flash1 540672 @ 0x7fffe0c00000 2 rw
>>>>    memfd_create pc.rom 131072 @ 0x7fffe0800000 2 rw
>>>>    memfd_create vga.vram 16777216 @ 0x7fffcac00000 2 rw
>>>>    memfd_create vga.rom 65536 @ 0x7fffe0400000 2 rw
>>>>    memfd_create /rom@etc/acpi/tables 2097152 @ 0x7fffca400000 6 rw
>>>>    memfd_create /rom@etc/table-loader 65536 @ 0x7fffca000000 6 rw
>>>>    memfd_create /rom@etc/acpi/rsdp 4096 @ 0x7fffc9c00000 6 rw
>>>>
>>>> Of those, only a subset are mapped for DMA, per the existing QEMU logic,
>>>> no changes from me:
>>>>
>>>>    dma_map: pc.rom 131072 @ 0x7fffe0800000 ro
>>>>    dma_map: vga.vram 16777216 @ 0x7fffcac00000 rw
>>>>    dma_map: vga.rom 65536 @ 0x7fffe0400000 ro
>>>
>>> I wonder whether there's any case that the "rom"s can be DMA target at
>>> all..  I understand it's logically possible to be READ from as ROMs, but I
>>> am curious what happens if we don't map them at all when they're ROMs, or
>>> whether there's any device that can (in real life) DMA from device ROMs,
>>> and for what use.
>>>
>>>>    dma_map: 0000:3a:10.0 BAR 0 mmaps[0] 16384 @ 0x7ffff7fef000 rw
>>>>    dma_map: 0000:3a:10.0 BAR 3 mmaps[0] 12288 @ 0x7ffff7fec000 rw
>>>>
>>>> system.flash0 is excluded by the vfio listener because it is a rom_device.
>>>> The rom@etc blocks are excluded because their MemoryRegions are not added to
>>>> any container region, so the flatmem traversal of the AS used by the listener
>>>> does not see them.
>>>>
>>>> The BARs should not be mapped IMO, and I propose excluding them in the
>>>> iommufd series:
>>>>    https://lore.kernel.org/qemu-devel/1721502937-87102-3-git-send-email-steven.sistare@oracle.com/
>>>
>>> Looks like this is clear now that they should be there.
>>>
>>>>
>>>> Note that the old-QEMU contents of all ramblocks must be preserved, just like
>>>> in live migration.  Live migration copies the contents in the stream.  Live update
>>>> preserves the contents in place by preserving the memfd.  Thus memfd serves
>>>> two purposes: preserving old contents, and preserving DMA mapped pinned pages.
>>>
>>> IMHO the 1st purpose is a fake one.  IOW:
>>>
>>>    - Preserving content will be important on large RAM/ROM regions.  When
>>>      it's small, it shouldn't matter a huge deal, IMHO, because this is
>>>      about "how fast we can migrate / live upgrade'.  IOW, this is not a
>>>      functional requirement.
>>
>> Regardless of the size of a ROM region, how would it ever be faster to
>> migrate ROMs rather that reload them from stable media on the target?
>> Furthermore, what mechanism other than migrating the ROM do we have to
>> guarantee the contents of the ROM are identical?
>>
>> I have a hard time accepting that ROMs are only migrated for
>> performance and there isn't some aspect of migrating them to ensure the
>> contents remain identical, and by that token CPR would also need to
>> preserve the contents to provide the same guarantee.  Thanks,
> 
> I agree.  Any ramblock may change if the contents are read from a file in
> the QEMU distribution, or if the contents are composed by QEMU code.  Live
> migration guards against this by sending the old ramblock contents in the
> migration stream.

Our emails just crossed.  I will repost this to your new reply so keep a single
conversation thread.

- Steve

> 
>>>    - DMA mapped pinned pages: instead this is a hard requirement that we
>>>      must make sure these pages are fd-based, because only a fd-based
>>>      mapping can persist the pages (via page cache).
>>>
>>> IMHO we shouldn't mangle them, and we should start with sticking with the
>>> 2nd goal here.  To be explicit, if we can find a good replacement for
>>> -alloc-anon, IMHO we could still migrate the ramblocks only fall into the
>>> 1st purpose category, e.g. device ROMs, hopefully even if they're pinned,
>>> they should never be DMAed to/from.
>>>
>>> Thanks,
>>>
>>
Steven Sistare Aug. 13, 2024, 6:56 p.m. UTC | #19
On 8/13/2024 2:45 PM, Peter Xu wrote:
> On Tue, Aug 13, 2024 at 11:00:37AM -0600, Alex Williamson wrote:
>>>> Note that the old-QEMU contents of all ramblocks must be preserved, just like
>>>> in live migration.  Live migration copies the contents in the stream.  Live update
>>>> preserves the contents in place by preserving the memfd.  Thus memfd serves
>>>> two purposes: preserving old contents, and preserving DMA mapped pinned pages.
>>>
>>> IMHO the 1st purpose is a fake one.  IOW:
>>>
>>>    - Preserving content will be important on large RAM/ROM regions.  When
>>>      it's small, it shouldn't matter a huge deal, IMHO, because this is
>>>      about "how fast we can migrate / live upgrade'.  IOW, this is not a
>>>      functional requirement.
>>
>> Regardless of the size of a ROM region, how would it ever be faster to
>> migrate ROMs rather that reload them from stable media on the target?
>> Furthermore, what mechanism other than migrating the ROM do we have to
>> guarantee the contents of the ROM are identical?
> 
> IIRC we need to migrate ROMs in some form because they can be different on
> src/dst, e.g., ROM files can upgrade after QEMU upgrades.  Here either
> putting them onto migration stream, or making that fd-based should work.

Agreed.

> Frankly I don't solidly know the details on why they can't be different.

Any ramblock may change if the contents are read from a file in
the QEMU distribution, or if the contents are composed by QEMU code.

> My current understanding was that if one device boots with one version of
> ROM/firmware, then it's possible the device keep interacting with the ROM
> region in some way (in the form of referring addresses in this specific
> version of ROM?), so that it may stop working if the ROM content changed.

Yes, the guest may continue to read data or execute code from the block.

> IOW, if my understanding is correct, new ROM files won't get used after
> migration automatically, but it requires one system reset.  When a system
> reset triggered after VM migrated to destination host, it'll reload device
> ROMs with the files specified (which will start to point to the upgraded
> version of ROM files), and IIUC that's where the devices will boostrap with
> the new ROM / BIOS files.

Agreed.

>> I have a hard time accepting that ROMs are only migrated for
>> performance
> 
> AFAICT, it's never about performance or making it faster when putting ROM
> data on wire.  Even in this context where Steve wanted to use fd backing
> the ROMs, then putting that on wire is still slower than sharing fds.
> 
> Here my previous comment / point was that this should be a small region, so
> it shouldn't matter a huge deal for ROMs to migrate either through the wire
> or via the fd page cache.  I wanted to remove one more dependency that we
> may even need the new -alloc-anon parameter as it doesn't sound required
> for ROM migrations.

Agreed that performance is not the issue.  I use memfd for correctness.

Also note that my new proposal deletes the alloc-anon parameter.  I allocate
non-command-line ramblocks unconditionally using memfd_create.

- Steve

>> and there isn't some aspect of migrating them to ensure the
>> contents remain identical, and by that token CPR would also need to
>> preserve the contents to provide the same guarantee.  Thanks,
> 
> Thanks,
>
Peter Xu Aug. 13, 2024, 7:02 p.m. UTC | #20
On Tue, Aug 13, 2024 at 01:34:42PM -0400, Steven Sistare wrote:
> > > > Non memory-backend-objects (ramblocks not described on the qemu command line) will always
> > > > be allocated using memfd_create (on Linux only).  The alloc-anon option is deleted.
> > > > The logic in ram_block_add becomes:
> > > > 
> > > >       if (!new_block->host) {
> > > >           if (xen_enabled()) {
> > > >               ...
> > > >           } else if (!object_dynamic_cast(new_block->mr->parent_obj.parent,
> > > >                                           TYPE_MEMORY_BACKEND)) {
> > > >               qemu_memfd_create()
> > > >           } else {
> > > >               qemu_anon_ram_alloc()
> > > >           }
> > > > 
> > > > Is that acceptable to everyone?  Igor, Peter, Daniel?
> > 
> > Sorry for a late reply.
> > 
> > I think this may not work as David pointed out? Where AFAIU it will switch
> > many old anon use cases to use memfd, aka, shmem, and it might be
> > problematic when share=off: we have double memory consumption issue with
> > shmem with private mapping.
> > 
> > I assume that includes things like "-m", "memory-backend-ram", and maybe
> > more.  IIUC memory consumption of the VM will double with them.
> 
> The new proposal only affects anon allocations that are not described on
> the command line, and their memfd will be shared.  There is no
> command line option which would set share=off for these blocks.
> 
> "-m" and "memory-backend-ram" are not affected.
> They will not work with CPR.

Hmm yeah memory-backend-ram should be TYPE_MEMORY_BACKEND for sure.. and I
just noticed "-m" looks like the same.

Though this change still gives me the feeling that we don't yet know who's
the target of this change at all, and what purpose it services.

I'll see how others see this.  For me, at least in this case I think it'll
be nice the "else if" will not be used unless cpr is enabled in the first
place.  But that's still a bit hacky to me.

Thanks,
diff mbox series

Patch

diff --git a/hw/core/machine.c b/hw/core/machine.c
index 655d75c..7ca2ad0 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -454,6 +454,20 @@  static void machine_set_mem_merge(Object *obj, bool value, Error **errp)
     ms->mem_merge = value;
 }
 
+static int machine_get_anon_alloc(Object *obj, Error **errp)
+{
+    MachineState *ms = MACHINE(obj);
+
+    return ms->anon_alloc;
+}
+
+static void machine_set_anon_alloc(Object *obj, int value, Error **errp)
+{
+    MachineState *ms = MACHINE(obj);
+
+    ms->anon_alloc = value;
+}
+
 static bool machine_get_usb(Object *obj, Error **errp)
 {
     MachineState *ms = MACHINE(obj);
@@ -1066,6 +1080,11 @@  static void machine_class_init(ObjectClass *oc, void *data)
     object_class_property_set_description(oc, "mem-merge",
         "Enable/disable memory merge support");
 
+    object_class_property_add_enum(oc, "anon-alloc", "AnonAllocOption",
+                                   &AnonAllocOption_lookup,
+                                   machine_get_anon_alloc,
+                                   machine_set_anon_alloc);
+
     object_class_property_add_bool(oc, "usb",
         machine_get_usb, machine_set_usb);
     object_class_property_set_description(oc, "usb",
@@ -1416,6 +1435,11 @@  static bool create_default_memdev(MachineState *ms, const char *path, Error **er
     if (!object_property_set_int(obj, "size", ms->ram_size, errp)) {
         goto out;
     }
+    if (!object_property_set_bool(obj, "share",
+                                  ms->anon_alloc == ANON_ALLOC_OPTION_MEMFD,
+                                  errp)) {
+        goto out;
+    }
     object_property_add_child(object_get_objects_root(), mc->default_ram_id,
                               obj);
     /* Ensure backend's memory region name is equal to mc->default_ram_id */
diff --git a/include/hw/boards.h b/include/hw/boards.h
index 73ad319..77f16ad 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -383,6 +383,7 @@  struct MachineState {
     bool enable_graphics;
     ConfidentialGuestSupport *cgs;
     HostMemoryBackend *memdev;
+    AnonAllocOption anon_alloc;
     /*
      * convenience alias to ram_memdev_id backend memory region
      * or to numa container memory region
diff --git a/qapi/machine.json b/qapi/machine.json
index 2fd3e9c..9173953 100644
--- a/qapi/machine.json
+++ b/qapi/machine.json
@@ -1881,3 +1881,17 @@ 
 { 'command': 'x-query-interrupt-controllers',
   'returns': 'HumanReadableText',
   'features': [ 'unstable' ]}
+
+##
+# @AnonAllocOption:
+#
+# An enumeration of the options for allocating anonymous guest memory.
+#
+# @mmap: allocate using mmap MAP_ANON
+#
+# @memfd: allocate using memfd_create
+#
+# Since: 9.1
+##
+{ 'enum': 'AnonAllocOption',
+  'data': [ 'mmap', 'memfd' ] }
diff --git a/qemu-options.hx b/qemu-options.hx
index 8ca7f34..595b693 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -38,6 +38,7 @@  DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
     "                nvdimm=on|off controls NVDIMM support (default=off)\n"
     "                memory-encryption=@var{} memory encryption object to use (default=none)\n"
     "                hmat=on|off controls ACPI HMAT support (default=off)\n"
+    "                anon-alloc=mmap|memfd allocate anonymous guest RAM using mmap MAP_ANON or memfd_create (default: mmap)\n"
     "                memory-backend='backend-id' specifies explicitly provided backend for main RAM (default=none)\n"
     "                cxl-fmw.0.targets.0=firsttarget,cxl-fmw.0.targets.1=secondtarget,cxl-fmw.0.size=size[,cxl-fmw.0.interleave-granularity=granularity]\n",
     QEMU_ARCH_ALL)
@@ -101,6 +102,18 @@  SRST
         Enables or disables ACPI Heterogeneous Memory Attribute Table
         (HMAT) support. The default is off.
 
+    ``anon-alloc=mmap|memfd``
+        Allocate anonymous guest RAM using mmap MAP_ANON (the default)
+        or memfd_create.  This affects memory-backend-ram objects,
+        RAM created with the global -m option but without an
+        associated memory-backend object and without the -mem-path
+        option, and various memory regions such as ROMs that are
+        allocated when devices are created.  This option does not
+        affect memory-backend-file, memory-backend-memfd, or
+        memory-backend-epc objects.
+
+        Some migration modes require anon-alloc=memfd.
+
     ``memory-backend='id'``
         An alternative to legacy ``-mem-path`` and ``mem-prealloc`` options.
         Allows to use a memory backend as main RAM.
diff --git a/system/memory.c b/system/memory.c
index 2d69521..28a837d 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -1552,8 +1552,10 @@  bool memory_region_init_ram_nomigrate(MemoryRegion *mr,
                                       uint64_t size,
                                       Error **errp)
 {
+    uint32_t flags = (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD) ?
+                     RAM_SHARED : 0;
     return memory_region_init_ram_flags_nomigrate(mr, owner, name,
-                                                  size, 0, errp);
+                                                  size, flags, errp);
 }
 
 bool memory_region_init_ram_flags_nomigrate(MemoryRegion *mr,
@@ -1713,8 +1715,10 @@  bool memory_region_init_rom_nomigrate(MemoryRegion *mr,
                                       uint64_t size,
                                       Error **errp)
 {
+    uint32_t flags = (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD) ?
+                     RAM_SHARED : 0;
     if (!memory_region_init_ram_flags_nomigrate(mr, owner, name,
-                                                size, 0, errp)) {
+                                                size, flags, errp)) {
          return false;
     }
     mr->readonly = true;
@@ -1731,6 +1735,8 @@  bool memory_region_init_rom_device_nomigrate(MemoryRegion *mr,
                                              Error **errp)
 {
     Error *err = NULL;
+    uint32_t flags = (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD) ?
+                     RAM_SHARED : 0;
     assert(ops);
     memory_region_init(mr, owner, name, size);
     mr->ops = ops;
@@ -1738,7 +1744,7 @@  bool memory_region_init_rom_device_nomigrate(MemoryRegion *mr,
     mr->terminates = true;
     mr->rom_device = true;
     mr->destructor = memory_region_destructor_ram;
-    mr->ram_block = qemu_ram_alloc(size, 0, mr, &err);
+    mr->ram_block = qemu_ram_alloc(size, flags, mr, &err);
     if (err) {
         mr->size = int128_zero();
         object_unparent(OBJECT(mr));
diff --git a/system/physmem.c b/system/physmem.c
index 33d09f7..efe95ff 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -47,6 +47,7 @@ 
 #include "qemu/qemu-print.h"
 #include "qemu/log.h"
 #include "qemu/memalign.h"
+#include "qemu/memfd.h"
 #include "exec/memory.h"
 #include "exec/ioport.h"
 #include "sysemu/dma.h"
@@ -54,6 +55,7 @@ 
 #include "sysemu/hw_accel.h"
 #include "sysemu/xen-mapcache.h"
 #include "trace/trace-root.h"
+#include "trace.h"
 
 #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
 #include <linux/falloc.h>
@@ -69,6 +71,8 @@ 
 
 #include "qemu/pmem.h"
 
+#include "qapi/qapi-types-migration.h"
+#include "migration/options.h"
 #include "migration/vmstate.h"
 
 #include "qemu/range.h"
@@ -1828,6 +1832,32 @@  static void ram_block_add(RAMBlock *new_block, Error **errp)
                 qemu_mutex_unlock_ramlist();
                 return;
             }
+
+        } else if (new_block->flags & RAM_SHARED) {
+            size_t max_length = new_block->max_length;
+            MemoryRegion *mr = new_block->mr;
+            const char *name = memory_region_name(mr);
+
+            new_block->mr->align = QEMU_VMALLOC_ALIGN;
+
+            if (new_block->fd == -1) {
+                new_block->fd = qemu_memfd_create(name, max_length + mr->align,
+                                                  0, 0, 0, errp);
+            }
+
+            if (new_block->fd >= 0) {
+                int mfd = new_block->fd;
+                qemu_set_cloexec(mfd);
+                new_block->host = file_ram_alloc(new_block, max_length, mfd,
+                                                 false, 0, errp);
+            }
+            if (!new_block->host) {
+                qemu_mutex_unlock_ramlist();
+                return;
+            }
+            memory_try_enable_merging(new_block->host, new_block->max_length);
+            free_on_error = true;
+
         } else {
             new_block->host = qemu_anon_ram_alloc(new_block->max_length,
                                                   &new_block->mr->align,
@@ -1911,6 +1941,9 @@  static void ram_block_add(RAMBlock *new_block, Error **errp)
         ram_block_notify_add(new_block->host, new_block->used_length,
                              new_block->max_length);
     }
+    trace_ram_block_add(memory_region_name(new_block->mr), new_block->flags,
+                        new_block->fd, new_block->used_length,
+                        new_block->max_length);
     return;
 
 out_free:
@@ -2097,8 +2130,11 @@  RAMBlock *qemu_ram_alloc_resizeable(ram_addr_t size, ram_addr_t maxsz,
                                                      void *host),
                                      MemoryRegion *mr, Error **errp)
 {
+    uint32_t flags = (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD) ?
+                     RAM_SHARED : 0;
+    flags |= RAM_RESIZEABLE;
     return qemu_ram_alloc_internal(size, maxsz, resized, NULL,
-                                   RAM_RESIZEABLE, mr, errp);
+                                   flags, mr, errp);
 }
 
 static void reclaim_ramblock(RAMBlock *block)
diff --git a/system/trace-events b/system/trace-events
index 69c9044..f8ebf42 100644
--- a/system/trace-events
+++ b/system/trace-events
@@ -38,3 +38,6 @@  dirtylimit_state_finalize(void)
 dirtylimit_throttle_pct(int cpu_index, uint64_t pct, int64_t time_us) "CPU[%d] throttle percent: %" PRIu64 ", throttle adjust time %"PRIi64 " us"
 dirtylimit_set_vcpu(int cpu_index, uint64_t quota) "CPU[%d] set dirty page rate limit %"PRIu64
 dirtylimit_vcpu_execute(int cpu_index, int64_t sleep_time_us) "CPU[%d] sleep %"PRIi64 " us"
+
+#physmem.c
+ram_block_add(const char *name, uint32_t flags, int fd, size_t used_length, size_t max_length) "%s, flags %u, fd %d, len %lu, maxlen %lu"