Message ID | 1730468875-249970-2-git-send-email-steven.sistare@oracle.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Live update: cpr-transfer | expand |
On Fri, Nov 01, 2024 at 06:47:40AM -0700, Steve Sistare wrote: > @@ -1849,6 +1852,35 @@ static void ram_block_add(RAMBlock *new_block, Error **errp) > qemu_mutex_unlock_ramlist(); > return; > } > + > + } else if (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD && > + !object_dynamic_cast(new_block->mr->parent_obj.parent, > + TYPE_MEMORY_BACKEND)) { Steve, I think I'll postpone a few days on reading the whole series for other things.. as I think this will miss 9.2 anyway (but we can see whether we can still get it in early 10.0). Said that, I want to mention this early that there was concern raised on using block->mr->parent_obj.parent here to detect mem backends, which can be error prone. Please have a look at the discussion in v2 for that. https://lore.kernel.org/qemu-devel/Zv7C7MeVP2X8bEJU@x1n/
On 01.11.24 14:47, Steve Sistare wrote: > Allocate anonymous memory using mmap MAP_ANON or memfd_create depending > on the value of the anon-alloc machine property. This option applies to > memory allocated as a side effect of creating various devices. It does > not apply to memory-backend-objects, whether explicitly specified on > the command line, or implicitly created by the -m command line option. > > The memfd option is intended to support new migration modes, in which the > memory region can be transferred in place to a new QEMU process, by sending > the memfd file descriptor to the process. Memory contents are preserved, > and if the mode also transfers device descriptors, then pages that are > locked in memory for DMA remain locked. This behavior is a pre-requisite > for supporting vfio, vdpa, and iommufd devices with the new modes. A more portable, non-Linux specific variant of this will be using shm, similar to backends/hostmem-shm.c. Likely we should be using that instead of memfd, or try hiding the details. See below. [...] > @@ -69,6 +70,8 @@ > > #include "qemu/pmem.h" > > +#include "qapi/qapi-types-migration.h" > +#include "migration/options.h" > #include "migration/vmstate.h" > > #include "qemu/range.h" > @@ -1849,6 +1852,35 @@ static void ram_block_add(RAMBlock *new_block, Error **errp) > qemu_mutex_unlock_ramlist(); > return; > } > + > + } else if (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD && > + !object_dynamic_cast(new_block->mr->parent_obj.parent, > + TYPE_MEMORY_BACKEND)) { This looks a bit and hackish, and I don't think ram_block_add() is the right place where this should be. It should likely happen in the caller. We already do have two ways of allocating "shared anonymous memory": (1) memory-backend-ram,share=on (2) memory-backend-shm (2) gives us an fd as it uses shm_open(), (1) doesn't give us an fd as it uses MAP_ANON|MAP_SHARED. (1) is really only a corner case use case [1]. [there is also Linux specific memfd, which gives us more flexibility with hugetlb etc, but for the purpose here shm should likely be sufficient?] So why not make (1) behave like (2) and move that handling into qemu_ram_alloc_internal(), from where we can easily enable it using a new RMA_SHARED flag? So as a first step, something like: From 4b7b760c6e54cf05addca6728edc19adbec1588a Mon Sep 17 00:00:00 2001 From: David Hildenbrand <david@redhat.com> Date: Mon, 4 Nov 2024 11:29:22 +0100 Subject: [PATCH] tmp Signed-off-by: David Hildenbrand <david@redhat.com> --- backends/hostmem-shm.c | 56 ++++---------------------------- system/physmem.c | 73 ++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 76 insertions(+), 53 deletions(-) diff --git a/backends/hostmem-shm.c b/backends/hostmem-shm.c index 374edc3db8..0f33b35e9c 100644 --- a/backends/hostmem-shm.c +++ b/backends/hostmem-shm.c @@ -25,11 +25,8 @@ struct HostMemoryBackendShm { static bool shm_backend_memory_alloc(HostMemoryBackend *backend, Error **errp) { - g_autoptr(GString) shm_name = g_string_new(NULL); - g_autofree char *backend_name = NULL; + g_autofree char *name = NULL; uint32_t ram_flags; - int fd, oflag; - mode_t mode; if (!backend->size) { error_setg(errp, "can't create shm backend with size 0"); @@ -41,54 +38,13 @@ shm_backend_memory_alloc(HostMemoryBackend *backend, Error **errp) return false; } - /* - * Let's use `mode = 0` because we don't want other processes to open our - * memory unless we share the file descriptor with them. - */ - mode = 0; - oflag = O_RDWR | O_CREAT | O_EXCL; - backend_name = host_memory_backend_get_name(backend); - - /* - * Some operating systems allow creating anonymous POSIX shared memory - * objects (e.g. FreeBSD provides the SHM_ANON constant), but this is not - * defined by POSIX, so let's create a unique name. - * - * From Linux's shm_open(3) man-page: - * For portable use, a shared memory object should be identified - * by a name of the form /somename;" - */ - g_string_printf(shm_name, "/qemu-" FMT_pid "-shm-%s", getpid(), - backend_name); - - fd = shm_open(shm_name->str, oflag, mode); - if (fd < 0) { - error_setg_errno(errp, errno, - "failed to create POSIX shared memory"); - return false; - } - - /* - * We have the file descriptor, so we no longer need to expose the - * POSIX shared memory object. However it will remain allocated as long as - * there are file descriptors pointing to it. - */ - shm_unlink(shm_name->str); - - if (ftruncate(fd, backend->size) == -1) { - error_setg_errno(errp, errno, - "failed to resize POSIX shared memory to %" PRIu64, - backend->size); - close(fd); - return false; - } - + /* Let's do the same as memory-backend-ram,share=on would do. */ + name = host_memory_backend_get_name(backend); ram_flags = RAM_SHARED; ram_flags |= backend->reserve ? 0 : RAM_NORESERVE; - - return memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend), - backend_name, backend->size, - ram_flags, fd, 0, errp); + return memory_region_init_ram_flags_nomigrate(&backend->mr, OBJECT(backend), + name, backend->size, + ram_flags, errp); } static void diff --git a/system/physmem.c b/system/physmem.c index dc1db3a384..4d331b3828 100644 --- a/system/physmem.c +++ b/system/physmem.c @@ -2057,6 +2057,59 @@ RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr, } #endif +static int qemu_shm_alloc(size_t size, Error **errp) +{ + g_autoptr(GString) shm_name = g_string_new(NULL); + int fd, oflag, cur_sequence; + static int sequence; + mode_t mode; + + cur_sequence = qatomic_fetch_inc(&sequence); + + /* + * Let's use `mode = 0` because we don't want other processes to open our + * memory unless we share the file descriptor with them. + */ + mode = 0; + oflag = O_RDWR | O_CREAT | O_EXCL; + + /* + * Some operating systems allow creating anonymous POSIX shared memory + * objects (e.g. FreeBSD provides the SHM_ANON constant), but this is not + * defined by POSIX, so let's create a unique name. + * + * From Linux's shm_open(3) man-page: + * For portable use, a shared memory object should be identified + * by a name of the form /somename;" + */ + g_string_printf(shm_name, "/qemu-" FMT_pid "-shm-%d", getpid(), + cur_sequence); + + fd = shm_open(shm_name->str, oflag, mode); + if (fd < 0) { + error_setg_errno(errp, errno, + "failed to create POSIX shared memory"); + return false; + } + + /* + * We have the file descriptor, so we no longer need to expose the + * POSIX shared memory object. However it will remain allocated as long as + * there are file descriptors pointing to it. + */ + shm_unlink(shm_name->str); + + if (ftruncate(fd, size) == -1) { + error_setg_errno(errp, errno, + "failed to resize POSIX shared memory to %" PRIu64, + size); + close(fd); + return false; + } + + return fd; +} + static RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size, void (*resized)(const char*, @@ -2084,12 +2137,26 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size, new_block->used_length = size; new_block->max_length = max_size; assert(max_size >= size); - new_block->fd = -1; + new_block->guest_memfd = -1; new_block->page_size = qemu_real_host_page_size(); - new_block->host = host; new_block->flags = ram_flags; - ram_block_add(new_block, &local_err); + new_block->host = host; + + if ((ram_flags & RAM_PREALLOC) || !(ram_flags & RAM_SHARED)) { + new_block->fd = -1; + } else { + /* + * We want anonymous shared memory, similar to MAP_SHARED|MAP_ANON; but + * some users want the fd. So let's allocate shm explicitly, which will + * give us the fd. + */ + assert(!host); + new_block->fd = qemu_shm_alloc(new_block->max_length, &local_err); + } + if (!local_err) { + ram_block_add(new_block, &local_err); + } if (local_err) { g_free(new_block); error_propagate(errp, local_err);
On 04.11.24 11:39, David Hildenbrand wrote: > On 01.11.24 14:47, Steve Sistare wrote: >> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending >> on the value of the anon-alloc machine property. This option applies to >> memory allocated as a side effect of creating various devices. It does >> not apply to memory-backend-objects, whether explicitly specified on >> the command line, or implicitly created by the -m command line option. >> >> The memfd option is intended to support new migration modes, in which the >> memory region can be transferred in place to a new QEMU process, by sending >> the memfd file descriptor to the process. Memory contents are preserved, >> and if the mode also transfers device descriptors, then pages that are >> locked in memory for DMA remain locked. This behavior is a pre-requisite >> for supporting vfio, vdpa, and iommufd devices with the new modes. > > A more portable, non-Linux specific variant of this will be using shm, > similar to backends/hostmem-shm.c. > > Likely we should be using that instead of memfd, or try hiding the > details. See below. > > [...] > >> @@ -69,6 +70,8 @@ >> >> #include "qemu/pmem.h" >> >> +#include "qapi/qapi-types-migration.h" >> +#include "migration/options.h" >> #include "migration/vmstate.h" >> >> #include "qemu/range.h" >> @@ -1849,6 +1852,35 @@ static void ram_block_add(RAMBlock *new_block, Error **errp) >> qemu_mutex_unlock_ramlist(); >> return; >> } >> + >> + } else if (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD && >> + !object_dynamic_cast(new_block->mr->parent_obj.parent, >> + TYPE_MEMORY_BACKEND)) { > > This looks a bit and hackish, and I don't think ram_block_add() is the right > place where this should be. It should likely happen in the caller. > > We already do have two ways of allocating "shared anonymous memory": > > (1) memory-backend-ram,share=on > (2) memory-backend-shm > > (2) gives us an fd as it uses shm_open(), (1) doesn't give us an fd as it > uses MAP_ANON|MAP_SHARED. (1) is really only a corner case use case [1]. > > [there is also Linux specific memfd, which gives us more flexibility with > hugetlb etc, but for the purpose here shm should likely be sufficient?] > > So why not make (1) behave like (2) and move that handling into > qemu_ram_alloc_internal(), from where we can easily enable it using a > new RMA_SHARED flag? So as a first step, something like: > > From 4b7b760c6e54cf05addca6728edc19adbec1588a Mon Sep 17 00:00:00 2001 > From: David Hildenbrand <david@redhat.com> > Date: Mon, 4 Nov 2024 11:29:22 +0100 > Subject: [PATCH] tmp > > Signed-off-by: David Hildenbrand <david@redhat.com> > --- > backends/hostmem-shm.c | 56 ++++---------------------------- > system/physmem.c | 73 ++++++++++++++++++++++++++++++++++++++++-- > 2 files changed, 76 insertions(+), 53 deletions(-) > > diff --git a/backends/hostmem-shm.c b/backends/hostmem-shm.c > index 374edc3db8..0f33b35e9c 100644 > --- a/backends/hostmem-shm.c > +++ b/backends/hostmem-shm.c > @@ -25,11 +25,8 @@ struct HostMemoryBackendShm { > static bool > shm_backend_memory_alloc(HostMemoryBackend *backend, Error **errp) > { > - g_autoptr(GString) shm_name = g_string_new(NULL); > - g_autofree char *backend_name = NULL; > + g_autofree char *name = NULL; > uint32_t ram_flags; > - int fd, oflag; > - mode_t mode; > > if (!backend->size) { > error_setg(errp, "can't create shm backend with size 0"); > @@ -41,54 +38,13 @@ shm_backend_memory_alloc(HostMemoryBackend *backend, Error **errp) > return false; > } > > - /* > - * Let's use `mode = 0` because we don't want other processes to open our > - * memory unless we share the file descriptor with them. > - */ > - mode = 0; > - oflag = O_RDWR | O_CREAT | O_EXCL; > - backend_name = host_memory_backend_get_name(backend); > - > - /* > - * Some operating systems allow creating anonymous POSIX shared memory > - * objects (e.g. FreeBSD provides the SHM_ANON constant), but this is not > - * defined by POSIX, so let's create a unique name. > - * > - * From Linux's shm_open(3) man-page: > - * For portable use, a shared memory object should be identified > - * by a name of the form /somename;" > - */ > - g_string_printf(shm_name, "/qemu-" FMT_pid "-shm-%s", getpid(), > - backend_name); > - > - fd = shm_open(shm_name->str, oflag, mode); > - if (fd < 0) { > - error_setg_errno(errp, errno, > - "failed to create POSIX shared memory"); > - return false; > - } > - > - /* > - * We have the file descriptor, so we no longer need to expose the > - * POSIX shared memory object. However it will remain allocated as long as > - * there are file descriptors pointing to it. > - */ > - shm_unlink(shm_name->str); > - > - if (ftruncate(fd, backend->size) == -1) { > - error_setg_errno(errp, errno, > - "failed to resize POSIX shared memory to %" PRIu64, > - backend->size); > - close(fd); > - return false; > - } > - > + /* Let's do the same as memory-backend-ram,share=on would do. */ > + name = host_memory_backend_get_name(backend); > ram_flags = RAM_SHARED; > ram_flags |= backend->reserve ? 0 : RAM_NORESERVE; > - > - return memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend), > - backend_name, backend->size, > - ram_flags, fd, 0, errp); > + return memory_region_init_ram_flags_nomigrate(&backend->mr, OBJECT(backend), > + name, backend->size, > + ram_flags, errp); > } > > static void > diff --git a/system/physmem.c b/system/physmem.c > index dc1db3a384..4d331b3828 100644 > --- a/system/physmem.c > +++ b/system/physmem.c > @@ -2057,6 +2057,59 @@ RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr, > } > #endif > > +static int qemu_shm_alloc(size_t size, Error **errp) > +{ > + g_autoptr(GString) shm_name = g_string_new(NULL); > + int fd, oflag, cur_sequence; > + static int sequence; > + mode_t mode; > + > + cur_sequence = qatomic_fetch_inc(&sequence); > + > + /* > + * Let's use `mode = 0` because we don't want other processes to open our > + * memory unless we share the file descriptor with them. > + */ > + mode = 0; > + oflag = O_RDWR | O_CREAT | O_EXCL; > + > + /* > + * Some operating systems allow creating anonymous POSIX shared memory > + * objects (e.g. FreeBSD provides the SHM_ANON constant), but this is not > + * defined by POSIX, so let's create a unique name. > + * > + * From Linux's shm_open(3) man-page: > + * For portable use, a shared memory object should be identified > + * by a name of the form /somename;" > + */ > + g_string_printf(shm_name, "/qemu-" FMT_pid "-shm-%d", getpid(), > + cur_sequence); > + > + fd = shm_open(shm_name->str, oflag, mode); > + if (fd < 0) { > + error_setg_errno(errp, errno, > + "failed to create POSIX shared memory"); > + return false; > + } > + > + /* > + * We have the file descriptor, so we no longer need to expose the > + * POSIX shared memory object. However it will remain allocated as long as > + * there are file descriptors pointing to it. > + */ > + shm_unlink(shm_name->str); > + > + if (ftruncate(fd, size) == -1) { > + error_setg_errno(errp, errno, > + "failed to resize POSIX shared memory to %" PRIu64, > + size); > + close(fd); > + return false; > + } > + > + return fd; > +} > + > static > RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size, > void (*resized)(const char*, > @@ -2084,12 +2137,26 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size, > new_block->used_length = size; > new_block->max_length = max_size; > assert(max_size >= size); > - new_block->fd = -1; > + > new_block->guest_memfd = -1; > new_block->page_size = qemu_real_host_page_size(); > - new_block->host = host; > new_block->flags = ram_flags; > - ram_block_add(new_block, &local_err); > + new_block->host = host; > + > + if ((ram_flags & RAM_PREALLOC) || !(ram_flags & RAM_SHARED)) { > + new_block->fd = -1; > + } else { > + /* > + * We want anonymous shared memory, similar to MAP_SHARED|MAP_ANON; but > + * some users want the fd. So let's allocate shm explicitly, which will > + * give us the fd. > + */ > + assert(!host); > + new_block->fd = qemu_shm_alloc(new_block->max_length, &local_err); Note: completely untested. Likely a file_ram_alloc() call is missing here.
On 11/4/2024 5:39 AM, David Hildenbrand wrote: > On 01.11.24 14:47, Steve Sistare wrote: >> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending >> on the value of the anon-alloc machine property. This option applies to >> memory allocated as a side effect of creating various devices. It does >> not apply to memory-backend-objects, whether explicitly specified on >> the command line, or implicitly created by the -m command line option. >> >> The memfd option is intended to support new migration modes, in which the >> memory region can be transferred in place to a new QEMU process, by sending >> the memfd file descriptor to the process. Memory contents are preserved, >> and if the mode also transfers device descriptors, then pages that are >> locked in memory for DMA remain locked. This behavior is a pre-requisite >> for supporting vfio, vdpa, and iommufd devices with the new modes. > > A more portable, non-Linux specific variant of this will be using shm, > similar to backends/hostmem-shm.c. > > Likely we should be using that instead of memfd, or try hiding the > details. See below. For this series I would prefer to use memfd and hide the details. It's a concise (and well tested) solution albeit linux only. The code you supply for posix shm would be a good follow on patch to support other unices. We could drop -machine anon-alloc=mmap|memfd and define -machine anon-shared as you suggest at the end. > [...] > >> @@ -69,6 +70,8 @@ >> #include "qemu/pmem.h" >> +#include "qapi/qapi-types-migration.h" >> +#include "migration/options.h" >> #include "migration/vmstate.h" >> #include "qemu/range.h" >> @@ -1849,6 +1852,35 @@ static void ram_block_add(RAMBlock *new_block, Error **errp) >> qemu_mutex_unlock_ramlist(); >> return; >> } >> + >> + } else if (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD && >> + !object_dynamic_cast(new_block->mr->parent_obj.parent, >> + TYPE_MEMORY_BACKEND)) { > > This looks a bit and hackish, OK. I can revert parts of the previous version which passed in RAM_SHARED from various call sites to request anonymous shared memory: https://lore.kernel.org/qemu-devel/1714406135-451286-18-git-send-email-steven.sistare@oracle.com See the various sites that do uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0; Does that look OK to you? > and I don't think ram_block_add() is the right > place where this should be. It should likely happen in the caller. I agree, but I received no feedback when I proposed to refactor allocation vs ram_block_add, so I dropped them to simplify the live update review. These refactor but do not change functionality. Are you OK with something like this? Is this overkill? https://lore.kernel.org/qemu-devel/1714406135-451286-1-git-send-email-steven.sistare@oracle.com/ physmem: ram_block_create physmem: hoist guest_memfd creation physmem: hoist host memory allocation > We already do have two ways of allocating "shared anonymous memory": > > (1) memory-backend-ram,share=on > (2) memory-backend-shm > > (2) gives us an fd as it uses shm_open(), (1) doesn't give us an fd as it > uses MAP_ANON|MAP_SHARED. (1) is really only a corner case use case [1]. > > [there is also Linux specific memfd, which gives us more flexibility with > hugetlb etc, but for the purpose here shm should likely be sufficient?] > > So why not make (1) behave like (2) and move that handling into > qemu_ram_alloc_internal(), from where we can easily enable it using a > new RMA_SHARED flag? So as a first step, something like: I prefer that, and an earlier version did so, but only if anon-alloc==memfd. To be clear, do you propose that memory-backend-ram,shared=on unconditionally mmap fd-based shared memory, independently of the setting of anon-alloc? And drop the MAP_ANON|MAP_SHARED possibility? Or, do you propose that for memory-backend-ram,shared=on: if anon-shared mmap fd else MAP_ANON|MAP_SHARED The former is simpler from a user documentation point of view, but either works for me. I could stop listing memory-backend-ram as an exception in the docs, which currently state: # Memory-backend objects must have the share=on attribute, but # memory-backend-epc and memory-backend-ram are not supported. [...] > > Then, you only need a machine option to say "anon-shared", to make all > anonymous memory sharable between processes. All it would do is setting > the RAM_SHARED flag in qemu_ram_alloc_internal() when reasonable > (!(ram_flags & RAM_PREALLOC)). > > To handle "memory-backend-ram,share=off", can we find a way to bail out if > memory-backend-ram,share=off was used while the machine option "anon-shared" > would be active? In later patches I install migration blockers for various conditions, including when a ram block does not support CPR. > Or just document that the "anon-shared" will win? IMO a blocker is sufficient. I think you are also suggesting that an unadorned "memory-backend-ram" specification (with implicit shared=off), plus anon-shared, should cause shared anon to be allocated: "you only need a machine option to say "anon-shared", to make all anonymous memory sharable" I did that previously, and Peter objected, saying the explicit anon-shared should not override the implicit shared=off. But perhaps I misinterpret someone. - Steve > Alternatives might be a RAM_PFORCE_PRIVATE flag, set by the memory backend. > > > With above change, we could drop the "bool share" flag from, > qemu_anon_ram_alloc(), as it would be unused. > > [1] https://patchwork.kernel.org/project/qemu-devel/patch/20180201205511.19198-2-marcel@redhat.com/
On 04.11.24 18:38, Steven Sistare wrote: > On 11/4/2024 5:39 AM, David Hildenbrand wrote: >> On 01.11.24 14:47, Steve Sistare wrote: >>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending >>> on the value of the anon-alloc machine property. This option applies to >>> memory allocated as a side effect of creating various devices. It does >>> not apply to memory-backend-objects, whether explicitly specified on >>> the command line, or implicitly created by the -m command line option. >>> >>> The memfd option is intended to support new migration modes, in which the >>> memory region can be transferred in place to a new QEMU process, by sending >>> the memfd file descriptor to the process. Memory contents are preserved, >>> and if the mode also transfers device descriptors, then pages that are >>> locked in memory for DMA remain locked. This behavior is a pre-requisite >>> for supporting vfio, vdpa, and iommufd devices with the new modes. >> >> A more portable, non-Linux specific variant of this will be using shm, >> similar to backends/hostmem-shm.c. >> >> Likely we should be using that instead of memfd, or try hiding the >> details. See below. > > For this series I would prefer to use memfd and hide the details. It's a > concise (and well tested) solution albeit linux only. The code you supply > for posix shm would be a good follow on patch to support other unices. Unless there is reason to use memfd we should start with the more generic POSIX variant that is available even on systems without memfd. Factoring stuff out as I drafted does look quite compelling. I can help with the rework, and send it out separately, so you can focus on the "machine toggle" as part of this series. Of course, if we find out we need the memfd internally instead under Linux for whatever reason later, we can use that instead. But IIUC, the main selling point for memfd are additional features (hugetlb, memory sealing) that you aren't even using. > > We could drop > -machine anon-alloc=mmap|memfd Right, the memfd here might be an unnecessary detail. Especially, because all things here are mmap'ed ... so I don't quite like this interface :) > and define > -machine anon-shared > > as you suggest at the end. Likely we should remove the "anon" part from the interface as well ... hmm ... We want to instruct QEMU: "all guest RAM that is not explicitly specified should be sharable with another process". "internal-ram-share=true" "default-ram-share=true" Maybe we can come up with something even better. But getting rid of the "anon" would be great. I think I prefer the latter (below). > >> [...] >> >>> @@ -69,6 +70,8 @@ >>> #include "qemu/pmem.h" >>> +#include "qapi/qapi-types-migration.h" >>> +#include "migration/options.h" >>> #include "migration/vmstate.h" >>> #include "qemu/range.h" >>> @@ -1849,6 +1852,35 @@ static void ram_block_add(RAMBlock *new_block, Error **errp) >>> qemu_mutex_unlock_ramlist(); >>> return; >>> } >>> + >>> + } else if (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD && >>> + !object_dynamic_cast(new_block->mr->parent_obj.parent, >>> + TYPE_MEMORY_BACKEND)) { >> >> This looks a bit and hackish, > > OK. I can revert parts of the previous version which passed in RAM_SHARED from > various call sites to request anonymous shared memory: > https://lore.kernel.org/qemu-devel/1714406135-451286-18-git-send-email-steven.sistare@oracle.com > See the various sites that do > uint32_t flags = current_machine->memfd_alloc ? RAM_SHARED : 0; > Does that look OK to you? That's one option, or we just handle it in qemu_ram_alloc_internal() as I drafted below. Or we simply have another interface to allocate this "default RAM that does not come from a memory backend and is subject to the global toggle", and hide that detail (conditionally setting RAM_SHARED) in there. > >> and I don't think ram_block_add() is the right >> place where this should be. It should likely happen in the caller. > > I agree, but I received no feedback when I proposed to refactor allocation > vs ram_block_add, so I dropped them to simplify the live update review. > These refactor but do not change functionality. Are you OK with something > like this? Is this overkill? > Probably overkill :) > https://lore.kernel.org/qemu-devel/1714406135-451286-1-git-send-email-steven.sistare@oracle.com/ > physmem: ram_block_create > physmem: hoist guest_memfd creation > physmem: hoist host memory allocation > >> We already do have two ways of allocating "shared anonymous memory": >> >> (1) memory-backend-ram,share=on >> (2) memory-backend-shm >> >> (2) gives us an fd as it uses shm_open(), (1) doesn't give us an fd as it >> uses MAP_ANON|MAP_SHARED. (1) is really only a corner case use case [1]. >> >> [there is also Linux specific memfd, which gives us more flexibility with >> hugetlb etc, but for the purpose here shm should likely be sufficient?] >> >> So why not make (1) behave like (2) and move that handling into >> qemu_ram_alloc_internal(), from where we can easily enable it using a >> new RMA_SHARED flag? So as a first step, something like: > > I prefer that, and an earlier version did so, but only if anon-alloc==memfd. > > To be clear, do you propose that memory-backend-ram,shared=on unconditionally > mmap fd-based shared memory, independently of the setting of anon-alloc? > And drop the MAP_ANON|MAP_SHARED possibility? Yes, as done in my draft patch. MAP_ANON|MAP_SHARED was primarily a hack to make this RDMA thingy fly that could not deal with anonymous memory, and we didn't have memory-backend-ram,share=on was added via 06329ccecfa022494fdba288b3ab5bcb8dff4159 before memory-backend-memfd was added via dbb9e0f40d7d561dcffcf7e41ac9f6a5ec90e5b5 Both ended up in the same QEMU release. So likely memory-backend-ram,share=on could just have used memory-backend-memfd if it would have been available earlier, at least on Linux ... But, it looks like the use case for memory-backend-ram,share=on does no longer even exist, because commit 1dfd42c4264bbf47415a9e73f0d6b4e6a7cd7393 Author: Philippe Mathieu-Daudé <philmd@linaro.org> Date: Thu Mar 28 12:53:00 2024 +0100 hw/rdma: Remove deprecated pvrdma device and rdmacm-mux helper Removed that mremap() from the code base. So we can change how memory-backend-ram,share=on is implemented internally, as long as it keeps on working in a similar way. If "memory-backend-ram,share=on" will be the same as specifying "default-ram-share=on", that would actually be quite nice ... no need to bring in memfds at all as long as we only want some memory with an fd to share with other processes. > > Or, do you propose that for memory-backend-ram,shared=on: > if anon-shared > mmap fd > else > MAP_ANON|MAP_SHARED My suggestion would be to unconditionally use shm (which is available even on kernels without memfd support; if required use memfd first and fallback to shmem) as in the patch I drafted. > > The former is simpler from a user documentation point of view, but either > works for me. I could stop listing memory-backend-ram as an exception in > the docs, which currently state: > # Memory-backend objects must have the share=on attribute, but > # memory-backend-epc and memory-backend-ram are not supported. Likely that was never updated to document the memory-backend-ram use case. > > [...] >> >> Then, you only need a machine option to say "anon-shared", to make all >> anonymous memory sharable between processes. All it would do is setting >> the RAM_SHARED flag in qemu_ram_alloc_internal() when reasonable >> (!(ram_flags & RAM_PREALLOC)). >> >> To handle "memory-backend-ram,share=off", can we find a way to bail out if >> memory-backend-ram,share=off was used while the machine option "anon-shared" >> would be active? > > In later patches I install migration blockers for various conditions, including > when a ram block does not support CPR. Good! > >> Or just document that the "anon-shared" will win? > > IMO a blocker is sufficient. > > I think you are also suggesting that an unadorned "memory-backend-ram" > specification (with implicit shared=off), plus anon-shared, should cause > shared anon to be allocated: > "you only need a machine option to say "anon-shared", to make all anonymous > memory sharable" > > I did that previously, and Peter objected, saying the explicit anon-shared > should not override the implicit shared=off. Yes, it's better if we can detect that somehow. There should be easy ways to make that work, so I wouldn't worry about that.
On Mon, Nov 04, 2024 at 08:51:56PM +0100, David Hildenbrand wrote: > > I did that previously, and Peter objected, saying the explicit anon-shared > > should not override the implicit shared=off. > > Yes, it's better if we can detect that somehow. There should be easy ways to > make that work, so I wouldn't worry about that. I still think whenever the caller is capable of passing RAM_SHARED flag into ram_block_add(), we should always respect what's passed in from the caller, no matter it's a shared / private request. A major issue with that idea is when !RAM_SHARED, we don't easily know whether it's because the caller explicitly chose share=off, or if it's simply the type of ramblock that we don't care (e.g. ROMs). So besides what I used to suggest on monitoring the four call sites that can involve those, another simpler (and now I see it even cleaner..) way could be that we explicitly introduce RAM_PRIVATE. It means whenever we have things like below in the callers: int ram_flags = shared ? RAM_SHARED : 0; We start to switch to: int ram_flags = shared ? RAM_SHARED : RAM_PRIVATE; Then in ram_block_add(): if (!(ram_flags & (RAM_SHARED | RAM_PRIVATE))) { // these are the target ramblocks for cpr's whatever new machine // flag.. }
On 04.11.24 20:51, David Hildenbrand wrote: > On 04.11.24 18:38, Steven Sistare wrote: >> On 11/4/2024 5:39 AM, David Hildenbrand wrote: >>> On 01.11.24 14:47, Steve Sistare wrote: >>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending >>>> on the value of the anon-alloc machine property. This option applies to >>>> memory allocated as a side effect of creating various devices. It does >>>> not apply to memory-backend-objects, whether explicitly specified on >>>> the command line, or implicitly created by the -m command line option. >>>> >>>> The memfd option is intended to support new migration modes, in which the >>>> memory region can be transferred in place to a new QEMU process, by sending >>>> the memfd file descriptor to the process. Memory contents are preserved, >>>> and if the mode also transfers device descriptors, then pages that are >>>> locked in memory for DMA remain locked. This behavior is a pre-requisite >>>> for supporting vfio, vdpa, and iommufd devices with the new modes. >>> >>> A more portable, non-Linux specific variant of this will be using shm, >>> similar to backends/hostmem-shm.c. >>> >>> Likely we should be using that instead of memfd, or try hiding the >>> details. See below. >> >> For this series I would prefer to use memfd and hide the details. It's a >> concise (and well tested) solution albeit linux only. The code you supply >> for posix shm would be a good follow on patch to support other unices. > > Unless there is reason to use memfd we should start with the more > generic POSIX variant that is available even on systems without memfd. > Factoring stuff out as I drafted does look quite compelling. > > I can help with the rework, and send it out separately, so you can focus > on the "machine toggle" as part of this series. > > Of course, if we find out we need the memfd internally instead under > Linux for whatever reason later, we can use that instead. > > But IIUC, the main selling point for memfd are additional features > (hugetlb, memory sealing) that you aren't even using. FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount). Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open(). I'm hoping we can find a way where it just all is rather intuitive, like "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on" "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process. Thoughts?
On 04.11.24 21:14, Peter Xu wrote: > On Mon, Nov 04, 2024 at 08:51:56PM +0100, David Hildenbrand wrote: >>> I did that previously, and Peter objected, saying the explicit anon-shared >>> should not override the implicit shared=off. >> >> Yes, it's better if we can detect that somehow. There should be easy ways to >> make that work, so I wouldn't worry about that. > > I still think whenever the caller is capable of passing RAM_SHARED flag > into ram_block_add(), we should always respect what's passed in from the > caller, no matter it's a shared / private request. > > A major issue with that idea is when !RAM_SHARED, we don't easily know > whether it's because the caller explicitly chose share=off, or if it's > simply the type of ramblock that we don't care (e.g. ROMs). Agreed. But note that I think ram_block_add() is one level to deep to handle that. > > So besides what I used to suggest on monitoring the four call sites that > can involve those, another simpler (and now I see it even cleaner..) way > could be that we explicitly introduce RAM_PRIVATE. It means whenever we > have things like below in the callers: Yeah, I called it RAM_FORCE_PRIVATE, but it's the same idea. And simply calling it RAM_PRIVATE makes sense.
On Mon, Nov 04, 2024 at 09:17:30PM +0100, David Hildenbrand wrote: > On 04.11.24 21:14, Peter Xu wrote: > > On Mon, Nov 04, 2024 at 08:51:56PM +0100, David Hildenbrand wrote: > > > > I did that previously, and Peter objected, saying the explicit anon-shared > > > > should not override the implicit shared=off. > > > > > > Yes, it's better if we can detect that somehow. There should be easy ways to > > > make that work, so I wouldn't worry about that. > > > > I still think whenever the caller is capable of passing RAM_SHARED flag > > into ram_block_add(), we should always respect what's passed in from the > > caller, no matter it's a shared / private request. > > > > A major issue with that idea is when !RAM_SHARED, we don't easily know > > whether it's because the caller explicitly chose share=off, or if it's > > simply the type of ramblock that we don't care (e.g. ROMs). > > Agreed. But note that I think ram_block_add() is one level to deep to handle > that. True.. qemu_ram_alloc_internal() is probably the best place to do the trick. Thanks,
On 11/4/2024 3:15 PM, David Hildenbrand wrote: > On 04.11.24 20:51, David Hildenbrand wrote: >> On 04.11.24 18:38, Steven Sistare wrote: >>> On 11/4/2024 5:39 AM, David Hildenbrand wrote: >>>> On 01.11.24 14:47, Steve Sistare wrote: >>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending >>>>> on the value of the anon-alloc machine property. This option applies to >>>>> memory allocated as a side effect of creating various devices. It does >>>>> not apply to memory-backend-objects, whether explicitly specified on >>>>> the command line, or implicitly created by the -m command line option. >>>>> >>>>> The memfd option is intended to support new migration modes, in which the >>>>> memory region can be transferred in place to a new QEMU process, by sending >>>>> the memfd file descriptor to the process. Memory contents are preserved, >>>>> and if the mode also transfers device descriptors, then pages that are >>>>> locked in memory for DMA remain locked. This behavior is a pre-requisite >>>>> for supporting vfio, vdpa, and iommufd devices with the new modes. >>>> >>>> A more portable, non-Linux specific variant of this will be using shm, >>>> similar to backends/hostmem-shm.c. >>>> >>>> Likely we should be using that instead of memfd, or try hiding the >>>> details. See below. >>> >>> For this series I would prefer to use memfd and hide the details. It's a >>> concise (and well tested) solution albeit linux only. The code you supply >>> for posix shm would be a good follow on patch to support other unices. >> >> Unless there is reason to use memfd we should start with the more >> generic POSIX variant that is available even on systems without memfd. >> Factoring stuff out as I drafted does look quite compelling. >> >> I can help with the rework, and send it out separately, so you can focus >> on the "machine toggle" as part of this series. >> >> Of course, if we find out we need the memfd internally instead under >> Linux for whatever reason later, we can use that instead. >> >> But IIUC, the main selling point for memfd are additional features >> (hugetlb, memory sealing) that you aren't even using. > > FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount). Sizing is a non-trivial difference. One can by default allocate all memory using memfd_create. To do so using shm_open requires configuration on the mount. One step harder to use. This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM if memory-backend-ram has hogged all the memory. > Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open(). Yes, and if that is a good idea, then the same should be done for internal RAM -- memfd if available and fallback to shm_open. > I'm hoping we can find a way where it just all is rather intuitive, like > > "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on" > > "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process. > > Thoughts? Agreed, though I thought I had already landed at the intuitive specification in my patch. The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc controls everything else. Now we're just riffing on the details: memfd vs shm_open, spelling of options and words to describe them. - Steve
On 04.11.24 21:56, Steven Sistare wrote: > On 11/4/2024 3:15 PM, David Hildenbrand wrote: >> On 04.11.24 20:51, David Hildenbrand wrote: >>> On 04.11.24 18:38, Steven Sistare wrote: >>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote: >>>>> On 01.11.24 14:47, Steve Sistare wrote: >>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending >>>>>> on the value of the anon-alloc machine property. This option applies to >>>>>> memory allocated as a side effect of creating various devices. It does >>>>>> not apply to memory-backend-objects, whether explicitly specified on >>>>>> the command line, or implicitly created by the -m command line option. >>>>>> >>>>>> The memfd option is intended to support new migration modes, in which the >>>>>> memory region can be transferred in place to a new QEMU process, by sending >>>>>> the memfd file descriptor to the process. Memory contents are preserved, >>>>>> and if the mode also transfers device descriptors, then pages that are >>>>>> locked in memory for DMA remain locked. This behavior is a pre-requisite >>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes. >>>>> >>>>> A more portable, non-Linux specific variant of this will be using shm, >>>>> similar to backends/hostmem-shm.c. >>>>> >>>>> Likely we should be using that instead of memfd, or try hiding the >>>>> details. See below. >>>> >>>> For this series I would prefer to use memfd and hide the details. It's a >>>> concise (and well tested) solution albeit linux only. The code you supply >>>> for posix shm would be a good follow on patch to support other unices. >>> >>> Unless there is reason to use memfd we should start with the more >>> generic POSIX variant that is available even on systems without memfd. >>> Factoring stuff out as I drafted does look quite compelling. >>> >>> I can help with the rework, and send it out separately, so you can focus >>> on the "machine toggle" as part of this series. >>> >>> Of course, if we find out we need the memfd internally instead under >>> Linux for whatever reason later, we can use that instead. >>> >>> But IIUC, the main selling point for memfd are additional features >>> (hugetlb, memory sealing) that you aren't even using. >> >> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount). > > Sizing is a non-trivial difference. One can by default allocate all memory using memfd_create. > To do so using shm_open requires configuration on the mount. One step harder to use. Yes. > > This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM > if memory-backend-ram has hogged all the memory. > >> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open(). > > Yes, and if that is a good idea, then the same should be done for internal RAM > -- memfd if available and fallback to shm_open. Yes. > >> I'm hoping we can find a way where it just all is rather intuitive, like >> >> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on" >> >> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process. >> >> Thoughts? > > Agreed, though I thought I had already landed at the intuitive specification in my patch. > The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc > controls everything else. Now we're just riffing on the details: memfd vs shm_open, spelling > of options and words to describe them. Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on".
On 11/4/2024 4:36 PM, David Hildenbrand wrote: > On 04.11.24 21:56, Steven Sistare wrote: >> On 11/4/2024 3:15 PM, David Hildenbrand wrote: >>> On 04.11.24 20:51, David Hildenbrand wrote: >>>> On 04.11.24 18:38, Steven Sistare wrote: >>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote: >>>>>> On 01.11.24 14:47, Steve Sistare wrote: >>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending >>>>>>> on the value of the anon-alloc machine property. This option applies to >>>>>>> memory allocated as a side effect of creating various devices. It does >>>>>>> not apply to memory-backend-objects, whether explicitly specified on >>>>>>> the command line, or implicitly created by the -m command line option. >>>>>>> >>>>>>> The memfd option is intended to support new migration modes, in which the >>>>>>> memory region can be transferred in place to a new QEMU process, by sending >>>>>>> the memfd file descriptor to the process. Memory contents are preserved, >>>>>>> and if the mode also transfers device descriptors, then pages that are >>>>>>> locked in memory for DMA remain locked. This behavior is a pre-requisite >>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes. >>>>>> >>>>>> A more portable, non-Linux specific variant of this will be using shm, >>>>>> similar to backends/hostmem-shm.c. >>>>>> >>>>>> Likely we should be using that instead of memfd, or try hiding the >>>>>> details. See below. >>>>> >>>>> For this series I would prefer to use memfd and hide the details. It's a >>>>> concise (and well tested) solution albeit linux only. The code you supply >>>>> for posix shm would be a good follow on patch to support other unices. >>>> >>>> Unless there is reason to use memfd we should start with the more >>>> generic POSIX variant that is available even on systems without memfd. >>>> Factoring stuff out as I drafted does look quite compelling. >>>> >>>> I can help with the rework, and send it out separately, so you can focus >>>> on the "machine toggle" as part of this series. >>>> >>>> Of course, if we find out we need the memfd internally instead under >>>> Linux for whatever reason later, we can use that instead. >>>> >>>> But IIUC, the main selling point for memfd are additional features >>>> (hugetlb, memory sealing) that you aren't even using. >>> >>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount). >> >> Sizing is a non-trivial difference. One can by default allocate all memory using memfd_create. >> To do so using shm_open requires configuration on the mount. One step harder to use. > > Yes. > >> >> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM >> if memory-backend-ram has hogged all the memory. >> >>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open(). >> >> Yes, and if that is a good idea, then the same should be done for internal RAM >> -- memfd if available and fallback to shm_open. > > Yes. > >> >>> I'm hoping we can find a way where it just all is rather intuitive, like >>> >>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on" >>> >>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process. >>> >>> Thoughts? >> >> Agreed, though I thought I had already landed at the intuitive specification in my patch. >> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc >> controls everything else. Now we're just riffing on the details: memfd vs shm_open, spelling >> of options and words to describe them. > > Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on". Hi David and Peter, I have implemented and tested the following, for both qemu_memfd_create and qemu_shm_alloc. This is pseudo-code, with error conditions omitted for simplicity. Any comments before I submit a complete patch? ---- qemu-options.hx: ``aux-ram-share=on|off`` Allocate auxiliary guest RAM as an anonymous file that is shareable with an external process. This option applies to memory allocated as a side effect of creating various devices. It does not apply to memory-backend-objects, whether explicitly specified on the command line, or implicitly created by the -m command line option. Some migration modes require aux-ram-share=on. qapi/migration.json: @cpr-transfer: ... Memory-backend objects must have the share=on attribute, but memory-backend-epc is not supported. The VM must be started with the '-machine aux-ram-share=on' option. Define RAM_PRIVATE Define qemu_shm_alloc(), from David's tmp patch ram_backend_memory_alloc() ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE; memory_region_init_ram_flags_nomigrate(ram_flags) qemu_ram_alloc_internal() ... if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share) new_block->flags |= RAM_SHARED; if (!host && (new_block->flags & RAM_SHARED)) { qemu_ram_alloc_shared(new_block); } else new_block->fd = -1; new_block->host = host; } ram_block_add(new_block); qemu_ram_alloc_shared() if qemu_memfd_check() new_block->fd = qemu_memfd_create() else new_block->fd = qemu_shm_alloc() new_block->host = file_ram_alloc(new_block->fd) ---- - Steve
On Wed, Nov 06, 2024 at 03:12:20PM -0500, Steven Sistare wrote: > > > On 11/4/2024 4:36 PM, David Hildenbrand wrote: > > On 04.11.24 21:56, Steven Sistare wrote: > > > On 11/4/2024 3:15 PM, David Hildenbrand wrote: > > > > On 04.11.24 20:51, David Hildenbrand wrote: > > > > > On 04.11.24 18:38, Steven Sistare wrote: > > > > > > On 11/4/2024 5:39 AM, David Hildenbrand wrote: > > > > > > > On 01.11.24 14:47, Steve Sistare wrote: > > > > > > > > Allocate anonymous memory using mmap MAP_ANON or memfd_create depending > > > > > > > > on the value of the anon-alloc machine property. This option applies to > > > > > > > > memory allocated as a side effect of creating various devices. It does > > > > > > > > not apply to memory-backend-objects, whether explicitly specified on > > > > > > > > the command line, or implicitly created by the -m command line option. > > > > > > > > > > > > > > > > The memfd option is intended to support new migration modes, in which the > > > > > > > > memory region can be transferred in place to a new QEMU process, by sending > > > > > > > > the memfd file descriptor to the process. Memory contents are preserved, > > > > > > > > and if the mode also transfers device descriptors, then pages that are > > > > > > > > locked in memory for DMA remain locked. This behavior is a pre-requisite > > > > > > > > for supporting vfio, vdpa, and iommufd devices with the new modes. > > > > > > > > > > > > > > A more portable, non-Linux specific variant of this will be using shm, > > > > > > > similar to backends/hostmem-shm.c. > > > > > > > > > > > > > > Likely we should be using that instead of memfd, or try hiding the > > > > > > > details. See below. > > > > > > > > > > > > For this series I would prefer to use memfd and hide the details. It's a > > > > > > concise (and well tested) solution albeit linux only. The code you supply > > > > > > for posix shm would be a good follow on patch to support other unices. > > > > > > > > > > Unless there is reason to use memfd we should start with the more > > > > > generic POSIX variant that is available even on systems without memfd. > > > > > Factoring stuff out as I drafted does look quite compelling. > > > > > > > > > > I can help with the rework, and send it out separately, so you can focus > > > > > on the "machine toggle" as part of this series. > > > > > > > > > > Of course, if we find out we need the memfd internally instead under > > > > > Linux for whatever reason later, we can use that instead. > > > > > > > > > > But IIUC, the main selling point for memfd are additional features > > > > > (hugetlb, memory sealing) that you aren't even using. > > > > > > > > FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount). > > > > > > Sizing is a non-trivial difference. One can by default allocate all memory using memfd_create. > > > To do so using shm_open requires configuration on the mount. One step harder to use. > > > > Yes. > > > > > > > > This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM > > > if memory-backend-ram has hogged all the memory. > > > > > > > Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open(). > > > > > > Yes, and if that is a good idea, then the same should be done for internal RAM > > > -- memfd if available and fallback to shm_open. > > > > Yes. > > > > > > > > > I'm hoping we can find a way where it just all is rather intuitive, like > > > > > > > > "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on" > > > > > > > > "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process. > > > > > > > > Thoughts? > > > > > > Agreed, though I thought I had already landed at the intuitive specification in my patch. > > > The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc > > > controls everything else. Now we're just riffing on the details: memfd vs shm_open, spelling > > > of options and words to describe them. > > > > Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on". > > Hi David and Peter, > > I have implemented and tested the following, for both qemu_memfd_create > and qemu_shm_alloc. This is pseudo-code, with error conditions omitted > for simplicity. I'm ok with either shm or memfd, as this feature only applies to Linux anyway. I'll leave that part to you and David to decide. > > Any comments before I submit a complete patch? > > ---- > qemu-options.hx: > ``aux-ram-share=on|off`` > Allocate auxiliary guest RAM as an anonymous file that is > shareable with an external process. This option applies to > memory allocated as a side effect of creating various devices. > It does not apply to memory-backend-objects, whether explicitly > specified on the command line, or implicitly created by the -m > command line option. > > Some migration modes require aux-ram-share=on. > > qapi/migration.json: > @cpr-transfer: > ... > Memory-backend objects must have the share=on attribute, but > memory-backend-epc is not supported. The VM must be started > with the '-machine aux-ram-share=on' option. > > Define RAM_PRIVATE > > Define qemu_shm_alloc(), from David's tmp patch > > ram_backend_memory_alloc() > ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE; > memory_region_init_ram_flags_nomigrate(ram_flags) Looks all good until here. > > qemu_ram_alloc_internal() > ... > if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share) Nitpick: could rely on flags-only, rather than testing "!host", AFAICT that's equal to RAM_PREALLOC. Meanwhile I slightly prefer we don't touch anything if SHARED|PRIVATE is set. All combined, it could be: if (!(ram_flags & (RAM_PREALLOC | RAM_PRIVATE | RAM_SHARED))) { // ramblock to be allocated, with no share/private request, aka, // aux memory chunk... } > new_block->flags |= RAM_SHARED; > > if (!host && (new_block->flags & RAM_SHARED)) { > qemu_ram_alloc_shared(new_block); I'm not sure whether this needs its own helper. Should we fallback to ram_block_add() below, just like a RAM_SHARED? IIUC, we could start to create RAM_SHARED in qemu_anon_ram_alloc() and always cache the fd (even if we don't do that before)? > } else > new_block->fd = -1; > new_block->host = host; > } > ram_block_add(new_block); > > qemu_ram_alloc_shared() > if qemu_memfd_check() > new_block->fd = qemu_memfd_create() > else > new_block->fd = qemu_shm_alloc() > new_block->host = file_ram_alloc(new_block->fd) > ---- > > - Steve >
On 11/6/2024 3:41 PM, Peter Xu wrote: > On Wed, Nov 06, 2024 at 03:12:20PM -0500, Steven Sistare wrote: >> On 11/4/2024 4:36 PM, David Hildenbrand wrote: >>> On 04.11.24 21:56, Steven Sistare wrote: >>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote: >>>>> On 04.11.24 20:51, David Hildenbrand wrote: >>>>>> On 04.11.24 18:38, Steven Sistare wrote: >>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote: >>>>>>>> On 01.11.24 14:47, Steve Sistare wrote: >>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending >>>>>>>>> on the value of the anon-alloc machine property. This option applies to >>>>>>>>> memory allocated as a side effect of creating various devices. It does >>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on >>>>>>>>> the command line, or implicitly created by the -m command line option. >>>>>>>>> >>>>>>>>> The memfd option is intended to support new migration modes, in which the >>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending >>>>>>>>> the memfd file descriptor to the process. Memory contents are preserved, >>>>>>>>> and if the mode also transfers device descriptors, then pages that are >>>>>>>>> locked in memory for DMA remain locked. This behavior is a pre-requisite >>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes. >>>>>>>> >>>>>>>> A more portable, non-Linux specific variant of this will be using shm, >>>>>>>> similar to backends/hostmem-shm.c. >>>>>>>> >>>>>>>> Likely we should be using that instead of memfd, or try hiding the >>>>>>>> details. See below. >>>>>>> >>>>>>> For this series I would prefer to use memfd and hide the details. It's a >>>>>>> concise (and well tested) solution albeit linux only. The code you supply >>>>>>> for posix shm would be a good follow on patch to support other unices. >>>>>> >>>>>> Unless there is reason to use memfd we should start with the more >>>>>> generic POSIX variant that is available even on systems without memfd. >>>>>> Factoring stuff out as I drafted does look quite compelling. >>>>>> >>>>>> I can help with the rework, and send it out separately, so you can focus >>>>>> on the "machine toggle" as part of this series. >>>>>> >>>>>> Of course, if we find out we need the memfd internally instead under >>>>>> Linux for whatever reason later, we can use that instead. >>>>>> >>>>>> But IIUC, the main selling point for memfd are additional features >>>>>> (hugetlb, memory sealing) that you aren't even using. >>>>> >>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount). >>>> >>>> Sizing is a non-trivial difference. One can by default allocate all memory using memfd_create. >>>> To do so using shm_open requires configuration on the mount. One step harder to use. >>> >>> Yes. >>> >>>> >>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM >>>> if memory-backend-ram has hogged all the memory. >>>> >>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open(). >>>> >>>> Yes, and if that is a good idea, then the same should be done for internal RAM >>>> -- memfd if available and fallback to shm_open. >>> >>> Yes. >>> >>>> >>>>> I'm hoping we can find a way where it just all is rather intuitive, like >>>>> >>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on" >>>>> >>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process. >>>>> >>>>> Thoughts? >>>> >>>> Agreed, though I thought I had already landed at the intuitive specification in my patch. >>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc >>>> controls everything else. Now we're just riffing on the details: memfd vs shm_open, spelling >>>> of options and words to describe them. >>> >>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on". >> >> Hi David and Peter, >> >> I have implemented and tested the following, for both qemu_memfd_create >> and qemu_shm_alloc. This is pseudo-code, with error conditions omitted >> for simplicity. > > I'm ok with either shm or memfd, as this feature only applies to Linux > anyway. I'll leave that part to you and David to decide. > >> >> Any comments before I submit a complete patch? >> >> ---- >> qemu-options.hx: >> ``aux-ram-share=on|off`` >> Allocate auxiliary guest RAM as an anonymous file that is >> shareable with an external process. This option applies to >> memory allocated as a side effect of creating various devices. >> It does not apply to memory-backend-objects, whether explicitly >> specified on the command line, or implicitly created by the -m >> command line option. >> >> Some migration modes require aux-ram-share=on. >> >> qapi/migration.json: >> @cpr-transfer: >> ... >> Memory-backend objects must have the share=on attribute, but >> memory-backend-epc is not supported. The VM must be started >> with the '-machine aux-ram-share=on' option. >> >> Define RAM_PRIVATE >> >> Define qemu_shm_alloc(), from David's tmp patch >> >> ram_backend_memory_alloc() >> ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE; >> memory_region_init_ram_flags_nomigrate(ram_flags) > > Looks all good until here. > >> >> qemu_ram_alloc_internal() >> ... >> if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share) > > Nitpick: could rely on flags-only, rather than testing "!host", AFAICT > that's equal to RAM_PREALLOC. IMO testing host is clearer and more future proof, regardless of how flags are currently used. If the caller passes host, then we should not allocate memory here, full stop. > Meanwhile I slightly prefer we don't touch > anything if SHARED|PRIVATE is set. OK, if SHARED is already set I will not set it again. > All combined, it could be: > > if (!(ram_flags & (RAM_PREALLOC | RAM_PRIVATE | RAM_SHARED))) { > // ramblock to be allocated, with no share/private request, aka, > // aux memory chunk... > } > >> new_block->flags |= RAM_SHARED; >> >> if (!host && (new_block->flags & RAM_SHARED)) { >> qemu_ram_alloc_shared(new_block); > > I'm not sure whether this needs its own helper. Reserve judgement until you see the full patch. The helper is a non-trivial subroutine and IMO it improves readability. Also the cpr find/save hooks are confined to the subroutine. > Should we fallback to > ram_block_add() below, just like a RAM_SHARED? I thought we all discussed and agreed that the allocation should be performed above ram_block_add. David's suggested patch does it here also. - Steve > IIUC, we could start to create RAM_SHARED in qemu_anon_ram_alloc() and > always cache the fd (even if we don't do that before)? > >> } else >> new_block->fd = -1; >> new_block->host = host; >> } >> ram_block_add(new_block); >> >> qemu_ram_alloc_shared() >> if qemu_memfd_check() >> new_block->fd = qemu_memfd_create() >> else >> new_block->fd = qemu_shm_alloc() >> new_block->host = file_ram_alloc(new_block->fd) >> ---- >> >> - Steve >> >
On Wed, Nov 06, 2024 at 03:59:23PM -0500, Steven Sistare wrote: > On 11/6/2024 3:41 PM, Peter Xu wrote: > > On Wed, Nov 06, 2024 at 03:12:20PM -0500, Steven Sistare wrote: > > > On 11/4/2024 4:36 PM, David Hildenbrand wrote: > > > > On 04.11.24 21:56, Steven Sistare wrote: > > > > > On 11/4/2024 3:15 PM, David Hildenbrand wrote: > > > > > > On 04.11.24 20:51, David Hildenbrand wrote: > > > > > > > On 04.11.24 18:38, Steven Sistare wrote: > > > > > > > > On 11/4/2024 5:39 AM, David Hildenbrand wrote: > > > > > > > > > On 01.11.24 14:47, Steve Sistare wrote: > > > > > > > > > > Allocate anonymous memory using mmap MAP_ANON or memfd_create depending > > > > > > > > > > on the value of the anon-alloc machine property. This option applies to > > > > > > > > > > memory allocated as a side effect of creating various devices. It does > > > > > > > > > > not apply to memory-backend-objects, whether explicitly specified on > > > > > > > > > > the command line, or implicitly created by the -m command line option. > > > > > > > > > > > > > > > > > > > > The memfd option is intended to support new migration modes, in which the > > > > > > > > > > memory region can be transferred in place to a new QEMU process, by sending > > > > > > > > > > the memfd file descriptor to the process. Memory contents are preserved, > > > > > > > > > > and if the mode also transfers device descriptors, then pages that are > > > > > > > > > > locked in memory for DMA remain locked. This behavior is a pre-requisite > > > > > > > > > > for supporting vfio, vdpa, and iommufd devices with the new modes. > > > > > > > > > > > > > > > > > > A more portable, non-Linux specific variant of this will be using shm, > > > > > > > > > similar to backends/hostmem-shm.c. > > > > > > > > > > > > > > > > > > Likely we should be using that instead of memfd, or try hiding the > > > > > > > > > details. See below. > > > > > > > > > > > > > > > > For this series I would prefer to use memfd and hide the details. It's a > > > > > > > > concise (and well tested) solution albeit linux only. The code you supply > > > > > > > > for posix shm would be a good follow on patch to support other unices. > > > > > > > > > > > > > > Unless there is reason to use memfd we should start with the more > > > > > > > generic POSIX variant that is available even on systems without memfd. > > > > > > > Factoring stuff out as I drafted does look quite compelling. > > > > > > > > > > > > > > I can help with the rework, and send it out separately, so you can focus > > > > > > > on the "machine toggle" as part of this series. > > > > > > > > > > > > > > Of course, if we find out we need the memfd internally instead under > > > > > > > Linux for whatever reason later, we can use that instead. > > > > > > > > > > > > > > But IIUC, the main selling point for memfd are additional features > > > > > > > (hugetlb, memory sealing) that you aren't even using. > > > > > > > > > > > > FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount). > > > > > > > > > > Sizing is a non-trivial difference. One can by default allocate all memory using memfd_create. > > > > > To do so using shm_open requires configuration on the mount. One step harder to use. > > > > > > > > Yes. > > > > > > > > > > > > > > This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM > > > > > if memory-backend-ram has hogged all the memory. > > > > > > > > > > > Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open(). > > > > > > > > > > Yes, and if that is a good idea, then the same should be done for internal RAM > > > > > -- memfd if available and fallback to shm_open. > > > > > > > > Yes. > > > > > > > > > > > > > > > I'm hoping we can find a way where it just all is rather intuitive, like > > > > > > > > > > > > "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on" > > > > > > > > > > > > "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process. > > > > > > > > > > > > Thoughts? > > > > > > > > > > Agreed, though I thought I had already landed at the intuitive specification in my patch. > > > > > The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc > > > > > controls everything else. Now we're just riffing on the details: memfd vs shm_open, spelling > > > > > of options and words to describe them. > > > > > > > > Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on". > > > > > > Hi David and Peter, > > > > > > I have implemented and tested the following, for both qemu_memfd_create > > > and qemu_shm_alloc. This is pseudo-code, with error conditions omitted > > > for simplicity. > > > > I'm ok with either shm or memfd, as this feature only applies to Linux > > anyway. I'll leave that part to you and David to decide. > > > > > > > > Any comments before I submit a complete patch? > > > > > > ---- > > > qemu-options.hx: > > > ``aux-ram-share=on|off`` > > > Allocate auxiliary guest RAM as an anonymous file that is > > > shareable with an external process. This option applies to > > > memory allocated as a side effect of creating various devices. > > > It does not apply to memory-backend-objects, whether explicitly > > > specified on the command line, or implicitly created by the -m > > > command line option. > > > > > > Some migration modes require aux-ram-share=on. > > > > > > qapi/migration.json: > > > @cpr-transfer: > > > ... > > > Memory-backend objects must have the share=on attribute, but > > > memory-backend-epc is not supported. The VM must be started > > > with the '-machine aux-ram-share=on' option. > > > > > > Define RAM_PRIVATE > > > > > > Define qemu_shm_alloc(), from David's tmp patch > > > > > > ram_backend_memory_alloc() > > > ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE; > > > memory_region_init_ram_flags_nomigrate(ram_flags) > > > > Looks all good until here. > > > > > > > > qemu_ram_alloc_internal() > > > ... > > > if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share) > > > > Nitpick: could rely on flags-only, rather than testing "!host", AFAICT > > that's equal to RAM_PREALLOC. > > IMO testing host is clearer and more future proof, regardless of how flags > are currently used. If the caller passes host, then we should not allocate > memory here, full stop. > > > Meanwhile I slightly prefer we don't touch > > anything if SHARED|PRIVATE is set. > > OK, if SHARED is already set I will not set it again. > > > All combined, it could be: > > > > if (!(ram_flags & (RAM_PREALLOC | RAM_PRIVATE | RAM_SHARED))) { > > // ramblock to be allocated, with no share/private request, aka, > > // aux memory chunk... > > } > > > > > new_block->flags |= RAM_SHARED; > > > > > > if (!host && (new_block->flags & RAM_SHARED)) { > > > qemu_ram_alloc_shared(new_block); > > > > I'm not sure whether this needs its own helper. > > Reserve judgement until you see the full patch. The helper is a > non-trivial subroutine and IMO it improves readability. Also the > cpr find/save hooks are confined to the subroutine. I thought we can use the same code path to process "aux ramblock" and all kinds of other RAM_SHARED ramblocks. IIUC cpr find/save should apply there too, but maybe I missed something. > > > Should we fallback to > > ram_block_add() below, just like a RAM_SHARED? > > I thought we all discussed and agreed that the allocation should be performed > above ram_block_add. David's suggested patch does it here also. I was not closely followed all the discussions happened.. so I could have missed something indeed. One thing I want to double check is cpr will still make things like below work, right? -object memory-backend-ram,share=on [1] IIUC with the old code this won't create fd, so to make cpr work (and also what I was trying to say in the previous email..) is we could silently start to create memfds for these, which means we need to first teach qemu_anon_ram_alloc() on creating memfd for RAM_SHARED and cache these fds (which should hopefully keep the same behavior as before). Then for aux ramblocks like ROMs, as long as it sets RAM_SHARED properly in qemu_ram_alloc_internal() (but only when aux-share-mem=on, for sure..), then the rest code path in ram_block_add() should be the same as when user specified share=on in [1]. Anyway, if both of you agreed on it, I am happy to wait and read the whole patch. Side note: I'll still use a few days for other things, but I'll get back to read this whole series before next week.. btw, this series does not depend on precreate phase now, am I right? > > - Steve > > > IIUC, we could start to create RAM_SHARED in qemu_anon_ram_alloc() and > > always cache the fd (even if we don't do that before)? > > > > > } else > > > new_block->fd = -1; > > > new_block->host = host; > > > } > > > ram_block_add(new_block); > > > > > > qemu_ram_alloc_shared() > > > if qemu_memfd_check() > > > new_block->fd = qemu_memfd_create() > > > else > > > new_block->fd = qemu_shm_alloc() > > > new_block->host = file_ram_alloc(new_block->fd) > > > ---- > > > > > > - Steve > > > > > >
On 06.11.24 21:59, Steven Sistare wrote: > On 11/6/2024 3:41 PM, Peter Xu wrote: >> On Wed, Nov 06, 2024 at 03:12:20PM -0500, Steven Sistare wrote: >>> On 11/4/2024 4:36 PM, David Hildenbrand wrote: >>>> On 04.11.24 21:56, Steven Sistare wrote: >>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote: >>>>>> On 04.11.24 20:51, David Hildenbrand wrote: >>>>>>> On 04.11.24 18:38, Steven Sistare wrote: >>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote: >>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote: >>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending >>>>>>>>>> on the value of the anon-alloc machine property. This option applies to >>>>>>>>>> memory allocated as a side effect of creating various devices. It does >>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on >>>>>>>>>> the command line, or implicitly created by the -m command line option. >>>>>>>>>> >>>>>>>>>> The memfd option is intended to support new migration modes, in which the >>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending >>>>>>>>>> the memfd file descriptor to the process. Memory contents are preserved, >>>>>>>>>> and if the mode also transfers device descriptors, then pages that are >>>>>>>>>> locked in memory for DMA remain locked. This behavior is a pre-requisite >>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes. >>>>>>>>> >>>>>>>>> A more portable, non-Linux specific variant of this will be using shm, >>>>>>>>> similar to backends/hostmem-shm.c. >>>>>>>>> >>>>>>>>> Likely we should be using that instead of memfd, or try hiding the >>>>>>>>> details. See below. >>>>>>>> >>>>>>>> For this series I would prefer to use memfd and hide the details. It's a >>>>>>>> concise (and well tested) solution albeit linux only. The code you supply >>>>>>>> for posix shm would be a good follow on patch to support other unices. >>>>>>> >>>>>>> Unless there is reason to use memfd we should start with the more >>>>>>> generic POSIX variant that is available even on systems without memfd. >>>>>>> Factoring stuff out as I drafted does look quite compelling. >>>>>>> >>>>>>> I can help with the rework, and send it out separately, so you can focus >>>>>>> on the "machine toggle" as part of this series. >>>>>>> >>>>>>> Of course, if we find out we need the memfd internally instead under >>>>>>> Linux for whatever reason later, we can use that instead. >>>>>>> >>>>>>> But IIUC, the main selling point for memfd are additional features >>>>>>> (hugetlb, memory sealing) that you aren't even using. >>>>>> >>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount). >>>>> >>>>> Sizing is a non-trivial difference. One can by default allocate all memory using memfd_create. >>>>> To do so using shm_open requires configuration on the mount. One step harder to use. >>>> >>>> Yes. >>>> >>>>> >>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM >>>>> if memory-backend-ram has hogged all the memory. >>>>> >>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open(). >>>>> >>>>> Yes, and if that is a good idea, then the same should be done for internal RAM >>>>> -- memfd if available and fallback to shm_open. >>>> >>>> Yes. >>>> >>>>> >>>>>> I'm hoping we can find a way where it just all is rather intuitive, like >>>>>> >>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on" >>>>>> >>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process. >>>>>> >>>>>> Thoughts? >>>>> >>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch. >>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc >>>>> controls everything else. Now we're just riffing on the details: memfd vs shm_open, spelling >>>>> of options and words to describe them. >>>> >>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on". >>> >>> Hi David and Peter, >>> >>> I have implemented and tested the following, for both qemu_memfd_create >>> and qemu_shm_alloc. This is pseudo-code, with error conditions omitted >>> for simplicity. >> >> I'm ok with either shm or memfd, as this feature only applies to Linux >> anyway. I'll leave that part to you and David to decide. >> >>> >>> Any comments before I submit a complete patch? >>> >>> ---- >>> qemu-options.hx: >>> ``aux-ram-share=on|off`` >>> Allocate auxiliary guest RAM as an anonymous file that is >>> shareable with an external process. This option applies to >>> memory allocated as a side effect of creating various devices. >>> It does not apply to memory-backend-objects, whether explicitly >>> specified on the command line, or implicitly created by the -m >>> command line option. >>> >>> Some migration modes require aux-ram-share=on. >>> >>> qapi/migration.json: >>> @cpr-transfer: >>> ... >>> Memory-backend objects must have the share=on attribute, but >>> memory-backend-epc is not supported. The VM must be started >>> with the '-machine aux-ram-share=on' option. >>> >>> Define RAM_PRIVATE >>> >>> Define qemu_shm_alloc(), from David's tmp patch >>> >>> ram_backend_memory_alloc() >>> ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE; >>> memory_region_init_ram_flags_nomigrate(ram_flags) >> >> Looks all good until here. >> >>> >>> qemu_ram_alloc_internal() >>> ... >>> if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share) >> >> Nitpick: could rely on flags-only, rather than testing "!host", AFAICT >> that's equal to RAM_PREALLOC. > > IMO testing host is clearer and more future proof, regardless of how flags > are currently used. If the caller passes host, then we should not allocate > memory here, full stop. > >> Meanwhile I slightly prefer we don't touch >> anything if SHARED|PRIVATE is set. > > OK, if SHARED is already set I will not set it again. We only have to make sure that stuff like qemu_ram_is_shared() will continue working as expected. What I think we should do: We should probably assert that nobody passes in SHARED|PRIVATE. And we can use PRIVATE only as a parameter to the function, but never actually set it on the ramblock. If someone passes in PRIVATE, we don't include it in block->flags. (RMA_SHARED remains cleared) If someone passes in SHARED, we do set it in block->flags. If someone passes PRIVATE|SHARED, we assert. If someone passes in nothing: we set block->flags to SHARED with aux_ram_share=on. Otherwise, we do nothing (RAM_SHARED remains cleared) If that's also what you had in mind, great.
On 06.11.24 21:12, Steven Sistare wrote: > > > On 11/4/2024 4:36 PM, David Hildenbrand wrote: >> On 04.11.24 21:56, Steven Sistare wrote: >>> On 11/4/2024 3:15 PM, David Hildenbrand wrote: >>>> On 04.11.24 20:51, David Hildenbrand wrote: >>>>> On 04.11.24 18:38, Steven Sistare wrote: >>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote: >>>>>>> On 01.11.24 14:47, Steve Sistare wrote: >>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending >>>>>>>> on the value of the anon-alloc machine property. This option applies to >>>>>>>> memory allocated as a side effect of creating various devices. It does >>>>>>>> not apply to memory-backend-objects, whether explicitly specified on >>>>>>>> the command line, or implicitly created by the -m command line option. >>>>>>>> >>>>>>>> The memfd option is intended to support new migration modes, in which the >>>>>>>> memory region can be transferred in place to a new QEMU process, by sending >>>>>>>> the memfd file descriptor to the process. Memory contents are preserved, >>>>>>>> and if the mode also transfers device descriptors, then pages that are >>>>>>>> locked in memory for DMA remain locked. This behavior is a pre-requisite >>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes. >>>>>>> >>>>>>> A more portable, non-Linux specific variant of this will be using shm, >>>>>>> similar to backends/hostmem-shm.c. >>>>>>> >>>>>>> Likely we should be using that instead of memfd, or try hiding the >>>>>>> details. See below. >>>>>> >>>>>> For this series I would prefer to use memfd and hide the details. It's a >>>>>> concise (and well tested) solution albeit linux only. The code you supply >>>>>> for posix shm would be a good follow on patch to support other unices. >>>>> >>>>> Unless there is reason to use memfd we should start with the more >>>>> generic POSIX variant that is available even on systems without memfd. >>>>> Factoring stuff out as I drafted does look quite compelling. >>>>> >>>>> I can help with the rework, and send it out separately, so you can focus >>>>> on the "machine toggle" as part of this series. >>>>> >>>>> Of course, if we find out we need the memfd internally instead under >>>>> Linux for whatever reason later, we can use that instead. >>>>> >>>>> But IIUC, the main selling point for memfd are additional features >>>>> (hugetlb, memory sealing) that you aren't even using. >>>> >>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount). >>> >>> Sizing is a non-trivial difference. One can by default allocate all memory using memfd_create. >>> To do so using shm_open requires configuration on the mount. One step harder to use. >> >> Yes. >> >>> >>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM >>> if memory-backend-ram has hogged all the memory. >>> >>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open(). >>> >>> Yes, and if that is a good idea, then the same should be done for internal RAM >>> -- memfd if available and fallback to shm_open. >> >> Yes. >> >>> >>>> I'm hoping we can find a way where it just all is rather intuitive, like >>>> >>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on" >>>> >>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process. >>>> >>>> Thoughts? >>> >>> Agreed, though I thought I had already landed at the intuitive specification in my patch. >>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc >>> controls everything else. Now we're just riffing on the details: memfd vs shm_open, spelling >>> of options and words to describe them. >> >> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on". > > Hi David and Peter, > > I have implemented and tested the following, for both qemu_memfd_create > and qemu_shm_alloc. This is pseudo-code, with error conditions omitted > for simplicity. > > Any comments before I submit a complete patch? > > ---- > qemu-options.hx: > ``aux-ram-share=on|off`` > Allocate auxiliary guest RAM as an anonymous file that is > shareable with an external process. This option applies to > memory allocated as a side effect of creating various devices. > It does not apply to memory-backend-objects, whether explicitly > specified on the command line, or implicitly created by the -m > command line option. > > Some migration modes require aux-ram-share=on. > > qapi/migration.json: > @cpr-transfer: > ... > Memory-backend objects must have the share=on attribute, but > memory-backend-epc is not supported. The VM must be started > with the '-machine aux-ram-share=on' option. > > Define RAM_PRIVATE > > Define qemu_shm_alloc(), from David's tmp patch > > ram_backend_memory_alloc() > ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE; > memory_region_init_ram_flags_nomigrate(ram_flags) > > qemu_ram_alloc_internal() > ... > if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share) > new_block->flags |= RAM_SHARED; > > if (!host && (new_block->flags & RAM_SHARED)) { > qemu_ram_alloc_shared(new_block); > } else > new_block->fd = -1; > new_block->host = host; > } > ram_block_add(new_block); > > qemu_ram_alloc_shared() > if qemu_memfd_check() > new_block->fd = qemu_memfd_create() > else > new_block->fd = qemu_shm_alloc() Yes, that way "memory-backend-ram,share=on" will just mean "give me the best shared memory for RAM to be shared with other processes, I don't care about the details", and it will work on Linux kernels even before we had memfds. memory-backend-ram should be available on all architectures, and under Windows. qemu_anon_ram_alloc() under Linux just does nothing special, not even bail out. MAP_SHARED|MAP_ANON was always weird, because it meant "give me memory I can share only with subprocesses", but then, *there are not subprocesses for QEMU*. I recall there was a trick to obtain the fd under Linux for these regions using /proc/self/fd/, but it's very Linux specific ... So nobody would *actually* use that shared memory and it was only a hack for RDMA. Now we can do better. We'll have to decide if we simply fallback to qemu_anon_ram_alloc() if no shared memory can be created (unavailable), like we do on Windows. So maybe something like qemu_ram_alloc_shared() fd = -1; if (qemu_memfd_avilable()) { fd = qemu_memfd_create(); if (fd < 0) ... error } else if (qemu_shm_available()) fd = qemu_shm_alloc(); if (fd < 0) ... error } else { /* * Old behavior: try fd-less shared memory. We might * just end up with non-shared memory on Windows, but * nobody can make sure of this shared memory either way * ... should we just use non-shared memory? Or should * we simply bail out? But then, if there is no shared * memory nobody could possible use it. */ qemu_anon_ram_alloc(share=true) }
On 11/6/2024 4:21 PM, Peter Xu wrote: > On Wed, Nov 06, 2024 at 03:59:23PM -0500, Steven Sistare wrote: >> On 11/6/2024 3:41 PM, Peter Xu wrote: >>> On Wed, Nov 06, 2024 at 03:12:20PM -0500, Steven Sistare wrote: >>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote: >>>>> On 04.11.24 21:56, Steven Sistare wrote: >>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote: >>>>>>> On 04.11.24 20:51, David Hildenbrand wrote: >>>>>>>> On 04.11.24 18:38, Steven Sistare wrote: >>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote: >>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote: >>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending >>>>>>>>>>> on the value of the anon-alloc machine property. This option applies to >>>>>>>>>>> memory allocated as a side effect of creating various devices. It does >>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on >>>>>>>>>>> the command line, or implicitly created by the -m command line option. >>>>>>>>>>> >>>>>>>>>>> The memfd option is intended to support new migration modes, in which the >>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending >>>>>>>>>>> the memfd file descriptor to the process. Memory contents are preserved, >>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are >>>>>>>>>>> locked in memory for DMA remain locked. This behavior is a pre-requisite >>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes. >>>>>>>>>> >>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm, >>>>>>>>>> similar to backends/hostmem-shm.c. >>>>>>>>>> >>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the >>>>>>>>>> details. See below. >>>>>>>>> >>>>>>>>> For this series I would prefer to use memfd and hide the details. It's a >>>>>>>>> concise (and well tested) solution albeit linux only. The code you supply >>>>>>>>> for posix shm would be a good follow on patch to support other unices. >>>>>>>> >>>>>>>> Unless there is reason to use memfd we should start with the more >>>>>>>> generic POSIX variant that is available even on systems without memfd. >>>>>>>> Factoring stuff out as I drafted does look quite compelling. >>>>>>>> >>>>>>>> I can help with the rework, and send it out separately, so you can focus >>>>>>>> on the "machine toggle" as part of this series. >>>>>>>> >>>>>>>> Of course, if we find out we need the memfd internally instead under >>>>>>>> Linux for whatever reason later, we can use that instead. >>>>>>>> >>>>>>>> But IIUC, the main selling point for memfd are additional features >>>>>>>> (hugetlb, memory sealing) that you aren't even using. >>>>>>> >>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount). >>>>>> >>>>>> Sizing is a non-trivial difference. One can by default allocate all memory using memfd_create. >>>>>> To do so using shm_open requires configuration on the mount. One step harder to use. >>>>> >>>>> Yes. >>>>> >>>>>> >>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM >>>>>> if memory-backend-ram has hogged all the memory. >>>>>> >>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open(). >>>>>> >>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM >>>>>> -- memfd if available and fallback to shm_open. >>>>> >>>>> Yes. >>>>> >>>>>> >>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like >>>>>>> >>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on" >>>>>>> >>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process. >>>>>>> >>>>>>> Thoughts? >>>>>> >>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch. >>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc >>>>>> controls everything else. Now we're just riffing on the details: memfd vs shm_open, spelling >>>>>> of options and words to describe them. >>>>> >>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on". >>>> >>>> Hi David and Peter, >>>> >>>> I have implemented and tested the following, for both qemu_memfd_create >>>> and qemu_shm_alloc. This is pseudo-code, with error conditions omitted >>>> for simplicity. >>> >>> I'm ok with either shm or memfd, as this feature only applies to Linux >>> anyway. I'll leave that part to you and David to decide. >>> >>>> >>>> Any comments before I submit a complete patch? >>>> >>>> ---- >>>> qemu-options.hx: >>>> ``aux-ram-share=on|off`` >>>> Allocate auxiliary guest RAM as an anonymous file that is >>>> shareable with an external process. This option applies to >>>> memory allocated as a side effect of creating various devices. >>>> It does not apply to memory-backend-objects, whether explicitly >>>> specified on the command line, or implicitly created by the -m >>>> command line option. >>>> >>>> Some migration modes require aux-ram-share=on. >>>> >>>> qapi/migration.json: >>>> @cpr-transfer: >>>> ... >>>> Memory-backend objects must have the share=on attribute, but >>>> memory-backend-epc is not supported. The VM must be started >>>> with the '-machine aux-ram-share=on' option. >>>> >>>> Define RAM_PRIVATE >>>> >>>> Define qemu_shm_alloc(), from David's tmp patch >>>> >>>> ram_backend_memory_alloc() >>>> ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE; >>>> memory_region_init_ram_flags_nomigrate(ram_flags) >>> >>> Looks all good until here. >>> >>>> >>>> qemu_ram_alloc_internal() >>>> ... >>>> if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share) >>> >>> Nitpick: could rely on flags-only, rather than testing "!host", AFAICT >>> that's equal to RAM_PREALLOC. >> >> IMO testing host is clearer and more future proof, regardless of how flags >> are currently used. If the caller passes host, then we should not allocate >> memory here, full stop. >> >>> Meanwhile I slightly prefer we don't touch >>> anything if SHARED|PRIVATE is set. >> >> OK, if SHARED is already set I will not set it again. >> >>> All combined, it could be: >>> >>> if (!(ram_flags & (RAM_PREALLOC | RAM_PRIVATE | RAM_SHARED))) { >>> // ramblock to be allocated, with no share/private request, aka, >>> // aux memory chunk... >>> } >>> >>>> new_block->flags |= RAM_SHARED; >>>> >>>> if (!host && (new_block->flags & RAM_SHARED)) { >>>> qemu_ram_alloc_shared(new_block); >>> >>> I'm not sure whether this needs its own helper. >> >> Reserve judgement until you see the full patch. The helper is a >> non-trivial subroutine and IMO it improves readability. Also the >> cpr find/save hooks are confined to the subroutine. > > I thought we can use the same code path to process "aux ramblock" and all > kinds of other RAM_SHARED ramblocks. IIUC cpr find/save should apply there > too, but maybe I missed something. Yes. qemu_ram_alloc_shared() allocates and handles CPR for aux ramblock and memory-backend-ram,share=on. The key change that David suggested, that allows this unification, is to push the fd creation down from ram_backend_memory_alloc to qemu_ram_alloc_shared. >>> Should we fallback to >>> ram_block_add() below, just like a RAM_SHARED? >> >> I thought we all discussed and agreed that the allocation should be performed >> above ram_block_add. David's suggested patch does it here also. > > I was not closely followed all the discussions happened.. so I could have > missed something indeed. > > One thing I want to double check is cpr will still make things like below > work, right? > > -object memory-backend-ram,share=on [1] Yes, this new patch makes that work for CPR. V3 did not support CPR for memory-backend-ram. > IIUC with the old code this won't create fd, so to make cpr work (and also > what I was trying to say in the previous email..) is we could silently > start to create memfds for these, which means we need to first teach > qemu_anon_ram_alloc() on creating memfd for RAM_SHARED and cache these fds > (which should hopefully keep the same behavior as before). Now the fd is created and cached in qemu_ram_alloc_internal -> qemu_ram_alloc_shared. > Then for aux ramblocks like ROMs, as long as it sets RAM_SHARED properly in > qemu_ram_alloc_internal() (but only when aux-share-mem=on, for sure..), > then the rest code path in ram_block_add() should be the same as when user > specified share=on in [1]. > > Anyway, if both of you agreed on it, I am happy to wait and read the whole > patch. > > Side note: I'll still use a few days for other things, but I'll get back to > read this whole series before next week.. btw, this series does not depend > on precreate phase now, am I right? Correct, this series does not depend on precreate. - Steve >>> IIUC, we could start to create RAM_SHARED in qemu_anon_ram_alloc() and >>> always cache the fd (even if we don't do that before)? >>> >>>> } else >>>> new_block->fd = -1; >>>> new_block->host = host; >>>> } >>>> ram_block_add(new_block); >>>> >>>> qemu_ram_alloc_shared() >>>> if qemu_memfd_check() >>>> new_block->fd = qemu_memfd_create() >>>> else >>>> new_block->fd = qemu_shm_alloc() >>>> new_block->host = file_ram_alloc(new_block->fd) >>>> ---- >>>> >>>> - Steve >>>> >>> >> >
On 11/7/2024 8:05 AM, David Hildenbrand wrote: > On 06.11.24 21:59, Steven Sistare wrote: >> On 11/6/2024 3:41 PM, Peter Xu wrote: >>> On Wed, Nov 06, 2024 at 03:12:20PM -0500, Steven Sistare wrote: >>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote: >>>>> On 04.11.24 21:56, Steven Sistare wrote: >>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote: >>>>>>> On 04.11.24 20:51, David Hildenbrand wrote: >>>>>>>> On 04.11.24 18:38, Steven Sistare wrote: >>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote: >>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote: >>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending >>>>>>>>>>> on the value of the anon-alloc machine property. This option applies to >>>>>>>>>>> memory allocated as a side effect of creating various devices. It does >>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on >>>>>>>>>>> the command line, or implicitly created by the -m command line option. >>>>>>>>>>> >>>>>>>>>>> The memfd option is intended to support new migration modes, in which the >>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending >>>>>>>>>>> the memfd file descriptor to the process. Memory contents are preserved, >>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are >>>>>>>>>>> locked in memory for DMA remain locked. This behavior is a pre-requisite >>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes. >>>>>>>>>> >>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm, >>>>>>>>>> similar to backends/hostmem-shm.c. >>>>>>>>>> >>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the >>>>>>>>>> details. See below. >>>>>>>>> >>>>>>>>> For this series I would prefer to use memfd and hide the details. It's a >>>>>>>>> concise (and well tested) solution albeit linux only. The code you supply >>>>>>>>> for posix shm would be a good follow on patch to support other unices. >>>>>>>> >>>>>>>> Unless there is reason to use memfd we should start with the more >>>>>>>> generic POSIX variant that is available even on systems without memfd. >>>>>>>> Factoring stuff out as I drafted does look quite compelling. >>>>>>>> >>>>>>>> I can help with the rework, and send it out separately, so you can focus >>>>>>>> on the "machine toggle" as part of this series. >>>>>>>> >>>>>>>> Of course, if we find out we need the memfd internally instead under >>>>>>>> Linux for whatever reason later, we can use that instead. >>>>>>>> >>>>>>>> But IIUC, the main selling point for memfd are additional features >>>>>>>> (hugetlb, memory sealing) that you aren't even using. >>>>>>> >>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount). >>>>>> >>>>>> Sizing is a non-trivial difference. One can by default allocate all memory using memfd_create. >>>>>> To do so using shm_open requires configuration on the mount. One step harder to use. >>>>> >>>>> Yes. >>>>> >>>>>> >>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM >>>>>> if memory-backend-ram has hogged all the memory. >>>>>> >>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open(). >>>>>> >>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM >>>>>> -- memfd if available and fallback to shm_open. >>>>> >>>>> Yes. >>>>> >>>>>> >>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like >>>>>>> >>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on" >>>>>>> >>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process. >>>>>>> >>>>>>> Thoughts? >>>>>> >>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch. >>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc >>>>>> controls everything else. Now we're just riffing on the details: memfd vs shm_open, spelling >>>>>> of options and words to describe them. >>>>> >>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on". >>>> >>>> Hi David and Peter, >>>> >>>> I have implemented and tested the following, for both qemu_memfd_create >>>> and qemu_shm_alloc. This is pseudo-code, with error conditions omitted >>>> for simplicity. >>> >>> I'm ok with either shm or memfd, as this feature only applies to Linux >>> anyway. I'll leave that part to you and David to decide. >>> >>>> >>>> Any comments before I submit a complete patch? >>>> >>>> ---- >>>> qemu-options.hx: >>>> ``aux-ram-share=on|off`` >>>> Allocate auxiliary guest RAM as an anonymous file that is >>>> shareable with an external process. This option applies to >>>> memory allocated as a side effect of creating various devices. >>>> It does not apply to memory-backend-objects, whether explicitly >>>> specified on the command line, or implicitly created by the -m >>>> command line option. >>>> >>>> Some migration modes require aux-ram-share=on. >>>> >>>> qapi/migration.json: >>>> @cpr-transfer: >>>> ... >>>> Memory-backend objects must have the share=on attribute, but >>>> memory-backend-epc is not supported. The VM must be started >>>> with the '-machine aux-ram-share=on' option. >>>> >>>> Define RAM_PRIVATE >>>> >>>> Define qemu_shm_alloc(), from David's tmp patch >>>> >>>> ram_backend_memory_alloc() >>>> ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE; >>>> memory_region_init_ram_flags_nomigrate(ram_flags) >>> >>> Looks all good until here. >>> >>>> >>>> qemu_ram_alloc_internal() >>>> ... >>>> if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share) >>> >>> Nitpick: could rely on flags-only, rather than testing "!host", AFAICT >>> that's equal to RAM_PREALLOC. >> >> IMO testing host is clearer and more future proof, regardless of how flags >> are currently used. If the caller passes host, then we should not allocate >> memory here, full stop. >> >>> Meanwhile I slightly prefer we don't touch >>> anything if SHARED|PRIVATE is set. >> >> OK, if SHARED is already set I will not set it again. > > We only have to make sure that stuff like qemu_ram_is_shared() will continue working as expected. > > What I think we should do: > > We should probably assert that nobody passes in SHARED|PRIVATE. And we can use PRIVATE only as a parameter to the function, but never actually set it on the ramblock. > > If someone passes in PRIVATE, we don't include it in block->flags. (RMA_SHARED remains cleared) > > If someone passes in SHARED, we do set it in block->flags. > If someone passes PRIVATE|SHARED, we assert. > > If someone passes in nothing: we set block->flags to SHARED with aux_ram_share=on. Otherwise, we do nothing (RAM_SHARED remains cleared) > > If that's also what you had in mind, great. Yes, my patch does that, but it also sets RAM_PRIVATE on the ramblock. I will undo the latter. Do you plan to submit the part of your "tmp" patch that refactors shm_backend_memory_alloc and defines qemu_shm_alloc? If you want, I could include it in my series, with your Signed-off-by. Do you have any comments on my proposed name aux-ram-share, or my proposed text for qemu-options.hx and migration.json? Speaking now would prevent more version churn later. - Steve
On 11/7/2024 8:23 AM, David Hildenbrand wrote: > On 06.11.24 21:12, Steven Sistare wrote: >> On 11/4/2024 4:36 PM, David Hildenbrand wrote: >>> On 04.11.24 21:56, Steven Sistare wrote: >>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote: >>>>> On 04.11.24 20:51, David Hildenbrand wrote: >>>>>> On 04.11.24 18:38, Steven Sistare wrote: >>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote: >>>>>>>> On 01.11.24 14:47, Steve Sistare wrote: >>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending >>>>>>>>> on the value of the anon-alloc machine property. This option applies to >>>>>>>>> memory allocated as a side effect of creating various devices. It does >>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on >>>>>>>>> the command line, or implicitly created by the -m command line option. >>>>>>>>> >>>>>>>>> The memfd option is intended to support new migration modes, in which the >>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending >>>>>>>>> the memfd file descriptor to the process. Memory contents are preserved, >>>>>>>>> and if the mode also transfers device descriptors, then pages that are >>>>>>>>> locked in memory for DMA remain locked. This behavior is a pre-requisite >>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes. >>>>>>>> >>>>>>>> A more portable, non-Linux specific variant of this will be using shm, >>>>>>>> similar to backends/hostmem-shm.c. >>>>>>>> >>>>>>>> Likely we should be using that instead of memfd, or try hiding the >>>>>>>> details. See below. >>>>>>> >>>>>>> For this series I would prefer to use memfd and hide the details. It's a >>>>>>> concise (and well tested) solution albeit linux only. The code you supply >>>>>>> for posix shm would be a good follow on patch to support other unices. >>>>>> >>>>>> Unless there is reason to use memfd we should start with the more >>>>>> generic POSIX variant that is available even on systems without memfd. >>>>>> Factoring stuff out as I drafted does look quite compelling. >>>>>> >>>>>> I can help with the rework, and send it out separately, so you can focus >>>>>> on the "machine toggle" as part of this series. >>>>>> >>>>>> Of course, if we find out we need the memfd internally instead under >>>>>> Linux for whatever reason later, we can use that instead. >>>>>> >>>>>> But IIUC, the main selling point for memfd are additional features >>>>>> (hugetlb, memory sealing) that you aren't even using. >>>>> >>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount). >>>> >>>> Sizing is a non-trivial difference. One can by default allocate all memory using memfd_create. >>>> To do so using shm_open requires configuration on the mount. One step harder to use. >>> >>> Yes. >>> >>>> >>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM >>>> if memory-backend-ram has hogged all the memory. >>>> >>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open(). >>>> >>>> Yes, and if that is a good idea, then the same should be done for internal RAM >>>> -- memfd if available and fallback to shm_open. >>> >>> Yes. >>> >>>> >>>>> I'm hoping we can find a way where it just all is rather intuitive, like >>>>> >>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on" >>>>> >>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process. >>>>> >>>>> Thoughts? >>>> >>>> Agreed, though I thought I had already landed at the intuitive specification in my patch. >>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc >>>> controls everything else. Now we're just riffing on the details: memfd vs shm_open, spelling >>>> of options and words to describe them. >>> >>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on". >> >> Hi David and Peter, >> >> I have implemented and tested the following, for both qemu_memfd_create >> and qemu_shm_alloc. This is pseudo-code, with error conditions omitted >> for simplicity. >> >> Any comments before I submit a complete patch? >> >> ---- >> qemu-options.hx: >> ``aux-ram-share=on|off`` >> Allocate auxiliary guest RAM as an anonymous file that is >> shareable with an external process. This option applies to >> memory allocated as a side effect of creating various devices. >> It does not apply to memory-backend-objects, whether explicitly >> specified on the command line, or implicitly created by the -m >> command line option. >> >> Some migration modes require aux-ram-share=on. >> >> qapi/migration.json: >> @cpr-transfer: >> ... >> Memory-backend objects must have the share=on attribute, but >> memory-backend-epc is not supported. The VM must be started >> with the '-machine aux-ram-share=on' option. >> >> Define RAM_PRIVATE >> >> Define qemu_shm_alloc(), from David's tmp patch >> >> ram_backend_memory_alloc() >> ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE; >> memory_region_init_ram_flags_nomigrate(ram_flags) >> >> qemu_ram_alloc_internal() >> ... >> if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share) >> new_block->flags |= RAM_SHARED; >> >> if (!host && (new_block->flags & RAM_SHARED)) { >> qemu_ram_alloc_shared(new_block); >> } else >> new_block->fd = -1; >> new_block->host = host; >> } >> ram_block_add(new_block); >> >> qemu_ram_alloc_shared() >> if qemu_memfd_check() >> new_block->fd = qemu_memfd_create() >> else >> new_block->fd = qemu_shm_alloc() > > Yes, that way "memory-backend-ram,share=on" will just mean "give me the best shared memory for RAM to be shared with other processes, I don't care about the details", and it will work on Linux kernels even before we had memfds. > > memory-backend-ram should be available on all architectures, and under Windows. qemu_anon_ram_alloc() under Linux just does nothing special, not even bail out. > > MAP_SHARED|MAP_ANON was always weird, because it meant "give me memory I can share only with subprocesses", but then, *there are not subprocesses for QEMU*. I recall there was a trick to obtain the fd under Linux for these regions using /proc/self/fd/, but it's very Linux specific ... > > So nobody would *actually* use that shared memory and it was only a hack for RDMA. Now we can do better. > > > We'll have to decide if we simply fallback to qemu_anon_ram_alloc() if no shared memory can be created (unavailable), like we do on Windows. > > So maybe something like > > qemu_ram_alloc_shared() > fd = -1; > > if (qemu_memfd_avilable()) { > fd = qemu_memfd_create(); > if (fd < 0) > ... error > } else if (qemu_shm_available()) > fd = qemu_shm_alloc(); > if (fd < 0) > ... error > } else { > /* > * Old behavior: try fd-less shared memory. We might > * just end up with non-shared memory on Windows, but > * nobody can make sure of this shared memory either way > * ... should we just use non-shared memory? Or should > * we simply bail out? But then, if there is no shared > * memory nobody could possible use it. > */ > qemu_anon_ram_alloc(share=true) > } Good catch. We need that fallback for backwards compatibility. Even with no use case for memory-backend-ram,share=on since the demise of rdma, users may specify it on windows, for no particular reason, but it works, and should continue to work after this series. CPR would be blocked. More generally for backwards compatibility for share=on for no particular reason, should we fallback if qemu_shm_alloc fails? If /dev/shm is mounted with default options and more than half of ram is requested, it will fail, whereas current qemu succeeds using MAP_SHARED|MAP_ANON. - Steve
On 07.11.24 15:04, Steven Sistare wrote: > On 11/7/2024 8:05 AM, David Hildenbrand wrote: >> On 06.11.24 21:59, Steven Sistare wrote: >>> On 11/6/2024 3:41 PM, Peter Xu wrote: >>>> On Wed, Nov 06, 2024 at 03:12:20PM -0500, Steven Sistare wrote: >>>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote: >>>>>> On 04.11.24 21:56, Steven Sistare wrote: >>>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote: >>>>>>>> On 04.11.24 20:51, David Hildenbrand wrote: >>>>>>>>> On 04.11.24 18:38, Steven Sistare wrote: >>>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote: >>>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote: >>>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending >>>>>>>>>>>> on the value of the anon-alloc machine property. This option applies to >>>>>>>>>>>> memory allocated as a side effect of creating various devices. It does >>>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on >>>>>>>>>>>> the command line, or implicitly created by the -m command line option. >>>>>>>>>>>> >>>>>>>>>>>> The memfd option is intended to support new migration modes, in which the >>>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending >>>>>>>>>>>> the memfd file descriptor to the process. Memory contents are preserved, >>>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are >>>>>>>>>>>> locked in memory for DMA remain locked. This behavior is a pre-requisite >>>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes. >>>>>>>>>>> >>>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm, >>>>>>>>>>> similar to backends/hostmem-shm.c. >>>>>>>>>>> >>>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the >>>>>>>>>>> details. See below. >>>>>>>>>> >>>>>>>>>> For this series I would prefer to use memfd and hide the details. It's a >>>>>>>>>> concise (and well tested) solution albeit linux only. The code you supply >>>>>>>>>> for posix shm would be a good follow on patch to support other unices. >>>>>>>>> >>>>>>>>> Unless there is reason to use memfd we should start with the more >>>>>>>>> generic POSIX variant that is available even on systems without memfd. >>>>>>>>> Factoring stuff out as I drafted does look quite compelling. >>>>>>>>> >>>>>>>>> I can help with the rework, and send it out separately, so you can focus >>>>>>>>> on the "machine toggle" as part of this series. >>>>>>>>> >>>>>>>>> Of course, if we find out we need the memfd internally instead under >>>>>>>>> Linux for whatever reason later, we can use that instead. >>>>>>>>> >>>>>>>>> But IIUC, the main selling point for memfd are additional features >>>>>>>>> (hugetlb, memory sealing) that you aren't even using. >>>>>>>> >>>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount). >>>>>>> >>>>>>> Sizing is a non-trivial difference. One can by default allocate all memory using memfd_create. >>>>>>> To do so using shm_open requires configuration on the mount. One step harder to use. >>>>>> >>>>>> Yes. >>>>>> >>>>>>> >>>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM >>>>>>> if memory-backend-ram has hogged all the memory. >>>>>>> >>>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open(). >>>>>>> >>>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM >>>>>>> -- memfd if available and fallback to shm_open. >>>>>> >>>>>> Yes. >>>>>> >>>>>>> >>>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like >>>>>>>> >>>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on" >>>>>>>> >>>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process. >>>>>>>> >>>>>>>> Thoughts? >>>>>>> >>>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch. >>>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc >>>>>>> controls everything else. Now we're just riffing on the details: memfd vs shm_open, spelling >>>>>>> of options and words to describe them. >>>>>> >>>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on". >>>>> >>>>> Hi David and Peter, >>>>> >>>>> I have implemented and tested the following, for both qemu_memfd_create >>>>> and qemu_shm_alloc. This is pseudo-code, with error conditions omitted >>>>> for simplicity. >>>> >>>> I'm ok with either shm or memfd, as this feature only applies to Linux >>>> anyway. I'll leave that part to you and David to decide. >>>> >>>>> >>>>> Any comments before I submit a complete patch? >>>>> >>>>> ---- >>>>> qemu-options.hx: >>>>> ``aux-ram-share=on|off`` >>>>> Allocate auxiliary guest RAM as an anonymous file that is >>>>> shareable with an external process. This option applies to >>>>> memory allocated as a side effect of creating various devices. >>>>> It does not apply to memory-backend-objects, whether explicitly >>>>> specified on the command line, or implicitly created by the -m >>>>> command line option. >>>>> >>>>> Some migration modes require aux-ram-share=on. >>>>> >>>>> qapi/migration.json: >>>>> @cpr-transfer: >>>>> ... >>>>> Memory-backend objects must have the share=on attribute, but >>>>> memory-backend-epc is not supported. The VM must be started >>>>> with the '-machine aux-ram-share=on' option. >>>>> >>>>> Define RAM_PRIVATE >>>>> >>>>> Define qemu_shm_alloc(), from David's tmp patch >>>>> >>>>> ram_backend_memory_alloc() >>>>> ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE; >>>>> memory_region_init_ram_flags_nomigrate(ram_flags) >>>> >>>> Looks all good until here. >>>> >>>>> >>>>> qemu_ram_alloc_internal() >>>>> ... >>>>> if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share) >>>> >>>> Nitpick: could rely on flags-only, rather than testing "!host", AFAICT >>>> that's equal to RAM_PREALLOC. >>> >>> IMO testing host is clearer and more future proof, regardless of how flags >>> are currently used. If the caller passes host, then we should not allocate >>> memory here, full stop. >>> >>>> Meanwhile I slightly prefer we don't touch >>>> anything if SHARED|PRIVATE is set. >>> >>> OK, if SHARED is already set I will not set it again. >> >> We only have to make sure that stuff like qemu_ram_is_shared() will continue working as expected. >> >> What I think we should do: >> >> We should probably assert that nobody passes in SHARED|PRIVATE. And we can use PRIVATE only as a parameter to the function, but never actually set it on the ramblock. >> >> If someone passes in PRIVATE, we don't include it in block->flags. (RMA_SHARED remains cleared) >> >> If someone passes in SHARED, we do set it in block->flags. >> If someone passes PRIVATE|SHARED, we assert. >> >> If someone passes in nothing: we set block->flags to SHARED with aux_ram_share=on. Otherwise, we do nothing (RAM_SHARED remains cleared) >> >> If that's also what you had in mind, great. > > Yes, my patch does that, but it also sets RAM_PRIVATE on the ramblock. > I will undo the latter. > > Do you plan to submit the part of your "tmp" patch that refactors > shm_backend_memory_alloc and defines qemu_shm_alloc? If you want, > I could include it in my series, with your Signed-off-by. My patch went a bit too far I think. And would not work on win32 :) We should probably start with this: From 124920aeda2756faa104bfa6e934c7c20b1fbbe9 Mon Sep 17 00:00:00 2001 From: David Hildenbrand <david@redhat.com> Date: Mon, 4 Nov 2024 11:29:22 +0100 Subject: [PATCH] backends/hostmem-shm: factor out allocation of "anonymous shared memory with an fd" Let's factor it out so we can reuse it. Signed-off-by: David Hildenbrand <david@redhat.com> --- backends/hostmem-shm.c | 45 ++++------------------------------- include/qemu/osdep.h | 1 + system/physmem.c | 2 +- util/oslib-posix.c | 53 ++++++++++++++++++++++++++++++++++++++++++ util/oslib-win32.c | 6 +++++ 5 files changed, 65 insertions(+), 42 deletions(-) diff --git a/backends/hostmem-shm.c b/backends/hostmem-shm.c index 374edc3db8..837b9f1dd4 100644 --- a/backends/hostmem-shm.c +++ b/backends/hostmem-shm.c @@ -25,11 +25,9 @@ struct HostMemoryBackendShm { static bool shm_backend_memory_alloc(HostMemoryBackend *backend, Error **errp) { - g_autoptr(GString) shm_name = g_string_new(NULL); g_autofree char *backend_name = NULL; uint32_t ram_flags; - int fd, oflag; - mode_t mode; + int fd; if (!backend->size) { error_setg(errp, "can't create shm backend with size 0"); @@ -41,48 +39,13 @@ shm_backend_memory_alloc(HostMemoryBackend *backend, Error **errp) return false; } - /* - * Let's use `mode = 0` because we don't want other processes to open our - * memory unless we share the file descriptor with them. - */ - mode = 0; - oflag = O_RDWR | O_CREAT | O_EXCL; - backend_name = host_memory_backend_get_name(backend); - - /* - * Some operating systems allow creating anonymous POSIX shared memory - * objects (e.g. FreeBSD provides the SHM_ANON constant), but this is not - * defined by POSIX, so let's create a unique name. - * - * From Linux's shm_open(3) man-page: - * For portable use, a shared memory object should be identified - * by a name of the form /somename;" - */ - g_string_printf(shm_name, "/qemu-" FMT_pid "-shm-%s", getpid(), - backend_name); - - fd = shm_open(shm_name->str, oflag, mode); + fd = qemu_shm_alloc(backend->size, errp); if (fd < 0) { - error_setg_errno(errp, errno, - "failed to create POSIX shared memory"); - return false; - } - - /* - * We have the file descriptor, so we no longer need to expose the - * POSIX shared memory object. However it will remain allocated as long as - * there are file descriptors pointing to it. - */ - shm_unlink(shm_name->str); - - if (ftruncate(fd, backend->size) == -1) { - error_setg_errno(errp, errno, - "failed to resize POSIX shared memory to %" PRIu64, - backend->size); - close(fd); return false; } + /* Let's do the same as memory-backend-ram,share=on would do. */ + backend_name = host_memory_backend_get_name(backend); ram_flags = RAM_SHARED; ram_flags |= backend->reserve ? 0 : RAM_NORESERVE; diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h index fe7c3c5f67..4a24f11174 100644 --- a/include/qemu/osdep.h +++ b/include/qemu/osdep.h @@ -505,6 +505,7 @@ int qemu_daemon(int nochdir, int noclose); void *qemu_anon_ram_alloc(size_t size, uint64_t *align, bool shared, bool noreserve); void qemu_anon_ram_free(void *ptr, size_t size); +int qemu_shm_alloc(size_t size, Error **errp); #ifdef _WIN32 #define HAVE_CHARDEV_SERIAL 1 diff --git a/system/physmem.c b/system/physmem.c index dc1db3a384..1b477fec44 100644 --- a/system/physmem.c +++ b/system/physmem.c @@ -2089,7 +2089,7 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size, new_block->page_size = qemu_real_host_page_size(); new_block->host = host; new_block->flags = ram_flags; - ram_block_add(new_block, &local_err); + if (local_err) { g_free(new_block); error_propagate(errp, local_err); diff --git a/util/oslib-posix.c b/util/oslib-posix.c index 11b35e48fb..bc5c28b162 100644 --- a/util/oslib-posix.c +++ b/util/oslib-posix.c @@ -931,3 +931,56 @@ void qemu_close_all_open_fd(const int *skip, unsigned int nskip) qemu_close_all_open_fd_fallback(skip, nskip, open_max); } } + +int qemu_shm_alloc(size_t size, Error **errp) +{ + g_autoptr(GString) shm_name = g_string_new(NULL); + int fd, oflag, cur_sequence; + static int sequence; + mode_t mode; + + cur_sequence = qatomic_fetch_inc(&sequence); + + /* + * Let's use `mode = 0` because we don't want other processes to open our + * memory unless we share the file descriptor with them. + */ + mode = 0; + oflag = O_RDWR | O_CREAT | O_EXCL; + + /* + * Some operating systems allow creating anonymous POSIX shared memory + * objects (e.g. FreeBSD provides the SHM_ANON constant), but this is not + * defined by POSIX, so let's create a unique name. + * + * From Linux's shm_open(3) man-page: + * For portable use, a shared memory object should be identified + * by a name of the form /somename;" + */ + g_string_printf(shm_name, "/qemu-" FMT_pid "-shm-%d", getpid(), + cur_sequence); + + fd = shm_open(shm_name->str, oflag, mode); + if (fd < 0) { + error_setg_errno(errp, errno, + "failed to create POSIX shared memory"); + return -1; + } + + /* + * We have the file descriptor, so we no longer need to expose the + * POSIX shared memory object. However it will remain allocated as long as + * there are file descriptors pointing to it. + */ + shm_unlink(shm_name->str); + + if (ftruncate(fd, size) == -1) { + error_setg_errno(errp, errno, + "failed to resize POSIX shared memory to %" PRIu64, + size); + close(fd); + return -1; + } + + return fd; +} diff --git a/util/oslib-win32.c b/util/oslib-win32.c index b623830d62..f79a190b78 100644 --- a/util/oslib-win32.c +++ b/util/oslib-win32.c @@ -877,3 +877,9 @@ void qemu_win32_map_free(void *ptr, HANDLE h, Error **errp) } CloseHandle(h); } + +int qemu_shm_alloc(size_t size, Error **errp) +{ + error_setg("Shared memory is not supported."); + return -1; +}
On 07.11.24 17:02, Steven Sistare wrote: > On 11/7/2024 8:23 AM, David Hildenbrand wrote: >> On 06.11.24 21:12, Steven Sistare wrote: >>> On 11/4/2024 4:36 PM, David Hildenbrand wrote: >>>> On 04.11.24 21:56, Steven Sistare wrote: >>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote: >>>>>> On 04.11.24 20:51, David Hildenbrand wrote: >>>>>>> On 04.11.24 18:38, Steven Sistare wrote: >>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote: >>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote: >>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending >>>>>>>>>> on the value of the anon-alloc machine property. This option applies to >>>>>>>>>> memory allocated as a side effect of creating various devices. It does >>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on >>>>>>>>>> the command line, or implicitly created by the -m command line option. >>>>>>>>>> >>>>>>>>>> The memfd option is intended to support new migration modes, in which the >>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending >>>>>>>>>> the memfd file descriptor to the process. Memory contents are preserved, >>>>>>>>>> and if the mode also transfers device descriptors, then pages that are >>>>>>>>>> locked in memory for DMA remain locked. This behavior is a pre-requisite >>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes. >>>>>>>>> >>>>>>>>> A more portable, non-Linux specific variant of this will be using shm, >>>>>>>>> similar to backends/hostmem-shm.c. >>>>>>>>> >>>>>>>>> Likely we should be using that instead of memfd, or try hiding the >>>>>>>>> details. See below. >>>>>>>> >>>>>>>> For this series I would prefer to use memfd and hide the details. It's a >>>>>>>> concise (and well tested) solution albeit linux only. The code you supply >>>>>>>> for posix shm would be a good follow on patch to support other unices. >>>>>>> >>>>>>> Unless there is reason to use memfd we should start with the more >>>>>>> generic POSIX variant that is available even on systems without memfd. >>>>>>> Factoring stuff out as I drafted does look quite compelling. >>>>>>> >>>>>>> I can help with the rework, and send it out separately, so you can focus >>>>>>> on the "machine toggle" as part of this series. >>>>>>> >>>>>>> Of course, if we find out we need the memfd internally instead under >>>>>>> Linux for whatever reason later, we can use that instead. >>>>>>> >>>>>>> But IIUC, the main selling point for memfd are additional features >>>>>>> (hugetlb, memory sealing) that you aren't even using. >>>>>> >>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount). >>>>> >>>>> Sizing is a non-trivial difference. One can by default allocate all memory using memfd_create. >>>>> To do so using shm_open requires configuration on the mount. One step harder to use. >>>> >>>> Yes. >>>> >>>>> >>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM >>>>> if memory-backend-ram has hogged all the memory. >>>>> >>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open(). >>>>> >>>>> Yes, and if that is a good idea, then the same should be done for internal RAM >>>>> -- memfd if available and fallback to shm_open. >>>> >>>> Yes. >>>> >>>>> >>>>>> I'm hoping we can find a way where it just all is rather intuitive, like >>>>>> >>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on" >>>>>> >>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process. >>>>>> >>>>>> Thoughts? >>>>> >>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch. >>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc >>>>> controls everything else. Now we're just riffing on the details: memfd vs shm_open, spelling >>>>> of options and words to describe them. >>>> >>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on". >>> >>> Hi David and Peter, >>> >>> I have implemented and tested the following, for both qemu_memfd_create >>> and qemu_shm_alloc. This is pseudo-code, with error conditions omitted >>> for simplicity. >>> >>> Any comments before I submit a complete patch? >>> >>> ---- >>> qemu-options.hx: >>> ``aux-ram-share=on|off`` >>> Allocate auxiliary guest RAM as an anonymous file that is >>> shareable with an external process. This option applies to >>> memory allocated as a side effect of creating various devices. >>> It does not apply to memory-backend-objects, whether explicitly >>> specified on the command line, or implicitly created by the -m >>> command line option. >>> >>> Some migration modes require aux-ram-share=on. >>> >>> qapi/migration.json: >>> @cpr-transfer: >>> ... >>> Memory-backend objects must have the share=on attribute, but >>> memory-backend-epc is not supported. The VM must be started >>> with the '-machine aux-ram-share=on' option. >>> >>> Define RAM_PRIVATE >>> >>> Define qemu_shm_alloc(), from David's tmp patch >>> >>> ram_backend_memory_alloc() >>> ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE; >>> memory_region_init_ram_flags_nomigrate(ram_flags) >>> >>> qemu_ram_alloc_internal() >>> ... >>> if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share) >>> new_block->flags |= RAM_SHARED; >>> >>> if (!host && (new_block->flags & RAM_SHARED)) { >>> qemu_ram_alloc_shared(new_block); >>> } else >>> new_block->fd = -1; >>> new_block->host = host; >>> } >>> ram_block_add(new_block); >>> >>> qemu_ram_alloc_shared() >>> if qemu_memfd_check() >>> new_block->fd = qemu_memfd_create() >>> else >>> new_block->fd = qemu_shm_alloc() >> >> Yes, that way "memory-backend-ram,share=on" will just mean "give me the best shared memory for RAM to be shared with other processes, I don't care about the details", and it will work on Linux kernels even before we had memfds. >> >> memory-backend-ram should be available on all architectures, and under Windows. qemu_anon_ram_alloc() under Linux just does nothing special, not even bail out. >> >> MAP_SHARED|MAP_ANON was always weird, because it meant "give me memory I can share only with subprocesses", but then, *there are not subprocesses for QEMU*. I recall there was a trick to obtain the fd under Linux for these regions using /proc/self/fd/, but it's very Linux specific ... >> >> So nobody would *actually* use that shared memory and it was only a hack for RDMA. Now we can do better. >> >> >> We'll have to decide if we simply fallback to qemu_anon_ram_alloc() if no shared memory can be created (unavailable), like we do on Windows. >> >> So maybe something like >> >> qemu_ram_alloc_shared() >> fd = -1; >> >> if (qemu_memfd_avilable()) { >> fd = qemu_memfd_create(); >> if (fd < 0) >> ... error >> } else if (qemu_shm_available()) >> fd = qemu_shm_alloc(); >> if (fd < 0) >> ... error >> } else { >> /* >> * Old behavior: try fd-less shared memory. We might >> * just end up with non-shared memory on Windows, but >> * nobody can make sure of this shared memory either way >> * ... should we just use non-shared memory? Or should >> * we simply bail out? But then, if there is no shared >> * memory nobody could possible use it. >> */ >> qemu_anon_ram_alloc(share=true) >> } > > Good catch. We need that fallback for backwards compatibility. Even with > no use case for memory-backend-ram,share=on since the demise of rdma, users > may specify it on windows, for no particular reason, but it works, and should > continue to work after this series. CPR would be blocked. Yes, we should keep Windows working in the weird way it is working right now. > > More generally for backwards compatibility for share=on for no particular reason, > should we fallback if qemu_shm_alloc fails? If /dev/shm is mounted with default > options and more than half of ram is requested, it will fail, whereas current qemu > succeeds using MAP_SHARED|MAP_ANON. Only on Linux without memfd, of course. Maybe we should just warn when qemu_shm_alloc() fails (and comment that we continue for compat reasons only) and fallback to the stupid qemu_anon_ram_alloc(share=true). We could implement a fallback to shmget() but ... let's not go down that path. But we should not fallback to qemu_shm_alloc()/MAP_SHARED|MAP_ANON if memfd is available and that allocating the memfd failed. Failing to allocate a memfd might highlight a bigger problem.
On Thu, Nov 07, 2024 at 09:04:02AM -0500, Steven Sistare wrote: > On 11/7/2024 8:05 AM, David Hildenbrand wrote: > > On 06.11.24 21:59, Steven Sistare wrote: > > > On 11/6/2024 3:41 PM, Peter Xu wrote: > > > > On Wed, Nov 06, 2024 at 03:12:20PM -0500, Steven Sistare wrote: > > > > > On 11/4/2024 4:36 PM, David Hildenbrand wrote: > > > > > > On 04.11.24 21:56, Steven Sistare wrote: > > > > > > > On 11/4/2024 3:15 PM, David Hildenbrand wrote: > > > > > > > > On 04.11.24 20:51, David Hildenbrand wrote: > > > > > > > > > On 04.11.24 18:38, Steven Sistare wrote: > > > > > > > > > > On 11/4/2024 5:39 AM, David Hildenbrand wrote: > > > > > > > > > > > On 01.11.24 14:47, Steve Sistare wrote: > > > > > > > > > > > > Allocate anonymous memory using mmap MAP_ANON or memfd_create depending > > > > > > > > > > > > on the value of the anon-alloc machine property. This option applies to > > > > > > > > > > > > memory allocated as a side effect of creating various devices. It does > > > > > > > > > > > > not apply to memory-backend-objects, whether explicitly specified on > > > > > > > > > > > > the command line, or implicitly created by the -m command line option. > > > > > > > > > > > > > > > > > > > > > > > > The memfd option is intended to support new migration modes, in which the > > > > > > > > > > > > memory region can be transferred in place to a new QEMU process, by sending > > > > > > > > > > > > the memfd file descriptor to the process. Memory contents are preserved, > > > > > > > > > > > > and if the mode also transfers device descriptors, then pages that are > > > > > > > > > > > > locked in memory for DMA remain locked. This behavior is a pre-requisite > > > > > > > > > > > > for supporting vfio, vdpa, and iommufd devices with the new modes. > > > > > > > > > > > > > > > > > > > > > > A more portable, non-Linux specific variant of this will be using shm, > > > > > > > > > > > similar to backends/hostmem-shm.c. > > > > > > > > > > > > > > > > > > > > > > Likely we should be using that instead of memfd, or try hiding the > > > > > > > > > > > details. See below. > > > > > > > > > > > > > > > > > > > > For this series I would prefer to use memfd and hide the details. It's a > > > > > > > > > > concise (and well tested) solution albeit linux only. The code you supply > > > > > > > > > > for posix shm would be a good follow on patch to support other unices. > > > > > > > > > > > > > > > > > > Unless there is reason to use memfd we should start with the more > > > > > > > > > generic POSIX variant that is available even on systems without memfd. > > > > > > > > > Factoring stuff out as I drafted does look quite compelling. > > > > > > > > > > > > > > > > > > I can help with the rework, and send it out separately, so you can focus > > > > > > > > > on the "machine toggle" as part of this series. > > > > > > > > > > > > > > > > > > Of course, if we find out we need the memfd internally instead under > > > > > > > > > Linux for whatever reason later, we can use that instead. > > > > > > > > > > > > > > > > > > But IIUC, the main selling point for memfd are additional features > > > > > > > > > (hugetlb, memory sealing) that you aren't even using. > > > > > > > > > > > > > > > > FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount). > > > > > > > > > > > > > > Sizing is a non-trivial difference. One can by default allocate all memory using memfd_create. > > > > > > > To do so using shm_open requires configuration on the mount. One step harder to use. > > > > > > > > > > > > Yes. > > > > > > > > > > > > > > > > > > > > This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM > > > > > > > if memory-backend-ram has hogged all the memory. > > > > > > > > > > > > > > > Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open(). > > > > > > > > > > > > > > Yes, and if that is a good idea, then the same should be done for internal RAM > > > > > > > -- memfd if available and fallback to shm_open. > > > > > > > > > > > > Yes. > > > > > > > > > > > > > > > > > > > > > I'm hoping we can find a way where it just all is rather intuitive, like > > > > > > > > > > > > > > > > "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on" > > > > > > > > > > > > > > > > "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process. > > > > > > > > > > > > > > > > Thoughts? > > > > > > > > > > > > > > Agreed, though I thought I had already landed at the intuitive specification in my patch. > > > > > > > The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc > > > > > > > controls everything else. Now we're just riffing on the details: memfd vs shm_open, spelling > > > > > > > of options and words to describe them. > > > > > > > > > > > > Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on". > > > > > > > > > > Hi David and Peter, > > > > > > > > > > I have implemented and tested the following, for both qemu_memfd_create > > > > > and qemu_shm_alloc. This is pseudo-code, with error conditions omitted > > > > > for simplicity. > > > > > > > > I'm ok with either shm or memfd, as this feature only applies to Linux > > > > anyway. I'll leave that part to you and David to decide. > > > > > > > > > > > > > > Any comments before I submit a complete patch? > > > > > > > > > > ---- > > > > > qemu-options.hx: > > > > > ``aux-ram-share=on|off`` > > > > > Allocate auxiliary guest RAM as an anonymous file that is > > > > > shareable with an external process. This option applies to > > > > > memory allocated as a side effect of creating various devices. > > > > > It does not apply to memory-backend-objects, whether explicitly > > > > > specified on the command line, or implicitly created by the -m > > > > > command line option. > > > > > > > > > > Some migration modes require aux-ram-share=on. > > > > > > > > > > qapi/migration.json: > > > > > @cpr-transfer: > > > > > ... > > > > > Memory-backend objects must have the share=on attribute, but > > > > > memory-backend-epc is not supported. The VM must be started > > > > > with the '-machine aux-ram-share=on' option. > > > > > > > > > > Define RAM_PRIVATE > > > > > > > > > > Define qemu_shm_alloc(), from David's tmp patch > > > > > > > > > > ram_backend_memory_alloc() > > > > > ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE; > > > > > memory_region_init_ram_flags_nomigrate(ram_flags) > > > > > > > > Looks all good until here. > > > > > > > > > > > > > > qemu_ram_alloc_internal() > > > > > ... > > > > > if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share) > > > > > > > > Nitpick: could rely on flags-only, rather than testing "!host", AFAICT > > > > that's equal to RAM_PREALLOC. > > > > > > IMO testing host is clearer and more future proof, regardless of how flags > > > are currently used. If the caller passes host, then we should not allocate > > > memory here, full stop. > > > > > > > Meanwhile I slightly prefer we don't touch > > > > anything if SHARED|PRIVATE is set. > > > > > > OK, if SHARED is already set I will not set it again. > > > > We only have to make sure that stuff like qemu_ram_is_shared() will continue working as expected. > > > > What I think we should do: > > > > We should probably assert that nobody passes in SHARED|PRIVATE. And we can use PRIVATE only as a parameter to the function, but never actually set it on the ramblock. > > > > If someone passes in PRIVATE, we don't include it in block->flags. (RMA_SHARED remains cleared) > > > > If someone passes in SHARED, we do set it in block->flags. > > If someone passes PRIVATE|SHARED, we assert. > > > > If someone passes in nothing: we set block->flags to SHARED with aux_ram_share=on. Otherwise, we do nothing (RAM_SHARED remains cleared) > > > > If that's also what you had in mind, great. > > Yes, my patch does that, but it also sets RAM_PRIVATE on the ramblock. > I will undo the latter. David: why do we need to drop PRIVATE in ramblock flags? I thought it was pretty harmless. I suppose things like qemu_ram_is_shared() will even keep working as before? It looks ok to remove it too, but it adds logics that doesn't seem necessary to me, so just to double check if I missed something.. > > Do you plan to submit the part of your "tmp" patch that refactors > shm_backend_memory_alloc and defines qemu_shm_alloc? If you want, > I could include it in my series, with your Signed-off-by. > > Do you have any comments on my proposed name aux-ram-share, or my proposed text > for qemu-options.hx and migration.json? Speaking now would prevent more version > churn later. > > - Steve > >
On 07.11.24 17:32, Peter Xu wrote: > On Thu, Nov 07, 2024 at 09:04:02AM -0500, Steven Sistare wrote: >> On 11/7/2024 8:05 AM, David Hildenbrand wrote: >>> On 06.11.24 21:59, Steven Sistare wrote: >>>> On 11/6/2024 3:41 PM, Peter Xu wrote: >>>>> On Wed, Nov 06, 2024 at 03:12:20PM -0500, Steven Sistare wrote: >>>>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote: >>>>>>> On 04.11.24 21:56, Steven Sistare wrote: >>>>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote: >>>>>>>>> On 04.11.24 20:51, David Hildenbrand wrote: >>>>>>>>>> On 04.11.24 18:38, Steven Sistare wrote: >>>>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote: >>>>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote: >>>>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending >>>>>>>>>>>>> on the value of the anon-alloc machine property. This option applies to >>>>>>>>>>>>> memory allocated as a side effect of creating various devices. It does >>>>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on >>>>>>>>>>>>> the command line, or implicitly created by the -m command line option. >>>>>>>>>>>>> >>>>>>>>>>>>> The memfd option is intended to support new migration modes, in which the >>>>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending >>>>>>>>>>>>> the memfd file descriptor to the process. Memory contents are preserved, >>>>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are >>>>>>>>>>>>> locked in memory for DMA remain locked. This behavior is a pre-requisite >>>>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes. >>>>>>>>>>>> >>>>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm, >>>>>>>>>>>> similar to backends/hostmem-shm.c. >>>>>>>>>>>> >>>>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the >>>>>>>>>>>> details. See below. >>>>>>>>>>> >>>>>>>>>>> For this series I would prefer to use memfd and hide the details. It's a >>>>>>>>>>> concise (and well tested) solution albeit linux only. The code you supply >>>>>>>>>>> for posix shm would be a good follow on patch to support other unices. >>>>>>>>>> >>>>>>>>>> Unless there is reason to use memfd we should start with the more >>>>>>>>>> generic POSIX variant that is available even on systems without memfd. >>>>>>>>>> Factoring stuff out as I drafted does look quite compelling. >>>>>>>>>> >>>>>>>>>> I can help with the rework, and send it out separately, so you can focus >>>>>>>>>> on the "machine toggle" as part of this series. >>>>>>>>>> >>>>>>>>>> Of course, if we find out we need the memfd internally instead under >>>>>>>>>> Linux for whatever reason later, we can use that instead. >>>>>>>>>> >>>>>>>>>> But IIUC, the main selling point for memfd are additional features >>>>>>>>>> (hugetlb, memory sealing) that you aren't even using. >>>>>>>>> >>>>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount). >>>>>>>> >>>>>>>> Sizing is a non-trivial difference. One can by default allocate all memory using memfd_create. >>>>>>>> To do so using shm_open requires configuration on the mount. One step harder to use. >>>>>>> >>>>>>> Yes. >>>>>>> >>>>>>>> >>>>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM >>>>>>>> if memory-backend-ram has hogged all the memory. >>>>>>>> >>>>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open(). >>>>>>>> >>>>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM >>>>>>>> -- memfd if available and fallback to shm_open. >>>>>>> >>>>>>> Yes. >>>>>>> >>>>>>>> >>>>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like >>>>>>>>> >>>>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on" >>>>>>>>> >>>>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process. >>>>>>>>> >>>>>>>>> Thoughts? >>>>>>>> >>>>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch. >>>>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc >>>>>>>> controls everything else. Now we're just riffing on the details: memfd vs shm_open, spelling >>>>>>>> of options and words to describe them. >>>>>>> >>>>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on". >>>>>> >>>>>> Hi David and Peter, >>>>>> >>>>>> I have implemented and tested the following, for both qemu_memfd_create >>>>>> and qemu_shm_alloc. This is pseudo-code, with error conditions omitted >>>>>> for simplicity. >>>>> >>>>> I'm ok with either shm or memfd, as this feature only applies to Linux >>>>> anyway. I'll leave that part to you and David to decide. >>>>> >>>>>> >>>>>> Any comments before I submit a complete patch? >>>>>> >>>>>> ---- >>>>>> qemu-options.hx: >>>>>> ``aux-ram-share=on|off`` >>>>>> Allocate auxiliary guest RAM as an anonymous file that is >>>>>> shareable with an external process. This option applies to >>>>>> memory allocated as a side effect of creating various devices. >>>>>> It does not apply to memory-backend-objects, whether explicitly >>>>>> specified on the command line, or implicitly created by the -m >>>>>> command line option. >>>>>> >>>>>> Some migration modes require aux-ram-share=on. >>>>>> >>>>>> qapi/migration.json: >>>>>> @cpr-transfer: >>>>>> ... >>>>>> Memory-backend objects must have the share=on attribute, but >>>>>> memory-backend-epc is not supported. The VM must be started >>>>>> with the '-machine aux-ram-share=on' option. >>>>>> >>>>>> Define RAM_PRIVATE >>>>>> >>>>>> Define qemu_shm_alloc(), from David's tmp patch >>>>>> >>>>>> ram_backend_memory_alloc() >>>>>> ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE; >>>>>> memory_region_init_ram_flags_nomigrate(ram_flags) >>>>> >>>>> Looks all good until here. >>>>> >>>>>> >>>>>> qemu_ram_alloc_internal() >>>>>> ... >>>>>> if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share) >>>>> >>>>> Nitpick: could rely on flags-only, rather than testing "!host", AFAICT >>>>> that's equal to RAM_PREALLOC. >>>> >>>> IMO testing host is clearer and more future proof, regardless of how flags >>>> are currently used. If the caller passes host, then we should not allocate >>>> memory here, full stop. >>>> >>>>> Meanwhile I slightly prefer we don't touch >>>>> anything if SHARED|PRIVATE is set. >>>> >>>> OK, if SHARED is already set I will not set it again. >>> >>> We only have to make sure that stuff like qemu_ram_is_shared() will continue working as expected. >>> >>> What I think we should do: >>> >>> We should probably assert that nobody passes in SHARED|PRIVATE. And we can use PRIVATE only as a parameter to the function, but never actually set it on the ramblock. >>> >>> If someone passes in PRIVATE, we don't include it in block->flags. (RMA_SHARED remains cleared) >>> >>> If someone passes in SHARED, we do set it in block->flags. >>> If someone passes PRIVATE|SHARED, we assert. >>> >>> If someone passes in nothing: we set block->flags to SHARED with aux_ram_share=on. Otherwise, we do nothing (RAM_SHARED remains cleared) >>> >>> If that's also what you had in mind, great. >> >> Yes, my patch does that, but it also sets RAM_PRIVATE on the ramblock. >> I will undo the latter. > > David: why do we need to drop PRIVATE in ramblock flags? I thought it was > pretty harmless. I suppose things like qemu_ram_is_shared() will even keep > working as before? > > It looks ok to remove it too, but it adds logics that doesn't seem > necessary to me, so just to double check if I missed something.. A finished ramblock is only boolean "shared" vs. "not shared/private". A single flag (RAM_SHARED) can express that clearly. Consequently there is less to get wrong when using RAM_PRIVATE only as a flag to the creation function (and documenting that!). To make RAM_PRIVATE consistent we might have to tweak all other RAMBlock creation functions to set RAM_PRIVATE in the !RAM_SHARED case, and I don't think that is wroth the trouble.
On 11/7/2024 11:26 AM, David Hildenbrand wrote: > On 07.11.24 17:02, Steven Sistare wrote: >> On 11/7/2024 8:23 AM, David Hildenbrand wrote: >>> On 06.11.24 21:12, Steven Sistare wrote: >>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote: >>>>> On 04.11.24 21:56, Steven Sistare wrote: >>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote: >>>>>>> On 04.11.24 20:51, David Hildenbrand wrote: >>>>>>>> On 04.11.24 18:38, Steven Sistare wrote: >>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote: >>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote: >>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending >>>>>>>>>>> on the value of the anon-alloc machine property. This option applies to >>>>>>>>>>> memory allocated as a side effect of creating various devices. It does >>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on >>>>>>>>>>> the command line, or implicitly created by the -m command line option. >>>>>>>>>>> >>>>>>>>>>> The memfd option is intended to support new migration modes, in which the >>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending >>>>>>>>>>> the memfd file descriptor to the process. Memory contents are preserved, >>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are >>>>>>>>>>> locked in memory for DMA remain locked. This behavior is a pre-requisite >>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes. >>>>>>>>>> >>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm, >>>>>>>>>> similar to backends/hostmem-shm.c. >>>>>>>>>> >>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the >>>>>>>>>> details. See below. >>>>>>>>> >>>>>>>>> For this series I would prefer to use memfd and hide the details. It's a >>>>>>>>> concise (and well tested) solution albeit linux only. The code you supply >>>>>>>>> for posix shm would be a good follow on patch to support other unices. >>>>>>>> >>>>>>>> Unless there is reason to use memfd we should start with the more >>>>>>>> generic POSIX variant that is available even on systems without memfd. >>>>>>>> Factoring stuff out as I drafted does look quite compelling. >>>>>>>> >>>>>>>> I can help with the rework, and send it out separately, so you can focus >>>>>>>> on the "machine toggle" as part of this series. >>>>>>>> >>>>>>>> Of course, if we find out we need the memfd internally instead under >>>>>>>> Linux for whatever reason later, we can use that instead. >>>>>>>> >>>>>>>> But IIUC, the main selling point for memfd are additional features >>>>>>>> (hugetlb, memory sealing) that you aren't even using. >>>>>>> >>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount). >>>>>> >>>>>> Sizing is a non-trivial difference. One can by default allocate all memory using memfd_create. >>>>>> To do so using shm_open requires configuration on the mount. One step harder to use. >>>>> >>>>> Yes. >>>>> >>>>>> >>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM >>>>>> if memory-backend-ram has hogged all the memory. >>>>>> >>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open(). >>>>>> >>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM >>>>>> -- memfd if available and fallback to shm_open. >>>>> >>>>> Yes. >>>>> >>>>>> >>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like >>>>>>> >>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on" >>>>>>> >>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process. >>>>>>> >>>>>>> Thoughts? >>>>>> >>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch. >>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc >>>>>> controls everything else. Now we're just riffing on the details: memfd vs shm_open, spelling >>>>>> of options and words to describe them. >>>>> >>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on". >>>> >>>> Hi David and Peter, >>>> >>>> I have implemented and tested the following, for both qemu_memfd_create >>>> and qemu_shm_alloc. This is pseudo-code, with error conditions omitted >>>> for simplicity. >>>> >>>> Any comments before I submit a complete patch? >>>> >>>> ---- >>>> qemu-options.hx: >>>> ``aux-ram-share=on|off`` >>>> Allocate auxiliary guest RAM as an anonymous file that is >>>> shareable with an external process. This option applies to >>>> memory allocated as a side effect of creating various devices. >>>> It does not apply to memory-backend-objects, whether explicitly >>>> specified on the command line, or implicitly created by the -m >>>> command line option. >>>> >>>> Some migration modes require aux-ram-share=on. >>>> >>>> qapi/migration.json: >>>> @cpr-transfer: >>>> ... >>>> Memory-backend objects must have the share=on attribute, but >>>> memory-backend-epc is not supported. The VM must be started >>>> with the '-machine aux-ram-share=on' option. >>>> >>>> Define RAM_PRIVATE >>>> >>>> Define qemu_shm_alloc(), from David's tmp patch >>>> >>>> ram_backend_memory_alloc() >>>> ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE; >>>> memory_region_init_ram_flags_nomigrate(ram_flags) >>>> >>>> qemu_ram_alloc_internal() >>>> ... >>>> if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share) >>>> new_block->flags |= RAM_SHARED; >>>> >>>> if (!host && (new_block->flags & RAM_SHARED)) { >>>> qemu_ram_alloc_shared(new_block); >>>> } else >>>> new_block->fd = -1; >>>> new_block->host = host; >>>> } >>>> ram_block_add(new_block); >>>> >>>> qemu_ram_alloc_shared() >>>> if qemu_memfd_check() >>>> new_block->fd = qemu_memfd_create() >>>> else >>>> new_block->fd = qemu_shm_alloc() >>> >>> Yes, that way "memory-backend-ram,share=on" will just mean "give me the best shared memory for RAM to be shared with other processes, I don't care about the details", and it will work on Linux kernels even before we had memfds. >>> >>> memory-backend-ram should be available on all architectures, and under Windows. qemu_anon_ram_alloc() under Linux just does nothing special, not even bail out. >>> >>> MAP_SHARED|MAP_ANON was always weird, because it meant "give me memory I can share only with subprocesses", but then, *there are not subprocesses for QEMU*. I recall there was a trick to obtain the fd under Linux for these regions using /proc/self/fd/, but it's very Linux specific ... >>> >>> So nobody would *actually* use that shared memory and it was only a hack for RDMA. Now we can do better. >>> >>> >>> We'll have to decide if we simply fallback to qemu_anon_ram_alloc() if no shared memory can be created (unavailable), like we do on Windows. >>> >>> So maybe something like >>> >>> qemu_ram_alloc_shared() >>> fd = -1; >>> >>> if (qemu_memfd_avilable()) { >>> fd = qemu_memfd_create(); >>> if (fd < 0) >>> ... error >>> } else if (qemu_shm_available()) >>> fd = qemu_shm_alloc(); >>> if (fd < 0) >>> ... error >>> } else { >>> /* >>> * Old behavior: try fd-less shared memory. We might >>> * just end up with non-shared memory on Windows, but >>> * nobody can make sure of this shared memory either way >>> * ... should we just use non-shared memory? Or should >>> * we simply bail out? But then, if there is no shared >>> * memory nobody could possible use it. >>> */ >>> qemu_anon_ram_alloc(share=true) >>> } >> >> Good catch. We need that fallback for backwards compatibility. Even with >> no use case for memory-backend-ram,share=on since the demise of rdma, users >> may specify it on windows, for no particular reason, but it works, and should >> continue to work after this series. CPR would be blocked. > > Yes, we should keep Windows working in the weird way it is working right now. > > > > More generally for backwards compatibility for share=on for no particular reason, >> should we fallback if qemu_shm_alloc fails? If /dev/shm is mounted with default >> options and more than half of ram is requested, it will fail, whereas current qemu >> succeeds using MAP_SHARED|MAP_ANON. > > Only on Linux without memfd, of course. Maybe we should just warn when qemu_shm_alloc() fails (and comment that we continue for compat reasons only) and fallback to the stupid qemu_anon_ram_alloc(share=true). We could implement a fallback to shmget() but ... let's not go down that path. > > But we should not fallback to qemu_shm_alloc()/MAP_SHARED|MAP_ANON if memfd is available and that allocating the memfd failed. Failing to allocate a memfd might highlight a bigger problem. Agreed on all. One more opinion from you please, if you will. RAM_PRIVATE is only checked in qemu_ram_alloc_internal, and only needs to be set in ram_backend_memory_alloc -> ... -> qemu_ram_alloc_internal None of the other backends reach qemu_ram_alloc_internal. To be future proof, do you prefer I also set MAP_PRIVATE in the other backends, everywhere MAP_SHARED may be set, eg: file_backend_memory_alloc() ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE; Or omit RAM_PRIVATE in those cases where it will not be checked, to minimize exposure of this new flag? - Steve
On Thu, Nov 07, 2024 at 05:38:26PM +0100, David Hildenbrand wrote: > > David: why do we need to drop PRIVATE in ramblock flags? I thought it was > > pretty harmless. I suppose things like qemu_ram_is_shared() will even keep > > working as before? > > > > It looks ok to remove it too, but it adds logics that doesn't seem > > necessary to me, so just to double check if I missed something.. > > A finished ramblock is only boolean "shared" vs. "not shared/private". A > single flag (RAM_SHARED) can express that clearly. > > Consequently there is less to get wrong when using RAM_PRIVATE only as a > flag to the creation function (and documenting that!). > > To make RAM_PRIVATE consistent we might have to tweak all other RAMBlock > creation functions to set RAM_PRIVATE in the !RAM_SHARED case, and I don't > think that is wroth the trouble. Yeah, I actually prefer PRIVATE to be applied everywhere, and assert that either SHARED|PRIVATE be set in all ramblocks. But no strong opinions, if both of you like it only applied optionally, I already lost the vote. But yes, please in that case extremely carefully document PRIVATE as it can be very tricky now. PS: when I think about how to document that, I really, really hoped we simply apply it to all.. Thanks,
On 11/7/2024 11:19 AM, David Hildenbrand wrote: > On 07.11.24 15:04, Steven Sistare wrote: >> On 11/7/2024 8:05 AM, David Hildenbrand wrote: [...] >> Do you plan to submit the part of your "tmp" patch that refactors >> shm_backend_memory_alloc and defines qemu_shm_alloc? If you want, >> I could include it in my series, with your Signed-off-by. > > My patch went a bit too far I think. And would not work on win32 :) > > We should probably start with this: > > From 124920aeda2756faa104bfa6e934c7c20b1fbbe9 Mon Sep 17 00:00:00 2001 > From: David Hildenbrand <david@redhat.com> > Date: Mon, 4 Nov 2024 11:29:22 +0100 > Subject: [PATCH] backends/hostmem-shm: factor out allocation of "anonymous > shared memory with an fd" > > Let's factor it out so we can reuse it. > > Signed-off-by: David Hildenbrand <david@redhat.com> > --- > backends/hostmem-shm.c | 45 ++++------------------------------- > include/qemu/osdep.h | 1 + > system/physmem.c | 2 +- > util/oslib-posix.c | 53 ++++++++++++++++++++++++++++++++++++++++++ > util/oslib-win32.c | 6 +++++ > 5 files changed, 65 insertions(+), 42 deletions(-) [...] Thanks, I see what you fixed. FYI I deleted this remnant from "tmp": > @@ -2089,7 +2089,7 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size, > new_block->page_size = qemu_real_host_page_size(); > new_block->host = host; > new_block->flags = ram_flags; > - ram_block_add(new_block, &local_err); > + and I added this so all programs that link with libqemuutil also link with rt, which defines shm_open: diff --git a/meson.build b/meson.build index e324c49..aa54535 100644 --- a/meson.build +++ b/meson.build @@ -3604,9 +3604,13 @@ libqemuutil = static_library('qemuutil', build_by_default: false, sources: util_ss.sources() + stub_ss.sources() + genh, dependencies: [util_ss.dependencies(), libm, threads, glib, socket, malloc]) +qemuutil_deps = [event_loop_base] +if host_os != 'windows' + qemuutil_deps += [rt] +endif qemuutil = declare_dependency(link_with: libqemuutil, sources: genh + version_res, - dependencies: [event_loop_base]) + dependencies: qemuutil_deps) - Steve
On 07.11.24 17:40, Steven Sistare wrote: > On 11/7/2024 11:26 AM, David Hildenbrand wrote: >> On 07.11.24 17:02, Steven Sistare wrote: >>> On 11/7/2024 8:23 AM, David Hildenbrand wrote: >>>> On 06.11.24 21:12, Steven Sistare wrote: >>>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote: >>>>>> On 04.11.24 21:56, Steven Sistare wrote: >>>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote: >>>>>>>> On 04.11.24 20:51, David Hildenbrand wrote: >>>>>>>>> On 04.11.24 18:38, Steven Sistare wrote: >>>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote: >>>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote: >>>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending >>>>>>>>>>>> on the value of the anon-alloc machine property. This option applies to >>>>>>>>>>>> memory allocated as a side effect of creating various devices. It does >>>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on >>>>>>>>>>>> the command line, or implicitly created by the -m command line option. >>>>>>>>>>>> >>>>>>>>>>>> The memfd option is intended to support new migration modes, in which the >>>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending >>>>>>>>>>>> the memfd file descriptor to the process. Memory contents are preserved, >>>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are >>>>>>>>>>>> locked in memory for DMA remain locked. This behavior is a pre-requisite >>>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes. >>>>>>>>>>> >>>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm, >>>>>>>>>>> similar to backends/hostmem-shm.c. >>>>>>>>>>> >>>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the >>>>>>>>>>> details. See below. >>>>>>>>>> >>>>>>>>>> For this series I would prefer to use memfd and hide the details. It's a >>>>>>>>>> concise (and well tested) solution albeit linux only. The code you supply >>>>>>>>>> for posix shm would be a good follow on patch to support other unices. >>>>>>>>> >>>>>>>>> Unless there is reason to use memfd we should start with the more >>>>>>>>> generic POSIX variant that is available even on systems without memfd. >>>>>>>>> Factoring stuff out as I drafted does look quite compelling. >>>>>>>>> >>>>>>>>> I can help with the rework, and send it out separately, so you can focus >>>>>>>>> on the "machine toggle" as part of this series. >>>>>>>>> >>>>>>>>> Of course, if we find out we need the memfd internally instead under >>>>>>>>> Linux for whatever reason later, we can use that instead. >>>>>>>>> >>>>>>>>> But IIUC, the main selling point for memfd are additional features >>>>>>>>> (hugetlb, memory sealing) that you aren't even using. >>>>>>>> >>>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount). >>>>>>> >>>>>>> Sizing is a non-trivial difference. One can by default allocate all memory using memfd_create. >>>>>>> To do so using shm_open requires configuration on the mount. One step harder to use. >>>>>> >>>>>> Yes. >>>>>> >>>>>>> >>>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM >>>>>>> if memory-backend-ram has hogged all the memory. >>>>>>> >>>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open(). >>>>>>> >>>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM >>>>>>> -- memfd if available and fallback to shm_open. >>>>>> >>>>>> Yes. >>>>>> >>>>>>> >>>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like >>>>>>>> >>>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on" >>>>>>>> >>>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process. >>>>>>>> >>>>>>>> Thoughts? >>>>>>> >>>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch. >>>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc >>>>>>> controls everything else. Now we're just riffing on the details: memfd vs shm_open, spelling >>>>>>> of options and words to describe them. >>>>>> >>>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on". >>>>> >>>>> Hi David and Peter, >>>>> >>>>> I have implemented and tested the following, for both qemu_memfd_create >>>>> and qemu_shm_alloc. This is pseudo-code, with error conditions omitted >>>>> for simplicity. >>>>> >>>>> Any comments before I submit a complete patch? >>>>> >>>>> ---- >>>>> qemu-options.hx: >>>>> ``aux-ram-share=on|off`` >>>>> Allocate auxiliary guest RAM as an anonymous file that is >>>>> shareable with an external process. This option applies to >>>>> memory allocated as a side effect of creating various devices. >>>>> It does not apply to memory-backend-objects, whether explicitly >>>>> specified on the command line, or implicitly created by the -m >>>>> command line option. >>>>> >>>>> Some migration modes require aux-ram-share=on. >>>>> >>>>> qapi/migration.json: >>>>> @cpr-transfer: >>>>> ... >>>>> Memory-backend objects must have the share=on attribute, but >>>>> memory-backend-epc is not supported. The VM must be started >>>>> with the '-machine aux-ram-share=on' option. >>>>> >>>>> Define RAM_PRIVATE >>>>> >>>>> Define qemu_shm_alloc(), from David's tmp patch >>>>> >>>>> ram_backend_memory_alloc() >>>>> ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE; >>>>> memory_region_init_ram_flags_nomigrate(ram_flags) >>>>> >>>>> qemu_ram_alloc_internal() >>>>> ... >>>>> if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share) >>>>> new_block->flags |= RAM_SHARED; >>>>> >>>>> if (!host && (new_block->flags & RAM_SHARED)) { >>>>> qemu_ram_alloc_shared(new_block); >>>>> } else >>>>> new_block->fd = -1; >>>>> new_block->host = host; >>>>> } >>>>> ram_block_add(new_block); >>>>> >>>>> qemu_ram_alloc_shared() >>>>> if qemu_memfd_check() >>>>> new_block->fd = qemu_memfd_create() >>>>> else >>>>> new_block->fd = qemu_shm_alloc() >>>> >>>> Yes, that way "memory-backend-ram,share=on" will just mean "give me the best shared memory for RAM to be shared with other processes, I don't care about the details", and it will work on Linux kernels even before we had memfds. >>>> >>>> memory-backend-ram should be available on all architectures, and under Windows. qemu_anon_ram_alloc() under Linux just does nothing special, not even bail out. >>>> >>>> MAP_SHARED|MAP_ANON was always weird, because it meant "give me memory I can share only with subprocesses", but then, *there are not subprocesses for QEMU*. I recall there was a trick to obtain the fd under Linux for these regions using /proc/self/fd/, but it's very Linux specific ... >>>> >>>> So nobody would *actually* use that shared memory and it was only a hack for RDMA. Now we can do better. >>>> >>>> >>>> We'll have to decide if we simply fallback to qemu_anon_ram_alloc() if no shared memory can be created (unavailable), like we do on Windows. >>>> >>>> So maybe something like >>>> >>>> qemu_ram_alloc_shared() >>>> fd = -1; >>>> >>>> if (qemu_memfd_avilable()) { >>>> fd = qemu_memfd_create(); >>>> if (fd < 0) >>>> ... error >>>> } else if (qemu_shm_available()) >>>> fd = qemu_shm_alloc(); >>>> if (fd < 0) >>>> ... error >>>> } else { >>>> /* >>>> * Old behavior: try fd-less shared memory. We might >>>> * just end up with non-shared memory on Windows, but >>>> * nobody can make sure of this shared memory either way >>>> * ... should we just use non-shared memory? Or should >>>> * we simply bail out? But then, if there is no shared >>>> * memory nobody could possible use it. >>>> */ >>>> qemu_anon_ram_alloc(share=true) >>>> } >>> >>> Good catch. We need that fallback for backwards compatibility. Even with >>> no use case for memory-backend-ram,share=on since the demise of rdma, users >>> may specify it on windows, for no particular reason, but it works, and should >>> continue to work after this series. CPR would be blocked. >> >> Yes, we should keep Windows working in the weird way it is working right now. >> >> > > More generally for backwards compatibility for share=on for no particular reason, >>> should we fallback if qemu_shm_alloc fails? If /dev/shm is mounted with default >>> options and more than half of ram is requested, it will fail, whereas current qemu >>> succeeds using MAP_SHARED|MAP_ANON. >> >> Only on Linux without memfd, of course. Maybe we should just warn when qemu_shm_alloc() fails (and comment that we continue for compat reasons only) and fallback to the stupid qemu_anon_ram_alloc(share=true). We could implement a fallback to shmget() but ... let's not go down that path. >> >> But we should not fallback to qemu_shm_alloc()/MAP_SHARED|MAP_ANON if memfd is available and that allocating the memfd failed. Failing to allocate a memfd might highlight a bigger problem. > > Agreed on all. > > One more opinion from you please, if you will. > > RAM_PRIVATE is only checked in qemu_ram_alloc_internal, and only needs to be > set in > ram_backend_memory_alloc -> ... -> qemu_ram_alloc_internal > > None of the other backends reach qemu_ram_alloc_internal. > > To be future proof, do you prefer I also set MAP_PRIVATE in the other backends, > everywhere MAP_SHARED may be set, eg: Hm, I think then we should set RAM_PRIVATE really everywhere where we'd want it and relied on !RAM_SHARED doing the right thing. Alternatively, we make our life easier and do something like /* * This flag is only used while creating+allocating RAM, and * prevents RAM_SHARED getting set for anonymous RAM automatically in * some configurations. * * By default, not setting RAM_SHARED on anonymous RAM implies * "private anonymous RAM"; however, in some configuration we want to * have most of this RAM automatically be "sharable anonymous RAM", * except for some cases that really want "private anonymous RAM". * * This anonymous RAM *must* be private. This flag only applies to * "anonymous" RAM, not fd/file-backed/preallocated one. */ RAM_FORCE_ANON_PRIVATE (1 << 13) BUT maybe an even better alternative now that we have the "aux-ram-share" parameter, could we use /* * Auxiliary RAM that was created automatically internally, instead of * explicitly like using memory-backend-ram or some other device on the * QEMU cmdline. */ RAM_AUX (1 << 13) So it will be quite clear that "aux-ram-share" only applies to RAM_AUX RAMBlocks. That actually looks quite compelling to me :)
On Fri, Nov 08, 2024 at 12:31:45PM +0100, David Hildenbrand wrote: > On 07.11.24 17:40, Steven Sistare wrote: > > On 11/7/2024 11:26 AM, David Hildenbrand wrote: > > > On 07.11.24 17:02, Steven Sistare wrote: > > > > On 11/7/2024 8:23 AM, David Hildenbrand wrote: > > > > > On 06.11.24 21:12, Steven Sistare wrote: > > > > > > On 11/4/2024 4:36 PM, David Hildenbrand wrote: > > > > > > > On 04.11.24 21:56, Steven Sistare wrote: > > > > > > > > On 11/4/2024 3:15 PM, David Hildenbrand wrote: > > > > > > > > > On 04.11.24 20:51, David Hildenbrand wrote: > > > > > > > > > > On 04.11.24 18:38, Steven Sistare wrote: > > > > > > > > > > > On 11/4/2024 5:39 AM, David Hildenbrand wrote: > > > > > > > > > > > > On 01.11.24 14:47, Steve Sistare wrote: > > > > > > > > > > > > > Allocate anonymous memory using mmap MAP_ANON or memfd_create depending > > > > > > > > > > > > > on the value of the anon-alloc machine property. This option applies to > > > > > > > > > > > > > memory allocated as a side effect of creating various devices. It does > > > > > > > > > > > > > not apply to memory-backend-objects, whether explicitly specified on > > > > > > > > > > > > > the command line, or implicitly created by the -m command line option. > > > > > > > > > > > > > > > > > > > > > > > > > > The memfd option is intended to support new migration modes, in which the > > > > > > > > > > > > > memory region can be transferred in place to a new QEMU process, by sending > > > > > > > > > > > > > the memfd file descriptor to the process. Memory contents are preserved, > > > > > > > > > > > > > and if the mode also transfers device descriptors, then pages that are > > > > > > > > > > > > > locked in memory for DMA remain locked. This behavior is a pre-requisite > > > > > > > > > > > > > for supporting vfio, vdpa, and iommufd devices with the new modes. > > > > > > > > > > > > > > > > > > > > > > > > A more portable, non-Linux specific variant of this will be using shm, > > > > > > > > > > > > similar to backends/hostmem-shm.c. > > > > > > > > > > > > > > > > > > > > > > > > Likely we should be using that instead of memfd, or try hiding the > > > > > > > > > > > > details. See below. > > > > > > > > > > > > > > > > > > > > > > For this series I would prefer to use memfd and hide the details. It's a > > > > > > > > > > > concise (and well tested) solution albeit linux only. The code you supply > > > > > > > > > > > for posix shm would be a good follow on patch to support other unices. > > > > > > > > > > > > > > > > > > > > Unless there is reason to use memfd we should start with the more > > > > > > > > > > generic POSIX variant that is available even on systems without memfd. > > > > > > > > > > Factoring stuff out as I drafted does look quite compelling. > > > > > > > > > > > > > > > > > > > > I can help with the rework, and send it out separately, so you can focus > > > > > > > > > > on the "machine toggle" as part of this series. > > > > > > > > > > > > > > > > > > > > Of course, if we find out we need the memfd internally instead under > > > > > > > > > > Linux for whatever reason later, we can use that instead. > > > > > > > > > > > > > > > > > > > > But IIUC, the main selling point for memfd are additional features > > > > > > > > > > (hugetlb, memory sealing) that you aren't even using. > > > > > > > > > > > > > > > > > > FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount). > > > > > > > > > > > > > > > > Sizing is a non-trivial difference. One can by default allocate all memory using memfd_create. > > > > > > > > To do so using shm_open requires configuration on the mount. One step harder to use. > > > > > > > > > > > > > > Yes. > > > > > > > > > > > > > > > > > > > > > > > This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM > > > > > > > > if memory-backend-ram has hogged all the memory. > > > > > > > > > > > > > > > > > Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open(). > > > > > > > > > > > > > > > > Yes, and if that is a good idea, then the same should be done for internal RAM > > > > > > > > -- memfd if available and fallback to shm_open. > > > > > > > > > > > > > > Yes. > > > > > > > > > > > > > > > > > > > > > > > > I'm hoping we can find a way where it just all is rather intuitive, like > > > > > > > > > > > > > > > > > > "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on" > > > > > > > > > > > > > > > > > > "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process. > > > > > > > > > > > > > > > > > > Thoughts? > > > > > > > > > > > > > > > > Agreed, though I thought I had already landed at the intuitive specification in my patch. > > > > > > > > The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc > > > > > > > > controls everything else. Now we're just riffing on the details: memfd vs shm_open, spelling > > > > > > > > of options and words to describe them. > > > > > > > > > > > > > > Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on". > > > > > > > > > > > > Hi David and Peter, > > > > > > > > > > > > I have implemented and tested the following, for both qemu_memfd_create > > > > > > and qemu_shm_alloc. This is pseudo-code, with error conditions omitted > > > > > > for simplicity. > > > > > > > > > > > > Any comments before I submit a complete patch? > > > > > > > > > > > > ---- > > > > > > qemu-options.hx: > > > > > > ``aux-ram-share=on|off`` > > > > > > Allocate auxiliary guest RAM as an anonymous file that is > > > > > > shareable with an external process. This option applies to > > > > > > memory allocated as a side effect of creating various devices. > > > > > > It does not apply to memory-backend-objects, whether explicitly > > > > > > specified on the command line, or implicitly created by the -m > > > > > > command line option. > > > > > > > > > > > > Some migration modes require aux-ram-share=on. > > > > > > > > > > > > qapi/migration.json: > > > > > > @cpr-transfer: > > > > > > ... > > > > > > Memory-backend objects must have the share=on attribute, but > > > > > > memory-backend-epc is not supported. The VM must be started > > > > > > with the '-machine aux-ram-share=on' option. > > > > > > > > > > > > Define RAM_PRIVATE > > > > > > > > > > > > Define qemu_shm_alloc(), from David's tmp patch > > > > > > > > > > > > ram_backend_memory_alloc() > > > > > > ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE; > > > > > > memory_region_init_ram_flags_nomigrate(ram_flags) > > > > > > > > > > > > qemu_ram_alloc_internal() > > > > > > ... > > > > > > if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share) > > > > > > new_block->flags |= RAM_SHARED; > > > > > > > > > > > > if (!host && (new_block->flags & RAM_SHARED)) { > > > > > > qemu_ram_alloc_shared(new_block); > > > > > > } else > > > > > > new_block->fd = -1; > > > > > > new_block->host = host; > > > > > > } > > > > > > ram_block_add(new_block); > > > > > > > > > > > > qemu_ram_alloc_shared() > > > > > > if qemu_memfd_check() > > > > > > new_block->fd = qemu_memfd_create() > > > > > > else > > > > > > new_block->fd = qemu_shm_alloc() > > > > > > > > > > Yes, that way "memory-backend-ram,share=on" will just mean "give me the best shared memory for RAM to be shared with other processes, I don't care about the details", and it will work on Linux kernels even before we had memfds. > > > > > > > > > > memory-backend-ram should be available on all architectures, and under Windows. qemu_anon_ram_alloc() under Linux just does nothing special, not even bail out. > > > > > > > > > > MAP_SHARED|MAP_ANON was always weird, because it meant "give me memory I can share only with subprocesses", but then, *there are not subprocesses for QEMU*. I recall there was a trick to obtain the fd under Linux for these regions using /proc/self/fd/, but it's very Linux specific ... > > > > > > > > > > So nobody would *actually* use that shared memory and it was only a hack for RDMA. Now we can do better. > > > > > > > > > > > > > > > We'll have to decide if we simply fallback to qemu_anon_ram_alloc() if no shared memory can be created (unavailable), like we do on Windows. > > > > > > > > > > So maybe something like > > > > > > > > > > qemu_ram_alloc_shared() > > > > > fd = -1; > > > > > > > > > > if (qemu_memfd_avilable()) { > > > > > fd = qemu_memfd_create(); > > > > > if (fd < 0) > > > > > ... error > > > > > } else if (qemu_shm_available()) > > > > > fd = qemu_shm_alloc(); > > > > > if (fd < 0) > > > > > ... error > > > > > } else { > > > > > /* > > > > > * Old behavior: try fd-less shared memory. We might > > > > > * just end up with non-shared memory on Windows, but > > > > > * nobody can make sure of this shared memory either way > > > > > * ... should we just use non-shared memory? Or should > > > > > * we simply bail out? But then, if there is no shared > > > > > * memory nobody could possible use it. > > > > > */ > > > > > qemu_anon_ram_alloc(share=true) > > > > > } > > > > > > > > Good catch. We need that fallback for backwards compatibility. Even with > > > > no use case for memory-backend-ram,share=on since the demise of rdma, users > > > > may specify it on windows, for no particular reason, but it works, and should > > > > continue to work after this series. CPR would be blocked. > > > > > > Yes, we should keep Windows working in the weird way it is working right now. > > > > > > > > More generally for backwards compatibility for share=on for no particular reason, > > > > should we fallback if qemu_shm_alloc fails? If /dev/shm is mounted with default > > > > options and more than half of ram is requested, it will fail, whereas current qemu > > > > succeeds using MAP_SHARED|MAP_ANON. > > > > > > Only on Linux without memfd, of course. Maybe we should just warn when qemu_shm_alloc() fails (and comment that we continue for compat reasons only) and fallback to the stupid qemu_anon_ram_alloc(share=true). We could implement a fallback to shmget() but ... let's not go down that path. > > > > > > But we should not fallback to qemu_shm_alloc()/MAP_SHARED|MAP_ANON if memfd is available and that allocating the memfd failed. Failing to allocate a memfd might highlight a bigger problem. > > > > Agreed on all. > > > > One more opinion from you please, if you will. > > > > RAM_PRIVATE is only checked in qemu_ram_alloc_internal, and only needs to be > > set in > > ram_backend_memory_alloc -> ... -> qemu_ram_alloc_internal > > > > None of the other backends reach qemu_ram_alloc_internal. > > > > To be future proof, do you prefer I also set MAP_PRIVATE in the other backends, > > everywhere MAP_SHARED may be set, eg: > > Hm, I think then we should set RAM_PRIVATE really everywhere where we'd want > it and relied on !RAM_SHARED doing the right thing. > > Alternatively, we make our life easier and do something like > > /* > * This flag is only used while creating+allocating RAM, and > * prevents RAM_SHARED getting set for anonymous RAM automatically in > * some configurations. > * > * By default, not setting RAM_SHARED on anonymous RAM implies > * "private anonymous RAM"; however, in some configuration we want to > * have most of this RAM automatically be "sharable anonymous RAM", > * except for some cases that really want "private anonymous RAM". > * > * This anonymous RAM *must* be private. This flag only applies to > * "anonymous" RAM, not fd/file-backed/preallocated one. > */ > RAM_FORCE_ANON_PRIVATE (1 << 13) > > > BUT maybe an even better alternative now that we have the "aux-ram-share" > parameter, could we use > > /* > * Auxiliary RAM that was created automatically internally, instead of > * explicitly like using memory-backend-ram or some other device on the > * QEMU cmdline. > */ > RAM_AUX (1 << 13) > > > So it will be quite clear that "aux-ram-share" only applies to RAM_AUX > RAMBlocks. > > That actually looks quite compelling to me :) Could anyone remind me why we can't simply set PRIVATE|SHARED all over the place? IMHO RAM_AUX is too hard for any new callers to know how to set. It's much easier when we already have SHARED, adding PRIVATE could be mostly natural, then we can already avoid AUX due to checking !SHARED & !PRIVATE. Basically, SHARED|PRIVATE then must come from an user request (QMP or cmdline), otherwise the caller should always set none of them, implying aux. It still looks the best to me. Thanks,
On 11/8/2024 6:31 AM, David Hildenbrand wrote: > On 07.11.24 17:40, Steven Sistare wrote: >> On 11/7/2024 11:26 AM, David Hildenbrand wrote: >>> On 07.11.24 17:02, Steven Sistare wrote: >>>> On 11/7/2024 8:23 AM, David Hildenbrand wrote: >>>>> On 06.11.24 21:12, Steven Sistare wrote: >>>>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote: >>>>>>> On 04.11.24 21:56, Steven Sistare wrote: >>>>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote: >>>>>>>>> On 04.11.24 20:51, David Hildenbrand wrote: >>>>>>>>>> On 04.11.24 18:38, Steven Sistare wrote: >>>>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote: >>>>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote: >>>>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending >>>>>>>>>>>>> on the value of the anon-alloc machine property. This option applies to >>>>>>>>>>>>> memory allocated as a side effect of creating various devices. It does >>>>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on >>>>>>>>>>>>> the command line, or implicitly created by the -m command line option. >>>>>>>>>>>>> >>>>>>>>>>>>> The memfd option is intended to support new migration modes, in which the >>>>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending >>>>>>>>>>>>> the memfd file descriptor to the process. Memory contents are preserved, >>>>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are >>>>>>>>>>>>> locked in memory for DMA remain locked. This behavior is a pre-requisite >>>>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes. >>>>>>>>>>>> >>>>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm, >>>>>>>>>>>> similar to backends/hostmem-shm.c. >>>>>>>>>>>> >>>>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the >>>>>>>>>>>> details. See below. >>>>>>>>>>> >>>>>>>>>>> For this series I would prefer to use memfd and hide the details. It's a >>>>>>>>>>> concise (and well tested) solution albeit linux only. The code you supply >>>>>>>>>>> for posix shm would be a good follow on patch to support other unices. >>>>>>>>>> >>>>>>>>>> Unless there is reason to use memfd we should start with the more >>>>>>>>>> generic POSIX variant that is available even on systems without memfd. >>>>>>>>>> Factoring stuff out as I drafted does look quite compelling. >>>>>>>>>> >>>>>>>>>> I can help with the rework, and send it out separately, so you can focus >>>>>>>>>> on the "machine toggle" as part of this series. >>>>>>>>>> >>>>>>>>>> Of course, if we find out we need the memfd internally instead under >>>>>>>>>> Linux for whatever reason later, we can use that instead. >>>>>>>>>> >>>>>>>>>> But IIUC, the main selling point for memfd are additional features >>>>>>>>>> (hugetlb, memory sealing) that you aren't even using. >>>>>>>>> >>>>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount). >>>>>>>> >>>>>>>> Sizing is a non-trivial difference. One can by default allocate all memory using memfd_create. >>>>>>>> To do so using shm_open requires configuration on the mount. One step harder to use. >>>>>>> >>>>>>> Yes. >>>>>>> >>>>>>>> >>>>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM >>>>>>>> if memory-backend-ram has hogged all the memory. >>>>>>>> >>>>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open(). >>>>>>>> >>>>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM >>>>>>>> -- memfd if available and fallback to shm_open. >>>>>>> >>>>>>> Yes. >>>>>>> >>>>>>>> >>>>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like >>>>>>>>> >>>>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on" >>>>>>>>> >>>>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process. >>>>>>>>> >>>>>>>>> Thoughts? >>>>>>>> >>>>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch. >>>>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc >>>>>>>> controls everything else. Now we're just riffing on the details: memfd vs shm_open, spelling >>>>>>>> of options and words to describe them. >>>>>>> >>>>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on". >>>>>> >>>>>> Hi David and Peter, >>>>>> >>>>>> I have implemented and tested the following, for both qemu_memfd_create >>>>>> and qemu_shm_alloc. This is pseudo-code, with error conditions omitted >>>>>> for simplicity. >>>>>> >>>>>> Any comments before I submit a complete patch? >>>>>> >>>>>> ---- >>>>>> qemu-options.hx: >>>>>> ``aux-ram-share=on|off`` >>>>>> Allocate auxiliary guest RAM as an anonymous file that is >>>>>> shareable with an external process. This option applies to >>>>>> memory allocated as a side effect of creating various devices. >>>>>> It does not apply to memory-backend-objects, whether explicitly >>>>>> specified on the command line, or implicitly created by the -m >>>>>> command line option. >>>>>> >>>>>> Some migration modes require aux-ram-share=on. >>>>>> >>>>>> qapi/migration.json: >>>>>> @cpr-transfer: >>>>>> ... >>>>>> Memory-backend objects must have the share=on attribute, but >>>>>> memory-backend-epc is not supported. The VM must be started >>>>>> with the '-machine aux-ram-share=on' option. >>>>>> >>>>>> Define RAM_PRIVATE >>>>>> >>>>>> Define qemu_shm_alloc(), from David's tmp patch >>>>>> >>>>>> ram_backend_memory_alloc() >>>>>> ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE; >>>>>> memory_region_init_ram_flags_nomigrate(ram_flags) >>>>>> >>>>>> qemu_ram_alloc_internal() >>>>>> ... >>>>>> if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share) >>>>>> new_block->flags |= RAM_SHARED; >>>>>> >>>>>> if (!host && (new_block->flags & RAM_SHARED)) { >>>>>> qemu_ram_alloc_shared(new_block); >>>>>> } else >>>>>> new_block->fd = -1; >>>>>> new_block->host = host; >>>>>> } >>>>>> ram_block_add(new_block); >>>>>> >>>>>> qemu_ram_alloc_shared() >>>>>> if qemu_memfd_check() >>>>>> new_block->fd = qemu_memfd_create() >>>>>> else >>>>>> new_block->fd = qemu_shm_alloc() >>>>> >>>>> Yes, that way "memory-backend-ram,share=on" will just mean "give me the best shared memory for RAM to be shared with other processes, I don't care about the details", and it will work on Linux kernels even before we had memfds. >>>>> >>>>> memory-backend-ram should be available on all architectures, and under Windows. qemu_anon_ram_alloc() under Linux just does nothing special, not even bail out. >>>>> >>>>> MAP_SHARED|MAP_ANON was always weird, because it meant "give me memory I can share only with subprocesses", but then, *there are not subprocesses for QEMU*. I recall there was a trick to obtain the fd under Linux for these regions using /proc/self/fd/, but it's very Linux specific ... >>>>> >>>>> So nobody would *actually* use that shared memory and it was only a hack for RDMA. Now we can do better. >>>>> >>>>> >>>>> We'll have to decide if we simply fallback to qemu_anon_ram_alloc() if no shared memory can be created (unavailable), like we do on Windows. >>>>> >>>>> So maybe something like >>>>> >>>>> qemu_ram_alloc_shared() >>>>> fd = -1; >>>>> >>>>> if (qemu_memfd_avilable()) { >>>>> fd = qemu_memfd_create(); >>>>> if (fd < 0) >>>>> ... error >>>>> } else if (qemu_shm_available()) >>>>> fd = qemu_shm_alloc(); >>>>> if (fd < 0) >>>>> ... error >>>>> } else { >>>>> /* >>>>> * Old behavior: try fd-less shared memory. We might >>>>> * just end up with non-shared memory on Windows, but >>>>> * nobody can make sure of this shared memory either way >>>>> * ... should we just use non-shared memory? Or should >>>>> * we simply bail out? But then, if there is no shared >>>>> * memory nobody could possible use it. >>>>> */ >>>>> qemu_anon_ram_alloc(share=true) >>>>> } >>>> >>>> Good catch. We need that fallback for backwards compatibility. Even with >>>> no use case for memory-backend-ram,share=on since the demise of rdma, users >>>> may specify it on windows, for no particular reason, but it works, and should >>>> continue to work after this series. CPR would be blocked. >>> >>> Yes, we should keep Windows working in the weird way it is working right now. >>> >>> > > More generally for backwards compatibility for share=on for no particular reason, >>>> should we fallback if qemu_shm_alloc fails? If /dev/shm is mounted with default >>>> options and more than half of ram is requested, it will fail, whereas current qemu >>>> succeeds using MAP_SHARED|MAP_ANON. >>> >>> Only on Linux without memfd, of course. Maybe we should just warn when qemu_shm_alloc() fails (and comment that we continue for compat reasons only) and fallback to the stupid qemu_anon_ram_alloc(share=true). We could implement a fallback to shmget() but ... let's not go down that path. >>> >>> But we should not fallback to qemu_shm_alloc()/MAP_SHARED|MAP_ANON if memfd is available and that allocating the memfd failed. Failing to allocate a memfd might highlight a bigger problem. >> >> Agreed on all. >> >> One more opinion from you please, if you will. >> >> RAM_PRIVATE is only checked in qemu_ram_alloc_internal, and only needs to be >> set in >> ram_backend_memory_alloc -> ... -> qemu_ram_alloc_internal >> >> None of the other backends reach qemu_ram_alloc_internal. >> >> To be future proof, do you prefer I also set MAP_PRIVATE in the other backends, >> everywhere MAP_SHARED may be set, eg: > > Hm, I think then we should set RAM_PRIVATE really everywhere where we'd want it and relied on !RAM_SHARED doing the right thing. > > Alternatively, we make our life easier and do something like > > /* > * This flag is only used while creating+allocating RAM, and > * prevents RAM_SHARED getting set for anonymous RAM automatically in > * some configurations. > * > * By default, not setting RAM_SHARED on anonymous RAM implies > * "private anonymous RAM"; however, in some configuration we want to > * have most of this RAM automatically be "sharable anonymous RAM", > * except for some cases that really want "private anonymous RAM". > * > * This anonymous RAM *must* be private. This flag only applies to > * "anonymous" RAM, not fd/file-backed/preallocated one. > */ > RAM_FORCE_ANON_PRIVATE (1 << 13) > > > BUT maybe an even better alternative now that we have the "aux-ram-share" parameter, could we use > > /* > * Auxiliary RAM that was created automatically internally, instead of > * explicitly like using memory-backend-ram or some other device on the > * QEMU cmdline. > */ > RAM_AUX (1 << 13) > > > So it will be quite clear that "aux-ram-share" only applies to RAM_AUX RAMBlocks. > > That actually looks quite compelling to me :) Agreed, RAM_AUX is a clear solution. I would set it in these functions: qemu_ram_alloc_resizeable memory_region_init_ram_nomigrate memory_region_init_rom_nomigrate memory_region_init_rom_device_nomigrate and test it with aux_ram_share in qemu_ram_alloc_internal. if RAM_AUX && aux_ram_share flags |= RAM_SHARED However, we could just set RAM_SHARED at those same call sites: flags = current_machine->aux_ram_shared ? RAM_SHARED : 0; which is what I did in [PATCH V2 01/11] machine: alloc-anon option and test RAM_SHARED in qemu_ram_alloc_internal. No need for RAM_PRIVATE. RAM_AUX is nice because it declares intent more specifically. Your preference? - Steve
On 11/8/2024 8:43 AM, Peter Xu wrote: > On Fri, Nov 08, 2024 at 12:31:45PM +0100, David Hildenbrand wrote: >> On 07.11.24 17:40, Steven Sistare wrote: >>> On 11/7/2024 11:26 AM, David Hildenbrand wrote: >>>> On 07.11.24 17:02, Steven Sistare wrote: >>>>> On 11/7/2024 8:23 AM, David Hildenbrand wrote: >>>>>> On 06.11.24 21:12, Steven Sistare wrote: >>>>>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote: >>>>>>>> On 04.11.24 21:56, Steven Sistare wrote: >>>>>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote: >>>>>>>>>> On 04.11.24 20:51, David Hildenbrand wrote: >>>>>>>>>>> On 04.11.24 18:38, Steven Sistare wrote: >>>>>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote: >>>>>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote: >>>>>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending >>>>>>>>>>>>>> on the value of the anon-alloc machine property. This option applies to >>>>>>>>>>>>>> memory allocated as a side effect of creating various devices. It does >>>>>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on >>>>>>>>>>>>>> the command line, or implicitly created by the -m command line option. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The memfd option is intended to support new migration modes, in which the >>>>>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending >>>>>>>>>>>>>> the memfd file descriptor to the process. Memory contents are preserved, >>>>>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are >>>>>>>>>>>>>> locked in memory for DMA remain locked. This behavior is a pre-requisite >>>>>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes. >>>>>>>>>>>>> >>>>>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm, >>>>>>>>>>>>> similar to backends/hostmem-shm.c. >>>>>>>>>>>>> >>>>>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the >>>>>>>>>>>>> details. See below. >>>>>>>>>>>> >>>>>>>>>>>> For this series I would prefer to use memfd and hide the details. It's a >>>>>>>>>>>> concise (and well tested) solution albeit linux only. The code you supply >>>>>>>>>>>> for posix shm would be a good follow on patch to support other unices. >>>>>>>>>>> >>>>>>>>>>> Unless there is reason to use memfd we should start with the more >>>>>>>>>>> generic POSIX variant that is available even on systems without memfd. >>>>>>>>>>> Factoring stuff out as I drafted does look quite compelling. >>>>>>>>>>> >>>>>>>>>>> I can help with the rework, and send it out separately, so you can focus >>>>>>>>>>> on the "machine toggle" as part of this series. >>>>>>>>>>> >>>>>>>>>>> Of course, if we find out we need the memfd internally instead under >>>>>>>>>>> Linux for whatever reason later, we can use that instead. >>>>>>>>>>> >>>>>>>>>>> But IIUC, the main selling point for memfd are additional features >>>>>>>>>>> (hugetlb, memory sealing) that you aren't even using. >>>>>>>>>> >>>>>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount). >>>>>>>>> >>>>>>>>> Sizing is a non-trivial difference. One can by default allocate all memory using memfd_create. >>>>>>>>> To do so using shm_open requires configuration on the mount. One step harder to use. >>>>>>>> >>>>>>>> Yes. >>>>>>>> >>>>>>>>> >>>>>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM >>>>>>>>> if memory-backend-ram has hogged all the memory. >>>>>>>>> >>>>>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open(). >>>>>>>>> >>>>>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM >>>>>>>>> -- memfd if available and fallback to shm_open. >>>>>>>> >>>>>>>> Yes. >>>>>>>> >>>>>>>>> >>>>>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like >>>>>>>>>> >>>>>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on" >>>>>>>>>> >>>>>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process. >>>>>>>>>> >>>>>>>>>> Thoughts? >>>>>>>>> >>>>>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch. >>>>>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc >>>>>>>>> controls everything else. Now we're just riffing on the details: memfd vs shm_open, spelling >>>>>>>>> of options and words to describe them. >>>>>>>> >>>>>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on". >>>>>>> >>>>>>> Hi David and Peter, >>>>>>> >>>>>>> I have implemented and tested the following, for both qemu_memfd_create >>>>>>> and qemu_shm_alloc. This is pseudo-code, with error conditions omitted >>>>>>> for simplicity. >>>>>>> >>>>>>> Any comments before I submit a complete patch? >>>>>>> >>>>>>> ---- >>>>>>> qemu-options.hx: >>>>>>> ``aux-ram-share=on|off`` >>>>>>> Allocate auxiliary guest RAM as an anonymous file that is >>>>>>> shareable with an external process. This option applies to >>>>>>> memory allocated as a side effect of creating various devices. >>>>>>> It does not apply to memory-backend-objects, whether explicitly >>>>>>> specified on the command line, or implicitly created by the -m >>>>>>> command line option. >>>>>>> >>>>>>> Some migration modes require aux-ram-share=on. >>>>>>> >>>>>>> qapi/migration.json: >>>>>>> @cpr-transfer: >>>>>>> ... >>>>>>> Memory-backend objects must have the share=on attribute, but >>>>>>> memory-backend-epc is not supported. The VM must be started >>>>>>> with the '-machine aux-ram-share=on' option. >>>>>>> >>>>>>> Define RAM_PRIVATE >>>>>>> >>>>>>> Define qemu_shm_alloc(), from David's tmp patch >>>>>>> >>>>>>> ram_backend_memory_alloc() >>>>>>> ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE; >>>>>>> memory_region_init_ram_flags_nomigrate(ram_flags) >>>>>>> >>>>>>> qemu_ram_alloc_internal() >>>>>>> ... >>>>>>> if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share) >>>>>>> new_block->flags |= RAM_SHARED; >>>>>>> >>>>>>> if (!host && (new_block->flags & RAM_SHARED)) { >>>>>>> qemu_ram_alloc_shared(new_block); >>>>>>> } else >>>>>>> new_block->fd = -1; >>>>>>> new_block->host = host; >>>>>>> } >>>>>>> ram_block_add(new_block); >>>>>>> >>>>>>> qemu_ram_alloc_shared() >>>>>>> if qemu_memfd_check() >>>>>>> new_block->fd = qemu_memfd_create() >>>>>>> else >>>>>>> new_block->fd = qemu_shm_alloc() >>>>>> >>>>>> Yes, that way "memory-backend-ram,share=on" will just mean "give me the best shared memory for RAM to be shared with other processes, I don't care about the details", and it will work on Linux kernels even before we had memfds. >>>>>> >>>>>> memory-backend-ram should be available on all architectures, and under Windows. qemu_anon_ram_alloc() under Linux just does nothing special, not even bail out. >>>>>> >>>>>> MAP_SHARED|MAP_ANON was always weird, because it meant "give me memory I can share only with subprocesses", but then, *there are not subprocesses for QEMU*. I recall there was a trick to obtain the fd under Linux for these regions using /proc/self/fd/, but it's very Linux specific ... >>>>>> >>>>>> So nobody would *actually* use that shared memory and it was only a hack for RDMA. Now we can do better. >>>>>> >>>>>> >>>>>> We'll have to decide if we simply fallback to qemu_anon_ram_alloc() if no shared memory can be created (unavailable), like we do on Windows. >>>>>> >>>>>> So maybe something like >>>>>> >>>>>> qemu_ram_alloc_shared() >>>>>> fd = -1; >>>>>> >>>>>> if (qemu_memfd_avilable()) { >>>>>> fd = qemu_memfd_create(); >>>>>> if (fd < 0) >>>>>> ... error >>>>>> } else if (qemu_shm_available()) >>>>>> fd = qemu_shm_alloc(); >>>>>> if (fd < 0) >>>>>> ... error >>>>>> } else { >>>>>> /* >>>>>> * Old behavior: try fd-less shared memory. We might >>>>>> * just end up with non-shared memory on Windows, but >>>>>> * nobody can make sure of this shared memory either way >>>>>> * ... should we just use non-shared memory? Or should >>>>>> * we simply bail out? But then, if there is no shared >>>>>> * memory nobody could possible use it. >>>>>> */ >>>>>> qemu_anon_ram_alloc(share=true) >>>>>> } >>>>> >>>>> Good catch. We need that fallback for backwards compatibility. Even with >>>>> no use case for memory-backend-ram,share=on since the demise of rdma, users >>>>> may specify it on windows, for no particular reason, but it works, and should >>>>> continue to work after this series. CPR would be blocked. >>>> >>>> Yes, we should keep Windows working in the weird way it is working right now. >>>> >>>> > > More generally for backwards compatibility for share=on for no particular reason, >>>>> should we fallback if qemu_shm_alloc fails? If /dev/shm is mounted with default >>>>> options and more than half of ram is requested, it will fail, whereas current qemu >>>>> succeeds using MAP_SHARED|MAP_ANON. >>>> >>>> Only on Linux without memfd, of course. Maybe we should just warn when qemu_shm_alloc() fails (and comment that we continue for compat reasons only) and fallback to the stupid qemu_anon_ram_alloc(share=true). We could implement a fallback to shmget() but ... let's not go down that path. >>>> >>>> But we should not fallback to qemu_shm_alloc()/MAP_SHARED|MAP_ANON if memfd is available and that allocating the memfd failed. Failing to allocate a memfd might highlight a bigger problem. >>> >>> Agreed on all. >>> >>> One more opinion from you please, if you will. >>> >>> RAM_PRIVATE is only checked in qemu_ram_alloc_internal, and only needs to be >>> set in >>> ram_backend_memory_alloc -> ... -> qemu_ram_alloc_internal >>> >>> None of the other backends reach qemu_ram_alloc_internal. >>> >>> To be future proof, do you prefer I also set MAP_PRIVATE in the other backends, >>> everywhere MAP_SHARED may be set, eg: >> >> Hm, I think then we should set RAM_PRIVATE really everywhere where we'd want >> it and relied on !RAM_SHARED doing the right thing. >> >> Alternatively, we make our life easier and do something like >> >> /* >> * This flag is only used while creating+allocating RAM, and >> * prevents RAM_SHARED getting set for anonymous RAM automatically in >> * some configurations. >> * >> * By default, not setting RAM_SHARED on anonymous RAM implies >> * "private anonymous RAM"; however, in some configuration we want to >> * have most of this RAM automatically be "sharable anonymous RAM", >> * except for some cases that really want "private anonymous RAM". >> * >> * This anonymous RAM *must* be private. This flag only applies to >> * "anonymous" RAM, not fd/file-backed/preallocated one. >> */ >> RAM_FORCE_ANON_PRIVATE (1 << 13) >> >> >> BUT maybe an even better alternative now that we have the "aux-ram-share" >> parameter, could we use >> >> /* >> * Auxiliary RAM that was created automatically internally, instead of >> * explicitly like using memory-backend-ram or some other device on the >> * QEMU cmdline. >> */ >> RAM_AUX (1 << 13) >> >> >> So it will be quite clear that "aux-ram-share" only applies to RAM_AUX >> RAMBlocks. >> >> That actually looks quite compelling to me :) > > Could anyone remind me why we can't simply set PRIVATE|SHARED all over the > place? > > IMHO RAM_AUX is too hard for any new callers to know how to set. It's much > easier when we already have SHARED, adding PRIVATE could be mostly natural, > then we can already avoid AUX due to checking !SHARED & !PRIVATE. > > Basically, SHARED|PRIVATE then must come from an user request (QMP or > cmdline), otherwise the caller should always set none of them, implying > aux. > > It still looks the best to me. Our emails crossed. We could set PRIVATE|SHARED all over the place. Nothing wrong with that solution. I have no preference, other than finishing. - Steve
On 08.11.24 14:43, Peter Xu wrote: > On Fri, Nov 08, 2024 at 12:31:45PM +0100, David Hildenbrand wrote: >> On 07.11.24 17:40, Steven Sistare wrote: >>> On 11/7/2024 11:26 AM, David Hildenbrand wrote: >>>> On 07.11.24 17:02, Steven Sistare wrote: >>>>> On 11/7/2024 8:23 AM, David Hildenbrand wrote: >>>>>> On 06.11.24 21:12, Steven Sistare wrote: >>>>>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote: >>>>>>>> On 04.11.24 21:56, Steven Sistare wrote: >>>>>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote: >>>>>>>>>> On 04.11.24 20:51, David Hildenbrand wrote: >>>>>>>>>>> On 04.11.24 18:38, Steven Sistare wrote: >>>>>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote: >>>>>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote: >>>>>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending >>>>>>>>>>>>>> on the value of the anon-alloc machine property. This option applies to >>>>>>>>>>>>>> memory allocated as a side effect of creating various devices. It does >>>>>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on >>>>>>>>>>>>>> the command line, or implicitly created by the -m command line option. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The memfd option is intended to support new migration modes, in which the >>>>>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending >>>>>>>>>>>>>> the memfd file descriptor to the process. Memory contents are preserved, >>>>>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are >>>>>>>>>>>>>> locked in memory for DMA remain locked. This behavior is a pre-requisite >>>>>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes. >>>>>>>>>>>>> >>>>>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm, >>>>>>>>>>>>> similar to backends/hostmem-shm.c. >>>>>>>>>>>>> >>>>>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the >>>>>>>>>>>>> details. See below. >>>>>>>>>>>> >>>>>>>>>>>> For this series I would prefer to use memfd and hide the details. It's a >>>>>>>>>>>> concise (and well tested) solution albeit linux only. The code you supply >>>>>>>>>>>> for posix shm would be a good follow on patch to support other unices. >>>>>>>>>>> >>>>>>>>>>> Unless there is reason to use memfd we should start with the more >>>>>>>>>>> generic POSIX variant that is available even on systems without memfd. >>>>>>>>>>> Factoring stuff out as I drafted does look quite compelling. >>>>>>>>>>> >>>>>>>>>>> I can help with the rework, and send it out separately, so you can focus >>>>>>>>>>> on the "machine toggle" as part of this series. >>>>>>>>>>> >>>>>>>>>>> Of course, if we find out we need the memfd internally instead under >>>>>>>>>>> Linux for whatever reason later, we can use that instead. >>>>>>>>>>> >>>>>>>>>>> But IIUC, the main selling point for memfd are additional features >>>>>>>>>>> (hugetlb, memory sealing) that you aren't even using. >>>>>>>>>> >>>>>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount). >>>>>>>>> >>>>>>>>> Sizing is a non-trivial difference. One can by default allocate all memory using memfd_create. >>>>>>>>> To do so using shm_open requires configuration on the mount. One step harder to use. >>>>>>>> >>>>>>>> Yes. >>>>>>>> >>>>>>>>> >>>>>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM >>>>>>>>> if memory-backend-ram has hogged all the memory. >>>>>>>>> >>>>>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open(). >>>>>>>>> >>>>>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM >>>>>>>>> -- memfd if available and fallback to shm_open. >>>>>>>> >>>>>>>> Yes. >>>>>>>> >>>>>>>>> >>>>>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like >>>>>>>>>> >>>>>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on" >>>>>>>>>> >>>>>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process. >>>>>>>>>> >>>>>>>>>> Thoughts? >>>>>>>>> >>>>>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch. >>>>>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc >>>>>>>>> controls everything else. Now we're just riffing on the details: memfd vs shm_open, spelling >>>>>>>>> of options and words to describe them. >>>>>>>> >>>>>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on". >>>>>>> >>>>>>> Hi David and Peter, >>>>>>> >>>>>>> I have implemented and tested the following, for both qemu_memfd_create >>>>>>> and qemu_shm_alloc. This is pseudo-code, with error conditions omitted >>>>>>> for simplicity. >>>>>>> >>>>>>> Any comments before I submit a complete patch? >>>>>>> >>>>>>> ---- >>>>>>> qemu-options.hx: >>>>>>> ``aux-ram-share=on|off`` >>>>>>> Allocate auxiliary guest RAM as an anonymous file that is >>>>>>> shareable with an external process. This option applies to >>>>>>> memory allocated as a side effect of creating various devices. >>>>>>> It does not apply to memory-backend-objects, whether explicitly >>>>>>> specified on the command line, or implicitly created by the -m >>>>>>> command line option. >>>>>>> >>>>>>> Some migration modes require aux-ram-share=on. >>>>>>> >>>>>>> qapi/migration.json: >>>>>>> @cpr-transfer: >>>>>>> ... >>>>>>> Memory-backend objects must have the share=on attribute, but >>>>>>> memory-backend-epc is not supported. The VM must be started >>>>>>> with the '-machine aux-ram-share=on' option. >>>>>>> >>>>>>> Define RAM_PRIVATE >>>>>>> >>>>>>> Define qemu_shm_alloc(), from David's tmp patch >>>>>>> >>>>>>> ram_backend_memory_alloc() >>>>>>> ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE; >>>>>>> memory_region_init_ram_flags_nomigrate(ram_flags) >>>>>>> >>>>>>> qemu_ram_alloc_internal() >>>>>>> ... >>>>>>> if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share) >>>>>>> new_block->flags |= RAM_SHARED; >>>>>>> >>>>>>> if (!host && (new_block->flags & RAM_SHARED)) { >>>>>>> qemu_ram_alloc_shared(new_block); >>>>>>> } else >>>>>>> new_block->fd = -1; >>>>>>> new_block->host = host; >>>>>>> } >>>>>>> ram_block_add(new_block); >>>>>>> >>>>>>> qemu_ram_alloc_shared() >>>>>>> if qemu_memfd_check() >>>>>>> new_block->fd = qemu_memfd_create() >>>>>>> else >>>>>>> new_block->fd = qemu_shm_alloc() >>>>>> >>>>>> Yes, that way "memory-backend-ram,share=on" will just mean "give me the best shared memory for RAM to be shared with other processes, I don't care about the details", and it will work on Linux kernels even before we had memfds. >>>>>> >>>>>> memory-backend-ram should be available on all architectures, and under Windows. qemu_anon_ram_alloc() under Linux just does nothing special, not even bail out. >>>>>> >>>>>> MAP_SHARED|MAP_ANON was always weird, because it meant "give me memory I can share only with subprocesses", but then, *there are not subprocesses for QEMU*. I recall there was a trick to obtain the fd under Linux for these regions using /proc/self/fd/, but it's very Linux specific ... >>>>>> >>>>>> So nobody would *actually* use that shared memory and it was only a hack for RDMA. Now we can do better. >>>>>> >>>>>> >>>>>> We'll have to decide if we simply fallback to qemu_anon_ram_alloc() if no shared memory can be created (unavailable), like we do on Windows. >>>>>> >>>>>> So maybe something like >>>>>> >>>>>> qemu_ram_alloc_shared() >>>>>> fd = -1; >>>>>> >>>>>> if (qemu_memfd_avilable()) { >>>>>> fd = qemu_memfd_create(); >>>>>> if (fd < 0) >>>>>> ... error >>>>>> } else if (qemu_shm_available()) >>>>>> fd = qemu_shm_alloc(); >>>>>> if (fd < 0) >>>>>> ... error >>>>>> } else { >>>>>> /* >>>>>> * Old behavior: try fd-less shared memory. We might >>>>>> * just end up with non-shared memory on Windows, but >>>>>> * nobody can make sure of this shared memory either way >>>>>> * ... should we just use non-shared memory? Or should >>>>>> * we simply bail out? But then, if there is no shared >>>>>> * memory nobody could possible use it. >>>>>> */ >>>>>> qemu_anon_ram_alloc(share=true) >>>>>> } >>>>> >>>>> Good catch. We need that fallback for backwards compatibility. Even with >>>>> no use case for memory-backend-ram,share=on since the demise of rdma, users >>>>> may specify it on windows, for no particular reason, but it works, and should >>>>> continue to work after this series. CPR would be blocked. >>>> >>>> Yes, we should keep Windows working in the weird way it is working right now. >>>> >>>> > > More generally for backwards compatibility for share=on for no particular reason, >>>>> should we fallback if qemu_shm_alloc fails? If /dev/shm is mounted with default >>>>> options and more than half of ram is requested, it will fail, whereas current qemu >>>>> succeeds using MAP_SHARED|MAP_ANON. >>>> >>>> Only on Linux without memfd, of course. Maybe we should just warn when qemu_shm_alloc() fails (and comment that we continue for compat reasons only) and fallback to the stupid qemu_anon_ram_alloc(share=true). We could implement a fallback to shmget() but ... let's not go down that path. >>>> >>>> But we should not fallback to qemu_shm_alloc()/MAP_SHARED|MAP_ANON if memfd is available and that allocating the memfd failed. Failing to allocate a memfd might highlight a bigger problem. >>> >>> Agreed on all. >>> >>> One more opinion from you please, if you will. >>> >>> RAM_PRIVATE is only checked in qemu_ram_alloc_internal, and only needs to be >>> set in >>> ram_backend_memory_alloc -> ... -> qemu_ram_alloc_internal >>> >>> None of the other backends reach qemu_ram_alloc_internal. >>> >>> To be future proof, do you prefer I also set MAP_PRIVATE in the other backends, >>> everywhere MAP_SHARED may be set, eg: >> >> Hm, I think then we should set RAM_PRIVATE really everywhere where we'd want >> it and relied on !RAM_SHARED doing the right thing. >> >> Alternatively, we make our life easier and do something like >> >> /* >> * This flag is only used while creating+allocating RAM, and >> * prevents RAM_SHARED getting set for anonymous RAM automatically in >> * some configurations. >> * >> * By default, not setting RAM_SHARED on anonymous RAM implies >> * "private anonymous RAM"; however, in some configuration we want to >> * have most of this RAM automatically be "sharable anonymous RAM", >> * except for some cases that really want "private anonymous RAM". >> * >> * This anonymous RAM *must* be private. This flag only applies to >> * "anonymous" RAM, not fd/file-backed/preallocated one. >> */ >> RAM_FORCE_ANON_PRIVATE (1 << 13) >> >> >> BUT maybe an even better alternative now that we have the "aux-ram-share" >> parameter, could we use >> >> /* >> * Auxiliary RAM that was created automatically internally, instead of >> * explicitly like using memory-backend-ram or some other device on the >> * QEMU cmdline. >> */ >> RAM_AUX (1 << 13) >> >> >> So it will be quite clear that "aux-ram-share" only applies to RAM_AUX >> RAMBlocks. >> >> That actually looks quite compelling to me :) > > Could anyone remind me why we can't simply set PRIVATE|SHARED all over the > place? > > IMHO RAM_AUX is too hard for any new callers to know how to set. It's much > easier when we already have SHARED, adding PRIVATE could be mostly natural, > then we can already avoid AUX due to checking !SHARED & !PRIVATE. How is it clearer if you have to know whether you have to set RAM_PRIVATE or not for some RAM? Because you *wouldn't* set it "all over the place". No strong opinion, but RAM_AUX aligns much better with what we actually want to achieve: making aux RAM shared. Which implies, detecting aux RAM ...
On 08.11.24 14:56, Steven Sistare wrote: > On 11/8/2024 6:31 AM, David Hildenbrand wrote: >> On 07.11.24 17:40, Steven Sistare wrote: >>> On 11/7/2024 11:26 AM, David Hildenbrand wrote: >>>> On 07.11.24 17:02, Steven Sistare wrote: >>>>> On 11/7/2024 8:23 AM, David Hildenbrand wrote: >>>>>> On 06.11.24 21:12, Steven Sistare wrote: >>>>>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote: >>>>>>>> On 04.11.24 21:56, Steven Sistare wrote: >>>>>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote: >>>>>>>>>> On 04.11.24 20:51, David Hildenbrand wrote: >>>>>>>>>>> On 04.11.24 18:38, Steven Sistare wrote: >>>>>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote: >>>>>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote: >>>>>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending >>>>>>>>>>>>>> on the value of the anon-alloc machine property. This option applies to >>>>>>>>>>>>>> memory allocated as a side effect of creating various devices. It does >>>>>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on >>>>>>>>>>>>>> the command line, or implicitly created by the -m command line option. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The memfd option is intended to support new migration modes, in which the >>>>>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending >>>>>>>>>>>>>> the memfd file descriptor to the process. Memory contents are preserved, >>>>>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are >>>>>>>>>>>>>> locked in memory for DMA remain locked. This behavior is a pre-requisite >>>>>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes. >>>>>>>>>>>>> >>>>>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm, >>>>>>>>>>>>> similar to backends/hostmem-shm.c. >>>>>>>>>>>>> >>>>>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the >>>>>>>>>>>>> details. See below. >>>>>>>>>>>> >>>>>>>>>>>> For this series I would prefer to use memfd and hide the details. It's a >>>>>>>>>>>> concise (and well tested) solution albeit linux only. The code you supply >>>>>>>>>>>> for posix shm would be a good follow on patch to support other unices. >>>>>>>>>>> >>>>>>>>>>> Unless there is reason to use memfd we should start with the more >>>>>>>>>>> generic POSIX variant that is available even on systems without memfd. >>>>>>>>>>> Factoring stuff out as I drafted does look quite compelling. >>>>>>>>>>> >>>>>>>>>>> I can help with the rework, and send it out separately, so you can focus >>>>>>>>>>> on the "machine toggle" as part of this series. >>>>>>>>>>> >>>>>>>>>>> Of course, if we find out we need the memfd internally instead under >>>>>>>>>>> Linux for whatever reason later, we can use that instead. >>>>>>>>>>> >>>>>>>>>>> But IIUC, the main selling point for memfd are additional features >>>>>>>>>>> (hugetlb, memory sealing) that you aren't even using. >>>>>>>>>> >>>>>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount). >>>>>>>>> >>>>>>>>> Sizing is a non-trivial difference. One can by default allocate all memory using memfd_create. >>>>>>>>> To do so using shm_open requires configuration on the mount. One step harder to use. >>>>>>>> >>>>>>>> Yes. >>>>>>>> >>>>>>>>> >>>>>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM >>>>>>>>> if memory-backend-ram has hogged all the memory. >>>>>>>>> >>>>>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open(). >>>>>>>>> >>>>>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM >>>>>>>>> -- memfd if available and fallback to shm_open. >>>>>>>> >>>>>>>> Yes. >>>>>>>> >>>>>>>>> >>>>>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like >>>>>>>>>> >>>>>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on" >>>>>>>>>> >>>>>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process. >>>>>>>>>> >>>>>>>>>> Thoughts? >>>>>>>>> >>>>>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch. >>>>>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc >>>>>>>>> controls everything else. Now we're just riffing on the details: memfd vs shm_open, spelling >>>>>>>>> of options and words to describe them. >>>>>>>> >>>>>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on". >>>>>>> >>>>>>> Hi David and Peter, >>>>>>> >>>>>>> I have implemented and tested the following, for both qemu_memfd_create >>>>>>> and qemu_shm_alloc. This is pseudo-code, with error conditions omitted >>>>>>> for simplicity. >>>>>>> >>>>>>> Any comments before I submit a complete patch? >>>>>>> >>>>>>> ---- >>>>>>> qemu-options.hx: >>>>>>> ``aux-ram-share=on|off`` >>>>>>> Allocate auxiliary guest RAM as an anonymous file that is >>>>>>> shareable with an external process. This option applies to >>>>>>> memory allocated as a side effect of creating various devices. >>>>>>> It does not apply to memory-backend-objects, whether explicitly >>>>>>> specified on the command line, or implicitly created by the -m >>>>>>> command line option. >>>>>>> >>>>>>> Some migration modes require aux-ram-share=on. >>>>>>> >>>>>>> qapi/migration.json: >>>>>>> @cpr-transfer: >>>>>>> ... >>>>>>> Memory-backend objects must have the share=on attribute, but >>>>>>> memory-backend-epc is not supported. The VM must be started >>>>>>> with the '-machine aux-ram-share=on' option. >>>>>>> >>>>>>> Define RAM_PRIVATE >>>>>>> >>>>>>> Define qemu_shm_alloc(), from David's tmp patch >>>>>>> >>>>>>> ram_backend_memory_alloc() >>>>>>> ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE; >>>>>>> memory_region_init_ram_flags_nomigrate(ram_flags) >>>>>>> >>>>>>> qemu_ram_alloc_internal() >>>>>>> ... >>>>>>> if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share) >>>>>>> new_block->flags |= RAM_SHARED; >>>>>>> >>>>>>> if (!host && (new_block->flags & RAM_SHARED)) { >>>>>>> qemu_ram_alloc_shared(new_block); >>>>>>> } else >>>>>>> new_block->fd = -1; >>>>>>> new_block->host = host; >>>>>>> } >>>>>>> ram_block_add(new_block); >>>>>>> >>>>>>> qemu_ram_alloc_shared() >>>>>>> if qemu_memfd_check() >>>>>>> new_block->fd = qemu_memfd_create() >>>>>>> else >>>>>>> new_block->fd = qemu_shm_alloc() >>>>>> >>>>>> Yes, that way "memory-backend-ram,share=on" will just mean "give me the best shared memory for RAM to be shared with other processes, I don't care about the details", and it will work on Linux kernels even before we had memfds. >>>>>> >>>>>> memory-backend-ram should be available on all architectures, and under Windows. qemu_anon_ram_alloc() under Linux just does nothing special, not even bail out. >>>>>> >>>>>> MAP_SHARED|MAP_ANON was always weird, because it meant "give me memory I can share only with subprocesses", but then, *there are not subprocesses for QEMU*. I recall there was a trick to obtain the fd under Linux for these regions using /proc/self/fd/, but it's very Linux specific ... >>>>>> >>>>>> So nobody would *actually* use that shared memory and it was only a hack for RDMA. Now we can do better. >>>>>> >>>>>> >>>>>> We'll have to decide if we simply fallback to qemu_anon_ram_alloc() if no shared memory can be created (unavailable), like we do on Windows. >>>>>> >>>>>> So maybe something like >>>>>> >>>>>> qemu_ram_alloc_shared() >>>>>> fd = -1; >>>>>> >>>>>> if (qemu_memfd_avilable()) { >>>>>> fd = qemu_memfd_create(); >>>>>> if (fd < 0) >>>>>> ... error >>>>>> } else if (qemu_shm_available()) >>>>>> fd = qemu_shm_alloc(); >>>>>> if (fd < 0) >>>>>> ... error >>>>>> } else { >>>>>> /* >>>>>> * Old behavior: try fd-less shared memory. We might >>>>>> * just end up with non-shared memory on Windows, but >>>>>> * nobody can make sure of this shared memory either way >>>>>> * ... should we just use non-shared memory? Or should >>>>>> * we simply bail out? But then, if there is no shared >>>>>> * memory nobody could possible use it. >>>>>> */ >>>>>> qemu_anon_ram_alloc(share=true) >>>>>> } >>>>> >>>>> Good catch. We need that fallback for backwards compatibility. Even with >>>>> no use case for memory-backend-ram,share=on since the demise of rdma, users >>>>> may specify it on windows, for no particular reason, but it works, and should >>>>> continue to work after this series. CPR would be blocked. >>>> >>>> Yes, we should keep Windows working in the weird way it is working right now. >>>> >>>> > > More generally for backwards compatibility for share=on for no particular reason, >>>>> should we fallback if qemu_shm_alloc fails? If /dev/shm is mounted with default >>>>> options and more than half of ram is requested, it will fail, whereas current qemu >>>>> succeeds using MAP_SHARED|MAP_ANON. >>>> >>>> Only on Linux without memfd, of course. Maybe we should just warn when qemu_shm_alloc() fails (and comment that we continue for compat reasons only) and fallback to the stupid qemu_anon_ram_alloc(share=true). We could implement a fallback to shmget() but ... let's not go down that path. >>>> >>>> But we should not fallback to qemu_shm_alloc()/MAP_SHARED|MAP_ANON if memfd is available and that allocating the memfd failed. Failing to allocate a memfd might highlight a bigger problem. >>> >>> Agreed on all. >>> >>> One more opinion from you please, if you will. >>> >>> RAM_PRIVATE is only checked in qemu_ram_alloc_internal, and only needs to be >>> set in >>> ram_backend_memory_alloc -> ... -> qemu_ram_alloc_internal >>> >>> None of the other backends reach qemu_ram_alloc_internal. >>> >>> To be future proof, do you prefer I also set MAP_PRIVATE in the other backends, >>> everywhere MAP_SHARED may be set, eg: >> >> Hm, I think then we should set RAM_PRIVATE really everywhere where we'd want it and relied on !RAM_SHARED doing the right thing. >> >> Alternatively, we make our life easier and do something like >> >> /* >> * This flag is only used while creating+allocating RAM, and >> * prevents RAM_SHARED getting set for anonymous RAM automatically in >> * some configurations. >> * >> * By default, not setting RAM_SHARED on anonymous RAM implies >> * "private anonymous RAM"; however, in some configuration we want to >> * have most of this RAM automatically be "sharable anonymous RAM", >> * except for some cases that really want "private anonymous RAM". >> * >> * This anonymous RAM *must* be private. This flag only applies to >> * "anonymous" RAM, not fd/file-backed/preallocated one. >> */ >> RAM_FORCE_ANON_PRIVATE (1 << 13) >> >> >> BUT maybe an even better alternative now that we have the "aux-ram-share" parameter, could we use >> >> /* >> * Auxiliary RAM that was created automatically internally, instead of >> * explicitly like using memory-backend-ram or some other device on the >> * QEMU cmdline. >> */ >> RAM_AUX (1 << 13) >> >> >> So it will be quite clear that "aux-ram-share" only applies to RAM_AUX RAMBlocks. >> >> That actually looks quite compelling to me :) > > Agreed, RAM_AUX is a clear solution. I would set it in these functions: > qemu_ram_alloc_resizeable > memory_region_init_ram_nomigrate > memory_region_init_rom_nomigrate > memory_region_init_rom_device_nomigrate > > and test it with aux_ram_share in qemu_ram_alloc_internal. > if RAM_AUX && aux_ram_share > flags |= RAM_SHARED > > However, we could just set RAM_SHARED at those same call sites: > flags = current_machine->aux_ram_shared ? RAM_SHARED : 0; > which is what I did in > [PATCH V2 01/11] machine: alloc-anon option > and test RAM_SHARED in qemu_ram_alloc_internal. > No need for RAM_PRIVATE. > > RAM_AUX is nice because it declares intent more specifically. > > Your preference? My preference is either using RAM_AUX to flag AUX RAM, or the inverse, RAM_NON_AUX to flag non-aux RAM, such as from memory backends and likely ivshmem.c Peter still seems to prefer RAM_PRIVATE. So I guess it's up to you to decide ;)
On Fri, Nov 08, 2024 at 09:14:02AM -0500, Steven Sistare wrote: > > Could anyone remind me why we can't simply set PRIVATE|SHARED all over the > > place? > > > > IMHO RAM_AUX is too hard for any new callers to know how to set. It's much > > easier when we already have SHARED, adding PRIVATE could be mostly natural, > > then we can already avoid AUX due to checking !SHARED & !PRIVATE. > > > > Basically, SHARED|PRIVATE then must come from an user request (QMP or > > cmdline), otherwise the caller should always set none of them, implying > > aux. > > > > It still looks the best to me. > > Our emails crossed. We could set PRIVATE|SHARED all over the place. Nothing > wrong with that solution. I have no preference, other than finishing. The current AUX is exactly what I was picturing when I was replying v2, so the four paths you listed here: https://lore.kernel.org/qemu-devel/44b15731-0ee8-4e24-b4f5-0614bca594cb@oracle.com/ Agreed, RAM_AUX is a clear solution. I would set it in these functions: qemu_ram_alloc_resizeable memory_region_init_ram_nomigrate memory_region_init_rom_nomigrate memory_region_init_rom_device_nomigrate That is what I listed previously: https://lore.kernel.org/qemu-devel/Zv7C7MeVP2X8bEJU@x1n/ I think that means below paths [1-4] are only relevant: qemu_ram_alloc memory_region_init_rom_device_nomigrate [1] memory_region_init_ram_flags_nomigrate memory_region_init_ram_nomigrate [2] memory_region_init_rom_nomigrate [3] qemu_ram_alloc_resizeable [4] Except that if we don't want to risk using VM_SHARED unconditionally on linux for aux, then we need to have a new flag, aka RAM_AUX or similar. So I believe the list is at least correct. I changed my mind because I noticed it will be non-trivial to know whether one should set RAM_AUX when a new user needs to create some new ramblocks. In that case, we need to define RAM_AUX properly. One sane way to define it is: "if the user didn't specify share or private property, please use AUX", but then it'll overlap with RAM_SHARED / RAM_PRIVATE already, hence redundant. Yes, let's wait and see David's comment. Thanks,
On 11/8/2024 9:20 AM, David Hildenbrand wrote: > On 08.11.24 14:56, Steven Sistare wrote: >> On 11/8/2024 6:31 AM, David Hildenbrand wrote: >>> On 07.11.24 17:40, Steven Sistare wrote: >>>> On 11/7/2024 11:26 AM, David Hildenbrand wrote: >>>>> On 07.11.24 17:02, Steven Sistare wrote: >>>>>> On 11/7/2024 8:23 AM, David Hildenbrand wrote: >>>>>>> On 06.11.24 21:12, Steven Sistare wrote: >>>>>>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote: >>>>>>>>> On 04.11.24 21:56, Steven Sistare wrote: >>>>>>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote: >>>>>>>>>>> On 04.11.24 20:51, David Hildenbrand wrote: >>>>>>>>>>>> On 04.11.24 18:38, Steven Sistare wrote: >>>>>>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote: >>>>>>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote: >>>>>>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending >>>>>>>>>>>>>>> on the value of the anon-alloc machine property. This option applies to >>>>>>>>>>>>>>> memory allocated as a side effect of creating various devices. It does >>>>>>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on >>>>>>>>>>>>>>> the command line, or implicitly created by the -m command line option. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The memfd option is intended to support new migration modes, in which the >>>>>>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending >>>>>>>>>>>>>>> the memfd file descriptor to the process. Memory contents are preserved, >>>>>>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are >>>>>>>>>>>>>>> locked in memory for DMA remain locked. This behavior is a pre-requisite >>>>>>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes. >>>>>>>>>>>>>> >>>>>>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm, >>>>>>>>>>>>>> similar to backends/hostmem-shm.c. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the >>>>>>>>>>>>>> details. See below. >>>>>>>>>>>>> >>>>>>>>>>>>> For this series I would prefer to use memfd and hide the details. It's a >>>>>>>>>>>>> concise (and well tested) solution albeit linux only. The code you supply >>>>>>>>>>>>> for posix shm would be a good follow on patch to support other unices. >>>>>>>>>>>> >>>>>>>>>>>> Unless there is reason to use memfd we should start with the more >>>>>>>>>>>> generic POSIX variant that is available even on systems without memfd. >>>>>>>>>>>> Factoring stuff out as I drafted does look quite compelling. >>>>>>>>>>>> >>>>>>>>>>>> I can help with the rework, and send it out separately, so you can focus >>>>>>>>>>>> on the "machine toggle" as part of this series. >>>>>>>>>>>> >>>>>>>>>>>> Of course, if we find out we need the memfd internally instead under >>>>>>>>>>>> Linux for whatever reason later, we can use that instead. >>>>>>>>>>>> >>>>>>>>>>>> But IIUC, the main selling point for memfd are additional features >>>>>>>>>>>> (hugetlb, memory sealing) that you aren't even using. >>>>>>>>>>> >>>>>>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount). >>>>>>>>>> >>>>>>>>>> Sizing is a non-trivial difference. One can by default allocate all memory using memfd_create. >>>>>>>>>> To do so using shm_open requires configuration on the mount. One step harder to use. >>>>>>>>> >>>>>>>>> Yes. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM >>>>>>>>>> if memory-backend-ram has hogged all the memory. >>>>>>>>>> >>>>>>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open(). >>>>>>>>>> >>>>>>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM >>>>>>>>>> -- memfd if available and fallback to shm_open. >>>>>>>>> >>>>>>>>> Yes. >>>>>>>>> >>>>>>>>>> >>>>>>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like >>>>>>>>>>> >>>>>>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on" >>>>>>>>>>> >>>>>>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process. >>>>>>>>>>> >>>>>>>>>>> Thoughts? >>>>>>>>>> >>>>>>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch. >>>>>>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc >>>>>>>>>> controls everything else. Now we're just riffing on the details: memfd vs shm_open, spelling >>>>>>>>>> of options and words to describe them. >>>>>>>>> >>>>>>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on". >>>>>>>> >>>>>>>> Hi David and Peter, >>>>>>>> >>>>>>>> I have implemented and tested the following, for both qemu_memfd_create >>>>>>>> and qemu_shm_alloc. This is pseudo-code, with error conditions omitted >>>>>>>> for simplicity. >>>>>>>> >>>>>>>> Any comments before I submit a complete patch? >>>>>>>> >>>>>>>> ---- >>>>>>>> qemu-options.hx: >>>>>>>> ``aux-ram-share=on|off`` >>>>>>>> Allocate auxiliary guest RAM as an anonymous file that is >>>>>>>> shareable with an external process. This option applies to >>>>>>>> memory allocated as a side effect of creating various devices. >>>>>>>> It does not apply to memory-backend-objects, whether explicitly >>>>>>>> specified on the command line, or implicitly created by the -m >>>>>>>> command line option. >>>>>>>> >>>>>>>> Some migration modes require aux-ram-share=on. >>>>>>>> >>>>>>>> qapi/migration.json: >>>>>>>> @cpr-transfer: >>>>>>>> ... >>>>>>>> Memory-backend objects must have the share=on attribute, but >>>>>>>> memory-backend-epc is not supported. The VM must be started >>>>>>>> with the '-machine aux-ram-share=on' option. >>>>>>>> >>>>>>>> Define RAM_PRIVATE >>>>>>>> >>>>>>>> Define qemu_shm_alloc(), from David's tmp patch >>>>>>>> >>>>>>>> ram_backend_memory_alloc() >>>>>>>> ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE; >>>>>>>> memory_region_init_ram_flags_nomigrate(ram_flags) >>>>>>>> >>>>>>>> qemu_ram_alloc_internal() >>>>>>>> ... >>>>>>>> if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share) >>>>>>>> new_block->flags |= RAM_SHARED; >>>>>>>> >>>>>>>> if (!host && (new_block->flags & RAM_SHARED)) { >>>>>>>> qemu_ram_alloc_shared(new_block); >>>>>>>> } else >>>>>>>> new_block->fd = -1; >>>>>>>> new_block->host = host; >>>>>>>> } >>>>>>>> ram_block_add(new_block); >>>>>>>> >>>>>>>> qemu_ram_alloc_shared() >>>>>>>> if qemu_memfd_check() >>>>>>>> new_block->fd = qemu_memfd_create() >>>>>>>> else >>>>>>>> new_block->fd = qemu_shm_alloc() >>>>>>> >>>>>>> Yes, that way "memory-backend-ram,share=on" will just mean "give me the best shared memory for RAM to be shared with other processes, I don't care about the details", and it will work on Linux kernels even before we had memfds. >>>>>>> >>>>>>> memory-backend-ram should be available on all architectures, and under Windows. qemu_anon_ram_alloc() under Linux just does nothing special, not even bail out. >>>>>>> >>>>>>> MAP_SHARED|MAP_ANON was always weird, because it meant "give me memory I can share only with subprocesses", but then, *there are not subprocesses for QEMU*. I recall there was a trick to obtain the fd under Linux for these regions using /proc/self/fd/, but it's very Linux specific ... >>>>>>> >>>>>>> So nobody would *actually* use that shared memory and it was only a hack for RDMA. Now we can do better. >>>>>>> >>>>>>> >>>>>>> We'll have to decide if we simply fallback to qemu_anon_ram_alloc() if no shared memory can be created (unavailable), like we do on Windows. >>>>>>> >>>>>>> So maybe something like >>>>>>> >>>>>>> qemu_ram_alloc_shared() >>>>>>> fd = -1; >>>>>>> >>>>>>> if (qemu_memfd_avilable()) { >>>>>>> fd = qemu_memfd_create(); >>>>>>> if (fd < 0) >>>>>>> ... error >>>>>>> } else if (qemu_shm_available()) >>>>>>> fd = qemu_shm_alloc(); >>>>>>> if (fd < 0) >>>>>>> ... error >>>>>>> } else { >>>>>>> /* >>>>>>> * Old behavior: try fd-less shared memory. We might >>>>>>> * just end up with non-shared memory on Windows, but >>>>>>> * nobody can make sure of this shared memory either way >>>>>>> * ... should we just use non-shared memory? Or should >>>>>>> * we simply bail out? But then, if there is no shared >>>>>>> * memory nobody could possible use it. >>>>>>> */ >>>>>>> qemu_anon_ram_alloc(share=true) >>>>>>> } >>>>>> >>>>>> Good catch. We need that fallback for backwards compatibility. Even with >>>>>> no use case for memory-backend-ram,share=on since the demise of rdma, users >>>>>> may specify it on windows, for no particular reason, but it works, and should >>>>>> continue to work after this series. CPR would be blocked. >>>>> >>>>> Yes, we should keep Windows working in the weird way it is working right now. >>>>> >>>>> > > More generally for backwards compatibility for share=on for no particular reason, >>>>>> should we fallback if qemu_shm_alloc fails? If /dev/shm is mounted with default >>>>>> options and more than half of ram is requested, it will fail, whereas current qemu >>>>>> succeeds using MAP_SHARED|MAP_ANON. >>>>> >>>>> Only on Linux without memfd, of course. Maybe we should just warn when qemu_shm_alloc() fails (and comment that we continue for compat reasons only) and fallback to the stupid qemu_anon_ram_alloc(share=true). We could implement a fallback to shmget() but ... let's not go down that path. >>>>> >>>>> But we should not fallback to qemu_shm_alloc()/MAP_SHARED|MAP_ANON if memfd is available and that allocating the memfd failed. Failing to allocate a memfd might highlight a bigger problem. >>>> >>>> Agreed on all. >>>> >>>> One more opinion from you please, if you will. >>>> >>>> RAM_PRIVATE is only checked in qemu_ram_alloc_internal, and only needs to be >>>> set in >>>> ram_backend_memory_alloc -> ... -> qemu_ram_alloc_internal >>>> >>>> None of the other backends reach qemu_ram_alloc_internal. >>>> >>>> To be future proof, do you prefer I also set MAP_PRIVATE in the other backends, >>>> everywhere MAP_SHARED may be set, eg: >>> >>> Hm, I think then we should set RAM_PRIVATE really everywhere where we'd want it and relied on !RAM_SHARED doing the right thing. >>> >>> Alternatively, we make our life easier and do something like >>> >>> /* >>> * This flag is only used while creating+allocating RAM, and >>> * prevents RAM_SHARED getting set for anonymous RAM automatically in >>> * some configurations. >>> * >>> * By default, not setting RAM_SHARED on anonymous RAM implies >>> * "private anonymous RAM"; however, in some configuration we want to >>> * have most of this RAM automatically be "sharable anonymous RAM", >>> * except for some cases that really want "private anonymous RAM". >>> * >>> * This anonymous RAM *must* be private. This flag only applies to >>> * "anonymous" RAM, not fd/file-backed/preallocated one. >>> */ >>> RAM_FORCE_ANON_PRIVATE (1 << 13) >>> >>> >>> BUT maybe an even better alternative now that we have the "aux-ram-share" parameter, could we use >>> >>> /* >>> * Auxiliary RAM that was created automatically internally, instead of >>> * explicitly like using memory-backend-ram or some other device on the >>> * QEMU cmdline. >>> */ >>> RAM_AUX (1 << 13) >>> >>> >>> So it will be quite clear that "aux-ram-share" only applies to RAM_AUX RAMBlocks. >>> >>> That actually looks quite compelling to me :) >> >> Agreed, RAM_AUX is a clear solution. I would set it in these functions: >> qemu_ram_alloc_resizeable >> memory_region_init_ram_nomigrate >> memory_region_init_rom_nomigrate >> memory_region_init_rom_device_nomigrate >> >> and test it with aux_ram_share in qemu_ram_alloc_internal. >> if RAM_AUX && aux_ram_share >> flags |= RAM_SHARED >> >> However, we could just set RAM_SHARED at those same call sites: >> flags = current_machine->aux_ram_shared ? RAM_SHARED : 0; >> which is what I did in >> [PATCH V2 01/11] machine: alloc-anon option >> and test RAM_SHARED in qemu_ram_alloc_internal. >> No need for RAM_PRIVATE. >> >> RAM_AUX is nice because it declares intent more specifically. >> >> Your preference? > > My preference is either using RAM_AUX to flag AUX RAM, or the inverse, RAM_NON_AUX to flag non-aux RAM, such as from memory backends and likely ivshmem.c > > Peter still seems to prefer RAM_PRIVATE. So I guess it's up to you to decide ;) I like the inverse flag RAM_NON_AUX, better name TBD. The call sites are well defined. That is what my V3 hack was testing (modulo ivshmem). object_dynamic_cast(new_block->mr->parent_obj.parent, TYPE_MEMORY_BACKEND - Steve
On 08.11.24 15:37, Steven Sistare wrote: > On 11/8/2024 9:20 AM, David Hildenbrand wrote: >> On 08.11.24 14:56, Steven Sistare wrote: >>> On 11/8/2024 6:31 AM, David Hildenbrand wrote: >>>> On 07.11.24 17:40, Steven Sistare wrote: >>>>> On 11/7/2024 11:26 AM, David Hildenbrand wrote: >>>>>> On 07.11.24 17:02, Steven Sistare wrote: >>>>>>> On 11/7/2024 8:23 AM, David Hildenbrand wrote: >>>>>>>> On 06.11.24 21:12, Steven Sistare wrote: >>>>>>>>> On 11/4/2024 4:36 PM, David Hildenbrand wrote: >>>>>>>>>> On 04.11.24 21:56, Steven Sistare wrote: >>>>>>>>>>> On 11/4/2024 3:15 PM, David Hildenbrand wrote: >>>>>>>>>>>> On 04.11.24 20:51, David Hildenbrand wrote: >>>>>>>>>>>>> On 04.11.24 18:38, Steven Sistare wrote: >>>>>>>>>>>>>> On 11/4/2024 5:39 AM, David Hildenbrand wrote: >>>>>>>>>>>>>>> On 01.11.24 14:47, Steve Sistare wrote: >>>>>>>>>>>>>>>> Allocate anonymous memory using mmap MAP_ANON or memfd_create depending >>>>>>>>>>>>>>>> on the value of the anon-alloc machine property. This option applies to >>>>>>>>>>>>>>>> memory allocated as a side effect of creating various devices. It does >>>>>>>>>>>>>>>> not apply to memory-backend-objects, whether explicitly specified on >>>>>>>>>>>>>>>> the command line, or implicitly created by the -m command line option. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The memfd option is intended to support new migration modes, in which the >>>>>>>>>>>>>>>> memory region can be transferred in place to a new QEMU process, by sending >>>>>>>>>>>>>>>> the memfd file descriptor to the process. Memory contents are preserved, >>>>>>>>>>>>>>>> and if the mode also transfers device descriptors, then pages that are >>>>>>>>>>>>>>>> locked in memory for DMA remain locked. This behavior is a pre-requisite >>>>>>>>>>>>>>>> for supporting vfio, vdpa, and iommufd devices with the new modes. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> A more portable, non-Linux specific variant of this will be using shm, >>>>>>>>>>>>>>> similar to backends/hostmem-shm.c. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Likely we should be using that instead of memfd, or try hiding the >>>>>>>>>>>>>>> details. See below. >>>>>>>>>>>>>> >>>>>>>>>>>>>> For this series I would prefer to use memfd and hide the details. It's a >>>>>>>>>>>>>> concise (and well tested) solution albeit linux only. The code you supply >>>>>>>>>>>>>> for posix shm would be a good follow on patch to support other unices. >>>>>>>>>>>>> >>>>>>>>>>>>> Unless there is reason to use memfd we should start with the more >>>>>>>>>>>>> generic POSIX variant that is available even on systems without memfd. >>>>>>>>>>>>> Factoring stuff out as I drafted does look quite compelling. >>>>>>>>>>>>> >>>>>>>>>>>>> I can help with the rework, and send it out separately, so you can focus >>>>>>>>>>>>> on the "machine toggle" as part of this series. >>>>>>>>>>>>> >>>>>>>>>>>>> Of course, if we find out we need the memfd internally instead under >>>>>>>>>>>>> Linux for whatever reason later, we can use that instead. >>>>>>>>>>>>> >>>>>>>>>>>>> But IIUC, the main selling point for memfd are additional features >>>>>>>>>>>>> (hugetlb, memory sealing) that you aren't even using. >>>>>>>>>>>> >>>>>>>>>>>> FWIW, I'm looking into some details, and one difference is that shmem_open() under Linux (glibc) seems to go to /dev/shmem and memfd/SYSV go to the internal tmpfs mount. There is not a big difference, but there can be some difference (e.g., sizing of the /dev/shm mount). >>>>>>>>>>> >>>>>>>>>>> Sizing is a non-trivial difference. One can by default allocate all memory using memfd_create. >>>>>>>>>>> To do so using shm_open requires configuration on the mount. One step harder to use. >>>>>>>>>> >>>>>>>>>> Yes. >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> This is a real issue for memory-backend-ram, and becomes an issue for the internal RAM >>>>>>>>>>> if memory-backend-ram has hogged all the memory. >>>>>>>>>>> >>>>>>>>>>>> Regarding memory-backend-ram,share=on, I assume we can use memfd if available, but then fallback to shm_open(). >>>>>>>>>>> >>>>>>>>>>> Yes, and if that is a good idea, then the same should be done for internal RAM >>>>>>>>>>> -- memfd if available and fallback to shm_open. >>>>>>>>>> >>>>>>>>>> Yes. >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> I'm hoping we can find a way where it just all is rather intuitive, like >>>>>>>>>>>> >>>>>>>>>>>> "default-ram-share=on": behave for internal RAM just like "memory-backend-ram,share=on" >>>>>>>>>>>> >>>>>>>>>>>> "memory-backend-ram,share=on": use whatever mechanism we have to give us "anonymous" memory that can be shared using an fd with another process. >>>>>>>>>>>> >>>>>>>>>>>> Thoughts? >>>>>>>>>>> >>>>>>>>>>> Agreed, though I thought I had already landed at the intuitive specification in my patch. >>>>>>>>>>> The user must explicitly configure memory-backend-* to be usable with CPR, and anon-alloc >>>>>>>>>>> controls everything else. Now we're just riffing on the details: memfd vs shm_open, spelling >>>>>>>>>>> of options and words to describe them. >>>>>>>>>> >>>>>>>>>> Well, yes, and making it all a bit more consistent and the "machine option" behave just like "memory-backend-ram,share=on". >>>>>>>>> >>>>>>>>> Hi David and Peter, >>>>>>>>> >>>>>>>>> I have implemented and tested the following, for both qemu_memfd_create >>>>>>>>> and qemu_shm_alloc. This is pseudo-code, with error conditions omitted >>>>>>>>> for simplicity. >>>>>>>>> >>>>>>>>> Any comments before I submit a complete patch? >>>>>>>>> >>>>>>>>> ---- >>>>>>>>> qemu-options.hx: >>>>>>>>> ``aux-ram-share=on|off`` >>>>>>>>> Allocate auxiliary guest RAM as an anonymous file that is >>>>>>>>> shareable with an external process. This option applies to >>>>>>>>> memory allocated as a side effect of creating various devices. >>>>>>>>> It does not apply to memory-backend-objects, whether explicitly >>>>>>>>> specified on the command line, or implicitly created by the -m >>>>>>>>> command line option. >>>>>>>>> >>>>>>>>> Some migration modes require aux-ram-share=on. >>>>>>>>> >>>>>>>>> qapi/migration.json: >>>>>>>>> @cpr-transfer: >>>>>>>>> ... >>>>>>>>> Memory-backend objects must have the share=on attribute, but >>>>>>>>> memory-backend-epc is not supported. The VM must be started >>>>>>>>> with the '-machine aux-ram-share=on' option. >>>>>>>>> >>>>>>>>> Define RAM_PRIVATE >>>>>>>>> >>>>>>>>> Define qemu_shm_alloc(), from David's tmp patch >>>>>>>>> >>>>>>>>> ram_backend_memory_alloc() >>>>>>>>> ram_flags = backend->share ? RAM_SHARED : RAM_PRIVATE; >>>>>>>>> memory_region_init_ram_flags_nomigrate(ram_flags) >>>>>>>>> >>>>>>>>> qemu_ram_alloc_internal() >>>>>>>>> ... >>>>>>>>> if (!host && !(ram_flags & RAM_PRIVATE) && current_machine->aux_ram_share) >>>>>>>>> new_block->flags |= RAM_SHARED; >>>>>>>>> >>>>>>>>> if (!host && (new_block->flags & RAM_SHARED)) { >>>>>>>>> qemu_ram_alloc_shared(new_block); >>>>>>>>> } else >>>>>>>>> new_block->fd = -1; >>>>>>>>> new_block->host = host; >>>>>>>>> } >>>>>>>>> ram_block_add(new_block); >>>>>>>>> >>>>>>>>> qemu_ram_alloc_shared() >>>>>>>>> if qemu_memfd_check() >>>>>>>>> new_block->fd = qemu_memfd_create() >>>>>>>>> else >>>>>>>>> new_block->fd = qemu_shm_alloc() >>>>>>>> >>>>>>>> Yes, that way "memory-backend-ram,share=on" will just mean "give me the best shared memory for RAM to be shared with other processes, I don't care about the details", and it will work on Linux kernels even before we had memfds. >>>>>>>> >>>>>>>> memory-backend-ram should be available on all architectures, and under Windows. qemu_anon_ram_alloc() under Linux just does nothing special, not even bail out. >>>>>>>> >>>>>>>> MAP_SHARED|MAP_ANON was always weird, because it meant "give me memory I can share only with subprocesses", but then, *there are not subprocesses for QEMU*. I recall there was a trick to obtain the fd under Linux for these regions using /proc/self/fd/, but it's very Linux specific ... >>>>>>>> >>>>>>>> So nobody would *actually* use that shared memory and it was only a hack for RDMA. Now we can do better. >>>>>>>> >>>>>>>> >>>>>>>> We'll have to decide if we simply fallback to qemu_anon_ram_alloc() if no shared memory can be created (unavailable), like we do on Windows. >>>>>>>> >>>>>>>> So maybe something like >>>>>>>> >>>>>>>> qemu_ram_alloc_shared() >>>>>>>> fd = -1; >>>>>>>> >>>>>>>> if (qemu_memfd_avilable()) { >>>>>>>> fd = qemu_memfd_create(); >>>>>>>> if (fd < 0) >>>>>>>> ... error >>>>>>>> } else if (qemu_shm_available()) >>>>>>>> fd = qemu_shm_alloc(); >>>>>>>> if (fd < 0) >>>>>>>> ... error >>>>>>>> } else { >>>>>>>> /* >>>>>>>> * Old behavior: try fd-less shared memory. We might >>>>>>>> * just end up with non-shared memory on Windows, but >>>>>>>> * nobody can make sure of this shared memory either way >>>>>>>> * ... should we just use non-shared memory? Or should >>>>>>>> * we simply bail out? But then, if there is no shared >>>>>>>> * memory nobody could possible use it. >>>>>>>> */ >>>>>>>> qemu_anon_ram_alloc(share=true) >>>>>>>> } >>>>>>> >>>>>>> Good catch. We need that fallback for backwards compatibility. Even with >>>>>>> no use case for memory-backend-ram,share=on since the demise of rdma, users >>>>>>> may specify it on windows, for no particular reason, but it works, and should >>>>>>> continue to work after this series. CPR would be blocked. >>>>>> >>>>>> Yes, we should keep Windows working in the weird way it is working right now. >>>>>> >>>>>> > > More generally for backwards compatibility for share=on for no particular reason, >>>>>>> should we fallback if qemu_shm_alloc fails? If /dev/shm is mounted with default >>>>>>> options and more than half of ram is requested, it will fail, whereas current qemu >>>>>>> succeeds using MAP_SHARED|MAP_ANON. >>>>>> >>>>>> Only on Linux without memfd, of course. Maybe we should just warn when qemu_shm_alloc() fails (and comment that we continue for compat reasons only) and fallback to the stupid qemu_anon_ram_alloc(share=true). We could implement a fallback to shmget() but ... let's not go down that path. >>>>>> >>>>>> But we should not fallback to qemu_shm_alloc()/MAP_SHARED|MAP_ANON if memfd is available and that allocating the memfd failed. Failing to allocate a memfd might highlight a bigger problem. >>>>> >>>>> Agreed on all. >>>>> >>>>> One more opinion from you please, if you will. >>>>> >>>>> RAM_PRIVATE is only checked in qemu_ram_alloc_internal, and only needs to be >>>>> set in >>>>> ram_backend_memory_alloc -> ... -> qemu_ram_alloc_internal >>>>> >>>>> None of the other backends reach qemu_ram_alloc_internal. >>>>> >>>>> To be future proof, do you prefer I also set MAP_PRIVATE in the other backends, >>>>> everywhere MAP_SHARED may be set, eg: >>>> >>>> Hm, I think then we should set RAM_PRIVATE really everywhere where we'd want it and relied on !RAM_SHARED doing the right thing. >>>> >>>> Alternatively, we make our life easier and do something like >>>> >>>> /* >>>> * This flag is only used while creating+allocating RAM, and >>>> * prevents RAM_SHARED getting set for anonymous RAM automatically in >>>> * some configurations. >>>> * >>>> * By default, not setting RAM_SHARED on anonymous RAM implies >>>> * "private anonymous RAM"; however, in some configuration we want to >>>> * have most of this RAM automatically be "sharable anonymous RAM", >>>> * except for some cases that really want "private anonymous RAM". >>>> * >>>> * This anonymous RAM *must* be private. This flag only applies to >>>> * "anonymous" RAM, not fd/file-backed/preallocated one. >>>> */ >>>> RAM_FORCE_ANON_PRIVATE (1 << 13) >>>> >>>> >>>> BUT maybe an even better alternative now that we have the "aux-ram-share" parameter, could we use >>>> >>>> /* >>>> * Auxiliary RAM that was created automatically internally, instead of >>>> * explicitly like using memory-backend-ram or some other device on the >>>> * QEMU cmdline. >>>> */ >>>> RAM_AUX (1 << 13) >>>> >>>> >>>> So it will be quite clear that "aux-ram-share" only applies to RAM_AUX RAMBlocks. >>>> >>>> That actually looks quite compelling to me :) >>> >>> Agreed, RAM_AUX is a clear solution. I would set it in these functions: >>> qemu_ram_alloc_resizeable >>> memory_region_init_ram_nomigrate >>> memory_region_init_rom_nomigrate >>> memory_region_init_rom_device_nomigrate >>> >>> and test it with aux_ram_share in qemu_ram_alloc_internal. >>> if RAM_AUX && aux_ram_share >>> flags |= RAM_SHARED >>> >>> However, we could just set RAM_SHARED at those same call sites: >>> flags = current_machine->aux_ram_shared ? RAM_SHARED : 0; >>> which is what I did in >>> [PATCH V2 01/11] machine: alloc-anon option >>> and test RAM_SHARED in qemu_ram_alloc_internal. >>> No need for RAM_PRIVATE. >>> >>> RAM_AUX is nice because it declares intent more specifically. >>> >>> Your preference? >> >> My preference is either using RAM_AUX to flag AUX RAM, or the inverse, RAM_NON_AUX to flag non-aux RAM, such as from memory backends and likely ivshmem.c >> >> Peter still seems to prefer RAM_PRIVATE. So I guess it's up to you to decide ;) > > I like the inverse flag RAM_NON_AUX, better name TBD. > The call sites are well defined. > That is what my V3 hack was testing (modulo ivshmem). > object_dynamic_cast(new_block->mr->parent_obj.parent, TYPE_MEMORY_BACKEND Likely AUX is everything that is "neither explicitly specified by the user nor very special RAM" So I think hw/misc/ivshmem.c would also not count as "aux", and similarly hw/remote/memory.c; both use memory_region_init_ram_from_fd(share=on). memory_region_init_ram_ptr/memory_region_init_ram_device_ptr are similarly special: we cannot possibly turn them SHARED. But that's also what your code already handled. So maybe, really everything is AUX ram, except * Using memory_region_init_ram_from_fd()/ memory_region_init_ram_from_file() users. * Using memory_region_init_ram_ptr / memory_region_init_ram_device_ptr * Created via memory backends Note that hw/m68k/next-cube.c is one odd RAM_SHARED user. I don't know why it uses RAM_SHARED to get anonymous shared RAM. Likely a mistake when that code was introduced. CCing Thomas. commit 956a78118bfc7fa512b03cbe8a77b9384c6d89f4 Author: Thomas Huth <huth@tuxfamily.org> Date: Sat Jun 30 08:45:25 2018 +0200 m68k: Add NeXTcube machine It is still quite incomplete (no SCSI, no floppy emulation, no network, etc.), but the firmware already shows up the debug monitor prompt in the framebuffer display, so at least the very basics are already working. This code has been taken from Bryce Lanham's GSoC 2011 NeXT branch at https://github.com/blanham/qemu-NeXT/blob/next-cube/hw/next-cube.c and altered quite a bit to fit the latest interface and coding conventions of the current QEMU.
On Fri, Nov 08, 2024 at 03:18:07PM +0100, David Hildenbrand wrote: > > > > Could anyone remind me why we can't simply set PRIVATE|SHARED all over the > > place? > > > IMHO RAM_AUX is too hard for any new callers to know how to set. It's > much > > easier when we already have SHARED, adding PRIVATE could be mostly natural, > > then we can already avoid AUX due to checking !SHARED & !PRIVATE. > > How is it clearer if you have to know whether you have to set RAM_PRIVATE or > not for some RAM? Because you *wouldn't* set it "all over the place". I think I answered that in previous reply, but exactly after the line where it got cut-off.. :) https://lore.kernel.org/qemu-devel/Zy4VkScMEpYayGtM@x1n/ Basically, SHARED|PRIVATE then must come from an user request (QMP or cmdline), otherwise the caller should always set none of them, implying aux. But I confess that's not accurate.. some caller is based on type of objects etc. to decide mem must be SHARED. A better version could be: RAM_SHARED|RAM_PRIVATE describes the share-able attribute of the RAM block. It can never be set together. When one is set, the memory attribute must suffice the request. When none is set, QEMU will decide how to request the memory. The flag should only be set if the caller has explicit requirement on such memory property. For example, it can come from either a request from user (share=on/off), or the memory must be shared / private due to its own attribute (shm objects, like ivshmem, shm backend, or remotely shared mem, etc.). Otherwise, callers should leave both flags unset. Maybe we should document this directly into the flag definitions, so what flags to set would be clearer than before, and just to say it's not the caller's own willingness to set SHARED | PRIVATE randomly (so as to make cpr available as much as possible). Thanks,
On Fri, Nov 08, 2024 at 03:54:13PM +0100, David Hildenbrand wrote: > Likely AUX is everything that is "neither explicitly specified by the user nor > very special RAM" > > So I think hw/misc/ivshmem.c would also not count as "aux", and similarly > hw/remote/memory.c; both use memory_region_init_ram_from_fd(share=on). > > memory_region_init_ram_ptr/memory_region_init_ram_device_ptr are similarly > special: we cannot possibly turn them SHARED. But that's also what your code > already handled. > > So maybe, really everything is AUX ram, except > * Using memory_region_init_ram_from_fd()/ > memory_region_init_ram_from_file() users. > * Using memory_region_init_ram_ptr / memory_region_init_ram_device_ptr > * Created via memory backends > > > Note that hw/m68k/next-cube.c is one odd RAM_SHARED user. I don't know why > it uses RAM_SHARED to get anonymous shared RAM. Likely a mistake when that > code was introduced. > > CCing Thomas. > > commit 956a78118bfc7fa512b03cbe8a77b9384c6d89f4 > Author: Thomas Huth <huth@tuxfamily.org> > Date: Sat Jun 30 08:45:25 2018 +0200 > > m68k: Add NeXTcube machine > It is still quite incomplete (no SCSI, no floppy emulation, no network, > etc.), but the firmware already shows up the debug monitor prompt in the > framebuffer display, so at least the very basics are already working. > This code has been taken from Bryce Lanham's GSoC 2011 NeXT branch at > https://github.com/blanham/qemu-NeXT/blob/next-cube/hw/next-cube.c > and altered quite a bit to fit the latest interface and coding conventions > of the current QEMU. This might also imply that our current RAM_SHARED is already not crystal clear on when to use, not to mention RAM_AUX if to be introduced.. Please see my other email, trying to define RAM_SHARED properly. IIUC after we can properly define RAM_SHARED, then we don't need AUX, and everything will be crystal clear. Thanks,
On 08.11.24 16:07, Peter Xu wrote: > On Fri, Nov 08, 2024 at 03:54:13PM +0100, David Hildenbrand wrote: >> Likely AUX is everything that is "neither explicitly specified by the user nor >> very special RAM" >> >> So I think hw/misc/ivshmem.c would also not count as "aux", and similarly >> hw/remote/memory.c; both use memory_region_init_ram_from_fd(share=on). >> >> memory_region_init_ram_ptr/memory_region_init_ram_device_ptr are similarly >> special: we cannot possibly turn them SHARED. But that's also what your code >> already handled. >> >> So maybe, really everything is AUX ram, except >> * Using memory_region_init_ram_from_fd()/ >> memory_region_init_ram_from_file() users. >> * Using memory_region_init_ram_ptr / memory_region_init_ram_device_ptr >> * Created via memory backends >> >> >> Note that hw/m68k/next-cube.c is one odd RAM_SHARED user. I don't know why >> it uses RAM_SHARED to get anonymous shared RAM. Likely a mistake when that >> code was introduced. >> >> CCing Thomas. >> >> commit 956a78118bfc7fa512b03cbe8a77b9384c6d89f4 >> Author: Thomas Huth <huth@tuxfamily.org> >> Date: Sat Jun 30 08:45:25 2018 +0200 >> >> m68k: Add NeXTcube machine >> It is still quite incomplete (no SCSI, no floppy emulation, no network, >> etc.), but the firmware already shows up the debug monitor prompt in the >> framebuffer display, so at least the very basics are already working. >> This code has been taken from Bryce Lanham's GSoC 2011 NeXT branch at >> https://github.com/blanham/qemu-NeXT/blob/next-cube/hw/next-cube.c >> and altered quite a bit to fit the latest interface and coding conventions >> of the current QEMU. > > This might also imply that our current RAM_SHARED is already not crystal > clear on when to use, not to mention RAM_AUX if to be introduced.. Likely not. When the code was introduced we used magic boolean parameters and likely "true" was set by accident. There are not that many RAM_SHARED users at all ... Anyhow .... Please > see my other email, trying to define RAM_SHARED properly. > > IIUC after we can properly define RAM_SHARED, then we don't need AUX, and > everything will be crystal clear. I think I still prefer RAM_NO_AUX, but I'll leave it to you and Steven to figure out, it's been way to many emails at this point :)
> CCing Thomas. > > commit 956a78118bfc7fa512b03cbe8a77b9384c6d89f4 > Author: Thomas Huth <huth@tuxfamily.org> > Date: Sat Jun 30 08:45:25 2018 +0200 > > m68k: Add NeXTcube machine > > It is still quite incomplete (no SCSI, no floppy emulation, no network, > etc.), but the firmware already shows up the debug monitor prompt in the > framebuffer display, so at least the very basics are already working. > > This code has been taken from Bryce Lanham's GSoC 2011 NeXT branch at > > https://github.com/blanham/qemu-NeXT/blob/next-cube/hw/next-cube.c > > and altered quite a bit to fit the latest interface and coding conventions > of the current QEMU. Staring at that link, the code was /* MMIO */ cpu_register_physical_memory((uint32_t)0x2000000,0xD0000, cpu_register_io_memory(mmio_read, mmio_write, (void *)env,DEVICE_NATIVE_ENDIAN)); /* BMAP */ //acts as a catch-all for now cpu_register_physical_memory((uint32_t)0x2100000,0x3A7FF, cpu_register_io_memory(scr_read, scr_write, (void *)env,DEVICE_NATIVE_ENDIAN)); Which we converted to /* MMIO */ memory_region_init_io(mmiomem, NULL, &mmio_ops, machine, "next.mmio", 0xD0000); memory_region_add_subregion(sysmem, 0x02000000, mmiomem); /* BMAP memory */ memory_region_init_ram_shared_nomigrate(bmapm1, NULL, "next.bmapmem", 64, true, &error_fatal); memory_region_add_subregion(sysmem, 0x020c0000, bmapm1); So likely the "true" was added by mistake.
diff --git a/hw/core/machine.c b/hw/core/machine.c index adaba17..a89a32b 100644 --- a/hw/core/machine.c +++ b/hw/core/machine.c @@ -460,6 +460,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp) ms->mem_merge = value; } +static int machine_get_anon_alloc(Object *obj, Error **errp) +{ + MachineState *ms = MACHINE(obj); + + return ms->anon_alloc; +} + +static void machine_set_anon_alloc(Object *obj, int value, Error **errp) +{ + MachineState *ms = MACHINE(obj); + + ms->anon_alloc = value; +} + static bool machine_get_usb(Object *obj, Error **errp) { MachineState *ms = MACHINE(obj); @@ -1078,6 +1092,11 @@ static void machine_class_init(ObjectClass *oc, void *data) object_class_property_set_description(oc, "mem-merge", "Enable/disable memory merge support"); + object_class_property_add_enum(oc, "anon-alloc", "AnonAllocOption", + &AnonAllocOption_lookup, + machine_get_anon_alloc, + machine_set_anon_alloc); + object_class_property_add_bool(oc, "usb", machine_get_usb, machine_set_usb); object_class_property_set_description(oc, "usb", diff --git a/include/hw/boards.h b/include/hw/boards.h index 5966069..5a87647 100644 --- a/include/hw/boards.h +++ b/include/hw/boards.h @@ -393,6 +393,7 @@ struct MachineState { bool enable_graphics; ConfidentialGuestSupport *cgs; HostMemoryBackend *memdev; + AnonAllocOption anon_alloc; /* * convenience alias to ram_memdev_id backend memory region * or to numa container memory region diff --git a/qapi/machine.json b/qapi/machine.json index 3cc055b..f634c40 100644 --- a/qapi/machine.json +++ b/qapi/machine.json @@ -1898,3 +1898,17 @@ { 'command': 'x-query-interrupt-controllers', 'returns': 'HumanReadableText', 'features': [ 'unstable' ]} + +## +# @AnonAllocOption: +# +# An enumeration of the options for allocating anonymous guest memory. +# +# @mmap: allocate using mmap MAP_ANON +# +# @memfd: allocate using memfd_create +# +# Since: 9.2 +## +{ 'enum': 'AnonAllocOption', + 'data': [ 'mmap', 'memfd' ] } diff --git a/qemu-options.hx b/qemu-options.hx index dacc979..fdd6bf2 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -38,6 +38,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \ " nvdimm=on|off controls NVDIMM support (default=off)\n" " memory-encryption=@var{} memory encryption object to use (default=none)\n" " hmat=on|off controls ACPI HMAT support (default=off)\n" + " anon-alloc=mmap|memfd allocate anonymous guest RAM using mmap MAP_ANON or memfd_create (default: mmap)\n" " memory-backend='backend-id' specifies explicitly provided backend for main RAM (default=none)\n" " cxl-fmw.0.targets.0=firsttarget,cxl-fmw.0.targets.1=secondtarget,cxl-fmw.0.size=size[,cxl-fmw.0.interleave-granularity=granularity]\n", QEMU_ARCH_ALL) @@ -101,6 +102,16 @@ SRST Enables or disables ACPI Heterogeneous Memory Attribute Table (HMAT) support. The default is off. + ``anon-alloc=mmap|memfd`` + Allocate anonymous guest RAM using mmap MAP_ANON (the default) + or memfd_create. This option applies to memory allocated as a + side effect of creating various devices. It does not apply to + memory-backend-objects, whether explicitly specified on the + command line, or implicitly created by the -m command line + option. + + Some migration modes require anon-alloc=memfd. + ``memory-backend='id'`` An alternative to legacy ``-mem-path`` and ``mem-prealloc`` options. Allows to use a memory backend as main RAM. diff --git a/system/physmem.c b/system/physmem.c index dc1db3a..174f7e0 100644 --- a/system/physmem.c +++ b/system/physmem.c @@ -47,6 +47,7 @@ #include "qemu/qemu-print.h" #include "qemu/log.h" #include "qemu/memalign.h" +#include "qemu/memfd.h" #include "exec/memory.h" #include "exec/ioport.h" #include "sysemu/dma.h" @@ -69,6 +70,8 @@ #include "qemu/pmem.h" +#include "qapi/qapi-types-migration.h" +#include "migration/options.h" #include "migration/vmstate.h" #include "qemu/range.h" @@ -1849,6 +1852,35 @@ static void ram_block_add(RAMBlock *new_block, Error **errp) qemu_mutex_unlock_ramlist(); return; } + + } else if (current_machine->anon_alloc == ANON_ALLOC_OPTION_MEMFD && + !object_dynamic_cast(new_block->mr->parent_obj.parent, + TYPE_MEMORY_BACKEND)) { + size_t max_length = new_block->max_length; + MemoryRegion *mr = new_block->mr; + const char *name = memory_region_name(mr); + + new_block->mr->align = QEMU_VMALLOC_ALIGN; + new_block->flags |= RAM_SHARED; + + if (new_block->fd == -1) { + new_block->fd = qemu_memfd_create(name, max_length + mr->align, + 0, 0, 0, errp); + } + + if (new_block->fd >= 0) { + int mfd = new_block->fd; + qemu_set_cloexec(mfd); + new_block->host = file_ram_alloc(new_block, max_length, mfd, + false, 0, errp); + } + if (!new_block->host) { + qemu_mutex_unlock_ramlist(); + return; + } + memory_try_enable_merging(new_block->host, new_block->max_length); + free_on_error = true; + } else { new_block->host = qemu_anon_ram_alloc(new_block->max_length, &new_block->mr->align, @@ -1932,6 +1964,9 @@ static void ram_block_add(RAMBlock *new_block, Error **errp) ram_block_notify_add(new_block->host, new_block->used_length, new_block->max_length); } + trace_ram_block_add(memory_region_name(new_block->mr), new_block->flags, + new_block->fd, new_block->used_length, + new_block->max_length); return; out_free: diff --git a/system/trace-events b/system/trace-events index 074d001..267daca 100644 --- a/system/trace-events +++ b/system/trace-events @@ -47,3 +47,6 @@ dirtylimit_vcpu_execute(int cpu_index, int64_t sleep_time_us) "CPU[%d] sleep %"P # cpu-throttle.c cpu_throttle_set(int new_throttle_pct) "set guest CPU throttled by %d%%" + +#physmem.c +ram_block_add(const char *name, uint32_t flags, int fd, size_t used_length, size_t max_length) "%s, flags %u, fd %d, len %zu, maxlen %zu"
Allocate anonymous memory using mmap MAP_ANON or memfd_create depending on the value of the anon-alloc machine property. This option applies to memory allocated as a side effect of creating various devices. It does not apply to memory-backend-objects, whether explicitly specified on the command line, or implicitly created by the -m command line option. The memfd option is intended to support new migration modes, in which the memory region can be transferred in place to a new QEMU process, by sending the memfd file descriptor to the process. Memory contents are preserved, and if the mode also transfers device descriptors, then pages that are locked in memory for DMA remain locked. This behavior is a pre-requisite for supporting vfio, vdpa, and iommufd devices with the new modes. Signed-off-by: Steve Sistare <steven.sistare@oracle.com> --- hw/core/machine.c | 19 +++++++++++++++++++ include/hw/boards.h | 1 + qapi/machine.json | 14 ++++++++++++++ qemu-options.hx | 11 +++++++++++ system/physmem.c | 35 +++++++++++++++++++++++++++++++++++ system/trace-events | 3 +++ 6 files changed, 83 insertions(+)