Message ID | 20230302110925.4680-1-fam.zheng@bytedance.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | hostmem: Add clear option to file backend | expand |
On 02.03.23 12:09, Fam Zheng wrote: > This adds a memset to clear the backing memory. This is useful in the > case of PMEM DAX to drop dirty data, if the backing memory is handed > over from a previous application or firmware which didn't clean up > before exiting. > Why can't the VM manager do that instead? If you have a file that's certainly easily possible.
> On 2 Mar 2023, at 11:31, David Hildenbrand <david@redhat.com> wrote: > > On 02.03.23 12:09, Fam Zheng wrote: >> This adds a memset to clear the backing memory. This is useful in the >> case of PMEM DAX to drop dirty data, if the backing memory is handed >> over from a previous application or firmware which didn't clean up >> before exiting. > > Why can't the VM manager do that instead? If you have a file that's certainly easily possible. Hi David, Technically yes, but I have a simple VM manager here which wants to avoid replicating the same mmap code, such as handling the flags depending on share=on|off,hugepages=on|off. All in all this approach requires the least additional code to achieve it. Thanks, Fam > > -- > Thanks, > > David / dhildenb >
On 02.03.23 12:37, Feiran Zheng wrote: > > >> On 2 Mar 2023, at 11:31, David Hildenbrand <david@redhat.com> wrote: >> >> On 02.03.23 12:09, Fam Zheng wrote: >>> This adds a memset to clear the backing memory. This is useful in the >>> case of PMEM DAX to drop dirty data, if the backing memory is handed >>> over from a previous application or firmware which didn't clean up >>> before exiting. >> >> Why can't the VM manager do that instead? If you have a file that's certainly easily possible. > > > Hi David, > > Technically yes, but I have a simple VM manager here which wants to avoid replicating the same mmap code, such as handling the flags depending on share=on|off,hugepages=on|off. All in all this approach requires the least additional code to achieve it. so ... we're supposed to maintain that code in QEMU instead to make your life easier ? :) Sorry, for this particular use case I don't see the big benefit of moving that code into QEMU.
On Thu, Mar 02, 2023 at 12:31:46PM +0100, David Hildenbrand wrote: > On 02.03.23 12:09, Fam Zheng wrote: > > This adds a memset to clear the backing memory. This is useful in the > > case of PMEM DAX to drop dirty data, if the backing memory is handed > > over from a previous application or firmware which didn't clean up > > before exiting. > > > > Why can't the VM manager do that instead? If you have a file that's > certainly easily possible. This feels conceptually similar to the case where you expose a host block device to the guest. If that block device was previously given to a different guest it might still have data in it. Someone needs to take responsibility for scrubbing that data. Since that may take a non-trivial amount of time, it is typically todo that scrubbing in the background after the old VM is gone rather than put it into the startup path for a new VM which would delay boot. PMEM is blurring the boundary between memory and disk, but the tradeoff is not so different. We know that in general merely faulting in guest memory is quite time consuming and delays VM startup significantly as RAM size increases. Doing the full memset can only be slower still. For prealloc we've create complex code to fault in memory across many threads and even that's too slow, so we're considering doing it in the background as the VM starts up. IIUC, this patch just puts the memset in the critical serialized path. This will inevitably lead to a demand for improving performance by parallelizing across threads, but we know that's too slow already, and we cant play the background async game with memset as that's actually changunig guest visible contents. IOW, for large PMEM sizes, it does look compelling to do the clearing of old data in the background outside context of QEMU VM startup to avoid delayed startup. I can still understand the appeal of a simple flag to set on QEMU from a usability POV, but not sure its a good idea to encourage this usage by mgmt apps. With regards, Daniel
> On 2 Mar 2023, at 11:39, David Hildenbrand <david@redhat.com> wrote: > > On 02.03.23 12:37, Feiran Zheng wrote: >>> On 2 Mar 2023, at 11:31, David Hildenbrand <david@redhat.com> wrote: >>> >>> On 02.03.23 12:09, Fam Zheng wrote: >>>> This adds a memset to clear the backing memory. This is useful in the >>>> case of PMEM DAX to drop dirty data, if the backing memory is handed >>>> over from a previous application or firmware which didn't clean up >>>> before exiting. >>> >>> Why can't the VM manager do that instead? If you have a file that's certainly easily possible. >> Hi David, >> Technically yes, but I have a simple VM manager here which wants to avoid replicating the same mmap code, such as handling the flags depending on share=on|off,hugepages=on|off. All in all this approach requires the least additional code to achieve it. > > so ... we're supposed to maintain that code in QEMU instead to make your life easier ? :) > > Sorry, for this particular use case I don't see the big benefit of moving that code into QEMU. > I am posting because this does not only makes my life easier, supposedly it also make other developers life easier, because the file here can be a char file and there is no easy way to clear it (/dev/dax1.0) from command line if you want to invoke a QEMU command directly. Maybe I’m missing a convenient command to clear a DAX char file? Surely it doesn’t make a big difference either way and I am not worried of maintaining the code outside QEMU, but I just think this flag is a nice thing in QEMU anyway. Fam > -- > Thanks, > > David / dhildenb >
On 02.03.23 12:48, Feiran Zheng wrote: > > >> On 2 Mar 2023, at 11:39, David Hildenbrand <david@redhat.com> wrote: >> >> On 02.03.23 12:37, Feiran Zheng wrote: >>>> On 2 Mar 2023, at 11:31, David Hildenbrand <david@redhat.com> wrote: >>>> >>>> On 02.03.23 12:09, Fam Zheng wrote: >>>>> This adds a memset to clear the backing memory. This is useful in the >>>>> case of PMEM DAX to drop dirty data, if the backing memory is handed >>>>> over from a previous application or firmware which didn't clean up >>>>> before exiting. >>>> >>>> Why can't the VM manager do that instead? If you have a file that's certainly easily possible. >>> Hi David, >>> Technically yes, but I have a simple VM manager here which wants to avoid replicating the same mmap code, such as handling the flags depending on share=on|off,hugepages=on|off. All in all this approach requires the least additional code to achieve it. >> >> so ... we're supposed to maintain that code in QEMU instead to make your life easier ? :) >> >> Sorry, for this particular use case I don't see the big benefit of moving that code into QEMU. >> > > I am posting because this does not only makes my life easier, supposedly it also make other developers life easier, because the file here can be a char file and there is no easy way to clear it (/dev/dax1.0) from command line if you want to invoke a QEMU command directly. > > Maybe I’m missing a convenient command to clear a DAX char file? Can't you simply use dd and read from /dev/zero?
> On 2 Mar 2023, at 11:44, Daniel P. Berrangé <berrange@redhat.com> wrote: > > On Thu, Mar 02, 2023 at 12:31:46PM +0100, David Hildenbrand wrote: >> On 02.03.23 12:09, Fam Zheng wrote: >>> This adds a memset to clear the backing memory. This is useful in the >>> case of PMEM DAX to drop dirty data, if the backing memory is handed >>> over from a previous application or firmware which didn't clean up >>> before exiting. >>> >> >> Why can't the VM manager do that instead? If you have a file that's >> certainly easily possible. > > This feels conceptually similar to the case where you expose a host > block device to the guest. If that block device was previously given > to a different guest it might still have data in it. Someone needs > to take responsibility for scrubbing that data. Since that may take > a non-trivial amount of time, it is typically todo that scrubbing in > the background after the old VM is gone rather than put it into the > startup path for a new VM which would delay boot. > > PMEM is blurring the boundary between memory and disk, but the tradeoff > is not so different. We know that in general merely faulting in guest > memory is quite time consuming and delays VM startup significantly as > RAM size increases. Doing the full memset can only be slower still. > > For prealloc we've create complex code to fault in memory across many > threads and even that's too slow, so we're considering doing it in the > background as the VM starts up. > > IIUC, this patch just puts the memset in the critical serialized path. > This will inevitably lead to a demand for improving performance by > parallelizing across threads, but we know that's too slow already, > and we cant play the background async game with memset as that's > actually changunig guest visible contents. > > IOW, for large PMEM sizes, it does look compelling to do the clearing > of old data in the background outside context of QEMU VM startup to > avoid delayed startup. > > I can still understand the appeal of a simple flag to set on QEMU from > a usability POV, but not sure its a good idea to encourage this usage > by mgmt apps. I can totally see the reasoning about the latency here, but I’m a little dubious if multi-threading for memset can actaully help reduce the start-up time; the total cost is going to be bound by memory bandwidth between the CPU and memory (even more so if it’s PMEM) which is limited. Fam
> On 2 Mar 2023, at 11:54, David Hildenbrand <david@redhat.com> wrote: > > On 02.03.23 12:48, Feiran Zheng wrote: >>> On 2 Mar 2023, at 11:39, David Hildenbrand <david@redhat.com> wrote: >>> >>> On 02.03.23 12:37, Feiran Zheng wrote: >>>>> On 2 Mar 2023, at 11:31, David Hildenbrand <david@redhat.com> wrote: >>>>> >>>>> On 02.03.23 12:09, Fam Zheng wrote: >>>>>> This adds a memset to clear the backing memory. This is useful in the >>>>>> case of PMEM DAX to drop dirty data, if the backing memory is handed >>>>>> over from a previous application or firmware which didn't clean up >>>>>> before exiting. >>>>> >>>>> Why can't the VM manager do that instead? If you have a file that's certainly easily possible. >>>> Hi David, >>>> Technically yes, but I have a simple VM manager here which wants to avoid replicating the same mmap code, such as handling the flags depending on share=on|off,hugepages=on|off. All in all this approach requires the least additional code to achieve it. >>> >>> so ... we're supposed to maintain that code in QEMU instead to make your life easier ? :) >>> >>> Sorry, for this particular use case I don't see the big benefit of moving that code into QEMU. >>> >> I am posting because this does not only makes my life easier, supposedly it also make other developers life easier, because the file here can be a char file and there is no easy way to clear it (/dev/dax1.0) from command line if you want to invoke a QEMU command directly. >> Maybe I’m missing a convenient command to clear a DAX char file? > > Can't you simply use dd and read from /dev/zero? > I don’t think it works for dax because the fs driver only implemented mmap, not read/write: # strace -e write dd if=/dev/zero of=/dev/dax1.0 bs=1G count=1 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1073741824) = -1 EINVAL (Invalid argument) write(2, "dd: error writing '/dev/dax1.0':"..., 50dd: error writing '/dev/dax1.0': Invalid argument Fam
On 02.03.23 12:57, Feiran Zheng wrote: > > >> On 2 Mar 2023, at 11:44, Daniel P. Berrangé <berrange@redhat.com> wrote: >> >> On Thu, Mar 02, 2023 at 12:31:46PM +0100, David Hildenbrand wrote: >>> On 02.03.23 12:09, Fam Zheng wrote: >>>> This adds a memset to clear the backing memory. This is useful in the >>>> case of PMEM DAX to drop dirty data, if the backing memory is handed >>>> over from a previous application or firmware which didn't clean up >>>> before exiting. >>>> >>> >>> Why can't the VM manager do that instead? If you have a file that's >>> certainly easily possible. >> >> This feels conceptually similar to the case where you expose a host >> block device to the guest. If that block device was previously given >> to a different guest it might still have data in it. Someone needs >> to take responsibility for scrubbing that data. Since that may take >> a non-trivial amount of time, it is typically todo that scrubbing in >> the background after the old VM is gone rather than put it into the >> startup path for a new VM which would delay boot. >> >> PMEM is blurring the boundary between memory and disk, but the tradeoff >> is not so different. We know that in general merely faulting in guest >> memory is quite time consuming and delays VM startup significantly as >> RAM size increases. Doing the full memset can only be slower still. >> >> For prealloc we've create complex code to fault in memory across many >> threads and even that's too slow, so we're considering doing it in the >> background as the VM starts up. >> >> IIUC, this patch just puts the memset in the critical serialized path. >> This will inevitably lead to a demand for improving performance by >> parallelizing across threads, but we know that's too slow already, >> and we cant play the background async game with memset as that's >> actually changunig guest visible contents. >> >> IOW, for large PMEM sizes, it does look compelling to do the clearing >> of old data in the background outside context of QEMU VM startup to >> avoid delayed startup. >> >> I can still understand the appeal of a simple flag to set on QEMU from >> a usability POV, but not sure its a good idea to encourage this usage >> by mgmt apps. > > I can totally see the reasoning about the latency here, but I’m a little dubious if multi-threading for memset can actaully help reduce the start-up time; the total cost is going to be bound by memory bandwidth between the CPU and memory (even more so if it’s PMEM) which is limited. Right, daxio is the magic bit: daxio.x86_64 : Perform I/O on Device DAX devices or zero a Device DAX device # daxio -z -o /dev/dax0.0 daxio: copied 8587837440 bytes to device "/dev/dax0.0"
On 02.03.23 14:56, David Hildenbrand wrote: > On 02.03.23 12:57, Feiran Zheng wrote: >> >> >>> On 2 Mar 2023, at 11:44, Daniel P. Berrangé <berrange@redhat.com> wrote: >>> >>> On Thu, Mar 02, 2023 at 12:31:46PM +0100, David Hildenbrand wrote: >>>> On 02.03.23 12:09, Fam Zheng wrote: >>>>> This adds a memset to clear the backing memory. This is useful in the >>>>> case of PMEM DAX to drop dirty data, if the backing memory is handed >>>>> over from a previous application or firmware which didn't clean up >>>>> before exiting. >>>>> >>>> >>>> Why can't the VM manager do that instead? If you have a file that's >>>> certainly easily possible. >>> >>> This feels conceptually similar to the case where you expose a host >>> block device to the guest. If that block device was previously given >>> to a different guest it might still have data in it. Someone needs >>> to take responsibility for scrubbing that data. Since that may take >>> a non-trivial amount of time, it is typically todo that scrubbing in >>> the background after the old VM is gone rather than put it into the >>> startup path for a new VM which would delay boot. >>> >>> PMEM is blurring the boundary between memory and disk, but the tradeoff >>> is not so different. We know that in general merely faulting in guest >>> memory is quite time consuming and delays VM startup significantly as >>> RAM size increases. Doing the full memset can only be slower still. >>> >>> For prealloc we've create complex code to fault in memory across many >>> threads and even that's too slow, so we're considering doing it in the >>> background as the VM starts up. >>> >>> IIUC, this patch just puts the memset in the critical serialized path. >>> This will inevitably lead to a demand for improving performance by >>> parallelizing across threads, but we know that's too slow already, >>> and we cant play the background async game with memset as that's >>> actually changunig guest visible contents. >>> >>> IOW, for large PMEM sizes, it does look compelling to do the clearing >>> of old data in the background outside context of QEMU VM startup to >>> avoid delayed startup. >>> >>> I can still understand the appeal of a simple flag to set on QEMU from >>> a usability POV, but not sure its a good idea to encourage this usage >>> by mgmt apps. >> >> I can totally see the reasoning about the latency here, but I’m a little dubious if multi-threading for memset can actaully help reduce the start-up time; the total cost is going to be bound by memory bandwidth between the CPU and memory (even more so if it’s PMEM) which is limited. > > Right, daxio is the magic bit: > > daxio.x86_64 : Perform I/O on Device DAX devices or zero a Device DAX device > > # daxio -z -o /dev/dax0.0 > daxio: copied 8587837440 bytes to device "/dev/dax0.0" > ^ accidentally replied that to the wrong thread. Regarding this thread: memory preallocation (page zeroing) benefits heavily from concurrency in QEMU. I assume it would similarly do it on pmem, I didn't try, though.
On Thu, Mar 02, 2023 at 02:56:43PM +0100, David Hildenbrand wrote: > On 02.03.23 12:57, Feiran Zheng wrote: > > > > > > > On 2 Mar 2023, at 11:44, Daniel P. Berrangé <berrange@redhat.com> wrote: > > > > > > On Thu, Mar 02, 2023 at 12:31:46PM +0100, David Hildenbrand wrote: > > > > On 02.03.23 12:09, Fam Zheng wrote: > > > > > This adds a memset to clear the backing memory. This is useful in the > > > > > case of PMEM DAX to drop dirty data, if the backing memory is handed > > > > > over from a previous application or firmware which didn't clean up > > > > > before exiting. > > > > > > > > > > > > > Why can't the VM manager do that instead? If you have a file that's > > > > certainly easily possible. > > > > > > This feels conceptually similar to the case where you expose a host > > > block device to the guest. If that block device was previously given > > > to a different guest it might still have data in it. Someone needs > > > to take responsibility for scrubbing that data. Since that may take > > > a non-trivial amount of time, it is typically todo that scrubbing in > > > the background after the old VM is gone rather than put it into the > > > startup path for a new VM which would delay boot. > > > > > > PMEM is blurring the boundary between memory and disk, but the tradeoff > > > is not so different. We know that in general merely faulting in guest > > > memory is quite time consuming and delays VM startup significantly as > > > RAM size increases. Doing the full memset can only be slower still. > > > > > > For prealloc we've create complex code to fault in memory across many > > > threads and even that's too slow, so we're considering doing it in the > > > background as the VM starts up. > > > > > > IIUC, this patch just puts the memset in the critical serialized path. > > > This will inevitably lead to a demand for improving performance by > > > parallelizing across threads, but we know that's too slow already, > > > and we cant play the background async game with memset as that's > > > actually changunig guest visible contents. > > > > > > IOW, for large PMEM sizes, it does look compelling to do the clearing > > > of old data in the background outside context of QEMU VM startup to > > > avoid delayed startup. > > > > > > I can still understand the appeal of a simple flag to set on QEMU from > > > a usability POV, but not sure its a good idea to encourage this usage > > > by mgmt apps. > > > > I can totally see the reasoning about the latency here, but I’m a little dubious if multi-threading for memset can actaully help reduce the start-up time; the total cost is going to be bound by memory bandwidth between the CPU and memory (even more so if it’s PMEM) which is limited. > > Right, daxio is the magic bit: > > daxio.x86_64 : Perform I/O on Device DAX devices or zero a Device DAX device > > # daxio -z -o /dev/dax0.0 > daxio: copied 8587837440 bytes to device "/dev/dax0.0" I think Dan's concerns are valid, but I noticed daxio also just calls pmem_memset_persist(), so it's doing pretty much the same single-threaded thing as the patch: https://github.com/pmem/pmdk/blob/master/src/tools/daxio/daxio.c#L506 Stefan
diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c index 25141283c4..f7468d24ce 100644 --- a/backends/hostmem-file.c +++ b/backends/hostmem-file.c @@ -30,6 +30,7 @@ struct HostMemoryBackendFile { bool discard_data; bool is_pmem; bool readonly; + bool clear; }; static void @@ -56,6 +57,7 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error **errp) ram_flags = backend->share ? RAM_SHARED : 0; ram_flags |= backend->reserve ? 0 : RAM_NORESERVE; ram_flags |= fb->is_pmem ? RAM_PMEM : 0; + ram_flags |= fb->clear ? RAM_CLEAR : 0; memory_region_init_ram_from_file(&backend->mr, OBJECT(backend), name, backend->size, fb->align, ram_flags, fb->mem_path, fb->readonly, errp); @@ -168,6 +170,21 @@ static void file_memory_backend_set_readonly(Object *obj, bool value, fb->readonly = value; } +static bool file_memory_backend_get_clear(Object *obj, Error **errp) +{ + HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(obj); + + return fb->clear; +} + +static void file_memory_backend_set_clear(Object *obj, bool value, + Error **errp) +{ + HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(obj); + + fb->clear = value; +} + static void file_backend_unparent(Object *obj) { HostMemoryBackend *backend = MEMORY_BACKEND(obj); @@ -204,6 +221,9 @@ file_backend_class_init(ObjectClass *oc, void *data) object_class_property_add_bool(oc, "readonly", file_memory_backend_get_readonly, file_memory_backend_set_readonly); + object_class_property_add_bool(oc, "clear", + file_memory_backend_get_clear, + file_memory_backend_set_clear); } static void file_backend_instance_finalize(Object *o) diff --git a/include/exec/memory.h b/include/exec/memory.h index 2e602a2fad..3345db5241 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -232,6 +232,9 @@ typedef struct IOMMUTLBEvent { /* RAM that isn't accessible through normal means. */ #define RAM_PROTECTED (1 << 8) +/* Clear these pages when mapping */ +#define RAM_CLEAR (1 << 9) + static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn, IOMMUNotifierFlag flags, hwaddr start, hwaddr end, diff --git a/qapi/qom.json b/qapi/qom.json index 30e76653ad..2c4aa5b0d5 100644 --- a/qapi/qom.json +++ b/qapi/qom.json @@ -629,6 +629,8 @@ # specify the required alignment via this option. # 0 selects a default alignment (currently the page size). (default: 0) # +# @clear: if true, the memory is memset to 0 during init. (default: false) +# # @discard-data: if true, the file contents can be destroyed when QEMU exits, # to avoid unnecessarily flushing data to the backing file. Note # that ``discard-data`` is only an optimization, and QEMU might @@ -649,6 +651,7 @@ { 'struct': 'MemoryBackendFileProperties', 'base': 'MemoryBackendProperties', 'data': { '*align': 'size', + '*clear': 'bool', '*discard-data': 'bool', 'mem-path': 'str', '*pmem': { 'type': 'bool', 'if': 'CONFIG_LIBPMEM' }, diff --git a/qemu-options.hx b/qemu-options.hx index beeb4475ba..6c8345c62e 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -4859,7 +4859,7 @@ SRST they are specified. Note that the 'id' property must be set. These objects are placed in the '/objects' path. - ``-object memory-backend-file,id=id,size=size,mem-path=dir,share=on|off,discard-data=on|off,merge=on|off,dump=on|off,prealloc=on|off,host-nodes=host-nodes,policy=default|preferred|bind|interleave,align=align,readonly=on|off`` + ``-object memory-backend-file,id=id,size=size,mem-path=dir,share=on|off,clear=on|off,discard-data=on|off,merge=on|off,dump=on|off,prealloc=on|off,host-nodes=host-nodes,policy=default|preferred|bind|interleave,align=align,readonly=on|off`` Creates a memory file backend object, which can be used to back the guest RAM with huge pages. @@ -4886,6 +4886,8 @@ SRST Documentation/vm/numa\_memory\_policy.txt on the Linux kernel source tree for additional details. + Setting clear=on will make QEMU memset the backend to 0. + Setting the ``discard-data`` boolean option to on indicates that file contents can be destroyed when QEMU exits, to avoid unnecessarily flushing data to the backing file. Note that diff --git a/softmmu/physmem.c b/softmmu/physmem.c index df54b917a9..573c686fd1 100644 --- a/softmmu/physmem.c +++ b/softmmu/physmem.c @@ -1409,6 +1409,10 @@ static void *file_ram_alloc(RAMBlock *block, return NULL; } + if (block->flags & RAM_CLEAR) { + memset(area, 0, memory); + } + block->fd = fd; return area; } @@ -1868,7 +1872,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr, /* Just support these ram flags by now. */ assert((ram_flags & ~(RAM_SHARED | RAM_PMEM | RAM_NORESERVE | - RAM_PROTECTED)) == 0); + RAM_PROTECTED | RAM_CLEAR)) == 0); if (xen_enabled()) { error_setg(errp, "-mem-path not supported with Xen");
This adds a memset to clear the backing memory. This is useful in the case of PMEM DAX to drop dirty data, if the backing memory is handed over from a previous application or firmware which didn't clean up before exiting. Signed-off-by: Fam Zheng <fam.zheng@bytedance.com> --- backends/hostmem-file.c | 20 ++++++++++++++++++++ include/exec/memory.h | 3 +++ qapi/qom.json | 3 +++ qemu-options.hx | 4 +++- softmmu/physmem.c | 6 +++++- 5 files changed, 34 insertions(+), 2 deletions(-)