Message ID | 20230807190736.572665-2-david@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | softmmu/physmem: file_ram_open() readonly improvements | expand |
On Mon, Aug 07, 2023 at 09:07:32PM +0200, David Hildenbrand wrote: > From: Thiner Logoer <logoerthiner1@163.com> > > Users may specify > * "-mem-path" or > * "-object memory-backend-file,share=off,readonly=off" > and expect such COW (MAP_PRIVATE) mappings to work, even if the user > does not have write permissions to open the file. > > For now, we would always fail in that case, always requiring file write > permissions. Let's detect when that failure happens and fallback to opening > the file readonly. > > Warn the user, since there are other use cases where we want the file to > be mapped writable: ftruncate() and fallocate() will fail if the file > was not opened with write permissions. > > Signed-off-by: Thiner Logoer <logoerthiner1@163.com> > Co-developed-by: David Hildenbrand <david@redhat.com> > Signed-off-by: David Hildenbrand <david@redhat.com> > --- > softmmu/physmem.c | 26 ++++++++++++++++++-------- > 1 file changed, 18 insertions(+), 8 deletions(-) > > diff --git a/softmmu/physmem.c b/softmmu/physmem.c > index 3df73542e1..d1ae694b20 100644 > --- a/softmmu/physmem.c > +++ b/softmmu/physmem.c > @@ -1289,8 +1289,7 @@ static int64_t get_file_align(int fd) > static int file_ram_open(const char *path, > const char *region_name, > bool readonly, > - bool *created, > - Error **errp) > + bool *created) > { > char *filename; > char *sanitized_name; > @@ -1334,10 +1333,7 @@ static int file_ram_open(const char *path, > g_free(filename); > } > if (errno != EEXIST && errno != EINTR) { > - error_setg_errno(errp, errno, > - "can't open backing store %s for guest RAM", > - path); > - return -1; > + return -errno; > } > /* > * Try again on EINTR and EEXIST. The latter happens when > @@ -1946,9 +1942,23 @@ RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr, > bool created; > RAMBlock *block; > > - fd = file_ram_open(mem_path, memory_region_name(mr), readonly, &created, > - errp); > + fd = file_ram_open(mem_path, memory_region_name(mr), readonly, &created); > + if (fd == -EACCES && !(ram_flags & RAM_SHARED) && !readonly) { > + /* > + * We can have a writable MAP_PRIVATE mapping of a readonly file. > + * However, some operations like ftruncate() or fallocate() might fail > + * later, let's warn the user. > + */ > + fd = file_ram_open(mem_path, memory_region_name(mr), true, &created); > + if (fd >= 0) { > + warn_report("backing store %s for guest RAM (MAP_PRIVATE) opened" > + " readonly because the file is not writable", mem_path); I can understand the use case, but this will be slightly unwanted, especially the user doesn't yet have a way to predict when will it happen. Meanwhile this changes the behavior, is it a concern that someone may want to rely on current behavior of failing? To think from a higher level of current use case, the ideal solution seems to me that if the ram file can be put on a file system that supports CoW itself (like btrfs), we can snapshot that ram file and make it RW for the qemu instance. Then here it'll be able to open the file. We'll be able to keep the interface working as before, meanwhile it'll work with fallocate or truncations too I assume. Would that be better instead of changing QEMU? Thanks, > + } > + } > if (fd < 0) { > + error_setg_errno(errp, -fd, > + "can't open backing store %s for guest RAM", > + mem_path); > return NULL; > } > > -- > 2.41.0 >
At 2023-08-09 05:01:17, "Peter Xu" <peterx@redhat.com> wrote: >On Mon, Aug 07, 2023 at 09:07:32PM +0200, David Hildenbrand wrote: >> From: Thiner Logoer <logoerthiner1@163.com> >> >> Users may specify >> * "-mem-path" or >> * "-object memory-backend-file,share=off,readonly=off" >> and expect such COW (MAP_PRIVATE) mappings to work, even if the user >> does not have write permissions to open the file. >> >> For now, we would always fail in that case, always requiring file write >> permissions. Let's detect when that failure happens and fallback to opening >> the file readonly. >> >> Warn the user, since there are other use cases where we want the file to >> be mapped writable: ftruncate() and fallocate() will fail if the file >> was not opened with write permissions. >> >> Signed-off-by: Thiner Logoer <logoerthiner1@163.com> >> Co-developed-by: David Hildenbrand <david@redhat.com> >> Signed-off-by: David Hildenbrand <david@redhat.com> >> --- >> softmmu/physmem.c | 26 ++++++++++++++++++-------- >> 1 file changed, 18 insertions(+), 8 deletions(-) >> >> diff --git a/softmmu/physmem.c b/softmmu/physmem.c >> index 3df73542e1..d1ae694b20 100644 >> --- a/softmmu/physmem.c >> +++ b/softmmu/physmem.c >> @@ -1289,8 +1289,7 @@ static int64_t get_file_align(int fd) >> static int file_ram_open(const char *path, >> const char *region_name, >> bool readonly, >> - bool *created, >> - Error **errp) >> + bool *created) >> { >> char *filename; >> char *sanitized_name; >> @@ -1334,10 +1333,7 @@ static int file_ram_open(const char *path, >> g_free(filename); >> } >> if (errno != EEXIST && errno != EINTR) { >> - error_setg_errno(errp, errno, >> - "can't open backing store %s for guest RAM", >> - path); >> - return -1; >> + return -errno; >> } >> /* >> * Try again on EINTR and EEXIST. The latter happens when >> @@ -1946,9 +1942,23 @@ RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr, >> bool created; >> RAMBlock *block; >> >> - fd = file_ram_open(mem_path, memory_region_name(mr), readonly, &created, >> - errp); >> + fd = file_ram_open(mem_path, memory_region_name(mr), readonly, &created); >> + if (fd == -EACCES && !(ram_flags & RAM_SHARED) && !readonly) { >> + /* >> + * We can have a writable MAP_PRIVATE mapping of a readonly file. >> + * However, some operations like ftruncate() or fallocate() might fail >> + * later, let's warn the user. >> + */ >> + fd = file_ram_open(mem_path, memory_region_name(mr), true, &created); >> + if (fd >= 0) { >> + warn_report("backing store %s for guest RAM (MAP_PRIVATE) opened" >> + " readonly because the file is not writable", mem_path); > >I can understand the use case, but this will be slightly unwanted, >especially the user doesn't yet have a way to predict when will it happen. > >Meanwhile this changes the behavior, is it a concern that someone may want >to rely on current behavior of failing? > I am happy to learn if there is some solid evidence supporting this view. The target of compatibility is "private+discard" which seems itself pathological practice in early discussion. The only difference in behavior that might be unwanted in your argument lies in "readonly file+private+discard" failure time. Before the patch it fails early, after the patch it fails later and may does additional stuff. Do you think that a bug reporting mechanism which relies on qemu failure timing is valid? If someone argues that "readonly file+private+discard" early failure behavior is all where their system relies, I would be happy to learn why. Actually this is much easier to solve outside qemu, by checking memory file permission, compared to the "readonly+private" alternative plan that requires a btrfs. In the long run the "private+discard" setup may itself be warned and deprecated, rather than "private+readonly file" which is supported by linux kernel. Current patch is already a compromise considering compatibility, as it always tries open the file in readwrite mode first, to persist behavior on "readwrite+private+discard" case. Personally I prefer to try opening the file readonly first, as this will reduce the risk of opening ram file accidentally with write permission. However I can accept a second level of compromise, aka adding "-disallow-private-discard" to qemu argument for at least three qemu releases before "private+discard" get deprecated and removed, if extreme compatibility enthusiast is involved. With this argument, readonly private is enabled and private discard fails immediately when discard request is detected, and without this argument readonly private is disabled and the behavior is unchanged. This argument is useful also because it helps finding out existent "private+discard" behavior and assists debugging. Do you think that this solution pays off? (I am no expert qemu programmer, hope anyone else to handle the command line arguments if this is preferred) >To think from a higher level of current use case, the ideal solution seems >to me that if the ram file can be put on a file system that supports CoW >itself (like btrfs), we can snapshot that ram file and make it RW for the >qemu instance. Then here it'll be able to open the file. We'll be able to >keep the interface working as before, meanwhile it'll work with fallocate >or truncations too I assume. > >Would that be better instead of changing QEMU? > I am afraid that your alternative solution will make it still impossible for the use case to be used on rootless system; while MAP_PRIVATE based CoW works anywhere. For example this will not work on ext4 with readonly file mode, or that the file is owned by root and you are normal user, and on squashfs. Basically everything on the machine should be readonly and you are also only a nobody, then only the in-qemu choice is possible. I believe the warning that the file is opened readonly is enough for potential user to consider doing otherwise. Do you think that looking solely at qemu exit status and pipe all qemu stderr log to /dev/null all the time is one of the supported use cases? -- Regards, logoerthiner
Hi Peter! >> - fd = file_ram_open(mem_path, memory_region_name(mr), readonly, &created, >> - errp); >> + fd = file_ram_open(mem_path, memory_region_name(mr), readonly, &created); >> + if (fd == -EACCES && !(ram_flags & RAM_SHARED) && !readonly) { >> + /* >> + * We can have a writable MAP_PRIVATE mapping of a readonly file. >> + * However, some operations like ftruncate() or fallocate() might fail >> + * later, let's warn the user. >> + */ >> + fd = file_ram_open(mem_path, memory_region_name(mr), true, &created); >> + if (fd >= 0) { >> + warn_report("backing store %s for guest RAM (MAP_PRIVATE) opened" >> + " readonly because the file is not writable", mem_path); > > I can understand the use case, but this will be slightly unwanted, > especially the user doesn't yet have a way to predict when will it happen. Users can set the file permissions accordingly I guess. If they don't want the file to never ever be modified via QEMU, set it R/O. > > Meanwhile this changes the behavior, is it a concern that someone may want > to rely on current behavior of failing? The scenario would be that someone passes a readonly file to "-mem-path" or "-object memory-backend-file,share=off,readonly=off", with the expectation that it would currently fail. If it now doesn't fail (and we warn instead), what would happen is: * In file_ram_alloc() we won't even try ftruncate(), because the file already had a size > 0. So ftruncate() is not a concern as I now realize. * fallocate might fail later. AFAIKS, that only applies to ram_block_discard_range(). -> virtio-mem performs an initial ram_block_discard_range() check and fails gracefully early. -> virtio-ballooon ignores any errors -> ram_discard_range() in migration code fails early for postcopy in init_range() and loadvm_postcopy_ram_handle_discard(), handling it gracefully. So mostly nothing "bad" would happen, it might just be undesirable, and we properly warn. Most importantly, we won't be corrupting/touching the original file in any case, because it is R/O. If we really want to be careful, we could clue that behavior to compat machines. I'm not really sure yet if we really have to go down that path. Any other alternatives? I'd like to avoid new flags where not really required. > > To think from a higher level of current use case, the ideal solution seems > to me that if the ram file can be put on a file system that supports CoW > itself (like btrfs), we can snapshot that ram file and make it RW for the > qemu instance. Then here it'll be able to open the file. We'll be able to > keep the interface working as before, meanwhile it'll work with fallocate > or truncations too I assume. > > Would that be better instead of changing QEMU? As I recently learned, using file-backed VMs (on real ssd/disks, not shmem/hugetlb) is usually undesired, because the dirtied pages will constantly get rewritten to disk by background writeback threads, eventually resulting in bad performance and SSD wear. So while using a COW filesystem sounds cleaner in theory, it's not applicable in practice -- unless one disables any background writeback, which has different side effects because it cannot be configured on a per-file basis. So for VM templating, it makes sense to capture the guest RAM and store it in a file, to then use a COW (MAP_PRIVATE) mapping. Using a read-only file makes perfect sense in that scenario IMHO. [I'm curious at what point a filesystem will actually break COW. if it's wired up to the writenotify infrastructure, it would happen when actually writing to a page, not at mmap time. I know that filesystems use writenotify for lazy allocation of disk blocks on file holes, maybe they also do that for lazy allocation of disk blocks on COW] Thanks!
On Wed, Aug 09, 2023 at 11:20:11AM +0200, David Hildenbrand wrote: > Hi Peter! Hi, David, > > > > - fd = file_ram_open(mem_path, memory_region_name(mr), readonly, &created, > > > - errp); > > > + fd = file_ram_open(mem_path, memory_region_name(mr), readonly, &created); > > > + if (fd == -EACCES && !(ram_flags & RAM_SHARED) && !readonly) { > > > + /* > > > + * We can have a writable MAP_PRIVATE mapping of a readonly file. > > > + * However, some operations like ftruncate() or fallocate() might fail > > > + * later, let's warn the user. > > > + */ > > > + fd = file_ram_open(mem_path, memory_region_name(mr), true, &created); > > > + if (fd >= 0) { > > > + warn_report("backing store %s for guest RAM (MAP_PRIVATE) opened" > > > + " readonly because the file is not writable", mem_path); > > > > I can understand the use case, but this will be slightly unwanted, > > especially the user doesn't yet have a way to predict when will it happen. > > Users can set the file permissions accordingly I guess. If they don't want > the file to never ever be modified via QEMU, set it R/O. > > > > > Meanwhile this changes the behavior, is it a concern that someone may want > > to rely on current behavior of failing? > > The scenario would be that someone passes a readonly file to "-mem-path" or > "-object memory-backend-file,share=off,readonly=off", with the expectation > that it would currently fail. > > If it now doesn't fail (and we warn instead), what would happen is: > * In file_ram_alloc() we won't even try ftruncate(), because the file > already had a size > 0. So ftruncate() is not a concern as I now > realize. > * fallocate might fail later. AFAIKS, that only applies to > ram_block_discard_range(). > -> virtio-mem performs an initial ram_block_discard_range() check and > fails gracefully early. > -> virtio-ballooon ignores any errors > -> ram_discard_range() in migration code fails early for postcopy in > init_range() and loadvm_postcopy_ram_handle_discard(), handling it > gracefully. > > So mostly nothing "bad" would happen, it might just be undesirable, and we > properly warn. Indeed, all of them can fail gracefully, while balloon one is harmless. Definitely good news. > > Most importantly, we won't be corrupting/touching the original file in any > case, because it is R/O. > > If we really want to be careful, we could clue that behavior to compat > machines. I'm not really sure yet if we really have to go down that path. > > Any other alternatives? I'd like to avoid new flags where not really > required. I was just thinking of a new flag. :) So have you already discussed that possibility and decided that not a good idea? The root issue to me here is we actually have two resources (memory map of the process, and the file) but we only have one way to describe the permissions upon the two objects. I'd think it makes a lot more sense if a new flag is added, when there's a need to differentiate the two. Consider if you see a bunch of qemu instances with: -mem-path $RAM_FILE On the same host, which can be as weird as it could be to me.. At least '-mem-path' looks still like a way to exclusively own a ram file for an instance. I hesitate the new fallback can confuse people too, while that's so far not the major use case. Nobody may really rely on any existing behavior of the failure, but changing existing behavior is just always not wanted. The guideline here to me is: whether we want existing "-mem-path XXX" users to start using the fallback in general? If it's "no", then maybe it implies a new flag is better? > > > > > To think from a higher level of current use case, the ideal solution seems > > to me that if the ram file can be put on a file system that supports CoW > > itself (like btrfs), we can snapshot that ram file and make it RW for the > > qemu instance. Then here it'll be able to open the file. We'll be able to > > keep the interface working as before, meanwhile it'll work with fallocate > > or truncations too I assume. > > > > Would that be better instead of changing QEMU? > > As I recently learned, using file-backed VMs (on real ssd/disks, not > shmem/hugetlb) is usually undesired, because the dirtied pages will > constantly get rewritten to disk by background writeback threads, eventually > resulting in bad performance and SSD wear. > > So while using a COW filesystem sounds cleaner in theory, it's not > applicable in practice -- unless one disables any background writeback, > which has different side effects because it cannot be configured on a > per-file basis. > > So for VM templating, it makes sense to capture the guest RAM and store it > in a file, to then use a COW (MAP_PRIVATE) mapping. Using a read-only file > makes perfect sense in that scenario IMHO. Valid point. > > [I'm curious at what point a filesystem will actually break COW. if it's > wired up to the writenotify infrastructure, it would happen when actually > writing to a page, not at mmap time. I know that filesystems use writenotify > for lazy allocation of disk blocks on file holes, maybe they also do that > for lazy allocation of disk blocks on COW] I don't know either, but it definitely looks more promising and reasonable if the CoW only happens until being written, rather than being mapped RW. Thanks,
>> Most importantly, we won't be corrupting/touching the original file in any >> case, because it is R/O. >> >> If we really want to be careful, we could clue that behavior to compat >> machines. I'm not really sure yet if we really have to go down that path. >> >> Any other alternatives? I'd like to avoid new flags where not really >> required. > > I was just thinking of a new flag. :) So have you already discussed that > possibility and decided that not a good idea? Not really. I was briefly playing with that idea but already struggled to come up with a reasonable name :) Less toggles and just have it working nice, if possible. > > The root issue to me here is we actually have two resources (memory map of > the process, and the file) but we only have one way to describe the > permissions upon the two objects. I'd think it makes a lot more sense if a > new flag is added, when there's a need to differentiate the two. > > Consider if you see a bunch of qemu instances with: > > -mem-path $RAM_FILE > > On the same host, which can be as weird as it could be to me.. At least > '-mem-path' looks still like a way to exclusively own a ram file for an > instance. I hesitate the new fallback can confuse people too, while that's > so far not the major use case. Once I learned that this is not a MAP_SHARED mapping, I was extremely confused. For example, vhost-user with "-mem-path" will absolutely not work with "-mem-path", even though the documentation explicitly spells that out (I still have to send a patch to fix that). I guess "-mem-path" was primarily only used to consume hugetlb. Even for tmpfs it will already result in a double memory consumption, just like when using -memory-backend-memfd,share=no. I guess deprecating it was the right decision. But memory-backend-file also defaults to "share=no" ... so the same default behavior unfortunately. > > Nobody may really rely on any existing behavior of the failure, but > changing existing behavior is just always not wanted. The guideline here > to me is: whether we want existing "-mem-path XXX" users to start using the > fallback in general? If it's "no", then maybe it implies a new flag is > better? I think we have the following options (there might be more) 1) This patch. 2) New flag for memory-backend-file. We already have "readonly" and "share=". I'm having a hard time coming up with a good name that really describes the subtle difference. 3) Glue behavior to the QEMU machine For 3), one option would be to always open a COW file readonly (as Thiner originally proposed). We could leave "-mem-path" behavior alone and only change memory-backend-file semantics. If the COW file does *not* exist yet, we would refuse to create the file like patch 2+3 do. Therefore, no ftruncate() errors, and fallocate() errors would always happen. What are your thoughts? [...] >> >> [I'm curious at what point a filesystem will actually break COW. if it's >> wired up to the writenotify infrastructure, it would happen when actually >> writing to a page, not at mmap time. I know that filesystems use writenotify >> for lazy allocation of disk blocks on file holes, maybe they also do that >> for lazy allocation of disk blocks on COW] > > I don't know either, but it definitely looks more promising and reasonable > if the CoW only happens until being written, rather than being mapped RW. That would be my best guess. But then, we have multiple pagecache pages point at the same disk block until COW happens ... maybe that's how it already works. Once I have some spare time, I might play with that.
At 2023-08-10 22:19:45, "David Hildenbrand" <david@redhat.com> wrote: >>> Most importantly, we won't be corrupting/touching the original file in any >>> case, because it is R/O. >>> >>> If we really want to be careful, we could clue that behavior to compat >>> machines. I'm not really sure yet if we really have to go down that path. >>> >>> Any other alternatives? I'd like to avoid new flags where not really >>> required. >> >> I was just thinking of a new flag. :) So have you already discussed that >> possibility and decided that not a good idea? > >Not really. I was briefly playing with that idea but already struggled >to come up with a reasonable name :) > >Less toggles and just have it working nice, if possible. > >> >> The root issue to me here is we actually have two resources (memory map of >> the process, and the file) but we only have one way to describe the >> permissions upon the two objects. I'd think it makes a lot more sense if a >> new flag is added, when there's a need to differentiate the two. >> >> Consider if you see a bunch of qemu instances with: >> >> -mem-path $RAM_FILE >> >> On the same host, which can be as weird as it could be to me.. At least >> '-mem-path' looks still like a way to exclusively own a ram file for an >> instance. I hesitate the new fallback can confuse people too, while that's >> so far not the major use case. > >Once I learned that this is not a MAP_SHARED mapping, I was extremely >confused. For example, vhost-user with "-mem-path" will absolutely not >work with "-mem-path", even though the documentation explicitly spells >that out (I still have to send a patch to fix that). > >I guess "-mem-path" was primarily only used to consume hugetlb. Even for >tmpfs it will already result in a double memory consumption, just like >when using -memory-backend-memfd,share=no. > >I guess deprecating it was the right decision. > >But memory-backend-file also defaults to "share=no" ... so the same >default behavior unfortunately. > >> >> Nobody may really rely on any existing behavior of the failure, but >> changing existing behavior is just always not wanted. The guideline here >> to me is: whether we want existing "-mem-path XXX" users to start using the >> fallback in general? If it's "no", then maybe it implies a new flag is >> better? > >I think we have the following options (there might be more) > >1) This patch. > >2) New flag for memory-backend-file. We already have "readonly" and >"share=". I'm having a hard time coming up with a good name that really >describes the subtle difference. > >3) Glue behavior to the QEMU machine > 4) '-deny-private-discard' argv, or environment variable, or both I have proposed a 4) earlier in discussion which is to add a global qemu flag like '-deny-private-discard' or '-disallow-private-discard' (let's find a better name!) for some duration until private discard behavior phases out. We do everything exactly the same as before without the flag, and with this flag, private CoW mapping files are strictly opened readonly, discard on private memory backend is brutally denied very early when possibility arises, and private memory backed file are always opened readonly without creating any file (so the file must exist, no more nasty edge cases). This has the benefit that it can also help diagnose and debug all existing private discard usages, which could be required in the long run. Therefore, And with this flag we directly solves the immediate demand of <readonly file+private map> while delays the hard problem indefinitely. I think this solution seems most promising and acceptable by most ones. At least for my use case, I would be glad to insert a such flag to my argv if it is all I need, since it does not hurt the flexibility I care about. Note that difference on this option probably should not cause difference on the machine specification. Otherwise migration will fail because one machine has this option and the other does not, which is absurd, since it is a backend implementation flag. > >For 3), one option would be to always open a COW file readonly (as >Thiner originally proposed). We could leave "-mem-path" behavior alone >and only change memory-backend-file semantics. If the COW file does >*not* exist yet, we would refuse to create the file like patch 2+3 do. >Therefore, no ftruncate() errors, and fallocate() errors would always >happen. > > >What are your thoughts? > >[...] > I would be happy if -mem-path stays supported since in this case I would not need knowledge of memory backend before migration. -- Regards, logoerthiner
On Fri, Aug 11, 2023 at 01:06:12AM +0800, ThinerLogoer wrote: > >I think we have the following options (there might be more) > > > >1) This patch. > > > >2) New flag for memory-backend-file. We already have "readonly" and > >"share=". I'm having a hard time coming up with a good name that really > >describes the subtle difference. > > > >3) Glue behavior to the QEMU machine > > > > 4) '-deny-private-discard' argv, or environment variable, or both I'd personally vote for (2). How about "fdperm"? To describe when we want to use different rw permissions on the file (besides the access permission of the memory we already provided with "readonly"=XXX). IIUC the only sane value will be ro/rw/default, where "default" should just use the same rw permission as the memory ("readonly"=XXX). Would that be relatively clean and also work in this use case? (the other thing I'd wish we don't have that fallback is, as long as we have any of that "fallback" we'll need to be compatible with it since then, and for ever...)
At 2023-08-11 05:24:43, "Peter Xu" <peterx@redhat.com> wrote: >On Fri, Aug 11, 2023 at 01:06:12AM +0800, ThinerLogoer wrote: >> >I think we have the following options (there might be more) >> > >> >1) This patch. >> > >> >2) New flag for memory-backend-file. We already have "readonly" and >> >"share=". I'm having a hard time coming up with a good name that really >> >describes the subtle difference. >> > >> >3) Glue behavior to the QEMU machine >> > >> >> 4) '-deny-private-discard' argv, or environment variable, or both > >I'd personally vote for (2). How about "fdperm"? To describe when we want >to use different rw permissions on the file (besides the access permission >of the memory we already provided with "readonly"=XXX). IIUC the only sane >value will be ro/rw/default, where "default" should just use the same rw >permission as the memory ("readonly"=XXX). > >Would that be relatively clean and also work in this use case? > >(the other thing I'd wish we don't have that fallback is, as long as we > have any of that "fallback" we'll need to be compatible with it since > then, and for ever...) If it must be (2), I would vote (2) + (4), with (4) adjust the default behavior of said `fdperm`. Mainly because (private+discard) is itself not a good practice and (4) serves as a good tool to help catch existing (private+discard) problems. Actually (readonly+private) is more reasonable than (private+discard), so I want at least one room for a default (readonly+private) behavior. Also in my case I kind of have to use "-mem-path" despite it being considered to be close to deprecated. Only with this I can avoid knowledge of memory backend before migration. Actually there seems to be no equivalent working after-migration setup of "-object memory-backend-file,... -machine q35,mem=..." that can match before-migration setup of "-machine q35" (specifying nothing). Therefore I must make a plan and choose a migration method BEFORE I boot the machine and prepare to migrate, reducing the operation freedom. Considering that, I have to use "-mem-path" which keeps the freedom but has no configurable argument and I have to rely on default config. Are there any "-object memory-backend-file..." setup equivalent to "-machine q35" that can migrate from and to each other? If there is, I want to try it out. By the way "-object memory-backend-file,id=pc.ram" has just been killed by an earlier commit. Either (4) or fixing this should help my config. Hope you can consider this deeper and figure out a more systematic solution that helps more user? -- Regards, logoerthiner
On Fri, Aug 11, 2023 at 01:49:52PM +0800, ThinerLogoer wrote: > At 2023-08-11 05:24:43, "Peter Xu" <peterx@redhat.com> wrote: > >On Fri, Aug 11, 2023 at 01:06:12AM +0800, ThinerLogoer wrote: > >> >I think we have the following options (there might be more) > >> > > >> >1) This patch. > >> > > >> >2) New flag for memory-backend-file. We already have "readonly" and > >> >"share=". I'm having a hard time coming up with a good name that really > >> >describes the subtle difference. > >> > > >> >3) Glue behavior to the QEMU machine > >> > > >> > >> 4) '-deny-private-discard' argv, or environment variable, or both > > > >I'd personally vote for (2). How about "fdperm"? To describe when we want > >to use different rw permissions on the file (besides the access permission > >of the memory we already provided with "readonly"=XXX). IIUC the only sane > >value will be ro/rw/default, where "default" should just use the same rw > >permission as the memory ("readonly"=XXX). > > > >Would that be relatively clean and also work in this use case? > > > >(the other thing I'd wish we don't have that fallback is, as long as we > > have any of that "fallback" we'll need to be compatible with it since > > then, and for ever...) > > If it must be (2), I would vote (2) + (4), with (4) adjust the default behavior of said `fdperm`. > Mainly because (private+discard) is itself not a good practice and (4) serves > as a good tool to help catch existing (private+discard) problems. > > Actually (readonly+private) is more reasonable than (private+discard), so I > want at least one room for a default (readonly+private) behavior. Just for purely discussion purpose: I think maybe someday private+discard could work. IIUC what we're missing is an syscall interface to install a zero page for a MAP_PRIVATE, atomically freeing what's underneath: it seems either punching a hole or DONTNEED won't suffice here. It'll just be another problem when having zero page involved in file mappings at least. > > Also in my case I kind of have to use "-mem-path" despite it being considered > to be close to deprecated. Only with this I can avoid knowledge of memory > backend before migration. Actually there seems to be no equivalent working after-migration > setup of "-object memory-backend-file,... -machine q35,mem=..." that can match > before-migration setup of "-machine q35" (specifying nothing). Therefore > I must make a plan and choose a migration method BEFORE I boot the > machine and prepare to migrate, reducing the operation freedom. > Considering that, I have to use "-mem-path" which keeps the freedom but > has no configurable argument and I have to rely on default config. > > Are there any "-object memory-backend-file..." setup equivalent to "-machine q35" > that can migrate from and to each other? If there is, I want to try it out. > By the way "-object memory-backend-file,id=pc.ram" has just been killed by an earlier > commit. I'm actually not familiar enough on the interfaces here, but I just checked up the man page; would this work for you, together with option (2)? memory-backend='id' An alternative to legacy -mem-path and mem-prealloc options. Allows to use a memory backend as main RAM. For example: -object memory-backend-file,id=pc.ram,size=512M,mem-path=/hugetlbfs,prealloc=on,share=on -machine memory-backend=pc.ram -m 512M
On 10.08.23 23:24, Peter Xu wrote: > On Fri, Aug 11, 2023 at 01:06:12AM +0800, ThinerLogoer wrote: >>> I think we have the following options (there might be more) >>> >>> 1) This patch. >>> >>> 2) New flag for memory-backend-file. We already have "readonly" and >>> "share=". I'm having a hard time coming up with a good name that really >>> describes the subtle difference. >>> >>> 3) Glue behavior to the QEMU machine >>> >> >> 4) '-deny-private-discard' argv, or environment variable, or both > > I'd personally vote for (2). How about "fdperm"? To describe when we want > to use different rw permissions on the file (besides the access permission > of the memory we already provided with "readonly"=XXX). IIUC the only sane > value will be ro/rw/default, where "default" should just use the same rw > permission as the memory ("readonly"=XXX). Hmm, I'm not particularly happy about that. > > Would that be relatively clean and also work in this use case? > I get the feeling that we are over-engineering something that probably should never have been allowed: MAP_PRIVATE mapping of a file and opening it rw because someone might punch holes into it. Once we start adding new parameters just for that, I get a bit skeptical that this is what we want. The number of people that care about that are probably close to 0. The only real use case where this used to make sense (by accident I assume) was with hugetlb. And somehow, we decided that it was a good idea for "-mem-path" to use MAP_PRIVATE. So, what stops us from a) Leaving -mem-path alone. Keep opening files rw. b) Make memory-backend-file with shared=off,readonly=off open the file read-only c) Gluing that behavior to a QEMU compat machine fallocate(PUNCH_HOLE) will fail, and we can probably let virtio-mem/virtio-balloon and postcopy refuse to even start (virtio-mem already does that) as early as possible. People that care about any such use case would already get a warning when punching a hole today. If we ever support discarding RAM in that configuration, we can simply unlock it again. Am I missing any important use case?
On 11.08.23 16:59, David Hildenbrand wrote: > On 10.08.23 23:24, Peter Xu wrote: >> On Fri, Aug 11, 2023 at 01:06:12AM +0800, ThinerLogoer wrote: >>>> I think we have the following options (there might be more) >>>> >>>> 1) This patch. >>>> >>>> 2) New flag for memory-backend-file. We already have "readonly" and >>>> "share=". I'm having a hard time coming up with a good name that really >>>> describes the subtle difference. >>>> >>>> 3) Glue behavior to the QEMU machine >>>> >>> >>> 4) '-deny-private-discard' argv, or environment variable, or both >> >> I'd personally vote for (2). How about "fdperm"? To describe when we want >> to use different rw permissions on the file (besides the access permission >> of the memory we already provided with "readonly"=XXX). IIUC the only sane >> value will be ro/rw/default, where "default" should just use the same rw >> permission as the memory ("readonly"=XXX). > > Hmm, I'm not particularly happy about that. > >> >> Would that be relatively clean and also work in this use case? >> > > I get the feeling that we are over-engineering something that probably > should never have been allowed: MAP_PRIVATE mapping of a file and > opening it rw because someone might punch holes into it. > > Once we start adding new parameters just for that, I get a bit skeptical > that this is what we want. The number of people that care about that are > probably close to 0. > > The only real use case where this used to make sense (by accident I > assume) was with hugetlb. And somehow, we decided that it was a good > idea for "-mem-path" to use MAP_PRIVATE. > > So, what stops us from > > a) Leaving -mem-path alone. Keep opening files rw. > b) Make memory-backend-file with shared=off,readonly=off open the file > read-only > c) Gluing that behavior to a QEMU compat machine > > fallocate(PUNCH_HOLE) will fail, and we can probably let > virtio-mem/virtio-balloon and postcopy refuse to even start (virtio-mem > already does that) as early as possible. > > People that care about any such use case would already get a warning > when punching a hole today. > > If we ever support discarding RAM in that configuration, we can simply > unlock it again. > > Am I missing any important use case? I just started looking into the origins of "-mem-path". Originally c902760fb2 ("Add option to use file backed guest memory"): * Without MAP_POPULATE support, we use MAP_PRIVATE * With MAP_POPULATE support we use MAP_PRIVATE if mem_prealloc was not defined. It was only used for hugetlb. The shared memory case didn't really matter: they just needed a way to get hugetlb pages into the VM. Opening the file R/W even with MAP_PRIVATE kind-of made sense in that case, it was an exclusive owner. Discarding of RAM was not very popular back then I guess: virtio-mem didn't exist, virtio-balloon doesn't even handle hugetlb today really, postcopy didn't exist. I guess that's why nobody really cared about "-mem-path" MAP_PRIVATE vs. MAP_SHARED semantics: just get hugetlb pages into the VM somehow. Nowadays, "-mem-path" always defaults to MAP_PRIVATE. For the original hugetlb use case, it's still good enough. For anything else, I'm not so sure.
On Fri, Aug 11, 2023 at 04:59:56PM +0200, David Hildenbrand wrote: > On 10.08.23 23:24, Peter Xu wrote: > > On Fri, Aug 11, 2023 at 01:06:12AM +0800, ThinerLogoer wrote: > > > > I think we have the following options (there might be more) > > > > > > > > 1) This patch. > > > > > > > > 2) New flag for memory-backend-file. We already have "readonly" and > > > > "share=". I'm having a hard time coming up with a good name that really > > > > describes the subtle difference. > > > > > > > > 3) Glue behavior to the QEMU machine > > > > > > > > > > 4) '-deny-private-discard' argv, or environment variable, or both > > > > I'd personally vote for (2). How about "fdperm"? To describe when we want > > to use different rw permissions on the file (besides the access permission > > of the memory we already provided with "readonly"=XXX). IIUC the only sane > > value will be ro/rw/default, where "default" should just use the same rw > > permission as the memory ("readonly"=XXX). > > Hmm, I'm not particularly happy about that. > > > > > Would that be relatively clean and also work in this use case? > > > > I get the feeling that we are over-engineering something that probably > should never have been allowed: MAP_PRIVATE mapping of a file and opening it > rw because someone might punch holes into it. > > Once we start adding new parameters just for that, I get a bit skeptical > that this is what we want. The number of people that care about that are > probably close to 0. > > The only real use case where this used to make sense (by accident I assume) > was with hugetlb. And somehow, we decided that it was a good idea for > "-mem-path" to use MAP_PRIVATE. > > So, what stops us from > > a) Leaving -mem-path alone. Keep opening files rw. > b) Make memory-backend-file with shared=off,readonly=off open the file > read-only > c) Gluing that behavior to a QEMU compat machine So we want to make all users with shared=off + readonly=off to only open the file RO always, failing file write ops rather than crashing others. Sounds reasonable to me. In that case, do we even need a compat bit, if we're 100% sure it won't break anyone but only help, with the fact that everything should be transparent?
On Fri, Aug 11, 2023 at 05:26:24PM +0200, David Hildenbrand wrote: > I just started looking into the origins of "-mem-path". > > Originally c902760fb2 ("Add option to use file backed guest memory"): > > * Without MAP_POPULATE support, we use MAP_PRIVATE > * With MAP_POPULATE support we use MAP_PRIVATE if mem_prealloc was not > defined. > > It was only used for hugetlb. The shared memory case didn't really matter: > they just needed a way to get hugetlb pages into the VM. Opening the file > R/W even with MAP_PRIVATE kind-of made sense in that case, it was an > exclusive owner. > > Discarding of RAM was not very popular back then I guess: virtio-mem didn't > exist, virtio-balloon doesn't even handle hugetlb today really, postcopy > didn't exist. > > > I guess that's why nobody really cared about "-mem-path" MAP_PRIVATE vs. > MAP_SHARED semantics: just get hugetlb pages into the VM somehow. > > Nowadays, "-mem-path" always defaults to MAP_PRIVATE. For the original > hugetlb use case, it's still good enough. For anything else, I'm not so > sure. Ok this answers my other question then on the compat bit.. thanks. Feel free to ignore there. But then I'd lean back towards simply adding a fdperm=; that seems the simplest, since even if with a compat bit, we still face risk of breaking -mem-path users for anyone using new machine types.
On 11.08.23 18:16, Peter Xu wrote: > On Fri, Aug 11, 2023 at 05:26:24PM +0200, David Hildenbrand wrote: >> I just started looking into the origins of "-mem-path". >> >> Originally c902760fb2 ("Add option to use file backed guest memory"): >> >> * Without MAP_POPULATE support, we use MAP_PRIVATE >> * With MAP_POPULATE support we use MAP_PRIVATE if mem_prealloc was not >> defined. >> >> It was only used for hugetlb. The shared memory case didn't really matter: >> they just needed a way to get hugetlb pages into the VM. Opening the file >> R/W even with MAP_PRIVATE kind-of made sense in that case, it was an >> exclusive owner. >> >> Discarding of RAM was not very popular back then I guess: virtio-mem didn't >> exist, virtio-balloon doesn't even handle hugetlb today really, postcopy >> didn't exist. >> >> >> I guess that's why nobody really cared about "-mem-path" MAP_PRIVATE vs. >> MAP_SHARED semantics: just get hugetlb pages into the VM somehow. >> >> Nowadays, "-mem-path" always defaults to MAP_PRIVATE. For the original >> hugetlb use case, it's still good enough. For anything else, I'm not so >> sure. > > Ok this answers my other question then on the compat bit.. thanks. Feel > free to ignore there. > > But then I'd lean back towards simply adding a fdperm=; that seems the > simplest, since even if with a compat bit, we still face risk of breaking > -mem-path users for anyone using new machine types. We wouldn't touch "-mem-path".
On Fri, Aug 11, 2023 at 06:17:05PM +0200, David Hildenbrand wrote:
> We wouldn't touch "-mem-path".
But still the same issue when someone uses -object memory-backend-file for
hugetlb, mapping privately, expecting ram discard to work?
Basically I see that example as, "hugetlb" in general made the private
mapping over RW file usable, so forbidden that anywhere may take a risk.
Thanks,
On 11.08.23 18:22, Peter Xu wrote: > On Fri, Aug 11, 2023 at 06:17:05PM +0200, David Hildenbrand wrote: >> We wouldn't touch "-mem-path". > > But still the same issue when someone uses -object memory-backend-file for > hugetlb, mapping privately, expecting ram discard to work? > > Basically I see that example as, "hugetlb" in general made the private > mapping over RW file usable, so forbidden that anywhere may take a risk. These users can be directed to using hugetlb a) using MAP_SHARED b) using memory-backend-memfd, if MAP_PRIVATE is desired Am I missing any important use case? Are we being a bit to careful about virtio-balloon and postcopy simply not being available for these corner cases?
On Fri, Aug 11, 2023 at 06:25:14PM +0200, David Hildenbrand wrote: > On 11.08.23 18:22, Peter Xu wrote: > > On Fri, Aug 11, 2023 at 06:17:05PM +0200, David Hildenbrand wrote: > > > We wouldn't touch "-mem-path". > > > > But still the same issue when someone uses -object memory-backend-file for > > hugetlb, mapping privately, expecting ram discard to work? > > > > Basically I see that example as, "hugetlb" in general made the private > > mapping over RW file usable, so forbidden that anywhere may take a risk. > > These users can be directed to using hugetlb > > a) using MAP_SHARED > b) using memory-backend-memfd, if MAP_PRIVATE is desired > > Am I missing any important use case? Are we being a bit to careful about > virtio-balloon and postcopy simply not being available for these corner > cases? The current immediate issue is not really mem=rw + fd=rw + private case (which was a known issue), but how to make mem=rw + fd=ro + private work for ThinnerBloger, iiuc. I'd just think it safer to expose that cap to solve problem A (vm templating) without affecting problem B (fallcate-over-private not working right), when B is uncertain. I'm also copy Daniel & libvirt list in case there's quick comment from there. Say, maybe libvirt never use private mapping on hugetlb files over memory-backend-file at all, then it's probably fine. In all cases, you and Igor should have the final grasp; no stand on a strong opinon from my side.
On 11.08.23 18:54, Peter Xu wrote: > On Fri, Aug 11, 2023 at 06:25:14PM +0200, David Hildenbrand wrote: >> On 11.08.23 18:22, Peter Xu wrote: >>> On Fri, Aug 11, 2023 at 06:17:05PM +0200, David Hildenbrand wrote: >>>> We wouldn't touch "-mem-path". >>> >>> But still the same issue when someone uses -object memory-backend-file for >>> hugetlb, mapping privately, expecting ram discard to work? >>> >>> Basically I see that example as, "hugetlb" in general made the private >>> mapping over RW file usable, so forbidden that anywhere may take a risk. >> >> These users can be directed to using hugetlb >> >> a) using MAP_SHARED >> b) using memory-backend-memfd, if MAP_PRIVATE is desired >> >> Am I missing any important use case? Are we being a bit to careful about >> virtio-balloon and postcopy simply not being available for these corner >> cases? > > The current immediate issue is not really mem=rw + fd=rw + private case > (which was a known issue), but how to make mem=rw + fd=ro + private work > for ThinnerBloger, iiuc. > > I'd just think it safer to expose that cap to solve problem A (vm > templating) without affecting problem B (fallcate-over-private not working > right), when B is uncertain. Right, and I'm thinking about if B is worth the effort. > > I'm also copy Daniel & libvirt list in case there's quick comment from > there. Say, maybe libvirt never use private mapping on hugetlb files over > memory-backend-file at all, then it's probably fine. libvirt certainly allows setting <access mode="shared"/> with <source type="file">. Could be that they also end up mapping "<hugepages>" to memory-backend-file instead of memory-backend-memfd (e.g., compatibility with older kernels?). > > In all cases, you and Igor should have the final grasp; no stand on a > strong opinon from my side. I do value your opinion, so I'm still trying to figure out if there are sane use cases that really need a new parameter. Let's recap: When opening the file R/O, resulting in fallocate() refusing to work: * virtio-balloon will fail to discard RAM but continue to "be alive" * virtio-mem will discard any private pages, but cannot free up disk blocks using fallocate. * postcopy would fail early Postcopy: * Works on shmem (MAP_SHARED / MAP_PRIVATE) * Works on hugetlb (MAP_SHARED / MAP_PRIVATE) * Does not work on file-backed memory (including MAP_PRIVATE) We can ignore virtio-mem for now. What remains is postcopy and virtio-balloon. memory-backend-file with MAP_PRIVATE on shmem/tmpfs results in a double memory consumption, so we can mostly cross that out as "sane use case". Rather make such users aware of that :D memory-backend-file with MAP_PRIVATE on hugetlb works. virtio-balloon is not really compatible with hugetlb, free-page-reporting might work (although quite non-nonsensical). So postcopy as the most important use case remains. memory-backend-file with MAP_PRIVATE on file-backed memory works. postcopy does not apply. virtio-balloon should work I guess. So the two use cases that are left are: * postcopy with hugetlb would fail * virtio-balloon with file-backed memory cannot free up disk blocks Am I missing a case?
On 11.08.23 07:49, ThinerLogoer wrote: > At 2023-08-11 05:24:43, "Peter Xu" <peterx@redhat.com> wrote: >> On Fri, Aug 11, 2023 at 01:06:12AM +0800, ThinerLogoer wrote: >>>> I think we have the following options (there might be more) >>>> >>>> 1) This patch. >>>> >>>> 2) New flag for memory-backend-file. We already have "readonly" and >>>> "share=". I'm having a hard time coming up with a good name that really >>>> describes the subtle difference. >>>> >>>> 3) Glue behavior to the QEMU machine >>>> >>> >>> 4) '-deny-private-discard' argv, or environment variable, or both >> >> I'd personally vote for (2). How about "fdperm"? To describe when we want >> to use different rw permissions on the file (besides the access permission >> of the memory we already provided with "readonly"=XXX). IIUC the only sane >> value will be ro/rw/default, where "default" should just use the same rw >> permission as the memory ("readonly"=XXX). >> >> Would that be relatively clean and also work in this use case? >> >> (the other thing I'd wish we don't have that fallback is, as long as we >> have any of that "fallback" we'll need to be compatible with it since >> then, and for ever...) > > If it must be (2), I would vote (2) + (4), with (4) adjust the default behavior of said `fdperm`. > Mainly because (private+discard) is itself not a good practice and (4) serves > as a good tool to help catch existing (private+discard) problems. Instead of fdperm, maybe we could find a better name. The man page of "open" says: The argument flags must include one of the following access modes: O_RDONLY, O_WRONLY, or O_RDWR. These request opening the file read-only, write-only, or read/write, respectively. So maybe something a bit more mouthful like "file-access-mode" would be better. We could have the options *readwrite *readonly *auto Whereby "auto" is the default. Specifying: * "readonly=true,file-access-mode=readwrite" would fail. * "shared=true,readonly=false,file-access-mode=readonly" would fail. * "shared=false,readonly=false,file-access-mode=readonly" would work. In theory, we could adjust the mode of "auto" with compat machines. But maybe it would be sufficient to do something like the following 1) shared=true,readonly=false -> readwrite 2) shared=true,readonly=true > readonly 2) shared=false,readonly=true -> readonly 3) shared=false,readonly=true hugetlb? -> readwrite otherwise -> readonly Reason being the traditional usage of hugetlb with MAP_PRIVATE where I can see possible users with postcopy. Further, VM templating with hugetlb might not be that common ... users could still modify the behavior if they really have to. That would make your use case work automatically with file-backed memory.
On Fri, Aug 11, 2023 at 07:39:37PM +0200, David Hildenbrand wrote: > On 11.08.23 18:54, Peter Xu wrote: > > On Fri, Aug 11, 2023 at 06:25:14PM +0200, David Hildenbrand wrote: > > > On 11.08.23 18:22, Peter Xu wrote: > > > > On Fri, Aug 11, 2023 at 06:17:05PM +0200, David Hildenbrand wrote: > > > > > We wouldn't touch "-mem-path". > > > > > > > > But still the same issue when someone uses -object memory-backend-file for > > > > hugetlb, mapping privately, expecting ram discard to work? > > > > > > > > Basically I see that example as, "hugetlb" in general made the private > > > > mapping over RW file usable, so forbidden that anywhere may take a risk. > > > > > > These users can be directed to using hugetlb > > > > > > a) using MAP_SHARED > > > b) using memory-backend-memfd, if MAP_PRIVATE is desired > > > > > > Am I missing any important use case? Are we being a bit to careful about > > > virtio-balloon and postcopy simply not being available for these corner > > > cases? > > > > The current immediate issue is not really mem=rw + fd=rw + private case > > (which was a known issue), but how to make mem=rw + fd=ro + private work > > for ThinnerBloger, iiuc. > > > > I'd just think it safer to expose that cap to solve problem A (vm > > templating) without affecting problem B (fallcate-over-private not working > > right), when B is uncertain. > > Right, and I'm thinking about if B is worth the effort. > > > > > I'm also copy Daniel & libvirt list in case there's quick comment from > > there. Say, maybe libvirt never use private mapping on hugetlb files over > > memory-backend-file at all, then it's probably fine. > > libvirt certainly allows setting <access mode="shared"/> with <source > type="file">. > > Could be that they also end up mapping "<hugepages>" to memory-backend-file > instead of memory-backend-memfd (e.g., compatibility with older kernels?). > > > > > In all cases, you and Igor should have the final grasp; no stand on a > > strong opinon from my side. > > I do value your opinion, so I'm still trying to figure out if there are sane > use cases that really need a new parameter. Let's recap: > > When opening the file R/O, resulting in fallocate() refusing to work: > * virtio-balloon will fail to discard RAM but continue to "be alive" > * virtio-mem will discard any private pages, but cannot free up disk > blocks using fallocate. > * postcopy would fail early > > Postcopy: > * Works on shmem (MAP_SHARED / MAP_PRIVATE) > * Works on hugetlb (MAP_SHARED / MAP_PRIVATE) > * Does not work on file-backed memory (including MAP_PRIVATE) > > We can ignore virtio-mem for now. What remains is postcopy and > virtio-balloon. > > memory-backend-file with MAP_PRIVATE on shmem/tmpfs results in a double > memory consumption, so we can mostly cross that out as "sane use case". > Rather make such users aware of that :D > > memory-backend-file with MAP_PRIVATE on hugetlb works. virtio-balloon is not > really compatible with hugetlb, free-page-reporting might work (although > quite non-nonsensical). So postcopy as the most important use case remains. > > memory-backend-file with MAP_PRIVATE on file-backed memory works. postcopy > does not apply. virtio-balloon should work I guess. > > > So the two use cases that are left are: > * postcopy with hugetlb would fail > * virtio-balloon with file-backed memory cannot free up disk blocks > > Am I missing a case? Looks complete. I don't want to say so, but afaik postcopy should be "corner case" in most cases I'd say; people do still rely mostly on precopy. It's probably a matter of whether we'd like take the risk.
At 2023-08-12 03:00:54, "David Hildenbrand" <david@redhat.com> wrote: >On 11.08.23 07:49, ThinerLogoer wrote: >> At 2023-08-11 05:24:43, "Peter Xu" <peterx@redhat.com> wrote: >>> On Fri, Aug 11, 2023 at 01:06:12AM +0800, ThinerLogoer wrote: >>>>> I think we have the following options (there might be more) >>>>> >>>>> 1) This patch. >>>>> >>>>> 2) New flag for memory-backend-file. We already have "readonly" and >>>>> "share=". I'm having a hard time coming up with a good name that really >>>>> describes the subtle difference. >>>>> >>>>> 3) Glue behavior to the QEMU machine >>>>> >>>> >>>> 4) '-deny-private-discard' argv, or environment variable, or both >>> >>> I'd personally vote for (2). How about "fdperm"? To describe when we want >>> to use different rw permissions on the file (besides the access permission >>> of the memory we already provided with "readonly"=XXX). IIUC the only sane >>> value will be ro/rw/default, where "default" should just use the same rw >>> permission as the memory ("readonly"=XXX). >>> >>> Would that be relatively clean and also work in this use case? >>> >>> (the other thing I'd wish we don't have that fallback is, as long as we >>> have any of that "fallback" we'll need to be compatible with it since >>> then, and for ever...) >> >> If it must be (2), I would vote (2) + (4), with (4) adjust the default behavior of said `fdperm`. >> Mainly because (private+discard) is itself not a good practice and (4) serves >> as a good tool to help catch existing (private+discard) problems. > >Instead of fdperm, maybe we could find a better name. > >The man page of "open" says: The argument flags must include one of the >following access modes: O_RDONLY, O_WRONLY, or O_RDWR. These request >opening the file read-only, write-only, or read/write, respectively. > >So maybe something a bit more mouthful like "file-access-mode" would be >better. Maybe "fd-mode"or "open-mode" to make it shorter and also non ambiguous. "open-access-mode" can also be considered. Btw my chatgpt agent suggested 10 names to me, in case you feel hard to decide a name: access-mode file-mode open-mode file-permission file-access-mode (aha!) file-access-permission file-io-access-mode file-open-permission file-operation-mode file-read-write-mode > >We could have the options >*readwrite >*readonly >*auto > >Whereby "auto" is the default. I like the "auto" here. > >Specifying: > >* "readonly=true,file-access-mode=readwrite" would fail. >* "shared=true,readonly=false,file-access-mode=readonly" would fail. >* "shared=false,readonly=false,file-access-mode=readonly" would work. Yeah, sanitizing the arguments here is wise. > >In theory, we could adjust the mode of "auto" with compat machines. But >maybe it would be sufficient to do something like the following > >1) shared=true,readonly=false > -> readwrite >2) shared=true,readonly=true > > readonly >2) shared=false,readonly=true > -> readonly >3) shared=false,readonly=true > hugetlb? -> readwrite > otherwise -> readonly Looks like there is some typo here. I assume it was: 1) shared=true,readonly=false -> readwrite 2) shared=true,readonly=true -> readonly 3) shared=false,readonly=true -> readonly 4) shared=false,readonly=false hugetlb? -> readwrite otherwise -> readonly >Reason being the traditional usage of hugetlb with MAP_PRIVATE where I >can see possible users with postcopy. Further, VM templating with >hugetlb might not be that common ... users could still modify the >behavior if they really have to. > >That would make your use case work automatically with file-backed memory. This seems a good idea. I am good with the solution you proposed here as well. -- Regards, logoerthiner
At 2023-08-11 22:31:36, "Peter Xu" <peterx@redhat.com> wrote: >On Fri, Aug 11, 2023 at 01:49:52PM +0800, ThinerLogoer wrote: >> At 2023-08-11 05:24:43, "Peter Xu" <peterx@redhat.com> wrote: >> >On Fri, Aug 11, 2023 at 01:06:12AM +0800, ThinerLogoer wrote: >> >> >I think we have the following options (there might be more) >> >> > >> >> >1) This patch. >> >> > >> >> >2) New flag for memory-backend-file. We already have "readonly" and >> >> >"share=". I'm having a hard time coming up with a good name that really >> >> >describes the subtle difference. >> >> > >> >> >3) Glue behavior to the QEMU machine >> >> > >> >> >> >> 4) '-deny-private-discard' argv, or environment variable, or both >> > >> >I'd personally vote for (2). How about "fdperm"? To describe when we want >> >to use different rw permissions on the file (besides the access permission >> >of the memory we already provided with "readonly"=XXX). IIUC the only sane >> >value will be ro/rw/default, where "default" should just use the same rw >> >permission as the memory ("readonly"=XXX). >> > >> >Would that be relatively clean and also work in this use case? >> > >> >(the other thing I'd wish we don't have that fallback is, as long as we >> > have any of that "fallback" we'll need to be compatible with it since >> > then, and for ever...) >> >> If it must be (2), I would vote (2) + (4), with (4) adjust the default behavior of said `fdperm`. >> Mainly because (private+discard) is itself not a good practice and (4) serves >> as a good tool to help catch existing (private+discard) problems. >> >> Actually (readonly+private) is more reasonable than (private+discard), so I >> want at least one room for a default (readonly+private) behavior. > >Just for purely discussion purpose: I think maybe someday private+discard >could work. IIUC what we're missing is an syscall interface to install a >zero page for a MAP_PRIVATE, atomically freeing what's underneath: it seems >either punching a hole or DONTNEED won't suffice here. It'll just be >another problem when having zero page involved in file mappings at least. > >> >> Also in my case I kind of have to use "-mem-path" despite it being considered >> to be close to deprecated. Only with this I can avoid knowledge of memory >> backend before migration. Actually there seems to be no equivalent working after-migration >> setup of "-object memory-backend-file,... -machine q35,mem=..." that can match >> before-migration setup of "-machine q35" (specifying nothing). Therefore >> I must make a plan and choose a migration method BEFORE I boot the >> machine and prepare to migrate, reducing the operation freedom. >> Considering that, I have to use "-mem-path" which keeps the freedom but >> has no configurable argument and I have to rely on default config. >> >> Are there any "-object memory-backend-file..." setup equivalent to "-machine q35" >> that can migrate from and to each other? If there is, I want to try it out. >> By the way "-object memory-backend-file,id=pc.ram" has just been killed by an earlier >> commit. > >I'm actually not familiar enough on the interfaces here, but I just checked >up the man page; would this work for you, together with option (2)? > > memory-backend='id' > An alternative to legacy -mem-path and mem-prealloc options. Allows to use a memory backend as main RAM. > > For example: > > -object memory-backend-file,id=pc.ram,size=512M,mem-path=/hugetlbfs,prealloc=on,share=on > -machine memory-backend=pc.ram > -m 512M > Wait ... I thought it should not work but it did work today. How bad am I at reading the correct part of documentation ... '-machine q35 -m 512M' is equivalent to '-object memory-backend-file,id=pc.ram,size=512M -machine q35,memory-backend=pc.ram', the latter works, and the two mentioned setup can be migrated from one to another. What I was consistently trying was '-object memory-backend-file,id=pc.ram,size=512M -machine q35', and qemu raises an error for this in a recent update: ``` qemu-system-x86_64: object name 'pc.ram' is reserved for the default RAM backend, it can't be used for any other purposes. Change the object's 'id' to something else ``` This error is misleading. Actually in this case, the error report message should be more close to: ``` object name 'pc.ram' is reserved for the default RAM backend, it can't be used for any other purposes. Change the object's 'id' to something else, or append "memory-backend=pc.ram" to -machine arguments ``` (I suggest rewriting the error message like this string because of the confusion just now) Even though the default memory backend name is pc.ram, the '-machine q35,memory-backend=pc.ram' part explicitly marks that qemu uses a memory backend named pc.ram, rather than rely on default. It seems that if it "rely on default" and memory-backend-file has an id of "pc.ram" (in x86_64 of course), it will fail. Great. Now I will consider using a "-object memory-backend-file,id=pc.ram,size=512M -machine q35,memory-backend=pc.ram" -- Regards, logoerthiner
@Stefan, see below on a R/O NVDIMM question. We're discussing how to get MAPR_PRIVATE R/W mapping of a memory-backend-file running when using R/O files. > > This seems a good idea. I am good with the solution you proposed > here as well. I was just going to get started working on that, when I realized something important: "@readonly: if true, the backing file is opened read-only; if false, it is opened read-write. (default: false)" "@share: if false, the memory is private to QEMU; if true, it is shared (default: false)" So readonly is *all* about the file access mode already ... the mmap() parameters are just a side-effect of that. Adding a new "file-access-mode" or similar would be wrong. Here are the combinations we have right now: -object memory-backend-file,share=on,readonly=on -> Existing behavior: Open readonly, mmap readonly shared -> Makes sense, mmap'ing readwrite would fail -object memory-backend-file,share=on,readonly=off -> Existing behavior: Open readwrite, mmap readwrite shared -> Mostly makes sense: why open a shared file R/W and not mmap it R/W? -object memory-backend-file,share=off,readonly=off -> Existing behavior: Open readwrite, mmap readwrite private -> Mostly makes sense: why open a file R/W and not map it R/W (even if private)? -object memory-backend-file,share=off,readonly=on -> Existing behavior: Open readonly, mmap readonly private -> That's the problematic one So for your use case (VM templating using a readonly file), you would actually want to use: -object memory-backend-file,share=off,readonly=on BUT, have the mmap be writable (instead of currently readonly). Assuming we would change the current behavior, what if someone would specify: -object memory-backend-file,readonly=on (because the default is share=off ...) and using it for a R/O NVDIMM, where we expect any write access to fail. But let's look at the commit that added the "readonly" parameter: commit 86635aa4e9d627d5142b81c57a33dd1f36627d07 Author: Stefan Hajnoczi <stefanha@redhat.com> Date: Mon Jan 4 17:13:19 2021 +0000 hostmem-file: add readonly=on|off option Let -object memory-backend-file work on read-only files when the readonly=on option is given. This can be used to share the contents of a file between multiple guests while preventing them from consuming Copy-on-Write memory if guests dirty the pages, for example. That was part of https://lore.kernel.org/all/20210104171320.575838-3-stefanha@redhat.com/T/#m712f995e6dcfdde433958bae5095b145dd0ee640 From what I understand, for NVDIMMs we always use "-object memory-backend-file,share=on", even when we want a readonly NVDIMM. So we have two options: 1) Change the current behavior of -object memory-backend-file,share=off,readonly=on: -> Open the file r/o but mmap it writable 2) Add a new property to configure the mmap accessibility. Not a big fan of that. @Stefan, do you have any concern when we would do 1) ? As far as I can tell, we have to set the nvdimm to "unarmed=on" either way: + "unarmed" controls the ACPI NFIT NVDIMM Region Mapping Structure "NVDIMM + State Flags" Bit 3 indicating that the device is "unarmed" and cannot accept + persistent writes. Linux guest drivers set the device to read-only when this + bit is present. Set unarmed to on when the memdev has readonly=on. So changing the behavior would not really break the nvdimm use case. Further, we could warn in nvdimm code when we stumble over this configuration with unarmed=on.
On Mon, Aug 07, 2023 at 09:07:32PM +0200, David Hildenbrand wrote: > From: Thiner Logoer <logoerthiner1@163.com> > > Users may specify > * "-mem-path" or > * "-object memory-backend-file,share=off,readonly=off" > and expect such COW (MAP_PRIVATE) mappings to work, even if the user > does not have write permissions to open the file. > > For now, we would always fail in that case, always requiring file write > permissions. Let's detect when that failure happens and fallback to opening > the file readonly. > > Warn the user, since there are other use cases where we want the file to > be mapped writable: ftruncate() and fallocate() will fail if the file > was not opened with write permissions. > > Signed-off-by: Thiner Logoer <logoerthiner1@163.com> > Co-developed-by: David Hildenbrand <david@redhat.com> > Signed-off-by: David Hildenbrand <david@redhat.com> > --- > softmmu/physmem.c | 26 ++++++++++++++++++-------- > 1 file changed, 18 insertions(+), 8 deletions(-) > > diff --git a/softmmu/physmem.c b/softmmu/physmem.c > index 3df73542e1..d1ae694b20 100644 > --- a/softmmu/physmem.c > +++ b/softmmu/physmem.c > @@ -1289,8 +1289,7 @@ static int64_t get_file_align(int fd) > static int file_ram_open(const char *path, > const char *region_name, > bool readonly, > - bool *created, > - Error **errp) > + bool *created) > { > char *filename; > char *sanitized_name; > @@ -1334,10 +1333,7 @@ static int file_ram_open(const char *path, > g_free(filename); > } > if (errno != EEXIST && errno != EINTR) { > - error_setg_errno(errp, errno, > - "can't open backing store %s for guest RAM", > - path); > - return -1; > + return -errno; > } > /* > * Try again on EINTR and EEXIST. The latter happens when > @@ -1946,9 +1942,23 @@ RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr, > bool created; > RAMBlock *block; > > - fd = file_ram_open(mem_path, memory_region_name(mr), readonly, &created, > - errp); > + fd = file_ram_open(mem_path, memory_region_name(mr), readonly, &created); > + if (fd == -EACCES && !(ram_flags & RAM_SHARED) && !readonly) { > + /* > + * We can have a writable MAP_PRIVATE mapping of a readonly file. > + * However, some operations like ftruncate() or fallocate() might fail > + * later, let's warn the user. > + */ > + fd = file_ram_open(mem_path, memory_region_name(mr), true, &created); > + if (fd >= 0) { > + warn_report("backing store %s for guest RAM (MAP_PRIVATE) opened" > + " readonly because the file is not writable", mem_path); IIUC, from the description, the goal is that usage of a readonly backing store is intented to be an explicitly supported deployment configuration. At the time time though, this scenario could also be a deployment mistake that we want to diagnose It is inappropriate to issue warn_report() for things that are supported usage. It is also undesirable to continue execution in the case of things which are a deployment mistake. These two scenarios are mutually incompatible, so I understand why you choose to fudge it with a warn_report(). I wonder if this is pointing to the need for another configuration knob for the memory backend, to express the different desired usage models. We want O_WRONLY when opening the file, either if we want to file shared, or so that we can ftruncate it to the right size, if it does not exist. If shared=off and the file is pre-created at the right size, we should be able to use O_RDONLY even if the file is writable. So what if we added a 'create=yes|no' option to memory-backend-file -object memory-backend-file,share=off,readonly=off,create=yes would imply need for O_WRONLY|O_RDONLY, so that ftruncate() can do its work. With share=off,create=no, we could unconditionally open O_RDONLY, even if the file is writable. This would let us support read-only backing files, without any warn_reports() for this usage, while also stopping execution with deployment mistakes This doesn't help -mem-path, since it doesn't take options, but IMHO it would be acceptable to say users need to use the more verbose '-object memory-backend-file' instead. > + } > + } > if (fd < 0) { > + error_setg_errno(errp, -fd, > + "can't open backing store %s for guest RAM", > + mem_path); > return NULL; > } > > -- > 2.41.0 > > With regards, Daniel
On Thu, Aug 10, 2023 at 04:19:45PM +0200, David Hildenbrand wrote: > > > Most importantly, we won't be corrupting/touching the original file in any > > > case, because it is R/O. > > > > > > If we really want to be careful, we could clue that behavior to compat > > > machines. I'm not really sure yet if we really have to go down that path. > > > > > > Any other alternatives? I'd like to avoid new flags where not really > > > required. > > > > I was just thinking of a new flag. :) So have you already discussed that > > possibility and decided that not a good idea? > > Not really. I was briefly playing with that idea but already struggled to > come up with a reasonable name :) > > Less toggles and just have it working nice, if possible. IMHO having a new flag is desirable, because it is directly expressing the desired deployment scenario, such tat we get good error reporting upon deployment mistakes, while at the same time allowing the readonly usage. > > The root issue to me here is we actually have two resources (memory map of > > the process, and the file) but we only have one way to describe the > > permissions upon the two objects. I'd think it makes a lot more sense if a > > new flag is added, when there's a need to differentiate the two. > > > > Consider if you see a bunch of qemu instances with: > > > > -mem-path $RAM_FILE > > > > On the same host, which can be as weird as it could be to me.. At least > > '-mem-path' looks still like a way to exclusively own a ram file for an > > instance. I hesitate the new fallback can confuse people too, while that's > > so far not the major use case. > > Once I learned that this is not a MAP_SHARED mapping, I was extremely > confused. For example, vhost-user with "-mem-path" will absolutely not work > with "-mem-path", even though the documentation explicitly spells that out > (I still have to send a patch to fix that). > > I guess "-mem-path" was primarily only used to consume hugetlb. Even for > tmpfs it will already result in a double memory consumption, just like when > using -memory-backend-memfd,share=no. > > I guess deprecating it was the right decision. Regardless of whether its deprecated or not, I think its fine to just say people need to use the more verbose memory-backend-file syntax if they want to use an unusual deployment configuration where there is a readonly backing file. > > Nobody may really rely on any existing behavior of the failure, but > > changing existing behavior is just always not wanted. The guideline here > > to me is: whether we want existing "-mem-path XXX" users to start using the > > fallback in general? If it's "no", then maybe it implies a new flag is > > better? > > I think we have the following options (there might be more) > > 1) This patch. > > 2) New flag for memory-backend-file. We already have "readonly" and > "share=". I'm having a hard time coming up with a good name that really > describes the subtle difference. > > 3) Glue behavior to the QEMU machine > > > For 3), one option would be to always open a COW file readonly (as Thiner > originally proposed). We could leave "-mem-path" behavior alone and only > change memory-backend-file semantics. If the COW file does *not* exist yet, > we would refuse to create the file like patch 2+3 do. Therefore, no > ftruncate() errors, and fallocate() errors would always happen. I'm for (2). With regards, Daniel
On 17.08.23 15:37, Daniel P. Berrangé wrote: > On Mon, Aug 07, 2023 at 09:07:32PM +0200, David Hildenbrand wrote: >> From: Thiner Logoer <logoerthiner1@163.com> >> >> Users may specify >> * "-mem-path" or >> * "-object memory-backend-file,share=off,readonly=off" >> and expect such COW (MAP_PRIVATE) mappings to work, even if the user >> does not have write permissions to open the file. >> >> For now, we would always fail in that case, always requiring file write >> permissions. Let's detect when that failure happens and fallback to opening >> the file readonly. >> >> Warn the user, since there are other use cases where we want the file to >> be mapped writable: ftruncate() and fallocate() will fail if the file >> was not opened with write permissions. >> >> Signed-off-by: Thiner Logoer <logoerthiner1@163.com> >> Co-developed-by: David Hildenbrand <david@redhat.com> >> Signed-off-by: David Hildenbrand <david@redhat.com> >> --- >> softmmu/physmem.c | 26 ++++++++++++++++++-------- >> 1 file changed, 18 insertions(+), 8 deletions(-) >> >> diff --git a/softmmu/physmem.c b/softmmu/physmem.c >> index 3df73542e1..d1ae694b20 100644 >> --- a/softmmu/physmem.c >> +++ b/softmmu/physmem.c >> @@ -1289,8 +1289,7 @@ static int64_t get_file_align(int fd) >> static int file_ram_open(const char *path, >> const char *region_name, >> bool readonly, >> - bool *created, >> - Error **errp) >> + bool *created) >> { >> char *filename; >> char *sanitized_name; >> @@ -1334,10 +1333,7 @@ static int file_ram_open(const char *path, >> g_free(filename); >> } >> if (errno != EEXIST && errno != EINTR) { >> - error_setg_errno(errp, errno, >> - "can't open backing store %s for guest RAM", >> - path); >> - return -1; >> + return -errno; >> } >> /* >> * Try again on EINTR and EEXIST. The latter happens when >> @@ -1946,9 +1942,23 @@ RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr, >> bool created; >> RAMBlock *block; >> >> - fd = file_ram_open(mem_path, memory_region_name(mr), readonly, &created, >> - errp); >> + fd = file_ram_open(mem_path, memory_region_name(mr), readonly, &created); >> + if (fd == -EACCES && !(ram_flags & RAM_SHARED) && !readonly) { >> + /* >> + * We can have a writable MAP_PRIVATE mapping of a readonly file. >> + * However, some operations like ftruncate() or fallocate() might fail >> + * later, let's warn the user. >> + */ >> + fd = file_ram_open(mem_path, memory_region_name(mr), true, &created); >> + if (fd >= 0) { >> + warn_report("backing store %s for guest RAM (MAP_PRIVATE) opened" >> + " readonly because the file is not writable", mem_path); > > IIUC, from the description, the goal is that usage of a readonly > backing store is intented to be an explicitly supported deployment > configuration. At the time time though, this scenario could also be > a deployment mistake that we want to diagnose FWIW, I abandoned this approach here and instead will look into making memory-backend-file,readonly=on,share=off create RAM instead of ROM. The fallback was wrong after realizing what "readonly" actually is supposed to do. I stared at libvirt, an even it never seems to set readonly=on for R/O DIMMs, so you always get RAM and then tell the nvdimm device to not perform any writes (unarmed=on)
On 17.08.23 15:42, Daniel P. Berrangé wrote: > On Thu, Aug 10, 2023 at 04:19:45PM +0200, David Hildenbrand wrote: >>>> Most importantly, we won't be corrupting/touching the original file in any >>>> case, because it is R/O. >>>> >>>> If we really want to be careful, we could clue that behavior to compat >>>> machines. I'm not really sure yet if we really have to go down that path. >>>> >>>> Any other alternatives? I'd like to avoid new flags where not really >>>> required. >>> >>> I was just thinking of a new flag. :) So have you already discussed that >>> possibility and decided that not a good idea? >> >> Not really. I was briefly playing with that idea but already struggled to >> come up with a reasonable name :) >> >> Less toggles and just have it working nice, if possible. > > IMHO having a new flag is desirable, because it is directly > expressing the desired deployment scenario, such tat we get > good error reporting upon deployment mistakes, while at the > same time allowing the readonly usage. > >>> The root issue to me here is we actually have two resources (memory map of >>> the process, and the file) but we only have one way to describe the >>> permissions upon the two objects. I'd think it makes a lot more sense if a >>> new flag is added, when there's a need to differentiate the two. >>> >>> Consider if you see a bunch of qemu instances with: >>> >>> -mem-path $RAM_FILE >>> >>> On the same host, which can be as weird as it could be to me.. At least >>> '-mem-path' looks still like a way to exclusively own a ram file for an >>> instance. I hesitate the new fallback can confuse people too, while that's >>> so far not the major use case. >> >> Once I learned that this is not a MAP_SHARED mapping, I was extremely >> confused. For example, vhost-user with "-mem-path" will absolutely not work >> with "-mem-path", even though the documentation explicitly spells that out >> (I still have to send a patch to fix that). >> >> I guess "-mem-path" was primarily only used to consume hugetlb. Even for >> tmpfs it will already result in a double memory consumption, just like when >> using -memory-backend-memfd,share=no. >> >> I guess deprecating it was the right decision. > > Regardless of whether its deprecated or not, I think its fine to just > say people need to use the more verbose memory-backend-file syntax > if they want to use an unusual deployment configuration where there is > a readonly backing file. > >>> Nobody may really rely on any existing behavior of the failure, but >>> changing existing behavior is just always not wanted. The guideline here >>> to me is: whether we want existing "-mem-path XXX" users to start using the >>> fallback in general? If it's "no", then maybe it implies a new flag is >>> better? >> >> I think we have the following options (there might be more) >> >> 1) This patch. >> >> 2) New flag for memory-backend-file. We already have "readonly" and >> "share=". I'm having a hard time coming up with a good name that really >> describes the subtle difference. >> >> 3) Glue behavior to the QEMU machine >> >> >> For 3), one option would be to always open a COW file readonly (as Thiner >> originally proposed). We could leave "-mem-path" behavior alone and only >> change memory-backend-file semantics. If the COW file does *not* exist yet, >> we would refuse to create the file like patch 2+3 do. Therefore, no >> ftruncate() errors, and fallocate() errors would always happen. > > I'm for (2). (2) in the form we discussed here is wrong because "readonly" already expresses "open this file readonly", not anything else.
On Fri, Aug 11, 2023 at 09:00:54PM +0200, David Hildenbrand wrote: > On 11.08.23 07:49, ThinerLogoer wrote: > > At 2023-08-11 05:24:43, "Peter Xu" <peterx@redhat.com> wrote: > > > On Fri, Aug 11, 2023 at 01:06:12AM +0800, ThinerLogoer wrote: > > > > > I think we have the following options (there might be more) > > > > > > > > > > 1) This patch. > > > > > > > > > > 2) New flag for memory-backend-file. We already have "readonly" and > > > > > "share=". I'm having a hard time coming up with a good name that really > > > > > describes the subtle difference. > > > > > > > > > > 3) Glue behavior to the QEMU machine > > > > > > > > > > > > > 4) '-deny-private-discard' argv, or environment variable, or both > > > > > > I'd personally vote for (2). How about "fdperm"? To describe when we want > > > to use different rw permissions on the file (besides the access permission > > > of the memory we already provided with "readonly"=XXX). IIUC the only sane > > > value will be ro/rw/default, where "default" should just use the same rw > > > permission as the memory ("readonly"=XXX). > > > > > > Would that be relatively clean and also work in this use case? > > > > > > (the other thing I'd wish we don't have that fallback is, as long as we > > > have any of that "fallback" we'll need to be compatible with it since > > > then, and for ever...) > > > > If it must be (2), I would vote (2) + (4), with (4) adjust the default behavior of said `fdperm`. > > Mainly because (private+discard) is itself not a good practice and (4) serves > > as a good tool to help catch existing (private+discard) problems. > > Instead of fdperm, maybe we could find a better name. > > The man page of "open" says: The argument flags must include one of the > following access modes: O_RDONLY, O_WRONLY, or O_RDWR. These request > opening the file read-only, write-only, or read/write, respectively. > > So maybe something a bit more mouthful like "file-access-mode" would be > better. I don't think we should directly express the config in terms of file-access-mode, as that's a low level impl detail. The required file access mode is an artifact of the higher level goal, or whether the RAM should be process private vs shared, and whether we want QEMU to be able to create the backing file or use pre-create one. IOW, we should express whether or not we want QEMU to try to pre-create the file or not. With regards, Daniel
On 17.08.23 15:46, Daniel P. Berrangé wrote: > On Fri, Aug 11, 2023 at 09:00:54PM +0200, David Hildenbrand wrote: >> On 11.08.23 07:49, ThinerLogoer wrote: >>> At 2023-08-11 05:24:43, "Peter Xu" <peterx@redhat.com> wrote: >>>> On Fri, Aug 11, 2023 at 01:06:12AM +0800, ThinerLogoer wrote: >>>>>> I think we have the following options (there might be more) >>>>>> >>>>>> 1) This patch. >>>>>> >>>>>> 2) New flag for memory-backend-file. We already have "readonly" and >>>>>> "share=". I'm having a hard time coming up with a good name that really >>>>>> describes the subtle difference. >>>>>> >>>>>> 3) Glue behavior to the QEMU machine >>>>>> >>>>> >>>>> 4) '-deny-private-discard' argv, or environment variable, or both >>>> >>>> I'd personally vote for (2). How about "fdperm"? To describe when we want >>>> to use different rw permissions on the file (besides the access permission >>>> of the memory we already provided with "readonly"=XXX). IIUC the only sane >>>> value will be ro/rw/default, where "default" should just use the same rw >>>> permission as the memory ("readonly"=XXX). >>>> >>>> Would that be relatively clean and also work in this use case? >>>> >>>> (the other thing I'd wish we don't have that fallback is, as long as we >>>> have any of that "fallback" we'll need to be compatible with it since >>>> then, and for ever...) >>> >>> If it must be (2), I would vote (2) + (4), with (4) adjust the default behavior of said `fdperm`. >>> Mainly because (private+discard) is itself not a good practice and (4) serves >>> as a good tool to help catch existing (private+discard) problems. >> >> Instead of fdperm, maybe we could find a better name. >> >> The man page of "open" says: The argument flags must include one of the >> following access modes: O_RDONLY, O_WRONLY, or O_RDWR. These request >> opening the file read-only, write-only, or read/write, respectively. >> >> So maybe something a bit more mouthful like "file-access-mode" would be >> better. > > I don't think we should directly express the config in terms > of file-access-mode, as that's a low level impl detail. The > required file access mode is an artifact of the higher level > goal, or whether the RAM should be process private vs shared, > and whether we want QEMU to be able to create the backing > file or use pre-create one. See my other mails "readonly" already expresses exactly that. So no need for "file-access-mode". (and as far as I can see, no need for any other flags)
> @Stefan, do you have any concern when we would do 1) ? > > As far as I can tell, we have to set the nvdimm to "unarmed=on" either way: > > + "unarmed" controls the ACPI NFIT NVDIMM Region Mapping Structure "NVDIMM > + State Flags" Bit 3 indicating that the device is "unarmed" and cannot accept > + persistent writes. Linux guest drivers set the device to read-only when this > + bit is present. Set unarmed to on when the memdev has readonly=on. > > So changing the behavior would not really break the nvdimm use case. Looking into the details, this seems to be the right thing to do. This is what I have now as patch description, that also highlights how libvirt doesn't even make use of readonly=true. From 42f272ace68e0cd660a8448adb5aefb3b9dd7005 Mon Sep 17 00:00:00 2001 From: David Hildenbrand <david@redhat.com> Date: Thu, 17 Aug 2023 12:09:07 +0200 Subject: [PATCH v2 2/4] backends/hostmem-file: Make share=off,readonly=on result in RAM instead of ROM For now, "share=off,readonly=on" would always result in us opening the file R/O and mmap'ing the opened file MAP_PRIVATE R/O -- effectively turning it into ROM. As documented, readonly only specifies that we want to open the file R/O: @readonly: if true, the backing file is opened read-only; if false, it is opened read-write. (default: false) Especially for VM templating, "share=off" is a common use case. However, that use case is impossible with files that lack write permissions, because "share=off,readonly=off" will fail opening the file, and "share=off,readonly=on" will give us ROM instead of RAM. With MAP_PRIVATE we can easily open the file R/O and mmap it R/W, to turn it into COW RAM: private changes don't affect the file after all and don't require write permissions. This implies that we only get ROM now via "share=on,readonly=on". "share=off,readonly=on" will give us RAM. The sole user of ROM via memory-backend-file are R/O NVDIMMs. They also require "unarmed=on" to be set for the nvdimm device. With this change, R/O NVDIMMs will continue working even if "share=off,readonly=on" was specified similar to when simply providing ordinary RAM to the nvdimm device and setting "unarmed=on". Note that libvirt seems to default for a "readonly" nvdimm to * -object memory-backend-file,share=off (implying readonly=off) * -device nvdimm,unarmed=on And never seems to even set "readonly=on" for memory-backend-file. So this change won't affect libvirt, they already always get COW RAM -- not modifying the underlying file but opening it R/O. If someone really wants ROM, they can just use "share=on,readonly=on". After all, there is not relevant difference between a R/O MAP_SHARED file mapping and a R/O MAP_PRIVATE file mapping. Signed-off-by: David Hildenbrand <david@redhat.com>
On Thu, Aug 17, 2023 at 04:30:16PM +0200, David Hildenbrand wrote: > > > @Stefan, do you have any concern when we would do 1) ? > > > > As far as I can tell, we have to set the nvdimm to "unarmed=on" either way: > > > > + "unarmed" controls the ACPI NFIT NVDIMM Region Mapping Structure "NVDIMM > > + State Flags" Bit 3 indicating that the device is "unarmed" and cannot accept > > + persistent writes. Linux guest drivers set the device to read-only when this > > + bit is present. Set unarmed to on when the memdev has readonly=on. > > > > So changing the behavior would not really break the nvdimm use case. > > Looking into the details, this seems to be the right thing to do. > > This is what I have now as patch description, that also highlights how libvirt > doesn't even make use of readonly=true. > > > From 42f272ace68e0cd660a8448adb5aefb3b9dd7005 Mon Sep 17 00:00:00 2001 > From: David Hildenbrand <david@redhat.com> > Date: Thu, 17 Aug 2023 12:09:07 +0200 > Subject: [PATCH v2 2/4] backends/hostmem-file: Make share=off,readonly=on > result in RAM instead of ROM > > For now, "share=off,readonly=on" would always result in us opening the > file R/O and mmap'ing the opened file MAP_PRIVATE R/O -- effectively > turning it into ROM. > > As documented, readonly only specifies that we want to open the file R/O: > > @readonly: if true, the backing file is opened read-only; if false, > it is opened read-write. (default: false) > > Especially for VM templating, "share=off" is a common use case. However, > that use case is impossible with files that lack write permissions, > because "share=off,readonly=off" will fail opening the file, and > "share=off,readonly=on" will give us ROM instead of RAM. > > With MAP_PRIVATE we can easily open the file R/O and mmap it R/W, to > turn it into COW RAM: private changes don't affect the file after all and > don't require write permissions. > > This implies that we only get ROM now via "share=on,readonly=on". > "share=off,readonly=on" will give us RAM. > > The sole user of ROM via memory-backend-file are R/O NVDIMMs. They > also require "unarmed=on" to be set for the nvdimm device. > > With this change, R/O NVDIMMs will continue working even if > "share=off,readonly=on" was specified similar to when simply > providing ordinary RAM to the nvdimm device and setting "unarmed=on". > > Note that libvirt seems to default for a "readonly" nvdimm to > * -object memory-backend-file,share=off (implying readonly=off) > * -device nvdimm,unarmed=on > And never seems to even set "readonly=on" for memory-backend-file. So > this change won't affect libvirt, they already always get COW RAM -- not > modifying the underlying file but opening it R/O. > > If someone really wants ROM, they can just use "share=on,readonly=on". > After all, there is not relevant difference between a R/O MAP_SHARED > file mapping and a R/O MAP_PRIVATE file mapping. > > Signed-off-by: David Hildenbrand <david@redhat.com> This still leaves the patch having a warn_report() which I think is undesirable to emit in a valid / supported use case. With regards, Daniel
On 17.08.23 16:37, Daniel P. Berrangé wrote: > On Thu, Aug 17, 2023 at 04:30:16PM +0200, David Hildenbrand wrote: >> >>> @Stefan, do you have any concern when we would do 1) ? >>> >>> As far as I can tell, we have to set the nvdimm to "unarmed=on" either way: >>> >>> + "unarmed" controls the ACPI NFIT NVDIMM Region Mapping Structure "NVDIMM >>> + State Flags" Bit 3 indicating that the device is "unarmed" and cannot accept >>> + persistent writes. Linux guest drivers set the device to read-only when this >>> + bit is present. Set unarmed to on when the memdev has readonly=on. >>> >>> So changing the behavior would not really break the nvdimm use case. >> >> Looking into the details, this seems to be the right thing to do. >> >> This is what I have now as patch description, that also highlights how libvirt >> doesn't even make use of readonly=true. >> >> >> From 42f272ace68e0cd660a8448adb5aefb3b9dd7005 Mon Sep 17 00:00:00 2001 >> From: David Hildenbrand <david@redhat.com> >> Date: Thu, 17 Aug 2023 12:09:07 +0200 >> Subject: [PATCH v2 2/4] backends/hostmem-file: Make share=off,readonly=on >> result in RAM instead of ROM >> >> For now, "share=off,readonly=on" would always result in us opening the >> file R/O and mmap'ing the opened file MAP_PRIVATE R/O -- effectively >> turning it into ROM. >> >> As documented, readonly only specifies that we want to open the file R/O: >> >> @readonly: if true, the backing file is opened read-only; if false, >> it is opened read-write. (default: false) >> >> Especially for VM templating, "share=off" is a common use case. However, >> that use case is impossible with files that lack write permissions, >> because "share=off,readonly=off" will fail opening the file, and >> "share=off,readonly=on" will give us ROM instead of RAM. >> >> With MAP_PRIVATE we can easily open the file R/O and mmap it R/W, to >> turn it into COW RAM: private changes don't affect the file after all and >> don't require write permissions. >> >> This implies that we only get ROM now via "share=on,readonly=on". >> "share=off,readonly=on" will give us RAM. >> >> The sole user of ROM via memory-backend-file are R/O NVDIMMs. They >> also require "unarmed=on" to be set for the nvdimm device. >> >> With this change, R/O NVDIMMs will continue working even if >> "share=off,readonly=on" was specified similar to when simply >> providing ordinary RAM to the nvdimm device and setting "unarmed=on". >> >> Note that libvirt seems to default for a "readonly" nvdimm to >> * -object memory-backend-file,share=off (implying readonly=off) >> * -device nvdimm,unarmed=on >> And never seems to even set "readonly=on" for memory-backend-file. So >> this change won't affect libvirt, they already always get COW RAM -- not >> modifying the underlying file but opening it R/O. >> >> If someone really wants ROM, they can just use "share=on,readonly=on". >> After all, there is not relevant difference between a R/O MAP_SHARED >> file mapping and a R/O MAP_PRIVATE file mapping. >> >> Signed-off-by: David Hildenbrand <david@redhat.com> > > This still leaves the patch having a warn_report() which I think is > undesirable to emit in a valid / supported use case. No warning. Please elaborate on "valid/supported use case".
On Thu, Aug 17, 2023 at 11:07:23AM +0200, David Hildenbrand wrote: > @Stefan, see below on a R/O NVDIMM question. > > We're discussing how to get MAPR_PRIVATE R/W mapping of a > memory-backend-file running when using R/O files. > > > > > This seems a good idea. I am good with the solution you proposed > > here as well. > > I was just going to get started working on that, when I realized > something important: > > > "@readonly: if true, the backing file is opened read-only; if false, > it is opened read-write. (default: false)" > > "@share: if false, the memory is private to QEMU; if true, it is > shared (default: false)" > > So readonly is *all* about the file access mode already ... the mmap() > parameters are just a side-effect of that. Adding a new > "file-access-mode" or similar would be wrong. Not exactly a side effect, IMHO. IIUC it's simply because we didn't have a need of using different perm for memory/file levels. See the other patch commit message from Stefan: https://lore.kernel.org/all/20210104171320.575838-2-stefanha@redhat.com/ There is currently no way to open(O_RDONLY) and mmap(PROT_READ) when [...] So the goal at that time was to map/open both in RO mode, afaiu. So one parameter was enough at that time. It doesn't necessarily must be for only the file permission or memory, while in reality/code it does apply to both for now until we see a need on differentiating them for CoW purpose. > > > Here are the combinations we have right now: > > -object memory-backend-file,share=on,readonly=on > > -> Existing behavior: Open readonly, mmap readonly shared > -> Makes sense, mmap'ing readwrite would fail > > -object memory-backend-file,share=on,readonly=off > > -> Existing behavior: Open readwrite, mmap readwrite shared > -> Mostly makes sense: why open a shared file R/W and not mmap it > R/W? > > -object memory-backend-file,share=off,readonly=off > -> Existing behavior: Open readwrite, mmap readwrite private > -> Mostly makes sense: why open a file R/W and not map it R/W (even if > private)? > > -object memory-backend-file,share=off,readonly=on > -> Existing behavior: Open readonly, mmap readonly private > -> That's the problematic one > > > So for your use case (VM templating using a readonly file), you > would actually want to use: > > -object memory-backend-file,share=off,readonly=on > > BUT, have the mmap be writable (instead of currently readonly). > > Assuming we would change the current behavior, what if someone would > specify: > > -object memory-backend-file,readonly=on > > (because the default is share=off ...) and using it for a R/O NVDIMM, > where we expect any write access to fail. It will (as expected), right? fallocate() will just fail on the RO files. To be explicit, maybe we should just remember the readonly attribute for a ramblock and then we can even provide a more meaningful error log, like: diff --git a/softmmu/physmem.c b/softmmu/physmem.c index 3df73542e1..f8c11c8d54 100644 --- a/softmmu/physmem.c +++ b/softmmu/physmem.c @@ -3424,9 +3424,13 @@ int qemu_ram_foreach_block(RAMBlockIterFunc func, void *opaque) int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length) { int ret = -1; - uint8_t *host_startaddr = rb->host + start; + if (rb->flags & RAM_READONLY) { + // more meaningful error reports (though right now no errp passed in..) + return -EACCES; + } + I see that Stefan even raised this question in the commit log: No new RAMBlock flag is introduced for read-only because it's unclear whether RAMBlocks need to know that they are read-only. Pass a bool readonly argument instead. Right now failing at fallocate() isn't that bad to me. > > > But let's look at the commit that added the "readonly" parameter: > > commit 86635aa4e9d627d5142b81c57a33dd1f36627d07 > Author: Stefan Hajnoczi <stefanha@redhat.com> > Date: Mon Jan 4 17:13:19 2021 +0000 > > hostmem-file: add readonly=on|off option > Let -object memory-backend-file work on read-only files when the > readonly=on option is given. This can be used to share the contents of a > file between multiple guests while preventing them from consuming > Copy-on-Write memory if guests dirty the pages, for example. > > That was part of > > https://lore.kernel.org/all/20210104171320.575838-3-stefanha@redhat.com/T/#m712f995e6dcfdde433958bae5095b145dd0ee640 > > From what I understand, for NVDIMMs we always use > "-object memory-backend-file,share=on", even when we want a > readonly NVDIMM. > > > So we have two options: > > 1) Change the current behavior of -object memory-backend-file,share=off,readonly=on: > > -> Open the file r/o but mmap it writable > > 2) Add a new property to configure the mmap accessibility. Not a big fan of that. > > > @Stefan, do you have any concern when we would do 1) ? > > As far as I can tell, we have to set the nvdimm to "unarmed=on" either way: > > + "unarmed" controls the ACPI NFIT NVDIMM Region Mapping Structure "NVDIMM > + State Flags" Bit 3 indicating that the device is "unarmed" and cannot accept > + persistent writes. Linux guest drivers set the device to read-only when this > + bit is present. Set unarmed to on when the memdev has readonly=on. > > So changing the behavior would not really break the nvdimm use case. > > Further, we could warn in nvdimm code when we stumble over this configuration with > unarmed=on. I'll leave nvdimm specific question to Stefan, but isn't this also will map any readonly=on memory backends (besides nvdimm) to have the memory writable, which is still unexpected?
On Thu, Aug 17, 2023 at 04:37:52PM +0200, David Hildenbrand wrote: > On 17.08.23 16:37, Daniel P. Berrangé wrote: > > On Thu, Aug 17, 2023 at 04:30:16PM +0200, David Hildenbrand wrote: > > > > > > > @Stefan, do you have any concern when we would do 1) ? > > > > > > > > As far as I can tell, we have to set the nvdimm to "unarmed=on" either way: > > > > > > > > + "unarmed" controls the ACPI NFIT NVDIMM Region Mapping Structure "NVDIMM > > > > + State Flags" Bit 3 indicating that the device is "unarmed" and cannot accept > > > > + persistent writes. Linux guest drivers set the device to read-only when this > > > > + bit is present. Set unarmed to on when the memdev has readonly=on. > > > > > > > > So changing the behavior would not really break the nvdimm use case. > > > > > > Looking into the details, this seems to be the right thing to do. > > > > > > This is what I have now as patch description, that also highlights how libvirt > > > doesn't even make use of readonly=true. > > > > > > > > > From 42f272ace68e0cd660a8448adb5aefb3b9dd7005 Mon Sep 17 00:00:00 2001 > > > From: David Hildenbrand <david@redhat.com> > > > Date: Thu, 17 Aug 2023 12:09:07 +0200 > > > Subject: [PATCH v2 2/4] backends/hostmem-file: Make share=off,readonly=on > > > result in RAM instead of ROM > > > > > > For now, "share=off,readonly=on" would always result in us opening the > > > file R/O and mmap'ing the opened file MAP_PRIVATE R/O -- effectively > > > turning it into ROM. > > > > > > As documented, readonly only specifies that we want to open the file R/O: > > > > > > @readonly: if true, the backing file is opened read-only; if false, > > > it is opened read-write. (default: false) > > > > > > Especially for VM templating, "share=off" is a common use case. However, > > > that use case is impossible with files that lack write permissions, > > > because "share=off,readonly=off" will fail opening the file, and > > > "share=off,readonly=on" will give us ROM instead of RAM. > > > > > > With MAP_PRIVATE we can easily open the file R/O and mmap it R/W, to > > > turn it into COW RAM: private changes don't affect the file after all and > > > don't require write permissions. > > > > > > This implies that we only get ROM now via "share=on,readonly=on". > > > "share=off,readonly=on" will give us RAM. > > > > > > The sole user of ROM via memory-backend-file are R/O NVDIMMs. They > > > also require "unarmed=on" to be set for the nvdimm device. > > > > > > With this change, R/O NVDIMMs will continue working even if > > > "share=off,readonly=on" was specified similar to when simply > > > providing ordinary RAM to the nvdimm device and setting "unarmed=on". > > > > > > Note that libvirt seems to default for a "readonly" nvdimm to > > > * -object memory-backend-file,share=off (implying readonly=off) > > > * -device nvdimm,unarmed=on > > > And never seems to even set "readonly=on" for memory-backend-file. So > > > this change won't affect libvirt, they already always get COW RAM -- not > > > modifying the underlying file but opening it R/O. > > > > > > If someone really wants ROM, they can just use "share=on,readonly=on". > > > After all, there is not relevant difference between a R/O MAP_SHARED > > > file mapping and a R/O MAP_PRIVATE file mapping. > > > > > > Signed-off-by: David Hildenbrand <david@redhat.com> > > > > This still leaves the patch having a warn_report() which I think is > > undesirable to emit in a valid / supported use case. > > No warning. > > Please elaborate on "valid/supported use case". The usage scenario that this patch aims to enable. IIUC, it will follow the codepath that leads to the warn_report() call in this patch. With regards, Daniel
On 17.08.23 16:45, Daniel P. Berrangé wrote: > On Thu, Aug 17, 2023 at 04:37:52PM +0200, David Hildenbrand wrote: >> On 17.08.23 16:37, Daniel P. Berrangé wrote: >>> On Thu, Aug 17, 2023 at 04:30:16PM +0200, David Hildenbrand wrote: >>>> >>>>> @Stefan, do you have any concern when we would do 1) ? >>>>> >>>>> As far as I can tell, we have to set the nvdimm to "unarmed=on" either way: >>>>> >>>>> + "unarmed" controls the ACPI NFIT NVDIMM Region Mapping Structure "NVDIMM >>>>> + State Flags" Bit 3 indicating that the device is "unarmed" and cannot accept >>>>> + persistent writes. Linux guest drivers set the device to read-only when this >>>>> + bit is present. Set unarmed to on when the memdev has readonly=on. >>>>> >>>>> So changing the behavior would not really break the nvdimm use case. >>>> >>>> Looking into the details, this seems to be the right thing to do. >>>> >>>> This is what I have now as patch description, that also highlights how libvirt >>>> doesn't even make use of readonly=true. >>>> >>>> >>>> From 42f272ace68e0cd660a8448adb5aefb3b9dd7005 Mon Sep 17 00:00:00 2001 >>>> From: David Hildenbrand <david@redhat.com> >>>> Date: Thu, 17 Aug 2023 12:09:07 +0200 >>>> Subject: [PATCH v2 2/4] backends/hostmem-file: Make share=off,readonly=on >>>> result in RAM instead of ROM >>>> >>>> For now, "share=off,readonly=on" would always result in us opening the >>>> file R/O and mmap'ing the opened file MAP_PRIVATE R/O -- effectively >>>> turning it into ROM. >>>> >>>> As documented, readonly only specifies that we want to open the file R/O: >>>> >>>> @readonly: if true, the backing file is opened read-only; if false, >>>> it is opened read-write. (default: false) >>>> >>>> Especially for VM templating, "share=off" is a common use case. However, >>>> that use case is impossible with files that lack write permissions, >>>> because "share=off,readonly=off" will fail opening the file, and >>>> "share=off,readonly=on" will give us ROM instead of RAM. >>>> >>>> With MAP_PRIVATE we can easily open the file R/O and mmap it R/W, to >>>> turn it into COW RAM: private changes don't affect the file after all and >>>> don't require write permissions. >>>> >>>> This implies that we only get ROM now via "share=on,readonly=on". >>>> "share=off,readonly=on" will give us RAM. >>>> >>>> The sole user of ROM via memory-backend-file are R/O NVDIMMs. They >>>> also require "unarmed=on" to be set for the nvdimm device. >>>> >>>> With this change, R/O NVDIMMs will continue working even if >>>> "share=off,readonly=on" was specified similar to when simply >>>> providing ordinary RAM to the nvdimm device and setting "unarmed=on". >>>> >>>> Note that libvirt seems to default for a "readonly" nvdimm to >>>> * -object memory-backend-file,share=off (implying readonly=off) >>>> * -device nvdimm,unarmed=on >>>> And never seems to even set "readonly=on" for memory-backend-file. So >>>> this change won't affect libvirt, they already always get COW RAM -- not >>>> modifying the underlying file but opening it R/O. >>>> >>>> If someone really wants ROM, they can just use "share=on,readonly=on". >>>> After all, there is not relevant difference between a R/O MAP_SHARED >>>> file mapping and a R/O MAP_PRIVATE file mapping. >>>> >>>> Signed-off-by: David Hildenbrand <david@redhat.com> >>> >>> This still leaves the patch having a warn_report() which I think is >>> undesirable to emit in a valid / supported use case. >> >> No warning. >> >> Please elaborate on "valid/supported use case". > > The usage scenario that this patch aims to enable. IIUC, it will follow > the codepath that leads to the warn_report() call in this patch. It shouldn't but I will double check!
On 17.08.23 16:41, Peter Xu wrote: > On Thu, Aug 17, 2023 at 11:07:23AM +0200, David Hildenbrand wrote: >> @Stefan, see below on a R/O NVDIMM question. >> >> We're discussing how to get MAPR_PRIVATE R/W mapping of a >> memory-backend-file running when using R/O files. >> >>> >>> This seems a good idea. I am good with the solution you proposed >>> here as well. >> >> I was just going to get started working on that, when I realized >> something important: >> >> >> "@readonly: if true, the backing file is opened read-only; if false, >> it is opened read-write. (default: false)" >> >> "@share: if false, the memory is private to QEMU; if true, it is >> shared (default: false)" >> >> So readonly is *all* about the file access mode already ... the mmap() >> parameters are just a side-effect of that. Adding a new >> "file-access-mode" or similar would be wrong. > > Not exactly a side effect, IMHO. IIUC it's simply because we didn't have a > need of using different perm for memory/file levels. See the other patch > commit message from Stefan: > > https://lore.kernel.org/all/20210104171320.575838-2-stefanha@redhat.com/ > > There is currently no way to open(O_RDONLY) and mmap(PROT_READ) when > [...] > > So the goal at that time was to map/open both in RO mode, afaiu. So one Good point. And you can have both with "share=on,readonly=on" ever since Stefan introduced that flag, which that patch enabled. Stefan didn't go into details to describe the required interactions between MAP_PRIVATE / MAP_SHARED. [...] >> -object memory-backend-file,share=off,readonly=on >> >> BUT, have the mmap be writable (instead of currently readonly). >> >> Assuming we would change the current behavior, what if someone would >> specify: >> >> -object memory-backend-file,readonly=on >> >> (because the default is share=off ...) and using it for a R/O NVDIMM, >> where we expect any write access to fail. > > It will (as expected), right? fallocate() will just fail on the RO files. Yes, if the file was opened R/O, any fallocate() will fail. > > To be explicit, maybe we should just remember the readonly attribute for a > ramblock and then we can even provide a more meaningful error log, like: Hah, I have a patch that adds RAM_READONLY :) . But it expresses "mmapped shared" not "file opened shared". > > diff --git a/softmmu/physmem.c b/softmmu/physmem.c > index 3df73542e1..f8c11c8d54 100644 > --- a/softmmu/physmem.c > +++ b/softmmu/physmem.c > @@ -3424,9 +3424,13 @@ int qemu_ram_foreach_block(RAMBlockIterFunc func, void *opaque) > int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length) > { > int ret = -1; > - > uint8_t *host_startaddr = rb->host + start; > > + if (rb->flags & RAM_READONLY) { > + // more meaningful error reports (though right now no errp passed in..) > + return -EACCES; > + } Remembering whether a file is opened R/O might be reasonable to improve the error messages. [...] >> >> But let's look at the commit that added the "readonly" parameter: >> >> commit 86635aa4e9d627d5142b81c57a33dd1f36627d07 >> Author: Stefan Hajnoczi <stefanha@redhat.com> >> Date: Mon Jan 4 17:13:19 2021 +0000 >> >> hostmem-file: add readonly=on|off option >> Let -object memory-backend-file work on read-only files when the >> readonly=on option is given. This can be used to share the contents of a >> file between multiple guests while preventing them from consuming >> Copy-on-Write memory if guests dirty the pages, for example. >> >> That was part of >> >> https://lore.kernel.org/all/20210104171320.575838-3-stefanha@redhat.com/T/#m712f995e6dcfdde433958bae5095b145dd0ee640 >> >> From what I understand, for NVDIMMs we always use >> "-object memory-backend-file,share=on", even when we want a >> readonly NVDIMM. >> >> >> So we have two options: >> >> 1) Change the current behavior of -object memory-backend-file,share=off,readonly=on: >> >> -> Open the file r/o but mmap it writable >> >> 2) Add a new property to configure the mmap accessibility. Not a big fan of that. >> >> >> @Stefan, do you have any concern when we would do 1) ? >> >> As far as I can tell, we have to set the nvdimm to "unarmed=on" either way: >> >> + "unarmed" controls the ACPI NFIT NVDIMM Region Mapping Structure "NVDIMM >> + State Flags" Bit 3 indicating that the device is "unarmed" and cannot accept >> + persistent writes. Linux guest drivers set the device to read-only when this >> + bit is present. Set unarmed to on when the memdev has readonly=on. >> >> So changing the behavior would not really break the nvdimm use case. >> >> Further, we could warn in nvdimm code when we stumble over this configuration with >> unarmed=on. > > I'll leave nvdimm specific question to Stefan, but isn't this also will map > any readonly=on memory backends (besides nvdimm) to have the memory > writable, which is still unexpected? Note that libvirt *never* sets readonly=on for R/O NVDIMMs, and R/O NVDIMMs are really the only use case. I don't think "open file read only" raises the expectation "map it read only". It certainly does for shared mappings, but not for private mappings. To me, the expectation of "open the file read only" is that the file won't be modified, ever. Libvirt should probably be taught to use "share=on,readonly=on" when dealing with R/O NVDIMMs, instead of "share=off" (implying readonly=on) to create a writable COW mapping.
On Thu, 17 Aug 2023 at 05:08, David Hildenbrand <david@redhat.com> wrote: > > @Stefan, see below on a R/O NVDIMM question. > > We're discussing how to get MAPR_PRIVATE R/W mapping of a > memory-backend-file running when using R/O files. > > > > > This seems a good idea. I am good with the solution you proposed > > here as well. > > I was just going to get started working on that, when I realized > something important: > > > "@readonly: if true, the backing file is opened read-only; if false, > it is opened read-write. (default: false)" > > "@share: if false, the memory is private to QEMU; if true, it is > shared (default: false)" > > So readonly is *all* about the file access mode already ... the mmap() > parameters are just a side-effect of that. Adding a new > "file-access-mode" or similar would be wrong. > > > Here are the combinations we have right now: > > -object memory-backend-file,share=on,readonly=on > > -> Existing behavior: Open readonly, mmap readonly shared > -> Makes sense, mmap'ing readwrite would fail > > -object memory-backend-file,share=on,readonly=off > > -> Existing behavior: Open readwrite, mmap readwrite shared > -> Mostly makes sense: why open a shared file R/W and not mmap it > R/W? > > -object memory-backend-file,share=off,readonly=off > -> Existing behavior: Open readwrite, mmap readwrite private > -> Mostly makes sense: why open a file R/W and not map it R/W (even if > private)? > > -object memory-backend-file,share=off,readonly=on > -> Existing behavior: Open readonly, mmap readonly private > -> That's the problematic one > > > So for your use case (VM templating using a readonly file), you > would actually want to use: > > -object memory-backend-file,share=off,readonly=on > > BUT, have the mmap be writable (instead of currently readonly). > > Assuming we would change the current behavior, what if someone would > specify: > > -object memory-backend-file,readonly=on > > (because the default is share=off ...) and using it for a R/O NVDIMM, > where we expect any write access to fail. > > > But let's look at the commit that added the "readonly" parameter: > > commit 86635aa4e9d627d5142b81c57a33dd1f36627d07 > Author: Stefan Hajnoczi <stefanha@redhat.com> > Date: Mon Jan 4 17:13:19 2021 +0000 > > hostmem-file: add readonly=on|off option > > Let -object memory-backend-file work on read-only files when the > readonly=on option is given. This can be used to share the contents of a > file between multiple guests while preventing them from consuming > Copy-on-Write memory if guests dirty the pages, for example. > > That was part of > > https://lore.kernel.org/all/20210104171320.575838-3-stefanha@redhat.com/T/#m712f995e6dcfdde433958bae5095b145dd0ee640 > > From what I understand, for NVDIMMs we always use > "-object memory-backend-file,share=on", even when we want a > readonly NVDIMM. > > > So we have two options: > > 1) Change the current behavior of -object memory-backend-file,share=off,readonly=on: > > -> Open the file r/o but mmap it writable Commit 86635aa4e9d627d5142b81c57a33dd1f36627d07 mentions that we don't want guests to be able to dirty pages on the host. The change you're proposing would not protect against guests that dirty the memory. I don't know how important that requirement was (that commit was a request from Kata Containers). > > 2) Add a new property to configure the mmap accessibility. Not a big fan of that. > > > @Stefan, do you have any concern when we would do 1) ? > > As far as I can tell, we have to set the nvdimm to "unarmed=on" either way: > > + "unarmed" controls the ACPI NFIT NVDIMM Region Mapping Structure "NVDIMM > + State Flags" Bit 3 indicating that the device is "unarmed" and cannot accept > + persistent writes. Linux guest drivers set the device to read-only when this > + bit is present. Set unarmed to on when the memdev has readonly=on. > > So changing the behavior would not really break the nvdimm use case. > > Further, we could warn in nvdimm code when we stumble over this configuration with > unarmed=on. > > -- > Cheers, > > David / dhildenb > >
On 17.08.23 17:13, Stefan Hajnoczi wrote: > On Thu, 17 Aug 2023 at 05:08, David Hildenbrand <david@redhat.com> wrote: >> >> @Stefan, see below on a R/O NVDIMM question. >> >> We're discussing how to get MAPR_PRIVATE R/W mapping of a >> memory-backend-file running when using R/O files. >> >>> >>> This seems a good idea. I am good with the solution you proposed >>> here as well. >> >> I was just going to get started working on that, when I realized >> something important: >> >> >> "@readonly: if true, the backing file is opened read-only; if false, >> it is opened read-write. (default: false)" >> >> "@share: if false, the memory is private to QEMU; if true, it is >> shared (default: false)" >> >> So readonly is *all* about the file access mode already ... the mmap() >> parameters are just a side-effect of that. Adding a new >> "file-access-mode" or similar would be wrong. >> >> >> Here are the combinations we have right now: >> >> -object memory-backend-file,share=on,readonly=on >> >> -> Existing behavior: Open readonly, mmap readonly shared >> -> Makes sense, mmap'ing readwrite would fail >> >> -object memory-backend-file,share=on,readonly=off >> >> -> Existing behavior: Open readwrite, mmap readwrite shared >> -> Mostly makes sense: why open a shared file R/W and not mmap it >> R/W? >> >> -object memory-backend-file,share=off,readonly=off >> -> Existing behavior: Open readwrite, mmap readwrite private >> -> Mostly makes sense: why open a file R/W and not map it R/W (even if >> private)? >> >> -object memory-backend-file,share=off,readonly=on >> -> Existing behavior: Open readonly, mmap readonly private >> -> That's the problematic one >> >> >> So for your use case (VM templating using a readonly file), you >> would actually want to use: >> >> -object memory-backend-file,share=off,readonly=on >> >> BUT, have the mmap be writable (instead of currently readonly). >> >> Assuming we would change the current behavior, what if someone would >> specify: >> >> -object memory-backend-file,readonly=on >> >> (because the default is share=off ...) and using it for a R/O NVDIMM, >> where we expect any write access to fail. >> >> >> But let's look at the commit that added the "readonly" parameter: >> >> commit 86635aa4e9d627d5142b81c57a33dd1f36627d07 >> Author: Stefan Hajnoczi <stefanha@redhat.com> >> Date: Mon Jan 4 17:13:19 2021 +0000 >> >> hostmem-file: add readonly=on|off option >> >> Let -object memory-backend-file work on read-only files when the >> readonly=on option is given. This can be used to share the contents of a >> file between multiple guests while preventing them from consuming >> Copy-on-Write memory if guests dirty the pages, for example. >> >> That was part of >> >> https://lore.kernel.org/all/20210104171320.575838-3-stefanha@redhat.com/T/#m712f995e6dcfdde433958bae5095b145dd0ee640 >> >> From what I understand, for NVDIMMs we always use >> "-object memory-backend-file,share=on", even when we want a >> readonly NVDIMM. >> >> >> So we have two options: >> >> 1) Change the current behavior of -object memory-backend-file,share=off,readonly=on: >> >> -> Open the file r/o but mmap it writable > > Commit 86635aa4e9d627d5142b81c57a33dd1f36627d07 mentions that we don't > want guests to be able to dirty pages on the host. The change you're > proposing would not protect against guests that dirty the memory. The guest could write memory but not modify the file. Only with "share=off,readonly=on" of course, not with "share=on,readonly=on". > > I don't know how important that requirement was (that commit was a > request from Kata Containers). Let me take a look if Kata passes "share=on,readonly=on" or "share=off,readonly=off". Thanks!
>> Commit 86635aa4e9d627d5142b81c57a33dd1f36627d07 mentions that we don't >> want guests to be able to dirty pages on the host. The change you're >> proposing would not protect against guests that dirty the memory. > > The guest could write memory but not modify the file. Only with > "share=off,readonly=on" of course, not with "share=on,readonly=on". > >> >> I don't know how important that requirement was (that commit was a >> request from Kata Containers). > > Let me take a look if Kata passes "share=on,readonly=on" or > "share=off,readonly=off". > At least their R/O DIMM test generates: -device nvdimm,id=nv0,memdev=mem0,unarmed=on -object memory-backend-file,id=mem0,mem-path=/root,size=65536,readonly=on So they are assuming readonly with implied share=off creates ROM. If only they would have specified share=on ... One way would be letting the R/O nvdimm set the MR container to readonly. Then that guest also shouldn't be able to modify that memory. Let me think about that.
On Thu, Aug 17, 2023 at 05:15:52PM +0200, David Hildenbrand wrote: > > I don't know how important that requirement was (that commit was a > > request from Kata Containers). > > Let me take a look if Kata passes "share=on,readonly=on" or > "share=off,readonly=off". The question is whether it's good enough if we change the semantics as long as we guarantee the original purposes of when introducing those flags would be enough (nvdimm, kata, etc.), as anything introduced in qemu can potentially be used elsewhere too. David, could you share your concern on simply "having a new flag, while keeping all existing flags unchanged on behavior"? You mentioned it's not wanted, but I didn't yet see the reason behind. Thanks,
On 17.08.23 17:31, Peter Xu wrote: > On Thu, Aug 17, 2023 at 05:15:52PM +0200, David Hildenbrand wrote: >>> I don't know how important that requirement was (that commit was a >>> request from Kata Containers). >> >> Let me take a look if Kata passes "share=on,readonly=on" or >> "share=off,readonly=off". > > The question is whether it's good enough if we change the semantics as long > as we guarantee the original purposes of when introducing those flags would > be enough (nvdimm, kata, etc.), as anything introduced in qemu can > potentially be used elsewhere too. > Right. And we have to keep the R/O NVDIMM use case working as is apparently. > David, could you share your concern on simply "having a new flag, while > keeping all existing flags unchanged on behavior"? You mentioned it's not > wanted, but I didn't yet see the reason behind. I'm really having a hard time to come up with something reasonable to configure this. And apparently, we only want to configure "share=off,readonly=on". The best I was imagining was "readonly=file-only" but I'm also not too happy about that. Doesn't make any sense for "share=on". So if we could just let the memory backend do something reasonable and have the single existing consumer (R/O NVDIMM) handle the changes case explicitly internally, that turns up much cleaner. IMHO, the user shouldn't have to worry about "how is it mmaped". "share" and "readonly" express the memory semantics and the file semantics. A R/O DIMM on the other hand (unarmed=on), knows that it is R/O, and the user configured exactly that. So maybe it can simply expose it to the system as readonly by marking the memory region container as being a ROM. I have not given up yet, but this case is starting to be annoying.
On Fri, 11 Aug 2023 17:26:24 +0200 David Hildenbrand <david@redhat.com> wrote: [...] > Nowadays, "-mem-path" always defaults to MAP_PRIVATE. For the original > hugetlb use case, it's still good enough. For anything else, I'm not so > sure. We can deprecate -mem-path and direct users to use explicitly defined memory backend with legacy compatible example. That way users won't come back again trying to fix it or try inventing options to alter its behavior when using explicit '-machine memory-backend=foo' can do the job.
On 12.08.23 08:21, ThinerLogoer wrote: > At 2023-08-11 22:31:36, "Peter Xu" <peterx@redhat.com> wrote: >> On Fri, Aug 11, 2023 at 01:49:52PM +0800, ThinerLogoer wrote: >>> At 2023-08-11 05:24:43, "Peter Xu" <peterx@redhat.com> wrote: >>>> On Fri, Aug 11, 2023 at 01:06:12AM +0800, ThinerLogoer wrote: >>>>>> I think we have the following options (there might be more) >>>>>> >>>>>> 1) This patch. >>>>>> >>>>>> 2) New flag for memory-backend-file. We already have "readonly" and >>>>>> "share=". I'm having a hard time coming up with a good name that really >>>>>> describes the subtle difference. >>>>>> >>>>>> 3) Glue behavior to the QEMU machine >>>>>> >>>>> >>>>> 4) '-deny-private-discard' argv, or environment variable, or both >>>> >>>> I'd personally vote for (2). How about "fdperm"? To describe when we want >>>> to use different rw permissions on the file (besides the access permission >>>> of the memory we already provided with "readonly"=XXX). IIUC the only sane >>>> value will be ro/rw/default, where "default" should just use the same rw >>>> permission as the memory ("readonly"=XXX). >>>> >>>> Would that be relatively clean and also work in this use case? >>>> >>>> (the other thing I'd wish we don't have that fallback is, as long as we >>>> have any of that "fallback" we'll need to be compatible with it since >>>> then, and for ever...) >>> >>> If it must be (2), I would vote (2) + (4), with (4) adjust the default behavior of said `fdperm`. >>> Mainly because (private+discard) is itself not a good practice and (4) serves >>> as a good tool to help catch existing (private+discard) problems. >>> >>> Actually (readonly+private) is more reasonable than (private+discard), so I >>> want at least one room for a default (readonly+private) behavior. >> >> Just for purely discussion purpose: I think maybe someday private+discard >> could work. IIUC what we're missing is an syscall interface to install a >> zero page for a MAP_PRIVATE, atomically freeing what's underneath: it seems >> either punching a hole or DONTNEED won't suffice here. It'll just be >> another problem when having zero page involved in file mappings at least. >> >>> >>> Also in my case I kind of have to use "-mem-path" despite it being considered >>> to be close to deprecated. Only with this I can avoid knowledge of memory >>> backend before migration. Actually there seems to be no equivalent working after-migration >>> setup of "-object memory-backend-file,... -machine q35,mem=..." that can match >>> before-migration setup of "-machine q35" (specifying nothing). Therefore >>> I must make a plan and choose a migration method BEFORE I boot the >>> machine and prepare to migrate, reducing the operation freedom. >>> Considering that, I have to use "-mem-path" which keeps the freedom but >>> has no configurable argument and I have to rely on default config. >>> >>> Are there any "-object memory-backend-file..." setup equivalent to "-machine q35" >>> that can migrate from and to each other? If there is, I want to try it out. >>> By the way "-object memory-backend-file,id=pc.ram" has just been killed by an earlier >>> commit. >> >> I'm actually not familiar enough on the interfaces here, but I just checked >> up the man page; would this work for you, together with option (2)? >> >> memory-backend='id' >> An alternative to legacy -mem-path and mem-prealloc options. Allows to use a memory backend as main RAM. >> >> For example: >> >> -object memory-backend-file,id=pc.ram,size=512M,mem-path=/hugetlbfs,prealloc=on,share=on >> -machine memory-backend=pc.ram >> -m 512M >> > > Wait ... I thought it should not work but it did work today. How bad am I at reading > the correct part of documentation ... > > '-machine q35 -m 512M' is equivalent to '-object > memory-backend-file,id=pc.ram,size=512M > -machine q35,memory-backend=pc.ram', > the latter works, and the two mentioned setup can be > migrated from one to another. > > What I was consistently trying was '-object > memory-backend-file,id=pc.ram,size=512M -machine q35', and qemu raises an error > for this in a recent update: > > ``` > qemu-system-x86_64: object name 'pc.ram' is reserved for the default RAM backend, it can't be used for any other purposes. Change the object's 'id' to something else > ``` > > This error is misleading. Actually in this case, the error report message should be more > close to: > ``` > object name 'pc.ram' is reserved for the default RAM backend, it can't > be used for any other purposes. Change the object's 'id' to something > else, or append "memory-backend=pc.ram" to -machine arguments > ``` What about: $ ./build/qemu-system-x86_64 -object memory-backend-file,id=pc.ram,size=512M,mem-path=tmp -machine q35 qemu-system-x86_64: object name 'pc.ram' is reserved for the default RAM backend, it can't be used for any other purposes. Change the object's 'id' to something else or disable automatic creation of the default RAM backend by setting the 'memory-backend' machine property ?
diff --git a/softmmu/physmem.c b/softmmu/physmem.c index 3df73542e1..d1ae694b20 100644 --- a/softmmu/physmem.c +++ b/softmmu/physmem.c @@ -1289,8 +1289,7 @@ static int64_t get_file_align(int fd) static int file_ram_open(const char *path, const char *region_name, bool readonly, - bool *created, - Error **errp) + bool *created) { char *filename; char *sanitized_name; @@ -1334,10 +1333,7 @@ static int file_ram_open(const char *path, g_free(filename); } if (errno != EEXIST && errno != EINTR) { - error_setg_errno(errp, errno, - "can't open backing store %s for guest RAM", - path); - return -1; + return -errno; } /* * Try again on EINTR and EEXIST. The latter happens when @@ -1946,9 +1942,23 @@ RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr, bool created; RAMBlock *block; - fd = file_ram_open(mem_path, memory_region_name(mr), readonly, &created, - errp); + fd = file_ram_open(mem_path, memory_region_name(mr), readonly, &created); + if (fd == -EACCES && !(ram_flags & RAM_SHARED) && !readonly) { + /* + * We can have a writable MAP_PRIVATE mapping of a readonly file. + * However, some operations like ftruncate() or fallocate() might fail + * later, let's warn the user. + */ + fd = file_ram_open(mem_path, memory_region_name(mr), true, &created); + if (fd >= 0) { + warn_report("backing store %s for guest RAM (MAP_PRIVATE) opened" + " readonly because the file is not writable", mem_path); + } + } if (fd < 0) { + error_setg_errno(errp, -fd, + "can't open backing store %s for guest RAM", + mem_path); return NULL; }