diff mbox series

[v1,1/3] softmmu/physmem: fallback to opening guest RAM file as readonly in a MAP_PRIVATE mapping

Message ID 20230807190736.572665-2-david@redhat.com (mailing list archive)
State New, archived
Headers show
Series softmmu/physmem: file_ram_open() readonly improvements | expand

Commit Message

David Hildenbrand Aug. 7, 2023, 7:07 p.m. UTC
From: Thiner Logoer <logoerthiner1@163.com>

Users may specify
* "-mem-path" or
* "-object memory-backend-file,share=off,readonly=off"
and expect such COW (MAP_PRIVATE) mappings to work, even if the user
does not have write permissions to open the file.

For now, we would always fail in that case, always requiring file write
permissions. Let's detect when that failure happens and fallback to opening
the file readonly.

Warn the user, since there are other use cases where we want the file to
be mapped writable: ftruncate() and fallocate() will fail if the file
was not opened with write permissions.

Signed-off-by: Thiner Logoer <logoerthiner1@163.com>
Co-developed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 softmmu/physmem.c | 26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

Comments

Peter Xu Aug. 8, 2023, 9:01 p.m. UTC | #1
On Mon, Aug 07, 2023 at 09:07:32PM +0200, David Hildenbrand wrote:
> From: Thiner Logoer <logoerthiner1@163.com>
> 
> Users may specify
> * "-mem-path" or
> * "-object memory-backend-file,share=off,readonly=off"
> and expect such COW (MAP_PRIVATE) mappings to work, even if the user
> does not have write permissions to open the file.
> 
> For now, we would always fail in that case, always requiring file write
> permissions. Let's detect when that failure happens and fallback to opening
> the file readonly.
> 
> Warn the user, since there are other use cases where we want the file to
> be mapped writable: ftruncate() and fallocate() will fail if the file
> was not opened with write permissions.
> 
> Signed-off-by: Thiner Logoer <logoerthiner1@163.com>
> Co-developed-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  softmmu/physmem.c | 26 ++++++++++++++++++--------
>  1 file changed, 18 insertions(+), 8 deletions(-)
> 
> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> index 3df73542e1..d1ae694b20 100644
> --- a/softmmu/physmem.c
> +++ b/softmmu/physmem.c
> @@ -1289,8 +1289,7 @@ static int64_t get_file_align(int fd)
>  static int file_ram_open(const char *path,
>                           const char *region_name,
>                           bool readonly,
> -                         bool *created,
> -                         Error **errp)
> +                         bool *created)
>  {
>      char *filename;
>      char *sanitized_name;
> @@ -1334,10 +1333,7 @@ static int file_ram_open(const char *path,
>              g_free(filename);
>          }
>          if (errno != EEXIST && errno != EINTR) {
> -            error_setg_errno(errp, errno,
> -                             "can't open backing store %s for guest RAM",
> -                             path);
> -            return -1;
> +            return -errno;
>          }
>          /*
>           * Try again on EINTR and EEXIST.  The latter happens when
> @@ -1946,9 +1942,23 @@ RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr,
>      bool created;
>      RAMBlock *block;
>  
> -    fd = file_ram_open(mem_path, memory_region_name(mr), readonly, &created,
> -                       errp);
> +    fd = file_ram_open(mem_path, memory_region_name(mr), readonly, &created);
> +    if (fd == -EACCES && !(ram_flags & RAM_SHARED) && !readonly) {
> +        /*
> +         * We can have a writable MAP_PRIVATE mapping of a readonly file.
> +         * However, some operations like ftruncate() or fallocate() might fail
> +         * later, let's warn the user.
> +         */
> +        fd = file_ram_open(mem_path, memory_region_name(mr), true, &created);
> +        if (fd >= 0) {
> +            warn_report("backing store %s for guest RAM (MAP_PRIVATE) opened"
> +                        " readonly because the file is not writable", mem_path);

I can understand the use case, but this will be slightly unwanted,
especially the user doesn't yet have a way to predict when will it happen.

Meanwhile this changes the behavior, is it a concern that someone may want
to rely on current behavior of failing?

To think from a higher level of current use case, the ideal solution seems
to me that if the ram file can be put on a file system that supports CoW
itself (like btrfs), we can snapshot that ram file and make it RW for the
qemu instance. Then here it'll be able to open the file.  We'll be able to
keep the interface working as before, meanwhile it'll work with fallocate
or truncations too I assume.

Would that be better instead of changing QEMU?

Thanks,

> +        }
> +    }
>      if (fd < 0) {
> +        error_setg_errno(errp, -fd,
> +                         "can't open backing store %s for guest RAM",
> +                         mem_path);
>          return NULL;
>      }
>  
> -- 
> 2.41.0
>
ThinerLogoer Aug. 9, 2023, 5:39 a.m. UTC | #2
At 2023-08-09 05:01:17, "Peter Xu" <peterx@redhat.com> wrote:
>On Mon, Aug 07, 2023 at 09:07:32PM +0200, David Hildenbrand wrote:
>> From: Thiner Logoer <logoerthiner1@163.com>
>> 
>> Users may specify
>> * "-mem-path" or
>> * "-object memory-backend-file,share=off,readonly=off"
>> and expect such COW (MAP_PRIVATE) mappings to work, even if the user
>> does not have write permissions to open the file.
>> 
>> For now, we would always fail in that case, always requiring file write
>> permissions. Let's detect when that failure happens and fallback to opening
>> the file readonly.
>> 
>> Warn the user, since there are other use cases where we want the file to
>> be mapped writable: ftruncate() and fallocate() will fail if the file
>> was not opened with write permissions.
>> 
>> Signed-off-by: Thiner Logoer <logoerthiner1@163.com>
>> Co-developed-by: David Hildenbrand <david@redhat.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  softmmu/physmem.c | 26 ++++++++++++++++++--------
>>  1 file changed, 18 insertions(+), 8 deletions(-)
>> 
>> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
>> index 3df73542e1..d1ae694b20 100644
>> --- a/softmmu/physmem.c
>> +++ b/softmmu/physmem.c
>> @@ -1289,8 +1289,7 @@ static int64_t get_file_align(int fd)
>>  static int file_ram_open(const char *path,
>>                           const char *region_name,
>>                           bool readonly,
>> -                         bool *created,
>> -                         Error **errp)
>> +                         bool *created)
>>  {
>>      char *filename;
>>      char *sanitized_name;
>> @@ -1334,10 +1333,7 @@ static int file_ram_open(const char *path,
>>              g_free(filename);
>>          }
>>          if (errno != EEXIST && errno != EINTR) {
>> -            error_setg_errno(errp, errno,
>> -                             "can't open backing store %s for guest RAM",
>> -                             path);
>> -            return -1;
>> +            return -errno;
>>          }
>>          /*
>>           * Try again on EINTR and EEXIST.  The latter happens when
>> @@ -1946,9 +1942,23 @@ RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr,
>>      bool created;
>>      RAMBlock *block;
>>  
>> -    fd = file_ram_open(mem_path, memory_region_name(mr), readonly, &created,
>> -                       errp);
>> +    fd = file_ram_open(mem_path, memory_region_name(mr), readonly, &created);
>> +    if (fd == -EACCES && !(ram_flags & RAM_SHARED) && !readonly) {
>> +        /*
>> +         * We can have a writable MAP_PRIVATE mapping of a readonly file.
>> +         * However, some operations like ftruncate() or fallocate() might fail
>> +         * later, let's warn the user.
>> +         */
>> +        fd = file_ram_open(mem_path, memory_region_name(mr), true, &created);
>> +        if (fd >= 0) {
>> +            warn_report("backing store %s for guest RAM (MAP_PRIVATE) opened"
>> +                        " readonly because the file is not writable", mem_path);
>
>I can understand the use case, but this will be slightly unwanted,
>especially the user doesn't yet have a way to predict when will it happen.
>
>Meanwhile this changes the behavior, is it a concern that someone may want
>to rely on current behavior of failing?
>

I am happy to learn if there is some solid evidence supporting this view.

The target of compatibility is "private+discard" which seems itself pathological
practice in early discussion. The only difference in behavior that might be unwanted
in your argument lies in "readonly file+private+discard" failure time. Before the
patch it fails early, after the patch it fails later and may does additional stuff.
Do you think that a bug reporting mechanism which relies on qemu failure
timing is valid?

If someone argues that "readonly file+private+discard" early failure behavior
is all where their system relies, I would be happy to learn why. Actually this
is much easier to solve outside qemu, by checking memory file permission,
compared to the "readonly+private" alternative plan that requires a btrfs.

In the long run the "private+discard" setup may itself be warned and
deprecated, rather than "private+readonly file" which is supported by
linux kernel.

Current patch is already a compromise considering compatibility, as it
always tries open the file in readwrite mode first, to persist behavior on
"readwrite+private+discard" case. Personally I prefer to try opening the file
readonly first, as this will reduce the risk of opening ram file accidentally
with write permission.

However I can accept a second level of compromise, aka adding "-disallow-private-discard"
to qemu argument for at least three qemu releases before "private+discard"
get deprecated and removed, if extreme compatibility enthusiast is involved.
With this argument, readonly private is enabled and private
discard fails immediately when discard request is detected, and without this
argument readonly private is disabled and the behavior is unchanged.
This argument is useful also because it helps
finding out existent "private+discard" behavior and assists debugging.
Do you think that this solution pays off?
(I am no expert qemu programmer, hope anyone else to handle the command
line arguments if this is preferred)

>To think from a higher level of current use case, the ideal solution seems
>to me that if the ram file can be put on a file system that supports CoW
>itself (like btrfs), we can snapshot that ram file and make it RW for the
>qemu instance. Then here it'll be able to open the file.  We'll be able to
>keep the interface working as before, meanwhile it'll work with fallocate
>or truncations too I assume.
>
>Would that be better instead of changing QEMU?
>

I am afraid that your alternative solution will make it still impossible
for the use case to be used on rootless system; while MAP_PRIVATE
based CoW works anywhere. For example this will not work on ext4 with
readonly file mode, or that the file is owned by root and you are
normal user, and on squashfs. Basically everything on the machine
should be readonly and you are also only a nobody, then
only the in-qemu choice is possible.

I believe the warning that the file is opened readonly is enough
for potential user to consider doing otherwise. Do you think that
looking solely at qemu exit status and pipe all qemu stderr log to /dev/null
all the time is one of the supported use cases?

--

Regards,

logoerthiner
David Hildenbrand Aug. 9, 2023, 9:20 a.m. UTC | #3
Hi Peter!

>> -    fd = file_ram_open(mem_path, memory_region_name(mr), readonly, &created,
>> -                       errp);
>> +    fd = file_ram_open(mem_path, memory_region_name(mr), readonly, &created);
>> +    if (fd == -EACCES && !(ram_flags & RAM_SHARED) && !readonly) {
>> +        /*
>> +         * We can have a writable MAP_PRIVATE mapping of a readonly file.
>> +         * However, some operations like ftruncate() or fallocate() might fail
>> +         * later, let's warn the user.
>> +         */
>> +        fd = file_ram_open(mem_path, memory_region_name(mr), true, &created);
>> +        if (fd >= 0) {
>> +            warn_report("backing store %s for guest RAM (MAP_PRIVATE) opened"
>> +                        " readonly because the file is not writable", mem_path);
> 
> I can understand the use case, but this will be slightly unwanted,
> especially the user doesn't yet have a way to predict when will it happen.

Users can set the file permissions accordingly I guess. If they don't 
want the file to never ever be modified via QEMU, set it R/O.

> 
> Meanwhile this changes the behavior, is it a concern that someone may want
> to rely on current behavior of failing?

The scenario would be that someone passes a readonly file to "-mem-path" 
or "-object memory-backend-file,share=off,readonly=off", with the 
expectation that it would currently fail.

If it now doesn't fail (and we warn instead), what would happen is:
* In file_ram_alloc() we won't even try ftruncate(), because the file
   already had a size > 0. So ftruncate() is not a concern as I now
   realize.
* fallocate might fail later. AFAIKS, that only applies to
   ram_block_discard_range().
  -> virtio-mem performs an initial ram_block_discard_range() check and
     fails gracefully early.
  -> virtio-ballooon ignores any errors
  -> ram_discard_range() in migration code fails early for postcopy in
     init_range() and loadvm_postcopy_ram_handle_discard(), handling it
     gracefully.

So mostly nothing "bad" would happen, it might just be undesirable, and 
we properly warn.

Most importantly, we won't be corrupting/touching the original file in 
any case, because it is R/O.

If we really want to be careful, we could clue that behavior to compat 
machines. I'm not really sure yet if we really have to go down that path.

Any other alternatives? I'd like to avoid new flags where not really 
required.

> 
> To think from a higher level of current use case, the ideal solution seems
> to me that if the ram file can be put on a file system that supports CoW
> itself (like btrfs), we can snapshot that ram file and make it RW for the
> qemu instance. Then here it'll be able to open the file.  We'll be able to
> keep the interface working as before, meanwhile it'll work with fallocate
> or truncations too I assume.
> 
> Would that be better instead of changing QEMU?

As I recently learned, using file-backed VMs (on real ssd/disks, not 
shmem/hugetlb) is usually undesired, because the dirtied pages will 
constantly get rewritten to disk by background writeback threads, 
eventually resulting in bad performance and SSD wear.

So while using a COW filesystem sounds cleaner in theory, it's not 
applicable in practice -- unless one disables any background writeback, 
which has different side effects because it cannot be configured on a 
per-file basis.

So for VM templating, it makes sense to capture the guest RAM and store 
it in a file, to then use a COW (MAP_PRIVATE) mapping. Using a read-only 
file makes perfect sense in that scenario IMHO.

[I'm curious at what point a filesystem will actually break COW. if it's 
wired up to the writenotify infrastructure, it would happen when 
actually writing to a page, not at mmap time. I know that filesystems 
use writenotify for lazy allocation of disk blocks on file holes, maybe 
they also do that for lazy allocation of disk blocks on COW]

Thanks!
Peter Xu Aug. 9, 2023, 3:15 p.m. UTC | #4
On Wed, Aug 09, 2023 at 11:20:11AM +0200, David Hildenbrand wrote:
> Hi Peter!

Hi, David,

> 
> > > -    fd = file_ram_open(mem_path, memory_region_name(mr), readonly, &created,
> > > -                       errp);
> > > +    fd = file_ram_open(mem_path, memory_region_name(mr), readonly, &created);
> > > +    if (fd == -EACCES && !(ram_flags & RAM_SHARED) && !readonly) {
> > > +        /*
> > > +         * We can have a writable MAP_PRIVATE mapping of a readonly file.
> > > +         * However, some operations like ftruncate() or fallocate() might fail
> > > +         * later, let's warn the user.
> > > +         */
> > > +        fd = file_ram_open(mem_path, memory_region_name(mr), true, &created);
> > > +        if (fd >= 0) {
> > > +            warn_report("backing store %s for guest RAM (MAP_PRIVATE) opened"
> > > +                        " readonly because the file is not writable", mem_path);
> > 
> > I can understand the use case, but this will be slightly unwanted,
> > especially the user doesn't yet have a way to predict when will it happen.
> 
> Users can set the file permissions accordingly I guess. If they don't want
> the file to never ever be modified via QEMU, set it R/O.
> 
> > 
> > Meanwhile this changes the behavior, is it a concern that someone may want
> > to rely on current behavior of failing?
> 
> The scenario would be that someone passes a readonly file to "-mem-path" or
> "-object memory-backend-file,share=off,readonly=off", with the expectation
> that it would currently fail.
> 
> If it now doesn't fail (and we warn instead), what would happen is:
> * In file_ram_alloc() we won't even try ftruncate(), because the file
>   already had a size > 0. So ftruncate() is not a concern as I now
>   realize.
> * fallocate might fail later. AFAIKS, that only applies to
>   ram_block_discard_range().
>  -> virtio-mem performs an initial ram_block_discard_range() check and
>     fails gracefully early.
>  -> virtio-ballooon ignores any errors
>  -> ram_discard_range() in migration code fails early for postcopy in
>     init_range() and loadvm_postcopy_ram_handle_discard(), handling it
>     gracefully.
> 
> So mostly nothing "bad" would happen, it might just be undesirable, and we
> properly warn.

Indeed, all of them can fail gracefully, while balloon one is harmless.
Definitely good news.

> 
> Most importantly, we won't be corrupting/touching the original file in any
> case, because it is R/O.
> 
> If we really want to be careful, we could clue that behavior to compat
> machines. I'm not really sure yet if we really have to go down that path.
> 
> Any other alternatives? I'd like to avoid new flags where not really
> required.

I was just thinking of a new flag. :) So have you already discussed that
possibility and decided that not a good idea?

The root issue to me here is we actually have two resources (memory map of
the process, and the file) but we only have one way to describe the
permissions upon the two objects.  I'd think it makes a lot more sense if a
new flag is added, when there's a need to differentiate the two.

Consider if you see a bunch of qemu instances with:

  -mem-path $RAM_FILE

On the same host, which can be as weird as it could be to me.. At least
'-mem-path' looks still like a way to exclusively own a ram file for an
instance. I hesitate the new fallback can confuse people too, while that's
so far not the major use case.

Nobody may really rely on any existing behavior of the failure, but
changing existing behavior is just always not wanted.  The guideline here
to me is: whether we want existing "-mem-path XXX" users to start using the
fallback in general?  If it's "no", then maybe it implies a new flag is
better?

> 
> > 
> > To think from a higher level of current use case, the ideal solution seems
> > to me that if the ram file can be put on a file system that supports CoW
> > itself (like btrfs), we can snapshot that ram file and make it RW for the
> > qemu instance. Then here it'll be able to open the file.  We'll be able to
> > keep the interface working as before, meanwhile it'll work with fallocate
> > or truncations too I assume.
> > 
> > Would that be better instead of changing QEMU?
> 
> As I recently learned, using file-backed VMs (on real ssd/disks, not
> shmem/hugetlb) is usually undesired, because the dirtied pages will
> constantly get rewritten to disk by background writeback threads, eventually
> resulting in bad performance and SSD wear.
> 
> So while using a COW filesystem sounds cleaner in theory, it's not
> applicable in practice -- unless one disables any background writeback,
> which has different side effects because it cannot be configured on a
> per-file basis.
> 
> So for VM templating, it makes sense to capture the guest RAM and store it
> in a file, to then use a COW (MAP_PRIVATE) mapping. Using a read-only file
> makes perfect sense in that scenario IMHO.

Valid point.

> 
> [I'm curious at what point a filesystem will actually break COW. if it's
> wired up to the writenotify infrastructure, it would happen when actually
> writing to a page, not at mmap time. I know that filesystems use writenotify
> for lazy allocation of disk blocks on file holes, maybe they also do that
> for lazy allocation of disk blocks on COW]

I don't know either, but it definitely looks more promising and reasonable
if the CoW only happens until being written, rather than being mapped RW.

Thanks,
David Hildenbrand Aug. 10, 2023, 2:19 p.m. UTC | #5
>> Most importantly, we won't be corrupting/touching the original file in any
>> case, because it is R/O.
>>
>> If we really want to be careful, we could clue that behavior to compat
>> machines. I'm not really sure yet if we really have to go down that path.
>>
>> Any other alternatives? I'd like to avoid new flags where not really
>> required.
> 
> I was just thinking of a new flag. :) So have you already discussed that
> possibility and decided that not a good idea?

Not really. I was briefly playing with that idea but already struggled 
to come up with a reasonable name :)

Less toggles and just have it working nice, if possible.

> 
> The root issue to me here is we actually have two resources (memory map of
> the process, and the file) but we only have one way to describe the
> permissions upon the two objects.  I'd think it makes a lot more sense if a
> new flag is added, when there's a need to differentiate the two.
> 
> Consider if you see a bunch of qemu instances with:
> 
>    -mem-path $RAM_FILE
> 
> On the same host, which can be as weird as it could be to me.. At least
> '-mem-path' looks still like a way to exclusively own a ram file for an
> instance. I hesitate the new fallback can confuse people too, while that's
> so far not the major use case.

Once I learned that this is not a MAP_SHARED mapping, I was extremely 
confused. For example, vhost-user with "-mem-path" will absolutely not 
work with "-mem-path", even though the documentation explicitly spells 
that out (I still have to send a patch to fix that).

I guess "-mem-path" was primarily only used to consume hugetlb. Even for 
tmpfs it will already result in a double memory consumption, just like 
when using -memory-backend-memfd,share=no.

I guess deprecating it was the right decision.

But memory-backend-file also defaults to "share=no" ... so the same 
default behavior unfortunately.

> 
> Nobody may really rely on any existing behavior of the failure, but
> changing existing behavior is just always not wanted.  The guideline here
> to me is: whether we want existing "-mem-path XXX" users to start using the
> fallback in general?  If it's "no", then maybe it implies a new flag is
> better?

I think we have the following options (there might be more)

1) This patch.

2) New flag for memory-backend-file. We already have "readonly" and 
"share=". I'm having a hard time coming up with a good name that really 
describes the subtle difference.

3) Glue behavior to the QEMU machine


For 3), one option would be to always open a COW file readonly (as 
Thiner originally proposed). We could leave "-mem-path" behavior alone 
and only change memory-backend-file semantics. If the COW file does 
*not* exist yet, we would refuse to create the file like patch 2+3 do. 
Therefore, no ftruncate() errors, and fallocate() errors would always 
happen.


What are your thoughts?

[...]

>>
>> [I'm curious at what point a filesystem will actually break COW. if it's
>> wired up to the writenotify infrastructure, it would happen when actually
>> writing to a page, not at mmap time. I know that filesystems use writenotify
>> for lazy allocation of disk blocks on file holes, maybe they also do that
>> for lazy allocation of disk blocks on COW]
> 
> I don't know either, but it definitely looks more promising and reasonable
> if the CoW only happens until being written, rather than being mapped RW.

That would be my best guess. But then, we have multiple pagecache pages 
point at the same disk block until COW happens ... maybe that's how it 
already works. Once I have some spare time, I might play with that.
ThinerLogoer Aug. 10, 2023, 5:06 p.m. UTC | #6
At 2023-08-10 22:19:45, "David Hildenbrand" <david@redhat.com> wrote:
>>> Most importantly, we won't be corrupting/touching the original file in any
>>> case, because it is R/O.
>>>
>>> If we really want to be careful, we could clue that behavior to compat
>>> machines. I'm not really sure yet if we really have to go down that path.
>>>
>>> Any other alternatives? I'd like to avoid new flags where not really
>>> required.
>> 
>> I was just thinking of a new flag. :) So have you already discussed that
>> possibility and decided that not a good idea?
>
>Not really. I was briefly playing with that idea but already struggled 
>to come up with a reasonable name :)
>
>Less toggles and just have it working nice, if possible.
>
>> 
>> The root issue to me here is we actually have two resources (memory map of
>> the process, and the file) but we only have one way to describe the
>> permissions upon the two objects.  I'd think it makes a lot more sense if a
>> new flag is added, when there's a need to differentiate the two.
>> 
>> Consider if you see a bunch of qemu instances with:
>> 
>>    -mem-path $RAM_FILE
>> 
>> On the same host, which can be as weird as it could be to me.. At least
>> '-mem-path' looks still like a way to exclusively own a ram file for an
>> instance. I hesitate the new fallback can confuse people too, while that's
>> so far not the major use case.
>
>Once I learned that this is not a MAP_SHARED mapping, I was extremely 
>confused. For example, vhost-user with "-mem-path" will absolutely not 
>work with "-mem-path", even though the documentation explicitly spells 
>that out (I still have to send a patch to fix that).
>
>I guess "-mem-path" was primarily only used to consume hugetlb. Even for 
>tmpfs it will already result in a double memory consumption, just like 
>when using -memory-backend-memfd,share=no.
>
>I guess deprecating it was the right decision.
>
>But memory-backend-file also defaults to "share=no" ... so the same 
>default behavior unfortunately.
>
>> 
>> Nobody may really rely on any existing behavior of the failure, but
>> changing existing behavior is just always not wanted.  The guideline here
>> to me is: whether we want existing "-mem-path XXX" users to start using the
>> fallback in general?  If it's "no", then maybe it implies a new flag is
>> better?
>
>I think we have the following options (there might be more)
>
>1) This patch.
>
>2) New flag for memory-backend-file. We already have "readonly" and 
>"share=". I'm having a hard time coming up with a good name that really 
>describes the subtle difference.
>
>3) Glue behavior to the QEMU machine
>

4) '-deny-private-discard' argv, or environment variable, or both

I have proposed a 4) earlier in discussion which is to add a global qemu flag like
'-deny-private-discard' or '-disallow-private-discard' (let's find a better name!)
for some duration until private discard behavior phases out. We do
everything exactly the same as before without the flag, and with this flag,
private CoW mapping files are strictly opened readonly, discard on private
memory backend is brutally denied very early when possibility arises,
and private memory backed file are always opened readonly without
creating any file (so the file must exist, no more nasty edge cases).

This has the benefit that it can also help diagnose and debug all existing
private discard usages, which could be required in the long run. Therefore,
And with this flag we directly solves the immediate demand of
<readonly file+private map> while delays the hard problem indefinitely.
I think this solution seems most promising and acceptable by most ones.
At least for my use case, I would be glad to insert a such flag to my argv
if it is all I need, since it does not hurt the flexibility I care about.

Note that difference on this option probably should not cause difference
on the machine specification. Otherwise migration will fail because one
machine has this option and the other does not, which is absurd, since
it is a backend implementation flag.

>
>For 3), one option would be to always open a COW file readonly (as 
>Thiner originally proposed). We could leave "-mem-path" behavior alone 
>and only change memory-backend-file semantics. If the COW file does 
>*not* exist yet, we would refuse to create the file like patch 2+3 do. 
>Therefore, no ftruncate() errors, and fallocate() errors would always 
>happen.
>
>
>What are your thoughts?
>
>[...]
>

I would be happy if -mem-path stays supported since in this case I would
not need knowledge of memory backend before migration.

--

Regards,

logoerthiner
Peter Xu Aug. 10, 2023, 9:24 p.m. UTC | #7
On Fri, Aug 11, 2023 at 01:06:12AM +0800, ThinerLogoer wrote:
> >I think we have the following options (there might be more)
> >
> >1) This patch.
> >
> >2) New flag for memory-backend-file. We already have "readonly" and 
> >"share=". I'm having a hard time coming up with a good name that really 
> >describes the subtle difference.
> >
> >3) Glue behavior to the QEMU machine
> >
> 
> 4) '-deny-private-discard' argv, or environment variable, or both

I'd personally vote for (2).  How about "fdperm"?  To describe when we want
to use different rw permissions on the file (besides the access permission
of the memory we already provided with "readonly"=XXX).  IIUC the only sane
value will be ro/rw/default, where "default" should just use the same rw
permission as the memory ("readonly"=XXX).

Would that be relatively clean and also work in this use case?

(the other thing I'd wish we don't have that fallback is, as long as we
 have any of that "fallback" we'll need to be compatible with it since
 then, and for ever...)
ThinerLogoer Aug. 11, 2023, 5:49 a.m. UTC | #8
At 2023-08-11 05:24:43, "Peter Xu" <peterx@redhat.com> wrote:
>On Fri, Aug 11, 2023 at 01:06:12AM +0800, ThinerLogoer wrote:
>> >I think we have the following options (there might be more)
>> >
>> >1) This patch.
>> >
>> >2) New flag for memory-backend-file. We already have "readonly" and 
>> >"share=". I'm having a hard time coming up with a good name that really 
>> >describes the subtle difference.
>> >
>> >3) Glue behavior to the QEMU machine
>> >
>> 
>> 4) '-deny-private-discard' argv, or environment variable, or both
>
>I'd personally vote for (2).  How about "fdperm"?  To describe when we want
>to use different rw permissions on the file (besides the access permission
>of the memory we already provided with "readonly"=XXX).  IIUC the only sane
>value will be ro/rw/default, where "default" should just use the same rw
>permission as the memory ("readonly"=XXX).
>
>Would that be relatively clean and also work in this use case?
>
>(the other thing I'd wish we don't have that fallback is, as long as we
> have any of that "fallback" we'll need to be compatible with it since
> then, and for ever...)

If it must be (2), I would vote (2) + (4), with (4) adjust the default behavior of said `fdperm`.
Mainly because (private+discard) is itself not a good practice and (4) serves
as a good tool to help catch existing (private+discard) problems.

Actually (readonly+private) is more reasonable than (private+discard), so I
want at least one room for a default (readonly+private) behavior.

Also in my case I kind of have to use "-mem-path" despite it being considered
to be close to deprecated. Only with this I can avoid knowledge of memory
backend before migration. Actually there seems to be no equivalent working after-migration
setup of "-object memory-backend-file,... -machine q35,mem=..." that can match
before-migration setup of "-machine q35" (specifying nothing). Therefore
I must make a plan and choose a migration method BEFORE I boot the
machine and prepare to migrate, reducing the operation freedom.
Considering that, I have to use "-mem-path" which keeps the freedom but
has no configurable argument and I have to rely on default config.

Are there any "-object memory-backend-file..." setup equivalent to "-machine q35"
that can migrate from and to each other? If there is, I want to try it out.
By the way "-object memory-backend-file,id=pc.ram" has just been killed by an earlier
commit.

Either (4) or fixing this should help my config. Hope you can consider this
deeper and figure out a more systematic solution that helps more user?

--

Regards,
logoerthiner
Peter Xu Aug. 11, 2023, 2:31 p.m. UTC | #9
On Fri, Aug 11, 2023 at 01:49:52PM +0800, ThinerLogoer wrote:
> At 2023-08-11 05:24:43, "Peter Xu" <peterx@redhat.com> wrote:
> >On Fri, Aug 11, 2023 at 01:06:12AM +0800, ThinerLogoer wrote:
> >> >I think we have the following options (there might be more)
> >> >
> >> >1) This patch.
> >> >
> >> >2) New flag for memory-backend-file. We already have "readonly" and 
> >> >"share=". I'm having a hard time coming up with a good name that really 
> >> >describes the subtle difference.
> >> >
> >> >3) Glue behavior to the QEMU machine
> >> >
> >> 
> >> 4) '-deny-private-discard' argv, or environment variable, or both
> >
> >I'd personally vote for (2).  How about "fdperm"?  To describe when we want
> >to use different rw permissions on the file (besides the access permission
> >of the memory we already provided with "readonly"=XXX).  IIUC the only sane
> >value will be ro/rw/default, where "default" should just use the same rw
> >permission as the memory ("readonly"=XXX).
> >
> >Would that be relatively clean and also work in this use case?
> >
> >(the other thing I'd wish we don't have that fallback is, as long as we
> > have any of that "fallback" we'll need to be compatible with it since
> > then, and for ever...)
> 
> If it must be (2), I would vote (2) + (4), with (4) adjust the default behavior of said `fdperm`.
> Mainly because (private+discard) is itself not a good practice and (4) serves
> as a good tool to help catch existing (private+discard) problems.
> 
> Actually (readonly+private) is more reasonable than (private+discard), so I
> want at least one room for a default (readonly+private) behavior.

Just for purely discussion purpose: I think maybe someday private+discard
could work.  IIUC what we're missing is an syscall interface to install a
zero page for a MAP_PRIVATE, atomically freeing what's underneath: it seems
either punching a hole or DONTNEED won't suffice here.  It'll just be
another problem when having zero page involved in file mappings at least.

> 
> Also in my case I kind of have to use "-mem-path" despite it being considered
> to be close to deprecated. Only with this I can avoid knowledge of memory
> backend before migration. Actually there seems to be no equivalent working after-migration
> setup of "-object memory-backend-file,... -machine q35,mem=..." that can match
> before-migration setup of "-machine q35" (specifying nothing). Therefore
> I must make a plan and choose a migration method BEFORE I boot the
> machine and prepare to migrate, reducing the operation freedom.
> Considering that, I have to use "-mem-path" which keeps the freedom but
> has no configurable argument and I have to rely on default config.
> 
> Are there any "-object memory-backend-file..." setup equivalent to "-machine q35"
> that can migrate from and to each other? If there is, I want to try it out.
> By the way "-object memory-backend-file,id=pc.ram" has just been killed by an earlier
> commit.

I'm actually not familiar enough on the interfaces here, but I just checked
up the man page; would this work for you, together with option (2)?

        memory-backend='id'
                An alternative to legacy -mem-path and mem-prealloc options.  Allows to use a memory backend as main RAM.

                For example:

                -object memory-backend-file,id=pc.ram,size=512M,mem-path=/hugetlbfs,prealloc=on,share=on
                -machine memory-backend=pc.ram
                -m 512M
David Hildenbrand Aug. 11, 2023, 2:59 p.m. UTC | #10
On 10.08.23 23:24, Peter Xu wrote:
> On Fri, Aug 11, 2023 at 01:06:12AM +0800, ThinerLogoer wrote:
>>> I think we have the following options (there might be more)
>>>
>>> 1) This patch.
>>>
>>> 2) New flag for memory-backend-file. We already have "readonly" and
>>> "share=". I'm having a hard time coming up with a good name that really
>>> describes the subtle difference.
>>>
>>> 3) Glue behavior to the QEMU machine
>>>
>>
>> 4) '-deny-private-discard' argv, or environment variable, or both
> 
> I'd personally vote for (2).  How about "fdperm"?  To describe when we want
> to use different rw permissions on the file (besides the access permission
> of the memory we already provided with "readonly"=XXX).  IIUC the only sane
> value will be ro/rw/default, where "default" should just use the same rw
> permission as the memory ("readonly"=XXX).

Hmm, I'm not particularly happy about that.

> 
> Would that be relatively clean and also work in this use case?
> 

I get the feeling that we are over-engineering something that probably 
should never have been allowed: MAP_PRIVATE mapping of a file and 
opening it rw because someone might punch holes into it.

Once we start adding new parameters just for that, I get a bit skeptical 
that this is what we want. The number of people that care about that are 
probably close to 0.

The only real use case where this used to make sense (by accident I 
assume) was with hugetlb. And somehow, we decided that it was a good 
idea for "-mem-path" to use MAP_PRIVATE.

So, what stops us from

a) Leaving -mem-path alone. Keep opening files rw.
b) Make memory-backend-file with shared=off,readonly=off open the file
    read-only
c) Gluing that behavior to a QEMU compat machine

fallocate(PUNCH_HOLE) will fail, and we can probably let 
virtio-mem/virtio-balloon and postcopy refuse to even start (virtio-mem 
already does that) as early as possible.

People that care about any such use case would already get a warning 
when punching a hole today.

If we ever support discarding RAM in that configuration, we can simply 
unlock it again.

Am I missing any important use case?
David Hildenbrand Aug. 11, 2023, 3:26 p.m. UTC | #11
On 11.08.23 16:59, David Hildenbrand wrote:
> On 10.08.23 23:24, Peter Xu wrote:
>> On Fri, Aug 11, 2023 at 01:06:12AM +0800, ThinerLogoer wrote:
>>>> I think we have the following options (there might be more)
>>>>
>>>> 1) This patch.
>>>>
>>>> 2) New flag for memory-backend-file. We already have "readonly" and
>>>> "share=". I'm having a hard time coming up with a good name that really
>>>> describes the subtle difference.
>>>>
>>>> 3) Glue behavior to the QEMU machine
>>>>
>>>
>>> 4) '-deny-private-discard' argv, or environment variable, or both
>>
>> I'd personally vote for (2).  How about "fdperm"?  To describe when we want
>> to use different rw permissions on the file (besides the access permission
>> of the memory we already provided with "readonly"=XXX).  IIUC the only sane
>> value will be ro/rw/default, where "default" should just use the same rw
>> permission as the memory ("readonly"=XXX).
> 
> Hmm, I'm not particularly happy about that.
> 
>>
>> Would that be relatively clean and also work in this use case?
>>
> 
> I get the feeling that we are over-engineering something that probably
> should never have been allowed: MAP_PRIVATE mapping of a file and
> opening it rw because someone might punch holes into it.
> 
> Once we start adding new parameters just for that, I get a bit skeptical
> that this is what we want. The number of people that care about that are
> probably close to 0.
> 
> The only real use case where this used to make sense (by accident I
> assume) was with hugetlb. And somehow, we decided that it was a good
> idea for "-mem-path" to use MAP_PRIVATE.
> 
> So, what stops us from
> 
> a) Leaving -mem-path alone. Keep opening files rw.
> b) Make memory-backend-file with shared=off,readonly=off open the file
>      read-only
> c) Gluing that behavior to a QEMU compat machine
> 
> fallocate(PUNCH_HOLE) will fail, and we can probably let
> virtio-mem/virtio-balloon and postcopy refuse to even start (virtio-mem
> already does that) as early as possible.
> 
> People that care about any such use case would already get a warning
> when punching a hole today.
> 
> If we ever support discarding RAM in that configuration, we can simply
> unlock it again.
> 
> Am I missing any important use case?


I just started looking into the origins of "-mem-path".

Originally c902760fb2 ("Add option to use file backed guest memory"):

* Without MAP_POPULATE support, we use MAP_PRIVATE
* With MAP_POPULATE support we use MAP_PRIVATE if mem_prealloc was not
   defined.

It was only used for hugetlb. The shared memory case didn't really 
matter: they just needed a way to get hugetlb pages into the VM. Opening 
the file R/W even with MAP_PRIVATE kind-of made sense in that case, it 
was an exclusive owner.

Discarding of RAM was not very popular back then I guess: virtio-mem 
didn't exist, virtio-balloon doesn't even handle hugetlb today really, 
postcopy didn't exist.


I guess that's why nobody really cared about "-mem-path" MAP_PRIVATE vs. 
MAP_SHARED semantics: just get hugetlb pages into the VM somehow.

Nowadays, "-mem-path" always defaults to MAP_PRIVATE. For the original 
hugetlb use case, it's still good enough. For anything else, I'm not so 
sure.
Peter Xu Aug. 11, 2023, 3:47 p.m. UTC | #12
On Fri, Aug 11, 2023 at 04:59:56PM +0200, David Hildenbrand wrote:
> On 10.08.23 23:24, Peter Xu wrote:
> > On Fri, Aug 11, 2023 at 01:06:12AM +0800, ThinerLogoer wrote:
> > > > I think we have the following options (there might be more)
> > > > 
> > > > 1) This patch.
> > > > 
> > > > 2) New flag for memory-backend-file. We already have "readonly" and
> > > > "share=". I'm having a hard time coming up with a good name that really
> > > > describes the subtle difference.
> > > > 
> > > > 3) Glue behavior to the QEMU machine
> > > > 
> > > 
> > > 4) '-deny-private-discard' argv, or environment variable, or both
> > 
> > I'd personally vote for (2).  How about "fdperm"?  To describe when we want
> > to use different rw permissions on the file (besides the access permission
> > of the memory we already provided with "readonly"=XXX).  IIUC the only sane
> > value will be ro/rw/default, where "default" should just use the same rw
> > permission as the memory ("readonly"=XXX).
> 
> Hmm, I'm not particularly happy about that.
> 
> > 
> > Would that be relatively clean and also work in this use case?
> > 
> 
> I get the feeling that we are over-engineering something that probably
> should never have been allowed: MAP_PRIVATE mapping of a file and opening it
> rw because someone might punch holes into it.
> 
> Once we start adding new parameters just for that, I get a bit skeptical
> that this is what we want. The number of people that care about that are
> probably close to 0.
> 
> The only real use case where this used to make sense (by accident I assume)
> was with hugetlb. And somehow, we decided that it was a good idea for
> "-mem-path" to use MAP_PRIVATE.
> 
> So, what stops us from
> 
> a) Leaving -mem-path alone. Keep opening files rw.
> b) Make memory-backend-file with shared=off,readonly=off open the file
>    read-only
> c) Gluing that behavior to a QEMU compat machine

So we want to make all users with shared=off + readonly=off to only open
the file RO always, failing file write ops rather than crashing others.

Sounds reasonable to me.

In that case, do we even need a compat bit, if we're 100% sure it won't
break anyone but only help, with the fact that everything should be
transparent?
Peter Xu Aug. 11, 2023, 4:16 p.m. UTC | #13
On Fri, Aug 11, 2023 at 05:26:24PM +0200, David Hildenbrand wrote:
> I just started looking into the origins of "-mem-path".
> 
> Originally c902760fb2 ("Add option to use file backed guest memory"):
> 
> * Without MAP_POPULATE support, we use MAP_PRIVATE
> * With MAP_POPULATE support we use MAP_PRIVATE if mem_prealloc was not
>   defined.
> 
> It was only used for hugetlb. The shared memory case didn't really matter:
> they just needed a way to get hugetlb pages into the VM. Opening the file
> R/W even with MAP_PRIVATE kind-of made sense in that case, it was an
> exclusive owner.
> 
> Discarding of RAM was not very popular back then I guess: virtio-mem didn't
> exist, virtio-balloon doesn't even handle hugetlb today really, postcopy
> didn't exist.
> 
> 
> I guess that's why nobody really cared about "-mem-path" MAP_PRIVATE vs.
> MAP_SHARED semantics: just get hugetlb pages into the VM somehow.
> 
> Nowadays, "-mem-path" always defaults to MAP_PRIVATE. For the original
> hugetlb use case, it's still good enough. For anything else, I'm not so
> sure.

Ok this answers my other question then on the compat bit.. thanks.  Feel
free to ignore there.

But then I'd lean back towards simply adding a fdperm=; that seems the
simplest, since even if with a compat bit, we still face risk of breaking
-mem-path users for anyone using new machine types.
David Hildenbrand Aug. 11, 2023, 4:17 p.m. UTC | #14
On 11.08.23 18:16, Peter Xu wrote:
> On Fri, Aug 11, 2023 at 05:26:24PM +0200, David Hildenbrand wrote:
>> I just started looking into the origins of "-mem-path".
>>
>> Originally c902760fb2 ("Add option to use file backed guest memory"):
>>
>> * Without MAP_POPULATE support, we use MAP_PRIVATE
>> * With MAP_POPULATE support we use MAP_PRIVATE if mem_prealloc was not
>>    defined.
>>
>> It was only used for hugetlb. The shared memory case didn't really matter:
>> they just needed a way to get hugetlb pages into the VM. Opening the file
>> R/W even with MAP_PRIVATE kind-of made sense in that case, it was an
>> exclusive owner.
>>
>> Discarding of RAM was not very popular back then I guess: virtio-mem didn't
>> exist, virtio-balloon doesn't even handle hugetlb today really, postcopy
>> didn't exist.
>>
>>
>> I guess that's why nobody really cared about "-mem-path" MAP_PRIVATE vs.
>> MAP_SHARED semantics: just get hugetlb pages into the VM somehow.
>>
>> Nowadays, "-mem-path" always defaults to MAP_PRIVATE. For the original
>> hugetlb use case, it's still good enough. For anything else, I'm not so
>> sure.
> 
> Ok this answers my other question then on the compat bit.. thanks.  Feel
> free to ignore there.
> 
> But then I'd lean back towards simply adding a fdperm=; that seems the
> simplest, since even if with a compat bit, we still face risk of breaking
> -mem-path users for anyone using new machine types.

We wouldn't touch "-mem-path".
Peter Xu Aug. 11, 2023, 4:22 p.m. UTC | #15
On Fri, Aug 11, 2023 at 06:17:05PM +0200, David Hildenbrand wrote:
> We wouldn't touch "-mem-path".

But still the same issue when someone uses -object memory-backend-file for
hugetlb, mapping privately, expecting ram discard to work?

Basically I see that example as, "hugetlb" in general made the private
mapping over RW file usable, so forbidden that anywhere may take a risk.

Thanks,
David Hildenbrand Aug. 11, 2023, 4:25 p.m. UTC | #16
On 11.08.23 18:22, Peter Xu wrote:
> On Fri, Aug 11, 2023 at 06:17:05PM +0200, David Hildenbrand wrote:
>> We wouldn't touch "-mem-path".
> 
> But still the same issue when someone uses -object memory-backend-file for
> hugetlb, mapping privately, expecting ram discard to work?
> 
> Basically I see that example as, "hugetlb" in general made the private
> mapping over RW file usable, so forbidden that anywhere may take a risk.

These users can be directed to using hugetlb

a) using MAP_SHARED
b) using memory-backend-memfd, if MAP_PRIVATE is desired

Am I missing any important use case? Are we being a bit to careful about 
virtio-balloon and postcopy simply not being available for these corner 
cases?
Peter Xu Aug. 11, 2023, 4:54 p.m. UTC | #17
On Fri, Aug 11, 2023 at 06:25:14PM +0200, David Hildenbrand wrote:
> On 11.08.23 18:22, Peter Xu wrote:
> > On Fri, Aug 11, 2023 at 06:17:05PM +0200, David Hildenbrand wrote:
> > > We wouldn't touch "-mem-path".
> > 
> > But still the same issue when someone uses -object memory-backend-file for
> > hugetlb, mapping privately, expecting ram discard to work?
> > 
> > Basically I see that example as, "hugetlb" in general made the private
> > mapping over RW file usable, so forbidden that anywhere may take a risk.
> 
> These users can be directed to using hugetlb
> 
> a) using MAP_SHARED
> b) using memory-backend-memfd, if MAP_PRIVATE is desired
> 
> Am I missing any important use case? Are we being a bit to careful about
> virtio-balloon and postcopy simply not being available for these corner
> cases?

The current immediate issue is not really mem=rw + fd=rw + private case
(which was a known issue), but how to make mem=rw + fd=ro + private work
for ThinnerBloger, iiuc.

I'd just think it safer to expose that cap to solve problem A (vm
templating) without affecting problem B (fallcate-over-private not working
right), when B is uncertain.

I'm also copy Daniel & libvirt list in case there's quick comment from
there. Say, maybe libvirt never use private mapping on hugetlb files over
memory-backend-file at all, then it's probably fine.

In all cases, you and Igor should have the final grasp; no stand on a
strong opinon from my side.
David Hildenbrand Aug. 11, 2023, 5:39 p.m. UTC | #18
On 11.08.23 18:54, Peter Xu wrote:
> On Fri, Aug 11, 2023 at 06:25:14PM +0200, David Hildenbrand wrote:
>> On 11.08.23 18:22, Peter Xu wrote:
>>> On Fri, Aug 11, 2023 at 06:17:05PM +0200, David Hildenbrand wrote:
>>>> We wouldn't touch "-mem-path".
>>>
>>> But still the same issue when someone uses -object memory-backend-file for
>>> hugetlb, mapping privately, expecting ram discard to work?
>>>
>>> Basically I see that example as, "hugetlb" in general made the private
>>> mapping over RW file usable, so forbidden that anywhere may take a risk.
>>
>> These users can be directed to using hugetlb
>>
>> a) using MAP_SHARED
>> b) using memory-backend-memfd, if MAP_PRIVATE is desired
>>
>> Am I missing any important use case? Are we being a bit to careful about
>> virtio-balloon and postcopy simply not being available for these corner
>> cases?
> 
> The current immediate issue is not really mem=rw + fd=rw + private case
> (which was a known issue), but how to make mem=rw + fd=ro + private work
> for ThinnerBloger, iiuc.
> 
> I'd just think it safer to expose that cap to solve problem A (vm
> templating) without affecting problem B (fallcate-over-private not working
> right), when B is uncertain.

Right, and I'm thinking about if B is worth the effort.

> 
> I'm also copy Daniel & libvirt list in case there's quick comment from
> there. Say, maybe libvirt never use private mapping on hugetlb files over
> memory-backend-file at all, then it's probably fine.

libvirt certainly allows setting <access mode="shared"/> with <source 
type="file">.

Could be that they also end up mapping "<hugepages>" to 
memory-backend-file instead of memory-backend-memfd (e.g., compatibility 
with older kernels?).

> 
> In all cases, you and Igor should have the final grasp; no stand on a
> strong opinon from my side.

I do value your opinion, so I'm still trying to figure out if there are 
sane use cases that really need a new parameter. Let's recap:

When opening the file R/O, resulting in fallocate() refusing to work:
* virtio-balloon will fail to discard RAM but continue to "be alive"
* virtio-mem will discard any private pages, but cannot free up disk
   blocks using fallocate.
* postcopy would fail early

Postcopy:
* Works on shmem (MAP_SHARED / MAP_PRIVATE)
* Works on hugetlb (MAP_SHARED / MAP_PRIVATE)
* Does not work on file-backed memory (including MAP_PRIVATE)

We can ignore virtio-mem for now. What remains is postcopy and 
virtio-balloon.

memory-backend-file with MAP_PRIVATE on shmem/tmpfs results in a double 
memory consumption, so we can mostly cross that out as "sane use case". 
Rather make such users aware of that :D

memory-backend-file with MAP_PRIVATE on hugetlb works. virtio-balloon is 
not really compatible with hugetlb, free-page-reporting might work 
(although quite non-nonsensical). So postcopy as the most important use 
case remains.

memory-backend-file with MAP_PRIVATE on file-backed memory works. 
postcopy does not apply. virtio-balloon should work I guess.


So the two use cases that are left are:
* postcopy with hugetlb would fail
* virtio-balloon with file-backed memory cannot free up disk blocks

Am I missing a case?
David Hildenbrand Aug. 11, 2023, 7 p.m. UTC | #19
On 11.08.23 07:49, ThinerLogoer wrote:
> At 2023-08-11 05:24:43, "Peter Xu" <peterx@redhat.com> wrote:
>> On Fri, Aug 11, 2023 at 01:06:12AM +0800, ThinerLogoer wrote:
>>>> I think we have the following options (there might be more)
>>>>
>>>> 1) This patch.
>>>>
>>>> 2) New flag for memory-backend-file. We already have "readonly" and
>>>> "share=". I'm having a hard time coming up with a good name that really
>>>> describes the subtle difference.
>>>>
>>>> 3) Glue behavior to the QEMU machine
>>>>
>>>
>>> 4) '-deny-private-discard' argv, or environment variable, or both
>>
>> I'd personally vote for (2).  How about "fdperm"?  To describe when we want
>> to use different rw permissions on the file (besides the access permission
>> of the memory we already provided with "readonly"=XXX).  IIUC the only sane
>> value will be ro/rw/default, where "default" should just use the same rw
>> permission as the memory ("readonly"=XXX).
>>
>> Would that be relatively clean and also work in this use case?
>>
>> (the other thing I'd wish we don't have that fallback is, as long as we
>> have any of that "fallback" we'll need to be compatible with it since
>> then, and for ever...)
> 
> If it must be (2), I would vote (2) + (4), with (4) adjust the default behavior of said `fdperm`.
> Mainly because (private+discard) is itself not a good practice and (4) serves
> as a good tool to help catch existing (private+discard) problems.

Instead of fdperm, maybe we could find a better name.

The man page of "open" says: The argument flags must include one of the 
following access modes: O_RDONLY, O_WRONLY, or O_RDWR.  These request 
opening the file read-only, write-only, or read/write, respectively.

So maybe something a bit more mouthful like "file-access-mode" would be 
better.

We could have the options
*readwrite
*readonly
*auto

Whereby "auto" is the default.

Specifying:

* "readonly=true,file-access-mode=readwrite" would fail.
* "shared=true,readonly=false,file-access-mode=readonly" would fail.
* "shared=false,readonly=false,file-access-mode=readonly" would work.

In theory, we could adjust the mode of "auto" with compat machines. But 
maybe it would be sufficient to do something like the following

1) shared=true,readonly=false
	-> readwrite
2) shared=true,readonly=true
	> readonly
2) shared=false,readonly=true
	-> readonly
3) shared=false,readonly=true
	hugetlb? -> readwrite
	otherwise -> readonly

Reason being the traditional usage of hugetlb with MAP_PRIVATE where I 
can see possible users with postcopy. Further, VM templating with 
hugetlb might not be that common ... users could still modify the 
behavior if they really have to.

That would make your use case work automatically with file-backed memory.
Peter Xu Aug. 11, 2023, 9:07 p.m. UTC | #20
On Fri, Aug 11, 2023 at 07:39:37PM +0200, David Hildenbrand wrote:
> On 11.08.23 18:54, Peter Xu wrote:
> > On Fri, Aug 11, 2023 at 06:25:14PM +0200, David Hildenbrand wrote:
> > > On 11.08.23 18:22, Peter Xu wrote:
> > > > On Fri, Aug 11, 2023 at 06:17:05PM +0200, David Hildenbrand wrote:
> > > > > We wouldn't touch "-mem-path".
> > > > 
> > > > But still the same issue when someone uses -object memory-backend-file for
> > > > hugetlb, mapping privately, expecting ram discard to work?
> > > > 
> > > > Basically I see that example as, "hugetlb" in general made the private
> > > > mapping over RW file usable, so forbidden that anywhere may take a risk.
> > > 
> > > These users can be directed to using hugetlb
> > > 
> > > a) using MAP_SHARED
> > > b) using memory-backend-memfd, if MAP_PRIVATE is desired
> > > 
> > > Am I missing any important use case? Are we being a bit to careful about
> > > virtio-balloon and postcopy simply not being available for these corner
> > > cases?
> > 
> > The current immediate issue is not really mem=rw + fd=rw + private case
> > (which was a known issue), but how to make mem=rw + fd=ro + private work
> > for ThinnerBloger, iiuc.
> > 
> > I'd just think it safer to expose that cap to solve problem A (vm
> > templating) without affecting problem B (fallcate-over-private not working
> > right), when B is uncertain.
> 
> Right, and I'm thinking about if B is worth the effort.
> 
> > 
> > I'm also copy Daniel & libvirt list in case there's quick comment from
> > there. Say, maybe libvirt never use private mapping on hugetlb files over
> > memory-backend-file at all, then it's probably fine.
> 
> libvirt certainly allows setting <access mode="shared"/> with <source
> type="file">.
> 
> Could be that they also end up mapping "<hugepages>" to memory-backend-file
> instead of memory-backend-memfd (e.g., compatibility with older kernels?).
> 
> > 
> > In all cases, you and Igor should have the final grasp; no stand on a
> > strong opinon from my side.
> 
> I do value your opinion, so I'm still trying to figure out if there are sane
> use cases that really need a new parameter. Let's recap:
> 
> When opening the file R/O, resulting in fallocate() refusing to work:
> * virtio-balloon will fail to discard RAM but continue to "be alive"
> * virtio-mem will discard any private pages, but cannot free up disk
>   blocks using fallocate.
> * postcopy would fail early
> 
> Postcopy:
> * Works on shmem (MAP_SHARED / MAP_PRIVATE)
> * Works on hugetlb (MAP_SHARED / MAP_PRIVATE)
> * Does not work on file-backed memory (including MAP_PRIVATE)
> 
> We can ignore virtio-mem for now. What remains is postcopy and
> virtio-balloon.
> 
> memory-backend-file with MAP_PRIVATE on shmem/tmpfs results in a double
> memory consumption, so we can mostly cross that out as "sane use case".
> Rather make such users aware of that :D
> 
> memory-backend-file with MAP_PRIVATE on hugetlb works. virtio-balloon is not
> really compatible with hugetlb, free-page-reporting might work (although
> quite non-nonsensical). So postcopy as the most important use case remains.
> 
> memory-backend-file with MAP_PRIVATE on file-backed memory works. postcopy
> does not apply. virtio-balloon should work I guess.
> 
> 
> So the two use cases that are left are:
> * postcopy with hugetlb would fail
> * virtio-balloon with file-backed memory cannot free up disk blocks
> 
> Am I missing a case?

Looks complete.  I don't want to say so, but afaik postcopy should be
"corner case" in most cases I'd say; people do still rely mostly on
precopy.  It's probably a matter of whether we'd like take the risk.
ThinerLogoer Aug. 12, 2023, 5:18 a.m. UTC | #21
At 2023-08-12 03:00:54, "David Hildenbrand" <david@redhat.com> wrote:
>On 11.08.23 07:49, ThinerLogoer wrote:
>> At 2023-08-11 05:24:43, "Peter Xu" <peterx@redhat.com> wrote:
>>> On Fri, Aug 11, 2023 at 01:06:12AM +0800, ThinerLogoer wrote:
>>>>> I think we have the following options (there might be more)
>>>>>
>>>>> 1) This patch.
>>>>>
>>>>> 2) New flag for memory-backend-file. We already have "readonly" and
>>>>> "share=". I'm having a hard time coming up with a good name that really
>>>>> describes the subtle difference.
>>>>>
>>>>> 3) Glue behavior to the QEMU machine
>>>>>
>>>>
>>>> 4) '-deny-private-discard' argv, or environment variable, or both
>>>
>>> I'd personally vote for (2).  How about "fdperm"?  To describe when we want
>>> to use different rw permissions on the file (besides the access permission
>>> of the memory we already provided with "readonly"=XXX).  IIUC the only sane
>>> value will be ro/rw/default, where "default" should just use the same rw
>>> permission as the memory ("readonly"=XXX).
>>>
>>> Would that be relatively clean and also work in this use case?
>>>
>>> (the other thing I'd wish we don't have that fallback is, as long as we
>>> have any of that "fallback" we'll need to be compatible with it since
>>> then, and for ever...)
>> 
>> If it must be (2), I would vote (2) + (4), with (4) adjust the default behavior of said `fdperm`.
>> Mainly because (private+discard) is itself not a good practice and (4) serves
>> as a good tool to help catch existing (private+discard) problems.
>
>Instead of fdperm, maybe we could find a better name.
>
>The man page of "open" says: The argument flags must include one of the 
>following access modes: O_RDONLY, O_WRONLY, or O_RDWR.  These request 
>opening the file read-only, write-only, or read/write, respectively.
>
>So maybe something a bit more mouthful like "file-access-mode" would be 
>better.

Maybe "fd-mode"or "open-mode" to make it shorter and also non ambiguous.

"open-access-mode" can also be considered.

Btw my chatgpt agent suggested 10 names to me, in case you feel hard to
decide a name:

    access-mode
    file-mode
    open-mode
    file-permission
    file-access-mode (aha!)
    file-access-permission
    file-io-access-mode
    file-open-permission
    file-operation-mode
    file-read-write-mode


>
>We could have the options
>*readwrite
>*readonly
>*auto
>
>Whereby "auto" is the default.

I like the "auto" here.

>
>Specifying:
>
>* "readonly=true,file-access-mode=readwrite" would fail.
>* "shared=true,readonly=false,file-access-mode=readonly" would fail.
>* "shared=false,readonly=false,file-access-mode=readonly" would work.

Yeah, sanitizing the arguments here is wise.

>
>In theory, we could adjust the mode of "auto" with compat machines. But 
>maybe it would be sufficient to do something like the following
>
>1) shared=true,readonly=false
>	-> readwrite
>2) shared=true,readonly=true
>	> readonly
>2) shared=false,readonly=true
>	-> readonly
>3) shared=false,readonly=true
>	hugetlb? -> readwrite
>	otherwise -> readonly

Looks like there is some typo here. I assume it was:

1) shared=true,readonly=false
	-> readwrite
2) shared=true,readonly=true
	-> readonly
3) shared=false,readonly=true
	-> readonly
4) shared=false,readonly=false
	hugetlb? -> readwrite
	otherwise -> readonly


>Reason being the traditional usage of hugetlb with MAP_PRIVATE where I 
>can see possible users with postcopy. Further, VM templating with 
>hugetlb might not be that common ... users could still modify the 
>behavior if they really have to.
>
>That would make your use case work automatically with file-backed memory.

This seems a good idea. I am good with the solution you proposed
here as well.


--

Regards,

logoerthiner
ThinerLogoer Aug. 12, 2023, 6:21 a.m. UTC | #22
At 2023-08-11 22:31:36, "Peter Xu" <peterx@redhat.com> wrote:
>On Fri, Aug 11, 2023 at 01:49:52PM +0800, ThinerLogoer wrote:
>> At 2023-08-11 05:24:43, "Peter Xu" <peterx@redhat.com> wrote:
>> >On Fri, Aug 11, 2023 at 01:06:12AM +0800, ThinerLogoer wrote:
>> >> >I think we have the following options (there might be more)
>> >> >
>> >> >1) This patch.
>> >> >
>> >> >2) New flag for memory-backend-file. We already have "readonly" and 
>> >> >"share=". I'm having a hard time coming up with a good name that really 
>> >> >describes the subtle difference.
>> >> >
>> >> >3) Glue behavior to the QEMU machine
>> >> >
>> >> 
>> >> 4) '-deny-private-discard' argv, or environment variable, or both
>> >
>> >I'd personally vote for (2).  How about "fdperm"?  To describe when we want
>> >to use different rw permissions on the file (besides the access permission
>> >of the memory we already provided with "readonly"=XXX).  IIUC the only sane
>> >value will be ro/rw/default, where "default" should just use the same rw
>> >permission as the memory ("readonly"=XXX).
>> >
>> >Would that be relatively clean and also work in this use case?
>> >
>> >(the other thing I'd wish we don't have that fallback is, as long as we
>> > have any of that "fallback" we'll need to be compatible with it since
>> > then, and for ever...)
>> 
>> If it must be (2), I would vote (2) + (4), with (4) adjust the default behavior of said `fdperm`.
>> Mainly because (private+discard) is itself not a good practice and (4) serves
>> as a good tool to help catch existing (private+discard) problems.
>> 
>> Actually (readonly+private) is more reasonable than (private+discard), so I
>> want at least one room for a default (readonly+private) behavior.
>
>Just for purely discussion purpose: I think maybe someday private+discard
>could work.  IIUC what we're missing is an syscall interface to install a
>zero page for a MAP_PRIVATE, atomically freeing what's underneath: it seems
>either punching a hole or DONTNEED won't suffice here.  It'll just be
>another problem when having zero page involved in file mappings at least.
>
>> 
>> Also in my case I kind of have to use "-mem-path" despite it being considered
>> to be close to deprecated. Only with this I can avoid knowledge of memory
>> backend before migration. Actually there seems to be no equivalent working after-migration
>> setup of "-object memory-backend-file,... -machine q35,mem=..." that can match
>> before-migration setup of "-machine q35" (specifying nothing). Therefore
>> I must make a plan and choose a migration method BEFORE I boot the
>> machine and prepare to migrate, reducing the operation freedom.
>> Considering that, I have to use "-mem-path" which keeps the freedom but
>> has no configurable argument and I have to rely on default config.
>> 
>> Are there any "-object memory-backend-file..." setup equivalent to "-machine q35"
>> that can migrate from and to each other? If there is, I want to try it out.
>> By the way "-object memory-backend-file,id=pc.ram" has just been killed by an earlier
>> commit.
>
>I'm actually not familiar enough on the interfaces here, but I just checked
>up the man page; would this work for you, together with option (2)?
>
>        memory-backend='id'
>                An alternative to legacy -mem-path and mem-prealloc options.  Allows to use a memory backend as main RAM.
>
>                For example:
>
>                -object memory-backend-file,id=pc.ram,size=512M,mem-path=/hugetlbfs,prealloc=on,share=on
>                -machine memory-backend=pc.ram
>                -m 512M
>

Wait ... I thought it should not work but it did work today. How bad am I at reading
the correct part of documentation ...

'-machine q35 -m 512M' is equivalent to '-object
memory-backend-file,id=pc.ram,size=512M
-machine q35,memory-backend=pc.ram',
the latter works, and the two mentioned setup can be
migrated from one to another.

What I was consistently trying was '-object
memory-backend-file,id=pc.ram,size=512M -machine q35', and qemu raises an error
for this in a recent update:

```
qemu-system-x86_64: object name 'pc.ram' is reserved for the default RAM backend, it can't be used for any other purposes. Change the object's 'id' to something else
```

This error is misleading. Actually in this case, the error report message should be more
close to:
```
object name 'pc.ram' is reserved for the default RAM backend, it can't
be used for any other purposes. Change the object's 'id' to something
else, or append "memory-backend=pc.ram" to -machine arguments
```

(I suggest rewriting the error message like this string because of the confusion just now)


Even though the default memory backend name is pc.ram, the
'-machine q35,memory-backend=pc.ram' part explicitly marks that qemu
uses a memory backend named pc.ram, rather than rely on default.

It seems that if it "rely on default" and memory-backend-file has an id
of "pc.ram" (in x86_64 of course), it will fail.

Great. Now I will consider using a "-object
memory-backend-file,id=pc.ram,size=512M
-machine q35,memory-backend=pc.ram"

--

Regards,

logoerthiner
David Hildenbrand Aug. 17, 2023, 9:07 a.m. UTC | #23
@Stefan, see below on a R/O NVDIMM question.

We're discussing how to get MAPR_PRIVATE R/W mapping of a
memory-backend-file running when using R/O files.

> 
> This seems a good idea. I am good with the solution you proposed
> here as well.

I was just going to get started working on that, when I realized
something important:


"@readonly: if true, the backing file is opened read-only; if false,
  it is opened read-write.  (default: false)"

"@share: if false, the memory is private to QEMU; if true, it is
  shared (default: false)"

So readonly is *all* about the file access mode already ... the mmap()
parameters are just a side-effect of that. Adding a new
"file-access-mode" or similar would be wrong.


Here are the combinations we have right now:

-object memory-backend-file,share=on,readonly=on

  -> Existing behavior: Open readonly, mmap readonly shared
  -> Makes sense, mmap'ing readwrite would fail

-object memory-backend-file,share=on,readonly=off

  -> Existing behavior: Open readwrite, mmap readwrite shared
  -> Mostly makes sense: why open a shared file R/W and not mmap it
     R/W?

-object memory-backend-file,share=off,readonly=off
  -> Existing behavior: Open readwrite, mmap readwrite private
  -> Mostly makes sense: why open a file R/W and not map it R/W (even if
     private)?

-object memory-backend-file,share=off,readonly=on
  -> Existing behavior: Open readonly, mmap readonly private
  -> That's the problematic one


So for your use case (VM templating using a readonly file), you
would actually want to use:

-object memory-backend-file,share=off,readonly=on

BUT, have the mmap be writable (instead of currently readonly).

Assuming we would change the current behavior, what if someone would
specify:

-object memory-backend-file,readonly=on

(because the default is share=off ...) and using it for a R/O NVDIMM,
where we expect any write access to fail.


But let's look at the commit that added the "readonly" parameter:

commit 86635aa4e9d627d5142b81c57a33dd1f36627d07
Author: Stefan Hajnoczi <stefanha@redhat.com>
Date:   Mon Jan 4 17:13:19 2021 +0000

     hostmem-file: add readonly=on|off option
     
     Let -object memory-backend-file work on read-only files when the
     readonly=on option is given. This can be used to share the contents of a
     file between multiple guests while preventing them from consuming
     Copy-on-Write memory if guests dirty the pages, for example.

That was part of

https://lore.kernel.org/all/20210104171320.575838-3-stefanha@redhat.com/T/#m712f995e6dcfdde433958bae5095b145dd0ee640

 From what I understand, for NVDIMMs we always use
"-object memory-backend-file,share=on", even when we want a
readonly NVDIMM.


So we have two options:

1) Change the current behavior of -object memory-backend-file,share=off,readonly=on:

-> Open the file r/o but mmap it writable

2) Add a new property to configure the mmap accessibility. Not a big fan of that.


@Stefan, do you have any concern when we would do 1) ?

As far as I can tell, we have to set the nvdimm to "unarmed=on" either way:

+   "unarmed" controls the ACPI NFIT NVDIMM Region Mapping Structure "NVDIMM
+   State Flags" Bit 3 indicating that the device is "unarmed" and cannot accept
+   persistent writes. Linux guest drivers set the device to read-only when this
+   bit is present. Set unarmed to on when the memdev has readonly=on.

So changing the behavior would not really break the nvdimm use case.

Further, we could warn in nvdimm code when we stumble over this configuration with
unarmed=on.
Daniel P. Berrangé Aug. 17, 2023, 1:37 p.m. UTC | #24
On Mon, Aug 07, 2023 at 09:07:32PM +0200, David Hildenbrand wrote:
> From: Thiner Logoer <logoerthiner1@163.com>
> 
> Users may specify
> * "-mem-path" or
> * "-object memory-backend-file,share=off,readonly=off"
> and expect such COW (MAP_PRIVATE) mappings to work, even if the user
> does not have write permissions to open the file.
> 
> For now, we would always fail in that case, always requiring file write
> permissions. Let's detect when that failure happens and fallback to opening
> the file readonly.
> 
> Warn the user, since there are other use cases where we want the file to
> be mapped writable: ftruncate() and fallocate() will fail if the file
> was not opened with write permissions.
> 
> Signed-off-by: Thiner Logoer <logoerthiner1@163.com>
> Co-developed-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  softmmu/physmem.c | 26 ++++++++++++++++++--------
>  1 file changed, 18 insertions(+), 8 deletions(-)
> 
> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> index 3df73542e1..d1ae694b20 100644
> --- a/softmmu/physmem.c
> +++ b/softmmu/physmem.c
> @@ -1289,8 +1289,7 @@ static int64_t get_file_align(int fd)
>  static int file_ram_open(const char *path,
>                           const char *region_name,
>                           bool readonly,
> -                         bool *created,
> -                         Error **errp)
> +                         bool *created)
>  {
>      char *filename;
>      char *sanitized_name;
> @@ -1334,10 +1333,7 @@ static int file_ram_open(const char *path,
>              g_free(filename);
>          }
>          if (errno != EEXIST && errno != EINTR) {
> -            error_setg_errno(errp, errno,
> -                             "can't open backing store %s for guest RAM",
> -                             path);
> -            return -1;
> +            return -errno;
>          }
>          /*
>           * Try again on EINTR and EEXIST.  The latter happens when
> @@ -1946,9 +1942,23 @@ RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr,
>      bool created;
>      RAMBlock *block;
>  
> -    fd = file_ram_open(mem_path, memory_region_name(mr), readonly, &created,
> -                       errp);
> +    fd = file_ram_open(mem_path, memory_region_name(mr), readonly, &created);
> +    if (fd == -EACCES && !(ram_flags & RAM_SHARED) && !readonly) {
> +        /*
> +         * We can have a writable MAP_PRIVATE mapping of a readonly file.
> +         * However, some operations like ftruncate() or fallocate() might fail
> +         * later, let's warn the user.
> +         */
> +        fd = file_ram_open(mem_path, memory_region_name(mr), true, &created);
> +        if (fd >= 0) {
> +            warn_report("backing store %s for guest RAM (MAP_PRIVATE) opened"
> +                        " readonly because the file is not writable", mem_path);

IIUC, from the description, the goal is that usage of a readonly
backing store is intented to be an explicitly supported deployment
configuration. At the time time though, this scenario could also be
a deployment mistake that we want to diagnose

It is inappropriate to issue warn_report() for things that are
supported usage.

It is also undesirable to continue execution in the case of things
which are a deployment mistake.

These two scenarios are mutually incompatible, so I understand why
you choose to fudge it with a warn_report().

I wonder if this is pointing to the need for another configuration
knob for the memory backend, to express the different desired usage
models.

We want O_WRONLY when opening the file, either if we want to file
shared, or so that we can ftruncate it to the right size, if it
does not exist. If shared=off and the file is pre-created at the
right size, we should be able to use O_RDONLY even if the file is
writable.

So what if we added a 'create=yes|no' option to memory-backend-file

   -object memory-backend-file,share=off,readonly=off,create=yes

would imply need for O_WRONLY|O_RDONLY, so that ftruncate() can
do its work. 

With share=off,create=no, we could unconditionally open O_RDONLY,
even if the file is writable.

This would let us support read-only backing files, without any
warn_reports() for this usage, while also stopping execution
with deployment mistakes

This doesn't help -mem-path, since it doesn't take options, but
IMHO it would be acceptable to say users need to use the more
verbose '-object memory-backend-file' instead.

> +        }
> +    }
>      if (fd < 0) {
> +        error_setg_errno(errp, -fd,
> +                         "can't open backing store %s for guest RAM",
> +                         mem_path);
>          return NULL;
>      }
>  
> -- 
> 2.41.0
> 
> 

With regards,
Daniel
Daniel P. Berrangé Aug. 17, 2023, 1:42 p.m. UTC | #25
On Thu, Aug 10, 2023 at 04:19:45PM +0200, David Hildenbrand wrote:
> > > Most importantly, we won't be corrupting/touching the original file in any
> > > case, because it is R/O.
> > > 
> > > If we really want to be careful, we could clue that behavior to compat
> > > machines. I'm not really sure yet if we really have to go down that path.
> > > 
> > > Any other alternatives? I'd like to avoid new flags where not really
> > > required.
> > 
> > I was just thinking of a new flag. :) So have you already discussed that
> > possibility and decided that not a good idea?
> 
> Not really. I was briefly playing with that idea but already struggled to
> come up with a reasonable name :)
> 
> Less toggles and just have it working nice, if possible.

IMHO having a new flag is desirable, because it is directly
expressing the desired deployment scenario, such tat we get
good error reporting upon deployment mistakes, while at the
same time allowing the readonly usage.

> > The root issue to me here is we actually have two resources (memory map of
> > the process, and the file) but we only have one way to describe the
> > permissions upon the two objects.  I'd think it makes a lot more sense if a
> > new flag is added, when there's a need to differentiate the two.
> > 
> > Consider if you see a bunch of qemu instances with:
> > 
> >    -mem-path $RAM_FILE
> > 
> > On the same host, which can be as weird as it could be to me.. At least
> > '-mem-path' looks still like a way to exclusively own a ram file for an
> > instance. I hesitate the new fallback can confuse people too, while that's
> > so far not the major use case.
> 
> Once I learned that this is not a MAP_SHARED mapping, I was extremely
> confused. For example, vhost-user with "-mem-path" will absolutely not work
> with "-mem-path", even though the documentation explicitly spells that out
> (I still have to send a patch to fix that).
> 
> I guess "-mem-path" was primarily only used to consume hugetlb. Even for
> tmpfs it will already result in a double memory consumption, just like when
> using -memory-backend-memfd,share=no.
> 
> I guess deprecating it was the right decision.

Regardless of whether its deprecated or not, I think its fine to just
say people need to use the more verbose memory-backend-file syntax
if they want to use an unusual deployment configuration where there is
a readonly backing file.

> > Nobody may really rely on any existing behavior of the failure, but
> > changing existing behavior is just always not wanted.  The guideline here
> > to me is: whether we want existing "-mem-path XXX" users to start using the
> > fallback in general?  If it's "no", then maybe it implies a new flag is
> > better?
> 
> I think we have the following options (there might be more)
> 
> 1) This patch.
> 
> 2) New flag for memory-backend-file. We already have "readonly" and
> "share=". I'm having a hard time coming up with a good name that really
> describes the subtle difference.
> 
> 3) Glue behavior to the QEMU machine
> 
> 
> For 3), one option would be to always open a COW file readonly (as Thiner
> originally proposed). We could leave "-mem-path" behavior alone and only
> change memory-backend-file semantics. If the COW file does *not* exist yet,
> we would refuse to create the file like patch 2+3 do. Therefore, no
> ftruncate() errors, and fallocate() errors would always happen.

I'm for (2).


With regards,
Daniel
David Hildenbrand Aug. 17, 2023, 1:44 p.m. UTC | #26
On 17.08.23 15:37, Daniel P. Berrangé wrote:
> On Mon, Aug 07, 2023 at 09:07:32PM +0200, David Hildenbrand wrote:
>> From: Thiner Logoer <logoerthiner1@163.com>
>>
>> Users may specify
>> * "-mem-path" or
>> * "-object memory-backend-file,share=off,readonly=off"
>> and expect such COW (MAP_PRIVATE) mappings to work, even if the user
>> does not have write permissions to open the file.
>>
>> For now, we would always fail in that case, always requiring file write
>> permissions. Let's detect when that failure happens and fallback to opening
>> the file readonly.
>>
>> Warn the user, since there are other use cases where we want the file to
>> be mapped writable: ftruncate() and fallocate() will fail if the file
>> was not opened with write permissions.
>>
>> Signed-off-by: Thiner Logoer <logoerthiner1@163.com>
>> Co-developed-by: David Hildenbrand <david@redhat.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   softmmu/physmem.c | 26 ++++++++++++++++++--------
>>   1 file changed, 18 insertions(+), 8 deletions(-)
>>
>> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
>> index 3df73542e1..d1ae694b20 100644
>> --- a/softmmu/physmem.c
>> +++ b/softmmu/physmem.c
>> @@ -1289,8 +1289,7 @@ static int64_t get_file_align(int fd)
>>   static int file_ram_open(const char *path,
>>                            const char *region_name,
>>                            bool readonly,
>> -                         bool *created,
>> -                         Error **errp)
>> +                         bool *created)
>>   {
>>       char *filename;
>>       char *sanitized_name;
>> @@ -1334,10 +1333,7 @@ static int file_ram_open(const char *path,
>>               g_free(filename);
>>           }
>>           if (errno != EEXIST && errno != EINTR) {
>> -            error_setg_errno(errp, errno,
>> -                             "can't open backing store %s for guest RAM",
>> -                             path);
>> -            return -1;
>> +            return -errno;
>>           }
>>           /*
>>            * Try again on EINTR and EEXIST.  The latter happens when
>> @@ -1946,9 +1942,23 @@ RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr,
>>       bool created;
>>       RAMBlock *block;
>>   
>> -    fd = file_ram_open(mem_path, memory_region_name(mr), readonly, &created,
>> -                       errp);
>> +    fd = file_ram_open(mem_path, memory_region_name(mr), readonly, &created);
>> +    if (fd == -EACCES && !(ram_flags & RAM_SHARED) && !readonly) {
>> +        /*
>> +         * We can have a writable MAP_PRIVATE mapping of a readonly file.
>> +         * However, some operations like ftruncate() or fallocate() might fail
>> +         * later, let's warn the user.
>> +         */
>> +        fd = file_ram_open(mem_path, memory_region_name(mr), true, &created);
>> +        if (fd >= 0) {
>> +            warn_report("backing store %s for guest RAM (MAP_PRIVATE) opened"
>> +                        " readonly because the file is not writable", mem_path);
> 
> IIUC, from the description, the goal is that usage of a readonly
> backing store is intented to be an explicitly supported deployment
> configuration. At the time time though, this scenario could also be
> a deployment mistake that we want to diagnose

FWIW, I abandoned this approach here and instead will look into making

memory-backend-file,readonly=on,share=off

create RAM instead of ROM.

The fallback was wrong after realizing what "readonly" actually is 
supposed to do.

I stared at libvirt, an even it never seems to set readonly=on for R/O 
DIMMs, so you always get RAM and then tell the nvdimm device to not 
perform any writes (unarmed=on)
David Hildenbrand Aug. 17, 2023, 1:45 p.m. UTC | #27
On 17.08.23 15:42, Daniel P. Berrangé wrote:
> On Thu, Aug 10, 2023 at 04:19:45PM +0200, David Hildenbrand wrote:
>>>> Most importantly, we won't be corrupting/touching the original file in any
>>>> case, because it is R/O.
>>>>
>>>> If we really want to be careful, we could clue that behavior to compat
>>>> machines. I'm not really sure yet if we really have to go down that path.
>>>>
>>>> Any other alternatives? I'd like to avoid new flags where not really
>>>> required.
>>>
>>> I was just thinking of a new flag. :) So have you already discussed that
>>> possibility and decided that not a good idea?
>>
>> Not really. I was briefly playing with that idea but already struggled to
>> come up with a reasonable name :)
>>
>> Less toggles and just have it working nice, if possible.
> 
> IMHO having a new flag is desirable, because it is directly
> expressing the desired deployment scenario, such tat we get
> good error reporting upon deployment mistakes, while at the
> same time allowing the readonly usage.
> 
>>> The root issue to me here is we actually have two resources (memory map of
>>> the process, and the file) but we only have one way to describe the
>>> permissions upon the two objects.  I'd think it makes a lot more sense if a
>>> new flag is added, when there's a need to differentiate the two.
>>>
>>> Consider if you see a bunch of qemu instances with:
>>>
>>>     -mem-path $RAM_FILE
>>>
>>> On the same host, which can be as weird as it could be to me.. At least
>>> '-mem-path' looks still like a way to exclusively own a ram file for an
>>> instance. I hesitate the new fallback can confuse people too, while that's
>>> so far not the major use case.
>>
>> Once I learned that this is not a MAP_SHARED mapping, I was extremely
>> confused. For example, vhost-user with "-mem-path" will absolutely not work
>> with "-mem-path", even though the documentation explicitly spells that out
>> (I still have to send a patch to fix that).
>>
>> I guess "-mem-path" was primarily only used to consume hugetlb. Even for
>> tmpfs it will already result in a double memory consumption, just like when
>> using -memory-backend-memfd,share=no.
>>
>> I guess deprecating it was the right decision.
> 
> Regardless of whether its deprecated or not, I think its fine to just
> say people need to use the more verbose memory-backend-file syntax
> if they want to use an unusual deployment configuration where there is
> a readonly backing file.
> 
>>> Nobody may really rely on any existing behavior of the failure, but
>>> changing existing behavior is just always not wanted.  The guideline here
>>> to me is: whether we want existing "-mem-path XXX" users to start using the
>>> fallback in general?  If it's "no", then maybe it implies a new flag is
>>> better?
>>
>> I think we have the following options (there might be more)
>>
>> 1) This patch.
>>
>> 2) New flag for memory-backend-file. We already have "readonly" and
>> "share=". I'm having a hard time coming up with a good name that really
>> describes the subtle difference.
>>
>> 3) Glue behavior to the QEMU machine
>>
>>
>> For 3), one option would be to always open a COW file readonly (as Thiner
>> originally proposed). We could leave "-mem-path" behavior alone and only
>> change memory-backend-file semantics. If the COW file does *not* exist yet,
>> we would refuse to create the file like patch 2+3 do. Therefore, no
>> ftruncate() errors, and fallocate() errors would always happen.
> 
> I'm for (2).

(2) in the form we discussed here is wrong because "readonly" already 
expresses "open this file readonly", not anything else.
Daniel P. Berrangé Aug. 17, 2023, 1:46 p.m. UTC | #28
On Fri, Aug 11, 2023 at 09:00:54PM +0200, David Hildenbrand wrote:
> On 11.08.23 07:49, ThinerLogoer wrote:
> > At 2023-08-11 05:24:43, "Peter Xu" <peterx@redhat.com> wrote:
> > > On Fri, Aug 11, 2023 at 01:06:12AM +0800, ThinerLogoer wrote:
> > > > > I think we have the following options (there might be more)
> > > > > 
> > > > > 1) This patch.
> > > > > 
> > > > > 2) New flag for memory-backend-file. We already have "readonly" and
> > > > > "share=". I'm having a hard time coming up with a good name that really
> > > > > describes the subtle difference.
> > > > > 
> > > > > 3) Glue behavior to the QEMU machine
> > > > > 
> > > > 
> > > > 4) '-deny-private-discard' argv, or environment variable, or both
> > > 
> > > I'd personally vote for (2).  How about "fdperm"?  To describe when we want
> > > to use different rw permissions on the file (besides the access permission
> > > of the memory we already provided with "readonly"=XXX).  IIUC the only sane
> > > value will be ro/rw/default, where "default" should just use the same rw
> > > permission as the memory ("readonly"=XXX).
> > > 
> > > Would that be relatively clean and also work in this use case?
> > > 
> > > (the other thing I'd wish we don't have that fallback is, as long as we
> > > have any of that "fallback" we'll need to be compatible with it since
> > > then, and for ever...)
> > 
> > If it must be (2), I would vote (2) + (4), with (4) adjust the default behavior of said `fdperm`.
> > Mainly because (private+discard) is itself not a good practice and (4) serves
> > as a good tool to help catch existing (private+discard) problems.
> 
> Instead of fdperm, maybe we could find a better name.
> 
> The man page of "open" says: The argument flags must include one of the
> following access modes: O_RDONLY, O_WRONLY, or O_RDWR.  These request
> opening the file read-only, write-only, or read/write, respectively.
> 
> So maybe something a bit more mouthful like "file-access-mode" would be
> better.

I don't think we should directly express the config in terms
of file-access-mode, as that's a low level impl detail. The
required file access mode is an artifact of the higher level
goal, or whether the RAM should be process private vs shared,
and whether we want QEMU to be able to create the backing
file or use pre-create one.

IOW, we should express whether or not we want QEMU to try to
pre-create the file or not.


With regards,
Daniel
David Hildenbrand Aug. 17, 2023, 1:48 p.m. UTC | #29
On 17.08.23 15:46, Daniel P. Berrangé wrote:
> On Fri, Aug 11, 2023 at 09:00:54PM +0200, David Hildenbrand wrote:
>> On 11.08.23 07:49, ThinerLogoer wrote:
>>> At 2023-08-11 05:24:43, "Peter Xu" <peterx@redhat.com> wrote:
>>>> On Fri, Aug 11, 2023 at 01:06:12AM +0800, ThinerLogoer wrote:
>>>>>> I think we have the following options (there might be more)
>>>>>>
>>>>>> 1) This patch.
>>>>>>
>>>>>> 2) New flag for memory-backend-file. We already have "readonly" and
>>>>>> "share=". I'm having a hard time coming up with a good name that really
>>>>>> describes the subtle difference.
>>>>>>
>>>>>> 3) Glue behavior to the QEMU machine
>>>>>>
>>>>>
>>>>> 4) '-deny-private-discard' argv, or environment variable, or both
>>>>
>>>> I'd personally vote for (2).  How about "fdperm"?  To describe when we want
>>>> to use different rw permissions on the file (besides the access permission
>>>> of the memory we already provided with "readonly"=XXX).  IIUC the only sane
>>>> value will be ro/rw/default, where "default" should just use the same rw
>>>> permission as the memory ("readonly"=XXX).
>>>>
>>>> Would that be relatively clean and also work in this use case?
>>>>
>>>> (the other thing I'd wish we don't have that fallback is, as long as we
>>>> have any of that "fallback" we'll need to be compatible with it since
>>>> then, and for ever...)
>>>
>>> If it must be (2), I would vote (2) + (4), with (4) adjust the default behavior of said `fdperm`.
>>> Mainly because (private+discard) is itself not a good practice and (4) serves
>>> as a good tool to help catch existing (private+discard) problems.
>>
>> Instead of fdperm, maybe we could find a better name.
>>
>> The man page of "open" says: The argument flags must include one of the
>> following access modes: O_RDONLY, O_WRONLY, or O_RDWR.  These request
>> opening the file read-only, write-only, or read/write, respectively.
>>
>> So maybe something a bit more mouthful like "file-access-mode" would be
>> better.
> 
> I don't think we should directly express the config in terms
> of file-access-mode, as that's a low level impl detail. The
> required file access mode is an artifact of the higher level
> goal, or whether the RAM should be process private vs shared,
> and whether we want QEMU to be able to create the backing
> file or use pre-create one.

See my other mails "readonly" already expresses exactly that. So no need 
for "file-access-mode".

(and as far as I can see, no need for any other flags)
David Hildenbrand Aug. 17, 2023, 2:30 p.m. UTC | #30
> @Stefan, do you have any concern when we would do 1) ?
> 
> As far as I can tell, we have to set the nvdimm to "unarmed=on" either way:
> 
> +   "unarmed" controls the ACPI NFIT NVDIMM Region Mapping Structure "NVDIMM
> +   State Flags" Bit 3 indicating that the device is "unarmed" and cannot accept
> +   persistent writes. Linux guest drivers set the device to read-only when this
> +   bit is present. Set unarmed to on when the memdev has readonly=on.
> 
> So changing the behavior would not really break the nvdimm use case.

Looking into the details, this seems to be the right thing to do.

This is what I have now as patch description, that also highlights how libvirt
doesn't even make use of readonly=true.


 From 42f272ace68e0cd660a8448adb5aefb3b9dd7005 Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Thu, 17 Aug 2023 12:09:07 +0200
Subject: [PATCH v2 2/4] backends/hostmem-file: Make share=off,readonly=on
  result in RAM instead of ROM

For now, "share=off,readonly=on" would always result in us opening the
file R/O and mmap'ing the opened file MAP_PRIVATE R/O -- effectively
turning it into ROM.

As documented, readonly only specifies that we want to open the file R/O:

     @readonly: if true, the backing file is opened read-only; if false,
         it is opened read-write.  (default: false)

Especially for VM templating, "share=off" is a common use case. However,
that use case is impossible with files that lack write permissions,
because "share=off,readonly=off" will fail opening the file, and
"share=off,readonly=on" will give us ROM instead of RAM.

With MAP_PRIVATE we can easily open the file R/O and mmap it R/W, to
turn it into COW RAM: private changes don't affect the file after all and
don't require write permissions.

This implies that we only get ROM now via "share=on,readonly=on".
"share=off,readonly=on" will give us RAM.

The sole user of ROM via memory-backend-file are R/O NVDIMMs. They
also require "unarmed=on" to be set for the nvdimm device.

With this change, R/O NVDIMMs will continue working even if
"share=off,readonly=on" was specified similar to when simply
providing ordinary RAM to the nvdimm device and setting "unarmed=on".

Note that libvirt seems to default for a "readonly" nvdimm to
* -object memory-backend-file,share=off (implying readonly=off)
* -device nvdimm,unarmed=on
And never seems to even set "readonly=on" for memory-backend-file. So
this change won't affect libvirt, they already always get COW RAM -- not
modifying the underlying file but opening it R/O.

If someone really wants ROM, they can just use "share=on,readonly=on".
After all, there is not relevant difference between a R/O MAP_SHARED
file mapping and a R/O MAP_PRIVATE file mapping.

Signed-off-by: David Hildenbrand <david@redhat.com>
Daniel P. Berrangé Aug. 17, 2023, 2:37 p.m. UTC | #31
On Thu, Aug 17, 2023 at 04:30:16PM +0200, David Hildenbrand wrote:
> 
> > @Stefan, do you have any concern when we would do 1) ?
> > 
> > As far as I can tell, we have to set the nvdimm to "unarmed=on" either way:
> > 
> > +   "unarmed" controls the ACPI NFIT NVDIMM Region Mapping Structure "NVDIMM
> > +   State Flags" Bit 3 indicating that the device is "unarmed" and cannot accept
> > +   persistent writes. Linux guest drivers set the device to read-only when this
> > +   bit is present. Set unarmed to on when the memdev has readonly=on.
> > 
> > So changing the behavior would not really break the nvdimm use case.
> 
> Looking into the details, this seems to be the right thing to do.
> 
> This is what I have now as patch description, that also highlights how libvirt
> doesn't even make use of readonly=true.
> 
> 
> From 42f272ace68e0cd660a8448adb5aefb3b9dd7005 Mon Sep 17 00:00:00 2001
> From: David Hildenbrand <david@redhat.com>
> Date: Thu, 17 Aug 2023 12:09:07 +0200
> Subject: [PATCH v2 2/4] backends/hostmem-file: Make share=off,readonly=on
>  result in RAM instead of ROM
> 
> For now, "share=off,readonly=on" would always result in us opening the
> file R/O and mmap'ing the opened file MAP_PRIVATE R/O -- effectively
> turning it into ROM.
> 
> As documented, readonly only specifies that we want to open the file R/O:
> 
>     @readonly: if true, the backing file is opened read-only; if false,
>         it is opened read-write.  (default: false)
> 
> Especially for VM templating, "share=off" is a common use case. However,
> that use case is impossible with files that lack write permissions,
> because "share=off,readonly=off" will fail opening the file, and
> "share=off,readonly=on" will give us ROM instead of RAM.
> 
> With MAP_PRIVATE we can easily open the file R/O and mmap it R/W, to
> turn it into COW RAM: private changes don't affect the file after all and
> don't require write permissions.
> 
> This implies that we only get ROM now via "share=on,readonly=on".
> "share=off,readonly=on" will give us RAM.
> 
> The sole user of ROM via memory-backend-file are R/O NVDIMMs. They
> also require "unarmed=on" to be set for the nvdimm device.
> 
> With this change, R/O NVDIMMs will continue working even if
> "share=off,readonly=on" was specified similar to when simply
> providing ordinary RAM to the nvdimm device and setting "unarmed=on".
> 
> Note that libvirt seems to default for a "readonly" nvdimm to
> * -object memory-backend-file,share=off (implying readonly=off)
> * -device nvdimm,unarmed=on
> And never seems to even set "readonly=on" for memory-backend-file. So
> this change won't affect libvirt, they already always get COW RAM -- not
> modifying the underlying file but opening it R/O.
> 
> If someone really wants ROM, they can just use "share=on,readonly=on".
> After all, there is not relevant difference between a R/O MAP_SHARED
> file mapping and a R/O MAP_PRIVATE file mapping.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

This still leaves the patch having a warn_report() which I think is
undesirable to emit in a valid / supported use case.

With regards,
Daniel
David Hildenbrand Aug. 17, 2023, 2:37 p.m. UTC | #32
On 17.08.23 16:37, Daniel P. Berrangé wrote:
> On Thu, Aug 17, 2023 at 04:30:16PM +0200, David Hildenbrand wrote:
>>
>>> @Stefan, do you have any concern when we would do 1) ?
>>>
>>> As far as I can tell, we have to set the nvdimm to "unarmed=on" either way:
>>>
>>> +   "unarmed" controls the ACPI NFIT NVDIMM Region Mapping Structure "NVDIMM
>>> +   State Flags" Bit 3 indicating that the device is "unarmed" and cannot accept
>>> +   persistent writes. Linux guest drivers set the device to read-only when this
>>> +   bit is present. Set unarmed to on when the memdev has readonly=on.
>>>
>>> So changing the behavior would not really break the nvdimm use case.
>>
>> Looking into the details, this seems to be the right thing to do.
>>
>> This is what I have now as patch description, that also highlights how libvirt
>> doesn't even make use of readonly=true.
>>
>>
>>  From 42f272ace68e0cd660a8448adb5aefb3b9dd7005 Mon Sep 17 00:00:00 2001
>> From: David Hildenbrand <david@redhat.com>
>> Date: Thu, 17 Aug 2023 12:09:07 +0200
>> Subject: [PATCH v2 2/4] backends/hostmem-file: Make share=off,readonly=on
>>   result in RAM instead of ROM
>>
>> For now, "share=off,readonly=on" would always result in us opening the
>> file R/O and mmap'ing the opened file MAP_PRIVATE R/O -- effectively
>> turning it into ROM.
>>
>> As documented, readonly only specifies that we want to open the file R/O:
>>
>>      @readonly: if true, the backing file is opened read-only; if false,
>>          it is opened read-write.  (default: false)
>>
>> Especially for VM templating, "share=off" is a common use case. However,
>> that use case is impossible with files that lack write permissions,
>> because "share=off,readonly=off" will fail opening the file, and
>> "share=off,readonly=on" will give us ROM instead of RAM.
>>
>> With MAP_PRIVATE we can easily open the file R/O and mmap it R/W, to
>> turn it into COW RAM: private changes don't affect the file after all and
>> don't require write permissions.
>>
>> This implies that we only get ROM now via "share=on,readonly=on".
>> "share=off,readonly=on" will give us RAM.
>>
>> The sole user of ROM via memory-backend-file are R/O NVDIMMs. They
>> also require "unarmed=on" to be set for the nvdimm device.
>>
>> With this change, R/O NVDIMMs will continue working even if
>> "share=off,readonly=on" was specified similar to when simply
>> providing ordinary RAM to the nvdimm device and setting "unarmed=on".
>>
>> Note that libvirt seems to default for a "readonly" nvdimm to
>> * -object memory-backend-file,share=off (implying readonly=off)
>> * -device nvdimm,unarmed=on
>> And never seems to even set "readonly=on" for memory-backend-file. So
>> this change won't affect libvirt, they already always get COW RAM -- not
>> modifying the underlying file but opening it R/O.
>>
>> If someone really wants ROM, they can just use "share=on,readonly=on".
>> After all, there is not relevant difference between a R/O MAP_SHARED
>> file mapping and a R/O MAP_PRIVATE file mapping.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> This still leaves the patch having a warn_report() which I think is
> undesirable to emit in a valid / supported use case.

No warning.

Please elaborate on "valid/supported use case".
Peter Xu Aug. 17, 2023, 2:41 p.m. UTC | #33
On Thu, Aug 17, 2023 at 11:07:23AM +0200, David Hildenbrand wrote:
> @Stefan, see below on a R/O NVDIMM question.
> 
> We're discussing how to get MAPR_PRIVATE R/W mapping of a
> memory-backend-file running when using R/O files.
> 
> > 
> > This seems a good idea. I am good with the solution you proposed
> > here as well.
> 
> I was just going to get started working on that, when I realized
> something important:
> 
> 
> "@readonly: if true, the backing file is opened read-only; if false,
>  it is opened read-write.  (default: false)"
> 
> "@share: if false, the memory is private to QEMU; if true, it is
>  shared (default: false)"
> 
> So readonly is *all* about the file access mode already ... the mmap()
> parameters are just a side-effect of that. Adding a new
> "file-access-mode" or similar would be wrong.

Not exactly a side effect, IMHO.  IIUC it's simply because we didn't have a
need of using different perm for memory/file levels.  See the other patch
commit message from Stefan:

https://lore.kernel.org/all/20210104171320.575838-2-stefanha@redhat.com/

    There is currently no way to open(O_RDONLY) and mmap(PROT_READ) when
    [...]

So the goal at that time was to map/open both in RO mode, afaiu.  So one
parameter was enough at that time.  It doesn't necessarily must be for only
the file permission or memory, while in reality/code it does apply to both
for now until we see a need on differentiating them for CoW purpose.

> 
> 
> Here are the combinations we have right now:
> 
> -object memory-backend-file,share=on,readonly=on
> 
>  -> Existing behavior: Open readonly, mmap readonly shared
>  -> Makes sense, mmap'ing readwrite would fail
> 
> -object memory-backend-file,share=on,readonly=off
> 
>  -> Existing behavior: Open readwrite, mmap readwrite shared
>  -> Mostly makes sense: why open a shared file R/W and not mmap it
>     R/W?
> 
> -object memory-backend-file,share=off,readonly=off
>  -> Existing behavior: Open readwrite, mmap readwrite private
>  -> Mostly makes sense: why open a file R/W and not map it R/W (even if
>     private)?
> 
> -object memory-backend-file,share=off,readonly=on
>  -> Existing behavior: Open readonly, mmap readonly private
>  -> That's the problematic one
> 
> 
> So for your use case (VM templating using a readonly file), you
> would actually want to use:
> 
> -object memory-backend-file,share=off,readonly=on
> 
> BUT, have the mmap be writable (instead of currently readonly).
> 
> Assuming we would change the current behavior, what if someone would
> specify:
> 
> -object memory-backend-file,readonly=on
> 
> (because the default is share=off ...) and using it for a R/O NVDIMM,
> where we expect any write access to fail.

It will (as expected), right?  fallocate() will just fail on the RO files.

To be explicit, maybe we should just remember the readonly attribute for a
ramblock and then we can even provide a more meaningful error log, like:

diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 3df73542e1..f8c11c8d54 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -3424,9 +3424,13 @@ int qemu_ram_foreach_block(RAMBlockIterFunc func, void *opaque)
 int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
 {
     int ret = -1;
-
     uint8_t *host_startaddr = rb->host + start;
 
+    if (rb->flags & RAM_READONLY) {
+        // more meaningful error reports (though right now no errp passed in..)
+        return -EACCES;
+    }
+

I see that Stefan even raised this question in the commit log:

    No new RAMBlock flag is introduced for read-only because it's unclear
    whether RAMBlocks need to know that they are read-only. Pass a bool
    readonly argument instead.

Right now failing at fallocate() isn't that bad to me.

> 
> 
> But let's look at the commit that added the "readonly" parameter:
> 
> commit 86635aa4e9d627d5142b81c57a33dd1f36627d07
> Author: Stefan Hajnoczi <stefanha@redhat.com>
> Date:   Mon Jan 4 17:13:19 2021 +0000
> 
>     hostmem-file: add readonly=on|off option
>     Let -object memory-backend-file work on read-only files when the
>     readonly=on option is given. This can be used to share the contents of a
>     file between multiple guests while preventing them from consuming
>     Copy-on-Write memory if guests dirty the pages, for example.
> 
> That was part of
> 
> https://lore.kernel.org/all/20210104171320.575838-3-stefanha@redhat.com/T/#m712f995e6dcfdde433958bae5095b145dd0ee640
> 
> From what I understand, for NVDIMMs we always use
> "-object memory-backend-file,share=on", even when we want a
> readonly NVDIMM.
> 
> 
> So we have two options:
> 
> 1) Change the current behavior of -object memory-backend-file,share=off,readonly=on:
> 
> -> Open the file r/o but mmap it writable
> 
> 2) Add a new property to configure the mmap accessibility. Not a big fan of that.
> 
> 
> @Stefan, do you have any concern when we would do 1) ?
> 
> As far as I can tell, we have to set the nvdimm to "unarmed=on" either way:
> 
> +   "unarmed" controls the ACPI NFIT NVDIMM Region Mapping Structure "NVDIMM
> +   State Flags" Bit 3 indicating that the device is "unarmed" and cannot accept
> +   persistent writes. Linux guest drivers set the device to read-only when this
> +   bit is present. Set unarmed to on when the memdev has readonly=on.
> 
> So changing the behavior would not really break the nvdimm use case.
> 
> Further, we could warn in nvdimm code when we stumble over this configuration with
> unarmed=on.

I'll leave nvdimm specific question to Stefan, but isn't this also will map
any readonly=on memory backends (besides nvdimm) to have the memory
writable, which is still unexpected?
Daniel P. Berrangé Aug. 17, 2023, 2:45 p.m. UTC | #34
On Thu, Aug 17, 2023 at 04:37:52PM +0200, David Hildenbrand wrote:
> On 17.08.23 16:37, Daniel P. Berrangé wrote:
> > On Thu, Aug 17, 2023 at 04:30:16PM +0200, David Hildenbrand wrote:
> > > 
> > > > @Stefan, do you have any concern when we would do 1) ?
> > > > 
> > > > As far as I can tell, we have to set the nvdimm to "unarmed=on" either way:
> > > > 
> > > > +   "unarmed" controls the ACPI NFIT NVDIMM Region Mapping Structure "NVDIMM
> > > > +   State Flags" Bit 3 indicating that the device is "unarmed" and cannot accept
> > > > +   persistent writes. Linux guest drivers set the device to read-only when this
> > > > +   bit is present. Set unarmed to on when the memdev has readonly=on.
> > > > 
> > > > So changing the behavior would not really break the nvdimm use case.
> > > 
> > > Looking into the details, this seems to be the right thing to do.
> > > 
> > > This is what I have now as patch description, that also highlights how libvirt
> > > doesn't even make use of readonly=true.
> > > 
> > > 
> > >  From 42f272ace68e0cd660a8448adb5aefb3b9dd7005 Mon Sep 17 00:00:00 2001
> > > From: David Hildenbrand <david@redhat.com>
> > > Date: Thu, 17 Aug 2023 12:09:07 +0200
> > > Subject: [PATCH v2 2/4] backends/hostmem-file: Make share=off,readonly=on
> > >   result in RAM instead of ROM
> > > 
> > > For now, "share=off,readonly=on" would always result in us opening the
> > > file R/O and mmap'ing the opened file MAP_PRIVATE R/O -- effectively
> > > turning it into ROM.
> > > 
> > > As documented, readonly only specifies that we want to open the file R/O:
> > > 
> > >      @readonly: if true, the backing file is opened read-only; if false,
> > >          it is opened read-write.  (default: false)
> > > 
> > > Especially for VM templating, "share=off" is a common use case. However,
> > > that use case is impossible with files that lack write permissions,
> > > because "share=off,readonly=off" will fail opening the file, and
> > > "share=off,readonly=on" will give us ROM instead of RAM.
> > > 
> > > With MAP_PRIVATE we can easily open the file R/O and mmap it R/W, to
> > > turn it into COW RAM: private changes don't affect the file after all and
> > > don't require write permissions.
> > > 
> > > This implies that we only get ROM now via "share=on,readonly=on".
> > > "share=off,readonly=on" will give us RAM.
> > > 
> > > The sole user of ROM via memory-backend-file are R/O NVDIMMs. They
> > > also require "unarmed=on" to be set for the nvdimm device.
> > > 
> > > With this change, R/O NVDIMMs will continue working even if
> > > "share=off,readonly=on" was specified similar to when simply
> > > providing ordinary RAM to the nvdimm device and setting "unarmed=on".
> > > 
> > > Note that libvirt seems to default for a "readonly" nvdimm to
> > > * -object memory-backend-file,share=off (implying readonly=off)
> > > * -device nvdimm,unarmed=on
> > > And never seems to even set "readonly=on" for memory-backend-file. So
> > > this change won't affect libvirt, they already always get COW RAM -- not
> > > modifying the underlying file but opening it R/O.
> > > 
> > > If someone really wants ROM, they can just use "share=on,readonly=on".
> > > After all, there is not relevant difference between a R/O MAP_SHARED
> > > file mapping and a R/O MAP_PRIVATE file mapping.
> > > 
> > > Signed-off-by: David Hildenbrand <david@redhat.com>
> > 
> > This still leaves the patch having a warn_report() which I think is
> > undesirable to emit in a valid / supported use case.
> 
> No warning.
> 
> Please elaborate on "valid/supported use case".

The usage scenario that this patch aims to enable. IIUC, it will follow
the codepath that leads to the warn_report() call in this patch.

With regards,
Daniel
David Hildenbrand Aug. 17, 2023, 2:47 p.m. UTC | #35
On 17.08.23 16:45, Daniel P. Berrangé wrote:
> On Thu, Aug 17, 2023 at 04:37:52PM +0200, David Hildenbrand wrote:
>> On 17.08.23 16:37, Daniel P. Berrangé wrote:
>>> On Thu, Aug 17, 2023 at 04:30:16PM +0200, David Hildenbrand wrote:
>>>>
>>>>> @Stefan, do you have any concern when we would do 1) ?
>>>>>
>>>>> As far as I can tell, we have to set the nvdimm to "unarmed=on" either way:
>>>>>
>>>>> +   "unarmed" controls the ACPI NFIT NVDIMM Region Mapping Structure "NVDIMM
>>>>> +   State Flags" Bit 3 indicating that the device is "unarmed" and cannot accept
>>>>> +   persistent writes. Linux guest drivers set the device to read-only when this
>>>>> +   bit is present. Set unarmed to on when the memdev has readonly=on.
>>>>>
>>>>> So changing the behavior would not really break the nvdimm use case.
>>>>
>>>> Looking into the details, this seems to be the right thing to do.
>>>>
>>>> This is what I have now as patch description, that also highlights how libvirt
>>>> doesn't even make use of readonly=true.
>>>>
>>>>
>>>>   From 42f272ace68e0cd660a8448adb5aefb3b9dd7005 Mon Sep 17 00:00:00 2001
>>>> From: David Hildenbrand <david@redhat.com>
>>>> Date: Thu, 17 Aug 2023 12:09:07 +0200
>>>> Subject: [PATCH v2 2/4] backends/hostmem-file: Make share=off,readonly=on
>>>>    result in RAM instead of ROM
>>>>
>>>> For now, "share=off,readonly=on" would always result in us opening the
>>>> file R/O and mmap'ing the opened file MAP_PRIVATE R/O -- effectively
>>>> turning it into ROM.
>>>>
>>>> As documented, readonly only specifies that we want to open the file R/O:
>>>>
>>>>       @readonly: if true, the backing file is opened read-only; if false,
>>>>           it is opened read-write.  (default: false)
>>>>
>>>> Especially for VM templating, "share=off" is a common use case. However,
>>>> that use case is impossible with files that lack write permissions,
>>>> because "share=off,readonly=off" will fail opening the file, and
>>>> "share=off,readonly=on" will give us ROM instead of RAM.
>>>>
>>>> With MAP_PRIVATE we can easily open the file R/O and mmap it R/W, to
>>>> turn it into COW RAM: private changes don't affect the file after all and
>>>> don't require write permissions.
>>>>
>>>> This implies that we only get ROM now via "share=on,readonly=on".
>>>> "share=off,readonly=on" will give us RAM.
>>>>
>>>> The sole user of ROM via memory-backend-file are R/O NVDIMMs. They
>>>> also require "unarmed=on" to be set for the nvdimm device.
>>>>
>>>> With this change, R/O NVDIMMs will continue working even if
>>>> "share=off,readonly=on" was specified similar to when simply
>>>> providing ordinary RAM to the nvdimm device and setting "unarmed=on".
>>>>
>>>> Note that libvirt seems to default for a "readonly" nvdimm to
>>>> * -object memory-backend-file,share=off (implying readonly=off)
>>>> * -device nvdimm,unarmed=on
>>>> And never seems to even set "readonly=on" for memory-backend-file. So
>>>> this change won't affect libvirt, they already always get COW RAM -- not
>>>> modifying the underlying file but opening it R/O.
>>>>
>>>> If someone really wants ROM, they can just use "share=on,readonly=on".
>>>> After all, there is not relevant difference between a R/O MAP_SHARED
>>>> file mapping and a R/O MAP_PRIVATE file mapping.
>>>>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>
>>> This still leaves the patch having a warn_report() which I think is
>>> undesirable to emit in a valid / supported use case.
>>
>> No warning.
>>
>> Please elaborate on "valid/supported use case".
> 
> The usage scenario that this patch aims to enable. IIUC, it will follow
> the codepath that leads to the warn_report() call in this patch.

It shouldn't but I will double check!
David Hildenbrand Aug. 17, 2023, 3:02 p.m. UTC | #36
On 17.08.23 16:41, Peter Xu wrote:
> On Thu, Aug 17, 2023 at 11:07:23AM +0200, David Hildenbrand wrote:
>> @Stefan, see below on a R/O NVDIMM question.
>>
>> We're discussing how to get MAPR_PRIVATE R/W mapping of a
>> memory-backend-file running when using R/O files.
>>
>>>
>>> This seems a good idea. I am good with the solution you proposed
>>> here as well.
>>
>> I was just going to get started working on that, when I realized
>> something important:
>>
>>
>> "@readonly: if true, the backing file is opened read-only; if false,
>>   it is opened read-write.  (default: false)"
>>
>> "@share: if false, the memory is private to QEMU; if true, it is
>>   shared (default: false)"
>>
>> So readonly is *all* about the file access mode already ... the mmap()
>> parameters are just a side-effect of that. Adding a new
>> "file-access-mode" or similar would be wrong.
> 
> Not exactly a side effect, IMHO.  IIUC it's simply because we didn't have a
> need of using different perm for memory/file levels.  See the other patch
> commit message from Stefan:
> 
> https://lore.kernel.org/all/20210104171320.575838-2-stefanha@redhat.com/
> 
>      There is currently no way to open(O_RDONLY) and mmap(PROT_READ) when
>      [...]
> 
> So the goal at that time was to map/open both in RO mode, afaiu.  So one

Good point. And you can have both with "share=on,readonly=on" ever since 
Stefan introduced that flag, which that patch enabled.

Stefan didn't go into details to describe the required interactions 
between MAP_PRIVATE / MAP_SHARED.

[...]

>> -object memory-backend-file,share=off,readonly=on
>>
>> BUT, have the mmap be writable (instead of currently readonly).
>>
>> Assuming we would change the current behavior, what if someone would
>> specify:
>>
>> -object memory-backend-file,readonly=on
>>
>> (because the default is share=off ...) and using it for a R/O NVDIMM,
>> where we expect any write access to fail.
> 
> It will (as expected), right?  fallocate() will just fail on the RO files.

Yes, if the file was opened R/O, any fallocate() will fail.

> 
> To be explicit, maybe we should just remember the readonly attribute for a
> ramblock and then we can even provide a more meaningful error log, like:

Hah, I have a patch that adds RAM_READONLY :) . But it expresses 
"mmapped shared" not "file opened shared".

> 
> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> index 3df73542e1..f8c11c8d54 100644
> --- a/softmmu/physmem.c
> +++ b/softmmu/physmem.c
> @@ -3424,9 +3424,13 @@ int qemu_ram_foreach_block(RAMBlockIterFunc func, void *opaque)
>   int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
>   {
>       int ret = -1;
> -
>       uint8_t *host_startaddr = rb->host + start;
>   
> +    if (rb->flags & RAM_READONLY) {
> +        // more meaningful error reports (though right now no errp passed in..)
> +        return -EACCES;
> +    }

Remembering whether a file is opened R/O might be reasonable to improve 
the error messages.

[...]

>>
>> But let's look at the commit that added the "readonly" parameter:
>>
>> commit 86635aa4e9d627d5142b81c57a33dd1f36627d07
>> Author: Stefan Hajnoczi <stefanha@redhat.com>
>> Date:   Mon Jan 4 17:13:19 2021 +0000
>>
>>      hostmem-file: add readonly=on|off option
>>      Let -object memory-backend-file work on read-only files when the
>>      readonly=on option is given. This can be used to share the contents of a
>>      file between multiple guests while preventing them from consuming
>>      Copy-on-Write memory if guests dirty the pages, for example.
>>
>> That was part of
>>
>> https://lore.kernel.org/all/20210104171320.575838-3-stefanha@redhat.com/T/#m712f995e6dcfdde433958bae5095b145dd0ee640
>>
>>  From what I understand, for NVDIMMs we always use
>> "-object memory-backend-file,share=on", even when we want a
>> readonly NVDIMM.
>>
>>
>> So we have two options:
>>
>> 1) Change the current behavior of -object memory-backend-file,share=off,readonly=on:
>>
>> -> Open the file r/o but mmap it writable
>>
>> 2) Add a new property to configure the mmap accessibility. Not a big fan of that.
>>
>>
>> @Stefan, do you have any concern when we would do 1) ?
>>
>> As far as I can tell, we have to set the nvdimm to "unarmed=on" either way:
>>
>> +   "unarmed" controls the ACPI NFIT NVDIMM Region Mapping Structure "NVDIMM
>> +   State Flags" Bit 3 indicating that the device is "unarmed" and cannot accept
>> +   persistent writes. Linux guest drivers set the device to read-only when this
>> +   bit is present. Set unarmed to on when the memdev has readonly=on.
>>
>> So changing the behavior would not really break the nvdimm use case.
>>
>> Further, we could warn in nvdimm code when we stumble over this configuration with
>> unarmed=on.
> 
> I'll leave nvdimm specific question to Stefan, but isn't this also will map
> any readonly=on memory backends (besides nvdimm) to have the memory
> writable, which is still unexpected?

Note that libvirt *never* sets readonly=on for R/O NVDIMMs, and R/O 
NVDIMMs are really the only use case.

I don't think "open file read only" raises the expectation "map it read 
only". It certainly does for shared mappings, but not for private mappings.

To me, the expectation of "open the file read only" is that the file 
won't be modified, ever.

Libvirt should probably be taught to use "share=on,readonly=on" when 
dealing with R/O NVDIMMs, instead of "share=off" (implying readonly=on) 
to create a writable COW mapping.
Stefan Hajnoczi Aug. 17, 2023, 3:13 p.m. UTC | #37
On Thu, 17 Aug 2023 at 05:08, David Hildenbrand <david@redhat.com> wrote:
>
> @Stefan, see below on a R/O NVDIMM question.
>
> We're discussing how to get MAPR_PRIVATE R/W mapping of a
> memory-backend-file running when using R/O files.
>
> >
> > This seems a good idea. I am good with the solution you proposed
> > here as well.
>
> I was just going to get started working on that, when I realized
> something important:
>
>
> "@readonly: if true, the backing file is opened read-only; if false,
>   it is opened read-write.  (default: false)"
>
> "@share: if false, the memory is private to QEMU; if true, it is
>   shared (default: false)"
>
> So readonly is *all* about the file access mode already ... the mmap()
> parameters are just a side-effect of that. Adding a new
> "file-access-mode" or similar would be wrong.
>
>
> Here are the combinations we have right now:
>
> -object memory-backend-file,share=on,readonly=on
>
>   -> Existing behavior: Open readonly, mmap readonly shared
>   -> Makes sense, mmap'ing readwrite would fail
>
> -object memory-backend-file,share=on,readonly=off
>
>   -> Existing behavior: Open readwrite, mmap readwrite shared
>   -> Mostly makes sense: why open a shared file R/W and not mmap it
>      R/W?
>
> -object memory-backend-file,share=off,readonly=off
>   -> Existing behavior: Open readwrite, mmap readwrite private
>   -> Mostly makes sense: why open a file R/W and not map it R/W (even if
>      private)?
>
> -object memory-backend-file,share=off,readonly=on
>   -> Existing behavior: Open readonly, mmap readonly private
>   -> That's the problematic one
>
>
> So for your use case (VM templating using a readonly file), you
> would actually want to use:
>
> -object memory-backend-file,share=off,readonly=on
>
> BUT, have the mmap be writable (instead of currently readonly).
>
> Assuming we would change the current behavior, what if someone would
> specify:
>
> -object memory-backend-file,readonly=on
>
> (because the default is share=off ...) and using it for a R/O NVDIMM,
> where we expect any write access to fail.
>
>
> But let's look at the commit that added the "readonly" parameter:
>
> commit 86635aa4e9d627d5142b81c57a33dd1f36627d07
> Author: Stefan Hajnoczi <stefanha@redhat.com>
> Date:   Mon Jan 4 17:13:19 2021 +0000
>
>      hostmem-file: add readonly=on|off option
>
>      Let -object memory-backend-file work on read-only files when the
>      readonly=on option is given. This can be used to share the contents of a
>      file between multiple guests while preventing them from consuming
>      Copy-on-Write memory if guests dirty the pages, for example.
>
> That was part of
>
> https://lore.kernel.org/all/20210104171320.575838-3-stefanha@redhat.com/T/#m712f995e6dcfdde433958bae5095b145dd0ee640
>
>  From what I understand, for NVDIMMs we always use
> "-object memory-backend-file,share=on", even when we want a
> readonly NVDIMM.
>
>
> So we have two options:
>
> 1) Change the current behavior of -object memory-backend-file,share=off,readonly=on:
>
> -> Open the file r/o but mmap it writable

Commit 86635aa4e9d627d5142b81c57a33dd1f36627d07 mentions that we don't
want guests to be able to dirty pages on the host. The change you're
proposing would not protect against guests that dirty the memory.

I don't know how important that requirement was (that commit was a
request from Kata Containers).

>
> 2) Add a new property to configure the mmap accessibility. Not a big fan of that.
>
>
> @Stefan, do you have any concern when we would do 1) ?
>
> As far as I can tell, we have to set the nvdimm to "unarmed=on" either way:
>
> +   "unarmed" controls the ACPI NFIT NVDIMM Region Mapping Structure "NVDIMM
> +   State Flags" Bit 3 indicating that the device is "unarmed" and cannot accept
> +   persistent writes. Linux guest drivers set the device to read-only when this
> +   bit is present. Set unarmed to on when the memdev has readonly=on.
>
> So changing the behavior would not really break the nvdimm use case.
>
> Further, we could warn in nvdimm code when we stumble over this configuration with
> unarmed=on.
>
> --
> Cheers,
>
> David / dhildenb
>
>
David Hildenbrand Aug. 17, 2023, 3:15 p.m. UTC | #38
On 17.08.23 17:13, Stefan Hajnoczi wrote:
> On Thu, 17 Aug 2023 at 05:08, David Hildenbrand <david@redhat.com> wrote:
>>
>> @Stefan, see below on a R/O NVDIMM question.
>>
>> We're discussing how to get MAPR_PRIVATE R/W mapping of a
>> memory-backend-file running when using R/O files.
>>
>>>
>>> This seems a good idea. I am good with the solution you proposed
>>> here as well.
>>
>> I was just going to get started working on that, when I realized
>> something important:
>>
>>
>> "@readonly: if true, the backing file is opened read-only; if false,
>>    it is opened read-write.  (default: false)"
>>
>> "@share: if false, the memory is private to QEMU; if true, it is
>>    shared (default: false)"
>>
>> So readonly is *all* about the file access mode already ... the mmap()
>> parameters are just a side-effect of that. Adding a new
>> "file-access-mode" or similar would be wrong.
>>
>>
>> Here are the combinations we have right now:
>>
>> -object memory-backend-file,share=on,readonly=on
>>
>>    -> Existing behavior: Open readonly, mmap readonly shared
>>    -> Makes sense, mmap'ing readwrite would fail
>>
>> -object memory-backend-file,share=on,readonly=off
>>
>>    -> Existing behavior: Open readwrite, mmap readwrite shared
>>    -> Mostly makes sense: why open a shared file R/W and not mmap it
>>       R/W?
>>
>> -object memory-backend-file,share=off,readonly=off
>>    -> Existing behavior: Open readwrite, mmap readwrite private
>>    -> Mostly makes sense: why open a file R/W and not map it R/W (even if
>>       private)?
>>
>> -object memory-backend-file,share=off,readonly=on
>>    -> Existing behavior: Open readonly, mmap readonly private
>>    -> That's the problematic one
>>
>>
>> So for your use case (VM templating using a readonly file), you
>> would actually want to use:
>>
>> -object memory-backend-file,share=off,readonly=on
>>
>> BUT, have the mmap be writable (instead of currently readonly).
>>
>> Assuming we would change the current behavior, what if someone would
>> specify:
>>
>> -object memory-backend-file,readonly=on
>>
>> (because the default is share=off ...) and using it for a R/O NVDIMM,
>> where we expect any write access to fail.
>>
>>
>> But let's look at the commit that added the "readonly" parameter:
>>
>> commit 86635aa4e9d627d5142b81c57a33dd1f36627d07
>> Author: Stefan Hajnoczi <stefanha@redhat.com>
>> Date:   Mon Jan 4 17:13:19 2021 +0000
>>
>>       hostmem-file: add readonly=on|off option
>>
>>       Let -object memory-backend-file work on read-only files when the
>>       readonly=on option is given. This can be used to share the contents of a
>>       file between multiple guests while preventing them from consuming
>>       Copy-on-Write memory if guests dirty the pages, for example.
>>
>> That was part of
>>
>> https://lore.kernel.org/all/20210104171320.575838-3-stefanha@redhat.com/T/#m712f995e6dcfdde433958bae5095b145dd0ee640
>>
>>   From what I understand, for NVDIMMs we always use
>> "-object memory-backend-file,share=on", even when we want a
>> readonly NVDIMM.
>>
>>
>> So we have two options:
>>
>> 1) Change the current behavior of -object memory-backend-file,share=off,readonly=on:
>>
>> -> Open the file r/o but mmap it writable
> 
> Commit 86635aa4e9d627d5142b81c57a33dd1f36627d07 mentions that we don't
> want guests to be able to dirty pages on the host. The change you're
> proposing would not protect against guests that dirty the memory.

The guest could write memory but not modify the file. Only with 
"share=off,readonly=on" of course, not with "share=on,readonly=on".

> 
> I don't know how important that requirement was (that commit was a
> request from Kata Containers).

Let me take a look if Kata passes "share=on,readonly=on" or 
"share=off,readonly=off".

Thanks!
David Hildenbrand Aug. 17, 2023, 3:25 p.m. UTC | #39
>> Commit 86635aa4e9d627d5142b81c57a33dd1f36627d07 mentions that we don't
>> want guests to be able to dirty pages on the host. The change you're
>> proposing would not protect against guests that dirty the memory.
> 
> The guest could write memory but not modify the file. Only with
> "share=off,readonly=on" of course, not with "share=on,readonly=on".
> 
>>
>> I don't know how important that requirement was (that commit was a
>> request from Kata Containers).
> 
> Let me take a look if Kata passes "share=on,readonly=on" or
> "share=off,readonly=off".
> 

At least their R/O DIMM test generates:

-device nvdimm,id=nv0,memdev=mem0,unarmed=on -object 
memory-backend-file,id=mem0,mem-path=/root,size=65536,readonly=on

So they are assuming readonly with implied share=off creates ROM. If 
only they would have specified share=on ...

One way would be letting the R/O nvdimm set the MR container to 
readonly. Then that guest also shouldn't be able to modify that memory. 
Let me think about that.
Peter Xu Aug. 17, 2023, 3:31 p.m. UTC | #40
On Thu, Aug 17, 2023 at 05:15:52PM +0200, David Hildenbrand wrote:
> > I don't know how important that requirement was (that commit was a
> > request from Kata Containers).
> 
> Let me take a look if Kata passes "share=on,readonly=on" or
> "share=off,readonly=off".

The question is whether it's good enough if we change the semantics as long
as we guarantee the original purposes of when introducing those flags would
be enough (nvdimm, kata, etc.), as anything introduced in qemu can
potentially be used elsewhere too.

David, could you share your concern on simply "having a new flag, while
keeping all existing flags unchanged on behavior"?  You mentioned it's not
wanted, but I didn't yet see the reason behind.

Thanks,
David Hildenbrand Aug. 17, 2023, 3:43 p.m. UTC | #41
On 17.08.23 17:31, Peter Xu wrote:
> On Thu, Aug 17, 2023 at 05:15:52PM +0200, David Hildenbrand wrote:
>>> I don't know how important that requirement was (that commit was a
>>> request from Kata Containers).
>>
>> Let me take a look if Kata passes "share=on,readonly=on" or
>> "share=off,readonly=off".
> 
> The question is whether it's good enough if we change the semantics as long
> as we guarantee the original purposes of when introducing those flags would
> be enough (nvdimm, kata, etc.), as anything introduced in qemu can
> potentially be used elsewhere too.
> 

Right. And we have to keep the R/O NVDIMM use case working as is apparently.

> David, could you share your concern on simply "having a new flag, while
> keeping all existing flags unchanged on behavior"?  You mentioned it's not
> wanted, but I didn't yet see the reason behind.

I'm really having a hard time to come up with something reasonable to 
configure this. And apparently, we only want to configure 
"share=off,readonly=on".

The best I was imagining was "readonly=file-only" but I'm also not too 
happy about that. Doesn't make any sense for "share=on".

So if we could just let the memory backend do something reasonable and 
have the single existing consumer (R/O NVDIMM) handle the changes case 
explicitly internally, that turns up much cleaner.

IMHO, the user shouldn't have to worry about "how is it mmaped". "share" 
and "readonly" express the memory semantics and the file semantics.

A R/O DIMM on the other hand (unarmed=on), knows that it is R/O, and the 
user configured exactly that. So maybe it can simply expose it to the 
system as readonly by marking the memory region container as being a ROM.

I have not given up yet, but this case is starting to be annoying.
Igor Mammedov Aug. 21, 2023, 12:20 p.m. UTC | #42
On Fri, 11 Aug 2023 17:26:24 +0200
David Hildenbrand <david@redhat.com> wrote:

[...]
> Nowadays, "-mem-path" always defaults to MAP_PRIVATE. For the original 
> hugetlb use case, it's still good enough. For anything else, I'm not so 
> sure.

We can deprecate -mem-path and direct users to use explicitly
defined memory backend with legacy compatible example.
That way users won't come back again trying to fix it or
try inventing options to alter its behavior when using
explicit '-machine memory-backend=foo' can do the job.
David Hildenbrand Aug. 22, 2023, 1:35 p.m. UTC | #43
On 12.08.23 08:21, ThinerLogoer wrote:
> At 2023-08-11 22:31:36, "Peter Xu" <peterx@redhat.com> wrote:
>> On Fri, Aug 11, 2023 at 01:49:52PM +0800, ThinerLogoer wrote:
>>> At 2023-08-11 05:24:43, "Peter Xu" <peterx@redhat.com> wrote:
>>>> On Fri, Aug 11, 2023 at 01:06:12AM +0800, ThinerLogoer wrote:
>>>>>> I think we have the following options (there might be more)
>>>>>>
>>>>>> 1) This patch.
>>>>>>
>>>>>> 2) New flag for memory-backend-file. We already have "readonly" and
>>>>>> "share=". I'm having a hard time coming up with a good name that really
>>>>>> describes the subtle difference.
>>>>>>
>>>>>> 3) Glue behavior to the QEMU machine
>>>>>>
>>>>>
>>>>> 4) '-deny-private-discard' argv, or environment variable, or both
>>>>
>>>> I'd personally vote for (2).  How about "fdperm"?  To describe when we want
>>>> to use different rw permissions on the file (besides the access permission
>>>> of the memory we already provided with "readonly"=XXX).  IIUC the only sane
>>>> value will be ro/rw/default, where "default" should just use the same rw
>>>> permission as the memory ("readonly"=XXX).
>>>>
>>>> Would that be relatively clean and also work in this use case?
>>>>
>>>> (the other thing I'd wish we don't have that fallback is, as long as we
>>>> have any of that "fallback" we'll need to be compatible with it since
>>>> then, and for ever...)
>>>
>>> If it must be (2), I would vote (2) + (4), with (4) adjust the default behavior of said `fdperm`.
>>> Mainly because (private+discard) is itself not a good practice and (4) serves
>>> as a good tool to help catch existing (private+discard) problems.
>>>
>>> Actually (readonly+private) is more reasonable than (private+discard), so I
>>> want at least one room for a default (readonly+private) behavior.
>>
>> Just for purely discussion purpose: I think maybe someday private+discard
>> could work.  IIUC what we're missing is an syscall interface to install a
>> zero page for a MAP_PRIVATE, atomically freeing what's underneath: it seems
>> either punching a hole or DONTNEED won't suffice here.  It'll just be
>> another problem when having zero page involved in file mappings at least.
>>
>>>
>>> Also in my case I kind of have to use "-mem-path" despite it being considered
>>> to be close to deprecated. Only with this I can avoid knowledge of memory
>>> backend before migration. Actually there seems to be no equivalent working after-migration
>>> setup of "-object memory-backend-file,... -machine q35,mem=..." that can match
>>> before-migration setup of "-machine q35" (specifying nothing). Therefore
>>> I must make a plan and choose a migration method BEFORE I boot the
>>> machine and prepare to migrate, reducing the operation freedom.
>>> Considering that, I have to use "-mem-path" which keeps the freedom but
>>> has no configurable argument and I have to rely on default config.
>>>
>>> Are there any "-object memory-backend-file..." setup equivalent to "-machine q35"
>>> that can migrate from and to each other? If there is, I want to try it out.
>>> By the way "-object memory-backend-file,id=pc.ram" has just been killed by an earlier
>>> commit.
>>
>> I'm actually not familiar enough on the interfaces here, but I just checked
>> up the man page; would this work for you, together with option (2)?
>>
>>         memory-backend='id'
>>                 An alternative to legacy -mem-path and mem-prealloc options.  Allows to use a memory backend as main RAM.
>>
>>                 For example:
>>
>>                 -object memory-backend-file,id=pc.ram,size=512M,mem-path=/hugetlbfs,prealloc=on,share=on
>>                 -machine memory-backend=pc.ram
>>                 -m 512M
>>
> 
> Wait ... I thought it should not work but it did work today. How bad am I at reading
> the correct part of documentation ...
> 
> '-machine q35 -m 512M' is equivalent to '-object
> memory-backend-file,id=pc.ram,size=512M
> -machine q35,memory-backend=pc.ram',
> the latter works, and the two mentioned setup can be
> migrated from one to another.
> 
> What I was consistently trying was '-object
> memory-backend-file,id=pc.ram,size=512M -machine q35', and qemu raises an error
> for this in a recent update:
> 
> ```
> qemu-system-x86_64: object name 'pc.ram' is reserved for the default RAM backend, it can't be used for any other purposes. Change the object's 'id' to something else
> ```
> 
> This error is misleading. Actually in this case, the error report message should be more
> close to:
> ```
> object name 'pc.ram' is reserved for the default RAM backend, it can't
> be used for any other purposes. Change the object's 'id' to something
> else, or append "memory-backend=pc.ram" to -machine arguments
> ```

What about:

$ ./build/qemu-system-x86_64 -object 
memory-backend-file,id=pc.ram,size=512M,mem-path=tmp -machine q35
qemu-system-x86_64: object name 'pc.ram' is reserved for the default RAM 
backend, it can't be used for any other purposes. Change the object's 
'id' to something else or disable automatic creation of the default RAM 
backend by setting the 'memory-backend' machine property

?
diff mbox series

Patch

diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 3df73542e1..d1ae694b20 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -1289,8 +1289,7 @@  static int64_t get_file_align(int fd)
 static int file_ram_open(const char *path,
                          const char *region_name,
                          bool readonly,
-                         bool *created,
-                         Error **errp)
+                         bool *created)
 {
     char *filename;
     char *sanitized_name;
@@ -1334,10 +1333,7 @@  static int file_ram_open(const char *path,
             g_free(filename);
         }
         if (errno != EEXIST && errno != EINTR) {
-            error_setg_errno(errp, errno,
-                             "can't open backing store %s for guest RAM",
-                             path);
-            return -1;
+            return -errno;
         }
         /*
          * Try again on EINTR and EEXIST.  The latter happens when
@@ -1946,9 +1942,23 @@  RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr,
     bool created;
     RAMBlock *block;
 
-    fd = file_ram_open(mem_path, memory_region_name(mr), readonly, &created,
-                       errp);
+    fd = file_ram_open(mem_path, memory_region_name(mr), readonly, &created);
+    if (fd == -EACCES && !(ram_flags & RAM_SHARED) && !readonly) {
+        /*
+         * We can have a writable MAP_PRIVATE mapping of a readonly file.
+         * However, some operations like ftruncate() or fallocate() might fail
+         * later, let's warn the user.
+         */
+        fd = file_ram_open(mem_path, memory_region_name(mr), true, &created);
+        if (fd >= 0) {
+            warn_report("backing store %s for guest RAM (MAP_PRIVATE) opened"
+                        " readonly because the file is not writable", mem_path);
+        }
+    }
     if (fd < 0) {
+        error_setg_errno(errp, -fd,
+                         "can't open backing store %s for guest RAM",
+                         mem_path);
         return NULL;
     }