diff mbox series

[v1,1/4] softmmu/physmem: Warn with ram_block_discard_range() on MAP_PRIVATE file mapping

Message ID 20230620130354.322180-2-david@redhat.com (mailing list archive)
State New, archived
Headers show
Series virtio-mem: Support "x-ignore-shared" migration | expand

Commit Message

David Hildenbrand June 20, 2023, 1:03 p.m. UTC
ram_block_discard_range() cannot possibly do the right thing in
MAP_PRIVATE file mappings in the general case.

To achieve the documented semantics, we also have to punch a hole into
the file, possibly messing with other MAP_PRIVATE/MAP_SHARED mappings
of such a file.

For example, using VM templating -- see commit b17fbbe55cba ("migration:
allow private destination ram with x-ignore-shared") -- in combination with
any mechanism that relies on discarding of RAM is problematic. This
includes:
* Postcopy live migration
* virtio-balloon inflation/deflation or free-page-reporting
* virtio-mem

So at least warn that there is something possibly dangerous is going on
when using ram_block_discard_range() in these cases.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 softmmu/physmem.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

Comments

Peter Xu June 21, 2023, 3:55 p.m. UTC | #1
On Tue, Jun 20, 2023 at 03:03:51PM +0200, David Hildenbrand wrote:
> ram_block_discard_range() cannot possibly do the right thing in
> MAP_PRIVATE file mappings in the general case.
> 
> To achieve the documented semantics, we also have to punch a hole into
> the file, possibly messing with other MAP_PRIVATE/MAP_SHARED mappings
> of such a file.
> 
> For example, using VM templating -- see commit b17fbbe55cba ("migration:
> allow private destination ram with x-ignore-shared") -- in combination with
> any mechanism that relies on discarding of RAM is problematic. This
> includes:
> * Postcopy live migration
> * virtio-balloon inflation/deflation or free-page-reporting
> * virtio-mem
> 
> So at least warn that there is something possibly dangerous is going on
> when using ram_block_discard_range() in these cases.

The issue is probably valid.

One thing I worry is when the user (or, qemu instance) exclusively owns the
file, just forgot to attach share=on, where it used to work perfectly then
it'll show this warning.  But I agree maybe it's good to remind them just
to attach the share=on.

For real private mem users, the warning can of real help, one should
probably leverage things like file snapshot provided by modern file
systems, so each VM should just have its own snapshot ram file to use then
map it share=on I suppose.

For the long term, maybe we should simply support private mem here simply
by a MADV_DONTNEED.  I assume that's the right semantics for postcopy (just
need to support MINOR faults, though; MISSING faults definitely will stop
working.. but for all the rest framework shouldn't need much change), and I
hope that's also the semantics that balloon/virtio-mem wants here.  Not
sure whether/when that's strongly needed, assuming the corner case above
can still be work arounded properly by other means.

For now, a warning looks all sane.

> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: Peter Xu <peterx@redhat.com>

> ---
>  softmmu/physmem.c | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> index 6bdd944fe8..27c7219c82 100644
> --- a/softmmu/physmem.c
> +++ b/softmmu/physmem.c
> @@ -3451,6 +3451,24 @@ int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
>               * so a userfault will trigger.
>               */
>  #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> +            /*
> +             * We'll discard data from the actual file, even though we only
> +             * have a MAP_PRIVATE mapping, possibly messing with other
> +             * MAP_PRIVATE/MAP_SHARED mappings. There is no easy way to
> +             * change that behavior whithout violating the promised
> +             * semantics of ram_block_discard_range().
> +             *
> +             * Only warn, because it work as long as nobody else uses that
> +             * file.
> +             */
> +            if (!qemu_ram_is_shared(rb)) {
> +                warn_report_once("ram_block_discard_range: Discarding RAM"
> +                                 " in private file mappings is possibly"
> +                                 " dangerous, because it will modify the"
> +                                 " underlying file and will affect other"
> +                                 " users of the file");
> +            }
> +
>              ret = fallocate(rb->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
>                              start, length);
>              if (ret) {
> -- 
> 2.40.1
>
David Hildenbrand June 21, 2023, 4:17 p.m. UTC | #2
On 21.06.23 17:55, Peter Xu wrote:
> On Tue, Jun 20, 2023 at 03:03:51PM +0200, David Hildenbrand wrote:
>> ram_block_discard_range() cannot possibly do the right thing in
>> MAP_PRIVATE file mappings in the general case.
>>
>> To achieve the documented semantics, we also have to punch a hole into
>> the file, possibly messing with other MAP_PRIVATE/MAP_SHARED mappings
>> of such a file.
>>
>> For example, using VM templating -- see commit b17fbbe55cba ("migration:
>> allow private destination ram with x-ignore-shared") -- in combination with
>> any mechanism that relies on discarding of RAM is problematic. This
>> includes:
>> * Postcopy live migration
>> * virtio-balloon inflation/deflation or free-page-reporting
>> * virtio-mem
>>
>> So at least warn that there is something possibly dangerous is going on
>> when using ram_block_discard_range() in these cases.
> 
> The issue is probably valid.
> 
> One thing I worry is when the user (or, qemu instance) exclusively owns the
> file, just forgot to attach share=on, where it used to work perfectly then
> it'll show this warning.  But I agree maybe it's good to remind them just
> to attach the share=on.

For memory-backend-memfd "share=on" is fortunately the default. For 
memory-backend-file it isn't (and in most cases you do want share=on, 
like for hugetlbfs or tmpfs).

Missing the "share=on" for memory-backend-file can have sane use cases, 
but for the common /dev/shm/ case it even results in an undesired 
double-memory consumption (just like memory-backend-memfd,share=off).


> 
> For real private mem users, the warning can of real help, one should
> probably leverage things like file snapshot provided by modern file
> systems, so each VM should just have its own snapshot ram file to use then
> map it share=on I suppose.

Yes, I agree. Although we recently learned that fs-backed VM RAM (SSD) 
performs poorly and will severely wear your SSD severly :(

> 
> For the long term, maybe we should simply support private mem here simply
> by a MADV_DONTNEED.  I assume that's the right semantics for postcopy (just
> need to support MINOR faults, though; MISSING faults definitely will stop
> working.. but for all the rest framework shouldn't need much change), and I
> hope that's also the semantics that balloon/virtio-mem wants here.  Not
> sure whether/when that's strongly needed, assuming the corner case above
> can still be work arounded properly by other means.

I briefly thought about that but came to the conclusion that fixing it 
is not that easy. So I went with the warn.

As documented, ram_block_discard_range() guarantees two things

a) Read 0 after discarding succeeded
b) Make postcopy work by triggering a fault on next access

And if we'd simply want to drop the FALLOC_FL_PUNCH_HOLE:

1) For hugetlb, only newer kernels support MADV_DONTNEED. So there is no 
way to just discard in a private mapping here that works for kernels we 
still care about.

2) free-page-reporting wants to read 0's when re-accessing discarded 
memory. If there is still something there in the file, that won't work.

3) Regarding postcopy on MAP_PRIVATE shmem, I am not sure if it will 
actually do what you want if the pagecache holds a page. Maybe it works, 
but I am not so sure. Needs investigation.


> 
> For now, a warning looks all sane.
> 
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> Acked-by: Peter Xu <peterx@redhat.com>

Thanks!
Peter Xu June 21, 2023, 4:55 p.m. UTC | #3
On Wed, Jun 21, 2023 at 06:17:37PM +0200, David Hildenbrand wrote:
> As documented, ram_block_discard_range() guarantees two things
> 
> a) Read 0 after discarding succeeded
> b) Make postcopy work by triggering a fault on next access
> 
> And if we'd simply want to drop the FALLOC_FL_PUNCH_HOLE:
> 
> 1) For hugetlb, only newer kernels support MADV_DONTNEED. So there is no way
> to just discard in a private mapping here that works for kernels we still
> care about.
> 
> 2) free-page-reporting wants to read 0's when re-accessing discarded memory.
> If there is still something there in the file, that won't work.

Ah right.  The semantics is indeed slightly different..

IMHO, ideally here we need a zero page installed as private, ignoring the
page cache underneath, freeing any possible private page.  But I just don't
know how to do that easily with current default mm infrastructures, or
free-page-reporting over private mem seems just won't really work at all,
it seems to me.

Maybe.. UFFDIO_ZEROPAGE would work? We need uffd registered by default, but
that's slightly tricky.

> 
> 3) Regarding postcopy on MAP_PRIVATE shmem, I am not sure if it will
> actually do what you want if the pagecache holds a page. Maybe it works, but
> I am not so sure. Needs investigation.

For MINOR I think it will.  I actually already implemented some of that (I
think, all of that is required) in the HGM qemu rfc series, and smoked it a
bit without any known issue yet with the HGM kernel.

IIUC we can work on MINOR support without HGM; I can separate it out.  It's
really a matter of whether it'll be worthwhile the effort and time.

Thanks,
David Hildenbrand June 22, 2023, 1:10 p.m. UTC | #4
On 21.06.23 18:55, Peter Xu wrote:
> On Wed, Jun 21, 2023 at 06:17:37PM +0200, David Hildenbrand wrote:
>> As documented, ram_block_discard_range() guarantees two things
>>
>> a) Read 0 after discarding succeeded
>> b) Make postcopy work by triggering a fault on next access
>>
>> And if we'd simply want to drop the FALLOC_FL_PUNCH_HOLE:
>>
>> 1) For hugetlb, only newer kernels support MADV_DONTNEED. So there is no way
>> to just discard in a private mapping here that works for kernels we still
>> care about.
>>
>> 2) free-page-reporting wants to read 0's when re-accessing discarded memory.
>> If there is still something there in the file, that won't work.
> 
> Ah right.  The semantics is indeed slightly different..
> 
> IMHO, ideally here we need a zero page installed as private, ignoring the
> page cache underneath, freeing any possible private page.  But I just don't
> know how to do that easily with current default mm infrastructures, or
> free-page-reporting over private mem seems just won't really work at all,
> it seems to me.
> 
> Maybe.. UFFDIO_ZEROPAGE would work? We need uffd registered by default, but
> that's slightly tricky.

Maybe ... depends also on the uffd semantics as in 3).

> 
>>
>> 3) Regarding postcopy on MAP_PRIVATE shmem, I am not sure if it will
>> actually do what you want if the pagecache holds a page. Maybe it works, but
>> I am not so sure. Needs investigation.
> 
> For MINOR I think it will.  I actually already implemented some of that (I
> think, all of that is required) in the HGM qemu rfc series, and smoked it a
> bit without any known issue yet with the HGM kernel.
> 
> IIUC we can work on MINOR support without HGM; I can separate it out.  It's
> really a matter of whether it'll be worthwhile the effort and time.

Yes, MINOR might work. But especially postcopy doesn't make too much 
sense targeting a private mapping that has some other pages in there 
already ... so it might not be worth the trouble I guess.
Peter Xu June 22, 2023, 2:54 p.m. UTC | #5
On Thu, Jun 22, 2023 at 03:10:47PM +0200, David Hildenbrand wrote:
> Maybe ... depends also on the uffd semantics as in 3).

UFFDIO_COPY|ZEROPAGE bypasses page cache for private file mappings, afaict.
We'll still got a limit on the inode size (so we can't COPY|ZEROPAGE over
that offset of vma) but the rest should be all fine.

Feel free to have a quick skim over 5b51072e97d5 ("userfaultfd: shmem:
allocate anonymous memory for MAP_PRIVATE shmem").

Thanks,
diff mbox series

Patch

diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 6bdd944fe8..27c7219c82 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -3451,6 +3451,24 @@  int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
              * so a userfault will trigger.
              */
 #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
+            /*
+             * We'll discard data from the actual file, even though we only
+             * have a MAP_PRIVATE mapping, possibly messing with other
+             * MAP_PRIVATE/MAP_SHARED mappings. There is no easy way to
+             * change that behavior whithout violating the promised
+             * semantics of ram_block_discard_range().
+             *
+             * Only warn, because it work as long as nobody else uses that
+             * file.
+             */
+            if (!qemu_ram_is_shared(rb)) {
+                warn_report_once("ram_block_discard_range: Discarding RAM"
+                                 " in private file mappings is possibly"
+                                 " dangerous, because it will modify the"
+                                 " underlying file and will affect other"
+                                 " users of the file");
+            }
+
             ret = fallocate(rb->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
                             start, length);
             if (ret) {