diff mbox series

[2/3] exec: posix_madvise usage on SunOS.

Message ID CA+XhMqwtUrSpCqNGEETBijewzvmpno8OAX_PKSShDP_gUQ-3VQ@mail.gmail.com (mailing list archive)
State New, archived
Headers show
Series None | expand

Commit Message

David CARLIER July 18, 2020, 1:20 p.m. UTC
From a9e3cced279ae55a59847ba232f7828bc2479367 Mon Sep 17 00:00:00 2001
From: David Carlier <devnexen@gmail.com>
Date: Sat, 18 Jul 2020 13:29:44 +0100
Subject: [PATCH 2/3] exec: posix_madvise usage on SunOS.

with _XOPEN_SOURCE set, the older mman.h API based on caddr_t handling
is not accessible thus using posix_madvise here.

Signed-off-by: David Carlier <devnexen@gmail.com>
---
 exec.c | 8 ++++++++
 1 file changed, 8 insertions(+)

                 error_report("ram_block_discard_range: Failed to
discard range "

Comments

Peter Maydell July 18, 2020, 2:15 p.m. UTC | #1
On Sat, 18 Jul 2020 at 14:21, David CARLIER <devnexen@gmail.com> wrote:
>
> From a9e3cced279ae55a59847ba232f7828bc2479367 Mon Sep 17 00:00:00 2001
> From: David Carlier <devnexen@gmail.com>
> Date: Sat, 18 Jul 2020 13:29:44 +0100
> Subject: [PATCH 2/3] exec: posix_madvise usage on SunOS.
>
> with _XOPEN_SOURCE set, the older mman.h API based on caddr_t handling
> is not accessible thus using posix_madvise here.
>
> Signed-off-by: David Carlier <devnexen@gmail.com>
> ---
>  exec.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
>
> diff --git a/exec.c b/exec.c
> index 6f381f98e2..0466a75b89 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -3964,7 +3964,15 @@ int ram_block_discard_range(RAMBlock *rb,
> uint64_t start, size_t length)
>               * fallocate'd away).
>               */
>  #if defined(CONFIG_MADVISE)
> +#if !defined(CONFIG_SOLARIS)
>              ret =  madvise(host_startaddr, length, MADV_DONTNEED);
> +#else
> +        /*
> +         * mmap and its caddr_t based api is not accessible
> +         * with _XOPEN_SOURCE set on illumos
> +         */
> +            ret =  posix_madvise(host_startaddr, length, POSIX_MADV_DONTNEED);
> +#endif

Hi. I'm not sure this patch will do the right thing, because
I don't think that Solaris's POSIX_MADV_DONTNEED provides
the semantics that this QEMU function says it needs. The
comment at the top of the function says:

 * Unmap pages of memory from start to start+length such that
 * they a) read as 0, b) Trigger whatever fault mechanism
 * the OS provides for postcopy.
 * The pages must be unmapped by the end of the function.

(Aside: the use of 'unmap' in this comment is a bit confusing,
because it clearly doesn't mean 'unmap' if it wants read-as-0.
And the reference to faults on postcopy is incomprehensible
to me: if memory is read-as-0 it isn't going to fault.)

Linux's madvise(MADV_DONTNEED) does guarantee us this
read-as-zero behaviour. (It's a silly API choice that Linux
put this behaviour behind madvise, which is supposed to be
merely advisory, but that's how it is.) The Solaris
posix_madvise() manpage says it is merely advisory and
doesn't affect the behaviour of accesses to the memory.

If posix_madvise() behaviour was OK in this function, the
right way to fix this would be to use qemu_madvise()
instead, which already provides this "if host has
madvise(), use it, otherwise use posix_madvise()" logic.
But I suspect that the direct madvise() here is deliberate.

Side note: not sure the current code is correct for the
BSDs either -- they have madvise() but don't provide
Linux's really-read-as-zero guarantee for MADV_DONTNEED.
So we should probably be doing something else there, and
whatever that something-else is is probably also what
Solaris wants.

We use ram_block_discard_range() only in migration and
in virtio-balloon and virtio-mem; I've cc'd some people
who hopefully understand what the requirements on this
function are and might have a view on what the not-Linux
implementation should look like. (David Gilbert: git
blame says you wrote this code :-))

thanks
-- PMM
David CARLIER July 18, 2020, 3:23 p.m. UTC | #2
It probably does not indeed (maybe Solaris madvise does not provide
but memcntl ?), I was just trying to fix the build only :-)

On Sat, 18 Jul 2020 at 15:15, Peter Maydell <peter.maydell@linaro.org> wrote:
>
> On Sat, 18 Jul 2020 at 14:21, David CARLIER <devnexen@gmail.com> wrote:
> >
> > From a9e3cced279ae55a59847ba232f7828bc2479367 Mon Sep 17 00:00:00 2001
> > From: David Carlier <devnexen@gmail.com>
> > Date: Sat, 18 Jul 2020 13:29:44 +0100
> > Subject: [PATCH 2/3] exec: posix_madvise usage on SunOS.
> >
> > with _XOPEN_SOURCE set, the older mman.h API based on caddr_t handling
> > is not accessible thus using posix_madvise here.
> >
> > Signed-off-by: David Carlier <devnexen@gmail.com>
> > ---
> >  exec.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> >
> > diff --git a/exec.c b/exec.c
> > index 6f381f98e2..0466a75b89 100644
> > --- a/exec.c
> > +++ b/exec.c
> > @@ -3964,7 +3964,15 @@ int ram_block_discard_range(RAMBlock *rb,
> > uint64_t start, size_t length)
> >               * fallocate'd away).
> >               */
> >  #if defined(CONFIG_MADVISE)
> > +#if !defined(CONFIG_SOLARIS)
> >              ret =  madvise(host_startaddr, length, MADV_DONTNEED);
> > +#else
> > +        /*
> > +         * mmap and its caddr_t based api is not accessible
> > +         * with _XOPEN_SOURCE set on illumos
> > +         */
> > +            ret =  posix_madvise(host_startaddr, length, POSIX_MADV_DONTNEED);
> > +#endif
>
> Hi. I'm not sure this patch will do the right thing, because
> I don't think that Solaris's POSIX_MADV_DONTNEED provides
> the semantics that this QEMU function says it needs. The
> comment at the top of the function says:
>
>  * Unmap pages of memory from start to start+length such that
>  * they a) read as 0, b) Trigger whatever fault mechanism
>  * the OS provides for postcopy.
>  * The pages must be unmapped by the end of the function.
>
> (Aside: the use of 'unmap' in this comment is a bit confusing,
> because it clearly doesn't mean 'unmap' if it wants read-as-0.
> And the reference to faults on postcopy is incomprehensible
> to me: if memory is read-as-0 it isn't going to fault.)
>
> Linux's madvise(MADV_DONTNEED) does guarantee us this
> read-as-zero behaviour. (It's a silly API choice that Linux
> put this behaviour behind madvise, which is supposed to be
> merely advisory, but that's how it is.) The Solaris
> posix_madvise() manpage says it is merely advisory and
> doesn't affect the behaviour of accesses to the memory.
>
> If posix_madvise() behaviour was OK in this function, the
> right way to fix this would be to use qemu_madvise()
> instead, which already provides this "if host has
> madvise(), use it, otherwise use posix_madvise()" logic.
> But I suspect that the direct madvise() here is deliberate.
>
> Side note: not sure the current code is correct for the
> BSDs either -- they have madvise() but don't provide
> Linux's really-read-as-zero guarantee for MADV_DONTNEED.
> So we should probably be doing something else there, and
> whatever that something-else is is probably also what
> Solaris wants.
>
> We use ram_block_discard_range() only in migration and
> in virtio-balloon and virtio-mem; I've cc'd some people
> who hopefully understand what the requirements on this
> function are and might have a view on what the not-Linux
> implementation should look like. (David Gilbert: git
> blame says you wrote this code :-))
>
> thanks
> -- PMM
Dr. David Alan Gilbert July 20, 2020, 7:13 p.m. UTC | #3
(Copies in Dave Hildenbrand)

* Peter Maydell (peter.maydell@linaro.org) wrote:
> On Sat, 18 Jul 2020 at 14:21, David CARLIER <devnexen@gmail.com> wrote:
> >
> > From a9e3cced279ae55a59847ba232f7828bc2479367 Mon Sep 17 00:00:00 2001
> > From: David Carlier <devnexen@gmail.com>
> > Date: Sat, 18 Jul 2020 13:29:44 +0100
> > Subject: [PATCH 2/3] exec: posix_madvise usage on SunOS.
> >
> > with _XOPEN_SOURCE set, the older mman.h API based on caddr_t handling
> > is not accessible thus using posix_madvise here.
> >
> > Signed-off-by: David Carlier <devnexen@gmail.com>
> > ---
> >  exec.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> >
> > diff --git a/exec.c b/exec.c
> > index 6f381f98e2..0466a75b89 100644
> > --- a/exec.c
> > +++ b/exec.c
> > @@ -3964,7 +3964,15 @@ int ram_block_discard_range(RAMBlock *rb,
> > uint64_t start, size_t length)
> >               * fallocate'd away).
> >               */
> >  #if defined(CONFIG_MADVISE)
> > +#if !defined(CONFIG_SOLARIS)
> >              ret =  madvise(host_startaddr, length, MADV_DONTNEED);
> > +#else
> > +        /*
> > +         * mmap and its caddr_t based api is not accessible
> > +         * with _XOPEN_SOURCE set on illumos
> > +         */
> > +            ret =  posix_madvise(host_startaddr, length, POSIX_MADV_DONTNEED);
> > +#endif
> 
> Hi. I'm not sure this patch will do the right thing, because
> I don't think that Solaris's POSIX_MADV_DONTNEED provides
> the semantics that this QEMU function says it needs. The
> comment at the top of the function says:
> 
>  * Unmap pages of memory from start to start+length such that
>  * they a) read as 0, b) Trigger whatever fault mechanism
>  * the OS provides for postcopy.
>  * The pages must be unmapped by the end of the function.

This code has moved around a bit over it's life; joining the case
needed by balloon and the case needed by postcopy.

> (Aside: the use of 'unmap' in this comment is a bit confusing,
> because it clearly doesn't mean 'unmap' if it wants read-as-0.
> And the reference to faults on postcopy is incomprehensible
> to me: if memory is read-as-0 it isn't going to fault.)

I think because internally to Linux the behaviour is the same;
this causes the mapping to disappear from the TLB so it faults;
normally when reading the kernel resolves the fault and puts
a read-as-zero page there, except if userfault was enabled
for postcopy, in which case it gives us a kick and we service it.

> Linux's madvise(MADV_DONTNEED) does guarantee us this
> read-as-zero behaviour. (It's a silly API choice that Linux
> put this behaviour behind madvise, which is supposed to be
> merely advisory, but that's how it is.)

Yes, I don't think there's any equivalent to madvise
that guarantees anything.

> The Solaris
> posix_madvise() manpage says it is merely advisory and
> doesn't affect the behaviour of accesses to the memory.
> 
> If posix_madvise() behaviour was OK in this function, the
> right way to fix this would be to use qemu_madvise()
> instead, which already provides this "if host has
> madvise(), use it, otherwise use posix_madvise()" logic.
> But I suspect that the direct madvise() here is deliberate.

Yes, but I can't remember the semantics fully - I think it was because
we needed the guarantee at this point (and even Linux's
posix madvise did something different??)

I've got a note saying we didn't want to use
qemu_madvise because we wanted to be sure we didn't get
posix_madvise.

> Side note: not sure the current code is correct for the
> BSDs either -- they have madvise() but don't provide
> Linux's really-read-as-zero guarantee for MADV_DONTNEED.
> So we should probably be doing something else there, and
> whatever that something-else is is probably also what
> Solaris wants.
> 
> We use ram_block_discard_range() only in migration and
> in virtio-balloon and virtio-mem; I've cc'd some people
> who hopefully understand what the requirements on this
> function are and might have a view on what the not-Linux
> implementation should look like. (David Gilbert: git
> blame says you wrote this code :-))

Dave

> 
> thanks
> -- PMM
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
David Hildenbrand July 21, 2020, 8:22 a.m. UTC | #4
On 20.07.20 21:13, Dr. David Alan Gilbert wrote:
> (Copies in Dave Hildenbrand)
> 
> * Peter Maydell (peter.maydell@linaro.org) wrote:
>> On Sat, 18 Jul 2020 at 14:21, David CARLIER <devnexen@gmail.com> wrote:
>>>
>>> From a9e3cced279ae55a59847ba232f7828bc2479367 Mon Sep 17 00:00:00 2001
>>> From: David Carlier <devnexen@gmail.com>
>>> Date: Sat, 18 Jul 2020 13:29:44 +0100
>>> Subject: [PATCH 2/3] exec: posix_madvise usage on SunOS.
>>>
>>> with _XOPEN_SOURCE set, the older mman.h API based on caddr_t handling
>>> is not accessible thus using posix_madvise here.
>>>
>>> Signed-off-by: David Carlier <devnexen@gmail.com>
>>> ---
>>>  exec.c | 8 ++++++++
>>>  1 file changed, 8 insertions(+)
>>>
>>> diff --git a/exec.c b/exec.c
>>> index 6f381f98e2..0466a75b89 100644
>>> --- a/exec.c
>>> +++ b/exec.c
>>> @@ -3964,7 +3964,15 @@ int ram_block_discard_range(RAMBlock *rb,
>>> uint64_t start, size_t length)
>>>               * fallocate'd away).
>>>               */
>>>  #if defined(CONFIG_MADVISE)
>>> +#if !defined(CONFIG_SOLARIS)
>>>              ret =  madvise(host_startaddr, length, MADV_DONTNEED);
>>> +#else
>>> +        /*
>>> +         * mmap and its caddr_t based api is not accessible
>>> +         * with _XOPEN_SOURCE set on illumos
>>> +         */
>>> +            ret =  posix_madvise(host_startaddr, length, POSIX_MADV_DONTNEED);
>>> +#endif
>>
>> Hi. I'm not sure this patch will do the right thing, because
>> I don't think that Solaris's POSIX_MADV_DONTNEED provides
>> the semantics that this QEMU function says it needs. The
>> comment at the top of the function says:
>>
>>  * Unmap pages of memory from start to start+length such that
>>  * they a) read as 0, b) Trigger whatever fault mechanism
>>  * the OS provides for postcopy.
>>  * The pages must be unmapped by the end of the function.
> 
> This code has moved around a bit over it's life; joining the case
> needed by balloon and the case needed by postcopy.
> 
>> (Aside: the use of 'unmap' in this comment is a bit confusing,
>> because it clearly doesn't mean 'unmap' if it wants read-as-0.
>> And the reference to faults on postcopy is incomprehensible
>> to me: if memory is read-as-0 it isn't going to fault.)
> 
> I think because internally to Linux the behaviour is the same;
> this causes the mapping to disappear from the TLB so it faults;
> normally when reading the kernel resolves the fault and puts
> a read-as-zero page there, except if userfault was enabled
> for postcopy, in which case it gives us a kick and we service it.
> 
>> Linux's madvise(MADV_DONTNEED) does guarantee us this
>> read-as-zero behaviour. (It's a silly API choice that Linux
>> put this behaviour behind madvise, which is supposed to be
>> merely advisory, but that's how it is.)
> 
> Yes, I don't think there's any equivalent to madvise
> that guarantees anything.
> 
>> The Solaris
>> posix_madvise() manpage says it is merely advisory and
>> doesn't affect the behaviour of accesses to the memory.
>>
>> If posix_madvise() behaviour was OK in this function, the
>> right way to fix this would be to use qemu_madvise()
>> instead, which already provides this "if host has
>> madvise(), use it, otherwise use posix_madvise()" logic.
>> But I suspect that the direct madvise() here is deliberate.
> 
> Yes, but I can't remember the semantics fully - I think it was because
> we needed the guarantee at this point (and even Linux's
> posix madvise did something different??)
> 
> I've got a note saying we didn't want to use
> qemu_madvise because we wanted to be sure we didn't get
> posix_madvise.
> 
>> Side note: not sure the current code is correct for the
>> BSDs either -- they have madvise() but don't provide
>> Linux's really-read-as-zero guarantee for MADV_DONTNEED.
>> So we should probably be doing something else there, and
>> whatever that something-else is is probably also what
>> Solaris wants.
>>
>> We use ram_block_discard_range() only in migration and
>> in virtio-balloon and virtio-mem; I've cc'd some people
>> who hopefully understand what the requirements on this
>> function are and might have a view on what the not-Linux
>> implementation should look like. (David Gilbert: git
>> blame says you wrote this code :-))

virtio-mem depends on Linux (hw/virtio/Kconfig). I guess
userfaultfd/postcopy is also not relevant in the context of SunOS. So
what remains is virtio-balloon.

virito-balloon ideally wants to discard the actual mapped pages to free
up memory. When memory is re-accessed, a fresh page is faulted in (->
zero-page under Linux). Now, we already have other cases where it looks
like "the balloon works" but it really doesn't. One example is using
vfio+virtio-balloon under Linux - inflating the balloon is simply a NOP,
no memory is actually discarded.

I agree that POSIX_MADV_DONTNEED is not a proper match - different
guarantees. If SunOS cannot implement ram_block_discard_range() as
documented, we should disable it.

I would suggest using ram_block_discard_disable(true) when under SunOS
early during QEMU startup. In addition, we might want to return directly
in ram_block_dicard_range(). We might also want to make virito-balloon
depend on !SubOS.
Peter Maydell July 21, 2020, 9:31 a.m. UTC | #5
On Tue, 21 Jul 2020 at 09:22, David Hildenbrand <david@redhat.com> wrote:
> virtio-mem depends on Linux (hw/virtio/Kconfig). I guess
> userfaultfd/postcopy is also not relevant in the context of SunOS. So
> what remains is virtio-balloon.
>
> virito-balloon ideally wants to discard the actual mapped pages to free
> up memory. When memory is re-accessed, a fresh page is faulted in (->
> zero-page under Linux). Now, we already have other cases where it looks
> like "the balloon works" but it really doesn't. One example is using
> vfio+virtio-balloon under Linux - inflating the balloon is simply a NOP,
> no memory is actually discarded.
>
> I agree that POSIX_MADV_DONTNEED is not a proper match - different
> guarantees. If SunOS cannot implement ram_block_discard_range() as
> documented, we should disable it.

Could we also improve the documentation comment to make it clearer
what ram_block_discard_range() is actually supposed to be doing?
At the moment I don't think you can figure it out from the comments,
which are a confusing mix of claiming it unmaps memory and that it
zeroes it.

Is the Linux-specific stuff (userfaultfd) a *requirement* for the
function, or just a "this would be nice" extra and a valid
implementation would be eg "zero out the memory" ?

thanks
-- PMM
David Hildenbrand July 21, 2020, 9:48 a.m. UTC | #6
On 21.07.20 11:31, Peter Maydell wrote:
> On Tue, 21 Jul 2020 at 09:22, David Hildenbrand <david@redhat.com> wrote:
>> virtio-mem depends on Linux (hw/virtio/Kconfig). I guess
>> userfaultfd/postcopy is also not relevant in the context of SunOS. So
>> what remains is virtio-balloon.
>>
>> virito-balloon ideally wants to discard the actual mapped pages to free
>> up memory. When memory is re-accessed, a fresh page is faulted in (->
>> zero-page under Linux). Now, we already have other cases where it looks
>> like "the balloon works" but it really doesn't. One example is using
>> vfio+virtio-balloon under Linux - inflating the balloon is simply a NOP,
>> no memory is actually discarded.
>>
>> I agree that POSIX_MADV_DONTNEED is not a proper match - different
>> guarantees. If SunOS cannot implement ram_block_discard_range() as
>> documented, we should disable it.
> 
> Could we also improve the documentation comment to make it clearer
> what ram_block_discard_range() is actually supposed to be doing?
> At the moment I don't think you can figure it out from the comments,
> which are a confusing mix of claiming it unmaps memory and that it
> zeroes it.

Certainly. It really means (as the name suggests) "Discard the pages
used for backing the virtual address range, effectively freeing them
back to the OS. When accessing pages within the virtual address range
again, populate fresh, zeroed-out pages, just as when accessing memory
inside a new mmap for the first time.". It really tries to mimic the
"populate on demand" mechanism used on fresh, anonymous mmaps. If no
page is populated, a page fault in the OS is triggered and a fresh
(physical) page is populated. Of course, when only reading, the single
COW zero page can be used until written to the page.

"Unmap" is a misleading terminology, I agree.

> 
> Is the Linux-specific stuff (userfaultfd) a *requirement* for the
> function, or just a "this would be nice" extra and a valid
> implementation would be eg "zero out the memory" ?

postcopy/userfaultfd really needs all backing pages for the VMA to be
gone (discarded/zapped), so that a pagefault will be triggered. As
postcopy will resolve the pagefault itself, it does not rely on the
"zeroed" semantics. It will tell the OS which page to place.

Supplying anything not zero already smells like the original request
(discard the pages) was not properly implemented. Of course, one could
supply random data to a user space process on a pagefault, when
populating pages, but that already smells like a severe security issue ...

Simply zeroing out memory is not what this function promises. It does
not work with all three use cases: postcopy, virito-balloon and
virtio-mem. We really have to discard.
diff mbox series

Patch

diff --git a/exec.c b/exec.c
index 6f381f98e2..0466a75b89 100644
--- a/exec.c
+++ b/exec.c
@@ -3964,7 +3964,15 @@  int ram_block_discard_range(RAMBlock *rb,
uint64_t start, size_t length)
              * fallocate'd away).
              */
 #if defined(CONFIG_MADVISE)
+#if !defined(CONFIG_SOLARIS)
             ret =  madvise(host_startaddr, length, MADV_DONTNEED);
+#else
+        /*
+         * mmap and its caddr_t based api is not accessible
+         * with _XOPEN_SOURCE set on illumos
+         */
+            ret =  posix_madvise(host_startaddr, length, POSIX_MADV_DONTNEED);
+#endif
             if (ret) {
                 ret = -errno;