Message ID | CA+XhMqwtUrSpCqNGEETBijewzvmpno8OAX_PKSShDP_gUQ-3VQ@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | None | expand |
On Sat, 18 Jul 2020 at 14:21, David CARLIER <devnexen@gmail.com> wrote: > > From a9e3cced279ae55a59847ba232f7828bc2479367 Mon Sep 17 00:00:00 2001 > From: David Carlier <devnexen@gmail.com> > Date: Sat, 18 Jul 2020 13:29:44 +0100 > Subject: [PATCH 2/3] exec: posix_madvise usage on SunOS. > > with _XOPEN_SOURCE set, the older mman.h API based on caddr_t handling > is not accessible thus using posix_madvise here. > > Signed-off-by: David Carlier <devnexen@gmail.com> > --- > exec.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/exec.c b/exec.c > index 6f381f98e2..0466a75b89 100644 > --- a/exec.c > +++ b/exec.c > @@ -3964,7 +3964,15 @@ int ram_block_discard_range(RAMBlock *rb, > uint64_t start, size_t length) > * fallocate'd away). > */ > #if defined(CONFIG_MADVISE) > +#if !defined(CONFIG_SOLARIS) > ret = madvise(host_startaddr, length, MADV_DONTNEED); > +#else > + /* > + * mmap and its caddr_t based api is not accessible > + * with _XOPEN_SOURCE set on illumos > + */ > + ret = posix_madvise(host_startaddr, length, POSIX_MADV_DONTNEED); > +#endif Hi. I'm not sure this patch will do the right thing, because I don't think that Solaris's POSIX_MADV_DONTNEED provides the semantics that this QEMU function says it needs. The comment at the top of the function says: * Unmap pages of memory from start to start+length such that * they a) read as 0, b) Trigger whatever fault mechanism * the OS provides for postcopy. * The pages must be unmapped by the end of the function. (Aside: the use of 'unmap' in this comment is a bit confusing, because it clearly doesn't mean 'unmap' if it wants read-as-0. And the reference to faults on postcopy is incomprehensible to me: if memory is read-as-0 it isn't going to fault.) Linux's madvise(MADV_DONTNEED) does guarantee us this read-as-zero behaviour. (It's a silly API choice that Linux put this behaviour behind madvise, which is supposed to be merely advisory, but that's how it is.) The Solaris posix_madvise() manpage says it is merely advisory and doesn't affect the behaviour of accesses to the memory. If posix_madvise() behaviour was OK in this function, the right way to fix this would be to use qemu_madvise() instead, which already provides this "if host has madvise(), use it, otherwise use posix_madvise()" logic. But I suspect that the direct madvise() here is deliberate. Side note: not sure the current code is correct for the BSDs either -- they have madvise() but don't provide Linux's really-read-as-zero guarantee for MADV_DONTNEED. So we should probably be doing something else there, and whatever that something-else is is probably also what Solaris wants. We use ram_block_discard_range() only in migration and in virtio-balloon and virtio-mem; I've cc'd some people who hopefully understand what the requirements on this function are and might have a view on what the not-Linux implementation should look like. (David Gilbert: git blame says you wrote this code :-)) thanks -- PMM
It probably does not indeed (maybe Solaris madvise does not provide but memcntl ?), I was just trying to fix the build only :-) On Sat, 18 Jul 2020 at 15:15, Peter Maydell <peter.maydell@linaro.org> wrote: > > On Sat, 18 Jul 2020 at 14:21, David CARLIER <devnexen@gmail.com> wrote: > > > > From a9e3cced279ae55a59847ba232f7828bc2479367 Mon Sep 17 00:00:00 2001 > > From: David Carlier <devnexen@gmail.com> > > Date: Sat, 18 Jul 2020 13:29:44 +0100 > > Subject: [PATCH 2/3] exec: posix_madvise usage on SunOS. > > > > with _XOPEN_SOURCE set, the older mman.h API based on caddr_t handling > > is not accessible thus using posix_madvise here. > > > > Signed-off-by: David Carlier <devnexen@gmail.com> > > --- > > exec.c | 8 ++++++++ > > 1 file changed, 8 insertions(+) > > > > diff --git a/exec.c b/exec.c > > index 6f381f98e2..0466a75b89 100644 > > --- a/exec.c > > +++ b/exec.c > > @@ -3964,7 +3964,15 @@ int ram_block_discard_range(RAMBlock *rb, > > uint64_t start, size_t length) > > * fallocate'd away). > > */ > > #if defined(CONFIG_MADVISE) > > +#if !defined(CONFIG_SOLARIS) > > ret = madvise(host_startaddr, length, MADV_DONTNEED); > > +#else > > + /* > > + * mmap and its caddr_t based api is not accessible > > + * with _XOPEN_SOURCE set on illumos > > + */ > > + ret = posix_madvise(host_startaddr, length, POSIX_MADV_DONTNEED); > > +#endif > > Hi. I'm not sure this patch will do the right thing, because > I don't think that Solaris's POSIX_MADV_DONTNEED provides > the semantics that this QEMU function says it needs. The > comment at the top of the function says: > > * Unmap pages of memory from start to start+length such that > * they a) read as 0, b) Trigger whatever fault mechanism > * the OS provides for postcopy. > * The pages must be unmapped by the end of the function. > > (Aside: the use of 'unmap' in this comment is a bit confusing, > because it clearly doesn't mean 'unmap' if it wants read-as-0. > And the reference to faults on postcopy is incomprehensible > to me: if memory is read-as-0 it isn't going to fault.) > > Linux's madvise(MADV_DONTNEED) does guarantee us this > read-as-zero behaviour. (It's a silly API choice that Linux > put this behaviour behind madvise, which is supposed to be > merely advisory, but that's how it is.) The Solaris > posix_madvise() manpage says it is merely advisory and > doesn't affect the behaviour of accesses to the memory. > > If posix_madvise() behaviour was OK in this function, the > right way to fix this would be to use qemu_madvise() > instead, which already provides this "if host has > madvise(), use it, otherwise use posix_madvise()" logic. > But I suspect that the direct madvise() here is deliberate. > > Side note: not sure the current code is correct for the > BSDs either -- they have madvise() but don't provide > Linux's really-read-as-zero guarantee for MADV_DONTNEED. > So we should probably be doing something else there, and > whatever that something-else is is probably also what > Solaris wants. > > We use ram_block_discard_range() only in migration and > in virtio-balloon and virtio-mem; I've cc'd some people > who hopefully understand what the requirements on this > function are and might have a view on what the not-Linux > implementation should look like. (David Gilbert: git > blame says you wrote this code :-)) > > thanks > -- PMM
(Copies in Dave Hildenbrand) * Peter Maydell (peter.maydell@linaro.org) wrote: > On Sat, 18 Jul 2020 at 14:21, David CARLIER <devnexen@gmail.com> wrote: > > > > From a9e3cced279ae55a59847ba232f7828bc2479367 Mon Sep 17 00:00:00 2001 > > From: David Carlier <devnexen@gmail.com> > > Date: Sat, 18 Jul 2020 13:29:44 +0100 > > Subject: [PATCH 2/3] exec: posix_madvise usage on SunOS. > > > > with _XOPEN_SOURCE set, the older mman.h API based on caddr_t handling > > is not accessible thus using posix_madvise here. > > > > Signed-off-by: David Carlier <devnexen@gmail.com> > > --- > > exec.c | 8 ++++++++ > > 1 file changed, 8 insertions(+) > > > > diff --git a/exec.c b/exec.c > > index 6f381f98e2..0466a75b89 100644 > > --- a/exec.c > > +++ b/exec.c > > @@ -3964,7 +3964,15 @@ int ram_block_discard_range(RAMBlock *rb, > > uint64_t start, size_t length) > > * fallocate'd away). > > */ > > #if defined(CONFIG_MADVISE) > > +#if !defined(CONFIG_SOLARIS) > > ret = madvise(host_startaddr, length, MADV_DONTNEED); > > +#else > > + /* > > + * mmap and its caddr_t based api is not accessible > > + * with _XOPEN_SOURCE set on illumos > > + */ > > + ret = posix_madvise(host_startaddr, length, POSIX_MADV_DONTNEED); > > +#endif > > Hi. I'm not sure this patch will do the right thing, because > I don't think that Solaris's POSIX_MADV_DONTNEED provides > the semantics that this QEMU function says it needs. The > comment at the top of the function says: > > * Unmap pages of memory from start to start+length such that > * they a) read as 0, b) Trigger whatever fault mechanism > * the OS provides for postcopy. > * The pages must be unmapped by the end of the function. This code has moved around a bit over it's life; joining the case needed by balloon and the case needed by postcopy. > (Aside: the use of 'unmap' in this comment is a bit confusing, > because it clearly doesn't mean 'unmap' if it wants read-as-0. > And the reference to faults on postcopy is incomprehensible > to me: if memory is read-as-0 it isn't going to fault.) I think because internally to Linux the behaviour is the same; this causes the mapping to disappear from the TLB so it faults; normally when reading the kernel resolves the fault and puts a read-as-zero page there, except if userfault was enabled for postcopy, in which case it gives us a kick and we service it. > Linux's madvise(MADV_DONTNEED) does guarantee us this > read-as-zero behaviour. (It's a silly API choice that Linux > put this behaviour behind madvise, which is supposed to be > merely advisory, but that's how it is.) Yes, I don't think there's any equivalent to madvise that guarantees anything. > The Solaris > posix_madvise() manpage says it is merely advisory and > doesn't affect the behaviour of accesses to the memory. > > If posix_madvise() behaviour was OK in this function, the > right way to fix this would be to use qemu_madvise() > instead, which already provides this "if host has > madvise(), use it, otherwise use posix_madvise()" logic. > But I suspect that the direct madvise() here is deliberate. Yes, but I can't remember the semantics fully - I think it was because we needed the guarantee at this point (and even Linux's posix madvise did something different??) I've got a note saying we didn't want to use qemu_madvise because we wanted to be sure we didn't get posix_madvise. > Side note: not sure the current code is correct for the > BSDs either -- they have madvise() but don't provide > Linux's really-read-as-zero guarantee for MADV_DONTNEED. > So we should probably be doing something else there, and > whatever that something-else is is probably also what > Solaris wants. > > We use ram_block_discard_range() only in migration and > in virtio-balloon and virtio-mem; I've cc'd some people > who hopefully understand what the requirements on this > function are and might have a view on what the not-Linux > implementation should look like. (David Gilbert: git > blame says you wrote this code :-)) Dave > > thanks > -- PMM > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On 20.07.20 21:13, Dr. David Alan Gilbert wrote: > (Copies in Dave Hildenbrand) > > * Peter Maydell (peter.maydell@linaro.org) wrote: >> On Sat, 18 Jul 2020 at 14:21, David CARLIER <devnexen@gmail.com> wrote: >>> >>> From a9e3cced279ae55a59847ba232f7828bc2479367 Mon Sep 17 00:00:00 2001 >>> From: David Carlier <devnexen@gmail.com> >>> Date: Sat, 18 Jul 2020 13:29:44 +0100 >>> Subject: [PATCH 2/3] exec: posix_madvise usage on SunOS. >>> >>> with _XOPEN_SOURCE set, the older mman.h API based on caddr_t handling >>> is not accessible thus using posix_madvise here. >>> >>> Signed-off-by: David Carlier <devnexen@gmail.com> >>> --- >>> exec.c | 8 ++++++++ >>> 1 file changed, 8 insertions(+) >>> >>> diff --git a/exec.c b/exec.c >>> index 6f381f98e2..0466a75b89 100644 >>> --- a/exec.c >>> +++ b/exec.c >>> @@ -3964,7 +3964,15 @@ int ram_block_discard_range(RAMBlock *rb, >>> uint64_t start, size_t length) >>> * fallocate'd away). >>> */ >>> #if defined(CONFIG_MADVISE) >>> +#if !defined(CONFIG_SOLARIS) >>> ret = madvise(host_startaddr, length, MADV_DONTNEED); >>> +#else >>> + /* >>> + * mmap and its caddr_t based api is not accessible >>> + * with _XOPEN_SOURCE set on illumos >>> + */ >>> + ret = posix_madvise(host_startaddr, length, POSIX_MADV_DONTNEED); >>> +#endif >> >> Hi. I'm not sure this patch will do the right thing, because >> I don't think that Solaris's POSIX_MADV_DONTNEED provides >> the semantics that this QEMU function says it needs. The >> comment at the top of the function says: >> >> * Unmap pages of memory from start to start+length such that >> * they a) read as 0, b) Trigger whatever fault mechanism >> * the OS provides for postcopy. >> * The pages must be unmapped by the end of the function. > > This code has moved around a bit over it's life; joining the case > needed by balloon and the case needed by postcopy. > >> (Aside: the use of 'unmap' in this comment is a bit confusing, >> because it clearly doesn't mean 'unmap' if it wants read-as-0. >> And the reference to faults on postcopy is incomprehensible >> to me: if memory is read-as-0 it isn't going to fault.) > > I think because internally to Linux the behaviour is the same; > this causes the mapping to disappear from the TLB so it faults; > normally when reading the kernel resolves the fault and puts > a read-as-zero page there, except if userfault was enabled > for postcopy, in which case it gives us a kick and we service it. > >> Linux's madvise(MADV_DONTNEED) does guarantee us this >> read-as-zero behaviour. (It's a silly API choice that Linux >> put this behaviour behind madvise, which is supposed to be >> merely advisory, but that's how it is.) > > Yes, I don't think there's any equivalent to madvise > that guarantees anything. > >> The Solaris >> posix_madvise() manpage says it is merely advisory and >> doesn't affect the behaviour of accesses to the memory. >> >> If posix_madvise() behaviour was OK in this function, the >> right way to fix this would be to use qemu_madvise() >> instead, which already provides this "if host has >> madvise(), use it, otherwise use posix_madvise()" logic. >> But I suspect that the direct madvise() here is deliberate. > > Yes, but I can't remember the semantics fully - I think it was because > we needed the guarantee at this point (and even Linux's > posix madvise did something different??) > > I've got a note saying we didn't want to use > qemu_madvise because we wanted to be sure we didn't get > posix_madvise. > >> Side note: not sure the current code is correct for the >> BSDs either -- they have madvise() but don't provide >> Linux's really-read-as-zero guarantee for MADV_DONTNEED. >> So we should probably be doing something else there, and >> whatever that something-else is is probably also what >> Solaris wants. >> >> We use ram_block_discard_range() only in migration and >> in virtio-balloon and virtio-mem; I've cc'd some people >> who hopefully understand what the requirements on this >> function are and might have a view on what the not-Linux >> implementation should look like. (David Gilbert: git >> blame says you wrote this code :-)) virtio-mem depends on Linux (hw/virtio/Kconfig). I guess userfaultfd/postcopy is also not relevant in the context of SunOS. So what remains is virtio-balloon. virito-balloon ideally wants to discard the actual mapped pages to free up memory. When memory is re-accessed, a fresh page is faulted in (-> zero-page under Linux). Now, we already have other cases where it looks like "the balloon works" but it really doesn't. One example is using vfio+virtio-balloon under Linux - inflating the balloon is simply a NOP, no memory is actually discarded. I agree that POSIX_MADV_DONTNEED is not a proper match - different guarantees. If SunOS cannot implement ram_block_discard_range() as documented, we should disable it. I would suggest using ram_block_discard_disable(true) when under SunOS early during QEMU startup. In addition, we might want to return directly in ram_block_dicard_range(). We might also want to make virito-balloon depend on !SubOS.
On Tue, 21 Jul 2020 at 09:22, David Hildenbrand <david@redhat.com> wrote: > virtio-mem depends on Linux (hw/virtio/Kconfig). I guess > userfaultfd/postcopy is also not relevant in the context of SunOS. So > what remains is virtio-balloon. > > virito-balloon ideally wants to discard the actual mapped pages to free > up memory. When memory is re-accessed, a fresh page is faulted in (-> > zero-page under Linux). Now, we already have other cases where it looks > like "the balloon works" but it really doesn't. One example is using > vfio+virtio-balloon under Linux - inflating the balloon is simply a NOP, > no memory is actually discarded. > > I agree that POSIX_MADV_DONTNEED is not a proper match - different > guarantees. If SunOS cannot implement ram_block_discard_range() as > documented, we should disable it. Could we also improve the documentation comment to make it clearer what ram_block_discard_range() is actually supposed to be doing? At the moment I don't think you can figure it out from the comments, which are a confusing mix of claiming it unmaps memory and that it zeroes it. Is the Linux-specific stuff (userfaultfd) a *requirement* for the function, or just a "this would be nice" extra and a valid implementation would be eg "zero out the memory" ? thanks -- PMM
On 21.07.20 11:31, Peter Maydell wrote: > On Tue, 21 Jul 2020 at 09:22, David Hildenbrand <david@redhat.com> wrote: >> virtio-mem depends on Linux (hw/virtio/Kconfig). I guess >> userfaultfd/postcopy is also not relevant in the context of SunOS. So >> what remains is virtio-balloon. >> >> virito-balloon ideally wants to discard the actual mapped pages to free >> up memory. When memory is re-accessed, a fresh page is faulted in (-> >> zero-page under Linux). Now, we already have other cases where it looks >> like "the balloon works" but it really doesn't. One example is using >> vfio+virtio-balloon under Linux - inflating the balloon is simply a NOP, >> no memory is actually discarded. >> >> I agree that POSIX_MADV_DONTNEED is not a proper match - different >> guarantees. If SunOS cannot implement ram_block_discard_range() as >> documented, we should disable it. > > Could we also improve the documentation comment to make it clearer > what ram_block_discard_range() is actually supposed to be doing? > At the moment I don't think you can figure it out from the comments, > which are a confusing mix of claiming it unmaps memory and that it > zeroes it. Certainly. It really means (as the name suggests) "Discard the pages used for backing the virtual address range, effectively freeing them back to the OS. When accessing pages within the virtual address range again, populate fresh, zeroed-out pages, just as when accessing memory inside a new mmap for the first time.". It really tries to mimic the "populate on demand" mechanism used on fresh, anonymous mmaps. If no page is populated, a page fault in the OS is triggered and a fresh (physical) page is populated. Of course, when only reading, the single COW zero page can be used until written to the page. "Unmap" is a misleading terminology, I agree. > > Is the Linux-specific stuff (userfaultfd) a *requirement* for the > function, or just a "this would be nice" extra and a valid > implementation would be eg "zero out the memory" ? postcopy/userfaultfd really needs all backing pages for the VMA to be gone (discarded/zapped), so that a pagefault will be triggered. As postcopy will resolve the pagefault itself, it does not rely on the "zeroed" semantics. It will tell the OS which page to place. Supplying anything not zero already smells like the original request (discard the pages) was not properly implemented. Of course, one could supply random data to a user space process on a pagefault, when populating pages, but that already smells like a severe security issue ... Simply zeroing out memory is not what this function promises. It does not work with all three use cases: postcopy, virito-balloon and virtio-mem. We really have to discard.
diff --git a/exec.c b/exec.c index 6f381f98e2..0466a75b89 100644 --- a/exec.c +++ b/exec.c @@ -3964,7 +3964,15 @@ int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length) * fallocate'd away). */ #if defined(CONFIG_MADVISE) +#if !defined(CONFIG_SOLARIS) ret = madvise(host_startaddr, length, MADV_DONTNEED); +#else + /* + * mmap and its caddr_t based api is not accessible + * with _XOPEN_SOURCE set on illumos + */ + ret = posix_madvise(host_startaddr, length, POSIX_MADV_DONTNEED); +#endif if (ret) { ret = -errno;