Message ID | 20210217154844.12392-1-david@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] mm/madvise: introduce MADV_POPULATE to prefault/prealloc memory | expand |
On 2/17/21 7:48 AM, David Hildenbrand wrote: > While MADV_DONTNEED and FALLOC_FL_PUNCH_HOLE provide us ways to reliably > discard memory, there is no generic approach to populate ("preallocate") > memory. > > Although mmap() supports MAP_POPULATE, it is not applicable to the concept > of sparse memory mappings, where we want to do populate/discard > dynamically and avoid expensive/problematic remappings. In addition, > we never actually report error during the final populate phase - it is > best-effort only. Seems pretty sane to me. But, I was surprised that MADV_WILLNEED was no mentioned. It might be nice to touch on on why MADV_WILLNEED is a bad choice for this functionality? We could theoretically have it populate anonymous mappings instead of just swapping in. I guess it's possible that folks are using MADV_WILLNEED on sparse mappings that they don't want to populate, but it would be nice to get that in the changelog. I was also a bit bummed to see the broad VM_IO/PFNMAP restriction show up again. I was just looking at implementing pre-faulting for the new SGX driver: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/cpu/sgx/driver.c It has a vm_ops->fault handler, but the VMAs are VM_IO. It obviously don't work with gup, though. Not a deal breaker, and something we could certainly add to this later.
On 17.02.21 17:46, Dave Hansen wrote: > On 2/17/21 7:48 AM, David Hildenbrand wrote: >> While MADV_DONTNEED and FALLOC_FL_PUNCH_HOLE provide us ways to reliably >> discard memory, there is no generic approach to populate ("preallocate") >> memory. >> >> Although mmap() supports MAP_POPULATE, it is not applicable to the concept >> of sparse memory mappings, where we want to do populate/discard >> dynamically and avoid expensive/problematic remappings. In addition, >> we never actually report error during the final populate phase - it is >> best-effort only. > > Seems pretty sane to me. > > But, I was surprised that MADV_WILLNEED was no mentioned. It might be > nice to touch on on why MADV_WILLNEED is a bad choice for this > functionality? We could theoretically have it populate anonymous > mappings instead of just swapping in. I stumbled over it, but it ended up looking like mixing in different semantics. "Expect access in the near future." and "might be a good idea to read some pages" vs. "Definitely populate/preallocate all memory and definitely fail.". > > I guess it's possible that folks are using MADV_WILLNEED on sparse > mappings that they don't want to populate, but it would be nice to get > that in the changelog. Indeed: prime example is virtio-balloon in QEMU when deflating. Just because we are deflating the balloon doesn't mean that the guest is going to use all memory immediately - and that we want to actually consume memory immediately. ... we call MADV_WILLNEED unconditionally on any memory backing when deflating ... I'll definitely add that to the changelog - thanks. > > I was also a bit bummed to see the broad VM_IO/PFNMAP restriction show > up again. I was just looking at implementing pre-faulting for the new > SGX driver: I added that because __mm_populate() similarly skips over VM_IO | VM_PFNMAP. So it mimics existing "populate semantics" we have. > >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/cpu/sgx/driver.c > > It has a vm_ops->fault handler, but the VMAs are VM_IO. It obviously > don't work with gup, though. Not a deal breaker, and something we could > certainly add to this later. I assume you would then also want to support MAP_POPULATE, right? Because it ends up using __mm_populate() and would not work. Thanks!
+CC linux-api, please do on further revisions. Keeping rest of the e-mail. On 2/17/21 4:48 PM, David Hildenbrand wrote: > When we manage sparse memory mappings dynamically in user space - also > sometimes involving MADV_NORESERVE - we want to dynamically populate/ > discard memory inside such a sparse memory region. Example users are > hypervisors (especially implementing memory ballooning or similar > technologies like virtio-mem) and memory allocators. In addition, we want > to fail in a nice way if populating does not succeed because we are out of > backend memory (which can happen easily with file-based mappings, > especially tmpfs and hugetlbfs). > > While MADV_DONTNEED and FALLOC_FL_PUNCH_HOLE provide us ways to reliably > discard memory, there is no generic approach to populate ("preallocate") > memory. > > Although mmap() supports MAP_POPULATE, it is not applicable to the concept > of sparse memory mappings, where we want to do populate/discard > dynamically and avoid expensive/problematic remappings. In addition, > we never actually report error during the final populate phase - it is > best-effort only. > > fallocate() can be used to preallocate file-based memory and fail in a safe > way. However, it is less useful for private mappings on anonymous files > due to COW semantics. For example, using fallocate() to preallocate memory > on an anonymous memfd files that are mapped MAP_PRIVATE results in a double > memory consumption when actually writing via the mapping. In addition, > fallocate() does not actually populate page tables, so we still always > have to resolve minor faults on first access. > > Because we don't have a proper interface, what applications > (like QEMU and databases) end up doing is touching (i.e., writing) all > individual pages. However, it requires expensive signal handling (SIGBUS); > for example, this is problematic in hypervisors like QEMU where SIGBUS > handlers might already be used by other subsystems concurrently to e.g, > handle hardware errors. "Simply" doing preallocation from another thread > is not that easy. > > Let's introduce MADV_POPULATE with the following semantics > 1. MADV_POPULATED does not work on PROT_NONE and special VMAs. It works > on everything else. > 2. Errors during MADV_POPULATED (especially OOM) are reported. If we hit > hardware errors on pages, ignore them - nothing we really can or > should do. > 3. On errors during MADV_POPULATED, some memory might have been > populated. Callers have to clean up if they care. > 4. Concurrent changes to the virtual memory layour are tolerated - we > process each and every PFN only once, though. > 5. If MADV_POPULATE succeeds, all memory in the range can be accessed > without SIGBUS. (of course, not if user space changed mappings in the > meantime or KSM kicked in on anonymous memory). > > Although sparse memory mappings are the primary use case, this will > also be useful for ordinary preallocations where MAP_POPULATE is not > desired (e.g., in QEMU, where users can trigger preallocation of > guest RAM after the mapping was created). > > Looking at the history, MADV_POPULATE was already proposed in 2013 [1], > however, the main motivation back than was performance improvements > (which should also still be the case, but it's a seconary concern). > > Basic functionality was tested with: > - anonymous memory > - MAP_PRIVATE on anonymous file via memfd > - MAP_SHARED on anonymous file via memf > - MAP_PRIVATE on anonymous hugetlbfs file via memfd > - MAP_SHARED on anonymous hugetlbfs file via memfd > - MAP_PRIVATE on tmpfs/shmem file (we end up with double memory consumption > though, as the actual file gets populated with zeroes) > - MAP_SHARED on tmpfs/shmem file > > Note: For populating/preallocating zeroed-out memory while userfaultfd is > active, it's even faster to use first fallocate() or placing zeroed pages > via userfaultfd APIs. Otherwise, we'll have to route every fault while > populating via the userfaultfd handler. > > [1] https://lkml.org/lkml/2013/6/27/698 > > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Arnd Bergmann <arnd@arndb.de> > Cc: Michal Hocko <mhocko@suse.com> > Cc: Oscar Salvador <osalvador@suse.de> > Cc: Matthew Wilcox (Oracle) <willy@infradead.org> > Cc: Andrea Arcangeli <aarcange@redhat.com> > Cc: Minchan Kim <minchan@kernel.org> > Cc: Jann Horn <jannh@google.com> > Cc: Jason Gunthorpe <jgg@ziepe.ca> > Cc: Dave Hansen <dave.hansen@intel.com> > Cc: Hugh Dickins <hughd@google.com> > Cc: Rik van Riel <riel@surriel.com> > Cc: Michael S. Tsirkin <mst@redhat.com> > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: Richard Henderson <rth@twiddle.net> > Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> > Cc: Matt Turner <mattst88@gmail.com> > Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> > Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> > Cc: Helge Deller <deller@gmx.de> > Cc: Chris Zankel <chris@zankel.net> > Cc: Max Filippov <jcmvbkbc@gmail.com> > Cc: linux-alpha@vger.kernel.org > Cc: linux-mips@vger.kernel.org > Cc: linux-parisc@vger.kernel.org > Cc: linux-xtensa@linux-xtensa.org > Cc: linux-arch@vger.kernel.org > Signed-off-by: David Hildenbrand <david@redhat.com> > --- > > If we agree that this makes sense I'll do more testing to see if we > are missing any return value handling and prepare a man page update to > document the semantics. > > Thoughts? > > --- > arch/alpha/include/uapi/asm/mman.h | 2 + > arch/mips/include/uapi/asm/mman.h | 2 + > arch/parisc/include/uapi/asm/mman.h | 2 + > arch/xtensa/include/uapi/asm/mman.h | 2 + > include/uapi/asm-generic/mman-common.h | 2 + > mm/madvise.c | 70 ++++++++++++++++++++++++++ > 6 files changed, 80 insertions(+) > > diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h > index a18ec7f63888..e90eeb5e6cf1 100644 > --- a/arch/alpha/include/uapi/asm/mman.h > +++ b/arch/alpha/include/uapi/asm/mman.h > @@ -71,6 +71,8 @@ > #define MADV_COLD 20 /* deactivate these pages */ > #define MADV_PAGEOUT 21 /* reclaim these pages */ > > +#define MADV_POPULATE 22 /* populate pages */ > + > /* compatibility flags */ > #define MAP_FILE 0 > > diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h > index 57dc2ac4f8bd..b928becc5308 100644 > --- a/arch/mips/include/uapi/asm/mman.h > +++ b/arch/mips/include/uapi/asm/mman.h > @@ -98,6 +98,8 @@ > #define MADV_COLD 20 /* deactivate these pages */ > #define MADV_PAGEOUT 21 /* reclaim these pages */ > > +#define MADV_POPULATE 22 /* populate pages */ > + > /* compatibility flags */ > #define MAP_FILE 0 > > diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h > index ab78cba446ed..9d3a56044287 100644 > --- a/arch/parisc/include/uapi/asm/mman.h > +++ b/arch/parisc/include/uapi/asm/mman.h > @@ -52,6 +52,8 @@ > #define MADV_COLD 20 /* deactivate these pages */ > #define MADV_PAGEOUT 21 /* reclaim these pages */ > > +#define MADV_POPULATE 22 /* populate pages */ > + > #define MADV_MERGEABLE 65 /* KSM may merge identical pages */ > #define MADV_UNMERGEABLE 66 /* KSM may not merge identical pages */ > > diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h > index e5e643752947..3169b1be8920 100644 > --- a/arch/xtensa/include/uapi/asm/mman.h > +++ b/arch/xtensa/include/uapi/asm/mman.h > @@ -106,6 +106,8 @@ > #define MADV_COLD 20 /* deactivate these pages */ > #define MADV_PAGEOUT 21 /* reclaim these pages */ > > +#define MADV_POPULATE 22 /* populate pages */ > + > /* compatibility flags */ > #define MAP_FILE 0 > > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h > index f94f65d429be..fa617fd0d733 100644 > --- a/include/uapi/asm-generic/mman-common.h > +++ b/include/uapi/asm-generic/mman-common.h > @@ -72,6 +72,8 @@ > #define MADV_COLD 20 /* deactivate these pages */ > #define MADV_PAGEOUT 21 /* reclaim these pages */ > > +#define MADV_POPULATE 22 /* populate pages */ > + > /* compatibility flags */ > #define MAP_FILE 0 > > diff --git a/mm/madvise.c b/mm/madvise.c > index 6a660858784b..f76fdd6fcf10 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -53,6 +53,7 @@ static int madvise_need_mmap_write(int behavior) > case MADV_COLD: > case MADV_PAGEOUT: > case MADV_FREE: > + case MADV_POPULATE: > return 0; > default: > /* be safe, default to 1. list exceptions explicitly */ > @@ -821,6 +822,72 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, > return -EINVAL; > } > > +static long madvise_populate(struct vm_area_struct *vma, > + struct vm_area_struct **prev, > + unsigned long start, unsigned long end) > +{ > + struct mm_struct *mm = vma->vm_mm; > + unsigned long tmp_end; > + int locked = 1; > + long pages; > + > + *prev = vma; > + > + while (start < end) { > + /* > + * We might have temporarily dropped the lock. For example, > + * our VMA might have been split. > + */ > + if (!vma || start >= vma->vm_end) { > + vma = find_vma(mm, start); > + if (!vma) > + return -ENOMEM; > + } > + > + /* Bail out on incompatible VMA types. */ > + if (vma->vm_flags & (VM_IO | VM_PFNMAP) || > + !vma_is_accessible(vma)) { > + return -EINVAL; > + } > + > + /* > + * Populate pages and take care of VM_LOCKED: simulate user > + * space access. > + * > + * For private, writable mappings, trigger a write fault to > + * break COW (i.e., shared zeropage). For other mappings (i.e., > + * read-only, shared), trigger a read fault. > + */ > + tmp_end = min_t(unsigned long, end, vma->vm_end); > + pages = populate_vma_page_range(vma, start, tmp_end, &locked); > + if (!locked) { > + mmap_read_lock(mm); > + *prev = NULL; > + vma = NULL; > + } > + if (pages < 0) { > + switch (pages) { > + case -EINTR: > + case -ENOMEM: > + return pages; > + case -EHWPOISON: > + /* Skip over any poisoned pages. */ > + start += PAGE_SIZE; > + continue; > + case -EBUSY: > + case -EAGAIN: > + continue; > + default: > + pr_warn_once("%s: unhandled return value: %ld\n", > + __func__, pages); > + return -ENOMEM; > + } > + } > + start += pages * PAGE_SIZE; > + } > + return 0; > +} > + > /* > * Application wants to free up the pages and associated backing store. > * This is effectively punching a hole into the middle of a file. > @@ -934,6 +1001,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev, > case MADV_FREE: > case MADV_DONTNEED: > return madvise_dontneed_free(vma, prev, start, end, behavior); > + case MADV_POPULATE: > + return madvise_populate(vma, prev, start, end); > default: > return madvise_behavior(vma, prev, start, end, behavior); > } > @@ -954,6 +1023,7 @@ madvise_behavior_valid(int behavior) > case MADV_FREE: > case MADV_COLD: > case MADV_PAGEOUT: > + case MADV_POPULATE: > #ifdef CONFIG_KSM > case MADV_MERGEABLE: > case MADV_UNMERGEABLE: >
On Wed 17-02-21 16:48:44, David Hildenbrand wrote: > When we manage sparse memory mappings dynamically in user space - also > sometimes involving MADV_NORESERVE - we want to dynamically populate/ Just wondering what is MADV_NORESERVE? I do not see anything like that in the Linus tree. Did you mean MAP_NORESERVE? > discard memory inside such a sparse memory region. Example users are > hypervisors (especially implementing memory ballooning or similar > technologies like virtio-mem) and memory allocators. In addition, we want > to fail in a nice way if populating does not succeed because we are out of > backend memory (which can happen easily with file-based mappings, > especially tmpfs and hugetlbfs). by "fail in a nice way" you mean before a #PF would fail and SIGBUS which would be harder to handle? [...] > Because we don't have a proper interface, what applications > (like QEMU and databases) end up doing is touching (i.e., writing) all > individual pages. However, it requires expensive signal handling (SIGBUS); > for example, this is problematic in hypervisors like QEMU where SIGBUS > handlers might already be used by other subsystems concurrently to e.g, > handle hardware errors. "Simply" doing preallocation from another thread > is not that easy. OK, that clarifies my above question. > > Let's introduce MADV_POPULATE with the following semantics > 1. MADV_POPULATED does not work on PROT_NONE and special VMAs. It works > on everything else. This would better clarify what "does not work" means. I assume those are ignored and do not report any error? > 2. Errors during MADV_POPULATED (especially OOM) are reported. How do you want to achieve that? gup/page fault handler will allocate memory and trigger the oom without caller noticing that. You would somehow have to weaken the allocation context to GFP_RETRY_MAYFAIL or NORETRY to achieve the error handling. > If we hit > hardware errors on pages, ignore them - nothing we really can or > should do. > 3. On errors during MADV_POPULATED, some memory might have been > populated. Callers have to clean up if they care. How does caller find out? madvise reports 0 on success so how do you find out how much has been populated? > 4. Concurrent changes to the virtual memory layour are tolerated - we > process each and every PFN only once, though. I do not understand this. madvise is about virtual address space not a physical address space. > 5. If MADV_POPULATE succeeds, all memory in the range can be accessed > without SIGBUS. (of course, not if user space changed mappings in the > meantime or KSM kicked in on anonymous memory). I do not see how KSM would change anything here and maybe it is not really important to mention it. KSM should be really transparent from the users space POV. Parallel and destructive virtual address space operations are also expected to change the outcome and there is nothing kernel do about at and provide any meaningful guarantees. I guess we want to assume a reasonable userspace behavior here. > Although sparse memory mappings are the primary use case, this will > also be useful for ordinary preallocations where MAP_POPULATE is not > desired (e.g., in QEMU, where users can trigger preallocation of > guest RAM after the mapping was created). > > Looking at the history, MADV_POPULATE was already proposed in 2013 [1], > however, the main motivation back than was performance improvements > (which should also still be the case, but it's a seconary concern). Well, I think it is more of a concern than prior-spectre era when syscalls were quite cheap.
On 18.02.21 11:25, Michal Hocko wrote: > On Wed 17-02-21 16:48:44, David Hildenbrand wrote: >> When we manage sparse memory mappings dynamically in user space - also >> sometimes involving MADV_NORESERVE - we want to dynamically populate/ > > Just wondering what is MADV_NORESERVE? I do not see anything like that > in the Linus tree. Did you mean MAP_NORESERVE? Most certainly, thanks :) > >> discard memory inside such a sparse memory region. Example users are >> hypervisors (especially implementing memory ballooning or similar >> technologies like virtio-mem) and memory allocators. In addition, we want >> to fail in a nice way if populating does not succeed because we are out of >> backend memory (which can happen easily with file-based mappings, >> especially tmpfs and hugetlbfs). > > by "fail in a nice way" you mean before a #PF would fail and SIGBUS > which would be harder to handle? Yes. > > [...] >> Because we don't have a proper interface, what applications >> (like QEMU and databases) end up doing is touching (i.e., writing) all >> individual pages. However, it requires expensive signal handling (SIGBUS); >> for example, this is problematic in hypervisors like QEMU where SIGBUS >> handlers might already be used by other subsystems concurrently to e.g, >> handle hardware errors. "Simply" doing preallocation from another thread >> is not that easy. > > OK, that clarifies my above question. > >> >> Let's introduce MADV_POPULATE with the following semantics >> 1. MADV_POPULATED does not work on PROT_NONE and special VMAs. It works >> on everything else. > > This would better clarify what "does not work" means. I assume those are > ignored and do not report any error? I'm currently preparing the man page. "Fail with -ENOMEM" (like MADV_DONTNEED or MADV_REMOVE) > >> 2. Errors during MADV_POPULATED (especially OOM) are reported. > > How do you want to achieve that? gup/page fault handler will allocate > memory and trigger the oom without caller noticing that. You would > somehow have to weaken the allocation context to GFP_RETRY_MAYFAIL or > NORETRY to achieve the error handling. Okay, I should be more clear here (again, I'm realizing this as well while I create the man page), OOM is confusing: avoid SIGBUS at runtime - like we would get on actual file systems/shmem/hugetlbfs when preallocating. It cannot save us from the actual OOM killer. To handle anonymous memory more reliable I'll need other means as well (dynamic swap space allocation for sparse mappings). > >> If we hit >> hardware errors on pages, ignore them - nothing we really can or >> should do. >> 3. On errors during MADV_POPULATED, some memory might have been >> populated. Callers have to clean up if they care. > > How does caller find out? madvise reports 0 on success so how do you > find out how much has been populated? If there is an error, something might have been populated. In my QEMU implementation, I simply discard the range again, good enough. I don't think we need to really indicate "error and populated" or "error and not populated". > >> 4. Concurrent changes to the virtual memory layour are tolerated - we >> process each and every PFN only once, though. > > I do not understand this. madvise is about virtual address space not a > physical address space. What I wanted to express: if we detect a change in the mapping we don't restart at the beginning, we always make forward progress. We process each virtual address once (on a per-page basis, thus I accidentally used "PFN"). > >> 5. If MADV_POPULATE succeeds, all memory in the range can be accessed >> without SIGBUS. (of course, not if user space changed mappings in the >> meantime or KSM kicked in on anonymous memory). > > I do not see how KSM would change anything here and maybe it is not > really important to mention it. KSM should be really transparent from > the users space POV. Parallel and destructive virtual address space > operations are also expected to change the outcome and there is nothing > kernel do about at and provide any meaningful guarantees. I guess we > want to assume a reasonable userspace behavior here. It's just a note that we cannot protect from someone interfering (discard/ksm/whatever). I'm making that clearer in the cover letter. Thanks!
>>> If we hit >>> hardware errors on pages, ignore them - nothing we really can or >>> should do. >>> 3. On errors during MADV_POPULATED, some memory might have been >>> populated. Callers have to clean up if they care. >> >> How does caller find out? madvise reports 0 on success so how do you >> find out how much has been populated? > > If there is an error, something might have been populated. In my QEMU > implementation, I simply discard the range again, good enough. I don't > think we need to really indicate "error and populated" or "error and not > populated". Clarifying again: if madvise(MADV_POPULATED) succeeds, it returns 0. If there was a problem poopulating memory, it returns -ENOMEM (similar to MADV_WILLNEED). Callers can detect the error and discard.
>> Let's introduce MADV_POPULATE with the following semantics >> 1. MADV_POPULATED does not work on PROT_NONE and special VMAs. It >> works >> on everything else. >> 2. Errors during MADV_POPULATED (especially OOM) are reported. If we >> hit >> hardware errors on pages, ignore them - nothing we really can or >> should do. >> 3. On errors during MADV_POPULATED, some memory might have been >> populated. Callers have to clean up if they care. >> 4. Concurrent changes to the virtual memory layour are tolerated - we ^t >> process each and every PFN only once, though. >> 5. If MADV_POPULATE succeeds, all memory in the range can be accessed >> without SIGBUS. (of course, not if user space changed mappings in >> the >> meantime or KSM kicked in on anonymous memory). You are talking both about MADV_POPULATE and MADV_POPULATED here. Eike
On Thu 18-02-21 11:44:41, David Hildenbrand wrote: > On 18.02.21 11:25, Michal Hocko wrote: > > On Wed 17-02-21 16:48:44, David Hildenbrand wrote: > > > When we manage sparse memory mappings dynamically in user space - also > > > sometimes involving MADV_NORESERVE - we want to dynamically populate/ > > > > Just wondering what is MADV_NORESERVE? I do not see anything like that > > in the Linus tree. Did you mean MAP_NORESERVE? > > Most certainly, thanks :) OK, good, I thought I have missed something. [...] > > > 2. Errors during MADV_POPULATED (especially OOM) are reported. > > > > How do you want to achieve that? gup/page fault handler will allocate > > memory and trigger the oom without caller noticing that. You would > > somehow have to weaken the allocation context to GFP_RETRY_MAYFAIL or > > NORETRY to achieve the error handling. > > Okay, I should be more clear here (again, I'm realizing this as well while I > create the man page), OOM is confusing: avoid SIGBUS at runtime - like we > would get on actual file systems/shmem/hugetlbfs when preallocating. Yes, preventing SIGBUS for unreserved mappings is a reasonable expectation. Regarding OOM chances are off I am afraid. We used to have a weaker model for MAP_POPULATE for memcg oom in the past and it turned out more problematic than useful. > It cannot save us from the actual OOM killer. To handle anonymous memory > more reliable I'll need other means as well (dynamic swap space allocation > for sparse mappings). > > > > > > If we hit > > > hardware errors on pages, ignore them - nothing we really can or > > > should do. > > > 3. On errors during MADV_POPULATED, some memory might have been > > > populated. Callers have to clean up if they care. > > > > How does caller find out? madvise reports 0 on success so how do you > > find out how much has been populated? > > If there is an error, something might have been populated. In my QEMU > implementation, I simply discard the range again, good enough. I don't think > we need to really indicate "error and populated" or "error and not > populated". Agreed. The wording just suggests that the syscall actually provides any means for an effective way to handle those errors. Maybe you should just stick with the first sentence and drop the second. > > > 4. Concurrent changes to the virtual memory layour are tolerated - we > > > process each and every PFN only once, though. > > > > I do not understand this. madvise is about virtual address space not a > > physical address space. > > What I wanted to express: if we detect a change in the mapping we don't > restart at the beginning, we always make forward progress. We process each > virtual address once (on a per-page basis, thus I accidentally used "PFN"). This is an implicit assumption. Your range can have the same page mapped several times in the given address range and all you care about is that you fault those which are not present during the virtual address space walk. Your syscall can return and large part of the address space might be unpopulated because memory reclaim just dropped those pages and that would be fine. This shouldn't really imply memory presence - mlock does that. > > > 5. If MADV_POPULATE succeeds, all memory in the range can be accessed > > > without SIGBUS. (of course, not if user space changed mappings in the > > > meantime or KSM kicked in on anonymous memory). > > > > I do not see how KSM would change anything here and maybe it is not > > really important to mention it. KSM should be really transparent from > > the users space POV. Parallel and destructive virtual address space > > operations are also expected to change the outcome and there is nothing > > kernel do about at and provide any meaningful guarantees. I guess we > > want to assume a reasonable userspace behavior here. > > It's just a note that we cannot protect from someone interfering > (discard/ksm/whatever). I'm making that clearer in the cover letter. Again that is implicit expectation. madvise will not work for anybody shooting an own foot.
> Am 18.02.2021 um 12:15 schrieb Rolf Eike Beer <eike-kernel@sf-tec.de>: > > >> >>> Let's introduce MADV_POPULATE with the following semantics >>> 1. MADV_POPULATED does not work on PROT_NONE and special VMAs. It works >>> on everything else. >>> 2. Errors during MADV_POPULATED (especially OOM) are reported. If we hit >>> hardware errors on pages, ignore them - nothing we really can or >>> should do. >>> 3. On errors during MADV_POPULATED, some memory might have been >>> populated. Callers have to clean up if they care. >>> 4. Concurrent changes to the virtual memory layour are tolerated - we > ^t >>> process each and every PFN only once, though. >>> 5. If MADV_POPULATE succeeds, all memory in the range can be accessed >>> without SIGBUS. (of course, not if user space changed mappings in the >>> meantime or KSM kicked in on anonymous memory). > > You are talking both about MADV_POPULATE and MADV_POPULATED here. > Already fixed up :) thanks! > Eike >
On Thu 18-02-21 11:54:48, David Hildenbrand wrote: > > > > If we hit > > > > hardware errors on pages, ignore them - nothing we really can or > > > > should do. > > > > 3. On errors during MADV_POPULATED, some memory might have been > > > > populated. Callers have to clean up if they care. > > > > > > How does caller find out? madvise reports 0 on success so how do you > > > find out how much has been populated? > > > > If there is an error, something might have been populated. In my QEMU > > implementation, I simply discard the range again, good enough. I don't > > think we need to really indicate "error and populated" or "error and not > > populated". > > Clarifying again: if madvise(MADV_POPULATED) succeeds, it returns 0. If > there was a problem poopulating memory, it returns -ENOMEM (similar to > MADV_WILLNEED). Callers can detect the error and discard. As responded to the previous mail. I wouldn't really bother telling callers what they should do. The interface will not give them any means to identify the error. They just have to live with the fact that the operation has failed.
>>>> If we hit >>>> hardware errors on pages, ignore them - nothing we really can or >>>> should do. >>>> 3. On errors during MADV_POPULATED, some memory might have been >>>> populated. Callers have to clean up if they care. >>> >>> How does caller find out? madvise reports 0 on success so how do you >>> find out how much has been populated? >> >> If there is an error, something might have been populated. In my QEMU >> implementation, I simply discard the range again, good enough. I don't think >> we need to really indicate "error and populated" or "error and not >> populated". > > Agreed. The wording just suggests that the syscall actually provides any > means for an effective way to handle those errors. Maybe you should just > stick with the first sentence and drop the second. Makes sense. "On errors during MADV_POPULATE, some memory might have been populated." > >>>> 4. Concurrent changes to the virtual memory layour are tolerated - we >>>> process each and every PFN only once, though. >>> >>> I do not understand this. madvise is about virtual address space not a >>> physical address space. >> >> What I wanted to express: if we detect a change in the mapping we don't >> restart at the beginning, we always make forward progress. We process each >> virtual address once (on a per-page basis, thus I accidentally used "PFN"). > > This is an implicit assumption. Your range can have the same page mapped > several times in the given address range and all you care about is that > you fault those which are not present during the virtual address space > walk. Your syscall can return and large part of the address space might > be unpopulated because memory reclaim just dropped those pages and that > would be fine. This shouldn't really imply memory presence - mlock does > that. "Concurrent changes to the virtual memory layout are tolerated. The range is processed exactly once." > >>>> 5. If MADV_POPULATE succeeds, all memory in the range can be accessed >>>> without SIGBUS. (of course, not if user space changed mappings in the >>>> meantime or KSM kicked in on anonymous memory). >>> >>> I do not see how KSM would change anything here and maybe it is not >>> really important to mention it. KSM should be really transparent from >>> the users space POV. Parallel and destructive virtual address space >>> operations are also expected to change the outcome and there is nothing >>> kernel do about at and provide any meaningful guarantees. I guess we >>> want to assume a reasonable userspace behavior here. >> >> It's just a note that we cannot protect from someone interfering >> (discard/ksm/whatever). I'm making that clearer in the cover letter. > > Again that is implicit expectation. madvise will not work for anybody > shooting an own foot. Okay, I'll drop that part, thanks!
Hi, David, On Wed, Feb 17, 2021 at 04:48:44PM +0100, David Hildenbrand wrote: > When we manage sparse memory mappings dynamically in user space - also > sometimes involving MADV_NORESERVE - we want to dynamically populate/ > discard memory inside such a sparse memory region. Example users are > hypervisors (especially implementing memory ballooning or similar > technologies like virtio-mem) and memory allocators. In addition, we want > to fail in a nice way if populating does not succeed because we are out of > backend memory (which can happen easily with file-based mappings, > especially tmpfs and hugetlbfs). Could you explain a bit more on how do you plan to use this new interface for the virtio-balloon scenario? Meanwhile, here you seemed to be talking about file-backed mem, however later it sounds more like for anonymous, so I'm slightly confused. Thanks,
On 18.02.21 23:59, Peter Xu wrote: > Hi, David, > > On Wed, Feb 17, 2021 at 04:48:44PM +0100, David Hildenbrand wrote: >> When we manage sparse memory mappings dynamically in user space - also >> sometimes involving MADV_NORESERVE - we want to dynamically populate/ >> discard memory inside such a sparse memory region. Example users are >> hypervisors (especially implementing memory ballooning or similar >> technologies like virtio-mem) and memory allocators. In addition, we want >> to fail in a nice way if populating does not succeed because we are out of >> backend memory (which can happen easily with file-based mappings, >> especially tmpfs and hugetlbfs). > > Could you explain a bit more on how do you plan to use this new interface for > the virtio-balloon scenario? Sure, that will bring up an interesting point to discuss (MADV_POPULATE_WRITE). I'm planning on using it in virtio-mem: whenever the guests requests the hypervisor (via a virtio-mem device) to make specific blocks available ("plug"), I want to have a configurable option ("populate=on" / "prealloc="on") to perform safety checks ("prealloc") and populate page tables. This becomes especially relevant for private/shared hugetlbfs and shared files/shmem where we have a limited pool size (e.g., huge pages, tmpfs size, filesystem size). But it will also come in handy when just preallocating (esp. zeroing) anonymous memory. For virito-balloon it is not applicable because it really only supports anonymous memory and we cannot fail requests to deflate ... --- Example --- Example: Assume the guests requests to make 128 MB available and we're using hugetlbfs. Assume we're out of huge pages in the hypervisor - we want to fail the request - I want to do some kind of preallocation. So I could do fallocate() on anything that's MAP_SHARED, but not on anything that's MAP_PRIVATE. hugetlbfs via memfd() cannot be preallocated without going via SIGBUS handlers. --- QEMU memory configurations --- I see the following combinations relevant in QEMU that I want to support with virito-mem: 1) MAP_PRIVATE anonymous memory 2) MAP_PRIVATE on hugetlbfs (esp. via memfd) 3) MAP_SHARED on hugetlbfs (esp. via memfd) 4) MAP_SHARED on shmem (file / memfd) 5) MAP_SHARED on some sparse file. Other MAP_PRIVATE mappings barely make any sense to me - "read the file and write to page cache" is not really applicable to VM RAM (not to mention doing fallocate(PUNCH_HOLE) that invalidates the private copies of all other mappings on that file). --- Ways to populate/preallocate --- I see the following ways to populate/preallocate: a) MADV_POPULATE: write fault on writable MAP_PRIVATE, read fault on MAP_SHARED b) Writing to MAP_PRIVATE | MAP_SHARED from user space. c) (below) MADV_POPULATE_WRITE: write fault on writable MAP_PRIVATE | MAP_SHARED Especially, 2) is kind of weird as implemented in QEMU (util/oslib-posix.c:do_touch_pages): "Read & write back the same value, so we don't corrupt existing user/app data ... TODO: get a better solution from kernel so we don't need to write at all so we don't cause wear on the storage backing the region..." So if we have zero, we write zero. We'll COW pages, triggering a write fault - and that's the only good thing about it. For example, similar to MADV_POPULATE, nothing stops KSM from merging anonymous pages again. So for anonymous memory the actual write is not helpful at all. Similarly for hugetlbfs, the actual write is not necessary - but there is no other way to really achieve the goal. --- How MADV_POPULATE is useful --- With virito-mem, our VM will usually write to memory before it reads it. With 1) and 2) it does exactly what I want: trigger COW / allocate memory and trigger a write fault. The only issue with 1) is that KSM might come around and undo our work - but that could only be avoided by writing random numbers to all pages from user space. Or we could simply rather disable KSM in that setup ... --- How MADV_POPULATE is not perfect --- KSM can merge anonymous pages again. Just like the current QEMU implementation. The only way around that is writing random numbers to the pages or mlocking all memory. No big news. Nothing stops reclaim/swap code from depopulating when using files. Again, no big new - we have to mlock. --- HOW MADV_POPULATE_WRITE might be useful --- With 3) 4) 5) MADV_POPULATE does partially what I want: preallocate memory and populate page tables. But as it's a read fault, I think we'll have another minor fault on access. Not perfect, but better than failing with SIGBUS. One way around that would be having an additional MADV_POPULATE_WRITE, to use in cases where it makes sense (I think at least 3) and 4), most probably not on actual files like 5) ). Trigger a write fault without actually writing. Makes sense?
On Wed 17-02-21 16:48:44, David Hildenbrand wrote: [...] I only got to the implementation now. > +static long madvise_populate(struct vm_area_struct *vma, > + struct vm_area_struct **prev, > + unsigned long start, unsigned long end) > +{ > + struct mm_struct *mm = vma->vm_mm; > + unsigned long tmp_end; > + int locked = 1; > + long pages; > + > + *prev = vma; > + > + while (start < end) { > + /* > + * We might have temporarily dropped the lock. For example, > + * our VMA might have been split. > + */ > + if (!vma || start >= vma->vm_end) { > + vma = find_vma(mm, start); > + if (!vma) > + return -ENOMEM; > + } Why do you need to find a vma when you already have one. do_madvise will give you your vma already. I do understand that you want to finish the vma for some errors but that shouldn't require handling vmas. You should be in the shope of one here unless I miss anything. > + > + /* Bail out on incompatible VMA types. */ > + if (vma->vm_flags & (VM_IO | VM_PFNMAP) || > + !vma_is_accessible(vma)) { > + return -EINVAL; > + } > + > + /* > + * Populate pages and take care of VM_LOCKED: simulate user > + * space access. > + * > + * For private, writable mappings, trigger a write fault to > + * break COW (i.e., shared zeropage). For other mappings (i.e., > + * read-only, shared), trigger a read fault. > + */ > + tmp_end = min_t(unsigned long, end, vma->vm_end); > + pages = populate_vma_page_range(vma, start, tmp_end, &locked); > + if (!locked) { > + mmap_read_lock(mm); > + *prev = NULL; > + vma = NULL; > + } > + if (pages < 0) { > + switch (pages) { > + case -EINTR: > + case -ENOMEM: > + return pages; > + case -EHWPOISON: > + /* Skip over any poisoned pages. */ > + start += PAGE_SIZE; > + continue; > + case -EBUSY: > + case -EAGAIN: > + continue; > + default: > + pr_warn_once("%s: unhandled return value: %ld\n", > + __func__, pages); > + return -ENOMEM; > + } > + } > + start += pages * PAGE_SIZE; > + } > + return 0; > +} > + > /* > * Application wants to free up the pages and associated backing store. > * This is effectively punching a hole into the middle of a file. > @@ -934,6 +1001,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev, > case MADV_FREE: > case MADV_DONTNEED: > return madvise_dontneed_free(vma, prev, start, end, behavior); > + case MADV_POPULATE: > + return madvise_populate(vma, prev, start, end); > default: > return madvise_behavior(vma, prev, start, end, behavior); > } > @@ -954,6 +1023,7 @@ madvise_behavior_valid(int behavior) > case MADV_FREE: > case MADV_COLD: > case MADV_PAGEOUT: > + case MADV_POPULATE: > #ifdef CONFIG_KSM > case MADV_MERGEABLE: > case MADV_UNMERGEABLE: > -- > 2.29.2 >
On 19.02.21 11:35, Michal Hocko wrote: > On Wed 17-02-21 16:48:44, David Hildenbrand wrote: > [...] > > I only got to the implementation now. > >> +static long madvise_populate(struct vm_area_struct *vma, >> + struct vm_area_struct **prev, >> + unsigned long start, unsigned long end) >> +{ >> + struct mm_struct *mm = vma->vm_mm; >> + unsigned long tmp_end; >> + int locked = 1; >> + long pages; >> + >> + *prev = vma; >> + >> + while (start < end) { >> + /* >> + * We might have temporarily dropped the lock. For example, >> + * our VMA might have been split. >> + */ >> + if (!vma || start >= vma->vm_end) { >> + vma = find_vma(mm, start); >> + if (!vma) >> + return -ENOMEM; >> + } > > Why do you need to find a vma when you already have one. do_madvise will > give you your vma already. I do understand that you want to finish the > vma for some errors but that shouldn't require handling vmas. You should > be in the shope of one here unless I miss anything. See below, we might temporary drop the lock while not having processed all pages > >> + >> + /* Bail out on incompatible VMA types. */ >> + if (vma->vm_flags & (VM_IO | VM_PFNMAP) || >> + !vma_is_accessible(vma)) { >> + return -EINVAL; >> + } >> + >> + /* >> + * Populate pages and take care of VM_LOCKED: simulate user >> + * space access. >> + * >> + * For private, writable mappings, trigger a write fault to >> + * break COW (i.e., shared zeropage). For other mappings (i.e., >> + * read-only, shared), trigger a read fault. >> + */ >> + tmp_end = min_t(unsigned long, end, vma->vm_end); >> + pages = populate_vma_page_range(vma, start, tmp_end, &locked); >> + if (!locked) { >> + mmap_read_lock(mm); >> + *prev = NULL; >> + vma = NULL; ^ here so, the VMA might have been replaced/split/... in the meantime. So to make forward progress, I have to lookup again. (similar. but different to madvise_dontneed_free()).
On Fri 19-02-21 11:43:48, David Hildenbrand wrote: > On 19.02.21 11:35, Michal Hocko wrote: > > On Wed 17-02-21 16:48:44, David Hildenbrand wrote: > > [...] > > > > I only got to the implementation now. > > > > > +static long madvise_populate(struct vm_area_struct *vma, > > > + struct vm_area_struct **prev, > > > + unsigned long start, unsigned long end) > > > +{ > > > + struct mm_struct *mm = vma->vm_mm; > > > + unsigned long tmp_end; > > > + int locked = 1; > > > + long pages; > > > + > > > + *prev = vma; > > > + > > > + while (start < end) { > > > + /* > > > + * We might have temporarily dropped the lock. For example, > > > + * our VMA might have been split. > > > + */ > > > + if (!vma || start >= vma->vm_end) { > > > + vma = find_vma(mm, start); > > > + if (!vma) > > > + return -ENOMEM; > > > + } > > > > Why do you need to find a vma when you already have one. do_madvise will > > give you your vma already. I do understand that you want to finish the > > vma for some errors but that shouldn't require handling vmas. You should > > be in the shope of one here unless I miss anything. > > See below, we might temporary drop the lock while not having processed all > pages > > > > > > + > > > + /* Bail out on incompatible VMA types. */ > > > + if (vma->vm_flags & (VM_IO | VM_PFNMAP) || > > > + !vma_is_accessible(vma)) { > > > + return -EINVAL; > > > + } > > > + > > > + /* > > > + * Populate pages and take care of VM_LOCKED: simulate user > > > + * space access. > > > + * > > > + * For private, writable mappings, trigger a write fault to > > > + * break COW (i.e., shared zeropage). For other mappings (i.e., > > > + * read-only, shared), trigger a read fault. > > > + */ > > > + tmp_end = min_t(unsigned long, end, vma->vm_end); > > > + pages = populate_vma_page_range(vma, start, tmp_end, &locked); > > > + if (!locked) { > > > + mmap_read_lock(mm); > > > + *prev = NULL; > > > + vma = NULL; > > ^ here > > so, the VMA might have been replaced/split/... in the meantime. > > So to make forward progress, I have to lookup again. (similar. but different > to madvise_dontneed_free()). Right. Missed that.
On 19.02.21 12:04, Michal Hocko wrote: > On Fri 19-02-21 11:43:48, David Hildenbrand wrote: >> On 19.02.21 11:35, Michal Hocko wrote: >>> On Wed 17-02-21 16:48:44, David Hildenbrand wrote: >>> [...] >>> >>> I only got to the implementation now. >>> >>>> +static long madvise_populate(struct vm_area_struct *vma, >>>> + struct vm_area_struct **prev, >>>> + unsigned long start, unsigned long end) >>>> +{ >>>> + struct mm_struct *mm = vma->vm_mm; >>>> + unsigned long tmp_end; >>>> + int locked = 1; >>>> + long pages; >>>> + >>>> + *prev = vma; >>>> + >>>> + while (start < end) { >>>> + /* >>>> + * We might have temporarily dropped the lock. For example, >>>> + * our VMA might have been split. >>>> + */ >>>> + if (!vma || start >= vma->vm_end) { >>>> + vma = find_vma(mm, start); >>>> + if (!vma) >>>> + return -ENOMEM; >>>> + } >>> >>> Why do you need to find a vma when you already have one. do_madvise will >>> give you your vma already. I do understand that you want to finish the >>> vma for some errors but that shouldn't require handling vmas. You should >>> be in the shope of one here unless I miss anything. >> >> See below, we might temporary drop the lock while not having processed all >> pages >> >>> >>>> + >>>> + /* Bail out on incompatible VMA types. */ >>>> + if (vma->vm_flags & (VM_IO | VM_PFNMAP) || >>>> + !vma_is_accessible(vma)) { >>>> + return -EINVAL; >>>> + } >>>> + >>>> + /* >>>> + * Populate pages and take care of VM_LOCKED: simulate user >>>> + * space access. >>>> + * >>>> + * For private, writable mappings, trigger a write fault to >>>> + * break COW (i.e., shared zeropage). For other mappings (i.e., >>>> + * read-only, shared), trigger a read fault. >>>> + */ >>>> + tmp_end = min_t(unsigned long, end, vma->vm_end); >>>> + pages = populate_vma_page_range(vma, start, tmp_end, &locked); >>>> + if (!locked) { >>>> + mmap_read_lock(mm); >>>> + *prev = NULL; >>>> + vma = NULL; >> >> ^ here >> >> so, the VMA might have been replaced/split/... in the meantime. >> >> So to make forward progress, I have to lookup again. (similar. but different >> to madvise_dontneed_free()). > > Right. Missed that. It would look more natural if we'd just be processing the whole range - but then it would not fit into the generic infrastructure and would result in even more code. I decided to go with "process the passed range and treat the given VMA as an initial VMA that is invalidated as soon as we drop the lock".
On Fri, Feb 19, 2021 at 09:20:16AM +0100, David Hildenbrand wrote: > On 18.02.21 23:59, Peter Xu wrote: > > Hi, David, > > > > On Wed, Feb 17, 2021 at 04:48:44PM +0100, David Hildenbrand wrote: > > > When we manage sparse memory mappings dynamically in user space - also > > > sometimes involving MADV_NORESERVE - we want to dynamically populate/ > > > discard memory inside such a sparse memory region. Example users are > > > hypervisors (especially implementing memory ballooning or similar > > > technologies like virtio-mem) and memory allocators. In addition, we want > > > to fail in a nice way if populating does not succeed because we are out of > > > backend memory (which can happen easily with file-based mappings, > > > especially tmpfs and hugetlbfs). > > > > Could you explain a bit more on how do you plan to use this new interface for > > the virtio-balloon scenario? > > Sure, that will bring up an interesting point to discuss > (MADV_POPULATE_WRITE). > > I'm planning on using it in virtio-mem: whenever the guests requests the > hypervisor (via a virtio-mem device) to make specific blocks available > ("plug"), I want to have a configurable option ("populate=on" / > "prealloc="on") to perform safety checks ("prealloc") and populate page > tables. As you mentioned in the commit message, the original goal for MADV_POPULATE should be for performance's sake, which I can understand. But for safety check, I'm curious whether we'd have better way to do that besides populating the whole memory. E.g., can we simply ask the kernel "how much memory this process can still allocate", then get a number out of it? I'm not sure whether it can be done already by either cgroup or any other facilities, or maybe it's still missing. But I'd raise this question up, since these two requirements seem to be two standalone issues to solve at least to me. It could be an overkill to populate all the memory just for a sanity check. > > This becomes especially relevant for private/shared hugetlbfs and shared > files/shmem where we have a limited pool size (e.g., huge pages, tmpfs size, > filesystem size). But it will also come in handy when just preallocating > (esp. zeroing) anonymous memory. > > For virito-balloon it is not applicable because it really only supports > anonymous memory and we cannot fail requests to deflate ... > > --- Example --- > > Example: Assume the guests requests to make 128 MB available and we're using > hugetlbfs. Assume we're out of huge pages in the hypervisor - we want to > fail the request - I want to do some kind of preallocation. > > So I could do fallocate() on anything that's MAP_SHARED, but not on anything > that's MAP_PRIVATE. hugetlbfs via memfd() cannot be preallocated without > going via SIGBUS handlers. > > --- QEMU memory configurations --- > > I see the following combinations relevant in QEMU that I want to support > with virito-mem: > > 1) MAP_PRIVATE anonymous memory > 2) MAP_PRIVATE on hugetlbfs (esp. via memfd) > 3) MAP_SHARED on hugetlbfs (esp. via memfd) > 4) MAP_SHARED on shmem (file / memfd) > 5) MAP_SHARED on some sparse file. > > Other MAP_PRIVATE mappings barely make any sense to me - "read the file and > write to page cache" is not really applicable to VM RAM (not to mention > doing fallocate(PUNCH_HOLE) that invalidates the private copies of all other > mappings on that file). > > --- Ways to populate/preallocate --- > > I see the following ways to populate/preallocate: > > a) MADV_POPULATE: write fault on writable MAP_PRIVATE, read fault on > MAP_SHARED > b) Writing to MAP_PRIVATE | MAP_SHARED from user space. > c) (below) MADV_POPULATE_WRITE: write fault on writable MAP_PRIVATE | > MAP_SHARED > > Especially, 2) is kind of weird as implemented in QEMU > (util/oslib-posix.c:do_touch_pages): > > "Read & write back the same value, so we don't corrupt existing user/app > data ... TODO: get a better solution from kernel so we don't need to write > at all so we don't cause wear on the storage backing the region..." It's interesting to know about commit 1e356fc14be ("mem-prealloc: reduce large guest start-up and migration time.", 2017-03-14). It seems for speeding up VM boot, but what I can't understand is why it would cause the delay of hugetlb accounting - I thought we'd fail even earlier at either fallocate() on the hugetlb file (when we use /dev/hugepages) or on mmap() of the memfd which contains the huge pages. See hugetlb_reserve_pages() and its callers. Or did I miss something? I think there's a special case if QEMU fork() with a MAP_PRIVATE hugetlbfs mapping, that could cause the memory accouting to be delayed until COW happens. However that's definitely not the case for QEMU since QEMU won't work at all as late as that point. IOW, for hugetlbfs I don't know why we need to populate the pages at all if we simply want to know "whether we do still have enough space".. And IIUC 2) above is the major issue you'd like to solve too. > > So if we have zero, we write zero. We'll COW pages, triggering a write fault > - and that's the only good thing about it. For example, similar to > MADV_POPULATE, nothing stops KSM from merging anonymous pages again. So for > anonymous memory the actual write is not helpful at all. Similarly for > hugetlbfs, the actual write is not necessary - but there is no other way to > really achieve the goal. > > --- How MADV_POPULATE is useful --- > > With virito-mem, our VM will usually write to memory before it reads it. > > With 1) and 2) it does exactly what I want: trigger COW / allocate memory > and trigger a write fault. The only issue with 1) is that KSM might come > around and undo our work - but that could only be avoided by writing random > numbers to all pages from user space. Or we could simply rather disable KSM > in that setup ... > > --- How MADV_POPULATE is not perfect --- > > KSM can merge anonymous pages again. Just like the current QEMU > implementation. The only way around that is writing random numbers to the > pages or mlocking all memory. No big news. > > Nothing stops reclaim/swap code from depopulating when using files. Again, > no big new - we have to mlock. > > --- HOW MADV_POPULATE_WRITE might be useful --- > > With 3) 4) 5) MADV_POPULATE does partially what I want: preallocate memory > and populate page tables. But as it's a read fault, I think we'll have > another minor fault on access. Not perfect, but better than failing with > SIGBUS. One way around that would be having an additional > MADV_POPULATE_WRITE, to use in cases where it makes sense (I think at least > 3) and 4), most probably not on actual files like 5) ). Right, it seems when populating memories we'll read-fault on file-backed. However that'll be another performance issue to think about. So I'd hope we can start with the current virtio-mem issue on memory accounting, then we can discuss them separately. Btw, thanks for the long write-up, it definitely helps me to understand what you wanted to achieve. Thanks,
On 19.02.21 17:31, Peter Xu wrote: > On Fri, Feb 19, 2021 at 09:20:16AM +0100, David Hildenbrand wrote: >> On 18.02.21 23:59, Peter Xu wrote: >>> Hi, David, >>> >>> On Wed, Feb 17, 2021 at 04:48:44PM +0100, David Hildenbrand wrote: >>>> When we manage sparse memory mappings dynamically in user space - also >>>> sometimes involving MADV_NORESERVE - we want to dynamically populate/ >>>> discard memory inside such a sparse memory region. Example users are >>>> hypervisors (especially implementing memory ballooning or similar >>>> technologies like virtio-mem) and memory allocators. In addition, we want >>>> to fail in a nice way if populating does not succeed because we are out of >>>> backend memory (which can happen easily with file-based mappings, >>>> especially tmpfs and hugetlbfs). >>> >>> Could you explain a bit more on how do you plan to use this new interface for >>> the virtio-balloon scenario? >> >> Sure, that will bring up an interesting point to discuss >> (MADV_POPULATE_WRITE). >> >> I'm planning on using it in virtio-mem: whenever the guests requests the >> hypervisor (via a virtio-mem device) to make specific blocks available >> ("plug"), I want to have a configurable option ("populate=on" / >> "prealloc="on") to perform safety checks ("prealloc") and populate page >> tables. > > As you mentioned in the commit message, the original goal for MADV_POPULATE > should be for performance's sake, which I can understand. But for safety > check, I'm curious whether we'd have better way to do that besides populating > the whole memory. Well, it's 100% what I want for "populate=on"/"prealloc=on" semantics. There is no real memory overcommit for huge pages, so any lacy allocation ("reserve only") only saves you boot time - which is not really an issue for virtio-mem, as the memory gets added and initialized asynchronously as the guest boots up. "reserve=on,prealloc=off" is another future use case I have in mind - possible only for some memory backends (esp. anonymous memory - below). > > E.g., can we simply ask the kernel "how much memory this process can still > allocate", then get a number out of it? I'm not sure whether it can be done Anything like that is completely racy and unreliable. > already by either cgroup or any other facilities, or maybe it's still missing. > But I'd raise this question up, since these two requirements seem to be two > standalone issues to solve at least to me. It could be an overkill to populate > all the memory just for a sanity check. For anonymous memory I have something in the works to dynamically reserve swap space per process for the memory reservation for not accounted private writable MAP_DONTRESERVE memory. However, it works because swap space is per-system, not per-node or anything else. Doing that for file systems/hugetlbfs is a different beast. And anonymous memory is right now less of my concern, as we're used to overcommitting there - limited pool sizes are more of an issue. >> --- Ways to populate/preallocate --- >> >> I see the following ways to populate/preallocate: >> >> a) MADV_POPULATE: write fault on writable MAP_PRIVATE, read fault on >> MAP_SHARED >> b) Writing to MAP_PRIVATE | MAP_SHARED from user space. >> c) (below) MADV_POPULATE_WRITE: write fault on writable MAP_PRIVATE | >> MAP_SHARED >> >> Especially, 2) is kind of weird as implemented in QEMU >> (util/oslib-posix.c:do_touch_pages): >> >> "Read & write back the same value, so we don't corrupt existing user/app >> data ... TODO: get a better solution from kernel so we don't need to write >> at all so we don't cause wear on the storage backing the region..." > > It's interesting to know about commit 1e356fc14be ("mem-prealloc: reduce large > guest start-up and migration time.", 2017-03-14). It seems for speeding up VM > boot, but what I can't understand is why it would cause the delay of hugetlb > accounting - I thought we'd fail even earlier at either fallocate() on the > hugetlb file (when we use /dev/hugepages) or on mmap() of the memfd which > contains the huge pages. See hugetlb_reserve_pages() and its callers. Or did > I miss something? We should fail on mmap() when the reservation happens (unless MAP_NORESERVE is passed) I think. > > I think there's a special case if QEMU fork() with a MAP_PRIVATE hugetlbfs > mapping, that could cause the memory accouting to be delayed until COW happens. That would be kind of weird. I'd assume the reservation gets properly done during fork() - just like for VM_ACCOUNT. > However that's definitely not the case for QEMU since QEMU won't work at all as > late as that point. > > IOW, for hugetlbfs I don't know why we need to populate the pages at all if we > simply want to know "whether we do still have enough space".. And IIUC 2) > above is the major issue you'd like to solve too. To avoid page faults at runtime on access I think. Reservation <= Preallocation. [...] >> --- HOW MADV_POPULATE_WRITE might be useful --- >> >> With 3) 4) 5) MADV_POPULATE does partially what I want: preallocate memory >> and populate page tables. But as it's a read fault, I think we'll have >> another minor fault on access. Not perfect, but better than failing with >> SIGBUS. One way around that would be having an additional >> MADV_POPULATE_WRITE, to use in cases where it makes sense (I think at least >> 3) and 4), most probably not on actual files like 5) ). > > Right, it seems when populating memories we'll read-fault on file-backed. > However that'll be another performance issue to think about. So I'd hope we > can start with the current virtio-mem issue on memory accounting, then we can > discuss them separately. MADV_POPULATE is certainly something I want and what fits nicely into the existing model of MAP_POPULATE. Doing reservation only is a different topic - and is most probably only possible for anonymous memory in a clean way. > Btw, thanks for the long write-up, it definitely helps me to understand what > you wanted to achieve. Sure! Thanks!
>> It's interesting to know about commit 1e356fc14be ("mem-prealloc: reduce large >> guest start-up and migration time.", 2017-03-14). It seems for speeding up VM >> boot, but what I can't understand is why it would cause the delay of hugetlb >> accounting - I thought we'd fail even earlier at either fallocate() on the >> hugetlb file (when we use /dev/hugepages) or on mmap() of the memfd which >> contains the huge pages. See hugetlb_reserve_pages() and its callers. Or did >> I miss something? > > We should fail on mmap() when the reservation happens (unless > MAP_NORESERVE is passed) I think. > >> >> I think there's a special case if QEMU fork() with a MAP_PRIVATE hugetlbfs >> mapping, that could cause the memory accouting to be delayed until COW happens. > > That would be kind of weird. I'd assume the reservation gets properly > done during fork() - just like for VM_ACCOUNT. > >> However that's definitely not the case for QEMU since QEMU won't work at all as >> late as that point. >> >> IOW, for hugetlbfs I don't know why we need to populate the pages at all if we >> simply want to know "whether we do still have enough space".. And IIUC 2) >> above is the major issue you'd like to solve too. > > To avoid page faults at runtime on access I think. Reservation <= > Preallocation. I just learned that there is more to it: (test done on v5.9) # echo 512 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages # cat /sys/devices/system/node/node*/meminfo | grep HugePages_ Node 0 HugePages_Total: 512 Node 0 HugePages_Free: 512 Node 0 HugePages_Surp: 0 Node 1 HugePages_Total: 0 Node 1 HugePages_Free: 0 Node 1 HugePages_Surp: 0 # cat /proc/meminfo | grep HugePages_ HugePages_Total: 512 HugePages_Free: 512 HugePages_Rsvd: 0 HugePages_Surp: 0 # /usr/libexec/qemu-kvm -m 1G -smp 1 -object memory-backend-memfd,id=mem0,size=1G,hugetlb=on,hugetlbsize=2M,policy=bind,host-nodes=0 -numa node,nodeid=0,memdev=mem0 -hda Fedora-Cloud-Base-Rawhide-20201004.n.1.x86_64.qcow2 -nographic -> works just fine # /usr/libexec/qemu-kvm -m 1G -smp 1 -object memory-backend-memfd,id=mem0,size=1G,hugetlb=on,hugetlbsize=2M,policy=bind,host-nodes=1 -numa node,nodeid=0,memdev=mem0 -hda Fedora-Cloud-Base-Rawhide-20201004.n.1.x86_64.qcow2 -nographic -> Does not fail nicely but crashes! See https://bugzilla.redhat.com/show_bug.cgi?id=1686261 for something similar, however, it no longer applies like that on more recent kernels. Hugetlbfs reservations don't always protect you (especially with NUMA) - that's why e.g., libvirt always tells QEMU to prealloc. I think the "issue" is that the reservation happens on mmap(). mbind() runs afterwards. Preallocation saves you from that. I suspect something similar will happen with anonymous memory with mbind() even if we reserved swap space. Did not test yet, though.
On Fri, Feb 19, 2021 at 06:13:47PM +0100, David Hildenbrand wrote: > On 19.02.21 17:31, Peter Xu wrote: > > On Fri, Feb 19, 2021 at 09:20:16AM +0100, David Hildenbrand wrote: > > > On 18.02.21 23:59, Peter Xu wrote: > > > > Hi, David, > > > > > > > > On Wed, Feb 17, 2021 at 04:48:44PM +0100, David Hildenbrand wrote: > > > > > When we manage sparse memory mappings dynamically in user space - also > > > > > sometimes involving MADV_NORESERVE - we want to dynamically populate/ > > > > > discard memory inside such a sparse memory region. Example users are > > > > > hypervisors (especially implementing memory ballooning or similar > > > > > technologies like virtio-mem) and memory allocators. In addition, we want > > > > > to fail in a nice way if populating does not succeed because we are out of > > > > > backend memory (which can happen easily with file-based mappings, > > > > > especially tmpfs and hugetlbfs). [1] > > E.g., can we simply ask the kernel "how much memory this process can still > > allocate", then get a number out of it? I'm not sure whether it can be done > > Anything like that is completely racy and unreliable. The failure path won't be racy imho - If we can detect current process doesn't have enough memory budget, it'll be more efficient to fail even before trying to populate any memory and then drop part of them again. But I see your point - indeed it's good to guarantee the guest won't crash at any point of further guest side memory access. Another question: can the user actually specify arbitrary max-length for the virtio-mem device (which decides the maximum memory this device could possibly consume)? I thought we should check that first before realizing the device and we really shouldn't fail any guest memory access if that check passed. Feel free to correct me.. [...] > > > > I think there's a special case if QEMU fork() with a MAP_PRIVATE hugetlbfs > > mapping, that could cause the memory accouting to be delayed until COW happens. > > That would be kind of weird. I'd assume the reservation gets properly done > during fork() - just like for VM_ACCOUNT. AFAIK VM_ACCOUNT is never applied for hugetlbfs. Neither do I know any accounting done for hugetlbfs during fork(), if not taking the pinned pages into account - that is definitely a special case. > > > However that's definitely not the case for QEMU since QEMU won't work at all as > > late as that point. > > > > IOW, for hugetlbfs I don't know why we need to populate the pages at all if we > > simply want to know "whether we do still have enough space".. And IIUC 2) > > above is the major issue you'd like to solve too. > > To avoid page faults at runtime on access I think. Reservation <= > Preallocation. Yes. Besides my above question regarding max-length of virtio-mem device: we care most about private mappings of hugetlbfs/shmem here, am I right? I'm thinking why we'd need MAP_PRIVATE of these at all for VM context. It's definitely not the major scenario when they're used shared with either ovs or any non-qemu process, because then MAP_SHARED is a must. Then if we use them privately, can we simply always make it MAP_SHARED? IMHO MAP_PRIVATE could be helpful only if we'd like the COW scemantics, so it means when there're something already, we'd like to keep that snapshot but trigger page copy when writes. But is that the case for a VM memory backend which should be always zeroed by default? Then, I'm wondering can we simply avoid bothering with VM_PRIVATE on these file-backed memory at all - then we'll naturally get fallocate() on hand, which seems already working for us. Thanks,
On 2/19/21 11:14 AM, David Hildenbrand wrote: >>> It's interesting to know about commit 1e356fc14be ("mem-prealloc: reduce large >>> guest start-up and migration time.", 2017-03-14). It seems for speeding up VM >>> boot, but what I can't understand is why it would cause the delay of hugetlb >>> accounting - I thought we'd fail even earlier at either fallocate() on the >>> hugetlb file (when we use /dev/hugepages) or on mmap() of the memfd which >>> contains the huge pages. See hugetlb_reserve_pages() and its callers. Or did >>> I miss something? >> >> We should fail on mmap() when the reservation happens (unless >> MAP_NORESERVE is passed) I think. >> >>> >>> I think there's a special case if QEMU fork() with a MAP_PRIVATE hugetlbfs >>> mapping, that could cause the memory accouting to be delayed until COW happens. >> >> That would be kind of weird. I'd assume the reservation gets properly >> done during fork() - just like for VM_ACCOUNT. >> >>> However that's definitely not the case for QEMU since QEMU won't work at all as >>> late as that point. >>> >>> IOW, for hugetlbfs I don't know why we need to populate the pages at all if we >>> simply want to know "whether we do still have enough space".. And IIUC 2) >>> above is the major issue you'd like to solve too. >> >> To avoid page faults at runtime on access I think. Reservation <= >> Preallocation. > > I just learned that there is more to it: (test done on v5.9) > > # echo 512 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages > # cat /sys/devices/system/node/node*/meminfo | grep HugePages_ > Node 0 HugePages_Total: 512 > Node 0 HugePages_Free: 512 > Node 0 HugePages_Surp: 0 > Node 1 HugePages_Total: 0 > Node 1 HugePages_Free: 0 > Node 1 HugePages_Surp: 0 > # cat /proc/meminfo | grep HugePages_ > HugePages_Total: 512 > HugePages_Free: 512 > HugePages_Rsvd: 0 > HugePages_Surp: 0 > > # /usr/libexec/qemu-kvm -m 1G -smp 1 -object memory-backend-memfd,id=mem0,size=1G,hugetlb=on,hugetlbsize=2M,policy=bind,host-nodes=0 -numa node,nodeid=0,memdev=mem0 -hda Fedora-Cloud-Base-Rawhide-20201004.n.1.x86_64.qcow2 -nographic > -> works just fine > > # /usr/libexec/qemu-kvm -m 1G -smp 1 -object memory-backend-memfd,id=mem0,size=1G,hugetlb=on,hugetlbsize=2M,policy=bind,host-nodes=1 -numa node,nodeid=0,memdev=mem0 -hda Fedora-Cloud-Base-Rawhide-20201004.n.1.x86_64.qcow2 -nographic > -> Does not fail nicely but crashes! > > > See https://bugzilla.redhat.com/show_bug.cgi?id=1686261 for something similar, however, it no longer applies like that on more recent kernels. > > Hugetlbfs reservations don't always protect you (especially with NUMA) - that's why e.g., libvirt always tells QEMU to prealloc. > > I think the "issue" is that the reservation happens on mmap(). mbind() runs afterwards. Preallocation saves you from that. > > I suspect something similar will happen with anonymous memory with mbind() even if we reserved swap space. Did not test yet, though. > Sorry, for jumping in late ... hugetlb keyword just hit my mail filters :) Yes, it is true that hugetlb reservations are not numa aware. So, even if pages are reserved at mmap time one could still SIGBUS if a fault is restricted to a node with insufficient pages. I looked into this some years ago, and there really is not a good way to make hugetlb reservations numa aware. preallocation, or on demand populating as proposed here is a way around the issue.
> Am 19.02.2021 um 20:23 schrieb Peter Xu <peterx@redhat.com>: > > On Fri, Feb 19, 2021 at 06:13:47PM +0100, David Hildenbrand wrote: >>> On 19.02.21 17:31, Peter Xu wrote: >>> On Fri, Feb 19, 2021 at 09:20:16AM +0100, David Hildenbrand wrote: >>>> On 18.02.21 23:59, Peter Xu wrote: >>>>> Hi, David, >>>>> >>>>> On Wed, Feb 17, 2021 at 04:48:44PM +0100, David Hildenbrand wrote: >>>>>> When we manage sparse memory mappings dynamically in user space - also >>>>>> sometimes involving MADV_NORESERVE - we want to dynamically populate/ >>>>>> discard memory inside such a sparse memory region. Example users are >>>>>> hypervisors (especially implementing memory ballooning or similar >>>>>> technologies like virtio-mem) and memory allocators. In addition, we want >>>>>> to fail in a nice way if populating does not succeed because we are out of >>>>>> backend memory (which can happen easily with file-based mappings, >>>>>> especially tmpfs and hugetlbfs). > > [1] > >>> E.g., can we simply ask the kernel "how much memory this process can still >>> allocate", then get a number out of it? I'm not sure whether it can be done >> >> Anything like that is completely racy and unreliable. > > The failure path won't be racy imho - If we can detect current process doesn't > have enough memory budget, it'll be more efficient to fail even before trying > to populate any memory and then drop part of them again. > > But I see your point - indeed it's good to guarantee the guest won't crash at > any point of further guest side memory access. > > Another question: can the user actually specify arbitrary max-length for the > virtio-mem device (which decides the maximum memory this device could possibly > consume)? I thought we should check that first before realizing the device and > we really shouldn't fail any guest memory access if that check passed. Feel > free to correct me. Max-length is currently limited by the mmap() we‘re allowed to create. With MAP_NORESERVE this can be big (not merged yet). Checking max-lenght at initialization time does not make too much sense. Just imagine shrinking/relocating other VMs so you can grow this VM further. Or migrating the VM to another machine where you might grow it further. The ultimate goal is to adjust the mapping size dynamically on demand, but that‘s stuff for the future as it turns out complicated. For example, hugetlbfs VMAs cannot be merged yet (although I think it shouldn‘t be too hard to implement). The short term approach is only exposing a small window of the bigger mmap to the guest. >> >> That would be kind of weird. I'd assume the reservation gets properly done >> during fork() - just like for VM_ACCOUNT. > > AFAIK VM_ACCOUNT is never applied for hugetlbfs. Neither do I know any > accounting done for hugetlbfs during fork(), if not taking the pinned pages > into account - that is definitely a special case. > Yes, it isn‘t - I meant „like“ as in „similar to swap reservation“. >> >>> However that's definitely not the case for QEMU since QEMU won't work at all as >>> late as that point. >>> >>> IOW, for hugetlbfs I don't know why we need to populate the pages at all if we >>> simply want to know "whether we do still have enough space".. And IIUC 2) >>> above is the major issue you'd like to solve too. >> >> To avoid page faults at runtime on access I think. Reservation <= >> Preallocation. > > Yes. Besides my above question regarding max-length of virtio-mem device: we > care most about private mappings of hugetlbfs/shmem here, am I right? > > I'm thinking why we'd need MAP_PRIVATE of these at all for VM context. One reason is that MAP_SHARED does not support mbind() - which should include hugetlbfs. I did not investigate other side effects / performance considerations on allocation. Similarly, fallocate() does not respect/care about NUMA. (And yes, NUMA for virtio-mem will be important).
> Sorry, for jumping in late ... hugetlb keyword just hit my mail filters :) > Sorry for not realizing to cc you before I sent out the man page update :) > Yes, it is true that hugetlb reservations are not numa aware. So, even if > pages are reserved at mmap time one could still SIGBUS if a fault is > restricted to a node with insufficient pages. > > I looked into this some years ago, and there really is not a good way to > make hugetlb reservations numa aware. preallocation, or on demand > populating as proposed here is a way around the issue. Thanks for confirming, this makes a lot of sense to me now.
On 17.02.21 16:48, David Hildenbrand wrote: > When we manage sparse memory mappings dynamically in user space - also > sometimes involving MADV_NORESERVE - we want to dynamically populate/ > discard memory inside such a sparse memory region. Example users are > hypervisors (especially implementing memory ballooning or similar > technologies like virtio-mem) and memory allocators. In addition, we want > to fail in a nice way if populating does not succeed because we are out of > backend memory (which can happen easily with file-based mappings, > especially tmpfs and hugetlbfs). > > While MADV_DONTNEED and FALLOC_FL_PUNCH_HOLE provide us ways to reliably > discard memory, there is no generic approach to populate ("preallocate") > memory. > > Although mmap() supports MAP_POPULATE, it is not applicable to the concept > of sparse memory mappings, where we want to do populate/discard > dynamically and avoid expensive/problematic remappings. In addition, > we never actually report error during the final populate phase - it is > best-effort only. > > fallocate() can be used to preallocate file-based memory and fail in a safe > way. However, it is less useful for private mappings on anonymous files > due to COW semantics. For example, using fallocate() to preallocate memory > on an anonymous memfd files that are mapped MAP_PRIVATE results in a double > memory consumption when actually writing via the mapping. In addition, > fallocate() does not actually populate page tables, so we still always > have to resolve minor faults on first access. > > Because we don't have a proper interface, what applications > (like QEMU and databases) end up doing is touching (i.e., writing) all > individual pages. However, it requires expensive signal handling (SIGBUS); > for example, this is problematic in hypervisors like QEMU where SIGBUS > handlers might already be used by other subsystems concurrently to e.g, > handle hardware errors. "Simply" doing preallocation from another thread > is not that easy. > > Let's introduce MADV_POPULATE with the following semantics > 1. MADV_POPULATED does not work on PROT_NONE and special VMAs. It works > on everything else. > 2. Errors during MADV_POPULATED (especially OOM) are reported. If we hit > hardware errors on pages, ignore them - nothing we really can or > should do. > 3. On errors during MADV_POPULATED, some memory might have been > populated. Callers have to clean up if they care. > 4. Concurrent changes to the virtual memory layour are tolerated - we > process each and every PFN only once, though. > 5. If MADV_POPULATE succeeds, all memory in the range can be accessed > without SIGBUS. (of course, not if user space changed mappings in the > meantime or KSM kicked in on anonymous memory). > > Although sparse memory mappings are the primary use case, this will > also be useful for ordinary preallocations where MAP_POPULATE is not > desired (e.g., in QEMU, where users can trigger preallocation of > guest RAM after the mapping was created). > > Looking at the history, MADV_POPULATE was already proposed in 2013 [1], > however, the main motivation back than was performance improvements > (which should also still be the case, but it's a seconary concern). > > Basic functionality was tested with: > - anonymous memory > - MAP_PRIVATE on anonymous file via memfd > - MAP_SHARED on anonymous file via memf > - MAP_PRIVATE on anonymous hugetlbfs file via memfd > - MAP_SHARED on anonymous hugetlbfs file via memfd > - MAP_PRIVATE on tmpfs/shmem file (we end up with double memory consumption > though, as the actual file gets populated with zeroes) > - MAP_SHARED on tmpfs/shmem file > > Note: For populating/preallocating zeroed-out memory while userfaultfd is > active, it's even faster to use first fallocate() or placing zeroed pages > via userfaultfd APIs. Otherwise, we'll have to route every fault while > populating via the userfaultfd handler. > > [1] https://lkml.org/lkml/2013/6/27/698 > > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Arnd Bergmann <arnd@arndb.de> > Cc: Michal Hocko <mhocko@suse.com> > Cc: Oscar Salvador <osalvador@suse.de> > Cc: Matthew Wilcox (Oracle) <willy@infradead.org> > Cc: Andrea Arcangeli <aarcange@redhat.com> > Cc: Minchan Kim <minchan@kernel.org> > Cc: Jann Horn <jannh@google.com> > Cc: Jason Gunthorpe <jgg@ziepe.ca> > Cc: Dave Hansen <dave.hansen@intel.com> > Cc: Hugh Dickins <hughd@google.com> > Cc: Rik van Riel <riel@surriel.com> > Cc: Michael S. Tsirkin <mst@redhat.com> > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: Richard Henderson <rth@twiddle.net> > Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> > Cc: Matt Turner <mattst88@gmail.com> > Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> > Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> > Cc: Helge Deller <deller@gmx.de> > Cc: Chris Zankel <chris@zankel.net> > Cc: Max Filippov <jcmvbkbc@gmail.com> > Cc: linux-alpha@vger.kernel.org > Cc: linux-mips@vger.kernel.org > Cc: linux-parisc@vger.kernel.org > Cc: linux-xtensa@linux-xtensa.org > Cc: linux-arch@vger.kernel.org > Signed-off-by: David Hildenbrand <david@redhat.com> > --- > > If we agree that this makes sense I'll do more testing to see if we > are missing any return value handling and prepare a man page update to > document the semantics. > > Thoughts? Thinking about MADV_POPULATE vs. MADV_POPULATE_WRITE I wonder if it would be more versatile to break with existing MAP_POPULATE semantics and directly go with MADV_POPULATE_READ: simulate user space read access without actually reading. Trigger a read fault if required. MADV_POPULATE_WRITE: simulate user space write access without actually writing. Trigger a write fault if required. For my use case, I could use MADV_POPULATE_WRITE on anonymous memory and RAM-backed files (shmem/hugetlb) - I would not have a minor fault when the guest inside the VM first initializes memory. This mimics how QEMU currently preallocates memory. However, I would use MADV_POPULATE_READ on any !RAM-backed files where we actually have to write-back to a (slow?) device. Dirtying everything although the guest might not actually consume it in the near future might be undesired. MADV_POPULATE_READ could also come in handy in combination with userfaulfd-wp() [1], when handling unpopulated memory via ordinary userfaultfd MISSING events in undesired. I could imagine it can speed up live migration of VMs in general, where we might end up reading a lot of unpopulated memory to figure out it's all zeroes after faulting in the shared zeropage. Especially relevant with a shared zeropage. Thoughts? [1] https://lkml.kernel.org/r/20210219211054.GL6669@xz-x1
I am slowly catching up with this thread. On Fri 19-02-21 09:20:16, David Hildenbrand wrote: [...] > So if we have zero, we write zero. We'll COW pages, triggering a write fault > - and that's the only good thing about it. For example, similar to > MADV_POPULATE, nothing stops KSM from merging anonymous pages again. So for > anonymous memory the actual write is not helpful at all. Similarly for > hugetlbfs, the actual write is not necessary - but there is no other way to > really achieve the goal. I really do not see why you care about KSM so much. Isn't KSM an explicit opt-in with a fine grained interface to control which memory to KSM or not?
On 22.02.21 13:46, Michal Hocko wrote: > I am slowly catching up with this thread. > > On Fri 19-02-21 09:20:16, David Hildenbrand wrote: > [...] >> So if we have zero, we write zero. We'll COW pages, triggering a write fault >> - and that's the only good thing about it. For example, similar to >> MADV_POPULATE, nothing stops KSM from merging anonymous pages again. So for >> anonymous memory the actual write is not helpful at all. Similarly for >> hugetlbfs, the actual write is not necessary - but there is no other way to >> really achieve the goal. > > I really do not see why you care about KSM so much. Isn't KSM an > explicit opt-in with a fine grained interface to control which memory to > KSM or not? Yeah, I think it's opt-in via MADV_MERGEABLE. E.g., QEMU defaults to enable KSM unless explicitly disabled by the user. But I agree, I got distracted by KSM details.
On Sat 20-02-21 10:12:26, David Hildenbrand wrote: [...] > Thinking about MADV_POPULATE vs. MADV_POPULATE_WRITE I wonder if it would be > more versatile to break with existing MAP_POPULATE semantics and directly go > with > > MADV_POPULATE_READ: simulate user space read access without actually > reading. Trigger a read fault if required. > > MADV_POPULATE_WRITE: simulate user space write access without actually > writing. Trigger a write fault if required. > > For my use case, I could use MADV_POPULATE_WRITE on anonymous memory and > RAM-backed files (shmem/hugetlb) - I would not have a minor fault when the > guest inside the VM first initializes memory. This mimics how QEMU currently > preallocates memory. > > However, I would use MADV_POPULATE_READ on any !RAM-backed files where we > actually have to write-back to a (slow?) device. Dirtying everything > although the guest might not actually consume it in the near future might be > undesired. Isn't what the current mm_populate does? if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE) gup_flags |= FOLL_WRITE; So it will write fault to shared memory mappings but it will touch others.
On 22.02.21 13:56, Michal Hocko wrote: > On Sat 20-02-21 10:12:26, David Hildenbrand wrote: > [...] >> Thinking about MADV_POPULATE vs. MADV_POPULATE_WRITE I wonder if it would be >> more versatile to break with existing MAP_POPULATE semantics and directly go >> with >> >> MADV_POPULATE_READ: simulate user space read access without actually >> reading. Trigger a read fault if required. >> >> MADV_POPULATE_WRITE: simulate user space write access without actually >> writing. Trigger a write fault if required. >> >> For my use case, I could use MADV_POPULATE_WRITE on anonymous memory and >> RAM-backed files (shmem/hugetlb) - I would not have a minor fault when the >> guest inside the VM first initializes memory. This mimics how QEMU currently >> preallocates memory. >> >> However, I would use MADV_POPULATE_READ on any !RAM-backed files where we >> actually have to write-back to a (slow?) device. Dirtying everything >> although the guest might not actually consume it in the near future might be >> undesired. > > Isn't what the current mm_populate does? > if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE) > gup_flags |= FOLL_WRITE; > > So it will write fault to shared memory mappings but it will touch > others. Exactly. But for hugetlbfs/shmem ("!RAM-backed files") this is not what we want.
On Mon 22-02-21 13:59:55, David Hildenbrand wrote: > On 22.02.21 13:56, Michal Hocko wrote: > > On Sat 20-02-21 10:12:26, David Hildenbrand wrote: > > [...] > > > Thinking about MADV_POPULATE vs. MADV_POPULATE_WRITE I wonder if it would be > > > more versatile to break with existing MAP_POPULATE semantics and directly go > > > with > > > > > > MADV_POPULATE_READ: simulate user space read access without actually > > > reading. Trigger a read fault if required. > > > > > > MADV_POPULATE_WRITE: simulate user space write access without actually > > > writing. Trigger a write fault if required. > > > > > > For my use case, I could use MADV_POPULATE_WRITE on anonymous memory and > > > RAM-backed files (shmem/hugetlb) - I would not have a minor fault when the > > > guest inside the VM first initializes memory. This mimics how QEMU currently > > > preallocates memory. > > > > > > However, I would use MADV_POPULATE_READ on any !RAM-backed files where we > > > actually have to write-back to a (slow?) device. Dirtying everything > > > although the guest might not actually consume it in the near future might be > > > undesired. > > > > Isn't what the current mm_populate does? > > if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE) > > gup_flags |= FOLL_WRITE; > > > > So it will write fault to shared memory mappings but it will touch > > others. Ble, I have writen that opposit to the actual behavior. It will write fault on writeable private mappings and only touch on read/only or private mappings. > > Exactly. But for hugetlbfs/shmem ("!RAM-backed files") this is not what we > want. OK, then I must have misread your requirements. Maybe I just got lost in all the combinations you have listed.
>> Exactly. But for hugetlbfs/shmem ("!RAM-backed files") this is not what we >> want. > > OK, then I must have misread your requirements. Maybe I just got lost in > all the combinations you have listed. Another special case could be dax/pmem I think. You might want to fault it in readable/writable but not perform an actual read/write unless really required. QEMU phrases this as "don't cause wear on the storage backing".
On Mon 22-02-21 14:22:37, David Hildenbrand wrote: > > > Exactly. But for hugetlbfs/shmem ("!RAM-backed files") this is not what we > > > want. > > > > OK, then I must have misread your requirements. Maybe I just got lost in > > all the combinations you have listed. > > Another special case could be dax/pmem I think. You might want to fault it > in readable/writable but not perform an actual read/write unless really > required. > > QEMU phrases this as "don't cause wear on the storage backing". Sorry for being dense here but I still do not follow. If you do not want to read then what do you want to populate from? Only map if it is in the page cache?
On 22.02.21 15:02, Michal Hocko wrote: > On Mon 22-02-21 14:22:37, David Hildenbrand wrote: >>>> Exactly. But for hugetlbfs/shmem ("!RAM-backed files") this is not what we >>>> want. >>> >>> OK, then I must have misread your requirements. Maybe I just got lost in >>> all the combinations you have listed. >> >> Another special case could be dax/pmem I think. You might want to fault it >> in readable/writable but not perform an actual read/write unless really >> required. >> >> QEMU phrases this as "don't cause wear on the storage backing". > > Sorry for being dense here but I still do not follow. If you do not want > to read then what do you want to populate from? Only map if it is in the In the context of VMs it's usually rather a mean to preallocate backend storage - which would also happen on read access. See below on case 4). > page cache? Let's try to untangle my thoughts regarding VMs. We could have as backend storage for our VM: 1) Anonymous memory 2) hugetlbfs (private/shared) 3) tmpfs/shmem (private/shared) 4) Ordinary files (shared) 5) DAX/PMEM (shared) Excluding special cases (hypervisor upgrades with 2) and 3) ), we expect to have pre-existing content in files only in 4) and 5). 4) and 5) might be used as NVDIMM backend for a guest, or as DIMM backend. The first access of our VM to memory could be a) Write: the usual case when exposed as RAM/DIMM to out guest. b) Read: possible case when exposed as an NVDIMM to our guest (we don't know). But eventually, we might write to (parts of) NVDIMMs later. We "preallocate"/"populate" memory of our VM so that - We know we have sufficient backend storage (esp. hugetlbfs, shmem, files) - so we don't randomly crash the VM. My most important use case. - We avoid page faults (including page zeroing!) at runtime. Especially relevant for RT workloads. With 1), 2), and 3) we want to have pages faulted in writable - we expect that our guest will write to that memory. MADV_POPULATE would do that only for 1), and MAP_PRIVATE of 2). For the shared parts, we would want MADV_POPULATE_WRITE semantics. With 5), we already had complaints that preallcoation in QEMU takes a long time - because we end up actually reading/writing slow PMEM (libvirt now disables preallcoation for that reason, which makes sense). However, MADV_POPULATE_WRITE would help to prefault without actually reading/writing pmem - if we want to avoid any minor faults. With 4), I think we primarily prealloc/prefault to make sure we have sufficient backend storage. fallocate() might do a better job just for the allocation. But if there is sufficient RAM it might make sense to prefault all guest RAM at least readable - then we only have a minor fault when the VM writes to it and might avoid having to go to disk. Prefaulting everything writable means that we *have to* write back all guest RAM even if the guest never accessed it. So I think there are cases where MADV_POPULATE_READ (current MADV_POPULATE) semantics could make sense.
> + tmp_end = min_t(unsigned long, end, vma->vm_end); > + pages = populate_vma_page_range(vma, start, tmp_end, &locked); > + if (!locked) { > + mmap_read_lock(mm); > + *prev = NULL; > + vma = NULL; ^ locked = 1; is missing here. --- Simple benchmark --- I implemented MADV_POPULATE_READ and MADV_POPULATE_WRITE and performed some simple measurements to simulate memory preallocation with empty files: 1) mmap a 2 MiB/128 MiB/4 GiB region (anonymous, memfd, memfd hugetlb) 2) Discard all memory using fallocate/madvise 3) Prefault memory using different approaches and measure the time this takes. I repeat 2)+3) 10 times and compute the average. I only use a single thread. Read: Read from each page a byte. Write: Write one byte of each page (0). Read/Write: Read one byte and write the value back for each page POPULATE: MADV_POPULATE (this patch) POPULATE_READ: MADV_POPULATE_READ POPULATE_WRITE: MADV_POPULATE_WRITE --- Benchmark results --- Measuring 10 iterations each: ================================================== 2 MiB MAP_PRIVATE: ************************************************** Anonymous : Read : 0.159 ms Anonymous : Write : 0.244 ms Anonymous : Read+Write : 0.383 ms Anonymous : POPULATE : 0.167 ms Anonymous : POPULATE_READ : 0.064 ms Anonymous : POPULATE_WRITE : 0.165 ms Memfd 4 KiB : Read : 0.401 ms Memfd 4 KiB : Write : 0.056 ms Memfd 4 KiB : Read+Write : 0.075 ms Memfd 4 KiB : POPULATE : 0.057 ms Memfd 4 KiB : POPULATE_READ : 0.337 ms Memfd 4 KiB : POPULATE_WRITE : 0.056 ms Memfd 2 MiB : Read : 0.041 ms Memfd 2 MiB : Write : 0.030 ms Memfd 2 MiB : Read+Write : 0.031 ms Memfd 2 MiB : POPULATE : 0.031 ms Memfd 2 MiB : POPULATE_READ : 0.031 ms Memfd 2 MiB : POPULATE_WRITE : 0.031 ms ************************************************** 2 MiB MAP_SHARED: ************************************************** Anonymous : Read : 0.071 ms Anonymous : Write : 0.181 ms Anonymous : Read+Write : 0.081 ms Anonymous : POPULATE : 0.069 ms Anonymous : POPULATE_READ : 0.069 ms Anonymous : POPULATE_WRITE : 0.115 ms Memfd 4 KiB : Read : 0.401 ms Memfd 4 KiB : Write : 0.351 ms Memfd 4 KiB : Read+Write : 0.414 ms Memfd 4 KiB : POPULATE : 0.338 ms Memfd 4 KiB : POPULATE_READ : 0.339 ms Memfd 4 KiB : POPULATE_WRITE : 0.279 ms Memfd 2 MiB : Read : 0.031 ms Memfd 2 MiB : Write : 0.031 ms Memfd 2 MiB : Read+Write : 0.031 ms Memfd 2 MiB : POPULATE : 0.031 ms Memfd 2 MiB : POPULATE_READ : 0.031 ms Memfd 2 MiB : POPULATE_WRITE : 0.031 ms ************************************************** 128 MiB MAP_PRIVATE: ************************************************** Anonymous : Read : 7.517 ms Anonymous : Write : 22.503 ms Anonymous : Read+Write : 33.186 ms Anonymous : POPULATE : 18.381 ms Anonymous : POPULATE_READ : 3.952 ms Anonymous : POPULATE_WRITE : 18.354 ms Memfd 4 KiB : Read : 34.300 ms Memfd 4 KiB : Write : 4.659 ms Memfd 4 KiB : Read+Write : 6.531 ms Memfd 4 KiB : POPULATE : 5.219 ms Memfd 4 KiB : POPULATE_READ : 29.744 ms Memfd 4 KiB : POPULATE_WRITE : 5.244 ms Memfd 2 MiB : Read : 10.228 ms Memfd 2 MiB : Write : 10.130 ms Memfd 2 MiB : Read+Write : 10.190 ms Memfd 2 MiB : POPULATE : 10.007 ms Memfd 2 MiB : POPULATE_READ : 10.008 ms Memfd 2 MiB : POPULATE_WRITE : 10.010 ms ************************************************** 128 MiB MAP_SHARED: ************************************************** Anonymous : Read : 7.295 ms Anonymous : Write : 15.234 ms Anonymous : Read+Write : 7.460 ms Anonymous : POPULATE : 5.196 ms Anonymous : POPULATE_READ : 5.190 ms Anonymous : POPULATE_WRITE : 8.245 ms Memfd 4 KiB : Read : 34.412 ms Memfd 4 KiB : Write : 30.586 ms Memfd 4 KiB : Read+Write : 35.157 ms Memfd 4 KiB : POPULATE : 29.643 ms Memfd 4 KiB : POPULATE_READ : 29.691 ms Memfd 4 KiB : POPULATE_WRITE : 25.790 ms Memfd 2 MiB : Read : 10.210 ms Memfd 2 MiB : Write : 10.074 ms Memfd 2 MiB : Read+Write : 10.068 ms Memfd 2 MiB : POPULATE : 10.034 ms Memfd 2 MiB : POPULATE_READ : 10.037 ms Memfd 2 MiB : POPULATE_WRITE : 10.031 ms ************************************************** 4096 MiB MAP_PRIVATE: ************************************************** Anonymous : Read : 240.947 ms Anonymous : Write : 712.941 ms Anonymous : Read+Write : 1027.636 ms Anonymous : POPULATE : 571.816 ms Anonymous : POPULATE_READ : 120.215 ms Anonymous : POPULATE_WRITE : 570.750 ms Memfd 4 KiB : Read : 1054.739 ms Memfd 4 KiB : Write : 145.534 ms Memfd 4 KiB : Read+Write : 202.275 ms Memfd 4 KiB : POPULATE : 162.597 ms Memfd 4 KiB : POPULATE_READ : 914.747 ms Memfd 4 KiB : POPULATE_WRITE : 161.281 ms Memfd 2 MiB : Read : 351.818 ms Memfd 2 MiB : Write : 352.357 ms Memfd 2 MiB : Read+Write : 352.762 ms Memfd 2 MiB : POPULATE : 351.471 ms Memfd 2 MiB : POPULATE_READ : 351.553 ms Memfd 2 MiB : POPULATE_WRITE : 351.931 ms ************************************************** 4096 MiB MAP_SHARED: ************************************************** Anonymous : Read : 229.338 ms Anonymous : Write : 478.964 ms Anonymous : Read+Write : 234.546 ms Anonymous : POPULATE : 161.635 ms Anonymous : POPULATE_READ : 160.943 ms Anonymous : POPULATE_WRITE : 252.686 ms Memfd 4 KiB : Read : 1052.828 ms Memfd 4 KiB : Write : 929.237 ms Memfd 4 KiB : Read+Write : 1074.494 ms Memfd 4 KiB : POPULATE : 915.663 ms Memfd 4 KiB : POPULATE_READ : 915.001 ms Memfd 4 KiB : POPULATE_WRITE : 787.388 ms Memfd 2 MiB : Read : 353.580 ms Memfd 2 MiB : Write : 353.197 ms Memfd 2 MiB : Read+Write : 353.172 ms Memfd 2 MiB : POPULATE : 353.686 ms Memfd 2 MiB : POPULATE_READ : 353.465 ms Memfd 2 MiB : POPULATE_WRITE : 352.776 ms ************************************************** --- Discussion --- 1) With huge pages, the performance benefit is negligible with the sizes I tried, because there are little actual page faults. Most time is spent zeroing huge pages I guess. It will take quite a lot of memory to pay off. 2) In all 4k cases, the POPULATE_READ/POPULATE_WRITE variants are faster than manually reading or writing from user space. What sticks out a bit is: 3) For MAP_SHARED on anonymous memory, it is fastest to first read and then write memory. It's slightly faster than POPULATE_WRITE and quite a lot faster than a simple write - what?!. It's even faster than POPULATE_WRITE - what?! I assume with the read access we prepare a fresh zero page and with the write access we only have to change PTE access rights. But why is this faster than writing directly? 4) POPULATE_WRITE with MAP_SHARED "Memfd 4 KiB" is faster than POPULATE_READ - it's the fastest way to preallocate that memory. Similarly, ordinary writes are faster than ordinary reads. I did not try with actual files, files that already have a content, or with multiple thread yet. Also, I did not try on a subset of a mmap yet - for simplicity I populate the whole mapping.
On 24.02.21 15:25, David Hildenbrand wrote: >> + tmp_end = min_t(unsigned long, end, vma->vm_end); >> + pages = populate_vma_page_range(vma, start, tmp_end, &locked); >> + if (!locked) { >> + mmap_read_lock(mm); >> + *prev = NULL; >> + vma = NULL; > > ^ locked = 1; is missing here. > > > --- Simple benchmark --- > > I implemented MADV_POPULATE_READ and MADV_POPULATE_WRITE and performed > some simple measurements to simulate memory preallocation with empty files: > > 1) mmap a 2 MiB/128 MiB/4 GiB region (anonymous, memfd, memfd hugetlb) > 2) Discard all memory using fallocate/madvise > 3) Prefault memory using different approaches and measure the time this > takes. > > I repeat 2)+3) 10 times and compute the average. I only use a single thread. > > Read: Read from each page a byte. > Write: Write one byte of each page (0). > Read/Write: Read one byte and write the value back for each page > POPULATE: MADV_POPULATE (this patch) > POPULATE_READ: MADV_POPULATE_READ > POPULATE_WRITE: MADV_POPULATE_WRITE > > --- Benchmark results --- > > Measuring 10 iterations each: > ================================================== > 2 MiB MAP_PRIVATE: > ************************************************** > Anonymous : Read : 0.159 ms > Anonymous : Write : 0.244 ms > Anonymous : Read+Write : 0.383 ms > Anonymous : POPULATE : 0.167 ms > Anonymous : POPULATE_READ : 0.064 ms > Anonymous : POPULATE_WRITE : 0.165 ms > Memfd 4 KiB : Read : 0.401 ms > Memfd 4 KiB : Write : 0.056 ms > Memfd 4 KiB : Read+Write : 0.075 ms > Memfd 4 KiB : POPULATE : 0.057 ms > Memfd 4 KiB : POPULATE_READ : 0.337 ms > Memfd 4 KiB : POPULATE_WRITE : 0.056 ms > Memfd 2 MiB : Read : 0.041 ms > Memfd 2 MiB : Write : 0.030 ms > Memfd 2 MiB : Read+Write : 0.031 ms > Memfd 2 MiB : POPULATE : 0.031 ms > Memfd 2 MiB : POPULATE_READ : 0.031 ms > Memfd 2 MiB : POPULATE_WRITE : 0.031 ms > ************************************************** > 2 MiB MAP_SHARED: > ************************************************** > Anonymous : Read : 0.071 ms > Anonymous : Write : 0.181 ms > Anonymous : Read+Write : 0.081 ms > Anonymous : POPULATE : 0.069 ms > Anonymous : POPULATE_READ : 0.069 ms > Anonymous : POPULATE_WRITE : 0.115 ms > Memfd 4 KiB : Read : 0.401 ms > Memfd 4 KiB : Write : 0.351 ms > Memfd 4 KiB : Read+Write : 0.414 ms > Memfd 4 KiB : POPULATE : 0.338 ms > Memfd 4 KiB : POPULATE_READ : 0.339 ms > Memfd 4 KiB : POPULATE_WRITE : 0.279 ms > Memfd 2 MiB : Read : 0.031 ms > Memfd 2 MiB : Write : 0.031 ms > Memfd 2 MiB : Read+Write : 0.031 ms > Memfd 2 MiB : POPULATE : 0.031 ms > Memfd 2 MiB : POPULATE_READ : 0.031 ms > Memfd 2 MiB : POPULATE_WRITE : 0.031 ms > ************************************************** > 128 MiB MAP_PRIVATE: > ************************************************** > Anonymous : Read : 7.517 ms > Anonymous : Write : 22.503 ms > Anonymous : Read+Write : 33.186 ms > Anonymous : POPULATE : 18.381 ms > Anonymous : POPULATE_READ : 3.952 ms > Anonymous : POPULATE_WRITE : 18.354 ms > Memfd 4 KiB : Read : 34.300 ms > Memfd 4 KiB : Write : 4.659 ms > Memfd 4 KiB : Read+Write : 6.531 ms > Memfd 4 KiB : POPULATE : 5.219 ms > Memfd 4 KiB : POPULATE_READ : 29.744 ms > Memfd 4 KiB : POPULATE_WRITE : 5.244 ms > Memfd 2 MiB : Read : 10.228 ms > Memfd 2 MiB : Write : 10.130 ms > Memfd 2 MiB : Read+Write : 10.190 ms > Memfd 2 MiB : POPULATE : 10.007 ms > Memfd 2 MiB : POPULATE_READ : 10.008 ms > Memfd 2 MiB : POPULATE_WRITE : 10.010 ms > ************************************************** > 128 MiB MAP_SHARED: > ************************************************** > Anonymous : Read : 7.295 ms > Anonymous : Write : 15.234 ms > Anonymous : Read+Write : 7.460 ms > Anonymous : POPULATE : 5.196 ms > Anonymous : POPULATE_READ : 5.190 ms > Anonymous : POPULATE_WRITE : 8.245 ms > Memfd 4 KiB : Read : 34.412 ms > Memfd 4 KiB : Write : 30.586 ms > Memfd 4 KiB : Read+Write : 35.157 ms > Memfd 4 KiB : POPULATE : 29.643 ms > Memfd 4 KiB : POPULATE_READ : 29.691 ms > Memfd 4 KiB : POPULATE_WRITE : 25.790 ms > Memfd 2 MiB : Read : 10.210 ms > Memfd 2 MiB : Write : 10.074 ms > Memfd 2 MiB : Read+Write : 10.068 ms > Memfd 2 MiB : POPULATE : 10.034 ms > Memfd 2 MiB : POPULATE_READ : 10.037 ms > Memfd 2 MiB : POPULATE_WRITE : 10.031 ms > ************************************************** > 4096 MiB MAP_PRIVATE: > ************************************************** > Anonymous : Read : 240.947 ms > Anonymous : Write : 712.941 ms > Anonymous : Read+Write : 1027.636 ms > Anonymous : POPULATE : 571.816 ms > Anonymous : POPULATE_READ : 120.215 ms > Anonymous : POPULATE_WRITE : 570.750 ms > Memfd 4 KiB : Read : 1054.739 ms > Memfd 4 KiB : Write : 145.534 ms > Memfd 4 KiB : Read+Write : 202.275 ms > Memfd 4 KiB : POPULATE : 162.597 ms > Memfd 4 KiB : POPULATE_READ : 914.747 ms > Memfd 4 KiB : POPULATE_WRITE : 161.281 ms > Memfd 2 MiB : Read : 351.818 ms > Memfd 2 MiB : Write : 352.357 ms > Memfd 2 MiB : Read+Write : 352.762 ms > Memfd 2 MiB : POPULATE : 351.471 ms > Memfd 2 MiB : POPULATE_READ : 351.553 ms > Memfd 2 MiB : POPULATE_WRITE : 351.931 ms > ************************************************** > 4096 MiB MAP_SHARED: > ************************************************** > Anonymous : Read : 229.338 ms > Anonymous : Write : 478.964 ms > Anonymous : Read+Write : 234.546 ms > Anonymous : POPULATE : 161.635 ms > Anonymous : POPULATE_READ : 160.943 ms > Anonymous : POPULATE_WRITE : 252.686 ms > Memfd 4 KiB : Read : 1052.828 ms > Memfd 4 KiB : Write : 929.237 ms > Memfd 4 KiB : Read+Write : 1074.494 ms > Memfd 4 KiB : POPULATE : 915.663 ms > Memfd 4 KiB : POPULATE_READ : 915.001 ms > Memfd 4 KiB : POPULATE_WRITE : 787.388 ms > Memfd 2 MiB : Read : 353.580 ms > Memfd 2 MiB : Write : 353.197 ms > Memfd 2 MiB : Read+Write : 353.172 ms > Memfd 2 MiB : POPULATE : 353.686 ms > Memfd 2 MiB : POPULATE_READ : 353.465 ms > Memfd 2 MiB : POPULATE_WRITE : 352.776 ms > ************************************************** > > > --- Discussion --- > > 1) With huge pages, the performance benefit is negligible with the sizes > I tried, because there are little actual page faults. Most time is spent > zeroing huge pages I guess. It will take quite a lot of memory to pay off. > > 2) In all 4k cases, the POPULATE_READ/POPULATE_WRITE variants are faster > than manually reading or writing from user space. Forgot to mention one case: Except on Memfd 4 KiB with MAP_PRIVATE: POPULATE_WRITE is slower than a simple write. And a read fault is exceptionally slower than a write fault (what?).
On 24.02.21 15:25, David Hildenbrand wrote: >> + tmp_end = min_t(unsigned long, end, vma->vm_end); >> + pages = populate_vma_page_range(vma, start, tmp_end, &locked); >> + if (!locked) { >> + mmap_read_lock(mm); >> + *prev = NULL; >> + vma = NULL; > > ^ locked = 1; is missing here. > > > --- Simple benchmark --- > > I implemented MADV_POPULATE_READ and MADV_POPULATE_WRITE and performed > some simple measurements to simulate memory preallocation with empty files: > > 1) mmap a 2 MiB/128 MiB/4 GiB region (anonymous, memfd, memfd hugetlb) > 2) Discard all memory using fallocate/madvise > 3) Prefault memory using different approaches and measure the time this > takes. > > I repeat 2)+3) 10 times and compute the average. I only use a single thread. > > Read: Read from each page a byte. > Write: Write one byte of each page (0). > Read/Write: Read one byte and write the value back for each page > POPULATE: MADV_POPULATE (this patch) > POPULATE_READ: MADV_POPULATE_READ > POPULATE_WRITE: MADV_POPULATE_WRITE > > --- Benchmark results --- > > Measuring 10 iterations each: > ================================================== > 2 MiB MAP_PRIVATE: > ************************************************** > Anonymous : Read : 0.159 ms > Anonymous : Write : 0.244 ms > Anonymous : Read+Write : 0.383 ms > Anonymous : POPULATE : 0.167 ms > Anonymous : POPULATE_READ : 0.064 ms > Anonymous : POPULATE_WRITE : 0.165 ms > Memfd 4 KiB : Read : 0.401 ms > Memfd 4 KiB : Write : 0.056 ms > Memfd 4 KiB : Read+Write : 0.075 ms > Memfd 4 KiB : POPULATE : 0.057 ms > Memfd 4 KiB : POPULATE_READ : 0.337 ms > Memfd 4 KiB : POPULATE_WRITE : 0.056 ms > Memfd 2 MiB : Read : 0.041 ms > Memfd 2 MiB : Write : 0.030 ms > Memfd 2 MiB : Read+Write : 0.031 ms > Memfd 2 MiB : POPULATE : 0.031 ms > Memfd 2 MiB : POPULATE_READ : 0.031 ms > Memfd 2 MiB : POPULATE_WRITE : 0.031 ms > ************************************************** > 2 MiB MAP_SHARED: > ************************************************** > Anonymous : Read : 0.071 ms > Anonymous : Write : 0.181 ms > Anonymous : Read+Write : 0.081 ms > Anonymous : POPULATE : 0.069 ms > Anonymous : POPULATE_READ : 0.069 ms > Anonymous : POPULATE_WRITE : 0.115 ms > Memfd 4 KiB : Read : 0.401 ms > Memfd 4 KiB : Write : 0.351 ms > Memfd 4 KiB : Read+Write : 0.414 ms > Memfd 4 KiB : POPULATE : 0.338 ms > Memfd 4 KiB : POPULATE_READ : 0.339 ms > Memfd 4 KiB : POPULATE_WRITE : 0.279 ms > Memfd 2 MiB : Read : 0.031 ms > Memfd 2 MiB : Write : 0.031 ms > Memfd 2 MiB : Read+Write : 0.031 ms > Memfd 2 MiB : POPULATE : 0.031 ms > Memfd 2 MiB : POPULATE_READ : 0.031 ms > Memfd 2 MiB : POPULATE_WRITE : 0.031 ms > ************************************************** > 128 MiB MAP_PRIVATE: > ************************************************** > Anonymous : Read : 7.517 ms > Anonymous : Write : 22.503 ms > Anonymous : Read+Write : 33.186 ms > Anonymous : POPULATE : 18.381 ms > Anonymous : POPULATE_READ : 3.952 ms > Anonymous : POPULATE_WRITE : 18.354 ms > Memfd 4 KiB : Read : 34.300 ms > Memfd 4 KiB : Write : 4.659 ms > Memfd 4 KiB : Read+Write : 6.531 ms > Memfd 4 KiB : POPULATE : 5.219 ms > Memfd 4 KiB : POPULATE_READ : 29.744 ms > Memfd 4 KiB : POPULATE_WRITE : 5.244 ms > Memfd 2 MiB : Read : 10.228 ms > Memfd 2 MiB : Write : 10.130 ms > Memfd 2 MiB : Read+Write : 10.190 ms > Memfd 2 MiB : POPULATE : 10.007 ms > Memfd 2 MiB : POPULATE_READ : 10.008 ms > Memfd 2 MiB : POPULATE_WRITE : 10.010 ms > ************************************************** > 128 MiB MAP_SHARED: > ************************************************** > Anonymous : Read : 7.295 ms > Anonymous : Write : 15.234 ms > Anonymous : Read+Write : 7.460 ms > Anonymous : POPULATE : 5.196 ms > Anonymous : POPULATE_READ : 5.190 ms > Anonymous : POPULATE_WRITE : 8.245 ms > Memfd 4 KiB : Read : 34.412 ms > Memfd 4 KiB : Write : 30.586 ms > Memfd 4 KiB : Read+Write : 35.157 ms > Memfd 4 KiB : POPULATE : 29.643 ms > Memfd 4 KiB : POPULATE_READ : 29.691 ms > Memfd 4 KiB : POPULATE_WRITE : 25.790 ms > Memfd 2 MiB : Read : 10.210 ms > Memfd 2 MiB : Write : 10.074 ms > Memfd 2 MiB : Read+Write : 10.068 ms > Memfd 2 MiB : POPULATE : 10.034 ms > Memfd 2 MiB : POPULATE_READ : 10.037 ms > Memfd 2 MiB : POPULATE_WRITE : 10.031 ms > ************************************************** > 4096 MiB MAP_PRIVATE: > ************************************************** > Anonymous : Read : 240.947 ms > Anonymous : Write : 712.941 ms > Anonymous : Read+Write : 1027.636 ms > Anonymous : POPULATE : 571.816 ms > Anonymous : POPULATE_READ : 120.215 ms > Anonymous : POPULATE_WRITE : 570.750 ms > Memfd 4 KiB : Read : 1054.739 ms > Memfd 4 KiB : Write : 145.534 ms > Memfd 4 KiB : Read+Write : 202.275 ms > Memfd 4 KiB : POPULATE : 162.597 ms > Memfd 4 KiB : POPULATE_READ : 914.747 ms > Memfd 4 KiB : POPULATE_WRITE : 161.281 ms > Memfd 2 MiB : Read : 351.818 ms > Memfd 2 MiB : Write : 352.357 ms > Memfd 2 MiB : Read+Write : 352.762 ms > Memfd 2 MiB : POPULATE : 351.471 ms > Memfd 2 MiB : POPULATE_READ : 351.553 ms > Memfd 2 MiB : POPULATE_WRITE : 351.931 ms > ************************************************** > 4096 MiB MAP_SHARED: > ************************************************** > Anonymous : Read : 229.338 ms > Anonymous : Write : 478.964 ms > Anonymous : Read+Write : 234.546 ms > Anonymous : POPULATE : 161.635 ms > Anonymous : POPULATE_READ : 160.943 ms > Anonymous : POPULATE_WRITE : 252.686 ms > Memfd 4 KiB : Read : 1052.828 ms > Memfd 4 KiB : Write : 929.237 ms > Memfd 4 KiB : Read+Write : 1074.494 ms > Memfd 4 KiB : POPULATE : 915.663 ms > Memfd 4 KiB : POPULATE_READ : 915.001 ms > Memfd 4 KiB : POPULATE_WRITE : 787.388 ms > Memfd 2 MiB : Read : 353.580 ms > Memfd 2 MiB : Write : 353.197 ms > Memfd 2 MiB : Read+Write : 353.172 ms > Memfd 2 MiB : POPULATE : 353.686 ms > Memfd 2 MiB : POPULATE_READ : 353.465 ms > Memfd 2 MiB : POPULATE_WRITE : 352.776 ms > ************************************************** > > > --- Discussion --- > > 1) With huge pages, the performance benefit is negligible with the sizes > I tried, because there are little actual page faults. Most time is spent > zeroing huge pages I guess. It will take quite a lot of memory to pay off. > > 2) In all 4k cases, the POPULATE_READ/POPULATE_WRITE variants are faster > than manually reading or writing from user space. > > > What sticks out a bit is: > > 3) For MAP_SHARED on anonymous memory, it is fastest to first read and > then write memory. It's slightly faster than POPULATE_WRITE and quite a > lot faster than a simple write - what?!. It's even faster than > POPULATE_WRITE - what?! I assume with the read access we prepare a fresh > zero page and with the write access we only have to change PTE access > rights. But why is this faster than writing directly? Okay, MADV_DONTNEED does not seem to really work on MAP_SHARED of anonymous memory. If I use a fresh mmap for each and every iteration the numbers make more sense: ************************************************** 4096 MiB MAP_SHARED: ************************************************** Anonymous : Read : 1054.154 ms Anonymous : Write : 924.572 ms Anonymous : Read+Write : 1075.215 ms Anonymous : POPULATE : 911.386 ms Anonymous : POPULATE_READ : 909.392 ms Anonymous : POPULATE_WRITE : 793.143 ms
diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h index a18ec7f63888..e90eeb5e6cf1 100644 --- a/arch/alpha/include/uapi/asm/mman.h +++ b/arch/alpha/include/uapi/asm/mman.h @@ -71,6 +71,8 @@ #define MADV_COLD 20 /* deactivate these pages */ #define MADV_PAGEOUT 21 /* reclaim these pages */ +#define MADV_POPULATE 22 /* populate pages */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h index 57dc2ac4f8bd..b928becc5308 100644 --- a/arch/mips/include/uapi/asm/mman.h +++ b/arch/mips/include/uapi/asm/mman.h @@ -98,6 +98,8 @@ #define MADV_COLD 20 /* deactivate these pages */ #define MADV_PAGEOUT 21 /* reclaim these pages */ +#define MADV_POPULATE 22 /* populate pages */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h index ab78cba446ed..9d3a56044287 100644 --- a/arch/parisc/include/uapi/asm/mman.h +++ b/arch/parisc/include/uapi/asm/mman.h @@ -52,6 +52,8 @@ #define MADV_COLD 20 /* deactivate these pages */ #define MADV_PAGEOUT 21 /* reclaim these pages */ +#define MADV_POPULATE 22 /* populate pages */ + #define MADV_MERGEABLE 65 /* KSM may merge identical pages */ #define MADV_UNMERGEABLE 66 /* KSM may not merge identical pages */ diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h index e5e643752947..3169b1be8920 100644 --- a/arch/xtensa/include/uapi/asm/mman.h +++ b/arch/xtensa/include/uapi/asm/mman.h @@ -106,6 +106,8 @@ #define MADV_COLD 20 /* deactivate these pages */ #define MADV_PAGEOUT 21 /* reclaim these pages */ +#define MADV_POPULATE 22 /* populate pages */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index f94f65d429be..fa617fd0d733 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -72,6 +72,8 @@ #define MADV_COLD 20 /* deactivate these pages */ #define MADV_PAGEOUT 21 /* reclaim these pages */ +#define MADV_POPULATE 22 /* populate pages */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/mm/madvise.c b/mm/madvise.c index 6a660858784b..f76fdd6fcf10 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -53,6 +53,7 @@ static int madvise_need_mmap_write(int behavior) case MADV_COLD: case MADV_PAGEOUT: case MADV_FREE: + case MADV_POPULATE: return 0; default: /* be safe, default to 1. list exceptions explicitly */ @@ -821,6 +822,72 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, return -EINVAL; } +static long madvise_populate(struct vm_area_struct *vma, + struct vm_area_struct **prev, + unsigned long start, unsigned long end) +{ + struct mm_struct *mm = vma->vm_mm; + unsigned long tmp_end; + int locked = 1; + long pages; + + *prev = vma; + + while (start < end) { + /* + * We might have temporarily dropped the lock. For example, + * our VMA might have been split. + */ + if (!vma || start >= vma->vm_end) { + vma = find_vma(mm, start); + if (!vma) + return -ENOMEM; + } + + /* Bail out on incompatible VMA types. */ + if (vma->vm_flags & (VM_IO | VM_PFNMAP) || + !vma_is_accessible(vma)) { + return -EINVAL; + } + + /* + * Populate pages and take care of VM_LOCKED: simulate user + * space access. + * + * For private, writable mappings, trigger a write fault to + * break COW (i.e., shared zeropage). For other mappings (i.e., + * read-only, shared), trigger a read fault. + */ + tmp_end = min_t(unsigned long, end, vma->vm_end); + pages = populate_vma_page_range(vma, start, tmp_end, &locked); + if (!locked) { + mmap_read_lock(mm); + *prev = NULL; + vma = NULL; + } + if (pages < 0) { + switch (pages) { + case -EINTR: + case -ENOMEM: + return pages; + case -EHWPOISON: + /* Skip over any poisoned pages. */ + start += PAGE_SIZE; + continue; + case -EBUSY: + case -EAGAIN: + continue; + default: + pr_warn_once("%s: unhandled return value: %ld\n", + __func__, pages); + return -ENOMEM; + } + } + start += pages * PAGE_SIZE; + } + return 0; +} + /* * Application wants to free up the pages and associated backing store. * This is effectively punching a hole into the middle of a file. @@ -934,6 +1001,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev, case MADV_FREE: case MADV_DONTNEED: return madvise_dontneed_free(vma, prev, start, end, behavior); + case MADV_POPULATE: + return madvise_populate(vma, prev, start, end); default: return madvise_behavior(vma, prev, start, end, behavior); } @@ -954,6 +1023,7 @@ madvise_behavior_valid(int behavior) case MADV_FREE: case MADV_COLD: case MADV_PAGEOUT: + case MADV_POPULATE: #ifdef CONFIG_KSM case MADV_MERGEABLE: case MADV_UNMERGEABLE:
When we manage sparse memory mappings dynamically in user space - also sometimes involving MADV_NORESERVE - we want to dynamically populate/ discard memory inside such a sparse memory region. Example users are hypervisors (especially implementing memory ballooning or similar technologies like virtio-mem) and memory allocators. In addition, we want to fail in a nice way if populating does not succeed because we are out of backend memory (which can happen easily with file-based mappings, especially tmpfs and hugetlbfs). While MADV_DONTNEED and FALLOC_FL_PUNCH_HOLE provide us ways to reliably discard memory, there is no generic approach to populate ("preallocate") memory. Although mmap() supports MAP_POPULATE, it is not applicable to the concept of sparse memory mappings, where we want to do populate/discard dynamically and avoid expensive/problematic remappings. In addition, we never actually report error during the final populate phase - it is best-effort only. fallocate() can be used to preallocate file-based memory and fail in a safe way. However, it is less useful for private mappings on anonymous files due to COW semantics. For example, using fallocate() to preallocate memory on an anonymous memfd files that are mapped MAP_PRIVATE results in a double memory consumption when actually writing via the mapping. In addition, fallocate() does not actually populate page tables, so we still always have to resolve minor faults on first access. Because we don't have a proper interface, what applications (like QEMU and databases) end up doing is touching (i.e., writing) all individual pages. However, it requires expensive signal handling (SIGBUS); for example, this is problematic in hypervisors like QEMU where SIGBUS handlers might already be used by other subsystems concurrently to e.g, handle hardware errors. "Simply" doing preallocation from another thread is not that easy. Let's introduce MADV_POPULATE with the following semantics 1. MADV_POPULATED does not work on PROT_NONE and special VMAs. It works on everything else. 2. Errors during MADV_POPULATED (especially OOM) are reported. If we hit hardware errors on pages, ignore them - nothing we really can or should do. 3. On errors during MADV_POPULATED, some memory might have been populated. Callers have to clean up if they care. 4. Concurrent changes to the virtual memory layour are tolerated - we process each and every PFN only once, though. 5. If MADV_POPULATE succeeds, all memory in the range can be accessed without SIGBUS. (of course, not if user space changed mappings in the meantime or KSM kicked in on anonymous memory). Although sparse memory mappings are the primary use case, this will also be useful for ordinary preallocations where MAP_POPULATE is not desired (e.g., in QEMU, where users can trigger preallocation of guest RAM after the mapping was created). Looking at the history, MADV_POPULATE was already proposed in 2013 [1], however, the main motivation back than was performance improvements (which should also still be the case, but it's a seconary concern). Basic functionality was tested with: - anonymous memory - MAP_PRIVATE on anonymous file via memfd - MAP_SHARED on anonymous file via memf - MAP_PRIVATE on anonymous hugetlbfs file via memfd - MAP_SHARED on anonymous hugetlbfs file via memfd - MAP_PRIVATE on tmpfs/shmem file (we end up with double memory consumption though, as the actual file gets populated with zeroes) - MAP_SHARED on tmpfs/shmem file Note: For populating/preallocating zeroed-out memory while userfaultfd is active, it's even faster to use first fallocate() or placing zeroed pages via userfaultfd APIs. Otherwise, we'll have to route every fault while populating via the userfaultfd handler. [1] https://lkml.org/lkml/2013/6/27/698 Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@surriel.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Helge Deller <deller@gmx.de> Cc: Chris Zankel <chris@zankel.net> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: linux-alpha@vger.kernel.org Cc: linux-mips@vger.kernel.org Cc: linux-parisc@vger.kernel.org Cc: linux-xtensa@linux-xtensa.org Cc: linux-arch@vger.kernel.org Signed-off-by: David Hildenbrand <david@redhat.com> --- If we agree that this makes sense I'll do more testing to see if we are missing any return value handling and prepare a man page update to document the semantics. Thoughts? --- arch/alpha/include/uapi/asm/mman.h | 2 + arch/mips/include/uapi/asm/mman.h | 2 + arch/parisc/include/uapi/asm/mman.h | 2 + arch/xtensa/include/uapi/asm/mman.h | 2 + include/uapi/asm-generic/mman-common.h | 2 + mm/madvise.c | 70 ++++++++++++++++++++++++++ 6 files changed, 80 insertions(+)