diff mbox series

[RFC] mm/madvise: introduce MADV_POPULATE to prefault/prealloc memory

Message ID 20210217154844.12392-1-david@redhat.com (mailing list archive)
State Superseded
Headers show
Series [RFC] mm/madvise: introduce MADV_POPULATE to prefault/prealloc memory | expand

Commit Message

David Hildenbrand Feb. 17, 2021, 3:48 p.m. UTC
When we manage sparse memory mappings dynamically in user space - also
sometimes involving MADV_NORESERVE - we want to dynamically populate/
discard memory inside such a sparse memory region. Example users are
hypervisors (especially implementing memory ballooning or similar
technologies like virtio-mem) and memory allocators. In addition, we want
to fail in a nice way if populating does not succeed because we are out of
backend memory (which can happen easily with file-based mappings,
especially tmpfs and hugetlbfs).

While MADV_DONTNEED and FALLOC_FL_PUNCH_HOLE provide us ways to reliably
discard memory, there is no generic approach to populate ("preallocate")
memory.

Although mmap() supports MAP_POPULATE, it is not applicable to the concept
of sparse memory mappings, where we want to do populate/discard
dynamically and avoid expensive/problematic remappings. In addition,
we never actually report error during the final populate phase - it is
best-effort only.

fallocate() can be used to preallocate file-based memory and fail in a safe
way. However, it is less useful for private mappings on anonymous files
due to COW semantics. For example, using fallocate() to preallocate memory
on an anonymous memfd files that are mapped MAP_PRIVATE results in a double
memory consumption when actually writing via the mapping. In addition,
fallocate() does not actually populate page tables, so we still always
have to resolve minor faults on first access.

Because we don't have a proper interface, what applications
(like QEMU and databases) end up doing is touching (i.e., writing) all
individual pages. However, it requires expensive signal handling (SIGBUS);
for example, this is problematic in hypervisors like QEMU where SIGBUS
handlers might already be used by other subsystems concurrently to e.g,
handle hardware errors. "Simply" doing preallocation from another thread
is not that easy.

Let's introduce MADV_POPULATE with the following semantics
1. MADV_POPULATED does not work on PROT_NONE and special VMAs. It works
   on everything else.
2. Errors during MADV_POPULATED (especially OOM) are reported. If we hit
   hardware errors on pages, ignore them - nothing we really can or
   should do.
3. On errors during MADV_POPULATED, some memory might have been
   populated. Callers have to clean up if they care.
4. Concurrent changes to the virtual memory layour are tolerated - we
   process each and every PFN only once, though.
5. If MADV_POPULATE succeeds, all memory in the range can be accessed
   without SIGBUS. (of course, not if user space changed mappings in the
   meantime or KSM kicked in on anonymous memory).

Although sparse memory mappings are the primary use case, this will
also be useful for ordinary preallocations where MAP_POPULATE is not
desired (e.g., in QEMU, where users can trigger preallocation of
guest RAM after the mapping was created).

Looking at the history, MADV_POPULATE was already proposed in 2013 [1],
however, the main motivation back than was performance improvements
(which should also still be the case, but it's a seconary concern).

Basic functionality was tested with:
- anonymous memory
- MAP_PRIVATE on anonymous file via memfd
- MAP_SHARED on anonymous file via memf
- MAP_PRIVATE on anonymous hugetlbfs file via memfd
- MAP_SHARED on anonymous hugetlbfs file via memfd
- MAP_PRIVATE on tmpfs/shmem file (we end up with double memory consumption
  though, as the actual file gets populated with zeroes)
- MAP_SHARED on tmpfs/shmem file

Note: For populating/preallocating zeroed-out memory while userfaultfd is
active, it's even faster to use first fallocate() or placing zeroed pages
via userfaultfd APIs. Otherwise, we'll have to route every fault while
populating via the userfaultfd handler.

[1] https://lkml.org/lkml/2013/6/27/698

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Chris Zankel <chris@zankel.net>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: linux-alpha@vger.kernel.org
Cc: linux-mips@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linux-xtensa@linux-xtensa.org
Cc: linux-arch@vger.kernel.org
Signed-off-by: David Hildenbrand <david@redhat.com>
---

If we agree that this makes sense I'll do more testing to see if we
are missing any return value handling and prepare a man page update to
document the semantics.

Thoughts?

---
 arch/alpha/include/uapi/asm/mman.h     |  2 +
 arch/mips/include/uapi/asm/mman.h      |  2 +
 arch/parisc/include/uapi/asm/mman.h    |  2 +
 arch/xtensa/include/uapi/asm/mman.h    |  2 +
 include/uapi/asm-generic/mman-common.h |  2 +
 mm/madvise.c                           | 70 ++++++++++++++++++++++++++
 6 files changed, 80 insertions(+)

Comments

Dave Hansen Feb. 17, 2021, 4:46 p.m. UTC | #1
On 2/17/21 7:48 AM, David Hildenbrand wrote:
> While MADV_DONTNEED and FALLOC_FL_PUNCH_HOLE provide us ways to reliably
> discard memory, there is no generic approach to populate ("preallocate")
> memory.
> 
> Although mmap() supports MAP_POPULATE, it is not applicable to the concept
> of sparse memory mappings, where we want to do populate/discard
> dynamically and avoid expensive/problematic remappings. In addition,
> we never actually report error during the final populate phase - it is
> best-effort only.

Seems pretty sane to me.

But, I was surprised that MADV_WILLNEED was no mentioned.  It might be
nice to touch on on why MADV_WILLNEED is a bad choice for this
functionality?  We could theoretically have it populate anonymous
mappings instead of just swapping in.

I guess it's possible that folks are using MADV_WILLNEED on sparse
mappings that they don't want to populate, but it would be nice to get
that in the changelog.

I was also a bit bummed to see the broad VM_IO/PFNMAP restriction show
up again.  I was just looking at implementing pre-faulting for the new
SGX driver:

> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/cpu/sgx/driver.c

It has a vm_ops->fault handler, but the VMAs are VM_IO.  It obviously
don't work with gup, though.  Not a deal breaker, and something we could
certainly add to this later.
David Hildenbrand Feb. 17, 2021, 5:06 p.m. UTC | #2
On 17.02.21 17:46, Dave Hansen wrote:
> On 2/17/21 7:48 AM, David Hildenbrand wrote:
>> While MADV_DONTNEED and FALLOC_FL_PUNCH_HOLE provide us ways to reliably
>> discard memory, there is no generic approach to populate ("preallocate")
>> memory.
>>
>> Although mmap() supports MAP_POPULATE, it is not applicable to the concept
>> of sparse memory mappings, where we want to do populate/discard
>> dynamically and avoid expensive/problematic remappings. In addition,
>> we never actually report error during the final populate phase - it is
>> best-effort only.
> 
> Seems pretty sane to me.
> 
> But, I was surprised that MADV_WILLNEED was no mentioned.  It might be
> nice to touch on on why MADV_WILLNEED is a bad choice for this
> functionality?  We could theoretically have it populate anonymous
> mappings instead of just swapping in.

I stumbled over it, but it ended up looking like mixing in different 
semantics.

"Expect access in the near future." and "might be a good idea to read 
some pages" vs. "Definitely populate/preallocate all memory and 
definitely fail.".

> 
> I guess it's possible that folks are using MADV_WILLNEED on sparse
> mappings that they don't want to populate, but it would be nice to get
> that in the changelog.

Indeed: prime example is virtio-balloon in QEMU when deflating. Just 
because we are deflating the balloon doesn't mean that the guest is 
going to use all memory immediately - and that we want to actually 
consume memory immediately. ... we call MADV_WILLNEED unconditionally on 
any memory backing when deflating ...

I'll definitely add that to the changelog - thanks.

> 
> I was also a bit bummed to see the broad VM_IO/PFNMAP restriction show
> up again.  I was just looking at implementing pre-faulting for the new
> SGX driver:

I added that because __mm_populate() similarly skips over VM_IO | 
VM_PFNMAP. So it mimics existing "populate semantics" we have.

> 
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/cpu/sgx/driver.c
> 
> It has a vm_ops->fault handler, but the VMAs are VM_IO.  It obviously
> don't work with gup, though.  Not a deal breaker, and something we could
> certainly add to this later.

I assume you would then also want to support MAP_POPULATE, right? 
Because it ends up using __mm_populate() and would not work.

Thanks!
Vlastimil Babka Feb. 17, 2021, 5:21 p.m. UTC | #3
+CC linux-api, please do on further revisions.

Keeping rest of the e-mail.

On 2/17/21 4:48 PM, David Hildenbrand wrote:
> When we manage sparse memory mappings dynamically in user space - also
> sometimes involving MADV_NORESERVE - we want to dynamically populate/
> discard memory inside such a sparse memory region. Example users are
> hypervisors (especially implementing memory ballooning or similar
> technologies like virtio-mem) and memory allocators. In addition, we want
> to fail in a nice way if populating does not succeed because we are out of
> backend memory (which can happen easily with file-based mappings,
> especially tmpfs and hugetlbfs).
> 
> While MADV_DONTNEED and FALLOC_FL_PUNCH_HOLE provide us ways to reliably
> discard memory, there is no generic approach to populate ("preallocate")
> memory.
> 
> Although mmap() supports MAP_POPULATE, it is not applicable to the concept
> of sparse memory mappings, where we want to do populate/discard
> dynamically and avoid expensive/problematic remappings. In addition,
> we never actually report error during the final populate phase - it is
> best-effort only.
> 
> fallocate() can be used to preallocate file-based memory and fail in a safe
> way. However, it is less useful for private mappings on anonymous files
> due to COW semantics. For example, using fallocate() to preallocate memory
> on an anonymous memfd files that are mapped MAP_PRIVATE results in a double
> memory consumption when actually writing via the mapping. In addition,
> fallocate() does not actually populate page tables, so we still always
> have to resolve minor faults on first access.
> 
> Because we don't have a proper interface, what applications
> (like QEMU and databases) end up doing is touching (i.e., writing) all
> individual pages. However, it requires expensive signal handling (SIGBUS);
> for example, this is problematic in hypervisors like QEMU where SIGBUS
> handlers might already be used by other subsystems concurrently to e.g,
> handle hardware errors. "Simply" doing preallocation from another thread
> is not that easy.
> 
> Let's introduce MADV_POPULATE with the following semantics
> 1. MADV_POPULATED does not work on PROT_NONE and special VMAs. It works
>    on everything else.
> 2. Errors during MADV_POPULATED (especially OOM) are reported. If we hit
>    hardware errors on pages, ignore them - nothing we really can or
>    should do.
> 3. On errors during MADV_POPULATED, some memory might have been
>    populated. Callers have to clean up if they care.
> 4. Concurrent changes to the virtual memory layour are tolerated - we
>    process each and every PFN only once, though.
> 5. If MADV_POPULATE succeeds, all memory in the range can be accessed
>    without SIGBUS. (of course, not if user space changed mappings in the
>    meantime or KSM kicked in on anonymous memory).
> 
> Although sparse memory mappings are the primary use case, this will
> also be useful for ordinary preallocations where MAP_POPULATE is not
> desired (e.g., in QEMU, where users can trigger preallocation of
> guest RAM after the mapping was created).
> 
> Looking at the history, MADV_POPULATE was already proposed in 2013 [1],
> however, the main motivation back than was performance improvements
> (which should also still be the case, but it's a seconary concern).
> 
> Basic functionality was tested with:
> - anonymous memory
> - MAP_PRIVATE on anonymous file via memfd
> - MAP_SHARED on anonymous file via memf
> - MAP_PRIVATE on anonymous hugetlbfs file via memfd
> - MAP_SHARED on anonymous hugetlbfs file via memfd
> - MAP_PRIVATE on tmpfs/shmem file (we end up with double memory consumption
>   though, as the actual file gets populated with zeroes)
> - MAP_SHARED on tmpfs/shmem file
> 
> Note: For populating/preallocating zeroed-out memory while userfaultfd is
> active, it's even faster to use first fallocate() or placing zeroed pages
> via userfaultfd APIs. Otherwise, we'll have to route every fault while
> populating via the userfaultfd handler.
> 
> [1] https://lkml.org/lkml/2013/6/27/698
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Jann Horn <jannh@google.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Rik van Riel <riel@surriel.com>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
> Cc: Matt Turner <mattst88@gmail.com>
> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
> Cc: Helge Deller <deller@gmx.de>
> Cc: Chris Zankel <chris@zankel.net>
> Cc: Max Filippov <jcmvbkbc@gmail.com>
> Cc: linux-alpha@vger.kernel.org
> Cc: linux-mips@vger.kernel.org
> Cc: linux-parisc@vger.kernel.org
> Cc: linux-xtensa@linux-xtensa.org
> Cc: linux-arch@vger.kernel.org
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
> 
> If we agree that this makes sense I'll do more testing to see if we
> are missing any return value handling and prepare a man page update to
> document the semantics.
> 
> Thoughts?
> 
> ---
>  arch/alpha/include/uapi/asm/mman.h     |  2 +
>  arch/mips/include/uapi/asm/mman.h      |  2 +
>  arch/parisc/include/uapi/asm/mman.h    |  2 +
>  arch/xtensa/include/uapi/asm/mman.h    |  2 +
>  include/uapi/asm-generic/mman-common.h |  2 +
>  mm/madvise.c                           | 70 ++++++++++++++++++++++++++
>  6 files changed, 80 insertions(+)
> 
> diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> index a18ec7f63888..e90eeb5e6cf1 100644
> --- a/arch/alpha/include/uapi/asm/mman.h
> +++ b/arch/alpha/include/uapi/asm/mman.h
> @@ -71,6 +71,8 @@
>  #define MADV_COLD	20		/* deactivate these pages */
>  #define MADV_PAGEOUT	21		/* reclaim these pages */
>  
> +#define MADV_POPULATE	22		/* populate pages */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  
> diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
> index 57dc2ac4f8bd..b928becc5308 100644
> --- a/arch/mips/include/uapi/asm/mman.h
> +++ b/arch/mips/include/uapi/asm/mman.h
> @@ -98,6 +98,8 @@
>  #define MADV_COLD	20		/* deactivate these pages */
>  #define MADV_PAGEOUT	21		/* reclaim these pages */
>  
> +#define MADV_POPULATE	22		/* populate pages */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  
> diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> index ab78cba446ed..9d3a56044287 100644
> --- a/arch/parisc/include/uapi/asm/mman.h
> +++ b/arch/parisc/include/uapi/asm/mman.h
> @@ -52,6 +52,8 @@
>  #define MADV_COLD	20		/* deactivate these pages */
>  #define MADV_PAGEOUT	21		/* reclaim these pages */
>  
> +#define MADV_POPULATE	22		/* populate pages */
> +
>  #define MADV_MERGEABLE   65		/* KSM may merge identical pages */
>  #define MADV_UNMERGEABLE 66		/* KSM may not merge identical pages */
>  
> diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> index e5e643752947..3169b1be8920 100644
> --- a/arch/xtensa/include/uapi/asm/mman.h
> +++ b/arch/xtensa/include/uapi/asm/mman.h
> @@ -106,6 +106,8 @@
>  #define MADV_COLD	20		/* deactivate these pages */
>  #define MADV_PAGEOUT	21		/* reclaim these pages */
>  
> +#define MADV_POPULATE	22		/* populate pages */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index f94f65d429be..fa617fd0d733 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -72,6 +72,8 @@
>  #define MADV_COLD	20		/* deactivate these pages */
>  #define MADV_PAGEOUT	21		/* reclaim these pages */
>  
> +#define MADV_POPULATE	22		/* populate pages */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 6a660858784b..f76fdd6fcf10 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -53,6 +53,7 @@ static int madvise_need_mmap_write(int behavior)
>  	case MADV_COLD:
>  	case MADV_PAGEOUT:
>  	case MADV_FREE:
> +	case MADV_POPULATE:
>  		return 0;
>  	default:
>  		/* be safe, default to 1. list exceptions explicitly */
> @@ -821,6 +822,72 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
>  		return -EINVAL;
>  }
>  
> +static long madvise_populate(struct vm_area_struct *vma,
> +			     struct vm_area_struct **prev,
> +			     unsigned long start, unsigned long end)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	unsigned long tmp_end;
> +	int locked = 1;
> +	long pages;
> +
> +	*prev = vma;
> +
> +	while (start < end) {
> +		/*
> +		 * We might have temporarily dropped the lock. For example,
> +		 * our VMA might have been split.
> +		 */
> +		if (!vma || start >= vma->vm_end) {
> +			vma = find_vma(mm, start);
> +			if (!vma)
> +				return -ENOMEM;
> +		}
> +
> +		/* Bail out on incompatible VMA types. */
> +		if (vma->vm_flags & (VM_IO | VM_PFNMAP) ||
> +		    !vma_is_accessible(vma)) {
> +			return -EINVAL;
> +		}
> +
> +		/*
> +		 * Populate pages and take care of VM_LOCKED: simulate user
> +		 * space access.
> +		 *
> +		 * For private, writable mappings, trigger a write fault to
> +		 * break COW (i.e., shared zeropage). For other mappings (i.e.,
> +		 * read-only, shared), trigger a read fault.
> +		 */
> +		tmp_end = min_t(unsigned long, end, vma->vm_end);
> +		pages = populate_vma_page_range(vma, start, tmp_end, &locked);
> +		if (!locked) {
> +			mmap_read_lock(mm);
> +			*prev = NULL;
> +			vma = NULL;
> +		}
> +		if (pages < 0) {
> +			switch (pages) {
> +			case -EINTR:
> +			case -ENOMEM:
> +				return pages;
> +			case -EHWPOISON:
> +				/* Skip over any poisoned pages. */
> +				start += PAGE_SIZE;
> +				continue;
> +			case -EBUSY:
> +			case -EAGAIN:
> +				continue;
> +			default:
> +				pr_warn_once("%s: unhandled return value: %ld\n",
> +					     __func__, pages);
> +				return -ENOMEM;
> +			}
> +		}
> +		start += pages * PAGE_SIZE;
> +	}
> +	return 0;
> +}
> +
>  /*
>   * Application wants to free up the pages and associated backing store.
>   * This is effectively punching a hole into the middle of a file.
> @@ -934,6 +1001,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
>  	case MADV_FREE:
>  	case MADV_DONTNEED:
>  		return madvise_dontneed_free(vma, prev, start, end, behavior);
> +	case MADV_POPULATE:
> +		return madvise_populate(vma, prev, start, end);
>  	default:
>  		return madvise_behavior(vma, prev, start, end, behavior);
>  	}
> @@ -954,6 +1023,7 @@ madvise_behavior_valid(int behavior)
>  	case MADV_FREE:
>  	case MADV_COLD:
>  	case MADV_PAGEOUT:
> +	case MADV_POPULATE:
>  #ifdef CONFIG_KSM
>  	case MADV_MERGEABLE:
>  	case MADV_UNMERGEABLE:
>
Michal Hocko Feb. 18, 2021, 10:25 a.m. UTC | #4
On Wed 17-02-21 16:48:44, David Hildenbrand wrote:
> When we manage sparse memory mappings dynamically in user space - also
> sometimes involving MADV_NORESERVE - we want to dynamically populate/

Just wondering what is MADV_NORESERVE? I do not see anything like that
in the Linus tree. Did you mean MAP_NORESERVE?

> discard memory inside such a sparse memory region. Example users are
> hypervisors (especially implementing memory ballooning or similar
> technologies like virtio-mem) and memory allocators. In addition, we want
> to fail in a nice way if populating does not succeed because we are out of
> backend memory (which can happen easily with file-based mappings,
> especially tmpfs and hugetlbfs).

by "fail in a nice way" you mean before a #PF would fail and SIGBUS
which would be harder to handle?

[...]
> Because we don't have a proper interface, what applications
> (like QEMU and databases) end up doing is touching (i.e., writing) all
> individual pages. However, it requires expensive signal handling (SIGBUS);
> for example, this is problematic in hypervisors like QEMU where SIGBUS
> handlers might already be used by other subsystems concurrently to e.g,
> handle hardware errors. "Simply" doing preallocation from another thread
> is not that easy.

OK, that clarifies my above question.

> 
> Let's introduce MADV_POPULATE with the following semantics
> 1. MADV_POPULATED does not work on PROT_NONE and special VMAs. It works
>    on everything else.

This would better clarify what "does not work" means. I assume those are
ignored and do not report any error?

> 2. Errors during MADV_POPULATED (especially OOM) are reported.

How do you want to achieve that? gup/page fault handler will allocate
memory and trigger the oom without caller noticing that. You would
somehow have to weaken the allocation context to GFP_RETRY_MAYFAIL or
NORETRY to achieve the error handling.

>    If we hit
>    hardware errors on pages, ignore them - nothing we really can or
>    should do.
> 3. On errors during MADV_POPULATED, some memory might have been
>    populated. Callers have to clean up if they care.

How does caller find out? madvise reports 0 on success so how do you
find out how much has been populated?

> 4. Concurrent changes to the virtual memory layour are tolerated - we
>    process each and every PFN only once, though.

I do not understand this. madvise is about virtual address space not a
physical address space.

> 5. If MADV_POPULATE succeeds, all memory in the range can be accessed
>    without SIGBUS. (of course, not if user space changed mappings in the
>    meantime or KSM kicked in on anonymous memory).

I do not see how KSM would change anything here and maybe it is not
really important to mention it. KSM should be really transparent from
the users space POV. Parallel and destructive virtual address space
operations are also expected to change the outcome and there is nothing
kernel do about at and provide any meaningful guarantees. I guess we
want to assume a reasonable userspace behavior here.

> Although sparse memory mappings are the primary use case, this will
> also be useful for ordinary preallocations where MAP_POPULATE is not
> desired (e.g., in QEMU, where users can trigger preallocation of
> guest RAM after the mapping was created).
> 
> Looking at the history, MADV_POPULATE was already proposed in 2013 [1],
> however, the main motivation back than was performance improvements
> (which should also still be the case, but it's a seconary concern).

Well, I think it is more of a concern than prior-spectre era when
syscalls were quite cheap.
David Hildenbrand Feb. 18, 2021, 10:44 a.m. UTC | #5
On 18.02.21 11:25, Michal Hocko wrote:
> On Wed 17-02-21 16:48:44, David Hildenbrand wrote:
>> When we manage sparse memory mappings dynamically in user space - also
>> sometimes involving MADV_NORESERVE - we want to dynamically populate/
> 
> Just wondering what is MADV_NORESERVE? I do not see anything like that
> in the Linus tree. Did you mean MAP_NORESERVE?

Most certainly, thanks :)

> 
>> discard memory inside such a sparse memory region. Example users are
>> hypervisors (especially implementing memory ballooning or similar
>> technologies like virtio-mem) and memory allocators. In addition, we want
>> to fail in a nice way if populating does not succeed because we are out of
>> backend memory (which can happen easily with file-based mappings,
>> especially tmpfs and hugetlbfs).
> 
> by "fail in a nice way" you mean before a #PF would fail and SIGBUS
> which would be harder to handle?

Yes.

> 
> [...]
>> Because we don't have a proper interface, what applications
>> (like QEMU and databases) end up doing is touching (i.e., writing) all
>> individual pages. However, it requires expensive signal handling (SIGBUS);
>> for example, this is problematic in hypervisors like QEMU where SIGBUS
>> handlers might already be used by other subsystems concurrently to e.g,
>> handle hardware errors. "Simply" doing preallocation from another thread
>> is not that easy.
> 
> OK, that clarifies my above question.
> 
>>
>> Let's introduce MADV_POPULATE with the following semantics
>> 1. MADV_POPULATED does not work on PROT_NONE and special VMAs. It works
>>     on everything else.
> 
> This would better clarify what "does not work" means. I assume those are
> ignored and do not report any error?

I'm currently preparing the man page. "Fail with -ENOMEM" (like 
MADV_DONTNEED or MADV_REMOVE)

> 
>> 2. Errors during MADV_POPULATED (especially OOM) are reported.
> 
> How do you want to achieve that? gup/page fault handler will allocate
> memory and trigger the oom without caller noticing that. You would
> somehow have to weaken the allocation context to GFP_RETRY_MAYFAIL or
> NORETRY to achieve the error handling.

Okay, I should be more clear here (again, I'm realizing this as well 
while I create the man page), OOM is confusing: avoid SIGBUS at runtime 
- like we would get on actual file systems/shmem/hugetlbfs when 
preallocating.

It cannot save us from the actual OOM killer. To handle anonymous memory 
more reliable I'll need other means as well (dynamic swap space 
allocation for sparse mappings).

> 
>>     If we hit
>>     hardware errors on pages, ignore them - nothing we really can or
>>     should do.
>> 3. On errors during MADV_POPULATED, some memory might have been
>>     populated. Callers have to clean up if they care.
> 
> How does caller find out? madvise reports 0 on success so how do you
> find out how much has been populated?

If there is an error, something might have been populated. In my QEMU 
implementation, I simply discard the range again, good enough. I don't 
think we need to really indicate "error and populated" or "error and not 
populated".


> 
>> 4. Concurrent changes to the virtual memory layour are tolerated - we
>>     process each and every PFN only once, though.
> 
> I do not understand this. madvise is about virtual address space not a
> physical address space.

What I wanted to express: if we detect a change in the mapping we don't 
restart at the beginning, we always make forward progress. We process 
each virtual address once (on a per-page basis, thus I accidentally used 
"PFN").

> 
>> 5. If MADV_POPULATE succeeds, all memory in the range can be accessed
>>     without SIGBUS. (of course, not if user space changed mappings in the
>>     meantime or KSM kicked in on anonymous memory).
> 
> I do not see how KSM would change anything here and maybe it is not
> really important to mention it. KSM should be really transparent from
> the users space POV. Parallel and destructive virtual address space
> operations are also expected to change the outcome and there is nothing
> kernel do about at and provide any meaningful guarantees. I guess we
> want to assume a reasonable userspace behavior here.

It's just a note that we cannot protect from someone interfering 
(discard/ksm/whatever). I'm making that clearer in the cover letter.

Thanks!
David Hildenbrand Feb. 18, 2021, 10:54 a.m. UTC | #6
>>>      If we hit
>>>      hardware errors on pages, ignore them - nothing we really can or
>>>      should do.
>>> 3. On errors during MADV_POPULATED, some memory might have been
>>>      populated. Callers have to clean up if they care.
>>
>> How does caller find out? madvise reports 0 on success so how do you
>> find out how much has been populated?
> 
> If there is an error, something might have been populated. In my QEMU
> implementation, I simply discard the range again, good enough. I don't
> think we need to really indicate "error and populated" or "error and not
> populated".

Clarifying again: if madvise(MADV_POPULATED) succeeds, it returns 0. If 
there was a problem poopulating memory, it returns -ENOMEM (similar to 
MADV_WILLNEED). Callers can detect the error and discard.
Rolf Eike Beer Feb. 18, 2021, 11:07 a.m. UTC | #7
>> Let's introduce MADV_POPULATE with the following semantics
>> 1. MADV_POPULATED does not work on PROT_NONE and special VMAs. It 
>> works
>>    on everything else.
>> 2. Errors during MADV_POPULATED (especially OOM) are reported. If we 
>> hit
>>    hardware errors on pages, ignore them - nothing we really can or
>>    should do.
>> 3. On errors during MADV_POPULATED, some memory might have been
>>    populated. Callers have to clean up if they care.
>> 4. Concurrent changes to the virtual memory layour are tolerated - we
                                                     ^t
>>    process each and every PFN only once, though.
>> 5. If MADV_POPULATE succeeds, all memory in the range can be accessed
>>    without SIGBUS. (of course, not if user space changed mappings in 
>> the
>>    meantime or KSM kicked in on anonymous memory).

You are talking both about MADV_POPULATE and MADV_POPULATED here.

Eike
Michal Hocko Feb. 18, 2021, 11:27 a.m. UTC | #8
On Thu 18-02-21 11:44:41, David Hildenbrand wrote:
> On 18.02.21 11:25, Michal Hocko wrote:
> > On Wed 17-02-21 16:48:44, David Hildenbrand wrote:
> > > When we manage sparse memory mappings dynamically in user space - also
> > > sometimes involving MADV_NORESERVE - we want to dynamically populate/
> > 
> > Just wondering what is MADV_NORESERVE? I do not see anything like that
> > in the Linus tree. Did you mean MAP_NORESERVE?
> 
> Most certainly, thanks :)

OK, good, I thought I have missed something.
[...]
> > > 2. Errors during MADV_POPULATED (especially OOM) are reported.
> > 
> > How do you want to achieve that? gup/page fault handler will allocate
> > memory and trigger the oom without caller noticing that. You would
> > somehow have to weaken the allocation context to GFP_RETRY_MAYFAIL or
> > NORETRY to achieve the error handling.
> 
> Okay, I should be more clear here (again, I'm realizing this as well while I
> create the man page), OOM is confusing: avoid SIGBUS at runtime - like we
> would get on actual file systems/shmem/hugetlbfs when preallocating.

Yes, preventing SIGBUS for unreserved mappings is a reasonable
expectation. Regarding OOM chances are off I am afraid. We used to have
a weaker model for MAP_POPULATE for memcg oom in the past and it turned
out more problematic than useful.
 
> It cannot save us from the actual OOM killer. To handle anonymous memory
> more reliable I'll need other means as well (dynamic swap space allocation
> for sparse mappings).
> 
> > 
> > >     If we hit
> > >     hardware errors on pages, ignore them - nothing we really can or
> > >     should do.
> > > 3. On errors during MADV_POPULATED, some memory might have been
> > >     populated. Callers have to clean up if they care.
> > 
> > How does caller find out? madvise reports 0 on success so how do you
> > find out how much has been populated?
> 
> If there is an error, something might have been populated. In my QEMU
> implementation, I simply discard the range again, good enough. I don't think
> we need to really indicate "error and populated" or "error and not
> populated".

Agreed. The wording just suggests that the syscall actually provides any
means for an effective way to handle those errors. Maybe you should just
stick with the first sentence and drop the second.
 
> > > 4. Concurrent changes to the virtual memory layour are tolerated - we
> > >     process each and every PFN only once, though.
> > 
> > I do not understand this. madvise is about virtual address space not a
> > physical address space.
> 
> What I wanted to express: if we detect a change in the mapping we don't
> restart at the beginning, we always make forward progress. We process each
> virtual address once (on a per-page basis, thus I accidentally used "PFN").

This is an implicit assumption. Your range can have the same page mapped
several times in the given address range and all you care about is that
you fault those which are not present during the virtual address space
walk. Your syscall can return and large part of the address space might
be unpopulated because memory reclaim just dropped those pages and that
would be fine. This shouldn't really imply memory presence - mlock does
that.

> > > 5. If MADV_POPULATE succeeds, all memory in the range can be accessed
> > >     without SIGBUS. (of course, not if user space changed mappings in the
> > >     meantime or KSM kicked in on anonymous memory).
> > 
> > I do not see how KSM would change anything here and maybe it is not
> > really important to mention it. KSM should be really transparent from
> > the users space POV. Parallel and destructive virtual address space
> > operations are also expected to change the outcome and there is nothing
> > kernel do about at and provide any meaningful guarantees. I guess we
> > want to assume a reasonable userspace behavior here.
> 
> It's just a note that we cannot protect from someone interfering
> (discard/ksm/whatever). I'm making that clearer in the cover letter.

Again that is implicit expectation. madvise will not work for anybody
shooting an own foot.
David Hildenbrand Feb. 18, 2021, 11:27 a.m. UTC | #9
> Am 18.02.2021 um 12:15 schrieb Rolf Eike Beer <eike-kernel@sf-tec.de>:
> 
> 
>> 
>>> Let's introduce MADV_POPULATE with the following semantics
>>> 1. MADV_POPULATED does not work on PROT_NONE and special VMAs. It works
>>>   on everything else.
>>> 2. Errors during MADV_POPULATED (especially OOM) are reported. If we hit
>>>   hardware errors on pages, ignore them - nothing we really can or
>>>   should do.
>>> 3. On errors during MADV_POPULATED, some memory might have been
>>>   populated. Callers have to clean up if they care.
>>> 4. Concurrent changes to the virtual memory layour are tolerated - we
>                                                    ^t
>>>   process each and every PFN only once, though.
>>> 5. If MADV_POPULATE succeeds, all memory in the range can be accessed
>>>   without SIGBUS. (of course, not if user space changed mappings in the
>>>   meantime or KSM kicked in on anonymous memory).
> 
> You are talking both about MADV_POPULATE and MADV_POPULATED here.
> 

Already fixed up :) thanks!

> Eike
>
Michal Hocko Feb. 18, 2021, 11:28 a.m. UTC | #10
On Thu 18-02-21 11:54:48, David Hildenbrand wrote:
> > > >      If we hit
> > > >      hardware errors on pages, ignore them - nothing we really can or
> > > >      should do.
> > > > 3. On errors during MADV_POPULATED, some memory might have been
> > > >      populated. Callers have to clean up if they care.
> > > 
> > > How does caller find out? madvise reports 0 on success so how do you
> > > find out how much has been populated?
> > 
> > If there is an error, something might have been populated. In my QEMU
> > implementation, I simply discard the range again, good enough. I don't
> > think we need to really indicate "error and populated" or "error and not
> > populated".
> 
> Clarifying again: if madvise(MADV_POPULATED) succeeds, it returns 0. If
> there was a problem poopulating memory, it returns -ENOMEM (similar to
> MADV_WILLNEED). Callers can detect the error and discard.

As responded to the previous mail. I wouldn't really bother telling
callers what they should do. The interface will not give them any means
to identify the error. They just have to live with the fact that the
operation has failed.
David Hildenbrand Feb. 18, 2021, 11:38 a.m. UTC | #11
>>>>      If we hit
>>>>      hardware errors on pages, ignore them - nothing we really can or
>>>>      should do.
>>>> 3. On errors during MADV_POPULATED, some memory might have been
>>>>      populated. Callers have to clean up if they care.
>>>
>>> How does caller find out? madvise reports 0 on success so how do you
>>> find out how much has been populated?
>>
>> If there is an error, something might have been populated. In my QEMU
>> implementation, I simply discard the range again, good enough. I don't think
>> we need to really indicate "error and populated" or "error and not
>> populated".
> 
> Agreed. The wording just suggests that the syscall actually provides any
> means for an effective way to handle those errors. Maybe you should just
> stick with the first sentence and drop the second.

Makes sense. "On errors during MADV_POPULATE, some memory might have 
been populated."

>   
>>>> 4. Concurrent changes to the virtual memory layour are tolerated - we
>>>>      process each and every PFN only once, though.
>>>
>>> I do not understand this. madvise is about virtual address space not a
>>> physical address space.
>>
>> What I wanted to express: if we detect a change in the mapping we don't
>> restart at the beginning, we always make forward progress. We process each
>> virtual address once (on a per-page basis, thus I accidentally used "PFN").
> 
> This is an implicit assumption. Your range can have the same page mapped
> several times in the given address range and all you care about is that
> you fault those which are not present during the virtual address space
> walk. Your syscall can return and large part of the address space might
> be unpopulated because memory reclaim just dropped those pages and that
> would be fine. This shouldn't really imply memory presence - mlock does
> that.

"Concurrent changes to the virtual memory layout are tolerated. The 
range is processed exactly once."

> 
>>>> 5. If MADV_POPULATE succeeds, all memory in the range can be accessed
>>>>      without SIGBUS. (of course, not if user space changed mappings in the
>>>>      meantime or KSM kicked in on anonymous memory).
>>>
>>> I do not see how KSM would change anything here and maybe it is not
>>> really important to mention it. KSM should be really transparent from
>>> the users space POV. Parallel and destructive virtual address space
>>> operations are also expected to change the outcome and there is nothing
>>> kernel do about at and provide any meaningful guarantees. I guess we
>>> want to assume a reasonable userspace behavior here.
>>
>> It's just a note that we cannot protect from someone interfering
>> (discard/ksm/whatever). I'm making that clearer in the cover letter.
> 
> Again that is implicit expectation. madvise will not work for anybody
> shooting an own foot.

Okay, I'll drop that part, thanks!
Peter Xu Feb. 18, 2021, 10:59 p.m. UTC | #12
Hi, David,

On Wed, Feb 17, 2021 at 04:48:44PM +0100, David Hildenbrand wrote:
> When we manage sparse memory mappings dynamically in user space - also
> sometimes involving MADV_NORESERVE - we want to dynamically populate/
> discard memory inside such a sparse memory region. Example users are
> hypervisors (especially implementing memory ballooning or similar
> technologies like virtio-mem) and memory allocators. In addition, we want
> to fail in a nice way if populating does not succeed because we are out of
> backend memory (which can happen easily with file-based mappings,
> especially tmpfs and hugetlbfs).

Could you explain a bit more on how do you plan to use this new interface for
the virtio-balloon scenario?

Meanwhile, here you seemed to be talking about file-backed mem, however later
it sounds more like for anonymous, so I'm slightly confused.

Thanks,
David Hildenbrand Feb. 19, 2021, 8:20 a.m. UTC | #13
On 18.02.21 23:59, Peter Xu wrote:
> Hi, David,
> 
> On Wed, Feb 17, 2021 at 04:48:44PM +0100, David Hildenbrand wrote:
>> When we manage sparse memory mappings dynamically in user space - also
>> sometimes involving MADV_NORESERVE - we want to dynamically populate/
>> discard memory inside such a sparse memory region. Example users are
>> hypervisors (especially implementing memory ballooning or similar
>> technologies like virtio-mem) and memory allocators. In addition, we want
>> to fail in a nice way if populating does not succeed because we are out of
>> backend memory (which can happen easily with file-based mappings,
>> especially tmpfs and hugetlbfs).
> 
> Could you explain a bit more on how do you plan to use this new interface for
> the virtio-balloon scenario?

Sure, that will bring up an interesting point to discuss 
(MADV_POPULATE_WRITE).

I'm planning on using it in virtio-mem: whenever the guests requests the 
hypervisor (via a virtio-mem device) to make specific blocks available 
("plug"), I want to have a configurable option ("populate=on" / 
"prealloc="on") to perform safety checks ("prealloc") and populate page 
tables.

This becomes especially relevant for private/shared hugetlbfs and shared 
files/shmem where we have a limited pool size (e.g., huge pages, tmpfs 
size, filesystem size). But it will also come in handy when just 
preallocating (esp. zeroing) anonymous memory.

For virito-balloon it is not applicable because it really only supports 
anonymous memory and we cannot fail requests to deflate ...

--- Example ---

Example: Assume the guests requests to make 128 MB available and we're 
using hugetlbfs. Assume we're out of huge pages in the hypervisor - we 
want to fail the request - I want to do some kind of preallocation.

So I could do fallocate() on anything that's MAP_SHARED, but not on 
anything that's MAP_PRIVATE. hugetlbfs via memfd() cannot be 
preallocated without going via SIGBUS handlers.

--- QEMU memory configurations ---

I see the following combinations relevant in QEMU that I want to support 
with virito-mem:

1) MAP_PRIVATE anonymous memory
2) MAP_PRIVATE on hugetlbfs (esp. via memfd)
3) MAP_SHARED on hugetlbfs (esp. via memfd)
4) MAP_SHARED on shmem (file / memfd)
5) MAP_SHARED on some sparse file.

Other MAP_PRIVATE mappings barely make any sense to me - "read the file 
and write to page cache" is not really applicable to VM RAM (not to 
mention doing fallocate(PUNCH_HOLE) that invalidates the private copies 
of all other mappings on that file).

--- Ways to populate/preallocate ---

I see the following ways to populate/preallocate:

a) MADV_POPULATE: write fault on writable MAP_PRIVATE, read fault on
    MAP_SHARED
b) Writing to MAP_PRIVATE | MAP_SHARED from user space.
c) (below) MADV_POPULATE_WRITE: write fault on writable MAP_PRIVATE |
    MAP_SHARED

Especially, 2) is kind of weird as implemented in QEMU 
(util/oslib-posix.c:do_touch_pages):

"Read & write back the same value, so we don't corrupt existing user/app 
data ... TODO: get a better solution from kernel so we don't need to 
write at all so we don't cause wear on the storage backing the region..."

So if we have zero, we write zero. We'll COW pages, triggering a write 
fault - and that's the only good thing about it. For example, similar to 
MADV_POPULATE, nothing stops KSM from merging anonymous pages again. So 
for anonymous memory the actual write is not helpful at all. Similarly 
for hugetlbfs, the actual write is not necessary - but there is no other 
way to really achieve the goal.

--- How MADV_POPULATE is useful ---

With virito-mem, our VM will usually write to memory before it reads it.

With 1) and 2) it does exactly what I want: trigger COW / allocate 
memory and trigger a write fault. The only issue with 1) is that KSM 
might come around and undo our work - but that could only be avoided by 
writing random numbers to all pages from user space. Or we could simply 
rather disable KSM in that setup ...

--- How MADV_POPULATE is not perfect ---

KSM can merge anonymous pages again. Just like the current QEMU 
implementation. The only way around that is writing random numbers to 
the pages or mlocking all memory. No big news.

Nothing stops reclaim/swap code from depopulating when using files. 
Again, no big new - we have to mlock.

--- HOW MADV_POPULATE_WRITE might be useful ---

With 3) 4) 5) MADV_POPULATE does partially what I want: preallocate 
memory and populate page tables. But as it's a read fault, I think we'll 
have another minor fault on access. Not perfect, but better than failing 
with SIGBUS. One way around that would be having an additional 
MADV_POPULATE_WRITE, to use in cases where it makes sense (I think at 
least 3) and 4), most probably not on actual files like 5) ).

Trigger a write fault without actually writing.


Makes sense?
Michal Hocko Feb. 19, 2021, 10:35 a.m. UTC | #14
On Wed 17-02-21 16:48:44, David Hildenbrand wrote:
[...]

I only got  to the implementation now.

> +static long madvise_populate(struct vm_area_struct *vma,
> +			     struct vm_area_struct **prev,
> +			     unsigned long start, unsigned long end)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	unsigned long tmp_end;
> +	int locked = 1;
> +	long pages;
> +
> +	*prev = vma;
> +
> +	while (start < end) {
> +		/*
> +		 * We might have temporarily dropped the lock. For example,
> +		 * our VMA might have been split.
> +		 */
> +		if (!vma || start >= vma->vm_end) {
> +			vma = find_vma(mm, start);
> +			if (!vma)
> +				return -ENOMEM;
> +		}

Why do you need to find a vma when you already have one. do_madvise will
give you your vma already. I do understand that you want to finish the
vma for some errors but that shouldn't require handling vmas. You should
be in the shope of one here unless I miss anything.

> +
> +		/* Bail out on incompatible VMA types. */
> +		if (vma->vm_flags & (VM_IO | VM_PFNMAP) ||
> +		    !vma_is_accessible(vma)) {
> +			return -EINVAL;
> +		}
> +
> +		/*
> +		 * Populate pages and take care of VM_LOCKED: simulate user
> +		 * space access.
> +		 *
> +		 * For private, writable mappings, trigger a write fault to
> +		 * break COW (i.e., shared zeropage). For other mappings (i.e.,
> +		 * read-only, shared), trigger a read fault.
> +		 */
> +		tmp_end = min_t(unsigned long, end, vma->vm_end);
> +		pages = populate_vma_page_range(vma, start, tmp_end, &locked);
> +		if (!locked) {
> +			mmap_read_lock(mm);
> +			*prev = NULL;
> +			vma = NULL;
> +		}
> +		if (pages < 0) {
> +			switch (pages) {
> +			case -EINTR:
> +			case -ENOMEM:
> +				return pages;
> +			case -EHWPOISON:
> +				/* Skip over any poisoned pages. */
> +				start += PAGE_SIZE;
> +				continue;
> +			case -EBUSY:
> +			case -EAGAIN:
> +				continue;
> +			default:
> +				pr_warn_once("%s: unhandled return value: %ld\n",
> +					     __func__, pages);
> +				return -ENOMEM;
> +			}
> +		}
> +		start += pages * PAGE_SIZE;
> +	}
> +	return 0;
> +}
> +
>  /*
>   * Application wants to free up the pages and associated backing store.
>   * This is effectively punching a hole into the middle of a file.
> @@ -934,6 +1001,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
>  	case MADV_FREE:
>  	case MADV_DONTNEED:
>  		return madvise_dontneed_free(vma, prev, start, end, behavior);
> +	case MADV_POPULATE:
> +		return madvise_populate(vma, prev, start, end);
>  	default:
>  		return madvise_behavior(vma, prev, start, end, behavior);
>  	}
> @@ -954,6 +1023,7 @@ madvise_behavior_valid(int behavior)
>  	case MADV_FREE:
>  	case MADV_COLD:
>  	case MADV_PAGEOUT:
> +	case MADV_POPULATE:
>  #ifdef CONFIG_KSM
>  	case MADV_MERGEABLE:
>  	case MADV_UNMERGEABLE:
> -- 
> 2.29.2
>
David Hildenbrand Feb. 19, 2021, 10:43 a.m. UTC | #15
On 19.02.21 11:35, Michal Hocko wrote:
> On Wed 17-02-21 16:48:44, David Hildenbrand wrote:
> [...]
> 
> I only got  to the implementation now.
> 
>> +static long madvise_populate(struct vm_area_struct *vma,
>> +			     struct vm_area_struct **prev,
>> +			     unsigned long start, unsigned long end)
>> +{
>> +	struct mm_struct *mm = vma->vm_mm;
>> +	unsigned long tmp_end;
>> +	int locked = 1;
>> +	long pages;
>> +
>> +	*prev = vma;
>> +
>> +	while (start < end) {
>> +		/*
>> +		 * We might have temporarily dropped the lock. For example,
>> +		 * our VMA might have been split.
>> +		 */
>> +		if (!vma || start >= vma->vm_end) {
>> +			vma = find_vma(mm, start);
>> +			if (!vma)
>> +				return -ENOMEM;
>> +		}
> 
> Why do you need to find a vma when you already have one. do_madvise will
> give you your vma already. I do understand that you want to finish the
> vma for some errors but that shouldn't require handling vmas. You should
> be in the shope of one here unless I miss anything.

See below, we might temporary drop the lock while not having processed 
all pages

> 
>> +
>> +		/* Bail out on incompatible VMA types. */
>> +		if (vma->vm_flags & (VM_IO | VM_PFNMAP) ||
>> +		    !vma_is_accessible(vma)) {
>> +			return -EINVAL;
>> +		}
>> +
>> +		/*
>> +		 * Populate pages and take care of VM_LOCKED: simulate user
>> +		 * space access.
>> +		 *
>> +		 * For private, writable mappings, trigger a write fault to
>> +		 * break COW (i.e., shared zeropage). For other mappings (i.e.,
>> +		 * read-only, shared), trigger a read fault.
>> +		 */
>> +		tmp_end = min_t(unsigned long, end, vma->vm_end);
>> +		pages = populate_vma_page_range(vma, start, tmp_end, &locked);
>> +		if (!locked) {
>> +			mmap_read_lock(mm);
>> +			*prev = NULL;
>> +			vma = NULL;

^ here

so, the VMA might have been replaced/split/... in the meantime.

So to make forward progress, I have to lookup again. (similar. but 
different to madvise_dontneed_free()).
Michal Hocko Feb. 19, 2021, 11:04 a.m. UTC | #16
On Fri 19-02-21 11:43:48, David Hildenbrand wrote:
> On 19.02.21 11:35, Michal Hocko wrote:
> > On Wed 17-02-21 16:48:44, David Hildenbrand wrote:
> > [...]
> > 
> > I only got  to the implementation now.
> > 
> > > +static long madvise_populate(struct vm_area_struct *vma,
> > > +			     struct vm_area_struct **prev,
> > > +			     unsigned long start, unsigned long end)
> > > +{
> > > +	struct mm_struct *mm = vma->vm_mm;
> > > +	unsigned long tmp_end;
> > > +	int locked = 1;
> > > +	long pages;
> > > +
> > > +	*prev = vma;
> > > +
> > > +	while (start < end) {
> > > +		/*
> > > +		 * We might have temporarily dropped the lock. For example,
> > > +		 * our VMA might have been split.
> > > +		 */
> > > +		if (!vma || start >= vma->vm_end) {
> > > +			vma = find_vma(mm, start);
> > > +			if (!vma)
> > > +				return -ENOMEM;
> > > +		}
> > 
> > Why do you need to find a vma when you already have one. do_madvise will
> > give you your vma already. I do understand that you want to finish the
> > vma for some errors but that shouldn't require handling vmas. You should
> > be in the shope of one here unless I miss anything.
> 
> See below, we might temporary drop the lock while not having processed all
> pages
> 
> > 
> > > +
> > > +		/* Bail out on incompatible VMA types. */
> > > +		if (vma->vm_flags & (VM_IO | VM_PFNMAP) ||
> > > +		    !vma_is_accessible(vma)) {
> > > +			return -EINVAL;
> > > +		}
> > > +
> > > +		/*
> > > +		 * Populate pages and take care of VM_LOCKED: simulate user
> > > +		 * space access.
> > > +		 *
> > > +		 * For private, writable mappings, trigger a write fault to
> > > +		 * break COW (i.e., shared zeropage). For other mappings (i.e.,
> > > +		 * read-only, shared), trigger a read fault.
> > > +		 */
> > > +		tmp_end = min_t(unsigned long, end, vma->vm_end);
> > > +		pages = populate_vma_page_range(vma, start, tmp_end, &locked);
> > > +		if (!locked) {
> > > +			mmap_read_lock(mm);
> > > +			*prev = NULL;
> > > +			vma = NULL;
> 
> ^ here
> 
> so, the VMA might have been replaced/split/... in the meantime.
> 
> So to make forward progress, I have to lookup again. (similar. but different
> to madvise_dontneed_free()).

Right. Missed that.
David Hildenbrand Feb. 19, 2021, 11:10 a.m. UTC | #17
On 19.02.21 12:04, Michal Hocko wrote:
> On Fri 19-02-21 11:43:48, David Hildenbrand wrote:
>> On 19.02.21 11:35, Michal Hocko wrote:
>>> On Wed 17-02-21 16:48:44, David Hildenbrand wrote:
>>> [...]
>>>
>>> I only got  to the implementation now.
>>>
>>>> +static long madvise_populate(struct vm_area_struct *vma,
>>>> +			     struct vm_area_struct **prev,
>>>> +			     unsigned long start, unsigned long end)
>>>> +{
>>>> +	struct mm_struct *mm = vma->vm_mm;
>>>> +	unsigned long tmp_end;
>>>> +	int locked = 1;
>>>> +	long pages;
>>>> +
>>>> +	*prev = vma;
>>>> +
>>>> +	while (start < end) {
>>>> +		/*
>>>> +		 * We might have temporarily dropped the lock. For example,
>>>> +		 * our VMA might have been split.
>>>> +		 */
>>>> +		if (!vma || start >= vma->vm_end) {
>>>> +			vma = find_vma(mm, start);
>>>> +			if (!vma)
>>>> +				return -ENOMEM;
>>>> +		}
>>>
>>> Why do you need to find a vma when you already have one. do_madvise will
>>> give you your vma already. I do understand that you want to finish the
>>> vma for some errors but that shouldn't require handling vmas. You should
>>> be in the shope of one here unless I miss anything.
>>
>> See below, we might temporary drop the lock while not having processed all
>> pages
>>
>>>
>>>> +
>>>> +		/* Bail out on incompatible VMA types. */
>>>> +		if (vma->vm_flags & (VM_IO | VM_PFNMAP) ||
>>>> +		    !vma_is_accessible(vma)) {
>>>> +			return -EINVAL;
>>>> +		}
>>>> +
>>>> +		/*
>>>> +		 * Populate pages and take care of VM_LOCKED: simulate user
>>>> +		 * space access.
>>>> +		 *
>>>> +		 * For private, writable mappings, trigger a write fault to
>>>> +		 * break COW (i.e., shared zeropage). For other mappings (i.e.,
>>>> +		 * read-only, shared), trigger a read fault.
>>>> +		 */
>>>> +		tmp_end = min_t(unsigned long, end, vma->vm_end);
>>>> +		pages = populate_vma_page_range(vma, start, tmp_end, &locked);
>>>> +		if (!locked) {
>>>> +			mmap_read_lock(mm);
>>>> +			*prev = NULL;
>>>> +			vma = NULL;
>>
>> ^ here
>>
>> so, the VMA might have been replaced/split/... in the meantime.
>>
>> So to make forward progress, I have to lookup again. (similar. but different
>> to madvise_dontneed_free()).
> 
> Right. Missed that.

It would look more natural if we'd just be processing the whole range - 
but then it would not fit into the generic infrastructure and would 
result in even more code.

I decided to go with "process the passed range and treat the given VMA 
as an initial VMA that is invalidated as soon as we drop the lock".
Peter Xu Feb. 19, 2021, 4:31 p.m. UTC | #18
On Fri, Feb 19, 2021 at 09:20:16AM +0100, David Hildenbrand wrote:
> On 18.02.21 23:59, Peter Xu wrote:
> > Hi, David,
> > 
> > On Wed, Feb 17, 2021 at 04:48:44PM +0100, David Hildenbrand wrote:
> > > When we manage sparse memory mappings dynamically in user space - also
> > > sometimes involving MADV_NORESERVE - we want to dynamically populate/
> > > discard memory inside such a sparse memory region. Example users are
> > > hypervisors (especially implementing memory ballooning or similar
> > > technologies like virtio-mem) and memory allocators. In addition, we want
> > > to fail in a nice way if populating does not succeed because we are out of
> > > backend memory (which can happen easily with file-based mappings,
> > > especially tmpfs and hugetlbfs).
> > 
> > Could you explain a bit more on how do you plan to use this new interface for
> > the virtio-balloon scenario?
> 
> Sure, that will bring up an interesting point to discuss
> (MADV_POPULATE_WRITE).
> 
> I'm planning on using it in virtio-mem: whenever the guests requests the
> hypervisor (via a virtio-mem device) to make specific blocks available
> ("plug"), I want to have a configurable option ("populate=on" /
> "prealloc="on") to perform safety checks ("prealloc") and populate page
> tables.

As you mentioned in the commit message, the original goal for MADV_POPULATE
should be for performance's sake, which I can understand.  But for safety
check, I'm curious whether we'd have better way to do that besides populating
the whole memory.

E.g., can we simply ask the kernel "how much memory this process can still
allocate", then get a number out of it?  I'm not sure whether it can be done
already by either cgroup or any other facilities, or maybe it's still missing.
But I'd raise this question up, since these two requirements seem to be two
standalone issues to solve at least to me.  It could be an overkill to populate
all the memory just for a sanity check.

> 
> This becomes especially relevant for private/shared hugetlbfs and shared
> files/shmem where we have a limited pool size (e.g., huge pages, tmpfs size,
> filesystem size). But it will also come in handy when just preallocating
> (esp. zeroing) anonymous memory.
> 
> For virito-balloon it is not applicable because it really only supports
> anonymous memory and we cannot fail requests to deflate ...
> 
> --- Example ---
> 
> Example: Assume the guests requests to make 128 MB available and we're using
> hugetlbfs. Assume we're out of huge pages in the hypervisor - we want to
> fail the request - I want to do some kind of preallocation.
> 
> So I could do fallocate() on anything that's MAP_SHARED, but not on anything
> that's MAP_PRIVATE. hugetlbfs via memfd() cannot be preallocated without
> going via SIGBUS handlers.
> 
> --- QEMU memory configurations ---
> 
> I see the following combinations relevant in QEMU that I want to support
> with virito-mem:
> 
> 1) MAP_PRIVATE anonymous memory
> 2) MAP_PRIVATE on hugetlbfs (esp. via memfd)
> 3) MAP_SHARED on hugetlbfs (esp. via memfd)
> 4) MAP_SHARED on shmem (file / memfd)
> 5) MAP_SHARED on some sparse file.
> 
> Other MAP_PRIVATE mappings barely make any sense to me - "read the file and
> write to page cache" is not really applicable to VM RAM (not to mention
> doing fallocate(PUNCH_HOLE) that invalidates the private copies of all other
> mappings on that file).
> 
> --- Ways to populate/preallocate ---
> 
> I see the following ways to populate/preallocate:
> 
> a) MADV_POPULATE: write fault on writable MAP_PRIVATE, read fault on
>    MAP_SHARED
> b) Writing to MAP_PRIVATE | MAP_SHARED from user space.
> c) (below) MADV_POPULATE_WRITE: write fault on writable MAP_PRIVATE |
>    MAP_SHARED
> 
> Especially, 2) is kind of weird as implemented in QEMU
> (util/oslib-posix.c:do_touch_pages):
> 
> "Read & write back the same value, so we don't corrupt existing user/app
> data ... TODO: get a better solution from kernel so we don't need to write
> at all so we don't cause wear on the storage backing the region..."

It's interesting to know about commit 1e356fc14be ("mem-prealloc: reduce large
guest start-up and migration time.", 2017-03-14).  It seems for speeding up VM
boot, but what I can't understand is why it would cause the delay of hugetlb
accounting - I thought we'd fail even earlier at either fallocate() on the
hugetlb file (when we use /dev/hugepages) or on mmap() of the memfd which
contains the huge pages.  See hugetlb_reserve_pages() and its callers.  Or did
I miss something?

I think there's a special case if QEMU fork() with a MAP_PRIVATE hugetlbfs
mapping, that could cause the memory accouting to be delayed until COW happens.
However that's definitely not the case for QEMU since QEMU won't work at all as
late as that point.

IOW, for hugetlbfs I don't know why we need to populate the pages at all if we
simply want to know "whether we do still have enough space"..  And IIUC 2)
above is the major issue you'd like to solve too.

> 
> So if we have zero, we write zero. We'll COW pages, triggering a write fault
> - and that's the only good thing about it. For example, similar to
> MADV_POPULATE, nothing stops KSM from merging anonymous pages again. So for
> anonymous memory the actual write is not helpful at all. Similarly for
> hugetlbfs, the actual write is not necessary - but there is no other way to
> really achieve the goal.
> 
> --- How MADV_POPULATE is useful ---
> 
> With virito-mem, our VM will usually write to memory before it reads it.
> 
> With 1) and 2) it does exactly what I want: trigger COW / allocate memory
> and trigger a write fault. The only issue with 1) is that KSM might come
> around and undo our work - but that could only be avoided by writing random
> numbers to all pages from user space. Or we could simply rather disable KSM
> in that setup ...
> 
> --- How MADV_POPULATE is not perfect ---
> 
> KSM can merge anonymous pages again. Just like the current QEMU
> implementation. The only way around that is writing random numbers to the
> pages or mlocking all memory. No big news.
> 
> Nothing stops reclaim/swap code from depopulating when using files. Again,
> no big new - we have to mlock.
> 
> --- HOW MADV_POPULATE_WRITE might be useful ---
> 
> With 3) 4) 5) MADV_POPULATE does partially what I want: preallocate memory
> and populate page tables. But as it's a read fault, I think we'll have
> another minor fault on access. Not perfect, but better than failing with
> SIGBUS. One way around that would be having an additional
> MADV_POPULATE_WRITE, to use in cases where it makes sense (I think at least
> 3) and 4), most probably not on actual files like 5) ).

Right, it seems when populating memories we'll read-fault on file-backed.
However that'll be another performance issue to think about.  So I'd hope we
can start with the current virtio-mem issue on memory accounting, then we can
discuss them separately.

Btw, thanks for the long write-up, it definitely helps me to understand what
you wanted to achieve.

Thanks,
David Hildenbrand Feb. 19, 2021, 5:13 p.m. UTC | #19
On 19.02.21 17:31, Peter Xu wrote:
> On Fri, Feb 19, 2021 at 09:20:16AM +0100, David Hildenbrand wrote:
>> On 18.02.21 23:59, Peter Xu wrote:
>>> Hi, David,
>>>
>>> On Wed, Feb 17, 2021 at 04:48:44PM +0100, David Hildenbrand wrote:
>>>> When we manage sparse memory mappings dynamically in user space - also
>>>> sometimes involving MADV_NORESERVE - we want to dynamically populate/
>>>> discard memory inside such a sparse memory region. Example users are
>>>> hypervisors (especially implementing memory ballooning or similar
>>>> technologies like virtio-mem) and memory allocators. In addition, we want
>>>> to fail in a nice way if populating does not succeed because we are out of
>>>> backend memory (which can happen easily with file-based mappings,
>>>> especially tmpfs and hugetlbfs).
>>>
>>> Could you explain a bit more on how do you plan to use this new interface for
>>> the virtio-balloon scenario?
>>
>> Sure, that will bring up an interesting point to discuss
>> (MADV_POPULATE_WRITE).
>>
>> I'm planning on using it in virtio-mem: whenever the guests requests the
>> hypervisor (via a virtio-mem device) to make specific blocks available
>> ("plug"), I want to have a configurable option ("populate=on" /
>> "prealloc="on") to perform safety checks ("prealloc") and populate page
>> tables.
> 
> As you mentioned in the commit message, the original goal for MADV_POPULATE
> should be for performance's sake, which I can understand.  But for safety
> check, I'm curious whether we'd have better way to do that besides populating
> the whole memory.

Well, it's 100% what I want for "populate=on"/"prealloc=on" semantics.

There is no real memory overcommit for huge pages, so any lacy 
allocation ("reserve only") only saves you boot time - which is not 
really an issue for virtio-mem, as the memory gets added and initialized 
asynchronously as the guest boots up.

"reserve=on,prealloc=off" is another future use case I have in mind - 
possible only for some memory backends (esp. anonymous memory - below).


> 
> E.g., can we simply ask the kernel "how much memory this process can still
> allocate", then get a number out of it?  I'm not sure whether it can be done

Anything like that is completely racy and unreliable.

> already by either cgroup or any other facilities, or maybe it's still missing.
> But I'd raise this question up, since these two requirements seem to be two
> standalone issues to solve at least to me.  It could be an overkill to populate
> all the memory just for a sanity check.

For anonymous memory I have something in the works to dynamically 
reserve swap space per process for the memory reservation for not 
accounted private writable MAP_DONTRESERVE memory.

However, it works because swap space is per-system, not per-node or 
anything else. Doing that for file systems/hugetlbfs is a different beast.

And anonymous memory is right now less of my concern, as we're used to 
overcommitting there - limited pool sizes are more of an issue.

>> --- Ways to populate/preallocate ---
>>
>> I see the following ways to populate/preallocate:
>>
>> a) MADV_POPULATE: write fault on writable MAP_PRIVATE, read fault on
>>     MAP_SHARED
>> b) Writing to MAP_PRIVATE | MAP_SHARED from user space.
>> c) (below) MADV_POPULATE_WRITE: write fault on writable MAP_PRIVATE |
>>     MAP_SHARED
>>
>> Especially, 2) is kind of weird as implemented in QEMU
>> (util/oslib-posix.c:do_touch_pages):
>>
>> "Read & write back the same value, so we don't corrupt existing user/app
>> data ... TODO: get a better solution from kernel so we don't need to write
>> at all so we don't cause wear on the storage backing the region..."
> 
> It's interesting to know about commit 1e356fc14be ("mem-prealloc: reduce large
> guest start-up and migration time.", 2017-03-14).  It seems for speeding up VM
> boot, but what I can't understand is why it would cause the delay of hugetlb
> accounting - I thought we'd fail even earlier at either fallocate() on the
> hugetlb file (when we use /dev/hugepages) or on mmap() of the memfd which
> contains the huge pages.  See hugetlb_reserve_pages() and its callers.  Or did
> I miss something?

We should fail on mmap() when the reservation happens (unless 
MAP_NORESERVE is passed) I think.

> 
> I think there's a special case if QEMU fork() with a MAP_PRIVATE hugetlbfs
> mapping, that could cause the memory accouting to be delayed until COW happens.

That would be kind of weird. I'd assume the reservation gets properly 
done during fork() - just like for VM_ACCOUNT.

> However that's definitely not the case for QEMU since QEMU won't work at all as
> late as that point.
> 
> IOW, for hugetlbfs I don't know why we need to populate the pages at all if we
> simply want to know "whether we do still have enough space"..  And IIUC 2)
> above is the major issue you'd like to solve too.

To avoid page faults at runtime on access I think. Reservation <= 
Preallocation.

[...]

>> --- HOW MADV_POPULATE_WRITE might be useful ---
>>
>> With 3) 4) 5) MADV_POPULATE does partially what I want: preallocate memory
>> and populate page tables. But as it's a read fault, I think we'll have
>> another minor fault on access. Not perfect, but better than failing with
>> SIGBUS. One way around that would be having an additional
>> MADV_POPULATE_WRITE, to use in cases where it makes sense (I think at least
>> 3) and 4), most probably not on actual files like 5) ).
> 
> Right, it seems when populating memories we'll read-fault on file-backed.
> However that'll be another performance issue to think about.  So I'd hope we
> can start with the current virtio-mem issue on memory accounting, then we can
> discuss them separately.

MADV_POPULATE is certainly something I want and what fits nicely into 
the existing model of MAP_POPULATE. Doing reservation only is a 
different topic - and is most probably only possible for anonymous 
memory in a clean way.

> Btw, thanks for the long write-up, it definitely helps me to understand what
> you wanted to achieve.

Sure! Thanks!
David Hildenbrand Feb. 19, 2021, 7:14 p.m. UTC | #20
>> It's interesting to know about commit 1e356fc14be ("mem-prealloc: reduce large
>> guest start-up and migration time.", 2017-03-14).  It seems for speeding up VM
>> boot, but what I can't understand is why it would cause the delay of hugetlb
>> accounting - I thought we'd fail even earlier at either fallocate() on the
>> hugetlb file (when we use /dev/hugepages) or on mmap() of the memfd which
>> contains the huge pages.  See hugetlb_reserve_pages() and its callers.  Or did
>> I miss something?
> 
> We should fail on mmap() when the reservation happens (unless
> MAP_NORESERVE is passed) I think.
> 
>>
>> I think there's a special case if QEMU fork() with a MAP_PRIVATE hugetlbfs
>> mapping, that could cause the memory accouting to be delayed until COW happens.
> 
> That would be kind of weird. I'd assume the reservation gets properly
> done during fork() - just like for VM_ACCOUNT.
> 
>> However that's definitely not the case for QEMU since QEMU won't work at all as
>> late as that point.
>>
>> IOW, for hugetlbfs I don't know why we need to populate the pages at all if we
>> simply want to know "whether we do still have enough space"..  And IIUC 2)
>> above is the major issue you'd like to solve too.
> 
> To avoid page faults at runtime on access I think. Reservation <=
> Preallocation.

I just learned that there is more to it: (test done on v5.9)

# echo 512 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
# cat /sys/devices/system/node/node*/meminfo | grep HugePages_
Node 0 HugePages_Total:   512
Node 0 HugePages_Free:    512
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0
# cat /proc/meminfo  | grep HugePages_
HugePages_Total:     512
HugePages_Free:      512
HugePages_Rsvd:        0
HugePages_Surp:        0

# /usr/libexec/qemu-kvm -m 1G -smp 1 -object memory-backend-memfd,id=mem0,size=1G,hugetlb=on,hugetlbsize=2M,policy=bind,host-nodes=0 -numa node,nodeid=0,memdev=mem0 -hda Fedora-Cloud-Base-Rawhide-20201004.n.1.x86_64.qcow2 -nographic
-> works just fine

# /usr/libexec/qemu-kvm -m 1G -smp 1 -object memory-backend-memfd,id=mem0,size=1G,hugetlb=on,hugetlbsize=2M,policy=bind,host-nodes=1 -numa node,nodeid=0,memdev=mem0 -hda Fedora-Cloud-Base-Rawhide-20201004.n.1.x86_64.qcow2 -nographic
-> Does not fail nicely but crashes!


See https://bugzilla.redhat.com/show_bug.cgi?id=1686261 for something similar, however, it no longer applies like that on more recent kernels.

Hugetlbfs reservations don't always protect you (especially with NUMA) - that's why e.g., libvirt always tells QEMU to prealloc.

I think the "issue" is that the reservation happens on mmap(). mbind() runs afterwards. Preallocation saves you from that.

I suspect something similar will happen with anonymous memory with mbind() even if we reserved swap space. Did not test yet, though.
Peter Xu Feb. 19, 2021, 7:23 p.m. UTC | #21
On Fri, Feb 19, 2021 at 06:13:47PM +0100, David Hildenbrand wrote:
> On 19.02.21 17:31, Peter Xu wrote:
> > On Fri, Feb 19, 2021 at 09:20:16AM +0100, David Hildenbrand wrote:
> > > On 18.02.21 23:59, Peter Xu wrote:
> > > > Hi, David,
> > > > 
> > > > On Wed, Feb 17, 2021 at 04:48:44PM +0100, David Hildenbrand wrote:
> > > > > When we manage sparse memory mappings dynamically in user space - also
> > > > > sometimes involving MADV_NORESERVE - we want to dynamically populate/
> > > > > discard memory inside such a sparse memory region. Example users are
> > > > > hypervisors (especially implementing memory ballooning or similar
> > > > > technologies like virtio-mem) and memory allocators. In addition, we want
> > > > > to fail in a nice way if populating does not succeed because we are out of
> > > > > backend memory (which can happen easily with file-based mappings,
> > > > > especially tmpfs and hugetlbfs).

[1]

> > E.g., can we simply ask the kernel "how much memory this process can still
> > allocate", then get a number out of it?  I'm not sure whether it can be done
> 
> Anything like that is completely racy and unreliable.

The failure path won't be racy imho - If we can detect current process doesn't
have enough memory budget, it'll be more efficient to fail even before trying
to populate any memory and then drop part of them again.

But I see your point - indeed it's good to guarantee the guest won't crash at
any point of further guest side memory access.

Another question: can the user actually specify arbitrary max-length for the
virtio-mem device (which decides the maximum memory this device could possibly
consume)?  I thought we should check that first before realizing the device and
we really shouldn't fail any guest memory access if that check passed. Feel
free to correct me..

[...]

> > 
> > I think there's a special case if QEMU fork() with a MAP_PRIVATE hugetlbfs
> > mapping, that could cause the memory accouting to be delayed until COW happens.
> 
> That would be kind of weird. I'd assume the reservation gets properly done
> during fork() - just like for VM_ACCOUNT.

AFAIK VM_ACCOUNT is never applied for hugetlbfs.  Neither do I know any
accounting done for hugetlbfs during fork(), if not taking the pinned pages
into account - that is definitely a special case.

> 
> > However that's definitely not the case for QEMU since QEMU won't work at all as
> > late as that point.
> > 
> > IOW, for hugetlbfs I don't know why we need to populate the pages at all if we
> > simply want to know "whether we do still have enough space"..  And IIUC 2)
> > above is the major issue you'd like to solve too.
> 
> To avoid page faults at runtime on access I think. Reservation <=
> Preallocation.

Yes.  Besides my above question regarding max-length of virtio-mem device: we
care most about private mappings of hugetlbfs/shmem here, am I right?

I'm thinking why we'd need MAP_PRIVATE of these at all for VM context.

It's definitely not the major scenario when they're used shared with either ovs
or any non-qemu process, because then MAP_SHARED is a must. Then if we use them
privately, can we simply always make it MAP_SHARED?

IMHO MAP_PRIVATE could be helpful only if we'd like the COW scemantics, so it
means when there're something already, we'd like to keep that snapshot but
trigger page copy when writes.  But is that the case for a VM memory backend
which should be always zeroed by default?  Then, I'm wondering can we simply
avoid bothering with VM_PRIVATE on these file-backed memory at all - then we'll
naturally get fallocate() on hand, which seems already working for us.

Thanks,
Mike Kravetz Feb. 19, 2021, 7:25 p.m. UTC | #22
On 2/19/21 11:14 AM, David Hildenbrand wrote:
>>> It's interesting to know about commit 1e356fc14be ("mem-prealloc: reduce large
>>> guest start-up and migration time.", 2017-03-14).  It seems for speeding up VM
>>> boot, but what I can't understand is why it would cause the delay of hugetlb
>>> accounting - I thought we'd fail even earlier at either fallocate() on the
>>> hugetlb file (when we use /dev/hugepages) or on mmap() of the memfd which
>>> contains the huge pages.  See hugetlb_reserve_pages() and its callers.  Or did
>>> I miss something?
>>
>> We should fail on mmap() when the reservation happens (unless
>> MAP_NORESERVE is passed) I think.
>>
>>>
>>> I think there's a special case if QEMU fork() with a MAP_PRIVATE hugetlbfs
>>> mapping, that could cause the memory accouting to be delayed until COW happens.
>>
>> That would be kind of weird. I'd assume the reservation gets properly
>> done during fork() - just like for VM_ACCOUNT.
>>
>>> However that's definitely not the case for QEMU since QEMU won't work at all as
>>> late as that point.
>>>
>>> IOW, for hugetlbfs I don't know why we need to populate the pages at all if we
>>> simply want to know "whether we do still have enough space"..  And IIUC 2)
>>> above is the major issue you'd like to solve too.
>>
>> To avoid page faults at runtime on access I think. Reservation <=
>> Preallocation.
> 
> I just learned that there is more to it: (test done on v5.9)
> 
> # echo 512 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
> # cat /sys/devices/system/node/node*/meminfo | grep HugePages_
> Node 0 HugePages_Total:   512
> Node 0 HugePages_Free:    512
> Node 0 HugePages_Surp:      0
> Node 1 HugePages_Total:     0
> Node 1 HugePages_Free:      0
> Node 1 HugePages_Surp:      0
> # cat /proc/meminfo  | grep HugePages_
> HugePages_Total:     512
> HugePages_Free:      512
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> 
> # /usr/libexec/qemu-kvm -m 1G -smp 1 -object memory-backend-memfd,id=mem0,size=1G,hugetlb=on,hugetlbsize=2M,policy=bind,host-nodes=0 -numa node,nodeid=0,memdev=mem0 -hda Fedora-Cloud-Base-Rawhide-20201004.n.1.x86_64.qcow2 -nographic
> -> works just fine
> 
> # /usr/libexec/qemu-kvm -m 1G -smp 1 -object memory-backend-memfd,id=mem0,size=1G,hugetlb=on,hugetlbsize=2M,policy=bind,host-nodes=1 -numa node,nodeid=0,memdev=mem0 -hda Fedora-Cloud-Base-Rawhide-20201004.n.1.x86_64.qcow2 -nographic
> -> Does not fail nicely but crashes!
> 
> 
> See https://bugzilla.redhat.com/show_bug.cgi?id=1686261 for something similar, however, it no longer applies like that on more recent kernels.
> 
> Hugetlbfs reservations don't always protect you (especially with NUMA) - that's why e.g., libvirt always tells QEMU to prealloc.
> 
> I think the "issue" is that the reservation happens on mmap(). mbind() runs afterwards. Preallocation saves you from that.
> 
> I suspect something similar will happen with anonymous memory with mbind() even if we reserved swap space. Did not test yet, though.
> 

Sorry, for jumping in late ... hugetlb keyword just hit my mail filters :)

Yes, it is true that hugetlb reservations are not numa aware.  So, even if
pages are reserved at mmap time one could still SIGBUS if a fault is
restricted to a node with insufficient pages.

I looked into this some years ago, and there really is not a good way to
make hugetlb reservations numa aware.  preallocation, or on demand
populating as proposed here is a way around the issue.
David Hildenbrand Feb. 19, 2021, 8:04 p.m. UTC | #23
> Am 19.02.2021 um 20:23 schrieb Peter Xu <peterx@redhat.com>:
> 
> On Fri, Feb 19, 2021 at 06:13:47PM +0100, David Hildenbrand wrote:
>>> On 19.02.21 17:31, Peter Xu wrote:
>>> On Fri, Feb 19, 2021 at 09:20:16AM +0100, David Hildenbrand wrote:
>>>> On 18.02.21 23:59, Peter Xu wrote:
>>>>> Hi, David,
>>>>> 
>>>>> On Wed, Feb 17, 2021 at 04:48:44PM +0100, David Hildenbrand wrote:
>>>>>> When we manage sparse memory mappings dynamically in user space - also
>>>>>> sometimes involving MADV_NORESERVE - we want to dynamically populate/
>>>>>> discard memory inside such a sparse memory region. Example users are
>>>>>> hypervisors (especially implementing memory ballooning or similar
>>>>>> technologies like virtio-mem) and memory allocators. In addition, we want
>>>>>> to fail in a nice way if populating does not succeed because we are out of
>>>>>> backend memory (which can happen easily with file-based mappings,
>>>>>> especially tmpfs and hugetlbfs).
> 
> [1]
> 
>>> E.g., can we simply ask the kernel "how much memory this process can still
>>> allocate", then get a number out of it?  I'm not sure whether it can be done
>> 
>> Anything like that is completely racy and unreliable.
> 
> The failure path won't be racy imho - If we can detect current process doesn't
> have enough memory budget, it'll be more efficient to fail even before trying
> to populate any memory and then drop part of them again.
> 
> But I see your point - indeed it's good to guarantee the guest won't crash at
> any point of further guest side memory access.
> 
> Another question: can the user actually specify arbitrary max-length for the
> virtio-mem device (which decides the maximum memory this device could possibly
> consume)?  I thought we should check that first before realizing the device and
> we really shouldn't fail any guest memory access if that check passed. Feel
> free to correct me.

Max-length is currently limited by the mmap() we‘re allowed to create. With MAP_NORESERVE this can be big (not merged yet).

Checking max-lenght at initialization time does not make too much sense. Just imagine shrinking/relocating other VMs so you can grow this VM further. Or migrating the VM to another machine where you might grow it further.

The ultimate goal is to adjust the mapping size dynamically on demand, but that‘s stuff for the future as it turns out complicated. For example, hugetlbfs VMAs cannot be merged yet (although I think it shouldn‘t be too hard to implement).

The short term approach is only exposing a small window of the bigger mmap to the guest.

>> 
>> That would be kind of weird. I'd assume the reservation gets properly done
>> during fork() - just like for VM_ACCOUNT.
> 
> AFAIK VM_ACCOUNT is never applied for hugetlbfs.  Neither do I know any
> accounting done for hugetlbfs during fork(), if not taking the pinned pages
> into account - that is definitely a special case.
> 

Yes, it isn‘t - I meant „like“ as in „similar to swap reservation“.

>> 
>>> However that's definitely not the case for QEMU since QEMU won't work at all as
>>> late as that point.
>>> 
>>> IOW, for hugetlbfs I don't know why we need to populate the pages at all if we
>>> simply want to know "whether we do still have enough space"..  And IIUC 2)
>>> above is the major issue you'd like to solve too.
>> 
>> To avoid page faults at runtime on access I think. Reservation <=
>> Preallocation.
> 
> Yes.  Besides my above question regarding max-length of virtio-mem device: we
> care most about private mappings of hugetlbfs/shmem here, am I right?
> 
> I'm thinking why we'd need MAP_PRIVATE of these at all for VM context.

One reason is that MAP_SHARED does not support mbind() - which should include hugetlbfs. I did not investigate other side effects / performance considerations on allocation.

Similarly, fallocate() does not respect/care about NUMA.

(And yes, NUMA for virtio-mem will be important).
David Hildenbrand Feb. 20, 2021, 9:01 a.m. UTC | #24
> Sorry, for jumping in late ... hugetlb keyword just hit my mail filters :)
> 

Sorry for not realizing to cc you before I sent out the man page update :)

> Yes, it is true that hugetlb reservations are not numa aware.  So, even if
> pages are reserved at mmap time one could still SIGBUS if a fault is
> restricted to a node with insufficient pages.
> 
> I looked into this some years ago, and there really is not a good way to
> make hugetlb reservations numa aware.  preallocation, or on demand
> populating as proposed here is a way around the issue.


Thanks for confirming, this makes a lot of sense to me now.
David Hildenbrand Feb. 20, 2021, 9:12 a.m. UTC | #25
On 17.02.21 16:48, David Hildenbrand wrote:
> When we manage sparse memory mappings dynamically in user space - also
> sometimes involving MADV_NORESERVE - we want to dynamically populate/
> discard memory inside such a sparse memory region. Example users are
> hypervisors (especially implementing memory ballooning or similar
> technologies like virtio-mem) and memory allocators. In addition, we want
> to fail in a nice way if populating does not succeed because we are out of
> backend memory (which can happen easily with file-based mappings,
> especially tmpfs and hugetlbfs).
> 
> While MADV_DONTNEED and FALLOC_FL_PUNCH_HOLE provide us ways to reliably
> discard memory, there is no generic approach to populate ("preallocate")
> memory.
> 
> Although mmap() supports MAP_POPULATE, it is not applicable to the concept
> of sparse memory mappings, where we want to do populate/discard
> dynamically and avoid expensive/problematic remappings. In addition,
> we never actually report error during the final populate phase - it is
> best-effort only.
> 
> fallocate() can be used to preallocate file-based memory and fail in a safe
> way. However, it is less useful for private mappings on anonymous files
> due to COW semantics. For example, using fallocate() to preallocate memory
> on an anonymous memfd files that are mapped MAP_PRIVATE results in a double
> memory consumption when actually writing via the mapping. In addition,
> fallocate() does not actually populate page tables, so we still always
> have to resolve minor faults on first access.
> 
> Because we don't have a proper interface, what applications
> (like QEMU and databases) end up doing is touching (i.e., writing) all
> individual pages. However, it requires expensive signal handling (SIGBUS);
> for example, this is problematic in hypervisors like QEMU where SIGBUS
> handlers might already be used by other subsystems concurrently to e.g,
> handle hardware errors. "Simply" doing preallocation from another thread
> is not that easy.
> 
> Let's introduce MADV_POPULATE with the following semantics
> 1. MADV_POPULATED does not work on PROT_NONE and special VMAs. It works
>     on everything else.
> 2. Errors during MADV_POPULATED (especially OOM) are reported. If we hit
>     hardware errors on pages, ignore them - nothing we really can or
>     should do.
> 3. On errors during MADV_POPULATED, some memory might have been
>     populated. Callers have to clean up if they care.
> 4. Concurrent changes to the virtual memory layour are tolerated - we
>     process each and every PFN only once, though.
> 5. If MADV_POPULATE succeeds, all memory in the range can be accessed
>     without SIGBUS. (of course, not if user space changed mappings in the
>     meantime or KSM kicked in on anonymous memory).
> 
> Although sparse memory mappings are the primary use case, this will
> also be useful for ordinary preallocations where MAP_POPULATE is not
> desired (e.g., in QEMU, where users can trigger preallocation of
> guest RAM after the mapping was created).
> 
> Looking at the history, MADV_POPULATE was already proposed in 2013 [1],
> however, the main motivation back than was performance improvements
> (which should also still be the case, but it's a seconary concern).
> 
> Basic functionality was tested with:
> - anonymous memory
> - MAP_PRIVATE on anonymous file via memfd
> - MAP_SHARED on anonymous file via memf
> - MAP_PRIVATE on anonymous hugetlbfs file via memfd
> - MAP_SHARED on anonymous hugetlbfs file via memfd
> - MAP_PRIVATE on tmpfs/shmem file (we end up with double memory consumption
>    though, as the actual file gets populated with zeroes)
> - MAP_SHARED on tmpfs/shmem file
> 
> Note: For populating/preallocating zeroed-out memory while userfaultfd is
> active, it's even faster to use first fallocate() or placing zeroed pages
> via userfaultfd APIs. Otherwise, we'll have to route every fault while
> populating via the userfaultfd handler.
> 
> [1] https://lkml.org/lkml/2013/6/27/698
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Jann Horn <jannh@google.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Rik van Riel <riel@surriel.com>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
> Cc: Matt Turner <mattst88@gmail.com>
> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
> Cc: Helge Deller <deller@gmx.de>
> Cc: Chris Zankel <chris@zankel.net>
> Cc: Max Filippov <jcmvbkbc@gmail.com>
> Cc: linux-alpha@vger.kernel.org
> Cc: linux-mips@vger.kernel.org
> Cc: linux-parisc@vger.kernel.org
> Cc: linux-xtensa@linux-xtensa.org
> Cc: linux-arch@vger.kernel.org
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
> 
> If we agree that this makes sense I'll do more testing to see if we
> are missing any return value handling and prepare a man page update to
> document the semantics.
> 
> Thoughts?

Thinking about MADV_POPULATE vs. MADV_POPULATE_WRITE I wonder if it 
would be more versatile to break with existing MAP_POPULATE semantics 
and directly go with

MADV_POPULATE_READ: simulate user space read access without actually 
reading. Trigger a read fault if required.

MADV_POPULATE_WRITE: simulate user space write access without actually 
writing. Trigger a write fault if required.

For my use case, I could use MADV_POPULATE_WRITE on anonymous memory and 
RAM-backed files (shmem/hugetlb) - I would not have a minor fault when 
the guest inside the VM first initializes memory. This mimics how QEMU 
currently preallocates memory.

However, I would use MADV_POPULATE_READ on any !RAM-backed files where 
we actually have to write-back to a (slow?) device. Dirtying everything 
although the guest might not actually consume it in the near future 
might be undesired.

MADV_POPULATE_READ could also come in handy in combination with 
userfaulfd-wp() [1], when handling unpopulated memory via ordinary 
userfaultfd MISSING events in undesired. I could imagine it can speed up 
live migration of VMs in general, where we might end up reading a lot of 
unpopulated memory to figure out it's all zeroes after faulting in the 
shared zeropage. Especially relevant with a shared zeropage.

Thoughts?

[1] https://lkml.kernel.org/r/20210219211054.GL6669@xz-x1
Michal Hocko Feb. 22, 2021, 12:46 p.m. UTC | #26
I am slowly catching up with this thread.

On Fri 19-02-21 09:20:16, David Hildenbrand wrote:
[...]
> So if we have zero, we write zero. We'll COW pages, triggering a write fault
> - and that's the only good thing about it. For example, similar to
> MADV_POPULATE, nothing stops KSM from merging anonymous pages again. So for
> anonymous memory the actual write is not helpful at all. Similarly for
> hugetlbfs, the actual write is not necessary - but there is no other way to
> really achieve the goal.

I really do not see why you care about KSM so much. Isn't KSM an
explicit opt-in with a fine grained interface to control which memory to
KSM or not?
David Hildenbrand Feb. 22, 2021, 12:52 p.m. UTC | #27
On 22.02.21 13:46, Michal Hocko wrote:
> I am slowly catching up with this thread.
> 
> On Fri 19-02-21 09:20:16, David Hildenbrand wrote:
> [...]
>> So if we have zero, we write zero. We'll COW pages, triggering a write fault
>> - and that's the only good thing about it. For example, similar to
>> MADV_POPULATE, nothing stops KSM from merging anonymous pages again. So for
>> anonymous memory the actual write is not helpful at all. Similarly for
>> hugetlbfs, the actual write is not necessary - but there is no other way to
>> really achieve the goal.
> 
> I really do not see why you care about KSM so much. Isn't KSM an
> explicit opt-in with a fine grained interface to control which memory to
> KSM or not?

Yeah, I think it's opt-in via MADV_MERGEABLE. E.g., QEMU defaults to 
enable KSM unless explicitly disabled by the user.

But I agree, I got distracted by KSM details.
Michal Hocko Feb. 22, 2021, 12:56 p.m. UTC | #28
On Sat 20-02-21 10:12:26, David Hildenbrand wrote:
[...]
> Thinking about MADV_POPULATE vs. MADV_POPULATE_WRITE I wonder if it would be
> more versatile to break with existing MAP_POPULATE semantics and directly go
> with
> 
> MADV_POPULATE_READ: simulate user space read access without actually
> reading. Trigger a read fault if required.
> 
> MADV_POPULATE_WRITE: simulate user space write access without actually
> writing. Trigger a write fault if required.
> 
> For my use case, I could use MADV_POPULATE_WRITE on anonymous memory and
> RAM-backed files (shmem/hugetlb) - I would not have a minor fault when the
> guest inside the VM first initializes memory. This mimics how QEMU currently
> preallocates memory.
> 
> However, I would use MADV_POPULATE_READ on any !RAM-backed files where we
> actually have to write-back to a (slow?) device. Dirtying everything
> although the guest might not actually consume it in the near future might be
> undesired.

Isn't what the current mm_populate does?
        if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE)
                gup_flags |= FOLL_WRITE;

So it will write fault to shared memory mappings but it will touch
others.
David Hildenbrand Feb. 22, 2021, 12:59 p.m. UTC | #29
On 22.02.21 13:56, Michal Hocko wrote:
> On Sat 20-02-21 10:12:26, David Hildenbrand wrote:
> [...]
>> Thinking about MADV_POPULATE vs. MADV_POPULATE_WRITE I wonder if it would be
>> more versatile to break with existing MAP_POPULATE semantics and directly go
>> with
>>
>> MADV_POPULATE_READ: simulate user space read access without actually
>> reading. Trigger a read fault if required.
>>
>> MADV_POPULATE_WRITE: simulate user space write access without actually
>> writing. Trigger a write fault if required.
>>
>> For my use case, I could use MADV_POPULATE_WRITE on anonymous memory and
>> RAM-backed files (shmem/hugetlb) - I would not have a minor fault when the
>> guest inside the VM first initializes memory. This mimics how QEMU currently
>> preallocates memory.
>>
>> However, I would use MADV_POPULATE_READ on any !RAM-backed files where we
>> actually have to write-back to a (slow?) device. Dirtying everything
>> although the guest might not actually consume it in the near future might be
>> undesired.
> 
> Isn't what the current mm_populate does?
>          if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE)
>                  gup_flags |= FOLL_WRITE;
> 
> So it will write fault to shared memory mappings but it will touch
> others.

Exactly. But for hugetlbfs/shmem ("!RAM-backed files") this is not what 
we want.
Michal Hocko Feb. 22, 2021, 1:19 p.m. UTC | #30
On Mon 22-02-21 13:59:55, David Hildenbrand wrote:
> On 22.02.21 13:56, Michal Hocko wrote:
> > On Sat 20-02-21 10:12:26, David Hildenbrand wrote:
> > [...]
> > > Thinking about MADV_POPULATE vs. MADV_POPULATE_WRITE I wonder if it would be
> > > more versatile to break with existing MAP_POPULATE semantics and directly go
> > > with
> > > 
> > > MADV_POPULATE_READ: simulate user space read access without actually
> > > reading. Trigger a read fault if required.
> > > 
> > > MADV_POPULATE_WRITE: simulate user space write access without actually
> > > writing. Trigger a write fault if required.
> > > 
> > > For my use case, I could use MADV_POPULATE_WRITE on anonymous memory and
> > > RAM-backed files (shmem/hugetlb) - I would not have a minor fault when the
> > > guest inside the VM first initializes memory. This mimics how QEMU currently
> > > preallocates memory.
> > > 
> > > However, I would use MADV_POPULATE_READ on any !RAM-backed files where we
> > > actually have to write-back to a (slow?) device. Dirtying everything
> > > although the guest might not actually consume it in the near future might be
> > > undesired.
> > 
> > Isn't what the current mm_populate does?
> >          if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE)
> >                  gup_flags |= FOLL_WRITE;
> > 
> > So it will write fault to shared memory mappings but it will touch
> > others.

Ble, I have writen that opposit to the actual behavior. It will write
fault on writeable private mappings and only touch on read/only or
private mappings.

> 
> Exactly. But for hugetlbfs/shmem ("!RAM-backed files") this is not what we
> want.

OK, then I must have misread your requirements. Maybe I just got lost in
all the combinations you have listed.
David Hildenbrand Feb. 22, 2021, 1:22 p.m. UTC | #31
>> Exactly. But for hugetlbfs/shmem ("!RAM-backed files") this is not what we
>> want.
> 
> OK, then I must have misread your requirements. Maybe I just got lost in
> all the combinations you have listed.

Another special case could be dax/pmem I think. You might want to fault 
it in readable/writable but not perform an actual read/write unless 
really required.

QEMU phrases this as "don't cause wear on the storage backing".
Michal Hocko Feb. 22, 2021, 2:02 p.m. UTC | #32
On Mon 22-02-21 14:22:37, David Hildenbrand wrote:
> > > Exactly. But for hugetlbfs/shmem ("!RAM-backed files") this is not what we
> > > want.
> > 
> > OK, then I must have misread your requirements. Maybe I just got lost in
> > all the combinations you have listed.
> 
> Another special case could be dax/pmem I think. You might want to fault it
> in readable/writable but not perform an actual read/write unless really
> required.
> 
> QEMU phrases this as "don't cause wear on the storage backing".

Sorry for being dense here but I still do not follow. If you do not want
to read then what do you want to populate from? Only map if it is in the
page cache?
David Hildenbrand Feb. 22, 2021, 3:30 p.m. UTC | #33
On 22.02.21 15:02, Michal Hocko wrote:
> On Mon 22-02-21 14:22:37, David Hildenbrand wrote:
>>>> Exactly. But for hugetlbfs/shmem ("!RAM-backed files") this is not what we
>>>> want.
>>>
>>> OK, then I must have misread your requirements. Maybe I just got lost in
>>> all the combinations you have listed.
>>
>> Another special case could be dax/pmem I think. You might want to fault it
>> in readable/writable but not perform an actual read/write unless really
>> required.
>>
>> QEMU phrases this as "don't cause wear on the storage backing".
> 
> Sorry for being dense here but I still do not follow. If you do not want
> to read then what do you want to populate from? Only map if it is in the

In the context of VMs it's usually rather a mean to preallocate backend 
storage - which would also happen on read access. See below on case 4).

> page cache?

Let's try to untangle my thoughts regarding VMs. We could have as 
backend storage for our VM:

1) Anonymous memory
2) hugetlbfs (private/shared)
3) tmpfs/shmem (private/shared)
4) Ordinary files (shared)
5) DAX/PMEM (shared)

Excluding special cases (hypervisor upgrades with 2) and 3) ), we expect 
to have pre-existing content in files only in 4) and 5). 4) and 5) might 
be used as NVDIMM backend for a guest, or as DIMM backend.

The first access of our VM to memory could be
a) Write: the usual case when exposed as RAM/DIMM to out guest.
b) Read: possible case when exposed as an NVDIMM to our guest (we don't
    know). But eventually, we might write to (parts of) NVDIMMs later.

We "preallocate"/"populate" memory of our VM so that
- We know we have sufficient backend storage (esp. hugetlbfs, shmem,
   files) - so we don't randomly crash the VM. My most important use
   case.
- We avoid page faults (including page zeroing!) at runtime. Especially
   relevant for RT workloads.

With 1), 2), and 3) we want to have pages faulted in writable - we 
expect that our guest will write to that memory. MADV_POPULATE would do 
that only for 1), and MAP_PRIVATE of 2). For the shared parts, we would 
want MADV_POPULATE_WRITE semantics.

With 5), we already had complaints that preallcoation in QEMU takes a 
long time - because we end up actually reading/writing slow PMEM 
(libvirt now disables preallcoation for that reason, which makes sense). 
However, MADV_POPULATE_WRITE would help to prefault without actually 
reading/writing pmem - if we want to avoid any minor faults.

With 4), I think we primarily prealloc/prefault to make sure we have 
sufficient backend storage. fallocate() might do a better job just for 
the allocation. But if there is sufficient RAM it might make sense to 
prefault all guest RAM at least readable - then we only have a minor 
fault when the VM writes to it and might avoid having to go to disk. 
Prefaulting everything writable means that we *have to* write back all 
guest RAM even if the guest never accessed it. So I think there are 
cases where MADV_POPULATE_READ (current MADV_POPULATE) semantics could 
make sense.
David Hildenbrand Feb. 24, 2021, 2:25 p.m. UTC | #34
> +		tmp_end = min_t(unsigned long, end, vma->vm_end);
> +		pages = populate_vma_page_range(vma, start, tmp_end, &locked);
> +		if (!locked) {
> +			mmap_read_lock(mm);
> +			*prev = NULL;
> +			vma = NULL;

^ locked = 1; is missing here.


--- Simple benchmark ---

I implemented MADV_POPULATE_READ and MADV_POPULATE_WRITE and performed 
some simple measurements to simulate memory preallocation with empty files:

1) mmap a 2 MiB/128 MiB/4 GiB region (anonymous, memfd, memfd hugetlb)
2) Discard all memory using fallocate/madvise
3) Prefault memory using different approaches and measure the time this
    takes.

I repeat 2)+3) 10 times and compute the average. I only use a single thread.

Read: Read from each page a byte.
Write: Write one byte of each page (0).
Read/Write: Read one byte and write the value back for each page
POPULATE: MADV_POPULATE (this patch)
POPULATE_READ: MADV_POPULATE_READ
POPULATE_WRITE: MADV_POPULATE_WRITE

--- Benchmark results ---

Measuring 10 iterations each:
==================================================
2 MiB MAP_PRIVATE:
**************************************************
Anonymous      : Read           :     0.159 ms
Anonymous      : Write          :     0.244 ms
Anonymous      : Read+Write     :     0.383 ms
Anonymous      : POPULATE       :     0.167 ms
Anonymous      : POPULATE_READ  :     0.064 ms
Anonymous      : POPULATE_WRITE :     0.165 ms
Memfd 4 KiB    : Read           :     0.401 ms
Memfd 4 KiB    : Write          :     0.056 ms
Memfd 4 KiB    : Read+Write     :     0.075 ms
Memfd 4 KiB    : POPULATE       :     0.057 ms
Memfd 4 KiB    : POPULATE_READ  :     0.337 ms
Memfd 4 KiB    : POPULATE_WRITE :     0.056 ms
Memfd 2 MiB    : Read           :     0.041 ms
Memfd 2 MiB    : Write          :     0.030 ms
Memfd 2 MiB    : Read+Write     :     0.031 ms
Memfd 2 MiB    : POPULATE       :     0.031 ms
Memfd 2 MiB    : POPULATE_READ  :     0.031 ms
Memfd 2 MiB    : POPULATE_WRITE :     0.031 ms
**************************************************
2 MiB MAP_SHARED:
**************************************************
Anonymous      : Read           :     0.071 ms
Anonymous      : Write          :     0.181 ms
Anonymous      : Read+Write     :     0.081 ms
Anonymous      : POPULATE       :     0.069 ms
Anonymous      : POPULATE_READ  :     0.069 ms
Anonymous      : POPULATE_WRITE :     0.115 ms
Memfd 4 KiB    : Read           :     0.401 ms
Memfd 4 KiB    : Write          :     0.351 ms
Memfd 4 KiB    : Read+Write     :     0.414 ms
Memfd 4 KiB    : POPULATE       :     0.338 ms
Memfd 4 KiB    : POPULATE_READ  :     0.339 ms
Memfd 4 KiB    : POPULATE_WRITE :     0.279 ms
Memfd 2 MiB    : Read           :     0.031 ms
Memfd 2 MiB    : Write          :     0.031 ms
Memfd 2 MiB    : Read+Write     :     0.031 ms
Memfd 2 MiB    : POPULATE       :     0.031 ms
Memfd 2 MiB    : POPULATE_READ  :     0.031 ms
Memfd 2 MiB    : POPULATE_WRITE :     0.031 ms
**************************************************
128 MiB MAP_PRIVATE:
**************************************************
Anonymous      : Read           :     7.517 ms
Anonymous      : Write          :    22.503 ms
Anonymous      : Read+Write     :    33.186 ms
Anonymous      : POPULATE       :    18.381 ms
Anonymous      : POPULATE_READ  :     3.952 ms
Anonymous      : POPULATE_WRITE :    18.354 ms
Memfd 4 KiB    : Read           :    34.300 ms
Memfd 4 KiB    : Write          :     4.659 ms
Memfd 4 KiB    : Read+Write     :     6.531 ms
Memfd 4 KiB    : POPULATE       :     5.219 ms
Memfd 4 KiB    : POPULATE_READ  :    29.744 ms
Memfd 4 KiB    : POPULATE_WRITE :     5.244 ms
Memfd 2 MiB    : Read           :    10.228 ms
Memfd 2 MiB    : Write          :    10.130 ms
Memfd 2 MiB    : Read+Write     :    10.190 ms
Memfd 2 MiB    : POPULATE       :    10.007 ms
Memfd 2 MiB    : POPULATE_READ  :    10.008 ms
Memfd 2 MiB    : POPULATE_WRITE :    10.010 ms
**************************************************
128 MiB MAP_SHARED:
**************************************************
Anonymous      : Read           :     7.295 ms
Anonymous      : Write          :    15.234 ms
Anonymous      : Read+Write     :     7.460 ms
Anonymous      : POPULATE       :     5.196 ms
Anonymous      : POPULATE_READ  :     5.190 ms
Anonymous      : POPULATE_WRITE :     8.245 ms
Memfd 4 KiB    : Read           :    34.412 ms
Memfd 4 KiB    : Write          :    30.586 ms
Memfd 4 KiB    : Read+Write     :    35.157 ms
Memfd 4 KiB    : POPULATE       :    29.643 ms
Memfd 4 KiB    : POPULATE_READ  :    29.691 ms
Memfd 4 KiB    : POPULATE_WRITE :    25.790 ms
Memfd 2 MiB    : Read           :    10.210 ms
Memfd 2 MiB    : Write          :    10.074 ms
Memfd 2 MiB    : Read+Write     :    10.068 ms
Memfd 2 MiB    : POPULATE       :    10.034 ms
Memfd 2 MiB    : POPULATE_READ  :    10.037 ms
Memfd 2 MiB    : POPULATE_WRITE :    10.031 ms
**************************************************
4096 MiB MAP_PRIVATE:
**************************************************
Anonymous      : Read           :   240.947 ms
Anonymous      : Write          :   712.941 ms
Anonymous      : Read+Write     :  1027.636 ms
Anonymous      : POPULATE       :   571.816 ms
Anonymous      : POPULATE_READ  :   120.215 ms
Anonymous      : POPULATE_WRITE :   570.750 ms
Memfd 4 KiB    : Read           :  1054.739 ms
Memfd 4 KiB    : Write          :   145.534 ms
Memfd 4 KiB    : Read+Write     :   202.275 ms
Memfd 4 KiB    : POPULATE       :   162.597 ms
Memfd 4 KiB    : POPULATE_READ  :   914.747 ms
Memfd 4 KiB    : POPULATE_WRITE :   161.281 ms
Memfd 2 MiB    : Read           :   351.818 ms
Memfd 2 MiB    : Write          :   352.357 ms
Memfd 2 MiB    : Read+Write     :   352.762 ms
Memfd 2 MiB    : POPULATE       :   351.471 ms
Memfd 2 MiB    : POPULATE_READ  :   351.553 ms
Memfd 2 MiB    : POPULATE_WRITE :   351.931 ms
**************************************************
4096 MiB MAP_SHARED:
**************************************************
Anonymous      : Read           :   229.338 ms
Anonymous      : Write          :   478.964 ms
Anonymous      : Read+Write     :   234.546 ms
Anonymous      : POPULATE       :   161.635 ms
Anonymous      : POPULATE_READ  :   160.943 ms
Anonymous      : POPULATE_WRITE :   252.686 ms
Memfd 4 KiB    : Read           :  1052.828 ms
Memfd 4 KiB    : Write          :   929.237 ms
Memfd 4 KiB    : Read+Write     :  1074.494 ms
Memfd 4 KiB    : POPULATE       :   915.663 ms
Memfd 4 KiB    : POPULATE_READ  :   915.001 ms
Memfd 4 KiB    : POPULATE_WRITE :   787.388 ms
Memfd 2 MiB    : Read           :   353.580 ms
Memfd 2 MiB    : Write          :   353.197 ms
Memfd 2 MiB    : Read+Write     :   353.172 ms
Memfd 2 MiB    : POPULATE       :   353.686 ms
Memfd 2 MiB    : POPULATE_READ  :   353.465 ms
Memfd 2 MiB    : POPULATE_WRITE :   352.776 ms
**************************************************


--- Discussion ---

1) With huge pages, the performance benefit is negligible with the sizes 
I tried, because there are little actual page faults. Most time is spent 
zeroing huge pages I guess. It will take quite a lot of memory to pay off.

2) In all 4k cases, the POPULATE_READ/POPULATE_WRITE variants are faster 
than manually reading or writing from user space.


What sticks out a bit is:

3) For MAP_SHARED on anonymous memory, it is fastest to first read and 
then write memory. It's slightly faster than POPULATE_WRITE and quite a 
lot faster than a simple write - what?!. It's even faster than 
POPULATE_WRITE - what?! I assume with the read access we prepare a fresh 
zero page and with the write access we only have to change PTE access 
rights. But why is this faster than writing directly?

4) POPULATE_WRITE with MAP_SHARED "Memfd 4 KiB" is faster than 
POPULATE_READ - it's the fastest way to preallocate that memory. 
Similarly, ordinary writes are faster than ordinary reads.


I did not try with actual files, files that already have a content, or 
with multiple thread yet. Also, I did not try on a subset of a mmap yet 
- for simplicity I populate the whole mapping.
David Hildenbrand Feb. 24, 2021, 2:38 p.m. UTC | #35
On 24.02.21 15:25, David Hildenbrand wrote:
>> +		tmp_end = min_t(unsigned long, end, vma->vm_end);
>> +		pages = populate_vma_page_range(vma, start, tmp_end, &locked);
>> +		if (!locked) {
>> +			mmap_read_lock(mm);
>> +			*prev = NULL;
>> +			vma = NULL;
> 
> ^ locked = 1; is missing here.
> 
> 
> --- Simple benchmark ---
> 
> I implemented MADV_POPULATE_READ and MADV_POPULATE_WRITE and performed
> some simple measurements to simulate memory preallocation with empty files:
> 
> 1) mmap a 2 MiB/128 MiB/4 GiB region (anonymous, memfd, memfd hugetlb)
> 2) Discard all memory using fallocate/madvise
> 3) Prefault memory using different approaches and measure the time this
>      takes.
> 
> I repeat 2)+3) 10 times and compute the average. I only use a single thread.
> 
> Read: Read from each page a byte.
> Write: Write one byte of each page (0).
> Read/Write: Read one byte and write the value back for each page
> POPULATE: MADV_POPULATE (this patch)
> POPULATE_READ: MADV_POPULATE_READ
> POPULATE_WRITE: MADV_POPULATE_WRITE
> 
> --- Benchmark results ---
> 
> Measuring 10 iterations each:
> ==================================================
> 2 MiB MAP_PRIVATE:
> **************************************************
> Anonymous      : Read           :     0.159 ms
> Anonymous      : Write          :     0.244 ms
> Anonymous      : Read+Write     :     0.383 ms
> Anonymous      : POPULATE       :     0.167 ms
> Anonymous      : POPULATE_READ  :     0.064 ms
> Anonymous      : POPULATE_WRITE :     0.165 ms
> Memfd 4 KiB    : Read           :     0.401 ms
> Memfd 4 KiB    : Write          :     0.056 ms
> Memfd 4 KiB    : Read+Write     :     0.075 ms
> Memfd 4 KiB    : POPULATE       :     0.057 ms
> Memfd 4 KiB    : POPULATE_READ  :     0.337 ms
> Memfd 4 KiB    : POPULATE_WRITE :     0.056 ms
> Memfd 2 MiB    : Read           :     0.041 ms
> Memfd 2 MiB    : Write          :     0.030 ms
> Memfd 2 MiB    : Read+Write     :     0.031 ms
> Memfd 2 MiB    : POPULATE       :     0.031 ms
> Memfd 2 MiB    : POPULATE_READ  :     0.031 ms
> Memfd 2 MiB    : POPULATE_WRITE :     0.031 ms
> **************************************************
> 2 MiB MAP_SHARED:
> **************************************************
> Anonymous      : Read           :     0.071 ms
> Anonymous      : Write          :     0.181 ms
> Anonymous      : Read+Write     :     0.081 ms
> Anonymous      : POPULATE       :     0.069 ms
> Anonymous      : POPULATE_READ  :     0.069 ms
> Anonymous      : POPULATE_WRITE :     0.115 ms
> Memfd 4 KiB    : Read           :     0.401 ms
> Memfd 4 KiB    : Write          :     0.351 ms
> Memfd 4 KiB    : Read+Write     :     0.414 ms
> Memfd 4 KiB    : POPULATE       :     0.338 ms
> Memfd 4 KiB    : POPULATE_READ  :     0.339 ms
> Memfd 4 KiB    : POPULATE_WRITE :     0.279 ms
> Memfd 2 MiB    : Read           :     0.031 ms
> Memfd 2 MiB    : Write          :     0.031 ms
> Memfd 2 MiB    : Read+Write     :     0.031 ms
> Memfd 2 MiB    : POPULATE       :     0.031 ms
> Memfd 2 MiB    : POPULATE_READ  :     0.031 ms
> Memfd 2 MiB    : POPULATE_WRITE :     0.031 ms
> **************************************************
> 128 MiB MAP_PRIVATE:
> **************************************************
> Anonymous      : Read           :     7.517 ms
> Anonymous      : Write          :    22.503 ms
> Anonymous      : Read+Write     :    33.186 ms
> Anonymous      : POPULATE       :    18.381 ms
> Anonymous      : POPULATE_READ  :     3.952 ms
> Anonymous      : POPULATE_WRITE :    18.354 ms
> Memfd 4 KiB    : Read           :    34.300 ms
> Memfd 4 KiB    : Write          :     4.659 ms
> Memfd 4 KiB    : Read+Write     :     6.531 ms
> Memfd 4 KiB    : POPULATE       :     5.219 ms
> Memfd 4 KiB    : POPULATE_READ  :    29.744 ms
> Memfd 4 KiB    : POPULATE_WRITE :     5.244 ms
> Memfd 2 MiB    : Read           :    10.228 ms
> Memfd 2 MiB    : Write          :    10.130 ms
> Memfd 2 MiB    : Read+Write     :    10.190 ms
> Memfd 2 MiB    : POPULATE       :    10.007 ms
> Memfd 2 MiB    : POPULATE_READ  :    10.008 ms
> Memfd 2 MiB    : POPULATE_WRITE :    10.010 ms
> **************************************************
> 128 MiB MAP_SHARED:
> **************************************************
> Anonymous      : Read           :     7.295 ms
> Anonymous      : Write          :    15.234 ms
> Anonymous      : Read+Write     :     7.460 ms
> Anonymous      : POPULATE       :     5.196 ms
> Anonymous      : POPULATE_READ  :     5.190 ms
> Anonymous      : POPULATE_WRITE :     8.245 ms
> Memfd 4 KiB    : Read           :    34.412 ms
> Memfd 4 KiB    : Write          :    30.586 ms
> Memfd 4 KiB    : Read+Write     :    35.157 ms
> Memfd 4 KiB    : POPULATE       :    29.643 ms
> Memfd 4 KiB    : POPULATE_READ  :    29.691 ms
> Memfd 4 KiB    : POPULATE_WRITE :    25.790 ms
> Memfd 2 MiB    : Read           :    10.210 ms
> Memfd 2 MiB    : Write          :    10.074 ms
> Memfd 2 MiB    : Read+Write     :    10.068 ms
> Memfd 2 MiB    : POPULATE       :    10.034 ms
> Memfd 2 MiB    : POPULATE_READ  :    10.037 ms
> Memfd 2 MiB    : POPULATE_WRITE :    10.031 ms
> **************************************************
> 4096 MiB MAP_PRIVATE:
> **************************************************
> Anonymous      : Read           :   240.947 ms
> Anonymous      : Write          :   712.941 ms
> Anonymous      : Read+Write     :  1027.636 ms
> Anonymous      : POPULATE       :   571.816 ms
> Anonymous      : POPULATE_READ  :   120.215 ms
> Anonymous      : POPULATE_WRITE :   570.750 ms
> Memfd 4 KiB    : Read           :  1054.739 ms
> Memfd 4 KiB    : Write          :   145.534 ms
> Memfd 4 KiB    : Read+Write     :   202.275 ms
> Memfd 4 KiB    : POPULATE       :   162.597 ms
> Memfd 4 KiB    : POPULATE_READ  :   914.747 ms
> Memfd 4 KiB    : POPULATE_WRITE :   161.281 ms
> Memfd 2 MiB    : Read           :   351.818 ms
> Memfd 2 MiB    : Write          :   352.357 ms
> Memfd 2 MiB    : Read+Write     :   352.762 ms
> Memfd 2 MiB    : POPULATE       :   351.471 ms
> Memfd 2 MiB    : POPULATE_READ  :   351.553 ms
> Memfd 2 MiB    : POPULATE_WRITE :   351.931 ms
> **************************************************
> 4096 MiB MAP_SHARED:
> **************************************************
> Anonymous      : Read           :   229.338 ms
> Anonymous      : Write          :   478.964 ms
> Anonymous      : Read+Write     :   234.546 ms
> Anonymous      : POPULATE       :   161.635 ms
> Anonymous      : POPULATE_READ  :   160.943 ms
> Anonymous      : POPULATE_WRITE :   252.686 ms
> Memfd 4 KiB    : Read           :  1052.828 ms
> Memfd 4 KiB    : Write          :   929.237 ms
> Memfd 4 KiB    : Read+Write     :  1074.494 ms
> Memfd 4 KiB    : POPULATE       :   915.663 ms
> Memfd 4 KiB    : POPULATE_READ  :   915.001 ms
> Memfd 4 KiB    : POPULATE_WRITE :   787.388 ms
> Memfd 2 MiB    : Read           :   353.580 ms
> Memfd 2 MiB    : Write          :   353.197 ms
> Memfd 2 MiB    : Read+Write     :   353.172 ms
> Memfd 2 MiB    : POPULATE       :   353.686 ms
> Memfd 2 MiB    : POPULATE_READ  :   353.465 ms
> Memfd 2 MiB    : POPULATE_WRITE :   352.776 ms
> **************************************************
> 
> 
> --- Discussion ---
> 
> 1) With huge pages, the performance benefit is negligible with the sizes
> I tried, because there are little actual page faults. Most time is spent
> zeroing huge pages I guess. It will take quite a lot of memory to pay off.
> 
> 2) In all 4k cases, the POPULATE_READ/POPULATE_WRITE variants are faster
> than manually reading or writing from user space.

Forgot to mention one case: Except on Memfd 4 KiB with MAP_PRIVATE: 
POPULATE_WRITE is slower than a simple write. And a read fault is 
exceptionally slower than a write fault (what?).
David Hildenbrand Feb. 25, 2021, 8:41 a.m. UTC | #36
On 24.02.21 15:25, David Hildenbrand wrote:
>> +		tmp_end = min_t(unsigned long, end, vma->vm_end);
>> +		pages = populate_vma_page_range(vma, start, tmp_end, &locked);
>> +		if (!locked) {
>> +			mmap_read_lock(mm);
>> +			*prev = NULL;
>> +			vma = NULL;
> 
> ^ locked = 1; is missing here.
> 
> 
> --- Simple benchmark ---
> 
> I implemented MADV_POPULATE_READ and MADV_POPULATE_WRITE and performed
> some simple measurements to simulate memory preallocation with empty files:
> 
> 1) mmap a 2 MiB/128 MiB/4 GiB region (anonymous, memfd, memfd hugetlb)
> 2) Discard all memory using fallocate/madvise
> 3) Prefault memory using different approaches and measure the time this
>      takes.
> 
> I repeat 2)+3) 10 times and compute the average. I only use a single thread.
> 
> Read: Read from each page a byte.
> Write: Write one byte of each page (0).
> Read/Write: Read one byte and write the value back for each page
> POPULATE: MADV_POPULATE (this patch)
> POPULATE_READ: MADV_POPULATE_READ
> POPULATE_WRITE: MADV_POPULATE_WRITE
> 
> --- Benchmark results ---
> 
> Measuring 10 iterations each:
> ==================================================
> 2 MiB MAP_PRIVATE:
> **************************************************
> Anonymous      : Read           :     0.159 ms
> Anonymous      : Write          :     0.244 ms
> Anonymous      : Read+Write     :     0.383 ms
> Anonymous      : POPULATE       :     0.167 ms
> Anonymous      : POPULATE_READ  :     0.064 ms
> Anonymous      : POPULATE_WRITE :     0.165 ms
> Memfd 4 KiB    : Read           :     0.401 ms
> Memfd 4 KiB    : Write          :     0.056 ms
> Memfd 4 KiB    : Read+Write     :     0.075 ms
> Memfd 4 KiB    : POPULATE       :     0.057 ms
> Memfd 4 KiB    : POPULATE_READ  :     0.337 ms
> Memfd 4 KiB    : POPULATE_WRITE :     0.056 ms
> Memfd 2 MiB    : Read           :     0.041 ms
> Memfd 2 MiB    : Write          :     0.030 ms
> Memfd 2 MiB    : Read+Write     :     0.031 ms
> Memfd 2 MiB    : POPULATE       :     0.031 ms
> Memfd 2 MiB    : POPULATE_READ  :     0.031 ms
> Memfd 2 MiB    : POPULATE_WRITE :     0.031 ms
> **************************************************
> 2 MiB MAP_SHARED:
> **************************************************
> Anonymous      : Read           :     0.071 ms
> Anonymous      : Write          :     0.181 ms
> Anonymous      : Read+Write     :     0.081 ms
> Anonymous      : POPULATE       :     0.069 ms
> Anonymous      : POPULATE_READ  :     0.069 ms
> Anonymous      : POPULATE_WRITE :     0.115 ms
> Memfd 4 KiB    : Read           :     0.401 ms
> Memfd 4 KiB    : Write          :     0.351 ms
> Memfd 4 KiB    : Read+Write     :     0.414 ms
> Memfd 4 KiB    : POPULATE       :     0.338 ms
> Memfd 4 KiB    : POPULATE_READ  :     0.339 ms
> Memfd 4 KiB    : POPULATE_WRITE :     0.279 ms
> Memfd 2 MiB    : Read           :     0.031 ms
> Memfd 2 MiB    : Write          :     0.031 ms
> Memfd 2 MiB    : Read+Write     :     0.031 ms
> Memfd 2 MiB    : POPULATE       :     0.031 ms
> Memfd 2 MiB    : POPULATE_READ  :     0.031 ms
> Memfd 2 MiB    : POPULATE_WRITE :     0.031 ms
> **************************************************
> 128 MiB MAP_PRIVATE:
> **************************************************
> Anonymous      : Read           :     7.517 ms
> Anonymous      : Write          :    22.503 ms
> Anonymous      : Read+Write     :    33.186 ms
> Anonymous      : POPULATE       :    18.381 ms
> Anonymous      : POPULATE_READ  :     3.952 ms
> Anonymous      : POPULATE_WRITE :    18.354 ms
> Memfd 4 KiB    : Read           :    34.300 ms
> Memfd 4 KiB    : Write          :     4.659 ms
> Memfd 4 KiB    : Read+Write     :     6.531 ms
> Memfd 4 KiB    : POPULATE       :     5.219 ms
> Memfd 4 KiB    : POPULATE_READ  :    29.744 ms
> Memfd 4 KiB    : POPULATE_WRITE :     5.244 ms
> Memfd 2 MiB    : Read           :    10.228 ms
> Memfd 2 MiB    : Write          :    10.130 ms
> Memfd 2 MiB    : Read+Write     :    10.190 ms
> Memfd 2 MiB    : POPULATE       :    10.007 ms
> Memfd 2 MiB    : POPULATE_READ  :    10.008 ms
> Memfd 2 MiB    : POPULATE_WRITE :    10.010 ms
> **************************************************
> 128 MiB MAP_SHARED:
> **************************************************
> Anonymous      : Read           :     7.295 ms
> Anonymous      : Write          :    15.234 ms
> Anonymous      : Read+Write     :     7.460 ms
> Anonymous      : POPULATE       :     5.196 ms
> Anonymous      : POPULATE_READ  :     5.190 ms
> Anonymous      : POPULATE_WRITE :     8.245 ms
> Memfd 4 KiB    : Read           :    34.412 ms
> Memfd 4 KiB    : Write          :    30.586 ms
> Memfd 4 KiB    : Read+Write     :    35.157 ms
> Memfd 4 KiB    : POPULATE       :    29.643 ms
> Memfd 4 KiB    : POPULATE_READ  :    29.691 ms
> Memfd 4 KiB    : POPULATE_WRITE :    25.790 ms
> Memfd 2 MiB    : Read           :    10.210 ms
> Memfd 2 MiB    : Write          :    10.074 ms
> Memfd 2 MiB    : Read+Write     :    10.068 ms
> Memfd 2 MiB    : POPULATE       :    10.034 ms
> Memfd 2 MiB    : POPULATE_READ  :    10.037 ms
> Memfd 2 MiB    : POPULATE_WRITE :    10.031 ms
> **************************************************
> 4096 MiB MAP_PRIVATE:
> **************************************************
> Anonymous      : Read           :   240.947 ms
> Anonymous      : Write          :   712.941 ms
> Anonymous      : Read+Write     :  1027.636 ms
> Anonymous      : POPULATE       :   571.816 ms
> Anonymous      : POPULATE_READ  :   120.215 ms
> Anonymous      : POPULATE_WRITE :   570.750 ms
> Memfd 4 KiB    : Read           :  1054.739 ms
> Memfd 4 KiB    : Write          :   145.534 ms
> Memfd 4 KiB    : Read+Write     :   202.275 ms
> Memfd 4 KiB    : POPULATE       :   162.597 ms
> Memfd 4 KiB    : POPULATE_READ  :   914.747 ms
> Memfd 4 KiB    : POPULATE_WRITE :   161.281 ms
> Memfd 2 MiB    : Read           :   351.818 ms
> Memfd 2 MiB    : Write          :   352.357 ms
> Memfd 2 MiB    : Read+Write     :   352.762 ms
> Memfd 2 MiB    : POPULATE       :   351.471 ms
> Memfd 2 MiB    : POPULATE_READ  :   351.553 ms
> Memfd 2 MiB    : POPULATE_WRITE :   351.931 ms
> **************************************************
> 4096 MiB MAP_SHARED:
> **************************************************
> Anonymous      : Read           :   229.338 ms
> Anonymous      : Write          :   478.964 ms
> Anonymous      : Read+Write     :   234.546 ms
> Anonymous      : POPULATE       :   161.635 ms
> Anonymous      : POPULATE_READ  :   160.943 ms
> Anonymous      : POPULATE_WRITE :   252.686 ms
> Memfd 4 KiB    : Read           :  1052.828 ms
> Memfd 4 KiB    : Write          :   929.237 ms
> Memfd 4 KiB    : Read+Write     :  1074.494 ms
> Memfd 4 KiB    : POPULATE       :   915.663 ms
> Memfd 4 KiB    : POPULATE_READ  :   915.001 ms
> Memfd 4 KiB    : POPULATE_WRITE :   787.388 ms
> Memfd 2 MiB    : Read           :   353.580 ms
> Memfd 2 MiB    : Write          :   353.197 ms
> Memfd 2 MiB    : Read+Write     :   353.172 ms
> Memfd 2 MiB    : POPULATE       :   353.686 ms
> Memfd 2 MiB    : POPULATE_READ  :   353.465 ms
> Memfd 2 MiB    : POPULATE_WRITE :   352.776 ms
> **************************************************
> 
> 
> --- Discussion ---
> 
> 1) With huge pages, the performance benefit is negligible with the sizes
> I tried, because there are little actual page faults. Most time is spent
> zeroing huge pages I guess. It will take quite a lot of memory to pay off.
> 
> 2) In all 4k cases, the POPULATE_READ/POPULATE_WRITE variants are faster
> than manually reading or writing from user space.
> 
> 
> What sticks out a bit is:
> 
> 3) For MAP_SHARED on anonymous memory, it is fastest to first read and
> then write memory. It's slightly faster than POPULATE_WRITE and quite a
> lot faster than a simple write - what?!. It's even faster than
> POPULATE_WRITE - what?! I assume with the read access we prepare a fresh
> zero page and with the write access we only have to change PTE access
> rights. But why is this faster than writing directly?

Okay, MADV_DONTNEED does not seem to really work on MAP_SHARED of 
anonymous memory. If I use a fresh mmap for each and every iteration the 
numbers make more sense:

**************************************************
4096 MiB MAP_SHARED:
**************************************************
Anonymous      : Read           :  1054.154 ms
Anonymous      : Write          :   924.572 ms
Anonymous      : Read+Write     :  1075.215 ms
Anonymous      : POPULATE       :   911.386 ms
Anonymous      : POPULATE_READ  :   909.392 ms
Anonymous      : POPULATE_WRITE :   793.143 ms
diff mbox series

Patch

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index a18ec7f63888..e90eeb5e6cf1 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -71,6 +71,8 @@ 
 #define MADV_COLD	20		/* deactivate these pages */
 #define MADV_PAGEOUT	21		/* reclaim these pages */
 
+#define MADV_POPULATE	22		/* populate pages */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index 57dc2ac4f8bd..b928becc5308 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -98,6 +98,8 @@ 
 #define MADV_COLD	20		/* deactivate these pages */
 #define MADV_PAGEOUT	21		/* reclaim these pages */
 
+#define MADV_POPULATE	22		/* populate pages */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index ab78cba446ed..9d3a56044287 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -52,6 +52,8 @@ 
 #define MADV_COLD	20		/* deactivate these pages */
 #define MADV_PAGEOUT	21		/* reclaim these pages */
 
+#define MADV_POPULATE	22		/* populate pages */
+
 #define MADV_MERGEABLE   65		/* KSM may merge identical pages */
 #define MADV_UNMERGEABLE 66		/* KSM may not merge identical pages */
 
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index e5e643752947..3169b1be8920 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -106,6 +106,8 @@ 
 #define MADV_COLD	20		/* deactivate these pages */
 #define MADV_PAGEOUT	21		/* reclaim these pages */
 
+#define MADV_POPULATE	22		/* populate pages */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index f94f65d429be..fa617fd0d733 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -72,6 +72,8 @@ 
 #define MADV_COLD	20		/* deactivate these pages */
 #define MADV_PAGEOUT	21		/* reclaim these pages */
 
+#define MADV_POPULATE	22		/* populate pages */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/mm/madvise.c b/mm/madvise.c
index 6a660858784b..f76fdd6fcf10 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -53,6 +53,7 @@  static int madvise_need_mmap_write(int behavior)
 	case MADV_COLD:
 	case MADV_PAGEOUT:
 	case MADV_FREE:
+	case MADV_POPULATE:
 		return 0;
 	default:
 		/* be safe, default to 1. list exceptions explicitly */
@@ -821,6 +822,72 @@  static long madvise_dontneed_free(struct vm_area_struct *vma,
 		return -EINVAL;
 }
 
+static long madvise_populate(struct vm_area_struct *vma,
+			     struct vm_area_struct **prev,
+			     unsigned long start, unsigned long end)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long tmp_end;
+	int locked = 1;
+	long pages;
+
+	*prev = vma;
+
+	while (start < end) {
+		/*
+		 * We might have temporarily dropped the lock. For example,
+		 * our VMA might have been split.
+		 */
+		if (!vma || start >= vma->vm_end) {
+			vma = find_vma(mm, start);
+			if (!vma)
+				return -ENOMEM;
+		}
+
+		/* Bail out on incompatible VMA types. */
+		if (vma->vm_flags & (VM_IO | VM_PFNMAP) ||
+		    !vma_is_accessible(vma)) {
+			return -EINVAL;
+		}
+
+		/*
+		 * Populate pages and take care of VM_LOCKED: simulate user
+		 * space access.
+		 *
+		 * For private, writable mappings, trigger a write fault to
+		 * break COW (i.e., shared zeropage). For other mappings (i.e.,
+		 * read-only, shared), trigger a read fault.
+		 */
+		tmp_end = min_t(unsigned long, end, vma->vm_end);
+		pages = populate_vma_page_range(vma, start, tmp_end, &locked);
+		if (!locked) {
+			mmap_read_lock(mm);
+			*prev = NULL;
+			vma = NULL;
+		}
+		if (pages < 0) {
+			switch (pages) {
+			case -EINTR:
+			case -ENOMEM:
+				return pages;
+			case -EHWPOISON:
+				/* Skip over any poisoned pages. */
+				start += PAGE_SIZE;
+				continue;
+			case -EBUSY:
+			case -EAGAIN:
+				continue;
+			default:
+				pr_warn_once("%s: unhandled return value: %ld\n",
+					     __func__, pages);
+				return -ENOMEM;
+			}
+		}
+		start += pages * PAGE_SIZE;
+	}
+	return 0;
+}
+
 /*
  * Application wants to free up the pages and associated backing store.
  * This is effectively punching a hole into the middle of a file.
@@ -934,6 +1001,8 @@  madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	case MADV_FREE:
 	case MADV_DONTNEED:
 		return madvise_dontneed_free(vma, prev, start, end, behavior);
+	case MADV_POPULATE:
+		return madvise_populate(vma, prev, start, end);
 	default:
 		return madvise_behavior(vma, prev, start, end, behavior);
 	}
@@ -954,6 +1023,7 @@  madvise_behavior_valid(int behavior)
 	case MADV_FREE:
 	case MADV_COLD:
 	case MADV_PAGEOUT:
+	case MADV_POPULATE:
 #ifdef CONFIG_KSM
 	case MADV_MERGEABLE:
 	case MADV_UNMERGEABLE: