diff mbox series

[4/9] userfaultfd/shmem: support UFFDIO_CONTINUE for shmem

Message ID 20210408234327.624367-5-axelrasmussen@google.com (mailing list archive)
State New, archived
Headers show
Series userfaultfd: add minor fault handling for shmem | expand

Commit Message

Axel Rasmussen April 8, 2021, 11:43 p.m. UTC
With this change, userspace can resolve a minor fault within a
shmem-backed area with a UFFDIO_CONTINUE ioctl. The semantics for this
match those for hugetlbfs - we look up the existing page in the page
cache, and install PTEs for it.

This commit introduces a new helper: mcopy_atomic_install_ptes.

Why handle UFFDIO_CONTINUE for shmem in mm/userfaultfd.c, instead of in
shmem.c? The existing userfault implementation only relies on shmem.c
for VM_SHARED VMAs. However, minor fault handling / CONTINUE work just
fine for !VM_SHARED VMAs as well. We'd prefer to handle CONTINUE for
shmem in one place, regardless of shared/private (to reduce code
duplication).

Why add a new mcopy_atomic_install_ptes helper? A problem we have with
continue is that shmem_mcopy_atomic_pte() and mcopy_atomic_pte() are
*close* to what we want, but not exactly. We do want to setup the PTEs
in a CONTINUE operation, but we don't want to e.g. allocate a new page,
charge it (e.g. to the shmem inode), manipulate various flags, etc. Also
we have the problem stated above: shmem_mcopy_atomic_pte() and
mcopy_atomic_pte() both handle one-half of the problem (shared /
private) continue cares about. So, introduce mcontinue_atomic_pte(), to
handle all of the shmem continue cases. Introduce the helper so it
doesn't duplicate code with mcopy_atomic_pte().

In a future commit, shmem_mcopy_atomic_pte() will also be modified to
use this new helper. However, since this is a bigger refactor, it seems
most clear to do it as a separate change.

Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
---
 mm/userfaultfd.c | 176 +++++++++++++++++++++++++++++++++++------------
 1 file changed, 131 insertions(+), 45 deletions(-)

Comments

Peter Xu April 12, 2021, 11:17 p.m. UTC | #1
On Thu, Apr 08, 2021 at 04:43:22PM -0700, Axel Rasmussen wrote:
> +/*
> + * Install PTEs, to map dst_addr (within dst_vma) to page.
> + *
> + * This function handles MCOPY_ATOMIC_CONTINUE (which is always file-backed),
> + * whether or not dst_vma is VM_SHARED. It also handles the more general
> + * MCOPY_ATOMIC_NORMAL case, when dst_vma is *not* VM_SHARED (it may be file
> + * backed, or not).
> + *
> + * Note that MCOPY_ATOMIC_NORMAL for a VM_SHARED dst_vma is handled by
> + * shmem_mcopy_atomic_pte instead.
> + */
> +static int mcopy_atomic_install_ptes(struct mm_struct *dst_mm, pmd_t *dst_pmd,
> +				     struct vm_area_struct *dst_vma,
> +				     unsigned long dst_addr, struct page *page,
> +				     bool newly_allocated, bool wp_copy)
> +{
> +	int ret;
> +	pte_t _dst_pte, *dst_pte;
> +	int writable;
> +	bool vm_shared = dst_vma->vm_flags & VM_SHARED;
> +	spinlock_t *ptl;
> +	struct inode *inode;
> +	pgoff_t offset, max_off;
> +
> +	_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
> +	writable = dst_vma->vm_flags & VM_WRITE;
> +	/* For private, non-anon we need CoW (don't write to page cache!) */
> +	if (!vma_is_anonymous(dst_vma) && !vm_shared)
> +		writable = 0;
> +
> +	if (writable || vma_is_anonymous(dst_vma))
> +		_dst_pte = pte_mkdirty(_dst_pte);
> +	if (writable) {
> +		if (wp_copy)
> +			_dst_pte = pte_mkuffd_wp(_dst_pte);
> +		else
> +			_dst_pte = pte_mkwrite(_dst_pte);
> +	} else if (vm_shared) {
> +		/*
> +		 * Since we didn't pte_mkdirty(), mark the page dirty or it
> +		 * could be freed from under us. We could do this
> +		 * unconditionally, but doing it only if !writable is faster.
> +		 */
> +		set_page_dirty(page);
> +	}
> +
> +	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
> +
> +	if (vma_is_shmem(dst_vma)) {
> +		/* The shmem MAP_PRIVATE case requires checking the i_size */

When you start to use this function in the last patch it'll be needed too even
if MAP_SHARED?

How about directly state the reason of doing this ("serialize against truncate
with the PT lock") instead of commenting about "who will need it"?

> +		inode = dst_vma->vm_file->f_inode;
> +		offset = linear_page_index(dst_vma, dst_addr);
> +		max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
> +		ret = -EFAULT;
> +		if (unlikely(offset >= max_off))
> +			goto out_unlock;
> +	}

[...]

> +/* Handles UFFDIO_CONTINUE for all shmem VMAs (shared or private). */
> +static int mcontinue_atomic_pte(struct mm_struct *dst_mm,
> +				pmd_t *dst_pmd,
> +				struct vm_area_struct *dst_vma,
> +				unsigned long dst_addr,
> +				bool wp_copy)
> +{
> +	struct inode *inode = file_inode(dst_vma->vm_file);
> +	pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
> +	struct page *page;
> +	int ret;
> +
> +	ret = shmem_getpage(inode, pgoff, &page, SGP_READ);

SGP_READ looks right, as we don't want page allocation.  However I noticed
there's very slight difference when the page was just fallocated:

	/* fallocated page? */
	if (page && !PageUptodate(page)) {
		if (sgp != SGP_READ)
			goto clear;
		unlock_page(page);
		put_page(page);
		page = NULL;
		hindex = index;
	}

I think it won't happen for your case since the page should be uptodate already
(the other thread should check and modify the page before CONTINUE), but still
raise this up, since if the page was allocated it smells better to still
install the fallocated page (do we need to clear the page and SetUptodate)?
Axel Rasmussen April 13, 2021, 4:40 a.m. UTC | #2
On Mon, Apr 12, 2021 at 4:17 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Apr 08, 2021 at 04:43:22PM -0700, Axel Rasmussen wrote:
> > +/*
> > + * Install PTEs, to map dst_addr (within dst_vma) to page.
> > + *
> > + * This function handles MCOPY_ATOMIC_CONTINUE (which is always file-backed),
> > + * whether or not dst_vma is VM_SHARED. It also handles the more general
> > + * MCOPY_ATOMIC_NORMAL case, when dst_vma is *not* VM_SHARED (it may be file
> > + * backed, or not).
> > + *
> > + * Note that MCOPY_ATOMIC_NORMAL for a VM_SHARED dst_vma is handled by
> > + * shmem_mcopy_atomic_pte instead.
> > + */
> > +static int mcopy_atomic_install_ptes(struct mm_struct *dst_mm, pmd_t *dst_pmd,
> > +                                  struct vm_area_struct *dst_vma,
> > +                                  unsigned long dst_addr, struct page *page,
> > +                                  bool newly_allocated, bool wp_copy)
> > +{
> > +     int ret;
> > +     pte_t _dst_pte, *dst_pte;
> > +     int writable;
> > +     bool vm_shared = dst_vma->vm_flags & VM_SHARED;
> > +     spinlock_t *ptl;
> > +     struct inode *inode;
> > +     pgoff_t offset, max_off;
> > +
> > +     _dst_pte = mk_pte(page, dst_vma->vm_page_prot);
> > +     writable = dst_vma->vm_flags & VM_WRITE;
> > +     /* For private, non-anon we need CoW (don't write to page cache!) */
> > +     if (!vma_is_anonymous(dst_vma) && !vm_shared)
> > +             writable = 0;
> > +
> > +     if (writable || vma_is_anonymous(dst_vma))
> > +             _dst_pte = pte_mkdirty(_dst_pte);
> > +     if (writable) {
> > +             if (wp_copy)
> > +                     _dst_pte = pte_mkuffd_wp(_dst_pte);
> > +             else
> > +                     _dst_pte = pte_mkwrite(_dst_pte);
> > +     } else if (vm_shared) {
> > +             /*
> > +              * Since we didn't pte_mkdirty(), mark the page dirty or it
> > +              * could be freed from under us. We could do this
> > +              * unconditionally, but doing it only if !writable is faster.
> > +              */
> > +             set_page_dirty(page);
> > +     }
> > +
> > +     dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
> > +
> > +     if (vma_is_shmem(dst_vma)) {
> > +             /* The shmem MAP_PRIVATE case requires checking the i_size */
>
> When you start to use this function in the last patch it'll be needed too even
> if MAP_SHARED?
>
> How about directly state the reason of doing this ("serialize against truncate
> with the PT lock") instead of commenting about "who will need it"?
>
> > +             inode = dst_vma->vm_file->f_inode;
> > +             offset = linear_page_index(dst_vma, dst_addr);
> > +             max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
> > +             ret = -EFAULT;
> > +             if (unlikely(offset >= max_off))
> > +                     goto out_unlock;
> > +     }
>
> [...]
>
> > +/* Handles UFFDIO_CONTINUE for all shmem VMAs (shared or private). */
> > +static int mcontinue_atomic_pte(struct mm_struct *dst_mm,
> > +                             pmd_t *dst_pmd,
> > +                             struct vm_area_struct *dst_vma,
> > +                             unsigned long dst_addr,
> > +                             bool wp_copy)
> > +{
> > +     struct inode *inode = file_inode(dst_vma->vm_file);
> > +     pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
> > +     struct page *page;
> > +     int ret;
> > +
> > +     ret = shmem_getpage(inode, pgoff, &page, SGP_READ);
>
> SGP_READ looks right, as we don't want page allocation.  However I noticed
> there's very slight difference when the page was just fallocated:
>
>         /* fallocated page? */
>         if (page && !PageUptodate(page)) {
>                 if (sgp != SGP_READ)
>                         goto clear;
>                 unlock_page(page);
>                 put_page(page);
>                 page = NULL;
>                 hindex = index;
>         }
>
> I think it won't happen for your case since the page should be uptodate already
> (the other thread should check and modify the page before CONTINUE), but still
> raise this up, since if the page was allocated it smells better to still
> install the fallocated page (do we need to clear the page and SetUptodate)?

Sorry for the somewhat rambling thought process:

My first thought is, I don't really know what PageUptodate means for
shmem pages. If I understand correctly, normally we say PageUptodate()
if the in memory data is more recent or equivalent to the on-disk
data. But, shmem pages are entirely in memory - they are file backed
in name only, in some sense.

fallocate() does all sorts of things so the comment to me seems a bit
ambiguous, but it seems the implication is that we're worried
specifically about the case where the shmem page was recently
allocated with fallocate(mode=0)? In that case, do we use
!PageUptodate() to denote that the page has been allocated, but its
contents are undefined?

I suppose that would make sense, as the action "goto clear;" generally
memset()-s the page to zero it, and then calls SetPageUptodate().

Okay so let's say the following sequence of events happens:

1. Userspace calls fallocate(mode=0) to allocate some shmem pages.
2. Another thread, via a UFFD-registered mapping, manages to trigger a
minor fault on one such page, while we still have !PageUptodate().
(I'm not 100% sure this can happen, but let's say it can.)
3. UFFD handler thread gets the minor fault event, and for whatever
(buggy?) reason does nothing - it doesn't modify the page, it just
calls CONTINUE.

I think if we get to this point, zeroing the page, returning it, and
setting up the PTEs seems somewhat reasonable to me. I suppose
alternatively we could notice that this happened and return an error
to the caller? I'm hesitant to mess with the behavior of
shmem_getpage_gfp() to make such a thing happen though. I do think if
we're going to set up the PTEs instead of returning an error, we
definitely do need to clear and SetPageUptodate() the page first.

In conclusion, I think this behavior is correct.

>
> --
> Peter Xu
>
Peter Xu April 13, 2021, 6:12 p.m. UTC | #3
On Mon, Apr 12, 2021 at 09:40:22PM -0700, Axel Rasmussen wrote:
> On Mon, Apr 12, 2021 at 4:17 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Thu, Apr 08, 2021 at 04:43:22PM -0700, Axel Rasmussen wrote:
> > > +/*
> > > + * Install PTEs, to map dst_addr (within dst_vma) to page.
> > > + *
> > > + * This function handles MCOPY_ATOMIC_CONTINUE (which is always file-backed),
> > > + * whether or not dst_vma is VM_SHARED. It also handles the more general
> > > + * MCOPY_ATOMIC_NORMAL case, when dst_vma is *not* VM_SHARED (it may be file
> > > + * backed, or not).
> > > + *
> > > + * Note that MCOPY_ATOMIC_NORMAL for a VM_SHARED dst_vma is handled by
> > > + * shmem_mcopy_atomic_pte instead.
> > > + */
> > > +static int mcopy_atomic_install_ptes(struct mm_struct *dst_mm, pmd_t *dst_pmd,
> > > +                                  struct vm_area_struct *dst_vma,
> > > +                                  unsigned long dst_addr, struct page *page,
> > > +                                  bool newly_allocated, bool wp_copy)
> > > +{
> > > +     int ret;
> > > +     pte_t _dst_pte, *dst_pte;
> > > +     int writable;
> > > +     bool vm_shared = dst_vma->vm_flags & VM_SHARED;
> > > +     spinlock_t *ptl;
> > > +     struct inode *inode;
> > > +     pgoff_t offset, max_off;
> > > +
> > > +     _dst_pte = mk_pte(page, dst_vma->vm_page_prot);
> > > +     writable = dst_vma->vm_flags & VM_WRITE;
> > > +     /* For private, non-anon we need CoW (don't write to page cache!) */
> > > +     if (!vma_is_anonymous(dst_vma) && !vm_shared)
> > > +             writable = 0;
> > > +
> > > +     if (writable || vma_is_anonymous(dst_vma))
> > > +             _dst_pte = pte_mkdirty(_dst_pte);
> > > +     if (writable) {
> > > +             if (wp_copy)
> > > +                     _dst_pte = pte_mkuffd_wp(_dst_pte);
> > > +             else
> > > +                     _dst_pte = pte_mkwrite(_dst_pte);
> > > +     } else if (vm_shared) {
> > > +             /*
> > > +              * Since we didn't pte_mkdirty(), mark the page dirty or it
> > > +              * could be freed from under us. We could do this
> > > +              * unconditionally, but doing it only if !writable is faster.
> > > +              */
> > > +             set_page_dirty(page);
> > > +     }
> > > +
> > > +     dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
> > > +
> > > +     if (vma_is_shmem(dst_vma)) {
> > > +             /* The shmem MAP_PRIVATE case requires checking the i_size */
> >
> > When you start to use this function in the last patch it'll be needed too even
> > if MAP_SHARED?
> >
> > How about directly state the reason of doing this ("serialize against truncate
> > with the PT lock") instead of commenting about "who will need it"?
> >
> > > +             inode = dst_vma->vm_file->f_inode;
> > > +             offset = linear_page_index(dst_vma, dst_addr);
> > > +             max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
> > > +             ret = -EFAULT;
> > > +             if (unlikely(offset >= max_off))
> > > +                     goto out_unlock;
> > > +     }
> >
> > [...]
> >
> > > +/* Handles UFFDIO_CONTINUE for all shmem VMAs (shared or private). */
> > > +static int mcontinue_atomic_pte(struct mm_struct *dst_mm,
> > > +                             pmd_t *dst_pmd,
> > > +                             struct vm_area_struct *dst_vma,
> > > +                             unsigned long dst_addr,
> > > +                             bool wp_copy)
> > > +{
> > > +     struct inode *inode = file_inode(dst_vma->vm_file);
> > > +     pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
> > > +     struct page *page;
> > > +     int ret;
> > > +
> > > +     ret = shmem_getpage(inode, pgoff, &page, SGP_READ);
> >
> > SGP_READ looks right, as we don't want page allocation.  However I noticed
> > there's very slight difference when the page was just fallocated:
> >
> >         /* fallocated page? */
> >         if (page && !PageUptodate(page)) {
> >                 if (sgp != SGP_READ)
> >                         goto clear;

[1]

> >                 unlock_page(page);
> >                 put_page(page);
> >                 page = NULL;
> >                 hindex = index;
> >         }
> >
> > I think it won't happen for your case since the page should be uptodate already
> > (the other thread should check and modify the page before CONTINUE), but still
> > raise this up, since if the page was allocated it smells better to still
> > install the fallocated page (do we need to clear the page and SetUptodate)?
> 
> Sorry for the somewhat rambling thought process:
> 
> My first thought is, I don't really know what PageUptodate means for
> shmem pages. If I understand correctly, normally we say PageUptodate()
> if the in memory data is more recent or equivalent to the on-disk
> data. But, shmem pages are entirely in memory - they are file backed
> in name only, in some sense.
> 
> fallocate() does all sorts of things so the comment to me seems a bit
> ambiguous, but it seems the implication is that we're worried
> specifically about the case where the shmem page was recently
> allocated with fallocate(mode=0)? In that case, do we use
> !PageUptodate() to denote that the page has been allocated, but its
> contents are undefined?
> 
> I suppose that would make sense, as the action "goto clear;" generally
> memset()-s the page to zero it, and then calls SetPageUptodate().
> 
> Okay so let's say the following sequence of events happens:
> 
> 1. Userspace calls fallocate(mode=0) to allocate some shmem pages.
> 2. Another thread, via a UFFD-registered mapping, manages to trigger a
> minor fault on one such page, while we still have !PageUptodate().
> (I'm not 100% sure this can happen, but let's say it can.)
> 3. UFFD handler thread gets the minor fault event, and for whatever
> (buggy?) reason does nothing - it doesn't modify the page, it just
> calls CONTINUE.

[2]

> 
> I think if we get to this point, zeroing the page, returning it, and
> setting up the PTEs seems somewhat reasonable to me. I suppose
> alternatively we could notice that this happened and return an error
> to the caller? I'm hesitant to mess with the behavior of
> shmem_getpage_gfp() to make such a thing happen though. I do think if
> we're going to set up the PTEs instead of returning an error, we
> definitely do need to clear and SetPageUptodate() the page first.
> 
> In conclusion, I think this behavior is correct.

I agree with you (mostly :), but except one thing: you passed in SGP_READ, so
IMHO it won't do what you explained (see [1] above: "goto clear" is with "sgp
!= SGP_READ" only); instead of doing what you said, I think it'll reset page
pointer to NULL..  Then quickly in the latter block:

	if (page || sgp == SGP_READ)
		goto out;

So I think at last shmem_getpage_gfp(SGP_READ) will return NULL.

I do think I've got some confusion here regarding SGP_READ, since from the
comment in shmem_fs.h it says:

	SGP_READ,	/* don't exceed i_size, don't allocate page */

It's natural to think it as "return the fallocated page" in this case.  However
it seems not the case?  My gut feeling is the comment for SGP_READ needs a
touch up, so as to state that for newly fallocated (and not used) pages it'll
return NULL even if cache hit.

So I think you're right, for all cases this may be a trivial case.  However
I've got a lesson somewhere else that we should never overlook zero pages,
which is also related to this case - although fallocated page is still
!Uptodate so clear page happens even latter, however from userspace pov, the
user could assume it's a zero page even if the page is not accessed at all
(since any access will cause clear page).  Then the user program could avoid
modifying this page if it knows this page keeps to be zero page somehow (e.g.,
a zero page bitmap?). Then your example above [2] seems indeed a valid one
worth thinking, at least not fully paranoid.
Axel Rasmussen April 15, 2021, 6:18 p.m. UTC | #4
On Tue, Apr 13, 2021 at 11:12 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Apr 12, 2021 at 09:40:22PM -0700, Axel Rasmussen wrote:
> > On Mon, Apr 12, 2021 at 4:17 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Thu, Apr 08, 2021 at 04:43:22PM -0700, Axel Rasmussen wrote:
> > > > +/*
> > > > + * Install PTEs, to map dst_addr (within dst_vma) to page.
> > > > + *
> > > > + * This function handles MCOPY_ATOMIC_CONTINUE (which is always file-backed),
> > > > + * whether or not dst_vma is VM_SHARED. It also handles the more general
> > > > + * MCOPY_ATOMIC_NORMAL case, when dst_vma is *not* VM_SHARED (it may be file
> > > > + * backed, or not).
> > > > + *
> > > > + * Note that MCOPY_ATOMIC_NORMAL for a VM_SHARED dst_vma is handled by
> > > > + * shmem_mcopy_atomic_pte instead.
> > > > + */
> > > > +static int mcopy_atomic_install_ptes(struct mm_struct *dst_mm, pmd_t *dst_pmd,
> > > > +                                  struct vm_area_struct *dst_vma,
> > > > +                                  unsigned long dst_addr, struct page *page,
> > > > +                                  bool newly_allocated, bool wp_copy)
> > > > +{
> > > > +     int ret;
> > > > +     pte_t _dst_pte, *dst_pte;
> > > > +     int writable;
> > > > +     bool vm_shared = dst_vma->vm_flags & VM_SHARED;
> > > > +     spinlock_t *ptl;
> > > > +     struct inode *inode;
> > > > +     pgoff_t offset, max_off;
> > > > +
> > > > +     _dst_pte = mk_pte(page, dst_vma->vm_page_prot);
> > > > +     writable = dst_vma->vm_flags & VM_WRITE;
> > > > +     /* For private, non-anon we need CoW (don't write to page cache!) */
> > > > +     if (!vma_is_anonymous(dst_vma) && !vm_shared)
> > > > +             writable = 0;
> > > > +
> > > > +     if (writable || vma_is_anonymous(dst_vma))
> > > > +             _dst_pte = pte_mkdirty(_dst_pte);
> > > > +     if (writable) {
> > > > +             if (wp_copy)
> > > > +                     _dst_pte = pte_mkuffd_wp(_dst_pte);
> > > > +             else
> > > > +                     _dst_pte = pte_mkwrite(_dst_pte);
> > > > +     } else if (vm_shared) {
> > > > +             /*
> > > > +              * Since we didn't pte_mkdirty(), mark the page dirty or it
> > > > +              * could be freed from under us. We could do this
> > > > +              * unconditionally, but doing it only if !writable is faster.
> > > > +              */
> > > > +             set_page_dirty(page);
> > > > +     }
> > > > +
> > > > +     dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
> > > > +
> > > > +     if (vma_is_shmem(dst_vma)) {
> > > > +             /* The shmem MAP_PRIVATE case requires checking the i_size */
> > >
> > > When you start to use this function in the last patch it'll be needed too even
> > > if MAP_SHARED?
> > >
> > > How about directly state the reason of doing this ("serialize against truncate
> > > with the PT lock") instead of commenting about "who will need it"?
> > >
> > > > +             inode = dst_vma->vm_file->f_inode;
> > > > +             offset = linear_page_index(dst_vma, dst_addr);
> > > > +             max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
> > > > +             ret = -EFAULT;
> > > > +             if (unlikely(offset >= max_off))
> > > > +                     goto out_unlock;
> > > > +     }
> > >
> > > [...]
> > >
> > > > +/* Handles UFFDIO_CONTINUE for all shmem VMAs (shared or private). */
> > > > +static int mcontinue_atomic_pte(struct mm_struct *dst_mm,
> > > > +                             pmd_t *dst_pmd,
> > > > +                             struct vm_area_struct *dst_vma,
> > > > +                             unsigned long dst_addr,
> > > > +                             bool wp_copy)
> > > > +{
> > > > +     struct inode *inode = file_inode(dst_vma->vm_file);
> > > > +     pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
> > > > +     struct page *page;
> > > > +     int ret;
> > > > +
> > > > +     ret = shmem_getpage(inode, pgoff, &page, SGP_READ);
> > >
> > > SGP_READ looks right, as we don't want page allocation.  However I noticed
> > > there's very slight difference when the page was just fallocated:
> > >
> > >         /* fallocated page? */
> > >         if (page && !PageUptodate(page)) {
> > >                 if (sgp != SGP_READ)
> > >                         goto clear;
>
> [1]
>
> > >                 unlock_page(page);
> > >                 put_page(page);
> > >                 page = NULL;
> > >                 hindex = index;
> > >         }
> > >
> > > I think it won't happen for your case since the page should be uptodate already
> > > (the other thread should check and modify the page before CONTINUE), but still
> > > raise this up, since if the page was allocated it smells better to still
> > > install the fallocated page (do we need to clear the page and SetUptodate)?
> >
> > Sorry for the somewhat rambling thought process:
> >
> > My first thought is, I don't really know what PageUptodate means for
> > shmem pages. If I understand correctly, normally we say PageUptodate()
> > if the in memory data is more recent or equivalent to the on-disk
> > data. But, shmem pages are entirely in memory - they are file backed
> > in name only, in some sense.
> >
> > fallocate() does all sorts of things so the comment to me seems a bit
> > ambiguous, but it seems the implication is that we're worried
> > specifically about the case where the shmem page was recently
> > allocated with fallocate(mode=0)? In that case, do we use
> > !PageUptodate() to denote that the page has been allocated, but its
> > contents are undefined?
> >
> > I suppose that would make sense, as the action "goto clear;" generally
> > memset()-s the page to zero it, and then calls SetPageUptodate().
> >
> > Okay so let's say the following sequence of events happens:
> >
> > 1. Userspace calls fallocate(mode=0) to allocate some shmem pages.
> > 2. Another thread, via a UFFD-registered mapping, manages to trigger a
> > minor fault on one such page, while we still have !PageUptodate().
> > (I'm not 100% sure this can happen, but let's say it can.)
> > 3. UFFD handler thread gets the minor fault event, and for whatever
> > (buggy?) reason does nothing - it doesn't modify the page, it just
> > calls CONTINUE.
>
> [2]
>
> >
> > I think if we get to this point, zeroing the page, returning it, and
> > setting up the PTEs seems somewhat reasonable to me. I suppose
> > alternatively we could notice that this happened and return an error
> > to the caller? I'm hesitant to mess with the behavior of
> > shmem_getpage_gfp() to make such a thing happen though. I do think if
> > we're going to set up the PTEs instead of returning an error, we
> > definitely do need to clear and SetPageUptodate() the page first.
> >
> > In conclusion, I think this behavior is correct.
>
> I agree with you (mostly :), but except one thing: you passed in SGP_READ, so
> IMHO it won't do what you explained (see [1] above: "goto clear" is with "sgp
> != SGP_READ" only); instead of doing what you said, I think it'll reset page
> pointer to NULL..  Then quickly in the latter block:
>
>         if (page || sgp == SGP_READ)
>                 goto out;
>
> So I think at last shmem_getpage_gfp(SGP_READ) will return NULL.

Ah, indeed, I mistakenly read that as "sgp != SGP_READ".

>
> I do think I've got some confusion here regarding SGP_READ, since from the
> comment in shmem_fs.h it says:
>
>         SGP_READ,       /* don't exceed i_size, don't allocate page */
>
> It's natural to think it as "return the fallocated page" in this case.  However
> it seems not the case?  My gut feeling is the comment for SGP_READ needs a
> touch up, so as to state that for newly fallocated (and not used) pages it'll
> return NULL even if cache hit.
>
> So I think you're right, for all cases this may be a trivial case.  However
> I've got a lesson somewhere else that we should never overlook zero pages,
> which is also related to this case - although fallocated page is still
> !Uptodate so clear page happens even latter, however from userspace pov, the
> user could assume it's a zero page even if the page is not accessed at all
> (since any access will cause clear page).  Then the user program could avoid
> modifying this page if it knows this page keeps to be zero page somehow (e.g.,
> a zero page bitmap?). Then your example above [2] seems indeed a valid one
> worth thinking, at least not fully paranoid.

I'm kind of left thinking along the same lines though.

Modifying shmem_getpage_gfp to behave differently in this case seems
likely to be a can of worms.

With the current behavior, at least it doesn't seem to be too
problematic - we're not going to hand userspace an unzeroed page or
anything. At worst, it's a bit of a rough edge userspace might hit.
The UFFD handler thread will just get an error when it tries to
CONTINUE, and could recover by e.g. writing to the page or something
and trying again.

I'm inclined to leave it as is, and maybe look into a way to file down
the rough edge in a future patch, as well as some of the other
cleanups we've discussed elsewhere. Objections?

>
> --
> Peter Xu
>
diff mbox series

Patch

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 60ae22207761..a539fe18b9a7 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -48,6 +48,87 @@  struct vm_area_struct *find_dst_vma(struct mm_struct *dst_mm,
 	return dst_vma;
 }
 
+/*
+ * Install PTEs, to map dst_addr (within dst_vma) to page.
+ *
+ * This function handles MCOPY_ATOMIC_CONTINUE (which is always file-backed),
+ * whether or not dst_vma is VM_SHARED. It also handles the more general
+ * MCOPY_ATOMIC_NORMAL case, when dst_vma is *not* VM_SHARED (it may be file
+ * backed, or not).
+ *
+ * Note that MCOPY_ATOMIC_NORMAL for a VM_SHARED dst_vma is handled by
+ * shmem_mcopy_atomic_pte instead.
+ */
+static int mcopy_atomic_install_ptes(struct mm_struct *dst_mm, pmd_t *dst_pmd,
+				     struct vm_area_struct *dst_vma,
+				     unsigned long dst_addr, struct page *page,
+				     bool newly_allocated, bool wp_copy)
+{
+	int ret;
+	pte_t _dst_pte, *dst_pte;
+	int writable;
+	bool vm_shared = dst_vma->vm_flags & VM_SHARED;
+	spinlock_t *ptl;
+	struct inode *inode;
+	pgoff_t offset, max_off;
+
+	_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
+	writable = dst_vma->vm_flags & VM_WRITE;
+	/* For private, non-anon we need CoW (don't write to page cache!) */
+	if (!vma_is_anonymous(dst_vma) && !vm_shared)
+		writable = 0;
+
+	if (writable || vma_is_anonymous(dst_vma))
+		_dst_pte = pte_mkdirty(_dst_pte);
+	if (writable) {
+		if (wp_copy)
+			_dst_pte = pte_mkuffd_wp(_dst_pte);
+		else
+			_dst_pte = pte_mkwrite(_dst_pte);
+	} else if (vm_shared) {
+		/*
+		 * Since we didn't pte_mkdirty(), mark the page dirty or it
+		 * could be freed from under us. We could do this
+		 * unconditionally, but doing it only if !writable is faster.
+		 */
+		set_page_dirty(page);
+	}
+
+	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
+
+	if (vma_is_shmem(dst_vma)) {
+		/* The shmem MAP_PRIVATE case requires checking the i_size */
+		inode = dst_vma->vm_file->f_inode;
+		offset = linear_page_index(dst_vma, dst_addr);
+		max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
+		ret = -EFAULT;
+		if (unlikely(offset >= max_off))
+			goto out_unlock;
+	}
+
+	ret = -EEXIST;
+	if (!pte_none(*dst_pte))
+		goto out_unlock;
+
+	inc_mm_counter(dst_mm, mm_counter(page));
+	if (vma_is_shmem(dst_vma))
+		page_add_file_rmap(page, false);
+	else
+		page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
+
+	if (newly_allocated)
+		lru_cache_add_inactive_or_unevictable(page, dst_vma);
+
+	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
+
+	/* No need to invalidate - it was non-present before */
+	update_mmu_cache(dst_vma, dst_addr, dst_pte);
+	ret = 0;
+out_unlock:
+	pte_unmap_unlock(dst_pte, ptl);
+	return ret;
+}
+
 static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 			    pmd_t *dst_pmd,
 			    struct vm_area_struct *dst_vma,
@@ -56,13 +137,9 @@  static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 			    struct page **pagep,
 			    bool wp_copy)
 {
-	pte_t _dst_pte, *dst_pte;
-	spinlock_t *ptl;
 	void *page_kaddr;
 	int ret;
 	struct page *page;
-	pgoff_t offset, max_off;
-	struct inode *inode;
 
 	if (!*pagep) {
 		ret = -ENOMEM;
@@ -99,43 +176,12 @@  static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 	if (mem_cgroup_charge(page, dst_mm, GFP_KERNEL))
 		goto out_release;
 
-	_dst_pte = pte_mkdirty(mk_pte(page, dst_vma->vm_page_prot));
-	if (dst_vma->vm_flags & VM_WRITE) {
-		if (wp_copy)
-			_dst_pte = pte_mkuffd_wp(_dst_pte);
-		else
-			_dst_pte = pte_mkwrite(_dst_pte);
-	}
-
-	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
-	if (dst_vma->vm_file) {
-		/* the shmem MAP_PRIVATE case requires checking the i_size */
-		inode = dst_vma->vm_file->f_inode;
-		offset = linear_page_index(dst_vma, dst_addr);
-		max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
-		ret = -EFAULT;
-		if (unlikely(offset >= max_off))
-			goto out_release_uncharge_unlock;
-	}
-	ret = -EEXIST;
-	if (!pte_none(*dst_pte))
-		goto out_release_uncharge_unlock;
-
-	inc_mm_counter(dst_mm, MM_ANONPAGES);
-	page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
-	lru_cache_add_inactive_or_unevictable(page, dst_vma);
-
-	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
-
-	/* No need to invalidate - it was non-present before */
-	update_mmu_cache(dst_vma, dst_addr, dst_pte);
-
-	pte_unmap_unlock(dst_pte, ptl);
-	ret = 0;
+	ret = mcopy_atomic_install_ptes(dst_mm, dst_pmd, dst_vma, dst_addr,
+					page, true, wp_copy);
+	if (ret)
+		goto out_release;
 out:
 	return ret;
-out_release_uncharge_unlock:
-	pte_unmap_unlock(dst_pte, ptl);
 out_release:
 	put_page(page);
 	goto out;
@@ -176,6 +222,41 @@  static int mfill_zeropage_pte(struct mm_struct *dst_mm,
 	return ret;
 }
 
+/* Handles UFFDIO_CONTINUE for all shmem VMAs (shared or private). */
+static int mcontinue_atomic_pte(struct mm_struct *dst_mm,
+				pmd_t *dst_pmd,
+				struct vm_area_struct *dst_vma,
+				unsigned long dst_addr,
+				bool wp_copy)
+{
+	struct inode *inode = file_inode(dst_vma->vm_file);
+	pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
+	struct page *page;
+	int ret;
+
+	ret = shmem_getpage(inode, pgoff, &page, SGP_READ);
+	if (ret)
+		goto out;
+	if (!page) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	ret = mcopy_atomic_install_ptes(dst_mm, dst_pmd, dst_vma, dst_addr,
+					page, false, wp_copy);
+	if (ret)
+		goto out_release;
+
+	unlock_page(page);
+	ret = 0;
+out:
+	return ret;
+out_release:
+	unlock_page(page);
+	put_page(page);
+	goto out;
+}
+
 static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
 {
 	pgd_t *pgd;
@@ -415,11 +496,16 @@  static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
 						unsigned long dst_addr,
 						unsigned long src_addr,
 						struct page **page,
-						bool zeropage,
+						enum mcopy_atomic_mode mode,
 						bool wp_copy)
 {
 	ssize_t err;
 
+	if (mode == MCOPY_ATOMIC_CONTINUE) {
+		return mcontinue_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
+					    wp_copy);
+	}
+
 	/*
 	 * The normal page fault path for a shmem will invoke the
 	 * fault, fill the hole in the file and COW it right away. The
@@ -431,7 +517,7 @@  static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
 	 * and not in the radix tree.
 	 */
 	if (!(dst_vma->vm_flags & VM_SHARED)) {
-		if (!zeropage)
+		if (mode == MCOPY_ATOMIC_NORMAL)
 			err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
 					       dst_addr, src_addr, page,
 					       wp_copy);
@@ -441,7 +527,8 @@  static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm,
 	} else {
 		VM_WARN_ON_ONCE(wp_copy);
 		err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
-					     dst_addr, src_addr, zeropage,
+					     dst_addr, src_addr,
+					     mode != MCOPY_ATOMIC_NORMAL,
 					     page);
 	}
 
@@ -463,7 +550,6 @@  static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 	long copied;
 	struct page *page;
 	bool wp_copy;
-	bool zeropage = (mcopy_mode == MCOPY_ATOMIC_ZEROPAGE);
 
 	/*
 	 * Sanitize the command parameters:
@@ -526,7 +612,7 @@  static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 
 	if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma))
 		goto out_unlock;
-	if (mcopy_mode == MCOPY_ATOMIC_CONTINUE)
+	if (!vma_is_shmem(dst_vma) && mcopy_mode == MCOPY_ATOMIC_CONTINUE)
 		goto out_unlock;
 
 	/*
@@ -574,7 +660,7 @@  static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 		BUG_ON(pmd_trans_huge(*dst_pmd));
 
 		err = mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
-				       src_addr, &page, zeropage, wp_copy);
+				       src_addr, &page, mcopy_mode, wp_copy);
 		cond_resched();
 
 		if (unlikely(err == -ENOENT)) {