diff mbox series

[RFC,v2,2/3] mm/gup: introduce FOLL_PIN flag for get_user_pages()

Message ID 20190817022419.23304-3-jhubbard@nvidia.com (mailing list archive)
State Superseded
Headers show
Series mm/gup: introduce vaddr_pin_pages_remote(), FOLL_PIN | expand

Commit Message

John Hubbard Aug. 17, 2019, 2:24 a.m. UTC
From: John Hubbard <jhubbard@nvidia.com>
DKIM-Signature: v aa-sha256; claxed/relaxed; didia.com; s;
	t66008674; bhMai0va6k/z2enpQJ4Nfvbj5WByFxGAO1JwdIBbXio	hPGP-Universal:From:To:CC:Subject:Date:Message-ID:X-Mailer:
	 In-Reply-To:References:MIME-Version:X-NVConfidentiality:
	 Content-Transfer-Encoding:Content-Type;
	bÖUDSde9XF/IsNteBaYOBWeKiHhWmeU9ekUJNvCviHssBDCtw0T+M/2TlEPEzomIT
	 fGXzIQNlGN6MXFbaBoyBmF/zjCu02TmTNExbVJ3/5N6PTyOuJFCx9ZN1/5gXsB11m1
	 xAHIWE+VOZs4qqDeHDBqKZq+FaxQHNvGz0j6lyVBA70TfseNoZqZZrSil8uvaKJwKd
	 TQ1ht+AGWbw9p610JmaPb4u6o/eV6Ns8Sl3EVnjWWu94T6ISNIaWCiC6wQQF6L1YCH
	 G5Pjn+0rEjhk6XG4TyLudi5lWp3IVBHd8+WlWlnl+bvLCC55RUAjPJLn7LaVyVdh0F
	 nLHwm3bN2Jotg
FOLL_PIN is set by callers of vaddr_pin_pages(). This is different
than FOLL_LONGTERM, because even short term page pins need a new kind
of tracking, if those pinned pages' data is going to potentially
be modified.

This situation is described in more detail in commit fc1d8e7cca2d
("mm: introduce put_user_page*(), placeholder versions").

FOLL_PIN is added now, rather than waiting until there is code that
takes action based on FOLL_PIN. That's because having FOLL_PIN in
the code helps to highlight the differences between:

    a) get_user_pages(): soon to be deprecated. Used to pin pages,
       but without awareness of file systems that might use those
       pages,

    b) The original vaddr_pin_pages(): intended only for
       FOLL_LONGTERM and DAX use cases. This assumes direct IO
       and therefore is not applicable the most of the other
       callers of get_user_pages(), and

Also add fairly extensive documentation of the meaning and use
of both FOLL_PIN and FOLL_LONGTERM.

Thanks to Jan Kara and Vlastimil Babka for explaining the 4 cases
in this documentation. (I've reworded it and expanded on it slightly.)

Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/mm.h | 56 +++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 50 insertions(+), 6 deletions(-)

Comments

John Hubbard Aug. 17, 2019, 2:36 a.m. UTC | #1
On 8/16/19 7:24 PM, jhubbard@nvidia.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
> DKIM-Signature: v a a-sha256; claxed/relaxed; d idia.com; s;
> 	t66008674; bhMai0va6k/z2enpQJ4Nfvbj5WByFxGAO1JwdIBbXio	h PGP-Universal:From:To:CC:Subject:Date:Message-ID:X-Mailer:
> 	 In-Reply-To:References:MIME-Version:X-NVConfidentiality:
> 	 Content-Transfer-Encoding:Content-Type;
> 	bÖUDSde9XF/IsNteBaYOBWeKiHhWmeU9ekUJNvCviHssBDCtw0T+M/2TlEPEzomIT
> 	 fGXzIQNlGN6MXFbaBoyBmF/zjCu02TmTNExbVJ3/5N6PTyOuJFCx9ZN1/5gXsB11m1
> 	 xAHIWE+VOZs4qqDeHDBqKZq+FaxQHNvGz0j6lyVBA70TfseNoZqZZrSil8uvaKJwKd
> 	 TQ1ht+AGWbw9p610JmaPb4u6o/eV6Ns8Sl3EVnjWWu94T6ISNIaWCiC6wQQF6L1YCH
> 	 G5Pjn+0rEjhk6XG4TyLudi5lWp3IVBHd8+WlWlnl+bvLCC55RUAjPJLn7LaVyVdh0F
> 	 nLHwm3bN2Jotg

I cannot readily explain the above email glitch, but I did just now switch
back to mailgw.nvidia.com for this patchset, in order to get the nice behavior
of having "From:" really be my native NVIDIA email address. That's very nice,
but if the glitches happen again, I'll switch back to using gmail for 
git-send-email.

Sorry about the weirdness. It does still let you apply the patch, I
just now checked on that.

thanks,
John Hubbard Aug. 19, 2019, 8:59 p.m. UTC | #2
On 8/16/19 7:36 PM, John Hubbard wrote:
> On 8/16/19 7:24 PM, jhubbard@nvidia.com wrote:
>> From: John Hubbard <jhubbard@nvidia.com>
>> DKIM-Signature: v a a-sha256; claxed/relaxed; d idia.com; s;
>> 	t66008674; bhMai0va6k/z2enpQJ4Nfvbj5WByFxGAO1JwdIBbXio	h PGP-Universal:From:To:CC:Subject:Date:Message-ID:X-Mailer:
>> 	 In-Reply-To:References:MIME-Version:X-NVConfidentiality:
>> 	 Content-Transfer-Encoding:Content-Type;
>> 	bÖUDSde9XF/IsNteBaYOBWeKiHhWmeU9ekUJNvCviHssBDCtw0T+M/2TlEPEzomIT
>> 	 fGXzIQNlGN6MXFbaBoyBmF/zjCu02TmTNExbVJ3/5N6PTyOuJFCx9ZN1/5gXsB11m1
>> 	 xAHIWE+VOZs4qqDeHDBqKZq+FaxQHNvGz0j6lyVBA70TfseNoZqZZrSil8uvaKJwKd
>> 	 TQ1ht+AGWbw9p610JmaPb4u6o/eV6Ns8Sl3EVnjWWu94T6ISNIaWCiC6wQQF6L1YCH
>> 	 G5Pjn+0rEjhk6XG4TyLudi5lWp3IVBHd8+WlWlnl+bvLCC55RUAjPJLn7LaVyVdh0F
>> 	 nLHwm3bN2Jotg
> 
> I cannot readily explain the above email glitch, but I did just now switch
> back to mailgw.nvidia.com for this patchset, in order to get the nice behavior
> of having "From:" really be my native NVIDIA email address. That's very nice,
> but if the glitches happen again, I'll switch back to using gmail for
> git-send-email.
> 
> Sorry about the weirdness. It does still let you apply the patch, I
> just now checked on that.
> 

Hi Ira, could you please let me know if you'd like me to repost this patch, or
the entire patchset, or if you're able to deal with it as-is? As it stands, the
DKIM-Signature cruft above needs to be manually removed, either from the patch, or
from the commit log after applying the patch.

Also, as noted in the email thread involving Bharath and sgi-gru [1], I'm
currently planning on branching from your tree, and continuing the misc
call site conversions from there. And then just adapting to whatever API
changes are made to vaddr_*() functions. And the biovec call site conversions should
be based on that as well.

[1] https://lore.kernel.org/r/0c2ad29b-934c-ec30-66c3-b153baf1fba5@nvidia.com

thanks,
diff mbox series

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bc675e94ddf8..6e7de424bf5e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2644,6 +2644,8 @@  static inline vm_fault_t vmf_error(int err)
 struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 			 unsigned int foll_flags);
 
+/* Flags for follow_page(), get_user_pages ("GUP"), and vaddr_pin_pages(): */
+
 #define FOLL_WRITE	0x01	/* check pte is writable */
 #define FOLL_TOUCH	0x02	/* mark page accessed */
 #define FOLL_GET	0x04	/* do get_page on page */
@@ -2663,13 +2665,15 @@  struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 #define FOLL_ANON	0x8000	/* don't do file mappings */
 #define FOLL_LONGTERM	0x10000	/* mapping lifetime is indefinite: see below */
 #define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
+#define FOLL_PIN	0x40000	/* pages must be released via put_user_page() */
 
 /*
- * NOTE on FOLL_LONGTERM:
+ * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
+ * other. Here is what they mean, and how to use them:
  *
  * FOLL_LONGTERM indicates that the page will be held for an indefinite time
- * period _often_ under userspace control.  This is contrasted with
- * iov_iter_get_pages() where usages which are transient.
+ * period _often_ under userspace control.  This is in contrast to
+ * iov_iter_get_pages(), where usages which are transient.
  *
  * FIXME: For pages which are part of a filesystem, mappings are subject to the
  * lifetime enforced by the filesystem and we need guarantees that longterm
@@ -2684,11 +2688,51 @@  struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
  * Currently only get_user_pages() and get_user_pages_fast() support this flag
  * and calls to get_user_pages_[un]locked are specifically not allowed.  This
  * is due to an incompatibility with the FS DAX check and
- * FAULT_FLAG_ALLOW_RETRY
+ * FAULT_FLAG_ALLOW_RETRY.
  *
- * In the CMA case: longterm pins in a CMA region would unnecessarily fragment
- * that region.  And so CMA attempts to migrate the page before pinning when
+ * In the CMA case: long term pins in a CMA region would unnecessarily fragment
+ * that region.  And so, CMA attempts to migrate the page before pinning, when
  * FOLL_LONGTERM is specified.
+ *
+ * FOLL_PIN indicates that a special kind of tracking (not just page->_refcount,
+ * but an additional pin counting system) will be invoked. This is intended for
+ * anything that gets a page reference and then touches page data (for example,
+ * Direct IO). This lets the filesystem know that some non-file-system entity is
+ * potentially changing the pages' data. FOLL_PIN pages must be released,
+ * ultimately, by a call to put_user_page(). Typically that will be via one of
+ * the vaddr_unpin_pages() variants.
+ *
+ * FIXME: note that this special tracking is not in place yet. However, the
+ * pages should still be released by put_user_page().
+ *
+ * When and where to use each flag:
+ *
+ * CASE 1: Direct IO (DIO). There are GUP references to pages that are serving
+ * as DIO buffers. These buffers are needed for a relatively short time (so they
+ * are not "long term"). No special synchronization with page_mkclean() or
+ * munmap() is provided. Therefore, flags to set at the call site are:
+ *
+ *     FOLL_PIN
+ *
+ * CASE 2: RDMA. There are GUP references to pages that are serving as DMA
+ * buffers. These buffers are needed for a long time ("long term"). No special
+ * synchronization with page_mkclean() or munmap() is provided. Therefore, flags
+ * to set at the call site are:
+ *
+ *     FOLL_PIN | FOLL_LONGTERM
+ *
+ * There is also a special case when the pages are DAX pages: in addition to the
+ * above flags, the caller needs a file lease. This is provided via the struct
+ * vaddr_pin argument to vaddr_pin_pages().
+ *
+ * CASE 3: ODP (Mellanox/Infiniband On Demand Paging: the hardware supports
+ * replayable page faulting). There are GUP references to pages serving as DMA
+ * buffers. For ODP, MMU notifiers are used to synchronize with page_mkclean()
+ * and munmap(). Therefore, normal GUP calls are sufficient, so neither flag
+ * needs to be set.
+ *
+ * CASE 4: pinning for struct page manipulation only. Here, normal GUP calls are
+ * sufficient, so neither flag needs to be set.
  */
 
 static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)