diff mbox

[v3,1/4] mm: introduce get_user_pages_longterm

Message ID 151197873499.26211.11687422577653326365.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Dan Williams Nov. 29, 2017, 6:05 p.m. UTC
Until there is a solution to the dma-to-dax vs truncate problem it is
not safe to allow long standing memory registrations against
filesytem-dax vmas. Device-dax vmas do not have this problem and are
explicitly allowed.

This is temporary until a "memory registration with layout-lease"
mechanism can be implemented for the affected sub-systems (RDMA and
V4L2).

Cc: <stable@vger.kernel.org>
Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/fs.h |   14 +++++++++++
 include/linux/mm.h |   13 +++++++++++
 mm/gup.c           |   64 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 91 insertions(+)

Comments

Michal Hocko Nov. 30, 2017, 9:53 a.m. UTC | #1
On Wed 29-11-17 10:05:35, Dan Williams wrote:
> Until there is a solution to the dma-to-dax vs truncate problem it is
> not safe to allow long standing memory registrations against
> filesytem-dax vmas. Device-dax vmas do not have this problem and are
> explicitly allowed.
> 
> This is temporary until a "memory registration with layout-lease"
> mechanism can be implemented for the affected sub-systems (RDMA and
> V4L2).

One thing is not clear to me. Who is allowed to pin pages for ever?
Is it possible to pin LRU pages that way as well? If yes then there
absolutely has to be a limit for that. Sorry I could have studied the
code much more but from a quick glance it seems to me that this is not
limited to dax (or non-LRU in general) pages.
Dan Williams Nov. 30, 2017, 4:39 p.m. UTC | #2
On Thu, Nov 30, 2017 at 1:53 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Wed 29-11-17 10:05:35, Dan Williams wrote:
>> Until there is a solution to the dma-to-dax vs truncate problem it is
>> not safe to allow long standing memory registrations against
>> filesytem-dax vmas. Device-dax vmas do not have this problem and are
>> explicitly allowed.
>>
>> This is temporary until a "memory registration with layout-lease"
>> mechanism can be implemented for the affected sub-systems (RDMA and
>> V4L2).
>
> One thing is not clear to me. Who is allowed to pin pages for ever?
> Is it possible to pin LRU pages that way as well? If yes then there
> absolutely has to be a limit for that. Sorry I could have studied the
> code much more but from a quick glance it seems to me that this is not
> limited to dax (or non-LRU in general) pages.

I would turn this question around. "who can not tolerate a page being
pinned forever?". In the case of filesytem-dax a page is
one-in-the-same object as a filesystem-block, and a filesystem expects
that its operations will not be blocked indefinitely. LRU pages can
continue to be pinned indefinitely because operations can continue
around the pinned page, i.e. every agent, save for the dma agent,
drops their reference to the page and its tolerable that the final
put_page() never arrives. As far as I can tell it's only filesystems
and dax that have this collision of wanting to revoke dma access to a
page combined with not being able to wait indefinitely for dma to
quiesce.
Michal Hocko Nov. 30, 2017, 5:42 p.m. UTC | #3
On Thu 30-11-17 08:39:51, Dan Williams wrote:
> On Thu, Nov 30, 2017 at 1:53 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Wed 29-11-17 10:05:35, Dan Williams wrote:
> >> Until there is a solution to the dma-to-dax vs truncate problem it is
> >> not safe to allow long standing memory registrations against
> >> filesytem-dax vmas. Device-dax vmas do not have this problem and are
> >> explicitly allowed.
> >>
> >> This is temporary until a "memory registration with layout-lease"
> >> mechanism can be implemented for the affected sub-systems (RDMA and
> >> V4L2).
> >
> > One thing is not clear to me. Who is allowed to pin pages for ever?
> > Is it possible to pin LRU pages that way as well? If yes then there
> > absolutely has to be a limit for that. Sorry I could have studied the
> > code much more but from a quick glance it seems to me that this is not
> > limited to dax (or non-LRU in general) pages.
> 
> I would turn this question around. "who can not tolerate a page being
> pinned forever?".

Any struct page on the movable zone or anything that is living on the
LRU list because such a memory is unreclaimable.

> In the case of filesytem-dax a page is
> one-in-the-same object as a filesystem-block, and a filesystem expects
> that its operations will not be blocked indefinitely. LRU pages can
> continue to be pinned indefinitely because operations can continue
> around the pinned page, i.e. every agent, save for the dma agent,
> drops their reference to the page and its tolerable that the final
> put_page() never arrives.

I do not understand. Are you saying that a user triggered IO can pin LRU
pages indefinitely. This would be _really_ wrong. It would be basically
an mlock without any limit. So I must be misreading you here

> As far as I can tell it's only filesystems
> and dax that have this collision of wanting to revoke dma access to a
> page combined with not being able to wait indefinitely for dma to
> quiesce.
Dan Williams Nov. 30, 2017, 6:03 p.m. UTC | #4
On Thu, Nov 30, 2017 at 9:42 AM, Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 30-11-17 08:39:51, Dan Williams wrote:
> > On Thu, Nov 30, 2017 at 1:53 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > > On Wed 29-11-17 10:05:35, Dan Williams wrote:
> > >> Until there is a solution to the dma-to-dax vs truncate problem it is
> > >> not safe to allow long standing memory registrations against
> > >> filesytem-dax vmas. Device-dax vmas do not have this problem and are
> > >> explicitly allowed.
> > >>
> > >> This is temporary until a "memory registration with layout-lease"
> > >> mechanism can be implemented for the affected sub-systems (RDMA and
> > >> V4L2).
> > >
> > > One thing is not clear to me. Who is allowed to pin pages for ever?
> > > Is it possible to pin LRU pages that way as well? If yes then there
> > > absolutely has to be a limit for that. Sorry I could have studied the
> > > code much more but from a quick glance it seems to me that this is not
> > > limited to dax (or non-LRU in general) pages.
> >
> > I would turn this question around. "who can not tolerate a page being
> > pinned forever?".
>
> Any struct page on the movable zone or anything that is living on the
> LRU list because such a memory is unreclaimable.
>
> > In the case of filesytem-dax a page is
> > one-in-the-same object as a filesystem-block, and a filesystem expects
> > that its operations will not be blocked indefinitely. LRU pages can
> > continue to be pinned indefinitely because operations can continue
> > around the pinned page, i.e. every agent, save for the dma agent,
> > drops their reference to the page and its tolerable that the final
> > put_page() never arrives.
>
> I do not understand. Are you saying that a user triggered IO can pin LRU
> pages indefinitely. This would be _really_ wrong. It would be basically
> an mlock without any limit. So I must be misreading you here

You're not misreading. See ib_umem_get() for example, it pins pages in
response to the userspace library call ibv_reg_mr() (memory
registration), and will not release those pages unless/until a call to
ibv_dereg_mr() is made. The current plan to fix this is to create
something like a ibv_reg_mr_lease() call that registers the memory
with an F_SETLEASE semantic so that the kernel can notify userspace
that a memory registration is being forcibly revoked by the kernel. A
previous attempt at something like this was the proposed MAP_DIRECT
mmap flag [1].

[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-October/012815.html
Michal Hocko Nov. 30, 2017, 6:17 p.m. UTC | #5
On Thu 30-11-17 10:03:26, Dan Williams wrote:
> On Thu, Nov 30, 2017 at 9:42 AM, Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Thu 30-11-17 08:39:51, Dan Williams wrote:
> > > On Thu, Nov 30, 2017 at 1:53 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > > > On Wed 29-11-17 10:05:35, Dan Williams wrote:
> > > >> Until there is a solution to the dma-to-dax vs truncate problem it is
> > > >> not safe to allow long standing memory registrations against
> > > >> filesytem-dax vmas. Device-dax vmas do not have this problem and are
> > > >> explicitly allowed.
> > > >>
> > > >> This is temporary until a "memory registration with layout-lease"
> > > >> mechanism can be implemented for the affected sub-systems (RDMA and
> > > >> V4L2).
> > > >
> > > > One thing is not clear to me. Who is allowed to pin pages for ever?
> > > > Is it possible to pin LRU pages that way as well? If yes then there
> > > > absolutely has to be a limit for that. Sorry I could have studied the
> > > > code much more but from a quick glance it seems to me that this is not
> > > > limited to dax (or non-LRU in general) pages.
> > >
> > > I would turn this question around. "who can not tolerate a page being
> > > pinned forever?".
> >
> > Any struct page on the movable zone or anything that is living on the
> > LRU list because such a memory is unreclaimable.
> >
> > > In the case of filesytem-dax a page is
> > > one-in-the-same object as a filesystem-block, and a filesystem expects
> > > that its operations will not be blocked indefinitely. LRU pages can
> > > continue to be pinned indefinitely because operations can continue
> > > around the pinned page, i.e. every agent, save for the dma agent,
> > > drops their reference to the page and its tolerable that the final
> > > put_page() never arrives.
> >
> > I do not understand. Are you saying that a user triggered IO can pin LRU
> > pages indefinitely. This would be _really_ wrong. It would be basically
> > an mlock without any limit. So I must be misreading you here
> 
> You're not misreading. See ib_umem_get() for example, it pins pages in
> response to the userspace library call ibv_reg_mr() (memory
> registration), and will not release those pages unless/until a call to
> ibv_dereg_mr() is made.

Who and how many LRU pages can pin that way and how do you prevent nasty
users to DoS systems this way?

I remember PeterZ wanted to address a similar issue by vmpin syscall
that would be a subject of a rlimit control. Sorry but I cannot find a
reference here but if this is at g-u-p level without any accounting then
it smells quite broken to me.
Dan Williams Nov. 30, 2017, 6:32 p.m. UTC | #6
[ adding linux-rdma ]

On Thu, Nov 30, 2017 at 10:17 AM, Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 30-11-17 10:03:26, Dan Williams wrote:
> > On Thu, Nov 30, 2017 at 9:42 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Thu 30-11-17 08:39:51, Dan Williams wrote:
> > > > On Thu, Nov 30, 2017 at 1:53 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > > > > On Wed 29-11-17 10:05:35, Dan Williams wrote:
> > > > >> Until there is a solution to the dma-to-dax vs truncate problem it is
> > > > >> not safe to allow long standing memory registrations against
> > > > >> filesytem-dax vmas. Device-dax vmas do not have this problem and are
> > > > >> explicitly allowed.
> > > > >>
> > > > >> This is temporary until a "memory registration with layout-lease"
> > > > >> mechanism can be implemented for the affected sub-systems (RDMA and
> > > > >> V4L2).
> > > > >
> > > > > One thing is not clear to me. Who is allowed to pin pages for ever?
> > > > > Is it possible to pin LRU pages that way as well? If yes then there
> > > > > absolutely has to be a limit for that. Sorry I could have studied the
> > > > > code much more but from a quick glance it seems to me that this is not
> > > > > limited to dax (or non-LRU in general) pages.
> > > >
> > > > I would turn this question around. "who can not tolerate a page being
> > > > pinned forever?".
> > >
> > > Any struct page on the movable zone or anything that is living on the
> > > LRU list because such a memory is unreclaimable.
> > >
> > > > In the case of filesytem-dax a page is
> > > > one-in-the-same object as a filesystem-block, and a filesystem expects
> > > > that its operations will not be blocked indefinitely. LRU pages can
> > > > continue to be pinned indefinitely because operations can continue
> > > > around the pinned page, i.e. every agent, save for the dma agent,
> > > > drops their reference to the page and its tolerable that the final
> > > > put_page() never arrives.
> > >
> > > I do not understand. Are you saying that a user triggered IO can pin LRU
> > > pages indefinitely. This would be _really_ wrong. It would be basically
> > > an mlock without any limit. So I must be misreading you here
> >
> > You're not misreading. See ib_umem_get() for example, it pins pages in
> > response to the userspace library call ibv_reg_mr() (memory
> > registration), and will not release those pages unless/until a call to
> > ibv_dereg_mr() is made.
>
> Who and how many LRU pages can pin that way and how do you prevent nasty
> users to DoS systems this way?

I assume this is something the RDMA community has had to contend with?
I'm not an RDMA person, I'm just here to fix dax.

> I remember PeterZ wanted to address a similar issue by vmpin syscall
> that would be a subject of a rlimit control. Sorry but I cannot find a
> reference here

https://lwn.net/Articles/600502/

> but if this is at g-u-p level without any accounting then
> it smells quite broken to me.

It's certainly broken with respect to filesystem-dax and if there is
other breakage we should get it all on the table.
Michal Hocko Dec. 1, 2017, 10:12 a.m. UTC | #7
On Thu 30-11-17 12:01:17, Jason Gunthorpe wrote:
> On Thu, Nov 30, 2017 at 10:32:42AM -0800, Dan Williams wrote:
> > > Who and how many LRU pages can pin that way and how do you prevent nasty
> > > users to DoS systems this way?
> > 
> > I assume this is something the RDMA community has had to contend with?
> > I'm not an RDMA person, I'm just here to fix dax.
> 
> The RDMA implementation respects the mlock rlimit

OK, so then I am kind of lost in why do we need a special g-u-p variant.
The documentation doesn't say and quite contrary it assumes that the
caller knows what he is doing. This cannot be the right approach.

In other words, what does V4L2 does in the same context? Does it account
the pinned memory or it allows user to pin arbitrary amount of memory.
Dan Williams Dec. 1, 2017, 4:29 p.m. UTC | #8
On Fri, Dec 1, 2017 at 8:02 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Fri, Dec 01, 2017 at 11:12:18AM +0100, Michal Hocko wrote:
> > On Thu 30-11-17 12:01:17, Jason Gunthorpe wrote:
> > > On Thu, Nov 30, 2017 at 10:32:42AM -0800, Dan Williams wrote:
> > > > > Who and how many LRU pages can pin that way and how do you prevent nasty
> > > > > users to DoS systems this way?
> > > >
> > > > I assume this is something the RDMA community has had to contend with?
> > > > I'm not an RDMA person, I'm just here to fix dax.
> > >
> > > The RDMA implementation respects the mlock rlimit
> >
> > OK, so then I am kind of lost in why do we need a special g-u-p variant.
> > The documentation doesn't say and quite contrary it assumes that the
> > caller knows what he is doing. This cannot be the right approach.
>
> I thought it was because get_user_pages_longterm is supposed to fail
> on DAX mappings?

Correct, the rlimit checks are a separate issue,
get_user_pages_longterm is only there to avoid open coding vma lookup
and vma_is_fsdax() checks in multiple code paths.

> And maybe we should think about moving the rlimit accounting into this
> new function too someday?

DAX pages are not accounted in any rlimit because they are statically
allocated reserved memory regions.
Michal Hocko Dec. 4, 2017, 9:31 a.m. UTC | #9
On Fri 01-12-17 08:29:53, Dan Williams wrote:
> On Fri, Dec 1, 2017 at 8:02 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Fri, Dec 01, 2017 at 11:12:18AM +0100, Michal Hocko wrote:
> > > On Thu 30-11-17 12:01:17, Jason Gunthorpe wrote:
> > > > On Thu, Nov 30, 2017 at 10:32:42AM -0800, Dan Williams wrote:
> > > > > > Who and how many LRU pages can pin that way and how do you prevent nasty
> > > > > > users to DoS systems this way?
> > > > >
> > > > > I assume this is something the RDMA community has had to contend with?
> > > > > I'm not an RDMA person, I'm just here to fix dax.
> > > >
> > > > The RDMA implementation respects the mlock rlimit
> > >
> > > OK, so then I am kind of lost in why do we need a special g-u-p variant.
> > > The documentation doesn't say and quite contrary it assumes that the
> > > caller knows what he is doing. This cannot be the right approach.
> >
> > I thought it was because get_user_pages_longterm is supposed to fail
> > on DAX mappings?
> 
> Correct, the rlimit checks are a separate issue,
> get_user_pages_longterm is only there to avoid open coding vma lookup
> and vma_is_fsdax() checks in multiple code paths.

Then it is a terrible misnomer. One would expect this is a proper way to
get a longterm pin on a page.
 
> > And maybe we should think about moving the rlimit accounting into this
> > new function too someday?
> 
> DAX pages are not accounted in any rlimit because they are statically
> allocated reserved memory regions.

Which is OK, but how do you prevent anybody calling this function on
normal LRU pages?
Dan Williams Dec. 4, 2017, 5:01 p.m. UTC | #10
On Mon, Dec 4, 2017 at 1:31 AM, Michal Hocko <mhocko@kernel.org> wrote:
>
> On Fri 01-12-17 08:29:53, Dan Williams wrote:
> > On Fri, Dec 1, 2017 at 8:02 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > On Fri, Dec 01, 2017 at 11:12:18AM +0100, Michal Hocko wrote:
> > > > On Thu 30-11-17 12:01:17, Jason Gunthorpe wrote:
> > > > > On Thu, Nov 30, 2017 at 10:32:42AM -0800, Dan Williams wrote:
> > > > > > > Who and how many LRU pages can pin that way and how do you prevent nasty
> > > > > > > users to DoS systems this way?
> > > > > >
> > > > > > I assume this is something the RDMA community has had to contend with?
> > > > > > I'm not an RDMA person, I'm just here to fix dax.
> > > > >
> > > > > The RDMA implementation respects the mlock rlimit
> > > >
> > > > OK, so then I am kind of lost in why do we need a special g-u-p variant.
> > > > The documentation doesn't say and quite contrary it assumes that the
> > > > caller knows what he is doing. This cannot be the right approach.
> > >
> > > I thought it was because get_user_pages_longterm is supposed to fail
> > > on DAX mappings?
> >
> > Correct, the rlimit checks are a separate issue,
> > get_user_pages_longterm is only there to avoid open coding vma lookup
> > and vma_is_fsdax() checks in multiple code paths.
>
> Then it is a terrible misnomer. One would expect this is a proper way to
> get a longterm pin on a page.

Yes, I can see that. The "get_user_pages_longterm" symbol name is
encoding the lifetime expectations of the caller vs properly
implementing 'longterm' pinning. However the proper interface to
establish a long term pin does not currently exist needs and
ultimately needs more coordination with userspace. We need a way for
the kernel to explicitly revoke the pin. So, this
get_user_pages_longterm change is only a stop-gap to prevent data
corruption and userspace from growing further expectations that
filesystem-dax supports long term pinning through the legacy
interfaces.

> > > And maybe we should think about moving the rlimit accounting into this
> > > new function too someday?
> >
> > DAX pages are not accounted in any rlimit because they are statically
> > allocated reserved memory regions.
>
> Which is OK, but how do you prevent anybody calling this function on
> normal LRU pages?

I don't, and didn't consider this angle as it's a consideration that
is missing from the existing gup interfaces. It is an additional gap
we need to fill.
diff mbox

Patch

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2995a271ec46..8a9f6d048487 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3194,6 +3194,20 @@  static inline bool vma_is_dax(struct vm_area_struct *vma)
 	return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host);
 }
 
+static inline bool vma_is_fsdax(struct vm_area_struct *vma)
+{
+	struct inode *inode;
+
+	if (!vma->vm_file)
+		return false;
+	if (!vma_is_dax(vma))
+		return false;
+	inode = file_inode(vma->vm_file);
+	if (inode->i_mode == S_IFCHR)
+		return false; /* device-dax */
+	return true;
+}
+
 static inline int iocb_flags(struct file *file)
 {
 	int res = 0;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b3b6a7e313e9..ea818ff739cd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1380,6 +1380,19 @@  long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
 		    unsigned int gup_flags, struct page **pages, int *locked);
 long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
 		    struct page **pages, unsigned int gup_flags);
+#ifdef CONFIG_FS_DAX
+long get_user_pages_longterm(unsigned long start, unsigned long nr_pages,
+			    unsigned int gup_flags, struct page **pages,
+			    struct vm_area_struct **vmas);
+#else
+static inline long get_user_pages_longterm(unsigned long start,
+		unsigned long nr_pages, unsigned int gup_flags,
+		struct page **pages, struct vm_area_struct **vmas)
+{
+	return get_user_pages(start, nr_pages, gup_flags, pages, vmas);
+}
+#endif /* CONFIG_FS_DAX */
+
 int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			struct page **pages);
 
diff --git a/mm/gup.c b/mm/gup.c
index 85cc822fd403..3dc8a7807ea0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1095,6 +1095,70 @@  long get_user_pages(unsigned long start, unsigned long nr_pages,
 }
 EXPORT_SYMBOL(get_user_pages);
 
+#ifdef CONFIG_FS_DAX
+/*
+ * This is the same as get_user_pages() in that it assumes we are
+ * operating on the current task's mm, but it goes further to validate
+ * that the vmas associated with the address range are suitable for
+ * longterm elevated page reference counts. For example, filesystem-dax
+ * mappings are subject to the lifetime enforced by the filesystem and
+ * we need guarantees that longterm users like RDMA and V4L2 only
+ * establish mappings that have a kernel enforced revocation mechanism.
+ *
+ * "longterm" == userspace controlled elevated page count lifetime.
+ * Contrast this to iov_iter_get_pages() usages which are transient.
+ */
+long get_user_pages_longterm(unsigned long start, unsigned long nr_pages,
+		unsigned int gup_flags, struct page **pages,
+		struct vm_area_struct **vmas_arg)
+{
+	struct vm_area_struct **vmas = vmas_arg;
+	struct vm_area_struct *vma_prev = NULL;
+	long rc, i;
+
+	if (!pages)
+		return -EINVAL;
+
+	if (!vmas) {
+		vmas = kzalloc(sizeof(struct vm_area_struct *) * nr_pages,
+				GFP_KERNEL);
+		if (!vmas)
+			return -ENOMEM;
+	}
+
+	rc = get_user_pages(start, nr_pages, gup_flags, pages, vmas);
+
+	for (i = 0; i < rc; i++) {
+		struct vm_area_struct *vma = vmas[i];
+
+		if (vma == vma_prev)
+			continue;
+
+		vma_prev = vma;
+
+		if (vma_is_fsdax(vma))
+			break;
+	}
+
+	/*
+	 * Either get_user_pages() failed, or the vma validation
+	 * succeeded, in either case we don't need to put_page() before
+	 * returning.
+	 */
+	if (i >= rc)
+		goto out;
+
+	for (i = 0; i < rc; i++)
+		put_page(pages[i]);
+	rc = -EOPNOTSUPP;
+out:
+	if (vmas != vmas_arg)
+		kfree(vmas);
+	return rc;
+}
+EXPORT_SYMBOL(get_user_pages_longterm);
+#endif /* CONFIG_FS_DAX */
+
 /**
  * populate_vma_page_range() -  populate a range of pages in the vma.
  * @vma:   target vma