Message ID | 151197873499.26211.11687422577653326365.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed 29-11-17 10:05:35, Dan Williams wrote: > Until there is a solution to the dma-to-dax vs truncate problem it is > not safe to allow long standing memory registrations against > filesytem-dax vmas. Device-dax vmas do not have this problem and are > explicitly allowed. > > This is temporary until a "memory registration with layout-lease" > mechanism can be implemented for the affected sub-systems (RDMA and > V4L2). One thing is not clear to me. Who is allowed to pin pages for ever? Is it possible to pin LRU pages that way as well? If yes then there absolutely has to be a limit for that. Sorry I could have studied the code much more but from a quick glance it seems to me that this is not limited to dax (or non-LRU in general) pages.
On Thu, Nov 30, 2017 at 1:53 AM, Michal Hocko <mhocko@kernel.org> wrote: > On Wed 29-11-17 10:05:35, Dan Williams wrote: >> Until there is a solution to the dma-to-dax vs truncate problem it is >> not safe to allow long standing memory registrations against >> filesytem-dax vmas. Device-dax vmas do not have this problem and are >> explicitly allowed. >> >> This is temporary until a "memory registration with layout-lease" >> mechanism can be implemented for the affected sub-systems (RDMA and >> V4L2). > > One thing is not clear to me. Who is allowed to pin pages for ever? > Is it possible to pin LRU pages that way as well? If yes then there > absolutely has to be a limit for that. Sorry I could have studied the > code much more but from a quick glance it seems to me that this is not > limited to dax (or non-LRU in general) pages. I would turn this question around. "who can not tolerate a page being pinned forever?". In the case of filesytem-dax a page is one-in-the-same object as a filesystem-block, and a filesystem expects that its operations will not be blocked indefinitely. LRU pages can continue to be pinned indefinitely because operations can continue around the pinned page, i.e. every agent, save for the dma agent, drops their reference to the page and its tolerable that the final put_page() never arrives. As far as I can tell it's only filesystems and dax that have this collision of wanting to revoke dma access to a page combined with not being able to wait indefinitely for dma to quiesce.
On Thu 30-11-17 08:39:51, Dan Williams wrote: > On Thu, Nov 30, 2017 at 1:53 AM, Michal Hocko <mhocko@kernel.org> wrote: > > On Wed 29-11-17 10:05:35, Dan Williams wrote: > >> Until there is a solution to the dma-to-dax vs truncate problem it is > >> not safe to allow long standing memory registrations against > >> filesytem-dax vmas. Device-dax vmas do not have this problem and are > >> explicitly allowed. > >> > >> This is temporary until a "memory registration with layout-lease" > >> mechanism can be implemented for the affected sub-systems (RDMA and > >> V4L2). > > > > One thing is not clear to me. Who is allowed to pin pages for ever? > > Is it possible to pin LRU pages that way as well? If yes then there > > absolutely has to be a limit for that. Sorry I could have studied the > > code much more but from a quick glance it seems to me that this is not > > limited to dax (or non-LRU in general) pages. > > I would turn this question around. "who can not tolerate a page being > pinned forever?". Any struct page on the movable zone or anything that is living on the LRU list because such a memory is unreclaimable. > In the case of filesytem-dax a page is > one-in-the-same object as a filesystem-block, and a filesystem expects > that its operations will not be blocked indefinitely. LRU pages can > continue to be pinned indefinitely because operations can continue > around the pinned page, i.e. every agent, save for the dma agent, > drops their reference to the page and its tolerable that the final > put_page() never arrives. I do not understand. Are you saying that a user triggered IO can pin LRU pages indefinitely. This would be _really_ wrong. It would be basically an mlock without any limit. So I must be misreading you here > As far as I can tell it's only filesystems > and dax that have this collision of wanting to revoke dma access to a > page combined with not being able to wait indefinitely for dma to > quiesce.
On Thu, Nov 30, 2017 at 9:42 AM, Michal Hocko <mhocko@kernel.org> wrote: > > On Thu 30-11-17 08:39:51, Dan Williams wrote: > > On Thu, Nov 30, 2017 at 1:53 AM, Michal Hocko <mhocko@kernel.org> wrote: > > > On Wed 29-11-17 10:05:35, Dan Williams wrote: > > >> Until there is a solution to the dma-to-dax vs truncate problem it is > > >> not safe to allow long standing memory registrations against > > >> filesytem-dax vmas. Device-dax vmas do not have this problem and are > > >> explicitly allowed. > > >> > > >> This is temporary until a "memory registration with layout-lease" > > >> mechanism can be implemented for the affected sub-systems (RDMA and > > >> V4L2). > > > > > > One thing is not clear to me. Who is allowed to pin pages for ever? > > > Is it possible to pin LRU pages that way as well? If yes then there > > > absolutely has to be a limit for that. Sorry I could have studied the > > > code much more but from a quick glance it seems to me that this is not > > > limited to dax (or non-LRU in general) pages. > > > > I would turn this question around. "who can not tolerate a page being > > pinned forever?". > > Any struct page on the movable zone or anything that is living on the > LRU list because such a memory is unreclaimable. > > > In the case of filesytem-dax a page is > > one-in-the-same object as a filesystem-block, and a filesystem expects > > that its operations will not be blocked indefinitely. LRU pages can > > continue to be pinned indefinitely because operations can continue > > around the pinned page, i.e. every agent, save for the dma agent, > > drops their reference to the page and its tolerable that the final > > put_page() never arrives. > > I do not understand. Are you saying that a user triggered IO can pin LRU > pages indefinitely. This would be _really_ wrong. It would be basically > an mlock without any limit. So I must be misreading you here You're not misreading. See ib_umem_get() for example, it pins pages in response to the userspace library call ibv_reg_mr() (memory registration), and will not release those pages unless/until a call to ibv_dereg_mr() is made. The current plan to fix this is to create something like a ibv_reg_mr_lease() call that registers the memory with an F_SETLEASE semantic so that the kernel can notify userspace that a memory registration is being forcibly revoked by the kernel. A previous attempt at something like this was the proposed MAP_DIRECT mmap flag [1]. [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-October/012815.html
On Thu 30-11-17 10:03:26, Dan Williams wrote: > On Thu, Nov 30, 2017 at 9:42 AM, Michal Hocko <mhocko@kernel.org> wrote: > > > > On Thu 30-11-17 08:39:51, Dan Williams wrote: > > > On Thu, Nov 30, 2017 at 1:53 AM, Michal Hocko <mhocko@kernel.org> wrote: > > > > On Wed 29-11-17 10:05:35, Dan Williams wrote: > > > >> Until there is a solution to the dma-to-dax vs truncate problem it is > > > >> not safe to allow long standing memory registrations against > > > >> filesytem-dax vmas. Device-dax vmas do not have this problem and are > > > >> explicitly allowed. > > > >> > > > >> This is temporary until a "memory registration with layout-lease" > > > >> mechanism can be implemented for the affected sub-systems (RDMA and > > > >> V4L2). > > > > > > > > One thing is not clear to me. Who is allowed to pin pages for ever? > > > > Is it possible to pin LRU pages that way as well? If yes then there > > > > absolutely has to be a limit for that. Sorry I could have studied the > > > > code much more but from a quick glance it seems to me that this is not > > > > limited to dax (or non-LRU in general) pages. > > > > > > I would turn this question around. "who can not tolerate a page being > > > pinned forever?". > > > > Any struct page on the movable zone or anything that is living on the > > LRU list because such a memory is unreclaimable. > > > > > In the case of filesytem-dax a page is > > > one-in-the-same object as a filesystem-block, and a filesystem expects > > > that its operations will not be blocked indefinitely. LRU pages can > > > continue to be pinned indefinitely because operations can continue > > > around the pinned page, i.e. every agent, save for the dma agent, > > > drops their reference to the page and its tolerable that the final > > > put_page() never arrives. > > > > I do not understand. Are you saying that a user triggered IO can pin LRU > > pages indefinitely. This would be _really_ wrong. It would be basically > > an mlock without any limit. So I must be misreading you here > > You're not misreading. See ib_umem_get() for example, it pins pages in > response to the userspace library call ibv_reg_mr() (memory > registration), and will not release those pages unless/until a call to > ibv_dereg_mr() is made. Who and how many LRU pages can pin that way and how do you prevent nasty users to DoS systems this way? I remember PeterZ wanted to address a similar issue by vmpin syscall that would be a subject of a rlimit control. Sorry but I cannot find a reference here but if this is at g-u-p level without any accounting then it smells quite broken to me.
[ adding linux-rdma ] On Thu, Nov 30, 2017 at 10:17 AM, Michal Hocko <mhocko@kernel.org> wrote: > > On Thu 30-11-17 10:03:26, Dan Williams wrote: > > On Thu, Nov 30, 2017 at 9:42 AM, Michal Hocko <mhocko@kernel.org> wrote: > > > > > > On Thu 30-11-17 08:39:51, Dan Williams wrote: > > > > On Thu, Nov 30, 2017 at 1:53 AM, Michal Hocko <mhocko@kernel.org> wrote: > > > > > On Wed 29-11-17 10:05:35, Dan Williams wrote: > > > > >> Until there is a solution to the dma-to-dax vs truncate problem it is > > > > >> not safe to allow long standing memory registrations against > > > > >> filesytem-dax vmas. Device-dax vmas do not have this problem and are > > > > >> explicitly allowed. > > > > >> > > > > >> This is temporary until a "memory registration with layout-lease" > > > > >> mechanism can be implemented for the affected sub-systems (RDMA and > > > > >> V4L2). > > > > > > > > > > One thing is not clear to me. Who is allowed to pin pages for ever? > > > > > Is it possible to pin LRU pages that way as well? If yes then there > > > > > absolutely has to be a limit for that. Sorry I could have studied the > > > > > code much more but from a quick glance it seems to me that this is not > > > > > limited to dax (or non-LRU in general) pages. > > > > > > > > I would turn this question around. "who can not tolerate a page being > > > > pinned forever?". > > > > > > Any struct page on the movable zone or anything that is living on the > > > LRU list because such a memory is unreclaimable. > > > > > > > In the case of filesytem-dax a page is > > > > one-in-the-same object as a filesystem-block, and a filesystem expects > > > > that its operations will not be blocked indefinitely. LRU pages can > > > > continue to be pinned indefinitely because operations can continue > > > > around the pinned page, i.e. every agent, save for the dma agent, > > > > drops their reference to the page and its tolerable that the final > > > > put_page() never arrives. > > > > > > I do not understand. Are you saying that a user triggered IO can pin LRU > > > pages indefinitely. This would be _really_ wrong. It would be basically > > > an mlock without any limit. So I must be misreading you here > > > > You're not misreading. See ib_umem_get() for example, it pins pages in > > response to the userspace library call ibv_reg_mr() (memory > > registration), and will not release those pages unless/until a call to > > ibv_dereg_mr() is made. > > Who and how many LRU pages can pin that way and how do you prevent nasty > users to DoS systems this way? I assume this is something the RDMA community has had to contend with? I'm not an RDMA person, I'm just here to fix dax. > I remember PeterZ wanted to address a similar issue by vmpin syscall > that would be a subject of a rlimit control. Sorry but I cannot find a > reference here https://lwn.net/Articles/600502/ > but if this is at g-u-p level without any accounting then > it smells quite broken to me. It's certainly broken with respect to filesystem-dax and if there is other breakage we should get it all on the table.
On Thu 30-11-17 12:01:17, Jason Gunthorpe wrote: > On Thu, Nov 30, 2017 at 10:32:42AM -0800, Dan Williams wrote: > > > Who and how many LRU pages can pin that way and how do you prevent nasty > > > users to DoS systems this way? > > > > I assume this is something the RDMA community has had to contend with? > > I'm not an RDMA person, I'm just here to fix dax. > > The RDMA implementation respects the mlock rlimit OK, so then I am kind of lost in why do we need a special g-u-p variant. The documentation doesn't say and quite contrary it assumes that the caller knows what he is doing. This cannot be the right approach. In other words, what does V4L2 does in the same context? Does it account the pinned memory or it allows user to pin arbitrary amount of memory.
On Fri, Dec 1, 2017 at 8:02 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Fri, Dec 01, 2017 at 11:12:18AM +0100, Michal Hocko wrote: > > On Thu 30-11-17 12:01:17, Jason Gunthorpe wrote: > > > On Thu, Nov 30, 2017 at 10:32:42AM -0800, Dan Williams wrote: > > > > > Who and how many LRU pages can pin that way and how do you prevent nasty > > > > > users to DoS systems this way? > > > > > > > > I assume this is something the RDMA community has had to contend with? > > > > I'm not an RDMA person, I'm just here to fix dax. > > > > > > The RDMA implementation respects the mlock rlimit > > > > OK, so then I am kind of lost in why do we need a special g-u-p variant. > > The documentation doesn't say and quite contrary it assumes that the > > caller knows what he is doing. This cannot be the right approach. > > I thought it was because get_user_pages_longterm is supposed to fail > on DAX mappings? Correct, the rlimit checks are a separate issue, get_user_pages_longterm is only there to avoid open coding vma lookup and vma_is_fsdax() checks in multiple code paths. > And maybe we should think about moving the rlimit accounting into this > new function too someday? DAX pages are not accounted in any rlimit because they are statically allocated reserved memory regions.
On Fri 01-12-17 08:29:53, Dan Williams wrote: > On Fri, Dec 1, 2017 at 8:02 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > On Fri, Dec 01, 2017 at 11:12:18AM +0100, Michal Hocko wrote: > > > On Thu 30-11-17 12:01:17, Jason Gunthorpe wrote: > > > > On Thu, Nov 30, 2017 at 10:32:42AM -0800, Dan Williams wrote: > > > > > > Who and how many LRU pages can pin that way and how do you prevent nasty > > > > > > users to DoS systems this way? > > > > > > > > > > I assume this is something the RDMA community has had to contend with? > > > > > I'm not an RDMA person, I'm just here to fix dax. > > > > > > > > The RDMA implementation respects the mlock rlimit > > > > > > OK, so then I am kind of lost in why do we need a special g-u-p variant. > > > The documentation doesn't say and quite contrary it assumes that the > > > caller knows what he is doing. This cannot be the right approach. > > > > I thought it was because get_user_pages_longterm is supposed to fail > > on DAX mappings? > > Correct, the rlimit checks are a separate issue, > get_user_pages_longterm is only there to avoid open coding vma lookup > and vma_is_fsdax() checks in multiple code paths. Then it is a terrible misnomer. One would expect this is a proper way to get a longterm pin on a page. > > And maybe we should think about moving the rlimit accounting into this > > new function too someday? > > DAX pages are not accounted in any rlimit because they are statically > allocated reserved memory regions. Which is OK, but how do you prevent anybody calling this function on normal LRU pages?
On Mon, Dec 4, 2017 at 1:31 AM, Michal Hocko <mhocko@kernel.org> wrote: > > On Fri 01-12-17 08:29:53, Dan Williams wrote: > > On Fri, Dec 1, 2017 at 8:02 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > On Fri, Dec 01, 2017 at 11:12:18AM +0100, Michal Hocko wrote: > > > > On Thu 30-11-17 12:01:17, Jason Gunthorpe wrote: > > > > > On Thu, Nov 30, 2017 at 10:32:42AM -0800, Dan Williams wrote: > > > > > > > Who and how many LRU pages can pin that way and how do you prevent nasty > > > > > > > users to DoS systems this way? > > > > > > > > > > > > I assume this is something the RDMA community has had to contend with? > > > > > > I'm not an RDMA person, I'm just here to fix dax. > > > > > > > > > > The RDMA implementation respects the mlock rlimit > > > > > > > > OK, so then I am kind of lost in why do we need a special g-u-p variant. > > > > The documentation doesn't say and quite contrary it assumes that the > > > > caller knows what he is doing. This cannot be the right approach. > > > > > > I thought it was because get_user_pages_longterm is supposed to fail > > > on DAX mappings? > > > > Correct, the rlimit checks are a separate issue, > > get_user_pages_longterm is only there to avoid open coding vma lookup > > and vma_is_fsdax() checks in multiple code paths. > > Then it is a terrible misnomer. One would expect this is a proper way to > get a longterm pin on a page. Yes, I can see that. The "get_user_pages_longterm" symbol name is encoding the lifetime expectations of the caller vs properly implementing 'longterm' pinning. However the proper interface to establish a long term pin does not currently exist needs and ultimately needs more coordination with userspace. We need a way for the kernel to explicitly revoke the pin. So, this get_user_pages_longterm change is only a stop-gap to prevent data corruption and userspace from growing further expectations that filesystem-dax supports long term pinning through the legacy interfaces. > > > And maybe we should think about moving the rlimit accounting into this > > > new function too someday? > > > > DAX pages are not accounted in any rlimit because they are statically > > allocated reserved memory regions. > > Which is OK, but how do you prevent anybody calling this function on > normal LRU pages? I don't, and didn't consider this angle as it's a consideration that is missing from the existing gup interfaces. It is an additional gap we need to fill.
diff --git a/include/linux/fs.h b/include/linux/fs.h index 2995a271ec46..8a9f6d048487 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3194,6 +3194,20 @@ static inline bool vma_is_dax(struct vm_area_struct *vma) return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host); } +static inline bool vma_is_fsdax(struct vm_area_struct *vma) +{ + struct inode *inode; + + if (!vma->vm_file) + return false; + if (!vma_is_dax(vma)) + return false; + inode = file_inode(vma->vm_file); + if (inode->i_mode == S_IFCHR) + return false; /* device-dax */ + return true; +} + static inline int iocb_flags(struct file *file) { int res = 0; diff --git a/include/linux/mm.h b/include/linux/mm.h index b3b6a7e313e9..ea818ff739cd 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1380,6 +1380,19 @@ long get_user_pages_locked(unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, int *locked); long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages, struct page **pages, unsigned int gup_flags); +#ifdef CONFIG_FS_DAX +long get_user_pages_longterm(unsigned long start, unsigned long nr_pages, + unsigned int gup_flags, struct page **pages, + struct vm_area_struct **vmas); +#else +static inline long get_user_pages_longterm(unsigned long start, + unsigned long nr_pages, unsigned int gup_flags, + struct page **pages, struct vm_area_struct **vmas) +{ + return get_user_pages(start, nr_pages, gup_flags, pages, vmas); +} +#endif /* CONFIG_FS_DAX */ + int get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages); diff --git a/mm/gup.c b/mm/gup.c index 85cc822fd403..3dc8a7807ea0 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1095,6 +1095,70 @@ long get_user_pages(unsigned long start, unsigned long nr_pages, } EXPORT_SYMBOL(get_user_pages); +#ifdef CONFIG_FS_DAX +/* + * This is the same as get_user_pages() in that it assumes we are + * operating on the current task's mm, but it goes further to validate + * that the vmas associated with the address range are suitable for + * longterm elevated page reference counts. For example, filesystem-dax + * mappings are subject to the lifetime enforced by the filesystem and + * we need guarantees that longterm users like RDMA and V4L2 only + * establish mappings that have a kernel enforced revocation mechanism. + * + * "longterm" == userspace controlled elevated page count lifetime. + * Contrast this to iov_iter_get_pages() usages which are transient. + */ +long get_user_pages_longterm(unsigned long start, unsigned long nr_pages, + unsigned int gup_flags, struct page **pages, + struct vm_area_struct **vmas_arg) +{ + struct vm_area_struct **vmas = vmas_arg; + struct vm_area_struct *vma_prev = NULL; + long rc, i; + + if (!pages) + return -EINVAL; + + if (!vmas) { + vmas = kzalloc(sizeof(struct vm_area_struct *) * nr_pages, + GFP_KERNEL); + if (!vmas) + return -ENOMEM; + } + + rc = get_user_pages(start, nr_pages, gup_flags, pages, vmas); + + for (i = 0; i < rc; i++) { + struct vm_area_struct *vma = vmas[i]; + + if (vma == vma_prev) + continue; + + vma_prev = vma; + + if (vma_is_fsdax(vma)) + break; + } + + /* + * Either get_user_pages() failed, or the vma validation + * succeeded, in either case we don't need to put_page() before + * returning. + */ + if (i >= rc) + goto out; + + for (i = 0; i < rc; i++) + put_page(pages[i]); + rc = -EOPNOTSUPP; +out: + if (vmas != vmas_arg) + kfree(vmas); + return rc; +} +EXPORT_SYMBOL(get_user_pages_longterm); +#endif /* CONFIG_FS_DAX */ + /** * populate_vma_page_range() - populate a range of pages in the vma. * @vma: target vma
Until there is a solution to the dma-to-dax vs truncate problem it is not safe to allow long standing memory registrations against filesytem-dax vmas. Device-dax vmas do not have this problem and are explicitly allowed. This is temporary until a "memory registration with layout-lease" mechanism can be implemented for the affected sub-systems (RDMA and V4L2). Cc: <stable@vger.kernel.org> Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings") Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- include/linux/fs.h | 14 +++++++++++ include/linux/mm.h | 13 +++++++++++ mm/gup.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 91 insertions(+)