[v5,05/15] mm/frame-vector: Use FOLL_LONGTERM

Message ID	20201030100815.2269-6-daniel.vetter@ffwll.ch (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=vaMb=EF=lists.freedesktop.org=dri-devel-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8009B21734 From: Daniel Vetter <daniel.vetter@ffwll.ch> To: DRI Development <dri-devel@lists.freedesktop.org>, LKML <linux-kernel@vger.kernel.org> Subject: [PATCH v5 05/15] mm/frame-vector: Use FOLL_LONGTERM Date: Fri, 30 Oct 2020 11:08:05 +0100 Message-Id: <20201030100815.2269-6-daniel.vetter@ffwll.ch> In-Reply-To: <20201030100815.2269-1-daniel.vetter@ffwll.ch> References: <20201030100815.2269-1-daniel.vetter@ffwll.ch> MIME-Version: 1.0 Precedence: list Cc: linux-samsung-soc@vger.kernel.org, Jan Kara <jack@suse.cz>, Pawel Osciak <pawel@osciak.com>, kvm@vger.kernel.org, Jason Gunthorpe <jgg@ziepe.ca>, Daniel Vetter <daniel.vetter@ffwll.ch>, Mauro Carvalho Chehab <mchehab@kernel.org>, =?utf-8?b?SsOpcsO0bWUgR2xpc3Nl?= <jglisse@redhat.com>, Tomasz Figa <tfiga@chromium.org>, linux-mm@kvack.org, Kyungmin Park <kyungmin.park@samsung.com>, John Hubbard <jhubbard@nvidia.com>, Daniel Vetter <daniel.vetter@intel.com>, Andrew Morton <akpm@linux-foundation.org>, Marek Szyprowski <m.szyprowski@samsung.com>, Dan Williams <dan.j.williams@intel.com>, linux-arm-kernel@lists.infradead.org, linux-media@vger.kernel.org Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	None \| expand [v5,05/15] mm/frame-vector: Use FOLL_LONGTERM

Daniel Vetter Oct. 30, 2020, 10:08 a.m. UTC

This is used by media/videbuf2 for persistent dma mappings, not just
for a single dma operation and then freed again, so needs
FOLL_LONGTERM.

Unfortunately current pup_locked doesn't support FOLL_LONGTERM due to
locking issues. Rework the code to pull the pup path out from the
mmap_sem critical section as suggested by Jason.

By relying entirely on the vma checks in pin_user_pages and follow_pfn
(for vm_flags and vma_is_fsdax) we can also streamline the code a lot.

Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Pawel Osciak <pawel@osciak.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Tomasz Figa <tfiga@chromium.org>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-mm@kvack.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-samsung-soc@vger.kernel.org
Cc: linux-media@vger.kernel.org
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
--
v2: Streamline the code and further simplify the loop checks (Jason)

v5: Review from Tomasz:
- fix page counting for the follow_pfn case by resetting ret
- drop gup_flags paramater, now unused
---
 .../media/common/videobuf2/videobuf2-memops.c |  3 +-
 include/linux/mm.h                            |  2 +-
 mm/frame_vector.c                             | 53 ++++++-------------
 3 files changed, 19 insertions(+), 39 deletions(-)

Tomasz Figa Oct. 30, 2020, 2:11 p.m. UTC | #1

On Fri, Oct 30, 2020 at 11:08 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> This is used by media/videbuf2 for persistent dma mappings, not just
> for a single dma operation and then freed again, so needs
> FOLL_LONGTERM.
>
> Unfortunately current pup_locked doesn't support FOLL_LONGTERM due to
> locking issues. Rework the code to pull the pup path out from the
> mmap_sem critical section as suggested by Jason.
>
> By relying entirely on the vma checks in pin_user_pages and follow_pfn
> (for vm_flags and vma_is_fsdax) we can also streamline the code a lot.
>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Pawel Osciak <pawel@osciak.com>
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Kyungmin Park <kyungmin.park@samsung.com>
> Cc: Tomasz Figa <tfiga@chromium.org>
> Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Jérôme Glisse <jglisse@redhat.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: linux-mm@kvack.org
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-samsung-soc@vger.kernel.org
> Cc: linux-media@vger.kernel.org
> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> --
> v2: Streamline the code and further simplify the loop checks (Jason)
>
> v5: Review from Tomasz:
> - fix page counting for the follow_pfn case by resetting ret
> - drop gup_flags paramater, now unused
> ---
>  .../media/common/videobuf2/videobuf2-memops.c |  3 +-
>  include/linux/mm.h                            |  2 +-
>  mm/frame_vector.c                             | 53 ++++++-------------
>  3 files changed, 19 insertions(+), 39 deletions(-)
>

Thanks, looks good to me now.

Acked-by: Tomasz Figa <tfiga@chromium.org>

From reading the code, this is quite unlikely to introduce any
behavior changes, but just to be safe, did you have a chance to test
this with some V4L2 driver?

Best regards,
Tomasz

> diff --git a/drivers/media/common/videobuf2/videobuf2-memops.c b/drivers/media/common/videobuf2/videobuf2-memops.c
> index 6e9e05153f4e..9dd6c27162f4 100644
> --- a/drivers/media/common/videobuf2/videobuf2-memops.c
> +++ b/drivers/media/common/videobuf2/videobuf2-memops.c
> @@ -40,7 +40,6 @@ struct frame_vector *vb2_create_framevec(unsigned long start,
>         unsigned long first, last;
>         unsigned long nr;
>         struct frame_vector *vec;
> -       unsigned int flags = FOLL_FORCE | FOLL_WRITE;
>
>         first = start >> PAGE_SHIFT;
>         last = (start + length - 1) >> PAGE_SHIFT;
> @@ -48,7 +47,7 @@ struct frame_vector *vb2_create_framevec(unsigned long start,
>         vec = frame_vector_create(nr);
>         if (!vec)
>                 return ERR_PTR(-ENOMEM);
> -       ret = get_vaddr_frames(start & PAGE_MASK, nr, flags, vec);
> +       ret = get_vaddr_frames(start & PAGE_MASK, nr, vec);
>         if (ret < 0)
>                 goto out_destroy;
>         /* We accept only complete set of PFNs */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ef360fe70aaf..d6b8e30dce2e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1765,7 +1765,7 @@ struct frame_vector {
>  struct frame_vector *frame_vector_create(unsigned int nr_frames);
>  void frame_vector_destroy(struct frame_vector *vec);
>  int get_vaddr_frames(unsigned long start, unsigned int nr_pfns,
> -                    unsigned int gup_flags, struct frame_vector *vec);
> +                    struct frame_vector *vec);
>  void put_vaddr_frames(struct frame_vector *vec);
>  int frame_vector_to_pages(struct frame_vector *vec);
>  void frame_vector_to_pfns(struct frame_vector *vec);
> diff --git a/mm/frame_vector.c b/mm/frame_vector.c
> index 10f82d5643b6..f8c34b895c76 100644
> --- a/mm/frame_vector.c
> +++ b/mm/frame_vector.c
> @@ -32,13 +32,12 @@
>   * This function takes care of grabbing mmap_lock as necessary.
>   */
>  int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
> -                    unsigned int gup_flags, struct frame_vector *vec)
> +                    struct frame_vector *vec)
>  {
>         struct mm_struct *mm = current->mm;
>         struct vm_area_struct *vma;
>         int ret = 0;
>         int err;
> -       int locked;
>
>         if (nr_frames == 0)
>                 return 0;
> @@ -48,40 +47,26 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
>
>         start = untagged_addr(start);
>
> -       mmap_read_lock(mm);
> -       locked = 1;
> -       vma = find_vma_intersection(mm, start, start + 1);
> -       if (!vma) {
> -               ret = -EFAULT;
> -               goto out;
> -       }
> -
> -       /*
> -        * While get_vaddr_frames() could be used for transient (kernel
> -        * controlled lifetime) pinning of memory pages all current
> -        * users establish long term (userspace controlled lifetime)
> -        * page pinning. Treat get_vaddr_frames() like
> -        * get_user_pages_longterm() and disallow it for filesystem-dax
> -        * mappings.
> -        */
> -       if (vma_is_fsdax(vma)) {
> -               ret = -EOPNOTSUPP;
> -               goto out;
> -       }
> -
> -       if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) {
> +       ret = pin_user_pages_fast(start, nr_frames,
> +                                 FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM,
> +                                 (struct page **)(vec->ptrs));
> +       if (ret > 0) {
>                 vec->got_ref = true;
>                 vec->is_pfns = false;
> -               ret = pin_user_pages_locked(start, nr_frames,
> -                       gup_flags, (struct page **)(vec->ptrs), &locked);
> -               goto out;
> +               goto out_unlocked;
>         }
>
> +       mmap_read_lock(mm);
>         vec->got_ref = false;
>         vec->is_pfns = true;
> +       ret = 0;
>         do {
>                 unsigned long *nums = frame_vector_pfns(vec);
>
> +               vma = find_vma_intersection(mm, start, start + 1);
> +               if (!vma)
> +                       break;
> +
>                 while (ret < nr_frames && start + PAGE_SIZE <= vma->vm_end) {
>                         err = follow_pfn(vma, start, &nums[ret]);
>                         if (err) {
> @@ -92,17 +77,13 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
>                         start += PAGE_SIZE;
>                         ret++;
>                 }
> -               /*
> -                * We stop if we have enough pages or if VMA doesn't completely
> -                * cover the tail page.
> -                */
> -               if (ret >= nr_frames || start < vma->vm_end)
> +               /* Bail out if VMA doesn't completely cover the tail page. */
> +               if (start < vma->vm_end)
>                         break;
> -               vma = find_vma_intersection(mm, start, start + 1);
> -       } while (vma && vma->vm_flags & (VM_IO | VM_PFNMAP));
> +       } while (ret < nr_frames);
>  out:
> -       if (locked)
> -               mmap_read_unlock(mm);
> +       mmap_read_unlock(mm);
> +out_unlocked:
>         if (!ret)
>                 ret = -EFAULT;
>         if (ret > 0)
> --
> 2.28.0
>

Daniel Vetter Oct. 30, 2020, 2:37 p.m. UTC | #2

On Fri, Oct 30, 2020 at 3:11 PM Tomasz Figa <tfiga@chromium.org> wrote:
>
> On Fri, Oct 30, 2020 at 11:08 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> >
> > This is used by media/videbuf2 for persistent dma mappings, not just
> > for a single dma operation and then freed again, so needs
> > FOLL_LONGTERM.
> >
> > Unfortunately current pup_locked doesn't support FOLL_LONGTERM due to
> > locking issues. Rework the code to pull the pup path out from the
> > mmap_sem critical section as suggested by Jason.
> >
> > By relying entirely on the vma checks in pin_user_pages and follow_pfn
> > (for vm_flags and vma_is_fsdax) we can also streamline the code a lot.
> >
> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > Cc: Jason Gunthorpe <jgg@ziepe.ca>
> > Cc: Pawel Osciak <pawel@osciak.com>
> > Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> > Cc: Kyungmin Park <kyungmin.park@samsung.com>
> > Cc: Tomasz Figa <tfiga@chromium.org>
> > Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: John Hubbard <jhubbard@nvidia.com>
> > Cc: Jérôme Glisse <jglisse@redhat.com>
> > Cc: Jan Kara <jack@suse.cz>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: linux-mm@kvack.org
> > Cc: linux-arm-kernel@lists.infradead.org
> > Cc: linux-samsung-soc@vger.kernel.org
> > Cc: linux-media@vger.kernel.org
> > Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> > --
> > v2: Streamline the code and further simplify the loop checks (Jason)
> >
> > v5: Review from Tomasz:
> > - fix page counting for the follow_pfn case by resetting ret
> > - drop gup_flags paramater, now unused
> > ---
> >  .../media/common/videobuf2/videobuf2-memops.c |  3 +-
> >  include/linux/mm.h                            |  2 +-
> >  mm/frame_vector.c                             | 53 ++++++-------------
> >  3 files changed, 19 insertions(+), 39 deletions(-)
> >
>
> Thanks, looks good to me now.
>
> Acked-by: Tomasz Figa <tfiga@chromium.org>
>
> From reading the code, this is quite unlikely to introduce any
> behavior changes, but just to be safe, did you have a chance to test
> this with some V4L2 driver?

Nah, unfortunately not.
-Daniel

>
> Best regards,
> Tomasz
>
> > diff --git a/drivers/media/common/videobuf2/videobuf2-memops.c b/drivers/media/common/videobuf2/videobuf2-memops.c
> > index 6e9e05153f4e..9dd6c27162f4 100644
> > --- a/drivers/media/common/videobuf2/videobuf2-memops.c
> > +++ b/drivers/media/common/videobuf2/videobuf2-memops.c
> > @@ -40,7 +40,6 @@ struct frame_vector *vb2_create_framevec(unsigned long start,
> >         unsigned long first, last;
> >         unsigned long nr;
> >         struct frame_vector *vec;
> > -       unsigned int flags = FOLL_FORCE | FOLL_WRITE;
> >
> >         first = start >> PAGE_SHIFT;
> >         last = (start + length - 1) >> PAGE_SHIFT;
> > @@ -48,7 +47,7 @@ struct frame_vector *vb2_create_framevec(unsigned long start,
> >         vec = frame_vector_create(nr);
> >         if (!vec)
> >                 return ERR_PTR(-ENOMEM);
> > -       ret = get_vaddr_frames(start & PAGE_MASK, nr, flags, vec);
> > +       ret = get_vaddr_frames(start & PAGE_MASK, nr, vec);
> >         if (ret < 0)
> >                 goto out_destroy;
> >         /* We accept only complete set of PFNs */
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index ef360fe70aaf..d6b8e30dce2e 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1765,7 +1765,7 @@ struct frame_vector {
> >  struct frame_vector *frame_vector_create(unsigned int nr_frames);
> >  void frame_vector_destroy(struct frame_vector *vec);
> >  int get_vaddr_frames(unsigned long start, unsigned int nr_pfns,
> > -                    unsigned int gup_flags, struct frame_vector *vec);
> > +                    struct frame_vector *vec);
> >  void put_vaddr_frames(struct frame_vector *vec);
> >  int frame_vector_to_pages(struct frame_vector *vec);
> >  void frame_vector_to_pfns(struct frame_vector *vec);
> > diff --git a/mm/frame_vector.c b/mm/frame_vector.c
> > index 10f82d5643b6..f8c34b895c76 100644
> > --- a/mm/frame_vector.c
> > +++ b/mm/frame_vector.c
> > @@ -32,13 +32,12 @@
> >   * This function takes care of grabbing mmap_lock as necessary.
> >   */
> >  int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
> > -                    unsigned int gup_flags, struct frame_vector *vec)
> > +                    struct frame_vector *vec)
> >  {
> >         struct mm_struct *mm = current->mm;
> >         struct vm_area_struct *vma;
> >         int ret = 0;
> >         int err;
> > -       int locked;
> >
> >         if (nr_frames == 0)
> >                 return 0;
> > @@ -48,40 +47,26 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
> >
> >         start = untagged_addr(start);
> >
> > -       mmap_read_lock(mm);
> > -       locked = 1;
> > -       vma = find_vma_intersection(mm, start, start + 1);
> > -       if (!vma) {
> > -               ret = -EFAULT;
> > -               goto out;
> > -       }
> > -
> > -       /*
> > -        * While get_vaddr_frames() could be used for transient (kernel
> > -        * controlled lifetime) pinning of memory pages all current
> > -        * users establish long term (userspace controlled lifetime)
> > -        * page pinning. Treat get_vaddr_frames() like
> > -        * get_user_pages_longterm() and disallow it for filesystem-dax
> > -        * mappings.
> > -        */
> > -       if (vma_is_fsdax(vma)) {
> > -               ret = -EOPNOTSUPP;
> > -               goto out;
> > -       }
> > -
> > -       if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) {
> > +       ret = pin_user_pages_fast(start, nr_frames,
> > +                                 FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM,
> > +                                 (struct page **)(vec->ptrs));
> > +       if (ret > 0) {
> >                 vec->got_ref = true;
> >                 vec->is_pfns = false;
> > -               ret = pin_user_pages_locked(start, nr_frames,
> > -                       gup_flags, (struct page **)(vec->ptrs), &locked);
> > -               goto out;
> > +               goto out_unlocked;
> >         }
> >
> > +       mmap_read_lock(mm);
> >         vec->got_ref = false;
> >         vec->is_pfns = true;
> > +       ret = 0;
> >         do {
> >                 unsigned long *nums = frame_vector_pfns(vec);
> >
> > +               vma = find_vma_intersection(mm, start, start + 1);
> > +               if (!vma)
> > +                       break;
> > +
> >                 while (ret < nr_frames && start + PAGE_SIZE <= vma->vm_end) {
> >                         err = follow_pfn(vma, start, &nums[ret]);
> >                         if (err) {
> > @@ -92,17 +77,13 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
> >                         start += PAGE_SIZE;
> >                         ret++;
> >                 }
> > -               /*
> > -                * We stop if we have enough pages or if VMA doesn't completely
> > -                * cover the tail page.
> > -                */
> > -               if (ret >= nr_frames || start < vma->vm_end)
> > +               /* Bail out if VMA doesn't completely cover the tail page. */
> > +               if (start < vma->vm_end)
> >                         break;
> > -               vma = find_vma_intersection(mm, start, start + 1);
> > -       } while (vma && vma->vm_flags & (VM_IO | VM_PFNMAP));
> > +       } while (ret < nr_frames);
> >  out:
> > -       if (locked)
> > -               mmap_read_unlock(mm);
> > +       mmap_read_unlock(mm);
> > +out_unlocked:
> >         if (!ret)
> >                 ret = -EFAULT;
> >         if (ret > 0)
> > --
> > 2.28.0
> >
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

John Hubbard Oct. 31, 2020, 2:55 a.m. UTC | #3

On 10/30/20 3:08 AM, Daniel Vetter wrote:
> This is used by media/videbuf2 for persistent dma mappings, not just
> for a single dma operation and then freed again, so needs
> FOLL_LONGTERM.
> 
> Unfortunately current pup_locked doesn't support FOLL_LONGTERM due to
> locking issues. Rework the code to pull the pup path out from the
> mmap_sem critical section as suggested by Jason.
> 
> By relying entirely on the vma checks in pin_user_pages and follow_pfn

There are vma checks in pin_user_pages(), but this patch changes things
to call pin_user_pages_fast(). And that does not have the vma checks.
More below about this:

> (for vm_flags and vma_is_fsdax) we can also streamline the code a lot.
> 
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Pawel Osciak <pawel@osciak.com>
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Kyungmin Park <kyungmin.park@samsung.com>
> Cc: Tomasz Figa <tfiga@chromium.org>
> Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Jérôme Glisse <jglisse@redhat.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: linux-mm@kvack.org
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-samsung-soc@vger.kernel.org
> Cc: linux-media@vger.kernel.org
> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> --
> v2: Streamline the code and further simplify the loop checks (Jason)
> 
> v5: Review from Tomasz:
> - fix page counting for the follow_pfn case by resetting ret
> - drop gup_flags paramater, now unused
> ---
>   .../media/common/videobuf2/videobuf2-memops.c |  3 +-
>   include/linux/mm.h                            |  2 +-
>   mm/frame_vector.c                             | 53 ++++++-------------
>   3 files changed, 19 insertions(+), 39 deletions(-)
> 
> diff --git a/drivers/media/common/videobuf2/videobuf2-memops.c b/drivers/media/common/videobuf2/videobuf2-memops.c
> index 6e9e05153f4e..9dd6c27162f4 100644
> --- a/drivers/media/common/videobuf2/videobuf2-memops.c
> +++ b/drivers/media/common/videobuf2/videobuf2-memops.c
> @@ -40,7 +40,6 @@ struct frame_vector *vb2_create_framevec(unsigned long start,
>   	unsigned long first, last;
>   	unsigned long nr;
>   	struct frame_vector *vec;
> -	unsigned int flags = FOLL_FORCE | FOLL_WRITE;
>   
>   	first = start >> PAGE_SHIFT;
>   	last = (start + length - 1) >> PAGE_SHIFT;
> @@ -48,7 +47,7 @@ struct frame_vector *vb2_create_framevec(unsigned long start,
>   	vec = frame_vector_create(nr);
>   	if (!vec)
>   		return ERR_PTR(-ENOMEM);
> -	ret = get_vaddr_frames(start & PAGE_MASK, nr, flags, vec);
> +	ret = get_vaddr_frames(start & PAGE_MASK, nr, vec);
>   	if (ret < 0)
>   		goto out_destroy;
>   	/* We accept only complete set of PFNs */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ef360fe70aaf..d6b8e30dce2e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1765,7 +1765,7 @@ struct frame_vector {
>   struct frame_vector *frame_vector_create(unsigned int nr_frames);
>   void frame_vector_destroy(struct frame_vector *vec);
>   int get_vaddr_frames(unsigned long start, unsigned int nr_pfns,
> -		     unsigned int gup_flags, struct frame_vector *vec);
> +		     struct frame_vector *vec);
>   void put_vaddr_frames(struct frame_vector *vec);
>   int frame_vector_to_pages(struct frame_vector *vec);
>   void frame_vector_to_pfns(struct frame_vector *vec);
> diff --git a/mm/frame_vector.c b/mm/frame_vector.c
> index 10f82d5643b6..f8c34b895c76 100644
> --- a/mm/frame_vector.c
> +++ b/mm/frame_vector.c
> @@ -32,13 +32,12 @@
>    * This function takes care of grabbing mmap_lock as necessary.
>    */
>   int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
> -		     unsigned int gup_flags, struct frame_vector *vec)
> +		     struct frame_vector *vec)
>   {
>   	struct mm_struct *mm = current->mm;
>   	struct vm_area_struct *vma;
>   	int ret = 0;
>   	int err;
> -	int locked;
>   
>   	if (nr_frames == 0)
>   		return 0;
> @@ -48,40 +47,26 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
>   
>   	start = untagged_addr(start);
>   
> -	mmap_read_lock(mm);
> -	locked = 1;
> -	vma = find_vma_intersection(mm, start, start + 1);
> -	if (!vma) {
> -		ret = -EFAULT;
> -		goto out;
> -	}
> -
> -	/*
> -	 * While get_vaddr_frames() could be used for transient (kernel
> -	 * controlled lifetime) pinning of memory pages all current
> -	 * users establish long term (userspace controlled lifetime)
> -	 * page pinning. Treat get_vaddr_frames() like
> -	 * get_user_pages_longterm() and disallow it for filesystem-dax
> -	 * mappings.
> -	 */
> -	if (vma_is_fsdax(vma)) {
> -		ret = -EOPNOTSUPP;
> -		goto out;
> -	}
> -
> -	if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) {

By removing this check from this location, and changing from
pin_user_pages_locked() to pin_user_pages_fast(), I *think* we end up
losing the check entirely. Is that intended? If so it could use a comment
somewhere to explain why.

thanks,

Daniel Vetter Oct. 31, 2020, 2:45 p.m. UTC | #4

On Sat, Oct 31, 2020 at 3:55 AM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 10/30/20 3:08 AM, Daniel Vetter wrote:
> > This is used by media/videbuf2 for persistent dma mappings, not just
> > for a single dma operation and then freed again, so needs
> > FOLL_LONGTERM.
> >
> > Unfortunately current pup_locked doesn't support FOLL_LONGTERM due to
> > locking issues. Rework the code to pull the pup path out from the
> > mmap_sem critical section as suggested by Jason.
> >
> > By relying entirely on the vma checks in pin_user_pages and follow_pfn
>
> There are vma checks in pin_user_pages(), but this patch changes things
> to call pin_user_pages_fast(). And that does not have the vma checks.
> More below about this:
>
> > (for vm_flags and vma_is_fsdax) we can also streamline the code a lot.
> >
> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > Cc: Jason Gunthorpe <jgg@ziepe.ca>
> > Cc: Pawel Osciak <pawel@osciak.com>
> > Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> > Cc: Kyungmin Park <kyungmin.park@samsung.com>
> > Cc: Tomasz Figa <tfiga@chromium.org>
> > Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: John Hubbard <jhubbard@nvidia.com>
> > Cc: Jérôme Glisse <jglisse@redhat.com>
> > Cc: Jan Kara <jack@suse.cz>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: linux-mm@kvack.org
> > Cc: linux-arm-kernel@lists.infradead.org
> > Cc: linux-samsung-soc@vger.kernel.org
> > Cc: linux-media@vger.kernel.org
> > Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> > --
> > v2: Streamline the code and further simplify the loop checks (Jason)
> >
> > v5: Review from Tomasz:
> > - fix page counting for the follow_pfn case by resetting ret
> > - drop gup_flags paramater, now unused
> > ---
> >   .../media/common/videobuf2/videobuf2-memops.c |  3 +-
> >   include/linux/mm.h                            |  2 +-
> >   mm/frame_vector.c                             | 53 ++++++-------------
> >   3 files changed, 19 insertions(+), 39 deletions(-)
> >
> > diff --git a/drivers/media/common/videobuf2/videobuf2-memops.c b/drivers/media/common/videobuf2/videobuf2-memops.c
> > index 6e9e05153f4e..9dd6c27162f4 100644
> > --- a/drivers/media/common/videobuf2/videobuf2-memops.c
> > +++ b/drivers/media/common/videobuf2/videobuf2-memops.c
> > @@ -40,7 +40,6 @@ struct frame_vector *vb2_create_framevec(unsigned long start,
> >       unsigned long first, last;
> >       unsigned long nr;
> >       struct frame_vector *vec;
> > -     unsigned int flags = FOLL_FORCE | FOLL_WRITE;
> >
> >       first = start >> PAGE_SHIFT;
> >       last = (start + length - 1) >> PAGE_SHIFT;
> > @@ -48,7 +47,7 @@ struct frame_vector *vb2_create_framevec(unsigned long start,
> >       vec = frame_vector_create(nr);
> >       if (!vec)
> >               return ERR_PTR(-ENOMEM);
> > -     ret = get_vaddr_frames(start & PAGE_MASK, nr, flags, vec);
> > +     ret = get_vaddr_frames(start & PAGE_MASK, nr, vec);
> >       if (ret < 0)
> >               goto out_destroy;
> >       /* We accept only complete set of PFNs */
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index ef360fe70aaf..d6b8e30dce2e 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1765,7 +1765,7 @@ struct frame_vector {
> >   struct frame_vector *frame_vector_create(unsigned int nr_frames);
> >   void frame_vector_destroy(struct frame_vector *vec);
> >   int get_vaddr_frames(unsigned long start, unsigned int nr_pfns,
> > -                  unsigned int gup_flags, struct frame_vector *vec);
> > +                  struct frame_vector *vec);
> >   void put_vaddr_frames(struct frame_vector *vec);
> >   int frame_vector_to_pages(struct frame_vector *vec);
> >   void frame_vector_to_pfns(struct frame_vector *vec);
> > diff --git a/mm/frame_vector.c b/mm/frame_vector.c
> > index 10f82d5643b6..f8c34b895c76 100644
> > --- a/mm/frame_vector.c
> > +++ b/mm/frame_vector.c
> > @@ -32,13 +32,12 @@
> >    * This function takes care of grabbing mmap_lock as necessary.
> >    */
> >   int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
> > -                  unsigned int gup_flags, struct frame_vector *vec)
> > +                  struct frame_vector *vec)
> >   {
> >       struct mm_struct *mm = current->mm;
> >       struct vm_area_struct *vma;
> >       int ret = 0;
> >       int err;
> > -     int locked;
> >
> >       if (nr_frames == 0)
> >               return 0;
> > @@ -48,40 +47,26 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
> >
> >       start = untagged_addr(start);
> >
> > -     mmap_read_lock(mm);
> > -     locked = 1;
> > -     vma = find_vma_intersection(mm, start, start + 1);
> > -     if (!vma) {
> > -             ret = -EFAULT;
> > -             goto out;
> > -     }
> > -
> > -     /*
> > -      * While get_vaddr_frames() could be used for transient (kernel
> > -      * controlled lifetime) pinning of memory pages all current
> > -      * users establish long term (userspace controlled lifetime)
> > -      * page pinning. Treat get_vaddr_frames() like
> > -      * get_user_pages_longterm() and disallow it for filesystem-dax
> > -      * mappings.
> > -      */
> > -     if (vma_is_fsdax(vma)) {
> > -             ret = -EOPNOTSUPP;
> > -             goto out;
> > -     }
> > -
> > -     if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) {
>
> By removing this check from this location, and changing from
> pin_user_pages_locked() to pin_user_pages_fast(), I *think* we end up
> losing the check entirely. Is that intended? If so it could use a comment
> somewhere to explain why.

Yeah this wasn't intentional. I think I needed to drop the _locked
version to prep for FOLL_LONGTERM, and figured _fast is always better.
But I didn't realize that _fast doesn't have the vma checks, gup.c got
me a bit confused.

I'll remedy this in all the patches where this applies (because a
VM_IO | VM_PFNMAP can point at struct page backed memory, and that
exact use-case is what we want to stop with the unsafe_follow_pfn work
since it wreaks things like cma or security).

Aside: I do wonder whether the lack for that check isn't a problem.
VM_IO | VM_PFNMAP generally means driver managed, which means the
driver isn't going to consult the page pin count or anything like that
(at least not necessarily) when revoking or moving that memory, since
we're assuming it's totally under driver control. So if pup_fast can
get into such a mapping, we might have a problem.
-Daniel

> thanks,
> --
> John Hubbard
> NVIDIA
>
> > +     ret = pin_user_pages_fast(start, nr_frames,
> > +                               FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM,
> > +                               (struct page **)(vec->ptrs));
> > +     if (ret > 0) {
> >               vec->got_ref = true;
> >               vec->is_pfns = false;
> > -             ret = pin_user_pages_locked(start, nr_frames,
> > -                     gup_flags, (struct page **)(vec->ptrs), &locked);
> > -             goto out;
> > +             goto out_unlocked;
> >       }
> >
> > +     mmap_read_lock(mm);
> >       vec->got_ref = false;
> >       vec->is_pfns = true;
> > +     ret = 0;
> >       do {
> >               unsigned long *nums = frame_vector_pfns(vec);
> >
> > +             vma = find_vma_intersection(mm, start, start + 1);
> > +             if (!vma)
> > +                     break;
> > +
> >               while (ret < nr_frames && start + PAGE_SIZE <= vma->vm_end) {
> >                       err = follow_pfn(vma, start, &nums[ret]);
> >                       if (err) {
> > @@ -92,17 +77,13 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
> >                       start += PAGE_SIZE;
> >                       ret++;
> >               }
> > -             /*
> > -              * We stop if we have enough pages or if VMA doesn't completely
> > -              * cover the tail page.
> > -              */
> > -             if (ret >= nr_frames || start < vma->vm_end)
> > +             /* Bail out if VMA doesn't completely cover the tail page. */
> > +             if (start < vma->vm_end)
> >                       break;
> > -             vma = find_vma_intersection(mm, start, start + 1);
> > -     } while (vma && vma->vm_flags & (VM_IO | VM_PFNMAP));
> > +     } while (ret < nr_frames);
> >   out:
> > -     if (locked)
> > -             mmap_read_unlock(mm);
> > +     mmap_read_unlock(mm);
> > +out_unlocked:
> >       if (!ret)
> >               ret = -EFAULT;
> >       if (ret > 0)
> >
>
>

John Hubbard Nov. 1, 2020, 5:22 a.m. UTC | #5

On 10/31/20 7:45 AM, Daniel Vetter wrote:
> On Sat, Oct 31, 2020 at 3:55 AM John Hubbard <jhubbard@nvidia.com> wrote:
>> On 10/30/20 3:08 AM, Daniel Vetter wrote:
...
>> By removing this check from this location, and changing from
>> pin_user_pages_locked() to pin_user_pages_fast(), I *think* we end up
>> losing the check entirely. Is that intended? If so it could use a comment
>> somewhere to explain why.
> 
> Yeah this wasn't intentional. I think I needed to drop the _locked
> version to prep for FOLL_LONGTERM, and figured _fast is always better.
> But I didn't realize that _fast doesn't have the vma checks, gup.c got
> me a bit confused.

Actually, I thought that the change to _fast was a very nice touch, btw.

> 
> I'll remedy this in all the patches where this applies (because a
> VM_IO | VM_PFNMAP can point at struct page backed memory, and that
> exact use-case is what we want to stop with the unsafe_follow_pfn work
> since it wreaks things like cma or security).
> 
> Aside: I do wonder whether the lack for that check isn't a problem.
> VM_IO | VM_PFNMAP generally means driver managed, which means the
> driver isn't going to consult the page pin count or anything like that
> (at least not necessarily) when revoking or moving that memory, since
> we're assuming it's totally under driver control. So if pup_fast can
> get into such a mapping, we might have a problem.
> -Daniel
>

Yes. I don't know why that check is missing from the _fast path.
Probably just an oversight, seeing as how it's in the slow path. Maybe
the appropriate response here is to add a separate patch that adds the
check.

I wonder if I'm overlooking something, but it certainly seems correct to
do that.

  thanks,

Daniel Vetter Nov. 1, 2020, 10:30 a.m. UTC | #6

On Sun, Nov 1, 2020 at 6:22 AM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 10/31/20 7:45 AM, Daniel Vetter wrote:
> > On Sat, Oct 31, 2020 at 3:55 AM John Hubbard <jhubbard@nvidia.com> wrote:
> >> On 10/30/20 3:08 AM, Daniel Vetter wrote:
> ...
> >> By removing this check from this location, and changing from
> >> pin_user_pages_locked() to pin_user_pages_fast(), I *think* we end up
> >> losing the check entirely. Is that intended? If so it could use a comment
> >> somewhere to explain why.
> >
> > Yeah this wasn't intentional. I think I needed to drop the _locked
> > version to prep for FOLL_LONGTERM, and figured _fast is always better.
> > But I didn't realize that _fast doesn't have the vma checks, gup.c got
> > me a bit confused.
>
> Actually, I thought that the change to _fast was a very nice touch, btw.
>
> >
> > I'll remedy this in all the patches where this applies (because a
> > VM_IO | VM_PFNMAP can point at struct page backed memory, and that
> > exact use-case is what we want to stop with the unsafe_follow_pfn work
> > since it wreaks things like cma or security).
> >
> > Aside: I do wonder whether the lack for that check isn't a problem.
> > VM_IO | VM_PFNMAP generally means driver managed, which means the
> > driver isn't going to consult the page pin count or anything like that
> > (at least not necessarily) when revoking or moving that memory, since
> > we're assuming it's totally under driver control. So if pup_fast can
> > get into such a mapping, we might have a problem.
> > -Daniel
> >
>
> Yes. I don't know why that check is missing from the _fast path.
> Probably just an oversight, seeing as how it's in the slow path. Maybe
> the appropriate response here is to add a separate patch that adds the
> check.
>
> I wonder if I'm overlooking something, but it certainly seems correct to
> do that.

You'll need the mmap_sem to get at the vma to be able to do this
check. If you add that to _fast, you made it as fast as the slow one.
Plus there's _fast_only due to locking recurion issues in fast-paths
(I assume, I didn't check all the callers).

I'm just wondering whether we have a bug somewhere with device
drivers. For CMA regions we always check in try_grab_page, but for dax
I'm not seeing where the checks in the _fast fastpaths are, and that
all still leaves random device driver mappings behind which aren't
backed by CMA but still point to something with a struct page behind
it. I'm probably just missing something, but no idea what.
-Daniel

John Hubbard Nov. 1, 2020, 9:13 p.m. UTC | #7

On 11/1/20 2:30 AM, Daniel Vetter wrote:
> On Sun, Nov 1, 2020 at 6:22 AM John Hubbard <jhubbard@nvidia.com> wrote:
>>
>> On 10/31/20 7:45 AM, Daniel Vetter wrote:
>>> On Sat, Oct 31, 2020 at 3:55 AM John Hubbard <jhubbard@nvidia.com> wrote:
>>>> On 10/30/20 3:08 AM, Daniel Vetter wrote:
>> ...
>>>> By removing this check from this location, and changing from
>>>> pin_user_pages_locked() to pin_user_pages_fast(), I *think* we end up
>>>> losing the check entirely. Is that intended? If so it could use a comment
>>>> somewhere to explain why.
>>>
>>> Yeah this wasn't intentional. I think I needed to drop the _locked
>>> version to prep for FOLL_LONGTERM, and figured _fast is always better.
>>> But I didn't realize that _fast doesn't have the vma checks, gup.c got
>>> me a bit confused.
>>
>> Actually, I thought that the change to _fast was a very nice touch, btw.
>>
>>>
>>> I'll remedy this in all the patches where this applies (because a
>>> VM_IO | VM_PFNMAP can point at struct page backed memory, and that
>>> exact use-case is what we want to stop with the unsafe_follow_pfn work
>>> since it wreaks things like cma or security).
>>>
>>> Aside: I do wonder whether the lack for that check isn't a problem.
>>> VM_IO | VM_PFNMAP generally means driver managed, which means the
>>> driver isn't going to consult the page pin count or anything like that
>>> (at least not necessarily) when revoking or moving that memory, since
>>> we're assuming it's totally under driver control. So if pup_fast can
>>> get into such a mapping, we might have a problem.
>>> -Daniel
>>>
>>
>> Yes. I don't know why that check is missing from the _fast path.
>> Probably just an oversight, seeing as how it's in the slow path. Maybe
>> the appropriate response here is to add a separate patch that adds the
>> check.
>>
>> I wonder if I'm overlooking something, but it certainly seems correct to
>> do that.
> 
> You'll need the mmap_sem to get at the vma to be able to do this
> check. If you add that to _fast, you made it as fast as the slow one.

Arggh, yes of course. Strike that, please. :)

> Plus there's _fast_only due to locking recurion issues in fast-paths
> (I assume, I didn't check all the callers).
> 
> I'm just wondering whether we have a bug somewhere with device
> drivers. For CMA regions we always check in try_grab_page, but for dax

OK, so here you're talking about a different bug than the VM_IO | VM_PFNMAP
pages, I think. This is about the "FOLL_LONGTERM + CMA + gup/pup _fast"
combination that is not allowed, right?

For that: try_grab_page() doesn't check anything, but try_grab_compound_head()
does, but only for pup_fast, not gup_fast. That was added by commit
df3a0a21b698d ("mm/gup: fix omission of check on FOLL_LONGTERM in gup fast
path") in April.

I recall that the patch was just plugging a very specific hole, as opposed
to locking down the API against mistakes or confused callers. And it does
seem that there are some holes.

> I'm not seeing where the checks in the _fast fastpaths are, and that
> all still leaves random device driver mappings behind which aren't
> backed by CMA but still point to something with a struct page behind
> it. I'm probably just missing something, but no idea what.
> -Daniel
> 

Certainly we've established that we can't check VMA flags by that time,
so I'm not sure that there is much we can check by the time we get to
gup/pup _fast. Seems like the device drivers have to avoid calling _fast
with pages that live in VM_IO | VM_PFNMAP, by design, right? Or maybe
you're talking about CMA checks only?


thanks,

Daniel Vetter Nov. 1, 2020, 10:50 p.m. UTC | #8

On Sun, Nov 1, 2020 at 10:13 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 11/1/20 2:30 AM, Daniel Vetter wrote:
> > On Sun, Nov 1, 2020 at 6:22 AM John Hubbard <jhubbard@nvidia.com> wrote:
> >>
> >> On 10/31/20 7:45 AM, Daniel Vetter wrote:
> >>> On Sat, Oct 31, 2020 at 3:55 AM John Hubbard <jhubbard@nvidia.com> wrote:
> >>>> On 10/30/20 3:08 AM, Daniel Vetter wrote:
> >> ...
> >>>> By removing this check from this location, and changing from
> >>>> pin_user_pages_locked() to pin_user_pages_fast(), I *think* we end up
> >>>> losing the check entirely. Is that intended? If so it could use a comment
> >>>> somewhere to explain why.
> >>>
> >>> Yeah this wasn't intentional. I think I needed to drop the _locked
> >>> version to prep for FOLL_LONGTERM, and figured _fast is always better.
> >>> But I didn't realize that _fast doesn't have the vma checks, gup.c got
> >>> me a bit confused.
> >>
> >> Actually, I thought that the change to _fast was a very nice touch, btw.
> >>
> >>>
> >>> I'll remedy this in all the patches where this applies (because a
> >>> VM_IO | VM_PFNMAP can point at struct page backed memory, and that
> >>> exact use-case is what we want to stop with the unsafe_follow_pfn work
> >>> since it wreaks things like cma or security).
> >>>
> >>> Aside: I do wonder whether the lack for that check isn't a problem.
> >>> VM_IO | VM_PFNMAP generally means driver managed, which means the
> >>> driver isn't going to consult the page pin count or anything like that
> >>> (at least not necessarily) when revoking or moving that memory, since
> >>> we're assuming it's totally under driver control. So if pup_fast can
> >>> get into such a mapping, we might have a problem.
> >>> -Daniel
> >>>
> >>
> >> Yes. I don't know why that check is missing from the _fast path.
> >> Probably just an oversight, seeing as how it's in the slow path. Maybe
> >> the appropriate response here is to add a separate patch that adds the
> >> check.
> >>
> >> I wonder if I'm overlooking something, but it certainly seems correct to
> >> do that.
> >
> > You'll need the mmap_sem to get at the vma to be able to do this
> > check. If you add that to _fast, you made it as fast as the slow one.
>
> Arggh, yes of course. Strike that, please. :)
>
> > Plus there's _fast_only due to locking recurion issues in fast-paths
> > (I assume, I didn't check all the callers).
> >
> > I'm just wondering whether we have a bug somewhere with device
> > drivers. For CMA regions we always check in try_grab_page, but for dax
>
> OK, so here you're talking about a different bug than the VM_IO | VM_PFNMAP
> pages, I think. This is about the "FOLL_LONGTERM + CMA + gup/pup _fast"
> combination that is not allowed, right?

Yeah sorry, I got distracted reading code and noticed we might have
another issue.

> For that: try_grab_page() doesn't check anything, but try_grab_compound_head()
> does, but only for pup_fast, not gup_fast. That was added by commit
> df3a0a21b698d ("mm/gup: fix omission of check on FOLL_LONGTERM in gup fast
> path") in April.
>
> I recall that the patch was just plugging a very specific hole, as opposed
> to locking down the API against mistakes or confused callers. And it does
> seem that there are some holes.

Yup that's the one I've found.

> > I'm not seeing where the checks in the _fast fastpaths are, and that
> > all still leaves random device driver mappings behind which aren't
> > backed by CMA but still point to something with a struct page behind
> > it. I'm probably just missing something, but no idea what.
> > -Daniel
> >
>
> Certainly we've established that we can't check VMA flags by that time,
> so I'm not sure that there is much we can check by the time we get to
> gup/pup _fast. Seems like the device drivers have to avoid calling _fast
> with pages that live in VM_IO | VM_PFNMAP, by design, right? Or maybe
> you're talking about CMA checks only?

It's not device drivers, but everyone else. At least my understanding
is that VM_IO | VM_PFNMAP means "even if it happens to be backed by a
struct page, do not treat it like normal memory". And gup/pup_fast
happily break that. I tried to chase the history of that test, didn't
turn up anything I understood much:

commit 1ff8038988adecfde71d82c0597727fc239d4e8c
Author: Linus Torvalds <torvalds@g5.osdl.org>
Date:   Mon Dec 12 16:24:33 2005 -0800

   get_user_pages: don't try to follow PFNMAP pages

   Nick Piggin points out that a few drivers play games with VM_IO (why?
   who knows..) and thus a pfn-remapped area may not have that bit set even
   if remap_pfn_range() set it originally.

   So make it explicit in get_user_pages() that we don't follow VM_PFNMAP
   pages, since pretty much by definition they do not have a "struct page"
   associated with them.

   Signed-off-by: Linus Torvalds <torvalds@osdl.org>

diff --git a/mm/memory.c b/mm/memory.c
index 47c533eaa072..d22f78c8a381 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1009,7 +1009,7 @@ int get_user_pages(struct task_struct *tsk,
struct mm_struct *mm,
                       continue;
               }

-               if (!vma || (vma->vm_flags & VM_IO)
+               if (!vma || (vma->vm_flags & (VM_IO | VM_PFNMAP))
                               || !(vm_flags & vma->vm_flags))
                       return i ? : -EFAULT;


The VM_IO check is kinda lost in pre-history.

tbh I have no idea what the various variants of pup/gup are supposed
to be doing vs. these VMA flags in the various cases. Just smells a
bit like potential trouble due to randomly pinning stuff without the
owner of that memory having an idea what's going on.
-Daniel

Tomasz Figa Nov. 2, 2020, 6:19 p.m. UTC | #9

On Fri, Oct 30, 2020 at 3:38 PM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> On Fri, Oct 30, 2020 at 3:11 PM Tomasz Figa <tfiga@chromium.org> wrote:
> >
> > On Fri, Oct 30, 2020 at 11:08 AM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > >
> > > This is used by media/videbuf2 for persistent dma mappings, not just
> > > for a single dma operation and then freed again, so needs
> > > FOLL_LONGTERM.
> > >
> > > Unfortunately current pup_locked doesn't support FOLL_LONGTERM due to
> > > locking issues. Rework the code to pull the pup path out from the
> > > mmap_sem critical section as suggested by Jason.
> > >
> > > By relying entirely on the vma checks in pin_user_pages and follow_pfn
> > > (for vm_flags and vma_is_fsdax) we can also streamline the code a lot.
> > >
> > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > > Cc: Jason Gunthorpe <jgg@ziepe.ca>
> > > Cc: Pawel Osciak <pawel@osciak.com>
> > > Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> > > Cc: Kyungmin Park <kyungmin.park@samsung.com>
> > > Cc: Tomasz Figa <tfiga@chromium.org>
> > > Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Cc: John Hubbard <jhubbard@nvidia.com>
> > > Cc: Jérôme Glisse <jglisse@redhat.com>
> > > Cc: Jan Kara <jack@suse.cz>
> > > Cc: Dan Williams <dan.j.williams@intel.com>
> > > Cc: linux-mm@kvack.org
> > > Cc: linux-arm-kernel@lists.infradead.org
> > > Cc: linux-samsung-soc@vger.kernel.org
> > > Cc: linux-media@vger.kernel.org
> > > Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> > > --
> > > v2: Streamline the code and further simplify the loop checks (Jason)
> > >
> > > v5: Review from Tomasz:
> > > - fix page counting for the follow_pfn case by resetting ret
> > > - drop gup_flags paramater, now unused
> > > ---
> > >  .../media/common/videobuf2/videobuf2-memops.c |  3 +-
> > >  include/linux/mm.h                            |  2 +-
> > >  mm/frame_vector.c                             | 53 ++++++-------------
> > >  3 files changed, 19 insertions(+), 39 deletions(-)
> > >
> >
> > Thanks, looks good to me now.
> >
> > Acked-by: Tomasz Figa <tfiga@chromium.org>
> >
> > From reading the code, this is quite unlikely to introduce any
> > behavior changes, but just to be safe, did you have a chance to test
> > this with some V4L2 driver?
>
> Nah, unfortunately not.

I believe we don't have any setup that could exercise the IO/PFNMAP
user pointers, but it should be possible to exercise the basic userptr
path by enabling the virtual (fake) video driver, vivid or
CONFIG_VIDEO_VIVID, in your kernel and then using yavta [1] with
--userptr and --capture=<number of frames> (and possibly some more
options) to grab a couple of frames from the test pattern generator.

Does it sound like something that you could give a try? Feel free to
ping me on IRC (tfiga on #v4l or #dri-devel) if you need any help.

[1] https://git.ideasonboard.org/yavta.git

Best regards,
Tomasz

> -Daniel
>
> >
> > Best regards,
> > Tomasz
> >
> > > diff --git a/drivers/media/common/videobuf2/videobuf2-memops.c b/drivers/media/common/videobuf2/videobuf2-memops.c
> > > index 6e9e05153f4e..9dd6c27162f4 100644
> > > --- a/drivers/media/common/videobuf2/videobuf2-memops.c
> > > +++ b/drivers/media/common/videobuf2/videobuf2-memops.c
> > > @@ -40,7 +40,6 @@ struct frame_vector *vb2_create_framevec(unsigned long start,
> > >         unsigned long first, last;
> > >         unsigned long nr;
> > >         struct frame_vector *vec;
> > > -       unsigned int flags = FOLL_FORCE | FOLL_WRITE;
> > >
> > >         first = start >> PAGE_SHIFT;
> > >         last = (start + length - 1) >> PAGE_SHIFT;
> > > @@ -48,7 +47,7 @@ struct frame_vector *vb2_create_framevec(unsigned long start,
> > >         vec = frame_vector_create(nr);
> > >         if (!vec)
> > >                 return ERR_PTR(-ENOMEM);
> > > -       ret = get_vaddr_frames(start & PAGE_MASK, nr, flags, vec);
> > > +       ret = get_vaddr_frames(start & PAGE_MASK, nr, vec);
> > >         if (ret < 0)
> > >                 goto out_destroy;
> > >         /* We accept only complete set of PFNs */
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index ef360fe70aaf..d6b8e30dce2e 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -1765,7 +1765,7 @@ struct frame_vector {
> > >  struct frame_vector *frame_vector_create(unsigned int nr_frames);
> > >  void frame_vector_destroy(struct frame_vector *vec);
> > >  int get_vaddr_frames(unsigned long start, unsigned int nr_pfns,
> > > -                    unsigned int gup_flags, struct frame_vector *vec);
> > > +                    struct frame_vector *vec);
> > >  void put_vaddr_frames(struct frame_vector *vec);
> > >  int frame_vector_to_pages(struct frame_vector *vec);
> > >  void frame_vector_to_pfns(struct frame_vector *vec);
> > > diff --git a/mm/frame_vector.c b/mm/frame_vector.c
> > > index 10f82d5643b6..f8c34b895c76 100644
> > > --- a/mm/frame_vector.c
> > > +++ b/mm/frame_vector.c
> > > @@ -32,13 +32,12 @@
> > >   * This function takes care of grabbing mmap_lock as necessary.
> > >   */
> > >  int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
> > > -                    unsigned int gup_flags, struct frame_vector *vec)
> > > +                    struct frame_vector *vec)
> > >  {
> > >         struct mm_struct *mm = current->mm;
> > >         struct vm_area_struct *vma;
> > >         int ret = 0;
> > >         int err;
> > > -       int locked;
> > >
> > >         if (nr_frames == 0)
> > >                 return 0;
> > > @@ -48,40 +47,26 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
> > >
> > >         start = untagged_addr(start);
> > >
> > > -       mmap_read_lock(mm);
> > > -       locked = 1;
> > > -       vma = find_vma_intersection(mm, start, start + 1);
> > > -       if (!vma) {
> > > -               ret = -EFAULT;
> > > -               goto out;
> > > -       }
> > > -
> > > -       /*
> > > -        * While get_vaddr_frames() could be used for transient (kernel
> > > -        * controlled lifetime) pinning of memory pages all current
> > > -        * users establish long term (userspace controlled lifetime)
> > > -        * page pinning. Treat get_vaddr_frames() like
> > > -        * get_user_pages_longterm() and disallow it for filesystem-dax
> > > -        * mappings.
> > > -        */
> > > -       if (vma_is_fsdax(vma)) {
> > > -               ret = -EOPNOTSUPP;
> > > -               goto out;
> > > -       }
> > > -
> > > -       if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) {
> > > +       ret = pin_user_pages_fast(start, nr_frames,
> > > +                                 FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM,
> > > +                                 (struct page **)(vec->ptrs));
> > > +       if (ret > 0) {
> > >                 vec->got_ref = true;
> > >                 vec->is_pfns = false;
> > > -               ret = pin_user_pages_locked(start, nr_frames,
> > > -                       gup_flags, (struct page **)(vec->ptrs), &locked);
> > > -               goto out;
> > > +               goto out_unlocked;
> > >         }
> > >
> > > +       mmap_read_lock(mm);
> > >         vec->got_ref = false;
> > >         vec->is_pfns = true;
> > > +       ret = 0;
> > >         do {
> > >                 unsigned long *nums = frame_vector_pfns(vec);
> > >
> > > +               vma = find_vma_intersection(mm, start, start + 1);
> > > +               if (!vma)
> > > +                       break;
> > > +
> > >                 while (ret < nr_frames && start + PAGE_SIZE <= vma->vm_end) {
> > >                         err = follow_pfn(vma, start, &nums[ret]);
> > >                         if (err) {
> > > @@ -92,17 +77,13 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
> > >                         start += PAGE_SIZE;
> > >                         ret++;
> > >                 }
> > > -               /*
> > > -                * We stop if we have enough pages or if VMA doesn't completely
> > > -                * cover the tail page.
> > > -                */
> > > -               if (ret >= nr_frames || start < vma->vm_end)
> > > +               /* Bail out if VMA doesn't completely cover the tail page. */
> > > +               if (start < vma->vm_end)
> > >                         break;
> > > -               vma = find_vma_intersection(mm, start, start + 1);
> > > -       } while (vma && vma->vm_flags & (VM_IO | VM_PFNMAP));
> > > +       } while (ret < nr_frames);
> > >  out:
> > > -       if (locked)
> > > -               mmap_read_unlock(mm);
> > > +       mmap_read_unlock(mm);
> > > +out_unlocked:
> > >         if (!ret)
> > >                 ret = -EFAULT;
> > >         if (ret > 0)
> > > --
> > > 2.28.0
> > >
> > _______________________________________________
> > dri-devel mailing list
> > dri-devel@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/dri-devel
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

Jason Gunthorpe Nov. 4, 2020, 2 p.m. UTC | #10

On Sun, Nov 01, 2020 at 11:50:39PM +0100, Daniel Vetter wrote:

> It's not device drivers, but everyone else. At least my understanding
> is that VM_IO | VM_PFNMAP means "even if it happens to be backed by a
> struct page, do not treat it like normal memory". And gup/pup_fast
> happily break that. I tried to chase the history of that test, didn't
> turn up anything I understood much:

VM_IO isn't suppose do thave struct pages, so how can gup_fast return
them?

I thought some magic in the PTE flags excluded this?

Jason

Daniel Vetter Nov. 4, 2020, 3:54 p.m. UTC | #11

On Wed, Nov 4, 2020 at 3:00 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Sun, Nov 01, 2020 at 11:50:39PM +0100, Daniel Vetter wrote:
>
> > It's not device drivers, but everyone else. At least my understanding
> > is that VM_IO | VM_PFNMAP means "even if it happens to be backed by a
> > struct page, do not treat it like normal memory". And gup/pup_fast
> > happily break that. I tried to chase the history of that test, didn't
> > turn up anything I understood much:
>
> VM_IO isn't suppose do thave struct pages, so how can gup_fast return
> them?
>
> I thought some magic in the PTE flags excluded this?

I don't really have a box here, but dma_mmap_attrs() and friends to
mmap dma_alloc_coherent memory is set up as VM_IO | VM_PFNMAP (it's
actually enforced since underneath it uses remap_pfn_range), and
usually (except if it's pre-cma carveout) that's just normal struct
page backed memory. Sometimes from a cma region (so will be caught by
the cma page check), but if you have an iommu to make it
device-contiguous, that's not needed.

I think only some architectures have a special io pte flag, and those
are only used for real mmio access. And I think the popular ones all
don't. But that stuff is really not my expertise, just some drive-by
reading I've done to understand how the pci mmap stuff works (which is
special in yet other ways I think).

So probably I'm missing something, but I'm not seeing anything that
prevents this from coming out of a  pup/gup_fast.
-Daniel

Daniel Vetter Nov. 4, 2020, 4:26 p.m. UTC | #12

On Wed, Nov 4, 2020 at 5:21 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Wed, Nov 04, 2020 at 04:54:19PM +0100, Daniel Vetter wrote:
> > I don't really have a box here, but dma_mmap_attrs() and friends to
> > mmap dma_alloc_coherent memory is set up as VM_IO | VM_PFNMAP (it's
> > actually enforced since underneath it uses remap_pfn_range), and
> > usually (except if it's pre-cma carveout) that's just normal struct
> > page backed memory. Sometimes from a cma region (so will be caught by
> > the cma page check), but if you have an iommu to make it
> > device-contiguous, that's not needed.
>
> dma_mmap_* memory may or may not be page backed, but it absolutely
> must not be resolved by get_user_pages and friends as it is special.
> So yes, not being able to get a struct page back from such an mmap is
> a feature.

Yes, that's clear.

What we're discussing is whether gup_fast and pup_fast also obey this,
or fall over and can give you the struct page that's backing the
dma_mmap_* memory. Since the _fast variant doesn't check for
vma->vm_flags, and afaict that's the only thing which closes this gap.
And like you restate, that would be a bit a problem. So where's that
check which Jason&me aren't spotting?
-Daniel

Jason Gunthorpe Nov. 4, 2020, 6:17 p.m. UTC | #13

On Wed, Nov 04, 2020 at 04:41:19PM +0000, Christoph Hellwig wrote:
> On Wed, Nov 04, 2020 at 04:37:58PM +0000, Christoph Hellwig wrote:
> > On Wed, Nov 04, 2020 at 05:26:58PM +0100, Daniel Vetter wrote:
> > > What we're discussing is whether gup_fast and pup_fast also obey this,
> > > or fall over and can give you the struct page that's backing the
> > > dma_mmap_* memory. Since the _fast variant doesn't check for
> > > vma->vm_flags, and afaict that's the only thing which closes this gap.
> > > And like you restate, that would be a bit a problem. So where's that
> > > check which Jason&me aren't spotting?
> > 
> > remap_pte_range uses pte_mkspecial to set up the PTEs, and gup_pte_range
> > errors out on pte_special.  Of course this only works for the
> > CONFIG_ARCH_HAS_PTE_SPECIAL case, for other architectures we do have
> > a real problem.
> 
> Except that we don't really support pte-level gup-fast without
> CONFIG_ARCH_HAS_PTE_SPECIAL, and in fact all architectures selecting
> HAVE_FAST_GUP also select ARCH_HAS_PTE_SPECIAL, so we should be fine.

Mm, I thought it was probably the special flag..

Knowing that CONFIG_HAVE_FAST_GUP can't be set without
CONFIG_ARCH_HAS_PTE_SPECIAL is pretty insightful, can we put that in
the Kconfig?

config HAVE_FAST_GUP
        depends on MMU
        depends on ARCH_HAS_PTE_SPECIAL
        bool

?

Jason

John Hubbard Nov. 4, 2020, 6:44 p.m. UTC | #14

On 11/4/20 10:17 AM, Jason Gunthorpe wrote:
> On Wed, Nov 04, 2020 at 04:41:19PM +0000, Christoph Hellwig wrote:
>> On Wed, Nov 04, 2020 at 04:37:58PM +0000, Christoph Hellwig wrote:
>>> On Wed, Nov 04, 2020 at 05:26:58PM +0100, Daniel Vetter wrote:
>>>> What we're discussing is whether gup_fast and pup_fast also obey this,
>>>> or fall over and can give you the struct page that's backing the
>>>> dma_mmap_* memory. Since the _fast variant doesn't check for
>>>> vma->vm_flags, and afaict that's the only thing which closes this gap.
>>>> And like you restate, that would be a bit a problem. So where's that
>>>> check which Jason&me aren't spotting?
>>>
>>> remap_pte_range uses pte_mkspecial to set up the PTEs, and gup_pte_range
>>> errors out on pte_special.  Of course this only works for the
>>> CONFIG_ARCH_HAS_PTE_SPECIAL case, for other architectures we do have
>>> a real problem.
>>
>> Except that we don't really support pte-level gup-fast without
>> CONFIG_ARCH_HAS_PTE_SPECIAL, and in fact all architectures selecting
>> HAVE_FAST_GUP also select ARCH_HAS_PTE_SPECIAL, so we should be fine.
> 
> Mm, I thought it was probably the special flag..
> 
> Knowing that CONFIG_HAVE_FAST_GUP can't be set without
> CONFIG_ARCH_HAS_PTE_SPECIAL is pretty insightful, can we put that in
> the Kconfig?
> 
> config HAVE_FAST_GUP
>          depends on MMU
>          depends on ARCH_HAS_PTE_SPECIAL
>          bool
> 
Well, the !CONFIG_ARCH_HAS_PTE_SPECIAL case points out in a comment that
gup-fast is not *completely* unavailable there, so I don't think you want
to shut it off like that:

/*
  * If we can't determine whether or not a pte is special, then fail immediately
  * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not
  * to be special.
  *
  * For a futex to be placed on a THP tail page, get_futex_key requires a
  * get_user_pages_fast_only implementation that can pin pages. Thus it's still
  * useful to have gup_huge_pmd even if we can't operate on ptes.
  */


thanks,

Jason Gunthorpe Nov. 4, 2020, 7:02 p.m. UTC | #15

On Wed, Nov 04, 2020 at 10:44:56AM -0800, John Hubbard wrote:
> On 11/4/20 10:17 AM, Jason Gunthorpe wrote:
> > On Wed, Nov 04, 2020 at 04:41:19PM +0000, Christoph Hellwig wrote:
> > > On Wed, Nov 04, 2020 at 04:37:58PM +0000, Christoph Hellwig wrote:
> > > > On Wed, Nov 04, 2020 at 05:26:58PM +0100, Daniel Vetter wrote:
> > > > > What we're discussing is whether gup_fast and pup_fast also obey this,
> > > > > or fall over and can give you the struct page that's backing the
> > > > > dma_mmap_* memory. Since the _fast variant doesn't check for
> > > > > vma->vm_flags, and afaict that's the only thing which closes this gap.
> > > > > And like you restate, that would be a bit a problem. So where's that
> > > > > check which Jason&me aren't spotting?
> > > > 
> > > > remap_pte_range uses pte_mkspecial to set up the PTEs, and gup_pte_range
> > > > errors out on pte_special.  Of course this only works for the
> > > > CONFIG_ARCH_HAS_PTE_SPECIAL case, for other architectures we do have
> > > > a real problem.
> > > 
> > > Except that we don't really support pte-level gup-fast without
> > > CONFIG_ARCH_HAS_PTE_SPECIAL, and in fact all architectures selecting
> > > HAVE_FAST_GUP also select ARCH_HAS_PTE_SPECIAL, so we should be fine.
> > 
> > Mm, I thought it was probably the special flag..
> > 
> > Knowing that CONFIG_HAVE_FAST_GUP can't be set without
> > CONFIG_ARCH_HAS_PTE_SPECIAL is pretty insightful, can we put that in
> > the Kconfig?
> > 
> > config HAVE_FAST_GUP
> >          depends on MMU
> >          depends on ARCH_HAS_PTE_SPECIAL
> >          bool
> > 
> Well, the !CONFIG_ARCH_HAS_PTE_SPECIAL case points out in a comment that
> gup-fast is not *completely* unavailable there, so I don't think you want
> to shut it off like that:
> 
> /*
>  * If we can't determine whether or not a pte is special, then fail immediately
>  * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not
>  * to be special.
>  *
>  * For a futex to be placed on a THP tail page, get_futex_key requires a
>  * get_user_pages_fast_only implementation that can pin pages. Thus it's still
>  * useful to have gup_huge_pmd even if we can't operate on ptes.
>  */

I saw that once and I really couldn't make sense of it..
What use is having futex's that only work on THP pages? Confused

CH said there was no case of HAVE_FAST_GUP !ARCH_HAS_PTE_SPECIAL, is
one hidden someplace then?

Jason

Daniel Vetter Nov. 5, 2020, 9:25 a.m. UTC | #16

On Wed, Nov 04, 2020 at 10:44:56AM -0800, John Hubbard wrote:
> On 11/4/20 10:17 AM, Jason Gunthorpe wrote:
> > On Wed, Nov 04, 2020 at 04:41:19PM +0000, Christoph Hellwig wrote:
> > > On Wed, Nov 04, 2020 at 04:37:58PM +0000, Christoph Hellwig wrote:
> > > > On Wed, Nov 04, 2020 at 05:26:58PM +0100, Daniel Vetter wrote:
> > > > > What we're discussing is whether gup_fast and pup_fast also obey this,
> > > > > or fall over and can give you the struct page that's backing the
> > > > > dma_mmap_* memory. Since the _fast variant doesn't check for
> > > > > vma->vm_flags, and afaict that's the only thing which closes this gap.
> > > > > And like you restate, that would be a bit a problem. So where's that
> > > > > check which Jason&me aren't spotting?
> > > > 
> > > > remap_pte_range uses pte_mkspecial to set up the PTEs, and gup_pte_range
> > > > errors out on pte_special.  Of course this only works for the
> > > > CONFIG_ARCH_HAS_PTE_SPECIAL case, for other architectures we do have
> > > > a real problem.
> > > 
> > > Except that we don't really support pte-level gup-fast without
> > > CONFIG_ARCH_HAS_PTE_SPECIAL, and in fact all architectures selecting
> > > HAVE_FAST_GUP also select ARCH_HAS_PTE_SPECIAL, so we should be fine.

Thanks for the explainer. I guess I can go back to _fast and instead
adjust the commit message to explain why that's all fine.

> > Mm, I thought it was probably the special flag..
> > 
> > Knowing that CONFIG_HAVE_FAST_GUP can't be set without
> > CONFIG_ARCH_HAS_PTE_SPECIAL is pretty insightful, can we put that in
> > the Kconfig?
> > 
> > config HAVE_FAST_GUP
> >          depends on MMU
> >          depends on ARCH_HAS_PTE_SPECIAL
> >          bool
> > 
> Well, the !CONFIG_ARCH_HAS_PTE_SPECIAL case points out in a comment that
> gup-fast is not *completely* unavailable there, so I don't think you want
> to shut it off like that:
> 
> /*
>  * If we can't determine whether or not a pte is special, then fail immediately
>  * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not
>  * to be special.
>  *
>  * For a futex to be placed on a THP tail page, get_futex_key requires a
>  * get_user_pages_fast_only implementation that can pin pages. Thus it's still
>  * useful to have gup_huge_pmd even if we can't operate on ptes.
>  */

We support hugepage faults in gpu drivers since recently, and I'm not
seeing a pud_mkhugespecial anywhere. So not sure this works, but probably
just me missing something again.
-Daniel

Jason Gunthorpe Nov. 5, 2020, 12:49 p.m. UTC | #17

On Thu, Nov 05, 2020 at 10:25:24AM +0100, Daniel Vetter wrote:
> > /*
> >  * If we can't determine whether or not a pte is special, then fail immediately
> >  * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not
> >  * to be special.
> >  *
> >  * For a futex to be placed on a THP tail page, get_futex_key requires a
> >  * get_user_pages_fast_only implementation that can pin pages. Thus it's still
> >  * useful to have gup_huge_pmd even if we can't operate on ptes.
> >  */
> 
> We support hugepage faults in gpu drivers since recently, and I'm not
> seeing a pud_mkhugespecial anywhere. So not sure this works, but probably
> just me missing something again.

It means ioremap can't create an IO page PUD, it has to be broken up.

Does ioremap even create anything larger than PTEs?

Jason

John Hubbard Nov. 6, 2020, 4:08 a.m. UTC | #18

On 11/5/20 4:49 AM, Jason Gunthorpe wrote:
> On Thu, Nov 05, 2020 at 10:25:24AM +0100, Daniel Vetter wrote:
>>> /*
>>>   * If we can't determine whether or not a pte is special, then fail immediately
>>>   * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not
>>>   * to be special.
>>>   *
>>>   * For a futex to be placed on a THP tail page, get_futex_key requires a
>>>   * get_user_pages_fast_only implementation that can pin pages. Thus it's still
>>>   * useful to have gup_huge_pmd even if we can't operate on ptes.
>>>   */
>>
>> We support hugepage faults in gpu drivers since recently, and I'm not
>> seeing a pud_mkhugespecial anywhere. So not sure this works, but probably
>> just me missing something again.
> 
> It means ioremap can't create an IO page PUD, it has to be broken up.
> 
> Does ioremap even create anything larger than PTEs?
> 

 From my reading, yes. See ioremap_try_huge_pmd().

thanks,

Daniel Vetter Nov. 6, 2020, 10:01 a.m. UTC | #19

On Fri, Nov 6, 2020 at 5:08 AM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 11/5/20 4:49 AM, Jason Gunthorpe wrote:
> > On Thu, Nov 05, 2020 at 10:25:24AM +0100, Daniel Vetter wrote:
> >>> /*
> >>>   * If we can't determine whether or not a pte is special, then fail immediately
> >>>   * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not
> >>>   * to be special.
> >>>   *
> >>>   * For a futex to be placed on a THP tail page, get_futex_key requires a
> >>>   * get_user_pages_fast_only implementation that can pin pages. Thus it's still
> >>>   * useful to have gup_huge_pmd even if we can't operate on ptes.
> >>>   */
> >>
> >> We support hugepage faults in gpu drivers since recently, and I'm not
> >> seeing a pud_mkhugespecial anywhere. So not sure this works, but probably
> >> just me missing something again.
> >
> > It means ioremap can't create an IO page PUD, it has to be broken up.
> >
> > Does ioremap even create anything larger than PTEs?

gpu drivers also tend to use vmf_insert_pfn* directly, so we can do
on-demand paging and move buffers around. From what I glanced for
lowest level we to the pte_mkspecial correctly (I think I convinced
myself that vm_insert_pfn does that), but for pud/pmd levels it seems
just yolo.

remap_pfn_range seems to indeed split down to pte level always.

>  From my reading, yes. See ioremap_try_huge_pmd().

The ioremap here shouldn't matter, since this is for kernel-internal
mappings. So that's all fine I think.
-Daniel

Daniel Vetter Nov. 6, 2020, 10:27 a.m. UTC | #20

On Fri, Nov 6, 2020 at 11:01 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Fri, Nov 6, 2020 at 5:08 AM John Hubbard <jhubbard@nvidia.com> wrote:
> >
> > On 11/5/20 4:49 AM, Jason Gunthorpe wrote:
> > > On Thu, Nov 05, 2020 at 10:25:24AM +0100, Daniel Vetter wrote:
> > >>> /*
> > >>>   * If we can't determine whether or not a pte is special, then fail immediately
> > >>>   * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not
> > >>>   * to be special.
> > >>>   *
> > >>>   * For a futex to be placed on a THP tail page, get_futex_key requires a
> > >>>   * get_user_pages_fast_only implementation that can pin pages. Thus it's still
> > >>>   * useful to have gup_huge_pmd even if we can't operate on ptes.
> > >>>   */
> > >>
> > >> We support hugepage faults in gpu drivers since recently, and I'm not
> > >> seeing a pud_mkhugespecial anywhere. So not sure this works, but probably
> > >> just me missing something again.
> > >
> > > It means ioremap can't create an IO page PUD, it has to be broken up.
> > >
> > > Does ioremap even create anything larger than PTEs?
>
> gpu drivers also tend to use vmf_insert_pfn* directly, so we can do
> on-demand paging and move buffers around. From what I glanced for
> lowest level we to the pte_mkspecial correctly (I think I convinced
> myself that vm_insert_pfn does that), but for pud/pmd levels it seems
> just yolo.

So I dug around a bit more and ttm sets PFN_DEV | PFN_MAP to get past
the various pft_t_devmap checks (see e.g. vmf_insert_pfn_pmd_prot()).
x86-64 has ARCH_HAS_PTE_DEVMAP, and gup.c seems to handle these
specially, but frankly I got totally lost in what this does.

The comment above the pfn_t_devmap check makes me wonder whether doing
this is correct or not.

Also adding Thomas Hellstrom, who implemented the huge map support in ttm.
-Daniel

> remap_pfn_range seems to indeed split down to pte level always.
>
> >  From my reading, yes. See ioremap_try_huge_pmd().
>
> The ioremap here shouldn't matter, since this is for kernel-internal
> mappings. So that's all fine I think.
> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

Jason Gunthorpe Nov. 6, 2020, 12:55 p.m. UTC | #21

On Fri, Nov 06, 2020 at 11:27:59AM +0100, Daniel Vetter wrote:
> On Fri, Nov 6, 2020 at 11:01 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Fri, Nov 6, 2020 at 5:08 AM John Hubbard <jhubbard@nvidia.com> wrote:
> > >
> > > On 11/5/20 4:49 AM, Jason Gunthorpe wrote:
> > > > On Thu, Nov 05, 2020 at 10:25:24AM +0100, Daniel Vetter wrote:
> > > >>> /*
> > > >>>   * If we can't determine whether or not a pte is special, then fail immediately
> > > >>>   * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not
> > > >>>   * to be special.
> > > >>>   *
> > > >>>   * For a futex to be placed on a THP tail page, get_futex_key requires a
> > > >>>   * get_user_pages_fast_only implementation that can pin pages. Thus it's still
> > > >>>   * useful to have gup_huge_pmd even if we can't operate on ptes.
> > > >>>   */
> > > >>
> > > >> We support hugepage faults in gpu drivers since recently, and I'm not
> > > >> seeing a pud_mkhugespecial anywhere. So not sure this works, but probably
> > > >> just me missing something again.
> > > >
> > > > It means ioremap can't create an IO page PUD, it has to be broken up.
> > > >
> > > > Does ioremap even create anything larger than PTEs?
> >
> > gpu drivers also tend to use vmf_insert_pfn* directly, so we can do
> > on-demand paging and move buffers around. From what I glanced for
> > lowest level we to the pte_mkspecial correctly (I think I convinced
> > myself that vm_insert_pfn does that), but for pud/pmd levels it seems
> > just yolo.
> 
> So I dug around a bit more and ttm sets PFN_DEV | PFN_MAP to get past
> the various pft_t_devmap checks (see e.g. vmf_insert_pfn_pmd_prot()).
> x86-64 has ARCH_HAS_PTE_DEVMAP, and gup.c seems to handle these
> specially, but frankly I got totally lost in what this does.

The fact vmf_insert_pfn_pmd_prot() has all those BUG_ON's to prevent
putting VM_PFNMAP pages into the page tables seems like a big red
flag.

The comment seems to confirm what we are talking about here:

	/*
	 * If we had pmd_special, we could avoid all these restrictions,
	 * but we need to be consistent with PTEs and architectures that
	 * can't support a 'special' bit.
	 */

ie without the ability to mark special we can't block fast gup and
anyone who does O_DIRECT on these ranges will crash the kernel when it
tries to convert a IO page into a struct page.

Should be easy enough to directly test?

Putting non-struct page PTEs into a VMA without setting VM_PFNMAP just
seems horribly wrong to me.

Jason

Jason Gunthorpe Nov. 6, 2020, 12:58 p.m. UTC | #22

On Fri, Nov 06, 2020 at 11:01:57AM +0100, Daniel Vetter wrote:

> gpu drivers also tend to use vmf_insert_pfn* directly, so we can do
> on-demand paging and move buffers around. From what I glanced for
> lowest level we to the pte_mkspecial correctly (I think I convinced
> myself that vm_insert_pfn does that), but for pud/pmd levels it seems
> just yolo.
> 
> remap_pfn_range seems to indeed split down to pte level always.

Thats what it looked like to me too.
 
> >  From my reading, yes. See ioremap_try_huge_pmd().
> 
> The ioremap here shouldn't matter, since this is for kernel-internal
> mappings. So that's all fine I think.

Right, sorry to be unclear, we are talking about io_remap_pfn_range()
which is for userspace mappings in VMAs

Jason

Thomas Hellström Nov. 9, 2020, 8:44 a.m. UTC | #23

On Fri, 2020-11-06 at 08:55 -0400, Jason Gunthorpe wrote:
> On Fri, Nov 06, 2020 at 11:27:59AM +0100, Daniel Vetter wrote:
> > On Fri, Nov 6, 2020 at 11:01 AM Daniel Vetter <daniel@ffwll.ch>
> > wrote:
> > > On Fri, Nov 6, 2020 at 5:08 AM John Hubbard <jhubbard@nvidia.com>
> > > wrote:
> > > > On 11/5/20 4:49 AM, Jason Gunthorpe wrote:
> > > > > On Thu, Nov 05, 2020 at 10:25:24AM +0100, Daniel Vetter
> > > > > wrote:
> > > > > > > /*
> > > > > > >   * If we can't determine whether or not a pte is
> > > > > > > special, then fail immediately
> > > > > > >   * for ptes. Note, we can still pin HugeTLB and THP as
> > > > > > > these are guaranteed not
> > > > > > >   * to be special.
> > > > > > >   *
> > > > > > >   * For a futex to be placed on a THP tail page,
> > > > > > > get_futex_key requires a
> > > > > > >   * get_user_pages_fast_only implementation that can pin
> > > > > > > pages. Thus it's still
> > > > > > >   * useful to have gup_huge_pmd even if we can't operate
> > > > > > > on ptes.
> > > > > > >   */
> > > > > > 
> > > > > > We support hugepage faults in gpu drivers since recently,
> > > > > > and I'm not
> > > > > > seeing a pud_mkhugespecial anywhere. So not sure this
> > > > > > works, but probably
> > > > > > just me missing something again.
> > > > > 
> > > > > It means ioremap can't create an IO page PUD, it has to be
> > > > > broken up.
> > > > > 
> > > > > Does ioremap even create anything larger than PTEs?
> > > 
> > > gpu drivers also tend to use vmf_insert_pfn* directly, so we can
> > > do
> > > on-demand paging and move buffers around. From what I glanced for
> > > lowest level we to the pte_mkspecial correctly (I think I
> > > convinced
> > > myself that vm_insert_pfn does that), but for pud/pmd levels it
> > > seems
> > > just yolo.
> > 
> > So I dug around a bit more and ttm sets PFN_DEV | PFN_MAP to get
> > past
> > the various pft_t_devmap checks (see e.g.
> > vmf_insert_pfn_pmd_prot()).
> > x86-64 has ARCH_HAS_PTE_DEVMAP, and gup.c seems to handle these
> > specially, but frankly I got totally lost in what this does.
> 
> The fact vmf_insert_pfn_pmd_prot() has all those BUG_ON's to prevent
> putting VM_PFNMAP pages into the page tables seems like a big red
> flag.
> 
> The comment seems to confirm what we are talking about here:
> 
> 	/*
> 	 * If we had pmd_special, we could avoid all these
> restrictions,
> 	 * but we need to be consistent with PTEs and architectures
> that
> 	 * can't support a 'special' bit.
> 	 */
> 
> ie without the ability to mark special we can't block fast gup and
> anyone who does O_DIRECT on these ranges will crash the kernel when
> it
> tries to convert a IO page into a struct page.
> 
> Should be easy enough to directly test?
> 
> Putting non-struct page PTEs into a VMA without setting VM_PFNMAP
> just
> seems horribly wrong to me.

Although core mm special huge-page support is currently quite limited,
some time ago, I extended the pre-existing vma_is_dax() to
vma_is_special_huge():

/**
 * vma_is_special_huge - Are transhuge page-table entries considered
special?
 * @vma: Pointer to the struct vm_area_struct to consider
 *
 * Whether transhuge page-table entries are considered "special"
following
 * the definition in vm_normal_page().
 *
 * Return: true if transhuge page-table entries should be considered
special,
 * false otherwise.
 */
static inline bool vma_is_special_huge(const struct vm_area_struct
*vma)
{
	return vma_is_dax(vma) || (vma->vm_file &&
				   (vma->vm_flags & (VM_PFNMAP |
VM_MIXEDMAP)));
}

meaning that currently all transhuge page-table-entries in a PFNMAP or
MIXEDMAP vma are considered "special". The number of calls to this
function (mainly in the page-splitting code) is quite limited so
replacing it with a more elaborate per-page-table-entry scheme would, I
guess, definitely be possible. Although all functions using it would
need to require a fallback path for architectures not supporting it.

/Thomas



> 
> Jason

Jason Gunthorpe Nov. 9, 2020, 8:19 p.m. UTC | #24

On Mon, Nov 09, 2020 at 09:44:02AM +0100, Thomas Hellström wrote:
> static inline bool vma_is_special_huge(const struct vm_area_struct
> *vma)
> {
> 	return vma_is_dax(vma) || (vma->vm_file &&
> 				   (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)));
> }

That is testing a VMA, not a PTE, which doesn't help protect
get_user_pages_fast.

Sounds like is has opened a big user crashy problem in DRM and the
huge page stuff needs to be revereted..

Dan?

Jason

[v5,05/15] mm/frame-vector: Use FOLL_LONGTERM

Commit Message

Comments

Patch