[v3,00/23] mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM
mbox series

Message ID 20191112000700.3455038-1-jhubbard@nvidia.com
Headers show
Series
  • mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM
Related show

Message

John Hubbard Nov. 12, 2019, 12:06 a.m. UTC
Hi,

The cover letter is long, so the more important stuff is first:

* Jason, if you or someone could look at the the VFIO cleanup (patch 8)
  and conversion to FOLL_PIN (patch 18), to make sure it's use of
  remote and longterm gup matches what we discussed during the review
  of v2, I'd appreciate it.

* Also for Jason and IB: as noted below, in patch 11, I am (too?) boldly
  converting from put_user_pages() to release_pages().

* Jerome, I am going to take a look at doing your FOLL_GET change idea
  (some callers should set FOLL_GET) separately, because it blew up "a
  little bit" in my face. It's definitely a separate--tiny, but risky--project.
  It also looks more reasonable when applied on top of this series here
  (and it conflicts a lot), so I'm going to send it as a follow-up.

Changes since v2:

* Added a patch to convert IB/umem from normal gup, to gup_fast(). This
  is also posted separately, in order to hopefully get some runtime
  testing.

* Changed the page devmap code to be a little clearer,
  thanks to Jerome for that.

* Split out the page devmap changes into a separate patch (and moved
  Ira's Signed-off-by to that patch).

* Fixed my bug in IB: ODP code does not require pin_user_pages()
  semantics. Therefore, revert the put_user_page() calls to put_page(),
  and leave the get_user_pages() call as-is.

      * As part of the revert, I am proposing here a change directly
        from put_user_pages(), to release_pages(). I'd feel better if
        someone agrees that this is the best way. It uses the more
        efficient release_pages(), instead of put_page() in a loop,
        and keep the change to just a few character on one line,
        but OTOH it is not a pure revert.

* Loosened the FOLL_LONGTERM restrictions in the
  __get_user_pages_locked() implementation, and used that in order
  to fix up a VFIO bug. Thanks to Jason for that idea.

    * Note the use of release_pages() in IB: is that OK?

* Added a few more WARN's and clarifying comments nearby.

* Many documentation improvements in various comments.

* Moved the new pin_user_pages.rst from Documentation/vm/ to
  Documentation/core-api/ .

* Commit descriptions: added clarifying notes to the three patches
  (drm/via, fs/io_uring, net/xdp) that already had put_user_page()
  calls in place.

* Collected all pending Reviewed-by and Acked-by tags, from v1 and v2
  email threads.

* Lot of churn from v2 --> v3, so it's possible that new bugs
  sneaked in.

NOT DONE: separate patchset is required:

* __get_user_pages_locked(): stop compensating for
  buggy callers who failed to set FOLL_GET. Instead, assert
  that FOLL_GET is set (and fail if it's not).

======================================================================
Original cover letter (edited to fix up the patch description numbers)

This applies cleanly to linux-next and mmotm, and also to linux.git if
linux-next's commit 20cac10710c9 ("mm/gup_benchmark: fix MAP_HUGETLB
case") is first applied there.

This provides tracking of dma-pinned pages. This is a prerequisite to
solving the larger problem of proper interactions between file-backed
pages, and [R]DMA activities, as discussed in [1], [2], [3], and in
a remarkable number of email threads since about 2017. :)

A new internal gup flag, FOLL_PIN is introduced, and thoroughly
documented in the last patch's Documentation/vm/pin_user_pages.rst.

I believe that this will provide a good starting point for doing the
layout lease work that Ira Weiny has been working on. That's because
these new wrapper functions provide a clean, constrained, systematically
named set of functionality that, again, is required in order to even
know if a page is "dma-pinned".

In contrast to earlier approaches, the page tracking can be
incrementally applied to the kernel call sites that, until now, have
been simply calling get_user_pages() ("gup"). In other words, opt-in by
changing from this:

    get_user_pages() (sets FOLL_GET)
    put_page()

to this:
    pin_user_pages() (sets FOLL_PIN)
    put_user_page()

Because there are interdependencies with FOLL_LONGTERM, a similar
conversion as for FOLL_PIN, was applied. The change was from this:

    get_user_pages(FOLL_LONGTERM) (also sets FOLL_GET)
    put_page()

to this:
    pin_longterm_pages() (sets FOLL_PIN | FOLL_LONGTERM)
    put_user_page()

============================================================
Patch summary:

* Patches 1-8: refactoring and preparatory cleanup, independent fixes

* Patch 9: introduce pin_user_pages(), FOLL_PIN, but no functional
           changes yet
* Patches 10-15: Convert existing put_user_page() callers, to use the
                new pin*()
* Patch 16: Activate tracking of FOLL_PIN pages.
* Patches 17-19: convert FOLL_LONGTERM callers
* Patches: 20-22: gup_benchmark and run_vmtests support
* Patch 23: enforce FOLL_LONGTERM as a gup-internal (only) flag

============================================================
Testing:

* I've done some overall kernel testing (LTP, and a few other goodies),
  and some directed testing to exercise some of the changes. And as you
  can see, gup_benchmark is enhanced to exercise this. Basically, I've been
  able to runtime test the core get_user_pages() and pin_user_pages() and
  related routines, but not so much on several of the call sites--but those
  are generally just a couple of lines changed, each.

  Not much of the kernel is actually using this, which on one hand
  reduces risk quite a lot. But on the other hand, testing coverage
  is low. So I'd love it if, in particular, the Infiniband and PowerPC
  folks could do a smoke test of this series for me.

  Also, my runtime testing for the call sites so far is very weak:

    * io_uring: Some directed tests from liburing exercise this, and they pass.
    * process_vm_access.c: A small directed test passes.
    * gup_benchmark: the enhanced version hits the new gup.c code, and passes.
    * infiniband (still only have crude "IB pingpong" working, on a
                  good day: it's not exercising my conversions at runtime...)
    * VFIO: compiles (I'm vowing to set up a run time test soon, but it's
                      not ready just yet)
    * powerpc: it compiles...
    * drm/via: compiles...
    * goldfish: compiles...
    * net/xdp: compiles...
    * media/v4l2: compiles...

============================================================
Next:

* Get the block/bio_vec sites converted to use pin_user_pages().

* Work with Ira and Dave Chinner to weave this together with the
  layout lease stuff.

============================================================

[1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/

John Hubbard (23):
  mm/gup: pass flags arg to __gup_device_* functions
  mm/gup: factor out duplicate code from four routines
  mm/gup: move try_get_compound_head() to top, fix minor issues
  mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages
  goldish_pipe: rename local pin_user_pages() routine
  IB/umem: use get_user_pages_fast() to pin DMA pages
  media/v4l2-core: set pages dirty upon releasing DMA buffers
  vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM
  mm/gup: introduce pin_user_pages*() and FOLL_PIN
  goldish_pipe: convert to pin_user_pages() and put_user_page()
  IB/{core,hw,umem}: set FOLL_PIN, FOLL_LONGTERM via
    pin_longterm_pages*()
  mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()
  drm/via: set FOLL_PIN via pin_user_pages_fast()
  fs/io_uring: set FOLL_PIN via pin_user_pages()
  net/xdp: set FOLL_PIN via pin_user_pages()
  mm/gup: track FOLL_PIN pages
  media/v4l2-core: pin_longterm_pages (FOLL_PIN) and put_user_page()
    conversion
  vfio, mm: pin_longterm_pages (FOLL_PIN) and put_user_page() conversion
  powerpc: book3s64: convert to pin_longterm_pages() and put_user_page()
  mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding
    "1"
  mm/gup_benchmark: support pin_user_pages() and related calls
  selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN
    coverage
  mm/gup: remove support for gup(FOLL_LONGTERM)

 Documentation/core-api/index.rst            |   1 +
 Documentation/core-api/pin_user_pages.rst   | 218 +++++++
 arch/powerpc/mm/book3s64/iommu_api.c        |  15 +-
 drivers/gpu/drm/via/via_dmablit.c           |   2 +-
 drivers/infiniband/core/umem.c              |  17 +-
 drivers/infiniband/core/umem_odp.c          |  24 +-
 drivers/infiniband/hw/hfi1/user_pages.c     |   4 +-
 drivers/infiniband/hw/mthca/mthca_memfree.c |   3 +-
 drivers/infiniband/hw/qib/qib_user_pages.c  |   8 +-
 drivers/infiniband/hw/qib/qib_user_sdma.c   |   2 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c    |   9 +-
 drivers/infiniband/sw/siw/siw_mem.c         |   5 +-
 drivers/media/v4l2-core/videobuf-dma-sg.c   |  10 +-
 drivers/platform/goldfish/goldfish_pipe.c   |  35 +-
 drivers/vfio/vfio_iommu_type1.c             |  35 +-
 fs/io_uring.c                               |   5 +-
 include/linux/mm.h                          | 164 +++++-
 include/linux/mmzone.h                      |   2 +
 include/linux/page_ref.h                    |  10 +
 mm/gup.c                                    | 608 ++++++++++++++++----
 mm/gup_benchmark.c                          |  87 ++-
 mm/huge_memory.c                            |  54 +-
 mm/hugetlb.c                                |  39 +-
 mm/memremap.c                               |  67 +--
 mm/process_vm_access.c                      |  28 +-
 mm/vmstat.c                                 |   2 +
 net/xdp/xdp_umem.c                          |   4 +-
 tools/testing/selftests/vm/gup_benchmark.c  |  34 +-
 tools/testing/selftests/vm/run_vmtests      |  22 +
 29 files changed, 1180 insertions(+), 334 deletions(-)
 create mode 100644 Documentation/core-api/pin_user_pages.rst

Comments

Jason Gunthorpe Nov. 12, 2019, 8:38 p.m. UTC | #1
On Mon, Nov 11, 2019 at 04:06:37PM -0800, John Hubbard wrote:
> Hi,
> 
> The cover letter is long, so the more important stuff is first:
> 
> * Jason, if you or someone could look at the the VFIO cleanup (patch 8)
>   and conversion to FOLL_PIN (patch 18), to make sure it's use of
>   remote and longterm gup matches what we discussed during the review
>   of v2, I'd appreciate it.
> 
> * Also for Jason and IB: as noted below, in patch 11, I am (too?) boldly
>   converting from put_user_pages() to release_pages().

Why are we doing this? I think things got confused here someplace, as
the comment still says:

/**
 * put_user_page() - release a gup-pinned page
 * @page:            pointer to page to be released
 *
 * Pages that were pinned via get_user_pages*() must be released via
 * either put_user_page(), or one of the put_user_pages*() routines
 * below.

I feel like if put_user_pages() is not the correct way to undo
get_user_pages() then it needs to be deleted.

Jason
John Hubbard Nov. 12, 2019, 9:10 p.m. UTC | #2
On 11/12/19 12:38 PM, Jason Gunthorpe wrote:
> On Mon, Nov 11, 2019 at 04:06:37PM -0800, John Hubbard wrote:
>> Hi,
>>
>> The cover letter is long, so the more important stuff is first:
>>
>> * Jason, if you or someone could look at the the VFIO cleanup (patch 8)
>>   and conversion to FOLL_PIN (patch 18), to make sure it's use of
>>   remote and longterm gup matches what we discussed during the review
>>   of v2, I'd appreciate it.
>>
>> * Also for Jason and IB: as noted below, in patch 11, I am (too?) boldly
>>   converting from put_user_pages() to release_pages().
> 
> Why are we doing this? I think things got confused here someplace, as


Because:

a) These need put_page() calls,  and

b) there is no put_pages() call, but there is a release_pages() call that
is, arguably, what put_pages() would be.


> the comment still says:
> 
> /**
>  * put_user_page() - release a gup-pinned page
>  * @page:            pointer to page to be released
>  *
>  * Pages that were pinned via get_user_pages*() must be released via
>  * either put_user_page(), or one of the put_user_pages*() routines
>  * below.


Ohhh, I missed those comments. They need to all be changed over to
say "pages that were pinned via pin_user_pages*() or 
pin_longterm_pages*() must be released via put_user_page*()."

The get_user_pages*() pages must still be released via put_page.

The churn is due to a fairly significant change in strategy, whis
is: instead of changing all get_user_pages*() sites to call 
put_user_page(), change selected sites to call pin_user_pages*() or 
pin_longterm_pages*(), plus put_user_page().

That allows incrementally converting the kernel over to using the
new pin APIs, without taking on the huge risk of a big one-shot
conversion. 

So, I've ended up with one place that actually needs to get reverted
back to get_user_pages(), and that's the IB ODP code.

> 
> I feel like if put_user_pages() is not the correct way to undo
> get_user_pages() then it needs to be deleted.
> 

Yes, you're right. I'll fix the put_user_page comments() as described.


thanks,

John Hubbard
NVIDIA
Daniel Vetter Nov. 13, 2019, 8:22 a.m. UTC | #3
On Tue, Nov 12, 2019 at 10:10 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 11/12/19 12:38 PM, Jason Gunthorpe wrote:
> > On Mon, Nov 11, 2019 at 04:06:37PM -0800, John Hubbard wrote:
> >> Hi,
> >>
> >> The cover letter is long, so the more important stuff is first:
> >>
> >> * Jason, if you or someone could look at the the VFIO cleanup (patch 8)
> >>   and conversion to FOLL_PIN (patch 18), to make sure it's use of
> >>   remote and longterm gup matches what we discussed during the review
> >>   of v2, I'd appreciate it.
> >>
> >> * Also for Jason and IB: as noted below, in patch 11, I am (too?) boldly
> >>   converting from put_user_pages() to release_pages().
> >
> > Why are we doing this? I think things got confused here someplace, as
>
>
> Because:
>
> a) These need put_page() calls,  and
>
> b) there is no put_pages() call, but there is a release_pages() call that
> is, arguably, what put_pages() would be.
>
>
> > the comment still says:
> >
> > /**
> >  * put_user_page() - release a gup-pinned page
> >  * @page:            pointer to page to be released
> >  *
> >  * Pages that were pinned via get_user_pages*() must be released via
> >  * either put_user_page(), or one of the put_user_pages*() routines
> >  * below.
>
>
> Ohhh, I missed those comments. They need to all be changed over to
> say "pages that were pinned via pin_user_pages*() or
> pin_longterm_pages*() must be released via put_user_page*()."
>
> The get_user_pages*() pages must still be released via put_page.
>
> The churn is due to a fairly significant change in strategy, whis
> is: instead of changing all get_user_pages*() sites to call
> put_user_page(), change selected sites to call pin_user_pages*() or
> pin_longterm_pages*(), plus put_user_page().

Can't we call this unpin_user_page then, for some symmetry? Or is that
even more churn?

Looking from afar the naming here seems really confusing.
-Daniel

> That allows incrementally converting the kernel over to using the
> new pin APIs, without taking on the huge risk of a big one-shot
> conversion.
>
> So, I've ended up with one place that actually needs to get reverted
> back to get_user_pages(), and that's the IB ODP code.
>
> >
> > I feel like if put_user_pages() is not the correct way to undo
> > get_user_pages() then it needs to be deleted.
> >
>
> Yes, you're right. I'll fix the put_user_page comments() as described.
>
>
> thanks,
>
> John Hubbard
> NVIDIA
John Hubbard Nov. 13, 2019, 9:02 a.m. UTC | #4
On 11/13/19 12:22 AM, Daniel Vetter wrote:
...
>>> Why are we doing this? I think things got confused here someplace, as
>>
>>
>> Because:
>>
>> a) These need put_page() calls,  and
>>
>> b) there is no put_pages() call, but there is a release_pages() call that
>> is, arguably, what put_pages() would be.
>>
>>
>>> the comment still says:
>>>
>>> /**
>>>   * put_user_page() - release a gup-pinned page
>>>   * @page:            pointer to page to be released
>>>   *
>>>   * Pages that were pinned via get_user_pages*() must be released via
>>>   * either put_user_page(), or one of the put_user_pages*() routines
>>>   * below.
>>
>>
>> Ohhh, I missed those comments. They need to all be changed over to
>> say "pages that were pinned via pin_user_pages*() or
>> pin_longterm_pages*() must be released via put_user_page*()."
>>
>> The get_user_pages*() pages must still be released via put_page.
>>
>> The churn is due to a fairly significant change in strategy, whis
>> is: instead of changing all get_user_pages*() sites to call
>> put_user_page(), change selected sites to call pin_user_pages*() or
>> pin_longterm_pages*(), plus put_user_page().
> 
> Can't we call this unpin_user_page then, for some symmetry? Or is that
> even more churn?
> 
> Looking from afar the naming here seems really confusing.


That look from afar is valuable, because I'm too close to the problem to see
how the naming looks. :)

unpin_user_page() sounds symmetrical. It's true that it would cause more
churn (which is why I started off with a proposal that avoids changing the
names of put_user_page*() APIs). But OTOH, the amount of churn is proportional
to the change in direction here, and it's really only 10 or 20 lines changed,
in the end.

So I'm open to changing to that naming. It would be nice to hear what others
prefer, too...


thanks,
Jan Kara Nov. 13, 2019, 10:12 a.m. UTC | #5
On Wed 13-11-19 01:02:02, John Hubbard wrote:
> On 11/13/19 12:22 AM, Daniel Vetter wrote:
> ...
> > > > Why are we doing this? I think things got confused here someplace, as
> > > 
> > > 
> > > Because:
> > > 
> > > a) These need put_page() calls,  and
> > > 
> > > b) there is no put_pages() call, but there is a release_pages() call that
> > > is, arguably, what put_pages() would be.
> > > 
> > > 
> > > > the comment still says:
> > > > 
> > > > /**
> > > >   * put_user_page() - release a gup-pinned page
> > > >   * @page:            pointer to page to be released
> > > >   *
> > > >   * Pages that were pinned via get_user_pages*() must be released via
> > > >   * either put_user_page(), or one of the put_user_pages*() routines
> > > >   * below.
> > > 
> > > 
> > > Ohhh, I missed those comments. They need to all be changed over to
> > > say "pages that were pinned via pin_user_pages*() or
> > > pin_longterm_pages*() must be released via put_user_page*()."
> > > 
> > > The get_user_pages*() pages must still be released via put_page.
> > > 
> > > The churn is due to a fairly significant change in strategy, whis
> > > is: instead of changing all get_user_pages*() sites to call
> > > put_user_page(), change selected sites to call pin_user_pages*() or
> > > pin_longterm_pages*(), plus put_user_page().
> > 
> > Can't we call this unpin_user_page then, for some symmetry? Or is that
> > even more churn?
> > 
> > Looking from afar the naming here seems really confusing.
> 
> 
> That look from afar is valuable, because I'm too close to the problem to see
> how the naming looks. :)
> 
> unpin_user_page() sounds symmetrical. It's true that it would cause more
> churn (which is why I started off with a proposal that avoids changing the
> names of put_user_page*() APIs). But OTOH, the amount of churn is proportional
> to the change in direction here, and it's really only 10 or 20 lines changed,
> in the end.
> 
> So I'm open to changing to that naming. It would be nice to hear what others
> prefer, too...

FWIW I'd find unpin_user_page() also better than put_user_page() as a
counterpart to pin_user_pages().

								Honza
Daniel Vetter Nov. 13, 2019, 11:43 a.m. UTC | #6
On Wed, Nov 13, 2019 at 11:12:10AM +0100, Jan Kara wrote:
> On Wed 13-11-19 01:02:02, John Hubbard wrote:
> > On 11/13/19 12:22 AM, Daniel Vetter wrote:
> > ...
> > > > > Why are we doing this? I think things got confused here someplace, as
> > > > 
> > > > 
> > > > Because:
> > > > 
> > > > a) These need put_page() calls,  and
> > > > 
> > > > b) there is no put_pages() call, but there is a release_pages() call that
> > > > is, arguably, what put_pages() would be.
> > > > 
> > > > 
> > > > > the comment still says:
> > > > > 
> > > > > /**
> > > > >   * put_user_page() - release a gup-pinned page
> > > > >   * @page:            pointer to page to be released
> > > > >   *
> > > > >   * Pages that were pinned via get_user_pages*() must be released via
> > > > >   * either put_user_page(), or one of the put_user_pages*() routines
> > > > >   * below.
> > > > 
> > > > 
> > > > Ohhh, I missed those comments. They need to all be changed over to
> > > > say "pages that were pinned via pin_user_pages*() or
> > > > pin_longterm_pages*() must be released via put_user_page*()."
> > > > 
> > > > The get_user_pages*() pages must still be released via put_page.
> > > > 
> > > > The churn is due to a fairly significant change in strategy, whis
> > > > is: instead of changing all get_user_pages*() sites to call
> > > > put_user_page(), change selected sites to call pin_user_pages*() or
> > > > pin_longterm_pages*(), plus put_user_page().
> > > 
> > > Can't we call this unpin_user_page then, for some symmetry? Or is that
> > > even more churn?
> > > 
> > > Looking from afar the naming here seems really confusing.
> > 
> > 
> > That look from afar is valuable, because I'm too close to the problem to see
> > how the naming looks. :)
> > 
> > unpin_user_page() sounds symmetrical. It's true that it would cause more
> > churn (which is why I started off with a proposal that avoids changing the
> > names of put_user_page*() APIs). But OTOH, the amount of churn is proportional
> > to the change in direction here, and it's really only 10 or 20 lines changed,
> > in the end.
> > 
> > So I'm open to changing to that naming. It would be nice to hear what others
> > prefer, too...
> 
> FWIW I'd find unpin_user_page() also better than put_user_page() as a
> counterpart to pin_user_pages().

One more point from afar on pin/unpin: We use that a lot in graphics for
permanently pinned graphics buffer objects. Which really only should be
used for scanout. So at least graphics folks should have an appropriate
mindset and try to make sure we don't overuse this stuff.
-Daniel
John Hubbard Nov. 13, 2019, 8:28 p.m. UTC | #7
On 11/13/19 3:43 AM, Daniel Vetter wrote:
...
>>>> Can't we call this unpin_user_page then, for some symmetry? Or is that
>>>> even more churn?
>>>>
>>>> Looking from afar the naming here seems really confusing.
>>>
>>>
>>> That look from afar is valuable, because I'm too close to the problem to see
>>> how the naming looks. :)
>>>
>>> unpin_user_page() sounds symmetrical. It's true that it would cause more
>>> churn (which is why I started off with a proposal that avoids changing the
>>> names of put_user_page*() APIs). But OTOH, the amount of churn is proportional
>>> to the change in direction here, and it's really only 10 or 20 lines changed,
>>> in the end.
>>>
>>> So I'm open to changing to that naming. It would be nice to hear what others
>>> prefer, too...
>>
>> FWIW I'd find unpin_user_page() also better than put_user_page() as a
>> counterpart to pin_user_pages().
> 
> One more point from afar on pin/unpin: We use that a lot in graphics for
> permanently pinned graphics buffer objects. Which really only should be
> used for scanout. So at least graphics folks should have an appropriate
> mindset and try to make sure we don't overuse this stuff.
> -Daniel
> 

OK, Ira also likes "unpin", and so far no one has said *anything* in favor
of the "put_user_page" names, so I think we have a winner! I'll change the
names to unpin_user_page*().


thanks,