mbox series

[0/3] mm/gup: track FOLL_PIN pages (follow on from v12)

Message ID 20200125021115.731629-1-jhubbard@nvidia.com (mailing list archive)
Headers show
Series mm/gup: track FOLL_PIN pages (follow on from v12) | expand

Message

John Hubbard Jan. 25, 2020, 2:11 a.m. UTC
Leon Romanovsky:

If you get a chance, I'd love to have this short series (or even just
the first patch; the others are just selftests) run through your test
suite that was previously choking on my earlier v11 patchset. The huge
page pincount limitations are removed, so I'm expecting a perfect test
run this time!

Everyone:

This activates tracking of FOLL_PIN pages. This is in support of fixing
the get_user_pages()+DMA problem described in [1]-[4].

It is based on today's (Jan 24) mmotm. There is a git repo and branch,
for convenience in reviewing:

    git@github.com:johnhubbard/linux.git track_user_pages_v1_mmotm_24Jan2020

FOLL_PIN support is (so far) in mmotm and linux-next. However, the
patch to use FOLL_PIN to track pages was *not* submitted, because Leon
saw an RDMA test suite failure that involved (I think) page refcount
overflows when huge pages were used.

This patch definitively solves that kind of overflow problem, by adding
an exact pincount, for compound pages (of order > 1), in the 3rd struct
page of a compound page. If available, that form of pincounting is used,
instead of the GUP_PIN_COUNTING_BIAS approach. Thanks again to Jan Kara
for that idea.

Here's the last reviewed version of the tracking patch (v11):

  https://lore.kernel.org/r/20191216222537.491123-1-jhubbard@nvidia.com

Jan Kara had provided a reviewed-by tag for that, but I've had to remove
it (again) here, due to having changed the patch "a little bit", in
order to add the feature described above.

Other interesting changes:

* dump_page(): added one, or two new things to report for compound
  pages: head refcount (for all compound pages), and map_pincount (for
  compound pages of order > 1).

* Documentation/core-api/pin_user_pages.rst: removed the "TODO" for the
  huge page refcount upper limit problems, and added notes about how it
  works now. Also added a note about the dump_page() enhancements.

* Added some comments in gup.c and mm.h, to explain that there are two
  ways to count pinned pages: exact (for compound pages of order > 1)
  and fuzzy (GUP_PIN_COUNTING_BIAS: for all other pages).

============================================================
General notes about the tracking patch:

This is a prerequisite to solving the problem of proper interactions
between file-backed pages, and [R]DMA activities, as discussed in [1],
[2], [3], [4] and in a remarkable number of email threads since about
2017. :)

In contrast to earlier approaches, the page tracking can be
incrementally applied to the kernel call sites that, until now, have
been simply calling get_user_pages() ("gup"). In other words, opt-in by
changing from this:

    get_user_pages() (sets FOLL_GET)
    put_page()

to this:
    pin_user_pages() (sets FOLL_PIN)
    unpin_user_page()

============================================================
Next steps:

* Convert more subsystems from get_user_pages() to pin_user_pages().
* Work with Ira and others to connect this all up with file system
  leases.

[1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/
[4] LWN kernel index: get_user_pages() https://lwn.net/Kernel/Index/#Memory_management-get_user_pages

John Hubbard (3):
  mm/gup: track FOLL_PIN pages
  mm/gup_benchmark: support pin_user_pages() and related calls
  selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN
    coverage

 Documentation/core-api/pin_user_pages.rst  |  48 ++-
 include/linux/mm.h                         | 109 ++++-
 include/linux/mm_types.h                   |   7 +-
 include/linux/mmzone.h                     |   2 +
 include/linux/page_ref.h                   |  10 +
 mm/debug.c                                 |  22 +-
 mm/gup.c                                   | 467 ++++++++++++++++-----
 mm/gup_benchmark.c                         |  70 ++-
 mm/huge_memory.c                           |  29 +-
 mm/hugetlb.c                               |  44 +-
 mm/page_alloc.c                            |   2 +
 mm/rmap.c                                  |   6 +
 mm/vmstat.c                                |   2 +
 tools/testing/selftests/vm/gup_benchmark.c |  15 +-
 tools/testing/selftests/vm/run_vmtests     |  22 +
 15 files changed, 678 insertions(+), 177 deletions(-)

Comments

Leon Romanovsky Jan. 25, 2020, 4:23 p.m. UTC | #1
On Fri, Jan 24, 2020 at 06:11:12PM -0800, John Hubbard wrote:
> Leon Romanovsky:
>
> If you get a chance, I'd love to have this short series (or even just
> the first patch; the others are just selftests) run through your test
> suite that was previously choking on my earlier v11 patchset. The huge
> page pincount limitations are removed, so I'm expecting a perfect test
> run this time!
>

I added those patches to our regression and I will post the in the
couple of days.

Thanks
Leon Romanovsky Jan. 29, 2020, 5:47 a.m. UTC | #2
On Sat, Jan 25, 2020 at 06:23:39PM +0200, Leon Romanovsky wrote:
> On Fri, Jan 24, 2020 at 06:11:12PM -0800, John Hubbard wrote:
> > Leon Romanovsky:
> >
> > If you get a chance, I'd love to have this short series (or even just
> > the first patch; the others are just selftests) run through your test
> > suite that was previously choking on my earlier v11 patchset. The huge
> > page pincount limitations are removed, so I'm expecting a perfect test
> > run this time!
> >
>
> I added those patches to our regression and I will post the in the
> couple of days.

Hi John,

The patches survived our RDMA verification night runs.

Thanks

>
> Thanks
>
John Hubbard Jan. 29, 2020, 8:01 p.m. UTC | #3
On 1/28/20 9:47 PM, Leon Romanovsky wrote:
> On Sat, Jan 25, 2020 at 06:23:39PM +0200, Leon Romanovsky wrote:
>> On Fri, Jan 24, 2020 at 06:11:12PM -0800, John Hubbard wrote:
>>> Leon Romanovsky:
>>>
>>> If you get a chance, I'd love to have this short series (or even just
>>> the first patch; the others are just selftests) run through your test
>>> suite that was previously choking on my earlier v11 patchset. The huge
>>> page pincount limitations are removed, so I'm expecting a perfect test
>>> run this time!
>>>
>>
>> I added those patches to our regression and I will post the in the
>> couple of days.
> 
> Hi John,
> 
> The patches survived our RDMA verification night runs.
> 

Great! Thanks very much for running those. That's a pretty solid 
confirmation that the earlier patch *did* allow a huge page refcount
overflow, and that this approach avoids it. 

thanks,