[RFC,0/9] mm, sparse-vmemmap: Introduce compound pagemaps

Message ID	20201208172901.17384-1-joao.m.martins@oracle.com (mailing list archive)
Headers	show Return-Path: <SRS0=xduE=FM=lists.01.org=linux-nvdimm-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org DE9DD23B28 Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=141.146.126.79; helo=aserp2130.oracle.com; envelope-from=joao.m.martins@oracle.com; receiver=<UNKNOWN> From: Joao Martins <joao.m.martins@oracle.com> To: linux-mm@kvack.org Subject: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps Date: Tue, 8 Dec 2020 17:28:51 +0000 Message-Id: <20201208172901.17384-1-joao.m.martins@oracle.com> Message-ID-Hash: DMBQM5TVUOWCJKHA4TTUJYRJJHGG5NS4 CC: linux-nvdimm@lists.01.org, Matthew Wilcox <willy@infradead.org>, Jason Gunthorpe <jgg@ziepe.ca>, Muchun Song <songmuchun@bytedance.com>, Mike Kravetz <mike.kravetz@oracle.com>, Andrew Morton <akpm@linux-foundation.org>, Joao Martins <joao.m.martins@oracle.com> Precedence: list Archived-At: <https://lists.01.org/hyperkitty/list/linux-nvdimm@lists.01.org/message/DMBQM5TVUOWCJKHA4TTUJYRJJHGG5NS4/> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit
Series	mm, sparse-vmemmap: Introduce compound pagemaps \| expand [RFC,0/9] mm, sparse-vmemmap: Introduce compound pagemaps [RFC,1/9] memremap: add ZONE_DEVICE support for compound pages [RFC,2/9] sparse-vmemmap: Consolidate arguments in vmemmap section populate [RFC,3/9] sparse-vmemmap: Reuse vmemmap areas for a given mhp_params::align [RFC,4/9] mm/page_alloc: Reuse tail struct pages for compound pagemaps [RFC,5/9] device-dax: Compound pagemap support [RFC,6/9] mm/gup: Grab head page refcount once for group of subpages [RFC,7/9] mm/gup: Decrement head page once for group of subpages [RFC,8/9] RDMA/umem: batch page unpin in __ib_mem_release() [RFC,9/9] mm: Add follow_devmap_page() for devdax vmas

Joao Martins Dec. 8, 2020, 5:28 p.m. UTC

Hey,

This small series, attempts at minimizing 'struct page' overhead by
pursuing a similar approach as Muchun Song series "Free some vmemmap
pages of hugetlb page"[0] but applied to devmap/ZONE_DEVICE. 

[0] https://lore.kernel.org/linux-mm/20201130151838.11208-1-songmuchun@bytedance.com/

The link above describes it quite nicely, but the idea is to reuse tail
page vmemmap areas, particular the area which only describes tail pages.
So a vmemmap page describes 64 struct pages, and the first page for a given
ZONE_DEVICE vmemmap would contain the head page and 63 tail pages. The second
vmemmap page would contain only tail pages, and that's what gets reused across
the rest of the subsection/section. The bigger the page size, the bigger the
savings (2M hpage -> save 6 vmemmap pages; 1G hpage -> save 4094 vmemmap pages).

In terms of savings, per 1Tb of memory, the struct page cost would go down
with compound pagemap:

* with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total memory)
* with 1G pages we lose 8MB instead of 16G (0.0007% instead of 1.5% of total memory)

Along the way I've extended it past 'struct page' overhead *trying* to address a
few performance issues we knew about for pmem, specifically on the
{pin,get}_user_pages* function family with device-dax vmas which are really
slow even of the fast variants. THP is great on -fast variants but all except
hugetlbfs perform rather poorly on non-fast gup.

So to summarize what the series does:

Patches 1-5: Much like Muchun series, we reuse tail page areas across a given
page size (namely @align was referred by remaining memremap/dax code) and
enabling of memremap to initialize the ZONE_DEVICE pages as compound pages or a
given @align order. The main difference though, is that contrary to the hugetlbfs
series, there's no vmemmap for the area, because we are onlining it. IOW no
freeing of pages of already initialized vmemmap like the case for hugetlbfs,
which simplifies the logic (besides not being arch-specific). After these,
there's quite visible region bootstrap of pmem memmap given that we would
initialize fewer struct pages depending on the page size.

    NVDIMM namespace bootstrap improves from ~750ms to ~190ms/<=1ms on emulated NVDIMMs
    with 2M and 1G respectivally. The net gain in improvement is similarly observed
    in proportion when running on actual NVDIMMs.

Patch 6 - 8: Optimize grabbing/release a page refcount changes given that we
are working with compound pages i.e. we do 1 increment/decrement to the head
page for a given set of N subpages compared as opposed to N individual writes.
{get,pin}_user_pages_fast() for zone_device with compound pagemap consequently
improves considerably, and unpin_user_pages() improves as well when passed a
set of consecutive pages:

                                           before          after
    (get_user_pages_fast 1G;2M page size) ~75k  us -> ~3.2k ; ~5.2k us
    (pin_user_pages_fast 1G;2M page size) ~125k us -> ~3.4k ; ~5.5k us

The RDMA patch (patch 8/9) is to demonstrate the improvement for an existing
user. For unpin_user_pages() we have an additional test to demonstrate the
improvement.  The test performs MR reg/unreg continuously and measuring its
rate for a given period. So essentially ib_mem_get and ib_mem_release being
stress tested which at the end of day means: pin_user_pages_longterm() and
unpin_user_pages() for a scatterlist:

    Before:
    159 rounds in 5.027 sec: 31617.923 usec / round (device-dax)
    466 rounds in 5.009 sec: 10748.456 usec / round (hugetlbfs)
	        
    After:
    305 rounds in 5.010 sec: 16426.047 usec / round (device-dax)
    1073 rounds in 5.004 sec: 4663.622 usec / round (hugetlbfs)

Patch 9: Improves {pin,get}_user_pages() and its longterm counterpart. It
is very experimental, and I imported most of follow_hugetlb_page(), except
that we do the same trick as gup-fast. In doing the patch I feel this batching
should live in follow_page_mask() and having that being changed to return a set
of pages/something-else when walking over PMD/PUDs for THP / devmap pages. This
patch then brings the previous test of mr reg/unreg (above) on parity
between device-dax and hugetlbfs.

Some of the patches are a little fresh/WIP (specially patch 3 and 9) and we are
still running tests. Hence the RFC, asking for comments and general direction
of the work before continuing.

Patches apply on top of linux-next tag next-20201208 (commit a9e26cb5f261).

Comments and suggestions very much appreciated!

Thanks,
	Joao

Joao Martins (9):
  memremap: add ZONE_DEVICE support for compound pages
  sparse-vmemmap: Consolidate arguments in vmemmap section populate
  sparse-vmemmap: Reuse vmemmap areas for a given page size
  mm/page_alloc: Reuse tail struct pages for compound pagemaps
  device-dax: Compound pagemap support
  mm/gup: Grab head page refcount once for group of subpages
  mm/gup: Decrement head page once for group of subpages
  RDMA/umem: batch page unpin in __ib_mem_release()
  mm: Add follow_devmap_page() for devdax vmas

 drivers/dax/device.c           |  54 ++++++---
 drivers/infiniband/core/umem.c |  25 +++-
 include/linux/huge_mm.h        |   4 +
 include/linux/memory_hotplug.h |  16 ++-
 include/linux/memremap.h       |   2 +
 include/linux/mm.h             |   6 +-
 mm/gup.c                       | 130 ++++++++++++++++-----
 mm/huge_memory.c               | 202 +++++++++++++++++++++++++++++++++
 mm/memory_hotplug.c            |  13 ++-
 mm/memremap.c                  |  13 ++-
 mm/page_alloc.c                |  28 ++++-
 mm/sparse-vmemmap.c            |  97 +++++++++++++---
 mm/sparse.c                    |  16 +--
 13 files changed, 531 insertions(+), 75 deletions(-)

David Hildenbrand Dec. 9, 2020, 9:38 a.m. UTC | #1

On 08.12.20 18:28, Joao Martins wrote:
> Hey,
> 
> This small series, attempts at minimizing 'struct page' overhead by
> pursuing a similar approach as Muchun Song series "Free some vmemmap
> pages of hugetlb page"[0] but applied to devmap/ZONE_DEVICE. 
> 
> [0] https://lore.kernel.org/linux-mm/20201130151838.11208-1-songmuchun@bytedance.com/
> 
> The link above describes it quite nicely, but the idea is to reuse tail
> page vmemmap areas, particular the area which only describes tail pages.
> So a vmemmap page describes 64 struct pages, and the first page for a given
> ZONE_DEVICE vmemmap would contain the head page and 63 tail pages. The second
> vmemmap page would contain only tail pages, and that's what gets reused across
> the rest of the subsection/section. The bigger the page size, the bigger the
> savings (2M hpage -> save 6 vmemmap pages; 1G hpage -> save 4094 vmemmap pages).
> 
> In terms of savings, per 1Tb of memory, the struct page cost would go down
> with compound pagemap:
> 
> * with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total memory)
> * with 1G pages we lose 8MB instead of 16G (0.0007% instead of 1.5% of total memory)
> 

That's the dream :)

> Along the way I've extended it past 'struct page' overhead *trying* to address a
> few performance issues we knew about for pmem, specifically on the
> {pin,get}_user_pages* function family with device-dax vmas which are really
> slow even of the fast variants. THP is great on -fast variants but all except
> hugetlbfs perform rather poorly on non-fast gup.
> 
> So to summarize what the series does:
> 
> Patches 1-5: Much like Muchun series, we reuse tail page areas across a given
> page size (namely @align was referred by remaining memremap/dax code) and
> enabling of memremap to initialize the ZONE_DEVICE pages as compound pages or a
> given @align order. The main difference though, is that contrary to the hugetlbfs
> series, there's no vmemmap for the area, because we are onlining it.

Yeah, I'd argue that this case is a lot easier to handle. When the buddy
is involved, things get more complicated.

Muchun Song Dec. 9, 2020, 9:52 a.m. UTC | #2

On Wed, Dec 9, 2020 at 1:32 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> Hey,
>
> This small series, attempts at minimizing 'struct page' overhead by
> pursuing a similar approach as Muchun Song series "Free some vmemmap
> pages of hugetlb page"[0] but applied to devmap/ZONE_DEVICE.
>
> [0] https://lore.kernel.org/linux-mm/20201130151838.11208-1-songmuchun@bytedance.com/
>

Oh, well. It looks like you agree with my optimization approach
and have fully understood. Also, welcome help me review that
series if you have time. :)

> The link above describes it quite nicely, but the idea is to reuse tail
> page vmemmap areas, particular the area which only describes tail pages.
> So a vmemmap page describes 64 struct pages, and the first page for a given
> ZONE_DEVICE vmemmap would contain the head page and 63 tail pages. The second
> vmemmap page would contain only tail pages, and that's what gets reused across
> the rest of the subsection/section. The bigger the page size, the bigger the
> savings (2M hpage -> save 6 vmemmap pages; 1G hpage -> save 4094 vmemmap pages).
>
> In terms of savings, per 1Tb of memory, the struct page cost would go down
> with compound pagemap:
>
> * with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total memory)
> * with 1G pages we lose 8MB instead of 16G (0.0007% instead of 1.5% of total memory)
>
> Along the way I've extended it past 'struct page' overhead *trying* to address a
> few performance issues we knew about for pmem, specifically on the
> {pin,get}_user_pages* function family with device-dax vmas which are really
> slow even of the fast variants. THP is great on -fast variants but all except
> hugetlbfs perform rather poorly on non-fast gup.
>
> So to summarize what the series does:
>
> Patches 1-5: Much like Muchun series, we reuse tail page areas across a given
> page size (namely @align was referred by remaining memremap/dax code) and
> enabling of memremap to initialize the ZONE_DEVICE pages as compound pages or a
> given @align order. The main difference though, is that contrary to the hugetlbfs
> series, there's no vmemmap for the area, because we are onlining it. IOW no
> freeing of pages of already initialized vmemmap like the case for hugetlbfs,
> which simplifies the logic (besides not being arch-specific). After these,
> there's quite visible region bootstrap of pmem memmap given that we would
> initialize fewer struct pages depending on the page size.
>
>     NVDIMM namespace bootstrap improves from ~750ms to ~190ms/<=1ms on emulated NVDIMMs
>     with 2M and 1G respectivally. The net gain in improvement is similarly observed
>     in proportion when running on actual NVDIMMs.
>
> Patch 6 - 8: Optimize grabbing/release a page refcount changes given that we
> are working with compound pages i.e. we do 1 increment/decrement to the head
> page for a given set of N subpages compared as opposed to N individual writes.
> {get,pin}_user_pages_fast() for zone_device with compound pagemap consequently
> improves considerably, and unpin_user_pages() improves as well when passed a
> set of consecutive pages:
>
>                                            before          after
>     (get_user_pages_fast 1G;2M page size) ~75k  us -> ~3.2k ; ~5.2k us
>     (pin_user_pages_fast 1G;2M page size) ~125k us -> ~3.4k ; ~5.5k us
>
> The RDMA patch (patch 8/9) is to demonstrate the improvement for an existing
> user. For unpin_user_pages() we have an additional test to demonstrate the
> improvement.  The test performs MR reg/unreg continuously and measuring its
> rate for a given period. So essentially ib_mem_get and ib_mem_release being
> stress tested which at the end of day means: pin_user_pages_longterm() and
> unpin_user_pages() for a scatterlist:
>
>     Before:
>     159 rounds in 5.027 sec: 31617.923 usec / round (device-dax)
>     466 rounds in 5.009 sec: 10748.456 usec / round (hugetlbfs)
>
>     After:
>     305 rounds in 5.010 sec: 16426.047 usec / round (device-dax)
>     1073 rounds in 5.004 sec: 4663.622 usec / round (hugetlbfs)
>
> Patch 9: Improves {pin,get}_user_pages() and its longterm counterpart. It
> is very experimental, and I imported most of follow_hugetlb_page(), except
> that we do the same trick as gup-fast. In doing the patch I feel this batching
> should live in follow_page_mask() and having that being changed to return a set
> of pages/something-else when walking over PMD/PUDs for THP / devmap pages. This
> patch then brings the previous test of mr reg/unreg (above) on parity
> between device-dax and hugetlbfs.
>
> Some of the patches are a little fresh/WIP (specially patch 3 and 9) and we are
> still running tests. Hence the RFC, asking for comments and general direction
> of the work before continuing.
>
> Patches apply on top of linux-next tag next-20201208 (commit a9e26cb5f261).
>
> Comments and suggestions very much appreciated!
>
> Thanks,
>         Joao
>
> Joao Martins (9):
>   memremap: add ZONE_DEVICE support for compound pages
>   sparse-vmemmap: Consolidate arguments in vmemmap section populate
>   sparse-vmemmap: Reuse vmemmap areas for a given page size
>   mm/page_alloc: Reuse tail struct pages for compound pagemaps
>   device-dax: Compound pagemap support
>   mm/gup: Grab head page refcount once for group of subpages
>   mm/gup: Decrement head page once for group of subpages
>   RDMA/umem: batch page unpin in __ib_mem_release()
>   mm: Add follow_devmap_page() for devdax vmas
>
>  drivers/dax/device.c           |  54 ++++++---
>  drivers/infiniband/core/umem.c |  25 +++-
>  include/linux/huge_mm.h        |   4 +
>  include/linux/memory_hotplug.h |  16 ++-
>  include/linux/memremap.h       |   2 +
>  include/linux/mm.h             |   6 +-
>  mm/gup.c                       | 130 ++++++++++++++++-----
>  mm/huge_memory.c               | 202 +++++++++++++++++++++++++++++++++
>  mm/memory_hotplug.c            |  13 ++-
>  mm/memremap.c                  |  13 ++-
>  mm/page_alloc.c                |  28 ++++-
>  mm/sparse-vmemmap.c            |  97 +++++++++++++---
>  mm/sparse.c                    |  16 +--
>  13 files changed, 531 insertions(+), 75 deletions(-)
>
> --
> 2.17.1
>

Dan Williams Feb. 20, 2021, 1:18 a.m. UTC | #3

On Tue, Dec 8, 2020 at 9:32 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> Hey,
>
> This small series, attempts at minimizing 'struct page' overhead by
> pursuing a similar approach as Muchun Song series "Free some vmemmap
> pages of hugetlb page"[0] but applied to devmap/ZONE_DEVICE.
>
> [0] https://lore.kernel.org/linux-mm/20201130151838.11208-1-songmuchun@bytedance.com/

Clever!

>
> The link above describes it quite nicely, but the idea is to reuse tail
> page vmemmap areas, particular the area which only describes tail pages.
> So a vmemmap page describes 64 struct pages, and the first page for a given
> ZONE_DEVICE vmemmap would contain the head page and 63 tail pages. The second
> vmemmap page would contain only tail pages, and that's what gets reused across
> the rest of the subsection/section. The bigger the page size, the bigger the
> savings (2M hpage -> save 6 vmemmap pages; 1G hpage -> save 4094 vmemmap pages).
>
> In terms of savings, per 1Tb of memory, the struct page cost would go down
> with compound pagemap:
>
> * with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total memory)
> * with 1G pages we lose 8MB instead of 16G (0.0007% instead of 1.5% of total memory)

Nice!

>
> Along the way I've extended it past 'struct page' overhead *trying* to address a
> few performance issues we knew about for pmem, specifically on the
> {pin,get}_user_pages* function family with device-dax vmas which are really
> slow even of the fast variants. THP is great on -fast variants but all except
> hugetlbfs perform rather poorly on non-fast gup.
>
> So to summarize what the series does:
>
> Patches 1-5: Much like Muchun series, we reuse tail page areas across a given
> page size (namely @align was referred by remaining memremap/dax code) and
> enabling of memremap to initialize the ZONE_DEVICE pages as compound pages or a
> given @align order. The main difference though, is that contrary to the hugetlbfs
> series, there's no vmemmap for the area, because we are onlining it. IOW no
> freeing of pages of already initialized vmemmap like the case for hugetlbfs,
> which simplifies the logic (besides not being arch-specific). After these,
> there's quite visible region bootstrap of pmem memmap given that we would
> initialize fewer struct pages depending on the page size.
>
>     NVDIMM namespace bootstrap improves from ~750ms to ~190ms/<=1ms on emulated NVDIMMs
>     with 2M and 1G respectivally. The net gain in improvement is similarly observed
>     in proportion when running on actual NVDIMMs.

I
>
> Patch 6 - 8: Optimize grabbing/release a page refcount changes given that we
> are working with compound pages i.e. we do 1 increment/decrement to the head
> page for a given set of N subpages compared as opposed to N individual writes.
> {get,pin}_user_pages_fast() for zone_device with compound pagemap consequently
> improves considerably, and unpin_user_pages() improves as well when passed a
> set of consecutive pages:
>
>                                            before          after
>     (get_user_pages_fast 1G;2M page size) ~75k  us -> ~3.2k ; ~5.2k us
>     (pin_user_pages_fast 1G;2M page size) ~125k us -> ~3.4k ; ~5.5k us

Compelling!

>
> The RDMA patch (patch 8/9) is to demonstrate the improvement for an existing
> user. For unpin_user_pages() we have an additional test to demonstrate the
> improvement.  The test performs MR reg/unreg continuously and measuring its
> rate for a given period. So essentially ib_mem_get and ib_mem_release being
> stress tested which at the end of day means: pin_user_pages_longterm() and
> unpin_user_pages() for a scatterlist:
>
>     Before:
>     159 rounds in 5.027 sec: 31617.923 usec / round (device-dax)
>     466 rounds in 5.009 sec: 10748.456 usec / round (hugetlbfs)
>
>     After:
>     305 rounds in 5.010 sec: 16426.047 usec / round (device-dax)
>     1073 rounds in 5.004 sec: 4663.622 usec / round (hugetlbfs)

Why does hugetlbfs get faster for a ZONE_DEVICE change? Might answer
that question myself when I get to patch 8.

>
> Patch 9: Improves {pin,get}_user_pages() and its longterm counterpart. It
> is very experimental, and I imported most of follow_hugetlb_page(), except
> that we do the same trick as gup-fast. In doing the patch I feel this batching
> should live in follow_page_mask() and having that being changed to return a set
> of pages/something-else when walking over PMD/PUDs for THP / devmap pages. This
> patch then brings the previous test of mr reg/unreg (above) on parity
> between device-dax and hugetlbfs.
>
> Some of the patches are a little fresh/WIP (specially patch 3 and 9) and we are
> still running tests. Hence the RFC, asking for comments and general direction
> of the work before continuing.

Will go look at the code, but I don't see anything scary conceptually
here. The fact that pfn_to_page() does not need to change is among the
most compelling features of this approach.

Joao Martins Feb. 22, 2021, 11:06 a.m. UTC | #4

On 2/20/21 1:18 AM, Dan Williams wrote:
> On Tue, Dec 8, 2020 at 9:32 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> The link above describes it quite nicely, but the idea is to reuse tail
>> page vmemmap areas, particular the area which only describes tail pages.
>> So a vmemmap page describes 64 struct pages, and the first page for a given
>> ZONE_DEVICE vmemmap would contain the head page and 63 tail pages. The second
>> vmemmap page would contain only tail pages, and that's what gets reused across
>> the rest of the subsection/section. The bigger the page size, the bigger the
>> savings (2M hpage -> save 6 vmemmap pages; 1G hpage -> save 4094 vmemmap pages).
>>
>> In terms of savings, per 1Tb of memory, the struct page cost would go down
>> with compound pagemap:
>>
>> * with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total memory)
>> * with 1G pages we lose 8MB instead of 16G (0.0007% instead of 1.5% of total memory)
> 
> Nice!
> 

I failed to mention this in the cover letter but I should say that with this trick we will
need to build the vmemmap page tables with basepages for 2M align, as opposed to hugepages
in the vmemmap page tables (as you probably could tell from the patches). This means that
we have to allocate a PMD page, and that costs 2GB per 1Tb (as opposed to 4M). This is
fixable for 1G align by reusing PMD pages (albeit I haven't done that in this RFC series).

The footprint reduction is still big, so to iterate the numbers above (and I will fix this
in the v2 cover letter):

* with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total memory)
* with 1G pages we lose 8MB instead of 16G (0.0007% instead of 1.5% of total memory)

For vmemmap page tables, we need to use base pages for 2M pages. So taking that into
account, in this RFC series:

* with 2M pages we lose 6G instead of 16G (0.586% instead of 1.5% of total memory)
* with 1G pages we lose ~2GB instead of 16G (0.19% instead of 1.5% of total memory)

For 1G align, we are able to reuse vmemmap PMDs that only point to tail pages, so
ultimately we can get the page table overhead from 2GB to 12MB:

* with 1G pages we lose 20MB instead of 16G (0.0019% instead of 1.5% of total memory)

>>
>> The RDMA patch (patch 8/9) is to demonstrate the improvement for an existing
>> user. For unpin_user_pages() we have an additional test to demonstrate the
>> improvement.  The test performs MR reg/unreg continuously and measuring its
>> rate for a given period. So essentially ib_mem_get and ib_mem_release being
>> stress tested which at the end of day means: pin_user_pages_longterm() and
>> unpin_user_pages() for a scatterlist:
>>
>>     Before:
>>     159 rounds in 5.027 sec: 31617.923 usec / round (device-dax)
>>     466 rounds in 5.009 sec: 10748.456 usec / round (hugetlbfs)
>>
>>     After:
>>     305 rounds in 5.010 sec: 16426.047 usec / round (device-dax)
>>     1073 rounds in 5.004 sec: 4663.622 usec / round (hugetlbfs)
> 
> Why does hugetlbfs get faster for a ZONE_DEVICE change? Might answer
> that question myself when I get to patch 8.
> 
Because the unpinning improvements aren't ZONE_DEVICE specific.

FWIW, I moved those two offending patches outside of this series:

  https://lore.kernel.org/linux-mm/20210212130843.13865-1-joao.m.martins@oracle.com/

>>
>> Patch 9: Improves {pin,get}_user_pages() and its longterm counterpart. It
>> is very experimental, and I imported most of follow_hugetlb_page(), except
>> that we do the same trick as gup-fast. In doing the patch I feel this batching
>> should live in follow_page_mask() and having that being changed to return a set
>> of pages/something-else when walking over PMD/PUDs for THP / devmap pages. This
>> patch then brings the previous test of mr reg/unreg (above) on parity
>> between device-dax and hugetlbfs.
>>
>> Some of the patches are a little fresh/WIP (specially patch 3 and 9) and we are
>> still running tests. Hence the RFC, asking for comments and general direction
>> of the work before continuing.
> 
> Will go look at the code, but I don't see anything scary conceptually
> here. The fact that pfn_to_page() does not need to change is among the
> most compelling features of this approach.
> 
Glad to hear that :D

Joao Martins Feb. 22, 2021, 2:32 p.m. UTC | #5

On 2/22/21 11:06 AM, Joao Martins wrote:
> On 2/20/21 1:18 AM, Dan Williams wrote:
>> On Tue, Dec 8, 2020 at 9:32 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>
>>> The link above describes it quite nicely, but the idea is to reuse tail
>>> page vmemmap areas, particular the area which only describes tail pages.
>>> So a vmemmap page describes 64 struct pages, and the first page for a given
>>> ZONE_DEVICE vmemmap would contain the head page and 63 tail pages. The second
>>> vmemmap page would contain only tail pages, and that's what gets reused across
>>> the rest of the subsection/section. The bigger the page size, the bigger the
>>> savings (2M hpage -> save 6 vmemmap pages; 1G hpage -> save 4094 vmemmap pages).
>>>
>>> In terms of savings, per 1Tb of memory, the struct page cost would go down
>>> with compound pagemap:
>>>
>>> * with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total memory)
>>> * with 1G pages we lose 8MB instead of 16G (0.0007% instead of 1.5% of total memory)
>>
>> Nice!
>>
> 
> I failed to mention this in the cover letter but I should say that with this trick we will
> need to build the vmemmap page tables with basepages for 2M align, as opposed to hugepages
> in the vmemmap page tables (as you probably could tell from the patches). 

Also to be clear: by "we will need to build the vmemmap page tables with basepages for 2M
align" I strictly refer to the ZONE_DEVICE range we are mapping the struct pages. It's not
the enterity of the vmemmap!

> This means that
> we have to allocate a PMD page, and that costs 2GB per 1Tb (as opposed to 4M). This is
> fixable for 1G align by reusing PMD pages (albeit I haven't done that in this RFC series).
> 
> The footprint reduction is still big, so to iterate the numbers above (and I will fix this
> in the v2 cover letter):
> 
> * with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total memory)
> * with 1G pages we lose 8MB instead of 16G (0.0007% instead of 1.5% of total memory)
> 
> For vmemmap page tables, we need to use base pages for 2M pages. So taking that into
> account, in this RFC series:
> 
> * with 2M pages we lose 6G instead of 16G (0.586% instead of 1.5% of total memory)
> * with 1G pages we lose ~2GB instead of 16G (0.19% instead of 1.5% of total memory)
> 
> For 1G align, we are able to reuse vmemmap PMDs that only point to tail pages, so
> ultimately we can get the page table overhead from 2GB to 12MB:
> 
> * with 1G pages we lose 20MB instead of 16G (0.0019% instead of 1.5% of total memory)
> 
>>>
>>> The RDMA patch (patch 8/9) is to demonstrate the improvement for an existing
>>> user. For unpin_user_pages() we have an additional test to demonstrate the
>>> improvement.  The test performs MR reg/unreg continuously and measuring its
>>> rate for a given period. So essentially ib_mem_get and ib_mem_release being
>>> stress tested which at the end of day means: pin_user_pages_longterm() and
>>> unpin_user_pages() for a scatterlist:
>>>
>>>     Before:
>>>     159 rounds in 5.027 sec: 31617.923 usec / round (device-dax)
>>>     466 rounds in 5.009 sec: 10748.456 usec / round (hugetlbfs)
>>>
>>>     After:
>>>     305 rounds in 5.010 sec: 16426.047 usec / round (device-dax)
>>>     1073 rounds in 5.004 sec: 4663.622 usec / round (hugetlbfs)
>>
>> Why does hugetlbfs get faster for a ZONE_DEVICE change? Might answer
>> that question myself when I get to patch 8.
>>
> Because the unpinning improvements aren't ZONE_DEVICE specific.
> 
> FWIW, I moved those two offending patches outside of this series:
> 
>   https://lore.kernel.org/linux-mm/20210212130843.13865-1-joao.m.martins@oracle.com/
> 
>>>
>>> Patch 9: Improves {pin,get}_user_pages() and its longterm counterpart. It
>>> is very experimental, and I imported most of follow_hugetlb_page(), except
>>> that we do the same trick as gup-fast. In doing the patch I feel this batching
>>> should live in follow_page_mask() and having that being changed to return a set
>>> of pages/something-else when walking over PMD/PUDs for THP / devmap pages. This
>>> patch then brings the previous test of mr reg/unreg (above) on parity
>>> between device-dax and hugetlbfs.
>>>
>>> Some of the patches are a little fresh/WIP (specially patch 3 and 9) and we are
>>> still running tests. Hence the RFC, asking for comments and general direction
>>> of the work before continuing.
>>
>> Will go look at the code, but I don't see anything scary conceptually
>> here. The fact that pfn_to_page() does not need to change is among the
>> most compelling features of this approach.
>>
> Glad to hear that :D
>

Joao Martins Feb. 23, 2021, 4:28 p.m. UTC | #6

On 2/20/21 1:18 AM, Dan Williams wrote:
> On Tue, Dec 8, 2020 at 9:32 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>> Patch 6 - 8: Optimize grabbing/release a page refcount changes given that we
>> are working with compound pages i.e. we do 1 increment/decrement to the head
>> page for a given set of N subpages compared as opposed to N individual writes.
>> {get,pin}_user_pages_fast() for zone_device with compound pagemap consequently
>> improves considerably, and unpin_user_pages() improves as well when passed a
>> set of consecutive pages:
>>
>>                                            before          after
>>     (get_user_pages_fast 1G;2M page size) ~75k  us -> ~3.2k ; ~5.2k us
>>     (pin_user_pages_fast 1G;2M page size) ~125k us -> ~3.4k ; ~5.5k us
> 
> Compelling!
> 

BTW is there any reason why we don't support pin_user_pages_fast() with FOLL_LONGTERM for
device-dax?

Looking at the history, I understand that fsdax can't support it atm, but I am not sure
that the same holds for device-dax. I have this small chunk (see below the scissors mark)
which relaxes this for a pgmap of type MEMORY_DEVICE_GENERIC, albeit not sure if there is
a fundamental issue for the other types that makes this an unwelcoming change.

	Joao

--------------------->8---------------------

Subject: [PATCH] mm/gup: allow FOLL_LONGTERM pin-fast for
 MEMORY_DEVICE_GENERIC

The downside would be one extra lookup in dev_pagemap tree
for other pgmap->types (P2P, FSDAX, PRIVATE). But just one
per gup-fast() call.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 include/linux/mm.h |  5 +++++
 mm/gup.c           | 24 +++++++++++++-----------
 2 files changed, 18 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 32f0c3986d4f..c89a049bbd7a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1171,6 +1171,11 @@ static inline bool is_pci_p2pdma_page(const struct page *page)
 		page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
 }

+static inline bool devmap_longterm_available(const struct dev_pagemap *pgmap)
+{
+	return pgmap->type == MEMORY_DEVICE_GENERIC;
+}
+
 /* 127: arbitrary random number, small enough to assemble well */
 #define page_ref_zero_or_close_to_overflow(page) \
 	((unsigned int) page_ref_count(page) + 127u <= 127u)
diff --git a/mm/gup.c b/mm/gup.c
index 222d1fdc5cfa..03e370d360e6 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2092,14 +2092,18 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned
long end,
 			goto pte_unmap;

 		if (pte_devmap(pte)) {
-			if (unlikely(flags & FOLL_LONGTERM))
-				goto pte_unmap;
-
 			pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
 			if (unlikely(!pgmap)) {
 				undo_dev_pagemap(nr, nr_start, flags, pages);
 				goto pte_unmap;
 			}
+
+			if (unlikely(flags & FOLL_LONGTERM) &&
+			    !devmap_longterm_available(pgmap)) {
+				undo_dev_pagemap(nr, nr_start, flags, pages);
+				goto pte_unmap;
+			}
+
 		} else if (pte_special(pte))
 			goto pte_unmap;

@@ -2195,6 +2199,10 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
 			return 0;
 		}

+		if (unlikely(flags & FOLL_LONGTERM) &&
+		    !devmap_longterm_available(pgmap))
+			return 0;
+
@@ -2356,12 +2364,9 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 	if (!pmd_access_permitted(orig, flags & FOLL_WRITE))
 		return 0;

-	if (pmd_devmap(orig)) {
-		if (unlikely(flags & FOLL_LONGTERM))
-			return 0;
+	if (pmd_devmap(orig))
 		return __gup_device_huge_pmd(orig, pmdp, addr, end, flags,
 					     pages, nr);
-	}

 	page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
 	refs = record_subpages(page, addr, end, pages + *nr);
@@ -2390,12 +2395,9 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 	if (!pud_access_permitted(orig, flags & FOLL_WRITE))
 		return 0;

-	if (pud_devmap(orig)) {
-		if (unlikely(flags & FOLL_LONGTERM))
-			return 0;
+	if (pud_devmap(orig))
 		return __gup_device_huge_pud(orig, pudp, addr, end, flags,
 					     pages, nr);
-	}

 	page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
 	refs = record_subpages(page, addr, end, pages + *nr);

Dan Williams Feb. 23, 2021, 4:44 p.m. UTC | #7

On Tue, Feb 23, 2021 at 8:30 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> On 2/20/21 1:18 AM, Dan Williams wrote:
> > On Tue, Dec 8, 2020 at 9:32 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> >> Patch 6 - 8: Optimize grabbing/release a page refcount changes given that we
> >> are working with compound pages i.e. we do 1 increment/decrement to the head
> >> page for a given set of N subpages compared as opposed to N individual writes.
> >> {get,pin}_user_pages_fast() for zone_device with compound pagemap consequently
> >> improves considerably, and unpin_user_pages() improves as well when passed a
> >> set of consecutive pages:
> >>
> >>                                            before          after
> >>     (get_user_pages_fast 1G;2M page size) ~75k  us -> ~3.2k ; ~5.2k us
> >>     (pin_user_pages_fast 1G;2M page size) ~125k us -> ~3.4k ; ~5.5k us
> >
> > Compelling!
> >
>
> BTW is there any reason why we don't support pin_user_pages_fast() with FOLL_LONGTERM for
> device-dax?
>

Good catch.

Must have been an oversight of the conversion. FOLL_LONGTERM collides
with filesystem operations, but not device-dax. In fact that's the
motivation for device-dax in the first instance, no need to coordinate
runtime physical address layout changes because the device is
statically allocated.

> Looking at the history, I understand that fsdax can't support it atm, but I am not sure
> that the same holds for device-dax. I have this small chunk (see below the scissors mark)
> which relaxes this for a pgmap of type MEMORY_DEVICE_GENERIC, albeit not sure if there is
> a fundamental issue for the other types that makes this an unwelcoming change.
>
>         Joao
>
> --------------------->8---------------------
>
> Subject: [PATCH] mm/gup: allow FOLL_LONGTERM pin-fast for
>  MEMORY_DEVICE_GENERIC
>
> The downside would be one extra lookup in dev_pagemap tree
> for other pgmap->types (P2P, FSDAX, PRIVATE). But just one
> per gup-fast() call.

I'd guess a dev_pagemap lookup is faster than a get_user_pages slow
path. It should be measurable that this change is at least as fast or
faster than falling back to the slow path, but it would be good to
measure.

Code changes look good to me.

>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  include/linux/mm.h |  5 +++++
>  mm/gup.c           | 24 +++++++++++++-----------
>  2 files changed, 18 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 32f0c3986d4f..c89a049bbd7a 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1171,6 +1171,11 @@ static inline bool is_pci_p2pdma_page(const struct page *page)
>                 page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
>  }
>
> +static inline bool devmap_longterm_available(const struct dev_pagemap *pgmap)
> +{

I'd call this devmap_can_longterm().

> +       return pgmap->type == MEMORY_DEVICE_GENERIC;
> +}
> +
>  /* 127: arbitrary random number, small enough to assemble well */
>  #define page_ref_zero_or_close_to_overflow(page) \
>         ((unsigned int) page_ref_count(page) + 127u <= 127u)
> diff --git a/mm/gup.c b/mm/gup.c
> index 222d1fdc5cfa..03e370d360e6 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2092,14 +2092,18 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned
> long end,
>                         goto pte_unmap;
>
>                 if (pte_devmap(pte)) {
> -                       if (unlikely(flags & FOLL_LONGTERM))
> -                               goto pte_unmap;
> -
>                         pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
>                         if (unlikely(!pgmap)) {
>                                 undo_dev_pagemap(nr, nr_start, flags, pages);
>                                 goto pte_unmap;
>                         }
> +
> +                       if (unlikely(flags & FOLL_LONGTERM) &&
> +                           !devmap_longterm_available(pgmap)) {
> +                               undo_dev_pagemap(nr, nr_start, flags, pages);
> +                               goto pte_unmap;
> +                       }
> +
>                 } else if (pte_special(pte))
>                         goto pte_unmap;
>
> @@ -2195,6 +2199,10 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>                         return 0;
>                 }
>
> +               if (unlikely(flags & FOLL_LONGTERM) &&
> +                   !devmap_longterm_available(pgmap))
> +                       return 0;
> +
> @@ -2356,12 +2364,9 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
>         if (!pmd_access_permitted(orig, flags & FOLL_WRITE))
>                 return 0;
>
> -       if (pmd_devmap(orig)) {
> -               if (unlikely(flags & FOLL_LONGTERM))
> -                       return 0;
> +       if (pmd_devmap(orig))
>                 return __gup_device_huge_pmd(orig, pmdp, addr, end, flags,
>                                              pages, nr);
> -       }
>
>         page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
>         refs = record_subpages(page, addr, end, pages + *nr);
> @@ -2390,12 +2395,9 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
>         if (!pud_access_permitted(orig, flags & FOLL_WRITE))
>                 return 0;
>
> -       if (pud_devmap(orig)) {
> -               if (unlikely(flags & FOLL_LONGTERM))
> -                       return 0;
> +       if (pud_devmap(orig))
>                 return __gup_device_huge_pud(orig, pudp, addr, end, flags,
>                                              pages, nr);
> -       }
>
>         page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
>         refs = record_subpages(page, addr, end, pages + *nr);
> --
> 2.17.1

Joao Martins Feb. 23, 2021, 5:15 p.m. UTC | #8

On 2/23/21 4:44 PM, Dan Williams wrote:
> On Tue, Feb 23, 2021 at 8:30 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> On 2/20/21 1:18 AM, Dan Williams wrote:
>>> On Tue, Dec 8, 2020 at 9:32 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>> Patch 6 - 8: Optimize grabbing/release a page refcount changes given that we
>>>> are working with compound pages i.e. we do 1 increment/decrement to the head
>>>> page for a given set of N subpages compared as opposed to N individual writes.
>>>> {get,pin}_user_pages_fast() for zone_device with compound pagemap consequently
>>>> improves considerably, and unpin_user_pages() improves as well when passed a
>>>> set of consecutive pages:
>>>>
>>>>                                            before          after
>>>>     (get_user_pages_fast 1G;2M page size) ~75k  us -> ~3.2k ; ~5.2k us
>>>>     (pin_user_pages_fast 1G;2M page size) ~125k us -> ~3.4k ; ~5.5k us
>>>
>>> Compelling!
>>>
>>
>> BTW is there any reason why we don't support pin_user_pages_fast() with FOLL_LONGTERM for
>> device-dax?
>>
> 
> Good catch.
> 
> Must have been an oversight of the conversion. FOLL_LONGTERM collides
> with filesystem operations, but not device-dax. 

hmmmm, fwiw, it was unilaterally disabled for any devmap pmd/pud in commit 7af75561e171
("mm/gup: add FOLL_LONGTERM capability to GUP fast") and I must only assume that
by "DAX pages" the submitter was only referring to fs-dax pages.

> In fact that's the
> motivation for device-dax in the first instance, no need to coordinate
> runtime physical address layout changes because the device is
> statically allocated.
> 
/me nods

>> Looking at the history, I understand that fsdax can't support it atm, but I am not sure
>> that the same holds for device-dax. I have this small chunk (see below the scissors mark)
>> which relaxes this for a pgmap of type MEMORY_DEVICE_GENERIC, albeit not sure if there is
>> a fundamental issue for the other types that makes this an unwelcoming change.
>>
>>         Joao
>>
>> --------------------->8---------------------
>>
>> Subject: [PATCH] mm/gup: allow FOLL_LONGTERM pin-fast for
>>  MEMORY_DEVICE_GENERIC
>>
>> The downside would be one extra lookup in dev_pagemap tree
>> for other pgmap->types (P2P, FSDAX, PRIVATE). But just one
>> per gup-fast() call.
> 
> I'd guess a dev_pagemap lookup is faster than a get_user_pages slow
> path. It should be measurable that this change is at least as fast or
> faster than falling back to the slow path, but it would be good to
> measure.
> 
But with the changes I am/will-be making I hope gup-fast and gup-slow
will be as fast (for present pmd/puds ofc, as the fault makes it slower).

I'll formally submit below patch, once I ran over the numbers.

> Code changes look good to me.
> 
Cool! Will add in the suggested change below.

>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  include/linux/mm.h |  5 +++++
>>  mm/gup.c           | 24 +++++++++++++-----------
>>  2 files changed, 18 insertions(+), 11 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 32f0c3986d4f..c89a049bbd7a 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1171,6 +1171,11 @@ static inline bool is_pci_p2pdma_page(const struct page *page)
>>                 page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
>>  }
>>
>> +static inline bool devmap_longterm_available(const struct dev_pagemap *pgmap)
>> +{
> 
> I'd call this devmap_can_longterm().
> 
Ack.

Dan Williams Feb. 23, 2021, 6:15 p.m. UTC | #9

On Tue, Feb 23, 2021 at 9:16 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> On 2/23/21 4:44 PM, Dan Williams wrote:
> > On Tue, Feb 23, 2021 at 8:30 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> >>
> >> On 2/20/21 1:18 AM, Dan Williams wrote:
> >>> On Tue, Dec 8, 2020 at 9:32 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> >>>> Patch 6 - 8: Optimize grabbing/release a page refcount changes given that we
> >>>> are working with compound pages i.e. we do 1 increment/decrement to the head
> >>>> page for a given set of N subpages compared as opposed to N individual writes.
> >>>> {get,pin}_user_pages_fast() for zone_device with compound pagemap consequently
> >>>> improves considerably, and unpin_user_pages() improves as well when passed a
> >>>> set of consecutive pages:
> >>>>
> >>>>                                            before          after
> >>>>     (get_user_pages_fast 1G;2M page size) ~75k  us -> ~3.2k ; ~5.2k us
> >>>>     (pin_user_pages_fast 1G;2M page size) ~125k us -> ~3.4k ; ~5.5k us
> >>>
> >>> Compelling!
> >>>
> >>
> >> BTW is there any reason why we don't support pin_user_pages_fast() with FOLL_LONGTERM for
> >> device-dax?
> >>
> >
> > Good catch.
> >
> > Must have been an oversight of the conversion. FOLL_LONGTERM collides
> > with filesystem operations, but not device-dax.
>
> hmmmm, fwiw, it was unilaterally disabled for any devmap pmd/pud in commit 7af75561e171
> ("mm/gup: add FOLL_LONGTERM capability to GUP fast") and I must only assume that
> by "DAX pages" the submitter was only referring to fs-dax pages.

Agree, that was an fsdax only assumption. Maybe this went unnoticed
because the primary gup-fast case for direct-I/O was not impacted.

Dan Williams Feb. 23, 2021, 10:48 p.m. UTC | #10

On Tue, Feb 23, 2021 at 10:54 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Tue, Feb 23, 2021 at 08:44:52AM -0800, Dan Williams wrote:
>
> > > The downside would be one extra lookup in dev_pagemap tree
> > > for other pgmap->types (P2P, FSDAX, PRIVATE). But just one
> > > per gup-fast() call.
> >
> > I'd guess a dev_pagemap lookup is faster than a get_user_pages slow
> > path. It should be measurable that this change is at least as fast or
> > faster than falling back to the slow path, but it would be good to
> > measure.
>
> What is the dev_pagemap thing doing in gup fast anyhow?
>
> I've been wondering for a while..

It's there to synchronize against dax-device removal. The device will
suspend removal awaiting all page references to be dropped, but
gup-fast could be racing device removal. So gup-fast checks for
pte_devmap() to grab a live reference to the device before assuming it
can pin a page.

Dan Williams Feb. 24, 2021, 12:14 a.m. UTC | #11

[ add Ralph ]

On Tue, Feb 23, 2021 at 3:07 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Tue, Feb 23, 2021 at 02:48:20PM -0800, Dan Williams wrote:
> > On Tue, Feb 23, 2021 at 10:54 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > On Tue, Feb 23, 2021 at 08:44:52AM -0800, Dan Williams wrote:
> > >
> > > > > The downside would be one extra lookup in dev_pagemap tree
> > > > > for other pgmap->types (P2P, FSDAX, PRIVATE). But just one
> > > > > per gup-fast() call.
> > > >
> > > > I'd guess a dev_pagemap lookup is faster than a get_user_pages slow
> > > > path. It should be measurable that this change is at least as fast or
> > > > faster than falling back to the slow path, but it would be good to
> > > > measure.
> > >
> > > What is the dev_pagemap thing doing in gup fast anyhow?
> > >
> > > I've been wondering for a while..
> >
> > It's there to synchronize against dax-device removal. The device will
> > suspend removal awaiting all page references to be dropped, but
> > gup-fast could be racing device removal. So gup-fast checks for
> > pte_devmap() to grab a live reference to the device before assuming it
> > can pin a page.
>
> From the perspective of CPU A it can't tell if CPU B is doing a HW
> page table walk or a GUP fast when it invalidates a page table. The
> design of gup-fast is supposed to be the same as the design of a HW
> page table walk, and the tlb invalidate CPU A does when removing a
> page from a page table is supposed to serialize against both a HW page
> table walk and gup-fast.
>
> Given that the HW page table walker does not do dev_pagemap stuff, why
> does gup-fast?

gup-fast historically assumed that the 'struct page' and memory
backing the page-table walk could not physically be removed from the
system during its walk because those pages were allocated from the
page allocator before being mapped into userspace. So there is an
implied elevated reference on any page that gup-fast would be asked to
walk, or pte_special() is there to "say wait, nevermind this isn't a
page allocator page fallback to gup-slow()". pte_devmap() is there to
say "wait, there is no implied elevated reference for this page, check
and hold dev_pagemap alive until a page reference can be taken". So it
splits the difference between pte_special() and typical page allocator
pages.

> Can you sketch the exact race this is protecting against?

Thread1 mmaps /mnt/daxfile1 from a "mount -o dax" filesystem and
issues direct I/O with that mapping as the target buffer, Thread2 does
"echo "namespace0.0" > /sys/bus/nd/drivers/nd_pmem/unbind". Without
the dev_pagemap check reference gup-fast could execute
get_page(pte_page(pte)) on a page that doesn't even exist anymore
because the driver unbind has already performed remove_pages().

Effectively the same percpu_ref that protects the pmem0 block device
from new command submissions while the device is dying also prevents
new dax page references being taken while the device is dying.

This could be solved with the traditional gup-fast rules if the device
driver could tell the filesystem to unmap all dax files and force them
to re-fault through the gup-slow path to see that the device is now
dying. I'll likely be working on that sooner rather than later given
some of the expectations of the CXL persistent memory "dirty shutdown"
detection.

Dan Williams Feb. 24, 2021, 1:32 a.m. UTC | #12

On Tue, Feb 23, 2021 at 5:00 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Tue, Feb 23, 2021 at 04:14:01PM -0800, Dan Williams wrote:
> > [ add Ralph ]
> >
> > On Tue, Feb 23, 2021 at 3:07 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > On Tue, Feb 23, 2021 at 02:48:20PM -0800, Dan Williams wrote:
> > > > On Tue, Feb 23, 2021 at 10:54 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > > >
> > > > > On Tue, Feb 23, 2021 at 08:44:52AM -0800, Dan Williams wrote:
> > > > >
> > > > > > > The downside would be one extra lookup in dev_pagemap tree
> > > > > > > for other pgmap->types (P2P, FSDAX, PRIVATE). But just one
> > > > > > > per gup-fast() call.
> > > > > >
> > > > > > I'd guess a dev_pagemap lookup is faster than a get_user_pages slow
> > > > > > path. It should be measurable that this change is at least as fast or
> > > > > > faster than falling back to the slow path, but it would be good to
> > > > > > measure.
> > > > >
> > > > > What is the dev_pagemap thing doing in gup fast anyhow?
> > > > >
> > > > > I've been wondering for a while..
> > > >
> > > > It's there to synchronize against dax-device removal. The device will
> > > > suspend removal awaiting all page references to be dropped, but
> > > > gup-fast could be racing device removal. So gup-fast checks for
> > > > pte_devmap() to grab a live reference to the device before assuming it
> > > > can pin a page.
> > >
> > > From the perspective of CPU A it can't tell if CPU B is doing a HW
> > > page table walk or a GUP fast when it invalidates a page table. The
> > > design of gup-fast is supposed to be the same as the design of a HW
> > > page table walk, and the tlb invalidate CPU A does when removing a
> > > page from a page table is supposed to serialize against both a HW page
> > > table walk and gup-fast.
> > >
> > > Given that the HW page table walker does not do dev_pagemap stuff, why
> > > does gup-fast?
> >
> > gup-fast historically assumed that the 'struct page' and memory
> > backing the page-table walk could not physically be removed from the
> > system during its walk because those pages were allocated from the
> > page allocator before being mapped into userspace.
>
> No, I'd say gup-fast assumes that any non-special PTE it finds in a
> page table must have a struct page.
>
> If something wants to remove that struct page it must first remove all
> the PTEs pointing at it from the entire system and flush the TLBs,
> which directly prevents a future gup-fast from running and trying to
> access the struct page. No extra locking needed
>
> > implied elevated reference on any page that gup-fast would be asked to
> > walk, or pte_special() is there to "say wait, nevermind this isn't a
> > page allocator page fallback to gup-slow()".
>
> pte_special says there is no struct page, and some of those cases can
> be fixed up in gup-slow.
>
> > > Can you sketch the exact race this is protecting against?
> >
> > Thread1 mmaps /mnt/daxfile1 from a "mount -o dax" filesystem and
> > issues direct I/O with that mapping as the target buffer, Thread2 does
> > "echo "namespace0.0" > /sys/bus/nd/drivers/nd_pmem/unbind". Without
> > the dev_pagemap check reference gup-fast could execute
> > get_page(pte_page(pte)) on a page that doesn't even exist anymore
> > because the driver unbind has already performed remove_pages().
>
> Surely the unbind either waits for all the VMAs to be destroyed or
> zaps them before allowing things to progress to remove_pages()?

If we're talking about device-dax this is precisely what it does, zaps
and prevents new faults from resolving, but filesystem-dax...

> Having a situation where the CPU page tables still point at physical
> pages that have been removed sounds so crazy/insecure, that can't be
> what is happening, can it??

Hmm, that may be true and an original dax bug! The unbind of a
block-device from underneath the filesystem does trigger the
filesystem to emergency shutdown / go read-only, but unless that
process also includes a global zap of all dax mappings not only is
that violating expectations of "page-tables to disappearing memory",
but the filesystem may also want to guarantee that no further dax
writes can happen after shutdown. Right now I believe it only assumes
that mmap I/O will come from page writeback so there's no need to
bother applications with mappings to page cache, but dax mappings need
to be ripped away.

/me goes to look at what filesytems guarantee when the block-device is
surprise removed out from under them.

In any event, this accelerates the effort to go implement
fs-global-dax-zap at the request of the device driver.

[RFC,0/9] mm, sparse-vmemmap: Introduce compound pagemaps

Message

Comments