mbox series

[v5,0/8] mm, dax: Introduce compound pages in devmap

Message ID 20211112150824.11028-1-joao.m.martins@oracle.com (mailing list archive)
Headers show
Series mm, dax: Introduce compound pages in devmap | expand

Message

Joao Martins Nov. 12, 2021, 3:08 p.m. UTC
Changes since v4[4]:

* Remove patches 8-14 as they will go in 2 separate (parallel) series;
* Rename @geometry to @vmemmap_shift (Christoph Hellwig)
* Make @vmemmap_shift an order rather than nr of pages (Christoph Hellwig)
* Consequently remove helper pgmap_geometry_order() as it's no longer
needed, in place of accessing directly the structure member [Patch 4 and 8]
* Rename pgmap_geometry() to pgmap_vmemmap_nr() in patches 4 and 8;
* Remove usage of pgmap_geometry() in favour for testing
  @vmemmap_shift for non-zero directly directly in patch 8;
* Patch 5 is new for using `struct_size()` (Dan Williams)
* Add a 'static_dev_dax()' helper for testing pgmap == NULL handling
for dynamic dax devices.
* Expand patch 6 to be explicitly on those !pgmap cases, and replace
those with static_dev_dax().
* Add performance numbers on patch 8 on gup/pin_user_pages() numbers with
this series.
* Massage commit description to remove mentions of @geometry.
* Add Dan's Reviewed-by on patch 8 (Dan Williams)

---

This series converts device-dax to use compound pages, and moves away from the
'struct page per basepage on PMD/PUD' that is done today. Doing so, unlocks a
few noticeable improvements on unpin_user_pages() and makes device-dax+altmap case
4x times faster in pinning (numbers below and in last patch).

I've split the compound pages on devmap part from the rest based on recent
discussions on devmap pending and future work planned[5][6]. There is consensus
that device-dax should be using compound pages to represent its PMD/PUDs just
like HugeTLB and THP, that could lead to less specialization of the dax parts.
I will pursue the rest of the work in parallel once this part is merged,
particular the GUP-{slow,fast} improvements [7] and the tail struct page
deduplication memory savings part[8].

To summarize what the series does:

Patch 1: Prepare hwpoisoning to work with dax compound pages.

Patches 2-3: Split the current utility function of prep_compound_page()
into head and tail and use those two helpers where appropriate to take
advantage of caches being warm after __init_single_page(). This is used
when initializing zone device when we bring up device-dax namespaces.

Patches 4-7: Add devmap support for compound pages in device-dax.
memmap_init_zone_device() initialize its metadata as compound pages, and it
introduces a new devmap property known as vmemmap_shift which
outlines how the vmemmap is structured (defaults to base pages as done today).
The property describe the page order of the metadata essentially.
Finally enable device-dax usage of devmap @vmemmap_shift to a value
based on its own @align property. @vmemmap_shift returns 0 by default (which
is today's case of base pages in devmap, like fsdax or the others) and the
usage of compound devmap is optional. Starting with device-dax (*not* fsdax) we
enable it by default. There are a few pinning improvements particular on the
unpinning case and altmap, as well as unpin_user_page_range_dirty_lock() being
just as effective as THP/hugetlb[0] pages.

    $ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S -a -n 512 -w
    (pin_user_pages_fast 2M pages) put:~71 ms -> put:~22 ms
    [altmap]
    (pin_user_pages_fast 2M pages) get:~524ms put:~525 ms -> get: ~127ms put:~71ms
    
     $ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S -a -n 512 -w
    (pin_user_pages_fast 2M pages) put:~513 ms -> put:~188 ms
    [altmap with -m 127004]
    (pin_user_pages_fast 2M pages) get:~4.1 secs put:~4.12 secs -> get:~1sec put:~563ms

Tested on x86 with 1Tb+ of pmem (alongside registering it with RDMA with and
without altmap), alongside gup_test selftests with dynamic dax regions and
static dax regions. Coupled with ndctl unit tests for dynamic dax devices
that exercise all of this. Note, for dynamic dax regions I had to revert
commit 8aa83e6395 ("x86/setup: Call early_reserve_memory() earlier"), it
is a known issue that this commit broke efi_fake_mem=.

Patches apply on top of linux-next tag next-20211112 (commit f2e19fd15bd7).

Thanks for all the review.

Comments and suggestions very much appreciated!

Older Changelog,

v3[3] -> v4[4]:

 * Collect Dan's Reviewed-by on patches 1-5,8,9,11
 * Collect Muchun Reviewed-by on patch 1,2,11
 * Reorder patches to first introduce compound pages in ZONE_DEVICE with
 device-dax (for pmem) as first user (patches 1-8) followed by implementing
 the sparse-vmemmap changes for minimize struct page overhead for devmap (patches 9-14)
 * Eliminate remnant @align references to use @geometry (Dan)
 * Convert mentions of 'compound pagemap' to 'compound devmap' throughout
   the series to avoid confusions of this work conflicting/referring to
   anything Folio or pagemap related.
 * Delete pgmap_pfn_geometry() on patch 4
   and rework other patches to use pgmap_geometry() instead (Dan)
 * Convert @geometry to be a number of pages rather than page size in patch 4 (Dan)
 * Make pgmap_geometry() more readable (Christoph)
 * Simplify pgmap refcount pfn computation in memremap_pages() (Christoph)
 * Rework memmap_init_compound() in patch 4 to use the same style as
 memmap_init_zone_device i.e. iterating over PFNs, rather than struct pages (Dan)
 * Add comment on devmap prep_compound_head callsite explaining why it needs
 to be used after first+second tail pages have been initialized (Dan, Jane)
 * Initialize tail page refcount to zero in patch 4
 * Make sure pfn_next() iterate over compound pages (rather than base page) in
 patch 4 to tackle the zone_device elevated page refcount.
 [ Note these last two bullet points above are unneeded once this patch is merged:
   https://lore.kernel.org/linux-mm/20210825034828.12927-3-alex.sierra@amd.com/ ]
 * Remove usage of ternary operator when computing @end in gup_device_huge() in patch 8 (Dan)
 * Remove pinned_head variable in patch 8
 * Remove put_dev_pagemap() need for compound case as that is now fixed for the general case
 in patch 8
 * Switch to PageHead() instead of PageCompound() as we only work with either base pages
 or head pages in patch 8 (Matthew)
 * Fix kdoc of @altmap and improve kdoc for @pgmap in patch 9 (Dan)
 * Fix up missing return in vmemmap_populate_address() in patch 10
 * Change error handling style in all patches (Dan)
 * Change title of vmemmap_dedup.rst to be more representative of the purpose in patch 12 (Dan)
 * Move some of the section and subsection tail page reuse code into helpers
 reuse_compound_section() and compound_section_tail_page() for readability in patch 12 (Dan)
 * Commit description fixes for clearity in various patches (Dan)
 * Add pgmap_geometry_order() helper and
   drop unneeded geometry_size, order variables in patch 12
 * Drop unneeded byte based computation to be PFN in patch 12
 * Handle the dynamic dax region properly when ensuring a stable dev_dax->pgmap in patch 6.
 * Add a compound_nr_pages() helper and use it in memmap_init_zone_device to calculate
 the number of unique struct pages to initialize depending on @altmap existence in patch 13 (Dan)
 * Add compound_section_tail_huge_page() for the tail page PMD reuse in patch 14 (Dan)
 * Reword cover letter.

v2 -> v3[3]:
 * Collect Mike's Ack on patch 2 (Mike)
 * Collect Naoya's Reviewed-by on patch 1 (Naoya)
 * Rename compound_pagemaps.rst doc page (and its mentions) to vmemmap_dedup.rst (Mike, Muchun)
 * Rebased to next-20210714

v1[1] -> v2[2]:

 (New patches 7, 10, 11)
 * Remove occurences of 'we' in the commit descriptions (now for real) [Dan]
 * Add comment on top of compound_head() for fsdax (Patch 1) [Dan]
 * Massage commit descriptions of cleanup/refactor patches to reflect [Dan]
 that it's in preparation for bigger infra in sparse-vmemmap. (Patch 2,3,5) [Dan]
 * Greatly improve all commit messages in terms of grammar/wording and clearity. [Dan]
 * Rename variable/helpers from dev_pagemap::align to @geometry, reflecting
 tht it's not the same thing as dev_dax->align, Patch 4 [Dan]
 * Move compound page init logic into separate memmap_init_compound() helper, Patch 4 [Dan]
 * Simplify patch 9 as a result of having compound initialization differently [Dan]
 * Rename @pfn_align variable in memmap_init_zone_device to @pfns_per_compound [Dan]
 * Rename Subject of patch 6 [Dan]
 * Move hugetlb_vmemmap.c comment block to Documentation/vm Patch 7 [Dan]
 * Add some type-safety to @block and use 'struct page *' rather than
 void, Patch 8 [Dan]
 * Add some comments to less obvious parts on 1G compound page case, Patch 8 [Dan]
 * Remove vmemmap lookup function in place of
 pmd_off_k() + pte_offset_kernel() given some guarantees on section onlining
 serialization, Patch 8
 * Add a comment to get_page() mentioning where/how it is, Patch 8 freed [Dan]
 * Add docs about device-dax usage of tail dedup technique in newly added
 compound_pagemaps.rst doc entry.
 * Add cleanup patch for device-dax for ensuring dev_dax::pgmap is always set [Dan]
 * Add cleanup patch for device-dax for using ALIGN() [Dan]
 * Store pinned head in separate @pinned_head variable and fix error case, patch 13 [Dan]
 * Add comment on difference of @next value for PageCompound(), patch 13 [Dan]
 * Move PUD compound page to be last patch [Dan]
 * Add vmemmap layout for PUD compound geometry in compound_pagemaps.rst doc, patch 14 [Dan]
 * Rebased to next-20210617 

 RFC[0] -> v1:
 (New patches 1-3, 5-8 but the diffstat isn't that different)
 * Fix hwpoisoning of devmap pages reported by Jane (Patch 1 is new in v1)
 * Fix/Massage commit messages to be more clear and remove the 'we' occurences (Dan, John, Matthew)
 * Use pfn_align to be clear it's nr of pages for @align value (John, Dan)
 * Add two helpers pgmap_align() and pgmap_pfn_align() as accessors of pgmap->align;
 * Remove the gup_device_compound_huge special path and have the same code
   work both ways while special casing when devmap page is compound (Jason, John)
 * Avoid usage of vmemmap_populate_basepages() and introduce a first class
   loop that doesn't care about passing an altmap for memmap reuse. (Dan)
 * Completely rework the vmemmap_populate_compound() to avoid the sparse_add_section
   hack into passing block across sparse_add_section calls. It's a lot easier to
   follow and more explicit in what it does.
 * Replace the vmemmap refactoring with adding a @pgmap argument and moving
   parts of the vmemmap_populate_base_pages(). (Patch 5 and 6 are new as a result)
 * Add PMD tail page vmemmap area reuse for 1GB pages. (Patch 8 is new)
 * Improve memmap_init_zone_device() to initialize compound pages when
   struct pages are cache warm. That lead to a even further speed up further
   from RFC series from 190ms -> 80-120ms. Patches 2 and 3 are the new ones
   as a result (Dan)
 * Remove PGMAP_COMPOUND and use @align as the property to detect whether
   or not to reuse vmemmap areas (Dan)

[0] https://lore.kernel.org/linux-mm/20201208172901.17384-1-joao.m.martins@oracle.com/
[1] https://lore.kernel.org/linux-mm/20210325230938.30752-1-joao.m.martins@oracle.com/
[2] https://lore.kernel.org/linux-mm/20210617184507.3662-1-joao.m.martins@oracle.com/
[3] https://lore.kernel.org/linux-mm/20210714193542.21857-1-joao.m.martins@oracle.com/
[4] https://lore.kernel.org/linux-mm/20210827145819.16471-1-joao.m.martins@oracle.com/
[5] https://lore.kernel.org/linux-mm/20211018182559.GC3686969@ziepe.ca/
[6] https://lore.kernel.org/linux-mm/499043a0-b3d8-7a42-4aee-84b81f5b633f@oracle.com/
[7] https://lore.kernel.org/linux-mm/20210827145819.16471-9-joao.m.martins@oracle.com/
[8] https://lore.kernel.org/linux-mm/20210827145819.16471-13-joao.m.martins@oracle.com/

Joao Martins (8):
  memory-failure: fetch compound_head after pgmap_pfn_valid()
  mm/page_alloc: split prep_compound_page into head and tail subparts
  mm/page_alloc: refactor memmap_init_zone_device() page init
  mm/memremap: add ZONE_DEVICE support for compound pages
  device-dax: use ALIGN() for determining pgoff
  device-dax: use struct_size()
  device-dax: ensure dev_dax->pgmap is valid for dynamic devices
  device-dax: compound devmap support

 drivers/dax/bus.c        |  14 ++++
 drivers/dax/bus.h        |   1 +
 drivers/dax/device.c     |  88 ++++++++++++++++++-------
 include/linux/memremap.h |  11 ++++
 mm/memory-failure.c      |   6 ++
 mm/memremap.c            |  12 ++--
 mm/page_alloc.c          | 138 +++++++++++++++++++++++++++------------
 7 files changed, 200 insertions(+), 70 deletions(-)

Comments

Jason Gunthorpe Nov. 12, 2021, 3:40 p.m. UTC | #1
On Fri, Nov 12, 2021 at 04:08:16PM +0100, Joao Martins wrote:

> This series converts device-dax to use compound pages, and moves away from the
> 'struct page per basepage on PMD/PUD' that is done today. Doing so, unlocks a
> few noticeable improvements on unpin_user_pages() and makes device-dax+altmap case
> 4x times faster in pinning (numbers below and in last patch).

I like it - aside from performance this series is important to clean
up the ZONE_DEVICE refcounting mess as it means that only fsdax will
be installing tail pages as PMD entries.

Thanks,
Jason
Joao Martins Nov. 15, 2021, 11 a.m. UTC | #2
On 11/12/21 16:40, Jason Gunthorpe wrote:
> On Fri, Nov 12, 2021 at 04:08:16PM +0100, Joao Martins wrote:
> 
>> This series converts device-dax to use compound pages, and moves away from the
>> 'struct page per basepage on PMD/PUD' that is done today. Doing so, unlocks a
>> few noticeable improvements on unpin_user_pages() and makes device-dax+altmap case
>> 4x times faster in pinning (numbers below and in last patch).
> 
> I like it - aside from performance this series is important to clean
> up the ZONE_DEVICE refcounting mess as it means that only fsdax will
> be installing tail pages as PMD entries.
> 
Yes, indeed. I should have emphasized that more in the cover letter.

Will fix for v6 (if there's a new respin).

> Thanks,
> Jason
>