mbox series

[v4,00/14] mm, sparse-vmemmap: Introduce compound devmaps for device-dax

Message ID 20210827145819.16471-1-joao.m.martins@oracle.com (mailing list archive)
Headers show
Series mm, sparse-vmemmap: Introduce compound devmaps for device-dax | expand

Message

Joao Martins Aug. 27, 2021, 2:58 p.m. UTC
Changes since v3[0]:

 * Collect Dan's Reviewed-by on patches 1-5,8,9,11
 * Collect Muchun Reviewed-by on patch 1,2,11
 * Reorder patches to first introduce compound pages in ZONE_DEVICE with
 device-dax (for pmem) as first user (patches 1-8) followed by implementing
 the sparse-vmemmap changes for minimize struct page overhead for devmap (patches 9-14)
 * Eliminate remnant @align references to use @geometry (Dan)
 * Convert mentions of 'compound pagemap' to 'compound devmap' throughout
   the series to avoid confusions of this work conflicting/referring to
   anything Folio or pagemap related.
 * Delete pgmap_pfn_geometry() on patch 4
   and rework other patches to use pgmap_geometry() instead (Dan)
 * Convert @geometry to be a number of pages rather than page size in patch 4 (Dan)
 * Make pgmap_geometry() more readable (Christoph)
 * Simplify pgmap refcount pfn computation in memremap_pages() (Christoph)
 * Rework memmap_init_compound() in patch 4 to use the same style as
 memmap_init_zone_device i.e. iterating over PFNs, rather than struct pages (Dan)
 * Add comment on devmap prep_compound_head callsite explaining why it needs
 to be used after first+second tail pages have been initialized (Dan, Jane)
 * Initialize tail page refcount to zero in patch 4
 * Make sure pfn_next() iterate over compound pages (rather than base page) in
 patch 4 to tackle the zone_device elevated page refcount.
 [ Note these last two bullet points above are unneeded once this patch is merged:
   https://lore.kernel.org/linux-mm/20210825034828.12927-3-alex.sierra@amd.com/ ]
 * Remove usage of ternary operator when computing @end in gup_device_huge() in patch 8 (Dan)
 * Remove pinned_head variable in patch 8
 * Remove put_dev_pagemap() need for compound case as that is now fixed for the general case
 in patch 8
 * Switch to PageHead() instead of PageCompound() as we only work with either base pages
 or head pages in patch 8 (Matthew)
 * Fix kdoc of @altmap and improve kdoc for @pgmap in patch 9 (Dan)
 * Fix up missing return in vmemmap_populate_address() in patch 10
 * Change error handling style in all patches (Dan)
 * Change title of vmemmap_dedup.rst to be more representative of the purpose in patch 12 (Dan)
 * Move some of the section and subsection tail page reuse code into helpers
 reuse_compound_section() and compound_section_tail_page() for readability in patch 12 (Dan)
 * Commit description fixes for clearity in various patches (Dan)
 * Add pgmap_geometry_order() helper and
   drop unneeded geometry_size, order variables in patch 12
 * Drop unneeded byte based computation to be PFN in patch 12
 * Handle the dynamic dax region properly when ensuring a stable dev_dax->pgmap in patch 6.
 * Add a compound_nr_pages() helper and use it in memmap_init_zone_device to calculate
 the number of unique struct pages to initialize depending on @altmap existence in patch 13 (Dan)
 * Add compound_section_tail_huge_page() for the tail page PMD reuse in patch 14 (Dan)
 * Reword cover letter.

[0] https://lore.kernel.org/linux-mm/20210617184507.3662-1-joao.m.martins@oracle.com/

Full changelog of previous versions at the bottom of cover letter.
---
This series, attempts at minimizing 'struct page' overhead by
pursuing a similar approach as Muchun Song series "Free some vmemmap
pages of hugetlb page"[0] (now merged for v5.14), but applied to devmap/ZONE_DEVICE.

[0] https://lore.kernel.org/linux-mm/20210308102807.59745-1-songmuchun@bytedance.com/

The link above describes it quite nicely, but the idea is to reuse/deduplicate tail
page vmemmap areas, particular the area which only describes tail pages.
So a vmemmap page describes 64 struct pages, and the first page for a given
ZONE_DEVICE vmemmap would contain the head page and 63 tail pages. The second
vmemmap page would contain only tail pages, and that's what gets reused across
the rest of the subsection/section. The bigger the page size, the bigger the
savings (2M hpage -> save 6 vmemmap pages; 1G hpage -> save 4094 vmemmap pages).

This series also takes one step further on 1GB pages and *also* deduplicate PMD pages
which only contain tail pages which allows to keep parity with current hugepage
based vmemmap. This further let us more than halve the overhead with 1GB pages
(40M -> 16M per Tb).

In terms of savings, per 1Tb of memory, the struct page cost would go down
with compound devmap:

* with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total memory)
* with 1G pages we lose 16MB instead of 16G (0.0014% instead of 1.5% of total memory)

Along the way I've extended it past 'struct page' overhead *trying* to address a
few performance issues we knew about for pmem, specifically on the
{pin,get}_user_pages_fast with device-dax vmas which are really
slow even of the fast variants. THP is great on -fast variants but all except
hugetlbfs perform rather poorly on non-fast gup. Although I deferred the
__get_user_pages() improvements (in a follow up small series I have stashed as its
ortogonal to device-dax; THP suffers from the same syndrome).

To summarize what the series does:

Patch 1: Prepare hwpoisoning to work with dax compound pages.

Patches 2-7: Prepare devmap infra to use compound pages (for device-dax).
Specifically, we split the current utility function of prep_compound_page()
into head and tail and use those two helpers where appropriate to take
advantage of caches being warm after __init_single_page(). Have
memmap_init_zone_device() initialize its metadata as compound pages, thus
introducing a new devmap property known as geometry which
outlines how the vmemmap is structured (defaults to base pages as done today).
Finally enable device-dax usage of devmap @geometry to a value
based on its own @align property. @geometry getter routine returns 1 by default (which
is today's case of base pages in devmap) and the usage of compound devmap is
optional. Starting with device-dax (*not* fsdax) we enable it by default
but fsdax and the rest of devmap users there's no change.

Patch 8: Optimize grabbing page refcount changes given that we
are working with compound pages i.e. we do 1 increment to the head
page for a given set of N subpages compared as opposed to N individual writes.
{get,pin}_user_pages_fast() for zone_device with compound devmap consequently
improves considerably with DRAM stored struct pages. It also *greatly*
improves pinning with altmap. Results with gup_test:

                                                   before     after
    (16G get_user_pages_fast 2M page size)         ~59 ms -> ~6.1 ms
    (16G pin_user_pages_fast 2M page size)         ~87 ms -> ~6.2 ms
    (16G get_user_pages_fast altmap 2M page size) ~494 ms -> ~9 ms
    (16G pin_user_pages_fast altmap 2M page size) ~494 ms -> ~10 ms

    altmap performance gets specially interesting when pinning a pmem dimm:

                                                   before     after
    (128G get_user_pages_fast 2M page size)         ~492 ms -> ~49 ms
    (128G pin_user_pages_fast 2M page size)         ~493 ms -> ~50 ms
    (128G get_user_pages_fast altmap 2M page size)  ~3.91 s -> ~70 ms
    (128G pin_user_pages_fast altmap 2M page size)  ~3.97 s -> ~74 ms

Patches 9-14: These patches take care of the struct page savings (also referred
to here as tail-page/vmemmap deduplication). Much like Muchun series, we reuse
PTE (and PMD) tail page vmemmap areas across a given geometry (which is seede by
 @align property of dax-core code) and enabling of memremap to
initialize the ZONE_DEVICE pages as compound pages of a given @align order. The
main difference though, is that contrary to the hugetlbfs series, there's no
vmemmap for the area, because we are populating it as opposed to remapping it.
IOW no freeing of pages of already initialized vmemmap like the case for
hugetlbfs, which simplifies the logic (besides not being arch-specific). After
these, there's also quite a visible region bootstrap of pmem vmemmap given that we
would initialize fewer struct pages depending on the page size with DRAM backed
struct pages -- because fewer pages are unique and most tail pages (with bigger
geometry sizes). altmap sees no significant difference in bootstrap AFAICT.
Patch 14 comes last as it's an improvement, not mandated for the initial
functionality. Also move the very nice docs of hugetlb_vmemmap.c into a
Documentation/vm/ entry and adjust them to the DAX case.

    NVDIMM namespace bootstrap improves from ~268-358 ms to ~80-110/<1ms on 128G NVDIMMs
    with 2M and 1G respectivally. And struct page needed capacity will be 3.8x / 1071x
    smaller for 2M and 1G respectivally.

Tested on x86 with 1Tb+ of pmem (alongside registering it with RDMA with and
without altmap), alongside gup_test selftests with dynamic dax regions and
static dax regions. Coupled with ndctl unit tests for both kinds of device-dax
regions (static/dynamic) that exercise all of this.
Note: Build-tested arm64 and ppc64.

I have deferred the __get_user_pages() patch to outside this series
(https://lore.kernel.org/linux-mm/20201208172901.17384-11-joao.m.martins@oracle.com/),
as I found an simpler way to address it and that is also applicable to
THP. But will submit that as a follow up of this.

Patches apply on top of linux-next tag next-20210827 (commit 5e63226c7228).

Comments and suggestions very much appreciated!

Older Changelog,

v2 -> v3[3]:
 * Collect Mike's Ack on patch 2 (Mike)
 * Collect Naoya's Reviewed-by on patch 1 (Naoya)
 * Rename compound_pagemaps.rst doc page (and its mentions) to vmemmap_dedup.rst (Mike, Muchun)
 * Rebased to next-20210714

v1[1] -> v2[2]:

 (New patches 7, 10, 11)
 * Remove occurences of 'we' in the commit descriptions (now for real) [Dan]
 * Add comment on top of compound_head() for fsdax (Patch 1) [Dan]
 * Massage commit descriptions of cleanup/refactor patches to reflect [Dan]
 that it's in preparation for bigger infra in sparse-vmemmap. (Patch 2,3,5) [Dan]
 * Greatly improve all commit messages in terms of grammar/wording and clearity. [Dan]
 * Rename variable/helpers from dev_pagemap::align to @geometry, reflecting
 tht it's not the same thing as dev_dax->align, Patch 4 [Dan]
 * Move compound page init logic into separate memmap_init_compound() helper, Patch 4 [Dan]
 * Simplify patch 9 as a result of having compound initialization differently [Dan]
 * Rename @pfn_align variable in memmap_init_zone_device to @pfns_per_compound [Dan]
 * Rename Subject of patch 6 [Dan]
 * Move hugetlb_vmemmap.c comment block to Documentation/vm Patch 7 [Dan]
 * Add some type-safety to @block and use 'struct page *' rather than
 void, Patch 8 [Dan]
 * Add some comments to less obvious parts on 1G compound page case, Patch 8 [Dan]
 * Remove vmemmap lookup function in place of
 pmd_off_k() + pte_offset_kernel() given some guarantees on section onlining
 serialization, Patch 8
 * Add a comment to get_page() mentioning where/how it is, Patch 8 freed [Dan]
 * Add docs about device-dax usage of tail dedup technique in newly added
 compound_pagemaps.rst doc entry.
 * Add cleanup patch for device-dax for ensuring dev_dax::pgmap is always set [Dan]
 * Add cleanup patch for device-dax for using ALIGN() [Dan]
 * Store pinned head in separate @pinned_head variable and fix error case, patch 13 [Dan]
 * Add comment on difference of @next value for PageCompound(), patch 13 [Dan]
 * Move PUD compound page to be last patch [Dan]
 * Add vmemmap layout for PUD compound geometry in compound_pagemaps.rst doc, patch 14 [Dan]
 * Rebased to next-20210617 

 RFC[0] -> v1:
 (New patches 1-3, 5-8 but the diffstat isn't that different)
 * Fix hwpoisoning of devmap pages reported by Jane (Patch 1 is new in v1)
 * Fix/Massage commit messages to be more clear and remove the 'we' occurences (Dan, John, Matthew)
 * Use pfn_align to be clear it's nr of pages for @align value (John, Dan)
 * Add two helpers pgmap_align() and pgmap_pfn_align() as accessors of pgmap->align;
 * Remove the gup_device_compound_huge special path and have the same code
   work both ways while special casing when devmap page is compound (Jason, John)
 * Avoid usage of vmemmap_populate_basepages() and introduce a first class
   loop that doesn't care about passing an altmap for memmap reuse. (Dan)
 * Completely rework the vmemmap_populate_compound() to avoid the sparse_add_section
   hack into passing block across sparse_add_section calls. It's a lot easier to
   follow and more explicit in what it does.
 * Replace the vmemmap refactoring with adding a @pgmap argument and moving
   parts of the vmemmap_populate_base_pages(). (Patch 5 and 6 are new as a result)
 * Add PMD tail page vmemmap area reuse for 1GB pages. (Patch 8 is new)
 * Improve memmap_init_zone_device() to initialize compound pages when
   struct pages are cache warm. That lead to a even further speed up further
   from RFC series from 190ms -> 80-120ms. Patches 2 and 3 are the new ones
   as a result (Dan)
 * Remove PGMAP_COMPOUND and use @align as the property to detect whether
   or not to reuse vmemmap areas (Dan)

[0] https://lore.kernel.org/linux-mm/20201208172901.17384-1-joao.m.martins@oracle.com/
[1] https://lore.kernel.org/linux-mm/20210325230938.30752-1-joao.m.martins@oracle.com/
[2] https://lore.kernel.org/linux-mm/20210617184507.3662-1-joao.m.martins@oracle.com/
[3] https://lore.kernel.org/linux-mm/20210617184507.3662-1-joao.m.martins@oracle.com/

Joao Martins (14):
  memory-failure: fetch compound_head after pgmap_pfn_valid()
  mm/page_alloc: split prep_compound_page into head and tail subparts
  mm/page_alloc: refactor memmap_init_zone_device() page init
  mm/memremap: add ZONE_DEVICE support for compound pages
  device-dax: use ALIGN() for determining pgoff
  device-dax: ensure dev_dax->pgmap is valid for dynamic devices
  device-dax: compound devmap support
  mm/gup: grab head page refcount once for group of subpages
  mm/sparse-vmemmap: add a pgmap argument to section activation
  mm/sparse-vmemmap: refactor core of vmemmap_populate_basepages() to
    helper
  mm/hugetlb_vmemmap: move comment block to Documentation/vm
  mm/sparse-vmemmap: populate compound devmaps
  mm/page_alloc: reuse tail struct pages for compound devmaps
  mm/sparse-vmemmap: improve memory savings for compound pud geometry

 Documentation/vm/index.rst         |   1 +
 Documentation/vm/vmemmap_dedup.rst | 300 +++++++++++++++++++++++++++++
 drivers/dax/bus.c                  |   8 +
 drivers/dax/device.c               |  58 ++++--
 include/linux/memory_hotplug.h     |   5 +-
 include/linux/memremap.h           |  17 ++
 include/linux/mm.h                 |   8 +-
 mm/gup.c                           |  51 +++--
 mm/hugetlb_vmemmap.c               | 162 +---------------
 mm/memory-failure.c                |   6 +
 mm/memory_hotplug.c                |   3 +-
 mm/memremap.c                      |  13 +-
 mm/page_alloc.c                    | 151 +++++++++++----
 mm/sparse-vmemmap.c                | 278 +++++++++++++++++++++++---
 mm/sparse.c                        |  26 ++-
 15 files changed, 803 insertions(+), 284 deletions(-)
 create mode 100644 Documentation/vm/vmemmap_dedup.rst