diff mbox series

[026/146] mm/page_alloc: split prep_compound_page into head and tail subparts

Message ID 20220114220415.Wq5bV9-yy%akpm@linux-foundation.org (mailing list archive)
State New
Headers show
Series [001/146] kthread: add the helper function kthread_run_on_cpu() | expand

Commit Message

Andrew Morton Jan. 14, 2022, 10:04 p.m. UTC
From: Joao Martins <joao.m.martins@oracle.com>
Subject: mm/page_alloc: split prep_compound_page into head and tail subparts

Patch series "mm, device-dax: Introduce compound pages in devmap", v7.

This series converts device-dax to use compound pages, and moves away from
the 'struct page per basepage on PMD/PUD' that is done today.  Doing so,
1) unlocks a few noticeable improvements on unpin_user_pages() and makes
device-dax+altmap case 4x times faster in pinning (numbers below and in
last patch) 2) as mentioned in various other threads it's one important
step towards cleaning up ZONE_DEVICE refcounting.

I've split the compound pages on devmap part from the rest based on recent
discussions on devmap pending and future work planned[5][6].  There is
consensus that device-dax should be using compound pages to represent its
PMD/PUDs just like HugeTLB and THP, and that leads to less specialization
of the dax parts.  I will pursue the rest of the work in parallel once
this part is merged, particular the GUP-{slow,fast} improvements [7] and
the tail struct page deduplication memory savings part[8].

To summarize what the series does:

Patch 1: Prepare hwpoisoning to work with dax compound pages.

Patches 2-3: Split the current utility function of prep_compound_page()
into head and tail and use those two helpers where appropriate to take
advantage of caches being warm after __init_single_page().  This is used
when initializing zone device when we bring up device-dax namespaces.

Patches 4-10: Add devmap support for compound pages in device-dax. 
memmap_init_zone_device() initialize its metadata as compound pages, and
it introduces a new devmap property known as vmemmap_shift which outlines
how the vmemmap is structured (defaults to base pages as done today).  The
property describe the page order of the metadata essentially.  While at it
do a few cleanups in device-dax in patches 5-9.  Finally enable device-dax
usage of devmap @vmemmap_shift to a value based on its own @align
property.  @vmemmap_shift returns 0 by default (which is today's case of
base pages in devmap, like fsdax or the others) and the usage of compound
devmap is optional.  Starting with device-dax (*not* fsdax) we enable it
by default.  There are a few pinning improvements particular on the
unpinning case and altmap, as well as unpin_user_page_range_dirty_lock()
being just as effective as THP/hugetlb[0] pages.

    $ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S -a -n 512 -w
    (pin_user_pages_fast 2M pages) put:~71 ms -> put:~22 ms
    [altmap]
    (pin_user_pages_fast 2M pages) get:~524ms put:~525 ms -> get: ~127ms put:~71ms
    
     $ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S -a -n 512 -w
    (pin_user_pages_fast 2M pages) put:~513 ms -> put:~188 ms
    [altmap with -m 127004]
    (pin_user_pages_fast 2M pages) get:~4.1 secs put:~4.12 secs -> get:~1sec put:~563ms

Tested on x86 with 1Tb+ of pmem (alongside registering it with RDMA with
and without altmap), alongside gup_test selftests with dynamic dax regions
and static dax regions.  Coupled with ndctl unit tests for dynamic dax
devices that exercise all of this.  Note, for dynamic dax regions I had to
revert commit 8aa83e6395 ("x86/setup: Call early_reserve_memory()
earlier"), it is a known issue that this commit broke efi_fake_mem=.


This patch (of 11):

Split the utility function prep_compound_page() into head and tail
counterparts, and use them accordingly.

This is in preparation for sharing the storage for compound page
metadata.

Link: https://lkml.kernel.org/r/20211202204422.26777-1-joao.m.martins@oracle.com
Link: https://lkml.kernel.org/r/20211202204422.26777-3-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |   30 ++++++++++++++++++++----------
 1 file changed, 20 insertions(+), 10 deletions(-)
diff mbox series

Patch

--- a/mm/page_alloc.c~mm-page_alloc-split-prep_compound_page-into-head-and-tail-subparts
+++ a/mm/page_alloc.c
@@ -726,23 +726,33 @@  void free_compound_page(struct page *pag
 	free_the_page(page, compound_order(page));
 }
 
+static void prep_compound_head(struct page *page, unsigned int order)
+{
+	set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
+	set_compound_order(page, order);
+	atomic_set(compound_mapcount_ptr(page), -1);
+	if (hpage_pincount_available(page))
+		atomic_set(compound_pincount_ptr(page), 0);
+}
+
+static void prep_compound_tail(struct page *head, int tail_idx)
+{
+	struct page *p = head + tail_idx;
+
+	p->mapping = TAIL_MAPPING;
+	set_compound_head(p, head);
+}
+
 void prep_compound_page(struct page *page, unsigned int order)
 {
 	int i;
 	int nr_pages = 1 << order;
 
 	__SetPageHead(page);
-	for (i = 1; i < nr_pages; i++) {
-		struct page *p = page + i;
-		p->mapping = TAIL_MAPPING;
-		set_compound_head(p, page);
-	}
+	for (i = 1; i < nr_pages; i++)
+		prep_compound_tail(page, i);
 
-	set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
-	set_compound_order(page, order);
-	atomic_set(compound_mapcount_ptr(page), -1);
-	if (hpage_pincount_available(page))
-		atomic_set(compound_pincount_ptr(page), 0);
+	prep_compound_head(page, order);
 }
 
 #ifdef CONFIG_DEBUG_PAGEALLOC