diff mbox series

[RFC,6/9] mm/gup: Grab head page refcount once for group of subpages

Message ID 20201208172901.17384-8-joao.m.martins@oracle.com (mailing list archive)
State New
Headers show
Series mm, sparse-vmemmap: Introduce compound pagemaps | expand

Commit Message

Joao Martins Dec. 8, 2020, 5:28 p.m. UTC
Much like hugetlbfs or THPs, we treat device pagemaps with
compound pages like the rest of GUP handling of compound pages.

Rather than incrementing the refcount every 4K, we record
all sub pages and increment by @refs amount *once*.

Performance measured by gup_benchmark improves considerably
get_user_pages_fast() and pin_user_pages_fast():

 $ gup_benchmark -f /dev/dax0.2 -m 16384 -r 10 -S [-u,-a] -n 512 -w

(get_user_pages_fast 2M pages) ~75k us -> ~3.6k us
(pin_user_pages_fast 2M pages) ~125k us -> ~3.8k us

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/gup.c | 67 ++++++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 51 insertions(+), 16 deletions(-)

Comments

John Hubbard Dec. 9, 2020, 4:40 a.m. UTC | #1
On 12/8/20 9:28 AM, Joao Martins wrote:
> Much like hugetlbfs or THPs, we treat device pagemaps with
> compound pages like the rest of GUP handling of compound pages.
> 
> Rather than incrementing the refcount every 4K, we record
> all sub pages and increment by @refs amount *once*.
> 
> Performance measured by gup_benchmark improves considerably
> get_user_pages_fast() and pin_user_pages_fast():
> 
>   $ gup_benchmark -f /dev/dax0.2 -m 16384 -r 10 -S [-u,-a] -n 512 -w

"gup_test", now that you're in linux-next, actually.

(Maybe I'll retrofit that test with getopt_long(), those options are
getting more elaborate.)

> 
> (get_user_pages_fast 2M pages) ~75k us -> ~3.6k us
> (pin_user_pages_fast 2M pages) ~125k us -> ~3.8k us

That is a beautiful result! I'm very motivated to see if this patchset
can make it in, in some form.

> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>   mm/gup.c | 67 ++++++++++++++++++++++++++++++++++++++++++--------------
>   1 file changed, 51 insertions(+), 16 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 98eb8e6d2609..194e6981eb03 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2250,22 +2250,68 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>   }
>   #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
>   
> +
> +static int record_subpages(struct page *page, unsigned long addr,
> +			   unsigned long end, struct page **pages)
> +{
> +	int nr;
> +
> +	for (nr = 0; addr != end; addr += PAGE_SIZE)
> +		pages[nr++] = page++;
> +
> +	return nr;
> +}
> +
>   #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
> -static int __gup_device_huge(unsigned long pfn, unsigned long addr,
> -			     unsigned long end, unsigned int flags,
> -			     struct page **pages, int *nr)
> +static int __gup_device_compound_huge(struct dev_pagemap *pgmap,
> +				      struct page *head, unsigned long sz,

If this variable survives (I see Jason requested a reorg of this math stuff,
and I also like that idea), then I'd like a slightly better name for "sz".

I was going to suggest one, but then realized that I can't understand how this
works. See below...

> +				      unsigned long addr, unsigned long end,
> +				      unsigned int flags, struct page **pages)
> +{
> +	struct page *page;
> +	int refs;
> +
> +	if (!(pgmap->flags & PGMAP_COMPOUND))
> +		return -1;

btw, I'm unhappy with returning -1 here and assigning it later to a refs variable.
(And that will show up even more clearly as an issue if you attempt to make
refs unsigned everywhere!)

I'm not going to suggest anything because there are a lot of ways to structure
these routines, and I don't want to overly constrain you. Just please don't assign
negative values to any refs variables.

> +
> +	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);

If you pass in PMD_SHIFT or PUD_SHIFT for, that's a number-of-bits, isn't it?
Not a size. And if it's not a size, then sz - 1 doesn't work, does it? If it
does work, then better naming might help. I'm probably missing a really
obvious math trick here.


thanks,
Joao Martins Dec. 9, 2020, 11:05 a.m. UTC | #2
On 12/8/20 7:49 PM, Jason Gunthorpe wrote:
> On Tue, Dec 08, 2020 at 05:28:58PM +0000, Joao Martins wrote:
>> Much like hugetlbfs or THPs, we treat device pagemaps with
>> compound pages like the rest of GUP handling of compound pages.
>>
>> Rather than incrementing the refcount every 4K, we record
>> all sub pages and increment by @refs amount *once*.
>>
>> Performance measured by gup_benchmark improves considerably
>> get_user_pages_fast() and pin_user_pages_fast():
>>
>>  $ gup_benchmark -f /dev/dax0.2 -m 16384 -r 10 -S [-u,-a] -n 512 -w
>>
>> (get_user_pages_fast 2M pages) ~75k us -> ~3.6k us
>> (pin_user_pages_fast 2M pages) ~125k us -> ~3.8k us
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>  mm/gup.c | 67 ++++++++++++++++++++++++++++++++++++++++++--------------
>>  1 file changed, 51 insertions(+), 16 deletions(-)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 98eb8e6d2609..194e6981eb03 100644
>> +++ b/mm/gup.c
>> @@ -2250,22 +2250,68 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>>  }
>>  #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
>>  
>> +
>> +static int record_subpages(struct page *page, unsigned long addr,
>> +			   unsigned long end, struct page **pages)
>> +{
>> +	int nr;
>> +
>> +	for (nr = 0; addr != end; addr += PAGE_SIZE)
>> +		pages[nr++] = page++;
>> +
>> +	return nr;
>> +}
>> +
>>  #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
>> -static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>> -			     unsigned long end, unsigned int flags,
>> -			     struct page **pages, int *nr)
>> +static int __gup_device_compound_huge(struct dev_pagemap *pgmap,
>> +				      struct page *head, unsigned long sz,
>> +				      unsigned long addr, unsigned long end,
>> +				      unsigned int flags, struct page **pages)
>> +{
>> +	struct page *page;
>> +	int refs;
>> +
>> +	if (!(pgmap->flags & PGMAP_COMPOUND))
>> +		return -1;
>> +
>> +	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
> 
> All the places that call record_subpages do some kind of maths like
> this, it should be placed inside record_subpages and not opencoded
> everywhere.
> 
Makes sense.

>> +	refs = record_subpages(page, addr, end, pages);
>> +
>> +	SetPageReferenced(page);
>> +	head = try_grab_compound_head(head, refs, flags);
>> +	if (!head) {
>> +		ClearPageReferenced(page);
>> +		return 0;
>> +	}
>> +
>> +	return refs;
>> +}
> 
> Why is all of this special? Any time we see a PMD/PGD/etc pointing to
> PFN we can apply this optimization. How come device has its own
> special path to do this?? 
> 
I think the reason is that zone_device struct pages have no relationship to one other. So
you anyways need to change individual pages, as opposed to just the head page.

I made it special to avoid breaking other ZONE_DEVICE users (and gating that with
PGMAP_COMPOUND). But if there's no concerns with that, I can unilaterally enable it.

> Why do we need to check PGMAP_COMPOUND? Why do we need to get pgmap?
> (We already removed that from the hmm version of this, was that wrong?
> Is this different?) Dan?
> 
> Also undo_dev_pagemap() is now out of date, we have unpin_user_pages()
> for that and no other error unwind touches ClearPageReferenced..
> 
/me nods Yeap I saw that too.

> Basic idea is good though!
> 
Cool, thanks!

	Joao
Joao Martins Dec. 9, 2020, 1:44 p.m. UTC | #3
On 12/9/20 4:40 AM, John Hubbard wrote:
> On 12/8/20 9:28 AM, Joao Martins wrote:
>> Much like hugetlbfs or THPs, we treat device pagemaps with
>> compound pages like the rest of GUP handling of compound pages.
>>
>> Rather than incrementing the refcount every 4K, we record
>> all sub pages and increment by @refs amount *once*.
>>
>> Performance measured by gup_benchmark improves considerably
>> get_user_pages_fast() and pin_user_pages_fast():
>>
>>   $ gup_benchmark -f /dev/dax0.2 -m 16384 -r 10 -S [-u,-a] -n 512 -w
> 
> "gup_test", now that you're in linux-next, actually.
> 
> (Maybe I'll retrofit that test with getopt_long(), those options are
> getting more elaborate.)
> 
:)

>>
>> (get_user_pages_fast 2M pages) ~75k us -> ~3.6k us
>> (pin_user_pages_fast 2M pages) ~125k us -> ~3.8k us
> 
> That is a beautiful result! I'm very motivated to see if this patchset
> can make it in, in some form.
> 
Cool!

>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>   mm/gup.c | 67 ++++++++++++++++++++++++++++++++++++++++++--------------
>>   1 file changed, 51 insertions(+), 16 deletions(-)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 98eb8e6d2609..194e6981eb03 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -2250,22 +2250,68 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>>   }
>>   #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
>>   
>> +
>> +static int record_subpages(struct page *page, unsigned long addr,
>> +			   unsigned long end, struct page **pages)
>> +{
>> +	int nr;
>> +
>> +	for (nr = 0; addr != end; addr += PAGE_SIZE)
>> +		pages[nr++] = page++;
>> +
>> +	return nr;
>> +}
>> +
>>   #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
>> -static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>> -			     unsigned long end, unsigned int flags,
>> -			     struct page **pages, int *nr)
>> +static int __gup_device_compound_huge(struct dev_pagemap *pgmap,
>> +				      struct page *head, unsigned long sz,
> 
> If this variable survives (I see Jason requested a reorg of this math stuff,
> and I also like that idea), then I'd like a slightly better name for "sz".
> 
Yeap.

> I was going to suggest one, but then realized that I can't understand how this
> works. See below...
> 
>> +				      unsigned long addr, unsigned long end,
>> +				      unsigned int flags, struct page **pages)
>> +{
>> +	struct page *page;
>> +	int refs;
>> +
>> +	if (!(pgmap->flags & PGMAP_COMPOUND))
>> +		return -1;
> 
> btw, I'm unhappy with returning -1 here and assigning it later to a refs variable.
> (And that will show up even more clearly as an issue if you attempt to make
> refs unsigned everywhere!)
> 
Yes true.

The usage of @refs = -1 (therefore an int) was to differentiate when we are not in a
PGMAP_COMPOUND pgmap (and so for logic to keep as today).

Notice that in the PGMAP_COMPOUND case if we fail to grab the head compound page we return 0.

> I'm not going to suggest anything because there are a lot of ways to structure
> these routines, and I don't want to overly constrain you. Just please don't assign
> negative values to any refs variables.
> 
OK.

TBH I'm a little afraid this can turn into further complexity if I have to keep the
non-compound pgmap around. But I will see how I can adjust this.

>> +
>> +	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
> 
> If you pass in PMD_SHIFT or PUD_SHIFT for, that's a number-of-bits, isn't it?
> Not a size. And if it's not a size, then sz - 1 doesn't work, does it? If it
> does work, then better naming might help. I'm probably missing a really
> obvious math trick here.

You're right. That was a mistake on my end, indeed. But the mistake wouldn't change the
logic, as the PageReference bit only applies to the head page.

	Joao
Joao Martins Dec. 9, 2020, 4:02 p.m. UTC | #4
On 12/9/20 3:15 PM, Jason Gunthorpe wrote:
> On Wed, Dec 09, 2020 at 11:05:39AM +0000, Joao Martins wrote:
>>> Why is all of this special? Any time we see a PMD/PGD/etc pointing to
>>> PFN we can apply this optimization. How come device has its own
>>> special path to do this?? 
>>
>> I think the reason is that zone_device struct pages have no
>> relationship to one other. So you anyways need to change individual
>> pages, as opposed to just the head page.
> 
> Huh? That can't be, unpin doesn't know the memory type when it unpins
> it, and as your series shows unpin always operates on the compound
> head. Thus pinning must also operate on compound heads
> 
I was referring to the code without this series, in the paragraph above.
Meaning today zone_device pages are *not* represented compound pages. And so
compound_head(page) on a non compound page just returns the page itself.

Otherwise, try_grab_page() (e.g. when pinning pages) would be broken.

>> I made it special to avoid breaking other ZONE_DEVICE users (and
>> gating that with PGMAP_COMPOUND). But if there's no concerns with
>> that, I can unilaterally enable it.
> 
> I didn't understand what PGMAP_COMPOUND was supposed to be for..
>  
PGMAP_COMPOUND purpose is to online these pages as compound pages (so head
and tails).

Today (without the series) struct pages are not represented the way they
are expressed in the page tables, which is what I am hoping to fix in this
series thus initializing these as compound pages of a given order. But me
introducing PGMAP_COMPOUND was to conservatively keep both old (non-compound)
and new (compound pages) co-exist.

I wasn't sure I could just enable regardless, worried that I would be breaking
other ZONE_DEVICE/memremap_pages users.

>>> Why do we need to check PGMAP_COMPOUND? Why do we need to get pgmap?
>>> (We already removed that from the hmm version of this, was that wrong?
>>> Is this different?) Dan?
> 
> And this is the key question - why do we need to get a pgmap here?
> 
> I'm going to assert that a pgmap cannot be destroyed concurrently with
> fast gup running. This is surely true on x86 as the TLB flush that
> must have preceeded a pgmap destroy excludes fast gup. Other arches
> must emulate this in their pgmap implementations.
> 
> So, why do we need pgmap here? Hoping Dan might know
> 
> If we delete the pgmap then the devmap stop being special.
>
I will let Dan chip in.

> CH and I looked at this and deleted it from the hmm side:
> 
> commit 068354ade5dd9e2b07d9b0c57055a681db6f4e37
> Author: Jason Gunthorpe <jgg@ziepe.ca>
> Date:   Fri Mar 27 17:00:13 2020 -0300
> 
>     mm/hmm: remove pgmap checking for devmap pages
>     
>     The checking boils down to some racy check if the pagemap is still
>     available or not. Instead of checking this, rely entirely on the
>     notifiers, if a pagemap is destroyed then all pages that belong to it must
>     be removed from the tables and the notifiers triggered.
>     
>     Link: https://lore.kernel.org/r/20200327200021.29372-2-jgg@ziepe.ca
> 
> Though I am wondering if this whole hmm thing is racy with memory
> unplug. Hmm.

	Joao
Joao Martins Dec. 9, 2020, 5:27 p.m. UTC | #5
On 12/9/20 4:24 PM, Jason Gunthorpe wrote:
> On Wed, Dec 09, 2020 at 04:02:05PM +0000, Joao Martins wrote:
> 
>> Today (without the series) struct pages are not represented the way they
>> are expressed in the page tables, which is what I am hoping to fix in this
>> series thus initializing these as compound pages of a given order. But me
>> introducing PGMAP_COMPOUND was to conservatively keep both old (non-compound)
>> and new (compound pages) co-exist.
> 
> Oooh, that I didn't know.. That is kind of horrible to have a PMD
> pointing at an order 0 page only in this one special case.
> 
> Still, I think it would be easier to teach record_subpages() that a
> PMD doesn't necessarily point to a high order page, eg do something
> like I suggested for the SGL where it extracts the page order and
> iterates over the contiguous range of pfns.
> 
/me nods

> This way it can remain general with no particularly special path for
> devmap or a special PGMAP_COMPOUND check here.

The less special paths the better, indeed.

	Joao
Matthew Wilcox Dec. 9, 2020, 6:14 p.m. UTC | #6
On Wed, Dec 09, 2020 at 12:24:38PM -0400, Jason Gunthorpe wrote:
> On Wed, Dec 09, 2020 at 04:02:05PM +0000, Joao Martins wrote:
> 
> > Today (without the series) struct pages are not represented the way they
> > are expressed in the page tables, which is what I am hoping to fix in this
> > series thus initializing these as compound pages of a given order. But me
> > introducing PGMAP_COMPOUND was to conservatively keep both old (non-compound)
> > and new (compound pages) co-exist.
> 
> Oooh, that I didn't know.. That is kind of horrible to have a PMD
> pointing at an order 0 page only in this one special case.

Uh, yes.  I'm surprised it hasn't caused more problems.

> Still, I think it would be easier to teach record_subpages() that a
> PMD doesn't necessarily point to a high order page, eg do something
> like I suggested for the SGL where it extracts the page order and
> iterates over the contiguous range of pfns.

But we also see good performance improvements from doing all reference
counts on the head page instead of spread throughout the pages, so we
really want compound pages.
Joao Martins Dec. 10, 2020, 3:43 p.m. UTC | #7
On 12/9/20 6:14 PM, Matthew Wilcox wrote:
> On Wed, Dec 09, 2020 at 12:24:38PM -0400, Jason Gunthorpe wrote:
>> On Wed, Dec 09, 2020 at 04:02:05PM +0000, Joao Martins wrote:
>>
>>> Today (without the series) struct pages are not represented the way they
>>> are expressed in the page tables, which is what I am hoping to fix in this
>>> series thus initializing these as compound pages of a given order. But me
>>> introducing PGMAP_COMPOUND was to conservatively keep both old (non-compound)
>>> and new (compound pages) co-exist.
>>
>> Oooh, that I didn't know.. That is kind of horrible to have a PMD
>> pointing at an order 0 page only in this one special case.
> 
> Uh, yes.  I'm surprised it hasn't caused more problems.
> 
There was 1 or 2 problems in the KVM MMU related to zone device pages.

See commit e851265a816f ("KVM: x86/mmu: Use huge pages for DAX-backed files")
which eventually lead to commit db5432165e9b5 ("KVM: x86/mmu: Walk host page
tables to find THP mappings") to be less amenable to metadata changes.

>> Still, I think it would be easier to teach record_subpages() that a
>> PMD doesn't necessarily point to a high order page, eg do something
>> like I suggested for the SGL where it extracts the page order and
>> iterates over the contiguous range of pfns.
> 
> But we also see good performance improvements from doing all reference
> counts on the head page instead of spread throughout the pages, so we
> really want compound pages.

Going further than just refcounts and borrowing your (or someone else?)
idea, perhaps also a FOLL_HEAD gup flag that would let us only work with
head pages (or folios). Which would consequently let us pin/grab bigger
swathes of memory e.g. 1G (in 2M head pages) or 512G (in 1G head pages)
with just 1 page for storing the struct pages[*]. Albeit I suspect the
numbers would have to justify it.

	Joao

[*] One page happens to be what's used for RDMA/umem and vdpa as callers
of pin_user_pages*()
diff mbox series

Patch

diff --git a/mm/gup.c b/mm/gup.c
index 98eb8e6d2609..194e6981eb03 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2250,22 +2250,68 @@  static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 }
 #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
 
+
+static int record_subpages(struct page *page, unsigned long addr,
+			   unsigned long end, struct page **pages)
+{
+	int nr;
+
+	for (nr = 0; addr != end; addr += PAGE_SIZE)
+		pages[nr++] = page++;
+
+	return nr;
+}
+
 #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
-static int __gup_device_huge(unsigned long pfn, unsigned long addr,
-			     unsigned long end, unsigned int flags,
-			     struct page **pages, int *nr)
+static int __gup_device_compound_huge(struct dev_pagemap *pgmap,
+				      struct page *head, unsigned long sz,
+				      unsigned long addr, unsigned long end,
+				      unsigned int flags, struct page **pages)
+{
+	struct page *page;
+	int refs;
+
+	if (!(pgmap->flags & PGMAP_COMPOUND))
+		return -1;
+
+	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
+	refs = record_subpages(page, addr, end, pages);
+
+	SetPageReferenced(page);
+	head = try_grab_compound_head(head, refs, flags);
+	if (!head) {
+		ClearPageReferenced(page);
+		return 0;
+	}
+
+	return refs;
+}
+
+static int __gup_device_huge(unsigned long pfn, unsigned long sz,
+			     unsigned long addr, unsigned long end,
+			     unsigned int flags, struct page **pages, int *nr)
 {
 	int nr_start = *nr;
 	struct dev_pagemap *pgmap = NULL;
 
 	do {
 		struct page *page = pfn_to_page(pfn);
+		int refs;
 
 		pgmap = get_dev_pagemap(pfn, pgmap);
 		if (unlikely(!pgmap)) {
 			undo_dev_pagemap(nr, nr_start, flags, pages);
 			return 0;
 		}
+
+		refs = __gup_device_compound_huge(pgmap, page, sz, addr, end,
+						  flags, pages + *nr);
+		if (refs >= 0) {
+			*nr += refs;
+			put_dev_pagemap(pgmap);
+			return refs ? 1 : 0;
+		}
+
 		SetPageReferenced(page);
 		pages[*nr] = page;
 		if (unlikely(!try_grab_page(page, flags))) {
@@ -2289,7 +2335,7 @@  static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 	int nr_start = *nr;
 
 	fault_pfn = pmd_pfn(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-	if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
+	if (!__gup_device_huge(fault_pfn, PMD_SHIFT, addr, end, flags, pages, nr))
 		return 0;
 
 	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
@@ -2307,7 +2353,7 @@  static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 	int nr_start = *nr;
 
 	fault_pfn = pud_pfn(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
-	if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
+	if (!__gup_device_huge(fault_pfn, PUD_SHIFT, addr, end, flags, pages, nr))
 		return 0;
 
 	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
@@ -2334,17 +2380,6 @@  static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
 }
 #endif
 
-static int record_subpages(struct page *page, unsigned long addr,
-			   unsigned long end, struct page **pages)
-{
-	int nr;
-
-	for (nr = 0; addr != end; addr += PAGE_SIZE)
-		pages[nr++] = page++;
-
-	return nr;
-}
-
 #ifdef CONFIG_ARCH_HAS_HUGEPD
 static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
 				      unsigned long sz)