Message ID | 20201208172901.17384-8-joao.m.martins@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm, sparse-vmemmap: Introduce compound pagemaps | expand |
On Tue, Dec 08, 2020 at 05:28:58PM +0000, Joao Martins wrote: > Much like hugetlbfs or THPs, we treat device pagemaps with > compound pages like the rest of GUP handling of compound pages. > > Rather than incrementing the refcount every 4K, we record > all sub pages and increment by @refs amount *once*. > > Performance measured by gup_benchmark improves considerably > get_user_pages_fast() and pin_user_pages_fast(): > > $ gup_benchmark -f /dev/dax0.2 -m 16384 -r 10 -S [-u,-a] -n 512 -w > > (get_user_pages_fast 2M pages) ~75k us -> ~3.6k us > (pin_user_pages_fast 2M pages) ~125k us -> ~3.8k us > > Signed-off-by: Joao Martins <joao.m.martins@oracle.com> > mm/gup.c | 67 ++++++++++++++++++++++++++++++++++++++++++-------------- > 1 file changed, 51 insertions(+), 16 deletions(-) > > diff --git a/mm/gup.c b/mm/gup.c > index 98eb8e6d2609..194e6981eb03 100644 > +++ b/mm/gup.c > @@ -2250,22 +2250,68 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > } > #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */ > > + > +static int record_subpages(struct page *page, unsigned long addr, > + unsigned long end, struct page **pages) > +{ > + int nr; > + > + for (nr = 0; addr != end; addr += PAGE_SIZE) > + pages[nr++] = page++; > + > + return nr; > +} > + > #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE) > -static int __gup_device_huge(unsigned long pfn, unsigned long addr, > - unsigned long end, unsigned int flags, > - struct page **pages, int *nr) > +static int __gup_device_compound_huge(struct dev_pagemap *pgmap, > + struct page *head, unsigned long sz, > + unsigned long addr, unsigned long end, > + unsigned int flags, struct page **pages) > +{ > + struct page *page; > + int refs; > + > + if (!(pgmap->flags & PGMAP_COMPOUND)) > + return -1; > + > + page = head + ((addr & (sz-1)) >> PAGE_SHIFT); All the places that call record_subpages do some kind of maths like this, it should be placed inside record_subpages and not opencoded everywhere. > + refs = record_subpages(page, addr, end, pages); > + > + SetPageReferenced(page); > + head = try_grab_compound_head(head, refs, flags); > + if (!head) { > + ClearPageReferenced(page); > + return 0; > + } > + > + return refs; > +} Why is all of this special? Any time we see a PMD/PGD/etc pointing to PFN we can apply this optimization. How come device has its own special path to do this?? Why do we need to check PGMAP_COMPOUND? Why do we need to get pgmap? (We already removed that from the hmm version of this, was that wrong? Is this different?) Dan? Also undo_dev_pagemap() is now out of date, we have unpin_user_pages() for that and no other error unwind touches ClearPageReferenced.. Basic idea is good though! Jason
On 12/8/20 9:28 AM, Joao Martins wrote: > Much like hugetlbfs or THPs, we treat device pagemaps with > compound pages like the rest of GUP handling of compound pages. > > Rather than incrementing the refcount every 4K, we record > all sub pages and increment by @refs amount *once*. > > Performance measured by gup_benchmark improves considerably > get_user_pages_fast() and pin_user_pages_fast(): > > $ gup_benchmark -f /dev/dax0.2 -m 16384 -r 10 -S [-u,-a] -n 512 -w "gup_test", now that you're in linux-next, actually. (Maybe I'll retrofit that test with getopt_long(), those options are getting more elaborate.) > > (get_user_pages_fast 2M pages) ~75k us -> ~3.6k us > (pin_user_pages_fast 2M pages) ~125k us -> ~3.8k us That is a beautiful result! I'm very motivated to see if this patchset can make it in, in some form. > > Signed-off-by: Joao Martins <joao.m.martins@oracle.com> > --- > mm/gup.c | 67 ++++++++++++++++++++++++++++++++++++++++++-------------- > 1 file changed, 51 insertions(+), 16 deletions(-) > > diff --git a/mm/gup.c b/mm/gup.c > index 98eb8e6d2609..194e6981eb03 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2250,22 +2250,68 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, > } > #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */ > > + > +static int record_subpages(struct page *page, unsigned long addr, > + unsigned long end, struct page **pages) > +{ > + int nr; > + > + for (nr = 0; addr != end; addr += PAGE_SIZE) > + pages[nr++] = page++; > + > + return nr; > +} > + > #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE) > -static int __gup_device_huge(unsigned long pfn, unsigned long addr, > - unsigned long end, unsigned int flags, > - struct page **pages, int *nr) > +static int __gup_device_compound_huge(struct dev_pagemap *pgmap, > + struct page *head, unsigned long sz, If this variable survives (I see Jason requested a reorg of this math stuff, and I also like that idea), then I'd like a slightly better name for "sz". I was going to suggest one, but then realized that I can't understand how this works. See below... > + unsigned long addr, unsigned long end, > + unsigned int flags, struct page **pages) > +{ > + struct page *page; > + int refs; > + > + if (!(pgmap->flags & PGMAP_COMPOUND)) > + return -1; btw, I'm unhappy with returning -1 here and assigning it later to a refs variable. (And that will show up even more clearly as an issue if you attempt to make refs unsigned everywhere!) I'm not going to suggest anything because there are a lot of ways to structure these routines, and I don't want to overly constrain you. Just please don't assign negative values to any refs variables. > + > + page = head + ((addr & (sz-1)) >> PAGE_SHIFT); If you pass in PMD_SHIFT or PUD_SHIFT for, that's a number-of-bits, isn't it? Not a size. And if it's not a size, then sz - 1 doesn't work, does it? If it does work, then better naming might help. I'm probably missing a really obvious math trick here. thanks,
On 12/8/20 7:49 PM, Jason Gunthorpe wrote: > On Tue, Dec 08, 2020 at 05:28:58PM +0000, Joao Martins wrote: >> Much like hugetlbfs or THPs, we treat device pagemaps with >> compound pages like the rest of GUP handling of compound pages. >> >> Rather than incrementing the refcount every 4K, we record >> all sub pages and increment by @refs amount *once*. >> >> Performance measured by gup_benchmark improves considerably >> get_user_pages_fast() and pin_user_pages_fast(): >> >> $ gup_benchmark -f /dev/dax0.2 -m 16384 -r 10 -S [-u,-a] -n 512 -w >> >> (get_user_pages_fast 2M pages) ~75k us -> ~3.6k us >> (pin_user_pages_fast 2M pages) ~125k us -> ~3.8k us >> >> Signed-off-by: Joao Martins <joao.m.martins@oracle.com> >> mm/gup.c | 67 ++++++++++++++++++++++++++++++++++++++++++-------------- >> 1 file changed, 51 insertions(+), 16 deletions(-) >> >> diff --git a/mm/gup.c b/mm/gup.c >> index 98eb8e6d2609..194e6981eb03 100644 >> +++ b/mm/gup.c >> @@ -2250,22 +2250,68 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, >> } >> #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */ >> >> + >> +static int record_subpages(struct page *page, unsigned long addr, >> + unsigned long end, struct page **pages) >> +{ >> + int nr; >> + >> + for (nr = 0; addr != end; addr += PAGE_SIZE) >> + pages[nr++] = page++; >> + >> + return nr; >> +} >> + >> #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE) >> -static int __gup_device_huge(unsigned long pfn, unsigned long addr, >> - unsigned long end, unsigned int flags, >> - struct page **pages, int *nr) >> +static int __gup_device_compound_huge(struct dev_pagemap *pgmap, >> + struct page *head, unsigned long sz, >> + unsigned long addr, unsigned long end, >> + unsigned int flags, struct page **pages) >> +{ >> + struct page *page; >> + int refs; >> + >> + if (!(pgmap->flags & PGMAP_COMPOUND)) >> + return -1; >> + >> + page = head + ((addr & (sz-1)) >> PAGE_SHIFT); > > All the places that call record_subpages do some kind of maths like > this, it should be placed inside record_subpages and not opencoded > everywhere. > Makes sense. >> + refs = record_subpages(page, addr, end, pages); >> + >> + SetPageReferenced(page); >> + head = try_grab_compound_head(head, refs, flags); >> + if (!head) { >> + ClearPageReferenced(page); >> + return 0; >> + } >> + >> + return refs; >> +} > > Why is all of this special? Any time we see a PMD/PGD/etc pointing to > PFN we can apply this optimization. How come device has its own > special path to do this?? > I think the reason is that zone_device struct pages have no relationship to one other. So you anyways need to change individual pages, as opposed to just the head page. I made it special to avoid breaking other ZONE_DEVICE users (and gating that with PGMAP_COMPOUND). But if there's no concerns with that, I can unilaterally enable it. > Why do we need to check PGMAP_COMPOUND? Why do we need to get pgmap? > (We already removed that from the hmm version of this, was that wrong? > Is this different?) Dan? > > Also undo_dev_pagemap() is now out of date, we have unpin_user_pages() > for that and no other error unwind touches ClearPageReferenced.. > /me nods Yeap I saw that too. > Basic idea is good though! > Cool, thanks! Joao
On 12/9/20 4:40 AM, John Hubbard wrote: > On 12/8/20 9:28 AM, Joao Martins wrote: >> Much like hugetlbfs or THPs, we treat device pagemaps with >> compound pages like the rest of GUP handling of compound pages. >> >> Rather than incrementing the refcount every 4K, we record >> all sub pages and increment by @refs amount *once*. >> >> Performance measured by gup_benchmark improves considerably >> get_user_pages_fast() and pin_user_pages_fast(): >> >> $ gup_benchmark -f /dev/dax0.2 -m 16384 -r 10 -S [-u,-a] -n 512 -w > > "gup_test", now that you're in linux-next, actually. > > (Maybe I'll retrofit that test with getopt_long(), those options are > getting more elaborate.) > :) >> >> (get_user_pages_fast 2M pages) ~75k us -> ~3.6k us >> (pin_user_pages_fast 2M pages) ~125k us -> ~3.8k us > > That is a beautiful result! I'm very motivated to see if this patchset > can make it in, in some form. > Cool! >> >> Signed-off-by: Joao Martins <joao.m.martins@oracle.com> >> --- >> mm/gup.c | 67 ++++++++++++++++++++++++++++++++++++++++++-------------- >> 1 file changed, 51 insertions(+), 16 deletions(-) >> >> diff --git a/mm/gup.c b/mm/gup.c >> index 98eb8e6d2609..194e6981eb03 100644 >> --- a/mm/gup.c >> +++ b/mm/gup.c >> @@ -2250,22 +2250,68 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, >> } >> #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */ >> >> + >> +static int record_subpages(struct page *page, unsigned long addr, >> + unsigned long end, struct page **pages) >> +{ >> + int nr; >> + >> + for (nr = 0; addr != end; addr += PAGE_SIZE) >> + pages[nr++] = page++; >> + >> + return nr; >> +} >> + >> #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE) >> -static int __gup_device_huge(unsigned long pfn, unsigned long addr, >> - unsigned long end, unsigned int flags, >> - struct page **pages, int *nr) >> +static int __gup_device_compound_huge(struct dev_pagemap *pgmap, >> + struct page *head, unsigned long sz, > > If this variable survives (I see Jason requested a reorg of this math stuff, > and I also like that idea), then I'd like a slightly better name for "sz". > Yeap. > I was going to suggest one, but then realized that I can't understand how this > works. See below... > >> + unsigned long addr, unsigned long end, >> + unsigned int flags, struct page **pages) >> +{ >> + struct page *page; >> + int refs; >> + >> + if (!(pgmap->flags & PGMAP_COMPOUND)) >> + return -1; > > btw, I'm unhappy with returning -1 here and assigning it later to a refs variable. > (And that will show up even more clearly as an issue if you attempt to make > refs unsigned everywhere!) > Yes true. The usage of @refs = -1 (therefore an int) was to differentiate when we are not in a PGMAP_COMPOUND pgmap (and so for logic to keep as today). Notice that in the PGMAP_COMPOUND case if we fail to grab the head compound page we return 0. > I'm not going to suggest anything because there are a lot of ways to structure > these routines, and I don't want to overly constrain you. Just please don't assign > negative values to any refs variables. > OK. TBH I'm a little afraid this can turn into further complexity if I have to keep the non-compound pgmap around. But I will see how I can adjust this. >> + >> + page = head + ((addr & (sz-1)) >> PAGE_SHIFT); > > If you pass in PMD_SHIFT or PUD_SHIFT for, that's a number-of-bits, isn't it? > Not a size. And if it's not a size, then sz - 1 doesn't work, does it? If it > does work, then better naming might help. I'm probably missing a really > obvious math trick here. You're right. That was a mistake on my end, indeed. But the mistake wouldn't change the logic, as the PageReference bit only applies to the head page. Joao
On Wed, Dec 09, 2020 at 11:05:39AM +0000, Joao Martins wrote: > > Why is all of this special? Any time we see a PMD/PGD/etc pointing to > > PFN we can apply this optimization. How come device has its own > > special path to do this?? > > I think the reason is that zone_device struct pages have no > relationship to one other. So you anyways need to change individual > pages, as opposed to just the head page. Huh? That can't be, unpin doesn't know the memory type when it unpins it, and as your series shows unpin always operates on the compound head. Thus pinning must also operate on compound heads > I made it special to avoid breaking other ZONE_DEVICE users (and > gating that with PGMAP_COMPOUND). But if there's no concerns with > that, I can unilaterally enable it. I didn't understand what PGMAP_COMPOUND was supposed to be for.. > > Why do we need to check PGMAP_COMPOUND? Why do we need to get pgmap? > > (We already removed that from the hmm version of this, was that wrong? > > Is this different?) Dan? And this is the key question - why do we need to get a pgmap here? I'm going to assert that a pgmap cannot be destroyed concurrently with fast gup running. This is surely true on x86 as the TLB flush that must have preceeded a pgmap destroy excludes fast gup. Other arches must emulate this in their pgmap implementations. So, why do we need pgmap here? Hoping Dan might know If we delete the pgmap then the devmap stop being special. CH and I looked at this and deleted it from the hmm side: commit 068354ade5dd9e2b07d9b0c57055a681db6f4e37 Author: Jason Gunthorpe <jgg@ziepe.ca> Date: Fri Mar 27 17:00:13 2020 -0300 mm/hmm: remove pgmap checking for devmap pages The checking boils down to some racy check if the pagemap is still available or not. Instead of checking this, rely entirely on the notifiers, if a pagemap is destroyed then all pages that belong to it must be removed from the tables and the notifiers triggered. Link: https://lore.kernel.org/r/20200327200021.29372-2-jgg@ziepe.ca Though I am wondering if this whole hmm thing is racy with memory unplug. Hmm. Jason
On 12/9/20 3:15 PM, Jason Gunthorpe wrote: > On Wed, Dec 09, 2020 at 11:05:39AM +0000, Joao Martins wrote: >>> Why is all of this special? Any time we see a PMD/PGD/etc pointing to >>> PFN we can apply this optimization. How come device has its own >>> special path to do this?? >> >> I think the reason is that zone_device struct pages have no >> relationship to one other. So you anyways need to change individual >> pages, as opposed to just the head page. > > Huh? That can't be, unpin doesn't know the memory type when it unpins > it, and as your series shows unpin always operates on the compound > head. Thus pinning must also operate on compound heads > I was referring to the code without this series, in the paragraph above. Meaning today zone_device pages are *not* represented compound pages. And so compound_head(page) on a non compound page just returns the page itself. Otherwise, try_grab_page() (e.g. when pinning pages) would be broken. >> I made it special to avoid breaking other ZONE_DEVICE users (and >> gating that with PGMAP_COMPOUND). But if there's no concerns with >> that, I can unilaterally enable it. > > I didn't understand what PGMAP_COMPOUND was supposed to be for.. > PGMAP_COMPOUND purpose is to online these pages as compound pages (so head and tails). Today (without the series) struct pages are not represented the way they are expressed in the page tables, which is what I am hoping to fix in this series thus initializing these as compound pages of a given order. But me introducing PGMAP_COMPOUND was to conservatively keep both old (non-compound) and new (compound pages) co-exist. I wasn't sure I could just enable regardless, worried that I would be breaking other ZONE_DEVICE/memremap_pages users. >>> Why do we need to check PGMAP_COMPOUND? Why do we need to get pgmap? >>> (We already removed that from the hmm version of this, was that wrong? >>> Is this different?) Dan? > > And this is the key question - why do we need to get a pgmap here? > > I'm going to assert that a pgmap cannot be destroyed concurrently with > fast gup running. This is surely true on x86 as the TLB flush that > must have preceeded a pgmap destroy excludes fast gup. Other arches > must emulate this in their pgmap implementations. > > So, why do we need pgmap here? Hoping Dan might know > > If we delete the pgmap then the devmap stop being special. > I will let Dan chip in. > CH and I looked at this and deleted it from the hmm side: > > commit 068354ade5dd9e2b07d9b0c57055a681db6f4e37 > Author: Jason Gunthorpe <jgg@ziepe.ca> > Date: Fri Mar 27 17:00:13 2020 -0300 > > mm/hmm: remove pgmap checking for devmap pages > > The checking boils down to some racy check if the pagemap is still > available or not. Instead of checking this, rely entirely on the > notifiers, if a pagemap is destroyed then all pages that belong to it must > be removed from the tables and the notifiers triggered. > > Link: https://lore.kernel.org/r/20200327200021.29372-2-jgg@ziepe.ca > > Though I am wondering if this whole hmm thing is racy with memory > unplug. Hmm. Joao
On Wed, Dec 09, 2020 at 04:02:05PM +0000, Joao Martins wrote: > Today (without the series) struct pages are not represented the way they > are expressed in the page tables, which is what I am hoping to fix in this > series thus initializing these as compound pages of a given order. But me > introducing PGMAP_COMPOUND was to conservatively keep both old (non-compound) > and new (compound pages) co-exist. Oooh, that I didn't know.. That is kind of horrible to have a PMD pointing at an order 0 page only in this one special case. Still, I think it would be easier to teach record_subpages() that a PMD doesn't necessarily point to a high order page, eg do something like I suggested for the SGL where it extracts the page order and iterates over the contiguous range of pfns. This way it can remain general with no particularly special path for devmap or a special PGMAP_COMPOUND check here. Jason
On 12/9/20 4:24 PM, Jason Gunthorpe wrote: > On Wed, Dec 09, 2020 at 04:02:05PM +0000, Joao Martins wrote: > >> Today (without the series) struct pages are not represented the way they >> are expressed in the page tables, which is what I am hoping to fix in this >> series thus initializing these as compound pages of a given order. But me >> introducing PGMAP_COMPOUND was to conservatively keep both old (non-compound) >> and new (compound pages) co-exist. > > Oooh, that I didn't know.. That is kind of horrible to have a PMD > pointing at an order 0 page only in this one special case. > > Still, I think it would be easier to teach record_subpages() that a > PMD doesn't necessarily point to a high order page, eg do something > like I suggested for the SGL where it extracts the page order and > iterates over the contiguous range of pfns. > /me nods > This way it can remain general with no particularly special path for > devmap or a special PGMAP_COMPOUND check here. The less special paths the better, indeed. Joao
On Wed, Dec 09, 2020 at 12:24:38PM -0400, Jason Gunthorpe wrote: > On Wed, Dec 09, 2020 at 04:02:05PM +0000, Joao Martins wrote: > > > Today (without the series) struct pages are not represented the way they > > are expressed in the page tables, which is what I am hoping to fix in this > > series thus initializing these as compound pages of a given order. But me > > introducing PGMAP_COMPOUND was to conservatively keep both old (non-compound) > > and new (compound pages) co-exist. > > Oooh, that I didn't know.. That is kind of horrible to have a PMD > pointing at an order 0 page only in this one special case. Uh, yes. I'm surprised it hasn't caused more problems. > Still, I think it would be easier to teach record_subpages() that a > PMD doesn't necessarily point to a high order page, eg do something > like I suggested for the SGL where it extracts the page order and > iterates over the contiguous range of pfns. But we also see good performance improvements from doing all reference counts on the head page instead of spread throughout the pages, so we really want compound pages.
On Wed, Dec 09, 2020 at 06:14:06PM +0000, Matthew Wilcox wrote: > > Still, I think it would be easier to teach record_subpages() that a > > PMD doesn't necessarily point to a high order page, eg do something > > like I suggested for the SGL where it extracts the page order and > > iterates over the contiguous range of pfns. > > But we also see good performance improvements from doing all reference > counts on the head page instead of spread throughout the pages, so we > really want compound pages. Oh no doubt! I'm not saying not to do that, just wanting to see some consolidation of the page table reading code. Instead of obtaining and checking the pgmap for PGMAP_COMPOUND (which is unique to devmap and very expensive) do the same algorithm we are talking about for unpin. Given a starting pfn and # of pages following (eg a PMD can be described like this) - compute the minimum list of (compound_head, ntails) tuples that spans that physical range. For instance using your folio language all the gup fast stuff pretty much boils down to: start_page = pmd_page(*pmd); // Select the sub PMD range GUP is interested in npages = adjust_for_vaddr(&start_page, vaddr, vlength, PMD_SHIFT); for_each_folio(start_page, num_pages, &folio, &ntails) { try_grab_folio(folio, ntails) } record_pages_in_output(start_page, npages); No need for all the gup_device* stuff at all. If 'for_each_folio' starts returing high order pages for devmap because the first part of this series made compound_order higher, then great! It also consolidates with the trailing part of gup_hugepte() and more on the gup slow side too. for_each_folio is just some simple maths that does: folio = to_folio(page) ntails = min(1 << folio_order(folio) - (head - page), num_pages) num_pages -= ntails page += ntails Jason
On 12/9/20 6:14 PM, Matthew Wilcox wrote: > On Wed, Dec 09, 2020 at 12:24:38PM -0400, Jason Gunthorpe wrote: >> On Wed, Dec 09, 2020 at 04:02:05PM +0000, Joao Martins wrote: >> >>> Today (without the series) struct pages are not represented the way they >>> are expressed in the page tables, which is what I am hoping to fix in this >>> series thus initializing these as compound pages of a given order. But me >>> introducing PGMAP_COMPOUND was to conservatively keep both old (non-compound) >>> and new (compound pages) co-exist. >> >> Oooh, that I didn't know.. That is kind of horrible to have a PMD >> pointing at an order 0 page only in this one special case. > > Uh, yes. I'm surprised it hasn't caused more problems. > There was 1 or 2 problems in the KVM MMU related to zone device pages. See commit e851265a816f ("KVM: x86/mmu: Use huge pages for DAX-backed files") which eventually lead to commit db5432165e9b5 ("KVM: x86/mmu: Walk host page tables to find THP mappings") to be less amenable to metadata changes. >> Still, I think it would be easier to teach record_subpages() that a >> PMD doesn't necessarily point to a high order page, eg do something >> like I suggested for the SGL where it extracts the page order and >> iterates over the contiguous range of pfns. > > But we also see good performance improvements from doing all reference > counts on the head page instead of spread throughout the pages, so we > really want compound pages. Going further than just refcounts and borrowing your (or someone else?) idea, perhaps also a FOLL_HEAD gup flag that would let us only work with head pages (or folios). Which would consequently let us pin/grab bigger swathes of memory e.g. 1G (in 2M head pages) or 512G (in 1G head pages) with just 1 page for storing the struct pages[*]. Albeit I suspect the numbers would have to justify it. Joao [*] One page happens to be what's used for RDMA/umem and vdpa as callers of pin_user_pages*()
diff --git a/mm/gup.c b/mm/gup.c index 98eb8e6d2609..194e6981eb03 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2250,22 +2250,68 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, } #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */ + +static int record_subpages(struct page *page, unsigned long addr, + unsigned long end, struct page **pages) +{ + int nr; + + for (nr = 0; addr != end; addr += PAGE_SIZE) + pages[nr++] = page++; + + return nr; +} + #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE) -static int __gup_device_huge(unsigned long pfn, unsigned long addr, - unsigned long end, unsigned int flags, - struct page **pages, int *nr) +static int __gup_device_compound_huge(struct dev_pagemap *pgmap, + struct page *head, unsigned long sz, + unsigned long addr, unsigned long end, + unsigned int flags, struct page **pages) +{ + struct page *page; + int refs; + + if (!(pgmap->flags & PGMAP_COMPOUND)) + return -1; + + page = head + ((addr & (sz-1)) >> PAGE_SHIFT); + refs = record_subpages(page, addr, end, pages); + + SetPageReferenced(page); + head = try_grab_compound_head(head, refs, flags); + if (!head) { + ClearPageReferenced(page); + return 0; + } + + return refs; +} + +static int __gup_device_huge(unsigned long pfn, unsigned long sz, + unsigned long addr, unsigned long end, + unsigned int flags, struct page **pages, int *nr) { int nr_start = *nr; struct dev_pagemap *pgmap = NULL; do { struct page *page = pfn_to_page(pfn); + int refs; pgmap = get_dev_pagemap(pfn, pgmap); if (unlikely(!pgmap)) { undo_dev_pagemap(nr, nr_start, flags, pages); return 0; } + + refs = __gup_device_compound_huge(pgmap, page, sz, addr, end, + flags, pages + *nr); + if (refs >= 0) { + *nr += refs; + put_dev_pagemap(pgmap); + return refs ? 1 : 0; + } + SetPageReferenced(page); pages[*nr] = page; if (unlikely(!try_grab_page(page, flags))) { @@ -2289,7 +2335,7 @@ static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, int nr_start = *nr; fault_pfn = pmd_pfn(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT); - if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr)) + if (!__gup_device_huge(fault_pfn, PMD_SHIFT, addr, end, flags, pages, nr)) return 0; if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) { @@ -2307,7 +2353,7 @@ static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, int nr_start = *nr; fault_pfn = pud_pfn(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT); - if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr)) + if (!__gup_device_huge(fault_pfn, PUD_SHIFT, addr, end, flags, pages, nr)) return 0; if (unlikely(pud_val(orig) != pud_val(*pudp))) { @@ -2334,17 +2380,6 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr, } #endif -static int record_subpages(struct page *page, unsigned long addr, - unsigned long end, struct page **pages) -{ - int nr; - - for (nr = 0; addr != end; addr += PAGE_SIZE) - pages[nr++] = page++; - - return nr; -} - #ifdef CONFIG_ARCH_HAS_HUGEPD static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end, unsigned long sz)
Much like hugetlbfs or THPs, we treat device pagemaps with compound pages like the rest of GUP handling of compound pages. Rather than incrementing the refcount every 4K, we record all sub pages and increment by @refs amount *once*. Performance measured by gup_benchmark improves considerably get_user_pages_fast() and pin_user_pages_fast(): $ gup_benchmark -f /dev/dax0.2 -m 16384 -r 10 -S [-u,-a] -n 512 -w (get_user_pages_fast 2M pages) ~75k us -> ~3.6k us (pin_user_pages_fast 2M pages) ~125k us -> ~3.8k us Signed-off-by: Joao Martins <joao.m.martins@oracle.com> --- mm/gup.c | 67 ++++++++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 51 insertions(+), 16 deletions(-)