Message ID | 20201208172901.17384-2-joao.m.martins@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm, sparse-vmemmap: Introduce compound pagemaps | expand |
On 12/8/20 9:28 AM, Joao Martins wrote: > Add a new flag for struct dev_pagemap which designates that a a pagemap a a > is described as a set of compound pages or in other words, that how > pages are grouped together in the page tables are reflected in how we > describe struct pages. This means that rather than initializing > individual struct pages, we also initialize these struct pages, as Let's not say "rather than x, we also do y", because it's self-contradictory. I think you want to just leave out the "also", like this: "This means that rather than initializing> individual struct pages, we initialize these struct pages ..." Is that right? > compound pages (on x86: 2M or 1G compound pages) > > For certain ZONE_DEVICE users, like device-dax, which have a fixed page > size, this creates an opportunity to optimize GUP and GUP-fast walkers, > thus playing the same tricks as hugetlb pages. > > Signed-off-by: Joao Martins <joao.m.martins@oracle.com> > --- > include/linux/memremap.h | 2 ++ > mm/memremap.c | 8 ++++++-- > mm/page_alloc.c | 7 +++++++ > 3 files changed, 15 insertions(+), 2 deletions(-) > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h > index 79c49e7f5c30..f8f26b2cc3da 100644 > --- a/include/linux/memremap.h > +++ b/include/linux/memremap.h > @@ -90,6 +90,7 @@ struct dev_pagemap_ops { > }; > > #define PGMAP_ALTMAP_VALID (1 << 0) > +#define PGMAP_COMPOUND (1 << 1) > > /** > * struct dev_pagemap - metadata for ZONE_DEVICE mappings > @@ -114,6 +115,7 @@ struct dev_pagemap { > struct completion done; > enum memory_type type; > unsigned int flags; > + unsigned int align; This also needs an "@aline" entry in the comment block above. > const struct dev_pagemap_ops *ops; > void *owner; > int nr_range; > diff --git a/mm/memremap.c b/mm/memremap.c > index 16b2fb482da1..287a24b7a65a 100644 > --- a/mm/memremap.c > +++ b/mm/memremap.c > @@ -277,8 +277,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params, > memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE], > PHYS_PFN(range->start), > PHYS_PFN(range_len(range)), pgmap); > - percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id) > - - pfn_first(pgmap, range_id)); > + if (pgmap->flags & PGMAP_COMPOUND) > + percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id) > + - pfn_first(pgmap, range_id)) / PHYS_PFN(pgmap->align)); Is there some reason that we cannot use range_len(), instead of pfn_end() minus pfn_first()? (Yes, this more about the pre-existing code than about your change.) And if not, then why are the nearby range_len() uses OK? I realize that range_len() is simpler and skips a case, but it's not clear that it's required here. But I'm new to this area so be warned. :) Also, dividing by PHYS_PFN() feels quite misleading: that function does what you happen to want, but is not named accordingly. Can you use or create something more accurately named? Like "number of pages in this large page"? > + else > + percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id) > + - pfn_first(pgmap, range_id)); > return 0; > > err_add_memory: > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index eaa227a479e4..9716ecd58e29 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -6116,6 +6116,8 @@ void __ref memmap_init_zone_device(struct zone *zone, > unsigned long pfn, end_pfn = start_pfn + nr_pages; > struct pglist_data *pgdat = zone->zone_pgdat; > struct vmem_altmap *altmap = pgmap_altmap(pgmap); > + bool compound = pgmap->flags & PGMAP_COMPOUND; > + unsigned int align = PHYS_PFN(pgmap->align); Maybe align_pfn or pfn_align? Don't want the same name for things that are actually different types, in meaning anyway. > unsigned long zone_idx = zone_idx(zone); > unsigned long start = jiffies; > int nid = pgdat->node_id; > @@ -6171,6 +6173,11 @@ void __ref memmap_init_zone_device(struct zone *zone, > } > } > > + if (compound) { > + for (pfn = start_pfn; pfn < end_pfn; pfn += align) > + prep_compound_page(pfn_to_page(pfn), order_base_2(align)); > + } > + > pr_info("%s initialised %lu pages in %ums\n", __func__, > nr_pages, jiffies_to_msecs(jiffies - start)); > } > thanks,
On Tue, Dec 08, 2020 at 09:59:19PM -0800, John Hubbard wrote: > On 12/8/20 9:28 AM, Joao Martins wrote: > > Add a new flag for struct dev_pagemap which designates that a a pagemap > > a a > > > is described as a set of compound pages or in other words, that how > > pages are grouped together in the page tables are reflected in how we > > describe struct pages. This means that rather than initializing > > individual struct pages, we also initialize these struct pages, as > > Let's not say "rather than x, we also do y", because it's self-contradictory. > I think you want to just leave out the "also", like this: > > "This means that rather than initializing> individual struct pages, we > initialize these struct pages ..." > > Is that right? I'd phrase it as: Add a new flag for struct dev_pagemap which specifies that a pagemap is composed of a set of compound pages instead of individual pages. When these pages are initialised, most are initialised as tail pages instead of order-0 pages. > > For certain ZONE_DEVICE users, like device-dax, which have a fixed page > > size, this creates an opportunity to optimize GUP and GUP-fast walkers, > > thus playing the same tricks as hugetlb pages. Rather than "playing the same tricks", how about "are treated the same way as THP or hugetlb pages"? > > + if (pgmap->flags & PGMAP_COMPOUND) > > + percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id) > > + - pfn_first(pgmap, range_id)) / PHYS_PFN(pgmap->align)); > > Is there some reason that we cannot use range_len(), instead of pfn_end() minus > pfn_first()? (Yes, this more about the pre-existing code than about your change.) > > And if not, then why are the nearby range_len() uses OK? I realize that range_len() > is simpler and skips a case, but it's not clear that it's required here. But I'm > new to this area so be warned. :) > > Also, dividing by PHYS_PFN() feels quite misleading: that function does what you > happen to want, but is not named accordingly. Can you use or create something > more accurately named? Like "number of pages in this large page"? We have compound_nr(), but that takes a struct page as an argument. We also have HPAGE_NR_PAGES. I'm not quite clear what you want.
On 12/9/20 6:33 AM, Matthew Wilcox wrote: > On Tue, Dec 08, 2020 at 09:59:19PM -0800, John Hubbard wrote: >> On 12/8/20 9:28 AM, Joao Martins wrote: >>> Add a new flag for struct dev_pagemap which designates that a a pagemap >> >> a a >> Ugh. Yeah will fix. >>> is described as a set of compound pages or in other words, that how >>> pages are grouped together in the page tables are reflected in how we >>> describe struct pages. This means that rather than initializing >>> individual struct pages, we also initialize these struct pages, as >> >> Let's not say "rather than x, we also do y", because it's self-contradictory. >> I think you want to just leave out the "also", like this: >> >> "This means that rather than initializing> individual struct pages, we >> initialize these struct pages ..." >> >> Is that right? > Nop, my previous text was broken. > I'd phrase it as: > > Add a new flag for struct dev_pagemap which specifies that a pagemap is > composed of a set of compound pages instead of individual pages. When > these pages are initialised, most are initialised as tail pages > instead of order-0 pages. > Thanks, I will use this instead. >>> For certain ZONE_DEVICE users, like device-dax, which have a fixed page >>> size, this creates an opportunity to optimize GUP and GUP-fast walkers, >>> thus playing the same tricks as hugetlb pages. > > Rather than "playing the same tricks", how about "are treated the same > way as THP or hugetlb pages"? > >>> + if (pgmap->flags & PGMAP_COMPOUND) >>> + percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id) >>> + - pfn_first(pgmap, range_id)) / PHYS_PFN(pgmap->align)); >> >> Is there some reason that we cannot use range_len(), instead of pfn_end() minus >> pfn_first()? (Yes, this more about the pre-existing code than about your change.) >> Indeed one could use range_len() / pgmap->align and it would work. But (...) >> And if not, then why are the nearby range_len() uses OK? I realize that range_len() >> is simpler and skips a case, but it's not clear that it's required here. But I'm >> new to this area so be warned. :) >> My use of pfns to calculate the nr of pages was to remain consistent with the rest of the code in the function taking references in the pgmap->ref. The usages one sees ofrange_len are are when the hotplug takes place which work at addresses and not PFNs. >> Also, dividing by PHYS_PFN() feels quite misleading: that function does what you >> happen to want, but is not named accordingly. Can you use or create something >> more accurately named? Like "number of pages in this large page"? > > We have compound_nr(), but that takes a struct page as an argument. > We also have HPAGE_NR_PAGES. I'm not quite clear what you want. > If possible I would rather keep the pfns as with the rest of the code. Another alternative is like a range_nr_pages helper but I am not sure it's worth the trouble for one caller. Joao
On Tue, Dec 8, 2020 at 9:32 AM Joao Martins <joao.m.martins@oracle.com> wrote: > > Add a new flag for struct dev_pagemap which designates that a a pagemap > is described as a set of compound pages or in other words, that how > pages are grouped together in the page tables are reflected in how we > describe struct pages. This means that rather than initializing > individual struct pages, we also initialize these struct pages, as > compound pages (on x86: 2M or 1G compound pages) > > For certain ZONE_DEVICE users, like device-dax, which have a fixed page > size, this creates an opportunity to optimize GUP and GUP-fast walkers, > thus playing the same tricks as hugetlb pages. > > Signed-off-by: Joao Martins <joao.m.martins@oracle.com> > --- > include/linux/memremap.h | 2 ++ > mm/memremap.c | 8 ++++++-- > mm/page_alloc.c | 7 +++++++ > 3 files changed, 15 insertions(+), 2 deletions(-) > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h > index 79c49e7f5c30..f8f26b2cc3da 100644 > --- a/include/linux/memremap.h > +++ b/include/linux/memremap.h > @@ -90,6 +90,7 @@ struct dev_pagemap_ops { > }; > > #define PGMAP_ALTMAP_VALID (1 << 0) > +#define PGMAP_COMPOUND (1 << 1) Why is a new flag needed versus just the align attribute? In other words there should be no need to go back to the old/slow days of 'struct page' per pfn after compound support is added.
On Tue, Dec 8, 2020 at 9:59 PM John Hubbard <jhubbard@nvidia.com> wrote: > > On 12/8/20 9:28 AM, Joao Martins wrote: > > Add a new flag for struct dev_pagemap which designates that a a pagemap > > a a > > > is described as a set of compound pages or in other words, that how > > pages are grouped together in the page tables are reflected in how we > > describe struct pages. This means that rather than initializing > > individual struct pages, we also initialize these struct pages, as > > Let's not say "rather than x, we also do y", because it's self-contradictory. > I think you want to just leave out the "also", like this: > > "This means that rather than initializing> individual struct pages, we > initialize these struct pages ..." > > Is that right? > > > compound pages (on x86: 2M or 1G compound pages) > > > > For certain ZONE_DEVICE users, like device-dax, which have a fixed page > > size, this creates an opportunity to optimize GUP and GUP-fast walkers, > > thus playing the same tricks as hugetlb pages. > > > > Signed-off-by: Joao Martins <joao.m.martins@oracle.com> > > --- > > include/linux/memremap.h | 2 ++ > > mm/memremap.c | 8 ++++++-- > > mm/page_alloc.c | 7 +++++++ > > 3 files changed, 15 insertions(+), 2 deletions(-) > > > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h > > index 79c49e7f5c30..f8f26b2cc3da 100644 > > --- a/include/linux/memremap.h > > +++ b/include/linux/memremap.h > > @@ -90,6 +90,7 @@ struct dev_pagemap_ops { > > }; > > > > #define PGMAP_ALTMAP_VALID (1 << 0) > > +#define PGMAP_COMPOUND (1 << 1) > > > > /** > > * struct dev_pagemap - metadata for ZONE_DEVICE mappings > > @@ -114,6 +115,7 @@ struct dev_pagemap { > > struct completion done; > > enum memory_type type; > > unsigned int flags; > > + unsigned int align; > > This also needs an "@aline" entry in the comment block above. > > > const struct dev_pagemap_ops *ops; > > void *owner; > > int nr_range; > > diff --git a/mm/memremap.c b/mm/memremap.c > > index 16b2fb482da1..287a24b7a65a 100644 > > --- a/mm/memremap.c > > +++ b/mm/memremap.c > > @@ -277,8 +277,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params, > > memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE], > > PHYS_PFN(range->start), > > PHYS_PFN(range_len(range)), pgmap); > > - percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id) > > - - pfn_first(pgmap, range_id)); > > + if (pgmap->flags & PGMAP_COMPOUND) > > + percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id) > > + - pfn_first(pgmap, range_id)) / PHYS_PFN(pgmap->align)); > > Is there some reason that we cannot use range_len(), instead of pfn_end() minus > pfn_first()? (Yes, this more about the pre-existing code than about your change.) > > And if not, then why are the nearby range_len() uses OK? I realize that range_len() > is simpler and skips a case, but it's not clear that it's required here. But I'm > new to this area so be warned. :) There's a subtle distinction between the range that was passed in and the pfns that are activated inside of it. See the offset trickery in pfn_first(). > Also, dividing by PHYS_PFN() feels quite misleading: that function does what you > happen to want, but is not named accordingly. Can you use or create something > more accurately named? Like "number of pages in this large page"? It's not the number of pages in a large page it's converting bytes to pages. Other place in the kernel write it as (x >> PAGE_SHIFT), but my though process was if I'm going to add () might as well use a macro that already does this. That said I think this calculation is broken precisely because pfn_first() makes the result unaligned. Rather than fix the unaligned pfn_first() problem I would use this support as an opportunity to revisit the option of storing pages in the vmem_altmap reserve soace. The altmap's whole reason for existence was that 1.5% of large PMEM might completely swamp DRAM. However if that overhead is reduced by an order (or orders) of magnitude the primary need for vmem_altmap vanishes. Now, we'll still need to keep it around for the ->align == PAGE_SIZE case, but for most part existing deployments that are specifying page map on PMEM and an align > PAGE_SIZE can instead just transparently be upgraded to page map on a smaller amount of DRAM. > > > + else > > + percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id) > > + - pfn_first(pgmap, range_id)); > > return 0; > > > > err_add_memory: > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index eaa227a479e4..9716ecd58e29 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -6116,6 +6116,8 @@ void __ref memmap_init_zone_device(struct zone *zone, > > unsigned long pfn, end_pfn = start_pfn + nr_pages; > > struct pglist_data *pgdat = zone->zone_pgdat; > > struct vmem_altmap *altmap = pgmap_altmap(pgmap); > > + bool compound = pgmap->flags & PGMAP_COMPOUND; > > + unsigned int align = PHYS_PFN(pgmap->align); > > Maybe align_pfn or pfn_align? Don't want the same name for things that are actually > different types, in meaning anyway. Good catch.
On 2/20/21 1:24 AM, Dan Williams wrote: > On Tue, Dec 8, 2020 at 9:32 AM Joao Martins <joao.m.martins@oracle.com> wrote: >> >> Add a new flag for struct dev_pagemap which designates that a a pagemap >> is described as a set of compound pages or in other words, that how >> pages are grouped together in the page tables are reflected in how we >> describe struct pages. This means that rather than initializing >> individual struct pages, we also initialize these struct pages, as >> compound pages (on x86: 2M or 1G compound pages) >> >> For certain ZONE_DEVICE users, like device-dax, which have a fixed page >> size, this creates an opportunity to optimize GUP and GUP-fast walkers, >> thus playing the same tricks as hugetlb pages. >> >> Signed-off-by: Joao Martins <joao.m.martins@oracle.com> >> --- >> include/linux/memremap.h | 2 ++ >> mm/memremap.c | 8 ++++++-- >> mm/page_alloc.c | 7 +++++++ >> 3 files changed, 15 insertions(+), 2 deletions(-) >> >> diff --git a/include/linux/memremap.h b/include/linux/memremap.h >> index 79c49e7f5c30..f8f26b2cc3da 100644 >> --- a/include/linux/memremap.h >> +++ b/include/linux/memremap.h >> @@ -90,6 +90,7 @@ struct dev_pagemap_ops { >> }; >> >> #define PGMAP_ALTMAP_VALID (1 << 0) >> +#define PGMAP_COMPOUND (1 << 1) > > Why is a new flag needed versus just the align attribute? In other > words there should be no need to go back to the old/slow days of > 'struct page' per pfn after compound support is added. > Ack, I suppose I could just use pgmap @align attribute as you mentioned. Joao
On 2/20/21 1:43 AM, Dan Williams wrote: > On Tue, Dec 8, 2020 at 9:59 PM John Hubbard <jhubbard@nvidia.com> wrote: >> On 12/8/20 9:28 AM, Joao Martins wrote: >>> diff --git a/mm/memremap.c b/mm/memremap.c >>> index 16b2fb482da1..287a24b7a65a 100644 >>> --- a/mm/memremap.c >>> +++ b/mm/memremap.c >>> @@ -277,8 +277,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params, >>> memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE], >>> PHYS_PFN(range->start), >>> PHYS_PFN(range_len(range)), pgmap); >>> - percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id) >>> - - pfn_first(pgmap, range_id)); >>> + if (pgmap->flags & PGMAP_COMPOUND) >>> + percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id) >>> + - pfn_first(pgmap, range_id)) / PHYS_PFN(pgmap->align)); >> >> Is there some reason that we cannot use range_len(), instead of pfn_end() minus >> pfn_first()? (Yes, this more about the pre-existing code than about your change.) >> >> And if not, then why are the nearby range_len() uses OK? I realize that range_len() >> is simpler and skips a case, but it's not clear that it's required here. But I'm >> new to this area so be warned. :) > > There's a subtle distinction between the range that was passed in and > the pfns that are activated inside of it. See the offset trickery in > pfn_first(). > >> Also, dividing by PHYS_PFN() feels quite misleading: that function does what you >> happen to want, but is not named accordingly. Can you use or create something >> more accurately named? Like "number of pages in this large page"? > > It's not the number of pages in a large page it's converting bytes to > pages. Other place in the kernel write it as (x >> PAGE_SHIFT), but my > though process was if I'm going to add () might as well use a macro > that already does this. > > That said I think this calculation is broken precisely because > pfn_first() makes the result unaligned. > > Rather than fix the unaligned pfn_first() problem I would use this > support as an opportunity to revisit the option of storing pages in > the vmem_altmap reserve soace. The altmap's whole reason for existence > was that 1.5% of large PMEM might completely swamp DRAM. However if > that overhead is reduced by an order (or orders) of magnitude the > primary need for vmem_altmap vanishes. > > Now, we'll still need to keep it around for the ->align == PAGE_SIZE > case, but for most part existing deployments that are specifying page > map on PMEM and an align > PAGE_SIZE can instead just transparently be > upgraded to page map on a smaller amount of DRAM. > I feel the altmap is still relevant. Even with the struct page reuse for tail pages, the overhead for 2M align is still non-negligeble i.e. 4G per 1Tb (strictly speaking about what's stored in the altmap). Muchun and Matthew were thinking (in another thread) on compound_head() adjustments that probably can make this overhead go to 2G (if we learn to differentiate the reused head page from the real head page). But even there it's still 2G per 1Tb. 1G pages, though, have a better story to remove altmap need. One thing to point out about altmap is that the degradation (in pinning and unpining) we observed with struct page's in device memory, is no longer observed once 1) we batch ref count updates as we move to compound pages 2) reusing tail pages seems to lead to these struct pages staying more likely in cache which perhaps contributes to dirtying a lot less cachelines. Joao
On Mon, Feb 22, 2021 at 3:24 AM Joao Martins <joao.m.martins@oracle.com> wrote: > > On 2/20/21 1:43 AM, Dan Williams wrote: > > On Tue, Dec 8, 2020 at 9:59 PM John Hubbard <jhubbard@nvidia.com> wrote: > >> On 12/8/20 9:28 AM, Joao Martins wrote: > >>> diff --git a/mm/memremap.c b/mm/memremap.c > >>> index 16b2fb482da1..287a24b7a65a 100644 > >>> --- a/mm/memremap.c > >>> +++ b/mm/memremap.c > >>> @@ -277,8 +277,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params, > >>> memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE], > >>> PHYS_PFN(range->start), > >>> PHYS_PFN(range_len(range)), pgmap); > >>> - percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id) > >>> - - pfn_first(pgmap, range_id)); > >>> + if (pgmap->flags & PGMAP_COMPOUND) > >>> + percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id) > >>> + - pfn_first(pgmap, range_id)) / PHYS_PFN(pgmap->align)); > >> > >> Is there some reason that we cannot use range_len(), instead of pfn_end() minus > >> pfn_first()? (Yes, this more about the pre-existing code than about your change.) > >> > >> And if not, then why are the nearby range_len() uses OK? I realize that range_len() > >> is simpler and skips a case, but it's not clear that it's required here. But I'm > >> new to this area so be warned. :) > > > > There's a subtle distinction between the range that was passed in and > > the pfns that are activated inside of it. See the offset trickery in > > pfn_first(). > > > >> Also, dividing by PHYS_PFN() feels quite misleading: that function does what you > >> happen to want, but is not named accordingly. Can you use or create something > >> more accurately named? Like "number of pages in this large page"? > > > > It's not the number of pages in a large page it's converting bytes to > > pages. Other place in the kernel write it as (x >> PAGE_SHIFT), but my > > though process was if I'm going to add () might as well use a macro > > that already does this. > > > > That said I think this calculation is broken precisely because > > pfn_first() makes the result unaligned. > > > > Rather than fix the unaligned pfn_first() problem I would use this > > support as an opportunity to revisit the option of storing pages in > > the vmem_altmap reserve soace. The altmap's whole reason for existence > > was that 1.5% of large PMEM might completely swamp DRAM. However if > > that overhead is reduced by an order (or orders) of magnitude the > > primary need for vmem_altmap vanishes. > > > > Now, we'll still need to keep it around for the ->align == PAGE_SIZE > > case, but for most part existing deployments that are specifying page > > map on PMEM and an align > PAGE_SIZE can instead just transparently be > > upgraded to page map on a smaller amount of DRAM. > > > I feel the altmap is still relevant. Even with the struct page reuse for > tail pages, the overhead for 2M align is still non-negligeble i.e. 4G per > 1Tb (strictly speaking about what's stored in the altmap). Muchun and > Matthew were thinking (in another thread) on compound_head() adjustments > that probably can make this overhead go to 2G (if we learn to differentiate > the reused head page from the real head page). I think that realization is more justification to make a new first class vmemmap_populate_compound_pages() rather than try to reuse vmemmap_populate_basepages() with new parameters. > But even there it's still > 2G per 1Tb. 1G pages, though, have a better story to remove altmap need. The concern that led to altmap is that someone would build a system with a 96:1 (PMEM:RAM) ratio where that correlates to maximum PMEM and minimum RAM, and mapping all PMEM consumes all RAM. As far as I understand real world populations are rarely going past 8:1, that seems to make 'struct page' in RAM feasible even for the 2M compound page case. Let me ask you for a data point, since you're one of the people actively deploying such systems, would you still use the 'struct page' in PMEM capability after this set was merged? > One thing to point out about altmap is that the degradation (in pinning and > unpining) we observed with struct page's in device memory, is no longer observed > once 1) we batch ref count updates as we move to compound pages 2) reusing > tail pages seems to lead to these struct pages staying more likely in cache > which perhaps contributes to dirtying a lot less cachelines. True, it makes it more palatable to survive 'struct page' in PMEM, but it's an ongoing maintenance burden that I'm not sure there are users after putting 'struct page' on a diet. Don't get me wrong the capability is still needed for filesystem-dax, but the distinction is that vmemmap_populate_compound_pages() need never worry about an altmap.
On 2/22/21 8:37 PM, Dan Williams wrote: > On Mon, Feb 22, 2021 at 3:24 AM Joao Martins <joao.m.martins@oracle.com> wrote: >> On 2/20/21 1:43 AM, Dan Williams wrote: >>> On Tue, Dec 8, 2020 at 9:59 PM John Hubbard <jhubbard@nvidia.com> wrote: >>>> On 12/8/20 9:28 AM, Joao Martins wrote: >>>>> diff --git a/mm/memremap.c b/mm/memremap.c >>>>> index 16b2fb482da1..287a24b7a65a 100644 >>>>> --- a/mm/memremap.c >>>>> +++ b/mm/memremap.c >>>>> @@ -277,8 +277,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params, >>>>> memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE], >>>>> PHYS_PFN(range->start), >>>>> PHYS_PFN(range_len(range)), pgmap); >>>>> - percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id) >>>>> - - pfn_first(pgmap, range_id)); >>>>> + if (pgmap->flags & PGMAP_COMPOUND) >>>>> + percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id) >>>>> + - pfn_first(pgmap, range_id)) / PHYS_PFN(pgmap->align)); >>>> >>>> Is there some reason that we cannot use range_len(), instead of pfn_end() minus >>>> pfn_first()? (Yes, this more about the pre-existing code than about your change.) >>>> >>>> And if not, then why are the nearby range_len() uses OK? I realize that range_len() >>>> is simpler and skips a case, but it's not clear that it's required here. But I'm >>>> new to this area so be warned. :) >>> >>> There's a subtle distinction between the range that was passed in and >>> the pfns that are activated inside of it. See the offset trickery in >>> pfn_first(). >>> >>>> Also, dividing by PHYS_PFN() feels quite misleading: that function does what you >>>> happen to want, but is not named accordingly. Can you use or create something >>>> more accurately named? Like "number of pages in this large page"? >>> >>> It's not the number of pages in a large page it's converting bytes to >>> pages. Other place in the kernel write it as (x >> PAGE_SHIFT), but my >>> though process was if I'm going to add () might as well use a macro >>> that already does this. >>> >>> That said I think this calculation is broken precisely because >>> pfn_first() makes the result unaligned. >>> >>> Rather than fix the unaligned pfn_first() problem I would use this >>> support as an opportunity to revisit the option of storing pages in >>> the vmem_altmap reserve soace. The altmap's whole reason for existence >>> was that 1.5% of large PMEM might completely swamp DRAM. However if >>> that overhead is reduced by an order (or orders) of magnitude the >>> primary need for vmem_altmap vanishes. >>> >>> Now, we'll still need to keep it around for the ->align == PAGE_SIZE >>> case, but for most part existing deployments that are specifying page >>> map on PMEM and an align > PAGE_SIZE can instead just transparently be >>> upgraded to page map on a smaller amount of DRAM. >>> >> I feel the altmap is still relevant. Even with the struct page reuse for >> tail pages, the overhead for 2M align is still non-negligeble i.e. 4G per >> 1Tb (strictly speaking about what's stored in the altmap). Muchun and >> Matthew were thinking (in another thread) on compound_head() adjustments >> that probably can make this overhead go to 2G (if we learn to differentiate >> the reused head page from the real head page). > > I think that realization is more justification to make a new first > class vmemmap_populate_compound_pages() rather than try to reuse > vmemmap_populate_basepages() with new parameters. > I was already going to move this to vmemmap_populate_compound_pages() based on your earlier suggestion :) >> But even there it's still >> 2G per 1Tb. 1G pages, though, have a better story to remove altmap need. > > The concern that led to altmap is that someone would build a system > with a 96:1 (PMEM:RAM) ratio where that correlates to maximum PMEM and > minimum RAM, and mapping all PMEM consumes all RAM. As far as I > understand real world populations are rarely going past 8:1, that > seems to make 'struct page' in RAM feasible even for the 2M compound > page case. > > Let me ask you for a data point, since you're one of the people > actively deploying such systems, would you still use the 'struct page' > in PMEM capability after this set was merged? > We might be sticking to RAM stored 'struct page' yes, but hard to say atm what the future holds. >> One thing to point out about altmap is that the degradation (in pinning and >> unpining) we observed with struct page's in device memory, is no longer observed >> once 1) we batch ref count updates as we move to compound pages 2) reusing >> tail pages seems to lead to these struct pages staying more likely in cache >> which perhaps contributes to dirtying a lot less cachelines. > > True, it makes it more palatable to survive 'struct page' in PMEM, but > it's an ongoing maintenance burden that I'm not sure there are users > after putting 'struct page' on a diet. FWIW all I was trying to point out is that the 2M huge page overhead is still non trivial. It is indeed much better than it is ATM yes, but still 6G per 1TB with 2M huge pages. Only with 1G would be non-existent overhead, but then we have a trade-off elsewhere in terms of poisoning a whole 1G page and what not. > Don't get me wrong the > capability is still needed for filesystem-dax, but the distinction is > that vmemmap_populate_compound_pages() need never worry about an > altmap. > IMO there's not much added complexity strictly speaking about altmap. We still use the same vmemmap_{pmd,pte,pgd}_populate helpers which just pass an altmap. So whatever it is being maintained for fsdax or other altmap consumers (e.g. we seem to be working towards hotplug making use of it) we are using it in the exact same way. The complexity of the future vmemmap_populate_compound_pages() has more to do with reusing vmemmap blocks allocated in previous vmemmap pages, and preserving that across section onlining (for 1G pages). Joao
On Tue, Feb 23, 2021 at 7:46 AM Joao Martins <joao.m.martins@oracle.com> wrote: > > On 2/22/21 8:37 PM, Dan Williams wrote: > > On Mon, Feb 22, 2021 at 3:24 AM Joao Martins <joao.m.martins@oracle.com> wrote: > >> On 2/20/21 1:43 AM, Dan Williams wrote: > >>> On Tue, Dec 8, 2020 at 9:59 PM John Hubbard <jhubbard@nvidia.com> wrote: > >>>> On 12/8/20 9:28 AM, Joao Martins wrote: > >>>>> diff --git a/mm/memremap.c b/mm/memremap.c > >>>>> index 16b2fb482da1..287a24b7a65a 100644 > >>>>> --- a/mm/memremap.c > >>>>> +++ b/mm/memremap.c > >>>>> @@ -277,8 +277,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params, > >>>>> memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE], > >>>>> PHYS_PFN(range->start), > >>>>> PHYS_PFN(range_len(range)), pgmap); > >>>>> - percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id) > >>>>> - - pfn_first(pgmap, range_id)); > >>>>> + if (pgmap->flags & PGMAP_COMPOUND) > >>>>> + percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id) > >>>>> + - pfn_first(pgmap, range_id)) / PHYS_PFN(pgmap->align)); > >>>> > >>>> Is there some reason that we cannot use range_len(), instead of pfn_end() minus > >>>> pfn_first()? (Yes, this more about the pre-existing code than about your change.) > >>>> > >>>> And if not, then why are the nearby range_len() uses OK? I realize that range_len() > >>>> is simpler and skips a case, but it's not clear that it's required here. But I'm > >>>> new to this area so be warned. :) > >>> > >>> There's a subtle distinction between the range that was passed in and > >>> the pfns that are activated inside of it. See the offset trickery in > >>> pfn_first(). > >>> > >>>> Also, dividing by PHYS_PFN() feels quite misleading: that function does what you > >>>> happen to want, but is not named accordingly. Can you use or create something > >>>> more accurately named? Like "number of pages in this large page"? > >>> > >>> It's not the number of pages in a large page it's converting bytes to > >>> pages. Other place in the kernel write it as (x >> PAGE_SHIFT), but my > >>> though process was if I'm going to add () might as well use a macro > >>> that already does this. > >>> > >>> That said I think this calculation is broken precisely because > >>> pfn_first() makes the result unaligned. > >>> > >>> Rather than fix the unaligned pfn_first() problem I would use this > >>> support as an opportunity to revisit the option of storing pages in > >>> the vmem_altmap reserve soace. The altmap's whole reason for existence > >>> was that 1.5% of large PMEM might completely swamp DRAM. However if > >>> that overhead is reduced by an order (or orders) of magnitude the > >>> primary need for vmem_altmap vanishes. > >>> > >>> Now, we'll still need to keep it around for the ->align == PAGE_SIZE > >>> case, but for most part existing deployments that are specifying page > >>> map on PMEM and an align > PAGE_SIZE can instead just transparently be > >>> upgraded to page map on a smaller amount of DRAM. > >>> > >> I feel the altmap is still relevant. Even with the struct page reuse for > >> tail pages, the overhead for 2M align is still non-negligeble i.e. 4G per > >> 1Tb (strictly speaking about what's stored in the altmap). Muchun and > >> Matthew were thinking (in another thread) on compound_head() adjustments > >> that probably can make this overhead go to 2G (if we learn to differentiate > >> the reused head page from the real head page). > > > > I think that realization is more justification to make a new first > > class vmemmap_populate_compound_pages() rather than try to reuse > > vmemmap_populate_basepages() with new parameters. > > > I was already going to move this to vmemmap_populate_compound_pages() based > on your earlier suggestion :) > > >> But even there it's still > >> 2G per 1Tb. 1G pages, though, have a better story to remove altmap need. > > > > The concern that led to altmap is that someone would build a system > > with a 96:1 (PMEM:RAM) ratio where that correlates to maximum PMEM and > > minimum RAM, and mapping all PMEM consumes all RAM. As far as I > > understand real world populations are rarely going past 8:1, that > > seems to make 'struct page' in RAM feasible even for the 2M compound > > page case. > > > > Let me ask you for a data point, since you're one of the people > > actively deploying such systems, would you still use the 'struct page' > > in PMEM capability after this set was merged? > > > We might be sticking to RAM stored 'struct page' yes, but hard to say atm > what the future holds. > > >> One thing to point out about altmap is that the degradation (in pinning and > >> unpining) we observed with struct page's in device memory, is no longer observed > >> once 1) we batch ref count updates as we move to compound pages 2) reusing > >> tail pages seems to lead to these struct pages staying more likely in cache > >> which perhaps contributes to dirtying a lot less cachelines. > > > > True, it makes it more palatable to survive 'struct page' in PMEM, but > > it's an ongoing maintenance burden that I'm not sure there are users > > after putting 'struct page' on a diet. > > FWIW all I was trying to point out is that the 2M huge page overhead is still non > trivial. It is indeed much better than it is ATM yes, but still 6G per 1TB with 2M huge > pages. Only with 1G would be non-existent overhead, but then we have a trade-off elsewhere > in terms of poisoning a whole 1G page and what not. > > > Don't get me wrong the > > capability is still needed for filesystem-dax, but the distinction is > > that vmemmap_populate_compound_pages() need never worry about an > > altmap. > > > IMO there's not much added complexity strictly speaking about altmap. We still use the > same vmemmap_{pmd,pte,pgd}_populate helpers which just pass an altmap. So whatever it is > being maintained for fsdax or other altmap consumers (e.g. we seem to be working towards > hotplug making use of it) we are using it in the exact same way. > > The complexity of the future vmemmap_populate_compound_pages() has more to do with reusing > vmemmap blocks allocated in previous vmemmap pages, and preserving that across section > onlining (for 1G pages). True, I'm less worried about the complexity as much as opportunistically converting configurations to RAM backed pages. It's already the case that poison handling is page mapping size aligned for device-dax, and filesystem-dax needs to stick with non-compound-pages for the foreseeable future. Ok, let's try to keep altmap in vmemmap_populate_compound_pages() and see how it looks.
On 2/23/21 4:50 PM, Dan Williams wrote: > On Tue, Feb 23, 2021 at 7:46 AM Joao Martins <joao.m.martins@oracle.com> wrote: >> On 2/22/21 8:37 PM, Dan Williams wrote: >>> On Mon, Feb 22, 2021 at 3:24 AM Joao Martins <joao.m.martins@oracle.com> wrote: >>>> On 2/20/21 1:43 AM, Dan Williams wrote: >>>>> On Tue, Dec 8, 2020 at 9:59 PM John Hubbard <jhubbard@nvidia.com> wrote: >>>>>> On 12/8/20 9:28 AM, Joao Martins wrote: [...] >>> Don't get me wrong the >>> capability is still needed for filesystem-dax, but the distinction is >>> that vmemmap_populate_compound_pages() need never worry about an >>> altmap. >>> >> IMO there's not much added complexity strictly speaking about altmap. We still use the >> same vmemmap_{pmd,pte,pgd}_populate helpers which just pass an altmap. So whatever it is >> being maintained for fsdax or other altmap consumers (e.g. we seem to be working towards >> hotplug making use of it) we are using it in the exact same way. >> >> The complexity of the future vmemmap_populate_compound_pages() has more to do with reusing >> vmemmap blocks allocated in previous vmemmap pages, and preserving that across section >> onlining (for 1G pages). > > True, I'm less worried about the complexity as much as > opportunistically converting configurations to RAM backed pages. It's > already the case that poison handling is page mapping size aligned for > device-dax, and filesystem-dax needs to stick with non-compound-pages > for the foreseeable future. > Hmm, I was sort off wondering that fsdax could move to compound pages too as opposed to base pages, albeit not necessarily using the vmemmap page reuse as it splits pages IIUC. > Ok, let's try to keep altmap in vmemmap_populate_compound_pages() and > see how it looks. > OK, will do.
On Tue, Feb 23, 2021 at 9:19 AM Joao Martins <joao.m.martins@oracle.com> wrote: > > On 2/23/21 4:50 PM, Dan Williams wrote: > > On Tue, Feb 23, 2021 at 7:46 AM Joao Martins <joao.m.martins@oracle.com> wrote: > >> On 2/22/21 8:37 PM, Dan Williams wrote: > >>> On Mon, Feb 22, 2021 at 3:24 AM Joao Martins <joao.m.martins@oracle.com> wrote: > >>>> On 2/20/21 1:43 AM, Dan Williams wrote: > >>>>> On Tue, Dec 8, 2020 at 9:59 PM John Hubbard <jhubbard@nvidia.com> wrote: > >>>>>> On 12/8/20 9:28 AM, Joao Martins wrote: > > [...] > > >>> Don't get me wrong the > >>> capability is still needed for filesystem-dax, but the distinction is > >>> that vmemmap_populate_compound_pages() need never worry about an > >>> altmap. > >>> > >> IMO there's not much added complexity strictly speaking about altmap. We still use the > >> same vmemmap_{pmd,pte,pgd}_populate helpers which just pass an altmap. So whatever it is > >> being maintained for fsdax or other altmap consumers (e.g. we seem to be working towards > >> hotplug making use of it) we are using it in the exact same way. > >> > >> The complexity of the future vmemmap_populate_compound_pages() has more to do with reusing > >> vmemmap blocks allocated in previous vmemmap pages, and preserving that across section > >> onlining (for 1G pages). > > > > True, I'm less worried about the complexity as much as > > opportunistically converting configurations to RAM backed pages. It's > > already the case that poison handling is page mapping size aligned for > > device-dax, and filesystem-dax needs to stick with non-compound-pages > > for the foreseeable future. > > > Hmm, I was sort off wondering that fsdax could move to compound pages too as > opposed to base pages, albeit not necessarily using the vmemmap page reuse > as it splits pages IIUC. I'm not sure compound pages for fsdax would work long term because there's no infrastructure to reassemble compound pages after a split. So if you fracture a block and then coalesce it back to a 2MB or 1GB aligned block there's nothing to go fixup the compound page... unless the filesystem wants to get into mm metadata fixups.
On 2/22/21 8:37 PM, Dan Williams wrote: > On Mon, Feb 22, 2021 at 3:24 AM Joao Martins <joao.m.martins@oracle.com> wrote: >> On 2/20/21 1:43 AM, Dan Williams wrote: >>> On Tue, Dec 8, 2020 at 9:59 PM John Hubbard <jhubbard@nvidia.com> wrote: >>>> On 12/8/20 9:28 AM, Joao Martins wrote: >> One thing to point out about altmap is that the degradation (in pinning and >> unpining) we observed with struct page's in device memory, is no longer observed >> once 1) we batch ref count updates as we move to compound pages 2) reusing >> tail pages seems to lead to these struct pages staying more likely in cache >> which perhaps contributes to dirtying a lot less cachelines. > > True, it makes it more palatable to survive 'struct page' in PMEM, I want to retract for now what I said above wrt to the no degradation with struct page in device comment. I was fooled by a bug on a patch later down this series. Particular because I accidentally cleared PGMAP_ALTMAP_VALID when unilaterally setting PGMAP_COMPOUND, which consequently lead to always allocating struct pages from memory. No wonder the numbers were just as fast. I am still confident that it's going to be faster and observe less degradation in pinning/init. Init for now is worst-case 2x faster. But to be *as fast* struct pages in memory might still be early to say. The broken masking of the PGMAP_ALTMAP_VALID bit did hide one flaw, where we don't support altmap for basepages on x86/mm and it apparently depends on architectures to implement it (and a couple other issues). The vmemmap allocation isn't the problem, so the previous comment in this thread that altmap doesn't change much in the vmemmap_populate_compound_pages() is still accurate. The problem though resides on the freeing of vmemmap pagetables with basepages *with altmap* (e.g. at dax-device teardown) which require arch support. Doing it properly would mean making the altmap reserve smaller (given fewer pages are allocated), and the ability for the altmap pfn allocator to track references per pfn. But I think it deserves its own separate patch series (probably almost just as big). Perhaps for this set I can stick without altmap as you suggested, and use hugepage vmemmap population (which wouldn't lead to device memory savings) instead of reusing base pages . I would still leave the compound page support logic as metadata representation for > 4K @align, as I think that's the right thing to do. And then a separate series onto improving altmap to leverage the metadata reduction support as done with non-device struct pages. Thoughts? Joao
On Wed, Mar 10, 2021 at 10:13 AM Joao Martins <joao.m.martins@oracle.com> wrote: > > On 2/22/21 8:37 PM, Dan Williams wrote: > > On Mon, Feb 22, 2021 at 3:24 AM Joao Martins <joao.m.martins@oracle.com> wrote: > >> On 2/20/21 1:43 AM, Dan Williams wrote: > >>> On Tue, Dec 8, 2020 at 9:59 PM John Hubbard <jhubbard@nvidia.com> wrote: > >>>> On 12/8/20 9:28 AM, Joao Martins wrote: > >> One thing to point out about altmap is that the degradation (in pinning and > >> unpining) we observed with struct page's in device memory, is no longer observed > >> once 1) we batch ref count updates as we move to compound pages 2) reusing > >> tail pages seems to lead to these struct pages staying more likely in cache > >> which perhaps contributes to dirtying a lot less cachelines. > > > > True, it makes it more palatable to survive 'struct page' in PMEM, > > I want to retract for now what I said above wrt to the no degradation with > struct page in device comment. I was fooled by a bug on a patch later down > this series. Particular because I accidentally cleared PGMAP_ALTMAP_VALID when > unilaterally setting PGMAP_COMPOUND, which consequently lead to always > allocating struct pages from memory. No wonder the numbers were just as fast. > I am still confident that it's going to be faster and observe less degradation > in pinning/init. Init for now is worst-case 2x faster. But to be *as fast* struct > pages in memory might still be early to say. > > The broken masking of the PGMAP_ALTMAP_VALID bit did hide one flaw, where > we don't support altmap for basepages on x86/mm and it apparently depends > on architectures to implement it (and a couple other issues). The vmemmap > allocation isn't the problem, so the previous comment in this thread that > altmap doesn't change much in the vmemmap_populate_compound_pages() is > still accurate. > > The problem though resides on the freeing of vmemmap pagetables with > basepages *with altmap* (e.g. at dax-device teardown) which require arch > support. Doing it properly would mean making the altmap reserve smaller > (given fewer pages are allocated), and the ability for the altmap pfn > allocator to track references per pfn. But I think it deserves its own > separate patch series (probably almost just as big). > > Perhaps for this set I can stick without altmap as you suggested, and > use hugepage vmemmap population (which wouldn't > lead to device memory savings) instead of reusing base pages . I would > still leave the compound page support logic as metadata representation > for > 4K @align, as I think that's the right thing to do. And then > a separate series onto improving altmap to leverage the metadata reduction > support as done with non-device struct pages. > > Thoughts? The space savings is the whole point. So I agree with moving altmap support to a follow-on enhancement, but land the non-altmap basepage support in the first round.
diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 79c49e7f5c30..f8f26b2cc3da 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -90,6 +90,7 @@ struct dev_pagemap_ops { }; #define PGMAP_ALTMAP_VALID (1 << 0) +#define PGMAP_COMPOUND (1 << 1) /** * struct dev_pagemap - metadata for ZONE_DEVICE mappings @@ -114,6 +115,7 @@ struct dev_pagemap { struct completion done; enum memory_type type; unsigned int flags; + unsigned int align; const struct dev_pagemap_ops *ops; void *owner; int nr_range; diff --git a/mm/memremap.c b/mm/memremap.c index 16b2fb482da1..287a24b7a65a 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -277,8 +277,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params, memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE], PHYS_PFN(range->start), PHYS_PFN(range_len(range)), pgmap); - percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id) - - pfn_first(pgmap, range_id)); + if (pgmap->flags & PGMAP_COMPOUND) + percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id) + - pfn_first(pgmap, range_id)) / PHYS_PFN(pgmap->align)); + else + percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id) + - pfn_first(pgmap, range_id)); return 0; err_add_memory: diff --git a/mm/page_alloc.c b/mm/page_alloc.c index eaa227a479e4..9716ecd58e29 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6116,6 +6116,8 @@ void __ref memmap_init_zone_device(struct zone *zone, unsigned long pfn, end_pfn = start_pfn + nr_pages; struct pglist_data *pgdat = zone->zone_pgdat; struct vmem_altmap *altmap = pgmap_altmap(pgmap); + bool compound = pgmap->flags & PGMAP_COMPOUND; + unsigned int align = PHYS_PFN(pgmap->align); unsigned long zone_idx = zone_idx(zone); unsigned long start = jiffies; int nid = pgdat->node_id; @@ -6171,6 +6173,11 @@ void __ref memmap_init_zone_device(struct zone *zone, } } + if (compound) { + for (pfn = start_pfn; pfn < end_pfn; pfn += align) + prep_compound_page(pfn_to_page(pfn), order_base_2(align)); + } + pr_info("%s initialised %lu pages in %ums\n", __func__, nr_pages, jiffies_to_msecs(jiffies - start)); }
Add a new flag for struct dev_pagemap which designates that a a pagemap is described as a set of compound pages or in other words, that how pages are grouped together in the page tables are reflected in how we describe struct pages. This means that rather than initializing individual struct pages, we also initialize these struct pages, as compound pages (on x86: 2M or 1G compound pages) For certain ZONE_DEVICE users, like device-dax, which have a fixed page size, this creates an opportunity to optimize GUP and GUP-fast walkers, thus playing the same tricks as hugetlb pages. Signed-off-by: Joao Martins <joao.m.martins@oracle.com> --- include/linux/memremap.h | 2 ++ mm/memremap.c | 8 ++++++-- mm/page_alloc.c | 7 +++++++ 3 files changed, 15 insertions(+), 2 deletions(-)