Message ID | 20210714193542.21857-5-joao.m.martins@oracle.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm, sparse-vmemmap: Introduce compound pagemaps | expand |
On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote: > > Add a new align property for struct dev_pagemap which specifies that a s/align/@geometry/ > pagemap is composed of a set of compound pages of size @align, s/@align/@geometry/ > instead of > base pages. When a compound page geometry is requested, all but the first > page are initialised as tail pages instead of order-0 pages. > > For certain ZONE_DEVICE users like device-dax which have a fixed page size, > this creates an opportunity to optimize GUP and GUP-fast walkers, treating > it the same way as THP or hugetlb pages. > > Signed-off-by: Joao Martins <joao.m.martins@oracle.com> > --- > include/linux/memremap.h | 17 +++++++++++++++++ > mm/memremap.c | 8 ++++++-- > mm/page_alloc.c | 34 +++++++++++++++++++++++++++++++++- > 3 files changed, 56 insertions(+), 3 deletions(-) > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h > index 119f130ef8f1..e5ab6d4525c1 100644 > --- a/include/linux/memremap.h > +++ b/include/linux/memremap.h > @@ -99,6 +99,10 @@ struct dev_pagemap_ops { > * @done: completion for @internal_ref > * @type: memory type: see MEMORY_* in memory_hotplug.h > * @flags: PGMAP_* flags to specify defailed behavior > + * @geometry: structural definition of how the vmemmap metadata is populated. > + * A zero or PAGE_SIZE defaults to using base pages as the memmap metadata > + * representation. A bigger value but also multiple of PAGE_SIZE will set > + * up compound struct pages representative of the requested geometry size. > * @ops: method table > * @owner: an opaque pointer identifying the entity that manages this > * instance. Used by various helpers to make sure that no > @@ -114,6 +118,7 @@ struct dev_pagemap { > struct completion done; > enum memory_type type; > unsigned int flags; > + unsigned long geometry; > const struct dev_pagemap_ops *ops; > void *owner; > int nr_range; > @@ -130,6 +135,18 @@ static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap) > return NULL; > } > > +static inline unsigned long pgmap_geometry(struct dev_pagemap *pgmap) > +{ > + if (!pgmap || !pgmap->geometry) > + return PAGE_SIZE; > + return pgmap->geometry; > +} > + > +static inline unsigned long pgmap_pfn_geometry(struct dev_pagemap *pgmap) > +{ > + return PHYS_PFN(pgmap_geometry(pgmap)); > +} Are both needed? Maybe just have ->geometry natively be in nr_pages units directly, because pgmap_pfn_geometry() makes it confusing whether it's a geometry of the pfn or the geometry of the pgmap. > + > #ifdef CONFIG_ZONE_DEVICE > bool pfn_zone_device_reserved(unsigned long pfn); > void *memremap_pages(struct dev_pagemap *pgmap, int nid); > diff --git a/mm/memremap.c b/mm/memremap.c > index 805d761740c4..ffcb924eb6a5 100644 > --- a/mm/memremap.c > +++ b/mm/memremap.c > @@ -318,8 +318,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params, > memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE], > PHYS_PFN(range->start), > PHYS_PFN(range_len(range)), pgmap); > - percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id) > - - pfn_first(pgmap, range_id)); > + if (pgmap_geometry(pgmap) > PAGE_SIZE) This would become if (pgmap_geometry(pgmap) > 1) > + percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id) > + - pfn_first(pgmap, range_id)) / pgmap_pfn_geometry(pgmap)); ...and this would be pgmap_geometry() > + else > + percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id) > + - pfn_first(pgmap, range_id)); > return 0; > > err_add_memory: > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 79f3b38afeca..188cb5f8c308 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -6597,6 +6597,31 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn, > } > } > > +static void __ref memmap_init_compound(struct page *page, unsigned long pfn, I'd feel better if @page was renamed @head... more below: > + unsigned long zone_idx, int nid, > + struct dev_pagemap *pgmap, > + unsigned long nr_pages) > +{ > + unsigned int order_align = order_base_2(nr_pages); > + unsigned long i; > + > + __SetPageHead(page); > + > + for (i = 1; i < nr_pages; i++) { The switch of loop styles is jarring. I.e. the switch from memmap_init_zone_device() that is using pfn, end_pfn, and a local 'struct page *' variable to this helper using pfn + i and a mix of helpers (__init_zone_device_page, prep_compound_tail) that have different expectations of head page + tail_idx and current page. I.e. this reads more obviously correct to me, but maybe I'm just in the wrong headspace: for (pfn = head_pfn + 1; pfn < end_pfn; pfn++) { struct page *page = pfn_to_page(pfn); __init_zone_device_page(page, pfn, zone_idx, nid, pgmap); prep_compound_tail(head, pfn - head_pfn); > + __init_zone_device_page(page + i, pfn + i, zone_idx, > + nid, pgmap); > + prep_compound_tail(page, i); > + > + /* > + * The first and second tail pages need to > + * initialized first, hence the head page is > + * prepared last. I'd change this comment to say why rather than restate what can be gleaned from the code. It's actually not clear to me why this order is necessary. > + */ > + if (i == 2) > + prep_compound_head(page, order_align); > + } > +} > + > void __ref memmap_init_zone_device(struct zone *zone, > unsigned long start_pfn, > unsigned long nr_pages, > @@ -6605,6 +6630,7 @@ void __ref memmap_init_zone_device(struct zone *zone, > unsigned long pfn, end_pfn = start_pfn + nr_pages; > struct pglist_data *pgdat = zone->zone_pgdat; > struct vmem_altmap *altmap = pgmap_altmap(pgmap); > + unsigned int pfns_per_compound = pgmap_pfn_geometry(pgmap); > unsigned long zone_idx = zone_idx(zone); > unsigned long start = jiffies; > int nid = pgdat->node_id; > @@ -6622,10 +6648,16 @@ void __ref memmap_init_zone_device(struct zone *zone, > nr_pages = end_pfn - start_pfn; > } > > - for (pfn = start_pfn; pfn < end_pfn; pfn++) { > + for (pfn = start_pfn; pfn < end_pfn; pfn += pfns_per_compound) { > struct page *page = pfn_to_page(pfn); > > __init_zone_device_page(page, pfn, zone_idx, nid, pgmap); > + > + if (pfns_per_compound == 1) > + continue; > + > + memmap_init_compound(page, pfn, zone_idx, nid, pgmap, > + pfns_per_compound); I otherwise don't see anything broken with this patch, so feel free to include: Reviewed-by: Dan Williams <dan.j.williams@intel.com> ...on the resend with the fixups.
> +static inline unsigned long pgmap_geometry(struct dev_pagemap *pgmap) > +{ > + if (!pgmap || !pgmap->geometry) > + return PAGE_SIZE; > + return pgmap->geometry; Nit, but avoiding all the negations would make this a little easier to read: if (pgmap && pgmap->geometry) return pgmap->geometry; return PAGE_SIZE > + if (pgmap_geometry(pgmap) > PAGE_SIZE) > + percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id) > + - pfn_first(pgmap, range_id)) / pgmap_pfn_geometry(pgmap)); > + else > + percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id) > + - pfn_first(pgmap, range_id)); This is a horrible undreadable mess, which is trivially fixed by a strategically used local variable: refs = pfn_end(pgmap, range_id) - pfn_first(pgmap, range_id); if (pgmap_geometry(pgmap) > PAGE_SIZE) refs /= pgmap_pfn_geometry(pgmap); percpu_ref_get_many(pgmap->ref, refs);
On 7/15/21 2:08 AM, Dan Williams wrote: > On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote: >> >> Add a new align property for struct dev_pagemap which specifies that a > > s/align/@geometry/ > Yeap, updated. >> pagemap is composed of a set of compound pages of size @align, > > s/@align/@geometry/ > Yeap, updated. >> instead of >> base pages. When a compound page geometry is requested, all but the first >> page are initialised as tail pages instead of order-0 pages. >> >> For certain ZONE_DEVICE users like device-dax which have a fixed page size, >> this creates an opportunity to optimize GUP and GUP-fast walkers, treating >> it the same way as THP or hugetlb pages. >> >> Signed-off-by: Joao Martins <joao.m.martins@oracle.com> >> --- >> include/linux/memremap.h | 17 +++++++++++++++++ >> mm/memremap.c | 8 ++++++-- >> mm/page_alloc.c | 34 +++++++++++++++++++++++++++++++++- >> 3 files changed, 56 insertions(+), 3 deletions(-) >> >> diff --git a/include/linux/memremap.h b/include/linux/memremap.h >> index 119f130ef8f1..e5ab6d4525c1 100644 >> --- a/include/linux/memremap.h >> +++ b/include/linux/memremap.h >> @@ -99,6 +99,10 @@ struct dev_pagemap_ops { >> * @done: completion for @internal_ref >> * @type: memory type: see MEMORY_* in memory_hotplug.h >> * @flags: PGMAP_* flags to specify defailed behavior >> + * @geometry: structural definition of how the vmemmap metadata is populated. >> + * A zero or PAGE_SIZE defaults to using base pages as the memmap metadata >> + * representation. A bigger value but also multiple of PAGE_SIZE will set >> + * up compound struct pages representative of the requested geometry size. >> * @ops: method table >> * @owner: an opaque pointer identifying the entity that manages this >> * instance. Used by various helpers to make sure that no >> @@ -114,6 +118,7 @@ struct dev_pagemap { >> struct completion done; >> enum memory_type type; >> unsigned int flags; >> + unsigned long geometry; >> const struct dev_pagemap_ops *ops; >> void *owner; >> int nr_range; >> @@ -130,6 +135,18 @@ static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap) >> return NULL; >> } >> >> +static inline unsigned long pgmap_geometry(struct dev_pagemap *pgmap) >> +{ >> + if (!pgmap || !pgmap->geometry) >> + return PAGE_SIZE; >> + return pgmap->geometry; >> +} >> + >> +static inline unsigned long pgmap_pfn_geometry(struct dev_pagemap *pgmap) >> +{ >> + return PHYS_PFN(pgmap_geometry(pgmap)); >> +} > > Are both needed? Maybe just have ->geometry natively be in nr_pages > units directly, because pgmap_pfn_geometry() makes it confusing > whether it's a geometry of the pfn or the geometry of the pgmap. > I use pgmap_geometry() largelly when we manipulate memmap in sparse-vmemmap code, as we deal with addresses/offsets/subsection-size. While using pgmap_pfn_geometry for code that deals with PFN initialization. For this patch I could remove the confusion. And actually maybe I can just store the pgmap_geometry() value in bytes locally in vmemmap_populate_compound_pages() and we can remove this extra helper. >> + >> #ifdef CONFIG_ZONE_DEVICE >> bool pfn_zone_device_reserved(unsigned long pfn); >> void *memremap_pages(struct dev_pagemap *pgmap, int nid); >> diff --git a/mm/memremap.c b/mm/memremap.c >> index 805d761740c4..ffcb924eb6a5 100644 >> --- a/mm/memremap.c >> +++ b/mm/memremap.c >> @@ -318,8 +318,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params, >> memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE], >> PHYS_PFN(range->start), >> PHYS_PFN(range_len(range)), pgmap); >> - percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id) >> - - pfn_first(pgmap, range_id)); >> + if (pgmap_geometry(pgmap) > PAGE_SIZE) > > This would become > > if (pgmap_geometry(pgmap) > 1) > >> + percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id) >> + - pfn_first(pgmap, range_id)) / pgmap_pfn_geometry(pgmap)); > > ...and this would be pgmap_geometry() > >> + else >> + percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id) >> + - pfn_first(pgmap, range_id)); >> return 0; >> Let me adjust accordingly. >> err_add_memory: >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index 79f3b38afeca..188cb5f8c308 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -6597,6 +6597,31 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn, >> } >> } >> >> +static void __ref memmap_init_compound(struct page *page, unsigned long pfn, > > I'd feel better if @page was renamed @head... more below: > Oh yeah -- definitely more readable. >> + unsigned long zone_idx, int nid, >> + struct dev_pagemap *pgmap, >> + unsigned long nr_pages) >> +{ >> + unsigned int order_align = order_base_2(nr_pages); >> + unsigned long i; >> + >> + __SetPageHead(page); >> + >> + for (i = 1; i < nr_pages; i++) { > > The switch of loop styles is jarring. I.e. the switch from > memmap_init_zone_device() that is using pfn, end_pfn, and a local > 'struct page *' variable to this helper using pfn + i and a mix of > helpers (__init_zone_device_page, prep_compound_tail) that have > different expectations of head page + tail_idx and current page. > > I.e. this reads more obviously correct to me, but maybe I'm just in > the wrong headspace: > > for (pfn = head_pfn + 1; pfn < end_pfn; pfn++) { > struct page *page = pfn_to_page(pfn); > > __init_zone_device_page(page, pfn, zone_idx, nid, pgmap); > prep_compound_tail(head, pfn - head_pfn); > Personally -- and I am dubious given I have been staring at this code -- I find that what I wrote a little better as it follows more what compound page initialization does. Like it's easier for me to read that I am initializing a number of tail pages and a head page (for a known geometry size). Additionally, it's unnecessary (and a tiny ineficient?) to keep doing pfn_to_page(pfn) provided ZONE_DEVICE requires SPARSEMEM_VMEMMAP and so your page pointers are all contiguous and so for any given PFN we can avoid having deref vmemmap vaddrs back and forth. Which is the second reason I pass a page, and iterate over its tails based on a head page pointer. But I was at too minds when writing this, so if the there's no added inefficiency I can rewrite like the above. >> + __init_zone_device_page(page + i, pfn + i, zone_idx, >> + nid, pgmap); >> + prep_compound_tail(page, i); >> + >> + /* >> + * The first and second tail pages need to >> + * initialized first, hence the head page is >> + * prepared last. > > I'd change this comment to say why rather than restate what can be > gleaned from the code. It's actually not clear to me why this order is > necessary. > So the first tail page stores mapcount_ptr and compound order, and the second tail page stores pincount_ptr. prep_compound_head() does this: set_compound_order(page, order); atomic_set(compound_mapcount_ptr(page), -1); if (hpage_pincount_available(page)) atomic_set(compound_pincount_ptr(page), 0); So we need those tail pages initialized first prior to initializing the head. I can expand the comment above to make it clear why we need first and second tail pages. >> + */ >> + if (i == 2) >> + prep_compound_head(page, order_align); >> + } >> +} >> + >> void __ref memmap_init_zone_device(struct zone *zone, >> unsigned long start_pfn, >> unsigned long nr_pages, >> @@ -6605,6 +6630,7 @@ void __ref memmap_init_zone_device(struct zone *zone, >> unsigned long pfn, end_pfn = start_pfn + nr_pages; >> struct pglist_data *pgdat = zone->zone_pgdat; >> struct vmem_altmap *altmap = pgmap_altmap(pgmap); >> + unsigned int pfns_per_compound = pgmap_pfn_geometry(pgmap); >> unsigned long zone_idx = zone_idx(zone); >> unsigned long start = jiffies; >> int nid = pgdat->node_id; >> @@ -6622,10 +6648,16 @@ void __ref memmap_init_zone_device(struct zone *zone, >> nr_pages = end_pfn - start_pfn; >> } >> >> - for (pfn = start_pfn; pfn < end_pfn; pfn++) { >> + for (pfn = start_pfn; pfn < end_pfn; pfn += pfns_per_compound) { >> struct page *page = pfn_to_page(pfn); >> >> __init_zone_device_page(page, pfn, zone_idx, nid, pgmap); >> + >> + if (pfns_per_compound == 1) >> + continue; >> + >> + memmap_init_compound(page, pfn, zone_idx, nid, pgmap, >> + pfns_per_compound); > > I otherwise don't see anything broken with this patch, so feel free to include: > > Reviewed-by: Dan Williams <dan.j.williams@intel.com> > > ...on the resend with the fixups. > Thanks. I will wait whether you still want to retain the tag provided the implied changes fixing the failure you reported.
On Wed, Jul 14, 2021 at 06:08:14PM -0700, Dan Williams wrote: > > +static inline unsigned long pgmap_geometry(struct dev_pagemap *pgmap) > > +{ > > + if (!pgmap || !pgmap->geometry) > > + return PAGE_SIZE; > > + return pgmap->geometry; > > +} > > + > > +static inline unsigned long pgmap_pfn_geometry(struct dev_pagemap *pgmap) > > +{ > > + return PHYS_PFN(pgmap_geometry(pgmap)); > > +} > > Are both needed? Maybe just have ->geometry natively be in nr_pages > units directly, because pgmap_pfn_geometry() makes it confusing > whether it's a geometry of the pfn or the geometry of the pgmap. Actually - do we need non-power of two sizes here? Otherwise a shift for the pfns would be really nice as that simplifies a lot of the calculations.
On 7/15/21 1:52 PM, Joao Martins wrote: > On 7/15/21 2:08 AM, Dan Williams wrote: >> On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote: >>> instead of >>> base pages. When a compound page geometry is requested, all but the first >>> page are initialised as tail pages instead of order-0 pages. >>> >>> For certain ZONE_DEVICE users like device-dax which have a fixed page size, >>> this creates an opportunity to optimize GUP and GUP-fast walkers, treating >>> it the same way as THP or hugetlb pages. >>> >>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com> >>> --- >>> include/linux/memremap.h | 17 +++++++++++++++++ >>> mm/memremap.c | 8 ++++++-- >>> mm/page_alloc.c | 34 +++++++++++++++++++++++++++++++++- >>> 3 files changed, 56 insertions(+), 3 deletions(-) >>> >>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h >>> index 119f130ef8f1..e5ab6d4525c1 100644 >>> --- a/include/linux/memremap.h >>> +++ b/include/linux/memremap.h >>> @@ -99,6 +99,10 @@ struct dev_pagemap_ops { >>> * @done: completion for @internal_ref >>> * @type: memory type: see MEMORY_* in memory_hotplug.h >>> * @flags: PGMAP_* flags to specify defailed behavior >>> + * @geometry: structural definition of how the vmemmap metadata is populated. >>> + * A zero or PAGE_SIZE defaults to using base pages as the memmap metadata >>> + * representation. A bigger value but also multiple of PAGE_SIZE will set >>> + * up compound struct pages representative of the requested geometry size. >>> * @ops: method table >>> * @owner: an opaque pointer identifying the entity that manages this >>> * instance. Used by various helpers to make sure that no >>> @@ -114,6 +118,7 @@ struct dev_pagemap { >>> struct completion done; >>> enum memory_type type; >>> unsigned int flags; >>> + unsigned long geometry; >>> const struct dev_pagemap_ops *ops; >>> void *owner; >>> int nr_range; >>> @@ -130,6 +135,18 @@ static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap) >>> return NULL; >>> } >>> >>> +static inline unsigned long pgmap_geometry(struct dev_pagemap *pgmap) >>> +{ >>> + if (!pgmap || !pgmap->geometry) >>> + return PAGE_SIZE; >>> + return pgmap->geometry; >>> +} >>> + >>> +static inline unsigned long pgmap_pfn_geometry(struct dev_pagemap *pgmap) >>> +{ >>> + return PHYS_PFN(pgmap_geometry(pgmap)); >>> +} >> >> Are both needed? Maybe just have ->geometry natively be in nr_pages >> units directly, because pgmap_pfn_geometry() makes it confusing >> whether it's a geometry of the pfn or the geometry of the pgmap. >> > I use pgmap_geometry() largelly when we manipulate memmap in sparse-vmemmap code, as we > deal with addresses/offsets/subsection-size. While using pgmap_pfn_geometry for code that > deals with PFN initialization. For this patch I could remove the confusion. > > And actually maybe I can just store the pgmap_geometry() value in bytes locally in > vmemmap_populate_compound_pages() and we can remove this extra helper. > But one nice property of pgmap_geometry() is the @pgmap check and not needing that the driver initializes a pgmap->geometry property. So a zeroed value would still support pgmap users on the old case where there's no @geometry (or the user doesn't care). So departing from this helper might mean that either memremap_pages() sets the right @geometry if a zeroed value is passed in. and __populate_section_memmap() makes sures pgmap is associated when trying to figure out if there's a geometry to consider in the section mapping.
On 7/15/21 7:48 AM, Christoph Hellwig wrote: >> +static inline unsigned long pgmap_geometry(struct dev_pagemap *pgmap) >> +{ >> + if (!pgmap || !pgmap->geometry) >> + return PAGE_SIZE; >> + return pgmap->geometry; > > Nit, but avoiding all the negations would make this a little easier to > read: > > if (pgmap && pgmap->geometry) > return pgmap->geometry; > return PAGE_SIZE > Nicer indeed. But this might be removed, should we follow Dan's suggestion on geometry representing nr of pages. >> + if (pgmap_geometry(pgmap) > PAGE_SIZE) >> + percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id) >> + - pfn_first(pgmap, range_id)) / pgmap_pfn_geometry(pgmap)); >> + else >> + percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id) >> + - pfn_first(pgmap, range_id)); > > This is a horrible undreadable mess, which is trivially fixed by a > strategically used local variable: > > refs = pfn_end(pgmap, range_id) - pfn_first(pgmap, range_id); > if (pgmap_geometry(pgmap) > PAGE_SIZE) > refs /= pgmap_pfn_geometry(pgmap); > percpu_ref_get_many(pgmap->ref, refs); > > Yeap, much readable, thanks for the suggestion.
On 7/15/21 1:59 PM, Christoph Hellwig wrote: > On Wed, Jul 14, 2021 at 06:08:14PM -0700, Dan Williams wrote: >>> +static inline unsigned long pgmap_geometry(struct dev_pagemap *pgmap) >>> +{ >>> + if (!pgmap || !pgmap->geometry) >>> + return PAGE_SIZE; >>> + return pgmap->geometry; >>> +} >>> + >>> +static inline unsigned long pgmap_pfn_geometry(struct dev_pagemap *pgmap) >>> +{ >>> + return PHYS_PFN(pgmap_geometry(pgmap)); >>> +} >> >> Are both needed? Maybe just have ->geometry natively be in nr_pages >> units directly, because pgmap_pfn_geometry() makes it confusing >> whether it's a geometry of the pfn or the geometry of the pgmap. > > Actually - do we need non-power of two sizes here? Otherwise a shift > for the pfns would be really nice as that simplifies a lot of the > calculations. > AIUI, it's only powers-of-two: PAGE_SIZE (1 if nr of pages), PMD_SIZE (512) and PUD_SIZE (4K).
On Thu, Jul 15, 2021 at 5:52 AM Joao Martins <joao.m.martins@oracle.com> wrote: > > > > On 7/15/21 2:08 AM, Dan Williams wrote: > > On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote: > >> > >> Add a new align property for struct dev_pagemap which specifies that a > > > > s/align/@geometry/ > > > Yeap, updated. > > >> pagemap is composed of a set of compound pages of size @align, > > > > s/@align/@geometry/ > > > Yeap, updated. > > >> instead of > >> base pages. When a compound page geometry is requested, all but the first > >> page are initialised as tail pages instead of order-0 pages. > >> > >> For certain ZONE_DEVICE users like device-dax which have a fixed page size, > >> this creates an opportunity to optimize GUP and GUP-fast walkers, treating > >> it the same way as THP or hugetlb pages. > >> > >> Signed-off-by: Joao Martins <joao.m.martins@oracle.com> > >> --- > >> include/linux/memremap.h | 17 +++++++++++++++++ > >> mm/memremap.c | 8 ++++++-- > >> mm/page_alloc.c | 34 +++++++++++++++++++++++++++++++++- > >> 3 files changed, 56 insertions(+), 3 deletions(-) > >> > >> diff --git a/include/linux/memremap.h b/include/linux/memremap.h > >> index 119f130ef8f1..e5ab6d4525c1 100644 > >> --- a/include/linux/memremap.h > >> +++ b/include/linux/memremap.h > >> @@ -99,6 +99,10 @@ struct dev_pagemap_ops { > >> * @done: completion for @internal_ref > >> * @type: memory type: see MEMORY_* in memory_hotplug.h > >> * @flags: PGMAP_* flags to specify defailed behavior > >> + * @geometry: structural definition of how the vmemmap metadata is populated. > >> + * A zero or PAGE_SIZE defaults to using base pages as the memmap metadata > >> + * representation. A bigger value but also multiple of PAGE_SIZE will set > >> + * up compound struct pages representative of the requested geometry size. > >> * @ops: method table > >> * @owner: an opaque pointer identifying the entity that manages this > >> * instance. Used by various helpers to make sure that no > >> @@ -114,6 +118,7 @@ struct dev_pagemap { > >> struct completion done; > >> enum memory_type type; > >> unsigned int flags; > >> + unsigned long geometry; > >> const struct dev_pagemap_ops *ops; > >> void *owner; > >> int nr_range; > >> @@ -130,6 +135,18 @@ static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap) > >> return NULL; > >> } > >> > >> +static inline unsigned long pgmap_geometry(struct dev_pagemap *pgmap) > >> +{ > >> + if (!pgmap || !pgmap->geometry) > >> + return PAGE_SIZE; > >> + return pgmap->geometry; > >> +} > >> + > >> +static inline unsigned long pgmap_pfn_geometry(struct dev_pagemap *pgmap) > >> +{ > >> + return PHYS_PFN(pgmap_geometry(pgmap)); > >> +} > > > > Are both needed? Maybe just have ->geometry natively be in nr_pages > > units directly, because pgmap_pfn_geometry() makes it confusing > > whether it's a geometry of the pfn or the geometry of the pgmap. > > > I use pgmap_geometry() largelly when we manipulate memmap in sparse-vmemmap code, as we > deal with addresses/offsets/subsection-size. While using pgmap_pfn_geometry for code that > deals with PFN initialization. For this patch I could remove the confusion. > > And actually maybe I can just store the pgmap_geometry() value in bytes locally in > vmemmap_populate_compound_pages() and we can remove this extra helper. > > >> + > >> #ifdef CONFIG_ZONE_DEVICE > >> bool pfn_zone_device_reserved(unsigned long pfn); > >> void *memremap_pages(struct dev_pagemap *pgmap, int nid); > >> diff --git a/mm/memremap.c b/mm/memremap.c > >> index 805d761740c4..ffcb924eb6a5 100644 > >> --- a/mm/memremap.c > >> +++ b/mm/memremap.c > >> @@ -318,8 +318,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params, > >> memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE], > >> PHYS_PFN(range->start), > >> PHYS_PFN(range_len(range)), pgmap); > >> - percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id) > >> - - pfn_first(pgmap, range_id)); > >> + if (pgmap_geometry(pgmap) > PAGE_SIZE) > > > > This would become > > > > if (pgmap_geometry(pgmap) > 1) > > > >> + percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id) > >> + - pfn_first(pgmap, range_id)) / pgmap_pfn_geometry(pgmap)); > > > > ...and this would be pgmap_geometry() > > > >> + else > >> + percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id) > >> + - pfn_first(pgmap, range_id)); > >> return 0; > >> > Let me adjust accordingly. > > >> err_add_memory: > >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c > >> index 79f3b38afeca..188cb5f8c308 100644 > >> --- a/mm/page_alloc.c > >> +++ b/mm/page_alloc.c > >> @@ -6597,6 +6597,31 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn, > >> } > >> } > >> > >> +static void __ref memmap_init_compound(struct page *page, unsigned long pfn, > > > > I'd feel better if @page was renamed @head... more below: > > > Oh yeah -- definitely more readable. > > >> + unsigned long zone_idx, int nid, > >> + struct dev_pagemap *pgmap, > >> + unsigned long nr_pages) > >> +{ > >> + unsigned int order_align = order_base_2(nr_pages); > >> + unsigned long i; > >> + > >> + __SetPageHead(page); > >> + > >> + for (i = 1; i < nr_pages; i++) { > > > > The switch of loop styles is jarring. I.e. the switch from > > memmap_init_zone_device() that is using pfn, end_pfn, and a local > > 'struct page *' variable to this helper using pfn + i and a mix of > > helpers (__init_zone_device_page, prep_compound_tail) that have > > different expectations of head page + tail_idx and current page. > > > > I.e. this reads more obviously correct to me, but maybe I'm just in > > the wrong headspace: > > > > for (pfn = head_pfn + 1; pfn < end_pfn; pfn++) { > > struct page *page = pfn_to_page(pfn); > > > > __init_zone_device_page(page, pfn, zone_idx, nid, pgmap); > > prep_compound_tail(head, pfn - head_pfn); > > > Personally -- and I am dubious given I have been staring at this code -- I find that what > I wrote a little better as it follows more what compound page initialization does. Like > it's easier for me to read that I am initializing a number of tail pages and a head page > (for a known geometry size). > > Additionally, it's unnecessary (and a tiny ineficient?) to keep doing pfn_to_page(pfn) > provided ZONE_DEVICE requires SPARSEMEM_VMEMMAP and so your page pointers are all > contiguous and so for any given PFN we can avoid having deref vmemmap vaddrs back and > forth. Which is the second reason I pass a page, and iterate over its tails based on a > head page pointer. But I was at too minds when writing this, so if the there's no added > inefficiency I can rewrite like the above. I mainly just don't want 2 different styles between memmap_init_zone_device() and this helper. So if the argument is that "it's inefficient to use pfn_to_page() here" then why does the caller use pfn_to_page()? I won't argue too much for one way or the other, I'm still biased towards my rewrite, but whatever you pick just make the style consistent. > > >> + __init_zone_device_page(page + i, pfn + i, zone_idx, > >> + nid, pgmap); > >> + prep_compound_tail(page, i); > >> + > >> + /* > >> + * The first and second tail pages need to > >> + * initialized first, hence the head page is > >> + * prepared last. > > > > I'd change this comment to say why rather than restate what can be > > gleaned from the code. It's actually not clear to me why this order is > > necessary. > > > So the first tail page stores mapcount_ptr and compound order, and the > second tail page stores pincount_ptr. prep_compound_head() does this: > > set_compound_order(page, order); > atomic_set(compound_mapcount_ptr(page), -1); > if (hpage_pincount_available(page)) > atomic_set(compound_pincount_ptr(page), 0); > > So we need those tail pages initialized first prior to initializing the head. > > I can expand the comment above to make it clear why we need first and second tail pages. Thanks! > >> + */ > >> + if (i == 2) > >> + prep_compound_head(page, order_align); > >> + } > >> +} > >> + > >> void __ref memmap_init_zone_device(struct zone *zone, > >> unsigned long start_pfn, > >> unsigned long nr_pages, > >> @@ -6605,6 +6630,7 @@ void __ref memmap_init_zone_device(struct zone *zone, > >> unsigned long pfn, end_pfn = start_pfn + nr_pages; > >> struct pglist_data *pgdat = zone->zone_pgdat; > >> struct vmem_altmap *altmap = pgmap_altmap(pgmap); > >> + unsigned int pfns_per_compound = pgmap_pfn_geometry(pgmap); > >> unsigned long zone_idx = zone_idx(zone); > >> unsigned long start = jiffies; > >> int nid = pgdat->node_id; > >> @@ -6622,10 +6648,16 @@ void __ref memmap_init_zone_device(struct zone *zone, > >> nr_pages = end_pfn - start_pfn; > >> } > >> > >> - for (pfn = start_pfn; pfn < end_pfn; pfn++) { > >> + for (pfn = start_pfn; pfn < end_pfn; pfn += pfns_per_compound) { > >> struct page *page = pfn_to_page(pfn); > >> > >> __init_zone_device_page(page, pfn, zone_idx, nid, pgmap); > >> + > >> + if (pfns_per_compound == 1) > >> + continue; > >> + > >> + memmap_init_compound(page, pfn, zone_idx, nid, pgmap, > >> + pfns_per_compound); > > > > I otherwise don't see anything broken with this patch, so feel free to include: > > > > Reviewed-by: Dan Williams <dan.j.williams@intel.com> > > > > ...on the resend with the fixups. > > > Thanks. > > I will wait whether you still want to retain the tag provided the implied changes > fixing the failure you reported. Yeah, tag is still valid.
On 7/15/2021 5:52 AM, Joao Martins wrote: >>> + __init_zone_device_page(page + i, pfn + i, zone_idx, >>> + nid, pgmap); >>> + prep_compound_tail(page, i); >>> + >>> + /* >>> + * The first and second tail pages need to >>> + * initialized first, hence the head page is >>> + * prepared last. >> I'd change this comment to say why rather than restate what can be >> gleaned from the code. It's actually not clear to me why this order is >> necessary. >> > So the first tail page stores mapcount_ptr and compound order, and the > second tail page stores pincount_ptr. prep_compound_head() does this: > > set_compound_order(page, order); > atomic_set(compound_mapcount_ptr(page), -1); > if (hpage_pincount_available(page)) > atomic_set(compound_pincount_ptr(page), 0); > > So we need those tail pages initialized first prior to initializing the head. > > I can expand the comment above to make it clear why we need first and second tail pages. > Perhaps just say The reason prep_compound_head() is called after the 1st and 2nd tail pages have been initialized is: so it overwrites some of the tail page fields setup by __init_zone_device_page(), rather than the other way around. ? thanks, -jane
On 7/22/21 1:38 AM, Jane Chu wrote: > > On 7/15/2021 5:52 AM, Joao Martins wrote: >>>> + __init_zone_device_page(page + i, pfn + i, zone_idx, >>>> + nid, pgmap); >>>> + prep_compound_tail(page, i); >>>> + >>>> + /* >>>> + * The first and second tail pages need to >>>> + * initialized first, hence the head page is >>>> + * prepared last. >>> I'd change this comment to say why rather than restate what can be >>> gleaned from the code. It's actually not clear to me why this order is >>> necessary. >>> >> So the first tail page stores mapcount_ptr and compound order, and the >> second tail page stores pincount_ptr. prep_compound_head() does this: >> >> set_compound_order(page, order); >> atomic_set(compound_mapcount_ptr(page), -1); >> if (hpage_pincount_available(page)) >> atomic_set(compound_pincount_ptr(page), 0); >> >> So we need those tail pages initialized first prior to initializing the head. >> >> I can expand the comment above to make it clear why we need first and second tail pages. >> > > Perhaps just say > The reason prep_compound_head() is called after the 1st and 2nd tail > pages have been initialized is: so it overwrites some of the tail page > fields setup by __init_zone_device_page(), rather than the other way > around. > ? Yeap, something along those lines is what I was thinking. Perhaps explicitly mentioning the struct page fields that 1st and 2nd tail pages store to avoid the reader thinking it's arbitrarily picked. Joao
On 7/15/21 8:48 PM, Dan Williams wrote: > On Thu, Jul 15, 2021 at 5:52 AM Joao Martins <joao.m.martins@oracle.com> wrote: >> On 7/15/21 2:08 AM, Dan Williams wrote: >>> On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote: >>>> + unsigned long zone_idx, int nid, >>>> + struct dev_pagemap *pgmap, >>>> + unsigned long nr_pages) >>>> +{ >>>> + unsigned int order_align = order_base_2(nr_pages); >>>> + unsigned long i; >>>> + >>>> + __SetPageHead(page); >>>> + >>>> + for (i = 1; i < nr_pages; i++) { >>> >>> The switch of loop styles is jarring. I.e. the switch from >>> memmap_init_zone_device() that is using pfn, end_pfn, and a local >>> 'struct page *' variable to this helper using pfn + i and a mix of >>> helpers (__init_zone_device_page, prep_compound_tail) that have >>> different expectations of head page + tail_idx and current page. >>> >>> I.e. this reads more obviously correct to me, but maybe I'm just in >>> the wrong headspace: >>> >>> for (pfn = head_pfn + 1; pfn < end_pfn; pfn++) { >>> struct page *page = pfn_to_page(pfn); >>> >>> __init_zone_device_page(page, pfn, zone_idx, nid, pgmap); >>> prep_compound_tail(head, pfn - head_pfn); >>> >> Personally -- and I am dubious given I have been staring at this code -- I find that what >> I wrote a little better as it follows more what compound page initialization does. Like >> it's easier for me to read that I am initializing a number of tail pages and a head page >> (for a known geometry size). >> >> Additionally, it's unnecessary (and a tiny ineficient?) to keep doing pfn_to_page(pfn) >> provided ZONE_DEVICE requires SPARSEMEM_VMEMMAP and so your page pointers are all >> contiguous and so for any given PFN we can avoid having deref vmemmap vaddrs back and >> forth. Which is the second reason I pass a page, and iterate over its tails based on a >> head page pointer. But I was at too minds when writing this, so if the there's no added >> inefficiency I can rewrite like the above. > > I mainly just don't want 2 different styles between > memmap_init_zone_device() and this helper. So if the argument is that > "it's inefficient to use pfn_to_page() here" then why does the caller > use pfn_to_page()? I won't argue too much for one way or the other, > I'm still biased towards my rewrite, but whatever you pick just make > the style consistent. > Meanwhile, turns out my concerns didn't materialize. I am not seeing a visible difference compared to old numbers. So I switched to the style you suggested above.
diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 119f130ef8f1..e5ab6d4525c1 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -99,6 +99,10 @@ struct dev_pagemap_ops { * @done: completion for @internal_ref * @type: memory type: see MEMORY_* in memory_hotplug.h * @flags: PGMAP_* flags to specify defailed behavior + * @geometry: structural definition of how the vmemmap metadata is populated. + * A zero or PAGE_SIZE defaults to using base pages as the memmap metadata + * representation. A bigger value but also multiple of PAGE_SIZE will set + * up compound struct pages representative of the requested geometry size. * @ops: method table * @owner: an opaque pointer identifying the entity that manages this * instance. Used by various helpers to make sure that no @@ -114,6 +118,7 @@ struct dev_pagemap { struct completion done; enum memory_type type; unsigned int flags; + unsigned long geometry; const struct dev_pagemap_ops *ops; void *owner; int nr_range; @@ -130,6 +135,18 @@ static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap) return NULL; } +static inline unsigned long pgmap_geometry(struct dev_pagemap *pgmap) +{ + if (!pgmap || !pgmap->geometry) + return PAGE_SIZE; + return pgmap->geometry; +} + +static inline unsigned long pgmap_pfn_geometry(struct dev_pagemap *pgmap) +{ + return PHYS_PFN(pgmap_geometry(pgmap)); +} + #ifdef CONFIG_ZONE_DEVICE bool pfn_zone_device_reserved(unsigned long pfn); void *memremap_pages(struct dev_pagemap *pgmap, int nid); diff --git a/mm/memremap.c b/mm/memremap.c index 805d761740c4..ffcb924eb6a5 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -318,8 +318,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params, memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE], PHYS_PFN(range->start), PHYS_PFN(range_len(range)), pgmap); - percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id) - - pfn_first(pgmap, range_id)); + if (pgmap_geometry(pgmap) > PAGE_SIZE) + percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id) + - pfn_first(pgmap, range_id)) / pgmap_pfn_geometry(pgmap)); + else + percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id) + - pfn_first(pgmap, range_id)); return 0; err_add_memory: diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 79f3b38afeca..188cb5f8c308 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6597,6 +6597,31 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn, } } +static void __ref memmap_init_compound(struct page *page, unsigned long pfn, + unsigned long zone_idx, int nid, + struct dev_pagemap *pgmap, + unsigned long nr_pages) +{ + unsigned int order_align = order_base_2(nr_pages); + unsigned long i; + + __SetPageHead(page); + + for (i = 1; i < nr_pages; i++) { + __init_zone_device_page(page + i, pfn + i, zone_idx, + nid, pgmap); + prep_compound_tail(page, i); + + /* + * The first and second tail pages need to + * initialized first, hence the head page is + * prepared last. + */ + if (i == 2) + prep_compound_head(page, order_align); + } +} + void __ref memmap_init_zone_device(struct zone *zone, unsigned long start_pfn, unsigned long nr_pages, @@ -6605,6 +6630,7 @@ void __ref memmap_init_zone_device(struct zone *zone, unsigned long pfn, end_pfn = start_pfn + nr_pages; struct pglist_data *pgdat = zone->zone_pgdat; struct vmem_altmap *altmap = pgmap_altmap(pgmap); + unsigned int pfns_per_compound = pgmap_pfn_geometry(pgmap); unsigned long zone_idx = zone_idx(zone); unsigned long start = jiffies; int nid = pgdat->node_id; @@ -6622,10 +6648,16 @@ void __ref memmap_init_zone_device(struct zone *zone, nr_pages = end_pfn - start_pfn; } - for (pfn = start_pfn; pfn < end_pfn; pfn++) { + for (pfn = start_pfn; pfn < end_pfn; pfn += pfns_per_compound) { struct page *page = pfn_to_page(pfn); __init_zone_device_page(page, pfn, zone_idx, nid, pgmap); + + if (pfns_per_compound == 1) + continue; + + memmap_init_compound(page, pfn, zone_idx, nid, pgmap, + pfns_per_compound); } pr_info("%s initialised %lu pages in %ums\n", __func__,
Add a new align property for struct dev_pagemap which specifies that a pagemap is composed of a set of compound pages of size @align, instead of base pages. When a compound page geometry is requested, all but the first page are initialised as tail pages instead of order-0 pages. For certain ZONE_DEVICE users like device-dax which have a fixed page size, this creates an opportunity to optimize GUP and GUP-fast walkers, treating it the same way as THP or hugetlb pages. Signed-off-by: Joao Martins <joao.m.martins@oracle.com> --- include/linux/memremap.h | 17 +++++++++++++++++ mm/memremap.c | 8 ++++++-- mm/page_alloc.c | 34 +++++++++++++++++++++++++++++++++- 3 files changed, 56 insertions(+), 3 deletions(-)