diff mbox series

[v4,04/14] mm/memremap: add ZONE_DEVICE support for compound pages

Message ID 20210827145819.16471-5-joao.m.martins@oracle.com (mailing list archive)
State New
Headers show
Series mm, sparse-vmemmap: Introduce compound devmaps for device-dax | expand

Commit Message

Joao Martins Aug. 27, 2021, 2:58 p.m. UTC
Add a new @geometry property for struct dev_pagemap which specifies that a
devmap is composed of a set of compound pages of size @geometry, instead of
base pages. When a compound page geometry is requested, all but the first
page are initialised as tail pages instead of order-0 pages.

For certain ZONE_DEVICE users like device-dax which have a fixed page size,
this creates an opportunity to optimize GUP and GUP-fast walkers, treating
it the same way as THP or hugetlb pages.

Additionally, commit 7118fc2906e2 ("hugetlb: address ref count racing in
prep_compound_gigantic_page") removed set_page_count() because the
setting of page ref count to zero was redundant. devmap pages don't come
from page allocator though and only head page refcount is used for
compound pages, hence initialize tail page count to zero.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/memremap.h | 17 +++++++++++++++++
 mm/memremap.c            | 12 ++++++------
 mm/page_alloc.c          | 37 ++++++++++++++++++++++++++++++++++++-
 3 files changed, 59 insertions(+), 7 deletions(-)

Comments

Christoph Hellwig Aug. 27, 2021, 3:33 p.m. UTC | #1
On Fri, Aug 27, 2021 at 03:58:09PM +0100, Joao Martins wrote:
> + * @geometry: structural definition of how the vmemmap metadata is populated.
> + *	A zero or 1 defaults to using base pages as the memmap metadata
> + *	representation. A bigger value will set up compound struct pages
> + *	representative of the requested geometry size.
>   * @ops: method table
>   * @owner: an opaque pointer identifying the entity that manages this
>   *	instance.  Used by various helpers to make sure that no
> @@ -114,6 +118,7 @@ struct dev_pagemap {
>  	struct completion done;
>  	enum memory_type type;
>  	unsigned int flags;
> +	unsigned long geometry;

So why not make this a shift as I suggested somewhere deep in the
last thread?  Also geometry sounds a bit strange, even if I can't really
offer anything better offhand.

> +static inline unsigned long pgmap_geometry(struct dev_pagemap *pgmap)
> +{
> +	if (pgmap && pgmap->geometry)
> +		return pgmap->geometry;

Why does this need to support a NULL pgmap?

> +static void __ref memmap_init_compound(struct page *head, unsigned long head_pfn,

Overly long line.
Joao Martins Aug. 27, 2021, 4 p.m. UTC | #2
On 8/27/21 4:33 PM, Christoph Hellwig wrote:
> On Fri, Aug 27, 2021 at 03:58:09PM +0100, Joao Martins wrote:
>> + * @geometry: structural definition of how the vmemmap metadata is populated.
>> + *	A zero or 1 defaults to using base pages as the memmap metadata
>> + *	representation. A bigger value will set up compound struct pages
>> + *	representative of the requested geometry size.
>>   * @ops: method table
>>   * @owner: an opaque pointer identifying the entity that manages this
>>   *	instance.  Used by various helpers to make sure that no
>> @@ -114,6 +118,7 @@ struct dev_pagemap {
>>  	struct completion done;
>>  	enum memory_type type;
>>  	unsigned int flags;
>> +	unsigned long geometry;
> 
> So why not make this a shift as I suggested somewhere deep in the
> last thread? 

So the previous you suggested had the check for pgmap_geometry() > PAGE_SIZE,
but because pgmap_geometry() return 1 by default, then the pfn_end() - pfn
computation wouldn't change for those that don't request a geometry.

So rather than this that you initially suggested:

	refs = pfn_end(pgmap, range_id) - pfn_first(pgmap, range_id);
	if (pgmap_geometry(pgmap) > 1)
		refs /= pgmap_geometry(pgmap);

I would turn into this, because now for users which don't select geometry it's always a
division by 1:

	refs = pfn_end(pgmap, range_id) - pfn_first(pgmap, range_id);
	refs /= pgmap_geometry(pgmap);

So felt like doing it inline straight away inline when calling percpu_ref_get_many():
	
	(pfn_end(pgmap, range_id) - pfn_first(pgmap, range_id)) / pgmap_geometry(pgmap);

I can switch to a shift if you prefer:

	(pfn_end(pgmap, range_id) - pfn_first(pgmap, range_id))
		<< pgmap_geometry_order(pgmap);

> Also geometry sounds a bit strange, even if I can't really
> offer anything better offhand.
> 
We started with @align (like in device dax core), and then we switched
to @geometry because these are slightly different things (one relates
to vmemmap metadata structure (number of pages) and the other is how
the mmap is aligned to a page size. I couldn't suggest anything else,
besides a more verbose name like vmemmap_align maybe.

>> +static inline unsigned long pgmap_geometry(struct dev_pagemap *pgmap)
>> +{
>> +	if (pgmap && pgmap->geometry)
>> +		return pgmap->geometry;
> 
> Why does this need to support a NULL pgmap?
> 
This was mainly a defensive choice, similar to put_dev_pagemap(). There's
only one caller, which is populate_section_memmap with an altmap but without
pgmap.

>> +static void __ref memmap_init_compound(struct page *head, unsigned long head_pfn,
> 
> Overly long line.
> 
Fixed.

Interesting that checkpatch doesn't complain with one character past 80 lines.
Christoph Hellwig Sept. 1, 2021, 9:44 a.m. UTC | #3
On Fri, Aug 27, 2021 at 05:00:11PM +0100, Joao Martins wrote:
> So felt like doing it inline straight away inline when calling percpu_ref_get_many():
> 	
> 	(pfn_end(pgmap, range_id) - pfn_first(pgmap, range_id)) / pgmap_geometry(pgmap);
> 
> I can switch to a shift if you prefer:
> 
> 	(pfn_end(pgmap, range_id) - pfn_first(pgmap, range_id))
> 		<< pgmap_geometry_order(pgmap);

Yes.  A shift is less overhead than a branch.

> > Also geometry sounds a bit strange, even if I can't really
> > offer anything better offhand.
> > 
> We started with @align (like in device dax core), and then we switched
> to @geometry because these are slightly different things (one relates
> to vmemmap metadata structure (number of pages) and the other is how
> the mmap is aligned to a page size. I couldn't suggest anything else,
> besides a more verbose name like vmemmap_align maybe.

It for sure isn't an alignment.  I think the term that comes closest
is a granularity.  But something like vmemmap_shift if switching to
a shift might be descriptive enough for the struct member name.
Joao Martins Sept. 9, 2021, 9:38 a.m. UTC | #4
On 9/1/21 10:44 AM, Christoph Hellwig wrote:
> On Fri, Aug 27, 2021 at 05:00:11PM +0100, Joao Martins wrote:
>> So felt like doing it inline straight away inline when calling percpu_ref_get_many():
>> 	
>> 	(pfn_end(pgmap, range_id) - pfn_first(pgmap, range_id)) / pgmap_geometry(pgmap);
>>
>> I can switch to a shift if you prefer:
>>
>> 	(pfn_end(pgmap, range_id) - pfn_first(pgmap, range_id))
>> 		<< pgmap_geometry_order(pgmap);
> 
> Yes.  A shift is less overhead than a branch.
> 
OK.

But FWIW, the overhead of this shift or branch doesn't matter much
compared to the rest of memremap_pages().

>>> Also geometry sounds a bit strange, even if I can't really
>>> offer anything better offhand.
>>>
>> We started with @align (like in device dax core), and then we switched
>> to @geometry because these are slightly different things (one relates
>> to vmemmap metadata structure (number of pages) and the other is how
>> the mmap is aligned to a page size. I couldn't suggest anything else,
>> besides a more verbose name like vmemmap_align maybe.
> 
> It for sure isn't an alignment.  I think the term that comes closest
> is a granularity.  But something like vmemmap_shift if switching to
> a shift might be descriptive enough for the struct member name.
> 
Hmmm, it would be a 'shift related to vmemmap' in the literal sense but
while descriptive of the field it doesn't tell much otherwise. geometry
is probably used more widely for block device term when Dan suggested
the term. Alternatively, vmemmap_compound_nr / vmemmap_npages (if it
represents a nr of pages per compound page) or vmemmap_order (which is
more clear on what it is than vmemmap_shift?). Changing to a 'order'/'shift'
representation also gets the 'being a power of 2' enforcement for free.
And compound pages IIUC always represent a power-of-2 number of pages.

	Joao
diff mbox series

Patch

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 119f130ef8f1..4b78d30c3987 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -99,6 +99,10 @@  struct dev_pagemap_ops {
  * @done: completion for @internal_ref
  * @type: memory type: see MEMORY_* in memory_hotplug.h
  * @flags: PGMAP_* flags to specify defailed behavior
+ * @geometry: structural definition of how the vmemmap metadata is populated.
+ *	A zero or 1 defaults to using base pages as the memmap metadata
+ *	representation. A bigger value will set up compound struct pages
+ *	representative of the requested geometry size.
  * @ops: method table
  * @owner: an opaque pointer identifying the entity that manages this
  *	instance.  Used by various helpers to make sure that no
@@ -114,6 +118,7 @@  struct dev_pagemap {
 	struct completion done;
 	enum memory_type type;
 	unsigned int flags;
+	unsigned long geometry;
 	const struct dev_pagemap_ops *ops;
 	void *owner;
 	int nr_range;
@@ -130,6 +135,18 @@  static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap)
 	return NULL;
 }
 
+static inline unsigned long pgmap_geometry(struct dev_pagemap *pgmap)
+{
+	if (pgmap && pgmap->geometry)
+		return pgmap->geometry;
+	return 1;
+}
+
+static inline unsigned long pgmap_geometry_order(struct dev_pagemap *pgmap)
+{
+	return order_base_2(pgmap_geometry(pgmap));
+}
+
 #ifdef CONFIG_ZONE_DEVICE
 bool pfn_zone_device_reserved(unsigned long pfn);
 void *memremap_pages(struct dev_pagemap *pgmap, int nid);
diff --git a/mm/memremap.c b/mm/memremap.c
index 84de22c14567..99646082436f 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -102,11 +102,11 @@  static unsigned long pfn_end(struct dev_pagemap *pgmap, int range_id)
 	return (range->start + range_len(range)) >> PAGE_SHIFT;
 }
 
-static unsigned long pfn_next(unsigned long pfn)
+static unsigned long pfn_next(struct dev_pagemap *pgmap, unsigned long pfn)
 {
-	if (pfn % 1024 == 0)
+	if (pfn % (1024 << pgmap_geometry_order(pgmap)))
 		cond_resched();
-	return pfn + 1;
+	return pfn + pgmap_geometry(pgmap);
 }
 
 /*
@@ -130,7 +130,7 @@  bool pfn_zone_device_reserved(unsigned long pfn)
 }
 
 #define for_each_device_pfn(pfn, map, i) \
-	for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); pfn = pfn_next(pfn))
+	for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); pfn = pfn_next(map, pfn))
 
 static void dev_pagemap_kill(struct dev_pagemap *pgmap)
 {
@@ -315,8 +315,8 @@  static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
 	memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
 				PHYS_PFN(range->start),
 				PHYS_PFN(range_len(range)), pgmap);
-	percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
-			- pfn_first(pgmap, range_id));
+	percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
+			- pfn_first(pgmap, range_id)) / pgmap_geometry(pgmap));
 	return 0;
 
 err_add_memory:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 57ef05800c06..c1497a928005 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6610,6 +6610,34 @@  static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
 	}
 }
 
+static void __ref memmap_init_compound(struct page *head, unsigned long head_pfn,
+				       unsigned long zone_idx, int nid,
+				       struct dev_pagemap *pgmap,
+				       unsigned long nr_pages)
+{
+	unsigned long pfn, end_pfn = head_pfn + nr_pages;
+	unsigned int order = pgmap_geometry_order(pgmap);
+
+	__SetPageHead(head);
+	for (pfn = head_pfn + 1; pfn < end_pfn; pfn++) {
+		struct page *page = pfn_to_page(pfn);
+
+		__init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
+		prep_compound_tail(head, pfn - head_pfn);
+		set_page_count(page, 0);
+
+		/*
+		 * The first tail page stores compound_mapcount_ptr() and
+		 * compound_order() and the second tail page stores
+		 * compound_pincount_ptr(). Call prep_compound_head() after
+		 * the first and second tail pages have been initialized to
+		 * not have the data overwritten.
+		 */
+		if (pfn == head_pfn + 2)
+			prep_compound_head(head, order);
+	}
+}
+
 void __ref memmap_init_zone_device(struct zone *zone,
 				   unsigned long start_pfn,
 				   unsigned long nr_pages,
@@ -6618,6 +6646,7 @@  void __ref memmap_init_zone_device(struct zone *zone,
 	unsigned long pfn, end_pfn = start_pfn + nr_pages;
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	struct vmem_altmap *altmap = pgmap_altmap(pgmap);
+	unsigned int pfns_per_compound = pgmap_geometry(pgmap);
 	unsigned long zone_idx = zone_idx(zone);
 	unsigned long start = jiffies;
 	int nid = pgdat->node_id;
@@ -6635,10 +6664,16 @@  void __ref memmap_init_zone_device(struct zone *zone,
 		nr_pages = end_pfn - start_pfn;
 	}
 
-	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
+	for (pfn = start_pfn; pfn < end_pfn; pfn += pfns_per_compound) {
 		struct page *page = pfn_to_page(pfn);
 
 		__init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
+
+		if (pfns_per_compound == 1)
+			continue;
+
+		memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
+				     pfns_per_compound);
 	}
 
 	pr_info("%s initialised %lu pages in %ums\n", __func__,