diff mbox series

[v2,2/4] hugetlb: add HPageCma flag and code to free non-gigantic pages in CMA

Message ID 20210923175347.10727-3-mike.kravetz@oracle.com (mailing list archive)
State New
Headers show
Series hugetlb: add demote/split page functionality | expand

Commit Message

Mike Kravetz Sept. 23, 2021, 5:53 p.m. UTC
When huge page demotion is fully implemented, gigantic pages can be
demoted to a smaller huge page size.  For example, on x86 a 1G page
can be demoted to 512 2M pages.  However, gigantic pages can potentially
be allocated from CMA.  If a gigantic page which was allocated from CMA
is demoted, the corresponding demoted pages needs to be returned to CMA.

In order to track hugetlb pages that need to be returned to CMA, add the
hugetlb specific flag HPageCma.  Flag is set when a huge page is
allocated from CMA and transferred to any demoted pages.  Non-gigantic
huge page freeing code checks for the flag and takes appropriate action.

This also requires a change to CMA reservations for gigantic pages.
Currently, the 'order_per_bit' is set to the gigantic page size.
However, if gigantic pages can be demoted this needs to be set to the
order of the smallest huge page.  At CMA reservation time we do not know
the size of the smallest huge page size, so use HUGETLB_PAGE_ORDER.
Also, prohibit demotion to huge page sizes smaller than
HUGETLB_PAGE_ORDER.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/hugetlb.h |  7 +++++
 mm/hugetlb.c            | 64 +++++++++++++++++++++++++++++------------
 2 files changed, 53 insertions(+), 18 deletions(-)

Comments

David Hildenbrand Sept. 24, 2021, 9:36 a.m. UTC | #1
On 23.09.21 19:53, Mike Kravetz wrote:
> When huge page demotion is fully implemented, gigantic pages can be
> demoted to a smaller huge page size.  For example, on x86 a 1G page
> can be demoted to 512 2M pages.  However, gigantic pages can potentially
> be allocated from CMA.  If a gigantic page which was allocated from CMA
> is demoted, the corresponding demoted pages needs to be returned to CMA.
> 
> In order to track hugetlb pages that need to be returned to CMA, add the
> hugetlb specific flag HPageCma.  Flag is set when a huge page is
> allocated from CMA and transferred to any demoted pages.  Non-gigantic
> huge page freeing code checks for the flag and takes appropriate action.

Do we really need that flag or couldn't we simply always try 
cma_release() and fallback to out ordinary freeing-path?

IIRC, cma knows exactly if something was allocated via a CMA are and can 
be free via it. No need for additional tracking usually.
Mike Kravetz Sept. 29, 2021, 7:42 p.m. UTC | #2
On 9/24/21 2:36 AM, David Hildenbrand wrote:
> On 23.09.21 19:53, Mike Kravetz wrote:
>> When huge page demotion is fully implemented, gigantic pages can be
>> demoted to a smaller huge page size.  For example, on x86 a 1G page
>> can be demoted to 512 2M pages.  However, gigantic pages can potentially
>> be allocated from CMA.  If a gigantic page which was allocated from CMA
>> is demoted, the corresponding demoted pages needs to be returned to CMA.
>>
>> In order to track hugetlb pages that need to be returned to CMA, add the
>> hugetlb specific flag HPageCma.  Flag is set when a huge page is
>> allocated from CMA and transferred to any demoted pages.  Non-gigantic
>> huge page freeing code checks for the flag and takes appropriate action.
> 
> Do we really need that flag or couldn't we simply always try cma_release() and fallback to out ordinary freeing-path?
> 
> IIRC, cma knows exactly if something was allocated via a CMA are and can be free via it. No need for additional tracking usually.
> 

Yes, I think this is possible.
Initially, I thought the check for whether pages were part of CMA
involved a mutex or some type of locking.  But, it really is
lightweight.  So, should not be in issue calling in every case.

I will most likely add a !CMA_CONFIG stub for cma_release() and remove
some of the #ifdefs in hugetlb.c

Thanks, I will update in next version.
Mike Kravetz Sept. 29, 2021, 11:21 p.m. UTC | #3
On 9/29/21 12:42 PM, Mike Kravetz wrote:
> On 9/24/21 2:36 AM, David Hildenbrand wrote:
>> On 23.09.21 19:53, Mike Kravetz wrote:
>>> When huge page demotion is fully implemented, gigantic pages can be
>>> demoted to a smaller huge page size.  For example, on x86 a 1G page
>>> can be demoted to 512 2M pages.  However, gigantic pages can potentially
>>> be allocated from CMA.  If a gigantic page which was allocated from CMA
>>> is demoted, the corresponding demoted pages needs to be returned to CMA.
>>>
>>> In order to track hugetlb pages that need to be returned to CMA, add the
>>> hugetlb specific flag HPageCma.  Flag is set when a huge page is
>>> allocated from CMA and transferred to any demoted pages.  Non-gigantic
>>> huge page freeing code checks for the flag and takes appropriate action.
>>
>> Do we really need that flag or couldn't we simply always try cma_release() and fallback to out ordinary freeing-path?
>>
>> IIRC, cma knows exactly if something was allocated via a CMA are and can be free via it. No need for additional tracking usually.
>>
> 
> Yes, I think this is possible.
> Initially, I thought the check for whether pages were part of CMA
> involved a mutex or some type of locking.  But, it really is
> lightweight.  So, should not be in issue calling in every case.

When modifying the code, I did come across one issue.  Sorry I did not
immediately remember this.

Gigantic pages are allocated as a 'set of pages' and turned into a compound
page by the hugetlb code.  They must be restored to a 'set of pages' before
calling cma_release.  You can not pass a compound page to cma_release.
Non-gigantic page are allocated from the buddy directly as compound pages.
They are returned to buddy as a compound page.

So, the issue comes about when freeing a non-gigantic page.  We would
need to convert to a 'set of pages' before calling cma_release just to
see if cma_release succeeds.  Then, if it fails convert back to a
compound page to call __free_pages.  Conversion is somewhat expensive as
we must modify every tail page struct.

Some possible solutions:
- Create a cma_pages_valid() routine that checks whether the pages
  belong to a cma region.  Only convert to 'set of pages' if cma_pages_valid
  and we know subsequent cma_release will succeed.
- Always convert non-gigantic pages to a 'set of pages' before freeing.
  Alternatively, don't allocate compound pages from buddy and just use
  the hugetlb gigantic page prep and destroy routines for all hugetlb
  page sizes.
- Use some kind of flag as in proposed patch.

Having hugetlb just allocate a set of pages from buddy is interesting.
This would make the allocate/free code paths for gigantic and
non-gigantic pages align more closely.  It may in overall code simplification,
not just for demote.
Mike Kravetz Oct. 1, 2021, 5:50 p.m. UTC | #4
On 9/29/21 4:21 PM, Mike Kravetz wrote:
> On 9/29/21 12:42 PM, Mike Kravetz wrote:
>> On 9/24/21 2:36 AM, David Hildenbrand wrote:
>>> On 23.09.21 19:53, Mike Kravetz wrote:
>>>> When huge page demotion is fully implemented, gigantic pages can be
>>>> demoted to a smaller huge page size.  For example, on x86 a 1G page
>>>> can be demoted to 512 2M pages.  However, gigantic pages can potentially
>>>> be allocated from CMA.  If a gigantic page which was allocated from CMA
>>>> is demoted, the corresponding demoted pages needs to be returned to CMA.
>>>>
>>>> In order to track hugetlb pages that need to be returned to CMA, add the
>>>> hugetlb specific flag HPageCma.  Flag is set when a huge page is
>>>> allocated from CMA and transferred to any demoted pages.  Non-gigantic
>>>> huge page freeing code checks for the flag and takes appropriate action.
>>>
>>> Do we really need that flag or couldn't we simply always try cma_release() and fallback to out ordinary freeing-path?
>>>
>>> IIRC, cma knows exactly if something was allocated via a CMA are and can be free via it. No need for additional tracking usually.
>>>
>>
>> Yes, I think this is possible.
>> Initially, I thought the check for whether pages were part of CMA
>> involved a mutex or some type of locking.  But, it really is
>> lightweight.  So, should not be in issue calling in every case.
> 
> When modifying the code, I did come across one issue.  Sorry I did not
> immediately remember this.
> 
> Gigantic pages are allocated as a 'set of pages' and turned into a compound
> page by the hugetlb code.  They must be restored to a 'set of pages' before
> calling cma_release.  You can not pass a compound page to cma_release.
> Non-gigantic page are allocated from the buddy directly as compound pages.
> They are returned to buddy as a compound page.
> 
> So, the issue comes about when freeing a non-gigantic page.  We would
> need to convert to a 'set of pages' before calling cma_release just to
> see if cma_release succeeds.  Then, if it fails convert back to a
> compound page to call __free_pages.  Conversion is somewhat expensive as
> we must modify every tail page struct.
> 
> Some possible solutions:
> - Create a cma_pages_valid() routine that checks whether the pages
>   belong to a cma region.  Only convert to 'set of pages' if cma_pages_valid
>   and we know subsequent cma_release will succeed.
> - Always convert non-gigantic pages to a 'set of pages' before freeing.
>   Alternatively, don't allocate compound pages from buddy and just use
>   the hugetlb gigantic page prep and destroy routines for all hugetlb
>   page sizes.
> - Use some kind of flag as in proposed patch.
> 
> Having hugetlb just allocate a set of pages from buddy is interesting.
> This would make the allocate/free code paths for gigantic and
> non-gigantic pages align more closely.  It may in overall code simplification,
> not just for demote.

Taking this approach actually got a bit ugly in alloc_and_dissolve_huge_page
which is used in migration.  Instead, I took the approach of adding a
'cma_pages_valid()' interface to check at page freeing time.  Sending
out v3 shortly.
diff mbox series

Patch

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index f2c3979efd69..08668b9f5f71 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -533,6 +533,11 @@  unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
  * HPG_freed - Set when page is on the free lists.
  *	Synchronization: hugetlb_lock held for examination and modification.
  * HPG_vmemmap_optimized - Set when the vmemmap pages of the page are freed.
+ * HPG_cma - Set if huge page was directly allocated from CMA area via
+ *	cma_alloc.  Initially set for gigantic page cma allocations, but can
+ *	be set in non-gigantic pages if gigantic pages are demoted.
+ *	Synchronization: Only accessed or modified when there is only one
+ *	reference to the page at allocation, free or demote time.
  */
 enum hugetlb_page_flags {
 	HPG_restore_reserve = 0,
@@ -540,6 +545,7 @@  enum hugetlb_page_flags {
 	HPG_temporary,
 	HPG_freed,
 	HPG_vmemmap_optimized,
+	HPG_cma,
 	__NR_HPAGEFLAGS,
 };
 
@@ -586,6 +592,7 @@  HPAGEFLAG(Migratable, migratable)
 HPAGEFLAG(Temporary, temporary)
 HPAGEFLAG(Freed, freed)
 HPAGEFLAG(VmemmapOptimized, vmemmap_optimized)
+HPAGEFLAG(Cma, cma)
 
 #ifdef CONFIG_HUGETLB_PAGE
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c76ee0bd6374..c3f7da8f0c68 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1272,6 +1272,7 @@  static void destroy_compound_gigantic_page(struct page *page,
 	atomic_set(compound_pincount_ptr(page), 0);
 
 	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
+		p->mapping = NULL;
 		clear_compound_head(p);
 		set_page_refcounted(p);
 	}
@@ -1283,16 +1284,12 @@  static void destroy_compound_gigantic_page(struct page *page,
 
 static void free_gigantic_page(struct page *page, unsigned int order)
 {
-	/*
-	 * If the page isn't allocated using the cma allocator,
-	 * cma_release() returns false.
-	 */
 #ifdef CONFIG_CMA
-	if (cma_release(hugetlb_cma[page_to_nid(page)], page, 1 << order))
-		return;
+	if (HPageCma(page))
+		cma_release(hugetlb_cma[page_to_nid(page)], page, 1 << order);
+	else
 #endif
-
-	free_contig_range(page_to_pfn(page), 1 << order);
+		free_contig_range(page_to_pfn(page), 1 << order);
 }
 
 #ifdef CONFIG_CONTIG_ALLOC
@@ -1311,8 +1308,10 @@  static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
 		if (hugetlb_cma[nid]) {
 			page = cma_alloc(hugetlb_cma[nid], nr_pages,
 					huge_page_order(h), true);
-			if (page)
+			if (page) {
+				SetHPageCma(page);
 				return page;
+			}
 		}
 
 		if (!(gfp_mask & __GFP_THISNODE)) {
@@ -1322,8 +1321,10 @@  static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
 
 				page = cma_alloc(hugetlb_cma[node], nr_pages,
 						huge_page_order(h), true);
-				if (page)
+				if (page) {
+					SetHPageCma(page);
 					return page;
+				}
 			}
 		}
 	}
@@ -1480,6 +1481,20 @@  static void __update_and_free_page(struct hstate *h, struct page *page)
 		destroy_compound_gigantic_page(page, huge_page_order(h));
 		free_gigantic_page(page, huge_page_order(h));
 	} else {
+#ifdef CONFIG_CMA
+		/*
+		 * Could be a page that was demoted from a gigantic page
+		 * which was allocated in a CMA area.
+		 */
+		if (HPageCma(page)) {
+			destroy_compound_gigantic_page(page,
+					huge_page_order(h));
+			if (!cma_release(hugetlb_cma[page_to_nid(page)], page,
+					1 << huge_page_order(h)))
+				VM_BUG_ON_PAGE(1, page);
+			return;
+		}
+#endif
 		__free_pages(page, huge_page_order(h));
 	}
 }
@@ -2997,14 +3012,19 @@  static void __init hugetlb_init_hstates(void)
 			hugetlb_hstate_alloc_pages(h);
 
 		/*
-		 * Set demote order for each hstate.  Note that
-		 * h->demote_order is initially 0.
+		 * Set demote order for each hstate.  hstates are not ordered,
+		 * so this is brute force.  Note that h->demote_order is
+		 * initially 0.  If cma is used for gigantic pages, the smallest
+		 * demote size is HUGETLB_PAGE_ORDER.
 		 */
-		for_each_hstate(h2) {
-			if (h2 == h)
-				continue;
-			if (h2->order < h->order && h2->order > h->demote_order)
-				h->demote_order = h2->order;
+		if (!hugetlb_cma_size || !(h->order <= HUGETLB_PAGE_ORDER)) {
+			for_each_hstate(h2) {
+				if (h2 == h)
+					continue;
+				if (h2->order < h->order &&
+				    h2->order > h->demote_order)
+					h->demote_order = h2->order;
+			}
 		}
 	}
 	VM_BUG_ON(minimum_order == UINT_MAX);
@@ -3555,6 +3575,8 @@  static ssize_t demote_size_store(struct kobject *kobj,
 	if (!t_hstate)
 		return -EINVAL;
 	demote_order = t_hstate->order;
+	if (demote_order < HUGETLB_PAGE_ORDER)
+		return -EINVAL;
 
 	/* demote order must be smaller hstate order */
 	h = kobj_to_hstate(kobj, &nid);
@@ -6563,7 +6585,13 @@  void __init hugetlb_cma_reserve(int order)
 		size = round_up(size, PAGE_SIZE << order);
 
 		snprintf(name, sizeof(name), "hugetlb%d", nid);
-		res = cma_declare_contiguous_nid(0, size, 0, PAGE_SIZE << order,
+		/*
+		 * Note that 'order per bit' is based on smallest size that
+		 * may be returned to CMA allocator in the case of
+		 * huge page demotion.
+		 */
+		res = cma_declare_contiguous_nid(0, size, 0,
+						PAGE_SIZE << HUGETLB_PAGE_ORDER,
 						 0, false, name,
 						 &hugetlb_cma[nid], nid);
 		if (res) {