diff mbox series

mm: page_alloc: Assume huge tail pages are valid when allocating contiguous pages

Message ID 20230414082222.idgw745cgcduzy37@techsingularity.net (mailing list archive)
State New
Headers show
Series mm: page_alloc: Assume huge tail pages are valid when allocating contiguous pages | expand

Commit Message

Mel Gorman April 14, 2023, 8:22 a.m. UTC
A bug was reported by Yuanxi Liu where allocating 1G pages at runtime is
taking an excessive amount of time for large amounts of memory. Further
testing allocating huge pages that the cost is linear i.e. if allocating
1G pages in batches of 10 then the time to allocate nr_hugepages from
10->20->30->etc increases linearly even though 10 pages are allocated at
each step. Profiles indicated that much of the time is spent checking the
validity within already existing huge pages and then attempting a migration
that fails after isolating the range, draining pages and a whole lot of
other useless work.

Commit eb14d4eefdc4 ("mm,page_alloc: drop unnecessary checks from
pfn_range_valid_contig") removed two checks, one which ignored huge pages
for contiguous allocations as huge pages can migrate. While there may be
value on migrating a 2M page to satisfy a 1G allocation, it's pointless
to move a 1G page for a new 1G allocation or scan for validity within an
existing huge page.

Reintroduce the PageHuge check with some limitations. The new check
will allow an attempt to migrate a 2M page for a 1G allocation but
contiguous requests within CMA regions will always attempt the migration.
The function is renamed as pfn_range_valid_contig can be easily confused
with a pfn_valid() check which is not what the function does.

The hpagealloc test allocates huge pages in batches and reports the
average latency per page over time. This test happens just after boot when
fragmentation is not an issue. Units are in milliseconds.

hpagealloc
                               6.3.0-rc6              6.3.0-rc6              6.3.0-rc6
                                 vanilla   hugeallocrevert-v1r1      hugeallocfix-v1r1
Min       Latency       26.42 (   0.00%)        5.07 (  80.82%)       20.30 (  23.19%)
1st-qrtle Latency      356.61 (   0.00%)        5.34 (  98.50%)       20.57 (  94.23%)
2nd-qrtle Latency      697.26 (   0.00%)        5.47 (  99.22%)       20.84 (  97.01%)
3rd-qrtle Latency      972.94 (   0.00%)        5.50 (  99.43%)       21.16 (  97.83%)
Max-1     Latency       26.42 (   0.00%)        5.07 (  80.82%)       20.30 (  23.19%)
Max-5     Latency       82.14 (   0.00%)        5.11 (  93.78%)       20.49 (  75.05%)
Max-10    Latency      150.54 (   0.00%)        5.20 (  96.55%)       20.52 (  86.37%)
Max-90    Latency     1164.45 (   0.00%)        5.53 (  99.52%)       21.20 (  98.18%)
Max-95    Latency     1223.06 (   0.00%)        5.55 (  99.55%)       21.22 (  98.26%)
Max-99    Latency     1278.67 (   0.00%)        5.57 (  99.56%)       22.81 (  98.22%)
Max       Latency     1310.90 (   0.00%)        8.06 (  99.39%)       24.87 (  98.10%)
Amean     Latency      678.36 (   0.00%)        5.44 *  99.20%*       20.93 *  96.91%*

                   6.3.0-rc6   6.3.0-rc6   6.3.0-rc6
                     vanilla   revert-v1   hugeallocfix-v1
Duration User           0.28        0.27        0.27
Duration System       808.66       17.77       36.63
Duration Elapsed      830.87       18.08       36.95

The vanilla kernel is poor, taking up to 1.3 second to allocate a huge page
and almost 10 minutes in total to run the test. Reverting the problematic
commit reduces it to 8ms at worst and the patch takes 24ms. This patch
fixes the main issue with skipping huge pages but leaves the page_count()
alone because a page with an elevated count potentially can migrate. Note
that a simplier fix that simply checks PageHuge also performs similarly
with the caveat that 1G allocations may fail due to to smaller huge pages
that could have been migrated.

BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=217022
Fixes: eb14d4eefdc4 ("mm,page_alloc: drop unnecessary checks from pfn_range_valid_contig")
Reported-by: Yuanxi Liu <y.liu@naruida.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 27 +++++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

Comments

Michal Hocko April 14, 2023, 8:55 a.m. UTC | #1
On Fri 14-04-23 09:22:22, Mel Gorman wrote:
[...]
> +
> +		/*
> +		 * Do not migrate huge pages that span the size of the region
> +		 * being allocated contiguous. e.g. Do not migrate a 1G page
> +		 * for a 1G allocation request. CMA is an exception as the
> +		 * region may be reserved for hardware that requires physical
> +		 * memory without a MMU or scatter/gather capability.
> +		 *
> +		 * Note that the compound check is race-prone versus
> +		 * free/split/collapse but it should be safe and result in
> +		 * a premature skip or a useless migration attempt.
> +		 */
> +		if (PageHuge(page) && compound_nr(page) >= nr_pages &&
> +		    !is_migrate_cma_page(page)) {
> +			return false;

Is the CMA check working as expected? The function sounds quite generic
and I agree that it would make sense if it was generic but it is used
only for GB pages in fact and unless I am missing something it would
allow to migrate CMA pages and potentially allocate over that region
without any possibility to migrate GB page out so the CMA region would
be essentially unusable for CMA users. GB pages already have their CMA
allocator path before we get to alloc_contig_pages. Or do I miss
something?
Mel Gorman April 14, 2023, 9:52 a.m. UTC | #2
On Fri, Apr 14, 2023 at 10:55:04AM +0200, Michal Hocko wrote:
> On Fri 14-04-23 09:22:22, Mel Gorman wrote:
> [...]
> > +
> > +		/*
> > +		 * Do not migrate huge pages that span the size of the region
> > +		 * being allocated contiguous. e.g. Do not migrate a 1G page
> > +		 * for a 1G allocation request. CMA is an exception as the
> > +		 * region may be reserved for hardware that requires physical
> > +		 * memory without a MMU or scatter/gather capability.
> > +		 *
> > +		 * Note that the compound check is race-prone versus
> > +		 * free/split/collapse but it should be safe and result in
> > +		 * a premature skip or a useless migration attempt.
> > +		 */
> > +		if (PageHuge(page) && compound_nr(page) >= nr_pages &&
> > +		    !is_migrate_cma_page(page)) {
> > +			return false;
> 
> Is the CMA check working as expected?

I didn't test it as I don't have a good simulator for CMA contraints which
is still a mobile phone concern for devices like cameras.

> The function sounds quite generic
> and I agree that it would make sense if it was generic but it is used
> only for GB pages in fact and unless I am missing something it would
> allow to migrate CMA pages and potentially allocate over that region
> without any possibility to migrate GB page out so the CMA region would
> be essentially unusable for CMA users.

It's used primarily for 1G pages but does have other users (debugging
mostly, low priority). As it's advertised as a general API, I decided to
treat it as such and that meant being nice to CMA if possible. If CMA pages
migrate but can still use the target location then it should be fine. If a
CMA can migrate to an usable location that breaks a device then that's a bug.

> GB pages already have their CMA
> allocator path before we get to alloc_contig_pages. Or do I miss
> something?

I don't think you missed anything. The CMA check is, at best, an effort
to have a potentially useful semantic but it's very doubtful anyone will
notice or care. I'm perfectly happy just to drop the CMA check because it's a
straight-forward fix and more suitable as a -stable backport.  I'm also happy
to just go with a PageHuge check and ignore any possibility that a 2M page
could be migrated to satisfy a 1G allocation.  1G allocation requests after
significant uptime is a crapshoot at best and relying on them succeeding is
unwise. There is a non-zero possibility that the latency incurred migrating
2M pages and still failing a 1G allocation could itself be classed as a
bug with users preferring fast-failure of 1G allocation attempts.
Michal Hocko April 14, 2023, 10:19 a.m. UTC | #3
On Fri 14-04-23 10:52:04, Mel Gorman wrote:
> On Fri, Apr 14, 2023 at 10:55:04AM +0200, Michal Hocko wrote:
> > On Fri 14-04-23 09:22:22, Mel Gorman wrote:
> > [...]
> > > +
> > > +		/*
> > > +		 * Do not migrate huge pages that span the size of the region
> > > +		 * being allocated contiguous. e.g. Do not migrate a 1G page
> > > +		 * for a 1G allocation request. CMA is an exception as the
> > > +		 * region may be reserved for hardware that requires physical
> > > +		 * memory without a MMU or scatter/gather capability.
> > > +		 *
> > > +		 * Note that the compound check is race-prone versus
> > > +		 * free/split/collapse but it should be safe and result in
> > > +		 * a premature skip or a useless migration attempt.
> > > +		 */
> > > +		if (PageHuge(page) && compound_nr(page) >= nr_pages &&
> > > +		    !is_migrate_cma_page(page)) {
> > > +			return false;
> > 
> > Is the CMA check working as expected?
> 
> I didn't test it as I don't have a good simulator for CMA contraints which
> is still a mobile phone concern for devices like cameras.
> 
> > The function sounds quite generic
> > and I agree that it would make sense if it was generic but it is used
> > only for GB pages in fact and unless I am missing something it would
> > allow to migrate CMA pages and potentially allocate over that region
> > without any possibility to migrate GB page out so the CMA region would
> > be essentially unusable for CMA users.
> 
> It's used primarily for 1G pages but does have other users (debugging
> mostly, low priority). As it's advertised as a general API, I decided to
> treat it as such and that meant being nice to CMA if possible. If CMA pages
> migrate but can still use the target location then it should be fine. If a
> CMA can migrate to an usable location that breaks a device then that's a bug.
> 
> > GB pages already have their CMA
> > allocator path before we get to alloc_contig_pages. Or do I miss
> > something?
> 
> I don't think you missed anything. The CMA check is, at best, an effort
> to have a potentially useful semantic but it's very doubtful anyone will
> notice or care. I'm perfectly happy just to drop the CMA check because it's a
> straight-forward fix and more suitable as a -stable backport.  I'm also happy
> to just go with a PageHuge check and ignore any possibility that a 2M page
> could be migrated to satisfy a 1G allocation.  1G allocation requests after
> significant uptime is a crapshoot at best and relying on them succeeding is
> unwise. There is a non-zero possibility that the latency incurred migrating
> 2M pages and still failing a 1G allocation could itself be classed as a
> bug with users preferring fast-failure of 1G allocation attempts.

Yes, the simpler the better. If we encounter a real usecase where couple
of 2MB hugetlb pages stand in the way to GB pages then we can add the
check so I would just go with reintroducing the PageHuge check alone.

Thanks!
David Hildenbrand April 14, 2023, 10:31 a.m. UTC | #4
On 14.04.23 12:19, Michal Hocko wrote:
> On Fri 14-04-23 10:52:04, Mel Gorman wrote:
>> On Fri, Apr 14, 2023 at 10:55:04AM +0200, Michal Hocko wrote:
>>> On Fri 14-04-23 09:22:22, Mel Gorman wrote:
>>> [...]
>>>> +
>>>> +		/*
>>>> +		 * Do not migrate huge pages that span the size of the region
>>>> +		 * being allocated contiguous. e.g. Do not migrate a 1G page
>>>> +		 * for a 1G allocation request. CMA is an exception as the
>>>> +		 * region may be reserved for hardware that requires physical
>>>> +		 * memory without a MMU or scatter/gather capability.
>>>> +		 *
>>>> +		 * Note that the compound check is race-prone versus
>>>> +		 * free/split/collapse but it should be safe and result in
>>>> +		 * a premature skip or a useless migration attempt.
>>>> +		 */
>>>> +		if (PageHuge(page) && compound_nr(page) >= nr_pages &&
>>>> +		    !is_migrate_cma_page(page)) {
>>>> +			return false;
>>>
>>> Is the CMA check working as expected?
>>
>> I didn't test it as I don't have a good simulator for CMA contraints which
>> is still a mobile phone concern for devices like cameras.
>>
>>> The function sounds quite generic
>>> and I agree that it would make sense if it was generic but it is used
>>> only for GB pages in fact and unless I am missing something it would
>>> allow to migrate CMA pages and potentially allocate over that region
>>> without any possibility to migrate GB page out so the CMA region would
>>> be essentially unusable for CMA users.
>>
>> It's used primarily for 1G pages but does have other users (debugging
>> mostly, low priority). As it's advertised as a general API, I decided to
>> treat it as such and that meant being nice to CMA if possible. If CMA pages
>> migrate but can still use the target location then it should be fine. If a
>> CMA can migrate to an usable location that breaks a device then that's a bug.
>>
>>> GB pages already have their CMA
>>> allocator path before we get to alloc_contig_pages. Or do I miss
>>> something?
>>
>> I don't think you missed anything. The CMA check is, at best, an effort
>> to have a potentially useful semantic but it's very doubtful anyone will
>> notice or care. I'm perfectly happy just to drop the CMA check because it's a
>> straight-forward fix and more suitable as a -stable backport.  I'm also happy
>> to just go with a PageHuge check and ignore any possibility that a 2M page
>> could be migrated to satisfy a 1G allocation.  1G allocation requests after
>> significant uptime is a crapshoot at best and relying on them succeeding is
>> unwise. There is a non-zero possibility that the latency incurred migrating
>> 2M pages and still failing a 1G allocation could itself be classed as a
>> bug with users preferring fast-failure of 1G allocation attempts.
> 
> Yes, the simpler the better. If we encounter a real usecase where couple
> of 2MB hugetlb pages stand in the way to GB pages then we can add the
> check so I would just go with reintroducing the PageHuge check alone.

alloc_contig_pages() -> __alloc_contig_pages() -> 
alloc_contig_range(MIGRATE_MOVABLE)

Should always fail when stumbling over MIGRATE_CMA pageblocks IIRC.

So we could bail out in that function early if we stumble over any CMA 
region.
Matthew Wilcox April 14, 2023, 12:22 p.m. UTC | #5
On Fri, Apr 14, 2023 at 09:22:22AM +0100, Mel Gorman wrote:
> +		/*
> +		 * Do not migrate huge pages that span the size of the region
> +		 * being allocated contiguous. e.g. Do not migrate a 1G page
> +		 * for a 1G allocation request. CMA is an exception as the
> +		 * region may be reserved for hardware that requires physical
> +		 * memory without a MMU or scatter/gather capability.
> +		 *
> +		 * Note that the compound check is race-prone versus
> +		 * free/split/collapse but it should be safe and result in
> +		 * a premature skip or a useless migration attempt.
> +		 */
> +		if (PageHuge(page) && compound_nr(page) >= nr_pages &&

This confuses me.  PageHuge() can be called on tail pages, but if
compound_nr() is called on a tail page, it returns 1.  So I'm not
sure why this works.  Also, do you really want PageHuge (ie only
hugetlbfs pages), or do you really just want to check PageCompound(),
which would also be true for THP?
Mel Gorman April 14, 2023, 1:29 p.m. UTC | #6
On Fri, Apr 14, 2023 at 01:22:06PM +0100, Matthew Wilcox wrote:
> On Fri, Apr 14, 2023 at 09:22:22AM +0100, Mel Gorman wrote:
> > +		/*
> > +		 * Do not migrate huge pages that span the size of the region
> > +		 * being allocated contiguous. e.g. Do not migrate a 1G page
> > +		 * for a 1G allocation request. CMA is an exception as the
> > +		 * region may be reserved for hardware that requires physical
> > +		 * memory without a MMU or scatter/gather capability.
> > +		 *
> > +		 * Note that the compound check is race-prone versus
> > +		 * free/split/collapse but it should be safe and result in
> > +		 * a premature skip or a useless migration attempt.
> > +		 */
> > +		if (PageHuge(page) && compound_nr(page) >= nr_pages &&
> 
> This confuses me.  PageHuge() can be called on tail pages, but if
> compound_nr() is called on a tail page, it returns 1. 

Given the calling context is a linear scan, the head page will be scanned
first so the value for compound_nr() called on a tail page shouldn't occur.

> So I'm not
> sure why this works.  Also, do you really want PageHuge (ie only
> hugetlbfs pages), or do you really just want to check PageCompound(),
> which would also be true for THP?
> 

For now I only want hugetlbfs pages as the fix is for a regression when
allocating 1G hugetlbfs pages and previous behaviour avoided existing
hugetlbfs pages. THP pages can be split+migrated of course but the cost
of the 1G allocation attempt may be excessive relative to any benefit.
Mike Kravetz April 14, 2023, 7:06 p.m. UTC | #7
Thanks Mel!
Apologies for not noticing when the bug was posted to the list.  Otherwise,
I would have jumped on it.

On 04/14/23 09:22, Mel Gorman wrote:
> A bug was reported by Yuanxi Liu where allocating 1G pages at runtime is
> taking an excessive amount of time for large amounts of memory. Further
> testing allocating huge pages that the cost is linear i.e. if allocating
> 1G pages in batches of 10 then the time to allocate nr_hugepages from
> 10->20->30->etc increases linearly even though 10 pages are allocated at
> each step. Profiles indicated that much of the time is spent checking the
> validity within already existing huge pages and then attempting a migration
> that fails after isolating the range, draining pages and a whole lot of
> other useless work.
> 
> Commit eb14d4eefdc4 ("mm,page_alloc: drop unnecessary checks from
> pfn_range_valid_contig") removed two checks, one which ignored huge pages
> for contiguous allocations as huge pages can migrate. While there may be
> value on migrating a 2M page to satisfy a 1G allocation, it's pointless
> to move a 1G page for a new 1G allocation or scan for validity within an
> existing huge page.

eb14d4eefdc4 was the last patch in Oscar's series "Make alloc_contig_range
handle Hugetlb pages".
https://lore.kernel.org/linux-mm/20210419075413.1064-1-osalvador@suse.de/

It seems bailing out of alloc_contig_range when experiencing hugetlb
pages was an actual issue as the cover letter says:

" This can be problematic for some users, e.g: CMA and virtio-mem, where those
  users will fail the call if alloc_contig_range ever sees a HugeTLB page, even
  when those pages lay in ZONE_MOVABLE and are free.
  That problem can be easily solved by replacing the page in the free hugepage
  pool."

> Reintroduce the PageHuge check with some limitations. The new check
> will allow an attempt to migrate a 2M page for a 1G allocation but
> contiguous requests within CMA regions will always attempt the migration.
> The function is renamed as pfn_range_valid_contig can be easily confused
> with a pfn_valid() check which is not what the function does.

Michal suggested that we reintroduce the simple PageHuge check and fail
alloc_contig_range if any hugetlb pages are encountered.  However, this
certainly would impact the issue Oscar was trying to address.

Do note that Oscar's series actually tried to address the issue here.  In
isolate_or_dissolve_huge_page() there is this check.

	/*
	 * Fence off gigantic pages as there is a cyclic dependency between
	 * alloc_contig_range and them. Return -ENOMEM as this has the effect
	 * of bailing out right away without further retrying.
	 */
	if (hstate_is_gigantic(h))
		return -ENOMEM;

So we actually bail when getting here.  But we are well down the call chain:
alloc_contig_pages() -> alloc_contig_range() -> __alloc_contig_migrate_range()
-> isolate_migratepages_range() -> isolate_migratepages_block() ->
isolate_or_dissolve_huge_page

The overhead of getting to this point where we bail in
isolate_or_dissolve_huge_page is excessive.

I suggest reintroducing the PageHuge check with condition that it will allow
migration of non-gigantic pages.  This will at least address the issue seen
by Oscar.  I agree with others that the CMA check is not needed.

> +		if (PageHuge(page) && compound_nr(page) >= nr_pages &&

I 'think' some architectures have more then one gigantic huge page size.  So,
it seems this should really be a gigantic page check.  Otherwise we will bail
later in isolate_or_dissolve_huge_page as we do today.  However, we can not
use the hugetlb interfaces to check for a gigantic page as they require the
page to be 'stable' and we hold no locks.  So, it would need to be open
coded something like:

	if (PageHuge(page) && compound_order(page) > MAX_ORDER)
		/* after MAX_ORDER update in linux-next */
Mike Kravetz April 14, 2023, 8:07 p.m. UTC | #8
On 04/14/23 12:06, Mike Kravetz wrote:
> Thanks Mel!
> Apologies for not noticing when the bug was posted to the list.  Otherwise,
> I would have jumped on it.
> 
> On 04/14/23 09:22, Mel Gorman wrote:
> > A bug was reported by Yuanxi Liu where allocating 1G pages at runtime is
> > taking an excessive amount of time for large amounts of memory. Further
> > testing allocating huge pages that the cost is linear i.e. if allocating
> > 1G pages in batches of 10 then the time to allocate nr_hugepages from
> > 10->20->30->etc increases linearly even though 10 pages are allocated at
> > each step. Profiles indicated that much of the time is spent checking the
> > validity within already existing huge pages and then attempting a migration
> > that fails after isolating the range, draining pages and a whole lot of
> > other useless work.
> > 
> > Commit eb14d4eefdc4 ("mm,page_alloc: drop unnecessary checks from
> > pfn_range_valid_contig") removed two checks, one which ignored huge pages
> > for contiguous allocations as huge pages can migrate. While there may be
> > value on migrating a 2M page to satisfy a 1G allocation, it's pointless
> > to move a 1G page for a new 1G allocation or scan for validity within an
> > existing huge page.
> 
> eb14d4eefdc4 was the last patch in Oscar's series "Make alloc_contig_range
> handle Hugetlb pages".
> https://lore.kernel.org/linux-mm/20210419075413.1064-1-osalvador@suse.de/
> 
> It seems bailing out of alloc_contig_range when experiencing hugetlb
> pages was an actual issue as the cover letter says:
> 

David correctly pointed out that I was confusing alloc_contig_range and
alloc_contig_pages.  Sorry!
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7136c36c5d01..9036306b3d53 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -9434,22 +9434,45 @@  static int __alloc_contig_pages(unsigned long start_pfn,
 				  gfp_mask);
 }
 
-static bool pfn_range_valid_contig(struct zone *z, unsigned long start_pfn,
+/*
+ * Returns true if it's worth trying to migrate pages within a range for
+ * a contiguous allocation.
+ */
+static bool pfn_range_suitable_contig(struct zone *z, unsigned long start_pfn,
 				   unsigned long nr_pages)
 {
 	unsigned long i, end_pfn = start_pfn + nr_pages;
 	struct page *page;
 
 	for (i = start_pfn; i < end_pfn; i++) {
+		/* Must be valid. */
 		page = pfn_to_online_page(i);
 		if (!page)
 			return false;
 
+		/* Must be within one zone. */
 		if (page_zone(page) != z)
 			return false;
 
+		/* Reserved pages cannot migrate. */
 		if (PageReserved(page))
 			return false;
+
+		/*
+		 * Do not migrate huge pages that span the size of the region
+		 * being allocated contiguous. e.g. Do not migrate a 1G page
+		 * for a 1G allocation request. CMA is an exception as the
+		 * region may be reserved for hardware that requires physical
+		 * memory without a MMU or scatter/gather capability.
+		 *
+		 * Note that the compound check is race-prone versus
+		 * free/split/collapse but it should be safe and result in
+		 * a premature skip or a useless migration attempt.
+		 */
+		if (PageHuge(page) && compound_nr(page) >= nr_pages &&
+		    !is_migrate_cma_page(page)) {
+			return false;
+		}
 	}
 	return true;
 }
@@ -9498,7 +9521,7 @@  struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
 
 		pfn = ALIGN(zone->zone_start_pfn, nr_pages);
 		while (zone_spans_last_pfn(zone, pfn, nr_pages)) {
-			if (pfn_range_valid_contig(zone, pfn, nr_pages)) {
+			if (pfn_range_suitable_contig(zone, pfn, nr_pages)) {
 				/*
 				 * We release the zone lock here because
 				 * alloc_contig_range() will also lock the zone