mm: reduce lock contention of pcp buffer refill

Message ID	20230201162549.68384-1-halbuer@sra.uni-hannover.de (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Alexander Halbuer <halbuer@sra.uni-hannover.de> To: akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Alexander Halbuer <halbuer@sra.uni-hannover.de> Subject: [PATCH] mm: reduce lock contention of pcp buffer refill Date: Wed, 1 Feb 2023 17:25:49 +0100 Message-Id: <20230201162549.68384-1-halbuer@sra.uni-hannover.de> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: reduce lock contention of pcp buffer refill \| expand mm: reduce lock contention of pcp buffer refill

Alexander Halbuer Feb. 1, 2023, 4:25 p.m. UTC

The `rmqueue_bulk` function batches the allocation of multiple elements to
refill the per-CPU buffers into a single hold of the zone lock. Each
element is allocated and checked using the `check_pcp_refill` function.
The check touches every related struct page which is especially expensive
for higher order allocations (huge pages). This patch reduces the time
holding the lock by moving the check out of the critical section similar
to the `rmqueue_buddy` function which allocates a single element.
Measurements of parallel allocation-heavy workloads show a reduction of
the average huge page allocation latency of 50 percent for two cores and
nearly 90 percent for 24 cores.

Signed-off-by: Alexander Halbuer <halbuer@sra.uni-hannover.de>
---
 mm/page_alloc.c | 22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

Andrew Morton Feb. 2, 2023, 11:25 p.m. UTC | #1

On Wed,  1 Feb 2023 17:25:49 +0100 Alexander Halbuer <halbuer@sra.uni-hannover.de> wrote:

> The `rmqueue_bulk` function batches the allocation of multiple elements to
> refill the per-CPU buffers into a single hold of the zone lock. Each
> element is allocated and checked using the `check_pcp_refill` function.
> The check touches every related struct page which is especially expensive
> for higher order allocations (huge pages). This patch reduces the time
> holding the lock by moving the check out of the critical section similar
> to the `rmqueue_buddy` function which allocates a single element.
> Measurements of parallel allocation-heavy workloads show a reduction of
> the average huge page allocation latency of 50 percent for two cores and
> nearly 90 percent for 24 cores.

Sounds nice.

Were you able to test how much benefit we get by simply removing the
check_new_pages() call from rmqueue_bulk()?

Vlastimil, I find this quite confusing:

#ifdef CONFIG_DEBUG_VM
/*
 * With DEBUG_VM enabled, order-0 pages are checked for expected state when
 * being allocated from pcp lists. With debug_pagealloc also enabled, they are
 * also checked when pcp lists are refilled from the free lists.
 */
static inline bool check_pcp_refill(struct page *page, unsigned int order)
{
	if (debug_pagealloc_enabled_static())
		return check_new_pages(page, order);
	else
		return false;
}

static inline bool check_new_pcp(struct page *page, unsigned int order)
{
	return check_new_pages(page, order);
}
#else
/*
 * With DEBUG_VM disabled, free order-0 pages are checked for expected state
 * when pcp lists are being refilled from the free lists. With debug_pagealloc
 * enabled, they are also checked when being allocated from the pcp lists.
 */
static inline bool check_pcp_refill(struct page *page, unsigned int order)
{
	return check_new_pages(page, order);
}
static inline bool check_new_pcp(struct page *page, unsigned int order)
{
	if (debug_pagealloc_enabled_static())
		return check_new_pages(page, order);
	else
		return false;
}
#endif /* CONFIG_DEBUG_VM */

and the 4462b32c9285b5 changelog is a struggle to follow.

Why are we performing *any* checks when CONFIG_DEBUG_VM=n and when
debug_pagealloc_enabled is false?

Anyway, these checks sounds quite costly so let's revisit their
desirability?

Alexander Halbuer Feb. 7, 2023, 4:11 p.m. UTC | #2

On 2/3/23 00:25, Andrew Morton wrote:
> On Wed,  1 Feb 2023 17:25:49 +0100 Alexander Halbuer <halbuer@sra.uni-hannover.de> wrote:
>
>> The `rmqueue_bulk` function batches the allocation of multiple elements to
>> refill the per-CPU buffers into a single hold of the zone lock. Each
>> element is allocated and checked using the `check_pcp_refill` function.
>> The check touches every related struct page which is especially expensive
>> for higher order allocations (huge pages). This patch reduces the time
>> holding the lock by moving the check out of the critical section similar
>> to the `rmqueue_buddy` function which allocates a single element.
>> Measurements of parallel allocation-heavy workloads show a reduction of
>> the average huge page allocation latency of 50 percent for two cores and
>> nearly 90 percent for 24 cores.
> Sounds nice.
>
> Were you able to test how much benefit we get by simply removing the
> check_new_pages() call from rmqueue_bulk()?
I did some further investigations and measurements to quantify potential
performance gains. Benchmarks ran on a machine with 24 physical cores
fixed at 2.1 GHz. The results show significant performance gains with the
patch applied in parallel scenarios. Eliminating the check reduces
allocation latency further, especially for low core counts.

The following tables show the average allocation latencies for huge pages
and normal pages for three different configurations:
The unpatched kernel, the patched kernel, and an additional version
without the check.

Huge pages
+-------+---------+-------+----------+---------+----------+
| Cores | Default | Patch |    Patch | NoCheck |  NoCheck |
|       |    (ns) |  (ns) |     Diff |    (ns) |     Diff |
+-------+---------+-------+----------+---------+----------+
|     1 |     127 |   124 |  (-2.4%) |     118 |  (-7.1%) |
|     2 |     140 |   140 |  (-0.0%) |     134 |  (-4.3%) |
|     3 |     143 |   142 |  (-0.7%) |     134 |  (-6.3%) |
|     4 |     178 |   159 | (-10.7%) |     156 | (-12.4%) |
|     6 |     269 |   239 | (-11.2%) |     237 | (-11.9%) |
|     8 |     363 |   321 | (-11.6%) |     319 | (-12.1%) |
|    10 |     454 |   409 |  (-9.9%) |     409 |  (-9.9%) |
|    12 |     545 |   494 |  (-9.4%) |     488 | (-10.5%) |
|    14 |     639 |   578 |  (-9.5%) |     574 | (-10.2%) |
|    16 |     735 |   660 | (-10.2%) |     653 | (-11.2%) |
|    20 |     915 |   826 |  (-9.7%) |     815 | (-10.9%) |
|    24 |    1105 |   992 | (-10.2%) |     982 | (-11.1%) |
+-------+---------+-------+----------+---------+----------+

Normal pages
+-------+---------+-------+----------+---------+----------+
| Cores | Default | Patch |    Patch | NoCheck |  NoCheck |
|       |    (ns) |  (ns) |     Diff |    (ns) |     Diff |
+-------+---------+-------+----------+---------+----------+
|     1 |    2790 |  2767 |  (-0.8%) |     171 | (-93.9%) |
|     2 |    6685 |  3484 | (-47.9%) |     519 | (-92.2%) |
|     3 |   10501 |  3599 | (-65.7%) |     855 | (-91.9%) |
|     4 |   14264 |  3635 | (-74.5%) |    1139 | (-92.0%) |
|     6 |   21800 |  3551 | (-83.7%) |    1713 | (-92.1%) |
|     8 |   29563 |  3570 | (-87.9%) |    2268 | (-92.3%) |
|    10 |   37210 |  3845 | (-89.7%) |    2872 | (-92.3%) |
|    12 |   44780 |  4452 | (-90.1%) |    3417 | (-92.4%) |
|    14 |   52576 |  5100 | (-90.3%) |    4020 | (-92.4%) |
|    16 |   60118 |  5785 | (-90.4%) |    4604 | (-92.3%) |
|    20 |   75037 |  7270 | (-90.3%) |    6486 | (-91.4%) |
|    24 |   90226 |  8712 | (-90.3%) |    7061 | (-92.2%) |
+-------+---------+-------+----------+---------+----------+

>
> Vlastimil, I find this quite confusing:
>
> #ifdef CONFIG_DEBUG_VM
> /*
>  * With DEBUG_VM enabled, order-0 pages are checked for expected state when
>  * being allocated from pcp lists. With debug_pagealloc also enabled, they are
>  * also checked when pcp lists are refilled from the free lists.
>  */
> static inline bool check_pcp_refill(struct page *page, unsigned int order)
> {
> 	if (debug_pagealloc_enabled_static())
> 		return check_new_pages(page, order);
> 	else
> 		return false;
> }
>
> static inline bool check_new_pcp(struct page *page, unsigned int order)
> {
> 	return check_new_pages(page, order);
> }
> #else
> /*
>  * With DEBUG_VM disabled, free order-0 pages are checked for expected state
>  * when pcp lists are being refilled from the free lists. With debug_pagealloc
>  * enabled, they are also checked when being allocated from the pcp lists.
>  */
> static inline bool check_pcp_refill(struct page *page, unsigned int order)
> {
> 	return check_new_pages(page, order);
> }
> static inline bool check_new_pcp(struct page *page, unsigned int order)
> {
> 	if (debug_pagealloc_enabled_static())
> 		return check_new_pages(page, order);
> 	else
> 		return false;
> }
> #endif /* CONFIG_DEBUG_VM */
>
> and the 4462b32c9285b5 changelog is a struggle to follow.
>
> Why are we performing *any* checks when CONFIG_DEBUG_VM=n and when
> debug_pagealloc_enabled is false?
>
> Anyway, these checks sounds quite costly so let's revisit their
> desirability?
>

Vlastimil Babka Feb. 8, 2023, 10:45 a.m. UTC | #3

On 2/3/23 00:25, Andrew Morton wrote:
> On Wed,  1 Feb 2023 17:25:49 +0100 Alexander Halbuer <halbuer@sra.uni-hannover.de> wrote:
> 
>> The `rmqueue_bulk` function batches the allocation of multiple elements to
>> refill the per-CPU buffers into a single hold of the zone lock. Each
>> element is allocated and checked using the `check_pcp_refill` function.
>> The check touches every related struct page which is especially expensive
>> for higher order allocations (huge pages). This patch reduces the time
>> holding the lock by moving the check out of the critical section similar
>> to the `rmqueue_buddy` function which allocates a single element.
>> Measurements of parallel allocation-heavy workloads show a reduction of
>> the average huge page allocation latency of 50 percent for two cores and
>> nearly 90 percent for 24 cores.
> 
> Sounds nice.
> 
> Were you able to test how much benefit we get by simply removing the
> check_new_pages() call from rmqueue_bulk()?
> 
> Vlastimil, I find this quite confusing:
> 
> #ifdef CONFIG_DEBUG_VM
> /*
>  * With DEBUG_VM enabled, order-0 pages are checked for expected state when
>  * being allocated from pcp lists. With debug_pagealloc also enabled, they are
>  * also checked when pcp lists are refilled from the free lists.
>  */
> static inline bool check_pcp_refill(struct page *page, unsigned int order)
> {
> 	if (debug_pagealloc_enabled_static())
> 		return check_new_pages(page, order);
> 	else
> 		return false;
> }
> 
> static inline bool check_new_pcp(struct page *page, unsigned int order)
> {
> 	return check_new_pages(page, order);
> }
> #else
> /*
>  * With DEBUG_VM disabled, free order-0 pages are checked for expected state
>  * when pcp lists are being refilled from the free lists. With debug_pagealloc
>  * enabled, they are also checked when being allocated from the pcp lists.
>  */
> static inline bool check_pcp_refill(struct page *page, unsigned int order)
> {
> 	return check_new_pages(page, order);
> }
> static inline bool check_new_pcp(struct page *page, unsigned int order)
> {
> 	if (debug_pagealloc_enabled_static())
> 		return check_new_pages(page, order);
> 	else
> 		return false;
> }
> #endif /* CONFIG_DEBUG_VM */
> 
> and the 4462b32c9285b5 changelog is a struggle to follow.
> 
> Why are we performing *any* checks when CONFIG_DEBUG_VM=n and when
> debug_pagealloc_enabled is false?

Well AFAIK the history was like this

- at first we always did the checks when allocating or freeing a page, even
when it was allocated/freed from the pcplist

- then Mel in 479f854a207c and 4db7548ccbd9 changed it so the checks were
done only when moving between pcplist and zone's freelists, so the
pcplist-cached fast paths were now faster. But that means it may not catch
some errors anymore, so with DEBUG_VM checks were still done on every alloc/free

- my 4462b32c9285b5 changed it so that when debug_pagelloc is boot-time
enabled, the checks happen both on pcplist and zone's freelist alloc/free.
This was mainly to allow enabling the checks on production kernels without
recompiling with DEBUG_VM. But it's a mode where catching the culprit is
more desirable than peak performance

> Anyway, these checks sounds quite costly so let's revisit their
> desirability?

So AFAIK never in the past we went with not doing the checks at all. But
given that by default we don't do them on pcplists for years, and the
majority of allocations are using pcplists, maybe indeed we won't lose much
coverage by not doing the checks at all.

But I wonder also what kernel hardening folks think here - are the hardened
kernels usually built with DEBUG_VM or debug_pagealloc enabled, or would you
like to hook some other kernel option for keeping the checks on page/alloc
free active? And should those checks be done on every alloc/free, including
pcplist cached allocations?

Vlastimil Babka Feb. 8, 2023, 3:11 p.m. UTC | #4

On 2/7/23 17:11, Alexander Halbuer wrote:
> On 2/3/23 00:25, Andrew Morton wrote:
>> On Wed,  1 Feb 2023 17:25:49 +0100 Alexander Halbuer <halbuer@sra.uni-hannover.de> wrote:
>>
>>> The `rmqueue_bulk` function batches the allocation of multiple elements to
>>> refill the per-CPU buffers into a single hold of the zone lock. Each
>>> element is allocated and checked using the `check_pcp_refill` function.
>>> The check touches every related struct page which is especially expensive
>>> for higher order allocations (huge pages). This patch reduces the time
>>> holding the lock by moving the check out of the critical section similar
>>> to the `rmqueue_buddy` function which allocates a single element.
>>> Measurements of parallel allocation-heavy workloads show a reduction of
>>> the average huge page allocation latency of 50 percent for two cores and
>>> nearly 90 percent for 24 cores.
>> Sounds nice.
>>
>> Were you able to test how much benefit we get by simply removing the
>> check_new_pages() call from rmqueue_bulk()?
> I did some further investigations and measurements to quantify potential
> performance gains. Benchmarks ran on a machine with 24 physical cores
> fixed at 2.1 GHz. The results show significant performance gains with the
> patch applied in parallel scenarios. Eliminating the check reduces
> allocation latency further, especially for low core counts.

Are the benchmarks done via kernel module calling bulk allocation, or from
userspace?

> The following tables show the average allocation latencies for huge pages
> and normal pages for three different configurations:
> The unpatched kernel, the patched kernel, and an additional version
> without the check.
> 
> Huge pages
> +-------+---------+-------+----------+---------+----------+
> | Cores | Default | Patch |    Patch | NoCheck |  NoCheck |
> |       |    (ns) |  (ns) |     Diff |    (ns) |     Diff |
> +-------+---------+-------+----------+---------+----------+
> |     1 |     127 |   124 |  (-2.4%) |     118 |  (-7.1%) |
> |     2 |     140 |   140 |  (-0.0%) |     134 |  (-4.3%) |
> |     3 |     143 |   142 |  (-0.7%) |     134 |  (-6.3%) |
> |     4 |     178 |   159 | (-10.7%) |     156 | (-12.4%) |
> |     6 |     269 |   239 | (-11.2%) |     237 | (-11.9%) |
> |     8 |     363 |   321 | (-11.6%) |     319 | (-12.1%) |
> |    10 |     454 |   409 |  (-9.9%) |     409 |  (-9.9%) |
> |    12 |     545 |   494 |  (-9.4%) |     488 | (-10.5%) |
> |    14 |     639 |   578 |  (-9.5%) |     574 | (-10.2%) |
> |    16 |     735 |   660 | (-10.2%) |     653 | (-11.2%) |
> |    20 |     915 |   826 |  (-9.7%) |     815 | (-10.9%) |
> |    24 |    1105 |   992 | (-10.2%) |     982 | (-11.1%) |
> +-------+---------+-------+----------+---------+----------+
> 
> Normal pages
> +-------+---------+-------+----------+---------+----------+
> | Cores | Default | Patch |    Patch | NoCheck |  NoCheck |
> |       |    (ns) |  (ns) |     Diff |    (ns) |     Diff |
> +-------+---------+-------+----------+---------+----------+
> |     1 |    2790 |  2767 |  (-0.8%) |     171 | (-93.9%) |
> |     2 |    6685 |  3484 | (-47.9%) |     519 | (-92.2%) |
> |     3 |   10501 |  3599 | (-65.7%) |     855 | (-91.9%) |
> |     4 |   14264 |  3635 | (-74.5%) |    1139 | (-92.0%) |
> |     6 |   21800 |  3551 | (-83.7%) |    1713 | (-92.1%) |
> |     8 |   29563 |  3570 | (-87.9%) |    2268 | (-92.3%) |
> |    10 |   37210 |  3845 | (-89.7%) |    2872 | (-92.3%) |
> |    12 |   44780 |  4452 | (-90.1%) |    3417 | (-92.4%) |
> |    14 |   52576 |  5100 | (-90.3%) |    4020 | (-92.4%) |
> |    16 |   60118 |  5785 | (-90.4%) |    4604 | (-92.3%) |
> |    20 |   75037 |  7270 | (-90.3%) |    6486 | (-91.4%) |
> |    24 |   90226 |  8712 | (-90.3%) |    7061 | (-92.2%) |
> +-------+---------+-------+----------+---------+----------+

Nice improvements. I assume the table headings are switched? Why would
Normal be more epensive than Huge?

>>
>> Vlastimil, I find this quite confusing:
>>
>> #ifdef CONFIG_DEBUG_VM
>> /*
>>  * With DEBUG_VM enabled, order-0 pages are checked for expected state when
>>  * being allocated from pcp lists. With debug_pagealloc also enabled, they are
>>  * also checked when pcp lists are refilled from the free lists.
>>  */
>> static inline bool check_pcp_refill(struct page *page, unsigned int order)
>> {
>> 	if (debug_pagealloc_enabled_static())
>> 		return check_new_pages(page, order);
>> 	else
>> 		return false;
>> }
>>
>> static inline bool check_new_pcp(struct page *page, unsigned int order)
>> {
>> 	return check_new_pages(page, order);
>> }
>> #else
>> /*
>>  * With DEBUG_VM disabled, free order-0 pages are checked for expected state
>>  * when pcp lists are being refilled from the free lists. With debug_pagealloc
>>  * enabled, they are also checked when being allocated from the pcp lists.
>>  */
>> static inline bool check_pcp_refill(struct page *page, unsigned int order)
>> {
>> 	return check_new_pages(page, order);
>> }
>> static inline bool check_new_pcp(struct page *page, unsigned int order)
>> {
>> 	if (debug_pagealloc_enabled_static())
>> 		return check_new_pages(page, order);
>> 	else
>> 		return false;
>> }
>> #endif /* CONFIG_DEBUG_VM */
>>
>> and the 4462b32c9285b5 changelog is a struggle to follow.
>>
>> Why are we performing *any* checks when CONFIG_DEBUG_VM=n and when
>> debug_pagealloc_enabled is false?
>>
>> Anyway, these checks sounds quite costly so let's revisit their
>> desirability?
>>

Vlastimil Babka Feb. 8, 2023, 3:20 p.m. UTC | #5

On 2/1/23 17:25, Alexander Halbuer wrote:
> The `rmqueue_bulk` function batches the allocation of multiple elements to
> refill the per-CPU buffers into a single hold of the zone lock. Each
> element is allocated and checked using the `check_pcp_refill` function.
> The check touches every related struct page which is especially expensive
> for higher order allocations (huge pages). This patch reduces the time
> holding the lock by moving the check out of the critical section similar
> to the `rmqueue_buddy` function which allocates a single element.
> Measurements of parallel allocation-heavy workloads show a reduction of
> the average huge page allocation latency of 50 percent for two cores and
> nearly 90 percent for 24 cores.
> 
> Signed-off-by: Alexander Halbuer <halbuer@sra.uni-hannover.de>

Even if we proceed with disabling the checks in default
non-debugging/non-hardened configurations, this would still help those
configurations, so:

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Suggestion below:

> ---
>  mm/page_alloc.c | 22 ++++++++++++++++++----
>  1 file changed, 18 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0745aedebb37..4b80438b1f59 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3119,6 +3119,8 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>  {
>  	unsigned long flags;
>  	int i, allocated = 0;
> +	struct list_head *prev_tail = list->prev;
> +	struct page *pos, *n;
>  
>  	spin_lock_irqsave(&zone->lock, flags);
>  	for (i = 0; i < count; ++i) {
> @@ -3127,9 +3129,6 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>  		if (unlikely(page == NULL))
>  			break;
>  
> -		if (unlikely(check_pcp_refill(page, order)))
> -			continue;
> -
>  		/*
>  		 * Split buddy pages returned by expand() are received here in
>  		 * physical page order. The page is added to the tail of
> @@ -3141,7 +3140,6 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>  		 * pages are ordered properly.
>  		 */
>  		list_add_tail(&page->pcp_list, list);
> -		allocated++;
>  		if (is_migrate_cma(get_pcppage_migratetype(page)))
>  			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,

As another benefit of your patch, the NR_FREE_CMA_PAGES will not become
inaccurate if we leak CMA pages failing the check, anymore.

You could also try another patch that will move the above check into the
loop below, see if it makes any difference in your benchmark. The loop could
count is_migrate_cma pages, and afterwards do a single "if (cma_pages > 0)
mod_zone_page_state(...)" - because we are no longer inside
spin_lock_irqsave() block, we need to use the safe mod_zone_page... variant
without underscores.

Thanks!

>  					      -(1 << order));
> @@ -3155,6 +3153,22 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>  	 */
>  	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
>  	spin_unlock_irqrestore(&zone->lock, flags);
> +
> +	/*
> +	 * Pages are appended to the pcp list without checking to reduce the
> +	 * time holding the zone lock. Checking the appended pages happens right
> +	 * after the critical section while still holding the pcp lock.
> +	 */
> +	pos = list_first_entry(prev_tail, struct page, pcp_list);
> +	list_for_each_entry_safe_from(pos, n, list, pcp_list) {
> +		if (unlikely(check_pcp_refill(pos, order))) {
> +			list_del(&pos->pcp_list);
> +			continue;
> +		}
> +
> +		allocated++;
> +	}
> +
>  	return allocated;
>  }
>

Alexander Halbuer Feb. 9, 2023, 10:34 a.m. UTC | #6

On 2/8/23 16:11, Vlastimil Babka wrote:
> On 2/7/23 17:11, Alexander Halbuer wrote:
>> On 2/3/23 00:25, Andrew Morton wrote:
>>> On Wed,  1 Feb 2023 17:25:49 +0100 Alexander Halbuer <halbuer@sra.uni-hannover.de> wrote:
>>>
>>>> The `rmqueue_bulk` function batches the allocation of multiple elements to
>>>> refill the per-CPU buffers into a single hold of the zone lock. Each
>>>> element is allocated and checked using the `check_pcp_refill` function.
>>>> The check touches every related struct page which is especially expensive
>>>> for higher order allocations (huge pages). This patch reduces the time
>>>> holding the lock by moving the check out of the critical section similar
>>>> to the `rmqueue_buddy` function which allocates a single element.
>>>> Measurements of parallel allocation-heavy workloads show a reduction of
>>>> the average huge page allocation latency of 50 percent for two cores and
>>>> nearly 90 percent for 24 cores.
>>> Sounds nice.
>>>
>>> Were you able to test how much benefit we get by simply removing the
>>> check_new_pages() call from rmqueue_bulk()?
>> I did some further investigations and measurements to quantify potential
>> performance gains. Benchmarks ran on a machine with 24 physical cores
>> fixed at 2.1 GHz. The results show significant performance gains with the
>> patch applied in parallel scenarios. Eliminating the check reduces
>> allocation latency further, especially for low core counts.
> 
> Are the benchmarks done via kernel module calling bulk allocation, or from
> userspace?

The benchmarks are done via a kernel module using the alloc_pages_node()
function to allocate from a specific NUMA node. (The test machine is a
dual-socket system featuring two Intel Xeon Gold 6252 with 24 physical
cores each; benchmark threads are pinned to the first CPU to prevent
additional NUMA effects.)

The module has originally been developed as part of a discussion about
different allocator designs to get actual numbers of an actual buddy
allocator implementation. Measurements include different allocation
scenarios (bulk, repeat, random) for different allocation sizes (orders)
and degrees of parallelism. The parallel bulk allocation benchmark
showed an outlier for huge page allocations (order 9) that was much
worse than order 8 and even order 10, despite the lack of a pcp buffer
for those sizes. This has been the origin of my further investigations.

The benchmark module internally uses multiple threads that independently
perform a specified number of allocations with a loop around the
mentioned alloc_pages_node() function. A barrier synchronizes all those
threads at the beginning. The values shown in the table are the average
durations of all threads divided by the number of allocations to break
it down to a single allocation.

>> The following tables show the average allocation latencies for huge pages
>> and normal pages for three different configurations:
>> The unpatched kernel, the patched kernel, and an additional version
>> without the check.
>>
>> Huge pages
>> +-------+---------+-------+----------+---------+----------+
>> | Cores | Default | Patch |    Patch | NoCheck |  NoCheck |
>> |       |    (ns) |  (ns) |     Diff |    (ns) |     Diff |
>> +-------+---------+-------+----------+---------+----------+
>> |     1 |     127 |   124 |  (-2.4%) |     118 |  (-7.1%) |
>> |     2 |     140 |   140 |  (-0.0%) |     134 |  (-4.3%) |
>> |     3 |     143 |   142 |  (-0.7%) |     134 |  (-6.3%) |
>> |     4 |     178 |   159 | (-10.7%) |     156 | (-12.4%) |
>> |     6 |     269 |   239 | (-11.2%) |     237 | (-11.9%) |
>> |     8 |     363 |   321 | (-11.6%) |     319 | (-12.1%) |
>> |    10 |     454 |   409 |  (-9.9%) |     409 |  (-9.9%) |
>> |    12 |     545 |   494 |  (-9.4%) |     488 | (-10.5%) |
>> |    14 |     639 |   578 |  (-9.5%) |     574 | (-10.2%) |
>> |    16 |     735 |   660 | (-10.2%) |     653 | (-11.2%) |
>> |    20 |     915 |   826 |  (-9.7%) |     815 | (-10.9%) |
>> |    24 |    1105 |   992 | (-10.2%) |     982 | (-11.1%) |
>> +-------+---------+-------+----------+---------+----------+
>>
>> Normal pages
>> +-------+---------+-------+----------+---------+----------+
>> | Cores | Default | Patch |    Patch | NoCheck |  NoCheck |
>> |       |    (ns) |  (ns) |     Diff |    (ns) |     Diff |
>> +-------+---------+-------+----------+---------+----------+
>> |     1 |    2790 |  2767 |  (-0.8%) |     171 | (-93.9%) |
>> |     2 |    6685 |  3484 | (-47.9%) |     519 | (-92.2%) |
>> |     3 |   10501 |  3599 | (-65.7%) |     855 | (-91.9%) |
>> |     4 |   14264 |  3635 | (-74.5%) |    1139 | (-92.0%) |
>> |     6 |   21800 |  3551 | (-83.7%) |    1713 | (-92.1%) |
>> |     8 |   29563 |  3570 | (-87.9%) |    2268 | (-92.3%) |
>> |    10 |   37210 |  3845 | (-89.7%) |    2872 | (-92.3%) |
>> |    12 |   44780 |  4452 | (-90.1%) |    3417 | (-92.4%) |
>> |    14 |   52576 |  5100 | (-90.3%) |    4020 | (-92.4%) |
>> |    16 |   60118 |  5785 | (-90.4%) |    4604 | (-92.3%) |
>> |    20 |   75037 |  7270 | (-90.3%) |    6486 | (-91.4%) |
>> |    24 |   90226 |  8712 | (-90.3%) |    7061 | (-92.2%) |
>> +-------+---------+-------+----------+---------+----------+
> 
> Nice improvements. I assume the table headings are switched? Why would
> Normal be more epensive than Huge?

Yes, you are right, I messed up the table headings.

>>>
>>> Vlastimil, I find this quite confusing:
>>>
>>> #ifdef CONFIG_DEBUG_VM
>>> /*
>>>  * With DEBUG_VM enabled, order-0 pages are checked for expected state when
>>>  * being allocated from pcp lists. With debug_pagealloc also enabled, they are
>>>  * also checked when pcp lists are refilled from the free lists.
>>>  */
>>> static inline bool check_pcp_refill(struct page *page, unsigned int order)
>>> {
>>> 	if (debug_pagealloc_enabled_static())
>>> 		return check_new_pages(page, order);
>>> 	else
>>> 		return false;
>>> }
>>>
>>> static inline bool check_new_pcp(struct page *page, unsigned int order)
>>> {
>>> 	return check_new_pages(page, order);
>>> }
>>> #else
>>> /*
>>>  * With DEBUG_VM disabled, free order-0 pages are checked for expected state
>>>  * when pcp lists are being refilled from the free lists. With debug_pagealloc
>>>  * enabled, they are also checked when being allocated from the pcp lists.
>>>  */
>>> static inline bool check_pcp_refill(struct page *page, unsigned int order)
>>> {
>>> 	return check_new_pages(page, order);
>>> }
>>> static inline bool check_new_pcp(struct page *page, unsigned int order)
>>> {
>>> 	if (debug_pagealloc_enabled_static())
>>> 		return check_new_pages(page, order);
>>> 	else
>>> 		return false;
>>> }
>>> #endif /* CONFIG_DEBUG_VM */
>>>
>>> and the 4462b32c9285b5 changelog is a struggle to follow.
>>>
>>> Why are we performing *any* checks when CONFIG_DEBUG_VM=n and when
>>> debug_pagealloc_enabled is false?
>>>
>>> Anyway, these checks sounds quite costly so let's revisit their
>>> desirability?
>>>
>

Kees Cook Feb. 14, 2023, 5:27 p.m. UTC | #7

On Wed, Feb 08, 2023 at 11:45:14AM +0100, Vlastimil Babka wrote:
> But I wonder also what kernel hardening folks think here - are the hardened
> kernels usually built with DEBUG_VM or debug_pagealloc enabled, or would you
> like to hook some other kernel option for keeping the checks on page/alloc
> free active? And should those checks be done on every alloc/free, including
> pcplist cached allocations?

What we're depending on for heap-related (i.e. both page allocator and
slab) hardening currently is:

- CONFIG_SLAB_FREELIST_HARDENED
	- pointer obfuscation (SLUB) -- freelist_ptr(), set_freepointer()
	- pool membership verification (SLUB and SLAB) -- cache_from_obj()
	- consecutive double free detection (SLUB and SLAB) -- __free_one()
- allocation order randomization
	- CONFIG_SLAB_FREELIST_RANDOM (SLUB and SLAB)
	- CONFIG_SHUFFLE_PAGE_ALLOCATOR (page allocator)
- memory wiping (both slab and page allocator)
	- init_on_alloc / CONFIG_INIT_ON_ALLOC_DEFAULT_ON
	- init_on_free / CONFIG_INIT_ON_FREE_DEFAULT_ON

I'd be nice to gain slab redzone verification, but that seems expensive
enough that anyone interested in that level of hardening has likely
turned on full KASAN.

Mel Gorman March 29, 2023, 9:31 a.m. UTC | #8

On Wed, Feb 01, 2023 at 05:25:49PM +0100, Alexander Halbuer wrote:
> The `rmqueue_bulk` function batches the allocation of multiple elements to
> refill the per-CPU buffers into a single hold of the zone lock. Each
> element is allocated and checked using the `check_pcp_refill` function.
> The check touches every related struct page which is especially expensive
> for higher order allocations (huge pages). This patch reduces the time
> holding the lock by moving the check out of the critical section similar
> to the `rmqueue_buddy` function which allocates a single element.
> Measurements of parallel allocation-heavy workloads show a reduction of
> the average huge page allocation latency of 50 percent for two cores and
> nearly 90 percent for 24 cores.
> 
> Signed-off-by: Alexander Halbuer <halbuer@sra.uni-hannover.de>

Acked-by: Mel Gorman <mgorman@techsingularity.net>

Minor comments only.

> ---
>  mm/page_alloc.c | 22 ++++++++++++++++++----
>  1 file changed, 18 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0745aedebb37..4b80438b1f59 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3119,6 +3119,8 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>  {
>  	unsigned long flags;
>  	int i, allocated = 0;
> +	struct list_head *prev_tail = list->prev;
> +	struct page *pos, *n;
>  
>  	spin_lock_irqsave(&zone->lock, flags);
>  	for (i = 0; i < count; ++i) {
> @@ -3127,9 +3129,6 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>  		if (unlikely(page == NULL))
>  			break;
>  
> -		if (unlikely(check_pcp_refill(page, order)))
> -			continue;
> -
>  		/*
>  		 * Split buddy pages returned by expand() are received here in
>  		 * physical page order. The page is added to the tail of
> @@ -3141,7 +3140,6 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>  		 * pages are ordered properly.
>  		 */
>  		list_add_tail(&page->pcp_list, list);
> -		allocated++;
>  		if (is_migrate_cma(get_pcppage_migratetype(page)))
>  			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
>  					      -(1 << order));

As a side-effect, the NR_FREE_CMA_PAGES accounting does not drift when bad
pages are found. It's rarely an issue and when it is an issue, corruption
due to a use-after-free has occurred and the system is already potentially
screwed. It's not enough to justify a patch split or -stable backport
and probably existed forever. I don't remember the last time these checks
actually caught corruption of struct pages although years ago I was told
it often found problems.

> @@ -3155,6 +3153,22 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>  	 */
>  	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
>  	spin_unlock_irqrestore(&zone->lock, flags);
> +
> +	/*
> +	 * Pages are appended to the pcp list without checking to reduce the
> +	 * time holding the zone lock. Checking the appended pages happens right
> +	 * after the critical section while still holding the pcp lock.
> +	 */
> +	pos = list_first_entry(prev_tail, struct page, pcp_list);
> +	list_for_each_entry_safe_from(pos, n, list, pcp_list) {
> +		if (unlikely(check_pcp_refill(pos, order))) {
> +			list_del(&pos->pcp_list);
> +			continue;
> +		}
> +
> +		allocated++;
> +	}
> +

Minor nit and not important for this patch but the list is traversed
even if check_pcp_refill does nothing but return false. The associated
helpers like check_pcp_refill under CONFIG_DEBUG_VM would need a helper
to determine if the list traversal is necessary. Maybe the compiler can
figure it out but I doubt it due to static branches.

Secondly, this does not kill the patch but the performance benefit is likely
artificial or limited in most cases. It specifically affects the case where
a buddy allocation such as THP or HugeTLBFS is happening in parallel with
allocations that are refilling the PCP lists. The most likely time that
this would happen is during early init of a memory-intensive parallelised
application. While the init phase *can* be critical when the primary
metric is "Time before a application is warmed up", it's not common.
Even if the patch is taking care of a corner case, it's still justified.

mm: reduce lock contention of pcp buffer refill

Commit Message

Comments

Patch