Message ID | 20230201162549.68384-1-halbuer@sra.uni-hannover.de (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: reduce lock contention of pcp buffer refill | expand |
On Wed, 1 Feb 2023 17:25:49 +0100 Alexander Halbuer <halbuer@sra.uni-hannover.de> wrote: > The `rmqueue_bulk` function batches the allocation of multiple elements to > refill the per-CPU buffers into a single hold of the zone lock. Each > element is allocated and checked using the `check_pcp_refill` function. > The check touches every related struct page which is especially expensive > for higher order allocations (huge pages). This patch reduces the time > holding the lock by moving the check out of the critical section similar > to the `rmqueue_buddy` function which allocates a single element. > Measurements of parallel allocation-heavy workloads show a reduction of > the average huge page allocation latency of 50 percent for two cores and > nearly 90 percent for 24 cores. Sounds nice. Were you able to test how much benefit we get by simply removing the check_new_pages() call from rmqueue_bulk()? Vlastimil, I find this quite confusing: #ifdef CONFIG_DEBUG_VM /* * With DEBUG_VM enabled, order-0 pages are checked for expected state when * being allocated from pcp lists. With debug_pagealloc also enabled, they are * also checked when pcp lists are refilled from the free lists. */ static inline bool check_pcp_refill(struct page *page, unsigned int order) { if (debug_pagealloc_enabled_static()) return check_new_pages(page, order); else return false; } static inline bool check_new_pcp(struct page *page, unsigned int order) { return check_new_pages(page, order); } #else /* * With DEBUG_VM disabled, free order-0 pages are checked for expected state * when pcp lists are being refilled from the free lists. With debug_pagealloc * enabled, they are also checked when being allocated from the pcp lists. */ static inline bool check_pcp_refill(struct page *page, unsigned int order) { return check_new_pages(page, order); } static inline bool check_new_pcp(struct page *page, unsigned int order) { if (debug_pagealloc_enabled_static()) return check_new_pages(page, order); else return false; } #endif /* CONFIG_DEBUG_VM */ and the 4462b32c9285b5 changelog is a struggle to follow. Why are we performing *any* checks when CONFIG_DEBUG_VM=n and when debug_pagealloc_enabled is false? Anyway, these checks sounds quite costly so let's revisit their desirability?
On 2/3/23 00:25, Andrew Morton wrote: > On Wed, 1 Feb 2023 17:25:49 +0100 Alexander Halbuer <halbuer@sra.uni-hannover.de> wrote: > >> The `rmqueue_bulk` function batches the allocation of multiple elements to >> refill the per-CPU buffers into a single hold of the zone lock. Each >> element is allocated and checked using the `check_pcp_refill` function. >> The check touches every related struct page which is especially expensive >> for higher order allocations (huge pages). This patch reduces the time >> holding the lock by moving the check out of the critical section similar >> to the `rmqueue_buddy` function which allocates a single element. >> Measurements of parallel allocation-heavy workloads show a reduction of >> the average huge page allocation latency of 50 percent for two cores and >> nearly 90 percent for 24 cores. > Sounds nice. > > Were you able to test how much benefit we get by simply removing the > check_new_pages() call from rmqueue_bulk()? I did some further investigations and measurements to quantify potential performance gains. Benchmarks ran on a machine with 24 physical cores fixed at 2.1 GHz. The results show significant performance gains with the patch applied in parallel scenarios. Eliminating the check reduces allocation latency further, especially for low core counts. The following tables show the average allocation latencies for huge pages and normal pages for three different configurations: The unpatched kernel, the patched kernel, and an additional version without the check. Huge pages +-------+---------+-------+----------+---------+----------+ | Cores | Default | Patch | Patch | NoCheck | NoCheck | | | (ns) | (ns) | Diff | (ns) | Diff | +-------+---------+-------+----------+---------+----------+ | 1 | 127 | 124 | (-2.4%) | 118 | (-7.1%) | | 2 | 140 | 140 | (-0.0%) | 134 | (-4.3%) | | 3 | 143 | 142 | (-0.7%) | 134 | (-6.3%) | | 4 | 178 | 159 | (-10.7%) | 156 | (-12.4%) | | 6 | 269 | 239 | (-11.2%) | 237 | (-11.9%) | | 8 | 363 | 321 | (-11.6%) | 319 | (-12.1%) | | 10 | 454 | 409 | (-9.9%) | 409 | (-9.9%) | | 12 | 545 | 494 | (-9.4%) | 488 | (-10.5%) | | 14 | 639 | 578 | (-9.5%) | 574 | (-10.2%) | | 16 | 735 | 660 | (-10.2%) | 653 | (-11.2%) | | 20 | 915 | 826 | (-9.7%) | 815 | (-10.9%) | | 24 | 1105 | 992 | (-10.2%) | 982 | (-11.1%) | +-------+---------+-------+----------+---------+----------+ Normal pages +-------+---------+-------+----------+---------+----------+ | Cores | Default | Patch | Patch | NoCheck | NoCheck | | | (ns) | (ns) | Diff | (ns) | Diff | +-------+---------+-------+----------+---------+----------+ | 1 | 2790 | 2767 | (-0.8%) | 171 | (-93.9%) | | 2 | 6685 | 3484 | (-47.9%) | 519 | (-92.2%) | | 3 | 10501 | 3599 | (-65.7%) | 855 | (-91.9%) | | 4 | 14264 | 3635 | (-74.5%) | 1139 | (-92.0%) | | 6 | 21800 | 3551 | (-83.7%) | 1713 | (-92.1%) | | 8 | 29563 | 3570 | (-87.9%) | 2268 | (-92.3%) | | 10 | 37210 | 3845 | (-89.7%) | 2872 | (-92.3%) | | 12 | 44780 | 4452 | (-90.1%) | 3417 | (-92.4%) | | 14 | 52576 | 5100 | (-90.3%) | 4020 | (-92.4%) | | 16 | 60118 | 5785 | (-90.4%) | 4604 | (-92.3%) | | 20 | 75037 | 7270 | (-90.3%) | 6486 | (-91.4%) | | 24 | 90226 | 8712 | (-90.3%) | 7061 | (-92.2%) | +-------+---------+-------+----------+---------+----------+ > > Vlastimil, I find this quite confusing: > > #ifdef CONFIG_DEBUG_VM > /* > * With DEBUG_VM enabled, order-0 pages are checked for expected state when > * being allocated from pcp lists. With debug_pagealloc also enabled, they are > * also checked when pcp lists are refilled from the free lists. > */ > static inline bool check_pcp_refill(struct page *page, unsigned int order) > { > if (debug_pagealloc_enabled_static()) > return check_new_pages(page, order); > else > return false; > } > > static inline bool check_new_pcp(struct page *page, unsigned int order) > { > return check_new_pages(page, order); > } > #else > /* > * With DEBUG_VM disabled, free order-0 pages are checked for expected state > * when pcp lists are being refilled from the free lists. With debug_pagealloc > * enabled, they are also checked when being allocated from the pcp lists. > */ > static inline bool check_pcp_refill(struct page *page, unsigned int order) > { > return check_new_pages(page, order); > } > static inline bool check_new_pcp(struct page *page, unsigned int order) > { > if (debug_pagealloc_enabled_static()) > return check_new_pages(page, order); > else > return false; > } > #endif /* CONFIG_DEBUG_VM */ > > and the 4462b32c9285b5 changelog is a struggle to follow. > > Why are we performing *any* checks when CONFIG_DEBUG_VM=n and when > debug_pagealloc_enabled is false? > > Anyway, these checks sounds quite costly so let's revisit their > desirability? >
On 2/3/23 00:25, Andrew Morton wrote: > On Wed, 1 Feb 2023 17:25:49 +0100 Alexander Halbuer <halbuer@sra.uni-hannover.de> wrote: > >> The `rmqueue_bulk` function batches the allocation of multiple elements to >> refill the per-CPU buffers into a single hold of the zone lock. Each >> element is allocated and checked using the `check_pcp_refill` function. >> The check touches every related struct page which is especially expensive >> for higher order allocations (huge pages). This patch reduces the time >> holding the lock by moving the check out of the critical section similar >> to the `rmqueue_buddy` function which allocates a single element. >> Measurements of parallel allocation-heavy workloads show a reduction of >> the average huge page allocation latency of 50 percent for two cores and >> nearly 90 percent for 24 cores. > > Sounds nice. > > Were you able to test how much benefit we get by simply removing the > check_new_pages() call from rmqueue_bulk()? > > Vlastimil, I find this quite confusing: > > #ifdef CONFIG_DEBUG_VM > /* > * With DEBUG_VM enabled, order-0 pages are checked for expected state when > * being allocated from pcp lists. With debug_pagealloc also enabled, they are > * also checked when pcp lists are refilled from the free lists. > */ > static inline bool check_pcp_refill(struct page *page, unsigned int order) > { > if (debug_pagealloc_enabled_static()) > return check_new_pages(page, order); > else > return false; > } > > static inline bool check_new_pcp(struct page *page, unsigned int order) > { > return check_new_pages(page, order); > } > #else > /* > * With DEBUG_VM disabled, free order-0 pages are checked for expected state > * when pcp lists are being refilled from the free lists. With debug_pagealloc > * enabled, they are also checked when being allocated from the pcp lists. > */ > static inline bool check_pcp_refill(struct page *page, unsigned int order) > { > return check_new_pages(page, order); > } > static inline bool check_new_pcp(struct page *page, unsigned int order) > { > if (debug_pagealloc_enabled_static()) > return check_new_pages(page, order); > else > return false; > } > #endif /* CONFIG_DEBUG_VM */ > > and the 4462b32c9285b5 changelog is a struggle to follow. > > Why are we performing *any* checks when CONFIG_DEBUG_VM=n and when > debug_pagealloc_enabled is false? Well AFAIK the history was like this - at first we always did the checks when allocating or freeing a page, even when it was allocated/freed from the pcplist - then Mel in 479f854a207c and 4db7548ccbd9 changed it so the checks were done only when moving between pcplist and zone's freelists, so the pcplist-cached fast paths were now faster. But that means it may not catch some errors anymore, so with DEBUG_VM checks were still done on every alloc/free - my 4462b32c9285b5 changed it so that when debug_pagelloc is boot-time enabled, the checks happen both on pcplist and zone's freelist alloc/free. This was mainly to allow enabling the checks on production kernels without recompiling with DEBUG_VM. But it's a mode where catching the culprit is more desirable than peak performance > Anyway, these checks sounds quite costly so let's revisit their > desirability? So AFAIK never in the past we went with not doing the checks at all. But given that by default we don't do them on pcplists for years, and the majority of allocations are using pcplists, maybe indeed we won't lose much coverage by not doing the checks at all. But I wonder also what kernel hardening folks think here - are the hardened kernels usually built with DEBUG_VM or debug_pagealloc enabled, or would you like to hook some other kernel option for keeping the checks on page/alloc free active? And should those checks be done on every alloc/free, including pcplist cached allocations?
On 2/7/23 17:11, Alexander Halbuer wrote: > On 2/3/23 00:25, Andrew Morton wrote: >> On Wed, 1 Feb 2023 17:25:49 +0100 Alexander Halbuer <halbuer@sra.uni-hannover.de> wrote: >> >>> The `rmqueue_bulk` function batches the allocation of multiple elements to >>> refill the per-CPU buffers into a single hold of the zone lock. Each >>> element is allocated and checked using the `check_pcp_refill` function. >>> The check touches every related struct page which is especially expensive >>> for higher order allocations (huge pages). This patch reduces the time >>> holding the lock by moving the check out of the critical section similar >>> to the `rmqueue_buddy` function which allocates a single element. >>> Measurements of parallel allocation-heavy workloads show a reduction of >>> the average huge page allocation latency of 50 percent for two cores and >>> nearly 90 percent for 24 cores. >> Sounds nice. >> >> Were you able to test how much benefit we get by simply removing the >> check_new_pages() call from rmqueue_bulk()? > I did some further investigations and measurements to quantify potential > performance gains. Benchmarks ran on a machine with 24 physical cores > fixed at 2.1 GHz. The results show significant performance gains with the > patch applied in parallel scenarios. Eliminating the check reduces > allocation latency further, especially for low core counts. Are the benchmarks done via kernel module calling bulk allocation, or from userspace? > The following tables show the average allocation latencies for huge pages > and normal pages for three different configurations: > The unpatched kernel, the patched kernel, and an additional version > without the check. > > Huge pages > +-------+---------+-------+----------+---------+----------+ > | Cores | Default | Patch | Patch | NoCheck | NoCheck | > | | (ns) | (ns) | Diff | (ns) | Diff | > +-------+---------+-------+----------+---------+----------+ > | 1 | 127 | 124 | (-2.4%) | 118 | (-7.1%) | > | 2 | 140 | 140 | (-0.0%) | 134 | (-4.3%) | > | 3 | 143 | 142 | (-0.7%) | 134 | (-6.3%) | > | 4 | 178 | 159 | (-10.7%) | 156 | (-12.4%) | > | 6 | 269 | 239 | (-11.2%) | 237 | (-11.9%) | > | 8 | 363 | 321 | (-11.6%) | 319 | (-12.1%) | > | 10 | 454 | 409 | (-9.9%) | 409 | (-9.9%) | > | 12 | 545 | 494 | (-9.4%) | 488 | (-10.5%) | > | 14 | 639 | 578 | (-9.5%) | 574 | (-10.2%) | > | 16 | 735 | 660 | (-10.2%) | 653 | (-11.2%) | > | 20 | 915 | 826 | (-9.7%) | 815 | (-10.9%) | > | 24 | 1105 | 992 | (-10.2%) | 982 | (-11.1%) | > +-------+---------+-------+----------+---------+----------+ > > Normal pages > +-------+---------+-------+----------+---------+----------+ > | Cores | Default | Patch | Patch | NoCheck | NoCheck | > | | (ns) | (ns) | Diff | (ns) | Diff | > +-------+---------+-------+----------+---------+----------+ > | 1 | 2790 | 2767 | (-0.8%) | 171 | (-93.9%) | > | 2 | 6685 | 3484 | (-47.9%) | 519 | (-92.2%) | > | 3 | 10501 | 3599 | (-65.7%) | 855 | (-91.9%) | > | 4 | 14264 | 3635 | (-74.5%) | 1139 | (-92.0%) | > | 6 | 21800 | 3551 | (-83.7%) | 1713 | (-92.1%) | > | 8 | 29563 | 3570 | (-87.9%) | 2268 | (-92.3%) | > | 10 | 37210 | 3845 | (-89.7%) | 2872 | (-92.3%) | > | 12 | 44780 | 4452 | (-90.1%) | 3417 | (-92.4%) | > | 14 | 52576 | 5100 | (-90.3%) | 4020 | (-92.4%) | > | 16 | 60118 | 5785 | (-90.4%) | 4604 | (-92.3%) | > | 20 | 75037 | 7270 | (-90.3%) | 6486 | (-91.4%) | > | 24 | 90226 | 8712 | (-90.3%) | 7061 | (-92.2%) | > +-------+---------+-------+----------+---------+----------+ Nice improvements. I assume the table headings are switched? Why would Normal be more epensive than Huge? >> >> Vlastimil, I find this quite confusing: >> >> #ifdef CONFIG_DEBUG_VM >> /* >> * With DEBUG_VM enabled, order-0 pages are checked for expected state when >> * being allocated from pcp lists. With debug_pagealloc also enabled, they are >> * also checked when pcp lists are refilled from the free lists. >> */ >> static inline bool check_pcp_refill(struct page *page, unsigned int order) >> { >> if (debug_pagealloc_enabled_static()) >> return check_new_pages(page, order); >> else >> return false; >> } >> >> static inline bool check_new_pcp(struct page *page, unsigned int order) >> { >> return check_new_pages(page, order); >> } >> #else >> /* >> * With DEBUG_VM disabled, free order-0 pages are checked for expected state >> * when pcp lists are being refilled from the free lists. With debug_pagealloc >> * enabled, they are also checked when being allocated from the pcp lists. >> */ >> static inline bool check_pcp_refill(struct page *page, unsigned int order) >> { >> return check_new_pages(page, order); >> } >> static inline bool check_new_pcp(struct page *page, unsigned int order) >> { >> if (debug_pagealloc_enabled_static()) >> return check_new_pages(page, order); >> else >> return false; >> } >> #endif /* CONFIG_DEBUG_VM */ >> >> and the 4462b32c9285b5 changelog is a struggle to follow. >> >> Why are we performing *any* checks when CONFIG_DEBUG_VM=n and when >> debug_pagealloc_enabled is false? >> >> Anyway, these checks sounds quite costly so let's revisit their >> desirability? >>
On 2/1/23 17:25, Alexander Halbuer wrote: > The `rmqueue_bulk` function batches the allocation of multiple elements to > refill the per-CPU buffers into a single hold of the zone lock. Each > element is allocated and checked using the `check_pcp_refill` function. > The check touches every related struct page which is especially expensive > for higher order allocations (huge pages). This patch reduces the time > holding the lock by moving the check out of the critical section similar > to the `rmqueue_buddy` function which allocates a single element. > Measurements of parallel allocation-heavy workloads show a reduction of > the average huge page allocation latency of 50 percent for two cores and > nearly 90 percent for 24 cores. > > Signed-off-by: Alexander Halbuer <halbuer@sra.uni-hannover.de> Even if we proceed with disabling the checks in default non-debugging/non-hardened configurations, this would still help those configurations, so: Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Suggestion below: > --- > mm/page_alloc.c | 22 ++++++++++++++++++---- > 1 file changed, 18 insertions(+), 4 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 0745aedebb37..4b80438b1f59 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -3119,6 +3119,8 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, > { > unsigned long flags; > int i, allocated = 0; > + struct list_head *prev_tail = list->prev; > + struct page *pos, *n; > > spin_lock_irqsave(&zone->lock, flags); > for (i = 0; i < count; ++i) { > @@ -3127,9 +3129,6 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, > if (unlikely(page == NULL)) > break; > > - if (unlikely(check_pcp_refill(page, order))) > - continue; > - > /* > * Split buddy pages returned by expand() are received here in > * physical page order. The page is added to the tail of > @@ -3141,7 +3140,6 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, > * pages are ordered properly. > */ > list_add_tail(&page->pcp_list, list); > - allocated++; > if (is_migrate_cma(get_pcppage_migratetype(page))) > __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, As another benefit of your patch, the NR_FREE_CMA_PAGES will not become inaccurate if we leak CMA pages failing the check, anymore. You could also try another patch that will move the above check into the loop below, see if it makes any difference in your benchmark. The loop could count is_migrate_cma pages, and afterwards do a single "if (cma_pages > 0) mod_zone_page_state(...)" - because we are no longer inside spin_lock_irqsave() block, we need to use the safe mod_zone_page... variant without underscores. Thanks! > -(1 << order)); > @@ -3155,6 +3153,22 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, > */ > __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order)); > spin_unlock_irqrestore(&zone->lock, flags); > + > + /* > + * Pages are appended to the pcp list without checking to reduce the > + * time holding the zone lock. Checking the appended pages happens right > + * after the critical section while still holding the pcp lock. > + */ > + pos = list_first_entry(prev_tail, struct page, pcp_list); > + list_for_each_entry_safe_from(pos, n, list, pcp_list) { > + if (unlikely(check_pcp_refill(pos, order))) { > + list_del(&pos->pcp_list); > + continue; > + } > + > + allocated++; > + } > + > return allocated; > } >
On 2/8/23 16:11, Vlastimil Babka wrote: > On 2/7/23 17:11, Alexander Halbuer wrote: >> On 2/3/23 00:25, Andrew Morton wrote: >>> On Wed, 1 Feb 2023 17:25:49 +0100 Alexander Halbuer <halbuer@sra.uni-hannover.de> wrote: >>> >>>> The `rmqueue_bulk` function batches the allocation of multiple elements to >>>> refill the per-CPU buffers into a single hold of the zone lock. Each >>>> element is allocated and checked using the `check_pcp_refill` function. >>>> The check touches every related struct page which is especially expensive >>>> for higher order allocations (huge pages). This patch reduces the time >>>> holding the lock by moving the check out of the critical section similar >>>> to the `rmqueue_buddy` function which allocates a single element. >>>> Measurements of parallel allocation-heavy workloads show a reduction of >>>> the average huge page allocation latency of 50 percent for two cores and >>>> nearly 90 percent for 24 cores. >>> Sounds nice. >>> >>> Were you able to test how much benefit we get by simply removing the >>> check_new_pages() call from rmqueue_bulk()? >> I did some further investigations and measurements to quantify potential >> performance gains. Benchmarks ran on a machine with 24 physical cores >> fixed at 2.1 GHz. The results show significant performance gains with the >> patch applied in parallel scenarios. Eliminating the check reduces >> allocation latency further, especially for low core counts. > > Are the benchmarks done via kernel module calling bulk allocation, or from > userspace? The benchmarks are done via a kernel module using the alloc_pages_node() function to allocate from a specific NUMA node. (The test machine is a dual-socket system featuring two Intel Xeon Gold 6252 with 24 physical cores each; benchmark threads are pinned to the first CPU to prevent additional NUMA effects.) The module has originally been developed as part of a discussion about different allocator designs to get actual numbers of an actual buddy allocator implementation. Measurements include different allocation scenarios (bulk, repeat, random) for different allocation sizes (orders) and degrees of parallelism. The parallel bulk allocation benchmark showed an outlier for huge page allocations (order 9) that was much worse than order 8 and even order 10, despite the lack of a pcp buffer for those sizes. This has been the origin of my further investigations. The benchmark module internally uses multiple threads that independently perform a specified number of allocations with a loop around the mentioned alloc_pages_node() function. A barrier synchronizes all those threads at the beginning. The values shown in the table are the average durations of all threads divided by the number of allocations to break it down to a single allocation. >> The following tables show the average allocation latencies for huge pages >> and normal pages for three different configurations: >> The unpatched kernel, the patched kernel, and an additional version >> without the check. >> >> Huge pages >> +-------+---------+-------+----------+---------+----------+ >> | Cores | Default | Patch | Patch | NoCheck | NoCheck | >> | | (ns) | (ns) | Diff | (ns) | Diff | >> +-------+---------+-------+----------+---------+----------+ >> | 1 | 127 | 124 | (-2.4%) | 118 | (-7.1%) | >> | 2 | 140 | 140 | (-0.0%) | 134 | (-4.3%) | >> | 3 | 143 | 142 | (-0.7%) | 134 | (-6.3%) | >> | 4 | 178 | 159 | (-10.7%) | 156 | (-12.4%) | >> | 6 | 269 | 239 | (-11.2%) | 237 | (-11.9%) | >> | 8 | 363 | 321 | (-11.6%) | 319 | (-12.1%) | >> | 10 | 454 | 409 | (-9.9%) | 409 | (-9.9%) | >> | 12 | 545 | 494 | (-9.4%) | 488 | (-10.5%) | >> | 14 | 639 | 578 | (-9.5%) | 574 | (-10.2%) | >> | 16 | 735 | 660 | (-10.2%) | 653 | (-11.2%) | >> | 20 | 915 | 826 | (-9.7%) | 815 | (-10.9%) | >> | 24 | 1105 | 992 | (-10.2%) | 982 | (-11.1%) | >> +-------+---------+-------+----------+---------+----------+ >> >> Normal pages >> +-------+---------+-------+----------+---------+----------+ >> | Cores | Default | Patch | Patch | NoCheck | NoCheck | >> | | (ns) | (ns) | Diff | (ns) | Diff | >> +-------+---------+-------+----------+---------+----------+ >> | 1 | 2790 | 2767 | (-0.8%) | 171 | (-93.9%) | >> | 2 | 6685 | 3484 | (-47.9%) | 519 | (-92.2%) | >> | 3 | 10501 | 3599 | (-65.7%) | 855 | (-91.9%) | >> | 4 | 14264 | 3635 | (-74.5%) | 1139 | (-92.0%) | >> | 6 | 21800 | 3551 | (-83.7%) | 1713 | (-92.1%) | >> | 8 | 29563 | 3570 | (-87.9%) | 2268 | (-92.3%) | >> | 10 | 37210 | 3845 | (-89.7%) | 2872 | (-92.3%) | >> | 12 | 44780 | 4452 | (-90.1%) | 3417 | (-92.4%) | >> | 14 | 52576 | 5100 | (-90.3%) | 4020 | (-92.4%) | >> | 16 | 60118 | 5785 | (-90.4%) | 4604 | (-92.3%) | >> | 20 | 75037 | 7270 | (-90.3%) | 6486 | (-91.4%) | >> | 24 | 90226 | 8712 | (-90.3%) | 7061 | (-92.2%) | >> +-------+---------+-------+----------+---------+----------+ > > Nice improvements. I assume the table headings are switched? Why would > Normal be more epensive than Huge? Yes, you are right, I messed up the table headings. >>> >>> Vlastimil, I find this quite confusing: >>> >>> #ifdef CONFIG_DEBUG_VM >>> /* >>> * With DEBUG_VM enabled, order-0 pages are checked for expected state when >>> * being allocated from pcp lists. With debug_pagealloc also enabled, they are >>> * also checked when pcp lists are refilled from the free lists. >>> */ >>> static inline bool check_pcp_refill(struct page *page, unsigned int order) >>> { >>> if (debug_pagealloc_enabled_static()) >>> return check_new_pages(page, order); >>> else >>> return false; >>> } >>> >>> static inline bool check_new_pcp(struct page *page, unsigned int order) >>> { >>> return check_new_pages(page, order); >>> } >>> #else >>> /* >>> * With DEBUG_VM disabled, free order-0 pages are checked for expected state >>> * when pcp lists are being refilled from the free lists. With debug_pagealloc >>> * enabled, they are also checked when being allocated from the pcp lists. >>> */ >>> static inline bool check_pcp_refill(struct page *page, unsigned int order) >>> { >>> return check_new_pages(page, order); >>> } >>> static inline bool check_new_pcp(struct page *page, unsigned int order) >>> { >>> if (debug_pagealloc_enabled_static()) >>> return check_new_pages(page, order); >>> else >>> return false; >>> } >>> #endif /* CONFIG_DEBUG_VM */ >>> >>> and the 4462b32c9285b5 changelog is a struggle to follow. >>> >>> Why are we performing *any* checks when CONFIG_DEBUG_VM=n and when >>> debug_pagealloc_enabled is false? >>> >>> Anyway, these checks sounds quite costly so let's revisit their >>> desirability? >>> >
On Wed, Feb 08, 2023 at 11:45:14AM +0100, Vlastimil Babka wrote: > But I wonder also what kernel hardening folks think here - are the hardened > kernels usually built with DEBUG_VM or debug_pagealloc enabled, or would you > like to hook some other kernel option for keeping the checks on page/alloc > free active? And should those checks be done on every alloc/free, including > pcplist cached allocations? What we're depending on for heap-related (i.e. both page allocator and slab) hardening currently is: - CONFIG_SLAB_FREELIST_HARDENED - pointer obfuscation (SLUB) -- freelist_ptr(), set_freepointer() - pool membership verification (SLUB and SLAB) -- cache_from_obj() - consecutive double free detection (SLUB and SLAB) -- __free_one() - allocation order randomization - CONFIG_SLAB_FREELIST_RANDOM (SLUB and SLAB) - CONFIG_SHUFFLE_PAGE_ALLOCATOR (page allocator) - memory wiping (both slab and page allocator) - init_on_alloc / CONFIG_INIT_ON_ALLOC_DEFAULT_ON - init_on_free / CONFIG_INIT_ON_FREE_DEFAULT_ON I'd be nice to gain slab redzone verification, but that seems expensive enough that anyone interested in that level of hardening has likely turned on full KASAN.
On Wed, Feb 01, 2023 at 05:25:49PM +0100, Alexander Halbuer wrote: > The `rmqueue_bulk` function batches the allocation of multiple elements to > refill the per-CPU buffers into a single hold of the zone lock. Each > element is allocated and checked using the `check_pcp_refill` function. > The check touches every related struct page which is especially expensive > for higher order allocations (huge pages). This patch reduces the time > holding the lock by moving the check out of the critical section similar > to the `rmqueue_buddy` function which allocates a single element. > Measurements of parallel allocation-heavy workloads show a reduction of > the average huge page allocation latency of 50 percent for two cores and > nearly 90 percent for 24 cores. > > Signed-off-by: Alexander Halbuer <halbuer@sra.uni-hannover.de> Acked-by: Mel Gorman <mgorman@techsingularity.net> Minor comments only. > --- > mm/page_alloc.c | 22 ++++++++++++++++++---- > 1 file changed, 18 insertions(+), 4 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 0745aedebb37..4b80438b1f59 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -3119,6 +3119,8 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, > { > unsigned long flags; > int i, allocated = 0; > + struct list_head *prev_tail = list->prev; > + struct page *pos, *n; > > spin_lock_irqsave(&zone->lock, flags); > for (i = 0; i < count; ++i) { > @@ -3127,9 +3129,6 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, > if (unlikely(page == NULL)) > break; > > - if (unlikely(check_pcp_refill(page, order))) > - continue; > - > /* > * Split buddy pages returned by expand() are received here in > * physical page order. The page is added to the tail of > @@ -3141,7 +3140,6 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, > * pages are ordered properly. > */ > list_add_tail(&page->pcp_list, list); > - allocated++; > if (is_migrate_cma(get_pcppage_migratetype(page))) > __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, > -(1 << order)); As a side-effect, the NR_FREE_CMA_PAGES accounting does not drift when bad pages are found. It's rarely an issue and when it is an issue, corruption due to a use-after-free has occurred and the system is already potentially screwed. It's not enough to justify a patch split or -stable backport and probably existed forever. I don't remember the last time these checks actually caught corruption of struct pages although years ago I was told it often found problems. > @@ -3155,6 +3153,22 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, > */ > __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order)); > spin_unlock_irqrestore(&zone->lock, flags); > + > + /* > + * Pages are appended to the pcp list without checking to reduce the > + * time holding the zone lock. Checking the appended pages happens right > + * after the critical section while still holding the pcp lock. > + */ > + pos = list_first_entry(prev_tail, struct page, pcp_list); > + list_for_each_entry_safe_from(pos, n, list, pcp_list) { > + if (unlikely(check_pcp_refill(pos, order))) { > + list_del(&pos->pcp_list); > + continue; > + } > + > + allocated++; > + } > + Minor nit and not important for this patch but the list is traversed even if check_pcp_refill does nothing but return false. The associated helpers like check_pcp_refill under CONFIG_DEBUG_VM would need a helper to determine if the list traversal is necessary. Maybe the compiler can figure it out but I doubt it due to static branches. Secondly, this does not kill the patch but the performance benefit is likely artificial or limited in most cases. It specifically affects the case where a buddy allocation such as THP or HugeTLBFS is happening in parallel with allocations that are refilling the PCP lists. The most likely time that this would happen is during early init of a memory-intensive parallelised application. While the init phase *can* be critical when the primary metric is "Time before a application is warmed up", it's not common. Even if the patch is taking care of a corner case, it's still justified.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 0745aedebb37..4b80438b1f59 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3119,6 +3119,8 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, { unsigned long flags; int i, allocated = 0; + struct list_head *prev_tail = list->prev; + struct page *pos, *n; spin_lock_irqsave(&zone->lock, flags); for (i = 0; i < count; ++i) { @@ -3127,9 +3129,6 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, if (unlikely(page == NULL)) break; - if (unlikely(check_pcp_refill(page, order))) - continue; - /* * Split buddy pages returned by expand() are received here in * physical page order. The page is added to the tail of @@ -3141,7 +3140,6 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, * pages are ordered properly. */ list_add_tail(&page->pcp_list, list); - allocated++; if (is_migrate_cma(get_pcppage_migratetype(page))) __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, -(1 << order)); @@ -3155,6 +3153,22 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, */ __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order)); spin_unlock_irqrestore(&zone->lock, flags); + + /* + * Pages are appended to the pcp list without checking to reduce the + * time holding the zone lock. Checking the appended pages happens right + * after the critical section while still holding the pcp lock. + */ + pos = list_first_entry(prev_tail, struct page, pcp_list); + list_for_each_entry_safe_from(pos, n, list, pcp_list) { + if (unlikely(check_pcp_refill(pos, order))) { + list_del(&pos->pcp_list); + continue; + } + + allocated++; + } + return allocated; }
The `rmqueue_bulk` function batches the allocation of multiple elements to refill the per-CPU buffers into a single hold of the zone lock. Each element is allocated and checked using the `check_pcp_refill` function. The check touches every related struct page which is especially expensive for higher order allocations (huge pages). This patch reduces the time holding the lock by moving the check out of the critical section similar to the `rmqueue_buddy` function which allocates a single element. Measurements of parallel allocation-heavy workloads show a reduction of the average huge page allocation latency of 50 percent for two cores and nearly 90 percent for 24 cores. Signed-off-by: Alexander Halbuer <halbuer@sra.uni-hannover.de> --- mm/page_alloc.c | 22 ++++++++++++++++++---- 1 file changed, 18 insertions(+), 4 deletions(-)