mbox series

[0/11,v3] Use local_lock for pcp protection and reduce stat overhead

Message ID 20210414133931.4555-1-mgorman@techsingularity.net (mailing list archive)
Headers show
Series Use local_lock for pcp protection and reduce stat overhead | expand

Message

Mel Gorman April 14, 2021, 1:39 p.m. UTC
Changelog since v2
o Fix zonestats initialisation
o Merged memory hotplug fix separately
o Embed local_lock within per_cpu_pages

This series requires patches in Andrew's tree so for convenience, it's
also available at

git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-percpu-local_lock-v3r6

The PCP (per-cpu page allocator in page_alloc.c) shares locking
requirements with vmstat and the zone lock which is inconvenient and
causes some issues. For example, the PCP list and vmstat share the
same per-cpu space meaning that it's possible that vmstat updates dirty
cache lines holding per-cpu lists across CPUs unless padding is used.
Second, PREEMPT_RT does not want to disable IRQs for too long in the
page allocator.

This series splits the locking requirements and uses locks types more
suitable for PREEMPT_RT, reduces the time when special locking is required
for stats and reduces the time when IRQs need to be disabled on !PREEMPT_RT
kernels.

Why local_lock? PREEMPT_RT considers the following sequence to be unsafe
as documented in Documentation/locking/locktypes.rst

   local_irq_disable();
   spin_lock(&lock);

The pcp allocator has this sequence for rmqueue_pcplist (local_irq_save)
-> __rmqueue_pcplist -> rmqueue_bulk (spin_lock). While it's possible to
separate this out, it generally means there are points where we enable
IRQs and reenable them again immediately. To prevent a migration and the
per-cpu pointer going stale, migrate_disable is also needed. That is a
custom lock that is similar, but worse, than local_lock. Furthermore,
on PREEMPT_RT, it's undesirable to leave IRQs disabled for too long.
By converting to local_lock which disables migration on PREEMPT_RT, the
locking requirements can be separated and start moving the protections
for PCP, stats and the zone lock to PREEMPT_RT-safe equivalent locking. As
a bonus, local_lock also means that PROVE_LOCKING does something useful.

After that, it's obvious that zone_statistics incurs too much overhead
and leaves IRQs disabled for longer than necessary on !PREEMPT_RT
kernels. zone_statistics uses perfectly accurate counters requiring IRQs
be disabled for parallel RMW sequences when inaccurate ones like vm_events
would do. The series makes the NUMA statistics (NUMA_HIT and friends)
inaccurate counters that then require no special protection on !PREEMPT_RT.

The bulk page allocator can then do stat updates in bulk with IRQs enabled
which should improve the efficiency.  Technically, this could have been
done without the local_lock and vmstat conversion work and the order
simply reflects the timing of when different series were implemented.

Finally, there are places where we conflate IRQs being disabled for the
PCP with the IRQ-safe zone spinlock. The remainder of the series reduces
the scope of what is protected by disabled IRQs on !PREEMPT_RT kernels.
By the end of the series, page_alloc.c does not call local_irq_save so
the locking scope is a bit clearer. The one exception is that modifying
NR_FREE_PAGES still happens in places where it's known the IRQs are
disabled as it's harmless for PREEMPT_RT and would be expensive to split
the locking there.

No performance data is included because despite the overhead of the stats,
it's within the noise for most workloads on !PREEMPT_RT. However, Jesper
Dangaard Brouer ran a page allocation microbenchmark on a E5-1650 v4 @
3.60GHz CPU on the first version of this series. Focusing on the array
variant of the bulk page allocator reveals the following.

(CPU: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz)
ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size

         Baseline        Patched
 1       56.383          54.225 (+3.83%)
 2       40.047          35.492 (+11.38%)
 3       37.339          32.643 (+12.58%)
 4       35.578          30.992 (+12.89%)
 8       33.592          29.606 (+11.87%)
 16      32.362          28.532 (+11.85%)
 32      31.476          27.728 (+11.91%)
 64      30.633          27.252 (+11.04%)
 128     30.596          27.090 (+11.46%)

While this is a positive outcome, the series is more likely to be
interesting to the RT people in terms of getting parts of the PREEMPT_RT
tree into mainline.

 drivers/base/node.c    |  18 +--
 include/linux/mmzone.h |  58 ++++++--
 include/linux/vmstat.h |  65 +++++----
 mm/mempolicy.c         |   2 +-
 mm/page_alloc.c        | 302 +++++++++++++++++++++++++----------------
 mm/vmstat.c            | 250 ++++++++++++----------------------
 6 files changed, 370 insertions(+), 325 deletions(-)

Comments

Vlastimil Babka April 15, 2021, 12:25 p.m. UTC | #1
On 4/14/21 3:39 PM, Mel Gorman wrote:
> Historically when freeing pages, free_one_page() assumed that callers
> had IRQs disabled and the zone->lock could be acquired with spin_lock().
> This confuses the scope of what local_lock_irq is protecting and what
> zone->lock is protecting in free_unref_page_list in particular.
> 
> This patch uses spin_lock_irqsave() for the zone->lock in
> free_one_page() instead of relying on callers to have disabled
> IRQs. free_unref_page_commit() is changed to only deal with PCP pages
> protected by the local lock. free_unref_page_list() then first frees
> isolated pages to the buddy lists with free_one_page() and frees the rest
> of the pages to the PCP via free_unref_page_commit(). The end result
> is that free_one_page() is no longer depending on side-effects of
> local_lock to be correct.
> 
> Note that this may incur a performance penalty while memory hot-remove
> is running but that is not a common operation.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

A nit below:

> @@ -3294,6 +3295,7 @@ void free_unref_page_list(struct list_head *list)
>  	struct page *page, *next;
>  	unsigned long flags, pfn;
>  	int batch_count = 0;
> +	int migratetype;
>  
>  	/* Prepare pages for freeing */
>  	list_for_each_entry_safe(page, next, list, lru) {
> @@ -3301,15 +3303,28 @@ void free_unref_page_list(struct list_head *list)
>  		if (!free_unref_page_prepare(page, pfn))
>  			list_del(&page->lru);
>  		set_page_private(page, pfn);

Should probably move this below so we don't set private for pages that then go
through free_one_page()? Doesn't seem to be a bug, just unneccessary.

> +
> +		/*
> +		 * Free isolated pages directly to the allocator, see
> +		 * comment in free_unref_page.
> +		 */
> +		migratetype = get_pcppage_migratetype(page);
> +		if (unlikely(migratetype >= MIGRATE_PCPTYPES)) {
> +			if (unlikely(is_migrate_isolate(migratetype))) {
> +				free_one_page(page_zone(page), page, pfn, 0,
> +							migratetype, FPI_NONE);
> +				list_del(&page->lru);
> +			}
> +		}
>  	}
>  
>  	local_lock_irqsave(&pagesets.lock, flags);
>  	list_for_each_entry_safe(page, next, list, lru) {
> -		unsigned long pfn = page_private(page);
> -
> +		pfn = page_private(page);
>  		set_page_private(page, 0);
> +		migratetype = get_pcppage_migratetype(page);
>  		trace_mm_page_free_batched(page);
> -		free_unref_page_commit(page, pfn);
> +		free_unref_page_commit(page, pfn, migratetype);
>  
>  		/*
>  		 * Guard against excessive IRQ disabled times when we get
>
Mel Gorman April 15, 2021, 2:11 p.m. UTC | #2
On Thu, Apr 15, 2021 at 02:25:36PM +0200, Vlastimil Babka wrote:
> > @@ -3294,6 +3295,7 @@ void free_unref_page_list(struct list_head *list)
> >  	struct page *page, *next;
> >  	unsigned long flags, pfn;
> >  	int batch_count = 0;
> > +	int migratetype;
> >  
> >  	/* Prepare pages for freeing */
> >  	list_for_each_entry_safe(page, next, list, lru) {
> > @@ -3301,15 +3303,28 @@ void free_unref_page_list(struct list_head *list)
> >  		if (!free_unref_page_prepare(page, pfn))
> >  			list_del(&page->lru);
> >  		set_page_private(page, pfn);
> 
> Should probably move this below so we don't set private for pages that then go
> through free_one_page()? Doesn't seem to be a bug, just unneccessary.
> 

Sure.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1d87ca364680..a9c1282d9c7b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3293,7 +3293,6 @@ void free_unref_page_list(struct list_head *list)
 		pfn = page_to_pfn(page);
 		if (!free_unref_page_prepare(page, pfn))
 			list_del(&page->lru);
-		set_page_private(page, pfn);
 
 		/*
 		 * Free isolated pages directly to the allocator, see
@@ -3307,6 +3306,8 @@ void free_unref_page_list(struct list_head *list)
 				list_del(&page->lru);
 			}
 		}
+
+		set_page_private(page, pfn);
 	}
 
 	local_lock_irqsave(&pagesets.lock, flags);
Vlastimil Babka April 15, 2021, 2:53 p.m. UTC | #3
On 4/14/21 3:39 PM, Mel Gorman wrote:
> struct per_cpu_pages is protected by the pagesets lock but it can be
> embedded within struct per_cpu_pages at a minor cost. This is possible
> because per-cpu lookups are based on offsets. Paraphrasing an explanation
> from Peter Ziljstra
> 
>   The whole thing relies on:
> 
>     &per_cpu_ptr(msblk->stream, cpu)->lock == per_cpu_ptr(&msblk->stream->lock, cpu)
> 
>   Which is true because the lhs:
> 
>     (local_lock_t *)((zone->per_cpu_pages + per_cpu_offset(cpu)) + offsetof(struct per_cpu_pages, lock))
> 
>   and the rhs:
> 
>     (local_lock_t *)((zone->per_cpu_pages + offsetof(struct per_cpu_pages, lock)) + per_cpu_offset(cpu))
> 
>   are identical, because addition is associative.
> 
> More details are included in mmzone.h. This embedding is not completely
> free for three reasons.
> 
> 1. As local_lock does not return a per-cpu structure, the PCP has to
>    be looked up twice -- first to acquire the lock and again to get the
>    PCP pointer.
> 
> 2. For PREEMPT_RT and CONFIG_DEBUG_LOCK_ALLOC, local_lock is potentially
>    a spinlock or has lock-specific tracking. In both cases, it becomes
>    necessary to release/acquire different locks when freeing a list of
>    pages in free_unref_page_list.

Looks like this pattern could benefit from a local_lock API helper that would do
the right thing? It probably couldn't optimize much the CONFIG_PREEMPT_RT case
which would need to be unlock/lock in any case, but CONFIG_DEBUG_LOCK_ALLOC
could perhaps just keep the IRQ's disabled and just note the change of what's
acquired?

> 3. For most kernel configurations, local_lock_t is empty and no storage is
>    required. By embedding the lock, the memory consumption on PREEMPT_RT
>    and CONFIG_DEBUG_LOCK_ALLOC is higher.

But I wonder, is there really a benefit to this increased complexity? Before the
patch we had "pagesets" - a local_lock that protects all zones' pcplists. Now
each zone's pcplists have own local_lock. On !PREEMPT_RT we will never take the
locks of multiple zones from the same CPU in parallel, because we use
local_lock_irqsave(). Can that parallelism happen on PREEMPT_RT, because that
could perhaps justify the change?

> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
Mel Gorman April 15, 2021, 3:29 p.m. UTC | #4
On Thu, Apr 15, 2021 at 04:53:46PM +0200, Vlastimil Babka wrote:
> On 4/14/21 3:39 PM, Mel Gorman wrote:
> > struct per_cpu_pages is protected by the pagesets lock but it can be
> > embedded within struct per_cpu_pages at a minor cost. This is possible
> > because per-cpu lookups are based on offsets. Paraphrasing an explanation
> > from Peter Ziljstra
> > 
> >   The whole thing relies on:
> > 
> >     &per_cpu_ptr(msblk->stream, cpu)->lock == per_cpu_ptr(&msblk->stream->lock, cpu)
> > 
> >   Which is true because the lhs:
> > 
> >     (local_lock_t *)((zone->per_cpu_pages + per_cpu_offset(cpu)) + offsetof(struct per_cpu_pages, lock))
> > 
> >   and the rhs:
> > 
> >     (local_lock_t *)((zone->per_cpu_pages + offsetof(struct per_cpu_pages, lock)) + per_cpu_offset(cpu))
> > 
> >   are identical, because addition is associative.
> > 
> > More details are included in mmzone.h. This embedding is not completely
> > free for three reasons.
> > 
> > 1. As local_lock does not return a per-cpu structure, the PCP has to
> >    be looked up twice -- first to acquire the lock and again to get the
> >    PCP pointer.
> > 
> > 2. For PREEMPT_RT and CONFIG_DEBUG_LOCK_ALLOC, local_lock is potentially
> >    a spinlock or has lock-specific tracking. In both cases, it becomes
> >    necessary to release/acquire different locks when freeing a list of
> >    pages in free_unref_page_list.
> 
> Looks like this pattern could benefit from a local_lock API helper that would do
> the right thing? It probably couldn't optimize much the CONFIG_PREEMPT_RT case
> which would need to be unlock/lock in any case, but CONFIG_DEBUG_LOCK_ALLOC
> could perhaps just keep the IRQ's disabled and just note the change of what's
> acquired?
> 

A helper could potentially be used but right now, there is only one
call-site that needs this type of care so it may be overkill. A helper
was proposed that can lookup and lock a per-cpu structure which is
generally useful but does not suit the case where different locks need
to be acquired.

> > 3. For most kernel configurations, local_lock_t is empty and no storage is
> >    required. By embedding the lock, the memory consumption on PREEMPT_RT
> >    and CONFIG_DEBUG_LOCK_ALLOC is higher.
> 
> But I wonder, is there really a benefit to this increased complexity? Before the
> patch we had "pagesets" - a local_lock that protects all zones' pcplists. Now
> each zone's pcplists have own local_lock. On !PREEMPT_RT we will never take the
> locks of multiple zones from the same CPU in parallel, because we use
> local_lock_irqsave(). Can that parallelism happen on PREEMPT_RT, because that
> could perhaps justify the change?
> 

I don't think PREEMPT_RT gets additional parallelism because it's still
a per-cpu structure that is being protected. The difference is whether
we are protecting the CPU-N index for all per_cpu_pages or just one.
The patch exists because it was asked why the lock was not embedded within
the structure it's protecting. I initially thought that was unsafe and
I was wrong as explained in the changelog. But now that I find it *can*
be done but it's a bit ugly so I put it at the end of the series so it
can be dropped if necessary.