mbox series

[RFC,0/6] Use local_lock for pcp protection and reduce stat overhead

Message ID 20210329120648.19040-1-mgorman@techsingularity.net (mailing list archive)
Headers show
Series Use local_lock for pcp protection and reduce stat overhead | expand

Message

Mel Gorman March 29, 2021, 12:06 p.m. UTC
This series requires patches in Andrew's tree so the series is also
available at

git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-percpu-local_lock-v1r15

tldr: Jesper and Chuck, it would be nice to verify if this series helps
	the allocation rate of the bulk page allocator. RT people, this
	*partially* addresses some problems PREEMPT_RT has with the page
	allocator but it needs review.

The PCP (per-cpu page allocator in page_alloc.c) share locking requirements
with vmstat which is inconvenient and causes some issues. Possibly because
of that, the PCP list and vmstat share the same per-cpu space meaning that
it's possible that vmstat updates dirty cache lines holding per-cpu lists
across CPUs unless padding is used. The series splits that structure and
separates the locking.

Second, PREEMPT_RT considers the following sequence to be unsafe
as documented in Documentation/locking/locktypes.rst

   local_irq_disable();
   spin_lock(&lock);

The pcp allocator has this sequence for rmqueue_pcplist (local_irq_save)
-> __rmqueue_pcplist -> rmqueue_bulk (spin_lock). This series explicitly
separates the locking requirements for the PCP list (local_lock) and stat
updates (irqs disabled). Once that is done, the length of time IRQs are
disabled can be reduced and in some cases, IRQ disabling can be replaced
with preempt_disable.

After that, it was very obvious that zone_statistics in particular has way
too much overhead and leaves IRQs disabled for longer than necessary. It
has perfectly accurate counters requiring IRQs be disabled for parallel
RMW sequences when inaccurate ones like vm_events would do. The series
makes the NUMA statistics (NUMA_HIT and friends) inaccurate counters that
only require preempt be disabled.

Finally the bulk page allocator can then do all the stat updates in bulk
with IRQs enabled which should improve the efficiency of the bulk page
allocator. Technically, this could have been done without the local_lock
and vmstat conversion work and the order simply reflects the timing of
when different series were implemented.

No performance data is included because despite the overhead of the
stats, it's within the noise for most workloads but Jesper and Chuck may
observe a significant different with the same tests used for the bulk
page allocator. The series is more likely to be interesting to the RT
folk in terms of slowing getting the PREEMPT tree into mainline.

 drivers/base/node.c    |  18 +--
 include/linux/mmzone.h |  29 +++--
 include/linux/vmstat.h |  65 ++++++-----
 mm/mempolicy.c         |   2 +-
 mm/page_alloc.c        | 173 ++++++++++++++++------------
 mm/vmstat.c            | 254 +++++++++++++++--------------------------
 6 files changed, 254 insertions(+), 287 deletions(-)

Comments

Jesper Dangaard Brouer March 30, 2021, 6:51 p.m. UTC | #1
On Mon, 29 Mar 2021 13:06:42 +0100
Mel Gorman <mgorman@techsingularity.net> wrote:

> This series requires patches in Andrew's tree so the series is also
> available at
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-percpu-local_lock-v1r15
> 
> tldr: Jesper and Chuck, it would be nice to verify if this series helps
> 	the allocation rate of the bulk page allocator. RT people, this
> 	*partially* addresses some problems PREEMPT_RT has with the page
> 	allocator but it needs review.

I've run a new micro-benchmark[1] which shows:
(CPU: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz)

BASELINE
 single_page alloc+put: 194 cycles(tsc) 54.106 ns

LIST variant: time_bulk_page_alloc_free_list: step=bulk size

 Per elem: 200 cycles(tsc) 55.667 ns (step:1)
 Per elem: 143 cycles(tsc) 39.755 ns (step:2)
 Per elem: 132 cycles(tsc) 36.758 ns (step:3)
 Per elem: 128 cycles(tsc) 35.795 ns (step:4)
 Per elem: 123 cycles(tsc) 34.339 ns (step:8)
 Per elem: 120 cycles(tsc) 33.396 ns (step:16)
 Per elem: 118 cycles(tsc) 32.806 ns (step:32)
 Per elem: 115 cycles(tsc) 32.169 ns (step:64)
 Per elem: 116 cycles(tsc) 32.258 ns (step:128)

ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size

 Per elem: 195 cycles(tsc) 54.225 ns (step:1)
 Per elem: 127 cycles(tsc) 35.492 ns (step:2)
 Per elem: 117 cycles(tsc) 32.643 ns (step:3)
 Per elem: 111 cycles(tsc) 30.992 ns (step:4)
 Per elem: 106 cycles(tsc) 29.606 ns (step:8)
 Per elem: 102 cycles(tsc) 28.532 ns (step:16)
 Per elem: 99 cycles(tsc) 27.728 ns (step:32)
 Per elem: 98 cycles(tsc) 27.252 ns (step:64)
 Per elem: 97 cycles(tsc) 27.090 ns (step:128)

[1] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org#micro-benchmark-page_bench04_bulk-local_lock-v1r15

This should be seen in comparison with the older micro-benchmark[2]
done on branch mm-bulk-rebase-v5r9.

BASELINE
 single_page alloc+put: Per elem: 199 cycles(tsc) 55.472 ns

LIST variant: time_bulk_page_alloc_free_list: step=bulk size

 Per elem: 206 cycles(tsc) 57.478 ns (step:1)
 Per elem: 154 cycles(tsc) 42.861 ns (step:2)
 Per elem: 145 cycles(tsc) 40.536 ns (step:3)
 Per elem: 142 cycles(tsc) 39.477 ns (step:4)
 Per elem: 142 cycles(tsc) 39.610 ns (step:8)
 Per elem: 137 cycles(tsc) 38.155 ns (step:16)
 Per elem: 135 cycles(tsc) 37.739 ns (step:32)
 Per elem: 134 cycles(tsc) 37.282 ns (step:64)
 Per elem: 133 cycles(tsc) 36.993 ns (step:128)

ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size

 Per elem: 202 cycles(tsc) 56.383 ns (step:1)
 Per elem: 144 cycles(tsc) 40.047 ns (step:2)
 Per elem: 134 cycles(tsc) 37.339 ns (step:3)
 Per elem: 128 cycles(tsc) 35.578 ns (step:4)
 Per elem: 120 cycles(tsc) 33.592 ns (step:8)
 Per elem: 116 cycles(tsc) 32.362 ns (step:16)
 Per elem: 113 cycles(tsc) 31.476 ns (step:32)
 Per elem: 110 cycles(tsc) 30.633 ns (step:64)
 Per elem: 110 cycles(tsc) 30.596 ns (step:128)

[2] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org#micro-benchmark-page_bench04_bulk

This new patchset does show some improvements in the micro-benchmark.


> The PCP (per-cpu page allocator in page_alloc.c) share locking requirements
> with vmstat which is inconvenient and causes some issues. Possibly because
> of that, the PCP list and vmstat share the same per-cpu space meaning that
> it's possible that vmstat updates dirty cache lines holding per-cpu lists
> across CPUs unless padding is used. The series splits that structure and
> separates the locking.
> 
> Second, PREEMPT_RT considers the following sequence to be unsafe
> as documented in Documentation/locking/locktypes.rst
> 
>    local_irq_disable();
>    spin_lock(&lock);
> 
> The pcp allocator has this sequence for rmqueue_pcplist (local_irq_save)
> -> __rmqueue_pcplist -> rmqueue_bulk (spin_lock). This series explicitly  
> separates the locking requirements for the PCP list (local_lock) and stat
> updates (irqs disabled). Once that is done, the length of time IRQs are
> disabled can be reduced and in some cases, IRQ disabling can be replaced
> with preempt_disable.
> 
> After that, it was very obvious that zone_statistics in particular has way
> too much overhead and leaves IRQs disabled for longer than necessary. It
> has perfectly accurate counters requiring IRQs be disabled for parallel
> RMW sequences when inaccurate ones like vm_events would do. The series
> makes the NUMA statistics (NUMA_HIT and friends) inaccurate counters that
> only require preempt be disabled.
> 
> Finally the bulk page allocator can then do all the stat updates in bulk
> with IRQs enabled which should improve the efficiency of the bulk page
> allocator. Technically, this could have been done without the local_lock
> and vmstat conversion work and the order simply reflects the timing of
> when different series were implemented.
> 
> No performance data is included because despite the overhead of the
> stats, it's within the noise for most workloads but Jesper and Chuck may
> observe a significant different with the same tests used for the bulk
> page allocator. The series is more likely to be interesting to the RT
> folk in terms of slowing getting the PREEMPT tree into mainline.

I've try to run some longer packet benchmarks later.  A quick test
showed performance was within same range, and slightly better.  The
perf report and objdump, did reveal that code layout prefers the label
"failed:" as the primary code path, which should only be used for
single page allocs, which is kind of weird. (But as performance is the
same or slightly better, I will not complain).


>  drivers/base/node.c    |  18 +--
>  include/linux/mmzone.h |  29 +++--
>  include/linux/vmstat.h |  65 ++++++-----
>  mm/mempolicy.c         |   2 +-
>  mm/page_alloc.c        | 173 ++++++++++++++++------------
>  mm/vmstat.c            | 254 +++++++++++++++--------------------------
>  6 files changed, 254 insertions(+), 287 deletions(-)
Mel Gorman March 31, 2021, 7:38 a.m. UTC | #2
On Tue, Mar 30, 2021 at 08:51:54PM +0200, Jesper Dangaard Brouer wrote:
> On Mon, 29 Mar 2021 13:06:42 +0100
> Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> > This series requires patches in Andrew's tree so the series is also
> > available at
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-percpu-local_lock-v1r15
> > 
> > tldr: Jesper and Chuck, it would be nice to verify if this series helps
> > 	the allocation rate of the bulk page allocator. RT people, this
> > 	*partially* addresses some problems PREEMPT_RT has with the page
> > 	allocator but it needs review.
> 
> I've run a new micro-benchmark[1] which shows:
> (CPU: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz)
> 
> <Editting to focus on arrays>
> BASELINE
>  single_page alloc+put: 194 cycles(tsc) 54.106 ns
> 
> ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size
> 
>  Per elem: 195 cycles(tsc) 54.225 ns (step:1)
>  Per elem: 127 cycles(tsc) 35.492 ns (step:2)
>  Per elem: 117 cycles(tsc) 32.643 ns (step:3)
>  Per elem: 111 cycles(tsc) 30.992 ns (step:4)
>  Per elem: 106 cycles(tsc) 29.606 ns (step:8)
>  Per elem: 102 cycles(tsc) 28.532 ns (step:16)
>  Per elem: 99 cycles(tsc) 27.728 ns (step:32)
>  Per elem: 98 cycles(tsc) 27.252 ns (step:64)
>  Per elem: 97 cycles(tsc) 27.090 ns (step:128)
> 
> This should be seen in comparison with the older micro-benchmark[2]
> done on branch mm-bulk-rebase-v5r9.
> 
> BASELINE
>  single_page alloc+put: Per elem: 199 cycles(tsc) 55.472 ns
> 
> ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size
> 
>  Per elem: 202 cycles(tsc) 56.383 ns (step:1)
>  Per elem: 144 cycles(tsc) 40.047 ns (step:2)
>  Per elem: 134 cycles(tsc) 37.339 ns (step:3)
>  Per elem: 128 cycles(tsc) 35.578 ns (step:4)
>  Per elem: 120 cycles(tsc) 33.592 ns (step:8)
>  Per elem: 116 cycles(tsc) 32.362 ns (step:16)
>  Per elem: 113 cycles(tsc) 31.476 ns (step:32)
>  Per elem: 110 cycles(tsc) 30.633 ns (step:64)
>  Per elem: 110 cycles(tsc) 30.596 ns (step:128)
> 

Ok, so bulk allocation is faster than allocating single pages, no surprise
there. Putting the array figures for bulk allocation into tabular format
and comparing we get;

Array variant (time to allocate a page in nanoseconds, lower is better)
        Baseline        Patched
1       56.383          54.225 (+3.83%)
2       40.047          35.492 (+11.38%)
3       37.339          32.643 (+12.58%)
4       35.578          30.992 (+12.89%)
8       33.592          29.606 (+11.87%)
16      32.362          28.532 (+11.85%)
32      31.476          27.728 (+11.91%)
64      30.633          27.252 (+11.04%)
128     30.596          27.090 (+11.46%)

The series is 11-12% faster when allocating multiple pages.  That's a
fairly positive outcome and I'll include this in the series leader if
you have no objections.

Thanks Jesper!
Jesper Dangaard Brouer March 31, 2021, 8:17 a.m. UTC | #3
On Wed, 31 Mar 2021 08:38:05 +0100
Mel Gorman <mgorman@techsingularity.net> wrote:

> On Tue, Mar 30, 2021 at 08:51:54PM +0200, Jesper Dangaard Brouer wrote:
> > On Mon, 29 Mar 2021 13:06:42 +0100
> > Mel Gorman <mgorman@techsingularity.net> wrote:
> >   
> > > This series requires patches in Andrew's tree so the series is also
> > > available at
> > > 
> > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-percpu-local_lock-v1r15
> > > 
> > > tldr: Jesper and Chuck, it would be nice to verify if this series helps
> > > 	the allocation rate of the bulk page allocator. RT people, this
> > > 	*partially* addresses some problems PREEMPT_RT has with the page
> > > 	allocator but it needs review.  
> > 
> > I've run a new micro-benchmark[1] which shows:
> > (CPU: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz)
> > 
> > <Editting to focus on arrays>
> > BASELINE
> >  single_page alloc+put: 194 cycles(tsc) 54.106 ns
> > 
> > ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size
> > 
> >  Per elem: 195 cycles(tsc) 54.225 ns (step:1)
> >  Per elem: 127 cycles(tsc) 35.492 ns (step:2)
> >  Per elem: 117 cycles(tsc) 32.643 ns (step:3)
> >  Per elem: 111 cycles(tsc) 30.992 ns (step:4)
> >  Per elem: 106 cycles(tsc) 29.606 ns (step:8)
> >  Per elem: 102 cycles(tsc) 28.532 ns (step:16)
> >  Per elem: 99 cycles(tsc) 27.728 ns (step:32)
> >  Per elem: 98 cycles(tsc) 27.252 ns (step:64)
> >  Per elem: 97 cycles(tsc) 27.090 ns (step:128)
> > 
> > This should be seen in comparison with the older micro-benchmark[2]
> > done on branch mm-bulk-rebase-v5r9.
> > 
> > BASELINE
> >  single_page alloc+put: Per elem: 199 cycles(tsc) 55.472 ns
> > 
> > ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size
> > 
> >  Per elem: 202 cycles(tsc) 56.383 ns (step:1)
> >  Per elem: 144 cycles(tsc) 40.047 ns (step:2)
> >  Per elem: 134 cycles(tsc) 37.339 ns (step:3)
> >  Per elem: 128 cycles(tsc) 35.578 ns (step:4)
> >  Per elem: 120 cycles(tsc) 33.592 ns (step:8)
> >  Per elem: 116 cycles(tsc) 32.362 ns (step:16)
> >  Per elem: 113 cycles(tsc) 31.476 ns (step:32)
> >  Per elem: 110 cycles(tsc) 30.633 ns (step:64)
> >  Per elem: 110 cycles(tsc) 30.596 ns (step:128)
> >   
> 
> Ok, so bulk allocation is faster than allocating single pages, no surprise
> there. Putting the array figures for bulk allocation into tabular format
> and comparing we get;
> 
> Array variant (time to allocate a page in nanoseconds, lower is better)
>         Baseline        Patched
> 1       56.383          54.225 (+3.83%)
> 2       40.047          35.492 (+11.38%)
> 3       37.339          32.643 (+12.58%)
> 4       35.578          30.992 (+12.89%)
> 8       33.592          29.606 (+11.87%)
> 16      32.362          28.532 (+11.85%)
> 32      31.476          27.728 (+11.91%)
> 64      30.633          27.252 (+11.04%)
> 128     30.596          27.090 (+11.46%)
> 
> The series is 11-12% faster when allocating multiple pages.  That's a
> fairly positive outcome and I'll include this in the series leader if
> you have no objections.

That is fine by me to add this to the cover letter.  I like your
tabular format as it makes is easier to compare. If you use the
nanosec measurements and not the cycles, you should state that
this was run on a CPU E5-1650 v4 @ 3.60GHz.  You might notice that the
factor between cycles(tsc) and ns is very close to 3.6.
Mel Gorman March 31, 2021, 8:52 a.m. UTC | #4
Ingo, Thomas or Peter, is there any chance one of you could take a look
at patch "[PATCH 2/6] mm/page_alloc: Convert per-cpu list protection to
local_lock" from this series? It's partially motivated by PREEMPT_RT. More
details below.

On Mon, Mar 29, 2021 at 01:06:42PM +0100, Mel Gorman wrote:
> This series requires patches in Andrew's tree so the series is also
> available at
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-percpu-local_lock-v1r15
> 
> The PCP (per-cpu page allocator in page_alloc.c) share locking requirements
> with vmstat which is inconvenient and causes some issues. Possibly because
> of that, the PCP list and vmstat share the same per-cpu space meaning that
> it's possible that vmstat updates dirty cache lines holding per-cpu lists
> across CPUs unless padding is used. The series splits that structure and
> separates the locking.
> 

The bulk page allocation series that the local_lock work had an
additional fix so I've rebased this onto

git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-percpu-local_lock-v1r16

> Second, PREEMPT_RT considers the following sequence to be unsafe
> as documented in Documentation/locking/locktypes.rst
> 
>    local_irq_disable();
>    spin_lock(&lock);
> 
> The pcp allocator has this sequence for rmqueue_pcplist (local_irq_save)
> -> __rmqueue_pcplist -> rmqueue_bulk (spin_lock). This series explicitly
> separates the locking requirements for the PCP list (local_lock) and stat
> updates (irqs disabled). Once that is done, the length of time IRQs are
> disabled can be reduced and in some cases, IRQ disabling can be replaced
> with preempt_disable.
> 

It's this part I'm interested in even though it only partially addresses
the preempt-rt tree concerns. More legwork is needed for preempt-rt which
is outside the context of this series. At minimum, it involves

1. Split locking of pcp and buddy allocator instead of using spin_lock()
   when it's "known" that IRQs are disabled (not necessarily a valid
   assumption on PREEMPT_RT)
2. Split the zone lock into what protects the zone metadata and what
   protects the free lists

This looks straight-forward but it involves audit work and it may be
difficult to avoid regressing non-PREEMPT_RT kernels by disabling/enabling
IRQs when switching between the pcp allocator and the buddy allocator.

> After that, it was very obvious that zone_statistics in particular has way
> too much overhead and leaves IRQs disabled for longer than necessary. It
> has perfectly accurate counters requiring IRQs be disabled for parallel
> RMW sequences when inaccurate ones like vm_events would do. The series
> makes the NUMA statistics (NUMA_HIT and friends) inaccurate counters that
> only require preempt be disabled.
> 
> Finally the bulk page allocator can then do all the stat updates in bulk
> with IRQs enabled which should improve the efficiency of the bulk page
> allocator. Technically, this could have been done without the local_lock
> and vmstat conversion work and the order simply reflects the timing of
> when different series were implemented.
> 
> No performance data is included because despite the overhead of the
> stats, it's within the noise for most workloads but Jesper and Chuck may
> observe a significant different with the same tests used for the bulk
> page allocator. The series is more likely to be interesting to the RT
> folk in terms of slowing getting the PREEMPT tree into mainline.
> 
>  drivers/base/node.c    |  18 +--
>  include/linux/mmzone.h |  29 +++--
>  include/linux/vmstat.h |  65 ++++++-----
>  mm/mempolicy.c         |   2 +-
>  mm/page_alloc.c        | 173 ++++++++++++++++------------
>  mm/vmstat.c            | 254 +++++++++++++++--------------------------
>  6 files changed, 254 insertions(+), 287 deletions(-)
> 
> -- 
> 2.26.2
>
Thomas Gleixner March 31, 2021, 9:51 a.m. UTC | #5
On Wed, Mar 31 2021 at 09:52, Mel Gorman wrote:
> Ingo, Thomas or Peter, is there any chance one of you could take a look
> at patch "[PATCH 2/6] mm/page_alloc: Convert per-cpu list protection to
> local_lock" from this series? It's partially motivated by PREEMPT_RT. More
> details below.

Sure.