Message ID | 20210329120648.19040-1-mgorman@techsingularity.net (mailing list archive) |
---|---|
Headers | show |
Series | Use local_lock for pcp protection and reduce stat overhead | expand |
On Mon, 29 Mar 2021 13:06:42 +0100 Mel Gorman <mgorman@techsingularity.net> wrote: > This series requires patches in Andrew's tree so the series is also > available at > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-percpu-local_lock-v1r15 > > tldr: Jesper and Chuck, it would be nice to verify if this series helps > the allocation rate of the bulk page allocator. RT people, this > *partially* addresses some problems PREEMPT_RT has with the page > allocator but it needs review. I've run a new micro-benchmark[1] which shows: (CPU: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz) BASELINE single_page alloc+put: 194 cycles(tsc) 54.106 ns LIST variant: time_bulk_page_alloc_free_list: step=bulk size Per elem: 200 cycles(tsc) 55.667 ns (step:1) Per elem: 143 cycles(tsc) 39.755 ns (step:2) Per elem: 132 cycles(tsc) 36.758 ns (step:3) Per elem: 128 cycles(tsc) 35.795 ns (step:4) Per elem: 123 cycles(tsc) 34.339 ns (step:8) Per elem: 120 cycles(tsc) 33.396 ns (step:16) Per elem: 118 cycles(tsc) 32.806 ns (step:32) Per elem: 115 cycles(tsc) 32.169 ns (step:64) Per elem: 116 cycles(tsc) 32.258 ns (step:128) ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size Per elem: 195 cycles(tsc) 54.225 ns (step:1) Per elem: 127 cycles(tsc) 35.492 ns (step:2) Per elem: 117 cycles(tsc) 32.643 ns (step:3) Per elem: 111 cycles(tsc) 30.992 ns (step:4) Per elem: 106 cycles(tsc) 29.606 ns (step:8) Per elem: 102 cycles(tsc) 28.532 ns (step:16) Per elem: 99 cycles(tsc) 27.728 ns (step:32) Per elem: 98 cycles(tsc) 27.252 ns (step:64) Per elem: 97 cycles(tsc) 27.090 ns (step:128) [1] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org#micro-benchmark-page_bench04_bulk-local_lock-v1r15 This should be seen in comparison with the older micro-benchmark[2] done on branch mm-bulk-rebase-v5r9. BASELINE single_page alloc+put: Per elem: 199 cycles(tsc) 55.472 ns LIST variant: time_bulk_page_alloc_free_list: step=bulk size Per elem: 206 cycles(tsc) 57.478 ns (step:1) Per elem: 154 cycles(tsc) 42.861 ns (step:2) Per elem: 145 cycles(tsc) 40.536 ns (step:3) Per elem: 142 cycles(tsc) 39.477 ns (step:4) Per elem: 142 cycles(tsc) 39.610 ns (step:8) Per elem: 137 cycles(tsc) 38.155 ns (step:16) Per elem: 135 cycles(tsc) 37.739 ns (step:32) Per elem: 134 cycles(tsc) 37.282 ns (step:64) Per elem: 133 cycles(tsc) 36.993 ns (step:128) ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size Per elem: 202 cycles(tsc) 56.383 ns (step:1) Per elem: 144 cycles(tsc) 40.047 ns (step:2) Per elem: 134 cycles(tsc) 37.339 ns (step:3) Per elem: 128 cycles(tsc) 35.578 ns (step:4) Per elem: 120 cycles(tsc) 33.592 ns (step:8) Per elem: 116 cycles(tsc) 32.362 ns (step:16) Per elem: 113 cycles(tsc) 31.476 ns (step:32) Per elem: 110 cycles(tsc) 30.633 ns (step:64) Per elem: 110 cycles(tsc) 30.596 ns (step:128) [2] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org#micro-benchmark-page_bench04_bulk This new patchset does show some improvements in the micro-benchmark. > The PCP (per-cpu page allocator in page_alloc.c) share locking requirements > with vmstat which is inconvenient and causes some issues. Possibly because > of that, the PCP list and vmstat share the same per-cpu space meaning that > it's possible that vmstat updates dirty cache lines holding per-cpu lists > across CPUs unless padding is used. The series splits that structure and > separates the locking. > > Second, PREEMPT_RT considers the following sequence to be unsafe > as documented in Documentation/locking/locktypes.rst > > local_irq_disable(); > spin_lock(&lock); > > The pcp allocator has this sequence for rmqueue_pcplist (local_irq_save) > -> __rmqueue_pcplist -> rmqueue_bulk (spin_lock). This series explicitly > separates the locking requirements for the PCP list (local_lock) and stat > updates (irqs disabled). Once that is done, the length of time IRQs are > disabled can be reduced and in some cases, IRQ disabling can be replaced > with preempt_disable. > > After that, it was very obvious that zone_statistics in particular has way > too much overhead and leaves IRQs disabled for longer than necessary. It > has perfectly accurate counters requiring IRQs be disabled for parallel > RMW sequences when inaccurate ones like vm_events would do. The series > makes the NUMA statistics (NUMA_HIT and friends) inaccurate counters that > only require preempt be disabled. > > Finally the bulk page allocator can then do all the stat updates in bulk > with IRQs enabled which should improve the efficiency of the bulk page > allocator. Technically, this could have been done without the local_lock > and vmstat conversion work and the order simply reflects the timing of > when different series were implemented. > > No performance data is included because despite the overhead of the > stats, it's within the noise for most workloads but Jesper and Chuck may > observe a significant different with the same tests used for the bulk > page allocator. The series is more likely to be interesting to the RT > folk in terms of slowing getting the PREEMPT tree into mainline. I've try to run some longer packet benchmarks later. A quick test showed performance was within same range, and slightly better. The perf report and objdump, did reveal that code layout prefers the label "failed:" as the primary code path, which should only be used for single page allocs, which is kind of weird. (But as performance is the same or slightly better, I will not complain). > drivers/base/node.c | 18 +-- > include/linux/mmzone.h | 29 +++-- > include/linux/vmstat.h | 65 ++++++----- > mm/mempolicy.c | 2 +- > mm/page_alloc.c | 173 ++++++++++++++++------------ > mm/vmstat.c | 254 +++++++++++++++-------------------------- > 6 files changed, 254 insertions(+), 287 deletions(-)
On Tue, Mar 30, 2021 at 08:51:54PM +0200, Jesper Dangaard Brouer wrote: > On Mon, 29 Mar 2021 13:06:42 +0100 > Mel Gorman <mgorman@techsingularity.net> wrote: > > > This series requires patches in Andrew's tree so the series is also > > available at > > > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-percpu-local_lock-v1r15 > > > > tldr: Jesper and Chuck, it would be nice to verify if this series helps > > the allocation rate of the bulk page allocator. RT people, this > > *partially* addresses some problems PREEMPT_RT has with the page > > allocator but it needs review. > > I've run a new micro-benchmark[1] which shows: > (CPU: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz) > > <Editting to focus on arrays> > BASELINE > single_page alloc+put: 194 cycles(tsc) 54.106 ns > > ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size > > Per elem: 195 cycles(tsc) 54.225 ns (step:1) > Per elem: 127 cycles(tsc) 35.492 ns (step:2) > Per elem: 117 cycles(tsc) 32.643 ns (step:3) > Per elem: 111 cycles(tsc) 30.992 ns (step:4) > Per elem: 106 cycles(tsc) 29.606 ns (step:8) > Per elem: 102 cycles(tsc) 28.532 ns (step:16) > Per elem: 99 cycles(tsc) 27.728 ns (step:32) > Per elem: 98 cycles(tsc) 27.252 ns (step:64) > Per elem: 97 cycles(tsc) 27.090 ns (step:128) > > This should be seen in comparison with the older micro-benchmark[2] > done on branch mm-bulk-rebase-v5r9. > > BASELINE > single_page alloc+put: Per elem: 199 cycles(tsc) 55.472 ns > > ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size > > Per elem: 202 cycles(tsc) 56.383 ns (step:1) > Per elem: 144 cycles(tsc) 40.047 ns (step:2) > Per elem: 134 cycles(tsc) 37.339 ns (step:3) > Per elem: 128 cycles(tsc) 35.578 ns (step:4) > Per elem: 120 cycles(tsc) 33.592 ns (step:8) > Per elem: 116 cycles(tsc) 32.362 ns (step:16) > Per elem: 113 cycles(tsc) 31.476 ns (step:32) > Per elem: 110 cycles(tsc) 30.633 ns (step:64) > Per elem: 110 cycles(tsc) 30.596 ns (step:128) > Ok, so bulk allocation is faster than allocating single pages, no surprise there. Putting the array figures for bulk allocation into tabular format and comparing we get; Array variant (time to allocate a page in nanoseconds, lower is better) Baseline Patched 1 56.383 54.225 (+3.83%) 2 40.047 35.492 (+11.38%) 3 37.339 32.643 (+12.58%) 4 35.578 30.992 (+12.89%) 8 33.592 29.606 (+11.87%) 16 32.362 28.532 (+11.85%) 32 31.476 27.728 (+11.91%) 64 30.633 27.252 (+11.04%) 128 30.596 27.090 (+11.46%) The series is 11-12% faster when allocating multiple pages. That's a fairly positive outcome and I'll include this in the series leader if you have no objections. Thanks Jesper!
On Wed, 31 Mar 2021 08:38:05 +0100 Mel Gorman <mgorman@techsingularity.net> wrote: > On Tue, Mar 30, 2021 at 08:51:54PM +0200, Jesper Dangaard Brouer wrote: > > On Mon, 29 Mar 2021 13:06:42 +0100 > > Mel Gorman <mgorman@techsingularity.net> wrote: > > > > > This series requires patches in Andrew's tree so the series is also > > > available at > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-percpu-local_lock-v1r15 > > > > > > tldr: Jesper and Chuck, it would be nice to verify if this series helps > > > the allocation rate of the bulk page allocator. RT people, this > > > *partially* addresses some problems PREEMPT_RT has with the page > > > allocator but it needs review. > > > > I've run a new micro-benchmark[1] which shows: > > (CPU: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz) > > > > <Editting to focus on arrays> > > BASELINE > > single_page alloc+put: 194 cycles(tsc) 54.106 ns > > > > ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size > > > > Per elem: 195 cycles(tsc) 54.225 ns (step:1) > > Per elem: 127 cycles(tsc) 35.492 ns (step:2) > > Per elem: 117 cycles(tsc) 32.643 ns (step:3) > > Per elem: 111 cycles(tsc) 30.992 ns (step:4) > > Per elem: 106 cycles(tsc) 29.606 ns (step:8) > > Per elem: 102 cycles(tsc) 28.532 ns (step:16) > > Per elem: 99 cycles(tsc) 27.728 ns (step:32) > > Per elem: 98 cycles(tsc) 27.252 ns (step:64) > > Per elem: 97 cycles(tsc) 27.090 ns (step:128) > > > > This should be seen in comparison with the older micro-benchmark[2] > > done on branch mm-bulk-rebase-v5r9. > > > > BASELINE > > single_page alloc+put: Per elem: 199 cycles(tsc) 55.472 ns > > > > ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size > > > > Per elem: 202 cycles(tsc) 56.383 ns (step:1) > > Per elem: 144 cycles(tsc) 40.047 ns (step:2) > > Per elem: 134 cycles(tsc) 37.339 ns (step:3) > > Per elem: 128 cycles(tsc) 35.578 ns (step:4) > > Per elem: 120 cycles(tsc) 33.592 ns (step:8) > > Per elem: 116 cycles(tsc) 32.362 ns (step:16) > > Per elem: 113 cycles(tsc) 31.476 ns (step:32) > > Per elem: 110 cycles(tsc) 30.633 ns (step:64) > > Per elem: 110 cycles(tsc) 30.596 ns (step:128) > > > > Ok, so bulk allocation is faster than allocating single pages, no surprise > there. Putting the array figures for bulk allocation into tabular format > and comparing we get; > > Array variant (time to allocate a page in nanoseconds, lower is better) > Baseline Patched > 1 56.383 54.225 (+3.83%) > 2 40.047 35.492 (+11.38%) > 3 37.339 32.643 (+12.58%) > 4 35.578 30.992 (+12.89%) > 8 33.592 29.606 (+11.87%) > 16 32.362 28.532 (+11.85%) > 32 31.476 27.728 (+11.91%) > 64 30.633 27.252 (+11.04%) > 128 30.596 27.090 (+11.46%) > > The series is 11-12% faster when allocating multiple pages. That's a > fairly positive outcome and I'll include this in the series leader if > you have no objections. That is fine by me to add this to the cover letter. I like your tabular format as it makes is easier to compare. If you use the nanosec measurements and not the cycles, you should state that this was run on a CPU E5-1650 v4 @ 3.60GHz. You might notice that the factor between cycles(tsc) and ns is very close to 3.6.
Ingo, Thomas or Peter, is there any chance one of you could take a look at patch "[PATCH 2/6] mm/page_alloc: Convert per-cpu list protection to local_lock" from this series? It's partially motivated by PREEMPT_RT. More details below. On Mon, Mar 29, 2021 at 01:06:42PM +0100, Mel Gorman wrote: > This series requires patches in Andrew's tree so the series is also > available at > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-percpu-local_lock-v1r15 > > The PCP (per-cpu page allocator in page_alloc.c) share locking requirements > with vmstat which is inconvenient and causes some issues. Possibly because > of that, the PCP list and vmstat share the same per-cpu space meaning that > it's possible that vmstat updates dirty cache lines holding per-cpu lists > across CPUs unless padding is used. The series splits that structure and > separates the locking. > The bulk page allocation series that the local_lock work had an additional fix so I've rebased this onto git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-percpu-local_lock-v1r16 > Second, PREEMPT_RT considers the following sequence to be unsafe > as documented in Documentation/locking/locktypes.rst > > local_irq_disable(); > spin_lock(&lock); > > The pcp allocator has this sequence for rmqueue_pcplist (local_irq_save) > -> __rmqueue_pcplist -> rmqueue_bulk (spin_lock). This series explicitly > separates the locking requirements for the PCP list (local_lock) and stat > updates (irqs disabled). Once that is done, the length of time IRQs are > disabled can be reduced and in some cases, IRQ disabling can be replaced > with preempt_disable. > It's this part I'm interested in even though it only partially addresses the preempt-rt tree concerns. More legwork is needed for preempt-rt which is outside the context of this series. At minimum, it involves 1. Split locking of pcp and buddy allocator instead of using spin_lock() when it's "known" that IRQs are disabled (not necessarily a valid assumption on PREEMPT_RT) 2. Split the zone lock into what protects the zone metadata and what protects the free lists This looks straight-forward but it involves audit work and it may be difficult to avoid regressing non-PREEMPT_RT kernels by disabling/enabling IRQs when switching between the pcp allocator and the buddy allocator. > After that, it was very obvious that zone_statistics in particular has way > too much overhead and leaves IRQs disabled for longer than necessary. It > has perfectly accurate counters requiring IRQs be disabled for parallel > RMW sequences when inaccurate ones like vm_events would do. The series > makes the NUMA statistics (NUMA_HIT and friends) inaccurate counters that > only require preempt be disabled. > > Finally the bulk page allocator can then do all the stat updates in bulk > with IRQs enabled which should improve the efficiency of the bulk page > allocator. Technically, this could have been done without the local_lock > and vmstat conversion work and the order simply reflects the timing of > when different series were implemented. > > No performance data is included because despite the overhead of the > stats, it's within the noise for most workloads but Jesper and Chuck may > observe a significant different with the same tests used for the bulk > page allocator. The series is more likely to be interesting to the RT > folk in terms of slowing getting the PREEMPT tree into mainline. > > drivers/base/node.c | 18 +-- > include/linux/mmzone.h | 29 +++-- > include/linux/vmstat.h | 65 ++++++----- > mm/mempolicy.c | 2 +- > mm/page_alloc.c | 173 ++++++++++++++++------------ > mm/vmstat.c | 254 +++++++++++++++-------------------------- > 6 files changed, 254 insertions(+), 287 deletions(-) > > -- > 2.26.2 >
On Wed, Mar 31 2021 at 09:52, Mel Gorman wrote: > Ingo, Thomas or Peter, is there any chance one of you could take a look > at patch "[PATCH 2/6] mm/page_alloc: Convert per-cpu list protection to > local_lock" from this series? It's partially motivated by PREEMPT_RT. More > details below. Sure.