mbox series

[RFC,0/9] Improve zone lock scalability using Daniel Jordan's list work

Message ID 20180911053616.6894-1-aaron.lu@intel.com (mailing list archive)
Headers show
Series Improve zone lock scalability using Daniel Jordan's list work | expand

Message

Aaron Lu Sept. 11, 2018, 5:36 a.m. UTC
Daniel Jordan and others proposed an innovative technique to make
multiple threads concurrently use list_del() at any position of the
list and list_add() at head position of the list without taking a lock
in this year's MM summit[0].

People think this technique may be useful to improve zone lock
scalability so here is my try. This series is based on Daniel Jordan's
most recent patchset[1]. To make this series self contained, 2 of his
patches are extracted here.

Scalability comes best when multiple threads are operating at different
positions of the list. Since free path will access (buddy) pages
randomly on free list during merging, it is a good fit to make use of
this technique. This patchset makes free path run concurrently.

Patch 1 is for testing purpose only, it removes LRU lock from the
picture so we can get a better understanding of how much improvement
this patchset has on zone lock.

Patch 2-3 are Daniel's work to realize concurrent list_del() and
list_add(), these new APIs are called smp_list_del() and
smp_list_splice().

Patch 4-7 makes free path run concurrently by converting the zone lock
from spinlock to rwlock and has free path taking the zone lock in read
mode. To avoid complexity and problems, all other code paths take zone
lock in write mode.

Patch 8 is an optimization that reduces free list head access to avoid
severe cache bouncing. It also comes with a side effect: with this
patch, there will be mergable pages unmerged in Buddy.

Patch 9 improves fragmentation issues introduced in patch 8 by doing
pre-merges before pages are sent to merge under zone lock.

This patchset is based on v4.19-rc2.

Performance wise on 56 cores/112 threads Intel Skylake 2 sockets server
using will-it-scale/page_fault1 process mode(higher is better):

kernel        performance      zone lock contention
patch1         9219349         76.99%
patch7         2461133 -73.3%  54.46%(another 34.66% on smp_list_add())
patch8        11712766 +27.0%  68.14%
patch9        11386980 +23.5%  67.18%

Though lock contention reduced a lot for patch7, the performance dropped
considerably due to severe cache bouncing on free list head among
multiple threads doing page free at the same time, because every page free
will need to add the page to the free list head.

Patch8 is meant to solve this cache bouncing problem and has good result,
except the above mentioned side effect of having mergable pages unmerged
in Buddy. Patch9 reduced the fragmentation problem to some extent while
caused slightly performance drop.

As a comparison to the no_merge+cluster_alloc approach I posted before[2]:

kernel                 performance      zone lock contention
patch1                  9219349         76.99%
no_merge               11733153 +27.3%  69.18%
no_merge+cluster_alloc 12094893 +31.2%   0.73%

no_merge(skip merging for order0 page on free path) has similar
performance and zone lock contention as patch8/9, while with
cluster_alloc that also improves allocation side, zone lock contention
for this workload is almost gone.

To get an idea of how fragmentation are affected by patch8 and how much
improvement patch9 has, this is the result of /proc/buddyinfo after
running will-it-scale/page_fault1 for 3 minutes:

With patch7:
Node 0, zone      DMA      0      2      1      1      3      2      2      1      0      1      3
Node 0, zone    DMA32      7      3      6      5      5     10      6      7      6     10    410
Node 0, zone   Normal  17820  16819  14645  12969  11367   9229   6365   3062    756     69   5646
Node 1, zone   Normal  44789  60354  52331  37532  22071   9604   2750    241     32     11   6378

With patch8:
Node 0, zone      DMA      0      2      1      1      3      2      2      1      0      1      3
Node 0, zone    DMA32      7      9      5      4      5     10      6      7      6     10    410
Node 0, zone   Normal 404917 119614  79446  58303  20679   3106    222     89     28      9   5615
Node 1, zone   Normal 507659 127355  64470  53549  14104   1288     30      4      1      1   6078

With patch9:
Node 0, zone      DMA      0      3      0      1      3      0      1      0      1      1      3
Node 0, zone    DMA32     11    423    621    705    726    702     60     14      5      6    296
Node 0, zone   Normal  20407  21016  18731  16195  13697  10483   6873   3148    735     39   5637
Node 1, zone   Normal  79738  76963  59313  35996  18626   9743   3947    750     21      2   6080

A lot more pages stayed in order0 in patch8 than patch7, consequently,
for order5 and above pages, there are fewer with patch8 than patch7,
suggesting that some pages are not properly merged into high order pages
with patch8 applied. Patch9 has far fewer pages stayed in order0 than
patch8, which is a good sign but still not as good as patch7.

As a comparison, this is the result of no_merge(think of it as a worst
case result regarding fragmentation):

With no_merge:
Node 0, zone      DMA      0      2      1      1      3      2      2      1      0      1      3
Node 0, zone    DMA32      7      3      6      5      5     10      6      7      6     10    410
Node 0, zone   Normal 1895199      5      1      1      4      2      2      1      1      1   5614
Node 1, zone   Normal 1718733      4      1     13     10      3      2      0      1      1   6008

Conclusion: The approach I proposed here caused performance drop due to
free list head cache bouncing. If we can bear the result of some
mergable pages becoming unmerged in Buddy, zone lock scalability can be
improved: performance increase 20%+, lock contention drop 8%.
no_merge+cluster_alloc on the other hand, can eiminate zone lock
contention entirely, but has worse fragmentation issue.

[0] https://lwn.net/Articles/753058/
[1] https://lkml.kernel.org/r/20180911004240.4758-1-daniel.m.jordan@oracle.com
[2] https://lkml.kernel.org/r/20180509085450.3524-1-aaron.lu@intel.com

Aaron Lu (7):
  mm: do not add anon pages to LRU
  mm: convert zone lock from spinlock to rwlock
  mm/page_alloc: use helper functions to add/remove a page to/from buddy
  use atomic for free_area[order].nr_free
  mm: use read_lock for free path
  mm: use smp_list_splice() on free path
  mm: page_alloc: merge before sending pages to global pool

Daniel Jordan (2):
  mm: introduce smp_list_del for concurrent list entry removals
  mm: introduce smp_list_splice to prepare for concurrent LRU adds

 include/linux/list.h   |   4 +
 include/linux/mm.h     |   1 +
 include/linux/mmzone.h |   4 +-
 init/main.c            |   1 +
 lib/Makefile           |   2 +-
 lib/list.c             | 227 ++++++++++++++++++++++++++++
 mm/compaction.c        |  90 +++++------
 mm/hugetlb.c           |   8 +-
 mm/memory.c            |   2 +-
 mm/page_alloc.c        | 332 ++++++++++++++++++++++++++++-------------
 mm/page_isolation.c    |  12 +-
 mm/vmstat.c            |   8 +-
 12 files changed, 526 insertions(+), 165 deletions(-)
 create mode 100644 lib/list.c

Comments

Daniel Jordan Sept. 21, 2018, 5:45 p.m. UTC | #1
On Tue, Sep 11, 2018 at 01:36:07PM +0800, Aaron Lu wrote:
> Daniel Jordan and others proposed an innovative technique to make
> multiple threads concurrently use list_del() at any position of the
> list and list_add() at head position of the list without taking a lock
> in this year's MM summit[0].
> 
> People think this technique may be useful to improve zone lock
> scalability so here is my try.

Nice, this uses the smp_list_* functions well in spite of the limitations you
encountered with them here.

> Performance wise on 56 cores/112 threads Intel Skylake 2 sockets server
> using will-it-scale/page_fault1 process mode(higher is better):
> 
> kernel        performance      zone lock contention
> patch1         9219349         76.99%
> patch7         2461133 -73.3%  54.46%(another 34.66% on smp_list_add())
> patch8        11712766 +27.0%  68.14%
> patch9        11386980 +23.5%  67.18%

Is "zone lock contention" the percentage that readers and writers combined
spent waiting?  I'm curious to see read and write wait time broken out, since
it seems there are writers (very likely on the allocation side) spoiling the
parallelism we get with the read lock.

If the contention is from allocation, I wonder whether it's feasible to make
that path use SMP list functions.  Something like smp_list_cut_position
combined with using page clusters from [*] to cut off a chunk of list.  Many
details to keep in mind there, though, like having to unset PageBuddy in that
list chunk when other tasks can be concurrently merging pages that are part of
it.

Or maybe what's needed is a more scalable data structure than an array of
lists, since contention on the heads seems to be the limiting factor.  A simple
list that keeps the pages in most-recently-used order (except when adding to
the list tail) is good for cache warmth, but I wonder how helpful that is when
all CPUs can allocate from the front.  Having multiple places to put pages of a
given order/mt would ease the contention.

> Though lock contention reduced a lot for patch7, the performance dropped
> considerably due to severe cache bouncing on free list head among
> multiple threads doing page free at the same time, because every page free
> will need to add the page to the free list head.

Could be beneficial to take an MCS-style approach in smp_list_splice/add so
that multiple waiters aren't bouncing the same cacheline around.  This is
something I'm planning to try on lru_lock.

Daniel

[*] https://lkml.kernel.org/r/20180509085450.3524-1-aaron.lu@intel.com
Aaron Lu Sept. 25, 2018, 2:37 a.m. UTC | #2
On Fri, Sep 21, 2018 at 10:45:36AM -0700, Daniel Jordan wrote:
> On Tue, Sep 11, 2018 at 01:36:07PM +0800, Aaron Lu wrote:
> > Daniel Jordan and others proposed an innovative technique to make
> > multiple threads concurrently use list_del() at any position of the
> > list and list_add() at head position of the list without taking a lock
> > in this year's MM summit[0].
> > 
> > People think this technique may be useful to improve zone lock
> > scalability so here is my try.
> 
> Nice, this uses the smp_list_* functions well in spite of the limitations you
> encountered with them here.
> 
> > Performance wise on 56 cores/112 threads Intel Skylake 2 sockets server
> > using will-it-scale/page_fault1 process mode(higher is better):
> > 
> > kernel        performance      zone lock contention
> > patch1         9219349         76.99%
> > patch7         2461133 -73.3%  54.46%(another 34.66% on smp_list_add())
> > patch8        11712766 +27.0%  68.14%
> > patch9        11386980 +23.5%  67.18%
> 
> Is "zone lock contention" the percentage that readers and writers combined
> spent waiting?  I'm curious to see read and write wait time broken out, since
> it seems there are writers (very likely on the allocation side) spoiling the
> parallelism we get with the read lock.

lock contention is combined, read side consumes about 31% while write
side consumes 35%. Write side definitely is blocking read side.

I also tried not taking lock in read mode on free path to avoid free
path blocking on allocation path, but that caused other unplesant
consequences for allocation path, namely the free_list head->next can
be NULL when allocating pages due to free path can be adding pages to
the list using smp_list_add/splice so I had to use free_list head->prev
instead to fetch pages on allocation path. Also, the fetched page can be
merged in the mean time on free path so need to confirm if it is really
usable, etc. This complicated allocation path and didn't deliver good
results so I gave up this idea.

> If the contention is from allocation, I wonder whether it's feasible to make
> that path use SMP list functions.  Something like smp_list_cut_position
> combined with using page clusters from [*] to cut off a chunk of list.  Many
> details to keep in mind there, though, like having to unset PageBuddy in that
> list chunk when other tasks can be concurrently merging pages that are part of
> it.

As you put here, the PageBuddy flag is a problem. If I cut off a batch
of pages from free_list and then dropping the lock, these pages will
have PageBuddy flag set and free path can attempt a merge with any of
these pages and cause problems.

PageBuddy flag can not be cleared with lock held since that would
require accessing 'struct page's for these pages and it is the most time
consuming part among all operations that happened on allocation path
under zone lock.

This is doable in your referenced no_merge+cluster_alloc approach because
we skipped merge most of the time. And when merge really needs to
happen like in compaction, cluser_alloc will be disabled.

> Or maybe what's needed is a more scalable data structure than an array of
> lists, since contention on the heads seems to be the limiting factor.  A simple
> list that keeps the pages in most-recently-used order (except when adding to
> the list tail) is good for cache warmth, but I wonder how helpful that is when
> all CPUs can allocate from the front.  Having multiple places to put pages of a
> given order/mt would ease the contention.

Agree.

> > Though lock contention reduced a lot for patch7, the performance dropped
> > considerably due to severe cache bouncing on free list head among
> > multiple threads doing page free at the same time, because every page free
> > will need to add the page to the free list head.
> 
> Could be beneficial to take an MCS-style approach in smp_list_splice/add so
> that multiple waiters aren't bouncing the same cacheline around.  This is
> something I'm planning to try on lru_lock.

That's a good idea.
If that is done, we can at least parallelise free path and gain
something by not paying the penalty of cache bouncing on list head.

> 
> Daniel
> 
> [*] https://lkml.kernel.org/r/20180509085450.3524-1-aaron.lu@intel.com

And thanks a lot for the comments!