[PATCH-tip,00/22] locking/rwsem: Rework rwsem-xadd & enable new rwsem features
mbox series

Message ID 1549566446-27967-1-git-send-email-longman@redhat.com
Headers show
Series
  • locking/rwsem: Rework rwsem-xadd & enable new rwsem features
Related show

Message

Waiman Long Feb. 7, 2019, 7:07 p.m. UTC
This patchset revamps the current rwsem-xadd implementation to make
it saner and easier to work with. This patchset removes all the
architecture specific assembly code and uses generic C code for all
architectures. This eases maintenance and enables us to enhance the
code more easily.

This patchset also implements the following 3 new features:

 1) Waiter lock handoff
 2) Reader optimistic spinning
 3) Store write-lock owner in the atomic count (x86-64 only)

Waiter lock handoff is similar to the mechanism currently in the mutex
code. This ensures that lock starvation won't happen.

Reader optimistic spinning enables readers to acquire the lock more
quickly.  So workloads that use a mix of readers and writers should
see an increase in performance.

Finally, storing the write-lock owner into the count will allow
optimistic spinners to get to the lock holder's task structure more
quickly and eliminating the timing gap where the write lock is acquired
but the owner isn't known yet. This is important for RT tasks where
spinning on a lock with an unknown owner is not allowed.

Because of the fact that multiple readers can share the same lock,
there is a natural preference for readers when measuring in term of
locking throughput as more readers are likely to get into the locking
fast path than the writers. With waiter lock handoff, we are not going
to starve the writers.

Patches 1-2 reworks the qspinlock_stat code to make it generic (lock
event counting) so that it can be used by all architectures and all
locking code.

Patch 3 reloctes the rwsem_down_read_failed() and associated functions
to below the optimistic spinning functions.

Patch 4 eliminates all architecture specific code and use generic C
code for all.

Patch 5 moves code that manages the owner field closer to the rwsem
lock fast path as it is not needed by the rwsem-spinlock code.

Patch 6 renames rwsem.h to rwsem-xadd.h as it is now specific to
rwsem-xadd.c only.

Patch 7 hides the internal rwsem-xadd functions from the public.

Patch 8 moves the DEBUG_RWSEMS_WARN_ON checks from rwsem.c to
kernel/locking/rwsem-xadd.h and adds some new ones.

Patch 9 enhances the DEBUG_RWSEMS_WARN_ON macro to print out rwsem
internal states that can be useful for debugging purpose.

Patch 10 enables lock event countings in the rwsem code.

Patch 11 implements a new rwsem locking scheme similar to what qrwlock
is current doing. Write lock is done by atomic_cmpxchg() while read
lock is still being done by atomic_add().

Patch 12 implments lock handoff to prevent lock starvation.

Patch 13 removes rwsem_wake() wakeup optimization as it doesn't work
with lock handoff.

Patch 14 adds some new rwsem owner access helper functions.

Patch 15 merges the write-lock owner task pointer into the count.
Only 64-bit count has enough space to provide a reasonable number of bits
for reader count. ARM64 seems to have problem with the current encoding
scheme. So this owner merging is currently limited to x86-64 only.

Patch 16 eliminates redundant computation of the merged owner-count.

Patch 17 reduces the chance of missed optimistic spinning opportunity
because of some race conditions.

Patch 18 makes rwsem_spin_on_owner() returns a tri-state value.

Patch 19 enables reader to spin on a writer-owned rwsem.

Patch 20 enables lock waiters to spin on a reader-owned rwsem with
limited number of tries.

Patch 21 makes reader wakeup to wake all the readers in the wait queue
instead of just those in the front.

Patch 22 disallows RT tasks to spin on a rwsem with unknown owner.

In term of performance, eliminating architecture specific assembly code
and using generic code doesn't seem to have any impact on performance.

Supporting lock handoff does have a minor performance impact on highly
contended rwsem, but it is a price worth paying for preventing lock
starvation.

Reader optimistic spinning is generally good for performance. Of course,
there will be some corner cases where performance may suffer.

Merging owner into count does have a minor performance impact. We can
discuss if this is a feature we want to have in the rwsem code.

There are also some performance data scattered in some of the patches.


Waiman Long (22):
  locking/qspinlock_stat: Introduce a generic lockevent counting APIs
  locking/lock_events: Make lock_events available for all archs & other
    locks
  locking/rwsem: Relocate rwsem_down_read_failed()
  locking/rwsem: Remove arch specific rwsem files
  locking/rwsem: Move owner setting code from rwsem.c to rwsem.h
  locking/rwsem: Rename kernel/locking/rwsem.h
  locking/rwsem: Move rwsem internal function declarations to
    rwsem-xadd.h
  locking/rwsem: Add debug check for __down_read*()
  locking/rwsem: Enhance DEBUG_RWSEMS_WARN_ON() macro
  locking/rwsem: Enable lock event counting
  locking/rwsem: Implement a new locking scheme
  locking/rwsem: Implement lock handoff to prevent lock starvation
  locking/rwsem: Remove rwsem_wake() wakeup optimization
  locking/rwsem: Add more rwsem owner access helpers
  locking/rwsem: Merge owner into count on x86-64
  locking/rwsem: Remove redundant computation of writer lock word
  locking/rwsem: Recheck owner if it is not on cpu
  locking/rwsem: Make rwsem_spin_on_owner() return a tri-state value
  locking/rwsem: Enable readers spinning on writer
  locking/rwsem: Enable count-based spinning on reader
  locking/rwsem: Wake up all readers in wait queue
  locking/rwsem: Ensure an RT task will not spin on reader

 MAINTAINERS                         |   1 -
 arch/Kconfig                        |  10 +
 arch/alpha/include/asm/rwsem.h      | 211 -----------
 arch/arm/include/asm/Kbuild         |   1 -
 arch/arm64/include/asm/Kbuild       |   1 -
 arch/hexagon/include/asm/Kbuild     |   1 -
 arch/ia64/include/asm/rwsem.h       | 172 ---------
 arch/powerpc/include/asm/Kbuild     |   1 -
 arch/s390/include/asm/Kbuild        |   1 -
 arch/sh/include/asm/Kbuild          |   1 -
 arch/sparc/include/asm/Kbuild       |   1 -
 arch/x86/Kconfig                    |   8 -
 arch/x86/include/asm/rwsem.h        | 237 -------------
 arch/x86/lib/Makefile               |   1 -
 arch/x86/lib/rwsem.S                | 156 ---------
 arch/xtensa/include/asm/Kbuild      |   1 -
 include/asm-generic/rwsem.h         | 140 --------
 include/linux/rwsem.h               |  11 +-
 kernel/locking/Makefile             |   1 +
 kernel/locking/lock_events.c        | 153 ++++++++
 kernel/locking/lock_events.h        |  55 +++
 kernel/locking/lock_events_list.h   |  71 ++++
 kernel/locking/percpu-rwsem.c       |   4 +
 kernel/locking/qspinlock.c          |   8 +-
 kernel/locking/qspinlock_paravirt.h |  19 +-
 kernel/locking/qspinlock_stat.h     | 242 +++----------
 kernel/locking/rwsem-xadd.c         | 682 +++++++++++++++++++++---------------
 kernel/locking/rwsem-xadd.h         | 436 +++++++++++++++++++++++
 kernel/locking/rwsem.c              |  31 +-
 kernel/locking/rwsem.h              | 134 -------
 30 files changed, 1197 insertions(+), 1594 deletions(-)
 delete mode 100644 arch/alpha/include/asm/rwsem.h
 delete mode 100644 arch/ia64/include/asm/rwsem.h
 delete mode 100644 arch/x86/include/asm/rwsem.h
 delete mode 100644 arch/x86/lib/rwsem.S
 delete mode 100644 include/asm-generic/rwsem.h
 create mode 100644 kernel/locking/lock_events.c
 create mode 100644 kernel/locking/lock_events.h
 create mode 100644 kernel/locking/lock_events_list.h
 create mode 100644 kernel/locking/rwsem-xadd.h
 delete mode 100644 kernel/locking/rwsem.h

Comments

Peter Zijlstra Feb. 7, 2019, 7:45 p.m. UTC | #1
On Thu, Feb 07, 2019 at 02:07:19PM -0500, Waiman Long wrote:
> On 32-bit architectures, there aren't enough bits to hold both.
> 64-bit architectures, however, can have enough bits to do that. For
> x86-64, the physical address can use up to 52 bits. That is 4PB of
> memory. That leaves 12 bits available for other use. The task structure
> pointer is also aligned to the L1 cache size. That means another 6 bits
> (64 bytes cacheline) will be available. Reserving 2 bits for status
> flags, we will have 16 bits for the reader count.  That can supports
> up to (64k-1) readers.

*groan*...

So take qrwlock's idea for a queue, then make the count value (similar
to the new mutex); that is have a bit0 be a r/w bit, when w bits 6-N are
owner, when r they are reader-count. bit1 can be a pending bit, bit2 a
handoff bit etc..

That should fit and work on 32bit and 64bit without issue.

I have a half-arsed rwsem-atomic.c somewhere that does just that. I just
never got around to doing all the optimistic spin and steal crap that
makes our current rwsem fly.

And that nicely gets rid of that mind bending BIAS crud.
Davidlohr Bueso Feb. 7, 2019, 7:51 p.m. UTC | #2
On Thu, 07 Feb 2019, Waiman Long wrote:
> 30 files changed, 1197 insertions(+), 1594 deletions(-)

Performance numbers on numerous workloads, pretty please.

I'll go and throw this at my mmap_sem intensive workloads
I've collected.

Thanks,
Davidlohr
Waiman Long Feb. 7, 2019, 7:55 p.m. UTC | #3
On 02/07/2019 02:45 PM, Peter Zijlstra wrote:
> On Thu, Feb 07, 2019 at 02:07:19PM -0500, Waiman Long wrote:
>> On 32-bit architectures, there aren't enough bits to hold both.
>> 64-bit architectures, however, can have enough bits to do that. For
>> x86-64, the physical address can use up to 52 bits. That is 4PB of
>> memory. That leaves 12 bits available for other use. The task structure
>> pointer is also aligned to the L1 cache size. That means another 6 bits
>> (64 bytes cacheline) will be available. Reserving 2 bits for status
>> flags, we will have 16 bits for the reader count.  That can supports
>> up to (64k-1) readers.
> *groan*...
>
> So take qrwlock's idea for a queue, then make the count value (similar
> to the new mutex); that is have a bit0 be a r/w bit, when w bits 6-N are
> owner, when r they are reader-count. bit1 can be a pending bit, bit2 a
> handoff bit etc..
>
> That should fit and work on 32bit and 64bit without issue.
>
> I have a half-arsed rwsem-atomic.c somewhere that does just that. I just
> never got around to doing all the optimistic spin and steal crap that
> makes our current rwsem fly.
>
> And that nicely gets rid of that mind bending BIAS crud.

Well, the reason for this compromise is to keep using xadd for readers.
Your scheme will certainly work, but we have to use cmpxchg for readers
too. That will have a performance impact especially with multiple
readers contending which I am trying to avoid.

Cheers,
Longman
Waiman Long Feb. 7, 2019, 8 p.m. UTC | #4
On 02/07/2019 02:51 PM, Davidlohr Bueso wrote:
> On Thu, 07 Feb 2019, Waiman Long wrote:
>> 30 files changed, 1197 insertions(+), 1594 deletions(-)
>
> Performance numbers on numerous workloads, pretty please.
>
> I'll go and throw this at my mmap_sem intensive workloads
> I've collected.
>
> Thanks,
> Davidlohr

Thanks for getting some of the performance numbers. This is the initial
draft after more than 1 years of hibernation. I will also get other
performance numbers in subsequent revision of the patch.

Cheers,
Longman
Linus Torvalds Feb. 8, 2019, 7:50 p.m. UTC | #5
On Thu, Feb 7, 2019 at 11:08 AM Waiman Long <longman@redhat.com> wrote:
>
> This patchset revamps the current rwsem-xadd implementation to make
> it saner and easier to work with. This patchset removes all the
> architecture specific assembly code and uses generic C code for all
> architectures. This eases maintenance and enables us to enhance the
> code more easily.
>
> This patchset also implements the following 3 new features:
>
>  1) Waiter lock handoff
>  2) Reader optimistic spinning
>  3) Store write-lock owner in the atomic count (x86-64 only)

The patches are kind of hard to read, with most of them just doing
prep-work that doesn't necessarily matter to the big picture.

What I'd really like to see is

 (a) an overview of the new locking logic

 (b) what's the new fastpath case

 (c) some performance numbers

to explain the changes from a "this is the point of the whole
exercise" standpoint.

And yes, I realize that the lock handoff and optimistic spinning is a
big deal, since I've seen the same regression numbers that presumably
caused this effort to be resurrected. So it's not that I don't find
this intriguing and worthwhile, it's literally that I'd like a summary
not so much of the individual patches, but of the new model.

Please?

             Linus
Waiman Long Feb. 8, 2019, 8:31 p.m. UTC | #6
On 02/08/2019 02:50 PM, Linus Torvalds wrote:
> On Thu, Feb 7, 2019 at 11:08 AM Waiman Long <longman@redhat.com> wrote:
>> This patchset revamps the current rwsem-xadd implementation to make
>> it saner and easier to work with. This patchset removes all the
>> architecture specific assembly code and uses generic C code for all
>> architectures. This eases maintenance and enables us to enhance the
>> code more easily.
>>
>> This patchset also implements the following 3 new features:
>>
>>  1) Waiter lock handoff
>>  2) Reader optimistic spinning
>>  3) Store write-lock owner in the atomic count (x86-64 only)
> The patches are kind of hard to read, with most of them just doing
> prep-work that doesn't necessarily matter to the big picture.
>
> What I'd really like to see is
>
>  (a) an overview of the new locking logic

The new locking logic is similar to qrwlock (see patch 11). Cmpxchg is
used to acquire the write lock, while xadd is still used for read lock.
Some of the bits in the count are also reserved for special purpose like
has waiter or lock handoff. Patch 15 tries to compress the write-lock
owner task pointer and put it into the count field for x86-64 at the
expense of less bits available for reader count. I have sent out an
additional patch this morning to make sure that the reader count won't
overflow.

In term of performance, there isn't much change with respect to
read-lock performance. For write-lock, I saw a slight drop in some
cases, but nothing significant. The merging of owner task pointer into
the count field does impose a slightly bigger drop than I would have
liked which I am going to look into a bit more.

>
>  (b) what's the new fastpath case

The only change in the fastpath is the use of cmpxchg for writer lock.

>
>  (c) some performance numbers

There are performance data at patches 11, 12, 15, 19, 20, 21. There was
performance data for patch 4 as well for eliminating the arch specific
file. Apparently, I might have deleted it accidentally. Anyway, no
noticeable performance difference was observed when switching to use
generic C code for x86, ppc and ARM64.

The major gain in performance is due to reader optimistic spinning
patches. The microbenchmark that I used shown an order of magnitude of
performance improvement for mixed reader-writer workloads. Of course, we
will see less performance gain with real world benchmarks.

I am planning to run more performance test and post the data sometimes
next week. Davidlohr is also going to run some of his rwsem performance
test on this patchset.

>
> to explain the changes from a "this is the point of the whole
> exercise" standpoint.
>
> And yes, I realize that the lock handoff and optimistic spinning is a
> big deal, since I've seen the same regression numbers that presumably
> caused this effort to be resurrected. So it's not that I don't find
> this intriguing and worthwhile, it's literally that I'd like a summary
> not so much of the individual patches, but of the new model.
>
> Please?

Maybe I should break this patchset into a few smaller ones to make it
easier to review. Any suggestion is welcome.

Cheers,
Longman
Linus Torvalds Feb. 9, 2019, 12:03 a.m. UTC | #7
On Fri, Feb 8, 2019 at 12:31 PM Waiman Long <longman@redhat.com> wrote:
>
> >  (b) what's the new fastpath case
>
> The only change in the fastpath is the use of cmpxchg for writer lock.

.. since a big deal here was about using the generic atomic accessor
functions, I really was looking forward to seeing the *actual* fast
path code generation.

In other words, right now I have very little visibility in how it
actually affects the code. Looking at the patches themselves doesn't
make it obvious. I was hoping for the overview to really explain the
whole "before and after" situation, and it didn't. Not at the high
level, and not at a low level. And no performance numbers in the
overview either.

And yes, I see the numbers in the patches, but what I really hoped for
was some real load numbers. In particular, I would have loved to see
numbers from th ekernel test robot "will-it-scale.per_thread_ops"
case, which is the one that had a 65% regression due to the lack of
reader spinning.

So I was kind of hoping to hear whether that regression is basically
entirely gone with this patch series, or if we still have a regression
due to the extra downgrade, or what?

                 Linus
Ingo Molnar Feb. 11, 2019, 7:38 a.m. UTC | #8
* Waiman Long <longman@redhat.com> wrote:

> On 02/07/2019 02:51 PM, Davidlohr Bueso wrote:
> > On Thu, 07 Feb 2019, Waiman Long wrote:
> >> 30 files changed, 1197 insertions(+), 1594 deletions(-)
> >
> > Performance numbers on numerous workloads, pretty please.
> >
> > I'll go and throw this at my mmap_sem intensive workloads
> > I've collected.
> >
> > Thanks,
> > Davidlohr
> 
> Thanks for getting some of the performance numbers. This is the initial
> draft after more than 1 years of hibernation. I will also get other
> performance numbers in subsequent revision of the patch.

If you could sort all the invariant preparatory patches to the head of 
the series I can merge them to reduce overall complexity and simplify 
performance testing and review of the rest.

Thanks,

	Ingo
Rong Chen Feb. 13, 2019, 9:19 a.m. UTC | #9
Hi all,

Kernel test robot reported a will-it-scale.per_thread_ops -64.1% regression on IVB-desktop for v4.20-rc1.
The first bad commit is: 9bc8039e715da3b53dbac89525323a9f2f69b7b5, Yang Shi <yang.shi@linux.alibaba.com>: mm: brk: downgrade mmap_sem to read when shrinking
(https://lists.01.org/pipermail/lkp/2018-November/009335.html).

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase/ucode:
  gcc-7/performance/x86_64-rhel-7.2/thread/100%/debian-x86_64-2018-04-03.cgz/lkp-ivb-d01/brk1/will-it-scale/0x20

commit: 
  85a06835f6 ("mm: mremap: downgrade mmap_sem to read when shrinking")
  9bc8039e71 ("mm: brk: downgrade mmap_sem to read when shrinking")

85a06835f6f1ba79 9bc8039e715da3b53dbac89525 
---------------- -------------------------- 
         %stddev     %change         %stddev
             \          |                \  
    196250 ±  8%     -64.1%      70494        will-it-scale.per_thread_ops
    127330 ± 19%     -98.0%       2525 ± 24%  will-it-scale.time.involuntary_context_switches
    727.50 ±  2%     -77.0%     167.25        will-it-scale.time.percent_of_cpu_this_job_got
      2141 ±  2%     -77.6%     479.12        will-it-scale.time.system_time
     50.48 ±  7%     -48.5%      25.98        will-it-scale.time.user_time
  34925294 ± 18%    +270.3%  1.293e+08 ±  4%  will-it-scale.time.voluntary_context_switches
   1570007 ±  8%     -64.1%     563958        will-it-scale.workload
      6435 ±  2%      -6.4%       6024        proc-vmstat.nr_shmem
      1298 ± 16%     -44.5%     721.00 ± 18%  proc-vmstat.pgactivate
      2341           +16.4%       2724        slabinfo.kmalloc-96.active_objs
      2341           +16.4%       2724        slabinfo.kmalloc-96.num_objs
      6346 ±150%     -87.8%     776.25 ±  9%  softirqs.NET_RX
    160107 ±  8%    +151.9%     403273        softirqs.SCHED
   1097999           -13.0%     955526        softirqs.TIMER
      5.50 ±  9%     -81.8%       1.00        vmstat.procs.r
    230700 ± 19%    +269.9%     853292 ±  4%  vmstat.system.cs
     26706 ±  3%     +15.7%      30910 ±  5%  vmstat.system.in
     11.24 ± 23%     +72.2       83.39        mpstat.cpu.idle%
      0.00 ±131%      +0.0        0.04 ± 99%  mpstat.cpu.iowait%
     86.32 ±  2%     -70.8       15.54        mpstat.cpu.sys%
      2.44 ±  7%      -1.4        1.04 ±  8%  mpstat.cpu.usr%
  20610709 ± 15%   +2376.0%  5.103e+08 ± 34%  cpuidle.C1.time
   3233399 ±  8%    +241.5%   11042785 ± 25%  cpuidle.C1.usage
  36172040 ±  6%    +931.3%   3.73e+08 ± 15%  cpuidle.C1E.time
    783605 ±  4%    +548.7%    5083041 ± 18%  cpuidle.C1E.usage
  28753819 ± 39%   +1054.5%  3.319e+08 ± 49%  cpuidle.C3.time
    283912 ± 25%    +688.4%    2238225 ± 34%  cpuidle.C3.usage
 1.507e+08 ± 47%    +292.3%  5.913e+08 ± 28%  cpuidle.C6.time
    339861 ± 37%    +549.7%    2208222 ± 24%  cpuidle.C6.usage
   2709719 ±  5%    +824.2%   25043444        cpuidle.POLL.time
  28602864 ± 18%    +173.7%   78276116 ± 10%  cpuidle.POLL.usage


We found that the patchset could fix the regression.

tests: 1
testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-brk1-ucode=0x20/lkp-ivb-d01

commit: 
  85a06835f6 ("mm: mremap: downgrade mmap_sem to read when shrinking")
  fb835fe7f0 ("locking/rwsem: Ensure an RT task will not spin on reader")

85a06835f6f1ba79  fb835fe7f0adbd7c2c074b98ec  
----------------  --------------------------  
         %stddev      change         %stddev
             \          |                \  
    120736 ± 22%        56%     188019 ±  6%  will-it-scale.time.involuntary_context_switches
      2126 ±  3%         4%       2215        will-it-scale.time.system_time
       722 ±  3%         4%        752        will-it-scale.time.percent_of_cpu_this_job_got
  36256485 ± 27%       -35%   23682989 ±  3%  will-it-scale.time.voluntary_context_switches
      3151 ±  9%        11%       3504        turbostat.Avg_MHz
    229285 ± 32%       -30%     160660 ±  3%  vmstat.system.cs
    120736 ± 22%        56%     188019 ±  6%  time.involuntary_context_switches
      2126 ±  3%         4%       2215        time.system_time
       722 ±  3%         4%        752        time.percent_of_cpu_this_job_got
  36256485 ± 27%       -35%   23682989 ±  3%  time.voluntary_context_switches
        23             643%        171 ±  3%  proc-vmstat.nr_zone_inactive_file
        23             643%        171 ±  3%  proc-vmstat.nr_inactive_file
      3664              12%       4121        proc-vmstat.nr_kernel_stack
      6392               6%       6785        proc-vmstat.nr_slab_unreclaimable
      9991                       10176        proc-vmstat.nr_slab_reclaimable
     63938                       62394        proc-vmstat.nr_zone_active_anon
     63938                       62394        proc-vmstat.nr_active_anon
    386388 ±  9%        -6%     362272        proc-vmstat.pgfree
    368296 ±  9%       -10%     333074        proc-vmstat.numa_hit
    368296 ±  9%       -10%     333074        proc-vmstat.numa_local
      5169 ± 13%       -28%       3745        proc-vmstat.nr_shmem
      1801 ± 21%       -83%        309        proc-vmstat.pgactivate
         0            1e+04      11441        latency_stats.avg.msleep.cpuinfo_open.proc_reg_open.do_dentry_open.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe
     13165 ±222%     -1e+04          0        latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup_revalidate.lookup_fast.walk_component.link_path_walk.path_lookupat.filename_lookup
     22499 ±151%     -2e+04        657 ±  7%  latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.__lookup_slow.lookup_slow.walk_component.path_lookupat.filename_lookup
    117414 ±181%     -9e+04      24418 ± 44%  latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_do_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open.do_syscall_64
    666005 ±218%     -7e+05        198 ±141%  latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_access.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat.filename_lookup
   2600097 ±132%     -3e+06        572        latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_getattr.__nfs_revalidate_inode.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat
  34391390 ±150%     -3e+07      21807 ±141%  latency_stats.avg.io_schedule.nfs_lock_and_join_requests.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe
  34624774 ±149%     -3e+07      37668 ± 58%  latency_stats.avg.max
         0            1e+04      11441        latency_stats.max.msleep.cpuinfo_open.proc_reg_open.do_dentry_open.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe
     22499 ±151%     -2e+04        657 ±  7%  latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.__lookup_slow.lookup_slow.walk_component.path_lookupat.filename_lookup
     37845 ±222%     -4e+04          0        latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup_revalidate.lookup_fast.walk_component.link_path_walk.path_lookupat.filename_lookup
     80096 ± 59%     -8e+04          0        latency_stats.max.call_rwsem_down_write_failed_killable.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe
    177149 ±195%     -2e+05      24418 ± 44%  latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_do_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open.do_syscall_64
    689417 ±209%     -7e+05        200 ±141%  latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_access.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat.filename_lookup
  18679699 ±129%     -2e+07        656        latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_getattr.__nfs_revalidate_inode.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat
  83587334 ±129%     -8e+07      43457 ±141%  latency_stats.max.io_schedule.nfs_lock_and_join_requests.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe
  84867236 ±126%     -8e+07      59318 ± 86%  latency_stats.max.max
         0            1e+04      11441        latency_stats.sum.msleep.cpuinfo_open.proc_reg_open.do_dentry_open.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe
     22499 ±151%     -2e+04        657 ±  7%  latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.__lookup_slow.lookup_slow.walk_component.path_lookupat.filename_lookup
     39431 ±222%     -4e+04          0        latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup_revalidate.lookup_fast.walk_component.link_path_walk.path_lookupat.filename_lookup
    216448 ±200%     -2e+05      24418 ± 44%  latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_do_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open.do_syscall_64
    691960 ±208%     -7e+05        397 ±141%  latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_access.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat.filename_lookup
  24239011 ±140%     -2e+07       4768 ± 10%  latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_getattr.__nfs_revalidate_inode.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat
 1.771e+08 ±122%     -2e+08      43614 ±141%  latency_stats.sum.io_schedule.nfs_lock_and_join_requests.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe
 1.939e+08 ± 36%     -2e+08          0        latency_stats.sum.call_rwsem_down_write_failed_killable.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe
 2.943e+08 ± 51%     -2e+08   51929782        latency_stats.sum.max
    407463 ± 10%      -100%          0        perf-stat.total.page-faults
  74225651 ± 26%      -100%          0        perf-stat.total.context-switches
     55293 ± 25%      -100%          0        perf-stat.total.cpu-migrations
    407463 ± 10%      -100%          0        perf-stat.total.minor-faults


tests: 1
testcase/path_params/tbox_group/run: will-it-scale/performance-thread-100%-brk1-ucode=0x20/lkp-ivb-d01

commit: 
  9bc8039e71 ("mm: brk: downgrade mmap_sem to read when shrinking")
  fb835fe7f0 ("locking/rwsem: Ensure an RT task will not spin on reader")

9bc8039e715da3b5  fb835fe7f0adbd7c2c074b98ec  
----------------  --------------------------  
         %stddev      change         %stddev
             \          |                \  
      3500 ± 36%      5272%     188019 ±  6%  will-it-scale.time.involuntary_context_switches
       483             358%       2215        will-it-scale.time.system_time
       168             346%        752        will-it-scale.time.percent_of_cpu_this_job_got
     71190             180%     199232 ±  4%  will-it-scale.per_thread_ops
    569524             180%    1593862 ±  4%  will-it-scale.workload
     25.85              93%      49.95 ±  3%  will-it-scale.time.user_time
 1.314e+08 ±  3%       -82%   23682989 ±  3%  will-it-scale.time.voluntary_context_switches
     30501 ±  9%       -15%      25813 ±  4%  vmstat.system.in
    799593 ± 10%       -80%     160660 ±  3%  vmstat.system.cs
       887 ± 11%       295%       3504        turbostat.Avg_MHz
     23.60 ± 10%        68%      39.54        turbostat.CorWatt
     28.38 ±  8%        57%      44.43        turbostat.PkgWatt
      3500 ± 36%      5272%     188019 ±  6%  time.involuntary_context_switches
       483             358%       2215        time.system_time
       168             346%        752        time.percent_of_cpu_this_job_got
     25.85              93%      49.95 ±  3%  time.user_time
 1.314e+08 ±  3%       -82%   23682989 ±  3%  time.voluntary_context_switches
         0 ± 44%     46220%        386        proc-vmstat.nr_zone_active_file
         0 ± 44%     46220%        386        proc-vmstat.nr_active_file
        23             643%        171 ±  3%  proc-vmstat.nr_zone_inactive_file
        23             643%        171 ±  3%  proc-vmstat.nr_inactive_file
      3690              12%       4121        proc-vmstat.nr_kernel_stack
      6419               6%       6785        proc-vmstat.nr_slab_unreclaimable
      9961                       10176        proc-vmstat.nr_slab_reclaimable
    229251                      231278        proc-vmstat.nr_zone_unevictable
    229251                      231278        proc-vmstat.nr_unevictable
      1008                        1005        proc-vmstat.nr_page_table_pages
     63178                       62394        proc-vmstat.nr_zone_active_anon
     63178                       62394        proc-vmstat.nr_active_anon
    432061 ± 12%       -11%     385372        proc-vmstat.pgfault
    408099 ± 10%       -11%     362272        proc-vmstat.pgfree
    422206 ±  9%       -11%     373690        proc-vmstat.pgalloc_normal
    382357 ± 11%       -13%     333074        proc-vmstat.numa_hit
    382357 ± 11%       -13%     333074        proc-vmstat.numa_local
      4428 ± 17%       -15%       3745        proc-vmstat.nr_shmem
         0            1e+04      11441        latency_stats.avg.msleep.cpuinfo_open.proc_reg_open.do_dentry_open.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe
     11180 ±168%     -1e+04        657 ±  7%  latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.__lookup_slow.lookup_slow.walk_component.path_lookupat.filename_lookup
     19239 ±223%     -2e+04          0        latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_get_acl.get_acl.posix_acl_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open
     63702 ±169%     -4e+04      24418 ± 44%  latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_do_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open.do_syscall_64
     77617 ±205%     -8e+04        510 ± 11%  latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe
   3043762 ±124%     -3e+06        572        latency_stats.avg.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_getattr.__nfs_revalidate_inode.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat
  11630441 ±139%     -1e+07      21807 ±141%  latency_stats.avg.io_schedule.nfs_lock_and_join_requests.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe
  12242832 ±129%     -1e+07      37668 ± 58%  latency_stats.avg.max
         0            1e+04      11441        latency_stats.max.msleep.cpuinfo_open.proc_reg_open.do_dentry_open.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe
     11180 ±168%     -1e+04        657 ±  7%  latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.__lookup_slow.lookup_slow.walk_component.path_lookupat.filename_lookup
     19239 ±223%     -2e+04          0        latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_get_acl.get_acl.posix_acl_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open
     29152 ± 11%     -3e+04          0        latency_stats.max.call_rwsem_down_write_failed_killable.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe
     65909 ±164%     -4e+04      24418 ± 44%  latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_do_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open.do_syscall_64
     77617 ±205%     -8e+04        510 ± 11%  latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe
  17301268 ±125%     -2e+07        656        latency_stats.max.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_getattr.__nfs_revalidate_inode.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat
  44248611 ±140%     -4e+07      43457 ±141%  latency_stats.max.io_schedule.nfs_lock_and_join_requests.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe
  46380610 ±130%     -5e+07      59318 ± 86%  latency_stats.max.max
         0            1e+04      11441        latency_stats.sum.msleep.cpuinfo_open.proc_reg_open.do_dentry_open.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe
     11180 ±168%     -1e+04        657 ±  7%  latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.__lookup_slow.lookup_slow.walk_component.path_lookupat.filename_lookup
     19239 ±223%     -2e+04          0        latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_get_acl.get_acl.posix_acl_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open
     74047 ±148%     -5e+04      24418 ± 44%  latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_do_create.nfs3_proc_create.nfs_create.path_openat.do_filp_open.do_sys_open.do_syscall_64
     77617 ±205%     -8e+04        510 ± 11%  latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_lookup.nfs_lookup.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe
  26043088 ±130%     -3e+07       4768 ± 10%  latency_stats.sum.rpc_wait_bit_killable.__rpc_execute.rpc_run_task.rpc_call_sync.nfs3_rpc_wrapper.nfs3_proc_getattr.__nfs_revalidate_inode.nfs_do_access.nfs_permission.inode_permission.link_path_walk.path_lookupat
  82480038 ±152%     -8e+07      43614 ±141%  latency_stats.sum.io_schedule.nfs_lock_and_join_requests.nfs_updatepage.nfs_write_end.generic_perform_write.nfs_file_write.__vfs_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe
 1.771e+09           -2e+09   51929782        latency_stats.sum.max
 1.771e+09           -2e+09          0        latency_stats.sum.call_rwsem_down_write_failed_killable.__x64_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe
    420016 ± 12%      -100%          0        perf-stat.total.page-faults
 2.648e+08 ±  3%      -100%          0        perf-stat.total.context-switches
     52212 ± 18%      -100%          0        perf-stat.total.cpu-migrations
    420016 ± 12%      -100%          0        perf-stat.total.minor-faults

Best Regards,
Rong Chen
Linus Torvalds Feb. 13, 2019, 7:56 p.m. UTC | #10
Ok, those test robot reports are hard to read, but trying to distill it down:

On Wed, Feb 13, 2019 at 1:19 AM Chen Rong <rong.a.chen@intel.com> wrote:
>
>          %stddev     %change         %stddev
>              \          |                \
>     196250 ±  8%     -64.1%      70494        will-it-scale.per_thread_ops

That's the original 64% regression..

And then with the patch set:

>          %stddev      change         %stddev
>              \          |                \
>      71190             180%     199232 ±  4%  will-it-scale.per_thread_ops

looks like it's back up where it used to be.

So I guess we have numbers for the regression now. Thanks.

And that closes my biggest question for the new model, and with the
new organization that gets ird of the arch-specific asm separately
first and makes it a bit more legible that way, I guess I'll just Ack
the whole series.

             Linus
Davidlohr Bueso Feb. 14, 2019, 1:23 p.m. UTC | #11
On Fri, 08 Feb 2019, Waiman Long wrote:
>I am planning to run more performance test and post the data sometimes
>next week. Davidlohr is also going to run some of his rwsem performance
>test on this patchset.

So I ran this series on a 40-core IB 2 socket with various worklods in
mmtests. Below are some of the interesting ones; full numbers and curves
at https://linux-scalability.org/rwsem-reader-spinner/

All workloads are with increasing number of threads.

-- pagefault timings: pft is an artificial pf benchmark (thus reader stress).
metric is faults/cpu and faults/sec
                                       v5.0-rc6                 v5.0-rc6
                                                                    dirty
Hmean     faults/cpu-1    624224.9815 (   0.00%)   618847.5201 *  -0.86%*
Hmean     faults/cpu-4    539550.3509 (   0.00%)   547407.5738 *   1.46%*
Hmean     faults/cpu-7    401470.3461 (   0.00%)   381157.9830 *  -5.06%*
Hmean     faults/cpu-12   267617.0353 (   0.00%)   271098.5441 *   1.30%*
Hmean     faults/cpu-21   176194.4641 (   0.00%)   175151.3256 *  -0.59%*
Hmean     faults/cpu-30   119927.3862 (   0.00%)   120610.1348 *   0.57%*
Hmean     faults/cpu-40    91203.6820 (   0.00%)    91832.7489 *   0.69%*
Hmean     faults/sec-1    623292.3467 (   0.00%)   617992.0795 *  -0.85%*
Hmean     faults/sec-4   2113364.6898 (   0.00%)  2140254.8238 *   1.27%*
Hmean     faults/sec-7   2557378.4385 (   0.00%)  2450945.7060 *  -4.16%*
Hmean     faults/sec-12  2696509.8975 (   0.00%)  2747968.9819 *   1.91%*
Hmean     faults/sec-21  2902892.5639 (   0.00%)  2905923.3881 *   0.10%*
Hmean     faults/sec-30  2956696.5793 (   0.00%)  2990583.5147 *   1.15%*
Hmean     faults/sec-40  3422806.4806 (   0.00%)  3352970.3082 *  -2.04%*
Stddev    faults/cpu-1      2949.5159 (   0.00%)     2802.2712 (   4.99%)
Stddev    faults/cpu-4     24165.9454 (   0.00%)    15841.1232 (  34.45%)
Stddev    faults/cpu-7     20914.8351 (   0.00%)    22744.3294 (  -8.75%)
Stddev    faults/cpu-12    11274.3490 (   0.00%)    14733.3152 ( -30.68%)
Stddev    faults/cpu-21     2500.1950 (   0.00%)     2200.9518 (  11.97%)
Stddev    faults/cpu-30     1599.5346 (   0.00%)     1414.0339 (  11.60%)
Stddev    faults/cpu-40     1473.0181 (   0.00%)     3004.1209 (-103.94%)
Stddev    faults/sec-1      2655.2581 (   0.00%)     2405.1625 (   9.42%)
Stddev    faults/sec-4     84042.7234 (   0.00%)    57996.7158 (  30.99%)
Stddev    faults/sec-7    123656.7901 (   0.00%)   135591.1087 (  -9.65%)
Stddev    faults/sec-12    97135.6091 (   0.00%)   127054.4926 ( -30.80%)
Stddev    faults/sec-21    69564.6264 (   0.00%)    65922.6381 (   5.24%)
Stddev    faults/sec-30    51524.4027 (   0.00%)    56109.4159 (  -8.90%)
Stddev    faults/sec-40   101927.5280 (   0.00%)   160117.0093 ( -57.09%)

With the exception of the hicup at 7 threads, things are pretty much in
the noise region for both metrics.

-- git checkout

First metric is total runtime for runs with incremental threads.

           v5.0-rc6    v5.0-rc6
                          dirty
User         218.95      219.07
System       149.29      146.82
Elapsed     1574.10     1427.08

In this case there's a non trivial improvement (11%) in overall elapsed time.

-- reaim (which is always succeptible to rwsem changes for both mmap_sem and
i_mmap)
                                     v5.0-rc6               v5.0-rc6
                                                                dirty
Hmean     compute-1         6674.01 (   0.00%)     6544.28 *  -1.94%*
Hmean     compute-21       85294.91 (   0.00%)    85524.20 *   0.27%*
Hmean     compute-41      149674.70 (   0.00%)   149494.58 *  -0.12%*
Hmean     compute-61      177721.15 (   0.00%)   170507.38 *  -4.06%*
Hmean     compute-81      181531.07 (   0.00%)   180463.24 *  -0.59%*
Hmean     compute-101     189024.09 (   0.00%)   187288.86 *  -0.92%*
Hmean     compute-121     200673.24 (   0.00%)   195327.65 *  -2.66%*
Hmean     compute-141     213082.29 (   0.00%)   211290.80 *  -0.84%*
Hmean     compute-161     207764.06 (   0.00%)   204626.68 *  -1.51%*

The 'compute' workload overall takes a small hit.

Hmean     new_dbase-1         60.48 (   0.00%)       60.63 *   0.25%*
Hmean     new_dbase-21      6590.49 (   0.00%)     6671.81 *   1.23%*
Hmean     new_dbase-41     14202.91 (   0.00%)    14470.59 *   1.88%*
Hmean     new_dbase-61     21207.24 (   0.00%)    21067.40 *  -0.66%*
Hmean     new_dbase-81     25542.40 (   0.00%)    25542.40 *   0.00%*
Hmean     new_dbase-101    30165.28 (   0.00%)    30046.21 *  -0.39%*
Hmean     new_dbase-121    33638.33 (   0.00%)    33219.90 *  -1.24%*
Hmean     new_dbase-141    36723.70 (   0.00%)    37504.52 *   2.13%*
Hmean     new_dbase-161    42242.51 (   0.00%)    42117.34 *  -0.30%*
Hmean     shared-1            76.54 (   0.00%)       76.09 *  -0.59%*
Hmean     shared-21         7535.51 (   0.00%)     5518.75 * -26.76%*
Hmean     shared-41        17207.81 (   0.00%)    14651.94 * -14.85%*
Hmean     shared-61        20716.98 (   0.00%)    18667.52 *  -9.89%*
Hmean     shared-81        27603.83 (   0.00%)    23466.45 * -14.99%*
Hmean     shared-101       26008.59 (   0.00%)    29536.96 *  13.57%*
Hmean     shared-121       28354.76 (   0.00%)    43139.39 *  52.14%*
Hmean     shared-141       38509.25 (   0.00%)    41619.35 *   8.08%*
Hmean     shared-161       40496.07 (   0.00%)    44303.46 *   9.40%*

Overall there is a small hit (in the noise level but consistent throughout
many workloads), except git-checkout which does quite well.

Thanks,
Davidlohr
Waiman Long Feb. 14, 2019, 3:22 p.m. UTC | #12
On 02/14/2019 08:23 AM, Davidlohr Bueso wrote:
> On Fri, 08 Feb 2019, Waiman Long wrote:
>> I am planning to run more performance test and post the data sometimes
>> next week. Davidlohr is also going to run some of his rwsem performance
>> test on this patchset.
>
> So I ran this series on a 40-core IB 2 socket with various worklods in
> mmtests. Below are some of the interesting ones; full numbers and curves
> at https://linux-scalability.org/rwsem-reader-spinner/
>
> All workloads are with increasing number of threads.
>
> -- pagefault timings: pft is an artificial pf benchmark (thus reader
> stress).
> metric is faults/cpu and faults/sec
>                                       v5.0-rc6                 v5.0-rc6
>                                                                    dirty
> Hmean     faults/cpu-1    624224.9815 (   0.00%)   618847.5201 *  -0.86%*
> Hmean     faults/cpu-4    539550.3509 (   0.00%)   547407.5738 *   1.46%*
> Hmean     faults/cpu-7    401470.3461 (   0.00%)   381157.9830 *  -5.06%*
> Hmean     faults/cpu-12   267617.0353 (   0.00%)   271098.5441 *   1.30%*
> Hmean     faults/cpu-21   176194.4641 (   0.00%)   175151.3256 *  -0.59%*
> Hmean     faults/cpu-30   119927.3862 (   0.00%)   120610.1348 *   0.57%*
> Hmean     faults/cpu-40    91203.6820 (   0.00%)    91832.7489 *   0.69%*
> Hmean     faults/sec-1    623292.3467 (   0.00%)   617992.0795 *  -0.85%*
> Hmean     faults/sec-4   2113364.6898 (   0.00%)  2140254.8238 *   1.27%*
> Hmean     faults/sec-7   2557378.4385 (   0.00%)  2450945.7060 *  -4.16%*
> Hmean     faults/sec-12  2696509.8975 (   0.00%)  2747968.9819 *   1.91%*
> Hmean     faults/sec-21  2902892.5639 (   0.00%)  2905923.3881 *   0.10%*
> Hmean     faults/sec-30  2956696.5793 (   0.00%)  2990583.5147 *   1.15%*
> Hmean     faults/sec-40  3422806.4806 (   0.00%)  3352970.3082 *  -2.04%*
> Stddev    faults/cpu-1      2949.5159 (   0.00%)     2802.2712 (   4.99%)
> Stddev    faults/cpu-4     24165.9454 (   0.00%)    15841.1232 (  34.45%)
> Stddev    faults/cpu-7     20914.8351 (   0.00%)    22744.3294 (  -8.75%)
> Stddev    faults/cpu-12    11274.3490 (   0.00%)    14733.3152 ( -30.68%)
> Stddev    faults/cpu-21     2500.1950 (   0.00%)     2200.9518 (  11.97%)
> Stddev    faults/cpu-30     1599.5346 (   0.00%)     1414.0339 (  11.60%)
> Stddev    faults/cpu-40     1473.0181 (   0.00%)     3004.1209 (-103.94%)
> Stddev    faults/sec-1      2655.2581 (   0.00%)     2405.1625 (   9.42%)
> Stddev    faults/sec-4     84042.7234 (   0.00%)    57996.7158 (  30.99%)
> Stddev    faults/sec-7    123656.7901 (   0.00%)   135591.1087 (  -9.65%)
> Stddev    faults/sec-12    97135.6091 (   0.00%)   127054.4926 ( -30.80%)
> Stddev    faults/sec-21    69564.6264 (   0.00%)    65922.6381 (   5.24%)
> Stddev    faults/sec-30    51524.4027 (   0.00%)    56109.4159 (  -8.90%)
> Stddev    faults/sec-40   101927.5280 (   0.00%)   160117.0093 ( -57.09%)
>
> With the exception of the hicup at 7 threads, things are pretty much in
> the noise region for both metrics.
>
> -- git checkout
>
> First metric is total runtime for runs with incremental threads.
>
>           v5.0-rc6    v5.0-rc6
>                          dirty
> User         218.95      219.07
> System       149.29      146.82
> Elapsed     1574.10     1427.08
>
> In this case there's a non trivial improvement (11%) in overall
> elapsed time.
>
> -- reaim (which is always succeptible to rwsem changes for both
> mmap_sem and
> i_mmap)
>                                     v5.0-rc6               v5.0-rc6
>                                                                dirty
> Hmean     compute-1         6674.01 (   0.00%)     6544.28 *  -1.94%*
> Hmean     compute-21       85294.91 (   0.00%)    85524.20 *   0.27%*
> Hmean     compute-41      149674.70 (   0.00%)   149494.58 *  -0.12%*
> Hmean     compute-61      177721.15 (   0.00%)   170507.38 *  -4.06%*
> Hmean     compute-81      181531.07 (   0.00%)   180463.24 *  -0.59%*
> Hmean     compute-101     189024.09 (   0.00%)   187288.86 *  -0.92%*
> Hmean     compute-121     200673.24 (   0.00%)   195327.65 *  -2.66%*
> Hmean     compute-141     213082.29 (   0.00%)   211290.80 *  -0.84%*
> Hmean     compute-161     207764.06 (   0.00%)   204626.68 *  -1.51%*
>
> The 'compute' workload overall takes a small hit.
>
> Hmean     new_dbase-1         60.48 (   0.00%)       60.63 *   0.25%*
> Hmean     new_dbase-21      6590.49 (   0.00%)     6671.81 *   1.23%*
> Hmean     new_dbase-41     14202.91 (   0.00%)    14470.59 *   1.88%*
> Hmean     new_dbase-61     21207.24 (   0.00%)    21067.40 *  -0.66%*
> Hmean     new_dbase-81     25542.40 (   0.00%)    25542.40 *   0.00%*
> Hmean     new_dbase-101    30165.28 (   0.00%)    30046.21 *  -0.39%*
> Hmean     new_dbase-121    33638.33 (   0.00%)    33219.90 *  -1.24%*
> Hmean     new_dbase-141    36723.70 (   0.00%)    37504.52 *   2.13%*
> Hmean     new_dbase-161    42242.51 (   0.00%)    42117.34 *  -0.30%*
> Hmean     shared-1            76.54 (   0.00%)       76.09 *  -0.59%*
> Hmean     shared-21         7535.51 (   0.00%)     5518.75 * -26.76%*
> Hmean     shared-41        17207.81 (   0.00%)    14651.94 * -14.85%*
> Hmean     shared-61        20716.98 (   0.00%)    18667.52 *  -9.89%*
> Hmean     shared-81        27603.83 (   0.00%)    23466.45 * -14.99%*
> Hmean     shared-101       26008.59 (   0.00%)    29536.96 *  13.57%*
> Hmean     shared-121       28354.76 (   0.00%)    43139.39 *  52.14%*
> Hmean     shared-141       38509.25 (   0.00%)    41619.35 *   8.08%*
> Hmean     shared-161       40496.07 (   0.00%)    44303.46 *   9.40%*
>
> Overall there is a small hit (in the noise level but consistent
> throughout
> many workloads), except git-checkout which does quite well.
>
> Thanks,
> Davidlohr

Thanks for running the patch through your performance tests.

Cheers,
Longman
huang ying April 10, 2019, 8:15 a.m. UTC | #13
Hi, Waiman,

What's the status of this patchset?  And its merging plan?

Best Regards,
Huang, Ying
Waiman Long April 10, 2019, 4:08 p.m. UTC | #14
On 04/10/2019 04:15 AM, huang ying wrote:
> Hi, Waiman,
>
> What's the status of this patchset?  And its merging plan?
>
> Best Regards,
> Huang, Ying

I have broken the patch into 3 parts (0/1/2) and rewritten some of them.
Part 0 has been merged into tip. Parts 1 and 2 are still under testing.

Cheers,
Longman
huang ying April 12, 2019, 12:49 a.m. UTC | #15
On Thu, Apr 11, 2019 at 12:08 AM Waiman Long <longman@redhat.com> wrote:
>
> On 04/10/2019 04:15 AM, huang ying wrote:
> > Hi, Waiman,
> >
> > What's the status of this patchset?  And its merging plan?
> >
> > Best Regards,
> > Huang, Ying
>
> I have broken the patch into 3 parts (0/1/2) and rewritten some of them.
> Part 0 has been merged into tip. Parts 1 and 2 are still under testing.

Thanks!  Please keep me updated!

Best Regards,
Huang, Ying

> Cheers,
> Longman
>