mbox series

[v4,0/3] locking/rwsem: Rwsem rearchitecture part 0

Message ID 1550095217-12047-1-git-send-email-longman@redhat.com (mailing list archive)
Headers show
Series locking/rwsem: Rwsem rearchitecture part 0 | expand

Message

Waiman Long Feb. 13, 2019, 10 p.m. UTC
v4:
 - Remove rwsem-spinlock.c and make all archs use rwsem-xadd.c.

v3:
 - Optimize __down_read_trylock() for the uncontended case as suggested
   by Linus.

v2:
 - Add patch 2 to optimize __down_read_trylock() as suggested by PeterZ.
 - Update performance test data in patch 1.

The goal of this patchset is to remove the architecture specific files
for rwsem-xadd to make it easer to add enhancements in the later rwsem
patches. It also removes the legacy rwsem-spinlock.c file and make all
the architectures use one single implementation of rwsem - rwsem-xadd.c.

Waiman Long (3):
  locking/rwsem: Remove arch specific rwsem files
  locking/rwsem: Remove rwsem-spinlock.c & use rwsem-xadd.c for all
    archs
  locking/rwsem: Optimize down_read_trylock()

 MAINTAINERS                     |   1 -
 arch/alpha/Kconfig              |   7 -
 arch/alpha/include/asm/rwsem.h  | 211 -------------------------
 arch/arc/Kconfig                |   3 -
 arch/arm/Kconfig                |   4 -
 arch/arm/include/asm/Kbuild     |   1 -
 arch/arm64/Kconfig              |   3 -
 arch/arm64/include/asm/Kbuild   |   1 -
 arch/c6x/Kconfig                |   3 -
 arch/csky/Kconfig               |   3 -
 arch/h8300/Kconfig              |   3 -
 arch/hexagon/Kconfig            |   6 -
 arch/hexagon/include/asm/Kbuild |   1 -
 arch/ia64/Kconfig               |   4 -
 arch/ia64/include/asm/rwsem.h   | 172 --------------------
 arch/m68k/Kconfig               |   7 -
 arch/microblaze/Kconfig         |   6 -
 arch/mips/Kconfig               |   7 -
 arch/nds32/Kconfig              |   3 -
 arch/nios2/Kconfig              |   3 -
 arch/openrisc/Kconfig           |   6 -
 arch/parisc/Kconfig             |   6 -
 arch/powerpc/Kconfig            |   7 -
 arch/powerpc/include/asm/Kbuild |   1 -
 arch/riscv/Kconfig              |   3 -
 arch/s390/Kconfig               |   6 -
 arch/s390/include/asm/Kbuild    |   1 -
 arch/sh/Kconfig                 |   6 -
 arch/sh/include/asm/Kbuild      |   1 -
 arch/sparc/Kconfig              |   8 -
 arch/sparc/include/asm/Kbuild   |   1 -
 arch/unicore32/Kconfig          |   6 -
 arch/x86/Kconfig                |   3 -
 arch/x86/include/asm/rwsem.h    | 237 ----------------------------
 arch/x86/lib/Makefile           |   1 -
 arch/x86/lib/rwsem.S            | 156 ------------------
 arch/x86/um/Kconfig             |   6 -
 arch/x86/um/Makefile            |   1 -
 arch/xtensa/Kconfig             |   3 -
 arch/xtensa/include/asm/Kbuild  |   1 -
 include/asm-generic/rwsem.h     | 140 -----------------
 include/linux/rwsem-spinlock.h  |  47 ------
 include/linux/rwsem.h           |   9 +-
 kernel/Kconfig.locks            |   2 +-
 kernel/locking/Makefile         |   4 +-
 kernel/locking/percpu-rwsem.c   |   2 +
 kernel/locking/rwsem-spinlock.c | 339 ----------------------------------------
 kernel/locking/rwsem.h          | 130 +++++++++++++++
 48 files changed, 135 insertions(+), 1447 deletions(-)
 delete mode 100644 arch/alpha/include/asm/rwsem.h
 delete mode 100644 arch/ia64/include/asm/rwsem.h
 delete mode 100644 arch/x86/include/asm/rwsem.h
 delete mode 100644 arch/x86/lib/rwsem.S
 delete mode 100644 include/asm-generic/rwsem.h
 delete mode 100644 include/linux/rwsem-spinlock.h
 delete mode 100644 kernel/locking/rwsem-spinlock.c

Comments

Peter Zijlstra Feb. 14, 2019, 10:37 a.m. UTC | #1
On Wed, Feb 13, 2019 at 05:00:14PM -0500, Waiman Long wrote:
> v4:
>  - Remove rwsem-spinlock.c and make all archs use rwsem-xadd.c.
> 
> v3:
>  - Optimize __down_read_trylock() for the uncontended case as suggested
>    by Linus.
> 
> v2:
>  - Add patch 2 to optimize __down_read_trylock() as suggested by PeterZ.
>  - Update performance test data in patch 1.
> 
> The goal of this patchset is to remove the architecture specific files
> for rwsem-xadd to make it easer to add enhancements in the later rwsem
> patches. It also removes the legacy rwsem-spinlock.c file and make all
> the architectures use one single implementation of rwsem - rwsem-xadd.c.
> 
> Waiman Long (3):
>   locking/rwsem: Remove arch specific rwsem files
>   locking/rwsem: Remove rwsem-spinlock.c & use rwsem-xadd.c for all
>     archs
>   locking/rwsem: Optimize down_read_trylock()

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

with the caveat that I'm happy to exchange patch 3 back to my earlier
suggestion in case Will expesses concerns wrt the ARM64 performance of
Linus' suggestion.
Waiman Long Feb. 14, 2019, 10:05 p.m. UTC | #2
On 02/14/2019 05:37 AM, Peter Zijlstra wrote:
> On Wed, Feb 13, 2019 at 05:00:14PM -0500, Waiman Long wrote:
>> v4:
>>  - Remove rwsem-spinlock.c and make all archs use rwsem-xadd.c.
>>
>> v3:
>>  - Optimize __down_read_trylock() for the uncontended case as suggested
>>    by Linus.
>>
>> v2:
>>  - Add patch 2 to optimize __down_read_trylock() as suggested by PeterZ.
>>  - Update performance test data in patch 1.
>>
>> The goal of this patchset is to remove the architecture specific files
>> for rwsem-xadd to make it easer to add enhancements in the later rwsem
>> patches. It also removes the legacy rwsem-spinlock.c file and make all
>> the architectures use one single implementation of rwsem - rwsem-xadd.c.
>>
>> Waiman Long (3):
>>   locking/rwsem: Remove arch specific rwsem files
>>   locking/rwsem: Remove rwsem-spinlock.c & use rwsem-xadd.c for all
>>     archs
>>   locking/rwsem: Optimize down_read_trylock()
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>
> with the caveat that I'm happy to exchange patch 3 back to my earlier
> suggestion in case Will expesses concerns wrt the ARM64 performance of
> Linus' suggestion.

I inserted a few lock event counters into the rwsem trylock code:

static inline int __down_read_trylock(struct rw_semaphore *sem)
{
        /*
         * Optimize for the case when the rwsem is not locked at all.
         */
        long tmp = RWSEM_UNLOCKED_VALUE;

        lockevent_inc(rwsem_rtrylock);
        do {
                if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
                                        tmp + RWSEM_ACTIVE_READ_BIAS)) {
                        rwsem_set_reader_owned(sem);
                        return 1;
                }
                lockevent_inc(rwsem_rtrylock_retry);
        } while (tmp >= 0);
        lockevent_inc(rwsem_rtrylock_fail);
        return 0;
}

static inline int __down_write_trylock(struct rw_semaphore *sem)
{
        long tmp;

        lockevent_inc(rwsem_wtrylock);
        tmp = atomic_long_cmpxchg_acquire(&sem->count, RWSEM_UNLOCKED_VALUE,
                      RWSEM_ACTIVE_WRITE_BIAS);
        if (tmp == RWSEM_UNLOCKED_VALUE) {
                rwsem_set_owner(sem);
                return true;
        }
        lockevent_inc(rwsem_wtrylock_fail);
        return false;
}

I booted the new kernel on a 4-socket 56-core 112-thread Broadwell
system. The counter values

1) After bootup:

rwsem_rtrylock=784029
rwsem_rtrylock_fail=59
rwsem_rtrylock_retry=394
rwsem_wtrylock=18284
rwsem_wtrylock_fail=230

2) After parallel kernel build (-j112):

rwsem_rtrylock=338667559
rwsem_rtrylock_fail=18
rwsem_rtrylock_retry=51
rwsem_wtrylock=17016332
rwsem_wtrylock_fail=98058

At least for these two use cases, try-for-ownership as suggested by
Linus is the right choice.

Cheers,
Longman
Will Deacon Feb. 15, 2019, 6:40 p.m. UTC | #3
On Thu, Feb 14, 2019 at 11:37:15AM +0100, Peter Zijlstra wrote:
> On Wed, Feb 13, 2019 at 05:00:14PM -0500, Waiman Long wrote:
> > v4:
> >  - Remove rwsem-spinlock.c and make all archs use rwsem-xadd.c.
> > 
> > v3:
> >  - Optimize __down_read_trylock() for the uncontended case as suggested
> >    by Linus.
> > 
> > v2:
> >  - Add patch 2 to optimize __down_read_trylock() as suggested by PeterZ.
> >  - Update performance test data in patch 1.
> > 
> > The goal of this patchset is to remove the architecture specific files
> > for rwsem-xadd to make it easer to add enhancements in the later rwsem
> > patches. It also removes the legacy rwsem-spinlock.c file and make all
> > the architectures use one single implementation of rwsem - rwsem-xadd.c.
> > 
> > Waiman Long (3):
> >   locking/rwsem: Remove arch specific rwsem files
> >   locking/rwsem: Remove rwsem-spinlock.c & use rwsem-xadd.c for all
> >     archs
> >   locking/rwsem: Optimize down_read_trylock()
> 
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> 
> with the caveat that I'm happy to exchange patch 3 back to my earlier
> suggestion in case Will expesses concerns wrt the ARM64 performance of
> Linus' suggestion.

Right, the current proposal doesn't work well for us, unfortunately. Which
was your earlier suggestion?

Will
Waiman Long Feb. 15, 2019, 6:58 p.m. UTC | #4
On 02/15/2019 01:40 PM, Will Deacon wrote:
> On Thu, Feb 14, 2019 at 11:37:15AM +0100, Peter Zijlstra wrote:
>> On Wed, Feb 13, 2019 at 05:00:14PM -0500, Waiman Long wrote:
>>> v4:
>>>  - Remove rwsem-spinlock.c and make all archs use rwsem-xadd.c.
>>>
>>> v3:
>>>  - Optimize __down_read_trylock() for the uncontended case as suggested
>>>    by Linus.
>>>
>>> v2:
>>>  - Add patch 2 to optimize __down_read_trylock() as suggested by PeterZ.
>>>  - Update performance test data in patch 1.
>>>
>>> The goal of this patchset is to remove the architecture specific files
>>> for rwsem-xadd to make it easer to add enhancements in the later rwsem
>>> patches. It also removes the legacy rwsem-spinlock.c file and make all
>>> the architectures use one single implementation of rwsem - rwsem-xadd.c.
>>>
>>> Waiman Long (3):
>>>   locking/rwsem: Remove arch specific rwsem files
>>>   locking/rwsem: Remove rwsem-spinlock.c & use rwsem-xadd.c for all
>>>     archs
>>>   locking/rwsem: Optimize down_read_trylock()
>> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>>
>> with the caveat that I'm happy to exchange patch 3 back to my earlier
>> suggestion in case Will expesses concerns wrt the ARM64 performance of
>> Linus' suggestion.
> Right, the current proposal doesn't work well for us, unfortunately. Which
> was your earlier suggestion?
>
> Will

In my posting yesterday, I showed that most of the trylocks done were
actually uncontended. Assuming that pattern hold for the most of the
workloads, it will not that bad after all.

-Longman
Will Deacon Feb. 18, 2019, 2:58 p.m. UTC | #5
On Fri, Feb 15, 2019 at 01:58:34PM -0500, Waiman Long wrote:
> On 02/15/2019 01:40 PM, Will Deacon wrote:
> > On Thu, Feb 14, 2019 at 11:37:15AM +0100, Peter Zijlstra wrote:
> >> On Wed, Feb 13, 2019 at 05:00:14PM -0500, Waiman Long wrote:
> >>> v4:
> >>>  - Remove rwsem-spinlock.c and make all archs use rwsem-xadd.c.
> >>>
> >>> v3:
> >>>  - Optimize __down_read_trylock() for the uncontended case as suggested
> >>>    by Linus.
> >>>
> >>> v2:
> >>>  - Add patch 2 to optimize __down_read_trylock() as suggested by PeterZ.
> >>>  - Update performance test data in patch 1.
> >>>
> >>> The goal of this patchset is to remove the architecture specific files
> >>> for rwsem-xadd to make it easer to add enhancements in the later rwsem
> >>> patches. It also removes the legacy rwsem-spinlock.c file and make all
> >>> the architectures use one single implementation of rwsem - rwsem-xadd.c.
> >>>
> >>> Waiman Long (3):
> >>>   locking/rwsem: Remove arch specific rwsem files
> >>>   locking/rwsem: Remove rwsem-spinlock.c & use rwsem-xadd.c for all
> >>>     archs
> >>>   locking/rwsem: Optimize down_read_trylock()
> >> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> >>
> >> with the caveat that I'm happy to exchange patch 3 back to my earlier
> >> suggestion in case Will expesses concerns wrt the ARM64 performance of
> >> Linus' suggestion.
> > Right, the current proposal doesn't work well for us, unfortunately. Which
> > was your earlier suggestion?
> >
> > Will
> 
> In my posting yesterday, I showed that most of the trylocks done were
> actually uncontended. Assuming that pattern hold for the most of the
> workloads, it will not that bad after all.

That's fair enough; if you're going to sit in a tight trylock() loop like the
benchmark does, then you're much better off just calling lock() if you care
at all about scalability.

Will