lockref: Limit number of cmpxchg loop retries

Message ID	20190605134849.28108-1-jglauber@marvell.com (mailing list archive)
State	Mainlined, archived
Commit	893a7d32e8e04ca4d6c882336b26ed660ca0a48d
Headers	show Return-Path: <linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org> From: Jan Glauber <jglauber@cavium.com> To: linux-kernel@vger.kernel.org Subject: [PATCH] lockref: Limit number of cmpxchg loop retries Date: Wed, 5 Jun 2019 15:48:49 +0200 Message-Id: <20190605134849.28108-1-jglauber@marvell.com> In-Reply-To: <CAHk-=wjPqcPYkiWKFc=R3+18DXqEhV+Nfbo=JWa32Xp8Nze67g@mail.gmail.com> References: <CAHk-=wjPqcPYkiWKFc=R3+18DXqEhV+Nfbo=JWa32Xp8Nze67g@mail.gmail.com> MIME-Version: 1.0 Precedence: list Cc: Catalin Marinas <catalin.marinas@arm.com>, Jan Glauber <jglauber@marvell.com>, Will Deacon <will.deacon@arm.com>, Jayachandran Chandrasekharan Nair <jnair@marvell.com>, Linus Torvalds <torvalds@linux-foundation.org>, linux-arm-kernel@lists.infradead.org Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org
Series	lockref: Limit number of cmpxchg loop retries \| expand lockref: Limit number of cmpxchg loop retries

Jan Glauber June 5, 2019, 1:48 p.m. UTC

The lockref cmpxchg loop is unbound as long as the spinlock is not
taken. Depending on the hardware implementation of compare-and-swap
a high number of loop retries might happen.

Add an upper bound to the loop to force the fallback to spinlocks
after some time. A retry value of 100 should not impact any hardware
that does not have this issue.

With the retry limit the performance of an open-close testcase
improved between 60-70% on ThunderX2.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jan Glauber <jglauber@marvell.com>
---
 lib/lockref.c | 3 +++
 1 file changed, 3 insertions(+)

Linus Torvalds June 5, 2019, 8:16 p.m. UTC | #1

On Wed, Jun 5, 2019 at 6:49 AM Jan Glauber <jglauber@cavium.com> wrote:
>
> Add an upper bound to the loop to force the fallback to spinlocks
> after some time. A retry value of 100 should not impact any hardware
> that does not have this issue.
>
> With the retry limit the performance of an open-close testcase
> improved between 60-70% on ThunderX2.

Btw, did you do any kind of performance analysis across different
retry limit values?

I'm perfectly happy to just pick a random number and '100' looks fine
to me, so this is mainly out of curiosity.

                       Linus

Jan Glauber June 6, 2019, 8:03 a.m. UTC | #2

On Wed, Jun 05, 2019 at 01:16:46PM -0700, Linus Torvalds wrote:
> On Wed, Jun 5, 2019 at 6:49 AM Jan Glauber <jglauber@cavium.com> wrote:
> >
> > Add an upper bound to the loop to force the fallback to spinlocks
> > after some time. A retry value of 100 should not impact any hardware
> > that does not have this issue.
> >
> > With the retry limit the performance of an open-close testcase
> > improved between 60-70% on ThunderX2.
> 
> Btw, did you do any kind of performance analysis across different
> retry limit values?

I tried 15/50/100/200/500, results were largely identical up to 100.
For SMT=4 a higher retry value might be better, but unless we can add a
sysctl value 100 looked like a good compromise to me.

--Jan

> I'm perfectly happy to just pick a random number and '100' looks fine
> to me, so this is mainly out of curiosity.
> 
>                        Linus

Will Deacon June 6, 2019, 9:41 a.m. UTC | #3

On Thu, Jun 06, 2019 at 08:03:27AM +0000, Jan Glauber wrote:
> On Wed, Jun 05, 2019 at 01:16:46PM -0700, Linus Torvalds wrote:
> > On Wed, Jun 5, 2019 at 6:49 AM Jan Glauber <jglauber@cavium.com> wrote:
> > >
> > > Add an upper bound to the loop to force the fallback to spinlocks
> > > after some time. A retry value of 100 should not impact any hardware
> > > that does not have this issue.
> > >
> > > With the retry limit the performance of an open-close testcase
> > > improved between 60-70% on ThunderX2.
> > 
> > Btw, did you do any kind of performance analysis across different
> > retry limit values?
> 
> I tried 15/50/100/200/500, results were largely identical up to 100.
> For SMT=4 a higher retry value might be better, but unless we can add a
> sysctl value 100 looked like a good compromise to me.

Perhaps I'm just getting confused pre-morning-coffee, but I thought the
original complaint (and the reason for this patch even existing) was that
when many CPUs were hammering the lockref then performance tanked? In which
case, increasing the threshold as the number of CPUs increases seems
counter-intuitive to me because it suggests that the larger the system,
the harder we should try to make the cmpxchg work.

Will

Jan Glauber June 6, 2019, 10:28 a.m. UTC | #4

On Thu, Jun 06, 2019 at 10:41:54AM +0100, Will Deacon wrote:
> On Thu, Jun 06, 2019 at 08:03:27AM +0000, Jan Glauber wrote:
> > On Wed, Jun 05, 2019 at 01:16:46PM -0700, Linus Torvalds wrote:
> > > On Wed, Jun 5, 2019 at 6:49 AM Jan Glauber <jglauber@cavium.com> wrote:
> > > >
> > > > Add an upper bound to the loop to force the fallback to spinlocks
> > > > after some time. A retry value of 100 should not impact any hardware
> > > > that does not have this issue.
> > > >
> > > > With the retry limit the performance of an open-close testcase
> > > > improved between 60-70% on ThunderX2.
> > > 
> > > Btw, did you do any kind of performance analysis across different
> > > retry limit values?
> > 
> > I tried 15/50/100/200/500, results were largely identical up to 100.
> > For SMT=4 a higher retry value might be better, but unless we can add a
> > sysctl value 100 looked like a good compromise to me.
> 
> Perhaps I'm just getting confused pre-morning-coffee, but I thought the
> original complaint (and the reason for this patch even existing) was that
> when many CPUs were hammering the lockref then performance tanked? In which
> case, increasing the threshold as the number of CPUs increases seems
> counter-intuitive to me because it suggests that the larger the system,
> the harder we should try to make the cmpxchg work.

For SMT=4 the top hit I see is queued_spin_lock_slowpath(). Maybe this is more
costly with more threads, so trying harder to use lockref-cmpxchg makes
the microbenchmark faster in that case?

--Jan

Jan Glauber June 7, 2019, 7:27 a.m. UTC | #5

On Thu, Jun 06, 2019 at 10:28:12AM +0000, Jan Glauber wrote:
> On Thu, Jun 06, 2019 at 10:41:54AM +0100, Will Deacon wrote:
> > On Thu, Jun 06, 2019 at 08:03:27AM +0000, Jan Glauber wrote:
> > > On Wed, Jun 05, 2019 at 01:16:46PM -0700, Linus Torvalds wrote:
> > > > On Wed, Jun 5, 2019 at 6:49 AM Jan Glauber <jglauber@cavium.com> wrote:
> > > > >
> > > > > Add an upper bound to the loop to force the fallback to spinlocks
> > > > > after some time. A retry value of 100 should not impact any hardware
> > > > > that does not have this issue.
> > > > >
> > > > > With the retry limit the performance of an open-close testcase
> > > > > improved between 60-70% on ThunderX2.
> > > > 
> > > > Btw, did you do any kind of performance analysis across different
> > > > retry limit values?
> > > 
> > > I tried 15/50/100/200/500, results were largely identical up to 100.
> > > For SMT=4 a higher retry value might be better, but unless we can add a
> > > sysctl value 100 looked like a good compromise to me.
> > 
> > Perhaps I'm just getting confused pre-morning-coffee, but I thought the
> > original complaint (and the reason for this patch even existing) was that
> > when many CPUs were hammering the lockref then performance tanked? In which
> > case, increasing the threshold as the number of CPUs increases seems
> > counter-intuitive to me because it suggests that the larger the system,
> > the harder we should try to make the cmpxchg work.
> 
> For SMT=4 the top hit I see is queued_spin_lock_slowpath(). Maybe this is more
> costly with more threads, so trying harder to use lockref-cmpxchg makes
> the microbenchmark faster in that case?

To clarify, with 224 threads & CPUs queued_spin_lock_slowpath is the top hit
even without a retry limit in lockref. This could be unrelated to the lockref
fallback, it looks like it's coming from the spinlock in:
	do_sys_open -> get_unused_fd_flags -> __alloc_fd

--Jan

Linus Torvalds June 7, 2019, 8:14 p.m. UTC | #6

On Fri, Jun 7, 2019 at 12:27 AM Jan Glauber <jglauber@marvell.com> wrote:
>
> To clarify, with 224 threads & CPUs queued_spin_lock_slowpath is the top hit
> even without a retry limit in lockref. This could be unrelated to the lockref
> fallback, it looks like it's coming from the spinlock in:
>         do_sys_open -> get_unused_fd_flags -> __alloc_fd

At some point I stop worrying about microbenchmarks just because it's
easy to hit some locking paths in them, without it necessarily being
relevant in real loads.

But I'll apply the lockref patch because I think the "limit cmpxchg
loop" is conceptually a valid model, and while I think the "hitting
the same dentry lockref over and over again" is likely also an
artifact of a microbenchmark, I could at least imagine that it happens
with some common dentries (root, cwd) in some situations.

                   Linus

lockref: Limit number of cmpxchg loop retries

Commit Message

Comments

Patch