diff mbox series

lockref: Limit number of cmpxchg loop retries

Message ID 20190605134849.28108-1-jglauber@marvell.com (mailing list archive)
State Mainlined, archived
Commit 893a7d32e8e04ca4d6c882336b26ed660ca0a48d
Headers show
Series lockref: Limit number of cmpxchg loop retries | expand

Commit Message

Jan Glauber June 5, 2019, 1:48 p.m. UTC
The lockref cmpxchg loop is unbound as long as the spinlock is not
taken. Depending on the hardware implementation of compare-and-swap
a high number of loop retries might happen.

Add an upper bound to the loop to force the fallback to spinlocks
after some time. A retry value of 100 should not impact any hardware
that does not have this issue.

With the retry limit the performance of an open-close testcase
improved between 60-70% on ThunderX2.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jan Glauber <jglauber@marvell.com>
---
 lib/lockref.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Linus Torvalds June 5, 2019, 8:16 p.m. UTC | #1
On Wed, Jun 5, 2019 at 6:49 AM Jan Glauber <jglauber@cavium.com> wrote:
>
> Add an upper bound to the loop to force the fallback to spinlocks
> after some time. A retry value of 100 should not impact any hardware
> that does not have this issue.
>
> With the retry limit the performance of an open-close testcase
> improved between 60-70% on ThunderX2.

Btw, did you do any kind of performance analysis across different
retry limit values?

I'm perfectly happy to just pick a random number and '100' looks fine
to me, so this is mainly out of curiosity.

                       Linus
Jan Glauber June 6, 2019, 8:03 a.m. UTC | #2
On Wed, Jun 05, 2019 at 01:16:46PM -0700, Linus Torvalds wrote:
> On Wed, Jun 5, 2019 at 6:49 AM Jan Glauber <jglauber@cavium.com> wrote:
> >
> > Add an upper bound to the loop to force the fallback to spinlocks
> > after some time. A retry value of 100 should not impact any hardware
> > that does not have this issue.
> >
> > With the retry limit the performance of an open-close testcase
> > improved between 60-70% on ThunderX2.
> 
> Btw, did you do any kind of performance analysis across different
> retry limit values?

I tried 15/50/100/200/500, results were largely identical up to 100.
For SMT=4 a higher retry value might be better, but unless we can add a
sysctl value 100 looked like a good compromise to me.

--Jan

> I'm perfectly happy to just pick a random number and '100' looks fine
> to me, so this is mainly out of curiosity.
> 
>                        Linus
Will Deacon June 6, 2019, 9:41 a.m. UTC | #3
On Thu, Jun 06, 2019 at 08:03:27AM +0000, Jan Glauber wrote:
> On Wed, Jun 05, 2019 at 01:16:46PM -0700, Linus Torvalds wrote:
> > On Wed, Jun 5, 2019 at 6:49 AM Jan Glauber <jglauber@cavium.com> wrote:
> > >
> > > Add an upper bound to the loop to force the fallback to spinlocks
> > > after some time. A retry value of 100 should not impact any hardware
> > > that does not have this issue.
> > >
> > > With the retry limit the performance of an open-close testcase
> > > improved between 60-70% on ThunderX2.
> > 
> > Btw, did you do any kind of performance analysis across different
> > retry limit values?
> 
> I tried 15/50/100/200/500, results were largely identical up to 100.
> For SMT=4 a higher retry value might be better, but unless we can add a
> sysctl value 100 looked like a good compromise to me.

Perhaps I'm just getting confused pre-morning-coffee, but I thought the
original complaint (and the reason for this patch even existing) was that
when many CPUs were hammering the lockref then performance tanked? In which
case, increasing the threshold as the number of CPUs increases seems
counter-intuitive to me because it suggests that the larger the system,
the harder we should try to make the cmpxchg work.

Will
Jan Glauber June 6, 2019, 10:28 a.m. UTC | #4
On Thu, Jun 06, 2019 at 10:41:54AM +0100, Will Deacon wrote:
> On Thu, Jun 06, 2019 at 08:03:27AM +0000, Jan Glauber wrote:
> > On Wed, Jun 05, 2019 at 01:16:46PM -0700, Linus Torvalds wrote:
> > > On Wed, Jun 5, 2019 at 6:49 AM Jan Glauber <jglauber@cavium.com> wrote:
> > > >
> > > > Add an upper bound to the loop to force the fallback to spinlocks
> > > > after some time. A retry value of 100 should not impact any hardware
> > > > that does not have this issue.
> > > >
> > > > With the retry limit the performance of an open-close testcase
> > > > improved between 60-70% on ThunderX2.
> > > 
> > > Btw, did you do any kind of performance analysis across different
> > > retry limit values?
> > 
> > I tried 15/50/100/200/500, results were largely identical up to 100.
> > For SMT=4 a higher retry value might be better, but unless we can add a
> > sysctl value 100 looked like a good compromise to me.
> 
> Perhaps I'm just getting confused pre-morning-coffee, but I thought the
> original complaint (and the reason for this patch even existing) was that
> when many CPUs were hammering the lockref then performance tanked? In which
> case, increasing the threshold as the number of CPUs increases seems
> counter-intuitive to me because it suggests that the larger the system,
> the harder we should try to make the cmpxchg work.

For SMT=4 the top hit I see is queued_spin_lock_slowpath(). Maybe this is more
costly with more threads, so trying harder to use lockref-cmpxchg makes
the microbenchmark faster in that case?

--Jan
Jan Glauber June 7, 2019, 7:27 a.m. UTC | #5
On Thu, Jun 06, 2019 at 10:28:12AM +0000, Jan Glauber wrote:
> On Thu, Jun 06, 2019 at 10:41:54AM +0100, Will Deacon wrote:
> > On Thu, Jun 06, 2019 at 08:03:27AM +0000, Jan Glauber wrote:
> > > On Wed, Jun 05, 2019 at 01:16:46PM -0700, Linus Torvalds wrote:
> > > > On Wed, Jun 5, 2019 at 6:49 AM Jan Glauber <jglauber@cavium.com> wrote:
> > > > >
> > > > > Add an upper bound to the loop to force the fallback to spinlocks
> > > > > after some time. A retry value of 100 should not impact any hardware
> > > > > that does not have this issue.
> > > > >
> > > > > With the retry limit the performance of an open-close testcase
> > > > > improved between 60-70% on ThunderX2.
> > > > 
> > > > Btw, did you do any kind of performance analysis across different
> > > > retry limit values?
> > > 
> > > I tried 15/50/100/200/500, results were largely identical up to 100.
> > > For SMT=4 a higher retry value might be better, but unless we can add a
> > > sysctl value 100 looked like a good compromise to me.
> > 
> > Perhaps I'm just getting confused pre-morning-coffee, but I thought the
> > original complaint (and the reason for this patch even existing) was that
> > when many CPUs were hammering the lockref then performance tanked? In which
> > case, increasing the threshold as the number of CPUs increases seems
> > counter-intuitive to me because it suggests that the larger the system,
> > the harder we should try to make the cmpxchg work.
> 
> For SMT=4 the top hit I see is queued_spin_lock_slowpath(). Maybe this is more
> costly with more threads, so trying harder to use lockref-cmpxchg makes
> the microbenchmark faster in that case?

To clarify, with 224 threads & CPUs queued_spin_lock_slowpath is the top hit
even without a retry limit in lockref. This could be unrelated to the lockref
fallback, it looks like it's coming from the spinlock in:
	do_sys_open -> get_unused_fd_flags -> __alloc_fd

--Jan
Linus Torvalds June 7, 2019, 8:14 p.m. UTC | #6
On Fri, Jun 7, 2019 at 12:27 AM Jan Glauber <jglauber@marvell.com> wrote:
>
> To clarify, with 224 threads & CPUs queued_spin_lock_slowpath is the top hit
> even without a retry limit in lockref. This could be unrelated to the lockref
> fallback, it looks like it's coming from the spinlock in:
>         do_sys_open -> get_unused_fd_flags -> __alloc_fd

At some point I stop worrying about microbenchmarks just because it's
easy to hit some locking paths in them, without it necessarily being
relevant in real loads.

But I'll apply the lockref patch because I think the "limit cmpxchg
loop" is conceptually a valid model, and while I think the "hitting
the same dentry lockref over and over again" is likely also an
artifact of a microbenchmark, I could at least imagine that it happens
with some common dentries (root, cwd) in some situations.

                   Linus
diff mbox series

Patch

diff --git a/lib/lockref.c b/lib/lockref.c
index 3d468b53d4c9..5b34bbd3eba8 100644
--- a/lib/lockref.c
+++ b/lib/lockref.c
@@ -9,6 +9,7 @@ 
  * failure case.
  */
 #define CMPXCHG_LOOP(CODE, SUCCESS) do {					\
+	int retry = 100;							\
 	struct lockref old;							\
 	BUILD_BUG_ON(sizeof(old) != 8);						\
 	old.lock_count = READ_ONCE(lockref->lock_count);			\
@@ -21,6 +22,8 @@ 
 		if (likely(old.lock_count == prev.lock_count)) {		\
 			SUCCESS;						\
 		}								\
+		if (!--retry)							\
+			break;							\
 		cpu_relax();							\
 	}									\
 } while (0)