Message ID | 20190605134849.28108-1-jglauber@marvell.com (mailing list archive) |
---|---|
State | Mainlined, archived |
Commit | 893a7d32e8e04ca4d6c882336b26ed660ca0a48d |
Headers | show |
Series | lockref: Limit number of cmpxchg loop retries | expand |
On Wed, Jun 5, 2019 at 6:49 AM Jan Glauber <jglauber@cavium.com> wrote: > > Add an upper bound to the loop to force the fallback to spinlocks > after some time. A retry value of 100 should not impact any hardware > that does not have this issue. > > With the retry limit the performance of an open-close testcase > improved between 60-70% on ThunderX2. Btw, did you do any kind of performance analysis across different retry limit values? I'm perfectly happy to just pick a random number and '100' looks fine to me, so this is mainly out of curiosity. Linus
On Wed, Jun 05, 2019 at 01:16:46PM -0700, Linus Torvalds wrote: > On Wed, Jun 5, 2019 at 6:49 AM Jan Glauber <jglauber@cavium.com> wrote: > > > > Add an upper bound to the loop to force the fallback to spinlocks > > after some time. A retry value of 100 should not impact any hardware > > that does not have this issue. > > > > With the retry limit the performance of an open-close testcase > > improved between 60-70% on ThunderX2. > > Btw, did you do any kind of performance analysis across different > retry limit values? I tried 15/50/100/200/500, results were largely identical up to 100. For SMT=4 a higher retry value might be better, but unless we can add a sysctl value 100 looked like a good compromise to me. --Jan > I'm perfectly happy to just pick a random number and '100' looks fine > to me, so this is mainly out of curiosity. > > Linus
On Thu, Jun 06, 2019 at 08:03:27AM +0000, Jan Glauber wrote: > On Wed, Jun 05, 2019 at 01:16:46PM -0700, Linus Torvalds wrote: > > On Wed, Jun 5, 2019 at 6:49 AM Jan Glauber <jglauber@cavium.com> wrote: > > > > > > Add an upper bound to the loop to force the fallback to spinlocks > > > after some time. A retry value of 100 should not impact any hardware > > > that does not have this issue. > > > > > > With the retry limit the performance of an open-close testcase > > > improved between 60-70% on ThunderX2. > > > > Btw, did you do any kind of performance analysis across different > > retry limit values? > > I tried 15/50/100/200/500, results were largely identical up to 100. > For SMT=4 a higher retry value might be better, but unless we can add a > sysctl value 100 looked like a good compromise to me. Perhaps I'm just getting confused pre-morning-coffee, but I thought the original complaint (and the reason for this patch even existing) was that when many CPUs were hammering the lockref then performance tanked? In which case, increasing the threshold as the number of CPUs increases seems counter-intuitive to me because it suggests that the larger the system, the harder we should try to make the cmpxchg work. Will
On Thu, Jun 06, 2019 at 10:41:54AM +0100, Will Deacon wrote: > On Thu, Jun 06, 2019 at 08:03:27AM +0000, Jan Glauber wrote: > > On Wed, Jun 05, 2019 at 01:16:46PM -0700, Linus Torvalds wrote: > > > On Wed, Jun 5, 2019 at 6:49 AM Jan Glauber <jglauber@cavium.com> wrote: > > > > > > > > Add an upper bound to the loop to force the fallback to spinlocks > > > > after some time. A retry value of 100 should not impact any hardware > > > > that does not have this issue. > > > > > > > > With the retry limit the performance of an open-close testcase > > > > improved between 60-70% on ThunderX2. > > > > > > Btw, did you do any kind of performance analysis across different > > > retry limit values? > > > > I tried 15/50/100/200/500, results were largely identical up to 100. > > For SMT=4 a higher retry value might be better, but unless we can add a > > sysctl value 100 looked like a good compromise to me. > > Perhaps I'm just getting confused pre-morning-coffee, but I thought the > original complaint (and the reason for this patch even existing) was that > when many CPUs were hammering the lockref then performance tanked? In which > case, increasing the threshold as the number of CPUs increases seems > counter-intuitive to me because it suggests that the larger the system, > the harder we should try to make the cmpxchg work. For SMT=4 the top hit I see is queued_spin_lock_slowpath(). Maybe this is more costly with more threads, so trying harder to use lockref-cmpxchg makes the microbenchmark faster in that case? --Jan
On Thu, Jun 06, 2019 at 10:28:12AM +0000, Jan Glauber wrote: > On Thu, Jun 06, 2019 at 10:41:54AM +0100, Will Deacon wrote: > > On Thu, Jun 06, 2019 at 08:03:27AM +0000, Jan Glauber wrote: > > > On Wed, Jun 05, 2019 at 01:16:46PM -0700, Linus Torvalds wrote: > > > > On Wed, Jun 5, 2019 at 6:49 AM Jan Glauber <jglauber@cavium.com> wrote: > > > > > > > > > > Add an upper bound to the loop to force the fallback to spinlocks > > > > > after some time. A retry value of 100 should not impact any hardware > > > > > that does not have this issue. > > > > > > > > > > With the retry limit the performance of an open-close testcase > > > > > improved between 60-70% on ThunderX2. > > > > > > > > Btw, did you do any kind of performance analysis across different > > > > retry limit values? > > > > > > I tried 15/50/100/200/500, results were largely identical up to 100. > > > For SMT=4 a higher retry value might be better, but unless we can add a > > > sysctl value 100 looked like a good compromise to me. > > > > Perhaps I'm just getting confused pre-morning-coffee, but I thought the > > original complaint (and the reason for this patch even existing) was that > > when many CPUs were hammering the lockref then performance tanked? In which > > case, increasing the threshold as the number of CPUs increases seems > > counter-intuitive to me because it suggests that the larger the system, > > the harder we should try to make the cmpxchg work. > > For SMT=4 the top hit I see is queued_spin_lock_slowpath(). Maybe this is more > costly with more threads, so trying harder to use lockref-cmpxchg makes > the microbenchmark faster in that case? To clarify, with 224 threads & CPUs queued_spin_lock_slowpath is the top hit even without a retry limit in lockref. This could be unrelated to the lockref fallback, it looks like it's coming from the spinlock in: do_sys_open -> get_unused_fd_flags -> __alloc_fd --Jan
On Fri, Jun 7, 2019 at 12:27 AM Jan Glauber <jglauber@marvell.com> wrote: > > To clarify, with 224 threads & CPUs queued_spin_lock_slowpath is the top hit > even without a retry limit in lockref. This could be unrelated to the lockref > fallback, it looks like it's coming from the spinlock in: > do_sys_open -> get_unused_fd_flags -> __alloc_fd At some point I stop worrying about microbenchmarks just because it's easy to hit some locking paths in them, without it necessarily being relevant in real loads. But I'll apply the lockref patch because I think the "limit cmpxchg loop" is conceptually a valid model, and while I think the "hitting the same dentry lockref over and over again" is likely also an artifact of a microbenchmark, I could at least imagine that it happens with some common dentries (root, cwd) in some situations. Linus
diff --git a/lib/lockref.c b/lib/lockref.c index 3d468b53d4c9..5b34bbd3eba8 100644 --- a/lib/lockref.c +++ b/lib/lockref.c @@ -9,6 +9,7 @@ * failure case. */ #define CMPXCHG_LOOP(CODE, SUCCESS) do { \ + int retry = 100; \ struct lockref old; \ BUILD_BUG_ON(sizeof(old) != 8); \ old.lock_count = READ_ONCE(lockref->lock_count); \ @@ -21,6 +22,8 @@ if (likely(old.lock_count == prev.lock_count)) { \ SUCCESS; \ } \ + if (!--retry) \ + break; \ cpu_relax(); \ } \ } while (0)
The lockref cmpxchg loop is unbound as long as the spinlock is not taken. Depending on the hardware implementation of compare-and-swap a high number of loop retries might happen. Add an upper bound to the loop to force the fallback to spinlocks after some time. A retry value of 100 should not impact any hardware that does not have this issue. With the retry limit the performance of an open-close testcase improved between 60-70% on ThunderX2. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jan Glauber <jglauber@marvell.com> --- lib/lockref.c | 3 +++ 1 file changed, 3 insertions(+)