diff mbox

RFC: mutex: hung tasks on SMP platforms with asm-generic/mutex-xchg.h

Message ID 20120807115647.GA12828@mudshark.cambridge.arm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Will Deacon Aug. 7, 2012, 11:56 a.m. UTC
Hello,

ARM recently moved to asm-generic/mutex-xchg.h for its mutex implementation
after our previous implementation was found to be missing some crucial
memory barriers. However, I'm seeing some problems running hackbench on
SMP platforms due to the way in which the MUTEX_SPIN_ON_OWNER code operates.

The symptoms are that a bunch of hackbench tasks are left waiting on an
unlocked mutex and therefore never get woken up to claim it. I think this
boils down to the following sequence:


        Task A        Task B        Task C        Lock value
0                                                     1
1       lock()                                        0
2                     lock()                          0
3                     spin(A)                         0
4       unlock()                                      1
5                                   lock()            0
6                     cmpxchg(1,0)                    0
7                     contended()                    -1
8       lock()                                        0
9       spin(C)                                       0
10                                  unlock()          1
11      cmpxchg(1,0)                                  0
12      unlock()                                      1


At this point, the lock is unlocked, but Task B is in an uninterruptible
sleep with nobody to wake it up.

The following patch fixes the problem by ensuring we put the lock into
the contended state if we acquire it from the spin loop on the slowpath
but I'd like to be sure that this won't cause problems with other mutex
implementations:




All comments welcome.

Cheers,

Will

Comments

Peter Zijlstra Aug. 7, 2012, 1:48 p.m. UTC | #1
On Tue, 2012-08-07 at 12:56 +0100, Will Deacon wrote:
> Hello,
> 
> ARM recently moved to asm-generic/mutex-xchg.h for its mutex implementation
> after our previous implementation was found to be missing some crucial
> memory barriers. 


This is a76d7bd96d ("ARM: 7467/1: mutex: use generic xchg-based
implementation for ARMv6+"), right? Why do you use xchg and not dec
based? The changelog mumbles something about shorter critical sections,
but me not knowing anything about ARM wonders about the why of that.

> However, I'm seeing some problems running hackbench on
> SMP platforms due to the way in which the MUTEX_SPIN_ON_OWNER code operates.
> 
> The symptoms are that a bunch of hackbench tasks are left waiting on an
> unlocked mutex and therefore never get woken up to claim it. I think this
> boils down to the following sequence:
> 
> 
>         Task A        Task B        Task C        Lock value
> 0                                                     1
> 1       lock()                                        0
> 2                     lock()                          0
> 3                     spin(A)                         0
> 4       unlock()                                      1
> 5                                   lock()            0
> 6                     cmpxchg(1,0)                    0
> 7                     contended()                    -1
> 8       lock()                                        0
> 9       spin(C)                                       0
> 10                                  unlock()          1
> 11      cmpxchg(1,0)                                  0
> 12      unlock()                                      1
> 
> 
> At this point, the lock is unlocked, but Task B is in an uninterruptible
> sleep with nobody to wake it up.
> 
> The following patch fixes the problem by ensuring we put the lock into
> the contended state if we acquire it from the spin loop on the slowpath
> but I'd like to be sure that this won't cause problems with other mutex
> implementations:
> 
> 
> diff --git a/kernel/mutex.c b/kernel/mutex.c
> index a307cc9..27b7887 100644
> --- a/kernel/mutex.c
> +++ b/kernel/mutex.c
> @@ -170,7 +170,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
>                 if (owner && !mutex_spin_on_owner(lock, owner))
>                         break;
>  
> -               if (atomic_cmpxchg(&lock->count, 1, 0) == 1) {
> +               if (atomic_cmpxchg(&lock->count, 1, -1) == 1) {
>                         lock_acquired(&lock->dep_map, ip);
>                         mutex_set_owner(lock);
>                         preempt_enable();
> 

But in this case, either B is still spinning in our spin-loop, or it has
already passed the atomic_xchg(&lock->count, -1) when we fell out.

Since you say B is in UNINTERRUPTIBLE state, we'll assume it fell
through and so the lock count should be -1 (or less) to mark it
contended.
Will Deacon Aug. 7, 2012, 2:04 p.m. UTC | #2
On Tue, Aug 07, 2012 at 02:48:42PM +0100, Peter Zijlstra wrote:
> On Tue, 2012-08-07 at 12:56 +0100, Will Deacon wrote:
> > ARM recently moved to asm-generic/mutex-xchg.h for its mutex implementation
> > after our previous implementation was found to be missing some crucial
> > memory barriers. 
> 
> 
> This is a76d7bd96d ("ARM: 7467/1: mutex: use generic xchg-based
> implementation for ARMv6+"), right? Why do you use xchg and not dec
> based? The changelog mumbles something about shorter critical sections,
> but me not knowing anything about ARM wonders about the why of that.

Correct, that's the patch. We don't have atomic add/sub instructions on ARM,
so instead we have to do:

1:	ldrex	...	@ Exclusive load
	add/sub ...     @ Do the arithmetic
	strex	...	@ Exclusive store
	cmp	...	@ Check the store succeeded
	bne	1b	@ Retry if we weren't atomic

So using dec adds a sub where we wouldn't need an instruction there for xchg.
I suspect there's no measurable difference between the two, but we use the
xchg-based implementation for CPUs prior to ARMv6 so it saves an ifdef as
well. Some discussion on the original patch here:

  http://lists.infradead.org/pipermail/linux-arm-kernel/2012-July/109333.html

> >         Task A        Task B        Task C        Lock value
> > 0                                                     1
> > 1       lock()                                        0
> > 2                     lock()                          0
> > 3                     spin(A)                         0
> > 4       unlock()                                      1
> > 5                                   lock()            0
> > 6                     cmpxchg(1,0)                    0
> > 7                     contended()                    -1
> > 8       lock()                                        0
> > 9       spin(C)                                       0
> > 10                                  unlock()          1
> > 11      cmpxchg(1,0)                                  0
> > 12      unlock()                                      1
> > 
> > 
> > At this point, the lock is unlocked, but Task B is in an uninterruptible
> > sleep with nobody to wake it up.

[...]

> But in this case, either B is still spinning in our spin-loop, or it has
> already passed the atomic_xchg(&lock->count, -1) when we fell out.

Yes, it does that xchg on line 7 (see the lock value of -1)...

> Since you say B is in UNINTERRUPTIBLE state, we'll assume it fell
> through and so the lock count should be -1 (or less) to mark it
> contended.

... but then A sets it straight back to 0 in __mutex_fastpath_lock and falls
down the slowpath due to it being contended. The problem is that it doesn't
restore the -1 when it acquires the lock on line 11, so B is never woken up.

Will
Nicolas Pitre Aug. 7, 2012, 5:14 p.m. UTC | #3
On Tue, 7 Aug 2012, Will Deacon wrote:

> Hello,
> 
> ARM recently moved to asm-generic/mutex-xchg.h for its mutex implementation
> after our previous implementation was found to be missing some crucial
> memory barriers. However, I'm seeing some problems running hackbench on
> SMP platforms due to the way in which the MUTEX_SPIN_ON_OWNER code operates.
> 
> The symptoms are that a bunch of hackbench tasks are left waiting on an
> unlocked mutex and therefore never get woken up to claim it. I think this
> boils down to the following sequence:
> 
> 
>         Task A        Task B        Task C        Lock value
> 0                                                     1
> 1       lock()                                        0
> 2                     lock()                          0
> 3                     spin(A)                         0
> 4       unlock()                                      1
> 5                                   lock()            0
> 6                     cmpxchg(1,0)                    0
> 7                     contended()                    -1
> 8       lock()                                        0
> 9       spin(C)                                       0
> 10                                  unlock()          1
> 11      cmpxchg(1,0)                                  0
> 12      unlock()                                      1
> 
> 
> At this point, the lock is unlocked, but Task B is in an uninterruptible
> sleep with nobody to wake it up.

I fail to see how the lock value would go from -1 to 0 on line 8.  How 
does that happen?
> The following patch fixes the problem by ensuring we put the lock into
> the contended state if we acquire it from the spin loop on the slowpath
> but I'd like to be sure that this won't cause problems with other mutex
> implementations:
> 
> 
> diff --git a/kernel/mutex.c b/kernel/mutex.c
> index a307cc9..27b7887 100644
> --- a/kernel/mutex.c
> +++ b/kernel/mutex.c
> @@ -170,7 +170,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
>                 if (owner && !mutex_spin_on_owner(lock, owner))
>                         break;
>  
> -               if (atomic_cmpxchg(&lock->count, 1, 0) == 1) {
> +               if (atomic_cmpxchg(&lock->count, 1, -1) == 1) {
>                         lock_acquired(&lock->dep_map, ip);
>                         mutex_set_owner(lock);
>                         preempt_enable();

This would force invokation of the slow path on unlock even if in most 
cases the lock is unlikely to be contended.  The really slow path does 
check if the waiting list is empty and sets the count to 0 before 
exiting to avoid that.  I don't see how this could be done safely in the 
spin_on_owner loop code as the lock->wait_lock isn't held (which appears 
to be the point of this code in the first place).

Yet, if the lock is heavily contended with a waiting task, the count 
should never get back to 1 and the cmpxchg on line 11 would not set the 
count to 0.  Hence my interrogation about line 8 above.


Nicolas
Will Deacon Aug. 7, 2012, 5:33 p.m. UTC | #4
On Tue, Aug 07, 2012 at 06:14:36PM +0100, Nicolas Pitre wrote:
> On Tue, 7 Aug 2012, Will Deacon wrote:
> > The symptoms are that a bunch of hackbench tasks are left waiting on an
> > unlocked mutex and therefore never get woken up to claim it. I think this
> > boils down to the following sequence:
> > 
> > 
> >         Task A        Task B        Task C        Lock value
> > 0                                                     1
> > 1       lock()                                        0
> > 2                     lock()                          0
> > 3                     spin(A)                         0
> > 4       unlock()                                      1
> > 5                                   lock()            0
> > 6                     cmpxchg(1,0)                    0
> > 7                     contended()                    -1
> > 8       lock()                                        0
> > 9       spin(C)                                       0
> > 10                                  unlock()          1
> > 11      cmpxchg(1,0)                                  0
> > 12      unlock()                                      1
> > 
> > 
> > At this point, the lock is unlocked, but Task B is in an uninterruptible
> > sleep with nobody to wake it up.
> 
> I fail to see how the lock value would go from -1 to 0 on line 8.  How 
> does that happen?

What I think is happening is that B writes the -1 in __mutex_lock_common
and, after seeing a NULL owner (C may not have set that yet), drops through
to the:

	if (atomic_xchg(&lock->count, -1) == 1)
		goto done;

bit. At the same time, A does a mutex_lock, which goes down the fastpath:

	if (unlikely(atomic_xchg(count, 0) != 1))
		fail_fn(count);

setting the count to 0. It then trundles off down the slowpath and spins on
the new owner (C).

Maybe my diagram is confusing... the lock value is supposed to be the value
*after* the relevant operations on that same line have completed.

> > diff --git a/kernel/mutex.c b/kernel/mutex.c
> > index a307cc9..27b7887 100644
> > --- a/kernel/mutex.c
> > +++ b/kernel/mutex.c
> > @@ -170,7 +170,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
> >                 if (owner && !mutex_spin_on_owner(lock, owner))
> >                         break;
> >  
> > -               if (atomic_cmpxchg(&lock->count, 1, 0) == 1) {
> > +               if (atomic_cmpxchg(&lock->count, 1, -1) == 1) {
> >                         lock_acquired(&lock->dep_map, ip);
> >                         mutex_set_owner(lock);
> >                         preempt_enable();
> 
> This would force invokation of the slow path on unlock even if in most 
> cases the lock is unlikely to be contended.  The really slow path does 
> check if the waiting list is empty and sets the count to 0 before 
> exiting to avoid that.  I don't see how this could be done safely in the 
> spin_on_owner loop code as the lock->wait_lock isn't held (which appears 
> to be the point of this code in the first place).

Indeed, it will trigger the slowpath on the next unlock but only in the case
that the lock was contended. You're right that there might not be any
waiters though, and we'd need to take the spinlock to check that.

> Yet, if the lock is heavily contended with a waiting task, the count 
> should never get back to 1 and the cmpxchg on line 11 would not set the 
> count to 0.  Hence my interrogation about line 8 above.

Hmm. __mutex_fastpath_unlock always sets the count to 1:

	if (unlikely(atomic_xchg(count, 1) != 0))
		failt_fn(count);

so there's always a window for a spinning waiter (as opposed to one blocked
in the queue) to succeed in the cmpxchg.

Unless I'm barking up the wrong tree!

Will
Will Deacon Aug. 7, 2012, 5:38 p.m. UTC | #5
On Tue, Aug 07, 2012 at 06:33:44PM +0100, Will Deacon wrote:
> What I think is happening is that B writes the -1 in __mutex_lock_common
> and, after seeing a NULL owner (C may not have set that yet), drops through
> to the:
> 
> 	if (atomic_xchg(&lock->count, -1) == 1)
> 		goto done;

Sorry, should have proofread that. I meant to say:

 What I think is happening is that B writes the -1 in __mutex_lock_common
 after seeing a NULL owner (C may not have set that yet) and dropping through
 to the:
 
 	if (atomic_xchg(&lock->count, -1) == 1)
 		goto done;
 

Will
Nicolas Pitre Aug. 7, 2012, 6:28 p.m. UTC | #6
On Tue, 7 Aug 2012, Will Deacon wrote:

> On Tue, Aug 07, 2012 at 06:14:36PM +0100, Nicolas Pitre wrote:
> > On Tue, 7 Aug 2012, Will Deacon wrote:
> > > The symptoms are that a bunch of hackbench tasks are left waiting on an
> > > unlocked mutex and therefore never get woken up to claim it. I think this
> > > boils down to the following sequence:
> > > 
> > > 
> > >         Task A        Task B        Task C        Lock value
> > > 0                                                     1
> > > 1       lock()                                        0
> > > 2                     lock()                          0
> > > 3                     spin(A)                         0
> > > 4       unlock()                                      1
> > > 5                                   lock()            0
> > > 6                     cmpxchg(1,0)                    0
> > > 7                     contended()                    -1
> > > 8       lock()                                        0
> > > 9       spin(C)                                       0
> > > 10                                  unlock()          1
> > > 11      cmpxchg(1,0)                                  0
> > > 12      unlock()                                      1
> > > 
> > > 
> > > At this point, the lock is unlocked, but Task B is in an uninterruptible
> > > sleep with nobody to wake it up.
> > 
> > I fail to see how the lock value would go from -1 to 0 on line 8.  How 
> > does that happen?
> 
> [...]

Forget that.  I assumed cmpxchg when it is just xchg.

(And, for that matter, I'm even the original author for some of that 
 code.: http://lkml.org/lkml/2005/12/26/83).

Back to thinking.


Nicolas
diff mbox

Patch

diff --git a/kernel/mutex.c b/kernel/mutex.c
index a307cc9..27b7887 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -170,7 +170,7 @@  __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
                if (owner && !mutex_spin_on_owner(lock, owner))
                        break;
 
-               if (atomic_cmpxchg(&lock->count, 1, 0) == 1) {
+               if (atomic_cmpxchg(&lock->count, 1, -1) == 1) {
                        lock_acquired(&lock->dep_map, ip);
                        mutex_set_owner(lock);
                        preempt_enable();