diff mbox

fix a race condition in cancelable mcs spinlocks

Message ID alpine.LRH.2.02.1406011342470.20831@file01.intranet.prod.int.rdu2.redhat.com (mailing list archive)
State Not Applicable
Headers show

Commit Message

Mikulas Patocka June 1, 2014, 5:53 p.m. UTC
The cancelable MCS spinlocks introduced in
fb0527bd5ea99bfeb2dd91e3c1433ecf745d6b99 break the kernel on PA-RISC.

How to reproduce:
* Use a machine with two dual-core PA-8900 processors.
* Run the LVM testsuite and compile the kernel in an endless loop at the
  same time.
* Wait for an hour or two and the kernel locks up.

You see some processes locked up in osd_lock and osq_unlock:
INFO: rcu_sched self-detected stall on CPU { 2}  (t=18000 jiffies g=247335 c=247334 q=101)
CPU: 2 PID: 21006 Comm: lvm Tainted: G           O  3.15.0-rc7 #9
Backtrace:
 [<000000004013e8a4>] show_stack+0x14/0x20
 [<00000000403016f0>] dump_stack+0x88/0x100
 [<00000000401b8738>] rcu_check_callbacks+0x4a8/0x900
 [<00000000401714c4>] update_process_times+0x64/0xc0
 [<000000004013fa24>] timer_interrupt+0x19c/0x200
 [<00000000401ad8d8>] handle_irq_event_percpu+0xa8/0x238
 [<00000000401b2454>] handle_percpu_irq+0x9c/0xd0
 [<00000000401acc40>] generic_handle_irq+0x40/0x50
 [<00000000401408cc>] do_cpu_irq_mask+0x1ac/0x298
 [<000000004012c074>] intr_return+0x0/0xc
 [<00000000401a609c>] osq_lock+0xc4/0x178
 [<0000000040138d24>] __mutex_lock_slowpath+0x1cc/0x290
 [<0000000040138e78>] mutex_lock+0x90/0x98
 [<00000000402a5614>] kernfs_activate+0x6c/0x1a0
 [<00000000402a59e0>] kernfs_add_one+0x140/0x190
 [<00000000402a75ec>] __kernfs_create_file+0xa4/0xf8

INFO: rcu_sched self-detected stall on CPU { 3}  (t=18473 jiffies g=247335 c=247334 q=101)
CPU: 3 PID: 21051 Comm: udevd Tainted: G           O  3.15.0-rc7 #9
Backtrace:
 [<000000004013e8a4>] show_stack+0x14/0x20
 [<00000000403016f0>] dump_stack+0x88/0x100
 [<00000000401b8738>] rcu_check_callbacks+0x4a8/0x900
 [<00000000401714c4>] update_process_times+0x64/0xc0
 [<000000004013fa24>] timer_interrupt+0x19c/0x200
 [<00000000401ad8d8>] handle_irq_event_percpu+0xa8/0x238
 [<00000000401b2454>] handle_percpu_irq+0x9c/0xd0
 [<00000000401acc40>] generic_handle_irq+0x40/0x50
 [<00000000401408cc>] do_cpu_irq_mask+0x1ac/0x298
 [<000000004012c074>] intr_return+0x0/0xc
 [<00000000401a6220>] osq_unlock+0xd0/0xf8
 [<0000000040138dcc>] __mutex_lock_slowpath+0x274/0x290
 [<0000000040138e78>] mutex_lock+0x90/0x98
 [<00000000402a3a90>] kernfs_dop_revalidate+0x48/0x108
 [<0000000040233310>] lookup_fast+0x320/0x348
 [<0000000040234600>] link_path_walk+0x190/0x9d8


The code in kernel/locking/mcs_spinlock.c is broken.

PA-RISC doesn't have xchg or cmpxchg atomic instructions like other
processors. It only has ldcw and ldcd instructions that load a word (or
doubleword) from memory and atomically store zero at the same location.
These instructions can only be used to implement spinlocks, direct
implementation of other atomic operations is impossible.

Consequently, Linux xchg and cmpxchg functions are implemented in such a
way that they hash the address, use the hash to index a spinlock, take the
spinlock, perform the xchg or cmpxchg operation non-atomically and drop
the spinlock.

If you write to some variable with ACCESS_ONCE and use cmpxchg or xchg at
the same time, you break it. ACCESS_ONCE doesn't take the hashed spinlock,
so, in this case, cmpxchg or xchg isn't really atomic at all.

This patch fixes the bug by introducing a new type atomic_pointer_t 
(backed by atomic_long_t) and replacing the offending pointer with it. 
atomic_long_set takes the hashed spinlock, so it avoids the race 
condition.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 include/asm-generic/atomic-long.h |   24 ++++++++++++++++++++++++
 kernel/locking/mcs_spinlock.c     |   16 ++++++++--------
 kernel/locking/mcs_spinlock.h     |    4 +++-
 3 files changed, 35 insertions(+), 9 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Peter Zijlstra June 1, 2014, 7:20 p.m. UTC | #1
On Sun, Jun 01, 2014 at 01:53:11PM -0400, Mikulas Patocka wrote:
> PA-RISC doesn't have xchg or cmpxchg atomic instructions like other
> processors. It only has ldcw and ldcd instructions that load a word (or
> doubleword) from memory and atomically store zero at the same location.
> These instructions can only be used to implement spinlocks, direct
> implementation of other atomic operations is impossible.
> 
> Consequently, Linux xchg and cmpxchg functions are implemented in such a
> way that they hash the address, use the hash to index a spinlock, take the
> spinlock, perform the xchg or cmpxchg operation non-atomically and drop
> the spinlock.
> 
> If you write to some variable with ACCESS_ONCE and use cmpxchg or xchg at
> the same time, you break it. ACCESS_ONCE doesn't take the hashed spinlock,
> so, in this case, cmpxchg or xchg isn't really atomic at all.

And this is really the first place in the kernel that breaks like this?
I've been using xchg() and cmpxchg() without such consideration for
quite a while.

Doesn't sparc32 have similarly broken atomic ops?

Ideally, if we really want to preserve such broken-ness, we'd add some
debugging infrastructure to detect such nonsense.

> This patch fixes the bug by introducing a new type atomic_pointer_t 
> (backed by atomic_long_t) and replacing the offending pointer with it. 
> atomic_long_set takes the hashed spinlock, so it avoids the race 
> condition.

So I hate that twice, once since xchg() and cmpxchg() do not share the
atomic_ prefix, its inappropriate and misleading here, and secondly,
non of this is specific to pointers, xchg() and cmpxchg() take any
(naturally aligned) 'native' size type.

--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
John David Anglin June 1, 2014, 8:46 p.m. UTC | #2
On 1-Jun-14, at 3:20 PM, Peter Zijlstra wrote:

>> If you write to some variable with ACCESS_ONCE and use cmpxchg or  
>> xchg at
>> the same time, you break it. ACCESS_ONCE doesn't take the hashed  
>> spinlock,
>> so, in this case, cmpxchg or xchg isn't really atomic at all.
>
> And this is really the first place in the kernel that breaks like  
> this?
> I've been using xchg() and cmpxchg() without such consideration for
> quite a while.

I believe Mikulas is correct.  Even in a controlled situation where a  
cmpxchg operation
is used to implement pthread_spin_lock() in userspace, we found  
recently that the lock
must be released with a  cmpxchg operation and not a simple write on  
SMP systems.
There is a race in the cache operations or instruction ordering that's  
not present with
the ldcw instruction.

Dave
--
John David Anglin	dave.anglin@bell.net



--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra June 1, 2014, 9:30 p.m. UTC | #3
On Sun, Jun 01, 2014 at 04:46:26PM -0400, John David Anglin wrote:
> On 1-Jun-14, at 3:20 PM, Peter Zijlstra wrote:
> 
> >>If you write to some variable with ACCESS_ONCE and use cmpxchg or xchg
> >>at
> >>the same time, you break it. ACCESS_ONCE doesn't take the hashed
> >>spinlock,
> >>so, in this case, cmpxchg or xchg isn't really atomic at all.
> >
> >And this is really the first place in the kernel that breaks like this?
> >I've been using xchg() and cmpxchg() without such consideration for
> >quite a while.
> 
> I believe Mikulas is correct.  Even in a controlled situation where a
> cmpxchg operation
> is used to implement pthread_spin_lock() in userspace, we found recently
> that the lock
> must be released with a  cmpxchg operation and not a simple write on SMP
> systems.
> There is a race in the cache operations or instruction ordering that's not
> present with
> the ldcw instruction.

Oh, I'm not arguing that. He's quite right that its broken, but this
form of atomic ops is also quite insane and unusual. Most sane machines
don't have this problem.

My main concern is how are we going to avoid breaking parisc (and I
think sparc32, which is similarly retarded) in the future; we should
invest in machinery to find and detect these things.

--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney June 1, 2014, 9:46 p.m. UTC | #4
On Sun, Jun 01, 2014 at 11:30:03PM +0200, Peter Zijlstra wrote:
> On Sun, Jun 01, 2014 at 04:46:26PM -0400, John David Anglin wrote:
> > On 1-Jun-14, at 3:20 PM, Peter Zijlstra wrote:
> > 
> > >>If you write to some variable with ACCESS_ONCE and use cmpxchg or xchg
> > >>at
> > >>the same time, you break it. ACCESS_ONCE doesn't take the hashed
> > >>spinlock,
> > >>so, in this case, cmpxchg or xchg isn't really atomic at all.
> > >
> > >And this is really the first place in the kernel that breaks like this?
> > >I've been using xchg() and cmpxchg() without such consideration for
> > >quite a while.
> > 
> > I believe Mikulas is correct.  Even in a controlled situation where a
> > cmpxchg operation
> > is used to implement pthread_spin_lock() in userspace, we found recently
> > that the lock
> > must be released with a  cmpxchg operation and not a simple write on SMP
> > systems.
> > There is a race in the cache operations or instruction ordering that's not
> > present with
> > the ldcw instruction.
> 
> Oh, I'm not arguing that. He's quite right that its broken, but this
> form of atomic ops is also quite insane and unusual. Most sane machines
> don't have this problem.
> 
> My main concern is how are we going to avoid breaking parisc (and I
> think sparc32, which is similarly retarded) in the future; we should
> invest in machinery to find and detect these things.

I cannot see an easy way to fix this by making ACCESS_ONCE() arch-dependent.
But could the compiler help out by recognizing ACCESS_ONCE() and generating
the needed code for it on sparc and pa-risc?

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mikulas Patocka June 2, 2014, 9:19 a.m. UTC | #5
On Sun, 1 Jun 2014, Peter Zijlstra wrote:

> On Sun, Jun 01, 2014 at 04:46:26PM -0400, John David Anglin wrote:
> > On 1-Jun-14, at 3:20 PM, Peter Zijlstra wrote:
> > 
> > >>If you write to some variable with ACCESS_ONCE and use cmpxchg or xchg
> > >>at
> > >>the same time, you break it. ACCESS_ONCE doesn't take the hashed
> > >>spinlock,
> > >>so, in this case, cmpxchg or xchg isn't really atomic at all.
> > >
> > >And this is really the first place in the kernel that breaks like this?
> > >I've been using xchg() and cmpxchg() without such consideration for
> > >quite a while.
> > 
> > I believe Mikulas is correct.  Even in a controlled situation where a
> > cmpxchg operation
> > is used to implement pthread_spin_lock() in userspace, we found recently
> > that the lock
> > must be released with a  cmpxchg operation and not a simple write on SMP
> > systems.
> > There is a race in the cache operations or instruction ordering that's not
> > present with
> > the ldcw instruction.
> 
> Oh, I'm not arguing that. He's quite right that its broken, but this
> form of atomic ops is also quite insane and unusual. Most sane machines
> don't have this problem.
> 
> My main concern is how are we going to avoid breaking parisc (and I
> think sparc32, which is similarly retarded) in the future; we should
> invest in machinery to find and detect these things.

Grep the kernel for "\<xchg\>" and "\<cmpxchg\>" and replace them with 
atomic types and atomic access functions.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mikulas Patocka June 2, 2014, 10:34 a.m. UTC | #6
On Sun, 1 Jun 2014, Peter Zijlstra wrote:

> On Sun, Jun 01, 2014 at 01:53:11PM -0400, Mikulas Patocka wrote:
> > PA-RISC doesn't have xchg or cmpxchg atomic instructions like other
> > processors. It only has ldcw and ldcd instructions that load a word (or
> > doubleword) from memory and atomically store zero at the same location.
> > These instructions can only be used to implement spinlocks, direct
> > implementation of other atomic operations is impossible.
> > 
> > Consequently, Linux xchg and cmpxchg functions are implemented in such a
> > way that they hash the address, use the hash to index a spinlock, take the
> > spinlock, perform the xchg or cmpxchg operation non-atomically and drop
> > the spinlock.
> > 
> > If you write to some variable with ACCESS_ONCE and use cmpxchg or xchg at
> > the same time, you break it. ACCESS_ONCE doesn't take the hashed spinlock,
> > so, in this case, cmpxchg or xchg isn't really atomic at all.
> 
> And this is really the first place in the kernel that breaks like this?
> I've been using xchg() and cmpxchg() without such consideration for
> quite a while.

It happens on a common mutex operation and it took an hour of 
stress-testing to trigger it.

The other cases may be buggy too, but no one has written a stress test 
specifically tailored for them.

> Doesn't sparc32 have similarly broken atomic ops?

Yes. tile32, arc and metag seem to be broken by this too. hexagon also has 
non-standard atomic_set, so it may be broken too.

> Ideally, if we really want to preserve such broken-ness, we'd add some
> debugging infrastructure to detect such nonsense.
> 
> > This patch fixes the bug by introducing a new type atomic_pointer_t 
> > (backed by atomic_long_t) and replacing the offending pointer with it. 
> > atomic_long_set takes the hashed spinlock, so it avoids the race 
> > condition.
> 
> So I hate that twice, once since xchg() and cmpxchg() do not share the
> atomic_ prefix, its inappropriate and misleading here, and secondly,
> non of this is specific to pointers, xchg() and cmpxchg() take any
> (naturally aligned) 'native' size type.

I think we don't need xchg() and cmpxchg() at all, because we have atomic 
types.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney June 2, 2014, 1:24 p.m. UTC | #7
On Mon, Jun 02, 2014 at 05:19:39AM -0400, Mikulas Patocka wrote:
> 
> 
> On Sun, 1 Jun 2014, Peter Zijlstra wrote:
> 
> > On Sun, Jun 01, 2014 at 04:46:26PM -0400, John David Anglin wrote:
> > > On 1-Jun-14, at 3:20 PM, Peter Zijlstra wrote:
> > > 
> > > >>If you write to some variable with ACCESS_ONCE and use cmpxchg or xchg
> > > >>at
> > > >>the same time, you break it. ACCESS_ONCE doesn't take the hashed
> > > >>spinlock,
> > > >>so, in this case, cmpxchg or xchg isn't really atomic at all.
> > > >
> > > >And this is really the first place in the kernel that breaks like this?
> > > >I've been using xchg() and cmpxchg() without such consideration for
> > > >quite a while.
> > > 
> > > I believe Mikulas is correct.  Even in a controlled situation where a
> > > cmpxchg operation
> > > is used to implement pthread_spin_lock() in userspace, we found recently
> > > that the lock
> > > must be released with a  cmpxchg operation and not a simple write on SMP
> > > systems.
> > > There is a race in the cache operations or instruction ordering that's not
> > > present with
> > > the ldcw instruction.
> > 
> > Oh, I'm not arguing that. He's quite right that its broken, but this
> > form of atomic ops is also quite insane and unusual. Most sane machines
> > don't have this problem.
> > 
> > My main concern is how are we going to avoid breaking parisc (and I
> > think sparc32, which is similarly retarded) in the future; we should
> > invest in machinery to find and detect these things.
> 
> Grep the kernel for "\<xchg\>" and "\<cmpxchg\>" and replace them with 
> atomic types and atomic access functions.

Not so good for pointers, though.  Defeats type-checking, for one thing.
An example of this is use of xchg() for atomically enqueuing RCU callbacks
in kernel/rcu/tree_plugin.h.

I still like the idea of PA-RISC's compiler implementing ACCESS_ONCE()
as needed to make things work on that architecture.

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mikulas Patocka June 2, 2014, 1:58 p.m. UTC | #8
On Sun, 1 Jun 2014, John David Anglin wrote:

> On 1-Jun-14, at 3:20 PM, Peter Zijlstra wrote:
> 
> > > If you write to some variable with ACCESS_ONCE and use cmpxchg or xchg at
> > > the same time, you break it. ACCESS_ONCE doesn't take the hashed spinlock,
> > > so, in this case, cmpxchg or xchg isn't really atomic at all.
> > 
> > And this is really the first place in the kernel that breaks like this?
> > I've been using xchg() and cmpxchg() without such consideration for
> > quite a while.
> 
> I believe Mikulas is correct.  Even in a controlled situation where a 
> cmpxchg operation is used to implement pthread_spin_lock() in userspace, 
> we found recently that the lock must be released with a cmpxchg 
> operation and not a simple write on SMP systems. There is a race in the 
> cache operations or instruction ordering that's not present with the 
> ldcw instruction.
> 
> Dave
> --
> John David Anglin	dave.anglin@bell.net

That is strange.

Spinlock with cmpxchg on lock and a single write on unlock should work,
assuming that cmpxchg doesn't write to the target address when it detects
mismatch (the cmpxchg in the kernel syscall page doesn't do it, it
nullifies the write instruction on mismatch).

Do you have some code that reproduces this misbehavior?

We really need to find out why does it behave this way:
- is PA-RISC really out of order? (we used to believe that it is in-order
  and we have empty barrier instructions in the kernel). Does adding the
  "SYNC" instruction before the write in pthread_spin_unlock fix it?
- does the processor performs nullified writes unconditionally? Does
  moving the write in the cmpxchg implementation from the nullified slot
  to is own branch fix it?
- does adding a dummy "ldcw" instruction to an unrelated address fix it?
  Is it that "ldcw" has some magic barrier properties?

I think we need to perform these tests and maybe some more to find out
what really happened there...

BTW. in Debian 5 libc 2.7, pthread_spin_lock uses ldcw and 
pthread_spin_unlock uses a single write (just like the kernel spinlock 
implementation). In Debian-ports libc 2.18, both pthread_spin_lock and 
pthread_spin_unlock call the kernel syscall page. What was the reason for 
switching to a less efficient implementation?

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mikulas Patocka June 2, 2014, 2:02 p.m. UTC | #9
On Mon, 2 Jun 2014, Mikulas Patocka wrote:

> 
> 
> On Sun, 1 Jun 2014, John David Anglin wrote:
> 
> > On 1-Jun-14, at 3:20 PM, Peter Zijlstra wrote:
> > 
> > > > If you write to some variable with ACCESS_ONCE and use cmpxchg or xchg at
> > > > the same time, you break it. ACCESS_ONCE doesn't take the hashed spinlock,
> > > > so, in this case, cmpxchg or xchg isn't really atomic at all.
> > > 
> > > And this is really the first place in the kernel that breaks like this?
> > > I've been using xchg() and cmpxchg() without such consideration for
> > > quite a while.
> > 
> > I believe Mikulas is correct.  Even in a controlled situation where a 
> > cmpxchg operation is used to implement pthread_spin_lock() in userspace, 
> > we found recently that the lock must be released with a cmpxchg 
> > operation and not a simple write on SMP systems. There is a race in the 
> > cache operations or instruction ordering that's not present with the 
> > ldcw instruction.
> > 
> > Dave
> > --
> > John David Anglin	dave.anglin@bell.net
> 
> That is strange.
> 
> Spinlock with cmpxchg on lock and a single write on unlock should work,
> assuming that cmpxchg doesn't write to the target address when it detects
> mismatch (the cmpxchg in the kernel syscall page doesn't do it, it
> nullifies the write instruction on mismatch).
> 
> Do you have some code that reproduces this misbehavior?
> 
> We really need to find out why does it behave this way:
> - is PA-RISC really out of order? (we used to believe that it is in-order
>   and we have empty barrier instructions in the kernel). Does adding the
>   "SYNC" instruction before the write in pthread_spin_unlock fix it?
> - does the processor performs nullified writes unconditionally? Does
>   moving the write in the cmpxchg implementation from the nullified slot
>   to is own branch fix it?
> - does adding a dummy "ldcw" instruction to an unrelated address fix it?
>   Is it that "ldcw" has some magic barrier properties?

- and there is "stw,o" instruction that does ordered store according to 
the specification, so we should test it too...

> I think we need to perform these tests and maybe some more to find out
> what really happened there...
> 
> BTW. in Debian 5 libc 2.7, pthread_spin_lock uses ldcw and 
> pthread_spin_unlock uses a single write (just like the kernel spinlock 
> implementation). In Debian-ports libc 2.18, both pthread_spin_lock and 
> pthread_spin_unlock call the kernel syscall page. What was the reason for 
> switching to a less efficient implementation?
> 
> Mikulas
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Waiman Long June 2, 2014, 2:14 p.m. UTC | #10
On 06/01/2014 01:53 PM, Mikulas Patocka wrote:
> The code in kernel/locking/mcs_spinlock.c is broken.

The osq_lock and osq_unlock functions aren't the only ones that need to 
be changed, the mcs_spin_lock and mcs_spin_unlock have exactly the same 
problem. There aren't certainly problems in other places as well.

> PA-RISC doesn't have xchg or cmpxchg atomic instructions like other
> processors. It only has ldcw and ldcd instructions that load a word (or
> doubleword) from memory and atomically store zero at the same location.
> These instructions can only be used to implement spinlocks, direct
> implementation of other atomic operations is impossible.
>
> Consequently, Linux xchg and cmpxchg functions are implemented in such a
> way that they hash the address, use the hash to index a spinlock, take the
> spinlock, perform the xchg or cmpxchg operation non-atomically and drop
> the spinlock.
>
> If you write to some variable with ACCESS_ONCE and use cmpxchg or xchg at
> the same time, you break it. ACCESS_ONCE doesn't take the hashed spinlock,
> so, in this case, cmpxchg or xchg isn't really atomic at all.
>
> This patch fixes the bug by introducing a new type atomic_pointer_t
> (backed by atomic_long_t) and replacing the offending pointer with it.
> atomic_long_set takes the hashed spinlock, so it avoids the race
> condition.

I believe the mixing of cmpxchg/xchg  and ACCESS_ONCE() is fairly common 
in the kernel, it will be an additional burden on the kernel developers 
to make sure that this kind of breakage won't happen. We also need clear 
documentation somewhere to document this kind of architecture specific 
behavior, maybe in the memory-barrier.txt.
> Index: linux-3.15-rc7/kernel/locking/mcs_spinlock.h
> ===================================================================
> --- linux-3.15-rc7.orig/kernel/locking/mcs_spinlock.h	2014-05-31 19:01:01.000000000 +0200
> +++ linux-3.15-rc7/kernel/locking/mcs_spinlock.h	2014-06-01 14:17:49.000000000 +0200
> @@ -13,6 +13,7 @@
>   #define __LINUX_MCS_SPINLOCK_H
>
>   #include<asm/mcs_spinlock.h>
> +#include<linux/atomic.h>
>
>   struct mcs_spinlock {
>   	struct mcs_spinlock *next;
> @@ -119,7 +120,8 @@ void mcs_spin_unlock(struct mcs_spinlock
>    */
>
>   struct optimistic_spin_queue {
> -	struct optimistic_spin_queue *next, *prev;
> +	atomic_pointer_t next;
> +	struct optimistic_spin_queue *prev;
>   	int locked; /* 1 if lock acquired */
>   };

Is there a way to do it without changing the pointer type? It will make 
the code harder to read and understand.

-Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Low June 2, 2014, 3:27 p.m. UTC | #11
On Mon, 2014-06-02 at 10:14 -0400, Waiman Long wrote:
> On 06/01/2014 01:53 PM, Mikulas Patocka wrote:
> >   struct optimistic_spin_queue {
> > -	struct optimistic_spin_queue *next, *prev;
> > +	atomic_pointer_t next;
> > +	struct optimistic_spin_queue *prev;
> >   	int locked; /* 1 if lock acquired */
> >   };
> 
> Is there a way to do it without changing the pointer type? It will make 
> the code harder to read and understand.

I agree that it would be nice if there is a way to fix this without
changing the pointer type of "next". The change of the type to
atomic_pointer_t might make it less obvious what "next" is for. This is
then compounded with "prev" being kept as a pointer to
optimistic_spin_queue, which can further make it appear as if "next" may
potentially point to something different.


--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
John David Anglin June 2, 2014, 3:39 p.m. UTC | #12
On 6/2/2014 10:02 AM, Mikulas Patocka wrote:
>
> On Mon, 2 Jun 2014, Mikulas Patocka wrote:
>
>>
>> On Sun, 1 Jun 2014, John David Anglin wrote:
>>
>>> On 1-Jun-14, at 3:20 PM, Peter Zijlstra wrote:
>>>
>>>>> If you write to some variable with ACCESS_ONCE and use cmpxchg or xchg at
>>>>> the same time, you break it. ACCESS_ONCE doesn't take the hashed spinlock,
>>>>> so, in this case, cmpxchg or xchg isn't really atomic at all.
>>>> And this is really the first place in the kernel that breaks like this?
>>>> I've been using xchg() and cmpxchg() without such consideration for
>>>> quite a while.
>>> I believe Mikulas is correct.  Even in a controlled situation where a
>>> cmpxchg operation is used to implement pthread_spin_lock() in userspace,
>>> we found recently that the lock must be released with a cmpxchg
>>> operation and not a simple write on SMP systems. There is a race in the
>>> cache operations or instruction ordering that's not present with the
>>> ldcw instruction.
>>>
>>> Dave
>>> --
>>> John David Anglin	dave.anglin@bell.net
>> That is strange.
>>
>> Spinlock with cmpxchg on lock and a single write on unlock should work,
>> assuming that cmpxchg doesn't write to the target address when it detects
>> mismatch (the cmpxchg in the kernel syscall page doesn't do it, it
>> nullifies the write instruction on mismatch).
>>
>> Do you have some code that reproduces this misbehavior?
There is a pthread_spin_lock test in the  kyotocabinet package that 
reproduces
this misbehavior.  Essentially, it creates four threads which loop doing 
pthread_spin_lock(),
sched_yield() and then pthread_spin_unlock().  On SMP systems, the test 
hangs with
the pthread_spin_lock locked and no thread holding lock (i.e., unlock 
failed).

The pthread support uses the cmpxchg code in 
arch/parisc/kernel/syscall.S.  This uses
"hashed" locks, etc, in a manner similar to the kernel code.

>>
>> We really need to find out why does it behave this way:
>> - is PA-RISC really out of order? (we used to believe that it is in-order
>>    and we have empty barrier instructions in the kernel). Does adding the
>>    "SYNC" instruction before the write in pthread_spin_unlock fix it?
I tried "SYNC" instruction before write and after the cmpxchg operation both
with.  In the cmpxchg operation, I also tried it with cache flush. I was 
trying to
simulated ldcw behavior.
>> - does the processor performs nullified writes unconditionally? Does
>>    moving the write in the cmpxchg implementation from the nullified slot
>>    to is own branch fix it?
I don't see how the processor can perform nullified writes 
unconditionally although that
might explain the observed symptom.  Didn't try moving the cmpxchg write.

>> - does adding a dummy "ldcw" instruction to an unrelated address fix it?
>>    Is it that "ldcw" has some magic barrier properties?
I had wondered about that.  One can't use %r0 as the instruction target 
as the architecture
manual says that it may then be implemented as a normal load. "ldcw" 
definitely has some magic
cache and barrier properties.  A normal store definitely works with it 
to reset the semaphore.
> - and there is "stw,o" instruction that does ordered store according to
> the specification, so we should test it too...
This doesn't help.

Currently, the Debian eglibc has a pthread_spin_unlock.diff patch that 
resolves the
kyotocabinet bug.  See:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=725508

>
>> I think we need to perform these tests and maybe some more to find out
>> what really happened there...
>>
>> BTW. in Debian 5 libc 2.7, pthread_spin_lock uses ldcw and
>> pthread_spin_unlock uses a single write (just like the kernel spinlock
>> implementation). In Debian-ports libc 2.18, both pthread_spin_lock and
>> pthread_spin_unlock call the kernel syscall page. What was the reason for
>> switching to a less efficient implementation?
>>
>> Mikulas
>>
>

Dave
Mikulas Patocka June 2, 2014, 3:57 p.m. UTC | #13
On Mon, 2 Jun 2014, Paul E. McKenney wrote:

> On Mon, Jun 02, 2014 at 05:19:39AM -0400, Mikulas Patocka wrote:
> > 
> > 
> > On Sun, 1 Jun 2014, Peter Zijlstra wrote:
> > 
> > > On Sun, Jun 01, 2014 at 04:46:26PM -0400, John David Anglin wrote:
> > > > On 1-Jun-14, at 3:20 PM, Peter Zijlstra wrote:
> > > > 
> > > > >>If you write to some variable with ACCESS_ONCE and use cmpxchg or xchg
> > > > >>at
> > > > >>the same time, you break it. ACCESS_ONCE doesn't take the hashed
> > > > >>spinlock,
> > > > >>so, in this case, cmpxchg or xchg isn't really atomic at all.
> > > > >
> > > > >And this is really the first place in the kernel that breaks like this?
> > > > >I've been using xchg() and cmpxchg() without such consideration for
> > > > >quite a while.
> > > > 
> > > > I believe Mikulas is correct.  Even in a controlled situation where a
> > > > cmpxchg operation
> > > > is used to implement pthread_spin_lock() in userspace, we found recently
> > > > that the lock
> > > > must be released with a  cmpxchg operation and not a simple write on SMP
> > > > systems.
> > > > There is a race in the cache operations or instruction ordering that's not
> > > > present with
> > > > the ldcw instruction.
> > > 
> > > Oh, I'm not arguing that. He's quite right that its broken, but this
> > > form of atomic ops is also quite insane and unusual. Most sane machines
> > > don't have this problem.
> > > 
> > > My main concern is how are we going to avoid breaking parisc (and I
> > > think sparc32, which is similarly retarded) in the future; we should
> > > invest in machinery to find and detect these things.
> > 
> > Grep the kernel for "\<xchg\>" and "\<cmpxchg\>" and replace them with 
> > atomic types and atomic access functions.
> 
> Not so good for pointers, though.  Defeats type-checking, for one thing.
> An example of this is use of xchg() for atomically enqueuing RCU callbacks
> in kernel/rcu/tree_plugin.h.
> 
> I still like the idea of PA-RISC's compiler implementing ACCESS_ONCE()
> as needed to make things work on that architecture.
> 
> 							Thanx, Paul

We can perform some preprocessor tricks to check the pointer type. See my 
next patch that adds type checking - you declare the variable with

	atomic_pointer(struct optimistic_spin_queue *) next;

and the pointer type is checked on all atomic operations involving this 
variable.


The problem with ACCESS_ONCE is that people omit it. There's plenty of 
places in the kernel where ACCESS_ONCE should be used and isn't 
(i_size_read, i_size_write, rt_mutex_is_locked...). Nothing really forces 
people to write the code correctly and use it.

atomic_pointer (and other atomic types) have the advantage that they force 
people to use the atomic functions to access them. If you read or write to 
the variable directly, it won't compile.

I think the best solution is to wrap the critical pointers with 
atomic_pointer(pointer_type *) and let the compiler report errors on all 
places where it is used unsafely.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney June 2, 2014, 4:39 p.m. UTC | #14
On Mon, Jun 02, 2014 at 11:57:14AM -0400, Mikulas Patocka wrote:
> 
> 
> On Mon, 2 Jun 2014, Paul E. McKenney wrote:
> 
> > On Mon, Jun 02, 2014 at 05:19:39AM -0400, Mikulas Patocka wrote:
> > > 
> > > 
> > > On Sun, 1 Jun 2014, Peter Zijlstra wrote:
> > > 
> > > > On Sun, Jun 01, 2014 at 04:46:26PM -0400, John David Anglin wrote:
> > > > > On 1-Jun-14, at 3:20 PM, Peter Zijlstra wrote:
> > > > > 
> > > > > >>If you write to some variable with ACCESS_ONCE and use cmpxchg or xchg
> > > > > >>at
> > > > > >>the same time, you break it. ACCESS_ONCE doesn't take the hashed
> > > > > >>spinlock,
> > > > > >>so, in this case, cmpxchg or xchg isn't really atomic at all.
> > > > > >
> > > > > >And this is really the first place in the kernel that breaks like this?
> > > > > >I've been using xchg() and cmpxchg() without such consideration for
> > > > > >quite a while.
> > > > > 
> > > > > I believe Mikulas is correct.  Even in a controlled situation where a
> > > > > cmpxchg operation
> > > > > is used to implement pthread_spin_lock() in userspace, we found recently
> > > > > that the lock
> > > > > must be released with a  cmpxchg operation and not a simple write on SMP
> > > > > systems.
> > > > > There is a race in the cache operations or instruction ordering that's not
> > > > > present with
> > > > > the ldcw instruction.
> > > > 
> > > > Oh, I'm not arguing that. He's quite right that its broken, but this
> > > > form of atomic ops is also quite insane and unusual. Most sane machines
> > > > don't have this problem.
> > > > 
> > > > My main concern is how are we going to avoid breaking parisc (and I
> > > > think sparc32, which is similarly retarded) in the future; we should
> > > > invest in machinery to find and detect these things.
> > > 
> > > Grep the kernel for "\<xchg\>" and "\<cmpxchg\>" and replace them with 
> > > atomic types and atomic access functions.
> > 
> > Not so good for pointers, though.  Defeats type-checking, for one thing.
> > An example of this is use of xchg() for atomically enqueuing RCU callbacks
> > in kernel/rcu/tree_plugin.h.
> > 
> > I still like the idea of PA-RISC's compiler implementing ACCESS_ONCE()
> > as needed to make things work on that architecture.
> > 
> > 							Thanx, Paul
> 
> We can perform some preprocessor tricks to check the pointer type. See my 
> next patch that adds type checking - you declare the variable with
> 
> 	atomic_pointer(struct optimistic_spin_queue *) next;
> 
> and the pointer type is checked on all atomic operations involving this 
> variable.

The special handling of ACCESS_ONCE() on architectures needing it is
way better than this sort of modification, from what I can see.

> The problem with ACCESS_ONCE is that people omit it. There's plenty of 
> places in the kernel where ACCESS_ONCE should be used and isn't 
> (i_size_read, i_size_write, rt_mutex_is_locked...). Nothing really forces 
> people to write the code correctly and use it.

Well, that would be another thing to add to the compiler modification,
have it check for a variable passed to xchg() or cmpxchg() and assigned
without the benefit of ACCESS_ONCE().  Of course, there will be false
positives, such as non-atomic assignments during initialization and
cleanup that cannot race with xchg() or cmpxchg().  Also cases where
all the xchg() and cmpxchg() are done under a lock, so that normal
assignments under that lock are OK.

Alternatively, perhaps a coccinelle script or change to sparse, smatch,
or whatever could help here.

> atomic_pointer (and other atomic types) have the advantage that they force 
> people to use the atomic functions to access them. If you read or write to 
> the variable directly, it won't compile.

Including the safe uses of normal assignment called out above?

> I think the best solution is to wrap the critical pointers with 
> atomic_pointer(pointer_type *) and let the compiler report errors on all 
> places where it is used unsafely.

I understand that you like this approach, but I am not at all convinced.

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Bottomley June 2, 2014, 7:56 p.m. UTC | #15
On Sun, 2014-06-01 at 23:30 +0200, Peter Zijlstra wrote:
> On Sun, Jun 01, 2014 at 04:46:26PM -0400, John David Anglin wrote:
> > On 1-Jun-14, at 3:20 PM, Peter Zijlstra wrote:
> > 
> > >>If you write to some variable with ACCESS_ONCE and use cmpxchg or xchg
> > >>at
> > >>the same time, you break it. ACCESS_ONCE doesn't take the hashed
> > >>spinlock,
> > >>so, in this case, cmpxchg or xchg isn't really atomic at all.
> > >
> > >And this is really the first place in the kernel that breaks like this?
> > >I've been using xchg() and cmpxchg() without such consideration for
> > >quite a while.
> > 
> > I believe Mikulas is correct.  Even in a controlled situation where a
> > cmpxchg operation
> > is used to implement pthread_spin_lock() in userspace, we found recently
> > that the lock
> > must be released with a  cmpxchg operation and not a simple write on SMP
> > systems.
> > There is a race in the cache operations or instruction ordering that's not
> > present with
> > the ldcw instruction.
> 
> Oh, I'm not arguing that. He's quite right that its broken, but this
> form of atomic ops is also quite insane and unusual. Most sane machines
> don't have this problem.
> 
> My main concern is how are we going to avoid breaking parisc (and I
> think sparc32, which is similarly retarded) in the future; we should
> invest in machinery to find and detect these things.

Architecturally, there is a way we could emulate the atomic exchange
instructions.  We could have a special section of memory that always
triggers a page trap.  In the Q state dtlb trap handlers we could
recognise the "atomic" section of memory and wrap the attempted
modification in a semaphore.  This would add a bit of overhead, but not
a huge amount if we do it in the trap handlers like the TMPALIAS
flushes.  This involves a lot of work for us because we have to decode
the instructions in software, recognise the operations and manually
apply the hashed semaphores around them.  If we did it like this, all
we'd need by way of mainline support is that variables treated as
atomically exchangeable should be in a separate section (because it's a
page fault handler effectively, we need them all separated from "normal"
code).  This would probably require some type of variable marker and if
we ever saw a xchg or cmpxchg on a variable without the marker, we could
break the build.

The way we'd implement is the memory region would be read and write
protected, so all loads and stores trap to the dtlb absent handlers.
For a ldX instruction, if it were not followed by a stX to the same
location, we'd simply give it the value.  For stX followed by ldX for
xchg, we'd take the lock, do the exchange and drop the lock and for stX
not preceded by ldX, we'd take the lock, do the store and drop the lock.
To avoid compromising the protected region, we'd actually back it by a
different area of kernel memory where we make the real modifications,
rather than trying to muck with temporarily inserting a TLB entry.  On
return we'd have to nullify the instructions to avoid re-trapping.
Effectively this has us emulating all load and store operations with a
shadow memory region ... if you know the number of possible address
modes for PARISC, you'll realise that's a non-trivial amount of code.
Plus we'd either have to ensure the shadow region had a permanent TLB
entry or do a full fault and exit the TLB handler (we can't take a
nested TLB fault within the TLB fault handler).

Is it worth it ... definitely not if we can just prevent mainline from
using xchg on our architecture.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra June 3, 2014, 7:56 a.m. UTC | #16
On Mon, Jun 02, 2014 at 12:56:40PM -0700, James Bottomley wrote:
> Architecturally, there is a way we could emulate the atomic exchange
> instructions.  We could have a special section of memory that always
> triggers a page trap.  In the Q state dtlb trap handlers we could
> recognise the "atomic" section of memory and wrap the attempted
> modification in a semaphore.  This would add a bit of overhead, but not
> a huge amount if we do it in the trap handlers like the TMPALIAS
> flushes.  This involves a lot of work for us because we have to decode
> the instructions in software, recognise the operations and manually
> apply the hashed semaphores around them.  If we did it like this, all
> we'd need by way of mainline support is that variables treated as
> atomically exchangeable should be in a separate section (because it's a
> page fault handler effectively, we need them all separated from "normal"
> code).  This would probably require some type of variable marker and if
> we ever saw a xchg or cmpxchg on a variable without the marker, we could
> break the build.

Cute, but I don't think that's entirely feasible given how these things
can be embedded in other structures (some dynamically allocated etc..).
Mikulas Patocka June 4, 2014, 12:53 p.m. UTC | #17
On Tue, 3 Jun 2014, Peter Zijlstra wrote:

> On Mon, Jun 02, 2014 at 12:56:40PM -0700, James Bottomley wrote:
> > Architecturally, there is a way we could emulate the atomic exchange
> > instructions.  We could have a special section of memory that always
> > triggers a page trap.  In the Q state dtlb trap handlers we could
> > recognise the "atomic" section of memory and wrap the attempted
> > modification in a semaphore.  This would add a bit of overhead, but not
> > a huge amount if we do it in the trap handlers like the TMPALIAS
> > flushes.  This involves a lot of work for us because we have to decode
> > the instructions in software, recognise the operations and manually
> > apply the hashed semaphores around them.  If we did it like this, all
> > we'd need by way of mainline support is that variables treated as
> > atomically exchangeable should be in a separate section (because it's a
> > page fault handler effectively, we need them all separated from "normal"
> > code).  This would probably require some type of variable marker and if
> > we ever saw a xchg or cmpxchg on a variable without the marker, we could
> > break the build.
> 
> Cute, but I don't think that's entirely feasible given how these things
> can be embedded in other structures (some dynamically allocated etc..).

We could deliberately misalign all the atomic variables - then, we would 
take the alignment trap (that is already written) and take the atomic 
spinlock in it.

I've got another idea - we could stop the other CPUs while xchg or cmpxchg 
is being executed. But there is a problem if the other CPU has interrupts 
disabled. Could we mask interrupts on PA-RISC in such a way that they are 
all disabled except one IPI that stops the CPU temporarily? Maybe do not 
mask interrupts with PSW I-bit and mask them with EIEM instead (leaving 
the one interrupt for cmpxchg IPI enabled)?

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

Index: linux-3.15-rc7/kernel/locking/mcs_spinlock.c
===================================================================
--- linux-3.15-rc7.orig/kernel/locking/mcs_spinlock.c	2014-05-31 19:05:24.000000000 +0200
+++ linux-3.15-rc7/kernel/locking/mcs_spinlock.c	2014-06-01 14:31:58.000000000 +0200
@@ -47,8 +47,8 @@  osq_wait_next(struct optimistic_spin_que
 		 * wait for either @lock to point to us, through its Step-B, or
 		 * wait for a new @node->next from its Step-C.
 		 */
-		if (node->next) {
-			next = xchg(&node->next, NULL);
+		if (atomic_pointer_read(&node->next)) {
+			next = atomic_pointer_xchg(&node->next, NULL);
 			if (next)
 				break;
 		}
@@ -65,13 +65,13 @@  bool osq_lock(struct optimistic_spin_que
 	struct optimistic_spin_queue *prev, *next;
 
 	node->locked = 0;
-	node->next = NULL;
+	atomic_pointer_set(&node->next, NULL);
 
 	node->prev = prev = xchg(lock, node);
 	if (likely(prev == NULL))
 		return true;
 
-	ACCESS_ONCE(prev->next) = node;
+	atomic_pointer_set(&prev->next, node);
 
 	/*
 	 * Normally @prev is untouchable after the above store; because at that
@@ -103,8 +103,8 @@  unqueue:
 	 */
 
 	for (;;) {
-		if (prev->next == node &&
-		    cmpxchg(&prev->next, node, NULL) == node)
+		if (atomic_pointer_read(&prev->next) == node &&
+		    atomic_pointer_cmpxchg(&prev->next, node, NULL) == node)
 			break;
 
 		/*
@@ -144,7 +144,7 @@  unqueue:
 	 */
 
 	ACCESS_ONCE(next->prev) = prev;
-	ACCESS_ONCE(prev->next) = next;
+	atomic_pointer_set(&prev->next, next);
 
 	return false;
 }
@@ -163,7 +163,7 @@  void osq_unlock(struct optimistic_spin_q
 	/*
 	 * Second most likely case.
 	 */
-	next = xchg(&node->next, NULL);
+	next = atomic_pointer_xchg(&node->next, NULL);
 	if (next) {
 		ACCESS_ONCE(next->locked) = 1;
 		return;
Index: linux-3.15-rc7/kernel/locking/mcs_spinlock.h
===================================================================
--- linux-3.15-rc7.orig/kernel/locking/mcs_spinlock.h	2014-05-31 19:01:01.000000000 +0200
+++ linux-3.15-rc7/kernel/locking/mcs_spinlock.h	2014-06-01 14:17:49.000000000 +0200
@@ -13,6 +13,7 @@ 
 #define __LINUX_MCS_SPINLOCK_H
 
 #include <asm/mcs_spinlock.h>
+#include <linux/atomic.h>
 
 struct mcs_spinlock {
 	struct mcs_spinlock *next;
@@ -119,7 +120,8 @@  void mcs_spin_unlock(struct mcs_spinlock
  */
 
 struct optimistic_spin_queue {
-	struct optimistic_spin_queue *next, *prev;
+	atomic_pointer_t next;
+	struct optimistic_spin_queue *prev;
 	int locked; /* 1 if lock acquired */
 };
 
Index: linux-3.15-rc7/include/asm-generic/atomic-long.h
===================================================================
--- linux-3.15-rc7.orig/include/asm-generic/atomic-long.h	2014-06-01 14:04:17.000000000 +0200
+++ linux-3.15-rc7/include/asm-generic/atomic-long.h	2014-06-01 14:30:19.000000000 +0200
@@ -255,4 +255,28 @@  static inline long atomic_long_add_unles
 
 #endif  /*  BITS_PER_LONG == 64  */
 
+typedef atomic_long_t atomic_pointer_t;
+
+#define ATOMIC_POINTER_INIT(i)	ATOMIC_LONG_INIT((long)(i))
+
+static inline void *atomic_pointer_read(atomic_pointer_t *v)
+{
+	return (void *)atomic_long_read(v);
+}
+
+static inline void atomic_pointer_set(atomic_pointer_t *v, void *i)
+{
+	atomic_long_set(v, (long)i);
+}
+
+static inline void *atomic_pointer_xchg(atomic_pointer_t *v, void *i)
+{
+	return (void *)atomic_long_xchg(v, (long)i);
+}
+
+static inline void *atomic_pointer_cmpxchg(atomic_pointer_t *v, void *old, void *new)
+{
+	return (void *)atomic_long_cmpxchg(v, (long)old, (long)new);
+}
+
 #endif  /*  _ASM_GENERIC_ATOMIC_LONG_H  */