locking/atomic: Make test_and_*_bit() ordered on failure

Message ID	20220816070311.89186-1-marcan@marcan.st (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org> sender: hector@marcansoft.com) by mail.marcansoft.com (Postfix) with ESMTPSA id E22484195A; Tue, 16 Aug 2022 07:03:19 +0000 (UTC) From: Hector Martin <marcan@marcan.st> To: Will Deacon <will@kernel.org>, Peter Zijlstra <peterz@infradead.org>, Arnd Bergmann <arnd@arndb.de>, Ingo Molnar <mingo@kernel.org> Cc: Alan Stern <stern@rowland.harvard.edu>, Andrea Parri <parri.andrea@gmail.com>, Boqun Feng <boqun.feng@gmail.com>, Nicholas Piggin <npiggin@gmail.com>, David Howells <dhowells@redhat.com>, Jade Alglave <j.alglave@ucl.ac.uk>, Luc Maranget <luc.maranget@inria.fr>, "Paul E. McKenney" <paulmck@kernel.org>, Akira Yokosawa <akiyks@gmail.com>, Daniel Lustig <dlustig@nvidia.com>, Joel Fernandes <joel@joelfernandes.org>, Mark Rutland <mark.rutland@arm.com>, Jonathan Corbet <corbet@lwn.net>, Tejun Heo <tj@kernel.org>, jirislaby@kernel.org, Marc Zyngier <maz@kernel.org>, Catalin Marinas <catalin.marinas@arm.com>, Oliver Neukum <oneukum@suse.com>, Linus Torvalds <torvalds@linux-foundation.org>, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, linux-doc@vger.kernel.org, linux-arm-kernel@lists.infradead.org, Asahi Linux <asahi@lists.linux.dev>, Hector Martin <marcan@marcan.st>, stable@vger.kernel.org Subject: [PATCH] locking/atomic: Make test_and_*_bit() ordered on failure Date: Tue, 16 Aug 2022 16:03:11 +0900 Message-Id: <20220816070311.89186-1-marcan@marcan.st> MIME-Version: 1.0 Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
Series	locking/atomic: Make test_and__bit() ordered on failure \| expand locking/atomic: Make test_and__bit() ordered on failure

Hector Martin Aug. 16, 2022, 7:03 a.m. UTC

These operations are documented as always ordered in
include/asm-generic/bitops/instrumented-atomic.h, and producer-consumer
type use cases where one side needs to ensure a flag is left pending
after some shared data was updated rely on this ordering, even in the
failure case.

This is the case with the workqueue code, which currently suffers from a
reproducible ordering violation on Apple M1 platforms (which are
notoriously out-of-order) that ends up causing the TTY layer to fail to
deliver data to userspace properly under the right conditions. This
change fixes that bug.

Change the documentation to restrict the "no order on failure" story to
the _lock() variant (for which it makes sense), and remove the
early-exit from the generic implementation, which is what causes the
missing barrier semantics in that case. Without this, the remaining
atomic op is fully ordered (including on ARM64 LSE, as of recent
versions of the architecture spec).

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org
Fixes: e986a0d6cb36 ("locking/atomics, asm-generic/bitops/atomic.h: Rewrite using atomic_*() APIs")
Fixes: 61e02392d3c7 ("locking/atomic/bitops: Document and clarify ordering semantics for failed test_and_{}_bit()")
Signed-off-by: Hector Martin <marcan@marcan.st>
---
 Documentation/atomic_bitops.txt     | 2 +-
 include/asm-generic/bitops/atomic.h | 6 ------
 2 files changed, 1 insertion(+), 7 deletions(-)

Arnd Bergmann Aug. 16, 2022, 8:16 a.m. UTC | #1

On Tue, Aug 16, 2022 at 9:03 AM Hector Martin <marcan@marcan.st> wrote:
>
> These operations are documented as always ordered in
> include/asm-generic/bitops/instrumented-atomic.h, and producer-consumer
> type use cases where one side needs to ensure a flag is left pending
> after some shared data was updated rely on this ordering, even in the
> failure case.
>
> This is the case with the workqueue code, which currently suffers from a
> reproducible ordering violation on Apple M1 platforms (which are
> notoriously out-of-order) that ends up causing the TTY layer to fail to
> deliver data to userspace properly under the right conditions. This
> change fixes that bug.
>
> Change the documentation to restrict the "no order on failure" story to
> the _lock() variant (for which it makes sense), and remove the
> early-exit from the generic implementation, which is what causes the
> missing barrier semantics in that case. Without this, the remaining
> atomic op is fully ordered (including on ARM64 LSE, as of recent
> versions of the architecture spec).
>
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: stable@vger.kernel.org
> Fixes: e986a0d6cb36 ("locking/atomics, asm-generic/bitops/atomic.h: Rewrite using atomic_*() APIs")
> Fixes: 61e02392d3c7 ("locking/atomic/bitops: Document and clarify ordering semantics for failed test_and_{}_bit()")
> Signed-off-by: Hector Martin <marcan@marcan.st>
> ---
>  Documentation/atomic_bitops.txt     | 2 +-
>  include/asm-generic/bitops/atomic.h | 6 ------

I double-checked all the architecture specific implementations to ensure
that the asm-generic one is the only one that needs the fix.

I assume this gets merged through the locking tree or that Linus picks it up
directly, not through my asm-generic tree.

Reviewed-by: Arnd Bergmann <arnd@arndb.de>

Jon Nettleton Aug. 16, 2022, 12:29 p.m. UTC | #2

On Tue, Aug 16, 2022 at 10:17 AM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Tue, Aug 16, 2022 at 9:03 AM Hector Martin <marcan@marcan.st> wrote:
> >
> > These operations are documented as always ordered in
> > include/asm-generic/bitops/instrumented-atomic.h, and producer-consumer
> > type use cases where one side needs to ensure a flag is left pending
> > after some shared data was updated rely on this ordering, even in the
> > failure case.
> >
> > This is the case with the workqueue code, which currently suffers from a
> > reproducible ordering violation on Apple M1 platforms (which are
> > notoriously out-of-order) that ends up causing the TTY layer to fail to
> > deliver data to userspace properly under the right conditions. This
> > change fixes that bug.
> >
> > Change the documentation to restrict the "no order on failure" story to
> > the _lock() variant (for which it makes sense), and remove the
> > early-exit from the generic implementation, which is what causes the
> > missing barrier semantics in that case. Without this, the remaining
> > atomic op is fully ordered (including on ARM64 LSE, as of recent
> > versions of the architecture spec).
> >
> > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: stable@vger.kernel.org
> > Fixes: e986a0d6cb36 ("locking/atomics, asm-generic/bitops/atomic.h: Rewrite using atomic_*() APIs")
> > Fixes: 61e02392d3c7 ("locking/atomic/bitops: Document and clarify ordering semantics for failed test_and_{}_bit()")
> > Signed-off-by: Hector Martin <marcan@marcan.st>
> > ---
> >  Documentation/atomic_bitops.txt     | 2 +-
> >  include/asm-generic/bitops/atomic.h | 6 ------
>
> I double-checked all the architecture specific implementations to ensure
> that the asm-generic one is the only one that needs the fix.
>
> I assume this gets merged through the locking tree or that Linus picks it up
> directly, not through my asm-generic tree.
>
> Reviewed-by: Arnd Bergmann <arnd@arndb.de>
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Testing this patch on pre Armv8.1 specifically Cortex-A72 and
Cortex-A53 cores I am seeing
a huge performance drop with this patch applied. Perf is showing
lock_is_held_type() as the worst offender
but that could just be the function getting blamed. The most obvious
indicator of the performance loss is
ssh throughput.  With the patch I am only able to achieve around
20MB/s and without the patch I am able
to transfer around 112MB/s, no other changes.

When I have more time I can do some more in depth testing, but for now
I just wanted to bring this
issue up so perhaps others can chime in regarding how it performs on
their hardware.

-Jon

Will Deacon Aug. 16, 2022, 1 p.m. UTC | #3

On Tue, Aug 16, 2022 at 02:29:49PM +0200, Jon Nettleton wrote:
> On Tue, Aug 16, 2022 at 10:17 AM Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > On Tue, Aug 16, 2022 at 9:03 AM Hector Martin <marcan@marcan.st> wrote:
> > >
> > > These operations are documented as always ordered in
> > > include/asm-generic/bitops/instrumented-atomic.h, and producer-consumer
> > > type use cases where one side needs to ensure a flag is left pending
> > > after some shared data was updated rely on this ordering, even in the
> > > failure case.
> > >
> > > This is the case with the workqueue code, which currently suffers from a
> > > reproducible ordering violation on Apple M1 platforms (which are
> > > notoriously out-of-order) that ends up causing the TTY layer to fail to
> > > deliver data to userspace properly under the right conditions. This
> > > change fixes that bug.
> > >
> > > Change the documentation to restrict the "no order on failure" story to
> > > the _lock() variant (for which it makes sense), and remove the
> > > early-exit from the generic implementation, which is what causes the
> > > missing barrier semantics in that case. Without this, the remaining
> > > atomic op is fully ordered (including on ARM64 LSE, as of recent
> > > versions of the architecture spec).
> > >
> > > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> > > Cc: stable@vger.kernel.org
> > > Fixes: e986a0d6cb36 ("locking/atomics, asm-generic/bitops/atomic.h: Rewrite using atomic_*() APIs")
> > > Fixes: 61e02392d3c7 ("locking/atomic/bitops: Document and clarify ordering semantics for failed test_and_{}_bit()")
> > > Signed-off-by: Hector Martin <marcan@marcan.st>
> > > ---
> > >  Documentation/atomic_bitops.txt     | 2 +-
> > >  include/asm-generic/bitops/atomic.h | 6 ------
> >
> > I double-checked all the architecture specific implementations to ensure
> > that the asm-generic one is the only one that needs the fix.
> >
> > I assume this gets merged through the locking tree or that Linus picks it up
> > directly, not through my asm-generic tree.
> >
> > Reviewed-by: Arnd Bergmann <arnd@arndb.de>
> >
> > _______________________________________________
> > linux-arm-kernel mailing list
> > linux-arm-kernel@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 
> Testing this patch on pre Armv8.1 specifically Cortex-A72 and
> Cortex-A53 cores I am seeing
> a huge performance drop with this patch applied. Perf is showing
> lock_is_held_type() as the worst offender

Hmm, that should only exist if LOCKDEP is enabled and performance tends to
go out of the window if you have that on. Can you reproduce the same
regression with lockdep disabled?

Will

Jon Nettleton Aug. 16, 2022, 1:05 p.m. UTC | #4

On Tue, Aug 16, 2022 at 3:01 PM Will Deacon <will@kernel.org> wrote:
>
> On Tue, Aug 16, 2022 at 02:29:49PM +0200, Jon Nettleton wrote:
> > On Tue, Aug 16, 2022 at 10:17 AM Arnd Bergmann <arnd@arndb.de> wrote:
> > >
> > > On Tue, Aug 16, 2022 at 9:03 AM Hector Martin <marcan@marcan.st> wrote:
> > > >
> > > > These operations are documented as always ordered in
> > > > include/asm-generic/bitops/instrumented-atomic.h, and producer-consumer
> > > > type use cases where one side needs to ensure a flag is left pending
> > > > after some shared data was updated rely on this ordering, even in the
> > > > failure case.
> > > >
> > > > This is the case with the workqueue code, which currently suffers from a
> > > > reproducible ordering violation on Apple M1 platforms (which are
> > > > notoriously out-of-order) that ends up causing the TTY layer to fail to
> > > > deliver data to userspace properly under the right conditions. This
> > > > change fixes that bug.
> > > >
> > > > Change the documentation to restrict the "no order on failure" story to
> > > > the _lock() variant (for which it makes sense), and remove the
> > > > early-exit from the generic implementation, which is what causes the
> > > > missing barrier semantics in that case. Without this, the remaining
> > > > atomic op is fully ordered (including on ARM64 LSE, as of recent
> > > > versions of the architecture spec).
> > > >
> > > > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> > > > Cc: stable@vger.kernel.org
> > > > Fixes: e986a0d6cb36 ("locking/atomics, asm-generic/bitops/atomic.h: Rewrite using atomic_*() APIs")
> > > > Fixes: 61e02392d3c7 ("locking/atomic/bitops: Document and clarify ordering semantics for failed test_and_{}_bit()")
> > > > Signed-off-by: Hector Martin <marcan@marcan.st>
> > > > ---
> > > >  Documentation/atomic_bitops.txt     | 2 +-
> > > >  include/asm-generic/bitops/atomic.h | 6 ------
> > >
> > > I double-checked all the architecture specific implementations to ensure
> > > that the asm-generic one is the only one that needs the fix.
> > >
> > > I assume this gets merged through the locking tree or that Linus picks it up
> > > directly, not through my asm-generic tree.
> > >
> > > Reviewed-by: Arnd Bergmann <arnd@arndb.de>
> > >
> > > _______________________________________________
> > > linux-arm-kernel mailing list
> > > linux-arm-kernel@lists.infradead.org
> > > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> >
> > Testing this patch on pre Armv8.1 specifically Cortex-A72 and
> > Cortex-A53 cores I am seeing
> > a huge performance drop with this patch applied. Perf is showing
> > lock_is_held_type() as the worst offender
>
> Hmm, that should only exist if LOCKDEP is enabled and performance tends to
> go out of the window if you have that on. Can you reproduce the same
> regression with lockdep disabled?
>
> Will

Yep I am working on it.  We should note that

config LOCKDEP_SUPPORT
        def_bool y

is the default for arm64

-Jon

Marc Zyngier Aug. 16, 2022, 1:23 p.m. UTC | #5

On Tue, 16 Aug 2022 14:05:54 +0100,
Jon Nettleton <jon@solid-run.com> wrote:
> 
> On Tue, Aug 16, 2022 at 3:01 PM Will Deacon <will@kernel.org> wrote:
> >
> > On Tue, Aug 16, 2022 at 02:29:49PM +0200, Jon Nettleton wrote:
> > > On Tue, Aug 16, 2022 at 10:17 AM Arnd Bergmann <arnd@arndb.de> wrote:
> > > >
> > > > On Tue, Aug 16, 2022 at 9:03 AM Hector Martin <marcan@marcan.st> wrote:
> > > > >
> > > > > These operations are documented as always ordered in
> > > > > include/asm-generic/bitops/instrumented-atomic.h, and producer-consumer
> > > > > type use cases where one side needs to ensure a flag is left pending
> > > > > after some shared data was updated rely on this ordering, even in the
> > > > > failure case.
> > > > >
> > > > > This is the case with the workqueue code, which currently suffers from a
> > > > > reproducible ordering violation on Apple M1 platforms (which are
> > > > > notoriously out-of-order) that ends up causing the TTY layer to fail to
> > > > > deliver data to userspace properly under the right conditions. This
> > > > > change fixes that bug.
> > > > >
> > > > > Change the documentation to restrict the "no order on failure" story to
> > > > > the _lock() variant (for which it makes sense), and remove the
> > > > > early-exit from the generic implementation, which is what causes the
> > > > > missing barrier semantics in that case. Without this, the remaining
> > > > > atomic op is fully ordered (including on ARM64 LSE, as of recent
> > > > > versions of the architecture spec).
> > > > >
> > > > > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> > > > > Cc: stable@vger.kernel.org
> > > > > Fixes: e986a0d6cb36 ("locking/atomics, asm-generic/bitops/atomic.h: Rewrite using atomic_*() APIs")
> > > > > Fixes: 61e02392d3c7 ("locking/atomic/bitops: Document and clarify ordering semantics for failed test_and_{}_bit()")
> > > > > Signed-off-by: Hector Martin <marcan@marcan.st>
> > > > > ---
> > > > >  Documentation/atomic_bitops.txt     | 2 +-
> > > > >  include/asm-generic/bitops/atomic.h | 6 ------
> > > >
> > > > I double-checked all the architecture specific implementations to ensure
> > > > that the asm-generic one is the only one that needs the fix.
> > > >
> > > > I assume this gets merged through the locking tree or that Linus picks it up
> > > > directly, not through my asm-generic tree.
> > > >
> > > > Reviewed-by: Arnd Bergmann <arnd@arndb.de>
> > > >
> > > > _______________________________________________
> > > > linux-arm-kernel mailing list
> > > > linux-arm-kernel@lists.infradead.org
> > > > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> > >
> > > Testing this patch on pre Armv8.1 specifically Cortex-A72 and
> > > Cortex-A53 cores I am seeing
> > > a huge performance drop with this patch applied. Perf is showing
> > > lock_is_held_type() as the worst offender
> >
> > Hmm, that should only exist if LOCKDEP is enabled and performance tends to
> > go out of the window if you have that on. Can you reproduce the same
> > regression with lockdep disabled?
> >
> > Will
> 
> Yep I am working on it.  We should note that
> 
> config LOCKDEP_SUPPORT
>         def_bool y
> 
> is the default for arm64

Yes, as the architecture supports LOCKDEP. However, you probably have
something like CONFIG_PROVE_LOCKING to see such a performance hit (and
that's definitely not on by default).

	M.

Will Deacon Aug. 16, 2022, 2:04 p.m. UTC | #6

On Tue, Aug 16, 2022 at 04:03:11PM +0900, Hector Martin wrote:
> These operations are documented as always ordered in
> include/asm-generic/bitops/instrumented-atomic.h, and producer-consumer
> type use cases where one side needs to ensure a flag is left pending
> after some shared data was updated rely on this ordering, even in the
> failure case.
> 
> This is the case with the workqueue code, which currently suffers from a
> reproducible ordering violation on Apple M1 platforms (which are
> notoriously out-of-order) that ends up causing the TTY layer to fail to
> deliver data to userspace properly under the right conditions. This
> change fixes that bug.
> 
> Change the documentation to restrict the "no order on failure" story to
> the _lock() variant (for which it makes sense), and remove the
> early-exit from the generic implementation, which is what causes the
> missing barrier semantics in that case. Without this, the remaining
> atomic op is fully ordered (including on ARM64 LSE, as of recent
> versions of the architecture spec).
> 
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: stable@vger.kernel.org
> Fixes: e986a0d6cb36 ("locking/atomics, asm-generic/bitops/atomic.h: Rewrite using atomic_*() APIs")
> Fixes: 61e02392d3c7 ("locking/atomic/bitops: Document and clarify ordering semantics for failed test_and_{}_bit()")
> Signed-off-by: Hector Martin <marcan@marcan.st>
> ---
>  Documentation/atomic_bitops.txt     | 2 +-
>  include/asm-generic/bitops/atomic.h | 6 ------
>  2 files changed, 1 insertion(+), 7 deletions(-)
> 
> diff --git a/Documentation/atomic_bitops.txt b/Documentation/atomic_bitops.txt
> index 093cdaefdb37..d8b101c97031 100644
> --- a/Documentation/atomic_bitops.txt
> +++ b/Documentation/atomic_bitops.txt
> @@ -59,7 +59,7 @@ Like with atomic_t, the rule of thumb is:
>   - RMW operations that have a return value are fully ordered.
>  
>   - RMW operations that are conditional are unordered on FAILURE,
> -   otherwise the above rules apply. In the case of test_and_{}_bit() operations,
> +   otherwise the above rules apply. In the case of test_and_set_bit_lock(),
>     if the bit in memory is unchanged by the operation then it is deemed to have
>     failed.

The next sentence is:

  | Except for a successful test_and_set_bit_lock() which has ACQUIRE
  | semantics and clear_bit_unlock() which has RELEASE semantics.

so I think it reads a bit strangely now. How about something like:


diff --git a/Documentation/atomic_bitops.txt b/Documentation/atomic_bitops.txt
index 093cdaefdb37..3b516729ec81 100644
--- a/Documentation/atomic_bitops.txt
+++ b/Documentation/atomic_bitops.txt
@@ -59,12 +59,15 @@ Like with atomic_t, the rule of thumb is:
  - RMW operations that have a return value are fully ordered.
 
  - RMW operations that are conditional are unordered on FAILURE,
-   otherwise the above rules apply. In the case of test_and_{}_bit() operations,
-   if the bit in memory is unchanged by the operation then it is deemed to have
-   failed.
+   otherwise the above rules apply. For the purposes of ordering, the
+   test_and_{}_bit() operations are treated as unconditional.
 
-Except for a successful test_and_set_bit_lock() which has ACQUIRE semantics and
-clear_bit_unlock() which has RELEASE semantics.
+Except for:
+
+ - test_and_set_bit_lock() which has ACQUIRE semantics on success and is
+   unordered on failure;
+
+ - clear_bit_unlock() which has RELEASE semantics.
 
 Since a platform only has a single means of achieving atomic operations
 the same barriers as for atomic_t are used, see atomic_t.txt.


?

> diff --git a/include/asm-generic/bitops/atomic.h b/include/asm-generic/bitops/atomic.h
> index 3096f086b5a3..71ab4ba9c25d 100644
> --- a/include/asm-generic/bitops/atomic.h
> +++ b/include/asm-generic/bitops/atomic.h
> @@ -39,9 +39,6 @@ arch_test_and_set_bit(unsigned int nr, volatile unsigned long *p)
>  	unsigned long mask = BIT_MASK(nr);
>  
>  	p += BIT_WORD(nr);
> -	if (READ_ONCE(*p) & mask)
> -		return 1;
> -
>  	old = arch_atomic_long_fetch_or(mask, (atomic_long_t *)p);
>  	return !!(old & mask);
>  }
> @@ -53,9 +50,6 @@ arch_test_and_clear_bit(unsigned int nr, volatile unsigned long *p)
>  	unsigned long mask = BIT_MASK(nr);
>  
>  	p += BIT_WORD(nr);
> -	if (!(READ_ONCE(*p) & mask))
> -		return 0;
> -
>  	old = arch_atomic_long_fetch_andnot(mask, (atomic_long_t *)p);
>  	return !!(old & mask);

I suppose one sad thing about this is that, on arm64, we could reasonably
keep the READ_ONCE() path with a DMB LD (R->RW) barrier before the return
but I don't think we can express that in the Linux memory model so we
end up in RmW territory every time. Perhaps we should roll our own
implementation in the arch code?

In any case, this should fix the problem so:

Acked-by: Will Deacon <will@kernel.org>

Will

Will Deacon Aug. 16, 2022, 2:06 p.m. UTC | #7

On Tue, Aug 16, 2022 at 10:16:04AM +0200, Arnd Bergmann wrote:
> On Tue, Aug 16, 2022 at 9:03 AM Hector Martin <marcan@marcan.st> wrote:
> >
> > These operations are documented as always ordered in
> > include/asm-generic/bitops/instrumented-atomic.h, and producer-consumer
> > type use cases where one side needs to ensure a flag is left pending
> > after some shared data was updated rely on this ordering, even in the
> > failure case.
> >
> > This is the case with the workqueue code, which currently suffers from a
> > reproducible ordering violation on Apple M1 platforms (which are
> > notoriously out-of-order) that ends up causing the TTY layer to fail to
> > deliver data to userspace properly under the right conditions. This
> > change fixes that bug.
> >
> > Change the documentation to restrict the "no order on failure" story to
> > the _lock() variant (for which it makes sense), and remove the
> > early-exit from the generic implementation, which is what causes the
> > missing barrier semantics in that case. Without this, the remaining
> > atomic op is fully ordered (including on ARM64 LSE, as of recent
> > versions of the architecture spec).
> >
> > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: stable@vger.kernel.org
> > Fixes: e986a0d6cb36 ("locking/atomics, asm-generic/bitops/atomic.h: Rewrite using atomic_*() APIs")
> > Fixes: 61e02392d3c7 ("locking/atomic/bitops: Document and clarify ordering semantics for failed test_and_{}_bit()")
> > Signed-off-by: Hector Martin <marcan@marcan.st>
> > ---
> >  Documentation/atomic_bitops.txt     | 2 +-
> >  include/asm-generic/bitops/atomic.h | 6 ------
> 
> I double-checked all the architecture specific implementations to ensure
> that the asm-generic one is the only one that needs the fix.

I couldn't figure out parisc -- do you know what ordering their spinlocks
provide? They have a comment talking about a release, but I don't know what
the ordering guarantees of an "ldcw" are.

Will

Hector Martin Aug. 16, 2022, 2:30 p.m. UTC | #8

On 16/08/2022 23.04, Will Deacon wrote:
>> diff --git a/Documentation/atomic_bitops.txt b/Documentation/atomic_bitops.txt
>> index 093cdaefdb37..d8b101c97031 100644
>> --- a/Documentation/atomic_bitops.txt
>> +++ b/Documentation/atomic_bitops.txt
>> @@ -59,7 +59,7 @@ Like with atomic_t, the rule of thumb is:
>>   - RMW operations that have a return value are fully ordered.
>>  
>>   - RMW operations that are conditional are unordered on FAILURE,
>> -   otherwise the above rules apply. In the case of test_and_{}_bit() operations,
>> +   otherwise the above rules apply. In the case of test_and_set_bit_lock(),
>>     if the bit in memory is unchanged by the operation then it is deemed to have
>>     failed.
> 
> The next sentence is:
> 
>   | Except for a successful test_and_set_bit_lock() which has ACQUIRE
>   | semantics and clear_bit_unlock() which has RELEASE semantics.
> 
> so I think it reads a bit strangely now. How about something like:
> 
> 
> diff --git a/Documentation/atomic_bitops.txt b/Documentation/atomic_bitops.txt
> index 093cdaefdb37..3b516729ec81 100644
> --- a/Documentation/atomic_bitops.txt
> +++ b/Documentation/atomic_bitops.txt
> @@ -59,12 +59,15 @@ Like with atomic_t, the rule of thumb is:
>   - RMW operations that have a return value are fully ordered.
>  
>   - RMW operations that are conditional are unordered on FAILURE,
> -   otherwise the above rules apply. In the case of test_and_{}_bit() operations,
> -   if the bit in memory is unchanged by the operation then it is deemed to have
> -   failed.
> +   otherwise the above rules apply. For the purposes of ordering, the
> +   test_and_{}_bit() operations are treated as unconditional.
>  
> -Except for a successful test_and_set_bit_lock() which has ACQUIRE semantics and
> -clear_bit_unlock() which has RELEASE semantics.
> +Except for:
> +
> + - test_and_set_bit_lock() which has ACQUIRE semantics on success and is
> +   unordered on failure;
> +
> + - clear_bit_unlock() which has RELEASE semantics.
>  
>  Since a platform only has a single means of achieving atomic operations
>  the same barriers as for atomic_t are used, see atomic_t.txt.

Makes sense! I'll send a v2 with that in a couple of days if nothing
else comes up.

>> diff --git a/include/asm-generic/bitops/atomic.h b/include/asm-generic/bitops/atomic.h
>> index 3096f086b5a3..71ab4ba9c25d 100644
>> --- a/include/asm-generic/bitops/atomic.h
>> +++ b/include/asm-generic/bitops/atomic.h
>> @@ -39,9 +39,6 @@ arch_test_and_set_bit(unsigned int nr, volatile unsigned long *p)
>>  	unsigned long mask = BIT_MASK(nr);
>>  
>>  	p += BIT_WORD(nr);
>> -	if (READ_ONCE(*p) & mask)
>> -		return 1;
>> -
>>  	old = arch_atomic_long_fetch_or(mask, (atomic_long_t *)p);
>>  	return !!(old & mask);
>>  }
>> @@ -53,9 +50,6 @@ arch_test_and_clear_bit(unsigned int nr, volatile unsigned long *p)
>>  	unsigned long mask = BIT_MASK(nr);
>>  
>>  	p += BIT_WORD(nr);
>> -	if (!(READ_ONCE(*p) & mask))
>> -		return 0;
>> -
>>  	old = arch_atomic_long_fetch_andnot(mask, (atomic_long_t *)p);
>>  	return !!(old & mask);
> 
> I suppose one sad thing about this is that, on arm64, we could reasonably
> keep the READ_ONCE() path with a DMB LD (R->RW) barrier before the return
> but I don't think we can express that in the Linux memory model so we
> end up in RmW territory every time.

You'd need a barrier *before* the READ_ONCE(), since what we're trying
to prevent is a consumer from writing to the value without being able to
observe the writes that happened prior, while this side read the old
value. A barrier after the READ_ONCE() doesn't do anything, as that read
is the last memory operation in this thread (of the problematic sequence).

At that point, I'm not sure DMB LD / early read / LSE atomic would be
any faster than just always doing the LSE atomic?

- Hector

Will Deacon Aug. 16, 2022, 5:36 p.m. UTC | #9

On Tue, Aug 16, 2022 at 11:30:45PM +0900, Hector Martin wrote:
> On 16/08/2022 23.04, Will Deacon wrote:
> >> diff --git a/include/asm-generic/bitops/atomic.h b/include/asm-generic/bitops/atomic.h
> >> index 3096f086b5a3..71ab4ba9c25d 100644
> >> --- a/include/asm-generic/bitops/atomic.h
> >> +++ b/include/asm-generic/bitops/atomic.h
> >> @@ -39,9 +39,6 @@ arch_test_and_set_bit(unsigned int nr, volatile unsigned long *p)
> >>  	unsigned long mask = BIT_MASK(nr);
> >>  
> >>  	p += BIT_WORD(nr);
> >> -	if (READ_ONCE(*p) & mask)
> >> -		return 1;
> >> -
> >>  	old = arch_atomic_long_fetch_or(mask, (atomic_long_t *)p);
> >>  	return !!(old & mask);
> >>  }
> >> @@ -53,9 +50,6 @@ arch_test_and_clear_bit(unsigned int nr, volatile unsigned long *p)
> >>  	unsigned long mask = BIT_MASK(nr);
> >>  
> >>  	p += BIT_WORD(nr);
> >> -	if (!(READ_ONCE(*p) & mask))
> >> -		return 0;
> >> -
> >>  	old = arch_atomic_long_fetch_andnot(mask, (atomic_long_t *)p);
> >>  	return !!(old & mask);
> > 
> > I suppose one sad thing about this is that, on arm64, we could reasonably
> > keep the READ_ONCE() path with a DMB LD (R->RW) barrier before the return
> > but I don't think we can express that in the Linux memory model so we
> > end up in RmW territory every time.
> 
> You'd need a barrier *before* the READ_ONCE(), since what we're trying
> to prevent is a consumer from writing to the value without being able to
> observe the writes that happened prior, while this side read the old
> value. A barrier after the READ_ONCE() doesn't do anything, as that read
> is the last memory operation in this thread (of the problematic sequence).

Right, having gone back to your litmus test, I now realise it's the "SB"
shape from the memory ordering terminology. It's funny because the arm64
acquire/release instructions are RCsc and so upgrading the READ_ONCE()
to an *arm64* acquire instruction would work for your specific case, but
only because the preceeding store is a release.

> At that point, I'm not sure DMB LD / early read / LSE atomic would be
> any faster than just always doing the LSE atomic?

It depends a lot on the configuration of the system and the state of the
relevant cacheline, but generally avoiding an RmW by introducing a barrier
is likely to be a win. It just gets ugly here as we'd want to avoid the
DMB in the case where we end up doing the RmW. Possibly we could do
something funky like a test-and-test-and-test-and-set (!) where we do
the DMB+READ_ONCE() only if the first READ_ONCE() has the bit set, but
even just typing that is horrible and I'd _absolutely_ want to see perf
numbers to show that it's a benefit once you start taking into account
things like branch prediction.

Anywho, since Linus has applied the patch and it should work, this is
just an interesting aside.

Will

Jon Nettleton Aug. 16, 2022, 5:49 p.m. UTC | #10

On Tue, Aug 16, 2022 at 7:38 PM Will Deacon <will@kernel.org> wrote:
>
> On Tue, Aug 16, 2022 at 11:30:45PM +0900, Hector Martin wrote:
> > On 16/08/2022 23.04, Will Deacon wrote:
> > >> diff --git a/include/asm-generic/bitops/atomic.h b/include/asm-generic/bitops/atomic.h
> > >> index 3096f086b5a3..71ab4ba9c25d 100644
> > >> --- a/include/asm-generic/bitops/atomic.h
> > >> +++ b/include/asm-generic/bitops/atomic.h
> > >> @@ -39,9 +39,6 @@ arch_test_and_set_bit(unsigned int nr, volatile unsigned long *p)
> > >>    unsigned long mask = BIT_MASK(nr);
> > >>
> > >>    p += BIT_WORD(nr);
> > >> -  if (READ_ONCE(*p) & mask)
> > >> -          return 1;
> > >> -
> > >>    old = arch_atomic_long_fetch_or(mask, (atomic_long_t *)p);
> > >>    return !!(old & mask);
> > >>  }
> > >> @@ -53,9 +50,6 @@ arch_test_and_clear_bit(unsigned int nr, volatile unsigned long *p)
> > >>    unsigned long mask = BIT_MASK(nr);
> > >>
> > >>    p += BIT_WORD(nr);
> > >> -  if (!(READ_ONCE(*p) & mask))
> > >> -          return 0;
> > >> -
> > >>    old = arch_atomic_long_fetch_andnot(mask, (atomic_long_t *)p);
> > >>    return !!(old & mask);
> > >
> > > I suppose one sad thing about this is that, on arm64, we could reasonably
> > > keep the READ_ONCE() path with a DMB LD (R->RW) barrier before the return
> > > but I don't think we can express that in the Linux memory model so we
> > > end up in RmW territory every time.
> >
> > You'd need a barrier *before* the READ_ONCE(), since what we're trying
> > to prevent is a consumer from writing to the value without being able to
> > observe the writes that happened prior, while this side read the old
> > value. A barrier after the READ_ONCE() doesn't do anything, as that read
> > is the last memory operation in this thread (of the problematic sequence).
>
> Right, having gone back to your litmus test, I now realise it's the "SB"
> shape from the memory ordering terminology. It's funny because the arm64
> acquire/release instructions are RCsc and so upgrading the READ_ONCE()
> to an *arm64* acquire instruction would work for your specific case, but
> only because the preceeding store is a release.
>
> > At that point, I'm not sure DMB LD / early read / LSE atomic would be
> > any faster than just always doing the LSE atomic?
>
> It depends a lot on the configuration of the system and the state of the
> relevant cacheline, but generally avoiding an RmW by introducing a barrier
> is likely to be a win. It just gets ugly here as we'd want to avoid the
> DMB in the case where we end up doing the RmW. Possibly we could do
> something funky like a test-and-test-and-test-and-set (!) where we do
> the DMB+READ_ONCE() only if the first READ_ONCE() has the bit set, but
> even just typing that is horrible and I'd _absolutely_ want to see perf
> numbers to show that it's a benefit once you start taking into account
> things like branch prediction.
>
> Anywho, since Linus has applied the patch and it should work, this is
> just an interesting aside.
>
> Will
>

It is moot if Linus has already taken the patch, but with a stock
kernel config I am
still seeing a slight performance dip but only ~1-2% in the specific
tests I was running.
Sorry about the noise I will need to look at my kernel builder and see what went
wrong when I have more time.

Cheers,
Jon

Linus Torvalds Aug. 16, 2022, 6:02 p.m. UTC | #11

On Tue, Aug 16, 2022 at 10:49 AM Jon Nettleton <jon@solid-run.com> wrote:
>
> It is moot if Linus has already taken the patch, but with a stock
> kernel config I am
> still seeing a slight performance dip but only ~1-2% in the specific
> tests I was running.

It would be interesting to hear if you can pinpoint in the profiles
where the time is spent.

It might be some random place that really doesn't care about ordering
at all, and then we could easily rewrite _that_ particular case to do
the unordered test explicitly, ie something like

-        if (test_and_set_bit()) ...
+       if (test_bit() || test_and_set_bit()) ...

or even introduce an explicitly unordered "test_and_set_bit_relaxed()" thing.

                 Linus

Matthew Wilcox Aug. 16, 2022, 6:14 p.m. UTC | #12

On Tue, Aug 16, 2022 at 03:06:41PM +0100, Will Deacon wrote:
> On Tue, Aug 16, 2022 at 10:16:04AM +0200, Arnd Bergmann wrote:
> > On Tue, Aug 16, 2022 at 9:03 AM Hector Martin <marcan@marcan.st> wrote:
> > >
> > > These operations are documented as always ordered in
> > > include/asm-generic/bitops/instrumented-atomic.h, and producer-consumer
> > > type use cases where one side needs to ensure a flag is left pending
> > > after some shared data was updated rely on this ordering, even in the
> > > failure case.
> > >
> > > This is the case with the workqueue code, which currently suffers from a
> > > reproducible ordering violation on Apple M1 platforms (which are
> > > notoriously out-of-order) that ends up causing the TTY layer to fail to
> > > deliver data to userspace properly under the right conditions. This
> > > change fixes that bug.
> > >
> > > Change the documentation to restrict the "no order on failure" story to
> > > the _lock() variant (for which it makes sense), and remove the
> > > early-exit from the generic implementation, which is what causes the
> > > missing barrier semantics in that case. Without this, the remaining
> > > atomic op is fully ordered (including on ARM64 LSE, as of recent
> > > versions of the architecture spec).
> > >
> > > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> > > Cc: stable@vger.kernel.org
> > > Fixes: e986a0d6cb36 ("locking/atomics, asm-generic/bitops/atomic.h: Rewrite using atomic_*() APIs")
> > > Fixes: 61e02392d3c7 ("locking/atomic/bitops: Document and clarify ordering semantics for failed test_and_{}_bit()")
> > > Signed-off-by: Hector Martin <marcan@marcan.st>
> > > ---
> > >  Documentation/atomic_bitops.txt     | 2 +-
> > >  include/asm-generic/bitops/atomic.h | 6 ------
> > 
> > I double-checked all the architecture specific implementations to ensure
> > that the asm-generic one is the only one that needs the fix.
> 
> I couldn't figure out parisc -- do you know what ordering their spinlocks
> provide? They have a comment talking about a release, but I don't know what
> the ordering guarantees of an "ldcw" are.

"The semaphore operation is strongly ordered" (that's from the
description of the LDCW instruction)

Jon Nettleton Aug. 17, 2022, 5:40 a.m. UTC | #13

On Tue, Aug 16, 2022 at 8:02 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, Aug 16, 2022 at 10:49 AM Jon Nettleton <jon@solid-run.com> wrote:
> >
> > It is moot if Linus has already taken the patch, but with a stock
> > kernel config I am
> > still seeing a slight performance dip but only ~1-2% in the specific
> > tests I was running.
>
> It would be interesting to hear if you can pinpoint in the profiles
> where the time is spent.
>
> It might be some random place that really doesn't care about ordering
> at all, and then we could easily rewrite _that_ particular case to do
> the unordered test explicitly, ie something like
>
> -        if (test_and_set_bit()) ...
> +       if (test_bit() || test_and_set_bit()) ...
>
> or even introduce an explicitly unordered "test_and_set_bit_relaxed()" thing.
>
>                  Linus

This is very interesting, the additional performance overhead doesn't seem
to be coming from within the kernel but from userspace. Comparing patched
and unpatched kernels I am seeing more cycles being taken up by glibc
atomics like __aarch64_cas4_acq  and __aarch64_ldadd4_acq_rel.

I need to test further to see if there is less effect on a system with
less cores,
This is a 16-core Cortex-A72, it is possible this is less of an issue on 4 core
A72's and A53's.

-Jon

David Laight Aug. 17, 2022, 8:20 a.m. UTC | #14

...
>  	p += BIT_WORD(nr);
> -	if (READ_ONCE(*p) & mask)
> -		return 1;
> -
>  	old = arch_atomic_long_fetch_or(mask, (atomic_long_t *)p);
>  	return !!(old & mask);
>  }

This looks like the same pattern (attempting to avoid a
locked bus cycle) that caused the qdisc code to sit on
transmit packets (even on x86).
That had some barriers in it (possibly nops on x86) that
didn't help - although the comments suggested otherwise.

I wonder if the pattern has been used anywhere else?

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

locking/atomic: Make test_and_*_bit() ordered on failure

Commit Message

Comments

Patch