diff mbox series

[v2] buffer: Fix I/O error due to ARM read-after-read hazard

Message ID 20191112130244.16630-1-vincent.whitchurch@axis.com (mailing list archive)
State New, archived
Headers show
Series [v2] buffer: Fix I/O error due to ARM read-after-read hazard | expand

Commit Message

Vincent Whitchurch Nov. 12, 2019, 1:02 p.m. UTC
On my dual-core ARM Cortex-A9, reading from squashfs (over
dm-verity/ubi/mtd) in a loop for hundreds of hours invariably results in
a read failure in squashfs_read_data().  The errors occur because the
buffer_uptodate() check fails after wait_on_buffer().  Further debugging
shows that the bh was in fact uptodate and that there is no actual I/O
error in the lower layers.

The problem is caused by the read-after-read hazards in the ARM
Cortex-A9 MPCore (erratum #761319, see [1]).  The code generated by the
compiler for the combination of the wait_on_buffer() and
buffer_uptodate() calls reads the flags value twice from memory (see the
excerpt of the assembly below).  The new value of the BH_Lock flag is
seen but the new value of BH_Uptodate is not even though both the bits
are read from the same memory location.

 27c:	9d08      	ldr	r5, [sp, #32]
 27e:	2400      	movs	r4, #0
 280:	e006      	b.n	290 <squashfs_read_data+0x290>
 282:	6803      	ldr	r3, [r0, #0]
 284:	07da      	lsls	r2, r3, #31
 286:	f140 810d 	bpl.w	4a4 <squashfs_read_data+0x4a4>
 28a:	3401      	adds	r4, #1
 28c:	42bc      	cmp	r4, r7
 28e:	da08      	bge.n	2a2 <squashfs_read_data+0x2a2>
 290:	f855 0f04 	ldr.w	r0, [r5, #4]!
 294:	6803      	ldr	r3, [r0, #0]
 296:	0759      	lsls	r1, r3, #29
 298:	d5f3      	bpl.n	282 <squashfs_read_data+0x282>
 29a:	f7ff fffe 	bl	0 <__wait_on_buffer>

Work around this problem by adding a DMB between the two reads of
bh->flags, as recommended in the ARM document.  With this barrier, no
failures have been seen in more than 5000 hours of the same test.

[1] http://infocenter.arm.com/help/topic/com.arm.doc.uan0004a/UAN0004A_a9_read_read.pdf

Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
---
v2: Reword commit message.

 include/linux/buffer_head.h | 8 ++++++++
 1 file changed, 8 insertions(+)

Comments

Catalin Marinas Nov. 12, 2019, 4:08 p.m. UTC | #1
On Tue, Nov 12, 2019 at 02:02:44PM +0100, Vincent Whitchurch wrote:
> On my dual-core ARM Cortex-A9, reading from squashfs (over
> dm-verity/ubi/mtd) in a loop for hundreds of hours invariably results in
> a read failure in squashfs_read_data().  The errors occur because the
> buffer_uptodate() check fails after wait_on_buffer().  Further debugging
> shows that the bh was in fact uptodate and that there is no actual I/O
> error in the lower layers.
> 
> The problem is caused by the read-after-read hazards in the ARM
> Cortex-A9 MPCore (erratum #761319, see [1]).  The code generated by the
> compiler for the combination of the wait_on_buffer() and
> buffer_uptodate() calls reads the flags value twice from memory (see the
> excerpt of the assembly below).  The new value of the BH_Lock flag is
> seen but the new value of BH_Uptodate is not even though both the bits
> are read from the same memory location.
> 
>  27c:	9d08      	ldr	r5, [sp, #32]
>  27e:	2400      	movs	r4, #0
>  280:	e006      	b.n	290 <squashfs_read_data+0x290>
>  282:	6803      	ldr	r3, [r0, #0]
>  284:	07da      	lsls	r2, r3, #31
>  286:	f140 810d 	bpl.w	4a4 <squashfs_read_data+0x4a4>
>  28a:	3401      	adds	r4, #1
>  28c:	42bc      	cmp	r4, r7
>  28e:	da08      	bge.n	2a2 <squashfs_read_data+0x2a2>
>  290:	f855 0f04 	ldr.w	r0, [r5, #4]!
>  294:	6803      	ldr	r3, [r0, #0]
>  296:	0759      	lsls	r1, r3, #29
>  298:	d5f3      	bpl.n	282 <squashfs_read_data+0x282>
>  29a:	f7ff fffe 	bl	0 <__wait_on_buffer>
> 
> Work around this problem by adding a DMB between the two reads of
> bh->flags, as recommended in the ARM document.  With this barrier, no
> failures have been seen in more than 5000 hours of the same test.
> 
> [1] http://infocenter.arm.com/help/topic/com.arm.doc.uan0004a/UAN0004A_a9_read_read.pdf

I thought we were going to fix the compiler. I found an old thread here:

https://gcc.gnu.org/ml/gcc-patches/2014-06/msg00714.html

Also cc'ing Richard Earnshaw as he may been involved in the gcc
discussion at the time.

While you can add some barrier here, there may be other cases where this
can go wrong.
Linus Torvalds Nov. 12, 2019, 5:54 p.m. UTC | #2
On Tue, Nov 12, 2019 at 8:09 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> While you can add some barrier here, there may be other cases where this
> can go wrong.

Yeah, that patch isn't going to be accepted. We don't add random arch
workarounds to core header files like that - particularly when it's
not at all clear that there aren't hundreds of other cases where that
cpu errata can trigger..

                Linus
Will Deacon Nov. 12, 2019, 6 p.m. UTC | #3
On Tue, Nov 12, 2019 at 04:08:57PM +0000, Catalin Marinas wrote:
> On Tue, Nov 12, 2019 at 02:02:44PM +0100, Vincent Whitchurch wrote:
> > On my dual-core ARM Cortex-A9, reading from squashfs (over
> > dm-verity/ubi/mtd) in a loop for hundreds of hours invariably results in
> > a read failure in squashfs_read_data().  The errors occur because the
> > buffer_uptodate() check fails after wait_on_buffer().  Further debugging
> > shows that the bh was in fact uptodate and that there is no actual I/O
> > error in the lower layers.
> > 
> > The problem is caused by the read-after-read hazards in the ARM
> > Cortex-A9 MPCore (erratum #761319, see [1]).  The code generated by the
> > compiler for the combination of the wait_on_buffer() and
> > buffer_uptodate() calls reads the flags value twice from memory (see the
> > excerpt of the assembly below).  The new value of the BH_Lock flag is
> > seen but the new value of BH_Uptodate is not even though both the bits
> > are read from the same memory location.
> > 
> >  27c:	9d08      	ldr	r5, [sp, #32]
> >  27e:	2400      	movs	r4, #0
> >  280:	e006      	b.n	290 <squashfs_read_data+0x290>
> >  282:	6803      	ldr	r3, [r0, #0]
> >  284:	07da      	lsls	r2, r3, #31
> >  286:	f140 810d 	bpl.w	4a4 <squashfs_read_data+0x4a4>
> >  28a:	3401      	adds	r4, #1
> >  28c:	42bc      	cmp	r4, r7
> >  28e:	da08      	bge.n	2a2 <squashfs_read_data+0x2a2>
> >  290:	f855 0f04 	ldr.w	r0, [r5, #4]!
> >  294:	6803      	ldr	r3, [r0, #0]
> >  296:	0759      	lsls	r1, r3, #29
> >  298:	d5f3      	bpl.n	282 <squashfs_read_data+0x282>
> >  29a:	f7ff fffe 	bl	0 <__wait_on_buffer>
> > 
> > Work around this problem by adding a DMB between the two reads of
> > bh->flags, as recommended in the ARM document.  With this barrier, no
> > failures have been seen in more than 5000 hours of the same test.
> > 
> > [1] http://infocenter.arm.com/help/topic/com.arm.doc.uan0004a/UAN0004A_a9_read_read.pdf
> 
> I thought we were going to fix the compiler. I found an old thread here:
> 
> https://gcc.gnu.org/ml/gcc-patches/2014-06/msg00714.html
> 
> Also cc'ing Richard Earnshaw as he may been involved in the gcc
> discussion at the time.
> 
> While you can add some barrier here, there may be other cases where this
> can go wrong.

Hmm, and afaict, even if the compiler was modified to emit LDREX instructions
for volatile loads, it wouldn't help in this case because test_bit() isn't
using READ_ONCE().

It's also slightly odd that the proposed patch makes the code look like:

	for (i = 0; i < b; i++) {
		if (buffer_locked(bh)) {
			__wait_on_buffer(bh);
			smp_rmb();
		}
		if (!buffer_uptodate(bh[i]))
			goto block_release;
	}

whereas there are other potential RAR orderings between buffer_locked()
and __wait_on_buffer() and also probably between successive iterations
of the loop.

So, really, the only way I see to solve this is for us to use READ_ONCE
consistently for all relaxed atomic loads (KCSAN is starting to tread on
this), and then to patch READ_ONCE to emit a DMB at runtime for arch/arm/
(maybe a static key would work if you can avoid the recursion).

I've already got patches at [1] to allow architectures to override
READ_ONCE, because Alpha needs to do something similar.

Will

[1] https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/log/?h=lto
Catalin Marinas Nov. 12, 2019, 6:22 p.m. UTC | #4
On Tue, Nov 12, 2019 at 06:00:35PM +0000, Will Deacon wrote:
> On Tue, Nov 12, 2019 at 04:08:57PM +0000, Catalin Marinas wrote:
> > On Tue, Nov 12, 2019 at 02:02:44PM +0100, Vincent Whitchurch wrote:
> > > On my dual-core ARM Cortex-A9, reading from squashfs (over
> > > dm-verity/ubi/mtd) in a loop for hundreds of hours invariably results in
> > > a read failure in squashfs_read_data().  The errors occur because the
> > > buffer_uptodate() check fails after wait_on_buffer().  Further debugging
> > > shows that the bh was in fact uptodate and that there is no actual I/O
> > > error in the lower layers.
> > > 
> > > The problem is caused by the read-after-read hazards in the ARM
> > > Cortex-A9 MPCore (erratum #761319, see [1]).  The code generated by the
> > > compiler for the combination of the wait_on_buffer() and
> > > buffer_uptodate() calls reads the flags value twice from memory (see the
> > > excerpt of the assembly below).  The new value of the BH_Lock flag is
> > > seen but the new value of BH_Uptodate is not even though both the bits
> > > are read from the same memory location.
> > > 
> > >  27c:	9d08      	ldr	r5, [sp, #32]
> > >  27e:	2400      	movs	r4, #0
> > >  280:	e006      	b.n	290 <squashfs_read_data+0x290>
> > >  282:	6803      	ldr	r3, [r0, #0]
> > >  284:	07da      	lsls	r2, r3, #31
> > >  286:	f140 810d 	bpl.w	4a4 <squashfs_read_data+0x4a4>
> > >  28a:	3401      	adds	r4, #1
> > >  28c:	42bc      	cmp	r4, r7
> > >  28e:	da08      	bge.n	2a2 <squashfs_read_data+0x2a2>
> > >  290:	f855 0f04 	ldr.w	r0, [r5, #4]!
> > >  294:	6803      	ldr	r3, [r0, #0]
> > >  296:	0759      	lsls	r1, r3, #29
> > >  298:	d5f3      	bpl.n	282 <squashfs_read_data+0x282>
> > >  29a:	f7ff fffe 	bl	0 <__wait_on_buffer>
> > > 
> > > Work around this problem by adding a DMB between the two reads of
> > > bh->flags, as recommended in the ARM document.  With this barrier, no
> > > failures have been seen in more than 5000 hours of the same test.
> > > 
> > > [1] http://infocenter.arm.com/help/topic/com.arm.doc.uan0004a/UAN0004A_a9_read_read.pdf
> > 
> > I thought we were going to fix the compiler. I found an old thread here:
> > 
> > https://gcc.gnu.org/ml/gcc-patches/2014-06/msg00714.html
> > 
> > Also cc'ing Richard Earnshaw as he may been involved in the gcc
> > discussion at the time.
> > 
> > While you can add some barrier here, there may be other cases where this
> > can go wrong.
> 
> Hmm, and afaict, even if the compiler was modified to emit LDREX instructions
> for volatile loads, it wouldn't help in this case because test_bit() isn't
> using READ_ONCE().

I think changing volatile accesses to LDREX in gcc wasn't acceptable
since they may read Device memory and not allowed on ARM.

> It's also slightly odd that the proposed patch makes the code look like:
> 
> 	for (i = 0; i < b; i++) {
> 		if (buffer_locked(bh)) {
> 			__wait_on_buffer(bh);
> 			smp_rmb();
> 		}

The proposed patch actually keeps smp_rmb() outside the 'if' block but
your point below still stands.

> 		if (!buffer_uptodate(bh[i]))
> 			goto block_release;
> 	}
> 
> whereas there are other potential RAR orderings between buffer_locked()
> and __wait_on_buffer() and also probably between successive iterations
> of the loop.
> 
> So, really, the only way I see to solve this is for us to use READ_ONCE
> consistently for all relaxed atomic loads (KCSAN is starting to tread on
> this), and then to patch READ_ONCE to emit a DMB at runtime for arch/arm/
> (maybe a static key would work if you can avoid the recursion).

OK, so this includes changing test_bit() to perform a READ_ONCE.
Linus Torvalds Nov. 12, 2019, 6:39 p.m. UTC | #5
On Tue, Nov 12, 2019 at 10:22 AM Catalin Marinas
<catalin.marinas@arm.com> wrote:
>
> OK, so this includes changing test_bit() to perform a READ_ONCE.

That's not going to happen.

              Linus
Will Deacon Nov. 13, 2019, 10:23 a.m. UTC | #6
On Tue, Nov 12, 2019 at 10:39:01AM -0800, Linus Torvalds wrote:
> On Tue, Nov 12, 2019 at 10:22 AM Catalin Marinas
> <catalin.marinas@arm.com> wrote:
> >
> > OK, so this includes changing test_bit() to perform a READ_ONCE.
> 
> That's not going to happen.

Ok, I'll stick my neck out here, but if test_bit() is being used to read
a bitmap that is being concurrently modified (e.g. by set_bit() which boils
down to atomic_long_or()), then why isn't READ_ONCE() required? Right now,
test_bit takes a 'const volatile unsigned long *addr' argument, so I don't
see that you'll get a change in codegen except on alpha and, with this
erratum, arm32.

Will
Russell King (Oracle) Nov. 13, 2019, 10:31 a.m. UTC | #7
On Wed, Nov 13, 2019 at 10:23:58AM +0000, Will Deacon wrote:
> On Tue, Nov 12, 2019 at 10:39:01AM -0800, Linus Torvalds wrote:
> > On Tue, Nov 12, 2019 at 10:22 AM Catalin Marinas
> > <catalin.marinas@arm.com> wrote:
> > >
> > > OK, so this includes changing test_bit() to perform a READ_ONCE.
> > 
> > That's not going to happen.
> 
> Ok, I'll stick my neck out here, but if test_bit() is being used to read
> a bitmap that is being concurrently modified (e.g. by set_bit() which boils
> down to atomic_long_or()), then why isn't READ_ONCE() required? Right now,
> test_bit takes a 'const volatile unsigned long *addr' argument, so I don't
> see that you'll get a change in codegen except on alpha and, with this
> erratum, arm32.

I'm not entirely clear what you're suggesting, so I'll just pick the
scenario that I think you're talking about - but I'm not sure it's the
one you're intending.

Using test_bit() in one thread and set_bit() on the same bit in another
thread without locking is going to be racy by definition.  It's entirely
possible for:

	Thread 1			Thread 2
	bit = test_bit(...);
					set_bit(...);
	/* use bit */

and here, bit == 0 but the bit has been set by thread 2.  Use of the
result from test_bit() is inherently a non-atomic operation.

This is why we have test_and_set_bit() and friends that atomically test
that a bit is clear before setting it.  Where this is especially
important is for some filesystems, as they use test_and_xxx_bit() to
manage their allocation bitmaps.
Will Deacon Nov. 13, 2019, 10:49 a.m. UTC | #8
Hi Russell,

On Wed, Nov 13, 2019 at 10:31:51AM +0000, Russell King - ARM Linux admin wrote:
> On Wed, Nov 13, 2019 at 10:23:58AM +0000, Will Deacon wrote:
> > On Tue, Nov 12, 2019 at 10:39:01AM -0800, Linus Torvalds wrote:
> > > On Tue, Nov 12, 2019 at 10:22 AM Catalin Marinas
> > > <catalin.marinas@arm.com> wrote:
> > > >
> > > > OK, so this includes changing test_bit() to perform a READ_ONCE.
> > > 
> > > That's not going to happen.
> > 
> > Ok, I'll stick my neck out here, but if test_bit() is being used to read
> > a bitmap that is being concurrently modified (e.g. by set_bit() which boils
> > down to atomic_long_or()), then why isn't READ_ONCE() required? Right now,
> > test_bit takes a 'const volatile unsigned long *addr' argument, so I don't
> > see that you'll get a change in codegen except on alpha and, with this
> > erratum, arm32.
> 
> I'm not entirely clear what you're suggesting, so I'll just pick the
> scenario that I think you're talking about - but I'm not sure it's the
> one you're intending.
> 
> Using test_bit() in one thread and set_bit() on the same bit in another
> thread without locking is going to be racy by definition.  It's entirely
> possible for:
> 
> 	Thread 1			Thread 2
> 	bit = test_bit(...);
> 					set_bit(...);
> 	/* use bit */
> 
> and here, bit == 0 but the bit has been set by thread 2.  Use of the
> result from test_bit() is inherently a non-atomic operation.

I think it's atomic in the same way that atomic_read() is atomic (which is
typically defined using READ_ONCE()).

> This is why we have test_and_set_bit() and friends that atomically test
> that a bit is clear before setting it.  Where this is especially
> important is for some filesystems, as they use test_and_xxx_bit() to
> manage their allocation bitmaps.

Agreed, but what we don't want is something like:

	Thread 1			Thread 2
					set_bit(...); // bit is now 1
	test_bit(...); // returns 1
	test_bit(...); // returns 0

which is what can happen due to this erratum. It's generally good practice
to use READ_ONCE() when reading something which can be updated concurrently
because:

	* It ensures that the value is (re-)loaded from memory

	* It prevents the compiler from performing harmful optimisations,
	  such as merging or tearing (although in this case I suspect
	  these are ok because we're dealing with a single bit)

	* On Alpha, it gives you a barrier so that dependency ordering
	  can be relied upon from the load

	* It keeps KCSAN happy

I think the current definition of test_bit() gives you the first two by
casting to volatile explicitly, but not the second two.

So I'm mainly curious as to the disconnect between my thinking and Linus's
"That's not going to happen" comment, given that it shouldn't have an
impact on architectures that don't need magic here.

Will
Linus Torvalds Nov. 13, 2019, 4:36 p.m. UTC | #9
On Wed, Nov 13, 2019 at 2:24 AM Will Deacon <will@kernel.org> wrote:
>
> Ok, I'll stick my neck out here, but if test_bit() is being used to read
> a bitmap that is being concurrently modified (e.g. by set_bit() which boils
> down to atomic_long_or()), then why isn't READ_ONCE() required? Right now,
> test_bit takes a 'const volatile unsigned long *addr' argument, so I don't
> see that you'll get a change in codegen except on alpha and, with this
> erratum, arm32.

You're right. We've moved back to actually having it volatile (or
possibly never got away from it). My bad.

At one point, we tried very hard to make test_bit() perform much
better when you tested multiple bits, and I thought we ended up having
that merged and didn't want to regress. But apparently we never did
that, or it got undone.

test_bit() is a very unfortunate interface, in that we actually use it
in some situations where we _really_ would want to merge reads (not
split them, but merge them). There are several cases where we do
constant test-bits on the same word, and don't care about ordering.
Things like thread flags etc.

It's explicitly not ordered, so *merging* reads would be right and
lovely, it's the "don't read twice" that is bad. But we have no way to
tell the compiler that ;(

Anyway, READ_ONCE() is ok by me when I look at the code, because as
you say, it's already volatile (like my memory ;).

           Linus
Linus Torvalds Nov. 13, 2019, 4:40 p.m. UTC | #10
On Wed, Nov 13, 2019 at 8:36 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> test_bit() is a very unfortunate interface, in that we actually use it
> in some situations where we _really_ would want to merge reads (not
> split them, but merge them). There are several cases where we do
> constant test-bits on the same word, and don't care about ordering.
> Things like thread flags etc.

Side note: test_bit() really isn't good for locking in the first
place. The fact that the buffer heads use it for that is very
non-optimal indeed.

Particularly for testing something like "is this buffer uptodate", it
should be a "smp_load_acquire()", not a test_bit(). And READ_ONCE()
doesn't really help.

So in many ways it would be much better to make the buffer head stuff
use proper ordered accesses. But I suspect nobody is going to ever
want to go through that pain for a legacy thing, so the papering it
over with READ_ONCE() and a ugly ARM hw erratum hack is probably the
best we'll do..

                Linus
Herbert Xu Nov. 14, 2019, 1:28 p.m. UTC | #11
Will Deacon <will@kernel.org> wrote:
>
> which is what can happen due to this erratum. It's generally good practice
> to use READ_ONCE() when reading something which can be updated concurrently
> because:
> 
>        * It ensures that the value is (re-)loaded from memory
> 
>        * It prevents the compiler from performing harmful optimisations,
>          such as merging or tearing (although in this case I suspect
>          these are ok because we're dealing with a single bit)
> 
>        * On Alpha, it gives you a barrier so that dependency ordering
>          can be relied upon from the load

The Alpha barrier matters for pointers, how could it make a
difference for individual bits?

Cheers,
Will Deacon Nov. 20, 2019, 7:18 p.m. UTC | #12
On Thu, Nov 14, 2019 at 09:28:48PM +0800, Herbert Xu wrote:
> Will Deacon <will@kernel.org> wrote:
> >
> > which is what can happen due to this erratum. It's generally good practice
> > to use READ_ONCE() when reading something which can be updated concurrently
> > because:
> > 
> >        * It ensures that the value is (re-)loaded from memory
> > 
> >        * It prevents the compiler from performing harmful optimisations,
> >          such as merging or tearing (although in this case I suspect
> >          these are ok because we're dealing with a single bit)
> > 
> >        * On Alpha, it gives you a barrier so that dependency ordering
> >          can be relied upon from the load
> 
> The Alpha barrier matters for pointers, how could it make a
> difference for individual bits?

I guess you could use the result of test_bit to index into an array or
something?

Will
Herbert Xu Nov. 21, 2019, 1:25 a.m. UTC | #13
On Wed, Nov 20, 2019 at 07:18:40PM +0000, Will Deacon wrote:
>
> > The Alpha barrier matters for pointers, how could it make a
> > difference for individual bits?
> 
> I guess you could use the result of test_bit to index into an array or
> something?

Can Alpha Assembly even do this without using a branch?

Cheers,
Will Deacon Nov. 21, 2019, 4:53 p.m. UTC | #14
On Thu, Nov 21, 2019 at 09:25:33AM +0800, Herbert Xu wrote:
> On Wed, Nov 20, 2019 at 07:18:40PM +0000, Will Deacon wrote:
> >
> > > The Alpha barrier matters for pointers, how could it make a
> > > difference for individual bits?
> > 
> > I guess you could use the result of test_bit to index into an array or
> > something?
> 
> Can Alpha Assembly even do this without using a branch?

Don't see why not: you can add the base address to the scaled result
of test_bit and use that as the address register into a load instruction.

Will
diff mbox series

Patch

diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 7b73ef7f902d..4ef909a91f8c 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -352,6 +352,14 @@  static inline void wait_on_buffer(struct buffer_head *bh)
 	might_sleep();
 	if (buffer_locked(bh))
 		__wait_on_buffer(bh);
+
+#ifdef CONFIG_ARM
+	/*
+	 * Work around ARM Cortex-A9 MPcore Read-after-Read Hazards (erratum
+	 * 761319).
+	 */
+	smp_rmb();
+#endif
 }
 
 static inline int trylock_buffer(struct buffer_head *bh)