Message ID | 20191112130244.16630-1-vincent.whitchurch@axis.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] buffer: Fix I/O error due to ARM read-after-read hazard | expand |
On Tue, Nov 12, 2019 at 02:02:44PM +0100, Vincent Whitchurch wrote: > On my dual-core ARM Cortex-A9, reading from squashfs (over > dm-verity/ubi/mtd) in a loop for hundreds of hours invariably results in > a read failure in squashfs_read_data(). The errors occur because the > buffer_uptodate() check fails after wait_on_buffer(). Further debugging > shows that the bh was in fact uptodate and that there is no actual I/O > error in the lower layers. > > The problem is caused by the read-after-read hazards in the ARM > Cortex-A9 MPCore (erratum #761319, see [1]). The code generated by the > compiler for the combination of the wait_on_buffer() and > buffer_uptodate() calls reads the flags value twice from memory (see the > excerpt of the assembly below). The new value of the BH_Lock flag is > seen but the new value of BH_Uptodate is not even though both the bits > are read from the same memory location. > > 27c: 9d08 ldr r5, [sp, #32] > 27e: 2400 movs r4, #0 > 280: e006 b.n 290 <squashfs_read_data+0x290> > 282: 6803 ldr r3, [r0, #0] > 284: 07da lsls r2, r3, #31 > 286: f140 810d bpl.w 4a4 <squashfs_read_data+0x4a4> > 28a: 3401 adds r4, #1 > 28c: 42bc cmp r4, r7 > 28e: da08 bge.n 2a2 <squashfs_read_data+0x2a2> > 290: f855 0f04 ldr.w r0, [r5, #4]! > 294: 6803 ldr r3, [r0, #0] > 296: 0759 lsls r1, r3, #29 > 298: d5f3 bpl.n 282 <squashfs_read_data+0x282> > 29a: f7ff fffe bl 0 <__wait_on_buffer> > > Work around this problem by adding a DMB between the two reads of > bh->flags, as recommended in the ARM document. With this barrier, no > failures have been seen in more than 5000 hours of the same test. > > [1] http://infocenter.arm.com/help/topic/com.arm.doc.uan0004a/UAN0004A_a9_read_read.pdf I thought we were going to fix the compiler. I found an old thread here: https://gcc.gnu.org/ml/gcc-patches/2014-06/msg00714.html Also cc'ing Richard Earnshaw as he may been involved in the gcc discussion at the time. While you can add some barrier here, there may be other cases where this can go wrong.
On Tue, Nov 12, 2019 at 8:09 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > While you can add some barrier here, there may be other cases where this > can go wrong. Yeah, that patch isn't going to be accepted. We don't add random arch workarounds to core header files like that - particularly when it's not at all clear that there aren't hundreds of other cases where that cpu errata can trigger.. Linus
On Tue, Nov 12, 2019 at 04:08:57PM +0000, Catalin Marinas wrote: > On Tue, Nov 12, 2019 at 02:02:44PM +0100, Vincent Whitchurch wrote: > > On my dual-core ARM Cortex-A9, reading from squashfs (over > > dm-verity/ubi/mtd) in a loop for hundreds of hours invariably results in > > a read failure in squashfs_read_data(). The errors occur because the > > buffer_uptodate() check fails after wait_on_buffer(). Further debugging > > shows that the bh was in fact uptodate and that there is no actual I/O > > error in the lower layers. > > > > The problem is caused by the read-after-read hazards in the ARM > > Cortex-A9 MPCore (erratum #761319, see [1]). The code generated by the > > compiler for the combination of the wait_on_buffer() and > > buffer_uptodate() calls reads the flags value twice from memory (see the > > excerpt of the assembly below). The new value of the BH_Lock flag is > > seen but the new value of BH_Uptodate is not even though both the bits > > are read from the same memory location. > > > > 27c: 9d08 ldr r5, [sp, #32] > > 27e: 2400 movs r4, #0 > > 280: e006 b.n 290 <squashfs_read_data+0x290> > > 282: 6803 ldr r3, [r0, #0] > > 284: 07da lsls r2, r3, #31 > > 286: f140 810d bpl.w 4a4 <squashfs_read_data+0x4a4> > > 28a: 3401 adds r4, #1 > > 28c: 42bc cmp r4, r7 > > 28e: da08 bge.n 2a2 <squashfs_read_data+0x2a2> > > 290: f855 0f04 ldr.w r0, [r5, #4]! > > 294: 6803 ldr r3, [r0, #0] > > 296: 0759 lsls r1, r3, #29 > > 298: d5f3 bpl.n 282 <squashfs_read_data+0x282> > > 29a: f7ff fffe bl 0 <__wait_on_buffer> > > > > Work around this problem by adding a DMB between the two reads of > > bh->flags, as recommended in the ARM document. With this barrier, no > > failures have been seen in more than 5000 hours of the same test. > > > > [1] http://infocenter.arm.com/help/topic/com.arm.doc.uan0004a/UAN0004A_a9_read_read.pdf > > I thought we were going to fix the compiler. I found an old thread here: > > https://gcc.gnu.org/ml/gcc-patches/2014-06/msg00714.html > > Also cc'ing Richard Earnshaw as he may been involved in the gcc > discussion at the time. > > While you can add some barrier here, there may be other cases where this > can go wrong. Hmm, and afaict, even if the compiler was modified to emit LDREX instructions for volatile loads, it wouldn't help in this case because test_bit() isn't using READ_ONCE(). It's also slightly odd that the proposed patch makes the code look like: for (i = 0; i < b; i++) { if (buffer_locked(bh)) { __wait_on_buffer(bh); smp_rmb(); } if (!buffer_uptodate(bh[i])) goto block_release; } whereas there are other potential RAR orderings between buffer_locked() and __wait_on_buffer() and also probably between successive iterations of the loop. So, really, the only way I see to solve this is for us to use READ_ONCE consistently for all relaxed atomic loads (KCSAN is starting to tread on this), and then to patch READ_ONCE to emit a DMB at runtime for arch/arm/ (maybe a static key would work if you can avoid the recursion). I've already got patches at [1] to allow architectures to override READ_ONCE, because Alpha needs to do something similar. Will [1] https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/log/?h=lto
On Tue, Nov 12, 2019 at 06:00:35PM +0000, Will Deacon wrote: > On Tue, Nov 12, 2019 at 04:08:57PM +0000, Catalin Marinas wrote: > > On Tue, Nov 12, 2019 at 02:02:44PM +0100, Vincent Whitchurch wrote: > > > On my dual-core ARM Cortex-A9, reading from squashfs (over > > > dm-verity/ubi/mtd) in a loop for hundreds of hours invariably results in > > > a read failure in squashfs_read_data(). The errors occur because the > > > buffer_uptodate() check fails after wait_on_buffer(). Further debugging > > > shows that the bh was in fact uptodate and that there is no actual I/O > > > error in the lower layers. > > > > > > The problem is caused by the read-after-read hazards in the ARM > > > Cortex-A9 MPCore (erratum #761319, see [1]). The code generated by the > > > compiler for the combination of the wait_on_buffer() and > > > buffer_uptodate() calls reads the flags value twice from memory (see the > > > excerpt of the assembly below). The new value of the BH_Lock flag is > > > seen but the new value of BH_Uptodate is not even though both the bits > > > are read from the same memory location. > > > > > > 27c: 9d08 ldr r5, [sp, #32] > > > 27e: 2400 movs r4, #0 > > > 280: e006 b.n 290 <squashfs_read_data+0x290> > > > 282: 6803 ldr r3, [r0, #0] > > > 284: 07da lsls r2, r3, #31 > > > 286: f140 810d bpl.w 4a4 <squashfs_read_data+0x4a4> > > > 28a: 3401 adds r4, #1 > > > 28c: 42bc cmp r4, r7 > > > 28e: da08 bge.n 2a2 <squashfs_read_data+0x2a2> > > > 290: f855 0f04 ldr.w r0, [r5, #4]! > > > 294: 6803 ldr r3, [r0, #0] > > > 296: 0759 lsls r1, r3, #29 > > > 298: d5f3 bpl.n 282 <squashfs_read_data+0x282> > > > 29a: f7ff fffe bl 0 <__wait_on_buffer> > > > > > > Work around this problem by adding a DMB between the two reads of > > > bh->flags, as recommended in the ARM document. With this barrier, no > > > failures have been seen in more than 5000 hours of the same test. > > > > > > [1] http://infocenter.arm.com/help/topic/com.arm.doc.uan0004a/UAN0004A_a9_read_read.pdf > > > > I thought we were going to fix the compiler. I found an old thread here: > > > > https://gcc.gnu.org/ml/gcc-patches/2014-06/msg00714.html > > > > Also cc'ing Richard Earnshaw as he may been involved in the gcc > > discussion at the time. > > > > While you can add some barrier here, there may be other cases where this > > can go wrong. > > Hmm, and afaict, even if the compiler was modified to emit LDREX instructions > for volatile loads, it wouldn't help in this case because test_bit() isn't > using READ_ONCE(). I think changing volatile accesses to LDREX in gcc wasn't acceptable since they may read Device memory and not allowed on ARM. > It's also slightly odd that the proposed patch makes the code look like: > > for (i = 0; i < b; i++) { > if (buffer_locked(bh)) { > __wait_on_buffer(bh); > smp_rmb(); > } The proposed patch actually keeps smp_rmb() outside the 'if' block but your point below still stands. > if (!buffer_uptodate(bh[i])) > goto block_release; > } > > whereas there are other potential RAR orderings between buffer_locked() > and __wait_on_buffer() and also probably between successive iterations > of the loop. > > So, really, the only way I see to solve this is for us to use READ_ONCE > consistently for all relaxed atomic loads (KCSAN is starting to tread on > this), and then to patch READ_ONCE to emit a DMB at runtime for arch/arm/ > (maybe a static key would work if you can avoid the recursion). OK, so this includes changing test_bit() to perform a READ_ONCE.
On Tue, Nov 12, 2019 at 10:22 AM Catalin Marinas <catalin.marinas@arm.com> wrote: > > OK, so this includes changing test_bit() to perform a READ_ONCE. That's not going to happen. Linus
On Tue, Nov 12, 2019 at 10:39:01AM -0800, Linus Torvalds wrote: > On Tue, Nov 12, 2019 at 10:22 AM Catalin Marinas > <catalin.marinas@arm.com> wrote: > > > > OK, so this includes changing test_bit() to perform a READ_ONCE. > > That's not going to happen. Ok, I'll stick my neck out here, but if test_bit() is being used to read a bitmap that is being concurrently modified (e.g. by set_bit() which boils down to atomic_long_or()), then why isn't READ_ONCE() required? Right now, test_bit takes a 'const volatile unsigned long *addr' argument, so I don't see that you'll get a change in codegen except on alpha and, with this erratum, arm32. Will
On Wed, Nov 13, 2019 at 10:23:58AM +0000, Will Deacon wrote: > On Tue, Nov 12, 2019 at 10:39:01AM -0800, Linus Torvalds wrote: > > On Tue, Nov 12, 2019 at 10:22 AM Catalin Marinas > > <catalin.marinas@arm.com> wrote: > > > > > > OK, so this includes changing test_bit() to perform a READ_ONCE. > > > > That's not going to happen. > > Ok, I'll stick my neck out here, but if test_bit() is being used to read > a bitmap that is being concurrently modified (e.g. by set_bit() which boils > down to atomic_long_or()), then why isn't READ_ONCE() required? Right now, > test_bit takes a 'const volatile unsigned long *addr' argument, so I don't > see that you'll get a change in codegen except on alpha and, with this > erratum, arm32. I'm not entirely clear what you're suggesting, so I'll just pick the scenario that I think you're talking about - but I'm not sure it's the one you're intending. Using test_bit() in one thread and set_bit() on the same bit in another thread without locking is going to be racy by definition. It's entirely possible for: Thread 1 Thread 2 bit = test_bit(...); set_bit(...); /* use bit */ and here, bit == 0 but the bit has been set by thread 2. Use of the result from test_bit() is inherently a non-atomic operation. This is why we have test_and_set_bit() and friends that atomically test that a bit is clear before setting it. Where this is especially important is for some filesystems, as they use test_and_xxx_bit() to manage their allocation bitmaps.
Hi Russell, On Wed, Nov 13, 2019 at 10:31:51AM +0000, Russell King - ARM Linux admin wrote: > On Wed, Nov 13, 2019 at 10:23:58AM +0000, Will Deacon wrote: > > On Tue, Nov 12, 2019 at 10:39:01AM -0800, Linus Torvalds wrote: > > > On Tue, Nov 12, 2019 at 10:22 AM Catalin Marinas > > > <catalin.marinas@arm.com> wrote: > > > > > > > > OK, so this includes changing test_bit() to perform a READ_ONCE. > > > > > > That's not going to happen. > > > > Ok, I'll stick my neck out here, but if test_bit() is being used to read > > a bitmap that is being concurrently modified (e.g. by set_bit() which boils > > down to atomic_long_or()), then why isn't READ_ONCE() required? Right now, > > test_bit takes a 'const volatile unsigned long *addr' argument, so I don't > > see that you'll get a change in codegen except on alpha and, with this > > erratum, arm32. > > I'm not entirely clear what you're suggesting, so I'll just pick the > scenario that I think you're talking about - but I'm not sure it's the > one you're intending. > > Using test_bit() in one thread and set_bit() on the same bit in another > thread without locking is going to be racy by definition. It's entirely > possible for: > > Thread 1 Thread 2 > bit = test_bit(...); > set_bit(...); > /* use bit */ > > and here, bit == 0 but the bit has been set by thread 2. Use of the > result from test_bit() is inherently a non-atomic operation. I think it's atomic in the same way that atomic_read() is atomic (which is typically defined using READ_ONCE()). > This is why we have test_and_set_bit() and friends that atomically test > that a bit is clear before setting it. Where this is especially > important is for some filesystems, as they use test_and_xxx_bit() to > manage their allocation bitmaps. Agreed, but what we don't want is something like: Thread 1 Thread 2 set_bit(...); // bit is now 1 test_bit(...); // returns 1 test_bit(...); // returns 0 which is what can happen due to this erratum. It's generally good practice to use READ_ONCE() when reading something which can be updated concurrently because: * It ensures that the value is (re-)loaded from memory * It prevents the compiler from performing harmful optimisations, such as merging or tearing (although in this case I suspect these are ok because we're dealing with a single bit) * On Alpha, it gives you a barrier so that dependency ordering can be relied upon from the load * It keeps KCSAN happy I think the current definition of test_bit() gives you the first two by casting to volatile explicitly, but not the second two. So I'm mainly curious as to the disconnect between my thinking and Linus's "That's not going to happen" comment, given that it shouldn't have an impact on architectures that don't need magic here. Will
On Wed, Nov 13, 2019 at 2:24 AM Will Deacon <will@kernel.org> wrote: > > Ok, I'll stick my neck out here, but if test_bit() is being used to read > a bitmap that is being concurrently modified (e.g. by set_bit() which boils > down to atomic_long_or()), then why isn't READ_ONCE() required? Right now, > test_bit takes a 'const volatile unsigned long *addr' argument, so I don't > see that you'll get a change in codegen except on alpha and, with this > erratum, arm32. You're right. We've moved back to actually having it volatile (or possibly never got away from it). My bad. At one point, we tried very hard to make test_bit() perform much better when you tested multiple bits, and I thought we ended up having that merged and didn't want to regress. But apparently we never did that, or it got undone. test_bit() is a very unfortunate interface, in that we actually use it in some situations where we _really_ would want to merge reads (not split them, but merge them). There are several cases where we do constant test-bits on the same word, and don't care about ordering. Things like thread flags etc. It's explicitly not ordered, so *merging* reads would be right and lovely, it's the "don't read twice" that is bad. But we have no way to tell the compiler that ;( Anyway, READ_ONCE() is ok by me when I look at the code, because as you say, it's already volatile (like my memory ;). Linus
On Wed, Nov 13, 2019 at 8:36 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > test_bit() is a very unfortunate interface, in that we actually use it > in some situations where we _really_ would want to merge reads (not > split them, but merge them). There are several cases where we do > constant test-bits on the same word, and don't care about ordering. > Things like thread flags etc. Side note: test_bit() really isn't good for locking in the first place. The fact that the buffer heads use it for that is very non-optimal indeed. Particularly for testing something like "is this buffer uptodate", it should be a "smp_load_acquire()", not a test_bit(). And READ_ONCE() doesn't really help. So in many ways it would be much better to make the buffer head stuff use proper ordered accesses. But I suspect nobody is going to ever want to go through that pain for a legacy thing, so the papering it over with READ_ONCE() and a ugly ARM hw erratum hack is probably the best we'll do.. Linus
Will Deacon <will@kernel.org> wrote: > > which is what can happen due to this erratum. It's generally good practice > to use READ_ONCE() when reading something which can be updated concurrently > because: > > * It ensures that the value is (re-)loaded from memory > > * It prevents the compiler from performing harmful optimisations, > such as merging or tearing (although in this case I suspect > these are ok because we're dealing with a single bit) > > * On Alpha, it gives you a barrier so that dependency ordering > can be relied upon from the load The Alpha barrier matters for pointers, how could it make a difference for individual bits? Cheers,
On Thu, Nov 14, 2019 at 09:28:48PM +0800, Herbert Xu wrote: > Will Deacon <will@kernel.org> wrote: > > > > which is what can happen due to this erratum. It's generally good practice > > to use READ_ONCE() when reading something which can be updated concurrently > > because: > > > > * It ensures that the value is (re-)loaded from memory > > > > * It prevents the compiler from performing harmful optimisations, > > such as merging or tearing (although in this case I suspect > > these are ok because we're dealing with a single bit) > > > > * On Alpha, it gives you a barrier so that dependency ordering > > can be relied upon from the load > > The Alpha barrier matters for pointers, how could it make a > difference for individual bits? I guess you could use the result of test_bit to index into an array or something? Will
On Wed, Nov 20, 2019 at 07:18:40PM +0000, Will Deacon wrote: > > > The Alpha barrier matters for pointers, how could it make a > > difference for individual bits? > > I guess you could use the result of test_bit to index into an array or > something? Can Alpha Assembly even do this without using a branch? Cheers,
On Thu, Nov 21, 2019 at 09:25:33AM +0800, Herbert Xu wrote: > On Wed, Nov 20, 2019 at 07:18:40PM +0000, Will Deacon wrote: > > > > > The Alpha barrier matters for pointers, how could it make a > > > difference for individual bits? > > > > I guess you could use the result of test_bit to index into an array or > > something? > > Can Alpha Assembly even do this without using a branch? Don't see why not: you can add the base address to the scaled result of test_bit and use that as the address register into a load instruction. Will
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h index 7b73ef7f902d..4ef909a91f8c 100644 --- a/include/linux/buffer_head.h +++ b/include/linux/buffer_head.h @@ -352,6 +352,14 @@ static inline void wait_on_buffer(struct buffer_head *bh) might_sleep(); if (buffer_locked(bh)) __wait_on_buffer(bh); + +#ifdef CONFIG_ARM + /* + * Work around ARM Cortex-A9 MPcore Read-after-Read Hazards (erratum + * 761319). + */ + smp_rmb(); +#endif } static inline int trylock_buffer(struct buffer_head *bh)
On my dual-core ARM Cortex-A9, reading from squashfs (over dm-verity/ubi/mtd) in a loop for hundreds of hours invariably results in a read failure in squashfs_read_data(). The errors occur because the buffer_uptodate() check fails after wait_on_buffer(). Further debugging shows that the bh was in fact uptodate and that there is no actual I/O error in the lower layers. The problem is caused by the read-after-read hazards in the ARM Cortex-A9 MPCore (erratum #761319, see [1]). The code generated by the compiler for the combination of the wait_on_buffer() and buffer_uptodate() calls reads the flags value twice from memory (see the excerpt of the assembly below). The new value of the BH_Lock flag is seen but the new value of BH_Uptodate is not even though both the bits are read from the same memory location. 27c: 9d08 ldr r5, [sp, #32] 27e: 2400 movs r4, #0 280: e006 b.n 290 <squashfs_read_data+0x290> 282: 6803 ldr r3, [r0, #0] 284: 07da lsls r2, r3, #31 286: f140 810d bpl.w 4a4 <squashfs_read_data+0x4a4> 28a: 3401 adds r4, #1 28c: 42bc cmp r4, r7 28e: da08 bge.n 2a2 <squashfs_read_data+0x2a2> 290: f855 0f04 ldr.w r0, [r5, #4]! 294: 6803 ldr r3, [r0, #0] 296: 0759 lsls r1, r3, #29 298: d5f3 bpl.n 282 <squashfs_read_data+0x282> 29a: f7ff fffe bl 0 <__wait_on_buffer> Work around this problem by adding a DMB between the two reads of bh->flags, as recommended in the ARM document. With this barrier, no failures have been seen in more than 5000 hours of the same test. [1] http://infocenter.arm.com/help/topic/com.arm.doc.uan0004a/UAN0004A_a9_read_read.pdf Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com> --- v2: Reword commit message. include/linux/buffer_head.h | 8 ++++++++ 1 file changed, 8 insertions(+)