Message ID | 20230820010219.12907-1-wqu@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | fstests: fsstress: wait interrupted aio to finish | expand |
On 20/08/2023 09:02, Qu Wenruo wrote: > [BUG] > There is a very low chance to hit data csum mismatch (caught by scrub) > during test case btrfs/06[234567]. > > After some extra digging, it turns out that plain fsstress itself is > enough to cause the problem: > > ``` > workload() > { > mkfs.btrfs -f -m single -d single --csum sha256 $dev1 > /dev/null > mount $dev1 $mnt > > #$fsstress -p 10 -n 1000 -w -d $mnt > umount $mnt > btrfs check --check-data-csum $dev1 || fail > } > > runtime=1024 > for (( i = 0; i < $runtime; i++ )); do > echo "=== $i / $runtime ===" > workload > done > ``` > > Inside a VM which has only 6 cores, above script can trigger with 1/20 > possibility. > > [CAUSE] > Locally I got a much smaller workload to reproduce: > > $fsstress -p 7 -n 50 -s 1691396493 -w -d $mnt -v > /tmp/fsstress > > With extra kernel trace_prinkt() on the buffered/direct writes. > > It turns out that the following direct write is always the cause: > > btrfs_do_write_iter: r/i=5/283 buffered fileoff=708608(709121) len=12288(7712) > > btrfs_do_write_iter: r/i=5/283 direct fileoff=8192(8192) len=73728(73728) <<<<< > > btrfs_do_write_iter: r/i=5/283 direct fileoff=589824(589824) len=16384(16384) > > With the involved byte number, it's easy to pin down the fsstress > opeartion: > > 0/31: writev d0/f3[285 2 0 0 296 1457078] [709121,8,964] 0 > 0/32: chown d0/f2 308134/1763236 0 > > 0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[285 2 308134 1763236 320 1457078] return 25, fallback to stat() > 0/33: awrite - io_getevents failed -4 <<<< > > 0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[285 2 308134 1763236 320 1457078] return 25, fallback to stat() > > Note the 0/33, when the data csum mismatch triggered, it always fail > with -4 (-EINTR). > > It looks like with lucky enough concurrency, we can get to the following > situation inside fsstress: > > Process A | Process B > -----------------------------------+--------------------------------------- > do_aio_rw() | > |- io_sumit(); | nit: typo: io_submit() > |- io_get_events(); | io_getevents(); > | Returned -EINTR, but IO hasn't | > | finished. | > `- free(buf); | malloc(); > | Got the same memory of @buf from > | thread A. > | Modify the memory > | Now the buffer is changed while > | still under IO > > This is the typical buffer modification during direct IO, which is going > to cause csum mismatch for btrfs, and btrfs properly detects it. > > This is the direct cause of the problem. > > However I'm still not 100% sure on why we got -EINTR for io_getevents(). > My previous tests shows that if I disable uring_write, then no more such > data corruption, thus I guess io_uring has something affecting aio? > > [FIX] > To fix the problem, we can just retry io_getevents() so that we can > properly wait for the IO. > > This prevents us to modify the IO buffer before writeback really > finishes. > > With this fixes, I can no longer reproduce the data corruption. > > Signed-off-by: Qu Wenruo <wqu@suse.com> > --- > ltp/fsstress.c | 10 +++++++++- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/ltp/fsstress.c b/ltp/fsstress.c > index 6641a525..f9f132bf 100644 > --- a/ltp/fsstress.c > +++ b/ltp/fsstress.c > @@ -2223,7 +2223,15 @@ do_aio_rw(opnum_t opno, long r, int flags) > procid, opno, iswrite ? "awrite" : "aread", e); > goto aio_out; > } > - if ((e = io_getevents(io_ctx, 1, 1, &event, NULL)) != 1) { > + do { > + e = io_getevents(io_ctx, 1, 1, &event, NULL); > + if (e == -EINTR) { > + if (v) > + printf("%d/%lld: %s - io_getevents interruped %d, retrying\n", > + procid, opno, iswrite ? "awrite" : "aread", e); > + } > + } while (e == -EINTR); I hope that after a few retries, the event status changes?; otherwise, we won't come out of this loop. Why not retry with a counter? Thanks, Anand > + if (e != 1) { > if (v) > printf("%d/%lld: %s - io_getevents failed %d\n", > procid, opno, iswrite ? "awrite" : "aread", e);
On 2023/8/21 16:45, Anand Jain wrote: > On 20/08/2023 09:02, Qu Wenruo wrote: >> [BUG] >> There is a very low chance to hit data csum mismatch (caught by scrub) >> during test case btrfs/06[234567]. >> >> After some extra digging, it turns out that plain fsstress itself is >> enough to cause the problem: >> >> ``` >> workload() >> { >> mkfs.btrfs -f -m single -d single --csum sha256 $dev1 > /dev/null >> mount $dev1 $mnt >> >> #$fsstress -p 10 -n 1000 -w -d $mnt >> umount $mnt >> btrfs check --check-data-csum $dev1 || fail >> } >> >> runtime=1024 >> for (( i = 0; i < $runtime; i++ )); do >> echo "=== $i / $runtime ===" >> workload >> done >> ``` >> >> Inside a VM which has only 6 cores, above script can trigger with 1/20 >> possibility. >> >> [CAUSE] >> Locally I got a much smaller workload to reproduce: >> >> $fsstress -p 7 -n 50 -s 1691396493 -w -d $mnt -v > /tmp/fsstress >> >> With extra kernel trace_prinkt() on the buffered/direct writes. >> >> It turns out that the following direct write is always the cause: >> >> btrfs_do_write_iter: r/i=5/283 buffered fileoff=708608(709121) >> len=12288(7712) >> >> btrfs_do_write_iter: r/i=5/283 direct fileoff=8192(8192) >> len=73728(73728) <<<<< >> >> btrfs_do_write_iter: r/i=5/283 direct fileoff=589824(589824) >> len=16384(16384) >> >> With the involved byte number, it's easy to pin down the fsstress >> opeartion: >> >> 0/31: writev d0/f3[285 2 0 0 296 1457078] [709121,8,964] 0 >> 0/32: chown d0/f2 308134/1763236 0 >> >> 0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[285 2 308134 1763236 >> 320 1457078] return 25, fallback to stat() >> 0/33: awrite - io_getevents failed -4 <<<< >> >> 0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[285 2 308134 1763236 >> 320 1457078] return 25, fallback to stat() >> >> Note the 0/33, when the data csum mismatch triggered, it always fail >> with -4 (-EINTR). >> >> It looks like with lucky enough concurrency, we can get to the following >> situation inside fsstress: >> >> Process A | Process B >> >> -----------------------------------+--------------------------------------- >> do_aio_rw() | >> |- io_sumit(); | > > nit: typo: io_submit() > >> |- io_get_events(); | > > io_getevents(); > >> | Returned -EINTR, but IO hasn't | >> | finished. | >> `- free(buf); | malloc(); >> | Got the same memory of @buf from >> | thread A. >> | Modify the memory >> | Now the buffer is changed while >> | still under IO >> >> This is the typical buffer modification during direct IO, which is going >> to cause csum mismatch for btrfs, and btrfs properly detects it. >> >> This is the direct cause of the problem. >> >> However I'm still not 100% sure on why we got -EINTR for io_getevents(). >> My previous tests shows that if I disable uring_write, then no more such >> data corruption, thus I guess io_uring has something affecting aio? >> >> [FIX] >> To fix the problem, we can just retry io_getevents() so that we can >> properly wait for the IO. >> >> This prevents us to modify the IO buffer before writeback really >> finishes. >> >> With this fixes, I can no longer reproduce the data corruption. >> >> Signed-off-by: Qu Wenruo <wqu@suse.com> >> --- >> ltp/fsstress.c | 10 +++++++++- >> 1 file changed, 9 insertions(+), 1 deletion(-) >> >> diff --git a/ltp/fsstress.c b/ltp/fsstress.c >> index 6641a525..f9f132bf 100644 >> --- a/ltp/fsstress.c >> +++ b/ltp/fsstress.c >> @@ -2223,7 +2223,15 @@ do_aio_rw(opnum_t opno, long r, int flags) >> procid, opno, iswrite ? "awrite" : "aread", e); >> goto aio_out; >> } >> - if ((e = io_getevents(io_ctx, 1, 1, &event, NULL)) != 1) { >> + do { >> + e = io_getevents(io_ctx, 1, 1, &event, NULL); >> + if (e == -EINTR) { >> + if (v) >> + printf("%d/%lld: %s - io_getevents interruped %d, >> retrying\n", >> + procid, opno, iswrite ? "awrite" : "aread", e); >> + } > + } while (e == -EINTR); > > I hope that after a few retries, the event status changes?; otherwise, > we won't come out of this loop. Why not retry with a counter? Initially I had a counter, but later discard the idea. The root cause of the interruption is uring is using signal by default to inform about the finished IO. Thus it would interrupt aio. This means, if we have a counter, heavy enough uring workload can still go beyond that counter and cause the same problem. Thus I discard the counter idea, to make sure we always wait for the IO. Thanks, Qu > > Thanks, Anand > > >> + if (e != 1) { >> if (v) >> printf("%d/%lld: %s - io_getevents failed %d\n", >> procid, opno, iswrite ? "awrite" : "aread", e); >
LGTM.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
diff --git a/ltp/fsstress.c b/ltp/fsstress.c index 6641a525..f9f132bf 100644 --- a/ltp/fsstress.c +++ b/ltp/fsstress.c @@ -2223,7 +2223,15 @@ do_aio_rw(opnum_t opno, long r, int flags) procid, opno, iswrite ? "awrite" : "aread", e); goto aio_out; } - if ((e = io_getevents(io_ctx, 1, 1, &event, NULL)) != 1) { + do { + e = io_getevents(io_ctx, 1, 1, &event, NULL); + if (e == -EINTR) { + if (v) + printf("%d/%lld: %s - io_getevents interruped %d, retrying\n", + procid, opno, iswrite ? "awrite" : "aread", e); + } + } while (e == -EINTR); + if (e != 1) { if (v) printf("%d/%lld: %s - io_getevents failed %d\n", procid, opno, iswrite ? "awrite" : "aread", e);
[BUG] There is a very low chance to hit data csum mismatch (caught by scrub) during test case btrfs/06[234567]. After some extra digging, it turns out that plain fsstress itself is enough to cause the problem: ``` workload() { mkfs.btrfs -f -m single -d single --csum sha256 $dev1 > /dev/null mount $dev1 $mnt #$fsstress -p 10 -n 1000 -w -d $mnt umount $mnt btrfs check --check-data-csum $dev1 || fail } runtime=1024 for (( i = 0; i < $runtime; i++ )); do echo "=== $i / $runtime ===" workload done ``` Inside a VM which has only 6 cores, above script can trigger with 1/20 possibility. [CAUSE] Locally I got a much smaller workload to reproduce: $fsstress -p 7 -n 50 -s 1691396493 -w -d $mnt -v > /tmp/fsstress With extra kernel trace_prinkt() on the buffered/direct writes. It turns out that the following direct write is always the cause: btrfs_do_write_iter: r/i=5/283 buffered fileoff=708608(709121) len=12288(7712) btrfs_do_write_iter: r/i=5/283 direct fileoff=8192(8192) len=73728(73728) <<<<< btrfs_do_write_iter: r/i=5/283 direct fileoff=589824(589824) len=16384(16384) With the involved byte number, it's easy to pin down the fsstress opeartion: 0/31: writev d0/f3[285 2 0 0 296 1457078] [709121,8,964] 0 0/32: chown d0/f2 308134/1763236 0 0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[285 2 308134 1763236 320 1457078] return 25, fallback to stat() 0/33: awrite - io_getevents failed -4 <<<< 0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[285 2 308134 1763236 320 1457078] return 25, fallback to stat() Note the 0/33, when the data csum mismatch triggered, it always fail with -4 (-EINTR). It looks like with lucky enough concurrency, we can get to the following situation inside fsstress: Process A | Process B -----------------------------------+--------------------------------------- do_aio_rw() | |- io_sumit(); | |- io_get_events(); | | Returned -EINTR, but IO hasn't | | finished. | `- free(buf); | malloc(); | Got the same memory of @buf from | thread A. | Modify the memory | Now the buffer is changed while | still under IO This is the typical buffer modification during direct IO, which is going to cause csum mismatch for btrfs, and btrfs properly detects it. This is the direct cause of the problem. However I'm still not 100% sure on why we got -EINTR for io_getevents(). My previous tests shows that if I disable uring_write, then no more such data corruption, thus I guess io_uring has something affecting aio? [FIX] To fix the problem, we can just retry io_getevents() so that we can properly wait for the IO. This prevents us to modify the IO buffer before writeback really finishes. With this fixes, I can no longer reproduce the data corruption. Signed-off-by: Qu Wenruo <wqu@suse.com> --- ltp/fsstress.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-)