Message ID | 20171013134218.19048-1-anand.jain@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Oct 13, 2017 at 09:42:18PM +0800, Anand Jain wrote: > When one of the device is missing, bbio_error() takes care > of setting the error status. And if its only IO that is > pending in that stripe, it fails to check the status of the > other IO at %bbio_error before setting the error %bi_status > for the %orig_bio. Fix this by checking if %bbio->error is > has crossed the %bbio->max_errors. Thxs. > > Reproducer as below fdatasync error is seen intermittently. > > mount -o degraded /dev/sdc /btrfs > dd status=none if=/dev/zero of=$(mktemp /btrfs/XXX) bs=4096 count=1 conv=fdatasync > > dd: fdatasync failed for ‘/btrfs/LSe’: Input/output error > > The reason for the intermittences of the problem is because.. > following condition has to be met, which depends on timely > coordination. > In btrfs_map_bio() > . The RAID1 the missing device has to be at %dev_nr = 1 > In bbio_error() > . Before bbio_error() is called the bio of the not-missing > device at %dev_nr=0 must be completed so that the below > condition is true > if (atomic_dec_and_test(&bbio->stripes_pending)) { > > Signed-off-by: Anand Jain <anand.jain@oracle.com> > --- > fs/btrfs/volumes.c | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index 9af633dcf015..efd502176915 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -6131,7 +6131,10 @@ static void bbio_error(struct btrfs_bio *bbio, struct bio *bio, u64 logical) > > btrfs_io_bio(bio)->mirror_num = bbio->mirror_num; > bio->bi_iter.bi_sector = logical >> 9; > - bio->bi_status = BLK_STS_IOERR; > + if (atomic_read(&bbio->error) > bbio->max_errors) > + bio->bi_status = BLK_STS_IOERR; > + else > + bio->bi_status = 0; Thanks for the fix, I'd prefer BLK_STS_OK rather than 0. With that, Reviewed-by: Liu Bo <bo.li.liu@oracle.com> -liubo > btrfs_end_bbio(bbio, bio); > } > } > -- > 2.13.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>> - bio->bi_status = BLK_STS_IOERR; >> + if (atomic_read(&bbio->error) > bbio->max_errors) >> + bio->bi_status = BLK_STS_IOERR; >> + else >> + bio->bi_status = 0; > > Thanks for the fix, I'd prefer BLK_STS_OK rather than 0. > > With that, > > Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Thanks for the review will fix it. -Anand > -liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 9af633dcf015..efd502176915 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -6131,7 +6131,10 @@ static void bbio_error(struct btrfs_bio *bbio, struct bio *bio, u64 logical) btrfs_io_bio(bio)->mirror_num = bbio->mirror_num; bio->bi_iter.bi_sector = logical >> 9; - bio->bi_status = BLK_STS_IOERR; + if (atomic_read(&bbio->error) > bbio->max_errors) + bio->bi_status = BLK_STS_IOERR; + else + bio->bi_status = 0; btrfs_end_bbio(bbio, bio); } }
When one of the device is missing, bbio_error() takes care of setting the error status. And if its only IO that is pending in that stripe, it fails to check the status of the other IO at %bbio_error before setting the error %bi_status for the %orig_bio. Fix this by checking if %bbio->error is has crossed the %bbio->max_errors. Thxs. Reproducer as below fdatasync error is seen intermittently. mount -o degraded /dev/sdc /btrfs dd status=none if=/dev/zero of=$(mktemp /btrfs/XXX) bs=4096 count=1 conv=fdatasync dd: fdatasync failed for ‘/btrfs/LSe’: Input/output error The reason for the intermittences of the problem is because.. following condition has to be met, which depends on timely coordination. In btrfs_map_bio() . The RAID1 the missing device has to be at %dev_nr = 1 In bbio_error() . Before bbio_error() is called the bio of the not-missing device at %dev_nr=0 must be completed so that the below condition is true if (atomic_dec_and_test(&bbio->stripes_pending)) { Signed-off-by: Anand Jain <anand.jain@oracle.com> --- fs/btrfs/volumes.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)