Message ID | 20190213095044.29628-1-bob.liu@oracle.com (mailing list archive) |
---|---|
Headers | show |
Series | Block/XFS: Support alternative mirror device retry | expand |
Hi Bob On 2/13/19 5:50 PM, Bob Liu wrote: > Motivation: > When fs data/metadata checksum mismatch, lower block devices may have other > correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1 but > decides that the metadata is garbage, today it will shut down the entire > filesystem without trying any of the other mirrors. This is a severe > loss of service, and we propose these patches to have XFS try harder to > avoid failure. > > This patch prototype this mirror retry idea by: > * Adding @nr_mirrors to struct request_queue which is similar as > blk_queue_nonrot(), filesystem can grab device request queue and check max > mirrors this block device has. > Helper functions were also added to get/set the nr_mirrors. > > * Introducing bi_rd_hint just like bi_write_hint, but bi_rd_hint is a long bitmap > in order to support stacked layer case. Why does we need a bitmap to know which underlying device has been tried ? For example, the following scenario, md8 / | \ sda sdb sdc If the the raid read the data from sda and fs check and find the data is corrupted. Then we may just need to let raid1 know that the data is from sda. Then based on this hint, raid1 could handle it with handle_read_error to try other replica and fix the error. If this is feasible, we just need to modify the bio as following and needn't add any bytes in it. struct bio { ... union { unsigned short bi_write_hint; unsigned short bi_read_hint; } ... } Thanks Jianchao > > * Modify md/raid1 to support this retry feature. > > * Adapter xfs to use this feature. > If the read verify fails, we loop over the available mirrors and retry the read. > > * Rewrite retried read > When the read verification fails, but the retry succeedes > write the buffer back to correct the bad mirror > > * Add tracepoints and logging to alternate device retry. > This patch adds new log entries and trace points to the alternate device retry > error path. > > Changes v2: > - No more reuse bi_write_hint > - Stacked layer support(see patch 4/9) > - Other feedback fix > > Allison Henderson (5): > Add b_alt_retry to xfs_buf > xfs: Add b_rd_hint to xfs_buf > xfs: Add device retry > xfs: Rewrite retried read > xfs: Add tracepoints and logging to alternate device retry > > Bob Liu (4): > block: add nr_mirrors to request_queue > block: add rd_hint to bio and request > md:raid1: set mirrors correctly > md:raid1: rd_hint support and consider stacked layer case > > Documentation/block/biodoc.txt | 3 + > block/bio.c | 1 + > block/blk-core.c | 4 ++ > block/blk-merge.c | 6 ++ > block/blk-settings.c | 24 +++++++ > block/bounce.c | 1 + > drivers/md/raid1.c | 123 ++++++++++++++++++++++++++++++++- > fs/xfs/xfs_buf.c | 58 +++++++++++++++- > fs/xfs/xfs_buf.h | 14 ++++ > fs/xfs/xfs_trace.h | 6 +- > include/linux/blk_types.h | 1 + > include/linux/blkdev.h | 4 ++ > include/linux/types.h | 3 + > 13 files changed, 244 insertions(+), 4 deletions(-) >
On Wed, Feb 13, 2019 at 05:50:35PM +0800, Bob Liu wrote: > Motivation: > When fs data/metadata checksum mismatch, lower block devices may have other > correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1 but > decides that the metadata is garbage, today it will shut down the entire > filesystem without trying any of the other mirrors. This is a severe > loss of service, and we propose these patches to have XFS try harder to > avoid failure. > > This patch prototype this mirror retry idea by: > * Adding @nr_mirrors to struct request_queue which is similar as > blk_queue_nonrot(), filesystem can grab device request queue and check max > mirrors this block device has. > Helper functions were also added to get/set the nr_mirrors. > > * Introducing bi_rd_hint just like bi_write_hint, but bi_rd_hint is a long bitmap > in order to support stacked layer case. > > * Modify md/raid1 to support this retry feature. > > * Adapter xfs to use this feature. > If the read verify fails, we loop over the available mirrors and retry the read. Why does the filesystem have to iterate every single posible combination of devices that are underneath it? Wouldn't it be much simpler to be able to attach a verifier function to the bio, and have each layer that gets called iterate over all it's copies internally until the verfier function passes or all copies are exhausted? This works for stacked mirrors - it can pass the higher layer verifier down as far as necessary. It can work for RAID5/6, too, by having that layer supply it's own verifier for reads that verifies parity and can reconstruct of failure, then when it's reconstructed a valid stripe it can run the verifier that was supplied to it from above, etc. i.e. I dont see why only filesystems should drive retries or have to be aware of the underlying storage stacking. ISTM that each layer of the storage stack should be able to verify what has been returned to it is valid independently of the higher layer requirements. The only difference from a caller point of view should be submit_bio(bio); vs submit_bio_verify(bio, verifier_cb_func); Cheers, Dave.
On 2/18/19 4:08 PM, jianchao.wang wrote: > Hi Bob > > On 2/13/19 5:50 PM, Bob Liu wrote: >> Motivation: >> When fs data/metadata checksum mismatch, lower block devices may have other >> correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1 but >> decides that the metadata is garbage, today it will shut down the entire >> filesystem without trying any of the other mirrors. This is a severe >> loss of service, and we propose these patches to have XFS try harder to >> avoid failure. >> >> This patch prototype this mirror retry idea by: >> * Adding @nr_mirrors to struct request_queue which is similar as >> blk_queue_nonrot(), filesystem can grab device request queue and check max >> mirrors this block device has. >> Helper functions were also added to get/set the nr_mirrors. >> >> * Introducing bi_rd_hint just like bi_write_hint, but bi_rd_hint is a long bitmap >> in order to support stacked layer case. > > Why does we need a bitmap to know which underlying device has been tried ? > For example, the following scenario, > > md8 > / | \ > sda sdb sdc > > If the the raid read the data from sda and fs check and find the data is corrupted. > Then we may just need to let raid1 know that the data is from sda. Then based on this > hint, raid1 could handle it with handle_read_error to try other replica and fix the > error. This doesn't work. The md raid1 can only see IO success or failure, so fix_read_error won't fix this. Sorry for the noise. Thanks Jianchao > > If this is feasible, we just need to modify the bio as following and needn't add any > bytes in it. > > struct bio { > ... > union { > unsigned short bi_write_hint; > unsigned short bi_read_hint; > } > ... > } > > Thanks > Jianchao >> >> * Modify md/raid1 to support this retry feature. >> >> * Adapter xfs to use this feature. >> If the read verify fails, we loop over the available mirrors and retry the read. >> >> * Rewrite retried read >> When the read verification fails, but the retry succeedes >> write the buffer back to correct the bad mirror >> >> * Add tracepoints and logging to alternate device retry. >> This patch adds new log entries and trace points to the alternate device retry >> error path. >> >> Changes v2: >> - No more reuse bi_write_hint >> - Stacked layer support(see patch 4/9) >> - Other feedback fix >> >> Allison Henderson (5): >> Add b_alt_retry to xfs_buf >> xfs: Add b_rd_hint to xfs_buf >> xfs: Add device retry >> xfs: Rewrite retried read >> xfs: Add tracepoints and logging to alternate device retry >> >> Bob Liu (4): >> block: add nr_mirrors to request_queue >> block: add rd_hint to bio and request >> md:raid1: set mirrors correctly >> md:raid1: rd_hint support and consider stacked layer case >> >> Documentation/block/biodoc.txt | 3 + >> block/bio.c | 1 + >> block/blk-core.c | 4 ++ >> block/blk-merge.c | 6 ++ >> block/blk-settings.c | 24 +++++++ >> block/bounce.c | 1 + >> drivers/md/raid1.c | 123 ++++++++++++++++++++++++++++++++- >> fs/xfs/xfs_buf.c | 58 +++++++++++++++- >> fs/xfs/xfs_buf.h | 14 ++++ >> fs/xfs/xfs_trace.h | 6 +- >> include/linux/blk_types.h | 1 + >> include/linux/blkdev.h | 4 ++ >> include/linux/types.h | 3 + >> 13 files changed, 244 insertions(+), 4 deletions(-) >> >
On Tue, Feb 19, 2019 at 08:31:50AM +1100, Dave Chinner wrote: > On Wed, Feb 13, 2019 at 05:50:35PM +0800, Bob Liu wrote: > > Motivation: > > When fs data/metadata checksum mismatch, lower block devices may have other > > correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1 but > > decides that the metadata is garbage, today it will shut down the entire > > filesystem without trying any of the other mirrors. This is a severe > > loss of service, and we propose these patches to have XFS try harder to > > avoid failure. > > > > This patch prototype this mirror retry idea by: > > * Adding @nr_mirrors to struct request_queue which is similar as > > blk_queue_nonrot(), filesystem can grab device request queue and check max > > mirrors this block device has. > > Helper functions were also added to get/set the nr_mirrors. > > > > * Introducing bi_rd_hint just like bi_write_hint, but bi_rd_hint is a long bitmap > > in order to support stacked layer case. > > > > * Modify md/raid1 to support this retry feature. > > > > * Adapter xfs to use this feature. > > If the read verify fails, we loop over the available mirrors and retry the read. > > Why does the filesystem have to iterate every single posible > combination of devices that are underneath it? > > Wouldn't it be much simpler to be able to attach a verifier > function to the bio, and have each layer that gets called iterate > over all it's copies internally until the verfier function passes > or all copies are exhausted? > > This works for stacked mirrors - it can pass the higher layer > verifier down as far as necessary. It can work for RAID5/6, too, by > having that layer supply it's own verifier for reads that verifies > parity and can reconstruct of failure, then when it's reconstructed > a valid stripe it can run the verifier that was supplied to it from > above, etc. > > i.e. I dont see why only filesystems should drive retries or have to > be aware of the underlying storage stacking. ISTM that each > layer of the storage stack should be able to verify what has been > returned to it is valid independently of the higher layer > requirements. The only difference from a caller point of view should > be submit_bio(bio); vs submit_bio_verify(bio, verifier_cb_func); What if instead of constructing a giant pile of verifier call chain, we simply had a return value from ->bi_end_io that would then be returned from bio_endio()? Stacked things like dm-linear would have to know how to connect the upper endio to the lower endio though. And that could have its downsides, too. How long do we tie up resources in the scsi layer while upper levels are busy running verification functions...? Hmmmmmmmmm.... --D > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Mon, Feb 18, 2019 at 06:55:20PM -0800, Darrick J. Wong wrote: > On Tue, Feb 19, 2019 at 08:31:50AM +1100, Dave Chinner wrote: > > On Wed, Feb 13, 2019 at 05:50:35PM +0800, Bob Liu wrote: > > > Motivation: > > > When fs data/metadata checksum mismatch, lower block devices may have other > > > correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1 but > > > decides that the metadata is garbage, today it will shut down the entire > > > filesystem without trying any of the other mirrors. This is a severe > > > loss of service, and we propose these patches to have XFS try harder to > > > avoid failure. > > > > > > This patch prototype this mirror retry idea by: > > > * Adding @nr_mirrors to struct request_queue which is similar as > > > blk_queue_nonrot(), filesystem can grab device request queue and check max > > > mirrors this block device has. > > > Helper functions were also added to get/set the nr_mirrors. > > > > > > * Introducing bi_rd_hint just like bi_write_hint, but bi_rd_hint is a long bitmap > > > in order to support stacked layer case. > > > > > > * Modify md/raid1 to support this retry feature. > > > > > > * Adapter xfs to use this feature. > > > If the read verify fails, we loop over the available mirrors and retry the read. > > > > Why does the filesystem have to iterate every single posible > > combination of devices that are underneath it? > > > > Wouldn't it be much simpler to be able to attach a verifier > > function to the bio, and have each layer that gets called iterate > > over all it's copies internally until the verfier function passes > > or all copies are exhausted? > > > > This works for stacked mirrors - it can pass the higher layer > > verifier down as far as necessary. It can work for RAID5/6, too, by > > having that layer supply it's own verifier for reads that verifies > > parity and can reconstruct of failure, then when it's reconstructed > > a valid stripe it can run the verifier that was supplied to it from > > above, etc. > > > > i.e. I dont see why only filesystems should drive retries or have to > > be aware of the underlying storage stacking. ISTM that each > > layer of the storage stack should be able to verify what has been > > returned to it is valid independently of the higher layer > > requirements. The only difference from a caller point of view should > > be submit_bio(bio); vs submit_bio_verify(bio, verifier_cb_func); > > What if instead of constructing a giant pile of verifier call chain, we > simply had a return value from ->bi_end_io that would then be returned > from bio_endio()? Conceptually it acheives the same thing - getting the high level verifier status down to the lower layer to say "this copy is bad, try again", but I suspect all the bio chaining and cloning done in the stack makes this much more difficult than it seems. > Stacked things like dm-linear would have to know how to connect > the upper endio to the lower endio though. And that could have > its downsides, too. Stacking always makes things hard :/ > How long do we tie up resources in the scsi > layer while upper levels are busy running verification functions...? I suspect there's a more important issue to worry about: we run the XFS read verifiers in an async work queue context after collecting the IO completion status from the bio, rather than running directly in bio->bi_end_io() call chain. Cheers, Dave.
On 2/19/19 5:31 AM, Dave Chinner wrote: > On Wed, Feb 13, 2019 at 05:50:35PM +0800, Bob Liu wrote: >> Motivation: >> When fs data/metadata checksum mismatch, lower block devices may have other >> correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1 but >> decides that the metadata is garbage, today it will shut down the entire >> filesystem without trying any of the other mirrors. This is a severe >> loss of service, and we propose these patches to have XFS try harder to >> avoid failure. >> >> This patch prototype this mirror retry idea by: >> * Adding @nr_mirrors to struct request_queue which is similar as >> blk_queue_nonrot(), filesystem can grab device request queue and check max >> mirrors this block device has. >> Helper functions were also added to get/set the nr_mirrors. >> >> * Introducing bi_rd_hint just like bi_write_hint, but bi_rd_hint is a long bitmap >> in order to support stacked layer case. >> >> * Modify md/raid1 to support this retry feature. >> >> * Adapter xfs to use this feature. >> If the read verify fails, we loop over the available mirrors and retry the read. > > Why does the filesystem have to iterate every single posible > combination of devices that are underneath it? > > Wouldn't it be much simpler to be able to attach a verifier > function to the bio, and have each layer that gets called iterate > over all it's copies internally until the verfier function passes > or all copies are exhausted? > > This works for stacked mirrors - it can pass the higher layer > verifier down as far as necessary. It can work for RAID5/6, too, by > having that layer supply it's own verifier for reads that verifies > parity and can reconstruct of failure, then when it's reconstructed > a valid stripe it can run the verifier that was supplied to it from > above, etc. > > i.e. I dont see why only filesystems should drive retries or have to > be aware of the underlying storage stacking. ISTM that each > layer of the storage stack should be able to verify what has been > returned to it is valid independently of the higher layer > requirements. The only difference from a caller point of view should > be submit_bio(bio); vs submit_bio_verify(bio, verifier_cb_func); > We already have bio->bi_end_io(), how about do the verification inside bi_end_io()? Then the whole sequence would like: bio_endio() > 1.bio->bi_end_io() > xfs_buf_bio_end_io() > verify, set bio->bi_status = "please retry" if verify fail > 2.if found bio->bi_status = retry > 3.resubmit bio Is it fine to resubmit a bio inside bio_endio()? - Thanks, Bob.
On Thu, Feb 28, 2019 at 10:22:02PM +0800, Bob Liu wrote: > On 2/19/19 5:31 AM, Dave Chinner wrote: > > On Wed, Feb 13, 2019 at 05:50:35PM +0800, Bob Liu wrote: > >> Motivation: > >> When fs data/metadata checksum mismatch, lower block devices may have other > >> correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1 but > >> decides that the metadata is garbage, today it will shut down the entire > >> filesystem without trying any of the other mirrors. This is a severe > >> loss of service, and we propose these patches to have XFS try harder to > >> avoid failure. > >> > >> This patch prototype this mirror retry idea by: > >> * Adding @nr_mirrors to struct request_queue which is similar as > >> blk_queue_nonrot(), filesystem can grab device request queue and check max > >> mirrors this block device has. > >> Helper functions were also added to get/set the nr_mirrors. > >> > >> * Introducing bi_rd_hint just like bi_write_hint, but bi_rd_hint is a long bitmap > >> in order to support stacked layer case. > >> > >> * Modify md/raid1 to support this retry feature. > >> > >> * Adapter xfs to use this feature. > >> If the read verify fails, we loop over the available mirrors and retry the read. > > > > Why does the filesystem have to iterate every single posible > > combination of devices that are underneath it? > > > > Wouldn't it be much simpler to be able to attach a verifier > > function to the bio, and have each layer that gets called iterate > > over all it's copies internally until the verfier function passes > > or all copies are exhausted? > > > > This works for stacked mirrors - it can pass the higher layer > > verifier down as far as necessary. It can work for RAID5/6, too, by > > having that layer supply it's own verifier for reads that verifies > > parity and can reconstruct of failure, then when it's reconstructed > > a valid stripe it can run the verifier that was supplied to it from > > above, etc. > > > > i.e. I dont see why only filesystems should drive retries or have to > > be aware of the underlying storage stacking. ISTM that each > > layer of the storage stack should be able to verify what has been > > returned to it is valid independently of the higher layer > > requirements. The only difference from a caller point of view should > > be submit_bio(bio); vs submit_bio_verify(bio, verifier_cb_func); > > > > We already have bio->bi_end_io(), how about do the verification inside bi_end_io()? > > Then the whole sequence would like: > bio_endio() > > 1.bio->bi_end_io() > > xfs_buf_bio_end_io() > > verify, set bio->bi_status = "please retry" if verify fail > > > 2.if found bio->bi_status = retry > > 3.resubmit bio As I mentioned to Darrick, this isn't cwas simple as it seems because what XFS actually does is this: IO completion thread Workqueue Thread bio_endio(bio) bio->bi_end_io(bio) xfs_buf_bio_end_io(bio) bp->b_error = bio->bi_status xfs_buf_ioend_async(bp) queue_work(bp->b_ioend_wq, bp) bio_put(bio) <io completion done> ..... xfs_buf_ioend(bp) bp->b_ops->read_verify() ..... IOWs, XFS does not do read verification inside the bio completion context, but instead defers it to an external workqueue so it does not delay processing incoming bio IO completions. Hence there is no way to get the verification status back to the bio completion (the bio has already been freed!) to resubmit from there. This is one of the reasons I suggested a verifier be added to the submission, so the bio itself is wholly responsible for running it, not an external, filesystem level completion function that may operate outside of bio scope.... > Is it fine to resubmit a bio inside bio_endio()? Depends on the context the bio_endio() completion is running in. Cheers, Dave.
On Feb 28, 2019, at 7:22 AM, Bob Liu <bob.liu@oracle.com> wrote: > > On 2/19/19 5:31 AM, Dave Chinner wrote: >> On Wed, Feb 13, 2019 at 05:50:35PM +0800, Bob Liu wrote: >>> Motivation: >>> When fs data/metadata checksum mismatch, lower block devices may have other >>> correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1 but >>> decides that the metadata is garbage, today it will shut down the entire >>> filesystem without trying any of the other mirrors. This is a severe >>> loss of service, and we propose these patches to have XFS try harder to >>> avoid failure. >>> >>> This patch prototype this mirror retry idea by: >>> * Adding @nr_mirrors to struct request_queue which is similar as >>> blk_queue_nonrot(), filesystem can grab device request queue and check max >>> mirrors this block device has. >>> Helper functions were also added to get/set the nr_mirrors. >>> >>> * Introducing bi_rd_hint just like bi_write_hint, but bi_rd_hint is a long bitmap >>> in order to support stacked layer case. >>> >>> * Modify md/raid1 to support this retry feature. >>> >>> * Adapter xfs to use this feature. >>> If the read verify fails, we loop over the available mirrors and retry the read. >> >> Why does the filesystem have to iterate every single posible >> combination of devices that are underneath it? Even if the filesystem isn't doing this iteration, there needs to be some way to track which devices or combinations of devices have been tried for the bio, which likely still means something inside the bio. >> Wouldn't it be much simpler to be able to attach a verifier >> function to the bio, and have each layer that gets called iterate >> over all it's copies internally until the verfier function passes >> or all copies are exhausted? >> >> This works for stacked mirrors - it can pass the higher layer >> verifier down as far as necessary. It can work for RAID5/6, too, by >> having that layer supply it's own verifier for reads that verifies >> parity and can reconstruct of failure, then when it's reconstructed >> a valid stripe it can run the verifier that was supplied to it from >> above, etc. >> >> i.e. I dont see why only filesystems should drive retries or have to >> be aware of the underlying storage stacking. ISTM that each >> layer of the storage stack should be able to verify what has been >> returned to it is valid independently of the higher layer >> requirements. The only difference from a caller point of view should >> be submit_bio(bio); vs submit_bio_verify(bio, verifier_cb_func); I don't think the filesystem should be aware of the stacking (nor are they in the proposed implementation). That said, the filesystem-level checksums should, IMHO, be checked at the filesystem level, and this proposal allows the filesystem to tell the lower layer "this read was bad, try something else". One option, instead of having a bitmap, with one bit for every possible device/combination in the system, would be to have a counter instead. This is much denser, and even the existing "__u16 bio_write_hint" field would be enough for 2^16 different devices/combinations of devices to be tried. The main difference would be that the retry layers in the device layer would need to have a deterministic iterator for the bio. For stacked devices it would need to use the same API to determine how many possible combinations are below it, and do a modulus to pass down the per-device iteration number. The easiest would be to iterate in numeric order, but it would also be possible to use something like a PRNG seeded by e.g. the block number to change the order on a per-bio basis to even out the load, if that is desirable. > For a two layer stacked md case like: > /dev/md0 > / | \ > /dev/md1-a /dev/md1-b /dev/sdf > / \ / | \ > /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde In this case, the top-level md0 would call blk_queue_get_copies() on each sub-devices to determine how many sub-devices/combinations they have, and pick the maximum (3 in this case), multiplied by the number of top-level devices (also 3 in this case). That means the top-level device would return blk_queue_get_copies() == 9 combinations, but the same could be done recursively for more/non-uniform layers if needed. The top-level device maps md1-a = [0-2], md1-b = [3-5], md1-c = [6-8], and can easily map an incoming bio_read_hint to the next device, either by simple increment or by predetermining a device ordering and following that (e.g. 0, 3, 6, 1, 4, 7, 2, 5, 8), or any other deterministic order that hits all of the devices exactly once). During submission bio_read_hint is set to the modulus of the value (so that each layer in the stack sees only values in the range [0, copies), and when the bio completes the top-level device will set bio_read_hint to be the next sub-device to try (like the original proposal was splitting and combining the bitmaps). If a sub-device gets a bad index (e.g. md1-a sees bio_read_hint == 2, or sdf sees anything other than 0) it is a no-op and returns e.g. -EAGAIN to the upper device so that it moves to the next device without returning to the caller. >> I suspect there's a more important issue to worry about: we run the >> XFS read verifiers in an async work queue context after collecting >> the IO completion status from the bio, rather than running directly >> in bio->bi_end_io() call chain. In this proposal, XFS would just have to save the __u16 bio_read_hint field from the previous bio completion and set it in the retried bio, so that it could start at the next device/combination. Obviously, this would mean that the internal device iterator couldn't have any hidden state for the bio so that just setting bio_read_hint would be the same as resubmitting the original bio again, but that is already a given or this whole problem wouldn't exist in the first place. Cheers, Andreas
On 3/1/19 7:28 AM, Andreas Dilger wrote: > On Feb 28, 2019, at 7:22 AM, Bob Liu <bob.liu@oracle.com> wrote: >> >> On 2/19/19 5:31 AM, Dave Chinner wrote: >>> On Wed, Feb 13, 2019 at 05:50:35PM +0800, Bob Liu wrote: >>>> Motivation: >>>> When fs data/metadata checksum mismatch, lower block devices may have other >>>> correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1 but >>>> decides that the metadata is garbage, today it will shut down the entire >>>> filesystem without trying any of the other mirrors. This is a severe >>>> loss of service, and we propose these patches to have XFS try harder to >>>> avoid failure. >>>> >>>> This patch prototype this mirror retry idea by: >>>> * Adding @nr_mirrors to struct request_queue which is similar as >>>> blk_queue_nonrot(), filesystem can grab device request queue and check max >>>> mirrors this block device has. >>>> Helper functions were also added to get/set the nr_mirrors. >>>> >>>> * Introducing bi_rd_hint just like bi_write_hint, but bi_rd_hint is a long bitmap >>>> in order to support stacked layer case. >>>> >>>> * Modify md/raid1 to support this retry feature. >>>> >>>> * Adapter xfs to use this feature. >>>> If the read verify fails, we loop over the available mirrors and retry the read. >>> >>> Why does the filesystem have to iterate every single posible >>> combination of devices that are underneath it? > > Even if the filesystem isn't doing this iteration, there needs to be > some way to track which devices or combinations of devices have been > tried for the bio, which likely still means something inside the bio. > >>> Wouldn't it be much simpler to be able to attach a verifier >>> function to the bio, and have each layer that gets called iterate >>> over all it's copies internally until the verfier function passes >>> or all copies are exhausted? >>> >>> This works for stacked mirrors - it can pass the higher layer >>> verifier down as far as necessary. It can work for RAID5/6, too, by >>> having that layer supply it's own verifier for reads that verifies >>> parity and can reconstruct of failure, then when it's reconstructed >>> a valid stripe it can run the verifier that was supplied to it from >>> above, etc. >>> >>> i.e. I dont see why only filesystems should drive retries or have to >>> be aware of the underlying storage stacking. ISTM that each >>> layer of the storage stack should be able to verify what has been >>> returned to it is valid independently of the higher layer >>> requirements. The only difference from a caller point of view should >>> be submit_bio(bio); vs submit_bio_verify(bio, verifier_cb_func); > > I don't think the filesystem should be aware of the stacking (nor are > they in the proposed implementation). That said, the filesystem-level > checksums should, IMHO, be checked at the filesystem level, and this > proposal allows the filesystem to tell the lower layer "this read was > bad, try something else". > > One option, instead of having a bitmap, with one bit for every possible > device/combination in the system, would be to have a counter instead. > This is much denser, and even the existing "__u16 bio_write_hint" field Indeed! This way is better than a bitmap. But as Dave mentioned, it's much better and simpler if attaching a verfier function to the bio.. - Thanks, Bob > would be enough for 2^16 different devices/combinations of devices to > be tried. The main difference would be that the retry layers in the > device layer would need to have a deterministic iterator for the bio. > > For stacked devices it would need to use the same API to determine how > many possible combinations are below it, and do a modulus to pass down > the per-device iteration number. The easiest would be to iterate in > numeric order, but it would also be possible to use something like a > PRNG seeded by e.g. the block number to change the order on a per-bio > basis to even out the load, if that is desirable. > >> For a two layer stacked md case like: >> /dev/md0 >> / | \ >> /dev/md1-a /dev/md1-b /dev/sdf >> / \ / | \ >> /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde > > In this case, the top-level md0 would call blk_queue_get_copies() on each > sub-devices to determine how many sub-devices/combinations they have, > and pick the maximum (3 in this case), multiplied by the number of > top-level devices (also 3 in this case). That means the top-level device > would return blk_queue_get_copies() == 9 combinations, but the same > could be done recursively for more/non-uniform layers if needed. > > The top-level device maps md1-a = [0-2], md1-b = [3-5], md1-c = [6-8], > and can easily map an incoming bio_read_hint to the next device, either > by simple increment or by predetermining a device ordering and following > that (e.g. 0, 3, 6, 1, 4, 7, 2, 5, 8), or any other deterministic order > that hits all of the devices exactly once). During submission bio_read_hint > is set to the modulus of the value (so that each layer in the stack sees > only values in the range [0, copies), and when the bio completes the top-level > device will set bio_read_hint to be the next sub-device to try (like the > original proposal was splitting and combining the bitmaps). If a sub-device > gets a bad index (e.g. md1-a sees bio_read_hint == 2, or sdf sees anything > other than 0) it is a no-op and returns e.g. -EAGAIN to the upper device > so that it moves to the next device without returning to the caller. > >>> I suspect there's a more important issue to worry about: we run the >>> XFS read verifiers in an async work queue context after collecting >>> the IO completion status from the bio, rather than running directly >>> in bio->bi_end_io() call chain. > > In this proposal, XFS would just have to save the __u16 bio_read_hint > field from the previous bio completion and set it in the retried bio, > so that it could start at the next device/combination. Obviously, > this would mean that the internal device iterator couldn't have any > hidden state for the bio so that just setting bio_read_hint would be > the same as resubmitting the original bio again, but that is already > a given or this whole problem wouldn't exist in the first place. > > Cheers, Andreas >
On 3/1/19 5:49 AM, Dave Chinner wrote: > On Thu, Feb 28, 2019 at 10:22:02PM +0800, Bob Liu wrote: >> On 2/19/19 5:31 AM, Dave Chinner wrote: >>> On Wed, Feb 13, 2019 at 05:50:35PM +0800, Bob Liu wrote: >>>> Motivation: >>>> When fs data/metadata checksum mismatch, lower block devices may have other >>>> correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1 but >>>> decides that the metadata is garbage, today it will shut down the entire >>>> filesystem without trying any of the other mirrors. This is a severe >>>> loss of service, and we propose these patches to have XFS try harder to >>>> avoid failure. >>>> >>>> This patch prototype this mirror retry idea by: >>>> * Adding @nr_mirrors to struct request_queue which is similar as >>>> blk_queue_nonrot(), filesystem can grab device request queue and check max >>>> mirrors this block device has. >>>> Helper functions were also added to get/set the nr_mirrors. >>>> >>>> * Introducing bi_rd_hint just like bi_write_hint, but bi_rd_hint is a long bitmap >>>> in order to support stacked layer case. >>>> >>>> * Modify md/raid1 to support this retry feature. >>>> >>>> * Adapter xfs to use this feature. >>>> If the read verify fails, we loop over the available mirrors and retry the read. >>> >>> Why does the filesystem have to iterate every single posible >>> combination of devices that are underneath it? >>> >>> Wouldn't it be much simpler to be able to attach a verifier >>> function to the bio, and have each layer that gets called iterate >>> over all it's copies internally until the verfier function passes >>> or all copies are exhausted? >>> >>> This works for stacked mirrors - it can pass the higher layer >>> verifier down as far as necessary. It can work for RAID5/6, too, by >>> having that layer supply it's own verifier for reads that verifies >>> parity and can reconstruct of failure, then when it's reconstructed >>> a valid stripe it can run the verifier that was supplied to it from >>> above, etc. >>> >>> i.e. I dont see why only filesystems should drive retries or have to >>> be aware of the underlying storage stacking. ISTM that each >>> layer of the storage stack should be able to verify what has been >>> returned to it is valid independently of the higher layer >>> requirements. The only difference from a caller point of view should >>> be submit_bio(bio); vs submit_bio_verify(bio, verifier_cb_func); >>> >> >> We already have bio->bi_end_io(), how about do the verification inside bi_end_io()? >> >> Then the whole sequence would like: >> bio_endio() >> > 1.bio->bi_end_io() >> > xfs_buf_bio_end_io() >> > verify, set bio->bi_status = "please retry" if verify fail >> >> > 2.if found bio->bi_status = retry >> > 3.resubmit bio > > As I mentioned to Darrick, this isn't cwas simple as it seems > because what XFS actually does is this: > > IO completion thread Workqueue Thread > bio_endio(bio) > bio->bi_end_io(bio) > xfs_buf_bio_end_io(bio) > bp->b_error = bio->bi_status > xfs_buf_ioend_async(bp) > queue_work(bp->b_ioend_wq, bp) > bio_put(bio) > <io completion done> > ..... > xfs_buf_ioend(bp) > bp->b_ops->read_verify() > ..... > > IOWs, XFS does not do read verification inside the bio completion > context, but instead defers it to an external workqueue so it does > not delay processing incoming bio IO completions. Hence there is no > way to get the verification status back to the bio completion (the > bio has already been freed!) to resubmit from there. > > This is one of the reasons I suggested a verifier be added to the > submission, so the bio itself is wholly responsible for running it, But then completion time of an i/o would be longer if calling verifier function inside bio_endio(). Would that be a problem? Since it used to be async as your mentioned xfs uses workqueue. Thanks, -Bob > not an external, filesystem level completion function that may > operate outside of bio scope.... > >> Is it fine to resubmit a bio inside bio_endio()? > > Depends on the context the bio_endio() completion is running in. >
On Sun, Mar 03, 2019 at 10:37:59AM +0800, Bob Liu wrote: > On 3/1/19 5:49 AM, Dave Chinner wrote: > > On Thu, Feb 28, 2019 at 10:22:02PM +0800, Bob Liu wrote: > >> On 2/19/19 5:31 AM, Dave Chinner wrote: > >>> On Wed, Feb 13, 2019 at 05:50:35PM +0800, Bob Liu wrote: > >>>> Motivation: > >>>> When fs data/metadata checksum mismatch, lower block devices may have other > >>>> correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1 but > >>>> decides that the metadata is garbage, today it will shut down the entire > >>>> filesystem without trying any of the other mirrors. This is a severe > >>>> loss of service, and we propose these patches to have XFS try harder to > >>>> avoid failure. > >>>> > >>>> This patch prototype this mirror retry idea by: > >>>> * Adding @nr_mirrors to struct request_queue which is similar as > >>>> blk_queue_nonrot(), filesystem can grab device request queue and check max > >>>> mirrors this block device has. > >>>> Helper functions were also added to get/set the nr_mirrors. > >>>> > >>>> * Introducing bi_rd_hint just like bi_write_hint, but bi_rd_hint is a long bitmap > >>>> in order to support stacked layer case. > >>>> > >>>> * Modify md/raid1 to support this retry feature. > >>>> > >>>> * Adapter xfs to use this feature. > >>>> If the read verify fails, we loop over the available mirrors and retry the read. > >>> > >>> Why does the filesystem have to iterate every single posible > >>> combination of devices that are underneath it? > >>> > >>> Wouldn't it be much simpler to be able to attach a verifier > >>> function to the bio, and have each layer that gets called iterate > >>> over all it's copies internally until the verfier function passes > >>> or all copies are exhausted? > >>> > >>> This works for stacked mirrors - it can pass the higher layer > >>> verifier down as far as necessary. It can work for RAID5/6, too, by > >>> having that layer supply it's own verifier for reads that verifies > >>> parity and can reconstruct of failure, then when it's reconstructed > >>> a valid stripe it can run the verifier that was supplied to it from > >>> above, etc. > >>> > >>> i.e. I dont see why only filesystems should drive retries or have to > >>> be aware of the underlying storage stacking. ISTM that each > >>> layer of the storage stack should be able to verify what has been > >>> returned to it is valid independently of the higher layer > >>> requirements. The only difference from a caller point of view should > >>> be submit_bio(bio); vs submit_bio_verify(bio, verifier_cb_func); > >>> > >> > >> We already have bio->bi_end_io(), how about do the verification inside bi_end_io()? > >> > >> Then the whole sequence would like: > >> bio_endio() > >> > 1.bio->bi_end_io() > >> > xfs_buf_bio_end_io() > >> > verify, set bio->bi_status = "please retry" if verify fail > >> > >> > 2.if found bio->bi_status = retry > >> > 3.resubmit bio > > > > As I mentioned to Darrick, this isn't cwas simple as it seems > > because what XFS actually does is this: > > > > IO completion thread Workqueue Thread > > bio_endio(bio) > > bio->bi_end_io(bio) > > xfs_buf_bio_end_io(bio) > > bp->b_error = bio->bi_status > > xfs_buf_ioend_async(bp) > > queue_work(bp->b_ioend_wq, bp) > > bio_put(bio) > > <io completion done> > > ..... > > xfs_buf_ioend(bp) > > bp->b_ops->read_verify() > > ..... > > > > IOWs, XFS does not do read verification inside the bio completion > > context, but instead defers it to an external workqueue so it does > > not delay processing incoming bio IO completions. Hence there is no > > way to get the verification status back to the bio completion (the > > bio has already been freed!) to resubmit from there. > > > > This is one of the reasons I suggested a verifier be added to the > > submission, so the bio itself is wholly responsible for running it, > > But then completion time of an i/o would be longer if calling verifier function inside bio_endio(). > Would that be a problem? No, because then we don't have to do it in the filesystem. i.e. the filesystem doesn't complete the IO until after the verifier has run, so it from the perspective of the waiting reader is doesn't matter where it is run because the overall I/O latency is the same. Cheers, Dave.
On Thu, Feb 28, 2019 at 04:28:53PM -0700, Andreas Dilger wrote: > On Feb 28, 2019, at 7:22 AM, Bob Liu <bob.liu@oracle.com> wrote: > > > > On 2/19/19 5:31 AM, Dave Chinner wrote: > >> On Wed, Feb 13, 2019 at 05:50:35PM +0800, Bob Liu wrote: > >>> Motivation: > >>> When fs data/metadata checksum mismatch, lower block devices may have other > >>> correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1 but > >>> decides that the metadata is garbage, today it will shut down the entire > >>> filesystem without trying any of the other mirrors. This is a severe > >>> loss of service, and we propose these patches to have XFS try harder to > >>> avoid failure. > >>> > >>> This patch prototype this mirror retry idea by: > >>> * Adding @nr_mirrors to struct request_queue which is similar as > >>> blk_queue_nonrot(), filesystem can grab device request queue and check max > >>> mirrors this block device has. > >>> Helper functions were also added to get/set the nr_mirrors. > >>> > >>> * Introducing bi_rd_hint just like bi_write_hint, but bi_rd_hint is a long bitmap > >>> in order to support stacked layer case. > >>> > >>> * Modify md/raid1 to support this retry feature. > >>> > >>> * Adapter xfs to use this feature. > >>> If the read verify fails, we loop over the available mirrors and retry the read. > >> > >> Why does the filesystem have to iterate every single posible > >> combination of devices that are underneath it? > > Even if the filesystem isn't doing this iteration, there needs to be > some way to track which devices or combinations of devices have been > tried for the bio, which likely still means something inside the bio. I don't beleive it needs to be "in the bio". The thing that does the iteration (i.e. the layer with multiple copies or rebuild capability) is the one that captures the IO completion state, runs the verifier it is supplied with and re-issues the read if the verifier or initial IO fails. i.e. it moves the iteration down to the thing that knows what can be iterated, and so there's no state needed in the bio itself. > >> Wouldn't it be much simpler to be able to attach a verifier > >> function to the bio, and have each layer that gets called iterate > >> over all it's copies internally until the verfier function passes > >> or all copies are exhausted? > >> > >> This works for stacked mirrors - it can pass the higher layer > >> verifier down as far as necessary. It can work for RAID5/6, too, by > >> having that layer supply it's own verifier for reads that verifies > >> parity and can reconstruct of failure, then when it's reconstructed > >> a valid stripe it can run the verifier that was supplied to it from > >> above, etc. > >> > >> i.e. I dont see why only filesystems should drive retries or have to > >> be aware of the underlying storage stacking. ISTM that each > >> layer of the storage stack should be able to verify what has been > >> returned to it is valid independently of the higher layer > >> requirements. The only difference from a caller point of view should > >> be submit_bio(bio); vs submit_bio_verify(bio, verifier_cb_func); > > I don't think the filesystem should be aware of the stacking (nor are > they in the proposed implementation). That said, the filesystem-level > checksums should, IMHO, be checked at the filesystem level, and this > proposal allows the filesystem to tell the lower layer "this read was > bad, try something else". After the fact, yes. I want the verification during the IO while the layer that knows about iteration and recovery can do this easily. i.e. all the complexity right now is because we back out of the layer that can do iteration before we can run the verification, and so we have to carry some state up to a higher level and then pass it back down in a completely separate IO context. That's where all this "need to carry satate in the bio" stuff comes from, and that's what I'm trying to get rid of. > One option, instead of having a bitmap, with one bit for every possible > device/combination in the system, would be to have a counter instead. > This is much denser, and even the existing "__u16 bio_write_hint" field > would be enough for 2^16 different devices/combinations of devices to > be tried. The main difference would be that the retry layers in the > device layer would need to have a deterministic iterator for the bio. The problem there is stacked layers - each layer needs a unique ID for it's iterator function, as this complexity: > For stacked devices it would need to use the same API to determine how > many possible combinations are below it, and do a modulus to pass down > the per-device iteration number. The easiest would be to iterate in > numeric order, but it would also be possible to use something like a > PRNG seeded by e.g. the block number to change the order on a per-bio > basis to even out the load, if that is desirable. > > > For a two layer stacked md case like: > > /dev/md0 > > / | \ > > /dev/md1-a /dev/md1-b /dev/sdf > > / \ / | \ > > /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde > > In this case, the top-level md0 would call blk_queue_get_copies() on each > sub-devices to determine how many sub-devices/combinations they have, > and pick the maximum (3 in this case), multiplied by the number of > top-level devices (also 3 in this case). That means the top-level device > would return blk_queue_get_copies() == 9 combinations, but the same > could be done recursively for more/non-uniform layers if needed. > > The top-level device maps md1-a = [0-2], md1-b = [3-5], md1-c = [6-8], > and can easily map an incoming bio_read_hint to the next device, either > by simple increment or by predetermining a device ordering and following > that (e.g. 0, 3, 6, 1, 4, 7, 2, 5, 8), or any other deterministic order > that hits all of the devices exactly once). During submission bio_read_hint > is set to the modulus of the value (so that each layer in the stack sees > only values in the range [0, copies), and when the bio completes the top-level > device will set bio_read_hint to be the next sub-device to try (like the > original proposal was splitting and combining the bitmaps). If a sub-device > gets a bad index (e.g. md1-a sees bio_read_hint == 2, or sdf sees anything > other than 0) it is a no-op and returns e.g. -EAGAIN to the upper device > so that it moves to the next device without returning to the caller. .... clearly demonstrates. I'd much prefer stacking completions and running them on demand as it should work for all stacked types and not require magic to iterate constructs like the above. Passing the verifier down also allows the underlying layer to repair itself. i.e. if it gets a verifier failure, then retries and gets success, it knows immediately which part of the mirror contains bad data and can repair it. It can also trigger a region scrub, knowing which device might be bad and which is likely to contain good data. i.e. we can start to think about automated block device self-repair if we can supply a data verifier with submit_bio()... > >> I suspect there's a more important issue to worry about: we run the > >> XFS read verifiers in an async work queue context after collecting > >> the IO completion status from the bio, rather than running directly > >> in bio->bi_end_io() call chain. > > In this proposal, XFS would just have to save the __u16 bio_read_hint > field from the previous bio completion and set it in the retried bio, > so that it could start at the next device/combination. Obviously, > this would mean that the internal device iterator couldn't have any > hidden state for the bio so that just setting bio_read_hint would be > the same as resubmitting the original bio again, but that is already > a given or this whole problem wouldn't exist in the first place. It still requires code in the filesystem to iterate and retry N times, instead of never. And we still have to re-write the data we read to fix the underlying device issue (which the device should already know about and have fixed by this point!) i.e. we either get verified data returned on bio completion or we get an error to say the data was corrupt and unrecoverable. If someone wants "fail fast" semantics, then they simply don't provide a verifier.... Cheers, Dave.