mbox series

[RFC,0/5] btrfs: raid56: do full stripe data checksum verification and recovery at RMW time

Message ID cover.1665730948.git.wqu@suse.com (mailing list archive)
Headers show
Series btrfs: raid56: do full stripe data checksum verification and recovery at RMW time | expand

Message

Qu Wenruo Oct. 14, 2022, 7:17 a.m. UTC
There is a long existing problem for all RAID56 (not only btrfs RAID56),
that if we have corrupted data on-disk, and we do a RMW using that
corrupted data, we lose the chance of recovery.

Since new parity is calculated using the corrupted sector, we will never
be able to recovery the old data.

However btrfs has data checksum to save the day here, if we do a full
scrub level verification at RMW time, we can detect the corrupted data
before we do any write.

Then do the proper recovery, data checksum recheck, and recovery the old
data and call it a day.

This series is going to add such ability, currently there are the
following limitations:

- Only works for full stripes without a missing device
  The code base is already here for a missing device + a corrupted
  sector case of RAID6.

  But for now, I don't really want to test RAID6 yet.

- We only handles data checksum verification
  Metadata verification will be much more complex, and in the long run
  we will only recommend metadata RAID1C3/C4 profiles to compensate
  RAID56 data profiles.

  Thus we may never support metadata verification for RAID56.

- If we found corrupted sectors which can not be repaired, we fail
  the whole bios for the full stripe
  This is to avoid further data corruption, but in the future we may
  just continue with corrupte data.

  This will need extra work to rollback to the original corrupte data
  (as the recovery process will change the content).

- Way more overhead for substripe write RMW cycle
  Now we need to:

  * Fetch the datacsum for the full stripe
  * Verify the datacsum
  * Do RAID56 recovery (if needed)
  * Verify the recovered data (if needed)

  Thankfully this only affects uncached sub-stripe writes.
  The successfully recovered data can be cached for later usage.

- Will not writeback the recovered data during RMW
  Thus we still need to go back to recovery path to recovery.

  This can be later enhanced to let RMW to write the full stripe if
  we did some recovery during RMW.


- May need further refactor to change how we handle RAID56 workflow
  Currently we use quite some workqueues to handle RAID56, and all
  work are delayed.

  This doesn't look sane to me, especially hard to read (too many jumps
  just for a full RMW cycle).

  May be a good idea to make it into a submit-and-wait workflow.

[REASON for RFC]
Although the patchset does not only passed RAID56 test groups, but also
passed my local destructive RMW test cases, some of the above limitations
need to be addressed.

And whther the trade-off is worthy may still need to be discussed.

Qu Wenruo (5):
  btrfs: refactor btrfs_lookup_csums_range()
  btrfs: raid56: refactor __raid_recover_end_io()
  btrfs: introduce a bitmap based csum range search function
  btrfs: raid56: prepare data checksums for later sub-stripe
    verification
  btrfs: raid56: do full stripe data checksum verification before RMW

 fs/btrfs/ctree.h      |   8 +-
 fs/btrfs/file-item.c  | 196 ++++++++++++--
 fs/btrfs/inode.c      |   6 +-
 fs/btrfs/raid56.c     | 608 +++++++++++++++++++++++++++++++-----------
 fs/btrfs/raid56.h     |  12 +
 fs/btrfs/relocation.c |   4 +-
 fs/btrfs/scrub.c      |   8 +-
 fs/btrfs/tree-log.c   |  16 +-
 8 files changed, 664 insertions(+), 194 deletions(-)

Comments

David Sterba Oct. 25, 2022, 1:48 p.m. UTC | #1
On Fri, Oct 14, 2022 at 03:17:08PM +0800, Qu Wenruo wrote:
> There is a long existing problem for all RAID56 (not only btrfs RAID56),
> that if we have corrupted data on-disk, and we do a RMW using that
> corrupted data, we lose the chance of recovery.
> 
> Since new parity is calculated using the corrupted sector, we will never
> be able to recovery the old data.
> 
> However btrfs has data checksum to save the day here, if we do a full
> scrub level verification at RMW time, we can detect the corrupted data
> before we do any write.
> 
> Then do the proper recovery, data checksum recheck, and recovery the old
> data and call it a day.
> 
> This series is going to add such ability, currently there are the
> following limitations:
> 
> - Only works for full stripes without a missing device
>   The code base is already here for a missing device + a corrupted
>   sector case of RAID6.
> 
>   But for now, I don't really want to test RAID6 yet.
> 
> - We only handles data checksum verification
>   Metadata verification will be much more complex, and in the long run
>   we will only recommend metadata RAID1C3/C4 profiles to compensate
>   RAID56 data profiles.
> 
>   Thus we may never support metadata verification for RAID56.
> 
> - If we found corrupted sectors which can not be repaired, we fail
>   the whole bios for the full stripe
>   This is to avoid further data corruption, but in the future we may
>   just continue with corrupte data.
> 
>   This will need extra work to rollback to the original corrupte data
>   (as the recovery process will change the content).
> 
> - Way more overhead for substripe write RMW cycle
>   Now we need to:
> 
>   * Fetch the datacsum for the full stripe
>   * Verify the datacsum
>   * Do RAID56 recovery (if needed)
>   * Verify the recovered data (if needed)
> 
>   Thankfully this only affects uncached sub-stripe writes.
>   The successfully recovered data can be cached for later usage.
> 
> - Will not writeback the recovered data during RMW
>   Thus we still need to go back to recovery path to recovery.
> 
>   This can be later enhanced to let RMW to write the full stripe if
>   we did some recovery during RMW.
> 
> 
> - May need further refactor to change how we handle RAID56 workflow
>   Currently we use quite some workqueues to handle RAID56, and all
>   work are delayed.
> 
>   This doesn't look sane to me, especially hard to read (too many jumps
>   just for a full RMW cycle).
> 
>   May be a good idea to make it into a submit-and-wait workflow.
> 
> [REASON for RFC]
> Although the patchset does not only passed RAID56 test groups, but also
> passed my local destructive RMW test cases, some of the above limitations
> need to be addressed.
> 
> And whther the trade-off is worthy may still need to be discussed.

So this improves reliability at the cost of a full RMW cycle, do you
have any numbers to compare? The affected workload is a cold write in
the middle of a stripe, following writes would likely hit the cached
stripe. For linear writes the cost should be amortized, for random
rewrites it's been always problematic regarding performance but I don't
know if the raid5 (or any striped profile) performance was not better in
some sense.
Qu Wenruo Oct. 25, 2022, 11:30 p.m. UTC | #2
On 2022/10/25 21:48, David Sterba wrote:
> On Fri, Oct 14, 2022 at 03:17:08PM +0800, Qu Wenruo wrote:
>> There is a long existing problem for all RAID56 (not only btrfs RAID56),
>> that if we have corrupted data on-disk, and we do a RMW using that
>> corrupted data, we lose the chance of recovery.
>>
>> Since new parity is calculated using the corrupted sector, we will never
>> be able to recovery the old data.
>>
>> However btrfs has data checksum to save the day here, if we do a full
>> scrub level verification at RMW time, we can detect the corrupted data
>> before we do any write.
>>
>> Then do the proper recovery, data checksum recheck, and recovery the old
>> data and call it a day.
>>
>> This series is going to add such ability, currently there are the
>> following limitations:
>>
>> - Only works for full stripes without a missing device
>>    The code base is already here for a missing device + a corrupted
>>    sector case of RAID6.
>>
>>    But for now, I don't really want to test RAID6 yet.
>>
>> - We only handles data checksum verification
>>    Metadata verification will be much more complex, and in the long run
>>    we will only recommend metadata RAID1C3/C4 profiles to compensate
>>    RAID56 data profiles.
>>
>>    Thus we may never support metadata verification for RAID56.
>>
>> - If we found corrupted sectors which can not be repaired, we fail
>>    the whole bios for the full stripe
>>    This is to avoid further data corruption, but in the future we may
>>    just continue with corrupte data.
>>
>>    This will need extra work to rollback to the original corrupte data
>>    (as the recovery process will change the content).
>>
>> - Way more overhead for substripe write RMW cycle
>>    Now we need to:
>>
>>    * Fetch the datacsum for the full stripe
>>    * Verify the datacsum
>>    * Do RAID56 recovery (if needed)
>>    * Verify the recovered data (if needed)
>>
>>    Thankfully this only affects uncached sub-stripe writes.
>>    The successfully recovered data can be cached for later usage.
>>
>> - Will not writeback the recovered data during RMW
>>    Thus we still need to go back to recovery path to recovery.
>>
>>    This can be later enhanced to let RMW to write the full stripe if
>>    we did some recovery during RMW.
>>
>>
>> - May need further refactor to change how we handle RAID56 workflow
>>    Currently we use quite some workqueues to handle RAID56, and all
>>    work are delayed.
>>
>>    This doesn't look sane to me, especially hard to read (too many jumps
>>    just for a full RMW cycle).
>>
>>    May be a good idea to make it into a submit-and-wait workflow.
>>
>> [REASON for RFC]
>> Although the patchset does not only passed RAID56 test groups, but also
>> passed my local destructive RMW test cases, some of the above limitations
>> need to be addressed.
>>
>> And whther the trade-off is worthy may still need to be discussed.
>
> So this improves reliability at the cost of a full RMW cycle, do you
> have any numbers to compare?

That's the problem, I don't have enough physical disks to do a real
world test.

But some basic analyze shows that, for a typical 5 disks RAID5, a full
stripe is 256K, if using the default CRC32 the csum would be at most 256
bytes.

Thus a csum search would at most read two leaves.

The cost should not be that huge AFAIK.

> The affected workload is a cold write in
> the middle of a stripe, following writes would likely hit the cached
> stripe. For linear writes the cost should be amortized, for random
> rewrites it's been always problematic regarding performance but I don't
> know if the raid5 (or any striped profile) performance was not better in
> some sense.

Just to mention another thing you may want to take into consideration,
I'm doing a refactor on the RAID56 write path, to make the whole
sub-stripe/full-stripe write path fit into a single function, and go
submit-and-wait path.

My current plan is to get the refactor merged (mostly done, doing the
tests now), then get the DRMW fix (would be much smaller and simpler to
merge after the refactor).

Thanks,
Qu
David Sterba Oct. 26, 2022, 1:19 p.m. UTC | #3
On Wed, Oct 26, 2022 at 07:30:04AM +0800, Qu Wenruo wrote:
> On 2022/10/25 21:48, David Sterba wrote:
> > On Fri, Oct 14, 2022 at 03:17:08PM +0800, Qu Wenruo wrote:
> > So this improves reliability at the cost of a full RMW cycle, do you
> > have any numbers to compare?
> 
> That's the problem, I don't have enough physical disks to do a real
> world test.
> 
> But some basic analyze shows that, for a typical 5 disks RAID5, a full
> stripe is 256K, if using the default CRC32 the csum would be at most 256
> bytes.
> 
> Thus a csum search would at most read two leaves.
> 
> The cost should not be that huge AFAIK.

Ok, thanks, that's also a good estimate. We can ask somebody with a
sufficient setup for a real test once we have the code ready.

> > The affected workload is a cold write in
> > the middle of a stripe, following writes would likely hit the cached
> > stripe. For linear writes the cost should be amortized, for random
> > rewrites it's been always problematic regarding performance but I don't
> > know if the raid5 (or any striped profile) performance was not better in
> > some sense.
> 
> Just to mention another thing you may want to take into consideration,
> I'm doing a refactor on the RAID56 write path, to make the whole
> sub-stripe/full-stripe write path fit into a single function, and go
> submit-and-wait path.
> 
> My current plan is to get the refactor merged (mostly done, doing the
> tests now), then get the DRMW fix (would be much smaller and simpler to
> merge after the refactor).

Ok, also now is a good time for the refactoring or preparatory work.