[2/4] Btrfs: fix data corruption in raid6

Message ID	20171102005405.20420-3-bo.li.liu@oracle.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-btrfs-owner@kernel.org> From: Liu Bo <bo.li.liu@oracle.com> To: linux-btrfs@vger.kernel.org Subject: [PATCH 2/4] Btrfs: fix data corruption in raid6 Date: Wed, 1 Nov 2017 18:54:03 -0600 Message-Id: <20171102005405.20420-3-bo.li.liu@oracle.com> In-Reply-To: <20171102005405.20420-1-bo.li.liu@oracle.com> References: <20171102005405.20420-1-bo.li.liu@oracle.com> Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk

Liu Bo Nov. 2, 2017, 12:54 a.m. UTC

With raid6 profile, btrfs can end up with data corruption by the
following steps.

Say we have a 5 disks that are set up with raid6 profile,

1) mount this btrfs
2) one disk gets pulled out
3) write something to btrfs and sync
4) another disk gets pulled out
5) write something to btrfs and sync
6) umount this btrfs
7) bring the two disk back
8) reboot
9) mount this btrfs

Chances are mount will fail (even with -odegraded) because of failure
on reading metadata blocks, IOW, our raid6 setup is not able to
protect two disk failures in some cases.

So it turns out that there is a bug in raid6's recover code, that is,

if we have one of stripes in the raid6 layout as shown here,

| D1(*)  | D2(*)  | D3 | D4 | D5    |
-------------------------------------
| data1  | data2  | P  | Q  | data0 |

D1 and D2 are the two disks which got pulled out and brought back.
When mount reads data1 and finds out that data1 doesn't match its crc,
btrfs goes to recover data1 from other stripes, what it'll be doing is

1) read data2, parity P, parity Q and data0

2) as we have a valid parity P and two data stripes, it goes to
   recover in raid5 style.

(However, since disk D2 was pulled out and data2 on D2 could be stale,
we still get the wrong data1 from this reconstruction.)

3) btrfs continues to try to reconstruct data1 from parity Q, data2
   and data0, we still get the wrong one for the same reason.

The fix here is to take advantage of the device flag, ie. 'In_sync',
all data on a device might be stale if 'In_sync' has not been set.

With this, we can build the correct data2 from parity P, Q and data1.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
 fs/btrfs/raid56.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Anand Jain Nov. 7, 2017, 8:32 a.m. UTC | #1

On 11/02/2017 08:54 AM, Liu Bo wrote:
> With raid6 profile, btrfs can end up with data corruption by the
> following steps.
> 
> Say we have a 5 disks that are set up with raid6 profile,
> 
> 1) mount this btrfs
> 2) one disk gets pulled out
> 3) write something to btrfs and sync
> 4) another disk gets pulled out
> 5) write something to btrfs and sync
> 6) umount this btrfs
> 7) bring the two disk back
> 8) reboot
> 9) mount this btrfs
> 
> Chances are mount will fail (even with -odegraded) because of failure
> on reading metadata blocks, IOW, our raid6 setup is not able to
> protect two disk failures in some cases.
> 
> So it turns out that there is a bug in raid6's recover code, that is,
> 
> if we have one of stripes in the raid6 layout as shown here,
> 
> | D1(*)  | D2(*)  | D3 | D4 | D5    |
> -------------------------------------
> | data1  | data2  | P  | Q  | data0 |


> D1 and D2 are the two disks which got pulled out and brought back.
> When mount reads data1 and finds out that data1 doesn't match its crc,
> btrfs goes to recover data1 from other stripes, what it'll be doing is
> 
> 1) read data2, parity P, parity Q and data0
>
> 2) as we have a valid parity P and two data stripes, it goes to
>     recover in raid5 style.


> (However, since disk D2 was pulled out and data2 on D2 could be stale,

  data2 should end up crc error, if not then raid5 recover is as
  expected OR this example is confusing to explain the context of
  two data stipe missing.

Thanks, Anand


> we still get the wrong data1 from this reconstruction.)
> 
> 3) btrfs continues to try to reconstruct data1 from parity Q, data2
>     and data0, we still get the wrong one for the same reason.
> 
> The fix here is to take advantage of the device flag, ie. 'In_sync',
> all data on a device might be stale if 'In_sync' has not been set.
 >
> With this, we can build the correct data2 from parity P, Q and data1.
> 
> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
> ---
>   fs/btrfs/raid56.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
> index 67262f8..3c0ce61 100644
> --- a/fs/btrfs/raid56.c
> +++ b/fs/btrfs/raid56.c
> @@ -1076,7 +1076,8 @@ static int rbio_add_io_page(struct btrfs_raid_bio *rbio,
>   	disk_start = stripe->physical + (page_index << PAGE_SHIFT);
>   
>   	/* if the device is missing, just fail this stripe */
> -	if (!stripe->dev->bdev)
> +	if (!stripe->dev->bdev ||
> +	    !test_bit(In_sync, &stripe->dev->flags))
>   		return fail_rbio_index(rbio, stripe_nr);
>   
>   	/* see if we can add this page onto our existing bio */
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Liu Bo Nov. 8, 2017, 7:53 p.m. UTC | #2

On Tue, Nov 07, 2017 at 04:32:55PM +0800, Anand Jain wrote:
> 
> 
> On 11/02/2017 08:54 AM, Liu Bo wrote:
> >With raid6 profile, btrfs can end up with data corruption by the
> >following steps.
> >
> >Say we have a 5 disks that are set up with raid6 profile,
> >
> >1) mount this btrfs
> >2) one disk gets pulled out
> >3) write something to btrfs and sync
> >4) another disk gets pulled out
> >5) write something to btrfs and sync
> >6) umount this btrfs
> >7) bring the two disk back
> >8) reboot
> >9) mount this btrfs
> >
> >Chances are mount will fail (even with -odegraded) because of failure
> >on reading metadata blocks, IOW, our raid6 setup is not able to
> >protect two disk failures in some cases.
> >
> >So it turns out that there is a bug in raid6's recover code, that is,
> >
> >if we have one of stripes in the raid6 layout as shown here,
> >
> >| D1(*)  | D2(*)  | D3 | D4 | D5    |
> >-------------------------------------
> >| data1  | data2  | P  | Q  | data0 |
> 
> 
> >D1 and D2 are the two disks which got pulled out and brought back.
> >When mount reads data1 and finds out that data1 doesn't match its crc,
> >btrfs goes to recover data1 from other stripes, what it'll be doing is
> >
> >1) read data2, parity P, parity Q and data0
> >
> >2) as we have a valid parity P and two data stripes, it goes to
> >    recover in raid5 style.
> 
> 
> >(However, since disk D2 was pulled out and data2 on D2 could be stale,
> 
>  data2 should end up crc error, if not then raid5 recover is as
>  expected OR this example is confusing to explain the context of
>  two data stipe missing.

The assumption you have is not true, when doing reconstruction, we
don't verify checksum for each data stripe that is read from disk.

Thanks,

-liubo
> 
> Thanks, Anand
> 
> 
> >we still get the wrong data1 from this reconstruction.)
> >
> >3) btrfs continues to try to reconstruct data1 from parity Q, data2
> >    and data0, we still get the wrong one for the same reason.
> >
> >The fix here is to take advantage of the device flag, ie. 'In_sync',
> >all data on a device might be stale if 'In_sync' has not been set.
> >
> >With this, we can build the correct data2 from parity P, Q and data1.
> >
> >Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
> >---
> >  fs/btrfs/raid56.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> >diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
> >index 67262f8..3c0ce61 100644
> >--- a/fs/btrfs/raid56.c
> >+++ b/fs/btrfs/raid56.c
> >@@ -1076,7 +1076,8 @@ static int rbio_add_io_page(struct btrfs_raid_bio *rbio,
> >  	disk_start = stripe->physical + (page_index << PAGE_SHIFT);
> >  	/* if the device is missing, just fail this stripe */
> >-	if (!stripe->dev->bdev)
> >+	if (!stripe->dev->bdev ||
> >+	    !test_bit(In_sync, &stripe->dev->flags))
> >  		return fail_rbio_index(rbio, stripe_nr);
> >  	/* see if we can add this page onto our existing bio */
> >
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Anand Jain Nov. 9, 2017, 9:29 a.m. UTC | #3

On 11/09/2017 03:53 AM, Liu Bo wrote:
> On Tue, Nov 07, 2017 at 04:32:55PM +0800, Anand Jain wrote:
>>
>>
>> On 11/02/2017 08:54 AM, Liu Bo wrote:
>>> With raid6 profile, btrfs can end up with data corruption by the
>>> following steps.
>>>
>>> Say we have a 5 disks that are set up with raid6 profile,
>>>
>>> 1) mount this btrfs
>>> 2) one disk gets pulled out
>>> 3) write something to btrfs and sync
>>> 4) another disk gets pulled out
>>> 5) write something to btrfs and sync
>>> 6) umount this btrfs
>>> 7) bring the two disk back
>>> 8) reboot
>>> 9) mount this btrfs
>>>
>>> Chances are mount will fail (even with -odegraded) because of failure
>>> on reading metadata blocks, IOW, our raid6 setup is not able to
>>> protect two disk failures in some cases.
>>>
>>> So it turns out that there is a bug in raid6's recover code, that is,
>>>
>>> if we have one of stripes in the raid6 layout as shown here,
>>>
>>> | D1(*)  | D2(*)  | D3 | D4 | D5    |
>>> -------------------------------------
>>> | data1  | data2  | P  | Q  | data0 |
>>
>>
>>> D1 and D2 are the two disks which got pulled out and brought back.
>>> When mount reads data1 and finds out that data1 doesn't match its crc,
>>> btrfs goes to recover data1 from other stripes, what it'll be doing is
>>>
>>> 1) read data2, parity P, parity Q and data0
>>>
>>> 2) as we have a valid parity P and two data stripes, it goes to
>>>     recover in raid5 style.
>>
>>
>>> (However, since disk D2 was pulled out and data2 on D2 could be stale,
>>
>>   data2 should end up crc error, if not then raid5 recover is as
>>   expected OR this example is confusing to explain the context of
>>   two data stipe missing.
> 
> The assumption you have is not true, 

> when doing reconstruction, we
> don't verify checksum for each data stripe that is read from disk.

  Why ? what if data0 (which has InSync flag) was wrong ?

  I am wary about the In_sync approach. IMO we are loosing the
  advantage of FS being merged with volume manager - which is to
  know exactly which active stripes were lost for quicker reconstruct.

  Lets say RAID6 is 50% full and disk1 is pulled out, and at 75%
  full disk2 is pulled out and reaches 90% full.

  So now when all the disks are back, two disks will not have
  In_sync flag? hope I understood this correctly.

  Now.
  15% of the data have lost two stripes.
  25% of the data have lost one stripe.
  50% of the data did not loose any stripe at all.

  Now for read and reconstruct..
  50% of the data does not need any reconstruction.
  25% of the data needs RAID5 style reconstruction as they have lost 
only one stripe.
  15% of the data needs RAID6 reconstruction since they have lost two 
stripes.

  Does InSync design here help to work like this ?

  Now for RAID1, lets say disk1 lost first 10% of the stripes and
  disk2 lost the last 10% of the stripes and these stripes are mutually
  exclusive. That means there is no single stripe which has lost
  completely.

  There is no code yet where I can see when the In_sync will be reset,
  which is fine for this eg. Assume no reconstruction is done. But now
  when both the disks are being mounted and as the In_sync will be based
  on dev_item.generation, one of it will get In_sync to me it looks like
  there will be 10% of stripes which will fail even though its not
  practically true in this example.

Thanks, Anand

> Thanks,
> 
> -liubo
>>
>> Thanks, Anand
>>
>>
>>> we still get the wrong data1 from this reconstruction.)
>>>
>>> 3) btrfs continues to try to reconstruct data1 from parity Q, data2
>>>     and data0, we still get the wrong one for the same reason.
>>>
>>> The fix here is to take advantage of the device flag, ie. 'In_sync',
>>> all data on a device might be stale if 'In_sync' has not been set.
>>>
>>> With this, we can build the correct data2 from parity P, Q and data1.
>>>
>>> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
>>> ---
>>>   fs/btrfs/raid56.c | 3 ++-
>>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
>>> index 67262f8..3c0ce61 100644
>>> --- a/fs/btrfs/raid56.c
>>> +++ b/fs/btrfs/raid56.c
>>> @@ -1076,7 +1076,8 @@ static int rbio_add_io_page(struct btrfs_raid_bio *rbio,
>>>   	disk_start = stripe->physical + (page_index << PAGE_SHIFT);
>>>   	/* if the device is missing, just fail this stripe */
>>> -	if (!stripe->dev->bdev)
>>> +	if (!stripe->dev->bdev ||
>>> +	    !test_bit(In_sync, &stripe->dev->flags))
>>>   		return fail_rbio_index(rbio, stripe_nr);
>>>   	/* see if we can add this page onto our existing bio */
>>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Qu Wenruo Nov. 9, 2017, 9:59 a.m. UTC | #4

On 2017年11月09日 17:29, Anand Jain wrote:
> 
> 
> On 11/09/2017 03:53 AM, Liu Bo wrote:
>> On Tue, Nov 07, 2017 at 04:32:55PM +0800, Anand Jain wrote:
>>>
>>>
>>> On 11/02/2017 08:54 AM, Liu Bo wrote:
>>>> With raid6 profile, btrfs can end up with data corruption by the
>>>> following steps.
>>>>
>>>> Say we have a 5 disks that are set up with raid6 profile,
>>>>
>>>> 1) mount this btrfs
>>>> 2) one disk gets pulled out
>>>> 3) write something to btrfs and sync
>>>> 4) another disk gets pulled out
>>>> 5) write something to btrfs and sync
>>>> 6) umount this btrfs
>>>> 7) bring the two disk back
>>>> 8) reboot
>>>> 9) mount this btrfs
>>>>
>>>> Chances are mount will fail (even with -odegraded) because of failure
>>>> on reading metadata blocks, IOW, our raid6 setup is not able to
>>>> protect two disk failures in some cases.
>>>>
>>>> So it turns out that there is a bug in raid6's recover code, that is,
>>>>
>>>> if we have one of stripes in the raid6 layout as shown here,
>>>>
>>>> | D1(*)  | D2(*)  | D3 | D4 | D5    |
>>>> -------------------------------------
>>>> | data1  | data2  | P  | Q  | data0 |
>>>
>>>
>>>> D1 and D2 are the two disks which got pulled out and brought back.
>>>> When mount reads data1 and finds out that data1 doesn't match its crc,
>>>> btrfs goes to recover data1 from other stripes, what it'll be doing is
>>>>
>>>> 1) read data2, parity P, parity Q and data0
>>>>
>>>> 2) as we have a valid parity P and two data stripes, it goes to
>>>>     recover in raid5 style.
>>>
>>>
>>>> (However, since disk D2 was pulled out and data2 on D2 could be stale,
>>>
>>>   data2 should end up crc error, if not then raid5 recover is as
>>>   expected OR this example is confusing to explain the context of
>>>   two data stipe missing.
>>
>> The assumption you have is not true, 
> 
>> when doing reconstruction, we
>> don't verify checksum for each data stripe that is read from disk.

I have the question, why not verifying the data stripe's csum when doing
reconstruction?

I know it can be nasty for metadata, since until we fully rebuild the
stripe we won't be able to check if the reconstruction is correct.
(And if we reconstruction failed, we should try to rebuild the block
using other untried methods)

But for data (with csum), I think it's possible to verify the csum at
reconstruction time, and make allow btrfs to have better understanding
that which repair method should be applied.

So at least for data with csum (the easiest case), reconstruction should
be done for
case D1 corruption:
| D1(*)  | D2(*)  | D3 | D4 | D5    |
-------------------------------------
| data1  | data2  | P  | Q  | data0 |

0) D1 csum mismatch
   Goto repair procedure
1) Read out csum for data0~data2
2) Read out data0~data2, P, Q
3) Verify data0~data2
   And both data1 and 2 fails the check
4) Repair using P Q data0
5) Recheck csum
   If data1/2 matches csum, repair is done well.
   If data1/2 mismatches csum, return EIO.

Thanks,
Qu

> 
>  Why ? what if data0 (which has InSync flag) was wrong ?
> 
>  I am wary about the In_sync approach. IMO we are loosing the
>  advantage of FS being merged with volume manager - which is to
>  know exactly which active stripes were lost for quicker reconstruct.
> 
>  Lets say RAID6 is 50% full and disk1 is pulled out, and at 75%
>  full disk2 is pulled out and reaches 90% full.
> 
>  So now when all the disks are back, two disks will not have
>  In_sync flag? hope I understood this correctly.
> 
>  Now.
>  15% of the data have lost two stripes.
>  25% of the data have lost one stripe.
>  50% of the data did not loose any stripe at all.
> 
>  Now for read and reconstruct..
>  50% of the data does not need any reconstruction.
>  25% of the data needs RAID5 style reconstruction as they have lost only
> one stripe.
>  15% of the data needs RAID6 reconstruction since they have lost two
> stripes.
> 
>  Does InSync design here help to work like this ?
> 
>  Now for RAID1, lets say disk1 lost first 10% of the stripes and
>  disk2 lost the last 10% of the stripes and these stripes are mutually
>  exclusive. That means there is no single stripe which has lost
>  completely.
> 
>  There is no code yet where I can see when the In_sync will be reset,
>  which is fine for this eg. Assume no reconstruction is done. But now
>  when both the disks are being mounted and as the In_sync will be based
>  on dev_item.generation, one of it will get In_sync to me it looks like
>  there will be 10% of stripes which will fail even though its not
>  practically true in this example.
> 
> 
> Thanks, Anand
> 
> 
> 
>> Thanks,
>>
>> -liubo
>>>
>>> Thanks, Anand
>>>
>>>
>>>> we still get the wrong data1 from this reconstruction.)
>>>>
>>>> 3) btrfs continues to try to reconstruct data1 from parity Q, data2
>>>>     and data0, we still get the wrong one for the same reason.
>>>>
>>>> The fix here is to take advantage of the device flag, ie. 'In_sync',
>>>> all data on a device might be stale if 'In_sync' has not been set.
>>>>
>>>> With this, we can build the correct data2 from parity P, Q and data1.
>>>>
>>>> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
>>>> ---
>>>>   fs/btrfs/raid56.c | 3 ++-
>>>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
>>>> index 67262f8..3c0ce61 100644
>>>> --- a/fs/btrfs/raid56.c
>>>> +++ b/fs/btrfs/raid56.c
>>>> @@ -1076,7 +1076,8 @@ static int rbio_add_io_page(struct
>>>> btrfs_raid_bio *rbio,
>>>>       disk_start = stripe->physical + (page_index << PAGE_SHIFT);
>>>>       /* if the device is missing, just fail this stripe */
>>>> -    if (!stripe->dev->bdev)
>>>> +    if (!stripe->dev->bdev ||
>>>> +        !test_bit(In_sync, &stripe->dev->flags))
>>>>           return fail_rbio_index(rbio, stripe_nr);
>>>>       /* see if we can add this page onto our existing bio */
>>>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Liu Bo Nov. 10, 2017, 12:12 a.m. UTC | #5

On Thu, Nov 09, 2017 at 05:29:25PM +0800, Anand Jain wrote:
> 
> 
> On 11/09/2017 03:53 AM, Liu Bo wrote:
> >On Tue, Nov 07, 2017 at 04:32:55PM +0800, Anand Jain wrote:
> >>
> >>
> >>On 11/02/2017 08:54 AM, Liu Bo wrote:
> >>>With raid6 profile, btrfs can end up with data corruption by the
> >>>following steps.
> >>>
> >>>Say we have a 5 disks that are set up with raid6 profile,
> >>>
> >>>1) mount this btrfs
> >>>2) one disk gets pulled out
> >>>3) write something to btrfs and sync
> >>>4) another disk gets pulled out
> >>>5) write something to btrfs and sync
> >>>6) umount this btrfs
> >>>7) bring the two disk back
> >>>8) reboot
> >>>9) mount this btrfs
> >>>
> >>>Chances are mount will fail (even with -odegraded) because of failure
> >>>on reading metadata blocks, IOW, our raid6 setup is not able to
> >>>protect two disk failures in some cases.
> >>>
> >>>So it turns out that there is a bug in raid6's recover code, that is,
> >>>
> >>>if we have one of stripes in the raid6 layout as shown here,
> >>>
> >>>| D1(*)  | D2(*)  | D3 | D4 | D5    |
> >>>-------------------------------------
> >>>| data1  | data2  | P  | Q  | data0 |
> >>
> >>
> >>>D1 and D2 are the two disks which got pulled out and brought back.
> >>>When mount reads data1 and finds out that data1 doesn't match its crc,
> >>>btrfs goes to recover data1 from other stripes, what it'll be doing is
> >>>
> >>>1) read data2, parity P, parity Q and data0
> >>>
> >>>2) as we have a valid parity P and two data stripes, it goes to
> >>>    recover in raid5 style.
> >>
> >>
> >>>(However, since disk D2 was pulled out and data2 on D2 could be stale,
> >>
> >>  data2 should end up crc error, if not then raid5 recover is as
> >>  expected OR this example is confusing to explain the context of
> >>  two data stipe missing.
> >
> >The assumption you have is not true,
> 
> >when doing reconstruction, we
> >don't verify checksum for each data stripe that is read from disk.
> 
>  Why ? what if data0 (which has InSync flag) was wrong ?
> 

(Wrinting these took me the whole afternoon, hopefully it makes
sense...)

Checksum verification happens much later, in readpage's end io, and if
data0 is wrong, then the rebuilt data couldn't match its checksum.

As Qu Wenruo also has the same question, here is my input about why
checksum is not done,

a) We may have mutliple different raid profiles within one btrfs, and
   checksum tree also resides somewhere on the disks, for
   raid1(default) and raid5, two disk failures may end up with data
   loss, it applies to raid6 as well because of the bug this patch is
   addressing.

b) If this is a metadata stripe, then it's likely that the stale
   content still matches with its checksum as metadata blocks store
   their checksum in the header part, thus checksum verification can
   not be trusted.

c) If this is a data stripe, then chances are they're belonging to
   nocow/nodatasum inodes such that no checksum could be verified.  In
   case that checksum is available, yes, finally we can verify
   checksum, but keep in mind, the whole data stripe may consist of
   several pieces of data which are from different inodes, some may
   nodatacow/nodatasum, some may not.

Given all the troubles it could lead to, I'd like to avoid it.

As a raid6 system, 2 disks failures can always be tolerated by
rebuilding data from other good copies.

Now that we have known two things, i.e.
a) data1 fails its checksum verification and
b) data2 from a out-of-sync disk might be stale,
it's straightforward to rebuild data1 from other good copies.

>  I am wary about the In_sync approach. IMO we are loosing the
>  advantage of FS being merged with volume manager - which is to
>  know exactly which active stripes were lost for quicker reconstruct.
>

Thanks for pointing this out, I'd say filesystem still has some
advantages over a vanilla volume manager while resync-ing out-of-sync
disks.

>  Lets say RAID6 is 50% full and disk1 is pulled out, and at 75%
>  full disk2 is pulled out and reaches 90% full.
> 
>  So now when all the disks are back, two disks will not have
>  In_sync flag? hope I understood this correctly.
> 
>  Now.
>  15% of the data have lost two stripes.
>  25% of the data have lost one stripe.
>  50% of the data did not loose any stripe at all.
> 
>  Now for read and reconstruct..
>  50% of the data does not need any reconstruction.

A 'Yes' for this, actually I should have avoided reading from
out-of-sync disks, but I forgot to do so. So as a side effect, if the
data read out from a out-of-sync disk matches its checksum, then it's
proven to be OK.

>  25% of the data needs RAID5 style reconstruction as they have lost only one
> stripe.
>  15% of the data needs RAID6 reconstruction since they have lost two
> stripes.
>

It depends, the stripes on the out-of-sync disks could be
a) all data stripes(as the example in this patch)
b) one data stripe and one stripe P
c) one data stripe and one stripe Q

with In_sync flag,
for a) raid6 reconstruction,
for b) raid6 reconstruction,
for c) raid5 reconstruction.

>  Does InSync design here help to work like this ?
> 
>  Now for RAID1, lets say disk1 lost first 10% of the stripes and
>  disk2 lost the last 10% of the stripes and these stripes are mutually
>  exclusive. That means there is no single stripe which has lost
>  completely.
>

True, probably we can drop patch 3 to let read happen on out-of-sync
disks and go the existing repair routine.

>  There is no code yet where I can see when the In_sync will be reset,
>  which is fine for this eg. Assume no reconstruction is done. But now
>  when both the disks are being mounted and as the In_sync will be based
>  on dev_item.generation, one of it will get In_sync to me it looks like
>  there will be 10% of stripes which will fail even though its not
>  practically true in this example.
>

Not sure if I understand this paragraph, are you talking about raid1
or raid6?

Regarding to when to reset In_sync, my naive idea is to utilize scrub
to verify checksum on all content that filesystem knows and do the
repair work, If it turns out that everything looks good, then we know
the disk is already resync, it's a good time to set In_sync.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Qu Wenruo Nov. 10, 2017, 12:54 a.m. UTC | #6

On 2017年11月10日 08:12, Liu Bo wrote:
> On Thu, Nov 09, 2017 at 05:29:25PM +0800, Anand Jain wrote:
>>
>>
>> On 11/09/2017 03:53 AM, Liu Bo wrote:
>>> On Tue, Nov 07, 2017 at 04:32:55PM +0800, Anand Jain wrote:
>>>>
>>>>
>>>> On 11/02/2017 08:54 AM, Liu Bo wrote:
>>>>> With raid6 profile, btrfs can end up with data corruption by the
>>>>> following steps.
>>>>>
>>>>> Say we have a 5 disks that are set up with raid6 profile,
>>>>>
>>>>> 1) mount this btrfs
>>>>> 2) one disk gets pulled out
>>>>> 3) write something to btrfs and sync
>>>>> 4) another disk gets pulled out
>>>>> 5) write something to btrfs and sync
>>>>> 6) umount this btrfs
>>>>> 7) bring the two disk back
>>>>> 8) reboot
>>>>> 9) mount this btrfs
>>>>>
>>>>> Chances are mount will fail (even with -odegraded) because of failure
>>>>> on reading metadata blocks, IOW, our raid6 setup is not able to
>>>>> protect two disk failures in some cases.
>>>>>
>>>>> So it turns out that there is a bug in raid6's recover code, that is,
>>>>>
>>>>> if we have one of stripes in the raid6 layout as shown here,
>>>>>
>>>>> | D1(*)  | D2(*)  | D3 | D4 | D5    |
>>>>> -------------------------------------
>>>>> | data1  | data2  | P  | Q  | data0 |
>>>>
>>>>
>>>>> D1 and D2 are the two disks which got pulled out and brought back.
>>>>> When mount reads data1 and finds out that data1 doesn't match its crc,
>>>>> btrfs goes to recover data1 from other stripes, what it'll be doing is
>>>>>
>>>>> 1) read data2, parity P, parity Q and data0
>>>>>
>>>>> 2) as we have a valid parity P and two data stripes, it goes to
>>>>>    recover in raid5 style.
>>>>
>>>>
>>>>> (However, since disk D2 was pulled out and data2 on D2 could be stale,
>>>>
>>>>  data2 should end up crc error, if not then raid5 recover is as
>>>>  expected OR this example is confusing to explain the context of
>>>>  two data stipe missing.
>>>
>>> The assumption you have is not true,
>>
>>> when doing reconstruction, we
>>> don't verify checksum for each data stripe that is read from disk.
>>
>>  Why ? what if data0 (which has InSync flag) was wrong ?
>>
> 
> (Wrinting these took me the whole afternoon, hopefully it makes
> sense...)
> 
> Checksum verification happens much later, in readpage's end io, and if
> data0 is wrong, then the rebuilt data couldn't match its checksum.
> 
> As Qu Wenruo also has the same question, here is my input about why
> checksum is not done,
> 
> a) We may have mutliple different raid profiles within one btrfs, and
>    checksum tree also resides somewhere on the disks, for
>    raid1(default) and raid5, two disk failures may end up with data
>    loss, it applies to raid6 as well because of the bug this patch is
>    addressing.

However, raid56 is more complex than other profiles, so it even has its
own raid56.[ch].
Although currently it's mostly for the complex RMW cycle, I think it can
also be enhanced to do csum verification at repair time.

> 
> b) If this is a metadata stripe, then it's likely that the stale
>    content still matches with its checksum as metadata blocks store
>    their checksum in the header part, thus checksum verification can
>    not be trusted.

Metadata can be handled later, it's much more complex since (by default)
metadata crosses several pages, which makes it impossible to verify
until we rebuild all related pages.

BTW, if metadata (tree block) has stale content, it will not match the csum.
Tree block csum is for the whole tree block, so any stale data in the
whole tree block will cause mismatched csum.
It's hard to do just because we can't verify if the rebuilt/original
tree block is good until we rebuild/read the whole tree block.

> 
> c) If this is a data stripe, then chances are they're belonging to
>    nocow/nodatasum inodes such that no checksum could be verified.  In
>    case that checksum is available, yes, finally we can verify
>    checksum, but keep in mind, the whole data stripe may consist of
>    several pieces of data which are from different inodes, some may
>    nodatacow/nodatasum, some may not.

Ignore such case too.

> 
> Given all the troubles it could lead to, I'd like to avoid it.

For hard parts, I think keeping them untouched until good solution is
found is completely acceptable.

No need to do it all-in-one, step by step should be the correct way.

> 
> As a raid6 system, 2 disks failures can always be tolerated by
> rebuilding data from other good copies.
> 
> Now that we have known two things, i.e.
> a) data1 fails its checksum verification and
> b) data2 from a out-of-sync disk might be stale,
> it's straightforward to rebuild data1 from other good copies.

But the out-of-sync flag is too generic to determine if we can trust the
copy.
Csum here is a much better indicator, which can provide page level (for
data) indicator other than device level.

For case 3 devices out-of-sync, we just treat it as 3 missing devices?

Thanks,
Qu
> 
> 
>>  I am wary about the In_sync approach. IMO we are loosing the
>>  advantage of FS being merged with volume manager - which is to
>>  know exactly which active stripes were lost for quicker reconstruct.
>>
> 
> Thanks for pointing this out, I'd say filesystem still has some
> advantages over a vanilla volume manager while resync-ing out-of-sync
> disks.
> 
>>  Lets say RAID6 is 50% full and disk1 is pulled out, and at 75%
>>  full disk2 is pulled out and reaches 90% full.
>>
>>  So now when all the disks are back, two disks will not have
>>  In_sync flag? hope I understood this correctly.
>>
>>  Now.
>>  15% of the data have lost two stripes.
>>  25% of the data have lost one stripe.
>>  50% of the data did not loose any stripe at all.
>>
>>  Now for read and reconstruct..
>>  50% of the data does not need any reconstruction.
> 
> A 'Yes' for this, actually I should have avoided reading from
> out-of-sync disks, but I forgot to do so. So as a side effect, if the
> data read out from a out-of-sync disk matches its checksum, then it's
> proven to be OK.
> 
>>  25% of the data needs RAID5 style reconstruction as they have lost only one
>> stripe.
>>  15% of the data needs RAID6 reconstruction since they have lost two
>> stripes.
>>
> 
> It depends, the stripes on the out-of-sync disks could be
> a) all data stripes(as the example in this patch)
> b) one data stripe and one stripe P
> c) one data stripe and one stripe Q
> 
> with In_sync flag,
> for a) raid6 reconstruction,
> for b) raid6 reconstruction,
> for c) raid5 reconstruction.
> 
>>  Does InSync design here help to work like this ?
>>
>>  Now for RAID1, lets say disk1 lost first 10% of the stripes and
>>  disk2 lost the last 10% of the stripes and these stripes are mutually
>>  exclusive. That means there is no single stripe which has lost
>>  completely.
>>
> 
> True, probably we can drop patch 3 to let read happen on out-of-sync
> disks and go the existing repair routine.
> 
>>  There is no code yet where I can see when the In_sync will be reset,
>>  which is fine for this eg. Assume no reconstruction is done. But now
>>  when both the disks are being mounted and as the In_sync will be based
>>  on dev_item.generation, one of it will get In_sync to me it looks like
>>  there will be 10% of stripes which will fail even though its not
>>  practically true in this example.
>>
> 
> Not sure if I understand this paragraph, are you talking about raid1
> or raid6?
> 
> Regarding to when to reset In_sync, my naive idea is to utilize scrub
> to verify checksum on all content that filesystem knows and do the
> repair work, If it turns out that everything looks good, then we know
> the disk is already resync, it's a good time to set In_sync.
> 
> Thanks,
> 
> -liubo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Qu Wenruo Nov. 10, 2017, 10:15 a.m. UTC | #7

On 2017年11月10日 08:54, Qu Wenruo wrote:
> 
> 
> On 2017年11月10日 08:12, Liu Bo wrote:
>> On Thu, Nov 09, 2017 at 05:29:25PM +0800, Anand Jain wrote:
>>>
>>>
>>> On 11/09/2017 03:53 AM, Liu Bo wrote:
>>>> On Tue, Nov 07, 2017 at 04:32:55PM +0800, Anand Jain wrote:
>>>>>
>>>>>
>>>>> On 11/02/2017 08:54 AM, Liu Bo wrote:
>>>>>> With raid6 profile, btrfs can end up with data corruption by the
>>>>>> following steps.
>>>>>>
>>>>>> Say we have a 5 disks that are set up with raid6 profile,
>>>>>>
>>>>>> 1) mount this btrfs
>>>>>> 2) one disk gets pulled out
>>>>>> 3) write something to btrfs and sync
>>>>>> 4) another disk gets pulled out
>>>>>> 5) write something to btrfs and sync
>>>>>> 6) umount this btrfs
>>>>>> 7) bring the two disk back
>>>>>> 8) reboot
>>>>>> 9) mount this btrfs
>>>>>>
>>>>>> Chances are mount will fail (even with -odegraded) because of failure
>>>>>> on reading metadata blocks, IOW, our raid6 setup is not able to
>>>>>> protect two disk failures in some cases.
>>>>>>
>>>>>> So it turns out that there is a bug in raid6's recover code, that is,
>>>>>>
>>>>>> if we have one of stripes in the raid6 layout as shown here,
>>>>>>
>>>>>> | D1(*)  | D2(*)  | D3 | D4 | D5    |
>>>>>> -------------------------------------
>>>>>> | data1  | data2  | P  | Q  | data0 |
>>>>>
>>>>>
>>>>>> D1 and D2 are the two disks which got pulled out and brought back.
>>>>>> When mount reads data1 and finds out that data1 doesn't match its crc,
>>>>>> btrfs goes to recover data1 from other stripes, what it'll be doing is
>>>>>>
>>>>>> 1) read data2, parity P, parity Q and data0
>>>>>>
>>>>>> 2) as we have a valid parity P and two data stripes, it goes to
>>>>>>    recover in raid5 style.
>>>>>
>>>>>
>>>>>> (However, since disk D2 was pulled out and data2 on D2 could be stale,
>>>>>
>>>>>  data2 should end up crc error, if not then raid5 recover is as
>>>>>  expected OR this example is confusing to explain the context of
>>>>>  two data stipe missing.
>>>>
>>>> The assumption you have is not true,
>>>
>>>> when doing reconstruction, we
>>>> don't verify checksum for each data stripe that is read from disk.
>>>
>>>  Why ? what if data0 (which has InSync flag) was wrong ?
>>>
>>
>> (Wrinting these took me the whole afternoon, hopefully it makes
>> sense...)
>>
>> Checksum verification happens much later, in readpage's end io, and if
>> data0 is wrong, then the rebuilt data couldn't match its checksum.
>>
>> As Qu Wenruo also has the same question, here is my input about why
>> checksum is not done,
>>
>> a) We may have mutliple different raid profiles within one btrfs, and
>>    checksum tree also resides somewhere on the disks, for
>>    raid1(default) and raid5, two disk failures may end up with data
>>    loss, it applies to raid6 as well because of the bug this patch is
>>    addressing.
> 
> However, raid56 is more complex than other profiles, so it even has its
> own raid56.[ch].
> Although currently it's mostly for the complex RMW cycle, I think it can
> also be enhanced to do csum verification at repair time.

BTW, repair can even be done without verifying csum of data2/data0.

Just try all repair combination, until the repair result matches
checksum, or no combination can result a good repair.

I know this sounds stupid, but at least this means it's possible.

The main point here is, we should at least ensure repair result matches
checksum.

This also reminds me that, checksum verification should be handled at
lower level, other than doing it at endio time.

(Not related to this topic, but I also want to implement optional csum
check hook in bio, so for case like dm-integrity on dm-raid1 can
recognize bad and good copy, and even let btrfs to reuse device mapper
to do chunk map, but that's another story)

Thanks,
Qu

> 
>>
>> b) If this is a metadata stripe, then it's likely that the stale
>>    content still matches with its checksum as metadata blocks store
>>    their checksum in the header part, thus checksum verification can
>>    not be trusted.
> 
> Metadata can be handled later, it's much more complex since (by default)
> metadata crosses several pages, which makes it impossible to verify
> until we rebuild all related pages.
> 
> BTW, if metadata (tree block) has stale content, it will not match the csum.
> Tree block csum is for the whole tree block, so any stale data in the
> whole tree block will cause mismatched csum.
> It's hard to do just because we can't verify if the rebuilt/original
> tree block is good until we rebuild/read the whole tree block.
> 
>>
>> c) If this is a data stripe, then chances are they're belonging to
>>    nocow/nodatasum inodes such that no checksum could be verified.  In
>>    case that checksum is available, yes, finally we can verify
>>    checksum, but keep in mind, the whole data stripe may consist of
>>    several pieces of data which are from different inodes, some may
>>    nodatacow/nodatasum, some may not.
> 
> Ignore such case too.
> 
>>
>> Given all the troubles it could lead to, I'd like to avoid it.
> 
> For hard parts, I think keeping them untouched until good solution is
> found is completely acceptable.
> 
> No need to do it all-in-one, step by step should be the correct way.
> 
>>
>> As a raid6 system, 2 disks failures can always be tolerated by
>> rebuilding data from other good copies.
>>
>> Now that we have known two things, i.e.
>> a) data1 fails its checksum verification and
>> b) data2 from a out-of-sync disk might be stale,
>> it's straightforward to rebuild data1 from other good copies.
> 
> But the out-of-sync flag is too generic to determine if we can trust the
> copy.
> Csum here is a much better indicator, which can provide page level (for
> data) indicator other than device level.
> 
> For case 3 devices out-of-sync, we just treat it as 3 missing devices?
> 
> Thanks,
> Qu
>>
>>
>>>  I am wary about the In_sync approach. IMO we are loosing the
>>>  advantage of FS being merged with volume manager - which is to
>>>  know exactly which active stripes were lost for quicker reconstruct.
>>>
>>
>> Thanks for pointing this out, I'd say filesystem still has some
>> advantages over a vanilla volume manager while resync-ing out-of-sync
>> disks.
>>
>>>  Lets say RAID6 is 50% full and disk1 is pulled out, and at 75%
>>>  full disk2 is pulled out and reaches 90% full.
>>>
>>>  So now when all the disks are back, two disks will not have
>>>  In_sync flag? hope I understood this correctly.
>>>
>>>  Now.
>>>  15% of the data have lost two stripes.
>>>  25% of the data have lost one stripe.
>>>  50% of the data did not loose any stripe at all.
>>>
>>>  Now for read and reconstruct..
>>>  50% of the data does not need any reconstruction.
>>
>> A 'Yes' for this, actually I should have avoided reading from
>> out-of-sync disks, but I forgot to do so. So as a side effect, if the
>> data read out from a out-of-sync disk matches its checksum, then it's
>> proven to be OK.
>>
>>>  25% of the data needs RAID5 style reconstruction as they have lost only one
>>> stripe.
>>>  15% of the data needs RAID6 reconstruction since they have lost two
>>> stripes.
>>>
>>
>> It depends, the stripes on the out-of-sync disks could be
>> a) all data stripes(as the example in this patch)
>> b) one data stripe and one stripe P
>> c) one data stripe and one stripe Q
>>
>> with In_sync flag,
>> for a) raid6 reconstruction,
>> for b) raid6 reconstruction,
>> for c) raid5 reconstruction.
>>
>>>  Does InSync design here help to work like this ?
>>>
>>>  Now for RAID1, lets say disk1 lost first 10% of the stripes and
>>>  disk2 lost the last 10% of the stripes and these stripes are mutually
>>>  exclusive. That means there is no single stripe which has lost
>>>  completely.
>>>
>>
>> True, probably we can drop patch 3 to let read happen on out-of-sync
>> disks and go the existing repair routine.
>>
>>>  There is no code yet where I can see when the In_sync will be reset,
>>>  which is fine for this eg. Assume no reconstruction is done. But now
>>>  when both the disks are being mounted and as the In_sync will be based
>>>  on dev_item.generation, one of it will get In_sync to me it looks like
>>>  there will be 10% of stripes which will fail even though its not
>>>  practically true in this example.
>>>
>>
>> Not sure if I understand this paragraph, are you talking about raid1
>> or raid6?
>>
>> Regarding to when to reset In_sync, my naive idea is to utilize scrub
>> to verify checksum on all content that filesystem knows and do the
>> repair work, If it turns out that everything looks good, then we know
>> the disk is already resync, it's a good time to set In_sync.
>>
>> Thanks,
>>
>> -liubo
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>

[2/4] Btrfs: fix data corruption in raid6

Commit Message

Comments

Patch