btrfs: raid56: Use correct stolen pages to calculate P/Q
diff mbox

Message ID 20161121085016.7148-1-quwenruo@cn.fujitsu.com
State New
Headers show

Commit Message

Qu Wenruo Nov. 21, 2016, 8:50 a.m. UTC
In the following situation, scrub will calculate wrong parity to
overwrite correct one:

RAID5 full stripe:

Before
|     Dev 1      |     Dev  2     |     Dev 3     |
| Data stripe 1  | Data stripe 2  | Parity Stripe |
--------------------------------------------------- 0
| 0x0000 (Bad)   |     0xcdcd     |     0x0000    |
--------------------------------------------------- 4K
|     0xcdcd     |     0xcdcd     |     0x0000    |
...
|     0xcdcd     |     0xcdcd     |     0x0000    |
--------------------------------------------------- 64K

After scrubbing dev3 only:

|     Dev 1      |     Dev  2     |     Dev 3     |
| Data stripe 1  | Data stripe 2  | Parity Stripe |
--------------------------------------------------- 0
| 0xcdcd (Good)  |     0xcdcd     | 0xcdcd (Bad)  |
--------------------------------------------------- 4K
|     0xcdcd     |     0xcdcd     |     0x0000    |
...
|     0xcdcd     |     0xcdcd     |     0x0000    |
--------------------------------------------------- 64K

The calltrace of such corruption is as following:

scrub_bio_end_io_worker() get called for each extent read out
|- scriub_block_complete()
   |- Data extent csum mismatch
   |- scrub_handle_errored_block
      |- scrub_recheck_block()
         |- scrub_submit_raid56_bio_wait()
            |- raid56_parity_recover()

Now we have a rbio with correct data stripe 1 recovered.
Let's call it "good_rbio".

scrub_parity_check_and_repair()
|- raid56_parity_submit_scrub_rbio()
   |- lock_stripe_add()
   |  |- steal_rbio()
   |     |- Recovered data are steal from "good_rbio", stored into
   |        rbio->stripe_pages[]
   |        Now rbio->bio_pages[] are bad data read from disk.
   |- async_scrub_parity()
      |- scrub_parity_work() (delayed_call to scrub_parity_work)

scrub_parity_work()
|- raid56_parity_scrub_stripe()
   |- validate_rbio_for_parity_scrub()
      |- finish_parity_scrub()
         |- Recalculate parity using *BAD* pages in rbio->bio_pages[]
            So good parity is overwritten with *BAD* one

The fix is to introduce 2 new members, bad_ondisk_a/b, to struct
btrfs_raid_bio, to info scrub code to use correct data pages to
re-calculate parity.

Reported-by: Goffredo Baroncelli <kreijack@inwind.it>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
Thanks to the above hell of delayed function all and damn stupid code
logical, such bug is quite hard to trace.

The damn kernel scrub is already multi-thread, why do such meaningless
delayed function call again and again?

What's wrong with single thread scrub?
We can do thing like in each stripe for raid56 which is easy and
straightforward, only delayed thing is to wake up waiter:

	lock_full_stripe()
	if (!is_parity_stripe()) {
		prepare_data_stripe_bios()
		submit_and_wait_bios()
		if (check_csum() == 0)
			goto out;
	}
	prepare_full_stripe_bios()
	submit_and_wait_bios()

	recover_raid56_stipres();
	prepare_full_stripe_write_bios()
	submit_and_wait_bios()

out:
  	unlock_full_stripe()

We really need to re-work the whole damn scrub code.

Also, we need to enhance btrfs-progs to detect scrub problem(my
submitted offline scrub is good enough for such usage), and tools to
corrupt extents reliably to put it into xfstests test cases.

RAID56 scrub code is neither tested nor well-designed.
---
 fs/btrfs/raid56.c | 29 ++++++++++++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)

Comments

Goffredo Baroncelli Nov. 21, 2016, 6:48 p.m. UTC | #1
Hi Qu,

I tested this succefully for RAID5 when doing a scrub (i.e.: I mount a corrupted disks, then I ran "btrfs scrub start ...", then I check the disks).

However if I do a "cat mnt/out.txt" (out.txt is the corrupted file):
1) the system detect that the file is corrupted   (good :) )
2) the system return the correct file content     (good :) )
3) the data on the platter are still wrong        (no good :( )


Enclosed the script which reproduces the problem. Note that:
If I corrupt the data, in the dmesg two time appears a line which says:

[ 3963.763384] BTRFS warning (device loop2): csum failed ino 257 off 0 csum 2280586218 expected csum 3192393815
[ 3963.766927] BTRFS warning (device loop2): csum failed ino 257 off 0 csum 2280586218 expected csum 3192393815

If I corrupt the parity, of course the system doesn't detect the corruption nor try to correct it. But this is the expected behavior.
 
BR
G.Baroncelli



On 2016-11-21 09:50, Qu Wenruo wrote:
> In the following situation, scrub will calculate wrong parity to
> overwrite correct one:
> 
> RAID5 full stripe:
> 
> Before
> |     Dev 1      |     Dev  2     |     Dev 3     |
> | Data stripe 1  | Data stripe 2  | Parity Stripe |
> --------------------------------------------------- 0
> | 0x0000 (Bad)   |     0xcdcd     |     0x0000    |
> --------------------------------------------------- 4K
> |     0xcdcd     |     0xcdcd     |     0x0000    |
> ...
> |     0xcdcd     |     0xcdcd     |     0x0000    |
> --------------------------------------------------- 64K
> 
> After scrubbing dev3 only:
> 
> |     Dev 1      |     Dev  2     |     Dev 3     |
> | Data stripe 1  | Data stripe 2  | Parity Stripe |
> --------------------------------------------------- 0
> | 0xcdcd (Good)  |     0xcdcd     | 0xcdcd (Bad)  |
> --------------------------------------------------- 4K
> |     0xcdcd     |     0xcdcd     |     0x0000    |
> ...
> |     0xcdcd     |     0xcdcd     |     0x0000    |
> --------------------------------------------------- 64K
> 
> The calltrace of such corruption is as following:
> 
> scrub_bio_end_io_worker() get called for each extent read out
> |- scriub_block_complete()
>    |- Data extent csum mismatch
>    |- scrub_handle_errored_block
>       |- scrub_recheck_block()
>          |- scrub_submit_raid56_bio_wait()
>             |- raid56_parity_recover()
> 
> Now we have a rbio with correct data stripe 1 recovered.
> Let's call it "good_rbio".
> 
> scrub_parity_check_and_repair()
> |- raid56_parity_submit_scrub_rbio()
>    |- lock_stripe_add()
>    |  |- steal_rbio()
>    |     |- Recovered data are steal from "good_rbio", stored into
>    |        rbio->stripe_pages[]
>    |        Now rbio->bio_pages[] are bad data read from disk.
>    |- async_scrub_parity()
>       |- scrub_parity_work() (delayed_call to scrub_parity_work)
> 
> scrub_parity_work()
> |- raid56_parity_scrub_stripe()
>    |- validate_rbio_for_parity_scrub()
>       |- finish_parity_scrub()
>          |- Recalculate parity using *BAD* pages in rbio->bio_pages[]
>             So good parity is overwritten with *BAD* one
> 
> The fix is to introduce 2 new members, bad_ondisk_a/b, to struct
> btrfs_raid_bio, to info scrub code to use correct data pages to
> re-calculate parity.
> 
> Reported-by: Goffredo Baroncelli <kreijack@inwind.it>
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> ---
> Thanks to the above hell of delayed function all and damn stupid code
> logical, such bug is quite hard to trace.
> 
> The damn kernel scrub is already multi-thread, why do such meaningless
> delayed function call again and again?
> 
> What's wrong with single thread scrub?
> We can do thing like in each stripe for raid56 which is easy and
> straightforward, only delayed thing is to wake up waiter:
> 
> 	lock_full_stripe()
> 	if (!is_parity_stripe()) {
> 		prepare_data_stripe_bios()
> 		submit_and_wait_bios()
> 		if (check_csum() == 0)
> 			goto out;
> 	}
> 	prepare_full_stripe_bios()
> 	submit_and_wait_bios()
> 
> 	recover_raid56_stipres();
> 	prepare_full_stripe_write_bios()
> 	submit_and_wait_bios()
> 
> out:
>   	unlock_full_stripe()
> 
> We really need to re-work the whole damn scrub code.
> 
> Also, we need to enhance btrfs-progs to detect scrub problem(my
> submitted offline scrub is good enough for such usage), and tools to
> corrupt extents reliably to put it into xfstests test cases.
> 
> RAID56 scrub code is neither tested nor well-designed.
> ---
>  fs/btrfs/raid56.c | 29 ++++++++++++++++++++++++++++-
>  1 file changed, 28 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
> index d016d4a..87e3565 100644
> --- a/fs/btrfs/raid56.c
> +++ b/fs/btrfs/raid56.c
> @@ -133,6 +133,16 @@ struct btrfs_raid_bio {
>  	/* second bad stripe (for raid6 use) */
>  	int failb;
>  
> +	/*
> +	 * For steal_rbio, we can steal recovered correct page,
> +	 * but in finish_parity_scrub(), we still use bad on-disk
> +	 * page to calculate parity.
> +	 * Use these members to info finish_parity_scrub() to use
> +	 * correct pages
> +	 */
> +	int bad_ondisk_a;
> +	int bad_ondisk_b;
> +
>  	int scrubp;
>  	/*
>  	 * number of pages needed to represent the full
> @@ -310,6 +320,12 @@ static void steal_rbio(struct btrfs_raid_bio *src, struct btrfs_raid_bio *dest)
>  	if (!test_bit(RBIO_CACHE_READY_BIT, &src->flags))
>  		return;
>  
> +	/* Record recovered stripe number */
> +	if (src->faila != -1)
> +		dest->bad_ondisk_a = src->faila;
> +	if (src->failb != -1)
> +		dest->bad_ondisk_b = src->failb;
> +
>  	for (i = 0; i < dest->nr_pages; i++) {
>  		s = src->stripe_pages[i];
>  		if (!s || !PageUptodate(s)) {
> @@ -998,6 +1014,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root,
>  	rbio->stripe_npages = stripe_npages;
>  	rbio->faila = -1;
>  	rbio->failb = -1;
> +	rbio->bad_ondisk_a = -1;
> +	rbio->bad_ondisk_b = -1;
>  	atomic_set(&rbio->refs, 1);
>  	atomic_set(&rbio->error, 0);
>  	atomic_set(&rbio->stripes_pending, 0);
> @@ -2352,7 +2370,16 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
>  		void *parity;
>  		/* first collect one page from each data stripe */
>  		for (stripe = 0; stripe < nr_data; stripe++) {
> -			p = page_in_rbio(rbio, stripe, pagenr, 0);
> +
> +			/*
> +			 * Use stolen recovered page other than bad
> +			 * on disk pages
> +			 */
> +			if (stripe == rbio->bad_ondisk_a ||
> +			    stripe == rbio->bad_ondisk_b)
> +				p = rbio_stripe_page(rbio, stripe, pagenr);
> +			else
> +				p = page_in_rbio(rbio, stripe, pagenr, 0);
>  			pointers[stripe] = kmap(p);
>  		}
>  
>
Qu Wenruo Nov. 22, 2016, 12:28 a.m. UTC | #2
At 11/22/2016 02:48 AM, Goffredo Baroncelli wrote:
> Hi Qu,
>
> I tested this succefully for RAID5 when doing a scrub (i.e.: I mount a corrupted disks, then I ran "btrfs scrub start ...", then I check the disks).
>
> However if I do a "cat mnt/out.txt" (out.txt is the corrupted file):
> 1) the system detect that the file is corrupted   (good :) )
> 2) the system return the correct file content     (good :) )
> 3) the data on the platter are still wrong        (no good :( )

Do you mean, read the corrupted data won't repair it?

IIRC that's the designed behavior.

For RAID5/6 read, there are several different mode, like READ_REBUILD or 
SCRUB_PARITY.

I'm not sure for write, but for read it won't write correct data.

So it's a designed behavior if I don't miss something.

Thanks,
Qu

>
>
> Enclosed the script which reproduces the problem. Note that:
> If I corrupt the data, in the dmesg two time appears a line which says:
>
> [ 3963.763384] BTRFS warning (device loop2): csum failed ino 257 off 0 csum 2280586218 expected csum 3192393815
> [ 3963.766927] BTRFS warning (device loop2): csum failed ino 257 off 0 csum 2280586218 expected csum 3192393815
>
> If I corrupt the parity, of course the system doesn't detect the corruption nor try to correct it. But this is the expected behavior.
>
> BR
> G.Baroncelli
>
>
>
> On 2016-11-21 09:50, Qu Wenruo wrote:
>> In the following situation, scrub will calculate wrong parity to
>> overwrite correct one:
>>
>> RAID5 full stripe:
>>
>> Before
>> |     Dev 1      |     Dev  2     |     Dev 3     |
>> | Data stripe 1  | Data stripe 2  | Parity Stripe |
>> --------------------------------------------------- 0
>> | 0x0000 (Bad)   |     0xcdcd     |     0x0000    |
>> --------------------------------------------------- 4K
>> |     0xcdcd     |     0xcdcd     |     0x0000    |
>> ...
>> |     0xcdcd     |     0xcdcd     |     0x0000    |
>> --------------------------------------------------- 64K
>>
>> After scrubbing dev3 only:
>>
>> |     Dev 1      |     Dev  2     |     Dev 3     |
>> | Data stripe 1  | Data stripe 2  | Parity Stripe |
>> --------------------------------------------------- 0
>> | 0xcdcd (Good)  |     0xcdcd     | 0xcdcd (Bad)  |
>> --------------------------------------------------- 4K
>> |     0xcdcd     |     0xcdcd     |     0x0000    |
>> ...
>> |     0xcdcd     |     0xcdcd     |     0x0000    |
>> --------------------------------------------------- 64K
>>
>> The calltrace of such corruption is as following:
>>
>> scrub_bio_end_io_worker() get called for each extent read out
>> |- scriub_block_complete()
>>    |- Data extent csum mismatch
>>    |- scrub_handle_errored_block
>>       |- scrub_recheck_block()
>>          |- scrub_submit_raid56_bio_wait()
>>             |- raid56_parity_recover()
>>
>> Now we have a rbio with correct data stripe 1 recovered.
>> Let's call it "good_rbio".
>>
>> scrub_parity_check_and_repair()
>> |- raid56_parity_submit_scrub_rbio()
>>    |- lock_stripe_add()
>>    |  |- steal_rbio()
>>    |     |- Recovered data are steal from "good_rbio", stored into
>>    |        rbio->stripe_pages[]
>>    |        Now rbio->bio_pages[] are bad data read from disk.
>>    |- async_scrub_parity()
>>       |- scrub_parity_work() (delayed_call to scrub_parity_work)
>>
>> scrub_parity_work()
>> |- raid56_parity_scrub_stripe()
>>    |- validate_rbio_for_parity_scrub()
>>       |- finish_parity_scrub()
>>          |- Recalculate parity using *BAD* pages in rbio->bio_pages[]
>>             So good parity is overwritten with *BAD* one
>>
>> The fix is to introduce 2 new members, bad_ondisk_a/b, to struct
>> btrfs_raid_bio, to info scrub code to use correct data pages to
>> re-calculate parity.
>>
>> Reported-by: Goffredo Baroncelli <kreijack@inwind.it>
>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>> ---
>> Thanks to the above hell of delayed function all and damn stupid code
>> logical, such bug is quite hard to trace.
>>
>> The damn kernel scrub is already multi-thread, why do such meaningless
>> delayed function call again and again?
>>
>> What's wrong with single thread scrub?
>> We can do thing like in each stripe for raid56 which is easy and
>> straightforward, only delayed thing is to wake up waiter:
>>
>> 	lock_full_stripe()
>> 	if (!is_parity_stripe()) {
>> 		prepare_data_stripe_bios()
>> 		submit_and_wait_bios()
>> 		if (check_csum() == 0)
>> 			goto out;
>> 	}
>> 	prepare_full_stripe_bios()
>> 	submit_and_wait_bios()
>>
>> 	recover_raid56_stipres();
>> 	prepare_full_stripe_write_bios()
>> 	submit_and_wait_bios()
>>
>> out:
>>   	unlock_full_stripe()
>>
>> We really need to re-work the whole damn scrub code.
>>
>> Also, we need to enhance btrfs-progs to detect scrub problem(my
>> submitted offline scrub is good enough for such usage), and tools to
>> corrupt extents reliably to put it into xfstests test cases.
>>
>> RAID56 scrub code is neither tested nor well-designed.
>> ---
>>  fs/btrfs/raid56.c | 29 ++++++++++++++++++++++++++++-
>>  1 file changed, 28 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
>> index d016d4a..87e3565 100644
>> --- a/fs/btrfs/raid56.c
>> +++ b/fs/btrfs/raid56.c
>> @@ -133,6 +133,16 @@ struct btrfs_raid_bio {
>>  	/* second bad stripe (for raid6 use) */
>>  	int failb;
>>
>> +	/*
>> +	 * For steal_rbio, we can steal recovered correct page,
>> +	 * but in finish_parity_scrub(), we still use bad on-disk
>> +	 * page to calculate parity.
>> +	 * Use these members to info finish_parity_scrub() to use
>> +	 * correct pages
>> +	 */
>> +	int bad_ondisk_a;
>> +	int bad_ondisk_b;
>> +
>>  	int scrubp;
>>  	/*
>>  	 * number of pages needed to represent the full
>> @@ -310,6 +320,12 @@ static void steal_rbio(struct btrfs_raid_bio *src, struct btrfs_raid_bio *dest)
>>  	if (!test_bit(RBIO_CACHE_READY_BIT, &src->flags))
>>  		return;
>>
>> +	/* Record recovered stripe number */
>> +	if (src->faila != -1)
>> +		dest->bad_ondisk_a = src->faila;
>> +	if (src->failb != -1)
>> +		dest->bad_ondisk_b = src->failb;
>> +
>>  	for (i = 0; i < dest->nr_pages; i++) {
>>  		s = src->stripe_pages[i];
>>  		if (!s || !PageUptodate(s)) {
>> @@ -998,6 +1014,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root,
>>  	rbio->stripe_npages = stripe_npages;
>>  	rbio->faila = -1;
>>  	rbio->failb = -1;
>> +	rbio->bad_ondisk_a = -1;
>> +	rbio->bad_ondisk_b = -1;
>>  	atomic_set(&rbio->refs, 1);
>>  	atomic_set(&rbio->error, 0);
>>  	atomic_set(&rbio->stripes_pending, 0);
>> @@ -2352,7 +2370,16 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
>>  		void *parity;
>>  		/* first collect one page from each data stripe */
>>  		for (stripe = 0; stripe < nr_data; stripe++) {
>> -			p = page_in_rbio(rbio, stripe, pagenr, 0);
>> +
>> +			/*
>> +			 * Use stolen recovered page other than bad
>> +			 * on disk pages
>> +			 */
>> +			if (stripe == rbio->bad_ondisk_a ||
>> +			    stripe == rbio->bad_ondisk_b)
>> +				p = rbio_stripe_page(rbio, stripe, pagenr);
>> +			else
>> +				p = page_in_rbio(rbio, stripe, pagenr, 0);
>>  			pointers[stripe] = kmap(p);
>>  		}
>>
>>
>
>


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Goffredo Baroncelli Nov. 22, 2016, 6:02 p.m. UTC | #3
On 2016-11-22 01:28, Qu Wenruo wrote:
> 
> 
> At 11/22/2016 02:48 AM, Goffredo Baroncelli wrote:
>> Hi Qu,
>>
>> I tested this succefully for RAID5 when doing a scrub (i.e.: I mount a corrupted disks, then I ran "btrfs scrub start ...", then I check the disks).
>>
>> However if I do a "cat mnt/out.txt" (out.txt is the corrupted file):
>> 1) the system detect that the file is corrupted   (good :) )
>> 2) the system return the correct file content     (good :) )
>> 3) the data on the platter are still wrong        (no good :( )
> 
> Do you mean, read the corrupted data won't repair it?
> 
> IIRC that's the designed behavior.

:O

You are right... I was unaware of that....

So you can add a "tested-by: Goffredo Baroncelli <kreijack@inwind.it>"

BR
G.Baroncelli

> 
> For RAID5/6 read, there are several different mode, like READ_REBUILD or SCRUB_PARITY.
> 
> I'm not sure for write, but for read it won't write correct data.
> 
> So it's a designed behavior if I don't miss something.
> 
> Thanks,
> Qu
> 
>>
>>
>> Enclosed the script which reproduces the problem. Note that:
>> If I corrupt the data, in the dmesg two time appears a line which says:
>>
>> [ 3963.763384] BTRFS warning (device loop2): csum failed ino 257 off 0 csum 2280586218 expected csum 3192393815
>> [ 3963.766927] BTRFS warning (device loop2): csum failed ino 257 off 0 csum 2280586218 expected csum 3192393815
>>
>> If I corrupt the parity, of course the system doesn't detect the corruption nor try to correct it. But this is the expected behavior.
>>
>> BR
>> G.Baroncelli
>>
>>
>>
>> On 2016-11-21 09:50, Qu Wenruo wrote:
>>> In the following situation, scrub will calculate wrong parity to
>>> overwrite correct one:
>>>
>>> RAID5 full stripe:
>>>
>>> Before
>>> |     Dev 1      |     Dev  2     |     Dev 3     |
>>> | Data stripe 1  | Data stripe 2  | Parity Stripe |
>>> --------------------------------------------------- 0
>>> | 0x0000 (Bad)   |     0xcdcd     |     0x0000    |
>>> --------------------------------------------------- 4K
>>> |     0xcdcd     |     0xcdcd     |     0x0000    |
>>> ...
>>> |     0xcdcd     |     0xcdcd     |     0x0000    |
>>> --------------------------------------------------- 64K
>>>
>>> After scrubbing dev3 only:
>>>
>>> |     Dev 1      |     Dev  2     |     Dev 3     |
>>> | Data stripe 1  | Data stripe 2  | Parity Stripe |
>>> --------------------------------------------------- 0
>>> | 0xcdcd (Good)  |     0xcdcd     | 0xcdcd (Bad)  |
>>> --------------------------------------------------- 4K
>>> |     0xcdcd     |     0xcdcd     |     0x0000    |
>>> ...
>>> |     0xcdcd     |     0xcdcd     |     0x0000    |
>>> --------------------------------------------------- 64K
>>>
>>> The calltrace of such corruption is as following:
>>>
>>> scrub_bio_end_io_worker() get called for each extent read out
>>> |- scriub_block_complete()
>>>    |- Data extent csum mismatch
>>>    |- scrub_handle_errored_block
>>>       |- scrub_recheck_block()
>>>          |- scrub_submit_raid56_bio_wait()
>>>             |- raid56_parity_recover()
>>>
>>> Now we have a rbio with correct data stripe 1 recovered.
>>> Let's call it "good_rbio".
>>>
>>> scrub_parity_check_and_repair()
>>> |- raid56_parity_submit_scrub_rbio()
>>>    |- lock_stripe_add()
>>>    |  |- steal_rbio()
>>>    |     |- Recovered data are steal from "good_rbio", stored into
>>>    |        rbio->stripe_pages[]
>>>    |        Now rbio->bio_pages[] are bad data read from disk.
>>>    |- async_scrub_parity()
>>>       |- scrub_parity_work() (delayed_call to scrub_parity_work)
>>>
>>> scrub_parity_work()
>>> |- raid56_parity_scrub_stripe()
>>>    |- validate_rbio_for_parity_scrub()
>>>       |- finish_parity_scrub()
>>>          |- Recalculate parity using *BAD* pages in rbio->bio_pages[]
>>>             So good parity is overwritten with *BAD* one
>>>
>>> The fix is to introduce 2 new members, bad_ondisk_a/b, to struct
>>> btrfs_raid_bio, to info scrub code to use correct data pages to
>>> re-calculate parity.
>>>
>>> Reported-by: Goffredo Baroncelli <kreijack@inwind.it>
>>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>>> ---
>>> Thanks to the above hell of delayed function all and damn stupid code
>>> logical, such bug is quite hard to trace.
>>>
>>> The damn kernel scrub is already multi-thread, why do such meaningless
>>> delayed function call again and again?
>>>
>>> What's wrong with single thread scrub?
>>> We can do thing like in each stripe for raid56 which is easy and
>>> straightforward, only delayed thing is to wake up waiter:
>>>
>>>     lock_full_stripe()
>>>     if (!is_parity_stripe()) {
>>>         prepare_data_stripe_bios()
>>>         submit_and_wait_bios()
>>>         if (check_csum() == 0)
>>>             goto out;
>>>     }
>>>     prepare_full_stripe_bios()
>>>     submit_and_wait_bios()
>>>
>>>     recover_raid56_stipres();
>>>     prepare_full_stripe_write_bios()
>>>     submit_and_wait_bios()
>>>
>>> out:
>>>       unlock_full_stripe()
>>>
>>> We really need to re-work the whole damn scrub code.
>>>
>>> Also, we need to enhance btrfs-progs to detect scrub problem(my
>>> submitted offline scrub is good enough for such usage), and tools to
>>> corrupt extents reliably to put it into xfstests test cases.
>>>
>>> RAID56 scrub code is neither tested nor well-designed.
>>> ---
>>>  fs/btrfs/raid56.c | 29 ++++++++++++++++++++++++++++-
>>>  1 file changed, 28 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
>>> index d016d4a..87e3565 100644
>>> --- a/fs/btrfs/raid56.c
>>> +++ b/fs/btrfs/raid56.c
>>> @@ -133,6 +133,16 @@ struct btrfs_raid_bio {
>>>      /* second bad stripe (for raid6 use) */
>>>      int failb;
>>>
>>> +    /*
>>> +     * For steal_rbio, we can steal recovered correct page,
>>> +     * but in finish_parity_scrub(), we still use bad on-disk
>>> +     * page to calculate parity.
>>> +     * Use these members to info finish_parity_scrub() to use
>>> +     * correct pages
>>> +     */
>>> +    int bad_ondisk_a;
>>> +    int bad_ondisk_b;
>>> +
>>>      int scrubp;
>>>      /*
>>>       * number of pages needed to represent the full
>>> @@ -310,6 +320,12 @@ static void steal_rbio(struct btrfs_raid_bio *src, struct btrfs_raid_bio *dest)
>>>      if (!test_bit(RBIO_CACHE_READY_BIT, &src->flags))
>>>          return;
>>>
>>> +    /* Record recovered stripe number */
>>> +    if (src->faila != -1)
>>> +        dest->bad_ondisk_a = src->faila;
>>> +    if (src->failb != -1)
>>> +        dest->bad_ondisk_b = src->failb;
>>> +
>>>      for (i = 0; i < dest->nr_pages; i++) {
>>>          s = src->stripe_pages[i];
>>>          if (!s || !PageUptodate(s)) {
>>> @@ -998,6 +1014,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root,
>>>      rbio->stripe_npages = stripe_npages;
>>>      rbio->faila = -1;
>>>      rbio->failb = -1;
>>> +    rbio->bad_ondisk_a = -1;
>>> +    rbio->bad_ondisk_b = -1;
>>>      atomic_set(&rbio->refs, 1);
>>>      atomic_set(&rbio->error, 0);
>>>      atomic_set(&rbio->stripes_pending, 0);
>>> @@ -2352,7 +2370,16 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
>>>          void *parity;
>>>          /* first collect one page from each data stripe */
>>>          for (stripe = 0; stripe < nr_data; stripe++) {
>>> -            p = page_in_rbio(rbio, stripe, pagenr, 0);
>>> +
>>> +            /*
>>> +             * Use stolen recovered page other than bad
>>> +             * on disk pages
>>> +             */
>>> +            if (stripe == rbio->bad_ondisk_a ||
>>> +                stripe == rbio->bad_ondisk_b)
>>> +                p = rbio_stripe_page(rbio, stripe, pagenr);
>>> +            else
>>> +                p = page_in_rbio(rbio, stripe, pagenr, 0);
>>>              pointers[stripe] = kmap(p);
>>>          }
>>>
>>>
>>
>>
> 
> 
>
Chris Mason Nov. 22, 2016, 6:58 p.m. UTC | #4
On 11/21/2016 03:50 AM, Qu Wenruo wrote:
> In the following situation, scrub will calculate wrong parity to
> overwrite correct one:
>
> RAID5 full stripe:
>
> Before
> |     Dev 1      |     Dev  2     |     Dev 3     |
> | Data stripe 1  | Data stripe 2  | Parity Stripe |
> --------------------------------------------------- 0
> | 0x0000 (Bad)   |     0xcdcd     |     0x0000    |
> --------------------------------------------------- 4K
> |     0xcdcd     |     0xcdcd     |     0x0000    |
> ...
> |     0xcdcd     |     0xcdcd     |     0x0000    |
> --------------------------------------------------- 64K
>
> After scrubbing dev3 only:
>
> |     Dev 1      |     Dev  2     |     Dev 3     |
> | Data stripe 1  | Data stripe 2  | Parity Stripe |
> --------------------------------------------------- 0
> | 0xcdcd (Good)  |     0xcdcd     | 0xcdcd (Bad)  |
> --------------------------------------------------- 4K
> |     0xcdcd     |     0xcdcd     |     0x0000    |
> ...
> |     0xcdcd     |     0xcdcd     |     0x0000    |
> --------------------------------------------------- 64K
>
> The calltrace of such corruption is as following:
>
> scrub_bio_end_io_worker() get called for each extent read out
> |- scriub_block_complete()
>    |- Data extent csum mismatch
>    |- scrub_handle_errored_block
>       |- scrub_recheck_block()
>          |- scrub_submit_raid56_bio_wait()
>             |- raid56_parity_recover()
>
> Now we have a rbio with correct data stripe 1 recovered.
> Let's call it "good_rbio".
>
> scrub_parity_check_and_repair()
> |- raid56_parity_submit_scrub_rbio()
>    |- lock_stripe_add()
>    |  |- steal_rbio()
>    |     |- Recovered data are steal from "good_rbio", stored into
>    |        rbio->stripe_pages[]
>    |        Now rbio->bio_pages[] are bad data read from disk.
>    |- async_scrub_parity()
>       |- scrub_parity_work() (delayed_call to scrub_parity_work)
>
> scrub_parity_work()
> |- raid56_parity_scrub_stripe()
>    |- validate_rbio_for_parity_scrub()
>       |- finish_parity_scrub()
>          |- Recalculate parity using *BAD* pages in rbio->bio_pages[]
>             So good parity is overwritten with *BAD* one
>
> The fix is to introduce 2 new members, bad_ondisk_a/b, to struct
> btrfs_raid_bio, to info scrub code to use correct data pages to
> re-calculate parity.
>
> Reported-by: Goffredo Baroncelli <kreijack@inwind.it>
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> ---
> Thanks to the above hell of delayed function all and damn stupid code
> logical, such bug is quite hard to trace.
>
> The damn kernel scrub is already multi-thread, why do such meaningless
> delayed function call again and again?
>
> What's wrong with single thread scrub?
> We can do thing like in each stripe for raid56 which is easy and
> straightforward, only delayed thing is to wake up waiter:
>
> 	lock_full_stripe()
> 	if (!is_parity_stripe()) {
> 		prepare_data_stripe_bios()
> 		submit_and_wait_bios()
> 		if (check_csum() == 0)
> 			goto out;
> 	}
> 	prepare_full_stripe_bios()
> 	submit_and_wait_bios()
>
> 	recover_raid56_stipres();
> 	prepare_full_stripe_write_bios()
> 	submit_and_wait_bios()
>
> out:
>   	unlock_full_stripe()
>
> We really need to re-work the whole damn scrub code.
>
> Also, we need to enhance btrfs-progs to detect scrub problem(my
> submitted offline scrub is good enough for such usage), and tools to
> corrupt extents reliably to put it into xfstests test cases.
>
> RAID56 scrub code is neither tested nor well-designed.

Great description, thanks for tracking this down.

> @@ -2352,7 +2370,16 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
>  		void *parity;
>  		/* first collect one page from each data stripe */
>  		for (stripe = 0; stripe < nr_data; stripe++) {
> -			p = page_in_rbio(rbio, stripe, pagenr, 0);
> +
> +			/*
> +			 * Use stolen recovered page other than bad
> +			 * on disk pages
> +			 */
> +			if (stripe == rbio->bad_ondisk_a ||
> +			    stripe == rbio->bad_ondisk_b)
> +				p = rbio_stripe_page(rbio, stripe, pagenr);
> +			else
> +				p = page_in_rbio(rbio, stripe, pagenr, 0);
>  			pointers[stripe] = kmap(p);
>  		}
>
>

We're changing which pages we kmap() but not which ones we kunmap(). 
Can you please update the kunmap loop to use this pointers array?  Also 
it looks like this kmap is never unmapped.

pointers[stripe++] = kmap(q_page);

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Qu Wenruo Nov. 23, 2016, 12:26 a.m. UTC | #5
At 11/23/2016 02:58 AM, Chris Mason wrote:
> On 11/21/2016 03:50 AM, Qu Wenruo wrote:
>> In the following situation, scrub will calculate wrong parity to
>> overwrite correct one:
>>
>> RAID5 full stripe:
>>
>> Before
>> |     Dev 1      |     Dev  2     |     Dev 3     |
>> | Data stripe 1  | Data stripe 2  | Parity Stripe |
>> --------------------------------------------------- 0
>> | 0x0000 (Bad)   |     0xcdcd     |     0x0000    |
>> --------------------------------------------------- 4K
>> |     0xcdcd     |     0xcdcd     |     0x0000    |
>> ...
>> |     0xcdcd     |     0xcdcd     |     0x0000    |
>> --------------------------------------------------- 64K
>>
>> After scrubbing dev3 only:
>>
>> |     Dev 1      |     Dev  2     |     Dev 3     |
>> | Data stripe 1  | Data stripe 2  | Parity Stripe |
>> --------------------------------------------------- 0
>> | 0xcdcd (Good)  |     0xcdcd     | 0xcdcd (Bad)  |
>> --------------------------------------------------- 4K
>> |     0xcdcd     |     0xcdcd     |     0x0000    |
>> ...
>> |     0xcdcd     |     0xcdcd     |     0x0000    |
>> --------------------------------------------------- 64K
>>
>> The calltrace of such corruption is as following:
>>
>> scrub_bio_end_io_worker() get called for each extent read out
>> |- scriub_block_complete()
>>    |- Data extent csum mismatch
>>    |- scrub_handle_errored_block
>>       |- scrub_recheck_block()
>>          |- scrub_submit_raid56_bio_wait()
>>             |- raid56_parity_recover()
>>
>> Now we have a rbio with correct data stripe 1 recovered.
>> Let's call it "good_rbio".
>>
>> scrub_parity_check_and_repair()
>> |- raid56_parity_submit_scrub_rbio()
>>    |- lock_stripe_add()
>>    |  |- steal_rbio()
>>    |     |- Recovered data are steal from "good_rbio", stored into
>>    |        rbio->stripe_pages[]
>>    |        Now rbio->bio_pages[] are bad data read from disk.
>>    |- async_scrub_parity()
>>       |- scrub_parity_work() (delayed_call to scrub_parity_work)
>>
>> scrub_parity_work()
>> |- raid56_parity_scrub_stripe()
>>    |- validate_rbio_for_parity_scrub()
>>       |- finish_parity_scrub()
>>          |- Recalculate parity using *BAD* pages in rbio->bio_pages[]
>>             So good parity is overwritten with *BAD* one
>>
>> The fix is to introduce 2 new members, bad_ondisk_a/b, to struct
>> btrfs_raid_bio, to info scrub code to use correct data pages to
>> re-calculate parity.
>>
>> Reported-by: Goffredo Baroncelli <kreijack@inwind.it>
>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>> ---
>> Thanks to the above hell of delayed function all and damn stupid code
>> logical, such bug is quite hard to trace.
>>
>> The damn kernel scrub is already multi-thread, why do such meaningless
>> delayed function call again and again?
>>
>> What's wrong with single thread scrub?
>> We can do thing like in each stripe for raid56 which is easy and
>> straightforward, only delayed thing is to wake up waiter:
>>
>>     lock_full_stripe()
>>     if (!is_parity_stripe()) {
>>         prepare_data_stripe_bios()
>>         submit_and_wait_bios()
>>         if (check_csum() == 0)
>>             goto out;
>>     }
>>     prepare_full_stripe_bios()
>>     submit_and_wait_bios()
>>
>>     recover_raid56_stipres();
>>     prepare_full_stripe_write_bios()
>>     submit_and_wait_bios()
>>
>> out:
>>       unlock_full_stripe()
>>
>> We really need to re-work the whole damn scrub code.
>>
>> Also, we need to enhance btrfs-progs to detect scrub problem(my
>> submitted offline scrub is good enough for such usage), and tools to
>> corrupt extents reliably to put it into xfstests test cases.
>>
>> RAID56 scrub code is neither tested nor well-designed.
>
> Great description, thanks for tracking this down.
>
>> @@ -2352,7 +2370,16 @@ static noinline void finish_parity_scrub(struct
>> btrfs_raid_bio *rbio,
>>          void *parity;
>>          /* first collect one page from each data stripe */
>>          for (stripe = 0; stripe < nr_data; stripe++) {
>> -            p = page_in_rbio(rbio, stripe, pagenr, 0);
>> +
>> +            /*
>> +             * Use stolen recovered page other than bad
>> +             * on disk pages
>> +             */
>> +            if (stripe == rbio->bad_ondisk_a ||
>> +                stripe == rbio->bad_ondisk_b)
>> +                p = rbio_stripe_page(rbio, stripe, pagenr);
>> +            else
>> +                p = page_in_rbio(rbio, stripe, pagenr, 0);
>>              pointers[stripe] = kmap(p);
>>          }
>>
>>
>
> We're changing which pages we kmap() but not which ones we kunmap(). Can
> you please update the kunmap loop to use this pointers array?  Also it
> looks like this kmap is never unmapped.

Oh I forget that.
I'll update it soon.

This reminds me, is there any kernel debug option to trace such unmapped 
pages?

Thanks,
Qu

>
> pointers[stripe++] = kmap(q_page);
>
> -chris
>
>


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zygo Blaxell Nov. 25, 2016, 4:31 a.m. UTC | #6
On Tue, Nov 22, 2016 at 07:02:13PM +0100, Goffredo Baroncelli wrote:
> On 2016-11-22 01:28, Qu Wenruo wrote:
> > 
> > 
> > At 11/22/2016 02:48 AM, Goffredo Baroncelli wrote:
> >> Hi Qu,
> >>
> >> I tested this succefully for RAID5 when doing a scrub (i.e.: I mount a corrupted disks, then I ran "btrfs scrub start ...", then I check the disks).
> >>
> >> However if I do a "cat mnt/out.txt" (out.txt is the corrupted file):
> >> 1) the system detect that the file is corrupted   (good :) )
> >> 2) the system return the correct file content     (good :) )
> >> 3) the data on the platter are still wrong        (no good :( )
> > 
> > Do you mean, read the corrupted data won't repair it?
> > 
> > IIRC that's the designed behavior.
> 
> :O
> 
> You are right... I was unaware of that....

This is correct.

Ordinary reads shouldn't touch corrupt data, they should only read
around it.  Scrubs in read-write mode should write corrected data over
the corrupt data.  Read-only scrubs can only report errors without
correcting them.

Rewriting corrupt data outside of scrub (i.e. on every read) is a
bad idea.  Consider what happens if a RAM controller gets too hot:
checksums start failing randomly, but the data on disk is still OK.
If we tried to fix the bad data on every read, we'd probably just trash
the filesystem in some cases.

This risk mitigation measure does rely on admins taking a machine in this
state down immediately, and also somehow knowing not to start a scrub
while their RAM is failing...which is kind of an annoying requirement
for the admin.

> So you can add a "tested-by: Goffredo Baroncelli <kreijack@inwind.it>"
> 
> BR
> G.Baroncelli
> 
> > 
> > For RAID5/6 read, there are several different mode, like READ_REBUILD or SCRUB_PARITY.
> > 
> > I'm not sure for write, but for read it won't write correct data.
> > 
> > So it's a designed behavior if I don't miss something.
> > 
> > Thanks,
> > Qu
> > 
> >>
> >>
> >> Enclosed the script which reproduces the problem. Note that:
> >> If I corrupt the data, in the dmesg two time appears a line which says:
> >>
> >> [ 3963.763384] BTRFS warning (device loop2): csum failed ino 257 off 0 csum 2280586218 expected csum 3192393815
> >> [ 3963.766927] BTRFS warning (device loop2): csum failed ino 257 off 0 csum 2280586218 expected csum 3192393815
> >>
> >> If I corrupt the parity, of course the system doesn't detect the corruption nor try to correct it. But this is the expected behavior.
> >>
> >> BR
> >> G.Baroncelli
> >>
> >>
> >>
> >> On 2016-11-21 09:50, Qu Wenruo wrote:
> >>> In the following situation, scrub will calculate wrong parity to
> >>> overwrite correct one:
> >>>
> >>> RAID5 full stripe:
> >>>
> >>> Before
> >>> |     Dev 1      |     Dev  2     |     Dev 3     |
> >>> | Data stripe 1  | Data stripe 2  | Parity Stripe |
> >>> --------------------------------------------------- 0
> >>> | 0x0000 (Bad)   |     0xcdcd     |     0x0000    |
> >>> --------------------------------------------------- 4K
> >>> |     0xcdcd     |     0xcdcd     |     0x0000    |
> >>> ...
> >>> |     0xcdcd     |     0xcdcd     |     0x0000    |
> >>> --------------------------------------------------- 64K
> >>>
> >>> After scrubbing dev3 only:
> >>>
> >>> |     Dev 1      |     Dev  2     |     Dev 3     |
> >>> | Data stripe 1  | Data stripe 2  | Parity Stripe |
> >>> --------------------------------------------------- 0
> >>> | 0xcdcd (Good)  |     0xcdcd     | 0xcdcd (Bad)  |
> >>> --------------------------------------------------- 4K
> >>> |     0xcdcd     |     0xcdcd     |     0x0000    |
> >>> ...
> >>> |     0xcdcd     |     0xcdcd     |     0x0000    |
> >>> --------------------------------------------------- 64K
> >>>
> >>> The calltrace of such corruption is as following:
> >>>
> >>> scrub_bio_end_io_worker() get called for each extent read out
> >>> |- scriub_block_complete()
> >>>    |- Data extent csum mismatch
> >>>    |- scrub_handle_errored_block
> >>>       |- scrub_recheck_block()
> >>>          |- scrub_submit_raid56_bio_wait()
> >>>             |- raid56_parity_recover()
> >>>
> >>> Now we have a rbio with correct data stripe 1 recovered.
> >>> Let's call it "good_rbio".
> >>>
> >>> scrub_parity_check_and_repair()
> >>> |- raid56_parity_submit_scrub_rbio()
> >>>    |- lock_stripe_add()
> >>>    |  |- steal_rbio()
> >>>    |     |- Recovered data are steal from "good_rbio", stored into
> >>>    |        rbio->stripe_pages[]
> >>>    |        Now rbio->bio_pages[] are bad data read from disk.
> >>>    |- async_scrub_parity()
> >>>       |- scrub_parity_work() (delayed_call to scrub_parity_work)
> >>>
> >>> scrub_parity_work()
> >>> |- raid56_parity_scrub_stripe()
> >>>    |- validate_rbio_for_parity_scrub()
> >>>       |- finish_parity_scrub()
> >>>          |- Recalculate parity using *BAD* pages in rbio->bio_pages[]
> >>>             So good parity is overwritten with *BAD* one
> >>>
> >>> The fix is to introduce 2 new members, bad_ondisk_a/b, to struct
> >>> btrfs_raid_bio, to info scrub code to use correct data pages to
> >>> re-calculate parity.
> >>>
> >>> Reported-by: Goffredo Baroncelli <kreijack@inwind.it>
> >>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> >>> ---
> >>> Thanks to the above hell of delayed function all and damn stupid code
> >>> logical, such bug is quite hard to trace.
> >>>
> >>> The damn kernel scrub is already multi-thread, why do such meaningless
> >>> delayed function call again and again?
> >>>
> >>> What's wrong with single thread scrub?
> >>> We can do thing like in each stripe for raid56 which is easy and
> >>> straightforward, only delayed thing is to wake up waiter:
> >>>
> >>>     lock_full_stripe()
> >>>     if (!is_parity_stripe()) {
> >>>         prepare_data_stripe_bios()
> >>>         submit_and_wait_bios()
> >>>         if (check_csum() == 0)
> >>>             goto out;
> >>>     }
> >>>     prepare_full_stripe_bios()
> >>>     submit_and_wait_bios()
> >>>
> >>>     recover_raid56_stipres();
> >>>     prepare_full_stripe_write_bios()
> >>>     submit_and_wait_bios()
> >>>
> >>> out:
> >>>       unlock_full_stripe()
> >>>
> >>> We really need to re-work the whole damn scrub code.
> >>>
> >>> Also, we need to enhance btrfs-progs to detect scrub problem(my
> >>> submitted offline scrub is good enough for such usage), and tools to
> >>> corrupt extents reliably to put it into xfstests test cases.
> >>>
> >>> RAID56 scrub code is neither tested nor well-designed.
> >>> ---
> >>>  fs/btrfs/raid56.c | 29 ++++++++++++++++++++++++++++-
> >>>  1 file changed, 28 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
> >>> index d016d4a..87e3565 100644
> >>> --- a/fs/btrfs/raid56.c
> >>> +++ b/fs/btrfs/raid56.c
> >>> @@ -133,6 +133,16 @@ struct btrfs_raid_bio {
> >>>      /* second bad stripe (for raid6 use) */
> >>>      int failb;
> >>>
> >>> +    /*
> >>> +     * For steal_rbio, we can steal recovered correct page,
> >>> +     * but in finish_parity_scrub(), we still use bad on-disk
> >>> +     * page to calculate parity.
> >>> +     * Use these members to info finish_parity_scrub() to use
> >>> +     * correct pages
> >>> +     */
> >>> +    int bad_ondisk_a;
> >>> +    int bad_ondisk_b;
> >>> +
> >>>      int scrubp;
> >>>      /*
> >>>       * number of pages needed to represent the full
> >>> @@ -310,6 +320,12 @@ static void steal_rbio(struct btrfs_raid_bio *src, struct btrfs_raid_bio *dest)
> >>>      if (!test_bit(RBIO_CACHE_READY_BIT, &src->flags))
> >>>          return;
> >>>
> >>> +    /* Record recovered stripe number */
> >>> +    if (src->faila != -1)
> >>> +        dest->bad_ondisk_a = src->faila;
> >>> +    if (src->failb != -1)
> >>> +        dest->bad_ondisk_b = src->failb;
> >>> +
> >>>      for (i = 0; i < dest->nr_pages; i++) {
> >>>          s = src->stripe_pages[i];
> >>>          if (!s || !PageUptodate(s)) {
> >>> @@ -998,6 +1014,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root,
> >>>      rbio->stripe_npages = stripe_npages;
> >>>      rbio->faila = -1;
> >>>      rbio->failb = -1;
> >>> +    rbio->bad_ondisk_a = -1;
> >>> +    rbio->bad_ondisk_b = -1;
> >>>      atomic_set(&rbio->refs, 1);
> >>>      atomic_set(&rbio->error, 0);
> >>>      atomic_set(&rbio->stripes_pending, 0);
> >>> @@ -2352,7 +2370,16 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
> >>>          void *parity;
> >>>          /* first collect one page from each data stripe */
> >>>          for (stripe = 0; stripe < nr_data; stripe++) {
> >>> -            p = page_in_rbio(rbio, stripe, pagenr, 0);
> >>> +
> >>> +            /*
> >>> +             * Use stolen recovered page other than bad
> >>> +             * on disk pages
> >>> +             */
> >>> +            if (stripe == rbio->bad_ondisk_a ||
> >>> +                stripe == rbio->bad_ondisk_b)
> >>> +                p = rbio_stripe_page(rbio, stripe, pagenr);
> >>> +            else
> >>> +                p = page_in_rbio(rbio, stripe, pagenr, 0);
> >>>              pointers[stripe] = kmap(p);
> >>>          }
> >>>
> >>>
> >>
> >>
> > 
> > 
> > 
> 
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
>
Gareth Pye Nov. 25, 2016, 4:40 a.m. UTC | #7
On Fri, Nov 25, 2016 at 3:31 PM, Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> This risk mitigation measure does rely on admins taking a machine in this
> state down immediately, and also somehow knowing not to start a scrub
> while their RAM is failing...which is kind of an annoying requirement
> for the admin.

Attempting to detect if RAM is bad when scrub starts is both time
consuming and not very reliable right.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zygo Blaxell Nov. 25, 2016, 5:07 a.m. UTC | #8
On Fri, Nov 25, 2016 at 03:40:36PM +1100, Gareth Pye wrote:
> On Fri, Nov 25, 2016 at 3:31 PM, Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
> >
> > This risk mitigation measure does rely on admins taking a machine in this
> > state down immediately, and also somehow knowing not to start a scrub
> > while their RAM is failing...which is kind of an annoying requirement
> > for the admin.
> 
> Attempting to detect if RAM is bad when scrub starts is both time
> consuming and not very reliable right.

RAM, like all hardware, could fail at any time, and a scrub could already
be running when it happens.  This is annoying but also a fact of life that
admins have to deal with.

Testing RAM before scrub starts is not more beneficial than testing RAM
at random intervals--but if you are testing RAM at random intervals,
why not do it at the same intervals as scrub?

If I see corruption errors showing up in stats, I will do a basic sanity
test to make sure they're coming from the storage layer and not somewhere
closer to the CPU.  If all errors come from one device and there are clear
log messages showing SCSI device errors and the SMART log matches the
other data, RAM is probably not the root case of failures, so scrub away.

If normally reliable programs like /bin/sh start randomly segfaulting,
there's smoke pouring out of the back of the machine, all the disks are
full of csum failures, and the BIOS welcome message has spelling errors
that weren't there before, I would *not* start a scrub.  More like
turn the machine off, take it apart, test all the pieces separately,
and only do a scrub after everything above the storage layer had been
replaced or recertified.  I certainly wouldn't want the filesystem to
try to fix the csum failures it finds in such situations.
Goffredo Baroncelli Nov. 26, 2016, 1:12 p.m. UTC | #9
On 2016-11-25 05:31, Zygo Blaxell wrote:
>>> Do you mean, read the corrupted data won't repair it?
>>>
>>> IIRC that's the designed behavior.
>> :O
>>
>> You are right... I was unaware of that....
> This is correct.
> 
> Ordinary reads shouldn't touch corrupt data, they should only read
> around it.  Scrubs in read-write mode should write corrected data over
> the corrupt data.  Read-only scrubs can only report errors without
> correcting them.
> 
> Rewriting corrupt data outside of scrub (i.e. on every read) is a
> bad idea.  Consider what happens if a RAM controller gets too hot:
> checksums start failing randomly, but the data on disk is still OK.
> If we tried to fix the bad data on every read, we'd probably just trash
> the filesystem in some cases.



I cant agree. If the filesystem is mounted read-only this behavior may be correct; bur in others cases I don't see any reason to not correct wrong data even in the read case. If your ram is unreliable you have big problem anyway.

The likelihood that the data contained in a disk is "corrupted" is higher than the likelihood that the RAM is bad.

BTW Btrfs in RAID1 mode corrects the data even in the read case. So I am still convinced that is the RAID5/6 behavior "strange".

BR
G.Baroncelli
Chris Mason Nov. 26, 2016, 5:18 p.m. UTC | #10
On 11/22/2016 07:26 PM, Qu Wenruo wrote:
>
>>
>> We're changing which pages we kmap() but not which ones we kunmap(). Can
>> you please update the kunmap loop to use this pointers array?  Also it
>> looks like this kmap is never unmapped.
>
> Oh I forget that.
> I'll update it soon.

Thanks!

>
> This reminds me, is there any kernel debug option to trace such unmapped
> pages?
>

I don't think so, which is surprising.  It explodes so quickly on 32 bit 
machines, its easiest to boot it on a 32 bit qemu.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zygo Blaxell Nov. 26, 2016, 6:54 p.m. UTC | #11
On Sat, Nov 26, 2016 at 02:12:56PM +0100, Goffredo Baroncelli wrote:
> On 2016-11-25 05:31, Zygo Blaxell wrote:
> >>> Do you mean, read the corrupted data won't repair it?
> >>>
> >>> IIRC that's the designed behavior.
> >> :O
> >>
> >> You are right... I was unaware of that....
> > This is correct.
> > 
> > Ordinary reads shouldn't touch corrupt data, they should only read
> > around it.  Scrubs in read-write mode should write corrected data over
> > the corrupt data.  Read-only scrubs can only report errors without
> > correcting them.
> > 
> > Rewriting corrupt data outside of scrub (i.e. on every read) is a
> > bad idea.  Consider what happens if a RAM controller gets too hot:
> > checksums start failing randomly, but the data on disk is still OK.
> > If we tried to fix the bad data on every read, we'd probably just trash
> > the filesystem in some cases.
> 
> 
> 
> I cant agree. If the filesystem is mounted read-only this behavior may
> be correct; bur in others cases I don't see any reason to not correct
> wrong data even in the read case. If your ram is unreliable you have
> big problem anyway.

If you don't like RAM corruption, pick any other failure mode.  Laptops
have to deal with things like vibration and temperature extremes which
produce the same results (spurious csum failures and IO errors under
conditions where writing will only destroy data that would otherwise
be recoverable).

> The likelihood that the data contained in a disk is "corrupted" is
> higher than the likelihood that the RAM is bad.
>
> BTW Btrfs in RAID1 mode corrects the data even in the read case. So

Have you tested this?  I think you'll find that it doesn't.

> I am still convinced that is the RAID5/6 behavior "strange".
> 
> BR
> G.Baroncelli
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
>
Goffredo Baroncelli Nov. 26, 2016, 11:16 p.m. UTC | #12
On 2016-11-26 19:54, Zygo Blaxell wrote:
> On Sat, Nov 26, 2016 at 02:12:56PM +0100, Goffredo Baroncelli wrote:
>> On 2016-11-25 05:31, Zygo Blaxell wrote:
[...]
>>
>> BTW Btrfs in RAID1 mode corrects the data even in the read case. So
> 
> Have you tested this?  I think you'll find that it doesn't.

Yes I tested it; and it does the rebuild automatically.
I corrupted a disk of mirror, then I read the related file. The log  says:

[   59.287748] BTRFS warning (device vdb): csum failed ino 257 off 0 csum 12813760 expected csum 3114703128
[   59.291542] BTRFS warning (device vdb): csum failed ino 257 off 0 csum 12813760 expected csum 3114703128
[   59.294950] BTRFS info (device vdb): read error corrected: ino 257 off 0 (dev /dev/vdb sector 2154496)
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

IIRC In case of RAID5/6 the last line is missing. However in both the case the data returned is good; but in RAID1 the data is corrected also on the disk.

Where you read that the data is not rebuild automatically ?

In fact I was surprised that RAID5/6 behaves differently....

> 
>> I am still convinced that is the RAID5/6 behavior "strange".
>>
>> BR
>> G.Baroncelli
>> -- 
>> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
>> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
>>
Zygo Blaxell Nov. 27, 2016, 4:53 p.m. UTC | #13
On Sun, Nov 27, 2016 at 12:16:34AM +0100, Goffredo Baroncelli wrote:
> On 2016-11-26 19:54, Zygo Blaxell wrote:
> > On Sat, Nov 26, 2016 at 02:12:56PM +0100, Goffredo Baroncelli wrote:
> >> On 2016-11-25 05:31, Zygo Blaxell wrote:
> [...]
> >>
> >> BTW Btrfs in RAID1 mode corrects the data even in the read case. So
> > 
> > Have you tested this?  I think you'll find that it doesn't.
> 
> Yes I tested it; and it does the rebuild automatically.
> I corrupted a disk of mirror, then I read the related file. The log  says:
> 
> [   59.287748] BTRFS warning (device vdb): csum failed ino 257 off 0 csum 12813760 expected csum 3114703128
> [   59.291542] BTRFS warning (device vdb): csum failed ino 257 off 0 csum 12813760 expected csum 3114703128
> [   59.294950] BTRFS info (device vdb): read error corrected: ino 257 off 0 (dev /dev/vdb sector 2154496)
>                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

> IIRC In case of RAID5/6 the last line is missing. However in both the
> case the data returned is good; but in RAID1 the data is corrected
> also on the disk.
> 
> Where you read that the data is not rebuild automatically ?

Experience?  I have real disk failures all the time.  Errors on RAID1
arrays persist until scrubbed.

No, wait... _transid_ errors always persist until scrubbed.  csum failures
are rewritten in repair_io_failure.  There is a comment earlier in
repair_io_failure that rewrite in RAID56 is not supported yet.

> In fact I was surprised that RAID5/6 behaves differently....

The difference is surprising, no matter which strategy you believe
is correct.  ;)

> >> I am still convinced that is the RAID5/6 behavior "strange".
> >>
> >> BR
> >> G.Baroncelli
> >> -- 
> >> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> >> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> >>
> 
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
>
Qu Wenruo Nov. 28, 2016, 12:40 a.m. UTC | #14
At 11/27/2016 07:16 AM, Goffredo Baroncelli wrote:
> On 2016-11-26 19:54, Zygo Blaxell wrote:
>> On Sat, Nov 26, 2016 at 02:12:56PM +0100, Goffredo Baroncelli wrote:
>>> On 2016-11-25 05:31, Zygo Blaxell wrote:
> [...]
>>>
>>> BTW Btrfs in RAID1 mode corrects the data even in the read case. So
>>
>> Have you tested this?  I think you'll find that it doesn't.
>
> Yes I tested it; and it does the rebuild automatically.
> I corrupted a disk of mirror, then I read the related file. The log  says:
>
> [   59.287748] BTRFS warning (device vdb): csum failed ino 257 off 0 csum 12813760 expected csum 3114703128
> [   59.291542] BTRFS warning (device vdb): csum failed ino 257 off 0 csum 12813760 expected csum 3114703128
> [   59.294950] BTRFS info (device vdb): read error corrected: ino 257 off 0 (dev /dev/vdb sector 2154496)
>                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> IIRC In case of RAID5/6 the last line is missing. However in both the case the data returned is good; but in RAID1 the data is corrected also on the disk.
>
> Where you read that the data is not rebuild automatically ?
>
> In fact I was surprised that RAID5/6 behaves differently....
>

Yes, I also tried that and realized that RAID1 is recovering corrupted 
data at *READ* time.

The main difference between RAID1 and RAID56 seems to be the complexity.

For RAID56, we have different read/write behavior, for read, we use flag 
BTRFS_RBIO_READ_REBUILD, which will only rebuild data but not write them 
into disk.
And I'm a little concern about the race between read time fix and write.

I assume it's possible to change the behavior to follow RAID1, but I'd 
like to do it in the following steps:
1) Fix known RAID56 bugs
    With the v3 patch and previous 2 patches, it seems OK now.
2) Full fstests test case, with all possible corruption combination
    (WIP)
3) Rework current RAID56 code to a cleaner and more readable status
    (long term)
4) Add the support to fix things at read time.

So the behavior change is not something we will see in short time.

Thanks,
Qu
>>
>>> I am still convinced that is the RAID5/6 behavior "strange".
>>>
>>> BR
>>> G.Baroncelli
>>> --
>>> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
>>> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
>>>
>
>


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Anton Mitterer Nov. 28, 2016, 3:37 a.m. UTC | #15
On Sat, 2016-11-26 at 14:12 +0100, Goffredo Baroncelli wrote:
> I cant agree. If the filesystem is mounted read-only this behavior
> may be correct; bur in others cases I don't see any reason to not
> correct wrong data even in the read case. If your ram is unreliable
> you have big problem anyway.

I'd agree with that - more or less.

If the memory is broken, then even without repairing (on read) a
filesystem that is written to will likely be further corrupted.


I think for safety it's best to repair as early as possible (and thus
on read when a damage is detected), as further  blocks/devices may fail
till eventually a scrub(with repair) would be run manually.

However, there may some workloads under which such auto-repair is
undesirable as it may cost performance and safety may be less important
than that.

Thus I think, there should be a mount-option that let users control
whether repair should happen on normal reads or not... and this should
IMO be independent of whether the fs was mounted ro or rw.
I'd say the default should go for data safety (i.e. repair as soon as
possible).


Cheers,
Chris.
Andrei Borzenkov Nov. 28, 2016, 3:53 a.m. UTC | #16
28.11.2016 06:37, Christoph Anton Mitterer пишет:
> On Sat, 2016-11-26 at 14:12 +0100, Goffredo Baroncelli wrote:
>> I cant agree. If the filesystem is mounted read-only this behavior
>> may be correct; bur in others cases I don't see any reason to not
>> correct wrong data even in the read case. If your ram is unreliable
>> you have big problem anyway.
> 
> I'd agree with that - more or less.
> 
> If the memory is broken, then even without repairing (on read) a
> filesystem that is written to will likely be further corrupted.
> 
> 
> I think for safety it's best to repair as early as possible (and thus
> on read when a damage is detected), as further  blocks/devices may fail
> till eventually a scrub(with repair) would be run manually.
> 
> However, there may some workloads under which such auto-repair is
> undesirable as it may cost performance and safety may be less important
> than that.
> 
> Thus I think, there should be a mount-option that let users control
> whether repair should happen on normal reads or not... and this should
> IMO be independent of whether the fs was mounted ro or rw.

If you allow any write to filesystem before resuming from hibernation
you risk corrupted filesystem. I strongly believe that "ro" must be
really read-only - having separate option to control it will require
update of every tool that generates initramfs and even then you cannot
avoid using older initramfs with newer kernel.

> I'd say the default should go for data safety (i.e. repair as soon as
> possible).
> 

Safety means exactly opposite for me - do not modify data if explicitly
requested not to do it. Having filesystem in half-read-only state will
be too error prone.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Anton Mitterer Nov. 28, 2016, 4:01 a.m. UTC | #17
On Mon, 2016-11-28 at 06:53 +0300, Andrei Borzenkov wrote:
> If you allow any write to filesystem before resuming from hibernation
> you risk corrupted filesystem. I strongly believe that "ro" must be
> really read-only

You're aware that "ro" already doesn't mean "no changes to the block
device" on most modern filesystems (including btrfs)?


> - having separate option to control it will require
> update of every tool that generates initramfs and even then you
> cannot
> avoid using older initramfs with newer kernel.

What would the initramfs have to deal with it? Isn't the whole
repairing thing completely transparent (unless perhaps you do some lo-
level forensics or access the fs directly and not throught the
filesystem driver)?


> Safety means exactly opposite for me - do not modify data if
> explicitly
> requested not to do it.

Depends on what you mean by data...
If you mean "the files as exported in the file hierarchy"... then there
won't be any modifications by repair (after all it just repairs and
gives back the correct data or fails)...


Cheers,
Chris.
Goffredo Baroncelli Nov. 28, 2016, 6:32 p.m. UTC | #18
On 2016-11-28 04:37, Christoph Anton Mitterer wrote:
> I think for safety it's best to repair as early as possible (and thus
> on read when a damage is detected), as further  blocks/devices may fail
> till eventually a scrub(with repair) would be run manually.
> 
> However, there may some workloads under which such auto-repair is
> undesirable as it may cost performance and safety may be less important
> than that.

I am assuming that a corruption is a quite rare event. So occasionally it could happens that a page is corrupted and the system corrects it. This shouldn't  have an impact on the workloads.

BR
G.Baroncelli
Goffredo Baroncelli Nov. 28, 2016, 6:45 p.m. UTC | #19
On 2016-11-28 01:40, Qu Wenruo wrote:
> 
> At 11/27/2016 07:16 AM, Goffredo Baroncelli wrote:
>> On 2016-11-26 19:54, Zygo Blaxell wrote:
>>> On Sat, Nov 26, 2016 at 02:12:56PM +0100, Goffredo Baroncelli wrote:
>>>> On 2016-11-25 05:31, Zygo Blaxell wrote:
>> [...]
>>>>
>>>> BTW Btrfs in RAID1 mode corrects the data even in the read case. So
>>>
>>> Have you tested this?  I think you'll find that it doesn't.
>>
>> Yes I tested it; and it does the rebuild automatically.
>> I corrupted a disk of mirror, then I read the related file. The log  says:
>>
>> [   59.287748] BTRFS warning (device vdb): csum failed ino 257 off 0 csum 12813760 expected csum 3114703128
>> [   59.291542] BTRFS warning (device vdb): csum failed ino 257 off 0 csum 12813760 expected csum 3114703128
>> [   59.294950] BTRFS info (device vdb): read error corrected: ino 257 off 0 (dev /dev/vdb sector 2154496)
>>                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>
>> IIRC In case of RAID5/6 the last line is missing. However in both the case the data returned is good; but in RAID1 the data is corrected also on the disk.
>>
>> Where you read that the data is not rebuild automatically ?
>>
>> In fact I was surprised that RAID5/6 behaves differently....
>>
> 
> Yes, I also tried that and realized that RAID1 is recovering corrupted data at *READ* time.
> 
> The main difference between RAID1 and RAID56 seems to be the complexity.
> 
> For RAID56, we have different read/write behavior, for read, we use flag BTRFS_RBIO_READ_REBUILD, which will only rebuild data but not write them into disk.
> And I'm a little concern about the race between read time fix and write.
> 
> I assume it's possible to change the behavior to follow RAID1, but I'd like to do it in the following steps:
> 1) Fix known RAID56 bugs
>    With the v3 patch and previous 2 patches, it seems OK now.
> 2) Full fstests test case, with all possible corruption combination
>    (WIP)
> 3) Rework current RAID56 code to a cleaner and more readable status
>    (long term)
> 4) Add the support to fix things at read time.
> 
> So the behavior change is not something we will see in short time.

+1
I am understanding that the status of RAID5/6 code is so badly that we need to correct all the more critical bugs and then increase the tests for not regression.
On the point 3, I don't know the code well enough to say something, the code is very complex.
I see the point 4 as the less urgent. 

Let me to make a request: I would like to know your opinion about my email "RFC: raid with a variable stripe size", which started a little thread. I am asking this because you now have the hands on this code: my suggestion (use different BG, with different stripe size to avoid RMW cycles) or the Zygo's one (don't fill a stripe if you don't need to avoid RMW cycles) are difficult to implement ?

BR
G.Baroncelli


> 
> Thanks,
> Qu
Christoph Anton Mitterer Nov. 28, 2016, 7 p.m. UTC | #20
On Mon, 2016-11-28 at 19:32 +0100, Goffredo Baroncelli wrote:
> I am assuming that a corruption is a quite rare event. So
> occasionally it could happens that a page is corrupted and the system
> corrects it. This shouldn't  have an impact on the workloads.

Probably, but it still make sense to make it configurable, especially
as an option like this would be needed to make a btrfs truly non-
writing to the device again...

The ro mount option just says that the files/permissions/etc. don't
change - not that any internas doesn't change, so it would IMO be an
abuse if ro would be used to disable repairs.

And some people simply may want that - and if it's just to rule out the
possibility of corruptions on a ro-fs in case of bad memory ;)

Cheers,
Chris.
Christoph Anton Mitterer Nov. 28, 2016, 7:01 p.m. UTC | #21
On Mon, 2016-11-28 at 19:45 +0100, Goffredo Baroncelli wrote:
> I am understanding that the status of RAID5/6 code is so badly
Just some random thought:

If the code for raid56 is really as bad as it's often claimed (I
haven't read it, to be honest) .... could it perhaps make sense to
consider to start it from scratch? And/or merge it with a more generic
approach that allows n-way-parity RAIDs (I think such patch was posted
hear some year(s) ago).


Cheers,
Chris.
Austin S. Hemmelgarn Nov. 28, 2016, 7:39 p.m. UTC | #22
On 2016-11-28 14:01, Christoph Anton Mitterer wrote:
> On Mon, 2016-11-28 at 19:45 +0100, Goffredo Baroncelli wrote:
>> I am understanding that the status of RAID5/6 code is so badly
> Just some random thought:
>
> If the code for raid56 is really as bad as it's often claimed (I
> haven't read it, to be honest) .... could it perhaps make sense to
> consider to start it from scratch? And/or merge it with a more generic
> approach that allows n-way-parity RAIDs (I think such patch was posted
> hear some year(s) ago).
Such a suggestion for higher order parity support was made some time ago 
(at least a year and a half I believe, probably more).  It was at the 
time stated that it was planned after n-way replication and raid5/6.

Personally I feel that sort of road-map is misguided, as all three are 
functionally interdependent as the original proposal had suggested, and 
I'm also of the opinion that the raid5/6 code probably should be redone 
from scratch (I wasn't the one who wrote it, and can't contribute much 
more than some attempts at testing as a third-party, so I obviously 
can't make that decision myself).  Doing so would likely make it much 
easier to implement higher order parity (and arbitrary 
striping/replication, which is something I'm _very_ interested in).  The 
existing code isn't bad in a stylistic or even traditional coding sense, 
but makes a significant number of poor choices in the high-level side of 
things (all the issues with parity for example).  If we just want 
working raid5/6, then working on the existing implementation is fine. 
If we want support for arbitrary combinations of 
striping/replication/parity (which would be a killer feature and 
something that BTRFS could actually say nothing else has), then 
rewriting from scratch is probably easier because of some of the 
low-level design choices made in the raid5/6 code.

Part of the problem though is that there are more admins and support 
focused people than coders involved.  In my case I very much do not have 
the expertise to work on kernel code beyond tweaking constants and 
twiddling default values for things (I'm technically a programmer, but 
90% of the work I do is sysops type stuff written in a mix of sh, 
Python, and about 7 different data serialization formats).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zygo Blaxell Nov. 28, 2016, 9:48 p.m. UTC | #23
On Mon, Nov 28, 2016 at 07:32:38PM +0100, Goffredo Baroncelli wrote:
> On 2016-11-28 04:37, Christoph Anton Mitterer wrote:
> > I think for safety it's best to repair as early as possible (and thus
> > on read when a damage is detected), as further  blocks/devices may fail
> > till eventually a scrub(with repair) would be run manually.
> > 
> > However, there may some workloads under which such auto-repair is
> > undesirable as it may cost performance and safety may be less important
> > than that.
> 
> I am assuming that a corruption is a quite rare event. So occasionally
> it could happens that a page is corrupted and the system corrects
> it. This shouldn't  have an impact on the workloads.

Depends heavily on the specifics of the failure case.  If a drive's
embedded controller RAM fails, you get corruption on the majority of
reads from a single disk, and most writes will be corrupted (even if they
were not before).  If there's a transient failure due to environmental
issues (e.g. short-term high-amplitude vibration or overheating) then
writes may pause for mechanical retry loops.  If there is bitrot in SSDs
(particularly in the address translation tables) it looks like a wall
of random noise that only ends when the disk goes offline.  You can get
combinations of these (e.g. RAM failures caused by transient overheating)
where the drive's behavior changes over time.

When in doubt, don't write.

> BR
> G.Baroncelli
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
>
Christoph Anton Mitterer Nov. 29, 2016, 1:52 a.m. UTC | #24
On Mon, 2016-11-28 at 16:48 -0500, Zygo Blaxell wrote:
> If a drive's
> embedded controller RAM fails, you get corruption on the majority of
> reads from a single disk, and most writes will be corrupted (even if
> they
> were not before).

Administrating a multi-PiB Tier-2 for the LHC Computing Grid with quite
a number of disks for nearly 10 years now, I'd have never stumbled on
such a case of breakage so far...

Actually most cases are as simple as HDD fails to work and this is
properly signalled to the controller.



> If there's a transient failure due to environmental
> issues (e.g. short-term high-amplitude vibration or overheating) then
> writes may pause for mechanical retry loops.  If there is bitrot in
> SSDs
> (particularly in the address translation tables) it looks like a wall
> of random noise that only ends when the disk goes offline.  You can
> get
> combinations of these (e.g. RAM failures caused by transient
> overheating)
> where the drive's behavior changes over time.
> 
> When in doubt, don't write.

Sorry, but these cases as any cases of memory issues (be it main memory
or HDD controller) would also kick in at any normal writes.

So there's no point in protecting against this on the storage side...

Either never write at all... or have good backups for these rare cases.



Cheers,
Chris.
Zygo Blaxell Nov. 29, 2016, 3:19 a.m. UTC | #25
On Tue, Nov 29, 2016 at 02:52:47AM +0100, Christoph Anton Mitterer wrote:
> On Mon, 2016-11-28 at 16:48 -0500, Zygo Blaxell wrote:
> > If a drive's
> > embedded controller RAM fails, you get corruption on the majority of
> > reads from a single disk, and most writes will be corrupted (even if
> > they
> > were not before).
> 
> Administrating a multi-PiB Tier-2 for the LHC Computing Grid with quite
> a number of disks for nearly 10 years now, I'd have never stumbled on
> such a case of breakage so far...

In data centers you won't see breakages that are common on desktop and
laptop drives.  Laptops in particular sometimes (often?) go to places
that are much less friendly to hardware.

All my NAS and enterprise drives in server racks and data centers just
wake up one morning stone dead or with a few well-behaved bad sectors,
with none of this drama.  Boring!

> Actually most cases are as simple as HDD fails to work and this is
> properly signalled to the controller.
> 
> 
> 
> > If there's a transient failure due to environmental
> > issues (e.g. short-term high-amplitude vibration or overheating) then
> > writes may pause for mechanical retry loops.  If there is bitrot in
> > SSDs
> > (particularly in the address translation tables) it looks like a wall
> > of random noise that only ends when the disk goes offline.  You can
> > get
> > combinations of these (e.g. RAM failures caused by transient
> > overheating)
> > where the drive's behavior changes over time.
> > 
> > When in doubt, don't write.
> 
> Sorry, but these cases as any cases of memory issues (be it main memory
> or HDD controller) would also kick in at any normal writes.

Yes, but in a RAID1 context there will be another disk with a good copy
(or if main RAM is failing, the entire filesystem will be toast no matter
what you do).

> So there's no point in protecting against this on the storage side...
> 
> Either never write at all... or have good backups for these rare cases.
> 
> 
> 
> Cheers,
> Chris.
Adam Borowski Nov. 29, 2016, 7:35 a.m. UTC | #26
On Tue, Nov 29, 2016 at 02:52:47AM +0100, Christoph Anton Mitterer wrote:
> On Mon, 2016-11-28 at 16:48 -0500, Zygo Blaxell wrote:
> > If a drive's embedded controller RAM fails, you get corruption on the
> > majority of reads from a single disk, and most writes will be corrupted
> > (even if they were not before).
> 
> Administrating a multi-PiB Tier-2 for the LHC Computing Grid with quite
> a number of disks for nearly 10 years now, I'd have never stumbled on
> such a case of breakage so far...
> 
> Actually most cases are as simple as HDD fails to work and this is
> properly signalled to the controller.

I administer no real storage at this time, and got only 16 disks (plus a few
disk-likes) to my name right now.  Yet in a ~2 months span I've seen three
cases of silent data corruption:

* a RasPi I used for DNS recursor/DHCP/aiccu started mangling some writes,
  with no notification that something is amiss.  With ext4 being a
  silentdatalossfs, there was no clue it was a disk (ok, SD) problem at all,
  making it really "fun" to debug.  Happens on multiple SD cards, thus it's
  the machine that's at fault.

* a HDD had some link resets and silent data corruption, diagnosed to a bad
  SATA cable, the disk works fine since (obviously after extensive tests).

* a HDD that has link resets and silent data corruption (apparently
  write-time only(?)), Marduk knows why.  Happens with multiple cables and
  two machines, putting the blame somewhere on the disk.

Thus, assumption that the controller will be notified about read errors is
quite invalid.  In the above cases, if recovery was possible it'd be
beneficial to rewrite a good copy of the data.


Meow!
Christoph Anton Mitterer Nov. 29, 2016, 2:24 p.m. UTC | #27
On Tue, 2016-11-29 at 08:35 +0100, Adam Borowski wrote:
> I administer no real storage at this time, and got only 16 disks
> (plus a few
> disk-likes) to my name right now.  Yet in a ~2 months span I've seen
> three
> cases of silent data corruption

I didn't meant to say we'd have no silent data corruption.
OTOH, we cannot have had many of them, since the storage management
software itself has another layer of checksumming.

What I meant to say is that we likely never had such a scenario
described by Zygo, where e.g. broken HDD controller memory would cause
corruptions (and which would be then bad in case of auto-RAID-repair-
on-read).
Cause IMO such memory issues would rather soon be noticed.


Cheers,
Chris.

Patch
diff mbox

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index d016d4a..87e3565 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -133,6 +133,16 @@  struct btrfs_raid_bio {
 	/* second bad stripe (for raid6 use) */
 	int failb;
 
+	/*
+	 * For steal_rbio, we can steal recovered correct page,
+	 * but in finish_parity_scrub(), we still use bad on-disk
+	 * page to calculate parity.
+	 * Use these members to info finish_parity_scrub() to use
+	 * correct pages
+	 */
+	int bad_ondisk_a;
+	int bad_ondisk_b;
+
 	int scrubp;
 	/*
 	 * number of pages needed to represent the full
@@ -310,6 +320,12 @@  static void steal_rbio(struct btrfs_raid_bio *src, struct btrfs_raid_bio *dest)
 	if (!test_bit(RBIO_CACHE_READY_BIT, &src->flags))
 		return;
 
+	/* Record recovered stripe number */
+	if (src->faila != -1)
+		dest->bad_ondisk_a = src->faila;
+	if (src->failb != -1)
+		dest->bad_ondisk_b = src->failb;
+
 	for (i = 0; i < dest->nr_pages; i++) {
 		s = src->stripe_pages[i];
 		if (!s || !PageUptodate(s)) {
@@ -998,6 +1014,8 @@  static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root,
 	rbio->stripe_npages = stripe_npages;
 	rbio->faila = -1;
 	rbio->failb = -1;
+	rbio->bad_ondisk_a = -1;
+	rbio->bad_ondisk_b = -1;
 	atomic_set(&rbio->refs, 1);
 	atomic_set(&rbio->error, 0);
 	atomic_set(&rbio->stripes_pending, 0);
@@ -2352,7 +2370,16 @@  static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
 		void *parity;
 		/* first collect one page from each data stripe */
 		for (stripe = 0; stripe < nr_data; stripe++) {
-			p = page_in_rbio(rbio, stripe, pagenr, 0);
+
+			/*
+			 * Use stolen recovered page other than bad
+			 * on disk pages
+			 */
+			if (stripe == rbio->bad_ondisk_a ||
+			    stripe == rbio->bad_ondisk_b)
+				p = rbio_stripe_page(rbio, stripe, pagenr);
+			else
+				p = page_in_rbio(rbio, stripe, pagenr, 0);
 			pointers[stripe] = kmap(p);
 		}