diff mbox

Adventures in btrfs raid5 disk recovery

Message ID 20160620034427.GK15597@hungrycats.org (mailing list archive)
State New, archived
Headers show

Commit Message

Zygo Blaxell June 20, 2016, 3:44 a.m. UTC
Not so long ago, I had a disk fail in a btrfs filesystem with raid1
metadata and raid5 data.  I mounted the filesystem readonly, replaced
the failing disk, and attempted to recover by adding the new disk and
deleting the missing disk.

It's not going well so far.  Pay attention, there are at least four
separate problems in here and we're not even half done yet.

I'm currently using kernel 4.6.2 with btrfs fixes forward-ported from
4.5.7, because 4.5.7 has a number of fixes that 4.6.2 doesn't.  I have
also pulled in some patches from the 4.7-rc series.

This fixed a few problems I encountered early on, and I'm still making
forward progress, but I've only replaced 50% of the failed disk so far,
and this is week four of this particular project.

What worked:

'mount -odegraded,...' successfully mounts the filesystem RW.  
'btrfs device add' adds the new disk.  Success!

The first thing I did was balance the metadata onto non-missing disks.
That went well.  Now there are only data chunks to recover from the
missing disk.  Success!

The normal 'device delete' operation got about 25% of the way in,
then got stuck on some corrupted sectors and aborting with EIO.  
That ends the success, but I've had similar problems with raid5 arrays
before and been able to solve them.

I've managed to remove about half of the data from the missing disk
so far.  'balance start -ddevid=<missing disk ID>,drange=0..100000000000'
(with increasing values for drange) is able to move data off the failed
disk while avoiding the damaged regions.  It looks like this process could
reduce the amount of data on "missing" devices to a manageable number,
then I could identify the offending corrupted extents with 'btrfs scrub',
remove the files containing them, and finish the device delete operation.
Hope!

What doesn't work:

The first problem is that the kernel keeps crashing.  I put the filesystem
and all its disks in a KVM so the crashes are less disruptive, and I can
debug them (or at least collect panic logs).  OK now crashes are merely a
performance problem.

Why did I mention 'btrfs scrub' above?  Because 'btrfs scrub' tells me
where corrupted blocks are.  'device delete' fills my kernel logs with
lines like this:

	[26054.744158] BTRFS info (device vdc): relocating block group 27753592127488 flags 129
	[26809.746993] BTRFS warning (device vdc): csum failed ino 404 off 6021976064 csum 778377694 expected csum 2827380172
	[26809.747029] BTRFS warning (device vdc): csum failed ino 404 off 6021980160 csum 3776938678 expected csum 514150079
	[26809.747077] BTRFS warning (device vdc): csum failed ino 404 off 6021984256 csum 470593400 expected csum 642831408
	[26809.747093] BTRFS warning (device vdc): csum failed ino 404 off 6021988352 csum 796755777 expected csum 690854341
	[26809.747108] BTRFS warning (device vdc): csum failed ino 404 off 6021992448 csum 4115095129 expected csum 249712906
	[26809.747122] BTRFS warning (device vdc): csum failed ino 404 off 6021996544 csum 2337431338 expected csum 1869250975
	[26809.747138] BTRFS warning (device vdc): csum failed ino 404 off 6022000640 csum 3543852608 expected csum 1929026437
	[26809.747154] BTRFS warning (device vdc): csum failed ino 404 off 6022004736 csum 3417780495 expected csum 3698318115
	[26809.747169] BTRFS warning (device vdc): csum failed ino 404 off 6022008832 csum 3423877520 expected csum 2981727596
	[26809.747183] BTRFS warning (device vdc): csum failed ino 404 off 6022012928 csum 550838742 expected csum 1005563554
	[26896.379773] BTRFS info (device vdc): relocating block group 27753592127488 flags 129
	[27791.128098] __readpage_endio_check: 7 callbacks suppressed
	[27791.236794] BTRFS warning (device vdc): csum failed ino 405 off 6021980160 csum 3776938678 expected csum 514150079
	[27791.236799] BTRFS warning (device vdc): csum failed ino 405 off 6021971968 csum 3304844252 expected csum 4171523312
	[27791.236821] BTRFS warning (device vdc): csum failed ino 405 off 6021984256 csum 470593400 expected csum 642831408
	[27791.236825] BTRFS warning (device vdc): csum failed ino 405 off 6021988352 csum 796755777 expected csum 690854341
	[27791.236842] BTRFS warning (device vdc): csum failed ino 405 off 6021992448 csum 4115095129 expected csum 249712906
	[27791.236847] BTRFS warning (device vdc): csum failed ino 405 off 6021996544 csum 2337431338 expected csum 1869250975
	[27791.236857] BTRFS warning (device vdc): csum failed ino 405 off 6022004736 csum 3417780495 expected csum 3698318115
	[27791.236864] BTRFS warning (device vdc): csum failed ino 405 off 6022000640 csum 3543852608 expected csum 1929026437
	[27791.236874] BTRFS warning (device vdc): csum failed ino 405 off 6022008832 csum 3423877520 expected csum 2981727596
	[27791.236978] BTRFS warning (device vdc): csum failed ino 405 off 6021976064 csum 778377694 expected csum 2827380172
	[27927.222130] BTRFS info (device vdc): relocating block group 27753592127488 flags 129
	[28679.904235] __readpage_endio_check: 7 callbacks suppressed
	[28679.907160] BTRFS warning (device vdc): csum failed ino 406 off 6021984256 csum 470593400 expected csum 642831408
	[28679.914460] BTRFS warning (device vdc): csum failed ino 406 off 6021980160 csum 3776938678 expected csum 514150079
	[28679.914566] BTRFS warning (device vdc): csum failed ino 406 off 6021976064 csum 778377694 expected csum 2827380172
	[28679.914584] BTRFS warning (device vdc): csum failed ino 406 off 6021988352 csum 796755777 expected csum 690854341
	[28679.914596] BTRFS warning (device vdc): csum failed ino 406 off 6021992448 csum 4115095129 expected csum 249712906
	[28679.914616] BTRFS warning (device vdc): csum failed ino 406 off 6021996544 csum 2337431338 expected csum 1869250975
	[28679.914625] BTRFS warning (device vdc): csum failed ino 406 off 6022000640 csum 3543852608 expected csum 1929026437
	[28679.914641] BTRFS warning (device vdc): csum failed ino 406 off 6022004736 csum 3417780495 expected csum 3698318115
	[28679.914643] BTRFS warning (device vdc): csum failed ino 406 off 6022008832 csum 3423877520 expected csum 2981727596
	[28679.914661] BTRFS warning (device vdc): csum failed ino 406 off 6022012928 csum 550838742 expected csum 1005563554
	[28769.622165] BTRFS info (device vdc): relocating block group 27753592127488 flags 129
	[29409.886503] __readpage_endio_check: 7 callbacks suppressed
	[29409.889486] BTRFS warning (device vdc): csum failed ino 407 off 6021971968 csum 3304844252 expected csum 4171523312
	[29409.896090] BTRFS warning (device vdc): csum failed ino 407 off 6021976064 csum 778377694 expected csum 2827380172
	[29409.918458] BTRFS warning (device vdc): csum failed ino 407 off 6021984256 csum 470593400 expected csum 642831408
	[29409.918463] BTRFS warning (device vdc): csum failed ino 407 off 6021980160 csum 3776938678 expected csum 514150079
	[29409.918489] BTRFS warning (device vdc): csum failed ino 407 off 6021992448 csum 4115095129 expected csum 249712906
	[29409.918504] BTRFS warning (device vdc): csum failed ino 407 off 6021996544 csum 2337431338 expected csum 1869250975
	[29409.918517] BTRFS warning (device vdc): csum failed ino 407 off 6022000640 csum 3543852608 expected csum 1929026437
	[29409.918530] BTRFS warning (device vdc): csum failed ino 407 off 6022004736 csum 3417780495 expected csum 3698318115
	[29409.918543] BTRFS warning (device vdc): csum failed ino 407 off 6022008832 csum 3423877520 expected csum 2981727596
	[29409.918556] BTRFS warning (device vdc): csum failed ino 407 off 6022012928 csum 550838742 expected csum 1005563554

"ino 407" is a lie and I have not yet been able to identify what the
offsets mean (assuming they're not lies too).  What's really interesting
is that ino increases by one each time 'btrfs dev del' is run.
I guess it's just printing uninitialized memory here or something?
OK I'll avoid some of the more broken btrfs tools.

'btrfs scrub' tells me the precise ino, offset, and path names of
corrupted files, but it takes a week to run, so I'm moving as much data
as possible onto newer (and faster) disks to make the scrub run faster.
Sounds like a viable plan.

I kept getting crashes at two specific BUG_ONs, which I patched around
as follows:


This code gives me more questions than answers:  Why is there a BUG_ON
located before the call to btrfs_is_parity_mirror?  Why is mirror_num
even passed to btrfs_is_parity_mirror?  btrfs_is_parity_mirror doesn't
use the mirror_num argument.  Never has since it was written in 2013.
More worrying.

Those two patches helped to the extent that my kernel stopped crashing
whenever it encountered a bad checksum on the filesystem; however, now I'm
getting EIO randomly all over the filesystem, including in files that were
written entirely _after_ the disk failure.  I had assumed there would be
no reason such files would not be written successfully, but it seems not.
Uh oh.  Better stop writing new data to the filesystem, then.

I added 'convert=raid1' to the 'btrfs balance' command to try to stem
the flow of corrupted data.  It seems to help, but hasn't completely
stopped new files from getting corrupted.  The new disks are large enough
to hold all the data in raid1, so maybe if I manage to delete the missing
disk I'll convert the whole filesystem to raid1 and pretend the raid5
feature doesn't exist any more.

I've already "voided the warranty" as it were, so I'm not surprised
about seeing a bunch of unexpected corruption.  The kernel already
tried twice to kill itself before it gets as far as I am now, so there
was probably a good reason for that.  At this point it's still less
disruptive to just fix the corruption one file at a time from backups,
and without the patches the only other option is to mkfs+restore the
filesystem, so I'm pressing on... :-/

The code I'm changing above hasn't changed since 2011, which is well
before there was raid5.  Clearly it's not handling the cases that raid5
brings up very well.  Is recovery a lost cause?  Do I just give up now
and mkfs+restore the filesystem?

Compounding the difficulty is the fact that the balance operation
keeps getting stuck every few hours with this backtrace:

[18365.506156] Call Trace:
[18365.506164]  [<ffffffff81b17b1f>] schedule+0x3f/0xa0
[18365.506167]  [<ffffffff81407ebd>] lock_extent_bits+0x18d/0x1e0
[18365.506171]  [<ffffffff810ddc30>] ? wait_woken+0xb0/0xb0
[18365.506174]  [<ffffffff8143ea70>] invalidate_extent_cache+0xe0/0x200
[18365.506177]  [<ffffffff814430ed>] merge_reloc_root+0x33d/0x570
[18365.506179]  [<ffffffff81443463>] merge_reloc_roots+0x143/0x250
[18365.506181]  [<ffffffff814439e1>] relocate_block_group+0x471/0x6b0
[18365.506183]  [<ffffffff81443dcd>] ? btrfs_relocate_block_group+0x1ad/0x2a0
[18365.506185]  [<ffffffff81443dd5>] btrfs_relocate_block_group+0x1b5/0x2a0
[18365.506188]  [<ffffffff81413877>] btrfs_relocate_chunk.isra.37+0x47/0xd0
[18365.506190]  [<ffffffff814150f1>] btrfs_balance+0xa71/0x1200
[18365.506192]  [<ffffffff81420072>] ? btrfs_ioctl_balance+0x182/0x3b0
[18365.506195]  [<ffffffff81420072>] btrfs_ioctl_balance+0x182/0x3b0
[18365.506197]  [<ffffffff8142680d>] btrfs_ioctl+0x139d/0x2550
[18365.506201]  [<ffffffff81071f4a>] ? __do_page_fault+0x1da/0x570
[18365.506203]  [<ffffffff812823af>] do_vfs_ioctl+0x8f/0x680
[18365.506205]  [<ffffffff81071f4a>] ? __do_page_fault+0x1da/0x570
[18365.506207]  [<ffffffff8128e9ef>] ? __fget_light+0x6f/0x90
[18365.506209]  [<ffffffff81282a19>] SyS_ioctl+0x79/0x90
[18365.506212]  [<ffffffff81b1ddc0>] entry_SYSCALL_64_fastpath+0x23/0xc1

That's probably another bug, but I don't know if it's because of something
I've done, something weird in my filesystem's metadata, or a previously
unreported regression since kernel 4.4.  For now I just detect when it
happens inside the KVM and reboot it from outside.

I only have a few months left before the next disk in this array fails,
and I'd like to be out of degraded mode by then.  ;)

Comments

Roman Mamedov June 20, 2016, 6:13 p.m. UTC | #1
On Sun, 19 Jun 2016 23:44:27 -0400
Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote:

> It's not going well so far.  Pay attention, there are at least four
> separate problems in here and we're not even half done yet.
> 
> I'm currently using kernel 4.6.2 with btrfs fixes forward-ported from
> 4.5.7, because 4.5.7 has a number of fixes that 4.6.2 doesn't.  I have
> also pulled in some patches from the 4.7-rc series.
> 
> This fixed a few problems I encountered early on, and I'm still making
> forward progress, but I've only replaced 50% of the failed disk so far,
> and this is week four of this particular project.

From a practical standpoint, [aside from not using Btrfs RAID5], you'd be
better off shutting down the system, booting a rescue OS, copying the content
of the failing disk to the replacement one using 'ddrescue', then removing the
bad disk, and after boot up your main system wouldn't notice anything has ever
happened, aside from a few recoverable CRC errors in the "holes" on the areas
which ddrescue failed to copy.

But in general it's commendable that you're experimenting with doing things
"the native way", as this is provides feedback to the developers and could help
make the RAID implementation better. I guess that's the whole point of the
exercise and the report, and hope this ends up being useful for everyone.
Zygo Blaxell June 20, 2016, 7:11 p.m. UTC | #2
On Mon, Jun 20, 2016 at 11:13:51PM +0500, Roman Mamedov wrote:
> On Sun, 19 Jun 2016 23:44:27 -0400
> Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote:
> From a practical standpoint, [aside from not using Btrfs RAID5], you'd be
> better off shutting down the system, booting a rescue OS, copying the content
> of the failing disk to the replacement one using 'ddrescue', then removing the
> bad disk, and after boot up your main system wouldn't notice anything has ever
> happened, aside from a few recoverable CRC errors in the "holes" on the areas
> which ddrescue failed to copy.

I'm aware of ddrescue and myrescue, but in this case the disk has failed,
past tense.  At this point the remaining choices are to make btrfs native
raid5 recovery work, or to restore from backups.

> But in general it's commendable that you're experimenting with doing things
> "the native way", as this is provides feedback to the developers and could help
> make the RAID implementation better. I guess that's the whole point of the
> exercise and the report, and hope this ends up being useful for everyone.

The intent was both to provide a cautionary tale for anyone considering
deploying a btrfs raid5 system today, and to possibly engage some
developers to help solve the problems.

The underlying causes seem to be somewhat removed from where the symptoms
are appearing, and at the moment I don't understand this code well enough
to know where to look for them.  Any assistance would be greatly appreciated.


> -- 
> With respect,
> Roman
Chris Murphy June 20, 2016, 7:30 p.m. UTC | #3
On Mon, Jun 20, 2016 at 1:11 PM, Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
> On Mon, Jun 20, 2016 at 11:13:51PM +0500, Roman Mamedov wrote:
>> On Sun, 19 Jun 2016 23:44:27 -0400
>> Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote:
>> From a practical standpoint, [aside from not using Btrfs RAID5], you'd be
>> better off shutting down the system, booting a rescue OS, copying the content
>> of the failing disk to the replacement one using 'ddrescue', then removing the
>> bad disk, and after boot up your main system wouldn't notice anything has ever
>> happened, aside from a few recoverable CRC errors in the "holes" on the areas
>> which ddrescue failed to copy.
>
> I'm aware of ddrescue and myrescue, but in this case the disk has failed,
> past tense.  At this point the remaining choices are to make btrfs native
> raid5 recovery work, or to restore from backups.

Seems difficult at best due to this:
>>The normal 'device delete' operation got about 25% of the way in,
then got stuck on some corrupted sectors and aborting with EIO.

In effect it's like a 2 disk failure for a raid5 (or it's
intermittently a 2 disk failure but always at least a 1 disk failure).
That's not something md raid recovers from. Even manual recovery in
such a case is far from certain.

Perhaps Roman's advice is also a question about the cause of this
corruption? I'm wondering this myself. That's the real problem here as
I see it. Losing a drive is ordinary. Additional corruptions happening
afterward is not. And are those corrupt sectors hardware corruptions,
or Btrfs corruptions at the time the data was written to disk, or
Btrfs being confused as it's reading the data from disk?

For me the critical question is what does "some corrupted sectors" mean?
Zygo Blaxell June 20, 2016, 8:40 p.m. UTC | #4
On Mon, Jun 20, 2016 at 01:30:11PM -0600, Chris Murphy wrote:
> On Mon, Jun 20, 2016 at 1:11 PM, Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
> > On Mon, Jun 20, 2016 at 11:13:51PM +0500, Roman Mamedov wrote:
> >> On Sun, 19 Jun 2016 23:44:27 -0400
> Seems difficult at best due to this:
> >>The normal 'device delete' operation got about 25% of the way in,
> then got stuck on some corrupted sectors and aborting with EIO.
> 
> In effect it's like a 2 disk failure for a raid5 (or it's
> intermittently a 2 disk failure but always at least a 1 disk failure).
> That's not something md raid recovers from. Even manual recovery in
> such a case is far from certain.
> 
> Perhaps Roman's advice is also a question about the cause of this
> corruption? I'm wondering this myself. That's the real problem here as
> I see it. Losing a drive is ordinary. Additional corruptions happening
> afterward is not. And are those corrupt sectors hardware corruptions,
> or Btrfs corruptions at the time the data was written to disk, or
> Btrfs being confused as it's reading the data from disk?

> For me the critical question is what does "some corrupted sectors" mean?

On other raid5 arrays, I would observe a small amount of corruption every
time there was a system crash (some of which were triggered by disk
failures, some not).  It looked like any writes in progress at the time
of the failure would be damaged.  In the past I would just mop up the
corrupt files (they were always the last extents written, easy to find
with find-new or scrub) and have no further problems.  In the earlier
cases there were no new instances of corruption after the initial failure
event and manual cleanup.

Now that I did a little deeper into this, I do see one fairly significant
piece of data:

	root@host:~# btrfs dev stat /data | grep -v ' 0$'
	[/dev/vdc].corruption_errs 16774
	[/dev/vde].write_io_errs   121
	[/dev/vde].read_io_errs    4
	[devid:8].read_io_errs    16

Prior to the failure of devid:8, vde had 121 write errors and 4 read
errors (these counter values are months old and the errors were long
since repaired by scrub).  The 16774 corruption errors on vdc are all
new since the devid:8 failure, though.

> 
> 
> -- 
> Chris Murphy
>
Chris Murphy June 20, 2016, 9:27 p.m. UTC | #5
On Mon, Jun 20, 2016 at 2:40 PM, Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
> On Mon, Jun 20, 2016 at 01:30:11PM -0600, Chris Murphy wrote:

>> For me the critical question is what does "some corrupted sectors" mean?
>
> On other raid5 arrays, I would observe a small amount of corruption every
> time there was a system crash (some of which were triggered by disk
> failures, some not).

What test are you using to determine there is corruption, and how much
data is corrupted? Is this on every disk? Non-deterministically fewer
than all disks? Have you identified this as a torn write or
misdirected write or is it just garbage at some sectors? And what's
the size? Partial sector? Partial md chunk (or fs block?)

>  It looked like any writes in progress at the time
> of the failure would be damaged.  In the past I would just mop up the
> corrupt files (they were always the last extents written, easy to find
> with find-new or scrub) and have no further problems.

This is on Btrfs? This isn't supposed to be possible. Even a literal
overwrite of a file is not an overwrite on Btrfs unless the file is
nodatacow. Data extents get written, then the metadata is updated to
point to those new blocks. There should be flush or fua requests to
make sure the order is such that the fs points to either the old or
new file, in either case uncorrupted. That's why I'm curious about the
nature of this corruption. It sounds like your hardware is not exactly
honoring flush requests.

With md raid and any other file system, it's pure luck that such
corrupted writes would only affect data extents and not the fs
metadata. Corrupted fs metadata is not well tolerated by any file
system, not least of which is most of them have no idea the metadata
is corrupt. At least Btrfs can determine this and if there's another
copy use that or just stop and face plant before more damage happens.
Maybe an exception now is XFS v5 metadata which employs checksumming.
But it still doesn't know if data extents are wrong (i.e. a torn or
misdirected write).

I've had perhaps a hundred power off during write with Btrfs and SSD
and I don't ever see corrupt files. It's definitely not normal to see
this with Btrfs.


> In the earlier
> cases there were no new instances of corruption after the initial failure
> event and manual cleanup.
>
> Now that I did a little deeper into this, I do see one fairly significant
> piece of data:
>
>         root@host:~# btrfs dev stat /data | grep -v ' 0$'
>         [/dev/vdc].corruption_errs 16774
>         [/dev/vde].write_io_errs   121
>         [/dev/vde].read_io_errs    4
>         [devid:8].read_io_errs    16
>
> Prior to the failure of devid:8, vde had 121 write errors and 4 read
> errors (these counter values are months old and the errors were long
> since repaired by scrub).  The 16774 corruption errors on vdc are all
> new since the devid:8 failure, though.

On md RAID 5 and 6, if the array gets parity mismatch counts above 0
doing a scrub (check > md/sync_action) there's a hardware problem.
It's entirely possible you've found a bug, but it must be extremely
obscure to basically not have hit everyone trying Btrfs raid56. I
think you need to track down the source of this corruption and stop it
 however possible; whether that's changing hardware, or making sure
the system isn't crashing.
Zygo Blaxell June 21, 2016, 1:55 a.m. UTC | #6
On Mon, Jun 20, 2016 at 03:27:03PM -0600, Chris Murphy wrote:
> On Mon, Jun 20, 2016 at 2:40 PM, Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
> > On Mon, Jun 20, 2016 at 01:30:11PM -0600, Chris Murphy wrote:
> 
> >> For me the critical question is what does "some corrupted sectors" mean?
> >
> > On other raid5 arrays, I would observe a small amount of corruption every
> > time there was a system crash (some of which were triggered by disk
> > failures, some not).
> 
> What test are you using to determine there is corruption, and how much
> data is corrupted? Is this on every disk? Non-deterministically fewer
> than all disks? Have you identified this as a torn write or
> misdirected write or is it just garbage at some sectors? And what's
> the size? Partial sector? Partial md chunk (or fs block?)

In earlier cases, scrub, read(), and btrfs dev stat all reported the
incidents differently.  Scrub would attribute errors randomly to disks
(error counts spread randomly across all the disks in the 'btrfs scrub
status -d' output).  'dev stat' would correctly increment counts on only
those disks which had individually had an event (e.g. media error or
SATA bus reset).

Before deploying raid5, I tested these by intentionally corrupting
one disk in an otherwise healthy raid5 array and watching the result.
When scrub identified an inode and offset in the kernel log, the csum
failure log message matched the offsets producing EIO on read(), but
the statistics reported by scrub about which disk had been corrupted
were mostly wrong.  In such cases a scrub could repair the data.

A different thing happens if there is a crash.  In that case, scrub cannot
repair the errors.  Every btrfs raid5 filesystem I've deployed so far
behaves this way when disks turn bad.  I had assumed it was a software bug
in the comparatively new raid5 support that would get fixed eventually.

In this current case, I'm getting things like this:

[12008.243867] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 4105596028 wanted 787343232 mirror 0
[12008.243876] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 1689373462 wanted 787343232 mirror 0
[12008.243885] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 3621611229 wanted 787343232 mirror 0
[12008.243893] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 113993114 wanted 787343232 mirror 0
[12008.243902] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 1464956834 wanted 787343232 mirror 0
[12008.243911] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 2545274038 wanted 787343232 mirror 0
[12008.243942] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 4090153227 wanted 787343232 mirror 0
[12008.243952] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 4129844199 wanted 787343232 mirror 0
[12008.243961] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 4129844199 wanted 787343232 mirror 0
[12008.243976] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 172651968 wanted 787343232 mirror 0
[12008.246158] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 4129844199 wanted 787343232 mirror 1
[12008.247557] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 1374425809 wanted 787343232 mirror 1
[12008.403493] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 1567917468 wanted 787343232 mirror 1
[12008.409809] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 2881359629 wanted 787343232 mirror 0
[12008.411165] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 3021442070 wanted 787343232 mirror 0
[12008.411180] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 3984314874 wanted 787343232 mirror 0
[12008.411189] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 599192427 wanted 787343232 mirror 0
[12008.411199] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 2887010053 wanted 787343232 mirror 0
[12008.411208] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 1314141634 wanted 787343232 mirror 0
[12008.411217] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 3156167613 wanted 787343232 mirror 0
[12008.411227] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 565550942 wanted 787343232 mirror 0
[12008.411236] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 4068631390 wanted 787343232 mirror 0
[12008.411245] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 531263990 wanted 787343232 mirror 0
[12008.411255] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 1992044858 wanted 787343232 mirror 0
[12008.411264] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 1516200640 wanted 787343232 mirror 0
[12008.411273] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 3762784693 wanted 787343232 mirror 0
[12008.411282] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 42246361 wanted 787343232 mirror 0
[12008.411291] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 3989363071 wanted 787343232 mirror 0
[12008.422079] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 983444376 wanted 931479324 mirror 0
[12008.422402] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825372160 csum 3757815310 wanted 913327053 mirror 2
[12008.542715] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 425031800 wanted 931479324 mirror 0
[12008.542722] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 1254202754 wanted 931479324 mirror 0
[12008.542749] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 4281398184 wanted 931479324 mirror 0
[12008.543730] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 1016285311 wanted 931479324 mirror 0
[12008.545174] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 3122785442 wanted 931479324 mirror 0
[12008.549530] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 2691681328 wanted 931479324 mirror 0
[12008.556722] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 1956690651 wanted 931479324 mirror 0
[12008.556744] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 857176374 wanted 931479324 mirror 0
[12008.556756] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 2858632553 wanted 931479324 mirror 0
[12008.557901] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825372160 csum 3245431096 wanted 913327053 mirror 2
[12008.557907] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825372160 csum 3536357904 wanted 931479324 mirror 2
[12008.557935] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825372160 csum 3536357904 wanted 931479324 mirror 2
[12008.557938] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825372160 csum 3536357904 wanted 931479324 mirror 2
[12008.557952] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825372160 csum 3536357904 wanted 931479324 mirror 2
[12008.557954] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 3251056521 wanted 931479324 mirror 0
[12008.558054] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 3711050593 wanted 931479324 mirror 0
[12008.561564] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 3540653895 wanted 931479324 mirror 0
[12008.568799] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 876087402 wanted 931479324 mirror 0
[12008.568817] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 809865123 wanted 931479324 mirror 0
[12008.568827] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 884024971 wanted 931479324 mirror 0
[12008.568839] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 2677615947 wanted 931479324 mirror 0
[12008.572122] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 3536357904 wanted 931479324 mirror 1
[12008.572907] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825372160 csum 3763483742 wanted 931479324 mirror 2
[12008.572964] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825372160 csum 3157651551 wanted 931479324 mirror 2
[12008.572994] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825372160 csum 824699142 wanted 931479324 mirror 2
[12008.573066] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825372160 csum 3536357904 wanted 931479324 mirror 2
[12008.573097] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825372160 csum 3536357904 wanted 931479324 mirror 2
[12008.573126] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825372160 csum 3536357904 wanted 931479324 mirror 2
[12008.573155] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825372160 csum 3536357904 wanted 931479324 mirror 2
[12008.573187] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825372160 csum 3536357904 wanted 931479324 mirror 2
[12008.573209] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825372160 csum 2691681328 wanted 931479324 mirror 2
[12008.573239] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825372160 csum 3757815310 wanted 913327053 mirror 2
[12008.573529] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 3536357904 wanted 931479324 mirror 0
[12008.574222] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825368064 csum 3757815310 wanted 913327053 mirror 2
[12008.577201] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 3536357904 wanted 931479324 mirror 0
[12008.595253] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825372160 csum 3757815310 wanted 913327053 mirror 2

The "wanted" field looks suspicious--this filesystem does not have large
areas filled with blocks of constant value, much less multiple different
constant values, and certainly not in the affected files.  

The other other weird thing here is that I can't find an example in the
logs of an extent with an EIO that isn't compressed.  I've been looking
up a random sample of the extent numbers, matching them up to filefrag
output, and finding e.g. the one compressed extent in the middle of an
otherwise uncompressed git pack file.  That's...odd.  Maybe there's a
problem with compressed extents in particular?  I'll see if I can
script something to check all the logs at once...

> >  It looked like any writes in progress at the time
> > of the failure would be damaged.  In the past I would just mop up the
> > corrupt files (they were always the last extents written, easy to find
> > with find-new or scrub) and have no further problems.
> 
> This is on Btrfs? This isn't supposed to be possible. Even a literal
> overwrite of a file is not an overwrite on Btrfs unless the file is
> nodatacow. Data extents get written, then the metadata is updated to
> point to those new blocks. There should be flush or fua requests to
> make sure the order is such that the fs points to either the old or
> new file, in either case uncorrupted. That's why I'm curious about the
> nature of this corruption. It sounds like your hardware is not exactly
> honoring flush requests.

That's true when all the writes are ordered within a single device, but
possibly not when writes must be synchronized across multiple devices.
It could be possible to corrupt existing data if:

	- btrfs puts multiple extents in the same RAID5 stripe

	- writes two of the extents in different transactions

	- the array is degraded so the RAID5 write hole applies

	- writes stop (crash, powerfail, etc) after updating some of
	the blocks in the stripe but not others in the second transaction.

This corrupts the data written in the first transaction, unless there's
something in btrfs that explicitly prevents this from happening (e.g.
the allocator prevents the second step, which would seem to burn up a
lot of space).

The filesystem would continue to work afterwards with raid1 metadata
because every disk in raid1 updates its blocks in the same order,
and there are no interdependencies between blocks on different disks
(not like a raid5 stripe, anyway).

Excluding the current event, this only happened to me when the disks
were doing other questionable things, like SATA link resets or dropping
off the SATA bus, so there could have been an additional hardware issue.

This does happen on _every_ raid5 deployment I've run so far, though,
and they don't share any common hardware or firmware components.

> I've had perhaps a hundred power off during write with Btrfs and SSD
> and I don't ever see corrupt files. It's definitely not normal to see
> this with Btrfs.

These are spinning rust.  They are probably quite vulnerable to RAID
write hole issues when writing in degraded mode due to the comparatively
large write latency variation between disks.

> > Now that I did a little deeper into this, I do see one fairly significant
> > piece of data:
> >
> >         root@host:~# btrfs dev stat /data | grep -v ' 0$'
> >         [/dev/vdc].corruption_errs 16774
> >         [/dev/vde].write_io_errs   121
> >         [/dev/vde].read_io_errs    4
> >         [devid:8].read_io_errs    16
> >
> > Prior to the failure of devid:8, vde had 121 write errors and 4 read
> > errors (these counter values are months old and the errors were long
> > since repaired by scrub).  The 16774 corruption errors on vdc are all
> > new since the devid:8 failure, though.
> 
> On md RAID 5 and 6, if the array gets parity mismatch counts above 0
> doing a scrub (check > md/sync_action) there's a hardware problem.
> It's entirely possible you've found a bug, but it must be extremely
> obscure to basically not have hit everyone trying Btrfs raid56. I

I've met many btrfs users on IRC but never one with a successful raid56
recovery story.  Give me a shout on IRC if you have one.  ;)

(I'm excluding my own recovery success stories--those were 16GB test
filesystems running on kvm, and they recovered *perfectly*).

> think you need to track down the source of this corruption and stop it
>  however possible; whether that's changing hardware, or making sure
> the system isn't crashing.

It's been a few hours, there have been plenty of EIOs on reads, but the
dev stat counters haven't changed from the values above.

I'm inclined to blame software for the EIOs:

	- the dev stat counters aren't changing, but there are dozens
	of read() EIOs going by.

	- the checksums of long block runs are all identical and not zero.
	Not one or two examples, all the logs going back several days
	look like this.

	- with the patches, I'm executing code on the far side of BUG_ONs.
	Most btrfs users aren't testing that code, they would have
	crashed first.  It still could be my fault for walking past the
	DANGER signs...

Probably something goes wrong earlier on, and the corruption is just a
symptom of the same failure that lead to the original BUG_ONs.

> -- 
> Chris Murphy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zygo Blaxell June 21, 2016, 3:53 a.m. UTC | #7
On Mon, Jun 20, 2016 at 09:55:59PM -0400, Zygo Blaxell wrote:
> In this current case, I'm getting things like this:
> 
> [12008.243867] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 4105596028 wanted 787343232 mirror 0
[...]
> The other other weird thing here is that I can't find an example in the
> logs of an extent with an EIO that isn't compressed.  I've been looking
> up a random sample of the extent numbers, matching them up to filefrag
> output, and finding e.g. the one compressed extent in the middle of an
> otherwise uncompressed git pack file.  That's...odd.  Maybe there's a
> problem with compressed extents in particular?  I'll see if I can
> script something to check all the logs at once...

No need for a script:  this message wording appears only in
fs/btrfs/compression.c so it can only ever be emitted by reading a
compressed extent.

Maybe there's a problem specific to raid5, degraded mode, and compressed
extents?
Zygo Blaxell June 22, 2016, 4:06 a.m. UTC | #8
TL;DR:

	Kernel 4.6.2 causes a world of pain.  Use 4.5.7 instead.

	'btrfs dev stat' doesn't seem to count "csum failed"
	(i.e. corruption) errors in compressed extents.



On Sun, Jun 19, 2016 at 11:44:27PM -0400, Zygo Blaxell wrote:
> Not so long ago, I had a disk fail in a btrfs filesystem with raid1
> metadata and raid5 data.  I mounted the filesystem readonly, replaced
> the failing disk, and attempted to recover by adding the new disk and
> deleting the missing disk.

> I'm currently using kernel 4.6.2

That turned out to be a mistake.  4.6.2 has some severe problems.

Over the past few days I've been upgrading other machines from 4.5.7
to 4.6.2.  This morning I saw the aggregate data coming back from
those machines, and it's all bad:  stalls in snapshot delete, balance,
and sync; some machines just lock up with no console messages; a lot of
watchdog timeouts.  None of the machines could get to an uptime over 26
hours and still be in a usable state.

I switched to 4.5.7 and the crashes, balance/delete hangs, and some of
the data corruption modes stopped.

> I'm
> getting EIO randomly all over the filesystem, including in files that were
> written entirely _after_ the disk failure.

There were actually four distinct corruption modes happening:

1.  There are some number (16500 so far) "normal" corrupt blocks:  read
repeatably returns EIO, they show up in scrub with sane log messages,
and replacing the files that contain these blocks makes them go away.
These blocks appear to be contained in extents that coincide with the
date of the disk failure.  Interestingly, no matter how many times I
read these blocks, I get no increase in the 'btrfs dev stat' numbers
even though I get kernel csum failure messages.  That looks like a bug.

2.  When attempting to replace corrupted files with rsync, I had used
'rsync --inplace'.  This caused bad blocks to be overwritten within
extents, but does not necessarily replace the _entire_ extent containing a
bad block.  This creates corrupt blocks that show up in scrub, balance,
and device delete, but not when reading files.  It also updates the
timestamps so a file with old corruption looks "new" to an insufficiently
sophisticated analysis tool.

3.  Files were corrupted while they were written and accessed via NFS.
This created files with correct btrfs checksums, but garbage contents.
This would show up as failures during 'git gc' or rsync checksum
mismatches.  During one of the many VM crashes, any writes in progress at
the time of the crash were lost.  This effectively rewound the filesystem
several minutes each time as btrfs reverts to the previous committed
tree on the next mount.  4.6.2's hanging issues made this worse by
delaying btrfs commits indefinitely.  The NFS clients were completely
unaware of this, so when the VM rebooted, files ended up with holes,
or would just disappear while in use.

4.  After a VM crash and the filesystem reverted to the previous
committed tree, files with bad blocks that had been repaired through
the NFS server or with rsync would be "unrepaired" (i.e. the filesystem
would revert back to the original corrupted blocks after the mount).

Combinations of these could occur as well for extra confusion, and some
corrupted blocks are contained in many files thanks to dedup.

With kernel 4.5.7 there have been no lockups during commit and no VM
crashes, so I haven't seen any of corruption modes 3 and 4 since 4.5.7.

Balance is now running normally to move the remaining data off the
missing disk.  ETA is 558 hours.  See you in mid-July!  ;)
Chris Murphy June 22, 2016, 5:14 p.m. UTC | #9
On Mon, Jun 20, 2016 at 7:55 PM, Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
> On Mon, Jun 20, 2016 at 03:27:03PM -0600, Chris Murphy wrote:
>> On Mon, Jun 20, 2016 at 2:40 PM, Zygo Blaxell
>> <ce3g8jdj@umail.furryterror.org> wrote:
>> > On Mon, Jun 20, 2016 at 01:30:11PM -0600, Chris Murphy wrote:
>>
>> >> For me the critical question is what does "some corrupted sectors" mean?
>> >
>> > On other raid5 arrays, I would observe a small amount of corruption every
>> > time there was a system crash (some of which were triggered by disk
>> > failures, some not).
>>
>> What test are you using to determine there is corruption, and how much
>> data is corrupted? Is this on every disk? Non-deterministically fewer
>> than all disks? Have you identified this as a torn write or
>> misdirected write or is it just garbage at some sectors? And what's
>> the size? Partial sector? Partial md chunk (or fs block?)
>
> In earlier cases, scrub, read(), and btrfs dev stat all reported the
> incidents differently.  Scrub would attribute errors randomly to disks
> (error counts spread randomly across all the disks in the 'btrfs scrub
> status -d' output).  'dev stat' would correctly increment counts on only
> those disks which had individually had an event (e.g. media error or
> SATA bus reset).
>
> Before deploying raid5, I tested these by intentionally corrupting
> one disk in an otherwise healthy raid5 array and watching the result.

It's difficult to reproduce if no one understands how you
intentionally corrupted that disk. Literal reading, you corrupted the
entire disk, but that's impractical. The fs is expected to behave
differently depending on what's been corrupted and how much.

> When scrub identified an inode and offset in the kernel log, the csum
> failure log message matched the offsets producing EIO on read(), but
> the statistics reported by scrub about which disk had been corrupted
> were mostly wrong.  In such cases a scrub could repair the data.

I don't often use the -Bd options, so I haven't tested it thoroughly,
but what you're describing sounds like a bug in user space tools. I've
found it reflects the same information as btrfs dev stats, and dev
stats have been reliable in my testing.


> A different thing happens if there is a crash.  In that case, scrub cannot
> repair the errors.  Every btrfs raid5 filesystem I've deployed so far
> behaves this way when disks turn bad.  I had assumed it was a software bug
> in the comparatively new raid5 support that would get fixed eventually.

This is really annoyingly vague. You don't give a complete recipe for
reproducing this sequence. Here's what I'm understanding and what I'm
missing:

1. The intentional corruption, extent of which is undefined, is still present.
2. A drive is bad, but that doesn't tell us if it's totally dead, or
only intermittently spitting out spurious information.
3. Is the volume remounted degraded or is the bad drive still being
used by Btrfs? Because Btrfs has no concept (patches pending) of drive
faulty state like md, let alone an automatic change to that faulty
state. It just keeps on trying to read or write to bad drives, even if
they're physically removed.
4. You've initiated a scrub, and the corruption in 1 is not fixed.

OK so what am I missing?

Because it sounds to me like you have two copies of data that are
gone. For raid 5 that's data loss, scrub can't fix things. Corruption
is missing data. The bad drive is missing data.

What values do you get for

smartctl -l scterc /dev/sdX
cat /sys/block/sdX/device/timeout



>> This is on Btrfs? This isn't supposed to be possible. Even a literal
>> overwrite of a file is not an overwrite on Btrfs unless the file is
>> nodatacow. Data extents get written, then the metadata is updated to
>> point to those new blocks. There should be flush or fua requests to
>> make sure the order is such that the fs points to either the old or
>> new file, in either case uncorrupted. That's why I'm curious about the
>> nature of this corruption. It sounds like your hardware is not exactly
>> honoring flush requests.
>
> That's true when all the writes are ordered within a single device, but
> possibly not when writes must be synchronized across multiple devices.

I think that's a big problem, the fs cannot be consistent if the super
block points to any tree whose metadata or data isn't on stable media.

But if you think it's happening you might benefit from integrity
checking, maybe try just the metadata one for starters which is the
check_int mount option (it must be compiled in first for that mount
option to work).

https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/check-integrity.c?id=refs/tags/v4.6.2



> It could be possible to corrupt existing data if:
>
>         - btrfs puts multiple extents in the same RAID5 stripe
>
>         - writes two of the extents in different transactions
>
>         - the array is degraded so the RAID5 write hole applies
>
>         - writes stop (crash, powerfail, etc) after updating some of
>         the blocks in the stripe but not others in the second transaction.

I do not know the exact nature of the Btrfs raid56 write hole. Maybe a
dev or someone who knows can explain it.

However, from btrfs-debug-tree from a 3 device raid5 volume:

item 5 key (FIRST_CHUNK_TREE CHUNK_ITEM 1103101952) itemoff 15621 itemsize 144
chunk length 2147483648 owner 2 stripe_len 65536
type DATA|RAID5 num_stripes 3
stripe 0 devid 2 offset 9437184
dev uuid: 3c6f37eb-5cae-455a-82bc-a1b0877dea55
stripe 1 devid 1 offset 1094713344
dev uuid: 13104709-6f30-4982-979e-4f055c326fad
stripe 2 devid 3 offset 1083179008
dev uuid: d45fc482-a0c1-46b1-98c1-41cea5a11c80

I expect that parity is in this data block group, and therefore is
checksummed the same as any other data in that block group. While
parity could become stale like with conventional raid5, unlike
conventional raid 5, Btrfs would know about this because there'd be a
checksum mismatch.

So in your example of degraded writes, no matter what the on disk
format makes it discoverable there is a problem:

A. The "updating" is still always COW so there is no overwriting.

B. The updated data extent is supposed to be on stable media before
both metadata pointing to it and including its checksum (in the extent
csum tree) are on stable media. Perhaps it's acceptable for certain
metadata could be written out of order with data? But certainly not
the superblock being updated to point to updates trees that haven't
actually been set to stable media yet.

But even if that were to happen due to some hardware bug, the fs can
figure this out at mount time as it boot straps itself from the super,
and will then fail to mount. This is what -o recovery  (now
usebackuproot in newer kernels) is used for; so that in effect the fs
is rolled back to a state where the super points to completely valid
roots. But these mount options are considered red flags for hardware
doing the wrong thing. Yes you lose some data, which is bad anyway due
to writes being interrupted during an oops, but at least the fs is OK.



> This corrupts the data written in the first transaction, unless there's
> something in btrfs that explicitly prevents this from happening (e.g.
> the allocator prevents the second step, which would seem to burn up a
> lot of space).

I don't follow your steps how the data in the first transaction is
corrupt. Everything is copy on write in Btrfs including raid5. So that
new transaction does not overwrite the first, it's a new write, and
what should happen is that data gets written to unused sectors, then
metadata is written to point to those sectors (chunk tree, dev tree,
extent tree, csum tree, fs tree, etc), and once that's on stable
media, then the superblock is updated. There are all kinds of
optimizations happening that probably allow some of those things to
happen in different order depending on what's going on, and then the
drive also will reorder those writes, but when Btrfs issues req_flush
to the drive before it writes an updated superblock, and the drive
lies that it has flushed when it hasn't? That's when problems happen.

And perhaps naively, I'd expect in any multiple device volume, that
Btrfs issues req_flush right before and after it updates supers on all
the drives. If that's really honored by the drive, then the writing of
supers doesn't happen until everything before the req_flush is
actually on stable media.

The very fact you're not getting mount failures tells me all the
supers on all the drives must be the same, or Btrfs would complain.
There are generally at least three super blocks on each drive, and
each super has four backup roots. So that's 15 opportunities per drive
to determine right off the bat if something is really screwy, and just
fail to mount.

But you're not seeing mount failure. I think something else is going
on. Possibly there was already some weird corruption before either
your intentional corruption or the bad drive happened.


> The filesystem would continue to work afterwards with raid1 metadata
> because every disk in raid1 updates its blocks in the same order,
> and there are no interdependencies between blocks on different disks
> (not like a raid5 stripe, anyway).

I'm not sure what you mean by this. Btrfs raid1 means two copies. It
doesn't matter how many drives there are, there are two copies of
metadata in your case, and you have no idea which drives those
metadata block groups are on without checking btrfs-debug-tree.


>
> Excluding the current event, this only happened to me when the disks
> were doing other questionable things, like SATA link resets or dropping
> off the SATA bus, so there could have been an additional hardware issue.

Link resets are bad. There's bad hardware, bad cable, or highly likely
and common is bad sectors plus a misconfiguration of drive error
recovery and the SCSI command timer.


> This does happen on _every_ raid5 deployment I've run so far, though,
> and they don't share any common hardware or firmware components.

Misconfigurations in raid 5 are default. I've mentioned this before on
this list. And linux-raid@ list is full of such problem. It's pretty
much a weekly occurrence someone loses all data on a raid5 array due
to this very common misconfiguration.


> I've met many btrfs users on IRC but never one with a successful raid56
> recovery story.  Give me a shout on IRC if you have one.  ;)

You'd have to check the archives, they do exist. But typically by the
time they're on this list, the problems are too far gone. It's also
very common on linux-raid@. The issue is people treat their arrays
like they're backups or like they're permission to delay backup
frequency, rather than treating them for the sole purpose they're
intended for which is to extend uptime in the face of a *single*
hardware failure, specifically one drive. The instant there are two
problems, implosion is almost certain.
Zygo Blaxell June 22, 2016, 8:35 p.m. UTC | #10
On Wed, Jun 22, 2016 at 11:14:30AM -0600, Chris Murphy wrote:
> > Before deploying raid5, I tested these by intentionally corrupting
> > one disk in an otherwise healthy raid5 array and watching the result.
> 
> It's difficult to reproduce if no one understands how you
> intentionally corrupted that disk. Literal reading, you corrupted the
> entire disk, but that's impractical. The fs is expected to behave
> differently depending on what's been corrupted and how much.

The first round of testing I did (a year ago, when deciding whether
btrfs raid5 was mature enough to start using) was:

	Create a 5-disk RAID5

	Put some known data on it until it's full (i.e. random test
	patterns).  At the time I didn't do any tests involving
	compressible data, which I now realize was a serious gap in
	my test coverage.

	Pick 1000 random blocks (excluding superblocks) on one of the
	disks and write random data to them

	Read and verify the data through the filesystem, do scrub, etc.
	Exercise all the btrfs features related to error reporting
	and recovery.

I expected scrub and dev stat to report accurate corruption counts (except
for the 1 in 4 billion case where a bad csum matches by random chance),
and I expect all the data to be reconstructed since only one drive was
corrupted (assuming there are no unplanned disk failures during the
test, obviously) and the corruption occurred while the filesystem was
offline so there was no possibility of RAID write hole.

My results from that testing were that everything worked except for the
mostly-harmless quirk where scrub counts errors on random disks instead
of the disk where the errors occur.

> I don't often use the -Bd options, so I haven't tested it thoroughly,
> but what you're describing sounds like a bug in user space tools. I've
> found it reflects the same information as btrfs dev stats, and dev
> stats have been reliable in my testing.

Don't the user space tools just read what the kernel tells them?
I don't know how *not* to produce this behavior on btrfs raid5 or raid6.
It should show up on any btrfs raid56 system.

> > A different thing happens if there is a crash.  In that case, scrub cannot
> > repair the errors.  Every btrfs raid5 filesystem I've deployed so far
> > behaves this way when disks turn bad.  I had assumed it was a software bug
> > in the comparatively new raid5 support that would get fixed eventually.
> 
> This is really annoyingly vague. You don't give a complete recipe for
> reproducing this sequence. Here's what I'm understanding and what I'm
> missing:
> 
> 1. The intentional corruption, extent of which is undefined, is still present.

No intentional corruption here (quote:  "A different thing happens if
there is a crash...").  Now we are talking about the baseline behavior
when there is a crash on a btrfs raid5 array, especially crashes
triggered by a disk-level failure (e.g. watchdog timeout because a disk
or controller has hung) but also ordinary power failures or other crashes
triggered by external causes.

> 2. A drive is bad, but that doesn't tell us if it's totally dead, or
> only intermittently spitting out spurious information.

The most common drive-initiated reboot case is that one drive temporarily
locks up and triggers the host to perform a watchdog reset.  The reset
is successful and the filesystem can be mounted again with all drives
present; however, a small amount of raid5 data appears to be corrupted
each time.  The raid1 metadata passes all the integrity checks I can
throw at it:  btrfs check, scrub, balance, walk the filesystem with find
-type f -exec cat ..., compare with the last backup, etc.

Usually when I detect this case, I delete any corrupted data, delete
the disk that triggers the lockups and have no further problems with
that array.

> 3. Is the volume remounted degraded or is the bad drive still being
> used by Btrfs? Because Btrfs has no concept (patches pending) of drive
> faulty state like md, let alone an automatic change to that faulty
> state. It just keeps on trying to read or write to bad drives, even if
> they're physically removed.

In the baseline case the filesystem has all drives present after remount.
It could be as simple as power-cycling the host while writes are active.

> 4. You've initiated a scrub, and the corruption in 1 is not fixed.

In this pattern, btrfs may find both correctable and uncorrectable
corrupted data, usually on one of the drives.  scrub fixes the correctable
corruption, but fails on the uncorrectable.

> OK so what am I missing?

Nothing yet.  The above is the "normal" btrfs raid5 crash experience with
a non-degraded raid5 array.  A few megabytes of corrupt extents can easily
be restored from backups or deleted and everything's fine after that.

In my *current* failure case, I'm experiencing *additional* issues.
In total:

	all of the above, plus:

	BUG_ON()s as described in the first mail on this thread,

	additional csum failure kernel messages without corresponding
	dev stat increments,

	several problems that seem to occur only on kernel 4.6.2 and
	not 4.5.7.

Another distinction between the current case and the previous cases is
that this time the array is degraded:  one drive is entirely missing.

The current case seems to be hitting code paths in the kernel that I've
never seen executed before, and it looks like there are problems with
them (e.g. any function with a mirror_num parameter does not seem to be
valid for raid56, and yet the kernel is crashing in some of them).

> Because it sounds to me like you have two copies of data that are
> gone. For raid 5 that's data loss, scrub can't fix things. Corruption
> is missing data. The bad drive is missing data.
> 
> What values do you get for
> 
> smartctl -l scterc /dev/sdX
> cat /sys/block/sdX/device/timeout

"SCT Error Recovery Control command not supported" and "30" respectively.

There are no kernel log messages suggesting timeouts in the current case
after the failed drive was disconnected from the system.  There were
plenty of these starting just after the drive failed, but that's to be
expected, and should be tolerated by a raid5 implementation.

All remaining drives appear to be healthy.  The array completed a scrub
two weeks before the disk failure with no errors (as it has every two
weeks since it was created).

> I do not know the exact nature of the Btrfs raid56 write hole. Maybe a
> dev or someone who knows can explain it.

If you have 3 raid5 devices, they might be laid out on disk like this
(e.g. with a 16K stripe width):

	Address:  0..16K        16..32K         32..64K
        Disk 1: [0..16K]        [32..64K]       [PARITY]
        Disk 2: [16..32K]       [PARITY]        [80..96K]
        Disk 3: [PARITY]        [64..80K]       [96..112K]

btrfs logical address ranges are inside [].  Disk physical address ranges
are shown at the top of each column.  (I've simplified the mapping here;
pretend all the addresses are relative to the start of a block group).

If we want to write a 32K extent at logical address 0, we'd write all
three disks in one column (disk1 gets 0..16K, disk2 gets 16..32K, disk3
gets parity for the other two disks).  The parity will be temporarily
invalid for the time between the first disk write and the last disk write.
In non-degraded mode the parity isn't necessary, but in degraded mode
the entire column cannot be reconstructed because of invalid parity.

To see why this could be a problem, suppose btrfs writes a 4K extent at
logical address 32K.  This requires updating (at least) disk 1 (where the
logical address 32K resides) and disk 2 (the parity for this column).
This means any data that existed at logical addresses 36K..80K (or at
least 32..36K and 64..68K) has its parity temporarily invalidated between
the write to the first and last disks.  If there were metadata pointing
to other blocks in this column, the metadata temporarily points to
damaged data during the write.  If there is no data in other blocks in
this column then it doesn't matter that the parity doesn't match--the
content of the reconstructed unallocated blocks would be undefined
even in the success cases.

Last time I checked, btrfs doesn't COW entire RAID stripes (it does
RMW them but that's not the same thing at all).  COW affects only extents
in the logical address space.  To avoid the write hole issue, btrfs
would have to avoid writing to any column that is partially occupied by
existing committed data (i.e. it would have to write the entire column
in a single transaction or not write to the column at all).

btrfs doesn't do this, which can be proven with a simple experiment:

	# btrfs sub create tmp && cd tmp
	# for x in $(seq 0 9); do head -c 4096 < /dev/urandom >> f; sync; done; filefrag -v f
	Filesystem type is: 9123683e
	File size of f is 40960 (10 blocks of 4096 bytes)
	 ext:     logical_offset:        physical_offset: length:   expected: flags:
	   0:        0..       0: 2412725689..2412725689:      1:            
	   1:        1..       1: 2412725690..2412725690:      1:            
	   2:        2..       2: 2412725691..2412725691:      1:            
	   3:        3..       3: 2412725692..2412725692:      1:            
	   4:        4..       4: 2412725693..2412725693:      1:            
	   5:        5..       5: 2412725694..2412725694:      1:            
	   6:        6..       6: 2412725695..2412725695:      1:            
	   7:        7..       7: 2412725698..2412725698:      1: 2412725696:
	   8:        8..       8: 2412725699..2412725699:      1:            
	   9:        9..       9: 2412725700..2412725700:      1:             last,eof
	f: 2 extents found

Here I have allocated 10 consecutive physical blocks in 10 separate
btrfs transactions on a 5-disk raid5 array.

Those physical_offset blocks need to be at least N-1 blocks apart (for a
5-disk array, this is 4) to avoid the write hole problem (except for the
special case of one block at the end of a column adjacent to the first
block of the next column).  They certainly cannot _all_ be adjacent.

If block #3 in this file is modified, btrfs will free physical block
2412725692 because of CoW:

	# head -c 4096 < /dev/urandom | dd seek=3 bs=4k of=f conv=notrunc; sync
	1+0 records in
	1+0 records out
	4096 bytes (4.1 kB) copied, 9.3396e-05 s, 43.9 MB/s
	Filesystem type is: 9123683e
	File size of f is 40960 (10 blocks of 4096 bytes)
	 ext:     logical_offset:        physical_offset: length:   expected: flags:
	   0:        0..       0: 2412725689..2412725689:      1:            
	   1:        1..       1: 2412725690..2412725690:      1:            
	   2:        2..       2: 2412725691..2412725691:      1:            
	   3:        3..       3: 2412725701..2412725701:      1: 2412725692:
	   4:        4..       4: 2412725693..2412725693:      1: 2412725702:
	   5:        5..       5: 2412725694..2412725694:      1:            
	   6:        6..       6: 2412725695..2412725695:      1:            
	   7:        7..       7: 2412725698..2412725698:      1: 2412725696:
	   8:        8..       8: 2412725699..2412725699:      1:            
	   9:        9..       9: 2412725700..2412725700:      1:             last,eof
	f: 4 extents found

If in the future btrfs allocates physical block 2412725692 to
a different file, up to 3 other blocks in this file (most likely
2412725689..2412725691) could be lost if a crash or disk I/O error also
occurs during the same transaction.  btrfs does do this--in fact, the
_very next block_ allocated by the filesystem is 2412725692:

	# head -c 4096 < /dev/urandom >> f; sync; filefrag -v f
	Filesystem type is: 9123683e
	File size of f is 45056 (11 blocks of 4096 bytes)
	 ext:     logical_offset:        physical_offset: length:   expected: flags:
	   0:        0..       0: 2412725689..2412725689:      1:            
	   1:        1..       1: 2412725690..2412725690:      1:            
	   2:        2..       2: 2412725691..2412725691:      1:            
	   3:        3..       3: 2412725701..2412725701:      1: 2412725692:
	   4:        4..       4: 2412725693..2412725693:      1: 2412725702:
	   5:        5..       5: 2412725694..2412725694:      1:            
	   6:        6..       6: 2412725695..2412725695:      1:            
	   7:        7..       7: 2412725698..2412725698:      1: 2412725696:
	   8:        8..       8: 2412725699..2412725699:      1:            
	   9:        9..       9: 2412725700..2412725700:      1:            
	  10:       10..      10: 2412725692..2412725692:      1: 2412725701: last,eof
	f: 5 extents found

These extents all have separate transids too:

	# btrfs sub find-new . 1
	inode 257 file offset 0 len 4096 disk start 9882524422144 offset 0 gen 1432814 flags NONE f
	inode 257 file offset 4096 len 4096 disk start 9882524426240 offset 0 gen 1432816 flags NONE f
	inode 257 file offset 8192 len 4096 disk start 9882524430336 offset 0 gen 1432817 flags NONE f
	inode 257 file offset 12288 len 4096 disk start 9882524471296 offset 0 gen 1432825 flags NONE f
	inode 257 file offset 16384 len 4096 disk start 9882524438528 offset 0 gen 1432819 flags NONE f
	inode 257 file offset 20480 len 4096 disk start 9882524442624 offset 0 gen 1432820 flags NONE f
	inode 257 file offset 24576 len 4096 disk start 9882524446720 offset 0 gen 1432821 flags NONE f
	inode 257 file offset 28672 len 4096 disk start 9882524459008 offset 0 gen 1432822 flags NONE f
	inode 257 file offset 32768 len 4096 disk start 9882524463104 offset 0 gen 1432823 flags NONE f
	inode 257 file offset 36864 len 4096 disk start 9882524467200 offset 0 gen 1432824 flags NONE f
	inode 257 file offset 40960 len 4096 disk start 9882524434432 offset 0 gen 1432826 flags NONE f
	transid marker was 1432826

> > The filesystem would continue to work afterwards with raid1 metadata
> > because every disk in raid1 updates its blocks in the same order,
> > and there are no interdependencies between blocks on different disks
> > (not like a raid5 stripe, anyway).
> 
> I'm not sure what you mean by this. Btrfs raid1 means two copies. It
> doesn't matter how many drives there are, there are two copies of
> metadata in your case, and you have no idea which drives those
> metadata block groups are on without checking btrfs-debug-tree.

What I mean is that there are no dependencies between logically adjacent
blocks on physically separate disks in raid1 as there are in raid5,
because there are no stripes in raid1.  The whole above scenario
cannot occur in raid1.
Goffredo Baroncelli June 23, 2016, 7:32 p.m. UTC | #11
On 2016-06-22 22:35, Zygo Blaxell wrote:
>> I do not know the exact nature of the Btrfs raid56 write hole. Maybe a
>> > dev or someone who knows can explain it.
> If you have 3 raid5 devices, they might be laid out on disk like this
> (e.g. with a 16K stripe width):
> 
> 	Address:  0..16K        16..32K         32..64K
>         Disk 1: [0..16K]        [32..64K]       [PARITY]
>         Disk 2: [16..32K]       [PARITY]        [80..96K]
>         Disk 3: [PARITY]        [64..80K]       [96..112K]
> 
> btrfs logical address ranges are inside [].  Disk physical address ranges
> are shown at the top of each column.  (I've simplified the mapping here;
> pretend all the addresses are relative to the start of a block group).
> 
> If we want to write a 32K extent at logical address 0, we'd write all
> three disks in one column (disk1 gets 0..16K, disk2 gets 16..32K, disk3
> gets parity for the other two disks).  The parity will be temporarily
> invalid for the time between the first disk write and the last disk write.
> In non-degraded mode the parity isn't necessary, but in degraded mode
> the entire column cannot be reconstructed because of invalid parity.
> 
> To see why this could be a problem, suppose btrfs writes a 4K extent at
> logical address 32K.  This requires updating (at least) disk 1 (where the
> logical address 32K resides) and disk 2 (the parity for this column).
> This means any data that existed at logical addresses 36K..80K (or at
> least 32..36K and 64..68K) has its parity temporarily invalidated between
> the write to the first and last disks.  If there were metadata pointing
> to other blocks in this column, the metadata temporarily points to
> damaged data during the write.  If there is no data in other blocks in
> this column then it doesn't matter that the parity doesn't match--the
> content of the reconstructed unallocated blocks would be undefined
> even in the success cases.
[...]

Sorry, but I can follow you.

RAID5 protect you in case of a failure (or a missing write) of a *single* disk.

The raid write hole happens when a stripe is not completely written on the platters: the parity and the related data mismatch. In this case a "simple" raid5 may return wrong data if the parity is used to compute the data. But this happens because a "simple" raid5 is unable to detected if the returned data is right or none.

The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the checksum. 

BTRFS is able to discard the wrong data: i.e. in case of a 3 disks raid5, the right data may be extracted from the data1+data2 or if the checksum doesn't match from data1+parity or if the checksum doesn't match from data2+parity. 
NOTE1: the real difference between the BTRFS (and ZFS) raid and the "simple" raid5 is that the latter doesn't try another pair of disks.
NOTE2: this works if only one write is corrupted. If more writes (== more disks) are involved, you got checksum mismatch. If more than one write are corrupted, raid5 is unable to protect you. 

In case of "degraded mode", you don't have any redundancy. So if a stripe of a degraded filesystem is not fully written to the disk, is like a block not fully written to the disk. And you have checksums mismatch. But this is not what is called raid write hole.


On 2016-06-22 22:35, Zygo Blaxell wrote:
> If in the future btrfs allocates physical block 2412725692 to
> a different file, up to 3 other blocks in this file (most likely
> 2412725689..2412725691) could be lost if a crash or disk I/O error also
> occurs during the same transaction.  btrfs does do this--in fact, the
> _very next block_ allocated by the filesystem is 2412725692:
> 
> 	# head -c 4096 < /dev/urandom >> f; sync; filefrag -v f
> 	Filesystem type is: 9123683e
> 	File size of f is 45056 (11 blocks of 4096 bytes)
> 	 ext:     logical_offset:        physical_offset: length:   expected: flags:
> 	   0:        0..       0: 2412725689..2412725689:      1:            
> 	   1:        1..       1: 2412725690..2412725690:      1:            
> 	   2:        2..       2: 2412725691..2412725691:      1:            
> 	   3:        3..       3: 2412725701..2412725701:      1: 2412725692:
> 	   4:        4..       4: 2412725693..2412725693:      1: 2412725702:
> 	   5:        5..       5: 2412725694..2412725694:      1:            
> 	   6:        6..       6: 2412725695..2412725695:      1:            
> 	   7:        7..       7: 2412725698..2412725698:      1: 2412725696:
> 	   8:        8..       8: 2412725699..2412725699:      1:            
> 	   9:        9..       9: 2412725700..2412725700:      1:            
> 	  10:       10..      10: 2412725692..2412725692:      1: 2412725701: last,eof
> 	f: 5 extents found

You are assuming that if you touch a block, all the blocks of the same stripe spread over the disks are involved. I disagree. The only parts which are involved, are the part of stripe which contains the changed block and the parts which contains the parity.
If both the parts become corrupted, RAID5 is unable to protect you (two failure, when raid 5 has only _one_ redundancy). But if only one of these is corrupted, BTRFS with the help of the checksum is capable to detect which one is corrupted and to return good data (and to rebuild the bad parts).


BR
G.Baroncelli
Chris Murphy June 23, 2016, 11:37 p.m. UTC | #12
On Wed, Jun 22, 2016 at 11:14 AM, Chris Murphy <lists@colorremedies.com> wrote:

>
> However, from btrfs-debug-tree from a 3 device raid5 volume:
>
> item 5 key (FIRST_CHUNK_TREE CHUNK_ITEM 1103101952) itemoff 15621 itemsize 144
> chunk length 2147483648 owner 2 stripe_len 65536
> type DATA|RAID5 num_stripes 3
> stripe 0 devid 2 offset 9437184
> dev uuid: 3c6f37eb-5cae-455a-82bc-a1b0877dea55
> stripe 1 devid 1 offset 1094713344
> dev uuid: 13104709-6f30-4982-979e-4f055c326fad
> stripe 2 devid 3 offset 1083179008
> dev uuid: d45fc482-a0c1-46b1-98c1-41cea5a11c80
>
> I expect that parity is in this data block group, and therefore is
> checksummed the same as any other data in that block group.

This appears to be wrong. Comparing the same file, one file only, one
two new Btrfs volumes, one volume single, one volume raid5, I get a
single csum tree entry:

raid5
    item 0 key (EXTENT_CSUM EXTENT_CSUM 12009865216) itemoff 16155 itemsize 128
        extent csum item

single

    item 0 key (EXTENT_CSUM EXTENT_CSUM 2168717312) itemoff 16155 itemsize 128
        extent csum item

They're both the same size. They both contain the same data. So it
looks like parity is not separately checksummed.

If there's a missing 64KiB data strip (bad sector, or dead drive), the
reconstruction of that strip from parity should match available csums
for those blocks. So in this way it's possible to infer if the parity
strip is bad. But, it also means assuming everything else about this
full stripe: the remaining data strips and their csums, are correct.




> So in your example of degraded writes, no matter what the on disk
> format makes it discoverable there is a problem:
>
> A. The "updating" is still always COW so there is no overwriting.

There is RMW code in btrfs/raid56.c but I don't know when that gets
triggered. With simple files changing one character with vi and gedit,
I get completely different logical and physical numbers with each
change, so it's clearly cowing the entire stripe (192KiB in my 3 dev
raid5).


[root@f24s ~]# filefrag -v /mnt/5/64k-a-then64k-b.txt
Filesystem type is: 9123683e
File size of /mnt/5/64k-a-then64k-b.txt is 131072 (32 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..      31:    2931744..   2931775:     32:             last,eof
/mnt/5/64k-a-then64k-b.txt: 1 extent found
[root@f24s ~]# btrfs-map-logical -l $[4096*2931744] /dev/VG/a
mirror 1 logical 12008423424 physical 1114112 device /dev/mapper/VG-b
mirror 2 logical 12008423424 physical 34668544 device /dev/mapper/VG-a
[root@f24s ~]# vi /mnt/5/64k-a-then64k-b.txt
[root@f24s ~]# filefrag -v /mnt/5/64k-a-then64k-b.txt
Filesystem type is: 9123683e
File size of /mnt/5/64k-a-then64k-b.txt is 131072 (32 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..      31:    2931776..   2931807:     32:             last,eof
/mnt/5/64k-a-then64k-b.txt: 1 extent found
[root@f24s ~]# btrfs-map-logical -l $[4096*29317776] /dev/VG/a
No extent found at range [120085610496,120085626880)
[root@f24s ~]# btrfs-map-logical -l $[4096*2931776] /dev/VG/a
mirror 1 logical 12008554496 physical 1108475904 device /dev/mapper/VG-c
mirror 2 logical 12008554496 physical 1179648 device /dev/mapper/VG-b
[root@f24s ~]#

There is a neat bug/rfe I found for btrfs-map-logical, it doesn't
report back the physical locations for all num_stripes on the volume.
It only spits back two, and sometimes it's the two data strips,
sometimes it's one data and one parity strip.


[1]
https://bugzilla.kernel.org/show_bug.cgi?id=120941
Chris Murphy June 24, 2016, 12:26 a.m. UTC | #13
On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli <kreijack@inwind.it> wrote:

>
> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the checksum.

Yeah I'm kinda confused on this point.

https://btrfs.wiki.kernel.org/index.php/RAID56

It says there is a write hole for Btrfs. But defines it in terms of
parity possibly being stale after a crash. I think the term comes not
from merely parity being wrong but parity being wrong *and* then being
used to wrongly reconstruct data because it's blindly trusted.

I don't read code well enough, but I'd be surprised if Btrfs
reconstructs from parity and doesn't then check the resulting
reconstructed data to its EXTENT_CSUM.
Zygo Blaxell June 24, 2016, 1:36 a.m. UTC | #14
On Thu, Jun 23, 2016 at 09:32:50PM +0200, Goffredo Baroncelli wrote:
> The raid write hole happens when a stripe is not completely written
> on the platters: the parity and the related data mismatch. In this
> case a "simple" raid5 may return wrong data if the parity is used to
> compute the data. But this happens because a "simple" raid5 is unable
> to detected if the returned data is right or none.
> 
> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the
> checksum.

Checksums do not help with the raid5 write hole.  The way btrfs does
checksums might even make it worse.

ZFS reduces the number of disks in a stripe when a disk failure
is detected so that writes are always in non-degraded mode, and they
presumably avoid sub-stripe-width data allocations or use journalling
to avoid the write hole.  btrfs seems to use neither tactic.  At best,
btrfs will avoid creating new block groups on disks that are missing at
mount time, and it doesn't deal with sub-stripe-width allocations at all.

I'm working from two assumptions as I haven't found all the relevant
code yet:

	1.  btrfs writes parity stripes at fixed locations relative to
	the data in the same stripe.  If this is true, then the parity
	blocks are _not_ CoW while the data blocks and their checksums
	_are_ CoW.  I don't know if the parity block checksums are also
	CoW.

	2.  btrfs sometimes puts data from two different transactions in
	the same stripe at the same time--a fundamental violation of the
	CoW concept.  I inferred this from the logical block addresses.

Unless I'm missing something in the code somewhere, parity blocks can
have out-of-date checksums for short periods of time between flushes and
commits.  This would lose data by falsely reporting valid parity blocks
as checksum failures.  If any *single* failure occurs at the same time
(such as a missing write or disk failure) a small amount of data will
be lost.

> BTRFS is able to discard the wrong data: i.e. in case of a 3 disks
> raid5, the right data may be extracted from the data1+data2 or if the
> checksum doesn't match from data1+parity or if the checksum doesn't
> match from data2+parity.

Suppose we have a sequence like this (3-disk RAID5 array, one stripe
containing 2 data and 1 parity block) starting with the stripe empty:

	1.  write data block 1 to disk 1 of stripe (parity is now invalid, no checksum yet)

	2.  write parity block to disk 3 in stripe (parity becomes valid again, no checksum yet)

	3.  commit metadata pointing to block 1 (parity and checksums now valid)

	4.  write data block 2 to disk 2 of stripe (parity and parity checksum now invalid)

	5.  write parity block to disk 3 in stripe (parity valid now, parity checksum still invalid)

	6.  commit metadata pointing to block 2 (parity and checksums now valid)

We can be interrupted at any point between step 1 and 4 with no data loss.
Before step 3 the data and parity blocks are not part of the extent
tree so their contents are irrelevant.  After step 3 (assuming each
step is completed in order) data block 1 is part of the extent tree and
can be reconstructed if any one disk fails.  This is the part of btrfs
raid5 that works.

If we are interrupted between steps 4 and 6 (e.g. power fails), a single
disk failure or corruption will cause data loss in block 1.  Note that
block 1 is *not* written between steps 4 and 6, so we are retroactively
damaging some previously written data that is not part of the current
transaction.

If we are interrupted between steps 4 and 5, we can no longer reconstruct
block 1 (block2 ^ parity) or block 2 (block1 ^ parity) because the parity
block doesn't match the data blocks in the same stripe
(i.e. block1 ^ block2 != parity).

If we are interrupted between step 5 and 6, the parity block checksum
committed at step 3 will fail.  Data block 2 will not be accessible
since the metadata was not written to point to it, but data block 1
will be intact, readable, and have a correct checksum as long as none
of the disks fail.  This can be repaired by a scrub (scrub will simply
throw the parity block away and reconstruct it from block1 and block2).
If disk 1 fails before the next scrub, data block 1 will be lost because
btrfs will believe the parity block is incorrect even though it is not.

This risk happens on *every* write to a stripe that is not a full stripe
write and contains existing committed data blocks.  It will occur more
often on full and heavily fragmented filesystems (filesystems which 
have these properties are more likely to write new data on stripes that 
already contain old data).

In cases where an entire stripe is written at once, or a stripe is
partially filled but no further writes ever modify the stripe, everything
works as intended in btrfs.

> NOTE2: this works if only one write is corrupted. If more writes (==
> more disks) are involved, you got checksum mismatch. If more than one
> write are corrupted, raid5 is unable to protect you.

No write corruption is required for data loss.  Data can be lost with
any single disk failure.

> In case of "degraded mode", you don't have any redundancy. So if a
> stripe of a degraded filesystem is not fully written to the disk,
> is like a block not fully written to the disk. And you have checksums
> mismatch. But this is not what is called raid write hole.

If a block is not fully written to the disk, btrfs should not update
the metadata tree to point to it, so no (committed) data will be lost.

> On 2016-06-22 22:35, Zygo Blaxell wrote:
> > If in the future btrfs allocates physical block 2412725692 to
> > a different file, up to 3 other blocks in this file (most likely
> > 2412725689..2412725691) could be lost if a crash or disk I/O error also
> > occurs during the same transaction.  btrfs does do this--in fact, the
> > _very next block_ allocated by the filesystem is 2412725692:
> > 
> > 	# head -c 4096 < /dev/urandom >> f; sync; filefrag -v f
> > 	Filesystem type is: 9123683e
> > 	File size of f is 45056 (11 blocks of 4096 bytes)
> > 	 ext:     logical_offset:        physical_offset: length:   expected: flags:
> > 	   0:        0..       0: 2412725689..2412725689:      1:            
> > 	   1:        1..       1: 2412725690..2412725690:      1:            
> > 	   2:        2..       2: 2412725691..2412725691:      1:            
> > 	   3:        3..       3: 2412725701..2412725701:      1: 2412725692:
> > 	   4:        4..       4: 2412725693..2412725693:      1: 2412725702:
> > 	   5:        5..       5: 2412725694..2412725694:      1:            
> > 	   6:        6..       6: 2412725695..2412725695:      1:            
> > 	   7:        7..       7: 2412725698..2412725698:      1: 2412725696:
> > 	   8:        8..       8: 2412725699..2412725699:      1:            
> > 	   9:        9..       9: 2412725700..2412725700:      1:            
> > 	  10:       10..      10: 2412725692..2412725692:      1: 2412725701: last,eof
> > 	f: 5 extents found
> 
> You are assuming that if you touch a block, all the blocks of the same
> stripe spread over the disks are involved. I disagree. The only parts
> which are involved, are the part of stripe which contains the changed
> block and the parts which contains the parity.

Any block change always affects all others in the stripe.  With checksums,
every write temporarily puts each stripe in degraded mode (although
this only happens to the parity blocks as the data blocks are protected
by CoW algorithms).  Checksums are CoW and updated in strictly-ordered
transactions with write barriers; stripes are modified in-place and they
are a function of the contents of many non-atomically-updated disks.

> If both the parts become corrupted, RAID5 is unable to protect you
> (two failure, when raid 5 has only _one_ redundancy). But if only
> one of these is corrupted, BTRFS with the help of the checksum is
> capable to detect which one is corrupted and to return good data
> (and to rebuild the bad parts).

This is only true for stripes that are not being updated.  During a stripe
update, btrfs raid5 seems to be much less tolerant, only one disk failure
away from data loss (of a single stripe) instead of the expected two.

All this rests on my assumptions above.  If those turn out to be wrong,
then most of the above is wrong too, and I need a new theory to explain
why almost every unplanned reboot corrupts a small amount of data on
big raid5 filesystems.
Zygo Blaxell June 24, 2016, 1:47 a.m. UTC | #15
On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote:
> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli <kreijack@inwind.it> wrote:
> > The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the checksum.
> 
> Yeah I'm kinda confused on this point.
> 
> https://btrfs.wiki.kernel.org/index.php/RAID56
> 
> It says there is a write hole for Btrfs. But defines it in terms of
> parity possibly being stale after a crash. I think the term comes not
> from merely parity being wrong but parity being wrong *and* then being
> used to wrongly reconstruct data because it's blindly trusted.

I think the opposite is more likely, as the layers above raid56
seem to check the data against sums before raid56 ever sees it.
(If those layers seem inverted to you, I agree, but OTOH there are
probably good reason to do it that way).

It looks like uncorrectable failures might occur because parity is
correct, but the parity checksum is out of date, so the parity checksum
doesn't match even though data blindly reconstructed from the parity
*would* match the data.

> I don't read code well enough, but I'd be surprised if Btrfs
> reconstructs from parity and doesn't then check the resulting
> reconstructed data to its EXTENT_CSUM.

I wouldn't be surprised if both things happen in different code paths,
given the number of different paths leading into the raid56 code and
the number of distinct failure modes it seems to have.
Zygo Blaxell June 24, 2016, 2:07 a.m. UTC | #16
On Thu, Jun 23, 2016 at 05:37:09PM -0600, Chris Murphy wrote:
> > So in your example of degraded writes, no matter what the on disk
> > format makes it discoverable there is a problem:
> >
> > A. The "updating" is still always COW so there is no overwriting.
> 
> There is RMW code in btrfs/raid56.c but I don't know when that gets
> triggered. 

RMW seems to be for cases where part of a stripe is modified but the
entire stripe has not yet been read into memory.  It reads the remaining
blocks (reconstructing missing blocks if necessary) then calculates
new parity blocks.

> With simple files changing one character with vi and gedit,
> I get completely different logical and physical numbers with each
> change, so it's clearly cowing the entire stripe (192KiB in my 3 dev
> raid5).

You are COWing the entire file because vi and gedit do truncate followed
by full-file write.

Try again with 'dd conv=notrunc bs=4k count=1 seek=N of=...' or
edit the file with a sector-level hex editor.

> [root@f24s ~]# filefrag -v /mnt/5/64k-a-then64k-b.txt
> Filesystem type is: 9123683e
> File size of /mnt/5/64k-a-then64k-b.txt is 131072 (32 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      31:    2931744..   2931775:     32:             last,eof
> /mnt/5/64k-a-then64k-b.txt: 1 extent found
> [root@f24s ~]# btrfs-map-logical -l $[4096*2931744] /dev/VG/a
> mirror 1 logical 12008423424 physical 1114112 device /dev/mapper/VG-b
> mirror 2 logical 12008423424 physical 34668544 device /dev/mapper/VG-a
> [root@f24s ~]# vi /mnt/5/64k-a-then64k-b.txt
> [root@f24s ~]# filefrag -v /mnt/5/64k-a-then64k-b.txt
> Filesystem type is: 9123683e
> File size of /mnt/5/64k-a-then64k-b.txt is 131072 (32 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      31:    2931776..   2931807:     32:             last,eof
> /mnt/5/64k-a-then64k-b.txt: 1 extent found
> [root@f24s ~]# btrfs-map-logical -l $[4096*29317776] /dev/VG/a
> No extent found at range [120085610496,120085626880)
> [root@f24s ~]# btrfs-map-logical -l $[4096*2931776] /dev/VG/a
> mirror 1 logical 12008554496 physical 1108475904 device /dev/mapper/VG-c
> mirror 2 logical 12008554496 physical 1179648 device /dev/mapper/VG-b
> [root@f24s ~]#
> 
> There is a neat bug/rfe I found for btrfs-map-logical, it doesn't
> report back the physical locations for all num_stripes on the volume.
> It only spits back two, and sometimes it's the two data strips,
> sometimes it's one data and one parity strip.
> 
> 
> [1]
> https://bugzilla.kernel.org/show_bug.cgi?id=120941
> 
> 
> -- 
> Chris Murphy
>
Zygo Blaxell June 24, 2016, 2:17 a.m. UTC | #17
On Thu, Jun 23, 2016 at 05:37:09PM -0600, Chris Murphy wrote:
> > I expect that parity is in this data block group, and therefore is
> > checksummed the same as any other data in that block group.
> 
> This appears to be wrong. Comparing the same file, one file only, one
> two new Btrfs volumes, one volume single, one volume raid5, I get a
> single csum tree entry:
> 
> raid5
>     item 0 key (EXTENT_CSUM EXTENT_CSUM 12009865216) itemoff 16155 itemsize 128
>         extent csum item
> 
> single
> 
>     item 0 key (EXTENT_CSUM EXTENT_CSUM 2168717312) itemoff 16155 itemsize 128
>         extent csum item
> 
> They're both the same size. They both contain the same data. So it
> looks like parity is not separately checksummed.

I'm inclined to agree because I didn't find any code that *writes* parity
csums...but if there are no parity csums, what does this code do?

scrub.c:
	static noinline_for_stack int scrub_raid56_parity(struct scrub_ctx *sctx,
	[...]
			ret = btrfs_lookup_csums_range(csum_root,
						extent_logical,
						extent_logical + extent_len - 1,
						&sctx->csum_list, 1);
			if (ret)
				goto out;

                        ret = scrub_extent_for_parity(sparity, extent_logical,
                                                      extent_len,
                                                      extent_physical,
                                                      extent_dev, flags,
                                                      generation,
                                                      extent_mirror_num);
Andrei Borzenkov June 24, 2016, 4:02 a.m. UTC | #18
24.06.2016 04:47, Zygo Blaxell пишет:
> On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote:
>> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli <kreijack@inwind.it> wrote:
>>> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the checksum.
>>
>> Yeah I'm kinda confused on this point.
>>
>> https://btrfs.wiki.kernel.org/index.php/RAID56
>>
>> It says there is a write hole for Btrfs. But defines it in terms of
>> parity possibly being stale after a crash. I think the term comes not
>> from merely parity being wrong but parity being wrong *and* then being
>> used to wrongly reconstruct data because it's blindly trusted.
> 
> I think the opposite is more likely, as the layers above raid56
> seem to check the data against sums before raid56 ever sees it.
> (If those layers seem inverted to you, I agree, but OTOH there are
> probably good reason to do it that way).
> 

Yes, that's how I read code as well. btrfs layer that does checksumming
is unaware of parity blocks at all; for all practical purposes they do
not exist. What happens is approximately

1. logical extent is allocated and checksum computed
2. it is mapped to physical area(s) on disks, skipping over what would
be parity blocks
3. when these areas are written out, RAID56 parity is computed and filled in

IOW btrfs checksums are for (meta)data and RAID56 parity is not data.

> It looks like uncorrectable failures might occur because parity is
> correct, but the parity checksum is out of date, so the parity checksum
> doesn't match even though data blindly reconstructed from the parity
> *would* match the data.
> 

Yep, that is how I read it too. So if your data is checksummed, it
should at least avoid silent corruption.

>> I don't read code well enough, but I'd be surprised if Btrfs
>> reconstructs from parity and doesn't then check the resulting
>> reconstructed data to its EXTENT_CSUM.
> 
> I wouldn't be surprised if both things happen in different code paths,
> given the number of different paths leading into the raid56 code and
> the number of distinct failure modes it seems to have.
> 

Well, the problem is that parity block cannot be redirected on write as
data blocks; which makes it impossible to version control it. The only
solution I see is to always use full stripe writes by either wasting
time in fixed width stripe or using variable width, so that every stripe
always gets new version of parity. This makes it possible to keep parity
checksums like data checksums.
Chris Murphy June 24, 2016, 5:20 a.m. UTC | #19
On Thu, Jun 23, 2016 at 8:07 PM, Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:

>> With simple files changing one character with vi and gedit,
>> I get completely different logical and physical numbers with each
>> change, so it's clearly cowing the entire stripe (192KiB in my 3 dev
>> raid5).
>
> You are COWing the entire file because vi and gedit do truncate followed
> by full-file write.

I'm seeing the file inode changes with either a vi or gedit
modification, even when file size is exactly the same, just character
substitute. So as far as VFS and Btrfs are concerned, it's an entirely
different file, so it's like faux-CoW that would have happened on any
file system, not an overwrite.


> Try again with 'dd conv=notrunc bs=4k count=1 seek=N of=...' or
> edit the file with a sector-level hex editor.

The inode is now the same, one of the 4096 byte blocks is
dereferenced, a new 4096 byte block is referenced, and written, the
other 3 blocks remain untouched, the other files in the stripe remain
untouched. So it's pretty clearly cow'd in this case.

[root@f24s ~]# filefrag -v /mnt/5/*
Filesystem type is: 9123683e
File size of /mnt/5/a.txt is 16383 (4 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       3:    2931712..   2931715:      4:             last,eof
/mnt/5/a.txt: 1 extent found
File size of /mnt/5/b.txt is 16383 (4 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       3:    2931716..   2931719:      4:             last,eof
/mnt/5/b.txt: 1 extent found
File size of /mnt/5/c.txt is 16383 (4 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       3:    2931720..   2931723:      4:             last,eof
/mnt/5/c.txt: 1 extent found
File size of /mnt/5/d.txt is 16383 (4 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       3:    2931724..   2931727:      4:             last,eof
/mnt/5/d.txt: 1 extent found
File size of /mnt/5/e.txt is 16383 (4 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       3:    2931728..   2931731:      4:             last,eof
/mnt/5/e.txt: 1 extent found

[root@f24s ~]# ls -li /mnt/5/*
285 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/a.txt
286 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/b.txt
287 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/c.txt
288 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/d.txt
289 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/e.txt

[root@f24s ~]# btrfs-map-logical -l $[4096*2931712] /dev/VG/a
mirror 1 logical 12008292352 physical 34603008 device /dev/mapper/VG-a
mirror 2 logical 12008292352 physical 1108344832 device /dev/mapper/VG-c
[root@f24s ~]# btrfs-map-logical -l $[4096*2931716] /dev/VG/a
mirror 1 logical 12008308736 physical 34619392 device /dev/mapper/VG-a
mirror 2 logical 12008308736 physical 1108361216 device /dev/mapper/VG-c
[root@f24s ~]# btrfs-map-logical -l $[4096*2931720] /dev/VG/a
mirror 1 logical 12008325120 physical 34635776 device /dev/mapper/VG-a
mirror 2 logical 12008325120 physical 1108377600 device /dev/mapper/VG-c
[root@f24s ~]# btrfs-map-logical -l $[4096*2931724] /dev/VG/a
mirror 1 logical 12008341504 physical 34652160 device /dev/mapper/VG-a
mirror 2 logical 12008341504 physical 1108393984 device /dev/mapper/VG-c
[root@f24s ~]# btrfs-map-logical -l $[4096*2931728] /dev/VG/a
mirror 1 logical 12008357888 physical 1048576 device /dev/mapper/VG-b
mirror 2 logical 12008357888 physical 1108344832 device /dev/mapper/VG-c


[root@f24s ~]# echo -n "g" | dd of=/mnt/5/a.txt conv=notrunc
0+1 records in
0+1 records out
1 byte copied, 0.000314582 s, 3.2 kB/s
[root@f24s ~]# ls -li /mnt/5/*
285 -rw-r--r--. 1 root root 16383 Jun 23 23:06 /mnt/5/a.txt
286 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/b.txt
287 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/c.txt
288 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/d.txt
289 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/e.txt

[root@f24s ~]# filefrag -v /mnt/5/*
Filesystem type is: 9123683e
File size of /mnt/5/a.txt is 16383 (4 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       0:    2931732..   2931732:      1:
   1:        1..       3:    2931713..   2931715:      3:    2931733: last,eof
/mnt/5/a.txt: 2 extents found
File size of /mnt/5/b.txt is 16383 (4 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       3:    2931716..   2931719:      4:             last,eof
/mnt/5/b.txt: 1 extent found
File size of /mnt/5/c.txt is 16383 (4 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       3:    2931720..   2931723:      4:             last,eof
/mnt/5/c.txt: 1 extent found
File size of /mnt/5/d.txt is 16383 (4 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       3:    2931724..   2931727:      4:             last,eof
/mnt/5/d.txt: 1 extent found
File size of /mnt/5/e.txt is 16383 (4 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       3:    2931728..   2931731:      4:             last,eof
/mnt/5/e.txt: 1 extent found

[root@f24s ~]# btrfs-map-logical -l $[4096*2931732] /dev/VG/a
mirror 1 logical 12008374272 physical 1064960 device /dev/mapper/VG-b
mirror 2 logical 12008374272 physical 1108361216 device /dev/mapper/VG-c


It has been cow'd.

[root@f24s ~]# dd if=/dev/VG/b bs=1 skip=1064960 count=4096
2>/dev/null | hexdump -C
00000000  67 61 61 61 61 61 61 61  61 61 61 61 61 61 61 61  |gaaaaaaaaaaaaaaa|
00000010  61 61 61 61 61 61 61 61  61 61 61 61 61 61 61 61  |aaaaaaaaaaaaaaaa|
*
00001000
[root@f24s ~]# dd if=/dev/VG/c bs=1 skip=1108361216 count=4096
2>/dev/null | hexdump -C
00000000  05 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
00000010  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
*
00001000
[root@f24s ~]# btrfs-map-logical -l $[4096*2931712] /dev/VG/a
mirror 1 logical 12008292352 physical 34603008 device /dev/mapper/VG-a
mirror 2 logical 12008292352 physical 1108344832 device /dev/mapper/VG-c
[root@f24s ~]# dd if=/dev/VG/a bs=1 skip=34603008 count=4096
2>/dev/null | hexdump -C
00000000  61 61 61 61 61 61 61 61  61 61 61 61 61 61 61 61  |aaaaaaaaaaaaaaaa|
*
00001000


So at the old address, it shows the "aaaaa..." is still there. And at
the added single block for this file at new logical and physical
addresses, is the modification substituting the first "a" for "g".

In this case, no rmw, no partial stripe modification, and no data
already on-disk is at risk. Even the metadata leaf/node is cow'd, it
has a new logical and physical address as well, which contains
information for all five files.
Hugo Mills June 24, 2016, 8:50 a.m. UTC | #20
On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote:
> 24.06.2016 04:47, Zygo Blaxell пишет:
> > On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote:
> >> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli <kreijack@inwind.it> wrote:
> >>> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the checksum.
> >>
> >> Yeah I'm kinda confused on this point.
> >>
> >> https://btrfs.wiki.kernel.org/index.php/RAID56
> >>
> >> It says there is a write hole for Btrfs. But defines it in terms of
> >> parity possibly being stale after a crash. I think the term comes not
> >> from merely parity being wrong but parity being wrong *and* then being
> >> used to wrongly reconstruct data because it's blindly trusted.
> > 
> > I think the opposite is more likely, as the layers above raid56
> > seem to check the data against sums before raid56 ever sees it.
> > (If those layers seem inverted to you, I agree, but OTOH there are
> > probably good reason to do it that way).
> > 
> 
> Yes, that's how I read code as well. btrfs layer that does checksumming
> is unaware of parity blocks at all; for all practical purposes they do
> not exist. What happens is approximately
> 
> 1. logical extent is allocated and checksum computed
> 2. it is mapped to physical area(s) on disks, skipping over what would
> be parity blocks
> 3. when these areas are written out, RAID56 parity is computed and filled in
> 
> IOW btrfs checksums are for (meta)data and RAID56 parity is not data.

   Checksums are not parity, correct. However, every data block
(including, I think, the parity) is checksummed and put into the csum
tree. This allows the FS to determine where damage has occurred,
rather thansimply detecting that it has occurred (which would be the
case if the parity doesn't match the data, or if the two copies of a
RAID-1 array don't match).

   (Note that csums for metadata are stored in the metadata block
itself, not in the csum tree).

   Hugo.

> > It looks like uncorrectable failures might occur because parity is
> > correct, but the parity checksum is out of date, so the parity checksum
> > doesn't match even though data blindly reconstructed from the parity
> > *would* match the data.
> > 
> 
> Yep, that is how I read it too. So if your data is checksummed, it
> should at least avoid silent corruption.
> 
> >> I don't read code well enough, but I'd be surprised if Btrfs
> >> reconstructs from parity and doesn't then check the resulting
> >> reconstructed data to its EXTENT_CSUM.
> > 
> > I wouldn't be surprised if both things happen in different code paths,
> > given the number of different paths leading into the raid56 code and
> > the number of distinct failure modes it seems to have.
> > 
> 
> Well, the problem is that parity block cannot be redirected on write as
> data blocks; which makes it impossible to version control it. The only
> solution I see is to always use full stripe writes by either wasting
> time in fixed width stripe or using variable width, so that every stripe
> always gets new version of parity. This makes it possible to keep parity
> checksums like data checksums.
>
Andrei Borzenkov June 24, 2016, 9:52 a.m. UTC | #21
On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills <hugo@carfax.org.uk> wrote:
> On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote:
>> 24.06.2016 04:47, Zygo Blaxell пишет:
>> > On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote:
>> >> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli <kreijack@inwind.it> wrote:
>> >>> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the checksum.
>> >>
>> >> Yeah I'm kinda confused on this point.
>> >>
>> >> https://btrfs.wiki.kernel.org/index.php/RAID56
>> >>
>> >> It says there is a write hole for Btrfs. But defines it in terms of
>> >> parity possibly being stale after a crash. I think the term comes not
>> >> from merely parity being wrong but parity being wrong *and* then being
>> >> used to wrongly reconstruct data because it's blindly trusted.
>> >
>> > I think the opposite is more likely, as the layers above raid56
>> > seem to check the data against sums before raid56 ever sees it.
>> > (If those layers seem inverted to you, I agree, but OTOH there are
>> > probably good reason to do it that way).
>> >
>>
>> Yes, that's how I read code as well. btrfs layer that does checksumming
>> is unaware of parity blocks at all; for all practical purposes they do
>> not exist. What happens is approximately
>>
>> 1. logical extent is allocated and checksum computed
>> 2. it is mapped to physical area(s) on disks, skipping over what would
>> be parity blocks
>> 3. when these areas are written out, RAID56 parity is computed and filled in
>>
>> IOW btrfs checksums are for (meta)data and RAID56 parity is not data.
>
>    Checksums are not parity, correct. However, every data block
> (including, I think, the parity) is checksummed and put into the csum
> tree. This allows the FS to determine where damage has occurred,
> rather thansimply detecting that it has occurred (which would be the
> case if the parity doesn't match the data, or if the two copies of a
> RAID-1 array don't match).
>

Yes, that is what I wrote below. But that means that RAID5 with one
degraded disk won't be able to reconstruct data on this degraded disk
because reconstructed extent content won't match checksum. Which kinda
makes RAID5 pointless.

...
>
>> > It looks like uncorrectable failures might occur because parity is
>> > correct, but the parity checksum is out of date, so the parity checksum
>> > doesn't match even though data blindly reconstructed from the parity
>> > *would* match the data.
>> >
>>
>> Yep, that is how I read it too. So if your data is checksummed, it
>> should at least avoid silent corruption.
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrei Borzenkov June 24, 2016, 10:16 a.m. UTC | #22
On Fri, Jun 24, 2016 at 8:20 AM, Chris Murphy <lists@colorremedies.com> wrote:

> [root@f24s ~]# filefrag -v /mnt/5/*
> Filesystem type is: 9123683e
> File size of /mnt/5/a.txt is 16383 (4 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..       3:    2931712..   2931715:      4:             last,eof

Hmm ... I wonder what is wrong here (openSUSE Tumbleweed)

nohostname:~ # filefrag -v /mnt/1
Filesystem type is: 9123683e
File size of /mnt/1 is 3072 (1 block of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       0:     269376..    269376:      1:             last,eof
/mnt/1: 1 extent found

But!

nohostname:~ # filefrag -v /etc/passwd
Filesystem type is: 9123683e
File size of /etc/passwd is 1527 (1 block of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..    4095:          0..      4095:   4096:
last,not_aligned,inline,eof
/etc/passwd: 1 extent found
nohostname:~ #

Why it works for one filesystem but does not work for an other one?
...
>
> So at the old address, it shows the "aaaaa..." is still there. And at
> the added single block for this file at new logical and physical
> addresses, is the modification substituting the first "a" for "g".
>
> In this case, no rmw, no partial stripe modification, and no data
> already on-disk is at risk.

You misunderstand the nature of problem. What is put at risk is data
that is already on disk and "shares" parity with new data.

As example, here are the first 64K in several extents on 4 disk RAID5
with so far single data chunk

        item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 1103101952) itemoff
15491 itemsize 176
                chunk length 3221225472 owner 2 stripe_len 65536
                type DATA|RAID5 num_stripes 4
                        stripe 0 devid 4 offset 9437184
                        dev uuid: ed13e42e-1633-4230-891c-897e86d1c0be
                        stripe 1 devid 3 offset 9437184
                        dev uuid: 10032b95-3f48-4ea0-a9ee-90064c53da1f
                        stripe 2 devid 2 offset 1074790400
                        dev uuid: cd749bd9-3d72-43b4-89a8-45e4a92658cf
                        stripe 3 devid 1 offset 1094713344
                        dev uuid: 41538b9f-3869-4c32-b3e2-30aa2ea1534e
                dev extent chunk_tree 3
                chunk objectid 256 chunk offset 1103101952 length 1073741824


        item 5 key (1 DEV_EXTENT 1094713344) itemoff 16027 itemsize 48
                dev extent chunk_tree 3
                chunk objectid 256 chunk offset 1103101952 length 1073741824
        item 7 key (2 DEV_EXTENT 1074790400) itemoff 15931 itemsize 48
                dev extent chunk_tree 3
                chunk objectid 256 chunk offset 1103101952 length 1073741824
        item 9 key (3 DEV_EXTENT 9437184) itemoff 15835 itemsize 48
                dev extent chunk_tree 3
                chunk objectid 256 chunk offset 1103101952 length 1073741824
        item 11 key (4 DEV_EXTENT 9437184) itemoff 15739 itemsize 48
                dev extent chunk_tree 3
                chunk objectid 256 chunk offset 1103101952 length 1073741824

where devid 1 = sdb1, 2 = sdc1 etc.

Now let's write some data (I created several files) up to 64K in size:

mirror 1 logical 1103364096 physical 1074855936 device /dev/sdc1
mirror 2 logical 1103364096 physical 9502720 device /dev/sde1
mirror 1 logical 1103368192 physical 1074860032 device /dev/sdc1
mirror 2 logical 1103368192 physical 9506816 device /dev/sde1
mirror 1 logical 1103372288 physical 1074864128 device /dev/sdc1
mirror 2 logical 1103372288 physical 9510912 device /dev/sde1
mirror 1 logical 1103376384 physical 1074868224 device /dev/sdc1
mirror 2 logical 1103376384 physical 9515008 device /dev/sde1
mirror 1 logical 1103380480 physical 1074872320 device /dev/sdc1
mirror 2 logical 1103380480 physical 9519104 device /dev/sde1

Note that btrfs allocates 64K on the same device before switching to
the next one. What is a bit misleading here, sdc1 is data and sde1 is
parity (you can see it in checksum tree, where only items for sdc1
exist).

Now let's write next 64k and see what happens

nohostname:~ # btrfs-map-logical -l 1103429632 -b 65536 /dev/sdb1
mirror 1 logical 1103429632 physical 1094778880 device /dev/sdb1
mirror 2 logical 1103429632 physical 9502720 device /dev/sde1

See? btrfs now allocates new stripe on sdb1; this stripe is at the
same offset as previous one on sdc1 (64K) and so shares the same
parity stripe on sde1. If you compare 64K on sde1 at offset 9502720
before and after, you will see that it has changed. INPLACE. Without
CoW. This is exactly what puts existing data on sdc1 at risk - if sdb1
is updated but sde1 is not, attempt to reconstruct data on sdc1 will
either fail (if we have checksums) or result in silent corruption.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hugo Mills June 24, 2016, 10:16 a.m. UTC | #23
On Fri, Jun 24, 2016 at 12:52:21PM +0300, Andrei Borzenkov wrote:
> On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills <hugo@carfax.org.uk> wrote:
> > On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote:
> >> 24.06.2016 04:47, Zygo Blaxell пишет:
> >> > On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote:
> >> >> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli <kreijack@inwind.it> wrote:
> >> >>> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the checksum.
> >> >>
> >> >> Yeah I'm kinda confused on this point.
> >> >>
> >> >> https://btrfs.wiki.kernel.org/index.php/RAID56
> >> >>
> >> >> It says there is a write hole for Btrfs. But defines it in terms of
> >> >> parity possibly being stale after a crash. I think the term comes not
> >> >> from merely parity being wrong but parity being wrong *and* then being
> >> >> used to wrongly reconstruct data because it's blindly trusted.
> >> >
> >> > I think the opposite is more likely, as the layers above raid56
> >> > seem to check the data against sums before raid56 ever sees it.
> >> > (If those layers seem inverted to you, I agree, but OTOH there are
> >> > probably good reason to do it that way).
> >> >
> >>
> >> Yes, that's how I read code as well. btrfs layer that does checksumming
> >> is unaware of parity blocks at all; for all practical purposes they do
> >> not exist. What happens is approximately
> >>
> >> 1. logical extent is allocated and checksum computed
> >> 2. it is mapped to physical area(s) on disks, skipping over what would
> >> be parity blocks
> >> 3. when these areas are written out, RAID56 parity is computed and filled in
> >>
> >> IOW btrfs checksums are for (meta)data and RAID56 parity is not data.
> >
> >    Checksums are not parity, correct. However, every data block
> > (including, I think, the parity) is checksummed and put into the csum
> > tree. This allows the FS to determine where damage has occurred,
> > rather thansimply detecting that it has occurred (which would be the
> > case if the parity doesn't match the data, or if the two copies of a
> > RAID-1 array don't match).
> >
> 
> Yes, that is what I wrote below. But that means that RAID5 with one
> degraded disk won't be able to reconstruct data on this degraded disk
> because reconstructed extent content won't match checksum. Which kinda
> makes RAID5 pointless.

   Eh? How do you come to that conclusion?

   For data, say you have n-1 good devices, with n-1 blocks on them.
Each block has a checksum in the metadata, so you can read that
checksum, read the blocks, and verify that they're not damaged. From
those n-1 known-good blocks (all data, or one parity and the rest
data) you can reconstruct the remaining block. That reconstructed
block won't be checked against the csum for the missing block -- it'll
just be written and a new csum for it written with it.

   Hugo.

> ...
> >
> >> > It looks like uncorrectable failures might occur because parity is
> >> > correct, but the parity checksum is out of date, so the parity checksum
> >> > doesn't match even though data blindly reconstructed from the parity
> >> > *would* match the data.
> >> >
> >>
> >> Yep, that is how I read it too. So if your data is checksummed, it
> >> should at least avoid silent corruption.
> >>
Andrei Borzenkov June 24, 2016, 10:19 a.m. UTC | #24
On Fri, Jun 24, 2016 at 1:16 PM, Hugo Mills <hugo@carfax.org.uk> wrote:
> On Fri, Jun 24, 2016 at 12:52:21PM +0300, Andrei Borzenkov wrote:
>> On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills <hugo@carfax.org.uk> wrote:
>> > On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote:
>> >> 24.06.2016 04:47, Zygo Blaxell пишет:
>> >> > On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote:
>> >> >> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli <kreijack@inwind.it> wrote:
>> >> >>> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the checksum.
>> >> >>
>> >> >> Yeah I'm kinda confused on this point.
>> >> >>
>> >> >> https://btrfs.wiki.kernel.org/index.php/RAID56
>> >> >>
>> >> >> It says there is a write hole for Btrfs. But defines it in terms of
>> >> >> parity possibly being stale after a crash. I think the term comes not
>> >> >> from merely parity being wrong but parity being wrong *and* then being
>> >> >> used to wrongly reconstruct data because it's blindly trusted.
>> >> >
>> >> > I think the opposite is more likely, as the layers above raid56
>> >> > seem to check the data against sums before raid56 ever sees it.
>> >> > (If those layers seem inverted to you, I agree, but OTOH there are
>> >> > probably good reason to do it that way).
>> >> >
>> >>
>> >> Yes, that's how I read code as well. btrfs layer that does checksumming
>> >> is unaware of parity blocks at all; for all practical purposes they do
>> >> not exist. What happens is approximately
>> >>
>> >> 1. logical extent is allocated and checksum computed
>> >> 2. it is mapped to physical area(s) on disks, skipping over what would
>> >> be parity blocks
>> >> 3. when these areas are written out, RAID56 parity is computed and filled in
>> >>
>> >> IOW btrfs checksums are for (meta)data and RAID56 parity is not data.
>> >
>> >    Checksums are not parity, correct. However, every data block
>> > (including, I think, the parity) is checksummed and put into the csum
>> > tree. This allows the FS to determine where damage has occurred,
>> > rather thansimply detecting that it has occurred (which would be the
>> > case if the parity doesn't match the data, or if the two copies of a
>> > RAID-1 array don't match).
>> >
>>
>> Yes, that is what I wrote below. But that means that RAID5 with one
>> degraded disk won't be able to reconstruct data on this degraded disk
>> because reconstructed extent content won't match checksum. Which kinda
>> makes RAID5 pointless.
>
>    Eh? How do you come to that conclusion?
>
>    For data, say you have n-1 good devices, with n-1 blocks on them.
> Each block has a checksum in the metadata, so you can read that
> checksum, read the blocks, and verify that they're not damaged. From
> those n-1 known-good blocks (all data, or one parity and the rest

We do not know whether parity is good or not because as far as I can
tell parity is not checksummed.

> data) you can reconstruct the remaining block. That reconstructed
> block won't be checked against the csum for the missing block -- it'll
> just be written and a new csum for it written with it.
>

So we have silent corruption. I fail to understand how it is an improvement :)

>    Hugo.
>
>> ...
>> >
>> >> > It looks like uncorrectable failures might occur because parity is
>> >> > correct, but the parity checksum is out of date, so the parity checksum
>> >> > doesn't match even though data blindly reconstructed from the parity
>> >> > *would* match the data.
>> >> >
>> >>
>> >> Yep, that is how I read it too. So if your data is checksummed, it
>> >> should at least avoid silent corruption.
>> >>
>
> --
> Hugo Mills             | Debugging is like hitting yourself in the head with
> hugo@... carfax.org.uk | hammer: it feels so good when you find the bug, and
> http://carfax.org.uk/  | you're allowed to stop debugging.
> PGP: E2AB1DE4          |                                        PotatoEngineer
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hugo Mills June 24, 2016, 10:59 a.m. UTC | #25
On Fri, Jun 24, 2016 at 01:19:30PM +0300, Andrei Borzenkov wrote:
> On Fri, Jun 24, 2016 at 1:16 PM, Hugo Mills <hugo@carfax.org.uk> wrote:
> > On Fri, Jun 24, 2016 at 12:52:21PM +0300, Andrei Borzenkov wrote:
> >> On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills <hugo@carfax.org.uk> wrote:
> >> > On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote:
> >> >> 24.06.2016 04:47, Zygo Blaxell пишет:
> >> >> > On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote:
> >> >> >> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli <kreijack@inwind.it> wrote:
> >> >> >>> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the checksum.
> >> >> >>
> >> >> >> Yeah I'm kinda confused on this point.
> >> >> >>
> >> >> >> https://btrfs.wiki.kernel.org/index.php/RAID56
> >> >> >>
> >> >> >> It says there is a write hole for Btrfs. But defines it in terms of
> >> >> >> parity possibly being stale after a crash. I think the term comes not
> >> >> >> from merely parity being wrong but parity being wrong *and* then being
> >> >> >> used to wrongly reconstruct data because it's blindly trusted.
> >> >> >
> >> >> > I think the opposite is more likely, as the layers above raid56
> >> >> > seem to check the data against sums before raid56 ever sees it.
> >> >> > (If those layers seem inverted to you, I agree, but OTOH there are
> >> >> > probably good reason to do it that way).
> >> >> >
> >> >>
> >> >> Yes, that's how I read code as well. btrfs layer that does checksumming
> >> >> is unaware of parity blocks at all; for all practical purposes they do
> >> >> not exist. What happens is approximately
> >> >>
> >> >> 1. logical extent is allocated and checksum computed
> >> >> 2. it is mapped to physical area(s) on disks, skipping over what would
> >> >> be parity blocks
> >> >> 3. when these areas are written out, RAID56 parity is computed and filled in
> >> >>
> >> >> IOW btrfs checksums are for (meta)data and RAID56 parity is not data.
> >> >
> >> >    Checksums are not parity, correct. However, every data block
> >> > (including, I think, the parity) is checksummed and put into the csum
> >> > tree. This allows the FS to determine where damage has occurred,
> >> > rather thansimply detecting that it has occurred (which would be the
> >> > case if the parity doesn't match the data, or if the two copies of a
> >> > RAID-1 array don't match).
> >> >
> >>
> >> Yes, that is what I wrote below. But that means that RAID5 with one
> >> degraded disk won't be able to reconstruct data on this degraded disk
> >> because reconstructed extent content won't match checksum. Which kinda
> >> makes RAID5 pointless.
> >
> >    Eh? How do you come to that conclusion?
> >
> >    For data, say you have n-1 good devices, with n-1 blocks on them.
> > Each block has a checksum in the metadata, so you can read that
> > checksum, read the blocks, and verify that they're not damaged. From
> > those n-1 known-good blocks (all data, or one parity and the rest
> 
> We do not know whether parity is good or not because as far as I can
> tell parity is not checksummed.

   I was about to write a devastating rebuttal of this... then I
actually tested it, and holy crap you're right.

   I've just closed the terminal in question by accident, so I can't
copy-and-paste, but the way I checked was:

# mkfs.btrfs -mraid1 -draid5 /dev/loop{0,1,2}
# mount /dev/loop0 foo
# dd if=/dev/urandom of=foo/file bs=4k count=32
# umount /dev/loop0
# btrfs-debug-tree /dev/loop0

then look at the csum tree:

     item 0 key (EXTENT_CSUM EXTENT_CSUM 351469568) itemoff 16155 itemsize 128
     	  extent csum item

There is a single csum item in it, of length 128. At 4 bytes per csum,
that's 32 checksums, which covers the 32 4KiB blocks I wrote, leaving
nothing for the parity.

   This is fundamentally broken, and I think we need to change the
wiki to indicate that the parity RAID implementation is not
recommended, because it doesn't actually do the job it's meant to in a
reliable way. :(

   Hugo.

> > data) you can reconstruct the remaining block. That reconstructed
> > block won't be checked against the csum for the missing block -- it'll
> > just be written and a new csum for it written with it.
> >
> 
> So we have silent corruption. I fail to understand how it is an improvement :)
> 
> >    Hugo.
> >
> >> ...
> >> >
> >> >> > It looks like uncorrectable failures might occur because parity is
> >> >> > correct, but the parity checksum is out of date, so the parity checksum
> >> >> > doesn't match even though data blindly reconstructed from the parity
> >> >> > *would* match the data.
> >> >> >
> >> >>
> >> >> Yep, that is how I read it too. So if your data is checksummed, it
> >> >> should at least avoid silent corruption.
> >> >>
> >
Austin S. Hemmelgarn June 24, 2016, 11:24 a.m. UTC | #26
On 2016-06-24 01:20, Chris Murphy wrote:
> On Thu, Jun 23, 2016 at 8:07 PM, Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
>
>>> With simple files changing one character with vi and gedit,
>>> I get completely different logical and physical numbers with each
>>> change, so it's clearly cowing the entire stripe (192KiB in my 3 dev
>>> raid5).
>>
>> You are COWing the entire file because vi and gedit do truncate followed
>> by full-file write.
>
> I'm seeing the file inode changes with either a vi or gedit
> modification, even when file size is exactly the same, just character
> substitute. So as far as VFS and Btrfs are concerned, it's an entirely
> different file, so it's like faux-CoW that would have happened on any
> file system, not an overwrite.
Yes, at least Vim (which is what most Linux systems use for vi) writes 
to a temporary file then does a replace by rename.  The idea is that 
POSIX implies that this should be atomic (except it's not actually 
required by POSIX, and even on some journaled and COW filesystems, it 
isn't actually atomic).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Austin S. Hemmelgarn June 24, 2016, 11:36 a.m. UTC | #27
On 2016-06-24 06:59, Hugo Mills wrote:
> On Fri, Jun 24, 2016 at 01:19:30PM +0300, Andrei Borzenkov wrote:
>> On Fri, Jun 24, 2016 at 1:16 PM, Hugo Mills <hugo@carfax.org.uk> wrote:
>>> On Fri, Jun 24, 2016 at 12:52:21PM +0300, Andrei Borzenkov wrote:
>>>> On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills <hugo@carfax.org.uk> wrote:
>>>>> On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote:
>>>>>> 24.06.2016 04:47, Zygo Blaxell пишет:
>>>>>>> On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote:
>>>>>>>> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli <kreijack@inwind.it> wrote:
>>>>>>>>> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the checksum.
>>>>>>>>
>>>>>>>> Yeah I'm kinda confused on this point.
>>>>>>>>
>>>>>>>> https://btrfs.wiki.kernel.org/index.php/RAID56
>>>>>>>>
>>>>>>>> It says there is a write hole for Btrfs. But defines it in terms of
>>>>>>>> parity possibly being stale after a crash. I think the term comes not
>>>>>>>> from merely parity being wrong but parity being wrong *and* then being
>>>>>>>> used to wrongly reconstruct data because it's blindly trusted.
>>>>>>>
>>>>>>> I think the opposite is more likely, as the layers above raid56
>>>>>>> seem to check the data against sums before raid56 ever sees it.
>>>>>>> (If those layers seem inverted to you, I agree, but OTOH there are
>>>>>>> probably good reason to do it that way).
>>>>>>>
>>>>>>
>>>>>> Yes, that's how I read code as well. btrfs layer that does checksumming
>>>>>> is unaware of parity blocks at all; for all practical purposes they do
>>>>>> not exist. What happens is approximately
>>>>>>
>>>>>> 1. logical extent is allocated and checksum computed
>>>>>> 2. it is mapped to physical area(s) on disks, skipping over what would
>>>>>> be parity blocks
>>>>>> 3. when these areas are written out, RAID56 parity is computed and filled in
>>>>>>
>>>>>> IOW btrfs checksums are for (meta)data and RAID56 parity is not data.
>>>>>
>>>>>    Checksums are not parity, correct. However, every data block
>>>>> (including, I think, the parity) is checksummed and put into the csum
>>>>> tree. This allows the FS to determine where damage has occurred,
>>>>> rather thansimply detecting that it has occurred (which would be the
>>>>> case if the parity doesn't match the data, or if the two copies of a
>>>>> RAID-1 array don't match).
>>>>>
>>>>
>>>> Yes, that is what I wrote below. But that means that RAID5 with one
>>>> degraded disk won't be able to reconstruct data on this degraded disk
>>>> because reconstructed extent content won't match checksum. Which kinda
>>>> makes RAID5 pointless.
>>>
>>>    Eh? How do you come to that conclusion?
>>>
>>>    For data, say you have n-1 good devices, with n-1 blocks on them.
>>> Each block has a checksum in the metadata, so you can read that
>>> checksum, read the blocks, and verify that they're not damaged. From
>>> those n-1 known-good blocks (all data, or one parity and the rest
>>
>> We do not know whether parity is good or not because as far as I can
>> tell parity is not checksummed.
>
>    I was about to write a devastating rebuttal of this... then I
> actually tested it, and holy crap you're right.
>
>    I've just closed the terminal in question by accident, so I can't
> copy-and-paste, but the way I checked was:
>
> # mkfs.btrfs -mraid1 -draid5 /dev/loop{0,1,2}
> # mount /dev/loop0 foo
> # dd if=/dev/urandom of=foo/file bs=4k count=32
> # umount /dev/loop0
> # btrfs-debug-tree /dev/loop0
>
> then look at the csum tree:
>
>      item 0 key (EXTENT_CSUM EXTENT_CSUM 351469568) itemoff 16155 itemsize 128
>      	  extent csum item
>
> There is a single csum item in it, of length 128. At 4 bytes per csum,
> that's 32 checksums, which covers the 32 4KiB blocks I wrote, leaving
> nothing for the parity.
>
>    This is fundamentally broken, and I think we need to change the
> wiki to indicate that the parity RAID implementation is not
> recommended, because it doesn't actually do the job it's meant to in a
> reliable way. :(
>
So item 4 now then, together with:
1. Rebuilds seemingly randomly decide based on the filesystem whether or 
not to take an insanely long time (always happens on some arrays, never 
happens on others, I have yet to see a report where it happens 
intermittently).
2. Failed disks seem to occasionally cause irreversible data corruption.
3. Classic erasure-code write-hole, just slightly different because of COW.

TBH, as much as I hate to say this, it looks like the raid5/6 code needs 
redone from scratch.  At an absolute minimum, we need to put a warning 
in mkfs for people using raid5/6 to tell them they shouldn't be using it 
outside of testing.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zygo Blaxell June 24, 2016, 4:32 p.m. UTC | #28
On Thu, Jun 23, 2016 at 11:20:40PM -0600, Chris Murphy wrote:
> [root@f24s ~]# filefrag -v /mnt/5/*
> Filesystem type is: 9123683e
> File size of /mnt/5/a.txt is 16383 (4 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..       0:    2931732..   2931732:      1:
>    1:        1..       3:    2931713..   2931715:      3:    2931733: last,eof
> /mnt/5/a.txt: 2 extents found
> File size of /mnt/5/b.txt is 16383 (4 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..       3:    2931716..   2931719:      4:             last,eof
> /mnt/5/b.txt: 1 extent found
> File size of /mnt/5/c.txt is 16383 (4 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..       3:    2931720..   2931723:      4:             last,eof
> /mnt/5/c.txt: 1 extent found
> File size of /mnt/5/d.txt is 16383 (4 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..       3:    2931724..   2931727:      4:             last,eof
> /mnt/5/d.txt: 1 extent found
> File size of /mnt/5/e.txt is 16383 (4 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..       3:    2931728..   2931731:      4:             last,eof
> /mnt/5/e.txt: 1 extent found

> So at the old address, it shows the "aaaaa..." is still there. And at
> the added single block for this file at new logical and physical
> addresses, is the modification substituting the first "a" for "g".
> 
> In this case, no rmw, no partial stripe modification, and no data
> already on-disk is at risk. Even the metadata leaf/node is cow'd, it
> has a new logical and physical address as well, which contains
> information for all five files.

Well, of course not.  You're not setting up the conditions for failure.
The extent at 2931712..2931715 is 4 blocks long, so when you overwrite
part of the extent all 4 blocks remain occupied.

You need extents that are shorter than the stripe width, and you need to
write to the same stripe in two different btrfs transactions (i.e. you
need to delete an extent and then have a new extent mapped in the old
location).
Zygo Blaxell June 24, 2016, 4:39 p.m. UTC | #29
On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote:
> >> I don't read code well enough, but I'd be surprised if Btrfs
> >> reconstructs from parity and doesn't then check the resulting
> >> reconstructed data to its EXTENT_CSUM.
> > 
> > I wouldn't be surprised if both things happen in different code paths,
> > given the number of different paths leading into the raid56 code and
> > the number of distinct failure modes it seems to have.
> 
> Well, the problem is that parity block cannot be redirected on write as
> data blocks; which makes it impossible to version control it. The only
> solution I see is to always use full stripe writes by either wasting
> time in fixed width stripe or using variable width, so that every stripe
> always gets new version of parity. This makes it possible to keep parity
> checksums like data checksums.

The allocator could try harder to avoid partial stripe writes.  We can
write multiple small extents to the same stripe as long as we always do
it all within one transaction, and then later treat the entire stripe
as read-only until every extent is removed.  It would be possible to do
that by fudging extent lengths (effectively adding a bunch of prealloc-ish
space if we have a partial write after all the delalloc stuff is done),
but it could also waste some blocks on every single transaction, or
create a bunch of "free but unavailable" space that makes df/statvfs
output even more wrong than it usually is.

The raid5 rmw code could try to relocate the other extents sharing a
stripe, but I fear that with the current state of backref walking code
that would make raid5 spectacularly slow if a filesystem is anywhere
near full.

We could also write rmw parity block updates to a journal (like another
log tree).  That would enable us to at least fix up the parity blocks
after a crash, and close the write hole.  That's an on-disk format
change though.
Chris Murphy June 24, 2016, 4:52 p.m. UTC | #30
On Fri, Jun 24, 2016 at 2:50 AM, Hugo Mills <hugo@carfax.org.uk> wrote:

>    Checksums are not parity, correct. However, every data block
> (including, I think, the parity) is checksummed and put into the csum
> tree.

I don't see how parity is checksummed. It definitely is not in the
csum tree. Two file systems, one raid5, one single, each with a single
identical file:

raid5
    item 0 key (EXTENT_CSUM EXTENT_CSUM 12009865216) itemoff 16155 itemsize 128
        extent csum item

single

    item 0 key (EXTENT_CSUM EXTENT_CSUM 2168717312) itemoff 16155 itemsize 128
        extent csum item

That's the only entry in the csum tree. The raid5 one is not 33.33%
bigger to account for the extra parity being checksummed.

Now, if parity is used to reconstruction of data, that data *is*
checksummed so if it fails checksum after reconstruction the
information is available to determine it was incorrectly
reconstructed. The notes in btrfs/raid56.c recognize the possibility
of parity corruption and how to handle it. But I think that corruption
is inferred. Maybe the parity csums are in some other metadata item,
but I don't see how it's in the csum tree.
Hugo Mills June 24, 2016, 4:56 p.m. UTC | #31
On Fri, Jun 24, 2016 at 10:52:53AM -0600, Chris Murphy wrote:
> On Fri, Jun 24, 2016 at 2:50 AM, Hugo Mills <hugo@carfax.org.uk> wrote:
> 
> >    Checksums are not parity, correct. However, every data block
> > (including, I think, the parity) is checksummed and put into the csum
> > tree.
> 
> I don't see how parity is checksummed. It definitely is not in the
> csum tree. Two file systems, one raid5, one single, each with a single
> identical file:

   It isn't -- I was wrong up there, and corrected myself in a later
message after investigation. (Although in this case, I regard reality
as being at fault ;) ).

   Hugo.

> raid5
>     item 0 key (EXTENT_CSUM EXTENT_CSUM 12009865216) itemoff 16155 itemsize 128
>         extent csum item
> 
> single
> 
>     item 0 key (EXTENT_CSUM EXTENT_CSUM 2168717312) itemoff 16155 itemsize 128
>         extent csum item
> 
> That's the only entry in the csum tree. The raid5 one is not 33.33%
> bigger to account for the extra parity being checksummed.
> 
> Now, if parity is used to reconstruction of data, that data *is*
> checksummed so if it fails checksum after reconstruction the
> information is available to determine it was incorrectly
> reconstructed. The notes in btrfs/raid56.c recognize the possibility
> of parity corruption and how to handle it. But I think that corruption
> is inferred. Maybe the parity csums are in some other metadata item,
> but I don't see how it's in the csum tree.
> 
>
Chris Murphy June 24, 2016, 5:06 p.m. UTC | #32
On Fri, Jun 24, 2016 at 3:52 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
> On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills <hugo@carfax.org.uk> wrote:
>eta)data and RAID56 parity is not data.
>>
>>    Checksums are not parity, correct. However, every data block
>> (including, I think, the parity) is checksummed and put into the csum
>> tree. This allows the FS to determine where damage has occurred,
>> rather thansimply detecting that it has occurred (which would be the
>> case if the parity doesn't match the data, or if the two copies of a
>> RAID-1 array don't match).
>>
>
> Yes, that is what I wrote below. But that means that RAID5 with one
> degraded disk won't be able to reconstruct data on this degraded disk
> because reconstructed extent content won't match checksum. Which kinda
> makes RAID5 pointless.

I don't understand this. Whether the failed disk means a stripe is
missing a data strip or parity strip, if any other strip is damaged of
course the reconstruction isn't going to match checksum. This does not
make raid5 pointless.
Andrei Borzenkov June 24, 2016, 5:21 p.m. UTC | #33
24.06.2016 20:06, Chris Murphy пишет:
> On Fri, Jun 24, 2016 at 3:52 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
>> On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills <hugo@carfax.org.uk> wrote:
>> eta)data and RAID56 parity is not data.
>>>
>>>    Checksums are not parity, correct. However, every data block
>>> (including, I think, the parity) is checksummed and put into the csum
>>> tree. This allows the FS to determine where damage has occurred,
>>> rather thansimply detecting that it has occurred (which would be the
>>> case if the parity doesn't match the data, or if the two copies of a
>>> RAID-1 array don't match).
>>>
>>
>> Yes, that is what I wrote below. But that means that RAID5 with one
>> degraded disk won't be able to reconstruct data on this degraded disk
>> because reconstructed extent content won't match checksum. Which kinda
>> makes RAID5 pointless.
> 
> I don't understand this. Whether the failed disk means a stripe is
> missing a data strip or parity strip, if any other strip is damaged of
> course the reconstruction isn't going to match checksum. This does not
> make raid5 pointless.
> 

Yes, you are right. We have double failure here. Still, in current
situation we apparently may end with btrfs reconstructing missing block
using wrong information. As was mentioned elsewhere, btrfs does not
verify checksum of reconstructed block, meaning data corruption.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Murphy June 24, 2016, 5:33 p.m. UTC | #34
On Fri, Jun 24, 2016 at 4:16 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
> On Fri, Jun 24, 2016 at 8:20 AM, Chris Murphy <lists@colorremedies.com> wrote:
>
>> [root@f24s ~]# filefrag -v /mnt/5/*
>> Filesystem type is: 9123683e
>> File size of /mnt/5/a.txt is 16383 (4 blocks of 4096 bytes)
>>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>>    0:        0..       3:    2931712..   2931715:      4:             last,eof
>
> Hmm ... I wonder what is wrong here (openSUSE Tumbleweed)
>
> nohostname:~ # filefrag -v /mnt/1
> Filesystem type is: 9123683e
> File size of /mnt/1 is 3072 (1 block of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..       0:     269376..    269376:      1:             last,eof
> /mnt/1: 1 extent found
>
> But!
>
> nohostname:~ # filefrag -v /etc/passwd
> Filesystem type is: 9123683e
> File size of /etc/passwd is 1527 (1 block of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..    4095:          0..      4095:   4096:
> last,not_aligned,inline,eof
> /etc/passwd: 1 extent found
> nohostname:~ #
>
> Why it works for one filesystem but does not work for an other one?
> ...
>>
>> So at the old address, it shows the "aaaaa..." is still there. And at
>> the added single block for this file at new logical and physical
>> addresses, is the modification substituting the first "a" for "g".
>>
>> In this case, no rmw, no partial stripe modification, and no data
>> already on-disk is at risk.
>
> You misunderstand the nature of problem. What is put at risk is data
> that is already on disk and "shares" parity with new data.
>
> As example, here are the first 64K in several extents on 4 disk RAID5
> with so far single data chunk
>
>         item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 1103101952) itemoff
> 15491 itemsize 176
>                 chunk length 3221225472 owner 2 stripe_len 65536
>                 type DATA|RAID5 num_stripes 4
>                         stripe 0 devid 4 offset 9437184
>                         dev uuid: ed13e42e-1633-4230-891c-897e86d1c0be
>                         stripe 1 devid 3 offset 9437184
>                         dev uuid: 10032b95-3f48-4ea0-a9ee-90064c53da1f
>                         stripe 2 devid 2 offset 1074790400
>                         dev uuid: cd749bd9-3d72-43b4-89a8-45e4a92658cf
>                         stripe 3 devid 1 offset 1094713344
>                         dev uuid: 41538b9f-3869-4c32-b3e2-30aa2ea1534e
>                 dev extent chunk_tree 3
>                 chunk objectid 256 chunk offset 1103101952 length 1073741824
>
>
>         item 5 key (1 DEV_EXTENT 1094713344) itemoff 16027 itemsize 48
>                 dev extent chunk_tree 3
>                 chunk objectid 256 chunk offset 1103101952 length 1073741824
>         item 7 key (2 DEV_EXTENT 1074790400) itemoff 15931 itemsize 48
>                 dev extent chunk_tree 3
>                 chunk objectid 256 chunk offset 1103101952 length 1073741824
>         item 9 key (3 DEV_EXTENT 9437184) itemoff 15835 itemsize 48
>                 dev extent chunk_tree 3
>                 chunk objectid 256 chunk offset 1103101952 length 1073741824
>         item 11 key (4 DEV_EXTENT 9437184) itemoff 15739 itemsize 48
>                 dev extent chunk_tree 3
>                 chunk objectid 256 chunk offset 1103101952 length 1073741824
>
> where devid 1 = sdb1, 2 = sdc1 etc.
>
> Now let's write some data (I created several files) up to 64K in size:
>
> mirror 1 logical 1103364096 physical 1074855936 device /dev/sdc1
> mirror 2 logical 1103364096 physical 9502720 device /dev/sde1
> mirror 1 logical 1103368192 physical 1074860032 device /dev/sdc1
> mirror 2 logical 1103368192 physical 9506816 device /dev/sde1
> mirror 1 logical 1103372288 physical 1074864128 device /dev/sdc1
> mirror 2 logical 1103372288 physical 9510912 device /dev/sde1
> mirror 1 logical 1103376384 physical 1074868224 device /dev/sdc1
> mirror 2 logical 1103376384 physical 9515008 device /dev/sde1
> mirror 1 logical 1103380480 physical 1074872320 device /dev/sdc1
> mirror 2 logical 1103380480 physical 9519104 device /dev/sde1
>
> Note that btrfs allocates 64K on the same device before switching to
> the next one. What is a bit misleading here, sdc1 is data and sde1 is
> parity (you can see it in checksum tree, where only items for sdc1
> exist).
>
> Now let's write next 64k and see what happens
>
> nohostname:~ # btrfs-map-logical -l 1103429632 -b 65536 /dev/sdb1
> mirror 1 logical 1103429632 physical 1094778880 device /dev/sdb1
> mirror 2 logical 1103429632 physical 9502720 device /dev/sde1
>
> See? btrfs now allocates new stripe on sdb1; this stripe is at the
> same offset as previous one on sdc1 (64K) and so shares the same
> parity stripe on sde1.

Yep, I've seen this also. What's not clear is if there's any
optimization where it's doing partial strip writes, i.e. only a
certain sector needs updating so only that sector is written. The md
driver does have this optimization for raid5 I don't think it does for
raid6. So either someone familiar with the code needs to tell us, or
someone needs to do some tracing (not difficult but the output is
super verbose).

In any case it's a risk when parity is not cow'd but in any case if it
were stale or corrupt, a rebuild that depends on parity will rebuild
bad data, and that bad data is discoverable. It's not silent, so I
don't think it's completely correct to say there's a write hole. It's
more correct to say parity is now cow'd therefore the first part of
the write hole can happen, it's just not going to result in silent
corruption if there's an interruption writing the parity.

So far I haven't seen data extents not be cow'd, which I think we'd
agree is a bigger problem if that were to happen.

>If you compare 64K on sde1 at offset 9502720
> before and after, you will see that it has changed. INPLACE. Without
> CoW. This is exactly what puts existing data on sdc1 at risk - if sdb1
> is updated but sde1 is not, attempt to reconstruct data on sdc1 will
> either fail (if we have checksums) or result in silent corruption.

Which happens with any implementation that does rmw, partial stripe
writes have all sorts of other problems also. The ZFS way of dealing
with this is carrying a boatload of additional metadata to account for
variable stripe size so that they can cow everything. And it gets
massively fragmented as well. So there's a tradeoff for everything.
Chris Murphy June 24, 2016, 5:40 p.m. UTC | #35
On Fri, Jun 24, 2016 at 4:16 AM, Hugo Mills <hugo@carfax.org.uk> wrote:
> On Fri, Jun 24, 2016 at 12:52:21PM +0300, Andrei Borzenkov wrote:

>> Yes, that is what I wrote below. But that means that RAID5 with one
>> degraded disk won't be able to reconstruct data on this degraded disk
>> because reconstructed extent content won't match checksum. Which kinda
>> makes RAID5 pointless.
>
>    Eh? How do you come to that conclusion?
>
>    For data, say you have n-1 good devices, with n-1 blocks on them.
> Each block has a checksum in the metadata, so you can read that
> checksum, read the blocks, and verify that they're not damaged. From
> those n-1 known-good blocks (all data, or one parity and the rest
> data) you can reconstruct the remaining block. That reconstructed
> block won't be checked against the csum for the missing block -- it'll
> just be written and a new csum for it written with it.

The last sentence is hugely problematic. Parity doesn't appear to be
either CoW'd or checksummed. If it is used for reconstruction and the
reconstructed data isn't compared to the data's EXTENT_CSUM entry, but
that entry is rather recomputed and written, that's just like blindly
trusting the parity is correct and then authenticating it with a csum.

It's  not difficult to test. Corrupt one byte of parity. Yank a drive.
Add a new one. Start a reconstruction with scrub or balance (or both
to see if they differ) and find out what happens. What should happen
is the reconstruct should work for everything except that one file. If
it's reconstructed silently, it should contain visible corruption and
we all collectively raise our eyebrows.
Chris Murphy June 24, 2016, 5:52 p.m. UTC | #36
On Fri, Jun 24, 2016 at 11:21 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
> 24.06.2016 20:06, Chris Murphy пишет:
>> On Fri, Jun 24, 2016 at 3:52 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
>>> On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills <hugo@carfax.org.uk> wrote:
>>> eta)data and RAID56 parity is not data.
>>>>
>>>>    Checksums are not parity, correct. However, every data block
>>>> (including, I think, the parity) is checksummed and put into the csum
>>>> tree. This allows the FS to determine where damage has occurred,
>>>> rather thansimply detecting that it has occurred (which would be the
>>>> case if the parity doesn't match the data, or if the two copies of a
>>>> RAID-1 array don't match).
>>>>
>>>
>>> Yes, that is what I wrote below. But that means that RAID5 with one
>>> degraded disk won't be able to reconstruct data on this degraded disk
>>> because reconstructed extent content won't match checksum. Which kinda
>>> makes RAID5 pointless.
>>
>> I don't understand this. Whether the failed disk means a stripe is
>> missing a data strip or parity strip, if any other strip is damaged of
>> course the reconstruction isn't going to match checksum. This does not
>> make raid5 pointless.
>>
>
> Yes, you are right. We have double failure here. Still, in current
> situation we apparently may end with btrfs reconstructing missing block
> using wrong information. As was mentioned elsewhere, btrfs does not
> verify checksum of reconstructed block, meaning data corruption.

Well that'd be bad, but also good in that it would explain a lot of
problems people have when metadata is also raid5. In this whole thread
the premise is the metadata is raid1, so the fs doesn't totally face
plant we just get a bunch of weird data corruptions. The metadata
raid5 case were sorta "WTF happened?" and not much was really said
about it other than telling the user to scrape off what they can and
start over.

Anyway, while not good I still think this is not super problematic to
at least *do* check EXTENT_CSUM after reconstruction from parity
rather than assuming that reconstruction happened correctly. The data
needed to pass fail the rebuild is already on the disk. It just needs
to be checked.

Better would be to get parity csummed and put into the csum tree. But
I don't know how much that helps. Think about always computing and
writing csums for parity, which almost never get used vs keeping
things the way they are now and just *checking our work* after
reconstruction from parity. If there's some obvious major advantage to
checksumming the parity I'm all ears but I'm not thinking of it at the
moment.
Zygo Blaxell June 24, 2016, 6:06 p.m. UTC | #37
On Fri, Jun 24, 2016 at 11:40:56AM -0600, Chris Murphy wrote:
> On Fri, Jun 24, 2016 at 4:16 AM, Hugo Mills <hugo@carfax.org.uk> wrote:
> > On Fri, Jun 24, 2016 at 12:52:21PM +0300, Andrei Borzenkov wrote:
> >    For data, say you have n-1 good devices, with n-1 blocks on them.
> > Each block has a checksum in the metadata, so you can read that
> > checksum, read the blocks, and verify that they're not damaged. From
> > those n-1 known-good blocks (all data, or one parity and the rest
> > data) you can reconstruct the remaining block. That reconstructed
> > block won't be checked against the csum for the missing block -- it'll
> > just be written and a new csum for it written with it.
> 
> The last sentence is hugely problematic. Parity doesn't appear to be
> either CoW'd or checksummed. If it is used for reconstruction and the
> reconstructed data isn't compared to the data's EXTENT_CSUM entry, but
> that entry is rather recomputed and written, that's just like blindly
> trusting the parity is correct and then authenticating it with a csum.

I think what happens is the data is recomputed, but the csum on the
data is _not_ updated (the csum does not reside in the raid56 code).
A read of the reconstructed data would get a csum failure (of course,
every 4 billionth time this happens the csum is correct by random chance,
so you wouldn't want to be reading parity blocks from a drive full of
garbage, but that's a different matter).

> It's  not difficult to test. Corrupt one byte of parity. Yank a drive.
> Add a new one. Start a reconstruction with scrub or balance (or both
> to see if they differ) and find out what happens. What should happen
> is the reconstruct should work for everything except that one file. If
> it's reconstructed silently, it should contain visible corruption and
> we all collectively raise our eyebrows.

I've done something like that test:  write random data to 1000 random
blocks on one disk, then run scrub.  It reconstructs the data without
problems (except for the minor wart that 'scrub status -d' counts the
randomly against every device, while 'dev stats' counts all the errors
on the disk that was corrupted).

Disk-side data corruption is a thing I have to deal with a few times each
year, so I tested the btrfs raid5 implementation for that case before I
started using it.  As far as I can tell so far, everything in btrfs raid5
works properly if a disk fails _while the filesystem is not mounted_.

The problem I see in the field is not *silent* corruption.  It's a whole
lot of very *noisy* corruption detected under circumstances where I'd
expect to see no corruption at all (silent or otherwise).
Austin S. Hemmelgarn June 24, 2016, 6:19 p.m. UTC | #38
On 2016-06-24 13:52, Chris Murphy wrote:
> On Fri, Jun 24, 2016 at 11:21 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
>> 24.06.2016 20:06, Chris Murphy пишет:
>>> On Fri, Jun 24, 2016 at 3:52 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
>>>> On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills <hugo@carfax.org.uk> wrote:
>>>> eta)data and RAID56 parity is not data.
>>>>>
>>>>>    Checksums are not parity, correct. However, every data block
>>>>> (including, I think, the parity) is checksummed and put into the csum
>>>>> tree. This allows the FS to determine where damage has occurred,
>>>>> rather thansimply detecting that it has occurred (which would be the
>>>>> case if the parity doesn't match the data, or if the two copies of a
>>>>> RAID-1 array don't match).
>>>>>
>>>>
>>>> Yes, that is what I wrote below. But that means that RAID5 with one
>>>> degraded disk won't be able to reconstruct data on this degraded disk
>>>> because reconstructed extent content won't match checksum. Which kinda
>>>> makes RAID5 pointless.
>>>
>>> I don't understand this. Whether the failed disk means a stripe is
>>> missing a data strip or parity strip, if any other strip is damaged of
>>> course the reconstruction isn't going to match checksum. This does not
>>> make raid5 pointless.
>>>
>>
>> Yes, you are right. We have double failure here. Still, in current
>> situation we apparently may end with btrfs reconstructing missing block
>> using wrong information. As was mentioned elsewhere, btrfs does not
>> verify checksum of reconstructed block, meaning data corruption.
>
> Well that'd be bad, but also good in that it would explain a lot of
> problems people have when metadata is also raid5. In this whole thread
> the premise is the metadata is raid1, so the fs doesn't totally face
> plant we just get a bunch of weird data corruptions. The metadata
> raid5 case were sorta "WTF happened?" and not much was really said
> about it other than telling the user to scrape off what they can and
> start over.
>
> Anyway, while not good I still think this is not super problematic to
> at least *do* check EXTENT_CSUM after reconstruction from parity
> rather than assuming that reconstruction happened correctly. The data
> needed to pass fail the rebuild is already on the disk. It just needs
> to be checked.
>
> Better would be to get parity csummed and put into the csum tree. But
> I don't know how much that helps. Think about always computing and
> writing csums for parity, which almost never get used vs keeping
> things the way they are now and just *checking our work* after
> reconstruction from parity. If there's some obvious major advantage to
> checksumming the parity I'm all ears but I'm not thinking of it at the
> moment.
>
Well, the obvious major advantage that comes to mind for me to 
checksumming parity is that it would let us scrub the parity data itself 
and verify it.  I'd personally much rather know my parity is bad before 
I need to use it than after using it to reconstruct data and getting an 
error there, and I'd be willing to be that most seasoned sysadmins 
working for companies using big storage arrays likely feel the same 
about it.  I could see it being practical to have an option to turn this 
off for performance reasons or similar, but again, I have a feeling that 
most people would rather be able to check if a rebuild will eat data 
before trying to rebuild (depending on the situation in such a case, it 
will sometimes just make more sense to nuke the array and restore from a 
backup instead of spending time waiting for it to rebuild).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Murphy June 25, 2016, 4:44 p.m. UTC | #39
On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:

> Well, the obvious major advantage that comes to mind for me to checksumming
> parity is that it would let us scrub the parity data itself and verify it.

OK but hold on. During scrub, it should read data, compute checksums
*and* parity, and compare those to what's on-disk - > EXTENT_CSUM in
the checksum tree, and the parity strip in the chunk tree. And if
parity is wrong, then it should be replaced.

Even check > md/sync_action does this. So no pun intended but Btrfs
isn't even at parity with mdadm on data integrity if it doesn't check
if the parity matches data.


> I'd personally much rather know my parity is bad before I need to use it
> than after using it to reconstruct data and getting an error there, and I'd
> be willing to be that most seasoned sysadmins working for companies using
> big storage arrays likely feel the same about it.

That doesn't require parity csums though. It just requires computing
parity during a scrub and comparing it to the parity on disk to make
sure they're the same. If they aren't, assuming no other error for
that full stripe read, then the parity block is replaced.

So that's also something to check in the code or poke a system with a
stick and see what happens.

> I could see it being
> practical to have an option to turn this off for performance reasons or
> similar, but again, I have a feeling that most people would rather be able
> to check if a rebuild will eat data before trying to rebuild (depending on
> the situation in such a case, it will sometimes just make more sense to nuke
> the array and restore from a backup instead of spending time waiting for it
> to rebuild).

The much bigger problem we have right now that affects Btrfs,
LVM/mdadm md raid, is this silly bad default with non-enterprise
drives having no configurable SCT ERC, with ensuing long recovery
times, and the kernel SCSI command timer at 30 seconds - which
actually also fucks over regular single disk users also because it
means they don't get the "benefit" of long recovery times, which is
the whole g'd point of that feature. This itself causes so many
problems where bad sectors just get worse and don't get fixed up
because of all the link resets. So I still think it's a bullshit
default kernel side because it pretty much affects the majority use
case, it is only a non-problem with proprietary hardware raid, and
software raid using enterprise (or NAS specific) drives that already
have short recovery times by default.

This has been true for a very long time, maybe a decade. And it's such
complete utter crap that this hasn't been dealt with properly by any
party. No distribution has fixed this for their users. Upstream udev
hasn't dealt with it. And kernel folks haven't dealt with it. It's a
perverse joke on the user to do this out of the box.
Chris Murphy June 25, 2016, 9:52 p.m. UTC | #40
Interestingly enough, so far I'm finding with full stripe writes, i.e.
3x raid5, exactly 128KiB data writes, devid 3 is always parity. This
is raid4. So...I wonder if some of these slow cases end up with a
bunch of stripes that are effectively raid4-like, and have a lot of
parity overwrites, which is where raid4 suffers due to disk
contention.

Totally speculative as the sample size is too small and distinctly non-random.


Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrei Borzenkov June 26, 2016, 7:54 a.m. UTC | #41
26.06.2016 00:52, Chris Murphy пишет:
> Interestingly enough, so far I'm finding with full stripe writes, i.e.
> 3x raid5, exactly 128KiB data writes, devid 3 is always parity. This
> is raid4.

That's not what code suggests and what I see in practice - parity seems
to be distributed across all disks; each new 128KiB file (extent) has
parity on new disk. At least as long as we can trust btrfs-map-logical
to always show parity as "mirror 2".

Do you see consecutive full stripes in your tests? Or how do you
determine which devid has parity for a given full stripe? This
information is not actually stored anywhere, it is computed based on
block group geometry and logical stripe offset.

P.S. usage of "stripe" to mean "stripe element" actually adds to
confusion when reading code :)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Duncan June 26, 2016, 3:03 p.m. UTC | #42
Andrei Borzenkov posted on Sun, 26 Jun 2016 10:54:16 +0300 as excerpted:

> P.S. usage of "stripe" to mean "stripe element" actually adds to
> confusion when reading code :)

... and posts (including patches, which I guess are code as well, just 
not applied yet).  I've been noticing that in the "stripe length" 
patches, when the comment associated with the patch suggests it's "strip 
length" they're actually talking about, using the "N strips, one per 
device, make a stripe" definition.
Chris Murphy June 26, 2016, 7:30 p.m. UTC | #43
On Sun, Jun 26, 2016 at 1:54 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
> 26.06.2016 00:52, Chris Murphy пишет:
>> Interestingly enough, so far I'm finding with full stripe writes, i.e.
>> 3x raid5, exactly 128KiB data writes, devid 3 is always parity. This
>> is raid4.
>
> That's not what code suggests and what I see in practice - parity seems
> to be distributed across all disks; each new 128KiB file (extent) has
> parity on new disk. At least as long as we can trust btrfs-map-logical
> to always show parity as "mirror 2".


tl;dr Andrei is correct there's no raid4 behavior here.

Looks like mirror 2 is always parity, more on that below.


>
> Do you see consecutive full stripes in your tests? Or how do you
> determine which devid has parity for a given full stripe?

I do see consecutive full stripe writes, but it doesn't always happen.
But not checking the consecutivity is where I became confused.

[root@f24s ~]# filefrag -v /mnt/5/ab*
Filesystem type is: 9123683e
File size of /mnt/5/ab128_2.txt is 131072 (32 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..      31:    3456128..   3456159:     32:             last,eof
/mnt/5/ab128_2.txt: 1 extent found
File size of /mnt/5/ab128_3.txt is 131072 (32 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..      31:    3456224..   3456255:     32:             last,eof
/mnt/5/ab128_3.txt: 1 extent found
File size of /mnt/5/ab128_4.txt is 131072 (32 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..      31:    3456320..   3456351:     32:             last,eof
/mnt/5/ab128_4.txt: 1 extent found
File size of /mnt/5/ab128_5.txt is 131072 (32 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..      31:    3456352..   3456383:     32:             last,eof
/mnt/5/ab128_5.txt: 1 extent found
File size of /mnt/5/ab128_6.txt is 131072 (32 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..      31:    3456384..   3456415:     32:             last,eof
/mnt/5/ab128_6.txt: 1 extent found
File size of /mnt/5/ab128_7.txt is 131072 (32 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..      31:    3456416..   3456447:     32:             last,eof
/mnt/5/ab128_7.txt: 1 extent found
File size of /mnt/5/ab128_8.txt is 131072 (32 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..      31:    3456448..   3456479:     32:             last,eof
/mnt/5/ab128_8.txt: 1 extent found
File size of /mnt/5/ab128_9.txt is 131072 (32 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..      31:    3456480..   3456511:     32:             last,eof
/mnt/5/ab128_9.txt: 1 extent found
File size of /mnt/5/ab128.txt is 131072 (32 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..      31:    3456096..   3456127:     32:             last,eof
/mnt/5/ab128.txt: 1 extent found

Starting with the bottom file then from the top so they're in 4096
byte block order; and the 2nd column is the difference in value:

3456096
3456128 32
3456224 96
3456320 96
3456352 32
3456384 32
3456416 32
3456448 32
3456480 32

So the first two files are consecutive full stripe writes. The next
two aren't. The next five are. They were all copied at the same time.
I don't know why they aren't always consecutive writes.


[root@f24s ~]# btrfs-map-logical -l $[4096*3456096] /dev/VG/a
mirror 1 logical 14156169216 physical 1108541440 device /dev/mapper/VG-a
mirror 2 logical 14156169216 physical 2182283264 device /dev/mapper/VG-c
[root@f24s ~]# btrfs-map-logical -l $[4096*3456128] /dev/VG/a
mirror 1 logical 14156300288 physical 1075052544 device /dev/mapper/VG-b
mirror 2 logical 14156300288 physical 1108606976 device /dev/mapper/VG-a
[root@f24s ~]# btrfs-map-logical -l $[4096*3456224] /dev/VG/a
mirror 1 logical 14156693504 physical 1075249152 device /dev/mapper/VG-b
mirror 2 logical 14156693504 physical 1108803584 device /dev/mapper/VG-a
[root@f24s ~]# btrfs-map-logical -l $[4096*3456320] /dev/VG/a
mirror 1 logical 14157086720 physical 1075445760 device /dev/mapper/VG-b
mirror 2 logical 14157086720 physical 1109000192 device /dev/mapper/VG-a
[root@f24s ~]# btrfs-map-logical -l $[4096*3456352] /dev/VG/a
mirror 1 logical 14157217792 physical 2182807552 device /dev/mapper/VG-c
mirror 2 logical 14157217792 physical 1075511296 device /dev/mapper/VG-b
[root@f24s ~]# btrfs-map-logical -l $[4096*3456384] /dev/VG/a
mirror 1 logical 14157348864 physical 1109131264 device /dev/mapper/VG-a
mirror 2 logical 14157348864 physical 2182873088 device /dev/mapper/VG-c
[root@f24s ~]# btrfs-map-logical -l $[4096*3456416] /dev/VG/a
mirror 1 logical 14157479936 physical 1075642368 device /dev/mapper/VG-b
mirror 2 logical 14157479936 physical 1109196800 device /dev/mapper/VG-a
[root@f24s ~]# btrfs-map-logical -l $[4096*3456448] /dev/VG/a
mirror 1 logical 14157611008 physical 2183004160 device /dev/mapper/VG-c
mirror 2 logical 14157611008 physical 1075707904 device /dev/mapper/VG-b
[root@f24s ~]# btrfs-map-logical -l $[4096*3456480] /dev/VG/a
mirror 1 logical 14157742080 physical 1109327872 device /dev/mapper/VG-a
mirror 2 logical 14157742080 physical 2183069696 device /dev/mapper/VG-c


To confirm/deny mirror 2 is parity (128KiB file is 64KiB "a", 64KiB
"b", so expected parity is 0x03; if it's always 128KiB of the same
value then parity is 0x00 and can result in confusion/mistakes with
unwritten free space).

[root@f24s ~]# dd if=/dev/VG/c bs=1 count=65536 skip=2182283264
2>/dev/null | hexdump -C
00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
*
00010000
[root@f24s ~]# dd if=/dev/VG/a bs=1 count=65536 skip=1108606976
2>/dev/null | hexdump -C
00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
*
00010000
[root@f24s ~]# dd if=/dev/VG/a bs=1 count=65536 skip=1108803584
2>/dev/null | hexdump -C
00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
*
00010000
[root@f24s ~]# dd if=/dev/VG/a bs=1 count=65536 skip=1109000192
2>/dev/null | hexdump -C
00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
*
00010000
[root@f24s ~]# dd if=/dev/VG/b bs=1 count=65536 skip=1075511296
2>/dev/null | hexdump -C
00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
*
00010000
[root@f24s ~]# dd if=/dev/VG/c bs=1 count=65536 skip=2182873088
2>/dev/null | hexdump -C
00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
*
00010000
[root@f24s ~]# dd if=/dev/VG/a bs=1 count=65536 skip=1109196800
2>/dev/null | hexdump -C
00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
*
00010000
[root@f24s ~]# dd if=/dev/VG/b bs=1 count=65536 skip=1075707904
2>/dev/null | hexdump -C
00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
*
00010000
[root@f24s ~]# dd if=/dev/VG/c bs=1 count=65536 skip=2183069696
2>/dev/null | hexdump -C
00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
*
00010000

Ok so in particular the last five, parity is on device b, c, a, b, c -
that suggests it's distributing parity on consecutive full stripe
writes.

Where I became confused is, there's not always a consecutive write,
and that's what ends up causing parity to end up on one device less
often. In the above example, parity goes 4x VG/a, 3x VG/c, and 2x
VG/b.

Basically it's a bad test. The sample size is too small. I'd need to
increase the sample size by a ton in order to know for sure if this is
really a problem.


>This
> information is not actually stored anywhere, it is computed based on
> block group geometry and logical stripe offset.

I think you're right. A better test is a scrub or balance on a raid5
that's exhibiting slowness, and find out if there's disk contention on
that system, and whether it's the result of parity not being
distributed enough.


> P.S. usage of "stripe" to mean "stripe element" actually adds to
> confusion when reading code :)

It's confusing everywhere. mdadm chunk = strip = stripe element. And
then LVM introduces -i --stripes which means "data strips" i.e. if you
choose -i 3 with raid6 segment type, you get 5 strips per stripe (3
data 2 parity). It's horrible.
Zygo Blaxell June 26, 2016, 7:52 p.m. UTC | #44
On Sun, Jun 26, 2016 at 01:30:03PM -0600, Chris Murphy wrote:
> On Sun, Jun 26, 2016 at 1:54 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
> > 26.06.2016 00:52, Chris Murphy пишет:
> >> Interestingly enough, so far I'm finding with full stripe writes, i.e.
> >> 3x raid5, exactly 128KiB data writes, devid 3 is always parity. This
> >> is raid4.
> >
> > That's not what code suggests and what I see in practice - parity seems
> > to be distributed across all disks; each new 128KiB file (extent) has
> > parity on new disk. At least as long as we can trust btrfs-map-logical
> > to always show parity as "mirror 2".
> 
> 
> tl;dr Andrei is correct there's no raid4 behavior here.
> 
> Looks like mirror 2 is always parity, more on that below.
> 
> 
> >
> > Do you see consecutive full stripes in your tests? Or how do you
> > determine which devid has parity for a given full stripe?
> 
> I do see consecutive full stripe writes, but it doesn't always happen.
> But not checking the consecutivity is where I became confused.
> 
> [root@f24s ~]# filefrag -v /mnt/5/ab*
> Filesystem type is: 9123683e
> File size of /mnt/5/ab128_2.txt is 131072 (32 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      31:    3456128..   3456159:     32:             last,eof
> /mnt/5/ab128_2.txt: 1 extent found
> File size of /mnt/5/ab128_3.txt is 131072 (32 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      31:    3456224..   3456255:     32:             last,eof
> /mnt/5/ab128_3.txt: 1 extent found
> File size of /mnt/5/ab128_4.txt is 131072 (32 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      31:    3456320..   3456351:     32:             last,eof
> /mnt/5/ab128_4.txt: 1 extent found
> File size of /mnt/5/ab128_5.txt is 131072 (32 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      31:    3456352..   3456383:     32:             last,eof
> /mnt/5/ab128_5.txt: 1 extent found
> File size of /mnt/5/ab128_6.txt is 131072 (32 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      31:    3456384..   3456415:     32:             last,eof
> /mnt/5/ab128_6.txt: 1 extent found
> File size of /mnt/5/ab128_7.txt is 131072 (32 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      31:    3456416..   3456447:     32:             last,eof
> /mnt/5/ab128_7.txt: 1 extent found
> File size of /mnt/5/ab128_8.txt is 131072 (32 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      31:    3456448..   3456479:     32:             last,eof
> /mnt/5/ab128_8.txt: 1 extent found
> File size of /mnt/5/ab128_9.txt is 131072 (32 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      31:    3456480..   3456511:     32:             last,eof
> /mnt/5/ab128_9.txt: 1 extent found
> File size of /mnt/5/ab128.txt is 131072 (32 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..      31:    3456096..   3456127:     32:             last,eof
> /mnt/5/ab128.txt: 1 extent found
> 
> Starting with the bottom file then from the top so they're in 4096
> byte block order; and the 2nd column is the difference in value:
> 
> 3456096
> 3456128 32
> 3456224 96
> 3456320 96
> 3456352 32
> 3456384 32
> 3456416 32
> 3456448 32
> 3456480 32
> 
> So the first two files are consecutive full stripe writes. The next
> two aren't. The next five are. They were all copied at the same time.
> I don't know why they aren't always consecutive writes.

The logical addresses don't include parity stripes, so you won't find
them with FIEMAP.  Parity locations are calculated after the logical ->
(disk, chunk_offset) translation is done (it's the same chunk_offset on
every disk, but one of the disks is parity while the others are data).

> [root@f24s ~]# btrfs-map-logical -l $[4096*3456096] /dev/VG/a
> mirror 1 logical 14156169216 physical 1108541440 device /dev/mapper/VG-a
> mirror 2 logical 14156169216 physical 2182283264 device /dev/mapper/VG-c
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456128] /dev/VG/a
> mirror 1 logical 14156300288 physical 1075052544 device /dev/mapper/VG-b
> mirror 2 logical 14156300288 physical 1108606976 device /dev/mapper/VG-a
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456224] /dev/VG/a
> mirror 1 logical 14156693504 physical 1075249152 device /dev/mapper/VG-b
> mirror 2 logical 14156693504 physical 1108803584 device /dev/mapper/VG-a
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456320] /dev/VG/a
> mirror 1 logical 14157086720 physical 1075445760 device /dev/mapper/VG-b
> mirror 2 logical 14157086720 physical 1109000192 device /dev/mapper/VG-a
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456352] /dev/VG/a
> mirror 1 logical 14157217792 physical 2182807552 device /dev/mapper/VG-c
> mirror 2 logical 14157217792 physical 1075511296 device /dev/mapper/VG-b
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456384] /dev/VG/a
> mirror 1 logical 14157348864 physical 1109131264 device /dev/mapper/VG-a
> mirror 2 logical 14157348864 physical 2182873088 device /dev/mapper/VG-c
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456416] /dev/VG/a
> mirror 1 logical 14157479936 physical 1075642368 device /dev/mapper/VG-b
> mirror 2 logical 14157479936 physical 1109196800 device /dev/mapper/VG-a
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456448] /dev/VG/a
> mirror 1 logical 14157611008 physical 2183004160 device /dev/mapper/VG-c
> mirror 2 logical 14157611008 physical 1075707904 device /dev/mapper/VG-b
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456480] /dev/VG/a
> mirror 1 logical 14157742080 physical 1109327872 device /dev/mapper/VG-a
> mirror 2 logical 14157742080 physical 2183069696 device /dev/mapper/VG-c
> 
> 
> To confirm/deny mirror 2 is parity (128KiB file is 64KiB "a", 64KiB
> "b", so expected parity is 0x03; if it's always 128KiB of the same
> value then parity is 0x00 and can result in confusion/mistakes with
> unwritten free space).
> 
> [root@f24s ~]# dd if=/dev/VG/c bs=1 count=65536 skip=2182283264
> 2>/dev/null | hexdump -C
> 00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/a bs=1 count=65536 skip=1108606976
> 2>/dev/null | hexdump -C
> 00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/a bs=1 count=65536 skip=1108803584
> 2>/dev/null | hexdump -C
> 00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/a bs=1 count=65536 skip=1109000192
> 2>/dev/null | hexdump -C
> 00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/b bs=1 count=65536 skip=1075511296
> 2>/dev/null | hexdump -C
> 00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/c bs=1 count=65536 skip=2182873088
> 2>/dev/null | hexdump -C
> 00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/a bs=1 count=65536 skip=1109196800
> 2>/dev/null | hexdump -C
> 00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/b bs=1 count=65536 skip=1075707904
> 2>/dev/null | hexdump -C
> 00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
> *
> 00010000
> [root@f24s ~]# dd if=/dev/VG/c bs=1 count=65536 skip=2183069696
> 2>/dev/null | hexdump -C
> 00000000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
> *
> 00010000
> 
> Ok so in particular the last five, parity is on device b, c, a, b, c -
> that suggests it's distributing parity on consecutive full stripe
> writes.
> 
> Where I became confused is, there's not always a consecutive write,
> and that's what ends up causing parity to end up on one device less
> often. In the above example, parity goes 4x VG/a, 3x VG/c, and 2x
> VG/b.
> 
> Basically it's a bad test. The sample size is too small. I'd need to
> increase the sample size by a ton in order to know for sure if this is
> really a problem.
> 
> 
> >This
> > information is not actually stored anywhere, it is computed based on
> > block group geometry and logical stripe offset.
> 
> I think you're right. A better test is a scrub or balance on a raid5
> that's exhibiting slowness, and find out if there's disk contention on
> that system, and whether it's the result of parity not being
> distributed enough.
> 
> 
> > P.S. usage of "stripe" to mean "stripe element" actually adds to
> > confusion when reading code :)
> 
> It's confusing everywhere. mdadm chunk = strip = stripe element. And
> then LVM introduces -i --stripes which means "data strips" i.e. if you
> choose -i 3 with raid6 segment type, you get 5 strips per stripe (3
> data 2 parity). It's horrible.
> 
> 
> 
> 
> -- 
> Chris Murphy
>
Austin S. Hemmelgarn June 27, 2016, 11:21 a.m. UTC | #45
On 2016-06-25 12:44, Chris Murphy wrote:
> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>
>> Well, the obvious major advantage that comes to mind for me to checksumming
>> parity is that it would let us scrub the parity data itself and verify it.
>
> OK but hold on. During scrub, it should read data, compute checksums
> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in
> the checksum tree, and the parity strip in the chunk tree. And if
> parity is wrong, then it should be replaced.
Except that's horribly inefficient.  With limited exceptions involving 
highly situational co-processors, computing a checksum of a parity block 
is always going to be faster than computing parity for the stripe.  By 
using that to check parity, we can safely speed up the common case of 
near zero errors during a scrub by a pretty significant factor.

The ideal situation that I'd like to see for scrub WRT parity is:
1. Store checksums for the parity itself.
2. During scrub, if the checksum is good, the parity is good, and we 
just saved the time of computing the whole parity block.
3. If the checksum is not good, then compute the parity.  If the parity 
just computed matches what is there already, the checksum is bad and 
should be rewritten (and we should probably recompute the whole block of 
checksums it's in), otherwise, the parity was bad, write out the new 
parity and update the checksum.
4. Have an option to skip the csum check on the parity and always 
compute it.
>
> Even check > md/sync_action does this. So no pun intended but Btrfs
> isn't even at parity with mdadm on data integrity if it doesn't check
> if the parity matches data.
Except that MD and LVM don't have checksums to verify anything outside 
of the very high-level metadata.  They have to compute the parity during 
a scrub because that's the _only_ way they have to check data integrity. 
  Just because that's the only way for them to check it does not mean we 
have to follow their design, especially considering that we have other, 
faster ways to check it.
>
>
>> I'd personally much rather know my parity is bad before I need to use it
>> than after using it to reconstruct data and getting an error there, and I'd
>> be willing to be that most seasoned sysadmins working for companies using
>> big storage arrays likely feel the same about it.
>
> That doesn't require parity csums though. It just requires computing
> parity during a scrub and comparing it to the parity on disk to make
> sure they're the same. If they aren't, assuming no other error for
> that full stripe read, then the parity block is replaced.
It does not require it, but it can make it significantly more efficient, 
and even a 1% increase in efficiency is a huge difference on a big array.
>
> So that's also something to check in the code or poke a system with a
> stick and see what happens.
>
>> I could see it being
>> practical to have an option to turn this off for performance reasons or
>> similar, but again, I have a feeling that most people would rather be able
>> to check if a rebuild will eat data before trying to rebuild (depending on
>> the situation in such a case, it will sometimes just make more sense to nuke
>> the array and restore from a backup instead of spending time waiting for it
>> to rebuild).
>
> The much bigger problem we have right now that affects Btrfs,
> LVM/mdadm md raid, is this silly bad default with non-enterprise
> drives having no configurable SCT ERC, with ensuing long recovery
> times, and the kernel SCSI command timer at 30 seconds - which
> actually also fucks over regular single disk users also because it
> means they don't get the "benefit" of long recovery times, which is
> the whole g'd point of that feature. This itself causes so many
> problems where bad sectors just get worse and don't get fixed up
> because of all the link resets. So I still think it's a bullshit
> default kernel side because it pretty much affects the majority use
> case, it is only a non-problem with proprietary hardware raid, and
> software raid using enterprise (or NAS specific) drives that already
> have short recovery times by default.
On this, we can agree.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Murphy June 27, 2016, 4:17 p.m. UTC | #46
On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-06-25 12:44, Chris Murphy wrote:
>>
>> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn
>> <ahferroin7@gmail.com> wrote:
>>
>>> Well, the obvious major advantage that comes to mind for me to
>>> checksumming
>>> parity is that it would let us scrub the parity data itself and verify
>>> it.
>>
>>
>> OK but hold on. During scrub, it should read data, compute checksums
>> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in
>> the checksum tree, and the parity strip in the chunk tree. And if
>> parity is wrong, then it should be replaced.
>
> Except that's horribly inefficient.  With limited exceptions involving
> highly situational co-processors, computing a checksum of a parity block is
> always going to be faster than computing parity for the stripe.  By using
> that to check parity, we can safely speed up the common case of near zero
> errors during a scrub by a pretty significant factor.

OK I'm in favor of that. Although somehow md gets away with this by
computing and checking parity for its scrubs, and still manages to
keep drives saturated in the process - at least HDDs, I'm not sure how
it fares on SSDs.




> The ideal situation that I'd like to see for scrub WRT parity is:
> 1. Store checksums for the parity itself.
> 2. During scrub, if the checksum is good, the parity is good, and we just
> saved the time of computing the whole parity block.
> 3. If the checksum is not good, then compute the parity.  If the parity just
> computed matches what is there already, the checksum is bad and should be
> rewritten (and we should probably recompute the whole block of checksums
> it's in), otherwise, the parity was bad, write out the new parity and update
> the checksum.
> 4. Have an option to skip the csum check on the parity and always compute
> it.
>>
>>
>> Even check > md/sync_action does this. So no pun intended but Btrfs
>> isn't even at parity with mdadm on data integrity if it doesn't check
>> if the parity matches data.
>
> Except that MD and LVM don't have checksums to verify anything outside of
> the very high-level metadata.  They have to compute the parity during a
> scrub because that's the _only_ way they have to check data integrity.  Just
> because that's the only way for them to check it does not mean we have to
> follow their design, especially considering that we have other, faster ways
> to check it.

I'm not opposed to this optimization. But retroactively better
qualifying my previous "major advantage" what I meant was in terms of
solving functional deficiency.



>> The much bigger problem we have right now that affects Btrfs,
>> LVM/mdadm md raid, is this silly bad default with non-enterprise
>> drives having no configurable SCT ERC, with ensuing long recovery
>> times, and the kernel SCSI command timer at 30 seconds - which
>> actually also fucks over regular single disk users also because it
>> means they don't get the "benefit" of long recovery times, which is
>> the whole g'd point of that feature. This itself causes so many
>> problems where bad sectors just get worse and don't get fixed up
>> because of all the link resets. So I still think it's a bullshit
>> default kernel side because it pretty much affects the majority use
>> case, it is only a non-problem with proprietary hardware raid, and
>> software raid using enterprise (or NAS specific) drives that already
>> have short recovery times by default.
>
> On this, we can agree.

It just came up again in a thread over the weekend on linux-raid@. I'm
going to ask while people are paying attention if a patch to change
the 30 second time out to something a lot higher has ever been
floated, what the negatives might be, and where to get this fixed if
it wouldn't be accepted in the kernel code directly.

*Ideally* I think we'd want two timeouts. I'd like to see commands
have a timer that results in merely a warning that could be used by
e.g. btrfs scrub to know "hey this sector range is 'slow' I'm going to
write over those sectors". That's how bad sectors start out, they read
slower and eventually go beyond 30 seconds and now it's all link
resets. If the problem could be fixed before then... that's the best
scenario.

The 2nd timer would be, OK the controller or drive just face planted, reset.
Chris Murphy June 27, 2016, 8:54 p.m. UTC | #47
For what it's worth I found btrfs-map-logical can produce mapping for
raid5 (didn't test raid6) by specifying the extent block length. If
that's omitted it only shows the device+mapping for the first strip.

This example is a 3 disk raid5, with a 128KiB file all in a single extent.

[root@f24s ~]# btrfs-map-logical -l 14157742080 /dev/VG/a
mirror 1 logical 14157742080 physical 1109327872 device /dev/mapper/VG-a
mirror 2 logical 14157742080 physical 2183069696 device /dev/mapper/VG-c

[root@f24s ~]# btrfs-map-logical -l 14157742080 -b 131072 /dev/VG/a
mirror 1 logical 14157742080 physical 1109327872 device /dev/mapper/VG-a
mirror 1 logical 14157807616 physical 1075773440 device /dev/mapper/VG-b
mirror 2 logical 14157742080 physical 2183069696 device /dev/mapper/VG-c
mirror 2 logical 14157807616 physical 2183069696 device /dev/mapper/VG-c

It's also possible to use -c and -o to copy the extent to a file and
more easily diff it with a control file, rather than using dd.

Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Henk Slager June 27, 2016, 9:02 p.m. UTC | #48
On Mon, Jun 27, 2016 at 6:17 PM, Chris Murphy <lists@colorremedies.com> wrote:
> On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>> On 2016-06-25 12:44, Chris Murphy wrote:
>>>
>>> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn
>>> <ahferroin7@gmail.com> wrote:
>>>
>>>> Well, the obvious major advantage that comes to mind for me to
>>>> checksumming
>>>> parity is that it would let us scrub the parity data itself and verify
>>>> it.
>>>
>>>
>>> OK but hold on. During scrub, it should read data, compute checksums
>>> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in
>>> the checksum tree, and the parity strip in the chunk tree. And if
>>> parity is wrong, then it should be replaced.
>>
>> Except that's horribly inefficient.  With limited exceptions involving
>> highly situational co-processors, computing a checksum of a parity block is
>> always going to be faster than computing parity for the stripe.  By using
>> that to check parity, we can safely speed up the common case of near zero
>> errors during a scrub by a pretty significant factor.
>
> OK I'm in favor of that. Although somehow md gets away with this by
> computing and checking parity for its scrubs, and still manages to
> keep drives saturated in the process - at least HDDs, I'm not sure how
> it fares on SSDs.

What I read in this thread clarifies the different flavors of errors I
saw when trying btrfs raid5 while corrupting 1 device or just
unexpectedly removing a device and replacing it with a fresh one.
Especially the lack of parity csums I was not aware of and I think
this is really wrong.

Consider a 4 disk btrfs raid10 and a 3 disk btrfs raid5. Both protect
against the loss of 1 device or badblocks on 1 device. In the current
design (unoptimized for performance), raid10 reads from 2 disk and
raid5 as well (as far as I remember) per task/process.
Which pair of strips for raid10 is pseudo-random AFAIK, so one could
get low throughput if some device in the array is older/slower and
that one is picked. From device to fs logical layer is just a simple
function, namely copy, so having the option to keep data in-place
(zerocopy). The data is at least read by the csum check and in case of
failure, the btrfs code picks the alternative strip and corrects etc.

For raid5, assuming it does avoid the parity in principle, it is also
a strip pair and csum check. In case of csum failure, one needs the
parity strip parity calculation. To me, It looks like that the 'fear'
of this calculation has made raid56 as a sort of add-on, instead of a
more integral part.

Looking at raid6 perf test at boot in dmesg, it is 30GByte/s, even
higher than memory bandwidth. So although a calculation is needed in
case data0strip+paritystrip would be used instead of
data0strip+data1strip, I think looking at total cost, it can be
cheaper than spending time on seeks, at least on HDDs. If the parity
calculation is treated in a transparent way, same as copy, then there
is more flexibility in selecting disks (and strips) and enables easier
design and performance optimizations I think.

>> The ideal situation that I'd like to see for scrub WRT parity is:
>> 1. Store checksums for the parity itself.
>> 2. During scrub, if the checksum is good, the parity is good, and we just
>> saved the time of computing the whole parity block.
>> 3. If the checksum is not good, then compute the parity.  If the parity just
>> computed matches what is there already, the checksum is bad and should be
>> rewritten (and we should probably recompute the whole block of checksums
>> it's in), otherwise, the parity was bad, write out the new parity and update
>> the checksum.

This 3rd point: if parity matches but csum is not good, then there is
a btrfs design error or some hardware/CPU/memory problem. Compare with
btrfs raid10: if the copies match but csum wrong, then there is
something fatally wrong. Just the first step, csum check and if wrong,
it would mean you generate the assumed corrupt strip newly from the 3
others. And for 3 disk raid5 from the 2 others, whether it is copying
or paritycalculation.

>> 4. Have an option to skip the csum check on the parity and always compute
>> it.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zygo Blaxell June 27, 2016, 9:57 p.m. UTC | #49
On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote:
> On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
> > On 2016-06-25 12:44, Chris Murphy wrote:
> >> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn
> >> <ahferroin7@gmail.com> wrote:
> >>
> >> OK but hold on. During scrub, it should read data, compute checksums
> >> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in
> >> the checksum tree, and the parity strip in the chunk tree. And if
> >> parity is wrong, then it should be replaced.
> >
> > Except that's horribly inefficient.  With limited exceptions involving
> > highly situational co-processors, computing a checksum of a parity block is
> > always going to be faster than computing parity for the stripe.  By using
> > that to check parity, we can safely speed up the common case of near zero
> > errors during a scrub by a pretty significant factor.
> 
> OK I'm in favor of that. Although somehow md gets away with this by
> computing and checking parity for its scrubs, and still manages to
> keep drives saturated in the process - at least HDDs, I'm not sure how
> it fares on SSDs.

A modest desktop CPU can compute raid6 parity at 6GB/sec, a less-modest
one at more than 10GB/sec.  Maybe a bottleneck is within reach of an
array of SSDs vs. a slow CPU.

> It just came up again in a thread over the weekend on linux-raid@. I'm
> going to ask while people are paying attention if a patch to change
> the 30 second time out to something a lot higher has ever been
> floated, what the negatives might be, and where to get this fixed if
> it wouldn't be accepted in the kernel code directly.

Defaults are defaults, they're not for everyone.  30 seconds is about
two minutes too short for an SMR drive's worst-case write latency, or
28 seconds too long for an OLTP system, or just right for an end-user's
personal machine with a low-energy desktop drive and a long spin-up time.

Once a drive starts taking 30+ seconds to do I/O, I consider the drive
failed in the sense that it's too slow to meet latency requirements.
When the problem is that it's already taking too long, the solution is
not waiting even longer.  To put things in perspective, consider that
server hardware watchdog timeouts are typically 60 seconds by default
(if not maximum).

If anything, I want the timeout to be shorter so that upper layers with
redundancy can get an EIO and initiate repair promptly, and admins can
get notified to evict chronic offenders from their drive slots, without
having to pay extra for hard disk firmware with that feature.

> *Ideally* I think we'd want two timeouts. I'd like to see commands
> have a timer that results in merely a warning that could be used by
> e.g. btrfs scrub to know "hey this sector range is 'slow' I'm going to
> write over those sectors". That's how bad sectors start out, they read
> slower and eventually go beyond 30 seconds and now it's all link
> resets. If the problem could be fixed before then... that's the best
> scenario.

What's the downside of a link reset?  Can the driver not just return
EIO for all the outstanding IOs in progress at reset, and let the upper
layers deal with it?  Or is the problem that the upper layers are all
horribly broken by EIOs, or drive firmware horribly broken by link resets?

The upper layers could time the IOs, and make their own decisions based
on the timing (e.g. btrfs or mdadm could proactively repair anything that
took more than 10 seconds to read).  That might be a better approach,
since shortening the time to an EIO is only useful when you have a
redundancy layer in place to do something about them.

> The 2nd timer would be, OK the controller or drive just face planted, reset.
> 
> -- 
> Chris Murphy
>
Chris Murphy June 27, 2016, 10:30 p.m. UTC | #50
On Mon, Jun 27, 2016 at 3:57 PM, Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
> On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote:

>
>> It just came up again in a thread over the weekend on linux-raid@. I'm
>> going to ask while people are paying attention if a patch to change
>> the 30 second time out to something a lot higher has ever been
>> floated, what the negatives might be, and where to get this fixed if
>> it wouldn't be accepted in the kernel code directly.
>
> Defaults are defaults, they're not for everyone.  30 seconds is about
> two minutes too short for an SMR drive's worst-case write latency, or
> 28 seconds too long for an OLTP system, or just right for an end-user's
> personal machine with a low-energy desktop drive and a long spin-up time.

The question is where is the correct place to change the default to
broadly capture most use cases, because it's definitely incompatible
with consumer SATA drives, whether in an enclosure or not.

Maybe it's with the kernel teams at each distribution? Or maybe an
upstream udev rule?

In any case something needs to give here because it's been years of
bugging users about this misconfiguration and people constantly run
into it, which means user education is not working.


>
> Once a drive starts taking 30+ seconds to do I/O, I consider the drive
> failed in the sense that it's too slow to meet latency requirements.

Well that is then a mismatch between use case and the drive purchasing
decision. Consumer drives do this. It's how they're designed to work.


> When the problem is that it's already taking too long, the solution is
> not waiting even longer.  To put things in perspective, consider that
> server hardware watchdog timeouts are typically 60 seconds by default
> (if not maximum).

If you want the data retrieved from that particular device, the only
solution is waiting longer. The alternative is what you get, an IO
error (well actually you get a link reset, which also means the entire
command queue is purged on SATA drives).


> If anything, I want the timeout to be shorter so that upper layers with
> redundancy can get an EIO and initiate repair promptly, and admins can
> get notified to evict chronic offenders from their drive slots, without
> having to pay extra for hard disk firmware with that feature.

The drive totally thwarts this. It doesn't report back to the kernel
what command is hung, as far as I'm aware. It just hangs and goes into
a so called "deep recovery" there is no way to know what sector is
causing the problem until the drive reports a read error, which will
include the affected sector LBA.

Btrfs does have something of a work around for when things get slow,
and that's balance, read and rewrite everything. The write forces
sector remapping by the drive firmware for bad sectors.


>> *Ideally* I think we'd want two timeouts. I'd like to see commands
>> have a timer that results in merely a warning that could be used by
>> e.g. btrfs scrub to know "hey this sector range is 'slow' I'm going to
>> write over those sectors". That's how bad sectors start out, they read
>> slower and eventually go beyond 30 seconds and now it's all link
>> resets. If the problem could be fixed before then... that's the best
>> scenario.
>
> What's the downside of a link reset?  Can the driver not just return
> EIO for all the outstanding IOs in progress at reset, and let the upper
> layers deal with it?  Or is the problem that the upper layers are all
> horribly broken by EIOs, or drive firmware horribly broken by link resets?

Link reset clears the entire command queue on SATA drives, and it
wipes away any possibility of finding out what LBA or even a range of
LBAs, is the source of the stall. So it pretty much gets you nothing.


> The upper layers could time the IOs, and make their own decisions based
> on the timing (e.g. btrfs or mdadm could proactively repair anything that
> took more than 10 seconds to read).  That might be a better approach,
> since shortening the time to an EIO is only useful when you have a
> redundancy layer in place to do something about them.

For RAID with redundancy, that's doable, although I have no idea what
work is needed, or even if it's possible, to track commands in this
manner, and fall back to some kind of repair mode as if it were a read
error.

For single drives and RAID 0, the only possible solution is to not do
link resets for up to 3 minutes and hope the drive returns the single
copy of data.

Even in the case of Btrfs DUP, it's thwarted without a read error
reported from the drive (or it returning bad data).
Zygo Blaxell June 28, 2016, 1:52 a.m. UTC | #51
On Mon, Jun 27, 2016 at 04:30:23PM -0600, Chris Murphy wrote:
> On Mon, Jun 27, 2016 at 3:57 PM, Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
> > On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote:
> > If anything, I want the timeout to be shorter so that upper layers with
> > redundancy can get an EIO and initiate repair promptly, and admins can
> > get notified to evict chronic offenders from their drive slots, without
> > having to pay extra for hard disk firmware with that feature.
> 
> The drive totally thwarts this. It doesn't report back to the kernel
> what command is hung, as far as I'm aware. It just hangs and goes into
> a so called "deep recovery" there is no way to know what sector is
> causing the problem

I'm proposing just treat the link reset _as_ an EIO, unless transparent
link resets are required for link speed negotiation or something.
The drive wouldn't be thwarting anything, the host would just ignore it
(unless the drive doesn't respond to a link reset until after its internal
timeout, in which case nothing is saved by shortening the timeout).

> until the drive reports a read error, which will
> include the affected sector LBA.

It doesn't matter which sector.  Chances are good that it was more than
one of the outstanding requested sectors anyway.  Rewrite them all.

We know which sectors they are because somebody has an IO operation
waiting for a status on each of them (unless they're using AIO or some
other API where a request can be fired at a hard drive and the reply
discarded).  Notify all of them that their IO failed and move on.

> Btrfs does have something of a work around for when things get slow,
> and that's balance, read and rewrite everything. The write forces
> sector remapping by the drive firmware for bad sectors.

It's a crude form of "resilvering" as ZFS calls it.

> > The upper layers could time the IOs, and make their own decisions based
> > on the timing (e.g. btrfs or mdadm could proactively repair anything that
> > took more than 10 seconds to read).  That might be a better approach,
> > since shortening the time to an EIO is only useful when you have a
> > redundancy layer in place to do something about them.
> 
> For RAID with redundancy, that's doable, although I have no idea what
> work is needed, or even if it's possible, to track commands in this
> manner, and fall back to some kind of repair mode as if it were a read
> error.

If btrfs sees EIO from a lower block layer it will try to reconstruct the
missing data (but not repair it).  If that happens during a scrub,
it will also attempt to rewrite the missing data over the original
offending sectors.  This happens every few months in my server pool,
and seems to be working even on btrfs raid5.

Last time I checked all the RAID implementations on Linux (ok, so that's
pretty much just md-raid) had some sort of repair capability.  lvm uses
(or can use) the md-raid implementation.  ext4 and xfs on naked disk
partitions will have problems, but that's because they were designed in
the 1990's when we were young and naive and still believed hard disks
would one day become reliable devices without buggy firmware.

> For single drives and RAID 0, the only possible solution is to not do
> link resets for up to 3 minutes and hope the drive returns the single
> copy of data.

So perhaps the timeout should be influenced by higher layers, e.g. if a
disk becomes part of a raid1, its timeout should be shortened by default,
while a timeout for a disk that is not used in by redundant layer should
be longer.

> Even in the case of Btrfs DUP, it's thwarted without a read error
> reported from the drive (or it returning bad data).

That case gets messy--different timeouts for different parts of the disk.
Probably not practical.
Chris Murphy June 28, 2016, 2:39 a.m. UTC | #52
On Mon, Jun 27, 2016 at 7:52 PM, Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
> On Mon, Jun 27, 2016 at 04:30:23PM -0600, Chris Murphy wrote:
>> On Mon, Jun 27, 2016 at 3:57 PM, Zygo Blaxell
>> <ce3g8jdj@umail.furryterror.org> wrote:
>> > On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote:
>> > If anything, I want the timeout to be shorter so that upper layers with
>> > redundancy can get an EIO and initiate repair promptly, and admins can
>> > get notified to evict chronic offenders from their drive slots, without
>> > having to pay extra for hard disk firmware with that feature.
>>
>> The drive totally thwarts this. It doesn't report back to the kernel
>> what command is hung, as far as I'm aware. It just hangs and goes into
>> a so called "deep recovery" there is no way to know what sector is
>> causing the problem
>
> I'm proposing just treat the link reset _as_ an EIO, unless transparent
> link resets are required for link speed negotiation or something.

That's not one EIO, that's possibly 31 items in the command queue that
get knocked over when the link is reset. I don't have the expertise to
know whether it's sane to interpret many EIO all at once as an
implicit indication of bad sectors. Off hand I think that's probably
specious.

> The drive wouldn't be thwarting anything, the host would just ignore it
> (unless the drive doesn't respond to a link reset until after its internal
> timeout, in which case nothing is saved by shortening the timeout).
>
>> until the drive reports a read error, which will
>> include the affected sector LBA.
>
> It doesn't matter which sector.  Chances are good that it was more than
> one of the outstanding requested sectors anyway.  Rewrite them all.

*shrug* even if valid, it only helps the raid 1+ cases. It does
nothing to help raid0, linear/concat, or single device deployments.
Those users also deserve to have access to their data, if the drive
can recover it by giving it enough time to do so.


> We know which sectors they are because somebody has an IO operation
> waiting for a status on each of them (unless they're using AIO or some
> other API where a request can be fired at a hard drive and the reply
> discarded).  Notify all of them that their IO failed and move on.

Dunno, maybe.


>
>> Btrfs does have something of a work around for when things get slow,
>> and that's balance, read and rewrite everything. The write forces
>> sector remapping by the drive firmware for bad sectors.
>
> It's a crude form of "resilvering" as ZFS calls it.

In what manner is it crude?




> If btrfs sees EIO from a lower block layer it will try to reconstruct the
> missing data (but not repair it).  If that happens during a scrub,
> it will also attempt to rewrite the missing data over the original
> offending sectors.  This happens every few months in my server pool,
> and seems to be working even on btrfs raid5.
>
> Last time I checked all the RAID implementations on Linux (ok, so that's
> pretty much just md-raid) had some sort of repair capability.

You can read man 4 md, and you can also look on linux-raid@, it's very
clearly necessary for the drive to report a read or write error
explicitly with LBA for md to do repairs. If there are link resets,
bad sectors accumulate and the obvious inevitably happens.



>
>> For single drives and RAID 0, the only possible solution is to not do
>> link resets for up to 3 minutes and hope the drive returns the single
>> copy of data.
>
> So perhaps the timeout should be influenced by higher layers, e.g. if a
> disk becomes part of a raid1, its timeout should be shortened by default,
> while a timeout for a disk that is not used in by redundant layer should
> be longer.

And there are a pile of reasons why link resets are necessary that
have nothing to do with bad sectors. So if you end up with a drive or
controller misbehaving and the new behavior is to force a bunch of new
(corrective) writes to the drive right after a reset it could actually
make its problems worse for all we know.

I think it's highly speculative to assume hung block devices means bad
sector and should be treated as a bad sector, and that doing so will
cause no other side effects. That's a question for block device/SCSI
experts to opine on whether this is at all sane to do. I'm sure
they're reasonably aware of this problem that if it were that simple
they'd have done that already, but conversely 5 years of telling users
to change the command timer or stop using the wrong kind of drives for
RAID really isn't sufficiently good advice either.

The reality is that manufacturers of drives have handed us drives that
far and wide don't support SCT ERC or it's disabled by default, so
yeah maybe the thing to do is udev polls the drive for SCT ERC, if
it's already at 70,70 then leave the SCSI command timer as is. If it
reports it's disabled, then udev needs to know if it's in some kind of
RAID 1+ and if so then set SCT ERC to 70,70. If it's a single drive,
linear/concat, or RAID 0, then instead it should change the SCSI
command timer to 180 and let the user sort out the possible insanity
that follows which will be the occasional half minute or longer hang
whenever a drive has a bad sector.

>
>> Even in the case of Btrfs DUP, it's thwarted without a read error
>> reported from the drive (or it returning bad data).
>
> That case gets messy--different timeouts for different parts of the disk.
> Probably not practical.

The point is that DUP implies single device, and for certain that
configuration should not have such a short command timer. If the drive
can recover it at all, it could possibly remap the sector on its own.
In fact that's what is supposed to happen, but drives seem to only do
this when the sector read is already painfully slow.
Zygo Blaxell June 28, 2016, 3:17 a.m. UTC | #53
On Mon, Jun 27, 2016 at 08:39:21PM -0600, Chris Murphy wrote:
> On Mon, Jun 27, 2016 at 7:52 PM, Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
> > On Mon, Jun 27, 2016 at 04:30:23PM -0600, Chris Murphy wrote:
> >> Btrfs does have something of a work around for when things get slow,
> >> and that's balance, read and rewrite everything. The write forces
> >> sector remapping by the drive firmware for bad sectors.
> >
> > It's a crude form of "resilvering" as ZFS calls it.
> 
> In what manner is it crude?

Balance relocates extents, looks up backrefs, and rewrites metadata, all
of which are extra work above what is required by resilvering (and extra
work that is proportional to the number of backrefs and the (currently
extremely poor) performance of the backref walking code, so snapshots
and large files multiply the workload).

Resilvering should just read data, reconstruct it from a mirror if
necessary, and write it back to the original location (or read one
mirror and rewrite the other).  That's more like what scrub does, except
scrub rewrites only the blocks it couldn't read (or that failed csum).

> > Last time I checked all the RAID implementations on Linux (ok, so that's
> > pretty much just md-raid) had some sort of repair capability.
> 
> You can read man 4 md, and you can also look on linux-raid@, it's very
> clearly necessary for the drive to report a read or write error
> explicitly with LBA for md to do repairs. If there are link resets,
> bad sectors accumulate and the obvious inevitably happens.

I am looking at the md code.  It looks at ->bi_error, and nothing else as
far as I can tell.  It doesn't even care if the error is EIO--any non-zero
return value from the lower bio layer seems to trigger automatic recovery.
Austin S. Hemmelgarn June 28, 2016, 11:23 a.m. UTC | #54
On 2016-06-27 23:17, Zygo Blaxell wrote:
> On Mon, Jun 27, 2016 at 08:39:21PM -0600, Chris Murphy wrote:
>> On Mon, Jun 27, 2016 at 7:52 PM, Zygo Blaxell
>> <ce3g8jdj@umail.furryterror.org> wrote:
>>> On Mon, Jun 27, 2016 at 04:30:23PM -0600, Chris Murphy wrote:
>>>> Btrfs does have something of a work around for when things get slow,
>>>> and that's balance, read and rewrite everything. The write forces
>>>> sector remapping by the drive firmware for bad sectors.
>>>
>>> It's a crude form of "resilvering" as ZFS calls it.
>>
>> In what manner is it crude?
>
> Balance relocates extents, looks up backrefs, and rewrites metadata, all
> of which are extra work above what is required by resilvering (and extra
> work that is proportional to the number of backrefs and the (currently
> extremely poor) performance of the backref walking code, so snapshots
> and large files multiply the workload).
>
> Resilvering should just read data, reconstruct it from a mirror if
> necessary, and write it back to the original location (or read one
> mirror and rewrite the other).  That's more like what scrub does, except
> scrub rewrites only the blocks it couldn't read (or that failed csum).
It's worth pointing out that balance was not designed for resilvering, 
it was designed for reshaping arrays, converting replication profiles, 
and compaction at the chunk level.  Balance is not a resilvering tool, 
that just happens to be a useful side effect of running a balance 
(actually, so is the chunk level compaction).


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Austin S. Hemmelgarn June 28, 2016, 12:05 p.m. UTC | #55
On 2016-06-27 17:57, Zygo Blaxell wrote:
> On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote:
>> On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn
>> <ahferroin7@gmail.com> wrote:
>>> On 2016-06-25 12:44, Chris Murphy wrote:
>>>> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn
>>>> <ahferroin7@gmail.com> wrote:
>>>>
>>>> OK but hold on. During scrub, it should read data, compute checksums
>>>> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in
>>>> the checksum tree, and the parity strip in the chunk tree. And if
>>>> parity is wrong, then it should be replaced.
>>>
>>> Except that's horribly inefficient.  With limited exceptions involving
>>> highly situational co-processors, computing a checksum of a parity block is
>>> always going to be faster than computing parity for the stripe.  By using
>>> that to check parity, we can safely speed up the common case of near zero
>>> errors during a scrub by a pretty significant factor.
>>
>> OK I'm in favor of that. Although somehow md gets away with this by
>> computing and checking parity for its scrubs, and still manages to
>> keep drives saturated in the process - at least HDDs, I'm not sure how
>> it fares on SSDs.
>
> A modest desktop CPU can compute raid6 parity at 6GB/sec, a less-modest
> one at more than 10GB/sec.  Maybe a bottleneck is within reach of an
> array of SSDs vs. a slow CPU.
OK, great for people who are using modern desktop or server CPU's.  Not 
everyone has that luxury, and even on many such CPU's, it's _still_ 
faster to computer CRC32c checksums.  On top of that, we don't appear to 
be using the in-kernel parity-raid libraries (or if we are, I haven't 
been able to find where we are calling the functions for it), so we 
don't necessarily get assembly optimized or co-processor accelerated 
computation of the parity itself.  The other thing that I didn't mention 
above though, is that computing parity checksums will always take less 
time than computing parity, because you have to process significantly 
less data.  On a 4 disk RAID5 array, you're processing roughly 2/3 as 
much data to do the parity checksums instead of parity itself, which 
means that the parity computation would need to be 200% faster than the 
CRC32c computation to break even, and this margin gets bigger and bigger 
as you add more disks.

On small arrays, this obviously won't have much impact.  Once you start 
to scale past a few TB though, even a few hundred MB/s faster processing 
means a significant decrease in processing time.  Say you have a CPU 
which gets about 12.0GB/s for RAID5 parity, and and about 12.25GB/s for 
CRC32c (~2% is a conservative ratio assuming you use the CRC32c 
instruction and assembly optimized RAID5 parity computations on a modern 
x86_64 processor (the ratio on both the mobile Core i5 in my laptop and 
the Xeon E3 in my home server is closer to 5%)).  Assuming those 
numbers, and that we're already checking checksums on non-parity blocks, 
processing 120TB of data in a 4 disk array (which gives 40TB of parity 
data, so 160TB total) gives:
For computing the parity to scrub:
120TB / 12.25GB =  9795.9 seconds for processing CRC32c csums of all the 
regular data
120TB / 12GB    = 10000 seconds for processing parity of all stripes
                 = 19795.9 seconds total
                 ~ 5.4 hours total

For computing csums of the parity:
120TB / 12.25GB =  9795.9 seconds for processing CRC32c csums of all the 
regular data
40TB / 12.25GB  =  3265.3 seconds for processing CRC32c csums of all the 
parity data
                 = 13061.2 seconds total
                 ~ 3.6 hours total

The checksum based computation is approximately 34% faster than the 
parity computation.  Much of this of course is that you have to process 
the regular data twice for the parity computation method (once for 
csums, once for parity).  You could probably do one pass computing both 
values, but that would need to be done carefully; and, without 
significant optimization, would likely not get you much benefit other 
than cutting the number of loads in half.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steven Haigh June 28, 2016, 12:14 p.m. UTC | #56
On 28/06/16 22:05, Austin S. Hemmelgarn wrote:
> On 2016-06-27 17:57, Zygo Blaxell wrote:
>> On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote:
>>> On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn
>>> <ahferroin7@gmail.com> wrote:
>>>> On 2016-06-25 12:44, Chris Murphy wrote:
>>>>> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn
>>>>> <ahferroin7@gmail.com> wrote:
>>>>>
>>>>> OK but hold on. During scrub, it should read data, compute checksums
>>>>> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in
>>>>> the checksum tree, and the parity strip in the chunk tree. And if
>>>>> parity is wrong, then it should be replaced.
>>>>
>>>> Except that's horribly inefficient.  With limited exceptions involving
>>>> highly situational co-processors, computing a checksum of a parity
>>>> block is
>>>> always going to be faster than computing parity for the stripe.  By
>>>> using
>>>> that to check parity, we can safely speed up the common case of near
>>>> zero
>>>> errors during a scrub by a pretty significant factor.
>>>
>>> OK I'm in favor of that. Although somehow md gets away with this by
>>> computing and checking parity for its scrubs, and still manages to
>>> keep drives saturated in the process - at least HDDs, I'm not sure how
>>> it fares on SSDs.
>>
>> A modest desktop CPU can compute raid6 parity at 6GB/sec, a less-modest
>> one at more than 10GB/sec.  Maybe a bottleneck is within reach of an
>> array of SSDs vs. a slow CPU.
> OK, great for people who are using modern desktop or server CPU's.  Not
> everyone has that luxury, and even on many such CPU's, it's _still_
> faster to computer CRC32c checksums.  On top of that, we don't appear to
> be using the in-kernel parity-raid libraries (or if we are, I haven't
> been able to find where we are calling the functions for it), so we
> don't necessarily get assembly optimized or co-processor accelerated
> computation of the parity itself.  The other thing that I didn't mention
> above though, is that computing parity checksums will always take less
> time than computing parity, because you have to process significantly
> less data.  On a 4 disk RAID5 array, you're processing roughly 2/3 as
> much data to do the parity checksums instead of parity itself, which
> means that the parity computation would need to be 200% faster than the
> CRC32c computation to break even, and this margin gets bigger and bigger
> as you add more disks.
> 
> On small arrays, this obviously won't have much impact.  Once you start
> to scale past a few TB though, even a few hundred MB/s faster processing
> means a significant decrease in processing time.  Say you have a CPU
> which gets about 12.0GB/s for RAID5 parity, and and about 12.25GB/s for
> CRC32c (~2% is a conservative ratio assuming you use the CRC32c
> instruction and assembly optimized RAID5 parity computations on a modern
> x86_64 processor (the ratio on both the mobile Core i5 in my laptop and
> the Xeon E3 in my home server is closer to 5%)).  Assuming those
> numbers, and that we're already checking checksums on non-parity blocks,
> processing 120TB of data in a 4 disk array (which gives 40TB of parity
> data, so 160TB total) gives:
> For computing the parity to scrub:
> 120TB / 12.25GB =  9795.9 seconds for processing CRC32c csums of all the
> regular data
> 120TB / 12GB    = 10000 seconds for processing parity of all stripes
>                 = 19795.9 seconds total
>                 ~ 5.4 hours total
> 
> For computing csums of the parity:
> 120TB / 12.25GB =  9795.9 seconds for processing CRC32c csums of all the
> regular data
> 40TB / 12.25GB  =  3265.3 seconds for processing CRC32c csums of all the
> parity data
>                 = 13061.2 seconds total
>                 ~ 3.6 hours total
> 
> The checksum based computation is approximately 34% faster than the
> parity computation.  Much of this of course is that you have to process
> the regular data twice for the parity computation method (once for
> csums, once for parity).  You could probably do one pass computing both
> values, but that would need to be done carefully; and, without
> significant optimization, would likely not get you much benefit other
> than cutting the number of loads in half.

And it all means jack shit because you don't get the data to disk that
quick. Who cares if its 500% faster - if it still saturates the
throughput of the actual drives, what difference does it make?

I'm all for actual solutions, but the nirvana fallacy seems to apply here...
Austin S. Hemmelgarn June 28, 2016, 12:25 p.m. UTC | #57
On 2016-06-28 08:14, Steven Haigh wrote:
> On 28/06/16 22:05, Austin S. Hemmelgarn wrote:
>> On 2016-06-27 17:57, Zygo Blaxell wrote:
>>> On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote:
>>>> On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn
>>>> <ahferroin7@gmail.com> wrote:
>>>>> On 2016-06-25 12:44, Chris Murphy wrote:
>>>>>> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn
>>>>>> <ahferroin7@gmail.com> wrote:
>>>>>>
>>>>>> OK but hold on. During scrub, it should read data, compute checksums
>>>>>> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in
>>>>>> the checksum tree, and the parity strip in the chunk tree. And if
>>>>>> parity is wrong, then it should be replaced.
>>>>>
>>>>> Except that's horribly inefficient.  With limited exceptions involving
>>>>> highly situational co-processors, computing a checksum of a parity
>>>>> block is
>>>>> always going to be faster than computing parity for the stripe.  By
>>>>> using
>>>>> that to check parity, we can safely speed up the common case of near
>>>>> zero
>>>>> errors during a scrub by a pretty significant factor.
>>>>
>>>> OK I'm in favor of that. Although somehow md gets away with this by
>>>> computing and checking parity for its scrubs, and still manages to
>>>> keep drives saturated in the process - at least HDDs, I'm not sure how
>>>> it fares on SSDs.
>>>
>>> A modest desktop CPU can compute raid6 parity at 6GB/sec, a less-modest
>>> one at more than 10GB/sec.  Maybe a bottleneck is within reach of an
>>> array of SSDs vs. a slow CPU.
>> OK, great for people who are using modern desktop or server CPU's.  Not
>> everyone has that luxury, and even on many such CPU's, it's _still_
>> faster to computer CRC32c checksums.  On top of that, we don't appear to
>> be using the in-kernel parity-raid libraries (or if we are, I haven't
>> been able to find where we are calling the functions for it), so we
>> don't necessarily get assembly optimized or co-processor accelerated
>> computation of the parity itself.  The other thing that I didn't mention
>> above though, is that computing parity checksums will always take less
>> time than computing parity, because you have to process significantly
>> less data.  On a 4 disk RAID5 array, you're processing roughly 2/3 as
>> much data to do the parity checksums instead of parity itself, which
>> means that the parity computation would need to be 200% faster than the
>> CRC32c computation to break even, and this margin gets bigger and bigger
>> as you add more disks.
>>
>> On small arrays, this obviously won't have much impact.  Once you start
>> to scale past a few TB though, even a few hundred MB/s faster processing
>> means a significant decrease in processing time.  Say you have a CPU
>> which gets about 12.0GB/s for RAID5 parity, and and about 12.25GB/s for
>> CRC32c (~2% is a conservative ratio assuming you use the CRC32c
>> instruction and assembly optimized RAID5 parity computations on a modern
>> x86_64 processor (the ratio on both the mobile Core i5 in my laptop and
>> the Xeon E3 in my home server is closer to 5%)).  Assuming those
>> numbers, and that we're already checking checksums on non-parity blocks,
>> processing 120TB of data in a 4 disk array (which gives 40TB of parity
>> data, so 160TB total) gives:
>> For computing the parity to scrub:
>> 120TB / 12.25GB =  9795.9 seconds for processing CRC32c csums of all the
>> regular data
>> 120TB / 12GB    = 10000 seconds for processing parity of all stripes
>>                 = 19795.9 seconds total
>>                 ~ 5.4 hours total
>>
>> For computing csums of the parity:
>> 120TB / 12.25GB =  9795.9 seconds for processing CRC32c csums of all the
>> regular data
>> 40TB / 12.25GB  =  3265.3 seconds for processing CRC32c csums of all the
>> parity data
>>                 = 13061.2 seconds total
>>                 ~ 3.6 hours total
>>
>> The checksum based computation is approximately 34% faster than the
>> parity computation.  Much of this of course is that you have to process
>> the regular data twice for the parity computation method (once for
>> csums, once for parity).  You could probably do one pass computing both
>> values, but that would need to be done carefully; and, without
>> significant optimization, would likely not get you much benefit other
>> than cutting the number of loads in half.
>
> And it all means jack shit because you don't get the data to disk that
> quick. Who cares if its 500% faster - if it still saturates the
> throughput of the actual drives, what difference does it make?
It has less impact on everything else running on the system at the time 
because it uses less CPU time and potentially less memory.  This is the 
exact same reason that you want your RAID parity computation performance 
as good as possible, the less time the CPU spends on that, the more it 
can spend on other things.  On top of that, there are high-end systems 
that do have SSD's that can get multiple GB/s of data transfer per 
second, and NVDIMM's are starting to become popular in the server 
market, and those give you data transfer speeds equivalent to regular 
memory bandwidth (which can be well over 20GB/s on decent hardware (I've 
got a relatively inexpensive system using DDR3-1866 RAM that has roughly 
22-24GB/s of memory bandwidth)).  Looking at this another way, the fact 
that the storage device is the bottleneck right now is not a good excuse 
to not worry about making everything else as efficient as possible.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steven Haigh June 28, 2016, 4:40 p.m. UTC | #58
On 28/06/16 22:25, Austin S. Hemmelgarn wrote:
> On 2016-06-28 08:14, Steven Haigh wrote:
>> On 28/06/16 22:05, Austin S. Hemmelgarn wrote:
>>> On 2016-06-27 17:57, Zygo Blaxell wrote:
>>>> On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote:
>>>>> On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn
>>>>> <ahferroin7@gmail.com> wrote:
>>>>>> On 2016-06-25 12:44, Chris Murphy wrote:
>>>>>>> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn
>>>>>>> <ahferroin7@gmail.com> wrote:
>>>>>>>
>>>>>>> OK but hold on. During scrub, it should read data, compute checksums
>>>>>>> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in
>>>>>>> the checksum tree, and the parity strip in the chunk tree. And if
>>>>>>> parity is wrong, then it should be replaced.
>>>>>>
>>>>>> Except that's horribly inefficient.  With limited exceptions
>>>>>> involving
>>>>>> highly situational co-processors, computing a checksum of a parity
>>>>>> block is
>>>>>> always going to be faster than computing parity for the stripe.  By
>>>>>> using
>>>>>> that to check parity, we can safely speed up the common case of near
>>>>>> zero
>>>>>> errors during a scrub by a pretty significant factor.
>>>>>
>>>>> OK I'm in favor of that. Although somehow md gets away with this by
>>>>> computing and checking parity for its scrubs, and still manages to
>>>>> keep drives saturated in the process - at least HDDs, I'm not sure how
>>>>> it fares on SSDs.
>>>>
>>>> A modest desktop CPU can compute raid6 parity at 6GB/sec, a less-modest
>>>> one at more than 10GB/sec.  Maybe a bottleneck is within reach of an
>>>> array of SSDs vs. a slow CPU.
>>> OK, great for people who are using modern desktop or server CPU's.  Not
>>> everyone has that luxury, and even on many such CPU's, it's _still_
>>> faster to computer CRC32c checksums.  On top of that, we don't appear to
>>> be using the in-kernel parity-raid libraries (or if we are, I haven't
>>> been able to find where we are calling the functions for it), so we
>>> don't necessarily get assembly optimized or co-processor accelerated
>>> computation of the parity itself.  The other thing that I didn't mention
>>> above though, is that computing parity checksums will always take less
>>> time than computing parity, because you have to process significantly
>>> less data.  On a 4 disk RAID5 array, you're processing roughly 2/3 as
>>> much data to do the parity checksums instead of parity itself, which
>>> means that the parity computation would need to be 200% faster than the
>>> CRC32c computation to break even, and this margin gets bigger and bigger
>>> as you add more disks.
>>>
>>> On small arrays, this obviously won't have much impact.  Once you start
>>> to scale past a few TB though, even a few hundred MB/s faster processing
>>> means a significant decrease in processing time.  Say you have a CPU
>>> which gets about 12.0GB/s for RAID5 parity, and and about 12.25GB/s for
>>> CRC32c (~2% is a conservative ratio assuming you use the CRC32c
>>> instruction and assembly optimized RAID5 parity computations on a modern
>>> x86_64 processor (the ratio on both the mobile Core i5 in my laptop and
>>> the Xeon E3 in my home server is closer to 5%)).  Assuming those
>>> numbers, and that we're already checking checksums on non-parity blocks,
>>> processing 120TB of data in a 4 disk array (which gives 40TB of parity
>>> data, so 160TB total) gives:
>>> For computing the parity to scrub:
>>> 120TB / 12.25GB =  9795.9 seconds for processing CRC32c csums of all the
>>> regular data
>>> 120TB / 12GB    = 10000 seconds for processing parity of all stripes
>>>                 = 19795.9 seconds total
>>>                 ~ 5.4 hours total
>>>
>>> For computing csums of the parity:
>>> 120TB / 12.25GB =  9795.9 seconds for processing CRC32c csums of all the
>>> regular data
>>> 40TB / 12.25GB  =  3265.3 seconds for processing CRC32c csums of all the
>>> parity data
>>>                 = 13061.2 seconds total
>>>                 ~ 3.6 hours total
>>>
>>> The checksum based computation is approximately 34% faster than the
>>> parity computation.  Much of this of course is that you have to process
>>> the regular data twice for the parity computation method (once for
>>> csums, once for parity).  You could probably do one pass computing both
>>> values, but that would need to be done carefully; and, without
>>> significant optimization, would likely not get you much benefit other
>>> than cutting the number of loads in half.
>>
>> And it all means jack shit because you don't get the data to disk that
>> quick. Who cares if its 500% faster - if it still saturates the
>> throughput of the actual drives, what difference does it make?
> It has less impact on everything else running on the system at the time
> because it uses less CPU time and potentially less memory.  This is the
> exact same reason that you want your RAID parity computation performance
> as good as possible, the less time the CPU spends on that, the more it
> can spend on other things.  On top of that, there are high-end systems
> that do have SSD's that can get multiple GB/s of data transfer per
> second, and NVDIMM's are starting to become popular in the server
> market, and those give you data transfer speeds equivalent to regular
> memory bandwidth (which can be well over 20GB/s on decent hardware (I've
> got a relatively inexpensive system using DDR3-1866 RAM that has roughly
> 22-24GB/s of memory bandwidth)).  Looking at this another way, the fact
> that the storage device is the bottleneck right now is not a good excuse
> to not worry about making everything else as efficient as possible.

If its purely about performance - then start with multi-thread as a base
- not chopping features to make better performance. I'm not aware of any
modern CPU that comes with a single core these days - so parallel
workloads are much more efficient than a single thread.

Yes, its a law of diminishing returns - but if you're not doing a full
check of data when one would assume you are, then is that broken by design?

Personally, during a scrub, I would want to know if either the checksum
OR the parity is wrong - as that indicates problems at a much deeper
level. As someone who just lost ~4Tb of data due to BTRFS bugs,
protection of data trumps performance in most cases.
Chris Murphy June 28, 2016, 6:01 p.m. UTC | #59
Just wiping the slate clean to summarize:


1. We have a consistent ~1 in 3 maybe 1 in 2, reproducible corruption
of *data extent* parity during a scrub with raid5. Goffredo and I have
both reproduced it. It's a big bug. It might still be useful if
someone else can reproduce it too.

Goffredo, can you file a bug at bugzilla.kernel.org and reference your
bug thread?  I don't know if the key developers know about this, it
might be worth pinging them on IRC once the bug is filed.

Unknown if it affects balance, or raid 6. And if it affects raid 6, is
p or q corrupted, or both? Unknown how this manifests on metadata
raid5 profile (only tested was data raid5). Presumably if there is
metadata corruption that's fixed during a scrub, and its parity is
overwritten with corrupt parity, the next time there's a degraded
state, the file system would face plant somehow. And we've seen quite
a few degraded raid5's (and even 6's) face plant in inexplicable ways
and we just kinda go, shit. Which is what the fs is doing when it
encounters a pile of csum errors. It treats the csum errors as a
signal to disregard the fs rather than maybe only being suspicious of
the fs. Could it turn out that these file systems were recoverable,
just that Btrfs wasn't tolerating any csum error and wouldn't proceed
further?

2. The existing scrub code computes parity on-the-fly, compares it
with what's on-disk, and overwrites if there's a mismatch. If there's
a mismatch, there's no message anywhere. It's a feature request to get
a message on parity mismatches. An additional feature request would be
to get a parity_error counter along the lines of the other error
counters we have for scrub stats and dev stats.

3. I think it's a more significant change to get parity checksums
stored some where. Right now the csum tree holds item type EXTENT_CSUM
but parity is not an extent, it's also not data, it's a variant of
data. So it seems to me we'd need a new item type PARITY_CSUM to get
it into the existing csum tree. And I'm not sure what incompatibility
that brings; presumably older kernels could mount such a volume ro
safely, but shouldn't write to it, including btrfs check --repair
should probably fail.


Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steven Haigh June 28, 2016, 6:17 p.m. UTC | #60
On 29/06/16 04:01, Chris Murphy wrote:
> Just wiping the slate clean to summarize:
> 
> 
> 1. We have a consistent ~1 in 3 maybe 1 in 2, reproducible corruption
> of *data extent* parity during a scrub with raid5. Goffredo and I have
> both reproduced it. It's a big bug. It might still be useful if
> someone else can reproduce it too.
> 
> Goffredo, can you file a bug at bugzilla.kernel.org and reference your
> bug thread?  I don't know if the key developers know about this, it
> might be worth pinging them on IRC once the bug is filed.
> 
> Unknown if it affects balance, or raid 6. And if it affects raid 6, is
> p or q corrupted, or both? Unknown how this manifests on metadata
> raid5 profile (only tested was data raid5). Presumably if there is
> metadata corruption that's fixed during a scrub, and its parity is
> overwritten with corrupt parity, the next time there's a degraded
> state, the file system would face plant somehow. And we've seen quite
> a few degraded raid5's (and even 6's) face plant in inexplicable ways
> and we just kinda go, shit. Which is what the fs is doing when it
> encounters a pile of csum errors. It treats the csum errors as a
> signal to disregard the fs rather than maybe only being suspicious of
> the fs. Could it turn out that these file systems were recoverable,
> just that Btrfs wasn't tolerating any csum error and wouldn't proceed
> further?

I believe this is the same case for RAID6 based on my experiences. I
actually wondered if the system halts were the result of a TON of csum
errors - not the actual result of those errors. Just about every system
hang when to 100% CPU usage on all cores and the system just stopped was
after a flood of csum errors. If it was only one or two (or I copied
data off via a network connection where the read rate was slower), I
found I had a MUCH lower chance of the system locking up.

In fact, now that I think about it, when I was copying data to an
external USB drive (maxed out at ~30MB/sec), I still got csum errors -
but the system never hung.

Every crash ended with the last line along the lines of "Stopped
recurring error. Your system needs rebooting". I wonder if this error
reporting was altered, that the system wouldn't go down.

Of course I have no way of testing this.....
Chris Murphy July 5, 2016, 11:05 p.m. UTC | #61
Related:
http://www.spinics.net/lists/raid/msg52880.html

Looks like there is some traction to figuring out what to do about
this, whether it's a udev rule or something that happens in the kernel
itself. Pretty much the only hardware setup unaffected by this are
those with enterprise or NAS drives. Every configuration of a consumer
drive, single, linear/concat, and all software (mdadm, lvm, Btrfs)
RAID Levels are adversely affected by this.

I suspect, but haven't tested, that ZFS On Linux would be equally
affected, unless they're completely reimplementing their own block
layer (?) So there are quite a few parties now negatively impacted by
the current default behavior.


Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Austin S. Hemmelgarn July 6, 2016, 11:51 a.m. UTC | #62
On 2016-07-05 19:05, Chris Murphy wrote:
> Related:
> http://www.spinics.net/lists/raid/msg52880.html
>
> Looks like there is some traction to figuring out what to do about
> this, whether it's a udev rule or something that happens in the kernel
> itself. Pretty much the only hardware setup unaffected by this are
> those with enterprise or NAS drives. Every configuration of a consumer
> drive, single, linear/concat, and all software (mdadm, lvm, Btrfs)
> RAID Levels are adversely affected by this.
The thing I don't get about this is that while the per-device settings 
on a given system are policy, the default value is not, and should be 
expected to work correctly (but not necessarily optimally) on as many 
systems as possible, so any claim that this should be fixed in udev are 
bogus by the regular kernel rules.
>
> I suspect, but haven't tested, that ZFS On Linux would be equally
> affected, unless they're completely reimplementing their own block
> layer (?) So there are quite a few parties now negatively impacted by
> the current default behavior.
OTOH, I would not be surprised if the stance there is 'you get no 
support if your not using enterprise drives', not because of the project 
itself, but because it's ZFS.  Part of their minimum recommended 
hardware requirements is ECC RAM, so it wouldn't surprise me if 
enterprise storage devices are there too.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Murphy July 6, 2016, 4:43 p.m. UTC | #63
On Wed, Jul 6, 2016 at 5:51 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-07-05 19:05, Chris Murphy wrote:
>>
>> Related:
>> http://www.spinics.net/lists/raid/msg52880.html
>>
>> Looks like there is some traction to figuring out what to do about
>> this, whether it's a udev rule or something that happens in the kernel
>> itself. Pretty much the only hardware setup unaffected by this are
>> those with enterprise or NAS drives. Every configuration of a consumer
>> drive, single, linear/concat, and all software (mdadm, lvm, Btrfs)
>> RAID Levels are adversely affected by this.
>
> The thing I don't get about this is that while the per-device settings on a
> given system are policy, the default value is not, and should be expected to
> work correctly (but not necessarily optimally) on as many systems as
> possible, so any claim that this should be fixed in udev are bogus by the
> regular kernel rules.

Sure. But changing it in the kernel leads to what other consequences?
It fixes the problem under discussion but what problem will it
introduce? I think it's valid to explore this, at the least so
affected parties can be informed.

Also, the problem isn't instigated by Linux, rather by drive
manufacturers introducing a whole new kind of error recovery, with an
order of magnitude longer recovery time. Now probably most hardware in
the field are such drives. Even SSDs like my Samsung 840 EVO that
support SCT ERC have it disabled, therefore the top end recovery time
is undiscoverable in the device itself. Maybe it's buried in a spec.

So does it make sense to just set the default to 180? Or is there a
smarter way to do this? I don't know.


>> I suspect, but haven't tested, that ZFS On Linux would be equally
>> affected, unless they're completely reimplementing their own block
>> layer (?) So there are quite a few parties now negatively impacted by
>> the current default behavior.
>
> OTOH, I would not be surprised if the stance there is 'you get no support if
> your not using enterprise drives', not because of the project itself, but
> because it's ZFS.  Part of their minimum recommended hardware requirements
> is ECC RAM, so it wouldn't surprise me if enterprise storage devices are
> there too.

http://open-zfs.org/wiki/Hardware
"Consistent performance requires hard drives that support error
recovery control. "

"Drives that lack such functionality can be expected to have
arbitrarily high limits. Several minutes is not impossible. Drives
with this functionality typically default to 7 seconds. ZFS does not
currently adjust this setting on drives. However, it is advisable to
write a script to set the error recovery time to a low value, such as
0.1 seconds until ZFS is modified to control it. This must be done on
every boot. "

They do not explicitly require enterprise drives, but they clearly
expect SCT ERC enabled to some sane value.

At least for Btrfs and ZFS, the mkfs is in a position to know all
parameters for properly setting SCT ERC and the SCSI command timer for
every device. Maybe it could create the udev rule? Single and raid0
profiles need to permit long recoveries; where raid1, 5, 6 need to set
things for very short recoveries.

Possibly mdadm and lvm tools do the same thing.
Austin S. Hemmelgarn July 6, 2016, 5:18 p.m. UTC | #64
On 2016-07-06 12:43, Chris Murphy wrote:
> On Wed, Jul 6, 2016 at 5:51 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>> On 2016-07-05 19:05, Chris Murphy wrote:
>>>
>>> Related:
>>> http://www.spinics.net/lists/raid/msg52880.html
>>>
>>> Looks like there is some traction to figuring out what to do about
>>> this, whether it's a udev rule or something that happens in the kernel
>>> itself. Pretty much the only hardware setup unaffected by this are
>>> those with enterprise or NAS drives. Every configuration of a consumer
>>> drive, single, linear/concat, and all software (mdadm, lvm, Btrfs)
>>> RAID Levels are adversely affected by this.
>>
>> The thing I don't get about this is that while the per-device settings on a
>> given system are policy, the default value is not, and should be expected to
>> work correctly (but not necessarily optimally) on as many systems as
>> possible, so any claim that this should be fixed in udev are bogus by the
>> regular kernel rules.
>
> Sure. But changing it in the kernel leads to what other consequences?
> It fixes the problem under discussion but what problem will it
> introduce? I think it's valid to explore this, at the least so
> affected parties can be informed.
>
> Also, the problem isn't instigated by Linux, rather by drive
> manufacturers introducing a whole new kind of error recovery, with an
> order of magnitude longer recovery time. Now probably most hardware in
> the field are such drives. Even SSDs like my Samsung 840 EVO that
> support SCT ERC have it disabled, therefore the top end recovery time
> is undiscoverable in the device itself. Maybe it's buried in a spec.
>
> So does it make sense to just set the default to 180? Or is there a
> smarter way to do this? I don't know.
Just thinking about this:
1. People who are setting this somewhere will be functionally unaffected.
2. People using single disks which have lots of errors may or may not 
see an apparent degradation of performance, but will likely have the 
life expectancy of their device extended.
3. Individuals who are not setting this but should be will on average be 
no worse off than before other than seeing a bigger performance hit on a 
disk error.
4. People with single disks which are new will see no functional change 
until the disk has an error.

In an ideal situation, what I'd want to see is:
1. If the device supports SCT ERC, set scsi_command_timer to  reasonable 
percentage over that (probably something like 25%, which would give 
roughly 10 seconds for the normal 7 second ERC timer).
2. If the device is actually a SCSI device, keep the 30 second timer 
(IIRC< this is reasonable for SCSI disks).
3. Otherwise, set the timer to 200 (we need a slight buffer over the 
expected disk timeout to account for things like latency outside of the 
disk).
>
>
>>> I suspect, but haven't tested, that ZFS On Linux would be equally
>>> affected, unless they're completely reimplementing their own block
>>> layer (?) So there are quite a few parties now negatively impacted by
>>> the current default behavior.
>>
>> OTOH, I would not be surprised if the stance there is 'you get no support if
>> your not using enterprise drives', not because of the project itself, but
>> because it's ZFS.  Part of their minimum recommended hardware requirements
>> is ECC RAM, so it wouldn't surprise me if enterprise storage devices are
>> there too.
>
> http://open-zfs.org/wiki/Hardware
> "Consistent performance requires hard drives that support error
> recovery control. "
>
> "Drives that lack such functionality can be expected to have
> arbitrarily high limits. Several minutes is not impossible. Drives
> with this functionality typically default to 7 seconds. ZFS does not
> currently adjust this setting on drives. However, it is advisable to
> write a script to set the error recovery time to a low value, such as
> 0.1 seconds until ZFS is modified to control it. This must be done on
> every boot. "
>
> They do not explicitly require enterprise drives, but they clearly
> expect SCT ERC enabled to some sane value.
>
> At least for Btrfs and ZFS, the mkfs is in a position to know all
> parameters for properly setting SCT ERC and the SCSI command timer for
> every device. Maybe it could create the udev rule? Single and raid0
> profiles need to permit long recoveries; where raid1, 5, 6 need to set
> things for very short recoveries.
>
> Possibly mdadm and lvm tools do the same thing.
I"m pretty certain they don't create rules, or even try to check the 
drive for SCT ERC support.  The problem with doing this is that you 
can't be certain that your underlying device is actually a physical 
storage device or not, and thus you have to check more than just the SCT 
ERC commands, and many people (myself included) don't like tools doing 
things that modify the persistent functioning of their system that the 
tool itself is not intended to do (and messing with block layer settings 
falls into that category for a mkfs tool).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Murphy July 6, 2016, 6:45 p.m. UTC | #65
On Wed, Jul 6, 2016 at 11:18 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-07-06 12:43, Chris Murphy wrote:

>> So does it make sense to just set the default to 180? Or is there a
>> smarter way to do this? I don't know.
>
> Just thinking about this:
> 1. People who are setting this somewhere will be functionally unaffected.

I think it's statistically 0 people changing this from default. It's
people with drives that have no SCT ERC support, used in raid1+, who
happen to stumble upon this very obscure work around to avoid link
resets in the face of media defects. Rare.


> 2. People using single disks which have lots of errors may or may not see an
> apparent degradation of performance, but will likely have the life
> expectancy of their device extended.

Well they have link resets and their file system presumably face
plants as a result of a pile of commands in the queue returning as
unsuccessful. So they have premature death of their system, rather
than it getting sluggish. This is a long standing indicator on Windows
to just reinstall the OS and restore data from backups -> the user has
an opportunity to freshen up user data backup, and the reinstallation
and restore from backup results in freshly written sectors which is
how bad sectors get fixed. The marginally bad sectors get new writes
and now read fast (or fast enough), and the persistently bad sectors
result in the drive firmware remapping to reserve sectors.

The main thing in my opinion is less extension of drive life, as it is
the user gets to use the system, albeit sluggish, to make a backup of
their data rather than possibly losing it.


> 3. Individuals who are not setting this but should be will on average be no
> worse off than before other than seeing a bigger performance hit on a disk
> error.
> 4. People with single disks which are new will see no functional change
> until the disk has an error.

I follow.


>
> In an ideal situation, what I'd want to see is:
> 1. If the device supports SCT ERC, set scsi_command_timer to  reasonable
> percentage over that (probably something like 25%, which would give roughly
> 10 seconds for the normal 7 second ERC timer).
> 2. If the device is actually a SCSI device, keep the 30 second timer (IIRC<
> this is reasonable for SCSI disks).
> 3. Otherwise, set the timer to 200 (we need a slight buffer over the
> expected disk timeout to account for things like latency outside of the
> disk).

Well if it's a non-redundant configuration, you'd want those long
recoveries permitted, rather than enable SCT ERC. The drive has the
ability to relocate sector data on a marginal (slow) read that's still
successful. But clearly many manufacturers tolerate slow reads that
don't result in immediate reallocation or overwrite or we wouldn't be
in this situation in the first place. I think this auto reallocation
is thwarted by enabling SCT ERC. It just flat out gives up and reports
a read error. So it is still data loss in the non-redundant
configuration and thus not an improvement.

Basically it's:

For SATA and USB drives:

if data redundant, then enable short SCT ERC time if supported, if not
supported then extend SCSI command timer to 200;

if data not redundant, then disable SCT ERC if supported, and extend
SCSI command timer to 200.

For SCSI (SAS most likely these days), keep things the same as now.
But that's only because this is a rare enough configuration now I
don't know if we really know the problems there. It may be that their
error recovery in 7 seconds is massively better and more reliable than
consumer drives over 180 seconds.




>
>>
>>
>>>> I suspect, but haven't tested, that ZFS On Linux would be equally
>>>> affected, unless they're completely reimplementing their own block
>>>> layer (?) So there are quite a few parties now negatively impacted by
>>>> the current default behavior.
>>>
>>>
>>> OTOH, I would not be surprised if the stance there is 'you get no support
>>> if
>>> your not using enterprise drives', not because of the project itself, but
>>> because it's ZFS.  Part of their minimum recommended hardware
>>> requirements
>>> is ECC RAM, so it wouldn't surprise me if enterprise storage devices are
>>> there too.
>>
>>
>> http://open-zfs.org/wiki/Hardware
>> "Consistent performance requires hard drives that support error
>> recovery control. "
>>
>> "Drives that lack such functionality can be expected to have
>> arbitrarily high limits. Several minutes is not impossible. Drives
>> with this functionality typically default to 7 seconds. ZFS does not
>> currently adjust this setting on drives. However, it is advisable to
>> write a script to set the error recovery time to a low value, such as
>> 0.1 seconds until ZFS is modified to control it. This must be done on
>> every boot. "
>>
>> They do not explicitly require enterprise drives, but they clearly
>> expect SCT ERC enabled to some sane value.
>>
>> At least for Btrfs and ZFS, the mkfs is in a position to know all
>> parameters for properly setting SCT ERC and the SCSI command timer for
>> every device. Maybe it could create the udev rule? Single and raid0
>> profiles need to permit long recoveries; where raid1, 5, 6 need to set
>> things for very short recoveries.
>>
>> Possibly mdadm and lvm tools do the same thing.
>
> I"m pretty certain they don't create rules, or even try to check the drive
> for SCT ERC support.

They don't. That's a suggested change in behavior. Sorry "should do
the same thing" instead of "do the same thing".


> The problem with doing this is that you can't be
> certain that your underlying device is actually a physical storage device or
> not, and thus you have to check more than just the SCT ERC commands, and
> many people (myself included) don't like tools doing things that modify the
> persistent functioning of their system that the tool itself is not intended
> to do (and messing with block layer settings falls into that category for a
> mkfs tool).

Yep it's imperfect unless there's the proper cross communication
between layers. There are some such things like hardware raid geometry
that optionally poke through (when supported by hardware raid drivers)
so that things like mkfs.xfs can automatically provide the right sunit
swidth for optimized layout; which the device mapper already does
automatically. So it could be done it's just a matter of how big of a
problem is this to build it, vs just going with a new one size fits
all default command timer?

If it were always 200 instead of 30, the consequence is if there's a
link problem that is not related to media errors. But what the hell
takes that long to report an explicit error? Even cable problems
generate UDMA errors pretty much instantly.
Austin S. Hemmelgarn July 6, 2016, 7:15 p.m. UTC | #66
On 2016-07-06 14:45, Chris Murphy wrote:
> On Wed, Jul 6, 2016 at 11:18 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>> On 2016-07-06 12:43, Chris Murphy wrote:
>
>>> So does it make sense to just set the default to 180? Or is there a
>>> smarter way to do this? I don't know.
>>
>> Just thinking about this:
>> 1. People who are setting this somewhere will be functionally unaffected.
>
> I think it's statistically 0 people changing this from default. It's
> people with drives that have no SCT ERC support, used in raid1+, who
> happen to stumble upon this very obscure work around to avoid link
> resets in the face of media defects. Rare.
Not as much as you think, once someone has this issue, they usually put 
preventative measures in place on any system where it applies.  I'd be 
willing to bet that most sysadmins at big companies like RedHat or 
Oracle are setting this.
>
>
>> 2. People using single disks which have lots of errors may or may not see an
>> apparent degradation of performance, but will likely have the life
>> expectancy of their device extended.
>
> Well they have link resets and their file system presumably face
> plants as a result of a pile of commands in the queue returning as
> unsuccessful. So they have premature death of their system, rather
> than it getting sluggish. This is a long standing indicator on Windows
> to just reinstall the OS and restore data from backups -> the user has
> an opportunity to freshen up user data backup, and the reinstallation
> and restore from backup results in freshly written sectors which is
> how bad sectors get fixed. The marginally bad sectors get new writes
> and now read fast (or fast enough), and the persistently bad sectors
> result in the drive firmware remapping to reserve sectors.
>
> The main thing in my opinion is less extension of drive life, as it is
> the user gets to use the system, albeit sluggish, to make a backup of
> their data rather than possibly losing it.
The extension of the drive's lifetime is a nice benefit, but not what my 
point was here.  For people in this particular case, it will almost 
certainly only make things better (although at first it may make 
performance worse).
>
>
>> 3. Individuals who are not setting this but should be will on average be no
>> worse off than before other than seeing a bigger performance hit on a disk
>> error.
>> 4. People with single disks which are new will see no functional change
>> until the disk has an error.
>
> I follow.
>
>
>>
>> In an ideal situation, what I'd want to see is:
>> 1. If the device supports SCT ERC, set scsi_command_timer to  reasonable
>> percentage over that (probably something like 25%, which would give roughly
>> 10 seconds for the normal 7 second ERC timer).
>> 2. If the device is actually a SCSI device, keep the 30 second timer (IIRC<
>> this is reasonable for SCSI disks).
>> 3. Otherwise, set the timer to 200 (we need a slight buffer over the
>> expected disk timeout to account for things like latency outside of the
>> disk).
>
> Well if it's a non-redundant configuration, you'd want those long
> recoveries permitted, rather than enable SCT ERC. The drive has the
> ability to relocate sector data on a marginal (slow) read that's still
> successful. But clearly many manufacturers tolerate slow reads that
> don't result in immediate reallocation or overwrite or we wouldn't be
> in this situation in the first place. I think this auto reallocation
> is thwarted by enabling SCT ERC. It just flat out gives up and reports
> a read error. So it is still data loss in the non-redundant
> configuration and thus not an improvement.
I agree, but if it's only the kernel doing this, then we can't make 
judgements based on userspace usage.  Also, the first situation while 
not optimal is still better than what happens now, at least there you 
will get an I/O error in a reasonable amount of time (as opposed to 
after a really long time if ever).
>
> Basically it's:
>
> For SATA and USB drives:
>
> if data redundant, then enable short SCT ERC time if supported, if not
> supported then extend SCSI command timer to 200;
>
> if data not redundant, then disable SCT ERC if supported, and extend
> SCSI command timer to 200.
>
> For SCSI (SAS most likely these days), keep things the same as now.
> But that's only because this is a rare enough configuration now I
> don't know if we really know the problems there. It may be that their
> error recovery in 7 seconds is massively better and more reliable than
> consumer drives over 180 seconds.
I don't see why you would think this is not common.  If you count just 
by systems, then it's absolutely outnumbered at least 100 to 1 by 
regular ATA disks.  If you look at individual disks though, the reverse 
is true, because people who use SCSI drives tend to use _lots_ of disks 
(think big data centers, NAS and SAN systems and such).  OTOH, both are 
probably vastly outnumbered by stuff that doesn't use either standard 
for storage...

Separately, USB gets _really_ complicated if you want to cover 
everything, USB drives may or may not present as non-rotational, may or 
may not show up as SATA or SCSI bridges (there are some of the more 
expensive flash drives that actually use SSD controllers plus USB-SAT 
chips internally), if they do show up as such, may or may not support 
the required commands (most don't, but it's seemingly hit or miss which do).
>
>
>>
>>>
>>>
>>>>> I suspect, but haven't tested, that ZFS On Linux would be equally
>>>>> affected, unless they're completely reimplementing their own block
>>>>> layer (?) So there are quite a few parties now negatively impacted by
>>>>> the current default behavior.
>>>>
>>>>
>>>> OTOH, I would not be surprised if the stance there is 'you get no support
>>>> if
>>>> your not using enterprise drives', not because of the project itself, but
>>>> because it's ZFS.  Part of their minimum recommended hardware
>>>> requirements
>>>> is ECC RAM, so it wouldn't surprise me if enterprise storage devices are
>>>> there too.
>>>
>>>
>>> http://open-zfs.org/wiki/Hardware
>>> "Consistent performance requires hard drives that support error
>>> recovery control. "
>>>
>>> "Drives that lack such functionality can be expected to have
>>> arbitrarily high limits. Several minutes is not impossible. Drives
>>> with this functionality typically default to 7 seconds. ZFS does not
>>> currently adjust this setting on drives. However, it is advisable to
>>> write a script to set the error recovery time to a low value, such as
>>> 0.1 seconds until ZFS is modified to control it. This must be done on
>>> every boot. "
>>>
>>> They do not explicitly require enterprise drives, but they clearly
>>> expect SCT ERC enabled to some sane value.
>>>
>>> At least for Btrfs and ZFS, the mkfs is in a position to know all
>>> parameters for properly setting SCT ERC and the SCSI command timer for
>>> every device. Maybe it could create the udev rule? Single and raid0
>>> profiles need to permit long recoveries; where raid1, 5, 6 need to set
>>> things for very short recoveries.
>>>
>>> Possibly mdadm and lvm tools do the same thing.
>>
>> I"m pretty certain they don't create rules, or even try to check the drive
>> for SCT ERC support.
>
> They don't. That's a suggested change in behavior. Sorry "should do
> the same thing" instead of "do the same thing".
>
>
>> The problem with doing this is that you can't be
>> certain that your underlying device is actually a physical storage device or
>> not, and thus you have to check more than just the SCT ERC commands, and
>> many people (myself included) don't like tools doing things that modify the
>> persistent functioning of their system that the tool itself is not intended
>> to do (and messing with block layer settings falls into that category for a
>> mkfs tool).
>
> Yep it's imperfect unless there's the proper cross communication
> between layers. There are some such things like hardware raid geometry
> that optionally poke through (when supported by hardware raid drivers)
> so that things like mkfs.xfs can automatically provide the right sunit
> swidth for optimized layout; which the device mapper already does
> automatically. So it could be done it's just a matter of how big of a
> problem is this to build it, vs just going with a new one size fits
> all default command timer?
The other problem though is that the existing things pass through 
_read-only_ data, while this requires writable data to be passed 
through, which leads to all kinds of complicated issues potentially.
>
> If it were always 200 instead of 30, the consequence is if there's a
> link problem that is not related to media errors. But what the hell
> takes that long to report an explicit error? Even cable problems
> generate UDMA errors pretty much instantly.
And that is more why I'd suggest changing the kernel default first 
before trying to use special heuristics or anything like that.  The 
caveat is that it would need to be for ATA disks only to not break SCSI 
(which works fine right now) and USB (which has it's own unique issues).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Murphy July 6, 2016, 9:01 p.m. UTC | #67
On Wed, Jul 6, 2016 at 1:15 PM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-07-06 14:45, Chris Murphy wrote:

>> I think it's statistically 0 people changing this from default. It's
>> people with drives that have no SCT ERC support, used in raid1+, who
>> happen to stumble upon this very obscure work around to avoid link
>> resets in the face of media defects. Rare.
>
> Not as much as you think, once someone has this issue, they usually put
> preventative measures in place on any system where it applies.  I'd be
> willing to bet that most sysadmins at big companies like RedHat or Oracle
> are setting this.

SCT ERC yes. Changing the kernel's command timer? I think almost zero.



>> Well they have link resets and their file system presumably face
>> plants as a result of a pile of commands in the queue returning as
>> unsuccessful. So they have premature death of their system, rather
>> than it getting sluggish. This is a long standing indicator on Windows
>> to just reinstall the OS and restore data from backups -> the user has
>> an opportunity to freshen up user data backup, and the reinstallation
>> and restore from backup results in freshly written sectors which is
>> how bad sectors get fixed. The marginally bad sectors get new writes
>> and now read fast (or fast enough), and the persistently bad sectors
>> result in the drive firmware remapping to reserve sectors.
>>
>> The main thing in my opinion is less extension of drive life, as it is
>> the user gets to use the system, albeit sluggish, to make a backup of
>> their data rather than possibly losing it.
>
> The extension of the drive's lifetime is a nice benefit, but not what my
> point was here.  For people in this particular case, it will almost
> certainly only make things better (although at first it may make performance
> worse).

I'm not sure why it makes performance worse. The options are, slower
reads vs a file system that almost certainly face plants upon a link
reset.




>> Basically it's:
>>
>> For SATA and USB drives:
>>
>> if data redundant, then enable short SCT ERC time if supported, if not
>> supported then extend SCSI command timer to 200;
>>
>> if data not redundant, then disable SCT ERC if supported, and extend
>> SCSI command timer to 200.
>>
>> For SCSI (SAS most likely these days), keep things the same as now.
>> But that's only because this is a rare enough configuration now I
>> don't know if we really know the problems there. It may be that their
>> error recovery in 7 seconds is massively better and more reliable than
>> consumer drives over 180 seconds.
>
> I don't see why you would think this is not common.

I was not clear. Single device SAS is probably not common. They're
typically being used in arrays where data is redundant. Using such a
drive with short error recovery as a single boot drive? Probably not
that common.



> Separately, USB gets _really_ complicated if you want to cover everything,
> USB drives may or may not present as non-rotational, may or may not show up
> as SATA or SCSI bridges (there are some of the more expensive flash drives
> that actually use SSD controllers plus USB-SAT chips internally), if they do
> show up as such, may or may not support the required commands (most don't,
> but it's seemingly hit or miss which do).

Yup. Well, do what we can instead of just ignoring the problem? They
can still be polled for features including SCT ERC and if it's not
supported or configurable then fallback to increasing the command
timer. I'm not sure what else can be done anyway.

The main obstacle is squaring the device capability (low level) with
storage stack redundancy 0 or 1 (high level). Something has to be
aware of both to ideally get all devices ideally configured.



>> Yep it's imperfect unless there's the proper cross communication
>> between layers. There are some such things like hardware raid geometry
>> that optionally poke through (when supported by hardware raid drivers)
>> so that things like mkfs.xfs can automatically provide the right sunit
>> swidth for optimized layout; which the device mapper already does
>> automatically. So it could be done it's just a matter of how big of a
>> problem is this to build it, vs just going with a new one size fits
>> all default command timer?
>
> The other problem though is that the existing things pass through
> _read-only_ data, while this requires writable data to be passed through,
> which leads to all kinds of complicated issues potentially.

I'm aware. There are also plenty of bugs even if write were to pass
through. I've encountered more drives than not which accept only one
SCT ERC change per poweron. A 2nd change causes the drive to offline
and vanish off the bus. So no doubt this whole area is fragile enough
not even the drive, controller, enclosure vendors are aware of where
all the bodies are buried.

What I think is fairly well established is that at least on Windows
their lower level stuff including kernel tolerates these very high
recovery times. The OS just gets irritatingly slow but doesn't flip
out. Linux is flipping out. And it's not Linux's direct fault, that's
drive manufacturers, but Linux needs to adapt.



>>
>>
>> If it were always 200 instead of 30, the consequence is if there's a
>> link problem that is not related to media errors. But what the hell
>> takes that long to report an explicit error? Even cable problems
>> generate UDMA errors pretty much instantly.
>
> And that is more why I'd suggest changing the kernel default first before
> trying to use special heuristics or anything like that.  The caveat is that
> it would need to be for ATA disks only to not break SCSI (which works fine
> right now) and USB (which has it's own unique issues).


I think you're probably right. Simpler is better.

Thing is, there will be consequences. In the software raid case where
a drive hangs on a media defect, right now this means a link reset at
30 seconds, which results in md reconstructing data and it goes where
needed by pretty much 31 seconds after requested. If that changes to
180 seconds, there will no doubt be some use cases that will be, WTF
just happened? This used to always recover in 30 seconds at the
longest, and now it's causing the network stack to implode while
waiting.

So all kinds of other timeouts might get impacted.

I wonder if it makes sense to change the default SCSI command timer on
a distribution and see what happens - if e.g. Fedora or opensuse would
volunteer to make the change for Rawhide or Tumbleweed. *shrug*
statically the number of users for those rolling releases may not have
a drive with media defects and a delay intolerant workload for maybe
years...
diff mbox

Patch

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 57fccc4..1ed1e43 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2296,7 +2296,10 @@  int btrfs_check_repairable(struct inode *inode, struct bio *failed_bio,
 		 * if the following BUG_ON triggers, our validation request got
 		 * merged. we need separate requests for our algorithm to work.
 		 */
-		BUG_ON(failrec->in_validation);
+		if (failrec->in_validation) {
+			printk(KERN_ERR "failrec->in_validation at %s:%d\n", __FUNCTION__, __LINE__);
+			return 0;
+		}
 		failrec->in_validation = 1;
 		failrec->this_mirror = failed_mirror;
 	} else {
@@ -2306,7 +2309,10 @@  int btrfs_check_repairable(struct inode *inode, struct bio *failed_bio,
 		 * everything for repair_io_failure to do the rest for us.
 		 */
 		if (failrec->in_validation) {
-			BUG_ON(failrec->this_mirror != failed_mirror);
+			if (failrec->this_mirror != failed_mirror) {
+				printk(KERN_ERR "failrec->this_mirror (%d) != failed_mirror (%d) at %s:%d\n", failrec->this_mirror, failed_mirror, __FUNCTION__, __LINE__);
+				return 0;
+			}
 			failrec->in_validation = 0;
 			failrec->this_mirror = 0;
 		}

'failrec->this_mirror' often has values like 0, 1, or 2, but sometimes
has values like -445886021 or -823858068.  More uninitialized memory?
This worries me.

The other BUG_ON lands here:

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c1fe22c..71afeb3 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2018,10 +2018,10 @@  int repair_io_failure(struct inode *inode, u64 start, u64 length, u64 logical,
 	int ret;
 
 	ASSERT(!(fs_info->sb->s_flags & MS_RDONLY));
-	BUG_ON(!mirror_num);
+	WARN_ON(!mirror_num);
 
 	/* we can't repair anything in raid56 yet */
-	if (btrfs_is_parity_mirror(map_tree, logical, length, mirror_num))
+	if (!mirror_num || btrfs_is_parity_mirror(map_tree, logical, length, mirror_num))
 		return 0;
 
 	bio = btrfs_io_bio_alloc(GFP_NOFS, 1);