Message ID | 20160620034427.GK15597@hungrycats.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Sun, 19 Jun 2016 23:44:27 -0400 Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote: > It's not going well so far. Pay attention, there are at least four > separate problems in here and we're not even half done yet. > > I'm currently using kernel 4.6.2 with btrfs fixes forward-ported from > 4.5.7, because 4.5.7 has a number of fixes that 4.6.2 doesn't. I have > also pulled in some patches from the 4.7-rc series. > > This fixed a few problems I encountered early on, and I'm still making > forward progress, but I've only replaced 50% of the failed disk so far, > and this is week four of this particular project. From a practical standpoint, [aside from not using Btrfs RAID5], you'd be better off shutting down the system, booting a rescue OS, copying the content of the failing disk to the replacement one using 'ddrescue', then removing the bad disk, and after boot up your main system wouldn't notice anything has ever happened, aside from a few recoverable CRC errors in the "holes" on the areas which ddrescue failed to copy. But in general it's commendable that you're experimenting with doing things "the native way", as this is provides feedback to the developers and could help make the RAID implementation better. I guess that's the whole point of the exercise and the report, and hope this ends up being useful for everyone.
On Mon, Jun 20, 2016 at 11:13:51PM +0500, Roman Mamedov wrote: > On Sun, 19 Jun 2016 23:44:27 -0400 > Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote: > From a practical standpoint, [aside from not using Btrfs RAID5], you'd be > better off shutting down the system, booting a rescue OS, copying the content > of the failing disk to the replacement one using 'ddrescue', then removing the > bad disk, and after boot up your main system wouldn't notice anything has ever > happened, aside from a few recoverable CRC errors in the "holes" on the areas > which ddrescue failed to copy. I'm aware of ddrescue and myrescue, but in this case the disk has failed, past tense. At this point the remaining choices are to make btrfs native raid5 recovery work, or to restore from backups. > But in general it's commendable that you're experimenting with doing things > "the native way", as this is provides feedback to the developers and could help > make the RAID implementation better. I guess that's the whole point of the > exercise and the report, and hope this ends up being useful for everyone. The intent was both to provide a cautionary tale for anyone considering deploying a btrfs raid5 system today, and to possibly engage some developers to help solve the problems. The underlying causes seem to be somewhat removed from where the symptoms are appearing, and at the moment I don't understand this code well enough to know where to look for them. Any assistance would be greatly appreciated. > -- > With respect, > Roman
On Mon, Jun 20, 2016 at 1:11 PM, Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote: > On Mon, Jun 20, 2016 at 11:13:51PM +0500, Roman Mamedov wrote: >> On Sun, 19 Jun 2016 23:44:27 -0400 >> Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote: >> From a practical standpoint, [aside from not using Btrfs RAID5], you'd be >> better off shutting down the system, booting a rescue OS, copying the content >> of the failing disk to the replacement one using 'ddrescue', then removing the >> bad disk, and after boot up your main system wouldn't notice anything has ever >> happened, aside from a few recoverable CRC errors in the "holes" on the areas >> which ddrescue failed to copy. > > I'm aware of ddrescue and myrescue, but in this case the disk has failed, > past tense. At this point the remaining choices are to make btrfs native > raid5 recovery work, or to restore from backups. Seems difficult at best due to this: >>The normal 'device delete' operation got about 25% of the way in, then got stuck on some corrupted sectors and aborting with EIO. In effect it's like a 2 disk failure for a raid5 (or it's intermittently a 2 disk failure but always at least a 1 disk failure). That's not something md raid recovers from. Even manual recovery in such a case is far from certain. Perhaps Roman's advice is also a question about the cause of this corruption? I'm wondering this myself. That's the real problem here as I see it. Losing a drive is ordinary. Additional corruptions happening afterward is not. And are those corrupt sectors hardware corruptions, or Btrfs corruptions at the time the data was written to disk, or Btrfs being confused as it's reading the data from disk? For me the critical question is what does "some corrupted sectors" mean?
On Mon, Jun 20, 2016 at 01:30:11PM -0600, Chris Murphy wrote: > On Mon, Jun 20, 2016 at 1:11 PM, Zygo Blaxell > <ce3g8jdj@umail.furryterror.org> wrote: > > On Mon, Jun 20, 2016 at 11:13:51PM +0500, Roman Mamedov wrote: > >> On Sun, 19 Jun 2016 23:44:27 -0400 > Seems difficult at best due to this: > >>The normal 'device delete' operation got about 25% of the way in, > then got stuck on some corrupted sectors and aborting with EIO. > > In effect it's like a 2 disk failure for a raid5 (or it's > intermittently a 2 disk failure but always at least a 1 disk failure). > That's not something md raid recovers from. Even manual recovery in > such a case is far from certain. > > Perhaps Roman's advice is also a question about the cause of this > corruption? I'm wondering this myself. That's the real problem here as > I see it. Losing a drive is ordinary. Additional corruptions happening > afterward is not. And are those corrupt sectors hardware corruptions, > or Btrfs corruptions at the time the data was written to disk, or > Btrfs being confused as it's reading the data from disk? > For me the critical question is what does "some corrupted sectors" mean? On other raid5 arrays, I would observe a small amount of corruption every time there was a system crash (some of which were triggered by disk failures, some not). It looked like any writes in progress at the time of the failure would be damaged. In the past I would just mop up the corrupt files (they were always the last extents written, easy to find with find-new or scrub) and have no further problems. In the earlier cases there were no new instances of corruption after the initial failure event and manual cleanup. Now that I did a little deeper into this, I do see one fairly significant piece of data: root@host:~# btrfs dev stat /data | grep -v ' 0$' [/dev/vdc].corruption_errs 16774 [/dev/vde].write_io_errs 121 [/dev/vde].read_io_errs 4 [devid:8].read_io_errs 16 Prior to the failure of devid:8, vde had 121 write errors and 4 read errors (these counter values are months old and the errors were long since repaired by scrub). The 16774 corruption errors on vdc are all new since the devid:8 failure, though. > > > -- > Chris Murphy >
On Mon, Jun 20, 2016 at 2:40 PM, Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote: > On Mon, Jun 20, 2016 at 01:30:11PM -0600, Chris Murphy wrote: >> For me the critical question is what does "some corrupted sectors" mean? > > On other raid5 arrays, I would observe a small amount of corruption every > time there was a system crash (some of which were triggered by disk > failures, some not). What test are you using to determine there is corruption, and how much data is corrupted? Is this on every disk? Non-deterministically fewer than all disks? Have you identified this as a torn write or misdirected write or is it just garbage at some sectors? And what's the size? Partial sector? Partial md chunk (or fs block?) > It looked like any writes in progress at the time > of the failure would be damaged. In the past I would just mop up the > corrupt files (they were always the last extents written, easy to find > with find-new or scrub) and have no further problems. This is on Btrfs? This isn't supposed to be possible. Even a literal overwrite of a file is not an overwrite on Btrfs unless the file is nodatacow. Data extents get written, then the metadata is updated to point to those new blocks. There should be flush or fua requests to make sure the order is such that the fs points to either the old or new file, in either case uncorrupted. That's why I'm curious about the nature of this corruption. It sounds like your hardware is not exactly honoring flush requests. With md raid and any other file system, it's pure luck that such corrupted writes would only affect data extents and not the fs metadata. Corrupted fs metadata is not well tolerated by any file system, not least of which is most of them have no idea the metadata is corrupt. At least Btrfs can determine this and if there's another copy use that or just stop and face plant before more damage happens. Maybe an exception now is XFS v5 metadata which employs checksumming. But it still doesn't know if data extents are wrong (i.e. a torn or misdirected write). I've had perhaps a hundred power off during write with Btrfs and SSD and I don't ever see corrupt files. It's definitely not normal to see this with Btrfs. > In the earlier > cases there were no new instances of corruption after the initial failure > event and manual cleanup. > > Now that I did a little deeper into this, I do see one fairly significant > piece of data: > > root@host:~# btrfs dev stat /data | grep -v ' 0$' > [/dev/vdc].corruption_errs 16774 > [/dev/vde].write_io_errs 121 > [/dev/vde].read_io_errs 4 > [devid:8].read_io_errs 16 > > Prior to the failure of devid:8, vde had 121 write errors and 4 read > errors (these counter values are months old and the errors were long > since repaired by scrub). The 16774 corruption errors on vdc are all > new since the devid:8 failure, though. On md RAID 5 and 6, if the array gets parity mismatch counts above 0 doing a scrub (check > md/sync_action) there's a hardware problem. It's entirely possible you've found a bug, but it must be extremely obscure to basically not have hit everyone trying Btrfs raid56. I think you need to track down the source of this corruption and stop it however possible; whether that's changing hardware, or making sure the system isn't crashing.
On Mon, Jun 20, 2016 at 03:27:03PM -0600, Chris Murphy wrote: > On Mon, Jun 20, 2016 at 2:40 PM, Zygo Blaxell > <ce3g8jdj@umail.furryterror.org> wrote: > > On Mon, Jun 20, 2016 at 01:30:11PM -0600, Chris Murphy wrote: > > >> For me the critical question is what does "some corrupted sectors" mean? > > > > On other raid5 arrays, I would observe a small amount of corruption every > > time there was a system crash (some of which were triggered by disk > > failures, some not). > > What test are you using to determine there is corruption, and how much > data is corrupted? Is this on every disk? Non-deterministically fewer > than all disks? Have you identified this as a torn write or > misdirected write or is it just garbage at some sectors? And what's > the size? Partial sector? Partial md chunk (or fs block?) In earlier cases, scrub, read(), and btrfs dev stat all reported the incidents differently. Scrub would attribute errors randomly to disks (error counts spread randomly across all the disks in the 'btrfs scrub status -d' output). 'dev stat' would correctly increment counts on only those disks which had individually had an event (e.g. media error or SATA bus reset). Before deploying raid5, I tested these by intentionally corrupting one disk in an otherwise healthy raid5 array and watching the result. When scrub identified an inode and offset in the kernel log, the csum failure log message matched the offsets producing EIO on read(), but the statistics reported by scrub about which disk had been corrupted were mostly wrong. In such cases a scrub could repair the data. A different thing happens if there is a crash. In that case, scrub cannot repair the errors. Every btrfs raid5 filesystem I've deployed so far behaves this way when disks turn bad. I had assumed it was a software bug in the comparatively new raid5 support that would get fixed eventually. In this current case, I'm getting things like this: [12008.243867] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 4105596028 wanted 787343232 mirror 0 [12008.243876] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 1689373462 wanted 787343232 mirror 0 [12008.243885] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 3621611229 wanted 787343232 mirror 0 [12008.243893] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 113993114 wanted 787343232 mirror 0 [12008.243902] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 1464956834 wanted 787343232 mirror 0 [12008.243911] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 2545274038 wanted 787343232 mirror 0 [12008.243942] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 4090153227 wanted 787343232 mirror 0 [12008.243952] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 4129844199 wanted 787343232 mirror 0 [12008.243961] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 4129844199 wanted 787343232 mirror 0 [12008.243976] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 172651968 wanted 787343232 mirror 0 [12008.246158] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 4129844199 wanted 787343232 mirror 1 [12008.247557] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 1374425809 wanted 787343232 mirror 1 [12008.403493] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 1567917468 wanted 787343232 mirror 1 [12008.409809] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 2881359629 wanted 787343232 mirror 0 [12008.411165] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 3021442070 wanted 787343232 mirror 0 [12008.411180] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 3984314874 wanted 787343232 mirror 0 [12008.411189] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 599192427 wanted 787343232 mirror 0 [12008.411199] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 2887010053 wanted 787343232 mirror 0 [12008.411208] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 1314141634 wanted 787343232 mirror 0 [12008.411217] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 3156167613 wanted 787343232 mirror 0 [12008.411227] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 565550942 wanted 787343232 mirror 0 [12008.411236] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 4068631390 wanted 787343232 mirror 0 [12008.411245] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 531263990 wanted 787343232 mirror 0 [12008.411255] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 1992044858 wanted 787343232 mirror 0 [12008.411264] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 1516200640 wanted 787343232 mirror 0 [12008.411273] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 3762784693 wanted 787343232 mirror 0 [12008.411282] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 42246361 wanted 787343232 mirror 0 [12008.411291] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 3989363071 wanted 787343232 mirror 0 [12008.422079] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 983444376 wanted 931479324 mirror 0 [12008.422402] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825372160 csum 3757815310 wanted 913327053 mirror 2 [12008.542715] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 425031800 wanted 931479324 mirror 0 [12008.542722] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 1254202754 wanted 931479324 mirror 0 [12008.542749] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 4281398184 wanted 931479324 mirror 0 [12008.543730] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 1016285311 wanted 931479324 mirror 0 [12008.545174] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 3122785442 wanted 931479324 mirror 0 [12008.549530] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 2691681328 wanted 931479324 mirror 0 [12008.556722] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 1956690651 wanted 931479324 mirror 0 [12008.556744] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 857176374 wanted 931479324 mirror 0 [12008.556756] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 2858632553 wanted 931479324 mirror 0 [12008.557901] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825372160 csum 3245431096 wanted 913327053 mirror 2 [12008.557907] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825372160 csum 3536357904 wanted 931479324 mirror 2 [12008.557935] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825372160 csum 3536357904 wanted 931479324 mirror 2 [12008.557938] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825372160 csum 3536357904 wanted 931479324 mirror 2 [12008.557952] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825372160 csum 3536357904 wanted 931479324 mirror 2 [12008.557954] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 3251056521 wanted 931479324 mirror 0 [12008.558054] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 3711050593 wanted 931479324 mirror 0 [12008.561564] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 3540653895 wanted 931479324 mirror 0 [12008.568799] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 876087402 wanted 931479324 mirror 0 [12008.568817] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 809865123 wanted 931479324 mirror 0 [12008.568827] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 884024971 wanted 931479324 mirror 0 [12008.568839] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 2677615947 wanted 931479324 mirror 0 [12008.572122] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 3536357904 wanted 931479324 mirror 1 [12008.572907] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825372160 csum 3763483742 wanted 931479324 mirror 2 [12008.572964] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825372160 csum 3157651551 wanted 931479324 mirror 2 [12008.572994] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825372160 csum 824699142 wanted 931479324 mirror 2 [12008.573066] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825372160 csum 3536357904 wanted 931479324 mirror 2 [12008.573097] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825372160 csum 3536357904 wanted 931479324 mirror 2 [12008.573126] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825372160 csum 3536357904 wanted 931479324 mirror 2 [12008.573155] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825372160 csum 3536357904 wanted 931479324 mirror 2 [12008.573187] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825372160 csum 3536357904 wanted 931479324 mirror 2 [12008.573209] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825372160 csum 2691681328 wanted 931479324 mirror 2 [12008.573239] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825372160 csum 3757815310 wanted 913327053 mirror 2 [12008.573529] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 3536357904 wanted 931479324 mirror 0 [12008.574222] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825368064 csum 3757815310 wanted 913327053 mirror 2 [12008.577201] BTRFS info (device vdc): csum failed ino 4420604 extent 1541295742976 csum 3536357904 wanted 931479324 mirror 0 [12008.595253] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825372160 csum 3757815310 wanted 913327053 mirror 2 The "wanted" field looks suspicious--this filesystem does not have large areas filled with blocks of constant value, much less multiple different constant values, and certainly not in the affected files. The other other weird thing here is that I can't find an example in the logs of an extent with an EIO that isn't compressed. I've been looking up a random sample of the extent numbers, matching them up to filefrag output, and finding e.g. the one compressed extent in the middle of an otherwise uncompressed git pack file. That's...odd. Maybe there's a problem with compressed extents in particular? I'll see if I can script something to check all the logs at once... > > It looked like any writes in progress at the time > > of the failure would be damaged. In the past I would just mop up the > > corrupt files (they were always the last extents written, easy to find > > with find-new or scrub) and have no further problems. > > This is on Btrfs? This isn't supposed to be possible. Even a literal > overwrite of a file is not an overwrite on Btrfs unless the file is > nodatacow. Data extents get written, then the metadata is updated to > point to those new blocks. There should be flush or fua requests to > make sure the order is such that the fs points to either the old or > new file, in either case uncorrupted. That's why I'm curious about the > nature of this corruption. It sounds like your hardware is not exactly > honoring flush requests. That's true when all the writes are ordered within a single device, but possibly not when writes must be synchronized across multiple devices. It could be possible to corrupt existing data if: - btrfs puts multiple extents in the same RAID5 stripe - writes two of the extents in different transactions - the array is degraded so the RAID5 write hole applies - writes stop (crash, powerfail, etc) after updating some of the blocks in the stripe but not others in the second transaction. This corrupts the data written in the first transaction, unless there's something in btrfs that explicitly prevents this from happening (e.g. the allocator prevents the second step, which would seem to burn up a lot of space). The filesystem would continue to work afterwards with raid1 metadata because every disk in raid1 updates its blocks in the same order, and there are no interdependencies between blocks on different disks (not like a raid5 stripe, anyway). Excluding the current event, this only happened to me when the disks were doing other questionable things, like SATA link resets or dropping off the SATA bus, so there could have been an additional hardware issue. This does happen on _every_ raid5 deployment I've run so far, though, and they don't share any common hardware or firmware components. > I've had perhaps a hundred power off during write with Btrfs and SSD > and I don't ever see corrupt files. It's definitely not normal to see > this with Btrfs. These are spinning rust. They are probably quite vulnerable to RAID write hole issues when writing in degraded mode due to the comparatively large write latency variation between disks. > > Now that I did a little deeper into this, I do see one fairly significant > > piece of data: > > > > root@host:~# btrfs dev stat /data | grep -v ' 0$' > > [/dev/vdc].corruption_errs 16774 > > [/dev/vde].write_io_errs 121 > > [/dev/vde].read_io_errs 4 > > [devid:8].read_io_errs 16 > > > > Prior to the failure of devid:8, vde had 121 write errors and 4 read > > errors (these counter values are months old and the errors were long > > since repaired by scrub). The 16774 corruption errors on vdc are all > > new since the devid:8 failure, though. > > On md RAID 5 and 6, if the array gets parity mismatch counts above 0 > doing a scrub (check > md/sync_action) there's a hardware problem. > It's entirely possible you've found a bug, but it must be extremely > obscure to basically not have hit everyone trying Btrfs raid56. I I've met many btrfs users on IRC but never one with a successful raid56 recovery story. Give me a shout on IRC if you have one. ;) (I'm excluding my own recovery success stories--those were 16GB test filesystems running on kvm, and they recovered *perfectly*). > think you need to track down the source of this corruption and stop it > however possible; whether that's changing hardware, or making sure > the system isn't crashing. It's been a few hours, there have been plenty of EIOs on reads, but the dev stat counters haven't changed from the values above. I'm inclined to blame software for the EIOs: - the dev stat counters aren't changing, but there are dozens of read() EIOs going by. - the checksums of long block runs are all identical and not zero. Not one or two examples, all the logs going back several days look like this. - with the patches, I'm executing code on the far side of BUG_ONs. Most btrfs users aren't testing that code, they would have crashed first. It still could be my fault for walking past the DANGER signs... Probably something goes wrong earlier on, and the corruption is just a symptom of the same failure that lead to the original BUG_ONs. > -- > Chris Murphy > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jun 20, 2016 at 09:55:59PM -0400, Zygo Blaxell wrote: > In this current case, I'm getting things like this: > > [12008.243867] BTRFS info (device vdc): csum failed ino 4420604 extent 26805825306624 csum 4105596028 wanted 787343232 mirror 0 [...] > The other other weird thing here is that I can't find an example in the > logs of an extent with an EIO that isn't compressed. I've been looking > up a random sample of the extent numbers, matching them up to filefrag > output, and finding e.g. the one compressed extent in the middle of an > otherwise uncompressed git pack file. That's...odd. Maybe there's a > problem with compressed extents in particular? I'll see if I can > script something to check all the logs at once... No need for a script: this message wording appears only in fs/btrfs/compression.c so it can only ever be emitted by reading a compressed extent. Maybe there's a problem specific to raid5, degraded mode, and compressed extents?
TL;DR: Kernel 4.6.2 causes a world of pain. Use 4.5.7 instead. 'btrfs dev stat' doesn't seem to count "csum failed" (i.e. corruption) errors in compressed extents. On Sun, Jun 19, 2016 at 11:44:27PM -0400, Zygo Blaxell wrote: > Not so long ago, I had a disk fail in a btrfs filesystem with raid1 > metadata and raid5 data. I mounted the filesystem readonly, replaced > the failing disk, and attempted to recover by adding the new disk and > deleting the missing disk. > I'm currently using kernel 4.6.2 That turned out to be a mistake. 4.6.2 has some severe problems. Over the past few days I've been upgrading other machines from 4.5.7 to 4.6.2. This morning I saw the aggregate data coming back from those machines, and it's all bad: stalls in snapshot delete, balance, and sync; some machines just lock up with no console messages; a lot of watchdog timeouts. None of the machines could get to an uptime over 26 hours and still be in a usable state. I switched to 4.5.7 and the crashes, balance/delete hangs, and some of the data corruption modes stopped. > I'm > getting EIO randomly all over the filesystem, including in files that were > written entirely _after_ the disk failure. There were actually four distinct corruption modes happening: 1. There are some number (16500 so far) "normal" corrupt blocks: read repeatably returns EIO, they show up in scrub with sane log messages, and replacing the files that contain these blocks makes them go away. These blocks appear to be contained in extents that coincide with the date of the disk failure. Interestingly, no matter how many times I read these blocks, I get no increase in the 'btrfs dev stat' numbers even though I get kernel csum failure messages. That looks like a bug. 2. When attempting to replace corrupted files with rsync, I had used 'rsync --inplace'. This caused bad blocks to be overwritten within extents, but does not necessarily replace the _entire_ extent containing a bad block. This creates corrupt blocks that show up in scrub, balance, and device delete, but not when reading files. It also updates the timestamps so a file with old corruption looks "new" to an insufficiently sophisticated analysis tool. 3. Files were corrupted while they were written and accessed via NFS. This created files with correct btrfs checksums, but garbage contents. This would show up as failures during 'git gc' or rsync checksum mismatches. During one of the many VM crashes, any writes in progress at the time of the crash were lost. This effectively rewound the filesystem several minutes each time as btrfs reverts to the previous committed tree on the next mount. 4.6.2's hanging issues made this worse by delaying btrfs commits indefinitely. The NFS clients were completely unaware of this, so when the VM rebooted, files ended up with holes, or would just disappear while in use. 4. After a VM crash and the filesystem reverted to the previous committed tree, files with bad blocks that had been repaired through the NFS server or with rsync would be "unrepaired" (i.e. the filesystem would revert back to the original corrupted blocks after the mount). Combinations of these could occur as well for extra confusion, and some corrupted blocks are contained in many files thanks to dedup. With kernel 4.5.7 there have been no lockups during commit and no VM crashes, so I haven't seen any of corruption modes 3 and 4 since 4.5.7. Balance is now running normally to move the remaining data off the missing disk. ETA is 558 hours. See you in mid-July! ;)
On Mon, Jun 20, 2016 at 7:55 PM, Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote: > On Mon, Jun 20, 2016 at 03:27:03PM -0600, Chris Murphy wrote: >> On Mon, Jun 20, 2016 at 2:40 PM, Zygo Blaxell >> <ce3g8jdj@umail.furryterror.org> wrote: >> > On Mon, Jun 20, 2016 at 01:30:11PM -0600, Chris Murphy wrote: >> >> >> For me the critical question is what does "some corrupted sectors" mean? >> > >> > On other raid5 arrays, I would observe a small amount of corruption every >> > time there was a system crash (some of which were triggered by disk >> > failures, some not). >> >> What test are you using to determine there is corruption, and how much >> data is corrupted? Is this on every disk? Non-deterministically fewer >> than all disks? Have you identified this as a torn write or >> misdirected write or is it just garbage at some sectors? And what's >> the size? Partial sector? Partial md chunk (or fs block?) > > In earlier cases, scrub, read(), and btrfs dev stat all reported the > incidents differently. Scrub would attribute errors randomly to disks > (error counts spread randomly across all the disks in the 'btrfs scrub > status -d' output). 'dev stat' would correctly increment counts on only > those disks which had individually had an event (e.g. media error or > SATA bus reset). > > Before deploying raid5, I tested these by intentionally corrupting > one disk in an otherwise healthy raid5 array and watching the result. It's difficult to reproduce if no one understands how you intentionally corrupted that disk. Literal reading, you corrupted the entire disk, but that's impractical. The fs is expected to behave differently depending on what's been corrupted and how much. > When scrub identified an inode and offset in the kernel log, the csum > failure log message matched the offsets producing EIO on read(), but > the statistics reported by scrub about which disk had been corrupted > were mostly wrong. In such cases a scrub could repair the data. I don't often use the -Bd options, so I haven't tested it thoroughly, but what you're describing sounds like a bug in user space tools. I've found it reflects the same information as btrfs dev stats, and dev stats have been reliable in my testing. > A different thing happens if there is a crash. In that case, scrub cannot > repair the errors. Every btrfs raid5 filesystem I've deployed so far > behaves this way when disks turn bad. I had assumed it was a software bug > in the comparatively new raid5 support that would get fixed eventually. This is really annoyingly vague. You don't give a complete recipe for reproducing this sequence. Here's what I'm understanding and what I'm missing: 1. The intentional corruption, extent of which is undefined, is still present. 2. A drive is bad, but that doesn't tell us if it's totally dead, or only intermittently spitting out spurious information. 3. Is the volume remounted degraded or is the bad drive still being used by Btrfs? Because Btrfs has no concept (patches pending) of drive faulty state like md, let alone an automatic change to that faulty state. It just keeps on trying to read or write to bad drives, even if they're physically removed. 4. You've initiated a scrub, and the corruption in 1 is not fixed. OK so what am I missing? Because it sounds to me like you have two copies of data that are gone. For raid 5 that's data loss, scrub can't fix things. Corruption is missing data. The bad drive is missing data. What values do you get for smartctl -l scterc /dev/sdX cat /sys/block/sdX/device/timeout >> This is on Btrfs? This isn't supposed to be possible. Even a literal >> overwrite of a file is not an overwrite on Btrfs unless the file is >> nodatacow. Data extents get written, then the metadata is updated to >> point to those new blocks. There should be flush or fua requests to >> make sure the order is such that the fs points to either the old or >> new file, in either case uncorrupted. That's why I'm curious about the >> nature of this corruption. It sounds like your hardware is not exactly >> honoring flush requests. > > That's true when all the writes are ordered within a single device, but > possibly not when writes must be synchronized across multiple devices. I think that's a big problem, the fs cannot be consistent if the super block points to any tree whose metadata or data isn't on stable media. But if you think it's happening you might benefit from integrity checking, maybe try just the metadata one for starters which is the check_int mount option (it must be compiled in first for that mount option to work). https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/check-integrity.c?id=refs/tags/v4.6.2 > It could be possible to corrupt existing data if: > > - btrfs puts multiple extents in the same RAID5 stripe > > - writes two of the extents in different transactions > > - the array is degraded so the RAID5 write hole applies > > - writes stop (crash, powerfail, etc) after updating some of > the blocks in the stripe but not others in the second transaction. I do not know the exact nature of the Btrfs raid56 write hole. Maybe a dev or someone who knows can explain it. However, from btrfs-debug-tree from a 3 device raid5 volume: item 5 key (FIRST_CHUNK_TREE CHUNK_ITEM 1103101952) itemoff 15621 itemsize 144 chunk length 2147483648 owner 2 stripe_len 65536 type DATA|RAID5 num_stripes 3 stripe 0 devid 2 offset 9437184 dev uuid: 3c6f37eb-5cae-455a-82bc-a1b0877dea55 stripe 1 devid 1 offset 1094713344 dev uuid: 13104709-6f30-4982-979e-4f055c326fad stripe 2 devid 3 offset 1083179008 dev uuid: d45fc482-a0c1-46b1-98c1-41cea5a11c80 I expect that parity is in this data block group, and therefore is checksummed the same as any other data in that block group. While parity could become stale like with conventional raid5, unlike conventional raid 5, Btrfs would know about this because there'd be a checksum mismatch. So in your example of degraded writes, no matter what the on disk format makes it discoverable there is a problem: A. The "updating" is still always COW so there is no overwriting. B. The updated data extent is supposed to be on stable media before both metadata pointing to it and including its checksum (in the extent csum tree) are on stable media. Perhaps it's acceptable for certain metadata could be written out of order with data? But certainly not the superblock being updated to point to updates trees that haven't actually been set to stable media yet. But even if that were to happen due to some hardware bug, the fs can figure this out at mount time as it boot straps itself from the super, and will then fail to mount. This is what -o recovery (now usebackuproot in newer kernels) is used for; so that in effect the fs is rolled back to a state where the super points to completely valid roots. But these mount options are considered red flags for hardware doing the wrong thing. Yes you lose some data, which is bad anyway due to writes being interrupted during an oops, but at least the fs is OK. > This corrupts the data written in the first transaction, unless there's > something in btrfs that explicitly prevents this from happening (e.g. > the allocator prevents the second step, which would seem to burn up a > lot of space). I don't follow your steps how the data in the first transaction is corrupt. Everything is copy on write in Btrfs including raid5. So that new transaction does not overwrite the first, it's a new write, and what should happen is that data gets written to unused sectors, then metadata is written to point to those sectors (chunk tree, dev tree, extent tree, csum tree, fs tree, etc), and once that's on stable media, then the superblock is updated. There are all kinds of optimizations happening that probably allow some of those things to happen in different order depending on what's going on, and then the drive also will reorder those writes, but when Btrfs issues req_flush to the drive before it writes an updated superblock, and the drive lies that it has flushed when it hasn't? That's when problems happen. And perhaps naively, I'd expect in any multiple device volume, that Btrfs issues req_flush right before and after it updates supers on all the drives. If that's really honored by the drive, then the writing of supers doesn't happen until everything before the req_flush is actually on stable media. The very fact you're not getting mount failures tells me all the supers on all the drives must be the same, or Btrfs would complain. There are generally at least three super blocks on each drive, and each super has four backup roots. So that's 15 opportunities per drive to determine right off the bat if something is really screwy, and just fail to mount. But you're not seeing mount failure. I think something else is going on. Possibly there was already some weird corruption before either your intentional corruption or the bad drive happened. > The filesystem would continue to work afterwards with raid1 metadata > because every disk in raid1 updates its blocks in the same order, > and there are no interdependencies between blocks on different disks > (not like a raid5 stripe, anyway). I'm not sure what you mean by this. Btrfs raid1 means two copies. It doesn't matter how many drives there are, there are two copies of metadata in your case, and you have no idea which drives those metadata block groups are on without checking btrfs-debug-tree. > > Excluding the current event, this only happened to me when the disks > were doing other questionable things, like SATA link resets or dropping > off the SATA bus, so there could have been an additional hardware issue. Link resets are bad. There's bad hardware, bad cable, or highly likely and common is bad sectors plus a misconfiguration of drive error recovery and the SCSI command timer. > This does happen on _every_ raid5 deployment I've run so far, though, > and they don't share any common hardware or firmware components. Misconfigurations in raid 5 are default. I've mentioned this before on this list. And linux-raid@ list is full of such problem. It's pretty much a weekly occurrence someone loses all data on a raid5 array due to this very common misconfiguration. > I've met many btrfs users on IRC but never one with a successful raid56 > recovery story. Give me a shout on IRC if you have one. ;) You'd have to check the archives, they do exist. But typically by the time they're on this list, the problems are too far gone. It's also very common on linux-raid@. The issue is people treat their arrays like they're backups or like they're permission to delay backup frequency, rather than treating them for the sole purpose they're intended for which is to extend uptime in the face of a *single* hardware failure, specifically one drive. The instant there are two problems, implosion is almost certain.
On Wed, Jun 22, 2016 at 11:14:30AM -0600, Chris Murphy wrote: > > Before deploying raid5, I tested these by intentionally corrupting > > one disk in an otherwise healthy raid5 array and watching the result. > > It's difficult to reproduce if no one understands how you > intentionally corrupted that disk. Literal reading, you corrupted the > entire disk, but that's impractical. The fs is expected to behave > differently depending on what's been corrupted and how much. The first round of testing I did (a year ago, when deciding whether btrfs raid5 was mature enough to start using) was: Create a 5-disk RAID5 Put some known data on it until it's full (i.e. random test patterns). At the time I didn't do any tests involving compressible data, which I now realize was a serious gap in my test coverage. Pick 1000 random blocks (excluding superblocks) on one of the disks and write random data to them Read and verify the data through the filesystem, do scrub, etc. Exercise all the btrfs features related to error reporting and recovery. I expected scrub and dev stat to report accurate corruption counts (except for the 1 in 4 billion case where a bad csum matches by random chance), and I expect all the data to be reconstructed since only one drive was corrupted (assuming there are no unplanned disk failures during the test, obviously) and the corruption occurred while the filesystem was offline so there was no possibility of RAID write hole. My results from that testing were that everything worked except for the mostly-harmless quirk where scrub counts errors on random disks instead of the disk where the errors occur. > I don't often use the -Bd options, so I haven't tested it thoroughly, > but what you're describing sounds like a bug in user space tools. I've > found it reflects the same information as btrfs dev stats, and dev > stats have been reliable in my testing. Don't the user space tools just read what the kernel tells them? I don't know how *not* to produce this behavior on btrfs raid5 or raid6. It should show up on any btrfs raid56 system. > > A different thing happens if there is a crash. In that case, scrub cannot > > repair the errors. Every btrfs raid5 filesystem I've deployed so far > > behaves this way when disks turn bad. I had assumed it was a software bug > > in the comparatively new raid5 support that would get fixed eventually. > > This is really annoyingly vague. You don't give a complete recipe for > reproducing this sequence. Here's what I'm understanding and what I'm > missing: > > 1. The intentional corruption, extent of which is undefined, is still present. No intentional corruption here (quote: "A different thing happens if there is a crash..."). Now we are talking about the baseline behavior when there is a crash on a btrfs raid5 array, especially crashes triggered by a disk-level failure (e.g. watchdog timeout because a disk or controller has hung) but also ordinary power failures or other crashes triggered by external causes. > 2. A drive is bad, but that doesn't tell us if it's totally dead, or > only intermittently spitting out spurious information. The most common drive-initiated reboot case is that one drive temporarily locks up and triggers the host to perform a watchdog reset. The reset is successful and the filesystem can be mounted again with all drives present; however, a small amount of raid5 data appears to be corrupted each time. The raid1 metadata passes all the integrity checks I can throw at it: btrfs check, scrub, balance, walk the filesystem with find -type f -exec cat ..., compare with the last backup, etc. Usually when I detect this case, I delete any corrupted data, delete the disk that triggers the lockups and have no further problems with that array. > 3. Is the volume remounted degraded or is the bad drive still being > used by Btrfs? Because Btrfs has no concept (patches pending) of drive > faulty state like md, let alone an automatic change to that faulty > state. It just keeps on trying to read or write to bad drives, even if > they're physically removed. In the baseline case the filesystem has all drives present after remount. It could be as simple as power-cycling the host while writes are active. > 4. You've initiated a scrub, and the corruption in 1 is not fixed. In this pattern, btrfs may find both correctable and uncorrectable corrupted data, usually on one of the drives. scrub fixes the correctable corruption, but fails on the uncorrectable. > OK so what am I missing? Nothing yet. The above is the "normal" btrfs raid5 crash experience with a non-degraded raid5 array. A few megabytes of corrupt extents can easily be restored from backups or deleted and everything's fine after that. In my *current* failure case, I'm experiencing *additional* issues. In total: all of the above, plus: BUG_ON()s as described in the first mail on this thread, additional csum failure kernel messages without corresponding dev stat increments, several problems that seem to occur only on kernel 4.6.2 and not 4.5.7. Another distinction between the current case and the previous cases is that this time the array is degraded: one drive is entirely missing. The current case seems to be hitting code paths in the kernel that I've never seen executed before, and it looks like there are problems with them (e.g. any function with a mirror_num parameter does not seem to be valid for raid56, and yet the kernel is crashing in some of them). > Because it sounds to me like you have two copies of data that are > gone. For raid 5 that's data loss, scrub can't fix things. Corruption > is missing data. The bad drive is missing data. > > What values do you get for > > smartctl -l scterc /dev/sdX > cat /sys/block/sdX/device/timeout "SCT Error Recovery Control command not supported" and "30" respectively. There are no kernel log messages suggesting timeouts in the current case after the failed drive was disconnected from the system. There were plenty of these starting just after the drive failed, but that's to be expected, and should be tolerated by a raid5 implementation. All remaining drives appear to be healthy. The array completed a scrub two weeks before the disk failure with no errors (as it has every two weeks since it was created). > I do not know the exact nature of the Btrfs raid56 write hole. Maybe a > dev or someone who knows can explain it. If you have 3 raid5 devices, they might be laid out on disk like this (e.g. with a 16K stripe width): Address: 0..16K 16..32K 32..64K Disk 1: [0..16K] [32..64K] [PARITY] Disk 2: [16..32K] [PARITY] [80..96K] Disk 3: [PARITY] [64..80K] [96..112K] btrfs logical address ranges are inside []. Disk physical address ranges are shown at the top of each column. (I've simplified the mapping here; pretend all the addresses are relative to the start of a block group). If we want to write a 32K extent at logical address 0, we'd write all three disks in one column (disk1 gets 0..16K, disk2 gets 16..32K, disk3 gets parity for the other two disks). The parity will be temporarily invalid for the time between the first disk write and the last disk write. In non-degraded mode the parity isn't necessary, but in degraded mode the entire column cannot be reconstructed because of invalid parity. To see why this could be a problem, suppose btrfs writes a 4K extent at logical address 32K. This requires updating (at least) disk 1 (where the logical address 32K resides) and disk 2 (the parity for this column). This means any data that existed at logical addresses 36K..80K (or at least 32..36K and 64..68K) has its parity temporarily invalidated between the write to the first and last disks. If there were metadata pointing to other blocks in this column, the metadata temporarily points to damaged data during the write. If there is no data in other blocks in this column then it doesn't matter that the parity doesn't match--the content of the reconstructed unallocated blocks would be undefined even in the success cases. Last time I checked, btrfs doesn't COW entire RAID stripes (it does RMW them but that's not the same thing at all). COW affects only extents in the logical address space. To avoid the write hole issue, btrfs would have to avoid writing to any column that is partially occupied by existing committed data (i.e. it would have to write the entire column in a single transaction or not write to the column at all). btrfs doesn't do this, which can be proven with a simple experiment: # btrfs sub create tmp && cd tmp # for x in $(seq 0 9); do head -c 4096 < /dev/urandom >> f; sync; done; filefrag -v f Filesystem type is: 9123683e File size of f is 40960 (10 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 0: 2412725689..2412725689: 1: 1: 1.. 1: 2412725690..2412725690: 1: 2: 2.. 2: 2412725691..2412725691: 1: 3: 3.. 3: 2412725692..2412725692: 1: 4: 4.. 4: 2412725693..2412725693: 1: 5: 5.. 5: 2412725694..2412725694: 1: 6: 6.. 6: 2412725695..2412725695: 1: 7: 7.. 7: 2412725698..2412725698: 1: 2412725696: 8: 8.. 8: 2412725699..2412725699: 1: 9: 9.. 9: 2412725700..2412725700: 1: last,eof f: 2 extents found Here I have allocated 10 consecutive physical blocks in 10 separate btrfs transactions on a 5-disk raid5 array. Those physical_offset blocks need to be at least N-1 blocks apart (for a 5-disk array, this is 4) to avoid the write hole problem (except for the special case of one block at the end of a column adjacent to the first block of the next column). They certainly cannot _all_ be adjacent. If block #3 in this file is modified, btrfs will free physical block 2412725692 because of CoW: # head -c 4096 < /dev/urandom | dd seek=3 bs=4k of=f conv=notrunc; sync 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 9.3396e-05 s, 43.9 MB/s Filesystem type is: 9123683e File size of f is 40960 (10 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 0: 2412725689..2412725689: 1: 1: 1.. 1: 2412725690..2412725690: 1: 2: 2.. 2: 2412725691..2412725691: 1: 3: 3.. 3: 2412725701..2412725701: 1: 2412725692: 4: 4.. 4: 2412725693..2412725693: 1: 2412725702: 5: 5.. 5: 2412725694..2412725694: 1: 6: 6.. 6: 2412725695..2412725695: 1: 7: 7.. 7: 2412725698..2412725698: 1: 2412725696: 8: 8.. 8: 2412725699..2412725699: 1: 9: 9.. 9: 2412725700..2412725700: 1: last,eof f: 4 extents found If in the future btrfs allocates physical block 2412725692 to a different file, up to 3 other blocks in this file (most likely 2412725689..2412725691) could be lost if a crash or disk I/O error also occurs during the same transaction. btrfs does do this--in fact, the _very next block_ allocated by the filesystem is 2412725692: # head -c 4096 < /dev/urandom >> f; sync; filefrag -v f Filesystem type is: 9123683e File size of f is 45056 (11 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 0: 2412725689..2412725689: 1: 1: 1.. 1: 2412725690..2412725690: 1: 2: 2.. 2: 2412725691..2412725691: 1: 3: 3.. 3: 2412725701..2412725701: 1: 2412725692: 4: 4.. 4: 2412725693..2412725693: 1: 2412725702: 5: 5.. 5: 2412725694..2412725694: 1: 6: 6.. 6: 2412725695..2412725695: 1: 7: 7.. 7: 2412725698..2412725698: 1: 2412725696: 8: 8.. 8: 2412725699..2412725699: 1: 9: 9.. 9: 2412725700..2412725700: 1: 10: 10.. 10: 2412725692..2412725692: 1: 2412725701: last,eof f: 5 extents found These extents all have separate transids too: # btrfs sub find-new . 1 inode 257 file offset 0 len 4096 disk start 9882524422144 offset 0 gen 1432814 flags NONE f inode 257 file offset 4096 len 4096 disk start 9882524426240 offset 0 gen 1432816 flags NONE f inode 257 file offset 8192 len 4096 disk start 9882524430336 offset 0 gen 1432817 flags NONE f inode 257 file offset 12288 len 4096 disk start 9882524471296 offset 0 gen 1432825 flags NONE f inode 257 file offset 16384 len 4096 disk start 9882524438528 offset 0 gen 1432819 flags NONE f inode 257 file offset 20480 len 4096 disk start 9882524442624 offset 0 gen 1432820 flags NONE f inode 257 file offset 24576 len 4096 disk start 9882524446720 offset 0 gen 1432821 flags NONE f inode 257 file offset 28672 len 4096 disk start 9882524459008 offset 0 gen 1432822 flags NONE f inode 257 file offset 32768 len 4096 disk start 9882524463104 offset 0 gen 1432823 flags NONE f inode 257 file offset 36864 len 4096 disk start 9882524467200 offset 0 gen 1432824 flags NONE f inode 257 file offset 40960 len 4096 disk start 9882524434432 offset 0 gen 1432826 flags NONE f transid marker was 1432826 > > The filesystem would continue to work afterwards with raid1 metadata > > because every disk in raid1 updates its blocks in the same order, > > and there are no interdependencies between blocks on different disks > > (not like a raid5 stripe, anyway). > > I'm not sure what you mean by this. Btrfs raid1 means two copies. It > doesn't matter how many drives there are, there are two copies of > metadata in your case, and you have no idea which drives those > metadata block groups are on without checking btrfs-debug-tree. What I mean is that there are no dependencies between logically adjacent blocks on physically separate disks in raid1 as there are in raid5, because there are no stripes in raid1. The whole above scenario cannot occur in raid1.
On 2016-06-22 22:35, Zygo Blaxell wrote: >> I do not know the exact nature of the Btrfs raid56 write hole. Maybe a >> > dev or someone who knows can explain it. > If you have 3 raid5 devices, they might be laid out on disk like this > (e.g. with a 16K stripe width): > > Address: 0..16K 16..32K 32..64K > Disk 1: [0..16K] [32..64K] [PARITY] > Disk 2: [16..32K] [PARITY] [80..96K] > Disk 3: [PARITY] [64..80K] [96..112K] > > btrfs logical address ranges are inside []. Disk physical address ranges > are shown at the top of each column. (I've simplified the mapping here; > pretend all the addresses are relative to the start of a block group). > > If we want to write a 32K extent at logical address 0, we'd write all > three disks in one column (disk1 gets 0..16K, disk2 gets 16..32K, disk3 > gets parity for the other two disks). The parity will be temporarily > invalid for the time between the first disk write and the last disk write. > In non-degraded mode the parity isn't necessary, but in degraded mode > the entire column cannot be reconstructed because of invalid parity. > > To see why this could be a problem, suppose btrfs writes a 4K extent at > logical address 32K. This requires updating (at least) disk 1 (where the > logical address 32K resides) and disk 2 (the parity for this column). > This means any data that existed at logical addresses 36K..80K (or at > least 32..36K and 64..68K) has its parity temporarily invalidated between > the write to the first and last disks. If there were metadata pointing > to other blocks in this column, the metadata temporarily points to > damaged data during the write. If there is no data in other blocks in > this column then it doesn't matter that the parity doesn't match--the > content of the reconstructed unallocated blocks would be undefined > even in the success cases. [...] Sorry, but I can follow you. RAID5 protect you in case of a failure (or a missing write) of a *single* disk. The raid write hole happens when a stripe is not completely written on the platters: the parity and the related data mismatch. In this case a "simple" raid5 may return wrong data if the parity is used to compute the data. But this happens because a "simple" raid5 is unable to detected if the returned data is right or none. The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the checksum. BTRFS is able to discard the wrong data: i.e. in case of a 3 disks raid5, the right data may be extracted from the data1+data2 or if the checksum doesn't match from data1+parity or if the checksum doesn't match from data2+parity. NOTE1: the real difference between the BTRFS (and ZFS) raid and the "simple" raid5 is that the latter doesn't try another pair of disks. NOTE2: this works if only one write is corrupted. If more writes (== more disks) are involved, you got checksum mismatch. If more than one write are corrupted, raid5 is unable to protect you. In case of "degraded mode", you don't have any redundancy. So if a stripe of a degraded filesystem is not fully written to the disk, is like a block not fully written to the disk. And you have checksums mismatch. But this is not what is called raid write hole. On 2016-06-22 22:35, Zygo Blaxell wrote: > If in the future btrfs allocates physical block 2412725692 to > a different file, up to 3 other blocks in this file (most likely > 2412725689..2412725691) could be lost if a crash or disk I/O error also > occurs during the same transaction. btrfs does do this--in fact, the > _very next block_ allocated by the filesystem is 2412725692: > > # head -c 4096 < /dev/urandom >> f; sync; filefrag -v f > Filesystem type is: 9123683e > File size of f is 45056 (11 blocks of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 0: 2412725689..2412725689: 1: > 1: 1.. 1: 2412725690..2412725690: 1: > 2: 2.. 2: 2412725691..2412725691: 1: > 3: 3.. 3: 2412725701..2412725701: 1: 2412725692: > 4: 4.. 4: 2412725693..2412725693: 1: 2412725702: > 5: 5.. 5: 2412725694..2412725694: 1: > 6: 6.. 6: 2412725695..2412725695: 1: > 7: 7.. 7: 2412725698..2412725698: 1: 2412725696: > 8: 8.. 8: 2412725699..2412725699: 1: > 9: 9.. 9: 2412725700..2412725700: 1: > 10: 10.. 10: 2412725692..2412725692: 1: 2412725701: last,eof > f: 5 extents found You are assuming that if you touch a block, all the blocks of the same stripe spread over the disks are involved. I disagree. The only parts which are involved, are the part of stripe which contains the changed block and the parts which contains the parity. If both the parts become corrupted, RAID5 is unable to protect you (two failure, when raid 5 has only _one_ redundancy). But if only one of these is corrupted, BTRFS with the help of the checksum is capable to detect which one is corrupted and to return good data (and to rebuild the bad parts). BR G.Baroncelli
On Wed, Jun 22, 2016 at 11:14 AM, Chris Murphy <lists@colorremedies.com> wrote: > > However, from btrfs-debug-tree from a 3 device raid5 volume: > > item 5 key (FIRST_CHUNK_TREE CHUNK_ITEM 1103101952) itemoff 15621 itemsize 144 > chunk length 2147483648 owner 2 stripe_len 65536 > type DATA|RAID5 num_stripes 3 > stripe 0 devid 2 offset 9437184 > dev uuid: 3c6f37eb-5cae-455a-82bc-a1b0877dea55 > stripe 1 devid 1 offset 1094713344 > dev uuid: 13104709-6f30-4982-979e-4f055c326fad > stripe 2 devid 3 offset 1083179008 > dev uuid: d45fc482-a0c1-46b1-98c1-41cea5a11c80 > > I expect that parity is in this data block group, and therefore is > checksummed the same as any other data in that block group. This appears to be wrong. Comparing the same file, one file only, one two new Btrfs volumes, one volume single, one volume raid5, I get a single csum tree entry: raid5 item 0 key (EXTENT_CSUM EXTENT_CSUM 12009865216) itemoff 16155 itemsize 128 extent csum item single item 0 key (EXTENT_CSUM EXTENT_CSUM 2168717312) itemoff 16155 itemsize 128 extent csum item They're both the same size. They both contain the same data. So it looks like parity is not separately checksummed. If there's a missing 64KiB data strip (bad sector, or dead drive), the reconstruction of that strip from parity should match available csums for those blocks. So in this way it's possible to infer if the parity strip is bad. But, it also means assuming everything else about this full stripe: the remaining data strips and their csums, are correct. > So in your example of degraded writes, no matter what the on disk > format makes it discoverable there is a problem: > > A. The "updating" is still always COW so there is no overwriting. There is RMW code in btrfs/raid56.c but I don't know when that gets triggered. With simple files changing one character with vi and gedit, I get completely different logical and physical numbers with each change, so it's clearly cowing the entire stripe (192KiB in my 3 dev raid5). [root@f24s ~]# filefrag -v /mnt/5/64k-a-then64k-b.txt Filesystem type is: 9123683e File size of /mnt/5/64k-a-then64k-b.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 31: 2931744.. 2931775: 32: last,eof /mnt/5/64k-a-then64k-b.txt: 1 extent found [root@f24s ~]# btrfs-map-logical -l $[4096*2931744] /dev/VG/a mirror 1 logical 12008423424 physical 1114112 device /dev/mapper/VG-b mirror 2 logical 12008423424 physical 34668544 device /dev/mapper/VG-a [root@f24s ~]# vi /mnt/5/64k-a-then64k-b.txt [root@f24s ~]# filefrag -v /mnt/5/64k-a-then64k-b.txt Filesystem type is: 9123683e File size of /mnt/5/64k-a-then64k-b.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 31: 2931776.. 2931807: 32: last,eof /mnt/5/64k-a-then64k-b.txt: 1 extent found [root@f24s ~]# btrfs-map-logical -l $[4096*29317776] /dev/VG/a No extent found at range [120085610496,120085626880) [root@f24s ~]# btrfs-map-logical -l $[4096*2931776] /dev/VG/a mirror 1 logical 12008554496 physical 1108475904 device /dev/mapper/VG-c mirror 2 logical 12008554496 physical 1179648 device /dev/mapper/VG-b [root@f24s ~]# There is a neat bug/rfe I found for btrfs-map-logical, it doesn't report back the physical locations for all num_stripes on the volume. It only spits back two, and sometimes it's the two data strips, sometimes it's one data and one parity strip. [1] https://bugzilla.kernel.org/show_bug.cgi?id=120941
On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli <kreijack@inwind.it> wrote: > > The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the checksum. Yeah I'm kinda confused on this point. https://btrfs.wiki.kernel.org/index.php/RAID56 It says there is a write hole for Btrfs. But defines it in terms of parity possibly being stale after a crash. I think the term comes not from merely parity being wrong but parity being wrong *and* then being used to wrongly reconstruct data because it's blindly trusted. I don't read code well enough, but I'd be surprised if Btrfs reconstructs from parity and doesn't then check the resulting reconstructed data to its EXTENT_CSUM.
On Thu, Jun 23, 2016 at 09:32:50PM +0200, Goffredo Baroncelli wrote: > The raid write hole happens when a stripe is not completely written > on the platters: the parity and the related data mismatch. In this > case a "simple" raid5 may return wrong data if the parity is used to > compute the data. But this happens because a "simple" raid5 is unable > to detected if the returned data is right or none. > > The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the > checksum. Checksums do not help with the raid5 write hole. The way btrfs does checksums might even make it worse. ZFS reduces the number of disks in a stripe when a disk failure is detected so that writes are always in non-degraded mode, and they presumably avoid sub-stripe-width data allocations or use journalling to avoid the write hole. btrfs seems to use neither tactic. At best, btrfs will avoid creating new block groups on disks that are missing at mount time, and it doesn't deal with sub-stripe-width allocations at all. I'm working from two assumptions as I haven't found all the relevant code yet: 1. btrfs writes parity stripes at fixed locations relative to the data in the same stripe. If this is true, then the parity blocks are _not_ CoW while the data blocks and their checksums _are_ CoW. I don't know if the parity block checksums are also CoW. 2. btrfs sometimes puts data from two different transactions in the same stripe at the same time--a fundamental violation of the CoW concept. I inferred this from the logical block addresses. Unless I'm missing something in the code somewhere, parity blocks can have out-of-date checksums for short periods of time between flushes and commits. This would lose data by falsely reporting valid parity blocks as checksum failures. If any *single* failure occurs at the same time (such as a missing write or disk failure) a small amount of data will be lost. > BTRFS is able to discard the wrong data: i.e. in case of a 3 disks > raid5, the right data may be extracted from the data1+data2 or if the > checksum doesn't match from data1+parity or if the checksum doesn't > match from data2+parity. Suppose we have a sequence like this (3-disk RAID5 array, one stripe containing 2 data and 1 parity block) starting with the stripe empty: 1. write data block 1 to disk 1 of stripe (parity is now invalid, no checksum yet) 2. write parity block to disk 3 in stripe (parity becomes valid again, no checksum yet) 3. commit metadata pointing to block 1 (parity and checksums now valid) 4. write data block 2 to disk 2 of stripe (parity and parity checksum now invalid) 5. write parity block to disk 3 in stripe (parity valid now, parity checksum still invalid) 6. commit metadata pointing to block 2 (parity and checksums now valid) We can be interrupted at any point between step 1 and 4 with no data loss. Before step 3 the data and parity blocks are not part of the extent tree so their contents are irrelevant. After step 3 (assuming each step is completed in order) data block 1 is part of the extent tree and can be reconstructed if any one disk fails. This is the part of btrfs raid5 that works. If we are interrupted between steps 4 and 6 (e.g. power fails), a single disk failure or corruption will cause data loss in block 1. Note that block 1 is *not* written between steps 4 and 6, so we are retroactively damaging some previously written data that is not part of the current transaction. If we are interrupted between steps 4 and 5, we can no longer reconstruct block 1 (block2 ^ parity) or block 2 (block1 ^ parity) because the parity block doesn't match the data blocks in the same stripe (i.e. block1 ^ block2 != parity). If we are interrupted between step 5 and 6, the parity block checksum committed at step 3 will fail. Data block 2 will not be accessible since the metadata was not written to point to it, but data block 1 will be intact, readable, and have a correct checksum as long as none of the disks fail. This can be repaired by a scrub (scrub will simply throw the parity block away and reconstruct it from block1 and block2). If disk 1 fails before the next scrub, data block 1 will be lost because btrfs will believe the parity block is incorrect even though it is not. This risk happens on *every* write to a stripe that is not a full stripe write and contains existing committed data blocks. It will occur more often on full and heavily fragmented filesystems (filesystems which have these properties are more likely to write new data on stripes that already contain old data). In cases where an entire stripe is written at once, or a stripe is partially filled but no further writes ever modify the stripe, everything works as intended in btrfs. > NOTE2: this works if only one write is corrupted. If more writes (== > more disks) are involved, you got checksum mismatch. If more than one > write are corrupted, raid5 is unable to protect you. No write corruption is required for data loss. Data can be lost with any single disk failure. > In case of "degraded mode", you don't have any redundancy. So if a > stripe of a degraded filesystem is not fully written to the disk, > is like a block not fully written to the disk. And you have checksums > mismatch. But this is not what is called raid write hole. If a block is not fully written to the disk, btrfs should not update the metadata tree to point to it, so no (committed) data will be lost. > On 2016-06-22 22:35, Zygo Blaxell wrote: > > If in the future btrfs allocates physical block 2412725692 to > > a different file, up to 3 other blocks in this file (most likely > > 2412725689..2412725691) could be lost if a crash or disk I/O error also > > occurs during the same transaction. btrfs does do this--in fact, the > > _very next block_ allocated by the filesystem is 2412725692: > > > > # head -c 4096 < /dev/urandom >> f; sync; filefrag -v f > > Filesystem type is: 9123683e > > File size of f is 45056 (11 blocks of 4096 bytes) > > ext: logical_offset: physical_offset: length: expected: flags: > > 0: 0.. 0: 2412725689..2412725689: 1: > > 1: 1.. 1: 2412725690..2412725690: 1: > > 2: 2.. 2: 2412725691..2412725691: 1: > > 3: 3.. 3: 2412725701..2412725701: 1: 2412725692: > > 4: 4.. 4: 2412725693..2412725693: 1: 2412725702: > > 5: 5.. 5: 2412725694..2412725694: 1: > > 6: 6.. 6: 2412725695..2412725695: 1: > > 7: 7.. 7: 2412725698..2412725698: 1: 2412725696: > > 8: 8.. 8: 2412725699..2412725699: 1: > > 9: 9.. 9: 2412725700..2412725700: 1: > > 10: 10.. 10: 2412725692..2412725692: 1: 2412725701: last,eof > > f: 5 extents found > > You are assuming that if you touch a block, all the blocks of the same > stripe spread over the disks are involved. I disagree. The only parts > which are involved, are the part of stripe which contains the changed > block and the parts which contains the parity. Any block change always affects all others in the stripe. With checksums, every write temporarily puts each stripe in degraded mode (although this only happens to the parity blocks as the data blocks are protected by CoW algorithms). Checksums are CoW and updated in strictly-ordered transactions with write barriers; stripes are modified in-place and they are a function of the contents of many non-atomically-updated disks. > If both the parts become corrupted, RAID5 is unable to protect you > (two failure, when raid 5 has only _one_ redundancy). But if only > one of these is corrupted, BTRFS with the help of the checksum is > capable to detect which one is corrupted and to return good data > (and to rebuild the bad parts). This is only true for stripes that are not being updated. During a stripe update, btrfs raid5 seems to be much less tolerant, only one disk failure away from data loss (of a single stripe) instead of the expected two. All this rests on my assumptions above. If those turn out to be wrong, then most of the above is wrong too, and I need a new theory to explain why almost every unplanned reboot corrupts a small amount of data on big raid5 filesystems.
On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote: > On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli <kreijack@inwind.it> wrote: > > The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the checksum. > > Yeah I'm kinda confused on this point. > > https://btrfs.wiki.kernel.org/index.php/RAID56 > > It says there is a write hole for Btrfs. But defines it in terms of > parity possibly being stale after a crash. I think the term comes not > from merely parity being wrong but parity being wrong *and* then being > used to wrongly reconstruct data because it's blindly trusted. I think the opposite is more likely, as the layers above raid56 seem to check the data against sums before raid56 ever sees it. (If those layers seem inverted to you, I agree, but OTOH there are probably good reason to do it that way). It looks like uncorrectable failures might occur because parity is correct, but the parity checksum is out of date, so the parity checksum doesn't match even though data blindly reconstructed from the parity *would* match the data. > I don't read code well enough, but I'd be surprised if Btrfs > reconstructs from parity and doesn't then check the resulting > reconstructed data to its EXTENT_CSUM. I wouldn't be surprised if both things happen in different code paths, given the number of different paths leading into the raid56 code and the number of distinct failure modes it seems to have.
On Thu, Jun 23, 2016 at 05:37:09PM -0600, Chris Murphy wrote: > > So in your example of degraded writes, no matter what the on disk > > format makes it discoverable there is a problem: > > > > A. The "updating" is still always COW so there is no overwriting. > > There is RMW code in btrfs/raid56.c but I don't know when that gets > triggered. RMW seems to be for cases where part of a stripe is modified but the entire stripe has not yet been read into memory. It reads the remaining blocks (reconstructing missing blocks if necessary) then calculates new parity blocks. > With simple files changing one character with vi and gedit, > I get completely different logical and physical numbers with each > change, so it's clearly cowing the entire stripe (192KiB in my 3 dev > raid5). You are COWing the entire file because vi and gedit do truncate followed by full-file write. Try again with 'dd conv=notrunc bs=4k count=1 seek=N of=...' or edit the file with a sector-level hex editor. > [root@f24s ~]# filefrag -v /mnt/5/64k-a-then64k-b.txt > Filesystem type is: 9123683e > File size of /mnt/5/64k-a-then64k-b.txt is 131072 (32 blocks of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 31: 2931744.. 2931775: 32: last,eof > /mnt/5/64k-a-then64k-b.txt: 1 extent found > [root@f24s ~]# btrfs-map-logical -l $[4096*2931744] /dev/VG/a > mirror 1 logical 12008423424 physical 1114112 device /dev/mapper/VG-b > mirror 2 logical 12008423424 physical 34668544 device /dev/mapper/VG-a > [root@f24s ~]# vi /mnt/5/64k-a-then64k-b.txt > [root@f24s ~]# filefrag -v /mnt/5/64k-a-then64k-b.txt > Filesystem type is: 9123683e > File size of /mnt/5/64k-a-then64k-b.txt is 131072 (32 blocks of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 31: 2931776.. 2931807: 32: last,eof > /mnt/5/64k-a-then64k-b.txt: 1 extent found > [root@f24s ~]# btrfs-map-logical -l $[4096*29317776] /dev/VG/a > No extent found at range [120085610496,120085626880) > [root@f24s ~]# btrfs-map-logical -l $[4096*2931776] /dev/VG/a > mirror 1 logical 12008554496 physical 1108475904 device /dev/mapper/VG-c > mirror 2 logical 12008554496 physical 1179648 device /dev/mapper/VG-b > [root@f24s ~]# > > There is a neat bug/rfe I found for btrfs-map-logical, it doesn't > report back the physical locations for all num_stripes on the volume. > It only spits back two, and sometimes it's the two data strips, > sometimes it's one data and one parity strip. > > > [1] > https://bugzilla.kernel.org/show_bug.cgi?id=120941 > > > -- > Chris Murphy >
On Thu, Jun 23, 2016 at 05:37:09PM -0600, Chris Murphy wrote: > > I expect that parity is in this data block group, and therefore is > > checksummed the same as any other data in that block group. > > This appears to be wrong. Comparing the same file, one file only, one > two new Btrfs volumes, one volume single, one volume raid5, I get a > single csum tree entry: > > raid5 > item 0 key (EXTENT_CSUM EXTENT_CSUM 12009865216) itemoff 16155 itemsize 128 > extent csum item > > single > > item 0 key (EXTENT_CSUM EXTENT_CSUM 2168717312) itemoff 16155 itemsize 128 > extent csum item > > They're both the same size. They both contain the same data. So it > looks like parity is not separately checksummed. I'm inclined to agree because I didn't find any code that *writes* parity csums...but if there are no parity csums, what does this code do? scrub.c: static noinline_for_stack int scrub_raid56_parity(struct scrub_ctx *sctx, [...] ret = btrfs_lookup_csums_range(csum_root, extent_logical, extent_logical + extent_len - 1, &sctx->csum_list, 1); if (ret) goto out; ret = scrub_extent_for_parity(sparity, extent_logical, extent_len, extent_physical, extent_dev, flags, generation, extent_mirror_num);
24.06.2016 04:47, Zygo Blaxell пишет: > On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote: >> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli <kreijack@inwind.it> wrote: >>> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the checksum. >> >> Yeah I'm kinda confused on this point. >> >> https://btrfs.wiki.kernel.org/index.php/RAID56 >> >> It says there is a write hole for Btrfs. But defines it in terms of >> parity possibly being stale after a crash. I think the term comes not >> from merely parity being wrong but parity being wrong *and* then being >> used to wrongly reconstruct data because it's blindly trusted. > > I think the opposite is more likely, as the layers above raid56 > seem to check the data against sums before raid56 ever sees it. > (If those layers seem inverted to you, I agree, but OTOH there are > probably good reason to do it that way). > Yes, that's how I read code as well. btrfs layer that does checksumming is unaware of parity blocks at all; for all practical purposes they do not exist. What happens is approximately 1. logical extent is allocated and checksum computed 2. it is mapped to physical area(s) on disks, skipping over what would be parity blocks 3. when these areas are written out, RAID56 parity is computed and filled in IOW btrfs checksums are for (meta)data and RAID56 parity is not data. > It looks like uncorrectable failures might occur because parity is > correct, but the parity checksum is out of date, so the parity checksum > doesn't match even though data blindly reconstructed from the parity > *would* match the data. > Yep, that is how I read it too. So if your data is checksummed, it should at least avoid silent corruption. >> I don't read code well enough, but I'd be surprised if Btrfs >> reconstructs from parity and doesn't then check the resulting >> reconstructed data to its EXTENT_CSUM. > > I wouldn't be surprised if both things happen in different code paths, > given the number of different paths leading into the raid56 code and > the number of distinct failure modes it seems to have. > Well, the problem is that parity block cannot be redirected on write as data blocks; which makes it impossible to version control it. The only solution I see is to always use full stripe writes by either wasting time in fixed width stripe or using variable width, so that every stripe always gets new version of parity. This makes it possible to keep parity checksums like data checksums.
On Thu, Jun 23, 2016 at 8:07 PM, Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote: >> With simple files changing one character with vi and gedit, >> I get completely different logical and physical numbers with each >> change, so it's clearly cowing the entire stripe (192KiB in my 3 dev >> raid5). > > You are COWing the entire file because vi and gedit do truncate followed > by full-file write. I'm seeing the file inode changes with either a vi or gedit modification, even when file size is exactly the same, just character substitute. So as far as VFS and Btrfs are concerned, it's an entirely different file, so it's like faux-CoW that would have happened on any file system, not an overwrite. > Try again with 'dd conv=notrunc bs=4k count=1 seek=N of=...' or > edit the file with a sector-level hex editor. The inode is now the same, one of the 4096 byte blocks is dereferenced, a new 4096 byte block is referenced, and written, the other 3 blocks remain untouched, the other files in the stripe remain untouched. So it's pretty clearly cow'd in this case. [root@f24s ~]# filefrag -v /mnt/5/* Filesystem type is: 9123683e File size of /mnt/5/a.txt is 16383 (4 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 3: 2931712.. 2931715: 4: last,eof /mnt/5/a.txt: 1 extent found File size of /mnt/5/b.txt is 16383 (4 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 3: 2931716.. 2931719: 4: last,eof /mnt/5/b.txt: 1 extent found File size of /mnt/5/c.txt is 16383 (4 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 3: 2931720.. 2931723: 4: last,eof /mnt/5/c.txt: 1 extent found File size of /mnt/5/d.txt is 16383 (4 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 3: 2931724.. 2931727: 4: last,eof /mnt/5/d.txt: 1 extent found File size of /mnt/5/e.txt is 16383 (4 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 3: 2931728.. 2931731: 4: last,eof /mnt/5/e.txt: 1 extent found [root@f24s ~]# ls -li /mnt/5/* 285 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/a.txt 286 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/b.txt 287 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/c.txt 288 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/d.txt 289 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/e.txt [root@f24s ~]# btrfs-map-logical -l $[4096*2931712] /dev/VG/a mirror 1 logical 12008292352 physical 34603008 device /dev/mapper/VG-a mirror 2 logical 12008292352 physical 1108344832 device /dev/mapper/VG-c [root@f24s ~]# btrfs-map-logical -l $[4096*2931716] /dev/VG/a mirror 1 logical 12008308736 physical 34619392 device /dev/mapper/VG-a mirror 2 logical 12008308736 physical 1108361216 device /dev/mapper/VG-c [root@f24s ~]# btrfs-map-logical -l $[4096*2931720] /dev/VG/a mirror 1 logical 12008325120 physical 34635776 device /dev/mapper/VG-a mirror 2 logical 12008325120 physical 1108377600 device /dev/mapper/VG-c [root@f24s ~]# btrfs-map-logical -l $[4096*2931724] /dev/VG/a mirror 1 logical 12008341504 physical 34652160 device /dev/mapper/VG-a mirror 2 logical 12008341504 physical 1108393984 device /dev/mapper/VG-c [root@f24s ~]# btrfs-map-logical -l $[4096*2931728] /dev/VG/a mirror 1 logical 12008357888 physical 1048576 device /dev/mapper/VG-b mirror 2 logical 12008357888 physical 1108344832 device /dev/mapper/VG-c [root@f24s ~]# echo -n "g" | dd of=/mnt/5/a.txt conv=notrunc 0+1 records in 0+1 records out 1 byte copied, 0.000314582 s, 3.2 kB/s [root@f24s ~]# ls -li /mnt/5/* 285 -rw-r--r--. 1 root root 16383 Jun 23 23:06 /mnt/5/a.txt 286 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/b.txt 287 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/c.txt 288 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/d.txt 289 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/e.txt [root@f24s ~]# filefrag -v /mnt/5/* Filesystem type is: 9123683e File size of /mnt/5/a.txt is 16383 (4 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 0: 2931732.. 2931732: 1: 1: 1.. 3: 2931713.. 2931715: 3: 2931733: last,eof /mnt/5/a.txt: 2 extents found File size of /mnt/5/b.txt is 16383 (4 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 3: 2931716.. 2931719: 4: last,eof /mnt/5/b.txt: 1 extent found File size of /mnt/5/c.txt is 16383 (4 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 3: 2931720.. 2931723: 4: last,eof /mnt/5/c.txt: 1 extent found File size of /mnt/5/d.txt is 16383 (4 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 3: 2931724.. 2931727: 4: last,eof /mnt/5/d.txt: 1 extent found File size of /mnt/5/e.txt is 16383 (4 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 3: 2931728.. 2931731: 4: last,eof /mnt/5/e.txt: 1 extent found [root@f24s ~]# btrfs-map-logical -l $[4096*2931732] /dev/VG/a mirror 1 logical 12008374272 physical 1064960 device /dev/mapper/VG-b mirror 2 logical 12008374272 physical 1108361216 device /dev/mapper/VG-c It has been cow'd. [root@f24s ~]# dd if=/dev/VG/b bs=1 skip=1064960 count=4096 2>/dev/null | hexdump -C 00000000 67 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 |gaaaaaaaaaaaaaaa| 00000010 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 |aaaaaaaaaaaaaaaa| * 00001000 [root@f24s ~]# dd if=/dev/VG/c bs=1 skip=1108361216 count=4096 2>/dev/null | hexdump -C 00000000 05 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................| 00000010 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................| * 00001000 [root@f24s ~]# btrfs-map-logical -l $[4096*2931712] /dev/VG/a mirror 1 logical 12008292352 physical 34603008 device /dev/mapper/VG-a mirror 2 logical 12008292352 physical 1108344832 device /dev/mapper/VG-c [root@f24s ~]# dd if=/dev/VG/a bs=1 skip=34603008 count=4096 2>/dev/null | hexdump -C 00000000 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 |aaaaaaaaaaaaaaaa| * 00001000 So at the old address, it shows the "aaaaa..." is still there. And at the added single block for this file at new logical and physical addresses, is the modification substituting the first "a" for "g". In this case, no rmw, no partial stripe modification, and no data already on-disk is at risk. Even the metadata leaf/node is cow'd, it has a new logical and physical address as well, which contains information for all five files.
On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote: > 24.06.2016 04:47, Zygo Blaxell пишет: > > On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote: > >> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli <kreijack@inwind.it> wrote: > >>> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the checksum. > >> > >> Yeah I'm kinda confused on this point. > >> > >> https://btrfs.wiki.kernel.org/index.php/RAID56 > >> > >> It says there is a write hole for Btrfs. But defines it in terms of > >> parity possibly being stale after a crash. I think the term comes not > >> from merely parity being wrong but parity being wrong *and* then being > >> used to wrongly reconstruct data because it's blindly trusted. > > > > I think the opposite is more likely, as the layers above raid56 > > seem to check the data against sums before raid56 ever sees it. > > (If those layers seem inverted to you, I agree, but OTOH there are > > probably good reason to do it that way). > > > > Yes, that's how I read code as well. btrfs layer that does checksumming > is unaware of parity blocks at all; for all practical purposes they do > not exist. What happens is approximately > > 1. logical extent is allocated and checksum computed > 2. it is mapped to physical area(s) on disks, skipping over what would > be parity blocks > 3. when these areas are written out, RAID56 parity is computed and filled in > > IOW btrfs checksums are for (meta)data and RAID56 parity is not data. Checksums are not parity, correct. However, every data block (including, I think, the parity) is checksummed and put into the csum tree. This allows the FS to determine where damage has occurred, rather thansimply detecting that it has occurred (which would be the case if the parity doesn't match the data, or if the two copies of a RAID-1 array don't match). (Note that csums for metadata are stored in the metadata block itself, not in the csum tree). Hugo. > > It looks like uncorrectable failures might occur because parity is > > correct, but the parity checksum is out of date, so the parity checksum > > doesn't match even though data blindly reconstructed from the parity > > *would* match the data. > > > > Yep, that is how I read it too. So if your data is checksummed, it > should at least avoid silent corruption. > > >> I don't read code well enough, but I'd be surprised if Btrfs > >> reconstructs from parity and doesn't then check the resulting > >> reconstructed data to its EXTENT_CSUM. > > > > I wouldn't be surprised if both things happen in different code paths, > > given the number of different paths leading into the raid56 code and > > the number of distinct failure modes it seems to have. > > > > Well, the problem is that parity block cannot be redirected on write as > data blocks; which makes it impossible to version control it. The only > solution I see is to always use full stripe writes by either wasting > time in fixed width stripe or using variable width, so that every stripe > always gets new version of parity. This makes it possible to keep parity > checksums like data checksums. >
On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills <hugo@carfax.org.uk> wrote: > On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote: >> 24.06.2016 04:47, Zygo Blaxell пишет: >> > On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote: >> >> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli <kreijack@inwind.it> wrote: >> >>> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the checksum. >> >> >> >> Yeah I'm kinda confused on this point. >> >> >> >> https://btrfs.wiki.kernel.org/index.php/RAID56 >> >> >> >> It says there is a write hole for Btrfs. But defines it in terms of >> >> parity possibly being stale after a crash. I think the term comes not >> >> from merely parity being wrong but parity being wrong *and* then being >> >> used to wrongly reconstruct data because it's blindly trusted. >> > >> > I think the opposite is more likely, as the layers above raid56 >> > seem to check the data against sums before raid56 ever sees it. >> > (If those layers seem inverted to you, I agree, but OTOH there are >> > probably good reason to do it that way). >> > >> >> Yes, that's how I read code as well. btrfs layer that does checksumming >> is unaware of parity blocks at all; for all practical purposes they do >> not exist. What happens is approximately >> >> 1. logical extent is allocated and checksum computed >> 2. it is mapped to physical area(s) on disks, skipping over what would >> be parity blocks >> 3. when these areas are written out, RAID56 parity is computed and filled in >> >> IOW btrfs checksums are for (meta)data and RAID56 parity is not data. > > Checksums are not parity, correct. However, every data block > (including, I think, the parity) is checksummed and put into the csum > tree. This allows the FS to determine where damage has occurred, > rather thansimply detecting that it has occurred (which would be the > case if the parity doesn't match the data, or if the two copies of a > RAID-1 array don't match). > Yes, that is what I wrote below. But that means that RAID5 with one degraded disk won't be able to reconstruct data on this degraded disk because reconstructed extent content won't match checksum. Which kinda makes RAID5 pointless. ... > >> > It looks like uncorrectable failures might occur because parity is >> > correct, but the parity checksum is out of date, so the parity checksum >> > doesn't match even though data blindly reconstructed from the parity >> > *would* match the data. >> > >> >> Yep, that is how I read it too. So if your data is checksummed, it >> should at least avoid silent corruption. >> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jun 24, 2016 at 8:20 AM, Chris Murphy <lists@colorremedies.com> wrote: > [root@f24s ~]# filefrag -v /mnt/5/* > Filesystem type is: 9123683e > File size of /mnt/5/a.txt is 16383 (4 blocks of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 3: 2931712.. 2931715: 4: last,eof Hmm ... I wonder what is wrong here (openSUSE Tumbleweed) nohostname:~ # filefrag -v /mnt/1 Filesystem type is: 9123683e File size of /mnt/1 is 3072 (1 block of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 0: 269376.. 269376: 1: last,eof /mnt/1: 1 extent found But! nohostname:~ # filefrag -v /etc/passwd Filesystem type is: 9123683e File size of /etc/passwd is 1527 (1 block of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 4095: 0.. 4095: 4096: last,not_aligned,inline,eof /etc/passwd: 1 extent found nohostname:~ # Why it works for one filesystem but does not work for an other one? ... > > So at the old address, it shows the "aaaaa..." is still there. And at > the added single block for this file at new logical and physical > addresses, is the modification substituting the first "a" for "g". > > In this case, no rmw, no partial stripe modification, and no data > already on-disk is at risk. You misunderstand the nature of problem. What is put at risk is data that is already on disk and "shares" parity with new data. As example, here are the first 64K in several extents on 4 disk RAID5 with so far single data chunk item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 1103101952) itemoff 15491 itemsize 176 chunk length 3221225472 owner 2 stripe_len 65536 type DATA|RAID5 num_stripes 4 stripe 0 devid 4 offset 9437184 dev uuid: ed13e42e-1633-4230-891c-897e86d1c0be stripe 1 devid 3 offset 9437184 dev uuid: 10032b95-3f48-4ea0-a9ee-90064c53da1f stripe 2 devid 2 offset 1074790400 dev uuid: cd749bd9-3d72-43b4-89a8-45e4a92658cf stripe 3 devid 1 offset 1094713344 dev uuid: 41538b9f-3869-4c32-b3e2-30aa2ea1534e dev extent chunk_tree 3 chunk objectid 256 chunk offset 1103101952 length 1073741824 item 5 key (1 DEV_EXTENT 1094713344) itemoff 16027 itemsize 48 dev extent chunk_tree 3 chunk objectid 256 chunk offset 1103101952 length 1073741824 item 7 key (2 DEV_EXTENT 1074790400) itemoff 15931 itemsize 48 dev extent chunk_tree 3 chunk objectid 256 chunk offset 1103101952 length 1073741824 item 9 key (3 DEV_EXTENT 9437184) itemoff 15835 itemsize 48 dev extent chunk_tree 3 chunk objectid 256 chunk offset 1103101952 length 1073741824 item 11 key (4 DEV_EXTENT 9437184) itemoff 15739 itemsize 48 dev extent chunk_tree 3 chunk objectid 256 chunk offset 1103101952 length 1073741824 where devid 1 = sdb1, 2 = sdc1 etc. Now let's write some data (I created several files) up to 64K in size: mirror 1 logical 1103364096 physical 1074855936 device /dev/sdc1 mirror 2 logical 1103364096 physical 9502720 device /dev/sde1 mirror 1 logical 1103368192 physical 1074860032 device /dev/sdc1 mirror 2 logical 1103368192 physical 9506816 device /dev/sde1 mirror 1 logical 1103372288 physical 1074864128 device /dev/sdc1 mirror 2 logical 1103372288 physical 9510912 device /dev/sde1 mirror 1 logical 1103376384 physical 1074868224 device /dev/sdc1 mirror 2 logical 1103376384 physical 9515008 device /dev/sde1 mirror 1 logical 1103380480 physical 1074872320 device /dev/sdc1 mirror 2 logical 1103380480 physical 9519104 device /dev/sde1 Note that btrfs allocates 64K on the same device before switching to the next one. What is a bit misleading here, sdc1 is data and sde1 is parity (you can see it in checksum tree, where only items for sdc1 exist). Now let's write next 64k and see what happens nohostname:~ # btrfs-map-logical -l 1103429632 -b 65536 /dev/sdb1 mirror 1 logical 1103429632 physical 1094778880 device /dev/sdb1 mirror 2 logical 1103429632 physical 9502720 device /dev/sde1 See? btrfs now allocates new stripe on sdb1; this stripe is at the same offset as previous one on sdc1 (64K) and so shares the same parity stripe on sde1. If you compare 64K on sde1 at offset 9502720 before and after, you will see that it has changed. INPLACE. Without CoW. This is exactly what puts existing data on sdc1 at risk - if sdb1 is updated but sde1 is not, attempt to reconstruct data on sdc1 will either fail (if we have checksums) or result in silent corruption. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jun 24, 2016 at 12:52:21PM +0300, Andrei Borzenkov wrote: > On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills <hugo@carfax.org.uk> wrote: > > On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote: > >> 24.06.2016 04:47, Zygo Blaxell пишет: > >> > On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote: > >> >> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli <kreijack@inwind.it> wrote: > >> >>> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the checksum. > >> >> > >> >> Yeah I'm kinda confused on this point. > >> >> > >> >> https://btrfs.wiki.kernel.org/index.php/RAID56 > >> >> > >> >> It says there is a write hole for Btrfs. But defines it in terms of > >> >> parity possibly being stale after a crash. I think the term comes not > >> >> from merely parity being wrong but parity being wrong *and* then being > >> >> used to wrongly reconstruct data because it's blindly trusted. > >> > > >> > I think the opposite is more likely, as the layers above raid56 > >> > seem to check the data against sums before raid56 ever sees it. > >> > (If those layers seem inverted to you, I agree, but OTOH there are > >> > probably good reason to do it that way). > >> > > >> > >> Yes, that's how I read code as well. btrfs layer that does checksumming > >> is unaware of parity blocks at all; for all practical purposes they do > >> not exist. What happens is approximately > >> > >> 1. logical extent is allocated and checksum computed > >> 2. it is mapped to physical area(s) on disks, skipping over what would > >> be parity blocks > >> 3. when these areas are written out, RAID56 parity is computed and filled in > >> > >> IOW btrfs checksums are for (meta)data and RAID56 parity is not data. > > > > Checksums are not parity, correct. However, every data block > > (including, I think, the parity) is checksummed and put into the csum > > tree. This allows the FS to determine where damage has occurred, > > rather thansimply detecting that it has occurred (which would be the > > case if the parity doesn't match the data, or if the two copies of a > > RAID-1 array don't match). > > > > Yes, that is what I wrote below. But that means that RAID5 with one > degraded disk won't be able to reconstruct data on this degraded disk > because reconstructed extent content won't match checksum. Which kinda > makes RAID5 pointless. Eh? How do you come to that conclusion? For data, say you have n-1 good devices, with n-1 blocks on them. Each block has a checksum in the metadata, so you can read that checksum, read the blocks, and verify that they're not damaged. From those n-1 known-good blocks (all data, or one parity and the rest data) you can reconstruct the remaining block. That reconstructed block won't be checked against the csum for the missing block -- it'll just be written and a new csum for it written with it. Hugo. > ... > > > >> > It looks like uncorrectable failures might occur because parity is > >> > correct, but the parity checksum is out of date, so the parity checksum > >> > doesn't match even though data blindly reconstructed from the parity > >> > *would* match the data. > >> > > >> > >> Yep, that is how I read it too. So if your data is checksummed, it > >> should at least avoid silent corruption. > >>
On Fri, Jun 24, 2016 at 1:16 PM, Hugo Mills <hugo@carfax.org.uk> wrote: > On Fri, Jun 24, 2016 at 12:52:21PM +0300, Andrei Borzenkov wrote: >> On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills <hugo@carfax.org.uk> wrote: >> > On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote: >> >> 24.06.2016 04:47, Zygo Blaxell пишет: >> >> > On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote: >> >> >> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli <kreijack@inwind.it> wrote: >> >> >>> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the checksum. >> >> >> >> >> >> Yeah I'm kinda confused on this point. >> >> >> >> >> >> https://btrfs.wiki.kernel.org/index.php/RAID56 >> >> >> >> >> >> It says there is a write hole for Btrfs. But defines it in terms of >> >> >> parity possibly being stale after a crash. I think the term comes not >> >> >> from merely parity being wrong but parity being wrong *and* then being >> >> >> used to wrongly reconstruct data because it's blindly trusted. >> >> > >> >> > I think the opposite is more likely, as the layers above raid56 >> >> > seem to check the data against sums before raid56 ever sees it. >> >> > (If those layers seem inverted to you, I agree, but OTOH there are >> >> > probably good reason to do it that way). >> >> > >> >> >> >> Yes, that's how I read code as well. btrfs layer that does checksumming >> >> is unaware of parity blocks at all; for all practical purposes they do >> >> not exist. What happens is approximately >> >> >> >> 1. logical extent is allocated and checksum computed >> >> 2. it is mapped to physical area(s) on disks, skipping over what would >> >> be parity blocks >> >> 3. when these areas are written out, RAID56 parity is computed and filled in >> >> >> >> IOW btrfs checksums are for (meta)data and RAID56 parity is not data. >> > >> > Checksums are not parity, correct. However, every data block >> > (including, I think, the parity) is checksummed and put into the csum >> > tree. This allows the FS to determine where damage has occurred, >> > rather thansimply detecting that it has occurred (which would be the >> > case if the parity doesn't match the data, or if the two copies of a >> > RAID-1 array don't match). >> > >> >> Yes, that is what I wrote below. But that means that RAID5 with one >> degraded disk won't be able to reconstruct data on this degraded disk >> because reconstructed extent content won't match checksum. Which kinda >> makes RAID5 pointless. > > Eh? How do you come to that conclusion? > > For data, say you have n-1 good devices, with n-1 blocks on them. > Each block has a checksum in the metadata, so you can read that > checksum, read the blocks, and verify that they're not damaged. From > those n-1 known-good blocks (all data, or one parity and the rest We do not know whether parity is good or not because as far as I can tell parity is not checksummed. > data) you can reconstruct the remaining block. That reconstructed > block won't be checked against the csum for the missing block -- it'll > just be written and a new csum for it written with it. > So we have silent corruption. I fail to understand how it is an improvement :) > Hugo. > >> ... >> > >> >> > It looks like uncorrectable failures might occur because parity is >> >> > correct, but the parity checksum is out of date, so the parity checksum >> >> > doesn't match even though data blindly reconstructed from the parity >> >> > *would* match the data. >> >> > >> >> >> >> Yep, that is how I read it too. So if your data is checksummed, it >> >> should at least avoid silent corruption. >> >> > > -- > Hugo Mills | Debugging is like hitting yourself in the head with > hugo@... carfax.org.uk | hammer: it feels so good when you find the bug, and > http://carfax.org.uk/ | you're allowed to stop debugging. > PGP: E2AB1DE4 | PotatoEngineer -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jun 24, 2016 at 01:19:30PM +0300, Andrei Borzenkov wrote: > On Fri, Jun 24, 2016 at 1:16 PM, Hugo Mills <hugo@carfax.org.uk> wrote: > > On Fri, Jun 24, 2016 at 12:52:21PM +0300, Andrei Borzenkov wrote: > >> On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills <hugo@carfax.org.uk> wrote: > >> > On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote: > >> >> 24.06.2016 04:47, Zygo Blaxell пишет: > >> >> > On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote: > >> >> >> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli <kreijack@inwind.it> wrote: > >> >> >>> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the checksum. > >> >> >> > >> >> >> Yeah I'm kinda confused on this point. > >> >> >> > >> >> >> https://btrfs.wiki.kernel.org/index.php/RAID56 > >> >> >> > >> >> >> It says there is a write hole for Btrfs. But defines it in terms of > >> >> >> parity possibly being stale after a crash. I think the term comes not > >> >> >> from merely parity being wrong but parity being wrong *and* then being > >> >> >> used to wrongly reconstruct data because it's blindly trusted. > >> >> > > >> >> > I think the opposite is more likely, as the layers above raid56 > >> >> > seem to check the data against sums before raid56 ever sees it. > >> >> > (If those layers seem inverted to you, I agree, but OTOH there are > >> >> > probably good reason to do it that way). > >> >> > > >> >> > >> >> Yes, that's how I read code as well. btrfs layer that does checksumming > >> >> is unaware of parity blocks at all; for all practical purposes they do > >> >> not exist. What happens is approximately > >> >> > >> >> 1. logical extent is allocated and checksum computed > >> >> 2. it is mapped to physical area(s) on disks, skipping over what would > >> >> be parity blocks > >> >> 3. when these areas are written out, RAID56 parity is computed and filled in > >> >> > >> >> IOW btrfs checksums are for (meta)data and RAID56 parity is not data. > >> > > >> > Checksums are not parity, correct. However, every data block > >> > (including, I think, the parity) is checksummed and put into the csum > >> > tree. This allows the FS to determine where damage has occurred, > >> > rather thansimply detecting that it has occurred (which would be the > >> > case if the parity doesn't match the data, or if the two copies of a > >> > RAID-1 array don't match). > >> > > >> > >> Yes, that is what I wrote below. But that means that RAID5 with one > >> degraded disk won't be able to reconstruct data on this degraded disk > >> because reconstructed extent content won't match checksum. Which kinda > >> makes RAID5 pointless. > > > > Eh? How do you come to that conclusion? > > > > For data, say you have n-1 good devices, with n-1 blocks on them. > > Each block has a checksum in the metadata, so you can read that > > checksum, read the blocks, and verify that they're not damaged. From > > those n-1 known-good blocks (all data, or one parity and the rest > > We do not know whether parity is good or not because as far as I can > tell parity is not checksummed. I was about to write a devastating rebuttal of this... then I actually tested it, and holy crap you're right. I've just closed the terminal in question by accident, so I can't copy-and-paste, but the way I checked was: # mkfs.btrfs -mraid1 -draid5 /dev/loop{0,1,2} # mount /dev/loop0 foo # dd if=/dev/urandom of=foo/file bs=4k count=32 # umount /dev/loop0 # btrfs-debug-tree /dev/loop0 then look at the csum tree: item 0 key (EXTENT_CSUM EXTENT_CSUM 351469568) itemoff 16155 itemsize 128 extent csum item There is a single csum item in it, of length 128. At 4 bytes per csum, that's 32 checksums, which covers the 32 4KiB blocks I wrote, leaving nothing for the parity. This is fundamentally broken, and I think we need to change the wiki to indicate that the parity RAID implementation is not recommended, because it doesn't actually do the job it's meant to in a reliable way. :( Hugo. > > data) you can reconstruct the remaining block. That reconstructed > > block won't be checked against the csum for the missing block -- it'll > > just be written and a new csum for it written with it. > > > > So we have silent corruption. I fail to understand how it is an improvement :) > > > Hugo. > > > >> ... > >> > > >> >> > It looks like uncorrectable failures might occur because parity is > >> >> > correct, but the parity checksum is out of date, so the parity checksum > >> >> > doesn't match even though data blindly reconstructed from the parity > >> >> > *would* match the data. > >> >> > > >> >> > >> >> Yep, that is how I read it too. So if your data is checksummed, it > >> >> should at least avoid silent corruption. > >> >> > >
On 2016-06-24 01:20, Chris Murphy wrote: > On Thu, Jun 23, 2016 at 8:07 PM, Zygo Blaxell > <ce3g8jdj@umail.furryterror.org> wrote: > >>> With simple files changing one character with vi and gedit, >>> I get completely different logical and physical numbers with each >>> change, so it's clearly cowing the entire stripe (192KiB in my 3 dev >>> raid5). >> >> You are COWing the entire file because vi and gedit do truncate followed >> by full-file write. > > I'm seeing the file inode changes with either a vi or gedit > modification, even when file size is exactly the same, just character > substitute. So as far as VFS and Btrfs are concerned, it's an entirely > different file, so it's like faux-CoW that would have happened on any > file system, not an overwrite. Yes, at least Vim (which is what most Linux systems use for vi) writes to a temporary file then does a replace by rename. The idea is that POSIX implies that this should be atomic (except it's not actually required by POSIX, and even on some journaled and COW filesystems, it isn't actually atomic). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2016-06-24 06:59, Hugo Mills wrote: > On Fri, Jun 24, 2016 at 01:19:30PM +0300, Andrei Borzenkov wrote: >> On Fri, Jun 24, 2016 at 1:16 PM, Hugo Mills <hugo@carfax.org.uk> wrote: >>> On Fri, Jun 24, 2016 at 12:52:21PM +0300, Andrei Borzenkov wrote: >>>> On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills <hugo@carfax.org.uk> wrote: >>>>> On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote: >>>>>> 24.06.2016 04:47, Zygo Blaxell пишет: >>>>>>> On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote: >>>>>>>> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli <kreijack@inwind.it> wrote: >>>>>>>>> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the checksum. >>>>>>>> >>>>>>>> Yeah I'm kinda confused on this point. >>>>>>>> >>>>>>>> https://btrfs.wiki.kernel.org/index.php/RAID56 >>>>>>>> >>>>>>>> It says there is a write hole for Btrfs. But defines it in terms of >>>>>>>> parity possibly being stale after a crash. I think the term comes not >>>>>>>> from merely parity being wrong but parity being wrong *and* then being >>>>>>>> used to wrongly reconstruct data because it's blindly trusted. >>>>>>> >>>>>>> I think the opposite is more likely, as the layers above raid56 >>>>>>> seem to check the data against sums before raid56 ever sees it. >>>>>>> (If those layers seem inverted to you, I agree, but OTOH there are >>>>>>> probably good reason to do it that way). >>>>>>> >>>>>> >>>>>> Yes, that's how I read code as well. btrfs layer that does checksumming >>>>>> is unaware of parity blocks at all; for all practical purposes they do >>>>>> not exist. What happens is approximately >>>>>> >>>>>> 1. logical extent is allocated and checksum computed >>>>>> 2. it is mapped to physical area(s) on disks, skipping over what would >>>>>> be parity blocks >>>>>> 3. when these areas are written out, RAID56 parity is computed and filled in >>>>>> >>>>>> IOW btrfs checksums are for (meta)data and RAID56 parity is not data. >>>>> >>>>> Checksums are not parity, correct. However, every data block >>>>> (including, I think, the parity) is checksummed and put into the csum >>>>> tree. This allows the FS to determine where damage has occurred, >>>>> rather thansimply detecting that it has occurred (which would be the >>>>> case if the parity doesn't match the data, or if the two copies of a >>>>> RAID-1 array don't match). >>>>> >>>> >>>> Yes, that is what I wrote below. But that means that RAID5 with one >>>> degraded disk won't be able to reconstruct data on this degraded disk >>>> because reconstructed extent content won't match checksum. Which kinda >>>> makes RAID5 pointless. >>> >>> Eh? How do you come to that conclusion? >>> >>> For data, say you have n-1 good devices, with n-1 blocks on them. >>> Each block has a checksum in the metadata, so you can read that >>> checksum, read the blocks, and verify that they're not damaged. From >>> those n-1 known-good blocks (all data, or one parity and the rest >> >> We do not know whether parity is good or not because as far as I can >> tell parity is not checksummed. > > I was about to write a devastating rebuttal of this... then I > actually tested it, and holy crap you're right. > > I've just closed the terminal in question by accident, so I can't > copy-and-paste, but the way I checked was: > > # mkfs.btrfs -mraid1 -draid5 /dev/loop{0,1,2} > # mount /dev/loop0 foo > # dd if=/dev/urandom of=foo/file bs=4k count=32 > # umount /dev/loop0 > # btrfs-debug-tree /dev/loop0 > > then look at the csum tree: > > item 0 key (EXTENT_CSUM EXTENT_CSUM 351469568) itemoff 16155 itemsize 128 > extent csum item > > There is a single csum item in it, of length 128. At 4 bytes per csum, > that's 32 checksums, which covers the 32 4KiB blocks I wrote, leaving > nothing for the parity. > > This is fundamentally broken, and I think we need to change the > wiki to indicate that the parity RAID implementation is not > recommended, because it doesn't actually do the job it's meant to in a > reliable way. :( > So item 4 now then, together with: 1. Rebuilds seemingly randomly decide based on the filesystem whether or not to take an insanely long time (always happens on some arrays, never happens on others, I have yet to see a report where it happens intermittently). 2. Failed disks seem to occasionally cause irreversible data corruption. 3. Classic erasure-code write-hole, just slightly different because of COW. TBH, as much as I hate to say this, it looks like the raid5/6 code needs redone from scratch. At an absolute minimum, we need to put a warning in mkfs for people using raid5/6 to tell them they shouldn't be using it outside of testing. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jun 23, 2016 at 11:20:40PM -0600, Chris Murphy wrote: > [root@f24s ~]# filefrag -v /mnt/5/* > Filesystem type is: 9123683e > File size of /mnt/5/a.txt is 16383 (4 blocks of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 0: 2931732.. 2931732: 1: > 1: 1.. 3: 2931713.. 2931715: 3: 2931733: last,eof > /mnt/5/a.txt: 2 extents found > File size of /mnt/5/b.txt is 16383 (4 blocks of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 3: 2931716.. 2931719: 4: last,eof > /mnt/5/b.txt: 1 extent found > File size of /mnt/5/c.txt is 16383 (4 blocks of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 3: 2931720.. 2931723: 4: last,eof > /mnt/5/c.txt: 1 extent found > File size of /mnt/5/d.txt is 16383 (4 blocks of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 3: 2931724.. 2931727: 4: last,eof > /mnt/5/d.txt: 1 extent found > File size of /mnt/5/e.txt is 16383 (4 blocks of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 3: 2931728.. 2931731: 4: last,eof > /mnt/5/e.txt: 1 extent found > So at the old address, it shows the "aaaaa..." is still there. And at > the added single block for this file at new logical and physical > addresses, is the modification substituting the first "a" for "g". > > In this case, no rmw, no partial stripe modification, and no data > already on-disk is at risk. Even the metadata leaf/node is cow'd, it > has a new logical and physical address as well, which contains > information for all five files. Well, of course not. You're not setting up the conditions for failure. The extent at 2931712..2931715 is 4 blocks long, so when you overwrite part of the extent all 4 blocks remain occupied. You need extents that are shorter than the stripe width, and you need to write to the same stripe in two different btrfs transactions (i.e. you need to delete an extent and then have a new extent mapped in the old location).
On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote: > >> I don't read code well enough, but I'd be surprised if Btrfs > >> reconstructs from parity and doesn't then check the resulting > >> reconstructed data to its EXTENT_CSUM. > > > > I wouldn't be surprised if both things happen in different code paths, > > given the number of different paths leading into the raid56 code and > > the number of distinct failure modes it seems to have. > > Well, the problem is that parity block cannot be redirected on write as > data blocks; which makes it impossible to version control it. The only > solution I see is to always use full stripe writes by either wasting > time in fixed width stripe or using variable width, so that every stripe > always gets new version of parity. This makes it possible to keep parity > checksums like data checksums. The allocator could try harder to avoid partial stripe writes. We can write multiple small extents to the same stripe as long as we always do it all within one transaction, and then later treat the entire stripe as read-only until every extent is removed. It would be possible to do that by fudging extent lengths (effectively adding a bunch of prealloc-ish space if we have a partial write after all the delalloc stuff is done), but it could also waste some blocks on every single transaction, or create a bunch of "free but unavailable" space that makes df/statvfs output even more wrong than it usually is. The raid5 rmw code could try to relocate the other extents sharing a stripe, but I fear that with the current state of backref walking code that would make raid5 spectacularly slow if a filesystem is anywhere near full. We could also write rmw parity block updates to a journal (like another log tree). That would enable us to at least fix up the parity blocks after a crash, and close the write hole. That's an on-disk format change though.
On Fri, Jun 24, 2016 at 2:50 AM, Hugo Mills <hugo@carfax.org.uk> wrote: > Checksums are not parity, correct. However, every data block > (including, I think, the parity) is checksummed and put into the csum > tree. I don't see how parity is checksummed. It definitely is not in the csum tree. Two file systems, one raid5, one single, each with a single identical file: raid5 item 0 key (EXTENT_CSUM EXTENT_CSUM 12009865216) itemoff 16155 itemsize 128 extent csum item single item 0 key (EXTENT_CSUM EXTENT_CSUM 2168717312) itemoff 16155 itemsize 128 extent csum item That's the only entry in the csum tree. The raid5 one is not 33.33% bigger to account for the extra parity being checksummed. Now, if parity is used to reconstruction of data, that data *is* checksummed so if it fails checksum after reconstruction the information is available to determine it was incorrectly reconstructed. The notes in btrfs/raid56.c recognize the possibility of parity corruption and how to handle it. But I think that corruption is inferred. Maybe the parity csums are in some other metadata item, but I don't see how it's in the csum tree.
On Fri, Jun 24, 2016 at 10:52:53AM -0600, Chris Murphy wrote: > On Fri, Jun 24, 2016 at 2:50 AM, Hugo Mills <hugo@carfax.org.uk> wrote: > > > Checksums are not parity, correct. However, every data block > > (including, I think, the parity) is checksummed and put into the csum > > tree. > > I don't see how parity is checksummed. It definitely is not in the > csum tree. Two file systems, one raid5, one single, each with a single > identical file: It isn't -- I was wrong up there, and corrected myself in a later message after investigation. (Although in this case, I regard reality as being at fault ;) ). Hugo. > raid5 > item 0 key (EXTENT_CSUM EXTENT_CSUM 12009865216) itemoff 16155 itemsize 128 > extent csum item > > single > > item 0 key (EXTENT_CSUM EXTENT_CSUM 2168717312) itemoff 16155 itemsize 128 > extent csum item > > That's the only entry in the csum tree. The raid5 one is not 33.33% > bigger to account for the extra parity being checksummed. > > Now, if parity is used to reconstruction of data, that data *is* > checksummed so if it fails checksum after reconstruction the > information is available to determine it was incorrectly > reconstructed. The notes in btrfs/raid56.c recognize the possibility > of parity corruption and how to handle it. But I think that corruption > is inferred. Maybe the parity csums are in some other metadata item, > but I don't see how it's in the csum tree. > >
On Fri, Jun 24, 2016 at 3:52 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote: > On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills <hugo@carfax.org.uk> wrote: >eta)data and RAID56 parity is not data. >> >> Checksums are not parity, correct. However, every data block >> (including, I think, the parity) is checksummed and put into the csum >> tree. This allows the FS to determine where damage has occurred, >> rather thansimply detecting that it has occurred (which would be the >> case if the parity doesn't match the data, or if the two copies of a >> RAID-1 array don't match). >> > > Yes, that is what I wrote below. But that means that RAID5 with one > degraded disk won't be able to reconstruct data on this degraded disk > because reconstructed extent content won't match checksum. Which kinda > makes RAID5 pointless. I don't understand this. Whether the failed disk means a stripe is missing a data strip or parity strip, if any other strip is damaged of course the reconstruction isn't going to match checksum. This does not make raid5 pointless.
24.06.2016 20:06, Chris Murphy пишет: > On Fri, Jun 24, 2016 at 3:52 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote: >> On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills <hugo@carfax.org.uk> wrote: >> eta)data and RAID56 parity is not data. >>> >>> Checksums are not parity, correct. However, every data block >>> (including, I think, the parity) is checksummed and put into the csum >>> tree. This allows the FS to determine where damage has occurred, >>> rather thansimply detecting that it has occurred (which would be the >>> case if the parity doesn't match the data, or if the two copies of a >>> RAID-1 array don't match). >>> >> >> Yes, that is what I wrote below. But that means that RAID5 with one >> degraded disk won't be able to reconstruct data on this degraded disk >> because reconstructed extent content won't match checksum. Which kinda >> makes RAID5 pointless. > > I don't understand this. Whether the failed disk means a stripe is > missing a data strip or parity strip, if any other strip is damaged of > course the reconstruction isn't going to match checksum. This does not > make raid5 pointless. > Yes, you are right. We have double failure here. Still, in current situation we apparently may end with btrfs reconstructing missing block using wrong information. As was mentioned elsewhere, btrfs does not verify checksum of reconstructed block, meaning data corruption. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jun 24, 2016 at 4:16 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote: > On Fri, Jun 24, 2016 at 8:20 AM, Chris Murphy <lists@colorremedies.com> wrote: > >> [root@f24s ~]# filefrag -v /mnt/5/* >> Filesystem type is: 9123683e >> File size of /mnt/5/a.txt is 16383 (4 blocks of 4096 bytes) >> ext: logical_offset: physical_offset: length: expected: flags: >> 0: 0.. 3: 2931712.. 2931715: 4: last,eof > > Hmm ... I wonder what is wrong here (openSUSE Tumbleweed) > > nohostname:~ # filefrag -v /mnt/1 > Filesystem type is: 9123683e > File size of /mnt/1 is 3072 (1 block of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 0: 269376.. 269376: 1: last,eof > /mnt/1: 1 extent found > > But! > > nohostname:~ # filefrag -v /etc/passwd > Filesystem type is: 9123683e > File size of /etc/passwd is 1527 (1 block of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 4095: 0.. 4095: 4096: > last,not_aligned,inline,eof > /etc/passwd: 1 extent found > nohostname:~ # > > Why it works for one filesystem but does not work for an other one? > ... >> >> So at the old address, it shows the "aaaaa..." is still there. And at >> the added single block for this file at new logical and physical >> addresses, is the modification substituting the first "a" for "g". >> >> In this case, no rmw, no partial stripe modification, and no data >> already on-disk is at risk. > > You misunderstand the nature of problem. What is put at risk is data > that is already on disk and "shares" parity with new data. > > As example, here are the first 64K in several extents on 4 disk RAID5 > with so far single data chunk > > item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 1103101952) itemoff > 15491 itemsize 176 > chunk length 3221225472 owner 2 stripe_len 65536 > type DATA|RAID5 num_stripes 4 > stripe 0 devid 4 offset 9437184 > dev uuid: ed13e42e-1633-4230-891c-897e86d1c0be > stripe 1 devid 3 offset 9437184 > dev uuid: 10032b95-3f48-4ea0-a9ee-90064c53da1f > stripe 2 devid 2 offset 1074790400 > dev uuid: cd749bd9-3d72-43b4-89a8-45e4a92658cf > stripe 3 devid 1 offset 1094713344 > dev uuid: 41538b9f-3869-4c32-b3e2-30aa2ea1534e > dev extent chunk_tree 3 > chunk objectid 256 chunk offset 1103101952 length 1073741824 > > > item 5 key (1 DEV_EXTENT 1094713344) itemoff 16027 itemsize 48 > dev extent chunk_tree 3 > chunk objectid 256 chunk offset 1103101952 length 1073741824 > item 7 key (2 DEV_EXTENT 1074790400) itemoff 15931 itemsize 48 > dev extent chunk_tree 3 > chunk objectid 256 chunk offset 1103101952 length 1073741824 > item 9 key (3 DEV_EXTENT 9437184) itemoff 15835 itemsize 48 > dev extent chunk_tree 3 > chunk objectid 256 chunk offset 1103101952 length 1073741824 > item 11 key (4 DEV_EXTENT 9437184) itemoff 15739 itemsize 48 > dev extent chunk_tree 3 > chunk objectid 256 chunk offset 1103101952 length 1073741824 > > where devid 1 = sdb1, 2 = sdc1 etc. > > Now let's write some data (I created several files) up to 64K in size: > > mirror 1 logical 1103364096 physical 1074855936 device /dev/sdc1 > mirror 2 logical 1103364096 physical 9502720 device /dev/sde1 > mirror 1 logical 1103368192 physical 1074860032 device /dev/sdc1 > mirror 2 logical 1103368192 physical 9506816 device /dev/sde1 > mirror 1 logical 1103372288 physical 1074864128 device /dev/sdc1 > mirror 2 logical 1103372288 physical 9510912 device /dev/sde1 > mirror 1 logical 1103376384 physical 1074868224 device /dev/sdc1 > mirror 2 logical 1103376384 physical 9515008 device /dev/sde1 > mirror 1 logical 1103380480 physical 1074872320 device /dev/sdc1 > mirror 2 logical 1103380480 physical 9519104 device /dev/sde1 > > Note that btrfs allocates 64K on the same device before switching to > the next one. What is a bit misleading here, sdc1 is data and sde1 is > parity (you can see it in checksum tree, where only items for sdc1 > exist). > > Now let's write next 64k and see what happens > > nohostname:~ # btrfs-map-logical -l 1103429632 -b 65536 /dev/sdb1 > mirror 1 logical 1103429632 physical 1094778880 device /dev/sdb1 > mirror 2 logical 1103429632 physical 9502720 device /dev/sde1 > > See? btrfs now allocates new stripe on sdb1; this stripe is at the > same offset as previous one on sdc1 (64K) and so shares the same > parity stripe on sde1. Yep, I've seen this also. What's not clear is if there's any optimization where it's doing partial strip writes, i.e. only a certain sector needs updating so only that sector is written. The md driver does have this optimization for raid5 I don't think it does for raid6. So either someone familiar with the code needs to tell us, or someone needs to do some tracing (not difficult but the output is super verbose). In any case it's a risk when parity is not cow'd but in any case if it were stale or corrupt, a rebuild that depends on parity will rebuild bad data, and that bad data is discoverable. It's not silent, so I don't think it's completely correct to say there's a write hole. It's more correct to say parity is now cow'd therefore the first part of the write hole can happen, it's just not going to result in silent corruption if there's an interruption writing the parity. So far I haven't seen data extents not be cow'd, which I think we'd agree is a bigger problem if that were to happen. >If you compare 64K on sde1 at offset 9502720 > before and after, you will see that it has changed. INPLACE. Without > CoW. This is exactly what puts existing data on sdc1 at risk - if sdb1 > is updated but sde1 is not, attempt to reconstruct data on sdc1 will > either fail (if we have checksums) or result in silent corruption. Which happens with any implementation that does rmw, partial stripe writes have all sorts of other problems also. The ZFS way of dealing with this is carrying a boatload of additional metadata to account for variable stripe size so that they can cow everything. And it gets massively fragmented as well. So there's a tradeoff for everything.
On Fri, Jun 24, 2016 at 4:16 AM, Hugo Mills <hugo@carfax.org.uk> wrote: > On Fri, Jun 24, 2016 at 12:52:21PM +0300, Andrei Borzenkov wrote: >> Yes, that is what I wrote below. But that means that RAID5 with one >> degraded disk won't be able to reconstruct data on this degraded disk >> because reconstructed extent content won't match checksum. Which kinda >> makes RAID5 pointless. > > Eh? How do you come to that conclusion? > > For data, say you have n-1 good devices, with n-1 blocks on them. > Each block has a checksum in the metadata, so you can read that > checksum, read the blocks, and verify that they're not damaged. From > those n-1 known-good blocks (all data, or one parity and the rest > data) you can reconstruct the remaining block. That reconstructed > block won't be checked against the csum for the missing block -- it'll > just be written and a new csum for it written with it. The last sentence is hugely problematic. Parity doesn't appear to be either CoW'd or checksummed. If it is used for reconstruction and the reconstructed data isn't compared to the data's EXTENT_CSUM entry, but that entry is rather recomputed and written, that's just like blindly trusting the parity is correct and then authenticating it with a csum. It's not difficult to test. Corrupt one byte of parity. Yank a drive. Add a new one. Start a reconstruction with scrub or balance (or both to see if they differ) and find out what happens. What should happen is the reconstruct should work for everything except that one file. If it's reconstructed silently, it should contain visible corruption and we all collectively raise our eyebrows.
On Fri, Jun 24, 2016 at 11:21 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote: > 24.06.2016 20:06, Chris Murphy пишет: >> On Fri, Jun 24, 2016 at 3:52 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote: >>> On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills <hugo@carfax.org.uk> wrote: >>> eta)data and RAID56 parity is not data. >>>> >>>> Checksums are not parity, correct. However, every data block >>>> (including, I think, the parity) is checksummed and put into the csum >>>> tree. This allows the FS to determine where damage has occurred, >>>> rather thansimply detecting that it has occurred (which would be the >>>> case if the parity doesn't match the data, or if the two copies of a >>>> RAID-1 array don't match). >>>> >>> >>> Yes, that is what I wrote below. But that means that RAID5 with one >>> degraded disk won't be able to reconstruct data on this degraded disk >>> because reconstructed extent content won't match checksum. Which kinda >>> makes RAID5 pointless. >> >> I don't understand this. Whether the failed disk means a stripe is >> missing a data strip or parity strip, if any other strip is damaged of >> course the reconstruction isn't going to match checksum. This does not >> make raid5 pointless. >> > > Yes, you are right. We have double failure here. Still, in current > situation we apparently may end with btrfs reconstructing missing block > using wrong information. As was mentioned elsewhere, btrfs does not > verify checksum of reconstructed block, meaning data corruption. Well that'd be bad, but also good in that it would explain a lot of problems people have when metadata is also raid5. In this whole thread the premise is the metadata is raid1, so the fs doesn't totally face plant we just get a bunch of weird data corruptions. The metadata raid5 case were sorta "WTF happened?" and not much was really said about it other than telling the user to scrape off what they can and start over. Anyway, while not good I still think this is not super problematic to at least *do* check EXTENT_CSUM after reconstruction from parity rather than assuming that reconstruction happened correctly. The data needed to pass fail the rebuild is already on the disk. It just needs to be checked. Better would be to get parity csummed and put into the csum tree. But I don't know how much that helps. Think about always computing and writing csums for parity, which almost never get used vs keeping things the way they are now and just *checking our work* after reconstruction from parity. If there's some obvious major advantage to checksumming the parity I'm all ears but I'm not thinking of it at the moment.
On Fri, Jun 24, 2016 at 11:40:56AM -0600, Chris Murphy wrote: > On Fri, Jun 24, 2016 at 4:16 AM, Hugo Mills <hugo@carfax.org.uk> wrote: > > On Fri, Jun 24, 2016 at 12:52:21PM +0300, Andrei Borzenkov wrote: > > For data, say you have n-1 good devices, with n-1 blocks on them. > > Each block has a checksum in the metadata, so you can read that > > checksum, read the blocks, and verify that they're not damaged. From > > those n-1 known-good blocks (all data, or one parity and the rest > > data) you can reconstruct the remaining block. That reconstructed > > block won't be checked against the csum for the missing block -- it'll > > just be written and a new csum for it written with it. > > The last sentence is hugely problematic. Parity doesn't appear to be > either CoW'd or checksummed. If it is used for reconstruction and the > reconstructed data isn't compared to the data's EXTENT_CSUM entry, but > that entry is rather recomputed and written, that's just like blindly > trusting the parity is correct and then authenticating it with a csum. I think what happens is the data is recomputed, but the csum on the data is _not_ updated (the csum does not reside in the raid56 code). A read of the reconstructed data would get a csum failure (of course, every 4 billionth time this happens the csum is correct by random chance, so you wouldn't want to be reading parity blocks from a drive full of garbage, but that's a different matter). > It's not difficult to test. Corrupt one byte of parity. Yank a drive. > Add a new one. Start a reconstruction with scrub or balance (or both > to see if they differ) and find out what happens. What should happen > is the reconstruct should work for everything except that one file. If > it's reconstructed silently, it should contain visible corruption and > we all collectively raise our eyebrows. I've done something like that test: write random data to 1000 random blocks on one disk, then run scrub. It reconstructs the data without problems (except for the minor wart that 'scrub status -d' counts the randomly against every device, while 'dev stats' counts all the errors on the disk that was corrupted). Disk-side data corruption is a thing I have to deal with a few times each year, so I tested the btrfs raid5 implementation for that case before I started using it. As far as I can tell so far, everything in btrfs raid5 works properly if a disk fails _while the filesystem is not mounted_. The problem I see in the field is not *silent* corruption. It's a whole lot of very *noisy* corruption detected under circumstances where I'd expect to see no corruption at all (silent or otherwise).
On 2016-06-24 13:52, Chris Murphy wrote: > On Fri, Jun 24, 2016 at 11:21 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote: >> 24.06.2016 20:06, Chris Murphy пишет: >>> On Fri, Jun 24, 2016 at 3:52 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote: >>>> On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills <hugo@carfax.org.uk> wrote: >>>> eta)data and RAID56 parity is not data. >>>>> >>>>> Checksums are not parity, correct. However, every data block >>>>> (including, I think, the parity) is checksummed and put into the csum >>>>> tree. This allows the FS to determine where damage has occurred, >>>>> rather thansimply detecting that it has occurred (which would be the >>>>> case if the parity doesn't match the data, or if the two copies of a >>>>> RAID-1 array don't match). >>>>> >>>> >>>> Yes, that is what I wrote below. But that means that RAID5 with one >>>> degraded disk won't be able to reconstruct data on this degraded disk >>>> because reconstructed extent content won't match checksum. Which kinda >>>> makes RAID5 pointless. >>> >>> I don't understand this. Whether the failed disk means a stripe is >>> missing a data strip or parity strip, if any other strip is damaged of >>> course the reconstruction isn't going to match checksum. This does not >>> make raid5 pointless. >>> >> >> Yes, you are right. We have double failure here. Still, in current >> situation we apparently may end with btrfs reconstructing missing block >> using wrong information. As was mentioned elsewhere, btrfs does not >> verify checksum of reconstructed block, meaning data corruption. > > Well that'd be bad, but also good in that it would explain a lot of > problems people have when metadata is also raid5. In this whole thread > the premise is the metadata is raid1, so the fs doesn't totally face > plant we just get a bunch of weird data corruptions. The metadata > raid5 case were sorta "WTF happened?" and not much was really said > about it other than telling the user to scrape off what they can and > start over. > > Anyway, while not good I still think this is not super problematic to > at least *do* check EXTENT_CSUM after reconstruction from parity > rather than assuming that reconstruction happened correctly. The data > needed to pass fail the rebuild is already on the disk. It just needs > to be checked. > > Better would be to get parity csummed and put into the csum tree. But > I don't know how much that helps. Think about always computing and > writing csums for parity, which almost never get used vs keeping > things the way they are now and just *checking our work* after > reconstruction from parity. If there's some obvious major advantage to > checksumming the parity I'm all ears but I'm not thinking of it at the > moment. > Well, the obvious major advantage that comes to mind for me to checksumming parity is that it would let us scrub the parity data itself and verify it. I'd personally much rather know my parity is bad before I need to use it than after using it to reconstruct data and getting an error there, and I'd be willing to be that most seasoned sysadmins working for companies using big storage arrays likely feel the same about it. I could see it being practical to have an option to turn this off for performance reasons or similar, but again, I have a feeling that most people would rather be able to check if a rebuild will eat data before trying to rebuild (depending on the situation in such a case, it will sometimes just make more sense to nuke the array and restore from a backup instead of spending time waiting for it to rebuild). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: > Well, the obvious major advantage that comes to mind for me to checksumming > parity is that it would let us scrub the parity data itself and verify it. OK but hold on. During scrub, it should read data, compute checksums *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in the checksum tree, and the parity strip in the chunk tree. And if parity is wrong, then it should be replaced. Even check > md/sync_action does this. So no pun intended but Btrfs isn't even at parity with mdadm on data integrity if it doesn't check if the parity matches data. > I'd personally much rather know my parity is bad before I need to use it > than after using it to reconstruct data and getting an error there, and I'd > be willing to be that most seasoned sysadmins working for companies using > big storage arrays likely feel the same about it. That doesn't require parity csums though. It just requires computing parity during a scrub and comparing it to the parity on disk to make sure they're the same. If they aren't, assuming no other error for that full stripe read, then the parity block is replaced. So that's also something to check in the code or poke a system with a stick and see what happens. > I could see it being > practical to have an option to turn this off for performance reasons or > similar, but again, I have a feeling that most people would rather be able > to check if a rebuild will eat data before trying to rebuild (depending on > the situation in such a case, it will sometimes just make more sense to nuke > the array and restore from a backup instead of spending time waiting for it > to rebuild). The much bigger problem we have right now that affects Btrfs, LVM/mdadm md raid, is this silly bad default with non-enterprise drives having no configurable SCT ERC, with ensuing long recovery times, and the kernel SCSI command timer at 30 seconds - which actually also fucks over regular single disk users also because it means they don't get the "benefit" of long recovery times, which is the whole g'd point of that feature. This itself causes so many problems where bad sectors just get worse and don't get fixed up because of all the link resets. So I still think it's a bullshit default kernel side because it pretty much affects the majority use case, it is only a non-problem with proprietary hardware raid, and software raid using enterprise (or NAS specific) drives that already have short recovery times by default. This has been true for a very long time, maybe a decade. And it's such complete utter crap that this hasn't been dealt with properly by any party. No distribution has fixed this for their users. Upstream udev hasn't dealt with it. And kernel folks haven't dealt with it. It's a perverse joke on the user to do this out of the box.
Interestingly enough, so far I'm finding with full stripe writes, i.e. 3x raid5, exactly 128KiB data writes, devid 3 is always parity. This is raid4. So...I wonder if some of these slow cases end up with a bunch of stripes that are effectively raid4-like, and have a lot of parity overwrites, which is where raid4 suffers due to disk contention. Totally speculative as the sample size is too small and distinctly non-random. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
26.06.2016 00:52, Chris Murphy пишет: > Interestingly enough, so far I'm finding with full stripe writes, i.e. > 3x raid5, exactly 128KiB data writes, devid 3 is always parity. This > is raid4. That's not what code suggests and what I see in practice - parity seems to be distributed across all disks; each new 128KiB file (extent) has parity on new disk. At least as long as we can trust btrfs-map-logical to always show parity as "mirror 2". Do you see consecutive full stripes in your tests? Or how do you determine which devid has parity for a given full stripe? This information is not actually stored anywhere, it is computed based on block group geometry and logical stripe offset. P.S. usage of "stripe" to mean "stripe element" actually adds to confusion when reading code :) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Andrei Borzenkov posted on Sun, 26 Jun 2016 10:54:16 +0300 as excerpted: > P.S. usage of "stripe" to mean "stripe element" actually adds to > confusion when reading code :) ... and posts (including patches, which I guess are code as well, just not applied yet). I've been noticing that in the "stripe length" patches, when the comment associated with the patch suggests it's "strip length" they're actually talking about, using the "N strips, one per device, make a stripe" definition.
On Sun, Jun 26, 2016 at 1:54 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote: > 26.06.2016 00:52, Chris Murphy пишет: >> Interestingly enough, so far I'm finding with full stripe writes, i.e. >> 3x raid5, exactly 128KiB data writes, devid 3 is always parity. This >> is raid4. > > That's not what code suggests and what I see in practice - parity seems > to be distributed across all disks; each new 128KiB file (extent) has > parity on new disk. At least as long as we can trust btrfs-map-logical > to always show parity as "mirror 2". tl;dr Andrei is correct there's no raid4 behavior here. Looks like mirror 2 is always parity, more on that below. > > Do you see consecutive full stripes in your tests? Or how do you > determine which devid has parity for a given full stripe? I do see consecutive full stripe writes, but it doesn't always happen. But not checking the consecutivity is where I became confused. [root@f24s ~]# filefrag -v /mnt/5/ab* Filesystem type is: 9123683e File size of /mnt/5/ab128_2.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 31: 3456128.. 3456159: 32: last,eof /mnt/5/ab128_2.txt: 1 extent found File size of /mnt/5/ab128_3.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 31: 3456224.. 3456255: 32: last,eof /mnt/5/ab128_3.txt: 1 extent found File size of /mnt/5/ab128_4.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 31: 3456320.. 3456351: 32: last,eof /mnt/5/ab128_4.txt: 1 extent found File size of /mnt/5/ab128_5.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 31: 3456352.. 3456383: 32: last,eof /mnt/5/ab128_5.txt: 1 extent found File size of /mnt/5/ab128_6.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 31: 3456384.. 3456415: 32: last,eof /mnt/5/ab128_6.txt: 1 extent found File size of /mnt/5/ab128_7.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 31: 3456416.. 3456447: 32: last,eof /mnt/5/ab128_7.txt: 1 extent found File size of /mnt/5/ab128_8.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 31: 3456448.. 3456479: 32: last,eof /mnt/5/ab128_8.txt: 1 extent found File size of /mnt/5/ab128_9.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 31: 3456480.. 3456511: 32: last,eof /mnt/5/ab128_9.txt: 1 extent found File size of /mnt/5/ab128.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 31: 3456096.. 3456127: 32: last,eof /mnt/5/ab128.txt: 1 extent found Starting with the bottom file then from the top so they're in 4096 byte block order; and the 2nd column is the difference in value: 3456096 3456128 32 3456224 96 3456320 96 3456352 32 3456384 32 3456416 32 3456448 32 3456480 32 So the first two files are consecutive full stripe writes. The next two aren't. The next five are. They were all copied at the same time. I don't know why they aren't always consecutive writes. [root@f24s ~]# btrfs-map-logical -l $[4096*3456096] /dev/VG/a mirror 1 logical 14156169216 physical 1108541440 device /dev/mapper/VG-a mirror 2 logical 14156169216 physical 2182283264 device /dev/mapper/VG-c [root@f24s ~]# btrfs-map-logical -l $[4096*3456128] /dev/VG/a mirror 1 logical 14156300288 physical 1075052544 device /dev/mapper/VG-b mirror 2 logical 14156300288 physical 1108606976 device /dev/mapper/VG-a [root@f24s ~]# btrfs-map-logical -l $[4096*3456224] /dev/VG/a mirror 1 logical 14156693504 physical 1075249152 device /dev/mapper/VG-b mirror 2 logical 14156693504 physical 1108803584 device /dev/mapper/VG-a [root@f24s ~]# btrfs-map-logical -l $[4096*3456320] /dev/VG/a mirror 1 logical 14157086720 physical 1075445760 device /dev/mapper/VG-b mirror 2 logical 14157086720 physical 1109000192 device /dev/mapper/VG-a [root@f24s ~]# btrfs-map-logical -l $[4096*3456352] /dev/VG/a mirror 1 logical 14157217792 physical 2182807552 device /dev/mapper/VG-c mirror 2 logical 14157217792 physical 1075511296 device /dev/mapper/VG-b [root@f24s ~]# btrfs-map-logical -l $[4096*3456384] /dev/VG/a mirror 1 logical 14157348864 physical 1109131264 device /dev/mapper/VG-a mirror 2 logical 14157348864 physical 2182873088 device /dev/mapper/VG-c [root@f24s ~]# btrfs-map-logical -l $[4096*3456416] /dev/VG/a mirror 1 logical 14157479936 physical 1075642368 device /dev/mapper/VG-b mirror 2 logical 14157479936 physical 1109196800 device /dev/mapper/VG-a [root@f24s ~]# btrfs-map-logical -l $[4096*3456448] /dev/VG/a mirror 1 logical 14157611008 physical 2183004160 device /dev/mapper/VG-c mirror 2 logical 14157611008 physical 1075707904 device /dev/mapper/VG-b [root@f24s ~]# btrfs-map-logical -l $[4096*3456480] /dev/VG/a mirror 1 logical 14157742080 physical 1109327872 device /dev/mapper/VG-a mirror 2 logical 14157742080 physical 2183069696 device /dev/mapper/VG-c To confirm/deny mirror 2 is parity (128KiB file is 64KiB "a", 64KiB "b", so expected parity is 0x03; if it's always 128KiB of the same value then parity is 0x00 and can result in confusion/mistakes with unwritten free space). [root@f24s ~]# dd if=/dev/VG/c bs=1 count=65536 skip=2182283264 2>/dev/null | hexdump -C 00000000 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................| * 00010000 [root@f24s ~]# dd if=/dev/VG/a bs=1 count=65536 skip=1108606976 2>/dev/null | hexdump -C 00000000 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................| * 00010000 [root@f24s ~]# dd if=/dev/VG/a bs=1 count=65536 skip=1108803584 2>/dev/null | hexdump -C 00000000 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................| * 00010000 [root@f24s ~]# dd if=/dev/VG/a bs=1 count=65536 skip=1109000192 2>/dev/null | hexdump -C 00000000 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................| * 00010000 [root@f24s ~]# dd if=/dev/VG/b bs=1 count=65536 skip=1075511296 2>/dev/null | hexdump -C 00000000 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................| * 00010000 [root@f24s ~]# dd if=/dev/VG/c bs=1 count=65536 skip=2182873088 2>/dev/null | hexdump -C 00000000 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................| * 00010000 [root@f24s ~]# dd if=/dev/VG/a bs=1 count=65536 skip=1109196800 2>/dev/null | hexdump -C 00000000 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................| * 00010000 [root@f24s ~]# dd if=/dev/VG/b bs=1 count=65536 skip=1075707904 2>/dev/null | hexdump -C 00000000 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................| * 00010000 [root@f24s ~]# dd if=/dev/VG/c bs=1 count=65536 skip=2183069696 2>/dev/null | hexdump -C 00000000 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................| * 00010000 Ok so in particular the last five, parity is on device b, c, a, b, c - that suggests it's distributing parity on consecutive full stripe writes. Where I became confused is, there's not always a consecutive write, and that's what ends up causing parity to end up on one device less often. In the above example, parity goes 4x VG/a, 3x VG/c, and 2x VG/b. Basically it's a bad test. The sample size is too small. I'd need to increase the sample size by a ton in order to know for sure if this is really a problem. >This > information is not actually stored anywhere, it is computed based on > block group geometry and logical stripe offset. I think you're right. A better test is a scrub or balance on a raid5 that's exhibiting slowness, and find out if there's disk contention on that system, and whether it's the result of parity not being distributed enough. > P.S. usage of "stripe" to mean "stripe element" actually adds to > confusion when reading code :) It's confusing everywhere. mdadm chunk = strip = stripe element. And then LVM introduces -i --stripes which means "data strips" i.e. if you choose -i 3 with raid6 segment type, you get 5 strips per stripe (3 data 2 parity). It's horrible.
On Sun, Jun 26, 2016 at 01:30:03PM -0600, Chris Murphy wrote: > On Sun, Jun 26, 2016 at 1:54 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote: > > 26.06.2016 00:52, Chris Murphy пишет: > >> Interestingly enough, so far I'm finding with full stripe writes, i.e. > >> 3x raid5, exactly 128KiB data writes, devid 3 is always parity. This > >> is raid4. > > > > That's not what code suggests and what I see in practice - parity seems > > to be distributed across all disks; each new 128KiB file (extent) has > > parity on new disk. At least as long as we can trust btrfs-map-logical > > to always show parity as "mirror 2". > > > tl;dr Andrei is correct there's no raid4 behavior here. > > Looks like mirror 2 is always parity, more on that below. > > > > > > Do you see consecutive full stripes in your tests? Or how do you > > determine which devid has parity for a given full stripe? > > I do see consecutive full stripe writes, but it doesn't always happen. > But not checking the consecutivity is where I became confused. > > [root@f24s ~]# filefrag -v /mnt/5/ab* > Filesystem type is: 9123683e > File size of /mnt/5/ab128_2.txt is 131072 (32 blocks of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 31: 3456128.. 3456159: 32: last,eof > /mnt/5/ab128_2.txt: 1 extent found > File size of /mnt/5/ab128_3.txt is 131072 (32 blocks of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 31: 3456224.. 3456255: 32: last,eof > /mnt/5/ab128_3.txt: 1 extent found > File size of /mnt/5/ab128_4.txt is 131072 (32 blocks of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 31: 3456320.. 3456351: 32: last,eof > /mnt/5/ab128_4.txt: 1 extent found > File size of /mnt/5/ab128_5.txt is 131072 (32 blocks of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 31: 3456352.. 3456383: 32: last,eof > /mnt/5/ab128_5.txt: 1 extent found > File size of /mnt/5/ab128_6.txt is 131072 (32 blocks of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 31: 3456384.. 3456415: 32: last,eof > /mnt/5/ab128_6.txt: 1 extent found > File size of /mnt/5/ab128_7.txt is 131072 (32 blocks of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 31: 3456416.. 3456447: 32: last,eof > /mnt/5/ab128_7.txt: 1 extent found > File size of /mnt/5/ab128_8.txt is 131072 (32 blocks of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 31: 3456448.. 3456479: 32: last,eof > /mnt/5/ab128_8.txt: 1 extent found > File size of /mnt/5/ab128_9.txt is 131072 (32 blocks of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 31: 3456480.. 3456511: 32: last,eof > /mnt/5/ab128_9.txt: 1 extent found > File size of /mnt/5/ab128.txt is 131072 (32 blocks of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 31: 3456096.. 3456127: 32: last,eof > /mnt/5/ab128.txt: 1 extent found > > Starting with the bottom file then from the top so they're in 4096 > byte block order; and the 2nd column is the difference in value: > > 3456096 > 3456128 32 > 3456224 96 > 3456320 96 > 3456352 32 > 3456384 32 > 3456416 32 > 3456448 32 > 3456480 32 > > So the first two files are consecutive full stripe writes. The next > two aren't. The next five are. They were all copied at the same time. > I don't know why they aren't always consecutive writes. The logical addresses don't include parity stripes, so you won't find them with FIEMAP. Parity locations are calculated after the logical -> (disk, chunk_offset) translation is done (it's the same chunk_offset on every disk, but one of the disks is parity while the others are data). > [root@f24s ~]# btrfs-map-logical -l $[4096*3456096] /dev/VG/a > mirror 1 logical 14156169216 physical 1108541440 device /dev/mapper/VG-a > mirror 2 logical 14156169216 physical 2182283264 device /dev/mapper/VG-c > [root@f24s ~]# btrfs-map-logical -l $[4096*3456128] /dev/VG/a > mirror 1 logical 14156300288 physical 1075052544 device /dev/mapper/VG-b > mirror 2 logical 14156300288 physical 1108606976 device /dev/mapper/VG-a > [root@f24s ~]# btrfs-map-logical -l $[4096*3456224] /dev/VG/a > mirror 1 logical 14156693504 physical 1075249152 device /dev/mapper/VG-b > mirror 2 logical 14156693504 physical 1108803584 device /dev/mapper/VG-a > [root@f24s ~]# btrfs-map-logical -l $[4096*3456320] /dev/VG/a > mirror 1 logical 14157086720 physical 1075445760 device /dev/mapper/VG-b > mirror 2 logical 14157086720 physical 1109000192 device /dev/mapper/VG-a > [root@f24s ~]# btrfs-map-logical -l $[4096*3456352] /dev/VG/a > mirror 1 logical 14157217792 physical 2182807552 device /dev/mapper/VG-c > mirror 2 logical 14157217792 physical 1075511296 device /dev/mapper/VG-b > [root@f24s ~]# btrfs-map-logical -l $[4096*3456384] /dev/VG/a > mirror 1 logical 14157348864 physical 1109131264 device /dev/mapper/VG-a > mirror 2 logical 14157348864 physical 2182873088 device /dev/mapper/VG-c > [root@f24s ~]# btrfs-map-logical -l $[4096*3456416] /dev/VG/a > mirror 1 logical 14157479936 physical 1075642368 device /dev/mapper/VG-b > mirror 2 logical 14157479936 physical 1109196800 device /dev/mapper/VG-a > [root@f24s ~]# btrfs-map-logical -l $[4096*3456448] /dev/VG/a > mirror 1 logical 14157611008 physical 2183004160 device /dev/mapper/VG-c > mirror 2 logical 14157611008 physical 1075707904 device /dev/mapper/VG-b > [root@f24s ~]# btrfs-map-logical -l $[4096*3456480] /dev/VG/a > mirror 1 logical 14157742080 physical 1109327872 device /dev/mapper/VG-a > mirror 2 logical 14157742080 physical 2183069696 device /dev/mapper/VG-c > > > To confirm/deny mirror 2 is parity (128KiB file is 64KiB "a", 64KiB > "b", so expected parity is 0x03; if it's always 128KiB of the same > value then parity is 0x00 and can result in confusion/mistakes with > unwritten free space). > > [root@f24s ~]# dd if=/dev/VG/c bs=1 count=65536 skip=2182283264 > 2>/dev/null | hexdump -C > 00000000 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................| > * > 00010000 > [root@f24s ~]# dd if=/dev/VG/a bs=1 count=65536 skip=1108606976 > 2>/dev/null | hexdump -C > 00000000 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................| > * > 00010000 > [root@f24s ~]# dd if=/dev/VG/a bs=1 count=65536 skip=1108803584 > 2>/dev/null | hexdump -C > 00000000 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................| > * > 00010000 > [root@f24s ~]# dd if=/dev/VG/a bs=1 count=65536 skip=1109000192 > 2>/dev/null | hexdump -C > 00000000 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................| > * > 00010000 > [root@f24s ~]# dd if=/dev/VG/b bs=1 count=65536 skip=1075511296 > 2>/dev/null | hexdump -C > 00000000 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................| > * > 00010000 > [root@f24s ~]# dd if=/dev/VG/c bs=1 count=65536 skip=2182873088 > 2>/dev/null | hexdump -C > 00000000 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................| > * > 00010000 > [root@f24s ~]# dd if=/dev/VG/a bs=1 count=65536 skip=1109196800 > 2>/dev/null | hexdump -C > 00000000 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................| > * > 00010000 > [root@f24s ~]# dd if=/dev/VG/b bs=1 count=65536 skip=1075707904 > 2>/dev/null | hexdump -C > 00000000 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................| > * > 00010000 > [root@f24s ~]# dd if=/dev/VG/c bs=1 count=65536 skip=2183069696 > 2>/dev/null | hexdump -C > 00000000 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................| > * > 00010000 > > Ok so in particular the last five, parity is on device b, c, a, b, c - > that suggests it's distributing parity on consecutive full stripe > writes. > > Where I became confused is, there's not always a consecutive write, > and that's what ends up causing parity to end up on one device less > often. In the above example, parity goes 4x VG/a, 3x VG/c, and 2x > VG/b. > > Basically it's a bad test. The sample size is too small. I'd need to > increase the sample size by a ton in order to know for sure if this is > really a problem. > > > >This > > information is not actually stored anywhere, it is computed based on > > block group geometry and logical stripe offset. > > I think you're right. A better test is a scrub or balance on a raid5 > that's exhibiting slowness, and find out if there's disk contention on > that system, and whether it's the result of parity not being > distributed enough. > > > > P.S. usage of "stripe" to mean "stripe element" actually adds to > > confusion when reading code :) > > It's confusing everywhere. mdadm chunk = strip = stripe element. And > then LVM introduces -i --stripes which means "data strips" i.e. if you > choose -i 3 with raid6 segment type, you get 5 strips per stripe (3 > data 2 parity). It's horrible. > > > > > -- > Chris Murphy >
On 2016-06-25 12:44, Chris Murphy wrote: > On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn > <ahferroin7@gmail.com> wrote: > >> Well, the obvious major advantage that comes to mind for me to checksumming >> parity is that it would let us scrub the parity data itself and verify it. > > OK but hold on. During scrub, it should read data, compute checksums > *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in > the checksum tree, and the parity strip in the chunk tree. And if > parity is wrong, then it should be replaced. Except that's horribly inefficient. With limited exceptions involving highly situational co-processors, computing a checksum of a parity block is always going to be faster than computing parity for the stripe. By using that to check parity, we can safely speed up the common case of near zero errors during a scrub by a pretty significant factor. The ideal situation that I'd like to see for scrub WRT parity is: 1. Store checksums for the parity itself. 2. During scrub, if the checksum is good, the parity is good, and we just saved the time of computing the whole parity block. 3. If the checksum is not good, then compute the parity. If the parity just computed matches what is there already, the checksum is bad and should be rewritten (and we should probably recompute the whole block of checksums it's in), otherwise, the parity was bad, write out the new parity and update the checksum. 4. Have an option to skip the csum check on the parity and always compute it. > > Even check > md/sync_action does this. So no pun intended but Btrfs > isn't even at parity with mdadm on data integrity if it doesn't check > if the parity matches data. Except that MD and LVM don't have checksums to verify anything outside of the very high-level metadata. They have to compute the parity during a scrub because that's the _only_ way they have to check data integrity. Just because that's the only way for them to check it does not mean we have to follow their design, especially considering that we have other, faster ways to check it. > > >> I'd personally much rather know my parity is bad before I need to use it >> than after using it to reconstruct data and getting an error there, and I'd >> be willing to be that most seasoned sysadmins working for companies using >> big storage arrays likely feel the same about it. > > That doesn't require parity csums though. It just requires computing > parity during a scrub and comparing it to the parity on disk to make > sure they're the same. If they aren't, assuming no other error for > that full stripe read, then the parity block is replaced. It does not require it, but it can make it significantly more efficient, and even a 1% increase in efficiency is a huge difference on a big array. > > So that's also something to check in the code or poke a system with a > stick and see what happens. > >> I could see it being >> practical to have an option to turn this off for performance reasons or >> similar, but again, I have a feeling that most people would rather be able >> to check if a rebuild will eat data before trying to rebuild (depending on >> the situation in such a case, it will sometimes just make more sense to nuke >> the array and restore from a backup instead of spending time waiting for it >> to rebuild). > > The much bigger problem we have right now that affects Btrfs, > LVM/mdadm md raid, is this silly bad default with non-enterprise > drives having no configurable SCT ERC, with ensuing long recovery > times, and the kernel SCSI command timer at 30 seconds - which > actually also fucks over regular single disk users also because it > means they don't get the "benefit" of long recovery times, which is > the whole g'd point of that feature. This itself causes so many > problems where bad sectors just get worse and don't get fixed up > because of all the link resets. So I still think it's a bullshit > default kernel side because it pretty much affects the majority use > case, it is only a non-problem with proprietary hardware raid, and > software raid using enterprise (or NAS specific) drives that already > have short recovery times by default. On this, we can agree. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: > On 2016-06-25 12:44, Chris Murphy wrote: >> >> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn >> <ahferroin7@gmail.com> wrote: >> >>> Well, the obvious major advantage that comes to mind for me to >>> checksumming >>> parity is that it would let us scrub the parity data itself and verify >>> it. >> >> >> OK but hold on. During scrub, it should read data, compute checksums >> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in >> the checksum tree, and the parity strip in the chunk tree. And if >> parity is wrong, then it should be replaced. > > Except that's horribly inefficient. With limited exceptions involving > highly situational co-processors, computing a checksum of a parity block is > always going to be faster than computing parity for the stripe. By using > that to check parity, we can safely speed up the common case of near zero > errors during a scrub by a pretty significant factor. OK I'm in favor of that. Although somehow md gets away with this by computing and checking parity for its scrubs, and still manages to keep drives saturated in the process - at least HDDs, I'm not sure how it fares on SSDs. > The ideal situation that I'd like to see for scrub WRT parity is: > 1. Store checksums for the parity itself. > 2. During scrub, if the checksum is good, the parity is good, and we just > saved the time of computing the whole parity block. > 3. If the checksum is not good, then compute the parity. If the parity just > computed matches what is there already, the checksum is bad and should be > rewritten (and we should probably recompute the whole block of checksums > it's in), otherwise, the parity was bad, write out the new parity and update > the checksum. > 4. Have an option to skip the csum check on the parity and always compute > it. >> >> >> Even check > md/sync_action does this. So no pun intended but Btrfs >> isn't even at parity with mdadm on data integrity if it doesn't check >> if the parity matches data. > > Except that MD and LVM don't have checksums to verify anything outside of > the very high-level metadata. They have to compute the parity during a > scrub because that's the _only_ way they have to check data integrity. Just > because that's the only way for them to check it does not mean we have to > follow their design, especially considering that we have other, faster ways > to check it. I'm not opposed to this optimization. But retroactively better qualifying my previous "major advantage" what I meant was in terms of solving functional deficiency. >> The much bigger problem we have right now that affects Btrfs, >> LVM/mdadm md raid, is this silly bad default with non-enterprise >> drives having no configurable SCT ERC, with ensuing long recovery >> times, and the kernel SCSI command timer at 30 seconds - which >> actually also fucks over regular single disk users also because it >> means they don't get the "benefit" of long recovery times, which is >> the whole g'd point of that feature. This itself causes so many >> problems where bad sectors just get worse and don't get fixed up >> because of all the link resets. So I still think it's a bullshit >> default kernel side because it pretty much affects the majority use >> case, it is only a non-problem with proprietary hardware raid, and >> software raid using enterprise (or NAS specific) drives that already >> have short recovery times by default. > > On this, we can agree. It just came up again in a thread over the weekend on linux-raid@. I'm going to ask while people are paying attention if a patch to change the 30 second time out to something a lot higher has ever been floated, what the negatives might be, and where to get this fixed if it wouldn't be accepted in the kernel code directly. *Ideally* I think we'd want two timeouts. I'd like to see commands have a timer that results in merely a warning that could be used by e.g. btrfs scrub to know "hey this sector range is 'slow' I'm going to write over those sectors". That's how bad sectors start out, they read slower and eventually go beyond 30 seconds and now it's all link resets. If the problem could be fixed before then... that's the best scenario. The 2nd timer would be, OK the controller or drive just face planted, reset.
For what it's worth I found btrfs-map-logical can produce mapping for raid5 (didn't test raid6) by specifying the extent block length. If that's omitted it only shows the device+mapping for the first strip. This example is a 3 disk raid5, with a 128KiB file all in a single extent. [root@f24s ~]# btrfs-map-logical -l 14157742080 /dev/VG/a mirror 1 logical 14157742080 physical 1109327872 device /dev/mapper/VG-a mirror 2 logical 14157742080 physical 2183069696 device /dev/mapper/VG-c [root@f24s ~]# btrfs-map-logical -l 14157742080 -b 131072 /dev/VG/a mirror 1 logical 14157742080 physical 1109327872 device /dev/mapper/VG-a mirror 1 logical 14157807616 physical 1075773440 device /dev/mapper/VG-b mirror 2 logical 14157742080 physical 2183069696 device /dev/mapper/VG-c mirror 2 logical 14157807616 physical 2183069696 device /dev/mapper/VG-c It's also possible to use -c and -o to copy the extent to a file and more easily diff it with a control file, rather than using dd. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jun 27, 2016 at 6:17 PM, Chris Murphy <lists@colorremedies.com> wrote: > On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn > <ahferroin7@gmail.com> wrote: >> On 2016-06-25 12:44, Chris Murphy wrote: >>> >>> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn >>> <ahferroin7@gmail.com> wrote: >>> >>>> Well, the obvious major advantage that comes to mind for me to >>>> checksumming >>>> parity is that it would let us scrub the parity data itself and verify >>>> it. >>> >>> >>> OK but hold on. During scrub, it should read data, compute checksums >>> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in >>> the checksum tree, and the parity strip in the chunk tree. And if >>> parity is wrong, then it should be replaced. >> >> Except that's horribly inefficient. With limited exceptions involving >> highly situational co-processors, computing a checksum of a parity block is >> always going to be faster than computing parity for the stripe. By using >> that to check parity, we can safely speed up the common case of near zero >> errors during a scrub by a pretty significant factor. > > OK I'm in favor of that. Although somehow md gets away with this by > computing and checking parity for its scrubs, and still manages to > keep drives saturated in the process - at least HDDs, I'm not sure how > it fares on SSDs. What I read in this thread clarifies the different flavors of errors I saw when trying btrfs raid5 while corrupting 1 device or just unexpectedly removing a device and replacing it with a fresh one. Especially the lack of parity csums I was not aware of and I think this is really wrong. Consider a 4 disk btrfs raid10 and a 3 disk btrfs raid5. Both protect against the loss of 1 device or badblocks on 1 device. In the current design (unoptimized for performance), raid10 reads from 2 disk and raid5 as well (as far as I remember) per task/process. Which pair of strips for raid10 is pseudo-random AFAIK, so one could get low throughput if some device in the array is older/slower and that one is picked. From device to fs logical layer is just a simple function, namely copy, so having the option to keep data in-place (zerocopy). The data is at least read by the csum check and in case of failure, the btrfs code picks the alternative strip and corrects etc. For raid5, assuming it does avoid the parity in principle, it is also a strip pair and csum check. In case of csum failure, one needs the parity strip parity calculation. To me, It looks like that the 'fear' of this calculation has made raid56 as a sort of add-on, instead of a more integral part. Looking at raid6 perf test at boot in dmesg, it is 30GByte/s, even higher than memory bandwidth. So although a calculation is needed in case data0strip+paritystrip would be used instead of data0strip+data1strip, I think looking at total cost, it can be cheaper than spending time on seeks, at least on HDDs. If the parity calculation is treated in a transparent way, same as copy, then there is more flexibility in selecting disks (and strips) and enables easier design and performance optimizations I think. >> The ideal situation that I'd like to see for scrub WRT parity is: >> 1. Store checksums for the parity itself. >> 2. During scrub, if the checksum is good, the parity is good, and we just >> saved the time of computing the whole parity block. >> 3. If the checksum is not good, then compute the parity. If the parity just >> computed matches what is there already, the checksum is bad and should be >> rewritten (and we should probably recompute the whole block of checksums >> it's in), otherwise, the parity was bad, write out the new parity and update >> the checksum. This 3rd point: if parity matches but csum is not good, then there is a btrfs design error or some hardware/CPU/memory problem. Compare with btrfs raid10: if the copies match but csum wrong, then there is something fatally wrong. Just the first step, csum check and if wrong, it would mean you generate the assumed corrupt strip newly from the 3 others. And for 3 disk raid5 from the 2 others, whether it is copying or paritycalculation. >> 4. Have an option to skip the csum check on the parity and always compute >> it. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote: > On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn > <ahferroin7@gmail.com> wrote: > > On 2016-06-25 12:44, Chris Murphy wrote: > >> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn > >> <ahferroin7@gmail.com> wrote: > >> > >> OK but hold on. During scrub, it should read data, compute checksums > >> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in > >> the checksum tree, and the parity strip in the chunk tree. And if > >> parity is wrong, then it should be replaced. > > > > Except that's horribly inefficient. With limited exceptions involving > > highly situational co-processors, computing a checksum of a parity block is > > always going to be faster than computing parity for the stripe. By using > > that to check parity, we can safely speed up the common case of near zero > > errors during a scrub by a pretty significant factor. > > OK I'm in favor of that. Although somehow md gets away with this by > computing and checking parity for its scrubs, and still manages to > keep drives saturated in the process - at least HDDs, I'm not sure how > it fares on SSDs. A modest desktop CPU can compute raid6 parity at 6GB/sec, a less-modest one at more than 10GB/sec. Maybe a bottleneck is within reach of an array of SSDs vs. a slow CPU. > It just came up again in a thread over the weekend on linux-raid@. I'm > going to ask while people are paying attention if a patch to change > the 30 second time out to something a lot higher has ever been > floated, what the negatives might be, and where to get this fixed if > it wouldn't be accepted in the kernel code directly. Defaults are defaults, they're not for everyone. 30 seconds is about two minutes too short for an SMR drive's worst-case write latency, or 28 seconds too long for an OLTP system, or just right for an end-user's personal machine with a low-energy desktop drive and a long spin-up time. Once a drive starts taking 30+ seconds to do I/O, I consider the drive failed in the sense that it's too slow to meet latency requirements. When the problem is that it's already taking too long, the solution is not waiting even longer. To put things in perspective, consider that server hardware watchdog timeouts are typically 60 seconds by default (if not maximum). If anything, I want the timeout to be shorter so that upper layers with redundancy can get an EIO and initiate repair promptly, and admins can get notified to evict chronic offenders from their drive slots, without having to pay extra for hard disk firmware with that feature. > *Ideally* I think we'd want two timeouts. I'd like to see commands > have a timer that results in merely a warning that could be used by > e.g. btrfs scrub to know "hey this sector range is 'slow' I'm going to > write over those sectors". That's how bad sectors start out, they read > slower and eventually go beyond 30 seconds and now it's all link > resets. If the problem could be fixed before then... that's the best > scenario. What's the downside of a link reset? Can the driver not just return EIO for all the outstanding IOs in progress at reset, and let the upper layers deal with it? Or is the problem that the upper layers are all horribly broken by EIOs, or drive firmware horribly broken by link resets? The upper layers could time the IOs, and make their own decisions based on the timing (e.g. btrfs or mdadm could proactively repair anything that took more than 10 seconds to read). That might be a better approach, since shortening the time to an EIO is only useful when you have a redundancy layer in place to do something about them. > The 2nd timer would be, OK the controller or drive just face planted, reset. > > -- > Chris Murphy >
On Mon, Jun 27, 2016 at 3:57 PM, Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote: > On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote: > >> It just came up again in a thread over the weekend on linux-raid@. I'm >> going to ask while people are paying attention if a patch to change >> the 30 second time out to something a lot higher has ever been >> floated, what the negatives might be, and where to get this fixed if >> it wouldn't be accepted in the kernel code directly. > > Defaults are defaults, they're not for everyone. 30 seconds is about > two minutes too short for an SMR drive's worst-case write latency, or > 28 seconds too long for an OLTP system, or just right for an end-user's > personal machine with a low-energy desktop drive and a long spin-up time. The question is where is the correct place to change the default to broadly capture most use cases, because it's definitely incompatible with consumer SATA drives, whether in an enclosure or not. Maybe it's with the kernel teams at each distribution? Or maybe an upstream udev rule? In any case something needs to give here because it's been years of bugging users about this misconfiguration and people constantly run into it, which means user education is not working. > > Once a drive starts taking 30+ seconds to do I/O, I consider the drive > failed in the sense that it's too slow to meet latency requirements. Well that is then a mismatch between use case and the drive purchasing decision. Consumer drives do this. It's how they're designed to work. > When the problem is that it's already taking too long, the solution is > not waiting even longer. To put things in perspective, consider that > server hardware watchdog timeouts are typically 60 seconds by default > (if not maximum). If you want the data retrieved from that particular device, the only solution is waiting longer. The alternative is what you get, an IO error (well actually you get a link reset, which also means the entire command queue is purged on SATA drives). > If anything, I want the timeout to be shorter so that upper layers with > redundancy can get an EIO and initiate repair promptly, and admins can > get notified to evict chronic offenders from their drive slots, without > having to pay extra for hard disk firmware with that feature. The drive totally thwarts this. It doesn't report back to the kernel what command is hung, as far as I'm aware. It just hangs and goes into a so called "deep recovery" there is no way to know what sector is causing the problem until the drive reports a read error, which will include the affected sector LBA. Btrfs does have something of a work around for when things get slow, and that's balance, read and rewrite everything. The write forces sector remapping by the drive firmware for bad sectors. >> *Ideally* I think we'd want two timeouts. I'd like to see commands >> have a timer that results in merely a warning that could be used by >> e.g. btrfs scrub to know "hey this sector range is 'slow' I'm going to >> write over those sectors". That's how bad sectors start out, they read >> slower and eventually go beyond 30 seconds and now it's all link >> resets. If the problem could be fixed before then... that's the best >> scenario. > > What's the downside of a link reset? Can the driver not just return > EIO for all the outstanding IOs in progress at reset, and let the upper > layers deal with it? Or is the problem that the upper layers are all > horribly broken by EIOs, or drive firmware horribly broken by link resets? Link reset clears the entire command queue on SATA drives, and it wipes away any possibility of finding out what LBA or even a range of LBAs, is the source of the stall. So it pretty much gets you nothing. > The upper layers could time the IOs, and make their own decisions based > on the timing (e.g. btrfs or mdadm could proactively repair anything that > took more than 10 seconds to read). That might be a better approach, > since shortening the time to an EIO is only useful when you have a > redundancy layer in place to do something about them. For RAID with redundancy, that's doable, although I have no idea what work is needed, or even if it's possible, to track commands in this manner, and fall back to some kind of repair mode as if it were a read error. For single drives and RAID 0, the only possible solution is to not do link resets for up to 3 minutes and hope the drive returns the single copy of data. Even in the case of Btrfs DUP, it's thwarted without a read error reported from the drive (or it returning bad data).
On Mon, Jun 27, 2016 at 04:30:23PM -0600, Chris Murphy wrote: > On Mon, Jun 27, 2016 at 3:57 PM, Zygo Blaxell > <ce3g8jdj@umail.furryterror.org> wrote: > > On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote: > > If anything, I want the timeout to be shorter so that upper layers with > > redundancy can get an EIO and initiate repair promptly, and admins can > > get notified to evict chronic offenders from their drive slots, without > > having to pay extra for hard disk firmware with that feature. > > The drive totally thwarts this. It doesn't report back to the kernel > what command is hung, as far as I'm aware. It just hangs and goes into > a so called "deep recovery" there is no way to know what sector is > causing the problem I'm proposing just treat the link reset _as_ an EIO, unless transparent link resets are required for link speed negotiation or something. The drive wouldn't be thwarting anything, the host would just ignore it (unless the drive doesn't respond to a link reset until after its internal timeout, in which case nothing is saved by shortening the timeout). > until the drive reports a read error, which will > include the affected sector LBA. It doesn't matter which sector. Chances are good that it was more than one of the outstanding requested sectors anyway. Rewrite them all. We know which sectors they are because somebody has an IO operation waiting for a status on each of them (unless they're using AIO or some other API where a request can be fired at a hard drive and the reply discarded). Notify all of them that their IO failed and move on. > Btrfs does have something of a work around for when things get slow, > and that's balance, read and rewrite everything. The write forces > sector remapping by the drive firmware for bad sectors. It's a crude form of "resilvering" as ZFS calls it. > > The upper layers could time the IOs, and make their own decisions based > > on the timing (e.g. btrfs or mdadm could proactively repair anything that > > took more than 10 seconds to read). That might be a better approach, > > since shortening the time to an EIO is only useful when you have a > > redundancy layer in place to do something about them. > > For RAID with redundancy, that's doable, although I have no idea what > work is needed, or even if it's possible, to track commands in this > manner, and fall back to some kind of repair mode as if it were a read > error. If btrfs sees EIO from a lower block layer it will try to reconstruct the missing data (but not repair it). If that happens during a scrub, it will also attempt to rewrite the missing data over the original offending sectors. This happens every few months in my server pool, and seems to be working even on btrfs raid5. Last time I checked all the RAID implementations on Linux (ok, so that's pretty much just md-raid) had some sort of repair capability. lvm uses (or can use) the md-raid implementation. ext4 and xfs on naked disk partitions will have problems, but that's because they were designed in the 1990's when we were young and naive and still believed hard disks would one day become reliable devices without buggy firmware. > For single drives and RAID 0, the only possible solution is to not do > link resets for up to 3 minutes and hope the drive returns the single > copy of data. So perhaps the timeout should be influenced by higher layers, e.g. if a disk becomes part of a raid1, its timeout should be shortened by default, while a timeout for a disk that is not used in by redundant layer should be longer. > Even in the case of Btrfs DUP, it's thwarted without a read error > reported from the drive (or it returning bad data). That case gets messy--different timeouts for different parts of the disk. Probably not practical.
On Mon, Jun 27, 2016 at 7:52 PM, Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote: > On Mon, Jun 27, 2016 at 04:30:23PM -0600, Chris Murphy wrote: >> On Mon, Jun 27, 2016 at 3:57 PM, Zygo Blaxell >> <ce3g8jdj@umail.furryterror.org> wrote: >> > On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote: >> > If anything, I want the timeout to be shorter so that upper layers with >> > redundancy can get an EIO and initiate repair promptly, and admins can >> > get notified to evict chronic offenders from their drive slots, without >> > having to pay extra for hard disk firmware with that feature. >> >> The drive totally thwarts this. It doesn't report back to the kernel >> what command is hung, as far as I'm aware. It just hangs and goes into >> a so called "deep recovery" there is no way to know what sector is >> causing the problem > > I'm proposing just treat the link reset _as_ an EIO, unless transparent > link resets are required for link speed negotiation or something. That's not one EIO, that's possibly 31 items in the command queue that get knocked over when the link is reset. I don't have the expertise to know whether it's sane to interpret many EIO all at once as an implicit indication of bad sectors. Off hand I think that's probably specious. > The drive wouldn't be thwarting anything, the host would just ignore it > (unless the drive doesn't respond to a link reset until after its internal > timeout, in which case nothing is saved by shortening the timeout). > >> until the drive reports a read error, which will >> include the affected sector LBA. > > It doesn't matter which sector. Chances are good that it was more than > one of the outstanding requested sectors anyway. Rewrite them all. *shrug* even if valid, it only helps the raid 1+ cases. It does nothing to help raid0, linear/concat, or single device deployments. Those users also deserve to have access to their data, if the drive can recover it by giving it enough time to do so. > We know which sectors they are because somebody has an IO operation > waiting for a status on each of them (unless they're using AIO or some > other API where a request can be fired at a hard drive and the reply > discarded). Notify all of them that their IO failed and move on. Dunno, maybe. > >> Btrfs does have something of a work around for when things get slow, >> and that's balance, read and rewrite everything. The write forces >> sector remapping by the drive firmware for bad sectors. > > It's a crude form of "resilvering" as ZFS calls it. In what manner is it crude? > If btrfs sees EIO from a lower block layer it will try to reconstruct the > missing data (but not repair it). If that happens during a scrub, > it will also attempt to rewrite the missing data over the original > offending sectors. This happens every few months in my server pool, > and seems to be working even on btrfs raid5. > > Last time I checked all the RAID implementations on Linux (ok, so that's > pretty much just md-raid) had some sort of repair capability. You can read man 4 md, and you can also look on linux-raid@, it's very clearly necessary for the drive to report a read or write error explicitly with LBA for md to do repairs. If there are link resets, bad sectors accumulate and the obvious inevitably happens. > >> For single drives and RAID 0, the only possible solution is to not do >> link resets for up to 3 minutes and hope the drive returns the single >> copy of data. > > So perhaps the timeout should be influenced by higher layers, e.g. if a > disk becomes part of a raid1, its timeout should be shortened by default, > while a timeout for a disk that is not used in by redundant layer should > be longer. And there are a pile of reasons why link resets are necessary that have nothing to do with bad sectors. So if you end up with a drive or controller misbehaving and the new behavior is to force a bunch of new (corrective) writes to the drive right after a reset it could actually make its problems worse for all we know. I think it's highly speculative to assume hung block devices means bad sector and should be treated as a bad sector, and that doing so will cause no other side effects. That's a question for block device/SCSI experts to opine on whether this is at all sane to do. I'm sure they're reasonably aware of this problem that if it were that simple they'd have done that already, but conversely 5 years of telling users to change the command timer or stop using the wrong kind of drives for RAID really isn't sufficiently good advice either. The reality is that manufacturers of drives have handed us drives that far and wide don't support SCT ERC or it's disabled by default, so yeah maybe the thing to do is udev polls the drive for SCT ERC, if it's already at 70,70 then leave the SCSI command timer as is. If it reports it's disabled, then udev needs to know if it's in some kind of RAID 1+ and if so then set SCT ERC to 70,70. If it's a single drive, linear/concat, or RAID 0, then instead it should change the SCSI command timer to 180 and let the user sort out the possible insanity that follows which will be the occasional half minute or longer hang whenever a drive has a bad sector. > >> Even in the case of Btrfs DUP, it's thwarted without a read error >> reported from the drive (or it returning bad data). > > That case gets messy--different timeouts for different parts of the disk. > Probably not practical. The point is that DUP implies single device, and for certain that configuration should not have such a short command timer. If the drive can recover it at all, it could possibly remap the sector on its own. In fact that's what is supposed to happen, but drives seem to only do this when the sector read is already painfully slow.
On Mon, Jun 27, 2016 at 08:39:21PM -0600, Chris Murphy wrote: > On Mon, Jun 27, 2016 at 7:52 PM, Zygo Blaxell > <ce3g8jdj@umail.furryterror.org> wrote: > > On Mon, Jun 27, 2016 at 04:30:23PM -0600, Chris Murphy wrote: > >> Btrfs does have something of a work around for when things get slow, > >> and that's balance, read and rewrite everything. The write forces > >> sector remapping by the drive firmware for bad sectors. > > > > It's a crude form of "resilvering" as ZFS calls it. > > In what manner is it crude? Balance relocates extents, looks up backrefs, and rewrites metadata, all of which are extra work above what is required by resilvering (and extra work that is proportional to the number of backrefs and the (currently extremely poor) performance of the backref walking code, so snapshots and large files multiply the workload). Resilvering should just read data, reconstruct it from a mirror if necessary, and write it back to the original location (or read one mirror and rewrite the other). That's more like what scrub does, except scrub rewrites only the blocks it couldn't read (or that failed csum). > > Last time I checked all the RAID implementations on Linux (ok, so that's > > pretty much just md-raid) had some sort of repair capability. > > You can read man 4 md, and you can also look on linux-raid@, it's very > clearly necessary for the drive to report a read or write error > explicitly with LBA for md to do repairs. If there are link resets, > bad sectors accumulate and the obvious inevitably happens. I am looking at the md code. It looks at ->bi_error, and nothing else as far as I can tell. It doesn't even care if the error is EIO--any non-zero return value from the lower bio layer seems to trigger automatic recovery.
On 2016-06-27 23:17, Zygo Blaxell wrote: > On Mon, Jun 27, 2016 at 08:39:21PM -0600, Chris Murphy wrote: >> On Mon, Jun 27, 2016 at 7:52 PM, Zygo Blaxell >> <ce3g8jdj@umail.furryterror.org> wrote: >>> On Mon, Jun 27, 2016 at 04:30:23PM -0600, Chris Murphy wrote: >>>> Btrfs does have something of a work around for when things get slow, >>>> and that's balance, read and rewrite everything. The write forces >>>> sector remapping by the drive firmware for bad sectors. >>> >>> It's a crude form of "resilvering" as ZFS calls it. >> >> In what manner is it crude? > > Balance relocates extents, looks up backrefs, and rewrites metadata, all > of which are extra work above what is required by resilvering (and extra > work that is proportional to the number of backrefs and the (currently > extremely poor) performance of the backref walking code, so snapshots > and large files multiply the workload). > > Resilvering should just read data, reconstruct it from a mirror if > necessary, and write it back to the original location (or read one > mirror and rewrite the other). That's more like what scrub does, except > scrub rewrites only the blocks it couldn't read (or that failed csum). It's worth pointing out that balance was not designed for resilvering, it was designed for reshaping arrays, converting replication profiles, and compaction at the chunk level. Balance is not a resilvering tool, that just happens to be a useful side effect of running a balance (actually, so is the chunk level compaction). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2016-06-27 17:57, Zygo Blaxell wrote: > On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote: >> On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn >> <ahferroin7@gmail.com> wrote: >>> On 2016-06-25 12:44, Chris Murphy wrote: >>>> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn >>>> <ahferroin7@gmail.com> wrote: >>>> >>>> OK but hold on. During scrub, it should read data, compute checksums >>>> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in >>>> the checksum tree, and the parity strip in the chunk tree. And if >>>> parity is wrong, then it should be replaced. >>> >>> Except that's horribly inefficient. With limited exceptions involving >>> highly situational co-processors, computing a checksum of a parity block is >>> always going to be faster than computing parity for the stripe. By using >>> that to check parity, we can safely speed up the common case of near zero >>> errors during a scrub by a pretty significant factor. >> >> OK I'm in favor of that. Although somehow md gets away with this by >> computing and checking parity for its scrubs, and still manages to >> keep drives saturated in the process - at least HDDs, I'm not sure how >> it fares on SSDs. > > A modest desktop CPU can compute raid6 parity at 6GB/sec, a less-modest > one at more than 10GB/sec. Maybe a bottleneck is within reach of an > array of SSDs vs. a slow CPU. OK, great for people who are using modern desktop or server CPU's. Not everyone has that luxury, and even on many such CPU's, it's _still_ faster to computer CRC32c checksums. On top of that, we don't appear to be using the in-kernel parity-raid libraries (or if we are, I haven't been able to find where we are calling the functions for it), so we don't necessarily get assembly optimized or co-processor accelerated computation of the parity itself. The other thing that I didn't mention above though, is that computing parity checksums will always take less time than computing parity, because you have to process significantly less data. On a 4 disk RAID5 array, you're processing roughly 2/3 as much data to do the parity checksums instead of parity itself, which means that the parity computation would need to be 200% faster than the CRC32c computation to break even, and this margin gets bigger and bigger as you add more disks. On small arrays, this obviously won't have much impact. Once you start to scale past a few TB though, even a few hundred MB/s faster processing means a significant decrease in processing time. Say you have a CPU which gets about 12.0GB/s for RAID5 parity, and and about 12.25GB/s for CRC32c (~2% is a conservative ratio assuming you use the CRC32c instruction and assembly optimized RAID5 parity computations on a modern x86_64 processor (the ratio on both the mobile Core i5 in my laptop and the Xeon E3 in my home server is closer to 5%)). Assuming those numbers, and that we're already checking checksums on non-parity blocks, processing 120TB of data in a 4 disk array (which gives 40TB of parity data, so 160TB total) gives: For computing the parity to scrub: 120TB / 12.25GB = 9795.9 seconds for processing CRC32c csums of all the regular data 120TB / 12GB = 10000 seconds for processing parity of all stripes = 19795.9 seconds total ~ 5.4 hours total For computing csums of the parity: 120TB / 12.25GB = 9795.9 seconds for processing CRC32c csums of all the regular data 40TB / 12.25GB = 3265.3 seconds for processing CRC32c csums of all the parity data = 13061.2 seconds total ~ 3.6 hours total The checksum based computation is approximately 34% faster than the parity computation. Much of this of course is that you have to process the regular data twice for the parity computation method (once for csums, once for parity). You could probably do one pass computing both values, but that would need to be done carefully; and, without significant optimization, would likely not get you much benefit other than cutting the number of loads in half. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 28/06/16 22:05, Austin S. Hemmelgarn wrote: > On 2016-06-27 17:57, Zygo Blaxell wrote: >> On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote: >>> On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn >>> <ahferroin7@gmail.com> wrote: >>>> On 2016-06-25 12:44, Chris Murphy wrote: >>>>> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn >>>>> <ahferroin7@gmail.com> wrote: >>>>> >>>>> OK but hold on. During scrub, it should read data, compute checksums >>>>> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in >>>>> the checksum tree, and the parity strip in the chunk tree. And if >>>>> parity is wrong, then it should be replaced. >>>> >>>> Except that's horribly inefficient. With limited exceptions involving >>>> highly situational co-processors, computing a checksum of a parity >>>> block is >>>> always going to be faster than computing parity for the stripe. By >>>> using >>>> that to check parity, we can safely speed up the common case of near >>>> zero >>>> errors during a scrub by a pretty significant factor. >>> >>> OK I'm in favor of that. Although somehow md gets away with this by >>> computing and checking parity for its scrubs, and still manages to >>> keep drives saturated in the process - at least HDDs, I'm not sure how >>> it fares on SSDs. >> >> A modest desktop CPU can compute raid6 parity at 6GB/sec, a less-modest >> one at more than 10GB/sec. Maybe a bottleneck is within reach of an >> array of SSDs vs. a slow CPU. > OK, great for people who are using modern desktop or server CPU's. Not > everyone has that luxury, and even on many such CPU's, it's _still_ > faster to computer CRC32c checksums. On top of that, we don't appear to > be using the in-kernel parity-raid libraries (or if we are, I haven't > been able to find where we are calling the functions for it), so we > don't necessarily get assembly optimized or co-processor accelerated > computation of the parity itself. The other thing that I didn't mention > above though, is that computing parity checksums will always take less > time than computing parity, because you have to process significantly > less data. On a 4 disk RAID5 array, you're processing roughly 2/3 as > much data to do the parity checksums instead of parity itself, which > means that the parity computation would need to be 200% faster than the > CRC32c computation to break even, and this margin gets bigger and bigger > as you add more disks. > > On small arrays, this obviously won't have much impact. Once you start > to scale past a few TB though, even a few hundred MB/s faster processing > means a significant decrease in processing time. Say you have a CPU > which gets about 12.0GB/s for RAID5 parity, and and about 12.25GB/s for > CRC32c (~2% is a conservative ratio assuming you use the CRC32c > instruction and assembly optimized RAID5 parity computations on a modern > x86_64 processor (the ratio on both the mobile Core i5 in my laptop and > the Xeon E3 in my home server is closer to 5%)). Assuming those > numbers, and that we're already checking checksums on non-parity blocks, > processing 120TB of data in a 4 disk array (which gives 40TB of parity > data, so 160TB total) gives: > For computing the parity to scrub: > 120TB / 12.25GB = 9795.9 seconds for processing CRC32c csums of all the > regular data > 120TB / 12GB = 10000 seconds for processing parity of all stripes > = 19795.9 seconds total > ~ 5.4 hours total > > For computing csums of the parity: > 120TB / 12.25GB = 9795.9 seconds for processing CRC32c csums of all the > regular data > 40TB / 12.25GB = 3265.3 seconds for processing CRC32c csums of all the > parity data > = 13061.2 seconds total > ~ 3.6 hours total > > The checksum based computation is approximately 34% faster than the > parity computation. Much of this of course is that you have to process > the regular data twice for the parity computation method (once for > csums, once for parity). You could probably do one pass computing both > values, but that would need to be done carefully; and, without > significant optimization, would likely not get you much benefit other > than cutting the number of loads in half. And it all means jack shit because you don't get the data to disk that quick. Who cares if its 500% faster - if it still saturates the throughput of the actual drives, what difference does it make? I'm all for actual solutions, but the nirvana fallacy seems to apply here...
On 2016-06-28 08:14, Steven Haigh wrote: > On 28/06/16 22:05, Austin S. Hemmelgarn wrote: >> On 2016-06-27 17:57, Zygo Blaxell wrote: >>> On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote: >>>> On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn >>>> <ahferroin7@gmail.com> wrote: >>>>> On 2016-06-25 12:44, Chris Murphy wrote: >>>>>> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn >>>>>> <ahferroin7@gmail.com> wrote: >>>>>> >>>>>> OK but hold on. During scrub, it should read data, compute checksums >>>>>> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in >>>>>> the checksum tree, and the parity strip in the chunk tree. And if >>>>>> parity is wrong, then it should be replaced. >>>>> >>>>> Except that's horribly inefficient. With limited exceptions involving >>>>> highly situational co-processors, computing a checksum of a parity >>>>> block is >>>>> always going to be faster than computing parity for the stripe. By >>>>> using >>>>> that to check parity, we can safely speed up the common case of near >>>>> zero >>>>> errors during a scrub by a pretty significant factor. >>>> >>>> OK I'm in favor of that. Although somehow md gets away with this by >>>> computing and checking parity for its scrubs, and still manages to >>>> keep drives saturated in the process - at least HDDs, I'm not sure how >>>> it fares on SSDs. >>> >>> A modest desktop CPU can compute raid6 parity at 6GB/sec, a less-modest >>> one at more than 10GB/sec. Maybe a bottleneck is within reach of an >>> array of SSDs vs. a slow CPU. >> OK, great for people who are using modern desktop or server CPU's. Not >> everyone has that luxury, and even on many such CPU's, it's _still_ >> faster to computer CRC32c checksums. On top of that, we don't appear to >> be using the in-kernel parity-raid libraries (or if we are, I haven't >> been able to find where we are calling the functions for it), so we >> don't necessarily get assembly optimized or co-processor accelerated >> computation of the parity itself. The other thing that I didn't mention >> above though, is that computing parity checksums will always take less >> time than computing parity, because you have to process significantly >> less data. On a 4 disk RAID5 array, you're processing roughly 2/3 as >> much data to do the parity checksums instead of parity itself, which >> means that the parity computation would need to be 200% faster than the >> CRC32c computation to break even, and this margin gets bigger and bigger >> as you add more disks. >> >> On small arrays, this obviously won't have much impact. Once you start >> to scale past a few TB though, even a few hundred MB/s faster processing >> means a significant decrease in processing time. Say you have a CPU >> which gets about 12.0GB/s for RAID5 parity, and and about 12.25GB/s for >> CRC32c (~2% is a conservative ratio assuming you use the CRC32c >> instruction and assembly optimized RAID5 parity computations on a modern >> x86_64 processor (the ratio on both the mobile Core i5 in my laptop and >> the Xeon E3 in my home server is closer to 5%)). Assuming those >> numbers, and that we're already checking checksums on non-parity blocks, >> processing 120TB of data in a 4 disk array (which gives 40TB of parity >> data, so 160TB total) gives: >> For computing the parity to scrub: >> 120TB / 12.25GB = 9795.9 seconds for processing CRC32c csums of all the >> regular data >> 120TB / 12GB = 10000 seconds for processing parity of all stripes >> = 19795.9 seconds total >> ~ 5.4 hours total >> >> For computing csums of the parity: >> 120TB / 12.25GB = 9795.9 seconds for processing CRC32c csums of all the >> regular data >> 40TB / 12.25GB = 3265.3 seconds for processing CRC32c csums of all the >> parity data >> = 13061.2 seconds total >> ~ 3.6 hours total >> >> The checksum based computation is approximately 34% faster than the >> parity computation. Much of this of course is that you have to process >> the regular data twice for the parity computation method (once for >> csums, once for parity). You could probably do one pass computing both >> values, but that would need to be done carefully; and, without >> significant optimization, would likely not get you much benefit other >> than cutting the number of loads in half. > > And it all means jack shit because you don't get the data to disk that > quick. Who cares if its 500% faster - if it still saturates the > throughput of the actual drives, what difference does it make? It has less impact on everything else running on the system at the time because it uses less CPU time and potentially less memory. This is the exact same reason that you want your RAID parity computation performance as good as possible, the less time the CPU spends on that, the more it can spend on other things. On top of that, there are high-end systems that do have SSD's that can get multiple GB/s of data transfer per second, and NVDIMM's are starting to become popular in the server market, and those give you data transfer speeds equivalent to regular memory bandwidth (which can be well over 20GB/s on decent hardware (I've got a relatively inexpensive system using DDR3-1866 RAM that has roughly 22-24GB/s of memory bandwidth)). Looking at this another way, the fact that the storage device is the bottleneck right now is not a good excuse to not worry about making everything else as efficient as possible. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 28/06/16 22:25, Austin S. Hemmelgarn wrote: > On 2016-06-28 08:14, Steven Haigh wrote: >> On 28/06/16 22:05, Austin S. Hemmelgarn wrote: >>> On 2016-06-27 17:57, Zygo Blaxell wrote: >>>> On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote: >>>>> On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn >>>>> <ahferroin7@gmail.com> wrote: >>>>>> On 2016-06-25 12:44, Chris Murphy wrote: >>>>>>> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn >>>>>>> <ahferroin7@gmail.com> wrote: >>>>>>> >>>>>>> OK but hold on. During scrub, it should read data, compute checksums >>>>>>> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in >>>>>>> the checksum tree, and the parity strip in the chunk tree. And if >>>>>>> parity is wrong, then it should be replaced. >>>>>> >>>>>> Except that's horribly inefficient. With limited exceptions >>>>>> involving >>>>>> highly situational co-processors, computing a checksum of a parity >>>>>> block is >>>>>> always going to be faster than computing parity for the stripe. By >>>>>> using >>>>>> that to check parity, we can safely speed up the common case of near >>>>>> zero >>>>>> errors during a scrub by a pretty significant factor. >>>>> >>>>> OK I'm in favor of that. Although somehow md gets away with this by >>>>> computing and checking parity for its scrubs, and still manages to >>>>> keep drives saturated in the process - at least HDDs, I'm not sure how >>>>> it fares on SSDs. >>>> >>>> A modest desktop CPU can compute raid6 parity at 6GB/sec, a less-modest >>>> one at more than 10GB/sec. Maybe a bottleneck is within reach of an >>>> array of SSDs vs. a slow CPU. >>> OK, great for people who are using modern desktop or server CPU's. Not >>> everyone has that luxury, and even on many such CPU's, it's _still_ >>> faster to computer CRC32c checksums. On top of that, we don't appear to >>> be using the in-kernel parity-raid libraries (or if we are, I haven't >>> been able to find where we are calling the functions for it), so we >>> don't necessarily get assembly optimized or co-processor accelerated >>> computation of the parity itself. The other thing that I didn't mention >>> above though, is that computing parity checksums will always take less >>> time than computing parity, because you have to process significantly >>> less data. On a 4 disk RAID5 array, you're processing roughly 2/3 as >>> much data to do the parity checksums instead of parity itself, which >>> means that the parity computation would need to be 200% faster than the >>> CRC32c computation to break even, and this margin gets bigger and bigger >>> as you add more disks. >>> >>> On small arrays, this obviously won't have much impact. Once you start >>> to scale past a few TB though, even a few hundred MB/s faster processing >>> means a significant decrease in processing time. Say you have a CPU >>> which gets about 12.0GB/s for RAID5 parity, and and about 12.25GB/s for >>> CRC32c (~2% is a conservative ratio assuming you use the CRC32c >>> instruction and assembly optimized RAID5 parity computations on a modern >>> x86_64 processor (the ratio on both the mobile Core i5 in my laptop and >>> the Xeon E3 in my home server is closer to 5%)). Assuming those >>> numbers, and that we're already checking checksums on non-parity blocks, >>> processing 120TB of data in a 4 disk array (which gives 40TB of parity >>> data, so 160TB total) gives: >>> For computing the parity to scrub: >>> 120TB / 12.25GB = 9795.9 seconds for processing CRC32c csums of all the >>> regular data >>> 120TB / 12GB = 10000 seconds for processing parity of all stripes >>> = 19795.9 seconds total >>> ~ 5.4 hours total >>> >>> For computing csums of the parity: >>> 120TB / 12.25GB = 9795.9 seconds for processing CRC32c csums of all the >>> regular data >>> 40TB / 12.25GB = 3265.3 seconds for processing CRC32c csums of all the >>> parity data >>> = 13061.2 seconds total >>> ~ 3.6 hours total >>> >>> The checksum based computation is approximately 34% faster than the >>> parity computation. Much of this of course is that you have to process >>> the regular data twice for the parity computation method (once for >>> csums, once for parity). You could probably do one pass computing both >>> values, but that would need to be done carefully; and, without >>> significant optimization, would likely not get you much benefit other >>> than cutting the number of loads in half. >> >> And it all means jack shit because you don't get the data to disk that >> quick. Who cares if its 500% faster - if it still saturates the >> throughput of the actual drives, what difference does it make? > It has less impact on everything else running on the system at the time > because it uses less CPU time and potentially less memory. This is the > exact same reason that you want your RAID parity computation performance > as good as possible, the less time the CPU spends on that, the more it > can spend on other things. On top of that, there are high-end systems > that do have SSD's that can get multiple GB/s of data transfer per > second, and NVDIMM's are starting to become popular in the server > market, and those give you data transfer speeds equivalent to regular > memory bandwidth (which can be well over 20GB/s on decent hardware (I've > got a relatively inexpensive system using DDR3-1866 RAM that has roughly > 22-24GB/s of memory bandwidth)). Looking at this another way, the fact > that the storage device is the bottleneck right now is not a good excuse > to not worry about making everything else as efficient as possible. If its purely about performance - then start with multi-thread as a base - not chopping features to make better performance. I'm not aware of any modern CPU that comes with a single core these days - so parallel workloads are much more efficient than a single thread. Yes, its a law of diminishing returns - but if you're not doing a full check of data when one would assume you are, then is that broken by design? Personally, during a scrub, I would want to know if either the checksum OR the parity is wrong - as that indicates problems at a much deeper level. As someone who just lost ~4Tb of data due to BTRFS bugs, protection of data trumps performance in most cases.
Just wiping the slate clean to summarize: 1. We have a consistent ~1 in 3 maybe 1 in 2, reproducible corruption of *data extent* parity during a scrub with raid5. Goffredo and I have both reproduced it. It's a big bug. It might still be useful if someone else can reproduce it too. Goffredo, can you file a bug at bugzilla.kernel.org and reference your bug thread? I don't know if the key developers know about this, it might be worth pinging them on IRC once the bug is filed. Unknown if it affects balance, or raid 6. And if it affects raid 6, is p or q corrupted, or both? Unknown how this manifests on metadata raid5 profile (only tested was data raid5). Presumably if there is metadata corruption that's fixed during a scrub, and its parity is overwritten with corrupt parity, the next time there's a degraded state, the file system would face plant somehow. And we've seen quite a few degraded raid5's (and even 6's) face plant in inexplicable ways and we just kinda go, shit. Which is what the fs is doing when it encounters a pile of csum errors. It treats the csum errors as a signal to disregard the fs rather than maybe only being suspicious of the fs. Could it turn out that these file systems were recoverable, just that Btrfs wasn't tolerating any csum error and wouldn't proceed further? 2. The existing scrub code computes parity on-the-fly, compares it with what's on-disk, and overwrites if there's a mismatch. If there's a mismatch, there's no message anywhere. It's a feature request to get a message on parity mismatches. An additional feature request would be to get a parity_error counter along the lines of the other error counters we have for scrub stats and dev stats. 3. I think it's a more significant change to get parity checksums stored some where. Right now the csum tree holds item type EXTENT_CSUM but parity is not an extent, it's also not data, it's a variant of data. So it seems to me we'd need a new item type PARITY_CSUM to get it into the existing csum tree. And I'm not sure what incompatibility that brings; presumably older kernels could mount such a volume ro safely, but shouldn't write to it, including btrfs check --repair should probably fail. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 29/06/16 04:01, Chris Murphy wrote: > Just wiping the slate clean to summarize: > > > 1. We have a consistent ~1 in 3 maybe 1 in 2, reproducible corruption > of *data extent* parity during a scrub with raid5. Goffredo and I have > both reproduced it. It's a big bug. It might still be useful if > someone else can reproduce it too. > > Goffredo, can you file a bug at bugzilla.kernel.org and reference your > bug thread? I don't know if the key developers know about this, it > might be worth pinging them on IRC once the bug is filed. > > Unknown if it affects balance, or raid 6. And if it affects raid 6, is > p or q corrupted, or both? Unknown how this manifests on metadata > raid5 profile (only tested was data raid5). Presumably if there is > metadata corruption that's fixed during a scrub, and its parity is > overwritten with corrupt parity, the next time there's a degraded > state, the file system would face plant somehow. And we've seen quite > a few degraded raid5's (and even 6's) face plant in inexplicable ways > and we just kinda go, shit. Which is what the fs is doing when it > encounters a pile of csum errors. It treats the csum errors as a > signal to disregard the fs rather than maybe only being suspicious of > the fs. Could it turn out that these file systems were recoverable, > just that Btrfs wasn't tolerating any csum error and wouldn't proceed > further? I believe this is the same case for RAID6 based on my experiences. I actually wondered if the system halts were the result of a TON of csum errors - not the actual result of those errors. Just about every system hang when to 100% CPU usage on all cores and the system just stopped was after a flood of csum errors. If it was only one or two (or I copied data off via a network connection where the read rate was slower), I found I had a MUCH lower chance of the system locking up. In fact, now that I think about it, when I was copying data to an external USB drive (maxed out at ~30MB/sec), I still got csum errors - but the system never hung. Every crash ended with the last line along the lines of "Stopped recurring error. Your system needs rebooting". I wonder if this error reporting was altered, that the system wouldn't go down. Of course I have no way of testing this.....
Related: http://www.spinics.net/lists/raid/msg52880.html Looks like there is some traction to figuring out what to do about this, whether it's a udev rule or something that happens in the kernel itself. Pretty much the only hardware setup unaffected by this are those with enterprise or NAS drives. Every configuration of a consumer drive, single, linear/concat, and all software (mdadm, lvm, Btrfs) RAID Levels are adversely affected by this. I suspect, but haven't tested, that ZFS On Linux would be equally affected, unless they're completely reimplementing their own block layer (?) So there are quite a few parties now negatively impacted by the current default behavior. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2016-07-05 19:05, Chris Murphy wrote: > Related: > http://www.spinics.net/lists/raid/msg52880.html > > Looks like there is some traction to figuring out what to do about > this, whether it's a udev rule or something that happens in the kernel > itself. Pretty much the only hardware setup unaffected by this are > those with enterprise or NAS drives. Every configuration of a consumer > drive, single, linear/concat, and all software (mdadm, lvm, Btrfs) > RAID Levels are adversely affected by this. The thing I don't get about this is that while the per-device settings on a given system are policy, the default value is not, and should be expected to work correctly (but not necessarily optimally) on as many systems as possible, so any claim that this should be fixed in udev are bogus by the regular kernel rules. > > I suspect, but haven't tested, that ZFS On Linux would be equally > affected, unless they're completely reimplementing their own block > layer (?) So there are quite a few parties now negatively impacted by > the current default behavior. OTOH, I would not be surprised if the stance there is 'you get no support if your not using enterprise drives', not because of the project itself, but because it's ZFS. Part of their minimum recommended hardware requirements is ECC RAM, so it wouldn't surprise me if enterprise storage devices are there too. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jul 6, 2016 at 5:51 AM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: > On 2016-07-05 19:05, Chris Murphy wrote: >> >> Related: >> http://www.spinics.net/lists/raid/msg52880.html >> >> Looks like there is some traction to figuring out what to do about >> this, whether it's a udev rule or something that happens in the kernel >> itself. Pretty much the only hardware setup unaffected by this are >> those with enterprise or NAS drives. Every configuration of a consumer >> drive, single, linear/concat, and all software (mdadm, lvm, Btrfs) >> RAID Levels are adversely affected by this. > > The thing I don't get about this is that while the per-device settings on a > given system are policy, the default value is not, and should be expected to > work correctly (but not necessarily optimally) on as many systems as > possible, so any claim that this should be fixed in udev are bogus by the > regular kernel rules. Sure. But changing it in the kernel leads to what other consequences? It fixes the problem under discussion but what problem will it introduce? I think it's valid to explore this, at the least so affected parties can be informed. Also, the problem isn't instigated by Linux, rather by drive manufacturers introducing a whole new kind of error recovery, with an order of magnitude longer recovery time. Now probably most hardware in the field are such drives. Even SSDs like my Samsung 840 EVO that support SCT ERC have it disabled, therefore the top end recovery time is undiscoverable in the device itself. Maybe it's buried in a spec. So does it make sense to just set the default to 180? Or is there a smarter way to do this? I don't know. >> I suspect, but haven't tested, that ZFS On Linux would be equally >> affected, unless they're completely reimplementing their own block >> layer (?) So there are quite a few parties now negatively impacted by >> the current default behavior. > > OTOH, I would not be surprised if the stance there is 'you get no support if > your not using enterprise drives', not because of the project itself, but > because it's ZFS. Part of their minimum recommended hardware requirements > is ECC RAM, so it wouldn't surprise me if enterprise storage devices are > there too. http://open-zfs.org/wiki/Hardware "Consistent performance requires hard drives that support error recovery control. " "Drives that lack such functionality can be expected to have arbitrarily high limits. Several minutes is not impossible. Drives with this functionality typically default to 7 seconds. ZFS does not currently adjust this setting on drives. However, it is advisable to write a script to set the error recovery time to a low value, such as 0.1 seconds until ZFS is modified to control it. This must be done on every boot. " They do not explicitly require enterprise drives, but they clearly expect SCT ERC enabled to some sane value. At least for Btrfs and ZFS, the mkfs is in a position to know all parameters for properly setting SCT ERC and the SCSI command timer for every device. Maybe it could create the udev rule? Single and raid0 profiles need to permit long recoveries; where raid1, 5, 6 need to set things for very short recoveries. Possibly mdadm and lvm tools do the same thing.
On 2016-07-06 12:43, Chris Murphy wrote: > On Wed, Jul 6, 2016 at 5:51 AM, Austin S. Hemmelgarn > <ahferroin7@gmail.com> wrote: >> On 2016-07-05 19:05, Chris Murphy wrote: >>> >>> Related: >>> http://www.spinics.net/lists/raid/msg52880.html >>> >>> Looks like there is some traction to figuring out what to do about >>> this, whether it's a udev rule or something that happens in the kernel >>> itself. Pretty much the only hardware setup unaffected by this are >>> those with enterprise or NAS drives. Every configuration of a consumer >>> drive, single, linear/concat, and all software (mdadm, lvm, Btrfs) >>> RAID Levels are adversely affected by this. >> >> The thing I don't get about this is that while the per-device settings on a >> given system are policy, the default value is not, and should be expected to >> work correctly (but not necessarily optimally) on as many systems as >> possible, so any claim that this should be fixed in udev are bogus by the >> regular kernel rules. > > Sure. But changing it in the kernel leads to what other consequences? > It fixes the problem under discussion but what problem will it > introduce? I think it's valid to explore this, at the least so > affected parties can be informed. > > Also, the problem isn't instigated by Linux, rather by drive > manufacturers introducing a whole new kind of error recovery, with an > order of magnitude longer recovery time. Now probably most hardware in > the field are such drives. Even SSDs like my Samsung 840 EVO that > support SCT ERC have it disabled, therefore the top end recovery time > is undiscoverable in the device itself. Maybe it's buried in a spec. > > So does it make sense to just set the default to 180? Or is there a > smarter way to do this? I don't know. Just thinking about this: 1. People who are setting this somewhere will be functionally unaffected. 2. People using single disks which have lots of errors may or may not see an apparent degradation of performance, but will likely have the life expectancy of their device extended. 3. Individuals who are not setting this but should be will on average be no worse off than before other than seeing a bigger performance hit on a disk error. 4. People with single disks which are new will see no functional change until the disk has an error. In an ideal situation, what I'd want to see is: 1. If the device supports SCT ERC, set scsi_command_timer to reasonable percentage over that (probably something like 25%, which would give roughly 10 seconds for the normal 7 second ERC timer). 2. If the device is actually a SCSI device, keep the 30 second timer (IIRC< this is reasonable for SCSI disks). 3. Otherwise, set the timer to 200 (we need a slight buffer over the expected disk timeout to account for things like latency outside of the disk). > > >>> I suspect, but haven't tested, that ZFS On Linux would be equally >>> affected, unless they're completely reimplementing their own block >>> layer (?) So there are quite a few parties now negatively impacted by >>> the current default behavior. >> >> OTOH, I would not be surprised if the stance there is 'you get no support if >> your not using enterprise drives', not because of the project itself, but >> because it's ZFS. Part of their minimum recommended hardware requirements >> is ECC RAM, so it wouldn't surprise me if enterprise storage devices are >> there too. > > http://open-zfs.org/wiki/Hardware > "Consistent performance requires hard drives that support error > recovery control. " > > "Drives that lack such functionality can be expected to have > arbitrarily high limits. Several minutes is not impossible. Drives > with this functionality typically default to 7 seconds. ZFS does not > currently adjust this setting on drives. However, it is advisable to > write a script to set the error recovery time to a low value, such as > 0.1 seconds until ZFS is modified to control it. This must be done on > every boot. " > > They do not explicitly require enterprise drives, but they clearly > expect SCT ERC enabled to some sane value. > > At least for Btrfs and ZFS, the mkfs is in a position to know all > parameters for properly setting SCT ERC and the SCSI command timer for > every device. Maybe it could create the udev rule? Single and raid0 > profiles need to permit long recoveries; where raid1, 5, 6 need to set > things for very short recoveries. > > Possibly mdadm and lvm tools do the same thing. I"m pretty certain they don't create rules, or even try to check the drive for SCT ERC support. The problem with doing this is that you can't be certain that your underlying device is actually a physical storage device or not, and thus you have to check more than just the SCT ERC commands, and many people (myself included) don't like tools doing things that modify the persistent functioning of their system that the tool itself is not intended to do (and messing with block layer settings falls into that category for a mkfs tool). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jul 6, 2016 at 11:18 AM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: > On 2016-07-06 12:43, Chris Murphy wrote: >> So does it make sense to just set the default to 180? Or is there a >> smarter way to do this? I don't know. > > Just thinking about this: > 1. People who are setting this somewhere will be functionally unaffected. I think it's statistically 0 people changing this from default. It's people with drives that have no SCT ERC support, used in raid1+, who happen to stumble upon this very obscure work around to avoid link resets in the face of media defects. Rare. > 2. People using single disks which have lots of errors may or may not see an > apparent degradation of performance, but will likely have the life > expectancy of their device extended. Well they have link resets and their file system presumably face plants as a result of a pile of commands in the queue returning as unsuccessful. So they have premature death of their system, rather than it getting sluggish. This is a long standing indicator on Windows to just reinstall the OS and restore data from backups -> the user has an opportunity to freshen up user data backup, and the reinstallation and restore from backup results in freshly written sectors which is how bad sectors get fixed. The marginally bad sectors get new writes and now read fast (or fast enough), and the persistently bad sectors result in the drive firmware remapping to reserve sectors. The main thing in my opinion is less extension of drive life, as it is the user gets to use the system, albeit sluggish, to make a backup of their data rather than possibly losing it. > 3. Individuals who are not setting this but should be will on average be no > worse off than before other than seeing a bigger performance hit on a disk > error. > 4. People with single disks which are new will see no functional change > until the disk has an error. I follow. > > In an ideal situation, what I'd want to see is: > 1. If the device supports SCT ERC, set scsi_command_timer to reasonable > percentage over that (probably something like 25%, which would give roughly > 10 seconds for the normal 7 second ERC timer). > 2. If the device is actually a SCSI device, keep the 30 second timer (IIRC< > this is reasonable for SCSI disks). > 3. Otherwise, set the timer to 200 (we need a slight buffer over the > expected disk timeout to account for things like latency outside of the > disk). Well if it's a non-redundant configuration, you'd want those long recoveries permitted, rather than enable SCT ERC. The drive has the ability to relocate sector data on a marginal (slow) read that's still successful. But clearly many manufacturers tolerate slow reads that don't result in immediate reallocation or overwrite or we wouldn't be in this situation in the first place. I think this auto reallocation is thwarted by enabling SCT ERC. It just flat out gives up and reports a read error. So it is still data loss in the non-redundant configuration and thus not an improvement. Basically it's: For SATA and USB drives: if data redundant, then enable short SCT ERC time if supported, if not supported then extend SCSI command timer to 200; if data not redundant, then disable SCT ERC if supported, and extend SCSI command timer to 200. For SCSI (SAS most likely these days), keep things the same as now. But that's only because this is a rare enough configuration now I don't know if we really know the problems there. It may be that their error recovery in 7 seconds is massively better and more reliable than consumer drives over 180 seconds. > >> >> >>>> I suspect, but haven't tested, that ZFS On Linux would be equally >>>> affected, unless they're completely reimplementing their own block >>>> layer (?) So there are quite a few parties now negatively impacted by >>>> the current default behavior. >>> >>> >>> OTOH, I would not be surprised if the stance there is 'you get no support >>> if >>> your not using enterprise drives', not because of the project itself, but >>> because it's ZFS. Part of their minimum recommended hardware >>> requirements >>> is ECC RAM, so it wouldn't surprise me if enterprise storage devices are >>> there too. >> >> >> http://open-zfs.org/wiki/Hardware >> "Consistent performance requires hard drives that support error >> recovery control. " >> >> "Drives that lack such functionality can be expected to have >> arbitrarily high limits. Several minutes is not impossible. Drives >> with this functionality typically default to 7 seconds. ZFS does not >> currently adjust this setting on drives. However, it is advisable to >> write a script to set the error recovery time to a low value, such as >> 0.1 seconds until ZFS is modified to control it. This must be done on >> every boot. " >> >> They do not explicitly require enterprise drives, but they clearly >> expect SCT ERC enabled to some sane value. >> >> At least for Btrfs and ZFS, the mkfs is in a position to know all >> parameters for properly setting SCT ERC and the SCSI command timer for >> every device. Maybe it could create the udev rule? Single and raid0 >> profiles need to permit long recoveries; where raid1, 5, 6 need to set >> things for very short recoveries. >> >> Possibly mdadm and lvm tools do the same thing. > > I"m pretty certain they don't create rules, or even try to check the drive > for SCT ERC support. They don't. That's a suggested change in behavior. Sorry "should do the same thing" instead of "do the same thing". > The problem with doing this is that you can't be > certain that your underlying device is actually a physical storage device or > not, and thus you have to check more than just the SCT ERC commands, and > many people (myself included) don't like tools doing things that modify the > persistent functioning of their system that the tool itself is not intended > to do (and messing with block layer settings falls into that category for a > mkfs tool). Yep it's imperfect unless there's the proper cross communication between layers. There are some such things like hardware raid geometry that optionally poke through (when supported by hardware raid drivers) so that things like mkfs.xfs can automatically provide the right sunit swidth for optimized layout; which the device mapper already does automatically. So it could be done it's just a matter of how big of a problem is this to build it, vs just going with a new one size fits all default command timer? If it were always 200 instead of 30, the consequence is if there's a link problem that is not related to media errors. But what the hell takes that long to report an explicit error? Even cable problems generate UDMA errors pretty much instantly.
On 2016-07-06 14:45, Chris Murphy wrote: > On Wed, Jul 6, 2016 at 11:18 AM, Austin S. Hemmelgarn > <ahferroin7@gmail.com> wrote: >> On 2016-07-06 12:43, Chris Murphy wrote: > >>> So does it make sense to just set the default to 180? Or is there a >>> smarter way to do this? I don't know. >> >> Just thinking about this: >> 1. People who are setting this somewhere will be functionally unaffected. > > I think it's statistically 0 people changing this from default. It's > people with drives that have no SCT ERC support, used in raid1+, who > happen to stumble upon this very obscure work around to avoid link > resets in the face of media defects. Rare. Not as much as you think, once someone has this issue, they usually put preventative measures in place on any system where it applies. I'd be willing to bet that most sysadmins at big companies like RedHat or Oracle are setting this. > > >> 2. People using single disks which have lots of errors may or may not see an >> apparent degradation of performance, but will likely have the life >> expectancy of their device extended. > > Well they have link resets and their file system presumably face > plants as a result of a pile of commands in the queue returning as > unsuccessful. So they have premature death of their system, rather > than it getting sluggish. This is a long standing indicator on Windows > to just reinstall the OS and restore data from backups -> the user has > an opportunity to freshen up user data backup, and the reinstallation > and restore from backup results in freshly written sectors which is > how bad sectors get fixed. The marginally bad sectors get new writes > and now read fast (or fast enough), and the persistently bad sectors > result in the drive firmware remapping to reserve sectors. > > The main thing in my opinion is less extension of drive life, as it is > the user gets to use the system, albeit sluggish, to make a backup of > their data rather than possibly losing it. The extension of the drive's lifetime is a nice benefit, but not what my point was here. For people in this particular case, it will almost certainly only make things better (although at first it may make performance worse). > > >> 3. Individuals who are not setting this but should be will on average be no >> worse off than before other than seeing a bigger performance hit on a disk >> error. >> 4. People with single disks which are new will see no functional change >> until the disk has an error. > > I follow. > > >> >> In an ideal situation, what I'd want to see is: >> 1. If the device supports SCT ERC, set scsi_command_timer to reasonable >> percentage over that (probably something like 25%, which would give roughly >> 10 seconds for the normal 7 second ERC timer). >> 2. If the device is actually a SCSI device, keep the 30 second timer (IIRC< >> this is reasonable for SCSI disks). >> 3. Otherwise, set the timer to 200 (we need a slight buffer over the >> expected disk timeout to account for things like latency outside of the >> disk). > > Well if it's a non-redundant configuration, you'd want those long > recoveries permitted, rather than enable SCT ERC. The drive has the > ability to relocate sector data on a marginal (slow) read that's still > successful. But clearly many manufacturers tolerate slow reads that > don't result in immediate reallocation or overwrite or we wouldn't be > in this situation in the first place. I think this auto reallocation > is thwarted by enabling SCT ERC. It just flat out gives up and reports > a read error. So it is still data loss in the non-redundant > configuration and thus not an improvement. I agree, but if it's only the kernel doing this, then we can't make judgements based on userspace usage. Also, the first situation while not optimal is still better than what happens now, at least there you will get an I/O error in a reasonable amount of time (as opposed to after a really long time if ever). > > Basically it's: > > For SATA and USB drives: > > if data redundant, then enable short SCT ERC time if supported, if not > supported then extend SCSI command timer to 200; > > if data not redundant, then disable SCT ERC if supported, and extend > SCSI command timer to 200. > > For SCSI (SAS most likely these days), keep things the same as now. > But that's only because this is a rare enough configuration now I > don't know if we really know the problems there. It may be that their > error recovery in 7 seconds is massively better and more reliable than > consumer drives over 180 seconds. I don't see why you would think this is not common. If you count just by systems, then it's absolutely outnumbered at least 100 to 1 by regular ATA disks. If you look at individual disks though, the reverse is true, because people who use SCSI drives tend to use _lots_ of disks (think big data centers, NAS and SAN systems and such). OTOH, both are probably vastly outnumbered by stuff that doesn't use either standard for storage... Separately, USB gets _really_ complicated if you want to cover everything, USB drives may or may not present as non-rotational, may or may not show up as SATA or SCSI bridges (there are some of the more expensive flash drives that actually use SSD controllers plus USB-SAT chips internally), if they do show up as such, may or may not support the required commands (most don't, but it's seemingly hit or miss which do). > > >> >>> >>> >>>>> I suspect, but haven't tested, that ZFS On Linux would be equally >>>>> affected, unless they're completely reimplementing their own block >>>>> layer (?) So there are quite a few parties now negatively impacted by >>>>> the current default behavior. >>>> >>>> >>>> OTOH, I would not be surprised if the stance there is 'you get no support >>>> if >>>> your not using enterprise drives', not because of the project itself, but >>>> because it's ZFS. Part of their minimum recommended hardware >>>> requirements >>>> is ECC RAM, so it wouldn't surprise me if enterprise storage devices are >>>> there too. >>> >>> >>> http://open-zfs.org/wiki/Hardware >>> "Consistent performance requires hard drives that support error >>> recovery control. " >>> >>> "Drives that lack such functionality can be expected to have >>> arbitrarily high limits. Several minutes is not impossible. Drives >>> with this functionality typically default to 7 seconds. ZFS does not >>> currently adjust this setting on drives. However, it is advisable to >>> write a script to set the error recovery time to a low value, such as >>> 0.1 seconds until ZFS is modified to control it. This must be done on >>> every boot. " >>> >>> They do not explicitly require enterprise drives, but they clearly >>> expect SCT ERC enabled to some sane value. >>> >>> At least for Btrfs and ZFS, the mkfs is in a position to know all >>> parameters for properly setting SCT ERC and the SCSI command timer for >>> every device. Maybe it could create the udev rule? Single and raid0 >>> profiles need to permit long recoveries; where raid1, 5, 6 need to set >>> things for very short recoveries. >>> >>> Possibly mdadm and lvm tools do the same thing. >> >> I"m pretty certain they don't create rules, or even try to check the drive >> for SCT ERC support. > > They don't. That's a suggested change in behavior. Sorry "should do > the same thing" instead of "do the same thing". > > >> The problem with doing this is that you can't be >> certain that your underlying device is actually a physical storage device or >> not, and thus you have to check more than just the SCT ERC commands, and >> many people (myself included) don't like tools doing things that modify the >> persistent functioning of their system that the tool itself is not intended >> to do (and messing with block layer settings falls into that category for a >> mkfs tool). > > Yep it's imperfect unless there's the proper cross communication > between layers. There are some such things like hardware raid geometry > that optionally poke through (when supported by hardware raid drivers) > so that things like mkfs.xfs can automatically provide the right sunit > swidth for optimized layout; which the device mapper already does > automatically. So it could be done it's just a matter of how big of a > problem is this to build it, vs just going with a new one size fits > all default command timer? The other problem though is that the existing things pass through _read-only_ data, while this requires writable data to be passed through, which leads to all kinds of complicated issues potentially. > > If it were always 200 instead of 30, the consequence is if there's a > link problem that is not related to media errors. But what the hell > takes that long to report an explicit error? Even cable problems > generate UDMA errors pretty much instantly. And that is more why I'd suggest changing the kernel default first before trying to use special heuristics or anything like that. The caveat is that it would need to be for ATA disks only to not break SCSI (which works fine right now) and USB (which has it's own unique issues). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jul 6, 2016 at 1:15 PM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: > On 2016-07-06 14:45, Chris Murphy wrote: >> I think it's statistically 0 people changing this from default. It's >> people with drives that have no SCT ERC support, used in raid1+, who >> happen to stumble upon this very obscure work around to avoid link >> resets in the face of media defects. Rare. > > Not as much as you think, once someone has this issue, they usually put > preventative measures in place on any system where it applies. I'd be > willing to bet that most sysadmins at big companies like RedHat or Oracle > are setting this. SCT ERC yes. Changing the kernel's command timer? I think almost zero. >> Well they have link resets and their file system presumably face >> plants as a result of a pile of commands in the queue returning as >> unsuccessful. So they have premature death of their system, rather >> than it getting sluggish. This is a long standing indicator on Windows >> to just reinstall the OS and restore data from backups -> the user has >> an opportunity to freshen up user data backup, and the reinstallation >> and restore from backup results in freshly written sectors which is >> how bad sectors get fixed. The marginally bad sectors get new writes >> and now read fast (or fast enough), and the persistently bad sectors >> result in the drive firmware remapping to reserve sectors. >> >> The main thing in my opinion is less extension of drive life, as it is >> the user gets to use the system, albeit sluggish, to make a backup of >> their data rather than possibly losing it. > > The extension of the drive's lifetime is a nice benefit, but not what my > point was here. For people in this particular case, it will almost > certainly only make things better (although at first it may make performance > worse). I'm not sure why it makes performance worse. The options are, slower reads vs a file system that almost certainly face plants upon a link reset. >> Basically it's: >> >> For SATA and USB drives: >> >> if data redundant, then enable short SCT ERC time if supported, if not >> supported then extend SCSI command timer to 200; >> >> if data not redundant, then disable SCT ERC if supported, and extend >> SCSI command timer to 200. >> >> For SCSI (SAS most likely these days), keep things the same as now. >> But that's only because this is a rare enough configuration now I >> don't know if we really know the problems there. It may be that their >> error recovery in 7 seconds is massively better and more reliable than >> consumer drives over 180 seconds. > > I don't see why you would think this is not common. I was not clear. Single device SAS is probably not common. They're typically being used in arrays where data is redundant. Using such a drive with short error recovery as a single boot drive? Probably not that common. > Separately, USB gets _really_ complicated if you want to cover everything, > USB drives may or may not present as non-rotational, may or may not show up > as SATA or SCSI bridges (there are some of the more expensive flash drives > that actually use SSD controllers plus USB-SAT chips internally), if they do > show up as such, may or may not support the required commands (most don't, > but it's seemingly hit or miss which do). Yup. Well, do what we can instead of just ignoring the problem? They can still be polled for features including SCT ERC and if it's not supported or configurable then fallback to increasing the command timer. I'm not sure what else can be done anyway. The main obstacle is squaring the device capability (low level) with storage stack redundancy 0 or 1 (high level). Something has to be aware of both to ideally get all devices ideally configured. >> Yep it's imperfect unless there's the proper cross communication >> between layers. There are some such things like hardware raid geometry >> that optionally poke through (when supported by hardware raid drivers) >> so that things like mkfs.xfs can automatically provide the right sunit >> swidth for optimized layout; which the device mapper already does >> automatically. So it could be done it's just a matter of how big of a >> problem is this to build it, vs just going with a new one size fits >> all default command timer? > > The other problem though is that the existing things pass through > _read-only_ data, while this requires writable data to be passed through, > which leads to all kinds of complicated issues potentially. I'm aware. There are also plenty of bugs even if write were to pass through. I've encountered more drives than not which accept only one SCT ERC change per poweron. A 2nd change causes the drive to offline and vanish off the bus. So no doubt this whole area is fragile enough not even the drive, controller, enclosure vendors are aware of where all the bodies are buried. What I think is fairly well established is that at least on Windows their lower level stuff including kernel tolerates these very high recovery times. The OS just gets irritatingly slow but doesn't flip out. Linux is flipping out. And it's not Linux's direct fault, that's drive manufacturers, but Linux needs to adapt. >> >> >> If it were always 200 instead of 30, the consequence is if there's a >> link problem that is not related to media errors. But what the hell >> takes that long to report an explicit error? Even cable problems >> generate UDMA errors pretty much instantly. > > And that is more why I'd suggest changing the kernel default first before > trying to use special heuristics or anything like that. The caveat is that > it would need to be for ATA disks only to not break SCSI (which works fine > right now) and USB (which has it's own unique issues). I think you're probably right. Simpler is better. Thing is, there will be consequences. In the software raid case where a drive hangs on a media defect, right now this means a link reset at 30 seconds, which results in md reconstructing data and it goes where needed by pretty much 31 seconds after requested. If that changes to 180 seconds, there will no doubt be some use cases that will be, WTF just happened? This used to always recover in 30 seconds at the longest, and now it's causing the network stack to implode while waiting. So all kinds of other timeouts might get impacted. I wonder if it makes sense to change the default SCSI command timer on a distribution and see what happens - if e.g. Fedora or opensuse would volunteer to make the change for Rawhide or Tumbleweed. *shrug* statically the number of users for those rolling releases may not have a drive with media defects and a delay intolerant workload for maybe years...
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 57fccc4..1ed1e43 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2296,7 +2296,10 @@ int btrfs_check_repairable(struct inode *inode, struct bio *failed_bio, * if the following BUG_ON triggers, our validation request got * merged. we need separate requests for our algorithm to work. */ - BUG_ON(failrec->in_validation); + if (failrec->in_validation) { + printk(KERN_ERR "failrec->in_validation at %s:%d\n", __FUNCTION__, __LINE__); + return 0; + } failrec->in_validation = 1; failrec->this_mirror = failed_mirror; } else { @@ -2306,7 +2309,10 @@ int btrfs_check_repairable(struct inode *inode, struct bio *failed_bio, * everything for repair_io_failure to do the rest for us. */ if (failrec->in_validation) { - BUG_ON(failrec->this_mirror != failed_mirror); + if (failrec->this_mirror != failed_mirror) { + printk(KERN_ERR "failrec->this_mirror (%d) != failed_mirror (%d) at %s:%d\n", failrec->this_mirror, failed_mirror, __FUNCTION__, __LINE__); + return 0; + } failrec->in_validation = 0; failrec->this_mirror = 0; } 'failrec->this_mirror' often has values like 0, 1, or 2, but sometimes has values like -445886021 or -823858068. More uninitialized memory? This worries me. The other BUG_ON lands here: diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index c1fe22c..71afeb3 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2018,10 +2018,10 @@ int repair_io_failure(struct inode *inode, u64 start, u64 length, u64 logical, int ret; ASSERT(!(fs_info->sb->s_flags & MS_RDONLY)); - BUG_ON(!mirror_num); + WARN_ON(!mirror_num); /* we can't repair anything in raid56 yet */ - if (btrfs_is_parity_mirror(map_tree, logical, length, mirror_num)) + if (!mirror_num || btrfs_is_parity_mirror(map_tree, logical, length, mirror_num)) return 0; bio = btrfs_io_bio_alloc(GFP_NOFS, 1);