Message ID | 5e8601d223da$0cbf6c00$263e4400$@com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue, Oct 11, 2016 at 10:10 AM, Jason D. Michaelson <jasondmichaelson@gmail.com> wrote: > superblock: bytenr=65536, device=/dev/sda > --------------------------------------------------------- > generation 161562 > root 5752616386560 > superblock: bytenr=65536, device=/dev/sdh > --------------------------------------------------------- > generation 161474 > root 4844272943104 OK so most obvious is that the bad super is many generations back than the good super. That's expected given all the write errors. >root@castor:~/logs# btrfs-find-root /dev/sda >parent transid verify failed on 5752357961728 wanted 161562 found 159746 >parent transid verify failed on 5752357961728 wanted 161562 found 159746 >Couldn't setup extent tree >Superblock thinks the generation is 161562 >Superblock thinks the level is 1 This squares with the good super. So btrfs-find-root is using a good super. I don't know what 5752357961728 is for, maybe it's possible to read that with btrfs-debug-tree -b 5752357961728 <anydev> and see what comes back. This is not the tree root according to the super though. So what do you get for btrfs-debug-tree -b 5752616386560 <anydev> Going back to your logs.... [ 38.810575] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory [ 38.810595] NFSD: starting 90-second grace period (net ffffffffb12e5b80) [ 241.292816] INFO: task bfad_worker:234 blocked for more than 120 seconds. [ 241.299135] Not tainted 4.7.0-0.bpo.1-amd64 #1 [ 241.305645] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. I don't know what this kernel is. I think you'd be better off with stable 4.7.7 or 4.8.1 for this work, so you're not running into a bunch of weird blocked task problems in addition to whatever is going on with the fs. [ 38.810575] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory [ 38.810595] NFSD: starting 90-second grace period (net ffffffffb12e5b80) [ 241.292816] INFO: task bfad_worker:234 blocked for more than 120 seconds. [ 241.299135] Not tainted 4.7.0-0.bpo.1-amd64 #1 [ 241.305645] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. I don't know what this kernel is. I think you'd be better off with stable 4.7.7 or 4.8.1 for this work, so you're not running into a bunch of weird blocked task problems in addition to whatever is going on with the fs. [ 20.552205] BTRFS: device fsid 73ed01df-fb2a-4b27-b6fc-12a57da934bd devid 3 transid 161562 /dev/sdd [ 20.552372] BTRFS: device fsid 73ed01df-fb2a-4b27-b6fc-12a57da934bd devid 5 transid 161562 /dev/sdf [ 20.552524] BTRFS: device fsid 73ed01df-fb2a-4b27-b6fc-12a57da934bd devid 6 transid 161562 /dev/sde [ 20.552689] BTRFS: device fsid 73ed01df-fb2a-4b27-b6fc-12a57da934bd devid 4 transid 161562 /dev/sdg [ 20.552858] BTRFS: device fsid 73ed01df-fb2a-4b27-b6fc-12a57da934bd devid 1 transid 161562 /dev/sda [ 669.843166] BTRFS warning (device sda): devid 2 uuid dc8760f1-2c54-4134-a9a7-a0ac2b7a9f1c is missing [232572.871243] sd 0:0:8:0: [sdh] tag#4 Sense Key : Medium Error [current] Two items missing, in effect, for this failed read. One literally missing, and the other one missing due to unrecoverable read error. The fact it's not trying to fix anything suggests it hasn't really finished mounting, there must be something wrong where it either just gets confused and won't fix (because it might make things worse) or there isn't reduncancy. [52799.495999] mce: [Hardware Error]: Machine check events logged [53249.491975] mce: [Hardware Error]: Machine check events logged [231298.005594] mce: [Hardware Error]: Machine check events logged Bunch of other hardware issues... I *really* think you need to get the hardware issues sorted out before working on this file system unless you just don't care that much about it. There are already enough unknowns without contributing who knows what effect the hardware issues are having while trying to repair things. Or even understand what's going on. > sys_chunk_array[2048]: > item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 0) > chunk length 4194304 owner 2 stripe_len 65536 > type SYSTEM num_stripes 1 > stripe 0 devid 1 offset 0 > dev uuid: 08c50aa9-c2dd-43b7-a631-6dfdc7d69ea4 > item 1 key (FIRST_CHUNK_TREE CHUNK_ITEM 20971520) > chunk length 11010048 owner 2 stripe_len 65536 > type SYSTEM|RAID6 num_stripes 6 > stripe 0 devid 6 offset 1048576 > dev uuid: 390a1fd8-cc6c-40e7-b0b5-88ca7dcbcc32 > stripe 1 devid 5 offset 1048576 > dev uuid: 2df974c5-9dde-4062-81e9-c6eeee13db62 > stripe 2 devid 4 offset 1048576 > dev uuid: dce3d159-721d-4859-9955-37a03769bb0d > stripe 3 devid 3 offset 1048576 > dev uuid: 6f7142db-824c-4791-a5b2-d6ce11c81c8f > stripe 4 devid 2 offset 1048576 > dev uuid: dc8760f1-2c54-4134-a9a7-a0ac2b7a9f1c > stripe 5 devid 1 offset 20971520 > dev uuid: 08c50aa9-c2dd-43b7-a631-6dfdc7d69ea4 Huh, well item 0 is damn strange. I wonder how that happened. The dev uuid of that single system chunk matches devid 1. This is a single source of failure. This could be an artifact of creating the raid6 with an old btrfs-progs. I just created a volume with btrfs-progs 4.7.3: # mkfs.btrfs -draid6 -mraid6 /dev/mapper/VG-1 /dev/mapper/VG-2 /dev/mapper/VG-3 /dev/mapper/VG-4 /dev/mapper/VG-5 /dev/mapper/VG-6 And the super block sys_chunk_array has only a SYSTEM|RAID6 chunk. There is no single SYSTEM chunk. After mounting and copying some data over, then umounting, same thing. One system chunk, raid6. So *IF* there is anything wrong with this single system chunk, it's all bets are off, no way to even attempt to fix the problem. That might explain why it's not getting past the very earliest stage of mounting. But it's inconclusive.
readding btrfs On Tue, Oct 11, 2016 at 1:00 PM, Jason D. Michaelson <jasondmichaelson@gmail.com> wrote: > > >> -----Original Message----- >> From: chris@colorremedies.com [mailto:chris@colorremedies.com] On >> Behalf Of Chris Murphy >> Sent: Tuesday, October 11, 2016 12:41 PM >> To: Jason D. Michaelson >> Cc: Chris Murphy; Btrfs BTRFS >> Subject: Re: raid6 file system in a bad state >> >> On Tue, Oct 11, 2016 at 10:10 AM, Jason D. Michaelson >> <jasondmichaelson@gmail.com> wrote: >> > superblock: bytenr=65536, device=/dev/sda >> > --------------------------------------------------------- >> > generation 161562 >> > root 5752616386560 >> >> >> >> > superblock: bytenr=65536, device=/dev/sdh >> > --------------------------------------------------------- >> > generation 161474 >> > root 4844272943104 >> >> OK so most obvious is that the bad super is many generations back than >> the good super. That's expected given all the write errors. >> >> > > Is there any chance/way of going back to use this generation/root as a source for btrfs restore? Yes with -t option and that root bytenr for the generation you want to restore. Thing is, that's so far back the metadata may be gone (overwritten) already. But worth a shot. I've recovered recently deleted files this way. OK at this point I'm thinking that fixing the super blocks won't change anything because it sounds like it's using the new ones anyway and maybe the thing to try is going back to a tree root that isn't in any of the new supers. That means losing anything that was being written when the lost writes happened. However, for all we know some overwrites happened so this won't work. And also it does nothing to deal with the fragile state of having at least two flaky devices, and one of the system chunks with no redundancy. Try 'btrfs check' without repair. And then also try it with -r flag using the various tree roots we've seen so far. Try explicitly using 5752616386560, which is what it ought to use first anyway. And then also 4844272943104. That might go far enough back before the bad sectors were a factor. Normally what you'd want is for it to use one of the backup roots, but it's consistently running into a problem with all of them when using recovery mount option.
> -----Original Message----- > From: chris@colorremedies.com [mailto:chris@colorremedies.com] On > Behalf Of Chris Murphy > Sent: Tuesday, October 11, 2016 3:38 PM > To: Jason D. Michaelson; Btrfs BTRFS > Cc: Chris Murphy > Subject: Re: raid6 file system in a bad state > > readding btrfs > > On Tue, Oct 11, 2016 at 1:00 PM, Jason D. Michaelson > <jasondmichaelson@gmail.com> wrote: > > > > > >> -----Original Message----- > >> From: chris@colorremedies.com [mailto:chris@colorremedies.com] On > >> Behalf Of Chris Murphy > >> Sent: Tuesday, October 11, 2016 12:41 PM > >> To: Jason D. Michaelson > >> Cc: Chris Murphy; Btrfs BTRFS > >> Subject: Re: raid6 file system in a bad state > >> > >> On Tue, Oct 11, 2016 at 10:10 AM, Jason D. Michaelson > >> <jasondmichaelson@gmail.com> wrote: > >> > superblock: bytenr=65536, device=/dev/sda > >> > --------------------------------------------------------- > >> > generation 161562 > >> > root 5752616386560 > >> > >> > >> > >> > superblock: bytenr=65536, device=/dev/sdh > >> > --------------------------------------------------------- > >> > generation 161474 > >> > root 4844272943104 > >> > >> OK so most obvious is that the bad super is many generations back > >> than the good super. That's expected given all the write errors. > >> > >> > > > > Is there any chance/way of going back to use this generation/root as > a source for btrfs restore? > > Yes with -t option and that root bytenr for the generation you want to > restore. Thing is, that's so far back the metadata may be gone > (overwritten) already. But worth a shot. I've recovered recently > deleted files this way. With the bad disc in place: root@castor:~/btrfs-progs# ./btrfs restore -t 4844272943104 -D /dev/sda /dev/null parent transid verify failed on 4844272943104 wanted 161562 found 161476 parent transid verify failed on 4844272943104 wanted 161562 found 161476 checksum verify failed on 4844272943104 found E808AB28 wanted 0CEB169E checksum verify failed on 4844272943104 found 4694222D wanted 5D4F0640 checksum verify failed on 4844272943104 found 4694222D wanted 5D4F0640 bytenr mismatch, want=4844272943104, have=66211125067776 Couldn't read tree root Could not open root, trying backup super warning, device 6 is missing warning, device 5 is missing warning, device 4 is missing warning, device 3 is missing warning, device 2 is missing checksum verify failed on 20971520 found 0FBD46D5 wanted FC3EB3AB checksum verify failed on 20971520 found 0FBD46D5 wanted FC3EB3AB bytenr mismatch, want=20971520, have=267714560 ERROR: cannot read chunk root Could not open root, trying backup super warning, device 6 is missing warning, device 5 is missing warning, device 4 is missing warning, device 3 is missing warning, device 2 is missing checksum verify failed on 20971520 found 0FBD46D5 wanted FC3EB3AB checksum verify failed on 20971520 found 0FBD46D5 wanted FC3EB3AB bytenr mismatch, want=20971520, have=267714560 ERROR: cannot read chunk root Could not open root, trying backup super And what's interesting is that when I move the /dev/sdd (the current bad disc) out of /dev, rescan, and run btrfs restore with the main root I get similar output: root@castor:~/btrfs-progs# ./btrfs restore -D /dev/sda /dev/null warning, device 2 is missing checksum verify failed on 21430272 found 71001E6E wanted 95E3A3D8 checksum verify failed on 21430272 found 992E0C37 wanted 36992D8B checksum verify failed on 21430272 found 992E0C37 wanted 36992D8B bytenr mismatch, want=21430272, have=264830976 Couldn't read chunk tree Could not open root, trying backup super warning, device 6 is missing warning, device 5 is missing warning, device 4 is missing warning, device 3 is missing warning, device 2 is missing checksum verify failed on 20971520 found 0FBD46D5 wanted FC3EB3AB checksum verify failed on 20971520 found 0FBD46D5 wanted FC3EB3AB bytenr mismatch, want=20971520, have=267714560 ERROR: cannot read chunk root Could not open root, trying backup super warning, device 6 is missing warning, device 5 is missing warning, device 4 is missing warning, device 3 is missing warning, device 2 is missing checksum verify failed on 20971520 found 0FBD46D5 wanted FC3EB3AB checksum verify failed on 20971520 found 0FBD46D5 wanted FC3EB3AB bytenr mismatch, want=20971520, have=267714560 ERROR: cannot read chunk root Could not open root, trying backup super So it doesn't seem to work, but the difference in output between the two, at least to my untrained eyes, is intriguing, to say the least. > > > OK at this point I'm thinking that fixing the super blocks won't change > anything because it sounds like it's using the new ones anyway and > maybe the thing to try is going back to a tree root that isn't in any > of the new supers. That means losing anything that was being written > when the lost writes happened. However, for all we know some overwrites > happened so this won't work. And also it does nothing to deal with the > fragile state of having at least two flaky devices, and one of the > system chunks with no redundancy. > This is the one thing I'm not following you on. I know there's one device that's flaky. Originally sdi, switched to sdh, and today (after reboot to 4.7.7), sdd. You'll have to forgive my ignorance, but I'm missing how you determined that a second was flaky (or was that from the ITEM 0 not being replicated you mentioned yesterday?) > > Try 'btrfs check' without repair. And then also try it with -r flag > using the various tree roots we've seen so far. Try explicitly using > 5752616386560, which is what it ought to use first anyway. And then > also 4844272943104. > root@castor:~/btrfs-progs# ./btrfs check --readonly /dev/sda parent transid verify failed on 5752357961728 wanted 161562 found 159746 parent transid verify failed on 5752357961728 wanted 161562 found 159746 checksum verify failed on 5752357961728 found B5CA97C0 wanted 51292A76 checksum verify failed on 5752357961728 found 8582246F wanted B53BE280 checksum verify failed on 5752357961728 found 8582246F wanted B53BE280 bytenr mismatch, want=5752357961728, have=56504706479104 Couldn't setup extent tree ERROR: cannot open file system root@castor:~/btrfs-progs# ./btrfs check --readonly /dev/sdd parent transid verify failed on 4844272943104 wanted 161474 found 161476 parent transid verify failed on 4844272943104 wanted 161474 found 161476 checksum verify failed on 4844272943104 found E808AB28 wanted 0CEB169E checksum verify failed on 4844272943104 found 4694222D wanted 5D4F0640 checksum verify failed on 4844272943104 found 4694222D wanted 5D4F0640 bytenr mismatch, want=4844272943104, have=66211125067776 Couldn't read tree root ERROR: cannot open file system root@castor:~/btrfs-progs# ./btrfs check --readonly -r 5752616386560 /dev/sda parent transid verify failed on 5752357961728 wanted 161562 found 159746 parent transid verify failed on 5752357961728 wanted 161562 found 159746 checksum verify failed on 5752357961728 found B5CA97C0 wanted 51292A76 checksum verify failed on 5752357961728 found 8582246F wanted B53BE280 checksum verify failed on 5752357961728 found 8582246F wanted B53BE280 bytenr mismatch, want=5752357961728, have=56504706479104 Couldn't setup extent tree ERROR: cannot open file system root@castor:~/btrfs-progs# ./btrfs check --readonly -r 5752616386560 /dev/sdd parent transid verify failed on 5752616386560 wanted 161474 found 161562 parent transid verify failed on 5752616386560 wanted 161474 found 161562 checksum verify failed on 5752616386560 found 2A134884 wanted CEF0F532 checksum verify failed on 5752616386560 found B7FE62DB wanted 3786D60F checksum verify failed on 5752616386560 found B7FE62DB wanted 3786D60F bytenr mismatch, want=5752616386560, have=56504661311488 Couldn't read tree root ERROR: cannot open file system root@castor:~/btrfs-progs# ./btrfs check --readonly -r 4844272943104 /dev/sda parent transid verify failed on 4844272943104 wanted 161562 found 161476 parent transid verify failed on 4844272943104 wanted 161562 found 161476 checksum verify failed on 4844272943104 found E808AB28 wanted 0CEB169E checksum verify failed on 4844272943104 found 4694222D wanted 5D4F0640 checksum verify failed on 4844272943104 found 4694222D wanted 5D4F0640 bytenr mismatch, want=4844272943104, have=66211125067776 Couldn't read tree root ERROR: cannot open file system root@castor:~/btrfs-progs# ./btrfs check --readonly -r 4844272943104 /dev/sdd parent transid verify failed on 4844272943104 wanted 161474 found 161476 parent transid verify failed on 4844272943104 wanted 161474 found 161476 checksum verify failed on 4844272943104 found E808AB28 wanted 0CEB169E checksum verify failed on 4844272943104 found 4694222D wanted 5D4F0640 checksum verify failed on 4844272943104 found 4694222D wanted 5D4F0640 bytenr mismatch, want=4844272943104, have=66211125067776 Couldn't read tree root ERROR: cannot open file system > That might go far enough back before the bad sectors were a factor. > Normally what you'd want is for it to use one of the backup roots, but > it's consistently running into a problem with all of them when using > recovery mount option. > Is that a result of all of them being identical, save for the bad disc? Again, Chris, Thank you so much for your time looking at this! Btrfs on the whole is something that, as a developer, I'd love to become involved with. Alas, there are only 24 hours in the day. > > > > > -- > Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Oct 12, 2016 at 11:59 AM, Jason D. Michaelson <jasondmichaelson@gmail.com> wrote: > With the bad disc in place: > > root@castor:~/btrfs-progs# ./btrfs restore -t 4844272943104 -D /dev/sda /dev/null > parent transid verify failed on 4844272943104 wanted 161562 found 161476 > parent transid verify failed on 4844272943104 wanted 161562 found 161476 > checksum verify failed on 4844272943104 found E808AB28 wanted 0CEB169E > checksum verify failed on 4844272943104 found 4694222D wanted 5D4F0640 > checksum verify failed on 4844272943104 found 4694222D wanted 5D4F0640 > bytenr mismatch, want=4844272943104, have=66211125067776 > Couldn't read tree root > Could not open root, trying backup super > warning, device 6 is missing > warning, device 5 is missing > warning, device 4 is missing > warning, device 3 is missing > warning, device 2 is missing > checksum verify failed on 20971520 found 0FBD46D5 wanted FC3EB3AB > checksum verify failed on 20971520 found 0FBD46D5 wanted FC3EB3AB > bytenr mismatch, want=20971520, have=267714560 > ERROR: cannot read chunk root > Could not open root, trying backup super > warning, device 6 is missing > warning, device 5 is missing > warning, device 4 is missing > warning, device 3 is missing > warning, device 2 is missing > checksum verify failed on 20971520 found 0FBD46D5 wanted FC3EB3AB > checksum verify failed on 20971520 found 0FBD46D5 wanted FC3EB3AB > bytenr mismatch, want=20971520, have=267714560 > ERROR: cannot read chunk root > Could not open root, trying backup super Don't all of these device missing messages seem bogus? I don't know how to find out what's going on here. If it were me, I'd try to reproduce this with a couple of distros's live images (Fedora Rawhide and openSUSE Tumbleweed), and if they're both reproducing this "missing" output, I'd file a bugzilla.kernel.org bug with a strace. I mean, this stuff is hard enough as it is without bugs like this getting in the way. Fedora 25 nightly: https://kojipkgs.fedoraproject.org/compose/branched/Fedora-25-20161008.n.0/compose/Workstation/x86_64/iso/Fedora-Workstation-Live-x86_64-25-20161008.n.0.iso That'll have some version of kernel 4.8, not sure which one. And it will have btrfs-progs 4.6.1 which is safe but showing its age in Btrfs years. You can do this from inside the live environment: sudo dnf update https://kojipkgs.fedoraproject.org//packages/btrfs-progs/4.7.3/1.fc26/x86_64/btrfs-progs-4.7.3-1.fc26.x86_64.rpm or sudo dnf update https://kojipkgs.fedoraproject.org//packages/btrfs-progs/4.8.1/1.fc26/x86_64/btrfs-progs-4.8.1-1.fc26.x86_64.rpm It's probably just as valid to do this with whatever you have now, strace that and file a bug. But it doesn't really for sure isolate whether it's a local problem or not. > > And what's interesting is that when I move the /dev/sdd (the current bad disc) out of /dev, rescan, and run btrfs restore with the main root I get similar output: > > root@castor:~/btrfs-progs# ./btrfs restore -D /dev/sda /dev/null > warning, device 2 is missing > checksum verify failed on 21430272 found 71001E6E wanted 95E3A3D8 > checksum verify failed on 21430272 found 992E0C37 wanted 36992D8B > checksum verify failed on 21430272 found 992E0C37 wanted 36992D8B > bytenr mismatch, want=21430272, have=264830976 > Couldn't read chunk tree > Could not open root, trying backup super > warning, device 6 is missing > warning, device 5 is missing > warning, device 4 is missing > warning, device 3 is missing > warning, device 2 is missing > checksum verify failed on 20971520 found 0FBD46D5 wanted FC3EB3AB > checksum verify failed on 20971520 found 0FBD46D5 wanted FC3EB3AB > bytenr mismatch, want=20971520, have=267714560 > ERROR: cannot read chunk root > Could not open root, trying backup super > warning, device 6 is missing > warning, device 5 is missing > warning, device 4 is missing > warning, device 3 is missing > warning, device 2 is missing > checksum verify failed on 20971520 found 0FBD46D5 wanted FC3EB3AB > checksum verify failed on 20971520 found 0FBD46D5 wanted FC3EB3AB > bytenr mismatch, want=20971520, have=267714560 > ERROR: cannot read chunk root > Could not open root, trying backup super > > So it doesn't seem to work, but the difference in output between the two, at least to my untrained eyes, is intriguing, to say the least. Yeah I'm not sure what to recommend now. > >> >> >> OK at this point I'm thinking that fixing the super blocks won't change >> anything because it sounds like it's using the new ones anyway and >> maybe the thing to try is going back to a tree root that isn't in any >> of the new supers. That means losing anything that was being written >> when the lost writes happened. However, for all we know some overwrites >> happened so this won't work. And also it does nothing to deal with the >> fragile state of having at least two flaky devices, and one of the >> system chunks with no redundancy. >> > > This is the one thing I'm not following you on. I know there's one device that's flaky. Originally sdi, switched to sdh, and today (after reboot to 4.7.7), sdd. You'll have to forgive my ignorance, but I'm missing how you determined that a second was flaky (or was that from the ITEM 0 not being replicated you mentioned yesterday?) In your dmesg there was one device reported missing entirely, and then a separate device had a sector read failure. > >> >> Try 'btrfs check' without repair. And then also try it with -r flag >> using the various tree roots we've seen so far. Try explicitly using >> 5752616386560, which is what it ought to use first anyway. And then >> also 4844272943104. >> > > root@castor:~/btrfs-progs# ./btrfs check --readonly /dev/sda > parent transid verify failed on 5752357961728 wanted 161562 found 159746 > parent transid verify failed on 5752357961728 wanted 161562 found 159746 > checksum verify failed on 5752357961728 found B5CA97C0 wanted 51292A76 > checksum verify failed on 5752357961728 found 8582246F wanted B53BE280 > checksum verify failed on 5752357961728 found 8582246F wanted B53BE280 > bytenr mismatch, want=5752357961728, have=56504706479104 > Couldn't setup extent tree > ERROR: cannot open file system > root@castor:~/btrfs-progs# ./btrfs check --readonly /dev/sdd > parent transid verify failed on 4844272943104 wanted 161474 found 161476 > parent transid verify failed on 4844272943104 wanted 161474 found 161476 > checksum verify failed on 4844272943104 found E808AB28 wanted 0CEB169E > checksum verify failed on 4844272943104 found 4694222D wanted 5D4F0640 > checksum verify failed on 4844272943104 found 4694222D wanted 5D4F0640 > bytenr mismatch, want=4844272943104, have=66211125067776 > Couldn't read tree root > ERROR: cannot open file system > > root@castor:~/btrfs-progs# ./btrfs check --readonly -r 5752616386560 /dev/sda > parent transid verify failed on 5752357961728 wanted 161562 found 159746 > parent transid verify failed on 5752357961728 wanted 161562 found 159746 > checksum verify failed on 5752357961728 found B5CA97C0 wanted 51292A76 > checksum verify failed on 5752357961728 found 8582246F wanted B53BE280 > checksum verify failed on 5752357961728 found 8582246F wanted B53BE280 > bytenr mismatch, want=5752357961728, have=56504706479104 > Couldn't setup extent tree > ERROR: cannot open file system > root@castor:~/btrfs-progs# ./btrfs check --readonly -r 5752616386560 /dev/sdd > parent transid verify failed on 5752616386560 wanted 161474 found 161562 > parent transid verify failed on 5752616386560 wanted 161474 found 161562 > checksum verify failed on 5752616386560 found 2A134884 wanted CEF0F532 > checksum verify failed on 5752616386560 found B7FE62DB wanted 3786D60F > checksum verify failed on 5752616386560 found B7FE62DB wanted 3786D60F > bytenr mismatch, want=5752616386560, have=56504661311488 > Couldn't read tree root > ERROR: cannot open file system > > root@castor:~/btrfs-progs# ./btrfs check --readonly -r 4844272943104 /dev/sda > parent transid verify failed on 4844272943104 wanted 161562 found 161476 > parent transid verify failed on 4844272943104 wanted 161562 found 161476 > checksum verify failed on 4844272943104 found E808AB28 wanted 0CEB169E > checksum verify failed on 4844272943104 found 4694222D wanted 5D4F0640 > checksum verify failed on 4844272943104 found 4694222D wanted 5D4F0640 > bytenr mismatch, want=4844272943104, have=66211125067776 > Couldn't read tree root > ERROR: cannot open file system > root@castor:~/btrfs-progs# ./btrfs check --readonly -r 4844272943104 /dev/sdd > parent transid verify failed on 4844272943104 wanted 161474 found 161476 > parent transid verify failed on 4844272943104 wanted 161474 found 161476 > checksum verify failed on 4844272943104 found E808AB28 wanted 0CEB169E > checksum verify failed on 4844272943104 found 4694222D wanted 5D4F0640 > checksum verify failed on 4844272943104 found 4694222D wanted 5D4F0640 > bytenr mismatch, want=4844272943104, have=66211125067776 > Couldn't read tree root > ERROR: cannot open file system Someone else who knows more will have to speak up. This is one of the more annoying things about Btrfs's state right now, is it's not at all clear to a regular user what sequence to attempt repairs in. It's a shot in the dark. Other file systems it's much easier. It fails to mount, you run fsck with default options, and it either can fix it or it can't. Btrfs, it's many options, many orders, very developer oriented messages, and no hints what the next step is to take. At this point you could set up some kind of overlay on each drive, maybe also using blockdev to set each block device read only to ensure the original is not modified. Something like this: https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID#Making_the_harddisks_read-only_using_an_overlay_file But that will avoid this: "Block-level copies of devices" https://btrfs.wiki.kernel.org/index.php/Gotchas I haven't tried this, so I'm not really certain how to hide the original and overlay from the kernel since they both have to be present at the same time. Maybe an LVM snapshot LV can be presented to libvirt/virt-manager and you can use one a recent distro image to boot the VM and try some repairs. I just can't tell you what order to do them in. Cannot read chunk root is a problem, maybe it can be repaired with btrfs rescue chunk-recover. Cannot read tree root is also a problem, once the chunk is repaired, maybe it's possible to repair it. The extent tree can't be used until the chunk tree is readable so that ought to just take care of itself. You might be looking at chunk recover, super recover, check --repair, and maybe even check --repair --init-extent-tree. And as a last resort --init-csum-tree which really is just papering over real problems in a way that now the file system won't know what's bad and makes things worse but it might survive long enough to get more data off. And actually, before any of the above, you could see if you can take a btrfs-image -t4 -c9 -s, and also btrfs-debug-tree and output to a file somewhere. Maybe then it's a useful donation image for making the tools better. > > >> That might go far enough back before the bad sectors were a factor. >> Normally what you'd want is for it to use one of the backup roots, but >> it's consistently running into a problem with all of them when using >> recovery mount option. >> > > Is that a result of all of them being identical, save for the bad disc? I don't understand the question. The bad disk is the one that has the bad super, but all the tools are clearly ignoring the bad super when looking for the tree root. So I don't think the bad disk is a factor. I can't prove it but I think the problems were happening before the bad disk, it's just that the bad disk added to the confusion and may also be preventing repairs.
This may be relevant and is pretty terrible. http://www.spinics.net/lists/linux-btrfs/msg59741.html Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
I've been following that thread. It's been my fear. I'm in the process of doing a restore of what I can get off of it so that i can re-create the file system with raid1 which, if i'm reading that thread correctly doesn't suffer at all from the rmw problems extant in the raid5/6 code at the moment. Again, thanks for your help. > -----Original Message----- > From: chris@colorremedies.com [mailto:chris@colorremedies.com] On > Behalf Of Chris Murphy > Sent: Friday, October 14, 2016 4:55 PM > To: Chris Murphy > Cc: Jason D. Michaelson; Btrfs BTRFS > Subject: Re: raid6 file system in a bad state > > This may be relevant and is pretty terrible. > > http://www.spinics.net/lists/linux-btrfs/msg59741.html > > > Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--- sda 2016-10-11 11:09:42.853170807 -0500 +++ sdh 2016-10-11 11:09:46.469082028 -0500 @@ -1,16 +1,16 @@ -superblock: bytenr=65536, device=/dev/sda +superblock: bytenr=65536, device=/dev/sdh --------------------------------------------------------- csum_type 0 (crc32c) csum_size 4 -csum 0x45278835 [match] +csum 0x0f7dfe09 [match] bytenr 65536 flags 0x1 ( WRITTEN ) magic _BHRfS_M [match] fsid 73ed01df-fb2a-4b27-b6fc-12a57da934bd label -generation 161562 -root 5752616386560 +generation 161474 +root 4844272943104 sys_array_size 354 chunk_root_generation 156893 root_level 1 @@ -20,7 +20,7 @@ log_root_transid 0 log_root_level 0 total_bytes 18003557892096 -bytes_used 7107627130880 +bytes_used 7110395990016 sectorsize 4096 nodesize 16384 leafsize 16384 @@ -34,17 +34,17 @@ BIG_METADATA | EXTENDED_IREF | RAID56 ) -cache_generation 161562 -uuid_tree_generation 161562 -dev_item.uuid 08c50aa9-c2dd-43b7-a631-6dfdc7d69ea4 +cache_generation 161474 +uuid_tree_generation 161474 +dev_item.uuid dc8760f1-2c54-4134-a9a7-a0ac2b7a9f1c dev_item.fsid 73ed01df-fb2a-4b27-b6fc-12a57da934bd [match] dev_item.type 0 dev_item.total_bytes 3000592982016 -dev_item.bytes_used 1800957198336 +dev_item.bytes_used 1800936226816 dev_item.io_align 4096 dev_item.io_width 4096 dev_item.sector_size 4096 -dev_item.devid 1 +dev_item.devid 2 dev_item.dev_group 0 dev_item.seek_speed 0 dev_item.bandwidth 0 @@ -72,47 +72,47 @@ dev uuid: 08c50aa9-c2dd-43b7-a631-6dfdc7d69ea4 backup_roots[4]: backup 0: - backup_tree_root: 5752437456896 gen: 161561 level: 1 + backup_tree_root: 4844253364224 gen: 161473 level: 1 backup_chunk_root: 20971520 gen: 156893 level: 1 - backup_extent_root: 5752385224704 gen: 161561 level: 2 + backup_extent_root: 4844248121344 gen: 161473 level: 2 backup_fs_root: 124387328 gen: 74008 level: 0 - backup_dev_root: 5752437587968 gen: 161561 level: 1 - backup_csum_root: 5752389615616 gen: 161561 level: 3 + backup_dev_root: 1411186688 gen: 156893 level: 1 + backup_csum_root: 4844247793664 gen: 161473 level: 3 backup_total_bytes: 18003557892096 - backup_bytes_used: 7112579833856 + backup_bytes_used: 7110380077056 backup_num_devices: 6 backup 1: - backup_tree_root: 5752616386560 gen: 161562 level: 1 + backup_tree_root: 4844272943104 gen: 161474 level: 1 backup_chunk_root: 20971520 gen: 156893 level: 1 - backup_extent_root: 5752649416704 gen: 161563 level: 2 + backup_extent_root: 4844268240896 gen: 161474 level: 2 backup_fs_root: 124387328 gen: 74008 level: 0 - backup_dev_root: 5752616501248 gen: 161562 level: 1 - backup_csum_root: 5752650203136 gen: 161563 level: 3 + backup_dev_root: 1411186688 gen: 156893 level: 1 + backup_csum_root: 4844254216192 gen: 161474 level: 3 backup_total_bytes: 18003557892096 - backup_bytes_used: 7107602407424 + backup_bytes_used: 7110395990016 backup_num_devices: 6 backup 2: - backup_tree_root: 5752112103424 gen: 161559 level: 1 + backup_tree_root: 4844252168192 gen: 161471 level: 1 backup_chunk_root: 20971520 gen: 156893 level: 1 - backup_extent_root: 5752207409152 gen: 161560 level: 2 + backup_extent_root: 4844242698240 gen: 161471 level: 2 backup_fs_root: 124387328 gen: 74008 level: 0 - backup_dev_root: 5752113463296 gen: 161559 level: 1 - backup_csum_root: 5752205492224 gen: 161560 level: 3 + backup_dev_root: 1411186688 gen: 156893 level: 1 + backup_csum_root: 4844241764352 gen: 161471 level: 3 backup_total_bytes: 18003557892096 - backup_bytes_used: 7112514002944 + backup_bytes_used: 7110343888896 backup_num_devices: 6 backup 3: - backup_tree_root: 5752298307584 gen: 161560 level: 1 + backup_tree_root: 4844263358464 gen: 161472 level: 1 backup_chunk_root: 20971520 gen: 156893 level: 1 - backup_extent_root: 5752385224704 gen: 161561 level: 2 + backup_extent_root: 4844261965824 gen: 161472 level: 2 backup_fs_root: 124387328 gen: 74008 level: 0 - backup_dev_root: 5752299978752 gen: 161560 level: 1 - backup_csum_root: 5752389615616 gen: 161561 level: 3 + backup_dev_root: 1411186688 gen: 156893 level: 1 + backup_csum_root: 4844261801984 gen: 161472 level: 3 backup_total_bytes: 18003557892096 - backup_bytes_used: 7112542425088 + backup_bytes_used: 7110370037760 backup_num_devices: 6