From patchwork Mon Jun 20 03:44:27 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Zygo Blaxell X-Patchwork-Id: 9186539 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id DC16A6075E for ; Mon, 20 Jun 2016 03:45:15 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C813D22362 for ; Mon, 20 Jun 2016 03:45:15 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id BC4DB22BF1; Mon, 20 Jun 2016 03:45:15 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.5 required=2.0 tests=BAYES_00,HEXHASH_WORD, RCVD_IN_DNSWL_HI,T_TVD_MIME_EPI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 99AC322362 for ; Mon, 20 Jun 2016 03:45:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751900AbcFTDob (ORCPT ); Sun, 19 Jun 2016 23:44:31 -0400 Received: from james.kirk.hungrycats.org ([174.142.39.145]:37836 "EHLO james.kirk.hungrycats.org" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751742AbcFTDo3 (ORCPT ); Sun, 19 Jun 2016 23:44:29 -0400 Received: by james.kirk.hungrycats.org (Postfix, from userid 1002) id D95AF3F3D077; Sun, 19 Jun 2016 23:44:27 -0400 (EDT) Date: Sun, 19 Jun 2016 23:44:27 -0400 From: Zygo Blaxell To: linux-btrfs@vger.kernel.org Subject: Adventures in btrfs raid5 disk recovery Message-ID: <20160620034427.GK15597@hungrycats.org> MIME-Version: 1.0 Content-Disposition: inline User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Not so long ago, I had a disk fail in a btrfs filesystem with raid1 metadata and raid5 data. I mounted the filesystem readonly, replaced the failing disk, and attempted to recover by adding the new disk and deleting the missing disk. It's not going well so far. Pay attention, there are at least four separate problems in here and we're not even half done yet. I'm currently using kernel 4.6.2 with btrfs fixes forward-ported from 4.5.7, because 4.5.7 has a number of fixes that 4.6.2 doesn't. I have also pulled in some patches from the 4.7-rc series. This fixed a few problems I encountered early on, and I'm still making forward progress, but I've only replaced 50% of the failed disk so far, and this is week four of this particular project. What worked: 'mount -odegraded,...' successfully mounts the filesystem RW. 'btrfs device add' adds the new disk. Success! The first thing I did was balance the metadata onto non-missing disks. That went well. Now there are only data chunks to recover from the missing disk. Success! The normal 'device delete' operation got about 25% of the way in, then got stuck on some corrupted sectors and aborting with EIO. That ends the success, but I've had similar problems with raid5 arrays before and been able to solve them. I've managed to remove about half of the data from the missing disk so far. 'balance start -ddevid=,drange=0..100000000000' (with increasing values for drange) is able to move data off the failed disk while avoiding the damaged regions. It looks like this process could reduce the amount of data on "missing" devices to a manageable number, then I could identify the offending corrupted extents with 'btrfs scrub', remove the files containing them, and finish the device delete operation. Hope! What doesn't work: The first problem is that the kernel keeps crashing. I put the filesystem and all its disks in a KVM so the crashes are less disruptive, and I can debug them (or at least collect panic logs). OK now crashes are merely a performance problem. Why did I mention 'btrfs scrub' above? Because 'btrfs scrub' tells me where corrupted blocks are. 'device delete' fills my kernel logs with lines like this: [26054.744158] BTRFS info (device vdc): relocating block group 27753592127488 flags 129 [26809.746993] BTRFS warning (device vdc): csum failed ino 404 off 6021976064 csum 778377694 expected csum 2827380172 [26809.747029] BTRFS warning (device vdc): csum failed ino 404 off 6021980160 csum 3776938678 expected csum 514150079 [26809.747077] BTRFS warning (device vdc): csum failed ino 404 off 6021984256 csum 470593400 expected csum 642831408 [26809.747093] BTRFS warning (device vdc): csum failed ino 404 off 6021988352 csum 796755777 expected csum 690854341 [26809.747108] BTRFS warning (device vdc): csum failed ino 404 off 6021992448 csum 4115095129 expected csum 249712906 [26809.747122] BTRFS warning (device vdc): csum failed ino 404 off 6021996544 csum 2337431338 expected csum 1869250975 [26809.747138] BTRFS warning (device vdc): csum failed ino 404 off 6022000640 csum 3543852608 expected csum 1929026437 [26809.747154] BTRFS warning (device vdc): csum failed ino 404 off 6022004736 csum 3417780495 expected csum 3698318115 [26809.747169] BTRFS warning (device vdc): csum failed ino 404 off 6022008832 csum 3423877520 expected csum 2981727596 [26809.747183] BTRFS warning (device vdc): csum failed ino 404 off 6022012928 csum 550838742 expected csum 1005563554 [26896.379773] BTRFS info (device vdc): relocating block group 27753592127488 flags 129 [27791.128098] __readpage_endio_check: 7 callbacks suppressed [27791.236794] BTRFS warning (device vdc): csum failed ino 405 off 6021980160 csum 3776938678 expected csum 514150079 [27791.236799] BTRFS warning (device vdc): csum failed ino 405 off 6021971968 csum 3304844252 expected csum 4171523312 [27791.236821] BTRFS warning (device vdc): csum failed ino 405 off 6021984256 csum 470593400 expected csum 642831408 [27791.236825] BTRFS warning (device vdc): csum failed ino 405 off 6021988352 csum 796755777 expected csum 690854341 [27791.236842] BTRFS warning (device vdc): csum failed ino 405 off 6021992448 csum 4115095129 expected csum 249712906 [27791.236847] BTRFS warning (device vdc): csum failed ino 405 off 6021996544 csum 2337431338 expected csum 1869250975 [27791.236857] BTRFS warning (device vdc): csum failed ino 405 off 6022004736 csum 3417780495 expected csum 3698318115 [27791.236864] BTRFS warning (device vdc): csum failed ino 405 off 6022000640 csum 3543852608 expected csum 1929026437 [27791.236874] BTRFS warning (device vdc): csum failed ino 405 off 6022008832 csum 3423877520 expected csum 2981727596 [27791.236978] BTRFS warning (device vdc): csum failed ino 405 off 6021976064 csum 778377694 expected csum 2827380172 [27927.222130] BTRFS info (device vdc): relocating block group 27753592127488 flags 129 [28679.904235] __readpage_endio_check: 7 callbacks suppressed [28679.907160] BTRFS warning (device vdc): csum failed ino 406 off 6021984256 csum 470593400 expected csum 642831408 [28679.914460] BTRFS warning (device vdc): csum failed ino 406 off 6021980160 csum 3776938678 expected csum 514150079 [28679.914566] BTRFS warning (device vdc): csum failed ino 406 off 6021976064 csum 778377694 expected csum 2827380172 [28679.914584] BTRFS warning (device vdc): csum failed ino 406 off 6021988352 csum 796755777 expected csum 690854341 [28679.914596] BTRFS warning (device vdc): csum failed ino 406 off 6021992448 csum 4115095129 expected csum 249712906 [28679.914616] BTRFS warning (device vdc): csum failed ino 406 off 6021996544 csum 2337431338 expected csum 1869250975 [28679.914625] BTRFS warning (device vdc): csum failed ino 406 off 6022000640 csum 3543852608 expected csum 1929026437 [28679.914641] BTRFS warning (device vdc): csum failed ino 406 off 6022004736 csum 3417780495 expected csum 3698318115 [28679.914643] BTRFS warning (device vdc): csum failed ino 406 off 6022008832 csum 3423877520 expected csum 2981727596 [28679.914661] BTRFS warning (device vdc): csum failed ino 406 off 6022012928 csum 550838742 expected csum 1005563554 [28769.622165] BTRFS info (device vdc): relocating block group 27753592127488 flags 129 [29409.886503] __readpage_endio_check: 7 callbacks suppressed [29409.889486] BTRFS warning (device vdc): csum failed ino 407 off 6021971968 csum 3304844252 expected csum 4171523312 [29409.896090] BTRFS warning (device vdc): csum failed ino 407 off 6021976064 csum 778377694 expected csum 2827380172 [29409.918458] BTRFS warning (device vdc): csum failed ino 407 off 6021984256 csum 470593400 expected csum 642831408 [29409.918463] BTRFS warning (device vdc): csum failed ino 407 off 6021980160 csum 3776938678 expected csum 514150079 [29409.918489] BTRFS warning (device vdc): csum failed ino 407 off 6021992448 csum 4115095129 expected csum 249712906 [29409.918504] BTRFS warning (device vdc): csum failed ino 407 off 6021996544 csum 2337431338 expected csum 1869250975 [29409.918517] BTRFS warning (device vdc): csum failed ino 407 off 6022000640 csum 3543852608 expected csum 1929026437 [29409.918530] BTRFS warning (device vdc): csum failed ino 407 off 6022004736 csum 3417780495 expected csum 3698318115 [29409.918543] BTRFS warning (device vdc): csum failed ino 407 off 6022008832 csum 3423877520 expected csum 2981727596 [29409.918556] BTRFS warning (device vdc): csum failed ino 407 off 6022012928 csum 550838742 expected csum 1005563554 "ino 407" is a lie and I have not yet been able to identify what the offsets mean (assuming they're not lies too). What's really interesting is that ino increases by one each time 'btrfs dev del' is run. I guess it's just printing uninitialized memory here or something? OK I'll avoid some of the more broken btrfs tools. 'btrfs scrub' tells me the precise ino, offset, and path names of corrupted files, but it takes a week to run, so I'm moving as much data as possible onto newer (and faster) disks to make the scrub run faster. Sounds like a viable plan. I kept getting crashes at two specific BUG_ONs, which I patched around as follows: This code gives me more questions than answers: Why is there a BUG_ON located before the call to btrfs_is_parity_mirror? Why is mirror_num even passed to btrfs_is_parity_mirror? btrfs_is_parity_mirror doesn't use the mirror_num argument. Never has since it was written in 2013. More worrying. Those two patches helped to the extent that my kernel stopped crashing whenever it encountered a bad checksum on the filesystem; however, now I'm getting EIO randomly all over the filesystem, including in files that were written entirely _after_ the disk failure. I had assumed there would be no reason such files would not be written successfully, but it seems not. Uh oh. Better stop writing new data to the filesystem, then. I added 'convert=raid1' to the 'btrfs balance' command to try to stem the flow of corrupted data. It seems to help, but hasn't completely stopped new files from getting corrupted. The new disks are large enough to hold all the data in raid1, so maybe if I manage to delete the missing disk I'll convert the whole filesystem to raid1 and pretend the raid5 feature doesn't exist any more. I've already "voided the warranty" as it were, so I'm not surprised about seeing a bunch of unexpected corruption. The kernel already tried twice to kill itself before it gets as far as I am now, so there was probably a good reason for that. At this point it's still less disruptive to just fix the corruption one file at a time from backups, and without the patches the only other option is to mkfs+restore the filesystem, so I'm pressing on... :-/ The code I'm changing above hasn't changed since 2011, which is well before there was raid5. Clearly it's not handling the cases that raid5 brings up very well. Is recovery a lost cause? Do I just give up now and mkfs+restore the filesystem? Compounding the difficulty is the fact that the balance operation keeps getting stuck every few hours with this backtrace: [18365.506156] Call Trace: [18365.506164] [] schedule+0x3f/0xa0 [18365.506167] [] lock_extent_bits+0x18d/0x1e0 [18365.506171] [] ? wait_woken+0xb0/0xb0 [18365.506174] [] invalidate_extent_cache+0xe0/0x200 [18365.506177] [] merge_reloc_root+0x33d/0x570 [18365.506179] [] merge_reloc_roots+0x143/0x250 [18365.506181] [] relocate_block_group+0x471/0x6b0 [18365.506183] [] ? btrfs_relocate_block_group+0x1ad/0x2a0 [18365.506185] [] btrfs_relocate_block_group+0x1b5/0x2a0 [18365.506188] [] btrfs_relocate_chunk.isra.37+0x47/0xd0 [18365.506190] [] btrfs_balance+0xa71/0x1200 [18365.506192] [] ? btrfs_ioctl_balance+0x182/0x3b0 [18365.506195] [] btrfs_ioctl_balance+0x182/0x3b0 [18365.506197] [] btrfs_ioctl+0x139d/0x2550 [18365.506201] [] ? __do_page_fault+0x1da/0x570 [18365.506203] [] do_vfs_ioctl+0x8f/0x680 [18365.506205] [] ? __do_page_fault+0x1da/0x570 [18365.506207] [] ? __fget_light+0x6f/0x90 [18365.506209] [] SyS_ioctl+0x79/0x90 [18365.506212] [] entry_SYSCALL_64_fastpath+0x23/0xc1 That's probably another bug, but I don't know if it's because of something I've done, something weird in my filesystem's metadata, or a previously unreported regression since kernel 4.4. For now I just detect when it happens inside the KVM and reboot it from outside. I only have a few months left before the next disk in this array fails, and I'd like to be out of degraded mode by then. ;) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 57fccc4..1ed1e43 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2296,7 +2296,10 @@ int btrfs_check_repairable(struct inode *inode, struct bio *failed_bio, * if the following BUG_ON triggers, our validation request got * merged. we need separate requests for our algorithm to work. */ - BUG_ON(failrec->in_validation); + if (failrec->in_validation) { + printk(KERN_ERR "failrec->in_validation at %s:%d\n", __FUNCTION__, __LINE__); + return 0; + } failrec->in_validation = 1; failrec->this_mirror = failed_mirror; } else { @@ -2306,7 +2309,10 @@ int btrfs_check_repairable(struct inode *inode, struct bio *failed_bio, * everything for repair_io_failure to do the rest for us. */ if (failrec->in_validation) { - BUG_ON(failrec->this_mirror != failed_mirror); + if (failrec->this_mirror != failed_mirror) { + printk(KERN_ERR "failrec->this_mirror (%d) != failed_mirror (%d) at %s:%d\n", failrec->this_mirror, failed_mirror, __FUNCTION__, __LINE__); + return 0; + } failrec->in_validation = 0; failrec->this_mirror = 0; } 'failrec->this_mirror' often has values like 0, 1, or 2, but sometimes has values like -445886021 or -823858068. More uninitialized memory? This worries me. The other BUG_ON lands here: diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index c1fe22c..71afeb3 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2018,10 +2018,10 @@ int repair_io_failure(struct inode *inode, u64 start, u64 length, u64 logical, int ret; ASSERT(!(fs_info->sb->s_flags & MS_RDONLY)); - BUG_ON(!mirror_num); + WARN_ON(!mirror_num); /* we can't repair anything in raid56 yet */ - if (btrfs_is_parity_mirror(map_tree, logical, length, mirror_num)) + if (!mirror_num || btrfs_is_parity_mirror(map_tree, logical, length, mirror_num)) return 0; bio = btrfs_io_bio_alloc(GFP_NOFS, 1);