From patchwork Mon Apr 23 08:23:04 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Qu Wenruo X-Patchwork-Id: 10356387 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 49952601BE for ; Mon, 23 Apr 2018 08:23:25 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 390572898C for ; Mon, 23 Apr 2018 08:23:25 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 2D9A828990; Mon, 23 Apr 2018 08:23:25 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,FREEMAIL_FROM, MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI, T_TVD_MIME_EPI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8170C2898C for ; Mon, 23 Apr 2018 08:23:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754500AbeDWIXW (ORCPT ); Mon, 23 Apr 2018 04:23:22 -0400 Received: from mout.gmx.net ([212.227.17.22]:54293 "EHLO mout.gmx.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754478AbeDWIXM (ORCPT ); Mon, 23 Apr 2018 04:23:12 -0400 Received: from [0.0.0.0] ([207.148.91.157]) by mail.gmx.com (mrgmx102 [212.227.17.174]) with ESMTPSA (Nemesis) id 0MUTSJ-1f26P81ndr-00RF9d; Mon, 23 Apr 2018 10:23:10 +0200 Subject: Re: 4.17-rc1 FS went read-only during balance To: Dmitrii Tcvetkov , linux-btrfs@vger.kernel.org References: <20180421175548.4b07dffc@demfloro.ru> <5775f38a-5f17-1f6d-a6cd-289e18188a26@gmx.com> <20180423080745.5a9dc6be@demfloro.ru> <3d2443c8-0b34-2eea-3adc-2f33570f75b1@gmx.com> <20180423105543.43f13e3a@job> From: Qu Wenruo Openpgp: preference=signencrypt Autocrypt: addr=quwenruo.btrfs@gmx.com; prefer-encrypt=mutual; keydata= xsBNBFnVga8BCACyhFP3ExcTIuB73jDIBA/vSoYcTyysFQzPvez64TUSCv1SgXEByR7fju3o 8RfaWuHCnkkea5luuTZMqfgTXrun2dqNVYDNOV6RIVrc4YuG20yhC1epnV55fJCThqij0MRL 1NxPKXIlEdHvN0Kov3CtWA+R1iNN0RCeVun7rmOrrjBK573aWC5sgP7YsBOLK79H3tmUtz6b 9Imuj0ZyEsa76Xg9PX9Hn2myKj1hfWGS+5og9Va4hrwQC8ipjXik6NKR5GDV+hOZkktU81G5 gkQtGB9jOAYRs86QG/b7PtIlbd3+pppT0gaS+wvwMs8cuNG+Pu6KO1oC4jgdseFLu7NpABEB AAHNIlF1IFdlbnJ1byA8cXV3ZW5ydW8uYnRyZnNAZ214LmNvbT7CwJQEEwEIAD4CGwMFCwkI BwIGFQgJCgsCBBYCAwECHgECF4AWIQQt33LlpaVbqJ2qQuHCPZHzoSX+qAUCWdWCnQUJCWYC bgAKCRDCPZHzoSX+qAR8B/94VAsSNygx1C6dhb1u1Wp1Jr/lfO7QIOK/nf1PF0VpYjTQ2au8 ihf/RApTna31sVjBx3jzlmpy+lDoPdXwbI3Czx1PwDbdhAAjdRbvBmwM6cUWyqD+zjVm4RTG rFTPi3E7828YJ71Vpda2qghOYdnC45xCcjmHh8FwReLzsV2A6FtXsvd87bq6Iw2axOHVUax2 FGSbardMsHrya1dC2jF2R6n0uxaIc1bWGweYsq0LXvLcvjWH+zDgzYCUB0cfb+6Ib/ipSCYp 3i8BevMsTs62MOBmKz7til6Zdz0kkqDdSNOq8LgWGLOwUTqBh71+lqN2XBpTDu1eLZaNbxSI ilaVzsBNBFnVga8BCACqU+th4Esy/c8BnvliFAjAfpzhI1wH76FD1MJPmAhA3DnX5JDORcga CbPEwhLj1xlwTgpeT+QfDmGJ5B5BlrrQFZVE1fChEjiJvyiSAO4yQPkrPVYTI7Xj34FnscPj /IrRUUka68MlHxPtFnAHr25VIuOS41lmYKYNwPNLRz9Ik6DmeTG3WJO2BQRNvXA0pXrJH1fN GSsRb+pKEKHKtL1803x71zQxCwLh+zLP1iXHVM5j8gX9zqupigQR/Cel2XPS44zWcDW8r7B0 q1eW4Jrv0x19p4P923voqn+joIAostyNTUjCeSrUdKth9jcdlam9X2DziA/DHDFfS5eq4fEv ABEBAAHCwHwEGAEIACYWIQQt33LlpaVbqJ2qQuHCPZHzoSX+qAUCWdWBrwIbDAUJA8JnAAAK CRDCPZHzoSX+qA3xB/4zS8zYh3Cbm3FllKz7+RKBw/ETBibFSKedQkbJzRlZhBc+XRwF61mi f0SXSdqKMbM1a98fEg8H5kV6GTo62BzvynVrf/FyT+zWbIVEuuZttMk2gWLIvbmWNyrQnzPl mnjK4AEvZGIt1pk+3+N/CMEfAZH5Aqnp0PaoytRZ/1vtMXNgMxlfNnb96giC3KMR6U0E+siA 4V7biIoyNoaN33t8m5FwEwd2FQDG9dAXWhG13zcm9gnk63BN3wyCQR+X5+jsfBaS4dvNzvQv h8Uq/YGjCoV1ofKYh3WKMY8avjq25nlrhzD/Nto9jHp8niwr21K//pXVA81R2qaXqGbql+zo Message-ID: Date: Mon, 23 Apr 2018 16:23:04 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: <20180423105543.43f13e3a@job> X-Provags-ID: V03:K1:+Z8GJOefgraPUBTWM/0zfSaI2A9Qm8HX82pnpdQh7Xrayct3rKG 18WvpIV8F+94WvFpIVy8sC8kY2uC3JUddGEwcEbf1LE13BR4fhkHKHLCqT3FpcineBFB9T6 /pFzNCE+bJFq1jHDCadKeuqyU7/4uXpcCUNkEWMlSuHJW9BNEpi+mDQyM//LLzruof/fcpN p0OA+66Q0gzZbblRKKcsA== X-UI-Out-Filterresults: notjunk:1; V01:K0:ILaeyVsvh7o=:KpmJwElXJX/cEq7EK6Idks haVq5KcpKl9eR2JuIVOjKf6Q2DO4HasVQicpmwyyKSaqlCtK9Ip2yufLVvRCvrCxpP8IOliVg FM6mUCaXnQnyafINKkiIviB7Hp7xl4XUdkcfRYsuR2rd+xrqXE14sqvROkhZP5ErapZ76kdPN QNDjmXGk/VVFlqe6xX9NcXTVeyck/IK3gtbYpnHV9tZ128UqYf+rlVC9QcJCY8XWWHAq0uWXp voc2qSbqH/fDMHy+W65f4tf4uUqdbQAXCHTQIZEEXiurCPfL0ihuN7EtU1y46Vf1138t48YZP Jzm3rVMhrnB0Ym/EQYFqbtDJ2B2WGkEhl35C/TuRk/nIgL2YqFKYQVfeQjiqcvk+RTOlYXpBX 3UUM8SxTfr7gfOv9aYYxPmyVE2LEKgtWWXnUSzYfMvA2E1cb07L+cZeUjYHhBsMGPnHEPxxzD XXD0e5xo9xjqcHAooGYHEks0w0YWHW06AGZX6VbyfhiNSA3/DMv8swO97W53N198JA/wWBQkZ iJuFwhBmHSBjnR/croa8naaH9LEBJZEvoGkGLbcwRv32th+vosS7ZggHMg+W/bsvHQylLPv/H UKdMTa3PFLxvOwuVGRthONjFsun3DPxlmRO50CWKgYfB8xCNU29H9L2NYri/tmUDUZWH9slci RAtfeH+e2f9s71H4Or0CK8cFLpsJeptiaJNTAkZTrGT+7efSgiXpClbA3PpkZIYnXRnEAVh4Z rCMYVq+I3tlVF7oMcFdRSarIVssZRDvR2Y0TLuQen2EbVvVFRZ855l4JXkft1w+qP42/MwfGd LHN6EGIftPH5gSwxKMZAPbZtfkatw== Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP On 2018年04月23日 16:04, Dmitrii Tcvetkov wrote: >>>>> TL;DR It seems as regression in 4.17, but I managed to find a >>>>> workaround to make filesystem rw mountable again. >>>>> >>>>> Kernel built from tag v4.17-rc1 >>>>> btrfs-progs 4.16 >>>>> >>>>> Tonight two my machines (PC (ECC RAM) and laptop(non-ECC RAM)) were >>>>> doing usual weekly balance with this command via cron: >>>>> btrfs balance start -musage=50 -dusage=50 >>>>> Both machines run same kernel version. >>>>> >>>>> On PC that caused root and "data" filesystems to go readonly. Root >>>>> is on an SSD with data single and metadata DUP, "data" filesystem >>>>> is on 2 HDDs with RAID1 for data and metadata. >>>>> >>>>> On laptop only /home went ro, it's on NVMe SSD with data single and >>>>> metadata DUP. >>>>> >>>>> Btrfs check of PC rootfs was without any errors in both modes, I did >>>>> them once each before reboot on readonly filesystem with --force >>>>> flag and then from live usb. Same output without any errors. >>>>> >>>>> After reboot kernel refused rw mount rootfs with the same error as >>>>> during cron balance, ro mount was accepted, error during rw mount: >>>>> BTRFS: error (device dm-17) in merge_reloc_roots:2465: errno=-117 >>> >>>> 117 means EUCLEAN, which could be caused by the newly introduced >>>> first_key and level check. >>> >>>> Please apply this hotfix to fix it. >>>> btrfs: Only check first key for committed tree blocks >>>> (Which is included in latest pull request) >>> >>>> Also, please consider enable CONFIG_BTRFS_DEBUG to provide extra >>>> debug info. >>> >>>> Thanks, >>>> Qu >>> >>> I tried 4.17-rc2 (as the pull request was pulled) with >>> CONFIG_BTRFS_DEBUG on LVM snapshot of laptop home partition (/dev/vdb) >>> in a VM (VM kernel sees only snapshot so no UUID collisions). Dmesg >>> attached. >> >> Thanks for the info and your previous btrfs-image. >> >> The image itself shows nothing wrong, so it should be runtime problem. >> Would you please apply these two debug patches? >> https://patchwork.kernel.org/patch/10335133/ >> https://patchwork.kernel.org/patch/10335135/ >> >> And the attached diff file? >> >> My guess is the parent node is not initialized correctly in this case. >> >> Thanks, >> Qu > > Dmesg from kernel with all three patches applied attached. > Thanks for the debug info, it really helps a lot! It turns out that I'm just a super idiot, a typo in replace_path() caused this, and it could not be trigger unless we enter it from relocation recovery. Please try the attached patch to see if it solves the problem. Thanks, Qu From 4b70eb864192ec5cf54a7e67e2957ddf0e5c0f6f Mon Sep 17 00:00:00 2001 From: Qu Wenruo Date: Mon, 23 Apr 2018 16:13:55 +0800 Subject: [PATCH] btrfs: Fix wrong first_key parameter in replace_path Commit 581c1760415c ("btrfs: Validate child tree block's level and first key") introduced new @first_key parameter for read_tree_block(), however caller in replace_path() is parasing wrong key to read_tree_block(). It should use parameter @first_key other than @key. Normally it won't expose problem as @key is normally initialzied to the same value of @first_key we expect. However in relocation recovery case, @key can be set to (0, 0, 0), and since no valid key in relocation tree can be (0, 0, 0), it will cause read_tree_block() to return -EUCLEAN and interrupt relocation recovery. Fix it by setting @first_key correctly. Signed-off-by: Qu Wenruo --- fs/btrfs/relocation.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index 00b7d3231821..b041b945a7ae 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -1841,7 +1841,7 @@ int replace_path(struct btrfs_trans_handle *trans, old_bytenr = btrfs_node_blockptr(parent, slot); blocksize = fs_info->nodesize; old_ptr_gen = btrfs_node_ptr_generation(parent, slot); - btrfs_node_key_to_cpu(parent, &key, slot); + btrfs_node_key_to_cpu(parent, &first_key, slot); if (level <= max_level) { eb = path->nodes[level]; -- 2.17.0