From patchwork Fri Dec 14 17:48:50 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Dmitriy Gorokh X-Patchwork-Id: 10731511 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 5148514E2 for ; Fri, 14 Dec 2018 17:49:07 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 962F62C74A for ; Fri, 14 Dec 2018 17:49:06 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 8A7F12C781; Fri, 14 Dec 2018 17:49:06 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 684372C74A for ; Fri, 14 Dec 2018 17:49:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729862AbeLNRtE (ORCPT ); Fri, 14 Dec 2018 12:49:04 -0500 Received: from mail-lf1-f66.google.com ([209.85.167.66]:34058 "EHLO mail-lf1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729341AbeLNRtE (ORCPT ); Fri, 14 Dec 2018 12:49:04 -0500 Received: by mail-lf1-f66.google.com with SMTP id p6so4894718lfc.1 for ; Fri, 14 Dec 2018 09:49:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :content-transfer-encoding; bh=mlLsFB8/WiyHfcOebJtoCtDCfTRq7Wth5b7Z9REjr7o=; b=hSPB55lNNRQtWFGwgacYYF84J7Kjwexb1kBsoNFuczHv8/84R48hYmsZMiPl+wxhU9 ar5YA+4PhzDxG/sGzCbbOiemKpPzd5wHdIULZN0Kmcxe0oj2rD6k51S5y5MHmrOkwy7T hCDTNOqh8PRvzhcktG0hBRKph78Yd4AUl6Jqn9YKGvxW2TNlhD5PLo4+8bWh2mxFek/7 ZRaml1T6FkG4/Bb0jOgYZldEfPHa5GtWApPk/kCzkicSAR6J52ETT81OsPJxWOwbx8TC qMfL6fHLjZK50c8IWLa2A9BnkIFBvpkUzYZTNo5Ex8j9U75ovTcqtfuGWAsN0wLIzeV2 36Ug== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:content-transfer-encoding; bh=mlLsFB8/WiyHfcOebJtoCtDCfTRq7Wth5b7Z9REjr7o=; b=ktrvopl3wmYamh8YwczP0SweZm+D+OSjNyhiYGEgVTwIWqkf6+nizwyTFg4ns1jAUS 071gMWpif3qOQbPfvu0WGdjmVqIcvd2TSe0oO/cy45IjhLIA1Gc9EGbkfJYR0oGB7Z0L RxpKVYDkR+brrjDdHuMXNUmRrvaCv9HNFkV5b8YXUtnHKdj6lAzbbLryMaK2u+7DtJpo fdhTzAm35Gj9T6RAuTdN5JItbducarjIrb0XY92N0oKEtj5WoGBPFxOhUu/xoUyLMRNy Smr0p8+5y39rSiraJJiASGzTP0KPiD3SYSiqZwm8Ro2X6DME2JI3IW9Qub8VzWQqzapO d6gg== X-Gm-Message-State: AA+aEWYu9Xfk29oELChlD6MRjSzNBUjKLxsbAH/RDT4i0yqCLmv7iQMO QluNze5NqIZZa/LHqj/u/T+YgYhzVHT8MwUXWETSECZu X-Google-Smtp-Source: AFSGD/UjDz633n+ikMcv3YroXNWtgfoUimcAO1UpkPimD0G/Qwu1ogo6GM5+3PcEUumRK4XnEJKTdvoQAeDv4Zccdek= X-Received: by 2002:a19:f20:: with SMTP id e32mr2470970lfi.51.1544809741709; Fri, 14 Dec 2018 09:49:01 -0800 (PST) MIME-Version: 1.0 References: <66F8D435-4E51-4761-B6CF-BA96F4BC5986@wdc.com> In-Reply-To: <66F8D435-4E51-4761-B6CF-BA96F4BC5986@wdc.com> From: Dmitriy Gorokh Date: Fri, 14 Dec 2018 20:48:50 +0300 Message-ID: Subject: [PATCH v2] btrfs: raid56: data corruption on a device removal To: linux-btrfs@vger.kernel.org Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP RAID5 or RAID6 filesystem might get corrupted in the following scenario: 1. Create 4 disks RAID6 filesystem 2. Preallocate 16 10Gb files 3. Run fio: 'fio --name=testload --directory=./ --size=10G --numjobs=16 --bs=64k --iodepth=64 --rw=randrw --verify=sha256 --time_based --runtime=3600’ 4. After few minutes pull out two drives: 'echo 1 > /sys/block/sdc/device/delete ; echo 1 > /sys/block/sdd/device/delete’ About 5 of 10 times the test is run, it led to a silent data corruption of a random stripe, resulting in ‘IO Error’ and ‘csum failed’ messages while trying to read the affected file. It usually affects only small portion of the files. It is possible that few bios which were being processed during the drives removal, contained non zero bio->bi_iter.bi_done field despite of EIO bi_status. bi_sector field was also increased from original one by that 'bi_done' value. Looks like this is a quite rare condition. Subsequently, in the raid_rmw_end_io handler that failed bio can be translated to a wrong stripe number and fail wrong rbio. Reviewed-by: Johannes Thumshirn Signed-off-by: Dmitriy Gorokh Reviewed-by: Liu Bo --- fs/btrfs/raid56.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 3c8093757497..cd2038315feb 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -1451,6 +1451,12 @@ static int find_bio_stripe(struct btrfs_raid_bio *rbio, struct btrfs_bio_stripe *stripe; physical <<= 9; + /* + * Since the failed bio can return partial data, bi_sector might be + * incremented by that value. We need to revert it back to the + * state before the bio was submitted. + */ + physical -= bio->bi_iter.bi_done; for (i = 0; i < rbio->bbio->num_stripes; i++) { stripe = &rbio->bbio->stripes[i];