From patchwork Fri Nov 20 12:16:22 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 7667201 Return-Path: X-Original-To: patchwork-linux-btrfs@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork1.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.136]) by patchwork1.web.kernel.org (Postfix) with ESMTP id 4446E9F2E2 for ; Fri, 20 Nov 2015 12:16:37 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 2AC4420434 for ; Fri, 20 Nov 2015 12:16:36 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id F2AC1203FB for ; Fri, 20 Nov 2015 12:16:34 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1760035AbbKTMQa (ORCPT ); Fri, 20 Nov 2015 07:16:30 -0500 Received: from mail.kernel.org ([198.145.29.136]:41560 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751295AbbKTMQa (ORCPT ); Fri, 20 Nov 2015 07:16:30 -0500 Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 08BD2203FB; Fri, 20 Nov 2015 12:16:29 +0000 (UTC) Received: from debian3.lan (bl8-199-62.dsl.telepac.pt [85.241.199.62]) (using TLSv1.2 with cipher AES128-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id A594520435; Fri, 20 Nov 2015 12:16:27 +0000 (UTC) From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Cc: Filipe Manana Subject: [PATCH v2] Btrfs: fix race when finishing dev replace leading to transaction abort Date: Fri, 20 Nov 2015 12:16:22 +0000 Message-Id: <1448021782-722-1-git-send-email-fdmanana@kernel.org> X-Mailer: git-send-email 2.1.3 In-Reply-To: <1448017261-15543-1-git-send-email-fdmanana@kernel.org> References: <1448017261-15543-1-git-send-email-fdmanana@kernel.org> X-Spam-Status: No, score=-7.5 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_HI, RP_MATCHES_RCVD, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Filipe Manana During the final phase of a device replace operation, I ran into a transaction abort that resulted in the following trace: [23919.655368] WARNING: CPU: 10 PID: 30175 at fs/btrfs/extent-tree.c:9843 btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]() [23919.664742] BTRFS: Transaction aborted (error -2) [23919.665749] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 parport psmouse acpi_cpufreq processor i2c_core evdev microcode pcspkr button serio_raw ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom virtio_scsi ata_generic ata_piix virtio_pci floppy virtio_ring libata e1000 virtio scsi_mod [last unloaded: btrfs] [23919.679442] CPU: 10 PID: 30175 Comm: fsstress Not tainted 4.3.0-rc5-btrfs-next-17+ #1 [23919.682392] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 [23919.689151] 0000000000000000 ffff8804020cbb50 ffffffff812566f4 ffff8804020cbb98 [23919.692604] ffff8804020cbb88 ffffffff8104d0a6 ffffffffa03eea69 ffff88041b678a48 [23919.694230] ffff88042ac38000 ffff88041b678930 00000000fffffffe ffff8804020cbbf0 [23919.696716] Call Trace: [23919.698669] [] dump_stack+0x4e/0x79 [23919.700597] [] warn_slowpath_common+0x9f/0xb8 [23919.701958] [] ? btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs] [23919.703612] [] warn_slowpath_fmt+0x48/0x50 [23919.705047] [] btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs] [23919.706967] [] __btrfs_end_transaction+0x84/0x2dd [btrfs] [23919.708611] [] btrfs_end_transaction+0x10/0x12 [btrfs] [23919.710099] [] btrfs_alloc_data_chunk_ondemand+0x121/0x28b [btrfs] [23919.711970] [] btrfs_fallocate+0x7d3/0xc6d [btrfs] [23919.713602] [] ? lock_acquire+0x10d/0x194 [23919.714756] [] ? percpu_down_read+0x51/0x78 [23919.716155] [] ? __sb_start_write+0x5f/0xb0 [23919.718918] [] ? __sb_start_write+0x5f/0xb0 [23919.724170] [] vfs_fallocate+0x170/0x1ff [23919.725482] [] ioctl_preallocate+0x89/0x9b [23919.726790] [] do_vfs_ioctl+0x406/0x4e6 [23919.728428] [] ? SYSC_newfstat+0x25/0x2e [23919.729642] [] ? __fget_light+0x4d/0x71 [23919.730782] [] SyS_ioctl+0x57/0x79 [23919.731847] [] entry_SYSCALL_64_fastpath+0x12/0x6f [23919.733330] ---[ end trace 166ef301a335832a ]--- This is due to a race between device replace and chunk allocation, which the following diagram illustrates: CPU 1 CPU 2 btrfs_dev_replace_finishing() at this point dev_replace->tgtdev->devid == BTRFS_DEV_REPLACE_DEVID (0ULL) ... btrfs_start_transaction() btrfs_commit_transaction() btrfs_fallocate() btrfs_alloc_data_chunk_ondemand() btrfs_join_transaction() --> starts a new transaction do_chunk_alloc() lock fs_info->chunk_mutex btrfs_alloc_chunk() --> creates extent map for the new chunk with em->bdev->map->stripes[i]->dev->devid == X (X > 0) --> extent map is added to fs_info->mapping_tree --> initial phase of bg A allocation completes unlock fs_info->chunk_mutex lock fs_info->chunk_mutex btrfs_dev_replace_update_device_in_mapping_tree() --> iterates fs_info->mapping_tree and replaces the device in every extent map's map->stripes[] with dev_replace->tgtdev, which still has an id of 0ULL (BTRFS_DEV_REPLACE_DEVID) btrfs_end_transaction() btrfs_create_pending_block_groups() --> starts final phase of bg A creation (update device, extent, and chunk trees, etc) btrfs_finish_chunk_alloc() btrfs_update_device() --> attempts to update a device item with ID == 0ULL (BTRFS_DEV_REPLACE_DEVID) which is the current ID of bg A's em->bdev->map->stripes[i]->dev->devid --> doesn't find such item returns -ENOENT --> the device id should have been X and not 0ULL got -ENOENT from btrfs_finish_chunk_alloc() and aborts current transaction finishes setting up the target device, namely it sets tgtdev->devid to the value of srcdev->devid, which is X (and X > 0) So fix this by replacing the device object in the extent maps only after finishing the setup for the target device (setting its new ID, etc) and before setting the old device's ID to BTRFS_DEV_REPLACE_DEVID. This happened while running fstest btrfs/071. Signed-off-by: Filipe Manana --- V2: Made the attribution of a new id from the old device to happen after replacing the device for the extent maps with the new device, as obviously it was missing before. fs/btrfs/dev-replace.c | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index 1e668fb..923b375 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -516,12 +516,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, dev_replace->time_stopped = get_seconds(); dev_replace->item_needs_writeback = 1; - /* replace old device with new one in mapping tree */ - if (!scrub_ret) { - btrfs_dev_replace_update_device_in_mapping_tree(fs_info, - src_device, - tgt_device); - } else { + if (scrub_ret) { btrfs_err_in_rcu(root->fs_info, "btrfs_scrub_dev(%s, %llu, %s) failed %d", src_device->missing ? "" : @@ -547,7 +542,6 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, rcu_str_deref(tgt_device->name)); tgt_device->is_tgtdev_for_dev_replace = 0; tgt_device->devid = src_device->devid; - src_device->devid = BTRFS_DEV_REPLACE_DEVID; memcpy(uuid_tmp, tgt_device->uuid, sizeof(uuid_tmp)); memcpy(tgt_device->uuid, src_device->uuid, sizeof(tgt_device->uuid)); memcpy(src_device->uuid, uuid_tmp, sizeof(src_device->uuid)); @@ -565,6 +559,18 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, list_add(&tgt_device->dev_alloc_list, &fs_info->fs_devices->alloc_list); fs_info->fs_devices->rw_devices++; + /* + * Now that our new device replaced the old device and it's completely + * setup, replace the old device with the new one in the mapping tree. + * But do it before setting the old device's id to + * BTRFS_DEV_REPLACE_DEVID, to avoid racing with tasks finishing chunk + * allocation. + */ + btrfs_dev_replace_update_device_in_mapping_tree(fs_info, + src_device, + tgt_device); + src_device->devid = BTRFS_DEV_REPLACE_DEVID; + btrfs_dev_replace_unlock(dev_replace); btrfs_rm_dev_replace_blocked(fs_info);