From patchwork Mon Apr 25 01:01:10 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 8937541 Return-Path: X-Original-To: patchwork-linux-btrfs@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.136]) by patchwork2.web.kernel.org (Postfix) with ESMTP id F1E31BF29F for ; Tue, 26 Apr 2016 10:39:45 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 9741F20114 for ; Tue, 26 Apr 2016 10:39:44 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id EC57820125 for ; Tue, 26 Apr 2016 10:39:42 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752215AbcDZKjN (ORCPT ); Tue, 26 Apr 2016 06:39:13 -0400 Received: from mail.kernel.org ([198.145.29.136]:54168 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751060AbcDZKjG (ORCPT ); Tue, 26 Apr 2016 06:39:06 -0400 Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 63BE12010C for ; Tue, 26 Apr 2016 10:39:04 +0000 (UTC) Received: from debian3.lan (bl12-226-64.dsl.telepac.pt [85.245.226.64]) (using TLSv1.2 with cipher AES128-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id A92572013A for ; Tue, 26 Apr 2016 10:39:02 +0000 (UTC) From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH 1/3] Btrfs: fix race in relocation that makes us miss extents Date: Mon, 25 Apr 2016 02:01:10 +0100 Message-Id: <1461546072-11154-2-git-send-email-fdmanana@kernel.org> X-Mailer: git-send-email 2.7.0.rc3 In-Reply-To: <1461546072-11154-1-git-send-email-fdmanana@kernel.org> References: <1461546072-11154-1-git-send-email-fdmanana@kernel.org> X-Spam-Status: No, score=-6.6 required=5.0 tests=BAYES_00, DATE_IN_PAST_24_48, RCVD_IN_DNSWL_HI, RP_MATCHES_RCVD, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Filipe Manana Before it starts the actual process of moving extents, relocation first sets the block group to read only mode, to prevent tasks from allocating new extents from it, and then flushes delalloc and waits for any ordered extents to complete. The flushing is done to synchronize with tasks that are running concurrently and have allocated an extent from the block group right before we set it to readonly mode but have not yet created the respective ordered extent (i.e. tasks that started flushing delalloc). Even though we wait for the ordered extents to complete, this is not enough to guarantee we will process (relocate) the respective extents, because the extent items are added to extent tree only when delayed references are run and we search for extents through the extent tree's commit root. Therefore we need to commit the current transaction if we waited for any ordered extents, otherwise we will miss extent items in find_next_extent() unless by chance the transaction used to complete the ordered extents is committed by some other concurrent task before we start scanning for extents in the extent tree. The race is illustrated by the following diagram: CPU 1 CPU 2 CPU 3 does buffered write against inode I btrfs_relocate_block_group(bg X) starts flushing delalloc extent_writepages() extent_write_cache_pages() locks first page in range __extent_writepage() writepage_delalloc() run_delalloc_range() cow_file_range() btrfs_reserve_extent() --> reserves extent from bg X sets bg X to RO btrfs_start_delalloc_roots() __start_delalloc_inodes() btrfs_alloc_delalloc_work() queues job to run btrfs_run_delalloc_work(inode I) at CPU 3 waits for job at CPU 3 to complete btrfs_run_delalloc_work(inode I) filemap_flush() writepages() extent_writepages() extent_write_cache_pages() --> blocks when trying to lock first page in range btrfs_add_ordered_extent(oe O) clears EXTENT_DELALLOC bit from the range unlocks first page in range gets lock on page leaves and unlocks the page because it's under writeback btrfs_wait_ordered_roots() btrfs_wait_ordered_extents() --> waits for ordered extent O to complete ordered extent O completes (btrfs_finish_ordered_io()) using transaction N to update the subvol and csum trees at this point transaction N is still the current transaction, it hasn't been committed yet nor did its delayed references got run, meaning there isn't yet an extent item in the extent tree for the extent that ordered extent O used we're at rc->stage == MOVE_DATA_EXTENTS relocate_block_group(bg X) find_next_extent() --> searches the extent tree for extent items using the commit root --> transaction N still not committed so it misses the extent from ordered extent O When this happens we end up not moving the extent resulting in the following trace/warning once the relocation finishes: [ 7260.832836] ------------[ cut here ]------------ [ 7260.834653] WARNING: CPU: 5 PID: 6765 at fs/btrfs/relocation.c:4318 btrfs_relocate_block_group+0x245/0x2a1 [btrfs]() [ 7260.838268] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc parport processor button loop autofs4 ext4 crc16 mbcache jbd2 sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring e1000 virtio scsi_mod floppy [last unloaded: btrfs] [ 7260.850935] CPU: 5 PID: 6765 Comm: btrfs Not tainted 4.5.0-rc6-btrfs-next-28+ #1 [ 7260.852998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [ 7260.852998] 0000000000000000 ffff88020bf57bc0 ffffffff812648b3 0000000000000000 [ 7260.852998] 0000000000000009 ffff88020bf57bf8 ffffffff81051608 ffffffffa03c1b2d [ 7260.852998] ffff8800b2bbb800 0000000000000000 ffff8800b17bcc58 ffff8800399dd000 [ 7260.852998] Call Trace: [ 7260.852998] [] dump_stack+0x67/0x90 [ 7260.852998] [] warn_slowpath_common+0x99/0xb2 [ 7260.852998] [] ? btrfs_relocate_block_group+0x245/0x2a1 [btrfs] [ 7260.852998] [] warn_slowpath_null+0x1a/0x1c [ 7260.852998] [] btrfs_relocate_block_group+0x245/0x2a1 [btrfs] [ 7260.852998] [] btrfs_relocate_chunk.isra.29+0x66/0xdb [btrfs] [ 7260.852998] [] btrfs_balance+0xde1/0xe4e [btrfs] [ 7260.852998] [] ? debug_smp_processor_id+0x17/0x19 [ 7260.852998] [] btrfs_ioctl_balance+0x255/0x2d3 [btrfs] [ 7260.852998] [] btrfs_ioctl+0x11e0/0x1dff [btrfs] [ 7260.852998] [] ? handle_mm_fault+0x443/0xd63 [ 7260.852998] [] ? _raw_spin_unlock+0x31/0x44 [ 7260.852998] [] ? arch_local_irq_save+0x9/0xc [ 7260.852998] [] vfs_ioctl+0x18/0x34 [ 7260.852998] [] do_vfs_ioctl+0x550/0x5be [ 7260.852998] [] ? __fget_light+0x4d/0x71 [ 7260.852998] [] SyS_ioctl+0x57/0x79 [ 7260.852998] [] entry_SYSCALL_64_fastpath+0x12/0x6b [ 7260.893268] ---[ end trace eb7803b24ebab8ad ]--- It is triggered because after the first stage of the relocation (rc->stage == MOVE_DATA_EXTENTS), we commit the current transaction and then the second time we call relocate_block_group() (rc->stage == UPDATE_DATA_PTRS), we have flushed and waited for delalloc on the relocation inode, but since we didn't find and move the extent in the first stage, the block group still has a non zero number of used bytes and therefore it triggers a warning at the end of btrfs_relocate_block_group(). Later on when trying to read the extent contents from disk we hit a BUG_ON() due to the inability to map a block with a logical address that belongs to the block group we relocated and is no longer valid, resulting in the following trace: [ 7344.885290] BTRFS critical (device sdi): unable to find logical 12845056 len 4096 [ 7344.887518] ------------[ cut here ]------------ [ 7344.888431] kernel BUG at fs/btrfs/inode.c:1833! [ 7344.888431] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 7344.888431] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc parport processor button loop autofs4 ext4 crc16 mbcache jbd2 sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring e1000 virtio scsi_mod floppy [last unloaded: btrfs] [ 7344.888431] CPU: 0 PID: 6831 Comm: od Tainted: G W 4.5.0-rc6-btrfs-next-28+ #1 [ 7344.888431] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [ 7344.888431] task: ffff880215818600 ti: ffff880204684000 task.ti: ffff880204684000 [ 7344.888431] RIP: 0010:[] [] btrfs_merge_bio_hook+0x54/0x6b [btrfs] [ 7344.888431] RSP: 0018:ffff8802046878f0 EFLAGS: 00010282 [ 7344.888431] RAX: 00000000ffffffea RBX: 0000000000001000 RCX: 0000000000000001 [ 7344.888431] RDX: ffff88023ec0f950 RSI: ffffffff8183b638 RDI: 00000000ffffffff [ 7344.888431] RBP: ffff880204687908 R08: 0000000000000001 R09: 0000000000000000 [ 7344.888431] R10: ffff880204687770 R11: ffffffff82f2d52d R12: 0000000000001000 [ 7344.888431] R13: ffff88021afbfee8 R14: 0000000000006208 R15: ffff88006cd199b0 [ 7344.888431] FS: 00007f1f9e1d6700(0000) GS:ffff88023ec00000(0000) knlGS:0000000000000000 [ 7344.888431] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 7344.888431] CR2: 00007f1f9dc8cb60 CR3: 000000023e3b6000 CR4: 00000000000006f0 [ 7344.888431] Stack: [ 7344.888431] 0000000000001000 0000000000001000 ffff880204687b98 ffff880204687950 [ 7344.888431] ffffffffa0395c8f ffffea0004d64d48 0000000000000000 0000000000001000 [ 7344.888431] ffffea0004d64d48 0000000000001000 0000000000000000 0000000000000000 [ 7344.888431] Call Trace: [ 7344.888431] [] submit_extent_page+0xf5/0x16f [btrfs] [ 7344.888431] [] __do_readpage+0x4a0/0x4f1 [btrfs] [ 7344.888431] [] ? btrfs_create_repair_bio+0xcb/0xcb [btrfs] [ 7344.888431] [] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs] [ 7344.888431] [] ? trace_hardirqs_on+0xd/0xf [ 7344.888431] [] __do_contiguous_readpages.constprop.26+0xc2/0xe4 [btrfs] [ 7344.888431] [] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs] [ 7344.888431] [] __extent_readpages.constprop.25+0xed/0x100 [btrfs] [ 7344.888431] [] ? lru_cache_add+0xe/0x10 [ 7344.888431] [] extent_readpages+0x160/0x1aa [btrfs] [ 7344.888431] [] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs] [ 7344.888431] [] ? alloc_pages_current+0xa9/0xcd [ 7344.888431] [] btrfs_readpages+0x1f/0x21 [btrfs] [ 7344.888431] [] __do_page_cache_readahead+0x168/0x1fc [ 7344.888431] [] ondemand_readahead+0x1f6/0x207 [ 7344.888431] [] ? ondemand_readahead+0x1f6/0x207 [ 7344.888431] [] ? pagecache_get_page+0x2b/0x154 [ 7344.888431] [] page_cache_sync_readahead+0x3d/0x3f [ 7344.888431] [] generic_file_read_iter+0x197/0x4e1 [ 7344.888431] [] __vfs_read+0x79/0x9d [ 7344.888431] [] vfs_read+0x8f/0xd2 [ 7344.888431] [] SyS_read+0x50/0x7e [ 7344.888431] [] entry_SYSCALL_64_fastpath+0x12/0x6b [ 7344.888431] Code: 8d 4d e8 45 31 c9 45 31 c0 48 8b 00 48 c1 e2 09 48 8b 80 80 fc ff ff 4c 89 65 e8 48 8b b8 f0 01 00 00 e8 1d 42 02 00 85 c0 79 02 <0f> 0b 4c 01 e3 31 c0 48 3b 5d e8 5a 5b 41 5c 0f 97 c0 5d c3 31 [ 7344.888431] RIP [] btrfs_merge_bio_hook+0x54/0x6b [btrfs] [ 7344.888431] RSP [ 7344.970544] ---[ end trace eb7803b24ebab8ae ]--- So if we waited for any ordered extents, make sure we attach to the current transaction and commit it. Another patch in the series makes sure that we only wait for ordered extents against extents located in our block group instead of any ordered extent. Signed-off-by: Filipe Manana Reviewed-by: Liu Bo --- fs/btrfs/ordered-data.c | 6 +++++- fs/btrfs/ordered-data.h | 2 +- fs/btrfs/relocation.c | 24 +++++++++++++++++++++++- 3 files changed, 29 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index 0de7da5..41101b2 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -708,11 +708,12 @@ int btrfs_wait_ordered_extents(struct btrfs_root *root, int nr) return count; } -void btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, int nr) +int btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, int nr) { struct btrfs_root *root; struct list_head splice; int done; + int total_done = 0; INIT_LIST_HEAD(&splice); @@ -730,6 +731,7 @@ void btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, int nr) done = btrfs_wait_ordered_extents(root, nr); btrfs_put_fs_root(root); + total_done += done; spin_lock(&fs_info->ordered_root_lock); if (nr != -1) { @@ -740,6 +742,8 @@ void btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, int nr) list_splice_tail(&splice, &fs_info->ordered_roots); spin_unlock(&fs_info->ordered_root_lock); mutex_unlock(&fs_info->ordered_operations_mutex); + + return total_done; } /* diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h index 23c9605..f29e08e 100644 --- a/fs/btrfs/ordered-data.h +++ b/fs/btrfs/ordered-data.h @@ -198,7 +198,7 @@ int btrfs_ordered_update_i_size(struct inode *inode, u64 offset, int btrfs_find_ordered_sum(struct inode *inode, u64 offset, u64 disk_bytenr, u32 *sum, int len); int btrfs_wait_ordered_extents(struct btrfs_root *root, int nr); -void btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, int nr); +int btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, int nr); void btrfs_get_logged_extents(struct inode *inode, struct list_head *logged_list, const loff_t start, diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index 2bd0011..f689be2 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -4258,7 +4258,29 @@ int btrfs_relocate_block_group(struct btrfs_root *extent_root, u64 group_start) err = ret; goto out; } - btrfs_wait_ordered_roots(fs_info, -1); + ret = btrfs_wait_ordered_roots(fs_info, -1); + /* + * We might have waited for an ordered extent to complete which was + * destined at an extent allocated from our block group. The extent item + * is only added to the extent tree after delayed references are run + * and it won't be accessible through the extent tree's commit root + * unless we commit the current transaction (the one the ordered extent + * used to update the subvol and csum trees). + * We always search for extents using the extent tree's commit root, + * with find_next_extent(), so we do this transaction commit to make + * sure we don't miss any extents. + */ + if (ret > 0) { + struct btrfs_trans_handle *trans; + + trans = btrfs_attach_transaction(extent_root); + if (IS_ERR(trans) && PTR_ERR(trans) != -ENOENT) + err = PTR_ERR(trans); + else if (!IS_ERR(trans)) + err = btrfs_commit_transaction(trans, extent_root); + if (err) + goto out; + } while (1) { mutex_lock(&fs_info->cleaner_mutex);