From patchwork Wed Feb 10 22:14:33 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Josef Bacik X-Patchwork-Id: 12081855 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5760EC433E6 for ; Wed, 10 Feb 2021 22:15:28 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 2B6B964E02 for ; Wed, 10 Feb 2021 22:15:28 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231193AbhBJWP0 (ORCPT ); Wed, 10 Feb 2021 17:15:26 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57238 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232813AbhBJWPW (ORCPT ); Wed, 10 Feb 2021 17:15:22 -0500 Received: from mail-qk1-x731.google.com (mail-qk1-x731.google.com [IPv6:2607:f8b0:4864:20::731]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DF075C0613D6 for ; Wed, 10 Feb 2021 14:14:41 -0800 (PST) Received: by mail-qk1-x731.google.com with SMTP id o193so3374951qke.11 for ; Wed, 10 Feb 2021 14:14:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=toxicpanda-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=mEHNIYeLl7iOw3nd1CdxJKtufXh4H/Rnl9mCb/QuHtg=; b=ct+AxeqwF+/RHWdDfJGyS9kDrNaThi0eX0ctTdnW461NS1Sr3J21GIju2bwetP40JM jnKRI7zqUkvdNO8apgm1nivD2I8HAsb1nVrWbf6/gzQ80Pca7AJCPvGEsb7jFxyHnzGi Vhv+rJQZMAZCeIaETfEPEQSiu3CxKg9EXQy8iYj9pUWAmKltVQj02CviCn76Y7+rZVbF aMI3U3j5Hz9biOtpvjgWSq0usuT9C9Ap+8PaMrH0Ij7O0UCB4qmy4kQmUxNLc4/26vgN AaXpRD0TBhvdvotnNrTCqN04gQoEb10oVnOWBFcizgK1sT9pbmCUO5N5xtYKw2cCChjI GQxw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=mEHNIYeLl7iOw3nd1CdxJKtufXh4H/Rnl9mCb/QuHtg=; b=LM8SI0Qf/TA1ZE1MJd14wU9yTyY9hrmRf7PakvTIT+HupjfGsDMlNEDESUfbV5P+77 imLzBCv1EBQERKqRJbSV5Uz8AYxV52r6HTzVz2YX+2gAh1tM73WGRUNOzMBiHhf2BfUz aYx6Rs3kitMzZaOqlmfr4XTrQfe0zx3o+vxuKXENi2gCprV71dBNdRw1B68nnvYQrh1F cH3f2K0fU2dfZYgTgxxv+t1rgt0t1JeIV9liMS4eUkZx7urV3h9Bexcin8CNIplrM7uL c9AxV0rPYJWtA2lcYMT+Z0yMVEaz8HsoXnMd7zeMniRlAQrCz5arFWMZnkAG//5QmEPp rZCw== X-Gm-Message-State: AOAM533COv1OqJfH0HkqIwfpxJoJLQLZ1opehdJzh6Nsr+4aZ6R8riwa TrzyDMqFkIGGdwjIFDNlAtlku4qAKSbbDaoE X-Google-Smtp-Source: ABdhPJzX+Pk3E3LhsZQFcK8TOgqmlGRPknuGu6HxbpchctfPkbj4bsUdpQfbl86m4P+syZo+qXRZXA== X-Received: by 2002:a05:620a:959:: with SMTP id w25mr4673782qkw.345.1612995280705; Wed, 10 Feb 2021 14:14:40 -0800 (PST) Received: from localhost (cpe-174-109-172-136.nc.res.rr.com. [174.109.172.136]) by smtp.gmail.com with ESMTPSA id i5sm2458810qkg.32.2021.02.10.14.14.39 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 10 Feb 2021 14:14:40 -0800 (PST) From: Josef Bacik To: linux-btrfs@vger.kernel.org, kernel-team@fb.com Cc: Filipe Manana Subject: [PATCH v2 1/4] btrfs: add a i_mmap_lock to our inode Date: Wed, 10 Feb 2021 17:14:33 -0500 Message-Id: <1963d741dfcd35e9585e1d7f96d1e45a44288125.1612995212.git.josef@toxicpanda.com> X-Mailer: git-send-email 2.26.2 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org We need to be able to exclude page_mkwrite from happening concurrently with certain operations. To facilitate this, add a i_mmap_lock to our inode, down_read() it in our mkwrite, and add a new ILOCK flag to indicate that we want to take the i_mmap_lock as well. I used pahole to check the size of the btrfs_inode, the sizes are as follows no lockdep: before: 1120 (3 per 4k page) after: 1160 (3 per 4k page) lockdep: before: 2072 (1 per 4k page) after: 2224 (1 per 4k page) We're slightly larger but it doesn't change how many objects we can fit per page. Reviewed-by: Filipe Manana Signed-off-by: Josef Bacik --- fs/btrfs/btrfs_inode.h | 1 + fs/btrfs/ctree.h | 1 + fs/btrfs/inode.c | 10 ++++++++++ 3 files changed, 12 insertions(+) diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index 28e202e89660..26837c3ca7f6 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -220,6 +220,7 @@ struct btrfs_inode { /* Hook into fs_info->delayed_iputs */ struct list_head delayed_iput; + struct rw_semaphore i_mmap_lock; struct inode vfs_inode; }; diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 3bc00aed13b2..5a410c812978 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3174,6 +3174,7 @@ extern const struct iomap_dio_ops btrfs_dio_ops; /* Inode locking type flags, by default the exclusive lock is taken */ #define BTRFS_ILOCK_SHARED (1U << 0) #define BTRFS_ILOCK_TRY (1U << 1) +#define BTRFS_ILOCK_MMAP (1U << 2) int btrfs_inode_lock(struct inode *inode, unsigned int ilock_flags); void btrfs_inode_unlock(struct inode *inode, unsigned int ilock_flags); diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 535abf898225..4c3ba0a3e0e6 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -102,6 +102,7 @@ static void __endio_write_update_ordered(struct btrfs_inode *inode, * BTRFS_ILOCK_SHARED - acquire a shared lock on the inode * BTRFS_ILOCK_TRY - try to acquire the lock, if fails on first attempt * return -EAGAIN + * BTRFS_ILOCK_MMAP - acquire a write lock on the i_mmap_lock */ int btrfs_inode_lock(struct inode *inode, unsigned int ilock_flags) { @@ -122,6 +123,8 @@ int btrfs_inode_lock(struct inode *inode, unsigned int ilock_flags) } inode_lock(inode); } + if (ilock_flags & BTRFS_ILOCK_MMAP) + down_write(&BTRFS_I(inode)->i_mmap_lock); return 0; } @@ -133,6 +136,8 @@ int btrfs_inode_lock(struct inode *inode, unsigned int ilock_flags) */ void btrfs_inode_unlock(struct inode *inode, unsigned int ilock_flags) { + if (ilock_flags & BTRFS_ILOCK_MMAP) + up_write(&BTRFS_I(inode)->i_mmap_lock); if (ilock_flags & BTRFS_ILOCK_SHARED) inode_unlock_shared(inode); else @@ -8538,6 +8543,7 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf) ret = VM_FAULT_NOPAGE; /* make the VM retry the fault */ again: + down_read(&BTRFS_I(inode)->i_mmap_lock); lock_page(page); size = i_size_read(inode); @@ -8566,6 +8572,7 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf) unlock_extent_cached(io_tree, page_start, page_end, &cached_state); unlock_page(page); + up_read(&BTRFS_I(inode)->i_mmap_lock); btrfs_start_ordered_extent(ordered, 1); btrfs_put_ordered_extent(ordered); goto again; @@ -8623,6 +8630,7 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf) BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit; unlock_extent_cached(io_tree, page_start, page_end, &cached_state); + up_read(&BTRFS_I(inode)->i_mmap_lock); btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE); sb_end_pagefault(inode->i_sb); @@ -8631,6 +8639,7 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf) out_unlock: unlock_page(page); + up_read(&BTRFS_I(inode)->i_mmap_lock); out: btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE); btrfs_delalloc_release_space(BTRFS_I(inode), data_reserved, page_start, @@ -8882,6 +8891,7 @@ struct inode *btrfs_alloc_inode(struct super_block *sb) INIT_LIST_HEAD(&ei->delalloc_inodes); INIT_LIST_HEAD(&ei->delayed_iput); RB_CLEAR_NODE(&ei->rb_node); + init_rwsem(&ei->i_mmap_lock); return inode; } From patchwork Wed Feb 10 22:14:34 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Josef Bacik X-Patchwork-Id: 12081857 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8817DC433E0 for ; Wed, 10 Feb 2021 22:16:05 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 5825764DE7 for ; Wed, 10 Feb 2021 22:16:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232903AbhBJWP2 (ORCPT ); Wed, 10 Feb 2021 17:15:28 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57246 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232891AbhBJWPY (ORCPT ); Wed, 10 Feb 2021 17:15:24 -0500 Received: from mail-qk1-x733.google.com (mail-qk1-x733.google.com [IPv6:2607:f8b0:4864:20::733]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B692FC061786 for ; Wed, 10 Feb 2021 14:14:43 -0800 (PST) Received: by mail-qk1-x733.google.com with SMTP id h8so3386021qkk.6 for ; Wed, 10 Feb 2021 14:14:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=toxicpanda-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=UUKqMOoaVU6i360/ikILJBalJxPMx24J0kCBLuD9Ymk=; b=aaWWEvrCKOSw0DRZVrO2RZt/WvMwqWuvpaeg3/XVQJ7FRAkiKaRPgN5N3wQ17ePRaa AQyXJ6rZ4ONoM7QsNy23/YEB2iAt8FAf97IH8TNtEhrOsGfuqOCc5R7OMae4lrHfT8Uy Z6jVEsslSeZpoEiYbdSBll9RPVSRvrJ3OHWntGjAKhzELbw6RKnoVw96Hzkg83G6LAEX 0mk/o+EUqAEbF0054I4JTxHV655wlqQ6JUGxIm84Vk33DIVlajxDi8txC/vqfFkEJUUJ X8d04k5Uq+GS6LFX0lUMjgWC6NQ7ZPM1t0NfsRVY4+Tiv13UyKjxD6kdF1f/6xacH9Cu T4vg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=UUKqMOoaVU6i360/ikILJBalJxPMx24J0kCBLuD9Ymk=; b=mYRej6PzslLeuyRFHPa0l8VgvJnWjbX5XpUwQwr/v1UhmtlSyH9pFLUVdAPxFigXtr tXxP1x9M2bC64Q0o7PYUmEdSiXDzPDtYwk8hIkPyXdF+bAgStzV7k8CpKD7UGwpJGTZc rwPf5Dd6WbnjBWhdglR1OmzmZ1F0Ts+5/XMsrc9LUr3Gu8ocmaAqXSFAdVci0wrU+AeV fLtR0fbCW1FlMn0R5Wd+2uVC+lCsp6XBezIuKDoG4Egoq7sL/br9J7YwP1cW3GwmkjQs 9XOXsth2g10wF7V/qCFOvL6IkouA9p/HQAf9K3KpMR/j9EWIb38Y/l0vgc0SUW8i9nGY S/8w== X-Gm-Message-State: AOAM532cYO1lx8gNsoMQ4IGanBY6uBnRf0G67SK9BSua6SO+6jHgmtGz BWUIO7YmKDe+rcpp0gkKJfxfgQwlm9pUZRsF X-Google-Smtp-Source: ABdhPJyffQZux4vA1bQGFsQeM9bvwTw0jybt5F4YY858EX9FO1A+ukKtN0tH69H4zNcMtPnGgpLG4w== X-Received: by 2002:a37:b96:: with SMTP id 144mr5902390qkl.314.1612995282545; Wed, 10 Feb 2021 14:14:42 -0800 (PST) Received: from localhost (cpe-174-109-172-136.nc.res.rr.com. [174.109.172.136]) by smtp.gmail.com with ESMTPSA id l137sm2539326qke.6.2021.02.10.14.14.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 10 Feb 2021 14:14:41 -0800 (PST) From: Josef Bacik To: linux-btrfs@vger.kernel.org, kernel-team@fb.com Cc: Filipe Manana Subject: [PATCH v2 2/4] btrfs: cleanup inode_lock/inode_unlock uses Date: Wed, 10 Feb 2021 17:14:34 -0500 Message-Id: <89a4671fb7927a5485cf52a95822bebfe64302cc.1612995212.git.josef@toxicpanda.com> X-Mailer: git-send-email 2.26.2 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org A few places we intermix btrfs_inode_lock with a inode_unlock, and some places we just use inode_lock/inode_unlock instead of btrfs_inode_lock. None of these places are using this incorrectly, but as we adjust some of these callers it would be nice to keep everything consistent, so convert everybody to use btrfs_inode_lock/btrfs_inode_unlock. Reviewed-by: Filipe Manana Signed-off-by: Josef Bacik --- fs/btrfs/delayed-inode.c | 4 ++-- fs/btrfs/file.c | 18 +++++++++--------- fs/btrfs/ioctl.c | 26 +++++++++++++------------- fs/btrfs/reflink.c | 4 ++-- fs/btrfs/relocation.c | 4 ++-- 5 files changed, 28 insertions(+), 28 deletions(-) diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c index ec0b50b8c5d6..ec6c277f1f91 100644 --- a/fs/btrfs/delayed-inode.c +++ b/fs/btrfs/delayed-inode.c @@ -1588,8 +1588,8 @@ bool btrfs_readdir_get_delayed_items(struct inode *inode, * We can only do one readdir with delayed items at a time because of * item->readdir_list. */ - inode_unlock_shared(inode); - inode_lock(inode); + btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED); + btrfs_inode_lock(inode, 0); mutex_lock(&delayed_node->mutex); item = __btrfs_first_delayed_insertion_item(delayed_node); diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 01a72f53fb5d..728736e3d4b8 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -2122,7 +2122,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) if (ret) goto out; - inode_lock(inode); + btrfs_inode_lock(inode, 0); atomic_inc(&root->log_batch); @@ -2154,7 +2154,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) */ ret = start_ordered_ops(inode, start, end); if (ret) { - inode_unlock(inode); + btrfs_inode_unlock(inode, 0); goto out; } @@ -2255,7 +2255,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) * file again, but that will end up using the synchronization * inside btrfs_sync_log to keep things safe. */ - inode_unlock(inode); + btrfs_inode_unlock(inode, 0); if (ret != BTRFS_NO_LOG_SYNC) { if (!ret) { @@ -2285,7 +2285,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) out_release_extents: btrfs_release_log_ctx_extents(&ctx); - inode_unlock(inode); + btrfs_inode_unlock(inode, 0); goto out; } @@ -2868,7 +2868,7 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len) if (ret) return ret; - inode_lock(inode); + btrfs_inode_lock(inode, 0); ino_size = round_up(inode->i_size, fs_info->sectorsize); ret = find_first_non_hole(BTRFS_I(inode), &offset, &len); if (ret < 0) @@ -2908,7 +2908,7 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len) truncated_block = true; ret = btrfs_truncate_block(BTRFS_I(inode), offset, 0, 0); if (ret) { - inode_unlock(inode); + btrfs_inode_unlock(inode, 0); return ret; } } @@ -3009,7 +3009,7 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len) ret = ret2; } } - inode_unlock(inode); + btrfs_inode_unlock(inode, 0); return ret; } @@ -3374,7 +3374,7 @@ static long btrfs_fallocate(struct file *file, int mode, if (mode & FALLOC_FL_ZERO_RANGE) { ret = btrfs_zero_range(inode, offset, len, mode); - inode_unlock(inode); + btrfs_inode_unlock(inode, 0); return ret; } @@ -3484,7 +3484,7 @@ static long btrfs_fallocate(struct file *file, int mode, unlock_extent_cached(&BTRFS_I(inode)->io_tree, alloc_start, locked_end, &cached_state); out: - inode_unlock(inode); + btrfs_inode_unlock(inode, 0); /* Let go of our reservation. */ if (ret != 0 && !(mode & FALLOC_FL_ZERO_RANGE)) btrfs_free_reserved_data_space(BTRFS_I(inode), data_reserved, diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index a8c60d46d19c..c9f2bc0602d6 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -226,7 +226,7 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg) if (ret) return ret; - inode_lock(inode); + btrfs_inode_lock(inode, 0); fsflags = btrfs_mask_fsflags_for_type(inode, fsflags); old_fsflags = btrfs_inode_flags_to_fsflags(binode->flags); @@ -353,7 +353,7 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg) out_end_trans: btrfs_end_transaction(trans); out_unlock: - inode_unlock(inode); + btrfs_inode_unlock(inode, 0); mnt_drop_write_file(file); return ret; } @@ -449,7 +449,7 @@ static int btrfs_ioctl_fssetxattr(struct file *file, void __user *arg) if (ret) return ret; - inode_lock(inode); + btrfs_inode_lock(inode, 0); old_flags = binode->flags; old_i_flags = inode->i_flags; @@ -501,7 +501,7 @@ static int btrfs_ioctl_fssetxattr(struct file *file, void __user *arg) inode->i_flags = old_i_flags; } - inode_unlock(inode); + btrfs_inode_unlock(inode, 0); mnt_drop_write_file(file); return ret; @@ -1013,7 +1013,7 @@ static noinline int btrfs_mksubvol(const struct path *parent, out_dput: dput(dentry); out_unlock: - inode_unlock(dir); + btrfs_inode_unlock(dir, 0); return error; } @@ -1611,7 +1611,7 @@ int btrfs_defrag_file(struct inode *inode, struct file *file, ra_index += cluster; } - inode_lock(inode); + btrfs_inode_lock(inode, 0); if (IS_SWAPFILE(inode)) { ret = -ETXTBSY; } else { @@ -1620,13 +1620,13 @@ int btrfs_defrag_file(struct inode *inode, struct file *file, ret = cluster_pages_for_defrag(inode, pages, i, cluster); } if (ret < 0) { - inode_unlock(inode); + btrfs_inode_unlock(inode, 0); goto out_ra; } defrag_count += ret; balance_dirty_pages_ratelimited(inode->i_mapping); - inode_unlock(inode); + btrfs_inode_unlock(inode, 0); if (newer_than) { if (newer_off == (u64)-1) @@ -1674,9 +1674,9 @@ int btrfs_defrag_file(struct inode *inode, struct file *file, out_ra: if (do_compress) { - inode_lock(inode); + btrfs_inode_lock(inode, 0); BTRFS_I(inode)->defrag_compress = BTRFS_COMPRESS_NONE; - inode_unlock(inode); + btrfs_inode_unlock(inode, 0); } if (!file) kfree(ra); @@ -3092,9 +3092,9 @@ static noinline int btrfs_ioctl_snap_destroy(struct file *file, goto out_dput; } - inode_lock(inode); + btrfs_inode_lock(inode, 0); err = btrfs_delete_subvolume(dir, dentry); - inode_unlock(inode); + btrfs_inode_unlock(inode, 0); if (!err) { fsnotify_rmdir(dir, dentry); d_delete(dentry); @@ -3103,7 +3103,7 @@ static noinline int btrfs_ioctl_snap_destroy(struct file *file, out_dput: dput(dentry); out_unlock_dir: - inode_unlock(dir); + btrfs_inode_unlock(dir, 0); free_subvol_name: kfree(subvol_name_ptr); free_parent: diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c index b24396cf2f99..12df3ee84e93 100644 --- a/fs/btrfs/reflink.c +++ b/fs/btrfs/reflink.c @@ -819,7 +819,7 @@ loff_t btrfs_remap_file_range(struct file *src_file, loff_t off, return -EINVAL; if (same_inode) - inode_lock(src_inode); + btrfs_inode_lock(src_inode, 0); else lock_two_nondirectories(src_inode, dst_inode); @@ -835,7 +835,7 @@ loff_t btrfs_remap_file_range(struct file *src_file, loff_t off, out_unlock: if (same_inode) - inode_unlock(src_inode); + btrfs_inode_unlock(src_inode, 0); else unlock_two_nondirectories(src_inode, dst_inode); diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index 232d5da7b7be..bf269ee17e68 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -2578,7 +2578,7 @@ static noinline_for_stack int prealloc_file_extent_cluster( return btrfs_end_transaction(trans); } - inode_lock(&inode->vfs_inode); + btrfs_inode_lock(&inode->vfs_inode, 0); for (nr = 0; nr < cluster->nr; nr++) { start = cluster->boundary[nr] - offset; if (nr + 1 < cluster->nr) @@ -2596,7 +2596,7 @@ static noinline_for_stack int prealloc_file_extent_cluster( if (ret) break; } - inode_unlock(&inode->vfs_inode); + btrfs_inode_unlock(&inode->vfs_inode, 0); if (cur_offset < prealloc_end) btrfs_free_reserved_data_space_noquota(inode->root->fs_info, From patchwork Wed Feb 10 22:14:35 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Josef Bacik X-Patchwork-Id: 12081859 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id F096BC433E0 for ; Wed, 10 Feb 2021 22:16:43 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id A894464E02 for ; Wed, 10 Feb 2021 22:16:43 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233460AbhBJWQJ (ORCPT ); Wed, 10 Feb 2021 17:16:09 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57326 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233310AbhBJWPs (ORCPT ); Wed, 10 Feb 2021 17:15:48 -0500 Received: from mail-qk1-x732.google.com (mail-qk1-x732.google.com [IPv6:2607:f8b0:4864:20::732]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8B470C061788 for ; Wed, 10 Feb 2021 14:14:45 -0800 (PST) Received: by mail-qk1-x732.google.com with SMTP id u20so3401229qku.7 for ; Wed, 10 Feb 2021 14:14:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=toxicpanda-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=oZgMHv3lnXbZpiNm7FRfKFXA7Wzot+u+i1VVa5WD74E=; b=LJxSOxlius42ZFM78MCqabdIL3I4u86Stg1pclcfhjndgRdgAXSxwfUj3CeWxNiGyF IUGf+bb/jg4+ipJZT1lhwyPH1jLKj6CMLg5ZO+y+ya/Kz4MeMyVFFYVig1XjT+OENVrw AYPwE8keP2zjAVEMOMVEl10l2zjMGmkB7SIJDteha3o9/yWHgDfjCKSOSTUSjFb8JGxE sVDMgt8cOqV78GfOg2gWhw81iQc2IxdJVcWQ5ZeMZTsL4lcfwKJWf7UO3PcP19U9Ow4/ YqKtAU4JHBhsKWSjla+cDvRY9iuK+GdohzYNciH1umK3z+MsKBZ51yqI54TopTkkggu3 COeQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=oZgMHv3lnXbZpiNm7FRfKFXA7Wzot+u+i1VVa5WD74E=; b=NaNktPzTK7ApIdR7nJWjWO6+8qBVcPztAFgjDSQ25/yZxfqeHDgI2pFP9xF1bUgQis CNLM2CwU3Jta9Nn/j86tSjU3vuBj+WIWdA6iGXZdL1+z3FFU43zNQyDy4oNTUG+6J/kW wuzJqqK3WhI4ZbhcnlUa37Z04z+6EugI4abq+z9De5d2Zsr2jCwyZDiQjOQI2lG8ZrM4 LtlFM6IULw0IXsiBblQZbalQw5eT3fu1Ce3xNAvGl9K82YE9hPqfwz5YwxqbdnCZv7+s qQkFn60BtGOlYIYNoQMqCU9heV6qF+U0FAhPiYdKiUWeK8QqBbkFVY804K3ubcL4P6WZ 5+Mw== X-Gm-Message-State: AOAM533+Qr13Sk2oZMT6pNWJoNdc71Bh/5TtuGkrb83amIu8rlp/+6z0 aojKyOHUJZdv4O0yBKQVSCqdV/B/E40qK+Lx X-Google-Smtp-Source: ABdhPJyOghA8Vvwafc4DTRmFIx7pRZwR9UVA9CSsQOzSd0L7zpmFK1z2opHrPIrv6LOM0bGerHA91w== X-Received: by 2002:a05:620a:b86:: with SMTP id k6mr5670184qkh.200.1612995284224; Wed, 10 Feb 2021 14:14:44 -0800 (PST) Received: from localhost (cpe-174-109-172-136.nc.res.rr.com. [174.109.172.136]) by smtp.gmail.com with ESMTPSA id a25sm2130105qtw.87.2021.02.10.14.14.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 10 Feb 2021 14:14:43 -0800 (PST) From: Josef Bacik To: linux-btrfs@vger.kernel.org, kernel-team@fb.com Cc: Filipe Manana Subject: [PATCH v2 3/4] btrfs: exclude mmaps while doing remap Date: Wed, 10 Feb 2021 17:14:35 -0500 Message-Id: <78f6c599aea8e4b887583736670f9d61a8cc4b3c.1612995212.git.josef@toxicpanda.com> X-Mailer: git-send-email 2.26.2 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org Darrick reported a potential issue to me where we could allow mmap writes after validating a page range matched in the case of dedupe. Generally we rely on lock page -> lock extent with the ordered flush to protect us, but this is done after we check the pages because we use the generic helpers, so we could modify the page in between doing the check and locking the range. There also exists a deadlock, as described by Filipe """ When cloning a file range, we lock the inodes, flush any delalloc within the respective file ranges, wait for any ordered extents and then lock the file ranges in both inodes. This means that right after we flush delalloc and before we lock the file ranges, memory mapped writes can come in and dirty pages in the file ranges of the clone operation. Most of the time this is harmless and causes no problems. However, if we are low on available metadata space, we can later end up in a deadlock when starting a transaction to replace file extent items. This happens if when allocating metadata space for the transaction, we need to wait for the async reclaim thread to release space and the reclaim thread needs to flush delalloc for the inode that got the memory mapped write and has its range locked by the clone task. Basically what happens is the following: 1) A clone operation locks inodes A and B, flushes delalloc for both inodes in the respective file ranges and waits for any ordered extents in those ranges to complete; 2) Before the clone task locks the file ranges, another task does a memory mapped write (which does not lock the inode) for one of the inodes of the clone operation. So now we have a dirty page in one of the ranges used by the clone operation; 3) The clone operation locks the file ranges for inodes A and B; 4) Later, when iterating over the file extents of inode A, the clone task attempts to start a transaction. There's not enough available free metadata space, so the async reclaim task is started (if not running already) and we wait for someone to wake us up on our reservation ticket; 5) The async reclaim task is not able to release space by any other means and decides to flush delalloc for the inode of the clone operation; 6) The workqueue job used to flush the inode blocks when starting delalloc for the inode, since the file range is currently locked by the clone task; 7) But the clone task is waiting on its reservation ticket and the async reclaim task is waiting on the flush job to complete, which can't progress since the clone task has the file range locked. So unless some other task is able to release space, for example an ordered extent for some other inode completes, we have a deadlock between all these tasks; When this happens stack traces like the following showup in dmesg/syslog: INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds. Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000 Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs] Call Trace: __schedule+0x5d1/0xcf0 schedule+0x45/0xe0 lock_extent_bits+0x1e6/0x2d0 [btrfs] ? finish_wait+0x90/0x90 btrfs_invalidatepage+0x32c/0x390 [btrfs] ? __mod_memcg_state+0x8e/0x160 __extent_writepage+0x2d4/0x400 [btrfs] extent_write_cache_pages+0x2b2/0x500 [btrfs] ? lock_release+0x20e/0x4c0 ? trace_hardirqs_on+0x1b/0xf0 extent_writepages+0x43/0x90 [btrfs] ? lock_acquire+0x1a3/0x490 do_writepages+0x43/0xe0 ? __filemap_fdatawrite_range+0xa4/0x100 __filemap_fdatawrite_range+0xc5/0x100 btrfs_run_delalloc_work+0x17/0x40 [btrfs] btrfs_work_helper+0xf1/0x600 [btrfs] process_one_work+0x24e/0x5e0 worker_thread+0x50/0x3b0 ? process_one_work+0x5e0/0x5e0 kthread+0x153/0x170 ? kthread_mod_delayed_work+0xc0/0xc0 ret_from_fork+0x22/0x30 INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds. Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000 Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] Call Trace: __schedule+0x5d1/0xcf0 ? kvm_clock_read+0x14/0x30 ? wait_for_completion+0x81/0x110 schedule+0x45/0xe0 schedule_timeout+0x30c/0x580 ? _raw_spin_unlock_irqrestore+0x3c/0x60 ? lock_acquire+0x1a3/0x490 ? try_to_wake_up+0x7a/0xa20 ? lock_release+0x20e/0x4c0 ? lock_acquired+0x199/0x490 ? wait_for_completion+0x81/0x110 wait_for_completion+0xab/0x110 start_delalloc_inodes+0x2af/0x390 [btrfs] btrfs_start_delalloc_roots+0x12d/0x250 [btrfs] flush_space+0x24f/0x660 [btrfs] btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs] process_one_work+0x24e/0x5e0 worker_thread+0x20f/0x3b0 ? process_one_work+0x5e0/0x5e0 kthread+0x153/0x170 ? kthread_mod_delayed_work+0xc0/0xc0 ret_from_fork+0x22/0x30 (...) several other tasks blocked on inode locks held by the clone task below (...) RIP: 0033:0x7f61efe73fff Code: Unable to access opcode bytes at RIP 0x7f61efe73fd5. RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000202 ORIG_RAX: 000000000000013c RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73fff RDX: 00000000ffffff9c RSI: 0000560fbd604690 RDI: 00000000ffffff9c RBP: 00007ffc3371beb0 R08: 0000000000000002 R09: 0000560fbd5d75f0 R10: 0000560fbd5d81f0 R11: 0000000000000202 R12: 0000000000000002 R13: 000000000000000b R14: 00007ffc3371bea0 R15: 00007ffc3371beb0 task: fdm-stress state:D stack: 0 pid:2508234 ppid:2508153 flags:0x00004000 Call Trace: __schedule+0x5d1/0xcf0 ? _raw_spin_unlock_irqrestore+0x3c/0x60 schedule+0x45/0xe0 __reserve_bytes+0x4a4/0xb10 [btrfs] ? finish_wait+0x90/0x90 btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs] btrfs_block_rsv_add+0x1f/0x50 [btrfs] start_transaction+0x2d1/0x760 [btrfs] btrfs_replace_file_extents+0x120/0x930 [btrfs] ? lock_release+0x20e/0x4c0 btrfs_clone+0x3e4/0x7e0 [btrfs] ? btrfs_lookup_first_ordered_extent+0x8e/0x100 [btrfs] btrfs_clone_files+0xf6/0x150 [btrfs] btrfs_remap_file_range+0x324/0x3d0 [btrfs] do_clone_file_range+0xd4/0x1f0 vfs_clone_file_range+0x4d/0x230 ? lock_release+0x20e/0x4c0 ioctl_file_clone+0x8f/0xc0 do_vfs_ioctl+0x342/0x750 __x64_sys_ioctl+0x62/0xb0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 """ Fix both of these issues by excluding mmaps from happening we are doing any sort of remap, which prevents this race completely. Reviewed-by: Filipe Manana Signed-off-by: Josef Bacik --- fs/btrfs/reflink.c | 30 ++++++++++++++++++++++++------ 1 file changed, 24 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c index 12df3ee84e93..bb87f4037cf6 100644 --- a/fs/btrfs/reflink.c +++ b/fs/btrfs/reflink.c @@ -590,6 +590,20 @@ static void btrfs_double_extent_lock(struct inode *inode1, u64 loff1, lock_extent(&BTRFS_I(inode2)->io_tree, loff2, loff2 + len - 1); } +static void btrfs_double_mmap_lock(struct inode *inode1, struct inode *inode2) +{ + if (inode1 < inode2) + swap(inode1, inode2); + down_write(&BTRFS_I(inode1)->i_mmap_lock); + down_write_nested(&BTRFS_I(inode2)->i_mmap_lock, SINGLE_DEPTH_NESTING); +} + +static void btrfs_double_mmap_unlock(struct inode *inode1, struct inode *inode2) +{ + up_write(&BTRFS_I(inode1)->i_mmap_lock); + up_write(&BTRFS_I(inode2)->i_mmap_lock); +} + static int btrfs_extent_same_range(struct inode *src, u64 loff, u64 len, struct inode *dst, u64 dst_loff) { @@ -818,10 +832,12 @@ loff_t btrfs_remap_file_range(struct file *src_file, loff_t off, if (remap_flags & ~(REMAP_FILE_DEDUP | REMAP_FILE_ADVISORY)) return -EINVAL; - if (same_inode) - btrfs_inode_lock(src_inode, 0); - else + if (same_inode) { + btrfs_inode_lock(src_inode, BTRFS_ILOCK_MMAP); + } else { lock_two_nondirectories(src_inode, dst_inode); + btrfs_double_mmap_lock(src_inode, dst_inode); + } ret = btrfs_remap_file_range_prep(src_file, off, dst_file, destoff, &len, remap_flags); @@ -834,10 +850,12 @@ loff_t btrfs_remap_file_range(struct file *src_file, loff_t off, ret = btrfs_clone_files(dst_file, src_file, off, len, destoff); out_unlock: - if (same_inode) - btrfs_inode_unlock(src_inode, 0); - else + if (same_inode) { + btrfs_inode_unlock(src_inode, BTRFS_ILOCK_MMAP); + } else { + btrfs_double_mmap_unlock(src_inode, dst_inode); unlock_two_nondirectories(src_inode, dst_inode); + } return ret < 0 ? ret : len; } From patchwork Wed Feb 10 22:14:36 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Josef Bacik X-Patchwork-Id: 12081861 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4250FC43381 for ; Wed, 10 Feb 2021 22:16:44 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 2BA0E64E0D for ; Wed, 10 Feb 2021 22:16:44 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233490AbhBJWQR (ORCPT ); Wed, 10 Feb 2021 17:16:17 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57328 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233321AbhBJWPs (ORCPT ); Wed, 10 Feb 2021 17:15:48 -0500 Received: from mail-qt1-x830.google.com (mail-qt1-x830.google.com [IPv6:2607:f8b0:4864:20::830]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4B9DBC06178A for ; Wed, 10 Feb 2021 14:14:47 -0800 (PST) Received: by mail-qt1-x830.google.com with SMTP id e15so2768561qte.9 for ; Wed, 10 Feb 2021 14:14:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=toxicpanda-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=Ia1yV7Vl7z3+95kkI7KZb6y6VoQu9ir2RgF0eVsZwgY=; b=lysIitfU8842e0tIkwOBFo7BORKFSLzyck4ryyD3+8hrtGS5bPCFEvqE4xPxy6SMnq QT1h1FtblICZoCtpMKjde6ejcf3W1/BB2b2aUKIev9+CnE3KqS6ZBh76o7pbRbGvUu4h q58NJoM8TP/6sd8A+XYhzFFM9g5U/0hShJCydFFEK0zlp6ybuBv8wC0gP9uaU7hBL0nV +TamPvSOzt3uQLZAvsQpawJgquxjGbcksKTHxM+FYSkz6LEcEoX/4S0dQMIlcoiXq3nR +zQUa+ZUkBQ0i7aUDu9wKFckEJFKsoOVEyBUKcDKYmgZVCFRNygINNc+BP4E4IpeUcY5 ouWg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=Ia1yV7Vl7z3+95kkI7KZb6y6VoQu9ir2RgF0eVsZwgY=; b=egM949wRg2rw0XeW68lX6lg880FaFzLKHkTXsBc0pNZrELWvepTvc1+Ht7TeuO0mwX xHNjtRd+w4l0Y9Q8Z0kRbefo6RJxCAbaXpCb+Hoi/y+bzRSmxO0VxS4DQAH41x0qRLak 5MM5S6GxQNnKRitLX0GlgWTuiGMW5kSyFqByWtoXn21lDmyGa0zZPUNFYRM8JCs8btRf xbcPf+Qrhglr+pFdJwZAiV49u45KBbpLb25UpSzh4a51EkqfS/bM2S7Lnv5oNhYHdm5i MdPN+GbP9byOZxtcwA1rXJc7jGFPsk6rN4xxu6xC3NQcwHdctIzS4EAxqMXP6TRCDwc9 nMcQ== X-Gm-Message-State: AOAM531/tcA6fD3i3KnehzKZfzOhE6oyOkCiHL4Apx/E4eyHfYCKG2uz 4S9q7NH29xuR1bTS89rYy4boDCOUz5tR562O X-Google-Smtp-Source: ABdhPJwFj7byehZDbybv9os4Itf4yG9bXysks4gf9I77MhwSW1gWmQlI1BSX7dqOBoWLE6nmxD+UPw== X-Received: by 2002:ac8:59c1:: with SMTP id f1mr4817348qtf.310.1612995285957; Wed, 10 Feb 2021 14:14:45 -0800 (PST) Received: from localhost (cpe-174-109-172-136.nc.res.rr.com. [174.109.172.136]) by smtp.gmail.com with ESMTPSA id s9sm2555482qke.67.2021.02.10.14.14.45 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 10 Feb 2021 14:14:45 -0800 (PST) From: Josef Bacik To: linux-btrfs@vger.kernel.org, kernel-team@fb.com Cc: Filipe Manana Subject: [PATCH v2 4/4] btrfs: exclude mmap from happening during all fallocate operations Date: Wed, 10 Feb 2021 17:14:36 -0500 Message-Id: <74e3efcc0f2a011b0caed71a0480c24e8739586d.1612995212.git.josef@toxicpanda.com> X-Mailer: git-send-email 2.26.2 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org There's a small window where a deadlock can happen between fallocate and mmap. This is described in detail by Filipe """ When doing a fallocate operation we lock the inode, flush delalloc within the target range, wait for any ordered extents to complete and then lock the file range. Before we lock the range and after we flush delalloc, there is a time window where another task can come in and do a memory mapped write for a page within the fallocate range. This means that after fallocate locks the range, there can be a dirty page in the range. More often than not, this does not cause any problem. The exception is when we are low on available metadata space, because an fallocate operation needs to start a transaction while holding the file range locked, either through btrfs_prealloc_file_range() or through the call to btrfs_fallocate_update_isize(). If that's the case, we can end up in a deadlock. The following list of steps explains how that happens: 1) A fallocate operation starts, locks the inode, flushes delalloc in the range and waits for ordered extents in the range to complete; 2) Before the fallocate task locks the file range, another task does a memory mapped write for a page in the fallocate target range. This is possible since memory mapped writes do not (and can not) lock the inode; 3) The fallocate task locks the file range. At this point there is one dirty page in the range (due to the memory mapped write); 4) When the fallocate task attempts to start a transaction, it blocks when attempting to reserve metadata space, since we are low on available metadata space. Before blocking (wait on its reservation ticket), it starts the async reclaim task (if not running already); 5) The async reclaim task is not able to release space through any other means, so it decides to flush delalloc for inodes with dirty pages. It finds that the inode used in the fallocate operation has a dirty page and therefore queues a job (fs_info->flushs_workers workqueue) to flush delalloc for that inode and waits on that job to complete; 6) The flush job blocks when attempting to lock the file range because it is currently locked by the fallocate task; 7) The fallocate task keeps waiting for its metadata reservation, waiting for a wakeup on its reservation ticket. The async reclaim task is waiting on the flush job, which in turn is waiting for locking the file range that is currently locked by the fallocate task. So unless some other task is able to release enough metadata space, for example an ordered extent for some other inode completes, we end up in a deadlock between all these tasks. When this happens stack traces like the following showup in dmesg/syslog: INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds. Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000 Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs] Call Trace: __schedule+0x5d1/0xcf0 schedule+0x45/0xe0 lock_extent_bits+0x1e6/0x2d0 [btrfs] ? finish_wait+0x90/0x90 btrfs_invalidatepage+0x32c/0x390 [btrfs] ? __mod_memcg_state+0x8e/0x160 __extent_writepage+0x2d4/0x400 [btrfs] extent_write_cache_pages+0x2b2/0x500 [btrfs] ? lock_release+0x20e/0x4c0 ? trace_hardirqs_on+0x1b/0xf0 extent_writepages+0x43/0x90 [btrfs] ? lock_acquire+0x1a3/0x490 do_writepages+0x43/0xe0 ? __filemap_fdatawrite_range+0xa4/0x100 __filemap_fdatawrite_range+0xc5/0x100 btrfs_run_delalloc_work+0x17/0x40 [btrfs] btrfs_work_helper+0xf1/0x600 [btrfs] process_one_work+0x24e/0x5e0 worker_thread+0x50/0x3b0 ? process_one_work+0x5e0/0x5e0 kthread+0x153/0x170 ? kthread_mod_delayed_work+0xc0/0xc0 ret_from_fork+0x22/0x30 INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds. Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000 Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] Call Trace: __schedule+0x5d1/0xcf0 ? kvm_clock_read+0x14/0x30 ? wait_for_completion+0x81/0x110 schedule+0x45/0xe0 schedule_timeout+0x30c/0x580 ? _raw_spin_unlock_irqrestore+0x3c/0x60 ? lock_acquire+0x1a3/0x490 ? try_to_wake_up+0x7a/0xa20 ? lock_release+0x20e/0x4c0 ? lock_acquired+0x199/0x490 ? wait_for_completion+0x81/0x110 wait_for_completion+0xab/0x110 start_delalloc_inodes+0x2af/0x390 [btrfs] btrfs_start_delalloc_roots+0x12d/0x250 [btrfs] flush_space+0x24f/0x660 [btrfs] btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs] process_one_work+0x24e/0x5e0 worker_thread+0x20f/0x3b0 ? process_one_work+0x5e0/0x5e0 kthread+0x153/0x170 ? kthread_mod_delayed_work+0xc0/0xc0 ret_from_fork+0x22/0x30 (...) several tasks waiting for the inode lock held by the fallocate task below (...) RIP: 0033:0x7f61efe73fff Code: Unable to access opcode bytes at RIP 0x7f61efe73fd5. RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000202 ORIG_RAX: 000000000000013c RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73fff RDX: 00000000ffffff9c RSI: 0000560fbd5d90a0 RDI: 00000000ffffff9c RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003 R10: 0000560fbd5d7ad0 R11: 0000000000000202 R12: 0000000000000001 R13: 000000000000005e R14: 00007ffc3371bea0 R15: 00007ffc3371beb0 task:fdm-stress state:D stack: 0 pid:2508243 ppid:2508153 flags:0x00000000 Call Trace: __schedule+0x5d1/0xcf0 ? _raw_spin_unlock_irqrestore+0x3c/0x60 schedule+0x45/0xe0 __reserve_bytes+0x4a4/0xb10 [btrfs] ? finish_wait+0x90/0x90 btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs] btrfs_block_rsv_add+0x1f/0x50 [btrfs] start_transaction+0x2d1/0x760 [btrfs] btrfs_replace_file_extents+0x120/0x930 [btrfs] ? btrfs_fallocate+0xdcf/0x1260 [btrfs] btrfs_fallocate+0xdfb/0x1260 [btrfs] ? filename_lookup+0xf1/0x180 vfs_fallocate+0x14f/0x440 ioctl_preallocate+0x92/0xc0 do_vfs_ioctl+0x66b/0x750 ? __do_sys_newfstat+0x53/0x60 __x64_sys_ioctl+0x62/0xb0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 """ Fix this by disallowing mmaps from happening while we're doing any of the fallocate operations on this inode. Reviewed-by: Filipe Manana Signed-off-by: Josef Bacik --- fs/btrfs/file.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 728736e3d4b8..3a2928749349 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -2868,7 +2868,7 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len) if (ret) return ret; - btrfs_inode_lock(inode, 0); + btrfs_inode_lock(inode, BTRFS_ILOCK_MMAP); ino_size = round_up(inode->i_size, fs_info->sectorsize); ret = find_first_non_hole(BTRFS_I(inode), &offset, &len); if (ret < 0) @@ -2908,7 +2908,7 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len) truncated_block = true; ret = btrfs_truncate_block(BTRFS_I(inode), offset, 0, 0); if (ret) { - btrfs_inode_unlock(inode, 0); + btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP); return ret; } } @@ -3009,7 +3009,7 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len) ret = ret2; } } - btrfs_inode_unlock(inode, 0); + btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP); return ret; } @@ -3332,7 +3332,7 @@ static long btrfs_fallocate(struct file *file, int mode, return ret; } - btrfs_inode_lock(inode, 0); + btrfs_inode_lock(inode, BTRFS_ILOCK_MMAP); if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) { ret = inode_newsize_ok(inode, offset + len); @@ -3374,7 +3374,7 @@ static long btrfs_fallocate(struct file *file, int mode, if (mode & FALLOC_FL_ZERO_RANGE) { ret = btrfs_zero_range(inode, offset, len, mode); - btrfs_inode_unlock(inode, 0); + btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP); return ret; } @@ -3484,7 +3484,7 @@ static long btrfs_fallocate(struct file *file, int mode, unlock_extent_cached(&BTRFS_I(inode)->io_tree, alloc_start, locked_end, &cached_state); out: - btrfs_inode_unlock(inode, 0); + btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP); /* Let go of our reservation. */ if (ret != 0 && !(mode & FALLOC_FL_ZERO_RANGE)) btrfs_free_reserved_data_space(BTRFS_I(inode), data_reserved,