From patchwork Mon Dec 14 18:19:38 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Josef Bacik X-Patchwork-Id: 11972653 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 960D3C4361B for ; Mon, 14 Dec 2020 18:22:42 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 3B01A20578 for ; Mon, 14 Dec 2020 18:22:42 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2440728AbgLNSUa (ORCPT ); Mon, 14 Dec 2020 13:20:30 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52534 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2440718AbgLNSU0 (ORCPT ); Mon, 14 Dec 2020 13:20:26 -0500 Received: from mail-qt1-x841.google.com (mail-qt1-x841.google.com [IPv6:2607:f8b0:4864:20::841]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 84964C0613D6 for ; Mon, 14 Dec 2020 10:19:46 -0800 (PST) Received: by mail-qt1-x841.google.com with SMTP id c14so12566902qtn.0 for ; Mon, 14 Dec 2020 10:19:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=toxicpanda-com.20150623.gappssmtp.com; s=20150623; h=from:to:subject:date:message-id:in-reply-to:references:mime-version :content-transfer-encoding; bh=wdcl2xtzih8VI306oUh2SKwkge0xZ3P7J5fl8vyUtg0=; b=v8UskHRW9OMgP6TmR2CiH+wYWaK0JyNt2PdGo7cJlaVXzOGT4q55wMXmIbuUsk9gsg SSV9oMLBNr5S7ZMnTqENCD9TH3qL0xy4qZDL+uoy622sEWocVV8B1WGpd1EyFRslGZhS sSvNEBEbRLvOAx6X2HpJblFPYsQx3zkw7KtHFdQFLSmS85xYHOFO0sGG7VRa1MzKf3xC NNpZlbq24SfvKIhS1NyRfcJQ/UT4YR4ASq/Uugj4gvx1jbNKyr6gXttkI19bcMcJg39y n/kEyInvBCMrgKYEPLztUlNzNwakMSGkQiioVAcNCeKlAJLkLklmNOS24R6yNQWUG968 lE4g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=wdcl2xtzih8VI306oUh2SKwkge0xZ3P7J5fl8vyUtg0=; b=YsA8gHZyGHx7c/dmMTel47WhGn0z04v6KiIIWIoRHbqPHZteL1k7+agPF5iGkuUrK2 wCQrSn8ghdLB0c+A2ToZGP80q/lYbHsWm378ebC8J1rENhlDuYaehNpokZOnZSrwrZPr KlI3h1AGRjxbse7wa5v2rZr3k4NcM5qDLcN1Xdyn73tGiRg65z141RjlvTgrSovdR5gS SMB0Z9oNpmQZlrsnTQDh1fU29q9KgbC4Lc2IN+Jbsxt30mUTJNRooCloDLGhGV6V9BxX ewdyPlPWXDoyg1NqpIAV1ipB1xcA4xUUpMrLaajTJAQbOAdaFEIUH2+w3rmVyBIo7ZJU lzKA== X-Gm-Message-State: AOAM5315oX8xaUW/hsHupNHiU4h0eo9EJU9KUPQjpVXk1QmdfIBgDRrR SS3+gl6BtV2G2vEj9ygxyIaFHOIWUQRSoZgS X-Google-Smtp-Source: ABdhPJyzz9VyeBpXimytqfzuBFyNm7TcYt+hIpTd7ZEyxmajfKpHOdOJqb9GqHBcOXA2CiEZSM2pjQ== X-Received: by 2002:a05:622a:182:: with SMTP id s2mr31650704qtw.147.1607969985349; Mon, 14 Dec 2020 10:19:45 -0800 (PST) Received: from localhost (cpe-174-109-172-136.nc.res.rr.com. [174.109.172.136]) by smtp.gmail.com with ESMTPSA id w7sm15271756qkd.92.2020.12.14.10.19.44 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 14 Dec 2020 10:19:44 -0800 (PST) From: Josef Bacik To: linux-btrfs@vger.kernel.org, kernel-team@fb.com Subject: [PATCH 1/4] btrfs: add a i_mmap_lock to our inode Date: Mon, 14 Dec 2020 13:19:38 -0500 Message-Id: <7322b31fb9e7a2c27b9fbca528904c163997ffa9.1607969636.git.josef@toxicpanda.com> X-Mailer: git-send-email 2.26.2 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org We need to be able to exclude page_mkwrite from happening concurrently with certain operations. To facilitate this, add a i_mmap_lock to our inode, down_read() it in our mkwrite, and add a new ILOCK flag to indicate that we want to take the i_mmap_lock as well. I used pahole to check the size of the btrfs_inode, the sizes are as follows no lockdep: before: 1120 (3 per 4k page) after: 1160 (3 per 4k page) lockdep: before: 2072 (1 per 4k page) after: 2224 (1 per 4k page) We're slightly larger but it doesn't change how many objects we can fit per page. Signed-off-by: Josef Bacik --- fs/btrfs/btrfs_inode.h | 1 + fs/btrfs/ctree.h | 1 + fs/btrfs/inode.c | 10 ++++++++++ 3 files changed, 12 insertions(+) diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index d9bf53d9ff90..66cec5c13396 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -220,6 +220,7 @@ struct btrfs_inode { /* Hook into fs_info->delayed_iputs */ struct list_head delayed_iput; + struct rw_semaphore i_mmap_lock; struct inode vfs_inode; }; diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 3935d297d198..e2bebb3318cc 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3147,6 +3147,7 @@ extern const struct iomap_dio_ops btrfs_dio_ops; /* Inode locking type flags, by default the exclusive lock is taken */ #define BTRFS_ILOCK_SHARED (1U << 0) #define BTRFS_ILOCK_TRY (1U << 1) +#define BTRFS_ILOCK_MMAP (1U << 2) int btrfs_inode_lock(struct inode *inode, unsigned int ilock_flags); void btrfs_inode_unlock(struct inode *inode, unsigned int ilock_flags); diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 070716650df8..a5ee4d45ee02 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -101,6 +101,7 @@ static void __endio_write_update_ordered(struct btrfs_inode *inode, * BTRFS_ILOCK_SHARED - acquire a shared lock on the inode * BTRFS_ILOCK_TRY - try to acquire the lock, if fails on first attempt * return -EAGAIN + * BTRFS_ILOCK_MMAP - acquire a write lock on the i_mmap_lock */ int btrfs_inode_lock(struct inode *inode, unsigned int ilock_flags) { @@ -121,6 +122,8 @@ int btrfs_inode_lock(struct inode *inode, unsigned int ilock_flags) } inode_lock(inode); } + if (ilock_flags & BTRFS_ILOCK_MMAP) + down_write(&BTRFS_I(inode)->i_mmap_lock); return 0; } @@ -132,6 +135,8 @@ int btrfs_inode_lock(struct inode *inode, unsigned int ilock_flags) */ void btrfs_inode_unlock(struct inode *inode, unsigned int ilock_flags) { + if (ilock_flags & BTRFS_ILOCK_MMAP) + up_write(&BTRFS_I(inode)->i_mmap_lock); if (ilock_flags & BTRFS_ILOCK_SHARED) inode_unlock_shared(inode); else @@ -8344,6 +8349,7 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf) ret = VM_FAULT_NOPAGE; /* make the VM retry the fault */ again: + down_read(&BTRFS_I(inode)->i_mmap_lock); lock_page(page); size = i_size_read(inode); @@ -8367,6 +8373,7 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf) unlock_extent_cached(io_tree, page_start, page_end, &cached_state); unlock_page(page); + up_read(&BTRFS_I(inode)->i_mmap_lock); btrfs_start_ordered_extent(ordered, 1); btrfs_put_ordered_extent(ordered); goto again; @@ -8424,6 +8431,7 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf) BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit; unlock_extent_cached(io_tree, page_start, page_end, &cached_state); + up_read(&BTRFS_I(inode)->i_mmap_lock); btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE); sb_end_pagefault(inode->i_sb); @@ -8432,6 +8440,7 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf) out_unlock: unlock_page(page); + up_read(&BTRFS_I(inode)->i_mmap_lock); out: btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE); btrfs_delalloc_release_space(BTRFS_I(inode), data_reserved, page_start, @@ -8680,6 +8689,7 @@ struct inode *btrfs_alloc_inode(struct super_block *sb) INIT_LIST_HEAD(&ei->delalloc_inodes); INIT_LIST_HEAD(&ei->delayed_iput); RB_CLEAR_NODE(&ei->rb_node); + init_rwsem(&ei->i_mmap_lock); return inode; } From patchwork Mon Dec 14 18:19:39 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Josef Bacik X-Patchwork-Id: 11972647 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3707BC2BB40 for ; Mon, 14 Dec 2020 18:20:57 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 04EF9224D1 for ; Mon, 14 Dec 2020 18:20:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2440747AbgLNSUl (ORCPT ); Mon, 14 Dec 2020 13:20:41 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52540 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2440721AbgLNSU2 (ORCPT ); Mon, 14 Dec 2020 13:20:28 -0500 Received: from mail-qk1-x743.google.com (mail-qk1-x743.google.com [IPv6:2607:f8b0:4864:20::743]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 72DF6C061793 for ; Mon, 14 Dec 2020 10:19:48 -0800 (PST) Received: by mail-qk1-x743.google.com with SMTP id i67so8968426qkf.11 for ; Mon, 14 Dec 2020 10:19:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=toxicpanda-com.20150623.gappssmtp.com; s=20150623; h=from:to:subject:date:message-id:in-reply-to:references:mime-version :content-transfer-encoding; bh=X5143zDzE8AvLarE6SdUPx/9IUh1GYaqX7I2YiVbxbo=; b=hwkvuzpHephdRTm+cW+M7jGc8n0D16+blhWpT1o/Q0HkBq3nxMlFOa5A2U8pIBB2o6 ZvRMaxmmTBkynsnMx6EqKjH6cLm33tu12HPGGUkyJ7Hl6NMnfAn/2T2Fan6flFZDZLVb VfdNXE8xJw78S3cfewOSmQjifcivvIGWadJ1KmFfAWdMXiND1B2LJCkrhxF3Xil1Fng8 KFDwtaT+y81exRy3rCddgWAUZA4NZO55wb/JOUHH48aplx1N5iDMFI1ZRlLWPAszFEFC IpZq8/roVJvMQ9cN0vxfsl6E+ARFzKM3dsK3CRJszkP9jBGfcMRO1H7Js6yZGP+/vpw+ zBhA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=X5143zDzE8AvLarE6SdUPx/9IUh1GYaqX7I2YiVbxbo=; b=EQ1Q45M28BJqJnL6srteZ82HyQSZnpKgVNy40tTJp7y+gr+pIfztewT26fGE7hsQ8v Re3WhAbldDmM8QxN+Lpc6Fuarg9GdN3YCifIyEqGOH6aYrucXuuuxqNhnw2Yh8Up1x67 W74FEqI+dYz+SulVdWKk6y9OP7qOdRudnn3f0E2VTzFd0vaEOXvMk1JUrBAgvqqudj4T dm4dlmphb4EwEygmuswIFmPDtz+0kt+ZHGoZnJSUfsoJwGuNQOt+ohIWIB+s8CgTwL6p XTW/Biv2XZXOi6zMlvm5qtGHnvpGtbwZ780pHkgm+6hqpZsYOJ96R+FLqGj8G9WWlziZ nZjg== X-Gm-Message-State: AOAM533pW6BaBQ6zswa1B2mVLY9zAzA+zl9PRyY1f7Ms5tyED09UiJ1P MJd7cDjVmMq9CmHnMRffBT7eTLzN+E4XAAim X-Google-Smtp-Source: ABdhPJyGhBcTJ2a+uQijXZCge8ETHZaVDnJQvH3F6nA1kYopLDa/9gXmXGnZ8Z2YL27SSbEFSH7Ixg== X-Received: by 2002:a37:a085:: with SMTP id j127mr33016368qke.273.1607969987288; Mon, 14 Dec 2020 10:19:47 -0800 (PST) Received: from localhost (cpe-174-109-172-136.nc.res.rr.com. [174.109.172.136]) by smtp.gmail.com with ESMTPSA id k32sm14800729qte.59.2020.12.14.10.19.46 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 14 Dec 2020 10:19:46 -0800 (PST) From: Josef Bacik To: linux-btrfs@vger.kernel.org, kernel-team@fb.com Subject: [PATCH 2/4] btrfs: cleanup inode_lock/inode_unlock uses Date: Mon, 14 Dec 2020 13:19:39 -0500 Message-Id: <11f7daa4cdb14f05e53b6ba49cb837307e770f66.1607969636.git.josef@toxicpanda.com> X-Mailer: git-send-email 2.26.2 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org A few places we intermix btrfs_inode_lock with a inode_unlock, and some places we just use inode_lock/inode_unlock instead of btrfs_inode_lock. None of these places are using this incorrectly, but as we adjust some of these callers it would be nice to keep everything consistent, so convert everybody to use btrfs_inode_lock/btrfs_inode_unlock. Signed-off-by: Josef Bacik --- fs/btrfs/delayed-inode.c | 4 ++-- fs/btrfs/file.c | 18 +++++++++--------- fs/btrfs/ioctl.c | 26 +++++++++++++------------- fs/btrfs/reflink.c | 4 ++-- fs/btrfs/relocation.c | 4 ++-- 5 files changed, 28 insertions(+), 28 deletions(-) diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c index 70c0340d839c..7532229e7efb 100644 --- a/fs/btrfs/delayed-inode.c +++ b/fs/btrfs/delayed-inode.c @@ -1588,8 +1588,8 @@ bool btrfs_readdir_get_delayed_items(struct inode *inode, * We can only do one readdir with delayed items at a time because of * item->readdir_list. */ - inode_unlock_shared(inode); - inode_lock(inode); + btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED); + btrfs_inode_lock(inode, 0); mutex_lock(&delayed_node->mutex); item = __btrfs_first_delayed_insertion_item(delayed_node); diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 0e41459b8de6..ed633ae87e06 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -2131,7 +2131,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) if (ret) goto out; - inode_lock(inode); + btrfs_inode_lock(inode, 0); atomic_inc(&root->log_batch); @@ -2163,7 +2163,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) */ ret = start_ordered_ops(inode, start, end); if (ret) { - inode_unlock(inode); + btrfs_inode_unlock(inode, 0); goto out; } @@ -2259,7 +2259,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) * file again, but that will end up using the synchronization * inside btrfs_sync_log to keep things safe. */ - inode_unlock(inode); + btrfs_inode_unlock(inode, 0); if (ret != BTRFS_NO_LOG_SYNC) { if (!ret) { @@ -2289,7 +2289,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) out_release_extents: btrfs_release_log_ctx_extents(&ctx); - inode_unlock(inode); + btrfs_inode_unlock(inode, 0); goto out; } @@ -2872,7 +2872,7 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len) if (ret) return ret; - inode_lock(inode); + btrfs_inode_lock(inode, 0); ino_size = round_up(inode->i_size, fs_info->sectorsize); ret = find_first_non_hole(BTRFS_I(inode), &offset, &len); if (ret < 0) @@ -2912,7 +2912,7 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len) truncated_block = true; ret = btrfs_truncate_block(BTRFS_I(inode), offset, 0, 0); if (ret) { - inode_unlock(inode); + btrfs_inode_unlock(inode, 0); return ret; } } @@ -3013,7 +3013,7 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len) ret = ret2; } } - inode_unlock(inode); + btrfs_inode_unlock(inode, 0); return ret; } @@ -3378,7 +3378,7 @@ static long btrfs_fallocate(struct file *file, int mode, if (mode & FALLOC_FL_ZERO_RANGE) { ret = btrfs_zero_range(inode, offset, len, mode); - inode_unlock(inode); + btrfs_inode_unlock(inode, 0); return ret; } @@ -3488,7 +3488,7 @@ static long btrfs_fallocate(struct file *file, int mode, unlock_extent_cached(&BTRFS_I(inode)->io_tree, alloc_start, locked_end, &cached_state); out: - inode_unlock(inode); + btrfs_inode_unlock(inode, 0); /* Let go of our reservation. */ if (ret != 0 && !(mode & FALLOC_FL_ZERO_RANGE)) btrfs_free_reserved_data_space(BTRFS_I(inode), data_reserved, diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index dde49a791f3e..becb5d9ae3ce 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -226,7 +226,7 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg) if (ret) return ret; - inode_lock(inode); + btrfs_inode_lock(inode, 0); fsflags = btrfs_mask_fsflags_for_type(inode, fsflags); old_fsflags = btrfs_inode_flags_to_fsflags(binode->flags); @@ -353,7 +353,7 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg) out_end_trans: btrfs_end_transaction(trans); out_unlock: - inode_unlock(inode); + btrfs_inode_unlock(inode, 0); mnt_drop_write_file(file); return ret; } @@ -449,7 +449,7 @@ static int btrfs_ioctl_fssetxattr(struct file *file, void __user *arg) if (ret) return ret; - inode_lock(inode); + btrfs_inode_lock(inode, 0); old_flags = binode->flags; old_i_flags = inode->i_flags; @@ -501,7 +501,7 @@ static int btrfs_ioctl_fssetxattr(struct file *file, void __user *arg) inode->i_flags = old_i_flags; } - inode_unlock(inode); + btrfs_inode_unlock(inode, 0); mnt_drop_write_file(file); return ret; @@ -1010,7 +1010,7 @@ static noinline int btrfs_mksubvol(const struct path *parent, out_dput: dput(dentry); out_unlock: - inode_unlock(dir); + btrfs_inode_unlock(dir, 0); return error; } @@ -1602,7 +1602,7 @@ int btrfs_defrag_file(struct inode *inode, struct file *file, ra_index += cluster; } - inode_lock(inode); + btrfs_inode_lock(inode, 0); if (IS_SWAPFILE(inode)) { ret = -ETXTBSY; } else { @@ -1611,13 +1611,13 @@ int btrfs_defrag_file(struct inode *inode, struct file *file, ret = cluster_pages_for_defrag(inode, pages, i, cluster); } if (ret < 0) { - inode_unlock(inode); + btrfs_inode_unlock(inode, 0); goto out_ra; } defrag_count += ret; balance_dirty_pages_ratelimited(inode->i_mapping); - inode_unlock(inode); + btrfs_inode_unlock(inode, 0); if (newer_than) { if (newer_off == (u64)-1) @@ -1665,9 +1665,9 @@ int btrfs_defrag_file(struct inode *inode, struct file *file, out_ra: if (do_compress) { - inode_lock(inode); + btrfs_inode_lock(inode, 0); BTRFS_I(inode)->defrag_compress = BTRFS_COMPRESS_NONE; - inode_unlock(inode); + btrfs_inode_unlock(inode, 0); } if (!file) kfree(ra); @@ -3083,9 +3083,9 @@ static noinline int btrfs_ioctl_snap_destroy(struct file *file, goto out_dput; } - inode_lock(inode); + btrfs_inode_lock(inode, 0); err = btrfs_delete_subvolume(dir, dentry); - inode_unlock(inode); + btrfs_inode_unlock(inode, 0); if (!err) { fsnotify_rmdir(dir, dentry); d_delete(dentry); @@ -3094,7 +3094,7 @@ static noinline int btrfs_ioctl_snap_destroy(struct file *file, out_dput: dput(dentry); out_unlock_dir: - inode_unlock(dir); + btrfs_inode_unlock(dir, 0); free_subvol_name: kfree(subvol_name_ptr); free_parent: diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c index b03e7891394e..97f58b15eba2 100644 --- a/fs/btrfs/reflink.c +++ b/fs/btrfs/reflink.c @@ -816,7 +816,7 @@ loff_t btrfs_remap_file_range(struct file *src_file, loff_t off, return -EINVAL; if (same_inode) - inode_lock(src_inode); + btrfs_inode_lock(src_inode, 0); else lock_two_nondirectories(src_inode, dst_inode); @@ -832,7 +832,7 @@ loff_t btrfs_remap_file_range(struct file *src_file, loff_t off, out_unlock: if (same_inode) - inode_unlock(src_inode); + btrfs_inode_unlock(src_inode, 0); else unlock_two_nondirectories(src_inode, dst_inode); diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index 19b7db8b2117..3fa4f7ba8da3 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -2553,7 +2553,7 @@ static noinline_for_stack int prealloc_file_extent_cluster( if (ret) return ret; - inode_lock(&inode->vfs_inode); + btrfs_inode_lock(&inode->vfs_inode, 0); for (nr = 0; nr < cluster->nr; nr++) { start = cluster->boundary[nr] - offset; if (nr + 1 < cluster->nr) @@ -2571,7 +2571,7 @@ static noinline_for_stack int prealloc_file_extent_cluster( if (ret) break; } - inode_unlock(&inode->vfs_inode); + btrfs_inode_unlock(&inode->vfs_inode, 0); if (cur_offset < prealloc_end) btrfs_free_reserved_data_space_noquota(inode->root->fs_info, From patchwork Mon Dec 14 18:19:40 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Josef Bacik X-Patchwork-Id: 11972649 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id F3E87C4361B for ; Mon, 14 Dec 2020 18:21:04 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id A97F720578 for ; Mon, 14 Dec 2020 18:21:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2440750AbgLNSUx (ORCPT ); Mon, 14 Dec 2020 13:20:53 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52548 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2440726AbgLNSUa (ORCPT ); Mon, 14 Dec 2020 13:20:30 -0500 Received: from mail-qt1-x843.google.com (mail-qt1-x843.google.com [IPv6:2607:f8b0:4864:20::843]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 53290C061794 for ; Mon, 14 Dec 2020 10:19:50 -0800 (PST) Received: by mail-qt1-x843.google.com with SMTP id y15so12544794qtv.5 for ; Mon, 14 Dec 2020 10:19:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=toxicpanda-com.20150623.gappssmtp.com; s=20150623; h=from:to:subject:date:message-id:in-reply-to:references:mime-version :content-transfer-encoding; bh=qQN/DyTHwM/g9N7fwg9qb6OrZHRuzbYiwmxkaZjwCZM=; b=R63kFYiGXd8sANXacwgQ/f4XKAZYO6OR3DTogE7k1+hfNlQAwt8Ynl67HNewE/Bgyh RURostM8hqo5lYEUbgKMcHF0x7xsJeGRGVWsY6N/LJm/DgaFczWP1TQ/Ux8d/LOrx9hX SMAjYZSoVgedyqblFaC6dMAXzg9gc/LDAphDr+WHLVNF3ic6P292V9nI9AAd2R3atIPT QED/cAS5046MtjfavEj/ALgl1RCya6yrX9aC2IdvZXdWJeO6otCiyuG8Y8LeXnZn2hp+ 07B62WPBF+eOAKD5coms9lpQWsHEb2EsfZBPwNki7pimiZz4y6ujXwP9RHbxBPnpi79A /uXg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=qQN/DyTHwM/g9N7fwg9qb6OrZHRuzbYiwmxkaZjwCZM=; b=T1aMsSKvI/yZZThKcM+uWwfAwuWnMe5W4HNUXl3TwmmRUb1EFbjSMSuYiJZlznYR9Q Jev8/pWqqXxm6QI9DfFf3ceV4Awkt0fuXd48NqQLPkMHiA5pP3Vf0lW7Npu0Vl+ELa4O i4PMPQgt9QhIIDkakb+nUclRXbn67M2UGqUBlo7VC0uRTAXBHDXi2csXHhcJIa8nHX// wRUxjrlqVXGBBAUVGmXIfDwPeEwSRj8JiFQAjYkPxjImXC27MYIDfYcXggaaa51y+qQK zxu6oZxdYuBCdI2JPnr5rIxo8CPv1JB319q2DmHlvadqeTdIpxWZ3kX8yFH1gD29xJAa HdJg== X-Gm-Message-State: AOAM533TkqkxtH5B+50Ku8bqmZC6AEtZDmOIlzuIBqT2FTNTc6qIzDkb mAHykr707cqmIv6tYLzvmR9KprS+0uxJSt/+ X-Google-Smtp-Source: ABdhPJw661JeHSxdr1mzOxpvUCxCzrhLNpa8fUU9JvPZsMLXlSgR+NrsUZl1Z6h3j4GxsuvYaHh7iQ== X-Received: by 2002:ac8:4a17:: with SMTP id x23mr34483454qtq.138.1607969988956; Mon, 14 Dec 2020 10:19:48 -0800 (PST) Received: from localhost (cpe-174-109-172-136.nc.res.rr.com. [174.109.172.136]) by smtp.gmail.com with ESMTPSA id i11sm15908403qkg.45.2020.12.14.10.19.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 14 Dec 2020 10:19:48 -0800 (PST) From: Josef Bacik To: linux-btrfs@vger.kernel.org, kernel-team@fb.com Subject: [PATCH 3/4] btrfs: exclude mmaps while doing remap Date: Mon, 14 Dec 2020 13:19:40 -0500 Message-Id: X-Mailer: git-send-email 2.26.2 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org Darrick reported a potential issue to me where we could allow mmap writes after validating a page range matched in the case of dedupe. Generally we rely on lock page -> lock extent with the ordered flush to protect us, but this is done after we check the pages because we use the generic helpers, so we could modify the page in between doing the check and locking the range. There also exists a deadlock, as described by Filipe """ When cloning a file range, we lock the inodes, flush any delalloc within the respective file ranges, wait for any ordered extents and then lock the file ranges in both inodes. This means that right after we flush delalloc and before we lock the file ranges, memory mapped writes can come in and dirty pages in the file ranges of the clone operation. Most of the time this is harmless and causes no problems. However, if we are low on available metadata space, we can later end up in a deadlock when starting a transaction to replace file extent items. This happens if when allocating metadata space for the transaction, we need to wait for the async reclaim thread to release space and the reclaim thread needs to flush delalloc for the inode that got the memory mapped write and has its range locked by the clone task. Basically what happens is the following: 1) A clone operation locks inodes A and B, flushes delalloc for both inodes in the respective file ranges and waits for any ordered extents in those ranges to complete; 2) Before the clone task locks the file ranges, another task does a memory mapped write (which does not lock the inode) for one of the inodes of the clone operation. So now we have a dirty page in one of the ranges used by the clone operation; 3) The clone operation locks the file ranges for inodes A and B; 4) Later, when iterating over the file extents of inode A, the clone task attempts to start a transaction. There's not enough available free metadata space, so the async reclaim task is started (if not running already) and we wait for someone to wake us up on our reservation ticket; 5) The async reclaim task is not able to release space by any other means and decides to flush delalloc for the inode of the clone operation; 6) The workqueue job used to flush the inode blocks when starting delalloc for the inode, since the file range is currently locked by the clone task; 7) But the clone task is waiting on its reservation ticket and the async reclaim task is waiting on the flush job to complete, which can't progress since the clone task has the file range locked. So unless some other task is able to release space, for example an ordered extent for some other inode completes, we have a deadlock between all these tasks; When this happens stack traces like the following showup in dmesg/syslog: INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds. Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000 Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs] Call Trace: __schedule+0x5d1/0xcf0 schedule+0x45/0xe0 lock_extent_bits+0x1e6/0x2d0 [btrfs] ? finish_wait+0x90/0x90 btrfs_invalidatepage+0x32c/0x390 [btrfs] ? __mod_memcg_state+0x8e/0x160 __extent_writepage+0x2d4/0x400 [btrfs] extent_write_cache_pages+0x2b2/0x500 [btrfs] ? lock_release+0x20e/0x4c0 ? trace_hardirqs_on+0x1b/0xf0 extent_writepages+0x43/0x90 [btrfs] ? lock_acquire+0x1a3/0x490 do_writepages+0x43/0xe0 ? __filemap_fdatawrite_range+0xa4/0x100 __filemap_fdatawrite_range+0xc5/0x100 btrfs_run_delalloc_work+0x17/0x40 [btrfs] btrfs_work_helper+0xf1/0x600 [btrfs] process_one_work+0x24e/0x5e0 worker_thread+0x50/0x3b0 ? process_one_work+0x5e0/0x5e0 kthread+0x153/0x170 ? kthread_mod_delayed_work+0xc0/0xc0 ret_from_fork+0x22/0x30 INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds. Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000 Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] Call Trace: __schedule+0x5d1/0xcf0 ? kvm_clock_read+0x14/0x30 ? wait_for_completion+0x81/0x110 schedule+0x45/0xe0 schedule_timeout+0x30c/0x580 ? _raw_spin_unlock_irqrestore+0x3c/0x60 ? lock_acquire+0x1a3/0x490 ? try_to_wake_up+0x7a/0xa20 ? lock_release+0x20e/0x4c0 ? lock_acquired+0x199/0x490 ? wait_for_completion+0x81/0x110 wait_for_completion+0xab/0x110 start_delalloc_inodes+0x2af/0x390 [btrfs] btrfs_start_delalloc_roots+0x12d/0x250 [btrfs] flush_space+0x24f/0x660 [btrfs] btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs] process_one_work+0x24e/0x5e0 worker_thread+0x20f/0x3b0 ? process_one_work+0x5e0/0x5e0 kthread+0x153/0x170 ? kthread_mod_delayed_work+0xc0/0xc0 ret_from_fork+0x22/0x30 (...) several other tasks blocked on inode locks held by the clone task below (...) RIP: 0033:0x7f61efe73fff Code: Unable to access opcode bytes at RIP 0x7f61efe73fd5. RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000202 ORIG_RAX: 000000000000013c RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73fff RDX: 00000000ffffff9c RSI: 0000560fbd604690 RDI: 00000000ffffff9c RBP: 00007ffc3371beb0 R08: 0000000000000002 R09: 0000560fbd5d75f0 R10: 0000560fbd5d81f0 R11: 0000000000000202 R12: 0000000000000002 R13: 000000000000000b R14: 00007ffc3371bea0 R15: 00007ffc3371beb0 task: fdm-stress state:D stack: 0 pid:2508234 ppid:2508153 flags:0x00004000 Call Trace: __schedule+0x5d1/0xcf0 ? _raw_spin_unlock_irqrestore+0x3c/0x60 schedule+0x45/0xe0 __reserve_bytes+0x4a4/0xb10 [btrfs] ? finish_wait+0x90/0x90 btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs] btrfs_block_rsv_add+0x1f/0x50 [btrfs] start_transaction+0x2d1/0x760 [btrfs] btrfs_replace_file_extents+0x120/0x930 [btrfs] ? lock_release+0x20e/0x4c0 btrfs_clone+0x3e4/0x7e0 [btrfs] ? btrfs_lookup_first_ordered_extent+0x8e/0x100 [btrfs] btrfs_clone_files+0xf6/0x150 [btrfs] btrfs_remap_file_range+0x324/0x3d0 [btrfs] do_clone_file_range+0xd4/0x1f0 vfs_clone_file_range+0x4d/0x230 ? lock_release+0x20e/0x4c0 ioctl_file_clone+0x8f/0xc0 do_vfs_ioctl+0x342/0x750 __x64_sys_ioctl+0x62/0xb0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 """ Fix both of these issues by excluding mmaps from happening we are doing any sort of remap, which prevents this race completely. Signed-off-by: Josef Bacik Signed-off-by: Darrick J. Wong --- fs/btrfs/reflink.c | 30 ++++++++++++++++++++++++------ 1 file changed, 24 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c index 97f58b15eba2..7df2182c98d4 100644 --- a/fs/btrfs/reflink.c +++ b/fs/btrfs/reflink.c @@ -587,6 +587,20 @@ static void btrfs_double_extent_lock(struct inode *inode1, u64 loff1, lock_extent(&BTRFS_I(inode2)->io_tree, loff2, loff2 + len - 1); } +static void btrfs_double_mmap_lock(struct inode *inode1, struct inode *inode2) +{ + if (inode1 < inode2) + swap(inode1, inode2); + down_write(&BTRFS_I(inode1)->i_mmap_lock); + down_write_nested(&BTRFS_I(inode2)->i_mmap_lock, SINGLE_DEPTH_NESTING); +} + +static void btrfs_double_mmap_unlock(struct inode *inode1, struct inode *inode2) +{ + up_write(&BTRFS_I(inode1)->i_mmap_lock); + up_write(&BTRFS_I(inode2)->i_mmap_lock); +} + static int btrfs_extent_same_range(struct inode *src, u64 loff, u64 len, struct inode *dst, u64 dst_loff) { @@ -815,10 +829,12 @@ loff_t btrfs_remap_file_range(struct file *src_file, loff_t off, if (remap_flags & ~(REMAP_FILE_DEDUP | REMAP_FILE_ADVISORY)) return -EINVAL; - if (same_inode) - btrfs_inode_lock(src_inode, 0); - else + if (same_inode) { + btrfs_inode_lock(src_inode, BTRFS_ILOCK_MMAP); + } else { lock_two_nondirectories(src_inode, dst_inode); + btrfs_double_mmap_lock(src_inode, dst_inode); + } ret = btrfs_remap_file_range_prep(src_file, off, dst_file, destoff, &len, remap_flags); @@ -831,10 +847,12 @@ loff_t btrfs_remap_file_range(struct file *src_file, loff_t off, ret = btrfs_clone_files(dst_file, src_file, off, len, destoff); out_unlock: - if (same_inode) - btrfs_inode_unlock(src_inode, 0); - else + if (same_inode) { + btrfs_inode_unlock(src_inode, BTRFS_ILOCK_MMAP); + } else { + btrfs_double_mmap_unlock(src_inode, dst_inode); unlock_two_nondirectories(src_inode, dst_inode); + } return ret < 0 ? ret : len; } From patchwork Mon Dec 14 18:19:41 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Josef Bacik X-Patchwork-Id: 11972651 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 30949C4361B for ; Mon, 14 Dec 2020 18:21:22 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id EA40120578 for ; Mon, 14 Dec 2020 18:21:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2440743AbgLNSVJ (ORCPT ); Mon, 14 Dec 2020 13:21:09 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52558 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2440735AbgLNSUd (ORCPT ); Mon, 14 Dec 2020 13:20:33 -0500 Received: from mail-qv1-xf42.google.com (mail-qv1-xf42.google.com [IPv6:2607:f8b0:4864:20::f42]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 31C10C06179C for ; Mon, 14 Dec 2020 10:19:53 -0800 (PST) Received: by mail-qv1-xf42.google.com with SMTP id bd6so8225288qvb.9 for ; Mon, 14 Dec 2020 10:19:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=toxicpanda-com.20150623.gappssmtp.com; s=20150623; h=from:to:subject:date:message-id:in-reply-to:references:mime-version :content-transfer-encoding; bh=AeN6ivgHC82uEJDuVzIm5oYCOM9gnft3Ccr9v+xSQSo=; b=HC/yoEqYxkZjt368DHgUhfdqZhne/cf763oVLWxJP/N6PlZtdNpmEB63st+Ligs949 HC5eAuzhOEZnK2lLIXgDxqjDFA74dBiIyIH8/9njsQr7zliVJQuIAYDt04qr8GRan91U cZB7+upCMukWm2DKWTJOgKAkT4BBk+XWBAmqQ/iXaOLo/gkOF+Lez3hNHTJV6QvlV01T bqYI5Y2pdCe4Y4ZQpstTUlX+bo1vzcSvEdazk3/V8Ou/2o4US6knQvcv0iXwBDJs3C3V n/QbdmXj5/ktSkeShxKc5MCJg/AMrdfGvxOozTE0k5BeyMyCsORQi8EO6r8iyritg6Oa TNIQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=AeN6ivgHC82uEJDuVzIm5oYCOM9gnft3Ccr9v+xSQSo=; b=n0i0WFsjVAzTVr1hdOmqieVBffrQ66QAXUNUFhtigkuoXNOOHOSWxVeMJLViIfmHUP sB1qXNWSaGS+h/22ko19T0kRtLa5UkXP8Uz3Mv+YcTT4f9OWbFahto/PV1BCLZPMnV5q I/BhLDtV2vbOZSlc2HG+NsJ4MFQCIMAeqgskrlq3i3sgKpYUiJPVHHjrT6UdhKFbx4Wz MmT46X2T+xU/Hm2fcgmCTVK1ygrCwmIXqVPwz+xa9p6wMywCa1qwqOWDsGBfZBQ4BzhJ g1zhWWKC2jar0ifOHE8me7WzNipr1PrNtVN6y9jh/Jv+40PgZeZgtBfJLKW9J93tlLpp hGBA== X-Gm-Message-State: AOAM531u2daV6aZaTDVUz8ZllQnqoRaW4aoBBbw3+HIezMQtLOACyJ9a 2Zj2piRdA+CislFc5sg95bhpl18QoM0Z/bFK X-Google-Smtp-Source: ABdhPJyZAgzRnFjb6ulf/7zERsrWvjxIygG3KLs88/TWUoPr2a39BZaH69/mu18ctE7vPlWXY2jFjA== X-Received: by 2002:a05:6214:1110:: with SMTP id e16mr32784135qvs.57.1607969991816; Mon, 14 Dec 2020 10:19:51 -0800 (PST) Received: from localhost (cpe-174-109-172-136.nc.res.rr.com. [174.109.172.136]) by smtp.gmail.com with ESMTPSA id g28sm9285209qtm.91.2020.12.14.10.19.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 14 Dec 2020 10:19:50 -0800 (PST) From: Josef Bacik To: linux-btrfs@vger.kernel.org, kernel-team@fb.com Subject: [PATCH 4/4] btrfs: exclude mmap from happening during all fallocate operations Date: Mon, 14 Dec 2020 13:19:41 -0500 Message-Id: <0598524d659ea05c5ed1fe2fb80326a04be83324.1607969636.git.josef@toxicpanda.com> X-Mailer: git-send-email 2.26.2 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org There's a small window where a deadlock can happen between fallocate and mmap. This is described in detail by Filipe """ When doing a fallocate operation we lock the inode, flush delalloc within the target range, wait for any ordered extents to complete and then lock the file range. Before we lock the range and after we flush delalloc, there is a time window where another task can come in and do a memory mapped write for a page within the fallocate range. This means that after fallocate locks the range, there can be a dirty page in the range. More often than not, this does not cause any problem. The exception is when we are low on available metadata space, because an fallocate operation needs to start a transaction while holding the file range locked, either through btrfs_prealloc_file_range() or through the call to btrfs_fallocate_update_isize(). If that's the case, we can end up in a deadlock. The following list of steps explains how that happens: 1) A fallocate operation starts, locks the inode, flushes delalloc in the range and waits for ordered extents in the range to complete; 2) Before the fallocate task locks the file range, another task does a memory mapped write for a page in the fallocate target range. This is possible since memory mapped writes do not (and can not) lock the inode; 3) The fallocate task locks the file range. At this point there is one dirty page in the range (due to the memory mapped write); 4) When the fallocate task attempts to start a transaction, it blocks when attempting to reserve metadata space, since we are low on available metadata space. Before blocking (wait on its reservation ticket), it starts the async reclaim task (if not running already); 5) The async reclaim task is not able to release space through any other means, so it decides to flush delalloc for inodes with dirty pages. It finds that the inode used in the fallocate operation has a dirty page and therefore queues a job (fs_info->flushs_workers workqueue) to flush delalloc for that inode and waits on that job to complete; 6) The flush job blocks when attempting to lock the file range because it is currently locked by the fallocate task; 7) The fallocate task keeps waiting for its metadata reservation, waiting for a wakeup on its reservation ticket. The async reclaim task is waiting on the flush job, which in turn is waiting for locking the file range that is currently locked by the fallocate task. So unless some other task is able to release enough metadata space, for example an ordered extent for some other inode completes, we end up in a deadlock between all these tasks. When this happens stack traces like the following showup in dmesg/syslog: INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds. Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000 Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs] Call Trace: __schedule+0x5d1/0xcf0 schedule+0x45/0xe0 lock_extent_bits+0x1e6/0x2d0 [btrfs] ? finish_wait+0x90/0x90 btrfs_invalidatepage+0x32c/0x390 [btrfs] ? __mod_memcg_state+0x8e/0x160 __extent_writepage+0x2d4/0x400 [btrfs] extent_write_cache_pages+0x2b2/0x500 [btrfs] ? lock_release+0x20e/0x4c0 ? trace_hardirqs_on+0x1b/0xf0 extent_writepages+0x43/0x90 [btrfs] ? lock_acquire+0x1a3/0x490 do_writepages+0x43/0xe0 ? __filemap_fdatawrite_range+0xa4/0x100 __filemap_fdatawrite_range+0xc5/0x100 btrfs_run_delalloc_work+0x17/0x40 [btrfs] btrfs_work_helper+0xf1/0x600 [btrfs] process_one_work+0x24e/0x5e0 worker_thread+0x50/0x3b0 ? process_one_work+0x5e0/0x5e0 kthread+0x153/0x170 ? kthread_mod_delayed_work+0xc0/0xc0 ret_from_fork+0x22/0x30 INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds. Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000 Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] Call Trace: __schedule+0x5d1/0xcf0 ? kvm_clock_read+0x14/0x30 ? wait_for_completion+0x81/0x110 schedule+0x45/0xe0 schedule_timeout+0x30c/0x580 ? _raw_spin_unlock_irqrestore+0x3c/0x60 ? lock_acquire+0x1a3/0x490 ? try_to_wake_up+0x7a/0xa20 ? lock_release+0x20e/0x4c0 ? lock_acquired+0x199/0x490 ? wait_for_completion+0x81/0x110 wait_for_completion+0xab/0x110 start_delalloc_inodes+0x2af/0x390 [btrfs] btrfs_start_delalloc_roots+0x12d/0x250 [btrfs] flush_space+0x24f/0x660 [btrfs] btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs] process_one_work+0x24e/0x5e0 worker_thread+0x20f/0x3b0 ? process_one_work+0x5e0/0x5e0 kthread+0x153/0x170 ? kthread_mod_delayed_work+0xc0/0xc0 ret_from_fork+0x22/0x30 (...) several tasks waiting for the inode lock held by the fallocate task below (...) RIP: 0033:0x7f61efe73fff Code: Unable to access opcode bytes at RIP 0x7f61efe73fd5. RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000202 ORIG_RAX: 000000000000013c RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73fff RDX: 00000000ffffff9c RSI: 0000560fbd5d90a0 RDI: 00000000ffffff9c RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003 R10: 0000560fbd5d7ad0 R11: 0000000000000202 R12: 0000000000000001 R13: 000000000000005e R14: 00007ffc3371bea0 R15: 00007ffc3371beb0 task:fdm-stress state:D stack: 0 pid:2508243 ppid:2508153 flags:0x00000000 Call Trace: __schedule+0x5d1/0xcf0 ? _raw_spin_unlock_irqrestore+0x3c/0x60 schedule+0x45/0xe0 __reserve_bytes+0x4a4/0xb10 [btrfs] ? finish_wait+0x90/0x90 btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs] btrfs_block_rsv_add+0x1f/0x50 [btrfs] start_transaction+0x2d1/0x760 [btrfs] btrfs_replace_file_extents+0x120/0x930 [btrfs] ? btrfs_fallocate+0xdcf/0x1260 [btrfs] btrfs_fallocate+0xdfb/0x1260 [btrfs] ? filename_lookup+0xf1/0x180 vfs_fallocate+0x14f/0x440 ioctl_preallocate+0x92/0xc0 do_vfs_ioctl+0x66b/0x750 ? __do_sys_newfstat+0x53/0x60 __x64_sys_ioctl+0x62/0xb0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 """ Fix this by disallowing mmaps from happening while we're doing any of the fallocate operations on this inode. Signed-off-by: Josef Bacik --- fs/btrfs/file.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index ed633ae87e06..d573ffcbd626 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -2872,7 +2872,7 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len) if (ret) return ret; - btrfs_inode_lock(inode, 0); + btrfs_inode_lock(inode, BTRFS_ILOCK_MMAP); ino_size = round_up(inode->i_size, fs_info->sectorsize); ret = find_first_non_hole(BTRFS_I(inode), &offset, &len); if (ret < 0) @@ -2912,7 +2912,7 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len) truncated_block = true; ret = btrfs_truncate_block(BTRFS_I(inode), offset, 0, 0); if (ret) { - btrfs_inode_unlock(inode, 0); + btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP); return ret; } } @@ -3013,7 +3013,7 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len) ret = ret2; } } - btrfs_inode_unlock(inode, 0); + btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP); return ret; } @@ -3336,7 +3336,7 @@ static long btrfs_fallocate(struct file *file, int mode, return ret; } - btrfs_inode_lock(inode, 0); + btrfs_inode_lock(inode, BTRFS_ILOCK_MMAP); if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) { ret = inode_newsize_ok(inode, offset + len); @@ -3378,7 +3378,7 @@ static long btrfs_fallocate(struct file *file, int mode, if (mode & FALLOC_FL_ZERO_RANGE) { ret = btrfs_zero_range(inode, offset, len, mode); - btrfs_inode_unlock(inode, 0); + btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP); return ret; } @@ -3488,7 +3488,7 @@ static long btrfs_fallocate(struct file *file, int mode, unlock_extent_cached(&BTRFS_I(inode)->io_tree, alloc_start, locked_end, &cached_state); out: - btrfs_inode_unlock(inode, 0); + btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP); /* Let go of our reservation. */ if (ret != 0 && !(mode & FALLOC_FL_ZERO_RANGE)) btrfs_free_reserved_data_space(BTRFS_I(inode), data_reserved,