From patchwork Tue Feb 23 12:08:47 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 12100275 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9E01EC433DB for ; Tue, 23 Feb 2021 12:11:41 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 60B8664E00 for ; Tue, 23 Feb 2021 12:11:41 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232548AbhBWMLk (ORCPT ); Tue, 23 Feb 2021 07:11:40 -0500 Received: from mail.kernel.org ([198.145.29.99]:41630 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232689AbhBWMJe (ORCPT ); Tue, 23 Feb 2021 07:09:34 -0500 Received: by mail.kernel.org (Postfix) with ESMTPSA id 901DD64E25 for ; Tue, 23 Feb 2021 12:08:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1614082133; bh=2fez1eKCiJtOXj51F8XTqr8DVrp7oBIqvrZhWsIjCW4=; h=From:To:Subject:Date:In-Reply-To:References:From; b=bg3QI/A1CtpNiyVobNf/3V+O5zvY0MzmGcDLEZvt3NIHvUM49dtcxgKtY8pZyfMuV fX4HlIy3EVdYSIODxPt4kjk2X5PPo9Ln5aQCEjovc9nmyCRIh8dezg5bAlQ8zfkbHD DDK/l0ObsxSGSMGzXCW8tOYUZc+xyFZvWdOd2C6ipL9PSEWKrntfe4o8I6S7uLIAOZ RrfZEjWNGSR7USYrs/u+oKcjySSmVTtxlo6/fqbWqB2A5wIjxNYgXS2kU2scyOaa/P ya9Ukqowola+VRIv2p2JXTsEKkb6UonjIWABwqqPUIPzHmWNX/foKxu1UiIcxZfh9s 0qhd/GWVhHcAA== From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH 1/3] btrfs: fix race between memory mapped writes and fsync Date: Tue, 23 Feb 2021 12:08:47 +0000 Message-Id: <913b5529db9d7ff6dcfb370785e471e0024241c2.1614081281.git.fdmanana@suse.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Filipe Manana When doing an fsync we flush all delalloc, lock the inode (vfs lock), flush any new delalloc that might have been created before taking the lock and then wait either for the ordered extents to complete or just for the writeback to complete (depending on whether the full sync flag is set or not). We then start logging the inode and assume that while we are doing it no one else is touching the inode's file extent items (or adding new ones). That is generally true because all operations that modify an inode acquire the inode's lock first, including buffered and direct IO writes. However there is one exception: memory mapped writes, which do not and can not acquire the inode's lock. This can cause two types of issues: ending up logging file extent items with overlapping ranges, which is detected by the tree checker and will result in aborting the transaction when starting writeback for a log tree's extent buffers, or a silent corruption where we log a version of the file that never existed. Scenario 1 - logging overlapping extents The following steps explain how we can end up with file extents items with overlapping ranges in a log tree due to a race between a fsync and memory mapped writes: 1) Task A starts an fsync on inode X, which has the full sync runtime flag set. First it starts by flushing all delalloc for the inode; 2) Task A then locks the inode and flushes any other delalloc that might have been created after the previous flush and waits for all ordered extents to complete; 3) In the inode's root we have the following leaf: Leaf N, generation == current transaction id: --------------------------------------------------------- | (...) [ file extent item, offset 640K, length 128K ] | --------------------------------------------------------- The last file extent item in leaf N covers the file range from 640K to 768K; 4) Task B does a memory mapped write for the page corresponding to the file range from 764K to 768K; 5) Task A starts logging the inode. At copy_inode_items_to_log() it uses btrfs_search_forward() to search for leafs modified in the current transaction that contain items for the inode. It finds leaf N and copies all the inode items from that leaf into the log tree. Now the log tree has a copy of the last file extent item from leaf N. At the end of the while loop at copy_inode_items_to_log(), we have the minimum key set to: min_key.objectid = min_key.type = BTRFS_EXTENT_DATA_KEY min_key.offset = 640K Then we increment the key's offset by 1 so that the next call to btrfs_search_forward() leaves us at the first key greater than the key we just processed. But before btrfs_search_forward() is called again... 6) Dellaloc for the page at offset 764K, dirtied by task B, is started. It can be started for several reasons: - The async reclaim task is attempting to satisfy metadata or data reservation requests, and it has reached a point where it decided to flush delalloc; - Due to memory pressure the vm triggers writeback of dirty pages; - The system call sync_file_range(2) is called from user space. 7) When the respective ordered extent completes, it trims the length of the existing file extent item for file offset 640K from 128K to 124K, and a new file extent item is added with a key offset of 764K and a length of 4K; 8) Task A calls btrfs_search_forward(), which returns us a path pointing to the leaf (can be leaf N or some other) containing the new file extent item for file offset 764K. We end up copying this item to the log tree, which overlaps with the last copied file extent item, which covers the file range from 640K to 768K. When writeback is triggered for log tree's extent buffers, the issue will be detected by the tree checker which will dump a trace and an error message on dmesg/syslog. If the writeback is triggered when syncing the log, which typically is, then we also end up aborting the current transaction. This is the same type of problem fixed in 0c713cbab6200b ("Btrfs: fix race between ranged fsync and writeback of adjacent ranges"). Scenario 2 - logging a version of the file that never existed This scenario only happens when using the NO_HOLES feature and results in a silent corruption, in the sense that is not detectable by 'btrfs check' or the tree checker: 1) We have an inode I with a size of 1M and two file extent items, one covering an extent with disk_bytenr == X for the file range [0, 512K[ and another one covering another extent with disk_bytenr == Y for the file range [512K, 1M[; 2) A hole is punched for the file range [512K, 1M[; 3) Task A starts an fsync of inode I, which has the full sync runtime flag set. It starts by flushing all existing delalloc, locks the inode (VFS lock), starts any new delalloc that might have been created before taking the lock and waits for all ordered extents to complete; 4) Some other task does a memory mapped write for the page corresponding to the file range [640K, 644K[ for example; 5) Task A then logs all items of the inode with the call to copy_inode_items_to_log(); 6) In the meanwhile delalloc for the range [640K, 644K[ is started. It can be started for several reasons: - The async reclaim task is attempting to satisfy metadata or data reservation requests, and it has reached a point where it decided to flush delalloc; - Due to memory pressure the vm triggers writeback of dirty pages; - The system call sync_file_range(2) is called from user space. 7) The ordered extent for the range [640K, 644K[ completes and a file extent item for that range is added to the subvolume tree, pointing to a 4K extent with a disk_bytenr == Z; 8) Task A then calls btrfs_log_holes(), to scan for implicit holes in the subvolume tree. It finds two implicit holes: - one for the file range [512K, 640K[ - one for the file range [644K, 1M[ As a result we end up neither logging a hole for the range [640K, 644K[ nor logging the file extent item with a disk_bytenr == Z. This means that if we have a power failure and replay the log tree we end up getting the following file extent layout: [ disk_bytenr X ] [ hole ] [ disk_bytenr Y ] [ hole ] 0 512K 512K 640K 640K 644K 644K 1M Which does not corresponding to any layout the file ever had before the power failure. The only two valid layouts would be: [ disk_bytenr X ] [ hole ] 0 512K 512K 1M and [ disk_bytenr X ] [ hole ] [ disk_bytenr Z ] [ hole ] 0 512K 512K 640K 640K 644K 644K 1M This can be fixed by serializing memory mapped writes with fsync, and there are two ways to do it: 1) Make a fsync lock the entire file range, from 0 to (u64)-1 / LLONG_MAX in the inode's io tree. This prevents the race but also blocks any reads during the duration of the fsync, which has a negative impact for many common workloads; 2) Make an fsync write lock the i_mmap_lock semaphore in the inode. This semaphore was recently added by Josef's patch set: btrfs: add a i_mmap_lock to our inode btrfs: cleanup inode_lock/inode_unlock uses btrfs: exclude mmaps while doing remap btrfs: exclude mmap from happening during all fallocate operations and is used to solve races between memory mapped writes and clone/dedupe/fallocate. This also makes us have the same behaviour we have regarding other writes (buffered and direct IO) and fsync - block them while the inode logging is in progress. This change uses the second approach due to the performance impact of the first one. Signed-off-by: Filipe Manana --- fs/btrfs/file.c | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 3a2928749349..809eee806505 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -2122,7 +2122,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) if (ret) goto out; - btrfs_inode_lock(inode, 0); + btrfs_inode_lock(inode, BTRFS_ILOCK_MMAP); atomic_inc(&root->log_batch); @@ -2135,11 +2135,11 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) &BTRFS_I(inode)->runtime_flags); /* - * Before we acquired the inode's lock, someone may have dirtied more - * pages in the target range. We need to make sure that writeback for - * any such pages does not start while we are logging the inode, because - * if it does, any of the following might happen when we are not doing a - * full inode sync: + * Before we acquired the inode's lock and the mmap lock, someone may + * have dirtied more pages in the target range. We need to make sure + * that writeback for any such pages does not start while we are logging + * the inode, because if it does, any of the following might happen when + * we are not doing a full inode sync: * * 1) We log an extent after its writeback finishes but before its * checksums are added to the csum tree, leading to -EIO errors @@ -2154,7 +2154,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) */ ret = start_ordered_ops(inode, start, end); if (ret) { - btrfs_inode_unlock(inode, 0); + btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP); goto out; } @@ -2255,7 +2255,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) * file again, but that will end up using the synchronization * inside btrfs_sync_log to keep things safe. */ - btrfs_inode_unlock(inode, 0); + btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP); if (ret != BTRFS_NO_LOG_SYNC) { if (!ret) { @@ -2285,7 +2285,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) out_release_extents: btrfs_release_log_ctx_extents(&ctx); - btrfs_inode_unlock(inode, 0); + btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP); goto out; } From patchwork Tue Feb 23 12:08:48 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 12100277 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3904DC433E6 for ; Tue, 23 Feb 2021 12:11:43 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id E86C364E21 for ; Tue, 23 Feb 2021 12:11:42 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232452AbhBWMLl (ORCPT ); Tue, 23 Feb 2021 07:11:41 -0500 Received: from mail.kernel.org ([198.145.29.99]:41632 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232690AbhBWMJe (ORCPT ); Tue, 23 Feb 2021 07:09:34 -0500 Received: by mail.kernel.org (Postfix) with ESMTPSA id 7009764E32 for ; Tue, 23 Feb 2021 12:08:53 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1614082133; bh=Co0/NHNHWHYPqn528EX0MjRKqrULSIK1Jy+x6KpP4mI=; h=From:To:Subject:Date:In-Reply-To:References:From; b=PkbWuFBDaxL8WZ4SFbSOh7aIfI8bSwvM/63IOcVCwt5tShuPMrYDn7CpZl6oXxctU tEZy2T9C+/SvjnMLPTakdR1OhXJ69jrmmiFx1GnCJYtRshmdxusAWHS9Elx1CNT65Q SVjwWXVlB5UGukdHTDyVkmGHSAeCkijUmOlJ79XnCHEmns9zSBnnZRjcE82SvHhQD/ +epRcBd5HSeKyjleP6P6C+4yys5fa3wDdsGX5crJwbCqzmV1lg4sNyLiuJP2y7eRE9 POC+Hx9wsXhknAc+1zUK3yRfW5V0IXuyg9ClumIhZtIEcn1YgoMb9BhOrKEJeWMJ31 1iYFfDuX+hThA== From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH 2/3] btrfs: fix race between marking inode needs to be logged and log syncing Date: Tue, 23 Feb 2021 12:08:48 +0000 Message-Id: <2afba65e798814e2a4a3af69f210c5a35a63fb3d.1614081281.git.fdmanana@suse.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Filipe Manana We have a race between marking that an inode needs to be logged, either at btrfs_set_inode_last_trans() or at btrfs_page_mkwrite(), and between btrfs_sync_log(). The following steps describe how the race happens. 1) We are at transaction N; 2) Inode I was previously fsynced in the current transaction so it has: inode->logged_trans set to N; 3) The inode's root currently has: root->log_transid set to 1 root->last_log_commit set to 0 Which means only one log transaction was committed to far, log transaction 0. When a log tree is created we set ->log_transid and ->last_log_commit of its parent root to 0 (at btrfs_add_log_tree()); 4) One more range of pages is dirtied in inode I; 5) Some task A starts an fsync against some other inode J (same root), and so it joins log transaction 1. Before task A calls btrfs_sync_log()... 6) Task B starts an fsync against inode I, which currently has the full sync flag set, so it starts delalloc and waits for the ordered extent to complete before calling btrfs_inode_in_log() at btrfs_sync_file(); 7) During ordered extent completion we have btrfs_update_inode() called against inode I, which in turn calls btrfs_set_inode_last_trans(), which does the following: spin_lock(&inode->lock); inode->last_trans = trans->transaction->transid; inode->last_sub_trans = inode->root->log_transid; inode->last_log_commit = inode->root->last_log_commit; spin_unlock(&inode->lock); So ->last_trans is set to N and ->last_sub_trans set to 1. But before setting ->last_log_commit... 8) Task A is at btrfs_sync_log(): - it increments root->log_transid to 2 - starts writeteback for all log tree extent buffers - waits for the writeback to complete - writes the super blocks - updates root->last_log_commit to 1 It's a lot of slow steps between updating root->log_transid and root->last_log_commit; 9) The task doing the ordered extent completion, currently at btrfs_set_inode_last_trans(), then finally runs: inode->last_log_commit = inode->root->last_log_commit; spin_unlock(&inode->lock); Which results in inode->last_log_commit being set to 1. The ordered extent completes; 10) Task B is resumed, and it calls btrfs_inode_in_log() whichs returns true because we have all the following conditions met: inode->logged_trans == N which matches fs_info->generation && inode->last_subtrans (1) <= inode->last_log_commit (1) && inode->last_subtrans (1) <= root->last_log_commit (1) && list inode->extent_tree.modified_extents is empty And as a consequence we return without logging the inode, so the existing logged version of the inode does not point to the extent that was written after the previous fsync. It should be impossible in practice for one task be able to do so much progress in btrfs_sync_log() while another task is at btrfs_set_inode_last_trans() right after it reads root->log_transid and before it reads root->last_log_commit. Even if kernel preemption is enabled we know the task at btrfs_set_inode_last_trans() can not be preempted because it is holding the inode's spinlock. However there is another place where we do the same without holding the spinlock, which is in the memory mapped write path at: vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf) { (...) BTRFS_I(inode)->last_trans = fs_info->generation; BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid; BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit; (...) So with preemption happening after setting ->last_sub_trans and before setting ->last_log_commit, it is less of a stretch to have another task do enough progress at btrfs_sync_log() such that the task doing the memory mapped write ends up with ->last_sub_trans and ->last_log_commit set to the same value. It is still a big stretch to get there, as the task doing btrfs_sync_log() has to start writeback, wait for its completion and write the super blocks. So fix this in two different ways: 1) For btrfs_set_inode_last_trans(), simply set ->last_log_commit to the value of ->last_sub_trans minus 1; 2) For btrfs_page_mkwrite() only set the inode's ->last_sub_trans, just like we do for buffered and direct writes at btrfs_file_write_iter(), which is all we need to make sure multiple writes and fsyncs to an inode in the same transaction never result in an fsync missing that the inode changed and needs to be logged. Turn this into a helper function and use it both at btrfs_page_mkwrite() and at btrfs_file_write_iter() - this also fixes the problem that at btrfs_page_mkwrite() we were setting those fields without the protection of the inode's spinlock. This is an extremely unlikely race to happen in pratice. Signed-off-by: Filipe Manana --- fs/btrfs/btrfs_inode.h | 15 +++++++++++++++ fs/btrfs/file.c | 10 ++-------- fs/btrfs/inode.c | 4 +--- fs/btrfs/transaction.h | 2 +- 4 files changed, 19 insertions(+), 12 deletions(-) diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index 26837c3ca7f6..8011596e1eb3 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -300,6 +300,21 @@ static inline void btrfs_mod_outstanding_extents(struct btrfs_inode *inode, mod); } +/* + * Called every time after doing a buffered, direct IO or memory mapped write. + * + * This is to ensure that if we write to a file that was previously fsynced in + * the current transaction, then try to fsync it again in the same transaction, + * we will know that there were changes in the file and that it needs to be + * logged. + */ +static inline void btrfs_set_inode_last_sub_trans(struct btrfs_inode *inode) +{ + spin_lock(&inode->lock); + inode->last_sub_trans = inode->root->log_transid; + spin_unlock(&inode->lock); +} + static inline int btrfs_inode_in_log(struct btrfs_inode *inode, u64 generation) { int ret = 0; diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 809eee806505..2dcbe3602b6a 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -2014,14 +2014,8 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb, else num_written = btrfs_buffered_write(iocb, from); - /* - * We also have to set last_sub_trans to the current log transid, - * otherwise subsequent syncs to a file that's been synced in this - * transaction will appear to have already occurred. - */ - spin_lock(&inode->lock); - inode->last_sub_trans = inode->root->log_transid; - spin_unlock(&inode->lock); + btrfs_set_inode_last_sub_trans(inode); + if (num_written > 0) num_written = generic_write_sync(iocb, num_written); diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 1710060eb2dc..ae3b439565b4 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -8623,9 +8623,7 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf) set_page_dirty(page); SetPageUptodate(page); - BTRFS_I(inode)->last_trans = fs_info->generation; - BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid; - BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit; + btrfs_set_inode_last_sub_trans(BTRFS_I(inode)); unlock_extent_cached(io_tree, page_start, page_end, &cached_state); up_read(&BTRFS_I(inode)->i_mmap_lock); diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h index 6335716e513f..dd7c3eea08ad 100644 --- a/fs/btrfs/transaction.h +++ b/fs/btrfs/transaction.h @@ -175,7 +175,7 @@ static inline void btrfs_set_inode_last_trans(struct btrfs_trans_handle *trans, spin_lock(&inode->lock); inode->last_trans = trans->transaction->transid; inode->last_sub_trans = inode->root->log_transid; - inode->last_log_commit = inode->root->last_log_commit; + inode->last_log_commit = inode->last_sub_trans - 1; spin_unlock(&inode->lock); } From patchwork Tue Feb 23 12:08:49 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 12100279 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 161E0C433DB for ; Tue, 23 Feb 2021 12:11:46 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id DAECD64E00 for ; Tue, 23 Feb 2021 12:11:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232648AbhBWMLn (ORCPT ); Tue, 23 Feb 2021 07:11:43 -0500 Received: from mail.kernel.org ([198.145.29.99]:41634 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232696AbhBWMJf (ORCPT ); Tue, 23 Feb 2021 07:09:35 -0500 Received: by mail.kernel.org (Postfix) with ESMTPSA id 3FBFA601FF for ; Tue, 23 Feb 2021 12:08:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1614082134; bh=DXKgqUXWggre8pb+n7ubDQFTEZqblI7KH0bCrTVwXoc=; h=From:To:Subject:Date:In-Reply-To:References:From; b=L3ERtcls24xtv76wn5KSgiwdxvOvSemBsCQOe9i37zcIU989KIGrJisEAu+IGzD/8 VUAHRIf+onW4V9WmUvGaJ+OePBuxl/+ldlwVqHo3T/3emguNpmlQYhpUeGpCVHDQbV YaqiUGY4tHq1+b5G53/PvggLWWJyxv073+AA3H13PkUOy1vdlni3K0xwOVRV+Z5N9n niwHRDsCI4vEEFy5oxTa9vNY4Zr+xW7YH0B9kuOJ217GD4z48l9Y+XzqRyn6XtjrDC mpoVVL3qVyfGvPejLyiA7Kqrai3gaN2B0wKnUL1dQnDdwreeyYguhIq6QhTPkUylgC 2RfEFR1MqETMg== From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH 3/3] btrfs: remove stale comment and logic from btrfs_inode_in_log() Date: Tue, 23 Feb 2021 12:08:49 +0000 Message-Id: X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Filipe Manana Currently btrfs_inode_in_log() checks the list of modified extents of the inode, and has a comment mentioning why, as it used to be necessary to make sure if we did something like the following: mmap write range A mmap write range B msync range A (ranged fsync) msync range B (ranged fsync) we ended up with both ranges being logged. If we did not check it, then the second fsync would do nothing because btrfs_inode_in_log() would return true. This was added in 125c4cf9f37c98 ("Btrfs: set inode's logged_trans/last_log_commit after ranged fsync") and test case generic/325 from fstests exercises that scenario. However, as of commit 487781796d3022 ("btrfs: make fast fsyncs wait only for writeback"), every ranged fsync is now turned into a full ranged fsync (operates on the range from 0 to LLONG_MAX), so it is now pointless to test of emptiness of the list of modified extents, and the comment is clearly outdated. So just remove the comment and list emptiness check, while also changing the function's return type to be a boolean instead of an integer. In case one day we get support for ranged fsyncs again, it will be easy to notice the check is necessary again, because it will make generic/325 always fail. Signed-off-by: Filipe Manana --- fs/btrfs/btrfs_inode.h | 17 ++++------------- 1 file changed, 4 insertions(+), 13 deletions(-) diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index 8011596e1eb3..c652e19ad74e 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -315,24 +315,15 @@ static inline void btrfs_set_inode_last_sub_trans(struct btrfs_inode *inode) spin_unlock(&inode->lock); } -static inline int btrfs_inode_in_log(struct btrfs_inode *inode, u64 generation) +static inline bool btrfs_inode_in_log(struct btrfs_inode *inode, u64 generation) { - int ret = 0; + bool ret = false; spin_lock(&inode->lock); if (inode->logged_trans == generation && inode->last_sub_trans <= inode->last_log_commit && - inode->last_sub_trans <= inode->root->last_log_commit) { - /* - * After a ranged fsync we might have left some extent maps - * (that fall outside the fsync's range). So return false - * here if the list isn't empty, to make sure btrfs_log_inode() - * will be called and process those extent maps. - */ - smp_mb(); - if (list_empty(&inode->extent_tree.modified_extents)) - ret = 1; - } + inode->last_sub_trans <= inode->root->last_log_commit) + ret = true; spin_unlock(&inode->lock); return ret; }