diff mbox series

[01/13] btrfs: fix missed extent on fsync after dropping extent maps

Message ID 25d72acc3215f74fbb885562667bf12401c214e9.1663594828.git.fdmanana@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs: fixes and cleanups around extent maps | expand

Commit Message

Filipe Manana Sept. 19, 2022, 2:06 p.m. UTC
From: Filipe Manana <fdmanana@suse.com>

When dropping extent maps for a range, through btrfs_drop_extent_cache(),
if we find an extent map that starts before our target range and/or ends
before the target range, and we are not able to allocate extent maps for
splitting that extent map, then we don't fail and simply remove the entire
extent map from the inode's extent map tree.

This is generally fine, because in case anyone needs to access the extent
map, it can just load it again later from the respective file extent
item(s) in the subvolume btree. However, if that extent map is new and is
in the list of modified extents, then a fast fsync will miss the parts of
the extent that were outside our range (that needed to be split),
therefore not logging them. Fix that by marking the inode for a full
fsync. This issue was introduced after removing BUG_ON()s triggered when
the split extent map allocations failed, done by commit 7014cdb49305ed
("Btrfs: btrfs_drop_extent_cache should never fail"), back in 2012, and
the fast fsync path already existed but was very recent.

Also, in the case where we could allocate extent maps for the split
operations but then fail to add a split extent map to the tree, mark the
inode for a full fsync as well. This is not supposed to ever fail, and we
assert that, but in case assertions are disabled (CONFIG_BTRFS_ASSERT is
not set), it's the correct thing to do to make sure a fast fsync will not
miss a new extent.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/file.c | 58 +++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 46 insertions(+), 12 deletions(-)

Comments

Anand Jain Sept. 20, 2022, 10:19 a.m. UTC | #1
On 19/09/2022 22:06, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When dropping extent maps for a range, through btrfs_drop_extent_cache(),
> if we find an extent map that starts before our target range and/or ends
> before the target range, and we are not able to allocate extent maps for
> splitting that extent map, then we don't fail and simply remove the entire
> extent map from the inode's extent map tree.


> +		if (testend && em->start + em->len > start + len)

  %len comes from

  	u64 len = end - start + 1;

  IMO >= was correct here; because including %start + %len is already
  after the range as in the original code. No?

> +			ends_after_range = true;
>   		flags = em->flags;
>   		gen = em->generation;
>   		if (skip_pinned && test_bit(EXTENT_FLAG_PINNED, &em->flags)) {
> -			if (testend && em->start + em->len >= start + len) {
> +			if (ends_after_range) {
>   				free_extent_map(em);
>   				write_unlock(&em_tree->lock);
>   				break;
Filipe Manana Sept. 20, 2022, 10:27 a.m. UTC | #2
On Tue, Sep 20, 2022 at 11:19 AM Anand Jain <anand.jain@oracle.com> wrote:
>
> On 19/09/2022 22:06, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > When dropping extent maps for a range, through btrfs_drop_extent_cache(),
> > if we find an extent map that starts before our target range and/or ends
> > before the target range, and we are not able to allocate extent maps for
> > splitting that extent map, then we don't fail and simply remove the entire
> > extent map from the inode's extent map tree.
>
>
> > +             if (testend && em->start + em->len > start + len)
>
>   %len comes from
>
>         u64 len = end - start + 1;
>
>   IMO >= was correct here; because including %start + %len is already
>   after the range as in the original code. No?

No, > is the correct thing to do. It only matters if the extent map's
range ends after our range.
Try the math with a simple example, with a start of 0 and a length of
4096 (end offset if 4095), and you'll see.

Thanks.

>
> > +                     ends_after_range = true;
> >               flags = em->flags;
> >               gen = em->generation;
> >               if (skip_pinned && test_bit(EXTENT_FLAG_PINNED, &em->flags)) {
> > -                     if (testend && em->start + em->len >= start + len) {
> > +                     if (ends_after_range) {
> >                               free_extent_map(em);
> >                               write_unlock(&em_tree->lock);
> >                               break;
Anand Jain Sept. 21, 2022, 11:11 a.m. UTC | #3
On 20/09/2022 18:27, Filipe Manana wrote:
> On Tue, Sep 20, 2022 at 11:19 AM Anand Jain <anand.jain@oracle.com> wrote:
>>
>> On 19/09/2022 22:06, fdmanana@kernel.org wrote:
>>> From: Filipe Manana <fdmanana@suse.com>
>>>
>>> When dropping extent maps for a range, through btrfs_drop_extent_cache(),
>>> if we find an extent map that starts before our target range and/or ends
>>> before the target range, and we are not able to allocate extent maps for
>>> splitting that extent map, then we don't fail and simply remove the entire
>>> extent map from the inode's extent map tree.
>>
>>
>>> +             if (testend && em->start + em->len > start + len)
>>
>>    %len comes from
>>
>>          u64 len = end - start + 1;
>>
>>    IMO >= was correct here; because including %start + %len is already
>>    after the range as in the original code. No?
> 
> No, > is the correct thing to do. It only matters if the extent map's
> range ends after our range.

  Got it.

> Try the math with a simple example, with a start of 0 and a length of
> 4096 (end offset if 4095), and you'll see.

Thanks.


> Thanks.
> 
>>
>>> +                     ends_after_range = true;
>>>                flags = em->flags;
>>>                gen = em->generation;
>>>                if (skip_pinned && test_bit(EXTENT_FLAG_PINNED, &em->flags)) {
>>> -                     if (testend && em->start + em->len >= start + len) {
>>> +                     if (ends_after_range) {
>>>                                free_extent_map(em);
>>>                                write_unlock(&em_tree->lock);
>>>                                break;
Anand Jain Sept. 21, 2022, 11:11 a.m. UTC | #4
On 19/09/2022 22:06, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When dropping extent maps for a range, through btrfs_drop_extent_cache(),
> if we find an extent map that starts before our target range and/or ends
> before the target range, and we are not able to allocate extent maps for
> splitting that extent map, then we don't fail and simply remove the entire
> extent map from the inode's extent map tree.
> 
> This is generally fine, because in case anyone needs to access the extent
> map, it can just load it again later from the respective file extent
> item(s) in the subvolume btree. However, if that extent map is new and is
> in the list of modified extents, then a fast fsync will miss the parts of
> the extent that were outside our range (that needed to be split),
> therefore not logging them. Fix that by marking the inode for a full
> fsync. This issue was introduced after removing BUG_ON()s triggered when
> the split extent map allocations failed, done by commit 7014cdb49305ed
> ("Btrfs: btrfs_drop_extent_cache should never fail"), back in 2012, and
> the fast fsync path already existed but was very recent.
> 
> Also, in the case where we could allocate extent maps for the split
> operations but then fail to add a split extent map to the tree, mark the
> inode for a full fsync as well. This is not supposed to ever fail, and we
> assert that, but in case assertions are disabled (CONFIG_BTRFS_ASSERT is
> not set), it's the correct thing to do to make sure a fast fsync will not
> miss a new extent.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Anand Jain <anand.jain@oracle.com>
diff mbox series

Patch

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index fea508a35900..0e52292cf8bc 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -523,6 +523,7 @@  void btrfs_drop_extent_cache(struct btrfs_inode *inode, u64 start, u64 end,
 		testend = 0;
 	}
 	while (1) {
+		bool ends_after_range = false;
 		int no_splits = 0;
 
 		modified = false;
@@ -539,10 +540,12 @@  void btrfs_drop_extent_cache(struct btrfs_inode *inode, u64 start, u64 end,
 			write_unlock(&em_tree->lock);
 			break;
 		}
+		if (testend && em->start + em->len > start + len)
+			ends_after_range = true;
 		flags = em->flags;
 		gen = em->generation;
 		if (skip_pinned && test_bit(EXTENT_FLAG_PINNED, &em->flags)) {
-			if (testend && em->start + em->len >= start + len) {
+			if (ends_after_range) {
 				free_extent_map(em);
 				write_unlock(&em_tree->lock);
 				break;
@@ -592,7 +595,7 @@  void btrfs_drop_extent_cache(struct btrfs_inode *inode, u64 start, u64 end,
 			split = split2;
 			split2 = NULL;
 		}
-		if (testend && em->start + em->len > start + len) {
+		if (ends_after_range) {
 			u64 diff = start + len - em->start;
 
 			split->start = start + len;
@@ -630,14 +633,42 @@  void btrfs_drop_extent_cache(struct btrfs_inode *inode, u64 start, u64 end,
 			} else {
 				ret = add_extent_mapping(em_tree, split,
 							 modified);
-				ASSERT(ret == 0); /* Logic error */
+				/* Logic error, shouldn't happen. */
+				ASSERT(ret == 0);
+				if (WARN_ON(ret != 0) && modified)
+					btrfs_set_inode_full_sync(inode);
 			}
 			free_extent_map(split);
 			split = NULL;
 		}
 next:
-		if (extent_map_in_tree(em))
+		if (extent_map_in_tree(em)) {
+			/*
+			 * If the extent map is still in the tree it means that
+			 * either of the following is true:
+			 *
+			 * 1) It fits entirely in our range (doesn't end beyond
+			 *    it or starts before it);
+			 *
+			 * 2) It starts before our range and/or ends after our
+			 *    range, and we were not able to allocate the extent
+			 *    maps for split operations, @split and @split2.
+			 *
+			 * If we are at case 2) then we just remove the entire
+			 * extent map - this is fine since if anyone needs it to
+			 * access the subranges outside our range, will just
+			 * load it again from the subvolume tree's file extent
+			 * item. However if the extent map was in the list of
+			 * modified extents, then we must mark the inode for a
+			 * full fsync, otherwise a fast fsync will miss this
+			 * extent if it's new and needs to be logged.
+			 */
+			if ((em->start < start || ends_after_range) && modified) {
+				ASSERT(no_splits);
+				btrfs_set_inode_full_sync(inode);
+			}
 			remove_extent_mapping(em_tree, em);
+		}
 		write_unlock(&em_tree->lock);
 
 		/* once for us */
@@ -2199,14 +2230,6 @@  int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 
 	atomic_inc(&root->log_batch);
 
-	/*
-	 * Always check for the full sync flag while holding the inode's lock,
-	 * to avoid races with other tasks. The flag must be either set all the
-	 * time during logging or always off all the time while logging.
-	 */
-	full_sync = test_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
-			     &BTRFS_I(inode)->runtime_flags);
-
 	/*
 	 * Before we acquired the inode's lock and the mmap lock, someone may
 	 * have dirtied more pages in the target range. We need to make sure
@@ -2231,6 +2254,17 @@  int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 		goto out;
 	}
 
+	/*
+	 * Always check for the full sync flag while holding the inode's lock,
+	 * to avoid races with other tasks. The flag must be either set all the
+	 * time during logging or always off all the time while logging.
+	 * We check the flag here after starting delalloc above, because when
+	 * running delalloc the full sync flag may be set if we need to drop
+	 * extra extent map ranges due to temporary memory allocation failures.
+	 */
+	full_sync = test_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
+			     &BTRFS_I(inode)->runtime_flags);
+
 	/*
 	 * We have to do this here to avoid the priority inversion of waiting on
 	 * IO of a lower priority task while holding a transaction open.