diff mbox series

[v5,4/5] btrfs: fix crash with non-zero pre in btrfs_split_ordered_extent

Message ID b8c66eeedfcea2c068befcd40de6a95cf5d64d7b.1679512207.git.boris@bur.io (mailing list archive)
State New, archived
Headers show
Series btrfs: fix corruption caused by partial dio writes | expand

Commit Message

Boris Burkov March 22, 2023, 7:11 p.m. UTC
if pre != 0 in btrfs_split_ordered_extent, then we do the following:
1. remove ordered (at file_offset) from the rb tree
2. modify file_offset+=pre
3. re-insert ordered
4. clone an ordered extent at offset 0 length pre from ordered.
5. clone an ordered extent for the post range, if necessary.

step 4 is not correct, as at this point, the start of ordered is already
the end of the desired new pre extent. Further this causes a panic when
btrfs_alloc_ordered_extent sees that the node (from the modified and
re-inserted ordered) is already present at file_offset + 0 = file_offset.

We can fix this by either using a negative offset, or by moving the
clone of the pre extent to after we remove the original one, but before
we modify and re-insert it. The former feels quite kludgy, as we are
"cloning" from outside the range of the ordered extent, so opt for the
latter, which does have some locking annoyances.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/ordered-data.c | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

Comments

Naohiro Aota March 23, 2023, 8:36 a.m. UTC | #1
On Wed, Mar 22, 2023 at 12:11:51PM -0700, Boris Burkov wrote:
> if pre != 0 in btrfs_split_ordered_extent, then we do the following:
> 1. remove ordered (at file_offset) from the rb tree
> 2. modify file_offset+=pre
> 3. re-insert ordered
> 4. clone an ordered extent at offset 0 length pre from ordered.
> 5. clone an ordered extent for the post range, if necessary.
> 
> step 4 is not correct, as at this point, the start of ordered is already
> the end of the desired new pre extent. Further this causes a panic when
> btrfs_alloc_ordered_extent sees that the node (from the modified and
> re-inserted ordered) is already present at file_offset + 0 = file_offset.
> 
> We can fix this by either using a negative offset, or by moving the
> clone of the pre extent to after we remove the original one, but before
> we modify and re-insert it. The former feels quite kludgy, as we are
> "cloning" from outside the range of the ordered extent, so opt for the
> latter, which does have some locking annoyances.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/ordered-data.c | 20 +++++++++++++-------
>  1 file changed, 13 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index 4bebebb9b434..d14a3fe1a113 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -1161,6 +1161,17 @@ int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered,
>  	if (tree->last == node)
>  		tree->last = NULL;
>  
> +	if (pre) {
> +		spin_unlock_irq(&tree->lock);
> +		oe = clone_ordered_extent(ordered, 0, pre);
> +		ret = IS_ERR(oe) ? PTR_ERR(oe) : 0;
> +		if (!ret && ret_pre)
> +			*ret_pre = oe;
> +		if (ret)
> +			goto out;

How about just "return ret;"?

> +		spin_lock_irq(&tree->lock);

I'm concerned about the locking too. Before this spin_lock_irq() is taken,
nothing in the ordered extent range in the tree. So, maybe, someone might
insert or lookup that range in the meantime, and fail? Well, this function
is called under the IO for this range, so it might be OK, though...

So, I considered another approach that factoring out some parts of
btrfs_add_ordered_extent() and use them to rewrite
btrfs_split_ordered_extent().

btrfs_add_ordered_extent() is doing three things:

1. Space accounting
   - btrfs_qgroup_free_data() or btrfs_qgroup_release_data()
   - percpu_counter_add_batch(...)
2. Allocating and initializing btrfs_ordered_extent
3. Adding the btrfs_ordered_extent entry to trees, incrementing OE counter
   - tree_insert(&tree->tree, ...)
   - list_add_tail(&entry->root_extent_list, &root->ordered_extents);
   - btrfs_mod_outstanding_extents(inode, 1);

For btrfs_split_ordered_extent(), we don't need to do #1 above as it was
already done for the original ordered extent. So, if we factor #2 to a
function e.g, init_ordered_extent(), we can rewrite clone_ordered_extent()
to return a cloned OE (doing #2), and also rewrite
btrfs_split_ordered_extent() o do #3 as following:

/* clone_ordered_extent() now returns new ordered extent. */
/* It is not inserted into the trees, yet. */
static struct btrfs_ordered_extent *clone_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pos,
				u64 len)
{
	struct inode *inode = ordered->inode;
	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
	u64 file_offset = ordered->file_offset + pos;
	u64 disk_bytenr = ordered->disk_bytenr + pos;
	unsigned long flags = ordered->flags & BTRFS_ORDERED_TYPE_FLAGS;

	WARN_ON_ONCE(flags & (1 << BTRFS_ORDERED_COMPRESSED));
	return init_ordered_extent(BTRFS_I(inode), file_offset, len, len,
				   disk_bytenr, len, 0, flags,
				   ordered->compress_type);
}

int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre,
				u64 post)
{
...
	/* clone new OEs first */
	if (pre)
		pre_oe = clone_ordered_extent(ordered, 0, pre);
	if (post)
		post_oe = clone_ordered_extent(ordered, ordered->disk_num_bytes - post, ost);
	/* check pre_oe and post_oe */

	spin_lock_irq(&tree->lock);
	/* remove original OE from tree */
	...

	/* modify the original OE params */
	ordered->file_offset += pre;
	...

	/* Re-insert the original OE */
	node = tree_insert(&tree->tree, ordered->file_offset, &ordered->rb_node);
	...

	/* Insert new OEs */
	if (pre_oe)
		node = tree_insert(...);
	...
	spin_unlock_irq(&tree->lock);

	/* And, do the root->ordered_extents and outstanding_extents works */
	...
}

With this approach, no one can see the intermediate state that an OE is
missing for some area in the original OE range.

> +	}
> +
>  	ordered->file_offset += pre;
>  	ordered->disk_bytenr += pre;
>  	ordered->num_bytes -= (pre + post);
> @@ -1176,18 +1187,13 @@ int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered,
>  
>  	spin_unlock_irq(&tree->lock);
>  
> -	if (pre) {
> -		oe = clone_ordered_extent(ordered, 0, pre);
> -		ret = IS_ERR(oe) ? PTR_ERR(oe) : 0;
> -		if (!ret && ret_pre)
> -			*ret_pre = oe;
> -	}
> -	if (!ret && post) {
> +	if (post) {
>  		oe = clone_ordered_extent(ordered, pre + ordered->disk_num_bytes, post);
>  		ret = IS_ERR(oe) ? PTR_ERR(oe) : 0;
>  		if (!ret && ret_post)
>  			*ret_post = oe;
>  	}
> +out:
>  	return ret;
>  }
>  
> -- 
> 2.38.1
>
Boris Burkov March 23, 2023, 4:22 p.m. UTC | #2
On Thu, Mar 23, 2023 at 08:36:08AM +0000, Naohiro Aota wrote:
> On Wed, Mar 22, 2023 at 12:11:51PM -0700, Boris Burkov wrote:
> > if pre != 0 in btrfs_split_ordered_extent, then we do the following:
> > 1. remove ordered (at file_offset) from the rb tree
> > 2. modify file_offset+=pre
> > 3. re-insert ordered
> > 4. clone an ordered extent at offset 0 length pre from ordered.
> > 5. clone an ordered extent for the post range, if necessary.
> > 
> > step 4 is not correct, as at this point, the start of ordered is already
> > the end of the desired new pre extent. Further this causes a panic when
> > btrfs_alloc_ordered_extent sees that the node (from the modified and
> > re-inserted ordered) is already present at file_offset + 0 = file_offset.
> > 
> > We can fix this by either using a negative offset, or by moving the
> > clone of the pre extent to after we remove the original one, but before
> > we modify and re-insert it. The former feels quite kludgy, as we are
> > "cloning" from outside the range of the ordered extent, so opt for the
> > latter, which does have some locking annoyances.
> > 
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> >  fs/btrfs/ordered-data.c | 20 +++++++++++++-------
> >  1 file changed, 13 insertions(+), 7 deletions(-)
> > 
> > diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> > index 4bebebb9b434..d14a3fe1a113 100644
> > --- a/fs/btrfs/ordered-data.c
> > +++ b/fs/btrfs/ordered-data.c
> > @@ -1161,6 +1161,17 @@ int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered,
> >  	if (tree->last == node)
> >  		tree->last = NULL;
> >  
> > +	if (pre) {
> > +		spin_unlock_irq(&tree->lock);
> > +		oe = clone_ordered_extent(ordered, 0, pre);
> > +		ret = IS_ERR(oe) ? PTR_ERR(oe) : 0;
> > +		if (!ret && ret_pre)
> > +			*ret_pre = oe;
> > +		if (ret)
> > +			goto out;
> 
> How about just "return ret;"?
> 
> > +		spin_lock_irq(&tree->lock);
> 
> I'm concerned about the locking too. Before this spin_lock_irq() is taken,
> nothing in the ordered extent range in the tree. So, maybe, someone might
> insert or lookup that range in the meantime, and fail? Well, this function
> is called under the IO for this range, so it might be OK, though...
> 
> So, I considered another approach that factoring out some parts of
> btrfs_add_ordered_extent() and use them to rewrite
> btrfs_split_ordered_extent().
> 
> btrfs_add_ordered_extent() is doing three things:
> 
> 1. Space accounting
>    - btrfs_qgroup_free_data() or btrfs_qgroup_release_data()
>    - percpu_counter_add_batch(...)
> 2. Allocating and initializing btrfs_ordered_extent
> 3. Adding the btrfs_ordered_extent entry to trees, incrementing OE counter
>    - tree_insert(&tree->tree, ...)
>    - list_add_tail(&entry->root_extent_list, &root->ordered_extents);
>    - btrfs_mod_outstanding_extents(inode, 1);
> 
> For btrfs_split_ordered_extent(), we don't need to do #1 above as it was
> already done for the original ordered extent. So, if we factor #2 to a
> function e.g, init_ordered_extent(), we can rewrite clone_ordered_extent()
> to return a cloned OE (doing #2), and also rewrite
> btrfs_split_ordered_extent() o do #3 as following:
> 
> /* clone_ordered_extent() now returns new ordered extent. */
> /* It is not inserted into the trees, yet. */
> static struct btrfs_ordered_extent *clone_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pos,
> 				u64 len)
> {
> 	struct inode *inode = ordered->inode;
> 	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
> 	u64 file_offset = ordered->file_offset + pos;
> 	u64 disk_bytenr = ordered->disk_bytenr + pos;
> 	unsigned long flags = ordered->flags & BTRFS_ORDERED_TYPE_FLAGS;
> 
> 	WARN_ON_ONCE(flags & (1 << BTRFS_ORDERED_COMPRESSED));
> 	return init_ordered_extent(BTRFS_I(inode), file_offset, len, len,
> 				   disk_bytenr, len, 0, flags,
> 				   ordered->compress_type);
> }
> 
> int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre,
> 				u64 post)
> {
> ...
> 	/* clone new OEs first */
> 	if (pre)
> 		pre_oe = clone_ordered_extent(ordered, 0, pre);
> 	if (post)
> 		post_oe = clone_ordered_extent(ordered, ordered->disk_num_bytes - post, ost);
> 	/* check pre_oe and post_oe */
> 
> 	spin_lock_irq(&tree->lock);
> 	/* remove original OE from tree */
> 	...
> 
> 	/* modify the original OE params */
> 	ordered->file_offset += pre;
> 	...
> 
> 	/* Re-insert the original OE */
> 	node = tree_insert(&tree->tree, ordered->file_offset, &ordered->rb_node);
> 	...
> 
> 	/* Insert new OEs */
> 	if (pre_oe)
> 		node = tree_insert(...);
> 	...
> 	spin_unlock_irq(&tree->lock);
> 
> 	/* And, do the root->ordered_extents and outstanding_extents works */
> 	...
> }
> 
> With this approach, no one can see the intermediate state that an OE is
> missing for some area in the original OE range.

I like this solution, I think it is nice to split it up so that three
steps are separate. i.e., initialize the two new OEs with the old state,
then modify the middle OE with the new state and re-insert the new OEs
together. And everything after the initialization can be under the lock.

However, based on Christoph's response, I would lean towards getting rid
of the three way split altogether. I would love to hear your thoughts in
that thread as well, before committing to that, though.

If we do keep the three way split, then I will definitely implement your
idea here, I think it's nicer than the weird lock dropping/re-taking
stuff I was doing.

Thanks,
Boris

> 
> > +	}
> > +
> >  	ordered->file_offset += pre;
> >  	ordered->disk_bytenr += pre;
> >  	ordered->num_bytes -= (pre + post);
> > @@ -1176,18 +1187,13 @@ int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered,
> >  
> >  	spin_unlock_irq(&tree->lock);
> >  
> > -	if (pre) {
> > -		oe = clone_ordered_extent(ordered, 0, pre);
> > -		ret = IS_ERR(oe) ? PTR_ERR(oe) : 0;
> > -		if (!ret && ret_pre)
> > -			*ret_pre = oe;
> > -	}
> > -	if (!ret && post) {
> > +	if (post) {
> >  		oe = clone_ordered_extent(ordered, pre + ordered->disk_num_bytes, post);
> >  		ret = IS_ERR(oe) ? PTR_ERR(oe) : 0;
> >  		if (!ret && ret_post)
> >  			*ret_post = oe;
> >  	}
> > +out:
> >  	return ret;
> >  }
> >  
> > -- 
> > 2.38.1
> >
diff mbox series

Patch

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 4bebebb9b434..d14a3fe1a113 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -1161,6 +1161,17 @@  int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered,
 	if (tree->last == node)
 		tree->last = NULL;
 
+	if (pre) {
+		spin_unlock_irq(&tree->lock);
+		oe = clone_ordered_extent(ordered, 0, pre);
+		ret = IS_ERR(oe) ? PTR_ERR(oe) : 0;
+		if (!ret && ret_pre)
+			*ret_pre = oe;
+		if (ret)
+			goto out;
+		spin_lock_irq(&tree->lock);
+	}
+
 	ordered->file_offset += pre;
 	ordered->disk_bytenr += pre;
 	ordered->num_bytes -= (pre + post);
@@ -1176,18 +1187,13 @@  int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered,
 
 	spin_unlock_irq(&tree->lock);
 
-	if (pre) {
-		oe = clone_ordered_extent(ordered, 0, pre);
-		ret = IS_ERR(oe) ? PTR_ERR(oe) : 0;
-		if (!ret && ret_pre)
-			*ret_pre = oe;
-	}
-	if (!ret && post) {
+	if (post) {
 		oe = clone_ordered_extent(ordered, pre + ordered->disk_num_bytes, post);
 		ret = IS_ERR(oe) ? PTR_ERR(oe) : 0;
 		if (!ret && ret_post)
 			*ret_post = oe;
 	}
+out:
 	return ret;
 }