diff mbox series

[2/2] btrfs: reserve delalloc metadata differently

Message ID 20190410195610.84110-3-josef@toxicpanda.com (mailing list archive)
State New, archived
Headers show
Series ENOSPC refinements | expand

Commit Message

Josef Bacik April 10, 2019, 7:56 p.m. UTC
With the per-inode block rsvs we started refilling the reserve based on
the calculated size of the outstanding csum bytes and extents for the
inode, including the amount we were adding with the new operation.

However generic/224 exposed a problem with this approach.  With 1000
files all writing at the same time we ended up with a bunch of bytes
being reserved but unusable.

When you write to a file we reserve space for the csum leaves for those
bytes, the number of extent items required to cover those bytes, and a
single credit for updating the inode at ordered extent finish for that
range of bytes.  This is held until the ordered extent finishes and we
release all of the reserved space.

If a second write comes in at this point we would add a single
reservation for the new outstanding extent and however many reservations
for the csum leaves.  At this point we find the delta of how much we
have reserved and how much outstanding size this is and attempt to
reserve this delta.  If the first write finishes it will not release any
space, because the space it had reserved for the initial write is still
needed for the second write.  However some space would have been used,
as we have added csums, extent items, and dirtied the inode.  Our
reserved space would be > 0 but < the total needed reserved space.

This is just for a single inode, now consider generic/224.  This has
1000 inodes writing in parallel to a very small file system, 1gib.  In
my testing this usually means we get about a 120mib metadata area to
work with, more than enough to allow the writes to continue, but not
enough if all of the inodes are stuck trying to reserve the slack space
while continuing to hold their leftovers from their initial writes.

Fix this by pre-reserved _only_ for the space we are currently trying to
add.  Then once that is successful modify our inodes csum count and
outstanding extents, and then add the newly reserved space to the inodes
block_rsv.  This allows us to actually pass generic/224 without running
out of metadata space.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent-tree.c | 145 ++++++++++++++++++-------------------------------
 1 file changed, 53 insertions(+), 92 deletions(-)

Comments

Nikolay Borisov April 12, 2019, 1:06 p.m. UTC | #1
On 10.04.19 г. 22:56 ч., Josef Bacik wrote:
> With the per-inode block rsvs we started refilling the reserve based on
> the calculated size of the outstanding csum bytes and extents for the
> inode, including the amount we were adding with the new operation.
> 
> However generic/224 exposed a problem with this approach.  With 1000
> files all writing at the same time we ended up with a bunch of bytes
> being reserved but unusable.
> 
> When you write to a file we reserve space for the csum leaves for those
> bytes, the number of extent items required to cover those bytes, and a
> single credit for updating the inode at ordered extent finish for that
> range of bytes.  This is held until the ordered extent finishes and we
> release all of the reserved space.
> 
> If a second write comes in at this point we would add a single
> reservation for the new outstanding extent and however many reservations
> for the csum leaves.  

If a second write comes we won't do anything different than the first
i.e calculate the number of extent items + csums bytes required, add
them to the block reservation and call btrfs_inode_rsv_refill which
should refill the delta necessary for the 2nd write.


At this point we find the delta of how much we
> have reserved and how much outstanding size this is and attempt to
> reserve this delta.  If the first write finishes it will not release any
> space, because the space it had reserved for the initial write is still
> needed for the second write.  However some space would have been used,

Each and every reservation is responsible for itself how come the first
one will know some of its space is required for the second, hence it
won't be released?


> as we have added csums, extent items, and dirtied the inode.  Our
> reserved space would be > 0 but < the total needed reserved space.
> 
> This is just for a single inode, now consider generic/224.  This has
> 1000 inodes writing in parallel to a very small file system, 1gib.  In
> my testing this usually means we get about a 120mib metadata area to
> work with, more than enough to allow the writes to continue, but not
> enough if all of the inodes are stuck trying to reserve the slack space
> while continuing to hold their leftovers from their initial writes.
> 
> Fix this by pre-reserved _only_ for the space we are currently trying to
> add.  Then once that is successful modify our inodes csum count and
> outstanding extents, and then add the newly reserved space to the inodes
> block_rsv.  This allows us to actually pass generic/224 without running
> out of metadata space.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/extent-tree.c | 145 ++++++++++++++++++-------------------------------
>  1 file changed, 53 insertions(+), 92 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 0982456ebabb..9aff7a8817d9 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -5811,85 +5811,6 @@ int btrfs_block_rsv_refill(struct btrfs_root *root,
>  	return ret;
>  }
>  
> -static void calc_refill_bytes(struct btrfs_block_rsv *block_rsv,
> -				u64 *metadata_bytes, u64 *qgroup_bytes)
> -{
> -	*metadata_bytes = 0;
> -	*qgroup_bytes = 0;
> -
> -	spin_lock(&block_rsv->lock);
> -	if (block_rsv->reserved < block_rsv->size)
> -		*metadata_bytes = block_rsv->size - block_rsv->reserved;
> -	if (block_rsv->qgroup_rsv_reserved < block_rsv->qgroup_rsv_size)
> -		*qgroup_bytes = block_rsv->qgroup_rsv_size -
> -			block_rsv->qgroup_rsv_reserved;
> -	spin_unlock(&block_rsv->lock);
> -}
> -
> -/**
> - * btrfs_inode_rsv_refill - refill the inode block rsv.
> - * @inode - the inode we are refilling.
> - * @flush - the flushing restriction.
> - *
> - * Essentially the same as btrfs_block_rsv_refill, except it uses the
> - * block_rsv->size as the minimum size.  We'll either refill the missing amount
> - * or return if we already have enough space.  This will also handle the reserve
> - * tracepoint for the reserved amount.
> - */
> -static int btrfs_inode_rsv_refill(struct btrfs_inode *inode,
> -				  enum btrfs_reserve_flush_enum flush)
> -{
> -	struct btrfs_root *root = inode->root;
> -	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
> -	u64 num_bytes, last = 0;
> -	u64 qgroup_num_bytes;
> -	int ret = -ENOSPC;
> -
> -	calc_refill_bytes(block_rsv, &num_bytes, &qgroup_num_bytes);
> -	if (num_bytes == 0)
> -		return 0;
> -
> -	do {
> -		ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_num_bytes,
> -							 true);
> -		if (ret)
> -			return ret;
> -		ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
> -		if (ret) {
> -			btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes);
> -			last = num_bytes;
> -			/*
> -			 * If we are fragmented we can end up with a lot of
> -			 * outstanding extents which will make our size be much
> -			 * larger than our reserved amount.
> -			 *
> -			 * If the reservation happens here, it might be very
> -			 * big though not needed in the end, if the delalloc
> -			 * flushing happens.
> -			 *
> -			 * If this is the case try and do the reserve again.
> -			 */
> -			if (flush == BTRFS_RESERVE_FLUSH_ALL)
> -				calc_refill_bytes(block_rsv, &num_bytes,
> -						   &qgroup_num_bytes);
> -			if (num_bytes == 0)
> -				return 0;
> -		}
> -	} while (ret && last != num_bytes);
> -
> -	if (!ret) {
> -		block_rsv_add_bytes(block_rsv, num_bytes, false);
> -		trace_btrfs_space_reservation(root->fs_info, "delalloc",
> -					      btrfs_ino(inode), num_bytes, 1);
> -
> -		/* Don't forget to increase qgroup_rsv_reserved */
> -		spin_lock(&block_rsv->lock);
> -		block_rsv->qgroup_rsv_reserved += qgroup_num_bytes;
> -		spin_unlock(&block_rsv->lock);
> -	}
> -	return ret;
> -}
> -
>  static u64 __btrfs_block_rsv_release(struct btrfs_fs_info *fs_info,
>  				     struct btrfs_block_rsv *block_rsv,
>  				     u64 num_bytes, u64 *qgroup_to_release)
> @@ -6190,9 +6111,26 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
>  	spin_unlock(&block_rsv->lock);
>  }
>  
> +static inline void calc_inode_reservations(struct btrfs_fs_info *fs_info,
> +					   struct btrfs_inode *inode,
> +					   u64 num_bytes, u64 *meta_reserve,
> +					   u64 *qgroup_reserve)
> +{
> +	u64 nr_extents = count_max_extents(num_bytes);
> +	u64 csum_leaves = btrfs_csum_bytes_to_leaves(fs_info, num_bytes);
> +
> +	/* We add one for the inode update at finish ordered time. */
> +	*meta_reserve = btrfs_calc_trans_metadata_size(fs_info,
> +						nr_extents + csum_leaves + 1);
> +	*qgroup_reserve = nr_extents * fs_info->nodesize;
> +}
> +
>  int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
>  {
> -	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> +	struct btrfs_root *root = inode->root;
> +	struct btrfs_fs_info *fs_info = root->fs_info;
> +	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
> +	u64 meta_reserve, qgroup_reserve;
>  	unsigned nr_extents;
>  	enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_FLUSH_ALL;
>  	int ret = 0;
> @@ -6222,7 +6160,31 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
>  
>  	num_bytes = ALIGN(num_bytes, fs_info->sectorsize);
>  
> -	/* Add our new extents and calculate the new rsv size. */
> +	/*
> +	 * Josef, we always want to do it this way, every other way is wrong and
> +	 * ends in tears.  Pre-reserving the amount we are going to add will
> +	 * always be the right way, because otherwise if we have enough
> +	 * parallelism we could end up with thousands of inodes all holding
> +	 * little bits of reservations they were able to make previously and the
> +	 * only way to reclaim that space is to ENOSPC out the operations and
> +	 * clear everything out and try again, which is shitty.  This way we
> +	 * just over-reserve slightly, and clean up the mess when we are done.
> +	 */
> +	calc_inode_reservations(fs_info, inode, num_bytes, &meta_reserve,
> +				&qgroup_reserve);
> +	ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_reserve, true);
> +	if (ret)
> +		goto out_fail;
> +	ret = reserve_metadata_bytes(root, block_rsv, meta_reserve, flush);
> +	if (ret)
> +		goto out_qgroup;
> +
> +	/*
> +	 * Now we need to update our outstanding extents and csum bytes _first_
> +	 * and then add the reservation to the block_rsv.  This keeps us from
> +	 * racing with an ordered completion or some such that would think it
> +	 * needs to free the reservation we just made.
> +	 */
>  	spin_lock(&inode->lock);
>  	nr_extents = count_max_extents(num_bytes);
>  	btrfs_mod_outstanding_extents(inode, nr_extents);
> @@ -6230,22 +6192,21 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
>  	btrfs_calculate_inode_block_rsv_size(fs_info, inode);
>  	spin_unlock(&inode->lock);
>  
> -	ret = btrfs_inode_rsv_refill(inode, flush);
> -	if (unlikely(ret))
> -		goto out_fail;
> +	/* Now we can safely add our space to our block rsv. */
> +	block_rsv_add_bytes(block_rsv, meta_reserve, false);
> +	trace_btrfs_space_reservation(root->fs_info, "delalloc",
> +				      btrfs_ino(inode), meta_reserve, 1);
> +
> +	spin_lock(&block_rsv->lock);
> +	block_rsv->qgroup_rsv_reserved += qgroup_reserve;
> +	spin_unlock(&block_rsv->lock);
>  
>  	if (delalloc_lock)
>  		mutex_unlock(&inode->delalloc_mutex);
>  	return 0;
> -
> +out_qgroup:
> +	btrfs_qgroup_free_meta_prealloc(root, qgroup_reserve);
>  out_fail:
> -	spin_lock(&inode->lock);
> -	nr_extents = count_max_extents(num_bytes);
> -	btrfs_mod_outstanding_extents(inode, -nr_extents);
> -	inode->csum_bytes -= num_bytes;
> -	btrfs_calculate_inode_block_rsv_size(fs_info, inode);
> -	spin_unlock(&inode->lock);
> -
>  	btrfs_inode_rsv_release(inode, true);
>  	if (delalloc_lock)
>  		mutex_unlock(&inode->delalloc_mutex);
>
Josef Bacik April 12, 2019, 1:26 p.m. UTC | #2
On Fri, Apr 12, 2019 at 04:06:25PM +0300, Nikolay Borisov wrote:
> 
> 
> On 10.04.19 г. 22:56 ч., Josef Bacik wrote:
> > With the per-inode block rsvs we started refilling the reserve based on
> > the calculated size of the outstanding csum bytes and extents for the
> > inode, including the amount we were adding with the new operation.
> > 
> > However generic/224 exposed a problem with this approach.  With 1000
> > files all writing at the same time we ended up with a bunch of bytes
> > being reserved but unusable.
> > 
> > When you write to a file we reserve space for the csum leaves for those
> > bytes, the number of extent items required to cover those bytes, and a
> > single credit for updating the inode at ordered extent finish for that
> > range of bytes.  This is held until the ordered extent finishes and we
> > release all of the reserved space.
> > 
> > If a second write comes in at this point we would add a single
> > reservation for the new outstanding extent and however many reservations
> > for the csum leaves.  
> 
> If a second write comes we won't do anything different than the first
> i.e calculate the number of extent items + csums bytes required, add
> them to the block reservation and call btrfs_inode_rsv_refill which
> should refill the delta necessary for the 2nd write.
> 
> 
> At this point we find the delta of how much we
> > have reserved and how much outstanding size this is and attempt to
> > reserve this delta.  If the first write finishes it will not release any
> > space, because the space it had reserved for the initial write is still
> > needed for the second write.  However some space would have been used,
> 
> Each and every reservation is responsible for itself how come the first
> one will know some of its space is required for the second, hence it
> won't be released?
> 

Write 1 comes in, sets the size to 3mib, reserves 3mib.
Write 2 comes in, sets the size to 5 mib, attempts to reserve 2mib.
  - can't reserve because there's not enough space, starts flushing.
Write 1 finishes, used 1mib of it's 3mib reservation
Write 1 sets the size to 3mib
We still have 2mib in reserves, which is less than 3mib, so no bytes are
  released to the space info.

Now multiply this by 1000, you have 1000 files with 2mib sitting in their
reservations, but they need 2mib, and there's no space to be squeezed from the
rest of the fs, so they start to ENOSPC out one by one.

With the new thing we get this

Write 1 comes in, reserves 3mib, sets the size to 3mib.
Write 2 comes in, attempts to reserve 3mib.
  - can't reserve because there's not enough space, starts flushing.
Write 1 finishes, used 1mib of it's 3mib reservation
Write 1 sets the size to 0mib
Write 1 releases 2mib to the space_info, allowing the next waiter to claim 2mib
  of whatever it's reservation was.

Multiply this by 1000.  As we get smaller and smaller amounts of metadata space
to work with, we get less and less writes happening in parallel, because we only
have X inodes worth of reservations to be in flight at any given time.  Thanks,

Josef
Nikolay Borisov April 12, 2019, 1:35 p.m. UTC | #3
On 12.04.19 г. 16:26 ч., Josef Bacik wrote:
> On Fri, Apr 12, 2019 at 04:06:25PM +0300, Nikolay Borisov wrote:
>>
>>
>> On 10.04.19 г. 22:56 ч., Josef Bacik wrote:
>>> With the per-inode block rsvs we started refilling the reserve based on
>>> the calculated size of the outstanding csum bytes and extents for the
>>> inode, including the amount we were adding with the new operation.
>>>
>>> However generic/224 exposed a problem with this approach.  With 1000
>>> files all writing at the same time we ended up with a bunch of bytes
>>> being reserved but unusable.
>>>
>>> When you write to a file we reserve space for the csum leaves for those
>>> bytes, the number of extent items required to cover those bytes, and a
>>> single credit for updating the inode at ordered extent finish for that
>>> range of bytes.  This is held until the ordered extent finishes and we
>>> release all of the reserved space.
>>>
>>> If a second write comes in at this point we would add a single
>>> reservation for the new outstanding extent and however many reservations
>>> for the csum leaves.  
>>
>> If a second write comes we won't do anything different than the first
>> i.e calculate the number of extent items + csums bytes required, add
>> them to the block reservation and call btrfs_inode_rsv_refill which
>> should refill the delta necessary for the 2nd write.
>>
>>
>> At this point we find the delta of how much we
>>> have reserved and how much outstanding size this is and attempt to
>>> reserve this delta.  If the first write finishes it will not release any
>>> space, because the space it had reserved for the initial write is still
>>> needed for the second write.  However some space would have been used,
>>
>> Each and every reservation is responsible for itself how come the first
>> one will know some of its space is required for the second, hence it
>> won't be released?
>>
> 
> Write 1 comes in, sets the size to 3mib, reserves 3mib.
> Write 2 comes in, sets the size to 5 mib, attempts to reserve 2mib.
>   - can't reserve because there's not enough space, starts flushing.
> Write 1 finishes, used 1mib of it's 3mib reservation
> Write 1 sets the size to 3mib
> We still have 2mib in reserves, which is less than 3mib, so no bytes are
>   released to the space info.
> 
> Now multiply this by 1000, you have 1000 files with 2mib sitting in their
> reservations, but they need 2mib, and there's no space to be squeezed from the
> rest of the fs, so they start to ENOSPC out one by one.
> 
> With the new thing we get this
> 
> Write 1 comes in, reserves 3mib, sets the size to 3mib.
> Write 2 comes in, attempts to reserve 3mib.
>   - can't reserve because there's not enough space, starts flushing.

Um, no, you've removed btrfs_inode_rsv_refill so there is no flushing
happening in btrfs_delalloc_reserve_metadata whatsoever. None of the 2
remaining callers of btrfs_delalloc_reserve_metadata does any flushing
based on the retval of that function.

This actually means that you should also remove the 'flush' variable in
the same function otherwise you are leaving unused variable behind,
which is not nice.

> Write 1 finishes, used 1mib of it's 3mib reservation
> Write 1 sets the size to 0mib

Reveweing the way oustanding_extents are used I'm getting a little bit
confused - on the one hand we are modifying this value when we reserve
metadata on the other hand we are also modifying the count when adding
ordered extents. Shouldn't that count have been already accounted when
doing the initial metadata reservation?

> Write 1 releases 2mib to the space_info, allowing the next waiter to claim 2mib
>   of whatever it's reservation was.
> 
> Multiply this by 1000.  As we get smaller and smaller amounts of metadata space
> to work with, we get less and less writes happening in parallel, because we only
> have X inodes worth of reservations to be in flight at any given time.  Thanks,
> 
> Josef
>
Josef Bacik April 12, 2019, 1:37 p.m. UTC | #4
On Fri, Apr 12, 2019 at 04:35:20PM +0300, Nikolay Borisov wrote:
> 
> 
> On 12.04.19 г. 16:26 ч., Josef Bacik wrote:
> > On Fri, Apr 12, 2019 at 04:06:25PM +0300, Nikolay Borisov wrote:
> >>
> >>
> >> On 10.04.19 г. 22:56 ч., Josef Bacik wrote:
> >>> With the per-inode block rsvs we started refilling the reserve based on
> >>> the calculated size of the outstanding csum bytes and extents for the
> >>> inode, including the amount we were adding with the new operation.
> >>>
> >>> However generic/224 exposed a problem with this approach.  With 1000
> >>> files all writing at the same time we ended up with a bunch of bytes
> >>> being reserved but unusable.
> >>>
> >>> When you write to a file we reserve space for the csum leaves for those
> >>> bytes, the number of extent items required to cover those bytes, and a
> >>> single credit for updating the inode at ordered extent finish for that
> >>> range of bytes.  This is held until the ordered extent finishes and we
> >>> release all of the reserved space.
> >>>
> >>> If a second write comes in at this point we would add a single
> >>> reservation for the new outstanding extent and however many reservations
> >>> for the csum leaves.  
> >>
> >> If a second write comes we won't do anything different than the first
> >> i.e calculate the number of extent items + csums bytes required, add
> >> them to the block reservation and call btrfs_inode_rsv_refill which
> >> should refill the delta necessary for the 2nd write.
> >>
> >>
> >> At this point we find the delta of how much we
> >>> have reserved and how much outstanding size this is and attempt to
> >>> reserve this delta.  If the first write finishes it will not release any
> >>> space, because the space it had reserved for the initial write is still
> >>> needed for the second write.  However some space would have been used,
> >>
> >> Each and every reservation is responsible for itself how come the first
> >> one will know some of its space is required for the second, hence it
> >> won't be released?
> >>
> > 
> > Write 1 comes in, sets the size to 3mib, reserves 3mib.
> > Write 2 comes in, sets the size to 5 mib, attempts to reserve 2mib.
> >   - can't reserve because there's not enough space, starts flushing.
> > Write 1 finishes, used 1mib of it's 3mib reservation
> > Write 1 sets the size to 3mib
> > We still have 2mib in reserves, which is less than 3mib, so no bytes are
> >   released to the space info.
> > 
> > Now multiply this by 1000, you have 1000 files with 2mib sitting in their
> > reservations, but they need 2mib, and there's no space to be squeezed from the
> > rest of the fs, so they start to ENOSPC out one by one.
> > 
> > With the new thing we get this
> > 
> > Write 1 comes in, reserves 3mib, sets the size to 3mib.
> > Write 2 comes in, attempts to reserve 3mib.
> >   - can't reserve because there's not enough space, starts flushing.
> 
> Um, no, you've removed btrfs_inode_rsv_refill so there is no flushing
> happening in btrfs_delalloc_reserve_metadata whatsoever. None of the 2
> remaining callers of btrfs_delalloc_reserve_metadata does any flushing
> based on the retval of that function.
> 

Please go read the code again.  Thanks,

Josef
David Sterba April 29, 2019, 6:33 p.m. UTC | #5
On Wed, Apr 10, 2019 at 03:56:10PM -0400, Josef Bacik wrote:
> With the per-inode block rsvs we started refilling the reserve based on
> the calculated size of the outstanding csum bytes and extents for the
> inode, including the amount we were adding with the new operation.
> 
> However generic/224 exposed a problem with this approach.  With 1000
> files all writing at the same time we ended up with a bunch of bytes
> being reserved but unusable.
> 
> When you write to a file we reserve space for the csum leaves for those
> bytes, the number of extent items required to cover those bytes, and a
> single credit for updating the inode at ordered extent finish for that
> range of bytes.  This is held until the ordered extent finishes and we
> release all of the reserved space.
> 
> If a second write comes in at this point we would add a single
> reservation for the new outstanding extent and however many reservations
> for the csum leaves.  At this point we find the delta of how much we
> have reserved and how much outstanding size this is and attempt to
> reserve this delta.  If the first write finishes it will not release any
> space, because the space it had reserved for the initial write is still
> needed for the second write.  However some space would have been used,
> as we have added csums, extent items, and dirtied the inode.  Our
> reserved space would be > 0 but < the total needed reserved space.
> 
> This is just for a single inode, now consider generic/224.  This has
> 1000 inodes writing in parallel to a very small file system, 1gib.  In
> my testing this usually means we get about a 120mib metadata area to
> work with, more than enough to allow the writes to continue, but not
> enough if all of the inodes are stuck trying to reserve the slack space
> while continuing to hold their leftovers from their initial writes.
> 
> Fix this by pre-reserved _only_ for the space we are currently trying to
> add.  Then once that is successful modify our inodes csum count and
> outstanding extents, and then add the newly reserved space to the inodes
> block_rsv.  This allows us to actually pass generic/224 without running
> out of metadata space.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/extent-tree.c | 145 ++++++++++++++++++-------------------------------
>  1 file changed, 53 insertions(+), 92 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 0982456ebabb..9aff7a8817d9 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -5811,85 +5811,6 @@ int btrfs_block_rsv_refill(struct btrfs_root *root,
>  	return ret;
>  }
>  
> -static void calc_refill_bytes(struct btrfs_block_rsv *block_rsv,
> -				u64 *metadata_bytes, u64 *qgroup_bytes)
> -{
> -	*metadata_bytes = 0;
> -	*qgroup_bytes = 0;
> -
> -	spin_lock(&block_rsv->lock);
> -	if (block_rsv->reserved < block_rsv->size)
> -		*metadata_bytes = block_rsv->size - block_rsv->reserved;
> -	if (block_rsv->qgroup_rsv_reserved < block_rsv->qgroup_rsv_size)
> -		*qgroup_bytes = block_rsv->qgroup_rsv_size -
> -			block_rsv->qgroup_rsv_reserved;
> -	spin_unlock(&block_rsv->lock);
> -}
> -
> -/**
> - * btrfs_inode_rsv_refill - refill the inode block rsv.
> - * @inode - the inode we are refilling.
> - * @flush - the flushing restriction.
> - *
> - * Essentially the same as btrfs_block_rsv_refill, except it uses the
> - * block_rsv->size as the minimum size.  We'll either refill the missing amount
> - * or return if we already have enough space.  This will also handle the reserve
> - * tracepoint for the reserved amount.
> - */
> -static int btrfs_inode_rsv_refill(struct btrfs_inode *inode,
> -				  enum btrfs_reserve_flush_enum flush)
> -{
> -	struct btrfs_root *root = inode->root;
> -	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
> -	u64 num_bytes, last = 0;
> -	u64 qgroup_num_bytes;
> -	int ret = -ENOSPC;
> -
> -	calc_refill_bytes(block_rsv, &num_bytes, &qgroup_num_bytes);
> -	if (num_bytes == 0)
> -		return 0;
> -
> -	do {
> -		ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_num_bytes,
> -							 true);
> -		if (ret)
> -			return ret;
> -		ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
> -		if (ret) {
> -			btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes);
> -			last = num_bytes;
> -			/*
> -			 * If we are fragmented we can end up with a lot of
> -			 * outstanding extents which will make our size be much
> -			 * larger than our reserved amount.
> -			 *
> -			 * If the reservation happens here, it might be very
> -			 * big though not needed in the end, if the delalloc
> -			 * flushing happens.
> -			 *
> -			 * If this is the case try and do the reserve again.
> -			 */
> -			if (flush == BTRFS_RESERVE_FLUSH_ALL)
> -				calc_refill_bytes(block_rsv, &num_bytes,
> -						   &qgroup_num_bytes);
> -			if (num_bytes == 0)
> -				return 0;
> -		}
> -	} while (ret && last != num_bytes);
> -
> -	if (!ret) {
> -		block_rsv_add_bytes(block_rsv, num_bytes, false);
> -		trace_btrfs_space_reservation(root->fs_info, "delalloc",
> -					      btrfs_ino(inode), num_bytes, 1);
> -
> -		/* Don't forget to increase qgroup_rsv_reserved */
> -		spin_lock(&block_rsv->lock);
> -		block_rsv->qgroup_rsv_reserved += qgroup_num_bytes;
> -		spin_unlock(&block_rsv->lock);
> -	}
> -	return ret;
> -}
> -
>  static u64 __btrfs_block_rsv_release(struct btrfs_fs_info *fs_info,
>  				     struct btrfs_block_rsv *block_rsv,
>  				     u64 num_bytes, u64 *qgroup_to_release)
> @@ -6190,9 +6111,26 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
>  	spin_unlock(&block_rsv->lock);
>  }
>  
> +static inline void calc_inode_reservations(struct btrfs_fs_info *fs_info,

I don't think this needs to be a static inline, just static.

> +					   struct btrfs_inode *inode,

Unused patameter.

> +					   u64 num_bytes, u64 *meta_reserve,
> +					   u64 *qgroup_reserve)
> +{
> +	u64 nr_extents = count_max_extents(num_bytes);
> +	u64 csum_leaves = btrfs_csum_bytes_to_leaves(fs_info, num_bytes);
> +
> +	/* We add one for the inode update at finish ordered time. */
> +	*meta_reserve = btrfs_calc_trans_metadata_size(fs_info,
> +						nr_extents + csum_leaves + 1);
> +	*qgroup_reserve = nr_extents * fs_info->nodesize;
> +}
> +
>  int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
>  {
> -	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> +	struct btrfs_root *root = inode->root;
> +	struct btrfs_fs_info *fs_info = root->fs_info;
> +	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
> +	u64 meta_reserve, qgroup_reserve;
>  	unsigned nr_extents;
>  	enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_FLUSH_ALL;
>  	int ret = 0;
> @@ -6222,7 +6160,31 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
>  
>  	num_bytes = ALIGN(num_bytes, fs_info->sectorsize);
>  
> -	/* Add our new extents and calculate the new rsv size. */
> +	/*
> +	 * Josef, we always want to do it this way, every other way is wrong and

Come on, talking to yourself in comments again?

> +	 * ends in tears.  Pre-reserving the amount we are going to add will
> +	 * always be the right way, because otherwise if we have enough
> +	 * parallelism we could end up with thousands of inodes all holding
> +	 * little bits of reservations they were able to make previously and the
> +	 * only way to reclaim that space is to ENOSPC out the operations and
> +	 * clear everything out and try again, which is shitty.  This way we

'shitty' replaced by 'bad'

Otherwise, patches updated and added to misc-next, thanks.
diff mbox series

Patch

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 0982456ebabb..9aff7a8817d9 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5811,85 +5811,6 @@  int btrfs_block_rsv_refill(struct btrfs_root *root,
 	return ret;
 }
 
-static void calc_refill_bytes(struct btrfs_block_rsv *block_rsv,
-				u64 *metadata_bytes, u64 *qgroup_bytes)
-{
-	*metadata_bytes = 0;
-	*qgroup_bytes = 0;
-
-	spin_lock(&block_rsv->lock);
-	if (block_rsv->reserved < block_rsv->size)
-		*metadata_bytes = block_rsv->size - block_rsv->reserved;
-	if (block_rsv->qgroup_rsv_reserved < block_rsv->qgroup_rsv_size)
-		*qgroup_bytes = block_rsv->qgroup_rsv_size -
-			block_rsv->qgroup_rsv_reserved;
-	spin_unlock(&block_rsv->lock);
-}
-
-/**
- * btrfs_inode_rsv_refill - refill the inode block rsv.
- * @inode - the inode we are refilling.
- * @flush - the flushing restriction.
- *
- * Essentially the same as btrfs_block_rsv_refill, except it uses the
- * block_rsv->size as the minimum size.  We'll either refill the missing amount
- * or return if we already have enough space.  This will also handle the reserve
- * tracepoint for the reserved amount.
- */
-static int btrfs_inode_rsv_refill(struct btrfs_inode *inode,
-				  enum btrfs_reserve_flush_enum flush)
-{
-	struct btrfs_root *root = inode->root;
-	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
-	u64 num_bytes, last = 0;
-	u64 qgroup_num_bytes;
-	int ret = -ENOSPC;
-
-	calc_refill_bytes(block_rsv, &num_bytes, &qgroup_num_bytes);
-	if (num_bytes == 0)
-		return 0;
-
-	do {
-		ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_num_bytes,
-							 true);
-		if (ret)
-			return ret;
-		ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
-		if (ret) {
-			btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes);
-			last = num_bytes;
-			/*
-			 * If we are fragmented we can end up with a lot of
-			 * outstanding extents which will make our size be much
-			 * larger than our reserved amount.
-			 *
-			 * If the reservation happens here, it might be very
-			 * big though not needed in the end, if the delalloc
-			 * flushing happens.
-			 *
-			 * If this is the case try and do the reserve again.
-			 */
-			if (flush == BTRFS_RESERVE_FLUSH_ALL)
-				calc_refill_bytes(block_rsv, &num_bytes,
-						   &qgroup_num_bytes);
-			if (num_bytes == 0)
-				return 0;
-		}
-	} while (ret && last != num_bytes);
-
-	if (!ret) {
-		block_rsv_add_bytes(block_rsv, num_bytes, false);
-		trace_btrfs_space_reservation(root->fs_info, "delalloc",
-					      btrfs_ino(inode), num_bytes, 1);
-
-		/* Don't forget to increase qgroup_rsv_reserved */
-		spin_lock(&block_rsv->lock);
-		block_rsv->qgroup_rsv_reserved += qgroup_num_bytes;
-		spin_unlock(&block_rsv->lock);
-	}
-	return ret;
-}
-
 static u64 __btrfs_block_rsv_release(struct btrfs_fs_info *fs_info,
 				     struct btrfs_block_rsv *block_rsv,
 				     u64 num_bytes, u64 *qgroup_to_release)
@@ -6190,9 +6111,26 @@  static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info,
 	spin_unlock(&block_rsv->lock);
 }
 
+static inline void calc_inode_reservations(struct btrfs_fs_info *fs_info,
+					   struct btrfs_inode *inode,
+					   u64 num_bytes, u64 *meta_reserve,
+					   u64 *qgroup_reserve)
+{
+	u64 nr_extents = count_max_extents(num_bytes);
+	u64 csum_leaves = btrfs_csum_bytes_to_leaves(fs_info, num_bytes);
+
+	/* We add one for the inode update at finish ordered time. */
+	*meta_reserve = btrfs_calc_trans_metadata_size(fs_info,
+						nr_extents + csum_leaves + 1);
+	*qgroup_reserve = nr_extents * fs_info->nodesize;
+}
+
 int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
 {
-	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	struct btrfs_root *root = inode->root;
+	struct btrfs_fs_info *fs_info = root->fs_info;
+	struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
+	u64 meta_reserve, qgroup_reserve;
 	unsigned nr_extents;
 	enum btrfs_reserve_flush_enum flush = BTRFS_RESERVE_FLUSH_ALL;
 	int ret = 0;
@@ -6222,7 +6160,31 @@  int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
 
 	num_bytes = ALIGN(num_bytes, fs_info->sectorsize);
 
-	/* Add our new extents and calculate the new rsv size. */
+	/*
+	 * Josef, we always want to do it this way, every other way is wrong and
+	 * ends in tears.  Pre-reserving the amount we are going to add will
+	 * always be the right way, because otherwise if we have enough
+	 * parallelism we could end up with thousands of inodes all holding
+	 * little bits of reservations they were able to make previously and the
+	 * only way to reclaim that space is to ENOSPC out the operations and
+	 * clear everything out and try again, which is shitty.  This way we
+	 * just over-reserve slightly, and clean up the mess when we are done.
+	 */
+	calc_inode_reservations(fs_info, inode, num_bytes, &meta_reserve,
+				&qgroup_reserve);
+	ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_reserve, true);
+	if (ret)
+		goto out_fail;
+	ret = reserve_metadata_bytes(root, block_rsv, meta_reserve, flush);
+	if (ret)
+		goto out_qgroup;
+
+	/*
+	 * Now we need to update our outstanding extents and csum bytes _first_
+	 * and then add the reservation to the block_rsv.  This keeps us from
+	 * racing with an ordered completion or some such that would think it
+	 * needs to free the reservation we just made.
+	 */
 	spin_lock(&inode->lock);
 	nr_extents = count_max_extents(num_bytes);
 	btrfs_mod_outstanding_extents(inode, nr_extents);
@@ -6230,22 +6192,21 @@  int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes)
 	btrfs_calculate_inode_block_rsv_size(fs_info, inode);
 	spin_unlock(&inode->lock);
 
-	ret = btrfs_inode_rsv_refill(inode, flush);
-	if (unlikely(ret))
-		goto out_fail;
+	/* Now we can safely add our space to our block rsv. */
+	block_rsv_add_bytes(block_rsv, meta_reserve, false);
+	trace_btrfs_space_reservation(root->fs_info, "delalloc",
+				      btrfs_ino(inode), meta_reserve, 1);
+
+	spin_lock(&block_rsv->lock);
+	block_rsv->qgroup_rsv_reserved += qgroup_reserve;
+	spin_unlock(&block_rsv->lock);
 
 	if (delalloc_lock)
 		mutex_unlock(&inode->delalloc_mutex);
 	return 0;
-
+out_qgroup:
+	btrfs_qgroup_free_meta_prealloc(root, qgroup_reserve);
 out_fail:
-	spin_lock(&inode->lock);
-	nr_extents = count_max_extents(num_bytes);
-	btrfs_mod_outstanding_extents(inode, -nr_extents);
-	inode->csum_bytes -= num_bytes;
-	btrfs_calculate_inode_block_rsv_size(fs_info, inode);
-	spin_unlock(&inode->lock);
-
 	btrfs_inode_rsv_release(inode, true);
 	if (delalloc_lock)
 		mutex_unlock(&inode->delalloc_mutex);