[24/24] btrfs: add a comment explaining the data flush steps
diff mbox series

Message ID 20200203204951.517751-25-josef@toxicpanda.com
State New
Headers show
Series
  • Convert data reservations to the ticketing infrastructure
Related show

Commit Message

Josef Bacik Feb. 3, 2020, 8:49 p.m. UTC
The data flushing steps are not obvious to people other than myself and
Chris.  Write a giant comment explaining the reasoning behind each flush
step for data as well as why it is in that particular order.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/space-info.c | 46 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)

Comments

Nikolay Borisov Feb. 4, 2020, 9:47 a.m. UTC | #1
On 3.02.20 г. 22:49 ч., Josef Bacik wrote:
> The data flushing steps are not obvious to people other than myself and
> Chris.  Write a giant comment explaining the reasoning behind each flush
> step for data as well as why it is in that particular order.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/space-info.c | 46 +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 46 insertions(+)
> 
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index 18a31d96bbbd..d3befc536a7f 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -780,6 +780,52 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work)
>  	} while (flush_state <= COMMIT_TRANS);
>  }
>  
> +/*
> + * FLUSH_DELALLOC_WAIT:
> + *   Space is free'd from flushing delalloc in one of two ways.
> + *
> + *   1) compression is on and we allocate less space than we reserved.
> + *   2) We are overwriting existing space.
> + *
> + *   For #1 that extra space is reclaimed as soon as the delalloc pages are
> + *   cow'ed, by way of btrfs_add_reserved_bytes() which adds the actual extent
> + *   length to ->bytes_reserved, and subtracts the reserved space from
> + *   ->bytes_may_use.
> + *
> + *   For #2 this is trickier.  Once the ordered extent runs we will drop the
> + *   extent in the range we are overwriting, which creates a delayed ref for
> + *   that freed extent.  This however is not reclaimed until the transaction
> + *   commits, thus the next stages.
> + *
> + * RUN_DELAYED_IPUTS
> + *   If we are freeing inodes, we want to make sure all delayed iputs have
> + *   completed, because they could have been on an inode with i_nlink == 0, and
> + *   thus have been trunated and free'd up space.  But again this space is not
> + *   immediately re-usable, it comes in the form of a delayed ref, which must be
> + *   run and then the transaction must be committed.
> + *
> + * FLUSH_DELAYED_REFS
> + *   The above two cases generate delayed refs that will affect
> + *   ->total_bytes_pinned.  However this counter can be inconsistent with
> + *   reality if there are outstanding delayed refs.  This is because we adjust
> + *   the counter based on the other delayed refs that exist.  So for example, if

IMO this sentence would be clearer if it simply says something along the
lines of " This is because we adjust the counter based solely on the
current set of delayed refs and disregard any on-disk state which might
include more refs".

> + *   we have an extent with 2 references, but we only drop 1, we'll see that
> + *   there is a negative delayed ref count for the extent and assume that the
> + *   space will be free'd, and thus increase ->total_bytes_pinned.
> + *
> + *   Running the delayed refs gives us the actual real view of what will be
> + *   freed at the transaction commit time.  This stage will not actually free
> + *   space for us, it just makes sure that may_commit_transaction() has all of
> + *   the information it needs to make the right decision.

Is there any particular reason why total_bytes_pinned is sort of double
accounted. I.e first add_pinned_bytes is called when a DROP del ref is
added with negative ref. Then during processing of that delref
__btrfs_free_extent either:

a) Removes the ref but doesn't free the extent if there were other,
non-del refs for this extent

b) Removes the extent and calls btrfs_update_block_group to account it
again as pinned (this time also setting the respective ranges as pinned)

This double accounting doesn't really happen because after the
processing of the DROP del ref is finished
cleanup_ref_head->btrfs_cleanup_ref_head_accounting will actually clean
the pinned bytes added at creation time of the DROP del ref. Can we
avoid this and simply rely on the update of total_bytes_pinned when an
extent is freed.


> + *
> + * COMMIT_TRANS
> + *   This is where we reclaim all of the pinned space generated by the previous
> + *   two stages.  We will not commit the transaction if we don't think we're
> + *   likely to satisfy our request, which means if our current free space +
> + *   total_bytes_pinned < reservation we will not commit.  This is why the
> + *   previous states are actually important, to make sure we know for sure
> + *   wether committing the transaction will allow us to make progress.

typo: s/wether/whether/

> + */
>  static const enum btrfs_flush_state data_flush_states[] = {
>  	FLUSH_DELALLOC_WAIT,
>  	RUN_DELAYED_IPUTS,
> 

This looks good and helped me understand the machinery and put some
context into the code.
Josef Bacik Feb. 4, 2020, 4:16 p.m. UTC | #2
On 2/4/20 4:47 AM, Nikolay Borisov wrote:
> 
> 
> On 3.02.20 г. 22:49 ч., Josef Bacik wrote:
>> The data flushing steps are not obvious to people other than myself and
>> Chris.  Write a giant comment explaining the reasoning behind each flush
>> step for data as well as why it is in that particular order.
>>
>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>> ---
>>   fs/btrfs/space-info.c | 46 +++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 46 insertions(+)
>>
>> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
>> index 18a31d96bbbd..d3befc536a7f 100644
>> --- a/fs/btrfs/space-info.c
>> +++ b/fs/btrfs/space-info.c
>> @@ -780,6 +780,52 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work)
>>   	} while (flush_state <= COMMIT_TRANS);
>>   }
>>   
>> +/*
>> + * FLUSH_DELALLOC_WAIT:
>> + *   Space is free'd from flushing delalloc in one of two ways.
>> + *
>> + *   1) compression is on and we allocate less space than we reserved.
>> + *   2) We are overwriting existing space.
>> + *
>> + *   For #1 that extra space is reclaimed as soon as the delalloc pages are
>> + *   cow'ed, by way of btrfs_add_reserved_bytes() which adds the actual extent
>> + *   length to ->bytes_reserved, and subtracts the reserved space from
>> + *   ->bytes_may_use.
>> + *
>> + *   For #2 this is trickier.  Once the ordered extent runs we will drop the
>> + *   extent in the range we are overwriting, which creates a delayed ref for
>> + *   that freed extent.  This however is not reclaimed until the transaction
>> + *   commits, thus the next stages.
>> + *
>> + * RUN_DELAYED_IPUTS
>> + *   If we are freeing inodes, we want to make sure all delayed iputs have
>> + *   completed, because they could have been on an inode with i_nlink == 0, and
>> + *   thus have been trunated and free'd up space.  But again this space is not
>> + *   immediately re-usable, it comes in the form of a delayed ref, which must be
>> + *   run and then the transaction must be committed.
>> + *
>> + * FLUSH_DELAYED_REFS
>> + *   The above two cases generate delayed refs that will affect
>> + *   ->total_bytes_pinned.  However this counter can be inconsistent with
>> + *   reality if there are outstanding delayed refs.  This is because we adjust
>> + *   the counter based on the other delayed refs that exist.  So for example, if
> 
> IMO this sentence would be clearer if it simply says something along the
> lines of " This is because we adjust the counter based solely on the
> current set of delayed refs and disregard any on-disk state which might
> include more refs".
> 
>> + *   we have an extent with 2 references, but we only drop 1, we'll see that
>> + *   there is a negative delayed ref count for the extent and assume that the
>> + *   space will be free'd, and thus increase ->total_bytes_pinned.
>> + *
>> + *   Running the delayed refs gives us the actual real view of what will be
>> + *   freed at the transaction commit time.  This stage will not actually free
>> + *   space for us, it just makes sure that may_commit_transaction() has all of
>> + *   the information it needs to make the right decision.
> 
> Is there any particular reason why total_bytes_pinned is sort of double
> accounted. I.e first add_pinned_bytes is called when a DROP del ref is
> added with negative ref. Then during processing of that delref
> __btrfs_free_extent either:
> 
> a) Removes the ref but doesn't free the extent if there were other,
> non-del refs for this extent
> 
> b) Removes the extent and calls btrfs_update_block_group to account it
> again as pinned (this time also setting the respective ranges as pinned)
> 
> This double accounting doesn't really happen because after the
> processing of the DROP del ref is finished
> cleanup_ref_head->btrfs_cleanup_ref_head_accounting will actually clean
> the pinned bytes added at creation time of the DROP del ref. Can we
> avoid this and simply rely on the update of total_bytes_pinned when an
> extent is freed.
> 

We discussed this offline, just replying here for those playing along at home.

Now that everything is unified we no longer need this extra total_bytes_pinned 
accounting.  Before this is what we were relying on in the data path to account 
for any data that may be freed once delayed refs run, as we didn't run delayed 
refs in the data reservation path.

Now that we are in fact doing this, we do not need this extra accounting.  In 
fact at the point we check in may_commit_transaction() our 
space_info->bytes_pinned will be uptodate, and thus we don't need this percpu 
counter at all.  Nikolay is going to follow up after these patches are merged 
and remove the counter altogether.  Thanks,

Josef

Patch
diff mbox series

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 18a31d96bbbd..d3befc536a7f 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -780,6 +780,52 @@  static void btrfs_async_reclaim_metadata_space(struct work_struct *work)
 	} while (flush_state <= COMMIT_TRANS);
 }
 
+/*
+ * FLUSH_DELALLOC_WAIT:
+ *   Space is free'd from flushing delalloc in one of two ways.
+ *
+ *   1) compression is on and we allocate less space than we reserved.
+ *   2) We are overwriting existing space.
+ *
+ *   For #1 that extra space is reclaimed as soon as the delalloc pages are
+ *   cow'ed, by way of btrfs_add_reserved_bytes() which adds the actual extent
+ *   length to ->bytes_reserved, and subtracts the reserved space from
+ *   ->bytes_may_use.
+ *
+ *   For #2 this is trickier.  Once the ordered extent runs we will drop the
+ *   extent in the range we are overwriting, which creates a delayed ref for
+ *   that freed extent.  This however is not reclaimed until the transaction
+ *   commits, thus the next stages.
+ *
+ * RUN_DELAYED_IPUTS
+ *   If we are freeing inodes, we want to make sure all delayed iputs have
+ *   completed, because they could have been on an inode with i_nlink == 0, and
+ *   thus have been trunated and free'd up space.  But again this space is not
+ *   immediately re-usable, it comes in the form of a delayed ref, which must be
+ *   run and then the transaction must be committed.
+ *
+ * FLUSH_DELAYED_REFS
+ *   The above two cases generate delayed refs that will affect
+ *   ->total_bytes_pinned.  However this counter can be inconsistent with
+ *   reality if there are outstanding delayed refs.  This is because we adjust
+ *   the counter based on the other delayed refs that exist.  So for example, if
+ *   we have an extent with 2 references, but we only drop 1, we'll see that
+ *   there is a negative delayed ref count for the extent and assume that the
+ *   space will be free'd, and thus increase ->total_bytes_pinned.
+ *
+ *   Running the delayed refs gives us the actual real view of what will be
+ *   freed at the transaction commit time.  This stage will not actually free
+ *   space for us, it just makes sure that may_commit_transaction() has all of
+ *   the information it needs to make the right decision.
+ *
+ * COMMIT_TRANS
+ *   This is where we reclaim all of the pinned space generated by the previous
+ *   two stages.  We will not commit the transaction if we don't think we're
+ *   likely to satisfy our request, which means if our current free space +
+ *   total_bytes_pinned < reservation we will not commit.  This is why the
+ *   previous states are actually important, to make sure we know for sure
+ *   wether committing the transaction will allow us to make progress.
+ */
 static const enum btrfs_flush_state data_flush_states[] = {
 	FLUSH_DELALLOC_WAIT,
 	RUN_DELAYED_IPUTS,