diff mbox series

[2/3] btrfs: add a comment describing delalloc space reservation

Message ID 20200203204436.517473-3-josef@toxicpanda.com (mailing list archive)
State New, archived
Headers show
Series Add comments describing how space reservation works | expand

Commit Message

Josef Bacik Feb. 3, 2020, 8:44 p.m. UTC
delalloc space reservation is tricky because it encompasses both data
and metadata.  Make it clear what each side does, the general flow of
how space is moved throughout the lifetime of a write, and what goes
into the calculations.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/delalloc-space.c | 90 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 90 insertions(+)

Comments

Qu Wenruo Feb. 4, 2020, 9:48 a.m. UTC | #1
On 2020/2/4 上午4:44, Josef Bacik wrote:
> delalloc space reservation is tricky because it encompasses both data
> and metadata.  Make it clear what each side does, the general flow of
> how space is moved throughout the lifetime of a write, and what goes
> into the calculations.

In fact, the lifespan of a write would be super helpful for newbies.

I still remember the pain when trying to understand the whole mechanism
years ago.

> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/delalloc-space.c | 90 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 90 insertions(+)
> 
> diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
> index c13d8609cc99..09a9c01fc1b5 100644
> --- a/fs/btrfs/delalloc-space.c
> +++ b/fs/btrfs/delalloc-space.c
> @@ -9,6 +9,96 @@
>  #include "qgroup.h"
>  #include "block-group.h"
>  
> +/*
> + * HOW DOES THIS WORK
> + *
> + * There are two stages to data reservations, one for data and one for metadata
> + * to handle the new extents and checksums generated by writing data.
> + *
> + *
> + * DATA RESERVATION
> + *   The data reservation stuff is relatively straightforward.  We want X bytes,
> + *   and thus need to make sure we have X bytes free in data space in order to
> + *   write that data.  If there is not X bytes free, allocate data chunks until
> + *   we can satisfy that reservation.  If we can no longer allocate data chunks,
> + *   attempt to flush space to see if we can now make the reservaiton.  See the
> + *   comment for data_flush_states to see how that flushing is accomplished.

What about such less words version?
We want X bytes of data space.

If there is not enough, try the following methods in order:
- Allocate new data chunk
- Flush space
  See comment for data_flush_states

> + *
> + *   Once this space is reserved, it is added to space_info->bytes_may_use.  The
> + *   caller must keep track of this reservation and free it up if it is never
> + *   used.  With the buffered IO case this is handled via the EXTENT_DELALLOC
> + *   bit's on the inode's io_tree.  For direct IO it's more straightforward, we
> + *   take the reservation at the start of the operation, and if we write less
> + *   than we reserved we free the excess.

This part involves the lifespan and state machine of data.
I guess more explanation on the state machine would help a lot.

Like:
Page clean
|
+- btrfs_buffered_write()
|  Reserve data space for data, metadata space for csum/file
|  extents/inodes.
|
Page dirty
|
+- run_delalloc_range()
|  Allocate data extents, submit ordered extents to do csum calculation
|  and bio submission
Page write back
|
+- finish_oredred_io()
|  Insert csum and file extents
|
Page clean

Although I'm not sure if such lifespan should belong to delalloc-space.c.

> + *
> + *   For the buffered case our reservation will take one of two paths
> + *
> + *   1) It is allocated.  In find_free_extent() we will call
> + *   btrfs_add_reserved_bytes() with the size of the extent we made, along with
> + *   the size that we are covering with this allocation.  For non-compressed
> + *   these will be the same thing, but for compressed they could be different.
> + *   In any case, we increase space_info->bytes_reserved by the extent size, and
> + *   reduce the space_info->bytes_may_use by the ram_bytes size.  From now on
> + *   the handling of this reserved space is the responsibility of the ordered
> + *   extent or the cow path.
> + *
> + *   2) There is an error, and we free it.  This is handled with the
> + *   EXTENT_CLEAR_DATA_RESV bit when clearing EXTENT_DELALLOC on the inode's
> + *   io_tree.
> + *
> + * METADATA RESERVATION
> + *   The general metadata reservation lifetimes are discussed elsewhere, this
> + *   will just focus on how it is used for delalloc space.
> + *
> + *   There are 3 things we are keeping reservations for.

It looks the 3 things are too detailed I guess?
It's definitely educational, but not sure if it fits the introduction
nature of such comment.

I guess we only need to mention:
- Objective
  How this meta rsv is used for (inode item, file extents, csum)

- Location of interest
  Important details. (outstanding extents and DELALLOC bits for metadata
  rsv calculation)

- Timing of such rsv
  When we reserve/update, use and release. (function entrance)

Then it should be enough for people to dig for their own interest.

Thanks,
Qu

> + *
> + *   1) Updating the inode item.  We hold a reservation for this inode as long
> + *   as there are dirty bytes outstanding for this inode.  This is because we
> + *   may update the inode multiple times throughout an operation, and there is
> + *   no telling when we may have to do a full cow back to that inode item.  Thus
> + *   we must always hold a reservation.
> + *
> + *   2) Adding an extent item.  This is trickier, so a few sub points
> + *
> + *     a) We keep track of how many extents an inode may need to create in
> + *     inode->outstanding_extents.  This is how many items we will have reserved
> + *     for the extents for this inode.
> + *
> + *     b) count_max_extents() is used to figure out how many extent items we
> + *     will need based on the contiguous area we have dirtied.  Thus if we are
> + *     writing 4k extents but they coalesce into a very large extent, we will
> + *     break this into smaller extents which means we'll need a reservation for
> + *     each of those extents.
> + *
> + *     c) When we set EXTENT_DELALLOC on the inode io_tree we will figure out
> + *     the nummber of extents needed for the contiguous area we just created,
> + *     and add that to inode->outstanding_extents.
> + *
> + *     d) We have no idea at reservation time how this new extent fits into
> + *     existing extents.  We unconditionally use count_max_extents() on the
> + *     reservation we are currently doing.  The reservation _must_ use
> + *     btrfs_delalloc_release_extents() once it has done it's work to clear up
> + *     this outstanding extents.  This means that we will transiently have too
> + *     many extent reservations for this inode than we need.  For example say we
> + *     have a clean inode, and we do a buffered write of 4k.  The reservation
> + *     code will mod outstanding_extents to 1, and then set_delalloc will
> + *     increase it to 2.  Then once we are finished,
> + *     btrfs_delalloc_release_extents() will drop it back down to 1 again.
> + *
> + *     e) Ordered extents take on the responsibility of their extent.  We know
> + *     that the ordered extent represents a single inode item, so it will modify
> + *     ->outstanding_extents by 1, and will clear delalloc which will adjust the
> + *     ->outstanding_extents by whatever value it needs to be adjusted to.  Once
> + *     the ordered io is finished we drop the ->outstanding_extents by 1 and if
> + *     we are 0 we drop our inode item reservation as well.
> + *
> + *   3) Adding csums for the range.  This is more straightforward than the
> + *   extent items, as we just want to hold the number of bytes we'll need for
> + *   checksums until the ordered extent is removed.  If there is an error it is
> + *   cleared via the EXTENT_CLEAR_META_RESV bit when clearning EXTENT_DELALLOC
> + *   on the inode io_tree.
> + */
> +
>  int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes)
>  {
>  	struct btrfs_root *root = inode->root;
>
Nikolay Borisov Feb. 4, 2020, 12:27 p.m. UTC | #2
On 4.02.20 г. 11:48 ч., Qu Wenruo wrote:
>


<snip>

> 
>> + *
>> + *   Once this space is reserved, it is added to space_info->bytes_may_use.  The
>> + *   caller must keep track of this reservation and free it up if it is never
>> + *   used.  With the buffered IO case this is handled via the EXTENT_DELALLOC
>> + *   bit's on the inode's io_tree.  For direct IO it's more straightforward, we
>> + *   take the reservation at the start of the operation, and if we write less
>> + *   than we reserved we free the excess.
> 
> This part involves the lifespan and state machine of data.
> I guess more explanation on the state machine would help a lot.
> 
> Like:
> Page clean
> |
> +- btrfs_buffered_write()
> |  Reserve data space for data, metadata space for csum/file
> |  extents/inodes.
> |
> Page dirty
> |
> +- run_delalloc_range()
> |  Allocate data extents, submit ordered extents to do csum calculation
> |  and bio submission
> Page write back
> |
> +- finish_oredred_io()
> |  Insert csum and file extents
> |
> Page clean
> 
> Although I'm not sure if such lifespan should belong to delalloc-space.c.

This omits a lot of critical details. FOr example it should be noted
that in btrfs_buffered_write we reserve as much space as is requested by
the user. Then in run_delalloc_range it must be mentioned that in case
of compressed extents it can be called to allocate an extent which is
less than the space reserved in btrfs_buffered_write => that's where the
possible space savings in case of compressed come from.

As a matter of fact running ordered io doesn't really clean any space
apart from a bit of metadata space (unless we do overwrites, as per our
discussion with josef in the slack channel).

<snip>

>> + *
>> + *   1) Updating the inode item.  We hold a reservation for this inode as long
>> + *   as there are dirty bytes outstanding for this inode.  This is because we
>> + *   may update the inode multiple times throughout an operation, and there is
>> + *   no telling when we may have to do a full cow back to that inode item.  Thus
>> + *   we must always hold a reservation.
>> + *
>> + *   2) Adding an extent item.  This is trickier, so a few sub points
>> + *
>> + *     a) We keep track of how many extents an inode may need to create in
>> + *     inode->outstanding_extents.  This is how many items we will have reserved
>> + *     for the extents for this inode.
>> + *
>> + *     b) count_max_extents() is used to figure out how many extent items we
>> + *     will need based on the contiguous area we have dirtied.  Thus if we are
>> + *     writing 4k extents but they coalesce into a very large extent, we will

I THe way you have worded this is a bit confusing because first you
mention that count_max_extents calcs how many extent items we'll need
for a contiguous area. Then you mention that if we make a bunch of 4k
writes that coalesce to a single extent i.e create 1 large contiguous
(that's what coalescing implies in this context) we'll have to split it
them. This is counter-intuitive.

I guess what you meant here is physically contiguous as opposed to
logically contiguous?

>> + *     break this into smaller extents which means we'll need a reservation for
>> + *     each of those extents.
>> + *
>> + *     c) When we set EXTENT_DELALLOC on the inode io_tree we will figure out
>> + *     the nummber of extents needed for the contiguous area we just created,

nit: s/nummber/number

>> + *     and add that to inode->outstanding_extents.

<snip>

>> + *
>> + *   3) Adding csums for the range.  This is more straightforward than the
>> + *   extent items, as we just want to hold the number of bytes we'll need for
>> + *   checksums until the ordered extent is removed.  If there is an error it is
>> + *   cleared via the EXTENT_CLEAR_META_RESV bit when clearning EXTENT_DELALLOC

nit: s/clearning/clearing

>> + *   on the inode io_tree.
>> + */
>> +
>>  int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes)
>>  {
>>  	struct btrfs_root *root = inode->root;
>>
>
Qu Wenruo Feb. 4, 2020, 12:39 p.m. UTC | #3
On 2020/2/4 下午8:27, Nikolay Borisov wrote:
>
>
> On 4.02.20 г. 11:48 ч., Qu Wenruo wrote:
>>
>
>
> <snip>
>
>>
>>> + *
>>> + *   Once this space is reserved, it is added to space_info->bytes_may_use.  The
>>> + *   caller must keep track of this reservation and free it up if it is never
>>> + *   used.  With the buffered IO case this is handled via the EXTENT_DELALLOC
>>> + *   bit's on the inode's io_tree.  For direct IO it's more straightforward, we
>>> + *   take the reservation at the start of the operation, and if we write less
>>> + *   than we reserved we free the excess.
>>
>> This part involves the lifespan and state machine of data.
>> I guess more explanation on the state machine would help a lot.
>>
>> Like:
>> Page clean
>> |
>> +- btrfs_buffered_write()
>> |  Reserve data space for data, metadata space for csum/file
>> |  extents/inodes.
>> |
>> Page dirty
>> |
>> +- run_delalloc_range()
>> |  Allocate data extents, submit ordered extents to do csum calculation
>> |  and bio submission
>> Page write back
>> |
>> +- finish_oredred_io()
>> |  Insert csum and file extents
>> |
>> Page clean
>>
>> Although I'm not sure if such lifespan should belong to delalloc-space.c.
>
> This omits a lot of critical details. FOr example it should be noted
> that in btrfs_buffered_write we reserve as much space as is requested by
> the user. Then in run_delalloc_range it must be mentioned that in case
> of compressed extents it can be called to allocate an extent which is
> less than the space reserved in btrfs_buffered_write => that's where the
> possible space savings in case of compressed come from.

If you spoiler everything in the introduction, I guess it's no longer
introduction.
An introduction should only tell the overall picture, not every details.
For details, we go read mentioned functions.

And too many details would make the introduction pretty hard to
maintain. What if one day we don't the current always reserve behavior
for buffered write?

So I tend to have just a overview, and entrance function. With minimal
details unless it's a really complex design.

Thanks,
Qu

>
> As a matter of fact running ordered io doesn't really clean any space
> apart from a bit of metadata space (unless we do overwrites, as per our
> discussion with josef in the slack channel).
>
> <snip>
>
>>> + *
>>> + *   1) Updating the inode item.  We hold a reservation for this inode as long
>>> + *   as there are dirty bytes outstanding for this inode.  This is because we
>>> + *   may update the inode multiple times throughout an operation, and there is
>>> + *   no telling when we may have to do a full cow back to that inode item.  Thus
>>> + *   we must always hold a reservation.
>>> + *
>>> + *   2) Adding an extent item.  This is trickier, so a few sub points
>>> + *
>>> + *     a) We keep track of how many extents an inode may need to create in
>>> + *     inode->outstanding_extents.  This is how many items we will have reserved
>>> + *     for the extents for this inode.
>>> + *
>>> + *     b) count_max_extents() is used to figure out how many extent items we
>>> + *     will need based on the contiguous area we have dirtied.  Thus if we are
>>> + *     writing 4k extents but they coalesce into a very large extent, we will
>
> I THe way you have worded this is a bit confusing because first you
> mention that count_max_extents calcs how many extent items we'll need
> for a contiguous area. Then you mention that if we make a bunch of 4k
> writes that coalesce to a single extent i.e create 1 large contiguous
> (that's what coalescing implies in this context) we'll have to split it
> them. This is counter-intuitive.
>
> I guess what you meant here is physically contiguous as opposed to
> logically contiguous?
>
>>> + *     break this into smaller extents which means we'll need a reservation for
>>> + *     each of those extents.
>>> + *
>>> + *     c) When we set EXTENT_DELALLOC on the inode io_tree we will figure out
>>> + *     the nummber of extents needed for the contiguous area we just created,
>
> nit: s/nummber/number
>
>>> + *     and add that to inode->outstanding_extents.
>
> <snip>
>
>>> + *
>>> + *   3) Adding csums for the range.  This is more straightforward than the
>>> + *   extent items, as we just want to hold the number of bytes we'll need for
>>> + *   checksums until the ordered extent is removed.  If there is an error it is
>>> + *   cleared via the EXTENT_CLEAR_META_RESV bit when clearning EXTENT_DELALLOC
>
> nit: s/clearning/clearing
>
>>> + *   on the inode io_tree.
>>> + */
>>> +
>>>  int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes)
>>>  {
>>>  	struct btrfs_root *root = inode->root;
>>>
>>
David Sterba Feb. 5, 2020, 1:44 p.m. UTC | #4
On Tue, Feb 04, 2020 at 08:39:19PM +0800, Qu Wenruo wrote:
> >> Although I'm not sure if such lifespan should belong to delalloc-space.c.
> >
> > This omits a lot of critical details. FOr example it should be noted
> > that in btrfs_buffered_write we reserve as much space as is requested by
> > the user. Then in run_delalloc_range it must be mentioned that in case
> > of compressed extents it can be called to allocate an extent which is
> > less than the space reserved in btrfs_buffered_write => that's where the
> > possible space savings in case of compressed come from.
> 
> If you spoiler everything in the introduction, I guess it's no longer
> introduction.
> An introduction should only tell the overall picture, not every details.
> For details, we go read mentioned functions.
> 
> And too many details would make the introduction pretty hard to
> maintain. What if one day we don't the current always reserve behavior
> for buffered write?
> 
> So I tend to have just a overview, and entrance function. With minimal
> details unless it's a really complex design.

I'd rather keep it as it is and enhance more, the ascii graphics might
help but I don't insist. Even if it's introductory, giving just pointers
would be IMHO too little. I'm assuming the overview should be enough to
read and then go to code and already have a clue what's happening.

We can update the docs further but we need a start so I've applied v2 to
misc-next and let's take that as starting point so we can discuss
suggested improvements as new patches. In case there's something that
really does not fit the .c file comment there's always the docs git
repository.
diff mbox series

Patch

diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
index c13d8609cc99..09a9c01fc1b5 100644
--- a/fs/btrfs/delalloc-space.c
+++ b/fs/btrfs/delalloc-space.c
@@ -9,6 +9,96 @@ 
 #include "qgroup.h"
 #include "block-group.h"
 
+/*
+ * HOW DOES THIS WORK
+ *
+ * There are two stages to data reservations, one for data and one for metadata
+ * to handle the new extents and checksums generated by writing data.
+ *
+ *
+ * DATA RESERVATION
+ *   The data reservation stuff is relatively straightforward.  We want X bytes,
+ *   and thus need to make sure we have X bytes free in data space in order to
+ *   write that data.  If there is not X bytes free, allocate data chunks until
+ *   we can satisfy that reservation.  If we can no longer allocate data chunks,
+ *   attempt to flush space to see if we can now make the reservaiton.  See the
+ *   comment for data_flush_states to see how that flushing is accomplished.
+ *
+ *   Once this space is reserved, it is added to space_info->bytes_may_use.  The
+ *   caller must keep track of this reservation and free it up if it is never
+ *   used.  With the buffered IO case this is handled via the EXTENT_DELALLOC
+ *   bit's on the inode's io_tree.  For direct IO it's more straightforward, we
+ *   take the reservation at the start of the operation, and if we write less
+ *   than we reserved we free the excess.
+ *
+ *   For the buffered case our reservation will take one of two paths
+ *
+ *   1) It is allocated.  In find_free_extent() we will call
+ *   btrfs_add_reserved_bytes() with the size of the extent we made, along with
+ *   the size that we are covering with this allocation.  For non-compressed
+ *   these will be the same thing, but for compressed they could be different.
+ *   In any case, we increase space_info->bytes_reserved by the extent size, and
+ *   reduce the space_info->bytes_may_use by the ram_bytes size.  From now on
+ *   the handling of this reserved space is the responsibility of the ordered
+ *   extent or the cow path.
+ *
+ *   2) There is an error, and we free it.  This is handled with the
+ *   EXTENT_CLEAR_DATA_RESV bit when clearing EXTENT_DELALLOC on the inode's
+ *   io_tree.
+ *
+ * METADATA RESERVATION
+ *   The general metadata reservation lifetimes are discussed elsewhere, this
+ *   will just focus on how it is used for delalloc space.
+ *
+ *   There are 3 things we are keeping reservations for.
+ *
+ *   1) Updating the inode item.  We hold a reservation for this inode as long
+ *   as there are dirty bytes outstanding for this inode.  This is because we
+ *   may update the inode multiple times throughout an operation, and there is
+ *   no telling when we may have to do a full cow back to that inode item.  Thus
+ *   we must always hold a reservation.
+ *
+ *   2) Adding an extent item.  This is trickier, so a few sub points
+ *
+ *     a) We keep track of how many extents an inode may need to create in
+ *     inode->outstanding_extents.  This is how many items we will have reserved
+ *     for the extents for this inode.
+ *
+ *     b) count_max_extents() is used to figure out how many extent items we
+ *     will need based on the contiguous area we have dirtied.  Thus if we are
+ *     writing 4k extents but they coalesce into a very large extent, we will
+ *     break this into smaller extents which means we'll need a reservation for
+ *     each of those extents.
+ *
+ *     c) When we set EXTENT_DELALLOC on the inode io_tree we will figure out
+ *     the nummber of extents needed for the contiguous area we just created,
+ *     and add that to inode->outstanding_extents.
+ *
+ *     d) We have no idea at reservation time how this new extent fits into
+ *     existing extents.  We unconditionally use count_max_extents() on the
+ *     reservation we are currently doing.  The reservation _must_ use
+ *     btrfs_delalloc_release_extents() once it has done it's work to clear up
+ *     this outstanding extents.  This means that we will transiently have too
+ *     many extent reservations for this inode than we need.  For example say we
+ *     have a clean inode, and we do a buffered write of 4k.  The reservation
+ *     code will mod outstanding_extents to 1, and then set_delalloc will
+ *     increase it to 2.  Then once we are finished,
+ *     btrfs_delalloc_release_extents() will drop it back down to 1 again.
+ *
+ *     e) Ordered extents take on the responsibility of their extent.  We know
+ *     that the ordered extent represents a single inode item, so it will modify
+ *     ->outstanding_extents by 1, and will clear delalloc which will adjust the
+ *     ->outstanding_extents by whatever value it needs to be adjusted to.  Once
+ *     the ordered io is finished we drop the ->outstanding_extents by 1 and if
+ *     we are 0 we drop our inode item reservation as well.
+ *
+ *   3) Adding csums for the range.  This is more straightforward than the
+ *   extent items, as we just want to hold the number of bytes we'll need for
+ *   checksums until the ordered extent is removed.  If there is an error it is
+ *   cleared via the EXTENT_CLEAR_META_RESV bit when clearning EXTENT_DELALLOC
+ *   on the inode io_tree.
+ */
+
 int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes)
 {
 	struct btrfs_root *root = inode->root;