[1/3] btrfs: add a comment describing block-rsvs
diff mbox series

Message ID 20200203204436.517473-2-josef@toxicpanda.com
State New
Headers show
Series
  • Add comments describing how space reservation works
Related show

Commit Message

Josef Bacik Feb. 3, 2020, 8:44 p.m. UTC
This is a giant comment at the top of block-rsv.c describing generally
how block rsvs work.  It is purely about the block rsv's themselves, and
nothing to do with how the actual reservation system works.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/block-rsv.c | 81 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 81 insertions(+)

Comments

Qu Wenruo Feb. 4, 2020, 9:30 a.m. UTC | #1
On 2020/2/4 上午4:44, Josef Bacik wrote:
> This is a giant comment at the top of block-rsv.c describing generally
> how block rsvs work.  It is purely about the block rsv's themselves, and
> nothing to do with how the actual reservation system works.

Such comment really helps!

Although it looks like there are too many words but too little ascii
arts or graphs.
Not sure if it's really easy to read.

And some questions inlined below.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/block-rsv.c | 81 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 81 insertions(+)
> 
> diff --git a/fs/btrfs/block-rsv.c b/fs/btrfs/block-rsv.c
> index d07bd41a7c1e..54380f477f80 100644
> --- a/fs/btrfs/block-rsv.c
> +++ b/fs/btrfs/block-rsv.c
> @@ -6,6 +6,87 @@
>  #include "space-info.h"
>  #include "transaction.h"
>  
> +/*
> + * HOW DO BLOCK RSVS WORK
> + *
> + *   Think of block_rsv's as bucktes for logically grouped reservations.  Each
> + *   block_rsv has a ->size and a ->reserved.  ->size is how large we want our
> + *   block rsv to be, ->reserved is how much space is currently reserved for
> + *   this block reserve.

This block_rsv system is for metadata only, right?
Since data size can be easily determined, no need to such complex system.

> + *
> + *   ->failfast exists for the truncate case, and is described below.
> + *
> + * NORMAL OPERATION
> + *   We determine we need N items of reservation, we use the appropriate
> + *   btrfs_calc*() helper to determine the number of bytes.  We call into
> + *   reserve_metadata_bytes() and get our bytes, we then add this space to our
> + *   ->size and our ->reserved.

Since you mentioned bytes_may_use in the finish part, what about also
mentioning that here?

> + *
> + *   We go to modify the tree for our operation, we allocate a tree block, which
> + *   calls btrfs_use_block_rsv(), and subtracts nodesize from
> + *   block_rsv->reserved.
> + *
> + *   We finish our operation, we subtract our original reservation from ->size,
> + *   and then we subtract ->size from ->reserved if there is an excess and free
> + *   the excess back to the space info, by reducing space_info->bytes_may_use by
> + *   the excess amount.

So I find the workflow can be expressed like this using timeline (?) graph:

+--- Reserve:
|    Entrance: btrfs_block_rsv_add(), btrfs_block_rsv_refill()
|
|    Calculate needed bytes by btrfs_calc*(), then add the needed space
|    to our ->size and our ->reserved.
|    This also contributes to space_info->bytes_may_use.
|
+--- Use:
|    Entrance: btrfs_use_block_rsv()
|
|    We're allocating a tree block, will subtracts nodesize from
|    block_rsv->reserved.
|
+--- Finish:
     Entrance: btrfs_block_rsv_release()

     we subtract our original reservation from ->size,
     and then we subtract ->size from ->reserved if there is an excess
     and free the excess back to the space info, by reducing
     space_info->bytes_may_use by the excess amount.

> + *
> + *   In some cases we may return this excess to the global block reserve or
> + *   delayed refs reserve if either of their ->size is greater than their
> + *   ->reserved.
> + *

Types of block_rsv:

> + * BLOCK_RSV_TRANS, BLOCK_RSV_DELOPS, BLOCK_RSV_CHUNK
> + *   These behave normally, as described above, just within the confines of the
> + *   lifetime of ther particular operation (transaction for the whole trans
> + *   handle lifetime, for example).
> + *
> + * BLOCK_RSV_GLOBAL
> + *   This has existed forever, with diminishing degrees of importance.
> + *   Currently it exists to save us from ourselves.  We definitely over-reserve
> + *   space most of the time, but the nature of COW is that we do not know how
> + *   much space we may need to use for any given operation.  This is
> + *   particularly true about the extent tree.  Modifying one extent could
> + *   balloon into 1000 modifications of the extent tree, which we have no way of
> + *   properly predicting.  To cover this case we have the global reserve act as
> + *   the "root" space to allow us to not abort the transaciton when things are
> + *   very tight.  As such we tend to treat this space as sacred, and only use it
> + *   if we are desparate.  Generally we should no longer be depending on its
> + *   space, and if new use cases arise we need to address them elsewhere.

Although we all know global rsv is really important for essential tree
updates, can we make it a little simpler?
It looks too long to read though.

I guess we don't need to put all related info here.
Maybe just mentioning the usage of each type is enough?
(Since the reader will still go greping for more details)

This also applies to the remaining types.

Thanks,
Qu

> + *
> + * BLOCK_RSV_DELALLOC
> + *   The individual item sizes are determined by the per-inode size
> + *   calculations, which are described with the delalloc code.  This is pretty
> + *   straightforward, it's just the calculation of ->size encodes a lot of
> + *   different items, and thus it gets used when updating inodes, inserting file
> + *   extents, and inserting checksums.
> + *
> + * BLOCK_RSV_DELREFS
> + *   We keep a running talley of how many delayed refs we have on the system.
> + *   We assume each one of these delayed refs are going to use a full
> + *   reservation.  We use the transaction items and pre-reserve space for every
> + *   operation, and use this reservation to refill any gap between ->size and
> + *   ->reserved that may exist.
> + *
> + *   From there it's straightforward, removing a delayed ref means we remove its
> + *   count from ->size and free up reservations as necessary.  Since this is the
> + *   most dynamic block rsv in the system, we will try to refill this block rsv
> + *   first with any excess returned by any other block reserve.
> + *
> + * BLOCK_RSV_EMPTY
> + *   This is the fallback block rsv to make us try to reserve space if we don't
> + *   have a specific bucket for this allocation.  It is mostly used for updating
> + *   the device tree and such, since that is a separate pool we're content to
> + *   just reserve space from the space_info on demand.
> + *
> + * BLOCK_RSV_TEMP
> + *   This is used by things like truncate and iput.  We will temporarily
> + *   allocate a block rsv, set it to some size, and then truncate bytes until we
> + *   have no space left.  With ->failfast set we'll simply return ENOSPC from
> + *   btrfs_use_block_rsv() to signal that we need to unwind and try to make a
> + *   new reservation.  This is because these operations are unbounded, so we
> + *   want to do as much work as we can, and then back off and re-reserve.
> + */
> +
>  static u64 block_rsv_release_bytes(struct btrfs_fs_info *fs_info,
>  				    struct btrfs_block_rsv *block_rsv,
>  				    struct btrfs_block_rsv *dest, u64 num_bytes,
>
Nikolay Borisov Feb. 4, 2020, 10:32 a.m. UTC | #2
On 4.02.20 г. 11:30 ч., Qu Wenruo wrote:
> 
> 
> On 2020/2/4 上午4:44, Josef Bacik wrote:
>> This is a giant comment at the top of block-rsv.c describing generally
>> how block rsvs work.  It is purely about the block rsv's themselves, and
>> nothing to do with how the actual reservation system works.
> 
> Such comment really helps!
> 
> Although it looks like there are too many words but too little ascii
> arts or graphs.
> Not sure if it's really easy to read.
> 
> And some questions inlined below.
>>
>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>> ---
>>  fs/btrfs/block-rsv.c | 81 ++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 81 insertions(+)
>>
>> diff --git a/fs/btrfs/block-rsv.c b/fs/btrfs/block-rsv.c
>> index d07bd41a7c1e..54380f477f80 100644
>> --- a/fs/btrfs/block-rsv.c
>> +++ b/fs/btrfs/block-rsv.c
>> @@ -6,6 +6,87 @@
>>  #include "space-info.h"
>>  #include "transaction.h"
>>


<snip>

>> + *
>> + *   We go to modify the tree for our operation, we allocate a tree block, which
>> + *   calls btrfs_use_block_rsv(), and subtracts nodesize from
>> + *   block_rsv->reserved.
>> + *
>> + *   We finish our operation, we subtract our original reservation from ->size,
>> + *   and then we subtract ->size from ->reserved if there is an excess and free
>> + *   the excess back to the space info, by reducing space_info->bytes_may_use by
>> + *   the excess amount.
> 
> So I find the workflow can be expressed like this using timeline (?) graph:
> 
> +--- Reserve:
> |    Entrance: btrfs_block_rsv_add(), btrfs_block_rsv_refill()
> |
> |    Calculate needed bytes by btrfs_calc*(), then add the needed space
> |    to our ->size and our ->reserved.
> |    This also contributes to space_info->bytes_may_use.
> |
> +--- Use:
> |    Entrance: btrfs_use_block_rsv()
> |
> |    We're allocating a tree block, will subtracts nodesize from
> |    block_rsv->reserved.
> |
> +--- Finish:
>      Entrance: btrfs_block_rsv_release()
> 
>      we subtract our original reservation from ->size,
>      and then we subtract ->size from ->reserved if there is an excess
>      and free the excess back to the space info, by reducing
>      space_info->bytes_may_use by the excess amount.

I find this graphic helpful. Also IMO it's important to explicitly state
that ->size is based on an overestimation, whereas the space subtracted
from ->reserved is always based on real usage, hence we can have a case
where we end up with  excess space that can be returned.

Over reservation is mentioned in the BLOCK_RSV_GLOBAL paragraph but I
think it should be here and can be removed from there.
> 
>> + *
>> + *   In some cases we may return this excess to the global block reserve or
>> + *   delayed refs reserve if either of their ->size is greater than their
>> + *   ->reserved.
>> + *
> 
> Types of block_rsv:
> 
>> + * BLOCK_RSV_TRANS, BLOCK_RSV_DELOPS, BLOCK_RSV_CHUNK
>> + *   These behave normally, as described above, just within the confines of the
>> + *   lifetime of ther particular operation (transaction for the whole trans
>> + *   handle lifetime, for example).
>> + *
>> + * BLOCK_RSV_GLOBAL
>> + *   This has existed forever, with diminishing degrees of importance.
>> + *   Currently it exists to save us from ourselves.  We definitely over-reserve
>> + *   space most of the time, but the nature of COW is that we do not know how
>> + *   much space we may need to use for any given operation.  This is
>> + *   particularly true about the extent tree.  Modifying one extent could
>> + *   balloon into 1000 modifications of the extent tree, which we have no way of
>> + *   properly predicting.  To cover this case we have the global reserve act as
>> + *   the "root" space to allow us to not abort the transaciton when things are
nit: s/transaciton/transaction
>> + *   very tight.  As such we tend to treat this space as sacred, and only use it
>> + *   if we are desparate.  Generally we should no longer be depending on its
nit: s/desparate/desperate

>> + *   space, and if new use cases arise we need to address them elsewhere.
> 
> Although we all know global rsv is really important for essential tree
> updates, can we make it a little simpler?
> It looks too long to read though.

The 2nd sentence of the paragraph can be removed. Also it can be
mentioned that globalrsv is used for other trees apart from extent i.e
chunk/csum ones. Also isn't it used to ensure progress of unlink() ?

> 
> I guess we don't need to put all related info here.
> Maybe just mentioning the usage of each type is enough?
> (Since the reader will still go greping for more details)
> 
> This also applies to the remaining types.


I disagree, those comment provide glimpses of the problem that
necessitated having block rsv in the first place. It's good to read this
before diving into the code.

<snip>

Patch
diff mbox series

diff --git a/fs/btrfs/block-rsv.c b/fs/btrfs/block-rsv.c
index d07bd41a7c1e..54380f477f80 100644
--- a/fs/btrfs/block-rsv.c
+++ b/fs/btrfs/block-rsv.c
@@ -6,6 +6,87 @@ 
 #include "space-info.h"
 #include "transaction.h"
 
+/*
+ * HOW DO BLOCK RSVS WORK
+ *
+ *   Think of block_rsv's as bucktes for logically grouped reservations.  Each
+ *   block_rsv has a ->size and a ->reserved.  ->size is how large we want our
+ *   block rsv to be, ->reserved is how much space is currently reserved for
+ *   this block reserve.
+ *
+ *   ->failfast exists for the truncate case, and is described below.
+ *
+ * NORMAL OPERATION
+ *   We determine we need N items of reservation, we use the appropriate
+ *   btrfs_calc*() helper to determine the number of bytes.  We call into
+ *   reserve_metadata_bytes() and get our bytes, we then add this space to our
+ *   ->size and our ->reserved.
+ *
+ *   We go to modify the tree for our operation, we allocate a tree block, which
+ *   calls btrfs_use_block_rsv(), and subtracts nodesize from
+ *   block_rsv->reserved.
+ *
+ *   We finish our operation, we subtract our original reservation from ->size,
+ *   and then we subtract ->size from ->reserved if there is an excess and free
+ *   the excess back to the space info, by reducing space_info->bytes_may_use by
+ *   the excess amount.
+ *
+ *   In some cases we may return this excess to the global block reserve or
+ *   delayed refs reserve if either of their ->size is greater than their
+ *   ->reserved.
+ *
+ * BLOCK_RSV_TRANS, BLOCK_RSV_DELOPS, BLOCK_RSV_CHUNK
+ *   These behave normally, as described above, just within the confines of the
+ *   lifetime of ther particular operation (transaction for the whole trans
+ *   handle lifetime, for example).
+ *
+ * BLOCK_RSV_GLOBAL
+ *   This has existed forever, with diminishing degrees of importance.
+ *   Currently it exists to save us from ourselves.  We definitely over-reserve
+ *   space most of the time, but the nature of COW is that we do not know how
+ *   much space we may need to use for any given operation.  This is
+ *   particularly true about the extent tree.  Modifying one extent could
+ *   balloon into 1000 modifications of the extent tree, which we have no way of
+ *   properly predicting.  To cover this case we have the global reserve act as
+ *   the "root" space to allow us to not abort the transaciton when things are
+ *   very tight.  As such we tend to treat this space as sacred, and only use it
+ *   if we are desparate.  Generally we should no longer be depending on its
+ *   space, and if new use cases arise we need to address them elsewhere.
+ *
+ * BLOCK_RSV_DELALLOC
+ *   The individual item sizes are determined by the per-inode size
+ *   calculations, which are described with the delalloc code.  This is pretty
+ *   straightforward, it's just the calculation of ->size encodes a lot of
+ *   different items, and thus it gets used when updating inodes, inserting file
+ *   extents, and inserting checksums.
+ *
+ * BLOCK_RSV_DELREFS
+ *   We keep a running talley of how many delayed refs we have on the system.
+ *   We assume each one of these delayed refs are going to use a full
+ *   reservation.  We use the transaction items and pre-reserve space for every
+ *   operation, and use this reservation to refill any gap between ->size and
+ *   ->reserved that may exist.
+ *
+ *   From there it's straightforward, removing a delayed ref means we remove its
+ *   count from ->size and free up reservations as necessary.  Since this is the
+ *   most dynamic block rsv in the system, we will try to refill this block rsv
+ *   first with any excess returned by any other block reserve.
+ *
+ * BLOCK_RSV_EMPTY
+ *   This is the fallback block rsv to make us try to reserve space if we don't
+ *   have a specific bucket for this allocation.  It is mostly used for updating
+ *   the device tree and such, since that is a separate pool we're content to
+ *   just reserve space from the space_info on demand.
+ *
+ * BLOCK_RSV_TEMP
+ *   This is used by things like truncate and iput.  We will temporarily
+ *   allocate a block rsv, set it to some size, and then truncate bytes until we
+ *   have no space left.  With ->failfast set we'll simply return ENOSPC from
+ *   btrfs_use_block_rsv() to signal that we need to unwind and try to make a
+ *   new reservation.  This is because these operations are unbounded, so we
+ *   want to do as much work as we can, and then back off and re-reserve.
+ */
+
 static u64 block_rsv_release_bytes(struct btrfs_fs_info *fs_info,
 				    struct btrfs_block_rsv *block_rsv,
 				    struct btrfs_block_rsv *dest, u64 num_bytes,