[3/3] btrfs: describe the space reservation system in general
diff mbox series

Message ID 20200203204436.517473-4-josef@toxicpanda.com
State New
Headers show
Series
  • Add comments describing how space reservation works
Related show

Commit Message

Josef Bacik Feb. 3, 2020, 8:44 p.m. UTC
Add another comment to cover how the space reservation system works
generally.  This covers the actual reservation flow, as well as how
flushing is handled.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/space-info.c | 128 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 128 insertions(+)

Comments

Qu Wenruo Feb. 4, 2020, 10:14 a.m. UTC | #1
On 2020/2/4 上午4:44, Josef Bacik wrote:
> Add another comment to cover how the space reservation system works
> generally.  This covers the actual reservation flow, as well as how
> flushing is handled.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/space-info.c | 128 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 128 insertions(+)
> 
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index d3befc536a7f..6de1fbe2835a 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -10,6 +10,134 @@
>  #include "transaction.h"
>  #include "block-group.h"
>  
> +/*
> + * HOW DOES SPACE RESERVATION WORK
> + *
> + * If you want to know about delalloc specifically, there is a separate comment
> + * for that with the delalloc code.  This comment is about how the whole system
> + * works generally.
> + *
> + * BASIC CONCEPTS
> + *
> + *   1) space_info.  This is the ultimate arbiter of how much space we can use.
> + *   There's a description of the bytes_ fields with the struct declaration,
> + *   refer to that for specifics on each field.  Suffice it to say that for
> + *   reservations we care about total_bytes - SUM(space_info->bytes_) when
> + *   determining if there is space to make an allocation.

How about mentioning 3 types of space info: DATA, META, SYS?

And it may be a good timing to also update the comment of
btrfs_space_info::bytes_*?.

> + *
> + *   2) block_rsv's.  These are basically buckets for every different type of
> + *   metadata reservation we have.  You can see the comment in the block_rsv
> + *   code on the rules for each type, but generally block_rsv->reserved is how
> + *   much space is accounted for in space_info->bytes_may_use.
> + *
> + *   3) btrfs_calc*_size.  These are the worst case calculations we used based
> + *   on the number of items we will want to modify.  We have one for changing
> + *   items, and one for inserting new items.  Generally we use these helpers to
> + *   determine the size of the block reserves, and then use the actual bytes
> + *   values to adjust the space_info counters.
> + *
> + * MAKING RESERVATIONS, THE NORMAL CASE
> + *
> + *   Things wanting to make reservations will calculate the size that they want
> + *   and make a reservation request.  If there is sufficient space, and there
> + *   are no current reservations pending, we will adjust
> + *   space_info->bytes_may_use by this amount.
> + *
> + *   Once we allocate an extent, we will add that size to ->bytes_reserved and
> + *   subtract the size from ->bytes_may_use.  Once that extent is written out we
> + *   subtract that value from ->bytes_reserved and add it to ->bytes_used.

Lifespan! And definitely a graph would be easier to read, with less words.

> + *
> + *   If there is an error at any point the reserver is responsible for dropping
> + *   its reservation from ->bytes_may_use.
> + *
> + * MAKING RESERVATIONS, FLUSHING
> + *
> + *   If we are unable to satisfy our reservation, or if there are pending
> + *   reservations already, we will create a reserve ticket and add ourselves to
> + *   the appropriate list.  This is controlled by btrfs_reserve_flush_enum.  For
> + *   simplicity sake this boils down to two cases, priority and normal.

This is the core of the whole ticketing space rsv system.
Definitely needs something like objective (to get some free space),
entrance functions.

> + *
> + *   1) Priority.  These reservations are important and have limited ability to
> + *   flush space.  For example, the relocation code currently tries to make a
> + *   reservation under a transaction commit, thus it cannot wait on anything
> + *   that may want to commit the transaction.  These tasks will add themselves
> + *   to the priority list and thus get any new space first, and then they can
> + *   flush space directly in their own context that is safe for them to do
> + *   without causing a deadlock.
> + *
> + *   2) Normal.  These reservations can wait forever on anything, because the do
> + *   not hold resources that they would deadlock on.  These tickets simply go to
> + *   sleep and start an async thread that will flush space on their behalf.
> + *   Every time one of the ->bytes_* counters is adjusted for the space info, we
> + *   will check to see if there is enough space to satisfy the requests (in
> + *   order) on either of our lists.  If there is enough space we will set the
> + *   ticket->bytes = 0, and wake the task up.  If we flush a few times and fail
> + *   to make any progress we will wake up all of the tickets and fail them all.
> + *
> + * THE FLUSHING STATES
> + *
> + *   Generally speaking we will have two cases for each state, a "nice" state
> + *   and a "ALL THE THINGS" state.  In btrfs we delay a lot of work in order to
> + *   reduce the locking over head on the various trees, and even to keep from
> + *   doing any work at all in the case of delayed refs.  Each of these delayed
> + *   things however hold reservations, and so letting them run allows us to
> + *   reclaim space so we can make new reservations.

So, it's just some delayed works which can free space if run.
The last sentence looks sufficient to me.

> + *
> + *   FLUSH_DELAYED_ITEMS
> + *     Every inode has a delayed item to update the inode (item).

The best explanation. This one sentence is enough to solve my question
on what delayed items are.

>  Take a simple write
> + *     for example, we would update the inode item at write time to update the
> + *     mtime, and then again at finish_ordered_io() time in order to update the
> + *     isize or bytes.  We keep these delayed items to coalesce these operations
> + *     into a single operation done on demand.  These are an easy way to reclaim
> + *     metadata space.
> + *
> + *   FLUSH_DELALLOC
> + *     Look at the delalloc comment to get an idea of how much space is reserved
> + *     for delayed allocation.  We can reclaim some of this space simply by
> + *     running delalloc, but usually we need to wait for ordered extents to
> + *     reclaim the bulk of this space.
> + *
> + *   FLUSH_DELAYED_REFS
> + *     We have a block reserve for the outstanding delayed refs space, and every
> + *     delayed ref operation holds a reservation.  Running these is a quick way
> + *     to reclaim space, but we want to hold this until the end because COW can
> + *     churn a lot and we can avoid making some extent tree modifications if we
> + *     are able to delay for as long as possible.
> + *
> + *   ALLOC_CHUNK
> + *     We will skip this the first time through space reservation, because of
> + *     overcommit and we don't want to have a lot of useless metadata space when
> + *     our worst case reservations will likely never come true.
> + *
> + *   RUN_DELAYED_IPUTS

Although I guess we all know what delayed iput is doing, one line
explanation would definitely help.

> + *     If we're freeing inodes we're likely freeing checksums, file extent
> + *     items, and extent tree items.  Loads of space could be freed up by these
> + *     operations, however they won't be usable until the transaction commits.
> + *
> + *   COMMIT_TRANS
> + *     may_commit_transaction() is the ultimate arbiter on wether we commit the
> + *     transaction or not.  In order to avoid constantly churning we do all the
> + *     above flushing first and then commit the transaction as the last resort.
> + *     However we need to take into account things like pinned space that would
> + *     be freed, plus any delayed work we may not have gotten rid of in the case
> + *     of metadata.

All these comments make sense, it would be fantastic if we can reduce
the number of lines though.

Thanks for all these comments, they really help,
Qu

> + *
> + * OVERCOMMIT
> + *   Because we hold so many reservations for metadata we will allow you to
> + *   reserve more space than is currently free in the currently allocate
> + *   metadata space.  This only happens with metadata, data does not allow
> + *   overcommitting.
> + *
> + *   You can see the current logic for when we allow overcommit in
> + *   btrfs_can_overcommit(), but it only applies to unallocated space.  If there
> + *   is no unallocated space to be had, all reservations are kept within the
> + *   free space in the allocated metadata chunks.
> + *
> + *   Because of overcommitting, you generally want to use the
> + *   btrfs_can_overcommit() logic for metadata allocations, as it does the right
> + *   thing with or without extra unallocated space.
> + */
> +
>  u64 __pure btrfs_space_info_used(struct btrfs_space_info *s_info,
>  			  bool may_use_included)
>  {
>

Patch
diff mbox series

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index d3befc536a7f..6de1fbe2835a 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -10,6 +10,134 @@ 
 #include "transaction.h"
 #include "block-group.h"
 
+/*
+ * HOW DOES SPACE RESERVATION WORK
+ *
+ * If you want to know about delalloc specifically, there is a separate comment
+ * for that with the delalloc code.  This comment is about how the whole system
+ * works generally.
+ *
+ * BASIC CONCEPTS
+ *
+ *   1) space_info.  This is the ultimate arbiter of how much space we can use.
+ *   There's a description of the bytes_ fields with the struct declaration,
+ *   refer to that for specifics on each field.  Suffice it to say that for
+ *   reservations we care about total_bytes - SUM(space_info->bytes_) when
+ *   determining if there is space to make an allocation.
+ *
+ *   2) block_rsv's.  These are basically buckets for every different type of
+ *   metadata reservation we have.  You can see the comment in the block_rsv
+ *   code on the rules for each type, but generally block_rsv->reserved is how
+ *   much space is accounted for in space_info->bytes_may_use.
+ *
+ *   3) btrfs_calc*_size.  These are the worst case calculations we used based
+ *   on the number of items we will want to modify.  We have one for changing
+ *   items, and one for inserting new items.  Generally we use these helpers to
+ *   determine the size of the block reserves, and then use the actual bytes
+ *   values to adjust the space_info counters.
+ *
+ * MAKING RESERVATIONS, THE NORMAL CASE
+ *
+ *   Things wanting to make reservations will calculate the size that they want
+ *   and make a reservation request.  If there is sufficient space, and there
+ *   are no current reservations pending, we will adjust
+ *   space_info->bytes_may_use by this amount.
+ *
+ *   Once we allocate an extent, we will add that size to ->bytes_reserved and
+ *   subtract the size from ->bytes_may_use.  Once that extent is written out we
+ *   subtract that value from ->bytes_reserved and add it to ->bytes_used.
+ *
+ *   If there is an error at any point the reserver is responsible for dropping
+ *   its reservation from ->bytes_may_use.
+ *
+ * MAKING RESERVATIONS, FLUSHING
+ *
+ *   If we are unable to satisfy our reservation, or if there are pending
+ *   reservations already, we will create a reserve ticket and add ourselves to
+ *   the appropriate list.  This is controlled by btrfs_reserve_flush_enum.  For
+ *   simplicity sake this boils down to two cases, priority and normal.
+ *
+ *   1) Priority.  These reservations are important and have limited ability to
+ *   flush space.  For example, the relocation code currently tries to make a
+ *   reservation under a transaction commit, thus it cannot wait on anything
+ *   that may want to commit the transaction.  These tasks will add themselves
+ *   to the priority list and thus get any new space first, and then they can
+ *   flush space directly in their own context that is safe for them to do
+ *   without causing a deadlock.
+ *
+ *   2) Normal.  These reservations can wait forever on anything, because the do
+ *   not hold resources that they would deadlock on.  These tickets simply go to
+ *   sleep and start an async thread that will flush space on their behalf.
+ *   Every time one of the ->bytes_* counters is adjusted for the space info, we
+ *   will check to see if there is enough space to satisfy the requests (in
+ *   order) on either of our lists.  If there is enough space we will set the
+ *   ticket->bytes = 0, and wake the task up.  If we flush a few times and fail
+ *   to make any progress we will wake up all of the tickets and fail them all.
+ *
+ * THE FLUSHING STATES
+ *
+ *   Generally speaking we will have two cases for each state, a "nice" state
+ *   and a "ALL THE THINGS" state.  In btrfs we delay a lot of work in order to
+ *   reduce the locking over head on the various trees, and even to keep from
+ *   doing any work at all in the case of delayed refs.  Each of these delayed
+ *   things however hold reservations, and so letting them run allows us to
+ *   reclaim space so we can make new reservations.
+ *
+ *   FLUSH_DELAYED_ITEMS
+ *     Every inode has a delayed item to update the inode.  Take a simple write
+ *     for example, we would update the inode item at write time to update the
+ *     mtime, and then again at finish_ordered_io() time in order to update the
+ *     isize or bytes.  We keep these delayed items to coalesce these operations
+ *     into a single operation done on demand.  These are an easy way to reclaim
+ *     metadata space.
+ *
+ *   FLUSH_DELALLOC
+ *     Look at the delalloc comment to get an idea of how much space is reserved
+ *     for delayed allocation.  We can reclaim some of this space simply by
+ *     running delalloc, but usually we need to wait for ordered extents to
+ *     reclaim the bulk of this space.
+ *
+ *   FLUSH_DELAYED_REFS
+ *     We have a block reserve for the outstanding delayed refs space, and every
+ *     delayed ref operation holds a reservation.  Running these is a quick way
+ *     to reclaim space, but we want to hold this until the end because COW can
+ *     churn a lot and we can avoid making some extent tree modifications if we
+ *     are able to delay for as long as possible.
+ *
+ *   ALLOC_CHUNK
+ *     We will skip this the first time through space reservation, because of
+ *     overcommit and we don't want to have a lot of useless metadata space when
+ *     our worst case reservations will likely never come true.
+ *
+ *   RUN_DELAYED_IPUTS
+ *     If we're freeing inodes we're likely freeing checksums, file extent
+ *     items, and extent tree items.  Loads of space could be freed up by these
+ *     operations, however they won't be usable until the transaction commits.
+ *
+ *   COMMIT_TRANS
+ *     may_commit_transaction() is the ultimate arbiter on wether we commit the
+ *     transaction or not.  In order to avoid constantly churning we do all the
+ *     above flushing first and then commit the transaction as the last resort.
+ *     However we need to take into account things like pinned space that would
+ *     be freed, plus any delayed work we may not have gotten rid of in the case
+ *     of metadata.
+ *
+ * OVERCOMMIT
+ *   Because we hold so many reservations for metadata we will allow you to
+ *   reserve more space than is currently free in the currently allocate
+ *   metadata space.  This only happens with metadata, data does not allow
+ *   overcommitting.
+ *
+ *   You can see the current logic for when we allow overcommit in
+ *   btrfs_can_overcommit(), but it only applies to unallocated space.  If there
+ *   is no unallocated space to be had, all reservations are kept within the
+ *   free space in the allocated metadata chunks.
+ *
+ *   Because of overcommitting, you generally want to use the
+ *   btrfs_can_overcommit() logic for metadata allocations, as it does the right
+ *   thing with or without extra unallocated space.
+ */
+
 u64 __pure btrfs_space_info_used(struct btrfs_space_info *s_info,
 			  bool may_use_included)
 {