diff mbox series

[v2,2/2] btrfs-progs: set the proper minimum size for a zoned file system

Message ID c1cfe98ea6c2610373d11d4df7c8855e6e98d3dc.1688658745.git.josef@toxicpanda.com (mailing list archive)
State New, archived
Headers show
Series btrfs-progs: some zoned mkfs fixups | expand

Commit Message

Josef Bacik July 6, 2023, 3:54 p.m. UTC
We currently limit the size of the file system to 5 * the zone size,
however we actually want to limit it to 7 * the zone size.  Fix up the
comment and the math to match our actual minimum zoned file system size.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 mkfs/main.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Christoph Hellwig July 7, 2023, 11:38 a.m. UTC | #1
On Thu, Jul 06, 2023 at 11:54:00AM -0400, Josef Bacik wrote:
> We currently limit the size of the file system to 5 * the zone size,
> however we actually want to limit it to 7 * the zone size.  Fix up the
> comment and the math to match our actual minimum zoned file system size.

Hmm.  IS this actually correct?  Don't we also need at least a second
metadata and system block group in case the first one fills up and
metadata needs to go somewhere else to be able to reset the previous
ones?

Sorry, should have noticed that last time around.
Naohiro Aota July 10, 2023, 12:57 a.m. UTC | #2
On Fri, Jul 07, 2023 at 04:38:10AM -0700, Christoph Hellwig wrote:
> On Thu, Jul 06, 2023 at 11:54:00AM -0400, Josef Bacik wrote:
> > We currently limit the size of the file system to 5 * the zone size,
> > however we actually want to limit it to 7 * the zone size.  Fix up the
> > comment and the math to match our actual minimum zoned file system size.
> 
> Hmm.  IS this actually correct?  Don't we also need at least a second
> metadata and system block group in case the first one fills up and
> metadata needs to go somewhere else to be able to reset the previous
> ones?
> 
> Sorry, should have noticed that last time around.

It depends on what we consider the "minimal" is. Even with the 5 zones (2
SBs + 1 per BG type), we can start writing to the file-system.

If you need to run a relocation, one more block group for it is needed.

The fsync block group might be optional because if the fsync node
allocation failed, it should fall back to the full sync. It will kill the
performance but still works...

If we say it is the "minimal" that we can infinitely write and delete a
file without ENOSPC, we need one (or two, depending on the metadata
profile) more BGs per META/SYSTEM.
Naohiro Aota July 10, 2023, 12:59 a.m. UTC | #3
On Thu, Jul 06, 2023 at 11:54:00AM -0400, Josef Bacik wrote:
> We currently limit the size of the file system to 5 * the zone size,
> however we actually want to limit it to 7 * the zone size.  Fix up the
> comment and the math to match our actual minimum zoned file system size.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  mkfs/main.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/mkfs/main.c b/mkfs/main.c
> index 8d94dac8..c7d7399f 100644
> --- a/mkfs/main.c
> +++ b/mkfs/main.c
> @@ -84,10 +84,12 @@ struct prepare_device_progress {
>   * 1 zone for the system block group
>   * 1 zone for a metadata block group
>   * 1 zone for a data block group
> + * 1 zone for a relocation block group
> + * 1 zone for the tree log
>   */
>  static u64 min_zoned_fs_size(const char *filename)
>  {
> -	return 5 * zone_size(file);
> +	return 7 * zone_size(file);

When we use DUP profile for METADATA or SYSTEM, we need two zones for
METADATA, SYSTEM, and the tree log BG.

>  }
>  
>  static int create_metadata_block_groups(struct btrfs_root *root, bool mixed,
> -- 
> 2.41.0
>
Christoph Hellwig July 10, 2023, 5:28 a.m. UTC | #4
On Mon, Jul 10, 2023 at 12:57:52AM +0000, Naohiro Aota wrote:
> It depends on what we consider the "minimal" is.

I think minimal means a file system that can actually be be continuously
used.

> Even with the 5 zones (2
> SBs + 1 per BG type), we can start writing to the file-system.
> 
> If you need to run a relocation, one more block group for it is needed.
> 
> The fsync block group might be optional because if the fsync node
> allocation failed, it should fall back to the full sync. It will kill the
> performance but still works...
> 
> If we say it is the "minimal" that we can infinitely write and delete a
> file without ENOSPC, we need one (or two, depending on the metadata
> profile) more BGs per META/SYSTEM.

Based on my above sentence we then need:

 2 zones for the primary superblock

metadata replication factor * (
  2 zones for the system block group
  2 zone for a metadata block group
  2 zone for the tree log)


data replication factor * (
 1 zone for a data block group
 1 zone for a relocation block group
)

where the two for the non-sb, non-data blocks accounts for beeing
able to continue writing after filling up one bg and allowing
gc.  In fact even just two might lead to deadlocks in that case
depending on the exact algorithm in other zoned storage systems,
but I don't know enough about btrfs metadata placement to understand
how that works on zoned btrfs right now.
David Sterba July 13, 2023, 6:19 p.m. UTC | #5
On Sun, Jul 09, 2023 at 10:28:12PM -0700, hch@infradead.org wrote:
> On Mon, Jul 10, 2023 at 12:57:52AM +0000, Naohiro Aota wrote:
> > It depends on what we consider the "minimal" is.
> 
> I think minimal means a file system that can actually be be continuously
> used.
> 
> > Even with the 5 zones (2
> > SBs + 1 per BG type), we can start writing to the file-system.
> > 
> > If you need to run a relocation, one more block group for it is needed.
> > 
> > The fsync block group might be optional because if the fsync node
> > allocation failed, it should fall back to the full sync. It will kill the
> > performance but still works...
> > 
> > If we say it is the "minimal" that we can infinitely write and delete a
> > file without ENOSPC, we need one (or two, depending on the metadata
> > profile) more BGs per META/SYSTEM.
> 
> Based on my above sentence we then need:
> 
>  2 zones for the primary superblock
> 
> metadata replication factor * (
>   2 zones for the system block group
>   2 zone for a metadata block group
>   2 zone for the tree log)
> 
> 
> data replication factor * (
>  1 zone for a data block group
>  1 zone for a relocation block group
> )

I think the relocation should be taken separately, there can be only one
relocation process running per block group type, ie. data/metadata and
the replication depends on the respective factor. Otherwise yeah the
formula for minimal number of zones needs to take into account the
replication and all the normal usage case. In total this is still a low
number, say always below 20 with currently supported profiles. Devices
typically have more and for emulated ones we should scale the size or
zone size properly.

Setting up devices with small number of spare zones is also interesting,
or with small zones that will trigger the reclaim more often.
Naohiro Aota July 21, 2023, 6:43 a.m. UTC | #6
On Thu, Jul 13, 2023 at 08:19:22PM +0200, David Sterba wrote:
> On Sun, Jul 09, 2023 at 10:28:12PM -0700, hch@infradead.org wrote:
> > On Mon, Jul 10, 2023 at 12:57:52AM +0000, Naohiro Aota wrote:
> > > It depends on what we consider the "minimal" is.
> > 
> > I think minimal means a file system that can actually be be continuously
> > used.
> > 
> > > Even with the 5 zones (2
> > > SBs + 1 per BG type), we can start writing to the file-system.
> > > 
> > > If you need to run a relocation, one more block group for it is needed.
> > > 
> > > The fsync block group might be optional because if the fsync node
> > > allocation failed, it should fall back to the full sync. It will kill the
> > > performance but still works...
> > > 
> > > If we say it is the "minimal" that we can infinitely write and delete a
> > > file without ENOSPC, we need one (or two, depending on the metadata
> > > profile) more BGs per META/SYSTEM.
> > 
> > Based on my above sentence we then need:
> > 
> >  2 zones for the primary superblock
> > 
> > metadata replication factor * (
> >   2 zones for the system block group
> >   2 zone for a metadata block group
> >   2 zone for the tree log)
> > 
> > 
> > data replication factor * (
> >  1 zone for a data block group
> >  1 zone for a relocation block group
> > )
> 
> I think the relocation should be taken separately, there can be only one
> relocation process running per block group type, ie. data/metadata and

That relocation block group only write relocated data and that data must be
written into a dedicated block group. The relocated metadata can go with
the same BG as normal metadata. So, the above calculation looks good to me.

> the replication depends on the respective factor. Otherwise yeah the
> formula for minimal number of zones needs to take into account the
> replication and all the normal usage case. In total this is still a low
> number, say always below 20 with currently supported profiles. Devices
> typically have more and for emulated ones we should scale the size or
> zone size properly.
> 
> Setting up devices with small number of spare zones is also interesting,
> or with small zones that will trigger the reclaim more often.
diff mbox series

Patch

diff --git a/mkfs/main.c b/mkfs/main.c
index 8d94dac8..c7d7399f 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -84,10 +84,12 @@  struct prepare_device_progress {
  * 1 zone for the system block group
  * 1 zone for a metadata block group
  * 1 zone for a data block group
+ * 1 zone for a relocation block group
+ * 1 zone for the tree log
  */
 static u64 min_zoned_fs_size(const char *filename)
 {
-	return 5 * zone_size(file);
+	return 7 * zone_size(file);
 }
 
 static int create_metadata_block_groups(struct btrfs_root *root, bool mixed,