diff mbox series

[RFC] xfs: filesystem expansion design documentation

Message ID 20240721230100.4159699-1-david@fromorbit.com (mailing list archive)
State New
Headers show
Series [RFC] xfs: filesystem expansion design documentation | expand

Commit Message

Dave Chinner July 21, 2024, 11:01 p.m. UTC
From: Dave Chinner <dchinner@redhat.com>

xfs-expand is an attempt to address the container/vm orchestration
image issue where really small XFS filesystems are grown to massive
sizes via xfs_growfs and end up with really insane, suboptimal
geometries.

Rather that grow a filesystem by appending AGs, expanding a
filesystem is based on allowing existing AGs to be expanded to
maximum sizes first. If further growth is needed, then the
traditional "append more AGs" growfs mechanism is triggered.

This document describes the structure of an XFS filesystem needed to
achieve this expansion, as well as the design of userspace tools
needed to make the mechanism work.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 Documentation/filesystems/xfs/index.rst       |   1 +
 .../filesystems/xfs/xfs-expand-design.rst     | 312 ++++++++++++++++++
 2 files changed, 313 insertions(+)
 create mode 100644 Documentation/filesystems/xfs/xfs-expand-design.rst

Comments

Darrick J. Wong July 23, 2024, 11:58 p.m. UTC | #1
On Mon, Jul 22, 2024 at 09:01:00AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> xfs-expand is an attempt to address the container/vm orchestration
> image issue where really small XFS filesystems are grown to massive
> sizes via xfs_growfs and end up with really insane, suboptimal
> geometries.
> 
> Rather that grow a filesystem by appending AGs, expanding a
> filesystem is based on allowing existing AGs to be expanded to
> maximum sizes first. If further growth is needed, then the
> traditional "append more AGs" growfs mechanism is triggered.
> 
> This document describes the structure of an XFS filesystem needed to
> achieve this expansion, as well as the design of userspace tools
> needed to make the mechanism work.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  Documentation/filesystems/xfs/index.rst       |   1 +
>  .../filesystems/xfs/xfs-expand-design.rst     | 312 ++++++++++++++++++
>  2 files changed, 313 insertions(+)
>  create mode 100644 Documentation/filesystems/xfs/xfs-expand-design.rst
> 
> diff --git a/Documentation/filesystems/xfs/index.rst b/Documentation/filesystems/xfs/index.rst
> index ab66c57a5d18..cb570fc886b2 100644
> --- a/Documentation/filesystems/xfs/index.rst
> +++ b/Documentation/filesystems/xfs/index.rst
> @@ -12,3 +12,4 @@ XFS Filesystem Documentation
>     xfs-maintainer-entry-profile
>     xfs-self-describing-metadata
>     xfs-online-fsck-design
> +   xfs-expand-design
> diff --git a/Documentation/filesystems/xfs/xfs-expand-design.rst b/Documentation/filesystems/xfs/xfs-expand-design.rst
> new file mode 100644
> index 000000000000..fffc0b44518d
> --- /dev/null
> +++ b/Documentation/filesystems/xfs/xfs-expand-design.rst
> @@ -0,0 +1,312 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +===============================
> +XFS Filesystem Expansion Design
> +===============================
> +
> +Background
> +==========
> +
> +XFS has long been able to grow the size of the filesystem dynamically whilst
> +mounted. The functionality has been used extensively over the past 3 decades
> +for managing filesystems on expandable storage arrays, but over the past decade
> +there has been significant growth in filesystem image based orchestration
> +frameworks that require expansion of the filesystem image during deployment.
> +
> +These frameworks want the initial image to be as small as possible to minimise
> +the cost of deployment, but then want that image to scale to whatever size the
> +deployment requires. This means that the base image can be as small as a few
> +hundred megabytes and be expanded on deployment to tens of terabytes.
> +
> +Growing a filesystem by 4-5 orders of magnitude is a long way outside the scope
> +of the original xfs_growfs design requirements. It was designed for users who
> +were adding physical storage to already large storage arrays; a single order of
> +magnitude in growth was considered a very large expansion.
> +
> +As a result, we have a situation where growing a filesystem works well up to a
> +certain point, yet we have orchestration frameworks that allows users to expand
> +filesystems a long way past this point without them being aware of the issues
> +it will cause them further down the track.

Ok, same growfs-on-deploy problem that we have.  Though, the minimum OCI
boot volume size is ~47GB so at least we're not going from 2G -> 200G.
Usually.

> +Scope
> +=====
> +
> +The need to expand filesystems with a geometry optimised for small storage
> +volumes onto much larger storage volumes results in a large filesystem with
> +poorly optimised geometry. Growing a small XFS filesystem by several orders of
> +magnitude results in filesystem with many small allocation groups (AGs). This is
> +bad for allocation effciency, contiguous free space management, allocation
> +performance as the filesystem fills, and so on. The filesystem will also end up
> +with a very small journal for the size of the filesystem which can limit the
> +metadata performance and concurrency in the filesystem drastically.
> +
> +These issues are a result of the filesystem growing algorithm. It is an
> +append-only mechanism which takes advantage of the fact we can safely initialise
> +the metadata for new AGs beyond the end of the existing filesystem without
> +impacting runtime behaviour. Those newly initialised AGs can then be enabled
> +atomically by running a single transaction to expose that newly initialised
> +space to the running filesystem.
> +
> +As a result, the growing algorithm is a fast, transparent, simple and crash-safe
> +algorithm that can be run while the filesystem is mounted. It's a very good
> +algorithm for growing a filesystem on a block device that has has new physical
> +storage appended to it's LBA space.
> +
> +However, this algorithm shows it's limitations when we move to system deployment
> +via filesystem image distribution. These deployments optimise the base
> +filesystem image for minimal size to minimise the time and cost of deploying
> +them to the newly provisioned system (be it VM or container). They rely on the
> +filesystem's ability to grow the filesystem to the size of the destination
> +storage during the first system bringup when they tailor the deployed filesystem
> +image for it's intented purpose and identity.
> +
> +If the deployed system has substantial storage provisioned, this means the
> +filesystem image will be expanded by multiple orders of magnitude during the
> +system initialisation phase, and this is where the existing append-based growing
> +algorithm falls apart. This is the issue that this design seeks to resolve.

I very much appreciate the scope definition here.  I also very much
appreciate starting off with a design document!  Thank you.

<snip out parts I'm already familiar with>

> +Optimising Physical AG Realignment
> +==================================
> +
> +The elephant in the room at this point in time is the fact that we have to
> +physically move data around to expand AGs. While this makes AG size expansion
> +prohibitive for large filesystems, they should already have large AGs and so
> +using the existing grow mechanism will continue to be the right tool to use for
> +expanding them.
> +
> +However, for small filesystems and filesystem images in the order of hundreds of
> +MB to a few GB in size, the cost of moving data around is much more tolerable.
> +If we can optimise the IO patterns to be purely sequential, offload the movement
> +to the hardware, or even use address space manipulation APIs to minimise the
> +cost of this movement, then resizing AGs via realignment becomes even more
> +appealing.
> +
> +Realigning AGs must avoid overwriting parts of AGs that have not yet been
> +realigned. That means we can't realign the AGs from AG 1 upwards - doing so will
> +overwrite parts of AG2 before we've realigned that data. Hence realignment must
> +be done from the highest AG first, and work downwards.
> +
> +Moving the data within an AG could be optimised to be space usage aware, similar
> +to what xfs_copy does to build sparse filesystem images. However, the space
> +optimised filesystem images aren't going to have a lot of free space in them,
> +and what there is may be quite fragmented. Hence doing free space aware copying
> +of relatively full small AGs may be IOPS intensive. Given we are talking about
> +AGs in the typical size range from 64-512MB, doing a sequential copy of the
> +entire AG isn't going to take very long on any storage. If we have to do several
> +hundred seeks in that range to skip free space, then copying the free space will
> +cost less than the seeks and the partial RAID stripe writes that small IOs will
> +cause.
> +
> +Hence the simplest, sequentially optimised data moving algorithm will be:
> +
> +.. code-block:: c
> +
> +	for (agno = sb_agcount - 1; agno > 0; agno--) {
> +		src = agno * sb_agblocks;
> +		dst = agno * new_agblocks;
> +		copy_file_range(src, dst, sb_agblocks);
> +	}
> +
> +This also leads to optimisation via server side or block device copy offload
> +infrastructure. Instead of streaming the data through kernel buffers, the copy
> +is handed to the server/hardware to moves the data internally as quickly as
> +possible.
> +
> +For filesystem images held in files and, potentially, on sparse storage devices
> +like dm-thinp, we don't even need to copy the data.  We can simply insert holes
> +into the underlying mapping at the appropriate place.  For filesystem images,
> +this is:
> +
> +.. code-block:: c
> +
> +	len = new_agblocks - sb_agblocks;
> +	for (agno = 1; agno < sb_agcount; agno++) {
> +		src = agno * sb_agblocks;
> +		fallocate(FALLOC_FL_INSERT_RANGE, src, len)
> +	}
> +
> +Then the filesystem image can be copied to the destination block device in an
> +efficient manner (i.e. skipping holes in the image file).

Does dm-thinp support insert range?  In the worst case (copy_file_range,
block device doesn't support xcopy) this results in a pagecache copy of
nearly all of the filesystem, doesn't it?

What about the log?  If sb_agblocks increases, that can cause
transaction reservations to increase, which also increases the minimum
log size.  If mkfs is careful, then I suppose xfs_expand could move the
log and make it bigger?  Or does mkfs create a log as if sb_agblocks
were 1TB, which will make the deployment image bigger?

Also, perhaps xfs_expand is a good opportunity to stamp a new uuid into
the superblock and set the metauuid bit?

I think the biggest difficulty for us (OCI) is that our block storage is
some sort of software defined storage system that exposes iscsi and
virtio-scsi endpoints.  For this to work, we'd have to have an
INSERT_RANGE SCSI command that the VM could send to the target and have
the device resize.  Does that exist today?

> +Hence there are several different realignment stratgeies that can be used to
> +optimise the expansion of the filesystem. The optimal strategy will ultimately
> +depend on how the orchestration software sets up the filesystem for
> +configuration at first boot. The userspace xfs expansion tool should be able to
> +support all these mechanisms directly so that higher level infrastructure
> +can simply select the option that best suits the installation being performed.
> +
> +
> +Limitations
> +===========
> +
> +This document describes an offline mechanism for expanding the filesystem
> +geometery. It doesn't add new AGs, just expands they existing AGs. If the
> +filesystem needs to be made larger than maximally sized AGs can address, then
> +a subsequent online xfs_growfs operation is still required.
> +
> +For container/vm orchestration software, this isn't a huge issue as they
> +generally grow the image from within the initramfs context on first boot. That
> +is currently a "mount; xfs_growfs" operation pair; adding expansion to this
> +would simply require adding expansion before the mount. i.e. first boot becomes
> +a "xfs_expand; mount; xfs_growfs" operation. Depending on the eventual size of
> +the target filesystem, the xfs-growfs operation may be a no-op.

I don't know about your cloud, but ours seems to optimize vm deploy
times very heavily.  Right now their firstboot payload calls xfs_admin
to change the fs uuid, mounts the fs, and then growfs's it into the
container.

Adding another pre-mount firstboot program (and one that potentially
might do a lot of IO) isn't going to be popular with them.  The vanilla
OL8 images that you can deploy from seem to consume ~12GB at first boot,
and that's before installing anything else.  Large Well Known Database
Products use quite a bit more... though at least those appliances format
a /data partition at deploy time and leave the rootfs alone.

> +Whether expansion can be done online is an open question. AG expansion cahnges
> +fundamental constants that are calculated at mount time (e.g. maximum AG btree
> +heights), and so an online expand would need to recalculate many internal
> +constants that are used throughout the codebase. This seems like a complex
> +problem to solve and isn't really necessary for the use case we need to address,
> +so online expansion remain as a potential future enhancement that requires a lot
> +more thought.

<nod> There are a lot of moving pieces, online explode sounds hard.

--D

> -- 
> 2.45.1
> 
>
Dave Chinner July 24, 2024, 12:46 a.m. UTC | #2
On Tue, Jul 23, 2024 at 04:58:01PM -0700, Darrick J. Wong wrote:
> On Mon, Jul 22, 2024 at 09:01:00AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > xfs-expand is an attempt to address the container/vm orchestration
> > image issue where really small XFS filesystems are grown to massive
> > sizes via xfs_growfs and end up with really insane, suboptimal
> > geometries.
> > 
> > Rather that grow a filesystem by appending AGs, expanding a
> > filesystem is based on allowing existing AGs to be expanded to
> > maximum sizes first. If further growth is needed, then the
> > traditional "append more AGs" growfs mechanism is triggered.
> > 
> > This document describes the structure of an XFS filesystem needed to
> > achieve this expansion, as well as the design of userspace tools
> > needed to make the mechanism work.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  Documentation/filesystems/xfs/index.rst       |   1 +
> >  .../filesystems/xfs/xfs-expand-design.rst     | 312 ++++++++++++++++++
> >  2 files changed, 313 insertions(+)
> >  create mode 100644 Documentation/filesystems/xfs/xfs-expand-design.rst
> > 
> > diff --git a/Documentation/filesystems/xfs/index.rst b/Documentation/filesystems/xfs/index.rst
> > index ab66c57a5d18..cb570fc886b2 100644
> > --- a/Documentation/filesystems/xfs/index.rst
> > +++ b/Documentation/filesystems/xfs/index.rst
> > @@ -12,3 +12,4 @@ XFS Filesystem Documentation
> >     xfs-maintainer-entry-profile
> >     xfs-self-describing-metadata
> >     xfs-online-fsck-design
> > +   xfs-expand-design
> > diff --git a/Documentation/filesystems/xfs/xfs-expand-design.rst b/Documentation/filesystems/xfs/xfs-expand-design.rst
> > new file mode 100644
> > index 000000000000..fffc0b44518d
> > --- /dev/null
> > +++ b/Documentation/filesystems/xfs/xfs-expand-design.rst
> > @@ -0,0 +1,312 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +===============================
> > +XFS Filesystem Expansion Design
> > +===============================
> > +
> > +Background
> > +==========
> > +
> > +XFS has long been able to grow the size of the filesystem dynamically whilst
> > +mounted. The functionality has been used extensively over the past 3 decades
> > +for managing filesystems on expandable storage arrays, but over the past decade
> > +there has been significant growth in filesystem image based orchestration
> > +frameworks that require expansion of the filesystem image during deployment.
> > +
> > +These frameworks want the initial image to be as small as possible to minimise
> > +the cost of deployment, but then want that image to scale to whatever size the
> > +deployment requires. This means that the base image can be as small as a few
> > +hundred megabytes and be expanded on deployment to tens of terabytes.
> > +
> > +Growing a filesystem by 4-5 orders of magnitude is a long way outside the scope
> > +of the original xfs_growfs design requirements. It was designed for users who
> > +were adding physical storage to already large storage arrays; a single order of
> > +magnitude in growth was considered a very large expansion.
> > +
> > +As a result, we have a situation where growing a filesystem works well up to a
> > +certain point, yet we have orchestration frameworks that allows users to expand
> > +filesystems a long way past this point without them being aware of the issues
> > +it will cause them further down the track.
> 
> Ok, same growfs-on-deploy problem that we have.  Though, the minimum OCI
> boot volume size is ~47GB so at least we're not going from 2G -> 200G.
> Usually.
> 
> > +Scope
> > +=====
> > +
> > +The need to expand filesystems with a geometry optimised for small storage
> > +volumes onto much larger storage volumes results in a large filesystem with
> > +poorly optimised geometry. Growing a small XFS filesystem by several orders of
> > +magnitude results in filesystem with many small allocation groups (AGs). This is
> > +bad for allocation effciency, contiguous free space management, allocation
> > +performance as the filesystem fills, and so on. The filesystem will also end up
> > +with a very small journal for the size of the filesystem which can limit the
> > +metadata performance and concurrency in the filesystem drastically.
> > +
> > +These issues are a result of the filesystem growing algorithm. It is an
> > +append-only mechanism which takes advantage of the fact we can safely initialise
> > +the metadata for new AGs beyond the end of the existing filesystem without
> > +impacting runtime behaviour. Those newly initialised AGs can then be enabled
> > +atomically by running a single transaction to expose that newly initialised
> > +space to the running filesystem.
> > +
> > +As a result, the growing algorithm is a fast, transparent, simple and crash-safe
> > +algorithm that can be run while the filesystem is mounted. It's a very good
> > +algorithm for growing a filesystem on a block device that has has new physical
> > +storage appended to it's LBA space.
> > +
> > +However, this algorithm shows it's limitations when we move to system deployment
> > +via filesystem image distribution. These deployments optimise the base
> > +filesystem image for minimal size to minimise the time and cost of deploying
> > +them to the newly provisioned system (be it VM or container). They rely on the
> > +filesystem's ability to grow the filesystem to the size of the destination
> > +storage during the first system bringup when they tailor the deployed filesystem
> > +image for it's intented purpose and identity.
> > +
> > +If the deployed system has substantial storage provisioned, this means the
> > +filesystem image will be expanded by multiple orders of magnitude during the
> > +system initialisation phase, and this is where the existing append-based growing
> > +algorithm falls apart. This is the issue that this design seeks to resolve.
> 
> I very much appreciate the scope definition here.  I also very much
> appreciate starting off with a design document!  Thank you.
> 
> <snip out parts I'm already familiar with>
> 
> > +Optimising Physical AG Realignment
> > +==================================
> > +
> > +The elephant in the room at this point in time is the fact that we have to
> > +physically move data around to expand AGs. While this makes AG size expansion
> > +prohibitive for large filesystems, they should already have large AGs and so
> > +using the existing grow mechanism will continue to be the right tool to use for
> > +expanding them.
> > +
> > +However, for small filesystems and filesystem images in the order of hundreds of
> > +MB to a few GB in size, the cost of moving data around is much more tolerable.
> > +If we can optimise the IO patterns to be purely sequential, offload the movement
> > +to the hardware, or even use address space manipulation APIs to minimise the
> > +cost of this movement, then resizing AGs via realignment becomes even more
> > +appealing.
> > +
> > +Realigning AGs must avoid overwriting parts of AGs that have not yet been
> > +realigned. That means we can't realign the AGs from AG 1 upwards - doing so will
> > +overwrite parts of AG2 before we've realigned that data. Hence realignment must
> > +be done from the highest AG first, and work downwards.
> > +
> > +Moving the data within an AG could be optimised to be space usage aware, similar
> > +to what xfs_copy does to build sparse filesystem images. However, the space
> > +optimised filesystem images aren't going to have a lot of free space in them,
> > +and what there is may be quite fragmented. Hence doing free space aware copying
> > +of relatively full small AGs may be IOPS intensive. Given we are talking about
> > +AGs in the typical size range from 64-512MB, doing a sequential copy of the
> > +entire AG isn't going to take very long on any storage. If we have to do several
> > +hundred seeks in that range to skip free space, then copying the free space will
> > +cost less than the seeks and the partial RAID stripe writes that small IOs will
> > +cause.
> > +
> > +Hence the simplest, sequentially optimised data moving algorithm will be:
> > +
> > +.. code-block:: c
> > +
> > +	for (agno = sb_agcount - 1; agno > 0; agno--) {
> > +		src = agno * sb_agblocks;
> > +		dst = agno * new_agblocks;
> > +		copy_file_range(src, dst, sb_agblocks);
> > +	}
> > +
> > +This also leads to optimisation via server side or block device copy offload
> > +infrastructure. Instead of streaming the data through kernel buffers, the copy
> > +is handed to the server/hardware to moves the data internally as quickly as
> > +possible.
> > +
> > +For filesystem images held in files and, potentially, on sparse storage devices
> > +like dm-thinp, we don't even need to copy the data.  We can simply insert holes
> > +into the underlying mapping at the appropriate place.  For filesystem images,
> > +this is:
> > +
> > +.. code-block:: c
> > +
> > +	len = new_agblocks - sb_agblocks;
> > +	for (agno = 1; agno < sb_agcount; agno++) {
> > +		src = agno * sb_agblocks;
> > +		fallocate(FALLOC_FL_INSERT_RANGE, src, len)
> > +	}
> > +
> > +Then the filesystem image can be copied to the destination block device in an
> > +efficient manner (i.e. skipping holes in the image file).
> 
> Does dm-thinp support insert range?

No - that would be a future enhancement. I mention it simly because
these are things we would really want sparse block devices to
support natively.

> In the worst case (copy_file_range,
> block device doesn't support xcopy) this results in a pagecache copy of
> nearly all of the filesystem, doesn't it?

Yes, it would.

> What about the log?  If sb_agblocks increases, that can cause
> transaction reservations to increase, which also increases the minimum
> log size.

Not caring, because the current default minimum of 64MB is big enough for
any physical filesystem size. Further, 64MB is big enough for decent
metadata performance even on large filesystem, so we really don't
need to touch the journal here.

> If mkfs is careful, then I suppose xfs_expand could move the
> log and make it bigger?  Or does mkfs create a log as if sb_agblocks
> were 1TB, which will make the deployment image bigger?

making the log bigger is not as straight forward as it could be.
If the log is dirty when we expand the filesystem and the dirty
section wraps the end of the log, expansion just got really complex.
So I'm just going to say "make the log large enough in the image
file to begin with" and assert that we already do this with a
current mkfs.

> Also, perhaps xfs_expand is a good opportunity to stamp a new uuid into
> the superblock and set the metauuid bit?

Isn't provisioning software is generally already doing this via
xfs_admin? We don't do this with growfs, and I'd prefer not to
overload an expansion tool with random other administrative
functions that only some use cases/environments might need. 

> I think the biggest difficulty for us (OCI) is that our block storage is
> some sort of software defined storage system that exposes iscsi and
> virtio-scsi endpoints.  For this to work, we'd have to have an
> INSERT_RANGE SCSI command that the VM could send to the target and have
> the device resize.  Does that exist today?

Not that I know of, but it's laregely irrelevant to the operation of
xfs_expand what go fast features the underlying
storage devices support . If they support INSERT_RANGE, we can use
it. If they support FICLONERANGE we can use it. If the storage
supports copy offload, copy_file_range() can use it. If all else
fails, the kernel will just bounce the data through internal
buffers (page cache for buffered IO or pipe buffers for direct IO).

> > +Limitations
> > +===========
> > +
> > +This document describes an offline mechanism for expanding the filesystem
> > +geometery. It doesn't add new AGs, just expands they existing AGs. If the
> > +filesystem needs to be made larger than maximally sized AGs can address, then
> > +a subsequent online xfs_growfs operation is still required.
> > +
> > +For container/vm orchestration software, this isn't a huge issue as they
> > +generally grow the image from within the initramfs context on first boot. That
> > +is currently a "mount; xfs_growfs" operation pair; adding expansion to this
> > +would simply require adding expansion before the mount. i.e. first boot becomes
> > +a "xfs_expand; mount; xfs_growfs" operation. Depending on the eventual size of
> > +the target filesystem, the xfs-growfs operation may be a no-op.
> 
> I don't know about your cloud, but ours seems to optimize vm deploy
> times very heavily.  Right now their firstboot payload calls xfs_admin
> to change the fs uuid, mounts the fs, and then growfs's it into the
> container.
> 
> Adding another pre-mount firstboot program (and one that potentially
> might do a lot of IO) isn't going to be popular with them.

There's nothing that requires xfs_expand to be done at first boot.
First boot is just part of the deployment scripts and it may make
sense to do the expansion as early as possible in the deployment
process.

e.g. It could be run immediately after the image file is cloned from
the source golden image. At that point it's still just an XFS file,
right? INSERT_RANGE won't affect the fact the extents are shared
with the golden image, and it will be fast enough that it likely
won't make a measurable impact on deployment speed. Four insert
range calls on a largely contiguous file will take less than 100ms
in most cases.

-Dave.
Eric Sandeen July 24, 2024, 3:41 p.m. UTC | #3
On 7/23/24 7:46 PM, Dave Chinner wrote:
>> What about the log?  If sb_agblocks increases, that can cause
>> transaction reservations to increase, which also increases the minimum
>> log size.
> Not caring, because the current default minimum of 64MB is big enough for
> any physical filesystem size. Further, 64MB is big enough for decent
> metadata performance even on large filesystem, so we really don't
> need to touch the journal here.

Seems fair, but just to stir the pot, "growing the log" offline, when
you've just added potentially gigabytes of free space to an AG, should
be trivial, right?

-Eric
Eric Sandeen July 24, 2024, 3:44 p.m. UTC | #4
On 7/24/24 10:41 AM, Eric Sandeen wrote:
> On 7/23/24 7:46 PM, Dave Chinner wrote:
>>> What about the log?  If sb_agblocks increases, that can cause
>>> transaction reservations to increase, which also increases the minimum
>>> log size.
>> Not caring, because the current default minimum of 64MB is big enough for
>> any physical filesystem size. Further, 64MB is big enough for decent
>> metadata performance even on large filesystem, so we really don't
>> need to touch the journal here.
> 
> Seems fair, but just to stir the pot, "growing the log" offline, when
> you've just added potentially gigabytes of free space to an AG, should
> be trivial, right?

Ugh I'm sorry, read to the end before responding, Eric.

(I had assumed that an expand operation would require a clean log, but I
suppose it doesn't have to.)

-Eric
Darrick J. Wong July 24, 2024, 5:23 p.m. UTC | #5
On Wed, Jul 24, 2024 at 10:44:47AM -0500, Eric Sandeen wrote:
> On 7/24/24 10:41 AM, Eric Sandeen wrote:
> > On 7/23/24 7:46 PM, Dave Chinner wrote:
> >>> What about the log?  If sb_agblocks increases, that can cause
> >>> transaction reservations to increase, which also increases the minimum
> >>> log size.
> >> Not caring, because the current default minimum of 64MB is big enough for
> >> any physical filesystem size. Further, 64MB is big enough for decent
> >> metadata performance even on large filesystem, so we really don't
> >> need to touch the journal here.

<shrug> I think our support staff might disagree about that for the
large machines they have to support, but log expansion doesn't need to
be implemented in the initial proposal or programming effort.

> > Seems fair, but just to stir the pot, "growing the log" offline, when
> > you've just added potentially gigabytes of free space to an AG, should
> > be trivial, right?

Possibly -- if there's free space after the end and the log is clean.
Maybe mkfs should try to allocate the log at the /end/ of the AG to
make this easier?

> Ugh I'm sorry, read to the end before responding, Eric.
> 
> (I had assumed that an expand operation would require a clean log, but I
> suppose it doesn't have to.)

...why not require a clean filesystem?  Most of these 10000x XFS
expansions are gold master cloud images coming from a vendor, which
implies that we could hold them to slightly higher cleanliness levels.

--D

> -Eric
> 
>
Eric Sandeen July 24, 2024, 5:33 p.m. UTC | #6
On 7/24/24 12:23 PM, Darrick J. Wong wrote:
> On Wed, Jul 24, 2024 at 10:44:47AM -0500, Eric Sandeen wrote:
>> On 7/24/24 10:41 AM, Eric Sandeen wrote:
>>> On 7/23/24 7:46 PM, Dave Chinner wrote:
>>>>> What about the log?  If sb_agblocks increases, that can cause
>>>>> transaction reservations to increase, which also increases the minimum
>>>>> log size.
>>>> Not caring, because the current default minimum of 64MB is big enough for
>>>> any physical filesystem size. Further, 64MB is big enough for decent
>>>> metadata performance even on large filesystem, so we really don't
>>>> need to touch the journal here.
> 
> <shrug> I think our support staff might disagree about that for the
> large machines they have to support, but log expansion doesn't need to
> be implemented in the initial proposal or programming effort.

+1

Yeah I'd prefer to not get bogged down in that, it can be done later,
or not. Let's focus on the core of the proposal, even though I'm
academically intrigued by log growth possibilities.

(he says, then comments more)

>>> Seems fair, but just to stir the pot, "growing the log" offline, when
>>> you've just added potentially gigabytes of free space to an AG, should
>>> be trivial, right?
> 
> Possibly -- if there's free space after the end and the log is clean.
> Maybe mkfs should try to allocate the log at the /end/ of the AG to
> make this easier?

It can be any sufficiently large free space, right, doesn't have to
be adjacent to the log. So as long as you've expanded more than 64MB
there's a chance to move to a bigger log region, I think.

>> Ugh I'm sorry, read to the end before responding, Eric.
>>
>> (I had assumed that an expand operation would require a clean log, but I
>> suppose it doesn't have to.)
> 
> ...why not require a clean filesystem?  Most of these 10000x XFS
> expansions are gold master cloud images coming from a vendor, which
> implies that we could hold them to slightly higher cleanliness levels.

Yeah I mean it only matters if you want to change the log size, right.
So could even do

# xfs_expand --grow-data=4T --grow-log=2G fs-image.img
Error: Cannot grow a dirty log, please mount ...
# 

But I think we're getting ahead of ourselves a little, let's see if
the basic proposal makes sense before debating extra bells and
whistles too much? It's probably enough at this point to say
"yeah, it's possible if we decide we want it, the design does not
preclude it."

-Eric

> --D
> 
>> -Eric
>>
>>
>
Darrick J. Wong July 24, 2024, 9:08 p.m. UTC | #7
On Wed, Jul 24, 2024 at 10:46:15AM +1000, Dave Chinner wrote:
> On Tue, Jul 23, 2024 at 04:58:01PM -0700, Darrick J. Wong wrote:
> > On Mon, Jul 22, 2024 at 09:01:00AM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > xfs-expand is an attempt to address the container/vm orchestration
> > > image issue where really small XFS filesystems are grown to massive
> > > sizes via xfs_growfs and end up with really insane, suboptimal
> > > geometries.
> > > 
> > > Rather that grow a filesystem by appending AGs, expanding a
> > > filesystem is based on allowing existing AGs to be expanded to
> > > maximum sizes first. If further growth is needed, then the
> > > traditional "append more AGs" growfs mechanism is triggered.
> > > 
> > > This document describes the structure of an XFS filesystem needed to
> > > achieve this expansion, as well as the design of userspace tools
> > > needed to make the mechanism work.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  Documentation/filesystems/xfs/index.rst       |   1 +
> > >  .../filesystems/xfs/xfs-expand-design.rst     | 312 ++++++++++++++++++
> > >  2 files changed, 313 insertions(+)
> > >  create mode 100644 Documentation/filesystems/xfs/xfs-expand-design.rst
> > > 
> > > diff --git a/Documentation/filesystems/xfs/index.rst b/Documentation/filesystems/xfs/index.rst
> > > index ab66c57a5d18..cb570fc886b2 100644
> > > --- a/Documentation/filesystems/xfs/index.rst
> > > +++ b/Documentation/filesystems/xfs/index.rst
> > > @@ -12,3 +12,4 @@ XFS Filesystem Documentation
> > >     xfs-maintainer-entry-profile
> > >     xfs-self-describing-metadata
> > >     xfs-online-fsck-design
> > > +   xfs-expand-design
> > > diff --git a/Documentation/filesystems/xfs/xfs-expand-design.rst b/Documentation/filesystems/xfs/xfs-expand-design.rst
> > > new file mode 100644
> > > index 000000000000..fffc0b44518d
> > > --- /dev/null
> > > +++ b/Documentation/filesystems/xfs/xfs-expand-design.rst
> > > @@ -0,0 +1,312 @@
> > > +.. SPDX-License-Identifier: GPL-2.0
> > > +
> > > +===============================
> > > +XFS Filesystem Expansion Design
> > > +===============================
> > > +
> > > +Background
> > > +==========
> > > +
> > > +XFS has long been able to grow the size of the filesystem dynamically whilst
> > > +mounted. The functionality has been used extensively over the past 3 decades
> > > +for managing filesystems on expandable storage arrays, but over the past decade
> > > +there has been significant growth in filesystem image based orchestration
> > > +frameworks that require expansion of the filesystem image during deployment.
> > > +
> > > +These frameworks want the initial image to be as small as possible to minimise
> > > +the cost of deployment, but then want that image to scale to whatever size the
> > > +deployment requires. This means that the base image can be as small as a few
> > > +hundred megabytes and be expanded on deployment to tens of terabytes.
> > > +
> > > +Growing a filesystem by 4-5 orders of magnitude is a long way outside the scope
> > > +of the original xfs_growfs design requirements. It was designed for users who
> > > +were adding physical storage to already large storage arrays; a single order of
> > > +magnitude in growth was considered a very large expansion.
> > > +
> > > +As a result, we have a situation where growing a filesystem works well up to a
> > > +certain point, yet we have orchestration frameworks that allows users to expand
> > > +filesystems a long way past this point without them being aware of the issues
> > > +it will cause them further down the track.
> > 
> > Ok, same growfs-on-deploy problem that we have.  Though, the minimum OCI
> > boot volume size is ~47GB so at least we're not going from 2G -> 200G.
> > Usually.
> > 
> > > +Scope
> > > +=====
> > > +
> > > +The need to expand filesystems with a geometry optimised for small storage
> > > +volumes onto much larger storage volumes results in a large filesystem with
> > > +poorly optimised geometry. Growing a small XFS filesystem by several orders of
> > > +magnitude results in filesystem with many small allocation groups (AGs). This is
> > > +bad for allocation effciency, contiguous free space management, allocation
> > > +performance as the filesystem fills, and so on. The filesystem will also end up
> > > +with a very small journal for the size of the filesystem which can limit the
> > > +metadata performance and concurrency in the filesystem drastically.
> > > +
> > > +These issues are a result of the filesystem growing algorithm. It is an
> > > +append-only mechanism which takes advantage of the fact we can safely initialise
> > > +the metadata for new AGs beyond the end of the existing filesystem without
> > > +impacting runtime behaviour. Those newly initialised AGs can then be enabled
> > > +atomically by running a single transaction to expose that newly initialised
> > > +space to the running filesystem.
> > > +
> > > +As a result, the growing algorithm is a fast, transparent, simple and crash-safe
> > > +algorithm that can be run while the filesystem is mounted. It's a very good
> > > +algorithm for growing a filesystem on a block device that has has new physical
> > > +storage appended to it's LBA space.
> > > +
> > > +However, this algorithm shows it's limitations when we move to system deployment
> > > +via filesystem image distribution. These deployments optimise the base
> > > +filesystem image for minimal size to minimise the time and cost of deploying
> > > +them to the newly provisioned system (be it VM or container). They rely on the
> > > +filesystem's ability to grow the filesystem to the size of the destination
> > > +storage during the first system bringup when they tailor the deployed filesystem
> > > +image for it's intented purpose and identity.
> > > +
> > > +If the deployed system has substantial storage provisioned, this means the
> > > +filesystem image will be expanded by multiple orders of magnitude during the
> > > +system initialisation phase, and this is where the existing append-based growing
> > > +algorithm falls apart. This is the issue that this design seeks to resolve.
> > 
> > I very much appreciate the scope definition here.  I also very much
> > appreciate starting off with a design document!  Thank you.
> > 
> > <snip out parts I'm already familiar with>
> > 
> > > +Optimising Physical AG Realignment
> > > +==================================
> > > +
> > > +The elephant in the room at this point in time is the fact that we have to
> > > +physically move data around to expand AGs. While this makes AG size expansion
> > > +prohibitive for large filesystems, they should already have large AGs and so
> > > +using the existing grow mechanism will continue to be the right tool to use for
> > > +expanding them.
> > > +
> > > +However, for small filesystems and filesystem images in the order of hundreds of
> > > +MB to a few GB in size, the cost of moving data around is much more tolerable.
> > > +If we can optimise the IO patterns to be purely sequential, offload the movement
> > > +to the hardware, or even use address space manipulation APIs to minimise the
> > > +cost of this movement, then resizing AGs via realignment becomes even more
> > > +appealing.
> > > +
> > > +Realigning AGs must avoid overwriting parts of AGs that have not yet been
> > > +realigned. That means we can't realign the AGs from AG 1 upwards - doing so will
> > > +overwrite parts of AG2 before we've realigned that data. Hence realignment must
> > > +be done from the highest AG first, and work downwards.
> > > +
> > > +Moving the data within an AG could be optimised to be space usage aware, similar
> > > +to what xfs_copy does to build sparse filesystem images. However, the space
> > > +optimised filesystem images aren't going to have a lot of free space in them,
> > > +and what there is may be quite fragmented. Hence doing free space aware copying
> > > +of relatively full small AGs may be IOPS intensive. Given we are talking about
> > > +AGs in the typical size range from 64-512MB, doing a sequential copy of the
> > > +entire AG isn't going to take very long on any storage. If we have to do several
> > > +hundred seeks in that range to skip free space, then copying the free space will
> > > +cost less than the seeks and the partial RAID stripe writes that small IOs will
> > > +cause.
> > > +
> > > +Hence the simplest, sequentially optimised data moving algorithm will be:
> > > +
> > > +.. code-block:: c
> > > +
> > > +	for (agno = sb_agcount - 1; agno > 0; agno--) {
> > > +		src = agno * sb_agblocks;
> > > +		dst = agno * new_agblocks;
> > > +		copy_file_range(src, dst, sb_agblocks);
> > > +	}
> > > +
> > > +This also leads to optimisation via server side or block device copy offload
> > > +infrastructure. Instead of streaming the data through kernel buffers, the copy
> > > +is handed to the server/hardware to moves the data internally as quickly as
> > > +possible.
> > > +
> > > +For filesystem images held in files and, potentially, on sparse storage devices
> > > +like dm-thinp, we don't even need to copy the data.  We can simply insert holes
> > > +into the underlying mapping at the appropriate place.  For filesystem images,
> > > +this is:
> > > +
> > > +.. code-block:: c
> > > +
> > > +	len = new_agblocks - sb_agblocks;
> > > +	for (agno = 1; agno < sb_agcount; agno++) {
> > > +		src = agno * sb_agblocks;
> > > +		fallocate(FALLOC_FL_INSERT_RANGE, src, len)
> > > +	}
> > > +
> > > +Then the filesystem image can be copied to the destination block device in an
> > > +efficient manner (i.e. skipping holes in the image file).
> > 
> > Does dm-thinp support insert range?
> 
> No - that would be a future enhancement. I mention it simly because
> these are things we would really want sparse block devices to
> support natively.

<nod> Should the next revision should cc -fsdevel and -block, then?

> > In the worst case (copy_file_range,
> > block device doesn't support xcopy) this results in a pagecache copy of
> > nearly all of the filesystem, doesn't it?
> 
> Yes, it would.

Counter-proposal: Instead of remapping the AGs to higher LBAs, what if
we allowed people to create single-AG filesystems with large(ish)
sb_agblocks.  You could then format a 2GB image with (say) a 100G AG
size and copy your 2GB of data into the filesystem.  At deploy time,
growfs will expand AG 0 to 100G and add new AGs after that, same as it
does now.

I think all we'd need is to add a switch to mkfs to tell it that it's
creating one of these gold master images, which would disable this
check:

	if (agsize > dblocks) {
		fprintf(stderr,
	_("agsize (%lld blocks) too big, data area is %lld blocks\n"),
			(long long)agsize, (long long)dblocks);
			usage();
	}

and set a largeish default AG size.  We might want to set a compat bit
so that xfs_repair won't complain about the single AG.

Yes, there are drawbacks, like the lack of redundant superblocks.  But
if growfs really runs at firstboot, then the deployed customer system
will likely have more than 1 AG and therefore be fine.

As for validating the integrity of the GM image, well, maybe the vendor
should enable fsverity. ;)

> > What about the log?  If sb_agblocks increases, that can cause
> > transaction reservations to increase, which also increases the minimum
> > log size.
> 
> Not caring, because the current default minimum of 64MB is big enough for
> any physical filesystem size. Further, 64MB is big enough for decent
> metadata performance even on large filesystem, so we really don't
> need to touch the journal here.
> 
> > If mkfs is careful, then I suppose xfs_expand could move the
> > log and make it bigger?  Or does mkfs create a log as if sb_agblocks
> > were 1TB, which will make the deployment image bigger?
> 
> making the log bigger is not as straight forward as it could be.
> If the log is dirty when we expand the filesystem and the dirty
> section wraps the end of the log, expansion just got really complex.
> So I'm just going to say "make the log large enough in the image
> file to begin with" and assert that we already do this with a
> current mkfs.
> 
> > Also, perhaps xfs_expand is a good opportunity to stamp a new uuid into
> > the superblock and set the metauuid bit?
> 
> Isn't provisioning software is generally already doing this via
> xfs_admin? We don't do this with growfs, and I'd prefer not to
> overload an expansion tool with random other administrative
> functions that only some use cases/environments might need. 

Yeah, though it'd be awfully convenient to do it while we've already got
the filesystem "mounted" in one userspace program.

> > I think the biggest difficulty for us (OCI) is that our block storage is
> > some sort of software defined storage system that exposes iscsi and
> > virtio-scsi endpoints.  For this to work, we'd have to have an
> > INSERT_RANGE SCSI command that the VM could send to the target and have
> > the device resize.  Does that exist today?
> 
> Not that I know of, but it's laregely irrelevant to the operation of
> xfs_expand what go fast features the underlying
> storage devices support . If they support INSERT_RANGE, we can use
> it. If they support FICLONERANGE we can use it. If the storage
> supports copy offload, copy_file_range() can use it. If all else
> fails, the kernel will just bounce the data through internal
> buffers (page cache for buffered IO or pipe buffers for direct IO).

I suspect copy offload will suffice for this, provided we can get the
OCI block storage folks to support it.  Now that support has landed in
the kernel, we can start those discussions internally.

> > > +Limitations
> > > +===========
> > > +
> > > +This document describes an offline mechanism for expanding the filesystem
> > > +geometery. It doesn't add new AGs, just expands they existing AGs. If the
> > > +filesystem needs to be made larger than maximally sized AGs can address, then
> > > +a subsequent online xfs_growfs operation is still required.
> > > +
> > > +For container/vm orchestration software, this isn't a huge issue as they
> > > +generally grow the image from within the initramfs context on first boot. That
> > > +is currently a "mount; xfs_growfs" operation pair; adding expansion to this
> > > +would simply require adding expansion before the mount. i.e. first boot becomes
> > > +a "xfs_expand; mount; xfs_growfs" operation. Depending on the eventual size of
> > > +the target filesystem, the xfs-growfs operation may be a no-op.
> > 
> > I don't know about your cloud, but ours seems to optimize vm deploy
> > times very heavily.  Right now their firstboot payload calls xfs_admin
> > to change the fs uuid, mounts the fs, and then growfs's it into the
> > container.
> > 
> > Adding another pre-mount firstboot program (and one that potentially
> > might do a lot of IO) isn't going to be popular with them.
> 
> There's nothing that requires xfs_expand to be done at first boot.
> First boot is just part of the deployment scripts and it may make
> sense to do the expansion as early as possible in the deployment
> process.

Yeah, but how often do you need to do a 10000x expansion on anything
other than a freshly cloned image?  Is that common in your cloudworld?
OCI usage patterns seem to be exploding the image on firstboot and
incremental growfs after that.

I think the difference between you and I here is that I see this
xfs_expand proposal as entirely a firstboot assistance program, whereas
you're looking at this more as a general operation that can happen at
any time.

> e.g. It could be run immediately after the image file is cloned from
> the source golden image. At that point it's still just an XFS file,
> right?

OCI block volumes are software defined storage; client VMs and
hypervisors don't have access to the internals of the storage nodes.

>        INSERT_RANGE won't affect the fact the extents are shared
> with the golden image, and it will be fast enough that it likely
> won't make a measurable impact on deployment speed. Four insert
> range calls on a largely contiguous file will take less than 100ms
> in most cases.

Let's hope so!

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Eric Sandeen July 24, 2024, 10:06 p.m. UTC | #8
On 7/21/24 6:01 PM, Dave Chinner wrote:
> +The solution should already be obvious: we can exploit the sparseness of FSBNO
> +addressing to allow AGs to grow to 1TB (maximum size) simply by configuring
> +sb_agblklog appropriately at mkfs.xfs time. Hence if we have 16MB AGs (minimum
> +size) and sb_agblklog = 30 (1TB max AG size), we can expand the AG size up
> +to their maximum size before we start appending new AGs.

there's a check in xfs_validate_sb_common() that tests whether sg_agblklog is
really the next power of two from sb_agblklog:

sbp->sb_agblklog != xfs_highbit32(sbp->sb_agblocks - 1) + 1

so I think the proposed idea would require a feature flag, right?

That might make it a little trickier as a drop-in replacement for cloud
providers because these new expandable filesystem images would only work on
kernels that understand the (trivial) new feature, unless I'm missing
something.

-Eric
Darrick J. Wong July 24, 2024, 10:12 p.m. UTC | #9
On Wed, Jul 24, 2024 at 05:06:58PM -0500, Eric Sandeen wrote:
> On 7/21/24 6:01 PM, Dave Chinner wrote:
> > +The solution should already be obvious: we can exploit the sparseness of FSBNO
> > +addressing to allow AGs to grow to 1TB (maximum size) simply by configuring
> > +sb_agblklog appropriately at mkfs.xfs time. Hence if we have 16MB AGs (minimum
> > +size) and sb_agblklog = 30 (1TB max AG size), we can expand the AG size up
> > +to their maximum size before we start appending new AGs.
> 
> there's a check in xfs_validate_sb_common() that tests whether sg_agblklog is
> really the next power of two from sb_agblklog:
> 
> sbp->sb_agblklog != xfs_highbit32(sbp->sb_agblocks - 1) + 1
> 
> so I think the proposed idea would require a feature flag, right?
> 
> That might make it a little trickier as a drop-in replacement for cloud
> providers because these new expandable filesystem images would only work on
> kernels that understand the (trivial) new feature, unless I'm missing
> something.

agblklog and agblocks would both be set correctly if you did mkfs.xfs -d
agsize=1T on a small image; afaict it's only mkfs that cares that
dblocks >= agblocks.

--D

> -Eric
>
Eric Sandeen July 24, 2024, 10:38 p.m. UTC | #10
On 7/24/24 5:12 PM, Darrick J. Wong wrote:
> On Wed, Jul 24, 2024 at 05:06:58PM -0500, Eric Sandeen wrote:
>> On 7/21/24 6:01 PM, Dave Chinner wrote:
>>> +The solution should already be obvious: we can exploit the sparseness of FSBNO
>>> +addressing to allow AGs to grow to 1TB (maximum size) simply by configuring
>>> +sb_agblklog appropriately at mkfs.xfs time. Hence if we have 16MB AGs (minimum
>>> +size) and sb_agblklog = 30 (1TB max AG size), we can expand the AG size up
>>> +to their maximum size before we start appending new AGs.
>>
>> there's a check in xfs_validate_sb_common() that tests whether sg_agblklog is
>> really the next power of two from sb_agblklog:
>>
>> sbp->sb_agblklog != xfs_highbit32(sbp->sb_agblocks - 1) + 1
>>
>> so I think the proposed idea would require a feature flag, right?
>>
>> That might make it a little trickier as a drop-in replacement for cloud
>> providers because these new expandable filesystem images would only work on
>> kernels that understand the (trivial) new feature, unless I'm missing
>> something.
> 
> agblklog and agblocks would both be set correctly if you did mkfs.xfs -d
> agsize=1T on a small image; afaict it's only mkfs that cares that
> dblocks >= agblocks.

Yes, in a single AG image like you've suggested.

In Dave's proposal, with multiple AGs, I think it would need to be handled.

-Eric

> --D
> 
>> -Eric
>>
>
Eric Sandeen July 24, 2024, 10:50 p.m. UTC | #11
On 7/24/24 4:08 PM, Darrick J. Wong wrote:
> On Wed, Jul 24, 2024 at 10:46:15AM +1000, Dave Chinner wrote:

...

> Counter-proposal: Instead of remapping the AGs to higher LBAs, what if
> we allowed people to create single-AG filesystems with large(ish)
> sb_agblocks.  You could then format a 2GB image with (say) a 100G AG
> size and copy your 2GB of data into the filesystem.  At deploy time,
> growfs will expand AG 0 to 100G and add new AGs after that, same as it
> does now.

And that could be done oneline...

> I think all we'd need is to add a switch to mkfs to tell it that it's
> creating one of these gold master images, which would disable this
> check:
> 
> 	if (agsize > dblocks) {
> 		fprintf(stderr,
> 	_("agsize (%lld blocks) too big, data area is %lld blocks\n"),
> 			(long long)agsize, (long long)dblocks);
> 			usage();
> 	}

(plus removing the single-ag check)

> and set a largeish default AG size.  We might want to set a compat bit
> so that xfs_repair won't complain about the single AG.
> 
> Yes, there are drawbacks, like the lack of redundant superblocks.  But
> if growfs really runs at firstboot, then the deployed customer system
> will likely have more than 1 AG and therefore be fine.

Other drawbacks are that you've fixed the AG size, so if you don't grow
past the AG size you picked at mkfs time, you've still got only one
superblock in the deployed image.

i.e. if you set it to 100G, you're OK if you're growing to 300-400G.
If you are only growing to 50G, not so much.

(and vice versa - if you optimize for gaining superblocks, you have to
pick a fairly small AG size, then run the risk of growing thousands of them)

In other words, it requires choices at mkfs time, whereas Dave's proposal
lets those choices be made per system, at "expand" time, when the desired
final size is known.

(And, you start right out of the gate with poorly distributed data and inodes,
though I'm not sure how much that'd matter in practice.)

(I'm not sure the ideas are even mutually exclusive; I think you could have
a single AG image with dblocks << agblocks << 2^agblocklog, and a simple
growfs adds agblocks-sized AGs, whereas an "expand" could adjust agblocks,
then growfs to add more?)

> As for validating the integrity of the GM image, well, maybe the vendor
> should enable fsverity. ;)

And host it on ext4, LOL.

-Eric
Darrick J. Wong July 24, 2024, 11:48 p.m. UTC | #12
On Wed, Jul 24, 2024 at 05:50:18PM -0500, Eric Sandeen wrote:
> On 7/24/24 4:08 PM, Darrick J. Wong wrote:
> > On Wed, Jul 24, 2024 at 10:46:15AM +1000, Dave Chinner wrote:
> 
> ...
> 
> > Counter-proposal: Instead of remapping the AGs to higher LBAs, what if
> > we allowed people to create single-AG filesystems with large(ish)
> > sb_agblocks.  You could then format a 2GB image with (say) a 100G AG
> > size and copy your 2GB of data into the filesystem.  At deploy time,
> > growfs will expand AG 0 to 100G and add new AGs after that, same as it
> > does now.
> 
> And that could be done oneline...
> 
> > I think all we'd need is to add a switch to mkfs to tell it that it's
> > creating one of these gold master images, which would disable this
> > check:
> > 
> > 	if (agsize > dblocks) {
> > 		fprintf(stderr,
> > 	_("agsize (%lld blocks) too big, data area is %lld blocks\n"),
> > 			(long long)agsize, (long long)dblocks);
> > 			usage();
> > 	}
> 
> (plus removing the single-ag check)
> 
> > and set a largeish default AG size.  We might want to set a compat bit
> > so that xfs_repair won't complain about the single AG.
> > 
> > Yes, there are drawbacks, like the lack of redundant superblocks.  But
> > if growfs really runs at firstboot, then the deployed customer system
> > will likely have more than 1 AG and therefore be fine.
> 
> Other drawbacks are that you've fixed the AG size, so if you don't grow
> past the AG size you picked at mkfs time, you've still got only one
> superblock in the deployed image.

Yes, that is a significant drawback. :)

> i.e. if you set it to 100G, you're OK if you're growing to 300-400G.
> If you are only growing to 50G, not so much.

Yes, though the upside of this counter proposal is that it can be done
today with relatively little code changes.  Dave's requires storage
devices and the kernel to support accelerated remapping, which is going
to take some time and conversations with vendors.

That said, I agree with Dave that his proposal probably results in
files spread more evenly around the disk.

But let's think about this -- would it be advantageous for a freshly
deployed system to have a lot of contiguous space at the end?

If the expand(ed) image is a root filesystem, then the existing content
isn't going to change a whole lot, right?  And if we're really launching
into the nopets era, then the system gets redeployed every quarter with
the latest OS update.

(Not that I do that; I'm still a grumpy Debian greybeard with too many
pets.)

OTOH, do you (or Dave) anticipate needing to expandfs an empty data
partition in the deployed image?  A common pattern amongst our software
is to send out a ~16G root fs image which is deployed into a VM with a
~250G boot volume and a 100TB data volume.  The firstboot process growfs
the rootfs by another ~235G, then it formats a fresh xfs onto the 100TB
volume.

The performance of the freshly formatted data partition is most
important, and we've spent years showing that layout and performance are
better if you do the fresh format.  So I don't think we're going to go
back to expanding data partitions.

> (and vice versa - if you optimize for gaining superblocks, you have to
> pick a fairly small AG size, then run the risk of growing thousands of them)
>
> In other words, it requires choices at mkfs time, whereas Dave's proposal
> lets those choices be made per system, at "expand" time, when the desired
> final size is known.

If you only have one AG, then the agnumber segment of the FSBNO will be
zero.  IOWs, you can increase agblklog on a single-AG fs because there
are no FSBNOs that need re-encoding.  You can even decrease it, so long
as you don't go below the size of the fs.

The ability to adjust goes away as soon as you hit two AGs.

Adjusting agblklog would require some extension to the growfs ioctl.

> (And, you start right out of the gate with poorly distributed data and inodes,
> though I'm not sure how much that'd matter in practice.)

On fast storage it probably doesn't matter.  OTOH, Dave's proposal does
mean that the log stays in the middle of the disk, which might be
advantageous if you /are/ running on spinning rust.

> (I'm not sure the ideas are even mutually exclusive; I think you could have
> a single AG image with dblocks << agblocks << 2^agblocklog, and a simple
> growfs adds agblocks-sized AGs, whereas an "expand" could adjust agblocks,
> then growfs to add more?)

Yes.

> > As for validating the integrity of the GM image, well, maybe the vendor
> > should enable fsverity. ;)
> 
> And host it on ext4, LOL.

I think we can land fsverity in the same timeframe as whatever we land
on for implementing xfs_explode^Wexpandfs.  Probably sooner.

--D

> -Eric
>
Dave Chinner July 25, 2024, 12:41 a.m. UTC | #13
On Wed, Jul 24, 2024 at 02:08:33PM -0700, Darrick J. Wong wrote:
> On Wed, Jul 24, 2024 at 10:46:15AM +1000, Dave Chinner wrote:
> > On Tue, Jul 23, 2024 at 04:58:01PM -0700, Darrick J. Wong wrote:
> > > On Mon, Jul 22, 2024 at 09:01:00AM +1000, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > xfs-expand is an attempt to address the container/vm orchestration
> > > > image issue where really small XFS filesystems are grown to massive
> > > > sizes via xfs_growfs and end up with really insane, suboptimal
> > > > geometries.
....
> > > > +Moving the data within an AG could be optimised to be space usage aware, similar
> > > > +to what xfs_copy does to build sparse filesystem images. However, the space
> > > > +optimised filesystem images aren't going to have a lot of free space in them,
> > > > +and what there is may be quite fragmented. Hence doing free space aware copying
> > > > +of relatively full small AGs may be IOPS intensive. Given we are talking about
> > > > +AGs in the typical size range from 64-512MB, doing a sequential copy of the
> > > > +entire AG isn't going to take very long on any storage. If we have to do several
> > > > +hundred seeks in that range to skip free space, then copying the free space will
> > > > +cost less than the seeks and the partial RAID stripe writes that small IOs will
> > > > +cause.
> > > > +
> > > > +Hence the simplest, sequentially optimised data moving algorithm will be:
> > > > +
> > > > +.. code-block:: c
> > > > +
> > > > +	for (agno = sb_agcount - 1; agno > 0; agno--) {
> > > > +		src = agno * sb_agblocks;
> > > > +		dst = agno * new_agblocks;
> > > > +		copy_file_range(src, dst, sb_agblocks);
> > > > +	}
> > > > +
> > > > +This also leads to optimisation via server side or block device copy offload
> > > > +infrastructure. Instead of streaming the data through kernel buffers, the copy
> > > > +is handed to the server/hardware to moves the data internally as quickly as
> > > > +possible.
> > > > +
> > > > +For filesystem images held in files and, potentially, on sparse storage devices
> > > > +like dm-thinp, we don't even need to copy the data.  We can simply insert holes
> > > > +into the underlying mapping at the appropriate place.  For filesystem images,
> > > > +this is:
> > > > +
> > > > +.. code-block:: c
> > > > +
> > > > +	len = new_agblocks - sb_agblocks;
> > > > +	for (agno = 1; agno < sb_agcount; agno++) {
> > > > +		src = agno * sb_agblocks;
> > > > +		fallocate(FALLOC_FL_INSERT_RANGE, src, len)
> > > > +	}
> > > > +
> > > > +Then the filesystem image can be copied to the destination block device in an
> > > > +efficient manner (i.e. skipping holes in the image file).
> > > 
> > > Does dm-thinp support insert range?
> > 
> > No - that would be a future enhancement. I mention it simly because
> > these are things we would really want sparse block devices to
> > support natively.
> 
> <nod> Should the next revision should cc -fsdevel and -block, then?

No. This is purely an XFS feature at this point. If future needs
change and we require work outside of XFS to be done, then it can be
taken up with external teams to design and implement the optional
acceleration functions that we desire.

> > > In the worst case (copy_file_range,
> > > block device doesn't support xcopy) this results in a pagecache copy of
> > > nearly all of the filesystem, doesn't it?
> > 
> > Yes, it would.
> 
> Counter-proposal: Instead of remapping the AGs to higher LBAs, what if
> we allowed people to create single-AG filesystems with large(ish)
> sb_agblocks.  You could then format a 2GB image with (say) a 100G AG
> size and copy your 2GB of data into the filesystem.  At deploy time,
> growfs will expand AG 0 to 100G and add new AGs after that, same as it
> does now.

We can already do this with existing tools.

All it requires is using xfs_db to rewrite the sb/ag geometry and
adding new freespace records. Now you have a 100GB AG instead of 2GB
and you can mount it and run growfs to add all the extra AGs you
need.

Maybe it wasn't obvious from my descriptions of the sparse address
space diagrams, but single AG filesystems have no restrictions of AG
size growth because there are no high bits set in any of the sparse
64 bit address spaces (i.e. fsbno or inode numbers). Hence we can
expand the AG size without worrying about overwriting the address
space used by higher AGs.

IOWs, the need for reserving sparse address space bits just doesn't
exist for single AG filesystems.  The point of this proposal is to
document a generic algorithm that avoids the problem of the higher
AG address space limiting how large lower AGs can be made. That's
the problem that prevents substantial resizing of AGs, and that's
what this design document addresses.

> > > Also, perhaps xfs_expand is a good opportunity to stamp a new uuid into
> > > the superblock and set the metauuid bit?
> > 
> > Isn't provisioning software is generally already doing this via
> > xfs_admin? We don't do this with growfs, and I'd prefer not to
> > overload an expansion tool with random other administrative
> > functions that only some use cases/environments might need. 
> 
> Yeah, though it'd be awfully convenient to do it while we've already got
> the filesystem "mounted" in one userspace program.

"it'd be awfully convenient" isn't a technical argument. It's an
entirely subjective observation and assumes an awful lot about the
implementation design that hasn't been started yet.

Indeed, from an implementation perspective I'm considering that
xfs_expand might even implemented as a simple shell script that
wraps xfs_db and xfs_io. I strongly suspect that we don't need to
write any custom C code for it at all. It's really that simple.

Hence talking about anything to do with optimising the whole expand
process to take on other administration tasks before we've even
started on a detailed implementation design is highly premature.  I
want to make sure the high level design and algorithms are
sufficient for all the use cases people can come up with, not define
exactly how we are going to implement the functionality.

> > > > +Limitations
> > > > +===========
> > > > +
> > > > +This document describes an offline mechanism for expanding the filesystem
> > > > +geometery. It doesn't add new AGs, just expands they existing AGs. If the
> > > > +filesystem needs to be made larger than maximally sized AGs can address, then
> > > > +a subsequent online xfs_growfs operation is still required.
> > > > +
> > > > +For container/vm orchestration software, this isn't a huge issue as they
> > > > +generally grow the image from within the initramfs context on first boot. That
> > > > +is currently a "mount; xfs_growfs" operation pair; adding expansion to this
> > > > +would simply require adding expansion before the mount. i.e. first boot becomes
> > > > +a "xfs_expand; mount; xfs_growfs" operation. Depending on the eventual size of
> > > > +the target filesystem, the xfs-growfs operation may be a no-op.
> > > 
> > > I don't know about your cloud, but ours seems to optimize vm deploy
> > > times very heavily.  Right now their firstboot payload calls xfs_admin
> > > to change the fs uuid, mounts the fs, and then growfs's it into the
> > > container.
> > > 
> > > Adding another pre-mount firstboot program (and one that potentially
> > > might do a lot of IO) isn't going to be popular with them.
> > 
> > There's nothing that requires xfs_expand to be done at first boot.
> > First boot is just part of the deployment scripts and it may make
> > sense to do the expansion as early as possible in the deployment
> > process.
> 
> Yeah, but how often do you need to do a 10000x expansion on anything
> other than a freshly cloned image?  Is that common in your cloudworld?
> OCI usage patterns seem to be exploding the image on firstboot and
> incremental growfs after that.

I've seen it happen many times outside of container/VMs - this was a
even a significant problem 20+ years ago when AGs were limited to
4GB. That specific historic case was fixed by moving to 1TB max AG
size, but there was no way to convert an existing filesystem. This
is the "cloud case" in a nutshell, so it's clearly not a new
problem.

Even ignoring the historic situation, we still see people have these
problems with growing filesystems. It's especially prevalent with
demand driven thin provisioned storage. Project starts small with
only the space they need (e.g. for initial documentation), then as
it ramps up and starts to generate TBs of data, the storage gets
expanded from it's initial "few GBs" size. Same problem, different
environment.

> I think the difference between you and I here is that I see this
> xfs_expand proposal as entirely a firstboot assistance program, whereas
> you're looking at this more as a general operation that can happen at
> any time.

Yes. As I've done for the past 15+ years, I'm thinking about the
best solution for the wider XFS and storage community first and
commercial imperatives second. I've seen people use XFS features and
storage APIs for things I've never considered when designing them.
I'm constantly surprised by how people use the functionality we
provide in innovative, unexpected ways because they are generic
enough to provide building blocks that people can use to implement
new ideas.

Filesystem expansion is, IMO, one of those "generically useful"
tools and algorithms.

Perhaps it's not an obvious jump, but I'm also thinking about how we
might be able to do the opposite of AG expansion to shrink the
filesystem online. Not sure it is possible yet, but having the
ability to dynamically resize AGs opens up many new possibilities.
That's way outside the scope of this discussion, but I mention it
simply to point out that the core of this generic expansion idea -
decoupling the AG physical size from the internal sparse 64 bit
addressing layout - has many potential future uses...

-Dave.
Dave Chinner July 25, 2024, 1:42 a.m. UTC | #14
On Wed, Jul 24, 2024 at 05:38:46PM -0500, Eric Sandeen wrote:
> On 7/24/24 5:12 PM, Darrick J. Wong wrote:
> > On Wed, Jul 24, 2024 at 05:06:58PM -0500, Eric Sandeen wrote:
> >> On 7/21/24 6:01 PM, Dave Chinner wrote:
> >>> +The solution should already be obvious: we can exploit the sparseness of FSBNO
> >>> +addressing to allow AGs to grow to 1TB (maximum size) simply by configuring
> >>> +sb_agblklog appropriately at mkfs.xfs time. Hence if we have 16MB AGs (minimum
> >>> +size) and sb_agblklog = 30 (1TB max AG size), we can expand the AG size up
> >>> +to their maximum size before we start appending new AGs.
> >>
> >> there's a check in xfs_validate_sb_common() that tests whether sg_agblklog is
> >> really the next power of two from sb_agblklog:
> >>
> >> sbp->sb_agblklog != xfs_highbit32(sbp->sb_agblocks - 1) + 1
> >>
> >> so I think the proposed idea would require a feature flag, right?
> >>
> >> That might make it a little trickier as a drop-in replacement for cloud
> >> providers because these new expandable filesystem images would only work on
> >> kernels that understand the (trivial) new feature, unless I'm missing
> >> something.

Yes, I think that's the case.

I don't see this as a showstopper - golden images that VMs get
cloned from tend to be specific to the distro release they contain,
so as long as both the orchestration nodes and the distro kernel
within the image support the feature bit it will just work, yes?

This essentially makes it an image build time decision to support
the expand feature.  i.e. if the fs in the image has the feature bit
set, deployment can use expand, otherwise skip it and go straight to
grow.

> > agblklog and agblocks would both be set correctly if you did mkfs.xfs -d
> > agsize=1T on a small image; afaict it's only mkfs that cares that
> > dblocks >= agblocks.

I don't think we can change sb_agblocks like this 

Fundamentally, sb_agblocks is the physical length of AGs, not the
theoretical maximum size. Setting it to the maximum possible size
fails in all sorts of ways for multiple AG filesystems
(__xfs_ag_block_count() is the simplest example to point out), and
even on single AG filesystems it will be problematic.

sb->sb_agblklog is not a physical size - it is decoupled from the
physical size of the AG so it can be used to calculate the address
space locations of inodes and FSBNOs. Hence we can change
sb_agblklog without changing sb_agblocks and other physical lengths
because it doesn't change anything to do with the physical layout of
the filesystem.

> Yes, in a single AG image like you've suggested.

I'm not so sure about that.

We're not so lucky with code that validate against the AG header
lengths or just use sb_agblocks blindly.

e.g. for single AG fs's to work, they need to use
__xfs_ag_block_count() everywhere to get the size of the AG and have
it return sb_dblocks instead.  We have code that assumes that
sb_agblocks is the highest AGBNO possible in an AG and doesn't take
into account that the size of the runt AG (e.g.
xfs_alloc_ag_max_usable() is one of these).

I've always understood there to be an implicit requirement for
sb_agblocks and agf->agf_length/agi->agi_length to be the same and
define the actual maximum valid block number in the AG.

I don't think we can change that, and I don't think we should try to
change this even for single AG filesystems. We don't even support
single AG filesystems. And I don't think we should try to change
such a fundamental physical layout definition for what would be a
filesystem format that only physically exists temporarily in a cloud
deployment context.

-Dave.
Dave Chinner July 29, 2024, 1:45 a.m. UTC | #15
On Wed, Jul 24, 2024 at 05:38:46PM -0500, Eric Sandeen wrote:
> On 7/24/24 5:12 PM, Darrick J. Wong wrote:
> > On Wed, Jul 24, 2024 at 05:06:58PM -0500, Eric Sandeen wrote:
> >> On 7/21/24 6:01 PM, Dave Chinner wrote:
> >>> +The solution should already be obvious: we can exploit the sparseness of FSBNO
> >>> +addressing to allow AGs to grow to 1TB (maximum size) simply by configuring
> >>> +sb_agblklog appropriately at mkfs.xfs time. Hence if we have 16MB AGs (minimum
> >>> +size) and sb_agblklog = 30 (1TB max AG size), we can expand the AG size up
> >>> +to their maximum size before we start appending new AGs.
> >>
> >> there's a check in xfs_validate_sb_common() that tests whether sg_agblklog is
> >> really the next power of two from sb_agblklog:
> >>
> >> sbp->sb_agblklog != xfs_highbit32(sbp->sb_agblocks - 1) + 1
> >>
> >> so I think the proposed idea would require a feature flag, right?
> >>
> >> That might make it a little trickier as a drop-in replacement for cloud
> >> providers because these new expandable filesystem images would only work on
> >> kernels that understand the (trivial) new feature, unless I'm missing
> >> something.
> > 
> > agblklog and agblocks would both be set correctly if you did mkfs.xfs -d
> > agsize=1T on a small image; afaict it's only mkfs that cares that
> > dblocks >= agblocks.
> 
> Yes, in a single AG image like you've suggested.
> 
> In Dave's proposal, with multiple AGs, I think it would need to be handled.

So in looking into this in more detail, there is one thing I've
overlooked that will absolutely require a change of on-disk format:
we encode the physical location (daddr) of some types of metadata
into the metadata.

This was added so that we could detect misdirected reads or writes;
the metadata knows it's physical location and so checks that it
matches the physical location the buffer was read from or is being
written to.

This includes the btree block headers, symlink blocks and
directory/attr blocks, and I'm pretty sure that the locations
recorded by the rmapbt's are physical locations as well.

To allow expansion to physically relocate metadata, I think we're
going to have to change all of these to record fsbno rather than
daddr so that we can change the physical location of the metadata
without needing to change all the metadata location references. We
calculate daddr from FSBNO or AGBNO everywhere, so I think all the
info we need to do this already available in the code.

Daddrs and fsbnos are also the same size on disk (__be64) so there's
no change to metadata layouts or anything like that. It's just a
change of what that those fields store from physical to internal
sparse addressing.

Hence I don't think this is a showstopper issue - it will just
require verifiers and metadata header initialisation to use the
correct encoding at runtime based on the superblock feature bit.
It's not really that complex, just a bit more work that I originally
thought it would be.

-Dave.
Brian Foster Aug. 2, 2024, 12:09 p.m. UTC | #16
On Thu, Jul 25, 2024 at 10:41:50AM +1000, Dave Chinner wrote:
> On Wed, Jul 24, 2024 at 02:08:33PM -0700, Darrick J. Wong wrote:
> > On Wed, Jul 24, 2024 at 10:46:15AM +1000, Dave Chinner wrote:
> > > On Tue, Jul 23, 2024 at 04:58:01PM -0700, Darrick J. Wong wrote:
> > > > On Mon, Jul 22, 2024 at 09:01:00AM +1000, Dave Chinner wrote:
> > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > 
> > > > > xfs-expand is an attempt to address the container/vm orchestration
> > > > > image issue where really small XFS filesystems are grown to massive
> > > > > sizes via xfs_growfs and end up with really insane, suboptimal
> > > > > geometries.
> ....
> > > > > +Moving the data within an AG could be optimised to be space usage aware, similar
> > > > > +to what xfs_copy does to build sparse filesystem images. However, the space
> > > > > +optimised filesystem images aren't going to have a lot of free space in them,
> > > > > +and what there is may be quite fragmented. Hence doing free space aware copying
> > > > > +of relatively full small AGs may be IOPS intensive. Given we are talking about
> > > > > +AGs in the typical size range from 64-512MB, doing a sequential copy of the
> > > > > +entire AG isn't going to take very long on any storage. If we have to do several
> > > > > +hundred seeks in that range to skip free space, then copying the free space will
> > > > > +cost less than the seeks and the partial RAID stripe writes that small IOs will
> > > > > +cause.
> > > > > +
> > > > > +Hence the simplest, sequentially optimised data moving algorithm will be:
> > > > > +
> > > > > +.. code-block:: c
> > > > > +
> > > > > +	for (agno = sb_agcount - 1; agno > 0; agno--) {
> > > > > +		src = agno * sb_agblocks;
> > > > > +		dst = agno * new_agblocks;
> > > > > +		copy_file_range(src, dst, sb_agblocks);
> > > > > +	}
> > > > > +
> > > > > +This also leads to optimisation via server side or block device copy offload
> > > > > +infrastructure. Instead of streaming the data through kernel buffers, the copy
> > > > > +is handed to the server/hardware to moves the data internally as quickly as
> > > > > +possible.
> > > > > +
> > > > > +For filesystem images held in files and, potentially, on sparse storage devices
> > > > > +like dm-thinp, we don't even need to copy the data.  We can simply insert holes
> > > > > +into the underlying mapping at the appropriate place.  For filesystem images,
> > > > > +this is:
> > > > > +
> > > > > +.. code-block:: c
> > > > > +
> > > > > +	len = new_agblocks - sb_agblocks;
> > > > > +	for (agno = 1; agno < sb_agcount; agno++) {
> > > > > +		src = agno * sb_agblocks;
> > > > > +		fallocate(FALLOC_FL_INSERT_RANGE, src, len)
> > > > > +	}
> > > > > +
> > > > > +Then the filesystem image can be copied to the destination block device in an
> > > > > +efficient manner (i.e. skipping holes in the image file).
> > > > 
> > > > Does dm-thinp support insert range?
> > > 
> > > No - that would be a future enhancement. I mention it simly because
> > > these are things we would really want sparse block devices to
> > > support natively.
> > 
> > <nod> Should the next revision should cc -fsdevel and -block, then?
> 
> No. This is purely an XFS feature at this point. If future needs
> change and we require work outside of XFS to be done, then it can be
> taken up with external teams to design and implement the optional
> acceleration functions that we desire.
> 
> > > > In the worst case (copy_file_range,
> > > > block device doesn't support xcopy) this results in a pagecache copy of
> > > > nearly all of the filesystem, doesn't it?
> > > 
> > > Yes, it would.
> > 
> > Counter-proposal: Instead of remapping the AGs to higher LBAs, what if
> > we allowed people to create single-AG filesystems with large(ish)
> > sb_agblocks.  You could then format a 2GB image with (say) a 100G AG
> > size and copy your 2GB of data into the filesystem.  At deploy time,
> > growfs will expand AG 0 to 100G and add new AGs after that, same as it
> > does now.
> 
> We can already do this with existing tools.
> 
> All it requires is using xfs_db to rewrite the sb/ag geometry and
> adding new freespace records. Now you have a 100GB AG instead of 2GB
> and you can mount it and run growfs to add all the extra AGs you
> need.
> 

I'm not sure you have to do anything around adding new freespace records
as such. If you format a small agcount=1 fs and then bump the sb agblock
fields via xfs_db, then last I knew you could mount and run xfs_growfs
according to the outsized AG size with no further changes at all. The
grow will add the appropriate free space records as normal. I've not
tried that with actually copying in data first, but otherwise it
survives repair at least.

It would be interesting to see if that all still works with data present
as long as you update the agblock fields before copying anything in. The
presumption is that this operates mostly the same as extending an
existing runt AG to a size that's still smaller than the full AG size,
which obviously works today, just with the special circumstance that
said runt AG happens to be the only AG in the fs.

FWIW, Darrick's proposal here is pretty much exactly what I've thought
XFS should have been at least considering doing to mitigate this cloudy
deployment problem for quite some time. I agree with the caveats and
limitations that Eric points out, but for a narrowly scoped cloudy
deployment tool I have a hard time seeing how this wouldn't be anything
but a significant improvement over the status quo for a relatively small
amount of work (assuming it can be fully verified to work, of course ;).

ISTM that the major caveats can be managed with some reasonable amount
of guardrails or policy enforcement. For example, if mkfs.xfs supported
an "image mode" param that specified a target/min deployment size, that
could be used as a hint to set the superblock AG size, a log size more
reflective of the eventual deployment size, and perhaps even facilitate
some usage limitations until that expected deployment occurs.

I.e., perhaps also set an "image mode" feature flag that requires a
corresponding mount option in order to mount writeable (to populate the
base image), and otherwise the fs mounts in a restricted-read mode where
grow is allowed, and grow only clears the image mode state once the fs
grows to at least 2 AGs (or whatever), etc. etc. At that point the user
doesn't have to know or care about any of the underlying geometry
details, just format and deploy as documented.

That's certainly not perfect and I suspect we could probably come up
with a variety of different policy permutations or user interfaces to
consider, but ISTM there's real practical value to something like that.

> Maybe it wasn't obvious from my descriptions of the sparse address
> space diagrams, but single AG filesystems have no restrictions of AG
> size growth because there are no high bits set in any of the sparse
> 64 bit address spaces (i.e. fsbno or inode numbers). Hence we can
> expand the AG size without worrying about overwriting the address
> space used by higher AGs.
> 
> IOWs, the need for reserving sparse address space bits just doesn't
> exist for single AG filesystems.  The point of this proposal is to
> document a generic algorithm that avoids the problem of the higher
> AG address space limiting how large lower AGs can be made. That's
> the problem that prevents substantial resizing of AGs, and that's
> what this design document addresses.
> 

So in theory it sounds like you could also defer setting the final AG
size on a single AG fs at least until the first grow, right? If so, that
might be particularly useful for the typical one-shot
expansion/deployment use case. The mkfs could just set image mode and
single AG, and then the first major grow finalizes the ag size and
expands to a sane (i.e. multi AG) format.

I guess that potentially makes log sizing a bit harder, but we've had
some discussion around online moving and resizing the log in the past as
well and IIRC it wasn't really that hard to achieve so long as we keep
the environment somewhat constrained by virtue of switching across a
quiesce, for example. I don't recall all the details though; I'd have to
see if I can dig up that conversation...

> > > > Also, perhaps xfs_expand is a good opportunity to stamp a new uuid into
> > > > the superblock and set the metauuid bit?
> > > 
> > > Isn't provisioning software is generally already doing this via
> > > xfs_admin? We don't do this with growfs, and I'd prefer not to
> > > overload an expansion tool with random other administrative
> > > functions that only some use cases/environments might need. 
> > 
> > Yeah, though it'd be awfully convenient to do it while we've already got
> > the filesystem "mounted" in one userspace program.
> 
> "it'd be awfully convenient" isn't a technical argument. It's an
> entirely subjective observation and assumes an awful lot about the
> implementation design that hasn't been started yet.
> 
> Indeed, from an implementation perspective I'm considering that
> xfs_expand might even implemented as a simple shell script that
> wraps xfs_db and xfs_io. I strongly suspect that we don't need to
> write any custom C code for it at all. It's really that simple.
> 
> Hence talking about anything to do with optimising the whole expand
> process to take on other administration tasks before we've even
> started on a detailed implementation design is highly premature.  I
> want to make sure the high level design and algorithms are
> sufficient for all the use cases people can come up with, not define
> exactly how we are going to implement the functionality.
> 
> > > > > +Limitations
> > > > > +===========
> > > > > +
> > > > > +This document describes an offline mechanism for expanding the filesystem
> > > > > +geometery. It doesn't add new AGs, just expands they existing AGs. If the
> > > > > +filesystem needs to be made larger than maximally sized AGs can address, then
> > > > > +a subsequent online xfs_growfs operation is still required.
> > > > > +
> > > > > +For container/vm orchestration software, this isn't a huge issue as they
> > > > > +generally grow the image from within the initramfs context on first boot. That
> > > > > +is currently a "mount; xfs_growfs" operation pair; adding expansion to this
> > > > > +would simply require adding expansion before the mount. i.e. first boot becomes
> > > > > +a "xfs_expand; mount; xfs_growfs" operation. Depending on the eventual size of
> > > > > +the target filesystem, the xfs-growfs operation may be a no-op.
> > > > 
> > > > I don't know about your cloud, but ours seems to optimize vm deploy
> > > > times very heavily.  Right now their firstboot payload calls xfs_admin
> > > > to change the fs uuid, mounts the fs, and then growfs's it into the
> > > > container.
> > > > 
> > > > Adding another pre-mount firstboot program (and one that potentially
> > > > might do a lot of IO) isn't going to be popular with them.
> > > 
> > > There's nothing that requires xfs_expand to be done at first boot.
> > > First boot is just part of the deployment scripts and it may make
> > > sense to do the expansion as early as possible in the deployment
> > > process.
> > 
> > Yeah, but how often do you need to do a 10000x expansion on anything
> > other than a freshly cloned image?  Is that common in your cloudworld?
> > OCI usage patterns seem to be exploding the image on firstboot and
> > incremental growfs after that.
> 
> I've seen it happen many times outside of container/VMs - this was a
> even a significant problem 20+ years ago when AGs were limited to
> 4GB. That specific historic case was fixed by moving to 1TB max AG
> size, but there was no way to convert an existing filesystem. This
> is the "cloud case" in a nutshell, so it's clearly not a new
> problem.
> 
> Even ignoring the historic situation, we still see people have these
> problems with growing filesystems. It's especially prevalent with
> demand driven thin provisioned storage. Project starts small with
> only the space they need (e.g. for initial documentation), then as
> it ramps up and starts to generate TBs of data, the storage gets
> expanded from it's initial "few GBs" size. Same problem, different
> environment.
> 
> > I think the difference between you and I here is that I see this
> > xfs_expand proposal as entirely a firstboot assistance program, whereas
> > you're looking at this more as a general operation that can happen at
> > any time.
> 
> Yes. As I've done for the past 15+ years, I'm thinking about the
> best solution for the wider XFS and storage community first and
> commercial imperatives second. I've seen people use XFS features and
> storage APIs for things I've never considered when designing them.
> I'm constantly surprised by how people use the functionality we
> provide in innovative, unexpected ways because they are generic
> enough to provide building blocks that people can use to implement
> new ideas.
> 

I think Darrick's agcount=1 proposal here is a nice example of that,
taking advantage of existing design flexibility to provide a creative
and elegantly simple solution to a real problem with minimal change and
risk.

Again the concept obviously needs to be prototyped enough to establish
confidence that it can work, but I don't really see how this has to be
mutually exclusive from a generic and more flexible and robust expand
mechanism in the longer term.

Just my .02. Thanks for the nice writeup, BTW.

Brian

> Filesystem expansion is, IMO, one of those "generically useful"
> tools and algorithms.
> 
> Perhaps it's not an obvious jump, but I'm also thinking about how we
> might be able to do the opposite of AG expansion to shrink the
> filesystem online. Not sure it is possible yet, but having the
> ability to dynamically resize AGs opens up many new possibilities.
> That's way outside the scope of this discussion, but I mention it
> simply to point out that the core of this generic expansion idea -
> decoupling the AG physical size from the internal sparse 64 bit
> addressing layout - has many potential future uses...
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Dave Chinner Aug. 6, 2024, 2:01 a.m. UTC | #17
On Fri, Aug 02, 2024 at 08:09:15AM -0400, Brian Foster wrote:
> On Thu, Jul 25, 2024 at 10:41:50AM +1000, Dave Chinner wrote:
> > On Wed, Jul 24, 2024 at 02:08:33PM -0700, Darrick J. Wong wrote:
> > > Counter-proposal: Instead of remapping the AGs to higher LBAs, what if
> > > we allowed people to create single-AG filesystems with large(ish)
> > > sb_agblocks.  You could then format a 2GB image with (say) a 100G AG
> > > size and copy your 2GB of data into the filesystem.  At deploy time,
> > > growfs will expand AG 0 to 100G and add new AGs after that, same as it
> > > does now.
> > 
> > We can already do this with existing tools.
> > 
> > All it requires is using xfs_db to rewrite the sb/ag geometry and
> > adding new freespace records. Now you have a 100GB AG instead of 2GB
> > and you can mount it and run growfs to add all the extra AGs you
> > need.
> > 
> 
> I'm not sure you have to do anything around adding new freespace records
> as such. If you format a small agcount=1 fs and then bump the sb agblock
> fields via xfs_db, then last I knew you could mount and run xfs_growfs
> according to the outsized AG size with no further changes at all. The
> grow will add the appropriate free space records as normal. I've not
> tried that with actually copying in data first, but otherwise it
> survives repair at least.

Yes, we could do it that way, too. It's just another existing
tool, but it requires the kernel to support an on-disk format
variant we have never previously supported.

> It would be interesting to see if that all still works with data present
> as long as you update the agblock fields before copying anything in. The
> presumption is that this operates mostly the same as extending an
> existing runt AG to a size that's still smaller than the full AG size,
> which obviously works today, just with the special circumstance that
> said runt AG happens to be the only AG in the fs.

Yes, exactly. Without significant effort to audit the code to ensure
this is safe I'm hesitant to say it'll actually work correctly. And
given that we can avoid such issues completely with a little bit of
xfs_db magic, the idea of having to support a weird one-off on
disk format variant forever just to allow growfs to do this One Neat
Trick doesn't seem like a good one to me.

> FWIW, Darrick's proposal here is pretty much exactly what I've thought
> XFS should have been at least considering doing to mitigate this cloudy
> deployment problem for quite some time. I agree with the caveats and
> limitations that Eric points out, but for a narrowly scoped cloudy
> deployment tool I have a hard time seeing how this wouldn't be anything
> but a significant improvement over the status quo for a relatively small
> amount of work (assuming it can be fully verified to work, of course ;).

I've considered it, too, and looked at what it means in the bigger
picture of the proliferation of cloud providers out there that have
their own image build/deploy implementations.

These cloud implementations would need all the client side
kernels and userspace to support this new way of deploying XFS
images. It will also need the orchestration software to buy into
this new XFS toolchain for image building and deployment. There are
a -lot- of moving parts to make this work, but it cannot replace the
existing build/deploy pipeling because clouds have to support older
distros/kernels that do stuff the current way.

That's one of the driving factors behind the expansion design - it
requires little in the way of changes to the build/deploy pipeline.
There are minimal kernel changes needed to support it, so backports
to LTS kernels could be done and they would filter through to
distros automatically.

To use it we need to add a mkfs flag to the image builder.
xfs_expand can be added to the boot time deploy scripts
unconditionally because it is a no-op if the feature bit is not set
on the image. The existing growfs post mount can remain because that
is a no-op if expansion to max size has already been done. If it
hasn't been expanded or growing is still required after expansion,
then the existing growfs callout just does the right thing.

So when you look at the bigger picture, expansion slots into the
existing image build/deploy pipelines in a backwards compatible way
with minimal changes to those pipelines. It's a relatively simple
tool that provides the mechanism, and it's relatively simple for
cloud infrastructure developers to slot that mechanism into their
pipelines.

It is the simplicity with which expansion fits into existing cloud
infrastructures that makes it a compelling solution to the
problem.

> ISTM that the major caveats can be managed with some reasonable amount
> of guardrails or policy enforcement. For example, if mkfs.xfs supported
> an "image mode" param that specified a target/min deployment size, that
> could be used as a hint to set the superblock AG size, a log size more
> reflective of the eventual deployment size, and perhaps even facilitate
> some usage limitations until that expected deployment occurs.
> 
> I.e., perhaps also set an "image mode" feature flag that requires a
> corresponding mount option in order to mount writeable (to populate the
> base image), and otherwise the fs mounts in a restricted-read mode where
> grow is allowed, and grow only clears the image mode state once the fs
> grows to at least 2 AGs (or whatever), etc. etc. At that point the user
> doesn't have to know or care about any of the underlying geometry
> details, just format and deploy as documented.
>
> That's certainly not perfect and I suspect we could probably come up
> with a variety of different policy permutations or user interfaces to
> consider, but ISTM there's real practical value to something like that.

These are more creative ideas, but there are a lot of conditionals
like "perhaps", "maybe", "suspect", "just", "if", etc in the above
comments. It is blue-sky thinking, not technical design review
feedback I can actually make use of.

> > Maybe it wasn't obvious from my descriptions of the sparse address
> > space diagrams, but single AG filesystems have no restrictions of AG
> > size growth because there are no high bits set in any of the sparse
> > 64 bit address spaces (i.e. fsbno or inode numbers). Hence we can
> > expand the AG size without worrying about overwriting the address
> > space used by higher AGs.
> > 
> > IOWs, the need for reserving sparse address space bits just doesn't
> > exist for single AG filesystems.  The point of this proposal is to
> > document a generic algorithm that avoids the problem of the higher
> > AG address space limiting how large lower AGs can be made. That's
> > the problem that prevents substantial resizing of AGs, and that's
> > what this design document addresses.
> > 
> 
> So in theory it sounds like you could also defer setting the final AG
> size on a single AG fs at least until the first grow, right? If so, that
> might be particularly useful for the typical one-shot
> expansion/deployment use case. The mkfs could just set image mode and
> single AG, and then the first major grow finalizes the ag size and
> expands to a sane (i.e. multi AG) format.

Sure, we could do that, too. 

However, on top of everything else, we need to redesign the kernel
growfs code to make AG sizing decisions. It needs to be able to
work out things like alignment of AG headers for given filesystem
functionality (e.g. stripe units, forced alignment for atomic
writes, etc), etc instead of just replicating what the userspace mkfs
logic decided at mkfs time.

Hence what seems like a simple addition to growfs gets much more
complex the moment we look at what "growing the AG size" really
means. It's these sorts of "how do we do that sanely in the kernel"
issues that lead me to doing AG expansion offline in userspace
because all the information to make these decisions correctly and
then implement them efficiently already exists in xfsprogs.

> I guess that potentially makes log sizing a bit harder, but we've had
> some discussion around online moving and resizing the log in the past as
> well and IIRC it wasn't really that hard to achieve so long as we keep
> the environment somewhat constrained by virtue of switching across a
> quiesce, for example. I don't recall all the details though; I'd have to
> see if I can dig up that conversation...

We addressed that problem a couple of years ago by making the log
a minimum of 64MB in size.

commit 6e0ed3d19c54603f0f7d628ea04b550151d8a262
Author: Darrick J. Wong <djwong@kernel.org>
Date:   Thu Aug 4 21:27:01 2022 -0500

    mkfs: stop allowing tiny filesystems

    Refuse to format a filesystem that are "too small", because these
    configurations are known to have performance and redundancy problems
    that are not present on the volume sizes that XFS is best at handling.

    Specifically, this means that we won't allow logs smaller than 64MB, we
    won't allow single-AG filesystems, and we won't allow volumes smaller
    than 300MB.  There are two exceptions: the first is an undocumented CLI
    option that can be used for crafting debug filesystems.

    The second exception is that if fstests is detected, because there are a
    lot of fstests that use tiny filesystems to perform targeted regression
    and functional testing in a controlled environment.  Fixing the ~40 or
    so tests to run more slowly with larger filesystems isn't worth the risk
    of breaking the tests.

    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
    Signed-off-by: Eric Sandeen <sandeen@sandeen.net>

For the vast majority of applications, 64MB is large enough to run
decently concurrent sustained metadata modification workloads at
speeds within 10-15% of a maximally sized 2GB journal. This amount
of performance is more than sufficient for most image based cloud
deployments....

> > > I think the difference between you and I here is that I see this
> > > xfs_expand proposal as entirely a firstboot assistance program, whereas
> > > you're looking at this more as a general operation that can happen at
> > > any time.
> > 
> > Yes. As I've done for the past 15+ years, I'm thinking about the
> > best solution for the wider XFS and storage community first and
> > commercial imperatives second. I've seen people use XFS features and
> > storage APIs for things I've never considered when designing them.
> > I'm constantly surprised by how people use the functionality we
> > provide in innovative, unexpected ways because they are generic
> > enough to provide building blocks that people can use to implement
> > new ideas.
> > 
> 
> I think Darrick's agcount=1 proposal here is a nice example of that,
> taking advantage of existing design flexibility to provide a creative
> and elegantly simple solution to a real problem with minimal change and
> risk.
>
> Again the concept obviously needs to be prototyped enough to establish
> confidence that it can work, but I don't really see how this has to be
> mutually exclusive from a generic and more flexible and robust expand
> mechanism in the longer term.

I never said it was mutually exclusive nor that it is a bad idea.  I
just don't think this potential new functionality is a relevant
consideration when it comes to reviewing the design of an expansion
tool.

If we want to do image generation and expansion in a completely
different way in future, then nothing in the xfs_expand proposal
prevents that. Whoever wants this new functionality can spend the
time to form these ideas into a solid proposal and write up the
requirements and the high level design into a document we can
review.

However, for the moment I need people to focus on the fundamental
architectural change that the expansion tool relies on: breaking the
AG size relationship between logical and physical addressing inside
the filesystem. This is the design change that needs focussed
review, not the expansion functionality that is built on top of it.

If this architectural change is solid, then it leads directly into
variable sized AGs and, potentially, a simple offline shrink
mechanism that works the opposite way to expansion.

IOWs, focussing on alternative ideas for image generation and
filesystem growing completely misses the important underlying
architectural change that expansion requires that I need reviewers
to scrutinise closely.

-Dave.
Brian Foster Aug. 9, 2024, 1:31 p.m. UTC | #18
On Tue, Aug 06, 2024 at 12:01:52PM +1000, Dave Chinner wrote:
> On Fri, Aug 02, 2024 at 08:09:15AM -0400, Brian Foster wrote:
> > On Thu, Jul 25, 2024 at 10:41:50AM +1000, Dave Chinner wrote:
> > > On Wed, Jul 24, 2024 at 02:08:33PM -0700, Darrick J. Wong wrote:
> > > > Counter-proposal: Instead of remapping the AGs to higher LBAs, what if
> > > > we allowed people to create single-AG filesystems with large(ish)
> > > > sb_agblocks.  You could then format a 2GB image with (say) a 100G AG
> > > > size and copy your 2GB of data into the filesystem.  At deploy time,
> > > > growfs will expand AG 0 to 100G and add new AGs after that, same as it
> > > > does now.
> > > 
> > > We can already do this with existing tools.
> > > 
> > > All it requires is using xfs_db to rewrite the sb/ag geometry and
> > > adding new freespace records. Now you have a 100GB AG instead of 2GB
> > > and you can mount it and run growfs to add all the extra AGs you
> > > need.
> > > 
> > 
> > I'm not sure you have to do anything around adding new freespace records
> > as such. If you format a small agcount=1 fs and then bump the sb agblock
> > fields via xfs_db, then last I knew you could mount and run xfs_growfs
> > according to the outsized AG size with no further changes at all. The
> > grow will add the appropriate free space records as normal. I've not
> > tried that with actually copying in data first, but otherwise it
> > survives repair at least.
> 
> Yes, we could do it that way, too. It's just another existing
> tool, but it requires the kernel to support an on-disk format
> variant we have never previously supported.
> 
> > It would be interesting to see if that all still works with data present
> > as long as you update the agblock fields before copying anything in. The
> > presumption is that this operates mostly the same as extending an
> > existing runt AG to a size that's still smaller than the full AG size,
> > which obviously works today, just with the special circumstance that
> > said runt AG happens to be the only AG in the fs.
> 
> Yes, exactly. Without significant effort to audit the code to ensure
> this is safe I'm hesitant to say it'll actually work correctly. And
> given that we can avoid such issues completely with a little bit of
> xfs_db magic, the idea of having to support a weird one-off on
> disk format variant forever just to allow growfs to do this One Neat
> Trick doesn't seem like a good one to me.
> 

The One Neat Trick is to enable growfs to work exactly as it does today,
so it seems like a good idea to me.

> > FWIW, Darrick's proposal here is pretty much exactly what I've thought
> > XFS should have been at least considering doing to mitigate this cloudy
> > deployment problem for quite some time. I agree with the caveats and
> > limitations that Eric points out, but for a narrowly scoped cloudy
> > deployment tool I have a hard time seeing how this wouldn't be anything
> > but a significant improvement over the status quo for a relatively small
> > amount of work (assuming it can be fully verified to work, of course ;).
> 
> I've considered it, too, and looked at what it means in the bigger
> picture of the proliferation of cloud providers out there that have
> their own image build/deploy implementations.
> 
> These cloud implementations would need all the client side
> kernels and userspace to support this new way of deploying XFS
> images. It will also need the orchestration software to buy into
> this new XFS toolchain for image building and deployment. There are
> a -lot- of moving parts to make this work, but it cannot replace the
> existing build/deploy pipeling because clouds have to support older
> distros/kernels that do stuff the current way.
> 

The whole point is to use a format that fundamentally works with
existing workflows. The customization is at image creation time and
doesn't introduce any special deployment requirements.

Here's a repeat of the previous experiment, now with throwing some data
in...

# truncate -s 512M img
# ~/xfsprogs-dev/mkfs/mkfs.xfs -f img -dagcount=1 -lsize=64m
(creates 512MB fs based on underlying image file size)
# xfs_db -xc "sb 0" -c "write -d agblocks 2097152" -c "write -d agblklog 21" ./img
(bump AG size to 8GB)

# xfs_repair -nf -o force_geometry ./img
(passes)
<mount, copy in xfsprogs source repo, umount>
# xfs_repair -nf -o force_geometry ./img
(still passes)

# truncate -s 40G ./img
# mount ./img /mnt/
# xfs_growfs /mnt/
meta-data=/dev/loop0             isize=512    agcount=1, agsize=2097152 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=1
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
data     =                       bsize=4096   blocks=131072, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=16384, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
data blocks changed from 131072 to 10485760
# xfs_info /mnt/
meta-data=/dev/loop0             isize=512    agcount=5, agsize=2097152 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=1
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
data     =                       bsize=4096   blocks=10485760, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=16384, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

... and repair still passes and I can clean/compile the xfsprogs source
repo copied in above without any userspace or filesystem level errors.

This is on an upstream xfsprogs/kernel with the only change being I
removed the agcount=1 restriction from mkfs to format the fs. So there
are essentially zero functional changes and this produces a 5xAG fs in a
use case where we otherwise currently produce something like ~300.

> That's one of the driving factors behind the expansion design - it
> requires little in the way of changes to the build/deploy pipeline.
> There are minimal kernel changes needed to support it, so backports
> to LTS kernels could be done and they would filter through to
> distros automatically.
> 
> To use it we need to add a mkfs flag to the image builder.
> xfs_expand can be added to the boot time deploy scripts
> unconditionally because it is a no-op if the feature bit is not set
> on the image. The existing growfs post mount can remain because that
> is a no-op if expansion to max size has already been done. If it
> hasn't been expanded or growing is still required after expansion,
> then the existing growfs callout just does the right thing.
> 
> So when you look at the bigger picture, expansion slots into the
> existing image build/deploy pipelines in a backwards compatible way
> with minimal changes to those pipelines. It's a relatively simple
> tool that provides the mechanism, and it's relatively simple for
> cloud infrastructure developers to slot that mechanism into their
> pipelines.
> 
> It is the simplicity with which expansion fits into existing cloud
> infrastructures that makes it a compelling solution to the
> problem.
> 

Compared to the sequence above, you'd have the same sort of mkfs flag,
xfs_expand is not involved, and growfs works exactly as it does today.

I don't dispute the value of the expansion concept, particularly for the
additional flexibility and some of the other benefits you've brought up.
I am skeptical it is the most elegant solution for the basic cloud image
copy -> grow deployment use case though.

> > ISTM that the major caveats can be managed with some reasonable amount
> > of guardrails or policy enforcement. For example, if mkfs.xfs supported
> > an "image mode" param that specified a target/min deployment size, that
> > could be used as a hint to set the superblock AG size, a log size more
> > reflective of the eventual deployment size, and perhaps even facilitate
> > some usage limitations until that expected deployment occurs.
> > 
> > I.e., perhaps also set an "image mode" feature flag that requires a
> > corresponding mount option in order to mount writeable (to populate the
> > base image), and otherwise the fs mounts in a restricted-read mode where
> > grow is allowed, and grow only clears the image mode state once the fs
> > grows to at least 2 AGs (or whatever), etc. etc. At that point the user
> > doesn't have to know or care about any of the underlying geometry
> > details, just format and deploy as documented.
> >
> > That's certainly not perfect and I suspect we could probably come up
> > with a variety of different policy permutations or user interfaces to
> > consider, but ISTM there's real practical value to something like that.
> 
> These are more creative ideas, but there are a lot of conditionals
> like "perhaps", "maybe", "suspect", "just", "if", etc in the above
> comments. It is blue-sky thinking, not technical design review
> feedback I can actually make use of.
> 
> > > Maybe it wasn't obvious from my descriptions of the sparse address
> > > space diagrams, but single AG filesystems have no restrictions of AG
> > > size growth because there are no high bits set in any of the sparse
> > > 64 bit address spaces (i.e. fsbno or inode numbers). Hence we can
> > > expand the AG size without worrying about overwriting the address
> > > space used by higher AGs.
> > > 
> > > IOWs, the need for reserving sparse address space bits just doesn't
> > > exist for single AG filesystems.  The point of this proposal is to
> > > document a generic algorithm that avoids the problem of the higher
> > > AG address space limiting how large lower AGs can be made. That's
> > > the problem that prevents substantial resizing of AGs, and that's
> > > what this design document addresses.
> > > 
> > 
> > So in theory it sounds like you could also defer setting the final AG
> > size on a single AG fs at least until the first grow, right? If so, that
> > might be particularly useful for the typical one-shot
> > expansion/deployment use case. The mkfs could just set image mode and
> > single AG, and then the first major grow finalizes the ag size and
> > expands to a sane (i.e. multi AG) format.
> 
> Sure, we could do that, too. 
> 
> However, on top of everything else, we need to redesign the kernel
> growfs code to make AG sizing decisions. It needs to be able to
> work out things like alignment of AG headers for given filesystem
> functionality (e.g. stripe units, forced alignment for atomic
> writes, etc), etc instead of just replicating what the userspace mkfs
> logic decided at mkfs time.
> 
> Hence what seems like a simple addition to growfs gets much more
> complex the moment we look at what "growing the AG size" really
> means. It's these sorts of "how do we do that sanely in the kernel"
> issues that lead me to doing AG expansion offline in userspace
> because all the information to make these decisions correctly and
> then implement them efficiently already exists in xfsprogs.
> 

Yeah, that's a fair concern. That's an idea for an optional usability
enhancement, so it's not a show stopper if it just didn't work.

> > I guess that potentially makes log sizing a bit harder, but we've had
> > some discussion around online moving and resizing the log in the past as
> > well and IIRC it wasn't really that hard to achieve so long as we keep
> > the environment somewhat constrained by virtue of switching across a
> > quiesce, for example. I don't recall all the details though; I'd have to
> > see if I can dig up that conversation...
> 
> We addressed that problem a couple of years ago by making the log
> a minimum of 64MB in size.
> 
> commit 6e0ed3d19c54603f0f7d628ea04b550151d8a262
> Author: Darrick J. Wong <djwong@kernel.org>
> Date:   Thu Aug 4 21:27:01 2022 -0500
> 
>     mkfs: stop allowing tiny filesystems
> 
>     Refuse to format a filesystem that are "too small", because these
>     configurations are known to have performance and redundancy problems
>     that are not present on the volume sizes that XFS is best at handling.
> 
>     Specifically, this means that we won't allow logs smaller than 64MB, we
>     won't allow single-AG filesystems, and we won't allow volumes smaller
>     than 300MB.  There are two exceptions: the first is an undocumented CLI
>     option that can be used for crafting debug filesystems.
> 
>     The second exception is that if fstests is detected, because there are a
>     lot of fstests that use tiny filesystems to perform targeted regression
>     and functional testing in a controlled environment.  Fixing the ~40 or
>     so tests to run more slowly with larger filesystems isn't worth the risk
>     of breaking the tests.
> 
>     Signed-off-by: Darrick J. Wong <djwong@kernel.org>
>     Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
>     Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
> 
> For the vast majority of applications, 64MB is large enough to run
> decently concurrent sustained metadata modification workloads at
> speeds within 10-15% of a maximally sized 2GB journal. This amount
> of performance is more than sufficient for most image based cloud
> deployments....
> 

Seems like that would make a reasonable default size for the cloud
deployment case then.

> > > > I think the difference between you and I here is that I see this
> > > > xfs_expand proposal as entirely a firstboot assistance program, whereas
> > > > you're looking at this more as a general operation that can happen at
> > > > any time.
> > > 
> > > Yes. As I've done for the past 15+ years, I'm thinking about the
> > > best solution for the wider XFS and storage community first and
> > > commercial imperatives second. I've seen people use XFS features and
> > > storage APIs for things I've never considered when designing them.
> > > I'm constantly surprised by how people use the functionality we
> > > provide in innovative, unexpected ways because they are generic
> > > enough to provide building blocks that people can use to implement
> > > new ideas.
> > > 
> > 
> > I think Darrick's agcount=1 proposal here is a nice example of that,
> > taking advantage of existing design flexibility to provide a creative
> > and elegantly simple solution to a real problem with minimal change and
> > risk.
> >
> > Again the concept obviously needs to be prototyped enough to establish
> > confidence that it can work, but I don't really see how this has to be
> > mutually exclusive from a generic and more flexible and robust expand
> > mechanism in the longer term.
> 
> I never said it was mutually exclusive nor that it is a bad idea.  I
> just don't think this potential new functionality is a relevant
> consideration when it comes to reviewing the design of an expansion
> tool.
> 

It was raised as a counterproposal for the problem/use case stated in
the doc.

> If we want to do image generation and expansion in a completely
> different way in future, then nothing in the xfs_expand proposal
> prevents that. Whoever wants this new functionality can spend the
> time to form these ideas into a solid proposal and write up the
> requirements and the high level design into a document we can
> review.
> 
> However, for the moment I need people to focus on the fundamental
> architectural change that the expansion tool relies on: breaking the
> AG size relationship between logical and physical addressing inside
> the filesystem. This is the design change that needs focussed
> review, not the expansion functionality that is built on top of it.
> 
> If this architectural change is solid, then it leads directly into
> variable sized AGs and, potentially, a simple offline shrink
> mechanism that works the opposite way to expansion.
> 

I think the doc could be improved to expand on those things, and maybe
less imply the expansion tool is solely intended to address the
problematic cloudy image growfs use case.

Brian

> IOWs, focussing on alternative ideas for image generation and
> filesystem growing completely misses the important underlying
> architectural change that expansion requires that I need reviewers
> to scrutinise closely.
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
diff mbox series

Patch

diff --git a/Documentation/filesystems/xfs/index.rst b/Documentation/filesystems/xfs/index.rst
index ab66c57a5d18..cb570fc886b2 100644
--- a/Documentation/filesystems/xfs/index.rst
+++ b/Documentation/filesystems/xfs/index.rst
@@ -12,3 +12,4 @@  XFS Filesystem Documentation
    xfs-maintainer-entry-profile
    xfs-self-describing-metadata
    xfs-online-fsck-design
+   xfs-expand-design
diff --git a/Documentation/filesystems/xfs/xfs-expand-design.rst b/Documentation/filesystems/xfs/xfs-expand-design.rst
new file mode 100644
index 000000000000..fffc0b44518d
--- /dev/null
+++ b/Documentation/filesystems/xfs/xfs-expand-design.rst
@@ -0,0 +1,312 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+
+===============================
+XFS Filesystem Expansion Design
+===============================
+
+Background
+==========
+
+XFS has long been able to grow the size of the filesystem dynamically whilst
+mounted. The functionality has been used extensively over the past 3 decades
+for managing filesystems on expandable storage arrays, but over the past decade
+there has been significant growth in filesystem image based orchestration
+frameworks that require expansion of the filesystem image during deployment.
+
+These frameworks want the initial image to be as small as possible to minimise
+the cost of deployment, but then want that image to scale to whatever size the
+deployment requires. This means that the base image can be as small as a few
+hundred megabytes and be expanded on deployment to tens of terabytes.
+
+Growing a filesystem by 4-5 orders of magnitude is a long way outside the scope
+of the original xfs_growfs design requirements. It was designed for users who
+were adding physical storage to already large storage arrays; a single order of
+magnitude in growth was considered a very large expansion.
+
+As a result, we have a situation where growing a filesystem works well up to a
+certain point, yet we have orchestration frameworks that allows users to expand
+filesystems a long way past this point without them being aware of the issues
+it will cause them further down the track.
+
+
+Scope
+=====
+
+The need to expand filesystems with a geometry optimised for small storage
+volumes onto much larger storage volumes results in a large filesystem with
+poorly optimised geometry. Growing a small XFS filesystem by several orders of
+magnitude results in filesystem with many small allocation groups (AGs). This is
+bad for allocation effciency, contiguous free space management, allocation
+performance as the filesystem fills, and so on. The filesystem will also end up
+with a very small journal for the size of the filesystem which can limit the
+metadata performance and concurrency in the filesystem drastically.
+
+These issues are a result of the filesystem growing algorithm. It is an
+append-only mechanism which takes advantage of the fact we can safely initialise
+the metadata for new AGs beyond the end of the existing filesystem without
+impacting runtime behaviour. Those newly initialised AGs can then be enabled
+atomically by running a single transaction to expose that newly initialised
+space to the running filesystem.
+
+As a result, the growing algorithm is a fast, transparent, simple and crash-safe
+algorithm that can be run while the filesystem is mounted. It's a very good
+algorithm for growing a filesystem on a block device that has has new physical
+storage appended to it's LBA space.
+
+However, this algorithm shows it's limitations when we move to system deployment
+via filesystem image distribution. These deployments optimise the base
+filesystem image for minimal size to minimise the time and cost of deploying
+them to the newly provisioned system (be it VM or container). They rely on the
+filesystem's ability to grow the filesystem to the size of the destination
+storage during the first system bringup when they tailor the deployed filesystem
+image for it's intented purpose and identity.
+
+If the deployed system has substantial storage provisioned, this means the
+filesystem image will be expanded by multiple orders of magnitude during the
+system initialisation phase, and this is where the existing append-based growing
+algorithm falls apart. This is the issue that this design seeks to resolve.
+
+
+XFS Internal Block Addressing
+=============================
+
+XFS has three different ways of addressing a storage block in the on-disk
+format. It can address storage blocks by:
+
+- DADDR addressing (xfs_daddr_t).
+  This is the linear LBA space of the underlying block device.
+
+- AGBNO addressing (xfs_agblock_t).
+  This is a linear address space starting at zero that indexes filesystem blocks
+  relative to the first block in an allocation group.
+
+- FSBNO addressing (xfs_fsblock_t).
+  This is a sparse encoding of (xfs_agnumber_t, xfs_agblock_t) that can address
+  any block in the entire filesystem.
+
+We are going to ignore the DADDR encoding for the moment as we first must
+understand how AGBNO and FSBNO addressing are related.
+
+The FSBNO encoding is a 64 bit number with the AG and AGBNO encoding within it
+being determined by fields in the superblock that are set at mkfs.xfs time and
+fixed for the life of the filesystem. The key point is that these fields
+determine the amount of FSBNO address space the AGBNO is encoded into::
+
+	FSBNO = (AG << sb_agblklog) | AGBNO
+
+This results in the FSBNO being a sparse encoding of the location of the block
+within the filesystem whenever the number of blocks in an AG is not an exact
+power of 2. The linear AGBNO address space for all the blocks in a single AG of
+size sb_agblocks looks like this::
+
+	0		      sb_agblocks	(1 << sb_agblklog)
+	+--------------------------+.................+
+
+If we have multiple AGs, the FSBNO address space looks like this::
+
+		    addressable space		unavailable
+	AG 0	+--------------------------+.................+
+	AG 1	+--------------------------+.................+
+	AG 2	+--------------------------+.................+
+	AG 3	+--------------------------+.................+
+
+Hence we have holes in the address FSBNO address space where there is no
+underlying physical LBA address space. The amount of unavailable space is
+determined by mkfs.xfs - it simply rounds the AG size up to the next highest
+power of 2. This power of 2 rounding means that encoding can be efficiently
+translated by the kernel with shifts and masks.
+
+This minimal expansion of the AGBNO address space also allows the number of AGs
+to be increased to scale the filesystem all the way out to the full 64 bit FSBNO
+address space, thereby allowing filesystems even with small AG sizes to be able
+to reach exabyte capacities.
+
+The reality of supporting millions of AGs, however, mean this simply isn't
+feasible. e.g. mkfs requires several million IOs to initialise all the AG
+headers, mount might have to read all those headers, etc.
+
+This is the fundamental problem with the existing filesystem growing mechanism
+when we start with very small AGs. Growing to thousands of AGs results in the
+FSBNO address space becoming so fragmented and requiring so many headers on disk
+to index it that the per-AG management algorithms do not work efficiently
+anymore.
+
+To solve this problem, we need is a way of preventing the FSBNO space from
+becoming excessively fragmented when we grow from small to very large. The
+solution, surprisingly, lies in the very fact that the FSBNO space is sparse,
+fragmented and fixed for the life of the filesystem.
+
+
+Exploiting Sparse FSBNO Addressing
+==================================
+
+The FSBNO address encoding is fixed at mkfs.xfs time because it is very
+difficult to change once the filesystem starts to index objects in FSBNO
+encoding. This is because the LBA address of the objects don't change, but
+the FSBNO encoding of that location changes.
+
+Hence changing sb_agblklog would require finding and re-encoding every
+FSBNO and inode number in the filesystem. While it is possible we could do
+this offline via xfs_repair, it would not be fast and there is no possibility
+it could be done transparently online. Changing inode numbers also has other
+downsides such as invalidating various long term backup strategies.
+
+However, there is a way we can change the AG size without needing to change
+sb_agblklog.  Earlier we showed how the FSBNO and AGBNO spaces are related, but
+now we need know how the FSBNO address space relates to the DADDR (block device
+LBA) address space.
+
+Recall the earlier diagram showing that that the FSBNO space is made up of
+available and unavailable AGBNO space across multiple AGs. Now lets lay out
+those 4 AGs and AGBNO space over the DADDR space. If "X" is the size of an AG
+in the DADDR space, we get something like this::
+
+	DADDR	0	X	2X	3X	4X
+	AG 0	+-------+...+
+	AG 1		+-------+...+
+	AG 1			+-------+...+
+	AG 1				+-------+...+
+
+
+The available AGBNO space in each AG is laid nose to tail over the DADDR space,
+whilst the unavailable AGBNO address space for each AG overlays the available
+AGBNO space of the next AG. For the last AG, the unavailable address space
+extends beyond the end of the filesystem and the available DADDR address space.
+
+Given this layout, it should now be obvious why it is very difficult to change
+the physical size of an AG and why the existing grow mechanism simply appends
+new AGs to the filesystem. Changing the size of an AG requires that we
+have to move all the higher AGs either up or down in the DADDR space. That
+requires physically moving data around in the block device LBA space.
+
+Hence if we wanted to expand the AG size out to the max defined by sb_agblklog
+(call that size Y), we'd have to move AG 3 from 3X to 3Y, AG 2 from 2X to 2Y, etc,
+until we ended up with this::
+
+	DADDR	0	X	2X	3X	4X
+	DADDR	0	    Y		2Y	    3Y		4Y
+	AG 0	+-------+...+
+	AG 1		    +-------+...+
+	AG 1				+-------+...+
+	AG 1					    +-------+...+
+
+And then we can change the AG size (sb_agblocks) to match the size defined
+by sb_agblklog. That would then give us::
+
+	DADDR	0	    Y		2Y	    3Y		4Y
+	AG 0	+-----------+
+	AG 1		    +-----------+
+	AG 1				+-----------+
+	AG 1					    +-----------+
+
+None of the FSBNO encodings have changed in this process. We have physically
+changed the location of the start of every AG so that they are still laid out
+nose to tail, but they are now maximally sized for the filesystem's defined
+sb_agblklog value.
+
+That, however, is the downside of this mechanism: We can only grow the AG size
+to the maximum defined by sb_agblklog. And we already know that mkfs.xfs rounds
+that value up to the next highest power of 2. Hence this mechanism can, at best,
+only double the size of an AG in an existing filesystem. That's not enough to
+solve the problem we are trying to address.
+
+The solution should already be obvious: we can exploit the sparseness of FSBNO
+addressing to allow AGs to grow to 1TB (maximum size) simply by configuring
+sb_agblklog appropriately at mkfs.xfs time. Hence if we have 16MB AGs (minimum
+size) and sb_agblklog = 30 (1TB max AG size), we can expand the AG size up
+to their maximum size before we start appending new AGs.
+
+
+Optimising Physical AG Realignment
+==================================
+
+The elephant in the room at this point in time is the fact that we have to
+physically move data around to expand AGs. While this makes AG size expansion
+prohibitive for large filesystems, they should already have large AGs and so
+using the existing grow mechanism will continue to be the right tool to use for
+expanding them.
+
+However, for small filesystems and filesystem images in the order of hundreds of
+MB to a few GB in size, the cost of moving data around is much more tolerable.
+If we can optimise the IO patterns to be purely sequential, offload the movement
+to the hardware, or even use address space manipulation APIs to minimise the
+cost of this movement, then resizing AGs via realignment becomes even more
+appealing.
+
+Realigning AGs must avoid overwriting parts of AGs that have not yet been
+realigned. That means we can't realign the AGs from AG 1 upwards - doing so will
+overwrite parts of AG2 before we've realigned that data. Hence realignment must
+be done from the highest AG first, and work downwards.
+
+Moving the data within an AG could be optimised to be space usage aware, similar
+to what xfs_copy does to build sparse filesystem images. However, the space
+optimised filesystem images aren't going to have a lot of free space in them,
+and what there is may be quite fragmented. Hence doing free space aware copying
+of relatively full small AGs may be IOPS intensive. Given we are talking about
+AGs in the typical size range from 64-512MB, doing a sequential copy of the
+entire AG isn't going to take very long on any storage. If we have to do several
+hundred seeks in that range to skip free space, then copying the free space will
+cost less than the seeks and the partial RAID stripe writes that small IOs will
+cause.
+
+Hence the simplest, sequentially optimised data moving algorithm will be:
+
+.. code-block:: c
+
+	for (agno = sb_agcount - 1; agno > 0; agno--) {
+		src = agno * sb_agblocks;
+		dst = agno * new_agblocks;
+		copy_file_range(src, dst, sb_agblocks);
+	}
+
+This also leads to optimisation via server side or block device copy offload
+infrastructure. Instead of streaming the data through kernel buffers, the copy
+is handed to the server/hardware to moves the data internally as quickly as
+possible.
+
+For filesystem images held in files and, potentially, on sparse storage devices
+like dm-thinp, we don't even need to copy the data.  We can simply insert holes
+into the underlying mapping at the appropriate place.  For filesystem images,
+this is:
+
+.. code-block:: c
+
+	len = new_agblocks - sb_agblocks;
+	for (agno = 1; agno < sb_agcount; agno++) {
+		src = agno * sb_agblocks;
+		fallocate(FALLOC_FL_INSERT_RANGE, src, len)
+	}
+
+Then the filesystem image can be copied to the destination block device in an
+efficient manner (i.e. skipping holes in the image file).
+
+Hence there are several different realignment stratgeies that can be used to
+optimise the expansion of the filesystem. The optimal strategy will ultimately
+depend on how the orchestration software sets up the filesystem for
+configuration at first boot. The userspace xfs expansion tool should be able to
+support all these mechanisms directly so that higher level infrastructure
+can simply select the option that best suits the installation being performed.
+
+
+Limitations
+===========
+
+This document describes an offline mechanism for expanding the filesystem
+geometery. It doesn't add new AGs, just expands they existing AGs. If the
+filesystem needs to be made larger than maximally sized AGs can address, then
+a subsequent online xfs_growfs operation is still required.
+
+For container/vm orchestration software, this isn't a huge issue as they
+generally grow the image from within the initramfs context on first boot. That
+is currently a "mount; xfs_growfs" operation pair; adding expansion to this
+would simply require adding expansion before the mount. i.e. first boot becomes
+a "xfs_expand; mount; xfs_growfs" operation. Depending on the eventual size of
+the target filesystem, the xfs-growfs operation may be a no-op.
+
+Whether expansion can be done online is an open question. AG expansion cahnges
+fundamental constants that are calculated at mount time (e.g. maximum AG btree
+heights), and so an online expand would need to recalculate many internal
+constants that are used throughout the codebase. This seems like a complex
+problem to solve and isn't really necessary for the use case we need to address,
+so online expansion remain as a potential future enhancement that requires a lot
+more thought.