diff mbox series

[1/5] xfs: don't treat append-only files as having preallocations

Message ID 171821431777.3202459.4876836906447539030.stgit@frogsfrogsfrogs (mailing list archive)
State New
Headers show
Series [1/5] xfs: don't treat append-only files as having preallocations | expand

Commit Message

Darrick J. Wong June 12, 2024, 5:46 p.m. UTC
From: Christoph Hellwig <hch@lst.de>

The XFS XFS_DIFLAG_APPEND maps to the VFS S_APPEND flag, which forbids
writes that don't append at the current EOF.

But the commit originally adding XFS_DIFLAG_APPEND support (commit
a23321e766d in xfs xfs-import repository) also checked it to skip
releasing speculative preallocations, which doesn't make any sense.

Another commit (dd9f438e3290 in the xfs-import repository) late extended
that flag to also report these speculation preallocations which should
not exist in getbmap.

Remove these checks as nothing XFS_DIFLAG_APPEND implies that
preallocations beyond EOF should exist.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_bmap_util.c |    9 ++++-----
 fs/xfs/xfs_icache.c    |    2 +-
 2 files changed, 5 insertions(+), 6 deletions(-)

Comments

Dave Chinner June 13, 2024, 6:03 a.m. UTC | #1
On Wed, Jun 12, 2024 at 10:46:48AM -0700, Darrick J. Wong wrote:
> From: Christoph Hellwig <hch@lst.de>
> 
> The XFS XFS_DIFLAG_APPEND maps to the VFS S_APPEND flag, which forbids
> writes that don't append at the current EOF.
> 
> But the commit originally adding XFS_DIFLAG_APPEND support (commit
> a23321e766d in xfs xfs-import repository) also checked it to skip
> releasing speculative preallocations, which doesn't make any sense.

I disagree, there was a very good reason for this behaviour:
preventing append-only log files from getting excessively fragmented
because speculative prealloc would get removed on close().

i.e. applications that slowly log messages to append only files
with the pattern open(O_APPEND); write(a single line to the log);
close(); caused worst case file fragmentation because the close()
always removed the speculative prealloc beyond EOF.

The fix for this pessimisitic XFS behaviour is for the application
to use chattr +A (like they would for ext3/4) hence triggering the
existence of XFS_DIFLAG_APPEND and that avoided the removal
speculative delalloc removed when the file is closed. hence the
fragmentation problems went away.

Note that fragmentation issue didn't affect the log writes - it
badly affected log reads because it turned them into worse case
random read workloads instead of sequential reads.

As such, I think the justification for this change is wrong and that
it removes a longstanding feature that prevents severe fragmentation
of append only log files. I think we should be leaving this code as
it currently stands.

-Dave.
Christoph Hellwig June 13, 2024, 8:28 a.m. UTC | #2
On Thu, Jun 13, 2024 at 04:03:53PM +1000, Dave Chinner wrote:
> I disagree, there was a very good reason for this behaviour:
> preventing append-only log files from getting excessively fragmented
> because speculative prealloc would get removed on close().

Where is that very clear intent documented?  Not in the original
commit message (which is very sparse) and no where in any documentation
I can find.

> i.e. applications that slowly log messages to append only files
> with the pattern open(O_APPEND); write(a single line to the log);
> close(); caused worst case file fragmentation because the close()
> always removed the speculative prealloc beyond EOF.

That case should be covered by the XFS_IDIRTY_RELEASE, at least
except for O_SYNC workloads. 

> 
> The fix for this pessimisitic XFS behaviour is for the application
> to use chattr +A (like they would for ext3/4) hence triggering the
> existence of XFS_DIFLAG_APPEND and that avoided the removal
> speculative delalloc removed when the file is closed. hence the
> fragmentation problems went away.

For ext4 the EXT4_APPEND_FL flag does not cause any difference
in allocation behavior.  For the historic ext2 driver it apparently
did just, with an XXX comment marking this as a bug, but for ext3 it
also never did looking back quite a bit in history.
Dave Chinner June 17, 2024, 5:03 a.m. UTC | #3
On Thu, Jun 13, 2024 at 10:28:55AM +0200, Christoph Hellwig wrote:
> On Thu, Jun 13, 2024 at 04:03:53PM +1000, Dave Chinner wrote:
> > I disagree, there was a very good reason for this behaviour:
> > preventing append-only log files from getting excessively fragmented
> > because speculative prealloc would get removed on close().
> 
> Where is that very clear intent documented?  Not in the original
> commit message (which is very sparse) and no where in any documentation
> I can find.

We've lost all the internal SGI bug databases, so there's little to
know evidence I can point at. But at the time, it was a well known
problem amongst Irix XFS engineers that append-only log files would
regularly get horribly fragmented.

There'd been several escalations over that behaviour over the years
w.r.t. large remote servers (think of facilities that "don't trust
the logs on client machines because they might be compromised"). In
general, the fixes for these applications tended to require the
loggin server application to use F_RESVSP to do the append-only log
file initialisation.  That got XFS_DIFLAG_PREALLOC set on the files,
so then anything allocated by appending writes beyond EOF was left
alone. That small change was largely sufficient to mitigate worst
case log file fragmentation on Irix-XFS.

So when adding a flag on disk for Linux-XFS to say "this is an
append only file" it made lots of sense to make it behave like
XFS_DIFLAG_PREALLOC had already been set on the inode without
requring the application to do anything to set that up.

I'll note that the patches sent to the list by Ethan Benson to
originally implement XFS_DIFLAG_APPEND (and others) is not exactly
what was committed in this commit:

https://marc.info/?l=linux-xfs&m=106360278223548&w=2

The last version posted on the list was this:

https://marc.info/?l=linux-xfs&m=106109662212214&w=2

but the version committed had lots of things renamed, sysctls for
sync and nodump inheritance and other bits and pieces including
the EOF freeing changes to skip if DIFLAG_APPEND was set.

It is clear that there was internal SGI discussion, modification and
review of the original proposed patch set, and none of that internal
discussion is on open mailing lists. We might have the historical
XFS code and Linux mailing list archives, but that doesn't always
tell us what institutional knowledge was behind subtle changes to
publicly proposed patches like this....

> > i.e. applications that slowly log messages to append only files
> > with the pattern open(O_APPEND); write(a single line to the log);
> > close(); caused worst case file fragmentation because the close()
> > always removed the speculative prealloc beyond EOF.
> 
> That case should be covered by the XFS_IDIRTY_RELEASE, at least
> except for O_SYNC workloads. 

Ah, so I fixed the problem independently 7 or 8 years later to fix
Linux NFS server performance issues. Ok, that makes removing the
flag less bad, but I still don't see the harm in keeping it there
given that behaviour has existed for the past 20 years....

> > The fix for this pessimisitic XFS behaviour is for the application
> > to use chattr +A (like they would for ext3/4) hence triggering the
> > existence of XFS_DIFLAG_APPEND and that avoided the removal
> > speculative delalloc removed when the file is closed. hence the
> > fragmentation problems went away.
> 
> For ext4 the EXT4_APPEND_FL flag does not cause any difference
> in allocation behavior.

Sure, but ext4 doesn't have speculative preallocation beyond EOF to
prevent fragmentation, either.

> For the historic ext2 driver it apparently
> did just, with an XXX comment marking this as a bug, but for ext3 it
> also never did looking back quite a bit in history.

Ditto - when the filesystem isn't allocating anything beyond EOF,
there's little point in trying to removing blocks beyond EOF that
can't exist on final close()...

-Dave.
Christoph Hellwig June 17, 2024, 6:46 a.m. UTC | #4
On Mon, Jun 17, 2024 at 03:03:28PM +1000, Dave Chinner wrote:
> > That case should be covered by the XFS_IDIRTY_RELEASE, at least
> > except for O_SYNC workloads. 
> 
> Ah, so I fixed the problem independently 7 or 8 years later to fix
> Linux NFS server performance issues. Ok, that makes removing the
> flag less bad, but I still don't see the harm in keeping it there
> given that behaviour has existed for the past 20 years....

I'm really kinda worried about these unaccounted preallocations lingering
around basically forever.  Note that in current mainline there actually
is a path removing them more or less accidentally when there are
delalloc blocks in a can_free_eofblocks path with force == true,
but that's going away with the next patch.
Dave Chinner June 17, 2024, 11:28 p.m. UTC | #5
On Mon, Jun 17, 2024 at 08:46:03AM +0200, Christoph Hellwig wrote:
> On Mon, Jun 17, 2024 at 03:03:28PM +1000, Dave Chinner wrote:
> > > That case should be covered by the XFS_IDIRTY_RELEASE, at least
> > > except for O_SYNC workloads. 
> > 
> > Ah, so I fixed the problem independently 7 or 8 years later to fix
> > Linux NFS server performance issues. Ok, that makes removing the
> > flag less bad, but I still don't see the harm in keeping it there
> > given that behaviour has existed for the past 20 years....
> 
> I'm really kinda worried about these unaccounted preallocations lingering
> around basically forever.

How are they "unaccounted"? They are accounted to the inode, they
are visible in statx and so du reports them.

Maybe you meant "unreclaimable"?

But that's not true, either, because a truncate to the same size or
a hole punch from EOF to -1 will remove the post-EOF blocks. But
that's what the blockgc ioctls are supposed to be doing for these
files, so....

> Note that in current mainline there actually
> is a path removing them more or less accidentally when there are
> delalloc blocks in a can_free_eofblocks path with force == true,
> but that's going away with the next patch.

... fix the blockgc walk to ignore DIFLAG_APPEND when doing it's
passes. The files are not marked with DIFLAG_PREALLOC, so blockgc
should trim them, just like it does with all other files that have
had post-eof prealloc that is currently unused.

In short: Don't remove the optimisation that prevents worst case
fragmentation in known workloads. Instead, fix the garbage
collection to do the right thing when space is low and we are
optimising for allocation success rather than optimal file layout.

-Dave.
diff mbox series

Patch

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index ac2e77ebb54c..eb8056b1c906 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -331,8 +331,7 @@  xfs_getbmap(
 		}
 
 		if (xfs_get_extsz_hint(ip) ||
-		    (ip->i_diflags &
-		     (XFS_DIFLAG_PREALLOC | XFS_DIFLAG_APPEND)))
+		    (ip->i_diflags & XFS_DIFLAG_PREALLOC))
 			max_len = mp->m_super->s_maxbytes;
 		else
 			max_len = XFS_ISIZE(ip);
@@ -526,10 +525,10 @@  xfs_can_free_eofblocks(
 		return false;
 
 	/*
-	 * Do not free real preallocated or append-only files unless the file
-	 * has delalloc blocks and we are forced to remove them.
+	 * Do not free real extents in preallocated files unless the file has
+	 * delalloc blocks and we are forced to remove them.
 	 */
-	if (ip->i_diflags & (XFS_DIFLAG_PREALLOC | XFS_DIFLAG_APPEND))
+	if (ip->i_diflags & XFS_DIFLAG_PREALLOC)
 		if (!force || ip->i_delayed_blks == 0)
 			return false;
 
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 0953163a2d84..41b8a5c4dd69 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1158,7 +1158,7 @@  xfs_inode_free_eofblocks(
 	if (xfs_can_free_eofblocks(ip, false))
 		return xfs_free_eofblocks(ip);
 
-	/* inode could be preallocated or append-only */
+	/* inode could be preallocated */
 	trace_xfs_inode_free_eofblocks_invalid(ip);
 	xfs_inode_clear_eofblocks_tag(ip);
 	return 0;