[RFC,0/3] : Extreme fragmentation ahoy!

Message ID	20190207050813.24271-1-david@fromorbit.com (mailing list archive)
Headers	show Return-Path: <linux-xfs-owner@kernel.org> From: Dave Chinner <david@fromorbit.com> To: linux-xfs@vger.kernel.org Subject: [RFC PATCH 0/3]: Extreme fragmentation ahoy! Date: Thu, 7 Feb 2019 16:08:10 +1100 Message-Id: <20190207050813.24271-1-david@fromorbit.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk
Series	: Extreme fragmentation ahoy! \| expand [RFC,0/3] : Extreme fragmentation ahoy! [1/3] xfs: Don't free EOF blocks on sync write close [2/3] xfs: Don't free EOF blocks on close when extent size hints are set [3/3] xfs: Don't free EOF blocks on sync write close [4/3] xfs: EOF blocks are not busy extents

Dave Chinner Feb. 7, 2019, 5:08 a.m. UTC

Hi folks,

I've just finished analysing an IO trace from a application
generating an extreme filesystem fragmentation problem that started
with extent size hints and ended with spurious ENOSPC reports due to
massively fragmented files and free space. While the ENOSPC issue
looks to have previously been solved, I still wanted to understand
how the application had so comprehensively defeated extent size
hints as a method of avoiding file fragmentation.

The key behaviour that I discovered was that specific "append write
only" files that had extent size hints to prevent fragmentation
weren't actually write only.  The application didn't do a lot of
writes to the file, but it kept the file open and appended to the
file (from the traces I have) in chunks of between ~3000 bytes and
~160000 bytes. This didn't explain the problem. I did notice that
the files were opened O_SYNC, however.

I then found was another process that, once every second, opened the
log file O_RDONLY, read 28 bytes from offset zero, then closed the
file. Every second. IOWs, between every appending write that would
allocate an extent size hint worth of space beyond EOF and then
write a small chunk of it, there were numerous open/read/close
cycles being done on the same file.

And what do we do on close()? We call xfs_release() and that can
truncate away blocks beyond EOF. For some reason the close wasn't
triggering the IDIRTY_RELEASE heuristic that preventd close from
removing EOF blocks prematurely. Then I realised that O_SYNC writes
don't leave delayed allocation blocks behind - they are always
converted in the context of the write. That's why it wasn't
triggering, and that meant that the open/read/close cycle was
removing the extent size hint allocation beyond EOF prematurely.
beyond EOF prematurely.

Then it occurred to me that extent size hints don't use delalloc
either, so they behave the same was as O_SYNC writes in this
situation.

Oh, and we remove EOF blocks on O_RDONLY file close, too. i.e. we
modify the file without having write permissions.

I suspect there's more cases like this when combined with repeated
open/<do_something>/close operations on a file that is being
written, but the patches address just these ones I just talked
about. The test script to reproduce them is below. Fragmentation
reduction results are in the commit descriptions. It's running
through fstests for a couple of hours now, no issues have been
noticed yet.

FWIW, I suspect we need to have a good hard think about whether we
should be trimming EOF blocks on close by default, or whether we
should only be doing it in very limited situations....

Comments, thoughts, flames welcome.

-Dave.


#!/bin/bash
#
# Test 1
#
# Write multiple files in parallel using synchronous buffered writes. Aim is to
# interleave allocations to fragment the files. Synchronous writes defeat the
# open/write/close heuristics in xfs_release() that prevent EOF block removal,
# so this should fragment badly.

workdir=/mnt/scratch
nfiles=8
wsize=4096
wcnt=1000

echo
echo "Test 1: sync write fragmentation counts"
echo
write_sync_file()
{
	idx=$1

	for ((cnt=0; cnt<$wcnt; cnt++)); do
		xfs_io -f -s -c "pwrite $((cnt * wsize)) $wsize" $workdir/file.$idx
	done
}

rm -f $workdir/file*
for ((n=0; n<$nfiles; n++)); do
	write_sync_file $n > /dev/null 2>&1 &
done
wait

sync

for ((n=0; n<$nfiles; n++)); do
	echo -n "$workdir/file.$n: "
	xfs_bmap -vp $workdir/file.$n | wc -l
done;


# Test 2
#
# Same as test 1, but instead of sync writes, use extent size hints to defeat
# the open/write/close heuristic

extent_size=16m

echo
echo "Test 2: Extent size hint fragmentation counts"
echo

write_extsz_file()
{
	idx=$1

	xfs_io -f -c "extsize $extent_size" $workdir/file.$idx
	for ((cnt=0; cnt<$wcnt; cnt++)); do
		xfs_io -f -c "pwrite $((cnt * wsize)) $wsize" $workdir/file.$idx
	done
}

rm -f $workdir/file*
for ((n=0; n<$nfiles; n++)); do
	write_extsz_file $n > /dev/null 2>&1 &
done
wait

sync

for ((n=0; n<$nfiles; n++)); do
	echo -n "$workdir/file.$n: "
	xfs_bmap -vp $workdir/file.$n | wc -l
done;



# Test 3
#
# Same as test 2, but instead of extent size hints, use open/read/close loops
# on the files to remove EOF blocks.

echo
echo "Test 3: Open/read/close loop fragmentation counts"
echo

write_file()
{
	idx=$1

	xfs_io -f -s -c "pwrite -b 64k 0 50m" $workdir/file.$idx
}

read_file()
{
	idx=$1

	for ((cnt=0; cnt<$wcnt; cnt++)); do
		xfs_io -f -r -c "pread 0 28" $workdir/file.$idx
	done
}

rm -f $workdir/file*
for ((n=0; n<$((nfiles * 4)); n++)); do
	write_file $n > /dev/null 2>&1 &
	read_file $n > /dev/null 2>&1 &
done
wait

sync

for ((n=0; n<$nfiles; n++)); do
	echo -n "$workdir/file.$n: "
	xfs_bmap -vp $workdir/file.$n | wc -l
done;

Darrick J. Wong Feb. 7, 2019, 5:21 a.m. UTC | #1

On Thu, Feb 07, 2019 at 04:08:10PM +1100, Dave Chinner wrote:
> Hi folks,
> 
> I've just finished analysing an IO trace from a application
> generating an extreme filesystem fragmentation problem that started
> with extent size hints and ended with spurious ENOSPC reports due to
> massively fragmented files and free space. While the ENOSPC issue
> looks to have previously been solved, I still wanted to understand
> how the application had so comprehensively defeated extent size
> hints as a method of avoiding file fragmentation.
> 
> The key behaviour that I discovered was that specific "append write
> only" files that had extent size hints to prevent fragmentation
> weren't actually write only.  The application didn't do a lot of
> writes to the file, but it kept the file open and appended to the
> file (from the traces I have) in chunks of between ~3000 bytes and
> ~160000 bytes. This didn't explain the problem. I did notice that
> the files were opened O_SYNC, however.
> 
> I then found was another process that, once every second, opened the
> log file O_RDONLY, read 28 bytes from offset zero, then closed the
> file. Every second. IOWs, between every appending write that would
> allocate an extent size hint worth of space beyond EOF and then
> write a small chunk of it, there were numerous open/read/close
> cycles being done on the same file.
> 
> And what do we do on close()? We call xfs_release() and that can
> truncate away blocks beyond EOF. For some reason the close wasn't
> triggering the IDIRTY_RELEASE heuristic that preventd close from
> removing EOF blocks prematurely. Then I realised that O_SYNC writes
> don't leave delayed allocation blocks behind - they are always
> converted in the context of the write. That's why it wasn't
> triggering, and that meant that the open/read/close cycle was
> removing the extent size hint allocation beyond EOF prematurely.
> beyond EOF prematurely.

<urk>

> Then it occurred to me that extent size hints don't use delalloc
> either, so they behave the same was as O_SYNC writes in this
> situation.
> 
> Oh, and we remove EOF blocks on O_RDONLY file close, too. i.e. we
> modify the file without having write permissions.

Yikes!

> I suspect there's more cases like this when combined with repeated
> open/<do_something>/close operations on a file that is being
> written, but the patches address just these ones I just talked
> about. The test script to reproduce them is below. Fragmentation
> reduction results are in the commit descriptions. It's running
> through fstests for a couple of hours now, no issues have been
> noticed yet.
> 
> FWIW, I suspect we need to have a good hard think about whether we
> should be trimming EOF blocks on close by default, or whether we
> should only be doing it in very limited situations....
> 
> Comments, thoughts, flames welcome.
> 
> -Dave.
> 
> 
> #!/bin/bash
> #
> # Test 1

Can you please turn these into fstests to cause the maintainer maximal
immediate pain^W^W^Wmake everyone pay attention^W^W^W^Westablish a basis
for regression testing and finding whatever other problems we can find
from digging deeper? :)

--D

> #
> # Write multiple files in parallel using synchronous buffered writes. Aim is to
> # interleave allocations to fragment the files. Synchronous writes defeat the
> # open/write/close heuristics in xfs_release() that prevent EOF block removal,
> # so this should fragment badly.
> 
> workdir=/mnt/scratch
> nfiles=8
> wsize=4096
> wcnt=1000
> 
> echo
> echo "Test 1: sync write fragmentation counts"
> echo
> write_sync_file()
> {
> 	idx=$1
> 
> 	for ((cnt=0; cnt<$wcnt; cnt++)); do
> 		xfs_io -f -s -c "pwrite $((cnt * wsize)) $wsize" $workdir/file.$idx
> 	done
> }
> 
> rm -f $workdir/file*
> for ((n=0; n<$nfiles; n++)); do
> 	write_sync_file $n > /dev/null 2>&1 &
> done
> wait
> 
> sync
> 
> for ((n=0; n<$nfiles; n++)); do
> 	echo -n "$workdir/file.$n: "
> 	xfs_bmap -vp $workdir/file.$n | wc -l
> done;
> 
> 
> # Test 2
> #
> # Same as test 1, but instead of sync writes, use extent size hints to defeat
> # the open/write/close heuristic
> 
> extent_size=16m
> 
> echo
> echo "Test 2: Extent size hint fragmentation counts"
> echo
> 
> write_extsz_file()
> {
> 	idx=$1
> 
> 	xfs_io -f -c "extsize $extent_size" $workdir/file.$idx
> 	for ((cnt=0; cnt<$wcnt; cnt++)); do
> 		xfs_io -f -c "pwrite $((cnt * wsize)) $wsize" $workdir/file.$idx
> 	done
> }
> 
> rm -f $workdir/file*
> for ((n=0; n<$nfiles; n++)); do
> 	write_extsz_file $n > /dev/null 2>&1 &
> done
> wait
> 
> sync
> 
> for ((n=0; n<$nfiles; n++)); do
> 	echo -n "$workdir/file.$n: "
> 	xfs_bmap -vp $workdir/file.$n | wc -l
> done;
> 
> 
> 
> # Test 3
> #
> # Same as test 2, but instead of extent size hints, use open/read/close loops
> # on the files to remove EOF blocks.
> 
> echo
> echo "Test 3: Open/read/close loop fragmentation counts"
> echo
> 
> write_file()
> {
> 	idx=$1
> 
> 	xfs_io -f -s -c "pwrite -b 64k 0 50m" $workdir/file.$idx
> }
> 
> read_file()
> {
> 	idx=$1
> 
> 	for ((cnt=0; cnt<$wcnt; cnt++)); do
> 		xfs_io -f -r -c "pread 0 28" $workdir/file.$idx
> 	done
> }
> 
> rm -f $workdir/file*
> for ((n=0; n<$((nfiles * 4)); n++)); do
> 	write_file $n > /dev/null 2>&1 &
> 	read_file $n > /dev/null 2>&1 &
> done
> wait
> 
> sync
> 
> for ((n=0; n<$nfiles; n++)); do
> 	echo -n "$workdir/file.$n: "
> 	xfs_bmap -vp $workdir/file.$n | wc -l
> done;
> 
>

Dave Chinner Feb. 7, 2019, 5:39 a.m. UTC | #2

On Wed, Feb 06, 2019 at 09:21:14PM -0800, Darrick J. Wong wrote:
> On Thu, Feb 07, 2019 at 04:08:10PM +1100, Dave Chinner wrote:
> > Hi folks,
> > 
> > I've just finished analysing an IO trace from a application
> > generating an extreme filesystem fragmentation problem that started
> > with extent size hints and ended with spurious ENOSPC reports due to
> > massively fragmented files and free space. While the ENOSPC issue
> > looks to have previously been solved, I still wanted to understand
> > how the application had so comprehensively defeated extent size
> > hints as a method of avoiding file fragmentation.
> > 
> > The key behaviour that I discovered was that specific "append write
> > only" files that had extent size hints to prevent fragmentation
> > weren't actually write only.  The application didn't do a lot of
> > writes to the file, but it kept the file open and appended to the
> > file (from the traces I have) in chunks of between ~3000 bytes and
> > ~160000 bytes. This didn't explain the problem. I did notice that
> > the files were opened O_SYNC, however.
> > 
> > I then found was another process that, once every second, opened the
> > log file O_RDONLY, read 28 bytes from offset zero, then closed the
> > file. Every second. IOWs, between every appending write that would
> > allocate an extent size hint worth of space beyond EOF and then
> > write a small chunk of it, there were numerous open/read/close
> > cycles being done on the same file.
> > 
> > And what do we do on close()? We call xfs_release() and that can
> > truncate away blocks beyond EOF. For some reason the close wasn't
> > triggering the IDIRTY_RELEASE heuristic that preventd close from
> > removing EOF blocks prematurely. Then I realised that O_SYNC writes
> > don't leave delayed allocation blocks behind - they are always
> > converted in the context of the write. That's why it wasn't
> > triggering, and that meant that the open/read/close cycle was
> > removing the extent size hint allocation beyond EOF prematurely.
> > beyond EOF prematurely.
> 
> <urk>
> 
> > Then it occurred to me that extent size hints don't use delalloc
> > either, so they behave the same was as O_SYNC writes in this
> > situation.
> > 
> > Oh, and we remove EOF blocks on O_RDONLY file close, too. i.e. we
> > modify the file without having write permissions.
> 
> Yikes!
> 
> > I suspect there's more cases like this when combined with repeated
> > open/<do_something>/close operations on a file that is being
> > written, but the patches address just these ones I just talked
> > about. The test script to reproduce them is below. Fragmentation
> > reduction results are in the commit descriptions. It's running
> > through fstests for a couple of hours now, no issues have been
> > noticed yet.
> > 
> > FWIW, I suspect we need to have a good hard think about whether we
> > should be trimming EOF blocks on close by default, or whether we
> > should only be doing it in very limited situations....
> > 
> > Comments, thoughts, flames welcome.
> > 
> > -Dave.
> > 
> > 
> > #!/bin/bash
> > #
> > # Test 1
> 
> Can you please turn these into fstests to cause the maintainer maximal
> immediate pain^W^W^Wmake everyone pay attention^W^W^W^Westablish a basis
> for regression testing and finding whatever other problems we can find
> from digging deeper? :)

I will, but not today - I only understood the cause well enough to
write a prototype reproducer about 4 hours ago. The rest of the time
since then has been fixing the issues and running smoke tests. My
brain is about fried now....

FWIW, I think the scope of the problem is quite widespread -
anything that does open/something/close repeatedly on a file that is
being written to with O_DSYNC or O_DIRECT appending writes will kill
the post-eof extent size hint allocated space. That's why I suspect
we need to think about not trimming by default and trying to
enumerating only the cases that need to trim eof blocks.

e.g. I closed the O_RDONLY case, but O_RDWR/read/close in a loop
will still trigger removal of post EOF extent size hint
preallocation and hence severe fragmentation.

Cheers,

Dave.

Brian Foster Feb. 7, 2019, 3:52 p.m. UTC | #3

On Thu, Feb 07, 2019 at 04:39:41PM +1100, Dave Chinner wrote:
> On Wed, Feb 06, 2019 at 09:21:14PM -0800, Darrick J. Wong wrote:
> > On Thu, Feb 07, 2019 at 04:08:10PM +1100, Dave Chinner wrote:
> > > Hi folks,
> > > 
> > > I've just finished analysing an IO trace from a application
> > > generating an extreme filesystem fragmentation problem that started
> > > with extent size hints and ended with spurious ENOSPC reports due to
> > > massively fragmented files and free space. While the ENOSPC issue
> > > looks to have previously been solved, I still wanted to understand
> > > how the application had so comprehensively defeated extent size
> > > hints as a method of avoiding file fragmentation.
> > > 
> > > The key behaviour that I discovered was that specific "append write
> > > only" files that had extent size hints to prevent fragmentation
> > > weren't actually write only.  The application didn't do a lot of
> > > writes to the file, but it kept the file open and appended to the
> > > file (from the traces I have) in chunks of between ~3000 bytes and
> > > ~160000 bytes. This didn't explain the problem. I did notice that
> > > the files were opened O_SYNC, however.
> > > 
> > > I then found was another process that, once every second, opened the
> > > log file O_RDONLY, read 28 bytes from offset zero, then closed the
> > > file. Every second. IOWs, between every appending write that would
> > > allocate an extent size hint worth of space beyond EOF and then
> > > write a small chunk of it, there were numerous open/read/close
> > > cycles being done on the same file.
> > > 
> > > And what do we do on close()? We call xfs_release() and that can
> > > truncate away blocks beyond EOF. For some reason the close wasn't
> > > triggering the IDIRTY_RELEASE heuristic that preventd close from
> > > removing EOF blocks prematurely. Then I realised that O_SYNC writes
> > > don't leave delayed allocation blocks behind - they are always
> > > converted in the context of the write. That's why it wasn't
> > > triggering, and that meant that the open/read/close cycle was
> > > removing the extent size hint allocation beyond EOF prematurely.
> > > beyond EOF prematurely.
> > 
> > <urk>
> > 
> > > Then it occurred to me that extent size hints don't use delalloc
> > > either, so they behave the same was as O_SYNC writes in this
> > > situation.
> > > 
> > > Oh, and we remove EOF blocks on O_RDONLY file close, too. i.e. we
> > > modify the file without having write permissions.
> > 
> > Yikes!
> > 
> > > I suspect there's more cases like this when combined with repeated
> > > open/<do_something>/close operations on a file that is being
> > > written, but the patches address just these ones I just talked
> > > about. The test script to reproduce them is below. Fragmentation
> > > reduction results are in the commit descriptions. It's running
> > > through fstests for a couple of hours now, no issues have been
> > > noticed yet.
> > > 
> > > FWIW, I suspect we need to have a good hard think about whether we
> > > should be trimming EOF blocks on close by default, or whether we
> > > should only be doing it in very limited situations....
> > > 
> > > Comments, thoughts, flames welcome.
> > > 
> > > -Dave.
> > > 
> > > 
> > > #!/bin/bash
> > > #
> > > # Test 1
> > 
> > Can you please turn these into fstests to cause the maintainer maximal
> > immediate pain^W^W^Wmake everyone pay attention^W^W^W^Westablish a basis
> > for regression testing and finding whatever other problems we can find
> > from digging deeper? :)
> 
> I will, but not today - I only understood the cause well enough to
> write a prototype reproducer about 4 hours ago. The rest of the time
> since then has been fixing the issues and running smoke tests. My
> brain is about fried now....
> 
> FWIW, I think the scope of the problem is quite widespread -
> anything that does open/something/close repeatedly on a file that is
> being written to with O_DSYNC or O_DIRECT appending writes will kill
> the post-eof extent size hint allocated space. That's why I suspect
> we need to think about not trimming by default and trying to
> enumerating only the cases that need to trim eof blocks.
> 

To further this point.. I think the eofblocks scanning stuff came long
after the speculative preallocation code and associated release time
post-eof truncate. I think the background scanning was initially an
enhancement to deal with things like the dirty release optimization
leaving these blocks around longer and being able to free up this
accumulated space when we're at -ENOSPC conditions. Now that we have the
scanning mechanism in place (and a 5 minute default background scan,
which really isn't all that long), it might be reasonable to just drop
the release time truncate completely and only trim post-eof blocks via
the bg scan or reclaim paths.

Brian

> e.g. I closed the O_RDONLY case, but O_RDWR/read/close in a loop
> will still trigger removal of post EOF extent size hint
> preallocation and hence severe fragmentation.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

Dave Chinner Feb. 8, 2019, 2:47 a.m. UTC | #4

On Thu, Feb 07, 2019 at 10:52:43AM -0500, Brian Foster wrote:
> On Thu, Feb 07, 2019 at 04:39:41PM +1100, Dave Chinner wrote:
> > On Wed, Feb 06, 2019 at 09:21:14PM -0800, Darrick J. Wong wrote:
> > > On Thu, Feb 07, 2019 at 04:08:10PM +1100, Dave Chinner wrote:
> > > > Hi folks,
> > > > 
> > > > I've just finished analysing an IO trace from a application
> > > > generating an extreme filesystem fragmentation problem that started
> > > > with extent size hints and ended with spurious ENOSPC reports due to
> > > > massively fragmented files and free space. While the ENOSPC issue
> > > > looks to have previously been solved, I still wanted to understand
> > > > how the application had so comprehensively defeated extent size
> > > > hints as a method of avoiding file fragmentation.
....
> > FWIW, I think the scope of the problem is quite widespread -
> > anything that does open/something/close repeatedly on a file that is
> > being written to with O_DSYNC or O_DIRECT appending writes will kill
> > the post-eof extent size hint allocated space. That's why I suspect
> > we need to think about not trimming by default and trying to
> > enumerating only the cases that need to trim eof blocks.
> > 
> 
> To further this point.. I think the eofblocks scanning stuff came long
> after the speculative preallocation code and associated release time
> post-eof truncate.

Yes, I cribed a bit of the history of the xfs_release() behaviour
on #xfs yesterday afternoon:

 <djwong>	dchinner: feel free to ignore this until tomorrow if you want, but /me wonders why we'd want to free the eofblocks at close time at all, instead of waiting for inactivation/enospc/background reaper to do it?
 <dchinner>	historic. People doing operations then complaining du didn't match ls
 <dchinner>	stuff like that
 <dchinner>	There used to be a open file cache in XFS - we'd know exactly when the last reference went away and trim it then
 <dchinner>	but that went away when NFS and the dcache got smarter about file handle conversion
 <dchinner>	(i.e. that's how we used to make nfs not suck)
 <dchinner>	that's when we started doing work in ->release
 <dchinner>	it was close enough to "last close" for most workloads it made no difference.
 <dchinner>	Except for concurrent NFS writes into the same directory
 <dchinner>	and now there's another pathological application that triggers problems
 <dchinner>	The NFS exception was prior to having thebackground reaper
 <dchinner>	as these things goes the background reaper is relatively recent functionality
 <dchinner>	so perhaps we should just leave it to "inode cache expiry or background reaping" and not do it on close at al

> I think the background scanning was initially an
> enhancement to deal with things like the dirty release optimization
> leaving these blocks around longer and being able to free up this
> accumulated space when we're at -ENOSPC conditions.

Yes, amongst other things like slow writes keeping the file open
forever.....

> Now that we have the
> scanning mechanism in place (and a 5 minute default background scan,
> which really isn't all that long), it might be reasonable to just drop
> the release time truncate completely and only trim post-eof blocks via
> the bg scan or reclaim paths.

Yeah, that's kinda the question I'm asking here. What's the likely
impact of not trimming EOF blocks at least on close apart from
people complaining about df/ls not matching du?

I don't really care about that anymore because, well, reflink/dedupe
completely break any remaining assumption that du reported space
consumption is related to the file size (if sparse files wasn't
enough of a hint arlready)....

Cheers,

Dave.

Brian Foster Feb. 8, 2019, 12:34 p.m. UTC | #5

On Fri, Feb 08, 2019 at 01:47:30PM +1100, Dave Chinner wrote:
> On Thu, Feb 07, 2019 at 10:52:43AM -0500, Brian Foster wrote:
> > On Thu, Feb 07, 2019 at 04:39:41PM +1100, Dave Chinner wrote:
> > > On Wed, Feb 06, 2019 at 09:21:14PM -0800, Darrick J. Wong wrote:
> > > > On Thu, Feb 07, 2019 at 04:08:10PM +1100, Dave Chinner wrote:
> > > > > Hi folks,
> > > > > 
> > > > > I've just finished analysing an IO trace from a application
> > > > > generating an extreme filesystem fragmentation problem that started
> > > > > with extent size hints and ended with spurious ENOSPC reports due to
> > > > > massively fragmented files and free space. While the ENOSPC issue
> > > > > looks to have previously been solved, I still wanted to understand
> > > > > how the application had so comprehensively defeated extent size
> > > > > hints as a method of avoiding file fragmentation.
> ....
> > > FWIW, I think the scope of the problem is quite widespread -
> > > anything that does open/something/close repeatedly on a file that is
> > > being written to with O_DSYNC or O_DIRECT appending writes will kill
> > > the post-eof extent size hint allocated space. That's why I suspect
> > > we need to think about not trimming by default and trying to
> > > enumerating only the cases that need to trim eof blocks.
> > > 
> > 
> > To further this point.. I think the eofblocks scanning stuff came long
> > after the speculative preallocation code and associated release time
> > post-eof truncate.
> 
> Yes, I cribed a bit of the history of the xfs_release() behaviour
> on #xfs yesterday afternoon:
> 
>  <djwong>	dchinner: feel free to ignore this until tomorrow if you want, but /me wonders why we'd want to free the eofblocks at close time at all, instead of waiting for inactivation/enospc/background reaper to do it?
>  <dchinner>	historic. People doing operations then complaining du didn't match ls
>  <dchinner>	stuff like that
>  <dchinner>	There used to be a open file cache in XFS - we'd know exactly when the last reference went away and trim it then
>  <dchinner>	but that went away when NFS and the dcache got smarter about file handle conversion
>  <dchinner>	(i.e. that's how we used to make nfs not suck)
>  <dchinner>	that's when we started doing work in ->release
>  <dchinner>	it was close enough to "last close" for most workloads it made no difference.
>  <dchinner>	Except for concurrent NFS writes into the same directory
>  <dchinner>	and now there's another pathological application that triggers problems
>  <dchinner>	The NFS exception was prior to having thebackground reaper
>  <dchinner>	as these things goes the background reaper is relatively recent functionality
>  <dchinner>	so perhaps we should just leave it to "inode cache expiry or background reaping" and not do it on close at al
> 

Thanks.

> > I think the background scanning was initially an
> > enhancement to deal with things like the dirty release optimization
> > leaving these blocks around longer and being able to free up this
> > accumulated space when we're at -ENOSPC conditions.
> 
> Yes, amongst other things like slow writes keeping the file open
> forever.....
> 
> > Now that we have the
> > scanning mechanism in place (and a 5 minute default background scan,
> > which really isn't all that long), it might be reasonable to just drop
> > the release time truncate completely and only trim post-eof blocks via
> > the bg scan or reclaim paths.
> 
> Yeah, that's kinda the question I'm asking here. What's the likely
> impact of not trimming EOF blocks at least on close apart from
> people complaining about df/ls not matching du?
> 

Ok. ISTM it's just a continuation of the same "might confuse some users"
scenario that pops up occasionally. It also seems that kind of thing has
died down as either most people don't really know or care about the
transient state or are just more familiar with it at this point. IME,
complex applications that depend on block ownership stats (userspace
filesystems for example) already have to account for speculative
preallocation with XFS, so tweaking the semantics of the optimization
shouldn't really have much of an impact that I can tell so long as the
broader/long-term behavior doesn't change[1].

I suppose there are all kinds of other applications that are technically
affected by dropping the release time trim (simple file copies, archive
extraction, etc.), but it's not clear to me that matters so long as we
have effective bg and -ENOSPC scans. The only thing I can think of so
far is whether we should consider changes to the bg scan heuristics to
accommodate scenarios currently covered by the release time trim. For
example, the release time scan doesn't consider whether the file is
dirty or not while the bg scan always skips "active" files.

Brian

[1] Which isn't the case for a change to not bg trim files with extent
size hints.

> I don't really care about that anymore because, well, reflink/dedupe
> completely break any remaining assumption that du reported space
> consumption is related to the file size (if sparse files wasn't
> enough of a hint arlready)....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

Darrick J. Wong Feb. 8, 2019, 4:29 p.m. UTC | #6

On Fri, Feb 08, 2019 at 01:47:30PM +1100, Dave Chinner wrote:
> On Thu, Feb 07, 2019 at 10:52:43AM -0500, Brian Foster wrote:
> > On Thu, Feb 07, 2019 at 04:39:41PM +1100, Dave Chinner wrote:
> > > On Wed, Feb 06, 2019 at 09:21:14PM -0800, Darrick J. Wong wrote:
> > > > On Thu, Feb 07, 2019 at 04:08:10PM +1100, Dave Chinner wrote:
> > > > > Hi folks,
> > > > > 
> > > > > I've just finished analysing an IO trace from a application
> > > > > generating an extreme filesystem fragmentation problem that started
> > > > > with extent size hints and ended with spurious ENOSPC reports due to
> > > > > massively fragmented files and free space. While the ENOSPC issue
> > > > > looks to have previously been solved, I still wanted to understand
> > > > > how the application had so comprehensively defeated extent size
> > > > > hints as a method of avoiding file fragmentation.
> ....
> > > FWIW, I think the scope of the problem is quite widespread -
> > > anything that does open/something/close repeatedly on a file that is
> > > being written to with O_DSYNC or O_DIRECT appending writes will kill
> > > the post-eof extent size hint allocated space. That's why I suspect
> > > we need to think about not trimming by default and trying to
> > > enumerating only the cases that need to trim eof blocks.
> > > 
> > 
> > To further this point.. I think the eofblocks scanning stuff came long
> > after the speculative preallocation code and associated release time
> > post-eof truncate.
> 
> Yes, I cribed a bit of the history of the xfs_release() behaviour
> on #xfs yesterday afternoon:
> 
>  <djwong>	dchinner: feel free to ignore this until tomorrow if you want, but /me wonders why we'd want to free the eofblocks at close time at all, instead of waiting for inactivation/enospc/background reaper to do it?
>  <dchinner>	historic. People doing operations then complaining du didn't match ls
>  <dchinner>	stuff like that
>  <dchinner>	There used to be a open file cache in XFS - we'd know exactly when the last reference went away and trim it then
>  <dchinner>	but that went away when NFS and the dcache got smarter about file handle conversion
>  <dchinner>	(i.e. that's how we used to make nfs not suck)
>  <dchinner>	that's when we started doing work in ->release
>  <dchinner>	it was close enough to "last close" for most workloads it made no difference.
>  <dchinner>	Except for concurrent NFS writes into the same directory
>  <dchinner>	and now there's another pathological application that triggers problems
>  <dchinner>	The NFS exception was prior to having thebackground reaper
>  <dchinner>	as these things goes the background reaper is relatively recent functionality
>  <dchinner>	so perhaps we should just leave it to "inode cache expiry or background reaping" and not do it on close at al
> 
> > I think the background scanning was initially an
> > enhancement to deal with things like the dirty release optimization
> > leaving these blocks around longer and being able to free up this
> > accumulated space when we're at -ENOSPC conditions.
> 
> Yes, amongst other things like slow writes keeping the file open
> forever.....
> 
> > Now that we have the
> > scanning mechanism in place (and a 5 minute default background scan,
> > which really isn't all that long), it might be reasonable to just drop
> > the release time truncate completely and only trim post-eof blocks via
> > the bg scan or reclaim paths.
> 
> Yeah, that's kinda the question I'm asking here. What's the likely
> impact of not trimming EOF blocks at least on close apart from
> people complaining about df/ls not matching du?
> 
> I don't really care about that anymore because, well, reflink/dedupe
> completely break any remaining assumption that du reported space
> consumption is related to the file size (if sparse files wasn't
> enough of a hint arlready)....

Not to mention the deferred inactivation series tracks "space we could
free if we did a bunch of inactivation work" so that we can lie to
statfs and pretend we already did the work.  It wouldn't be hard to
include speculative posteof blocks in that too.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

Darrick J. Wong Feb. 12, 2019, 1:13 a.m. UTC | #7

On Fri, Feb 08, 2019 at 07:34:33AM -0500, Brian Foster wrote:
> On Fri, Feb 08, 2019 at 01:47:30PM +1100, Dave Chinner wrote:
> > On Thu, Feb 07, 2019 at 10:52:43AM -0500, Brian Foster wrote:
> > > On Thu, Feb 07, 2019 at 04:39:41PM +1100, Dave Chinner wrote:
> > > > On Wed, Feb 06, 2019 at 09:21:14PM -0800, Darrick J. Wong wrote:
> > > > > On Thu, Feb 07, 2019 at 04:08:10PM +1100, Dave Chinner wrote:
> > > > > > Hi folks,
> > > > > > 
> > > > > > I've just finished analysing an IO trace from a application
> > > > > > generating an extreme filesystem fragmentation problem that started
> > > > > > with extent size hints and ended with spurious ENOSPC reports due to
> > > > > > massively fragmented files and free space. While the ENOSPC issue
> > > > > > looks to have previously been solved, I still wanted to understand
> > > > > > how the application had so comprehensively defeated extent size
> > > > > > hints as a method of avoiding file fragmentation.
> > ....
> > > > FWIW, I think the scope of the problem is quite widespread -
> > > > anything that does open/something/close repeatedly on a file that is
> > > > being written to with O_DSYNC or O_DIRECT appending writes will kill
> > > > the post-eof extent size hint allocated space. That's why I suspect
> > > > we need to think about not trimming by default and trying to
> > > > enumerating only the cases that need to trim eof blocks.
> > > > 
> > > 
> > > To further this point.. I think the eofblocks scanning stuff came long
> > > after the speculative preallocation code and associated release time
> > > post-eof truncate.
> > 
> > Yes, I cribed a bit of the history of the xfs_release() behaviour
> > on #xfs yesterday afternoon:
> > 
> >  <djwong>	dchinner: feel free to ignore this until tomorrow if you want, but /me wonders why we'd want to free the eofblocks at close time at all, instead of waiting for inactivation/enospc/background reaper to do it?
> >  <dchinner>	historic. People doing operations then complaining du didn't match ls
> >  <dchinner>	stuff like that
> >  <dchinner>	There used to be a open file cache in XFS - we'd know exactly when the last reference went away and trim it then
> >  <dchinner>	but that went away when NFS and the dcache got smarter about file handle conversion
> >  <dchinner>	(i.e. that's how we used to make nfs not suck)
> >  <dchinner>	that's when we started doing work in ->release
> >  <dchinner>	it was close enough to "last close" for most workloads it made no difference.
> >  <dchinner>	Except for concurrent NFS writes into the same directory
> >  <dchinner>	and now there's another pathological application that triggers problems
> >  <dchinner>	The NFS exception was prior to having thebackground reaper
> >  <dchinner>	as these things goes the background reaper is relatively recent functionality
> >  <dchinner>	so perhaps we should just leave it to "inode cache expiry or background reaping" and not do it on close at al
> > 
> 
> Thanks.
> 
> > > I think the background scanning was initially an
> > > enhancement to deal with things like the dirty release optimization
> > > leaving these blocks around longer and being able to free up this
> > > accumulated space when we're at -ENOSPC conditions.
> > 
> > Yes, amongst other things like slow writes keeping the file open
> > forever.....
> > 
> > > Now that we have the
> > > scanning mechanism in place (and a 5 minute default background scan,
> > > which really isn't all that long), it might be reasonable to just drop
> > > the release time truncate completely and only trim post-eof blocks via
> > > the bg scan or reclaim paths.
> > 
> > Yeah, that's kinda the question I'm asking here. What's the likely
> > impact of not trimming EOF blocks at least on close apart from
> > people complaining about df/ls not matching du?
> > 
> 
> Ok. ISTM it's just a continuation of the same "might confuse some users"
> scenario that pops up occasionally. It also seems that kind of thing has
> died down as either most people don't really know or care about the
> transient state or are just more familiar with it at this point. IME,
> complex applications that depend on block ownership stats (userspace
> filesystems for example) already have to account for speculative
> preallocation with XFS, so tweaking the semantics of the optimization
> shouldn't really have much of an impact that I can tell so long as the
> broader/long-term behavior doesn't change[1].
> 
> I suppose there are all kinds of other applications that are technically
> affected by dropping the release time trim (simple file copies, archive
> extraction, etc.), but it's not clear to me that matters so long as we
> have effective bg and -ENOSPC scans. The only thing I can think of so
> far is whether we should consider changes to the bg scan heuristics to
> accommodate scenarios currently covered by the release time trim. For
> example, the release time scan doesn't consider whether the file is
> dirty or not while the bg scan always skips "active" files.

I wrote a quick and dirty fstest that writes 999 files between 128k and
256k in size, to simulate untarring onto a filesystem.  No fancy
preallocation, just buffered writes.  I patched my kernel to skip the
posteof block freeing in xfs_release, so the preallocations get freed by
inode inactivation.  Then the freespace histogram looks like:

+   from      to extents  blocks    pct
+      1       1      36      36   0.00
+      2       3      69     175   0.01
+      4       7     122     698   0.02
+      8      15     237    2691   0.08
+     16      31       1      16   0.00
+     32      63     500   27843   0.88
+ 524288  806272       4 3141225  99.01

Pretty gnarly. :)  By comparison, a stock upstream kernel:

+   from      to extents  blocks    pct
+ 524288  806272       4 3172579 100.00

That's 969 free extents vs. 4, on a fs with 999 new files... which is
pretty bad.  Dave also suggessted on IRC that maybe this should be a
little smarter -- possibly skipping the posteof removal only if the
filesystem has sunit/swidth set, or if the inode has extent size hints,
or whatever. :)

--D

> Brian
> 
> [1] Which isn't the case for a change to not bg trim files with extent
> size hints.
> 
> > I don't really care about that anymore because, well, reflink/dedupe
> > completely break any remaining assumption that du reported space
> > consumption is related to the file size (if sparse files wasn't
> > enough of a hint arlready)....
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com

Brian Foster Feb. 12, 2019, 11:46 a.m. UTC | #8

On Mon, Feb 11, 2019 at 05:13:33PM -0800, Darrick J. Wong wrote:
> On Fri, Feb 08, 2019 at 07:34:33AM -0500, Brian Foster wrote:
> > On Fri, Feb 08, 2019 at 01:47:30PM +1100, Dave Chinner wrote:
> > > On Thu, Feb 07, 2019 at 10:52:43AM -0500, Brian Foster wrote:
> > > > On Thu, Feb 07, 2019 at 04:39:41PM +1100, Dave Chinner wrote:
> > > > > On Wed, Feb 06, 2019 at 09:21:14PM -0800, Darrick J. Wong wrote:
> > > > > > On Thu, Feb 07, 2019 at 04:08:10PM +1100, Dave Chinner wrote:
> > > > > > > Hi folks,
> > > > > > > 
> > > > > > > I've just finished analysing an IO trace from a application
> > > > > > > generating an extreme filesystem fragmentation problem that started
> > > > > > > with extent size hints and ended with spurious ENOSPC reports due to
> > > > > > > massively fragmented files and free space. While the ENOSPC issue
> > > > > > > looks to have previously been solved, I still wanted to understand
> > > > > > > how the application had so comprehensively defeated extent size
> > > > > > > hints as a method of avoiding file fragmentation.
> > > ....
> > > > > FWIW, I think the scope of the problem is quite widespread -
> > > > > anything that does open/something/close repeatedly on a file that is
> > > > > being written to with O_DSYNC or O_DIRECT appending writes will kill
> > > > > the post-eof extent size hint allocated space. That's why I suspect
> > > > > we need to think about not trimming by default and trying to
> > > > > enumerating only the cases that need to trim eof blocks.
> > > > > 
> > > > 
> > > > To further this point.. I think the eofblocks scanning stuff came long
> > > > after the speculative preallocation code and associated release time
> > > > post-eof truncate.
> > > 
> > > Yes, I cribed a bit of the history of the xfs_release() behaviour
> > > on #xfs yesterday afternoon:
> > > 
> > >  <djwong>	dchinner: feel free to ignore this until tomorrow if you want, but /me wonders why we'd want to free the eofblocks at close time at all, instead of waiting for inactivation/enospc/background reaper to do it?
> > >  <dchinner>	historic. People doing operations then complaining du didn't match ls
> > >  <dchinner>	stuff like that
> > >  <dchinner>	There used to be a open file cache in XFS - we'd know exactly when the last reference went away and trim it then
> > >  <dchinner>	but that went away when NFS and the dcache got smarter about file handle conversion
> > >  <dchinner>	(i.e. that's how we used to make nfs not suck)
> > >  <dchinner>	that's when we started doing work in ->release
> > >  <dchinner>	it was close enough to "last close" for most workloads it made no difference.
> > >  <dchinner>	Except for concurrent NFS writes into the same directory
> > >  <dchinner>	and now there's another pathological application that triggers problems
> > >  <dchinner>	The NFS exception was prior to having thebackground reaper
> > >  <dchinner>	as these things goes the background reaper is relatively recent functionality
> > >  <dchinner>	so perhaps we should just leave it to "inode cache expiry or background reaping" and not do it on close at al
> > > 
> > 
> > Thanks.
> > 
> > > > I think the background scanning was initially an
> > > > enhancement to deal with things like the dirty release optimization
> > > > leaving these blocks around longer and being able to free up this
> > > > accumulated space when we're at -ENOSPC conditions.
> > > 
> > > Yes, amongst other things like slow writes keeping the file open
> > > forever.....
> > > 
> > > > Now that we have the
> > > > scanning mechanism in place (and a 5 minute default background scan,
> > > > which really isn't all that long), it might be reasonable to just drop
> > > > the release time truncate completely and only trim post-eof blocks via
> > > > the bg scan or reclaim paths.
> > > 
> > > Yeah, that's kinda the question I'm asking here. What's the likely
> > > impact of not trimming EOF blocks at least on close apart from
> > > people complaining about df/ls not matching du?
> > > 
> > 
> > Ok. ISTM it's just a continuation of the same "might confuse some users"
> > scenario that pops up occasionally. It also seems that kind of thing has
> > died down as either most people don't really know or care about the
> > transient state or are just more familiar with it at this point. IME,
> > complex applications that depend on block ownership stats (userspace
> > filesystems for example) already have to account for speculative
> > preallocation with XFS, so tweaking the semantics of the optimization
> > shouldn't really have much of an impact that I can tell so long as the
> > broader/long-term behavior doesn't change[1].
> > 
> > I suppose there are all kinds of other applications that are technically
> > affected by dropping the release time trim (simple file copies, archive
> > extraction, etc.), but it's not clear to me that matters so long as we
> > have effective bg and -ENOSPC scans. The only thing I can think of so
> > far is whether we should consider changes to the bg scan heuristics to
> > accommodate scenarios currently covered by the release time trim. For
> > example, the release time scan doesn't consider whether the file is
> > dirty or not while the bg scan always skips "active" files.
> 
> I wrote a quick and dirty fstest that writes 999 files between 128k and
> 256k in size, to simulate untarring onto a filesystem.  No fancy
> preallocation, just buffered writes.  I patched my kernel to skip the
> posteof block freeing in xfs_release, so the preallocations get freed by
> inode inactivation.  Then the freespace histogram looks like:
> 

You didn't mention whether you disabled background eofb trims. Are you
just rendering that irrelevant by disabling the release time trim and
doing a mount cycle?

> +   from      to extents  blocks    pct
> +      1       1      36      36   0.00
> +      2       3      69     175   0.01
> +      4       7     122     698   0.02
> +      8      15     237    2691   0.08
> +     16      31       1      16   0.00
> +     32      63     500   27843   0.88
> + 524288  806272       4 3141225  99.01
> 
> Pretty gnarly. :)  By comparison, a stock upstream kernel:
> 

Indeed, that's a pretty rapid degradation. Thanks for testing that.

> +   from      to extents  blocks    pct
> + 524288  806272       4 3172579 100.00
> 
> That's 969 free extents vs. 4, on a fs with 999 new files... which is
> pretty bad.  Dave also suggessted on IRC that maybe this should be a
> little smarter -- possibly skipping the posteof removal only if the
> filesystem has sunit/swidth set, or if the inode has extent size hints,
> or whatever. :)
> 

This test implies that there's a significant difference between eofb
trims prior to delalloc conversion vs. after, which I suspect is the
primary difference between doing so on close vs. some time later. Is
there any good way to confirm that with your test? If that is the case,
it makes me wonder whether we should think about more generalized logic
as opposed to a battery of whatever particular inode state checks that
we've determined in practice contribute to free space fragmentation.

For example, extent size hints just happen to skip delayed allocation.
I don't recall the exact history, but I don't think this was always the
case for extsz hints. So would the close time eofb trim be as
problematic as for extsz hint files if the behavior of the latter
changed back to using delayed allocation? I think a patch for that was
proposed fairly recently, but it depended on delalloc -> unwritten
functionality which still had unresolved issues (IIRC).

From another angle, would a system that held files open for a
significant amount of time relative to a one-time write such that close
consistently occurred after writeback (and thus delalloc conversion) be
susceptible to the same level of free space fragmentation as shown
above? (I'm not sure if/why anybody would actually do that.. userspace
fs with an fd cache perhaps? It's somewhat besides the point,
anyways...)).

More testing and thought is probably required. I _was_ wondering if we
should consider something like always waiting as long as possible to
eofb trim already converted post-eof blocks, but I'm not totally
convinced that actually has value. For files that are not going to see
any further appends, we may have already lost since the real post-eof
blocks will end up truncated just the same whether it happens sooner or
not until inode reclaim.

Hmm, maybe we actually need to think about how to be smarter about when
to introduce speculative preallocation as opposed to how/when to reclaim
it. We currently limit speculative prealloc to files of a minimum size
(64k IIRC). Just thinking out loud, but what if we restricted
preallocation to files that have been appended after at least one
writeback cycle, for example?

Brian

> --D
> 
> > Brian
> > 
> > [1] Which isn't the case for a change to not bg trim files with extent
> > size hints.
> > 
> > > I don't really care about that anymore because, well, reflink/dedupe
> > > completely break any remaining assumption that du reported space
> > > consumption is related to the file size (if sparse files wasn't
> > > enough of a hint arlready)....
> > > 
> > > Cheers,
> > > 
> > > Dave.
> > > -- 
> > > Dave Chinner
> > > david@fromorbit.com

Dave Chinner Feb. 12, 2019, 8:21 p.m. UTC | #9

On Tue, Feb 12, 2019 at 06:46:31AM -0500, Brian Foster wrote:
> On Mon, Feb 11, 2019 at 05:13:33PM -0800, Darrick J. Wong wrote:
> > On Fri, Feb 08, 2019 at 07:34:33AM -0500, Brian Foster wrote:
> > > On Fri, Feb 08, 2019 at 01:47:30PM +1100, Dave Chinner wrote:
> > > > On Thu, Feb 07, 2019 at 10:52:43AM -0500, Brian Foster wrote:
> > > > > On Thu, Feb 07, 2019 at 04:39:41PM +1100, Dave Chinner wrote:
> > > > > > On Wed, Feb 06, 2019 at 09:21:14PM -0800, Darrick J. Wong wrote:
> > > > > > > On Thu, Feb 07, 2019 at 04:08:10PM +1100, Dave Chinner wrote:
> > > > > > > > Hi folks,
> > > > > > > > 
> > > > > > > > I've just finished analysing an IO trace from a application
> > > > > > > > generating an extreme filesystem fragmentation problem that started
> > > > > > > > with extent size hints and ended with spurious ENOSPC reports due to
> > > > > > > > massively fragmented files and free space. While the ENOSPC issue
> > > > > > > > looks to have previously been solved, I still wanted to understand
> > > > > > > > how the application had so comprehensively defeated extent size
> > > > > > > > hints as a method of avoiding file fragmentation.
> > > > ....
> > > > > > FWIW, I think the scope of the problem is quite widespread -
> > > > > > anything that does open/something/close repeatedly on a file that is
> > > > > > being written to with O_DSYNC or O_DIRECT appending writes will kill
> > > > > > the post-eof extent size hint allocated space. That's why I suspect
> > > > > > we need to think about not trimming by default and trying to
> > > > > > enumerating only the cases that need to trim eof blocks.
> > > > > > 
> > > > > 
> > > > > To further this point.. I think the eofblocks scanning stuff came long
> > > > > after the speculative preallocation code and associated release time
> > > > > post-eof truncate.
> > > > 
> > > > Yes, I cribed a bit of the history of the xfs_release() behaviour
> > > > on #xfs yesterday afternoon:
> > > > 
> > > >  <djwong>	dchinner: feel free to ignore this until tomorrow if you want, but /me wonders why we'd want to free the eofblocks at close time at all, instead of waiting for inactivation/enospc/background reaper to do it?
> > > >  <dchinner>	historic. People doing operations then complaining du didn't match ls
> > > >  <dchinner>	stuff like that
> > > >  <dchinner>	There used to be a open file cache in XFS - we'd know exactly when the last reference went away and trim it then
> > > >  <dchinner>	but that went away when NFS and the dcache got smarter about file handle conversion
> > > >  <dchinner>	(i.e. that's how we used to make nfs not suck)
> > > >  <dchinner>	that's when we started doing work in ->release
> > > >  <dchinner>	it was close enough to "last close" for most workloads it made no difference.
> > > >  <dchinner>	Except for concurrent NFS writes into the same directory
> > > >  <dchinner>	and now there's another pathological application that triggers problems
> > > >  <dchinner>	The NFS exception was prior to having thebackground reaper
> > > >  <dchinner>	as these things goes the background reaper is relatively recent functionality
> > > >  <dchinner>	so perhaps we should just leave it to "inode cache expiry or background reaping" and not do it on close at al
> > > > 
> > > 
> > > Thanks.
> > > 
> > > > > I think the background scanning was initially an
> > > > > enhancement to deal with things like the dirty release optimization
> > > > > leaving these blocks around longer and being able to free up this
> > > > > accumulated space when we're at -ENOSPC conditions.
> > > > 
> > > > Yes, amongst other things like slow writes keeping the file open
> > > > forever.....
> > > > 
> > > > > Now that we have the
> > > > > scanning mechanism in place (and a 5 minute default background scan,
> > > > > which really isn't all that long), it might be reasonable to just drop
> > > > > the release time truncate completely and only trim post-eof blocks via
> > > > > the bg scan or reclaim paths.
> > > > 
> > > > Yeah, that's kinda the question I'm asking here. What's the likely
> > > > impact of not trimming EOF blocks at least on close apart from
> > > > people complaining about df/ls not matching du?
> > > > 
> > > 
> > > Ok. ISTM it's just a continuation of the same "might confuse some users"
> > > scenario that pops up occasionally. It also seems that kind of thing has
> > > died down as either most people don't really know or care about the
> > > transient state or are just more familiar with it at this point. IME,
> > > complex applications that depend on block ownership stats (userspace
> > > filesystems for example) already have to account for speculative
> > > preallocation with XFS, so tweaking the semantics of the optimization
> > > shouldn't really have much of an impact that I can tell so long as the
> > > broader/long-term behavior doesn't change[1].
> > > 
> > > I suppose there are all kinds of other applications that are technically
> > > affected by dropping the release time trim (simple file copies, archive
> > > extraction, etc.), but it's not clear to me that matters so long as we
> > > have effective bg and -ENOSPC scans. The only thing I can think of so
> > > far is whether we should consider changes to the bg scan heuristics to
> > > accommodate scenarios currently covered by the release time trim. For
> > > example, the release time scan doesn't consider whether the file is
> > > dirty or not while the bg scan always skips "active" files.
> > 
> > I wrote a quick and dirty fstest that writes 999 files between 128k and
> > 256k in size, to simulate untarring onto a filesystem.  No fancy
> > preallocation, just buffered writes.  I patched my kernel to skip the
> > posteof block freeing in xfs_release, so the preallocations get freed by
> > inode inactivation.  Then the freespace histogram looks like:
> > 
> 
> You didn't mention whether you disabled background eofb trims. Are you
> just rendering that irrelevant by disabling the release time trim and
> doing a mount cycle?
> 
> > +   from      to extents  blocks    pct
> > +      1       1      36      36   0.00
> > +      2       3      69     175   0.01
> > +      4       7     122     698   0.02
> > +      8      15     237    2691   0.08
> > +     16      31       1      16   0.00
> > +     32      63     500   27843   0.88
> > + 524288  806272       4 3141225  99.01
> > 
> > Pretty gnarly. :)  By comparison, a stock upstream kernel:
> > 
> 
> Indeed, that's a pretty rapid degradation. Thanks for testing that.
> 
> > +   from      to extents  blocks    pct
> > + 524288  806272       4 3172579 100.00
> > 
> > That's 969 free extents vs. 4, on a fs with 999 new files... which is
> > pretty bad.  Dave also suggessted on IRC that maybe this should be a
> > little smarter -- possibly skipping the posteof removal only if the
> > filesystem has sunit/swidth set, or if the inode has extent size hints,
> > or whatever. :)
> > 
> 
> This test implies that there's a significant difference between eofb
> trims prior to delalloc conversion vs. after, which I suspect is the
> primary difference between doing so on close vs. some time later.

Yes, it's the difference between trimming the excess off the
delalloc extent and trimming the excess off an allocated extent
after writeback. In the later case, we end up fragmenting free space
because, while writeback is packing as tightly as it can, there is
unused space between the end of the one file and the start of the
next that ends up as free space.

> Is
> there any good way to confirm that with your test? If that is the case,
> it makes me wonder whether we should think about more generalized logic
> as opposed to a battery of whatever particular inode state checks that
> we've determined in practice contribute to free space fragmentation.

Yeah, we talked about that on #xfs, and it seems to me that the best
heuristic we can come up with is "trim on first close, if there are
multiple closes treat it as a repeated open/write/close workload and
apply the IDIRTY_RELEASE heuristic to it and don't remove the
prealloc on closes after the first.

> For example, extent size hints just happen to skip delayed allocation.
> I don't recall the exact history, but I don't think this was always the
> case for extsz hints.

It wasn't, but extent size hints + delalloc never played nicely and
could corrupt or expose stale data and caused all sorts of problems
at ENOSPC because delalloc reservations are unaware of alignment
requirements for extent size hints. Hence to make extent size hints
work for buffered writes, I simply made them work the same way as
direct writes (i.e. immediate allocation w/ unwritten extents).

> So would the close time eofb trim be as
> problematic as for extsz hint files if the behavior of the latter
> changed back to using delayed allocation?

Yes, but if it's a write-once file that doesn't matter. If it's
write-many, then we'd retain the post-eof blocks...

> I think a patch for that was
> proposed fairly recently, but it depended on delalloc -> unwritten
> functionality which still had unresolved issues (IIRC).

*nod*

> From another angle, would a system that held files open for a
> significant amount of time relative to a one-time write such that close
> consistently occurred after writeback (and thus delalloc conversion) be
> susceptible to the same level of free space fragmentation as shown
> above?

If the file is held open for writing for a long time, we have to
assume that they are going to write again (and again) so we should
leave the EOF blocks there. If they are writing slower than the eofb
gc, then there's nothing more we can really do in that case...

> (I'm not sure if/why anybody would actually do that.. userspace
> fs with an fd cache perhaps? It's somewhat besides the point,
> anyways...)).

*nod*

> More testing and thought is probably required. I _was_ wondering if we
> should consider something like always waiting as long as possible to
> eofb trim already converted post-eof blocks, but I'm not totally
> convinced that actually has value. For files that are not going to see
> any further appends, we may have already lost since the real post-eof
> blocks will end up truncated just the same whether it happens sooner or
> not until inode reclaim.

If the writes are far enough apart, then we lose any IO optimisation
advantage of retaining post-eof blocks (induces seeks because
location of new writes is fixed ahead of time). Then it just becomes
a fragmentation avoidance If the writes are slow enough,
fragmentation really doesn't matter a whole lot - it's when writes
are frequent and we trash the post-eof blocks quickly that it
matters.

> Hmm, maybe we actually need to think about how to be smarter about when
> to introduce speculative preallocation as opposed to how/when to reclaim
> it. We currently limit speculative prealloc to files of a minimum size
> (64k IIRC). Just thinking out loud, but what if we restricted
> preallocation to files that have been appended after at least one
> writeback cycle, for example?

Speculative delalloc for write once large files also has a massive
impact on things like allocation overhead - we can write gigabytes
into the page cache before writeback begins. If we take away the
specualtive delalloc for these first write files, then we are
essentially doing an extent manipluation (extending delalloc extent)
on every write() call we make.

Right now we only do that extent btree work when we hit the end of
the current speculative delalloc extent, so the normal write case is
just extending the in-memory EOF location rather than running the
entire of xfs_bmapi_reserve_delalloc() and doing space accounting,
etc.

/me points at his 2006 OLS paper about scaling write performance
as an example of just how important keeping delalloc overhead down
is for high throughput write performance:

https://www.kernel.org/doc/ols/2006/ols2006v1-pages-177-192.pdf

IOWs, speculative prealloc beyond EOF is not just about preventing
fragmentation - it also helps minimise the per-write CPU overhead of
delalloc space accounting. (i.e. allows faster write rates into
cache). IOWs, for anything more than a really small files, we want
to be doing speculative delalloc on the first time the file is
written to.

Cheers,

Dave.

Brian Foster Feb. 13, 2019, 1:50 p.m. UTC | #10

On Wed, Feb 13, 2019 at 07:21:51AM +1100, Dave Chinner wrote:
> On Tue, Feb 12, 2019 at 06:46:31AM -0500, Brian Foster wrote:
> > On Mon, Feb 11, 2019 at 05:13:33PM -0800, Darrick J. Wong wrote:
> > > On Fri, Feb 08, 2019 at 07:34:33AM -0500, Brian Foster wrote:
> > > > On Fri, Feb 08, 2019 at 01:47:30PM +1100, Dave Chinner wrote:
> > > > > On Thu, Feb 07, 2019 at 10:52:43AM -0500, Brian Foster wrote:
> > > > > > On Thu, Feb 07, 2019 at 04:39:41PM +1100, Dave Chinner wrote:
> > > > > > > On Wed, Feb 06, 2019 at 09:21:14PM -0800, Darrick J. Wong wrote:
> > > > > > > > On Thu, Feb 07, 2019 at 04:08:10PM +1100, Dave Chinner wrote:
> > > > > > > > > Hi folks,
> > > > > > > > > 
> > > > > > > > > I've just finished analysing an IO trace from a application
> > > > > > > > > generating an extreme filesystem fragmentation problem that started
> > > > > > > > > with extent size hints and ended with spurious ENOSPC reports due to
> > > > > > > > > massively fragmented files and free space. While the ENOSPC issue
> > > > > > > > > looks to have previously been solved, I still wanted to understand
> > > > > > > > > how the application had so comprehensively defeated extent size
> > > > > > > > > hints as a method of avoiding file fragmentation.
> > > > > ....
> > > > > > > FWIW, I think the scope of the problem is quite widespread -
> > > > > > > anything that does open/something/close repeatedly on a file that is
> > > > > > > being written to with O_DSYNC or O_DIRECT appending writes will kill
> > > > > > > the post-eof extent size hint allocated space. That's why I suspect
> > > > > > > we need to think about not trimming by default and trying to
> > > > > > > enumerating only the cases that need to trim eof blocks.
> > > > > > > 
> > > > > > 
> > > > > > To further this point.. I think the eofblocks scanning stuff came long
> > > > > > after the speculative preallocation code and associated release time
> > > > > > post-eof truncate.
> > > > > 
> > > > > Yes, I cribed a bit of the history of the xfs_release() behaviour
> > > > > on #xfs yesterday afternoon:
> > > > > 
> > > > >  <djwong>	dchinner: feel free to ignore this until tomorrow if you want, but /me wonders why we'd want to free the eofblocks at close time at all, instead of waiting for inactivation/enospc/background reaper to do it?
> > > > >  <dchinner>	historic. People doing operations then complaining du didn't match ls
> > > > >  <dchinner>	stuff like that
> > > > >  <dchinner>	There used to be a open file cache in XFS - we'd know exactly when the last reference went away and trim it then
> > > > >  <dchinner>	but that went away when NFS and the dcache got smarter about file handle conversion
> > > > >  <dchinner>	(i.e. that's how we used to make nfs not suck)
> > > > >  <dchinner>	that's when we started doing work in ->release
> > > > >  <dchinner>	it was close enough to "last close" for most workloads it made no difference.
> > > > >  <dchinner>	Except for concurrent NFS writes into the same directory
> > > > >  <dchinner>	and now there's another pathological application that triggers problems
> > > > >  <dchinner>	The NFS exception was prior to having thebackground reaper
> > > > >  <dchinner>	as these things goes the background reaper is relatively recent functionality
> > > > >  <dchinner>	so perhaps we should just leave it to "inode cache expiry or background reaping" and not do it on close at al
> > > > > 
> > > > 
> > > > Thanks.
> > > > 
> > > > > > I think the background scanning was initially an
> > > > > > enhancement to deal with things like the dirty release optimization
> > > > > > leaving these blocks around longer and being able to free up this
> > > > > > accumulated space when we're at -ENOSPC conditions.
> > > > > 
> > > > > Yes, amongst other things like slow writes keeping the file open
> > > > > forever.....
> > > > > 
> > > > > > Now that we have the
> > > > > > scanning mechanism in place (and a 5 minute default background scan,
> > > > > > which really isn't all that long), it might be reasonable to just drop
> > > > > > the release time truncate completely and only trim post-eof blocks via
> > > > > > the bg scan or reclaim paths.
> > > > > 
> > > > > Yeah, that's kinda the question I'm asking here. What's the likely
> > > > > impact of not trimming EOF blocks at least on close apart from
> > > > > people complaining about df/ls not matching du?
> > > > > 
> > > > 
> > > > Ok. ISTM it's just a continuation of the same "might confuse some users"
> > > > scenario that pops up occasionally. It also seems that kind of thing has
> > > > died down as either most people don't really know or care about the
> > > > transient state or are just more familiar with it at this point. IME,
> > > > complex applications that depend on block ownership stats (userspace
> > > > filesystems for example) already have to account for speculative
> > > > preallocation with XFS, so tweaking the semantics of the optimization
> > > > shouldn't really have much of an impact that I can tell so long as the
> > > > broader/long-term behavior doesn't change[1].
> > > > 
> > > > I suppose there are all kinds of other applications that are technically
> > > > affected by dropping the release time trim (simple file copies, archive
> > > > extraction, etc.), but it's not clear to me that matters so long as we
> > > > have effective bg and -ENOSPC scans. The only thing I can think of so
> > > > far is whether we should consider changes to the bg scan heuristics to
> > > > accommodate scenarios currently covered by the release time trim. For
> > > > example, the release time scan doesn't consider whether the file is
> > > > dirty or not while the bg scan always skips "active" files.
> > > 
> > > I wrote a quick and dirty fstest that writes 999 files between 128k and
> > > 256k in size, to simulate untarring onto a filesystem.  No fancy
> > > preallocation, just buffered writes.  I patched my kernel to skip the
> > > posteof block freeing in xfs_release, so the preallocations get freed by
> > > inode inactivation.  Then the freespace histogram looks like:
> > > 
> > 
> > You didn't mention whether you disabled background eofb trims. Are you
> > just rendering that irrelevant by disabling the release time trim and
> > doing a mount cycle?
> > 
> > > +   from      to extents  blocks    pct
> > > +      1       1      36      36   0.00
> > > +      2       3      69     175   0.01
> > > +      4       7     122     698   0.02
> > > +      8      15     237    2691   0.08
> > > +     16      31       1      16   0.00
> > > +     32      63     500   27843   0.88
> > > + 524288  806272       4 3141225  99.01
> > > 
> > > Pretty gnarly. :)  By comparison, a stock upstream kernel:
> > > 
> > 
> > Indeed, that's a pretty rapid degradation. Thanks for testing that.
> > 
> > > +   from      to extents  blocks    pct
> > > + 524288  806272       4 3172579 100.00
> > > 
> > > That's 969 free extents vs. 4, on a fs with 999 new files... which is
> > > pretty bad.  Dave also suggessted on IRC that maybe this should be a
> > > little smarter -- possibly skipping the posteof removal only if the
> > > filesystem has sunit/swidth set, or if the inode has extent size hints,
> > > or whatever. :)
> > > 
> > 
> > This test implies that there's a significant difference between eofb
> > trims prior to delalloc conversion vs. after, which I suspect is the
> > primary difference between doing so on close vs. some time later.
> 
> Yes, it's the difference between trimming the excess off the
> delalloc extent and trimming the excess off an allocated extent
> after writeback. In the later case, we end up fragmenting free space
> because, while writeback is packing as tightly as it can, there is
> unused space between the end of the one file and the start of the
> next that ends up as free space.
> 

Makes sense, kind of what I expected. I think it does raise the question
of the value of small-ish speculative preallocations.

> > Is
> > there any good way to confirm that with your test? If that is the case,
> > it makes me wonder whether we should think about more generalized logic
> > as opposed to a battery of whatever particular inode state checks that
> > we've determined in practice contribute to free space fragmentation.
> 
> Yeah, we talked about that on #xfs, and it seems to me that the best
> heuristic we can come up with is "trim on first close, if there are
> multiple closes treat it as a repeated open/write/close workload and
> apply the IDIRTY_RELEASE heuristic to it and don't remove the
> prealloc on closes after the first.
> 

Darrick mentioned this yesterday on irc. This sounds like a reasonable
variation of the change to me. It filters out the write once use case
that we clearly can't unconditionally defer to background trim based on
the test results above.

> > For example, extent size hints just happen to skip delayed allocation.
> > I don't recall the exact history, but I don't think this was always the
> > case for extsz hints.
> 
> It wasn't, but extent size hints + delalloc never played nicely and
> could corrupt or expose stale data and caused all sorts of problems
> at ENOSPC because delalloc reservations are unaware of alignment
> requirements for extent size hints. Hence to make extent size hints
> work for buffered writes, I simply made them work the same way as
> direct writes (i.e. immediate allocation w/ unwritten extents).
> 
> > So would the close time eofb trim be as
> > problematic as for extsz hint files if the behavior of the latter
> > changed back to using delayed allocation?
> 
> Yes, but if it's a write-once file that doesn't matter. If it's
> write-many, then we'd retain the post-eof blocks...
> 
> > I think a patch for that was
> > proposed fairly recently, but it depended on delalloc -> unwritten
> > functionality which still had unresolved issues (IIRC).
> 
> *nod*
> 
> > From another angle, would a system that held files open for a
> > significant amount of time relative to a one-time write such that close
> > consistently occurred after writeback (and thus delalloc conversion) be
> > susceptible to the same level of free space fragmentation as shown
> > above?
> 
> If the file is held open for writing for a long time, we have to
> assume that they are going to write again (and again) so we should
> leave the EOF blocks there. If they are writing slower than the eofb
> gc, then there's nothing more we can really do in that case...
> 

I'm not necessarily sure that this condition is always a matter of
writing too slow. It very well may be true, but I'm curious if there are
parallel copy scenarios (perhaps under particular cpu/RAM configs)
where we could end up doing a large number of one time file writes and
not doing the release time trim until an underlying extent (with
post-eof blocks) has been converted in more cases than not.

Perhaps this is more of a question of writeback behavior than anything
in XFS, but if that can occur with files on the smaller side it's a
clear path to free space fragmentation. I think it's at least worth
considering whether we can improve things, but of course this requires
significantly more thought and analysis to determine whether that's an
actual problem and if the cure is worse than the disease. :P

> > (I'm not sure if/why anybody would actually do that.. userspace
> > fs with an fd cache perhaps? It's somewhat besides the point,
> > anyways...)).
> 
> *nod*
> 
> > More testing and thought is probably required. I _was_ wondering if we
> > should consider something like always waiting as long as possible to
> > eofb trim already converted post-eof blocks, but I'm not totally
> > convinced that actually has value. For files that are not going to see
> > any further appends, we may have already lost since the real post-eof
> > blocks will end up truncated just the same whether it happens sooner or
> > not until inode reclaim.
> 
> If the writes are far enough apart, then we lose any IO optimisation
> advantage of retaining post-eof blocks (induces seeks because
> location of new writes is fixed ahead of time). Then it just becomes
> a fragmentation avoidance If the writes are slow enough,
> fragmentation really doesn't matter a whole lot - it's when writes
> are frequent and we trash the post-eof blocks quickly that it
> matters.
> 

It sounds like what you're saying is that it doesn't really matter
either way at this point. There's no perf advantage to keeping the eof
blocks in this scenario, but there's also no real harm in deferring the
eofb trim of physical post-eof blocks because any future free space
fragmentation damage has already been done (assuming no more writes come
in).

The thought above was tip-toeing around the idea of (in addition to the
one-time trim heuristic you mentioned above) never doing a release time
trim of non-delalloc post-eof blocks.

> > Hmm, maybe we actually need to think about how to be smarter about when
> > to introduce speculative preallocation as opposed to how/when to reclaim
> > it. We currently limit speculative prealloc to files of a minimum size
> > (64k IIRC). Just thinking out loud, but what if we restricted
> > preallocation to files that have been appended after at least one
> > writeback cycle, for example?
> 
> Speculative delalloc for write once large files also has a massive
> impact on things like allocation overhead - we can write gigabytes
> into the page cache before writeback begins. If we take away the
> specualtive delalloc for these first write files, then we are
> essentially doing an extent manipluation (extending delalloc extent)
> on every write() call we make.
> 

I was kind of banking on the ability to write large amounts of data to
cache before a particular file sees writeback activity. That means we'll
allocate as large as possible extents for actual file data and only
start doing preallocations on the larger side as well.

I hadn't considered the angle of preallocation reducing (del)allocation
overhead, however..

> Right now we only do that extent btree work when we hit the end of
> the current speculative delalloc extent, so the normal write case is
> just extending the in-memory EOF location rather than running the
> entire of xfs_bmapi_reserve_delalloc() and doing space accounting,
> etc.
> 
> /me points at his 2006 OLS paper about scaling write performance
> as an example of just how important keeping delalloc overhead down
> is for high throughput write performance:
> 
> https://www.kernel.org/doc/ols/2006/ols2006v1-pages-177-192.pdf
> 

Interesting work, thanks.

> IOWs, speculative prealloc beyond EOF is not just about preventing
> fragmentation - it also helps minimise the per-write CPU overhead of
> delalloc space accounting. (i.e. allows faster write rates into
> cache). IOWs, for anything more than a really small files, we want
> to be doing speculative delalloc on the first time the file is
> written to.
> 

Ok, this paper refers to CPU overhead as it contributes to lack of
scalability. It certainly makes sense that a delayed allocation buffered
write does more work than a non-alloc write, but this document isn't
necessarily claiming that as evidence of measurably faster or slower
single threaded buffered writes. Rather, that overhead becomes
measurable with enough parallel buffered writes contending on a global
allocation critical section (section 5.1 "Spinlocks in Hot Paths").

Subsequently, section 6.1.2 addresses that particular issue via the
introduction of per-cpu accounting. It looks like there is further
discussion around the general link between efficiency and performance,
which makes sense, but I don't think that draws a definite conclusion
that speculative preallocation needs to be introduced immediately on
sequential buffered writes. What I think it suggests is that we need to
consider the potential scalability impact of any prospective change in
speculative preallocation behavior (along with the other tradeoffs
associated with preallocation) because less aggressive preallocation
means more buffered write overhead.

BTW for historical context.. was speculative preallocation a thing when
this paper was written? The doc suggests per-page block allocation needs
to be reduced on XFS, but also indicates that XFS had minimal
fragmentation compared to other fs'. AFAICT, section 5.2 attributes XFS
fragmentation avoidance to delayed allocation (not necessarily
preallocation).

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

Dave Chinner Feb. 13, 2019, 10:27 p.m. UTC | #11

On Wed, Feb 13, 2019 at 08:50:22AM -0500, Brian Foster wrote:
> On Wed, Feb 13, 2019 at 07:21:51AM +1100, Dave Chinner wrote:
> > On Tue, Feb 12, 2019 at 06:46:31AM -0500, Brian Foster wrote:
> > > On Mon, Feb 11, 2019 at 05:13:33PM -0800, Darrick J. Wong wrote:
> > > > On Fri, Feb 08, 2019 at 07:34:33AM -0500, Brian Foster wrote:
> > > For example, extent size hints just happen to skip delayed allocation.
> > > I don't recall the exact history, but I don't think this was always the
> > > case for extsz hints.
> > 
> > It wasn't, but extent size hints + delalloc never played nicely and
> > could corrupt or expose stale data and caused all sorts of problems
> > at ENOSPC because delalloc reservations are unaware of alignment
> > requirements for extent size hints. Hence to make extent size hints
> > work for buffered writes, I simply made them work the same way as
> > direct writes (i.e. immediate allocation w/ unwritten extents).
> > 
> > > So would the close time eofb trim be as
> > > problematic as for extsz hint files if the behavior of the latter
> > > changed back to using delayed allocation?
> > 
> > Yes, but if it's a write-once file that doesn't matter. If it's
> > write-many, then we'd retain the post-eof blocks...
> > 
> > > I think a patch for that was
> > > proposed fairly recently, but it depended on delalloc -> unwritten
> > > functionality which still had unresolved issues (IIRC).
> > 
> > *nod*
> > 
> > > From another angle, would a system that held files open for a
> > > significant amount of time relative to a one-time write such that close
> > > consistently occurred after writeback (and thus delalloc conversion) be
> > > susceptible to the same level of free space fragmentation as shown
> > > above?
> > 
> > If the file is held open for writing for a long time, we have to
> > assume that they are going to write again (and again) so we should
> > leave the EOF blocks there. If they are writing slower than the eofb
> > gc, then there's nothing more we can really do in that case...
> > 
> 
> I'm not necessarily sure that this condition is always a matter of
> writing too slow. It very well may be true, but I'm curious if there are
> parallel copy scenarios (perhaps under particular cpu/RAM configs)
> where we could end up doing a large number of one time file writes and
> not doing the release time trim until an underlying extent (with
> post-eof blocks) has been converted in more cases than not.

I'm not sure I follow what sort of workload and situation you are
describing here. Are you talking about the effect of an EOFB gc pass
during ongoing writes?

> Perhaps this is more of a question of writeback behavior than anything
> in XFS, but if that can occur with files on the smaller side it's a
> clear path to free space fragmentation. I think it's at least worth
> considering whether we can improve things, but of course this requires
> significantly more thought and analysis to determine whether that's an
> actual problem and if the cure is worse than the disease. :P
> 
> > > (I'm not sure if/why anybody would actually do that.. userspace
> > > fs with an fd cache perhaps? It's somewhat besides the point,
> > > anyways...)).
> > 
> > *nod*
> > 
> > > More testing and thought is probably required. I _was_ wondering if we
> > > should consider something like always waiting as long as possible to
> > > eofb trim already converted post-eof blocks, but I'm not totally
> > > convinced that actually has value. For files that are not going to see
> > > any further appends, we may have already lost since the real post-eof
> > > blocks will end up truncated just the same whether it happens sooner or
> > > not until inode reclaim.
> > 
> > If the writes are far enough apart, then we lose any IO optimisation
> > advantage of retaining post-eof blocks (induces seeks because
> > location of new writes is fixed ahead of time). Then it just becomes
> > a fragmentation avoidance If the writes are slow enough,
> > fragmentation really doesn't matter a whole lot - it's when writes
> > are frequent and we trash the post-eof blocks quickly that it
> > matters.
> > 
> 
> It sounds like what you're saying is that it doesn't really matter
> either way at this point. There's no perf advantage to keeping the eof
> blocks in this scenario, but there's also no real harm in deferring the
> eofb trim of physical post-eof blocks because any future free space
> fragmentation damage has already been done (assuming no more writes come
> in).

Essentially, yes.

> The thought above was tip-toeing around the idea of (in addition to the
> one-time trim heuristic you mentioned above) never doing a release time
> trim of non-delalloc post-eof blocks.

Except we want to trim blocks in the in the cases where it's a write
once file that has been fsync()d or written by O_DIRECT w/ really
large extent size hints....

It seems to me that once we start making exceptions to the general
rule, we end up in a world of corner cases and, most likely,
unreliable heuristics.

> > IOWs, speculative prealloc beyond EOF is not just about preventing
> > fragmentation - it also helps minimise the per-write CPU overhead of
> > delalloc space accounting. (i.e. allows faster write rates into
> > cache). IOWs, for anything more than a really small files, we want
> > to be doing speculative delalloc on the first time the file is
> > written to.
> > 
> 
> Ok, this paper refers to CPU overhead as it contributes to lack of
> scalability.

Well, that was the experiments that were being performed. I'm using
it as an example of how per-write overhead is actually important to
throughput. Ignore the "global lock caused overall throughput
issues" because we don't have that problem any more, and instead
look at it as a demonstration of "anything that slows down a write()
reduces per-thread throughput".

> It certainly makes sense that a delayed allocation buffered
> write does more work than a non-alloc write, but this document isn't
> necessarily claiming that as evidence of measurably faster or slower
> single threaded buffered writes. Rather, that overhead becomes
> measurable with enough parallel buffered writes contending on a global
> allocation critical section (section 5.1 "Spinlocks in Hot Paths").

It's just an example of how increasing per-write overhead impacts
performance. The same can be said for all the memory reclaim
mods - they also reduced per-write overhead because memory
allocation for the page cache was much, much faster and required
much less CPU, so each write spent less time allocating pages to
cache the data to be written. hence, faster writes....

> Subsequently, section 6.1.2 addresses that particular issue via the
> introduction of per-cpu accounting. It looks like there is further
> discussion around the general link between efficiency and performance,
> which makes sense, but I don't think that draws a definite conclusion
> that speculative preallocation needs to be introduced immediately on
> sequential buffered writes.

Sure, because the post-EOF preallocation that was already happening
wasn't the problem, and we couldn't easily address the per-page
block mapping overhead with generic code changes that would be
useful in any circumstance (Linux community at the time was hostile
towards bringing anything useful to XFS into the core kernel code).

And this was only a small machine (24 CPUs) so we had to address the
locking problem because increasing speculative prealloc wouldn't
help solve the global lock contention problems on the 2048p machines
that SGI was shipping at the time....

> What I think it suggests is that we need to
> consider the potential scalability impact of any prospective change in
> speculative preallocation behavior (along with the other tradeoffs
> associated with preallocation) because less aggressive preallocation
> means more buffered write overhead.

It's not a scalability problem now because of the fixes made back
then - it's now just a per-write thread latency and throughput
issue.

What is not obvious from the paper is that I designed the XFS
benchmarks so that it wrote each file into a separate directories to
place them into different AGs so that they didn't fragment due to
interleaving writes. i.e. so we could directly control the number of
sequential write streams going to the disks.

We had to do that to minimise the seek overhead of the disks so
they'd run at full bandwidth rather than being seek bound - we had
to keep the 256 disks at >97% bandwidth utilisation to get to
10GB/S, so even one extra seek per disk per second and we wouldn't
get there.

IOWs, we'd carefully taken anything to do with fragmentation out of
the picture entirely, both for direct and buffered writes. We were
just focussed on write efficiency and maximising total throughput
and that largely made post-EOF preallocation irrelevant to us.

> BTW for historical context.. was speculative preallocation a thing when
> this paper was written?

Yes. specualtive prealloc goes way back into the 90s from Irix.  It
was first made configurable in XFS via the biosize mount option
added with v3 superblocks in 1997, but the initial linux port only
allowed up to 64k.

In 2005, the linux mount option allowed biosize to be extended to
1GB, which made sense because >4GB allocation groups (mkfs enabled
them late 2003) were now starting to be widely used and so users
were reporting new large AG fragmentation issues that had never been
seen before. i.e.  it was now practical to have contiguous multi-GB
extents in files and the delalloc code was struggling to create
them, so having EOF-prealloc be able to make use of that capability
was needed....

And then auto-tuning made sense because more and more people were
having to use the mount option in more general workloads to avoid
fragmentation.

And even since then it's abeen able tweaking the algorithms to stop
userspace from defeating it and re-introducing fragmentation
vectors...

> The doc suggests per-page block allocation needs
> to be reduced on XFS,

Yup. It still stands - that was the specific design premise that
lead to the iomap infrastructure we now have. i.e. do block mapping
once per write() call, not once per page per write call. early
prototypes got us a 10-20% increase in single threaded bufered write
throughput on slow CPUs, and with the current code some high
throughput buffered write cases either went up by 30-40% or CPU
usage went down by that amount.

> but also indicates that XFS had minimal
> fragmentation compared to other fs'. AFAICT, section 5.2 attributes XFS
> fragmentation avoidance to delayed allocation (not necessarily
> preallocation).

Right, because none of the other filesystems had delayed allocation
and they were interleaving concurent allocations at write() time.
i.e. they serialised write() for long periods of time while they did
global fs allocation, whilst XFS had a low overhead delalloc path
with a tiny critical global section. i.e. delalloc allowed us to
defer allocation to writeback where it isn't so throughput critical
and better allocation decisions can be made.

IOWs, delalloc was the primary difference in both the fragmentation
and scalability differences reported in the paper. That XFS did
small amounts of in-memory-only prealloc beyond EOF during delalloc
was irrelevant - it's just part of the delalloc mechanism in this
context....

Cheers,

Dave.

Brian Foster Feb. 14, 2019, 1 p.m. UTC | #12

On Thu, Feb 14, 2019 at 09:27:26AM +1100, Dave Chinner wrote:
> On Wed, Feb 13, 2019 at 08:50:22AM -0500, Brian Foster wrote:
> > On Wed, Feb 13, 2019 at 07:21:51AM +1100, Dave Chinner wrote:
> > > On Tue, Feb 12, 2019 at 06:46:31AM -0500, Brian Foster wrote:
> > > > On Mon, Feb 11, 2019 at 05:13:33PM -0800, Darrick J. Wong wrote:
> > > > > On Fri, Feb 08, 2019 at 07:34:33AM -0500, Brian Foster wrote:
> > > > For example, extent size hints just happen to skip delayed allocation.
> > > > I don't recall the exact history, but I don't think this was always the
> > > > case for extsz hints.
> > > 
> > > It wasn't, but extent size hints + delalloc never played nicely and
> > > could corrupt or expose stale data and caused all sorts of problems
> > > at ENOSPC because delalloc reservations are unaware of alignment
> > > requirements for extent size hints. Hence to make extent size hints
> > > work for buffered writes, I simply made them work the same way as
> > > direct writes (i.e. immediate allocation w/ unwritten extents).
> > > 
> > > > So would the close time eofb trim be as
> > > > problematic as for extsz hint files if the behavior of the latter
> > > > changed back to using delayed allocation?
> > > 
> > > Yes, but if it's a write-once file that doesn't matter. If it's
> > > write-many, then we'd retain the post-eof blocks...
> > > 
> > > > I think a patch for that was
> > > > proposed fairly recently, but it depended on delalloc -> unwritten
> > > > functionality which still had unresolved issues (IIRC).
> > > 
> > > *nod*
> > > 
> > > > From another angle, would a system that held files open for a
> > > > significant amount of time relative to a one-time write such that close
> > > > consistently occurred after writeback (and thus delalloc conversion) be
> > > > susceptible to the same level of free space fragmentation as shown
> > > > above?
> > > 
> > > If the file is held open for writing for a long time, we have to
> > > assume that they are going to write again (and again) so we should
> > > leave the EOF blocks there. If they are writing slower than the eofb
> > > gc, then there's nothing more we can really do in that case...
> > > 
> > 
> > I'm not necessarily sure that this condition is always a matter of
> > writing too slow. It very well may be true, but I'm curious if there are
> > parallel copy scenarios (perhaps under particular cpu/RAM configs)
> > where we could end up doing a large number of one time file writes and
> > not doing the release time trim until an underlying extent (with
> > post-eof blocks) has been converted in more cases than not.
> 
> I'm not sure I follow what sort of workload and situation you are
> describing here. Are you talking about the effect of an EOFB gc pass
> during ongoing writes?
> 

I'm not sure if it's an actual reproducible situation.. I'm just
wondering out loud if there are normal workloads that might still defeat
a one-time trim at release time. For example, copy enough files in
parallel such that writeback touches most of them before the copies
complete and we end up trimming physical blocks rather than delalloc
blocks.

This is not so much of a problem if those files are large, I think,
because then the preallocs and the resulting trimmed free space is on
the larger side as well. If we're copying a bunch of little files with
small preallocs, however, then we put ourselves in the pathological
situation shown in Darrick's test.

I was originally thinking about whether this could happen or not on a
highly parallel small file copy workload, but having thought about it a
bit more I think there is a more simple example. What about an untar
like workload that creates small files and calls fsync() before each fd
is released? Wouldn't that still defeat a one-time release heuristic and
produce the same layout issues as shown above? We'd prealloc,
writeback/convert then trim small/spurious fragments of post-eof space
back to the allocator.

> > Perhaps this is more of a question of writeback behavior than anything
> > in XFS, but if that can occur with files on the smaller side it's a
> > clear path to free space fragmentation. I think it's at least worth
> > considering whether we can improve things, but of course this requires
> > significantly more thought and analysis to determine whether that's an
> > actual problem and if the cure is worse than the disease. :P
> > 
> > > > (I'm not sure if/why anybody would actually do that.. userspace
> > > > fs with an fd cache perhaps? It's somewhat besides the point,
> > > > anyways...)).
> > > 
> > > *nod*
> > > 
> > > > More testing and thought is probably required. I _was_ wondering if we
> > > > should consider something like always waiting as long as possible to
> > > > eofb trim already converted post-eof blocks, but I'm not totally
> > > > convinced that actually has value. For files that are not going to see
> > > > any further appends, we may have already lost since the real post-eof
> > > > blocks will end up truncated just the same whether it happens sooner or
> > > > not until inode reclaim.
> > > 
> > > If the writes are far enough apart, then we lose any IO optimisation
> > > advantage of retaining post-eof blocks (induces seeks because
> > > location of new writes is fixed ahead of time). Then it just becomes
> > > a fragmentation avoidance If the writes are slow enough,
> > > fragmentation really doesn't matter a whole lot - it's when writes
> > > are frequent and we trash the post-eof blocks quickly that it
> > > matters.
> > > 
> > 
> > It sounds like what you're saying is that it doesn't really matter
> > either way at this point. There's no perf advantage to keeping the eof
> > blocks in this scenario, but there's also no real harm in deferring the
> > eofb trim of physical post-eof blocks because any future free space
> > fragmentation damage has already been done (assuming no more writes come
> > in).
> 
> Essentially, yes.
> 
> > The thought above was tip-toeing around the idea of (in addition to the
> > one-time trim heuristic you mentioned above) never doing a release time
> > trim of non-delalloc post-eof blocks.
> 
> Except we want to trim blocks in the in the cases where it's a write
> once file that has been fsync()d or written by O_DIRECT w/ really
> large extent size hints....
> 

We still would trim it. The one time write case is essentially
unaffected because the only advantage of that hueristic is to trim eof
blocks before they are converted. If the eof blocks are real, the
release time heuristic has already failed (i.e., it hasn't provided any
benefit that background trim doesn't already provide).

IOW, what we really want to avoid is trimming (small batches of) unused
physical eof blocks.

> It seems to me that once we start making exceptions to the general
> rule, we end up in a world of corner cases and, most likely,
> unreliable heuristics.
> 

I agree in principle, but my purpose for this discussion is to try and
think about whether we can come up with a better set of rules/behaviors.

> > > IOWs, speculative prealloc beyond EOF is not just about preventing
> > > fragmentation - it also helps minimise the per-write CPU overhead of
> > > delalloc space accounting. (i.e. allows faster write rates into
> > > cache). IOWs, for anything more than a really small files, we want
> > > to be doing speculative delalloc on the first time the file is
> > > written to.
> > > 
> > 
> > Ok, this paper refers to CPU overhead as it contributes to lack of
> > scalability.
> 
> Well, that was the experiments that were being performed. I'm using
> it as an example of how per-write overhead is actually important to
> throughput. Ignore the "global lock caused overall throughput
> issues" because we don't have that problem any more, and instead
> look at it as a demonstration of "anything that slows down a write()
> reduces per-thread throughput".
> 

Makes sense, and that's how I took it after reading through the paper.
My point was just that I think this is more of a tradeoff and caveat to
consider than something that outright rules out doing less agressive
preallocation in certain cases.

I ran a few tests yesterday out of curiousity and was able to measure (a
small) difference in single-threaded buffered writes to cache with and
without preallocation. What I found a bit interesting was that my
original attempt to test this actually showed _faster_ throughput
without preallocation because the mechanism I happened to use to bypass
preallocation was an up front truncate. I eventually realized that this
had other effects (i.e., one copy doing size updates vs. the other not
doing so) and just compared a fixed, full size (1G) preallocation with a
fixed 4k preallocation to reproduce the boost provided by the former.
The point here is that while there is such a boost, there are also other
workload dependent factors that are out of our control. For example,
somebody who today cares about preallocation only for a boost on writes
to cache can apparently achieve a greater benefit by truncating the file
up front and disabling preallocation entirely.

Moving beyond the truncate thing, I also saw the benefit of
preallocation diminish as write buffer size was increased. As the write
size increases from 4k to around 64k, the pure performance benefit of
preallocation trailed off to zero. Part of that could also be effects of
less frequent size updates and whatnot due to the larger writes, but I
also don't think that's an uncommon thing in practice.

My only point here is that I don't think it's so cut and dry that we
absolutely need dynamic speculative preallocation for write to cache
performance. There is definitely an existing benefit and
perf/scalability caveats to consider before changing anything as such.
TBH it's probably not worth getting too much into the weeds here because
I don't really have any great solution/alternative in mind at the moment
beyond what has already been proposed. It's much easier to reason about
(and test) an actual prototype of some sort as opposed to handwavy and
poorly described ideas :P, but I also don't think it's reasonable or
wise to restrict ourselves to a particular implementation for what seem
like secondary benefits.

> > It certainly makes sense that a delayed allocation buffered
> > write does more work than a non-alloc write, but this document isn't
> > necessarily claiming that as evidence of measurably faster or slower
> > single threaded buffered writes. Rather, that overhead becomes
> > measurable with enough parallel buffered writes contending on a global
> > allocation critical section (section 5.1 "Spinlocks in Hot Paths").
> 
> It's just an example of how increasing per-write overhead impacts
> performance. The same can be said for all the memory reclaim
> mods - they also reduced per-write overhead because memory
> allocation for the page cache was much, much faster and required
> much less CPU, so each write spent less time allocating pages to
> cache the data to be written. hence, faster writes....
> 
> > Subsequently, section 6.1.2 addresses that particular issue via the
> > introduction of per-cpu accounting. It looks like there is further
> > discussion around the general link between efficiency and performance,
> > which makes sense, but I don't think that draws a definite conclusion
> > that speculative preallocation needs to be introduced immediately on
> > sequential buffered writes.
> 
> Sure, because the post-EOF preallocation that was already happening
> wasn't the problem, and we couldn't easily address the per-page
> block mapping overhead with generic code changes that would be
> useful in any circumstance (Linux community at the time was hostile
> towards bringing anything useful to XFS into the core kernel code).
> 
> And this was only a small machine (24 CPUs) so we had to address the
> locking problem because increasing speculative prealloc wouldn't
> help solve the global lock contention problems on the 2048p machines
> that SGI was shipping at the time....
> 

Makes sense.

> > What I think it suggests is that we need to
> > consider the potential scalability impact of any prospective change in
> > speculative preallocation behavior (along with the other tradeoffs
> > associated with preallocation) because less aggressive preallocation
> > means more buffered write overhead.
> 
> It's not a scalability problem now because of the fixes made back
> then - it's now just a per-write thread latency and throughput
> issue.
> 
> What is not obvious from the paper is that I designed the XFS
> benchmarks so that it wrote each file into a separate directories to
> place them into different AGs so that they didn't fragment due to
> interleaving writes. i.e. so we could directly control the number of
> sequential write streams going to the disks.
> 
> We had to do that to minimise the seek overhead of the disks so
> they'd run at full bandwidth rather than being seek bound - we had
> to keep the 256 disks at >97% bandwidth utilisation to get to
> 10GB/S, so even one extra seek per disk per second and we wouldn't
> get there.
> 
> IOWs, we'd carefully taken anything to do with fragmentation out of
> the picture entirely, both for direct and buffered writes. We were
> just focussed on write efficiency and maximising total throughput
> and that largely made post-EOF preallocation irrelevant to us.
> 

Ok.

> > BTW for historical context.. was speculative preallocation a thing when
> > this paper was written?
> 
> Yes. specualtive prealloc goes way back into the 90s from Irix.  It
> was first made configurable in XFS via the biosize mount option
> added with v3 superblocks in 1997, but the initial linux port only
> allowed up to 64k.
> 

Hmm, Ok.. so it was originally speculative preallocation without the
"dynamic sizing" logic that we have today. Thanks for the background.

> In 2005, the linux mount option allowed biosize to be extended to
> 1GB, which made sense because >4GB allocation groups (mkfs enabled
> them late 2003) were now starting to be widely used and so users
> were reporting new large AG fragmentation issues that had never been
> seen before. i.e.  it was now practical to have contiguous multi-GB
> extents in files and the delalloc code was struggling to create
> them, so having EOF-prealloc be able to make use of that capability
> was needed....
> 
> And then auto-tuning made sense because more and more people were
> having to use the mount option in more general workloads to avoid
> fragmentation.
> 

"auto-tuning" means "dynamic sizing" here, yes? FWIW, much of this
discussion also makes me wonder how appropriate the current size limit
(64k) on preallocation is for today's filesystems and systems (e.g. RAM
availability), as opposed to something larger (on the order of hundreds
of MBs for example, perhaps 256-512MB).

Brian

> And even since then it's abeen able tweaking the algorithms to stop
> userspace from defeating it and re-introducing fragmentation
> vectors...
> 
> > The doc suggests per-page block allocation needs
> > to be reduced on XFS,
> 
> Yup. It still stands - that was the specific design premise that
> lead to the iomap infrastructure we now have. i.e. do block mapping
> once per write() call, not once per page per write call. early
> prototypes got us a 10-20% increase in single threaded bufered write
> throughput on slow CPUs, and with the current code some high
> throughput buffered write cases either went up by 30-40% or CPU
> usage went down by that amount.
> 
> > but also indicates that XFS had minimal
> > fragmentation compared to other fs'. AFAICT, section 5.2 attributes XFS
> > fragmentation avoidance to delayed allocation (not necessarily
> > preallocation).
> 
> Right, because none of the other filesystems had delayed allocation
> and they were interleaving concurent allocations at write() time.
> i.e. they serialised write() for long periods of time while they did
> global fs allocation, whilst XFS had a low overhead delalloc path
> with a tiny critical global section. i.e. delalloc allowed us to
> defer allocation to writeback where it isn't so throughput critical
> and better allocation decisions can be made.
> 
> IOWs, delalloc was the primary difference in both the fragmentation
> and scalability differences reported in the paper. That XFS did
> small amounts of in-memory-only prealloc beyond EOF during delalloc
> was irrelevant - it's just part of the delalloc mechanism in this
> context....
> 
> Cheers,
> 
> Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com

Dave Chinner Feb. 14, 2019, 9:51 p.m. UTC | #13

On Thu, Feb 14, 2019 at 08:00:14AM -0500, Brian Foster wrote:
> On Thu, Feb 14, 2019 at 09:27:26AM +1100, Dave Chinner wrote:
> > On Wed, Feb 13, 2019 at 08:50:22AM -0500, Brian Foster wrote:
> > > On Wed, Feb 13, 2019 at 07:21:51AM +1100, Dave Chinner wrote:
> > > > On Tue, Feb 12, 2019 at 06:46:31AM -0500, Brian Foster wrote:
> > > > > On Mon, Feb 11, 2019 at 05:13:33PM -0800, Darrick J. Wong wrote:
> > > > > > On Fri, Feb 08, 2019 at 07:34:33AM -0500, Brian Foster wrote:
> > > > > For example, extent size hints just happen to skip delayed allocation.
> > > > > I don't recall the exact history, but I don't think this was always the
> > > > > case for extsz hints.
> > > > 
> > > > It wasn't, but extent size hints + delalloc never played nicely and
> > > > could corrupt or expose stale data and caused all sorts of problems
> > > > at ENOSPC because delalloc reservations are unaware of alignment
> > > > requirements for extent size hints. Hence to make extent size hints
> > > > work for buffered writes, I simply made them work the same way as
> > > > direct writes (i.e. immediate allocation w/ unwritten extents).
> > > > 
> > > > > So would the close time eofb trim be as
> > > > > problematic as for extsz hint files if the behavior of the latter
> > > > > changed back to using delayed allocation?
> > > > 
> > > > Yes, but if it's a write-once file that doesn't matter. If it's
> > > > write-many, then we'd retain the post-eof blocks...
> > > > 
> > > > > I think a patch for that was
> > > > > proposed fairly recently, but it depended on delalloc -> unwritten
> > > > > functionality which still had unresolved issues (IIRC).
> > > > 
> > > > *nod*
> > > > 
> > > > > From another angle, would a system that held files open for a
> > > > > significant amount of time relative to a one-time write such that close
> > > > > consistently occurred after writeback (and thus delalloc conversion) be
> > > > > susceptible to the same level of free space fragmentation as shown
> > > > > above?
> > > > 
> > > > If the file is held open for writing for a long time, we have to
> > > > assume that they are going to write again (and again) so we should
> > > > leave the EOF blocks there. If they are writing slower than the eofb
> > > > gc, then there's nothing more we can really do in that case...
> > > > 
> > > 
> > > I'm not necessarily sure that this condition is always a matter of
> > > writing too slow. It very well may be true, but I'm curious if there are
> > > parallel copy scenarios (perhaps under particular cpu/RAM configs)
> > > where we could end up doing a large number of one time file writes and
> > > not doing the release time trim until an underlying extent (with
> > > post-eof blocks) has been converted in more cases than not.
> > 
> > I'm not sure I follow what sort of workload and situation you are
> > describing here. Are you talking about the effect of an EOFB gc pass
> > during ongoing writes?
> > 
> 
> I'm not sure if it's an actual reproducible situation.. I'm just
> wondering out loud if there are normal workloads that might still defeat
> a one-time trim at release time. For example, copy enough files in
> parallel such that writeback touches most of them before the copies
> complete and we end up trimming physical blocks rather than delalloc
> blocks.
> 
> This is not so much of a problem if those files are large, I think,
> because then the preallocs and the resulting trimmed free space is on
> the larger side as well. If we're copying a bunch of little files with
> small preallocs, however, then we put ourselves in the pathological
> situation shown in Darrick's test.
> 
> I was originally thinking about whether this could happen or not on a
> highly parallel small file copy workload, but having thought about it a
> bit more I think there is a more simple example. What about an untar
> like workload that creates small files and calls fsync() before each fd
> is released?

Which is the same as an O_SYNC write if it's the same fd, which
means we'll trim allocated blocks on close. i.e. it's no different
to the current behaviour.  If such files are written in parallel then,
again, it is no different to the existing behaviour. i.e. it
largely depends on the timing of allocation in writeback and EOF
block clearing in close(). If close happens before the next
allocation in that AG, then they'll pack because there's no EOF
blocks that push out the new allocation.  If it's the other way
around, we get some level of freespace fragmentation.

THis is mitigated by the fact that workloads like this tend to be
separated into separate AGs - allocation first attempts to be
non-blocking which means if it races with a free in progress it will
skip to the next AG and it won't pack sequentially, anyway. So I
don't think this is a major concern in terms of free space
fragmentation - the allocator AG selection makes it impossible to
predict what happens when racing alloc/frees occur in a single AG
and we really try to avoid that as much as possible, anyway.

If it's a separate open/fsync/close pass after all the writes (and
close) then the behaviour is the same with either the existing code
or the "trim on first close" behaviour - and the delalloc blocks
beyond EOF will get killed on the first close and not be present for
the writeback, and all is good.

> Wouldn't that still defeat a one-time release heuristic and
> produce the same layout issues as shown above? We'd prealloc,
> writeback/convert then trim small/spurious fragments of post-eof space
> back to the allocator.

Yes, but as far as I can reason (sanely), "trim on first close" is
no worse that the existing heuristic in these cases, whilst being
significantly better in others we have observed in the wild that
cause problems...

I'm not looking for perfect here, just "better with no obvious
regressions". We can't predict every situation, so if it deals with
all the problems we've had reported and a few similar cases we don't
curently handle as well, then we should run with that and not really
worry about the cases that it (or the existing code) does not
solve until we have evidence that those workloads exist and are
causing real world problems. It's a difficult enough issue to reason
about without making it more complex by playing "what about" games..

> > > It sounds like what you're saying is that it doesn't really matter
> > > either way at this point. There's no perf advantage to keeping the eof
> > > blocks in this scenario, but there's also no real harm in deferring the
> > > eofb trim of physical post-eof blocks because any future free space
> > > fragmentation damage has already been done (assuming no more writes come
> > > in).
> > 
> > Essentially, yes.
> > 
> > > The thought above was tip-toeing around the idea of (in addition to the
> > > one-time trim heuristic you mentioned above) never doing a release time
> > > trim of non-delalloc post-eof blocks.
> > 
> > Except we want to trim blocks in the in the cases where it's a write
> > once file that has been fsync()d or written by O_DIRECT w/ really
> > large extent size hints....
> 
> We still would trim it. The one time write case is essentially
> unaffected because the only advantage of that hueristic is to trim eof
> blocks before they are converted. If the eof blocks are real, the
> release time heuristic has already failed (i.e., it hasn't provided any
> benefit that background trim doesn't already provide).

No, all it means is that the blocks were allocated before the fd was
closed, not that the release time heuristic failed. The release time
heuristic is deciding what to do /after/ the writes have been
completed, whatever the post-eof situation is. It /can't fail/ if it
hasn't been triggered before physical allocation has been done, it
can only decide what to do about those extents once it is called...

In which case, if it's a write-once file we want to kill the blocks,
no matter whether they are allocated or not. And if it's a repeated
open/X/close workload, then a single removal of EOF blocks won't
greatly impact the file layout because subsequent closes won't
trigger.

IOWs, in the general case it does the right thing, and when it's
wrong the impact is negated by the fact it will do the right thing
on all future closes on that inode....

> IOW, what we really want to avoid is trimming (small batches of) unused
> physical eof blocks.

For the general case, I disagree. :)

> > > > IOWs, speculative prealloc beyond EOF is not just about preventing
> > > > fragmentation - it also helps minimise the per-write CPU overhead of
> > > > delalloc space accounting. (i.e. allows faster write rates into
> > > > cache). IOWs, for anything more than a really small files, we want
> > > > to be doing speculative delalloc on the first time the file is
> > > > written to.
> > > > 
> > > 
> > > Ok, this paper refers to CPU overhead as it contributes to lack of
> > > scalability.
> > 
> > Well, that was the experiments that were being performed. I'm using
> > it as an example of how per-write overhead is actually important to
> > throughput. Ignore the "global lock caused overall throughput
> > issues" because we don't have that problem any more, and instead
> > look at it as a demonstration of "anything that slows down a write()
> > reduces per-thread throughput".
> > 
> 
> Makes sense, and that's how I took it after reading through the paper.
> My point was just that I think this is more of a tradeoff and caveat to
> consider than something that outright rules out doing less agressive
> preallocation in certain cases.
> 
> I ran a few tests yesterday out of curiousity and was able to measure (a
> small) difference in single-threaded buffered writes to cache with and
> without preallocation. What I found a bit interesting was that my
> original attempt to test this actually showed _faster_ throughput
> without preallocation because the mechanism I happened to use to bypass
> preallocation was an up front truncate.

So, like:

$ xfs_io -f -t -c "pwrite 0 1g" /mnt/scratch/testfile
wrote 1073741824/1073741824 bytes at offset 0
1 GiB, 262144 ops; 0.6582 sec (1.519 GiB/sec and 398224.4691 ops/sec)
$

Vs:

$ xfs_io -f -t -c "truncate 1g" -c "pwrite 0 1g" -c "fsync" /mnt/scratch/testfile
wrote 1073741824/1073741824 bytes at offset 0
1 GiB, 262144 ops; 0.7159 sec (1.397 GiB/sec and 366147.4512 ops/sec)
$

I get an average of 1.53GiB/s write rate into cache with
speculative prealloc active, and only 1.36GiB/s write rate in cache
without it. I've run these 10 times each, ranges are
1.519-1.598GiB/s for prealloc and 1.291-1.361GiB/s whenusing
truncate to prevent prealloc.

IOWs, on my test machine, the write rate into cache using 4kB writes
is over 10% faster with prealloc enabled. And to point out the
dminishing returns, with a write size of 1MB:

$  xfs_io -f -t -c "truncate 1g" -c "pwrite 0 1g -b 1M" -c "fsync" /mnt/scratch/testfile
wrote 1073741824/1073741824 bytes at offset 0
1 GiB, 1024 ops; 0.4975 sec (2.010 GiB/sec and 2058.1922 ops/sec)

The throughput with or without prealloc average out just over
2GiB/s and the difference is noise (2.03GiB/s vs 2.05GiB/s)

Note: I'm jsut measuring write-in rates here, and I've isolated them
from writeback completely because the file size is small enough
that dirty throttling (20% of memory) isn't kicking in because I
have 16GB of RAM on this machine. If I go over 3GB in file size,
dirty throttling and writeback kicks in and the results are a
complete crap-shoot because it's dependent on when writeback kicks
in and how the writeback rate ramps up in the first second of
writeback.

> I eventually realized that this
> had other effects (i.e., one copy doing size updates vs. the other not
> doing so) and just compared a fixed, full size (1G) preallocation with a
> fixed 4k preallocation to reproduce the boost provided by the former.

That only matters for *writeback overhead*, not ingest efficiency.
Indeed, using a preallocated extent:

$ time sudo xfs_io -f -t -c "falloc 0 1g" -c "pwrite 0 1g" -c "fsync" /mnt/scratch/testfile
wrote 1073741824/1073741824 bytes at offset 0
1 GiB, 262144 ops; 0.6610 sec (1.513 GiB/sec and 396557.5926 ops/sec)

$ xfs_io -f -t -c "falloc 0 1g" -c "pwrite 0 1g -b 1M" -c "fsync" /mnt/scratch/testfile
wrote 1073741824/1073741824 bytes at offset 0
1 GiB, 1024 ops; 0.4946 sec (2.022 GiB/sec and 2070.0795 ops/sec)

The write-in rates are identical to when specualtive prealloc is
enabled (always hitting a preexisting extent and not having to
extend an extent on every iomap lookup). If I time the fsync, then
things are different because there's additional writeback overhead
(i.e. unwritten extent conversion), but this does not affect the
per-write cache ingest CPU overhead.

> The point here is that while there is such a boost, there are also other
> workload dependent factors that are out of our control. For example,
> somebody who today cares about preallocation only for a boost on writes
> to cache can apparently achieve a greater benefit by truncating the file
> up front and disabling preallocation entirely.

Can you post your tests and results, because I'm getting very
different results from running what I think are the same tests.

> Moving beyond the truncate thing, I also saw the benefit of
> preallocation diminish as write buffer size was increased. As the write
> size increases from 4k to around 64k, the pure performance benefit of
> preallocation trailed off to zero.

Yes, because we've optimised the overhead out of larger writes with
the iomap infrastructure. This used to be a whole lot worse when we
did a delalloc mapping for every page instead of one per write()
call. IOWs, a significant part of the 30-35% speed increase you see in my numbers above
going from 4kB writes to 1MB writes is a result of iomap dropping
the number of delalloc mapping calls by a factor of 256....

> Part of that could also be effects of
> less frequent size updates and whatnot due to the larger writes, but I
> also don't think that's an uncommon thing in practice.

Again, size updates are only done on writeback, not ingest....

> My only point here is that I don't think it's so cut and dry that we
> absolutely need dynamic speculative preallocation for write to cache
> performance.

History, experience with both embedded and high end NAS machines
(where per-write CPU usage really matters!) and my own experiements
tell me a different story :/

> > > BTW for historical context.. was speculative preallocation a thing when
> > > this paper was written?
> > 
> > Yes. specualtive prealloc goes way back into the 90s from Irix.  It
> > was first made configurable in XFS via the biosize mount option
> > added with v3 superblocks in 1997, but the initial linux port only
> > allowed up to 64k.
> > 
> 
> Hmm, Ok.. so it was originally speculative preallocation without the
> "dynamic sizing" logic that we have today. Thanks for the background.
> 
> > In 2005, the linux mount option allowed biosize to be extended to
> > 1GB, which made sense because >4GB allocation groups (mkfs enabled
> > them late 2003) were now starting to be widely used and so users
> > were reporting new large AG fragmentation issues that had never been
> > seen before. i.e.  it was now practical to have contiguous multi-GB
> > extents in files and the delalloc code was struggling to create
> > them, so having EOF-prealloc be able to make use of that capability
> > was needed....
> > 
> > And then auto-tuning made sense because more and more people were
> > having to use the mount option in more general workloads to avoid
> > fragmentation.
> > 
> 
> "auto-tuning" means "dynamic sizing" here, yes?

Yes.

> FWIW, much of this
> discussion also makes me wonder how appropriate the current size limit
> (64k) on preallocation is for today's filesystems and systems (e.g. RAM
> availability), as opposed to something larger (on the order of hundreds
> of MBs for example, perhaps 256-512MB).

RAM size is irrelevant. What matters is file size and the impact
of allocation patterns on writeback IO patterns. i.e. the size limit
is about optimising writeback, not preventing fragmentation or
making more efficient use of memory, etc.

i.e. when we have lots of small files, we want writeback to pack
them so we get multiple-file sequentialisation of the write stream -
this makes things like untarring a kernel tarball (which is a large
number of small files) a sequential write workload rather than a
seek-per-individual-file-write workload. That make a massive
difference to performance on spinning disks, and that's what the 64k
threshold (and post-EOF block removal on close for larger files)
tries to preserve.

Realistically, we should probably change that 64k threshold to match
sunit if it is set, so that we really do end up trying to pack
any write once file smaller than sunit as the allocator won't try
to align them, anyway....

Cheers,

Dave.

Brian Foster Feb. 15, 2019, 2:35 a.m. UTC | #14

On Fri, Feb 15, 2019 at 08:51:24AM +1100, Dave Chinner wrote:
> On Thu, Feb 14, 2019 at 08:00:14AM -0500, Brian Foster wrote:
> > On Thu, Feb 14, 2019 at 09:27:26AM +1100, Dave Chinner wrote:
> > > On Wed, Feb 13, 2019 at 08:50:22AM -0500, Brian Foster wrote:
> > > > On Wed, Feb 13, 2019 at 07:21:51AM +1100, Dave Chinner wrote:
> > > > > On Tue, Feb 12, 2019 at 06:46:31AM -0500, Brian Foster wrote:
> > > > > > On Mon, Feb 11, 2019 at 05:13:33PM -0800, Darrick J. Wong wrote:
> > > > > > > On Fri, Feb 08, 2019 at 07:34:33AM -0500, Brian Foster wrote:
> > > > > > For example, extent size hints just happen to skip delayed allocation.
> > > > > > I don't recall the exact history, but I don't think this was always the
> > > > > > case for extsz hints.
> > > > > 
> > > > > It wasn't, but extent size hints + delalloc never played nicely and
> > > > > could corrupt or expose stale data and caused all sorts of problems
> > > > > at ENOSPC because delalloc reservations are unaware of alignment
> > > > > requirements for extent size hints. Hence to make extent size hints
> > > > > work for buffered writes, I simply made them work the same way as
> > > > > direct writes (i.e. immediate allocation w/ unwritten extents).
> > > > > 
> > > > > > So would the close time eofb trim be as
> > > > > > problematic as for extsz hint files if the behavior of the latter
> > > > > > changed back to using delayed allocation?
> > > > > 
> > > > > Yes, but if it's a write-once file that doesn't matter. If it's
> > > > > write-many, then we'd retain the post-eof blocks...
> > > > > 
> > > > > > I think a patch for that was
> > > > > > proposed fairly recently, but it depended on delalloc -> unwritten
> > > > > > functionality which still had unresolved issues (IIRC).
> > > > > 
> > > > > *nod*
> > > > > 
> > > > > > From another angle, would a system that held files open for a
> > > > > > significant amount of time relative to a one-time write such that close
> > > > > > consistently occurred after writeback (and thus delalloc conversion) be
> > > > > > susceptible to the same level of free space fragmentation as shown
> > > > > > above?
> > > > > 
> > > > > If the file is held open for writing for a long time, we have to
> > > > > assume that they are going to write again (and again) so we should
> > > > > leave the EOF blocks there. If they are writing slower than the eofb
> > > > > gc, then there's nothing more we can really do in that case...
> > > > > 
> > > > 
> > > > I'm not necessarily sure that this condition is always a matter of
> > > > writing too slow. It very well may be true, but I'm curious if there are
> > > > parallel copy scenarios (perhaps under particular cpu/RAM configs)
> > > > where we could end up doing a large number of one time file writes and
> > > > not doing the release time trim until an underlying extent (with
> > > > post-eof blocks) has been converted in more cases than not.
> > > 
> > > I'm not sure I follow what sort of workload and situation you are
> > > describing here. Are you talking about the effect of an EOFB gc pass
> > > during ongoing writes?
> > > 
> > 
> > I'm not sure if it's an actual reproducible situation.. I'm just
> > wondering out loud if there are normal workloads that might still defeat
> > a one-time trim at release time. For example, copy enough files in
> > parallel such that writeback touches most of them before the copies
> > complete and we end up trimming physical blocks rather than delalloc
> > blocks.
> > 
> > This is not so much of a problem if those files are large, I think,
> > because then the preallocs and the resulting trimmed free space is on
> > the larger side as well. If we're copying a bunch of little files with
> > small preallocs, however, then we put ourselves in the pathological
> > situation shown in Darrick's test.
> > 
> > I was originally thinking about whether this could happen or not on a
> > highly parallel small file copy workload, but having thought about it a
> > bit more I think there is a more simple example. What about an untar
> > like workload that creates small files and calls fsync() before each fd
> > is released?
> 
> Which is the same as an O_SYNC write if it's the same fd, which
> means we'll trim allocated blocks on close. i.e. it's no different
> to the current behaviour.  If such files are written in parallel then,
> again, it is no different to the existing behaviour. i.e. it
> largely depends on the timing of allocation in writeback and EOF
> block clearing in close(). If close happens before the next
> allocation in that AG, then they'll pack because there's no EOF
> blocks that push out the new allocation.  If it's the other way
> around, we get some level of freespace fragmentation.
> 

I know it's the same as the current behavior. ;P I think we're talking
past eachother on this. What I'm saying is that the downside to the
current behavior is that a simple copy file -> fsync -> copy next file
workload fragments free space.

Darrick demonstrated this better in his random size test with the
release time trim removed, but a simple loop to write one thousand 100k
files (xfs_io -fc "pwrite 0 100k" -c fsync ...) demonstrates similar
behavior:

# xfs_db -c "freesp -s" /dev/fedora_rhs-srv-19/tmp
   from      to extents  blocks    pct
      1       1      18      18   0.00
     16      31       1      25   0.00
     32      63       1      58   0.00
 131072  262143     924 242197739  24.97
 262144  524287       1  365696   0.04
134217728 242588672       3 727292183  74.99

vs. the same test without the fsync:

# xfs_db -c "freesp -s" /dev/fedora_rhs-srv-19/tmp 
   from      to extents  blocks    pct
      1       1      16      16   0.00
     16      31       1      20   0.00
4194304 8388607       2 16752060   1.73
134217728 242588672       4 953103627  98.27

Clearly there is an advantage to trimming before delalloc conversion.

Random thought: perhaps a one time trim at writeback (where writeback
would convert delaloc across eof) _or_ release time, whichever happens
first on an inode with preallocation, might help mitigate this problem.

> THis is mitigated by the fact that workloads like this tend to be
> separated into separate AGs - allocation first attempts to be
> non-blocking which means if it races with a free in progress it will
> skip to the next AG and it won't pack sequentially, anyway. So I
> don't think this is a major concern in terms of free space
> fragmentation - the allocator AG selection makes it impossible to
> predict what happens when racing alloc/frees occur in a single AG
> and we really try to avoid that as much as possible, anyway.
> 

Indeed..

> If it's a separate open/fsync/close pass after all the writes (and
> close) then the behaviour is the same with either the existing code
> or the "trim on first close" behaviour - and the delalloc blocks
> beyond EOF will get killed on the first close and not be present for
> the writeback, and all is good.
> 
> > Wouldn't that still defeat a one-time release heuristic and
> > produce the same layout issues as shown above? We'd prealloc,
> > writeback/convert then trim small/spurious fragments of post-eof space
> > back to the allocator.
> 
> Yes, but as far as I can reason (sanely), "trim on first close" is
> no worse that the existing heuristic in these cases, whilst being
> significantly better in others we have observed in the wild that
> cause problems...
> 

Agreed. Again, I'm thinking about this assuming that fix is already in
place.

> I'm not looking for perfect here, just "better with no obvious
> regressions". We can't predict every situation, so if it deals with
> all the problems we've had reported and a few similar cases we don't
> curently handle as well, then we should run with that and not really
> worry about the cases that it (or the existing code) does not
> solve until we have evidence that those workloads exist and are
> causing real world problems. It's a difficult enough issue to reason
> about without making it more complex by playing "what about" games..
> 

That's fair, but we have had users run into this situation. The whole
sparse inodes thing is partially a workaround for side effects of this
problem (free space fragmentation being so bad we can't allocate
inodes). Granted, some of those users may have also been able to avoid
that problem with better fs usage/configuration.

> > > > It sounds like what you're saying is that it doesn't really matter
> > > > either way at this point. There's no perf advantage to keeping the eof
> > > > blocks in this scenario, but there's also no real harm in deferring the
> > > > eofb trim of physical post-eof blocks because any future free space
> > > > fragmentation damage has already been done (assuming no more writes come
> > > > in).
> > > 
> > > Essentially, yes.
> > > 
> > > > The thought above was tip-toeing around the idea of (in addition to the
> > > > one-time trim heuristic you mentioned above) never doing a release time
> > > > trim of non-delalloc post-eof blocks.
> > > 
> > > Except we want to trim blocks in the in the cases where it's a write
> > > once file that has been fsync()d or written by O_DIRECT w/ really
> > > large extent size hints....
> > 
> > We still would trim it. The one time write case is essentially
> > unaffected because the only advantage of that hueristic is to trim eof
> > blocks before they are converted. If the eof blocks are real, the
> > release time heuristic has already failed (i.e., it hasn't provided any
> > benefit that background trim doesn't already provide).
> 
> No, all it means is that the blocks were allocated before the fd was
> closed, not that the release time heuristic failed. The release time
> heuristic is deciding what to do /after/ the writes have been
> completed, whatever the post-eof situation is. It /can't fail/ if it
> hasn't been triggered before physical allocation has been done, it
> can only decide what to do about those extents once it is called...
> 

Confused. By "failed," I mean we physically allocated blocks that were
never intended to be used. This is basically referring to the negative
effect of delalloc conversion -> eof trim behavior on once written files
demonstrated above. If this negative effect didn't exist, we wouldn't
need the release time trim at all and could just rely on background
trim.

> In which case, if it's a write-once file we want to kill the blocks,
> no matter whether they are allocated or not. And if it's a repeated
> open/X/close workload, then a single removal of EOF blocks won't
> greatly impact the file layout because subsequent closes won't
> trigger.
> 
> IOWs, in the general case it does the right thing, and when it's
> wrong the impact is negated by the fact it will do the right thing
> on all future closes on that inode....
> 
> > IOW, what we really want to avoid is trimming (small batches of) unused
> > physical eof blocks.
> 
> For the general case, I disagree. :)
> 
> > > > > IOWs, speculative prealloc beyond EOF is not just about preventing
> > > > > fragmentation - it also helps minimise the per-write CPU overhead of
> > > > > delalloc space accounting. (i.e. allows faster write rates into
> > > > > cache). IOWs, for anything more than a really small files, we want
> > > > > to be doing speculative delalloc on the first time the file is
> > > > > written to.
> > > > > 
> > > > 
> > > > Ok, this paper refers to CPU overhead as it contributes to lack of
> > > > scalability.
> > > 
> > > Well, that was the experiments that were being performed. I'm using
> > > it as an example of how per-write overhead is actually important to
> > > throughput. Ignore the "global lock caused overall throughput
> > > issues" because we don't have that problem any more, and instead
> > > look at it as a demonstration of "anything that slows down a write()
> > > reduces per-thread throughput".
> > > 
> > 
> > Makes sense, and that's how I took it after reading through the paper.
> > My point was just that I think this is more of a tradeoff and caveat to
> > consider than something that outright rules out doing less agressive
> > preallocation in certain cases.
> > 
> > I ran a few tests yesterday out of curiousity and was able to measure (a
> > small) difference in single-threaded buffered writes to cache with and
> > without preallocation. What I found a bit interesting was that my
> > original attempt to test this actually showed _faster_ throughput
> > without preallocation because the mechanism I happened to use to bypass
> > preallocation was an up front truncate.
> 
> So, like:
> 
> $ xfs_io -f -t -c "pwrite 0 1g" /mnt/scratch/testfile
> wrote 1073741824/1073741824 bytes at offset 0
> 1 GiB, 262144 ops; 0.6582 sec (1.519 GiB/sec and 398224.4691 ops/sec)
> $
> 
> Vs:
> 
> $ xfs_io -f -t -c "truncate 1g" -c "pwrite 0 1g" -c "fsync" /mnt/scratch/testfile
> wrote 1073741824/1073741824 bytes at offset 0
> 1 GiB, 262144 ops; 0.7159 sec (1.397 GiB/sec and 366147.4512 ops/sec)
> $
> 
> I get an average of 1.53GiB/s write rate into cache with
> speculative prealloc active, and only 1.36GiB/s write rate in cache
> without it. I've run these 10 times each, ranges are
> 1.519-1.598GiB/s for prealloc and 1.291-1.361GiB/s whenusing
> truncate to prevent prealloc.
> 

Yes, that's basically the same test. I repeated it again and saw pretty
much the same numbers here. Hmm, not sure what happened there. Perhaps I
fat fingered something or mixed in a run with a different buffer size
for the truncate case. I recall seeing right around ~1.5GB/s for the
base case and closer to ~1.6GB/s for the truncate case somehow or
another, but I don't have a record of it to figure out what happened.
Anyways, sorry for the noise.. I did ultimately reproduce the prealloc
boost and still see the diminishing returns at about 64k or so (around
~2.7GB/s with or without preallocation).

> IOWs, on my test machine, the write rate into cache using 4kB writes
> is over 10% faster with prealloc enabled. And to point out the
> dminishing returns, with a write size of 1MB:
> 
> $  xfs_io -f -t -c "truncate 1g" -c "pwrite 0 1g -b 1M" -c "fsync" /mnt/scratch/testfile
> wrote 1073741824/1073741824 bytes at offset 0
> 1 GiB, 1024 ops; 0.4975 sec (2.010 GiB/sec and 2058.1922 ops/sec)
> 
> The throughput with or without prealloc average out just over
> 2GiB/s and the difference is noise (2.03GiB/s vs 2.05GiB/s)
> 
> Note: I'm jsut measuring write-in rates here, and I've isolated them
> from writeback completely because the file size is small enough
> that dirty throttling (20% of memory) isn't kicking in because I
> have 16GB of RAM on this machine. If I go over 3GB in file size,
> dirty throttling and writeback kicks in and the results are a
> complete crap-shoot because it's dependent on when writeback kicks
> in and how the writeback rate ramps up in the first second of
> writeback.
> 

*nod*

> > I eventually realized that this
> > had other effects (i.e., one copy doing size updates vs. the other not
> > doing so) and just compared a fixed, full size (1G) preallocation with a
> > fixed 4k preallocation to reproduce the boost provided by the former.
> 
> That only matters for *writeback overhead*, not ingest efficiency.
> Indeed, using a preallocated extent:
> 

Not sure how we got into physical preallocation here. Assume any
reference to "preallocation" in this thread by me refers to post-eof
speculative preallocation. The above preallocation tweaks were
controlled in my tests via the allocsize mount option, not physical
block preallocation.

> $ time sudo xfs_io -f -t -c "falloc 0 1g" -c "pwrite 0 1g" -c "fsync" /mnt/scratch/testfile
> wrote 1073741824/1073741824 bytes at offset 0
> 1 GiB, 262144 ops; 0.6610 sec (1.513 GiB/sec and 396557.5926 ops/sec)
> 
> 
> $ xfs_io -f -t -c "falloc 0 1g" -c "pwrite 0 1g -b 1M" -c "fsync" /mnt/scratch/testfile
> wrote 1073741824/1073741824 bytes at offset 0
> 1 GiB, 1024 ops; 0.4946 sec (2.022 GiB/sec and 2070.0795 ops/sec)
> 
> The write-in rates are identical to when specualtive prealloc is
> enabled (always hitting a preexisting extent and not having to
> extend an extent on every iomap lookup). If I time the fsync, then
> things are different because there's additional writeback overhead
> (i.e. unwritten extent conversion), but this does not affect the
> per-write cache ingest CPU overhead.
> 
> > The point here is that while there is such a boost, there are also other
> > workload dependent factors that are out of our control. For example,
> > somebody who today cares about preallocation only for a boost on writes
> > to cache can apparently achieve a greater benefit by truncating the file
> > up front and disabling preallocation entirely.
> 
> Can you post your tests and results, because I'm getting very
> different results from running what I think are the same tests.
> 
> > Moving beyond the truncate thing, I also saw the benefit of
> > preallocation diminish as write buffer size was increased. As the write
> > size increases from 4k to around 64k, the pure performance benefit of
> > preallocation trailed off to zero.
> 
> Yes, because we've optimised the overhead out of larger writes with
> the iomap infrastructure. This used to be a whole lot worse when we
> did a delalloc mapping for every page instead of one per write()
> call. IOWs, a significant part of the 30-35% speed increase you see in my numbers above
> going from 4kB writes to 1MB writes is a result of iomap dropping
> the number of delalloc mapping calls by a factor of 256....
> 
> > Part of that could also be effects of
> > less frequent size updates and whatnot due to the larger writes, but I
> > also don't think that's an uncommon thing in practice.
> 
> Again, size updates are only done on writeback, not ingest....
> 
> > My only point here is that I don't think it's so cut and dry that we
> > absolutely need dynamic speculative preallocation for write to cache
> > performance.
> 
> History, experience with both embedded and high end NAS machines
> (where per-write CPU usage really matters!) and my own experiements
> tell me a different story :/
> 
> > > > BTW for historical context.. was speculative preallocation a thing when
> > > > this paper was written?
> > > 
> > > Yes. specualtive prealloc goes way back into the 90s from Irix.  It
> > > was first made configurable in XFS via the biosize mount option
> > > added with v3 superblocks in 1997, but the initial linux port only
> > > allowed up to 64k.
> > > 
> > 
> > Hmm, Ok.. so it was originally speculative preallocation without the
> > "dynamic sizing" logic that we have today. Thanks for the background.
> > 
> > > In 2005, the linux mount option allowed biosize to be extended to
> > > 1GB, which made sense because >4GB allocation groups (mkfs enabled
> > > them late 2003) were now starting to be widely used and so users
> > > were reporting new large AG fragmentation issues that had never been
> > > seen before. i.e.  it was now practical to have contiguous multi-GB
> > > extents in files and the delalloc code was struggling to create
> > > them, so having EOF-prealloc be able to make use of that capability
> > > was needed....
> > > 
> > > And then auto-tuning made sense because more and more people were
> > > having to use the mount option in more general workloads to avoid
> > > fragmentation.
> > > 
> > 
> > "auto-tuning" means "dynamic sizing" here, yes?
> 
> Yes.
> 
> > FWIW, much of this
> > discussion also makes me wonder how appropriate the current size limit
> > (64k) on preallocation is for today's filesystems and systems (e.g. RAM
> > availability), as opposed to something larger (on the order of hundreds
> > of MBs for example, perhaps 256-512MB).
> 
> RAM size is irrelevant. What matters is file size and the impact
> of allocation patterns on writeback IO patterns. i.e. the size limit
> is about optimising writeback, not preventing fragmentation or
> making more efficient use of memory, etc.
> 

I'm just suggesting that the more RAM that is available, the more we're
able to write into cache before writeback starts and thus the larger
physical extents we're able to allocate independent of speculative
preallocation (perf issues notwithstanding).

> i.e. when we have lots of small files, we want writeback to pack
> them so we get multiple-file sequentialisation of the write stream -
> this makes things like untarring a kernel tarball (which is a large
> number of small files) a sequential write workload rather than a
> seek-per-individual-file-write workload. That make a massive
> difference to performance on spinning disks, and that's what the 64k
> threshold (and post-EOF block removal on close for larger files)
> tries to preserve.
> 

Sure, but what's the downside to increasing that threshold to even
something on the order of MBs? Wouldn't that at least help us leave
larger free extents around in those workloads/patterns that do fragment
free space?

> Realistically, we should probably change that 64k threshold to match
> sunit if it is set, so that we really do end up trying to pack
> any write once file smaller than sunit as the allocator won't try
> to align them, anyway....
> 

That makes sense.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

Dave Chinner Feb. 15, 2019, 7:23 a.m. UTC | #15

On Thu, Feb 14, 2019 at 09:35:21PM -0500, Brian Foster wrote:
> On Fri, Feb 15, 2019 at 08:51:24AM +1100, Dave Chinner wrote:
> > On Thu, Feb 14, 2019 at 08:00:14AM -0500, Brian Foster wrote:
> > > On Thu, Feb 14, 2019 at 09:27:26AM +1100, Dave Chinner wrote:
> > > > On Wed, Feb 13, 2019 at 08:50:22AM -0500, Brian Foster wrote:
> > > > > On Wed, Feb 13, 2019 at 07:21:51AM +1100, Dave Chinner wrote:
> > > > > > On Tue, Feb 12, 2019 at 06:46:31AM -0500, Brian Foster wrote:
> > > > > > > On Mon, Feb 11, 2019 at 05:13:33PM -0800, Darrick J. Wong wrote:
> > > > > > > > On Fri, Feb 08, 2019 at 07:34:33AM -0500, Brian Foster wrote:
> > > > > > > For example, extent size hints just happen to skip delayed allocation.
> > > > > > > I don't recall the exact history, but I don't think this was always the
> > > > > > > case for extsz hints.
> > > > > > 
> > > > > > It wasn't, but extent size hints + delalloc never played nicely and
> > > > > > could corrupt or expose stale data and caused all sorts of problems
> > > > > > at ENOSPC because delalloc reservations are unaware of alignment
> > > > > > requirements for extent size hints. Hence to make extent size hints
> > > > > > work for buffered writes, I simply made them work the same way as
> > > > > > direct writes (i.e. immediate allocation w/ unwritten extents).
> > > > > > 
> > > > > > > So would the close time eofb trim be as
> > > > > > > problematic as for extsz hint files if the behavior of the latter
> > > > > > > changed back to using delayed allocation?
> > > > > > 
> > > > > > Yes, but if it's a write-once file that doesn't matter. If it's
> > > > > > write-many, then we'd retain the post-eof blocks...
> > > > > > 
> > > > > > > I think a patch for that was
> > > > > > > proposed fairly recently, but it depended on delalloc -> unwritten
> > > > > > > functionality which still had unresolved issues (IIRC).
> > > > > > 
> > > > > > *nod*
> > > > > > 
> > > > > > > From another angle, would a system that held files open for a
> > > > > > > significant amount of time relative to a one-time write such that close
> > > > > > > consistently occurred after writeback (and thus delalloc conversion) be
> > > > > > > susceptible to the same level of free space fragmentation as shown
> > > > > > > above?
> > > > > > 
> > > > > > If the file is held open for writing for a long time, we have to
> > > > > > assume that they are going to write again (and again) so we should
> > > > > > leave the EOF blocks there. If they are writing slower than the eofb
> > > > > > gc, then there's nothing more we can really do in that case...
> > > > > > 
> > > > > 
> > > > > I'm not necessarily sure that this condition is always a matter of
> > > > > writing too slow. It very well may be true, but I'm curious if there are
> > > > > parallel copy scenarios (perhaps under particular cpu/RAM configs)
> > > > > where we could end up doing a large number of one time file writes and
> > > > > not doing the release time trim until an underlying extent (with
> > > > > post-eof blocks) has been converted in more cases than not.
> > > > 
> > > > I'm not sure I follow what sort of workload and situation you are
> > > > describing here. Are you talking about the effect of an EOFB gc pass
> > > > during ongoing writes?
> > > > 
> > > 
> > > I'm not sure if it's an actual reproducible situation.. I'm just
> > > wondering out loud if there are normal workloads that might still defeat
> > > a one-time trim at release time. For example, copy enough files in
> > > parallel such that writeback touches most of them before the copies
> > > complete and we end up trimming physical blocks rather than delalloc
> > > blocks.
> > > 
> > > This is not so much of a problem if those files are large, I think,
> > > because then the preallocs and the resulting trimmed free space is on
> > > the larger side as well. If we're copying a bunch of little files with
> > > small preallocs, however, then we put ourselves in the pathological
> > > situation shown in Darrick's test.
> > > 
> > > I was originally thinking about whether this could happen or not on a
> > > highly parallel small file copy workload, but having thought about it a
> > > bit more I think there is a more simple example. What about an untar
> > > like workload that creates small files and calls fsync() before each fd
> > > is released?
> > 
> > Which is the same as an O_SYNC write if it's the same fd, which
> > means we'll trim allocated blocks on close. i.e. it's no different
> > to the current behaviour.  If such files are written in parallel then,
> > again, it is no different to the existing behaviour. i.e. it
> > largely depends on the timing of allocation in writeback and EOF
> > block clearing in close(). If close happens before the next
> > allocation in that AG, then they'll pack because there's no EOF
> > blocks that push out the new allocation.  If it's the other way
> > around, we get some level of freespace fragmentation.
> > 
> 
> I know it's the same as the current behavior. ;P I think we're talking
> past eachother on this.

Probably :P

> What I'm saying is that the downside to the
> current behavior is that a simple copy file -> fsync -> copy next file
> workload fragments free space.

Yes. But it's also one of the cases that "always release on first
close" fixes.

> Darrick demonstrated this better in his random size test with the
> release time trim removed, but a simple loop to write one thousand 100k
> files (xfs_io -fc "pwrite 0 100k" -c fsync ...) demonstrates similar
> behavior:
> 
> # xfs_db -c "freesp -s" /dev/fedora_rhs-srv-19/tmp
>    from      to extents  blocks    pct
>       1       1      18      18   0.00
>      16      31       1      25   0.00
>      32      63       1      58   0.00
>  131072  262143     924 242197739  24.97
>  262144  524287       1  365696   0.04
> 134217728 242588672       3 727292183  74.99

That should not be leaving freespace fragments behind - it should be
trimming the EOF blocks on close after fsync() and the next
allocation should pack tightly.

/me goes off to trace it because it's not doing what he knows it
should be doing.

Ngggh. Busy extent handling.

Basically, the extent we trimmed is busy because it is a user data
extent that has been freed and not yet committed (even though it was
not used), so it gets trimmed out of the free space range that is
allocated.

IOWs, how we handle busy extents results in this behaviour, not the
speculative prealloc which has already been removed and returned to
the free space pool....

> vs. the same test without the fsync:
> 
> # xfs_db -c "freesp -s" /dev/fedora_rhs-srv-19/tmp 
>    from      to extents  blocks    pct
>       1       1      16      16   0.00
>      16      31       1      20   0.00
> 4194304 8388607       2 16752060   1.73
> 134217728 242588672       4 953103627  98.27
> 
> Clearly there is an advantage to trimming before delalloc conversion.

The advantage is in avoiding busy extents by trimming before
physical allocation occurs. I think we need to fix the busy extent
handling here, not the speculative prealloc...

> Random thought: perhaps a one time trim at writeback (where writeback
> would convert delaloc across eof) _or_ release time, whichever happens
> first on an inode with preallocation, might help mitigate this problem.

Not sure how you'd do that reliably - there's no serialisation with
incoming writes, and no context as to what is changing EOF. Hence I
don't know how we'd decide that trimming was required or not...

> > I'm not looking for perfect here, just "better with no obvious
> > regressions". We can't predict every situation, so if it deals with
> > all the problems we've had reported and a few similar cases we don't
> > curently handle as well, then we should run with that and not really
> > worry about the cases that it (or the existing code) does not
> > solve until we have evidence that those workloads exist and are
> > causing real world problems. It's a difficult enough issue to reason
> > about without making it more complex by playing "what about" games..
> > 
> 
> That's fair, but we have had users run into this situation. The whole
> sparse inodes thing is partially a workaround for side effects of this
> problem (free space fragmentation being so bad we can't allocate
> inodes). Granted, some of those users may have also been able to avoid
> that problem with better fs usage/configuration.

I think systems that required sparse inodes is orthogonal - those
issues could be caused just with well packed small files and large
inodes, and had nothing in common with the workload we've recently
seen. Indeed, in the cases other than the specific small file
workload gluster used to fragment free space to prevent inode
allocation, we never got to the bottom of what caused the free space
fragmentation. All we could do is make the fs more tolerant of
freespace fragmentation.

This time, we have direct evidence of what caused freespace
fragmentation on this specific system. It was caused by the app
doing something bizarre and we can extrapolate several similar
behaviours from that workload.

But beyond that, we're well into "whatabout" and "whatif" territory.

> > > We still would trim it. The one time write case is essentially
> > > unaffected because the only advantage of that hueristic is to trim eof
> > > blocks before they are converted. If the eof blocks are real, the
> > > release time heuristic has already failed (i.e., it hasn't provided any
> > > benefit that background trim doesn't already provide).
> > 
> > No, all it means is that the blocks were allocated before the fd was
> > closed, not that the release time heuristic failed. The release time
> > heuristic is deciding what to do /after/ the writes have been
> > completed, whatever the post-eof situation is. It /can't fail/ if it
> > hasn't been triggered before physical allocation has been done, it
> > can only decide what to do about those extents once it is called...
> > 
> 
> Confused. By "failed," I mean we physically allocated blocks that were
> never intended to be used.

How can we know that are never intended to be used at writeback
time?

> This is basically referring to the negative
> effect of delalloc conversion -> eof trim behavior on once written files
> demonstrated above. If this negative effect didn't exist, we wouldn't
> need the release time trim at all and could just rely on background
> trim.

We also cannot predict what the application intends. Hence we have
heuristics that trigger once the application signals that it is
"done with this file". i.e. it has closed the fd.

> > > I eventually realized that this
> > > had other effects (i.e., one copy doing size updates vs. the other not
> > > doing so) and just compared a fixed, full size (1G) preallocation with a
> > > fixed 4k preallocation to reproduce the boost provided by the former.
> > 
> > That only matters for *writeback overhead*, not ingest efficiency.
> > Indeed, using a preallocated extent:
> > 
> 
> Not sure how we got into physical preallocation here. Assume any
> reference to "preallocation" in this thread by me refers to post-eof
> speculative preallocation. The above preallocation tweaks were
> controlled in my tests via the allocsize mount option, not physical
> block preallocation.

I'm just using it to demonstrate the difference is in continually
extending the delalloc extent in memory. I could have just done an
overwrite - it's the same thing.

> > > FWIW, much of this
> > > discussion also makes me wonder how appropriate the current size limit
> > > (64k) on preallocation is for today's filesystems and systems (e.g. RAM
> > > availability), as opposed to something larger (on the order of hundreds
> > > of MBs for example, perhaps 256-512MB).
> > 
> > RAM size is irrelevant. What matters is file size and the impact
> > of allocation patterns on writeback IO patterns. i.e. the size limit
> > is about optimising writeback, not preventing fragmentation or
> > making more efficient use of memory, etc.
> 
> I'm just suggesting that the more RAM that is available, the more we're
> able to write into cache before writeback starts and thus the larger
> physical extents we're able to allocate independent of speculative
> preallocation (perf issues notwithstanding).

ISTR I looked at that years ago and couldn't get it to work
reliably. It works well for initial ingest, but once the writes go
on for long enough dirty throttling starts chopping ingest up in
smaller and smaller chunks as it rotors writeback bandwidth around
all the processes dirtying the page cache. This chunking happens
regardless of the size of the file being written. And so the more
processes that are dirtying the page cache, the smaller the file
fragments get because each file gets a smaller amount of the overall
writeback bandwidth each time it writeback occurs. i.e.
fragmentation increases as memory pressure, load and concurrency
increases, which are exactly the conditions we want to be avoiding
fragmentation as much as possible...

The only way I found to prevent this in fair and predictable
manner is the auto-grow algorithm we have now.  There's very few
real world corner cases where it break down, so we do not need
fundamental changes here. We've found one corner case where it is
defeated, so let's address that corner case with the minimal change
that is necessary but otherwise leave the underlying algorithm
alone so we can observe the loner term effects of the tweak we
need to make....

> > i.e. when we have lots of small files, we want writeback to pack
> > them so we get multiple-file sequentialisation of the write stream -
> > this makes things like untarring a kernel tarball (which is a large
> > number of small files) a sequential write workload rather than a
> > seek-per-individual-file-write workload. That make a massive
> > difference to performance on spinning disks, and that's what the 64k
> > threshold (and post-EOF block removal on close for larger files)
> > tries to preserve.
> 
> Sure, but what's the downside to increasing that threshold to even
> something on the order of MBs? Wouldn't that at least help us leave
> larger free extents around in those workloads/patterns that do fragment
> free space?

Because even for files in th the "few MB" size, worst case
fragmentation is thousands of extents and IO performance that
absolutely sucks. Especially on RAID5/6 devices. We have to ensure
file fragmentation at it's worst does not affect filesystem
throughput and that means in the general case less than ~1MB sized
extents is just not acceptible even for smallish files....

Cheers,

Dave.

Brian Foster Feb. 15, 2019, 8:33 p.m. UTC | #16

On Fri, Feb 15, 2019 at 06:23:32PM +1100, Dave Chinner wrote:
> On Thu, Feb 14, 2019 at 09:35:21PM -0500, Brian Foster wrote:
> > On Fri, Feb 15, 2019 at 08:51:24AM +1100, Dave Chinner wrote:
> > > On Thu, Feb 14, 2019 at 08:00:14AM -0500, Brian Foster wrote:
> > > > On Thu, Feb 14, 2019 at 09:27:26AM +1100, Dave Chinner wrote:
> > > > > On Wed, Feb 13, 2019 at 08:50:22AM -0500, Brian Foster wrote:
> > > > > > On Wed, Feb 13, 2019 at 07:21:51AM +1100, Dave Chinner wrote:
> > > > > > > On Tue, Feb 12, 2019 at 06:46:31AM -0500, Brian Foster wrote:
> > > > > > > > On Mon, Feb 11, 2019 at 05:13:33PM -0800, Darrick J. Wong wrote:
> > > > > > > > > On Fri, Feb 08, 2019 at 07:34:33AM -0500, Brian Foster wrote:
> > > > > > > > For example, extent size hints just happen to skip delayed allocation.
> > > > > > > > I don't recall the exact history, but I don't think this was always the
> > > > > > > > case for extsz hints.
> > > > > > > 
> > > > > > > It wasn't, but extent size hints + delalloc never played nicely and
> > > > > > > could corrupt or expose stale data and caused all sorts of problems
> > > > > > > at ENOSPC because delalloc reservations are unaware of alignment
> > > > > > > requirements for extent size hints. Hence to make extent size hints
> > > > > > > work for buffered writes, I simply made them work the same way as
> > > > > > > direct writes (i.e. immediate allocation w/ unwritten extents).
> > > > > > > 
> > > > > > > > So would the close time eofb trim be as
> > > > > > > > problematic as for extsz hint files if the behavior of the latter
> > > > > > > > changed back to using delayed allocation?
> > > > > > > 
> > > > > > > Yes, but if it's a write-once file that doesn't matter. If it's
> > > > > > > write-many, then we'd retain the post-eof blocks...
> > > > > > > 
> > > > > > > > I think a patch for that was
> > > > > > > > proposed fairly recently, but it depended on delalloc -> unwritten
> > > > > > > > functionality which still had unresolved issues (IIRC).
> > > > > > > 
> > > > > > > *nod*
> > > > > > > 
> > > > > > > > From another angle, would a system that held files open for a
> > > > > > > > significant amount of time relative to a one-time write such that close
> > > > > > > > consistently occurred after writeback (and thus delalloc conversion) be
> > > > > > > > susceptible to the same level of free space fragmentation as shown
> > > > > > > > above?
> > > > > > > 
> > > > > > > If the file is held open for writing for a long time, we have to
> > > > > > > assume that they are going to write again (and again) so we should
> > > > > > > leave the EOF blocks there. If they are writing slower than the eofb
> > > > > > > gc, then there's nothing more we can really do in that case...
> > > > > > > 
> > > > > > 
> > > > > > I'm not necessarily sure that this condition is always a matter of
> > > > > > writing too slow. It very well may be true, but I'm curious if there are
> > > > > > parallel copy scenarios (perhaps under particular cpu/RAM configs)
> > > > > > where we could end up doing a large number of one time file writes and
> > > > > > not doing the release time trim until an underlying extent (with
> > > > > > post-eof blocks) has been converted in more cases than not.
> > > > > 
> > > > > I'm not sure I follow what sort of workload and situation you are
> > > > > describing here. Are you talking about the effect of an EOFB gc pass
> > > > > during ongoing writes?
> > > > > 
> > > > 
> > > > I'm not sure if it's an actual reproducible situation.. I'm just
> > > > wondering out loud if there are normal workloads that might still defeat
> > > > a one-time trim at release time. For example, copy enough files in
> > > > parallel such that writeback touches most of them before the copies
> > > > complete and we end up trimming physical blocks rather than delalloc
> > > > blocks.
> > > > 
> > > > This is not so much of a problem if those files are large, I think,
> > > > because then the preallocs and the resulting trimmed free space is on
> > > > the larger side as well. If we're copying a bunch of little files with
> > > > small preallocs, however, then we put ourselves in the pathological
> > > > situation shown in Darrick's test.
> > > > 
> > > > I was originally thinking about whether this could happen or not on a
> > > > highly parallel small file copy workload, but having thought about it a
> > > > bit more I think there is a more simple example. What about an untar
> > > > like workload that creates small files and calls fsync() before each fd
> > > > is released?
> > > 
> > > Which is the same as an O_SYNC write if it's the same fd, which
> > > means we'll trim allocated blocks on close. i.e. it's no different
> > > to the current behaviour.  If such files are written in parallel then,
> > > again, it is no different to the existing behaviour. i.e. it
> > > largely depends on the timing of allocation in writeback and EOF
> > > block clearing in close(). If close happens before the next
> > > allocation in that AG, then they'll pack because there's no EOF
> > > blocks that push out the new allocation.  If it's the other way
> > > around, we get some level of freespace fragmentation.
> > > 
> > 
> > I know it's the same as the current behavior. ;P I think we're talking
> > past eachother on this.
> 
> Probably :P
> 
> > What I'm saying is that the downside to the
> > current behavior is that a simple copy file -> fsync -> copy next file
> > workload fragments free space.
> 
> Yes. But it's also one of the cases that "always release on first
> close" fixes.
> 
> > Darrick demonstrated this better in his random size test with the
> > release time trim removed, but a simple loop to write one thousand 100k
> > files (xfs_io -fc "pwrite 0 100k" -c fsync ...) demonstrates similar
> > behavior:
> > 
> > # xfs_db -c "freesp -s" /dev/fedora_rhs-srv-19/tmp
> >    from      to extents  blocks    pct
> >       1       1      18      18   0.00
> >      16      31       1      25   0.00
> >      32      63       1      58   0.00
> >  131072  262143     924 242197739  24.97
> >  262144  524287       1  365696   0.04
> > 134217728 242588672       3 727292183  74.99
> 
> That should not be leaving freespace fragments behind - it should be
> trimming the EOF blocks on close after fsync() and the next
> allocation should pack tightly.
> 
> /me goes off to trace it because it's not doing what he knows it
> should be doing.
> 
> Ngggh. Busy extent handling.
> 
> Basically, the extent we trimmed is busy because it is a user data
> extent that has been freed and not yet committed (even though it was
> not used), so it gets trimmed out of the free space range that is
> allocated.
> 
> IOWs, how we handle busy extents results in this behaviour, not the
> speculative prealloc which has already been removed and returned to
> the free space pool....
> 
> > vs. the same test without the fsync:
> > 
> > # xfs_db -c "freesp -s" /dev/fedora_rhs-srv-19/tmp 
> >    from      to extents  blocks    pct
> >       1       1      16      16   0.00
> >      16      31       1      20   0.00
> > 4194304 8388607       2 16752060   1.73
> > 134217728 242588672       4 953103627  98.27
> > 
> > Clearly there is an advantage to trimming before delalloc conversion.
> 
> The advantage is in avoiding busy extents by trimming before
> physical allocation occurs. I think we need to fix the busy extent
> handling here, not the speculative prealloc...
> 

Ok, that might be a reasonable approach for this particular pattern. It
might not help for other (probably less common) patterns. In any event,
I hope this at least shows why I'm fishing for further improvements..

> > Random thought: perhaps a one time trim at writeback (where writeback
> > would convert delaloc across eof) _or_ release time, whichever happens
> > first on an inode with preallocation, might help mitigate this problem.
> 
> Not sure how you'd do that reliably - there's no serialisation with
> incoming writes, and no context as to what is changing EOF. Hence I
> don't know how we'd decide that trimming was required or not...
> 

That's not that difficult to work around. If writeback can't reliably
add/remove blocks, it can make decisions about how many blocks to
convert in particular situations. A writeback time trim could be
implemented as a trimmed allocation/conversion that would also allow (or
refuse) a subsequent release to still perform the actual post-eof trim.
IOW, consider the approach as more of having a smarter post-eof
writeback delalloc conversion policy rather than exclusively attempting
to address post-eof blocks at release time.

A simple example could be something like not converting post-eof blocks
if the current writeback cycle is attempting to convert the only
delalloc extent in the file and/or if the extent extends from offset
zero to i_size (or whatever heuristic best describes a write once file),
otherwise behave exactly as we do today. That filters out smallish,
write-once files without requiring any changes to the buffered write
time speculative prealloc algorithm. It also could eliminate the need
for a release time trim because unused post-eof blocks are unlikely to
ever turn into physical blocks for small write once files, but I'd have
to think about that angle a bit more.

A tradeoff would be potential for higher fragmentation in cases where
writeback occurs on a single delalloc extent file before the file
finishes being copied, but we're only talking an extra extent or so if
it's a one time "writeback allocation trim" and it's probably likely the
file is large for that to occur in the first place. IOW, the extra
extent should be sized at or around the amount of data we were able
ingest into pagecache before writeback touched the inode.

This is of course just an idea and still subject to prototyping and
testing/experimentation and whatnot...

> > > I'm not looking for perfect here, just "better with no obvious
> > > regressions". We can't predict every situation, so if it deals with
> > > all the problems we've had reported and a few similar cases we don't
> > > curently handle as well, then we should run with that and not really
> > > worry about the cases that it (or the existing code) does not
> > > solve until we have evidence that those workloads exist and are
> > > causing real world problems. It's a difficult enough issue to reason
> > > about without making it more complex by playing "what about" games..
> > > 
> > 
> > That's fair, but we have had users run into this situation. The whole
> > sparse inodes thing is partially a workaround for side effects of this
> > problem (free space fragmentation being so bad we can't allocate
> > inodes). Granted, some of those users may have also been able to avoid
> > that problem with better fs usage/configuration.
> 
> I think systems that required sparse inodes is orthogonal - those
> issues could be caused just with well packed small files and large
> inodes, and had nothing in common with the workload we've recently
> seen. Indeed, in the cases other than the specific small file
> workload gluster used to fragment free space to prevent inode
> allocation, we never got to the bottom of what caused the free space
> fragmentation. All we could do is make the fs more tolerant of
> freespace fragmentation.
> 
> This time, we have direct evidence of what caused freespace
> fragmentation on this specific system. It was caused by the app
> doing something bizarre and we can extrapolate several similar
> behaviours from that workload.
> 
> But beyond that, we're well into "whatabout" and "whatif" territory.
> 
> > > > We still would trim it. The one time write case is essentially
> > > > unaffected because the only advantage of that hueristic is to trim eof
> > > > blocks before they are converted. If the eof blocks are real, the
> > > > release time heuristic has already failed (i.e., it hasn't provided any
> > > > benefit that background trim doesn't already provide).
> > > 
> > > No, all it means is that the blocks were allocated before the fd was
> > > closed, not that the release time heuristic failed. The release time
> > > heuristic is deciding what to do /after/ the writes have been
> > > completed, whatever the post-eof situation is. It /can't fail/ if it
> > > hasn't been triggered before physical allocation has been done, it
> > > can only decide what to do about those extents once it is called...
> > > 
> > 
> > Confused. By "failed," I mean we physically allocated blocks that were
> > never intended to be used.
> 
> How can we know that are never intended to be used at writeback
> time?
> 

We never really know if the blocks will be used or not. That's not the
point. I'm using the term "failed" simply to refer to the case where
post-eof blocks are unused and end up converted to physical blocks
before they are trimmed. I chose the term because that's the scenario
where the current heuristic contributes to free space fragmentation.
This has been demonstrated by Darrick's test and my more simplistic test
above. Feel free to choose a different term for that scenario, but it's
clearly a case that could stand to improve in XFS (whether it be via
something like the writeback time trim, busy extent fixes, etc.).

> > This is basically referring to the negative
> > effect of delalloc conversion -> eof trim behavior on once written files
> > demonstrated above. If this negative effect didn't exist, we wouldn't
> > need the release time trim at all and could just rely on background
> > trim.
> 
> We also cannot predict what the application intends. Hence we have
> heuristics that trigger once the application signals that it is
> "done with this file". i.e. it has closed the fd.
> 

Of course, that doesn't mean we can't try to make it smarter or dampen
negative side effects of the "failure" case.

> > > > I eventually realized that this
> > > > had other effects (i.e., one copy doing size updates vs. the other not
> > > > doing so) and just compared a fixed, full size (1G) preallocation with a
> > > > fixed 4k preallocation to reproduce the boost provided by the former.
> > > 
> > > That only matters for *writeback overhead*, not ingest efficiency.
> > > Indeed, using a preallocated extent:
> > > 
> > 
> > Not sure how we got into physical preallocation here. Assume any
> > reference to "preallocation" in this thread by me refers to post-eof
> > speculative preallocation. The above preallocation tweaks were
> > controlled in my tests via the allocsize mount option, not physical
> > block preallocation.
> 
> I'm just using it to demonstrate the difference is in continually
> extending the delalloc extent in memory. I could have just done an
> overwrite - it's the same thing.
> 
> > > > FWIW, much of this
> > > > discussion also makes me wonder how appropriate the current size limit
> > > > (64k) on preallocation is for today's filesystems and systems (e.g. RAM
> > > > availability), as opposed to something larger (on the order of hundreds
> > > > of MBs for example, perhaps 256-512MB).
> > > 
> > > RAM size is irrelevant. What matters is file size and the impact
> > > of allocation patterns on writeback IO patterns. i.e. the size limit
> > > is about optimising writeback, not preventing fragmentation or
> > > making more efficient use of memory, etc.
> > 
> > I'm just suggesting that the more RAM that is available, the more we're
> > able to write into cache before writeback starts and thus the larger
> > physical extents we're able to allocate independent of speculative
> > preallocation (perf issues notwithstanding).
> 
> ISTR I looked at that years ago and couldn't get it to work
> reliably. It works well for initial ingest, but once the writes go
> on for long enough dirty throttling starts chopping ingest up in
> smaller and smaller chunks as it rotors writeback bandwidth around
> all the processes dirtying the page cache. This chunking happens
> regardless of the size of the file being written. And so the more
> processes that are dirtying the page cache, the smaller the file
> fragments get because each file gets a smaller amount of the overall
> writeback bandwidth each time it writeback occurs. i.e.
> fragmentation increases as memory pressure, load and concurrency
> increases, which are exactly the conditions we want to be avoiding
> fragmentation as much as possible...
> 

Ok, well that suggests less aggressive preallocation is at least worth
considering/investigating.

> The only way I found to prevent this in fair and predictable
> manner is the auto-grow algorithm we have now.  There's very few
> real world corner cases where it break down, so we do not need
> fundamental changes here. We've found one corner case where it is
> defeated, so let's address that corner case with the minimal change
> that is necessary but otherwise leave the underlying algorithm
> alone so we can observe the loner term effects of the tweak we
> need to make....
> 

I agree that an incremental fix is fine for now. I don't agree that the
current implementation is only defeated by corner cases. The write+fsync
example above is a pretty straightforward usage pattern. Userspace
filesystems are also increasingly common and may implement things like
open file caches that distort the filesystems perception of how it's
being used, and thus could defeat a release time trim heuristic in
exactly the same manner. I recall that gluster had something like this,
but I'd have to dig into it to see exactly if it applies here..

> > > i.e. when we have lots of small files, we want writeback to pack
> > > them so we get multiple-file sequentialisation of the write stream -
> > > this makes things like untarring a kernel tarball (which is a large
> > > number of small files) a sequential write workload rather than a
> > > seek-per-individual-file-write workload. That make a massive
> > > difference to performance on spinning disks, and that's what the 64k
> > > threshold (and post-EOF block removal on close for larger files)
> > > tries to preserve.
> > 
> > Sure, but what's the downside to increasing that threshold to even
> > something on the order of MBs? Wouldn't that at least help us leave
> > larger free extents around in those workloads/patterns that do fragment
> > free space?
> 
> Because even for files in th the "few MB" size, worst case
> fragmentation is thousands of extents and IO performance that
> absolutely sucks. Especially on RAID5/6 devices. We have to ensure
> file fragmentation at it's worst does not affect filesystem
> throughput and that means in the general case less than ~1MB sized
> extents is just not acceptible even for smallish files....
> 

Sure, fragmentation is bad. That doesn't answer my question though.
Under what conditions would you expect less aggressive speculative
preallocation in the form of a say 1MB threshold to manifest in a level
of fragmentation "where performance absolutely sucks?" Highly parallel
small file (<1MB) copies in a low memory environment perhaps? If so, how
much is low memory? I would think that if we had gobs of RAM, a good
number of 1MB file copies would make it cache completely before files
currently being written start ever being affected by writeback, but I
could be mistaken about reclaim/writeback behavior..

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

[RFC,0/3] : Extreme fragmentation ahoy!

Message

Comments