[RFC,00/16] xfs: Block size > PAGE_SIZE support
mbox series

Message ID 20181107063127.3902-1-david@fromorbit.com
Headers show
Series
  • xfs: Block size > PAGE_SIZE support
Related show

Message

Dave Chinner Nov. 7, 2018, 6:31 a.m. UTC
Hi folks,

We've had a fair number of problems reported on 64k block size
filesystems of late, but none of the XFS developers have Power or
ARM machines handy to reproduce them or even really test the fixes.

The iomap infrastructure we introduced a while back was designed
with the capabity of block size > page size support in mind, but we
hadn't tried to implement it.

So after another 64k block size bug report late last week I said to
Darrick "How hard could it be"?

About 6 billion (yes, B) fsx ops later, I have most of the XFS
functionality working on 64k block sizes on x86_64.  Buffered
read/write, mmap read/write and direct IO read/write all work. All
the fallocate() operations work correctly, as does truncate. xfsdump
and xfs_restore are happy with it, as is xfs_repair. xfs-scrub
needed some help, but I've tested Darrick's fixes for that quite a
bit over the past few days.

It passes most of xfstests - there's some test failures that I have
to determine whether they are code bugs or test problems (i.e. some
tests don't deal with 64k block sizes correctly or assume block size
<= page size).

What I haven't tested yet is shared extents - the COW path,
clone_file_range and dedupe_file_range. I discovered earlier today
that fsx doesn't support copy/clone/dedupe_file_operations
operations, so before I go any further I need to enxpahnce fsx. Then
fix all the bugs it uncovers on block size <= page size filesystems.
And then I'll move onto adding the rest of the functionality this
patch set requires.

The main addition to the iomap code to support this functionality is
the "zero-around" capability. When the filesystem is mapping a new
block, a delalloc range or an unwritten extent, it sets the
IOMAP_F_ZERO_AROUND flag in the iomap it returns. This tells the
iomap code that it needs to expand the range of the operation being
performed to cover entire blocks. i.e. if the data being written
doesn't span the filesystem block, it needs to instantiate and zero
pages in the page cache to cover the portions of the block the data
doesn't cover.

Because the page cache may already hold data for the regions (e.g.
read over a hole/unwritten extent) the zero-around code does not
zero pages that are already marked up to date. It is assumed that
whatever put those pages into the page cache has already done the
right thing with them.

Yes, this means the unit of page cache IO is still individual pages.
There are no mm/ changes at all, no page cache changes, nothing.
That all still just operates on individual pages and is oblivious to
the fact the filesystem/iomap codeis now processing gangs of pages
at a time instead of just one.

Actually, I stretch the truth there a little - there is one change
to XFS that is important to note here. I removed ->writepage from
XFS (patches 1 and 2). We can't use it for large block sizes because
we want to write whole blocks at a time if they are newly allocated
or unwritten. And really, it's just a nasty hack that gets in the
way of background writeback doing an efficient job of cleaning dirty
pages. So I killed it.

We also need to expose the large block size to stat(2). If we don't,
the applications that use stat.st_bsize for operations that require
block size alignment (e.g. various fallocate ops) then fail because
4k is not the block size of a 64k block size filesystem.

A number of latent bugs in existing code were uncovered as I worked
through this - patches 3-5 fix bugs in XFS and iomap that can be
triggered on existing systems but it's somewhat hard to expose them.

Patches 6-12 introduce all the iomap infrastructure needed to
support block size > page size.

Patches 13-16 introduce the necessary functionality to trigger the
iompa infrastructure, tell userspace the right thing, make sub-block
fsync ranges do the right thing and finally remove the code that
prevents large block size filesystems from mounting on small page
size machines.

It works, it seems pretty robust and runs enough of fstests that
I've already used it to find, fix and test a 64k block size bug in
XFS:

https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git/commit/?h=for-next&id=837514f7a4ca4aca06aec5caa5ff56d33ef06976

I think this is the last of the XFS Irix features we haven't
implemented in Linux XFS - it's only taken us 20 years to
get here, but the end of the tunnel is in sight.

Nah, it's probably a train. Or maybe a flame. :)

Anyway, I'm interested to see what people think of the approach.

Cheers,

Dave.

Comments

Darrick J. Wong Nov. 7, 2018, 5:14 p.m. UTC | #1
On Wed, Nov 07, 2018 at 05:31:11PM +1100, Dave Chinner wrote:
> Hi folks,
> 
> We've had a fair number of problems reported on 64k block size
> filesystems of late, but none of the XFS developers have Power or
> ARM machines handy to reproduce them or even really test the fixes.
> 
> The iomap infrastructure we introduced a while back was designed
> with the capabity of block size > page size support in mind, but we
> hadn't tried to implement it.
> 
> So after another 64k block size bug report late last week I said to
> Darrick "How hard could it be"?

"Nothing is ever simple" :)

> About 6 billion (yes, B) fsx ops later, I have most of the XFS
> functionality working on 64k block sizes on x86_64.  Buffered
> read/write, mmap read/write and direct IO read/write all work. All
> the fallocate() operations work correctly, as does truncate. xfsdump
> and xfs_restore are happy with it, as is xfs_repair. xfs-scrub
> needed some help, but I've tested Darrick's fixes for that quite a
> bit over the past few days.
> 
> It passes most of xfstests - there's some test failures that I have
> to determine whether they are code bugs or test problems (i.e. some
> tests don't deal with 64k block sizes correctly or assume block size
> <= page size).
> 
> What I haven't tested yet is shared extents - the COW path,
> clone_file_range and dedupe_file_range. I discovered earlier today
> that fsx doesn't support copy/clone/dedupe_file_operations
> operations, so before I go any further I need to enxpahnce fsx. Then

I assume that means you only tested this on reflink=0 filesystems?

Looking at fsstress, it looks like we don't test copy_file_range either.
I can try adding the missing clone/dedupe/copy to both programs, but
maybe you've already done that while I was asleep?

--D

> fix all the bugs it uncovers on block size <= page size filesystems.
> And then I'll move onto adding the rest of the functionality this
> patch set requires.
> 
> The main addition to the iomap code to support this functionality is
> the "zero-around" capability. When the filesystem is mapping a new
> block, a delalloc range or an unwritten extent, it sets the
> IOMAP_F_ZERO_AROUND flag in the iomap it returns. This tells the
> iomap code that it needs to expand the range of the operation being
> performed to cover entire blocks. i.e. if the data being written
> doesn't span the filesystem block, it needs to instantiate and zero
> pages in the page cache to cover the portions of the block the data
> doesn't cover.
> 
> Because the page cache may already hold data for the regions (e.g.
> read over a hole/unwritten extent) the zero-around code does not
> zero pages that are already marked up to date. It is assumed that
> whatever put those pages into the page cache has already done the
> right thing with them.
> 
> Yes, this means the unit of page cache IO is still individual pages.
> There are no mm/ changes at all, no page cache changes, nothing.
> That all still just operates on individual pages and is oblivious to
> the fact the filesystem/iomap codeis now processing gangs of pages
> at a time instead of just one.
> 
> Actually, I stretch the truth there a little - there is one change
> to XFS that is important to note here. I removed ->writepage from
> XFS (patches 1 and 2). We can't use it for large block sizes because
> we want to write whole blocks at a time if they are newly allocated
> or unwritten. And really, it's just a nasty hack that gets in the
> way of background writeback doing an efficient job of cleaning dirty
> pages. So I killed it.
> 
> We also need to expose the large block size to stat(2). If we don't,
> the applications that use stat.st_bsize for operations that require
> block size alignment (e.g. various fallocate ops) then fail because
> 4k is not the block size of a 64k block size filesystem.
> 
> A number of latent bugs in existing code were uncovered as I worked
> through this - patches 3-5 fix bugs in XFS and iomap that can be
> triggered on existing systems but it's somewhat hard to expose them.
> 
> Patches 6-12 introduce all the iomap infrastructure needed to
> support block size > page size.
> 
> Patches 13-16 introduce the necessary functionality to trigger the
> iompa infrastructure, tell userspace the right thing, make sub-block
> fsync ranges do the right thing and finally remove the code that
> prevents large block size filesystems from mounting on small page
> size machines.
> 
> It works, it seems pretty robust and runs enough of fstests that
> I've already used it to find, fix and test a 64k block size bug in
> XFS:
> 
> https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git/commit/?h=for-next&id=837514f7a4ca4aca06aec5caa5ff56d33ef06976
> 
> I think this is the last of the XFS Irix features we haven't
> implemented in Linux XFS - it's only taken us 20 years to
> get here, but the end of the tunnel is in sight.
> 
> Nah, it's probably a train. Or maybe a flame. :)
> 
> Anyway, I'm interested to see what people think of the approach.
> 
> Cheers,
> 
> Dave.
> 
>
Dave Chinner Nov. 7, 2018, 10:04 p.m. UTC | #2
On Wed, Nov 07, 2018 at 09:14:05AM -0800, Darrick J. Wong wrote:
> On Wed, Nov 07, 2018 at 05:31:11PM +1100, Dave Chinner wrote:
> > Hi folks,
> > 
> > We've had a fair number of problems reported on 64k block size
> > filesystems of late, but none of the XFS developers have Power or
> > ARM machines handy to reproduce them or even really test the fixes.
> > 
> > The iomap infrastructure we introduced a while back was designed
> > with the capabity of block size > page size support in mind, but we
> > hadn't tried to implement it.
> > 
> > So after another 64k block size bug report late last week I said to
> > Darrick "How hard could it be"?
> 
> "Nothing is ever simple" :)

"It'll only take a couple of minutes!"

> > About 6 billion (yes, B) fsx ops later, I have most of the XFS
> > functionality working on 64k block sizes on x86_64.  Buffered
> > read/write, mmap read/write and direct IO read/write all work. All
> > the fallocate() operations work correctly, as does truncate. xfsdump
> > and xfs_restore are happy with it, as is xfs_repair. xfs-scrub
> > needed some help, but I've tested Darrick's fixes for that quite a
> > bit over the past few days.
> > 
> > It passes most of xfstests - there's some test failures that I have
> > to determine whether they are code bugs or test problems (i.e. some
> > tests don't deal with 64k block sizes correctly or assume block size
> > <= page size).
> > 
> > What I haven't tested yet is shared extents - the COW path,
> > clone_file_range and dedupe_file_range. I discovered earlier today
> > that fsx doesn't support copy/clone/dedupe_file_operations
> > operations, so before I go any further I need to enxpahnce fsx. Then
> 
> I assume that means you only tested this on reflink=0 filesystems?

Correct.

> Looking at fsstress, it looks like we don't test copy_file_range either.
> I can try adding the missing clone/dedupe/copy to both programs, but
> maybe you've already done that while I was asleep?

No, I haven't started on this yet. I've been sleeping. :P

Cheers,

Dave.
Darrick J. Wong Nov. 8, 2018, 1:38 a.m. UTC | #3
On Thu, Nov 08, 2018 at 09:04:41AM +1100, Dave Chinner wrote:
> On Wed, Nov 07, 2018 at 09:14:05AM -0800, Darrick J. Wong wrote:
> > On Wed, Nov 07, 2018 at 05:31:11PM +1100, Dave Chinner wrote:
> > > Hi folks,
> > > 
> > > We've had a fair number of problems reported on 64k block size
> > > filesystems of late, but none of the XFS developers have Power or
> > > ARM machines handy to reproduce them or even really test the fixes.
> > > 
> > > The iomap infrastructure we introduced a while back was designed
> > > with the capabity of block size > page size support in mind, but we
> > > hadn't tried to implement it.
> > > 
> > > So after another 64k block size bug report late last week I said to
> > > Darrick "How hard could it be"?
> > 
> > "Nothing is ever simple" :)
> 
> "It'll only take a couple of minutes!"
> 
> > > About 6 billion (yes, B) fsx ops later, I have most of the XFS
> > > functionality working on 64k block sizes on x86_64.  Buffered
> > > read/write, mmap read/write and direct IO read/write all work. All
> > > the fallocate() operations work correctly, as does truncate. xfsdump
> > > and xfs_restore are happy with it, as is xfs_repair. xfs-scrub
> > > needed some help, but I've tested Darrick's fixes for that quite a
> > > bit over the past few days.
> > > 
> > > It passes most of xfstests - there's some test failures that I have
> > > to determine whether they are code bugs or test problems (i.e. some
> > > tests don't deal with 64k block sizes correctly or assume block size
> > > <= page size).
> > > 
> > > What I haven't tested yet is shared extents - the COW path,
> > > clone_file_range and dedupe_file_range. I discovered earlier today
> > > that fsx doesn't support copy/clone/dedupe_file_operations
> > > operations, so before I go any further I need to enxpahnce fsx. Then
> > 
> > I assume that means you only tested this on reflink=0 filesystems?
> 
> Correct.
> 
> > Looking at fsstress, it looks like we don't test copy_file_range either.
> > I can try adding the missing clone/dedupe/copy to both programs, but
> > maybe you've already done that while I was asleep?
> 
> No, I haven't started on this yet. I've been sleeping. :P

I started wondering if we were missing anything from not having fsx
support clone/dedupe and ended up with:

https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fsstress-clone

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner Nov. 8, 2018, 9:04 a.m. UTC | #4
On Wed, Nov 07, 2018 at 05:38:43PM -0800, Darrick J. Wong wrote:
> On Thu, Nov 08, 2018 at 09:04:41AM +1100, Dave Chinner wrote:
> > On Wed, Nov 07, 2018 at 09:14:05AM -0800, Darrick J. Wong wrote:
> > > On Wed, Nov 07, 2018 at 05:31:11PM +1100, Dave Chinner wrote:
> > > > Hi folks,
> > > > 
> > > > We've had a fair number of problems reported on 64k block size
> > > > filesystems of late, but none of the XFS developers have Power or
> > > > ARM machines handy to reproduce them or even really test the fixes.
> > > > 
> > > > The iomap infrastructure we introduced a while back was designed
> > > > with the capabity of block size > page size support in mind, but we
> > > > hadn't tried to implement it.
> > > > 
> > > > So after another 64k block size bug report late last week I said to
> > > > Darrick "How hard could it be"?
> > > 
> > > "Nothing is ever simple" :)
> > 
> > "It'll only take a couple of minutes!"
> > 
> > > > About 6 billion (yes, B) fsx ops later, I have most of the XFS
> > > > functionality working on 64k block sizes on x86_64.  Buffered
> > > > read/write, mmap read/write and direct IO read/write all work. All
> > > > the fallocate() operations work correctly, as does truncate. xfsdump
> > > > and xfs_restore are happy with it, as is xfs_repair. xfs-scrub
> > > > needed some help, but I've tested Darrick's fixes for that quite a
> > > > bit over the past few days.
> > > > 
> > > > It passes most of xfstests - there's some test failures that I have
> > > > to determine whether they are code bugs or test problems (i.e. some
> > > > tests don't deal with 64k block sizes correctly or assume block size
> > > > <= page size).
> > > > 
> > > > What I haven't tested yet is shared extents - the COW path,
> > > > clone_file_range and dedupe_file_range. I discovered earlier today
> > > > that fsx doesn't support copy/clone/dedupe_file_operations
> > > > operations, so before I go any further I need to enxpahnce fsx. Then
> > > 
> > > I assume that means you only tested this on reflink=0 filesystems?
> > 
> > Correct.
> > 
> > > Looking at fsstress, it looks like we don't test copy_file_range either.
> > > I can try adding the missing clone/dedupe/copy to both programs, but
> > > maybe you've already done that while I was asleep?
> > 
> > No, I haven't started on this yet. I've been sleeping. :P
> 
> I started wondering if we were missing anything from not having fsx
> support clone/dedupe and ended up with:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fsstress-clone

Some fixes to that below.

I haven't got to testing dedupe or clone - copy_file_range explodes
in under 40 operations in on generic/263. do_splice_direct() looks
to be broken in several different waysat this point.

Cheers,

Dave.
Darrick J. Wong Nov. 8, 2018, 10:17 p.m. UTC | #5
On Thu, Nov 08, 2018 at 08:04:32PM +1100, Dave Chinner wrote:
> On Wed, Nov 07, 2018 at 05:38:43PM -0800, Darrick J. Wong wrote:
> > On Thu, Nov 08, 2018 at 09:04:41AM +1100, Dave Chinner wrote:
> > > On Wed, Nov 07, 2018 at 09:14:05AM -0800, Darrick J. Wong wrote:
> > > > On Wed, Nov 07, 2018 at 05:31:11PM +1100, Dave Chinner wrote:
> > > > > Hi folks,
> > > > > 
> > > > > We've had a fair number of problems reported on 64k block size
> > > > > filesystems of late, but none of the XFS developers have Power or
> > > > > ARM machines handy to reproduce them or even really test the fixes.
> > > > > 
> > > > > The iomap infrastructure we introduced a while back was designed
> > > > > with the capabity of block size > page size support in mind, but we
> > > > > hadn't tried to implement it.
> > > > > 
> > > > > So after another 64k block size bug report late last week I said to
> > > > > Darrick "How hard could it be"?
> > > > 
> > > > "Nothing is ever simple" :)
> > > 
> > > "It'll only take a couple of minutes!"
> > > 
> > > > > About 6 billion (yes, B) fsx ops later, I have most of the XFS
> > > > > functionality working on 64k block sizes on x86_64.  Buffered
> > > > > read/write, mmap read/write and direct IO read/write all work. All
> > > > > the fallocate() operations work correctly, as does truncate. xfsdump
> > > > > and xfs_restore are happy with it, as is xfs_repair. xfs-scrub
> > > > > needed some help, but I've tested Darrick's fixes for that quite a
> > > > > bit over the past few days.
> > > > > 
> > > > > It passes most of xfstests - there's some test failures that I have
> > > > > to determine whether they are code bugs or test problems (i.e. some
> > > > > tests don't deal with 64k block sizes correctly or assume block size
> > > > > <= page size).
> > > > > 
> > > > > What I haven't tested yet is shared extents - the COW path,
> > > > > clone_file_range and dedupe_file_range. I discovered earlier today
> > > > > that fsx doesn't support copy/clone/dedupe_file_operations
> > > > > operations, so before I go any further I need to enxpahnce fsx. Then
> > > > 
> > > > I assume that means you only tested this on reflink=0 filesystems?
> > > 
> > > Correct.
> > > 
> > > > Looking at fsstress, it looks like we don't test copy_file_range either.
> > > > I can try adding the missing clone/dedupe/copy to both programs, but
> > > > maybe you've already done that while I was asleep?
> > > 
> > > No, I haven't started on this yet. I've been sleeping. :P
> > 
> > I started wondering if we were missing anything from not having fsx
> > support clone/dedupe and ended up with:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fsstress-clone
> 
> Some fixes to that below.
> 
> I haven't got to testing dedupe or clone - copy_file_range explodes
> in under 40 operations in on generic/263. do_splice_direct() looks
> to be broken in several different waysat this point.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> fsx: clean up copy/dedupe file range support.
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> copy_file_range() needs to obey read/write constraints otherwise is
> blows up when direct IO is used.
> 
> FIDEDUPERANGE has a completely screwed up API for error reporting.
> The ioctl succeeds even if dedupe fails, so you have to check
> every individual dedupe operations for failure. Without this, dedupe
> "succeeds" on kernels filesystems that don't even support dedupe...
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  ltp/fsx.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/ltp/fsx.c b/ltp/fsx.c
> index fad50e0022af..b51910b8b2e1 100644
> --- a/ltp/fsx.c
> +++ b/ltp/fsx.c
> @@ -1382,7 +1382,11 @@ do_dedupe_range(unsigned offset, unsigned length, unsigned dest)
>  	fdr->info[0].dest_fd = fd;
>  	fdr->info[0].dest_offset = dest;
>  
> -	if (ioctl(fd, FIDEDUPERANGE, fdr) == -1) {
> +	if (ioctl(fd, FIDEDUPERANGE, fdr) == -1 ||
> +	    fdr->info[0].status < 0) {
> +		if (fdr->info[0].status < 0)
> +			errno = -fdr->info[0].status;
> +
>  		if (errno == EOPNOTSUPP || errno == ENOTTY) {
>  			if (!quiet && testcalls > simulatedopcount)
>  				prt("skipping unsupported dedupe range\n");
> @@ -1416,6 +1420,11 @@ do_copy_range(unsigned offset, unsigned length, unsigned dest)
>  	loff_t o1, o2;
>  	ssize_t nr;
>  
> +	offset -= offset % readbdy;
> +	dest -= dest % writebdy;
> +	if (o_direct)
> +		length -= length % readbdy;

Don't we want byte-granularity copies if we're doing buffered copies?

('Want' is such a strong word, maybe I don't want to find out what other
skeletons are lurking in do_splice_direct...)

--D

> +
>  	if (length == 0) {
>  		if (!quiet && testcalls > simulatedopcount)
>  			prt("skipping zero length copy range\n");
Dave Chinner Nov. 8, 2018, 10:22 p.m. UTC | #6
On Thu, Nov 08, 2018 at 02:17:56PM -0800, Darrick J. Wong wrote:
> On Thu, Nov 08, 2018 at 08:04:32PM +1100, Dave Chinner wrote:
> > On Wed, Nov 07, 2018 at 05:38:43PM -0800, Darrick J. Wong wrote:
> > @@ -1416,6 +1420,11 @@ do_copy_range(unsigned offset, unsigned length, unsigned dest)
> >  	loff_t o1, o2;
> >  	ssize_t nr;
> >  
> > +	offset -= offset % readbdy;
> > +	dest -= dest % writebdy;
> > +	if (o_direct)
> > +		length -= length % readbdy;
> 
> Don't we want byte-granularity copies if we're doing buffered copies?

Yes, just don't set -r/-w values when doing buffered IO. O_DIRECT
requires the length to be aligned, too, but buffered IO doesn't,
which is why the "if (o_direct)" case is there.

Cheers,

Dave.