Message ID | 20181107063127.3902-1-david@fromorbit.com (mailing list archive) |
---|---|
Headers | show |
Series | xfs: Block size > PAGE_SIZE support | expand |
On Wed, Nov 07, 2018 at 05:31:11PM +1100, Dave Chinner wrote: > Hi folks, > > We've had a fair number of problems reported on 64k block size > filesystems of late, but none of the XFS developers have Power or > ARM machines handy to reproduce them or even really test the fixes. > > The iomap infrastructure we introduced a while back was designed > with the capabity of block size > page size support in mind, but we > hadn't tried to implement it. > > So after another 64k block size bug report late last week I said to > Darrick "How hard could it be"? "Nothing is ever simple" :) > About 6 billion (yes, B) fsx ops later, I have most of the XFS > functionality working on 64k block sizes on x86_64. Buffered > read/write, mmap read/write and direct IO read/write all work. All > the fallocate() operations work correctly, as does truncate. xfsdump > and xfs_restore are happy with it, as is xfs_repair. xfs-scrub > needed some help, but I've tested Darrick's fixes for that quite a > bit over the past few days. > > It passes most of xfstests - there's some test failures that I have > to determine whether they are code bugs or test problems (i.e. some > tests don't deal with 64k block sizes correctly or assume block size > <= page size). > > What I haven't tested yet is shared extents - the COW path, > clone_file_range and dedupe_file_range. I discovered earlier today > that fsx doesn't support copy/clone/dedupe_file_operations > operations, so before I go any further I need to enxpahnce fsx. Then I assume that means you only tested this on reflink=0 filesystems? Looking at fsstress, it looks like we don't test copy_file_range either. I can try adding the missing clone/dedupe/copy to both programs, but maybe you've already done that while I was asleep? --D > fix all the bugs it uncovers on block size <= page size filesystems. > And then I'll move onto adding the rest of the functionality this > patch set requires. > > The main addition to the iomap code to support this functionality is > the "zero-around" capability. When the filesystem is mapping a new > block, a delalloc range or an unwritten extent, it sets the > IOMAP_F_ZERO_AROUND flag in the iomap it returns. This tells the > iomap code that it needs to expand the range of the operation being > performed to cover entire blocks. i.e. if the data being written > doesn't span the filesystem block, it needs to instantiate and zero > pages in the page cache to cover the portions of the block the data > doesn't cover. > > Because the page cache may already hold data for the regions (e.g. > read over a hole/unwritten extent) the zero-around code does not > zero pages that are already marked up to date. It is assumed that > whatever put those pages into the page cache has already done the > right thing with them. > > Yes, this means the unit of page cache IO is still individual pages. > There are no mm/ changes at all, no page cache changes, nothing. > That all still just operates on individual pages and is oblivious to > the fact the filesystem/iomap codeis now processing gangs of pages > at a time instead of just one. > > Actually, I stretch the truth there a little - there is one change > to XFS that is important to note here. I removed ->writepage from > XFS (patches 1 and 2). We can't use it for large block sizes because > we want to write whole blocks at a time if they are newly allocated > or unwritten. And really, it's just a nasty hack that gets in the > way of background writeback doing an efficient job of cleaning dirty > pages. So I killed it. > > We also need to expose the large block size to stat(2). If we don't, > the applications that use stat.st_bsize for operations that require > block size alignment (e.g. various fallocate ops) then fail because > 4k is not the block size of a 64k block size filesystem. > > A number of latent bugs in existing code were uncovered as I worked > through this - patches 3-5 fix bugs in XFS and iomap that can be > triggered on existing systems but it's somewhat hard to expose them. > > Patches 6-12 introduce all the iomap infrastructure needed to > support block size > page size. > > Patches 13-16 introduce the necessary functionality to trigger the > iompa infrastructure, tell userspace the right thing, make sub-block > fsync ranges do the right thing and finally remove the code that > prevents large block size filesystems from mounting on small page > size machines. > > It works, it seems pretty robust and runs enough of fstests that > I've already used it to find, fix and test a 64k block size bug in > XFS: > > https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git/commit/?h=for-next&id=837514f7a4ca4aca06aec5caa5ff56d33ef06976 > > I think this is the last of the XFS Irix features we haven't > implemented in Linux XFS - it's only taken us 20 years to > get here, but the end of the tunnel is in sight. > > Nah, it's probably a train. Or maybe a flame. :) > > Anyway, I'm interested to see what people think of the approach. > > Cheers, > > Dave. > >
On Wed, Nov 07, 2018 at 09:14:05AM -0800, Darrick J. Wong wrote: > On Wed, Nov 07, 2018 at 05:31:11PM +1100, Dave Chinner wrote: > > Hi folks, > > > > We've had a fair number of problems reported on 64k block size > > filesystems of late, but none of the XFS developers have Power or > > ARM machines handy to reproduce them or even really test the fixes. > > > > The iomap infrastructure we introduced a while back was designed > > with the capabity of block size > page size support in mind, but we > > hadn't tried to implement it. > > > > So after another 64k block size bug report late last week I said to > > Darrick "How hard could it be"? > > "Nothing is ever simple" :) "It'll only take a couple of minutes!" > > About 6 billion (yes, B) fsx ops later, I have most of the XFS > > functionality working on 64k block sizes on x86_64. Buffered > > read/write, mmap read/write and direct IO read/write all work. All > > the fallocate() operations work correctly, as does truncate. xfsdump > > and xfs_restore are happy with it, as is xfs_repair. xfs-scrub > > needed some help, but I've tested Darrick's fixes for that quite a > > bit over the past few days. > > > > It passes most of xfstests - there's some test failures that I have > > to determine whether they are code bugs or test problems (i.e. some > > tests don't deal with 64k block sizes correctly or assume block size > > <= page size). > > > > What I haven't tested yet is shared extents - the COW path, > > clone_file_range and dedupe_file_range. I discovered earlier today > > that fsx doesn't support copy/clone/dedupe_file_operations > > operations, so before I go any further I need to enxpahnce fsx. Then > > I assume that means you only tested this on reflink=0 filesystems? Correct. > Looking at fsstress, it looks like we don't test copy_file_range either. > I can try adding the missing clone/dedupe/copy to both programs, but > maybe you've already done that while I was asleep? No, I haven't started on this yet. I've been sleeping. :P Cheers, Dave.
On Thu, Nov 08, 2018 at 09:04:41AM +1100, Dave Chinner wrote: > On Wed, Nov 07, 2018 at 09:14:05AM -0800, Darrick J. Wong wrote: > > On Wed, Nov 07, 2018 at 05:31:11PM +1100, Dave Chinner wrote: > > > Hi folks, > > > > > > We've had a fair number of problems reported on 64k block size > > > filesystems of late, but none of the XFS developers have Power or > > > ARM machines handy to reproduce them or even really test the fixes. > > > > > > The iomap infrastructure we introduced a while back was designed > > > with the capabity of block size > page size support in mind, but we > > > hadn't tried to implement it. > > > > > > So after another 64k block size bug report late last week I said to > > > Darrick "How hard could it be"? > > > > "Nothing is ever simple" :) > > "It'll only take a couple of minutes!" > > > > About 6 billion (yes, B) fsx ops later, I have most of the XFS > > > functionality working on 64k block sizes on x86_64. Buffered > > > read/write, mmap read/write and direct IO read/write all work. All > > > the fallocate() operations work correctly, as does truncate. xfsdump > > > and xfs_restore are happy with it, as is xfs_repair. xfs-scrub > > > needed some help, but I've tested Darrick's fixes for that quite a > > > bit over the past few days. > > > > > > It passes most of xfstests - there's some test failures that I have > > > to determine whether they are code bugs or test problems (i.e. some > > > tests don't deal with 64k block sizes correctly or assume block size > > > <= page size). > > > > > > What I haven't tested yet is shared extents - the COW path, > > > clone_file_range and dedupe_file_range. I discovered earlier today > > > that fsx doesn't support copy/clone/dedupe_file_operations > > > operations, so before I go any further I need to enxpahnce fsx. Then > > > > I assume that means you only tested this on reflink=0 filesystems? > > Correct. > > > Looking at fsstress, it looks like we don't test copy_file_range either. > > I can try adding the missing clone/dedupe/copy to both programs, but > > maybe you've already done that while I was asleep? > > No, I haven't started on this yet. I've been sleeping. :P I started wondering if we were missing anything from not having fsx support clone/dedupe and ended up with: https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fsstress-clone --D > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Wed, Nov 07, 2018 at 05:38:43PM -0800, Darrick J. Wong wrote: > On Thu, Nov 08, 2018 at 09:04:41AM +1100, Dave Chinner wrote: > > On Wed, Nov 07, 2018 at 09:14:05AM -0800, Darrick J. Wong wrote: > > > On Wed, Nov 07, 2018 at 05:31:11PM +1100, Dave Chinner wrote: > > > > Hi folks, > > > > > > > > We've had a fair number of problems reported on 64k block size > > > > filesystems of late, but none of the XFS developers have Power or > > > > ARM machines handy to reproduce them or even really test the fixes. > > > > > > > > The iomap infrastructure we introduced a while back was designed > > > > with the capabity of block size > page size support in mind, but we > > > > hadn't tried to implement it. > > > > > > > > So after another 64k block size bug report late last week I said to > > > > Darrick "How hard could it be"? > > > > > > "Nothing is ever simple" :) > > > > "It'll only take a couple of minutes!" > > > > > > About 6 billion (yes, B) fsx ops later, I have most of the XFS > > > > functionality working on 64k block sizes on x86_64. Buffered > > > > read/write, mmap read/write and direct IO read/write all work. All > > > > the fallocate() operations work correctly, as does truncate. xfsdump > > > > and xfs_restore are happy with it, as is xfs_repair. xfs-scrub > > > > needed some help, but I've tested Darrick's fixes for that quite a > > > > bit over the past few days. > > > > > > > > It passes most of xfstests - there's some test failures that I have > > > > to determine whether they are code bugs or test problems (i.e. some > > > > tests don't deal with 64k block sizes correctly or assume block size > > > > <= page size). > > > > > > > > What I haven't tested yet is shared extents - the COW path, > > > > clone_file_range and dedupe_file_range. I discovered earlier today > > > > that fsx doesn't support copy/clone/dedupe_file_operations > > > > operations, so before I go any further I need to enxpahnce fsx. Then > > > > > > I assume that means you only tested this on reflink=0 filesystems? > > > > Correct. > > > > > Looking at fsstress, it looks like we don't test copy_file_range either. > > > I can try adding the missing clone/dedupe/copy to both programs, but > > > maybe you've already done that while I was asleep? > > > > No, I haven't started on this yet. I've been sleeping. :P > > I started wondering if we were missing anything from not having fsx > support clone/dedupe and ended up with: > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fsstress-clone Some fixes to that below. I haven't got to testing dedupe or clone - copy_file_range explodes in under 40 operations in on generic/263. do_splice_direct() looks to be broken in several different waysat this point. Cheers, Dave.
On Thu, Nov 08, 2018 at 08:04:32PM +1100, Dave Chinner wrote: > On Wed, Nov 07, 2018 at 05:38:43PM -0800, Darrick J. Wong wrote: > > On Thu, Nov 08, 2018 at 09:04:41AM +1100, Dave Chinner wrote: > > > On Wed, Nov 07, 2018 at 09:14:05AM -0800, Darrick J. Wong wrote: > > > > On Wed, Nov 07, 2018 at 05:31:11PM +1100, Dave Chinner wrote: > > > > > Hi folks, > > > > > > > > > > We've had a fair number of problems reported on 64k block size > > > > > filesystems of late, but none of the XFS developers have Power or > > > > > ARM machines handy to reproduce them or even really test the fixes. > > > > > > > > > > The iomap infrastructure we introduced a while back was designed > > > > > with the capabity of block size > page size support in mind, but we > > > > > hadn't tried to implement it. > > > > > > > > > > So after another 64k block size bug report late last week I said to > > > > > Darrick "How hard could it be"? > > > > > > > > "Nothing is ever simple" :) > > > > > > "It'll only take a couple of minutes!" > > > > > > > > About 6 billion (yes, B) fsx ops later, I have most of the XFS > > > > > functionality working on 64k block sizes on x86_64. Buffered > > > > > read/write, mmap read/write and direct IO read/write all work. All > > > > > the fallocate() operations work correctly, as does truncate. xfsdump > > > > > and xfs_restore are happy with it, as is xfs_repair. xfs-scrub > > > > > needed some help, but I've tested Darrick's fixes for that quite a > > > > > bit over the past few days. > > > > > > > > > > It passes most of xfstests - there's some test failures that I have > > > > > to determine whether they are code bugs or test problems (i.e. some > > > > > tests don't deal with 64k block sizes correctly or assume block size > > > > > <= page size). > > > > > > > > > > What I haven't tested yet is shared extents - the COW path, > > > > > clone_file_range and dedupe_file_range. I discovered earlier today > > > > > that fsx doesn't support copy/clone/dedupe_file_operations > > > > > operations, so before I go any further I need to enxpahnce fsx. Then > > > > > > > > I assume that means you only tested this on reflink=0 filesystems? > > > > > > Correct. > > > > > > > Looking at fsstress, it looks like we don't test copy_file_range either. > > > > I can try adding the missing clone/dedupe/copy to both programs, but > > > > maybe you've already done that while I was asleep? > > > > > > No, I haven't started on this yet. I've been sleeping. :P > > > > I started wondering if we were missing anything from not having fsx > > support clone/dedupe and ended up with: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fsstress-clone > > Some fixes to that below. > > I haven't got to testing dedupe or clone - copy_file_range explodes > in under 40 operations in on generic/263. do_splice_direct() looks > to be broken in several different waysat this point. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > > fsx: clean up copy/dedupe file range support. > > From: Dave Chinner <dchinner@redhat.com> > > copy_file_range() needs to obey read/write constraints otherwise is > blows up when direct IO is used. > > FIDEDUPERANGE has a completely screwed up API for error reporting. > The ioctl succeeds even if dedupe fails, so you have to check > every individual dedupe operations for failure. Without this, dedupe > "succeeds" on kernels filesystems that don't even support dedupe... > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > ltp/fsx.c | 11 ++++++++++- > 1 file changed, 10 insertions(+), 1 deletion(-) > > diff --git a/ltp/fsx.c b/ltp/fsx.c > index fad50e0022af..b51910b8b2e1 100644 > --- a/ltp/fsx.c > +++ b/ltp/fsx.c > @@ -1382,7 +1382,11 @@ do_dedupe_range(unsigned offset, unsigned length, unsigned dest) > fdr->info[0].dest_fd = fd; > fdr->info[0].dest_offset = dest; > > - if (ioctl(fd, FIDEDUPERANGE, fdr) == -1) { > + if (ioctl(fd, FIDEDUPERANGE, fdr) == -1 || > + fdr->info[0].status < 0) { > + if (fdr->info[0].status < 0) > + errno = -fdr->info[0].status; > + > if (errno == EOPNOTSUPP || errno == ENOTTY) { > if (!quiet && testcalls > simulatedopcount) > prt("skipping unsupported dedupe range\n"); > @@ -1416,6 +1420,11 @@ do_copy_range(unsigned offset, unsigned length, unsigned dest) > loff_t o1, o2; > ssize_t nr; > > + offset -= offset % readbdy; > + dest -= dest % writebdy; > + if (o_direct) > + length -= length % readbdy; Don't we want byte-granularity copies if we're doing buffered copies? ('Want' is such a strong word, maybe I don't want to find out what other skeletons are lurking in do_splice_direct...) --D > + > if (length == 0) { > if (!quiet && testcalls > simulatedopcount) > prt("skipping zero length copy range\n");
On Thu, Nov 08, 2018 at 02:17:56PM -0800, Darrick J. Wong wrote: > On Thu, Nov 08, 2018 at 08:04:32PM +1100, Dave Chinner wrote: > > On Wed, Nov 07, 2018 at 05:38:43PM -0800, Darrick J. Wong wrote: > > @@ -1416,6 +1420,11 @@ do_copy_range(unsigned offset, unsigned length, unsigned dest) > > loff_t o1, o2; > > ssize_t nr; > > > > + offset -= offset % readbdy; > > + dest -= dest % writebdy; > > + if (o_direct) > > + length -= length % readbdy; > > Don't we want byte-granularity copies if we're doing buffered copies? Yes, just don't set -r/-w values when doing buffered IO. O_DIRECT requires the length to be aligned, too, but buffered IO doesn't, which is why the "if (o_direct)" case is there. Cheers, Dave.