diff mbox

[06/10] dax: provide an iomap based fault handler

Message ID 20160910073646.GA18547@lst.de (mailing list archive)
State New, archived
Headers show

Commit Message

Christoph Hellwig Sept. 10, 2016, 7:36 a.m. UTC
On Sat, Sep 10, 2016 at 08:55:57AM +1000, Dave Chinner wrote:
> THe errors from the above two cases are not acted on. they are
> immediately overwritten by:

Yes, Robert also pointed this out.  Fix below.

> Is there a missing "if (error) goto out;" check somewhere here?

Just the one above.

> I'm also wondering if you've looked at supporting the PMD fault case
> with iomap?

PMD faults currently don't work at all.  Ross has a series to resurrect
them, but we'll need to coordinate between the two series somehow.  My
preference would be to not resurrect them for the bh path and only do
it for the iomap version.


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Ross Zwisler Sept. 13, 2016, 3:51 p.m. UTC | #1
On Sat, Sep 10, 2016 at 09:36:46AM +0200, Christoph Hellwig wrote:
> On Sat, Sep 10, 2016 at 08:55:57AM +1000, Dave Chinner wrote:
> > THe errors from the above two cases are not acted on. they are
> > immediately overwritten by:
> 
> Yes, Robert also pointed this out.  Fix below.
> 
> > Is there a missing "if (error) goto out;" check somewhere here?
> 
> Just the one above.
> 
> > I'm also wondering if you've looked at supporting the PMD fault case
> > with iomap?
> 
> PMD faults currently don't work at all.  Ross has a series to resurrect
> them, but we'll need to coordinate between the two series somehow.  My
> preference would be to not resurrect them for the bh path and only do
> it for the iomap version.

I'm working on this right now.  I expect that most/all of the infrastructure
between the bh+get_block_t version and the iomap version to be shared, it'll
just be a matter of having a PMD version of the iomap fault handler.  This
should be pretty minor.

Let's see how it goes, but right now my plan is to have both - I'd like to
keep feature parity between ext2/ext4 and XFS, and that means having PMD
faults in ext4 via bh+get_block_t until they move over to iomap.

Regarding coordination, the PMD v2 series hasn't gotten much review so far, so
I'm not sure it'll go in for v4.9.  At this point I'm planning on just
rebasing on top of your iomap series, though if it gets taken sooner I
wouldn't object.

> diff --git a/fs/dax.c b/fs/dax.c
> index a170a94..5534594 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1440,6 +1440,7 @@ int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  	default:
>  		WARN_ON_ONCE(1);
>  		error = -EIO;
> +		goto unlock_entry;
>  	}
>  
>  	/* Filesystem should not return unwritten buffers to us! */
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Sept. 14, 2016, 7:06 a.m. UTC | #2
On Tue, Sep 13, 2016 at 09:51:26AM -0600, Ross Zwisler wrote:
> I'm working on this right now.  I expect that most/all of the infrastructure
> between the bh+get_block_t version and the iomap version to be shared, it'll
> just be a matter of having a PMD version of the iomap fault handler.  This
> should be pretty minor.

Yes, I looked at it (although I didn't do any work yet), and the work
should be fairly easy.

> Let's see how it goes, but right now my plan is to have both - I'd like to
> keep feature parity between ext2/ext4 and XFS, and that means having PMD
> faults in ext4 via bh+get_block_t until they move over to iomap.
> 
> Regarding coordination, the PMD v2 series hasn't gotten much review so far, so
> I'm not sure it'll go in for v4.9.  At this point I'm planning on just
> rebasing on top of your iomap series, though if it gets taken sooner I
> wouldn't object.

So let's do iomap first.  I've got stable ext2 support, as well as support
for the block device, although I'm not sure what the proper testing
protocol for that is.  I've started ext4 and read / zero was easy, but
now I'm stuck in the convoluted mess that is the ext4 direct I/O and
DAX path.

Maybe we should get the iomap work into 4.9 and then convert over ext4
as well as adding PMD fault support in the next release.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Sept. 14, 2016, 9:53 a.m. UTC | #3
On Wed, Sep 14, 2016 at 09:06:33AM +0200, Christoph Hellwig wrote:
> So let's do iomap first.  I've got stable ext2 support, as well as support
> for the block device, although I'm not sure what the proper testing
> protocol for that is.

Well, turns out DAX on block devices actually is dead despite a few
leftovers and there is no way to actually test it.  That was easy :)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ross Zwisler Sept. 23, 2016, 9:02 p.m. UTC | #4
On Wed, Sep 14, 2016 at 09:06:33AM +0200, Christoph Hellwig wrote:
> On Tue, Sep 13, 2016 at 09:51:26AM -0600, Ross Zwisler wrote:
> > I'm working on this right now.  I expect that most/all of the infrastructure
> > between the bh+get_block_t version and the iomap version to be shared, it'll
> > just be a matter of having a PMD version of the iomap fault handler.  This
> > should be pretty minor.
> 
> Yes, I looked at it (although I didn't do any work yet), and the work
> should be fairly easy.
> 
> > Let's see how it goes, but right now my plan is to have both - I'd like to
> > keep feature parity between ext2/ext4 and XFS, and that means having PMD
> > faults in ext4 via bh+get_block_t until they move over to iomap.
> > 
> > Regarding coordination, the PMD v2 series hasn't gotten much review so far, so
> > I'm not sure it'll go in for v4.9.  At this point I'm planning on just
> > rebasing on top of your iomap series, though if it gets taken sooner I
> > wouldn't object.
> 
> So let's do iomap first.  I've got stable ext2 support, as well as support
> for the block device, although I'm not sure what the proper testing
> protocol for that is.  I've started ext4 and read / zero was easy, but
> now I'm stuck in the convoluted mess that is the ext4 direct I/O and
> DAX path.
> 
> Maybe we should get the iomap work into 4.9 and then convert over ext4
> as well as adding PMD fault support in the next release.

I was doing some testing of my PMD patches, and was surprised to see that
ext4 + PMDs + generic/074 now takes 23 minutes to complete.  With ext4 and
PTEs faults this test takes ~50 seconds in the same setup, and XFS + PMDs
takes 27 seconds.

The root cause is that ext4 is using the direct I/O path for reads and writes,
and that the DIO path thinks it needs to flush dirty data from the radix tree
on each I/O.

Each read ends up writing back PMDs via:

  vfs_read()
    __vfs_read()
      new_sync_read
        generic_file_read_iter()
          filemap_write_and_wait_range()

I believe we have an analogous problem for writes.

This results in us flushing 12 TiB worth of data during the generic/074 test,
one cache line at a time...

With your recent changes to detangle the DAX and DIO faults paths in XFS, we
avoid this because XFS no longer uses generic_file_read_iter(), but instead
uses xfs_file_write_iter() which skips the writeback and instead just calls
xfs_file_dax_write() to do the I/O.

This is obviously an existing issue in ext4 that we need to address.  Even
with the PTE path we are doing tons of unnecessary flushing, but the move from
flushing PTEs to flushing PMDs is what killed us.

I can just add a hack to hop over the writeback in generic_file_read_iter(),
but I hesitate to do this because it seems like the correct thing to do is to
separate the ext4 DAX & DIO paths, which I think you are already doing.

I believe that my DAX PMD patches are ready to go, but because of this issue
they currently only support XFS.  I'm tempted to send them out as they are
right now since they add a bunch of complexity to DAX that we need to review,
and that review can fully happen with only XFS support. We can add ext4
support back in later when it's ready.

Thoughts?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Sept. 26, 2016, 12:08 a.m. UTC | #5
On Fri, Sep 23, 2016 at 03:02:37PM -0600, Ross Zwisler wrote:
> I can just add a hack to hop over the writeback in generic_file_read_iter(),
> but I hesitate to do this because it seems like the correct thing to do is to
> separate the ext4 DAX & DIO paths, which I think you are already doing.

I've started on it, and the read path works fine.  I'm a bit stuck on
the write side as my attempts at it seems to corrupt data fairly easily
when running xfstests.

> I believe that my DAX PMD patches are ready to go, but because of this issue
> they currently only support XFS.  I'm tempted to send them out as they are
> right now since they add a bunch of complexity to DAX that we need to review,
> and that review can fully happen with only XFS support. We can add ext4
> support back in later when it's ready.

Yes, please send them out.  We really should move ext4 over to iomap for
DAX.  I'd love to get some help from someone more familar with ext4,
and can send my code dump to get started.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara Sept. 26, 2016, 2:28 p.m. UTC | #6
On Mon 26-09-16 02:08:05, Christoph Hellwig wrote:
> On Fri, Sep 23, 2016 at 03:02:37PM -0600, Ross Zwisler wrote:
> > I can just add a hack to hop over the writeback in generic_file_read_iter(),
> > but I hesitate to do this because it seems like the correct thing to do is to
> > separate the ext4 DAX & DIO paths, which I think you are already doing.
> 
> I've started on it, and the read path works fine.  I'm a bit stuck on
> the write side as my attempts at it seems to corrupt data fairly easily
> when running xfstests.

Well, I have a bunch of data corruption fixes for DAX sitting in my tree
which I was able to trigger. E.g. there are some missing invalidations of
blockdev aliases, also some races between fault vs write path the can lead
to data loss. But at least the second fix requires clearing of dirty bits
in the radix tree for DAX and that is a big pile of patches I have to sift
through mm people (and related mm parts are in constant churn due to PMD
pages in pagecache work so it is a mess).

> > I believe that my DAX PMD patches are ready to go, but because of this issue
> > they currently only support XFS.  I'm tempted to send them out as they are
> > right now since they add a bunch of complexity to DAX that we need to review,
> > and that review can fully happen with only XFS support. We can add ext4
> > support back in later when it's ready.
> 
> Yes, please send them out.  We really should move ext4 over to iomap for
> DAX.  I'd love to get some help from someone more familar with ext4,
> and can send my code dump to get started.

Yeah, I'd also just implement it for iomap path and XFS as a start and then
I'll port ext4 over to iomap infrastructure for DAX. There should be no
principial problem.

									Honza
diff mbox

Patch

diff --git a/fs/dax.c b/fs/dax.c
index a170a94..5534594 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1440,6 +1440,7 @@  int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	default:
 		WARN_ON_ONCE(1);
 		error = -EIO;
+		goto unlock_entry;
 	}
 
 	/* Filesystem should not return unwritten buffers to us! */