[4/9] btrfs: Switch to iomap_dio_rw() for dio
diff mbox series

Message ID 20200326210254.17647-5-rgoldwyn@suse.de
State New
Headers show
Series
  • btrfs direct-io using iomap
Related show

Commit Message

Goldwyn Rodrigues March 26, 2020, 9:02 p.m. UTC
From: Goldwyn Rodrigues <rgoldwyn@suse.com>

Switch from __blockdev_direct_IO() to iomap_dio_rw().
Rename btrfs_get_blocks_direct() to btrfs_dio_iomap_begin() and use it
as iomap_begin() for iomap direct I/O functions. This function
allocates and locks all the blocks required for the I/O.
btrfs_submit_direct() is used as the submit_io() hook for direct I/O
ops.

Since we need direct I/O reads to go through iomap_dio_rw(), we change
file_operations.read_iter() to a btrfs_file_read_iter() which calls
btrfs_direct_IO() for direct reads and falls back to
generic_file_buffered_read() for incomplete reads and buffered reads.

We don't need address_space.direct_IO() anymore so set it to noop.
Similarly, we don't need flags used in __blockdev_direct_IO(). iomap is
capable of direct I/O reads from a hole, so we don't need to return
-ENOENT.

BTRFS direct I/O is now done under i_rwsem, shared in case of
reads and exclusive in case of writes. This guards against simultaneous
truncates.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/btrfs/ctree.h |   1 +
 fs/btrfs/file.c  |  21 +++++-
 fs/btrfs/inode.c | 190 ++++++++++++++++++++++-------------------------
 3 files changed, 109 insertions(+), 103 deletions(-)

Comments

Christoph Hellwig March 27, 2020, 8:10 a.m. UTC | #1
On Thu, Mar 26, 2020 at 04:02:49PM -0500, Goldwyn Rodrigues wrote:
> BTRFS direct I/O is now done under i_rwsem, shared in case of
> reads and exclusive in case of writes. This guards against simultaneous
> truncates.

Btw, you really want to add the optimization of only taking it shared
for all the easy write cases similar to what XFS has done for ages
and what ext4 picked up now.  Without that performance on someworkloads
is going to be horrible.  That could be a patch on top of this one,
though.

> +/*
> + * btrfs_direct_IO - perform direct I/O
> + * inode->i_rwsem must be locked before calling this function, shared or exclusive.
> + * @iocb - kernel iocb
> + * @iter - iter to/from data is copied

This adds a way too long line.  Also kerneldoc comments go below the
arguments.  Last but not least a lockdep_assert_held is much more useful
than comments trying to document locking patterns..
Goldwyn Rodrigues March 27, 2020, 4:13 p.m. UTC | #2
On  1:10 27/03, Christoph Hellwig wrote:
> On Thu, Mar 26, 2020 at 04:02:49PM -0500, Goldwyn Rodrigues wrote:
> > BTRFS direct I/O is now done under i_rwsem, shared in case of
> > reads and exclusive in case of writes. This guards against simultaneous
> > truncates.
> 
> Btw, you really want to add the optimization of only taking it shared
> for all the easy write cases similar to what XFS has done for ages
> and what ext4 picked up now.  Without that performance on someworkloads
> is going to be horrible.  That could be a patch on top of this one,
> though.

Yes, I will work on that. The idea of dropping the lock seems a little
weird.
Christoph Hellwig May 7, 2020, 6:14 a.m. UTC | #3
What is the status of this series?  I haven't really seen it posted
any time recently, and it would be sad to miss 5.8..
David Sterba May 7, 2020, 11:37 a.m. UTC | #4
On Wed, May 06, 2020 at 11:14:30PM -0700, Christoph Hellwig wrote:
> What is the status of this series?  I haven't really seen it posted
> any time recently, and it would be sad to miss 5.8..

I've been testing it and reporting to Goldwyn, but there are still
problems that don't seem to be fixable for 5.8 target.
Christoph Hellwig May 7, 2020, 12:10 p.m. UTC | #5
On Thu, May 07, 2020 at 01:37:41PM +0200, David Sterba wrote:
> On Wed, May 06, 2020 at 11:14:30PM -0700, Christoph Hellwig wrote:
> > What is the status of this series?  I haven't really seen it posted
> > any time recently, and it would be sad to miss 5.8..
> 
> I've been testing it and reporting to Goldwyn, but there are still
> problems that don't seem to be fixable for 5.8 target.

What are the issues, and how can we help to resolve them?
Goldwyn Rodrigues May 7, 2020, 1:44 p.m. UTC | #6
On  5:10 07/05, Christoph Hellwig wrote:
> On Thu, May 07, 2020 at 01:37:41PM +0200, David Sterba wrote:
> > On Wed, May 06, 2020 at 11:14:30PM -0700, Christoph Hellwig wrote:
> > > What is the status of this series?  I haven't really seen it posted
> > > any time recently, and it would be sad to miss 5.8..
> > 
> > I've been testing it and reporting to Goldwyn, but there are still
> > problems that don't seem to be fixable for 5.8 target.
> 
> What are the issues, and how can we help to resolve them?

The issue we are facing is with generic/475

The problems is that in case of a device error, there are still remnants in
the inode reservation which causes the WARN_ON() during
btrfs_destroy_inode(). I am trying to figure out the precise conditions
that are causing these remnants in only some of the inodes.
Goldwyn Rodrigues May 8, 2020, 3:14 a.m. UTC | #7
On  5:10 07/05, Christoph Hellwig wrote:
> On Thu, May 07, 2020 at 01:37:41PM +0200, David Sterba wrote:
> > On Wed, May 06, 2020 at 11:14:30PM -0700, Christoph Hellwig wrote:
> > > What is the status of this series?  I haven't really seen it posted
> > > any time recently, and it would be sad to miss 5.8..
> > 
> > I've been testing it and reporting to Goldwyn, but there are still
> > problems that don't seem to be fixable for 5.8 target.
> 
> What are the issues, and how can we help to resolve them?

I investigated further and here are my findings.

geenric/475 fails because there are reservations left in the inode's
block reservations system which are not cleared out. So the system
triggers WARN_ON() while performing destroy_inode.

The problem is similar to described in:
50745b0a7f46 ("Btrfs: Direct I/O: Fix space accounting")

To test the theory, I framed an ugly patch of using an extra field
in current-> task_struct to store a number which carries the reservation
currently remaining like the patch does and it works. So what we need is
a way to carry reservation information from btrfs_direct_write() to
iomap's direct ops ->submit_io() where the reservations are consumed.

We cannot use a similar solution of using current->journal_info
because fdatawrite sequence in iomap_dio_rw() uses
current->journal_info.

We cannot perform data reservations and release in iomap_begin() and
iomap_end() for performance and accounting issues.

Ideas welcome.
Christoph Hellwig May 9, 2020, 1:59 p.m. UTC | #8
On Thu, May 07, 2020 at 10:14:05PM -0500, Goldwyn Rodrigues wrote:
> geenric/475 fails because there are reservations left in the inode's
> block reservations system which are not cleared out. So the system
> triggers WARN_ON() while performing destroy_inode.
> 
> The problem is similar to described in:
> 50745b0a7f46 ("Btrfs: Direct I/O: Fix space accounting")
> 
> To test the theory, I framed an ugly patch of using an extra field
> in current-> task_struct to store a number which carries the reservation
> currently remaining like the patch does and it works. So what we need is
> a way to carry reservation information from btrfs_direct_write() to
> iomap's direct ops ->submit_io() where the reservations are consumed.
> 
> We cannot use a similar solution of using current->journal_info
> because fdatawrite sequence in iomap_dio_rw() uses
> current->journal_info.
> 
> We cannot perform data reservations and release in iomap_begin() and
> iomap_end() for performance and accounting issues.

So just drop "btrfs: Use ->iomap_end() instead of btrfs_dio_data"
from the series and be done with it?
Goldwyn Rodrigues May 10, 2020, 4:06 a.m. UTC | #9
On  6:59 09/05, Christoph Hellwig wrote:
> On Thu, May 07, 2020 at 10:14:05PM -0500, Goldwyn Rodrigues wrote:
> > geenric/475 fails because there are reservations left in the inode's
> > block reservations system which are not cleared out. So the system
> > triggers WARN_ON() while performing destroy_inode.
> > 
> > The problem is similar to described in:
> > 50745b0a7f46 ("Btrfs: Direct I/O: Fix space accounting")
> > 
> > To test the theory, I framed an ugly patch of using an extra field
> > in current-> task_struct to store a number which carries the reservation
> > currently remaining like the patch does and it works. So what we need is
> > a way to carry reservation information from btrfs_direct_write() to
> > iomap's direct ops ->submit_io() where the reservations are consumed.
> > 
> > We cannot use a similar solution of using current->journal_info
> > because fdatawrite sequence in iomap_dio_rw() uses
> > current->journal_info.
> > 
> > We cannot perform data reservations and release in iomap_begin() and
> > iomap_end() for performance and accounting issues.
> 
> So just drop "btrfs: Use ->iomap_end() instead of btrfs_dio_data"
> from the series and be done with it?

We are using current->journal_info for fdatawrite sequence hence using
that as a temporary pointer does not work since iomap_dio_rw() performs
the fdatawrite sequence.
Christoph Hellwig May 12, 2020, 2:58 p.m. UTC | #10
On Sat, May 09, 2020 at 11:06:01PM -0500, Goldwyn Rodrigues wrote:
> > > We cannot perform data reservations and release in iomap_begin() and
> > > iomap_end() for performance and accounting issues.
> > 
> > So just drop "btrfs: Use ->iomap_end() instead of btrfs_dio_data"
> > from the series and be done with it?
> 
> We are using current->journal_info for fdatawrite sequence hence using
> that as a temporary pointer does not work since iomap_dio_rw() performs
> the fdatawrite sequence.

Ok. but in that case they never really should have been separate patches.

Can someone help me to understand who consumes the reservation create by
btrfs_delalloc_reserve_space?  Most importantly if this is done by
something called from btrfs_dio_iomap_begin or from btrfs_submit_direct.
Goldwyn Rodrigues May 12, 2020, 5:19 p.m. UTC | #11
On  7:58 12/05, Christoph Hellwig wrote:
> On Sat, May 09, 2020 at 11:06:01PM -0500, Goldwyn Rodrigues wrote:
> > > > We cannot perform data reservations and release in iomap_begin() and
> > > > iomap_end() for performance and accounting issues.
> > > 
> > > So just drop "btrfs: Use ->iomap_end() instead of btrfs_dio_data"
> > > from the series and be done with it?
> > 
> > We are using current->journal_info for fdatawrite sequence hence using
> > that as a temporary pointer does not work since iomap_dio_rw() performs
> > the fdatawrite sequence.
> 
> Ok. but in that case they never really should have been separate patches.
> 

Yes, I realized it when I was dealing with this problem.

> Can someone help me to understand who consumes the reservation create by
> btrfs_delalloc_reserve_space?  Most importantly if this is done by
> something called from btrfs_dio_iomap_begin or from btrfs_submit_direct.

It is consumed in
btrfs_finish_ordered_io()->..btrfs_cow_block()->..btrfs_use_block_rsv().
So, it is a queued work from __end_write_update_ordered().

I am also understanding the way this reservation system works so I may
not be 100% correct.

More details are in the starting comments of fs/btrfs/block-rsv.c
Christoph Hellwig May 15, 2020, 2:13 p.m. UTC | #12
FYI, generic/475 always fail on me for btrfs, due to the warnings on
transaction abort.

Anyway, I have come up with a version that seems to mostly work.

The main change is that btrfs_sync_file stashes away the journal handle.
I also had to merge parts of the ->iomap_end patch into the main iomap
one.  I also did some cleanups to my iomap changes while looking over it.
Let me know what you thing, the tree is here:

    git://git.infradead.org/users/hch/misc.git btrfs-dio

Gitweb:

    http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/btrfs-dio
Goldwyn Rodrigues May 18, 2020, 2:36 p.m. UTC | #13
On  7:13 15/05, Christoph Hellwig wrote:
> 
> FYI, generic/475 always fail on me for btrfs, due to the warnings on
> transaction abort.
> 
> Anyway, I have come up with a version that seems to mostly work.
> 
> The main change is that btrfs_sync_file stashes away the journal handle.

Unfortunately, this will not fly. I tested this week and found
transactions begin in writepage() etc. We could work on stashing journal
handle in those functions as well, but it looks hackish. We might revert
to it as a last resort.

In the meantime, I am trying to revert the allocations in case of an
error.

> I also had to merge parts of the ->iomap_end patch into the main iomap
> one.  I also did some cleanups to my iomap changes while looking over it.
> Let me know what you thing, the tree is here:
> 
>     git://git.infradead.org/users/hch/misc.git btrfs-dio
> 
> Gitweb:
> 
>     http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/btrfs-dio
Goldwyn Rodrigues May 19, 2020, 8:11 p.m. UTC | #14
Hi,

On  7:13 15/05, Christoph Hellwig wrote:
> 
> FYI, generic/475 always fail on me for btrfs, due to the warnings on
> transaction abort.
> 
> Anyway, I have come up with a version that seems to mostly work.
> 
> The main change is that btrfs_sync_file stashes away the journal handle.
> I also had to merge parts of the ->iomap_end patch into the main iomap
> one.  I also did some cleanups to my iomap changes while looking over it.
> Let me know what you thing, the tree is here:
> 
>     git://git.infradead.org/users/hch/misc.git btrfs-dio
> 
> Gitweb:
> 
>     http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/btrfs-dio

I finally managed to fix the reservation issue and the final tree based
on Dave's for-next is at:
https://github.com/goldwynr/linux/tree/dio-merge

I will test it thoroughly and send another patchset.
I will still need that iomap->private!
Christoph Hellwig May 20, 2020, 6:11 a.m. UTC | #15
On Tue, May 19, 2020 at 03:11:16PM -0500, Goldwyn Rodrigues wrote:
> I finally managed to fix the reservation issue and the final tree based
> on Dave's for-next is at:
> https://github.com/goldwynr/linux/tree/dio-merge
> 
> I will test it thoroughly and send another patchset.
> I will still need that iomap->private!

FYI, the reason why I killed ->private is that:

 - it turns out the current pointer one is not used at all
 - the integer one as used in your patch set is maybe a bit cleaner
   handled by a counter maintained in the aio code, as in the version
   of the submit hook patch in my tree.

No a biggie, and we can sort that out later.
David Sterba May 22, 2020, 11:36 a.m. UTC | #16
On Tue, May 19, 2020 at 03:11:16PM -0500, Goldwyn Rodrigues wrote:
> On  7:13 15/05, Christoph Hellwig wrote:
> > FYI, generic/475 always fail on me for btrfs, due to the warnings on
> > transaction abort.
> > 
> > Anyway, I have come up with a version that seems to mostly work.
> > 
> > The main change is that btrfs_sync_file stashes away the journal handle.
> > I also had to merge parts of the ->iomap_end patch into the main iomap
> > one.  I also did some cleanups to my iomap changes while looking over it.
> > Let me know what you thing, the tree is here:
> > 
> >     git://git.infradead.org/users/hch/misc.git btrfs-dio
> > 
> > Gitweb:
> > 
> >     http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/btrfs-dio
> 
> I finally managed to fix the reservation issue and the final tree based
> on Dave's for-next is at:
> https://github.com/goldwynr/linux/tree/dio-merge
> 
> I will test it thoroughly and send another patchset.
> I will still need that iomap->private!

With the updated top commit 6cbb7a0c7b33d33e6 it passes fstests in my
setup, so that's the minimum for inclusion.

Regarding merge, I'm willing to add it to 5.8 queue still. In total it's
7 patches, 6 of which are preparatory or cleanups that have been
reviewed by several people. The switch to iomap is one patch and not a
huge one.

Sending the latest version proably makes sense so we have it in the
mailinglist, I can add the patches to misc-next right away so it gets
more testing exposure.

There have been other changes to our direct IO code so the testing focus
will be there anyway and reverting one or two patches as fallback is an
option, I think the risk of including the patches that close to merge
window is manageable.
Goldwyn Rodrigues May 22, 2020, 12:08 p.m. UTC | #17
On 13:36 22/05, David Sterba wrote:
> 
> With the updated top commit 6cbb7a0c7b33d33e6 it passes fstests in my
> setup, so that's the minimum for inclusion.
> 
> Regarding merge, I'm willing to add it to 5.8 queue still. In total it's
> 7 patches, 6 of which are preparatory or cleanups that have been
> reviewed by several people. The switch to iomap is one patch and not a
> huge one.
> 
> Sending the latest version proably makes sense so we have it in the
> mailinglist, I can add the patches to misc-next right away so it gets
> more testing exposure.
> 

Yes, sending them in some time. Thanks for testing.

Patch
diff mbox series

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 36df977b64d9..0cf65cf1d84f 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2949,6 +2949,7 @@  int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end);
 void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
 					  u64 end, int uptodate);
 extern const struct dentry_operations btrfs_dentry_operations;
+ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter);
 
 /* ioctl.c */
 long btrfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index a16da274c9aa..e76be6472652 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1831,7 +1831,7 @@  static ssize_t __btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	loff_t endbyte;
 	int err;
 
-	written = generic_file_direct_write(iocb, from);
+	written = btrfs_direct_IO(iocb, from);
 
 	if (written < 0 || !iov_iter_count(from))
 		return written;
@@ -3443,9 +3443,26 @@  static int btrfs_file_open(struct inode *inode, struct file *filp)
 	return generic_file_open(inode, filp);
 }
 
+static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+	ssize_t ret = 0;
+
+	if (iocb->ki_flags & IOCB_DIRECT) {
+		struct inode *inode = file_inode(iocb->ki_filp);
+
+		inode_lock_shared(inode);
+		ret = btrfs_direct_IO(iocb, to);
+		inode_unlock_shared(inode);
+		if (ret < 0)
+			return ret;
+	}
+
+	return generic_file_buffered_read(iocb, to, ret);
+}
+
 const struct file_operations btrfs_file_operations = {
 	.llseek		= btrfs_file_llseek,
-	.read_iter      = generic_file_read_iter,
+	.read_iter      = btrfs_file_read_iter,
 	.splice_read	= generic_file_splice_read,
 	.write_iter	= btrfs_file_write_iter,
 	.mmap		= btrfs_file_mmap,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d267eb5caa7b..bb7b3cfd24e8 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -29,6 +29,7 @@ 
 #include <linux/iversion.h>
 #include <linux/swap.h>
 #include <linux/sched/mm.h>
+#include <linux/iomap.h>
 #include <asm/unaligned.h>
 #include "misc.h"
 #include "ctree.h"
@@ -6943,7 +6944,7 @@  noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
 }
 
 static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend,
-			      struct extent_state **cached_state, int writing)
+			      struct extent_state **cached_state, bool writing)
 {
 	struct btrfs_ordered_extent *ordered;
 	int ret = 0;
@@ -7081,30 +7082,7 @@  static struct extent_map *create_io_em(struct inode *inode, u64 start, u64 len,
 }
 
 
-static int btrfs_get_blocks_direct_read(struct extent_map *em,
-					struct buffer_head *bh_result,
-					struct inode *inode,
-					u64 start, u64 len)
-{
-	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
-
-	if (em->block_start == EXTENT_MAP_HOLE ||
-			test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
-		return -ENOENT;
-
-	len = min(len, em->len - (start - em->start));
-
-	bh_result->b_blocknr = (em->block_start + (start - em->start)) >>
-		inode->i_blkbits;
-	bh_result->b_size = len;
-	bh_result->b_bdev = fs_info->fs_devices->latest_bdev;
-	set_buffer_mapped(bh_result);
-
-	return 0;
-}
-
 static int btrfs_get_blocks_direct_write(struct extent_map **map,
-					 struct buffer_head *bh_result,
 					 struct inode *inode,
 					 struct btrfs_dio_data *dio_data,
 					 u64 start, u64 len)
@@ -7166,7 +7144,6 @@  static int btrfs_get_blocks_direct_write(struct extent_map **map,
 	}
 
 	/* this will cow the extent */
-	len = bh_result->b_size;
 	free_extent_map(em);
 	*map = em = btrfs_new_extent_direct(inode, start, len);
 	if (IS_ERR(em)) {
@@ -7177,15 +7154,6 @@  static int btrfs_get_blocks_direct_write(struct extent_map **map,
 	len = min(len, em->len - (start - em->start));
 
 skip_cow:
-	bh_result->b_blocknr = (em->block_start + (start - em->start)) >>
-		inode->i_blkbits;
-	bh_result->b_size = len;
-	bh_result->b_bdev = fs_info->fs_devices->latest_bdev;
-	set_buffer_mapped(bh_result);
-
-	if (!test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
-		set_buffer_new(bh_result);
-
 	/*
 	 * Need to update the i_size under the extent lock so buffered
 	 * readers will get the updated i_size when we unlock.
@@ -7201,24 +7169,37 @@  static int btrfs_get_blocks_direct_write(struct extent_map **map,
 	return ret;
 }
 
-static int btrfs_get_blocks_direct(struct inode *inode, sector_t iblock,
-				   struct buffer_head *bh_result, int create)
+static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
+		loff_t length, unsigned flags, struct iomap *iomap,
+		struct iomap *srcmap)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct extent_map *em;
 	struct extent_state *cached_state = NULL;
 	struct btrfs_dio_data *dio_data = NULL;
-	u64 start = iblock << inode->i_blkbits;
 	u64 lockstart, lockend;
-	u64 len = bh_result->b_size;
+	bool write = !!(flags & IOMAP_WRITE);
 	int ret = 0;
+	u64 len = length;
+	bool unlock_extents = false;
 
-	if (!create)
+	if (!write)
 		len = min_t(u64, len, fs_info->sectorsize);
 
 	lockstart = start;
 	lockend = start + len - 1;
 
+	/*
+	 * The generic stuff only does filemap_write_and_wait_range, which
+	 * isn't enough if we've written compressed pages to this area, so
+	 * we need to flush the dirty pages again to make absolutely sure
+	 * that any outstanding dirty pages are on disk.
+	 */
+	if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
+		     &BTRFS_I(inode)->runtime_flags))
+		ret = filemap_fdatawrite_range(inode->i_mapping, start,
+					 start + length - 1);
+
 	if (current->journal_info) {
 		/*
 		 * Need to pull our outstanding extents and set journal_info to NULL so
@@ -7234,7 +7215,7 @@  static int btrfs_get_blocks_direct(struct inode *inode, sector_t iblock,
 	 * this range and we need to fallback to buffered.
 	 */
 	if (lock_extent_direct(inode, lockstart, lockend, &cached_state,
-			       create)) {
+			       write)) {
 		ret = -ENOTBLK;
 		goto err;
 	}
@@ -7266,36 +7247,53 @@  static int btrfs_get_blocks_direct(struct inode *inode, sector_t iblock,
 		goto unlock_err;
 	}
 
-	if (create) {
-		ret = btrfs_get_blocks_direct_write(&em, bh_result, inode,
+	len = min(len, em->len - (start - em->start));
+	if (write) {
+		ret = btrfs_get_blocks_direct_write(&em, inode,
 						    dio_data, start, len);
 		if (ret < 0)
 			goto unlock_err;
-
-		unlock_extent_cached(&BTRFS_I(inode)->io_tree, lockstart,
-				     lockend, &cached_state);
+		unlock_extents = true;
+		/* Recalc len in case the new em is smaller than requested */
+		len = min(len, em->len - (start - em->start));
+	} else if (em->block_start == EXTENT_MAP_HOLE ||
+			test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) {
+		/* Unlock in case of direct reading from a hole */
+		unlock_extents = true;
 	} else {
-		ret = btrfs_get_blocks_direct_read(em, bh_result, inode,
-						   start, len);
-		/* Can be negative only if we read from a hole */
-		if (ret < 0) {
-			ret = 0;
-			free_extent_map(em);
-			goto unlock_err;
-		}
 		/*
 		 * We need to unlock only the end area that we aren't using.
 		 * The rest is going to be unlocked by the endio routine.
 		 */
-		lockstart = start + bh_result->b_size;
-		if (lockstart < lockend) {
-			unlock_extent_cached(&BTRFS_I(inode)->io_tree,
-					     lockstart, lockend, &cached_state);
-		} else {
-			free_extent_state(cached_state);
-		}
+		lockstart = start + len;
+		if (lockstart < lockend)
+			unlock_extents = true;
 	}
 
+	if (unlock_extents)
+		unlock_extent_cached(&BTRFS_I(inode)->io_tree,
+				lockstart, lockend, &cached_state);
+	else
+		free_extent_state(cached_state);
+
+	/*
+	 * Translate extent map information to iomap
+	 * We trim the extents (and move the addr) even though
+	 * iomap code does that, since we have locked only the parts
+	 * we are performing I/O in.
+	 */
+	if ((em->block_start == EXTENT_MAP_HOLE) ||
+	    (test_bit(EXTENT_FLAG_PREALLOC, &em->flags) && !write)) {
+		iomap->addr = IOMAP_NULL_ADDR;
+		iomap->type = IOMAP_HOLE;
+	} else {
+		iomap->addr = em->block_start + (start - em->start);
+		iomap->type = IOMAP_MAPPED;
+	}
+	iomap->offset = start;
+	iomap->bdev = fs_info->fs_devices->latest_bdev;
+	iomap->length = len;
+
 	free_extent_map(em);
 
 	return 0;
@@ -7662,9 +7660,9 @@  static void btrfs_endio_direct_read(struct bio *bio)
 
 	kfree(dip);
 
-	dio_bio->bi_status = err;
-	dio_end_io(dio_bio);
 	btrfs_io_bio_free_csum(io_bio);
+	dio_bio->bi_status = err;
+	bio_endio(dio_bio);
 	bio_put(bio);
 }
 
@@ -7722,7 +7720,7 @@  static void btrfs_endio_direct_write(struct bio *bio)
 	kfree(dip);
 
 	dio_bio->bi_status = bio->bi_status;
-	dio_end_io(dio_bio);
+	bio_endio(dio_bio);
 	bio_put(bio);
 }
 
@@ -7957,8 +7955,9 @@  static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip)
 	return 0;
 }
 
-static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
-				loff_t file_offset)
+static blk_qc_t btrfs_submit_direct(struct inode *inode,
+		struct iomap *iomap, struct bio *dio_bio,
+		loff_t file_offset)
 {
 	struct btrfs_dio_private *dip = NULL;
 	struct bio *bio = NULL;
@@ -8010,7 +8009,7 @@  static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
 
 	ret = btrfs_submit_direct_hook(dip);
 	if (!ret)
-		return;
+		return BLK_QC_T_NONE;
 
 	btrfs_io_bio_free_csum(io_bio);
 
@@ -8029,7 +8028,7 @@  static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
 		/*
 		 * The end io callbacks free our dip, do the final put on bio
 		 * and all the cleanup and final put for dio_bio (through
-		 * dio_end_io()).
+		 * end_io()).
 		 */
 		dip = NULL;
 		bio = NULL;
@@ -8044,15 +8043,12 @@  static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode,
 			      file_offset + dio_bio->bi_iter.bi_size - 1);
 
 		dio_bio->bi_status = BLK_STS_IOERR;
-		/*
-		 * Releases and cleans up our dio_bio, no need to bio_put()
-		 * nor bio_endio()/bio_io_error() against dio_bio.
-		 */
-		dio_end_io(dio_bio);
+		bio_endio(dio_bio);
 	}
 	if (bio)
 		bio_put(bio);
 	kfree(dip);
+	return BLK_QC_T_NONE;
 }
 
 static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info,
@@ -8088,7 +8084,23 @@  static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info,
 	return retval;
 }
 
-static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
+static const struct iomap_ops btrfs_dio_iomap_ops = {
+	.iomap_begin            = btrfs_dio_iomap_begin,
+};
+
+static const struct iomap_dio_ops btrfs_dops = {
+	.submit_io		= btrfs_submit_direct,
+};
+
+
+/*
+ * btrfs_direct_IO - perform direct I/O
+ * inode->i_rwsem must be locked before calling this function, shared or exclusive.
+ * @iocb - kernel iocb
+ * @iter - iter to/from data is copied
+ */
+
+ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file->f_mapping->host;
@@ -8097,28 +8109,13 @@  static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	struct extent_changeset *data_reserved = NULL;
 	loff_t offset = iocb->ki_pos;
 	size_t count = 0;
-	int flags = 0;
-	bool wakeup = true;
 	bool relock = false;
 	ssize_t ret;
 
 	if (check_direct_IO(fs_info, iter, offset))
 		return 0;
 
-	inode_dio_begin(inode);
-
-	/*
-	 * The generic stuff only does filemap_write_and_wait_range, which
-	 * isn't enough if we've written compressed pages to this area, so
-	 * we need to flush the dirty pages again to make absolutely sure
-	 * that any outstanding dirty pages are on disk.
-	 */
 	count = iov_iter_count(iter);
-	if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
-		     &BTRFS_I(inode)->runtime_flags))
-		filemap_fdatawrite_range(inode->i_mapping, offset,
-					 offset + count - 1);
-
 	if (iov_iter_rw(iter) == WRITE) {
 		/*
 		 * If the write DIO is beyond the EOF, we need update
@@ -8149,17 +8146,11 @@  static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 		dio_data.unsubmitted_oe_range_end = (u64)offset;
 		current->journal_info = &dio_data;
 		down_read(&BTRFS_I(inode)->dio_sem);
-	} else if (test_bit(BTRFS_INODE_READDIO_NEED_LOCK,
-				     &BTRFS_I(inode)->runtime_flags)) {
-		inode_dio_end(inode);
-		flags = DIO_LOCKING | DIO_SKIP_HOLES;
-		wakeup = false;
 	}
 
-	ret = __blockdev_direct_IO(iocb, inode,
-				   fs_info->fs_devices->latest_bdev,
-				   iter, btrfs_get_blocks_direct, NULL,
-				   btrfs_submit_direct, flags);
+	ret = iomap_dio_rw(iocb, iter, &btrfs_dio_iomap_ops, &btrfs_dops,
+			is_sync_kiocb(iocb));
+
 	if (iov_iter_rw(iter) == WRITE) {
 		up_read(&BTRFS_I(inode)->dio_sem);
 		current->journal_info = NULL;
@@ -8186,11 +8177,8 @@  static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 		btrfs_delalloc_release_extents(BTRFS_I(inode), count);
 	}
 out:
-	if (wakeup)
-		inode_dio_end(inode);
 	if (relock)
 		inode_lock(inode);
-
 	extent_changeset_free(data_reserved);
 	return ret;
 }
@@ -10471,7 +10459,7 @@  static const struct address_space_operations btrfs_aops = {
 	.writepage	= btrfs_writepage,
 	.writepages	= btrfs_writepages,
 	.readpages	= btrfs_readpages,
-	.direct_IO	= btrfs_direct_IO,
+	.direct_IO	= noop_direct_IO,
 	.invalidatepage = btrfs_invalidatepage,
 	.releasepage	= btrfs_releasepage,
 	.set_page_dirty	= btrfs_set_page_dirty,