diff mbox

[4/5] iomap: implement direct I/O

Message ID 1478276479-10749-5-git-send-email-hch@lst.de (mailing list archive)
State Superseded, archived
Headers show

Commit Message

Christoph Hellwig Nov. 4, 2016, 4:21 p.m. UTC
This adds a full fledget direct I/O implementation using the iomap
interface. Full fledged in this case means all features are supported:
AIO, vectored I/O, any iov_iter type including kernel pointers, bvecs
and pipes, support for hole filling and async apending writes.  It does
not mean supporting all the warts of the old generic code.  We expect
i_rwsem to be held over the duration of the call, and we expect to
maintain i_dio_count ourselves, and we pass on any kinds of mapping
to the file system for now.

The algorithm used is very simple: We use iomap_apply to iterate over
the range of the I/O, and then we use the new bio_iov_iter_get_pages
helper to lock down the user range for the size of the extent.
bio_iov_iter_get_pages can currently lock down twice as many pages as
the old direct I/O code did, which means that we will have a better
batch factor for everything but overwrites of badly fragmented files.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap.c            | 358 ++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/iomap.h |   8 ++
 2 files changed, 366 insertions(+)

Comments

Jens Axboe Nov. 4, 2016, 8:56 p.m. UTC | #1
On Fri, Nov 04 2016, Christoph Hellwig wrote:
> This adds a full fledget direct I/O implementation using the iomap
> interface. Full fledged in this case means all features are supported:
> AIO, vectored I/O, any iov_iter type including kernel pointers, bvecs
> and pipes, support for hole filling and async apending writes.  It does
> not mean supporting all the warts of the old generic code.  We expect
> i_rwsem to be held over the duration of the call, and we expect to
> maintain i_dio_count ourselves, and we pass on any kinds of mapping
> to the file system for now.
> 
> The algorithm used is very simple: We use iomap_apply to iterate over
> the range of the I/O, and then we use the new bio_iov_iter_get_pages
> helper to lock down the user range for the size of the extent.
> bio_iov_iter_get_pages can currently lock down twice as many pages as
> the old direct I/O code did, which means that we will have a better
> batch factor for everything but overwrites of badly fragmented files.

This looks pretty good, I'll give it a whirl on my fast(er) devices.

One question, since it isn't commented and isn't immediately obvious to
me - what's the purpose of using the cmpxchg() on dio->error?
Kent Overstreet Nov. 5, 2016, 3:10 p.m. UTC | #2
On Fri, Nov 04, 2016 at 10:21:18AM -0600, Christoph Hellwig wrote:
> This adds a full fledget direct I/O implementation using the iomap
> interface. Full fledged in this case means all features are supported:
> AIO, vectored I/O, any iov_iter type including kernel pointers, bvecs
> and pipes, support for hole filling and async apending writes.  It does
> not mean supporting all the warts of the old generic code.  We expect
> i_rwsem to be held over the duration of the call, and we expect to
> maintain i_dio_count ourselves, and we pass on any kinds of mapping
> to the file system for now.
> 
> The algorithm used is very simple: We use iomap_apply to iterate over
> the range of the I/O, and then we use the new bio_iov_iter_get_pages
> helper to lock down the user range for the size of the extent.
> bio_iov_iter_get_pages can currently lock down twice as many pages as
> the old direct I/O code did, which means that we will have a better
> batch factor for everything but overwrites of badly fragmented files.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity Nov. 5, 2016, 6:14 p.m. UTC | #3
Hello,


On 11/04/2016 06:21 PM, Christoph Hellwig wrote:
> This adds a full fledget direct I/O implementation using the iomap
> interface. Full fledged in this case means all features are supported:
> AIO, vectored I/O, any iov_iter type including kernel pointers, bvecs
> and pipes, support for hole filling and async apending writes.

Does this include support for more than one concurrent appending write 
for the same file?

If so, that's great, but please make this feature discoverable. Right 
now applications have to benchmark or guess how fully asynchronous the 
aio implementation is.

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Nov. 6, 2016, 4:36 p.m. UTC | #4
> Does this include support for more than one concurrent appending write for 
> the same file?

No.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Nov. 6, 2016, 10:40 p.m. UTC | #5
On Fri, Nov 04, 2016 at 10:21:18AM -0600, Christoph Hellwig wrote:
> This adds a full fledget direct I/O implementation using the iomap
> interface. Full fledged in this case means all features are supported:
> AIO, vectored I/O, any iov_iter type including kernel pointers, bvecs
> and pipes, support for hole filling and async apending writes.  It does
> not mean supporting all the warts of the old generic code.  We expect
> i_rwsem to be held over the duration of the call, and we expect to
> maintain i_dio_count ourselves, and we pass on any kinds of mapping
> to the file system for now.
> 
> The algorithm used is very simple: We use iomap_apply to iterate over
> the range of the I/O, and then we use the new bio_iov_iter_get_pages
> helper to lock down the user range for the size of the extent.
> bio_iov_iter_get_pages can currently lock down twice as many pages as
> the old direct I/O code did, which means that we will have a better
> batch factor for everything but overwrites of badly fragmented files.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
.....
> +static loff_t
> +iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
> +		void *data, struct iomap *iomap)
> +{
> +	struct iomap_dio *dio = data;
> +	unsigned blkbits = blksize_bits(bdev_logical_block_size(iomap->bdev));
> +	unsigned fs_block_size = (1 << inode->i_blkbits), pad;
> +	struct iov_iter iter = *dio->submit.iter;
> +	struct bio *bio;
> +	bool may_zero = false;
> +	int nr_pages, ret;
> +
> +	if ((pos | length | iov_iter_alignment(&iter)) & ((1 << blkbits) - 1))
> +		return -EINVAL;
> +
> +	switch (iomap->type) {
> +	case IOMAP_HOLE:
> +		/*
> +		 * We return -ENOTBLK to fall back to buffered I/O for file
> +		 * systems that can't fill holes from direct writes.
> +		 */
> +		if (dio->flags & IOMAP_DIO_WRITE)
> +			return -ENOTBLK;
> +		/*FALLTHRU*/

This is preventing direct IO writes from being done into holes for
all filesystems.

> +	case IOMAP_UNWRITTEN:
> +		if (!(dio->flags & IOMAP_DIO_WRITE)) {
> +			iov_iter_zero(length, dio->submit.iter);
> +			dio->size += length;
> +			return length;
> +		}
> +		dio->flags |= IOMAP_DIO_UNWRITTEN;
> +		may_zero = true;
> +		break;
> +	case IOMAP_MAPPED:
> +		if (iomap->flags & IOMAP_F_SHARED)
> +			dio->flags |= IOMAP_DIO_COW;
> +		if (iomap->flags & IOMAP_F_NEW)
> +			may_zero = true;
> +		break;
> +	default:
> +		WARN_ON_ONCE(1);
> +		return -EIO;
> +	}
> +
> +	iov_iter_truncate(&iter, length);

Won't this truncate the entire DIO down to the length of the first
extent that is mapped? 

> +	if (may_zero) {
> +		pad = pos & (fs_block_size - 1);
> +		if (pad)
> +			iomap_dio_zero(dio, iomap, pos, fs_block_size - pad);
> +	}

Repeated zeroing code. helper function?

> +	inode_dio_begin(inode);
> +
> +	blk_start_plug(&plug);
> +	do {
> +		ret = iomap_apply(inode, pos, count, flags, ops, dio,
> +				iomap_dio_actor);
> +		if (ret <= 0) {
> +			/* magic error code to fall back to buffered I/O */
> +			if (ret == -ENOTBLK)
> +				ret = 0;
> +			break;
> +		}
> +		pos += ret;
> +	} while ((count = iov_iter_count(iter)) > 0);
> +	blk_finish_plug(&plug);
> +
> +	if (ret < 0)
> +		cmpxchg(&dio->error, 0, ret);

Why cmpxchg? What are we racing with here? Helper (e.g.
dio_set_error())?

> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -49,6 +49,7 @@ struct iomap {
>  #define IOMAP_WRITE		(1 << 0) /* writing, must allocate blocks */
>  #define IOMAP_ZERO		(1 << 1) /* zeroing operation, may skip holes */
>  #define IOMAP_REPORT		(1 << 2) /* report extent status, e.g. FIEMAP */
> +#define IOMAP_DIRECT		(1 << 3)

Comment decribing use?

>  struct iomap_ops {
>  	/*
> @@ -82,4 +83,11 @@ int iomap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
>  int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
>  		loff_t start, loff_t len, struct iomap_ops *ops);
>  
> +#define IOMAP_DIO_UNWRITTEN	(1 << 0)
> +#define IOMAP_DIO_COW		(1 << 1)
> +typedef int (iomap_dio_end_io_t)(struct kiocb *iocb, ssize_t ret,
> +		unsigned flags);
> +ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> +		struct iomap_ops *ops, iomap_dio_end_io_t end_io);

Comment on the context the new flags are used under and what they
mean?

Cheers,

Dave.
Christoph Hellwig Nov. 7, 2016, 3:08 p.m. UTC | #6
On Mon, Nov 07, 2016 at 09:40:49AM +1100, Dave Chinner wrote:
> > +	case IOMAP_HOLE:
> > +		/*
> > +		 * We return -ENOTBLK to fall back to buffered I/O for file
> > +		 * systems that can't fill holes from direct writes.
> > +		 */
> > +		if (dio->flags & IOMAP_DIO_WRITE)
> > +			return -ENOTBLK;
> > +		/*FALLTHRU*/
> 
> This is preventing direct IO writes from being done into holes for
> all filesystems.

It's not.  Hint:  the whole iomap code very much assumes a file system
fills holes before applying the actor on writes.

That being said I should remove this check - as-is it's dead, untested
code that I only used for my aborted ext2 conversion, so we're better
off not having it.

> > +	iov_iter_truncate(&iter, length);
> 
> Won't this truncate the entire DIO down to the length of the first
> extent that is mapped? 

It truncates a copy of the main iter down to the length of the extent
we're working on.  That allows us to limit all the iov_iter based helper
(most importantly get_user_pages) to only operate on a given extent.
Later in the function we then advance the primary iter when moving to
the next extent.

> 
> > +	if (may_zero) {
> > +		pad = pos & (fs_block_size - 1);
> > +		if (pad)
> > +			iomap_dio_zero(dio, iomap, pos, fs_block_size - pad);
> > +	}
> 
> Repeated zeroing code. helper function?

The actual repeated code is in iomap_dio_zero.  Because we once zero
the beginning of a block and once the end the arithmetics looks somewhat
similar but actually are different. We could do a trick like the end
parameter to dio_zero_block in the old dio code to save a line of code
or two, but I think it's highly confusing to the reader.

> > +	do {
> > +		ret = iomap_apply(inode, pos, count, flags, ops, dio,
> > +				iomap_dio_actor);
> > +		if (ret <= 0) {
> > +			/* magic error code to fall back to buffered I/O */
> > +			if (ret == -ENOTBLK)
> > +				ret = 0;
> > +			break;
> > +		}
> > +		pos += ret;
> > +	} while ((count = iov_iter_count(iter)) > 0);
> > +	blk_finish_plug(&plug);
> > +
> > +	if (ret < 0)
> > +		cmpxchg(&dio->error, 0, ret);
> 
> Why cmpxchg? What are we racing with here? Helper (e.g.
> dio_set_error())?

The submission thread against I/O completions (which in the worst
case could come from multiple threads as well).  Same reason as
the one in xfs_buf_bio_end_io added in commit 9bdd9bd69b
("xfs: buffer ->bi_end_io function requires irq-safe lock")

> Comment decribing use?

Sure.

> Comment on the context the new flags are used under and what they
> mean?

Ok.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Nov. 8, 2016, 1:38 a.m. UTC | #7
On Mon, Nov 07, 2016 at 04:08:07PM +0100, Christoph Hellwig wrote:
> On Mon, Nov 07, 2016 at 09:40:49AM +1100, Dave Chinner wrote:
> > > +	case IOMAP_HOLE:
> > > +		/*
> > > +		 * We return -ENOTBLK to fall back to buffered I/O for file
> > > +		 * systems that can't fill holes from direct writes.
> > > +		 */
> > > +		if (dio->flags & IOMAP_DIO_WRITE)
> > > +			return -ENOTBLK;
> > > +		/*FALLTHRU*/
> > 
> > This is preventing direct IO writes from being done into holes for
> > all filesystems.
> 
> It's not.  Hint:  the whole iomap code very much assumes a file system
> fills holes before applying the actor on writes.
> 
> That being said I should remove this check - as-is it's dead, untested
> code that I only used for my aborted ext2 conversion, so we're better
> off not having it.

Yup, agreed.

> > > +	iov_iter_truncate(&iter, length);
> > 
> > Won't this truncate the entire DIO down to the length of the first
> > extent that is mapped? 
> 
> It truncates a copy of the main iter down to the length of the extent
> we're working on.  That allows us to limit all the iov_iter based helper
> (most importantly get_user_pages) to only operate on a given extent.
> Later in the function we then advance the primary iter when moving to
> the next extent.

Hmmm, I must be missing something here. iomap_dio_rw() stores a
pointer to the primary iter in the dio structure, and that gets
passed to the actor function, and then it....

Oh, bloody hell, Christoph! :/ You hid a structure copy in the
variable initialisations and used the same variable name for the
copy as the primary pointer:

	struct iov_iter iter = *dio->submit.iter;

That's really subtle and easy for idiots like me to miss when
reading the code. Please make it clear that we're working on
a copy of the primary iter here, not the primary iter itself.

> > > +		if (ret <= 0) {
> > > +			/* magic error code to fall back to buffered I/O */
> > > +			if (ret == -ENOTBLK)
> > > +				ret = 0;
> > > +			break;
> > > +		}
> > > +		pos += ret;
> > > +	} while ((count = iov_iter_count(iter)) > 0);
> > > +	blk_finish_plug(&plug);
> > > +
> > > +	if (ret < 0)
> > > +		cmpxchg(&dio->error, 0, ret);
> > 
> > Why cmpxchg? What are we racing with here? Helper (e.g.
> > dio_set_error())?
> 
> The submission thread against I/O completions (which in the worst
> case could come from multiple threads as well).  Same reason as
> the one in xfs_buf_bio_end_io added in commit 9bdd9bd69b
> ("xfs: buffer ->bi_end_io function requires irq-safe lock")

Yup, that's what I suspected - a comment is needed at least, though,
IMO, a helper w/ comment is the most maintainable approach here.

Cheers,

Dave.
diff mbox

Patch

diff --git a/fs/iomap.c b/fs/iomap.c
index a8ee8c3..630ab0e 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -24,6 +24,7 @@ 
 #include <linux/uio.h>
 #include <linux/backing-dev.h>
 #include <linux/buffer_head.h>
+#include <linux/task_io_accounting_ops.h>
 #include <linux/dax.h>
 #include "internal.h"
 
@@ -583,3 +584,360 @@  int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fi,
 	return 0;
 }
 EXPORT_SYMBOL_GPL(iomap_fiemap);
+
+/*
+ * Private flags for iomap_dio, must not overlap with the public ones in
+ * iomap.h:
+ */
+#define IOMAP_DIO_WRITE		(1 << 30)
+#define IOMAP_DIO_DIRTY	(1 << 31)
+
+struct iomap_dio {
+	struct kiocb		*iocb;
+	iomap_dio_end_io_t	*end_io;
+	loff_t			i_size;
+	loff_t			size;
+	atomic_t		ref;
+	unsigned		flags;
+	int			error;
+
+	union {
+		/* used during submission and for synchronous completion: */
+		struct {
+			struct iov_iter		*iter;
+			struct task_struct	*waiter;
+			struct request_queue	*last_queue;
+			blk_qc_t		cookie;
+		} submit;
+
+		/* used for aio completion: */
+		struct {
+			struct work_struct	work;
+		} aio;
+	};
+};
+
+static ssize_t iomap_dio_complete(struct iomap_dio *dio)
+{
+	struct kiocb *iocb = dio->iocb;
+	ssize_t ret;
+
+	if (dio->end_io) {
+		ret = dio->end_io(iocb,
+				dio->error ? dio->error : dio->size,
+				dio->flags);
+	} else {
+		ret = dio->error;
+	}
+
+	if (likely(!ret)) {
+		ret = dio->size;
+		/* check for short read */
+		if (iocb->ki_pos + ret > dio->i_size &&
+		    !(dio->flags & IOMAP_DIO_WRITE))
+			ret = dio->i_size - iocb->ki_pos;
+		iocb->ki_pos += ret;
+	}
+
+	inode_dio_end(file_inode(iocb->ki_filp));
+	kfree(dio);
+
+	return ret;
+}
+
+static void iomap_dio_complete_work(struct work_struct *work)
+{
+	struct iomap_dio *dio = container_of(work, struct iomap_dio, aio.work);
+	struct kiocb *iocb = dio->iocb;
+	bool is_write = (dio->flags & IOMAP_DIO_WRITE);
+	ssize_t ret;
+
+	ret = iomap_dio_complete(dio);
+	if (is_write && ret > 0)
+		ret = generic_write_sync(iocb, ret);
+	iocb->ki_complete(iocb, ret, 0);
+}
+
+static void iomap_dio_bio_end_io(struct bio *bio)
+{
+	struct iomap_dio *dio = bio->bi_private;
+	bool should_dirty = (dio->flags & IOMAP_DIO_DIRTY);
+
+	if (bio->bi_error)
+		cmpxchg(&dio->error, 0, bio->bi_error);
+
+	if (atomic_dec_and_test(&dio->ref)) {
+		if (is_sync_kiocb(dio->iocb)) {
+			struct task_struct *waiter = dio->submit.waiter;
+
+			WRITE_ONCE(dio->submit.waiter, NULL);
+			wake_up_process(waiter);
+		} else if (dio->flags & IOMAP_DIO_WRITE) {
+			struct inode *inode = file_inode(dio->iocb->ki_filp);
+
+			INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
+			queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
+		} else {
+			iomap_dio_complete_work(&dio->aio.work);
+		}
+	}
+
+	if (should_dirty) {
+		bio_check_pages_dirty(bio);
+	} else {
+		struct bio_vec *bvec;
+		int i;
+
+		bio_for_each_segment_all(bvec, bio, i)
+			put_page(bvec->bv_page);
+		bio_put(bio);
+	}
+}
+
+static blk_qc_t
+iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
+		unsigned len)
+{
+	struct page *page = ZERO_PAGE(0);
+	struct bio *bio;
+
+	bio = bio_alloc(GFP_KERNEL, 1);
+	bio->bi_bdev = iomap->bdev;
+	bio->bi_iter.bi_sector =
+		iomap->blkno + ((pos - iomap->offset) >> 9);
+	bio->bi_private = dio;
+	bio->bi_end_io = iomap_dio_bio_end_io;
+
+	get_page(page);
+	if (bio_add_page(bio, page, len, 0) != len)
+		BUG();
+	bio->bi_opf = REQ_OP_WRITE | REQ_SYNC | REQ_IDLE;
+
+	atomic_inc(&dio->ref);
+	return submit_bio(bio);
+}
+
+static loff_t
+iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
+		void *data, struct iomap *iomap)
+{
+	struct iomap_dio *dio = data;
+	unsigned blkbits = blksize_bits(bdev_logical_block_size(iomap->bdev));
+	unsigned fs_block_size = (1 << inode->i_blkbits), pad;
+	struct iov_iter iter = *dio->submit.iter;
+	struct bio *bio;
+	bool may_zero = false;
+	int nr_pages, ret;
+
+	if ((pos | length | iov_iter_alignment(&iter)) & ((1 << blkbits) - 1))
+		return -EINVAL;
+
+	switch (iomap->type) {
+	case IOMAP_HOLE:
+		/*
+		 * We return -ENOTBLK to fall back to buffered I/O for file
+		 * systems that can't fill holes from direct writes.
+		 */
+		if (dio->flags & IOMAP_DIO_WRITE)
+			return -ENOTBLK;
+		/*FALLTHRU*/
+	case IOMAP_UNWRITTEN:
+		if (!(dio->flags & IOMAP_DIO_WRITE)) {
+			iov_iter_zero(length, dio->submit.iter);
+			dio->size += length;
+			return length;
+		}
+		dio->flags |= IOMAP_DIO_UNWRITTEN;
+		may_zero = true;
+		break;
+	case IOMAP_MAPPED:
+		if (iomap->flags & IOMAP_F_SHARED)
+			dio->flags |= IOMAP_DIO_COW;
+		if (iomap->flags & IOMAP_F_NEW)
+			may_zero = true;
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		return -EIO;
+	}
+
+	iov_iter_truncate(&iter, length);
+	nr_pages = iov_iter_npages(&iter, BIO_MAX_PAGES);
+	if (nr_pages <= 0)
+		return nr_pages;
+
+	if (may_zero) {
+		pad = pos & (fs_block_size - 1);
+		if (pad)
+			iomap_dio_zero(dio, iomap, pos - pad, pad);
+	}
+
+	do {
+		if (dio->error)
+			return 0;
+
+		bio = bio_alloc(GFP_KERNEL, nr_pages);
+		bio->bi_bdev = iomap->bdev;
+		bio->bi_iter.bi_sector =
+			iomap->blkno + ((pos - iomap->offset) >> 9);
+		bio->bi_private = dio;
+		bio->bi_end_io = iomap_dio_bio_end_io;
+
+		ret = bio_iov_iter_get_pages(bio, &iter);
+		if (unlikely(ret)) {
+			bio_put(bio);
+			return ret;
+		}
+
+		if (dio->flags & IOMAP_DIO_WRITE) {
+			bio->bi_opf = REQ_OP_WRITE | REQ_SYNC | REQ_IDLE;
+			task_io_account_write(bio->bi_iter.bi_size);
+		} else {
+			bio->bi_opf = REQ_OP_READ;
+			if (dio->flags & IOMAP_DIO_DIRTY)
+				bio_set_pages_dirty(bio);
+		}
+
+		dio->size += bio->bi_iter.bi_size;
+		pos += bio->bi_iter.bi_size;
+
+		nr_pages = iov_iter_npages(&iter, BIO_MAX_PAGES);
+
+		atomic_inc(&dio->ref);
+
+		dio->submit.last_queue = bdev_get_queue(iomap->bdev);
+		dio->submit.cookie = submit_bio(bio);
+	} while (nr_pages);
+
+	if (may_zero) {
+		pad = pos & (fs_block_size - 1);
+		if (pad)
+			iomap_dio_zero(dio, iomap, pos, fs_block_size - pad);
+	}
+
+	iov_iter_advance(dio->submit.iter, length);
+	return length;
+}
+
+ssize_t
+iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, struct iomap_ops *ops,
+		iomap_dio_end_io_t end_io)
+{
+	struct address_space *mapping = iocb->ki_filp->f_mapping;
+	struct inode *inode = file_inode(iocb->ki_filp);
+	size_t count = iov_iter_count(iter);
+	loff_t pos = iocb->ki_pos, end = iocb->ki_pos + count - 1, ret = 0;
+	unsigned int flags = IOMAP_DIRECT;
+	struct blk_plug plug;
+	struct iomap_dio *dio;
+
+	lockdep_assert_held(&inode->i_rwsem);
+
+	if (!count)
+		return 0;
+
+	dio = kmalloc(sizeof(*dio), GFP_KERNEL);
+	if (!dio)
+		return -ENOMEM;
+
+	dio->iocb = iocb;
+	atomic_set(&dio->ref, 1);
+	dio->size = 0;
+	dio->i_size = i_size_read(inode);
+	dio->end_io = end_io;
+	dio->error = 0;
+	dio->flags = 0;
+
+	dio->submit.iter = iter;
+	if (is_sync_kiocb(iocb)) {
+		dio->submit.waiter = current;
+		dio->submit.cookie = BLK_QC_T_NONE;
+		dio->submit.last_queue = NULL;
+	}
+
+	if (iov_iter_rw(iter) == READ) {
+		if (pos >= dio->i_size)
+			goto out_free_dio;
+
+		if (iter->type == ITER_IOVEC)
+			dio->flags |= IOMAP_DIO_DIRTY;
+	} else {
+		dio->flags |= IOMAP_DIO_WRITE;
+		flags |= IOMAP_WRITE;
+	}
+
+	if (mapping->nrpages) {
+		ret = filemap_write_and_wait_range(mapping, iocb->ki_pos, end);
+		if (ret)
+			goto out_free_dio;
+
+		ret = invalidate_inode_pages2_range(mapping,
+				iocb->ki_pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
+		WARN_ON_ONCE(ret);
+		ret = 0;
+	}
+
+	inode_dio_begin(inode);
+
+	blk_start_plug(&plug);
+	do {
+		ret = iomap_apply(inode, pos, count, flags, ops, dio,
+				iomap_dio_actor);
+		if (ret <= 0) {
+			/* magic error code to fall back to buffered I/O */
+			if (ret == -ENOTBLK)
+				ret = 0;
+			break;
+		}
+		pos += ret;
+	} while ((count = iov_iter_count(iter)) > 0);
+	blk_finish_plug(&plug);
+
+	if (ret < 0)
+		cmpxchg(&dio->error, 0, ret);
+
+	if (ret >= 0 && iov_iter_rw(iter) == WRITE && !is_sync_kiocb(iocb) &&
+			!inode->i_sb->s_dio_done_wq) {
+		ret = sb_init_dio_done_wq(inode->i_sb);
+		if (ret < 0)
+			cmpxchg(&dio->error, 0, ret);
+	}
+
+	if (!atomic_dec_and_test(&dio->ref)) {
+		if (!is_sync_kiocb(iocb))
+			return -EIOCBQUEUED;
+
+		for (;;) {
+			set_current_state(TASK_UNINTERRUPTIBLE);
+			if (!READ_ONCE(dio->submit.waiter))
+				break;
+
+			if (!(iocb->ki_flags & IOCB_HIPRI) ||
+			    !dio->submit.last_queue ||
+			    !blk_poll(dio->submit.last_queue,
+					dio->submit.cookie))
+				io_schedule();
+		}
+		__set_current_state(TASK_RUNNING);
+	}
+
+	/*
+	 * Try again to invalidate clean pages which might have been cached by
+	 * non-direct readahead, or faulted in by get_user_pages() if the source
+	 * of the write was an mmap'ed region of the file we're writing.  Either
+	 * one is a pretty crazy thing to do, so we don't support it 100%.  If
+	 * this invalidation fails, tough, the write still worked...
+	 */
+	if (iov_iter_rw(iter) == WRITE && mapping->nrpages) {
+		ret = invalidate_inode_pages2_range(mapping,
+				iocb->ki_pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
+		WARN_ON_ONCE(ret);
+	}
+
+	return iomap_dio_complete(dio);
+
+out_free_dio:
+	kfree(dio);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iomap_dio_rw);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 7892f55..1b53109 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -49,6 +49,7 @@  struct iomap {
 #define IOMAP_WRITE		(1 << 0) /* writing, must allocate blocks */
 #define IOMAP_ZERO		(1 << 1) /* zeroing operation, may skip holes */
 #define IOMAP_REPORT		(1 << 2) /* report extent status, e.g. FIEMAP */
+#define IOMAP_DIRECT		(1 << 3)
 
 struct iomap_ops {
 	/*
@@ -82,4 +83,11 @@  int iomap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 		loff_t start, loff_t len, struct iomap_ops *ops);
 
+#define IOMAP_DIO_UNWRITTEN	(1 << 0)
+#define IOMAP_DIO_COW		(1 << 1)
+typedef int (iomap_dio_end_io_t)(struct kiocb *iocb, ssize_t ret,
+		unsigned flags);
+ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
+		struct iomap_ops *ops, iomap_dio_end_io_t end_io);
+
 #endif /* LINUX_IOMAP_H */