diff mbox series

[v3,5/6] ext4: introduce direct IO write path using iomap infrastructure

Message ID db33705f9ba35ccbe20fc19b8ecbbf2078beff08.1568282664.git.mbobrowski@mbobrowski.org (mailing list archive)
State New, archived
Headers show
Series ext4: port direct IO to iomap infrastructure | expand

Commit Message

Matthew Bobrowski Sept. 12, 2019, 11:04 a.m. UTC
This patch introduces a new direct IO write code path implementation
that makes use of the iomap infrastructure.

All direct IO write operations are now passed from the ->write_iter()
callback to the new function ext4_dio_write_iter(). This function is
responsible for calling into iomap infrastructure via
iomap_dio_rw(). Snippets of the direct IO code from within
ext4_file_write_iter(), such as checking whether the IO request is
unaligned asynchronous IO, or whether it will ber overwriting
allocated and initialized blocks has been moved out and into
ext4_dio_write_iter().

The block mapping flags that are passed to ext4_map_blocks() from
within ext4_dio_get_block() and friends have effectively been taken
out and introduced within the ext4_iomap_begin(). If ext4_map_blocks()
happens to have instantiated blocks beyond the i_size, then we attempt
to place the inode onto the orphan list. Despite being able to perform
i_size extension checking earlier on in the direct IO code path, it
makes most sense to perform this bit post successful block allocation.

The ->end_io() callback ext4_dio_write_end_io() is responsible for
removing the inode from the orphan list and determining if we should
truncate a failed write in the case of an error. We also convert a
range of unwritten extents to written if IOMAP_DIO_UNWRITTEN is set
and perform the necessary i_size/i_disksize extension if the
iocb->ki_pos + dio->size > i_size_read(inode).

In the instance of a short write, we fallback to buffered IO and
complete whatever is left the 'iter'. Any blocks that may have been
allocated in preparation for direct IO will be reused by buffered IO,
so there's no issue with leaving allocated blocks beyond EOF.

Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
---
 fs/ext4/file.c  | 219 +++++++++++++++++++++++++++++++++---------------
 fs/ext4/inode.c |  57 ++++++++++---
 2 files changed, 196 insertions(+), 80 deletions(-)

Comments

Ritesh Harjani Sept. 16, 2019, 4:37 a.m. UTC | #1
Hello,

On 9/12/19 4:34 PM, Matthew Bobrowski wrote:
> This patch introduces a new direct IO write code path implementation
> that makes use of the iomap infrastructure.
> 
> All direct IO write operations are now passed from the ->write_iter()
> callback to the new function ext4_dio_write_iter(). This function is
> responsible for calling into iomap infrastructure via
> iomap_dio_rw(). Snippets of the direct IO code from within
> ext4_file_write_iter(), such as checking whether the IO request is
> unaligned asynchronous IO, or whether it will ber overwriting
> allocated and initialized blocks has been moved out and into
> ext4_dio_write_iter().
> 
> The block mapping flags that are passed to ext4_map_blocks() from
> within ext4_dio_get_block() and friends have effectively been taken
> out and introduced within the ext4_iomap_begin(). If ext4_map_blocks()
> happens to have instantiated blocks beyond the i_size, then we attempt
> to place the inode onto the orphan list. Despite being able to perform
> i_size extension checking earlier on in the direct IO code path, it
> makes most sense to perform this bit post successful block allocation.
> 
> The ->end_io() callback ext4_dio_write_end_io() is responsible for
> removing the inode from the orphan list and determining if we should
> truncate a failed write in the case of an error. We also convert a
> range of unwritten extents to written if IOMAP_DIO_UNWRITTEN is set
> and perform the necessary i_size/i_disksize extension if the
> iocb->ki_pos + dio->size > i_size_read(inode).
> 
> In the instance of a short write, we fallback to buffered IO and
> complete whatever is left the 'iter'. Any blocks that may have been
> allocated in preparation for direct IO will be reused by buffered IO,
> so there's no issue with leaving allocated blocks beyond EOF.
> 
> Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
> ---
>   fs/ext4/file.c  | 219 +++++++++++++++++++++++++++++++++---------------
>   fs/ext4/inode.c |  57 ++++++++++---
>   2 files changed, 196 insertions(+), 80 deletions(-)
> 
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 6457c629b8cf..413c7895aa9e 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -29,6 +29,7 @@
>   #include <linux/pagevec.h>
>   #include <linux/uio.h>
>   #include <linux/mman.h>
> +#include <linux/backing-dev.h>
>   #include "ext4.h"
>   #include "ext4_jbd2.h"
>   #include "xattr.h"
> @@ -213,12 +214,16 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
>   	struct inode *inode = file_inode(iocb->ki_filp);
>   	ssize_t ret;
> 
> +	if (unlikely(IS_IMMUTABLE(inode)))
> +		return -EPERM;
> +
>   	ret = generic_write_checks(iocb, from);
>   	if (ret <= 0)
>   		return ret;
> 
> -	if (unlikely(IS_IMMUTABLE(inode)))
> -		return -EPERM;
> +	ret = file_modified(iocb->ki_filp);
> +	if (ret)
> +		return 0;

Why not return ret directly, otherwise we will be returning the wrong
error code to user space. Thoughts?

Do you think simplification/restructuring of this API
"ext4_write_checks" can be a separate patch, so that this patch
only focuses on conversion of DIO write path to iomap?


Also, I think we can make the function (ext4_write_checks())
like below.
This way we call for file_modified() only after we have checked for
write limits, at one place.

   static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter 
*from)
   {
           struct inode *inode = file_inode(iocb->ki_filp);
           ssize_t ret;

           if (unlikely(IS_IMMUTABLE(inode)))
                   return -EPERM;

           ret = generic_write_checks(iocb, from);
           if (ret <= 0)
_                 return ret;
           /*
            * If we have encountered a bitmap-format file, the size limit
            * is smaller than s_maxbytes, which is for extent-mapped files.
            */
           if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
                   struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);

                   if (iocb->ki_pos >= sbi->s_bitmap_maxbytes)
                           return -EFBIG;
                   iov_iter_truncate(from, sbi->s_bitmap_maxbytes - 
iocb->ki_pos);
           }
+
+         ret = file_modified(iocb->ki_filp);
+         if (ret)
+                 return ret;
+
           return iov_iter_count(from);
   }


-ritesh


> 
>   	/*
>   	 * If we have encountered a bitmap-format file, the size limit
> @@ -234,6 +239,32 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
>   	return iov_iter_count(from);
>   }
> 
> +static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> +					struct iov_iter *from)
> +{
> +	ssize_t ret;
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +
> +	if (iocb->ki_flags & IOCB_NOWAIT)
> +		return -EOPNOTSUPP;
> +
> +	inode_lock(inode);
> +	ret = ext4_write_checks(iocb, from);
> +	if (ret <= 0)
> +		goto out;
> +
> +	current->backing_dev_info = inode_to_bdi(inode);
> +	ret = generic_perform_write(iocb->ki_filp, from, iocb->ki_pos);
> +	current->backing_dev_info = NULL;
> +out:
> +	inode_unlock(inode);
> +	if (likely(ret > 0)) {
> +		iocb->ki_pos += ret;
> +		ret = generic_write_sync(iocb, ret);
> +	}
> +	return ret;
> +}
> +
>   static int ext4_handle_inode_extension(struct inode *inode, loff_t offset,
>   				       ssize_t len, size_t count)
>   {
> @@ -310,6 +341,120 @@ static int ext4_handle_failed_inode_extension(struct inode *inode, loff_t size)
>   	return 0;
>   }
> 
> +/*
> + * For a write that extends the inode size, ext4_dio_write_iter() will
> + * wait for the write to complete. Consequently, operations performed
> + * within this function are still covered by the inode_lock(). On
> + * success, this function returns 0.
> + */
> +static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size, int error,
> +				 unsigned int flags)
> +{
> +	int ret;
> +	loff_t offset = iocb->ki_pos;
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +
> +	if (error) {
> +		ret = ext4_handle_failed_inode_extension(inode, offset + size);
> +		return ret ? ret : error;
> +	}
> +
> +	if (flags & IOMAP_DIO_UNWRITTEN) {
> +		ret = ext4_convert_unwritten_extents(NULL, inode,
> +						     offset, size);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	if (offset + size > i_size_read(inode)) {
> +		ret = ext4_handle_inode_extension(inode, offset, size, 0);
> +		if (ret)
> +			return ret;
> +	}
> +	return 0;
> +}
> +
> +static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> +{
> +	ssize_t ret;
> +	size_t count;
> +	loff_t offset = iocb->ki_pos;
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +	bool extend = false, overwrite = false, unaligned_aio = false;
> +
> +	if (!inode_trylock(inode)) {
> +		if (iocb->ki_flags & IOCB_NOWAIT)
> +			return -EAGAIN;
> +		inode_lock(inode);
> +	}
> +
> +	if (!ext4_dio_checks(inode)) {
> +		inode_unlock(inode);
> +		/*
> +		 * Fallback to buffered IO if the operation on the
> +		 * inode is not supported by direct IO.
> +		 */
> +		return ext4_buffered_write_iter(iocb, from);
> +	}
> +
> +	ret = ext4_write_checks(iocb, from);
> +	if (ret <= 0) {
> +		inode_unlock(inode);
> +		return ret;
> +	}
> +
> +	/*
> +	 * Unaligned direct AIO must be serialized among each other as
> +	 * the zeroing of partial blocks of two competing unaligned
> +	 * AIOs can result in data corruption.
> +	 */
> +	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
> +	    !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, offset)) {
> +		unaligned_aio = true;
> +		inode_dio_wait(inode);
> +	}
> +
> +	/*
> +	 * Determine whether the IO operation will overwrite allocated
> +	 * and initialized blocks. If so, check to see whether it is
> +	 * possible to take the dioread_nolock path.
> +	 */
> +	count = iov_iter_count(from);
> +	if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&
> +	    ext4_should_dioread_nolock(inode)) {
> +		overwrite = true;
> +		downgrade_write(&inode->i_rwsem);
> +	}
> +
> +	if (offset + count > i_size_read(inode) ||
> +	    offset + count > EXT4_I(inode)->i_disksize) {
> +		ext4_update_i_disksize(inode, inode->i_size);
> +		extend = true;
> +	}
> +
> +	ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops, ext4_dio_write_end_io);
> +
> +	/*
> +	 * Unaligned direct AIO must be the only IO in flight or else
> +	 * any overlapping aligned IO after unaligned IO might result
> +	 * in data corruption. We also need to wait here in the case
> +	 * where the inode is being extended so that inode extension
> +	 * routines in ext4_dio_write_end_io() are covered by the
> +	 * inode_lock().
> +	 */
> +	if (ret == -EIOCBQUEUED && (unaligned_aio || extend))
> +		inode_dio_wait(inode);
> +
> +	if (overwrite)
> +		inode_unlock_shared(inode);
> +	else
> +		inode_unlock(inode);
> +
> +	if (ret >= 0 && iov_iter_count(from))
> +		return ext4_buffered_write_iter(iocb, from);
> +	return ret;
> +}
> +
>   #ifdef CONFIG_FS_DAX
>   static ssize_t
>   ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
> @@ -324,15 +469,10 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
>   			return -EAGAIN;
>   		inode_lock(inode);
>   	}
> +
>   	ret = ext4_write_checks(iocb, from);
>   	if (ret <= 0)
>   		goto out;
> -	ret = file_remove_privs(iocb->ki_filp);
> -	if (ret)
> -		goto out;
> -	ret = file_update_time(iocb->ki_filp);
> -	if (ret)
> -		goto out;
> 
>   	offset = iocb->ki_pos;
>   	ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
> @@ -358,73 +498,16 @@ static ssize_t
>   ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>   {
>   	struct inode *inode = file_inode(iocb->ki_filp);
> -	int o_direct = iocb->ki_flags & IOCB_DIRECT;
> -	int unaligned_aio = 0;
> -	int overwrite = 0;
> -	ssize_t ret;
> 
>   	if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
>   		return -EIO;
> 
> -#ifdef CONFIG_FS_DAX
>   	if (IS_DAX(inode))
>   		return ext4_dax_write_iter(iocb, from);
> -#endif
> -	if (!o_direct && (iocb->ki_flags & IOCB_NOWAIT))
> -		return -EOPNOTSUPP;
> 
> -	if (!inode_trylock(inode)) {
> -		if (iocb->ki_flags & IOCB_NOWAIT)
> -			return -EAGAIN;
> -		inode_lock(inode);
> -	}
> -
> -	ret = ext4_write_checks(iocb, from);
> -	if (ret <= 0)
> -		goto out;
> -
> -	/*
> -	 * Unaligned direct AIO must be serialized among each other as zeroing
> -	 * of partial blocks of two competing unaligned AIOs can result in data
> -	 * corruption.
> -	 */
> -	if (o_direct && ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
> -	    !is_sync_kiocb(iocb) &&
> -	    ext4_unaligned_aio(inode, from, iocb->ki_pos)) {
> -		unaligned_aio = 1;
> -		ext4_unwritten_wait(inode);
> -	}
> -
> -	iocb->private = &overwrite;
> -	/* Check whether we do a DIO overwrite or not */
> -	if (o_direct && !unaligned_aio) {
> -		if (ext4_overwrite_io(inode, iocb->ki_pos, iov_iter_count(from))) {
> -			if (ext4_should_dioread_nolock(inode))
> -				overwrite = 1;
> -		} else if (iocb->ki_flags & IOCB_NOWAIT) {
> -			ret = -EAGAIN;
> -			goto out;
> -		}
> -	}
> -
> -	ret = __generic_file_write_iter(iocb, from);
> -	/*
> -	 * Unaligned direct AIO must be the only IO in flight. Otherwise
> -	 * overlapping aligned IO after unaligned might result in data
> -	 * corruption.
> -	 */
> -	if (ret == -EIOCBQUEUED && unaligned_aio)
> -		ext4_unwritten_wait(inode);
> -	inode_unlock(inode);
> -
> -	if (ret > 0)
> -		ret = generic_write_sync(iocb, ret);
> -
> -	return ret;
> -
> -out:
> -	inode_unlock(inode);
> -	return ret;
> +	if (iocb->ki_flags & IOCB_DIRECT)
> +		return ext4_dio_write_iter(iocb, from);
> +	return ext4_buffered_write_iter(iocb, from);
>   }
> 
>   #ifdef CONFIG_FS_DAX
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index efb184928e51..f52ad3065236 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3513,11 +3513,13 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>   			}
>   		}
>   	} else if (flags & IOMAP_WRITE) {
> -		int dio_credits;
>   		handle_t *handle;
> -		int retries = 0;
> +		int dio_credits, retries = 0, m_flags = 0;
> 
> -		/* Trim mapping request to maximum we can map at once for DIO */
> +		/*
> +		 * Trim mapping request to maximum we can map at once
> +		 * for DIO.
> +		 */
>   		if (map.m_len > DIO_MAX_BLOCKS)
>   			map.m_len = DIO_MAX_BLOCKS;
>   		dio_credits = ext4_chunk_trans_blocks(inode, map.m_len);
> @@ -3533,8 +3535,30 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>   		if (IS_ERR(handle))
>   			return PTR_ERR(handle);
> 
> -		ret = ext4_map_blocks(handle, inode, &map,
> -				      EXT4_GET_BLOCKS_CREATE_ZERO);
> +		/*
> +		 * DAX and direct IO are the only two operations that
> +		 * are currently supported with IOMAP_WRITE.
> +		 */
> +		WARN_ON(!IS_DAX(inode) && !(flags & IOMAP_DIRECT));
> +		if (IS_DAX(inode))
> +			m_flags = EXT4_GET_BLOCKS_CREATE_ZERO;
> +		else if (round_down(offset, i_blocksize(inode)) >=
> +			 i_size_read(inode))
> +			m_flags = EXT4_GET_BLOCKS_CREATE;
> +		else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> +			m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
> +
> +		ret = ext4_map_blocks(handle, inode, &map, m_flags);
> +
> +		/*
> +		 * We cannot fill holes in indirect tree based inodes
> +		 * as that could expose stale data in the case of a
> +		 * crash. Use the magic error code to fallback to
> +		 * buffered IO.
> +		 */
> +		if (!m_flags && !ret)
> +			ret = -ENOTBLK;
> +
>   		if (ret < 0) {
>   			ext4_journal_stop(handle);
>   			if (ret == -ENOSPC &&
> @@ -3544,13 +3568,14 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>   		}
> 
>   		/*
> -		 * If we added blocks beyond i_size, we need to make sure they
> -		 * will get truncated if we crash before updating i_size in
> -		 * ext4_iomap_end(). For faults we don't need to do that (and
> -		 * even cannot because for orphan list operations inode_lock is
> -		 * required) - if we happen to instantiate block beyond i_size,
> -		 * it is because we race with truncate which has already added
> -		 * the inode to the orphan list.
> +		 * If we added blocks beyond i_size, we need to make
> +		 * sure they will get truncated if we crash before
> +		 * updating the i_size. For faults we don't need to do
> +		 * that (and even cannot because for orphan list
> +		 * operations inode_lock is required) - if we happen
> +		 * to instantiate block beyond i_size, it is because
> +		 * we race with truncate which has already added the
> +		 * inode to the orphan list.
>   		 */
>   		if (!(flags & IOMAP_FAULT) && first_block + map.m_len >
>   		    (i_size_read(inode) + (1 << blkbits) - 1) >> blkbits) {
> @@ -3612,6 +3637,14 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>   static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
>   			  ssize_t written, unsigned flags, struct iomap *iomap)
>   {
> +	/*
> +	 * Check to see whether an error occurred while writing data
> +	 * out to allocated blocks. If so, return the magic error code
> +	 * so that we fallback to buffered IO and reuse the blocks
> +	 * that were allocated in preparation for the direct IO write.
> +	 */
> +	if (flags & IOMAP_DIRECT && written == 0)
> +		return -ENOTBLK;
>   	return 0;
>   }
>
Matthew Bobrowski Sept. 16, 2019, 10:14 a.m. UTC | #2
On Mon, Sep 16, 2019 at 10:07:41AM +0530, Ritesh Harjani wrote:
> > @@ -213,12 +214,16 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
> >   	struct inode *inode = file_inode(iocb->ki_filp);
> >   	ssize_t ret;
> > 
> > +	if (unlikely(IS_IMMUTABLE(inode)))
> > +		return -EPERM;
> > +
> >   	ret = generic_write_checks(iocb, from);
> >   	if (ret <= 0)
> >   		return ret;
> > 
> > -	if (unlikely(IS_IMMUTABLE(inode)))
> > -		return -EPERM;
> > +	ret = file_modified(iocb->ki_filp);
> > +	if (ret)
> > +		return 0;
> 
> Why not return ret directly, otherwise we will be returning the wrong
> error code to user space. Thoughts?

You're right. I can't remember exactly why I decided to return '0', however
looking at the code once again I don't see a reason why we don't just return
'ret', as any value other than '0' represents a failure in this case
anyway. Thanks for picking that up.
 
> Do you think simplification/restructuring of this API
> "ext4_write_checks" can be a separate patch, so that this patch
> only focuses on conversion of DIO write path to iomap?

Hm, if we split it up so that it comes before this patch then it becomes hairy
in the sense that a whole bunch of other changes would also need to come with
what looks to be such a miniscule modification
i.e. ext4_buffered_write_iter(), ext4_file_write_iter(), etc. Splitting it to
come after just doesn't make any sense. To be honest, I don't really have any
strong opinions around why we shouldn't split it up, nor do I have a strong
opinion around why we should, so I think we should just leave it for now.

> Also, I think we can make the function (ext4_write_checks())
> like below. This way we call for file_modified() only after we
> have checked for write limits, at one place.

No objections and I think it's a good idea.

>   static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter
> *from)
>   {
>           struct inode *inode = file_inode(iocb->ki_filp);
>           ssize_t ret;
> 
>           if (unlikely(IS_IMMUTABLE(inode)))
>                   return -EPERM;
> 
>           ret = generic_write_checks(iocb, from);
>           if (ret <= 0)
> _                 return ret;
>           /*
>            * If we have encountered a bitmap-format file, the size limit
>            * is smaller than s_maxbytes, which is for extent-mapped files.
>            */
>           if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
>                   struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> 
>                   if (iocb->ki_pos >= sbi->s_bitmap_maxbytes)
>                           return -EFBIG;
>                   iov_iter_truncate(from, sbi->s_bitmap_maxbytes -
> iocb->ki_pos);
>           }
> +
> +         ret = file_modified(iocb->ki_filp);
> +         if (ret)
> +                 return ret;
> +
>           return iov_iter_count(from);
>   }

--<M>--
Christoph Hellwig Sept. 16, 2019, 12:12 p.m. UTC | #3
On Thu, Sep 12, 2019 at 09:04:46PM +1000, Matthew Bobrowski wrote:
> @@ -213,12 +214,16 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
>  	struct inode *inode = file_inode(iocb->ki_filp);
>  	ssize_t ret;
>  
> +	if (unlikely(IS_IMMUTABLE(inode)))
> +		return -EPERM;
> +
>  	ret = generic_write_checks(iocb, from);
>  	if (ret <= 0)
>  		return ret;
>  
> -	if (unlikely(IS_IMMUTABLE(inode)))
> -		return -EPERM;
> +	ret = file_modified(iocb->ki_filp);
> +	if (ret)
> +		return 0;
>  
>  	/*
>  	 * If we have encountered a bitmap-format file, the size limit

Independent of the error return issue you probably want to split
modifying ext4_write_checks into a separate preparation patch.

> +/*
> + * For a write that extends the inode size, ext4_dio_write_iter() will
> + * wait for the write to complete. Consequently, operations performed
> + * within this function are still covered by the inode_lock(). On
> + * success, this function returns 0.
> + */
> +static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size, int error,
> +				 unsigned int flags)
> +{
> +	int ret;
> +	loff_t offset = iocb->ki_pos;
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +
> +	if (error) {
> +		ret = ext4_handle_failed_inode_extension(inode, offset + size);
> +		return ret ? ret : error;
> +	}

Just a personal opinion, but I find the use of the ternary operator
here a little weird.

A plain old:

	ret = ext4_handle_failed_inode_extension(inode, offset + size);
	if (ret)
		return ret;
	return error;

flow much easier.

> +	if (!inode_trylock(inode)) {
> +		if (iocb->ki_flags & IOCB_NOWAIT)
> +			return -EAGAIN;
> +		inode_lock(inode);
> +	}
> +
> +	if (!ext4_dio_checks(inode)) {
> +		inode_unlock(inode);
> +		/*
> +		 * Fallback to buffered IO if the operation on the
> +		 * inode is not supported by direct IO.
> +		 */
> +		return ext4_buffered_write_iter(iocb, from);

I think you want to lift the locking into the caller of this function
so that you don't have to unlock and relock for the buffered write
fallback.

> +	if (offset + count > i_size_read(inode) ||
> +	    offset + count > EXT4_I(inode)->i_disksize) {
> +		ext4_update_i_disksize(inode, inode->i_size);
> +		extend = true;

Doesn't the ext4_update_i_disksize need to be under an open journal
handle?
Matthew Bobrowski Sept. 16, 2019, 10:37 p.m. UTC | #4
On Mon, Sep 16, 2019 at 05:12:48AM -0700, Christoph Hellwig wrote:
> On Thu, Sep 12, 2019 at 09:04:46PM +1000, Matthew Bobrowski wrote:
> > @@ -213,12 +214,16 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
> >  	struct inode *inode = file_inode(iocb->ki_filp);
> >  	ssize_t ret;
> >  
> > +	if (unlikely(IS_IMMUTABLE(inode)))
> > +		return -EPERM;
> > +
> >  	ret = generic_write_checks(iocb, from);
> >  	if (ret <= 0)
> >  		return ret;
> >  
> > -	if (unlikely(IS_IMMUTABLE(inode)))
> > -		return -EPERM;
> > +	ret = file_modified(iocb->ki_filp);
> > +	if (ret)
> > +		return 0;
> >  
> >  	/*
> >  	 * If we have encountered a bitmap-format file, the size limit
> 
> Independent of the error return issue you probably want to split
> modifying ext4_write_checks into a separate preparation patch.

Providing that there's no objections to introducing a possible performance
change with this separate preparation patch (overhead of calling
file_remove_privs/file_update_time twice), then I have no issues in doing so.

> > +/*
> > + * For a write that extends the inode size, ext4_dio_write_iter() will
> > + * wait for the write to complete. Consequently, operations performed
> > + * within this function are still covered by the inode_lock(). On
> > + * success, this function returns 0.
> > + */
> > +static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size, int error,
> > +				 unsigned int flags)
> > +{
> > +	int ret;
> > +	loff_t offset = iocb->ki_pos;
> > +	struct inode *inode = file_inode(iocb->ki_filp);
> > +
> > +	if (error) {
> > +		ret = ext4_handle_failed_inode_extension(inode, offset + size);
> > +		return ret ? ret : error;
> > +	}
> 
> Just a personal opinion, but I find the use of the ternary operator
> here a little weird.
> 
> A plain old:
> 
> 	ret = ext4_handle_failed_inode_extension(inode, offset + size);
> 	if (ret)
> 		return ret;
> 	return error;
> 
> flow much easier.

Agree, much cleaner.
 
> > +	if (!inode_trylock(inode)) {
> > +		if (iocb->ki_flags & IOCB_NOWAIT)
> > +			return -EAGAIN;
> > +		inode_lock(inode);
> > +	}
> > +
> > +	if (!ext4_dio_checks(inode)) {
> > +		inode_unlock(inode);
> > +		/*
> > +		 * Fallback to buffered IO if the operation on the
> > +		 * inode is not supported by direct IO.
> > +		 */
> > +		return ext4_buffered_write_iter(iocb, from);
> 
> I think you want to lift the locking into the caller of this function
> so that you don't have to unlock and relock for the buffered write
> fallback.

I don't exactly know what you really mean by "lift the locking into the caller
of this function". I'm interpreting that as moving the inode_unlock()
operation into ext4_buffered_write_iter(), but I can't see how that would be
any different from doing it directly here? Wouldn't this also run the risk of
the locks becoming unbalanced as we'd need to add checks around whether the
resource is being contended? Maybe I'm misunderstanding something here...

> > +	if (offset + count > i_size_read(inode) ||
> > +	    offset + count > EXT4_I(inode)->i_disksize) {
> > +		ext4_update_i_disksize(inode, inode->i_size);
> > +		extend = true;
> 
> Doesn't the ext4_update_i_disksize need to be under an open journal
> handle?

After all, it is a metadata update, which should go through an open journal
handle.

Thank you for the review Christoph!

--<M>--
Ritesh Harjani Sept. 17, 2019, 9 a.m. UTC | #5
Hello,

On 9/17/19 4:07 AM, Matthew Bobrowski wrote:
> On Mon, Sep 16, 2019 at 05:12:48AM -0700, Christoph Hellwig wrote:
>> On Thu, Sep 12, 2019 at 09:04:46PM +1000, Matthew Bobrowski wrote:
>>> @@ -213,12 +214,16 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
>>>   	struct inode *inode = file_inode(iocb->ki_filp);
>>>   	ssize_t ret;
>>>   
>>> +	if (unlikely(IS_IMMUTABLE(inode)))
>>> +		return -EPERM;
>>> +
>>>   	ret = generic_write_checks(iocb, from);
>>>   	if (ret <= 0)
>>>   		return ret;
>>>   
>>> -	if (unlikely(IS_IMMUTABLE(inode)))
>>> -		return -EPERM;
>>> +	ret = file_modified(iocb->ki_filp);
>>> +	if (ret)
>>> +		return 0;
>>>   
>>>   	/*
>>>   	 * If we have encountered a bitmap-format file, the size limit
>>
>> Independent of the error return issue you probably want to split
>> modifying ext4_write_checks into a separate preparation patch.
> 
> Providing that there's no objections to introducing a possible performance
> change with this separate preparation patch (overhead of calling
> file_remove_privs/file_update_time twice), then I have no issues in doing so.
> 
>>> +/*
>>> + * For a write that extends the inode size, ext4_dio_write_iter() will
>>> + * wait for the write to complete. Consequently, operations performed
>>> + * within this function are still covered by the inode_lock(). On
>>> + * success, this function returns 0.
>>> + */
>>> +static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size, int error,
>>> +				 unsigned int flags)
>>> +{
>>> +	int ret;
>>> +	loff_t offset = iocb->ki_pos;
>>> +	struct inode *inode = file_inode(iocb->ki_filp);
>>> +
>>> +	if (error) {
>>> +		ret = ext4_handle_failed_inode_extension(inode, offset + size);
>>> +		return ret ? ret : error;
>>> +	}
>>
>> Just a personal opinion, but I find the use of the ternary operator
>> here a little weird.
>>
>> A plain old:
>>
>> 	ret = ext4_handle_failed_inode_extension(inode, offset + size);
>> 	if (ret)
>> 		return ret;
>> 	return error;
>>
>> flow much easier.
> 
> Agree, much cleaner.
> 
>>> +	if (!inode_trylock(inode)) {
>>> +		if (iocb->ki_flags & IOCB_NOWAIT)
>>> +			return -EAGAIN;
>>> +		inode_lock(inode);
>>> +	}
>>> +
>>> +	if (!ext4_dio_checks(inode)) {
>>> +		inode_unlock(inode);
>>> +		/*
>>> +		 * Fallback to buffered IO if the operation on the
>>> +		 * inode is not supported by direct IO.
>>> +		 */
>>> +		return ext4_buffered_write_iter(iocb, from);
>>
>> I think you want to lift the locking into the caller of this function
>> so that you don't have to unlock and relock for the buffered write
>> fallback.
> 
> I don't exactly know what you really mean by "lift the locking into the caller
> of this function". I'm interpreting that as moving the inode_unlock()
> operation into ext4_buffered_write_iter(), but I can't see how that would be
> any different from doing it directly here? Wouldn't this also run the risk of
> the locks becoming unbalanced as we'd need to add checks around whether the
> resource is being contended? Maybe I'm misunderstanding something here...
> 
>>> +	if (offset + count > i_size_read(inode) ||
>>> +	    offset + count > EXT4_I(inode)->i_disksize) {
>>> +		ext4_update_i_disksize(inode, inode->i_size);
>>> +		extend = true;
>>
>> Doesn't the ext4_update_i_disksize need to be under an open journal
>> handle?
> 
> After all, it is a metadata update, which should go through an open journal
> handle.

Hmmm, it seems like a race here. But I am not sure if this is just due 
to not updating i_disksize under open journal handle.


So if we have a delayed buffered write to a file,
in that case we first only update inode->i_size and update
i_disksize at writeback time
(i.e. during block allocation).
In that case when we call for ext4_dio_write_iter
since offset + len > i_disksize, we call for ext4_update_i_disksize().

Now if writeback for some reason failed. And the system crashes, during 
the DIO writes, after the blocks are allocated. Then during reboot we 
may have an inconsistent inode, since we did not add the inode into the
orphan list before we updated the inode->i_disksize. And journal replay
may not succeed.

1. Can above actually happen? I am still not able to figure out the
    race/inconsistency completely.
2. Can you please help explain under what other cases
    it was necessary to call ext4_update_i_disksize() in DIO write paths?
3. When will i_disksize be out-of-sync with i_size during DIO writes?


-ritesh
Christoph Hellwig Sept. 17, 2019, 9:02 a.m. UTC | #6
On Tue, Sep 17, 2019 at 02:30:15PM +0530, Ritesh Harjani wrote:
> So if we have a delayed buffered write to a file,
> in that case we first only update inode->i_size and update
> i_disksize at writeback time
> (i.e. during block allocation).
> In that case when we call for ext4_dio_write_iter
> since offset + len > i_disksize, we call for ext4_update_i_disksize().
> 
> Now if writeback for some reason failed. And the system crashes, during the
> DIO writes, after the blocks are allocated. Then during reboot we may have
> an inconsistent inode, since we did not add the inode into the
> orphan list before we updated the inode->i_disksize. And journal replay
> may not succeed.
> 
> 1. Can above actually happen? I am still not able to figure out the
>    race/inconsistency completely.
> 2. Can you please help explain under what other cases
>    it was necessary to call ext4_update_i_disksize() in DIO write paths?
> 3. When will i_disksize be out-of-sync with i_size during DIO writes?

None of the above seems new in this patchset, does it?  That being said
I found the early size update odd.  XFS updates the on-disk size only
at I/O completion time to deal with various races including the
potential exposure of stale data.
Christoph Hellwig Sept. 17, 2019, 9:06 a.m. UTC | #7
On Tue, Sep 17, 2019 at 08:37:41AM +1000, Matthew Bobrowski wrote:
> > Independent of the error return issue you probably want to split
> > modifying ext4_write_checks into a separate preparation patch.
> 
> Providing that there's no objections to introducing a possible performance
> change with this separate preparation patch (overhead of calling
> file_remove_privs/file_update_time twice), then I have no issues in doing so.

Well, we should avoid calling it twice.  But what caught my eye is that
the buffered I/O path also called this function, so we are changing it as
well here.  If that actually is safe (I didn't review these bits carefully
and don't know ext4 that well) the overall refactoring of the write
flow might belong into a separate prep patch (that is not relying
on ->direct_IO, the checks changes, etc).

> > > +	if (!inode_trylock(inode)) {
> > > +		if (iocb->ki_flags & IOCB_NOWAIT)
> > > +			return -EAGAIN;
> > > +		inode_lock(inode);
> > > +	}
> > > +
> > > +	if (!ext4_dio_checks(inode)) {
> > > +		inode_unlock(inode);
> > > +		/*
> > > +		 * Fallback to buffered IO if the operation on the
> > > +		 * inode is not supported by direct IO.
> > > +		 */
> > > +		return ext4_buffered_write_iter(iocb, from);
> > 
> > I think you want to lift the locking into the caller of this function
> > so that you don't have to unlock and relock for the buffered write
> > fallback.
> 
> I don't exactly know what you really mean by "lift the locking into the caller
> of this function". I'm interpreting that as moving the inode_unlock()
> operation into ext4_buffered_write_iter(), but I can't see how that would be
> any different from doing it directly here? Wouldn't this also run the risk of
> the locks becoming unbalanced as we'd need to add checks around whether the
> resource is being contended? Maybe I'm misunderstanding something here...

With that I mean to acquire the inode lock in ext4_file_write_iter
instead of the low-level buffered I/O or direct I/O routines.
Ritesh Harjani Sept. 17, 2019, 10:12 a.m. UTC | #8
On 9/17/19 2:32 PM, Christoph Hellwig wrote:
> On Tue, Sep 17, 2019 at 02:30:15PM +0530, Ritesh Harjani wrote:
>> So if we have a delayed buffered write to a file,
>> in that case we first only update inode->i_size and update
>> i_disksize at writeback time
>> (i.e. during block allocation).
>> In that case when we call for ext4_dio_write_iter
>> since offset + len > i_disksize, we call for ext4_update_i_disksize().
>>
>> Now if writeback for some reason failed. And the system crashes, during the
>> DIO writes, after the blocks are allocated. Then during reboot we may have
>> an inconsistent inode, since we did not add the inode into the
>> orphan list before we updated the inode->i_disksize. And journal replay
>> may not succeed.
>>
>> 1. Can above actually happen? I am still not able to figure out the
>>     race/inconsistency completely.
>> 2. Can you please help explain under what other cases
>>     it was necessary to call ext4_update_i_disksize() in DIO write paths?
>> 3. When will i_disksize be out-of-sync with i_size during DIO writes?
> 
> None of the above seems new in this patchset, does it?  That being said


In original code before updating i_disksize in ext4_direct_IO_write,
we used to add the inode into the orphan list (which will mark the iloc
dirty and also update the ondisk inode size). Only then we update the 
i_disksize to inode->i_size (which still I don't understand the
reason to put inside open journal handle).

So in case if the crash happens, then in the recovery, we can replay the
journal and we truncate any extra blocks beyond i_size.
(ext4_orphan_cleanup()).


In new iomap implementation (i.e. this patchset), we are doing this in
reverse.

We first call for ext4_update_i_disksize() in ext4_dio_write_iter(),
and then in ext4_iomap_begin() after ext4_map_blocks(),
we add the inode to orphan list, which I am not really sure whether it 
is really consistent with on disk size??



> I found the early size update odd.  XFS updates the on-disk size only
> at I/O completion time to deal with various races including the
> potential exposure of stale data.
> 

Yes, can't really say why it is the case in ext4.
That's mostly what I wanted to understand from previous queries.


-ritesh
Matthew Bobrowski Sept. 17, 2019, 11:31 a.m. UTC | #9
On Tue, Sep 17, 2019 at 02:06:13AM -0700, Christoph Hellwig wrote:
> On Tue, Sep 17, 2019 at 08:37:41AM +1000, Matthew Bobrowski wrote:
> > > Independent of the error return issue you probably want to split
> > > modifying ext4_write_checks into a separate preparation patch.
> > 
> > Providing that there's no objections to introducing a possible performance
> > change with this separate preparation patch (overhead of calling
> > file_remove_privs/file_update_time twice), then I have no issues in doing so.
> 
> Well, we should avoid calling it twice.  But what caught my eye is that
> the buffered I/O path also called this function, so we are changing it as
> well here.  If that actually is safe (I didn't review these bits carefully
> and don't know ext4 that well) the overall refactoring of the write
> flow might belong into a separate prep patch (that is not relying
> on ->direct_IO, the checks changes, etc).

Yeah, I see what you're saying. From memory, in order to get this right, there
was a whole bunch of additional changes that needed to be done that would
effectively be removed in a subsequent patch. But, let me revisit this again
and see what I can do.

> > > > +	if (!inode_trylock(inode)) {
> > > > +		if (iocb->ki_flags & IOCB_NOWAIT)
> > > > +			return -EAGAIN;
> > > > +		inode_lock(inode);
> > > > +	}
> > > > +
> > > > +	if (!ext4_dio_checks(inode)) {
> > > > +		inode_unlock(inode);
> > > > +		/*
> > > > +		 * Fallback to buffered IO if the operation on the
> > > > +		 * inode is not supported by direct IO.
> > > > +		 */
> > > > +		return ext4_buffered_write_iter(iocb, from);
> > > 
> > > I think you want to lift the locking into the caller of this function
> > > so that you don't have to unlock and relock for the buffered write
> > > fallback.
> > 
> > I don't exactly know what you really mean by "lift the locking into the caller
> > of this function". I'm interpreting that as moving the inode_unlock()
> > operation into ext4_buffered_write_iter(), but I can't see how that would be
> > any different from doing it directly here? Wouldn't this also run the risk of
> > the locks becoming unbalanced as we'd need to add checks around whether the
> > resource is being contended? Maybe I'm misunderstanding something here...
> 
> With that I mean to acquire the inode lock in ext4_file_write_iter
> instead of the low-level buffered I/O or direct I/O routines.

Oh, I didn't think of that! But yes, that would in fact be nice and I cannot
see why we shouldn't be doing that at this point. It also helps with reducing
all the code duplication going on in the low-level buffered, direct, dax I/O
routines.

--<M>--
Matthew Bobrowski Sept. 17, 2019, 12:39 p.m. UTC | #10
On Tue, Sep 17, 2019 at 02:02:33AM -0700, Christoph Hellwig wrote:
> On Tue, Sep 17, 2019 at 02:30:15PM +0530, Ritesh Harjani wrote:
> > So if we have a delayed buffered write to a file,
> > in that case we first only update inode->i_size and update
> > i_disksize at writeback time
> > (i.e. during block allocation).
> > In that case when we call for ext4_dio_write_iter
> > since offset + len > i_disksize, we call for ext4_update_i_disksize().
> > 
> > Now if writeback for some reason failed. And the system crashes, during the
> > DIO writes, after the blocks are allocated. Then during reboot we may have
> > an inconsistent inode, since we did not add the inode into the
> > orphan list before we updated the inode->i_disksize. And journal replay
> > may not succeed.
> > 
> > 1. Can above actually happen? I am still not able to figure out the
> >    race/inconsistency completely.
> > 2. Can you please help explain under what other cases
> >    it was necessary to call ext4_update_i_disksize() in DIO write paths?
> > 3. When will i_disksize be out-of-sync with i_size during DIO writes?
> 
> None of the above seems new in this patchset, does it?

That's correct.

*Ritesh - FWIW, I think you'll find the answers to your questions above by
 referring to the following commits:

 1) 73fdad00b208b
 2) 45d8ec4d9fd54

If you drop the check (offset + count > EXT4_I(inode)->i_disksize) and the
call to ext4_update_i_disksize(), under some workloads i.e. "generic/475"
you'll generally end up with metadata corruption.

> That being said I found the early size update odd. XFS updates the on-disk
> size only at I/O completion time to deal with various races including the
> potential exposure of stale data.

Indeed a little odd. But, I think delalloc/writeback implementation is
possibly to blame here (based on what's detailed in 45d8ec4d9fd54)? Ideally, I
had the same approach as XFS in mind, but I couldn't do it.

--<M>--
Matthew Bobrowski Sept. 20, 2019, 1:24 p.m. UTC | #11
On Tue, Sep 17, 2019 at 02:06:13AM -0700, Christoph Hellwig wrote:
> On Tue, Sep 17, 2019 at 08:37:41AM +1000, Matthew Bobrowski wrote:
> > > Independent of the error return issue you probably want to split
> > > modifying ext4_write_checks into a separate preparation patch.
> > 
> > Providing that there's no objections to introducing a possible performance
> > change with this separate preparation patch (overhead of calling
> > file_remove_privs/file_update_time twice), then I have no issues in doing so.
> 
> Well, we should avoid calling it twice.  But what caught my eye is that
> the buffered I/O path also called this function, so we are changing it as
> well here.  If that actually is safe (I didn't review these bits carefully
> and don't know ext4 that well) the overall refactoring of the write
> flow might belong into a separate prep patch (that is not relying
> on ->direct_IO, the checks changes, etc).

Yeah, no. Revisiting this again now and trying to implement the
ext4_write_checks() modifications as a pre-patch is a nightmare so to
speak. This is purely due to the way that ext4_file_write_iter() is currently
written and how both the current buffered I/O and direct I/O paths traverse
through and make use of it.

If anything, the changes applied to ext4_write_checks() should be a separate
patch that comes *after* the refactoring of the buffered and direct I/O write
flow. However, even then, there'd be code that we essentially introduce in the
write flow changes and then subsequently removed after the fact. Providing
that's OK, then sure, I can put this within a separate patch.

--<M>--
Jan Kara Sept. 23, 2019, 9:10 p.m. UTC | #12
I'll try to comment just on top of refactoring Christoph has suggested...

On Thu 12-09-19 21:04:46, Matthew Bobrowski wrote:
> @@ -310,6 +341,120 @@ static int ext4_handle_failed_inode_extension(struct inode *inode, loff_t size)
>  	return 0;
>  }
>  
> +/*
> + * For a write that extends the inode size, ext4_dio_write_iter() will
> + * wait for the write to complete. Consequently, operations performed
> + * within this function are still covered by the inode_lock(). On
> + * success, this function returns 0.
> + */
> +static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size, int error,
> +				 unsigned int flags)
> +{
> +	int ret;
> +	loff_t offset = iocb->ki_pos;
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +
> +	if (error) {
> +		ret = ext4_handle_failed_inode_extension(inode, offset + size);
> +		return ret ? ret : error;
> +	}
> +
> +	if (flags & IOMAP_DIO_UNWRITTEN) {
> +		ret = ext4_convert_unwritten_extents(NULL, inode,
> +						     offset, size);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	if (offset + size > i_size_read(inode)) {
> +		ret = ext4_handle_inode_extension(inode, offset, size, 0);
> +		if (ret)
> +			return ret;
> +	}

With the suggestions I made to your patch 3/6 this could be simplified to:

	if (!error && flags & IOMAP_DIO_UNWRITTEN) {
		error = ext4_convert_unwritten_extents(NULL, inode, offset,
						       size);
	}
	return ext4_handle_inode_extension(inode, offset, error ? : size, size);


Note the change that when ext4_convert_unwritten_extents() fails (although
this should not really happen unless there's some corruption going on), we
do properly truncate possible extents beyond i_size.

> +static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> +{
> +	ssize_t ret;
> +	size_t count;
> +	loff_t offset = iocb->ki_pos;
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +	bool extend = false, overwrite = false, unaligned_aio = false;
> +
> +	if (!inode_trylock(inode)) {
> +		if (iocb->ki_flags & IOCB_NOWAIT)
> +			return -EAGAIN;
> +		inode_lock(inode);
> +	}
> +
> +	if (!ext4_dio_checks(inode)) {
> +		inode_unlock(inode);
> +		/*
> +		 * Fallback to buffered IO if the operation on the
> +		 * inode is not supported by direct IO.
> +		 */
> +		return ext4_buffered_write_iter(iocb, from);
> +	}
> +
> +	ret = ext4_write_checks(iocb, from);
> +	if (ret <= 0) {
> +		inode_unlock(inode);
> +		return ret;
> +	}
> +
> +	/*
> +	 * Unaligned direct AIO must be serialized among each other as
> +	 * the zeroing of partial blocks of two competing unaligned
> +	 * AIOs can result in data corruption.
> +	 */
> +	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
> +	    !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, offset)) {
> +		unaligned_aio = true;
> +		inode_dio_wait(inode);
> +	}
> +
> +	/*
> +	 * Determine whether the IO operation will overwrite allocated
> +	 * and initialized blocks. If so, check to see whether it is
> +	 * possible to take the dioread_nolock path.
> +	 */
> +	count = iov_iter_count(from);
> +	if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&
> +	    ext4_should_dioread_nolock(inode)) {
> +		overwrite = true;
> +		downgrade_write(&inode->i_rwsem);
> +	}
> +
> +	if (offset + count > i_size_read(inode) ||
> +	    offset + count > EXT4_I(inode)->i_disksize) {
> +		ext4_update_i_disksize(inode, inode->i_size);
> +		extend = true;
> +	}

This call to ext4_update_i_disksize() is definitely wrong. If nothing else,
you need to also have transaction started and call ext4_mark_inode_dirty()
to actually journal the change of i_disksize (ext4_update_i_disksize()
updates only the in-memory copy of the entry). Also the direct IO code
needs to add the inode to the orphan list so that in case of crash, blocks
allocated beyond EOF get truncated on next mount. That is the whole point
of this excercise with i_disksize after all.

But I'm wondering if i_disksize update is needed. Truncate cannot be in
progress (we hold i_rwsem) and dirty pages will be flushed by
iomap_dio_rw() before we start to allocate any blocks. So it should be
enough to have here:

	if (offset + count > i_size_read(inode)) {
		/*
		 * Add inode to orphan list so that blocks allocated beyond
		 * EOF get properly truncated in case of crash.
		 */
		start transaction handle
		add inode to orphan list
		stop transaction handle
	}

And just leave i_disksize at whatever it currently is.

> +
> +	ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops, ext4_dio_write_end_io);
> +
> +	/*
> +	 * Unaligned direct AIO must be the only IO in flight or else
> +	 * any overlapping aligned IO after unaligned IO might result
> +	 * in data corruption. We also need to wait here in the case
> +	 * where the inode is being extended so that inode extension
> +	 * routines in ext4_dio_write_end_io() are covered by the
> +	 * inode_lock().
> +	 */
> +	if (ret == -EIOCBQUEUED && (unaligned_aio || extend))
> +		inode_dio_wait(inode);
> +
> +	if (overwrite)
> +		inode_unlock_shared(inode);
> +	else
> +		inode_unlock(inode);
> +
> +	if (ret >= 0 && iov_iter_count(from))
> +		return ext4_buffered_write_iter(iocb, from);
> +	return ret;
> +}
> +
...
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index efb184928e51..f52ad3065236 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3513,11 +3513,13 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>  			}
>  		}
>  	} else if (flags & IOMAP_WRITE) {
> -		int dio_credits;
>  		handle_t *handle;
> -		int retries = 0;
> +		int dio_credits, retries = 0, m_flags = 0;
>  
> -		/* Trim mapping request to maximum we can map at once for DIO */
> +		/*
> +		 * Trim mapping request to maximum we can map at once
> +		 * for DIO.
> +		 */
>  		if (map.m_len > DIO_MAX_BLOCKS)
>  			map.m_len = DIO_MAX_BLOCKS;
>  		dio_credits = ext4_chunk_trans_blocks(inode, map.m_len);
> @@ -3533,8 +3535,30 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>  		if (IS_ERR(handle))
>  			return PTR_ERR(handle);
>  
> -		ret = ext4_map_blocks(handle, inode, &map,
> -				      EXT4_GET_BLOCKS_CREATE_ZERO);
> +		/*
> +		 * DAX and direct IO are the only two operations that
> +		 * are currently supported with IOMAP_WRITE.
> +		 */
> +		WARN_ON(!IS_DAX(inode) && !(flags & IOMAP_DIRECT));
> +		if (IS_DAX(inode))
> +			m_flags = EXT4_GET_BLOCKS_CREATE_ZERO;
> +		else if (round_down(offset, i_blocksize(inode)) >=
> +			 i_size_read(inode))
> +			m_flags = EXT4_GET_BLOCKS_CREATE;
> +		else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> +			m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
> +
> +		ret = ext4_map_blocks(handle, inode, &map, m_flags);
> +
> +		/*
> +		 * We cannot fill holes in indirect tree based inodes
> +		 * as that could expose stale data in the case of a
> +		 * crash. Use the magic error code to fallback to
> +		 * buffered IO.
> +		 */
> +		if (!m_flags && !ret)
> +			ret = -ENOTBLK;
> +
>  		if (ret < 0) {
>  			ext4_journal_stop(handle);
>  			if (ret == -ENOSPC &&
> @@ -3544,13 +3568,14 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>  		}
>  
>  		/*
> -		 * If we added blocks beyond i_size, we need to make sure they
> -		 * will get truncated if we crash before updating i_size in
> -		 * ext4_iomap_end(). For faults we don't need to do that (and
> -		 * even cannot because for orphan list operations inode_lock is
> -		 * required) - if we happen to instantiate block beyond i_size,
> -		 * it is because we race with truncate which has already added
> -		 * the inode to the orphan list.
> +		 * If we added blocks beyond i_size, we need to make
> +		 * sure they will get truncated if we crash before
> +		 * updating the i_size. For faults we don't need to do
> +		 * that (and even cannot because for orphan list
> +		 * operations inode_lock is required) - if we happen
> +		 * to instantiate block beyond i_size, it is because
> +		 * we race with truncate which has already added the
> +		 * inode to the orphan list.
>  		 */

Just a nit but it would be nice to use full width of 80 columns when
formatting comments so that they don't get unnecessarily long.

								Honza
Matthew Bobrowski Sept. 24, 2019, 10:29 a.m. UTC | #13
On Mon, Sep 23, 2019 at 11:10:11PM +0200, Jan Kara wrote:
> On Thu 12-09-19 21:04:46, Matthew Bobrowski wrote:
> > +/*
> > + * For a write that extends the inode size, ext4_dio_write_iter() will
> > + * wait for the write to complete. Consequently, operations performed
> > + * within this function are still covered by the inode_lock(). On
> > + * success, this function returns 0.
> > + */
> > +static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size, int error,
> > +				 unsigned int flags)
> > +{
> > +	int ret;
> > +	loff_t offset = iocb->ki_pos;
> > +	struct inode *inode = file_inode(iocb->ki_filp);
> > +
> > +	if (error) {
> > +		ret = ext4_handle_failed_inode_extension(inode, offset + size);
> > +		return ret ? ret : error;
> > +	}
> > +
> > +	if (flags & IOMAP_DIO_UNWRITTEN) {
> > +		ret = ext4_convert_unwritten_extents(NULL, inode,
> > +						     offset, size);
> > +		if (ret)
> > +			return ret;
> > +	}
> > +
> > +	if (offset + size > i_size_read(inode)) {
> > +		ret = ext4_handle_inode_extension(inode, offset, size, 0);
> > +		if (ret)
> > +			return ret;
> > +	}
> 
> With the suggestions I made to your patch 3/6 this could be simplified to:
> 
> 	if (!error && flags & IOMAP_DIO_UNWRITTEN) {
> 		error = ext4_convert_unwritten_extents(NULL, inode, offset,
> 						       size);
> 	}
> 	return ext4_handle_inode_extension(inode, offset, error ? : size, size);
> 
> 
> Note the change that when ext4_convert_unwritten_extents() fails (although
> this should not really happen unless there's some corruption going on), we
> do properly truncate possible extents beyond i_size.

This sounds good to me. I like this idea.
 
> > +static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> > +{
> > +	ssize_t ret;
> > +	size_t count;
> > +	loff_t offset = iocb->ki_pos;
> > +	struct inode *inode = file_inode(iocb->ki_filp);
> > +	bool extend = false, overwrite = false, unaligned_aio = false;
> > +
> > +	if (!inode_trylock(inode)) {
> > +		if (iocb->ki_flags & IOCB_NOWAIT)
> > +			return -EAGAIN;
> > +		inode_lock(inode);
> > +	}
> > +
> > +	if (!ext4_dio_checks(inode)) {
> > +		inode_unlock(inode);
> > +		/*
> > +		 * Fallback to buffered IO if the operation on the
> > +		 * inode is not supported by direct IO.
> > +		 */
> > +		return ext4_buffered_write_iter(iocb, from);
> > +	}
> > +
> > +	ret = ext4_write_checks(iocb, from);
> > +	if (ret <= 0) {
> > +		inode_unlock(inode);
> > +		return ret;
> > +	}
> > +
> > +	/*
> > +	 * Unaligned direct AIO must be serialized among each other as
> > +	 * the zeroing of partial blocks of two competing unaligned
> > +	 * AIOs can result in data corruption.
> > +	 */
> > +	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
> > +	    !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, offset)) {
> > +		unaligned_aio = true;
> > +		inode_dio_wait(inode);
> > +	}
> > +
> > +	/*
> > +	 * Determine whether the IO operation will overwrite allocated
> > +	 * and initialized blocks. If so, check to see whether it is
> > +	 * possible to take the dioread_nolock path.
> > +	 */
> > +	count = iov_iter_count(from);
> > +	if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&
> > +	    ext4_should_dioread_nolock(inode)) {
> > +		overwrite = true;
> > +		downgrade_write(&inode->i_rwsem);
> > +	}
> > +
> > +	if (offset + count > i_size_read(inode) ||
> > +	    offset + count > EXT4_I(inode)->i_disksize) {
> > +		ext4_update_i_disksize(inode, inode->i_size);
> > +		extend = true;
> > +	}
> 
> This call to ext4_update_i_disksize() is definitely wrong. If nothing else,
> you need to also have transaction started and call ext4_mark_inode_dirty()
> to actually journal the change of i_disksize (ext4_update_i_disksize()
> updates only the in-memory copy of the entry). Also the direct IO code
> needs to add the inode to the orphan list so that in case of crash, blocks
> allocated beyond EOF get truncated on next mount. That is the whole point
> of this excercise with i_disksize after all.
> 
> But I'm wondering if i_disksize update is needed. Truncate cannot be in
> progress (we hold i_rwsem) and dirty pages will be flushed by
> iomap_dio_rw() before we start to allocate any blocks. So it should be
> enough to have here:

Well, I initially thought the same, however doing some research shows that we
have the following edge case:
     - 45d8ec4d9fd54
     and
     - 73fdad00b208b

In fact you can reproduce the exact same i_size corruption issue by running
the generic/475 xfstests mutitple times, as articulated within
45d8ec4d9fd54. So with that, I'm kind of confused and thinking that there may
be a problem that resides elsewhere that may need addressing?
 
> 	if (offset + count > i_size_read(inode)) {
> 		/*
> 		 * Add inode to orphan list so that blocks allocated beyond
> 		 * EOF get properly truncated in case of crash.
> 		 */
> 		start transaction handle
> 		add inode to orphan list
> 		stop transaction handle
> 	}
> 
> And just leave i_disksize at whatever it currently is.

I originally had the code which added the inode to the orphan list here, but
then I thought to myself that it'd make more sense to actually do this step
closer to the point where we've managed to successfully allocate the required
blocks for the write. This prevents the need to spray orphan list clean up
code all over the place just to cover the case that a write which had intended
to extend the inode beyond i_size had failed prematurely (i.e. before block
allocation). So, hence the reason why I thought having it in
ext4_iomap_begin() would make more sense, because at that point in the write
path, there is enough/or more assurance to make the call around whether we
will in fact be able to perform the write which will be extending beyond
i_size, or not and consequently whether the inode should be placed onto the
orphan list?

Ideally I'd like to turn this statement into:

	if (offset + count > i_size_read(inode))
	        extend = true;

Maybe I'm missing something here and there's actually a really good reason for
doing this nice and early? What are your thoughts about what I've mentioned
above?

> > -		 * If we added blocks beyond i_size, we need to make sure they
> > -		 * will get truncated if we crash before updating i_size in
> > -		 * ext4_iomap_end(). For faults we don't need to do that (and
> > -		 * even cannot because for orphan list operations inode_lock is
> > -		 * required) - if we happen to instantiate block beyond i_size,
> > -		 * it is because we race with truncate which has already added
> > -		 * the inode to the orphan list.
> > +		 * If we added blocks beyond i_size, we need to make
> > +		 * sure they will get truncated if we crash before
> > +		 * updating the i_size. For faults we don't need to do
> > +		 * that (and even cannot because for orphan list
> > +		 * operations inode_lock is required) - if we happen
> > +		 * to instantiate block beyond i_size, it is because
> > +		 * we race with truncate which has already added the
> > +		 * inode to the orphan list.
> >  		 */
> 
> Just a nit but it would be nice to use full width of 80 columns when
> formatting comments so that they don't get unnecessarily long.

Sure. I'm blaming emacs for this. :P

--<M>--
Jan Kara Sept. 24, 2019, 10:57 a.m. UTC | #14
On Tue 17-09-19 14:30:15, Ritesh Harjani wrote:
> On 9/17/19 4:07 AM, Matthew Bobrowski wrote:
> > On Mon, Sep 16, 2019 at 05:12:48AM -0700, Christoph Hellwig wrote:
> > > > +	if (offset + count > i_size_read(inode) ||
> > > > +	    offset + count > EXT4_I(inode)->i_disksize) {
> > > > +		ext4_update_i_disksize(inode, inode->i_size);
> > > > +		extend = true;
> > > 
> > > Doesn't the ext4_update_i_disksize need to be under an open journal
> > > handle?
> > 
> > After all, it is a metadata update, which should go through an open journal
> > handle.
> 
> Hmmm, it seems like a race here. But I am not sure if this is just due to
> not updating i_disksize under open journal handle.
> 
> 
> So if we have a delayed buffered write to a file,
> in that case we first only update inode->i_size and update
> i_disksize at writeback time
> (i.e. during block allocation).
> In that case when we call for ext4_dio_write_iter
> since offset + len > i_disksize, we call for ext4_update_i_disksize().
> 
> Now if writeback for some reason failed. And the system crashes, during the
> DIO writes, after the blocks are allocated. Then during reboot we may have
> an inconsistent inode, since we did not add the inode into the
> orphan list before we updated the inode->i_disksize. And journal replay
> may not succeed.

OK, let me clear out some confusion here.

> 1. Can above actually happen? I am still not able to figure out the
>    race/inconsistency completely.
> 2. Can you please help explain under what other cases
>    it was necessary to call ext4_update_i_disksize() in DIO write paths?
> 3. When will i_disksize be out-of-sync with i_size during DIO writes?

So as I commented in my other reply to this patch, the code is definitely
wrong as is. The update of i_disksize in the original DIO code is connected
with the addition to the orphan list - we want to make sure orphan cleanup
in case of crash will truncate inode to the current i_size so we set
i_disksize to i_size. That being said I cannot find a reason why the update
would be actually necessary because at that point i_size should be equal to
i_disksize anyway. So the best theory I have is that it was added "just to
be sure" (the code got inherited from original ext3 codebase from before
git times).

To answer your other questions: If we decided to leave i_disksize update
in, the proper use would look like:

	handle = ext4_journal_start(...);
	if (ext4_update_i_disksize(inode))
		ext4_mark_inode_dirty(handle, inode);
	ext4_orphan_add(handle, inode);
	ext4_journal_stop();

Now the race you ask about in 1) can happen but it would be harmless.
As you mention buffered writeback does update i_disksize after submitting
IO to freshly allocated blocks (in the same transaction as block allocation
is) but stale data exposure is impossible because transaction commit will
wait for page writeback to complete.

To answer 3), i_disksize is not in sync with i_size either during truncate
(or similar operations generally protected by i_rwsem for writing) or
when there are pending delay allocated writes.

Now to answer 2) I don't think update of i_disksize is ever needed for DIO
path - once iomap_dio_rw() flushes pending dirty pages, i_disksize should
be equal to i_size since we hold i_rwsem at least for reading.

								Honza
Jan Kara Sept. 24, 2019, 2:13 p.m. UTC | #15
On Tue 24-09-19 20:29:26, Matthew Bobrowski wrote:
> On Mon, Sep 23, 2019 at 11:10:11PM +0200, Jan Kara wrote:
> > On Thu 12-09-19 21:04:46, Matthew Bobrowski wrote:
> > > +static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> > > +{
> > > +	ssize_t ret;
> > > +	size_t count;
> > > +	loff_t offset = iocb->ki_pos;
> > > +	struct inode *inode = file_inode(iocb->ki_filp);
> > > +	bool extend = false, overwrite = false, unaligned_aio = false;
> > > +
> > > +	if (!inode_trylock(inode)) {
> > > +		if (iocb->ki_flags & IOCB_NOWAIT)
> > > +			return -EAGAIN;
> > > +		inode_lock(inode);
> > > +	}
> > > +
> > > +	if (!ext4_dio_checks(inode)) {
> > > +		inode_unlock(inode);
> > > +		/*
> > > +		 * Fallback to buffered IO if the operation on the
> > > +		 * inode is not supported by direct IO.
> > > +		 */
> > > +		return ext4_buffered_write_iter(iocb, from);
> > > +	}
> > > +
> > > +	ret = ext4_write_checks(iocb, from);
> > > +	if (ret <= 0) {
> > > +		inode_unlock(inode);
> > > +		return ret;
> > > +	}
> > > +
> > > +	/*
> > > +	 * Unaligned direct AIO must be serialized among each other as
> > > +	 * the zeroing of partial blocks of two competing unaligned
> > > +	 * AIOs can result in data corruption.
> > > +	 */
> > > +	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
> > > +	    !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, offset)) {
> > > +		unaligned_aio = true;
> > > +		inode_dio_wait(inode);
> > > +	}
> > > +
> > > +	/*
> > > +	 * Determine whether the IO operation will overwrite allocated
> > > +	 * and initialized blocks. If so, check to see whether it is
> > > +	 * possible to take the dioread_nolock path.
> > > +	 */
> > > +	count = iov_iter_count(from);
> > > +	if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&
> > > +	    ext4_should_dioread_nolock(inode)) {
> > > +		overwrite = true;
> > > +		downgrade_write(&inode->i_rwsem);
> > > +	}
> > > +
> > > +	if (offset + count > i_size_read(inode) ||
> > > +	    offset + count > EXT4_I(inode)->i_disksize) {
> > > +		ext4_update_i_disksize(inode, inode->i_size);
> > > +		extend = true;
> > > +	}
> > 
> > This call to ext4_update_i_disksize() is definitely wrong. If nothing else,
> > you need to also have transaction started and call ext4_mark_inode_dirty()
> > to actually journal the change of i_disksize (ext4_update_i_disksize()
> > updates only the in-memory copy of the entry). Also the direct IO code
> > needs to add the inode to the orphan list so that in case of crash, blocks
> > allocated beyond EOF get truncated on next mount. That is the whole point
> > of this excercise with i_disksize after all.
> > 
> > But I'm wondering if i_disksize update is needed. Truncate cannot be in
> > progress (we hold i_rwsem) and dirty pages will be flushed by
> > iomap_dio_rw() before we start to allocate any blocks. So it should be
> > enough to have here:
> 
> Well, I initially thought the same, however doing some research shows that we
> have the following edge case:
>      - 45d8ec4d9fd54
>      and
>      - 73fdad00b208b
> 
> In fact you can reproduce the exact same i_size corruption issue by running
> the generic/475 xfstests mutitple times, as articulated within
> 45d8ec4d9fd54. So with that, I'm kind of confused and thinking that there may
> be a problem that resides elsewhere that may need addressing?

Right, I forgot about the special case explained in 45d8ec4d9fd54 where
there's unwritted delalloc write beyond range where DIO write happens.

> > 	if (offset + count > i_size_read(inode)) {
> > 		/*
> > 		 * Add inode to orphan list so that blocks allocated beyond
> > 		 * EOF get properly truncated in case of crash.
> > 		 */
> > 		start transaction handle
> > 		add inode to orphan list
> > 		stop transaction handle
> > 	}
> > 
> > And just leave i_disksize at whatever it currently is.
> 
> I originally had the code which added the inode to the orphan list here, but
> then I thought to myself that it'd make more sense to actually do this step
> closer to the point where we've managed to successfully allocate the required
> blocks for the write. This prevents the need to spray orphan list clean up
> code all over the place just to cover the case that a write which had intended
> to extend the inode beyond i_size had failed prematurely (i.e. before block
> allocation). So, hence the reason why I thought having it in
> ext4_iomap_begin() would make more sense, because at that point in the write
> path, there is enough/or more assurance to make the call around whether we
> will in fact be able to perform the write which will be extending beyond
> i_size, or not and consequently whether the inode should be placed onto the
> orphan list?
> 
> Ideally I'd like to turn this statement into:
> 
> 	if (offset + count > i_size_read(inode))
> 	        extend = true;
> 
> Maybe I'm missing something here and there's actually a really good reason for
> doing this nice and early? What are your thoughts about what I've mentioned
> above?

Well, the slight trouble with adding inode to orphan list in
ext4_iomap_begin() is that then it is somewhat difficult to tell whether
you need to remove it when IO is done because there's no way how to
propagate that information from ext4_iomap_begin() and checking against
i_disksize is unreliable because it can change (due to writeback of
delalloc pages) while direct IO is running. But I think we can overcome
that by splitting our end_io functions to two - ext4_dio_write_end_io() and
ext4_dio_extend_write_end_io(). So:

	WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize);
	/*
	 * Need to check against i_disksize as there may be dellalloc writes
	 * pending.
	 */
 	if (offset + count > EXT4_I(inode)->i_disksize)
		extend = true;

	...
	iomap_dio_rw(...,
		extend ? ext4_dio_extend_write_end_io : ext4_dio_write_end_io);

and ext4_dio_write_end_io() will just take care of conversion of unwritten
extents on successful IO completion, while ext4_dio_extend_write_end_io()
will take care of all the complex stuff with orphan handling, extension
of inode size, and truncation of blocks beyond EOF - and it can do that
because it is guaranteed to run under the protection of i_rwsem held in
ext4_dio_write_iter().

Alternatively, we could also just pass NULL instead of
ext4_dio_extend_write_end_io() and just do all the work explicitely in
ext4_dio_write_iter() in the 'extend' case. That might be actually the most
transparent option...

But at this point there are so many suggestions in flight that I need to
see current state of the code again to be able to tell anything useful :).

								Honza
Matthew Bobrowski Sept. 25, 2019, 7:14 a.m. UTC | #16
On Tue, Sep 24, 2019 at 04:13:21PM +0200, Jan Kara wrote:
> On Tue 24-09-19 20:29:26, Matthew Bobrowski wrote:
> > On Mon, Sep 23, 2019 at 11:10:11PM +0200, Jan Kara wrote:
> > > On Thu 12-09-19 21:04:46, Matthew Bobrowski wrote:
> > > > +	if (offset + count > i_size_read(inode) ||
> > > > +	    offset + count > EXT4_I(inode)->i_disksize) {
> > > > +		ext4_update_i_disksize(inode, inode->i_size);
> > > > +		extend = true;
> > > > +	}
> > > 
> > > This call to ext4_update_i_disksize() is definitely wrong. If nothing else,
> > > you need to also have transaction started and call ext4_mark_inode_dirty()
> > > to actually journal the change of i_disksize (ext4_update_i_disksize()
> > > updates only the in-memory copy of the entry). Also the direct IO code
> > > needs to add the inode to the orphan list so that in case of crash, blocks
> > > allocated beyond EOF get truncated on next mount. That is the whole point
> > > of this excercise with i_disksize after all.
> > > 
> > > But I'm wondering if i_disksize update is needed. Truncate cannot be in
> > > progress (we hold i_rwsem) and dirty pages will be flushed by
> > > iomap_dio_rw() before we start to allocate any blocks. So it should be
> > > enough to have here:
> > 
> > Well, I initially thought the same, however doing some research shows that we
> > have the following edge case:
> >      - 45d8ec4d9fd54
> >      and
> >      - 73fdad00b208b
> > 
> > In fact you can reproduce the exact same i_size corruption issue by running
> > the generic/475 xfstests mutitple times, as articulated within
> > 45d8ec4d9fd54. So with that, I'm kind of confused and thinking that there may
> > be a problem that resides elsewhere that may need addressing?
> 
> Right, I forgot about the special case explained in 45d8ec4d9fd54 where
> there's unwritted delalloc write beyond range where DIO write happens.
> 
> > > 	if (offset + count > i_size_read(inode)) {
> > > 		/*
> > > 		 * Add inode to orphan list so that blocks allocated beyond
> > > 		 * EOF get properly truncated in case of crash.
> > > 		 */
> > > 		start transaction handle
> > > 		add inode to orphan list
> > > 		stop transaction handle
> > > 	}
> > > 
> > > And just leave i_disksize at whatever it currently is.
> > 
> > I originally had the code which added the inode to the orphan list here, but
> > then I thought to myself that it'd make more sense to actually do this step
> > closer to the point where we've managed to successfully allocate the required
> > blocks for the write. This prevents the need to spray orphan list clean up
> > code all over the place just to cover the case that a write which had intended
> > to extend the inode beyond i_size had failed prematurely (i.e. before block
> > allocation). So, hence the reason why I thought having it in
> > ext4_iomap_begin() would make more sense, because at that point in the write
> > path, there is enough/or more assurance to make the call around whether we
> > will in fact be able to perform the write which will be extending beyond
> > i_size, or not and consequently whether the inode should be placed onto the
> > orphan list?
> > 
> > Ideally I'd like to turn this statement into:
> > 
> > 	if (offset + count > i_size_read(inode))
> > 	        extend = true;
> > 
> > Maybe I'm missing something here and there's actually a really good reason for
> > doing this nice and early? What are your thoughts about what I've mentioned
> > above?
> 
> Well, the slight trouble with adding inode to orphan list in
> ext4_iomap_begin() is that then it is somewhat difficult to tell whether
> you need to remove it when IO is done because there's no way how to
> propagate that information from ext4_iomap_begin() and checking against
> i_disksize is unreliable because it can change (due to writeback of
> delalloc pages) while direct IO is running. But I think we can overcome
> that by splitting our end_io functions to two - ext4_dio_write_end_io() and
> ext4_dio_extend_write_end_io(). So:
> 
> 	WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize);
> 	/*
> 	 * Need to check against i_disksize as there may be dellalloc writes
> 	 * pending.
> 	 */
>  	if (offset + count > EXT4_I(inode)->i_disksize)
> 		extend = true;

Hm... I'm not entirely convinced that EXT4_I(inode)->i_disksize is what we
should be using to determine whether an extension will be performed or
not? Because, my understanding is that i_size is what holds the actual value
of what the file size is expected to be and hence the reason why we previously
updated the i_disksize to i_size using ext4_update_i_disksize().

Also, there are cases where offset + count > EXT4_I(inode)->i_disksize,
however offset + count < i_size_read(inode). So in that case we may take an
incorrect path somewhere i.e. below where extend clause is true. Also, I feel
as though we should try stick to using one value as the reference to determine
whether we're performing an extension and not use i_disksize here and then
i_size over there kind of thing as that leads to unnecessary confusion?

> 	...
> 	iomap_dio_rw(...,
> 		extend ? ext4_dio_extend_write_end_io : ext4_dio_write_end_io);
> 
> and ext4_dio_write_end_io() will just take care of conversion of unwritten
> extents on successful IO completion, while ext4_dio_extend_write_end_io()
> will take care of all the complex stuff with orphan handling, extension
> of inode size, and truncation of blocks beyond EOF - and it can do that
> because it is guaranteed to run under the protection of i_rwsem held in
> ext4_dio_write_iter().
> 
> Alternatively, we could also just pass NULL instead of
> ext4_dio_extend_write_end_io() and just do all the work explicitely in
> ext4_dio_write_iter() in the 'extend' case. That might be actually the most
> transparent option...

Well, with the changes to ext4_handle_inode_extension() conditions that you
recommended in patch 2/6, then I can't see why we'd need two separate
->end_io() handlers as we'd just abort early if we're not extending?

> But at this point there are so many suggestions in flight that I need to
> see current state of the code again to be able to tell anything useful :).

Heh, true. I will post through an updated patch series taking into account
most of the recommendations put forward for this series version and then we
can have a discussion based on that. :)

--<M>--
Jan Kara Sept. 25, 2019, 8:40 a.m. UTC | #17
On Wed 25-09-19 17:14:29, Matthew Bobrowski wrote:
> On Tue, Sep 24, 2019 at 04:13:21PM +0200, Jan Kara wrote:
> > On Tue 24-09-19 20:29:26, Matthew Bobrowski wrote:
> > > On Mon, Sep 23, 2019 at 11:10:11PM +0200, Jan Kara wrote:
> > > > On Thu 12-09-19 21:04:46, Matthew Bobrowski wrote:
> > > > > +	if (offset + count > i_size_read(inode) ||
> > > > > +	    offset + count > EXT4_I(inode)->i_disksize) {
> > > > > +		ext4_update_i_disksize(inode, inode->i_size);
> > > > > +		extend = true;
> > > > > +	}
> > > > 
> > > > This call to ext4_update_i_disksize() is definitely wrong. If nothing else,
> > > > you need to also have transaction started and call ext4_mark_inode_dirty()
> > > > to actually journal the change of i_disksize (ext4_update_i_disksize()
> > > > updates only the in-memory copy of the entry). Also the direct IO code
> > > > needs to add the inode to the orphan list so that in case of crash, blocks
> > > > allocated beyond EOF get truncated on next mount. That is the whole point
> > > > of this excercise with i_disksize after all.
> > > > 
> > > > But I'm wondering if i_disksize update is needed. Truncate cannot be in
> > > > progress (we hold i_rwsem) and dirty pages will be flushed by
> > > > iomap_dio_rw() before we start to allocate any blocks. So it should be
> > > > enough to have here:
> > > 
> > > Well, I initially thought the same, however doing some research shows that we
> > > have the following edge case:
> > >      - 45d8ec4d9fd54
> > >      and
> > >      - 73fdad00b208b
> > > 
> > > In fact you can reproduce the exact same i_size corruption issue by running
> > > the generic/475 xfstests mutitple times, as articulated within
> > > 45d8ec4d9fd54. So with that, I'm kind of confused and thinking that there may
> > > be a problem that resides elsewhere that may need addressing?
> > 
> > Right, I forgot about the special case explained in 45d8ec4d9fd54 where
> > there's unwritted delalloc write beyond range where DIO write happens.
> > 
> > > > 	if (offset + count > i_size_read(inode)) {
> > > > 		/*
> > > > 		 * Add inode to orphan list so that blocks allocated beyond
> > > > 		 * EOF get properly truncated in case of crash.
> > > > 		 */
> > > > 		start transaction handle
> > > > 		add inode to orphan list
> > > > 		stop transaction handle
> > > > 	}
> > > > 
> > > > And just leave i_disksize at whatever it currently is.
> > > 
> > > I originally had the code which added the inode to the orphan list here, but
> > > then I thought to myself that it'd make more sense to actually do this step
> > > closer to the point where we've managed to successfully allocate the required
> > > blocks for the write. This prevents the need to spray orphan list clean up
> > > code all over the place just to cover the case that a write which had intended
> > > to extend the inode beyond i_size had failed prematurely (i.e. before block
> > > allocation). So, hence the reason why I thought having it in
> > > ext4_iomap_begin() would make more sense, because at that point in the write
> > > path, there is enough/or more assurance to make the call around whether we
> > > will in fact be able to perform the write which will be extending beyond
> > > i_size, or not and consequently whether the inode should be placed onto the
> > > orphan list?
> > > 
> > > Ideally I'd like to turn this statement into:
> > > 
> > > 	if (offset + count > i_size_read(inode))
> > > 	        extend = true;
> > > 
> > > Maybe I'm missing something here and there's actually a really good reason for
> > > doing this nice and early? What are your thoughts about what I've mentioned
> > > above?
> > 
> > Well, the slight trouble with adding inode to orphan list in
> > ext4_iomap_begin() is that then it is somewhat difficult to tell whether
> > you need to remove it when IO is done because there's no way how to
> > propagate that information from ext4_iomap_begin() and checking against
> > i_disksize is unreliable because it can change (due to writeback of
> > delalloc pages) while direct IO is running. But I think we can overcome
> > that by splitting our end_io functions to two - ext4_dio_write_end_io() and
> > ext4_dio_extend_write_end_io(). So:
> > 
> > 	WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize);
> > 	/*
> > 	 * Need to check against i_disksize as there may be dellalloc writes
> > 	 * pending.
> > 	 */
> >  	if (offset + count > EXT4_I(inode)->i_disksize)
> > 		extend = true;
> 
> Hm... I'm not entirely convinced that EXT4_I(inode)->i_disksize is what we
> should be using to determine whether an extension will be performed or
> not? Because, my understanding is that i_size is what holds the actual value
> of what the file size is expected to be and hence the reason why we previously
> updated the i_disksize to i_size using ext4_update_i_disksize().

So i_size is how inode size actually appears to user. i_disksize is what
inode size really is on disk. And because of orphan handling and similar
stuff requiring i_rwsem protection, we need to use the slow path waiting
for DIO to complete whenever the direct IO is beyond the on-disk version of
inode size. Possibly there are other places in DIO path that need to use
i_disksize instead of i_size as well - I'll check that once I see new
version of your patches.

> Also, there are cases where offset + count > EXT4_I(inode)->i_disksize,
> however offset + count < i_size_read(inode). So in that case we may take an
> incorrect path somewhere i.e. below where extend clause is true. Also, I feel
> as though we should try stick to using one value as the reference to determine
> whether we're performing an extension and not use i_disksize here and then
> i_size over there kind of thing as that leads to unnecessary confusion?

Well, when DIO extends past i_disksize, it needs to add inode to orphan
list, call truncate in case of failed write, etc. So extension of
i_disksize is what actually matters. I was speaking in the past about
i_size because I thought i_size == i_disksize for the cases we care for but
as you properly pointed out that isn't necessarily the case.

> > 	...
> > 	iomap_dio_rw(...,
> > 		extend ? ext4_dio_extend_write_end_io : ext4_dio_write_end_io);
> > 
> > and ext4_dio_write_end_io() will just take care of conversion of unwritten
> > extents on successful IO completion, while ext4_dio_extend_write_end_io()
> > will take care of all the complex stuff with orphan handling, extension
> > of inode size, and truncation of blocks beyond EOF - and it can do that
> > because it is guaranteed to run under the protection of i_rwsem held in
> > ext4_dio_write_iter().
> > 
> > Alternatively, we could also just pass NULL instead of
> > ext4_dio_extend_write_end_io() and just do all the work explicitely in
> > ext4_dio_write_iter() in the 'extend' case. That might be actually the most
> > transparent option...
> 
> Well, with the changes to ext4_handle_inode_extension() conditions that you
> recommended in patch 2/6, then I can't see why we'd need two separate
> ->end_io() handlers as we'd just abort early if we're not extending?

The problem is that the condition I've suggested for patch 2/6 will be
actually racy if we use i_disksize for comparison. Consider the following
situation:

CPU1					CPU2
fd1 = open("file");			fd2 = open("file", O_DIRECT);
/* Delalloc write */
pwrite(fd1, buf, 4096, 16384);
					/* O_DIRECT write */
					pwrite(fd2, buf, 4096, 4096)
					  i_disksize == 0 so we have to add
					  inode to orphan list
					    submit DIO
writeback happens, i_disksize extended
to 20480.
					    DIO completes ->
					    ext4_dio_write_end_io() - sees
					    big i_disksize so does not
					    touch orphan list and inode is
					    wrongly left there.


And we cannot remove inode unconditionally from the orphan list in
ext4_dio_write_end_io() as when DIO starts already below i_disksize,
ext4_dio_write_end_io() may get called without i_rwsem protection.

								Honza
diff mbox series

Patch

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 6457c629b8cf..413c7895aa9e 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -29,6 +29,7 @@ 
 #include <linux/pagevec.h>
 #include <linux/uio.h>
 #include <linux/mman.h>
+#include <linux/backing-dev.h>
 #include "ext4.h"
 #include "ext4_jbd2.h"
 #include "xattr.h"
@@ -213,12 +214,16 @@  static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
 	struct inode *inode = file_inode(iocb->ki_filp);
 	ssize_t ret;
 
+	if (unlikely(IS_IMMUTABLE(inode)))
+		return -EPERM;
+
 	ret = generic_write_checks(iocb, from);
 	if (ret <= 0)
 		return ret;
 
-	if (unlikely(IS_IMMUTABLE(inode)))
-		return -EPERM;
+	ret = file_modified(iocb->ki_filp);
+	if (ret)
+		return 0;
 
 	/*
 	 * If we have encountered a bitmap-format file, the size limit
@@ -234,6 +239,32 @@  static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
 	return iov_iter_count(from);
 }
 
+static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
+					struct iov_iter *from)
+{
+	ssize_t ret;
+	struct inode *inode = file_inode(iocb->ki_filp);
+
+	if (iocb->ki_flags & IOCB_NOWAIT)
+		return -EOPNOTSUPP;
+
+	inode_lock(inode);
+	ret = ext4_write_checks(iocb, from);
+	if (ret <= 0)
+		goto out;
+
+	current->backing_dev_info = inode_to_bdi(inode);
+	ret = generic_perform_write(iocb->ki_filp, from, iocb->ki_pos);
+	current->backing_dev_info = NULL;
+out:
+	inode_unlock(inode);
+	if (likely(ret > 0)) {
+		iocb->ki_pos += ret;
+		ret = generic_write_sync(iocb, ret);
+	}
+	return ret;
+}
+
 static int ext4_handle_inode_extension(struct inode *inode, loff_t offset,
 				       ssize_t len, size_t count)
 {
@@ -310,6 +341,120 @@  static int ext4_handle_failed_inode_extension(struct inode *inode, loff_t size)
 	return 0;
 }
 
+/*
+ * For a write that extends the inode size, ext4_dio_write_iter() will
+ * wait for the write to complete. Consequently, operations performed
+ * within this function are still covered by the inode_lock(). On
+ * success, this function returns 0.
+ */
+static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size, int error,
+				 unsigned int flags)
+{
+	int ret;
+	loff_t offset = iocb->ki_pos;
+	struct inode *inode = file_inode(iocb->ki_filp);
+
+	if (error) {
+		ret = ext4_handle_failed_inode_extension(inode, offset + size);
+		return ret ? ret : error;
+	}
+
+	if (flags & IOMAP_DIO_UNWRITTEN) {
+		ret = ext4_convert_unwritten_extents(NULL, inode,
+						     offset, size);
+		if (ret)
+			return ret;
+	}
+
+	if (offset + size > i_size_read(inode)) {
+		ret = ext4_handle_inode_extension(inode, offset, size, 0);
+		if (ret)
+			return ret;
+	}
+	return 0;
+}
+
+static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+	ssize_t ret;
+	size_t count;
+	loff_t offset = iocb->ki_pos;
+	struct inode *inode = file_inode(iocb->ki_filp);
+	bool extend = false, overwrite = false, unaligned_aio = false;
+
+	if (!inode_trylock(inode)) {
+		if (iocb->ki_flags & IOCB_NOWAIT)
+			return -EAGAIN;
+		inode_lock(inode);
+	}
+
+	if (!ext4_dio_checks(inode)) {
+		inode_unlock(inode);
+		/*
+		 * Fallback to buffered IO if the operation on the
+		 * inode is not supported by direct IO.
+		 */
+		return ext4_buffered_write_iter(iocb, from);
+	}
+
+	ret = ext4_write_checks(iocb, from);
+	if (ret <= 0) {
+		inode_unlock(inode);
+		return ret;
+	}
+
+	/*
+	 * Unaligned direct AIO must be serialized among each other as
+	 * the zeroing of partial blocks of two competing unaligned
+	 * AIOs can result in data corruption.
+	 */
+	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
+	    !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, offset)) {
+		unaligned_aio = true;
+		inode_dio_wait(inode);
+	}
+
+	/*
+	 * Determine whether the IO operation will overwrite allocated
+	 * and initialized blocks. If so, check to see whether it is
+	 * possible to take the dioread_nolock path.
+	 */
+	count = iov_iter_count(from);
+	if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&
+	    ext4_should_dioread_nolock(inode)) {
+		overwrite = true;
+		downgrade_write(&inode->i_rwsem);
+	}
+
+	if (offset + count > i_size_read(inode) ||
+	    offset + count > EXT4_I(inode)->i_disksize) {
+		ext4_update_i_disksize(inode, inode->i_size);
+		extend = true;
+	}
+
+	ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops, ext4_dio_write_end_io);
+
+	/*
+	 * Unaligned direct AIO must be the only IO in flight or else
+	 * any overlapping aligned IO after unaligned IO might result
+	 * in data corruption. We also need to wait here in the case
+	 * where the inode is being extended so that inode extension
+	 * routines in ext4_dio_write_end_io() are covered by the
+	 * inode_lock().
+	 */
+	if (ret == -EIOCBQUEUED && (unaligned_aio || extend))
+		inode_dio_wait(inode);
+
+	if (overwrite)
+		inode_unlock_shared(inode);
+	else
+		inode_unlock(inode);
+
+	if (ret >= 0 && iov_iter_count(from))
+		return ext4_buffered_write_iter(iocb, from);
+	return ret;
+}
+
 #ifdef CONFIG_FS_DAX
 static ssize_t
 ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
@@ -324,15 +469,10 @@  ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
 			return -EAGAIN;
 		inode_lock(inode);
 	}
+
 	ret = ext4_write_checks(iocb, from);
 	if (ret <= 0)
 		goto out;
-	ret = file_remove_privs(iocb->ki_filp);
-	if (ret)
-		goto out;
-	ret = file_update_time(iocb->ki_filp);
-	if (ret)
-		goto out;
 
 	offset = iocb->ki_pos;
 	ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
@@ -358,73 +498,16 @@  static ssize_t
 ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
-	int o_direct = iocb->ki_flags & IOCB_DIRECT;
-	int unaligned_aio = 0;
-	int overwrite = 0;
-	ssize_t ret;
 
 	if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
 		return -EIO;
 
-#ifdef CONFIG_FS_DAX
 	if (IS_DAX(inode))
 		return ext4_dax_write_iter(iocb, from);
-#endif
-	if (!o_direct && (iocb->ki_flags & IOCB_NOWAIT))
-		return -EOPNOTSUPP;
 
-	if (!inode_trylock(inode)) {
-		if (iocb->ki_flags & IOCB_NOWAIT)
-			return -EAGAIN;
-		inode_lock(inode);
-	}
-
-	ret = ext4_write_checks(iocb, from);
-	if (ret <= 0)
-		goto out;
-
-	/*
-	 * Unaligned direct AIO must be serialized among each other as zeroing
-	 * of partial blocks of two competing unaligned AIOs can result in data
-	 * corruption.
-	 */
-	if (o_direct && ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
-	    !is_sync_kiocb(iocb) &&
-	    ext4_unaligned_aio(inode, from, iocb->ki_pos)) {
-		unaligned_aio = 1;
-		ext4_unwritten_wait(inode);
-	}
-
-	iocb->private = &overwrite;
-	/* Check whether we do a DIO overwrite or not */
-	if (o_direct && !unaligned_aio) {
-		if (ext4_overwrite_io(inode, iocb->ki_pos, iov_iter_count(from))) {
-			if (ext4_should_dioread_nolock(inode))
-				overwrite = 1;
-		} else if (iocb->ki_flags & IOCB_NOWAIT) {
-			ret = -EAGAIN;
-			goto out;
-		}
-	}
-
-	ret = __generic_file_write_iter(iocb, from);
-	/*
-	 * Unaligned direct AIO must be the only IO in flight. Otherwise
-	 * overlapping aligned IO after unaligned might result in data
-	 * corruption.
-	 */
-	if (ret == -EIOCBQUEUED && unaligned_aio)
-		ext4_unwritten_wait(inode);
-	inode_unlock(inode);
-
-	if (ret > 0)
-		ret = generic_write_sync(iocb, ret);
-
-	return ret;
-
-out:
-	inode_unlock(inode);
-	return ret;
+	if (iocb->ki_flags & IOCB_DIRECT)
+		return ext4_dio_write_iter(iocb, from);
+	return ext4_buffered_write_iter(iocb, from);
 }
 
 #ifdef CONFIG_FS_DAX
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index efb184928e51..f52ad3065236 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3513,11 +3513,13 @@  static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 			}
 		}
 	} else if (flags & IOMAP_WRITE) {
-		int dio_credits;
 		handle_t *handle;
-		int retries = 0;
+		int dio_credits, retries = 0, m_flags = 0;
 
-		/* Trim mapping request to maximum we can map at once for DIO */
+		/*
+		 * Trim mapping request to maximum we can map at once
+		 * for DIO.
+		 */
 		if (map.m_len > DIO_MAX_BLOCKS)
 			map.m_len = DIO_MAX_BLOCKS;
 		dio_credits = ext4_chunk_trans_blocks(inode, map.m_len);
@@ -3533,8 +3535,30 @@  static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 		if (IS_ERR(handle))
 			return PTR_ERR(handle);
 
-		ret = ext4_map_blocks(handle, inode, &map,
-				      EXT4_GET_BLOCKS_CREATE_ZERO);
+		/*
+		 * DAX and direct IO are the only two operations that
+		 * are currently supported with IOMAP_WRITE.
+		 */
+		WARN_ON(!IS_DAX(inode) && !(flags & IOMAP_DIRECT));
+		if (IS_DAX(inode))
+			m_flags = EXT4_GET_BLOCKS_CREATE_ZERO;
+		else if (round_down(offset, i_blocksize(inode)) >=
+			 i_size_read(inode))
+			m_flags = EXT4_GET_BLOCKS_CREATE;
+		else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
+			m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
+
+		ret = ext4_map_blocks(handle, inode, &map, m_flags);
+
+		/*
+		 * We cannot fill holes in indirect tree based inodes
+		 * as that could expose stale data in the case of a
+		 * crash. Use the magic error code to fallback to
+		 * buffered IO.
+		 */
+		if (!m_flags && !ret)
+			ret = -ENOTBLK;
+
 		if (ret < 0) {
 			ext4_journal_stop(handle);
 			if (ret == -ENOSPC &&
@@ -3544,13 +3568,14 @@  static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 		}
 
 		/*
-		 * If we added blocks beyond i_size, we need to make sure they
-		 * will get truncated if we crash before updating i_size in
-		 * ext4_iomap_end(). For faults we don't need to do that (and
-		 * even cannot because for orphan list operations inode_lock is
-		 * required) - if we happen to instantiate block beyond i_size,
-		 * it is because we race with truncate which has already added
-		 * the inode to the orphan list.
+		 * If we added blocks beyond i_size, we need to make
+		 * sure they will get truncated if we crash before
+		 * updating the i_size. For faults we don't need to do
+		 * that (and even cannot because for orphan list
+		 * operations inode_lock is required) - if we happen
+		 * to instantiate block beyond i_size, it is because
+		 * we race with truncate which has already added the
+		 * inode to the orphan list.
 		 */
 		if (!(flags & IOMAP_FAULT) && first_block + map.m_len >
 		    (i_size_read(inode) + (1 << blkbits) - 1) >> blkbits) {
@@ -3612,6 +3637,14 @@  static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
 			  ssize_t written, unsigned flags, struct iomap *iomap)
 {
+	/*
+	 * Check to see whether an error occurred while writing data
+	 * out to allocated blocks. If so, return the magic error code
+	 * so that we fallback to buffered IO and reuse the blocks
+	 * that were allocated in preparation for the direct IO write.
+	 */
+	if (flags & IOMAP_DIRECT && written == 0)
+		return -ENOTBLK;
 	return 0;
 }