diff mbox series

[3/9] f2fs: rework write preallocations

Message ID 20210716143919.44373-4-ebiggers@kernel.org (mailing list archive)
State New, archived
Headers show
Series f2fs: use iomap for direct I/O | expand

Commit Message

Eric Biggers July 16, 2021, 2:39 p.m. UTC
From: Eric Biggers <ebiggers@google.com>

f2fs_write_begin() assumes that all blocks were preallocated by
default unless FI_NO_PREALLOC is explicitly set.  This invites data
corruption, as there are cases in which not all blocks are preallocated.
Commit 47501f87c61a ("f2fs: preallocate DIO blocks when forcing
buffered_io") fixed one case, but there are others remaining.

Fix up this logic by replacing this flag with FI_PREALLOCATED_ALL, which
only gets set if all blocks for the current write were preallocated.

Also clean up f2fs_preallocate_blocks(), move it to file.c, and make it
handle some of the logic that was previously in write_iter() directly.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 fs/f2fs/data.c |  55 ++--------------------
 fs/f2fs/f2fs.h |   3 +-
 fs/f2fs/file.c | 123 ++++++++++++++++++++++++++++++++-----------------
 3 files changed, 84 insertions(+), 97 deletions(-)

Comments

Chao Yu July 25, 2021, 10:50 a.m. UTC | #1
On 2021/7/16 22:39, Eric Biggers wrote:
> From: Eric Biggers <ebiggers@google.com>
> 
> f2fs_write_begin() assumes that all blocks were preallocated by
> default unless FI_NO_PREALLOC is explicitly set.  This invites data
> corruption, as there are cases in which not all blocks are preallocated.
> Commit 47501f87c61a ("f2fs: preallocate DIO blocks when forcing
> buffered_io") fixed one case, but there are others remaining.

Could you please explain which cases we missed to handle previously?
then I can check those related logic before and after the rework.

> -			/*
> -			 * If force_buffere_io() is true, we have to allocate
> -			 * blocks all the time, since f2fs_direct_IO will fall
> -			 * back to buffered IO.
> -			 */
> -			if (!f2fs_force_buffered_io(inode, iocb, from) &&
> -					f2fs_lfs_mode(F2FS_I_SB(inode)))

We should keep this OPU DIO logic, otherwise, in lfs mode, write dio
will always allocate two block addresses for each 4k append IO.

I jsut test based on codes of last f2fs dev-test branch.

rm /mnt/f2fs/dio
dd if=/dev/zero  of=/mnt/f2fs/dio bs=4k count=4 oflag=direct

           <...>-763176  [001] ...1 177258.793370: f2fs_map_blocks: dev = (259,1), ino = 6, file offset = 0, start blkaddr = 0xe1a2e, len = 0x1, flags = 48,seg_type = 1, may_create = 1, err = 0
            <...>-763176  [001] ...1 177258.793462: f2fs_map_blocks: dev = (259,1), ino = 6, file offset = 0, start blkaddr = 0xe1a2f, len = 0x1, flags = 16,seg_type = 1, may_create = 1, err = 0
               dd-763176  [001] ...1 177258.793575: f2fs_map_blocks: dev = (259,1), ino = 6, file offset = 1, start blkaddr = 0xe1a30, len = 0x1, flags = 48,seg_type = 1, may_create = 1, err = 0
               dd-763176  [001] ...1 177258.793599: f2fs_map_blocks: dev = (259,1), ino = 6, file offset = 1, start blkaddr = 0xe1a31, len = 0x1, flags = 16,seg_type = 1, may_create = 1, err = 0
               dd-763176  [001] ...1 177258.793735: f2fs_map_blocks: dev = (259,1), ino = 6, file offset = 2, start blkaddr = 0xe1a32, len = 0x1, flags = 48,seg_type = 1, may_create = 1, err = 0
               dd-763176  [001] ...1 177258.793769: f2fs_map_blocks: dev = (259,1), ino = 6, file offset = 2, start blkaddr = 0xe1a33, len = 0x1, flags = 16,seg_type = 1, may_create = 1, err = 0
               dd-763176  [001] ...1 177258.793859: f2fs_map_blocks: dev = (259,1), ino = 6, file offset = 3, start blkaddr = 0xe1a34, len = 0x1, flags = 48,seg_type = 1, may_create = 1, err = 0
               dd-763176  [001] ...1 177258.793885: f2fs_map_blocks: dev = (259,1), ino = 6, file offset = 3, start blkaddr = 0xe1a35, len = 0x1, flags = 16,seg_type = 1, may_create = 1, err = 0

Thanks,
Jaegeuk Kim July 25, 2021, 3:35 p.m. UTC | #2
Note that, this patch is failing generic/250.

On 07/16, Eric Biggers wrote:
> From: Eric Biggers <ebiggers@google.com>
> 
> f2fs_write_begin() assumes that all blocks were preallocated by
> default unless FI_NO_PREALLOC is explicitly set.  This invites data
> corruption, as there are cases in which not all blocks are preallocated.
> Commit 47501f87c61a ("f2fs: preallocate DIO blocks when forcing
> buffered_io") fixed one case, but there are others remaining.
> 
> Fix up this logic by replacing this flag with FI_PREALLOCATED_ALL, which
> only gets set if all blocks for the current write were preallocated.
> 
> Also clean up f2fs_preallocate_blocks(), move it to file.c, and make it
> handle some of the logic that was previously in write_iter() directly.
> 
> Signed-off-by: Eric Biggers <ebiggers@google.com>
> ---
>  fs/f2fs/data.c |  55 ++--------------------
>  fs/f2fs/f2fs.h |   3 +-
>  fs/f2fs/file.c | 123 ++++++++++++++++++++++++++++++++-----------------
>  3 files changed, 84 insertions(+), 97 deletions(-)
> 
> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
> index 18cb28a514e6..cdadaa9daf55 100644
> --- a/fs/f2fs/data.c
> +++ b/fs/f2fs/data.c
> @@ -1370,53 +1370,6 @@ static int __allocate_data_block(struct dnode_of_data *dn, int seg_type)
>  	return 0;
>  }
>  
> -int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from)
> -{
> -	struct inode *inode = file_inode(iocb->ki_filp);
> -	struct f2fs_map_blocks map;
> -	int flag;
> -	int err = 0;
> -	bool direct_io = iocb->ki_flags & IOCB_DIRECT;
> -
> -	map.m_lblk = F2FS_BLK_ALIGN(iocb->ki_pos);
> -	map.m_len = F2FS_BYTES_TO_BLK(iocb->ki_pos + iov_iter_count(from));
> -	if (map.m_len > map.m_lblk)
> -		map.m_len -= map.m_lblk;
> -	else
> -		map.m_len = 0;
> -
> -	map.m_next_pgofs = NULL;
> -	map.m_next_extent = NULL;
> -	map.m_seg_type = NO_CHECK_TYPE;
> -	map.m_may_create = true;
> -
> -	if (direct_io) {
> -		map.m_seg_type = f2fs_rw_hint_to_seg_type(iocb->ki_hint);
> -		flag = f2fs_force_buffered_io(inode, iocb, from) ?
> -					F2FS_GET_BLOCK_PRE_AIO :
> -					F2FS_GET_BLOCK_PRE_DIO;
> -		goto map_blocks;
> -	}
> -	if (iocb->ki_pos + iov_iter_count(from) > MAX_INLINE_DATA(inode)) {
> -		err = f2fs_convert_inline_inode(inode);
> -		if (err)
> -			return err;
> -	}
> -	if (f2fs_has_inline_data(inode))
> -		return err;
> -
> -	flag = F2FS_GET_BLOCK_PRE_AIO;
> -
> -map_blocks:
> -	err = f2fs_map_blocks(inode, &map, 1, flag);
> -	if (map.m_len > 0 && err == -ENOSPC) {
> -		if (!direct_io)
> -			set_inode_flag(inode, FI_NO_PREALLOC);
> -		err = 0;
> -	}
> -	return err;
> -}
> -
>  void f2fs_do_map_lock(struct f2fs_sb_info *sbi, int flag, bool lock)
>  {
>  	if (flag == F2FS_GET_BLOCK_PRE_AIO) {
> @@ -3210,12 +3163,10 @@ static int prepare_write_begin(struct f2fs_sb_info *sbi,
>  	int flag;
>  
>  	/*
> -	 * we already allocated all the blocks, so we don't need to get
> -	 * the block addresses when there is no need to fill the page.
> +	 * If a whole page is being written and we already preallocated all the
> +	 * blocks, then there is no need to get a block address now.
>  	 */
> -	if (!f2fs_has_inline_data(inode) && len == PAGE_SIZE &&
> -	    !is_inode_flag_set(inode, FI_NO_PREALLOC) &&
> -	    !f2fs_verity_in_progress(inode))
> +	if (len == PAGE_SIZE && is_inode_flag_set(inode, FI_PREALLOCATED_ALL))
>  		return 0;
>  
>  	/* f2fs_lock_op avoids race between write CP and convert_inline_page */
> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> index ad7c1b94e23a..da1da3111f18 100644
> --- a/fs/f2fs/f2fs.h
> +++ b/fs/f2fs/f2fs.h
> @@ -699,7 +699,7 @@ enum {
>  	FI_INLINE_DOTS,		/* indicate inline dot dentries */
>  	FI_DO_DEFRAG,		/* indicate defragment is running */
>  	FI_DIRTY_FILE,		/* indicate regular/symlink has dirty pages */
> -	FI_NO_PREALLOC,		/* indicate skipped preallocated blocks */
> +	FI_PREALLOCATED_ALL,	/* all blocks for write were preallocated */
>  	FI_HOT_DATA,		/* indicate file is hot */
>  	FI_EXTRA_ATTR,		/* indicate file has extra attribute */
>  	FI_PROJ_INHERIT,	/* indicate file inherits projectid */
> @@ -3604,7 +3604,6 @@ void f2fs_update_data_blkaddr(struct dnode_of_data *dn, block_t blkaddr);
>  int f2fs_reserve_new_blocks(struct dnode_of_data *dn, blkcnt_t count);
>  int f2fs_reserve_new_block(struct dnode_of_data *dn);
>  int f2fs_get_block(struct dnode_of_data *dn, pgoff_t index);
> -int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from);
>  int f2fs_reserve_block(struct dnode_of_data *dn, pgoff_t index);
>  struct page *f2fs_get_read_data_page(struct inode *inode, pgoff_t index,
>  			int op_flags, bool for_write);
> diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
> index b1cb5b50faac..9b12004e78c6 100644
> --- a/fs/f2fs/file.c
> +++ b/fs/f2fs/file.c
> @@ -4218,10 +4218,72 @@ static ssize_t f2fs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
>  	return ret;
>  }
>  
> +/*
> + * Preallocate blocks for a write request, if it is possible and helpful to do
> + * so.  Returns a positive number if blocks may have been preallocated, 0 if no
> + * blocks were preallocated, or a negative errno value if something went
> + * seriously wrong.  Also sets FI_PREALLOCATED_ALL on the inode if *all* the
> + * requested blocks (not just some of them) have been allocated.
> + */
> +static int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *iter)
> +{
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +	struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
> +	const loff_t pos = iocb->ki_pos;
> +	const size_t count = iov_iter_count(iter);
> +	struct f2fs_map_blocks map = {};
> +	bool dio = (iocb->ki_flags & IOCB_DIRECT) &&
> +		   !f2fs_force_buffered_io(inode, iocb, iter);
> +	int flag;
> +	int ret;
> +
> +	/* If it will be an in-place direct write, don't bother. */
> +	if (dio && !f2fs_lfs_mode(sbi))
> +		return 0;
> +
> +	/* No-wait I/O can't allocate blocks. */
> +	if (iocb->ki_flags & IOCB_NOWAIT)
> +		return 0;
> +
> +	/* If it will be a short write, don't bother. */
> +	if (iov_iter_fault_in_readable(iter, count) != 0)
> +		return 0;
> +
> +	if (f2fs_has_inline_data(inode)) {
> +		/* If the data will fit inline, don't bother. */
> +		if (pos + count <= MAX_INLINE_DATA(inode))
> +			return 0;
> +		ret = f2fs_convert_inline_inode(inode);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	map.m_lblk = (pos >> inode->i_blkbits);
> +	map.m_len = ((pos + count - 1) >> inode->i_blkbits) - map.m_lblk + 1;
> +	map.m_may_create = true;
> +	if (dio) {
> +		map.m_seg_type = f2fs_rw_hint_to_seg_type(inode->i_write_hint);
> +		flag = F2FS_GET_BLOCK_PRE_DIO;
> +	} else {
> +		map.m_seg_type = NO_CHECK_TYPE;
> +		flag = F2FS_GET_BLOCK_PRE_AIO;
> +	}
> +
> +	ret = f2fs_map_blocks(inode, &map, 1, flag);
> +	/* -ENOSPC is only a fatal error if no blocks could be allocated. */
> +	if (ret < 0 && !(ret == -ENOSPC && map.m_len > 0))
> +		return ret;
> +	if (ret == 0)
> +		set_inode_flag(inode, FI_PREALLOCATED_ALL);
> +	return map.m_len;
> +}
> +
>  static ssize_t f2fs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  {
>  	struct file *file = iocb->ki_filp;
>  	struct inode *inode = file_inode(file);
> +	loff_t target_size;
> +	int preallocated;
>  	ssize_t ret;
>  
>  	if (unlikely(f2fs_cp_error(F2FS_I_SB(inode)))) {
> @@ -4245,84 +4307,59 @@ static ssize_t f2fs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  
>  	if (unlikely(IS_IMMUTABLE(inode))) {
>  		ret = -EPERM;
> -		goto unlock;
> +		goto out_unlock;
>  	}
>  
>  	if (is_inode_flag_set(inode, FI_COMPRESS_RELEASED)) {
>  		ret = -EPERM;
> -		goto unlock;
> +		goto out_unlock;
>  	}
>  
>  	ret = generic_write_checks(iocb, from);
>  	if (ret > 0) {
> -		bool preallocated = false;
> -		size_t target_size = 0;
> -		int err;
> -
> -		if (iov_iter_fault_in_readable(from, iov_iter_count(from)))
> -			set_inode_flag(inode, FI_NO_PREALLOC);
> -
> -		if ((iocb->ki_flags & IOCB_NOWAIT)) {
> +		if (iocb->ki_flags & IOCB_NOWAIT) {
>  			if (!f2fs_overwrite_io(inode, iocb->ki_pos,
>  						iov_iter_count(from)) ||
>  				f2fs_has_inline_data(inode) ||
>  				f2fs_force_buffered_io(inode, iocb, from)) {
> -				clear_inode_flag(inode, FI_NO_PREALLOC);
> -				inode_unlock(inode);
>  				ret = -EAGAIN;
> -				goto out;
> +				goto out_unlock;
>  			}
> -			goto write;
>  		}
> -
> -		if (is_inode_flag_set(inode, FI_NO_PREALLOC))
> -			goto write;
> -
>  		if (iocb->ki_flags & IOCB_DIRECT) {
>  			/*
>  			 * Convert inline data for Direct I/O before entering
>  			 * f2fs_direct_IO().
>  			 */
> -			err = f2fs_convert_inline_inode(inode);
> -			if (err)
> -				goto out_err;
> -			/*
> -			 * If force_buffere_io() is true, we have to allocate
> -			 * blocks all the time, since f2fs_direct_IO will fall
> -			 * back to buffered IO.
> -			 */
> -			if (!f2fs_force_buffered_io(inode, iocb, from) &&
> -					f2fs_lfs_mode(F2FS_I_SB(inode)))
> -				goto write;
> +			ret = f2fs_convert_inline_inode(inode);
> +			if (ret)
> +				goto out_unlock;
>  		}
> -		preallocated = true;
> -		target_size = iocb->ki_pos + iov_iter_count(from);
>  
> -		err = f2fs_preallocate_blocks(iocb, from);
> -		if (err) {
> -out_err:
> -			clear_inode_flag(inode, FI_NO_PREALLOC);
> -			inode_unlock(inode);
> -			ret = err;
> -			goto out;
> +		/* Possibly preallocate the blocks for the write. */
> +		target_size = iocb->ki_pos + iov_iter_count(from);
> +		preallocated = f2fs_preallocate_blocks(iocb, from);
> +		if (preallocated < 0) {
> +			ret = preallocated;
> +			goto out_unlock;
>  		}
> -write:
> +
>  		ret = __generic_file_write_iter(iocb, from);
> -		clear_inode_flag(inode, FI_NO_PREALLOC);
>  
> -		/* if we couldn't write data, we should deallocate blocks. */
> -		if (preallocated && i_size_read(inode) < target_size) {
> +		/* Don't leave any preallocated blocks around past i_size. */
> +		if (preallocated > 0 && inode->i_size < target_size) {
>  			down_write(&F2FS_I(inode)->i_gc_rwsem[WRITE]);
>  			down_write(&F2FS_I(inode)->i_mmap_sem);
>  			f2fs_truncate(inode);
>  			up_write(&F2FS_I(inode)->i_mmap_sem);
>  			up_write(&F2FS_I(inode)->i_gc_rwsem[WRITE]);
>  		}
> +		clear_inode_flag(inode, FI_PREALLOCATED_ALL);
>  
>  		if (ret > 0)
>  			f2fs_update_iostat(F2FS_I_SB(inode), APP_WRITE_IO, ret);
>  	}
> -unlock:
> +out_unlock:
>  	inode_unlock(inode);
>  out:
>  	trace_f2fs_file_write_iter(inode, iocb->ki_pos,
> -- 
> 2.32.0
Jaegeuk Kim July 25, 2021, 3:47 p.m. UTC | #3
On 07/25, Jaegeuk Kim wrote:
> Note that, this patch is failing generic/250.

correction: it's failing in 4.14 and 4.19 after simple cherry-pick, but
giving no failure on 5.4, 5.10, and mainline.

> 
> On 07/16, Eric Biggers wrote:
> > From: Eric Biggers <ebiggers@google.com>
> > 
> > f2fs_write_begin() assumes that all blocks were preallocated by
> > default unless FI_NO_PREALLOC is explicitly set.  This invites data
> > corruption, as there are cases in which not all blocks are preallocated.
> > Commit 47501f87c61a ("f2fs: preallocate DIO blocks when forcing
> > buffered_io") fixed one case, but there are others remaining.
> > 
> > Fix up this logic by replacing this flag with FI_PREALLOCATED_ALL, which
> > only gets set if all blocks for the current write were preallocated.
> > 
> > Also clean up f2fs_preallocate_blocks(), move it to file.c, and make it
> > handle some of the logic that was previously in write_iter() directly.
> > 
> > Signed-off-by: Eric Biggers <ebiggers@google.com>
> > ---
> >  fs/f2fs/data.c |  55 ++--------------------
> >  fs/f2fs/f2fs.h |   3 +-
> >  fs/f2fs/file.c | 123 ++++++++++++++++++++++++++++++++-----------------
> >  3 files changed, 84 insertions(+), 97 deletions(-)
> > 
> > diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
> > index 18cb28a514e6..cdadaa9daf55 100644
> > --- a/fs/f2fs/data.c
> > +++ b/fs/f2fs/data.c
> > @@ -1370,53 +1370,6 @@ static int __allocate_data_block(struct dnode_of_data *dn, int seg_type)
> >  	return 0;
> >  }
> >  
> > -int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from)
> > -{
> > -	struct inode *inode = file_inode(iocb->ki_filp);
> > -	struct f2fs_map_blocks map;
> > -	int flag;
> > -	int err = 0;
> > -	bool direct_io = iocb->ki_flags & IOCB_DIRECT;
> > -
> > -	map.m_lblk = F2FS_BLK_ALIGN(iocb->ki_pos);
> > -	map.m_len = F2FS_BYTES_TO_BLK(iocb->ki_pos + iov_iter_count(from));
> > -	if (map.m_len > map.m_lblk)
> > -		map.m_len -= map.m_lblk;
> > -	else
> > -		map.m_len = 0;
> > -
> > -	map.m_next_pgofs = NULL;
> > -	map.m_next_extent = NULL;
> > -	map.m_seg_type = NO_CHECK_TYPE;
> > -	map.m_may_create = true;
> > -
> > -	if (direct_io) {
> > -		map.m_seg_type = f2fs_rw_hint_to_seg_type(iocb->ki_hint);
> > -		flag = f2fs_force_buffered_io(inode, iocb, from) ?
> > -					F2FS_GET_BLOCK_PRE_AIO :
> > -					F2FS_GET_BLOCK_PRE_DIO;
> > -		goto map_blocks;
> > -	}
> > -	if (iocb->ki_pos + iov_iter_count(from) > MAX_INLINE_DATA(inode)) {
> > -		err = f2fs_convert_inline_inode(inode);
> > -		if (err)
> > -			return err;
> > -	}
> > -	if (f2fs_has_inline_data(inode))
> > -		return err;
> > -
> > -	flag = F2FS_GET_BLOCK_PRE_AIO;
> > -
> > -map_blocks:
> > -	err = f2fs_map_blocks(inode, &map, 1, flag);
> > -	if (map.m_len > 0 && err == -ENOSPC) {
> > -		if (!direct_io)
> > -			set_inode_flag(inode, FI_NO_PREALLOC);
> > -		err = 0;
> > -	}
> > -	return err;
> > -}
> > -
> >  void f2fs_do_map_lock(struct f2fs_sb_info *sbi, int flag, bool lock)
> >  {
> >  	if (flag == F2FS_GET_BLOCK_PRE_AIO) {
> > @@ -3210,12 +3163,10 @@ static int prepare_write_begin(struct f2fs_sb_info *sbi,
> >  	int flag;
> >  
> >  	/*
> > -	 * we already allocated all the blocks, so we don't need to get
> > -	 * the block addresses when there is no need to fill the page.
> > +	 * If a whole page is being written and we already preallocated all the
> > +	 * blocks, then there is no need to get a block address now.
> >  	 */
> > -	if (!f2fs_has_inline_data(inode) && len == PAGE_SIZE &&
> > -	    !is_inode_flag_set(inode, FI_NO_PREALLOC) &&
> > -	    !f2fs_verity_in_progress(inode))
> > +	if (len == PAGE_SIZE && is_inode_flag_set(inode, FI_PREALLOCATED_ALL))
> >  		return 0;
> >  
> >  	/* f2fs_lock_op avoids race between write CP and convert_inline_page */
> > diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> > index ad7c1b94e23a..da1da3111f18 100644
> > --- a/fs/f2fs/f2fs.h
> > +++ b/fs/f2fs/f2fs.h
> > @@ -699,7 +699,7 @@ enum {
> >  	FI_INLINE_DOTS,		/* indicate inline dot dentries */
> >  	FI_DO_DEFRAG,		/* indicate defragment is running */
> >  	FI_DIRTY_FILE,		/* indicate regular/symlink has dirty pages */
> > -	FI_NO_PREALLOC,		/* indicate skipped preallocated blocks */
> > +	FI_PREALLOCATED_ALL,	/* all blocks for write were preallocated */
> >  	FI_HOT_DATA,		/* indicate file is hot */
> >  	FI_EXTRA_ATTR,		/* indicate file has extra attribute */
> >  	FI_PROJ_INHERIT,	/* indicate file inherits projectid */
> > @@ -3604,7 +3604,6 @@ void f2fs_update_data_blkaddr(struct dnode_of_data *dn, block_t blkaddr);
> >  int f2fs_reserve_new_blocks(struct dnode_of_data *dn, blkcnt_t count);
> >  int f2fs_reserve_new_block(struct dnode_of_data *dn);
> >  int f2fs_get_block(struct dnode_of_data *dn, pgoff_t index);
> > -int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from);
> >  int f2fs_reserve_block(struct dnode_of_data *dn, pgoff_t index);
> >  struct page *f2fs_get_read_data_page(struct inode *inode, pgoff_t index,
> >  			int op_flags, bool for_write);
> > diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
> > index b1cb5b50faac..9b12004e78c6 100644
> > --- a/fs/f2fs/file.c
> > +++ b/fs/f2fs/file.c
> > @@ -4218,10 +4218,72 @@ static ssize_t f2fs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
> >  	return ret;
> >  }
> >  
> > +/*
> > + * Preallocate blocks for a write request, if it is possible and helpful to do
> > + * so.  Returns a positive number if blocks may have been preallocated, 0 if no
> > + * blocks were preallocated, or a negative errno value if something went
> > + * seriously wrong.  Also sets FI_PREALLOCATED_ALL on the inode if *all* the
> > + * requested blocks (not just some of them) have been allocated.
> > + */
> > +static int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *iter)
> > +{
> > +	struct inode *inode = file_inode(iocb->ki_filp);
> > +	struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
> > +	const loff_t pos = iocb->ki_pos;
> > +	const size_t count = iov_iter_count(iter);
> > +	struct f2fs_map_blocks map = {};
> > +	bool dio = (iocb->ki_flags & IOCB_DIRECT) &&
> > +		   !f2fs_force_buffered_io(inode, iocb, iter);
> > +	int flag;
> > +	int ret;
> > +
> > +	/* If it will be an in-place direct write, don't bother. */
> > +	if (dio && !f2fs_lfs_mode(sbi))
> > +		return 0;
> > +
> > +	/* No-wait I/O can't allocate blocks. */
> > +	if (iocb->ki_flags & IOCB_NOWAIT)
> > +		return 0;
> > +
> > +	/* If it will be a short write, don't bother. */
> > +	if (iov_iter_fault_in_readable(iter, count) != 0)
> > +		return 0;
> > +
> > +	if (f2fs_has_inline_data(inode)) {
> > +		/* If the data will fit inline, don't bother. */
> > +		if (pos + count <= MAX_INLINE_DATA(inode))
> > +			return 0;
> > +		ret = f2fs_convert_inline_inode(inode);
> > +		if (ret)
> > +			return ret;
> > +	}
> > +
> > +	map.m_lblk = (pos >> inode->i_blkbits);
> > +	map.m_len = ((pos + count - 1) >> inode->i_blkbits) - map.m_lblk + 1;
> > +	map.m_may_create = true;
> > +	if (dio) {
> > +		map.m_seg_type = f2fs_rw_hint_to_seg_type(inode->i_write_hint);
> > +		flag = F2FS_GET_BLOCK_PRE_DIO;
> > +	} else {
> > +		map.m_seg_type = NO_CHECK_TYPE;
> > +		flag = F2FS_GET_BLOCK_PRE_AIO;
> > +	}
> > +
> > +	ret = f2fs_map_blocks(inode, &map, 1, flag);
> > +	/* -ENOSPC is only a fatal error if no blocks could be allocated. */
> > +	if (ret < 0 && !(ret == -ENOSPC && map.m_len > 0))
> > +		return ret;
> > +	if (ret == 0)
> > +		set_inode_flag(inode, FI_PREALLOCATED_ALL);
> > +	return map.m_len;
> > +}
> > +
> >  static ssize_t f2fs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> >  {
> >  	struct file *file = iocb->ki_filp;
> >  	struct inode *inode = file_inode(file);
> > +	loff_t target_size;
> > +	int preallocated;
> >  	ssize_t ret;
> >  
> >  	if (unlikely(f2fs_cp_error(F2FS_I_SB(inode)))) {
> > @@ -4245,84 +4307,59 @@ static ssize_t f2fs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> >  
> >  	if (unlikely(IS_IMMUTABLE(inode))) {
> >  		ret = -EPERM;
> > -		goto unlock;
> > +		goto out_unlock;
> >  	}
> >  
> >  	if (is_inode_flag_set(inode, FI_COMPRESS_RELEASED)) {
> >  		ret = -EPERM;
> > -		goto unlock;
> > +		goto out_unlock;
> >  	}
> >  
> >  	ret = generic_write_checks(iocb, from);
> >  	if (ret > 0) {
> > -		bool preallocated = false;
> > -		size_t target_size = 0;
> > -		int err;
> > -
> > -		if (iov_iter_fault_in_readable(from, iov_iter_count(from)))
> > -			set_inode_flag(inode, FI_NO_PREALLOC);
> > -
> > -		if ((iocb->ki_flags & IOCB_NOWAIT)) {
> > +		if (iocb->ki_flags & IOCB_NOWAIT) {
> >  			if (!f2fs_overwrite_io(inode, iocb->ki_pos,
> >  						iov_iter_count(from)) ||
> >  				f2fs_has_inline_data(inode) ||
> >  				f2fs_force_buffered_io(inode, iocb, from)) {
> > -				clear_inode_flag(inode, FI_NO_PREALLOC);
> > -				inode_unlock(inode);
> >  				ret = -EAGAIN;
> > -				goto out;
> > +				goto out_unlock;
> >  			}
> > -			goto write;
> >  		}
> > -
> > -		if (is_inode_flag_set(inode, FI_NO_PREALLOC))
> > -			goto write;
> > -
> >  		if (iocb->ki_flags & IOCB_DIRECT) {
> >  			/*
> >  			 * Convert inline data for Direct I/O before entering
> >  			 * f2fs_direct_IO().
> >  			 */
> > -			err = f2fs_convert_inline_inode(inode);
> > -			if (err)
> > -				goto out_err;
> > -			/*
> > -			 * If force_buffere_io() is true, we have to allocate
> > -			 * blocks all the time, since f2fs_direct_IO will fall
> > -			 * back to buffered IO.
> > -			 */
> > -			if (!f2fs_force_buffered_io(inode, iocb, from) &&
> > -					f2fs_lfs_mode(F2FS_I_SB(inode)))
> > -				goto write;
> > +			ret = f2fs_convert_inline_inode(inode);
> > +			if (ret)
> > +				goto out_unlock;
> >  		}
> > -		preallocated = true;
> > -		target_size = iocb->ki_pos + iov_iter_count(from);
> >  
> > -		err = f2fs_preallocate_blocks(iocb, from);
> > -		if (err) {
> > -out_err:
> > -			clear_inode_flag(inode, FI_NO_PREALLOC);
> > -			inode_unlock(inode);
> > -			ret = err;
> > -			goto out;
> > +		/* Possibly preallocate the blocks for the write. */
> > +		target_size = iocb->ki_pos + iov_iter_count(from);
> > +		preallocated = f2fs_preallocate_blocks(iocb, from);
> > +		if (preallocated < 0) {
> > +			ret = preallocated;
> > +			goto out_unlock;
> >  		}
> > -write:
> > +
> >  		ret = __generic_file_write_iter(iocb, from);
> > -		clear_inode_flag(inode, FI_NO_PREALLOC);
> >  
> > -		/* if we couldn't write data, we should deallocate blocks. */
> > -		if (preallocated && i_size_read(inode) < target_size) {
> > +		/* Don't leave any preallocated blocks around past i_size. */
> > +		if (preallocated > 0 && inode->i_size < target_size) {
> >  			down_write(&F2FS_I(inode)->i_gc_rwsem[WRITE]);
> >  			down_write(&F2FS_I(inode)->i_mmap_sem);
> >  			f2fs_truncate(inode);
> >  			up_write(&F2FS_I(inode)->i_mmap_sem);
> >  			up_write(&F2FS_I(inode)->i_gc_rwsem[WRITE]);
> >  		}
> > +		clear_inode_flag(inode, FI_PREALLOCATED_ALL);
> >  
> >  		if (ret > 0)
> >  			f2fs_update_iostat(F2FS_I_SB(inode), APP_WRITE_IO, ret);
> >  	}
> > -unlock:
> > +out_unlock:
> >  	inode_unlock(inode);
> >  out:
> >  	trace_f2fs_file_write_iter(inode, iocb->ki_pos,
> > -- 
> > 2.32.0
Eric Biggers July 25, 2021, 5:57 p.m. UTC | #4
On Sun, Jul 25, 2021 at 06:50:51PM +0800, Chao Yu wrote:
> On 2021/7/16 22:39, Eric Biggers wrote:
> > From: Eric Biggers <ebiggers@google.com>
> > 
> > f2fs_write_begin() assumes that all blocks were preallocated by
> > default unless FI_NO_PREALLOC is explicitly set.  This invites data
> > corruption, as there are cases in which not all blocks are preallocated.
> > Commit 47501f87c61a ("f2fs: preallocate DIO blocks when forcing
> > buffered_io") fixed one case, but there are others remaining.
> 
> Could you please explain which cases we missed to handle previously?
> then I can check those related logic before and after the rework.

Any case where a buffered write happens while not all blocks were preallocated
but FI_NO_PREALLOC wasn't set.  For example when ENOSPC was hit in the middle of
the preallocations for a direct write that will fall back to a buffered write,
e.g. due to f2fs_force_buffered_io() or page cache invalidation failure.

> 
> > -			/*
> > -			 * If force_buffere_io() is true, we have to allocate
> > -			 * blocks all the time, since f2fs_direct_IO will fall
> > -			 * back to buffered IO.
> > -			 */
> > -			if (!f2fs_force_buffered_io(inode, iocb, from) &&
> > -					f2fs_lfs_mode(F2FS_I_SB(inode)))
> > -				goto write;
> 
> We should keep this OPU DIO logic, otherwise, in lfs mode, write dio
> will always allocate two block addresses for each 4k append IO.
> 
> I jsut test based on codes of last f2fs dev-test branch.

Yes, I had misread that due to the weird goto and misleading comment and
translated it into:

        /* If it will be an in-place direct write, don't bother. */
        if (dio && !f2fs_lfs_mode(sbi))
                return 0;

It should be:

        if (dio && f2fs_lfs_mode(sbi))
                return 0;

Do you have a proper explanation for why preallocations shouldn't be done in
this case?  Note that preallocations are still done for buffered writes, which
may be out-of-place as well; how are those different?

- Eric
Eric Biggers July 25, 2021, 6:01 p.m. UTC | #5
On Sun, Jul 25, 2021 at 08:47:51AM -0700, Jaegeuk Kim wrote:
> On 07/25, Jaegeuk Kim wrote:
> > Note that, this patch is failing generic/250.
> 
> correction: it's failing in 4.14 and 4.19 after simple cherry-pick, but
> giving no failure on 5.4, 5.10, and mainline.
> 

For me, generic/250 fails on both mainline and f2fs/dev without my changes.
So it isn't a regression.

- Eric
Jaegeuk Kim July 26, 2021, 7:04 p.m. UTC | #6
On 07/25, Eric Biggers wrote:
> On Sun, Jul 25, 2021 at 08:47:51AM -0700, Jaegeuk Kim wrote:
> > On 07/25, Jaegeuk Kim wrote:
> > > Note that, this patch is failing generic/250.
> > 
> > correction: it's failing in 4.14 and 4.19 after simple cherry-pick, but
> > giving no failure on 5.4, 5.10, and mainline.
> > 
> 
> For me, generic/250 fails on both mainline and f2fs/dev without my changes.
> So it isn't a regression.

fyi; I had to change 250 to pass like this. I'm digging the patch.
https://github.com/jaegeuk/xfstests-f2fs/commit/99c11b6550a2a24f831018d2e019eed86e517d44.

> 
> - Eric
Jaegeuk Kim July 27, 2021, 2 a.m. UTC | #7
On 07/25, Eric Biggers wrote:
> On Sun, Jul 25, 2021 at 06:50:51PM +0800, Chao Yu wrote:
> > On 2021/7/16 22:39, Eric Biggers wrote:
> > > From: Eric Biggers <ebiggers@google.com>
> > > 
> > > f2fs_write_begin() assumes that all blocks were preallocated by
> > > default unless FI_NO_PREALLOC is explicitly set.  This invites data
> > > corruption, as there are cases in which not all blocks are preallocated.
> > > Commit 47501f87c61a ("f2fs: preallocate DIO blocks when forcing
> > > buffered_io") fixed one case, but there are others remaining.
> > 
> > Could you please explain which cases we missed to handle previously?
> > then I can check those related logic before and after the rework.
> 
> Any case where a buffered write happens while not all blocks were preallocated
> but FI_NO_PREALLOC wasn't set.  For example when ENOSPC was hit in the middle of
> the preallocations for a direct write that will fall back to a buffered write,
> e.g. due to f2fs_force_buffered_io() or page cache invalidation failure.
> 
> > 
> > > -			/*
> > > -			 * If force_buffere_io() is true, we have to allocate
> > > -			 * blocks all the time, since f2fs_direct_IO will fall
> > > -			 * back to buffered IO.
> > > -			 */
> > > -			if (!f2fs_force_buffered_io(inode, iocb, from) &&
> > > -					f2fs_lfs_mode(F2FS_I_SB(inode)))
> > > -				goto write;
> > 
> > We should keep this OPU DIO logic, otherwise, in lfs mode, write dio
> > will always allocate two block addresses for each 4k append IO.
> > 
> > I jsut test based on codes of last f2fs dev-test branch.
> 
> Yes, I had misread that due to the weird goto and misleading comment and
> translated it into:
> 
>         /* If it will be an in-place direct write, don't bother. */
>         if (dio && !f2fs_lfs_mode(sbi))
>                 return 0;
> 
> It should be:
> 
>         if (dio && f2fs_lfs_mode(sbi))
>                 return 0;

Hmm, this addresses my 250 failure. And, I think the below commit can explain
the case.

commit 47501f87c61ad2aa234add63e1ae231521dbc3f5
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Tue Nov 26 15:01:42 2019 -0800

    f2fs: preallocate DIO blocks when forcing buffered_io

    The previous preallocation and DIO decision like below.

                             allow_outplace_dio              !allow_outplace_dio
    f2fs_force_buffered_io   (*) No_Prealloc / Buffered_IO   Prealloc / Buffered_IO
    !f2fs_force_buffered_io  No_Prealloc / DIO               Prealloc / DIO

    But, Javier reported Case (*) where zoned device bypassed preallocation but
    fell back to buffered writes in f2fs_direct_IO(), resulting in stale data
    being read.

    In order to fix the issue, actually we need to preallocate blocks whenever
    we fall back to buffered IO like this. No change is made in the other cases.

                             allow_outplace_dio              !allow_outplace_dio
    f2fs_force_buffered_io   (*) Prealloc / Buffered_IO      Prealloc / Buffered_IO
    !f2fs_force_buffered_io  No_Prealloc / DIO               Prealloc / DIO

    Reported-and-tested-by: Javier Gonzalez <javier@javigon.com>
    Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
    Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
    Reviewed-by: Chao Yu <yuchao0@huawei.com>
    Reviewed-by: Javier González <javier@javigon.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>


> 
> Do you have a proper explanation for why preallocations shouldn't be done in
> this case?  Note that preallocations are still done for buffered writes, which
> may be out-of-place as well; how are those different?
> 
> - Eric
Chao Yu July 27, 2021, 3:23 a.m. UTC | #8
On 2021/7/27 10:00, Jaegeuk Kim wrote:
> On 07/25, Eric Biggers wrote:
>> On Sun, Jul 25, 2021 at 06:50:51PM +0800, Chao Yu wrote:
>>> On 2021/7/16 22:39, Eric Biggers wrote:
>>>> From: Eric Biggers <ebiggers@google.com>
>>>>
>>>> f2fs_write_begin() assumes that all blocks were preallocated by
>>>> default unless FI_NO_PREALLOC is explicitly set.  This invites data
>>>> corruption, as there are cases in which not all blocks are preallocated.
>>>> Commit 47501f87c61a ("f2fs: preallocate DIO blocks when forcing
>>>> buffered_io") fixed one case, but there are others remaining.
>>>
>>> Could you please explain which cases we missed to handle previously?
>>> then I can check those related logic before and after the rework.
>>
>> Any case where a buffered write happens while not all blocks were preallocated
>> but FI_NO_PREALLOC wasn't set.  For example when ENOSPC was hit in the middle of
>> the preallocations for a direct write that will fall back to a buffered write,
>> e.g. due to f2fs_force_buffered_io() or page cache invalidation failure.

Indeed, IIUC, the buggy code is as below, if any preallocation failed, we need to
set FI_NO_PREALLOC flag.

map_blocks:
	err = f2fs_map_blocks(inode, &map, 1, flag);
	if (map.m_len > 0 && err == -ENOSPC) {
		if (!direct_io)         <----
			set_inode_flag(inode, FI_NO_PREALLOC);
		err = 0;
	}

BTW, it will be better to include above issue details you explained into commit
message?

>>
>>>
>>>> -			/*
>>>> -			 * If force_buffere_io() is true, we have to allocate
>>>> -			 * blocks all the time, since f2fs_direct_IO will fall
>>>> -			 * back to buffered IO.
>>>> -			 */
>>>> -			if (!f2fs_force_buffered_io(inode, iocb, from) &&
>>>> -					f2fs_lfs_mode(F2FS_I_SB(inode)))
>>>> -				goto write;
>>>
>>> We should keep this OPU DIO logic, otherwise, in lfs mode, write dio
>>> will always allocate two block addresses for each 4k append IO.
>>>
>>> I jsut test based on codes of last f2fs dev-test branch.
>>
>> Yes, I had misread that due to the weird goto and misleading comment and
>> translated it into:
>>
>>          /* If it will be an in-place direct write, don't bother. */
>>          if (dio && !f2fs_lfs_mode(sbi))
>>                  return 0;
>>
>> It should be:
>>
>>          if (dio && f2fs_lfs_mode(sbi))
>>                  return 0;
> 
> Hmm, this addresses my 250 failure. And, I think the below commit can explain
> the case.
> 
> commit 47501f87c61ad2aa234add63e1ae231521dbc3f5
> Author: Jaegeuk Kim <jaegeuk@kernel.org>
> Date:   Tue Nov 26 15:01:42 2019 -0800
> 
>      f2fs: preallocate DIO blocks when forcing buffered_io
> 
>      The previous preallocation and DIO decision like below.
> 
>                               allow_outplace_dio              !allow_outplace_dio
>      f2fs_force_buffered_io   (*) No_Prealloc / Buffered_IO   Prealloc / Buffered_IO
>      !f2fs_force_buffered_io  No_Prealloc / DIO               Prealloc / DIO
> 
>      But, Javier reported Case (*) where zoned device bypassed preallocation but
>      fell back to buffered writes in f2fs_direct_IO(), resulting in stale data
>      being read.
> 
>      In order to fix the issue, actually we need to preallocate blocks whenever
>      we fall back to buffered IO like this. No change is made in the other cases.
> 
>                               allow_outplace_dio              !allow_outplace_dio
>      f2fs_force_buffered_io   (*) Prealloc / Buffered_IO      Prealloc / Buffered_IO
>      !f2fs_force_buffered_io  No_Prealloc / DIO               Prealloc / DIO
> 
>      Reported-and-tested-by: Javier Gonzalez <javier@javigon.com>
>      Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
>      Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
>      Reviewed-by: Chao Yu <yuchao0@huawei.com>
>      Reviewed-by: Javier González <javier@javigon.com>
>      Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
> 

Thanks for the explain.

> 
>>
>> Do you have a proper explanation for why preallocations shouldn't be done in

See commit f847c699cff3 ("f2fs: allow out-place-update for direct IO in LFS mode"),
f2fs_map_blocks() logic was changed to force allocating a new block address no matter
previous block address was existed if it is called from write path of DIO. So, in such
condition, if we preallocate new block address in f2fs_file_write_iter(), we will
suffer the problem which my trace indicates.

>> this case?  Note that preallocations are still done for buffered writes, which
>> may be out-of-place as well; how are those different?
Got your concern.

For buffered IO, we use F2FS_GET_BLOCK_PRE_AIO, in this mode, we just preserve
filesystem block count and tag NEW_ADDR in dnode block, so, it's fine, double
new block address allocation won't happen during data page writeback.

For direct IO, we use F2FS_GET_BLOCK_PRE_DIO, in this mode, we will allocate
physical block address, after preallocation, if we fallback to buffered IO, we
may suffer double new block address allocation issue... IIUC.

Well, can we relocate preallocation into f2fs_direct_IO() after all cases which
may cause fallbacking DIO to buffered IO?

Thanks,

>>
>> - Eric
Eric Biggers July 27, 2021, 7:38 a.m. UTC | #9
On Tue, Jul 27, 2021 at 11:23:03AM +0800, Chao Yu wrote:
> > > 
> > > Do you have a proper explanation for why preallocations shouldn't be done in
> 
> See commit f847c699cff3 ("f2fs: allow out-place-update for direct IO in LFS mode"),
> f2fs_map_blocks() logic was changed to force allocating a new block address no matter
> previous block address was existed if it is called from write path of DIO. So, in such
> condition, if we preallocate new block address in f2fs_file_write_iter(), we will
> suffer the problem which my trace indicates.
> 
> > > this case?  Note that preallocations are still done for buffered writes, which
> > > may be out-of-place as well; how are those different?
> Got your concern.
> 
> For buffered IO, we use F2FS_GET_BLOCK_PRE_AIO, in this mode, we just preserve
> filesystem block count and tag NEW_ADDR in dnode block, so, it's fine, double
> new block address allocation won't happen during data page writeback.
> 
> For direct IO, we use F2FS_GET_BLOCK_PRE_DIO, in this mode, we will allocate
> physical block address, after preallocation, if we fallback to buffered IO, we
> may suffer double new block address allocation issue... IIUC.
> 
> Well, can we relocate preallocation into f2fs_direct_IO() after all cases which
> may cause fallbacking DIO to buffered IO?
> 

That's somewhat helpful, but I've been doing some more investigation and now I'm
even more confused.  How can f2fs support non-overwrite DIO writes at all
(meaning DIO writes in LFS mode as well as DIO writes to holes in non-LFS mode),
given that it has no support for unwritten extents?  AFAICS, as-is users can
easily leak uninitialized disk contents on f2fs by issuing a DIO write that
won't complete fully (or might not complete fully), then reading back the blocks
that got allocated but not written to.

I think that f2fs will have to take the ext2 approach of not allowing
non-overwrite DIO writes at all...

- Eric
Chao Yu July 27, 2021, 8:30 a.m. UTC | #10
On 2021/7/27 15:38, Eric Biggers wrote:
> That's somewhat helpful, but I've been doing some more investigation and now I'm
> even more confused.  How can f2fs support non-overwrite DIO writes at all
> (meaning DIO writes in LFS mode as well as DIO writes to holes in non-LFS mode),
> given that it has no support for unwritten extents?  AFAICS, as-is users can

I'm trying to pick up DAX support patch created by Qiuyang from huawei, and it
looks it faces the same issue, so it tries to fix this by calling sb_issue_zeroout()
in f2fs_map_blocks() before it returns.

> easily leak uninitialized disk contents on f2fs by issuing a DIO write that
> won't complete fully (or might not complete fully), then reading back the blocks
> that got allocated but not written to.
> 
> I think that f2fs will have to take the ext2 approach of not allowing
> non-overwrite DIO writes at all...
Yes,

Another option is to enhance f2fs metadata's scalability which needs to update layout
of dnode block or SSA block, after that we can record the status of unwritten data block
there... it's a big change though...

Thanks,

> 
> - Eric
>
Darrick J. Wong July 27, 2021, 3:33 p.m. UTC | #11
On Tue, Jul 27, 2021 at 04:30:16PM +0800, Chao Yu wrote:
> On 2021/7/27 15:38, Eric Biggers wrote:
> > That's somewhat helpful, but I've been doing some more investigation and now I'm
> > even more confused.  How can f2fs support non-overwrite DIO writes at all
> > (meaning DIO writes in LFS mode as well as DIO writes to holes in non-LFS mode),
> > given that it has no support for unwritten extents?  AFAICS, as-is users can
> 
> I'm trying to pick up DAX support patch created by Qiuyang from huawei, and it
> looks it faces the same issue, so it tries to fix this by calling sb_issue_zeroout()
> in f2fs_map_blocks() before it returns.

I really hope you don't, because zeroing the region before memcpy'ing it
is absurd.  I don't know if f2fs can do that (xfs can't really) without
pinning resources during a potentially lengthy memcpy operation, but you
/could/ allocate the space in ->iomap_begin, attach some record of that
to iomap->private, and only commit the mapping update in ->iomap_end.

--D

> > easily leak uninitialized disk contents on f2fs by issuing a DIO write that
> > won't complete fully (or might not complete fully), then reading back the blocks
> > that got allocated but not written to.
> > 
> > I think that f2fs will have to take the ext2 approach of not allowing
> > non-overwrite DIO writes at all...
> Yes,
> 
> Another option is to enhance f2fs metadata's scalability which needs to update layout
> of dnode block or SSA block, after that we can record the status of unwritten data block
> there... it's a big change though...
> 
> Thanks,
> 
> > 
> > - Eric
> >
Jaegeuk Kim July 28, 2021, 2:29 a.m. UTC | #12
On 07/26, Jaegeuk Kim wrote:
> On 07/25, Eric Biggers wrote:
> > On Sun, Jul 25, 2021 at 06:50:51PM +0800, Chao Yu wrote:
> > > On 2021/7/16 22:39, Eric Biggers wrote:
> > > > From: Eric Biggers <ebiggers@google.com>
> > > > 
> > > > f2fs_write_begin() assumes that all blocks were preallocated by
> > > > default unless FI_NO_PREALLOC is explicitly set.  This invites data
> > > > corruption, as there are cases in which not all blocks are preallocated.
> > > > Commit 47501f87c61a ("f2fs: preallocate DIO blocks when forcing
> > > > buffered_io") fixed one case, but there are others remaining.
> > > 
> > > Could you please explain which cases we missed to handle previously?
> > > then I can check those related logic before and after the rework.
> > 
> > Any case where a buffered write happens while not all blocks were preallocated
> > but FI_NO_PREALLOC wasn't set.  For example when ENOSPC was hit in the middle of
> > the preallocations for a direct write that will fall back to a buffered write,
> > e.g. due to f2fs_force_buffered_io() or page cache invalidation failure.
> > 
> > > 
> > > > -			/*
> > > > -			 * If force_buffere_io() is true, we have to allocate
> > > > -			 * blocks all the time, since f2fs_direct_IO will fall
> > > > -			 * back to buffered IO.
> > > > -			 */
> > > > -			if (!f2fs_force_buffered_io(inode, iocb, from) &&
> > > > -					f2fs_lfs_mode(F2FS_I_SB(inode)))
> > > > -				goto write;
> > > 
> > > We should keep this OPU DIO logic, otherwise, in lfs mode, write dio
> > > will always allocate two block addresses for each 4k append IO.
> > > 
> > > I jsut test based on codes of last f2fs dev-test branch.
> > 
> > Yes, I had misread that due to the weird goto and misleading comment and
> > translated it into:
> > 
> >         /* If it will be an in-place direct write, don't bother. */
> >         if (dio && !f2fs_lfs_mode(sbi))
> >                 return 0;
> > 
> > It should be:
> > 
> >         if (dio && f2fs_lfs_mode(sbi))
> >                 return 0;
> 
> Hmm, this addresses my 250 failure. And, I think the below commit can explain
> the case.

In addition to this, I got failure on generic/263, and the below change fixes
it. (I didn't take a look at deeply tho.)

--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -4344,8 +4344,13 @@ static int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *iter)
                        return ret;
        }

-       map.m_lblk = (pos >> inode->i_blkbits);
-       map.m_len = ((pos + count - 1) >> inode->i_blkbits) - map.m_lblk + 1;
+       map.m_lblk = F2FS_BLK_ALIGN(pos);
+       map.m_len = F2FS_BYTES_TO_BLK(pos + count);
+       if (map.m_len > map.m_lblk)
+               map.m_len -= map.m_lblk;
+       else
+               map.m_len = 0;
+
Chao Yu July 29, 2021, 12:26 a.m. UTC | #13
On 2021/7/27 23:33, Darrick J. Wong wrote:
> On Tue, Jul 27, 2021 at 04:30:16PM +0800, Chao Yu wrote:
>> On 2021/7/27 15:38, Eric Biggers wrote:
>>> That's somewhat helpful, but I've been doing some more investigation and now I'm
>>> even more confused.  How can f2fs support non-overwrite DIO writes at all
>>> (meaning DIO writes in LFS mode as well as DIO writes to holes in non-LFS mode),
>>> given that it has no support for unwritten extents?  AFAICS, as-is users can
>>
>> I'm trying to pick up DAX support patch created by Qiuyang from huawei, and it
>> looks it faces the same issue, so it tries to fix this by calling sb_issue_zeroout()
>> in f2fs_map_blocks() before it returns.
> 
> I really hope you don't, because zeroing the region before memcpy'ing it
> is absurd.  I don't know if f2fs can do that (xfs can't really) without
> pinning resources during a potentially lengthy memcpy operation, but you
> /could/ allocate the space in ->iomap_begin, attach some record of that
> to iomap->private, and only commit the mapping update in ->iomap_end.

Thanks for the suggestion, let me check this a little bit later, since now I
just try to stabilize the codes...

Thanks,

> 
> --D
> 
>>> easily leak uninitialized disk contents on f2fs by issuing a DIO write that
>>> won't complete fully (or might not complete fully), then reading back the blocks
>>> that got allocated but not written to.
>>>
>>> I think that f2fs will have to take the ext2 approach of not allowing
>>> non-overwrite DIO writes at all...
>> Yes,
>>
>> Another option is to enhance f2fs metadata's scalability which needs to update layout
>> of dnode block or SSA block, after that we can record the status of unwritten data block
>> there... it's a big change though...
>>
>> Thanks,
>>
>>>
>>> - Eric
>>>
diff mbox series

Patch

diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 18cb28a514e6..cdadaa9daf55 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -1370,53 +1370,6 @@  static int __allocate_data_block(struct dnode_of_data *dn, int seg_type)
 	return 0;
 }
 
-int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from)
-{
-	struct inode *inode = file_inode(iocb->ki_filp);
-	struct f2fs_map_blocks map;
-	int flag;
-	int err = 0;
-	bool direct_io = iocb->ki_flags & IOCB_DIRECT;
-
-	map.m_lblk = F2FS_BLK_ALIGN(iocb->ki_pos);
-	map.m_len = F2FS_BYTES_TO_BLK(iocb->ki_pos + iov_iter_count(from));
-	if (map.m_len > map.m_lblk)
-		map.m_len -= map.m_lblk;
-	else
-		map.m_len = 0;
-
-	map.m_next_pgofs = NULL;
-	map.m_next_extent = NULL;
-	map.m_seg_type = NO_CHECK_TYPE;
-	map.m_may_create = true;
-
-	if (direct_io) {
-		map.m_seg_type = f2fs_rw_hint_to_seg_type(iocb->ki_hint);
-		flag = f2fs_force_buffered_io(inode, iocb, from) ?
-					F2FS_GET_BLOCK_PRE_AIO :
-					F2FS_GET_BLOCK_PRE_DIO;
-		goto map_blocks;
-	}
-	if (iocb->ki_pos + iov_iter_count(from) > MAX_INLINE_DATA(inode)) {
-		err = f2fs_convert_inline_inode(inode);
-		if (err)
-			return err;
-	}
-	if (f2fs_has_inline_data(inode))
-		return err;
-
-	flag = F2FS_GET_BLOCK_PRE_AIO;
-
-map_blocks:
-	err = f2fs_map_blocks(inode, &map, 1, flag);
-	if (map.m_len > 0 && err == -ENOSPC) {
-		if (!direct_io)
-			set_inode_flag(inode, FI_NO_PREALLOC);
-		err = 0;
-	}
-	return err;
-}
-
 void f2fs_do_map_lock(struct f2fs_sb_info *sbi, int flag, bool lock)
 {
 	if (flag == F2FS_GET_BLOCK_PRE_AIO) {
@@ -3210,12 +3163,10 @@  static int prepare_write_begin(struct f2fs_sb_info *sbi,
 	int flag;
 
 	/*
-	 * we already allocated all the blocks, so we don't need to get
-	 * the block addresses when there is no need to fill the page.
+	 * If a whole page is being written and we already preallocated all the
+	 * blocks, then there is no need to get a block address now.
 	 */
-	if (!f2fs_has_inline_data(inode) && len == PAGE_SIZE &&
-	    !is_inode_flag_set(inode, FI_NO_PREALLOC) &&
-	    !f2fs_verity_in_progress(inode))
+	if (len == PAGE_SIZE && is_inode_flag_set(inode, FI_PREALLOCATED_ALL))
 		return 0;
 
 	/* f2fs_lock_op avoids race between write CP and convert_inline_page */
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index ad7c1b94e23a..da1da3111f18 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -699,7 +699,7 @@  enum {
 	FI_INLINE_DOTS,		/* indicate inline dot dentries */
 	FI_DO_DEFRAG,		/* indicate defragment is running */
 	FI_DIRTY_FILE,		/* indicate regular/symlink has dirty pages */
-	FI_NO_PREALLOC,		/* indicate skipped preallocated blocks */
+	FI_PREALLOCATED_ALL,	/* all blocks for write were preallocated */
 	FI_HOT_DATA,		/* indicate file is hot */
 	FI_EXTRA_ATTR,		/* indicate file has extra attribute */
 	FI_PROJ_INHERIT,	/* indicate file inherits projectid */
@@ -3604,7 +3604,6 @@  void f2fs_update_data_blkaddr(struct dnode_of_data *dn, block_t blkaddr);
 int f2fs_reserve_new_blocks(struct dnode_of_data *dn, blkcnt_t count);
 int f2fs_reserve_new_block(struct dnode_of_data *dn);
 int f2fs_get_block(struct dnode_of_data *dn, pgoff_t index);
-int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from);
 int f2fs_reserve_block(struct dnode_of_data *dn, pgoff_t index);
 struct page *f2fs_get_read_data_page(struct inode *inode, pgoff_t index,
 			int op_flags, bool for_write);
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index b1cb5b50faac..9b12004e78c6 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -4218,10 +4218,72 @@  static ssize_t f2fs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 	return ret;
 }
 
+/*
+ * Preallocate blocks for a write request, if it is possible and helpful to do
+ * so.  Returns a positive number if blocks may have been preallocated, 0 if no
+ * blocks were preallocated, or a negative errno value if something went
+ * seriously wrong.  Also sets FI_PREALLOCATED_ALL on the inode if *all* the
+ * requested blocks (not just some of them) have been allocated.
+ */
+static int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *iter)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
+	const loff_t pos = iocb->ki_pos;
+	const size_t count = iov_iter_count(iter);
+	struct f2fs_map_blocks map = {};
+	bool dio = (iocb->ki_flags & IOCB_DIRECT) &&
+		   !f2fs_force_buffered_io(inode, iocb, iter);
+	int flag;
+	int ret;
+
+	/* If it will be an in-place direct write, don't bother. */
+	if (dio && !f2fs_lfs_mode(sbi))
+		return 0;
+
+	/* No-wait I/O can't allocate blocks. */
+	if (iocb->ki_flags & IOCB_NOWAIT)
+		return 0;
+
+	/* If it will be a short write, don't bother. */
+	if (iov_iter_fault_in_readable(iter, count) != 0)
+		return 0;
+
+	if (f2fs_has_inline_data(inode)) {
+		/* If the data will fit inline, don't bother. */
+		if (pos + count <= MAX_INLINE_DATA(inode))
+			return 0;
+		ret = f2fs_convert_inline_inode(inode);
+		if (ret)
+			return ret;
+	}
+
+	map.m_lblk = (pos >> inode->i_blkbits);
+	map.m_len = ((pos + count - 1) >> inode->i_blkbits) - map.m_lblk + 1;
+	map.m_may_create = true;
+	if (dio) {
+		map.m_seg_type = f2fs_rw_hint_to_seg_type(inode->i_write_hint);
+		flag = F2FS_GET_BLOCK_PRE_DIO;
+	} else {
+		map.m_seg_type = NO_CHECK_TYPE;
+		flag = F2FS_GET_BLOCK_PRE_AIO;
+	}
+
+	ret = f2fs_map_blocks(inode, &map, 1, flag);
+	/* -ENOSPC is only a fatal error if no blocks could be allocated. */
+	if (ret < 0 && !(ret == -ENOSPC && map.m_len > 0))
+		return ret;
+	if (ret == 0)
+		set_inode_flag(inode, FI_PREALLOCATED_ALL);
+	return map.m_len;
+}
+
 static ssize_t f2fs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file_inode(file);
+	loff_t target_size;
+	int preallocated;
 	ssize_t ret;
 
 	if (unlikely(f2fs_cp_error(F2FS_I_SB(inode)))) {
@@ -4245,84 +4307,59 @@  static ssize_t f2fs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 
 	if (unlikely(IS_IMMUTABLE(inode))) {
 		ret = -EPERM;
-		goto unlock;
+		goto out_unlock;
 	}
 
 	if (is_inode_flag_set(inode, FI_COMPRESS_RELEASED)) {
 		ret = -EPERM;
-		goto unlock;
+		goto out_unlock;
 	}
 
 	ret = generic_write_checks(iocb, from);
 	if (ret > 0) {
-		bool preallocated = false;
-		size_t target_size = 0;
-		int err;
-
-		if (iov_iter_fault_in_readable(from, iov_iter_count(from)))
-			set_inode_flag(inode, FI_NO_PREALLOC);
-
-		if ((iocb->ki_flags & IOCB_NOWAIT)) {
+		if (iocb->ki_flags & IOCB_NOWAIT) {
 			if (!f2fs_overwrite_io(inode, iocb->ki_pos,
 						iov_iter_count(from)) ||
 				f2fs_has_inline_data(inode) ||
 				f2fs_force_buffered_io(inode, iocb, from)) {
-				clear_inode_flag(inode, FI_NO_PREALLOC);
-				inode_unlock(inode);
 				ret = -EAGAIN;
-				goto out;
+				goto out_unlock;
 			}
-			goto write;
 		}
-
-		if (is_inode_flag_set(inode, FI_NO_PREALLOC))
-			goto write;
-
 		if (iocb->ki_flags & IOCB_DIRECT) {
 			/*
 			 * Convert inline data for Direct I/O before entering
 			 * f2fs_direct_IO().
 			 */
-			err = f2fs_convert_inline_inode(inode);
-			if (err)
-				goto out_err;
-			/*
-			 * If force_buffere_io() is true, we have to allocate
-			 * blocks all the time, since f2fs_direct_IO will fall
-			 * back to buffered IO.
-			 */
-			if (!f2fs_force_buffered_io(inode, iocb, from) &&
-					f2fs_lfs_mode(F2FS_I_SB(inode)))
-				goto write;
+			ret = f2fs_convert_inline_inode(inode);
+			if (ret)
+				goto out_unlock;
 		}
-		preallocated = true;
-		target_size = iocb->ki_pos + iov_iter_count(from);
 
-		err = f2fs_preallocate_blocks(iocb, from);
-		if (err) {
-out_err:
-			clear_inode_flag(inode, FI_NO_PREALLOC);
-			inode_unlock(inode);
-			ret = err;
-			goto out;
+		/* Possibly preallocate the blocks for the write. */
+		target_size = iocb->ki_pos + iov_iter_count(from);
+		preallocated = f2fs_preallocate_blocks(iocb, from);
+		if (preallocated < 0) {
+			ret = preallocated;
+			goto out_unlock;
 		}
-write:
+
 		ret = __generic_file_write_iter(iocb, from);
-		clear_inode_flag(inode, FI_NO_PREALLOC);
 
-		/* if we couldn't write data, we should deallocate blocks. */
-		if (preallocated && i_size_read(inode) < target_size) {
+		/* Don't leave any preallocated blocks around past i_size. */
+		if (preallocated > 0 && inode->i_size < target_size) {
 			down_write(&F2FS_I(inode)->i_gc_rwsem[WRITE]);
 			down_write(&F2FS_I(inode)->i_mmap_sem);
 			f2fs_truncate(inode);
 			up_write(&F2FS_I(inode)->i_mmap_sem);
 			up_write(&F2FS_I(inode)->i_gc_rwsem[WRITE]);
 		}
+		clear_inode_flag(inode, FI_PREALLOCATED_ALL);
 
 		if (ret > 0)
 			f2fs_update_iostat(F2FS_I_SB(inode), APP_WRITE_IO, ret);
 	}
-unlock:
+out_unlock:
 	inode_unlock(inode);
 out:
 	trace_f2fs_file_write_iter(inode, iocb->ki_pos,