diff mbox series

[v4,03/10] ext4: don't write back data before punch hole in nojournal mode

Message ID 20241216013915.3392419-4-yi.zhang@huaweicloud.com (mailing list archive)
State New
Headers show
Series ext4: clean up and refactor fallocate | expand

Commit Message

Zhang Yi Dec. 16, 2024, 1:39 a.m. UTC
From: Zhang Yi <yi.zhang@huawei.com>

There is no need to write back all data before punching a hole in
non-journaled mode since it will be dropped soon after removing space.
Therefore, the call to filemap_write_and_wait_range() can be eliminated.
Besides, similar to ext4_zero_range(), we must address the case of
partially punched folios when block size < page size. It is essential to
remove writable userspace mappings to ensure that the folio can be
faulted again during subsequent mmap write access.

In journaled mode, we need to write dirty pages out before discarding
page cache in case of crash before committing the freeing data
transaction, which could expose old, stale data, even if synchronization
has been performed.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/inode.c | 18 +++++-------------
 1 file changed, 5 insertions(+), 13 deletions(-)

Comments

Jan Kara Dec. 16, 2024, 3:02 p.m. UTC | #1
On Mon 16-12-24 09:39:08, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> There is no need to write back all data before punching a hole in
> non-journaled mode since it will be dropped soon after removing space.
> Therefore, the call to filemap_write_and_wait_range() can be eliminated.
> Besides, similar to ext4_zero_range(), we must address the case of
> partially punched folios when block size < page size. It is essential to
> remove writable userspace mappings to ensure that the folio can be
> faulted again during subsequent mmap write access.
> 
> In journaled mode, we need to write dirty pages out before discarding
> page cache in case of crash before committing the freeing data
> transaction, which could expose old, stale data, even if synchronization
> has been performed.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/inode.c | 18 +++++-------------
>  1 file changed, 5 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index bf735d06b621..a5ba2b71d508 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4018,17 +4018,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>  
>  	trace_ext4_punch_hole(inode, offset, length, 0);
>  
> -	/*
> -	 * Write out all dirty pages to avoid race conditions
> -	 * Then release them.
> -	 */
> -	if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> -		ret = filemap_write_and_wait_range(mapping, offset,
> -						   offset + length - 1);
> -		if (ret)
> -			return ret;
> -	}
> -
>  	inode_lock(inode);
>  
>  	/* No need to punch hole beyond i_size */
> @@ -4090,8 +4079,11 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>  		ret = ext4_update_disksize_before_punch(inode, offset, length);
>  		if (ret)
>  			goto out_dio;
> -		truncate_pagecache_range(inode, first_block_offset,
> -					 last_block_offset);
> +
> +		ret = ext4_truncate_page_cache_block_range(inode,
> +				first_block_offset, last_block_offset + 1);
> +		if (ret)
> +			goto out_dio;
>  	}
>  
>  	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> -- 
> 2.46.1
>
Ojaswin Mujoo Dec. 17, 2024, 2:31 p.m. UTC | #2
On Mon, Dec 16, 2024 at 09:39:08AM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> There is no need to write back all data before punching a hole in
> non-journaled mode since it will be dropped soon after removing space.
> Therefore, the call to filemap_write_and_wait_range() can be eliminated.

Hi, sorry I'm a bit late to this however following the discussion here
[1], I believe the initial concern was that we don't in PATCH v1 01/10 
was that after truncating the pagecache, the ext4_alloc_file_blocks()
call might fail with errors like EIO, ENOMEM etc leading to inconsistent
data. 

Is my understanding correct that  we realised that these are very rare
cases and are not worth the performance penalty of writeback? In which
case, is it really okay to just let the scope for corruption exist even
though its rare. There might be some other error cases we might be
missing which might be more easier to hit. For eg I think we can also
fail ext4_alloc_file_blocks() with ENOSPC in case there is a written to
unwritten extent conversion causing an extent split leading to  extent
tree node allocation. (Maybe can be avoided by using PRE_IO with
EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT in the first ext4_alloc_file_blocks() call)

So does it make sense to retain the writeback behavior or am I just
being paranoid :) 

Regards,
ojaswin

> Besides, similar to ext4_zero_range(), we must address the case of
> partially punched folios when block size < page size. It is essential to
> remove writable userspace mappings to ensure that the folio can be
> faulted again during subsequent mmap write access.
> 
> In journaled mode, we need to write dirty pages out before discarding
> page cache in case of crash before committing the freeing data
> transaction, which could expose old, stale data, even if synchronization
> has been performed.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>  fs/ext4/inode.c | 18 +++++-------------
>  1 file changed, 5 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index bf735d06b621..a5ba2b71d508 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4018,17 +4018,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>  
>  	trace_ext4_punch_hole(inode, offset, length, 0);
>  
> -	/*
> -	 * Write out all dirty pages to avoid race conditions
> -	 * Then release them.
> -	 */
> -	if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> -		ret = filemap_write_and_wait_range(mapping, offset,
> -						   offset + length - 1);
> -		if (ret)
> -			return ret;
> -	}
> -
>  	inode_lock(inode);
>  
>  	/* No need to punch hole beyond i_size */
> @@ -4090,8 +4079,11 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>  		ret = ext4_update_disksize_before_punch(inode, offset, length);
>  		if (ret)
>  			goto out_dio;
> -		truncate_pagecache_range(inode, first_block_offset,
> -					 last_block_offset);
> +
> +		ret = ext4_truncate_page_cache_block_range(inode,
> +				first_block_offset, last_block_offset + 1);
> +		if (ret)
> +			goto out_dio;
>  	}
>  
>  	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> -- 
> 2.46.1
>
Ojaswin Mujoo Dec. 17, 2024, 2:50 p.m. UTC | #3
On Tue, Dec 17, 2024 at 08:01:26PM +0530, Ojaswin Mujoo wrote:
> On Mon, Dec 16, 2024 at 09:39:08AM +0800, Zhang Yi wrote:
> > From: Zhang Yi <yi.zhang@huawei.com>
> > 
> > There is no need to write back all data before punching a hole in
> > non-journaled mode since it will be dropped soon after removing space.
> > Therefore, the call to filemap_write_and_wait_range() can be eliminated.
> 
> Hi, sorry I'm a bit late to this however following the discussion here
> [1], I believe the initial concern was that we don't in PATCH v1 01/10 
> was that after truncating the pagecache, the ext4_alloc_file_blocks()
> call might fail with errors like EIO, ENOMEM etc leading to inconsistent
> data. 
> 
> Is my understanding correct that  we realised that these are very rare
> cases and are not worth the performance penalty of writeback? In which
> case, is it really okay to just let the scope for corruption exist even
> though its rare. There might be some other error cases we might be
> missing which might be more easier to hit. For eg I think we can also
> fail ext4_alloc_file_blocks() with ENOSPC in case there is a written to
> unwritten extent conversion causing an extent split leading to  extent
> tree node allocation. (Maybe can be avoided by using PRE_IO with
> EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT in the first ext4_alloc_file_blocks() call)
> 
> So does it make sense to retain the writeback behavior or am I just
> being paranoid :) 
> 
> Regards,
> ojaswin

[1]
https://lore.kernel.org/linux-ext4/20240917165007.j5dywaekvnirfffm@quack3/
> 
> > Besides, similar to ext4_zero_range(), we must address the case of
> > partially punched folios when block size < page size. It is essential to
> > remove writable userspace mappings to ensure that the folio can be
> > faulted again during subsequent mmap write access.
> > 
> > In journaled mode, we need to write dirty pages out before discarding
> > page cache in case of crash before committing the freeing data
> > transaction, which could expose old, stale data, even if synchronization
> > has been performed.
> > 
> > Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> > ---
> >  fs/ext4/inode.c | 18 +++++-------------
> >  1 file changed, 5 insertions(+), 13 deletions(-)
> > 
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index bf735d06b621..a5ba2b71d508 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -4018,17 +4018,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
> >  
> >  	trace_ext4_punch_hole(inode, offset, length, 0);
> >  
> > -	/*
> > -	 * Write out all dirty pages to avoid race conditions
> > -	 * Then release them.
> > -	 */
> > -	if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> > -		ret = filemap_write_and_wait_range(mapping, offset,
> > -						   offset + length - 1);
> > -		if (ret)
> > -			return ret;
> > -	}
> > -
> >  	inode_lock(inode);
> >  
> >  	/* No need to punch hole beyond i_size */
> > @@ -4090,8 +4079,11 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
> >  		ret = ext4_update_disksize_before_punch(inode, offset, length);
> >  		if (ret)
> >  			goto out_dio;
> > -		truncate_pagecache_range(inode, first_block_offset,
> > -					 last_block_offset);
> > +
> > +		ret = ext4_truncate_page_cache_block_range(inode,
> > +				first_block_offset, last_block_offset + 1);
> > +		if (ret)
> > +			goto out_dio;
> >  	}
> >  
> >  	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> > -- 
> > 2.46.1
> >
Zhang Yi Dec. 18, 2024, 7:10 a.m. UTC | #4
On 2024/12/17 22:31, Ojaswin Mujoo wrote:
> On Mon, Dec 16, 2024 at 09:39:08AM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> There is no need to write back all data before punching a hole in
>> non-journaled mode since it will be dropped soon after removing space.
>> Therefore, the call to filemap_write_and_wait_range() can be eliminated.
> 
> Hi, sorry I'm a bit late to this however following the discussion here
> [1], I believe the initial concern was that we don't in PATCH v1 01/10 
> was that after truncating the pagecache, the ext4_alloc_file_blocks()
> call might fail with errors like EIO, ENOMEM etc leading to inconsistent
> data. 
> 
> Is my understanding correct that  we realised that these are very rare
> cases and are not worth the performance penalty of writeback? In which
> case, is it really okay to just let the scope for corruption exist even
> though its rare. There might be some other error cases we might be
> missing which might be more easier to hit. For eg I think we can also
> fail ext4_alloc_file_blocks() with ENOSPC in case there is a written to
> unwritten extent conversion causing an extent split leading to  extent
> tree node allocation. (Maybe can be avoided by using PRE_IO with
> EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT in the first ext4_alloc_file_blocks() call)
> 
> So does it make sense to retain the writeback behavior or am I just
> being paranoid :) 
> 

Hi, Ojaswin!

Yeah, from my point of view, ENOSPC could happen, and it may be more
likely to happen if we intentionally create conditions for it. However,
all the efforts we can make at this point are merely best efforts and
reduce the probability. We cannot 100% guarantee it will not happen,
even if we write back the whole range before manipulating extents and
blocks. This is because we do not accurately reserve space for extents
split. Additionally, In ext4_punch_hole(), we have used 'nofail' flag
while freeing blocks to reduce the possibility of ENOSPC. So I suppose
it's fine by now, but we may need to implement additional measures if
we truly want to resolve the issue completely.

Thanks,
Yi.

> 
>> Besides, similar to ext4_zero_range(), we must address the case of
>> partially punched folios when block size < page size. It is essential to
>> remove writable userspace mappings to ensure that the folio can be
>> faulted again during subsequent mmap write access.
>>
>> In journaled mode, we need to write dirty pages out before discarding
>> page cache in case of crash before committing the freeing data
>> transaction, which could expose old, stale data, even if synchronization
>> has been performed.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> ---
>>  fs/ext4/inode.c | 18 +++++-------------
>>  1 file changed, 5 insertions(+), 13 deletions(-)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index bf735d06b621..a5ba2b71d508 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -4018,17 +4018,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>>  
>>  	trace_ext4_punch_hole(inode, offset, length, 0);
>>  
>> -	/*
>> -	 * Write out all dirty pages to avoid race conditions
>> -	 * Then release them.
>> -	 */
>> -	if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
>> -		ret = filemap_write_and_wait_range(mapping, offset,
>> -						   offset + length - 1);
>> -		if (ret)
>> -			return ret;
>> -	}
>> -
>>  	inode_lock(inode);
>>  
>>  	/* No need to punch hole beyond i_size */
>> @@ -4090,8 +4079,11 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>>  		ret = ext4_update_disksize_before_punch(inode, offset, length);
>>  		if (ret)
>>  			goto out_dio;
>> -		truncate_pagecache_range(inode, first_block_offset,
>> -					 last_block_offset);
>> +
>> +		ret = ext4_truncate_page_cache_block_range(inode,
>> +				first_block_offset, last_block_offset + 1);
>> +		if (ret)
>> +			goto out_dio;
>>  	}
>>  
>>  	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
>> -- 
>> 2.46.1
>>
Ojaswin Mujoo Dec. 18, 2024, 10:13 a.m. UTC | #5
On Wed, Dec 18, 2024 at 03:10:36PM +0800, Zhang Yi wrote:
> On 2024/12/17 22:31, Ojaswin Mujoo wrote:
> > On Mon, Dec 16, 2024 at 09:39:08AM +0800, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@huawei.com>
> >>
> >> There is no need to write back all data before punching a hole in
> >> non-journaled mode since it will be dropped soon after removing space.
> >> Therefore, the call to filemap_write_and_wait_range() can be eliminated.
> > 
> > Hi, sorry I'm a bit late to this however following the discussion here
> > [1], I believe the initial concern was that we don't in PATCH v1 01/10 
> > was that after truncating the pagecache, the ext4_alloc_file_blocks()
> > call might fail with errors like EIO, ENOMEM etc leading to inconsistent
> > data. 
> > 
> > Is my understanding correct that  we realised that these are very rare
> > cases and are not worth the performance penalty of writeback? In which
> > case, is it really okay to just let the scope for corruption exist even
> > though its rare. There might be some other error cases we might be
> > missing which might be more easier to hit. For eg I think we can also
> > fail ext4_alloc_file_blocks() with ENOSPC in case there is a written to
> > unwritten extent conversion causing an extent split leading to  extent
> > tree node allocation. (Maybe can be avoided by using PRE_IO with
> > EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT in the first ext4_alloc_file_blocks() call)
> > 
> > So does it make sense to retain the writeback behavior or am I just
> > being paranoid :) 
> > 
> 
> Hi, Ojaswin!
> 
> Yeah, from my point of view, ENOSPC could happen, and it may be more
> likely to happen if we intentionally create conditions for it. However,
> all the efforts we can make at this point are merely best efforts and
> reduce the probability. We cannot 100% guarantee it will not happen,
> even if we write back the whole range before manipulating extents and
> blocks. This is because we do not accurately reserve space for extents
> split. Additionally, In ext4_punch_hole(), we have used 'nofail' flag

Right, rechecking the ext4_map_blocks code, seems like we can also result
in a failure after unwrit extents have successfully been allocated so 
either ways we can't be sure that we'll retain old data on failure even
with writeback. 

> while freeing blocks to reduce the possibility of ENOSPC. So I suppose
> it's fine by now, but we may need to implement additional measures if
> we truly want to resolve the issue completely.

Sure I agree that in that case we should ideally have something more
robust to handle these edge cases. For now, this change looks good.

Feel free to add:

Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

> 
> Thanks,
> Yi.
> 
> > 
> >> Besides, similar to ext4_zero_range(), we must address the case of
> >> partially punched folios when block size < page size. It is essential to
> >> remove writable userspace mappings to ensure that the folio can be
> >> faulted again during subsequent mmap write access.
> >>
> >> In journaled mode, we need to write dirty pages out before discarding
> >> page cache in case of crash before committing the freeing data
> >> transaction, which could expose old, stale data, even if synchronization
> >> has been performed.
> >>
> >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> >> ---
> >>  fs/ext4/inode.c | 18 +++++-------------
> >>  1 file changed, 5 insertions(+), 13 deletions(-)
> >>
> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >> index bf735d06b621..a5ba2b71d508 100644
> >> --- a/fs/ext4/inode.c
> >> +++ b/fs/ext4/inode.c
> >> @@ -4018,17 +4018,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
> >>  
> >>  	trace_ext4_punch_hole(inode, offset, length, 0);
> >>  
> >> -	/*
> >> -	 * Write out all dirty pages to avoid race conditions
> >> -	 * Then release them.
> >> -	 */
> >> -	if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> >> -		ret = filemap_write_and_wait_range(mapping, offset,
> >> -						   offset + length - 1);
> >> -		if (ret)
> >> -			return ret;
> >> -	}
> >> -
> >>  	inode_lock(inode);
> >>  
> >>  	/* No need to punch hole beyond i_size */
> >> @@ -4090,8 +4079,11 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
> >>  		ret = ext4_update_disksize_before_punch(inode, offset, length);
> >>  		if (ret)
> >>  			goto out_dio;
> >> -		truncate_pagecache_range(inode, first_block_offset,
> >> -					 last_block_offset);
> >> +
> >> +		ret = ext4_truncate_page_cache_block_range(inode,
> >> +				first_block_offset, last_block_offset + 1);
> >> +		if (ret)
> >> +			goto out_dio;
> >>  	}
> >>  
> >>  	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> >> -- 
> >> 2.46.1
> >>
>
diff mbox series

Patch

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index bf735d06b621..a5ba2b71d508 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4018,17 +4018,6 @@  int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
 
 	trace_ext4_punch_hole(inode, offset, length, 0);
 
-	/*
-	 * Write out all dirty pages to avoid race conditions
-	 * Then release them.
-	 */
-	if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
-		ret = filemap_write_and_wait_range(mapping, offset,
-						   offset + length - 1);
-		if (ret)
-			return ret;
-	}
-
 	inode_lock(inode);
 
 	/* No need to punch hole beyond i_size */
@@ -4090,8 +4079,11 @@  int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
 		ret = ext4_update_disksize_before_punch(inode, offset, length);
 		if (ret)
 			goto out_dio;
-		truncate_pagecache_range(inode, first_block_offset,
-					 last_block_offset);
+
+		ret = ext4_truncate_page_cache_block_range(inode,
+				first_block_offset, last_block_offset + 1);
+		if (ret)
+			goto out_dio;
 	}
 
 	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))