Message ID | 20241216013915.3392419-4-yi.zhang@huaweicloud.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | ext4: clean up and refactor fallocate | expand |
On Mon 16-12-24 09:39:08, Zhang Yi wrote: > From: Zhang Yi <yi.zhang@huawei.com> > > There is no need to write back all data before punching a hole in > non-journaled mode since it will be dropped soon after removing space. > Therefore, the call to filemap_write_and_wait_range() can be eliminated. > Besides, similar to ext4_zero_range(), we must address the case of > partially punched folios when block size < page size. It is essential to > remove writable userspace mappings to ensure that the folio can be > faulted again during subsequent mmap write access. > > In journaled mode, we need to write dirty pages out before discarding > page cache in case of crash before committing the freeing data > transaction, which could expose old, stale data, even if synchronization > has been performed. > > Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Looks good. Feel free to add: Reviewed-by: Jan Kara <jack@suse.cz> Honza > --- > fs/ext4/inode.c | 18 +++++------------- > 1 file changed, 5 insertions(+), 13 deletions(-) > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index bf735d06b621..a5ba2b71d508 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -4018,17 +4018,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length) > > trace_ext4_punch_hole(inode, offset, length, 0); > > - /* > - * Write out all dirty pages to avoid race conditions > - * Then release them. > - */ > - if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) { > - ret = filemap_write_and_wait_range(mapping, offset, > - offset + length - 1); > - if (ret) > - return ret; > - } > - > inode_lock(inode); > > /* No need to punch hole beyond i_size */ > @@ -4090,8 +4079,11 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length) > ret = ext4_update_disksize_before_punch(inode, offset, length); > if (ret) > goto out_dio; > - truncate_pagecache_range(inode, first_block_offset, > - last_block_offset); > + > + ret = ext4_truncate_page_cache_block_range(inode, > + first_block_offset, last_block_offset + 1); > + if (ret) > + goto out_dio; > } > > if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) > -- > 2.46.1 >
On Mon, Dec 16, 2024 at 09:39:08AM +0800, Zhang Yi wrote: > From: Zhang Yi <yi.zhang@huawei.com> > > There is no need to write back all data before punching a hole in > non-journaled mode since it will be dropped soon after removing space. > Therefore, the call to filemap_write_and_wait_range() can be eliminated. Hi, sorry I'm a bit late to this however following the discussion here [1], I believe the initial concern was that we don't in PATCH v1 01/10 was that after truncating the pagecache, the ext4_alloc_file_blocks() call might fail with errors like EIO, ENOMEM etc leading to inconsistent data. Is my understanding correct that we realised that these are very rare cases and are not worth the performance penalty of writeback? In which case, is it really okay to just let the scope for corruption exist even though its rare. There might be some other error cases we might be missing which might be more easier to hit. For eg I think we can also fail ext4_alloc_file_blocks() with ENOSPC in case there is a written to unwritten extent conversion causing an extent split leading to extent tree node allocation. (Maybe can be avoided by using PRE_IO with EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT in the first ext4_alloc_file_blocks() call) So does it make sense to retain the writeback behavior or am I just being paranoid :) Regards, ojaswin > Besides, similar to ext4_zero_range(), we must address the case of > partially punched folios when block size < page size. It is essential to > remove writable userspace mappings to ensure that the folio can be > faulted again during subsequent mmap write access. > > In journaled mode, we need to write dirty pages out before discarding > page cache in case of crash before committing the freeing data > transaction, which could expose old, stale data, even if synchronization > has been performed. > > Signed-off-by: Zhang Yi <yi.zhang@huawei.com> > --- > fs/ext4/inode.c | 18 +++++------------- > 1 file changed, 5 insertions(+), 13 deletions(-) > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index bf735d06b621..a5ba2b71d508 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -4018,17 +4018,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length) > > trace_ext4_punch_hole(inode, offset, length, 0); > > - /* > - * Write out all dirty pages to avoid race conditions > - * Then release them. > - */ > - if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) { > - ret = filemap_write_and_wait_range(mapping, offset, > - offset + length - 1); > - if (ret) > - return ret; > - } > - > inode_lock(inode); > > /* No need to punch hole beyond i_size */ > @@ -4090,8 +4079,11 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length) > ret = ext4_update_disksize_before_punch(inode, offset, length); > if (ret) > goto out_dio; > - truncate_pagecache_range(inode, first_block_offset, > - last_block_offset); > + > + ret = ext4_truncate_page_cache_block_range(inode, > + first_block_offset, last_block_offset + 1); > + if (ret) > + goto out_dio; > } > > if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) > -- > 2.46.1 >
On Tue, Dec 17, 2024 at 08:01:26PM +0530, Ojaswin Mujoo wrote: > On Mon, Dec 16, 2024 at 09:39:08AM +0800, Zhang Yi wrote: > > From: Zhang Yi <yi.zhang@huawei.com> > > > > There is no need to write back all data before punching a hole in > > non-journaled mode since it will be dropped soon after removing space. > > Therefore, the call to filemap_write_and_wait_range() can be eliminated. > > Hi, sorry I'm a bit late to this however following the discussion here > [1], I believe the initial concern was that we don't in PATCH v1 01/10 > was that after truncating the pagecache, the ext4_alloc_file_blocks() > call might fail with errors like EIO, ENOMEM etc leading to inconsistent > data. > > Is my understanding correct that we realised that these are very rare > cases and are not worth the performance penalty of writeback? In which > case, is it really okay to just let the scope for corruption exist even > though its rare. There might be some other error cases we might be > missing which might be more easier to hit. For eg I think we can also > fail ext4_alloc_file_blocks() with ENOSPC in case there is a written to > unwritten extent conversion causing an extent split leading to extent > tree node allocation. (Maybe can be avoided by using PRE_IO with > EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT in the first ext4_alloc_file_blocks() call) > > So does it make sense to retain the writeback behavior or am I just > being paranoid :) > > Regards, > ojaswin [1] https://lore.kernel.org/linux-ext4/20240917165007.j5dywaekvnirfffm@quack3/ > > > Besides, similar to ext4_zero_range(), we must address the case of > > partially punched folios when block size < page size. It is essential to > > remove writable userspace mappings to ensure that the folio can be > > faulted again during subsequent mmap write access. > > > > In journaled mode, we need to write dirty pages out before discarding > > page cache in case of crash before committing the freeing data > > transaction, which could expose old, stale data, even if synchronization > > has been performed. > > > > Signed-off-by: Zhang Yi <yi.zhang@huawei.com> > > --- > > fs/ext4/inode.c | 18 +++++------------- > > 1 file changed, 5 insertions(+), 13 deletions(-) > > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > > index bf735d06b621..a5ba2b71d508 100644 > > --- a/fs/ext4/inode.c > > +++ b/fs/ext4/inode.c > > @@ -4018,17 +4018,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length) > > > > trace_ext4_punch_hole(inode, offset, length, 0); > > > > - /* > > - * Write out all dirty pages to avoid race conditions > > - * Then release them. > > - */ > > - if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) { > > - ret = filemap_write_and_wait_range(mapping, offset, > > - offset + length - 1); > > - if (ret) > > - return ret; > > - } > > - > > inode_lock(inode); > > > > /* No need to punch hole beyond i_size */ > > @@ -4090,8 +4079,11 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length) > > ret = ext4_update_disksize_before_punch(inode, offset, length); > > if (ret) > > goto out_dio; > > - truncate_pagecache_range(inode, first_block_offset, > > - last_block_offset); > > + > > + ret = ext4_truncate_page_cache_block_range(inode, > > + first_block_offset, last_block_offset + 1); > > + if (ret) > > + goto out_dio; > > } > > > > if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) > > -- > > 2.46.1 > >
On 2024/12/17 22:31, Ojaswin Mujoo wrote: > On Mon, Dec 16, 2024 at 09:39:08AM +0800, Zhang Yi wrote: >> From: Zhang Yi <yi.zhang@huawei.com> >> >> There is no need to write back all data before punching a hole in >> non-journaled mode since it will be dropped soon after removing space. >> Therefore, the call to filemap_write_and_wait_range() can be eliminated. > > Hi, sorry I'm a bit late to this however following the discussion here > [1], I believe the initial concern was that we don't in PATCH v1 01/10 > was that after truncating the pagecache, the ext4_alloc_file_blocks() > call might fail with errors like EIO, ENOMEM etc leading to inconsistent > data. > > Is my understanding correct that we realised that these are very rare > cases and are not worth the performance penalty of writeback? In which > case, is it really okay to just let the scope for corruption exist even > though its rare. There might be some other error cases we might be > missing which might be more easier to hit. For eg I think we can also > fail ext4_alloc_file_blocks() with ENOSPC in case there is a written to > unwritten extent conversion causing an extent split leading to extent > tree node allocation. (Maybe can be avoided by using PRE_IO with > EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT in the first ext4_alloc_file_blocks() call) > > So does it make sense to retain the writeback behavior or am I just > being paranoid :) > Hi, Ojaswin! Yeah, from my point of view, ENOSPC could happen, and it may be more likely to happen if we intentionally create conditions for it. However, all the efforts we can make at this point are merely best efforts and reduce the probability. We cannot 100% guarantee it will not happen, even if we write back the whole range before manipulating extents and blocks. This is because we do not accurately reserve space for extents split. Additionally, In ext4_punch_hole(), we have used 'nofail' flag while freeing blocks to reduce the possibility of ENOSPC. So I suppose it's fine by now, but we may need to implement additional measures if we truly want to resolve the issue completely. Thanks, Yi. > >> Besides, similar to ext4_zero_range(), we must address the case of >> partially punched folios when block size < page size. It is essential to >> remove writable userspace mappings to ensure that the folio can be >> faulted again during subsequent mmap write access. >> >> In journaled mode, we need to write dirty pages out before discarding >> page cache in case of crash before committing the freeing data >> transaction, which could expose old, stale data, even if synchronization >> has been performed. >> >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> >> --- >> fs/ext4/inode.c | 18 +++++------------- >> 1 file changed, 5 insertions(+), 13 deletions(-) >> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c >> index bf735d06b621..a5ba2b71d508 100644 >> --- a/fs/ext4/inode.c >> +++ b/fs/ext4/inode.c >> @@ -4018,17 +4018,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length) >> >> trace_ext4_punch_hole(inode, offset, length, 0); >> >> - /* >> - * Write out all dirty pages to avoid race conditions >> - * Then release them. >> - */ >> - if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) { >> - ret = filemap_write_and_wait_range(mapping, offset, >> - offset + length - 1); >> - if (ret) >> - return ret; >> - } >> - >> inode_lock(inode); >> >> /* No need to punch hole beyond i_size */ >> @@ -4090,8 +4079,11 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length) >> ret = ext4_update_disksize_before_punch(inode, offset, length); >> if (ret) >> goto out_dio; >> - truncate_pagecache_range(inode, first_block_offset, >> - last_block_offset); >> + >> + ret = ext4_truncate_page_cache_block_range(inode, >> + first_block_offset, last_block_offset + 1); >> + if (ret) >> + goto out_dio; >> } >> >> if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) >> -- >> 2.46.1 >>
On Wed, Dec 18, 2024 at 03:10:36PM +0800, Zhang Yi wrote: > On 2024/12/17 22:31, Ojaswin Mujoo wrote: > > On Mon, Dec 16, 2024 at 09:39:08AM +0800, Zhang Yi wrote: > >> From: Zhang Yi <yi.zhang@huawei.com> > >> > >> There is no need to write back all data before punching a hole in > >> non-journaled mode since it will be dropped soon after removing space. > >> Therefore, the call to filemap_write_and_wait_range() can be eliminated. > > > > Hi, sorry I'm a bit late to this however following the discussion here > > [1], I believe the initial concern was that we don't in PATCH v1 01/10 > > was that after truncating the pagecache, the ext4_alloc_file_blocks() > > call might fail with errors like EIO, ENOMEM etc leading to inconsistent > > data. > > > > Is my understanding correct that we realised that these are very rare > > cases and are not worth the performance penalty of writeback? In which > > case, is it really okay to just let the scope for corruption exist even > > though its rare. There might be some other error cases we might be > > missing which might be more easier to hit. For eg I think we can also > > fail ext4_alloc_file_blocks() with ENOSPC in case there is a written to > > unwritten extent conversion causing an extent split leading to extent > > tree node allocation. (Maybe can be avoided by using PRE_IO with > > EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT in the first ext4_alloc_file_blocks() call) > > > > So does it make sense to retain the writeback behavior or am I just > > being paranoid :) > > > > Hi, Ojaswin! > > Yeah, from my point of view, ENOSPC could happen, and it may be more > likely to happen if we intentionally create conditions for it. However, > all the efforts we can make at this point are merely best efforts and > reduce the probability. We cannot 100% guarantee it will not happen, > even if we write back the whole range before manipulating extents and > blocks. This is because we do not accurately reserve space for extents > split. Additionally, In ext4_punch_hole(), we have used 'nofail' flag Right, rechecking the ext4_map_blocks code, seems like we can also result in a failure after unwrit extents have successfully been allocated so either ways we can't be sure that we'll retain old data on failure even with writeback. > while freeing blocks to reduce the possibility of ENOSPC. So I suppose > it's fine by now, but we may need to implement additional measures if > we truly want to resolve the issue completely. Sure I agree that in that case we should ideally have something more robust to handle these edge cases. For now, this change looks good. Feel free to add: Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> > > Thanks, > Yi. > > > > >> Besides, similar to ext4_zero_range(), we must address the case of > >> partially punched folios when block size < page size. It is essential to > >> remove writable userspace mappings to ensure that the folio can be > >> faulted again during subsequent mmap write access. > >> > >> In journaled mode, we need to write dirty pages out before discarding > >> page cache in case of crash before committing the freeing data > >> transaction, which could expose old, stale data, even if synchronization > >> has been performed. > >> > >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> > >> --- > >> fs/ext4/inode.c | 18 +++++------------- > >> 1 file changed, 5 insertions(+), 13 deletions(-) > >> > >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > >> index bf735d06b621..a5ba2b71d508 100644 > >> --- a/fs/ext4/inode.c > >> +++ b/fs/ext4/inode.c > >> @@ -4018,17 +4018,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length) > >> > >> trace_ext4_punch_hole(inode, offset, length, 0); > >> > >> - /* > >> - * Write out all dirty pages to avoid race conditions > >> - * Then release them. > >> - */ > >> - if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) { > >> - ret = filemap_write_and_wait_range(mapping, offset, > >> - offset + length - 1); > >> - if (ret) > >> - return ret; > >> - } > >> - > >> inode_lock(inode); > >> > >> /* No need to punch hole beyond i_size */ > >> @@ -4090,8 +4079,11 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length) > >> ret = ext4_update_disksize_before_punch(inode, offset, length); > >> if (ret) > >> goto out_dio; > >> - truncate_pagecache_range(inode, first_block_offset, > >> - last_block_offset); > >> + > >> + ret = ext4_truncate_page_cache_block_range(inode, > >> + first_block_offset, last_block_offset + 1); > >> + if (ret) > >> + goto out_dio; > >> } > >> > >> if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) > >> -- > >> 2.46.1 > >> >
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index bf735d06b621..a5ba2b71d508 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4018,17 +4018,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length) trace_ext4_punch_hole(inode, offset, length, 0); - /* - * Write out all dirty pages to avoid race conditions - * Then release them. - */ - if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) { - ret = filemap_write_and_wait_range(mapping, offset, - offset + length - 1); - if (ret) - return ret; - } - inode_lock(inode); /* No need to punch hole beyond i_size */ @@ -4090,8 +4079,11 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length) ret = ext4_update_disksize_before_punch(inode, offset, length); if (ret) goto out_dio; - truncate_pagecache_range(inode, first_block_offset, - last_block_offset); + + ret = ext4_truncate_page_cache_block_range(inode, + first_block_offset, last_block_offset + 1); + if (ret) + goto out_dio; } if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))