Message ID | 20241209114241.3725722-2-leo.lilong@huawei.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | iomap: fix zero padding data issue in concurrent append writes | expand |
On Mon, Dec 09, 2024 at 07:42:39PM +0800, Long Li wrote: > This is a preparatory patch for fixing zero padding issues in concurrent > append write scenarios. In the following patches, we need to obtain > byte-granular writeback end position for io_size trimming after EOF > handling. > > Due to concurrent writeback and truncate operations, inode size may > shrink. Resampling inode size would force writeback code to handle the > newly appeared post-EOF blocks, which is undesirable. As Dave > explained in [1]: > > "Really, the issue is that writeback mappings have to be able to > handle the range being mapped suddenly appear to be beyond EOF. > This behaviour is a longstanding writeback constraint, and is what > iomap_writepage_handle_eof() is attempting to handle. > > We handle this by only sampling i_size_read() whilst we have the > folio locked and can determine the action we should take with that > folio (i.e. nothing, partial zeroing, or skip altogether). Once > we've made the decision that the folio is within EOF and taken > action on it (i.e. moved the folio to writeback state), we cannot > then resample the inode size because a truncate may have started > and changed the inode size." > > To avoid resampling inode size after EOF handling, we convert end_pos > to byte-granular writeback position and return it from EOF handling > function. > > Since iomap_set_range_dirty() can handle unaligned lengths, this > conversion has no impact on it. However, iomap_find_dirty_range() > requires aligned start and end range to find dirty blocks within the > given range, so the end position needs to be rounded up when passed > to it. > > LINK [1]: https://lore.kernel.org/linux-xfs/Z1Gg0pAa54MoeYME@localhost.localdomain/ > Signed-off-by: Long Li <leo.lilong@huawei.com> > --- > fs/iomap/buffered-io.c | 21 ++++++++++++--------- > 1 file changed, 12 insertions(+), 9 deletions(-) > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index 955f19e27e47..bcc7831d03af 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c ... > @@ -1914,6 +1915,7 @@ static int iomap_writepage_map(struct iomap_writepage_ctx *wpc, > struct inode *inode = folio->mapping->host; > u64 pos = folio_pos(folio); > u64 end_pos = pos + folio_size(folio); > + u64 end_aligned = 0; > unsigned count = 0; > int error = 0; > u32 rlen; > @@ -1955,9 +1957,10 @@ static int iomap_writepage_map(struct iomap_writepage_ctx *wpc, > /* > * Walk through the folio to find dirty areas to write back. > */ > - while ((rlen = iomap_find_dirty_range(folio, &pos, end_pos))) { > + end_aligned = round_up(end_pos, i_blocksize(inode)); So do I follow correctly that the set_range_dirty() path doesn't need the alignment because it uses inclusive first_blk/last_blk logic, whereas this find_dirty_range() path does the opposite and thus does require the round_up? If so, presumably that means if we fixed up the find path we wouldn't need end_aligned at all anymore? If I follow the reasoning correctly, then this looks Ok to me: Reviewed-by: Brian Foster <bfoster@redhat.com> ... but as a followup exercise it might be nice to clean up the iomap_find_dirty_range() path to either do the rounding itself or be more consistent with set_range_dirty(). Brian > + while ((rlen = iomap_find_dirty_range(folio, &pos, end_aligned))) { > error = iomap_writepage_map_blocks(wpc, wbc, folio, inode, > - pos, rlen, &count); > + pos, end_pos, rlen, &count); > if (error) > break; > pos += rlen; > -- > 2.39.2 > >
On Mon, Dec 09, 2024 at 09:06:14AM -0500, Brian Foster wrote: > On Mon, Dec 09, 2024 at 07:42:39PM +0800, Long Li wrote: > > This is a preparatory patch for fixing zero padding issues in concurrent > > append write scenarios. In the following patches, we need to obtain > > byte-granular writeback end position for io_size trimming after EOF > > handling. > > > > Due to concurrent writeback and truncate operations, inode size may > > shrink. Resampling inode size would force writeback code to handle the > > newly appeared post-EOF blocks, which is undesirable. As Dave > > explained in [1]: > > > > "Really, the issue is that writeback mappings have to be able to > > handle the range being mapped suddenly appear to be beyond EOF. > > This behaviour is a longstanding writeback constraint, and is what > > iomap_writepage_handle_eof() is attempting to handle. > > > > We handle this by only sampling i_size_read() whilst we have the > > folio locked and can determine the action we should take with that > > folio (i.e. nothing, partial zeroing, or skip altogether). Once > > we've made the decision that the folio is within EOF and taken > > action on it (i.e. moved the folio to writeback state), we cannot > > then resample the inode size because a truncate may have started > > and changed the inode size." > > > > To avoid resampling inode size after EOF handling, we convert end_pos > > to byte-granular writeback position and return it from EOF handling > > function. > > > > Since iomap_set_range_dirty() can handle unaligned lengths, this > > conversion has no impact on it. However, iomap_find_dirty_range() > > requires aligned start and end range to find dirty blocks within the > > given range, so the end position needs to be rounded up when passed > > to it. > > > > LINK [1]: https://lore.kernel.org/linux-xfs/Z1Gg0pAa54MoeYME@localhost.localdomain/ > > Signed-off-by: Long Li <leo.lilong@huawei.com> > > --- > > fs/iomap/buffered-io.c | 21 ++++++++++++--------- > > 1 file changed, 12 insertions(+), 9 deletions(-) > > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > > index 955f19e27e47..bcc7831d03af 100644 > > --- a/fs/iomap/buffered-io.c > > +++ b/fs/iomap/buffered-io.c > ... > > @@ -1914,6 +1915,7 @@ static int iomap_writepage_map(struct iomap_writepage_ctx *wpc, > > struct inode *inode = folio->mapping->host; > > u64 pos = folio_pos(folio); > > u64 end_pos = pos + folio_size(folio); > > + u64 end_aligned = 0; > > unsigned count = 0; > > int error = 0; > > u32 rlen; > > @@ -1955,9 +1957,10 @@ static int iomap_writepage_map(struct iomap_writepage_ctx *wpc, > > /* > > * Walk through the folio to find dirty areas to write back. > > */ > > - while ((rlen = iomap_find_dirty_range(folio, &pos, end_pos))) { > > + end_aligned = round_up(end_pos, i_blocksize(inode)); > > So do I follow correctly that the set_range_dirty() path doesn't need > the alignment because it uses inclusive first_blk/last_blk logic, > whereas this find_dirty_range() path does the opposite and thus does > require the round_up? If so, presumably that means if we fixed up the > find path we wouldn't need end_aligned at all anymore? > Agreed with you. > If I follow the reasoning correctly, then this looks Ok to me: > > Reviewed-by: Brian Foster <bfoster@redhat.com> > > ... but as a followup exercise it might be nice to clean up the > iomap_find_dirty_range() path to either do the rounding itself or be > more consistent with set_range_dirty(). > > Brian Yes, I think we can handle the cleanup through a separate patch later? Thanks, Long Li
On Tue, Dec 10, 2024 at 04:09:26PM +0800, Long Li wrote: > On Mon, Dec 09, 2024 at 09:06:14AM -0500, Brian Foster wrote: > > On Mon, Dec 09, 2024 at 07:42:39PM +0800, Long Li wrote: > > > This is a preparatory patch for fixing zero padding issues in concurrent > > > append write scenarios. In the following patches, we need to obtain > > > byte-granular writeback end position for io_size trimming after EOF > > > handling. > > > > > > Due to concurrent writeback and truncate operations, inode size may > > > shrink. Resampling inode size would force writeback code to handle the > > > newly appeared post-EOF blocks, which is undesirable. As Dave > > > explained in [1]: > > > > > > "Really, the issue is that writeback mappings have to be able to > > > handle the range being mapped suddenly appear to be beyond EOF. > > > This behaviour is a longstanding writeback constraint, and is what > > > iomap_writepage_handle_eof() is attempting to handle. > > > > > > We handle this by only sampling i_size_read() whilst we have the > > > folio locked and can determine the action we should take with that > > > folio (i.e. nothing, partial zeroing, or skip altogether). Once > > > we've made the decision that the folio is within EOF and taken > > > action on it (i.e. moved the folio to writeback state), we cannot > > > then resample the inode size because a truncate may have started > > > and changed the inode size." > > > > > > To avoid resampling inode size after EOF handling, we convert end_pos > > > to byte-granular writeback position and return it from EOF handling > > > function. > > > > > > Since iomap_set_range_dirty() can handle unaligned lengths, this > > > conversion has no impact on it. However, iomap_find_dirty_range() > > > requires aligned start and end range to find dirty blocks within the > > > given range, so the end position needs to be rounded up when passed > > > to it. > > > > > > LINK [1]: https://lore.kernel.org/linux-xfs/Z1Gg0pAa54MoeYME@localhost.localdomain/ > > > Signed-off-by: Long Li <leo.lilong@huawei.com> > > > --- > > > fs/iomap/buffered-io.c | 21 ++++++++++++--------- > > > 1 file changed, 12 insertions(+), 9 deletions(-) > > > > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > > > index 955f19e27e47..bcc7831d03af 100644 > > > --- a/fs/iomap/buffered-io.c > > > +++ b/fs/iomap/buffered-io.c > > ... > > > @@ -1914,6 +1915,7 @@ static int iomap_writepage_map(struct iomap_writepage_ctx *wpc, > > > struct inode *inode = folio->mapping->host; > > > u64 pos = folio_pos(folio); > > > u64 end_pos = pos + folio_size(folio); > > > + u64 end_aligned = 0; > > > unsigned count = 0; > > > int error = 0; > > > u32 rlen; > > > @@ -1955,9 +1957,10 @@ static int iomap_writepage_map(struct iomap_writepage_ctx *wpc, > > > /* > > > * Walk through the folio to find dirty areas to write back. > > > */ > > > - while ((rlen = iomap_find_dirty_range(folio, &pos, end_pos))) { > > > + end_aligned = round_up(end_pos, i_blocksize(inode)); > > > > So do I follow correctly that the set_range_dirty() path doesn't need > > the alignment because it uses inclusive first_blk/last_blk logic, > > whereas this find_dirty_range() path does the opposite and thus does > > require the round_up? If so, presumably that means if we fixed up the > > find path we wouldn't need end_aligned at all anymore? > > > > Agreed with you. > > > If I follow the reasoning correctly, then this looks Ok to me: > > > > Reviewed-by: Brian Foster <bfoster@redhat.com> > > > > ... but as a followup exercise it might be nice to clean up the > > iomap_find_dirty_range() path to either do the rounding itself or be > > more consistent with set_range_dirty(). > > > > Brian > > Yes, I think we can handle the cleanup through a separate patch later? > Yep, thanks. Brian > Thanks, > Long Li >
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 955f19e27e47..bcc7831d03af 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -1774,7 +1774,8 @@ static bool iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t pos) */ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc, struct writeback_control *wbc, struct folio *folio, - struct inode *inode, loff_t pos, unsigned len) + struct inode *inode, loff_t pos, loff_t end_pos, + unsigned len) { struct iomap_folio_state *ifs = folio->private; size_t poff = offset_in_folio(folio, pos); @@ -1800,8 +1801,8 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc, static int iomap_writepage_map_blocks(struct iomap_writepage_ctx *wpc, struct writeback_control *wbc, struct folio *folio, - struct inode *inode, u64 pos, unsigned dirty_len, - unsigned *count) + struct inode *inode, u64 pos, u64 end_pos, + unsigned dirty_len, unsigned *count) { int error; @@ -1826,7 +1827,7 @@ static int iomap_writepage_map_blocks(struct iomap_writepage_ctx *wpc, break; default: error = iomap_add_to_ioend(wpc, wbc, folio, inode, pos, - map_len); + end_pos, map_len); if (!error) (*count)++; break; @@ -1897,11 +1898,11 @@ static bool iomap_writepage_handle_eof(struct folio *folio, struct inode *inode, * remaining memory is zeroed when mapped, and writes to that * region are not written out to the file. * - * Also adjust the writeback range to skip all blocks entirely - * beyond i_size. + * Also adjust the end_pos to the end of file and skip writeback + * for all blocks entirely beyond i_size. */ folio_zero_segment(folio, poff, folio_size(folio)); - *end_pos = round_up(isize, i_blocksize(inode)); + *end_pos = isize; } return true; @@ -1914,6 +1915,7 @@ static int iomap_writepage_map(struct iomap_writepage_ctx *wpc, struct inode *inode = folio->mapping->host; u64 pos = folio_pos(folio); u64 end_pos = pos + folio_size(folio); + u64 end_aligned = 0; unsigned count = 0; int error = 0; u32 rlen; @@ -1955,9 +1957,10 @@ static int iomap_writepage_map(struct iomap_writepage_ctx *wpc, /* * Walk through the folio to find dirty areas to write back. */ - while ((rlen = iomap_find_dirty_range(folio, &pos, end_pos))) { + end_aligned = round_up(end_pos, i_blocksize(inode)); + while ((rlen = iomap_find_dirty_range(folio, &pos, end_aligned))) { error = iomap_writepage_map_blocks(wpc, wbc, folio, inode, - pos, rlen, &count); + pos, end_pos, rlen, &count); if (error) break; pos += rlen;
This is a preparatory patch for fixing zero padding issues in concurrent append write scenarios. In the following patches, we need to obtain byte-granular writeback end position for io_size trimming after EOF handling. Due to concurrent writeback and truncate operations, inode size may shrink. Resampling inode size would force writeback code to handle the newly appeared post-EOF blocks, which is undesirable. As Dave explained in [1]: "Really, the issue is that writeback mappings have to be able to handle the range being mapped suddenly appear to be beyond EOF. This behaviour is a longstanding writeback constraint, and is what iomap_writepage_handle_eof() is attempting to handle. We handle this by only sampling i_size_read() whilst we have the folio locked and can determine the action we should take with that folio (i.e. nothing, partial zeroing, or skip altogether). Once we've made the decision that the folio is within EOF and taken action on it (i.e. moved the folio to writeback state), we cannot then resample the inode size because a truncate may have started and changed the inode size." To avoid resampling inode size after EOF handling, we convert end_pos to byte-granular writeback position and return it from EOF handling function. Since iomap_set_range_dirty() can handle unaligned lengths, this conversion has no impact on it. However, iomap_find_dirty_range() requires aligned start and end range to find dirty blocks within the given range, so the end position needs to be rounded up when passed to it. LINK [1]: https://lore.kernel.org/linux-xfs/Z1Gg0pAa54MoeYME@localhost.localdomain/ Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/iomap/buffered-io.c | 21 ++++++++++++--------- 1 file changed, 12 insertions(+), 9 deletions(-)