diff mbox series

[v6,1/3] iomap: pass byte granular end position to iomap_add_to_ioend

Message ID 20241209114241.3725722-2-leo.lilong@huawei.com (mailing list archive)
State Not Applicable, archived
Headers show
Series iomap: fix zero padding data issue in concurrent append writes | expand

Commit Message

Long Li Dec. 9, 2024, 11:42 a.m. UTC
This is a preparatory patch for fixing zero padding issues in concurrent
append write scenarios. In the following patches, we need to obtain
byte-granular writeback end position for io_size trimming after EOF
handling.

Due to concurrent writeback and truncate operations, inode size may
shrink. Resampling inode size would force writeback code to handle the
newly appeared post-EOF blocks, which is undesirable. As Dave
explained in [1]:

"Really, the issue is that writeback mappings have to be able to
handle the range being mapped suddenly appear to be beyond EOF.
This behaviour is a longstanding writeback constraint, and is what
iomap_writepage_handle_eof() is attempting to handle.

We handle this by only sampling i_size_read() whilst we have the
folio locked and can determine the action we should take with that
folio (i.e. nothing, partial zeroing, or skip altogether). Once
we've made the decision that the folio is within EOF and taken
action on it (i.e. moved the folio to writeback state), we cannot
then resample the inode size because a truncate may have started
and changed the inode size."

To avoid resampling inode size after EOF handling, we convert end_pos
to byte-granular writeback position and return it from EOF handling
function.

Since iomap_set_range_dirty() can handle unaligned lengths, this
conversion has no impact on it. However, iomap_find_dirty_range()
requires aligned start and end range to find dirty blocks within the
given range, so the end position needs to be rounded up when passed
to it.

LINK [1]: https://lore.kernel.org/linux-xfs/Z1Gg0pAa54MoeYME@localhost.localdomain/
Signed-off-by: Long Li <leo.lilong@huawei.com>
---
 fs/iomap/buffered-io.c | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

Comments

Brian Foster Dec. 9, 2024, 2:06 p.m. UTC | #1
On Mon, Dec 09, 2024 at 07:42:39PM +0800, Long Li wrote:
> This is a preparatory patch for fixing zero padding issues in concurrent
> append write scenarios. In the following patches, we need to obtain
> byte-granular writeback end position for io_size trimming after EOF
> handling.
> 
> Due to concurrent writeback and truncate operations, inode size may
> shrink. Resampling inode size would force writeback code to handle the
> newly appeared post-EOF blocks, which is undesirable. As Dave
> explained in [1]:
> 
> "Really, the issue is that writeback mappings have to be able to
> handle the range being mapped suddenly appear to be beyond EOF.
> This behaviour is a longstanding writeback constraint, and is what
> iomap_writepage_handle_eof() is attempting to handle.
> 
> We handle this by only sampling i_size_read() whilst we have the
> folio locked and can determine the action we should take with that
> folio (i.e. nothing, partial zeroing, or skip altogether). Once
> we've made the decision that the folio is within EOF and taken
> action on it (i.e. moved the folio to writeback state), we cannot
> then resample the inode size because a truncate may have started
> and changed the inode size."
> 
> To avoid resampling inode size after EOF handling, we convert end_pos
> to byte-granular writeback position and return it from EOF handling
> function.
> 
> Since iomap_set_range_dirty() can handle unaligned lengths, this
> conversion has no impact on it. However, iomap_find_dirty_range()
> requires aligned start and end range to find dirty blocks within the
> given range, so the end position needs to be rounded up when passed
> to it.
> 
> LINK [1]: https://lore.kernel.org/linux-xfs/Z1Gg0pAa54MoeYME@localhost.localdomain/
> Signed-off-by: Long Li <leo.lilong@huawei.com>
> ---
>  fs/iomap/buffered-io.c | 21 ++++++++++++---------
>  1 file changed, 12 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 955f19e27e47..bcc7831d03af 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
...
> @@ -1914,6 +1915,7 @@ static int iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  	struct inode *inode = folio->mapping->host;
>  	u64 pos = folio_pos(folio);
>  	u64 end_pos = pos + folio_size(folio);
> +	u64 end_aligned = 0;
>  	unsigned count = 0;
>  	int error = 0;
>  	u32 rlen;
> @@ -1955,9 +1957,10 @@ static int iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>  	/*
>  	 * Walk through the folio to find dirty areas to write back.
>  	 */
> -	while ((rlen = iomap_find_dirty_range(folio, &pos, end_pos))) {
> +	end_aligned = round_up(end_pos, i_blocksize(inode));

So do I follow correctly that the set_range_dirty() path doesn't need
the alignment because it uses inclusive first_blk/last_blk logic,
whereas this find_dirty_range() path does the opposite and thus does
require the round_up? If so, presumably that means if we fixed up the
find path we wouldn't need end_aligned at all anymore?

If I follow the reasoning correctly, then this looks Ok to me:

Reviewed-by: Brian Foster <bfoster@redhat.com>

... but as a followup exercise it might be nice to clean up the
iomap_find_dirty_range() path to either do the rounding itself or be
more consistent with set_range_dirty().

Brian

> +	while ((rlen = iomap_find_dirty_range(folio, &pos, end_aligned))) {
>  		error = iomap_writepage_map_blocks(wpc, wbc, folio, inode,
> -				pos, rlen, &count);
> +				pos, end_pos, rlen, &count);
>  		if (error)
>  			break;
>  		pos += rlen;
> -- 
> 2.39.2
> 
>
Long Li Dec. 10, 2024, 8:09 a.m. UTC | #2
On Mon, Dec 09, 2024 at 09:06:14AM -0500, Brian Foster wrote:
> On Mon, Dec 09, 2024 at 07:42:39PM +0800, Long Li wrote:
> > This is a preparatory patch for fixing zero padding issues in concurrent
> > append write scenarios. In the following patches, we need to obtain
> > byte-granular writeback end position for io_size trimming after EOF
> > handling.
> > 
> > Due to concurrent writeback and truncate operations, inode size may
> > shrink. Resampling inode size would force writeback code to handle the
> > newly appeared post-EOF blocks, which is undesirable. As Dave
> > explained in [1]:
> > 
> > "Really, the issue is that writeback mappings have to be able to
> > handle the range being mapped suddenly appear to be beyond EOF.
> > This behaviour is a longstanding writeback constraint, and is what
> > iomap_writepage_handle_eof() is attempting to handle.
> > 
> > We handle this by only sampling i_size_read() whilst we have the
> > folio locked and can determine the action we should take with that
> > folio (i.e. nothing, partial zeroing, or skip altogether). Once
> > we've made the decision that the folio is within EOF and taken
> > action on it (i.e. moved the folio to writeback state), we cannot
> > then resample the inode size because a truncate may have started
> > and changed the inode size."
> > 
> > To avoid resampling inode size after EOF handling, we convert end_pos
> > to byte-granular writeback position and return it from EOF handling
> > function.
> > 
> > Since iomap_set_range_dirty() can handle unaligned lengths, this
> > conversion has no impact on it. However, iomap_find_dirty_range()
> > requires aligned start and end range to find dirty blocks within the
> > given range, so the end position needs to be rounded up when passed
> > to it.
> > 
> > LINK [1]: https://lore.kernel.org/linux-xfs/Z1Gg0pAa54MoeYME@localhost.localdomain/
> > Signed-off-by: Long Li <leo.lilong@huawei.com>
> > ---
> >  fs/iomap/buffered-io.c | 21 ++++++++++++---------
> >  1 file changed, 12 insertions(+), 9 deletions(-)
> > 
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index 955f19e27e47..bcc7831d03af 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> ...
> > @@ -1914,6 +1915,7 @@ static int iomap_writepage_map(struct iomap_writepage_ctx *wpc,
> >  	struct inode *inode = folio->mapping->host;
> >  	u64 pos = folio_pos(folio);
> >  	u64 end_pos = pos + folio_size(folio);
> > +	u64 end_aligned = 0;
> >  	unsigned count = 0;
> >  	int error = 0;
> >  	u32 rlen;
> > @@ -1955,9 +1957,10 @@ static int iomap_writepage_map(struct iomap_writepage_ctx *wpc,
> >  	/*
> >  	 * Walk through the folio to find dirty areas to write back.
> >  	 */
> > -	while ((rlen = iomap_find_dirty_range(folio, &pos, end_pos))) {
> > +	end_aligned = round_up(end_pos, i_blocksize(inode));
> 
> So do I follow correctly that the set_range_dirty() path doesn't need
> the alignment because it uses inclusive first_blk/last_blk logic,
> whereas this find_dirty_range() path does the opposite and thus does
> require the round_up? If so, presumably that means if we fixed up the
> find path we wouldn't need end_aligned at all anymore?
> 

Agreed with you.

> If I follow the reasoning correctly, then this looks Ok to me:
> 
> Reviewed-by: Brian Foster <bfoster@redhat.com>
>
> ... but as a followup exercise it might be nice to clean up the
> iomap_find_dirty_range() path to either do the rounding itself or be
> more consistent with set_range_dirty().
> 
> Brian

Yes, I think we can handle the cleanup through a separate patch later?                                                                            

Thanks,
Long Li
Brian Foster Dec. 10, 2024, 11:50 a.m. UTC | #3
On Tue, Dec 10, 2024 at 04:09:26PM +0800, Long Li wrote:
> On Mon, Dec 09, 2024 at 09:06:14AM -0500, Brian Foster wrote:
> > On Mon, Dec 09, 2024 at 07:42:39PM +0800, Long Li wrote:
> > > This is a preparatory patch for fixing zero padding issues in concurrent
> > > append write scenarios. In the following patches, we need to obtain
> > > byte-granular writeback end position for io_size trimming after EOF
> > > handling.
> > > 
> > > Due to concurrent writeback and truncate operations, inode size may
> > > shrink. Resampling inode size would force writeback code to handle the
> > > newly appeared post-EOF blocks, which is undesirable. As Dave
> > > explained in [1]:
> > > 
> > > "Really, the issue is that writeback mappings have to be able to
> > > handle the range being mapped suddenly appear to be beyond EOF.
> > > This behaviour is a longstanding writeback constraint, and is what
> > > iomap_writepage_handle_eof() is attempting to handle.
> > > 
> > > We handle this by only sampling i_size_read() whilst we have the
> > > folio locked and can determine the action we should take with that
> > > folio (i.e. nothing, partial zeroing, or skip altogether). Once
> > > we've made the decision that the folio is within EOF and taken
> > > action on it (i.e. moved the folio to writeback state), we cannot
> > > then resample the inode size because a truncate may have started
> > > and changed the inode size."
> > > 
> > > To avoid resampling inode size after EOF handling, we convert end_pos
> > > to byte-granular writeback position and return it from EOF handling
> > > function.
> > > 
> > > Since iomap_set_range_dirty() can handle unaligned lengths, this
> > > conversion has no impact on it. However, iomap_find_dirty_range()
> > > requires aligned start and end range to find dirty blocks within the
> > > given range, so the end position needs to be rounded up when passed
> > > to it.
> > > 
> > > LINK [1]: https://lore.kernel.org/linux-xfs/Z1Gg0pAa54MoeYME@localhost.localdomain/
> > > Signed-off-by: Long Li <leo.lilong@huawei.com>
> > > ---
> > >  fs/iomap/buffered-io.c | 21 ++++++++++++---------
> > >  1 file changed, 12 insertions(+), 9 deletions(-)
> > > 
> > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > index 955f19e27e47..bcc7831d03af 100644
> > > --- a/fs/iomap/buffered-io.c
> > > +++ b/fs/iomap/buffered-io.c
> > ...
> > > @@ -1914,6 +1915,7 @@ static int iomap_writepage_map(struct iomap_writepage_ctx *wpc,
> > >  	struct inode *inode = folio->mapping->host;
> > >  	u64 pos = folio_pos(folio);
> > >  	u64 end_pos = pos + folio_size(folio);
> > > +	u64 end_aligned = 0;
> > >  	unsigned count = 0;
> > >  	int error = 0;
> > >  	u32 rlen;
> > > @@ -1955,9 +1957,10 @@ static int iomap_writepage_map(struct iomap_writepage_ctx *wpc,
> > >  	/*
> > >  	 * Walk through the folio to find dirty areas to write back.
> > >  	 */
> > > -	while ((rlen = iomap_find_dirty_range(folio, &pos, end_pos))) {
> > > +	end_aligned = round_up(end_pos, i_blocksize(inode));
> > 
> > So do I follow correctly that the set_range_dirty() path doesn't need
> > the alignment because it uses inclusive first_blk/last_blk logic,
> > whereas this find_dirty_range() path does the opposite and thus does
> > require the round_up? If so, presumably that means if we fixed up the
> > find path we wouldn't need end_aligned at all anymore?
> > 
> 
> Agreed with you.
> 
> > If I follow the reasoning correctly, then this looks Ok to me:
> > 
> > Reviewed-by: Brian Foster <bfoster@redhat.com>
> >
> > ... but as a followup exercise it might be nice to clean up the
> > iomap_find_dirty_range() path to either do the rounding itself or be
> > more consistent with set_range_dirty().
> > 
> > Brian
> 
> Yes, I think we can handle the cleanup through a separate patch later?                                                                            
> 

Yep, thanks.

Brian

> Thanks,
> Long Li
>
diff mbox series

Patch

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 955f19e27e47..bcc7831d03af 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1774,7 +1774,8 @@  static bool iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t pos)
  */
 static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
 		struct writeback_control *wbc, struct folio *folio,
-		struct inode *inode, loff_t pos, unsigned len)
+		struct inode *inode, loff_t pos, loff_t end_pos,
+		unsigned len)
 {
 	struct iomap_folio_state *ifs = folio->private;
 	size_t poff = offset_in_folio(folio, pos);
@@ -1800,8 +1801,8 @@  static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
 
 static int iomap_writepage_map_blocks(struct iomap_writepage_ctx *wpc,
 		struct writeback_control *wbc, struct folio *folio,
-		struct inode *inode, u64 pos, unsigned dirty_len,
-		unsigned *count)
+		struct inode *inode, u64 pos, u64 end_pos,
+		unsigned dirty_len, unsigned *count)
 {
 	int error;
 
@@ -1826,7 +1827,7 @@  static int iomap_writepage_map_blocks(struct iomap_writepage_ctx *wpc,
 			break;
 		default:
 			error = iomap_add_to_ioend(wpc, wbc, folio, inode, pos,
-					map_len);
+					end_pos, map_len);
 			if (!error)
 				(*count)++;
 			break;
@@ -1897,11 +1898,11 @@  static bool iomap_writepage_handle_eof(struct folio *folio, struct inode *inode,
 		 *    remaining memory is zeroed when mapped, and writes to that
 		 *    region are not written out to the file.
 		 *
-		 * Also adjust the writeback range to skip all blocks entirely
-		 * beyond i_size.
+		 * Also adjust the end_pos to the end of file and skip writeback
+		 * for all blocks entirely beyond i_size.
 		 */
 		folio_zero_segment(folio, poff, folio_size(folio));
-		*end_pos = round_up(isize, i_blocksize(inode));
+		*end_pos = isize;
 	}
 
 	return true;
@@ -1914,6 +1915,7 @@  static int iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	struct inode *inode = folio->mapping->host;
 	u64 pos = folio_pos(folio);
 	u64 end_pos = pos + folio_size(folio);
+	u64 end_aligned = 0;
 	unsigned count = 0;
 	int error = 0;
 	u32 rlen;
@@ -1955,9 +1957,10 @@  static int iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	/*
 	 * Walk through the folio to find dirty areas to write back.
 	 */
-	while ((rlen = iomap_find_dirty_range(folio, &pos, end_pos))) {
+	end_aligned = round_up(end_pos, i_blocksize(inode));
+	while ((rlen = iomap_find_dirty_range(folio, &pos, end_aligned))) {
 		error = iomap_writepage_map_blocks(wpc, wbc, folio, inode,
-				pos, rlen, &count);
+				pos, end_pos, rlen, &count);
 		if (error)
 			break;
 		pos += rlen;