diff mbox series

[13/14] iomap: map multiple blocks at a time

Message ID 20231207072710.176093-14-hch@lst.de (mailing list archive)
State New, archived
Headers show
Series [01/14] iomap: clear the per-folio dirty bits on all writeback failures | expand

Commit Message

Christoph Hellwig Dec. 7, 2023, 7:27 a.m. UTC
The ->map_blocks interface returns a valid range for writeback, but we
still call back into it for every block, which is a bit inefficient.

Change iomap_writepage_map to use the valid range in the map until the
end of the folio or the dirty range inside the folio instead of calling
back into every block.

Note that the range is not used over folio boundaries as we need to be
able to check the mapping sequence count under the folio lock.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/buffered-io.c | 116 ++++++++++++++++++++++++++++-------------
 include/linux/iomap.h  |   7 +++
 2 files changed, 88 insertions(+), 35 deletions(-)

Comments

Zhang Yi Dec. 7, 2023, 1:39 p.m. UTC | #1
Hi, Christoph.

On 2023/12/7 15:27, Christoph Hellwig wrote:
> The ->map_blocks interface returns a valid range for writeback, but we
> still call back into it for every block, which is a bit inefficient.
> 
> Change iomap_writepage_map to use the valid range in the map until the
> end of the folio or the dirty range inside the folio instead of calling
> back into every block.
> 
> Note that the range is not used over folio boundaries as we need to be
> able to check the mapping sequence count under the folio lock.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/iomap/buffered-io.c | 116 ++++++++++++++++++++++++++++-------------
>  include/linux/iomap.h  |   7 +++
>  2 files changed, 88 insertions(+), 35 deletions(-)
> 
[..]
> @@ -1738,29 +1775,41 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
>  
>  static int iomap_writepage_map_blocks(struct iomap_writepage_ctx *wpc,
>  		struct writeback_control *wbc, struct folio *folio,
> -		struct inode *inode, u64 pos, unsigned *count)
> +		struct inode *inode, u64 pos, unsigned dirty_len,
> +		unsigned *count)
>  {
>  	int error;
>  
> -	error = wpc->ops->map_blocks(wpc, inode, pos);
> -	if (error)
> -		goto fail;
> -	trace_iomap_writepage_map(inode, &wpc->iomap);
> -
> -	switch (wpc->iomap.type) {
> -	case IOMAP_INLINE:
> -		WARN_ON_ONCE(1);
> -		error = -EIO;
> -		break;
> -	case IOMAP_HOLE:
> -		break;
> -	default:
> -		error = iomap_add_to_ioend(wpc, wbc, folio, inode, pos);
> -		if (!error)
> -			(*count)++;
> -	}
> +	do {
> +		unsigned map_len;
> +
> +		error = wpc->ops->map_blocks(wpc, inode, pos);
> +		if (error)
> +			break;
> +		trace_iomap_writepage_map(inode, &wpc->iomap);
> +
> +		map_len = min_t(u64, dirty_len,
> +			wpc->iomap.offset + wpc->iomap.length - pos);
> +		WARN_ON_ONCE(!folio->private && map_len < dirty_len);

While I was debugging this series on ext4, I would suggest try to add map_len
or dirty_len into this trace point could be more convenient.

> +
> +		switch (wpc->iomap.type) {
> +		case IOMAP_INLINE:
> +			WARN_ON_ONCE(1);
> +			error = -EIO;
> +			break;
> +		case IOMAP_HOLE:
> +			break;

BTW, I want to ask an unrelated question of this patch series. Do you
agree with me to add a IOMAP_DELAYED case and re-dirty folio here? The
background is that on ext4, jbd2 thread call ext4_normal_submit_inode_data_buffers()
submit data blocks in data=ordered mode, but it can only submit mapped
blocks, now we skip unmapped blocks and re-dirty folios in
ext4_do_writepages()->mpage_prepare_extent_to_map()->..->ext4_bio_write_folio().
So we have to inherit this logic when convert to iomap, I suppose ext4's
->map_blocks() return IOMAP_DELALLOC for this case, and iomap do something
like:

+               case IOMAP_DELALLOC:
+                       iomap_set_range_dirty(folio, offset_in_folio(folio, pos),
+                                             map_len);
+                       folio_redirty_for_writepage(wbc, folio);
+                       break;

Thanks,
Yi.

> +		default:
> +			error = iomap_add_to_ioend(wpc, wbc, folio, inode, pos,
> +					map_len);
> +			if (!error)
> +				(*count)++;
> +			break;
> +		}
> +		dirty_len -= map_len;
> +		pos += map_len;
> +	} while (dirty_len && !error);
>  
> -fail:
>  	/*
>  	 * We cannot cancel the ioend directly here on error.  We may have
>  	 * already set other pages under writeback and hence we have to run I/O
> @@ -1827,7 +1876,7 @@ static bool iomap_writepage_handle_eof(struct folio *folio, struct inode *inode,
>  		 * beyond i_size.
>  		 */
>  		folio_zero_segment(folio, poff, folio_size(folio));
> -		*end_pos = isize;
> +		*end_pos = round_up(isize, i_blocksize(inode));
>  	}
>  
>  	return true;
Christoph Hellwig Dec. 7, 2023, 3:03 p.m. UTC | #2
On Thu, Dec 07, 2023 at 09:39:44PM +0800, Zhang Yi wrote:
> > +	do {
> > +		unsigned map_len;
> > +
> > +		error = wpc->ops->map_blocks(wpc, inode, pos);
> > +		if (error)
> > +			break;
> > +		trace_iomap_writepage_map(inode, &wpc->iomap);
> > +
> > +		map_len = min_t(u64, dirty_len,
> > +			wpc->iomap.offset + wpc->iomap.length - pos);
> > +		WARN_ON_ONCE(!folio->private && map_len < dirty_len);
> 
> While I was debugging this series on ext4, I would suggest try to add map_len
> or dirty_len into this trace point could be more convenient.

That does seem useful, but it means we need to have an entirely new
even class.  Can I offload this to you for inclusion in your ext4
series? :)

> > +		case IOMAP_HOLE:
> > +			break;
> 
> BTW, I want to ask an unrelated question of this patch series. Do you
> agree with me to add a IOMAP_DELAYED case and re-dirty folio here? The
> background is that on ext4, jbd2 thread call ext4_normal_submit_inode_data_buffers()
> submit data blocks in data=ordered mode, but it can only submit mapped
> blocks, now we skip unmapped blocks and re-dirty folios in
> ext4_do_writepages()->mpage_prepare_extent_to_map()->..->ext4_bio_write_folio().
> So we have to inherit this logic when convert to iomap, I suppose ext4's
> ->map_blocks() return IOMAP_DELALLOC for this case, and iomap do something
> like:
> 
> +               case IOMAP_DELALLOC:
> +                       iomap_set_range_dirty(folio, offset_in_folio(folio, pos),
> +                                             map_len);
> +                       folio_redirty_for_writepage(wbc, folio);
> +                       break;

I guess we could add it, but it feels pretty quirky to me, so it would at
least need a very big comment.

But I think Ted mentioned a while ago that dropping the classic
data=ordered mode for ext4 might be a good idea eventually no that ext4
can update the inode size at I/O completion time (Ted, correct me if
I'm wrong).  If that's the case it might make sense to just drop the
ordered mode instead of adding these quirks to iomap.
Zhang Yi Dec. 8, 2023, 7:33 a.m. UTC | #3
On 2023/12/7 23:03, Christoph Hellwig wrote:
> On Thu, Dec 07, 2023 at 09:39:44PM +0800, Zhang Yi wrote:
>>> +	do {
>>> +		unsigned map_len;
>>> +
>>> +		error = wpc->ops->map_blocks(wpc, inode, pos);
>>> +		if (error)
>>> +			break;
>>> +		trace_iomap_writepage_map(inode, &wpc->iomap);
>>> +
>>> +		map_len = min_t(u64, dirty_len,
>>> +			wpc->iomap.offset + wpc->iomap.length - pos);
>>> +		WARN_ON_ONCE(!folio->private && map_len < dirty_len);
>>
>> While I was debugging this series on ext4, I would suggest try to add map_len
>> or dirty_len into this trace point could be more convenient.
> 
> That does seem useful, but it means we need to have an entirely new
> even class.  Can I offload this to you for inclusion in your ext4
> series? :)
> 

Sure, I'm glad to do it.

>>> +		case IOMAP_HOLE:
>>> +			break;
>>
>> BTW, I want to ask an unrelated question of this patch series. Do you
>> agree with me to add a IOMAP_DELAYED case and re-dirty folio here? The
>> background is that on ext4, jbd2 thread call ext4_normal_submit_inode_data_buffers()
>> submit data blocks in data=ordered mode, but it can only submit mapped
>> blocks, now we skip unmapped blocks and re-dirty folios in
>> ext4_do_writepages()->mpage_prepare_extent_to_map()->..->ext4_bio_write_folio().
>> So we have to inherit this logic when convert to iomap, I suppose ext4's
>> ->map_blocks() return IOMAP_DELALLOC for this case, and iomap do something
>> like:
>>
>> +               case IOMAP_DELALLOC:
>> +                       iomap_set_range_dirty(folio, offset_in_folio(folio, pos),
>> +                                             map_len);
>> +                       folio_redirty_for_writepage(wbc, folio);
>> +                       break;
> 
> I guess we could add it, but it feels pretty quirky to me, so it would at
> least need a very big comment.
> 
> But I think Ted mentioned a while ago that dropping the classic
> data=ordered mode for ext4 might be a good idea eventually no that ext4
> can update the inode size at I/O completion time (Ted, correct me if
> I'm wrong).  If that's the case it might make sense to just drop the
> ordered mode instead of adding these quirks to iomap.
> 

Yeah, make sense, we could remove these quirks after ext4 drop
data=ordered mode. For now, let me implement it according to this
temporary method.

Thanks,
Yi.
diff mbox series

Patch

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 4339bc422b245d..d8f56968962b97 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1,7 +1,7 @@ 
 // SPDX-License-Identifier: GPL-2.0
 /*
  * Copyright (C) 2010 Red Hat, Inc.
- * Copyright (C) 2016-2019 Christoph Hellwig.
+ * Copyright (C) 2016-2023 Christoph Hellwig.
  */
 #include <linux/module.h>
 #include <linux/compiler.h>
@@ -95,6 +95,44 @@  static inline bool ifs_block_is_dirty(struct folio *folio,
 	return test_bit(block + blks_per_folio, ifs->state);
 }
 
+static unsigned ifs_find_dirty_range(struct folio *folio,
+		struct iomap_folio_state *ifs, u64 *range_start, u64 range_end)
+{
+	struct inode *inode = folio->mapping->host;
+	unsigned start_blk =
+		offset_in_folio(folio, *range_start) >> inode->i_blkbits;
+	unsigned end_blk = min_not_zero(
+		offset_in_folio(folio, range_end) >> inode->i_blkbits,
+		i_blocks_per_folio(inode, folio));
+	unsigned nblks = 1;
+
+	while (!ifs_block_is_dirty(folio, ifs, start_blk))
+		if (++start_blk == end_blk)
+			return 0;
+
+	while (start_blk + nblks < end_blk) {
+		if (!ifs_block_is_dirty(folio, ifs, start_blk + nblks))
+			break;
+		nblks++;
+	}
+
+	*range_start = folio_pos(folio) + (start_blk << inode->i_blkbits);
+	return nblks << inode->i_blkbits;
+}
+
+static unsigned iomap_find_dirty_range(struct folio *folio, u64 *range_start,
+		u64 range_end)
+{
+	struct iomap_folio_state *ifs = folio->private;
+
+	if (*range_start >= range_end)
+		return 0;
+
+	if (ifs)
+		return ifs_find_dirty_range(folio, ifs, range_start, range_end);
+	return range_end - *range_start;
+}
+
 static void ifs_clear_range_dirty(struct folio *folio,
 		struct iomap_folio_state *ifs, size_t off, size_t len)
 {
@@ -1711,10 +1749,9 @@  static bool iomap_can_add_to_ioend(struct iomap_writepage_ctx *wpc, loff_t pos)
  */
 static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
 		struct writeback_control *wbc, struct folio *folio,
-		struct inode *inode, loff_t pos)
+		struct inode *inode, loff_t pos, unsigned len)
 {
 	struct iomap_folio_state *ifs = folio->private;
-	unsigned len = i_blocksize(inode);
 	size_t poff = offset_in_folio(folio, pos);
 	int error;
 
@@ -1738,29 +1775,41 @@  static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
 
 static int iomap_writepage_map_blocks(struct iomap_writepage_ctx *wpc,
 		struct writeback_control *wbc, struct folio *folio,
-		struct inode *inode, u64 pos, unsigned *count)
+		struct inode *inode, u64 pos, unsigned dirty_len,
+		unsigned *count)
 {
 	int error;
 
-	error = wpc->ops->map_blocks(wpc, inode, pos);
-	if (error)
-		goto fail;
-	trace_iomap_writepage_map(inode, &wpc->iomap);
-
-	switch (wpc->iomap.type) {
-	case IOMAP_INLINE:
-		WARN_ON_ONCE(1);
-		error = -EIO;
-		break;
-	case IOMAP_HOLE:
-		break;
-	default:
-		error = iomap_add_to_ioend(wpc, wbc, folio, inode, pos);
-		if (!error)
-			(*count)++;
-	}
+	do {
+		unsigned map_len;
+
+		error = wpc->ops->map_blocks(wpc, inode, pos);
+		if (error)
+			break;
+		trace_iomap_writepage_map(inode, &wpc->iomap);
+
+		map_len = min_t(u64, dirty_len,
+			wpc->iomap.offset + wpc->iomap.length - pos);
+		WARN_ON_ONCE(!folio->private && map_len < dirty_len);
+
+		switch (wpc->iomap.type) {
+		case IOMAP_INLINE:
+			WARN_ON_ONCE(1);
+			error = -EIO;
+			break;
+		case IOMAP_HOLE:
+			break;
+		default:
+			error = iomap_add_to_ioend(wpc, wbc, folio, inode, pos,
+					map_len);
+			if (!error)
+				(*count)++;
+			break;
+		}
+		dirty_len -= map_len;
+		pos += map_len;
+	} while (dirty_len && !error);
 
-fail:
 	/*
 	 * We cannot cancel the ioend directly here on error.  We may have
 	 * already set other pages under writeback and hence we have to run I/O
@@ -1827,7 +1876,7 @@  static bool iomap_writepage_handle_eof(struct folio *folio, struct inode *inode,
 		 * beyond i_size.
 		 */
 		folio_zero_segment(folio, poff, folio_size(folio));
-		*end_pos = isize;
+		*end_pos = round_up(isize, i_blocksize(inode));
 	}
 
 	return true;
@@ -1838,12 +1887,11 @@  static int iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 {
 	struct iomap_folio_state *ifs = folio->private;
 	struct inode *inode = folio->mapping->host;
-	unsigned len = i_blocksize(inode);
-	unsigned nblocks = i_blocks_per_folio(inode, folio);
 	u64 pos = folio_pos(folio);
 	u64 end_pos = pos + folio_size(folio);
 	unsigned count = 0;
-	int error = 0, i;
+	int error = 0;
+	u32 rlen;
 
 	WARN_ON_ONCE(!folio_test_locked(folio));
 	WARN_ON_ONCE(folio_test_dirty(folio));
@@ -1857,7 +1905,7 @@  static int iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	}
 	WARN_ON_ONCE(end_pos <= pos);
 
-	if (nblocks > 1) {
+	if (i_blocks_per_folio(inode, folio) > 1) {
 		if (!ifs) {
 			ifs = ifs_alloc(inode, folio, 0);
 			iomap_set_range_dirty(folio, 0, end_pos - pos);
@@ -1880,18 +1928,16 @@  static int iomap_writepage_map(struct iomap_writepage_ctx *wpc,
 	folio_start_writeback(folio);
 
 	/*
-	 * Walk through the folio to find areas to write back. If we
-	 * run off the end of the current map or find the current map
-	 * invalid, grab a new one.
+	 * Walk through the folio to find dirty areas to write back.
 	 */
-	for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
-		if (ifs && !ifs_block_is_dirty(folio, ifs, i))
-			continue;
-		error = iomap_writepage_map_blocks(wpc, wbc, folio, inode, pos,
-				&count);
+	while ((rlen = iomap_find_dirty_range(folio, &pos, end_pos))) {
+		error = iomap_writepage_map_blocks(wpc, wbc, folio, inode,
+				pos, rlen, &count);
 		if (error)
 			break;
+		pos += rlen;
 	}
+
 	if (count)
 		wpc->nr_folios++;
 
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index b8d3b658ad2b03..49d93f53878565 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -309,6 +309,13 @@  struct iomap_writeback_ops {
 	/*
 	 * Required, maps the blocks so that writeback can be performed on
 	 * the range starting at offset.
+	 *
+	 * Can return arbitrarily large regions, but we need to call into it at
+	 * least once per folio to allow the file systems to synchronize with
+	 * the write path that could be invalidating mappings.
+	 *
+	 * An existing mapping from a previous call to this method can be reused
+	 * by the file system if it is still valid.
 	 */
 	int (*map_blocks)(struct iomap_writepage_ctx *wpc, struct inode *inode,
 				loff_t offset);