diff mbox series

iomap: elide zero range flush from partial eof zeroing

Message ID 20241023143029.11275-1-bfoster@redhat.com (mailing list archive)
State New
Headers show
Series iomap: elide zero range flush from partial eof zeroing | expand

Commit Message

Brian Foster Oct. 23, 2024, 2:30 p.m. UTC
iomap zero range performs a pagecache flush upon seeing unwritten
extents with dirty pagecache in order to determine accurate
subranges that require direct zeroing. This is to support an
optimization where clean, unwritten ranges are skipped as they are
already zero on-disk.

Certain use cases for zero range are more sensitive to flush latency
than others. The kernel test robot recently reported a regression in
the following stress-ng workload on XFS:

  stress-ng --timeout 60 --times --verify --metrics --no-rand-seed --metamix 64

This workload involves a series of small, strided, write extending
writes. On XFS, this produces a pattern of allocating post-eof
speculative preallocation, converting preallocation to unwritten on
zero range calls, dirtying pagecache over the converted mapping, and
then repeating the sequence again from the updated EOF. This
basically produces a sequence of pagecache flushes on the partial
EOF block zeroing use case of zero range.

To mitigate this problem, special case the EOF block zeroing use
case to prefer zeroing over a pagecache flush when the EOF folio is
already dirty. This brings most of the performance back by avoiding
flushes on write and truncate extension operations, while preserving
the ability for iomap to flush and properly process larger ranges.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---

Hi iomap maintainers,

This is an incremental optimization for the regression reported by the
test robot here[1]. I'm not totally convinced this is necessary as an
immediate fix, but the discussion on that thread was enough to suggest
it could be. I don't really love the factoring, but I had to play a bit
of whack-a-mole between fstests and stress-ng to restore performance and
still maintain behavior expectations for some of the tests.

On a positive note, exploring this gave me what I think is a better idea
for dealing with zero range overall, so I'm working on a followup to
this that reworks it by splitting zero range across block alignment
boundaries (similar to how something like truncate page range works, for
example). This simplifies things by isolating the dirty range check to a
single folio on an unaligned start offset, which lets the _iter() call
do a skip or zero (i.e. no more flush_and_stale()), and then
unconditionally flush the aligned portion to end-of-range. The latter
flush should be a no-op for every use case I've seen so far, so this
might entirely avoid the need for anything more complex for zero range.

In summary, I'm posting this as an optional and more "stable-worthy"
patch for reference and for the maintainers to consider as they like. I
think it's reasonable to include if we are concerned about this
particular stress-ng test and are Ok with it as a transient solution.
But if it were up to me, I'd probably sit on it for a bit to determine
if a more practical user/workload is affected by this, particularly
knowing that I'm trying to rework it. This could always be applied as a
stable fix if really needed, but I just don't think the slightly more
invasive rework is appropriate for -rc..

Thoughts, reviews, flames appreciated.

Brian

[1] https://lore.kernel.org/linux-xfs/202410141536.1167190b-oliver.sang@intel.com/

 fs/iomap/buffered-io.c | 20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)
diff mbox series

Patch

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index aa587b2142e2..8fd25b14d120 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1372,6 +1372,7 @@  static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
 	loff_t pos = iter->pos;
 	loff_t length = iomap_length(iter);
 	loff_t written = 0;
+	bool eof_zero = false;
 
 	/*
 	 * We must zero subranges of unwritten mappings that might be dirty in
@@ -1391,12 +1392,23 @@  static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
 	 * triggers writeback time post-eof zeroing.
 	 */
 	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) {
-		if (*range_dirty) {
+		/* range is clean and already zeroed, nothing to do */
+		if (!*range_dirty)
+			return length;
+
+		/* flush for anything other than partial eof zeroing */
+		if (pos != i_size_read(iter->inode) ||
+		   (pos % i_blocksize(iter->inode)) == 0) {
 			*range_dirty = false;
 			return iomap_zero_iter_flush_and_stale(iter);
 		}
-		/* range is clean and already zeroed, nothing to do */
-		return length;
+		/*
+		 * Special case partial EOF zeroing. Since we know the EOF
+		 * folio is dirty, prefer in-memory zeroing for it. This avoids
+		 * excessive flush latency on frequent file size extending
+		 * operations.
+		 */
+		eof_zero = true;
 	}
 
 	do {
@@ -1415,6 +1427,8 @@  static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
 		offset = offset_in_folio(folio, pos);
 		if (bytes > folio_size(folio) - offset)
 			bytes = folio_size(folio) - offset;
+		if (eof_zero && length > bytes)
+			length = bytes;
 
 		folio_zero_range(folio, offset, bytes);
 		folio_mark_accessed(folio);