diff mbox series

[5/9] iomap: buffered write failure should not truncate the page cache

Message ID 20221123055812.747923-6-david@fromorbit.com (mailing list archive)
State Accepted
Headers show
Series xfs, iomap: fix data corrupton due to stale cached iomaps | expand

Commit Message

Dave Chinner Nov. 23, 2022, 5:58 a.m. UTC
From: Dave Chinner <dchinner@redhat.com>

iomap_file_buffered_write_punch_delalloc() currently invalidates the
page cache over the unused range of the delalloc extent that was
allocated. While the write allocated the delalloc extent, it does
not own it exclusively as the write does not hold any locks that
prevent either writeback or mmap page faults from changing the state
of either the page cache or the extent state backing this range.

Whilst xfs_bmap_punch_delalloc_range() already handles races in
extent conversion - it will only punch out delalloc extents and it
ignores any other type of extent - the page cache truncate does not
discriminate between data written by this write or some other task.
As a result, truncating the page cache can result in data corruption
if the write races with mmap modifications to the file over the same
range.

generic/346 exercises this workload, and if we randomly fail writes
(as will happen when iomap gets stale iomap detection later in the
patchset), it will randomly corrupt the file data because it removes
data written by mmap() in the same page as the write() that failed.

Hence we do not want to punch out the page cache over the range of
the extent we failed to write to - what we actually need to do is
detect the ranges that have dirty data in cache over them and *not
punch them out*.

To do this, we have to walk the page cache over the range of the
delalloc extent we want to remove. This is made complex by the fact
we have to handle partially up-to-date folios correctly and this can
happen even when the FSB size == PAGE_SIZE because we now support
multi-page folios in the page cache.

Because we are only interested in discovering the edges of data
ranges in the page cache (i.e. hole-data boundaries) we can make use
of mapping_seek_hole_data() to find those transitions in the page
cache. As we hold the invalidate_lock, we know that the boundaries
are not going to change while we walk the range. This interface is
also byte-based and is sub-page block aware, so we can find the data
ranges in the cache based on byte offsets rather than page, folio or
fs block sized chunks. This greatly simplifies the logic of finding
dirty cached ranges in the page cache.

Once we've identified a range that contains cached data, we can then
iterate the range folio by folio. This allows us to determine if the
data is dirty and hence perform the correct delalloc extent punching
operations. The seek interface we use to iterate data ranges will
give us sub-folio start/end granularity, so we may end up looking up
the same folio multiple times as the seek interface iterates across
each discontiguous data region in the folio.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/iomap/buffered-io.c | 194 ++++++++++++++++++++++++++++++++++++++---
 1 file changed, 180 insertions(+), 14 deletions(-)

Comments

kernel test robot Nov. 23, 2022, 1:24 p.m. UTC | #1
Hi Dave,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on xfs-linux/for-next]
[also build test WARNING on linus/master v6.1-rc6 next-20221123]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Dave-Chinner/xfs-iomap-fix-data-corrupton-due-to-stale-cached-iomaps/20221123-135956
base:   https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git for-next
patch link:    https://lore.kernel.org/r/20221123055812.747923-6-david%40fromorbit.com
patch subject: [PATCH 5/9] iomap: buffered write failure should not truncate the page cache
config: x86_64-allyesconfig
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
reproduce (this is a W=1 build):
        # https://github.com/intel-lab-lkp/linux/commit/9f88e3438e8c0e63b4d3610d2ea7d1900a6b4981
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Dave-Chinner/xfs-iomap-fix-data-corrupton-due-to-stale-cached-iomaps/20221123-135956
        git checkout 9f88e3438e8c0e63b4d3610d2ea7d1900a6b4981
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash fs/iomap/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   fs/iomap/buffered-io.c: In function 'iomap_file_buffered_write_punch_delalloc':
>> fs/iomap/buffered-io.c:1032:33: warning: unused variable 'error' [-Wunused-variable]
    1032 |         int                     error = 0;
         |                                 ^~~~~


vim +/error +1032 fs/iomap/buffered-io.c

9f88e3438e8c0e6 Dave Chinner 2022-11-23   993  
dfd153a8d32fb12 Dave Chinner 2022-11-23   994  /*
dfd153a8d32fb12 Dave Chinner 2022-11-23   995   * When a short write occurs, the filesystem may need to remove reserved space
dfd153a8d32fb12 Dave Chinner 2022-11-23   996   * that was allocated in ->iomap_begin from it's ->iomap_end method. For
dfd153a8d32fb12 Dave Chinner 2022-11-23   997   * filesystems that use delayed allocation, we need to punch out delalloc
dfd153a8d32fb12 Dave Chinner 2022-11-23   998   * extents from the range that are not dirty in the page cache. As the write can
dfd153a8d32fb12 Dave Chinner 2022-11-23   999   * race with page faults, there can be dirty pages over the delalloc extent
dfd153a8d32fb12 Dave Chinner 2022-11-23  1000   * outside the range of a short write but still within the delalloc extent
dfd153a8d32fb12 Dave Chinner 2022-11-23  1001   * allocated for this iomap.
dfd153a8d32fb12 Dave Chinner 2022-11-23  1002   *
dfd153a8d32fb12 Dave Chinner 2022-11-23  1003   * This function uses [start_byte, end_byte) intervals (i.e. open ended) to
9f88e3438e8c0e6 Dave Chinner 2022-11-23  1004   * simplify range iterations.
9f88e3438e8c0e6 Dave Chinner 2022-11-23  1005   *
9f88e3438e8c0e6 Dave Chinner 2022-11-23  1006   * The punch() callback *must* only punch delalloc extents in the range passed
9f88e3438e8c0e6 Dave Chinner 2022-11-23  1007   * to it. It must skip over all other types of extents in the range and leave
9f88e3438e8c0e6 Dave Chinner 2022-11-23  1008   * them completely unchanged. It must do this punch atomically with respect to
9f88e3438e8c0e6 Dave Chinner 2022-11-23  1009   * other extent modifications.
9f88e3438e8c0e6 Dave Chinner 2022-11-23  1010   *
9f88e3438e8c0e6 Dave Chinner 2022-11-23  1011   * The punch() callback may be called with a folio locked to prevent writeback
9f88e3438e8c0e6 Dave Chinner 2022-11-23  1012   * extent allocation racing at the edge of the range we are currently punching.
9f88e3438e8c0e6 Dave Chinner 2022-11-23  1013   * The locked folio may or may not cover the range being punched, so it is not
9f88e3438e8c0e6 Dave Chinner 2022-11-23  1014   * safe for the punch() callback to lock folios itself.
9f88e3438e8c0e6 Dave Chinner 2022-11-23  1015   *
9f88e3438e8c0e6 Dave Chinner 2022-11-23  1016   * Lock order is:
9f88e3438e8c0e6 Dave Chinner 2022-11-23  1017   *
9f88e3438e8c0e6 Dave Chinner 2022-11-23  1018   * inode->i_rwsem (shared or exclusive)
9f88e3438e8c0e6 Dave Chinner 2022-11-23  1019   *   inode->i_mapping->invalidate_lock (exclusive)
9f88e3438e8c0e6 Dave Chinner 2022-11-23  1020   *     folio_lock()
9f88e3438e8c0e6 Dave Chinner 2022-11-23  1021   *       ->punch
9f88e3438e8c0e6 Dave Chinner 2022-11-23  1022   *         internal filesystem allocation lock
dfd153a8d32fb12 Dave Chinner 2022-11-23  1023   */
dfd153a8d32fb12 Dave Chinner 2022-11-23  1024  int iomap_file_buffered_write_punch_delalloc(struct inode *inode,
dfd153a8d32fb12 Dave Chinner 2022-11-23  1025  		struct iomap *iomap, loff_t pos, loff_t length,
dfd153a8d32fb12 Dave Chinner 2022-11-23  1026  		ssize_t written,
dfd153a8d32fb12 Dave Chinner 2022-11-23  1027  		int (*punch)(struct inode *inode, loff_t pos, loff_t length))
dfd153a8d32fb12 Dave Chinner 2022-11-23  1028  {
dfd153a8d32fb12 Dave Chinner 2022-11-23  1029  	loff_t			start_byte;
dfd153a8d32fb12 Dave Chinner 2022-11-23  1030  	loff_t			end_byte;
dfd153a8d32fb12 Dave Chinner 2022-11-23  1031  	int			blocksize = i_blocksize(inode);
dfd153a8d32fb12 Dave Chinner 2022-11-23 @1032  	int			error = 0;
dfd153a8d32fb12 Dave Chinner 2022-11-23  1033  
dfd153a8d32fb12 Dave Chinner 2022-11-23  1034  	if (iomap->type != IOMAP_DELALLOC)
dfd153a8d32fb12 Dave Chinner 2022-11-23  1035  		return 0;
dfd153a8d32fb12 Dave Chinner 2022-11-23  1036  
dfd153a8d32fb12 Dave Chinner 2022-11-23  1037  	/* If we didn't reserve the blocks, we're not allowed to punch them. */
dfd153a8d32fb12 Dave Chinner 2022-11-23  1038  	if (!(iomap->flags & IOMAP_F_NEW))
dfd153a8d32fb12 Dave Chinner 2022-11-23  1039  		return 0;
dfd153a8d32fb12 Dave Chinner 2022-11-23  1040  
dfd153a8d32fb12 Dave Chinner 2022-11-23  1041  	/*
dfd153a8d32fb12 Dave Chinner 2022-11-23  1042  	 * start_byte refers to the first unused block after a short write. If
dfd153a8d32fb12 Dave Chinner 2022-11-23  1043  	 * nothing was written, round offset down to point at the first block in
dfd153a8d32fb12 Dave Chinner 2022-11-23  1044  	 * the range.
dfd153a8d32fb12 Dave Chinner 2022-11-23  1045  	 */
dfd153a8d32fb12 Dave Chinner 2022-11-23  1046  	if (unlikely(!written))
dfd153a8d32fb12 Dave Chinner 2022-11-23  1047  		start_byte = round_down(pos, blocksize);
dfd153a8d32fb12 Dave Chinner 2022-11-23  1048  	else
dfd153a8d32fb12 Dave Chinner 2022-11-23  1049  		start_byte = round_up(pos + written, blocksize);
dfd153a8d32fb12 Dave Chinner 2022-11-23  1050  	end_byte = round_up(pos + length, blocksize);
dfd153a8d32fb12 Dave Chinner 2022-11-23  1051  
dfd153a8d32fb12 Dave Chinner 2022-11-23  1052  	/* Nothing to do if we've written the entire delalloc extent */
dfd153a8d32fb12 Dave Chinner 2022-11-23  1053  	if (start_byte >= end_byte)
dfd153a8d32fb12 Dave Chinner 2022-11-23  1054  		return 0;
dfd153a8d32fb12 Dave Chinner 2022-11-23  1055  
9f88e3438e8c0e6 Dave Chinner 2022-11-23  1056  	return iomap_write_delalloc_release(inode, start_byte, end_byte,
9f88e3438e8c0e6 Dave Chinner 2022-11-23  1057  					punch);
dfd153a8d32fb12 Dave Chinner 2022-11-23  1058  }
dfd153a8d32fb12 Dave Chinner 2022-11-23  1059  EXPORT_SYMBOL_GPL(iomap_file_buffered_write_punch_delalloc);
dfd153a8d32fb12 Dave Chinner 2022-11-23  1060
diff mbox series

Patch

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 734b761a1e4a..29be510c3b15 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -832,6 +832,165 @@  iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
 }
 EXPORT_SYMBOL_GPL(iomap_file_buffered_write);
 
+/*
+ * Scan the data range passed to us for dirty page cache folios. If we find a
+ * dirty folio, punch out the preceeding range and update the offset from which
+ * the next punch will start from.
+ *
+ * We can punch out storage reservations under clean pages because they either
+ * contain data that has been written back - in which case the delalloc punch
+ * over that range is a no-op - or they have been read faults in which case they
+ * contain zeroes and we can remove the delalloc backing range and any new
+ * writes to those pages will do the normal hole filling operation...
+ *
+ * This makes the logic simple: we only need to keep the delalloc extents only
+ * over the dirty ranges of the page cache.
+ *
+ * This function uses [start_byte, end_byte) intervals (i.e. open ended) to
+ * simplify range iterations.
+ */
+static int iomap_write_delalloc_scan(struct inode *inode,
+		loff_t *punch_start_byte, loff_t start_byte, loff_t end_byte,
+		int (*punch)(struct inode *inode, loff_t offset, loff_t length))
+{
+	while (start_byte < end_byte) {
+		struct folio	*folio;
+
+		/* grab locked page */
+		folio = filemap_lock_folio(inode->i_mapping,
+				start_byte >> PAGE_SHIFT);
+		if (!folio) {
+			start_byte = ALIGN_DOWN(start_byte, PAGE_SIZE) +
+					PAGE_SIZE;
+			continue;
+		}
+
+		/* if dirty, punch up to offset */
+		if (folio_test_dirty(folio)) {
+			if (start_byte > *punch_start_byte) {
+				int	error;
+
+				error = punch(inode, *punch_start_byte,
+						start_byte - *punch_start_byte);
+				if (error) {
+					folio_unlock(folio);
+					folio_put(folio);
+					return error;
+				}
+			}
+
+			/*
+			 * Make sure the next punch start is correctly bound to
+			 * the end of this data range, not the end of the folio.
+			 */
+			*punch_start_byte = min_t(loff_t, end_byte,
+					folio_next_index(folio) << PAGE_SHIFT);
+		}
+
+		/* move offset to start of next folio in range */
+		start_byte = folio_next_index(folio) << PAGE_SHIFT;
+		folio_unlock(folio);
+		folio_put(folio);
+	}
+	return 0;
+}
+
+/*
+ * Punch out all the delalloc blocks in the range given except for those that
+ * have dirty data still pending in the page cache - those are going to be
+ * written and so must still retain the delalloc backing for writeback.
+ *
+ * As we are scanning the page cache for data, we don't need to reimplement the
+ * wheel - mapping_seek_hole_data() does exactly what we need to identify the
+ * start and end of data ranges correctly even for sub-folio block sizes. This
+ * byte range based iteration is especially convenient because it means we
+ * don't have to care about variable size folios, nor where the start or end of
+ * the data range lies within a folio, if they lie within the same folio or even
+ * if there are multiple discontiguous data ranges within the folio.
+ *
+ * It should be noted that mapping_seek_hole_data() is not aware of EOF, and so
+ * can return data ranges that exist in the cache beyond EOF. e.g. a page fault
+ * spanning EOF will initialise the post-EOF data to zeroes and mark it up to
+ * date. A write page fault can then mark it dirty. If we then fail a write()
+ * beyond EOF into that up to date cached range, we allocate a delalloc block
+ * beyond EOF and then have to punch it out. Because the range is up to date,
+ * mapping_seek_hole_data() will return it, and we will skip the punch because
+ * the folio is dirty. THis is incorrect - we always need to punch out delalloc
+ * beyond EOF in this case as writeback will never write back and covert that
+ * delalloc block beyond EOF. Hence we limit the cached data scan range to EOF,
+ * resulting in always punching out the range from the EOF to the end of the
+ * range the iomap spans.
+ *
+ * Intervals are of the form [start_byte, end_byte) (i.e. open ended) because it
+ * matches the intervals returned by mapping_seek_hole_data(). i.e. SEEK_DATA
+ * returns the start of a data range (start_byte), and SEEK_HOLE(start_byte)
+ * returns the end of the data range (data_end). Using closed intervals would
+ * require sprinkling this code with magic "+ 1" and "- 1" arithmetic and expose
+ * the code to subtle off-by-one bugs....
+ */
+static int iomap_write_delalloc_release(struct inode *inode,
+		loff_t start_byte, loff_t end_byte,
+		int (*punch)(struct inode *inode, loff_t pos, loff_t length))
+{
+	loff_t punch_start_byte = start_byte;
+	loff_t scan_end_byte = min(i_size_read(inode), end_byte);
+	int error = 0;
+
+	/*
+	 * Lock the mapping to avoid races with page faults re-instantiating
+	 * folios and dirtying them via ->page_mkwrite whilst we walk the
+	 * cache and perform delalloc extent removal. Failing to do this can
+	 * leave dirty pages with no space reservation in the cache.
+	 */
+	filemap_invalidate_lock(inode->i_mapping);
+	while (start_byte < scan_end_byte) {
+		loff_t		data_end;
+
+		start_byte = mapping_seek_hole_data(inode->i_mapping,
+				start_byte, scan_end_byte, SEEK_DATA);
+		/*
+		 * If there is no more data to scan, all that is left is to
+		 * punch out the remaining range.
+		 */
+		if (start_byte == -ENXIO || start_byte == scan_end_byte)
+			break;
+		if (start_byte < 0) {
+			error = start_byte;
+			goto out_unlock;
+		}
+		WARN_ON_ONCE(start_byte < punch_start_byte);
+		WARN_ON_ONCE(start_byte > scan_end_byte);
+
+		/*
+		 * We find the end of this contiguous cached data range by
+		 * seeking from start_byte to the beginning of the next hole.
+		 */
+		data_end = mapping_seek_hole_data(inode->i_mapping, start_byte,
+				scan_end_byte, SEEK_HOLE);
+		if (data_end < 0) {
+			error = data_end;
+			goto out_unlock;
+		}
+		WARN_ON_ONCE(data_end <= start_byte);
+		WARN_ON_ONCE(data_end > scan_end_byte);
+
+		error = iomap_write_delalloc_scan(inode, &punch_start_byte,
+				start_byte, data_end, punch);
+		if (error)
+			goto out_unlock;
+
+		/* The next data search starts at the end of this one. */
+		start_byte = data_end;
+	}
+
+	if (punch_start_byte < end_byte)
+		error = punch(inode, punch_start_byte,
+				end_byte - punch_start_byte);
+out_unlock:
+	filemap_invalidate_unlock(inode->i_mapping);
+	return error;
+}
+
 /*
  * When a short write occurs, the filesystem may need to remove reserved space
  * that was allocated in ->iomap_begin from it's ->iomap_end method. For
@@ -842,8 +1001,25 @@  EXPORT_SYMBOL_GPL(iomap_file_buffered_write);
  * allocated for this iomap.
  *
  * This function uses [start_byte, end_byte) intervals (i.e. open ended) to
- * simplify range iterations, but converts them back to {offset,len} tuples for
- * the punch callback.
+ * simplify range iterations.
+ *
+ * The punch() callback *must* only punch delalloc extents in the range passed
+ * to it. It must skip over all other types of extents in the range and leave
+ * them completely unchanged. It must do this punch atomically with respect to
+ * other extent modifications.
+ *
+ * The punch() callback may be called with a folio locked to prevent writeback
+ * extent allocation racing at the edge of the range we are currently punching.
+ * The locked folio may or may not cover the range being punched, so it is not
+ * safe for the punch() callback to lock folios itself.
+ *
+ * Lock order is:
+ *
+ * inode->i_rwsem (shared or exclusive)
+ *   inode->i_mapping->invalidate_lock (exclusive)
+ *     folio_lock()
+ *       ->punch
+ *         internal filesystem allocation lock
  */
 int iomap_file_buffered_write_punch_delalloc(struct inode *inode,
 		struct iomap *iomap, loff_t pos, loff_t length,
@@ -877,18 +1053,8 @@  int iomap_file_buffered_write_punch_delalloc(struct inode *inode,
 	if (start_byte >= end_byte)
 		return 0;
 
-	/*
-	 * Lock the mapping to avoid races with page faults re-instantiating
-	 * folios and dirtying them via ->page_mkwrite between the page cache
-	 * truncation and the delalloc extent removal. Failing to do this can
-	 * leave dirty pages with no space reservation in the cache.
-	 */
-	filemap_invalidate_lock(inode->i_mapping);
-	truncate_pagecache_range(inode, start_byte, end_byte - 1);
-	error = punch(inode, start_byte, end_byte - start_byte);
-	filemap_invalidate_unlock(inode->i_mapping);
-
-	return error;
+	return iomap_write_delalloc_release(inode, start_byte, end_byte,
+					punch);
 }
 EXPORT_SYMBOL_GPL(iomap_file_buffered_write_punch_delalloc);