Message ID | c9d7a03ee9730e1d864cb6fbe2d511dd8899a953.1715798440.git.fdmanana@suse.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | btrfs: fix a bug in the direct IO write path for COW writes | expand |
在 2024/5/16 04:21, fdmanana@kernel.org 写道: > From: Filipe Manana <fdmanana@suse.com> > > If a write path in COW mode fails, either before submitting a bio for the > new extents or an actual IO error happens, we can end up allowing a fast > fsync to log file extent items that point to unwritten extents. > > This is because the ordered extent completion for a failed write is > executed in a work queue. This means that once the write path unlocks the > inode, a fast fsync can come and log the extent maps created by the write > attempt before the work queue completes the ordered extent. > > For example consider a direct IO write, in COW mode, that fails at > btrfs_dio_submit_io() because btrfs_extract_ordered_extent() returned an > error: > > 1) We call btrfs_finish_ordered_extent() with the 'uptodate' parameter > set to false, meaning an error happened; > > 2) That results in marking the ordered extent with the BTRFS_ORDERED_IOERR > flag; > > 3) btrfs_finish_ordered_extent() queues the completion of the ordered > extent - so that btrfs_finish_one_ordered() will be executed later in > a work queue. That function will drop extents maps in the range when > it's executed, since the extent maps point to unwritten locations > (signaled by the BTRFS_ORDERED_IOERR flag); > > 4) After calling btrfs_finish_ordered_extent() we keep going down the > write path and unlock the inode; > > 5) After that a fast fsync starts and locks the inode; > > 6) Before the work queue executes btrfs_finish_one_ordered(), the fsync > task sees the extent maps that point to the unwritten locations and > logs file extent items based on them - it does not know they are > unwritten, and the fast fsync path does not wait for ordered extents > to complete in order to reduce latency. > > So to fix this make btrfs_finish_ordered_extent() drop the extent maps > in the range if an error happened for a COW write. > > Note that this issues of using extent maps that point to unwritten > locations can not happen for reads, because in read paths we start by > locking the extent range and wait for any ordered extents in the range > to complete before looking for extent maps. > > Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Thanks for the detailed explanation on the issue. Thanks, Qu > --- > fs/btrfs/ordered-data.c | 27 +++++++++++++++++++++++++++ > 1 file changed, 27 insertions(+) > > diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c > index 304d94f6d29b..3a3f21da6eb7 100644 > --- a/fs/btrfs/ordered-data.c > +++ b/fs/btrfs/ordered-data.c > @@ -388,6 +388,33 @@ bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered, > ret = can_finish_ordered_extent(ordered, page, file_offset, len, uptodate); > spin_unlock_irqrestore(&inode->ordered_tree_lock, flags); > > + /* > + * If this is a COW write it means we created new extent maps for the > + * range and they point to an unwritten location if we got an error > + * either before submitting a bio or during IO. > + * > + * We have marked the ordered extent with BTRFS_ORDERED_IOERR, and we > + * are queuing its completion below. During completion, at > + * btrfs_finish_one_ordered(), we will drop the extent maps for the > + * unwritten extents. > + * > + * However because completion runs in a work queue we can end up > + * unlocking the inode before the ordered extent is completed. > + * > + * That means that a fast fsync can happen before the work queue > + * executes the completion of the ordered extent, and in that case > + * the fsync will use the extent maps that point to unwritten extents, > + * resulting in logging file extent items that point to unwritten > + * locations. Unlike read paths, a fast fsync doesn't wait for ordered > + * extent completion before proceeding (intentional to reduce latency). > + * > + * To be safe drop the new extent maps in the range (if are doing COW) > + * right here before we unlock the inode and allow a fsync to run. > + */ > + if (!uptodate && !test_bit(BTRFS_ORDERED_NOCOW, &ordered->flags)) > + btrfs_drop_extent_map_range(inode, file_offset, > + file_offset + len - 1, false); > + > if (ret) > btrfs_queue_ordered_fn(ordered); > return ret;
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index 304d94f6d29b..3a3f21da6eb7 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -388,6 +388,33 @@ bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered, ret = can_finish_ordered_extent(ordered, page, file_offset, len, uptodate); spin_unlock_irqrestore(&inode->ordered_tree_lock, flags); + /* + * If this is a COW write it means we created new extent maps for the + * range and they point to an unwritten location if we got an error + * either before submitting a bio or during IO. + * + * We have marked the ordered extent with BTRFS_ORDERED_IOERR, and we + * are queuing its completion below. During completion, at + * btrfs_finish_one_ordered(), we will drop the extent maps for the + * unwritten extents. + * + * However because completion runs in a work queue we can end up + * unlocking the inode before the ordered extent is completed. + * + * That means that a fast fsync can happen before the work queue + * executes the completion of the ordered extent, and in that case + * the fsync will use the extent maps that point to unwritten extents, + * resulting in logging file extent items that point to unwritten + * locations. Unlike read paths, a fast fsync doesn't wait for ordered + * extent completion before proceeding (intentional to reduce latency). + * + * To be safe drop the new extent maps in the range (if are doing COW) + * right here before we unlock the inode and allow a fsync to run. + */ + if (!uptodate && !test_bit(BTRFS_ORDERED_NOCOW, &ordered->flags)) + btrfs_drop_extent_map_range(inode, file_offset, + file_offset + len - 1, false); + if (ret) btrfs_queue_ordered_fn(ordered); return ret;