diff mbox series

xfs: only remap the written blocks in xfs_reflink_end_cow_extent

Message ID 20231016152852.1021679-1-hch@lst.de (mailing list archive)
State Superseded
Headers show
Series xfs: only remap the written blocks in xfs_reflink_end_cow_extent | expand

Commit Message

Christoph Hellwig Oct. 16, 2023, 3:28 p.m. UTC
xfs_reflink_end_cow_extent looks up the COW extent and the data fork
extent at offset_fsb, and then proceeds to remap the common subset
between the two.

It does however not limit the remapped extent to the passed in
[*offset_fsbm end_fsb] range and thus potentially remaps more blocks than
the one handled by the current I/O completion.  This means that with
sufficiently large data and COW extents we could be remapping COW fork
mappings that have not been written to, leading to a stale data exposure
on a powerfail event.

We use to have a xfs_trim_range to make the remap fit the I/O completion
range, but that got (apparently accidentally) removed in commit
df2fd88f8ac7 ("xfs: rewrite xfs_reflink_end_cow to use intents").

Note that I've only found this by code inspection, and a test case would
probably require very specific delay and error injection.

Fixes: df2fd88f8ac7 ("xfs: rewrite xfs_reflink_end_cow to use intents")
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_reflink.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Darrick J. Wong Oct. 16, 2023, 3:48 p.m. UTC | #1
On Mon, Oct 16, 2023 at 05:28:52PM +0200, Christoph Hellwig wrote:
> xfs_reflink_end_cow_extent looks up the COW extent and the data fork
> extent at offset_fsb, and then proceeds to remap the common subset
> between the two.
> 
> It does however not limit the remapped extent to the passed in
> [*offset_fsbm end_fsb] range and thus potentially remaps more blocks than
> the one handled by the current I/O completion.  This means that with
> sufficiently large data and COW extents we could be remapping COW fork
> mappings that have not been written to, leading to a stale data exposure
> on a powerfail event.
> 
> We use to have a xfs_trim_range to make the remap fit the I/O completion
> range, but that got (apparently accidentally) removed in commit
> df2fd88f8ac7 ("xfs: rewrite xfs_reflink_end_cow to use intents").
> 
> Note that I've only found this by code inspection, and a test case would
> probably require very specific delay and error injection.

Hmm.  xfs_prepare_ioend converts unwritten cowfork extents to written
prior to submit_bio.  So I guess you'd have to trick writeback into
issuing totally separate bios for the single mapping.  Then you'd have
to delay the bio for the higher offset part of the mapping while
allowing the bio for the lower part to complete, at which point it would
convey the entire mapping to the data fork.  Then you'd have to convince
the kernel to reread the contents from disk.  I think that would be hard
since the folios for the incomplete writeback are still uptodate and
marked for writeback.  directio will block trying to flush and
invalidate the cache, and buffered io will read the pagecache.

So I don't think there's an actual exposure vector here, but others can
chime in about whatever I've missed. ;)

> Fixes: df2fd88f8ac7 ("xfs: rewrite xfs_reflink_end_cow to use intents")
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Yep, that was a goof, thanks for catching that.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_reflink.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index eb9102453affbf..0611af06771589 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -784,6 +784,7 @@ xfs_reflink_end_cow_extent(
>  		}
>  	}
>  	del = got;
> +	xfs_trim_extent(&del, *offset_fsb, end_fsb - *offset_fsb);
>  
>  	/* Grab the corresponding mapping in the data fork. */
>  	nmaps = 1;
> -- 
> 2.39.2
>
Christoph Hellwig Oct. 16, 2023, 4:10 p.m. UTC | #2
On Mon, Oct 16, 2023 at 08:48:27AM -0700, Darrick J. Wong wrote:
> Hmm.  xfs_prepare_ioend converts unwritten cowfork extents to written
> prior to submit_bio.  So I guess you'd have to trick writeback into
> issuing totally separate bios for the single mapping.

Yes.  Hitting IOEND_BATCH_SIZE seems like the least difficult one
to hit, but even that would require work.

> Then you'd have
> to delay the bio for the higher offset part of the mapping while
> allowing the bio for the lower part to complete, at which point it would
> convey the entire mapping to the data fork.

Shouldn't really matter which side is faster.

> Then you'd have to convince
> the kernel to reread the contents from disk.  I think that would be hard
> since the folios for the incomplete writeback are still uptodate and
> marked for writeback.  directio will block trying to flush and
> invalidate the cache, and buffered io will read the pagecache.

I don't think on a live kernel it is possible.  But if one of the
two bios completed before the other one, and power failed just inbetween.
Darrick J. Wong Oct. 17, 2023, 12:56 a.m. UTC | #3
On Mon, Oct 16, 2023 at 06:10:19PM +0200, Christoph Hellwig wrote:
> On Mon, Oct 16, 2023 at 08:48:27AM -0700, Darrick J. Wong wrote:
> > Hmm.  xfs_prepare_ioend converts unwritten cowfork extents to written
> > prior to submit_bio.  So I guess you'd have to trick writeback into
> > issuing totally separate bios for the single mapping.
> 
> Yes.  Hitting IOEND_BATCH_SIZE seems like the least difficult one
> to hit, but even that would require work.
> 
> > Then you'd have
> > to delay the bio for the higher offset part of the mapping while
> > allowing the bio for the lower part to complete, at which point it would
> > convey the entire mapping to the data fork.
> 
> Shouldn't really matter which side is faster.
> 
> > Then you'd have to convince
> > the kernel to reread the contents from disk.  I think that would be hard
> > since the folios for the incomplete writeback are still uptodate and
> > marked for writeback.  directio will block trying to flush and
> > invalidate the cache, and buffered io will read the pagecache.
> 
> I don't think on a live kernel it is possible.  But if one of the
> two bios completed before the other one, and power failed just inbetween.

Ooooh, yeah.  That could happen if the ioend metadata update gets
written to the log device and the system crashes before that other bio
even gets a chance to execute.

--D
diff mbox series

Patch

diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index eb9102453affbf..0611af06771589 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -784,6 +784,7 @@  xfs_reflink_end_cow_extent(
 		}
 	}
 	del = got;
+	xfs_trim_extent(&del, *offset_fsb, end_fsb - *offset_fsb);
 
 	/* Grab the corresponding mapping in the data fork. */
 	nmaps = 1;