diff mbox series

[1/2] iomap: don't skip reading in !uptodate folios when unsharing a range

Message ID 169507872536.772278.18183365318216726644.stgit@frogsfrogsfrogs (mailing list archive)
State New, archived
Headers show
Series iomap: fix unshare data corruption bug | expand

Commit Message

Darrick J. Wong Sept. 18, 2023, 11:12 p.m. UTC
From: Darrick J. Wong <djwong@kernel.org>

Prior to commit a01b8f225248e, we would always read in the contents of a
!uptodate folio prior to writing userspace data into the folio,
allocated a folio state object, etc.  Ritesh introduced an optimization
that skips all of that if the write would cover the entire folio.

Unfortunately, the optimization misses the unshare case, where we always
have to read in the folio contents since there isn't a data buffer
supplied by userspace.  This can result in stale kernel memory exposure
if userspace issues a FALLOC_FL_UNSHARE_RANGE call on part of a shared
file that isn't already cached.

This was caught by observing fstests regressions in the "unshare around"
mechanism that is used for unaligned writes to a reflinked realtime
volume when the realtime extent size is larger than 1FSB, though I think
it applies to any shared file.

Cc: ritesh.list@gmail.com, willy@infradead.org
Fixes: a01b8f225248e ("iomap: Allocate ifs in ->write_begin() early")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/iomap/buffered-io.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Comments

Ritesh Harjani (IBM) Sept. 19, 2023, 4:42 a.m. UTC | #1
"Darrick J. Wong" <djwong@kernel.org> writes:

> From: Darrick J. Wong <djwong@kernel.org>
>
> Prior to commit a01b8f225248e, we would always read in the contents of a
> !uptodate folio prior to writing userspace data into the folio,
> allocated a folio state object, etc.  Ritesh introduced an optimization
> that skips all of that if the write would cover the entire folio.
>
> Unfortunately, the optimization misses the unshare case, where we always
> have to read in the folio contents since there isn't a data buffer
> supplied by userspace.  This can result in stale kernel memory exposure
> if userspace issues a FALLOC_FL_UNSHARE_RANGE call on part of a shared
> file that isn't already cached.
>
> This was caught by observing fstests regressions in the "unshare around"
> mechanism that is used for unaligned writes to a reflinked realtime
> volume when the realtime extent size is larger than 1FSB, though I think
> it applies to any shared file.
>
> Cc: ritesh.list@gmail.com, willy@infradead.org
> Fixes: a01b8f225248e ("iomap: Allocate ifs in ->write_begin() early")
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/iomap/buffered-io.c |    6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)

Thanks for catching this case. Fix for this looks good to me. 
I have verified on my setup. w/o this patch it indeed can cause
corruption in the unshare case, since we don't read the disk contents
and we might end up writing garbage from the page cache.

Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>


>
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index ae8673ce08b1..0350830fc989 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -640,11 +640,13 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
>  	size_t poff, plen;
>  
>  	/*
> -	 * If the write completely overlaps the current folio, then
> +	 * If the write or zeroing completely overlaps the current folio, then
>  	 * entire folio will be dirtied so there is no need for
>  	 * per-block state tracking structures to be attached to this folio.
> +	 * For the unshare case, we must read in the ondisk contents because we
> +	 * are not changing pagecache contents.
>  	 */
> -	if (pos <= folio_pos(folio) &&
> +	if (!(iter->flags & IOMAP_UNSHARE) && pos <= folio_pos(folio) &&
>  	    pos + len >= folio_pos(folio) + folio_size(folio))
>  		return 0;
>
Ritesh Harjani (IBM) Sept. 19, 2023, 5:14 a.m. UTC | #2
"Darrick J. Wong" <djwong@kernel.org> writes:

> From: Darrick J. Wong <djwong@kernel.org>
>
> Prior to commit a01b8f225248e, we would always read in the contents of a
> !uptodate folio prior to writing userspace data into the folio,
> allocated a folio state object, etc.  Ritesh introduced an optimization
> that skips all of that if the write would cover the entire folio.
>
> Unfortunately, the optimization misses the unshare case, where we always
> have to read in the folio contents since there isn't a data buffer
> supplied by userspace.  This can result in stale kernel memory exposure
> if userspace issues a FALLOC_FL_UNSHARE_RANGE call on part of a shared
> file that isn't already cached.
>
> This was caught by observing fstests regressions in the "unshare around"
> mechanism that is used for unaligned writes to a reflinked realtime
> volume when the realtime extent size is larger than 1FSB,

I was wondering what is testcase that you are referring here to? 
Can you please tell the testcase no. and the mkfs / mount config options
which I can use to observe the regression please?

> though I think it applies to any shared file.
>
> Cc: ritesh.list@gmail.com, willy@infradead.org
> Fixes: a01b8f225248e ("iomap: Allocate ifs in ->write_begin() early")
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/iomap/buffered-io.c |    6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index ae8673ce08b1..0350830fc989 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -640,11 +640,13 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
>  	size_t poff, plen;
>  
>  	/*
> -	 * If the write completely overlaps the current folio, then
> +	 * If the write or zeroing completely overlaps the current folio, then
>  	 * entire folio will be dirtied so there is no need for
>  	 * per-block state tracking structures to be attached to this folio.
> +	 * For the unshare case, we must read in the ondisk contents because we
> +	 * are not changing pagecache contents.
>  	 */
> -	if (pos <= folio_pos(folio) &&
> +	if (!(iter->flags & IOMAP_UNSHARE) && pos <= folio_pos(folio) &&
>  	    pos + len >= folio_pos(folio) + folio_size(folio))
>  		return 0;
>
Darrick J. Wong Sept. 19, 2023, 5:24 a.m. UTC | #3
On Tue, Sep 19, 2023 at 10:44:58AM +0530, Ritesh Harjani wrote:
> "Darrick J. Wong" <djwong@kernel.org> writes:
> 
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > Prior to commit a01b8f225248e, we would always read in the contents of a
> > !uptodate folio prior to writing userspace data into the folio,
> > allocated a folio state object, etc.  Ritesh introduced an optimization
> > that skips all of that if the write would cover the entire folio.
> >
> > Unfortunately, the optimization misses the unshare case, where we always
> > have to read in the folio contents since there isn't a data buffer
> > supplied by userspace.  This can result in stale kernel memory exposure
> > if userspace issues a FALLOC_FL_UNSHARE_RANGE call on part of a shared
> > file that isn't already cached.
> >
> > This was caught by observing fstests regressions in the "unshare around"
> > mechanism that is used for unaligned writes to a reflinked realtime
> > volume when the realtime extent size is larger than 1FSB,
> 
> I was wondering what is testcase that you are referring here to? 
> Can you please tell the testcase no. and the mkfs / mount config options
> which I can use to observe the regression please?

https://lore.kernel.org/linux-fsdevel/169507871947.772278.5767091361086740046.stgit@frogsfrogsfrogs/T/#m8081f74f4f1fcb862399aa1544be082aabe56765

(any xfs config with reflink enabled)

--D

> > though I think it applies to any shared file.
> >
> > Cc: ritesh.list@gmail.com, willy@infradead.org
> > Fixes: a01b8f225248e ("iomap: Allocate ifs in ->write_begin() early")
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/iomap/buffered-io.c |    6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >
> >
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index ae8673ce08b1..0350830fc989 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -640,11 +640,13 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
> >  	size_t poff, plen;
> >  
> >  	/*
> > -	 * If the write completely overlaps the current folio, then
> > +	 * If the write or zeroing completely overlaps the current folio, then
> >  	 * entire folio will be dirtied so there is no need for
> >  	 * per-block state tracking structures to be attached to this folio.
> > +	 * For the unshare case, we must read in the ondisk contents because we
> > +	 * are not changing pagecache contents.
> >  	 */
> > -	if (pos <= folio_pos(folio) &&
> > +	if (!(iter->flags & IOMAP_UNSHARE) && pos <= folio_pos(folio) &&
> >  	    pos + len >= folio_pos(folio) + folio_size(folio))
> >  		return 0;
> >
Darrick J. Wong Sept. 19, 2023, 5:32 a.m. UTC | #4
On Mon, Sep 18, 2023 at 10:24:34PM -0700, Darrick J. Wong wrote:
> On Tue, Sep 19, 2023 at 10:44:58AM +0530, Ritesh Harjani wrote:
> > "Darrick J. Wong" <djwong@kernel.org> writes:
> > 
> > > From: Darrick J. Wong <djwong@kernel.org>
> > >
> > > Prior to commit a01b8f225248e, we would always read in the contents of a
> > > !uptodate folio prior to writing userspace data into the folio,
> > > allocated a folio state object, etc.  Ritesh introduced an optimization
> > > that skips all of that if the write would cover the entire folio.
> > >
> > > Unfortunately, the optimization misses the unshare case, where we always
> > > have to read in the folio contents since there isn't a data buffer
> > > supplied by userspace.  This can result in stale kernel memory exposure
> > > if userspace issues a FALLOC_FL_UNSHARE_RANGE call on part of a shared
> > > file that isn't already cached.
> > >
> > > This was caught by observing fstests regressions in the "unshare around"
> > > mechanism that is used for unaligned writes to a reflinked realtime
> > > volume when the realtime extent size is larger than 1FSB,
> > 
> > I was wondering what is testcase that you are referring here to? 
> > Can you please tell the testcase no. and the mkfs / mount config options
> > which I can use to observe the regression please?
> 
> https://lore.kernel.org/linux-fsdevel/169507871947.772278.5767091361086740046.stgit@frogsfrogsfrogs/T/#m8081f74f4f1fcb862399aa1544be082aabe56765
> 
> (any xfs config with reflink enabled)

*OH* you meant which testcase in the realtime reflink patchset.

This testcase:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/commit/tests/xfs/1919?h=djwong-wtf&id=56538e8882ac52e606882cfcab7e46dcb64d2a62

And this tag:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/tag/?h=realtime-reflink-extsize_2023-09-12
If you rebase this branch against 6.6-rc1.

Then you need this xfsprogs:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/tag/?h=realtime-reflink-extsize_2023-09-12

and ... MKFS_OPTIONS='-d rtinherit=1, -n parent=1, -r extsize=28k,rtgroups=1'
along with a SCRATCH_RTDE.

I'm basically done porting djwong-dev to 6.6 and will likely have an
initial patchbomb of more online fsck stuff for 6.7 in a few days.

--D

> --D
> 
> > > though I think it applies to any shared file.
> > >
> > > Cc: ritesh.list@gmail.com, willy@infradead.org
> > > Fixes: a01b8f225248e ("iomap: Allocate ifs in ->write_begin() early")
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  fs/iomap/buffered-io.c |    6 ++++--
> > >  1 file changed, 4 insertions(+), 2 deletions(-)
> > >
> > >
> > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > index ae8673ce08b1..0350830fc989 100644
> > > --- a/fs/iomap/buffered-io.c
> > > +++ b/fs/iomap/buffered-io.c
> > > @@ -640,11 +640,13 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
> > >  	size_t poff, plen;
> > >  
> > >  	/*
> > > -	 * If the write completely overlaps the current folio, then
> > > +	 * If the write or zeroing completely overlaps the current folio, then
> > >  	 * entire folio will be dirtied so there is no need for
> > >  	 * per-block state tracking structures to be attached to this folio.
> > > +	 * For the unshare case, we must read in the ondisk contents because we
> > > +	 * are not changing pagecache contents.
> > >  	 */
> > > -	if (pos <= folio_pos(folio) &&
> > > +	if (!(iter->flags & IOMAP_UNSHARE) && pos <= folio_pos(folio) &&
> > >  	    pos + len >= folio_pos(folio) + folio_size(folio))
> > >  		return 0;
> > >
Ritesh Harjani (IBM) Sept. 19, 2023, 9:24 a.m. UTC | #5
Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes:

> "Darrick J. Wong" <djwong@kernel.org> writes:
>
>> From: Darrick J. Wong <djwong@kernel.org>
>>
>> Prior to commit a01b8f225248e, we would always read in the contents of a
>> !uptodate folio prior to writing userspace data into the folio,
>> allocated a folio state object, etc.  Ritesh introduced an optimization
>> that skips all of that if the write would cover the entire folio.
>>
>> Unfortunately, the optimization misses the unshare case, where we always
>> have to read in the folio contents since there isn't a data buffer
>> supplied by userspace.  This can result in stale kernel memory exposure
>> if userspace issues a FALLOC_FL_UNSHARE_RANGE call on part of a shared
>> file that isn't already cached.
>>
>> This was caught by observing fstests regressions in the "unshare around"
>> mechanism that is used for unaligned writes to a reflinked realtime
>> volume when the realtime extent size is larger than 1FSB, though I think
>> it applies to any shared file.
>>
>> Cc: ritesh.list@gmail.com, willy@infradead.org
>> Fixes: a01b8f225248e ("iomap: Allocate ifs in ->write_begin() early")
>> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
>> ---
>>  fs/iomap/buffered-io.c |    6 ++++--
>>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> Thanks for catching this case. Fix for this looks good to me. 
> I have verified on my setup. w/o this patch it indeed can cause
> corruption in the unshare case, since we don't read the disk contents
> and we might end up writing garbage from the page cache.

To add more info to my above review. iomap_write_begin() is used by 
1. iomap_write_iter()
2. iomap_zero_iter()
3. iomap_unshare_iter()

And looks like out of the 3, iomap_unshare_iter() is the only one which
will not write anything to the folio in the foliocache, & we
definitely need to read the extent in folio cache in iomap_write_begin()
for unsharing.

Hence I believe iomap_unshare_iter() should be the only path to be
fixed, which this patch does by checking IOMAP_UNSHARE flag in
__iomap_write_begin().

>
> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>
>

-ritesh
diff mbox series

Patch

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index ae8673ce08b1..0350830fc989 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -640,11 +640,13 @@  static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 	size_t poff, plen;
 
 	/*
-	 * If the write completely overlaps the current folio, then
+	 * If the write or zeroing completely overlaps the current folio, then
 	 * entire folio will be dirtied so there is no need for
 	 * per-block state tracking structures to be attached to this folio.
+	 * For the unshare case, we must read in the ondisk contents because we
+	 * are not changing pagecache contents.
 	 */
-	if (pos <= folio_pos(folio) &&
+	if (!(iter->flags & IOMAP_UNSHARE) && pos <= folio_pos(folio) &&
 	    pos + len >= folio_pos(folio) + folio_size(folio))
 		return 0;