diff mbox

direct-io: fix stale data exposure from concurrent buffered read

Message ID 1462033162-21837-1-git-send-email-guaneryu@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Eryu Guan April 30, 2016, 4:19 p.m. UTC
Direct writes inside i_size on a DIO_SKIP_HOLES filesystem are not
allowed to allocate blocks(get_more_blocks() sets 'create' to 0 before
calling get_bkicl() callback), if it's a sparse file, direct writes fall
back to buffered writes to avoid stale data exposure from concurrent
buffered read.

But the detection for "writing inside i_size" is not correct, writes can
be treated as "extending writes" wrongly, which results in block
allocation for holes and could expose stale data. This is because we're
checking on the fs blocks not the actual offset and i_size in bytes.

For example, direct write 1FSB to a 1FSB(or smaller) sparse file on
ext2/3/4, starting from offset 0, in this case it's writing inside
i_size, but 'create' is non-zero, because 'sdio->block_in_file' and
'(i_size_read(dio->inode) >> sdio->blkbits' are both zero.

This can be demonstrated by running ltp-aiodio test ADSP045 many times.
When testing on extN filesystems, I see test failures occasionally,
buffered read could read non-zero (stale) data.

ADSP045: dio_sparse -a 4k -w 4k -s 2k -n 2

dio_sparse    0  TINFO  :  Dirtying free blocks
dio_sparse    0  TINFO  :  Starting I/O tests
non zero buffer at buf[0] => 0xffffffaa,ffffffaa,ffffffaa,ffffffaa
non-zero read at offset 0
dio_sparse    0  TINFO  :  Killing childrens(s)
dio_sparse    1  TFAIL  :  dio_sparse.c:191: 1 children(s) exited abnormally

Fix it by checking on the actual offset and i_size in bytes, not the fs
blocks where offset and i_size are in, to make sure it's really writing
into the file. Also introduce some local variables to make the code
easier to read a little bit.

Signed-off-by: Eryu Guan <guaneryu@gmail.com>
---
 fs/direct-io.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

Comments

Jeff Moyer May 5, 2016, 7:39 p.m. UTC | #1
Hi, Eryu,

Thanks for the great description of the problem!  I have some comments
below.

Eryu Guan <guaneryu@gmail.com> writes:

> Direct writes inside i_size on a DIO_SKIP_HOLES filesystem are not
> allowed to allocate blocks(get_more_blocks() sets 'create' to 0 before
> calling get_bkicl() callback), if it's a sparse file, direct writes fall
> back to buffered writes to avoid stale data exposure from concurrent
> buffered read.
>
> But the detection for "writing inside i_size" is not correct, writes can
> be treated as "extending writes" wrongly, which results in block
> allocation for holes and could expose stale data. This is because we're
> checking on the fs blocks not the actual offset and i_size in bytes.
>
> For example, direct write 1FSB to a 1FSB(or smaller) sparse file on
> ext2/3/4, starting from offset 0, in this case it's writing inside
> i_size, but 'create' is non-zero, because 'sdio->block_in_file' and
> '(i_size_read(dio->inode) >> sdio->blkbits' are both zero.
>
> This can be demonstrated by running ltp-aiodio test ADSP045 many times.
> When testing on extN filesystems, I see test failures occasionally,
> buffered read could read non-zero (stale) data.
>
> ADSP045: dio_sparse -a 4k -w 4k -s 2k -n 2
>
> dio_sparse    0  TINFO  :  Dirtying free blocks
> dio_sparse    0  TINFO  :  Starting I/O tests
> non zero buffer at buf[0] => 0xffffffaa,ffffffaa,ffffffaa,ffffffaa
> non-zero read at offset 0
> dio_sparse    0  TINFO  :  Killing childrens(s)
> dio_sparse    1  TFAIL  :  dio_sparse.c:191: 1 children(s) exited abnormally

OK, so in this case, block_in_file is 0, i_size_read(inode) is 2048, and
i_blkbits is 12.

> Fix it by checking on the actual offset and i_size in bytes, not the fs
> blocks where offset and i_size are in, to make sure it's really writing
> into the file.

I think this code operates on blocks for a reason: we're trying to
determine if we'll trigger block allocation, right?  For example,
consider a sparse file with i_size of 2k, and a write to offset 2k into
the file, with a file system block size of 4k.  Should that have create
set or not?

Ted or Jan, can you answer that question?

> Also introduce some local variables to make the code
> easier to read a little bit.

Please don't do this.  You're only making the change harder to review.
Just submit the minimal fix.  You can submit cleanups as a follow-on.

Thanks,
Jeff

> Signed-off-by: Eryu Guan <guaneryu@gmail.com>
> ---
>  fs/direct-io.c | 20 ++++++++++----------
>  1 file changed, 10 insertions(+), 10 deletions(-)
>
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index 4720377..ca0c9bc 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -607,9 +607,12 @@ static int get_more_blocks(struct dio *dio, struct dio_submit *sdio,
>  	int ret;
>  	sector_t fs_startblk;	/* Into file, in filesystem-sized blocks */
>  	sector_t fs_endblk;	/* Into file, in filesystem-sized blocks */
> +	sector_t block_in_file = sdio->block_in_file;
>  	unsigned long fs_count;	/* Number of filesystem-sized blocks */
>  	int create;
> -	unsigned int i_blkbits = sdio->blkbits + sdio->blkfactor;
> +	unsigned int blkbits = sdio->blkbits;
> +	unsigned int blkfactor = sdio->blkfactor;
> +	unsigned int i_blkbits = blkbits + blkfactor;
>  
>  	/*
>  	 * If there was a memory error and we've overwritten all the
> @@ -617,10 +620,9 @@ static int get_more_blocks(struct dio *dio, struct dio_submit *sdio,
>  	 */
>  	ret = dio->page_errors;
>  	if (ret == 0) {
> -		BUG_ON(sdio->block_in_file >= sdio->final_block_in_request);
> -		fs_startblk = sdio->block_in_file >> sdio->blkfactor;
> -		fs_endblk = (sdio->final_block_in_request - 1) >>
> -					sdio->blkfactor;
> +		BUG_ON(block_in_file >= sdio->final_block_in_request);
> +		fs_startblk = block_in_file >> blkfactor;
> +		fs_endblk = (sdio->final_block_in_request - 1) >> blkfactor;
>  		fs_count = fs_endblk - fs_startblk + 1;
>  
>  		map_bh->b_state = 0;
> @@ -638,11 +640,9 @@ static int get_more_blocks(struct dio *dio, struct dio_submit *sdio,
>  		 * buffer head.
>  		 */
>  		create = dio->rw & WRITE;
> -		if (dio->flags & DIO_SKIP_HOLES) {
> -			if (sdio->block_in_file < (i_size_read(dio->inode) >>
> -							sdio->blkbits))
> -				create = 0;
> -		}
> +		if ((dio->flags & DIO_SKIP_HOLES) &&
> +		    ((block_in_file << blkbits) < i_size_read(dio->inode)))
> +			create = 0;
>  
>  		ret = (*sdio->get_block)(dio->inode, fs_startblk,
>  						map_bh, create);
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eryu Guan May 6, 2016, 10:46 a.m. UTC | #2
On Thu, May 05, 2016 at 03:39:29PM -0400, Jeff Moyer wrote:
> Hi, Eryu,
> 
> Thanks for the great description of the problem!  I have some comments
> below.
> 
> Eryu Guan <guaneryu@gmail.com> writes:
> 
> > Direct writes inside i_size on a DIO_SKIP_HOLES filesystem are not
> > allowed to allocate blocks(get_more_blocks() sets 'create' to 0 before
> > calling get_bkicl() callback), if it's a sparse file, direct writes fall
> > back to buffered writes to avoid stale data exposure from concurrent
> > buffered read.
> >
> > But the detection for "writing inside i_size" is not correct, writes can
> > be treated as "extending writes" wrongly, which results in block
> > allocation for holes and could expose stale data. This is because we're
> > checking on the fs blocks not the actual offset and i_size in bytes.
> >
> > For example, direct write 1FSB to a 1FSB(or smaller) sparse file on
> > ext2/3/4, starting from offset 0, in this case it's writing inside
> > i_size, but 'create' is non-zero, because 'sdio->block_in_file' and
> > '(i_size_read(dio->inode) >> sdio->blkbits' are both zero.
> >
> > This can be demonstrated by running ltp-aiodio test ADSP045 many times.
> > When testing on extN filesystems, I see test failures occasionally,
> > buffered read could read non-zero (stale) data.
> >
> > ADSP045: dio_sparse -a 4k -w 4k -s 2k -n 2
> >
> > dio_sparse    0  TINFO  :  Dirtying free blocks
> > dio_sparse    0  TINFO  :  Starting I/O tests
> > non zero buffer at buf[0] => 0xffffffaa,ffffffaa,ffffffaa,ffffffaa
> > non-zero read at offset 0
> > dio_sparse    0  TINFO  :  Killing childrens(s)
> > dio_sparse    1  TFAIL  :  dio_sparse.c:191: 1 children(s) exited abnormally
> 
> OK, so in this case, block_in_file is 0, i_size_read(inode) is 2048, and
> i_blkbits is 12.
> 
> > Fix it by checking on the actual offset and i_size in bytes, not the fs
> > blocks where offset and i_size are in, to make sure it's really writing
> > into the file.
> 
> I think this code operates on blocks for a reason: we're trying to
> determine if we'll trigger block allocation, right?  For example,
> consider a sparse file with i_size of 2k, and a write to offset 2k into
> the file, with a file system block size of 4k.  Should that have create
> set or not?

Thanks for pointing this out! I think 'create' should be 0 in this case,
my test failed in this case, with both 4.6-rc6 stock kernel and my
patched kernel.

I'm testing an updated patch now, hopefully it's doing the right thing.
It's basiclly something like:

if (offset < i_size)
	create = 0;
else if ((block_in_file >> blkfactor) == (i_size >> (blkbits + blkfactor)) &&
	 (i_size & ((1 << (blkbits + blkfactor)) - 1)))
	create = 0;

So in addition to the (offset < i_size) check, it also checks that
block_in_file and i_size are falling into the same fs block && i_size is
not fs block size aligned.

> 
> Ted or Jan, can you answer that question?
> 
> > Also introduce some local variables to make the code
> > easier to read a little bit.
> 
> Please don't do this.  You're only making the change harder to review.
> Just submit the minimal fix.  You can submit cleanups as a follow-on.

I think it's not a pure cleanup, it's needed as things like
'sdio->block_in_file' are referenced multiple times in the function, and
they are making the lines too long to read/write. Maybe I should have
made it clear in the first place.

Thanks for the review!

Eryu
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Moyer May 6, 2016, 2:13 p.m. UTC | #3
Eryu Guan <guaneryu@gmail.com> writes:

> On Thu, May 05, 2016 at 03:39:29PM -0400, Jeff Moyer wrote:
>> I think this code operates on blocks for a reason: we're trying to
>> determine if we'll trigger block allocation, right?  For example,
>> consider a sparse file with i_size of 2k, and a write to offset 2k into
>> the file, with a file system block size of 4k.  Should that have create
>> set or not?
>
> Thanks for pointing this out! I think 'create' should be 0 in this case,
> my test failed in this case, with both 4.6-rc6 stock kernel and my
> patched kernel.
>
> I'm testing an updated patch now, hopefully it's doing the right thing.
> It's basiclly something like:
>
> if (offset < i_size)
> 	create = 0;
> else if ((block_in_file >> blkfactor) == (i_size >> (blkbits + blkfactor)) &&
> 	 (i_size & ((1 << (blkbits + blkfactor)) - 1)))
> 	create = 0;

I think that can be simplified to a single check;  something like:

	if (block_in_file < total_blocks_in_file)
		create = 0;

>> > Also introduce some local variables to make the code
>> > easier to read a little bit.
>> 
>> Please don't do this.  You're only making the change harder to review.
>> Just submit the minimal fix.  You can submit cleanups as a follow-on.
>
> I think it's not a pure cleanup, it's needed as things like
> 'sdio->block_in_file' are referenced multiple times in the function, and
> they are making the lines too long to read/write. Maybe I should have
> made it clear in the first place.

I still view that as a cleanup.  If you had submitted the minimal patch,
I would have to look at a couple lines of change.  In code this tricky,
I'd rather not have to stare at all the code movement to make sure
you got that part right, too.

But do what you feel is right, I'll review it either way.  ;-)

Thanks, Eryu!
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eryu Guan May 7, 2016, 3:04 a.m. UTC | #4
On Fri, May 06, 2016 at 10:13:39AM -0400, Jeff Moyer wrote:
> Eryu Guan <guaneryu@gmail.com> writes:
> 
> > On Thu, May 05, 2016 at 03:39:29PM -0400, Jeff Moyer wrote:
> >> I think this code operates on blocks for a reason: we're trying to
> >> determine if we'll trigger block allocation, right?  For example,
> >> consider a sparse file with i_size of 2k, and a write to offset 2k into
> >> the file, with a file system block size of 4k.  Should that have create
> >> set or not?
> >
> > Thanks for pointing this out! I think 'create' should be 0 in this case,
> > my test failed in this case, with both 4.6-rc6 stock kernel and my
> > patched kernel.
> >
> > I'm testing an updated patch now, hopefully it's doing the right thing.
> > It's basiclly something like:
> >
> > if (offset < i_size)
> > 	create = 0;
> > else if ((block_in_file >> blkfactor) == (i_size >> (blkbits + blkfactor)) &&
> > 	 (i_size & ((1 << (blkbits + blkfactor)) - 1)))
> > 	create = 0;
> 
> I think that can be simplified to a single check;  something like:
> 
> 	if (block_in_file < total_blocks_in_file)
> 		create = 0;

I may miss something, but this doesn't seem right to me. Still take your
example, on a 4k block size & 512 sector size filesystem

	xfs_io -f -c "truncate 2k" testfile
	xfs_io -d -c "pwrite 2k 2k" testfile

block_in_file is 4 (dio block size is 512 in this case, 4 blocks for 2k
size), total_blocks_in_file is 0, so 'create' is set, but it should be 0

> 
> >> > Also introduce some local variables to make the code
> >> > easier to read a little bit.
> >> 
> >> Please don't do this.  You're only making the change harder to review.
> >> Just submit the minimal fix.  You can submit cleanups as a follow-on.
> >
> > I think it's not a pure cleanup, it's needed as things like
> > 'sdio->block_in_file' are referenced multiple times in the function, and
> > they are making the lines too long to read/write. Maybe I should have
> > made it clear in the first place.
> 
> I still view that as a cleanup.  If you had submitted the minimal patch,
> I would have to look at a couple lines of change.  In code this tricky,
> I'd rather not have to stare at all the code movement to make sure
> you got that part right, too.
> 
> But do what you feel is right, I'll review it either way.  ;-)

Thanks very much! I'll split it to two patches, first one is a cleanup,
has no function change, second one is the real fix. This should make the
review easier.

Eryu
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Moyer May 9, 2016, 1:55 p.m. UTC | #5
Eryu Guan <guaneryu@gmail.com> writes:

> On Fri, May 06, 2016 at 10:13:39AM -0400, Jeff Moyer wrote:
>> Eryu Guan <guaneryu@gmail.com> writes:
>> 
>> > On Thu, May 05, 2016 at 03:39:29PM -0400, Jeff Moyer wrote:
>> >> I think this code operates on blocks for a reason: we're trying to
>> >> determine if we'll trigger block allocation, right?  For example,
>> >> consider a sparse file with i_size of 2k, and a write to offset 2k into
>> >> the file, with a file system block size of 4k.  Should that have create
>> >> set or not?
>> >
>> > Thanks for pointing this out! I think 'create' should be 0 in this case,
>> > my test failed in this case, with both 4.6-rc6 stock kernel and my
>> > patched kernel.
>> >
>> > I'm testing an updated patch now, hopefully it's doing the right thing.
>> > It's basiclly something like:
>> >
>> > if (offset < i_size)
>> > 	create = 0;
>> > else if ((block_in_file >> blkfactor) == (i_size >> (blkbits + blkfactor)) &&
>> > 	 (i_size & ((1 << (blkbits + blkfactor)) - 1)))
>> > 	create = 0;
>> 
>> I think that can be simplified to a single check;  something like:
>> 
>> 	if (block_in_file < total_blocks_in_file)
>> 		create = 0;
>
> I may miss something, but this doesn't seem right to me. Still take your
> example, on a 4k block size & 512 sector size filesystem

... where blocks are in file system block size units.  So:

if (fs_block_in_file < total_fs_blocks_in_file)


> Thanks very much! I'll split it to two patches, first one is a cleanup,
> has no function change, second one is the real fix. This should make the
> review easier.

Typically the mininmal fix goes first (for ease of backporting to
stable), and then the cleanup.  As I said, though, this isn't critical,
I'll take a look.

Thanks!
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara May 11, 2016, 4:57 p.m. UTC | #6
On Mon 09-05-16 09:55:26, Jeff Moyer wrote:
> Eryu Guan <guaneryu@gmail.com> writes:
> 
> > On Fri, May 06, 2016 at 10:13:39AM -0400, Jeff Moyer wrote:
> >> Eryu Guan <guaneryu@gmail.com> writes:
> >> 
> >> > On Thu, May 05, 2016 at 03:39:29PM -0400, Jeff Moyer wrote:
> >> >> I think this code operates on blocks for a reason: we're trying to
> >> >> determine if we'll trigger block allocation, right?  For example,
> >> >> consider a sparse file with i_size of 2k, and a write to offset 2k into
> >> >> the file, with a file system block size of 4k.  Should that have create
> >> >> set or not?
> >> >
> >> > Thanks for pointing this out! I think 'create' should be 0 in this case,
> >> > my test failed in this case, with both 4.6-rc6 stock kernel and my
> >> > patched kernel.
> >> >
> >> > I'm testing an updated patch now, hopefully it's doing the right thing.
> >> > It's basiclly something like:
> >> >
> >> > if (offset < i_size)
> >> > 	create = 0;
> >> > else if ((block_in_file >> blkfactor) == (i_size >> (blkbits + blkfactor)) &&
> >> > 	 (i_size & ((1 << (blkbits + blkfactor)) - 1)))
> >> > 	create = 0;
> >> 
> >> I think that can be simplified to a single check;  something like:
> >> 
> >> 	if (block_in_file < total_blocks_in_file)
> >> 		create = 0;
> >
> > I may miss something, but this doesn't seem right to me. Still take your
> > example, on a 4k block size & 512 sector size filesystem
> 
> ... where blocks are in file system block size units.  So:
> 
> if (fs_block_in_file < total_fs_blocks_in_file)

Agreed. The test should be:

	if (fs_startblk < (i_size_read(dio->inode) >>
					(sdio->blkbits + sdio->blkfactor)))

Sorry for not having a look earlier.

								Honza
diff mbox

Patch

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 4720377..ca0c9bc 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -607,9 +607,12 @@  static int get_more_blocks(struct dio *dio, struct dio_submit *sdio,
 	int ret;
 	sector_t fs_startblk;	/* Into file, in filesystem-sized blocks */
 	sector_t fs_endblk;	/* Into file, in filesystem-sized blocks */
+	sector_t block_in_file = sdio->block_in_file;
 	unsigned long fs_count;	/* Number of filesystem-sized blocks */
 	int create;
-	unsigned int i_blkbits = sdio->blkbits + sdio->blkfactor;
+	unsigned int blkbits = sdio->blkbits;
+	unsigned int blkfactor = sdio->blkfactor;
+	unsigned int i_blkbits = blkbits + blkfactor;
 
 	/*
 	 * If there was a memory error and we've overwritten all the
@@ -617,10 +620,9 @@  static int get_more_blocks(struct dio *dio, struct dio_submit *sdio,
 	 */
 	ret = dio->page_errors;
 	if (ret == 0) {
-		BUG_ON(sdio->block_in_file >= sdio->final_block_in_request);
-		fs_startblk = sdio->block_in_file >> sdio->blkfactor;
-		fs_endblk = (sdio->final_block_in_request - 1) >>
-					sdio->blkfactor;
+		BUG_ON(block_in_file >= sdio->final_block_in_request);
+		fs_startblk = block_in_file >> blkfactor;
+		fs_endblk = (sdio->final_block_in_request - 1) >> blkfactor;
 		fs_count = fs_endblk - fs_startblk + 1;
 
 		map_bh->b_state = 0;
@@ -638,11 +640,9 @@  static int get_more_blocks(struct dio *dio, struct dio_submit *sdio,
 		 * buffer head.
 		 */
 		create = dio->rw & WRITE;
-		if (dio->flags & DIO_SKIP_HOLES) {
-			if (sdio->block_in_file < (i_size_read(dio->inode) >>
-							sdio->blkbits))
-				create = 0;
-		}
+		if ((dio->flags & DIO_SKIP_HOLES) &&
+		    ((block_in_file << blkbits) < i_size_read(dio->inode)))
+			create = 0;
 
 		ret = (*sdio->get_block)(dio->inode, fs_startblk,
 						map_bh, create);