Message ID | 1462033162-21837-1-git-send-email-guaneryu@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi, Eryu, Thanks for the great description of the problem! I have some comments below. Eryu Guan <guaneryu@gmail.com> writes: > Direct writes inside i_size on a DIO_SKIP_HOLES filesystem are not > allowed to allocate blocks(get_more_blocks() sets 'create' to 0 before > calling get_bkicl() callback), if it's a sparse file, direct writes fall > back to buffered writes to avoid stale data exposure from concurrent > buffered read. > > But the detection for "writing inside i_size" is not correct, writes can > be treated as "extending writes" wrongly, which results in block > allocation for holes and could expose stale data. This is because we're > checking on the fs blocks not the actual offset and i_size in bytes. > > For example, direct write 1FSB to a 1FSB(or smaller) sparse file on > ext2/3/4, starting from offset 0, in this case it's writing inside > i_size, but 'create' is non-zero, because 'sdio->block_in_file' and > '(i_size_read(dio->inode) >> sdio->blkbits' are both zero. > > This can be demonstrated by running ltp-aiodio test ADSP045 many times. > When testing on extN filesystems, I see test failures occasionally, > buffered read could read non-zero (stale) data. > > ADSP045: dio_sparse -a 4k -w 4k -s 2k -n 2 > > dio_sparse 0 TINFO : Dirtying free blocks > dio_sparse 0 TINFO : Starting I/O tests > non zero buffer at buf[0] => 0xffffffaa,ffffffaa,ffffffaa,ffffffaa > non-zero read at offset 0 > dio_sparse 0 TINFO : Killing childrens(s) > dio_sparse 1 TFAIL : dio_sparse.c:191: 1 children(s) exited abnormally OK, so in this case, block_in_file is 0, i_size_read(inode) is 2048, and i_blkbits is 12. > Fix it by checking on the actual offset and i_size in bytes, not the fs > blocks where offset and i_size are in, to make sure it's really writing > into the file. I think this code operates on blocks for a reason: we're trying to determine if we'll trigger block allocation, right? For example, consider a sparse file with i_size of 2k, and a write to offset 2k into the file, with a file system block size of 4k. Should that have create set or not? Ted or Jan, can you answer that question? > Also introduce some local variables to make the code > easier to read a little bit. Please don't do this. You're only making the change harder to review. Just submit the minimal fix. You can submit cleanups as a follow-on. Thanks, Jeff > Signed-off-by: Eryu Guan <guaneryu@gmail.com> > --- > fs/direct-io.c | 20 ++++++++++---------- > 1 file changed, 10 insertions(+), 10 deletions(-) > > diff --git a/fs/direct-io.c b/fs/direct-io.c > index 4720377..ca0c9bc 100644 > --- a/fs/direct-io.c > +++ b/fs/direct-io.c > @@ -607,9 +607,12 @@ static int get_more_blocks(struct dio *dio, struct dio_submit *sdio, > int ret; > sector_t fs_startblk; /* Into file, in filesystem-sized blocks */ > sector_t fs_endblk; /* Into file, in filesystem-sized blocks */ > + sector_t block_in_file = sdio->block_in_file; > unsigned long fs_count; /* Number of filesystem-sized blocks */ > int create; > - unsigned int i_blkbits = sdio->blkbits + sdio->blkfactor; > + unsigned int blkbits = sdio->blkbits; > + unsigned int blkfactor = sdio->blkfactor; > + unsigned int i_blkbits = blkbits + blkfactor; > > /* > * If there was a memory error and we've overwritten all the > @@ -617,10 +620,9 @@ static int get_more_blocks(struct dio *dio, struct dio_submit *sdio, > */ > ret = dio->page_errors; > if (ret == 0) { > - BUG_ON(sdio->block_in_file >= sdio->final_block_in_request); > - fs_startblk = sdio->block_in_file >> sdio->blkfactor; > - fs_endblk = (sdio->final_block_in_request - 1) >> > - sdio->blkfactor; > + BUG_ON(block_in_file >= sdio->final_block_in_request); > + fs_startblk = block_in_file >> blkfactor; > + fs_endblk = (sdio->final_block_in_request - 1) >> blkfactor; > fs_count = fs_endblk - fs_startblk + 1; > > map_bh->b_state = 0; > @@ -638,11 +640,9 @@ static int get_more_blocks(struct dio *dio, struct dio_submit *sdio, > * buffer head. > */ > create = dio->rw & WRITE; > - if (dio->flags & DIO_SKIP_HOLES) { > - if (sdio->block_in_file < (i_size_read(dio->inode) >> > - sdio->blkbits)) > - create = 0; > - } > + if ((dio->flags & DIO_SKIP_HOLES) && > + ((block_in_file << blkbits) < i_size_read(dio->inode))) > + create = 0; > > ret = (*sdio->get_block)(dio->inode, fs_startblk, > map_bh, create); -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 05, 2016 at 03:39:29PM -0400, Jeff Moyer wrote: > Hi, Eryu, > > Thanks for the great description of the problem! I have some comments > below. > > Eryu Guan <guaneryu@gmail.com> writes: > > > Direct writes inside i_size on a DIO_SKIP_HOLES filesystem are not > > allowed to allocate blocks(get_more_blocks() sets 'create' to 0 before > > calling get_bkicl() callback), if it's a sparse file, direct writes fall > > back to buffered writes to avoid stale data exposure from concurrent > > buffered read. > > > > But the detection for "writing inside i_size" is not correct, writes can > > be treated as "extending writes" wrongly, which results in block > > allocation for holes and could expose stale data. This is because we're > > checking on the fs blocks not the actual offset and i_size in bytes. > > > > For example, direct write 1FSB to a 1FSB(or smaller) sparse file on > > ext2/3/4, starting from offset 0, in this case it's writing inside > > i_size, but 'create' is non-zero, because 'sdio->block_in_file' and > > '(i_size_read(dio->inode) >> sdio->blkbits' are both zero. > > > > This can be demonstrated by running ltp-aiodio test ADSP045 many times. > > When testing on extN filesystems, I see test failures occasionally, > > buffered read could read non-zero (stale) data. > > > > ADSP045: dio_sparse -a 4k -w 4k -s 2k -n 2 > > > > dio_sparse 0 TINFO : Dirtying free blocks > > dio_sparse 0 TINFO : Starting I/O tests > > non zero buffer at buf[0] => 0xffffffaa,ffffffaa,ffffffaa,ffffffaa > > non-zero read at offset 0 > > dio_sparse 0 TINFO : Killing childrens(s) > > dio_sparse 1 TFAIL : dio_sparse.c:191: 1 children(s) exited abnormally > > OK, so in this case, block_in_file is 0, i_size_read(inode) is 2048, and > i_blkbits is 12. > > > Fix it by checking on the actual offset and i_size in bytes, not the fs > > blocks where offset and i_size are in, to make sure it's really writing > > into the file. > > I think this code operates on blocks for a reason: we're trying to > determine if we'll trigger block allocation, right? For example, > consider a sparse file with i_size of 2k, and a write to offset 2k into > the file, with a file system block size of 4k. Should that have create > set or not? Thanks for pointing this out! I think 'create' should be 0 in this case, my test failed in this case, with both 4.6-rc6 stock kernel and my patched kernel. I'm testing an updated patch now, hopefully it's doing the right thing. It's basiclly something like: if (offset < i_size) create = 0; else if ((block_in_file >> blkfactor) == (i_size >> (blkbits + blkfactor)) && (i_size & ((1 << (blkbits + blkfactor)) - 1))) create = 0; So in addition to the (offset < i_size) check, it also checks that block_in_file and i_size are falling into the same fs block && i_size is not fs block size aligned. > > Ted or Jan, can you answer that question? > > > Also introduce some local variables to make the code > > easier to read a little bit. > > Please don't do this. You're only making the change harder to review. > Just submit the minimal fix. You can submit cleanups as a follow-on. I think it's not a pure cleanup, it's needed as things like 'sdio->block_in_file' are referenced multiple times in the function, and they are making the lines too long to read/write. Maybe I should have made it clear in the first place. Thanks for the review! Eryu -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Eryu Guan <guaneryu@gmail.com> writes: > On Thu, May 05, 2016 at 03:39:29PM -0400, Jeff Moyer wrote: >> I think this code operates on blocks for a reason: we're trying to >> determine if we'll trigger block allocation, right? For example, >> consider a sparse file with i_size of 2k, and a write to offset 2k into >> the file, with a file system block size of 4k. Should that have create >> set or not? > > Thanks for pointing this out! I think 'create' should be 0 in this case, > my test failed in this case, with both 4.6-rc6 stock kernel and my > patched kernel. > > I'm testing an updated patch now, hopefully it's doing the right thing. > It's basiclly something like: > > if (offset < i_size) > create = 0; > else if ((block_in_file >> blkfactor) == (i_size >> (blkbits + blkfactor)) && > (i_size & ((1 << (blkbits + blkfactor)) - 1))) > create = 0; I think that can be simplified to a single check; something like: if (block_in_file < total_blocks_in_file) create = 0; >> > Also introduce some local variables to make the code >> > easier to read a little bit. >> >> Please don't do this. You're only making the change harder to review. >> Just submit the minimal fix. You can submit cleanups as a follow-on. > > I think it's not a pure cleanup, it's needed as things like > 'sdio->block_in_file' are referenced multiple times in the function, and > they are making the lines too long to read/write. Maybe I should have > made it clear in the first place. I still view that as a cleanup. If you had submitted the minimal patch, I would have to look at a couple lines of change. In code this tricky, I'd rather not have to stare at all the code movement to make sure you got that part right, too. But do what you feel is right, I'll review it either way. ;-) Thanks, Eryu! Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, May 06, 2016 at 10:13:39AM -0400, Jeff Moyer wrote: > Eryu Guan <guaneryu@gmail.com> writes: > > > On Thu, May 05, 2016 at 03:39:29PM -0400, Jeff Moyer wrote: > >> I think this code operates on blocks for a reason: we're trying to > >> determine if we'll trigger block allocation, right? For example, > >> consider a sparse file with i_size of 2k, and a write to offset 2k into > >> the file, with a file system block size of 4k. Should that have create > >> set or not? > > > > Thanks for pointing this out! I think 'create' should be 0 in this case, > > my test failed in this case, with both 4.6-rc6 stock kernel and my > > patched kernel. > > > > I'm testing an updated patch now, hopefully it's doing the right thing. > > It's basiclly something like: > > > > if (offset < i_size) > > create = 0; > > else if ((block_in_file >> blkfactor) == (i_size >> (blkbits + blkfactor)) && > > (i_size & ((1 << (blkbits + blkfactor)) - 1))) > > create = 0; > > I think that can be simplified to a single check; something like: > > if (block_in_file < total_blocks_in_file) > create = 0; I may miss something, but this doesn't seem right to me. Still take your example, on a 4k block size & 512 sector size filesystem xfs_io -f -c "truncate 2k" testfile xfs_io -d -c "pwrite 2k 2k" testfile block_in_file is 4 (dio block size is 512 in this case, 4 blocks for 2k size), total_blocks_in_file is 0, so 'create' is set, but it should be 0 > > >> > Also introduce some local variables to make the code > >> > easier to read a little bit. > >> > >> Please don't do this. You're only making the change harder to review. > >> Just submit the minimal fix. You can submit cleanups as a follow-on. > > > > I think it's not a pure cleanup, it's needed as things like > > 'sdio->block_in_file' are referenced multiple times in the function, and > > they are making the lines too long to read/write. Maybe I should have > > made it clear in the first place. > > I still view that as a cleanup. If you had submitted the minimal patch, > I would have to look at a couple lines of change. In code this tricky, > I'd rather not have to stare at all the code movement to make sure > you got that part right, too. > > But do what you feel is right, I'll review it either way. ;-) Thanks very much! I'll split it to two patches, first one is a cleanup, has no function change, second one is the real fix. This should make the review easier. Eryu -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Eryu Guan <guaneryu@gmail.com> writes: > On Fri, May 06, 2016 at 10:13:39AM -0400, Jeff Moyer wrote: >> Eryu Guan <guaneryu@gmail.com> writes: >> >> > On Thu, May 05, 2016 at 03:39:29PM -0400, Jeff Moyer wrote: >> >> I think this code operates on blocks for a reason: we're trying to >> >> determine if we'll trigger block allocation, right? For example, >> >> consider a sparse file with i_size of 2k, and a write to offset 2k into >> >> the file, with a file system block size of 4k. Should that have create >> >> set or not? >> > >> > Thanks for pointing this out! I think 'create' should be 0 in this case, >> > my test failed in this case, with both 4.6-rc6 stock kernel and my >> > patched kernel. >> > >> > I'm testing an updated patch now, hopefully it's doing the right thing. >> > It's basiclly something like: >> > >> > if (offset < i_size) >> > create = 0; >> > else if ((block_in_file >> blkfactor) == (i_size >> (blkbits + blkfactor)) && >> > (i_size & ((1 << (blkbits + blkfactor)) - 1))) >> > create = 0; >> >> I think that can be simplified to a single check; something like: >> >> if (block_in_file < total_blocks_in_file) >> create = 0; > > I may miss something, but this doesn't seem right to me. Still take your > example, on a 4k block size & 512 sector size filesystem ... where blocks are in file system block size units. So: if (fs_block_in_file < total_fs_blocks_in_file) > Thanks very much! I'll split it to two patches, first one is a cleanup, > has no function change, second one is the real fix. This should make the > review easier. Typically the mininmal fix goes first (for ease of backporting to stable), and then the cleanup. As I said, though, this isn't critical, I'll take a look. Thanks! Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon 09-05-16 09:55:26, Jeff Moyer wrote: > Eryu Guan <guaneryu@gmail.com> writes: > > > On Fri, May 06, 2016 at 10:13:39AM -0400, Jeff Moyer wrote: > >> Eryu Guan <guaneryu@gmail.com> writes: > >> > >> > On Thu, May 05, 2016 at 03:39:29PM -0400, Jeff Moyer wrote: > >> >> I think this code operates on blocks for a reason: we're trying to > >> >> determine if we'll trigger block allocation, right? For example, > >> >> consider a sparse file with i_size of 2k, and a write to offset 2k into > >> >> the file, with a file system block size of 4k. Should that have create > >> >> set or not? > >> > > >> > Thanks for pointing this out! I think 'create' should be 0 in this case, > >> > my test failed in this case, with both 4.6-rc6 stock kernel and my > >> > patched kernel. > >> > > >> > I'm testing an updated patch now, hopefully it's doing the right thing. > >> > It's basiclly something like: > >> > > >> > if (offset < i_size) > >> > create = 0; > >> > else if ((block_in_file >> blkfactor) == (i_size >> (blkbits + blkfactor)) && > >> > (i_size & ((1 << (blkbits + blkfactor)) - 1))) > >> > create = 0; > >> > >> I think that can be simplified to a single check; something like: > >> > >> if (block_in_file < total_blocks_in_file) > >> create = 0; > > > > I may miss something, but this doesn't seem right to me. Still take your > > example, on a 4k block size & 512 sector size filesystem > > ... where blocks are in file system block size units. So: > > if (fs_block_in_file < total_fs_blocks_in_file) Agreed. The test should be: if (fs_startblk < (i_size_read(dio->inode) >> (sdio->blkbits + sdio->blkfactor))) Sorry for not having a look earlier. Honza
diff --git a/fs/direct-io.c b/fs/direct-io.c index 4720377..ca0c9bc 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -607,9 +607,12 @@ static int get_more_blocks(struct dio *dio, struct dio_submit *sdio, int ret; sector_t fs_startblk; /* Into file, in filesystem-sized blocks */ sector_t fs_endblk; /* Into file, in filesystem-sized blocks */ + sector_t block_in_file = sdio->block_in_file; unsigned long fs_count; /* Number of filesystem-sized blocks */ int create; - unsigned int i_blkbits = sdio->blkbits + sdio->blkfactor; + unsigned int blkbits = sdio->blkbits; + unsigned int blkfactor = sdio->blkfactor; + unsigned int i_blkbits = blkbits + blkfactor; /* * If there was a memory error and we've overwritten all the @@ -617,10 +620,9 @@ static int get_more_blocks(struct dio *dio, struct dio_submit *sdio, */ ret = dio->page_errors; if (ret == 0) { - BUG_ON(sdio->block_in_file >= sdio->final_block_in_request); - fs_startblk = sdio->block_in_file >> sdio->blkfactor; - fs_endblk = (sdio->final_block_in_request - 1) >> - sdio->blkfactor; + BUG_ON(block_in_file >= sdio->final_block_in_request); + fs_startblk = block_in_file >> blkfactor; + fs_endblk = (sdio->final_block_in_request - 1) >> blkfactor; fs_count = fs_endblk - fs_startblk + 1; map_bh->b_state = 0; @@ -638,11 +640,9 @@ static int get_more_blocks(struct dio *dio, struct dio_submit *sdio, * buffer head. */ create = dio->rw & WRITE; - if (dio->flags & DIO_SKIP_HOLES) { - if (sdio->block_in_file < (i_size_read(dio->inode) >> - sdio->blkbits)) - create = 0; - } + if ((dio->flags & DIO_SKIP_HOLES) && + ((block_in_file << blkbits) < i_size_read(dio->inode))) + create = 0; ret = (*sdio->get_block)(dio->inode, fs_startblk, map_bh, create);
Direct writes inside i_size on a DIO_SKIP_HOLES filesystem are not allowed to allocate blocks(get_more_blocks() sets 'create' to 0 before calling get_bkicl() callback), if it's a sparse file, direct writes fall back to buffered writes to avoid stale data exposure from concurrent buffered read. But the detection for "writing inside i_size" is not correct, writes can be treated as "extending writes" wrongly, which results in block allocation for holes and could expose stale data. This is because we're checking on the fs blocks not the actual offset and i_size in bytes. For example, direct write 1FSB to a 1FSB(or smaller) sparse file on ext2/3/4, starting from offset 0, in this case it's writing inside i_size, but 'create' is non-zero, because 'sdio->block_in_file' and '(i_size_read(dio->inode) >> sdio->blkbits' are both zero. This can be demonstrated by running ltp-aiodio test ADSP045 many times. When testing on extN filesystems, I see test failures occasionally, buffered read could read non-zero (stale) data. ADSP045: dio_sparse -a 4k -w 4k -s 2k -n 2 dio_sparse 0 TINFO : Dirtying free blocks dio_sparse 0 TINFO : Starting I/O tests non zero buffer at buf[0] => 0xffffffaa,ffffffaa,ffffffaa,ffffffaa non-zero read at offset 0 dio_sparse 0 TINFO : Killing childrens(s) dio_sparse 1 TFAIL : dio_sparse.c:191: 1 children(s) exited abnormally Fix it by checking on the actual offset and i_size in bytes, not the fs blocks where offset and i_size are in, to make sure it's really writing into the file. Also introduce some local variables to make the code easier to read a little bit. Signed-off-by: Eryu Guan <guaneryu@gmail.com> --- fs/direct-io.c | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-)