Message ID | 20231026140832.1089824-1-kernel@pankajraghav.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | iomap: fix iomap_dio_zero() for fs bs > system page size | expand |
On Thu, Oct 26, 2023 at 04:08:32PM +0200, Pankaj Raghav wrote: > From: Pankaj Raghav <p.raghav@samsung.com> > > iomap_dio_zero() will pad a fs block with zeroes if the direct IO size > < fs block size. iomap_dio_zero() has an implicit assumption that fs block > size < page_size. This is true for most filesystems at the moment. > > If the block size > page size (Large block sizes)[1], this will send the > contents of the page next to zero page(as len > PAGE_SIZE) to the > underlying block device, causing FS corruption. > > iomap is a generic infrastructure and it should not make any assumptions > about the fs block size and the page size of the system. > > Fixes: db074436f421 ("iomap: move the direct IO code into a separate file") > Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> > > [1] https://lore.kernel.org/lkml/20230915183848.1018717-1-kernel@pankajraghav.com/ > --- > I had initially planned on sending this as a part of LBS patches but I > think this can go as a standalone patch as it is a generic fix to iomap. > > @Dave chinner This fixes the corruption issue you were seeing in > generic/091 for bs=64k in [2] > > [2] https://lore.kernel.org/lkml/ZQfbHloBUpDh+zCg@dread.disaster.area/ > > fs/iomap/direct-io.c | 13 +++++++++++-- > 1 file changed, 11 insertions(+), 2 deletions(-) > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > index bcd3f8cf5ea4..04f6c5548136 100644 > --- a/fs/iomap/direct-io.c > +++ b/fs/iomap/direct-io.c > @@ -239,14 +239,23 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio, > struct page *page = ZERO_PAGE(0); > struct bio *bio; > > - bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE); > + WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE)); How can that happen here? Max fsb size will be 64kB for the foreseeable future, the bio can hold 256 pages so it will have a minimum size capability of 1MB. FWIW, as a general observation, I think this is the wrong place to be checking that a filesystem block is larger than can be fit in a single bio. There's going to be problems all over the place if we can't do fsb sized IO in a single bio. i.e. I think this sort of validation should be performed during filesystem mount, not sporadically checked with WARN_ON() checks in random places in the IO path... > + > + bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS, > + REQ_OP_WRITE | REQ_SYNC | REQ_IDLE); > fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits, > GFP_KERNEL); > + > bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos); > bio->bi_private = dio; > bio->bi_end_io = iomap_dio_bio_end_io; > > - __bio_add_page(bio, page, len, 0); > + while (len) { > + unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE); > + > + __bio_add_page(bio, page, io_len, 0); > + len -= io_len; > + } > iomap_dio_submit_bio(iter, dio, bio, pos); /me wonders if we should have a set of ZERO_FOLIO()s that contain a folio of each possible size. Then we just pull the ZERO_FOLIO of the correct size and use __bio_add_folio(). i.e. no need for looping over the bio to repeatedly add the ZERO_PAGE, etc, and the code is then identical for all cases of page size vs FSB size. -Dave.
On Thu, Oct 26, 2023 at 04:08:32PM +0200, Pankaj Raghav wrote: > From: Pankaj Raghav <p.raghav@samsung.com> > > iomap_dio_zero() will pad a fs block with zeroes if the direct IO size > < fs block size. iomap_dio_zero() has an implicit assumption that fs block > size < page_size. This is true for most filesystems at the moment. > > If the block size > page size (Large block sizes)[1], this will send the > contents of the page next to zero page(as len > PAGE_SIZE) to the > underlying block device, causing FS corruption. > > iomap is a generic infrastructure and it should not make any assumptions > about the fs block size and the page size of the system. > > Fixes: db074436f421 ("iomap: move the direct IO code into a separate file") Nice! > Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> > > [1] https://lore.kernel.org/lkml/20230915183848.1018717-1-kernel@pankajraghav.com/ This URL jus tneeds to go above the Fixes tag. Luis
> > - __bio_add_page(bio, page, len, 0); > + while (len) { > + unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE); > + > + __bio_add_page(bio, page, io_len, 0); > + len -= io_len; > + } Maybe out of self-interest, but shouldn't we replace ZERO_PAGE with a sufficiently larger ZERO_FOLIO? Right now I have a case where I have to have a zero padding of up to MAX_PAGECACHE_ORDER minus block size, so having a MAX_PAGECACHE_ORDER folio would have been really helpful for me, but I suspect there are many other such cases as well.
>> - bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE); >> + WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE)); > > How can that happen here? Max fsb size will be 64kB for the > foreseeable future, the bio can hold 256 pages so it will have a > minimum size capability of 1MB. > I just added it as a pathological check. This will not trigger anytime in the near future. > FWIW, as a general observation, I think this is the wrong place to > be checking that a filesystem block is larger than can be fit in a > single bio. There's going to be problems all over the place if we > can't do fsb sized IO in a single bio. i.e. I think this sort of > validation should be performed during filesystem mount, not > sporadically checked with WARN_ON() checks in random places in the > IO path... > I agree that it should be a check at a higher level. As iomap can be used by any filesystem, the check on FSB limitation should be a part iomap right? I don't see any explicit document/comment that states the iomap limitations on FSB, etc. >> >> - __bio_add_page(bio, page, len, 0); >> + while (len) { >> + unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE); >> + >> + __bio_add_page(bio, page, io_len, 0); >> + len -= io_len; >> + } >> iomap_dio_submit_bio(iter, dio, bio, pos); > > /me wonders if we should have a set of ZERO_FOLIO()s that contain a > folio of each possible size. Then we just pull the ZERO_FOLIO of the > correct size and use __bio_add_folio(). i.e. no need for > looping over the bio to repeatedly add the ZERO_PAGE, etc, and the > code is then identical for all cases of page size vs FSB size. > I was exactly looking for ZERO_FOLIOs. Instead of having a ZERO_PAGE, can we reserve a ZERO_HUGE_PAGE so that it can be used directly by bio_add_folio_nofail()?
On 27/10/2023 07:18, Christoph Hellwig wrote: >> >> - __bio_add_page(bio, page, len, 0); >> + while (len) { >> + unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE); >> + >> + __bio_add_page(bio, page, io_len, 0); >> + len -= io_len; >> + } > > Maybe out of self-interest, but shouldn't we replace ZERO_PAGE with a > sufficiently larger ZERO_FOLIO? Right now I have a case where I have > to have a zero padding of up to MAX_PAGECACHE_ORDER minus block size, > so having a MAX_PAGECACHE_ORDER folio would have been really helpful > for me, but I suspect there are many other such cases as well. I think that would definitely be useful. I also noticed this pattern in fscrypt_zeroout_range_inline_crypt(). Probably there are more places which could use a ZERO_FOLIO directly instead of iterating with ZERO_PAGE. Chinner also had a similar comment. It would be nice if we can reserve a zero huge page that is the size of MAX_PAGECACHE_ORDER and add it as one folio to the bio.
On Fri, Oct 27, 2023 at 10:03:15AM +0200, Pankaj Raghav wrote: > I also noticed this pattern in fscrypt_zeroout_range_inline_crypt(). > Probably there are more places which could use a ZERO_FOLIO directly > instead of iterating with ZERO_PAGE. > > Chinner also had a similar comment. It would be nice if we can reserve > a zero huge page that is the size of MAX_PAGECACHE_ORDER and add it as > one folio to the bio. i'm on holiday atm. start looking at mm_get_huge_zero_page()
On 27/10/2023 12:47, Matthew Wilcox wrote: > On Fri, Oct 27, 2023 at 10:03:15AM +0200, Pankaj Raghav wrote: >> I also noticed this pattern in fscrypt_zeroout_range_inline_crypt(). >> Probably there are more places which could use a ZERO_FOLIO directly >> instead of iterating with ZERO_PAGE. >> >> Chinner also had a similar comment. It would be nice if we can reserve >> a zero huge page that is the size of MAX_PAGECACHE_ORDER and add it as >> one folio to the bio. > > i'm on holiday atm. start looking at mm_get_huge_zero_page() Thanks for the pointer. I made a rough version of how it might look like if I use that API: diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index bcd3f8cf5ea4..6ae21bd16dbe 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c @@ -236,17 +236,43 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio, loff_t pos, unsigned len) { struct inode *inode = file_inode(dio->iocb->ki_filp); - struct page *page = ZERO_PAGE(0); + struct page *page = NULL; + bool huge_page = false; + bool fallback = false; struct bio *bio; - bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE); + if (len > PAGE_SIZE) { + page = mm_get_huge_zero_page(current->mm); + if (likely(page)) + huge_page = true; + } + + if (!huge_page) + page = ZERO_PAGE(0); + + fallback = ((len > PAGE_SIZE) && !huge_page); + + bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS, + REQ_OP_WRITE | REQ_SYNC | REQ_IDLE); fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits, GFP_KERNEL); + bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos); bio->bi_private = dio; bio->bi_end_io = iomap_dio_bio_end_io; - __bio_add_page(bio, page, len, 0); + if (!fallback) { + bio_add_folio_nofail(bio, page_folio(page), len, 0); + } else { + while (len) { + unsigned int io_len = + min_t(unsigned int, len, PAGE_SIZE); + + __bio_add_page(bio, page, io_len, 0); + len -= io_len; + } + } + iomap_dio_submit_bio(iter, dio, bio, pos); } The only issue with mm_get_huge_zero_page() is that it can fail, and we need to fallback to iteration if it cannot allocate a huge folio. PS: Enjoy your holidays :)
On Fri, Oct 27, 2023 at 09:53:19AM +0200, Pankaj Raghav wrote: > >> - bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE); > >> + WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE)); > > > > How can that happen here? Max fsb size will be 64kB for the > > foreseeable future, the bio can hold 256 pages so it will have a > > minimum size capability of 1MB. > > > > I just added it as a pathological check. This will not trigger > anytime in the near future. Yeah, exactly my point - it should never happen, it's a fundamental developer stuff-up bug, not a runtime bug, and so shouldn't be checked at runtime on every bio.... > > FWIW, as a general observation, I think this is the wrong place to > > be checking that a filesystem block is larger than can be fit in a > > single bio. There's going to be problems all over the place if we > > can't do fsb sized IO in a single bio. i.e. I think this sort of > > validation should be performed during filesystem mount, not > > sporadically checked with WARN_ON() checks in random places in the > > IO path... > > > > I agree that it should be a check at a higher level. > > As iomap can be used by any filesystem, the check on FSB limitation > should be a part iomap right? I don't see any explicit document/comment > that states the iomap limitations on FSB, etc. No, it should be part of the filesystem that *uses the bio infrastrucure*. We already do that with the page cache - filesystems check at mount time that the FSB is <= PAGE_SIZE and reject the mount if this check fails because the page cache simply cannot handle this situation. This is no different: if we are going to allow FSB > PAGE_SIZE, then we need to ensure somewhere, even as a BUILD_BUG_ON(), that the max support FSB the filesystem has is smaller than what we can pack in a bio. -Dave.
On Thu, Oct 26, 2023 at 04:08:32PM +0200, Pankaj Raghav wrote: > From: Pankaj Raghav <p.raghav@samsung.com> > > iomap_dio_zero() will pad a fs block with zeroes if the direct IO size > < fs block size. iomap_dio_zero() has an implicit assumption that fs block > size < page_size. This is true for most filesystems at the moment. > > If the block size > page size (Large block sizes)[1], this will send the > contents of the page next to zero page(as len > PAGE_SIZE) to the > underlying block device, causing FS corruption. > > iomap is a generic infrastructure and it should not make any assumptions > about the fs block size and the page size of the system. > > Fixes: db074436f421 ("iomap: move the direct IO code into a separate file") I forgot to mention - this fixes tag is completely bogus. That's just a commit that moves the code from one file to another and there's no actual functional change at all. Further, this isn't a change that "fixes" a bug or regression - it is a change to support new functionality that doesnt' yet exist upstream, and so there is no bug in the existing kernels for it to fix.... -Dave.
On Fri, Oct 27, 2023 at 05:41:10PM +0200, Pankaj Raghav wrote: > On 27/10/2023 12:47, Matthew Wilcox wrote: > > On Fri, Oct 27, 2023 at 10:03:15AM +0200, Pankaj Raghav wrote: > >> I also noticed this pattern in fscrypt_zeroout_range_inline_crypt(). > >> Probably there are more places which could use a ZERO_FOLIO directly > >> instead of iterating with ZERO_PAGE. > >> > >> Chinner also had a similar comment. It would be nice if we can reserve > >> a zero huge page that is the size of MAX_PAGECACHE_ORDER and add it as > >> one folio to the bio. > > > > i'm on holiday atm. start looking at mm_get_huge_zero_page() > > Thanks for the pointer. I made a rough version of how it might > look like if I use that API: useful thing to do. i think this shows we need a new folio api wrapping it. happy to do that when i'm back, or you can have a crack at it. your point about it possibly failing is correct. so i think we need an api which definitely returns a folio, but it might be of arbitrary order. > bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos); > bio->bi_private = dio; > bio->bi_end_io = iomap_dio_bio_end_io; > > - __bio_add_page(bio, page, len, 0); > + if (!fallback) { > + bio_add_folio_nofail(bio, page_folio(page), len, 0); > + } else { > + while (len) { > + unsigned int io_len = > + min_t(unsigned int, len, PAGE_SIZE); > + > + __bio_add_page(bio, page, io_len, 0); > + len -= io_len; > + } > + } > + > iomap_dio_submit_bio(iter, dio, bio, pos); then this can look something like: while (len) { size_t size = min(len, folio_size(folio)); __bio_add_folio(bio, folio, size, 0); len -= size; } > PS: Enjoy your holidays :) cheers ;-)
On 10/27/23 17:41, Pankaj Raghav wrote: > On 27/10/2023 12:47, Matthew Wilcox wrote: >> On Fri, Oct 27, 2023 at 10:03:15AM +0200, Pankaj Raghav wrote: >>> I also noticed this pattern in fscrypt_zeroout_range_inline_crypt(). >>> Probably there are more places which could use a ZERO_FOLIO directly >>> instead of iterating with ZERO_PAGE. >>> >>> Chinner also had a similar comment. It would be nice if we can reserve >>> a zero huge page that is the size of MAX_PAGECACHE_ORDER and add it as >>> one folio to the bio. >> >> i'm on holiday atm. start looking at mm_get_huge_zero_page() > > Thanks for the pointer. I made a rough version of how it might > look like if I use that API: > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > index bcd3f8cf5ea4..6ae21bd16dbe 100644 > --- a/fs/iomap/direct-io.c > +++ b/fs/iomap/direct-io.c > @@ -236,17 +236,43 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio, > loff_t pos, unsigned len) > { > struct inode *inode = file_inode(dio->iocb->ki_filp); > - struct page *page = ZERO_PAGE(0); > + struct page *page = NULL; > + bool huge_page = false; > + bool fallback = false; > struct bio *bio; > > - bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE); > + if (len > PAGE_SIZE) { > + page = mm_get_huge_zero_page(current->mm); > + if (likely(page)) > + huge_page = true; > + } > + > + if (!huge_page) > + page = ZERO_PAGE(0); > + > + fallback = ((len > PAGE_SIZE) && !huge_page); > + That is pointless. Bios can handle pages larger than PAGE_SIZE. > + bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS, > + REQ_OP_WRITE | REQ_SYNC | REQ_IDLE); Similar here. Just allocate space for a single folio. > fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits, > GFP_KERNEL); > + > bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos); > bio->bi_private = dio; > bio->bi_end_io = iomap_dio_bio_end_io; > > - __bio_add_page(bio, page, len, 0); > + if (!fallback) { > + bio_add_folio_nofail(bio, page_folio(page), len, 0); > + } else { > + while (len) { > + unsigned int io_len = > + min_t(unsigned int, len, PAGE_SIZE); > + > + __bio_add_page(bio, page, io_len, 0); > + len -= io_len; > + } > + } See above. Kill the 'fallback' case. Cheers, Hannes
>> - bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE); >> + if (len > PAGE_SIZE) { >> + page = mm_get_huge_zero_page(current->mm); >> + if (likely(page)) >> + huge_page = true; >> + } >> + >> + if (!huge_page) >> + page = ZERO_PAGE(0); >> + >> + fallback = ((len > PAGE_SIZE) && !huge_page); >> + > That is pointless. > Bios can handle pages larger than PAGE_SIZE. Yes, I know BIOs can handle large pages. But the point here is if we fail to allocate a huge zero page that can cover the complete large FSB ( > page size), then we need to use the statically allocated ZERO_PAGE (see the original patch) for multiple offsets covering the range. Unless we have an API that can return a zero folio with arbitrary order(see also the reply from Willy), we can't use a bio with one vec for LBS support. -- Pankaj
>> Fixes: db074436f421 ("iomap: move the direct IO code into a separate file") > > I forgot to mention - this fixes tag is completely bogus. That's > just a commit that moves the code from one file to another and > there's no actual functional change at all. > > Further, this isn't a change that "fixes" a bug or regression - it > is a change to support new functionality that doesnt' yet exist > upstream, and so there is no bug in the existing kernels for it to > fix.... I agree. I will remove the Fixes tag. I added it as iomap infra does not explicitly tell it does not support the LBS usecase. But I agree that it is a change to support a new feature, and it shouldn't go as a fix.
>> Thanks for the pointer. I made a rough version of how it might >> look like if I use that API: > > useful thing to do. i think this shows we need a new folio api wrapping > it. happy to do that when i'm back, or you can have a crack at it. > A folio wrapping for mm_get_huge_zero_page()? I tried to look for all the callers of mm_get_huge_zero_page() and I don't see them doing the folio <-> page dance. Of course, if we end up using mm_get_huge_zero_page() in iomap, then getting making a folio API might be worth it! > your point about it possibly failing is correct. so i think we need an > api which definitely returns a folio, but it might be of arbitrary > order. > That will be really nice, given that many parts of the kernel might use that API to get away with iterating with ZERO_PAGE(). >> + >> iomap_dio_submit_bio(iter, dio, bio, pos); > > then this can look something like: > > while (len) { > size_t size = min(len, folio_size(folio)); > > __bio_add_folio(bio, folio, size, 0); > len -= size; > } > Ah, this looks good! >> PS: Enjoy your holidays :) > > cheers ;-)
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index bcd3f8cf5ea4..04f6c5548136 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c @@ -239,14 +239,23 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio, struct page *page = ZERO_PAGE(0); struct bio *bio; - bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE); + WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE)); + + bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS, + REQ_OP_WRITE | REQ_SYNC | REQ_IDLE); fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits, GFP_KERNEL); + bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos); bio->bi_private = dio; bio->bi_end_io = iomap_dio_bio_end_io; - __bio_add_page(bio, page, len, 0); + while (len) { + unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE); + + __bio_add_page(bio, page, io_len, 0); + len -= io_len; + } iomap_dio_submit_bio(iter, dio, bio, pos); }