diff mbox series

iomap: fix iomap_dio_zero() for fs bs > system page size

Message ID 20231026140832.1089824-1-kernel@pankajraghav.com (mailing list archive)
State Superseded
Headers show
Series iomap: fix iomap_dio_zero() for fs bs > system page size | expand

Commit Message

Pankaj Raghav (Samsung) Oct. 26, 2023, 2:08 p.m. UTC
From: Pankaj Raghav <p.raghav@samsung.com>

iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
< fs block size. iomap_dio_zero() has an implicit assumption that fs block
size < page_size. This is true for most filesystems at the moment.

If the block size > page size (Large block sizes)[1], this will send the
contents of the page next to zero page(as len > PAGE_SIZE) to the
underlying block device, causing FS corruption.

iomap is a generic infrastructure and it should not make any assumptions
about the fs block size and the page size of the system.

Fixes: db074436f421 ("iomap: move the direct IO code into a separate file")
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>

[1] https://lore.kernel.org/lkml/20230915183848.1018717-1-kernel@pankajraghav.com/
---
I had initially planned on sending this as a part of LBS patches but I                                                                                                                                                                                                                                                  
think this can go as a standalone patch as it is a generic fix to iomap.                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                                                        
@Dave chinner This fixes the corruption issue you were seeing in                                                                                                                                                                                                                                                        
generic/091 for bs=64k in [2]                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                        
[2] https://lore.kernel.org/lkml/ZQfbHloBUpDh+zCg@dread.disaster.area/

 fs/iomap/direct-io.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)


base-commit: 05d3ef8bba77c1b5f98d941d8b2d4aeab8118ef1

Comments

Dave Chinner Oct. 26, 2023, 10:10 p.m. UTC | #1
On Thu, Oct 26, 2023 at 04:08:32PM +0200, Pankaj Raghav wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
> < fs block size. iomap_dio_zero() has an implicit assumption that fs block
> size < page_size. This is true for most filesystems at the moment.
> 
> If the block size > page size (Large block sizes)[1], this will send the
> contents of the page next to zero page(as len > PAGE_SIZE) to the
> underlying block device, causing FS corruption.
> 
> iomap is a generic infrastructure and it should not make any assumptions
> about the fs block size and the page size of the system.
> 
> Fixes: db074436f421 ("iomap: move the direct IO code into a separate file")
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> 
> [1] https://lore.kernel.org/lkml/20230915183848.1018717-1-kernel@pankajraghav.com/
> ---
> I had initially planned on sending this as a part of LBS patches but I                                                                                                                                                                                                                                                  
> think this can go as a standalone patch as it is a generic fix to iomap.                                                                                                                                                                                                                                                
>                                                                                                                                                                                                                                                                                                                         
> @Dave chinner This fixes the corruption issue you were seeing in                                                                                                                                                                                                                                                        
> generic/091 for bs=64k in [2]                                                                                                                                                                                                                                                                                           
>                                                                                                                                                                                                                                                                                                                         
> [2] https://lore.kernel.org/lkml/ZQfbHloBUpDh+zCg@dread.disaster.area/
> 
>  fs/iomap/direct-io.c | 13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index bcd3f8cf5ea4..04f6c5548136 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -239,14 +239,23 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
>  	struct page *page = ZERO_PAGE(0);
>  	struct bio *bio;
>  
> -	bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
> +	WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE));

How can that happen here? Max fsb size will be 64kB for the
foreseeable future, the bio can hold 256 pages so it will have a
minimum size capability of 1MB.

FWIW, as a general observation, I think this is the wrong place to
be checking that a filesystem block is larger than can be fit in a
single bio. There's going to be problems all over the place if we
can't do fsb sized IO in a single bio. i.e. I think this sort of
validation should be performed during filesystem mount, not
sporadically checked with WARN_ON() checks in random places in the
IO path...

> +
> +	bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
> +				  REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
>  	fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
>  				  GFP_KERNEL);
> +
>  	bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
>  	bio->bi_private = dio;
>  	bio->bi_end_io = iomap_dio_bio_end_io;
>  
> -	__bio_add_page(bio, page, len, 0);
> +	while (len) {
> +		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
> +
> +		__bio_add_page(bio, page, io_len, 0);
> +		len -= io_len;
> +	}
>  	iomap_dio_submit_bio(iter, dio, bio, pos);

/me wonders if we should have a set of ZERO_FOLIO()s that contain a
folio of each possible size. Then we just pull the ZERO_FOLIO of the
correct size and use __bio_add_folio(). i.e. no need for
looping over the bio to repeatedly add the ZERO_PAGE, etc, and the
code is then identical for all cases of page size vs FSB size.

-Dave.
Luis Chamberlain Oct. 26, 2023, 11:20 p.m. UTC | #2
On Thu, Oct 26, 2023 at 04:08:32PM +0200, Pankaj Raghav wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
> < fs block size. iomap_dio_zero() has an implicit assumption that fs block
> size < page_size. This is true for most filesystems at the moment.
> 
> If the block size > page size (Large block sizes)[1], this will send the
> contents of the page next to zero page(as len > PAGE_SIZE) to the
> underlying block device, causing FS corruption.
> 
> iomap is a generic infrastructure and it should not make any assumptions
> about the fs block size and the page size of the system.
> 
> Fixes: db074436f421 ("iomap: move the direct IO code into a separate file")

Nice!

> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> 
> [1] https://lore.kernel.org/lkml/20230915183848.1018717-1-kernel@pankajraghav.com/

This URL jus tneeds to go above the Fixes tag.

   Luis
Christoph Hellwig Oct. 27, 2023, 5:18 a.m. UTC | #3
>  
> -	__bio_add_page(bio, page, len, 0);
> +	while (len) {
> +		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
> +
> +		__bio_add_page(bio, page, io_len, 0);
> +		len -= io_len;
> +	}

Maybe out of self-interest, but shouldn't we replace ZERO_PAGE with a
sufficiently larger ZERO_FOLIO?  Right now I have a case where I have
to have a zero padding of up to MAX_PAGECACHE_ORDER minus block size,
so having a MAX_PAGECACHE_ORDER folio would have been really helpful
for me, but I suspect there are many other such cases as well.
Pankaj Raghav Oct. 27, 2023, 7:53 a.m. UTC | #4
>> -	bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
>> +	WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE));
> 
> How can that happen here? Max fsb size will be 64kB for the
> foreseeable future, the bio can hold 256 pages so it will have a
> minimum size capability of 1MB.
> 

I just added it as a pathological check. This will not trigger
anytime in the near future.

> FWIW, as a general observation, I think this is the wrong place to
> be checking that a filesystem block is larger than can be fit in a
> single bio. There's going to be problems all over the place if we
> can't do fsb sized IO in a single bio. i.e. I think this sort of
> validation should be performed during filesystem mount, not
> sporadically checked with WARN_ON() checks in random places in the
> IO path...
> 

I agree that it should be a check at a higher level.

As iomap can be used by any filesystem, the check on FSB limitation
should be a part iomap right? I don't see any explicit document/comment
that states the iomap limitations on FSB, etc.

>>  
>> -	__bio_add_page(bio, page, len, 0);
>> +	while (len) {
>> +		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
>> +
>> +		__bio_add_page(bio, page, io_len, 0);
>> +		len -= io_len;
>> +	}
>>  	iomap_dio_submit_bio(iter, dio, bio, pos);
> 
> /me wonders if we should have a set of ZERO_FOLIO()s that contain a
> folio of each possible size. Then we just pull the ZERO_FOLIO of the
> correct size and use __bio_add_folio(). i.e. no need for
> looping over the bio to repeatedly add the ZERO_PAGE, etc, and the
> code is then identical for all cases of page size vs FSB size.
> 

I was exactly looking for ZERO_FOLIOs. Instead of having a ZERO_PAGE, can
we reserve a ZERO_HUGE_PAGE so that it can be used directly by
bio_add_folio_nofail()?
Pankaj Raghav Oct. 27, 2023, 8:03 a.m. UTC | #5
On 27/10/2023 07:18, Christoph Hellwig wrote:
>>  
>> -	__bio_add_page(bio, page, len, 0);
>> +	while (len) {
>> +		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
>> +
>> +		__bio_add_page(bio, page, io_len, 0);
>> +		len -= io_len;
>> +	}
> 
> Maybe out of self-interest, but shouldn't we replace ZERO_PAGE with a
> sufficiently larger ZERO_FOLIO?  Right now I have a case where I have
> to have a zero padding of up to MAX_PAGECACHE_ORDER minus block size,
> so having a MAX_PAGECACHE_ORDER folio would have been really helpful
> for me, but I suspect there are many other such cases as well.

I think that would definitely be useful.

I also noticed this pattern in fscrypt_zeroout_range_inline_crypt().
Probably there are more places which could use a ZERO_FOLIO directly
instead of iterating with ZERO_PAGE.

Chinner also had a similar comment. It would be nice if we can reserve
a zero huge page that is the size of MAX_PAGECACHE_ORDER and add it as
one folio to the bio.
Matthew Wilcox Oct. 27, 2023, 10:47 a.m. UTC | #6
On Fri, Oct 27, 2023 at 10:03:15AM +0200, Pankaj Raghav wrote:
> I also noticed this pattern in fscrypt_zeroout_range_inline_crypt().
> Probably there are more places which could use a ZERO_FOLIO directly
> instead of iterating with ZERO_PAGE.
> 
> Chinner also had a similar comment. It would be nice if we can reserve
> a zero huge page that is the size of MAX_PAGECACHE_ORDER and add it as
> one folio to the bio.

i'm on holiday atm.  start looking at mm_get_huge_zero_page()
Dave Chinner Oct. 27, 2023, 10:07 p.m. UTC | #7
On Fri, Oct 27, 2023 at 09:53:19AM +0200, Pankaj Raghav wrote:
> >> -	bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
> >> +	WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE));
> > 
> > How can that happen here? Max fsb size will be 64kB for the
> > foreseeable future, the bio can hold 256 pages so it will have a
> > minimum size capability of 1MB.
> > 
> 
> I just added it as a pathological check. This will not trigger
> anytime in the near future.

Yeah, exactly my point - it should never happen, it's a fundamental
developer stuff-up bug, not a runtime bug, and so shouldn't be
checked at runtime on every bio....

> > FWIW, as a general observation, I think this is the wrong place to
> > be checking that a filesystem block is larger than can be fit in a
> > single bio. There's going to be problems all over the place if we
> > can't do fsb sized IO in a single bio. i.e. I think this sort of
> > validation should be performed during filesystem mount, not
> > sporadically checked with WARN_ON() checks in random places in the
> > IO path...
> > 
> 
> I agree that it should be a check at a higher level.
> 
> As iomap can be used by any filesystem, the check on FSB limitation
> should be a part iomap right? I don't see any explicit document/comment
> that states the iomap limitations on FSB, etc.

No, it should be part of the filesystem that *uses the bio
infrastrucure*.

We already do that with the page cache - filesystems check at mount
time that the FSB is <= PAGE_SIZE and reject the mount if this check
fails because the page cache simply cannot handle this situation.

This is no different: if we are going to allow FSB > PAGE_SIZE, then
we need to ensure somewhere, even as a BUILD_BUG_ON(), that the max
support FSB the filesystem has is smaller than what we can pack in a
bio.

-Dave.
Dave Chinner Oct. 27, 2023, 10:10 p.m. UTC | #8
On Thu, Oct 26, 2023 at 04:08:32PM +0200, Pankaj Raghav wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
> < fs block size. iomap_dio_zero() has an implicit assumption that fs block
> size < page_size. This is true for most filesystems at the moment.
> 
> If the block size > page size (Large block sizes)[1], this will send the
> contents of the page next to zero page(as len > PAGE_SIZE) to the
> underlying block device, causing FS corruption.
> 
> iomap is a generic infrastructure and it should not make any assumptions
> about the fs block size and the page size of the system.
> 
> Fixes: db074436f421 ("iomap: move the direct IO code into a separate file")

I forgot to mention - this fixes tag is completely bogus. That's
just a commit that moves the code from one file to another and
there's no actual functional change at all.

Further, this isn't a change that "fixes" a bug or regression - it
is a change to support new functionality that doesnt' yet exist
upstream, and so there is no bug in the existing kernels for it to
fix....

-Dave.
Matthew Wilcox Oct. 27, 2023, 10:59 p.m. UTC | #9
On Fri, Oct 27, 2023 at 05:41:10PM +0200, Pankaj Raghav wrote:
> On 27/10/2023 12:47, Matthew Wilcox wrote:
> > On Fri, Oct 27, 2023 at 10:03:15AM +0200, Pankaj Raghav wrote:
> >> I also noticed this pattern in fscrypt_zeroout_range_inline_crypt().
> >> Probably there are more places which could use a ZERO_FOLIO directly
> >> instead of iterating with ZERO_PAGE.
> >>
> >> Chinner also had a similar comment. It would be nice if we can reserve
> >> a zero huge page that is the size of MAX_PAGECACHE_ORDER and add it as
> >> one folio to the bio.
> > 
> > i'm on holiday atm.  start looking at mm_get_huge_zero_page()
> 
> Thanks for the pointer. I made a rough version of how it might
> look like if I use that API:

useful thing to do.  i think this shows we need a new folio api wrapping
it.  happy to do that when i'm back, or you can have a crack at it.

your point about it possibly failing is correct.  so i think we need an
api which definitely returns a folio, but it might be of arbitrary
order.

>         bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
>         bio->bi_private = dio;
>         bio->bi_end_io = iomap_dio_bio_end_io;
> 
> -       __bio_add_page(bio, page, len, 0);
> +       if (!fallback) {
> +               bio_add_folio_nofail(bio, page_folio(page), len, 0);
> +       } else {
> +               while (len) {
> +                       unsigned int io_len =
> +                               min_t(unsigned int, len, PAGE_SIZE);
> +
> +                       __bio_add_page(bio, page, io_len, 0);
> +                       len -= io_len;
> +               }
> +       }
> +
>         iomap_dio_submit_bio(iter, dio, bio, pos);

then this can look something like:

	while (len) {
		size_t size = min(len, folio_size(folio));

		__bio_add_folio(bio, folio, size, 0);
		len -= size;
	}

> PS: Enjoy your holidays :)

cheers ;-)
Hannes Reinecke Oct. 28, 2023, 1:17 p.m. UTC | #10
On 10/27/23 17:41, Pankaj Raghav wrote:
> On 27/10/2023 12:47, Matthew Wilcox wrote:
>> On Fri, Oct 27, 2023 at 10:03:15AM +0200, Pankaj Raghav wrote:
>>> I also noticed this pattern in fscrypt_zeroout_range_inline_crypt().
>>> Probably there are more places which could use a ZERO_FOLIO directly
>>> instead of iterating with ZERO_PAGE.
>>>
>>> Chinner also had a similar comment. It would be nice if we can reserve
>>> a zero huge page that is the size of MAX_PAGECACHE_ORDER and add it as
>>> one folio to the bio.
>>
>> i'm on holiday atm.  start looking at mm_get_huge_zero_page()
> 
> Thanks for the pointer. I made a rough version of how it might
> look like if I use that API:
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index bcd3f8cf5ea4..6ae21bd16dbe 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -236,17 +236,43 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
>                  loff_t pos, unsigned len)
>   {
>          struct inode *inode = file_inode(dio->iocb->ki_filp);
> -       struct page *page = ZERO_PAGE(0);
> +       struct page *page = NULL;
> +       bool huge_page = false;
> +       bool fallback = false;
>          struct bio *bio;
> 
> -       bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
> +       if (len > PAGE_SIZE) {
> +               page = mm_get_huge_zero_page(current->mm);
> +               if (likely(page))
> +                       huge_page = true;
> +       }
> +
> +       if (!huge_page)
> +               page = ZERO_PAGE(0);
> +
> +       fallback = ((len > PAGE_SIZE) && !huge_page);
> +
That is pointless.
Bios can handle pages larger than PAGE_SIZE.

> +       bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
> +                                 REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
Similar here. Just allocate space for a single folio.

>          fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
>                                    GFP_KERNEL);
> +
>          bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
>          bio->bi_private = dio;
>          bio->bi_end_io = iomap_dio_bio_end_io;
> 
> -       __bio_add_page(bio, page, len, 0);
> +       if (!fallback) {
> +               bio_add_folio_nofail(bio, page_folio(page), len, 0);
> +       } else {
> +               while (len) {
> +                       unsigned int io_len =
> +                               min_t(unsigned int, len, PAGE_SIZE);
> +
> +                       __bio_add_page(bio, page, io_len, 0);
> +                       len -= io_len;
> +               }
> +       }
See above. Kill the 'fallback' case.

Cheers,

Hannes
Pankaj Raghav Oct. 28, 2023, 4:57 p.m. UTC | #11
>> -       bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
>> +       if (len > PAGE_SIZE) {
>> +               page = mm_get_huge_zero_page(current->mm);
>> +               if (likely(page))
>> +                       huge_page = true;
>> +       }
>> +
>> +       if (!huge_page)
>> +               page = ZERO_PAGE(0);
>> +
>> +       fallback = ((len > PAGE_SIZE) && !huge_page);
>> +
> That is pointless.
> Bios can handle pages larger than PAGE_SIZE.

Yes, I know BIOs can handle large pages. But the point here is if we fail
to allocate a huge zero page that can cover the complete large FSB ( > page size),
then we need to use the statically allocated ZERO_PAGE (see the original patch)
for multiple offsets covering the range.

Unless we have an API that can return a zero folio with arbitrary order(see also the
reply from Willy), we can't use a bio with one vec for LBS support.

--
Pankaj
Pankaj Raghav Oct. 28, 2023, 5:20 p.m. UTC | #12
>> Fixes: db074436f421 ("iomap: move the direct IO code into a separate file")
> 
> I forgot to mention - this fixes tag is completely bogus. That's
> just a commit that moves the code from one file to another and
> there's no actual functional change at all.
> 
> Further, this isn't a change that "fixes" a bug or regression - it
> is a change to support new functionality that doesnt' yet exist
> upstream, and so there is no bug in the existing kernels for it to
> fix....

I agree. I will remove the Fixes tag.

I added it as iomap infra does not explicitly tell it does not
support the LBS usecase. But I agree that it is a change to support
a new feature, and it shouldn't go as a fix.
Pankaj Raghav Oct. 28, 2023, 7:57 p.m. UTC | #13
>> Thanks for the pointer. I made a rough version of how it might
>> look like if I use that API:
> 
> useful thing to do.  i think this shows we need a new folio api wrapping
> it.  happy to do that when i'm back, or you can have a crack at it.
> 
A folio wrapping for mm_get_huge_zero_page()?

I tried to look for all the callers of mm_get_huge_zero_page() and I don't
see them doing the folio <-> page dance. Of course, if we end up using
mm_get_huge_zero_page() in iomap, then getting making a folio API might be
worth it!

> your point about it possibly failing is correct.  so i think we need an
> api which definitely returns a folio, but it might be of arbitrary
> order.
> 
That will be really nice, given that many parts of the kernel might
use that API to get away with iterating with ZERO_PAGE().

>> +
>>         iomap_dio_submit_bio(iter, dio, bio, pos);
> 
> then this can look something like:
> 
> 	while (len) {
> 		size_t size = min(len, folio_size(folio));
> 
> 		__bio_add_folio(bio, folio, size, 0);
> 		len -= size;
> 	}
>

Ah, this looks good!

>> PS: Enjoy your holidays :)
> 
> cheers ;-)
diff mbox series

Patch

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index bcd3f8cf5ea4..04f6c5548136 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -239,14 +239,23 @@  static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
 	struct page *page = ZERO_PAGE(0);
 	struct bio *bio;
 
-	bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
+	WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE));
+
+	bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
+				  REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
 	fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
 				  GFP_KERNEL);
+
 	bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
 	bio->bi_private = dio;
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
-	__bio_add_page(bio, page, len, 0);
+	while (len) {
+		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
+
+		__bio_add_page(bio, page, io_len, 0);
+		len -= io_len;
+	}
 	iomap_dio_submit_bio(iter, dio, bio, pos);
 }