diff mbox series

[RFC,v2,10/14] iomap: fix iomap_dio_zero() for fs bs > system page size

Message ID 20240213093713.1753368-11-kernel@pankajraghav.com (mailing list archive)
State New
Headers show
Series enable bs > ps in XFS | expand

Commit Message

Pankaj Raghav (Samsung) Feb. 13, 2024, 9:37 a.m. UTC
From: Pankaj Raghav <p.raghav@samsung.com>

iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
< fs block size. iomap_dio_zero() has an implicit assumption that fs block
size < page_size. This is true for most filesystems at the moment.

If the block size > page size, this will send the contents of the page
next to zero page(as len > PAGE_SIZE) to the underlying block device,
causing FS corruption.

iomap is a generic infrastructure and it should not make any assumptions
about the fs block size and the page size of the system.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/iomap/direct-io.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

Comments

Hannes Reinecke Feb. 13, 2024, 3:06 p.m. UTC | #1
On 2/13/24 10:37, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
> < fs block size. iomap_dio_zero() has an implicit assumption that fs block
> size < page_size. This is true for most filesystems at the moment.
> 
> If the block size > page size, this will send the contents of the page
> next to zero page(as len > PAGE_SIZE) to the underlying block device,
> causing FS corruption.
> 
> iomap is a generic infrastructure and it should not make any assumptions
> about the fs block size and the page size of the system.
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>   fs/iomap/direct-io.c | 13 +++++++++++--
>   1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index bcd3f8cf5ea4..04f6c5548136 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -239,14 +239,23 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
>   	struct page *page = ZERO_PAGE(0);
>   	struct bio *bio;
>   
> -	bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
> +	WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE));
> +
> +	bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
> +				  REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
>   	fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
>   				  GFP_KERNEL);
> +
>   	bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
>   	bio->bi_private = dio;
>   	bio->bi_end_io = iomap_dio_bio_end_io;
>   
> -	__bio_add_page(bio, page, len, 0);
> +	while (len) {
> +		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
> +
> +		__bio_add_page(bio, page, io_len, 0);
> +		len -= io_len;
> +	}
>   	iomap_dio_submit_bio(iter, dio, bio, pos);
>   }
>   
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
Darrick J. Wong Feb. 13, 2024, 4:30 p.m. UTC | #2
On Tue, Feb 13, 2024 at 10:37:09AM +0100, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
> < fs block size. iomap_dio_zero() has an implicit assumption that fs block
> size < page_size. This is true for most filesystems at the moment.
> 
> If the block size > page size, this will send the contents of the page
> next to zero page(as len > PAGE_SIZE) to the underlying block device,
> causing FS corruption.
> 
> iomap is a generic infrastructure and it should not make any assumptions
> about the fs block size and the page size of the system.
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  fs/iomap/direct-io.c | 13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index bcd3f8cf5ea4..04f6c5548136 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -239,14 +239,23 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
>  	struct page *page = ZERO_PAGE(0);
>  	struct bio *bio;
>  
> -	bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
> +	WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE));
> +
> +	bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
> +				  REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
>  	fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
>  				  GFP_KERNEL);
> +
>  	bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
>  	bio->bi_private = dio;
>  	bio->bi_end_io = iomap_dio_bio_end_io;
>  
> -	__bio_add_page(bio, page, len, 0);
> +	while (len) {
> +		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);

What was the result of all that discussion about using the PMD-sized
zero-folio the last time this patch was submitted?  Did that prove to be
unwieldly, or did it require enough extra surgery to become its own
series?

(The code here looks good to me.)

--D

> +
> +		__bio_add_page(bio, page, io_len, 0);
> +		len -= io_len;
> +	}
>  	iomap_dio_submit_bio(iter, dio, bio, pos);
>  }
>  
> -- 
> 2.43.0
> 
>
Pankaj Raghav (Samsung) Feb. 13, 2024, 9:27 p.m. UTC | #3
On Tue, Feb 13, 2024 at 08:30:37AM -0800, Darrick J. Wong wrote:
> On Tue, Feb 13, 2024 at 10:37:09AM +0100, Pankaj Raghav (Samsung) wrote:
> > From: Pankaj Raghav <p.raghav@samsung.com>
> > 
> > iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
> > < fs block size. iomap_dio_zero() has an implicit assumption that fs block
> > size < page_size. This is true for most filesystems at the moment.
> > 
> > If the block size > page size, this will send the contents of the page
> > next to zero page(as len > PAGE_SIZE) to the underlying block device,
> > causing FS corruption.
> > 
> > iomap is a generic infrastructure and it should not make any assumptions
> > about the fs block size and the page size of the system.
> > 
> > Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> > ---
> >  fs/iomap/direct-io.c | 13 +++++++++++--
> >  1 file changed, 11 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > index bcd3f8cf5ea4..04f6c5548136 100644
> > --- a/fs/iomap/direct-io.c
> > +++ b/fs/iomap/direct-io.c
> > @@ -239,14 +239,23 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
> >  	struct page *page = ZERO_PAGE(0);
> >  	struct bio *bio;
> >  
> > -	bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
> > +	WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE));
> > +
> > +	bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
> > +				  REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
> >  	fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
> >  				  GFP_KERNEL);
> > +
> >  	bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
> >  	bio->bi_private = dio;
> >  	bio->bi_end_io = iomap_dio_bio_end_io;
> >  
> > -	__bio_add_page(bio, page, len, 0);
> > +	while (len) {
> > +		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
> 
> What was the result of all that discussion about using the PMD-sized
> zero-folio the last time this patch was submitted?  Did that prove to be
> unwieldly, or did it require enough extra surgery to become its own
> series?
> 

It proved a bit unwieldly to me at least as I did not know any straight
forward way to do it at the time. So I thought I will keep this approach
as it is, and add support for the PMD-sized zero folio for later
improvement.

> (The code here looks good to me.)

Thanks!
> 
> --D
> 
> > +
> > +		__bio_add_page(bio, page, io_len, 0);
> > +		len -= io_len;
> > +	}
> >  	iomap_dio_submit_bio(iter, dio, bio, pos);
> >  }
> >  
> > -- 
> > 2.43.0
> > 
> >
Darrick J. Wong Feb. 13, 2024, 9:30 p.m. UTC | #4
On Tue, Feb 13, 2024 at 10:27:32PM +0100, Pankaj Raghav (Samsung) wrote:
> On Tue, Feb 13, 2024 at 08:30:37AM -0800, Darrick J. Wong wrote:
> > On Tue, Feb 13, 2024 at 10:37:09AM +0100, Pankaj Raghav (Samsung) wrote:
> > > From: Pankaj Raghav <p.raghav@samsung.com>
> > > 
> > > iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
> > > < fs block size. iomap_dio_zero() has an implicit assumption that fs block
> > > size < page_size. This is true for most filesystems at the moment.
> > > 
> > > If the block size > page size, this will send the contents of the page
> > > next to zero page(as len > PAGE_SIZE) to the underlying block device,
> > > causing FS corruption.
> > > 
> > > iomap is a generic infrastructure and it should not make any assumptions
> > > about the fs block size and the page size of the system.
> > > 
> > > Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> > > ---
> > >  fs/iomap/direct-io.c | 13 +++++++++++--
> > >  1 file changed, 11 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > index bcd3f8cf5ea4..04f6c5548136 100644
> > > --- a/fs/iomap/direct-io.c
> > > +++ b/fs/iomap/direct-io.c
> > > @@ -239,14 +239,23 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
> > >  	struct page *page = ZERO_PAGE(0);
> > >  	struct bio *bio;
> > >  
> > > -	bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
> > > +	WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE));
> > > +
> > > +	bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
> > > +				  REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
> > >  	fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
> > >  				  GFP_KERNEL);
> > > +
> > >  	bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
> > >  	bio->bi_private = dio;
> > >  	bio->bi_end_io = iomap_dio_bio_end_io;
> > >  
> > > -	__bio_add_page(bio, page, len, 0);
> > > +	while (len) {
> > > +		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
> > 
> > What was the result of all that discussion about using the PMD-sized
> > zero-folio the last time this patch was submitted?  Did that prove to be
> > unwieldly, or did it require enough extra surgery to become its own
> > series?
> > 
> 
> It proved a bit unwieldly to me at least as I did not know any straight
> forward way to do it at the time. So I thought I will keep this approach
> as it is, and add support for the PMD-sized zero folio for later
> improvement.
> 
> > (The code here looks good to me.)
> 
> Thanks!

In that case I'll throw it on the testing pile and let's ask brauner to
merge this for 6.9 if nothing blows up.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> > 
> > --D
> > 
> > > +
> > > +		__bio_add_page(bio, page, io_len, 0);
> > > +		len -= io_len;
> > > +	}
> > >  	iomap_dio_submit_bio(iter, dio, bio, pos);
> > >  }
> > >  
> > > -- 
> > > 2.43.0
> > > 
> > > 
>
Pankaj Raghav (Samsung) Feb. 14, 2024, 3:13 p.m. UTC | #5
> 
> In that case I'll throw it on the testing pile and let's ask brauner to
> merge this for 6.9 if nothing blows up.
> 
Sounds good. Thanks.

> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> 
> --D
>
diff mbox series

Patch

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index bcd3f8cf5ea4..04f6c5548136 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -239,14 +239,23 @@  static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
 	struct page *page = ZERO_PAGE(0);
 	struct bio *bio;
 
-	bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
+	WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE));
+
+	bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
+				  REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
 	fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
 				  GFP_KERNEL);
+
 	bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
 	bio->bi_private = dio;
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
-	__bio_add_page(bio, page, len, 0);
+	while (len) {
+		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
+
+		__bio_add_page(bio, page, io_len, 0);
+		len -= io_len;
+	}
 	iomap_dio_submit_bio(iter, dio, bio, pos);
 }