diff mbox series

[v4,07/11] iomap: fix iomap_dio_zero() for fs bs > system page size

Message ID 20240425113746.335530-8-kernel@pankajraghav.com (mailing list archive)
State New
Headers show
Series enable bs > ps in XFS | expand

Commit Message

Pankaj Raghav (Samsung) April 25, 2024, 11:37 a.m. UTC
From: Pankaj Raghav <p.raghav@samsung.com>

iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
< fs block size. iomap_dio_zero() has an implicit assumption that fs block
size < page_size. This is true for most filesystems at the moment.

If the block size > page size, this will send the contents of the page
next to zero page(as len > PAGE_SIZE) to the underlying block device,
causing FS corruption.

iomap is a generic infrastructure and it should not make any assumptions
about the fs block size and the page size of the system.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/iomap/direct-io.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

Comments

Christoph Hellwig April 26, 2024, 6:22 a.m. UTC | #1
On Thu, Apr 25, 2024 at 01:37:42PM +0200, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
> < fs block size. iomap_dio_zero() has an implicit assumption that fs block
> size < page_size. This is true for most filesystems at the moment.
> 
> If the block size > page size, this will send the contents of the page
> next to zero page(as len > PAGE_SIZE) to the underlying block device,
> causing FS corruption.
> 
> iomap is a generic infrastructure and it should not make any assumptions
> about the fs block size and the page size of the system.

So what happened to the plan to making huge_zero_page a folio and have
it available for non-hugetlb setups?  Not only would this be cleaner
and more efficient, but it would actually work for the case where you'd
have to zero more than 1MB on a 4k PAGE_SIZE system, which doesn't
seem impossible with 2MB folios.
Pankaj Raghav (Samsung) April 26, 2024, 11:43 a.m. UTC | #2
On Thu, Apr 25, 2024 at 11:22:35PM -0700, Christoph Hellwig wrote:
> On Thu, Apr 25, 2024 at 01:37:42PM +0200, Pankaj Raghav (Samsung) wrote:
> > From: Pankaj Raghav <p.raghav@samsung.com>
> > 
> > iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
> > < fs block size. iomap_dio_zero() has an implicit assumption that fs block
> > size < page_size. This is true for most filesystems at the moment.
> > 
> > If the block size > page size, this will send the contents of the page
> > next to zero page(as len > PAGE_SIZE) to the underlying block device,
> > causing FS corruption.
> > 
> > iomap is a generic infrastructure and it should not make any assumptions
> > about the fs block size and the page size of the system.
> 
> So what happened to the plan to making huge_zero_page a folio and have
> it available for non-hugetlb setups?  Not only would this be cleaner
> and more efficient, but it would actually work for the case where you'd
> have to zero more than 1MB on a 4k PAGE_SIZE system, which doesn't
> seem impossible with 2MB folios.

I mentioned this Darrick in one of the older series[1] that it was
proving to be a bit complicated (at least for me) to add that support.

Currently, we reserve the ZERO_PAGE during kernel startup (arch/x86/kernel/head_64.S).

Do we go about doing the same by reserving 1 PMD (512 PTEs with base page size)
at kernel startup if we want to have zeroed 2MB (for x86) always at
our disposal to use for zeroing out?
Because allocating it during runtime will defeat the purpose.

Let me know what you think.

In anycase, I would like to pursue huge_zero_page folio separately
from this series. Also iomap_dio_zero() only pads a fs block with
zeroes, which should never be > 64k for XFS.

[1] https://lore.kernel.org/linux-fsdevel/5kodxnrvjq5dsjgjfeps6wte774c2sl75bn3fg3hh46q3wkwk5@2tru4htvqmrq/
Matthew Wilcox April 27, 2024, 3:26 a.m. UTC | #3
On Thu, Apr 25, 2024 at 11:22:35PM -0700, Christoph Hellwig wrote:
> On Thu, Apr 25, 2024 at 01:37:42PM +0200, Pankaj Raghav (Samsung) wrote:
> > From: Pankaj Raghav <p.raghav@samsung.com>
> > 
> > iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
> > < fs block size. iomap_dio_zero() has an implicit assumption that fs block
> > size < page_size. This is true for most filesystems at the moment.
> > 
> > If the block size > page size, this will send the contents of the page
> > next to zero page(as len > PAGE_SIZE) to the underlying block device,
> > causing FS corruption.
> > 
> > iomap is a generic infrastructure and it should not make any assumptions
> > about the fs block size and the page size of the system.
> 
> So what happened to the plan to making huge_zero_page a folio and have

There's a series of commits in linux-mm with the titles:

      sparc: use is_huge_zero_pmd()
      mm: add is_huge_zero_folio()
      mm: add pmd_folio()
      mm: convert migrate_vma_collect_pmd to use a folio
      mm: convert huge_zero_page to huge_zero_folio
      mm: convert do_huge_pmd_anonymous_page to huge_zero_folio
      dax: use huge_zero_folio
      mm: rename mm_put_huge_zero_page to mm_put_huge_zero_folio

> it available for non-hugetlb setups?  Not only would this be cleaner
> and more efficient, but it would actually work for the case where you'd
> have to zero more than 1MB on a 4k PAGE_SIZE system, which doesn't
> seem impossible with 2MB folios.

It is available for non-hugetlb setups.  It is however allocated on
demand, so it might not be available.
Christoph Hellwig April 27, 2024, 4:52 a.m. UTC | #4
On Sat, Apr 27, 2024 at 04:26:44AM +0100, Matthew Wilcox wrote:
> There's a series of commits in linux-mm with the titles:
> 
>       sparc: use is_huge_zero_pmd()
>       mm: add is_huge_zero_folio()
>       mm: add pmd_folio()
>       mm: convert migrate_vma_collect_pmd to use a folio
>       mm: convert huge_zero_page to huge_zero_folio
>       mm: convert do_huge_pmd_anonymous_page to huge_zero_folio
>       dax: use huge_zero_folio
>       mm: rename mm_put_huge_zero_page to mm_put_huge_zero_folio
> 
> > it available for non-hugetlb setups?  Not only would this be cleaner
> > and more efficient, but it would actually work for the case where you'd
> > have to zero more than 1MB on a 4k PAGE_SIZE system, which doesn't
> > seem impossible with 2MB folios.
> 
> It is available for non-hugetlb setups.  It is however allocated on
> demand, so it might not be available.

We could just export get_huge_zero_page/put_huge_zero_page and make
sure it is is available for block sizse > PAGE_SIZE file systems, or
is there a good argument against that?
Christoph Hellwig April 27, 2024, 5:12 a.m. UTC | #5
On Fri, Apr 26, 2024 at 11:43:01AM +0000, Pankaj Raghav (Samsung) wrote:
> Because allocating it during runtime will defeat the purpose.

Well, what runtime?  Either way it seems like we have the infrastructure
now based on the comment from willy.

> In anycase, I would like to pursue huge_zero_page folio separately
> from this series. Also iomap_dio_zero() only pads a fs block with
> zeroes, which should never be > 64k for XFS.

Only if you are limited to 64k block size.
Pankaj Raghav (Samsung) April 29, 2024, 9:02 p.m. UTC | #6
On Fri, Apr 26, 2024 at 10:12:01PM -0700, Christoph Hellwig wrote:
> On Fri, Apr 26, 2024 at 11:43:01AM +0000, Pankaj Raghav (Samsung) wrote:
> > Because allocating it during runtime will defeat the purpose.
> 
> Well, what runtime?  Either way it seems like we have the infrastructure
> now based on the comment from willy.

As willy pointed out in that reply, it is allocated on demand, so it
might still fail and we might have to revert back to looping.
And if we end up using the huge zero page, we should also make sure to
decrement to reference in iomap_dio_bio_end_io(), which is not needed
when we use ZERO_PAGE.

FWIW, I did a prototype with mm_huge_zero_page() in one of the older series.
[1] (I forgot to decrement reference in end_io()) but I did not get final
response if that is the direction we want to go.

Let me know your thoughts.

[1]https://lore.kernel.org/linux-xfs/3pqmgrlewo6ctcwakdvbvjqixac5en6irlipe5aiz6vkylfyni@2luhrs36ke5r/#r
> 
> > In anycase, I would like to pursue huge_zero_page folio separately
> > from this series. Also iomap_dio_zero() only pads a fs block with
> > zeroes, which should never be > 64k for XFS.
> 
> Only if you are limited to 64k block size.
>
diff mbox series

Patch

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index f3b43d223a46..5f481068de5b 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -239,14 +239,23 @@  static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
 	struct page *page = ZERO_PAGE(0);
 	struct bio *bio;
 
-	bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
+	WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE));
+
+	bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
+				  REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
 	fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
 				  GFP_KERNEL);
+
 	bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
 	bio->bi_private = dio;
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
-	__bio_add_page(bio, page, len, 0);
+	while (len) {
+		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
+
+		__bio_add_page(bio, page, io_len, 0);
+		len -= io_len;
+	}
 	iomap_dio_submit_bio(iter, dio, bio, pos);
 }