[06/16] iomap: support block size > page size for direct IO

Message ID	20181107063127.3902-7-david@fromorbit.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Dave Chinner <david@fromorbit.com> To: linux-xfs@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org Subject: [PATCH 06/16] iomap: support block size > page size for direct IO Date: Wed, 7 Nov 2018 17:31:17 +1100 Message-Id: <20181107063127.3902-7-david@fromorbit.com> In-Reply-To: <20181107063127.3902-1-david@fromorbit.com> References: <20181107063127.3902-1-david@fromorbit.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk
Series	xfs: Block size > PAGE_SIZE support \| expand [RFC,00/16] xfs: Block size > PAGE_SIZE support [01/16] xfs: drop ->writepage completely [02/16] xfs: move writepage context warnings to writepages [03/16] xfs: finobt AG reserves don't consider last AG can be a runt [04/16] xfs: extent shifting doesn't fully invalidate page cache [05/16] iomap: sub-block dio needs to zeroout beyond EOF [06/16] iomap: support block size > page size for direct IO [07/16] iomap: prepare buffered IO paths for block size > page size [08/16] iomap: mode iomap_zero_range and friends [09/16] iomap: introduce zero-around functionality [10/16] iomap: enable zero-around for iomap_zero_range() [11/16] iomap: Don't mark partial pages zeroing uptodate for zero-around [12/16] iomap: zero-around in iomap_page_mkwrite [13/16] xfs: add zero-around controls to iomap [14/16] xfs: align writepages to large block sizes [15/16] xfs: expose block size in stat [16/16] xfs: enable block size larger than page size support

Message ID

20181107063127.3902-7-david@fromorbit.com (mailing list archive)

State

New, archived

Headers

From: Dave Chinner <david@fromorbit.com>
To: linux-xfs@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Subject: [PATCH 06/16] iomap: support block size > page size for direct IO
Date: Wed,  7 Nov 2018 17:31:17 +1100
Message-Id: <20181107063127.3902-7-david@fromorbit.com>
In-Reply-To: <20181107063127.3902-1-david@fromorbit.com>
References: <20181107063127.3902-1-david@fromorbit.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk

Series

xfs: Block size > PAGE_SIZE support | expand

Commit Message

Dave Chinner Nov. 7, 2018, 6:31 a.m. UTC

From: Dave Chinner <dchinner@redhat.com>

iomap_dio_rw() has all the infrastructure in place to support block
size > page size filesystems because it is essentially just
sub-block DIO. It needs help, however, with the sub-block zeroing
code (needs multi-page IOs) page cache invalidation over the block
being written.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/iomap.c | 65 ++++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 49 insertions(+), 16 deletions(-)

Comments

Nikolay Borisov Nov. 8, 2018, 11:28 a.m. UTC | #1

On 7.11.18 г. 8:31 ч., Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> iomap_dio_rw() has all the infrastructure in place to support block
> size > page size filesystems because it is essentially just
> sub-block DIO. It needs help, however, with the sub-block zeroing
> code (needs multi-page IOs) page cache invalidation over the block
> being written.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/iomap.c | 65 ++++++++++++++++++++++++++++++++++++++++--------------
>  1 file changed, 49 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 16d16596b00f..8878b1f1f9c7 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -1548,21 +1548,34 @@ static void iomap_dio_bio_end_io(struct bio *bio)
>  	}
>  }
>  
> +/*
> + * With zeroing for block size larger than page size, the zeroing length can
> + * span multiple pages.
> + */
> +#define howmany(x, y) (((x)+((y)-1))/(y))

nit: This could be replaced by DIV_ROUND_UP

>  static blk_qc_t
>  iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
>  		unsigned len)
>  {
>  	struct page *page = ZERO_PAGE(0);
>  	struct bio *bio;
> +	int npages = howmany(len, PAGE_SIZE);
> +
> +	WARN_ON_ONCE(npages > 16);
>  
> -	bio = bio_alloc(GFP_KERNEL, 1);
> +	bio = bio_alloc(GFP_KERNEL, npages);
>  	bio_set_dev(bio, iomap->bdev);
>  	bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
>  	bio->bi_private = dio;
>  	bio->bi_end_io = iomap_dio_bio_end_io;
>  
> -	get_page(page);
> -	__bio_add_page(bio, page, len, 0);
> +	while (npages-- > 0) {
> +		unsigned plen = min_t(unsigned, PAGE_SIZE, len);
> +		get_page(page);
> +		__bio_add_page(bio, page, plen, 0);
> +		len -= plen;
> +	}
> +	WARN_ON(len != 0);
>  	bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_SYNC | REQ_IDLE);
>  
>  	atomic_inc(&dio->ref);
> @@ -1752,6 +1765,38 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
>  	}
>  }
>  
> +/*
> + * This is lifted almost straight from xfs_flush_unmap_range(). Need a generic
> + * version of the block size rounding for these purposes.
> + */
> +static int
> +iomap_flush_unmap_range(struct file *f, loff_t offset, loff_t len)
> +{
> +	struct inode *inode = file_inode(f);
> +	loff_t rounding, start, end;
> +	int ret;
> +
> +	rounding = max_t(loff_t, i_blocksize(inode), PAGE_SIZE);
> +	start = round_down(offset, rounding);
> +	end = round_up(offset + len, rounding) - 1;
> +
> +	ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
> +	if (ret)
> +		return ret;
> +
> +	/*
> +	 * Try to invalidate cache pages for the range we're direct
> +	 * writing.  If this invalidation fails, tough, the write will
> +	 * still work, but racing two incompatible write paths is a
> +	 * pretty crazy thing to do, so we don't support it 100%.
> +	 */
> +	ret = invalidate_inode_pages2_range(inode->i_mapping,
> +			start >> PAGE_SHIFT, end >> PAGE_SHIFT);
> +	if (ret)
> +		dio_warn_stale_pagecache(f);
> +	return 0;
> +}
> +
>  /*
>   * iomap_dio_rw() always completes O_[D]SYNC writes regardless of whether the IO
>   * is being issued as AIO or not.  This allows us to optimise pure data writes
> @@ -1829,22 +1874,10 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		flags |= IOMAP_NOWAIT;
>  	}
>  
> -	ret = filemap_write_and_wait_range(mapping, start, end);
> +	ret = iomap_flush_unmap_range(iocb->ki_filp, start, end);
>  	if (ret)
>  		goto out_free_dio;
>  
> -	/*
> -	 * Try to invalidate cache pages for the range we're direct
> -	 * writing.  If this invalidation fails, tough, the write will
> -	 * still work, but racing two incompatible write paths is a
> -	 * pretty crazy thing to do, so we don't support it 100%.
> -	 */
> -	ret = invalidate_inode_pages2_range(mapping,
> -			start >> PAGE_SHIFT, end >> PAGE_SHIFT);
> -	if (ret)
> -		dio_warn_stale_pagecache(iocb->ki_filp);
> -	ret = 0;
> -
>  	if (iov_iter_rw(iter) == WRITE && !dio->wait_for_completion &&
>  	    !inode->i_sb->s_dio_done_wq) {
>  		ret = sb_init_dio_done_wq(inode->i_sb);
>

Christoph Hellwig Nov. 9, 2018, 3:18 p.m. UTC | #2

>  static blk_qc_t
>  iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
>  		unsigned len)
>  {
>  	struct page *page = ZERO_PAGE(0);
>  	struct bio *bio;
> +	int npages = howmany(len, PAGE_SIZE);
> +
> +	WARN_ON_ONCE(npages > 16);

Where does this magic 16 come from?

> +	WARN_ON(len != 0);

WARN_ON_ONCE please to avoid making the log unreadable if it ever
triggers.

> +/*
> + * This is lifted almost straight from xfs_flush_unmap_range(). Need a generic
> + * version of the block size rounding for these purposes.
> + */

Can you just add a generic version of this in a separate patch and
also switch XFS over to it?

> +static int
> +iomap_flush_unmap_range(struct file *f, loff_t offset, loff_t len)

Can we please spell out file in the parameter name?

Dave Chinner Nov. 11, 2018, 1:12 a.m. UTC | #3

On Fri, Nov 09, 2018 at 07:18:19AM -0800, Christoph Hellwig wrote:
> >  static blk_qc_t
> >  iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
> >  		unsigned len)
> >  {
> >  	struct page *page = ZERO_PAGE(0);
> >  	struct bio *bio;
> > +	int npages = howmany(len, PAGE_SIZE);
> > +
> > +	WARN_ON_ONCE(npages > 16);
> 
> Where does this magic 16 come from?

4k page size, 64k block size. Debug code, essentially.

> > +	WARN_ON(len != 0);
> 
> WARN_ON_ONCE please to avoid making the log unreadable if it ever
> triggers.

Debug code, too, so it'll get removed eventually.

> > +/*
> > + * This is lifted almost straight from xfs_flush_unmap_range(). Need a generic
> > + * version of the block size rounding for these purposes.
> > + */
> 
> Can you just add a generic version of this in a separate patch and
> also switch XFS over to it?

Well, they do different things. The xfs code must truncate the page
cache over the range (because we are removing the underlying
storage) while this just attempts to invalidate the pages and simply
says "have a nice day" if it fails. 

So they really are two different functions. The comment was written
when I expected that I was going to need to do lots more block size
rounding for invalidation in the generic code., but it seems that
may actually not be necessary....

Cheers,

Dave.

diff --git a/fs/iomap.c b/fs/iomap.c
index 16d16596b00f..8878b1f1f9c7 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1548,21 +1548,34 @@  static void iomap_dio_bio_end_io(struct bio *bio)
 	}
 }
 
+/*
+ * With zeroing for block size larger than page size, the zeroing length can
+ * span multiple pages.
+ */
+#define howmany(x, y) (((x)+((y)-1))/(y))
 static blk_qc_t
 iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
 		unsigned len)
 {
 	struct page *page = ZERO_PAGE(0);
 	struct bio *bio;
+	int npages = howmany(len, PAGE_SIZE);
+
+	WARN_ON_ONCE(npages > 16);
 
-	bio = bio_alloc(GFP_KERNEL, 1);
+	bio = bio_alloc(GFP_KERNEL, npages);
 	bio_set_dev(bio, iomap->bdev);
 	bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
 	bio->bi_private = dio;
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
-	get_page(page);
-	__bio_add_page(bio, page, len, 0);
+	while (npages-- > 0) {
+		unsigned plen = min_t(unsigned, PAGE_SIZE, len);
+		get_page(page);
+		__bio_add_page(bio, page, plen, 0);
+		len -= plen;
+	}
+	WARN_ON(len != 0);
 	bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_SYNC | REQ_IDLE);
 
 	atomic_inc(&dio->ref);
@@ -1752,6 +1765,38 @@  iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
 	}
 }
 
+/*
+ * This is lifted almost straight from xfs_flush_unmap_range(). Need a generic
+ * version of the block size rounding for these purposes.
+ */
+static int
+iomap_flush_unmap_range(struct file *f, loff_t offset, loff_t len)
+{
+	struct inode *inode = file_inode(f);
+	loff_t rounding, start, end;
+	int ret;
+
+	rounding = max_t(loff_t, i_blocksize(inode), PAGE_SIZE);
+	start = round_down(offset, rounding);
+	end = round_up(offset + len, rounding) - 1;
+
+	ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
+	if (ret)
+		return ret;
+
+	/*
+	 * Try to invalidate cache pages for the range we're direct
+	 * writing.  If this invalidation fails, tough, the write will
+	 * still work, but racing two incompatible write paths is a
+	 * pretty crazy thing to do, so we don't support it 100%.
+	 */
+	ret = invalidate_inode_pages2_range(inode->i_mapping,
+			start >> PAGE_SHIFT, end >> PAGE_SHIFT);
+	if (ret)
+		dio_warn_stale_pagecache(f);
+	return 0;
+}
+
 /*
  * iomap_dio_rw() always completes O_[D]SYNC writes regardless of whether the IO
  * is being issued as AIO or not.  This allows us to optimise pure data writes
@@ -1829,22 +1874,10 @@  iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		flags |= IOMAP_NOWAIT;
 	}
 
-	ret = filemap_write_and_wait_range(mapping, start, end);
+	ret = iomap_flush_unmap_range(iocb->ki_filp, start, end);
 	if (ret)
 		goto out_free_dio;
 
-	/*
-	 * Try to invalidate cache pages for the range we're direct
-	 * writing.  If this invalidation fails, tough, the write will
-	 * still work, but racing two incompatible write paths is a
-	 * pretty crazy thing to do, so we don't support it 100%.
-	 */
-	ret = invalidate_inode_pages2_range(mapping,
-			start >> PAGE_SHIFT, end >> PAGE_SHIFT);
-	if (ret)
-		dio_warn_stale_pagecache(iocb->ki_filp);
-	ret = 0;
-
 	if (iov_iter_rw(iter) == WRITE && !dio->wait_for_completion &&
 	    !inode->i_sb->s_dio_done_wq) {
 		ret = sb_init_dio_done_wq(inode->i_sb);

[06/16] iomap: support block size > page size for direct IO

Commit Message

Comments

Patch