diff mbox series

[PATCHv4,8/9] block: relax direct io memory alignment

Message ID 20220526010613.4016118-9-kbusch@fb.com (mailing list archive)
State New, archived
Headers show
Series direct io dma alignment | expand

Commit Message

Keith Busch May 26, 2022, 1:06 a.m. UTC
From: Keith Busch <kbusch@kernel.org>

Use the address alignment requirements from the hardware for direct io
instead of requiring addresses be aligned to the block size. User space
can discover the alignment requirements from the dma_alignment queue
attribute.

User space can specify any hardware compatible DMA offset for each
segment, but every segment length is still required to be a multiple of
the block size.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 block/bio.c  | 12 ++++++++++++
 block/fops.c | 14 +++++++++++---
 2 files changed, 23 insertions(+), 3 deletions(-)

Comments

Damien Le Moal May 26, 2022, 2:03 a.m. UTC | #1
On 2022/05/26 10:06, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> Use the address alignment requirements from the hardware for direct io
> instead of requiring addresses be aligned to the block size. User space
> can discover the alignment requirements from the dma_alignment queue
> attribute.
> 
> User space can specify any hardware compatible DMA offset for each
> segment, but every segment length is still required to be a multiple of
> the block size.
> 
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
>  block/bio.c  | 12 ++++++++++++
>  block/fops.c | 14 +++++++++++---
>  2 files changed, 23 insertions(+), 3 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 55d2a9c4e312..c492881959d1 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1219,7 +1219,19 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  	BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
>  	pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
>  
> +	/*
> +	 * Each segment in the iov is required to be a block size multiple.
> +	 * However, we may not be able to get the entire segment if it spans
> +	 * more pages than bi_max_vecs allows, so we have to ALIGN_DOWN the
> +	 * result to ensure the bio's total size is correct. The remainder of
> +	 * the iov data will be picked up in the next bio iteration.
> +	 *
> +	 * If the result is ever 0, that indicates the iov fails the segment
> +	 * size requirement and is an error.
> +	 */
>  	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
> +	if (size > 0)
> +		size = ALIGN_DOWN(size, bdev_logical_block_size(bio->bi_bdev));
>  	if (unlikely(size <= 0))
>  		return size ? size : -EFAULT;
>  
> diff --git a/block/fops.c b/block/fops.c
> index bd6c2e13a4e3..6ecbccc552b9 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -45,10 +45,10 @@ static unsigned int dio_bio_write_op(struct kiocb *iocb)
>  static int blkdev_dio_aligned(struct block_device *bdev, loff_t pos,
>  			      struct iov_iter *iter)
>  {
> -	if ((pos | iov_iter_alignment(iter)) &
> -	    (bdev_logical_block_size(bdev) - 1))
> +	if ((pos | iov_iter_count(iter)) & (bdev_logical_block_size(bdev) - 1))
> +		return -EINVAL;
> +	if (iov_iter_alignment(iter) & bdev_dma_alignment(bdev))
>  		return -EINVAL;
> -
>  	return 0;
>  }
>  
> @@ -88,6 +88,10 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
>  	bio.bi_ioprio = iocb->ki_ioprio;
>  
>  	ret = bio_iov_iter_get_pages(&bio, iter);
> +
> +	/* check if iov is not aligned */
> +	if (unlikely(!ret && iov_iter_count(iter)))
> +		ret = -EINVAL;
>  	if (unlikely(ret))
>  		goto out;
>  	ret = bio.bi_iter.bi_size;
> @@ -333,6 +337,10 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
>  		bio_iov_bvec_set(bio, iter);
>  	} else {
>  		ret = bio_iov_iter_get_pages(bio, iter);
> +
> +		/* check if iov is not aligned */
> +		if (unlikely(!ret && iov_iter_count(iter)))
> +			ret = -EINVAL;
>  		if (unlikely(ret)) {
>  			bio_put(bio);
>  			return ret;

Looks OK to me.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Eric Biggers May 26, 2022, 7:29 a.m. UTC | #2
On Wed, May 25, 2022 at 06:06:12PM -0700, Keith Busch wrote:
> +	/*
> +	 * Each segment in the iov is required to be a block size multiple.

Where is this enforced?

- Eric
Keith Busch May 26, 2022, 1:50 p.m. UTC | #3
On Thu, May 26, 2022 at 12:29:41AM -0700, Eric Biggers wrote:
> On Wed, May 25, 2022 at 06:06:12PM -0700, Keith Busch wrote:
> > +	/*
> > +	 * Each segment in the iov is required to be a block size multiple.
> 
> Where is this enforced?

Right below the comment. If it isn't a block size multiple, then ALIGN_DOWN
will eventually result in 0 and -EFAULT is returned.
Eric Biggers May 26, 2022, 6:13 p.m. UTC | #4
On Thu, May 26, 2022 at 07:50:45AM -0600, Keith Busch wrote:
> On Thu, May 26, 2022 at 12:29:41AM -0700, Eric Biggers wrote:
> > On Wed, May 25, 2022 at 06:06:12PM -0700, Keith Busch wrote:
> > > +	/*
> > > +	 * Each segment in the iov is required to be a block size multiple.
> > 
> > Where is this enforced?
> 
> Right below the comment. If it isn't a block size multiple, then ALIGN_DOWN
> will eventually result in 0 and -EFAULT is returned. 

That's interesting, I would have expected it to be checked in
blkdev_dio_aligned().

EFAULT isn't the correct error code for this case; it should be EINVAL as is
normally the case for bad alignment.  See the man pages for read and write.

- Eric
Keith Busch May 26, 2022, 6:55 p.m. UTC | #5
On Thu, May 26, 2022 at 11:13:52AM -0700, Eric Biggers wrote:
> On Thu, May 26, 2022 at 07:50:45AM -0600, Keith Busch wrote:
> > On Thu, May 26, 2022 at 12:29:41AM -0700, Eric Biggers wrote:
> > > On Wed, May 25, 2022 at 06:06:12PM -0700, Keith Busch wrote:
> > > > +	/*
> > > > +	 * Each segment in the iov is required to be a block size multiple.
> > > 
> > > Where is this enforced?
> > 
> > Right below the comment. If it isn't a block size multiple, then ALIGN_DOWN
> > will eventually result in 0 and -EFAULT is returned. 
> 
> That's interesting, I would have expected it to be checked in
> blkdev_dio_aligned().

That would require a change to the iov_iter_alignment API, or a new function
entirely.
 
> EFAULT isn't the correct error code for this case; it should be EINVAL as is
> normally the case for bad alignment.  See the man pages for read and write.

The EFAULT was just to get the do-while loop to break out. The callers in this
patch still return EINVAL when it sees the iov_iter hasn't advanced to the end.

But there are some cases where the EFAULT would be returned to the user. Let me
see how difficult it would be catch it early in blkdev_dio_aligned().
Keith Busch May 26, 2022, 8:32 p.m. UTC | #6
On Thu, May 26, 2022 at 12:55:50PM -0600, Keith Busch wrote:
> 
> Let me see how difficult it would be catch it early in blkdev_dio_aligned().

Something like this appears to work:

---
diff --git a/block/fops.c b/block/fops.c
index 6ecbccc552b9..20763a143991 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -47,7 +47,8 @@ static int blkdev_dio_aligned(struct block_device *bdev, loff_t pos,
 {
 	if ((pos | iov_iter_count(iter)) & (bdev_logical_block_size(bdev) - 1))
 		return -EINVAL;
-	if (iov_iter_alignment(iter) & bdev_dma_alignment(bdev))
+	if (!iov_iter_aligned(iter, bdev_dma_alignment(bdev),
+			      bdev_logical_block_size(bdev) - 1))
 		return -EINVAL;
 	return 0;
 }
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 739285fe5a2f..a9cbf90d8dcd 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -219,6 +219,8 @@ size_t _copy_mc_to_iter(const void *addr, size_t bytes, struct iov_iter *i);
 #endif
 
 size_t iov_iter_zero(size_t bytes, struct iov_iter *);
+bool iov_iter_aligned(const struct iov_iter *i, unsigned addr_mask,
+			unsigned len_mask);
 unsigned long iov_iter_alignment(const struct iov_iter *i);
 unsigned long iov_iter_gap_alignment(const struct iov_iter *i);
 void iov_iter_init(struct iov_iter *i, unsigned int direction, const struct iovec *iov,
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 6dd5330f7a99..6a98bbd56408 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1268,6 +1268,89 @@ void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count)
 }
 EXPORT_SYMBOL(iov_iter_discard);
 
+static bool iov_iter_aligned_iovec(const struct iov_iter *i, unsigned addr_mask,
+				   unsigned len_mask)
+{
+	size_t size = i->count;
+	size_t skip = i->iov_offset;
+	unsigned k;
+
+	for (k = 0; k < i->nr_segs; k++, skip = 0) {
+		size_t len = i->iov[k].iov_len - skip;
+
+		if (len > size)
+			len = size;
+		if (len & len_mask)
+			return false;
+		if ((unsigned long)(i->iov[k].iov_base + skip) & addr_mask)
+			return false;
+
+		size -= len;
+		if (!size)
+			break;
+	}
+	return true;
+}
+
+static bool iov_iter_aligned_bvec(const struct iov_iter *i, unsigned addr_mask,
+				  unsigned len_mask)
+{
+	size_t size = i->count;
+	unsigned skip = i->iov_offset;
+	unsigned k;
+
+	for (k = 0; k < i->nr_segs; k++, skip = 0) {
+		size_t len = i->bvec[k].bv_len - skip;
+
+		if (len > size)
+			len = size;
+		if (len & len_mask)
+			return false;
+		if ((unsigned long)(i->bvec[k].bv_offset + skip) & addr_mask)
+			return false;
+
+		size -= len;
+		if (!size)
+			break;
+	}
+	return true;
+}
+
+bool iov_iter_aligned(const struct iov_iter *i, unsigned addr_mask,
+		      unsigned len_mask)
+{
+	if (likely(iter_is_iovec(i) || iov_iter_is_kvec(i)))
+		return iov_iter_aligned_iovec(i, addr_mask, len_mask);
+
+	if (iov_iter_is_bvec(i))
+		return iov_iter_aligned_bvec(i, addr_mask, len_mask);
+
+	if (iov_iter_is_pipe(i)) {
+		unsigned int p_mask = i->pipe->ring_size - 1;
+		size_t size = i->count;
+
+		if (size & len_mask)
+			return false;
+		if (size && allocated(&i->pipe->bufs[i->head & p_mask])) {
+			if (i->iov_offset & addr_mask)
+				return false;
+		}
+
+		return true;
+	}
+
+	if (iov_iter_is_xarray(i)) {
+		if (i->count & len_mask)
+			return false;
+		if ((i->xarray_start + i->iov_offset) & addr_mask)
+			return false;
+	}
+
+	return true;
+}
+EXPORT_SYMBOL_GPL(iov_iter_aligned);
+
 static unsigned long iov_iter_alignment_iovec(const struct iov_iter *i)
 {
 	unsigned long res = 0;
--
Christoph Hellwig May 31, 2022, 6:30 a.m. UTC | #7
On Thu, May 26, 2022 at 02:32:12PM -0600, Keith Busch wrote:
> On Thu, May 26, 2022 at 12:55:50PM -0600, Keith Busch wrote:
> > 
> > Let me see how difficult it would be catch it early in blkdev_dio_aligned().
> 
> Something like this appears to work:

This looks reasonable to me.  Nits below:

> +	if (!iov_iter_aligned(iter, bdev_dma_alignment(bdev),
> +			      bdev_logical_block_size(bdev) - 1))

This probably wants a helper so that it can also be reused by e.g.
iomap.
diff mbox series

Patch

diff --git a/block/bio.c b/block/bio.c
index 55d2a9c4e312..c492881959d1 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1219,7 +1219,19 @@  static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
 	pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
 
+	/*
+	 * Each segment in the iov is required to be a block size multiple.
+	 * However, we may not be able to get the entire segment if it spans
+	 * more pages than bi_max_vecs allows, so we have to ALIGN_DOWN the
+	 * result to ensure the bio's total size is correct. The remainder of
+	 * the iov data will be picked up in the next bio iteration.
+	 *
+	 * If the result is ever 0, that indicates the iov fails the segment
+	 * size requirement and is an error.
+	 */
 	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
+	if (size > 0)
+		size = ALIGN_DOWN(size, bdev_logical_block_size(bio->bi_bdev));
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
 
diff --git a/block/fops.c b/block/fops.c
index bd6c2e13a4e3..6ecbccc552b9 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -45,10 +45,10 @@  static unsigned int dio_bio_write_op(struct kiocb *iocb)
 static int blkdev_dio_aligned(struct block_device *bdev, loff_t pos,
 			      struct iov_iter *iter)
 {
-	if ((pos | iov_iter_alignment(iter)) &
-	    (bdev_logical_block_size(bdev) - 1))
+	if ((pos | iov_iter_count(iter)) & (bdev_logical_block_size(bdev) - 1))
+		return -EINVAL;
+	if (iov_iter_alignment(iter) & bdev_dma_alignment(bdev))
 		return -EINVAL;
-
 	return 0;
 }
 
@@ -88,6 +88,10 @@  static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 	bio.bi_ioprio = iocb->ki_ioprio;
 
 	ret = bio_iov_iter_get_pages(&bio, iter);
+
+	/* check if iov is not aligned */
+	if (unlikely(!ret && iov_iter_count(iter)))
+		ret = -EINVAL;
 	if (unlikely(ret))
 		goto out;
 	ret = bio.bi_iter.bi_size;
@@ -333,6 +337,10 @@  static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
 		bio_iov_bvec_set(bio, iter);
 	} else {
 		ret = bio_iov_iter_get_pages(bio, iter);
+
+		/* check if iov is not aligned */
+		if (unlikely(!ret && iov_iter_count(iter)))
+			ret = -EINVAL;
 		if (unlikely(ret)) {
 			bio_put(bio);
 			return ret;