diff mbox series

block: respect virtual boundary mask in bvecs

Message ID 20181105102301.9752-1-jthumshirn@suse.de (mailing list archive)
State New, archived
Headers show
Series block: respect virtual boundary mask in bvecs | expand

Commit Message

Johannes Thumshirn Nov. 5, 2018, 10:23 a.m. UTC
With drivers like iSer we are seeing a lot of bio splitting and smaller I/Os
being submitted to the driver.

The root cause of this issue that the virtual boundary mask code does not take
into consideration that some of the memory segments in the SG list may have
come from a huge memory page that is being managed in the SG list as 4K
blocks.  This means that many of the segments in the SG list will have an
offset into the page that is not 0 but will be a multiple of 4K.

Cc: Jan Kara <jack@suse.cz>
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
---
 block/blk-merge.c | 2 +-
 block/blk.h       | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

Comments

Ming Lei Nov. 5, 2018, 10:55 a.m. UTC | #1
On Mon, Nov 05, 2018 at 11:23:01AM +0100, Johannes Thumshirn wrote:
> With drivers like iSer we are seeing a lot of bio splitting and smaller I/Os
> being submitted to the driver.
> 
> The root cause of this issue that the virtual boundary mask code does not take
> into consideration that some of the memory segments in the SG list may have
> come from a huge memory page that is being managed in the SG list as 4K

I guess you mean something like 64K PAGE_SIZE, instead of huge page.

> blocks.  This means that many of the segments in the SG list will have an
> offset into the page that is not 0 but will be a multiple of 4K.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Sagi Grimberg <sagi@grimberg.me>
> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
> ---
>  block/blk-merge.c | 2 +-
>  block/blk.h       | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 6b5ad275ed56..208658a901c6 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -46,7 +46,7 @@ static inline bool bio_will_gap(struct request_queue *q,
>  		bio_get_first_bvec(prev_rq->bio, &pb);
>  	else
>  		bio_get_first_bvec(prev, &pb);
> -	if (pb.bv_offset)
> +	if (pb.bv_offset & queue_virt_boundary(q))
>  		return true;

The change should only make difference in case that PAGE_SIZE is bigger
than 4K. If yes, please add the description in commit log.


Thanks,
Ming
Johannes Thumshirn Nov. 5, 2018, 11:50 a.m. UTC | #2
On 05/11/2018 11:55, Ming Lei wrote:
> On Mon, Nov 05, 2018 at 11:23:01AM +0100, Johannes Thumshirn wrote:
>> With drivers like iSer we are seeing a lot of bio splitting and smaller I/Os
>> being submitted to the driver.
>>
>> The root cause of this issue that the virtual boundary mask code does not take
>> into consideration that some of the memory segments in the SG list may have
>> come from a huge memory page that is being managed in the SG list as 4K
> 
> I guess you mean something like 64K PAGE_SIZE, instead of huge page.

No I mean like a 2M page from an upper layer.
Ming Lei Nov. 5, 2018, 12:01 p.m. UTC | #3
On Mon, Nov 05, 2018 at 12:50:50PM +0100, Johannes Thumshirn wrote:
> On 05/11/2018 11:55, Ming Lei wrote:
> > On Mon, Nov 05, 2018 at 11:23:01AM +0100, Johannes Thumshirn wrote:
> >> With drivers like iSer we are seeing a lot of bio splitting and smaller I/Os
> >> being submitted to the driver.
> >>
> >> The root cause of this issue that the virtual boundary mask code does not take
> >> into consideration that some of the memory segments in the SG list may have
> >> come from a huge memory page that is being managed in the SG list as 4K
> > 
> > I guess you mean something like 64K PAGE_SIZE, instead of huge page.
> 
> No I mean like a 2M page from an upper layer.

If you mean the real huge page, this patch shouldn't have made a difference
because bio_vec->bv_offset is in [0, PAGE_SIZE), and iSer sets virt
boundary as 4K - 1.

However, things will change after multipage bvec is introduced.

Thanks,
Ming
Johannes Thumshirn Nov. 6, 2018, 12:34 p.m. UTC | #4
On 05/11/2018 13:01, Ming Lei wrote:
> If you mean the real huge page, this patch shouldn't have made a difference
> because bio_vec->bv_offset is in [0, PAGE_SIZE), and iSer sets virt
> boundary as 4K - 1.
> 
> However, things will change after multipage bvec is introduced.


Hi Ming,

I've received a blktrace from our customer showing the issue [1].

In this example trace they've submitted (contiguous) 64K I/Os and
without this patch, they're seeing a lot of splits as indicated by the
trace.

With the patch applied the I/O is directly issued to the LLDD without
the splits.

[1] http://beta.suse.com/private/jthumshirn/blktrace.txt

Byte,
	Johannes
Keith Busch Nov. 6, 2018, 2:31 p.m. UTC | #5
On Mon, Nov 05, 2018 at 11:23:01AM +0100, Johannes Thumshirn wrote:
> With drivers like iSer we are seeing a lot of bio splitting and smaller I/Os
> being submitted to the driver.
> 
> The root cause of this issue that the virtual boundary mask code does not take
> into consideration that some of the memory segments in the SG list may have
> come from a huge memory page that is being managed in the SG list as 4K
> blocks.  This means that many of the segments in the SG list will have an
> offset into the page that is not 0 but will be a multiple of 4K.

I probably got this wrong, but I thought a 2M huge page was 512 regular
pages with a compound head, and offsets were from those regular pages
rather than from the head.

Overall though, the patch makes sense to me for this and other reasons.

Acked-by: Keith Busch <keith.busch@intel.com>

 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Sagi Grimberg <sagi@grimberg.me>
> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
> ---
>  block/blk-merge.c | 2 +-
>  block/blk.h       | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 6b5ad275ed56..208658a901c6 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -46,7 +46,7 @@ static inline bool bio_will_gap(struct request_queue *q,
>  		bio_get_first_bvec(prev_rq->bio, &pb);
>  	else
>  		bio_get_first_bvec(prev, &pb);
> -	if (pb.bv_offset)
> +	if (pb.bv_offset & queue_virt_boundary(q))
>  		return true;
>  
>  	/*
> diff --git a/block/blk.h b/block/blk.h
> index a1841b8ff129..c85e53f21cdd 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -169,7 +169,7 @@ static inline bool biovec_phys_mergeable(struct request_queue *q,
>  static inline bool __bvec_gap_to_prev(struct request_queue *q,
>  		struct bio_vec *bprv, unsigned int offset)
>  {
> -	return offset ||
> +	return (offset & queue_virt_boundary(q)) ||
>  		((bprv->bv_offset + bprv->bv_len) & queue_virt_boundary(q));
>  }
>  
> --
Bart Van Assche Nov. 6, 2018, 2:53 p.m. UTC | #6
On 11/5/18 2:23 AM, Johannes Thumshirn wrote:
> @@ -169,7 +169,7 @@ static inline bool biovec_phys_mergeable(struct request_queue *q,
>   static inline bool __bvec_gap_to_prev(struct request_queue *q,
>   		struct bio_vec *bprv, unsigned int offset)
>   {
> -	return offset ||
> +	return (offset & queue_virt_boundary(q)) ||
>   		((bprv->bv_offset + bprv->bv_len) & queue_virt_boundary(q));
>   }

How about changing that expression into the following to make it easier 
for the compiler to optimize this code?

(offset | (bprv->bv_offset + bprv->bv_len)) & queue_virt_boundary(q)

Thanks,

Bart.
Johannes Thumshirn Nov. 6, 2018, 2:56 p.m. UTC | #7
On 06/11/2018 15:53, Bart Van Assche wrote:
> How about changing that expression into the following to make it easier
> for the compiler to optimize this code?
> 
> (offset | (bprv->bv_offset + bprv->bv_len)) & queue_virt_boundary(q)

Uhm I have to admit I'm not really able to parse the above expression.
Sure GCC will do it but I think it's less readable (at least for me).
Let's see what other's think.

Thanks,
	Johannes
Ming Lei Nov. 6, 2018, 2:56 p.m. UTC | #8
On Tue, Nov 6, 2018 at 8:35 PM Johannes Thumshirn <jthumshirn@suse.de> wrote:
>
> On 05/11/2018 13:01, Ming Lei wrote:
> > If you mean the real huge page, this patch shouldn't have made a difference
> > because bio_vec->bv_offset is in [0, PAGE_SIZE), and iSer sets virt
> > boundary as 4K - 1.
> >
> > However, things will change after multipage bvec is introduced.
>
>
> Hi Ming,
>
> I've received a blktrace from our customer showing the issue [1].
>
> In this example trace they've submitted (contiguous) 64K I/Os and
> without this patch, they're seeing a lot of splits as indicated by the
> trace.
>
> With the patch applied the I/O is directly issued to the LLDD without
> the splits.
>
> [1] http://beta.suse.com/private/jthumshirn/blktrace.txt

blktrace won't help on this issue because .bv_offset isn't recorded.

This patch makes sense on >4KB PAGE_SIZE. If your issue happens on
ARCH with 4K PAGE_SIZE, maybe you should root cause why it makes a
difference on iSer. And it is highly possible there is bug somewhere.

As I mentioned, the description of huge page part in the commit log is
misleading,
and it has to be fixed. Otherwise, the patch itself is fine:

Reviewed-by: Ming Lei <ming.lei@redhat.com>

Thanks
Ming Lei
Keith Busch Nov. 6, 2018, 3:14 p.m. UTC | #9
On Tue, Nov 06, 2018 at 10:56:52PM +0800, Ming Lei wrote:
> 
> This patch makes sense on >4KB PAGE_SIZE. 

Granted existing blk_queue_virt_boundary() users all appear to be at
least 4k, the block API doesn't enforce that. Some unlikely device may
require an even smaller alignment in the future, which would benefit
from this patch.

But yes, the changelog description is a bit confusing to me if the
sighting came from a 4K arch.
Sagi Grimberg Nov. 7, 2018, 3:11 a.m. UTC | #10
>> If you mean the real huge page, this patch shouldn't have made a difference
>> because bio_vec->bv_offset is in [0, PAGE_SIZE), and iSer sets virt
>> boundary as 4K - 1.
>>
>> However, things will change after multipage bvec is introduced.
> 
> 
> Hi Ming,
> 
> I've received a blktrace from our customer showing the issue [1].
> 
> In this example trace they've submitted (contiguous) 64K I/Os and
> without this patch, they're seeing a lot of splits as indicated by the
> trace.
> 
> With the patch applied the I/O is directly issued to the LLDD without
> the splits.
> 
> [1] http://beta.suse.com/private/jthumshirn/blktrace.txt
> 

This patch makes sense to me. However I agree that we need to clarify
that this addresses cases where PAGE_SIZE-1 > queue_virt_boundary(q).

And also, lets replace "like iser" with "drivers that sets a virtual
boundary constraint" as nvme-rdma does the same thing.
Sagi Grimberg Nov. 7, 2018, 3:14 a.m. UTC | #11
> blktrace won't help on this issue because .bv_offset isn't recorded.
> 
> This patch makes sense on >4KB PAGE_SIZE. If your issue happens on
> ARCH with 4K PAGE_SIZE, maybe you should root cause why it makes a
> difference on iSer. And it is highly possible there is bug somewhere.

I don't suspect that this is the case. unless all the vec elements
happen to have an offset but if that is the case I don't think that
this patch would've solved this.

It can't really be an iser issue because iser does not control bio
splitting/merging... it can only ask the block layer to obay the
virt_boundary.
Sagi Grimberg Nov. 7, 2018, 3:30 a.m. UTC | #12
>> How about changing that expression into the following to make it easier
>> for the compiler to optimize this code?
>>
>> (offset | (bprv->bv_offset + bprv->bv_len)) & queue_virt_boundary(q)
> 
> Uhm I have to admit I'm not really able to parse the above expression.
> Sure GCC will do it but I think it's less readable (at least for me).
> Let's see what other's think.

I personally not a huge fan of decoding complicated expressions. But if
others are fine with it then I am too...
Bart Van Assche Nov. 7, 2018, 4:10 p.m. UTC | #13
On Tue, 2018-11-06 at 19:30 -0800, Sagi Grimberg wrote:
> > > How about changing that expression into the following to make it easier
> > > for the compiler to optimize this code?
> > > 
> > > (offset | (bprv->bv_offset + bprv->bv_len)) & queue_virt_boundary(q)
> > 
> > Uhm I have to admit I'm not really able to parse the above expression.
> > Sure GCC will do it but I think it's less readable (at least for me).
> > Let's see what other's think.
> 
> I personally not a huge fan of decoding complicated expressions. But if
> others are fine with it then I am too...

What I proposed is not a new pattern. It is a pattern that is already used
elsewhere in the Linux kernel. A few examples:

From dmabounce.c:

		/* Figure out if we need to bounce from the DMA mask. */
		if ((dma_addr | (dma_addr + size - 1)) & ~mask)
			return 1;

From dma-direct.h:

	if ((addr | (addr + size - 1)) & ~mask)
		return 0;

Bart.
Christoph Hellwig Nov. 8, 2018, 3:04 p.m. UTC | #14
On Wed, Nov 07, 2018 at 08:10:19AM -0800, Bart Van Assche wrote:
> > I personally not a huge fan of decoding complicated expressions. But if
> > others are fine with it then I am too...
> 
> What I proposed is not a new pattern. It is a pattern that is already used
> elsewhere in the Linux kernel. A few examples:
> 
> >From dmabounce.c:
> 
> 		/* Figure out if we need to bounce from the DMA mask. */
> 		if ((dma_addr | (dma_addr + size - 1)) & ~mask)
> 			return 1;
> 
> >From dma-direct.h:
> 
> 	if ((addr | (addr + size - 1)) & ~mask)
> 		return 0;

Time for a well documented helper :)
diff mbox series

Patch

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 6b5ad275ed56..208658a901c6 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -46,7 +46,7 @@  static inline bool bio_will_gap(struct request_queue *q,
 		bio_get_first_bvec(prev_rq->bio, &pb);
 	else
 		bio_get_first_bvec(prev, &pb);
-	if (pb.bv_offset)
+	if (pb.bv_offset & queue_virt_boundary(q))
 		return true;
 
 	/*
diff --git a/block/blk.h b/block/blk.h
index a1841b8ff129..c85e53f21cdd 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -169,7 +169,7 @@  static inline bool biovec_phys_mergeable(struct request_queue *q,
 static inline bool __bvec_gap_to_prev(struct request_queue *q,
 		struct bio_vec *bprv, unsigned int offset)
 {
-	return offset ||
+	return (offset & queue_virt_boundary(q)) ||
 		((bprv->bv_offset + bprv->bv_len) & queue_virt_boundary(q));
 }