Message ID | 20181105102301.9752-1-jthumshirn@suse.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | block: respect virtual boundary mask in bvecs | expand |
On Mon, Nov 05, 2018 at 11:23:01AM +0100, Johannes Thumshirn wrote: > With drivers like iSer we are seeing a lot of bio splitting and smaller I/Os > being submitted to the driver. > > The root cause of this issue that the virtual boundary mask code does not take > into consideration that some of the memory segments in the SG list may have > come from a huge memory page that is being managed in the SG list as 4K I guess you mean something like 64K PAGE_SIZE, instead of huge page. > blocks. This means that many of the segments in the SG list will have an > offset into the page that is not 0 but will be a multiple of 4K. > > Cc: Jan Kara <jack@suse.cz> > Cc: Sagi Grimberg <sagi@grimberg.me> > Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> > --- > block/blk-merge.c | 2 +- > block/blk.h | 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/block/blk-merge.c b/block/blk-merge.c > index 6b5ad275ed56..208658a901c6 100644 > --- a/block/blk-merge.c > +++ b/block/blk-merge.c > @@ -46,7 +46,7 @@ static inline bool bio_will_gap(struct request_queue *q, > bio_get_first_bvec(prev_rq->bio, &pb); > else > bio_get_first_bvec(prev, &pb); > - if (pb.bv_offset) > + if (pb.bv_offset & queue_virt_boundary(q)) > return true; The change should only make difference in case that PAGE_SIZE is bigger than 4K. If yes, please add the description in commit log. Thanks, Ming
On 05/11/2018 11:55, Ming Lei wrote: > On Mon, Nov 05, 2018 at 11:23:01AM +0100, Johannes Thumshirn wrote: >> With drivers like iSer we are seeing a lot of bio splitting and smaller I/Os >> being submitted to the driver. >> >> The root cause of this issue that the virtual boundary mask code does not take >> into consideration that some of the memory segments in the SG list may have >> come from a huge memory page that is being managed in the SG list as 4K > > I guess you mean something like 64K PAGE_SIZE, instead of huge page. No I mean like a 2M page from an upper layer.
On Mon, Nov 05, 2018 at 12:50:50PM +0100, Johannes Thumshirn wrote: > On 05/11/2018 11:55, Ming Lei wrote: > > On Mon, Nov 05, 2018 at 11:23:01AM +0100, Johannes Thumshirn wrote: > >> With drivers like iSer we are seeing a lot of bio splitting and smaller I/Os > >> being submitted to the driver. > >> > >> The root cause of this issue that the virtual boundary mask code does not take > >> into consideration that some of the memory segments in the SG list may have > >> come from a huge memory page that is being managed in the SG list as 4K > > > > I guess you mean something like 64K PAGE_SIZE, instead of huge page. > > No I mean like a 2M page from an upper layer. If you mean the real huge page, this patch shouldn't have made a difference because bio_vec->bv_offset is in [0, PAGE_SIZE), and iSer sets virt boundary as 4K - 1. However, things will change after multipage bvec is introduced. Thanks, Ming
On 05/11/2018 13:01, Ming Lei wrote: > If you mean the real huge page, this patch shouldn't have made a difference > because bio_vec->bv_offset is in [0, PAGE_SIZE), and iSer sets virt > boundary as 4K - 1. > > However, things will change after multipage bvec is introduced. Hi Ming, I've received a blktrace from our customer showing the issue [1]. In this example trace they've submitted (contiguous) 64K I/Os and without this patch, they're seeing a lot of splits as indicated by the trace. With the patch applied the I/O is directly issued to the LLDD without the splits. [1] http://beta.suse.com/private/jthumshirn/blktrace.txt Byte, Johannes
On Mon, Nov 05, 2018 at 11:23:01AM +0100, Johannes Thumshirn wrote: > With drivers like iSer we are seeing a lot of bio splitting and smaller I/Os > being submitted to the driver. > > The root cause of this issue that the virtual boundary mask code does not take > into consideration that some of the memory segments in the SG list may have > come from a huge memory page that is being managed in the SG list as 4K > blocks. This means that many of the segments in the SG list will have an > offset into the page that is not 0 but will be a multiple of 4K. I probably got this wrong, but I thought a 2M huge page was 512 regular pages with a compound head, and offsets were from those regular pages rather than from the head. Overall though, the patch makes sense to me for this and other reasons. Acked-by: Keith Busch <keith.busch@intel.com> > Cc: Jan Kara <jack@suse.cz> > Cc: Sagi Grimberg <sagi@grimberg.me> > Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> > --- > block/blk-merge.c | 2 +- > block/blk.h | 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/block/blk-merge.c b/block/blk-merge.c > index 6b5ad275ed56..208658a901c6 100644 > --- a/block/blk-merge.c > +++ b/block/blk-merge.c > @@ -46,7 +46,7 @@ static inline bool bio_will_gap(struct request_queue *q, > bio_get_first_bvec(prev_rq->bio, &pb); > else > bio_get_first_bvec(prev, &pb); > - if (pb.bv_offset) > + if (pb.bv_offset & queue_virt_boundary(q)) > return true; > > /* > diff --git a/block/blk.h b/block/blk.h > index a1841b8ff129..c85e53f21cdd 100644 > --- a/block/blk.h > +++ b/block/blk.h > @@ -169,7 +169,7 @@ static inline bool biovec_phys_mergeable(struct request_queue *q, > static inline bool __bvec_gap_to_prev(struct request_queue *q, > struct bio_vec *bprv, unsigned int offset) > { > - return offset || > + return (offset & queue_virt_boundary(q)) || > ((bprv->bv_offset + bprv->bv_len) & queue_virt_boundary(q)); > } > > --
On 11/5/18 2:23 AM, Johannes Thumshirn wrote: > @@ -169,7 +169,7 @@ static inline bool biovec_phys_mergeable(struct request_queue *q, > static inline bool __bvec_gap_to_prev(struct request_queue *q, > struct bio_vec *bprv, unsigned int offset) > { > - return offset || > + return (offset & queue_virt_boundary(q)) || > ((bprv->bv_offset + bprv->bv_len) & queue_virt_boundary(q)); > } How about changing that expression into the following to make it easier for the compiler to optimize this code? (offset | (bprv->bv_offset + bprv->bv_len)) & queue_virt_boundary(q) Thanks, Bart.
On 06/11/2018 15:53, Bart Van Assche wrote: > How about changing that expression into the following to make it easier > for the compiler to optimize this code? > > (offset | (bprv->bv_offset + bprv->bv_len)) & queue_virt_boundary(q) Uhm I have to admit I'm not really able to parse the above expression. Sure GCC will do it but I think it's less readable (at least for me). Let's see what other's think. Thanks, Johannes
On Tue, Nov 6, 2018 at 8:35 PM Johannes Thumshirn <jthumshirn@suse.de> wrote: > > On 05/11/2018 13:01, Ming Lei wrote: > > If you mean the real huge page, this patch shouldn't have made a difference > > because bio_vec->bv_offset is in [0, PAGE_SIZE), and iSer sets virt > > boundary as 4K - 1. > > > > However, things will change after multipage bvec is introduced. > > > Hi Ming, > > I've received a blktrace from our customer showing the issue [1]. > > In this example trace they've submitted (contiguous) 64K I/Os and > without this patch, they're seeing a lot of splits as indicated by the > trace. > > With the patch applied the I/O is directly issued to the LLDD without > the splits. > > [1] http://beta.suse.com/private/jthumshirn/blktrace.txt blktrace won't help on this issue because .bv_offset isn't recorded. This patch makes sense on >4KB PAGE_SIZE. If your issue happens on ARCH with 4K PAGE_SIZE, maybe you should root cause why it makes a difference on iSer. And it is highly possible there is bug somewhere. As I mentioned, the description of huge page part in the commit log is misleading, and it has to be fixed. Otherwise, the patch itself is fine: Reviewed-by: Ming Lei <ming.lei@redhat.com> Thanks Ming Lei
On Tue, Nov 06, 2018 at 10:56:52PM +0800, Ming Lei wrote: > > This patch makes sense on >4KB PAGE_SIZE. Granted existing blk_queue_virt_boundary() users all appear to be at least 4k, the block API doesn't enforce that. Some unlikely device may require an even smaller alignment in the future, which would benefit from this patch. But yes, the changelog description is a bit confusing to me if the sighting came from a 4K arch.
>> If you mean the real huge page, this patch shouldn't have made a difference >> because bio_vec->bv_offset is in [0, PAGE_SIZE), and iSer sets virt >> boundary as 4K - 1. >> >> However, things will change after multipage bvec is introduced. > > > Hi Ming, > > I've received a blktrace from our customer showing the issue [1]. > > In this example trace they've submitted (contiguous) 64K I/Os and > without this patch, they're seeing a lot of splits as indicated by the > trace. > > With the patch applied the I/O is directly issued to the LLDD without > the splits. > > [1] http://beta.suse.com/private/jthumshirn/blktrace.txt > This patch makes sense to me. However I agree that we need to clarify that this addresses cases where PAGE_SIZE-1 > queue_virt_boundary(q). And also, lets replace "like iser" with "drivers that sets a virtual boundary constraint" as nvme-rdma does the same thing.
> blktrace won't help on this issue because .bv_offset isn't recorded. > > This patch makes sense on >4KB PAGE_SIZE. If your issue happens on > ARCH with 4K PAGE_SIZE, maybe you should root cause why it makes a > difference on iSer. And it is highly possible there is bug somewhere. I don't suspect that this is the case. unless all the vec elements happen to have an offset but if that is the case I don't think that this patch would've solved this. It can't really be an iser issue because iser does not control bio splitting/merging... it can only ask the block layer to obay the virt_boundary.
>> How about changing that expression into the following to make it easier >> for the compiler to optimize this code? >> >> (offset | (bprv->bv_offset + bprv->bv_len)) & queue_virt_boundary(q) > > Uhm I have to admit I'm not really able to parse the above expression. > Sure GCC will do it but I think it's less readable (at least for me). > Let's see what other's think. I personally not a huge fan of decoding complicated expressions. But if others are fine with it then I am too...
On Tue, 2018-11-06 at 19:30 -0800, Sagi Grimberg wrote: > > > How about changing that expression into the following to make it easier > > > for the compiler to optimize this code? > > > > > > (offset | (bprv->bv_offset + bprv->bv_len)) & queue_virt_boundary(q) > > > > Uhm I have to admit I'm not really able to parse the above expression. > > Sure GCC will do it but I think it's less readable (at least for me). > > Let's see what other's think. > > I personally not a huge fan of decoding complicated expressions. But if > others are fine with it then I am too... What I proposed is not a new pattern. It is a pattern that is already used elsewhere in the Linux kernel. A few examples: From dmabounce.c: /* Figure out if we need to bounce from the DMA mask. */ if ((dma_addr | (dma_addr + size - 1)) & ~mask) return 1; From dma-direct.h: if ((addr | (addr + size - 1)) & ~mask) return 0; Bart.
On Wed, Nov 07, 2018 at 08:10:19AM -0800, Bart Van Assche wrote: > > I personally not a huge fan of decoding complicated expressions. But if > > others are fine with it then I am too... > > What I proposed is not a new pattern. It is a pattern that is already used > elsewhere in the Linux kernel. A few examples: > > >From dmabounce.c: > > /* Figure out if we need to bounce from the DMA mask. */ > if ((dma_addr | (dma_addr + size - 1)) & ~mask) > return 1; > > >From dma-direct.h: > > if ((addr | (addr + size - 1)) & ~mask) > return 0; Time for a well documented helper :)
diff --git a/block/blk-merge.c b/block/blk-merge.c index 6b5ad275ed56..208658a901c6 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -46,7 +46,7 @@ static inline bool bio_will_gap(struct request_queue *q, bio_get_first_bvec(prev_rq->bio, &pb); else bio_get_first_bvec(prev, &pb); - if (pb.bv_offset) + if (pb.bv_offset & queue_virt_boundary(q)) return true; /* diff --git a/block/blk.h b/block/blk.h index a1841b8ff129..c85e53f21cdd 100644 --- a/block/blk.h +++ b/block/blk.h @@ -169,7 +169,7 @@ static inline bool biovec_phys_mergeable(struct request_queue *q, static inline bool __bvec_gap_to_prev(struct request_queue *q, struct bio_vec *bprv, unsigned int offset) { - return offset || + return (offset & queue_virt_boundary(q)) || ((bprv->bv_offset + bprv->bv_len) & queue_virt_boundary(q)); }
With drivers like iSer we are seeing a lot of bio splitting and smaller I/Os being submitted to the driver. The root cause of this issue that the virtual boundary mask code does not take into consideration that some of the memory segments in the SG list may have come from a huge memory page that is being managed in the SG list as 4K blocks. This means that many of the segments in the SG list will have an offset into the page that is not 0 but will be a multiple of 4K. Cc: Jan Kara <jack@suse.cz> Cc: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> --- block/blk-merge.c | 2 +- block/blk.h | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)