Message ID | CACVXFVNAkf5+XFyf2JkcfeEPO2sv9bqHMR+N+=kKFzrtXpFEnA@mail.gmail.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
> Hi Sagi, Hey, >> But I don't think simply not cloning the biovecs is the right thing >> to do in the end. This must be something with the bvec iterators. > > I agree with Christoph, and there might be issues somewhere. > Me too, it was just an isolation step... >> From the log: >> iser: sg[0] dma_addr:0x85FC06000 off:0x0 sz:0x200 dma_len:0x200 >> iser: sg[1] dma_addr:0x860334000 off:0x0 sz:0x200 dma_len:0x200 <-- gap >> iser: sg[2] dma_addr:0x860335000 off:0x0 sz:0x200 dma_len:0x200 <-- gap > > The above gap shouldn't have come since blk_bio_segment_split() splits > out one new bio if gap is detected. > > Sort of the following code can be added in driver or prep_fn to check if > bvec of the rq is correct: > > rq_for_each_segment(bvec, sc->request, iter) { > //check if there is gap between bvec > } I added this indication and the gap detection does trigger a bio split. > > I don't know how to use iser, and looks everything works fine after > I setup virt boundary as 4095 for null_blk by the attachment > patch. That's probably because it's artificial and there is no HW with a real limitation... > >> Full quote for Ming: >> >> On Sun, Feb 14, 2016 at 04:02:18PM +0200, Sagi Grimberg wrote: >>> >>>>> I'm bisecting now, there are a couple of patches from Ming in >>>>> the area of the bio splitting code... >>>>> >>>>> CC'ing Ming, Linux-block and Linux-nvme as iser is identical to nvme >>>>> wrt the virtual boundary so I think nvme will break as well. > > The bisected commit is merged to v4.3, and looks no such kind of > report from nvme. I'm wandering how can that be... because clearly iser is seeing gaps which like nvme, it can't handle those. Maybe this is scsi specific? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Feb 15, 2016 at 12:29 AM, Sagi Grimberg <sagig@dev.mellanox.co.il> wrote: >> Hi Sagi, > > > Hey, > >>> But I don't think simply not cloning the biovecs is the right thing >>> to do in the end. This must be something with the bvec iterators. >> >> >> I agree with Christoph, and there might be issues somewhere. >> > > Me too, it was just an isolation step... > >>> From the log: >>> iser: sg[0] dma_addr:0x85FC06000 off:0x0 sz:0x200 dma_len:0x200 >>> iser: sg[1] dma_addr:0x860334000 off:0x0 sz:0x200 dma_len:0x200 <-- gap >>> iser: sg[2] dma_addr:0x860335000 off:0x0 sz:0x200 dma_len:0x200 <-- gap >> >> >> The above gap shouldn't have come since blk_bio_segment_split() splits >> out one new bio if gap is detected. >> >> Sort of the following code can be added in driver or prep_fn to check if >> bvec of the rq is correct: >> >> rq_for_each_segment(bvec, sc->request, iter) { >> //check if there is gap between bvec >> } > > > I added this indication and the gap detection does trigger a bio > split. Then I suggest to add the similar check inside blk_queue_split() further to see if the result bio('res' bio in the function) has gap. > >> >> I don't know how to use iser, and looks everything works fine after >> I setup virt boundary as 4095 for null_blk by the attachment >> patch. > > > That's probably because it's artificial and there is no HW with a real > limitation... I do check the trace_printk log, and basically one rq only includes one segment if I set bs as 512 in fio test. Thanks, > >> >>> Full quote for Ming: >>> >>> On Sun, Feb 14, 2016 at 04:02:18PM +0200, Sagi Grimberg wrote: >>>> >>>> >>>>>> I'm bisecting now, there are a couple of patches from Ming in >>>>>> the area of the bio splitting code... >>>>>> >>>>>> CC'ing Ming, Linux-block and Linux-nvme as iser is identical to nvme >>>>>> wrt the virtual boundary so I think nvme will break as well. >> >> >> The bisected commit is merged to v4.3, and looks no such kind of >> report from nvme. > > > I'm wandering how can that be... because clearly iser is seeing gaps > which like nvme, it can't handle those. Maybe this is scsi specific? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Feb 15, 2016 at 12:29 AM, Sagi Grimberg <sagig@dev.mellanox.co.il> wrote: >> Hi Sagi, > > > Hey, > >>> But I don't think simply not cloning the biovecs is the right thing >>> to do in the end. This must be something with the bvec iterators. >> >> >> I agree with Christoph, and there might be issues somewhere. >> > > Me too, it was just an isolation step... > >>> From the log: >>> iser: sg[0] dma_addr:0x85FC06000 off:0x0 sz:0x200 dma_len:0x200 >>> iser: sg[1] dma_addr:0x860334000 off:0x0 sz:0x200 dma_len:0x200 <-- gap >>> iser: sg[2] dma_addr:0x860335000 off:0x0 sz:0x200 dma_len:0x200 <-- gap >> >> >> The above gap shouldn't have come since blk_bio_segment_split() splits >> out one new bio if gap is detected. >> >> Sort of the following code can be added in driver or prep_fn to check if >> bvec of the rq is correct: >> >> rq_for_each_segment(bvec, sc->request, iter) { >> //check if there is gap between bvec >> } > > > I added this indication and the gap detection does trigger a bio > split. > >> >> I don't know how to use iser, and looks everything works fine after >> I setup virt boundary as 4095 for null_blk by the attachment >> patch. > > > That's probably because it's artificial and there is no HW with a real > limitation... > >> >>> Full quote for Ming: >>> >>> On Sun, Feb 14, 2016 at 04:02:18PM +0200, Sagi Grimberg wrote: >>>> >>>> >>>>>> I'm bisecting now, there are a couple of patches from Ming in >>>>>> the area of the bio splitting code... >>>>>> >>>>>> CC'ing Ming, Linux-block and Linux-nvme as iser is identical to nvme >>>>>> wrt the virtual boundary so I think nvme will break as well. >> >> >> The bisected commit is merged to v4.3, and looks no such kind of >> report from nvme. > > > I'm wandering how can that be... because clearly iser is seeing gaps > which like nvme, it can't handle those. Maybe this is scsi specific? I can reproduce the issue now, and it is easy to trigger it via your test code on scsi device, but a bit difficult to get it on null_blk. Turns out it is a block core issue, and it is in bio_will_gap() which gets the last bvec via 'bi_io_vec[prev->bi_vcnt - 1]' directly. I have posted out one patchset for fixing the issue: http://marc.info/?l=linux-kernel&m=145551975429092&w=2 Thanks, Ming -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> I can reproduce the issue now, and it is easy to trigger it via your test > code on scsi device, but a bit difficult to get it on null_blk. > > Turns out it is a block core issue, and it is in bio_will_gap() which gets > the last bvec via 'bi_io_vec[prev->bi_vcnt - 1]' directly. I have posted > out one patchset for fixing the issue: > > http://marc.info/?l=linux-kernel&m=145551975429092&w=2 Thanks Ming! I'll give it a shot... -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c index 8ba1e97..5328f3c 100644 --- a/drivers/block/null_blk.c +++ b/drivers/block/null_blk.c @@ -10,6 +10,8 @@ #include <linux/hrtimer.h> #include <linux/lightnvm.h> +#define MAX_SG 128 + struct nullb_cmd { struct list_head list; struct llist_node ll_list; @@ -19,6 +21,7 @@ struct nullb_cmd { unsigned int tag; struct nullb_queue *nq; struct hrtimer timer; + struct scatterlist sg[MAX_SG]; }; struct nullb_queue { @@ -351,6 +354,23 @@ static void null_request_fn(struct request_queue *q) } } +static int handle_sg(struct nullb_cmd *cmd) +{ + int num, i; + struct request *rq = cmd->rq; + + num = blk_rq_map_sg(rq->q, rq, cmd->sg); + trace_printk("%s dump sg for %p(%d)\n", __func__, rq, rq->tag); + for (i = 0; i < num; i++) { + struct scatterlist *sg = &cmd->sg[i]; + trace_printk("\t %4d: %u %u %llu\n", + i, sg->offset, sg->length, + (unsigned long long)sg->dma_address); + } + + return 0; +} + static int null_queue_rq(struct blk_mq_hw_ctx *hctx, const struct blk_mq_queue_data *bd) { @@ -365,6 +385,8 @@ static int null_queue_rq(struct blk_mq_hw_ctx *hctx, blk_mq_start_request(bd->rq); + handle_sg(cmd); + null_handle_cmd(cmd); return BLK_MQ_RQ_QUEUE_OK; } @@ -707,6 +729,8 @@ static int null_add_dev(void) queue_flag_set_unlocked(QUEUE_FLAG_NONROT, nullb->q); queue_flag_clear_unlocked(QUEUE_FLAG_ADD_RANDOM, nullb->q); + blk_queue_virt_boundary(nullb->q, (1UL << 12) - 1); + blk_queue_max_segments(nullb->q, MAX_SG); mutex_lock(&lock); list_add_tail(&nullb->list, &nullb_list);