diff mbox

4.5-rc iser issues

Message ID CACVXFVNAkf5+XFyf2JkcfeEPO2sv9bqHMR+N+=kKFzrtXpFEnA@mail.gmail.com (mailing list archive)
State Not Applicable
Headers show

Commit Message

Ming Lei Feb. 14, 2016, 4:20 p.m. UTC
Hi Sagi,

On Sun, Feb 14, 2016 at 11:22 PM, Christoph Hellwig <hch@infradead.org> wrote:
> Adding Ming to Cc.
>
> But I don't think simply not cloning the biovecs is the right thing
> to do in the end.  This must be something with the bvec iterators.

I agree with Christoph, and there might be issues somewhere.

> From the log:
> iser: sg[0] dma_addr:0x85FC06000 off:0x0 sz:0x200 dma_len:0x200
> iser: sg[1] dma_addr:0x860334000 off:0x0 sz:0x200 dma_len:0x200 <-- gap
> iser: sg[2] dma_addr:0x860335000 off:0x0 sz:0x200 dma_len:0x200 <-- gap

The above gap shouldn't have come since blk_bio_segment_split() splits
out one new bio if gap is detected.

Sort of the following code can be added in driver or prep_fn to check if
bvec of  the rq is correct:

rq_for_each_segment(bvec, sc->request, iter) {
     //check if there is gap between bvec
}

I don't know how to use iser, and looks everything works fine after
I setup virt boundary as 4095 for null_blk by the attachment
patch.

> Full quote for Ming:
>
> On Sun, Feb 14, 2016 at 04:02:18PM +0200, Sagi Grimberg wrote:
>>
>> >>I'm bisecting now, there are a couple of patches from Ming in
>> >>the area of the bio splitting code...
>> >>
>> >>CC'ing Ming, Linux-block and Linux-nvme as iser is identical to nvme
>> >>wrt the virtual boundary so I think nvme will break as well.

The bisected commit is merged to v4.3, and looks no such kind of
report from nvme.

>> >
>> >Bisection reveals that this one is the culprit:
>> >
>> >commit 52cc6eead9095e2faf2ec7afc013aa3af1f01ac5
>> >Author: Ming Lei <ming.lei@canonical.com>
>> >Date:   Thu Sep 17 09:58:38 2015 -0600
>> >
>> >     block: blk-merge: fast-clone bio when splitting rw bios
>> >
>> >     biovecs has become immutable since v3.13, so it isn't necessary
>> >     to allocate biovecs for the new cloned bios, then we can save
>> >     one extra biovecs allocation/copy, and the allocation is often
>> >     not fixed-length and a bit more expensive.
>> >
>> >     For example, if the 'max_sectors_kb' of null blk's queue is set
>> >     as 16(32 sectors) via sysfs just for making more splits, this patch
>> >     can increase throught about ~70% in the sequential read test over
>> >     null_blk(direct io, bs: 1M).
>> >
>> >     Cc: Christoph Hellwig <hch@infradead.org>
>> >     Cc: Kent Overstreet <kent.overstreet@gmail.com>
>> >     Cc: Ming Lin <ming.l@ssi.samsung.com>
>> >     Cc: Dongsu Park <dpark@posteo.net>
>> >     Signed-off-by: Ming Lei <ming.lei@canonical.com>
>> >
>> >     This fixes a performance regression introduced by commit 54efd50bfd,
>> >     and allows us to take full advantage of the fact that we have
>> >immutable
>> >     bio_vecs. Hand applied, as it rejected violently with commit
>> >     5014c311baa2.
>> >
>> >     Signed-off-by: Jens Axboe <axboe@fb.com>
>> >--
>>
>> Looks like there is a problem with bio_clone_fast()
>>
>> This change makes the problem go away:
>> --
>> diff --git a/block/bio.c b/block/bio.c
>> index dbabd48..5e93733 100644
>> --- a/block/bio.c
>> +++ b/block/bio.c
>> @@ -1791,7 +1791,7 @@ struct bio *bio_split(struct bio *bio, int sectors,
>>          * Discards need a mutable bio_vec to accommodate the payload
>>          * required by the DSM TRIM and UNMAP commands.
>>          */
>> -       if (bio->bi_rw & REQ_DISCARD)
>> +       if (1 || bio->bi_rw & REQ_DISCARD)
>>                 split = bio_clone_bioset(bio, gfp, bs);
>>         else
>>                 split = bio_clone_fast(bio, gfp, bs);
>> --
>>
>> Any thoughts?

I don't think bio_clone_fast() is wrong, which use bvec table from
the original bio, and drivers are not allowed to change the table, and
should just call the standard iterator helpers to access bvec.

Also bio_for_each_segment_all() can't be used to iterate over one
cloned bio.

Thanks,

Comments

Sagi Grimberg Feb. 14, 2016, 4:29 p.m. UTC | #1
> Hi Sagi,

Hey,

>> But I don't think simply not cloning the biovecs is the right thing
>> to do in the end.  This must be something with the bvec iterators.
>
> I agree with Christoph, and there might be issues somewhere.
>

Me too, it was just an isolation step...

>>  From the log:
>> iser: sg[0] dma_addr:0x85FC06000 off:0x0 sz:0x200 dma_len:0x200
>> iser: sg[1] dma_addr:0x860334000 off:0x0 sz:0x200 dma_len:0x200 <-- gap
>> iser: sg[2] dma_addr:0x860335000 off:0x0 sz:0x200 dma_len:0x200 <-- gap
>
> The above gap shouldn't have come since blk_bio_segment_split() splits
> out one new bio if gap is detected.
>
> Sort of the following code can be added in driver or prep_fn to check if
> bvec of  the rq is correct:
>
> rq_for_each_segment(bvec, sc->request, iter) {
>       //check if there is gap between bvec
> }

I added this indication and the gap detection does trigger a bio
split.

>
> I don't know how to use iser, and looks everything works fine after
> I setup virt boundary as 4095 for null_blk by the attachment
> patch.

That's probably because it's artificial and there is no HW with a real
limitation...

>
>> Full quote for Ming:
>>
>> On Sun, Feb 14, 2016 at 04:02:18PM +0200, Sagi Grimberg wrote:
>>>
>>>>> I'm bisecting now, there are a couple of patches from Ming in
>>>>> the area of the bio splitting code...
>>>>>
>>>>> CC'ing Ming, Linux-block and Linux-nvme as iser is identical to nvme
>>>>> wrt the virtual boundary so I think nvme will break as well.
>
> The bisected commit is merged to v4.3, and looks no such kind of
> report from nvme.

I'm wandering how can that be... because clearly iser is seeing gaps
which like nvme, it can't handle those. Maybe this is scsi specific?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ming Lei Feb. 14, 2016, 4:39 p.m. UTC | #2
On Mon, Feb 15, 2016 at 12:29 AM, Sagi Grimberg
<sagig@dev.mellanox.co.il> wrote:
>> Hi Sagi,
>
>
> Hey,
>
>>> But I don't think simply not cloning the biovecs is the right thing
>>> to do in the end.  This must be something with the bvec iterators.
>>
>>
>> I agree with Christoph, and there might be issues somewhere.
>>
>
> Me too, it was just an isolation step...
>
>>>  From the log:
>>> iser: sg[0] dma_addr:0x85FC06000 off:0x0 sz:0x200 dma_len:0x200
>>> iser: sg[1] dma_addr:0x860334000 off:0x0 sz:0x200 dma_len:0x200 <-- gap
>>> iser: sg[2] dma_addr:0x860335000 off:0x0 sz:0x200 dma_len:0x200 <-- gap
>>
>>
>> The above gap shouldn't have come since blk_bio_segment_split() splits
>> out one new bio if gap is detected.
>>
>> Sort of the following code can be added in driver or prep_fn to check if
>> bvec of  the rq is correct:
>>
>> rq_for_each_segment(bvec, sc->request, iter) {
>>       //check if there is gap between bvec
>> }
>
>
> I added this indication and the gap detection does trigger a bio
> split.

Then I suggest to add the similar check inside blk_queue_split()
further to see if the result bio('res' bio in the function) has gap.

>
>>
>> I don't know how to use iser, and looks everything works fine after
>> I setup virt boundary as 4095 for null_blk by the attachment
>> patch.
>
>
> That's probably because it's artificial and there is no HW with a real
> limitation...

I do check the trace_printk log, and basically one rq only includes one
segment if I set bs as 512 in fio test.

Thanks,

>
>>
>>> Full quote for Ming:
>>>
>>> On Sun, Feb 14, 2016 at 04:02:18PM +0200, Sagi Grimberg wrote:
>>>>
>>>>
>>>>>> I'm bisecting now, there are a couple of patches from Ming in
>>>>>> the area of the bio splitting code...
>>>>>>
>>>>>> CC'ing Ming, Linux-block and Linux-nvme as iser is identical to nvme
>>>>>> wrt the virtual boundary so I think nvme will break as well.
>>
>>
>> The bisected commit is merged to v4.3, and looks no such kind of
>> report from nvme.
>
>
> I'm wandering how can that be... because clearly iser is seeing gaps
> which like nvme, it can't handle those. Maybe this is scsi specific?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ming Lei Feb. 15, 2016, 7:11 a.m. UTC | #3
On Mon, Feb 15, 2016 at 12:29 AM, Sagi Grimberg
<sagig@dev.mellanox.co.il> wrote:
>> Hi Sagi,
>
>
> Hey,
>
>>> But I don't think simply not cloning the biovecs is the right thing
>>> to do in the end.  This must be something with the bvec iterators.
>>
>>
>> I agree with Christoph, and there might be issues somewhere.
>>
>
> Me too, it was just an isolation step...
>
>>>  From the log:
>>> iser: sg[0] dma_addr:0x85FC06000 off:0x0 sz:0x200 dma_len:0x200
>>> iser: sg[1] dma_addr:0x860334000 off:0x0 sz:0x200 dma_len:0x200 <-- gap
>>> iser: sg[2] dma_addr:0x860335000 off:0x0 sz:0x200 dma_len:0x200 <-- gap
>>
>>
>> The above gap shouldn't have come since blk_bio_segment_split() splits
>> out one new bio if gap is detected.
>>
>> Sort of the following code can be added in driver or prep_fn to check if
>> bvec of  the rq is correct:
>>
>> rq_for_each_segment(bvec, sc->request, iter) {
>>       //check if there is gap between bvec
>> }
>
>
> I added this indication and the gap detection does trigger a bio
> split.
>
>>
>> I don't know how to use iser, and looks everything works fine after
>> I setup virt boundary as 4095 for null_blk by the attachment
>> patch.
>
>
> That's probably because it's artificial and there is no HW with a real
> limitation...
>
>>
>>> Full quote for Ming:
>>>
>>> On Sun, Feb 14, 2016 at 04:02:18PM +0200, Sagi Grimberg wrote:
>>>>
>>>>
>>>>>> I'm bisecting now, there are a couple of patches from Ming in
>>>>>> the area of the bio splitting code...
>>>>>>
>>>>>> CC'ing Ming, Linux-block and Linux-nvme as iser is identical to nvme
>>>>>> wrt the virtual boundary so I think nvme will break as well.
>>
>>
>> The bisected commit is merged to v4.3, and looks no such kind of
>> report from nvme.
>
>
> I'm wandering how can that be... because clearly iser is seeing gaps
> which like nvme, it can't handle those. Maybe this is scsi specific?

I can reproduce the issue now, and it is easy to trigger it via your test
code on scsi device, but a bit difficult to get it on null_blk.

Turns out it is a block core issue, and it is in bio_will_gap() which gets
the last bvec via 'bi_io_vec[prev->bi_vcnt - 1]' directly.  I have posted
out one patchset for fixing the issue:

http://marc.info/?l=linux-kernel&m=145551975429092&w=2

Thanks,
Ming
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Feb. 15, 2016, 7:45 a.m. UTC | #4
> I can reproduce the issue now, and it is easy to trigger it via your test
> code on scsi device, but a bit difficult to get it on null_blk.
>
> Turns out it is a block core issue, and it is in bio_will_gap() which gets
> the last bvec via 'bi_io_vec[prev->bi_vcnt - 1]' directly.  I have posted
> out one patchset for fixing the issue:
>
> http://marc.info/?l=linux-kernel&m=145551975429092&w=2

Thanks Ming!

I'll give it a shot...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 8ba1e97..5328f3c 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -10,6 +10,8 @@ 
 #include <linux/hrtimer.h>
 #include <linux/lightnvm.h>
 
+#define MAX_SG	128
+
 struct nullb_cmd {
 	struct list_head list;
 	struct llist_node ll_list;
@@ -19,6 +21,7 @@  struct nullb_cmd {
 	unsigned int tag;
 	struct nullb_queue *nq;
 	struct hrtimer timer;
+	struct scatterlist sg[MAX_SG];
 };
 
 struct nullb_queue {
@@ -351,6 +354,23 @@  static void null_request_fn(struct request_queue *q)
 	}
 }
 
+static int handle_sg(struct nullb_cmd *cmd)
+{
+	int num, i;
+	struct request *rq = cmd->rq;
+
+	num = blk_rq_map_sg(rq->q, rq, cmd->sg);
+	trace_printk("%s dump sg for %p(%d)\n", __func__, rq, rq->tag);
+	for (i = 0; i < num; i++) {
+		struct scatterlist *sg = &cmd->sg[i];
+		trace_printk("\t %4d: %u %u %llu\n",
+			     i, sg->offset, sg->length,
+			     (unsigned long long)sg->dma_address);
+	}
+
+	return 0;
+}
+
 static int null_queue_rq(struct blk_mq_hw_ctx *hctx,
 			 const struct blk_mq_queue_data *bd)
 {
@@ -365,6 +385,8 @@  static int null_queue_rq(struct blk_mq_hw_ctx *hctx,
 
 	blk_mq_start_request(bd->rq);
 
+	handle_sg(cmd);
+
 	null_handle_cmd(cmd);
 	return BLK_MQ_RQ_QUEUE_OK;
 }
@@ -707,6 +729,8 @@  static int null_add_dev(void)
 	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, nullb->q);
 	queue_flag_clear_unlocked(QUEUE_FLAG_ADD_RANDOM, nullb->q);
 
+	blk_queue_virt_boundary(nullb->q, (1UL << 12) - 1);
+	blk_queue_max_segments(nullb->q, MAX_SG);
 
 	mutex_lock(&lock);
 	list_add_tail(&nullb->list, &nullb_list);