kernel BUG at drivers/scsi/scsi_lib.c:1096!

Message ID	565579A2.4000005@suse.de (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> Message-ID: <565579A2.4000005@suse.de> Date: Wed, 25 Nov 2015 10:04:34 +0100 From: Hannes Reinecke <hare@suse.de> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0 MIME-Version: 1.0 To: emilne@redhat.com CC: Christoph Hellwig <hch@infradead.org>, Michael Ellerman <mpe@ellerman.id.au>, Mark Salter <msalter@redhat.com>, "James E. J. Bottomley" <JBottomley@odin.com>, brking <brking@us.ibm.com>, linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-block@vger.kernel.org, Mike Snitzer <snitzer@redhat.com>, Jun'ichi Nomura <j-nomura@ce.jp.nec.com>, Jens Axboe <axboe@fb.com> Subject: Re: kernel BUG at drivers/scsi/scsi_lib.c:1096! References: <1447838334.1564.2.camel@ellerman.id.au> <1447855399.3974.24.camel@redhat.com> <1447894964.15206.0.camel@ellerman.id.au> <20151119082325.GA11419@infradead.org> <564DEC41.5010600@suse.de> <1448030316.4067.18.camel@localhost.localdomain> <564F3453.9040603@suse.de> <1448033323.4067.24.camel@localhost.localdomain> In-Reply-To: <1448033323.4067.24.camel@localhost.localdomain> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Sender: linux-block-owner@vger.kernel.org Precedence: bulk

Hannes Reinecke Nov. 25, 2015, 9:04 a.m. UTC

On 11/20/2015 04:28 PM, Ewan Milne wrote:
> On Fri, 2015-11-20 at 15:55 +0100, Hannes Reinecke wrote:
>> Can't we have a joint effort here?
>> I've been spending a _LOT_ of time trying to debug things here, but
>> none of the ideas I've come up with have been able to fix anything.
> 
> Yes.  I'm not the one primarily looking at it, and we don't have a
> reproducer in-house.  We just have the one dump right now.
> 
>>
>> I'm almost tempted to increase the count from scsi_alloc_sgtable()
>> by one and be done with ...
>>
> 
> That might not fix it if it is a problem with the merge code, though.
> 
And indeed, it doesn't.
Seems I finally found the culprit.

What happens is this:
We have two paths, with these seg_boundary_masks:

path-1:    seg_boundary_mask = 65535,
path-2:    seg_boundary_mask = 4294967295,

consequently the DM request queue has this:

md-1:    seg_boundary_mask = 65535,

What happens now is that a request is being formatted, and sent
to path 2. During submission req->nr_phys_segments is formatted
with the limits of path 2, arriving at a count of 3.
Now the request gets retried on path 1, but as the NOMERGE request
flag is set req->nr_phys_segments is never updated.
But blk_rq_map_sg() ignores all counters, and just uses the
bi_vec directly, resulting in a count of 4 -> boom.

So the culprit here is the NOMERGE flag, which is evaluated
via
->dm_dispatch_request()
  ->blk_insert_cloned_request()
    ->blk_rq_check_limits()

If the above assessment is correct, the following patch should
fix it:

rq->cmd_flags)) {

Mike? Jens?
Can you comment on it?

Cheers,

Hannes

Jens Axboe Nov. 25, 2015, 5:56 p.m. UTC | #1

On 11/25/2015 02:04 AM, Hannes Reinecke wrote:
> On 11/20/2015 04:28 PM, Ewan Milne wrote:
>> On Fri, 2015-11-20 at 15:55 +0100, Hannes Reinecke wrote:
>>> Can't we have a joint effort here?
>>> I've been spending a _LOT_ of time trying to debug things here, but
>>> none of the ideas I've come up with have been able to fix anything.
>>
>> Yes.  I'm not the one primarily looking at it, and we don't have a
>> reproducer in-house.  We just have the one dump right now.
>>
>>>
>>> I'm almost tempted to increase the count from scsi_alloc_sgtable()
>>> by one and be done with ...
>>>
>>
>> That might not fix it if it is a problem with the merge code, though.
>>
> And indeed, it doesn't.
> Seems I finally found the culprit.
>
> What happens is this:
> We have two paths, with these seg_boundary_masks:
>
> path-1:    seg_boundary_mask = 65535,
> path-2:    seg_boundary_mask = 4294967295,
>
> consequently the DM request queue has this:
>
> md-1:    seg_boundary_mask = 65535,
>
> What happens now is that a request is being formatted, and sent
> to path 2. During submission req->nr_phys_segments is formatted
> with the limits of path 2, arriving at a count of 3.
> Now the request gets retried on path 1, but as the NOMERGE request
> flag is set req->nr_phys_segments is never updated.
> But blk_rq_map_sg() ignores all counters, and just uses the
> bi_vec directly, resulting in a count of 4 -> boom.
>
> So the culprit here is the NOMERGE flag, which is evaluated
> via
> ->dm_dispatch_request()
>    ->blk_insert_cloned_request()
>      ->blk_rq_check_limits()
>
> If the above assessment is correct, the following patch should
> fix it:
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 801ced7..12cccd6 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1928,7 +1928,7 @@ EXPORT_SYMBOL(submit_bio);
>    */
>   int blk_rq_check_limits(struct request_queue *q, struct request *rq)
>   {
> -       if (!rq_mergeable(rq))
> +       if (rq->cmd_type != REQ_TYPE_FS)
>                  return 0;
>
>          if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q,
> rq->cmd_flags)) {
>
>
> Mike? Jens?
> Can you comment on it?

We only support merging on REQ_TYPE_FS already, so how is the above 
making it any different? In general, NOMERGE being set or not should not 
make a difference. It's only a hint that we need not check further if we 
should be merging on this request, since we already tried it once, found 
we'd exceed various limits, then set NOMERGE to reflect that.

Mike Snitzer Nov. 25, 2015, 6:01 p.m. UTC | #2

On Wed, Nov 25 2015 at  4:04am -0500,
Hannes Reinecke <hare@suse.de> wrote:

> On 11/20/2015 04:28 PM, Ewan Milne wrote:
> > On Fri, 2015-11-20 at 15:55 +0100, Hannes Reinecke wrote:
> >> Can't we have a joint effort here?
> >> I've been spending a _LOT_ of time trying to debug things here, but
> >> none of the ideas I've come up with have been able to fix anything.
> > 
> > Yes.  I'm not the one primarily looking at it, and we don't have a
> > reproducer in-house.  We just have the one dump right now.
> > 
> >>
> >> I'm almost tempted to increase the count from scsi_alloc_sgtable()
> >> by one and be done with ...
> >>
> > 
> > That might not fix it if it is a problem with the merge code, though.
> > 
> And indeed, it doesn't.

How did you arrive at that?  Do you have a reproducer now?

> Seems I finally found the culprit.
> 
> What happens is this:
> We have two paths, with these seg_boundary_masks:
> 
> path-1:    seg_boundary_mask = 65535,
> path-2:    seg_boundary_mask = 4294967295,
> 
> consequently the DM request queue has this:
> 
> md-1:    seg_boundary_mask = 65535,
> 
> What happens now is that a request is being formatted, and sent
> to path 2. During submission req->nr_phys_segments is formatted
> with the limits of path 2, arriving at a count of 3.
> Now the request gets retried on path 1, but as the NOMERGE request
> flag is set req->nr_phys_segments is never updated.
> But blk_rq_map_sg() ignores all counters, and just uses the
> bi_vec directly, resulting in a count of 4 -> boom.
> 
> So the culprit here is the NOMERGE flag,

NOMERGE is always set in __blk_rq_prep_clone() for cloned requests.

> which is evaluated via
> ->dm_dispatch_request()
>   ->blk_insert_cloned_request()
>     ->blk_rq_check_limits()

blk_insert_cloned_request() is the only caller of blk_rq_check_limits();
anyway after reading your mail I'm still left wondering if your proposed
patch is correct.

> If the above assessment is correct, the following patch should
> fix it:
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 801ced7..12cccd6 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1928,7 +1928,7 @@ EXPORT_SYMBOL(submit_bio);
>   */
>  int blk_rq_check_limits(struct request_queue *q, struct request *rq)
>  {
> -       if (!rq_mergeable(rq))
> +       if (rq->cmd_type != REQ_TYPE_FS)
>                 return 0;
> 
>         if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q,
> rq->cmd_flags)) {
> 
> 
> Mike? Jens?
> Can you comment on it?

You're not explaining the actual change in the patch very well; I think
you're correct but you're leaving the justification as an exercise to 
the reviewer:

blk_rq_check_limits() will call blk_recalc_rq_segments() after the
!rq_mergeable() check but you're saying for this case in question we
never get there -- due to the cloned request having NOMERGE set.

So in blk_rq_check_limits() you've unrolled rq_mergeable() and
open-coded the lone remaining check (rq->cmd_type != REQ_TYPE_FS)

I agree that the (rq->cmd_flags & REQ_NOMERGE_FLAGS) check in
the blk_insert_cloned_request() call-chain (via rq_mergeable()) makes no
sense for cloned requests that always have NOMERGE set.

So you're saying that by having blk_rq_check_limits() go on to call
blk_recalc_rq_segments() this bug will be fixed?

BTW, I think blk_rq_check_limits()'s export should be removed and the
function made static and renamed to blk_clone_rq_check_limits(), again:
blk_insert_cloned_request() is the only caller of blk_rq_check_limits()

Seems prudent to make that change now to be clear that this code is only
used by cloned requests.

Thanks,
Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hannes Reinecke Nov. 25, 2015, 7:01 p.m. UTC | #3

On 11/25/2015 07:01 PM, Mike Snitzer wrote:
> On Wed, Nov 25 2015 at  4:04am -0500,
> Hannes Reinecke <hare@suse.de> wrote:
>
>> On 11/20/2015 04:28 PM, Ewan Milne wrote:
>>> On Fri, 2015-11-20 at 15:55 +0100, Hannes Reinecke wrote:
>>>> Can't we have a joint effort here?
>>>> I've been spending a _LOT_ of time trying to debug things here, but
>>>> none of the ideas I've come up with have been able to fix anything.
>>>
>>> Yes.  I'm not the one primarily looking at it, and we don't have a
>>> reproducer in-house.  We just have the one dump right now.
>>>
>>>>
>>>> I'm almost tempted to increase the count from scsi_alloc_sgtable()
>>>> by one and be done with ...
>>>>
>>>
>>> That might not fix it if it is a problem with the merge code, though.
>>>
>> And indeed, it doesn't.
>
> How did you arrive at that?  Do you have a reproducer now?
>
Not a reproducer, but several dumps for analysis.

>> Seems I finally found the culprit.
>>
>> What happens is this:
>> We have two paths, with these seg_boundary_masks:
>>
>> path-1:    seg_boundary_mask = 65535,
>> path-2:    seg_boundary_mask = 4294967295,
>>
>> consequently the DM request queue has this:
>>
>> md-1:    seg_boundary_mask = 65535,
>>
>> What happens now is that a request is being formatted, and sent
>> to path 2. During submission req->nr_phys_segments is formatted
>> with the limits of path 2, arriving at a count of 3.
>> Now the request gets retried on path 1, but as the NOMERGE request
>> flag is set req->nr_phys_segments is never updated.
>> But blk_rq_map_sg() ignores all counters, and just uses the
>> bi_vec directly, resulting in a count of 4 -> boom.
>>
>> So the culprit here is the NOMERGE flag,
>
> NOMERGE is always set in __blk_rq_prep_clone() for cloned requests.
>
Yes.

>> which is evaluated via
>> ->dm_dispatch_request()
>>    ->blk_insert_cloned_request()
>>      ->blk_rq_check_limits()
>
> blk_insert_cloned_request() is the only caller of blk_rq_check_limits();
> anyway after reading your mail I'm still left wondering if your proposed
> patch is correct.
>
>> If the above assessment is correct, the following patch should
>> fix it:
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 801ced7..12cccd6 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -1928,7 +1928,7 @@ EXPORT_SYMBOL(submit_bio);
>>    */
>>   int blk_rq_check_limits(struct request_queue *q, struct request *rq)
>>   {
>> -       if (!rq_mergeable(rq))
>> +       if (rq->cmd_type != REQ_TYPE_FS)
>>                  return 0;
>>
>>          if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q,
>> rq->cmd_flags)) {
>>
>>
>> Mike? Jens?
>> Can you comment on it?
>
> You're not explaining the actual change in the patch very well; I think
> you're correct but you're leaving the justification as an exercise to
> the reviewer:
>
> blk_rq_check_limits() will call blk_recalc_rq_segments() after the
> !rq_mergeable() check but you're saying for this case in question we
> never get there -- due to the cloned request having NOMERGE set.
>
> So in blk_rq_check_limits() you've unrolled rq_mergeable() and
> open-coded the lone remaining check (rq->cmd_type != REQ_TYPE_FS)
>
> I agree that the (rq->cmd_flags & REQ_NOMERGE_FLAGS) check in
> the blk_insert_cloned_request() call-chain (via rq_mergeable()) makes no
> sense for cloned requests that always have NOMERGE set.
>
> So you're saying that by having blk_rq_check_limits() go on to call
> blk_recalc_rq_segments() this bug will be fixed?
>
That is the idea.

I've already established that in all instances I have seen so far
req->nr_phys_segments is _less_ than req->bio->bi_phys_segments.

As it turns out, req->nr_phys_segemnts _would_ have been updated in
blk_rq_check_limits(), but isn't due to the NOMERGE flag being set
for the cloned request.
So each cloned request inherits the values from the original request,
despite the fact that req->nr_phys_segments _has_ to be evaluated in
the final request_queue context, as the queue limits _might_ be 
different from the original (merged) queue limits of the multipath
request queue.

> BTW, I think blk_rq_check_limits()'s export should be removed and the
> function made static and renamed to blk_clone_rq_check_limits(), again:
> blk_insert_cloned_request() is the only caller of blk_rq_check_limits()
>
Actually, seeing Jens' last comment the check for REQ_TYPE_FS is 
pointless, too, so we might as well remove the entire if-clause.

> Seems prudent to make that change now to be clear that this code is only
> used by cloned requests.
>
Yeah, that would make sense. I'll be preparing a patch.
With a more detailed description :-)

Cheers,

Hannes

Hannes Reinecke Nov. 25, 2015, 7:10 p.m. UTC | #4

On 11/25/2015 06:56 PM, Jens Axboe wrote:
> On 11/25/2015 02:04 AM, Hannes Reinecke wrote:
>> On 11/20/2015 04:28 PM, Ewan Milne wrote:
>>> On Fri, 2015-11-20 at 15:55 +0100, Hannes Reinecke wrote:
>>>> Can't we have a joint effort here?
>>>> I've been spending a _LOT_ of time trying to debug things here, but
>>>> none of the ideas I've come up with have been able to fix anything.
>>>
>>> Yes.  I'm not the one primarily looking at it, and we don't have a
>>> reproducer in-house.  We just have the one dump right now.
>>>
>>>>
>>>> I'm almost tempted to increase the count from scsi_alloc_sgtable()
>>>> by one and be done with ...
>>>>
>>>
>>> That might not fix it if it is a problem with the merge code, though.
>>>
>> And indeed, it doesn't.
>> Seems I finally found the culprit.
>>
>> What happens is this:
>> We have two paths, with these seg_boundary_masks:
>>
>> path-1:    seg_boundary_mask = 65535,
>> path-2:    seg_boundary_mask = 4294967295,
>>
>> consequently the DM request queue has this:
>>
>> md-1:    seg_boundary_mask = 65535,
>>
>> What happens now is that a request is being formatted, and sent
>> to path 2. During submission req->nr_phys_segments is formatted
>> with the limits of path 2, arriving at a count of 3.
>> Now the request gets retried on path 1, but as the NOMERGE request
>> flag is set req->nr_phys_segments is never updated.
>> But blk_rq_map_sg() ignores all counters, and just uses the
>> bi_vec directly, resulting in a count of 4 -> boom.
>>
>> So the culprit here is the NOMERGE flag, which is evaluated
>> via
>> ->dm_dispatch_request()
>>    ->blk_insert_cloned_request()
>>      ->blk_rq_check_limits()
>>
>> If the above assessment is correct, the following patch should
>> fix it:
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 801ced7..12cccd6 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -1928,7 +1928,7 @@ EXPORT_SYMBOL(submit_bio);
>>    */
>>   int blk_rq_check_limits(struct request_queue *q, struct request *rq)
>>   {
>> -       if (!rq_mergeable(rq))
>> +       if (rq->cmd_type != REQ_TYPE_FS)
>>                  return 0;
>>
>>          if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q,
>> rq->cmd_flags)) {
>>
>>
>> Mike? Jens?
>> Can you comment on it?
>
> We only support merging on REQ_TYPE_FS already, so how is the above
> making it any different? In general, NOMERGE being set or not should not
> make a difference. It's only a hint that we need not check further if we
> should be merging on this request, since we already tried it once, found
> we'd exceed various limits, then set NOMERGE to reflect that.
>
The problem is that NOMERGE does too much, as it inhibits _any_ merging.

Unfortunately, the req->nr_phys_segments value is evaluated in the final
_driver_ context _after_ the merging happend; cf 
scsi_lib.c:scsi_init_sgtable().
As nr_phys_segments is inherited from the original request (and never
recalculated with the new request queue limits) the following 
blk_rq_map_sg() call might end up at a different calculation, especially
after retrying a request on another path.

Cheers,

Hannes

Jens Axboe Nov. 25, 2015, 7:24 p.m. UTC | #5

On 11/25/2015 12:10 PM, Hannes Reinecke wrote:
> On 11/25/2015 06:56 PM, Jens Axboe wrote:
>> On 11/25/2015 02:04 AM, Hannes Reinecke wrote:
>>> On 11/20/2015 04:28 PM, Ewan Milne wrote:
>>>> On Fri, 2015-11-20 at 15:55 +0100, Hannes Reinecke wrote:
>>>>> Can't we have a joint effort here?
>>>>> I've been spending a _LOT_ of time trying to debug things here, but
>>>>> none of the ideas I've come up with have been able to fix anything.
>>>>
>>>> Yes.  I'm not the one primarily looking at it, and we don't have a
>>>> reproducer in-house.  We just have the one dump right now.
>>>>
>>>>>
>>>>> I'm almost tempted to increase the count from scsi_alloc_sgtable()
>>>>> by one and be done with ...
>>>>>
>>>>
>>>> That might not fix it if it is a problem with the merge code, though.
>>>>
>>> And indeed, it doesn't.
>>> Seems I finally found the culprit.
>>>
>>> What happens is this:
>>> We have two paths, with these seg_boundary_masks:
>>>
>>> path-1:    seg_boundary_mask = 65535,
>>> path-2:    seg_boundary_mask = 4294967295,
>>>
>>> consequently the DM request queue has this:
>>>
>>> md-1:    seg_boundary_mask = 65535,
>>>
>>> What happens now is that a request is being formatted, and sent
>>> to path 2. During submission req->nr_phys_segments is formatted
>>> with the limits of path 2, arriving at a count of 3.
>>> Now the request gets retried on path 1, but as the NOMERGE request
>>> flag is set req->nr_phys_segments is never updated.
>>> But blk_rq_map_sg() ignores all counters, and just uses the
>>> bi_vec directly, resulting in a count of 4 -> boom.
>>>
>>> So the culprit here is the NOMERGE flag, which is evaluated
>>> via
>>> ->dm_dispatch_request()
>>>    ->blk_insert_cloned_request()
>>>      ->blk_rq_check_limits()
>>>
>>> If the above assessment is correct, the following patch should
>>> fix it:
>>>
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index 801ced7..12cccd6 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -1928,7 +1928,7 @@ EXPORT_SYMBOL(submit_bio);
>>>    */
>>>   int blk_rq_check_limits(struct request_queue *q, struct request *rq)
>>>   {
>>> -       if (!rq_mergeable(rq))
>>> +       if (rq->cmd_type != REQ_TYPE_FS)
>>>                  return 0;
>>>
>>>          if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q,
>>> rq->cmd_flags)) {
>>>
>>>
>>> Mike? Jens?
>>> Can you comment on it?
>>
>> We only support merging on REQ_TYPE_FS already, so how is the above
>> making it any different? In general, NOMERGE being set or not should not
>> make a difference. It's only a hint that we need not check further if we
>> should be merging on this request, since we already tried it once, found
>> we'd exceed various limits, then set NOMERGE to reflect that.
>>
> The problem is that NOMERGE does too much, as it inhibits _any_ merging.

Right, that is the point of the flag from the block layer view, where it 
was originally added for the case mentioned.

> Unfortunately, the req->nr_phys_segments value is evaluated in the final
> _driver_ context _after_ the merging happend; cf
> scsi_lib.c:scsi_init_sgtable().
> As nr_phys_segments is inherited from the original request (and never
> recalculated with the new request queue limits) the following
> blk_rq_map_sg() call might end up at a different calculation, especially
> after retrying a request on another path.

That all sounds pretty horrible. Why is blk_rq_check_limits() checking 
for mergeable at all? If merging is disabled on the request, I'm 
assuming that's an attempt at an optimization since we know it won't 
change. But that should be tracked separately, like how it's done on the 
bio.

Mike Snitzer Nov. 25, 2015, 8:23 p.m. UTC | #6

On Wed, Nov 25 2015 at  2:24pm -0500,
Jens Axboe <axboe@fb.com> wrote:

> On 11/25/2015 12:10 PM, Hannes Reinecke wrote:
> >On 11/25/2015 06:56 PM, Jens Axboe wrote:
> >>On 11/25/2015 02:04 AM, Hannes Reinecke wrote:
> >>>On 11/20/2015 04:28 PM, Ewan Milne wrote:
> >>>>On Fri, 2015-11-20 at 15:55 +0100, Hannes Reinecke wrote:
> >>>>>Can't we have a joint effort here?
> >>>>>I've been spending a _LOT_ of time trying to debug things here, but
> >>>>>none of the ideas I've come up with have been able to fix anything.
> >>>>
> >>>>Yes.  I'm not the one primarily looking at it, and we don't have a
> >>>>reproducer in-house.  We just have the one dump right now.
> >>>>
> >>>>>
> >>>>>I'm almost tempted to increase the count from scsi_alloc_sgtable()
> >>>>>by one and be done with ...
> >>>>>
> >>>>
> >>>>That might not fix it if it is a problem with the merge code, though.
> >>>>
> >>>And indeed, it doesn't.
> >>>Seems I finally found the culprit.
> >>>
> >>>What happens is this:
> >>>We have two paths, with these seg_boundary_masks:
> >>>
> >>>path-1:    seg_boundary_mask = 65535,
> >>>path-2:    seg_boundary_mask = 4294967295,
> >>>
> >>>consequently the DM request queue has this:
> >>>
> >>>md-1:    seg_boundary_mask = 65535,
> >>>
> >>>What happens now is that a request is being formatted, and sent
> >>>to path 2. During submission req->nr_phys_segments is formatted
> >>>with the limits of path 2, arriving at a count of 3.
> >>>Now the request gets retried on path 1, but as the NOMERGE request
> >>>flag is set req->nr_phys_segments is never updated.
> >>>But blk_rq_map_sg() ignores all counters, and just uses the
> >>>bi_vec directly, resulting in a count of 4 -> boom.
> >>>
> >>>So the culprit here is the NOMERGE flag, which is evaluated
> >>>via
> >>>->dm_dispatch_request()
> >>>   ->blk_insert_cloned_request()
> >>>     ->blk_rq_check_limits()
> >>>
> >>>If the above assessment is correct, the following patch should
> >>>fix it:
> >>>
> >>>diff --git a/block/blk-core.c b/block/blk-core.c
> >>>index 801ced7..12cccd6 100644
> >>>--- a/block/blk-core.c
> >>>+++ b/block/blk-core.c
> >>>@@ -1928,7 +1928,7 @@ EXPORT_SYMBOL(submit_bio);
> >>>   */
> >>>  int blk_rq_check_limits(struct request_queue *q, struct request *rq)
> >>>  {
> >>>-       if (!rq_mergeable(rq))
> >>>+       if (rq->cmd_type != REQ_TYPE_FS)
> >>>                 return 0;
> >>>
> >>>         if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q,
> >>>rq->cmd_flags)) {
> >>>
> >>>
> >>>Mike? Jens?
> >>>Can you comment on it?
> >>
> >>We only support merging on REQ_TYPE_FS already, so how is the above
> >>making it any different? In general, NOMERGE being set or not should not
> >>make a difference. It's only a hint that we need not check further if we
> >>should be merging on this request, since we already tried it once, found
> >>we'd exceed various limits, then set NOMERGE to reflect that.
> >>
> >The problem is that NOMERGE does too much, as it inhibits _any_ merging.
> 
> Right, that is the point of the flag from the block layer view,
> where it was originally added for the case mentioned.

And we really don't want _any_ merging.  The merging, if any, will have
already happened in upper DM-multipath's elevator.  So there should be
no need to have the underlying SCSI paths do any merging.

> >Unfortunately, the req->nr_phys_segments value is evaluated in the final
> >_driver_ context _after_ the merging happend; cf
> >scsi_lib.c:scsi_init_sgtable().
> >As nr_phys_segments is inherited from the original request (and never
> >recalculated with the new request queue limits) the following
> >blk_rq_map_sg() call might end up at a different calculation, especially
> >after retrying a request on another path.
> 
> That all sounds pretty horrible. Why is blk_rq_check_limits()
> checking for mergeable at all? If merging is disabled on the
> request, I'm assuming that's an attempt at an optimization since we
> know it won't change. But that should be tracked separately, like
> how it's done on the bio.

Not clear to me why it was checking for merging...
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mike Snitzer Nov. 25, 2015, 9:20 p.m. UTC | #7

On Wed, Nov 25 2015 at  3:23pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Wed, Nov 25 2015 at  2:24pm -0500,
> Jens Axboe <axboe@fb.com> wrote:
> 
> > On 11/25/2015 12:10 PM, Hannes Reinecke wrote:
> > >The problem is that NOMERGE does too much, as it inhibits _any_ merging.
> > 
> > Right, that is the point of the flag from the block layer view,
> > where it was originally added for the case mentioned.
> 
> And we really don't want _any_ merging.  The merging, if any, will have
> already happened in upper DM-multipath's elevator.  So there should be
> no need to have the underlying SCSI paths do any merging.
> 
> > >Unfortunately, the req->nr_phys_segments value is evaluated in the final
> > >_driver_ context _after_ the merging happend; cf
> > >scsi_lib.c:scsi_init_sgtable().
> > >As nr_phys_segments is inherited from the original request (and never
> > >recalculated with the new request queue limits) the following
> > >blk_rq_map_sg() call might end up at a different calculation, especially
> > >after retrying a request on another path.
> > 
> > That all sounds pretty horrible. Why is blk_rq_check_limits()
> > checking for mergeable at all? If merging is disabled on the
> > request, I'm assuming that's an attempt at an optimization since we
> > know it won't change. But that should be tracked separately, like
> > how it's done on the bio.
> 
> Not clear to me why it was checking for merging...

Ewan pointed out that blk_rq_check_limits()'s call to rq_mergable() was
introduced as part of Martin's DISCARD cleanup that prepared for
WRITE_SAME, see: commit e2a60da74 ("block: Clean up special command handling logic")

And prior to that, blk_rq_check_limits()'s (rq->cmd_flags & REQ_DISCARD)
check was introduced by some guy crazy doppelganger name "Ike Snitzer",
see commit: 3383977f ("block: update request stacking methods to support discards")

So basically we probably never needed the extra check in
blk_rq_check_limits() to begin with.. Ike was probably paranoid. ;)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Takashi Iwai Dec. 4, 2015, 4:59 p.m. UTC | #8

On Wed, 25 Nov 2015 20:01:47 +0100,
Hannes Reinecke wrote:
> 
> On 11/25/2015 07:01 PM, Mike Snitzer wrote:
> > On Wed, Nov 25 2015 at  4:04am -0500,
> > Hannes Reinecke <hare@suse.de> wrote:
> >
> >> On 11/20/2015 04:28 PM, Ewan Milne wrote:
> >>> On Fri, 2015-11-20 at 15:55 +0100, Hannes Reinecke wrote:
> >>>> Can't we have a joint effort here?
> >>>> I've been spending a _LOT_ of time trying to debug things here, but
> >>>> none of the ideas I've come up with have been able to fix anything.
> >>>
> >>> Yes.  I'm not the one primarily looking at it, and we don't have a
> >>> reproducer in-house.  We just have the one dump right now.
> >>>
> >>>>
> >>>> I'm almost tempted to increase the count from scsi_alloc_sgtable()
> >>>> by one and be done with ...
> >>>>
> >>>
> >>> That might not fix it if it is a problem with the merge code, though.
> >>>
> >> And indeed, it doesn't.
> >
> > How did you arrive at that?  Do you have a reproducer now?
> >
> Not a reproducer, but several dumps for analysis.
> 
> >> Seems I finally found the culprit.
> >>
> >> What happens is this:
> >> We have two paths, with these seg_boundary_masks:
> >>
> >> path-1:    seg_boundary_mask = 65535,
> >> path-2:    seg_boundary_mask = 4294967295,
> >>
> >> consequently the DM request queue has this:
> >>
> >> md-1:    seg_boundary_mask = 65535,
> >>
> >> What happens now is that a request is being formatted, and sent
> >> to path 2. During submission req->nr_phys_segments is formatted
> >> with the limits of path 2, arriving at a count of 3.
> >> Now the request gets retried on path 1, but as the NOMERGE request
> >> flag is set req->nr_phys_segments is never updated.
> >> But blk_rq_map_sg() ignores all counters, and just uses the
> >> bi_vec directly, resulting in a count of 4 -> boom.
> >>
> >> So the culprit here is the NOMERGE flag,
> >
> > NOMERGE is always set in __blk_rq_prep_clone() for cloned requests.
> >
> Yes.
> 
> >> which is evaluated via
> >> ->dm_dispatch_request()
> >>    ->blk_insert_cloned_request()
> >>      ->blk_rq_check_limits()
> >
> > blk_insert_cloned_request() is the only caller of blk_rq_check_limits();
> > anyway after reading your mail I'm still left wondering if your proposed
> > patch is correct.
> >
> >> If the above assessment is correct, the following patch should
> >> fix it:
> >>
> >> diff --git a/block/blk-core.c b/block/blk-core.c
> >> index 801ced7..12cccd6 100644
> >> --- a/block/blk-core.c
> >> +++ b/block/blk-core.c
> >> @@ -1928,7 +1928,7 @@ EXPORT_SYMBOL(submit_bio);
> >>    */
> >>   int blk_rq_check_limits(struct request_queue *q, struct request *rq)
> >>   {
> >> -       if (!rq_mergeable(rq))
> >> +       if (rq->cmd_type != REQ_TYPE_FS)
> >>                  return 0;
> >>
> >>          if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q,
> >> rq->cmd_flags)) {
> >>
> >>
> >> Mike? Jens?
> >> Can you comment on it?
> >
> > You're not explaining the actual change in the patch very well; I think
> > you're correct but you're leaving the justification as an exercise to
> > the reviewer:
> >
> > blk_rq_check_limits() will call blk_recalc_rq_segments() after the
> > !rq_mergeable() check but you're saying for this case in question we
> > never get there -- due to the cloned request having NOMERGE set.
> >
> > So in blk_rq_check_limits() you've unrolled rq_mergeable() and
> > open-coded the lone remaining check (rq->cmd_type != REQ_TYPE_FS)
> >
> > I agree that the (rq->cmd_flags & REQ_NOMERGE_FLAGS) check in
> > the blk_insert_cloned_request() call-chain (via rq_mergeable()) makes no
> > sense for cloned requests that always have NOMERGE set.
> >
> > So you're saying that by having blk_rq_check_limits() go on to call
> > blk_recalc_rq_segments() this bug will be fixed?
> >
> That is the idea.
> 
> I've already established that in all instances I have seen so far
> req->nr_phys_segments is _less_ than req->bio->bi_phys_segments.
> 
> As it turns out, req->nr_phys_segemnts _would_ have been updated in
> blk_rq_check_limits(), but isn't due to the NOMERGE flag being set
> for the cloned request.
> So each cloned request inherits the values from the original request,
> despite the fact that req->nr_phys_segments _has_ to be evaluated in
> the final request_queue context, as the queue limits _might_ be 
> different from the original (merged) queue limits of the multipath
> request queue.
> 
> > BTW, I think blk_rq_check_limits()'s export should be removed and the
> > function made static and renamed to blk_clone_rq_check_limits(), again:
> > blk_insert_cloned_request() is the only caller of blk_rq_check_limits()
> >
> Actually, seeing Jens' last comment the check for REQ_TYPE_FS is 
> pointless, too, so we might as well remove the entire if-clause.
> 
> > Seems prudent to make that change now to be clear that this code is only
> > used by cloned requests.
> >
> Yeah, that would make sense. I'll be preparing a patch.
> With a more detailed description :-)

Do we have already a fix?  Right now I got (likely) this kernel BUG()
on the almost latest Linus tree (commit 25364a9e54fb8296).  It
happened while I started a KVM right after a fresh boot.  The machine
paniced even before that, so I hit this twice today.

I have a crash dump, so if anything is needed, let me know.
(But the system had to be restarted since it's my main workstation.)


thanks,

Takashi

===
[  137.211255] ------------[ cut here ]------------
[  137.211265] WARNING: CPU: 0 PID: 391 at block/blk-merge.c:435 blk_rq_map_sg+0x430/0x4d0()
[  137.211267] Modules linked in: nfsv3 nfs_acl auth_rpcgss oid_registry nfsv4 dns_resolver nfs lockd grace sunrpc fscache parport_pc ppdev parport fuse vmw_vsock_vmci_transport vsock vmw_vmci iscsi_ibft iscsi_boot_sysfs snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm i915 irqbypass crct10dif_pclmul crc32_pclmul snd_emu10k1 snd_hda_codec_ca0132 snd_util_mem aesni_intel snd_ac97_codec snd_hda_intel aes_x86_64 dcdbas snd_hda_codec glue_helper dm_mod ac97_bus lrw gf128mul ablk_helper cryptd snd_rawmidi snd_hda_core i2c_algo_bit snd_seq_device drm_kms_helper snd_hwdep syscopyarea snd_pcm sysfillrect sysimgblt snd_timer fb_sys_fops iTCO_wdt i2c_i801 drm snd iTCO_vendor_support emu10k1_gp battery lpc_ich gameport mfd_core soundcore
[  137.211326]  mei_me video mei thermal button usbhid crc32c_intel sr_mod ehci_pci cdrom serio_raw xhci_pci ehci_hcd e1000e xhci_hcd ptp pps_core usbcore usb_common sg
[  137.211342] CPU: 0 PID: 391 Comm: kworker/0:1H Not tainted 4.4.0-rc3-test2+ #183
[  137.211344] Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
[  137.211349] Workqueue: kblockd bfq_kick_queue
[  137.211352]  ffffffff81a6b805 ffff880214307ab8 ffffffff813b1cc9 0000000000000000
[  137.211357]  ffff880214307af0 ffffffff8106d9a2 ffff88019e7d49c8 0000000000000000
[  137.211362]  0000000000000000 0000000000001000 0000000000000020 ffff880214307b00
[  137.211367] Call Trace:
[  137.211373]  [<ffffffff813b1cc9>] dump_stack+0x4b/0x72
[  137.211378]  [<ffffffff8106d9a2>] warn_slowpath_common+0x82/0xc0
[  137.211383]  [<ffffffff8106da9a>] warn_slowpath_null+0x1a/0x20
[  137.211387]  [<ffffffff81381340>] blk_rq_map_sg+0x430/0x4d0
[  137.211393]  [<ffffffff81514a60>] ? scsi_init_cmd_errh+0x90/0x90
[  137.211398]  [<ffffffff81514fc4>] scsi_init_sgtable+0x44/0x80
[  137.211402]  [<ffffffff81515045>] scsi_init_io+0x45/0x1d0
[  137.211408]  [<ffffffff815214a8>] sd_setup_read_write_cmnd+0x38/0xa20
[  137.211413]  [<ffffffff810c141d>] ? trace_hardirqs_off+0xd/0x10
[  137.211416]  [<ffffffff81524fc7>] sd_init_command+0x27/0xa0
[  137.211420]  [<ffffffff815152c0>] scsi_setup_cmnd+0xf0/0x150
[  137.211424]  [<ffffffff81515471>] scsi_prep_fn+0xd1/0x170
[  137.211428]  [<ffffffff8137b87c>] blk_peek_request+0x18c/0x320
[  137.211433]  [<ffffffff81516cb2>] scsi_request_fn+0x42/0x610
[  137.211438]  [<ffffffff81374943>] __blk_run_queue+0x33/0x40
[  137.211442]  [<ffffffff813a8015>] bfq_kick_queue+0x25/0x40
[  137.211447]  [<ffffffff8108b18d>] process_one_work+0x1ed/0x660
[  137.211451]  [<ffffffff8108b0fc>] ? process_one_work+0x15c/0x660
[  137.211456]  [<ffffffff8108b64e>] worker_thread+0x4e/0x450
[  137.211460]  [<ffffffff8108b600>] ? process_one_work+0x660/0x660
[  137.211464]  [<ffffffff8108b600>] ? process_one_work+0x660/0x660
[  137.211468]  [<ffffffff8109206e>] kthread+0xfe/0x120
[  137.211473]  [<ffffffff81091f70>] ? kthread_create_on_node+0x230/0x230
[  137.211478]  [<ffffffff816d19ef>] ret_from_fork+0x3f/0x70
[  137.211482]  [<ffffffff81091f70>] ? kthread_create_on_node+0x230/0x230
[  137.211485] ---[ end trace b39a62f95b1cad19 ]---
[  137.211499] ------------[ cut here ]------------
[  137.211506] kernel BUG at drivers/scsi/scsi_lib.c:1096!
[  137.211511] invalid opcode: 0000 [#1] PREEMPT SMP 
[  137.211519] Modules linked in: nfsv3 nfs_acl auth_rpcgss oid_registry nfsv4 dns_resolver nfs lockd grace sunrpc fscache parport_pc ppdev parport fuse vmw_vsock_vmci_transport vsock vmw_vmci iscsi_ibft iscsi_boot_sysfs snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm i915 irqbypass crct10dif_pclmul crc32_pclmul snd_emu10k1 snd_hda_codec_ca0132 snd_util_mem aesni_intel snd_ac97_codec snd_hda_intel aes_x86_64 dcdbas snd_hda_codec glue_helper dm_mod ac97_bus lrw gf128mul ablk_helper cryptd snd_rawmidi snd_hda_core i2c_algo_bit snd_seq_device drm_kms_helper snd_hwdep syscopyarea snd_pcm sysfillrect sysimgblt snd_timer fb_sys_fops iTCO_wdt i2c_i801 drm snd iTCO_vendor_support emu10k1_gp battery lpc_ich gameport mfd_core soundcore
[  137.211630]  mei_me video mei thermal button usbhid crc32c_intel sr_mod ehci_pci cdrom serio_raw xhci_pci ehci_hcd e1000e xhci_hcd ptp pps_core usbcore usb_common sg
[  137.211660] CPU: 0 PID: 391 Comm: kworker/0:1H Tainted: G        W       4.4.0-rc3-test2+ #183
[  137.211668] Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
[  137.211675] Workqueue: kblockd bfq_kick_queue
[  137.211681] task: ffff880214754a40 ti: ffff880214304000 task.ti: ffff880214304000
[  137.211713] RIP: 0010:[<ffffffff81514ff2>]  [<ffffffff81514ff2>] scsi_init_sgtable+0x72/0x80
[  137.211747] RSP: 0018:ffff880214307b90  EFLAGS: 00010097
[  137.211777] RAX: 000000000000000e RBX: ffff8802147e68b0 RCX: 0000000000000006
[  137.211807] RDX: 0000000000000024 RSI: 0000000000000000 RDI: 0000000000000009
[  137.211838] RBP: ffff880214307ba8 R08: 0000000000000001 R09: 0000000000000000
[  137.211868] R10: ffff8802147e68b0 R11: 0000000000000001 R12: ffff88003711c320
[  137.211899] R13: 0000000000000000 R14: ffff8800370be000 R15: ffff8802147e68b0
[  137.211930] FS:  0000000000000000(0000) GS:ffff88021e200000(0000) knlGS:0000000000000000
[  137.211961] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  137.211991] CR2: 000000000041bee0 CR3: 00000000d85ac000 CR4: 00000000001426f0
[  137.212021] Stack:
[  137.212048]  ffff88003711c200 000000000003b000 0000000000000000 ffff880214307bf0
[  137.212082]  ffffffff81515045 ffff880214307c28 0000000000000092 ffff8802147e68b0
[  137.212115]  000000000003b000 ffff8800d8a90800 ffff8800370be000 0000000000000020
[  137.212148] Call Trace:
[  137.212178]  [<ffffffff81515045>] scsi_init_io+0x45/0x1d0
[  137.212205]  [<ffffffff815214a8>] sd_setup_read_write_cmnd+0x38/0xa20
[  137.212232]  [<ffffffff810c141d>] ? trace_hardirqs_off+0xd/0x10
[  137.212258]  [<ffffffff81524fc7>] sd_init_command+0x27/0xa0
[  137.212285]  [<ffffffff815152c0>] scsi_setup_cmnd+0xf0/0x150
[  137.212312]  [<ffffffff81515471>] scsi_prep_fn+0xd1/0x170
[  137.212338]  [<ffffffff8137b87c>] blk_peek_request+0x18c/0x320
[  137.212365]  [<ffffffff81516cb2>] scsi_request_fn+0x42/0x610
[  137.212392]  [<ffffffff81374943>] __blk_run_queue+0x33/0x40
[  137.212418]  [<ffffffff813a8015>] bfq_kick_queue+0x25/0x40
[  137.212444]  [<ffffffff8108b18d>] process_one_work+0x1ed/0x660
[  137.212471]  [<ffffffff8108b0fc>] ? process_one_work+0x15c/0x660
[  137.212498]  [<ffffffff8108b64e>] worker_thread+0x4e/0x450
[  137.212525]  [<ffffffff8108b600>] ? process_one_work+0x660/0x660
[  137.212552]  [<ffffffff8108b600>] ? process_one_work+0x660/0x660
[  137.212578]  [<ffffffff8109206e>] kthread+0xfe/0x120
[  137.212605]  [<ffffffff81091f70>] ? kthread_create_on_node+0x230/0x230
[  137.212631]  [<ffffffff816d19ef>] ret_from_fork+0x3f/0x70
[  137.212657]  [<ffffffff81091f70>] ? kthread_create_on_node+0x230/0x230
[  137.212683] Code: 24 08 72 27 41 89 44 24 08 8b 43 5c 5b 41 89 44 24 10 44 89 e8 41 5c 41 5d 5d c3 41 bd 02 00 00 00 5b 44 89 e8 41 5c 41 5d 5d c3 <0f> 0b 66 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 
[  137.212798] RIP  [<ffffffff81514ff2>] scsi_init_sgtable+0x72/0x80
[  137.212826]  RSP <ffff880214307b90>
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jens Axboe Dec. 4, 2015, 5:02 p.m. UTC | #9

On 12/04/2015 09:59 AM, Takashi Iwai wrote:
> On Wed, 25 Nov 2015 20:01:47 +0100,
> Hannes Reinecke wrote:
>>
>> On 11/25/2015 07:01 PM, Mike Snitzer wrote:
>>> On Wed, Nov 25 2015 at  4:04am -0500,
>>> Hannes Reinecke <hare@suse.de> wrote:
>>>
>>>> On 11/20/2015 04:28 PM, Ewan Milne wrote:
>>>>> On Fri, 2015-11-20 at 15:55 +0100, Hannes Reinecke wrote:
>>>>>> Can't we have a joint effort here?
>>>>>> I've been spending a _LOT_ of time trying to debug things here, but
>>>>>> none of the ideas I've come up with have been able to fix anything.
>>>>>
>>>>> Yes.  I'm not the one primarily looking at it, and we don't have a
>>>>> reproducer in-house.  We just have the one dump right now.
>>>>>
>>>>>>
>>>>>> I'm almost tempted to increase the count from scsi_alloc_sgtable()
>>>>>> by one and be done with ...
>>>>>>
>>>>>
>>>>> That might not fix it if it is a problem with the merge code, though.
>>>>>
>>>> And indeed, it doesn't.
>>>
>>> How did you arrive at that?  Do you have a reproducer now?
>>>
>> Not a reproducer, but several dumps for analysis.
>>
>>>> Seems I finally found the culprit.
>>>>
>>>> What happens is this:
>>>> We have two paths, with these seg_boundary_masks:
>>>>
>>>> path-1:    seg_boundary_mask = 65535,
>>>> path-2:    seg_boundary_mask = 4294967295,
>>>>
>>>> consequently the DM request queue has this:
>>>>
>>>> md-1:    seg_boundary_mask = 65535,
>>>>
>>>> What happens now is that a request is being formatted, and sent
>>>> to path 2. During submission req->nr_phys_segments is formatted
>>>> with the limits of path 2, arriving at a count of 3.
>>>> Now the request gets retried on path 1, but as the NOMERGE request
>>>> flag is set req->nr_phys_segments is never updated.
>>>> But blk_rq_map_sg() ignores all counters, and just uses the
>>>> bi_vec directly, resulting in a count of 4 -> boom.
>>>>
>>>> So the culprit here is the NOMERGE flag,
>>>
>>> NOMERGE is always set in __blk_rq_prep_clone() for cloned requests.
>>>
>> Yes.
>>
>>>> which is evaluated via
>>>> ->dm_dispatch_request()
>>>>     ->blk_insert_cloned_request()
>>>>       ->blk_rq_check_limits()
>>>
>>> blk_insert_cloned_request() is the only caller of blk_rq_check_limits();
>>> anyway after reading your mail I'm still left wondering if your proposed
>>> patch is correct.
>>>
>>>> If the above assessment is correct, the following patch should
>>>> fix it:
>>>>
>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>> index 801ced7..12cccd6 100644
>>>> --- a/block/blk-core.c
>>>> +++ b/block/blk-core.c
>>>> @@ -1928,7 +1928,7 @@ EXPORT_SYMBOL(submit_bio);
>>>>     */
>>>>    int blk_rq_check_limits(struct request_queue *q, struct request *rq)
>>>>    {
>>>> -       if (!rq_mergeable(rq))
>>>> +       if (rq->cmd_type != REQ_TYPE_FS)
>>>>                   return 0;
>>>>
>>>>           if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q,
>>>> rq->cmd_flags)) {
>>>>
>>>>
>>>> Mike? Jens?
>>>> Can you comment on it?
>>>
>>> You're not explaining the actual change in the patch very well; I think
>>> you're correct but you're leaving the justification as an exercise to
>>> the reviewer:
>>>
>>> blk_rq_check_limits() will call blk_recalc_rq_segments() after the
>>> !rq_mergeable() check but you're saying for this case in question we
>>> never get there -- due to the cloned request having NOMERGE set.
>>>
>>> So in blk_rq_check_limits() you've unrolled rq_mergeable() and
>>> open-coded the lone remaining check (rq->cmd_type != REQ_TYPE_FS)
>>>
>>> I agree that the (rq->cmd_flags & REQ_NOMERGE_FLAGS) check in
>>> the blk_insert_cloned_request() call-chain (via rq_mergeable()) makes no
>>> sense for cloned requests that always have NOMERGE set.
>>>
>>> So you're saying that by having blk_rq_check_limits() go on to call
>>> blk_recalc_rq_segments() this bug will be fixed?
>>>
>> That is the idea.
>>
>> I've already established that in all instances I have seen so far
>> req->nr_phys_segments is _less_ than req->bio->bi_phys_segments.
>>
>> As it turns out, req->nr_phys_segemnts _would_ have been updated in
>> blk_rq_check_limits(), but isn't due to the NOMERGE flag being set
>> for the cloned request.
>> So each cloned request inherits the values from the original request,
>> despite the fact that req->nr_phys_segments _has_ to be evaluated in
>> the final request_queue context, as the queue limits _might_ be
>> different from the original (merged) queue limits of the multipath
>> request queue.
>>
>>> BTW, I think blk_rq_check_limits()'s export should be removed and the
>>> function made static and renamed to blk_clone_rq_check_limits(), again:
>>> blk_insert_cloned_request() is the only caller of blk_rq_check_limits()
>>>
>> Actually, seeing Jens' last comment the check for REQ_TYPE_FS is
>> pointless, too, so we might as well remove the entire if-clause.
>>
>>> Seems prudent to make that change now to be clear that this code is only
>>> used by cloned requests.
>>>
>> Yeah, that would make sense. I'll be preparing a patch.
>> With a more detailed description :-)
>
> Do we have already a fix?  Right now I got (likely) this kernel BUG()
> on the almost latest Linus tree (commit 25364a9e54fb8296).  It
> happened while I started a KVM right after a fresh boot.  The machine
> paniced even before that, so I hit this twice today.

Update to the tree as-of yesterday (or today) and it should work. 
25364a9e54fb8296 doesn't include the latest block fixes that were sent 
in yesterday, that should fix it. You need commit a88d32af18b8 or newer.

Takashi Iwai Dec. 4, 2015, 5:09 p.m. UTC | #10

On Fri, 04 Dec 2015 18:02:58 +0100,
Jens Axboe wrote:
> 
> On 12/04/2015 09:59 AM, Takashi Iwai wrote:
> > On Wed, 25 Nov 2015 20:01:47 +0100,
> > Hannes Reinecke wrote:
> >>
> >> On 11/25/2015 07:01 PM, Mike Snitzer wrote:
> >>> On Wed, Nov 25 2015 at  4:04am -0500,
> >>> Hannes Reinecke <hare@suse.de> wrote:
> >>>
> >>>> On 11/20/2015 04:28 PM, Ewan Milne wrote:
> >>>>> On Fri, 2015-11-20 at 15:55 +0100, Hannes Reinecke wrote:
> >>>>>> Can't we have a joint effort here?
> >>>>>> I've been spending a _LOT_ of time trying to debug things here, but
> >>>>>> none of the ideas I've come up with have been able to fix anything.
> >>>>>
> >>>>> Yes.  I'm not the one primarily looking at it, and we don't have a
> >>>>> reproducer in-house.  We just have the one dump right now.
> >>>>>
> >>>>>>
> >>>>>> I'm almost tempted to increase the count from scsi_alloc_sgtable()
> >>>>>> by one and be done with ...
> >>>>>>
> >>>>>
> >>>>> That might not fix it if it is a problem with the merge code, though.
> >>>>>
> >>>> And indeed, it doesn't.
> >>>
> >>> How did you arrive at that?  Do you have a reproducer now?
> >>>
> >> Not a reproducer, but several dumps for analysis.
> >>
> >>>> Seems I finally found the culprit.
> >>>>
> >>>> What happens is this:
> >>>> We have two paths, with these seg_boundary_masks:
> >>>>
> >>>> path-1:    seg_boundary_mask = 65535,
> >>>> path-2:    seg_boundary_mask = 4294967295,
> >>>>
> >>>> consequently the DM request queue has this:
> >>>>
> >>>> md-1:    seg_boundary_mask = 65535,
> >>>>
> >>>> What happens now is that a request is being formatted, and sent
> >>>> to path 2. During submission req->nr_phys_segments is formatted
> >>>> with the limits of path 2, arriving at a count of 3.
> >>>> Now the request gets retried on path 1, but as the NOMERGE request
> >>>> flag is set req->nr_phys_segments is never updated.
> >>>> But blk_rq_map_sg() ignores all counters, and just uses the
> >>>> bi_vec directly, resulting in a count of 4 -> boom.
> >>>>
> >>>> So the culprit here is the NOMERGE flag,
> >>>
> >>> NOMERGE is always set in __blk_rq_prep_clone() for cloned requests.
> >>>
> >> Yes.
> >>
> >>>> which is evaluated via
> >>>> ->dm_dispatch_request()
> >>>>     ->blk_insert_cloned_request()
> >>>>       ->blk_rq_check_limits()
> >>>
> >>> blk_insert_cloned_request() is the only caller of blk_rq_check_limits();
> >>> anyway after reading your mail I'm still left wondering if your proposed
> >>> patch is correct.
> >>>
> >>>> If the above assessment is correct, the following patch should
> >>>> fix it:
> >>>>
> >>>> diff --git a/block/blk-core.c b/block/blk-core.c
> >>>> index 801ced7..12cccd6 100644
> >>>> --- a/block/blk-core.c
> >>>> +++ b/block/blk-core.c
> >>>> @@ -1928,7 +1928,7 @@ EXPORT_SYMBOL(submit_bio);
> >>>>     */
> >>>>    int blk_rq_check_limits(struct request_queue *q, struct request *rq)
> >>>>    {
> >>>> -       if (!rq_mergeable(rq))
> >>>> +       if (rq->cmd_type != REQ_TYPE_FS)
> >>>>                   return 0;
> >>>>
> >>>>           if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q,
> >>>> rq->cmd_flags)) {
> >>>>
> >>>>
> >>>> Mike? Jens?
> >>>> Can you comment on it?
> >>>
> >>> You're not explaining the actual change in the patch very well; I think
> >>> you're correct but you're leaving the justification as an exercise to
> >>> the reviewer:
> >>>
> >>> blk_rq_check_limits() will call blk_recalc_rq_segments() after the
> >>> !rq_mergeable() check but you're saying for this case in question we
> >>> never get there -- due to the cloned request having NOMERGE set.
> >>>
> >>> So in blk_rq_check_limits() you've unrolled rq_mergeable() and
> >>> open-coded the lone remaining check (rq->cmd_type != REQ_TYPE_FS)
> >>>
> >>> I agree that the (rq->cmd_flags & REQ_NOMERGE_FLAGS) check in
> >>> the blk_insert_cloned_request() call-chain (via rq_mergeable()) makes no
> >>> sense for cloned requests that always have NOMERGE set.
> >>>
> >>> So you're saying that by having blk_rq_check_limits() go on to call
> >>> blk_recalc_rq_segments() this bug will be fixed?
> >>>
> >> That is the idea.
> >>
> >> I've already established that in all instances I have seen so far
> >> req->nr_phys_segments is _less_ than req->bio->bi_phys_segments.
> >>
> >> As it turns out, req->nr_phys_segemnts _would_ have been updated in
> >> blk_rq_check_limits(), but isn't due to the NOMERGE flag being set
> >> for the cloned request.
> >> So each cloned request inherits the values from the original request,
> >> despite the fact that req->nr_phys_segments _has_ to be evaluated in
> >> the final request_queue context, as the queue limits _might_ be
> >> different from the original (merged) queue limits of the multipath
> >> request queue.
> >>
> >>> BTW, I think blk_rq_check_limits()'s export should be removed and the
> >>> function made static and renamed to blk_clone_rq_check_limits(), again:
> >>> blk_insert_cloned_request() is the only caller of blk_rq_check_limits()
> >>>
> >> Actually, seeing Jens' last comment the check for REQ_TYPE_FS is
> >> pointless, too, so we might as well remove the entire if-clause.
> >>
> >>> Seems prudent to make that change now to be clear that this code is only
> >>> used by cloned requests.
> >>>
> >> Yeah, that would make sense. I'll be preparing a patch.
> >> With a more detailed description :-)
> >
> > Do we have already a fix?  Right now I got (likely) this kernel BUG()
> > on the almost latest Linus tree (commit 25364a9e54fb8296).  It
> > happened while I started a KVM right after a fresh boot.  The machine
> > paniced even before that, so I hit this twice today.
> 
> Update to the tree as-of yesterday (or today) and it should work. 
> 25364a9e54fb8296 doesn't include the latest block fixes that were sent 
> in yesterday, that should fix it. You need commit a88d32af18b8 or newer.

Alright, I'll give it a try.  Thanks for information!


Takashi
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

kernel BUG at drivers/scsi/scsi_lib.c:1096!

Commit Message

Comments

Patch