diff mbox series

[V2] blk-mq: Set request mapping to NULL in blk_mq_put_driver_tag

Message ID CAHsXFKHSjBHZy9tyhdEjYJcLJeE6SqdeBPcoeq=bCaRfzvMqgw@mail.gmail.com (mailing list archive)
State Changes Requested
Headers show
Series [V2] blk-mq: Set request mapping to NULL in blk_mq_put_driver_tag | expand

Commit Message

Kashyap Desai Dec. 18, 2018, 7:08 a.m. UTC
V1 -> V2
Added fix in __blk_mq_finish_request around blk_mq_put_tag() for
non-internal tags

Problem statement :
Whenever try to get outstanding request via scsi_host_find_tag,
block layer will return stale entries instead of actual outstanding
request. Kernel panic if stale entry is inaccessible or memory is reused.
Fix :
Undo request mapping in blk_mq_put_driver_tag  nce request is return.

More detail :
Whenever each SDEV entry is created, block layer allocate separate tags
and static requestis.Those requests are not valid after SDEV is deleted
from the system. On the fly, block layer maps static rqs to rqs as below
from blk_mq_get_driver_tag()

data.hctx->tags->rqs[rq->tag] = rq;

Above mapping is active in-used requests and it is the same mapping which
is referred in function scsi_host_find_tag().
After running some IOs, “data.hctx->tags->rqs[rq->tag]” will have some
entries which will never be reset in block layer.

There would be a kernel panic, If request pointing to
“data.hctx->tags->rqs[rq->tag]” is part of “sdev” which is removed
and as part of that all the memory allocation of request associated with
that sdev might be reused or inaccessible to the driver.
Kernel panic snippet -

BUG: unable to handle kernel paging request at ffffff8000000010
IP: [<ffffffffc048306c>] mpt3sas_scsih_scsi_lookup_get+0x6c/0xc0 [mpt3sas]
PGD aa4414067 PUD 0
Oops: 0000 [#1] SMP
Call Trace:
 [<ffffffffc046f72f>] mpt3sas_get_st_from_smid+0x1f/0x60 [mpt3sas]
 [<ffffffffc047e125>] scsih_shutdown+0x55/0x100 [mpt3sas]

Cc: <stable@vger.kernel.org>
Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Sreekanth Reddy <sreekanth.reddy@broadcom.com>

---
 block/blk-mq.c | 4 +++-
 block/blk-mq.h | 1 +
 2 files changed, 4 insertions(+), 1 deletion(-)

Comments

Hannes Reinecke Dec. 18, 2018, 8:27 a.m. UTC | #1
On 12/18/18 8:08 AM, Kashyap Desai wrote:
> V1 -> V2
> Added fix in __blk_mq_finish_request around blk_mq_put_tag() for
> non-internal tags
> 
> Problem statement :
> Whenever try to get outstanding request via scsi_host_find_tag,
> block layer will return stale entries instead of actual outstanding
> request. Kernel panic if stale entry is inaccessible or memory is reused.
> Fix :
> Undo request mapping in blk_mq_put_driver_tag  nce request is return.
> 
> More detail :
> Whenever each SDEV entry is created, block layer allocate separate tags
> and static requestis.Those requests are not valid after SDEV is deleted
> from the system. On the fly, block layer maps static rqs to rqs as below
> from blk_mq_get_driver_tag()
> 
> data.hctx->tags->rqs[rq->tag] = rq;
> 
> Above mapping is active in-used requests and it is the same mapping which
> is referred in function scsi_host_find_tag().
> After running some IOs, “data.hctx->tags->rqs[rq->tag]” will have some
> entries which will never be reset in block layer.
> 
> There would be a kernel panic, If request pointing to
> “data.hctx->tags->rqs[rq->tag]” is part of “sdev” which is removed
> and as part of that all the memory allocation of request associated with
> that sdev might be reused or inaccessible to the driver.
> Kernel panic snippet -
> 
> BUG: unable to handle kernel paging request at ffffff8000000010
> IP: [<ffffffffc048306c>] mpt3sas_scsih_scsi_lookup_get+0x6c/0xc0 [mpt3sas]
> PGD aa4414067 PUD 0
> Oops: 0000 [#1] SMP
> Call Trace:
>   [<ffffffffc046f72f>] mpt3sas_get_st_from_smid+0x1f/0x60 [mpt3sas]
>   [<ffffffffc047e125>] scsih_shutdown+0x55/0x100 [mpt3sas]
> 
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
> Signed-off-by: Sreekanth Reddy <sreekanth.reddy@broadcom.com>
> 
> ---
>   block/blk-mq.c | 4 +++-
>   block/blk-mq.h | 1 +
>   2 files changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 6a75662..88d1e92 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -477,8 +477,10 @@ static void __blk_mq_free_request(struct request *rq)
>       const int sched_tag = rq->internal_tag;
> 
>       blk_pm_mark_last_busy(rq);
> -    if (rq->tag != -1)
> +    if (rq->tag != -1) {
> +        hctx->tags->rqs[rq->tag] = NULL;
>           blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
> +    }
>       if (sched_tag != -1)
>           blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag);
>       blk_mq_sched_restart(hctx);
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index 9497b47..57432be 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -175,6 +175,7 @@ static inline bool
> blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
>   static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
>                          struct request *rq)
>   {
> +    hctx->tags->rqs[rq->tag] = NULL;
>       blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag);
>       rq->tag = -1;
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
Bart Van Assche Dec. 18, 2018, 4:15 p.m. UTC | #2
On Tue, 2018-12-18 at 12:38 +0530, Kashyap Desai wrote:
> V1 -> V2
> Added fix in __blk_mq_finish_request around blk_mq_put_tag() for
> non-internal tags
> 
> Problem statement :
> Whenever try to get outstanding request via scsi_host_find_tag,
> block layer will return stale entries instead of actual outstanding
> request. Kernel panic if stale entry is inaccessible or memory is reused.
> Fix :
> Undo request mapping in blk_mq_put_driver_tag  nce request is return.
> 
> More detail :
> Whenever each SDEV entry is created, block layer allocate separate tags
> and static requestis.Those requests are not valid after SDEV is deleted
> from the system. On the fly, block layer maps static rqs to rqs as below
> from blk_mq_get_driver_tag()
> 
> data.hctx->tags->rqs[rq->tag] = rq;
> 
> Above mapping is active in-used requests and it is the same mapping which
> is referred in function scsi_host_find_tag().
> After running some IOs, “data.hctx->tags->rqs[rq->tag]” will have some
> entries which will never be reset in block layer.
> 
> There would be a kernel panic, If request pointing to
> “data.hctx->tags->rqs[rq->tag]” is part of “sdev” which is removed
> and as part of that all the memory allocation of request associated with
> that sdev might be reused or inaccessible to the driver.
> Kernel panic snippet -
> 
> BUG: unable to handle kernel paging request at ffffff8000000010
> IP: [<ffffffffc048306c>] mpt3sas_scsih_scsi_lookup_get+0x6c/0xc0 [mpt3sas]
> PGD aa4414067 PUD 0
> Oops: 0000 [#1] SMP
> Call Trace:
>  [<ffffffffc046f72f>] mpt3sas_get_st_from_smid+0x1f/0x60 [mpt3sas]
>  [<ffffffffc047e125>] scsih_shutdown+0x55/0x100 [mpt3sas]

Other block drivers (e.g. ib_srp, skd) do not need this to work reliably.
It has been explained to you that the bug that you reported can be fixed
by modifying the mpt3sas driver. So why to fix this by modifying the block
layer? Additionally, what prevents that a race condition occurs between
the block layer clearing hctx->tags->rqs[rq->tag] and scsi_host_find_tag()
reading that same array element? I'm afraid that this is an attempt to
paper over a real problem instead of fixing the root cause.

Bart.
Jens Axboe Dec. 18, 2018, 4:18 p.m. UTC | #3
On 12/18/18 9:15 AM, Bart Van Assche wrote:
> On Tue, 2018-12-18 at 12:38 +0530, Kashyap Desai wrote:
>> V1 -> V2
>> Added fix in __blk_mq_finish_request around blk_mq_put_tag() for
>> non-internal tags
>>
>> Problem statement :
>> Whenever try to get outstanding request via scsi_host_find_tag,
>> block layer will return stale entries instead of actual outstanding
>> request. Kernel panic if stale entry is inaccessible or memory is reused.
>> Fix :
>> Undo request mapping in blk_mq_put_driver_tag  nce request is return.
>>
>> More detail :
>> Whenever each SDEV entry is created, block layer allocate separate tags
>> and static requestis.Those requests are not valid after SDEV is deleted
>> from the system. On the fly, block layer maps static rqs to rqs as below
>> from blk_mq_get_driver_tag()
>>
>> data.hctx->tags->rqs[rq->tag] = rq;
>>
>> Above mapping is active in-used requests and it is the same mapping which
>> is referred in function scsi_host_find_tag().
>> After running some IOs, “data.hctx->tags->rqs[rq->tag]” will have some
>> entries which will never be reset in block layer.
>>
>> There would be a kernel panic, If request pointing to
>> “data.hctx->tags->rqs[rq->tag]” is part of “sdev” which is removed
>> and as part of that all the memory allocation of request associated with
>> that sdev might be reused or inaccessible to the driver.
>> Kernel panic snippet -
>>
>> BUG: unable to handle kernel paging request at ffffff8000000010
>> IP: [<ffffffffc048306c>] mpt3sas_scsih_scsi_lookup_get+0x6c/0xc0 [mpt3sas]
>> PGD aa4414067 PUD 0
>> Oops: 0000 [#1] SMP
>> Call Trace:
>>  [<ffffffffc046f72f>] mpt3sas_get_st_from_smid+0x1f/0x60 [mpt3sas]
>>  [<ffffffffc047e125>] scsih_shutdown+0x55/0x100 [mpt3sas]
> 
> Other block drivers (e.g. ib_srp, skd) do not need this to work reliably.
> It has been explained to you that the bug that you reported can be fixed
> by modifying the mpt3sas driver. So why to fix this by modifying the block
> layer? Additionally, what prevents that a race condition occurs between
> the block layer clearing hctx->tags->rqs[rq->tag] and scsi_host_find_tag()
> reading that same array element? I'm afraid that this is an attempt to
> paper over a real problem instead of fixing the root cause.

I have to agree with Bart here, I just don't see how the mpt3sas use case
is special. The change will paper over the issue in any case.
Kashyap Desai Dec. 18, 2018, 5:08 p.m. UTC | #4
> >
> > Other block drivers (e.g. ib_srp, skd) do not need this to work
> > reliably.
> > It has been explained to you that the bug that you reported can be
> > fixed by modifying the mpt3sas driver. So why to fix this by modifying
> > the block layer? Additionally, what prevents that a race condition
> > occurs between the block layer clearing hctx->tags->rqs[rq->tag] and
> > scsi_host_find_tag() reading that same array element? I'm afraid that
> > this is an attempt to paper over a real problem instead of fixing the
> > root
> cause.
>
> I have to agree with Bart here, I just don't see how the mpt3sas use case
> is
> special. The change will paper over the issue in any case.

Hi Jens, Bart

One of the key requirement for iterating whole tagset  using
scsi_host_find_tag is to block scsi host. Once we are done that, we should
be good. No race condition is possible if that part is taken care.
Without this patch, if driver still receive scsi command from the
hctx->tags->rqs which is really not outstanding.  I am finding this is
common issue for many scsi low level drivers.

Just for example <fnic> - fnic_is_abts_pending() function has below code -

        for (tag = 0; tag < fnic->fnic_max_tag_id; tag++) {
                sc = scsi_host_find_tag(fnic->lport->host, tag);
                /*
                 * ignore this lun reset cmd or cmds that do not belong to
                 * this lun
                 */
                if (!sc || (lr_sc && (sc->device != lun_dev || sc ==
lr_sc)))
                        continue;

Above code also has similar exposure of kernel panic like <mpt3sas> driver
while accessing sc->device.

Panic is more obvious if we have add/removal of scsi device before looping
through scsi_host_find_tag().

Avoiding block layer changes is also attempted in <mpt3sas> but our problem
is to convert that code common for non-mq and mq.
Temporary to unblock this issue, We have fixed <mpt3sas> using driver
internals scsiio_tracker() instead of piggy back in scsi_command.

Kashyap

>
> --
> Jens Axboe
Jens Axboe Dec. 18, 2018, 5:13 p.m. UTC | #5
On 12/18/18 10:08 AM, Kashyap Desai wrote:
>>>
>>> Other block drivers (e.g. ib_srp, skd) do not need this to work
>>> reliably.
>>> It has been explained to you that the bug that you reported can be
>>> fixed by modifying the mpt3sas driver. So why to fix this by modifying
>>> the block layer? Additionally, what prevents that a race condition
>>> occurs between the block layer clearing hctx->tags->rqs[rq->tag] and
>>> scsi_host_find_tag() reading that same array element? I'm afraid that
>>> this is an attempt to paper over a real problem instead of fixing the
>>> root
>> cause.
>>
>> I have to agree with Bart here, I just don't see how the mpt3sas use case
>> is
>> special. The change will paper over the issue in any case.
> 
> Hi Jens, Bart
> 
> One of the key requirement for iterating whole tagset  using
> scsi_host_find_tag is to block scsi host. Once we are done that, we should
> be good. No race condition is possible if that part is taken care.
> Without this patch, if driver still receive scsi command from the
> hctx->tags->rqs which is really not outstanding.  I am finding this is
> common issue for many scsi low level drivers.
> 
> Just for example <fnic> - fnic_is_abts_pending() function has below code -
> 
>         for (tag = 0; tag < fnic->fnic_max_tag_id; tag++) {
>                 sc = scsi_host_find_tag(fnic->lport->host, tag);
>                 /*
>                  * ignore this lun reset cmd or cmds that do not belong to
>                  * this lun
>                  */
>                 if (!sc || (lr_sc && (sc->device != lun_dev || sc ==
> lr_sc)))
>                         continue;
> 
> Above code also has similar exposure of kernel panic like <mpt3sas> driver
> while accessing sc->device.
> 
> Panic is more obvious if we have add/removal of scsi device before looping
> through scsi_host_find_tag().
> 
> Avoiding block layer changes is also attempted in <mpt3sas> but our problem
> is to convert that code common for non-mq and mq.
> Temporary to unblock this issue, We have fixed <mpt3sas> using driver
> internals scsiio_tracker() instead of piggy back in scsi_command.

For mq, the requests never go out of scope, they are always valid. So
the key question here is WHY they have been freed. If the queue gets killed,
then one potential solution would be to clear pointers in the tag map
belonging to that queue. That also takes it out of the hot path.
Kashyap Desai Dec. 18, 2018, 5:48 p.m. UTC | #6
>
> On 12/18/18 10:08 AM, Kashyap Desai wrote:
> >>>
> >>> Other block drivers (e.g. ib_srp, skd) do not need this to work
> >>> reliably.
> >>> It has been explained to you that the bug that you reported can be
> >>> fixed by modifying the mpt3sas driver. So why to fix this by
> >>> modifying the block layer? Additionally, what prevents that a race
> >>> condition occurs between the block layer clearing
> >>> hctx->tags->rqs[rq->tag] and
> >>> scsi_host_find_tag() reading that same array element? I'm afraid
> >>> that this is an attempt to paper over a real problem instead of
> >>> fixing the root
> >> cause.
> >>
> >> I have to agree with Bart here, I just don't see how the mpt3sas use
> >> case is special. The change will paper over the issue in any case.
> >
> > Hi Jens, Bart
> >
> > One of the key requirement for iterating whole tagset  using
> > scsi_host_find_tag is to block scsi host. Once we are done that, we
> > should be good. No race condition is possible if that part is taken
> > care.
> > Without this patch, if driver still receive scsi command from the
> > hctx->tags->rqs which is really not outstanding.  I am finding this is
> > common issue for many scsi low level drivers.
> >
> > Just for example <fnic> - fnic_is_abts_pending() function has below
> > code -
> >
> >         for (tag = 0; tag < fnic->fnic_max_tag_id; tag++) {
> >                 sc = scsi_host_find_tag(fnic->lport->host, tag);
> >                 /*
> >                  * ignore this lun reset cmd or cmds that do not belong
> > to
> >                  * this lun
> >                  */
> >                 if (!sc || (lr_sc && (sc->device != lun_dev || sc ==
> > lr_sc)))
> >                         continue;
> >
> > Above code also has similar exposure of kernel panic like <mpt3sas>
> > driver while accessing sc->device.
> >
> > Panic is more obvious if we have add/removal of scsi device before
> > looping through scsi_host_find_tag().
> >
> > Avoiding block layer changes is also attempted in <mpt3sas> but our
> > problem is to convert that code common for non-mq and mq.
> > Temporary to unblock this issue, We have fixed <mpt3sas> using driver
> > internals scsiio_tracker() instead of piggy back in scsi_command.
>
> For mq, the requests never go out of scope, they are always valid. So the
> key
> question here is WHY they have been freed. If the queue gets killed, then
> one
> potential solution would be to clear pointers in the tag map belonging to
> that
> queue. That also takes it out of the hot path.

At driver load whenever driver does scsi_add_host_with_dma(), it follows
below code path in block layer.

scsi_mq_setup_tags
  ->blk_mq_alloc_tag_set
          -> blk_mq_alloc_rq_maps
                     -> __blk_mq_alloc_rq_maps

SML create two set of request pool. One is per HBA and other is per SDEV. I
was confused why SML creates request pool per HBA.

Example - IF HBA queue depth is 1K and there are 8 device behind that HBA,
total request pool is created is 1K + 8 * scsi_device queue depth. 1K will
be always static, but other request pool is managed whenever scsi device is
added/removed.

I never observe requests allocated per HBA is used in IO path. It is always
request allocated per scsi device is what active.
Also, what I observed is whenever scsi_device is deleted, associated request
is also deleted. What is missing is - "Deleted request still available in
hctx->tags->rqs[rq->tag]."

IF there is an assurance that all the request will be valid as long as hctx
is available, this patch is not correct. I posted patch based on assumption
that request per hctx can be removed whenever scsi device is removed.

Kashyap

>
> --
> Jens Axboe
Jens Axboe Dec. 18, 2018, 5:50 p.m. UTC | #7
On 12/18/18 10:48 AM, Kashyap Desai wrote:
>>
>> On 12/18/18 10:08 AM, Kashyap Desai wrote:
>>>>>
>>>>> Other block drivers (e.g. ib_srp, skd) do not need this to work
>>>>> reliably.
>>>>> It has been explained to you that the bug that you reported can be
>>>>> fixed by modifying the mpt3sas driver. So why to fix this by
>>>>> modifying the block layer? Additionally, what prevents that a race
>>>>> condition occurs between the block layer clearing
>>>>> hctx->tags->rqs[rq->tag] and
>>>>> scsi_host_find_tag() reading that same array element? I'm afraid
>>>>> that this is an attempt to paper over a real problem instead of
>>>>> fixing the root
>>>> cause.
>>>>
>>>> I have to agree with Bart here, I just don't see how the mpt3sas use
>>>> case is special. The change will paper over the issue in any case.
>>>
>>> Hi Jens, Bart
>>>
>>> One of the key requirement for iterating whole tagset  using
>>> scsi_host_find_tag is to block scsi host. Once we are done that, we
>>> should be good. No race condition is possible if that part is taken
>>> care.
>>> Without this patch, if driver still receive scsi command from the
>>> hctx->tags->rqs which is really not outstanding.  I am finding this is
>>> common issue for many scsi low level drivers.
>>>
>>> Just for example <fnic> - fnic_is_abts_pending() function has below
>>> code -
>>>
>>>         for (tag = 0; tag < fnic->fnic_max_tag_id; tag++) {
>>>                 sc = scsi_host_find_tag(fnic->lport->host, tag);
>>>                 /*
>>>                  * ignore this lun reset cmd or cmds that do not belong
>>> to
>>>                  * this lun
>>>                  */
>>>                 if (!sc || (lr_sc && (sc->device != lun_dev || sc ==
>>> lr_sc)))
>>>                         continue;
>>>
>>> Above code also has similar exposure of kernel panic like <mpt3sas>
>>> driver while accessing sc->device.
>>>
>>> Panic is more obvious if we have add/removal of scsi device before
>>> looping through scsi_host_find_tag().
>>>
>>> Avoiding block layer changes is also attempted in <mpt3sas> but our
>>> problem is to convert that code common for non-mq and mq.
>>> Temporary to unblock this issue, We have fixed <mpt3sas> using driver
>>> internals scsiio_tracker() instead of piggy back in scsi_command.
>>
>> For mq, the requests never go out of scope, they are always valid. So the
>> key
>> question here is WHY they have been freed. If the queue gets killed, then
>> one
>> potential solution would be to clear pointers in the tag map belonging to
>> that
>> queue. That also takes it out of the hot path.
> 
> At driver load whenever driver does scsi_add_host_with_dma(), it follows
> below code path in block layer.
> 
> scsi_mq_setup_tags
>   ->blk_mq_alloc_tag_set
>           -> blk_mq_alloc_rq_maps
>                      -> __blk_mq_alloc_rq_maps
> 
> SML create two set of request pool. One is per HBA and other is per SDEV. I
> was confused why SML creates request pool per HBA.
> 
> Example - IF HBA queue depth is 1K and there are 8 device behind that HBA,
> total request pool is created is 1K + 8 * scsi_device queue depth. 1K will
> be always static, but other request pool is managed whenever scsi device is
> added/removed.
> 
> I never observe requests allocated per HBA is used in IO path. It is always
> request allocated per scsi device is what active.
> Also, what I observed is whenever scsi_device is deleted, associated request
> is also deleted. What is missing is - "Deleted request still available in
> hctx->tags->rqs[rq->tag]."

So that sounds like the issue. If the device is deleted and its requests
go away, those pointers should be cleared. That's what your patch should
do, not do it for each IO.
Kashyap Desai Dec. 18, 2018, 6:08 p.m. UTC | #8
> On 12/18/18 10:48 AM, Kashyap Desai wrote:
> >>
> >> On 12/18/18 10:08 AM, Kashyap Desai wrote:
> >>>>>
> >>>>> Other block drivers (e.g. ib_srp, skd) do not need this to work
> >>>>> reliably.
> >>>>> It has been explained to you that the bug that you reported can be
> >>>>> fixed by modifying the mpt3sas driver. So why to fix this by
> >>>>> modifying the block layer? Additionally, what prevents that a race
> >>>>> condition occurs between the block layer clearing
> >>>>> hctx->tags->rqs[rq->tag] and
> >>>>> scsi_host_find_tag() reading that same array element? I'm afraid
> >>>>> that this is an attempt to paper over a real problem instead of
> >>>>> fixing the root
> >>>> cause.
> >>>>
> >>>> I have to agree with Bart here, I just don't see how the mpt3sas
> >>>> use case is special. The change will paper over the issue in any
> >>>> case.
> >>>
> >>> Hi Jens, Bart
> >>>
> >>> One of the key requirement for iterating whole tagset  using
> >>> scsi_host_find_tag is to block scsi host. Once we are done that, we
> >>> should be good. No race condition is possible if that part is taken
> >>> care.
> >>> Without this patch, if driver still receive scsi command from the
> >>> hctx->tags->rqs which is really not outstanding.  I am finding this
> >>> hctx->tags->is
> >>> common issue for many scsi low level drivers.
> >>>
> >>> Just for example <fnic> - fnic_is_abts_pending() function has below
> >>> code -
> >>>
> >>>         for (tag = 0; tag < fnic->fnic_max_tag_id; tag++) {
> >>>                 sc = scsi_host_find_tag(fnic->lport->host, tag);
> >>>                 /*
> >>>                  * ignore this lun reset cmd or cmds that do not
> >>> belong to
> >>>                  * this lun
> >>>                  */
> >>>                 if (!sc || (lr_sc && (sc->device != lun_dev || sc ==
> >>> lr_sc)))
> >>>                         continue;
> >>>
> >>> Above code also has similar exposure of kernel panic like <mpt3sas>
> >>> driver while accessing sc->device.
> >>>
> >>> Panic is more obvious if we have add/removal of scsi device before
> >>> looping through scsi_host_find_tag().
> >>>
> >>> Avoiding block layer changes is also attempted in <mpt3sas> but our
> >>> problem is to convert that code common for non-mq and mq.
> >>> Temporary to unblock this issue, We have fixed <mpt3sas> using
> >>> driver internals scsiio_tracker() instead of piggy back in
> >>> scsi_command.
> >>
> >> For mq, the requests never go out of scope, they are always valid. So
> >> the key question here is WHY they have been freed. If the queue gets
> >> killed, then one potential solution would be to clear pointers in the
> >> tag map belonging to that queue. That also takes it out of the hot
> >> path.
> >
> > At driver load whenever driver does scsi_add_host_with_dma(), it
> > follows below code path in block layer.
> >
> > scsi_mq_setup_tags
> >   ->blk_mq_alloc_tag_set
> >           -> blk_mq_alloc_rq_maps
> >                      -> __blk_mq_alloc_rq_maps
> >
> > SML create two set of request pool. One is per HBA and other is per
> > SDEV. I was confused why SML creates request pool per HBA.
> >
> > Example - IF HBA queue depth is 1K and there are 8 device behind that
> > HBA, total request pool is created is 1K + 8 * scsi_device queue
> > depth. 1K will be always static, but other request pool is managed
> > whenever scsi device is added/removed.
> >
> > I never observe requests allocated per HBA is used in IO path. It is
> > always request allocated per scsi device is what active.
> > Also, what I observed is whenever scsi_device is deleted, associated
> > request is also deleted. What is missing is - "Deleted request still
> > available in
> > hctx->tags->rqs[rq->tag]."
>
> So that sounds like the issue. If the device is deleted and its requests
> go away,
> those pointers should be cleared. That's what your patch should do, not do
> it
> for each IO.

At the time of device removal,  it requires reverse traversing.  Find out if
each requests associated with sdev is part of hctx->tags->rqs() and clear
that entry.
Not sure about atomic traverse if more than one device removal is happening
in parallel.  May be more error prone. ?

Just wondering both the way we will be removing invalid request from array.
Are you suspecting any performance issue if we do it per IO ?

Kashyap

>
>
> --
> Jens Axboe
Jens Axboe Dec. 18, 2018, 6:11 p.m. UTC | #9
On 12/18/18 11:08 AM, Kashyap Desai wrote:
>> On 12/18/18 10:48 AM, Kashyap Desai wrote:
>>>>
>>>> On 12/18/18 10:08 AM, Kashyap Desai wrote:
>>>>>>>
>>>>>>> Other block drivers (e.g. ib_srp, skd) do not need this to work
>>>>>>> reliably.
>>>>>>> It has been explained to you that the bug that you reported can be
>>>>>>> fixed by modifying the mpt3sas driver. So why to fix this by
>>>>>>> modifying the block layer? Additionally, what prevents that a race
>>>>>>> condition occurs between the block layer clearing
>>>>>>> hctx->tags->rqs[rq->tag] and
>>>>>>> scsi_host_find_tag() reading that same array element? I'm afraid
>>>>>>> that this is an attempt to paper over a real problem instead of
>>>>>>> fixing the root
>>>>>> cause.
>>>>>>
>>>>>> I have to agree with Bart here, I just don't see how the mpt3sas
>>>>>> use case is special. The change will paper over the issue in any
>>>>>> case.
>>>>>
>>>>> Hi Jens, Bart
>>>>>
>>>>> One of the key requirement for iterating whole tagset  using
>>>>> scsi_host_find_tag is to block scsi host. Once we are done that, we
>>>>> should be good. No race condition is possible if that part is taken
>>>>> care.
>>>>> Without this patch, if driver still receive scsi command from the
>>>>> hctx->tags->rqs which is really not outstanding.  I am finding this
>>>>> hctx->tags->is
>>>>> common issue for many scsi low level drivers.
>>>>>
>>>>> Just for example <fnic> - fnic_is_abts_pending() function has below
>>>>> code -
>>>>>
>>>>>         for (tag = 0; tag < fnic->fnic_max_tag_id; tag++) {
>>>>>                 sc = scsi_host_find_tag(fnic->lport->host, tag);
>>>>>                 /*
>>>>>                  * ignore this lun reset cmd or cmds that do not
>>>>> belong to
>>>>>                  * this lun
>>>>>                  */
>>>>>                 if (!sc || (lr_sc && (sc->device != lun_dev || sc ==
>>>>> lr_sc)))
>>>>>                         continue;
>>>>>
>>>>> Above code also has similar exposure of kernel panic like <mpt3sas>
>>>>> driver while accessing sc->device.
>>>>>
>>>>> Panic is more obvious if we have add/removal of scsi device before
>>>>> looping through scsi_host_find_tag().
>>>>>
>>>>> Avoiding block layer changes is also attempted in <mpt3sas> but our
>>>>> problem is to convert that code common for non-mq and mq.
>>>>> Temporary to unblock this issue, We have fixed <mpt3sas> using
>>>>> driver internals scsiio_tracker() instead of piggy back in
>>>>> scsi_command.
>>>>
>>>> For mq, the requests never go out of scope, they are always valid. So
>>>> the key question here is WHY they have been freed. If the queue gets
>>>> killed, then one potential solution would be to clear pointers in the
>>>> tag map belonging to that queue. That also takes it out of the hot
>>>> path.
>>>
>>> At driver load whenever driver does scsi_add_host_with_dma(), it
>>> follows below code path in block layer.
>>>
>>> scsi_mq_setup_tags
>>>   ->blk_mq_alloc_tag_set
>>>           -> blk_mq_alloc_rq_maps
>>>                      -> __blk_mq_alloc_rq_maps
>>>
>>> SML create two set of request pool. One is per HBA and other is per
>>> SDEV. I was confused why SML creates request pool per HBA.
>>>
>>> Example - IF HBA queue depth is 1K and there are 8 device behind that
>>> HBA, total request pool is created is 1K + 8 * scsi_device queue
>>> depth. 1K will be always static, but other request pool is managed
>>> whenever scsi device is added/removed.
>>>
>>> I never observe requests allocated per HBA is used in IO path. It is
>>> always request allocated per scsi device is what active.
>>> Also, what I observed is whenever scsi_device is deleted, associated
>>> request is also deleted. What is missing is - "Deleted request still
>>> available in
>>> hctx->tags->rqs[rq->tag]."
>>
>> So that sounds like the issue. If the device is deleted and its requests
>> go away,
>> those pointers should be cleared. That's what your patch should do, not do
>> it
>> for each IO.
> 
> At the time of device removal,  it requires reverse traversing.  Find out if
> each requests associated with sdev is part of hctx->tags->rqs() and clear
> that entry.
> Not sure about atomic traverse if more than one device removal is happening
> in parallel.  May be more error prone. ?
> 
> Just wondering both the way we will be removing invalid request from array.
> Are you suspecting any performance issue if we do it per IO ?

It's an extra store, and it's a store to an area that's then now shared
between issue and completion. Those are never a good idea. Besides, it's
the kind of issue you solve in the SLOW path, not in the fast path. Since
that's doable, it would be silly to do it for every IO.

This might not matter on mpt3sas, but on more efficient hw it definitely
will.

I'm still trying to convince myself that this issue even exists. I can see
having stale entries, but those should never be busy. Why are you finding
them with the tag iteration? It must be because the tag is reused, and
you are finding it before it's re-assigned?
Kashyap Desai Dec. 18, 2018, 6:22 p.m. UTC | #10
> >
> > At the time of device removal,  it requires reverse traversing.  Find
> > out if each requests associated with sdev is part of hctx->tags->rqs()
> > and clear that entry.
> > Not sure about atomic traverse if more than one device removal is
> > happening in parallel.  May be more error prone. ?
> >
> > Just wondering both the way we will be removing invalid request from
> > array.
> > Are you suspecting any performance issue if we do it per IO ?
>
> It's an extra store, and it's a store to an area that's then now shared
> between
> issue and completion. Those are never a good idea. Besides, it's the kind
> of
> issue you solve in the SLOW path, not in the fast path. Since that's
> doable, it
> would be silly to do it for every IO.
>
> This might not matter on mpt3sas, but on more efficient hw it definitely
> will.

Understood your primary concern is to avoid per IO and do it if no better
way.

> I'm still trying to convince myself that this issue even exists. I can see
> having
> stale entries, but those should never be busy. Why are you finding them
> with
> the tag iteration? It must be because the tag is reused, and you are
> finding it
> before it's re-assigned?


Stale entries will be forever if we remove scsi devices. It is not timing
issue. If memory associated with request (freed due to device removal)
reused, kernel panic occurs.
We have 24 Drives behind Expander and follow expander reset which will
remove all 24 drives and add it back. Add and removal of all the drives
happens quickly.
As part of Expander reset, <mpt3sas> driver process broadcast primitive
event and that requires finding all outstanding scsi command.

In some cases, we need firmware restart and that path also requires tag
iteration.


>
> --
> Jens Axboe
Jens Axboe Dec. 18, 2018, 6:28 p.m. UTC | #11
On 12/18/18 11:22 AM, Kashyap Desai wrote:
>>>
>>> At the time of device removal,  it requires reverse traversing.  Find
>>> out if each requests associated with sdev is part of hctx->tags->rqs()
>>> and clear that entry.
>>> Not sure about atomic traverse if more than one device removal is
>>> happening in parallel.  May be more error prone. ?
>>>
>>> Just wondering both the way we will be removing invalid request from
>>> array.
>>> Are you suspecting any performance issue if we do it per IO ?
>>
>> It's an extra store, and it's a store to an area that's then now shared
>> between
>> issue and completion. Those are never a good idea. Besides, it's the kind
>> of
>> issue you solve in the SLOW path, not in the fast path. Since that's
>> doable, it
>> would be silly to do it for every IO.
>>
>> This might not matter on mpt3sas, but on more efficient hw it definitely
>> will.
> 
> Understood your primary concern is to avoid per IO and do it if no better
> way.
> 
>> I'm still trying to convince myself that this issue even exists. I can see
>> having
>> stale entries, but those should never be busy. Why are you finding them
>> with
>> the tag iteration? It must be because the tag is reused, and you are
>> finding it
>> before it's re-assigned?
> 
> 
> Stale entries will be forever if we remove scsi devices. It is not timing
> issue. If memory associated with request (freed due to device removal)
> reused, kernel panic occurs.
> We have 24 Drives behind Expander and follow expander reset which will
> remove all 24 drives and add it back. Add and removal of all the drives
> happens quickly.
> As part of Expander reset, <mpt3sas> driver process broadcast primitive
> event and that requires finding all outstanding scsi command.
> 
> In some cases, we need firmware restart and that path also requires tag
> iteration.

I actually took a look at scsi_host_find_tag() - what I think needs
fixing here is that it should not return a tag that isn't allocated.
You're just looking up random stuff, that is a recipe for disaster.
But even with that, there's no guarantee that the tag isn't going away.

The mpt3sas use case is crap. It's iterating every tag, just in case it
needs to do something to it.

My suggestion would be to scrap that bad implementation and have
something available for iterating busy tags instead. That'd be more
appropriate and a lot more efficient that a random loop from 0..depth.
If you are flushing running commands, looking up tags that aren't even
active is silly and counterproductive.
Kashyap Desai Dec. 18, 2018, 7:04 p.m. UTC | #12
>
> I actually took a look at scsi_host_find_tag() - what I think needs fixing
> here is
> that it should not return a tag that isn't allocated.
> You're just looking up random stuff, that is a recipe for disaster.
> But even with that, there's no guarantee that the tag isn't going away.

Got your point. Let us fix in <mpt3sas> driver.

>
> The mpt3sas use case is crap. It's iterating every tag, just in case it
> needs to do
> something to it.

Many drivers in scsi layer is having similar trouble.  May be they are less
exposed. That was a main reason, I thought to provide common fix in block
layer.

>
> My suggestion would be to scrap that bad implementation and have
> something available for iterating busy tags instead. That'd be more
> appropriate and a lot more efficient that a random loop from 0..depth.
> If you are flushing running commands, looking up tags that aren't even
> active
> is silly and counterproductive.

We will address this issue through <mpt3sas> driver changes in two steps.
1. I can use  driver's internal memory and will not rely on request/scsi
command. Tag 0...depth loop  is not in main IO path, so what we need is
contention free access of the list. Having driver's internal memory and
array will provide that control.
2. As you suggested best way is to use busy tag iteration.  (only for blk-mq
stack)


Thanks for your feedback.

Kashyap

>
> --
> Jens Axboe
diff mbox series

Patch

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6a75662..88d1e92 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -477,8 +477,10 @@  static void __blk_mq_free_request(struct request *rq)
     const int sched_tag = rq->internal_tag;

     blk_pm_mark_last_busy(rq);
-    if (rq->tag != -1)
+    if (rq->tag != -1) {
+        hctx->tags->rqs[rq->tag] = NULL;
         blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
+    }
     if (sched_tag != -1)
         blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag);
     blk_mq_sched_restart(hctx);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 9497b47..57432be 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -175,6 +175,7 @@  static inline bool
blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
 static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
                        struct request *rq)
 {
+    hctx->tags->rqs[rq->tag] = NULL;
     blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag);
     rq->tag = -1;