mbox series

[RFC,0/2] Serialize timeout handling and done callback.

Message ID 20210507123103.10265-1-dwagner@suse.de (mailing list archive)
Headers show
Series Serialize timeout handling and done callback. | expand

Message

Daniel Wagner May 7, 2021, 12:31 p.m. UTC
Hi,

We got a customer report where qla2xxx was crashing only if the kernel
was booting and ql2xextended_error_logging was set. Loading the module
with the log option didn't trigger the crash.

After starring for a long time at the crash report I figured the
problem might be a race between the timeout handler and done callback.
I've come up with these patches here but unfortunatly, our customer is
not able to reproduce the problem in the lab anymore (it was caused by
a hardware issue which got fixed). So for these patches I don't have
any feedback.

Maybe they make sense to add the driver even if I don't have prove it
really address the mentioned bug hence this is marked as RFC.

Thanks,
Daniel

Daniel Wagner (2):
  qla2xxx: Refactor asynchronous command initialization
  qla2xxx: Do not free resource to early in qla24xx_async_gpsc_sp_done()

 drivers/scsi/qla2xxx/qla_def.h    |  5 ++
 drivers/scsi/qla2xxx/qla_gbl.h    |  4 +-
 drivers/scsi/qla2xxx/qla_gs.c     | 86 ++++++++++-------------------
 drivers/scsi/qla2xxx/qla_init.c   | 91 +++++++++++++------------------
 drivers/scsi/qla2xxx/qla_iocb.c   | 54 +++++++++++++-----
 drivers/scsi/qla2xxx/qla_mbx.c    | 11 ++--
 drivers/scsi/qla2xxx/qla_mid.c    |  5 +-
 drivers/scsi/qla2xxx/qla_mr.c     |  7 +--
 drivers/scsi/qla2xxx/qla_target.c |  6 +-
 9 files changed, 127 insertions(+), 142 deletions(-)

Comments

Daniel Wagner May 7, 2021, 12:31 p.m. UTC | #1
Hi,

We got a customer report where qla2xxx was crashing only if the kernel
was booting and ql2xextended_error_logging was set. Loading the module
with the log option didn't trigger the crash.

After starring for a long time at the crash report I figured the
problem might be a race between the timeout handler and done callback.
I've come up with these patches here but unfortunatly, our customer is
not able to reproduce the problem in the lab anymore (it was caused by
a hardware issue which got fixed). So for these patches I don't have
any feedback.

Maybe they make sense to add the driver even if I don't have prove it
really address the mentioned bug hence this is marked as RFC.

Thanks,
Daniel

Daniel Wagner (2):
  qla2xxx: Refactor asynchronous command initialization
  qla2xxx: Do not free resource to early in qla24xx_async_gpsc_sp_done()

 drivers/scsi/qla2xxx/qla_def.h    |  5 ++
 drivers/scsi/qla2xxx/qla_gbl.h    |  4 +-
 drivers/scsi/qla2xxx/qla_gs.c     | 86 ++++++++++-------------------
 drivers/scsi/qla2xxx/qla_init.c   | 91 +++++++++++++------------------
 drivers/scsi/qla2xxx/qla_iocb.c   | 54 +++++++++++++-----
 drivers/scsi/qla2xxx/qla_mbx.c    | 11 ++--
 drivers/scsi/qla2xxx/qla_mid.c    |  5 +-
 drivers/scsi/qla2xxx/qla_mr.c     |  7 +--
 drivers/scsi/qla2xxx/qla_target.c |  6 +-
 9 files changed, 127 insertions(+), 142 deletions(-)
Daniel Wagner May 25, 2021, 1:26 p.m. UTC | #2
Hi,

On Fri, May 07, 2021 at 02:31:00PM +0200, Daniel Wagner wrote:
> Maybe they make sense to add the driver even if I don't have prove it
> really address the mentioned bug hence this is marked as RFC.

Any feedback?

Thanks,
Daniel
Arun Easi May 26, 2021, 12:56 a.m. UTC | #3
On Tue, 25 May 2021, 6:26am, Daniel Wagner wrote:

> External Email
> 
> ----------------------------------------------------------------------
> Hi,
> 
> On Fri, May 07, 2021 at 02:31:00PM +0200, Daniel Wagner wrote:
> > Maybe they make sense to add the driver even if I don't have prove it
> > really address the mentioned bug hence this is marked as RFC.
> 
> Any feedback?
> 

Apologies for the delay, Daniel. I will review this by end of this week.

Regards,
-Arun
Arun Easi May 31, 2021, 9:06 a.m. UTC | #4
On Fri, 7 May 2021, 5:31am, Daniel Wagner wrote:

> External Email
> 
> ----------------------------------------------------------------------
> Hi,
> 
> We got a customer report where qla2xxx was crashing only if the kernel
> was booting and ql2xextended_error_logging was set. Loading the module
> with the log option didn't trigger the crash.
> 
> After starring for a long time at the crash report I figured the
> problem might be a race between the timeout handler and done callback.
> I've come up with these patches here but unfortunatly, our customer is
> not able to reproduce the problem in the lab anymore (it was caused by
> a hardware issue which got fixed). So for these patches I don't have
> any feedback.

Thanks Daniel for the report and your effort here. Agree, this needs to be 
fixed.

> 
> Maybe they make sense to add the driver even if I don't have prove it
> really address the mentioned bug hence this is marked as RFC.

If you do not mind, can I take this from here? This touches quite a 
number of paths, and I would like to have this go through a full 
regression cycle before this is merged.

That said, I had some general comments:

1. I see sp->lock was introduced, but could not locate where it was used.
2. I did not see a release of lock after a successful kref_put_lock, maybe 
   that piece was missed out.
3. The sp->done should be invoked only once, and I do not see if this is
   taken care of.
4. sp->cmd_sp could be a SCSI IO too, where sp is not allocated 
   separately, so qla2x00_sp_release shall not be called for it.

Regards,
-Arun

> 
> Thanks,
> Daniel
> 
> Daniel Wagner (2):
>   qla2xxx: Refactor asynchronous command initialization
>   qla2xxx: Do not free resource to early in qla24xx_async_gpsc_sp_done()
> 
>  drivers/scsi/qla2xxx/qla_def.h    |  5 ++
>  drivers/scsi/qla2xxx/qla_gbl.h    |  4 +-
>  drivers/scsi/qla2xxx/qla_gs.c     | 86 ++++++++++-------------------
>  drivers/scsi/qla2xxx/qla_init.c   | 91 +++++++++++++------------------
>  drivers/scsi/qla2xxx/qla_iocb.c   | 54 +++++++++++++-----
>  drivers/scsi/qla2xxx/qla_mbx.c    | 11 ++--
>  drivers/scsi/qla2xxx/qla_mid.c    |  5 +-
>  drivers/scsi/qla2xxx/qla_mr.c     |  7 +--
>  drivers/scsi/qla2xxx/qla_target.c |  6 +-
>  9 files changed, 127 insertions(+), 142 deletions(-)
> 
>
Daniel Wagner May 31, 2021, 9:55 a.m. UTC | #5
Hi Arun,

On Mon, May 31, 2021 at 02:06:24AM -0700, Arun Easi wrote:
> Thanks Daniel for the report and your effort here. Agree, this needs to be 
> fixed.

Good to hear!

> If you do not mind, can I take this from here? This touches quite a 
> number of paths, and I would like to have this go through a full 
> regression cycle before this is merged.

Sure, that is what I hoped for. It is an invasive change and this needs
to be properly tested with a few different setups. Something I can't really
do. So I would be glad if you could pick up the patches and fix them up.

> That said, I had some general comments:
> 
> 1. I see sp->lock was introduced, but could not locate where it was
> used.

I thought I needed it for serializing the kref operations. The lock
itself is not used in the driver. After re-reading the documentation,
the lock is not necessary as kref_put() is able to serialize the ref
counter inc/dec operation correctly. The lock would only be useful to
serialize the kref_put() with something which runs in the driver
concurrently.

> 2. I did not see a release of lock after a successful kref_put_lock, maybe 
>    that piece was missed out.

I think you got it right. The lock is not necessary.

> 3. The sp->done should be invoked only once, and I do not see if this is
>    taken care of.

qla2x00_sp_release() will only be called when the ref counter gets 0.
This makes sure we only call sp->done() once.

> 4. sp->cmd_sp could be a SCSI IO too, where sp is not allocated 
>    separately, so qla2x00_sp_release shall not be called for it.

Okay, didn't realize this.

Thanks,
Daniel