diff mbox

[dm-devel] dm-mq and end_clone_request()

Message ID 1494059467.386778.1470762767417.JavaMail.zimbra@redhat.com (mailing list archive)
State Not Applicable, archived
Headers show

Commit Message

Laurence Oberman Aug. 9, 2016, 5:12 p.m. UTC
----- Original Message -----
> From: "Bart Van Assche" <bart.vanassche@sandisk.com>
> To: "Laurence Oberman" <loberman@redhat.com>
> Cc: dm-devel@redhat.com, "Mike Snitzer" <snitzer@redhat.com>, linux-scsi@vger.kernel.org, "Johannes Thumshirn"
> <jthumshirn@suse.de>
> Sent: Tuesday, August 9, 2016 11:51:00 AM
> Subject: Re: [dm-devel] dm-mq and end_clone_request()
> 
> On 08/08/2016 05:09 PM, Laurence Oberman wrote:
> > So now back to a 10 LUN dual path (ramdisk backed) two-server
> > configuration I am unable to reproduce the dm issue.
> > Recovery is very fast with the servers connected back to back.
> > This is using your kernel and this multipath.conf
> > 
> > [ ... ]
> > 
> > Mikes patches have definitely stabilized this issue for me on this
> > configuration.
> > 
> > I will see if I can move to a larger target server that has more
> > memory and allocate more mpath devices. I feel this issue in large
> > configurations is now rooted in multipath not bringing back maps
> > sometimes even when the actual paths are back via srp_daemon.
> > I am still tracking that down.
> > 
> > If you recall, last week I caused some of our own issues by
> > forgetting I had a no_path_retry 12 hiding in my multipath.conf.
> > Since removing that and spending most of the weekend testing on
> > the DDN array (had to give that back today), most of my issues
> > were either the sporadic host delete race or multipath not
> > re-instantiating paths.
> > 
> > I dont know if this helps, but since applying your latest patch I
> > have not seen the host delete race.
> 
> Hello Laurence,
> 
> My latest SCSI core patch adds additional instrumentation to the SCSI
> core but does not change the behavior of the SCSI core. So it cannot
> fix the scsi_forget_host() crash you had reported.
> 
> On my setup, with the kernel code from the srp-initiator-for-next
> branch and with CONFIG_DM_MQ_DEFAULT=n, I still see that when I run the
> srp-test software that fio reports I/O errors every now and then. What
> I see in syslog seems to indicate that these I/O errors are generated
> by dm-mpath:
> 
> Aug  9 08:45:39 ion-dev-ib-ini kernel: mpath 254:1: queue_if_no_path 1 -> 0
> Aug  9 08:45:39 ion-dev-ib-ini kernel: must_push_back: 107 callbacks
> suppressed
> Aug  9 08:45:39 ion-dev-ib-ini kernel: device-mapper: multipath:
> must_push_back: queue_if_no_path=0 suspend_active=1 suspending=0
> Aug  9 08:45:39 ion-dev-ib-ini kernel: __multipath_map(): (a) returning -5
> Aug  9 08:45:39 ion-dev-ib-ini kernel: map_request(): clone_and_map_rq()
> returned -5
> Aug  9 08:45:39 ion-dev-ib-ini kernel: dm_complete_request: error = -5
> Aug  9 08:45:39 ion-dev-ib-ini kernel: dm_softirq_done: dm-1 tio->error = -5
> 
> Bart.
> 
> 
Hello Bart

I was talking about this patch

Comments

Bart Van Assche Aug. 9, 2016, 5:16 p.m. UTC | #1
On 08/09/2016 10:12 AM, Laurence Oberman wrote:
> I was talking about this patch
>
> --- a/drivers/scsi/scsi_scan.c
> +++ b/drivers/scsi/scsi_scan.c
> @@ -1890,10 +1890,11 @@ void scsi_forget_host(struct Scsi_Host *shost)
>   restart:
>          spin_lock_irqsave(shost->host_lock, flags);
>          list_for_each_entry(sdev, &shost->__devices, siblings) {
> -                if (sdev->sdev_state == SDEV_DEL)
> +                if (sdev->sdev_state == SDEV_DEL || scsi_device_get(sdev) < 0)
>                          continue;
>                  spin_unlock_irqrestore(shost->host_lock, flags);
>                  __scsi_remove_device(sdev);
> +                scsi_device_put(sdev);
>                  goto restart;
>          }
>          spin_unlock_irqrestore(shost->host_lock, flags);

Hello Laurence,

Did you run your tests with that patch applied? If so, it would help if 
you could rerun your tests without that patch. If the above patch makes 
a difference it means that it can happen that __scsi_remove_device() 
does not change the device state into SDEV_DEL. That's a bug and we need 
to know whether or not __scsi_remove_device() behaves correctly.

Thanks,

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Laurence Oberman Aug. 9, 2016, 5:21 p.m. UTC | #2
----- Original Message -----
> From: "Bart Van Assche" <bart.vanassche@sandisk.com>
> To: "Laurence Oberman" <loberman@redhat.com>
> Cc: dm-devel@redhat.com, linux-scsi@vger.kernel.org, "Mike Snitzer" <snitzer@redhat.com>, "Johannes Thumshirn"
> <jthumshirn@suse.de>
> Sent: Tuesday, August 9, 2016 1:16:52 PM
> Subject: Re: [dm-devel] dm-mq and end_clone_request()
> 
> On 08/09/2016 10:12 AM, Laurence Oberman wrote:
> > I was talking about this patch
> >
> > --- a/drivers/scsi/scsi_scan.c
> > +++ b/drivers/scsi/scsi_scan.c
> > @@ -1890,10 +1890,11 @@ void scsi_forget_host(struct Scsi_Host *shost)
> >   restart:
> >          spin_lock_irqsave(shost->host_lock, flags);
> >          list_for_each_entry(sdev, &shost->__devices, siblings) {
> > -                if (sdev->sdev_state == SDEV_DEL)
> > +                if (sdev->sdev_state == SDEV_DEL || scsi_device_get(sdev)
> > < 0)
> >                          continue;
> >                  spin_unlock_irqrestore(shost->host_lock, flags);
> >                  __scsi_remove_device(sdev);
> > +                scsi_device_put(sdev);
> >                  goto restart;
> >          }
> >          spin_unlock_irqrestore(shost->host_lock, flags);
> 
> Hello Laurence,
> 
> Did you run your tests with that patch applied? If so, it would help if
> you could rerun your tests without that patch. If the above patch makes
> a difference it means that it can happen that __scsi_remove_device()
> does not change the device state into SDEV_DEL. That's a bug and we need
> to know whether or not __scsi_remove_device() behaves correctly.
> 
> Thanks,
> 
> Bart.
> 
Yes Sir, I ran all yesterdays tests on your kernel with that patch applied.
Of course it may well just be luck/coincidence that the host delete race is no longer happening
so I agree we need to re-run the tests so I will revert and re-run.
I will probably only get back to you tomorrow with the results.

Thanks
Laurence
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Laurence Oberman Aug. 10, 2016, 9:38 p.m. UTC | #3
----- Original Message -----
> From: "Laurence Oberman" <loberman@redhat.com>
> To: "Bart Van Assche" <bart.vanassche@sandisk.com>
> Cc: dm-devel@redhat.com, linux-scsi@vger.kernel.org, "Mike Snitzer" <snitzer@redhat.com>, "Johannes Thumshirn"
> <jthumshirn@suse.de>
> Sent: Tuesday, August 9, 2016 1:21:15 PM
> Subject: Re: [dm-devel] dm-mq and end_clone_request()
> 
> 
> 
> ----- Original Message -----
> > From: "Bart Van Assche" <bart.vanassche@sandisk.com>
> > To: "Laurence Oberman" <loberman@redhat.com>
> > Cc: dm-devel@redhat.com, linux-scsi@vger.kernel.org, "Mike Snitzer"
> > <snitzer@redhat.com>, "Johannes Thumshirn"
> > <jthumshirn@suse.de>
> > Sent: Tuesday, August 9, 2016 1:16:52 PM
> > Subject: Re: [dm-devel] dm-mq and end_clone_request()
> > 
> > On 08/09/2016 10:12 AM, Laurence Oberman wrote:
> > > I was talking about this patch
> > >
> > > --- a/drivers/scsi/scsi_scan.c
> > > +++ b/drivers/scsi/scsi_scan.c
> > > @@ -1890,10 +1890,11 @@ void scsi_forget_host(struct Scsi_Host *shost)
> > >   restart:
> > >          spin_lock_irqsave(shost->host_lock, flags);
> > >          list_for_each_entry(sdev, &shost->__devices, siblings) {
> > > -                if (sdev->sdev_state == SDEV_DEL)
> > > +                if (sdev->sdev_state == SDEV_DEL ||
> > > scsi_device_get(sdev)
> > > < 0)
> > >                          continue;
> > >                  spin_unlock_irqrestore(shost->host_lock, flags);
> > >                  __scsi_remove_device(sdev);
> > > +                scsi_device_put(sdev);
> > >                  goto restart;
> > >          }
> > >          spin_unlock_irqrestore(shost->host_lock, flags);
> > 
> > Hello Laurence,
> > 
> > Did you run your tests with that patch applied? If so, it would help if
> > you could rerun your tests without that patch. If the above patch makes
> > a difference it means that it can happen that __scsi_remove_device()
> > does not change the device state into SDEV_DEL. That's a bug and we need
> > to know whether or not __scsi_remove_device() behaves correctly.
> > 
> > Thanks,
> > 
> > Bart.
> > 
> Yes Sir, I ran all yesterdays tests on your kernel with that patch applied.
> Of course it may well just be luck/coincidence that the host delete race is
> no longer happening
> so I agree we need to re-run the tests so I will revert and re-run.
> I will probably only get back to you tomorrow with the results.
> 
> Thanks
> Laurence
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
Hello Bart

I only just got time now to revert that patch and build a kernel.
Will test this tonight and let you know if I am back to seeing panics sporadically without the patch.
As already mentioned, this is a different configuration to what I had when I was able to reproduce the panic.
This means the lack of hitting this stack trace and panic may turn out to have nothing to do with the patch I applied.

Thanks
Laurence 
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Laurence Oberman Aug. 11, 2016, 4:51 p.m. UTC | #4
----- Original Message -----
> From: "Laurence Oberman" <loberman@redhat.com>
> To: "Bart Van Assche" <bart.vanassche@sandisk.com>
> Cc: dm-devel@redhat.com, linux-scsi@vger.kernel.org, "Mike Snitzer" <snitzer@redhat.com>, "Johannes Thumshirn"
> <jthumshirn@suse.de>
> Sent: Wednesday, August 10, 2016 5:38:16 PM
> Subject: Re: [dm-devel] dm-mq and end_clone_request()
> 
> 
> 
> ----- Original Message -----
> > From: "Laurence Oberman" <loberman@redhat.com>
> > To: "Bart Van Assche" <bart.vanassche@sandisk.com>
> > Cc: dm-devel@redhat.com, linux-scsi@vger.kernel.org, "Mike Snitzer"
> > <snitzer@redhat.com>, "Johannes Thumshirn"
> > <jthumshirn@suse.de>
> > Sent: Tuesday, August 9, 2016 1:21:15 PM
> > Subject: Re: [dm-devel] dm-mq and end_clone_request()
> > 
> > 
> > 
> > ----- Original Message -----
> > > From: "Bart Van Assche" <bart.vanassche@sandisk.com>
> > > To: "Laurence Oberman" <loberman@redhat.com>
> > > Cc: dm-devel@redhat.com, linux-scsi@vger.kernel.org, "Mike Snitzer"
> > > <snitzer@redhat.com>, "Johannes Thumshirn"
> > > <jthumshirn@suse.de>
> > > Sent: Tuesday, August 9, 2016 1:16:52 PM
> > > Subject: Re: [dm-devel] dm-mq and end_clone_request()
> > > 
> > > On 08/09/2016 10:12 AM, Laurence Oberman wrote:
> > > > I was talking about this patch
> > > >
> > > > --- a/drivers/scsi/scsi_scan.c
> > > > +++ b/drivers/scsi/scsi_scan.c
> > > > @@ -1890,10 +1890,11 @@ void scsi_forget_host(struct Scsi_Host *shost)
> > > >   restart:
> > > >          spin_lock_irqsave(shost->host_lock, flags);
> > > >          list_for_each_entry(sdev, &shost->__devices, siblings) {
> > > > -                if (sdev->sdev_state == SDEV_DEL)
> > > > +                if (sdev->sdev_state == SDEV_DEL ||
> > > > scsi_device_get(sdev)
> > > > < 0)
> > > >                          continue;
> > > >                  spin_unlock_irqrestore(shost->host_lock, flags);
> > > >                  __scsi_remove_device(sdev);
> > > > +                scsi_device_put(sdev);
> > > >                  goto restart;
> > > >          }
> > > >          spin_unlock_irqrestore(shost->host_lock, flags);
> > > 
> > > Hello Laurence,
> > > 
> > > Did you run your tests with that patch applied? If so, it would help if
> > > you could rerun your tests without that patch. If the above patch makes
> > > a difference it means that it can happen that __scsi_remove_device()
> > > does not change the device state into SDEV_DEL. That's a bug and we need
> > > to know whether or not __scsi_remove_device() behaves correctly.
> > > 
> > > Thanks,
> > > 
> > > Bart.
> > > 
> > Yes Sir, I ran all yesterdays tests on your kernel with that patch applied.
> > Of course it may well just be luck/coincidence that the host delete race is
> > no longer happening
> > so I agree we need to re-run the tests so I will revert and re-run.
> > I will probably only get back to you tomorrow with the results.
> > 
> > Thanks
> > Laurence
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> Hello Bart
> 
> I only just got time now to revert that patch and build a kernel.
> Will test this tonight and let you know if I am back to seeing panics
> sporadically without the patch.
> As already mentioned, this is a different configuration to what I had when I
> was able to reproduce the panic.
> This means the lack of hitting this stack trace and panic may turn out to
> have nothing to do with the patch I applied.
> 
> Thanks
> Laurence
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
Hello Bart

I can no longer reproduce the stack even with the patch reverted so its behaving as you expected and the patch is as you already said, not valid.
I ran about 30 fio tests with your kernel and multiple host deletions and and did experience only one hard fio error.
My tests now produce the same results as you are seeing.

The single fio errors was with many more executions of the test so its not easy to get these fio errors.

Away from tomorrow on vacation for 10 days

Thanks
Laurence
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -1890,10 +1890,11 @@  void scsi_forget_host(struct Scsi_Host *shost)
  restart:
         spin_lock_irqsave(shost->host_lock, flags);
         list_for_each_entry(sdev, &shost->__devices, siblings) {
-                if (sdev->sdev_state == SDEV_DEL)
+                if (sdev->sdev_state == SDEV_DEL || scsi_device_get(sdev) < 0)
                         continue;
                 spin_unlock_irqrestore(shost->host_lock, flags);
                 __scsi_remove_device(sdev);
+                scsi_device_put(sdev);
                 goto restart;
         }
         spin_unlock_irqrestore(shost->host_lock, flags);