Message ID | 20160727200939.GA82654@redhat.com (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
On 07/27/2016 01:09 PM, Mike Snitzer wrote: > In addition to the above patch, please apply this patch and retest your > 4.7 kernel: > > diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c > index 287caa7..16583c1 100644 > --- a/drivers/md/dm-mpath.c > +++ b/drivers/md/dm-mpath.c > @@ -1416,12 +1416,14 @@ static void multipath_postsuspend(struct dm_target *ti) > static void multipath_resume(struct dm_target *ti) > { > struct multipath *m = ti->private; > + unsigned long flags; > > + spin_lock_irqsave(&m->lock, flags); > if (test_bit(MPATHF_SAVED_QUEUE_IF_NO_PATH, &m->flags)) > set_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags); > else > clear_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags); > - smp_mb__after_atomic(); > + spin_unlock_irqrestore(&m->lock, flags); > } > > /* Hello Mike, Thanks again for having made this patch available. I will test it as soon as I have the time. BTW, in the meantime I ran a few tests with DM_MQ_DEFAULT=n since until now I ran all tests with DM_MQ_DEFAULT=y. The result of these tests is as follows: * v4.6.0, v4.6.5 and v4.7.0 with DM_MQ_DEFAULT=y: first simulated path removal triggers I/O errors. * v4.6.4, v4.6.5 and v4.7.0 with DM_MQ_DEFAULT=n: test passes more than 100 iterations. I have not yet run any tests with kernel v4.5.x because in the test I ran the ib_srp and ib_srpt drivers are loaded on the same system and because I need five v4.7 LIO patches to run this test pass but unfortunately these patches do not apply cleanly on the v4.5.x code base. Please let me know if you need more information. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jul 27 2016 at 7:05pm -0400, Bart Van Assche <bart.vanassche@sandisk.com> wrote: > On 07/27/2016 01:09 PM, Mike Snitzer wrote: > > In addition to the above patch, please apply this patch and retest your > >4.7 kernel: > > > >diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c > >index 287caa7..16583c1 100644 > >--- a/drivers/md/dm-mpath.c > >+++ b/drivers/md/dm-mpath.c > >@@ -1416,12 +1416,14 @@ static void multipath_postsuspend(struct dm_target *ti) > > static void multipath_resume(struct dm_target *ti) > > { > > struct multipath *m = ti->private; > >+ unsigned long flags; > > > >+ spin_lock_irqsave(&m->lock, flags); > > if (test_bit(MPATHF_SAVED_QUEUE_IF_NO_PATH, &m->flags)) > > set_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags); > > else > > clear_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags); > >- smp_mb__after_atomic(); > >+ spin_unlock_irqrestore(&m->lock, flags); > > } > > > > /* > > Hello Mike, > > Thanks again for having made this patch available. I will test it as > soon as I have the time. BTW, in the meantime I ran a few tests with > DM_MQ_DEFAULT=n since until now I ran all tests with > DM_MQ_DEFAULT=y. The result of these tests is as follows: > * v4.6.0, v4.6.5 and v4.7.0 with DM_MQ_DEFAULT=y: first simulated > path removal triggers I/O errors. > * v4.6.4, v4.6.5 and v4.7.0 with DM_MQ_DEFAULT=n: test passes more > than 100 iterations. I think this may point to an SRP issue then. Is the synthetic "cable pull" (by writing to /sys/class/srp_remote_ports/port-*/delete) representitive of what actually happens if a cable is physically pulled? Or is yur synthetic method hitting the device way harder than would happen with an actual production fault? Again, there hasn't been any report of failures (EIO or otherwise) with extensive scsi-mq and dm-mq testing on a larger FC testbed. > I have not yet run any tests with kernel v4.5.x because in the test > I ran the ib_srp and ib_srpt drivers are loaded on the same system > and because I need five v4.7 LIO patches to run this test pass but > unfortunately these patches do not apply cleanly on the v4.5.x code > base. > > Please let me know if you need more information. Can the target core be made to use SRP in loopback (local test machine) mode? The mptest harness currently defaults to using tcmloop. Would be great if I could somehow exercise the SRP code without needing a fullblown IB setup. But if there isn't a way to achieve that test coverage I can probably/hopefully get access to a subset of a larger IB/SRP testbed. Please advise, thanks. Mike -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 07/28/2016 06:33 AM, Mike Snitzer wrote: > On Wed, Jul 27 2016 at 7:05pm -0400, > Bart Van Assche <bart.vanassche@sandisk.com> wrote: >> Thanks again for having made this patch available. I will test it as >> soon as I have the time. BTW, in the meantime I ran a few tests with >> DM_MQ_DEFAULT=n since until now I ran all tests with >> DM_MQ_DEFAULT=y. The result of these tests is as follows: >> * v4.6.0, v4.6.5 and v4.7.0 with DM_MQ_DEFAULT=y: first simulated >> path removal triggers I/O errors. >> * v4.6.4, v4.6.5 and v4.7.0 with DM_MQ_DEFAULT=n: test passes more >> than 100 iterations. > > I think this may point to an SRP issue then. Is the synthetic "cable > pull" (by writing to /sys/class/srp_remote_ports/port-*/delete) > representitive of what actually happens if a cable is physically pulled? > > Or is your synthetic method hitting the device way harder than would > happen with an actual production fault? > > Again, there hasn't been any report of failures (EIO or otherwise) with > extensive scsi-mq and dm-mq testing on a larger FC testbed. Hello Mike, Sorry but I disagree that the ib_srp driver would be causing the EIO errors because: * All tests, including the tests that pass, were run with CONFIG_SCSI_MQ_DEFAULT=y in the kernel config. The same code paths were triggered in the ib_srp driver by all the tests (CONFIG_DM_MQ_DEFAULT=y and CONFIG_DM_MQ_DEFAULT=n). * In my previous e-mails I have shown that the EIO error code is generated by the dm-mpath driver after all (SRP) paths have gone. So how could the ib_srp driver be involved? There is an important difference between the SCSI FC drivers and ib_srp: after dev_loss_tmo expires FC drivers call scsi_remove_target() while the SRP transport layer triggers a call of scsi_remove_host(). Both writing into /sys/class/srp_remote_ports/*/delete and pulling a cable make the ib_srp driver call scsi_remove_host(). The only difference is the timing. With the former method it is more likely that the time between submitting I/O and calling scsi_remove_host() is small. >> I have not yet run any tests with kernel v4.5.x because in the test >> I ran the ib_srp and ib_srpt drivers are loaded on the same system >> and because I need five v4.7 LIO patches to run this test pass but >> unfortunately these patches do not apply cleanly on the v4.5.x code >> base. >> >> Please let me know if you need more information. > > Can the target core be made to use SRP in loopback (local test machine) > mode? The mptest harness currently defaults to using tcmloop. Would be > great if I could somehow exercise the SRP code without needing a > fullblown IB setup. > > But if there isn't a way to achieve that test coverage I can > probably/hopefully get access to a subset of a larger IB/SRP testbed. All InfiniBand HCAs that I have encountered so far support loopback as long as at least one HCA port is up (either connected to a switch or connected to another HCA port and opensm is running against one of these two ports). The scripts I used to test the ib_srp driver are available at https://github.com/bvanassche/srp-test. Thanks, Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 28 2016 at 11:23am -0400, Bart Van Assche <bart.vanassche@sandisk.com> wrote: > On 07/28/2016 06:33 AM, Mike Snitzer wrote: > >On Wed, Jul 27 2016 at 7:05pm -0400, > >Bart Van Assche <bart.vanassche@sandisk.com> wrote: > >>Thanks again for having made this patch available. I will test it as > >>soon as I have the time. BTW, in the meantime I ran a few tests with > >>DM_MQ_DEFAULT=n since until now I ran all tests with > >>DM_MQ_DEFAULT=y. The result of these tests is as follows: > >>* v4.6.0, v4.6.5 and v4.7.0 with DM_MQ_DEFAULT=y: first simulated > >>path removal triggers I/O errors. > >>* v4.6.4, v4.6.5 and v4.7.0 with DM_MQ_DEFAULT=n: test passes more > >>than 100 iterations. > > > >I think this may point to an SRP issue then. Is the synthetic "cable > >pull" (by writing to /sys/class/srp_remote_ports/port-*/delete) > >representitive of what actually happens if a cable is physically pulled? > > > >Or is your synthetic method hitting the device way harder than would > >happen with an actual production fault? > > > >Again, there hasn't been any report of failures (EIO or otherwise) with > >extensive scsi-mq and dm-mq testing on a larger FC testbed. > > Hello Mike, > > Sorry but I disagree that the ib_srp driver would be causing the EIO > errors because: > * All tests, including the tests that pass, were run with > CONFIG_SCSI_MQ_DEFAULT=y in the kernel config. The same code paths > were triggered in the ib_srp driver by all the tests > (CONFIG_DM_MQ_DEFAULT=y and CONFIG_DM_MQ_DEFAULT=n). > * In my previous e-mails I have shown that the EIO error code is > generated by the dm-mpath driver after all (SRP) paths have gone. So > how could the ib_srp driver be involved? > > There is an important difference between the SCSI FC drivers and > ib_srp: after dev_loss_tmo expires FC drivers call > scsi_remove_target() while the SRP transport layer triggers a call > of scsi_remove_host(). > > Both writing into /sys/class/srp_remote_ports/*/delete and pulling a > cable make the ib_srp driver call scsi_remove_host(). The only > difference is the timing. With the former method it is more likely > that the time between submitting I/O and calling scsi_remove_host() > is small. Reality is I just need a testbed to reproduce. This back and forth isn't really helping us converge on _why_ must_push_back() is returning false for your case. I need to know what exactly is causing that method to return false in your case. As is, hard to see why blk-mq vs .request_fn interface for DM mpath device would cause must_push_back() to return false vs true. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 07/28/2016 05:40 PM, Mike Snitzer wrote: > On Thu, Jul 28 2016 at 11:23am -0400, > Bart Van Assche <bart.vanassche@sandisk.com> wrote: > >> On 07/28/2016 06:33 AM, Mike Snitzer wrote: [ .. ] > > Reality is I just need a testbed to reproduce. This back and forth > isn't really helping us converge on _why_ must_push_back() is returning > false for your case. I need to know what exactly is causing that method > to return false in your case. > > As is, hard to see why blk-mq vs .request_fn interface for DM mpath > device would cause must_push_back() to return false vs true. > I wonder if that isn't the same issue I've seen (and tried to discuss at LSF), hitting the printk in blk_cloned_rq_check_limits(). If I would hazard a guess I'd say that the queue limits become temporarily invalidated during failover, and we're managing to submit an I/O at just that time. I am currently working on getting FCoE to run over virtio; if that works we should be a good synthetic testbed for reproducing. Cheers, Hannes
diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c index 287caa7..16583c1 100644 --- a/drivers/md/dm-mpath.c +++ b/drivers/md/dm-mpath.c @@ -1416,12 +1416,14 @@ static void multipath_postsuspend(struct dm_target *ti) static void multipath_resume(struct dm_target *ti) { struct multipath *m = ti->private; + unsigned long flags; + spin_lock_irqsave(&m->lock, flags); if (test_bit(MPATHF_SAVED_QUEUE_IF_NO_PATH, &m->flags)) set_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags); else clear_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags); - smp_mb__after_atomic(); + spin_unlock_irqrestore(&m->lock, flags); } /*