Message ID | 20160801204628.GA94704@redhat.com (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
On 08/01/2016 01:46 PM, Mike Snitzer wrote: > Please retry both variant (CONFIG_DM_MQ_DEFAULT=y first) with this patch > applied. Interested to see if things look better for you (WARN_ON_ONCEs > added just to see if we hit the corresponding suspend/stopped state > while mapping requests -- if so this speaks to an inherently racy > problem that will need further investigation for a proper fix but > results from this should let us know if we're closer). > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c > index 1b2f962..0e0f6e0 100644 > --- a/drivers/md/dm.c > +++ b/drivers/md/dm.c > @@ -2007,6 +2007,9 @@ static int map_request(struct dm_rq_target_io *tio, struct request *rq, > struct dm_target *ti = tio->ti; > struct request *clone = NULL; > > + if (WARN_ON_ONCE(unlikely(dm_suspended_md(md)))) > + return DM_MAPIO_REQUEUE; > + > if (tio->clone) { > clone = tio->clone; > r = ti->type->map_rq(ti, clone, &tio->info); > @@ -2722,6 +2725,9 @@ static int dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx, > dm_put_live_table(md, srcu_idx); > } > > + if (WARN_ON_ONCE(unlikely(test_bit(BLK_MQ_S_STOPPED, &hctx->state)))) > + return BLK_MQ_RQ_QUEUE_BUSY; > + > if (ti->type->busy && ti->type->busy(ti)) > return BLK_MQ_RQ_QUEUE_BUSY; Hello Mike, The test results with this patch and also the three other patches that have been posted in the context of this e-mail thread applied on top of kernel v4.7 are as follows: (1) CONFIG_DM_MQ_DEFAULT=y and fio running on top of XFS: From the system log: [ ... ] mpath 254:0: queue_if_no_path 0 -> 1 executing DM ioctl DEV_SUSPEND on mpathbe mpath 254:0: queue_if_no_path 1 -> 0 __multipath_map(): (a) returning -5 map_request(): clone_and_map_rq() returned -5 dm_complete_request: error = -5 dm_softirq_done: dm-0 tio->error = -5 blk_update_request: I/O error (-5), dev dm-0, sector 311960 [ ... ] After this test finished, "dmsetup remove_all" failed and the following message appeared in the system log: "device-mapper: ioctl: remove_all left 1 open device(s)". Note: when I reran this test after a reboot "dmsetup remove_all" succeeded. (2) CONFIG_DM_MQ_DEFAULT=y and fio running on top of ext4: From the system log: [ ... ] [ 146.023067] WARNING: CPU: 2 PID: 482 at drivers/md/dm.c:2748 dm_mq_queue_rq+0xc1/0x150 [dm_mod] [ 146.026073] Workqueue: kblockd blk_mq_run_work_fn [ 146.026083] Call Trace: [ 146.026087] [<ffffffff81320047>] dump_stack+0x68/0xa1 [ 146.026090] [<ffffffff81061c46>] __warn+0xc6/0xe0 [ 146.026092] [<ffffffff81061d18>] warn_slowpath_null+0x18/0x20 [ 146.026098] [<ffffffffa0286791>] dm_mq_queue_rq+0xc1/0x150 [dm_mod] [ 146.026100] [<ffffffff81306f7a>] __blk_mq_run_hw_queue+0x1da/0x350 [ 146.026102] [<ffffffff813076c0>] blk_mq_run_work_fn+0x10/0x20 [ 146.026105] [<ffffffff8107efe9>] process_one_work+0x1f9/0x6a0 [ 146.026109] [<ffffffff8107f4d9>] worker_thread+0x49/0x490 [ 146.026116] [<ffffffff81085cda>] kthread+0xea/0x100 [ 146.026119] [<ffffffff81624fbf>] ret_from_fork+0x1f/0x40 [ ... ] [ 146.269194] mpath 254:1: queue_if_no_path 0 -> 1 [ 146.276502] executing DM ioctl DEV_SUSPEND on mpathbf [ 146.276556] mpath 254:1: queue_if_no_path 1 -> 0 [ 146.276560] __multipath_map(): (a) returning -5 [ 146.276561] map_request(): clone_and_map_rq() returned -5 [ 146.276562] dm_complete_request: error = -5 [ 146.276563] dm_softirq_done: dm-1 tio->error = -5 [ 146.276566] blk_update_request: I/O error (-5), dev dm-1, sector 2097144 [ ... ] After this test finished running "dmsetup remove_all" and unloading ib_srp succeeded. (3) CONFIG_DM_MQ_DEFAULT=n and fio running on top of XFS: The first run of this test passed. During the second run fio reported an I/O error. From the system log: [ ... ] [ 1290.010886] mpath 254:0: queue_if_no_path 0 -> 1 [ 1290.026905] executing DM ioctl DEV_SUSPEND on mpathbe [ 1290.026960] mpath 254:0: queue_if_no_path 1 -> 0 [ 1290.027001] __multipath_map(): (a) returning -5 [ 1290.027002] map_request(): clone_and_map_rq() returned -5 [ 1290.027003] dm_complete_request: error = -5 [ ... ] (4) CONFIG_DM_MQ_DEFAULT=n and fio running on top of ext4: The first two runs of this test passed. After the second run "dmsetup remove_all" failed and the following error message appeared in the system log: "device-mapper: ioctl: remove_all left 1 open device(s)". The following kernel thread might be the one that was holding open /dev/dm-0: # ps aux | grep dio/ root 5306 0.0 0.0 0 0 ? S< 15:24 0:00 [dio/dm-0] Please let me know if you need more information. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Aug 01 2016 at 6:41pm -0400, Bart Van Assche <bart.vanassche@sandisk.com> wrote: > On 08/01/2016 01:46 PM, Mike Snitzer wrote: > > Please retry both variant (CONFIG_DM_MQ_DEFAULT=y first) with this patch > > applied. Interested to see if things look better for you (WARN_ON_ONCEs > > added just to see if we hit the corresponding suspend/stopped state > > while mapping requests -- if so this speaks to an inherently racy > > problem that will need further investigation for a proper fix but > > results from this should let us know if we're closer). > > > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c > > index 1b2f962..0e0f6e0 100644 > > --- a/drivers/md/dm.c > > +++ b/drivers/md/dm.c > > @@ -2007,6 +2007,9 @@ static int map_request(struct dm_rq_target_io *tio, struct request *rq, > > struct dm_target *ti = tio->ti; > > struct request *clone = NULL; > > > > + if (WARN_ON_ONCE(unlikely(dm_suspended_md(md)))) > > + return DM_MAPIO_REQUEUE; > > + > > if (tio->clone) { > > clone = tio->clone; > > r = ti->type->map_rq(ti, clone, &tio->info); > > @@ -2722,6 +2725,9 @@ static int dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx, > > dm_put_live_table(md, srcu_idx); > > } > > > > + if (WARN_ON_ONCE(unlikely(test_bit(BLK_MQ_S_STOPPED, &hctx->state)))) > > + return BLK_MQ_RQ_QUEUE_BUSY; > > + > > if (ti->type->busy && ti->type->busy(ti)) > > return BLK_MQ_RQ_QUEUE_BUSY; > > Hello Mike, > > The test results with this patch and also the three other patches that > have been posted in the context of this e-mail thread applied on top of > kernel v4.7 are as follows: Hi Bart, Please do these same tests against a v4.7 kernel with the 4 patches from this branch applied (no need for your other debug patches): https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.7-mpath-fixes I've had good results with my blk-mq SRP based testing. Thanks, Mike -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/02/2016 10:45 AM, Mike Snitzer wrote: > Please do these same tests against a v4.7 kernel with the 4 patches from > this branch applied (no need for your other debug patches): > https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.7-mpath-fixes > > I've had good results with my blk-mq SRP based testing. Hello Mike, Thanks again for having made these patches available. The results of my tests are as follows: (1) CONFIG_DM_MQ_DEFAULT=y, fio running on top of xfs The first simulated cable pull caused the following messages to appear: [ 428.716566] mpath 254:1: queue_if_no_path 1 -> 0 [ 428.729671] __multipath_map(): (a) returning -5 [ 428.729730] map_request(): clone_and_map_rq() returned -5 [ 428.729788] dm_complete_request: error = -5 [ 428.729846] dm_softirq_done: dm-1 tio->error = -5 [ 428.729904] blk_update_request: 880 callbacks suppressed [ 428.729970] blk_update_request: I/O error (-5), dev dm-1, sector 2097024 (2) CONFIG_DM_MQ_DEFAULT=y, fio running on top of ext4 The first simulated cable pull caused the following messages to appear: [ 162.894737] mpath 254:0: queue_if_no_path 0 -> 1 [ 162.903155] executing DM ioctl DEV_SUSPEND on mpathbe [ 162.903207] mpath 254:0: queue_if_no_path 1 -> 0 [ 162.903255] device-mapper: multipath: must_push_back: queue_if_no_path=0 suspend_active=1 suspending=0 [ 162.903256] __multipath_map(): (a) returning -5 [ 162.903257] map_request(): clone_and_map_rq() returned -5 [ 162.903258] dm_complete_request: error = -5 [ 162.903259] dm_softirq_done: dm-0 tio->error = -5 [ 162.903261] blk_update_request: I/O error (-5), dev dm-0, sector 263424 [ 162.903284] Buffer I/O error on dev dm-0, logical block 32928, lost sync page write (3) CONFIG_DM_MQ_DEFAULT=n, fio running on top of xfs This test passed once but on the second run fio reported "bad magic header" after a large number of iterations. I'm still analyzing the logs. (4) CONFIG_DM_MQ_DEFAULT=n, fio running on top of ext4 Ran this test three times. The first two runs passed but during the third run fio reported again I/O errors and I found the following in the kernel log: [ 954.048860] __multipath_map(): (a) returning -5 [ 954.048861] map_request(): clone_and_map_rq() returned -5 [ 954.048862] dm_complete_request: error = -5 [ 954.048870] dm_softirq_done: dm-0 tio->error = -5 [ 954.048873] blk_update_request: 15 callbacks suppressed [ 954.048874] blk_update_request: I/O error (-5), dev dm-0, sector 159976 Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Aug 02 2016 at 8:19pm -0400, Bart Van Assche <bart.vanassche@sandisk.com> wrote: > On 08/02/2016 10:45 AM, Mike Snitzer wrote: > > Please do these same tests against a v4.7 kernel with the 4 patches from > > this branch applied (no need for your other debug patches): > > https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.7-mpath-fixes > > > > I've had good results with my blk-mq SRP based testing. > > Hello Mike, > > Thanks again for having made these patches available. The results of my > tests are as follows: Disappointing. But I asked you to run the v4.7 kernel patches I pointed to _without_ any of your debug patches. I cannot reproduce on our SRP testbed with the fixes I provided. We're now in a place where there would appear to be something very unique to your environment causing these failures. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Bart I simplified the test to 2 simple scripts and only running against one XFS file system. Can you validate these and tell me if its enough to emulate what you are doing. Perhaps our test-suite is too simple. Start the test # cat run_test.sh #!/bin/bash logger "Starting Bart's test" #for i in `seq 1 10` for i in 1 do fio --verify=md5 -rw=randwrite --size=10M --bs=4K --loops=$((10**6)) \ --iodepth=64 --group_reporting --sync=1 --direct=1 --ioengine=libaio \ --directory="/data-$i" --name=data-integrity-test --thread --numjobs=16 \ --runtime=600 --output=fio-output.txt >/dev/null & done Delete the host, I wait 10s in between host deletions. But I also tested with 3s and still its stable with Mike's patches. #!/bin/bash for i in /sys/class/srp_remote_ports/* do echo "Deleting host $i, it will re-connect via srp_daemon" echo 1 > $i/delete sleep 10 done Check for I/O errors affecting XFS and we now have none with the patches Mike provided. After recovery I can create files in the xfs mount with no issues. Can you use my scripts and 1 mount and see if it still fails for you. Thanks Laurence ----- Original Message ----- > From: "Mike Snitzer" <snitzer@redhat.com> > To: "Bart Van Assche" <bart.vanassche@sandisk.com> > Cc: dm-devel@redhat.com, "Laurence Oberman" <loberman@redhat.com>, linux-scsi@vger.kernel.org > Sent: Tuesday, August 2, 2016 8:40:14 PM > Subject: Re: dm-mq and end_clone_request() > > On Tue, Aug 02 2016 at 8:19pm -0400, > Bart Van Assche <bart.vanassche@sandisk.com> wrote: > > > On 08/02/2016 10:45 AM, Mike Snitzer wrote: > > > Please do these same tests against a v4.7 kernel with the 4 patches from > > > this branch applied (no need for your other debug patches): > > > https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.7-mpath-fixes > > > > > > I've had good results with my blk-mq SRP based testing. > > > > Hello Mike, > > > > Thanks again for having made these patches available. The results of my > > tests are as follows: > > Disappointing. But I asked you to run the v4.7 kernel patches I > pointed to _without_ any of your debug patches. > > I cannot reproduce on our SRP testbed with the fixes I provided. We're > now in a place where there would appear to be something very unique to > your environment causing these failures. > -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Aug 02 2016 at 9:33pm -0400, Laurence Oberman <loberman@redhat.com> wrote: > Hi Bart > > I simplified the test to 2 simple scripts and only running against one XFS file system. > Can you validate these and tell me if its enough to emulate what you are doing. > Perhaps our test-suite is too simple. > > Start the test > > # cat run_test.sh > #!/bin/bash > logger "Starting Bart's test" > #for i in `seq 1 10` > for i in 1 > do > fio --verify=md5 -rw=randwrite --size=10M --bs=4K --loops=$((10**6)) \ > --iodepth=64 --group_reporting --sync=1 --direct=1 --ioengine=libaio \ > --directory="/data-$i" --name=data-integrity-test --thread --numjobs=16 \ > --runtime=600 --output=fio-output.txt >/dev/null & > done > > Delete the host, I wait 10s in between host deletions. > But I also tested with 3s and still its stable with Mike's patches. > > #!/bin/bash > for i in /sys/class/srp_remote_ports/* > do > echo "Deleting host $i, it will re-connect via srp_daemon" > echo 1 > $i/delete > sleep 10 > done > > Check for I/O errors affecting XFS and we now have none with the patches Mike provided. > After recovery I can create files in the xfs mount with no issues. > > Can you use my scripts and 1 mount and see if it still fails for you. In parallel we can try Bart's testsuite that he shared earlier in this thread: https://github.com/bvanassche/srp-test README.md says: "Running these tests manually is tedious. Hence this test suite that tests the SRP initiator and target drivers by loading both drivers on the same server, by logging in using the IB loopback functionality and by sending I/O through the SRP initiator driver to a RAM disk exported by the SRP target driver." This could explain why Bart is still seeing issues. He isn't testing real hardware -- as such he is using ramdisk to expose races, etc. Mike -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
----- Original Message ----- > From: "Mike Snitzer" <snitzer@redhat.com> > To: "Laurence Oberman" <loberman@redhat.com> > Cc: "Bart Van Assche" <bart.vanassche@sandisk.com>, dm-devel@redhat.com, linux-scsi@vger.kernel.org > Sent: Tuesday, August 2, 2016 10:10:12 PM > Subject: Re: dm-mq and end_clone_request() > > On Tue, Aug 02 2016 at 9:33pm -0400, > Laurence Oberman <loberman@redhat.com> wrote: > > > Hi Bart > > > > I simplified the test to 2 simple scripts and only running against one XFS > > file system. > > Can you validate these and tell me if its enough to emulate what you are > > doing. > > Perhaps our test-suite is too simple. > > > > Start the test > > > > # cat run_test.sh > > #!/bin/bash > > logger "Starting Bart's test" > > #for i in `seq 1 10` > > for i in 1 > > do > > fio --verify=md5 -rw=randwrite --size=10M --bs=4K --loops=$((10**6)) \ > > --iodepth=64 --group_reporting --sync=1 --direct=1 > > --ioengine=libaio \ > > --directory="/data-$i" --name=data-integrity-test --thread > > --numjobs=16 \ > > --runtime=600 --output=fio-output.txt >/dev/null & > > done > > > > Delete the host, I wait 10s in between host deletions. > > But I also tested with 3s and still its stable with Mike's patches. > > > > #!/bin/bash > > for i in /sys/class/srp_remote_ports/* > > do > > echo "Deleting host $i, it will re-connect via srp_daemon" > > echo 1 > $i/delete > > sleep 10 > > done > > > > Check for I/O errors affecting XFS and we now have none with the patches > > Mike provided. > > After recovery I can create files in the xfs mount with no issues. > > > > Can you use my scripts and 1 mount and see if it still fails for you. > > In parallel we can try Bart's testsuite that he shared earlier in this > thread: https://github.com/bvanassche/srp-test > > README.md says: > "Running these tests manually is tedious. Hence this test suite that > tests the SRP initiator and target drivers by loading both drivers on > the same server, by logging in using the IB loopback functionality and > by sending I/O through the SRP initiator driver to a RAM disk exported > by the SRP target driver." > > This could explain why Bart is still seeing issues. He isn't testing > real hardware -- as such he is using ramdisk to expose races, etc. > > Mike > Hi Mike, I looked at Bart's scripts, they looked fine but I wanted a more simplified way to bring the error out. Using ramdisk is not uncommon as an LIO backend via ib_srpt to serve LUNS. That is the same way I do it when I am not connected to a large array as it is the only way I can get EDR like speeds. I don't thinks its racing due to the ramdisk back-end but maybe we need to ramp ours up to run more in parallel in a loop. I will run 21 parallel runs and see if it makes a difference tonight and report back tomorrow. Clearly prior to your final patches we were escaping back to the FS layer with errors but since your patches, at least in out test harness that is resolved. Thanks Laurence -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
----- Original Message ----- > From: "Laurence Oberman" <loberman@redhat.com> > To: "Mike Snitzer" <snitzer@redhat.com> > Cc: "Bart Van Assche" <bart.vanassche@sandisk.com>, dm-devel@redhat.com, linux-scsi@vger.kernel.org > Sent: Tuesday, August 2, 2016 10:18:30 PM > Subject: Re: dm-mq and end_clone_request() > > > > ----- Original Message ----- > > From: "Mike Snitzer" <snitzer@redhat.com> > > To: "Laurence Oberman" <loberman@redhat.com> > > Cc: "Bart Van Assche" <bart.vanassche@sandisk.com>, dm-devel@redhat.com, > > linux-scsi@vger.kernel.org > > Sent: Tuesday, August 2, 2016 10:10:12 PM > > Subject: Re: dm-mq and end_clone_request() > > > > On Tue, Aug 02 2016 at 9:33pm -0400, > > Laurence Oberman <loberman@redhat.com> wrote: > > > > > Hi Bart > > > > > > I simplified the test to 2 simple scripts and only running against one > > > XFS > > > file system. > > > Can you validate these and tell me if its enough to emulate what you are > > > doing. > > > Perhaps our test-suite is too simple. > > > > > > Start the test > > > > > > # cat run_test.sh > > > #!/bin/bash > > > logger "Starting Bart's test" > > > #for i in `seq 1 10` > > > for i in 1 > > > do > > > fio --verify=md5 -rw=randwrite --size=10M --bs=4K --loops=$((10**6)) \ > > > --iodepth=64 --group_reporting --sync=1 --direct=1 > > > --ioengine=libaio \ > > > --directory="/data-$i" --name=data-integrity-test --thread > > > --numjobs=16 \ > > > --runtime=600 --output=fio-output.txt >/dev/null & > > > done > > > > > > Delete the host, I wait 10s in between host deletions. > > > But I also tested with 3s and still its stable with Mike's patches. > > > > > > #!/bin/bash > > > for i in /sys/class/srp_remote_ports/* > > > do > > > echo "Deleting host $i, it will re-connect via srp_daemon" > > > echo 1 > $i/delete > > > sleep 10 > > > done > > > > > > Check for I/O errors affecting XFS and we now have none with the patches > > > Mike provided. > > > After recovery I can create files in the xfs mount with no issues. > > > > > > Can you use my scripts and 1 mount and see if it still fails for you. > > > > In parallel we can try Bart's testsuite that he shared earlier in this > > thread: https://github.com/bvanassche/srp-test > > > > README.md says: > > "Running these tests manually is tedious. Hence this test suite that > > tests the SRP initiator and target drivers by loading both drivers on > > the same server, by logging in using the IB loopback functionality and > > by sending I/O through the SRP initiator driver to a RAM disk exported > > by the SRP target driver." > > > > This could explain why Bart is still seeing issues. He isn't testing > > real hardware -- as such he is using ramdisk to expose races, etc. > > > > Mike > > > > Hi Mike, > > I looked at Bart's scripts, they looked fine but I wanted a more simplified > way to bring the error out. > Using ramdisk is not uncommon as an LIO backend via ib_srpt to serve LUNS. > That is the same way I do it when I am not connected to a large array as it > is the only way I can get EDR like speeds. > > I don't thinks its racing due to the ramdisk back-end but maybe we need to > ramp ours up to run more in parallel in a loop. > > I will run 21 parallel runs and see if it makes a difference tonight and > report back tomorrow. > Clearly prior to your final patches we were escaping back to the FS layer > with errors but since your patches, at least in out test harness that is > resolved. > > Thanks > Laurence > -- > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Hello I ran 20 parallel runs with 3 loops through host deletion and in each case fio survived with no hard error escaping to the FS layer. Its solid in our test bed, Keep in mind we have no ib_srpt loaded as we have a hardware based array and are connected directly to the array with EDR 100. I am also not removing and reloading modules like is happening in Barts's scripts and also not trying to delete mpath maps etc. I focused only on the I/O error that was escaping up to the FS layer. I will check in with Bart tomorrow. Thanks Laurence -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
----- Original Message ----- > From: "Laurence Oberman" <loberman@redhat.com> > To: "Mike Snitzer" <snitzer@redhat.com> > Cc: "Bart Van Assche" <bart.vanassche@sandisk.com>, dm-devel@redhat.com, linux-scsi@vger.kernel.org > Sent: Tuesday, August 2, 2016 10:55:59 PM > Subject: Re: dm-mq and end_clone_request() > > > > ----- Original Message ----- > > From: "Laurence Oberman" <loberman@redhat.com> > > To: "Mike Snitzer" <snitzer@redhat.com> > > Cc: "Bart Van Assche" <bart.vanassche@sandisk.com>, dm-devel@redhat.com, > > linux-scsi@vger.kernel.org > > Sent: Tuesday, August 2, 2016 10:18:30 PM > > Subject: Re: dm-mq and end_clone_request() > > > > > > > > ----- Original Message ----- > > > From: "Mike Snitzer" <snitzer@redhat.com> > > > To: "Laurence Oberman" <loberman@redhat.com> > > > Cc: "Bart Van Assche" <bart.vanassche@sandisk.com>, dm-devel@redhat.com, > > > linux-scsi@vger.kernel.org > > > Sent: Tuesday, August 2, 2016 10:10:12 PM > > > Subject: Re: dm-mq and end_clone_request() > > > > > > On Tue, Aug 02 2016 at 9:33pm -0400, > > > Laurence Oberman <loberman@redhat.com> wrote: > > > > > > > Hi Bart > > > > > > > > I simplified the test to 2 simple scripts and only running against one > > > > XFS > > > > file system. > > > > Can you validate these and tell me if its enough to emulate what you > > > > are > > > > doing. > > > > Perhaps our test-suite is too simple. > > > > > > > > Start the test > > > > > > > > # cat run_test.sh > > > > #!/bin/bash > > > > logger "Starting Bart's test" > > > > #for i in `seq 1 10` > > > > for i in 1 > > > > do > > > > fio --verify=md5 -rw=randwrite --size=10M --bs=4K --loops=$((10**6)) \ > > > > --iodepth=64 --group_reporting --sync=1 --direct=1 > > > > --ioengine=libaio \ > > > > --directory="/data-$i" --name=data-integrity-test --thread > > > > --numjobs=16 \ > > > > --runtime=600 --output=fio-output.txt >/dev/null & > > > > done > > > > > > > > Delete the host, I wait 10s in between host deletions. > > > > But I also tested with 3s and still its stable with Mike's patches. > > > > > > > > #!/bin/bash > > > > for i in /sys/class/srp_remote_ports/* > > > > do > > > > echo "Deleting host $i, it will re-connect via srp_daemon" > > > > echo 1 > $i/delete > > > > sleep 10 > > > > done > > > > > > > > Check for I/O errors affecting XFS and we now have none with the > > > > patches > > > > Mike provided. > > > > After recovery I can create files in the xfs mount with no issues. > > > > > > > > Can you use my scripts and 1 mount and see if it still fails for you. > > > > > > In parallel we can try Bart's testsuite that he shared earlier in this > > > thread: https://github.com/bvanassche/srp-test > > > > > > README.md says: > > > "Running these tests manually is tedious. Hence this test suite that > > > tests the SRP initiator and target drivers by loading both drivers on > > > the same server, by logging in using the IB loopback functionality and > > > by sending I/O through the SRP initiator driver to a RAM disk exported > > > by the SRP target driver." > > > > > > This could explain why Bart is still seeing issues. He isn't testing > > > real hardware -- as such he is using ramdisk to expose races, etc. > > > > > > Mike > > > > > > > Hi Mike, > > > > I looked at Bart's scripts, they looked fine but I wanted a more simplified > > way to bring the error out. > > Using ramdisk is not uncommon as an LIO backend via ib_srpt to serve LUNS. > > That is the same way I do it when I am not connected to a large array as it > > is the only way I can get EDR like speeds. > > > > I don't thinks its racing due to the ramdisk back-end but maybe we need to > > ramp ours up to run more in parallel in a loop. > > > > I will run 21 parallel runs and see if it makes a difference tonight and > > report back tomorrow. > > Clearly prior to your final patches we were escaping back to the FS layer > > with errors but since your patches, at least in out test harness that is > > resolved. > > > > Thanks > > Laurence > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > Hello > > I ran 20 parallel runs with 3 loops through host deletion and in each case > fio survived with no hard error escaping to the FS layer. > Its solid in our test bed, > Keep in mind we have no ib_srpt loaded as we have a hardware based array and > are connected directly to the array with EDR 100. > I am also not removing and reloading modules like is happening in Barts's > scripts and also not trying to delete mpath maps etc. > > I focused only on the I/O error that was escaping up to the FS layer. > I will check in with Bart tomorrow. > > Thanks > Laurence > -- > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Hi Bart Looking back at your email. I also get these, but those are expected as we are in the process of doing I/O when we yank the hosts and that has in-flights affected. Aug 2 22:41:23 jumpclient kernel: device-mapper: multipath: Failing path 8:192. Aug 2 22:41:23 jumpclient kernel: blk_update_request: I/O error, dev sdm, sector 258504 Aug 2 22:41:23 jumpclient kernel: blk_update_request: I/O error, dev sdm, sector 60320 However I never get any of these any more (with the patches applied) that you show: [ 162.903284] Buffer I/O error on dev dm-0, logical block 32928, lost sync page write I will work with you to understand why with Mike's patches, its now stable here but not in your configuration Thanks Laurence -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/02/2016 06:33 PM, Laurence Oberman wrote: > #!/bin/bash > for i in /sys/class/srp_remote_ports/* > do > echo "Deleting host $i, it will re-connect via srp_daemon" > echo 1 > $i/delete > sleep 10 > done Hello Laurence, Sorry but the above looks wrong to me. There should be a second loop around this loop and the sleep statement should be moved from the inner loop to the outer loop. The above code logs out one (initiator, target) port pair at a time instead of logging out all paths at once. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/02/2016 05:40 PM, Mike Snitzer wrote: > But I asked you to run the v4.7 kernel patches I > pointed to _without_ any of your debug patches. I need several patches to fix bugs that are not related to the device mapper, e.g. "sched: Avoid that __wait_on_bit_lock() hangs" (https://lkml.org/lkml/2016/8/3/289). Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
----- Original Message ----- > From: "Bart Van Assche" <bart.vanassche@sandisk.com> > To: "Laurence Oberman" <loberman@redhat.com>, "Mike Snitzer" <snitzer@redhat.com> > Cc: dm-devel@redhat.com, linux-scsi@vger.kernel.org > Sent: Wednesday, August 3, 2016 12:06:17 PM > Subject: Re: [dm-devel] dm-mq and end_clone_request() > > On 08/02/2016 06:33 PM, Laurence Oberman wrote: > > #!/bin/bash > > for i in /sys/class/srp_remote_ports/* > > do > > echo "Deleting host $i, it will re-connect via srp_daemon" > > echo 1 > $i/delete > > sleep 10 > > done > > Hello Laurence, > > Sorry but the above looks wrong to me. There should be a second loop > around this loop and the sleep statement should be moved from the inner > loop to the outer loop. The above code logs out one (initiator, target) > port pair at a time instead of logging out all paths at once. > > Bart. > Hi Bart It logs out each host in turn with a 10s sleep in between. I actually reduced the sleep to 3s last night. We do land up with all paths lost but not at precisely the same second. Are you saying we have to lose all paths at the same time. That is easy to fix and I was running it that way in beginning, I will re-test. Thanks Laurence -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
----- Original Message ----- > From: "Bart Van Assche" <bart.vanassche@sandisk.com> > To: "Laurence Oberman" <loberman@redhat.com>, "Mike Snitzer" <snitzer@redhat.com> > Cc: dm-devel@redhat.com, linux-scsi@vger.kernel.org > Sent: Wednesday, August 3, 2016 12:06:17 PM > Subject: Re: [dm-devel] dm-mq and end_clone_request() > > On 08/02/2016 06:33 PM, Laurence Oberman wrote: > > #!/bin/bash > > for i in /sys/class/srp_remote_ports/* > > do > > echo "Deleting host $i, it will re-connect via srp_daemon" > > echo 1 > $i/delete > > sleep 10 > > done > > Hello Laurence, > > Sorry but the above looks wrong to me. There should be a second loop > around this loop and the sleep statement should be moved from the inner > loop to the outer loop. The above code logs out one (initiator, target) > port pair at a time instead of logging out all paths at once. > > Bart. > Hi Bart Latest tests are still good on our side. I am now taking both paths out at the same time but still we seem stable here. First test removed sleep and we still had a delay, second test add a background so they ran as close as possible to the same time. Both tests passed. I will email messages log just to you. With no sleep we still have a gap when we delete paths of 9s and we are good. Aug 3 13:41:21 jumpclient multipathd: 360001ff0b035d000000000008d700001: remaining active paths: 1 Aug 3 13:41:22 jumpclient multipathd: 360001ff0b035d000000000028d720003: remaining active paths: 1 Aug 3 13:41:22 jumpclient multipathd: 360001ff0b035d000000000048d740005: remaining active paths: 1 Aug 3 13:41:22 jumpclient multipathd: 360001ff0b035d000000000068d760007: remaining active paths: 1 Aug 3 13:41:23 jumpclient multipathd: 360001ff0b035d0000000000b8d7b000c: remaining active paths: 1 Aug 3 13:41:23 jumpclient multipathd: 360001ff0b035d0000000000d8d7d000e: remaining active paths: 1 Aug 3 13:41:23 jumpclient multipathd: 360001ff0b035d000000000118d810012: remaining active paths: 1 Aug 3 13:41:24 jumpclient multipathd: 360001ff0b035d000000000138d830014: remaining active paths: 1 Aug 3 13:41:24 jumpclient multipathd: 360001ff0b035d000000000158d850016: remaining active paths: 1 Aug 3 13:41:25 jumpclient multipathd: 360001ff0b035d000000000178d870018: remaining active paths: 1 Aug 3 13:41:25 jumpclient multipathd: 360001ff0b035d000000000198d89001a: remaining active paths: 1 Aug 3 13:41:25 jumpclient multipathd: 360001ff0b035d0000000001a8d8a001b: remaining active paths: 1 Aug 3 13:41:25 jumpclient multipathd: 360001ff0b035d0000000001c8d8c001d: remaining active paths: 1 Aug 3 13:41:26 jumpclient multipathd: 360001ff0b035d0000000001e8d8e001f: remaining active paths: 1 Aug 3 13:41:26 jumpclient multipathd: 360001ff0b035d0000000001f8d8f0020: remaining active paths: 1 Aug 3 13:41:26 jumpclient multipathd: 360001ff0b035d000000000208d900021: remaining active paths: 1 Aug 3 13:41:26 jumpclient multipathd: 360001ff0b035d000000000228d920023: remaining active paths: 1 Aug 3 13:41:28 jumpclient multipathd: 360001ff0b035d000000000248d940025: remaining active paths: 1 Aug 3 13:41:29 jumpclient multipathd: 360001ff0b035d000000000268d960027: remaining active paths: 1 Aug 3 13:41:29 jumpclient multipathd: 360001ff0b035d000000000278d970028: remaining active paths: 1 Aug 3 13:41:30 jumpclient multipathd: 360001ff0b035d000000000288d980029: remaining active paths: 1 Aug 3 13:41:35 jumpclient multipathd: 360001ff0b035d000000000008d700001: remaining active paths: 0 Aug 3 13:41:36 jumpclient multipathd: 360001ff0b035d000000000028d720003: remaining active paths: 0 Aug 3 13:41:37 jumpclient multipathd: 360001ff0b035d000000000048d740005: remaining active paths: 0 Aug 3 13:41:37 jumpclient multipathd: 360001ff0b035d000000000068d760007: remaining active paths: 0 Aug 3 13:41:38 jumpclient multipathd: 360001ff0b035d0000000000b8d7b000c: remaining active paths: 0 Aug 3 13:41:38 jumpclient multipathd: 360001ff0b035d0000000000d8d7d000e: remaining active paths: 0 Aug 3 13:41:38 jumpclient multipathd: 360001ff0b035d000000000108d800011: remaining active paths: 0 Aug 3 13:41:38 jumpclient multipathd: 360001ff0b035d000000000118d810012: remaining active paths: 0 Aug 3 13:41:38 jumpclient multipathd: 360001ff0b035d000000000138d830014: remaining active paths: 0 Aug 3 13:41:39 jumpclient multipathd: 360001ff0b035d000000000158d850016: remaining active paths: 0 Aug 3 13:41:39 jumpclient multipathd: 360001ff0b035d000000000178d870018: remaining active paths: 0 Aug 3 13:41:39 jumpclient multipathd: 360001ff0b035d000000000198d89001a: remaining active paths: 0 Aug 3 13:41:39 jumpclient multipathd: 360001ff0b035d0000000001a8d8a001b: remaining active paths: 0 Aug 3 13:41:39 jumpclient multipathd: 360001ff0b035d0000000001c8d8c001d: remaining active paths: 0 Aug 3 13:41:39 jumpclient multipathd: 360001ff0b035d0000000001e8d8e001f: remaining active paths: 0 Aug 3 13:41:41 jumpclient multipathd: 360001ff0b035d0000000001f8d8f0020: remaining active paths: 0 Aug 3 13:41:41 jumpclient multipathd: 360001ff0b035d000000000208d900021: remaining active paths: 0 Aug 3 13:41:43 jumpclient multipathd: 360001ff0b035d000000000248d940025: remaining active paths: 0 Aug 3 13:41:43 jumpclient multipathd: 360001ff0b035d000000000268d960027: remaining active paths: 0 Aug 3 13:41:44 jumpclient multipathd: 360001ff0b035d000000000288d980029: remaining active paths: 0 Aug 3 13:42:44 jumpclient multipathd: 360001ff0b035d000000000138d830014: remaining active paths: 2 Aug 3 13:42:44 jumpclient multipathd: 360001ff0b035d000000000158d850016: remaining active paths: 2 These are the only errors and they are expected. Aug 3 13:41:22 jumpclient kernel: blk_update_request: I/O error, dev sdd, sector 31141264880 Aug 3 13:41:22 jumpclient kernel: blk_update_request: I/O error, dev sdd, sector 79928 Aug 3 13:41:22 jumpclient kernel: blk_update_request: I/O error, dev sdd, sector 65264 Aug 3 13:41:22 jumpclient kernel: blk_update_request: I/O error, dev sdd, sector 55232 Aug 3 13:41:22 jumpclient kernel: blk_update_request: I/O error, dev sdd, sector 14152 Aug 3 13:41:22 jumpclient kernel: blk_update_request: I/O error, dev sdd, sector 168880 Aug 3 13:41:22 jumpclient kernel: blk_update_request: I/O error, dev sdd, sector 269392 Aug 3 13:41:22 jumpclient kernel: blk_update_request: I/O error, dev sdd, sector 309200 Aug 3 13:41:22 jumpclient kernel: blk_update_request: I/O error, dev sdd, sector 87520 Aug 3 13:41:22 jumpclient kernel: blk_update_request: I/O error, dev sdd, sector 7744 Aug 3 13:41:28 jumpclient kernel: blk_update_request: I/O error, dev sdca, sector 119984 Aug 3 13:41:29 jumpclient kernel: blk_update_request: I/O error, dev sdcd, sector 31139908984 Aug 3 13:41:29 jumpclient kernel: blk_update_request: I/O error, dev sdcd, sector 131136 Aug 3 13:41:29 jumpclient kernel: blk_update_request: I/O error, dev sdcd, sector 97536 Aug 3 13:41:29 jumpclient kernel: blk_update_request: I/O error, dev sdcd, sector 123264 Aug 3 13:41:29 jumpclient kernel: blk_update_request: I/O error, dev sdcd, sector 110336 Aug 3 13:41:29 jumpclient kernel: blk_update_request: I/O error, dev sdcd, sector 158136 Aug 3 13:41:29 jumpclient kernel: blk_update_request: I/O error, dev sdcd, sector 156136 Aug 3 13:41:29 jumpclient kernel: blk_update_request: I/O error, dev sdcd, sector 173072 Aug 3 13:41:29 jumpclient kernel: blk_update_request: I/O error, dev sdcd, sector 6984 Aug 3 13:41:35 jumpclient kernel: blk_update_request: I/O error, dev sdc, sector 130224 Aug 3 13:41:35 jumpclient kernel: blk_update_request: I/O error, dev sdc, sector 225816 Aug 3 13:41:36 jumpclient kernel: blk_update_request: I/O error, dev sdc, sector 248120 Aug 3 13:41:36 jumpclient kernel: blk_update_request: I/O error, dev sdc, sector 242528 Aug 3 13:41:36 jumpclient kernel: blk_update_request: I/O error, dev sdk, sector 251248 Aug 3 13:41:36 jumpclient kernel: blk_update_request: I/O error, dev sdk, sector 242032 Aug 3 13:41:36 jumpclient kernel: blk_update_request: I/O error, dev sdk, sector 203736 Aug 3 13:41:36 jumpclient kernel: blk_update_request: I/O error, dev sdk, sector 31141107808 Aug 3 13:41:36 jumpclient kernel: blk_update_request: I/O error, dev sdk, sector 233336 Aug 3 13:41:36 jumpclient kernel: blk_update_request: I/O error, dev sdk, sector 187944 Aug 3 13:41:41 jumpclient kernel: blk_update_request: I/O error, dev sdbl, sector 85800 Aug 3 13:41:41 jumpclient kernel: blk_update_request: I/O error, dev sdbl, sector 74120 Aug 3 13:41:41 jumpclient kernel: blk_update_request: I/O error, dev sdbl, sector 78216 Aug 3 13:41:41 jumpclient kernel: blk_update_request: I/O error, dev sdbl, sector 79976 Aug 3 13:41:41 jumpclient kernel: blk_update_request: I/O error, dev sdbl, sector 79552 Aug 3 13:41:41 jumpclient kernel: blk_update_request: I/O error, dev sdbl, sector 87888 Aug 3 13:41:43 jumpclient kernel: blk_update_request: I/O error, dev sdbt, sector 274368 Aug 3 13:41:43 jumpclient kernel: blk_update_request: I/O error, dev sdbt, sector 31139814080 Aug 3 13:41:43 jumpclient kernel: blk_update_request: I/O error, dev sdbx, sector 6776 Aug 3 13:41:43 jumpclient kernel: blk_update_request: I/O error, dev sdbx, sector 302152 Changing script to add & we take both away at the same time but still we seem to survive here. This is my configuration: 360001ff0b035d000000000078d770008 dm-9 DDN ,SFA14K size=29T features='3 queue_if_no_path queue_mode mq' hwhandler='0' wp=rw |-+- policy='round-robin 0' prio=90 status=active | `- 1:0:0:7 sday 67:32 active ready running `-+- policy='round-robin 0' prio=10 status=enabled `- 2:0:0:7 sdj 8:144 active ready running device { vendor "DDN" product "SFA14K" path_grouping_policy group_by_prio prio alua path_selector "round-robin 0" path_checker tur failback 2 rr_weight uniform no_path_retry 12 dev_loss_tmo 10 fast_io_fail_tmo 5 features "1 queue_if_no_path" } With & #!/bin/bash for i in /sys/class/srp_remote_ports/* do echo "Deleting host $i, it will re-connect via srp_daemon" echo 1 > $i/delete & #sleep 3 done Here we lose both paths at the same time. [root@jumpclient bart_tests]# grep remaining messages Aug 3 13:49:38 jumpclient multipathd: 360001ff0b035d000000000008d700001: remaining active paths: 0 Aug 3 13:49:38 jumpclient multipathd: 360001ff0b035d000000000028d720003: remaining active paths: 0 Aug 3 13:49:38 jumpclient multipathd: 360001ff0b035d000000000048d740005: remaining active paths: 0 Aug 3 13:49:41 jumpclient multipathd: 360001ff0b035d000000000068d760007: remaining active paths: 0 Aug 3 13:49:42 jumpclient multipathd: 360001ff0b035d0000000000d8d7d000e: remaining active paths: 0 Aug 3 13:49:45 jumpclient multipathd: 360001ff0b035d000000000118d810012: remaining active paths: 0 Aug 3 13:49:45 jumpclient multipathd: 360001ff0b035d000000000108d800011: remaining active paths: 0 Aug 3 13:49:47 jumpclient multipathd: 360001ff0b035d000000000158d850016: remaining active paths: 0 Aug 3 13:49:48 jumpclient multipathd: 360001ff0b035d000000000178d870018: remaining active paths: 0 Aug 3 13:49:48 jumpclient multipathd: 360001ff0b035d000000000198d89001a: remaining active paths: 0 Aug 3 13:49:48 jumpclient multipathd: 360001ff0b035d0000000001a8d8a001b: remaining active paths: 0 Aug 3 13:49:55 jumpclient multipathd: 360001ff0b035d0000000001e8d8e001f: remaining active paths: 0 Aug 3 13:49:55 jumpclient multipathd: 360001ff0b035d0000000001f8d8f0020: remaining active paths: 0 Aug 3 13:49:58 jumpclient multipathd: 360001ff0b035d000000000248d940025: remaining active paths: 0 Aug 3 13:49:59 jumpclient multipathd: 360001ff0b035d000000000268d960027: remaining active paths: 0 Aug 3 13:50:00 jumpclient multipathd: 360001ff0b035d000000000288d980029: remaining active paths: 0 Aug 3 13:51:17 jumpclient multipathd: 360001ff0b035d000000000038d730004: remaining active paths: 2 Aug 3 13:51:17 jumpclient multipathd: 360001ff0b035d000000000028d720003: remaining active paths: 2 Aug 3 13:51:19 jumpclient multipathd: 360001ff0b035d000000000078d770008: remaining active paths: 2 Aug 3 13:51:20 jumpclient multipathd: 360001ff0b035d0000000000a8d7a000b: remaining active paths: 2 Aug 3 13:51:23 jumpclient multipathd: 360001ff0b035d0000000000d8d7d000e: remaining active paths: 2 Aug 3 13:51:24 jumpclient multipathd: 360001ff0b035d000000000108d800011: remaining active paths: 2 Aug 3 13:51:25 jumpclient multipathd: 360001ff0b035d0000000000f8d7f0010: remaining active paths: 2 Aug 3 13:51:26 jumpclient multipathd: 360001ff0b035d000000000128d820013: remaining active paths: 2 Aug 3 13:51:29 jumpclient multipathd: 360001ff0b035d0000000001c8d8c001d: remaining active paths: 2 Aug 3 13:51:33 jumpclient multipathd: 360001ff0b035d000000000228d920023: remaining active paths: 2 Aug 3 13:51:34 jumpclient multipathd: 360001ff0b035d000000000238d930024: remaining active paths: 2 We still survive. [root@jumpclient bart_tests]# grep -i error messages Aug 3 13:49:38 jumpclient kernel: blk_update_request: I/O error, dev sdc, sector 98288 Aug 3 13:49:38 jumpclient kernel: blk_update_request: I/O error, dev sdc, sector 98320 Aug 3 13:49:38 jumpclient kernel: blk_update_request: I/O error, dev sdc, sector 46976 Aug 3 13:49:38 jumpclient kernel: blk_update_request: I/O error, dev sde, sector 216720 Aug 3 13:49:38 jumpclient kernel: blk_update_request: I/O error, dev sdg, sector 130672 Aug 3 13:49:41 jumpclient kernel: blk_update_request: I/O error, dev sdi, sector 56984 Aug 3 13:49:41 jumpclient kernel: blk_update_request: I/O error, dev sdi, sector 56120 Aug 3 13:49:41 jumpclient kernel: blk_update_request: I/O error, dev sdi, sector 62112 Aug 3 13:49:42 jumpclient kernel: blk_update_request: I/O error, dev sdp, sector 156944 Aug 3 13:49:42 jumpclient kernel: blk_update_request: I/O error, dev sdp, sector 31140975496 Aug 3 13:49:44 jumpclient kernel: blk_update_request: I/O error, dev sdt, sector 207392 Aug 3 13:49:44 jumpclient kernel: blk_update_request: I/O error, dev sdt, sector 200568 Aug 3 13:49:44 jumpclient kernel: blk_update_request: I/O error, dev sdt, sector 251048 Aug 3 13:49:44 jumpclient kernel: blk_update_request: I/O error, dev sdt, sector 247616 Aug 3 13:49:44 jumpclient kernel: blk_update_request: I/O error, dev sdt, sector 210592 Aug 3 13:49:44 jumpclient kernel: blk_update_request: I/O error, dev sdt, sector 200120 Aug 3 13:49:44 jumpclient kernel: blk_update_request: I/O error, dev sdt, sector 203000 Aug 3 13:49:44 jumpclient kernel: blk_update_request: I/O error, dev sdt, sector 248640 Aug 3 13:49:47 jumpclient kernel: blk_update_request: I/O error, dev sdx, sector 48232 Aug 3 13:49:48 jumpclient kernel: blk_update_request: I/O error, dev sdz, sector 9984 Aug 3 13:49:55 jumpclient kernel: blk_update_request: I/O error, dev sdag, sector 130512 Aug 3 13:49:58 jumpclient kernel: blk_update_request: I/O error, dev sdai, sector 39040 Aug 3 13:49:58 jumpclient kernel: blk_update_request: I/O error, dev sdam, sector 31140570528 Aug 3 13:49:59 jumpclient kernel: blk_update_request: I/O error, dev sdao, sector 204552 Aug 3 13:50:00 jumpclient kernel: blk_update_request: I/O error, dev sdaq, sector 31142052904 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Aug 03 2016 at 12:55pm -0400, Bart Van Assche <bart.vanassche@sandisk.com> wrote: > On 08/02/2016 05:40 PM, Mike Snitzer wrote: > >But I asked you to run the v4.7 kernel patches I > >pointed to _without_ any of your debug patches. > > I need several patches to fix bugs that are not related to the > device mapper, e.g. "sched: Avoid that __wait_on_bit_lock() hangs" > (https://lkml.org/lkml/2016/8/3/289). OK, but you have way more changes than seem needed. In particular the blk-mq error handling changes look suspect. I'm also not sure what REQ_FAIL_IF_NO_PATH is all about (vaguely recall seeing it before; and suggesting you use SCSI's more traditional differentiated IO errors). Anyway, at this point you're having us test too many changes that aren't yet upstream: $ git diff bart/srp-initiator-for-next dm/dm-4.7-mpath-fixes -- drivers block include kernel | diffstat block/bio-integrity.c | 1 block/blk-cgroup.c | 4 block/blk-core.c | 16 --- block/blk-mq.c | 16 --- block/partition-generic.c | 3 drivers/acpi/acpica/nswalk.c | 1 drivers/infiniband/core/rw.c | 24 +++-- drivers/infiniband/core/verbs.c | 9 -- drivers/infiniband/hw/hfi1/Kconfig | 1 drivers/infiniband/hw/mlx4/qp.c | 6 - drivers/infiniband/sw/rdmavt/Kconfig | 1 drivers/infiniband/ulp/isert/ib_isert.c | 2 drivers/infiniband/ulp/isert/ib_isert.h | 1 drivers/infiniband/ulp/srp/ib_srp.c | 131 -------------------------------- drivers/infiniband/ulp/srp/ib_srp.h | 5 - drivers/infiniband/ulp/srpt/ib_srpt.c | 10 +- drivers/infiniband/ulp/srpt/ib_srpt.h | 6 - drivers/md/dm-crypt.c | 4 drivers/md/dm-ioctl.c | 77 +++++++++--------- drivers/md/dm-mpath.c | 32 ------- drivers/md/dm.c | 22 ----- drivers/scsi/scsi_lib.c | 36 +------- drivers/scsi/scsi_priv.h | 2 drivers/scsi/scsi_scan.c | 2 drivers/scsi/scsi_sysfs.c | 48 ----------- drivers/scsi/sd.c | 6 - drivers/scsi/sg.c | 3 include/linux/blk-mq.h | 3 include/linux/blk_types.h | 5 - include/linux/blkdev.h | 1 include/linux/dmar.h | 2 include/rdma/ib_verbs.h | 6 - include/scsi/scsi_device.h | 2 kernel/sched/wait.c | 2 34 files changed, 106 insertions(+), 384 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/04/2016 09:10 AM, Mike Snitzer wrote: > Anyway, at this point you're having us test too many changes that aren't > yet upstream: > > $ git diff bart/srp-initiator-for-next dm/dm-4.7-mpath-fixes -- drivers block include kernel | diffstat > block/bio-integrity.c | 1 > block/blk-cgroup.c | 4 > block/blk-core.c | 16 --- > block/blk-mq.c | 16 --- > block/partition-generic.c | 3 > drivers/acpi/acpica/nswalk.c | 1 > drivers/infiniband/core/rw.c | 24 +++-- > drivers/infiniband/core/verbs.c | 9 -- > drivers/infiniband/hw/hfi1/Kconfig | 1 > drivers/infiniband/hw/mlx4/qp.c | 6 - > drivers/infiniband/sw/rdmavt/Kconfig | 1 > drivers/infiniband/ulp/isert/ib_isert.c | 2 > drivers/infiniband/ulp/isert/ib_isert.h | 1 > drivers/infiniband/ulp/srp/ib_srp.c | 131 -------------------------------- > drivers/infiniband/ulp/srp/ib_srp.h | 5 - > drivers/infiniband/ulp/srpt/ib_srpt.c | 10 +- > drivers/infiniband/ulp/srpt/ib_srpt.h | 6 - > drivers/md/dm-crypt.c | 4 > drivers/md/dm-ioctl.c | 77 +++++++++--------- > drivers/md/dm-mpath.c | 32 ------- > drivers/md/dm.c | 22 ----- > drivers/scsi/scsi_lib.c | 36 +------- > drivers/scsi/scsi_priv.h | 2 > drivers/scsi/scsi_scan.c | 2 > drivers/scsi/scsi_sysfs.c | 48 ----------- > drivers/scsi/sd.c | 6 - > drivers/scsi/sg.c | 3 > include/linux/blk-mq.h | 3 > include/linux/blk_types.h | 5 - > include/linux/blkdev.h | 1 > include/linux/dmar.h | 2 > include/rdma/ib_verbs.h | 6 - > include/scsi/scsi_device.h | 2 > kernel/sched/wait.c | 2 > 34 files changed, 106 insertions(+), 384 deletions(-) Hello Mike, Most of the changes you are referring to either are already upstream, are expected to arrive in Linus' tree later this week or only add debugging pr_info() statements. The changes that either are already upstream or that are expected to be upstream soon are: $ for b in origin/master dledford-rdma/k.o/for-4.8-1 dledford-rdma/k.o/for-4.8-2; do git log v4.7..$b --author="Bart Van Assche" | grep ^commit -A4 | sed -n 's/^ //p'; done block: Fix spelling in a source code comment dm ioctl: Simplify parameter buffer management code dm crypt: Fix sparse complaints block/blk-cgroup.c: Declare local symbols static block/bio-integrity.c: Add #include "blk.h" block/partition-generic.c: Remove a set-but-not-used variable IB/hfi1: Disable by default IB/rdmavt: Disable by default IB/isert: Remove an unused member variable IB/srpt: Simplify srpt_queue_response() IB/srpt: Limit the number of SG elements per work request IB/core, RDMA RW API: Do not exceed QP SGE send limit IB/core: Make rdma_rw_ctx_init() initialize all used fields Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
I've staged another fix, Laurence is seeing success with this added: https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.8&id=d50a6450104c237db1dc75314d17b78c990a8c05 I'll be sending all the fixes I've queued to Linus tonight or early tomorrow (since I'll then be on vacation until Monday 8/15). -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
----- Original Message ----- > From: "Mike Snitzer" <snitzer@redhat.com> > To: "Bart Van Assche" <bart.vanassche@sandisk.com> > Cc: dm-devel@redhat.com, "Laurence Oberman" <loberman@redhat.com>, linux-scsi@vger.kernel.org > Sent: Thursday, August 4, 2016 7:58:50 PM > Subject: Re: dm-mq and end_clone_request() > > I've staged another fix, Laurence is seeing success with this added: > https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.8&id=d50a6450104c237db1dc75314d17b78c990a8c05 > > I'll be sending all the fixes I've queued to Linus tonight or early > tomorrow (since I'll then be on vacation until Monday 8/15). > Hello Bart, I applied that patch to your kernel and while I still obviously see all the debug logging its no longer failing fio for me. I ran 8 loops with 20 parallel fio runs. This was on a different server to the one I had been testing on. However I am concerned about timing playing a part here here so let us know what you find. Thanks Laurence -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
----- Original Message ----- > From: "Laurence Oberman" <loberman@redhat.com> > To: "Mike Snitzer" <snitzer@redhat.com> > Cc: "Bart Van Assche" <bart.vanassche@sandisk.com>, dm-devel@redhat.com, linux-scsi@vger.kernel.org > Sent: Thursday, August 4, 2016 9:07:28 PM > Subject: Re: dm-mq and end_clone_request() > > > > ----- Original Message ----- > > From: "Mike Snitzer" <snitzer@redhat.com> > > To: "Bart Van Assche" <bart.vanassche@sandisk.com> > > Cc: dm-devel@redhat.com, "Laurence Oberman" <loberman@redhat.com>, > > linux-scsi@vger.kernel.org > > Sent: Thursday, August 4, 2016 7:58:50 PM > > Subject: Re: dm-mq and end_clone_request() > > > > I've staged another fix, Laurence is seeing success with this added: > > https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.8&id=d50a6450104c237db1dc75314d17b78c990a8c05 > > > > I'll be sending all the fixes I've queued to Linus tonight or early > > tomorrow (since I'll then be on vacation until Monday 8/15). > > > Hello Bart, > > I applied that patch to your kernel and while I still obviously see all the > debug logging its no longer failing fio for me. > I ran 8 loops with 20 parallel fio runs. This was on a different server to > the one I had been testing on. > > However I am concerned about timing playing a part here here so let us know > what you find. > > Thanks > Laurence Replying to my own message: Hi Bart, Mike Further testing has shown we are still exposed here so more investigation is necessary. The above patch seems to help but I still see sporadic cases of errors escaping up the stack. I expect you will see the same so more work to do here to figure this out. Thanks Laurence -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
----- Original Message ----- > From: "Laurence Oberman" <loberman@redhat.com> > To: "Mike Snitzer" <snitzer@redhat.com> > Cc: "Bart Van Assche" <bart.vanassche@sandisk.com>, dm-devel@redhat.com, linux-scsi@vger.kernel.org > Sent: Friday, August 5, 2016 7:43:30 AM > Subject: Re: dm-mq and end_clone_request() > > > > ----- Original Message ----- > > From: "Laurence Oberman" <loberman@redhat.com> > > To: "Mike Snitzer" <snitzer@redhat.com> > > Cc: "Bart Van Assche" <bart.vanassche@sandisk.com>, dm-devel@redhat.com, > > linux-scsi@vger.kernel.org > > Sent: Thursday, August 4, 2016 9:07:28 PM > > Subject: Re: dm-mq and end_clone_request() > > > > > > > > ----- Original Message ----- > > > From: "Mike Snitzer" <snitzer@redhat.com> > > > To: "Bart Van Assche" <bart.vanassche@sandisk.com> > > > Cc: dm-devel@redhat.com, "Laurence Oberman" <loberman@redhat.com>, > > > linux-scsi@vger.kernel.org > > > Sent: Thursday, August 4, 2016 7:58:50 PM > > > Subject: Re: dm-mq and end_clone_request() > > > > > > I've staged another fix, Laurence is seeing success with this added: > > > https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.8&id=d50a6450104c237db1dc75314d17b78c990a8c05 > > > > > > I'll be sending all the fixes I've queued to Linus tonight or early > > > tomorrow (since I'll then be on vacation until Monday 8/15). > > > > > Hello Bart, > > > > I applied that patch to your kernel and while I still obviously see all the > > debug logging its no longer failing fio for me. > > I ran 8 loops with 20 parallel fio runs. This was on a different server to > > the one I had been testing on. > > > > However I am concerned about timing playing a part here here so let us know > > what you find. > > > > Thanks > > Laurence > Replying to my own message: > > Hi Bart, Mike > > Further testing has shown we are still exposed here so more investigation is > necessary. > The above patch seems to help but I still see sporadic cases of errors > escaping up the stack. > > I expect you will see the same so more work to do here to figure this out. > > Thanks > Laurence > -- > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Hello Bart I completely forgot I had set no_path_retry=12, so after 12 retries it will error out. This is likely why I had different results seemingly affected by timing. Mike reminded me of it this morning. What do you have set for no_path_retry, because when I set it to queue, it blocks the paths coming back for some reason. I am now investigating why that is happening :). I see now I need to add "simultaneous all paths lost" scenarios to my QA testing, as its not a common scenario. Thanks Laurence -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/05/2016 08:39 AM, Laurence Oberman wrote: > I completely forgot I had set no_path_retry=12, so after 12 retries it will error out. > This is likely why I had different results seemingly affected by timing. > Mike reminded me of it this morning. > > What do you have set for no_path_retry, because when I set it to queue, it blocks the paths coming back for some reason. > I am now investigating why that is happening :). > I see now I need to add "simultaneous all paths lost" scenarios to my QA testing, as its not a common scenario. Hello Laurence, I'm using the following multipath.conf file for the tests I run: defaults { user_friendly_names yes queue_without_daemon no } blacklist { device { vendor "ATA" product ".*" } } devices { device { vendor "SCST_BIO|LIO-ORG" product ".*" features "3 queue_if_no_path pg_init_retries 50" path_grouping_policy group_by_prio path_selector "queue-length 0" path_checker tur } } blacklist_exceptions { property ".*" } Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/04/2016 04:58 PM, Mike Snitzer wrote: > I've staged another fix, Laurence is seeing success with this added: > https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.8&id=d50a6450104c237db1dc75314d17b78c990a8c05 Thanks Mike. I have started testing that fix this morning. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/05/2016 04:43 AM, Laurence Oberman wrote: > Further testing has shown we are still exposed here so more investigation is necessary. > The above patch seems to help but I still see sporadic cases of errors escaping up the stack. > > I expect you will see the same so more work to do here to figure this out. Hello Laurence, Unfortunately I also still see sporadic I/O errors when testing all-paths-down with CONFIG_DM_MQ_DEFAULT=n (I have not yet tried to retest with CONFIG_DM_MQ_DEFAULT=y). Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
----- Original Message ----- > From: "Bart Van Assche" <bart.vanassche@sandisk.com> > To: "Laurence Oberman" <loberman@redhat.com>, "Mike Snitzer" <snitzer@redhat.com> > Cc: dm-devel@redhat.com, linux-scsi@vger.kernel.org > Sent: Friday, August 5, 2016 2:42:49 PM > Subject: Re: [dm-devel] dm-mq and end_clone_request() > > On 08/05/2016 04:43 AM, Laurence Oberman wrote: > > Further testing has shown we are still exposed here so more investigation > > is necessary. > > The above patch seems to help but I still see sporadic cases of errors > > escaping up the stack. > > > > I expect you will see the same so more work to do here to figure this out. > > Hello Laurence, > > Unfortunately I also still see sporadic I/O errors when testing > all-paths-down with CONFIG_DM_MQ_DEFAULT=n (I have not yet tried to > retest with CONFIG_DM_MQ_DEFAULT=y). > > Bart. > -- > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Hello Bart, I am still debugging this, now that I have no_path_retry=queue and not a count :) I am often hitting the host delete race, have you seen this on your testing during debugging. I am using your kernel built from your git tree that has Mikes patches applied. 4.7.0bart [66813.896159] Hardware name: HP ProLiant DL380 G7, BIOS P67 08/16/2015 [66813.933246] Workqueue: srp_remove srp_remove_work [ib_srp] [66813.964703] 0000000000000086 00000000d185b9ce ffff88060fa03d20 ffffffff813456df [66814.007292] 0000000000000000 0000000000000000 ffff88060fa03d60 ffffffff81089fb1 [66814.049336] 0000007da067604b ffff880c01643d80 0000000000017ec0 ffff880c016447dc [66814.091725] Call Trace: [66814.104775] <IRQ> [<ffffffff813456df>] dump_stack+0x63/0x84 [66814.136507] [<ffffffff81089fb1>] __warn+0xd1/0xf0 [66814.163118] [<ffffffff8108a0ed>] warn_slowpath_null+0x1d/0x20 [66814.195409] [<ffffffff8104fd7e>] native_smp_send_reschedule+0x3e/0x40 [66814.231954] [<ffffffff810b47db>] try_to_wake_up+0x30b/0x390 [66814.263661] [<ffffffff810b4912>] default_wake_function+0x12/0x20 [66814.297713] [<ffffffff810ccb05>] __wake_up_common+0x55/0x90 [66814.330021] [<ffffffff810ccb53>] __wake_up_locked+0x13/0x20 [66814.361906] [<ffffffff81261179>] ep_poll_callback+0xb9/0x200 [66814.392784] [<ffffffff810ccb05>] __wake_up_common+0x55/0x90 [66814.424908] [<ffffffff810ccc59>] __wake_up+0x39/0x50 [66814.454327] [<ffffffff810e1f80>] wake_up_klogd_work_func+0x40/0x60 [66814.490152] [<ffffffff81177b6d>] irq_work_run_list+0x4d/0x70 [66814.523007] [<ffffffff810710d0>] ? do_flush_tlb_all+0x50/0x50 [66814.556161] [<ffffffff81177bbc>] irq_work_run+0x2c/0x30 [66814.586677] [<ffffffff8110ab5f>] flush_smp_call_function_queue+0x8f/0x160 [66814.625667] [<ffffffff8110b613>] generic_smp_call_function_single_interrupt+0x13/0x60 [66814.669276] [<ffffffff81050167>] smp_call_function_interrupt+0x27/0x40 [66814.706255] [<ffffffff816c7e9c>] call_function_interrupt+0x8c/0xa0 [66814.741406] <EOI> [<ffffffff8118e733>] ? panic+0x1ef/0x233 [66814.772851] [<ffffffff8118e72f>] ? panic+0x1eb/0x233 [66814.800207] [<ffffffff810308f8>] oops_end+0xb8/0xd0 [66814.827454] [<ffffffff8106977e>] no_context+0x13e/0x3a0 [66814.858368] [<ffffffff811f3feb>] ? __slab_free+0x9b/0x280 [66814.890365] [<ffffffff81069ace>] __bad_area_nosemaphore+0xee/0x1d0 [66814.926508] [<ffffffff81069bc4>] bad_area_nosemaphore+0x14/0x20 [66814.959939] [<ffffffff8106a269>] __do_page_fault+0x89/0x4a0 [66814.992039] [<ffffffff811f3feb>] ? __slab_free+0x9b/0x280 [66815.023052] [<ffffffff8106a6b0>] do_page_fault+0x30/0x80 [66815.053368] [<ffffffff816c8b88>] page_fault+0x28/0x30 [66815.083196] [<ffffffff814ae4e9>] ? __scsi_remove_device+0x79/0x160 [66815.117444] [<ffffffff814ae5c2>] ? __scsi_remove_device+0x152/0x160 [66815.152051] [<ffffffff814ac790>] scsi_forget_host+0x60/0x70 [66815.183939] [<ffffffff814a0137>] scsi_remove_host+0x77/0x110 [66815.216152] [<ffffffffa0677be0>] srp_remove_work+0x90/0x200 [ib_srp] [66815.253221] [<ffffffff810a2e72>] process_one_work+0x152/0x400 [66815.286221] [<ffffffff810a3765>] worker_thread+0x125/0x4b0 [66815.317313] [<ffffffff810a3640>] ? rescuer_thread+0x380/0x380 [66815.349770] [<ffffffff810a9298>] kthread+0xd8/0xf0 [66815.376082] [<ffffffff816c6b3f>] ret_from_fork+0x1f/0x40 [66815.404767] [<ffffffff810a91c0>] ? kthread_park+0x60/0x60 [66815.436448] ---[ end trace bfaf79198d0976f5 ]--- -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/06/16 07:47, Laurence Oberman wrote: > [66813.933246] Workqueue: srp_remove srp_remove_work [ib_srp] > [ ... ] > [66815.152051] [<ffffffff814ac790>] scsi_forget_host+0x60/0x70 > [66815.183939] [<ffffffff814a0137>] scsi_remove_host+0x77/0x110 > [66815.216152] [<ffffffffa0677be0>] srp_remove_work+0x90/0x200 [ib_srp] > [66815.253221] [<ffffffff810a2e72>] process_one_work+0x152/0x400 > [66815.286221] [<ffffffff810a3765>] worker_thread+0x125/0x4b0 > [66815.317313] [<ffffffff810a3640>] ? rescuer_thread+0x380/0x380 > [66815.349770] [<ffffffff810a9298>] kthread+0xd8/0xf0 > [66815.376082] [<ffffffff816c6b3f>] ret_from_fork+0x1f/0x40 > [66815.404767] [<ffffffff810a91c0>] ? kthread_park+0x60/0x60 Hello Laurence, This is a callstack I have not yet encountered myself during any test. Please provide the output of the following commands: $ gdb /lib/modules/$(uname -r)/build/vmlinux (gdb) list *(scsi_forget_host+0x60) Thanks, Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
----- Original Message ----- > From: "Bart Van Assche" <bvanassche@acm.org> > To: "Laurence Oberman" <loberman@redhat.com> > Cc: "Mike Snitzer" <snitzer@redhat.com>, dm-devel@redhat.com, linux-scsi@vger.kernel.org > Sent: Sunday, August 7, 2016 6:31:11 PM > Subject: Re: [dm-devel] dm-mq and end_clone_request() > > On 08/06/16 07:47, Laurence Oberman wrote: > > [66813.933246] Workqueue: srp_remove srp_remove_work [ib_srp] > > [ ... ] > > [66815.152051] [<ffffffff814ac790>] scsi_forget_host+0x60/0x70 > > [66815.183939] [<ffffffff814a0137>] scsi_remove_host+0x77/0x110 > > [66815.216152] [<ffffffffa0677be0>] srp_remove_work+0x90/0x200 [ib_srp] > > [66815.253221] [<ffffffff810a2e72>] process_one_work+0x152/0x400 > > [66815.286221] [<ffffffff810a3765>] worker_thread+0x125/0x4b0 > > [66815.317313] [<ffffffff810a3640>] ? rescuer_thread+0x380/0x380 > > [66815.349770] [<ffffffff810a9298>] kthread+0xd8/0xf0 > > [66815.376082] [<ffffffff816c6b3f>] ret_from_fork+0x1f/0x40 > > [66815.404767] [<ffffffff810a91c0>] ? kthread_park+0x60/0x60 > > Hello Laurence, > > This is a callstack I have not yet encountered myself during any test. > Please provide the output of the following commands: > $ gdb /lib/modules/$(uname -r)/build/vmlinux > (gdb) list *(scsi_forget_host+0x60) > > Thanks, > > Bart. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > [loberman@jumptest1 linux]$ gdb vmlinux GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /home/loberman/bart/linux/vmlinux...done. (gdb) list *(scsi_forget_host+0x60) 0xffffffff814ac790 is in scsi_forget_host (drivers/scsi/scsi_scan.c:1895). 1890 list_for_each_entry(sdev, &shost->__devices, siblings) { 1891 if (sdev->sdev_state == SDEV_DEL) 1892 continue; 1893 spin_unlock_irqrestore(shost->host_lock, flags); 1894 __scsi_remove_device(sdev); 1895 goto restart; 1896 } 1897 spin_unlock_irqrestore(shost->host_lock, flags); 1898 } 1899 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 1b2f962..0e0f6e0 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -2007,6 +2007,9 @@ static int map_request(struct dm_rq_target_io *tio, struct request *rq, struct dm_target *ti = tio->ti; struct request *clone = NULL; + if (WARN_ON_ONCE(unlikely(dm_suspended_md(md)))) + return DM_MAPIO_REQUEUE; + if (tio->clone) { clone = tio->clone; r = ti->type->map_rq(ti, clone, &tio->info); @@ -2722,6 +2725,9 @@ static int dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx, dm_put_live_table(md, srcu_idx); } + if (WARN_ON_ONCE(unlikely(test_bit(BLK_MQ_S_STOPPED, &hctx->state)))) + return BLK_MQ_RQ_QUEUE_BUSY; + if (ti->type->busy && ti->type->busy(ti)) return BLK_MQ_RQ_QUEUE_BUSY;