diff mbox

[0/4] iscsi target: Fix oops during relogin

Message ID 58E2C237.3050809@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Mike Christie April 3, 2017, 9:44 p.m. UTC
On 03/31/2017 11:54 AM, Mike Christie wrote:
> On 03/31/2017 01:00 AM, Nicholas A. Bellinger wrote:
>> Hey Mike,
>>
>> On Tue, 2017-03-21 at 09:14 -0400, Mike Christie wrote:
>>> On 02/26/2017 10:03 PM, Mike Christie wrote:
>>>> The following patches fix a oops that occurs when the initiator
>>>> is trying to relogin to a iscsi target. The problem occurs, when
>>>> the initiator has sent a command that is stuck running on some
>>>> backend, and the initiator has sent TMFs and eventually escalated
>>>> to session level recovery.
>>>>
>>>> During the relogin operation, the target will wait for the stuck
>>>> command to complete, and the initiator may time out the relogin
>>>> request and drop that tcp connection and retry. The target will
>>>> then free the iscsi connection structs from under the np_thread
>>>> and we will crash.
>>>>
>>>> Patches were made over the target-pending for-next branch.
>>>>
>>>
>>> Hey Nick,
>>>
>>> Were these patches ok?
>>>
>>> I can rebuild them against your current tree, but before I do that, I
>>> was thinking there might be a cleaner alternative you know about.
>>>
>>> I think my patches are a little ugly. The behavior for the modified
>>> functions is now more difficult to follow because you can sleep, not
>>> sleep and now interruptible sleep, and you can end up retrying them and
>>> going down different branches on the retry.
>>>
>>> I think the alternative is some sort ref counting based teardown in that
>>> login error path.
>>
>> I haven't forgot about these, and apologies on the extended delay for a
>> proper follow-up.
> 
> He He, I was a bad initiator maintainer in the end so I can't complain :)
> 
>>
>> What I'm confused by is the particular scenario described in the patch.
>> That is, it's the same scenario DATERA Q/A and automation routinely
>> tests on v4.1.y and v3.14.y with a few thousand active volumes.  So far
>> we've not triggered a reproduction like the one described above.
>>
>> Namely, where a backend driver takes an extended amount of time to
>> complete an outstanding se_cmd, resulting in ABORT_TASK and LUN_RESET,
>> followed by a session reinstatement that occurs while se_cmd is still
>> outstanding to backend driver code.
>>
>> If a session reinstatement fails due to it's login attempt taking longer
>> than TA_LOGIN_TIMEOUT=15 seconds since the se_cmd in question still
>> didn't complete, iscsi_handle_login_thread_timeout() fires and sends
>> SIGINT to iscsi_np->np_thread.
>>
>> If iscsi_check_for_session_reinstatement() is already blocked on
>> iscsi_stop_session() -> wait_for_completion(), it will wait indefinitely
>> until the se_cmd in question is completed back to target-core before
>> allowing login to make forward progress, or fail due to the login
>> timeout.
> 
> This is where we hit the problem.
> 
> At this time while in the wait, the initiator gives up (normally hit a
> iscsi login timeout on the initiator side) on the login attempt and just
> drops the tcp/ip connection. On the target side we detect this and
> iscsi_target_sk_state_change runs and iscsi_target_do_cleanup which
> frees the iscsi_login related resources.
> 
> When the command eventually completes, we wake from the
> wait_for_completion and try to access the freed iscsi_login struct.
> 
> The problem is that iscsi_check_for_session_reinstatement ->
> iscsi_target_check_for_existing_instances will return 0 after the
> command has completed so the login thread does not know that login has
> failed due to the tcp/ip connection getting dropped and the iscsi_login
> struct has been freed. It will then try to access the freed iscsi_login
> struct and proceed with the login process.
> 
> 
> 
> 
>>
>> If iscsi_check_for_session_reinstatement hasn't been reached yet or
>> hasn't blocked on wait_for_completion(), the SIGINT should fail the
>> connection the next time it attempts to do socket I/O.
>>
>> From what I can gather from the original problem statement, you are
>> hitting something different than these two cases, right..?
>>
>> So I'd really like to reproduce what you've seen to trigger the
>> scenario, and jump into kgdb and see what's going on.  Would you mind
>> giving me more details wrt you've been have to reproduce this, and even
>> better, some debug code to reproduce at will..?
>>
> 
> I will send a patch for scsi_debug that can simulate the problem.
> 

Attached is a patch to scsi_debug, scsi-debug-hang-abort.patch, which
will hang the abort process so you can simulate commands that get stuck.
Just export the scsi_debug /dev/sdX as a pscsi backend device and use
these settings for scsi_debug:

1. every_nth = 30 (set this after the initial login through sysfs on the
target side /sys/module/scsi_debug/parameters/every_nth, so you do not
hit scanning related issues)
2. abort_sleep = 120 (you might need to increase this depending on your
timeouts below)
3. opts = 0x4

On the initiator side, use these settings to speed up the failure:

1. Set /sys/block/sdX/device/timeout to 5.
2. node.session.timeo.replacement_timeout = 5
3. node.conn[0].timeo.login_timeout = 30
4. node.conn[0].timeo.noop_out_timeout and
node.conn[0].timeo.noop_out_interval = 5
5. node.session.err_timeo.abort_timeout = 5

On the target side, I am using the default settings.

Then just do some simple IO until you hit the every_nth limit. Do
something like

dd if=/dev/sdX of=/dev/null iflag=direct count=1

A couple times until you hit the every_nth setting, so you do not end up
with a lot of stuck IO on the target side.

I also attached the oops I see in the attachment iscsi-relogin-bug. It
was made against master in target-pending.
[  387.250623] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[  387.250637] IP: iscsi_encode_text_output+0x32/0x5e0 [iscsi_target_mod]
[  387.250638] PGD 0 

[  387.250641] Oops: 0000 [#1] SMP
[  387.250642] Modules linked in: iscsi_target_mod target_core_pscsi target_core_file target_core_iblock target_core_user uio target_core_mod configfs fuse nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) dm_mirror dm_region_hash dm_log coretemp ext4 x86_pkg_temp_thermal kvm_intel jbd2 mbcache kvm snd_hda_codec_hdmi irqbypass crc32_pclmul ghash_clmulni_intel pcbc snd_hda_codec_realtek snd_hda_codec_generic aesni_intel aes_x86_64 crypto_simd snd_hda_intel cryptd snd_hda_codec glue_helper snd_hda_core iTCO_wdt snd_hwdep iTCO_vendor_support snd_seq dm_mod dell_smm_hwmon snd_seq_device dcdbas snd_pcm snd_timer i2c_i801 snd lpc_ich pcspkr
[  387.250682]  soundcore sg i2c_core shpchp mfd_core wmi hed nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc ip_tables autofs4 xfs libcrc32c sr_mod sd_mod cdrom crc32c_intel ixgbe mdio xhci_pci hwmon qla2xxx e1000e ahci dca xhci_hcd libahci ptp scsi_transport_fc pps_core ipv6 crc_ccitt
[  387.250706] CPU: 18 PID: 10413 Comm: iscsi_np Tainted: G           O    4.11.0-rc1+ #1
[  387.250707] Hardware name: Dell Inc. Precision Tower 7810/0GWHMW, BIOS A02 09/05/2014
[  387.250708] task: ffff88044aa65a80 task.stack: ffffc90004330000
[  387.250713] RIP: 0010:iscsi_encode_text_output+0x32/0x5e0 [iscsi_target_mod]
[  387.250714] RSP: 0018:ffffc90004333d80 EFLAGS: 00010286
[  387.250715] RAX: ffff8804492ce000 RBX: ffff880442213800 RCX: ffff88044d590cdc
[  387.250716] RDX: ffff8804492ce000 RSI: 0000000000000002 RDI: 0000000000000006
[  387.250717] RBP: ffffc90004333df8 R08: 0000000000000000 R09: 0000000000000000
[  387.250718] R10: 0000000000000000 R11: ffff880439bae029 R12: 0000000000000008
[  387.250718] R13: ffff88044d590cdc R14: ffff88044d590cc0 R15: ffff880449360820
[  387.250720] FS:  0000000000000000(0000) GS:ffff88045fc80000(0000) knlGS:0000000000000000
[  387.250721] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  387.250721] CR2: 0000000000000008 CR3: 0000000001c09000 CR4: 00000000001406e0
[  387.250722] Call Trace:
[  387.250730]  iscsi_target_do_login+0x143/0x610 [iscsi_target_mod]
[  387.250736]  iscsi_target_start_negotiation+0x4e/0xa0 [iscsi_target_mod]
[  387.250741]  iscsi_target_login_thread+0xa73/0xfc0 [iscsi_target_mod]
[  387.250744]  ? __schedule+0x2bb/0x950
[  387.250746]  kthread+0x101/0x140
[  387.250751]  ? iscsi_target_login_sess_out+0x250/0x250 [iscsi_target_mod]
[  387.250752]  ? queued_spin_unlock+0x20/0x20
[  387.250754]  ret_from_fork+0x2c/0x40
[  387.250755] Code: 89 e5 41 57 41 56 41 55 49 89 cd 41 54 4d 8d 60 08 53 48 83 ec 50 8b 01 48 89 55 b8 4c 89 45 a0 40 88 75 97 48 01 d0 48 89 45 98 <49> 8b 40 08 49 39 c4 48 8d 58 e0 0f 84 59 02 00 00 45 31 ff 31 
[  387.250781] RIP: iscsi_encode_text_output+0x32/0x5e0 [iscsi_target_mod] RSP: ffffc90004333d80
[  387.250781] CR2: 0000000000000008
[  387.250783] ---[ end trace d5c6a015e9661bc0 ]---

Comments

Nicholas A. Bellinger April 12, 2017, 5:53 a.m. UTC | #1
Hi MNC,

On Mon, 2017-04-03 at 16:44 -0500, Mike Christie wrote:
> On 03/31/2017 11:54 AM, Mike Christie wrote:
> > On 03/31/2017 01:00 AM, Nicholas A. Bellinger wrote:

<SNIP>

> > 
> > This is where we hit the problem.
> > 
> > At this time while in the wait, the initiator gives up (normally hit a
> > iscsi login timeout on the initiator side) on the login attempt and just
> > drops the tcp/ip connection. On the target side we detect this and
> > iscsi_target_sk_state_change runs and iscsi_target_do_cleanup which
> > frees the iscsi_login related resources.
> > 
> > When the command eventually completes, we wake from the
> > wait_for_completion and try to access the freed iscsi_login struct.
> > 
> > The problem is that iscsi_check_for_session_reinstatement ->
> > iscsi_target_check_for_existing_instances will return 0 after the
> > command has completed so the login thread does not know that login has
> > failed due to the tcp/ip connection getting dropped and the iscsi_login
> > struct has been freed. It will then try to access the freed iscsi_login
> > struct and proceed with the login process.
> > 
> > 
> > 
> > 
> >>
> >> If iscsi_check_for_session_reinstatement hasn't been reached yet or
> >> hasn't blocked on wait_for_completion(), the SIGINT should fail the
> >> connection the next time it attempts to do socket I/O.
> >>
> >> From what I can gather from the original problem statement, you are
> >> hitting something different than these two cases, right..?
> >>
> >> So I'd really like to reproduce what you've seen to trigger the
> >> scenario, and jump into kgdb and see what's going on.  Would you mind
> >> giving me more details wrt you've been have to reproduce this, and even
> >> better, some debug code to reproduce at will..?
> >>
> > 
> > I will send a patch for scsi_debug that can simulate the problem.
> > 
> 
> Attached is a patch to scsi_debug, scsi-debug-hang-abort.patch, which
> will hang the abort process so you can simulate commands that get stuck.
> Just export the scsi_debug /dev/sdX as a pscsi backend device and use
> these settings for scsi_debug:
> 
> 1. every_nth = 30 (set this after the initial login through sysfs on the
> target side /sys/module/scsi_debug/parameters/every_nth, so you do not
> hit scanning related issues)
> 2. abort_sleep = 120 (you might need to increase this depending on your
> timeouts below)
> 3. opts = 0x4
> 
> On the initiator side, use these settings to speed up the failure:
> 
> 1. Set /sys/block/sdX/device/timeout to 5.
> 2. node.session.timeo.replacement_timeout = 5
> 3. node.conn[0].timeo.login_timeout = 30
> 4. node.conn[0].timeo.noop_out_timeout and
> node.conn[0].timeo.noop_out_interval = 5
> 5. node.session.err_timeo.abort_timeout = 5
> 
> On the target side, I am using the default settings.
> 
> Then just do some simple IO until you hit the every_nth limit. Do
> something like
> 
> dd if=/dev/sdX of=/dev/null iflag=direct count=1
> 
> A couple times until you hit the every_nth setting, so you do not end up
> with a lot of stuck IO on the target side.
> 
> I also attached the oops I see in the attachment iscsi-relogin-bug. It
> was made against master in target-pending.
> 
> 

Thanks alot for the reproduction details.

Still thinking how this patch should look, but introducing a
iscsi_conn->login_kref to ensure the existing logic in
iscsi_target_do_cleanup() can only be called on final put vs. active
iscsi_target_do_login_rx() is probably going to be the most
straight-forward approach.

In any event, this is at the top of my list and I should have something
to start testing by the end of the week.

Thank you.

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nicholas A. Bellinger May 26, 2017, 5:49 a.m. UTC | #2
Hey MNC,

Apologies for taking so long to get back to this one.  Comments below.

(Adding HARE, Sagi & Varun CC').

On Mon, 2017-04-03 at 16:44 -0500, Mike Christie wrote:
> On 03/31/2017 11:54 AM, Mike Christie wrote:
> > On 03/31/2017 01:00 AM, Nicholas A. Bellinger wrote:

<SNIP>

> >>
> >> What I'm confused by is the particular scenario described in the patch.
> >> That is, it's the same scenario DATERA Q/A and automation routinely
> >> tests on v4.1.y and v3.14.y with a few thousand active volumes.  So far
> >> we've not triggered a reproduction like the one described above.
> >>
> >> Namely, where a backend driver takes an extended amount of time to
> >> complete an outstanding se_cmd, resulting in ABORT_TASK and LUN_RESET,
> >> followed by a session reinstatement that occurs while se_cmd is still
> >> outstanding to backend driver code.
> >>
> >> If a session reinstatement fails due to it's login attempt taking longer
> >> than TA_LOGIN_TIMEOUT=15 seconds since the se_cmd in question still
> >> didn't complete, iscsi_handle_login_thread_timeout() fires and sends
> >> SIGINT to iscsi_np->np_thread.
> >>
> >> If iscsi_check_for_session_reinstatement() is already blocked on
> >> iscsi_stop_session() -> wait_for_completion(), it will wait indefinitely
> >> until the se_cmd in question is completed back to target-core before
> >> allowing login to make forward progress, or fail due to the login
> >> timeout.
> > 
> > This is where we hit the problem.
> > 
> > At this time while in the wait, the initiator gives up (normally hit a
> > iscsi login timeout on the initiator side) on the login attempt and just
> > drops the tcp/ip connection. On the target side we detect this and
> > iscsi_target_sk_state_change runs and iscsi_target_do_cleanup which
> > frees the iscsi_login related resources.
> > 
> > When the command eventually completes, we wake from the
> > wait_for_completion and try to access the freed iscsi_login struct.
> > 
> > The problem is that iscsi_check_for_session_reinstatement ->
> > iscsi_target_check_for_existing_instances will return 0 after the
> > command has completed so the login thread does not know that login has
> > failed due to the tcp/ip connection getting dropped and the iscsi_login
> > struct has been freed. It will then try to access the freed iscsi_login
> > struct and proceed with the login process.
> > 
> > 
> > 
> > 
> >>
> >> If iscsi_check_for_session_reinstatement hasn't been reached yet or
> >> hasn't blocked on wait_for_completion(), the SIGINT should fail the
> >> connection the next time it attempts to do socket I/O.
> >>
> >> From what I can gather from the original problem statement, you are
> >> hitting something different than these two cases, right..?
> >>
> >> So I'd really like to reproduce what you've seen to trigger the
> >> scenario, and jump into kgdb and see what's going on.  Would you mind
> >> giving me more details wrt you've been have to reproduce this, and even
> >> better, some debug code to reproduce at will..?
> >>
> > 
> > I will send a patch for scsi_debug that can simulate the problem.
> > 
> 
> Attached is a patch to scsi_debug, scsi-debug-hang-abort.patch, which
> will hang the abort process so you can simulate commands that get stuck.
> Just export the scsi_debug /dev/sdX as a pscsi backend device and use
> these settings for scsi_debug:
> 
> 1. every_nth = 30 (set this after the initial login through sysfs on the
> target side /sys/module/scsi_debug/parameters/every_nth, so you do not
> hit scanning related issues)
> 2. abort_sleep = 120 (you might need to increase this depending on your
> timeouts below)
> 3. opts = 0x4
> 
> On the initiator side, use these settings to speed up the failure:
> 
> 1. Set /sys/block/sdX/device/timeout to 5.
> 2. node.session.timeo.replacement_timeout = 5
> 3. node.conn[0].timeo.login_timeout = 30
> 4. node.conn[0].timeo.noop_out_timeout and
> node.conn[0].timeo.noop_out_interval = 5
> 5. node.session.err_timeo.abort_timeout = 5
> 
> On the target side, I am using the default settings.
> 
> Then just do some simple IO until you hit the every_nth limit. Do
> something like
> 
> dd if=/dev/sdX of=/dev/null iflag=direct count=1
> 
> A couple times until you hit the every_nth setting, so you do not end up
> with a lot of stuck IO on the target side.
> 
> I also attached the oops I see in the attachment iscsi-relogin-bug. It
> was made against master in target-pending.
> 
> 

OK, so I've finally posted a patch to address the root cause.

iscsi-target: Fix initial login PDU asynchronous socket close OOPs
https://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending.git/commit/?id=61d3d4c9abc1d22cfd444e46837bda090f1343ce

It addresses the original case where iscsi_np process context is blocked
waiting for session reinstatement to complete while processing the
initial login request PDU, but the TCP connection is closed
asynchronously by the initiator.

It just sets LOGIN_FLAGS_CLOSED in iscsi_target_sk_state_change(), and
let's iscsi_np process context detect the failure once session
reinstatement has completed, and perform the associated connection
cleanup for the failed login attempt.

For subsequent login request PDUs handled from delayed workqueue context
in iscsi_target_do_login_rx(), when a TCP connection is closed
asynchronously from iscsi_target_sk_state_change() just kicks
schedule_delayed_work(&conn->login_work, 0) and let's delayed workqueue
process context in iscsi_target_do_login_rx() handle the failure
detection and associated connection cleanup.

That said, I've been testing these particular two scenarios on v4.12-rc1
with the patch using settings you described above, and both cases are
working as expected in small scale vm-tests.

Also, the same patch has been backported to v4.1.y and the DATERA Q/A
team is currently putting it through the scale and error injection
regression test on physical hardware.

So please have a look (Hannes please review) and verify on your end it
addresses the original bug, and doesn't break anything else.

For Sagi + Varun, AFAICT this doesn't break anything wrt to iser-target
and cxgbit code, but if you can give it a quick spin on your setup it
would be appreciated.

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index 17249c3..498a7f2 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -43,6 +43,7 @@ 
 #include <linux/hrtimer.h>
 #include <linux/uuid.h>
 #include <linux/t10-pi.h>
+#include <linux/delay.h>
 
 #include <net/checksum.h>
 
@@ -110,6 +111,7 @@ 
 #define DEF_DIF 0
 #define DEF_DIX 0
 #define DEF_D_SENSE   0
+#define DEF_ABORT_SLEEP 0
 #define DEF_EVERY_NTH   0
 #define DEF_FAKE_RW	0
 #define DEF_GUARD 0
@@ -576,6 +578,7 @@  enum sdeb_opcode_index {
 static int sdebug_dif = DEF_DIF;
 static int sdebug_dix = DEF_DIX;
 static int sdebug_dsense = DEF_D_SENSE;
+static int sdebug_abort_sleep = DEF_ABORT_SLEEP;
 static int sdebug_every_nth = DEF_EVERY_NTH;
 static int sdebug_fake_rw = DEF_FAKE_RW;
 static unsigned int sdebug_guard = DEF_GUARD;
@@ -3782,6 +3785,16 @@  static int scsi_debug_abort(struct scsi_cmnd *SCpnt)
 {
 	bool ok;
 
+	if (sdebug_abort_sleep > 0) {
+		sdev_printk(KERN_INFO, SCpnt->device,
+			    "scsi_debug_abort sleeping for %d seconds\n",
+			    sdebug_abort_sleep);
+		ssleep(sdebug_abort_sleep);
+		sdev_printk(KERN_INFO, SCpnt->device,
+			    "scsi_debug_abort done sleeping for %d seconds\n",
+			    sdebug_abort_sleep);
+	}
+
 	++num_aborts;
 	if (SCpnt) {
 		ok = stop_queued_cmnd(SCpnt);
@@ -4148,6 +4161,7 @@  static int schedule_resp(struct scsi_cmnd *cmnd, struct sdebug_dev_info *devip,
 module_param_named(dif, sdebug_dif, int, S_IRUGO);
 module_param_named(dix, sdebug_dix, int, S_IRUGO);
 module_param_named(dsense, sdebug_dsense, int, S_IRUGO | S_IWUSR);
+module_param_named(abort_sleep, sdebug_abort_sleep, int, S_IRUGO | S_IWUSR);
 module_param_named(every_nth, sdebug_every_nth, int, S_IRUGO | S_IWUSR);
 module_param_named(fake_rw, sdebug_fake_rw, int, S_IRUGO | S_IWUSR);
 module_param_named(guard, sdebug_guard, uint, S_IRUGO);