diff mbox series

[v2,1/1] smb: client: Fix hang in smb2_reconnect

Message ID 20240409112758.467112-2-rickaran@axis.com (mailing list archive)
State New, archived
Headers show
Series smb client hang | expand

Commit Message

Rickard x Andersson April 9, 2024, 11:27 a.m. UTC
From: Rickard x Andersson <rickaran@axis.com>

Test case:
mount -t cifs //192.168.0.1/test y -o
  port=19999,ro,vers=2.1,sec=none,echo_interval=1
kill smbd with SIGSTOP
umount /tmp/y

Gives the following error:
 INFO: task umount:466 blocked for more than 122 seconds.
       Not tainted 6.8.2-axis9-devel #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
 message.
 task:umount state:D stack:0 pid:466 tgid:466 ppid:464 flags:0x00000004
  __schedule from schedule+0x34/0x108
  schedule from schedule_preempt_disabled+0x24/0x34
  schedule_preempt_disabled from __mutex_lock.constprop.0+0x330/0x8b0
  __mutex_lock.constprop.0 from smb2_reconnect+0x278/0x8fc [cifs]
  smb2_reconnect [cifs] from SMB2_open_init+0x54/0x9f4 [cifs]
  SMB2_open_init [cifs] from smb2_query_info_compound+0x1a0/0x500[cifs]
  smb2_query_info_compound [cifs] from smb2_queryfs+0x64/0x134 [cifs]
  smb2_queryfs [cifs] from cifs_statfs+0xc8/0x318 [cifs]
  cifs_statfs [cifs] from statfs_by_dentry+0x60/0x84
  statfs_by_dentry from fd_statfs+0x30/0x74
  fd_statfs from sys_fstatfs64+0x40/0x6c
  sys_fstatfs64 from ret_fast_syscall+0x0/0x54

The umount task is blocked waiting on the session mutex. The reason it
never gets the session mutex is because 'kworker/0:3' is holding the
mutex and is waiting for response (see line 1209 in
fs/smb/client/smb2pdu.c.

Stack trace of 'kworker/0:3' just before calling wait_for_response:
 CPU: 0 PID: 220 Comm: kworker/0:3 Not tainted 6.8.2-axis9-devel #1
 Hardware name: Freescale i.MX6 SoloX (Device Tree)
 Workqueue: cifsiod smb2_reconnect_server [cifs]
  unwind_backtrace from show_stack+0x18/0x1c
  show_stack from dump_stack_lvl+0x24/0x2c
  dump_stack_lvl from compound_send_recv+0x7bc/0xac8 [cifs]
  compound_send_recv [cifs] from cifs_send_recv+0x34/0x3c [cifs]
  cifs_send_recv [cifs] from SMB2_negotiate+0x410/0x13dc [cifs]
  SMB2_negotiate [cifs] from smb2_negotiate+0x4c/0x58 [cifs]
  smb2_negotiate [cifs] from cifs_negotiate_protocol+0x9c/0x100 [cifs]
  cifs_negotiate_protocol [cifs] from smb2_reconnect+0x418/0x8fc [cifs]
  smb2_reconnect [cifs] from smb2_reconnect_server+0x1dc/0x514 [cifs]
  smb2_reconnect_server [cifs] from process_one_work+0x188/0x3ec
  process_one_work from worker_thread+0x1fc/0x430
  worker_thread from kthread+0x110/0x130
  kthread from ret_from_fork+0x14/0x28

Change-Id: I53439ffb007c9c51d77ce40fb655a34e5ca291ec
Signed-off-by: Rickard x Andersson <rickaran@axis.com>
---
 fs/smb/client/transport.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

Comments

Steve French April 9, 2024, 5:19 p.m. UTC | #1
I tried this on 6.8-rc7 and current 6.9-rc3 and it didn't fail on either.

I tried (to Samba, and using the mount parms you suggested for
cifs.ko) with both "killall -s SIGSTOP smbd" and also with "killall
smbd" - but unmount worked fine in both cases.

Do you know if this fails to ksmbd or Samba or ...?  Other ideas for
how to repro this?

On Tue, Apr 9, 2024 at 6:29 AM Rickard Andersson <rickaran@axis.com> wrote:
>
> From: Rickard x Andersson <rickaran@axis.com>
>
> Test case:
> mount -t cifs //192.168.0.1/test y -o
>   port=19999,ro,vers=2.1,sec=none,echo_interval=1
> kill smbd with SIGSTOP
> umount /tmp/y
>
> Gives the following error:
>  INFO: task umount:466 blocked for more than 122 seconds.
>        Not tainted 6.8.2-axis9-devel #1
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
>  message.
>  task:umount state:D stack:0 pid:466 tgid:466 ppid:464 flags:0x00000004
>   __schedule from schedule+0x34/0x108
>   schedule from schedule_preempt_disabled+0x24/0x34
>   schedule_preempt_disabled from __mutex_lock.constprop.0+0x330/0x8b0
>   __mutex_lock.constprop.0 from smb2_reconnect+0x278/0x8fc [cifs]
>   smb2_reconnect [cifs] from SMB2_open_init+0x54/0x9f4 [cifs]
>   SMB2_open_init [cifs] from smb2_query_info_compound+0x1a0/0x500[cifs]
>   smb2_query_info_compound [cifs] from smb2_queryfs+0x64/0x134 [cifs]
>   smb2_queryfs [cifs] from cifs_statfs+0xc8/0x318 [cifs]
>   cifs_statfs [cifs] from statfs_by_dentry+0x60/0x84
>   statfs_by_dentry from fd_statfs+0x30/0x74
>   fd_statfs from sys_fstatfs64+0x40/0x6c
>   sys_fstatfs64 from ret_fast_syscall+0x0/0x54
>
> The umount task is blocked waiting on the session mutex. The reason it
> never gets the session mutex is because 'kworker/0:3' is holding the
> mutex and is waiting for response (see line 1209 in
> fs/smb/client/smb2pdu.c.
>
> Stack trace of 'kworker/0:3' just before calling wait_for_response:
>  CPU: 0 PID: 220 Comm: kworker/0:3 Not tainted 6.8.2-axis9-devel #1
>  Hardware name: Freescale i.MX6 SoloX (Device Tree)
>  Workqueue: cifsiod smb2_reconnect_server [cifs]
>   unwind_backtrace from show_stack+0x18/0x1c
>   show_stack from dump_stack_lvl+0x24/0x2c
>   dump_stack_lvl from compound_send_recv+0x7bc/0xac8 [cifs]
>   compound_send_recv [cifs] from cifs_send_recv+0x34/0x3c [cifs]
>   cifs_send_recv [cifs] from SMB2_negotiate+0x410/0x13dc [cifs]
>   SMB2_negotiate [cifs] from smb2_negotiate+0x4c/0x58 [cifs]
>   smb2_negotiate [cifs] from cifs_negotiate_protocol+0x9c/0x100 [cifs]
>   cifs_negotiate_protocol [cifs] from smb2_reconnect+0x418/0x8fc [cifs]
>   smb2_reconnect [cifs] from smb2_reconnect_server+0x1dc/0x514 [cifs]
>   smb2_reconnect_server [cifs] from process_one_work+0x188/0x3ec
>   process_one_work from worker_thread+0x1fc/0x430
>   worker_thread from kthread+0x110/0x130
>   kthread from ret_from_fork+0x14/0x28
>
> Change-Id: I53439ffb007c9c51d77ce40fb655a34e5ca291ec
> Signed-off-by: Rickard x Andersson <rickaran@axis.com>
> ---
>  fs/smb/client/transport.c | 15 +++++++++------
>  1 file changed, 9 insertions(+), 6 deletions(-)
>
> diff --git a/fs/smb/client/transport.c b/fs/smb/client/transport.c
> index 994d70193432..96b8f8757ddc 100644
> --- a/fs/smb/client/transport.c
> +++ b/fs/smb/client/transport.c
> @@ -32,6 +32,8 @@
>  /* Max number of iovectors we can use off the stack when sending requests. */
>  #define CIFS_MAX_IOV_SIZE 8
>
> +#define RESPONSE_TIMEOUT_SECS 50
> +
>  void
>  cifs_wake_up_task(struct mid_q_entry *mid)
>  {
> @@ -735,13 +737,14 @@ static int allocate_mid(struct cifs_ses *ses, struct smb_hdr *in_buf,
>  static int
>  wait_for_response(struct TCP_Server_Info *server, struct mid_q_entry *midQ)
>  {
> -       int error;
> +       int ret;
> +
> +       ret = wait_event_killable_timeout(server->response_q,
> +                                         midQ->mid_state != MID_REQUEST_SUBMITTED &&
> +                                         midQ->mid_state != MID_RESPONSE_RECEIVED,
> +                                         RESPONSE_TIMEOUT_SECS*HZ);
>
> -       error = wait_event_state(server->response_q,
> -                                midQ->mid_state != MID_REQUEST_SUBMITTED &&
> -                                midQ->mid_state != MID_RESPONSE_RECEIVED,
> -                                (TASK_KILLABLE|TASK_FREEZABLE_UNSAFE));
> -       if (error < 0)
> +       if ((ret < 0) || (ret == 0))
>                 return -ERESTARTSYS;
>
>         return 0;
> --
> 2.30.2
>
>
Steve French April 13, 2024, 9:02 a.m. UTC | #2
Shyam and I tried some experiments and there are two cases where the
patch breaks:
1) ChangeNotify will time out
2) Certainly byte range lock calls (they can be allowed to block) will timeout

An obvious alternative would be to not make this change for the
commands like ChangeNotify and blocking locks but allow it for the
others.

On Tue, Apr 9, 2024 at 6:29 AM Rickard Andersson <rickaran@axis.com> wrote:
>
> From: Rickard x Andersson <rickaran@axis.com>
>
> Test case:
> mount -t cifs //192.168.0.1/test y -o
>   port=19999,ro,vers=2.1,sec=none,echo_interval=1
> kill smbd with SIGSTOP
> umount /tmp/y
>
> Gives the following error:
>  INFO: task umount:466 blocked for more than 122 seconds.
>        Not tainted 6.8.2-axis9-devel #1
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
>  message.
>  task:umount state:D stack:0 pid:466 tgid:466 ppid:464 flags:0x00000004
>   __schedule from schedule+0x34/0x108
>   schedule from schedule_preempt_disabled+0x24/0x34
>   schedule_preempt_disabled from __mutex_lock.constprop.0+0x330/0x8b0
>   __mutex_lock.constprop.0 from smb2_reconnect+0x278/0x8fc [cifs]
>   smb2_reconnect [cifs] from SMB2_open_init+0x54/0x9f4 [cifs]
>   SMB2_open_init [cifs] from smb2_query_info_compound+0x1a0/0x500[cifs]
>   smb2_query_info_compound [cifs] from smb2_queryfs+0x64/0x134 [cifs]
>   smb2_queryfs [cifs] from cifs_statfs+0xc8/0x318 [cifs]
>   cifs_statfs [cifs] from statfs_by_dentry+0x60/0x84
>   statfs_by_dentry from fd_statfs+0x30/0x74
>   fd_statfs from sys_fstatfs64+0x40/0x6c
>   sys_fstatfs64 from ret_fast_syscall+0x0/0x54
>
> The umount task is blocked waiting on the session mutex. The reason it
> never gets the session mutex is because 'kworker/0:3' is holding the
> mutex and is waiting for response (see line 1209 in
> fs/smb/client/smb2pdu.c.
>
> Stack trace of 'kworker/0:3' just before calling wait_for_response:
>  CPU: 0 PID: 220 Comm: kworker/0:3 Not tainted 6.8.2-axis9-devel #1
>  Hardware name: Freescale i.MX6 SoloX (Device Tree)
>  Workqueue: cifsiod smb2_reconnect_server [cifs]
>   unwind_backtrace from show_stack+0x18/0x1c
>   show_stack from dump_stack_lvl+0x24/0x2c
>   dump_stack_lvl from compound_send_recv+0x7bc/0xac8 [cifs]
>   compound_send_recv [cifs] from cifs_send_recv+0x34/0x3c [cifs]
>   cifs_send_recv [cifs] from SMB2_negotiate+0x410/0x13dc [cifs]
>   SMB2_negotiate [cifs] from smb2_negotiate+0x4c/0x58 [cifs]
>   smb2_negotiate [cifs] from cifs_negotiate_protocol+0x9c/0x100 [cifs]
>   cifs_negotiate_protocol [cifs] from smb2_reconnect+0x418/0x8fc [cifs]
>   smb2_reconnect [cifs] from smb2_reconnect_server+0x1dc/0x514 [cifs]
>   smb2_reconnect_server [cifs] from process_one_work+0x188/0x3ec
>   process_one_work from worker_thread+0x1fc/0x430
>   worker_thread from kthread+0x110/0x130
>   kthread from ret_from_fork+0x14/0x28
>
> Change-Id: I53439ffb007c9c51d77ce40fb655a34e5ca291ec
> Signed-off-by: Rickard x Andersson <rickaran@axis.com>
> ---
>  fs/smb/client/transport.c | 15 +++++++++------
>  1 file changed, 9 insertions(+), 6 deletions(-)
>
> diff --git a/fs/smb/client/transport.c b/fs/smb/client/transport.c
> index 994d70193432..96b8f8757ddc 100644
> --- a/fs/smb/client/transport.c
> +++ b/fs/smb/client/transport.c
> @@ -32,6 +32,8 @@
>  /* Max number of iovectors we can use off the stack when sending requests. */
>  #define CIFS_MAX_IOV_SIZE 8
>
> +#define RESPONSE_TIMEOUT_SECS 50
> +
>  void
>  cifs_wake_up_task(struct mid_q_entry *mid)
>  {
> @@ -735,13 +737,14 @@ static int allocate_mid(struct cifs_ses *ses, struct smb_hdr *in_buf,
>  static int
>  wait_for_response(struct TCP_Server_Info *server, struct mid_q_entry *midQ)
>  {
> -       int error;
> +       int ret;
> +
> +       ret = wait_event_killable_timeout(server->response_q,
> +                                         midQ->mid_state != MID_REQUEST_SUBMITTED &&
> +                                         midQ->mid_state != MID_RESPONSE_RECEIVED,
> +                                         RESPONSE_TIMEOUT_SECS*HZ);
>
> -       error = wait_event_state(server->response_q,
> -                                midQ->mid_state != MID_REQUEST_SUBMITTED &&
> -                                midQ->mid_state != MID_RESPONSE_RECEIVED,
> -                                (TASK_KILLABLE|TASK_FREEZABLE_UNSAFE));
> -       if (error < 0)
> +       if ((ret < 0) || (ret == 0))
>                 return -ERESTARTSYS;
>
>         return 0;
> --
> 2.30.2
>
>
Rickard x Andersson April 15, 2024, 12:05 p.m. UTC | #3
On 4/13/24 21:40, Steve French wrote:
> Is there any way to verify if the same problem can be reproduced with 
> 6.9-rc3 or later


I have now reproduced the problem also on 6.9.0-rc3:


  INFO: task umount:511 blocked for more than 614 seconds.
        Not tainted 6.9.0-rc3-axis9-devel #1
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  task:umount          state:D stack:0     pid:511   tgid:511   ppid:509 
    flags:0x00000004
  Call trace:
   __schedule from schedule+0x34/0x14c
   schedule from schedule_preempt_disabled+0x24/0x34
  schedule_preempt_disabled from __mutex_lock.constprop.0+0x330/0x8b0
   __mutex_lock.constprop.0 from smb2_reconnect+0x114/0x8fc [cifs]
   smb2_reconnect [cifs] from SMB2_open_init+0x58/0xa5c [cifs]
   SMB2_open_init [cifs] from smb2_query_info_compound+0x1a4/0x50c [cifs]
   smb2_query_info_compound [cifs] from smb2_queryfs+0x64/0x134 [cifs]
   smb2_queryfs [cifs] from cifs_statfs+0xc8/0x318 [cifs]
   cifs_statfs [cifs] from statfs_by_dentry+0x60/0x84
   statfs_by_dentry from fd_statfs+0x30/0x74
   fd_statfs from sys_fstatfs64+0x40/0x6c
   sys_fstatfs64 from ret_fast_syscall+0x0/0x54
  Exception stack(0xc749ffa8 to 0xc749fff0)
  ffa0:                   01716218 00000000 00000003 00000058 beb62b90 
00000000
  ffc0: 01716218 00000000 00000003 0000010b b6fbd138 01716448 017162a4 
beb62b90
  ffe0: 0000010b beb62ad4 b6f0861f b6e87866


Best regards,
Rickard
Rickard x Andersson April 15, 2024, 2:35 p.m. UTC | #4
On 4/13/24 11:02, Steve French wrote:
> Shyam and I tried some experiments and there are two cases where the
> patch breaks:
> 1) ChangeNotify will time out
> 2) Certainly byte range lock calls (they can be allowed to block) will timeout
> 
> An obvious alternative would be to not make this change for the
> commands like ChangeNotify and blocking locks but allow it for the
> others.
> 

Would it make sense to make the patch less intrusive by doing something 
like the following?:

@@ -1209,7 +1216,12 @@ compound_send_recv(const unsigned int xid, struct 
cifs_ses *ses,
         spin_unlock(&ses->ses_lock);

         for (i = 0; i < num_rqst; i++) {
-               rc = wait_for_response(server, midQ[i]);
+               if (flags == CIFS_NEG_OP) {

+                       rc = wait_for_response(server, midQ[i], 50*1000);
+               }
+               else
+                       rc = wait_for_response(server, midQ[i], -1);
                 if (rc != 0)
                         break;



So, we are just waiting with timeout in the case it is a "CIFS_NEG_OP".

Note that I am not familiar at all with this code base.

Best regards,
Rickard
diff mbox series

Patch

diff --git a/fs/smb/client/transport.c b/fs/smb/client/transport.c
index 994d70193432..96b8f8757ddc 100644
--- a/fs/smb/client/transport.c
+++ b/fs/smb/client/transport.c
@@ -32,6 +32,8 @@ 
 /* Max number of iovectors we can use off the stack when sending requests. */
 #define CIFS_MAX_IOV_SIZE 8
 
+#define RESPONSE_TIMEOUT_SECS 50
+
 void
 cifs_wake_up_task(struct mid_q_entry *mid)
 {
@@ -735,13 +737,14 @@  static int allocate_mid(struct cifs_ses *ses, struct smb_hdr *in_buf,
 static int
 wait_for_response(struct TCP_Server_Info *server, struct mid_q_entry *midQ)
 {
-	int error;
+	int ret;
+
+	ret = wait_event_killable_timeout(server->response_q,
+					  midQ->mid_state != MID_REQUEST_SUBMITTED &&
+					  midQ->mid_state != MID_RESPONSE_RECEIVED,
+					  RESPONSE_TIMEOUT_SECS*HZ);
 
-	error = wait_event_state(server->response_q,
-				 midQ->mid_state != MID_REQUEST_SUBMITTED &&
-				 midQ->mid_state != MID_RESPONSE_RECEIVED,
-				 (TASK_KILLABLE|TASK_FREEZABLE_UNSAFE));
-	if (error < 0)
+	if ((ret < 0) || (ret == 0))
 		return -ERESTARTSYS;
 
 	return 0;