diff mbox series

smb: client: fix hang in wait_for_response() for negproto

Message ID 20240901014734.141366-1-pc@manguebit.com (mailing list archive)
State New, archived
Headers show
Series smb: client: fix hang in wait_for_response() for negproto | expand

Commit Message

Paulo Alcantara Sept. 1, 2024, 1:47 a.m. UTC
Call cifs_reconnect() to wake up processes waiting on negotiate
protocol to handle the case where server abruptly shut down and had no
chance to properly close the socket.

Simple reproducer:

  ssh 192.168.2.100 pkill -STOP smbd
  mount.cifs //192.168.2.100/test /mnt -o ... [never returns]

Cc: Rickard Andersson <rickaran@axis.com>
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
---
 fs/smb/client/connect.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

Comments

Steve French Sept. 1, 2024, 3:42 a.m. UTC | #1
when I tried this - it took much longer than expected to time out.
How long do you expect mount to hang with your patch?

On Sat, Aug 31, 2024 at 8:47 PM Paulo Alcantara <pc@manguebit.com> wrote:
>
> Call cifs_reconnect() to wake up processes waiting on negotiate
> protocol to handle the case where server abruptly shut down and had no
> chance to properly close the socket.
>
> Simple reproducer:
>
>   ssh 192.168.2.100 pkill -STOP smbd
>   mount.cifs //192.168.2.100/test /mnt -o ... [never returns]
>
> Cc: Rickard Andersson <rickaran@axis.com>
> Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
> ---
>  fs/smb/client/connect.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/fs/smb/client/connect.c b/fs/smb/client/connect.c
> index c1c14274930a..e004b515e321 100644
> --- a/fs/smb/client/connect.c
> +++ b/fs/smb/client/connect.c
> @@ -656,6 +656,19 @@ allocate_buffers(struct TCP_Server_Info *server)
>  static bool
>  server_unresponsive(struct TCP_Server_Info *server)
>  {
> +       /*
> +        * If we're in the process of mounting a share or reconnecting a session
> +        * and the server abruptly shut down (e.g. socket wasn't closed properly),
> +        * wait for at least an echo interval (+7s from rcvtimeo) when attempting
> +        * to negotiate protocol.
> +        */
> +       spin_lock(&server->srv_lock);
> +       if (server->tcpStatus == CifsInNegotiate &&
> +           time_after(jiffies, server->lstrp + server->echo_interval)) {
> +               spin_unlock(&server->srv_lock);
> +               cifs_reconnect(server, false);
> +               return true;
> +       }
>         /*
>          * We need to wait 3 echo intervals to make sure we handle such
>          * situations right:
> @@ -667,7 +680,6 @@ server_unresponsive(struct TCP_Server_Info *server)
>          * 65s kernel_recvmsg times out, and we see that we haven't gotten
>          *     a response in >60s.
>          */
> -       spin_lock(&server->srv_lock);
>         if ((server->tcpStatus == CifsGood ||
>             server->tcpStatus == CifsNeedNegotiate) &&
>             (!server->ops->can_echo || server->ops->can_echo(server)) &&
> --
> 2.46.0
>
Steve French Sept. 1, 2024, 4:37 a.m. UTC | #2
tentatively merged to cifs-2.6.git for-next pending testing and
additional review

I am a little worried that the timeout is too long though (for mount
to fail in your example)

On Sat, Aug 31, 2024 at 8:47 PM Paulo Alcantara <pc@manguebit.com> wrote:
>
> Call cifs_reconnect() to wake up processes waiting on negotiate
> protocol to handle the case where server abruptly shut down and had no
> chance to properly close the socket.
>
> Simple reproducer:
>
>   ssh 192.168.2.100 pkill -STOP smbd
>   mount.cifs //192.168.2.100/test /mnt -o ... [never returns]
>
> Cc: Rickard Andersson <rickaran@axis.com>
> Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
> ---
>  fs/smb/client/connect.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/fs/smb/client/connect.c b/fs/smb/client/connect.c
> index c1c14274930a..e004b515e321 100644
> --- a/fs/smb/client/connect.c
> +++ b/fs/smb/client/connect.c
> @@ -656,6 +656,19 @@ allocate_buffers(struct TCP_Server_Info *server)
>  static bool
>  server_unresponsive(struct TCP_Server_Info *server)
>  {
> +       /*
> +        * If we're in the process of mounting a share or reconnecting a session
> +        * and the server abruptly shut down (e.g. socket wasn't closed properly),
> +        * wait for at least an echo interval (+7s from rcvtimeo) when attempting
> +        * to negotiate protocol.
> +        */
> +       spin_lock(&server->srv_lock);
> +       if (server->tcpStatus == CifsInNegotiate &&
> +           time_after(jiffies, server->lstrp + server->echo_interval)) {
> +               spin_unlock(&server->srv_lock);
> +               cifs_reconnect(server, false);
> +               return true;
> +       }
>         /*
>          * We need to wait 3 echo intervals to make sure we handle such
>          * situations right:
> @@ -667,7 +680,6 @@ server_unresponsive(struct TCP_Server_Info *server)
>          * 65s kernel_recvmsg times out, and we see that we haven't gotten
>          *     a response in >60s.
>          */
> -       spin_lock(&server->srv_lock);
>         if ((server->tcpStatus == CifsGood ||
>             server->tcpStatus == CifsNeedNegotiate) &&
>             (!server->ops->can_echo || server->ops->can_echo(server)) &&
> --
> 2.46.0
>
Paulo Alcantara Sept. 1, 2024, 6:16 p.m. UTC | #3
Steve French <smfrench@gmail.com> writes:

> when I tried this - it took much longer than expected to time out.

For me the mount process never returns as the server is no longer
responding to any requests and demultiplex thread is unable to detect
server's unresponsiviness, therefore leaving the process stuck in
wait_for_response() by not marking the mid for retry and waking it up.

# ps -e -o stat,pid,comm,cmd |grep ^D
D+       726 mount.cifs      mount.cifs //192.168.2.100/scratch /mnt/1 -o vers=3.1.1,username=testuser,password=xyz,echo_interval=5
# cat /proc/726/stack
[<0>] wait_for_response+0x147/0x1a0 [cifs]
[<0>] compound_send_recv+0x6b4/0x1150 [cifs]
[<0>] cifs_send_recv+0x23/0x30 [cifs]
[<0>] SMB2_negotiate+0x8b0/0x1c10 [cifs]
[<0>] smb2_negotiate+0x52/0x70 [cifs]
[<0>] cifs_negotiate_protocol+0x129/0x1c0 [cifs]
[<0>] cifs_get_smb_ses+0x807/0x1150 [cifs]
[<0>] cifs_mount_get_session+0x8a/0x210 [cifs]
[<0>] dfs_mount_share+0x281/0x1040 [cifs]
[<0>] cifs_mount+0xbd/0x3d0 [cifs]
[<0>] cifs_smb3_do_mount+0x1e2/0xc80 [cifs]
[<0>] smb3_get_tree+0x1bf/0x330 [cifs]
[<0>] vfs_get_tree+0x4a/0x160
[<0>] path_mount+0x3c1/0xfb0
[<0>] __x64_sys_mount+0x1a6/0x1e0
[<0>] do_syscall_64+0xbb/0x1d0
[<0>] entry_SYSCALL_64_after_hwframe+0x77/0x7f

If this happens after mounting the share, either smb2_reconnect_server()
or any subsequent I/O will end up stuck in smb2_reconnect() under
cifs_ses::session_mutex attempting to reconnect the session.  Next I/O
will block indefinitely under cifs_ses::session_mutex as it will need to
reconnect the session as well.

> How long do you expect mount to hang with your patch?

@echo_interval seconds.
Enzo Matsumiya Sept. 6, 2024, 8:37 p.m. UTC | #4
On 08/31, Paulo Alcantara wrote:
>Call cifs_reconnect() to wake up processes waiting on negotiate
>protocol to handle the case where server abruptly shut down and had no
>chance to properly close the socket.
>
>Simple reproducer:
>
>  ssh 192.168.2.100 pkill -STOP smbd
>  mount.cifs //192.168.2.100/test /mnt -o ... [never returns]
>
>Cc: Rickard Andersson <rickaran@axis.com>
>Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
>---
> fs/smb/client/connect.c | 14 +++++++++++++-
> 1 file changed, 13 insertions(+), 1 deletion(-)
>
>diff --git a/fs/smb/client/connect.c b/fs/smb/client/connect.c
>index c1c14274930a..e004b515e321 100644
>--- a/fs/smb/client/connect.c
>+++ b/fs/smb/client/connect.c
>@@ -656,6 +656,19 @@ allocate_buffers(struct TCP_Server_Info *server)
> static bool
> server_unresponsive(struct TCP_Server_Info *server)
> {
>+	/*
>+	 * If we're in the process of mounting a share or reconnecting a session
>+	 * and the server abruptly shut down (e.g. socket wasn't closed properly),
>+	 * wait for at least an echo interval (+7s from rcvtimeo) when attempting
>+	 * to negotiate protocol.
>+	 */
>+	spin_lock(&server->srv_lock);
>+	if (server->tcpStatus == CifsInNegotiate &&
>+	    time_after(jiffies, server->lstrp + server->echo_interval)) {
>+		spin_unlock(&server->srv_lock);
>+		cifs_reconnect(server, false);
>+		return true;
>+	}
> 	/*
> 	 * We need to wait 3 echo intervals to make sure we handle such
> 	 * situations right:
>@@ -667,7 +680,6 @@ server_unresponsive(struct TCP_Server_Info *server)
> 	 * 65s kernel_recvmsg times out, and we see that we haven't gotten
> 	 *     a response in >60s.
> 	 */
>-	spin_lock(&server->srv_lock);
> 	if ((server->tcpStatus == CifsGood ||
> 	    server->tcpStatus == CifsNeedNegotiate) &&
> 	    (!server->ops->can_echo || server->ops->can_echo(server)) &&

Maybe, for extra precaution, also worth adding a check in
smb2_reconnect() that could catch other cases:

--- a/fs/smb/client/smb2pdu.c
+++ b/fs/smb/client/smb2pdu.c
@@ -278,6 +278,9 @@ smb2_reconnect(__le16 smb2_command, struct cifs_tcon *tcon,
                         spin_unlock(&server->srv_lock);
                         return -EAGAIN;
                 }
+       } else if (server->tcpStatus == CifsInNegotiate) {
+               spin_unlock(&server->srv_lock);
+               return -EAGAIN;
         }

         /* if server is marked for termination, cifsd will cleanup */


FTR we hit this exact same bug a few months ago and fixed
downstream with the above check, combined with using
wait_event_timeout() in wait_for_response() (using @echo_interval
seconds as well for timeout).

Never sent upstream because it looked too much like a hack and the bug
was only reproducible on our v5.14-based kernel.

Anyway, HTH.


Cheers,

Enzo
diff mbox series

Patch

diff --git a/fs/smb/client/connect.c b/fs/smb/client/connect.c
index c1c14274930a..e004b515e321 100644
--- a/fs/smb/client/connect.c
+++ b/fs/smb/client/connect.c
@@ -656,6 +656,19 @@  allocate_buffers(struct TCP_Server_Info *server)
 static bool
 server_unresponsive(struct TCP_Server_Info *server)
 {
+	/*
+	 * If we're in the process of mounting a share or reconnecting a session
+	 * and the server abruptly shut down (e.g. socket wasn't closed properly),
+	 * wait for at least an echo interval (+7s from rcvtimeo) when attempting
+	 * to negotiate protocol.
+	 */
+	spin_lock(&server->srv_lock);
+	if (server->tcpStatus == CifsInNegotiate &&
+	    time_after(jiffies, server->lstrp + server->echo_interval)) {
+		spin_unlock(&server->srv_lock);
+		cifs_reconnect(server, false);
+		return true;
+	}
 	/*
 	 * We need to wait 3 echo intervals to make sure we handle such
 	 * situations right:
@@ -667,7 +680,6 @@  server_unresponsive(struct TCP_Server_Info *server)
 	 * 65s kernel_recvmsg times out, and we see that we haven't gotten
 	 *     a response in >60s.
 	 */
-	spin_lock(&server->srv_lock);
 	if ((server->tcpStatus == CifsGood ||
 	    server->tcpStatus == CifsNeedNegotiate) &&
 	    (!server->ops->can_echo || server->ops->can_echo(server)) &&