Message ID | 20240901014734.141366-1-pc@manguebit.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | smb: client: fix hang in wait_for_response() for negproto | expand |
when I tried this - it took much longer than expected to time out. How long do you expect mount to hang with your patch? On Sat, Aug 31, 2024 at 8:47 PM Paulo Alcantara <pc@manguebit.com> wrote: > > Call cifs_reconnect() to wake up processes waiting on negotiate > protocol to handle the case where server abruptly shut down and had no > chance to properly close the socket. > > Simple reproducer: > > ssh 192.168.2.100 pkill -STOP smbd > mount.cifs //192.168.2.100/test /mnt -o ... [never returns] > > Cc: Rickard Andersson <rickaran@axis.com> > Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> > --- > fs/smb/client/connect.c | 14 +++++++++++++- > 1 file changed, 13 insertions(+), 1 deletion(-) > > diff --git a/fs/smb/client/connect.c b/fs/smb/client/connect.c > index c1c14274930a..e004b515e321 100644 > --- a/fs/smb/client/connect.c > +++ b/fs/smb/client/connect.c > @@ -656,6 +656,19 @@ allocate_buffers(struct TCP_Server_Info *server) > static bool > server_unresponsive(struct TCP_Server_Info *server) > { > + /* > + * If we're in the process of mounting a share or reconnecting a session > + * and the server abruptly shut down (e.g. socket wasn't closed properly), > + * wait for at least an echo interval (+7s from rcvtimeo) when attempting > + * to negotiate protocol. > + */ > + spin_lock(&server->srv_lock); > + if (server->tcpStatus == CifsInNegotiate && > + time_after(jiffies, server->lstrp + server->echo_interval)) { > + spin_unlock(&server->srv_lock); > + cifs_reconnect(server, false); > + return true; > + } > /* > * We need to wait 3 echo intervals to make sure we handle such > * situations right: > @@ -667,7 +680,6 @@ server_unresponsive(struct TCP_Server_Info *server) > * 65s kernel_recvmsg times out, and we see that we haven't gotten > * a response in >60s. > */ > - spin_lock(&server->srv_lock); > if ((server->tcpStatus == CifsGood || > server->tcpStatus == CifsNeedNegotiate) && > (!server->ops->can_echo || server->ops->can_echo(server)) && > -- > 2.46.0 >
tentatively merged to cifs-2.6.git for-next pending testing and additional review I am a little worried that the timeout is too long though (for mount to fail in your example) On Sat, Aug 31, 2024 at 8:47 PM Paulo Alcantara <pc@manguebit.com> wrote: > > Call cifs_reconnect() to wake up processes waiting on negotiate > protocol to handle the case where server abruptly shut down and had no > chance to properly close the socket. > > Simple reproducer: > > ssh 192.168.2.100 pkill -STOP smbd > mount.cifs //192.168.2.100/test /mnt -o ... [never returns] > > Cc: Rickard Andersson <rickaran@axis.com> > Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> > --- > fs/smb/client/connect.c | 14 +++++++++++++- > 1 file changed, 13 insertions(+), 1 deletion(-) > > diff --git a/fs/smb/client/connect.c b/fs/smb/client/connect.c > index c1c14274930a..e004b515e321 100644 > --- a/fs/smb/client/connect.c > +++ b/fs/smb/client/connect.c > @@ -656,6 +656,19 @@ allocate_buffers(struct TCP_Server_Info *server) > static bool > server_unresponsive(struct TCP_Server_Info *server) > { > + /* > + * If we're in the process of mounting a share or reconnecting a session > + * and the server abruptly shut down (e.g. socket wasn't closed properly), > + * wait for at least an echo interval (+7s from rcvtimeo) when attempting > + * to negotiate protocol. > + */ > + spin_lock(&server->srv_lock); > + if (server->tcpStatus == CifsInNegotiate && > + time_after(jiffies, server->lstrp + server->echo_interval)) { > + spin_unlock(&server->srv_lock); > + cifs_reconnect(server, false); > + return true; > + } > /* > * We need to wait 3 echo intervals to make sure we handle such > * situations right: > @@ -667,7 +680,6 @@ server_unresponsive(struct TCP_Server_Info *server) > * 65s kernel_recvmsg times out, and we see that we haven't gotten > * a response in >60s. > */ > - spin_lock(&server->srv_lock); > if ((server->tcpStatus == CifsGood || > server->tcpStatus == CifsNeedNegotiate) && > (!server->ops->can_echo || server->ops->can_echo(server)) && > -- > 2.46.0 >
Steve French <smfrench@gmail.com> writes: > when I tried this - it took much longer than expected to time out. For me the mount process never returns as the server is no longer responding to any requests and demultiplex thread is unable to detect server's unresponsiviness, therefore leaving the process stuck in wait_for_response() by not marking the mid for retry and waking it up. # ps -e -o stat,pid,comm,cmd |grep ^D D+ 726 mount.cifs mount.cifs //192.168.2.100/scratch /mnt/1 -o vers=3.1.1,username=testuser,password=xyz,echo_interval=5 # cat /proc/726/stack [<0>] wait_for_response+0x147/0x1a0 [cifs] [<0>] compound_send_recv+0x6b4/0x1150 [cifs] [<0>] cifs_send_recv+0x23/0x30 [cifs] [<0>] SMB2_negotiate+0x8b0/0x1c10 [cifs] [<0>] smb2_negotiate+0x52/0x70 [cifs] [<0>] cifs_negotiate_protocol+0x129/0x1c0 [cifs] [<0>] cifs_get_smb_ses+0x807/0x1150 [cifs] [<0>] cifs_mount_get_session+0x8a/0x210 [cifs] [<0>] dfs_mount_share+0x281/0x1040 [cifs] [<0>] cifs_mount+0xbd/0x3d0 [cifs] [<0>] cifs_smb3_do_mount+0x1e2/0xc80 [cifs] [<0>] smb3_get_tree+0x1bf/0x330 [cifs] [<0>] vfs_get_tree+0x4a/0x160 [<0>] path_mount+0x3c1/0xfb0 [<0>] __x64_sys_mount+0x1a6/0x1e0 [<0>] do_syscall_64+0xbb/0x1d0 [<0>] entry_SYSCALL_64_after_hwframe+0x77/0x7f If this happens after mounting the share, either smb2_reconnect_server() or any subsequent I/O will end up stuck in smb2_reconnect() under cifs_ses::session_mutex attempting to reconnect the session. Next I/O will block indefinitely under cifs_ses::session_mutex as it will need to reconnect the session as well. > How long do you expect mount to hang with your patch? @echo_interval seconds.
On 08/31, Paulo Alcantara wrote: >Call cifs_reconnect() to wake up processes waiting on negotiate >protocol to handle the case where server abruptly shut down and had no >chance to properly close the socket. > >Simple reproducer: > > ssh 192.168.2.100 pkill -STOP smbd > mount.cifs //192.168.2.100/test /mnt -o ... [never returns] > >Cc: Rickard Andersson <rickaran@axis.com> >Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> >--- > fs/smb/client/connect.c | 14 +++++++++++++- > 1 file changed, 13 insertions(+), 1 deletion(-) > >diff --git a/fs/smb/client/connect.c b/fs/smb/client/connect.c >index c1c14274930a..e004b515e321 100644 >--- a/fs/smb/client/connect.c >+++ b/fs/smb/client/connect.c >@@ -656,6 +656,19 @@ allocate_buffers(struct TCP_Server_Info *server) > static bool > server_unresponsive(struct TCP_Server_Info *server) > { >+ /* >+ * If we're in the process of mounting a share or reconnecting a session >+ * and the server abruptly shut down (e.g. socket wasn't closed properly), >+ * wait for at least an echo interval (+7s from rcvtimeo) when attempting >+ * to negotiate protocol. >+ */ >+ spin_lock(&server->srv_lock); >+ if (server->tcpStatus == CifsInNegotiate && >+ time_after(jiffies, server->lstrp + server->echo_interval)) { >+ spin_unlock(&server->srv_lock); >+ cifs_reconnect(server, false); >+ return true; >+ } > /* > * We need to wait 3 echo intervals to make sure we handle such > * situations right: >@@ -667,7 +680,6 @@ server_unresponsive(struct TCP_Server_Info *server) > * 65s kernel_recvmsg times out, and we see that we haven't gotten > * a response in >60s. > */ >- spin_lock(&server->srv_lock); > if ((server->tcpStatus == CifsGood || > server->tcpStatus == CifsNeedNegotiate) && > (!server->ops->can_echo || server->ops->can_echo(server)) && Maybe, for extra precaution, also worth adding a check in smb2_reconnect() that could catch other cases: --- a/fs/smb/client/smb2pdu.c +++ b/fs/smb/client/smb2pdu.c @@ -278,6 +278,9 @@ smb2_reconnect(__le16 smb2_command, struct cifs_tcon *tcon, spin_unlock(&server->srv_lock); return -EAGAIN; } + } else if (server->tcpStatus == CifsInNegotiate) { + spin_unlock(&server->srv_lock); + return -EAGAIN; } /* if server is marked for termination, cifsd will cleanup */ FTR we hit this exact same bug a few months ago and fixed downstream with the above check, combined with using wait_event_timeout() in wait_for_response() (using @echo_interval seconds as well for timeout). Never sent upstream because it looked too much like a hack and the bug was only reproducible on our v5.14-based kernel. Anyway, HTH. Cheers, Enzo
diff --git a/fs/smb/client/connect.c b/fs/smb/client/connect.c index c1c14274930a..e004b515e321 100644 --- a/fs/smb/client/connect.c +++ b/fs/smb/client/connect.c @@ -656,6 +656,19 @@ allocate_buffers(struct TCP_Server_Info *server) static bool server_unresponsive(struct TCP_Server_Info *server) { + /* + * If we're in the process of mounting a share or reconnecting a session + * and the server abruptly shut down (e.g. socket wasn't closed properly), + * wait for at least an echo interval (+7s from rcvtimeo) when attempting + * to negotiate protocol. + */ + spin_lock(&server->srv_lock); + if (server->tcpStatus == CifsInNegotiate && + time_after(jiffies, server->lstrp + server->echo_interval)) { + spin_unlock(&server->srv_lock); + cifs_reconnect(server, false); + return true; + } /* * We need to wait 3 echo intervals to make sure we handle such * situations right: @@ -667,7 +680,6 @@ server_unresponsive(struct TCP_Server_Info *server) * 65s kernel_recvmsg times out, and we see that we haven't gotten * a response in >60s. */ - spin_lock(&server->srv_lock); if ((server->tcpStatus == CifsGood || server->tcpStatus == CifsNeedNegotiate) && (!server->ops->can_echo || server->ops->can_echo(server)) &&
Call cifs_reconnect() to wake up processes waiting on negotiate protocol to handle the case where server abruptly shut down and had no chance to properly close the socket. Simple reproducer: ssh 192.168.2.100 pkill -STOP smbd mount.cifs //192.168.2.100/test /mnt -o ... [never returns] Cc: Rickard Andersson <rickaran@axis.com> Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> --- fs/smb/client/connect.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-)