diff mbox series

[1/4] cifs: Fix server re-repick on subrequest retry

Message ID 20240719140907.1598372-2-dhowells@redhat.com (mailing list archive)
State New
Headers show
Series cifs: Miscellaneous fixes and a trace point | expand

Commit Message

David Howells July 19, 2024, 2:09 p.m. UTC
When a subrequest is marked for needing retry, netfs will call
cifs_prepare_write() which will make cifs repick the server for the op
before renegotiating credits; it then calls cifs_issue_write() which
invokes smb2_async_writev() - which re-repicks the server.

If a different server is then selected, this causes the increment of
server->in_flight to happen against one record and the decrement to happen
against another, leading to misaccounting.

Fix this by just removing the repick code in smb2_async_writev().  As this
is only called from netfslib-driven code, cifs_prepare_write() should
always have been called first, and so server should never be NULL and the
preparatory step is repeated in the event that we do a retry.

The problem manifests as a warning looking something like:

 WARNING: CPU: 4 PID: 72896 at fs/smb/client/smb2ops.c:97 smb2_add_credits+0x3f0/0x9e0 [cifs]
 ...
 RIP: 0010:smb2_add_credits+0x3f0/0x9e0 [cifs]
 ...
  smb2_writev_callback+0x334/0x560 [cifs]
  cifs_demultiplex_thread+0x77a/0x11b0 [cifs]
  kthread+0x187/0x1d0
  ret_from_fork+0x34/0x60
  ret_from_fork_asm+0x1a/0x30

Which may be triggered by a number of different xfstests running against an
Azure server in multichannel mode.  generic/249 seems the most repeatable,
but generic/215, generic/249 and generic/308 may also show it.

Fixes: 3ee1a1fc3981 ("cifs: Cut over to using netfslib")
Reported-by: Steve French <sfrench@samba.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Aurelien Aptel <aaptel@suse.com>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
---
 fs/smb/client/smb2pdu.c | 3 ---
 1 file changed, 3 deletions(-)

Comments

Tom Talpey July 19, 2024, 2:27 p.m. UTC | #1
On 7/19/2024 10:09 AM, David Howells wrote:
> When a subrequest is marked for needing retry, netfs will call
> cifs_prepare_write() which will make cifs repick the server for the op
> before renegotiating credits; it then calls cifs_issue_write() which
> invokes smb2_async_writev() - which re-repicks the server.
> 
> If a different server is then selected, this causes the increment of
> server->in_flight to happen against one record and the decrement to happen
> against another, leading to misaccounting.
> 
> Fix this by just removing the repick code in smb2_async_writev().  As this
> is only called from netfslib-driven code, cifs_prepare_write() should
> always have been called first, and so server should never be NULL and the
> preparatory step is repeated in the event that we do a retry.
> 
> The problem manifests as a warning looking something like:
> 
>   WARNING: CPU: 4 PID: 72896 at fs/smb/client/smb2ops.c:97 smb2_add_credits+0x3f0/0x9e0 [cifs]
>   ...
>   RIP: 0010:smb2_add_credits+0x3f0/0x9e0 [cifs]
>   ...
>    smb2_writev_callback+0x334/0x560 [cifs]
>    cifs_demultiplex_thread+0x77a/0x11b0 [cifs]
>    kthread+0x187/0x1d0
>    ret_from_fork+0x34/0x60
>    ret_from_fork_asm+0x1a/0x30
> 
> Which may be triggered by a number of different xfstests running against an
> Azure server in multichannel mode.  generic/249 seems the most repeatable,
> but generic/215, generic/249 and generic/308 may also show it.

Nice fix, and good explanation. So, is this the negative-credits issue
we've been looking to fix? Or just one instance?

I'm very curious why it manifested when testing with Azure Files. Do
connection errors disappear with the fix, or do they still occur but
are now recoverable?

Feel free to add...

Acked-by: Tom Talpey <tom@talpey.com>

Tom.

> Fixes: 3ee1a1fc3981 ("cifs: Cut over to using netfslib")
> Reported-by: Steve French <sfrench@samba.org>
> Signed-off-by: David Howells <dhowells@redhat.com>
> Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
> cc: Jeff Layton <jlayton@kernel.org>
> cc: Aurelien Aptel <aaptel@suse.com>
> cc: linux-cifs@vger.kernel.org
> cc: netfs@lists.linux.dev
> cc: linux-fsdevel@vger.kernel.org
> ---
>   fs/smb/client/smb2pdu.c | 3 ---
>   1 file changed, 3 deletions(-)
> 
> diff --git a/fs/smb/client/smb2pdu.c b/fs/smb/client/smb2pdu.c
> index 2ae2dbb6202b..bb84a89e5905 100644
> --- a/fs/smb/client/smb2pdu.c
> +++ b/fs/smb/client/smb2pdu.c
> @@ -4859,9 +4859,6 @@ smb2_async_writev(struct cifs_io_subrequest *wdata)
>   	struct cifs_io_parms *io_parms = NULL;
>   	int credit_request;
>   
> -	if (!wdata->server || test_bit(NETFS_SREQ_RETRYING, &wdata->subreq.flags))
> -		server = wdata->server = cifs_pick_channel(tcon->ses);
> -
>   	/*
>   	 * in future we may get cifs_io_parms passed in from the caller,
>   	 * but for now we construct it here...
> 
> 
>
Steve French July 19, 2024, 3:45 p.m. UTC | #2
On Fri, Jul 19, 2024 at 9:27 AM Tom Talpey <tom@talpey.com> wrote:
>
> On 7/19/2024 10:09 AM, David Howells wrote:
> > When a subrequest is marked for needing retry, netfs will call
> > cifs_prepare_write() which will make cifs repick the server for the op
> > before renegotiating credits; it then calls cifs_issue_write() which
> > invokes smb2_async_writev() - which re-repicks the server.
> >
> > If a different server is then selected, this causes the increment of
> > server->in_flight to happen against one record and the decrement to happen
> > against another, leading to misaccounting.
> >
> > Fix this by just removing the repick code in smb2_async_writev().  As this
> > is only called from netfslib-driven code, cifs_prepare_write() should
> > always have been called first, and so server should never be NULL and the
> > preparatory step is repeated in the event that we do a retry.
> >
> > The problem manifests as a warning looking something like:
> >
> >   WARNING: CPU: 4 PID: 72896 at fs/smb/client/smb2ops.c:97 smb2_add_credits+0x3f0/0x9e0 [cifs]
> >   ...
> >   RIP: 0010:smb2_add_credits+0x3f0/0x9e0 [cifs]
> >   ...
> >    smb2_writev_callback+0x334/0x560 [cifs]
> >    cifs_demultiplex_thread+0x77a/0x11b0 [cifs]
> >    kthread+0x187/0x1d0
> >    ret_from_fork+0x34/0x60
> >    ret_from_fork_asm+0x1a/0x30
> >
> > Which may be triggered by a number of different xfstests running against an
> > Azure server in multichannel mode.  generic/249 seems the most repeatable,
> > but generic/215, generic/249 and generic/308 may also show it.
>
> Nice fix, and good explanation. So, is this the negative-credits issue
> we've been looking to fix? Or just one instance?

Yes it fixes the only recent crediting issues that I had been seeing.

> Feel free to add...
>
> Acked-by: Tom Talpey <tom@talpey.com>

Done
diff mbox series

Patch

diff --git a/fs/smb/client/smb2pdu.c b/fs/smb/client/smb2pdu.c
index 2ae2dbb6202b..bb84a89e5905 100644
--- a/fs/smb/client/smb2pdu.c
+++ b/fs/smb/client/smb2pdu.c
@@ -4859,9 +4859,6 @@  smb2_async_writev(struct cifs_io_subrequest *wdata)
 	struct cifs_io_parms *io_parms = NULL;
 	int credit_request;
 
-	if (!wdata->server || test_bit(NETFS_SREQ_RETRYING, &wdata->subreq.flags))
-		server = wdata->server = cifs_pick_channel(tcon->ses);
-
 	/*
 	 * in future we may get cifs_io_parms passed in from the caller,
 	 * but for now we construct it here...