diff mbox series

[5/6] net/rds: rds_tcp_accept_one ought to not discard messages

Message ID 20250227042638.82553-6-allison.henderson@oracle.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series RDS bug fix collection | expand

Checks

Context Check Description
netdev/series_format warning Target tree name not specified in the subject
netdev/tree_selection success Guessed tree name to be net-next, async
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers warning 6 maintainers not CCed: linux-rdma@vger.kernel.org rds-devel@oss.oracle.com kuba@kernel.org horms@kernel.org edumazet@google.com pabeni@redhat.com
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 12 this patch: 12
netdev/checkpatch warning WARNING: line length of 81 exceeds 80 columns WARNING: line length of 82 exceeds 80 columns WARNING: line length of 83 exceeds 80 columns WARNING: line length of 84 exceeds 80 columns WARNING: line length of 87 exceeds 80 columns WARNING: memory barrier without comment WARNING: struct ctl_table should normally be const
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc fail Errors and warnings before: 30 this patch: 31
netdev/source_inline success Was 0 now: 0

Commit Message

Allison Henderson Feb. 27, 2025, 4:26 a.m. UTC
From: Gerd Rausch <gerd.rausch@oracle.com>

RDS/TCP differs from RDS/RDMA in that message acknowledgment
is done based on TCP sequence numbers:
As soon as the last byte of a message has been acknowledged
by the TCP stack of a peer, "rds_tcp_write_space()" goes on
to discard prior messages from the send queue.

Which is fine, for as long as the receiver never throws any messages away.

Unfortunately, that is *not* the case since the introduction of MPRDS:
commit "RDS: TCP: Enable multipath RDS for TCP"

A new function "rds_tcp_accept_one_path" was introduced,
which is entitled to return "NULL", if no connection path is currently
available.

Unfortunately, this happens after the "->accept()" call, and the new socket
often already contains messages, since the peer already transitioned
to "RDS_CONN_UP" on behalf of "TCP_ESTABLISHED".

That's also the case after this [1]:
commit "RDS: TCP: Force every connection to be initiated by numerically
smaller IP address"

which tried to address the situation of pending data by only transitioning
connections from a smaller IP address to "RDS_CONN_UP".

But even in those cases, and in particular if the "RDS_EXTHDR_NPATHS"
handshake has not occurred yet, and therefore we're working with
"c_npaths <= 1", "c_conn[0]" may be in a state distinct from
"RDS_CONN_DOWN", and therefore all messages on the just accepted socket
will be tossed away.

This fix changes "rds_tcp_accept_one":

* If connected from a peer with a larger IP address, the new socket
  will continue to get closed right away.
  With commit [1] above, there should not be any messages
  in the socket receive buffer, since the peer never transitioned
  to "RDS_CONN_UP".
  Therefore it should be okay to not make any efforts to dispatch
  the socket receive buffer.

* If connected from a peer with a smaller IP address,
  we call "rds_tcp_accept_one_path" to find a free slot/"path".
  If found, business goes on as usual.
  If none was found, we save/stash the newly accepted socket
  into "rds_tcp_accepted_sock", in order to not lose any
  messages that may have arrived already.
  We then return from "rds_tcp_accept_one" with "-ENOBUFS".
  Later on, when a slot/"path" does become available again
  (e.g. state transitioned to "RDS_CONN_DOWN",
   or HS extension header was received with "c_npaths > 1")
  we call "rds_tcp_conn_slots_available" that simply re-issues
  a "rds_tcp_accept_one_path" worker-callback and picks
  up the new socket from "rds_tcp_accepted_sock", and thereby
  continuing where it left with "-ENOBUFS" last time.
  Since a new slot has become available, those messages
  won't be lost, since processing proceeds as if that slot
  had been available the first time around.

Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
---
 net/rds/rds.h        |   1 +
 net/rds/recv.c       |   4 ++
 net/rds/tcp.c        |  27 +++-----
 net/rds/tcp.h        |  22 +++++-
 net/rds/tcp_listen.c | 160 ++++++++++++++++++++++++++++++-------------
 5 files changed, 148 insertions(+), 66 deletions(-)

Comments

Jakub Kicinski March 1, 2025, 12:21 a.m. UTC | #1
On Wed, 26 Feb 2025 21:26:37 -0700 allison.henderson@oracle.com wrote:
> diff --git a/net/rds/rds.h b/net/rds/rds.h
> index 85b47ce52266..422d5e26410e 100644
> --- a/net/rds/rds.h
> +++ b/net/rds/rds.h
> @@ -548,6 +548,7 @@ struct rds_transport {
>  			   __u32 scope_id);
>  	int (*conn_alloc)(struct rds_connection *conn, gfp_t gfp);
>  	void (*conn_free)(void *data);
> +	void (*conn_slots_available)(struct rds_connection *conn);
>  	int (*conn_path_connect)(struct rds_conn_path *cp);
>  	void (*conn_path_shutdown)(struct rds_conn_path *conn);
>  	void (*xmit_path_prepare)(struct rds_conn_path *cp);

This struct has a kdoc, you need to document the new member.
Or make the comment not a kdoc, if full documentation isn't necessary.
kernel test robot March 1, 2025, 11:22 p.m. UTC | #2
Hi,

kernel test robot noticed the following build warnings:

[auto build test WARNING on net/main]
[also build test WARNING on net-next/main linus/master v6.14-rc4 next-20250228]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/allison-henderson-oracle-com/net-rds-Avoid-queuing-superfluous-send-and-recv-work/20250227-123038
base:   net/main
patch link:    https://lore.kernel.org/r/20250227042638.82553-6-allison.henderson%40oracle.com
patch subject: [PATCH 5/6] net/rds: rds_tcp_accept_one ought to not discard messages
config: riscv-randconfig-002-20250302 (https://download.01.org/0day-ci/archive/20250302/202503020739.xWWyG7zO-lkp@intel.com/config)
compiler: clang version 16.0.6 (https://github.com/llvm/llvm-project 7cbf1a2591520c2491aa35339f227775f4d3adf6)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250302/202503020739.xWWyG7zO-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202503020739.xWWyG7zO-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> net/rds/tcp_listen.c:137:6: warning: variable 'inet' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
           if (!new_sock) {
               ^~~~~~~~~
   net/rds/tcp_listen.c:179:19: note: uninitialized use occurs here
                    my_addr, ntohs(inet->inet_sport),
                                   ^~~~
   include/linux/byteorder/generic.h:142:27: note: expanded from macro 'ntohs'
   #define ntohs(x) ___ntohs(x)
                             ^
   include/linux/byteorder/generic.h:137:35: note: expanded from macro '___ntohs'
   #define ___ntohs(x) __be16_to_cpu(x)
                                     ^
   include/uapi/linux/byteorder/little_endian.h:43:59: note: expanded from macro '__be16_to_cpu'
   #define __be16_to_cpu(x) __swab16((__force __u16)(__be16)(x))
                                                             ^
   include/uapi/linux/swab.h:105:31: note: expanded from macro '__swab16'
           (__u16)(__builtin_constant_p(x) ?       \
                                        ^
   net/rds/tcp_listen.c:137:2: note: remove the 'if' if its condition is always true
           if (!new_sock) {
           ^~~~~~~~~~~~~~~
   net/rds/tcp_listen.c:115:24: note: initialize the variable 'inet' to silence this warning
           struct inet_sock *inet;
                                 ^
                                  = NULL
   1 warning generated.


vim +137 net/rds/tcp_listen.c

   108	
   109	int rds_tcp_accept_one(struct rds_tcp_net *rtn)
   110	{
   111		struct socket *listen_sock = rtn->rds_tcp_listen_sock;
   112		struct socket *new_sock = NULL;
   113		struct rds_connection *conn;
   114		int ret;
   115		struct inet_sock *inet;
   116		struct rds_tcp_connection *rs_tcp = NULL;
   117		int conn_state;
   118		struct rds_conn_path *cp;
   119		struct in6_addr *my_addr, *peer_addr;
   120		struct proto_accept_arg arg = {
   121			.flags = O_NONBLOCK,
   122			.kern = true,
   123		};
   124	#if !IS_ENABLED(CONFIG_IPV6)
   125		struct in6_addr saddr, daddr;
   126	#endif
   127		int dev_if = 0;
   128	
   129		mutex_lock(&rtn->rds_tcp_accept_lock);
   130	
   131		if (!listen_sock)
   132			return -ENETUNREACH;
   133	
   134		new_sock = rtn->rds_tcp_accepted_sock;
   135		rtn->rds_tcp_accepted_sock = NULL;
   136	
 > 137		if (!new_sock) {
   138			ret = sock_create_lite(listen_sock->sk->sk_family,
   139					       listen_sock->sk->sk_type,
   140					       listen_sock->sk->sk_protocol,
   141					       &new_sock);
   142			if (ret)
   143				goto out;
   144	
   145			ret = listen_sock->ops->accept(listen_sock, new_sock, &arg);
   146			if (ret < 0)
   147				goto out;
   148	
   149			/* sock_create_lite() does not get a hold on the owner module so we
   150			 * need to do it here.  Note that sock_release() uses sock->ops to
   151			 * determine if it needs to decrement the reference count.  So set
   152			 * sock->ops after calling accept() in case that fails.  And there's
   153			 * no need to do try_module_get() as the listener should have a hold
   154			 * already.
   155			 */
   156			new_sock->ops = listen_sock->ops;
   157			__module_get(new_sock->ops->owner);
   158	
   159			rds_tcp_keepalive(new_sock);
   160			if (!rds_tcp_tune(new_sock)) {
   161				ret = -EINVAL;
   162				goto out;
   163			}
   164	
   165			inet = inet_sk(new_sock->sk);
   166		}
   167
Dan Carpenter March 4, 2025, 10:28 a.m. UTC | #3
Hi,

kernel test robot noticed the following build warnings:

https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/allison-henderson-oracle-com/net-rds-Avoid-queuing-superfluous-send-and-recv-work/20250227-123038
base:   net/main
patch link:    https://lore.kernel.org/r/20250227042638.82553-6-allison.henderson%40oracle.com
patch subject: [PATCH 5/6] net/rds: rds_tcp_accept_one ought to not discard messages
config: x86_64-randconfig-161-20250304 (https://download.01.org/0day-ci/archive/20250304/202503041749.YwkRic8W-lkp@intel.com/config)
compiler: clang version 19.1.7 (https://github.com/llvm/llvm-project cd708029e0b2869e80abe31ddb175f7c35361f90)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
| Closes: https://lore.kernel.org/r/202503041749.YwkRic8W-lkp@intel.com/

smatch warnings:
net/rds/tcp_listen.c:295 rds_tcp_accept_one() warn: inconsistent returns '&rtn->rds_tcp_accept_lock'.

vim +/new_sock +156 net/rds/tcp_listen.c

112b9a7a012048 Gerd Rausch       2025-02-26  109  int rds_tcp_accept_one(struct rds_tcp_net *rtn)
70041088e3b976 Andy Grover       2009-08-21  110  {
112b9a7a012048 Gerd Rausch       2025-02-26  111  	struct socket *listen_sock = rtn->rds_tcp_listen_sock;
70041088e3b976 Andy Grover       2009-08-21  112  	struct socket *new_sock = NULL;
70041088e3b976 Andy Grover       2009-08-21  113  	struct rds_connection *conn;
70041088e3b976 Andy Grover       2009-08-21  114  	int ret;
70041088e3b976 Andy Grover       2009-08-21  115  	struct inet_sock *inet;
bd7c5f983f3185 Sowmini Varadhan  2016-05-02  116  	struct rds_tcp_connection *rs_tcp = NULL;
bd7c5f983f3185 Sowmini Varadhan  2016-05-02  117  	int conn_state;
ea3b1ea5393087 Sowmini Varadhan  2016-06-30  118  	struct rds_conn_path *cp;
1e2b44e78eead7 Ka-Cheong Poon    2018-07-23  119  	struct in6_addr *my_addr, *peer_addr;
92ef0fd55ac80d Jens Axboe        2024-05-09  120  	struct proto_accept_arg arg = {
92ef0fd55ac80d Jens Axboe        2024-05-09  121  		.flags = O_NONBLOCK,
92ef0fd55ac80d Jens Axboe        2024-05-09  122  		.kern = true,
92ef0fd55ac80d Jens Axboe        2024-05-09  123  	};
e65d4d96334e3f Ka-Cheong Poon    2018-07-30  124  #if !IS_ENABLED(CONFIG_IPV6)
e65d4d96334e3f Ka-Cheong Poon    2018-07-30  125  	struct in6_addr saddr, daddr;
e65d4d96334e3f Ka-Cheong Poon    2018-07-30  126  #endif
e65d4d96334e3f Ka-Cheong Poon    2018-07-30  127  	int dev_if = 0;
70041088e3b976 Andy Grover       2009-08-21  128  
112b9a7a012048 Gerd Rausch       2025-02-26  129  	mutex_lock(&rtn->rds_tcp_accept_lock);
                                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

112b9a7a012048 Gerd Rausch       2025-02-26  130  
112b9a7a012048 Gerd Rausch       2025-02-26  131  	if (!listen_sock)
37e14f4fe2991f Sowmini Varadhan  2016-05-18  132  		return -ENETUNREACH;
                                                                ^^^^^^^^^^^^^^^^^^^^
Do this before taking the lock?

37e14f4fe2991f Sowmini Varadhan  2016-05-18  133  
112b9a7a012048 Gerd Rausch       2025-02-26  134  	new_sock = rtn->rds_tcp_accepted_sock;
112b9a7a012048 Gerd Rausch       2025-02-26  135  	rtn->rds_tcp_accepted_sock = NULL;
112b9a7a012048 Gerd Rausch       2025-02-26  136  
112b9a7a012048 Gerd Rausch       2025-02-26 @137  	if (!new_sock) {
112b9a7a012048 Gerd Rausch       2025-02-26  138  		ret = sock_create_lite(listen_sock->sk->sk_family,
112b9a7a012048 Gerd Rausch       2025-02-26  139  				       listen_sock->sk->sk_type,
112b9a7a012048 Gerd Rausch       2025-02-26  140  				       listen_sock->sk->sk_protocol,
d5a8ac28a7ff2f Sowmini Varadhan  2015-08-05  141  				       &new_sock);
70041088e3b976 Andy Grover       2009-08-21  142  		if (ret)
70041088e3b976 Andy Grover       2009-08-21  143  			goto out;
70041088e3b976 Andy Grover       2009-08-21  144  
112b9a7a012048 Gerd Rausch       2025-02-26  145  		ret = listen_sock->ops->accept(listen_sock, new_sock, &arg);
70041088e3b976 Andy Grover       2009-08-21  146  		if (ret < 0)
70041088e3b976 Andy Grover       2009-08-21  147  			goto out;
70041088e3b976 Andy Grover       2009-08-21  148  
84eef2b2187ed7 Ka-Cheong Poon    2018-03-01  149  		/* sock_create_lite() does not get a hold on the owner module so we
84eef2b2187ed7 Ka-Cheong Poon    2018-03-01  150  		 * need to do it here.  Note that sock_release() uses sock->ops to
84eef2b2187ed7 Ka-Cheong Poon    2018-03-01  151  		 * determine if it needs to decrement the reference count.  So set
84eef2b2187ed7 Ka-Cheong Poon    2018-03-01  152  		 * sock->ops after calling accept() in case that fails.  And there's
84eef2b2187ed7 Ka-Cheong Poon    2018-03-01  153  		 * no need to do try_module_get() as the listener should have a hold
84eef2b2187ed7 Ka-Cheong Poon    2018-03-01  154  		 * already.
84eef2b2187ed7 Ka-Cheong Poon    2018-03-01  155  		 */
112b9a7a012048 Gerd Rausch       2025-02-26 @156  		new_sock->ops = listen_sock->ops;
84eef2b2187ed7 Ka-Cheong Poon    2018-03-01  157  		__module_get(new_sock->ops->owner);
84eef2b2187ed7 Ka-Cheong Poon    2018-03-01  158  
480aeb9639d6a0 Christoph Hellwig 2020-05-28  159  		rds_tcp_keepalive(new_sock);
6997fbd7a3dafa Tetsuo Handa      2022-05-05  160  		if (!rds_tcp_tune(new_sock)) {
6997fbd7a3dafa Tetsuo Handa      2022-05-05  161  			ret = -EINVAL;
6997fbd7a3dafa Tetsuo Handa      2022-05-05  162  			goto out;
6997fbd7a3dafa Tetsuo Handa      2022-05-05  163  		}
70041088e3b976 Andy Grover       2009-08-21  164  
70041088e3b976 Andy Grover       2009-08-21  165  		inet = inet_sk(new_sock->sk);
112b9a7a012048 Gerd Rausch       2025-02-26  166  	}
70041088e3b976 Andy Grover       2009-08-21  167  
e65d4d96334e3f Ka-Cheong Poon    2018-07-30  168  #if IS_ENABLED(CONFIG_IPV6)
1e2b44e78eead7 Ka-Cheong Poon    2018-07-23  169  	my_addr = &new_sock->sk->sk_v6_rcv_saddr;
1e2b44e78eead7 Ka-Cheong Poon    2018-07-23  170  	peer_addr = &new_sock->sk->sk_v6_daddr;
e65d4d96334e3f Ka-Cheong Poon    2018-07-30  171  #else
e65d4d96334e3f Ka-Cheong Poon    2018-07-30  172  	ipv6_addr_set_v4mapped(inet->inet_saddr, &saddr);
e65d4d96334e3f Ka-Cheong Poon    2018-07-30  173  	ipv6_addr_set_v4mapped(inet->inet_daddr, &daddr);
e65d4d96334e3f Ka-Cheong Poon    2018-07-30  174  	my_addr = &saddr;
e65d4d96334e3f Ka-Cheong Poon    2018-07-30  175  	peer_addr = &daddr;
e65d4d96334e3f Ka-Cheong Poon    2018-07-30  176  #endif
e65d4d96334e3f Ka-Cheong Poon    2018-07-30  177  	rdsdebug("accepted family %d tcp %pI6c:%u -> %pI6c:%u\n",
112b9a7a012048 Gerd Rausch       2025-02-26  178  		 listen_sock->sk->sk_family,
1e2b44e78eead7 Ka-Cheong Poon    2018-07-23  179  		 my_addr, ntohs(inet->inet_sport),
1e2b44e78eead7 Ka-Cheong Poon    2018-07-23  180  		 peer_addr, ntohs(inet->inet_dport));
70041088e3b976 Andy Grover       2009-08-21  181  
e65d4d96334e3f Ka-Cheong Poon    2018-07-30  182  #if IS_ENABLED(CONFIG_IPV6)
1e2b44e78eead7 Ka-Cheong Poon    2018-07-23  183  	/* sk_bound_dev_if is not set if the peer address is not link local
1e2b44e78eead7 Ka-Cheong Poon    2018-07-23  184  	 * address.  In this case, it happens that mcast_oif is set.  So
1e2b44e78eead7 Ka-Cheong Poon    2018-07-23  185  	 * just use it.
1e2b44e78eead7 Ka-Cheong Poon    2018-07-23  186  	 */
1e2b44e78eead7 Ka-Cheong Poon    2018-07-23  187  	if ((ipv6_addr_type(my_addr) & IPV6_ADDR_LINKLOCAL) &&
1e2b44e78eead7 Ka-Cheong Poon    2018-07-23  188  	    !(ipv6_addr_type(peer_addr) & IPV6_ADDR_LINKLOCAL)) {
1e2b44e78eead7 Ka-Cheong Poon    2018-07-23  189  		struct ipv6_pinfo *inet6;
1e2b44e78eead7 Ka-Cheong Poon    2018-07-23  190  
1e2b44e78eead7 Ka-Cheong Poon    2018-07-23  191  		inet6 = inet6_sk(new_sock->sk);
d2f011a0bf28c0 Eric Dumazet      2023-12-08  192  		dev_if = READ_ONCE(inet6->mcast_oif);
1e2b44e78eead7 Ka-Cheong Poon    2018-07-23  193  	} else {
1e2b44e78eead7 Ka-Cheong Poon    2018-07-23  194  		dev_if = new_sock->sk->sk_bound_dev_if;
1e2b44e78eead7 Ka-Cheong Poon    2018-07-23  195  	}
e65d4d96334e3f Ka-Cheong Poon    2018-07-30  196  #endif
e65d4d96334e3f Ka-Cheong Poon    2018-07-30  197  
112b9a7a012048 Gerd Rausch       2025-02-26  198  	if (!rds_tcp_laddr_check(sock_net(listen_sock->sk), peer_addr, dev_if)) {
aced3ce57cd37b Rao Shoaib        2021-05-21  199  		/* local address connection is only allowed via loopback */
aced3ce57cd37b Rao Shoaib        2021-05-21  200  		ret = -EOPNOTSUPP;
aced3ce57cd37b Rao Shoaib        2021-05-21  201  		goto out;
aced3ce57cd37b Rao Shoaib        2021-05-21  202  	}
aced3ce57cd37b Rao Shoaib        2021-05-21  203  
112b9a7a012048 Gerd Rausch       2025-02-26  204  	conn = rds_conn_create(sock_net(listen_sock->sk),
e65d4d96334e3f Ka-Cheong Poon    2018-07-30  205  			       my_addr, peer_addr,
3eb450367d0823 Santosh Shilimkar 2018-10-23  206  			       &rds_tcp_transport, 0, GFP_KERNEL, dev_if);
eee2fa6ab32251 Ka-Cheong Poon    2018-07-23  207  
70041088e3b976 Andy Grover       2009-08-21  208  	if (IS_ERR(conn)) {
70041088e3b976 Andy Grover       2009-08-21  209  		ret = PTR_ERR(conn);
70041088e3b976 Andy Grover       2009-08-21  210  		goto out;
70041088e3b976 Andy Grover       2009-08-21  211  	}
f711a6ae062cae Sowmini Varadhan  2015-05-05  212  	/* An incoming SYN request came in, and TCP just accepted it.
f711a6ae062cae Sowmini Varadhan  2015-05-05  213  	 *
f711a6ae062cae Sowmini Varadhan  2015-05-05  214  	 * If the client reboots, this conn will need to be cleaned up.
f711a6ae062cae Sowmini Varadhan  2015-05-05  215  	 * rds_tcp_state_change() will do that cleanup
f711a6ae062cae Sowmini Varadhan  2015-05-05  216  	 */
112b9a7a012048 Gerd Rausch       2025-02-26  217  	if (rds_addr_cmp(&conn->c_faddr, &conn->c_laddr) < 0) {
112b9a7a012048 Gerd Rausch       2025-02-26  218  		/* Try to obtain a free connection slot.
112b9a7a012048 Gerd Rausch       2025-02-26  219  		 * If unsuccessful, we need to preserve "new_sock"
112b9a7a012048 Gerd Rausch       2025-02-26  220  		 * that we just accepted, since its "sk_receive_queue"
112b9a7a012048 Gerd Rausch       2025-02-26  221  		 * may contain messages already that have been acknowledged
112b9a7a012048 Gerd Rausch       2025-02-26  222  		 * to and discarded by the sender.
112b9a7a012048 Gerd Rausch       2025-02-26  223  		 * We must not throw those away!
112b9a7a012048 Gerd Rausch       2025-02-26  224  		 */
5916e2c1554f3e Sowmini Varadhan  2016-07-14  225  		rs_tcp = rds_tcp_accept_one_path(conn);
112b9a7a012048 Gerd Rausch       2025-02-26  226  		if (!rs_tcp) {
112b9a7a012048 Gerd Rausch       2025-02-26  227  			/* It's okay to stash "new_sock", since
112b9a7a012048 Gerd Rausch       2025-02-26  228  			 * "rds_tcp_conn_slots_available" triggers "rds_tcp_accept_one"
112b9a7a012048 Gerd Rausch       2025-02-26  229  			 * again as soon as one of the connection slots
112b9a7a012048 Gerd Rausch       2025-02-26  230  			 * becomes available again
112b9a7a012048 Gerd Rausch       2025-02-26  231  			 */
112b9a7a012048 Gerd Rausch       2025-02-26  232  			rtn->rds_tcp_accepted_sock = new_sock;
112b9a7a012048 Gerd Rausch       2025-02-26  233  			new_sock = NULL;
112b9a7a012048 Gerd Rausch       2025-02-26  234  			ret = -ENOBUFS;
112b9a7a012048 Gerd Rausch       2025-02-26  235  			goto out;
112b9a7a012048 Gerd Rausch       2025-02-26  236  		}
112b9a7a012048 Gerd Rausch       2025-02-26  237  	} else {
112b9a7a012048 Gerd Rausch       2025-02-26  238  		/* This connection request came from a peer with
112b9a7a012048 Gerd Rausch       2025-02-26  239  		 * a larger address.
112b9a7a012048 Gerd Rausch       2025-02-26  240  		 * Function "rds_tcp_state_change" makes sure
112b9a7a012048 Gerd Rausch       2025-02-26  241  		 * that the connection doesn't transition
112b9a7a012048 Gerd Rausch       2025-02-26  242  		 * to state "RDS_CONN_UP", and therefore
112b9a7a012048 Gerd Rausch       2025-02-26  243  		 * we should not have received any messages
112b9a7a012048 Gerd Rausch       2025-02-26  244  		 * on this socket yet.
112b9a7a012048 Gerd Rausch       2025-02-26  245  		 * This is the only case where it's okay to
112b9a7a012048 Gerd Rausch       2025-02-26  246  		 * not dequeue messages from "sk_receive_queue".
112b9a7a012048 Gerd Rausch       2025-02-26  247  		 */
112b9a7a012048 Gerd Rausch       2025-02-26  248  		if (conn->c_npaths <= 1)
112b9a7a012048 Gerd Rausch       2025-02-26  249  			rds_conn_path_connect_if_down(&conn->c_path[0]);
112b9a7a012048 Gerd Rausch       2025-02-26  250  		rs_tcp = NULL;
5916e2c1554f3e Sowmini Varadhan  2016-07-14  251  		goto rst_nsk;
112b9a7a012048 Gerd Rausch       2025-02-26  252  	}
112b9a7a012048 Gerd Rausch       2025-02-26  253  
02105b2ccdd634 Sowmini Varadhan  2016-06-30  254  	mutex_lock(&rs_tcp->t_conn_path_lock);
5916e2c1554f3e Sowmini Varadhan  2016-07-14  255  	cp = rs_tcp->t_cpath;
5916e2c1554f3e Sowmini Varadhan  2016-07-14  256  	conn_state = rds_conn_path_state(cp);
1a0e100fb2c966 Sowmini Varadhan  2016-11-16  257  	WARN_ON(conn_state == RDS_CONN_UP);
112b9a7a012048 Gerd Rausch       2025-02-26  258  	if (conn_state != RDS_CONN_CONNECTING && conn_state != RDS_CONN_ERROR) {
112b9a7a012048 Gerd Rausch       2025-02-26  259  		rds_conn_path_drop(cp, 0);
bd7c5f983f3185 Sowmini Varadhan  2016-05-02  260  		goto rst_nsk;
112b9a7a012048 Gerd Rausch       2025-02-26  261  	}
eb192840266fab Sowmini Varadhan  2016-05-02  262  	if (rs_tcp->t_sock) {
41500c3e2a19ff Sowmini Varadhan  2017-06-15  263  		/* Duelling SYN has been handled in rds_tcp_accept_one() */
ea3b1ea5393087 Sowmini Varadhan  2016-06-30  264  		rds_tcp_reset_callbacks(new_sock, cp);
9c79440e2c5e25 Sowmini Varadhan  2016-06-04  265  		/* rds_connect_path_complete() marks RDS_CONN_UP */
ea3b1ea5393087 Sowmini Varadhan  2016-06-30  266  		rds_connect_path_complete(cp, RDS_CONN_RESETTING);
335b48d980f631 Sowmini Varadhan  2016-06-04  267  	} else {
ea3b1ea5393087 Sowmini Varadhan  2016-06-30  268  		rds_tcp_set_callbacks(new_sock, cp);
ea3b1ea5393087 Sowmini Varadhan  2016-06-30  269  		rds_connect_path_complete(cp, RDS_CONN_CONNECTING);
e70845e6ecd132 Ka-Cheong Poon    2025-02-26  270  		wake_up(&cp->cp_up_waitq);
335b48d980f631 Sowmini Varadhan  2016-06-04  271  	}
70041088e3b976 Andy Grover       2009-08-21  272  	new_sock = NULL;
70041088e3b976 Andy Grover       2009-08-21  273  	ret = 0;
69b92b5b741984 Sowmini Varadhan  2017-06-21  274  	if (conn->c_npaths == 0)
69b92b5b741984 Sowmini Varadhan  2017-06-21  275  		rds_send_ping(cp->cp_conn, cp->cp_index);
bd7c5f983f3185 Sowmini Varadhan  2016-05-02  276  	goto out;
bd7c5f983f3185 Sowmini Varadhan  2016-05-02  277  rst_nsk:
10beea7d7408d0 Sowmini Varadhan  2017-06-15  278  	/* reset the newly returned accept sock and bail.
10beea7d7408d0 Sowmini Varadhan  2017-06-15  279  	 * It is safe to set linger on new_sock because the RDS connection
10beea7d7408d0 Sowmini Varadhan  2017-06-15  280  	 * has not been brought up on new_sock, so no RDS-level data could
10beea7d7408d0 Sowmini Varadhan  2017-06-15  281  	 * be pending on it. By setting linger, we achieve the side-effect
10beea7d7408d0 Sowmini Varadhan  2017-06-15  282  	 * of avoiding TIME_WAIT state on new_sock.
10beea7d7408d0 Sowmini Varadhan  2017-06-15  283  	 */
c433594c07457d Christoph Hellwig 2020-05-28  284  	sock_no_linger(new_sock->sk);
335b48d980f631 Sowmini Varadhan  2016-06-04  285  	kernel_sock_shutdown(new_sock, SHUT_RDWR);
bd7c5f983f3185 Sowmini Varadhan  2016-05-02  286  	ret = 0;
70041088e3b976 Andy Grover       2009-08-21  287  out:
bd7c5f983f3185 Sowmini Varadhan  2016-05-02  288  	if (rs_tcp)
02105b2ccdd634 Sowmini Varadhan  2016-06-30  289  		mutex_unlock(&rs_tcp->t_conn_path_lock);
70041088e3b976 Andy Grover       2009-08-21  290  	if (new_sock)
70041088e3b976 Andy Grover       2009-08-21  291  		sock_release(new_sock);
112b9a7a012048 Gerd Rausch       2025-02-26  292  
112b9a7a012048 Gerd Rausch       2025-02-26  293  	mutex_unlock(&rtn->rds_tcp_accept_lock);
112b9a7a012048 Gerd Rausch       2025-02-26  294  
70041088e3b976 Andy Grover       2009-08-21 @295  	return ret;
70041088e3b976 Andy Grover       2009-08-21  296  }
Allison Henderson March 5, 2025, 12:39 a.m. UTC | #4
On Tue, 2025-03-04 at 13:28 +0300, Dan Carpenter wrote:
> Hi,
> 
> kernel test robot noticed the following build warnings:
> 
> https://urldefense.com/v3/__https://git-scm.com/docs/git-format-patch*_base_tree_information__;Iw!!ACWV5N9M2RV99hQ!KxXf0dDbI8bmxH2KHU1J7gopr6EmtIqkgdl5UKsE1BS29rZeqrqJGWZJPQxyKauzGVdgxG8sw0TgBbaK6hO2wPnB5xus5w$ ]
> 
> url:    https://urldefense.com/v3/__https://github.com/intel-lab-lkp/linux/commits/allison-henderson-oracle-com/net-rds-Avoid-queuing-superfluous-send-and-recv-work/20250227-123038__;!!ACWV5N9M2RV99hQ!KxXf0dDbI8bmxH2KHU1J7gopr6EmtIqkgdl5UKsE1BS29rZeqrqJGWZJPQxyKauzGVdgxG8sw0TgBbaK6hO2wPlkiiOVHg$ 
> base:   net/main
> patch link:    https://urldefense.com/v3/__https://lore.kernel.org/r/20250227042638.82553-6-allison.henderson*40oracle.com__;JQ!!ACWV5N9M2RV99hQ!KxXf0dDbI8bmxH2KHU1J7gopr6EmtIqkgdl5UKsE1BS29rZeqrqJGWZJPQxyKauzGVdgxG8sw0TgBbaK6hO2wPl-4cL2HA$ 
> patch subject: [PATCH 5/6] net/rds: rds_tcp_accept_one ought to not discard messages
> config: x86_64-randconfig-161-20250304 (https://urldefense.com/v3/__https://download.01.org/0day-ci/archive/20250304/202503041749.YwkRic8W-lkp@intel.com/config__;!!ACWV5N9M2RV99hQ!KxXf0dDbI8bmxH2KHU1J7gopr6EmtIqkgdl5UKsE1BS29rZeqrqJGWZJPQxyKauzGVdgxG8sw0TgBbaK6hO2wPko9q8sNw$ )
> compiler: clang version 19.1.7 (https://urldefense.com/v3/__https://github.com/llvm/llvm-project__;!!ACWV5N9M2RV99hQ!KxXf0dDbI8bmxH2KHU1J7gopr6EmtIqkgdl5UKsE1BS29rZeqrqJGWZJPQxyKauzGVdgxG8sw0TgBbaK6hO2wPlPIG0jYA$  cd708029e0b2869e80abe31ddb175f7c35361f90)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> > Reported-by: kernel test robot <lkp@intel.com>
> > Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
> > Closes: https://urldefense.com/v3/__https://lore.kernel.org/r/202503041749.YwkRic8W-lkp@intel.com/__;!!ACWV5N9M2RV99hQ!KxXf0dDbI8bmxH2KHU1J7gopr6EmtIqkgdl5UKsE1BS29rZeqrqJGWZJPQxyKauzGVdgxG8sw0TgBbaK6hO2wPnrmKrVLA$ 
> 
> smatch warnings:
> net/rds/tcp_listen.c:295 rds_tcp_accept_one() warn: inconsistent returns '&rtn->rds_tcp_accept_lock'.
> 
> vim +/new_sock +156 net/rds/tcp_listen.c
> 
> 112b9a7a012048 Gerd Rausch       2025-02-26  109  int rds_tcp_accept_one(struct rds_tcp_net *rtn)
> 70041088e3b976 Andy Grover       2009-08-21  110  {
> 112b9a7a012048 Gerd Rausch       2025-02-26  111  	struct socket *listen_sock = rtn->rds_tcp_listen_sock;
> 70041088e3b976 Andy Grover       2009-08-21  112  	struct socket *new_sock = NULL;
> 70041088e3b976 Andy Grover       2009-08-21  113  	struct rds_connection *conn;
> 70041088e3b976 Andy Grover       2009-08-21  114  	int ret;
> 70041088e3b976 Andy Grover       2009-08-21  115  	struct inet_sock *inet;
> bd7c5f983f3185 Sowmini Varadhan  2016-05-02  116  	struct rds_tcp_connection *rs_tcp = NULL;
> bd7c5f983f3185 Sowmini Varadhan  2016-05-02  117  	int conn_state;
> ea3b1ea5393087 Sowmini Varadhan  2016-06-30  118  	struct rds_conn_path *cp;
> 1e2b44e78eead7 Ka-Cheong Poon    2018-07-23  119  	struct in6_addr *my_addr, *peer_addr;
> 92ef0fd55ac80d Jens Axboe        2024-05-09  120  	struct proto_accept_arg arg = {
> 92ef0fd55ac80d Jens Axboe        2024-05-09  121  		.flags = O_NONBLOCK,
> 92ef0fd55ac80d Jens Axboe        2024-05-09  122  		.kern = true,
> 92ef0fd55ac80d Jens Axboe        2024-05-09  123  	};
> e65d4d96334e3f Ka-Cheong Poon    2018-07-30  124  #if !IS_ENABLED(CONFIG_IPV6)
> e65d4d96334e3f Ka-Cheong Poon    2018-07-30  125  	struct in6_addr saddr, daddr;
> e65d4d96334e3f Ka-Cheong Poon    2018-07-30  126  #endif
> e65d4d96334e3f Ka-Cheong Poon    2018-07-30  127  	int dev_if = 0;
> 70041088e3b976 Andy Grover       2009-08-21  128  
> 112b9a7a012048 Gerd Rausch       2025-02-26  129  	mutex_lock(&rtn->rds_tcp_accept_lock);
>                                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> 112b9a7a012048 Gerd Rausch       2025-02-26  130  
> 112b9a7a012048 Gerd Rausch       2025-02-26  131  	if (!listen_sock)
> 37e14f4fe2991f Sowmini Varadhan  2016-05-18  132  		return -ENETUNREACH;
>                                                                 ^^^^^^^^^^^^^^^^^^^^
> Do this before taking the lock?
Yep, I think you're right.  Will fix.  Thank you!

Allison

> 
> 37e14f4fe2991f Sowmini Varadhan  2016-05-18  133  
> 112b9a7a012048 Gerd Rausch       2025-02-26  134  	new_sock = rtn->rds_tcp_accepted_sock;
> 112b9a7a012048 Gerd Rausch       2025-02-26  135  	rtn->rds_tcp_accepted_sock = NULL;
> 112b9a7a012048 Gerd Rausch       2025-02-26  136  
> 112b9a7a012048 Gerd Rausch       2025-02-26 @137  	if (!new_sock) {
> 112b9a7a012048 Gerd Rausch       2025-02-26  138  		ret = sock_create_lite(listen_sock->sk->sk_family,
> 112b9a7a012048 Gerd Rausch       2025-02-26  139  				       listen_sock->sk->sk_type,
> 112b9a7a012048 Gerd Rausch       2025-02-26  140  				       listen_sock->sk->sk_protocol,
> d5a8ac28a7ff2f Sowmini Varadhan  2015-08-05  141  				       &new_sock);
> 70041088e3b976 Andy Grover       2009-08-21  142  		if (ret)
> 70041088e3b976 Andy Grover       2009-08-21  143  			goto out;
> 70041088e3b976 Andy Grover       2009-08-21  144  
> 112b9a7a012048 Gerd Rausch       2025-02-26  145  		ret = listen_sock->ops->accept(listen_sock, new_sock, &arg);
> 70041088e3b976 Andy Grover       2009-08-21  146  		if (ret < 0)
> 70041088e3b976 Andy Grover       2009-08-21  147  			goto out;
> 70041088e3b976 Andy Grover       2009-08-21  148  
> 84eef2b2187ed7 Ka-Cheong Poon    2018-03-01  149  		/* sock_create_lite() does not get a hold on the owner module so we
> 84eef2b2187ed7 Ka-Cheong Poon    2018-03-01  150  		 * need to do it here.  Note that sock_release() uses sock->ops to
> 84eef2b2187ed7 Ka-Cheong Poon    2018-03-01  151  		 * determine if it needs to decrement the reference count.  So set
> 84eef2b2187ed7 Ka-Cheong Poon    2018-03-01  152  		 * sock->ops after calling accept() in case that fails.  And there's
> 84eef2b2187ed7 Ka-Cheong Poon    2018-03-01  153  		 * no need to do try_module_get() as the listener should have a hold
> 84eef2b2187ed7 Ka-Cheong Poon    2018-03-01  154  		 * already.
> 84eef2b2187ed7 Ka-Cheong Poon    2018-03-01  155  		 */
> 112b9a7a012048 Gerd Rausch       2025-02-26 @156  		new_sock->ops = listen_sock->ops;
> 84eef2b2187ed7 Ka-Cheong Poon    2018-03-01  157  		__module_get(new_sock->ops->owner);
> 84eef2b2187ed7 Ka-Cheong Poon    2018-03-01  158  
> 480aeb9639d6a0 Christoph Hellwig 2020-05-28  159  		rds_tcp_keepalive(new_sock);
> 6997fbd7a3dafa Tetsuo Handa      2022-05-05  160  		if (!rds_tcp_tune(new_sock)) {
> 6997fbd7a3dafa Tetsuo Handa      2022-05-05  161  			ret = -EINVAL;
> 6997fbd7a3dafa Tetsuo Handa      2022-05-05  162  			goto out;
> 6997fbd7a3dafa Tetsuo Handa      2022-05-05  163  		}
> 70041088e3b976 Andy Grover       2009-08-21  164  
> 70041088e3b976 Andy Grover       2009-08-21  165  		inet = inet_sk(new_sock->sk);
> 112b9a7a012048 Gerd Rausch       2025-02-26  166  	}
> 70041088e3b976 Andy Grover       2009-08-21  167  
> e65d4d96334e3f Ka-Cheong Poon    2018-07-30  168  #if IS_ENABLED(CONFIG_IPV6)
> 1e2b44e78eead7 Ka-Cheong Poon    2018-07-23  169  	my_addr = &new_sock->sk->sk_v6_rcv_saddr;
> 1e2b44e78eead7 Ka-Cheong Poon    2018-07-23  170  	peer_addr = &new_sock->sk->sk_v6_daddr;
> e65d4d96334e3f Ka-Cheong Poon    2018-07-30  171  #else
> e65d4d96334e3f Ka-Cheong Poon    2018-07-30  172  	ipv6_addr_set_v4mapped(inet->inet_saddr, &saddr);
> e65d4d96334e3f Ka-Cheong Poon    2018-07-30  173  	ipv6_addr_set_v4mapped(inet->inet_daddr, &daddr);
> e65d4d96334e3f Ka-Cheong Poon    2018-07-30  174  	my_addr = &saddr;
> e65d4d96334e3f Ka-Cheong Poon    2018-07-30  175  	peer_addr = &daddr;
> e65d4d96334e3f Ka-Cheong Poon    2018-07-30  176  #endif
> e65d4d96334e3f Ka-Cheong Poon    2018-07-30  177  	rdsdebug("accepted family %d tcp %pI6c:%u -> %pI6c:%u\n",
> 112b9a7a012048 Gerd Rausch       2025-02-26  178  		 listen_sock->sk->sk_family,
> 1e2b44e78eead7 Ka-Cheong Poon    2018-07-23  179  		 my_addr, ntohs(inet->inet_sport),
> 1e2b44e78eead7 Ka-Cheong Poon    2018-07-23  180  		 peer_addr, ntohs(inet->inet_dport));
> 70041088e3b976 Andy Grover       2009-08-21  181  
> e65d4d96334e3f Ka-Cheong Poon    2018-07-30  182  #if IS_ENABLED(CONFIG_IPV6)
> 1e2b44e78eead7 Ka-Cheong Poon    2018-07-23  183  	/* sk_bound_dev_if is not set if the peer address is not link local
> 1e2b44e78eead7 Ka-Cheong Poon    2018-07-23  184  	 * address.  In this case, it happens that mcast_oif is set.  So
> 1e2b44e78eead7 Ka-Cheong Poon    2018-07-23  185  	 * just use it.
> 1e2b44e78eead7 Ka-Cheong Poon    2018-07-23  186  	 */
> 1e2b44e78eead7 Ka-Cheong Poon    2018-07-23  187  	if ((ipv6_addr_type(my_addr) & IPV6_ADDR_LINKLOCAL) &&
> 1e2b44e78eead7 Ka-Cheong Poon    2018-07-23  188  	    !(ipv6_addr_type(peer_addr) & IPV6_ADDR_LINKLOCAL)) {
> 1e2b44e78eead7 Ka-Cheong Poon    2018-07-23  189  		struct ipv6_pinfo *inet6;
> 1e2b44e78eead7 Ka-Cheong Poon    2018-07-23  190  
> 1e2b44e78eead7 Ka-Cheong Poon    2018-07-23  191  		inet6 = inet6_sk(new_sock->sk);
> d2f011a0bf28c0 Eric Dumazet      2023-12-08  192  		dev_if = READ_ONCE(inet6->mcast_oif);
> 1e2b44e78eead7 Ka-Cheong Poon    2018-07-23  193  	} else {
> 1e2b44e78eead7 Ka-Cheong Poon    2018-07-23  194  		dev_if = new_sock->sk->sk_bound_dev_if;
> 1e2b44e78eead7 Ka-Cheong Poon    2018-07-23  195  	}
> e65d4d96334e3f Ka-Cheong Poon    2018-07-30  196  #endif
> e65d4d96334e3f Ka-Cheong Poon    2018-07-30  197  
> 112b9a7a012048 Gerd Rausch       2025-02-26  198  	if (!rds_tcp_laddr_check(sock_net(listen_sock->sk), peer_addr, dev_if)) {
> aced3ce57cd37b Rao Shoaib        2021-05-21  199  		/* local address connection is only allowed via loopback */
> aced3ce57cd37b Rao Shoaib        2021-05-21  200  		ret = -EOPNOTSUPP;
> aced3ce57cd37b Rao Shoaib        2021-05-21  201  		goto out;
> aced3ce57cd37b Rao Shoaib        2021-05-21  202  	}
> aced3ce57cd37b Rao Shoaib        2021-05-21  203  
> 112b9a7a012048 Gerd Rausch       2025-02-26  204  	conn = rds_conn_create(sock_net(listen_sock->sk),
> e65d4d96334e3f Ka-Cheong Poon    2018-07-30  205  			       my_addr, peer_addr,
> 3eb450367d0823 Santosh Shilimkar 2018-10-23  206  			       &rds_tcp_transport, 0, GFP_KERNEL, dev_if);
> eee2fa6ab32251 Ka-Cheong Poon    2018-07-23  207  
> 70041088e3b976 Andy Grover       2009-08-21  208  	if (IS_ERR(conn)) {
> 70041088e3b976 Andy Grover       2009-08-21  209  		ret = PTR_ERR(conn);
> 70041088e3b976 Andy Grover       2009-08-21  210  		goto out;
> 70041088e3b976 Andy Grover       2009-08-21  211  	}
> f711a6ae062cae Sowmini Varadhan  2015-05-05  212  	/* An incoming SYN request came in, and TCP just accepted it.
> f711a6ae062cae Sowmini Varadhan  2015-05-05  213  	 *
> f711a6ae062cae Sowmini Varadhan  2015-05-05  214  	 * If the client reboots, this conn will need to be cleaned up.
> f711a6ae062cae Sowmini Varadhan  2015-05-05  215  	 * rds_tcp_state_change() will do that cleanup
> f711a6ae062cae Sowmini Varadhan  2015-05-05  216  	 */
> 112b9a7a012048 Gerd Rausch       2025-02-26  217  	if (rds_addr_cmp(&conn->c_faddr, &conn->c_laddr) < 0) {
> 112b9a7a012048 Gerd Rausch       2025-02-26  218  		/* Try to obtain a free connection slot.
> 112b9a7a012048 Gerd Rausch       2025-02-26  219  		 * If unsuccessful, we need to preserve "new_sock"
> 112b9a7a012048 Gerd Rausch       2025-02-26  220  		 * that we just accepted, since its "sk_receive_queue"
> 112b9a7a012048 Gerd Rausch       2025-02-26  221  		 * may contain messages already that have been acknowledged
> 112b9a7a012048 Gerd Rausch       2025-02-26  222  		 * to and discarded by the sender.
> 112b9a7a012048 Gerd Rausch       2025-02-26  223  		 * We must not throw those away!
> 112b9a7a012048 Gerd Rausch       2025-02-26  224  		 */
> 5916e2c1554f3e Sowmini Varadhan  2016-07-14  225  		rs_tcp = rds_tcp_accept_one_path(conn);
> 112b9a7a012048 Gerd Rausch       2025-02-26  226  		if (!rs_tcp) {
> 112b9a7a012048 Gerd Rausch       2025-02-26  227  			/* It's okay to stash "new_sock", since
> 112b9a7a012048 Gerd Rausch       2025-02-26  228  			 * "rds_tcp_conn_slots_available" triggers "rds_tcp_accept_one"
> 112b9a7a012048 Gerd Rausch       2025-02-26  229  			 * again as soon as one of the connection slots
> 112b9a7a012048 Gerd Rausch       2025-02-26  230  			 * becomes available again
> 112b9a7a012048 Gerd Rausch       2025-02-26  231  			 */
> 112b9a7a012048 Gerd Rausch       2025-02-26  232  			rtn->rds_tcp_accepted_sock = new_sock;
> 112b9a7a012048 Gerd Rausch       2025-02-26  233  			new_sock = NULL;
> 112b9a7a012048 Gerd Rausch       2025-02-26  234  			ret = -ENOBUFS;
> 112b9a7a012048 Gerd Rausch       2025-02-26  235  			goto out;
> 112b9a7a012048 Gerd Rausch       2025-02-26  236  		}
> 112b9a7a012048 Gerd Rausch       2025-02-26  237  	} else {
> 112b9a7a012048 Gerd Rausch       2025-02-26  238  		/* This connection request came from a peer with
> 112b9a7a012048 Gerd Rausch       2025-02-26  239  		 * a larger address.
> 112b9a7a012048 Gerd Rausch       2025-02-26  240  		 * Function "rds_tcp_state_change" makes sure
> 112b9a7a012048 Gerd Rausch       2025-02-26  241  		 * that the connection doesn't transition
> 112b9a7a012048 Gerd Rausch       2025-02-26  242  		 * to state "RDS_CONN_UP", and therefore
> 112b9a7a012048 Gerd Rausch       2025-02-26  243  		 * we should not have received any messages
> 112b9a7a012048 Gerd Rausch       2025-02-26  244  		 * on this socket yet.
> 112b9a7a012048 Gerd Rausch       2025-02-26  245  		 * This is the only case where it's okay to
> 112b9a7a012048 Gerd Rausch       2025-02-26  246  		 * not dequeue messages from "sk_receive_queue".
> 112b9a7a012048 Gerd Rausch       2025-02-26  247  		 */
> 112b9a7a012048 Gerd Rausch       2025-02-26  248  		if (conn->c_npaths <= 1)
> 112b9a7a012048 Gerd Rausch       2025-02-26  249  			rds_conn_path_connect_if_down(&conn->c_path[0]);
> 112b9a7a012048 Gerd Rausch       2025-02-26  250  		rs_tcp = NULL;
> 5916e2c1554f3e Sowmini Varadhan  2016-07-14  251  		goto rst_nsk;
> 112b9a7a012048 Gerd Rausch       2025-02-26  252  	}
> 112b9a7a012048 Gerd Rausch       2025-02-26  253  
> 02105b2ccdd634 Sowmini Varadhan  2016-06-30  254  	mutex_lock(&rs_tcp->t_conn_path_lock);
> 5916e2c1554f3e Sowmini Varadhan  2016-07-14  255  	cp = rs_tcp->t_cpath;
> 5916e2c1554f3e Sowmini Varadhan  2016-07-14  256  	conn_state = rds_conn_path_state(cp);
> 1a0e100fb2c966 Sowmini Varadhan  2016-11-16  257  	WARN_ON(conn_state == RDS_CONN_UP);
> 112b9a7a012048 Gerd Rausch       2025-02-26  258  	if (conn_state != RDS_CONN_CONNECTING && conn_state != RDS_CONN_ERROR) {
> 112b9a7a012048 Gerd Rausch       2025-02-26  259  		rds_conn_path_drop(cp, 0);
> bd7c5f983f3185 Sowmini Varadhan  2016-05-02  260  		goto rst_nsk;
> 112b9a7a012048 Gerd Rausch       2025-02-26  261  	}
> eb192840266fab Sowmini Varadhan  2016-05-02  262  	if (rs_tcp->t_sock) {
> 41500c3e2a19ff Sowmini Varadhan  2017-06-15  263  		/* Duelling SYN has been handled in rds_tcp_accept_one() */
> ea3b1ea5393087 Sowmini Varadhan  2016-06-30  264  		rds_tcp_reset_callbacks(new_sock, cp);
> 9c79440e2c5e25 Sowmini Varadhan  2016-06-04  265  		/* rds_connect_path_complete() marks RDS_CONN_UP */
> ea3b1ea5393087 Sowmini Varadhan  2016-06-30  266  		rds_connect_path_complete(cp, RDS_CONN_RESETTING);
> 335b48d980f631 Sowmini Varadhan  2016-06-04  267  	} else {
> ea3b1ea5393087 Sowmini Varadhan  2016-06-30  268  		rds_tcp_set_callbacks(new_sock, cp);
> ea3b1ea5393087 Sowmini Varadhan  2016-06-30  269  		rds_connect_path_complete(cp, RDS_CONN_CONNECTING);
> e70845e6ecd132 Ka-Cheong Poon    2025-02-26  270  		wake_up(&cp->cp_up_waitq);
> 335b48d980f631 Sowmini Varadhan  2016-06-04  271  	}
> 70041088e3b976 Andy Grover       2009-08-21  272  	new_sock = NULL;
> 70041088e3b976 Andy Grover       2009-08-21  273  	ret = 0;
> 69b92b5b741984 Sowmini Varadhan  2017-06-21  274  	if (conn->c_npaths == 0)
> 69b92b5b741984 Sowmini Varadhan  2017-06-21  275  		rds_send_ping(cp->cp_conn, cp->cp_index);
> bd7c5f983f3185 Sowmini Varadhan  2016-05-02  276  	goto out;
> bd7c5f983f3185 Sowmini Varadhan  2016-05-02  277  rst_nsk:
> 10beea7d7408d0 Sowmini Varadhan  2017-06-15  278  	/* reset the newly returned accept sock and bail.
> 10beea7d7408d0 Sowmini Varadhan  2017-06-15  279  	 * It is safe to set linger on new_sock because the RDS connection
> 10beea7d7408d0 Sowmini Varadhan  2017-06-15  280  	 * has not been brought up on new_sock, so no RDS-level data could
> 10beea7d7408d0 Sowmini Varadhan  2017-06-15  281  	 * be pending on it. By setting linger, we achieve the side-effect
> 10beea7d7408d0 Sowmini Varadhan  2017-06-15  282  	 * of avoiding TIME_WAIT state on new_sock.
> 10beea7d7408d0 Sowmini Varadhan  2017-06-15  283  	 */
> c433594c07457d Christoph Hellwig 2020-05-28  284  	sock_no_linger(new_sock->sk);
> 335b48d980f631 Sowmini Varadhan  2016-06-04  285  	kernel_sock_shutdown(new_sock, SHUT_RDWR);
> bd7c5f983f3185 Sowmini Varadhan  2016-05-02  286  	ret = 0;
> 70041088e3b976 Andy Grover       2009-08-21  287  out:
> bd7c5f983f3185 Sowmini Varadhan  2016-05-02  288  	if (rs_tcp)
> 02105b2ccdd634 Sowmini Varadhan  2016-06-30  289  		mutex_unlock(&rs_tcp->t_conn_path_lock);
> 70041088e3b976 Andy Grover       2009-08-21  290  	if (new_sock)
> 70041088e3b976 Andy Grover       2009-08-21  291  		sock_release(new_sock);
> 112b9a7a012048 Gerd Rausch       2025-02-26  292  
> 112b9a7a012048 Gerd Rausch       2025-02-26  293  	mutex_unlock(&rtn->rds_tcp_accept_lock);
> 112b9a7a012048 Gerd Rausch       2025-02-26  294  
> 70041088e3b976 Andy Grover       2009-08-21 @295  	return ret;
> 70041088e3b976 Andy Grover       2009-08-21  296  }
>
Allison Henderson March 5, 2025, 12:43 a.m. UTC | #5
On Sun, 2025-03-02 at 07:22 +0800, kernel test robot wrote:
> Hi,
> 
> kernel test robot noticed the following build warnings:
> 
> [auto build test WARNING on net/main]
> [also build test WARNING on net-next/main linus/master v6.14-rc4 next-20250228]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://urldefense.com/v3/__https://git-scm.com/docs/git-format-patch*_base_tree_information__;Iw!!ACWV5N9M2RV99hQ!N3fhQzk3WA-OAbKUbFqZo2JRgVJCSPX22UjSEDrigRVQ8QvFOo479ljGaZ4Xkdig5e4AXzn-C8voWMQ$ ]
> 
> url:    https://urldefense.com/v3/__https://github.com/intel-lab-lkp/linux/commits/allison-henderson-oracle-com/net-rds-Avoid-queuing-superfluous-send-and-recv-work/20250227-123038__;!!ACWV5N9M2RV99hQ!N3fhQzk3WA-OAbKUbFqZo2JRgVJCSPX22UjSEDrigRVQ8QvFOo479ljGaZ4Xkdig5e4AXzn-YEva8Ys$ 
> base:   net/main
> patch link:    https://urldefense.com/v3/__https://lore.kernel.org/r/20250227042638.82553-6-allison.henderson*40oracle.com__;JQ!!ACWV5N9M2RV99hQ!N3fhQzk3WA-OAbKUbFqZo2JRgVJCSPX22UjSEDrigRVQ8QvFOo479ljGaZ4Xkdig5e4AXzn-O0k5mu0$ 
> patch subject: [PATCH 5/6] net/rds: rds_tcp_accept_one ought to not discard messages
> config: riscv-randconfig-002-20250302 (https://urldefense.com/v3/__https://download.01.org/0day-ci/archive/20250302/202503020739.xWWyG7zO-lkp@intel.com/config__;!!ACWV5N9M2RV99hQ!N3fhQzk3WA-OAbKUbFqZo2JRgVJCSPX22UjSEDrigRVQ8QvFOo479ljGaZ4Xkdig5e4AXzn-4P6Kfjw$ )
> compiler: clang version 16.0.6 (https://urldefense.com/v3/__https://github.com/llvm/llvm-project__;!!ACWV5N9M2RV99hQ!N3fhQzk3WA-OAbKUbFqZo2JRgVJCSPX22UjSEDrigRVQ8QvFOo479ljGaZ4Xkdig5e4AXzn-p4RKZfU$  7cbf1a2591520c2491aa35339f227775f4d3adf6)
> reproduce (this is a W=1 build): (https://urldefense.com/v3/__https://download.01.org/0day-ci/archive/20250302/202503020739.xWWyG7zO-lkp@intel.com/reproduce__;!!ACWV5N9M2RV99hQ!N3fhQzk3WA-OAbKUbFqZo2JRgVJCSPX22UjSEDrigRVQ8QvFOo479ljGaZ4Xkdig5e4AXzn-3tTQLjw$ )
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> > Reported-by: kernel test robot <lkp@intel.com>
> > Closes: https://urldefense.com/v3/__https://lore.kernel.org/oe-kbuild-all/202503020739.xWWyG7zO-lkp@intel.com/__;!!ACWV5N9M2RV99hQ!N3fhQzk3WA-OAbKUbFqZo2JRgVJCSPX22UjSEDrigRVQ8QvFOo479ljGaZ4Xkdig5e4AXzn-gr3kV7I$ 
> 
> All warnings (new ones prefixed by >>):
> 
> > > net/rds/tcp_listen.c:137:6: warning: variable 'inet' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
>            if (!new_sock) {
>                ^~~~~~~~~
>    net/rds/tcp_listen.c:179:19: note: uninitialized use occurs here
>                     my_addr, ntohs(inet->inet_sport),
>                                    ^~~~
>    include/linux/byteorder/generic.h:142:27: note: expanded from macro 'ntohs'
>    #define ntohs(x) ___ntohs(x)
>                              ^
>    include/linux/byteorder/generic.h:137:35: note: expanded from macro '___ntohs'
>    #define ___ntohs(x) __be16_to_cpu(x)
>                                      ^
>    include/uapi/linux/byteorder/little_endian.h:43:59: note: expanded from macro '__be16_to_cpu'
>    #define __be16_to_cpu(x) __swab16((__force __u16)(__be16)(x))
>                                                              ^
>    include/uapi/linux/swab.h:105:31: note: expanded from macro '__swab16'
>            (__u16)(__builtin_constant_p(x) ?       \
>                                         ^
>    net/rds/tcp_listen.c:137:2: note: remove the 'if' if its condition is always true
>            if (!new_sock) {
>            ^~~~~~~~~~~~~~~
>    net/rds/tcp_listen.c:115:24: note: initialize the variable 'inet' to silence this warning
>            struct inet_sock *inet;
>                                  ^
>                                   = NULL
>    1 warning generated.
> 
> 
> vim +137 net/rds/tcp_listen.c
> 
>    108	
>    109	int rds_tcp_accept_one(struct rds_tcp_net *rtn)
>    110	{
>    111		struct socket *listen_sock = rtn->rds_tcp_listen_sock;
>    112		struct socket *new_sock = NULL;
>    113		struct rds_connection *conn;
>    114		int ret;
>    115		struct inet_sock *inet;
>    116		struct rds_tcp_connection *rs_tcp = NULL;
>    117		int conn_state;
>    118		struct rds_conn_path *cp;
>    119		struct in6_addr *my_addr, *peer_addr;
>    120		struct proto_accept_arg arg = {
>    121			.flags = O_NONBLOCK,
>    122			.kern = true,
>    123		};
>    124	#if !IS_ENABLED(CONFIG_IPV6)
>    125		struct in6_addr saddr, daddr;
>    126	#endif
>    127		int dev_if = 0;
>    128	
>    129		mutex_lock(&rtn->rds_tcp_accept_lock);
>    130	
>    131		if (!listen_sock)
>    132			return -ENETUNREACH;
>    133	
>    134		new_sock = rtn->rds_tcp_accepted_sock;
>    135		rtn->rds_tcp_accepted_sock = NULL;
>    136	
>  > 137		if (!new_sock) {
>    138			ret = sock_create_lite(listen_sock->sk->sk_family,
>    139					       listen_sock->sk->sk_type,
>    140					       listen_sock->sk->sk_protocol,
>    141					       &new_sock);
>    142			if (ret)
>    143				goto out;
>    144	
>    145			ret = listen_sock->ops->accept(listen_sock, new_sock, &arg);
>    146			if (ret < 0)
>    147				goto out;
>    148	
>    149			/* sock_create_lite() does not get a hold on the owner module so we
>    150			 * need to do it here.  Note that sock_release() uses sock->ops to
>    151			 * determine if it needs to decrement the reference count.  So set
>    152			 * sock->ops after calling accept() in case that fails.  And there's
>    153			 * no need to do try_module_get() as the listener should have a hold
>    154			 * already.
>    155			 */
>    156			new_sock->ops = listen_sock->ops;
>    157			__module_get(new_sock->ops->owner);
>    158	
>    159			rds_tcp_keepalive(new_sock);
>    160			if (!rds_tcp_tune(new_sock)) {
>    161				ret = -EINVAL;
>    162				goto out;
>    163			}
>    164	
>    165			inet = inet_sk(new_sock->sk);
>    166		}
I think just moving the inet assignment below the last bracket here should correct this warning.  Will fix :-)

Thanks!
Allison

>    167	
>
Allison Henderson March 6, 2025, 4:41 p.m. UTC | #6
On Fri, 2025-02-28 at 16:21 -0800, Jakub Kicinski wrote:
> On Wed, 26 Feb 2025 21:26:37 -0700 allison.henderson@oracle.com wrote:
> > diff --git a/net/rds/rds.h b/net/rds/rds.h
> > index 85b47ce52266..422d5e26410e 100644
> > --- a/net/rds/rds.h
> > +++ b/net/rds/rds.h
> > @@ -548,6 +548,7 @@ struct rds_transport {
> >  			   __u32 scope_id);
> >  	int (*conn_alloc)(struct rds_connection *conn, gfp_t gfp);
> >  	void (*conn_free)(void *data);
> > +	void (*conn_slots_available)(struct rds_connection *conn);
> >  	int (*conn_path_connect)(struct rds_conn_path *cp);
> >  	void (*conn_path_shutdown)(struct rds_conn_path *conn);
> >  	void (*xmit_path_prepare)(struct rds_conn_path *cp);
> 
> This struct has a kdoc, you need to document the new member.
> Or make the comment not a kdoc, if full documentation isn't necessary.

Hi Jakub,

Sure, how about I break the kdoc into comments for their respective members and add then a comment for the new function
pointer.  How does the below new comment sound:

/*
 * conn_slots_available is invoked when a previously unavailable connection slot
 * becomes available again. rds_tcp_accept_one_path may return -ENOBUFS if it 
 * cannot find an available slot, and then stashes the new socket in
 * "rds_tcp_accepted_sock". This function re-issues `rds_tcp_accept_one_path`,
 * which picks up the stashed socket and continuing where it left with "-ENOBUFS"
 * last time.  This ensures messages received on the new socket are not discarded
 * when no connection path was available at the time.
 */

Let me know what you think.  Thanks!
Allison
diff mbox series

Patch

diff --git a/net/rds/rds.h b/net/rds/rds.h
index 85b47ce52266..422d5e26410e 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -548,6 +548,7 @@  struct rds_transport {
 			   __u32 scope_id);
 	int (*conn_alloc)(struct rds_connection *conn, gfp_t gfp);
 	void (*conn_free)(void *data);
+	void (*conn_slots_available)(struct rds_connection *conn);
 	int (*conn_path_connect)(struct rds_conn_path *cp);
 	void (*conn_path_shutdown)(struct rds_conn_path *conn);
 	void (*xmit_path_prepare)(struct rds_conn_path *cp);
diff --git a/net/rds/recv.c b/net/rds/recv.c
index 5627f80013f8..c0a89af29d1b 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -230,6 +230,10 @@  static void rds_recv_hs_exthdrs(struct rds_header *hdr,
 	conn->c_npaths = max_t(int, conn->c_npaths, 1);
 	conn->c_ping_triggered = 0;
 	rds_conn_peer_gen_update(conn, new_peer_gen_num);
+
+	if (conn->c_npaths > 1 &&
+	    conn->c_trans->conn_slots_available)
+		conn->c_trans->conn_slots_available(conn);
 }
 
 /* rds_start_mprds() will synchronously start multiple paths when appropriate.
diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index b3f2c6e27b59..915b35136693 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -221,6 +221,8 @@  void rds_tcp_set_callbacks(struct socket *sock, struct rds_conn_path *cp)
 		sock->sk->sk_data_ready = sock->sk->sk_user_data;
 
 	tc->t_sock = sock;
+	if (!tc->t_rtn)
+		tc->t_rtn = net_generic(sock_net(sock->sk), rds_tcp_netid);
 	tc->t_cpath = cp;
 	tc->t_orig_data_ready = sock->sk->sk_data_ready;
 	tc->t_orig_write_space = sock->sk->sk_write_space;
@@ -386,6 +388,7 @@  static int rds_tcp_conn_alloc(struct rds_connection *conn, gfp_t gfp)
 		}
 		mutex_init(&tc->t_conn_path_lock);
 		tc->t_sock = NULL;
+		tc->t_rtn = NULL;
 		tc->t_tinc = NULL;
 		tc->t_tinc_hdr_rem = sizeof(struct rds_header);
 		tc->t_tinc_data_rem = 0;
@@ -466,6 +469,7 @@  struct rds_transport rds_tcp_transport = {
 	.recv_path		= rds_tcp_recv_path,
 	.conn_alloc		= rds_tcp_conn_alloc,
 	.conn_free		= rds_tcp_conn_free,
+	.conn_slots_available	= rds_tcp_conn_slots_available,
 	.conn_path_connect	= rds_tcp_conn_path_connect,
 	.conn_path_shutdown	= rds_tcp_conn_path_shutdown,
 	.inc_copy_to_user	= rds_tcp_inc_copy_to_user,
@@ -481,17 +485,7 @@  struct rds_transport rds_tcp_transport = {
 	.t_unloading		= rds_tcp_is_unloading,
 };
 
-static unsigned int rds_tcp_netid;
-
-/* per-network namespace private data for this module */
-struct rds_tcp_net {
-	struct socket *rds_tcp_listen_sock;
-	struct work_struct rds_tcp_accept_w;
-	struct ctl_table_header *rds_tcp_sysctl;
-	struct ctl_table *ctl_table;
-	int sndbuf_size;
-	int rcvbuf_size;
-};
+int rds_tcp_netid;
 
 /* All module specific customizations to the RDS-TCP socket should be done in
  * rds_tcp_tune() and applied after socket creation.
@@ -538,15 +532,12 @@  static void rds_tcp_accept_worker(struct work_struct *work)
 					       struct rds_tcp_net,
 					       rds_tcp_accept_w);
 
-	while (rds_tcp_accept_one(rtn->rds_tcp_listen_sock) == 0)
+	while (rds_tcp_accept_one(rtn) == 0)
 		cond_resched();
 }
 
-void rds_tcp_accept_work(struct sock *sk)
+void rds_tcp_accept_work(struct rds_tcp_net *rtn)
 {
-	struct net *net = sock_net(sk);
-	struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
-
 	queue_work(rds_wq, &rtn->rds_tcp_accept_w);
 }
 
@@ -558,6 +549,8 @@  static __net_init int rds_tcp_init_net(struct net *net)
 
 	memset(rtn, 0, sizeof(*rtn));
 
+	mutex_init(&rtn->rds_tcp_accept_lock);
+
 	/* {snd, rcv}buf_size default to 0, which implies we let the
 	 * stack pick the value, and permit auto-tuning of buffer size.
 	 */
@@ -621,6 +614,8 @@  static void rds_tcp_kill_sock(struct net *net)
 
 	rtn->rds_tcp_listen_sock = NULL;
 	rds_tcp_listen_stop(lsock, &rtn->rds_tcp_accept_w);
+	if (rtn->rds_tcp_accepted_sock)
+		sock_release(rtn->rds_tcp_accepted_sock);
 	spin_lock_irq(&rds_tcp_conn_lock);
 	list_for_each_entry_safe(tc, _tc, &rds_tcp_conn_list, t_tcp_node) {
 		struct net *c_net = read_pnet(&tc->t_cpath->cp_conn->c_net);
diff --git a/net/rds/tcp.h b/net/rds/tcp.h
index 053aa7da87ef..2000f4acd57a 100644
--- a/net/rds/tcp.h
+++ b/net/rds/tcp.h
@@ -4,6 +4,21 @@ 
 
 #define RDS_TCP_PORT	16385
 
+/* per-network namespace private data for this module */
+struct rds_tcp_net {
+	/* serialize "rds_tcp_accept_one" with "rds_tcp_accept_lock"
+	 * to protect "rds_tcp_accepted_sock"
+	 */
+	struct mutex		rds_tcp_accept_lock;
+	struct socket		*rds_tcp_listen_sock;
+	struct socket		*rds_tcp_accepted_sock;
+	struct work_struct	rds_tcp_accept_w;
+	struct ctl_table_header	*rds_tcp_sysctl;
+	struct ctl_table	*ctl_table;
+	int			sndbuf_size;
+	int			rcvbuf_size;
+};
+
 struct rds_tcp_incoming {
 	struct rds_incoming	ti_inc;
 	struct sk_buff_head	ti_skb_list;
@@ -19,6 +34,7 @@  struct rds_tcp_connection {
 	 */
 	struct mutex		t_conn_path_lock;
 	struct socket		*t_sock;
+	struct rds_tcp_net	*t_rtn;
 	void			*t_orig_write_space;
 	void			*t_orig_data_ready;
 	void			*t_orig_state_change;
@@ -49,6 +65,7 @@  struct rds_tcp_statistics {
 };
 
 /* tcp.c */
+extern int rds_tcp_netid;
 bool rds_tcp_tune(struct socket *sock);
 void rds_tcp_set_callbacks(struct socket *sock, struct rds_conn_path *cp);
 void rds_tcp_reset_callbacks(struct socket *sock, struct rds_conn_path *cp);
@@ -57,7 +74,7 @@  void rds_tcp_restore_callbacks(struct socket *sock,
 u32 rds_tcp_write_seq(struct rds_tcp_connection *tc);
 u32 rds_tcp_snd_una(struct rds_tcp_connection *tc);
 extern struct rds_transport rds_tcp_transport;
-void rds_tcp_accept_work(struct sock *sk);
+void rds_tcp_accept_work(struct rds_tcp_net *rtn);
 int rds_tcp_laddr_check(struct net *net, const struct in6_addr *addr,
 			__u32 scope_id);
 /* tcp_connect.c */
@@ -69,7 +86,8 @@  void rds_tcp_state_change(struct sock *sk);
 struct socket *rds_tcp_listen_init(struct net *net, bool isv6);
 void rds_tcp_listen_stop(struct socket *sock, struct work_struct *acceptor);
 void rds_tcp_listen_data_ready(struct sock *sk);
-int rds_tcp_accept_one(struct socket *sock);
+void rds_tcp_conn_slots_available(struct rds_connection *conn);
+int rds_tcp_accept_one(struct rds_tcp_net *rtn);
 void rds_tcp_keepalive(struct socket *sock);
 void *rds_tcp_listen_sock_def_readable(struct net *net);
 
diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
index 886b5373843e..e44384f0adf7 100644
--- a/net/rds/tcp_listen.c
+++ b/net/rds/tcp_listen.c
@@ -35,6 +35,8 @@ 
 #include <linux/in.h>
 #include <net/tcp.h>
 #include <trace/events/sock.h>
+#include <net/net_namespace.h>
+#include <net/netns/generic.h>
 
 #include "rds.h"
 #include "tcp.h"
@@ -66,32 +68,47 @@  struct rds_tcp_connection *rds_tcp_accept_one_path(struct rds_connection *conn)
 	int i;
 	int npaths = max_t(int, 1, conn->c_npaths);
 
-	/* for mprds, all paths MUST be initiated by the peer
-	 * with the smaller address.
-	 */
-	if (rds_addr_cmp(&conn->c_faddr, &conn->c_laddr) >= 0) {
-		/* Make sure we initiate at least one path if this
-		 * has not already been done; rds_start_mprds() will
-		 * take care of additional paths, if necessary.
-		 */
-		if (npaths == 1)
-			rds_conn_path_connect_if_down(&conn->c_path[0]);
-		return NULL;
-	}
-
 	for (i = 0; i < npaths; i++) {
 		struct rds_conn_path *cp = &conn->c_path[i];
 
 		if (rds_conn_path_transition(cp, RDS_CONN_DOWN,
-					     RDS_CONN_CONNECTING)) {
+					     RDS_CONN_CONNECTING))
 			return cp->cp_transport_data;
-		}
 	}
 	return NULL;
 }
 
-int rds_tcp_accept_one(struct socket *sock)
+void rds_tcp_conn_slots_available(struct rds_connection *conn)
+{
+	struct rds_tcp_connection *tc;
+	struct rds_tcp_net *rtn;
+
+	smp_rmb();
+	if (test_bit(RDS_DESTROY_PENDING, &conn->c_path->cp_flags))
+		return;
+
+	tc = conn->c_path->cp_transport_data;
+	rtn = tc->t_rtn;
+	if (!rtn)
+		return;
+
+	/* As soon as a connection went down,
+	 * it is safe to schedule a "rds_tcp_accept_one"
+	 * attempt even if there are no connections pending:
+	 * Function "rds_tcp_accept_one" won't block
+	 * but simply return -EAGAIN in that case.
+	 *
+	 * Doing so is necessary to address the case where an
+	 * incoming connection on "rds_tcp_listen_sock" is ready
+	 * to be acccepted prior to a free slot being available:
+	 * the -ENOBUFS case in "rds_tcp_accept_one".
+	 */
+	rds_tcp_accept_work(rtn);
+}
+
+int rds_tcp_accept_one(struct rds_tcp_net *rtn)
 {
+	struct socket *listen_sock = rtn->rds_tcp_listen_sock;
 	struct socket *new_sock = NULL;
 	struct rds_connection *conn;
 	int ret;
@@ -109,37 +126,45 @@  int rds_tcp_accept_one(struct socket *sock)
 #endif
 	int dev_if = 0;
 
-	if (!sock) /* module unload or netns delete in progress */
-		return -ENETUNREACH;
+	mutex_lock(&rtn->rds_tcp_accept_lock);
 
-	ret = sock_create_lite(sock->sk->sk_family,
-			       sock->sk->sk_type, sock->sk->sk_protocol,
-			       &new_sock);
-	if (ret)
-		goto out;
+	if (!listen_sock)
+		return -ENETUNREACH;
 
-	ret = sock->ops->accept(sock, new_sock, &arg);
-	if (ret < 0)
-		goto out;
+	new_sock = rtn->rds_tcp_accepted_sock;
+	rtn->rds_tcp_accepted_sock = NULL;
+
+	if (!new_sock) {
+		ret = sock_create_lite(listen_sock->sk->sk_family,
+				       listen_sock->sk->sk_type,
+				       listen_sock->sk->sk_protocol,
+				       &new_sock);
+		if (ret)
+			goto out;
+
+		ret = listen_sock->ops->accept(listen_sock, new_sock, &arg);
+		if (ret < 0)
+			goto out;
+
+		/* sock_create_lite() does not get a hold on the owner module so we
+		 * need to do it here.  Note that sock_release() uses sock->ops to
+		 * determine if it needs to decrement the reference count.  So set
+		 * sock->ops after calling accept() in case that fails.  And there's
+		 * no need to do try_module_get() as the listener should have a hold
+		 * already.
+		 */
+		new_sock->ops = listen_sock->ops;
+		__module_get(new_sock->ops->owner);
 
-	/* sock_create_lite() does not get a hold on the owner module so we
-	 * need to do it here.  Note that sock_release() uses sock->ops to
-	 * determine if it needs to decrement the reference count.  So set
-	 * sock->ops after calling accept() in case that fails.  And there's
-	 * no need to do try_module_get() as the listener should have a hold
-	 * already.
-	 */
-	new_sock->ops = sock->ops;
-	__module_get(new_sock->ops->owner);
+		rds_tcp_keepalive(new_sock);
+		if (!rds_tcp_tune(new_sock)) {
+			ret = -EINVAL;
+			goto out;
+		}
 
-	rds_tcp_keepalive(new_sock);
-	if (!rds_tcp_tune(new_sock)) {
-		ret = -EINVAL;
-		goto out;
+		inet = inet_sk(new_sock->sk);
 	}
 
-	inet = inet_sk(new_sock->sk);
-
 #if IS_ENABLED(CONFIG_IPV6)
 	my_addr = &new_sock->sk->sk_v6_rcv_saddr;
 	peer_addr = &new_sock->sk->sk_v6_daddr;
@@ -150,7 +175,7 @@  int rds_tcp_accept_one(struct socket *sock)
 	peer_addr = &daddr;
 #endif
 	rdsdebug("accepted family %d tcp %pI6c:%u -> %pI6c:%u\n",
-		 sock->sk->sk_family,
+		 listen_sock->sk->sk_family,
 		 my_addr, ntohs(inet->inet_sport),
 		 peer_addr, ntohs(inet->inet_dport));
 
@@ -170,13 +195,13 @@  int rds_tcp_accept_one(struct socket *sock)
 	}
 #endif
 
-	if (!rds_tcp_laddr_check(sock_net(sock->sk), peer_addr, dev_if)) {
+	if (!rds_tcp_laddr_check(sock_net(listen_sock->sk), peer_addr, dev_if)) {
 		/* local address connection is only allowed via loopback */
 		ret = -EOPNOTSUPP;
 		goto out;
 	}
 
-	conn = rds_conn_create(sock_net(sock->sk),
+	conn = rds_conn_create(sock_net(listen_sock->sk),
 			       my_addr, peer_addr,
 			       &rds_tcp_transport, 0, GFP_KERNEL, dev_if);
 
@@ -189,15 +214,51 @@  int rds_tcp_accept_one(struct socket *sock)
 	 * If the client reboots, this conn will need to be cleaned up.
 	 * rds_tcp_state_change() will do that cleanup
 	 */
-	rs_tcp = rds_tcp_accept_one_path(conn);
-	if (!rs_tcp)
+	if (rds_addr_cmp(&conn->c_faddr, &conn->c_laddr) < 0) {
+		/* Try to obtain a free connection slot.
+		 * If unsuccessful, we need to preserve "new_sock"
+		 * that we just accepted, since its "sk_receive_queue"
+		 * may contain messages already that have been acknowledged
+		 * to and discarded by the sender.
+		 * We must not throw those away!
+		 */
+		rs_tcp = rds_tcp_accept_one_path(conn);
+		if (!rs_tcp) {
+			/* It's okay to stash "new_sock", since
+			 * "rds_tcp_conn_slots_available" triggers "rds_tcp_accept_one"
+			 * again as soon as one of the connection slots
+			 * becomes available again
+			 */
+			rtn->rds_tcp_accepted_sock = new_sock;
+			new_sock = NULL;
+			ret = -ENOBUFS;
+			goto out;
+		}
+	} else {
+		/* This connection request came from a peer with
+		 * a larger address.
+		 * Function "rds_tcp_state_change" makes sure
+		 * that the connection doesn't transition
+		 * to state "RDS_CONN_UP", and therefore
+		 * we should not have received any messages
+		 * on this socket yet.
+		 * This is the only case where it's okay to
+		 * not dequeue messages from "sk_receive_queue".
+		 */
+		if (conn->c_npaths <= 1)
+			rds_conn_path_connect_if_down(&conn->c_path[0]);
+		rs_tcp = NULL;
 		goto rst_nsk;
+	}
+
 	mutex_lock(&rs_tcp->t_conn_path_lock);
 	cp = rs_tcp->t_cpath;
 	conn_state = rds_conn_path_state(cp);
 	WARN_ON(conn_state == RDS_CONN_UP);
-	if (conn_state != RDS_CONN_CONNECTING && conn_state != RDS_CONN_ERROR)
+	if (conn_state != RDS_CONN_CONNECTING && conn_state != RDS_CONN_ERROR) {
+		rds_conn_path_drop(cp, 0);
 		goto rst_nsk;
+	}
 	if (rs_tcp->t_sock) {
 		/* Duelling SYN has been handled in rds_tcp_accept_one() */
 		rds_tcp_reset_callbacks(new_sock, cp);
@@ -228,6 +289,9 @@  int rds_tcp_accept_one(struct socket *sock)
 		mutex_unlock(&rs_tcp->t_conn_path_lock);
 	if (new_sock)
 		sock_release(new_sock);
+
+	mutex_unlock(&rtn->rds_tcp_accept_lock);
+
 	return ret;
 }
 
@@ -255,7 +319,7 @@  void rds_tcp_listen_data_ready(struct sock *sk)
 	 * the listen socket is being torn down.
 	 */
 	if (sk->sk_state == TCP_LISTEN)
-		rds_tcp_accept_work(sk);
+		rds_tcp_accept_work(net_generic(sock_net(sk), rds_tcp_netid));
 	else
 		ready = rds_tcp_listen_sock_def_readable(sock_net(sk));