Message ID | 20211228130611.19124-1-tonylu@linux.alibaba.com (mailing list archive) |
---|---|
Headers | show |
Series | RDMA device net namespace support for SMC | expand |
Hello: This series was applied to netdev/net-next.git (master) by David S. Miller <davem@davemloft.net>: On Tue, 28 Dec 2021 21:06:08 +0800 you wrote: > This patch set introduces net namespace support for linkgroups. > > Path 1 is the main approach to implement net ns support. > > Path 2 - 4 are the additional modifications to let us know the netns. > Also, I will submit changes of smc-tools to github later. > > [...] Here is the summary with links: - [1/4] net/smc: Introduce net namespace support for linkgroup https://git.kernel.org/netdev/net-next/c/0237a3a683e4 - [2/4] net/smc: Add netlink net namespace support https://git.kernel.org/netdev/net-next/c/79d39fc503b4 - [3/4] net/smc: Print net namespace in log https://git.kernel.org/netdev/net-next/c/de2fea7b39bf - [4/4] net/smc: Add net namespace for tracepoints https://git.kernel.org/netdev/net-next/c/a838f5084828 You are awesome, thank you!
On Tue, 2021-12-28 at 21:06 +0800, Tony Lu wrote: > This patch set introduces net namespace support for linkgroups. > > Path 1 is the main approach to implement net ns support. > > Path 2 - 4 are the additional modifications to let us know the netns. > Also, I will submit changes of smc-tools to github later. > > Currently, smc doesn't support net namespace isolation. The ibdevs > registered to smc are shared for all linkgroups and connections. When > running applications in different net namespaces, such as container > environment, applications should only use the ibdevs that belongs to the > same net namespace. > > This adds a new field, net, in smc linkgroup struct. During first > contact, it checks and find the linkgroup has same net namespace, if > not, it is going to create and initialized the net field with first > link's ibdev net namespace. When finding the rdma devices, it also checks > the sk net device's and ibdev's net namespaces. After net namespace > destroyed, the net device and ibdev move to root net namespace, > linkgroups won't be matched, and wait for lgr free. > > If rdma net namespace exclusive mode is not enabled, it behaves as > before. > > Steps to enable and test net namespaces: > > 1. enable RDMA device net namespace exclusive support > rdma system set netns exclusive # default is shared > > 2. create new net namespace, move and initialize them > ip netns add test1 > rdma dev set mlx5_1 netns test1 > ip link set dev eth2 netns test1 > ip netns exec test1 ip link set eth2 up > ip netns exec test1 ip addr add ${HOST_IP}/26 dev eth2 > > 3. setup server and client, connect N <-> M > ip netns exec test1 smc_run sockperf server --tcp # server > ip netns exec test1 smc_run sockperf pp --tcp -i ${SERVER_IP} # client > > 4. netns isolated linkgroups (2 * 2 mesh) with their own linkgroups > - server Hi Tony, I'm having a bit of trouble getting this to work for me and was wondering if you could test my scenario or help me figure out what's wrong. I'm using network namespacing to be able to test traffic between two VFs of the same card/port with a single Linux system. By having one VF in each of a client and server namespace, traffic doesn't shortcut via loopback. This works great for TCP and with "rdma system set netns exclusive" I can also verify that RDMA with "qperf -cm1 ... rc_bw" only works once the respective RDMA device is also added to each namespace. When I try the same with SMC-R I tried: ip netns exec server smc_run qperf & ip netns exec client smc_run qperf <ip_server> tcp_bw With that however I only see fallback TCP connections in "ip netns exec client watch smc_dbg". It doesn't seem to be an "smc_dbg" problem either since the performance with and without smc_run is the same. I also do have the same PNET_ID set on the interfaces. As an aside do you know how to gracefully put the RDMA devices back into the default namespace? For network interfaces I can use "ip -n <ns> link set dev <iface> netns 1" but the equivalent "ip netns exec <ns> rdma dev set <rdmadev> netns 1" doesn't work because there is no PID variant. Deleting the namespace and killing processes using the RDMA device does seem to get it back but with some delay. Thanks, Niklas
On Thu, Feb 17, 2022 at 12:33:06PM +0100, Niklas Schnelle wrote: > On Tue, 2021-12-28 at 21:06 +0800, Tony Lu wrote: > > This patch set introduces net namespace support for linkgroups. > > > > Path 1 is the main approach to implement net ns support. > > > > Path 2 - 4 are the additional modifications to let us know the netns. > > Also, I will submit changes of smc-tools to github later. > > > > Currently, smc doesn't support net namespace isolation. The ibdevs > > registered to smc are shared for all linkgroups and connections. When > > running applications in different net namespaces, such as container > > environment, applications should only use the ibdevs that belongs to the > > same net namespace. > > > > This adds a new field, net, in smc linkgroup struct. During first > > contact, it checks and find the linkgroup has same net namespace, if > > not, it is going to create and initialized the net field with first > > link's ibdev net namespace. When finding the rdma devices, it also checks > > the sk net device's and ibdev's net namespaces. After net namespace > > destroyed, the net device and ibdev move to root net namespace, > > linkgroups won't be matched, and wait for lgr free. > > > > If rdma net namespace exclusive mode is not enabled, it behaves as > > before. > > > > Steps to enable and test net namespaces: > > > > 1. enable RDMA device net namespace exclusive support > > rdma system set netns exclusive # default is shared > > > > 2. create new net namespace, move and initialize them > > ip netns add test1 > > rdma dev set mlx5_1 netns test1 > > ip link set dev eth2 netns test1 > > ip netns exec test1 ip link set eth2 up > > ip netns exec test1 ip addr add ${HOST_IP}/26 dev eth2 > > > > 3. setup server and client, connect N <-> M > > ip netns exec test1 smc_run sockperf server --tcp # server > > ip netns exec test1 smc_run sockperf pp --tcp -i ${SERVER_IP} # client > > > > 4. netns isolated linkgroups (2 * 2 mesh) with their own linkgroups > > - server > > Hi Tony, > > I'm having a bit of trouble getting this to work for me and was > wondering if you could test my scenario or help me figure out what's > wrong. > > I'm using network namespacing to be able to test traffic between two > VFs of the same card/port with a single Linux system. By having one VF > in each of a client and server namespace, traffic doesn't shortcut via > loopback. This works great for TCP and with "rdma system set netns > exclusive" I can also verify that RDMA with "qperf -cm1 ... rc_bw" only > works once the respective RDMA device is also added to each namespace. > > When I try the same with SMC-R I tried: > > ip netns exec server smc_run qperf & > ip netns exec client smc_run qperf <ip_server> tcp_bw > > With that however I only see fallback TCP connections in "ip netns exec > client watch smc_dbg". It doesn't seem to be an "smc_dbg" problem > either since the performance with and without smc_run is the same. I > also do have the same PNET_ID set on the interfaces. Hi Niklas, I understood your problem. This connection falls back to TCP for unknown reasons. You can find out the fallback reason of this connection. It can help us find out the root cause of fallbacks. For example, if SMC_CLC_DECL_MEM (0x01010000) is occurred in this connection, it means that there is no enough memory (smc_init_info, sndbuf, RMB, proposal buf, clc msg). Before you giving out the fallback reason, based on your environment, this are some potential possibilities. You can check this list: - RDMA device availability in netns. Run "ip netns exec server rdma dev" to check RDMA device in both server/client. If exclusive mode is setted, it should have different devices in different netns. - SMC-R device availability in netns. Run "ip netns exec server smcr d" to check SMC device available list. Only if we have eth name in the list, it can access by this netns. smc-tools matches ethernet NIC and RDMA device, it can only find the name of eth nic in this netns, so there is no name if this eth nic doesn't belong to this netns. Net-Dev IB-Dev IB-P IB-State Type Crit #Links PNET-ID mlx5_0 1 ACTIVE RoCE_Express2 No 0 eth2 mlx5_1 1 ACTIVE RoCE_Express2 No 0 This output shows we have ONE available RDMA device in this netns. - Misc checks, such as memory usage, loop back connection and so on. Also, you can check dmesg for device operations if you moved netns of RDMA device. Every device's operation will log in dmesg. # SMC module init, adds two RDMA device. [ +0.000512] smc: adding ib device mlx5_0 with port count 1 [ +0.000534] smc: ib device mlx5_0 port 1 has pnetid [ +0.000516] smc: adding ib device mlx5_1 with port count 1 [ +0.000525] smc: ib device mlx5_1 port 1 has pnetid # Move one RDMA device to another netns. [Feb21 14:16] smc: removing ib device mlx5_1 [ +0.015723] smc: adding ib device mlx5_1 with port count 1 [ +0.000600] smc: ib device mlx5_1 port 1 has pnetid > As an aside do you know how to gracefully put the RDMA devices back > into the default namespace? For network interfaces I can use "ip -n > <ns> link set dev <iface> netns 1" but the equivalent "ip netns exec > <ns> rdma dev set <rdmadev> netns 1" doesn't work because there is no > PID variant. Deleting the namespace and killing processes using the > RDMA device does seem to get it back but with some delay. Yes, just remove net namespace, we need to wait for all the connections shutdown, because every sock will get refcnt of this netns. I didn't move back device gracefully before, because life of containers is as long as RDMA device. But you reminded me this, after reading the implement of iproute2, I believe it's because iproute2 doesn't implement this (based on nsid) for RDMA devices. RDMA core provides RDMA_NLDEV_NET_NS_FD in netlink, iproute2 just handles name (string) in this function, which is created by ip command before. // iproute2/rdma/dev.c static int dev_set_netns(struct rd *rd) { char *netns_path; uint32_t seq; int netns; int ret; if (rd_no_arg(rd)) { pr_err("Please provide device name.\n"); return -EINVAL; } // netns_path is created before by ip command. // File located in /var/run/netns/{NS_NAME}, such as // /var/run/netns/server. if (asprintf(&netns_path, "%s/%s", NETNS_RUN_DIR, rd_argv(rd)) < 0) return -ENOMEM; netns = open(netns_path, O_RDONLY | O_CLOEXEC); if (netns < 0) { fprintf(stderr, "Cannot open network namespace \"%s\": %s\n", rd_argv(rd), strerror(errno)); ret = -EINVAL; goto done; } rd_prepare_msg(rd, RDMA_NLDEV_CMD_SET, &seq, (NLM_F_REQUEST | NLM_F_ACK)); mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_DEV_INDEX, rd->dev_idx); // based on the fd in this netns. mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_NET_NS_FD, netns); ret = rd_sendrecv_msg(rd, seq); close(netns); done: free(netns_path); return ret; } I don't know if there are other tools that can do it with RDMA device. But we can do it by calling netlink with RDMA_NLDEV_NET_NS_FD, and set this value to the fd of desired netns, such as /proc/1/ns/net. Hope this information can help you. Best regards, Tony Lu
On Mon, 2022-02-21 at 14:54 +0800, Tony Lu wrote: > On Thu, Feb 17, 2022 at 12:33:06PM +0100, Niklas Schnelle wrote: > > On Tue, 2021-12-28 at 21:06 +0800, Tony Lu wrote: > > ---8<--- > > Hi Tony, > > > > I'm having a bit of trouble getting this to work for me and was > > wondering if you could test my scenario or help me figure out what's > > wrong. > > > > I'm using network namespacing to be able to test traffic between two > > VFs of the same card/port with a single Linux system. By having one VF > > in each of a client and server namespace, traffic doesn't shortcut via > > loopback. This works great for TCP and with "rdma system set netns > > exclusive" I can also verify that RDMA with "qperf -cm1 ... rc_bw" only > > works once the respective RDMA device is also added to each namespace. > > > > When I try the same with SMC-R I tried: > > > > ip netns exec server smc_run qperf & > > ip netns exec client smc_run qperf <ip_server> tcp_bw > > > > With that however I only see fallback TCP connections in "ip netns exec > > client watch smc_dbg". It doesn't seem to be an "smc_dbg" problem > > either since the performance with and without smc_run is the same. I > > also do have the same PNET_ID set on the interfaces. > > Hi Niklas, > > I understood your problem. This connection falls back to TCP for unknown > reasons. You can find out the fallback reason of this connection. It can > help us find out the root cause of fallbacks. For example, > if SMC_CLC_DECL_MEM (0x01010000) is occurred in this connection, it > means that there is no enough memory (smc_init_info, sndbuf, RMB, > proposal buf, clc msg). Regarding fallback reason. It seems to be that the RDMA device is not found (0x03030000) in smd_dbg on I see the following lines: Server: State UID Inode Local Address Peer Address Intf Mode Shutd Token Sndbuf .. LISTEN 00000 0103804 0.0.0.0:37373 ACTIVE 00000 0112895 ::ffff:10.10.93..:46093 ::ffff:10.10.93..:54474 0000 TCP 0x03030000 ACTIVE 00000 0112701 ::ffff:10.10.93..:19765 ::ffff:10.10.93..:51934 0000 TCP 0x03030000 LISTEN 00000 0112699 0.0.0.0:19765 Client: State UID Inode Local Address Peer Address Intf Mode Shutd Token Sndbuf ... ACTIVE 00000 0116203 10.10.93.11:54474 10.10.93.12:46093 0000 TCP 0x05000000/0x03030000 ACTIVE 00000 0116201 10.10.93.11:51934 10.10.93.12:19765 0000 TCP 0x05000000/0x03030000 However this doesn't match what I'm seeing in the other commands below > > Before you giving out the fallback reason, based on your environment, > this are some potential possibilities. You can check this list: > > - RDMA device availability in netns. Run "ip netns exec server rdma dev" > to check RDMA device in both server/client. If exclusive mode is setted, > it should have different devices in different netns. I get the following output that looks as expected to me: Server: 2: roceP9p0s0: node_type ca fw 14.25.1020 node_guid 1d82:ff9b:1bfe:2c28 sys_image_guid 282c:001b:9b03:9803 Client: 4: roceP11p0s0: node_type ca fw 14.25.1020 node_guid 0982:ff9b:63fe:64e7 sys_image_guid e764:0063:9b03:9803 > - SMC-R device availability in netns. Run "ip netns exec server smcr d" > to check SMC device available list. Only if we have eth name in the > list, it can access by this netns. smc-tools matches ethernet NIC and > RDMA device, it can only find the name of eth nic in this netns, so > there is no name if this eth nic doesn't belong to this netns. > > Net-Dev IB-Dev IB-P IB-State Type Crit #Links PNET-ID > mlx5_0 1 ACTIVE RoCE_Express2 No 0 > eth2 mlx5_1 1 ACTIVE RoCE_Express2 No 0 > > This output shows we have ONE available RDMA device in this netns. Here too things look good to me: Server: Net-Dev IB-Dev IB-P IB-State Type Crit #Links PNET-ID ... roceP12p 1 ACTIVE RoCE_Express2 No 0 NET26 roceP11p 1 ACTIVE RoCE_Express2 No 0 NET25 ens2076 roceP9p0 1 ACTIVE RoCE_Express2 No 0 NET25 Client: Net-Dev IB-Dev IB-P IB-State Type Crit #Links PNET-ID ... roceP12p 1 ACTIVE RoCE_Express2 No 0 NET26 ens1296 roceP11p 1 ACTIVE RoCE_Express2 No 0 NET25 roceP9p0 1 ACTIVE RoCE_Express2 No 0 NET25 And I again confirmed that a pure RDMA workload ("qperf -cm1 ... rc_bw") works with the RDMA namespacing set to exclusive but only if I add the RDMA devices to the namespaces. I do wonder why the other RDMA devices are still visible in the above output though? > - Misc checks, such as memory usage, loop back connection and so on. > Also, you can check dmesg for device operations if you moved netns of > RDMA device. Every device's operation will log in dmesg. > > # SMC module init, adds two RDMA device. > [ +0.000512] smc: adding ib device mlx5_0 with port count 1 > [ +0.000534] smc: ib device mlx5_0 port 1 has pnetid > [ +0.000516] smc: adding ib device mlx5_1 with port count 1 > [ +0.000525] smc: ib device mlx5_1 port 1 has pnetid > > # Move one RDMA device to another netns. > [Feb21 14:16] smc: removing ib device mlx5_1 > [ +0.015723] smc: adding ib device mlx5_1 with port count 1 > [ +0.000600] smc: ib device mlx5_1 port 1 has pnetid There is no memory pressure and SMC-R between two systems works. I also see the smc add/remove messages in dmesg as you describe: smc: removing ib device roceP11p0s0 smc: adding ib device roceP11p0s0 with port count 1 smc: ib device roceP11p0s0 port 1 has pnetid NET25 smc: removing ib device roceP9p0s0 smc: adding ib device roceP9p0s0 with port count 1 smc: ib device roceP9p0s0 port 1 has pnetid NET25 mlx5_core 000b:00:00.0 ens1296: Link up mlx5_core 0009:00:00.0 ens2076: Link up IPv6: ADDRCONF(NETDEV_CHANGE): ens2076: link becomes ready smc: removing ib device roceP11p0s0 smc: adding ib device roceP11p0s0 with port count 1 smc: ib device roceP11p0s0 port 1 has pnetid NET25 mlx5_core 000b:00:00.0 ens1296: Link up mlx5_core 0009:00:00.0 ens2076: Link up smc: removing ib device roceP9p0s0 smc: adding ib device roceP9p0s0 with port count 1 smc: ib device roceP9p0s0 port 1 has pnetid NET25 IPv6: ADDRCONF(NETDEV_CHANGE): ens1296: link becomes ready (The PCI addresses and resulting names are normal for s390) One thing I notice is that you don't seem to have a pnetid set in your output, did you redact those or are you dealing differently with PNETIDs? Maybe there is an issue with matching PNETIDs betwen RDMA devices and network devices when namespaced? I also tested with smc_chk instead of qperf to make sure it's not a problem with LD_PRELOAD or anything like that. With that it simply doesn't connect.
On Mon, Feb 21, 2022 at 04:30:32PM +0100, Niklas Schnelle wrote: > On Mon, 2022-02-21 at 14:54 +0800, Tony Lu wrote: > > On Thu, Feb 17, 2022 at 12:33:06PM +0100, Niklas Schnelle wrote: > > > On Tue, 2021-12-28 at 21:06 +0800, Tony Lu wrote: > > > > ---8<--- > > > Hi Tony, > > > > > > I'm having a bit of trouble getting this to work for me and was > > > wondering if you could test my scenario or help me figure out what's > > > wrong. > > > > > > I'm using network namespacing to be able to test traffic between two > > > VFs of the same card/port with a single Linux system. By having one VF > > > in each of a client and server namespace, traffic doesn't shortcut via > > > loopback. This works great for TCP and with "rdma system set netns > > > exclusive" I can also verify that RDMA with "qperf -cm1 ... rc_bw" only > > > works once the respective RDMA device is also added to each namespace. > > > > > > When I try the same with SMC-R I tried: > > > > > > ip netns exec server smc_run qperf & > > > ip netns exec client smc_run qperf <ip_server> tcp_bw > > > > > > With that however I only see fallback TCP connections in "ip netns exec > > > client watch smc_dbg". It doesn't seem to be an "smc_dbg" problem > > > either since the performance with and without smc_run is the same. I > > > also do have the same PNET_ID set on the interfaces. > > > > Hi Niklas, > > > > I understood your problem. This connection falls back to TCP for unknown > > reasons. You can find out the fallback reason of this connection. It can > > help us find out the root cause of fallbacks. For example, > > if SMC_CLC_DECL_MEM (0x01010000) is occurred in this connection, it > > means that there is no enough memory (smc_init_info, sndbuf, RMB, > > proposal buf, clc msg). > > Regarding fallback reason. It seems to be that the RDMA device is not > found (0x03030000) in smd_dbg on I see the following lines: > > Server: > State UID Inode Local Address Peer Address Intf Mode Shutd Token Sndbuf .. > LISTEN 00000 0103804 0.0.0.0:37373 > ACTIVE 00000 0112895 ::ffff:10.10.93..:46093 ::ffff:10.10.93..:54474 0000 TCP 0x03030000 > ACTIVE 00000 0112701 ::ffff:10.10.93..:19765 ::ffff:10.10.93..:51934 0000 TCP 0x03030000 > LISTEN 00000 0112699 0.0.0.0:19765 > > Client: > State UID Inode Local Address Peer Address Intf Mode Shutd Token Sndbuf ... > ACTIVE 00000 0116203 10.10.93.11:54474 10.10.93.12:46093 0000 TCP 0x05000000/0x03030000 > ACTIVE 00000 0116201 10.10.93.11:51934 10.10.93.12:19765 0000 TCP 0x05000000/0x03030000 > > > However this doesn't match what I'm seeing in the other commands below Based on the fallback reason, the server didn't find proper RDMA device to start, so it fell back. > > > > Before you giving out the fallback reason, based on your environment, > > this are some potential possibilities. You can check this list: > > > > - RDMA device availability in netns. Run "ip netns exec server rdma dev" > > to check RDMA device in both server/client. If exclusive mode is setted, > > it should have different devices in different netns. > > I get the following output that looks as expected to me: > > Server: > 2: roceP9p0s0: node_type ca fw 14.25.1020 node_guid 1d82:ff9b:1bfe:2c28 sys_image_guid 282c:001b:9b03:9803 > Client: > 4: roceP11p0s0: node_type ca fw 14.25.1020 node_guid 0982:ff9b:63fe:64e7 sys_image_guid e764:0063:9b03:9803 It looks good for now. > > > - SMC-R device availability in netns. Run "ip netns exec server smcr d" > > to check SMC device available list. Only if we have eth name in the > > list, it can access by this netns. smc-tools matches ethernet NIC and > > RDMA device, it can only find the name of eth nic in this netns, so > > there is no name if this eth nic doesn't belong to this netns. > > > > Net-Dev IB-Dev IB-P IB-State Type Crit #Links PNET-ID > > mlx5_0 1 ACTIVE RoCE_Express2 No 0 > > eth2 mlx5_1 1 ACTIVE RoCE_Express2 No 0 > > > > This output shows we have ONE available RDMA device in this netns. > > Here too things look good to me: > > Server: > > Net-Dev IB-Dev IB-P IB-State Type Crit #Links PNET-ID > ... > roceP12p 1 ACTIVE RoCE_Express2 No 0 NET26 > roceP11p 1 ACTIVE RoCE_Express2 No 0 NET25 > ens2076 roceP9p0 1 ACTIVE RoCE_Express2 No 0 NET25 > > Client: > > Net-Dev IB-Dev IB-P IB-State Type Crit #Links PNET-ID > ... > roceP12p 1 ACTIVE RoCE_Express2 No 0 NET26 > ens1296 roceP11p 1 ACTIVE RoCE_Express2 No 0 NET25 > roceP9p0 1 ACTIVE RoCE_Express2 No 0 NET25 > > And I again confirmed that a pure RDMA workload ("qperf -cm1 ... rc_bw") > works with the RDMA namespacing set to exclusive but only if I add the > RDMA devices to the namespaces. I do wonder why the other RDMA devices are still > visible in the above output though? SMC maintains the list of ibdevices, which is isolated from rdma command. SMC registered handlers for ib device, if ib device removed or added, it triggered a event, and SMC will remove or add this device from list. "smcr d" dumps all the list, and not filtered by netns. > > - Misc checks, such as memory usage, loop back connection and so on. > > Also, you can check dmesg for device operations if you moved netns of > > RDMA device. Every device's operation will log in dmesg. > > > > # SMC module init, adds two RDMA device. > > [ +0.000512] smc: adding ib device mlx5_0 with port count 1 > > [ +0.000534] smc: ib device mlx5_0 port 1 has pnetid > > [ +0.000516] smc: adding ib device mlx5_1 with port count 1 > > [ +0.000525] smc: ib device mlx5_1 port 1 has pnetid > > > > # Move one RDMA device to another netns. > > [Feb21 14:16] smc: removing ib device mlx5_1 > > [ +0.015723] smc: adding ib device mlx5_1 with port count 1 > > [ +0.000600] smc: ib device mlx5_1 port 1 has pnetid > > There is no memory pressure and SMC-R between two systems works. > > I also see the smc add/remove messages in dmesg as you describe: > > smc: removing ib device roceP11p0s0 > smc: adding ib device roceP11p0s0 with port count 1 > smc: ib device roceP11p0s0 port 1 has pnetid NET25 It looks like s390 has pnetid, other systems don't implement it and have to set pnetid by user. Now dmesg shows that you can get pnetid directly without setting it. > smc: removing ib device roceP9p0s0 > smc: adding ib device roceP9p0s0 with port count 1 > smc: ib device roceP9p0s0 port 1 has pnetid NET25 > mlx5_core 000b:00:00.0 ens1296: Link up > mlx5_core 0009:00:00.0 ens2076: Link up > IPv6: ADDRCONF(NETDEV_CHANGE): ens2076: link becomes ready > smc: removing ib device roceP11p0s0 > smc: adding ib device roceP11p0s0 with port count 1 > smc: ib device roceP11p0s0 port 1 has pnetid NET25 > mlx5_core 000b:00:00.0 ens1296: Link up > mlx5_core 0009:00:00.0 ens2076: Link up > smc: removing ib device roceP9p0s0 > smc: adding ib device roceP9p0s0 with port count 1 > smc: ib device roceP9p0s0 port 1 has pnetid NET25 > IPv6: ADDRCONF(NETDEV_CHANGE): ens1296: link becomes ready > > (The PCI addresses and resulting names are normal for s390) > > One thing I notice is that you don't seem to have a pnetid set > in your output, did you redact those or are you dealing differently > with PNETIDs? Maybe there is an issue with matching PNETIDs betwen > RDMA devices and network devices when namespaced? It works okay if I setted pnetid in different netns, the logic of pnet handling is untouched in my test environment. $ ip netns exec test1 smcr d # mlx5_1 with pnetid TEST1 Net-Dev IB-Dev IB-P IB-State Type Crit #Links PNET-ID mlx5_0 1 ACTIVE RoCE_Express2 No 0 *TEST0 eth2 mlx5_1 1 ACTIVE RoCE_Express2 Yes 1 *TEST1 $ ip netns exec test1 smcss # runs in mode SMCR State UID Inode Local Address Peer Address Intf Mode ACTIVE 00993 0045755 11.213.45.7:8091 11.213.45.19:48884 0000 SMCR Based on the dmesg and fallback reason, you can check the eth and ib device are added to pnetlist correctly. SMC tries to find the proper RDMA device in pnet list matched by pnetid. Currently, pnettable is per-netns. So it should be added in current netns. If the arch doesn't enabled CONFIG_HAVE_PNETID (s390 enabled), it tries to use the handshake device when pnetlist is empty, otherwise it tries to find in pnetlist by pnetid, and no rdma device found when pnetlist is empty, then fallback to TCP. So the default behavior is different when list is empty. After investigating the pnet logic, I found something that could be improved in original implementation, which is out of this netns patch, such as the limit of init_net in pnet_enter and remove. I will start the discussion if needed. Thanks, Tony Lu