Message ID | 20230508075636.352138-1-yanjun.zhu@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | Fix the problem that rxe can not work in net namespace | expand |
On 5/8/23 02:56, Zhu Yanjun wrote: > From: Zhu Yanjun <yanjun.zhu@linux.dev> > > When run "ip link add" command to add a rxe rdma link in a net > namespace, normally this rxe rdma link can not work in a net > name space. > > The root cause is that a sock listening on udp port 4791 is created > in init_net when the rdma_rxe module is loaded into kernel. That is, > the sock listening on udp port 4791 is created in init_net. Other net > namespace is difficult to use this sock. > > The following commits will solve this problem. > > In the first commit, move the creating sock listening on udp port 4791 > from module_init function to rdma link creating functions. That is, > after the module rdma_rxe is loaded, the sock will not be created. > When run "rdma link add ..." command, the sock will be created. So > when creating a rdma link in the net namespace, the sock will be > created in this net namespace. > > In the second commit, the functions udp4_lib_lookup and udp6_lib_lookup > will check the sock exists in the net namespace or not. If yes, rdma > link will increase the reference count of this sock, then continue other > jobs instead of creating a new sock to listen on udp port 4791. Since the > network notifier is global, when the module rdma_rxe is loaded, this > notifier will be registered. > > After the rdma link is created, the command "rdma link del" is to > delete rdma link at the same time the sock is checked. If the reference > count of this sock is greater than the sock reference count needed by > udp tunnel, the sock reference count is decreased by one. If equal, it > indicates that this rdma link is the last one. As such, the udp tunnel > is shut down and the sock is closed. The above work should be > implemented in linkdel function. But currently no dellink function in > rxe. So the 3rd commit addes dellink function pointer. And the 4th > commit implements the dellink function in rxe. > > To now, it is not necessary to keep a global variable to store the sock > listening udp port 4791. This global variable can be replaced by the > functions udp4_lib_lookup and udp6_lib_lookup totally. Because the > function udp6_lib_lookup is in the fast path, a member variable l_sk6 > is added to store the sock. If l_sk6 is NULL, udp6_lib_lookup is called > to lookup the sock, then the sock is stored in l_sk6, in the future,it > can be used directly. > > All the above work has been done in init_net. And it can also work in > the net namespace. So the init_net is replaced by the individual net > namespace. This is what the 6th commit does. Because rxe device is > dependent on the net device and the sock listening on udp port 4791, > every rxe device is in exclusive mode in the individual net namespace. > Other rdma netns operations will be considerred in the future. > > In the 7th commit, the register_pernet_subsys/unregister_pernet_subsys > functions are added. When a new net namespace is created, the init > function will initialize the sk4 and sk6 socks. Then the 2 socks will > be released when the net namespace is destroyed. The functions > rxe_ns_pernet_sk4/rxe_ns_pernet_set_sk4 will get and set sk4 in the net > namespace. The functions rxe_ns_pernet_sk6/rxe_ns_pernet_set_sk6 will > handle sk6. Then sk4 and sk6 are used in the previous commits. > > As the sk4 and sk6 in pernet namespace can be accessed, it is not > necessary to add a new l_sk6. As such, in the 8th commit, the l_sk6 is > replaced with the sk6 in pernet namespace. > > Test steps: > 1) Suppose that 2 NICs are in 2 different net namespaces. > > # ip netns exec net0 ip link > 3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP > link/ether 00:1e:67:a0:22:3f brd ff:ff:ff:ff:ff:ff > altname enp5s0 > > # ip netns exec net1 ip link > 4: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel > link/ether f8:e4:3b:3b:e4:10 brd ff:ff:ff:ff:ff:ff > > 2) Add rdma link in the different net namespace > net0: > # ip netns exec net0 rdma link add rxe0 type rxe netdev eno2 > > net1: > # ip netns exec net1 rdma link add rxe1 type rxe netdev eno3 > > 3) Run rping test. > net0 > # ip netns exec net0 rping -s -a 192.168.2.1 -C 1& > [1] 1737 > # ip netns exec net1 rping -c -a 192.168.2.1 -d -v -C 1 > verbose > count 1 > ... > ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr > ... > > 4) Remove the rdma links from the net namespaces. > net0: > # ip netns exec net0 ss -lu > State Recv-Q Send-Q Local Address:Port Peer Address:Port Process > UNCONN 0 0 0.0.0.0:4791 0.0.0.0:* > UNCONN 0 0 [::]:4791 [::]:* > > # ip netns exec net0 rdma link del rxe0 > > # ip netns exec net0 ss -lu > State Recv-Q Send-Q Local Address:Port Peer Address:Port Process > > net1: > # ip netns exec net0 ss -lu > State Recv-Q Send-Q Local Address:Port Peer Address:Port Process > UNCONN 0 0 0.0.0.0:4791 0.0.0.0:* > UNCONN 0 0 [::]:4791 [::]:* > > # ip netns exec net1 rdma link del rxe1 > > # ip netns exec net0 ss -lu > State Recv-Q Send-Q Local Address:Port Peer Address:Port Process > > V4->V5: Rebase the commits to V6.4-rc1 > > V3->V4: Rebase the commits to rdma-next; > > V2->V3: 1) Add "rdma link del" example in the cover letter, and use "ss -lu" to > verify rdma link is removed. > 2) Add register_pernet_subsys/unregister_pernet_subsys net namespace > 3) Replace l_sk6 with sk6 of pernet_name_space > > V1->V2: Add the explicit initialization of sk6. > > Zhu Yanjun (8): > RDMA/rxe: Creating listening sock in newlink function > RDMA/rxe: Support more rdma links in init_net > RDMA/nldev: Add dellink function pointer > RDMA/rxe: Implement dellink in rxe > RDMA/rxe: Replace global variable with sock lookup functions > RDMA/rxe: add the support of net namespace > RDMA/rxe: Add the support of net namespace notifier > RDMA/rxe: Replace l_sk6 with sk6 in net namespace > > drivers/infiniband/core/nldev.c | 6 ++ > drivers/infiniband/sw/rxe/Makefile | 3 +- > drivers/infiniband/sw/rxe/rxe.c | 35 +++++++- > drivers/infiniband/sw/rxe/rxe_net.c | 113 +++++++++++++++++------ > drivers/infiniband/sw/rxe/rxe_net.h | 9 +- > drivers/infiniband/sw/rxe/rxe_ns.c | 134 ++++++++++++++++++++++++++++ip netns add test > drivers/infiniband/sw/rxe/rxe_ns.h | 17 ++++ > include/rdma/rdma_netlink.h | 2 + > 8 files changed, 279 insertions(+), 40 deletions(-) > create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.c > create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.hip netns add test > Zhu, I did some simple experiments on netns functionality. With your patch set applied and rxe0 created on enp6s0 and rxe1 created on lo in the default namespace # sudo ip netns add test # ip netns test # sudo ip netns exec test ip link 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 # sudo ip netns exec test ip link set dev lo up # sudo ip netns exec test ip link 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 # sudo ip netns exec test ip addr add dev lo fe80::0200:00ff:fe00:0000/64 [rxe doesn't work unless this IPV6 address is set] # sudo ip netns exec test ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 fe80::200:ff:fe00:0/64 scope link valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever # sudo ip netns exec test ls /sys/class/infiniband rxe0 rxe1 [These show up even though the ndevs do *not* belong to the test namespace! Probably OK.] # sudo ip netns exec test rdma link add rxe2 type rxe netdev lo # ls /sys/class/infiniband rxe0 rxe1 rxe2 [The new rxe device shows up in the default namespace. At least we're consistent.] # ib_send_bw -d rxe0 ... 192.168.0.27 [Works. Didn't break the existing rxe devices. Expected] # ib_send_bw -d rxe1 ... 127.0.0.1 [Works. Expected] # ib_send_bw -d rxe2 ... 127.0.0.1 IB device rxe2 not found Unable to find the Infiniband/RoCE device [Not work. Expected.] # sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1 IB device rxe2 not found Unable to find the Infiniband/RoCE device [Also not work. Turns out rxe2 device is gone after failure. Not expected.] # sudo ip netns exec test rdma link add rxe2 type rxe netdev lo # ls /sys/class/infiniband rxe0 rxe1 rxe2 [Good. It's back] # sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1 [Works in test namespace! Expected.] # sudo ip netns exec test ib_send_bw -d rxe1 ... 127.0.0.1 [Also works. Definitely not expected.] My take, it sort of works. But there are some serious issues. You shouldn't be able to use the rxe2 device in the default namespace. It would be nice if you couldn't see the rxe devices in each other's namespaces (Like ip link or ip addr hide other namespace's devices.) Bob
On 6/21/23 16:09, Bob Pearson wrote: > On 5/8/23 02:56, Zhu Yanjun wrote: >> From: Zhu Yanjun <yanjun.zhu@linux.dev> >> >> When run "ip link add" command to add a rxe rdma link in a net >> namespace, normally this rxe rdma link can not work in a net >> name space. >> >> The root cause is that a sock listening on udp port 4791 is created >> in init_net when the rdma_rxe module is loaded into kernel. That is, >> the sock listening on udp port 4791 is created in init_net. Other net >> namespace is difficult to use this sock. >> >> The following commits will solve this problem. >> >> In the first commit, move the creating sock listening on udp port 4791 >> from module_init function to rdma link creating functions. That is, >> after the module rdma_rxe is loaded, the sock will not be created. >> When run "rdma link add ..." command, the sock will be created. So >> when creating a rdma link in the net namespace, the sock will be >> created in this net namespace. >> >> In the second commit, the functions udp4_lib_lookup and udp6_lib_lookup >> will check the sock exists in the net namespace or not. If yes, rdma >> link will increase the reference count of this sock, then continue other >> jobs instead of creating a new sock to listen on udp port 4791. Since the >> network notifier is global, when the module rdma_rxe is loaded, this >> notifier will be registered. >> >> After the rdma link is created, the command "rdma link del" is to >> delete rdma link at the same time the sock is checked. If the reference >> count of this sock is greater than the sock reference count needed by >> udp tunnel, the sock reference count is decreased by one. If equal, it >> indicates that this rdma link is the last one. As such, the udp tunnel >> is shut down and the sock is closed. The above work should be >> implemented in linkdel function. But currently no dellink function in >> rxe. So the 3rd commit addes dellink function pointer. And the 4th >> commit implements the dellink function in rxe. >> >> To now, it is not necessary to keep a global variable to store the sock >> listening udp port 4791. This global variable can be replaced by the >> functions udp4_lib_lookup and udp6_lib_lookup totally. Because the >> function udp6_lib_lookup is in the fast path, a member variable l_sk6 >> is added to store the sock. If l_sk6 is NULL, udp6_lib_lookup is called >> to lookup the sock, then the sock is stored in l_sk6, in the future,it >> can be used directly. >> >> All the above work has been done in init_net. And it can also work in >> the net namespace. So the init_net is replaced by the individual net >> namespace. This is what the 6th commit does. Because rxe device is >> dependent on the net device and the sock listening on udp port 4791, >> every rxe device is in exclusive mode in the individual net namespace. >> Other rdma netns operations will be considerred in the future. >> >> In the 7th commit, the register_pernet_subsys/unregister_pernet_subsys >> functions are added. When a new net namespace is created, the init >> function will initialize the sk4 and sk6 socks. Then the 2 socks will >> be released when the net namespace is destroyed. The functions >> rxe_ns_pernet_sk4/rxe_ns_pernet_set_sk4 will get and set sk4 in the net >> namespace. The functions rxe_ns_pernet_sk6/rxe_ns_pernet_set_sk6 will >> handle sk6. Then sk4 and sk6 are used in the previous commits. >> >> As the sk4 and sk6 in pernet namespace can be accessed, it is not >> necessary to add a new l_sk6. As such, in the 8th commit, the l_sk6 is >> replaced with the sk6 in pernet namespace. >> >> Test steps: >> 1) Suppose that 2 NICs are in 2 different net namespaces. >> >> # ip netns exec net0 ip link >> 3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP >> link/ether 00:1e:67:a0:22:3f brd ff:ff:ff:ff:ff:ff >> altname enp5s0 >> >> # ip netns exec net1 ip link >> 4: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel >> link/ether f8:e4:3b:3b:e4:10 brd ff:ff:ff:ff:ff:ff >> >> 2) Add rdma link in the different net namespace >> net0: >> # ip netns exec net0 rdma link add rxe0 type rxe netdev eno2 >> >> net1: >> # ip netns exec net1 rdma link add rxe1 type rxe netdev eno3 >> >> 3) Run rping test. >> net0 >> # ip netns exec net0 rping -s -a 192.168.2.1 -C 1& >> [1] 1737 >> # ip netns exec net1 rping -c -a 192.168.2.1 -d -v -C 1 >> verbose >> count 1 >> ... >> ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr >> ... >> >> 4) Remove the rdma links from the net namespaces. >> net0: >> # ip netns exec net0 ss -lu >> State Recv-Q Send-Q Local Address:Port Peer Address:Port Process >> UNCONN 0 0 0.0.0.0:4791 0.0.0.0:* >> UNCONN 0 0 [::]:4791 [::]:* >> >> # ip netns exec net0 rdma link del rxe0 >> >> # ip netns exec net0 ss -lu >> State Recv-Q Send-Q Local Address:Port Peer Address:Port Process >> >> net1: >> # ip netns exec net0 ss -lu >> State Recv-Q Send-Q Local Address:Port Peer Address:Port Process >> UNCONN 0 0 0.0.0.0:4791 0.0.0.0:* >> UNCONN 0 0 [::]:4791 [::]:* >> >> # ip netns exec net1 rdma link del rxe1 >> >> # ip netns exec net0 ss -lu >> State Recv-Q Send-Q Local Address:Port Peer Address:Port Process >> >> V4->V5: Rebase the commits to V6.4-rc1 >> >> V3->V4: Rebase the commits to rdma-next; >> >> V2->V3: 1) Add "rdma link del" example in the cover letter, and use "ss -lu" to >> verify rdma link is removed. >> 2) Add register_pernet_subsys/unregister_pernet_subsys net namespace >> 3) Replace l_sk6 with sk6 of pernet_name_space >> >> V1->V2: Add the explicit initialization of sk6. >> >> Zhu Yanjun (8): >> RDMA/rxe: Creating listening sock in newlink function >> RDMA/rxe: Support more rdma links in init_net >> RDMA/nldev: Add dellink function pointer >> RDMA/rxe: Implement dellink in rxe >> RDMA/rxe: Replace global variable with sock lookup functions >> RDMA/rxe: add the support of net namespace >> RDMA/rxe: Add the support of net namespace notifier >> RDMA/rxe: Replace l_sk6 with sk6 in net namespace >> >> drivers/infiniband/core/nldev.c | 6 ++ >> drivers/infiniband/sw/rxe/Makefile | 3 +- >> drivers/infiniband/sw/rxe/rxe.c | 35 +++++++- >> drivers/infiniband/sw/rxe/rxe_net.c | 113 +++++++++++++++++------ >> drivers/infiniband/sw/rxe/rxe_net.h | 9 +- >> drivers/infiniband/sw/rxe/rxe_ns.c | 134 ++++++++++++++++++++++++++++ip netns add test >> drivers/infiniband/sw/rxe/rxe_ns.h | 17 ++++ >> include/rdma/rdma_netlink.h | 2 + >> 8 files changed, 279 insertions(+), 40 deletions(-) >> create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.c >> create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.hip netns add test >> > > Zhu, > > I did some simple experiments on netns functionality. > > With your patch set applied and rxe0 created on enp6s0 and rxe1 created on lo in the default namespace > > # sudo ip netns add test > # ip netns > test > # sudo ip netns exec test ip link > 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000 > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > # sudo ip netns exec test ip link set dev lo up > # sudo ip netns exec test ip link > 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > # sudo ip netns exec test ip addr add dev lo fe80::0200:00ff:fe00:0000/64 > [rxe doesn't work unless this IPV6 address is set] > # sudo ip netns exec test ip addr > 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > inet 127.0.0.1/8 scope host lo > valid_lft forever preferred_lft forever > inet6 fe80::200:ff:fe00:0/64 scope link > valid_lft forever preferred_lft forever > inet6 ::1/128 scope host > valid_lft forever preferred_lft forever > # sudo ip netns exec test ls /sys/class/infiniband > rxe0 rxe1 > [These show up even though the ndevs do *not* belong to the test namespace! Probably OK.] > # sudo ip netns exec test rdma link add rxe2 type rxe netdev lo > # ls /sys/class/infiniband > rxe0 rxe1 rxe2 > [The new rxe device shows up in the default namespace. At least we're consistent.] > # ib_send_bw -d rxe0 ... 192.168.0.27 > [Works. Didn't break the existing rxe devices. Expected] > # ib_send_bw -d rxe1 ... 127.0.0.1 > [Works. Expected] > # ib_send_bw -d rxe2 ... 127.0.0.1 > IB device rxe2 not found > Unable to find the Infiniband/RoCE device > [Not work. Expected.] > # sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1 > IB device rxe2 not found > Unable to find the Infiniband/RoCE device > [Also not work. Turns out rxe2 device is gone after failure. Not expected.] > # sudo ip netns exec test rdma link add rxe2 type rxe netdev lo > # ls /sys/class/infiniband > rxe0 rxe1 rxe2 > [Good. It's back] > # sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1 > [Works in test namespace! Expected.] > # sudo ip netns exec test ib_send_bw -d rxe1 ... 127.0.0.1 > [Also works. Definitely not expected.] > > My take, it sort of works. But there are some serious issues. You shouldn't be able to use the > rxe2 device in the default namespace. It would be nice if you couldn't see the rxe devices in each > other's namespaces (Like ip link or ip addr hide other namespace's devices.) > > Bob Forgot to mention. It also is definitely not good that a process in the default namespace can destroy a rxe device in the test namespace by trying to use it. Bob
在 2023/6/22 5:09, Bob Pearson 写道: > On 5/8/23 02:56, Zhu Yanjun wrote: >> From: Zhu Yanjun <yanjun.zhu@linux.dev> >> >> When run "ip link add" command to add a rxe rdma link in a net >> namespace, normally this rxe rdma link can not work in a net >> name space. >> >> The root cause is that a sock listening on udp port 4791 is created >> in init_net when the rdma_rxe module is loaded into kernel. That is, >> the sock listening on udp port 4791 is created in init_net. Other net >> namespace is difficult to use this sock. >> >> The following commits will solve this problem. >> >> In the first commit, move the creating sock listening on udp port 4791 >> from module_init function to rdma link creating functions. That is, >> after the module rdma_rxe is loaded, the sock will not be created. >> When run "rdma link add ..." command, the sock will be created. So >> when creating a rdma link in the net namespace, the sock will be >> created in this net namespace. >> >> In the second commit, the functions udp4_lib_lookup and udp6_lib_lookup >> will check the sock exists in the net namespace or not. If yes, rdma >> link will increase the reference count of this sock, then continue other >> jobs instead of creating a new sock to listen on udp port 4791. Since the >> network notifier is global, when the module rdma_rxe is loaded, this >> notifier will be registered. >> >> After the rdma link is created, the command "rdma link del" is to >> delete rdma link at the same time the sock is checked. If the reference >> count of this sock is greater than the sock reference count needed by >> udp tunnel, the sock reference count is decreased by one. If equal, it >> indicates that this rdma link is the last one. As such, the udp tunnel >> is shut down and the sock is closed. The above work should be >> implemented in linkdel function. But currently no dellink function in >> rxe. So the 3rd commit addes dellink function pointer. And the 4th >> commit implements the dellink function in rxe. >> >> To now, it is not necessary to keep a global variable to store the sock >> listening udp port 4791. This global variable can be replaced by the >> functions udp4_lib_lookup and udp6_lib_lookup totally. Because the >> function udp6_lib_lookup is in the fast path, a member variable l_sk6 >> is added to store the sock. If l_sk6 is NULL, udp6_lib_lookup is called >> to lookup the sock, then the sock is stored in l_sk6, in the future,it >> can be used directly. >> >> All the above work has been done in init_net. And it can also work in >> the net namespace. So the init_net is replaced by the individual net >> namespace. This is what the 6th commit does. Because rxe device is >> dependent on the net device and the sock listening on udp port 4791, >> every rxe device is in exclusive mode in the individual net namespace. >> Other rdma netns operations will be considerred in the future. >> >> In the 7th commit, the register_pernet_subsys/unregister_pernet_subsys >> functions are added. When a new net namespace is created, the init >> function will initialize the sk4 and sk6 socks. Then the 2 socks will >> be released when the net namespace is destroyed. The functions >> rxe_ns_pernet_sk4/rxe_ns_pernet_set_sk4 will get and set sk4 in the net >> namespace. The functions rxe_ns_pernet_sk6/rxe_ns_pernet_set_sk6 will >> handle sk6. Then sk4 and sk6 are used in the previous commits. >> >> As the sk4 and sk6 in pernet namespace can be accessed, it is not >> necessary to add a new l_sk6. As such, in the 8th commit, the l_sk6 is >> replaced with the sk6 in pernet namespace. >> >> Test steps: >> 1) Suppose that 2 NICs are in 2 different net namespaces. >> >> # ip netns exec net0 ip link >> 3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP >> link/ether 00:1e:67:a0:22:3f brd ff:ff:ff:ff:ff:ff >> altname enp5s0 >> >> # ip netns exec net1 ip link >> 4: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel >> link/ether f8:e4:3b:3b:e4:10 brd ff:ff:ff:ff:ff:ff >> >> 2) Add rdma link in the different net namespace >> net0: >> # ip netns exec net0 rdma link add rxe0 type rxe netdev eno2 >> >> net1: >> # ip netns exec net1 rdma link add rxe1 type rxe netdev eno3 >> >> 3) Run rping test. >> net0 >> # ip netns exec net0 rping -s -a 192.168.2.1 -C 1& >> [1] 1737 >> # ip netns exec net1 rping -c -a 192.168.2.1 -d -v -C 1 >> verbose >> count 1 >> ... >> ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr >> ... >> >> 4) Remove the rdma links from the net namespaces. >> net0: >> # ip netns exec net0 ss -lu >> State Recv-Q Send-Q Local Address:Port Peer Address:Port Process >> UNCONN 0 0 0.0.0.0:4791 0.0.0.0:* >> UNCONN 0 0 [::]:4791 [::]:* >> >> # ip netns exec net0 rdma link del rxe0 >> >> # ip netns exec net0 ss -lu >> State Recv-Q Send-Q Local Address:Port Peer Address:Port Process >> >> net1: >> # ip netns exec net0 ss -lu >> State Recv-Q Send-Q Local Address:Port Peer Address:Port Process >> UNCONN 0 0 0.0.0.0:4791 0.0.0.0:* >> UNCONN 0 0 [::]:4791 [::]:* >> >> # ip netns exec net1 rdma link del rxe1 >> >> # ip netns exec net0 ss -lu >> State Recv-Q Send-Q Local Address:Port Peer Address:Port Process >> >> V4->V5: Rebase the commits to V6.4-rc1 >> >> V3->V4: Rebase the commits to rdma-next; >> >> V2->V3: 1) Add "rdma link del" example in the cover letter, and use "ss -lu" to >> verify rdma link is removed. >> 2) Add register_pernet_subsys/unregister_pernet_subsys net namespace >> 3) Replace l_sk6 with sk6 of pernet_name_space >> >> V1->V2: Add the explicit initialization of sk6. >> >> Zhu Yanjun (8): >> RDMA/rxe: Creating listening sock in newlink function >> RDMA/rxe: Support more rdma links in init_net >> RDMA/nldev: Add dellink function pointer >> RDMA/rxe: Implement dellink in rxe >> RDMA/rxe: Replace global variable with sock lookup functions >> RDMA/rxe: add the support of net namespace >> RDMA/rxe: Add the support of net namespace notifier >> RDMA/rxe: Replace l_sk6 with sk6 in net namespace >> >> drivers/infiniband/core/nldev.c | 6 ++ >> drivers/infiniband/sw/rxe/Makefile | 3 +- >> drivers/infiniband/sw/rxe/rxe.c | 35 +++++++- >> drivers/infiniband/sw/rxe/rxe_net.c | 113 +++++++++++++++++------ >> drivers/infiniband/sw/rxe/rxe_net.h | 9 +- >> drivers/infiniband/sw/rxe/rxe_ns.c | 134 ++++++++++++++++++++++++++++ip netns add test >> drivers/infiniband/sw/rxe/rxe_ns.h | 17 ++++ >> include/rdma/rdma_netlink.h | 2 + >> 8 files changed, 279 insertions(+), 40 deletions(-) >> create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.c >> create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.hip netns add test >> > Zhu, > > I did some simple experiments on netns functionality. > > With your patch set applied and rxe0 created on enp6s0 and rxe1 created on lo in the default namespace > > # sudo ip netns add test > # ip netns > test > # sudo ip netns exec test ip link > 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000 > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > # sudo ip netns exec test ip link set dev lo up > # sudo ip netns exec test ip link > 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > # sudo ip netns exec test ip addr add dev lo fe80::0200:00ff:fe00:0000/64 > [rxe doesn't work unless this IPV6 address is set] > # sudo ip netns exec test ip addr > 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > inet 127.0.0.1/8 scope host lo > valid_lft forever preferred_lft forever > inet6 fe80::200:ff:fe00:0/64 scope link > valid_lft forever preferred_lft forever > inet6 ::1/128 scope host > valid_lft forever preferred_lft forever > # sudo ip netns exec test ls /sys/class/infiniband > rxe0 rxe1 > [These show up even though the ndevs do *not* belong to the test namespace! Probably OK.] > # sudo ip netns exec test rdma link add rxe2 type rxe netdev lo > # ls /sys/class/infiniband > rxe0 rxe1 rxe2 > [The new rxe device shows up in the default namespace. At least we're consistent.] > # ib_send_bw -d rxe0 ... 192.168.0.27 > [Works. Didn't break the existing rxe devices. Expected] > # ib_send_bw -d rxe1 ... 127.0.0.1 > [Works. Expected] > # ib_send_bw -d rxe2 ... 127.0.0.1 > IB device rxe2 not found > Unable to find the Infiniband/RoCE device > [Not work. Expected.] > # sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1 > IB device rxe2 not found > Unable to find the Infiniband/RoCE device > [Also not work. Turns out rxe2 device is gone after failure. Not expected.] > # sudo ip netns exec test rdma link add rxe2 type rxe netdev lo > # ls /sys/class/infiniband > rxe0 rxe1 rxe2 > [Good. It's back] > # sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1 > [Works in test namespace! Expected.] > # sudo ip netns exec test ib_send_bw -d rxe1 ... 127.0.0.1 > [Also works. Definitely not expected.] > > My take, it sort of works. But there are some serious issues. You shouldn't be able to use the > rxe2 device in the default namespace. It would be nice if you couldn't see the rxe devices in each > other's namespaces (Like ip link or ip addr hide other namespace's devices.) Thanks, Bob. I will delve into your tests and reply you tomorrow. Zhu Yanjun > > Bob
在 2023/6/22 5:09, Bob Pearson 写道: > On 5/8/23 02:56, Zhu Yanjun wrote: >> From: Zhu Yanjun <yanjun.zhu@linux.dev> >> >> When run "ip link add" command to add a rxe rdma link in a net >> namespace, normally this rxe rdma link can not work in a net >> name space. >> >> The root cause is that a sock listening on udp port 4791 is created >> in init_net when the rdma_rxe module is loaded into kernel. That is, >> the sock listening on udp port 4791 is created in init_net. Other net >> namespace is difficult to use this sock. >> >> The following commits will solve this problem. >> >> In the first commit, move the creating sock listening on udp port 4791 >> from module_init function to rdma link creating functions. That is, >> after the module rdma_rxe is loaded, the sock will not be created. >> When run "rdma link add ..." command, the sock will be created. So >> when creating a rdma link in the net namespace, the sock will be >> created in this net namespace. >> >> In the second commit, the functions udp4_lib_lookup and udp6_lib_lookup >> will check the sock exists in the net namespace or not. If yes, rdma >> link will increase the reference count of this sock, then continue other >> jobs instead of creating a new sock to listen on udp port 4791. Since the >> network notifier is global, when the module rdma_rxe is loaded, this >> notifier will be registered. >> >> After the rdma link is created, the command "rdma link del" is to >> delete rdma link at the same time the sock is checked. If the reference >> count of this sock is greater than the sock reference count needed by >> udp tunnel, the sock reference count is decreased by one. If equal, it >> indicates that this rdma link is the last one. As such, the udp tunnel >> is shut down and the sock is closed. The above work should be >> implemented in linkdel function. But currently no dellink function in >> rxe. So the 3rd commit addes dellink function pointer. And the 4th >> commit implements the dellink function in rxe. >> >> To now, it is not necessary to keep a global variable to store the sock >> listening udp port 4791. This global variable can be replaced by the >> functions udp4_lib_lookup and udp6_lib_lookup totally. Because the >> function udp6_lib_lookup is in the fast path, a member variable l_sk6 >> is added to store the sock. If l_sk6 is NULL, udp6_lib_lookup is called >> to lookup the sock, then the sock is stored in l_sk6, in the future,it >> can be used directly. >> >> All the above work has been done in init_net. And it can also work in >> the net namespace. So the init_net is replaced by the individual net >> namespace. This is what the 6th commit does. Because rxe device is >> dependent on the net device and the sock listening on udp port 4791, >> every rxe device is in exclusive mode in the individual net namespace. >> Other rdma netns operations will be considerred in the future. >> >> In the 7th commit, the register_pernet_subsys/unregister_pernet_subsys >> functions are added. When a new net namespace is created, the init >> function will initialize the sk4 and sk6 socks. Then the 2 socks will >> be released when the net namespace is destroyed. The functions >> rxe_ns_pernet_sk4/rxe_ns_pernet_set_sk4 will get and set sk4 in the net >> namespace. The functions rxe_ns_pernet_sk6/rxe_ns_pernet_set_sk6 will >> handle sk6. Then sk4 and sk6 are used in the previous commits. >> >> As the sk4 and sk6 in pernet namespace can be accessed, it is not >> necessary to add a new l_sk6. As such, in the 8th commit, the l_sk6 is >> replaced with the sk6 in pernet namespace. >> >> Test steps: >> 1) Suppose that 2 NICs are in 2 different net namespaces. >> >> # ip netns exec net0 ip link >> 3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP >> link/ether 00:1e:67:a0:22:3f brd ff:ff:ff:ff:ff:ff >> altname enp5s0 >> >> # ip netns exec net1 ip link >> 4: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel >> link/ether f8:e4:3b:3b:e4:10 brd ff:ff:ff:ff:ff:ff >> >> 2) Add rdma link in the different net namespace >> net0: >> # ip netns exec net0 rdma link add rxe0 type rxe netdev eno2 >> >> net1: >> # ip netns exec net1 rdma link add rxe1 type rxe netdev eno3 >> >> 3) Run rping test. >> net0 >> # ip netns exec net0 rping -s -a 192.168.2.1 -C 1& >> [1] 1737 >> # ip netns exec net1 rping -c -a 192.168.2.1 -d -v -C 1 >> verbose >> count 1 >> ... >> ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr >> ... >> >> 4) Remove the rdma links from the net namespaces. >> net0: >> # ip netns exec net0 ss -lu >> State Recv-Q Send-Q Local Address:Port Peer Address:Port Process >> UNCONN 0 0 0.0.0.0:4791 0.0.0.0:* >> UNCONN 0 0 [::]:4791 [::]:* >> >> # ip netns exec net0 rdma link del rxe0 >> >> # ip netns exec net0 ss -lu >> State Recv-Q Send-Q Local Address:Port Peer Address:Port Process >> >> net1: >> # ip netns exec net0 ss -lu >> State Recv-Q Send-Q Local Address:Port Peer Address:Port Process >> UNCONN 0 0 0.0.0.0:4791 0.0.0.0:* >> UNCONN 0 0 [::]:4791 [::]:* >> >> # ip netns exec net1 rdma link del rxe1 >> >> # ip netns exec net0 ss -lu >> State Recv-Q Send-Q Local Address:Port Peer Address:Port Process >> >> V4->V5: Rebase the commits to V6.4-rc1 >> >> V3->V4: Rebase the commits to rdma-next; >> >> V2->V3: 1) Add "rdma link del" example in the cover letter, and use "ss -lu" to >> verify rdma link is removed. >> 2) Add register_pernet_subsys/unregister_pernet_subsys net namespace >> 3) Replace l_sk6 with sk6 of pernet_name_space >> >> V1->V2: Add the explicit initialization of sk6. >> >> Zhu Yanjun (8): >> RDMA/rxe: Creating listening sock in newlink function >> RDMA/rxe: Support more rdma links in init_net >> RDMA/nldev: Add dellink function pointer >> RDMA/rxe: Implement dellink in rxe >> RDMA/rxe: Replace global variable with sock lookup functions >> RDMA/rxe: add the support of net namespace >> RDMA/rxe: Add the support of net namespace notifier >> RDMA/rxe: Replace l_sk6 with sk6 in net namespace >> >> drivers/infiniband/core/nldev.c | 6 ++ >> drivers/infiniband/sw/rxe/Makefile | 3 +- >> drivers/infiniband/sw/rxe/rxe.c | 35 +++++++- >> drivers/infiniband/sw/rxe/rxe_net.c | 113 +++++++++++++++++------ >> drivers/infiniband/sw/rxe/rxe_net.h | 9 +- >> drivers/infiniband/sw/rxe/rxe_ns.c | 134 ++++++++++++++++++++++++++++ip netns add test >> drivers/infiniband/sw/rxe/rxe_ns.h | 17 ++++ >> include/rdma/rdma_netlink.h | 2 + >> 8 files changed, 279 insertions(+), 40 deletions(-) >> create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.c >> create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.hip netns add test >> > Zhu, > > I did some simple experiments on netns functionality. > > With your patch set applied and rxe0 created on enp6s0 and rxe1 created on lo in the default namespace > > # sudo ip netns add test > # ip netns > test > # sudo ip netns exec test ip link > 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000 > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > # sudo ip netns exec test ip link set dev lo up > # sudo ip netns exec test ip link > 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > # sudo ip netns exec test ip addr add dev lo fe80::0200:00ff:fe00:0000/64 > [rxe doesn't work unless this IPV6 address is set] > # sudo ip netns exec test ip addr > 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > inet 127.0.0.1/8 scope host lo > valid_lft forever preferred_lft forever > inet6 fe80::200:ff:fe00:0/64 scope link > valid_lft forever preferred_lft forever > inet6 ::1/128 scope host > valid_lft forever preferred_lft forever > # sudo ip netns exec test ls /sys/class/infiniband > rxe0 rxe1 > [These show up even though the ndevs do *not* belong to the test namespace! Probably OK.] > # sudo ip netns exec test rdma link add rxe2 type rxe netdev lo > # ls /sys/class/infiniband > rxe0 rxe1 rxe2 > [The new rxe device shows up in the default namespace. At least we're consistent.] > # ib_send_bw -d rxe0 ... 192.168.0.27 > [Works. Didn't break the existing rxe devices. Expected] > # ib_send_bw -d rxe1 ... 127.0.0.1 > [Works. Expected] > # ib_send_bw -d rxe2 ... 127.0.0.1 > IB device rxe2 not found > Unable to find the Infiniband/RoCE device > [Not work. Expected.] > # sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1 > IB device rxe2 not found > Unable to find the Infiniband/RoCE device > [Also not work. Turns out rxe2 device is gone after failure. Not expected.] > # sudo ip netns exec test rdma link add rxe2 type rxe netdev lo > # ls /sys/class/infiniband > rxe0 rxe1 rxe2 > [Good. It's back] > # sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1 > [Works in test namespace! Expected.] > # sudo ip netns exec test ib_send_bw -d rxe1 ... 127.0.0.1 > [Also works. Definitely not expected.] > > My take, it sort of works. But there are some serious issues. You shouldn't be able to use the > rxe2 device in the default namespace. It would be nice if you couldn't see the rxe devices in each > other's namespaces (Like ip link or ip addr hide other namespace's devices.) Thanks a lot for your testing. Bob. There are 2 modes in rdma net namespace. One is shared mode, the other is exclusive mode. Share mode is the defaut mode. So what you have found is normal. The followings are the explanations of both shared and exclusive modes. rdma system set netns shared|exclusive, specifies the RDMA subsystem mode. Either exclusive or shared. When user wants to assign dedicated RDMA device to a particular network namespace, exclusive mode should be set before creating any network namespace. If there are active network namespaces and if one or more RDMA devices exist, changing mode from shared to exclusive returns error code EBUSY. When RDMA subsystem is in shared mode, RDMA device is accessible in all network namespace. When RDMA device isolation among multiple network namespaces is not needed, shared mode can be used. It is preferred to not change the subsystem mode when there is active RDMA traffic running, even though it is supported. Zhu Yanjun > > Bob
在 2023/6/22 5:27, Bob Pearson 写道: > On 6/21/23 16:09, Bob Pearson wrote: >> On 5/8/23 02:56, Zhu Yanjun wrote: >>> From: Zhu Yanjun <yanjun.zhu@linux.dev> >>> >>> When run "ip link add" command to add a rxe rdma link in a net >>> namespace, normally this rxe rdma link can not work in a net >>> name space. >>> >>> The root cause is that a sock listening on udp port 4791 is created >>> in init_net when the rdma_rxe module is loaded into kernel. That is, >>> the sock listening on udp port 4791 is created in init_net. Other net >>> namespace is difficult to use this sock. >>> >>> The following commits will solve this problem. >>> >>> In the first commit, move the creating sock listening on udp port 4791 >>> from module_init function to rdma link creating functions. That is, >>> after the module rdma_rxe is loaded, the sock will not be created. >>> When run "rdma link add ..." command, the sock will be created. So >>> when creating a rdma link in the net namespace, the sock will be >>> created in this net namespace. >>> >>> In the second commit, the functions udp4_lib_lookup and udp6_lib_lookup >>> will check the sock exists in the net namespace or not. If yes, rdma >>> link will increase the reference count of this sock, then continue other >>> jobs instead of creating a new sock to listen on udp port 4791. Since the >>> network notifier is global, when the module rdma_rxe is loaded, this >>> notifier will be registered. >>> >>> After the rdma link is created, the command "rdma link del" is to >>> delete rdma link at the same time the sock is checked. If the reference >>> count of this sock is greater than the sock reference count needed by >>> udp tunnel, the sock reference count is decreased by one. If equal, it >>> indicates that this rdma link is the last one. As such, the udp tunnel >>> is shut down and the sock is closed. The above work should be >>> implemented in linkdel function. But currently no dellink function in >>> rxe. So the 3rd commit addes dellink function pointer. And the 4th >>> commit implements the dellink function in rxe. >>> >>> To now, it is not necessary to keep a global variable to store the sock >>> listening udp port 4791. This global variable can be replaced by the >>> functions udp4_lib_lookup and udp6_lib_lookup totally. Because the >>> function udp6_lib_lookup is in the fast path, a member variable l_sk6 >>> is added to store the sock. If l_sk6 is NULL, udp6_lib_lookup is called >>> to lookup the sock, then the sock is stored in l_sk6, in the future,it >>> can be used directly. >>> >>> All the above work has been done in init_net. And it can also work in >>> the net namespace. So the init_net is replaced by the individual net >>> namespace. This is what the 6th commit does. Because rxe device is >>> dependent on the net device and the sock listening on udp port 4791, >>> every rxe device is in exclusive mode in the individual net namespace. >>> Other rdma netns operations will be considerred in the future. >>> >>> In the 7th commit, the register_pernet_subsys/unregister_pernet_subsys >>> functions are added. When a new net namespace is created, the init >>> function will initialize the sk4 and sk6 socks. Then the 2 socks will >>> be released when the net namespace is destroyed. The functions >>> rxe_ns_pernet_sk4/rxe_ns_pernet_set_sk4 will get and set sk4 in the net >>> namespace. The functions rxe_ns_pernet_sk6/rxe_ns_pernet_set_sk6 will >>> handle sk6. Then sk4 and sk6 are used in the previous commits. >>> >>> As the sk4 and sk6 in pernet namespace can be accessed, it is not >>> necessary to add a new l_sk6. As such, in the 8th commit, the l_sk6 is >>> replaced with the sk6 in pernet namespace. >>> >>> Test steps: >>> 1) Suppose that 2 NICs are in 2 different net namespaces. >>> >>> # ip netns exec net0 ip link >>> 3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP >>> link/ether 00:1e:67:a0:22:3f brd ff:ff:ff:ff:ff:ff >>> altname enp5s0 >>> >>> # ip netns exec net1 ip link >>> 4: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel >>> link/ether f8:e4:3b:3b:e4:10 brd ff:ff:ff:ff:ff:ff >>> >>> 2) Add rdma link in the different net namespace >>> net0: >>> # ip netns exec net0 rdma link add rxe0 type rxe netdev eno2 >>> >>> net1: >>> # ip netns exec net1 rdma link add rxe1 type rxe netdev eno3 >>> >>> 3) Run rping test. >>> net0 >>> # ip netns exec net0 rping -s -a 192.168.2.1 -C 1& >>> [1] 1737 >>> # ip netns exec net1 rping -c -a 192.168.2.1 -d -v -C 1 >>> verbose >>> count 1 >>> ... >>> ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr >>> ... >>> >>> 4) Remove the rdma links from the net namespaces. >>> net0: >>> # ip netns exec net0 ss -lu >>> State Recv-Q Send-Q Local Address:Port Peer Address:Port Process >>> UNCONN 0 0 0.0.0.0:4791 0.0.0.0:* >>> UNCONN 0 0 [::]:4791 [::]:* >>> >>> # ip netns exec net0 rdma link del rxe0 >>> >>> # ip netns exec net0 ss -lu >>> State Recv-Q Send-Q Local Address:Port Peer Address:Port Process >>> >>> net1: >>> # ip netns exec net0 ss -lu >>> State Recv-Q Send-Q Local Address:Port Peer Address:Port Process >>> UNCONN 0 0 0.0.0.0:4791 0.0.0.0:* >>> UNCONN 0 0 [::]:4791 [::]:* >>> >>> # ip netns exec net1 rdma link del rxe1 >>> >>> # ip netns exec net0 ss -lu >>> State Recv-Q Send-Q Local Address:Port Peer Address:Port Process >>> >>> V4->V5: Rebase the commits to V6.4-rc1 >>> >>> V3->V4: Rebase the commits to rdma-next; >>> >>> V2->V3: 1) Add "rdma link del" example in the cover letter, and use "ss -lu" to >>> verify rdma link is removed. >>> 2) Add register_pernet_subsys/unregister_pernet_subsys net namespace >>> 3) Replace l_sk6 with sk6 of pernet_name_space >>> >>> V1->V2: Add the explicit initialization of sk6. >>> >>> Zhu Yanjun (8): >>> RDMA/rxe: Creating listening sock in newlink function >>> RDMA/rxe: Support more rdma links in init_net >>> RDMA/nldev: Add dellink function pointer >>> RDMA/rxe: Implement dellink in rxe >>> RDMA/rxe: Replace global variable with sock lookup functions >>> RDMA/rxe: add the support of net namespace >>> RDMA/rxe: Add the support of net namespace notifier >>> RDMA/rxe: Replace l_sk6 with sk6 in net namespace >>> >>> drivers/infiniband/core/nldev.c | 6 ++ >>> drivers/infiniband/sw/rxe/Makefile | 3 +- >>> drivers/infiniband/sw/rxe/rxe.c | 35 +++++++- >>> drivers/infiniband/sw/rxe/rxe_net.c | 113 +++++++++++++++++------ >>> drivers/infiniband/sw/rxe/rxe_net.h | 9 +- >>> drivers/infiniband/sw/rxe/rxe_ns.c | 134 ++++++++++++++++++++++++++++ip netns add test >>> drivers/infiniband/sw/rxe/rxe_ns.h | 17 ++++ >>> include/rdma/rdma_netlink.h | 2 + >>> 8 files changed, 279 insertions(+), 40 deletions(-) >>> create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.c >>> create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.hip netns add test >>> >> Zhu, >> >> I did some simple experiments on netns functionality. >> >> With your patch set applied and rxe0 created on enp6s0 and rxe1 created on lo in the default namespace >> >> # sudo ip netns add test >> # ip netns >> test >> # sudo ip netns exec test ip link >> 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000 >> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 >> # sudo ip netns exec test ip link set dev lo up >> # sudo ip netns exec test ip link >> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 >> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 >> # sudo ip netns exec test ip addr add dev lo fe80::0200:00ff:fe00:0000/64 >> [rxe doesn't work unless this IPV6 address is set] >> # sudo ip netns exec test ip addr >> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 >> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 >> inet 127.0.0.1/8 scope host lo >> valid_lft forever preferred_lft forever >> inet6 fe80::200:ff:fe00:0/64 scope link >> valid_lft forever preferred_lft forever >> inet6 ::1/128 scope host >> valid_lft forever preferred_lft forever >> # sudo ip netns exec test ls /sys/class/infiniband >> rxe0 rxe1 >> [These show up even though the ndevs do *not* belong to the test namespace! Probably OK.] >> # sudo ip netns exec test rdma link add rxe2 type rxe netdev lo >> # ls /sys/class/infiniband >> rxe0 rxe1 rxe2 >> [The new rxe device shows up in the default namespace. At least we're consistent.] >> # ib_send_bw -d rxe0 ... 192.168.0.27 >> [Works. Didn't break the existing rxe devices. Expected] >> # ib_send_bw -d rxe1 ... 127.0.0.1 >> [Works. Expected] >> # ib_send_bw -d rxe2 ... 127.0.0.1 >> IB device rxe2 not found >> Unable to find the Infiniband/RoCE device >> [Not work. Expected.] >> # sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1 >> IB device rxe2 not found >> Unable to find the Infiniband/RoCE device >> [Also not work. Turns out rxe2 device is gone after failure. Not expected.] >> # sudo ip netns exec test rdma link add rxe2 type rxe netdev lo >> # ls /sys/class/infiniband >> rxe0 rxe1 rxe2 >> [Good. It's back] >> # sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1 >> [Works in test namespace! Expected.] >> # sudo ip netns exec test ib_send_bw -d rxe1 ... 127.0.0.1 >> [Also works. Definitely not expected.] >> >> My take, it sort of works. But there are some serious issues. You shouldn't be able to use the >> rxe2 device in the default namespace. It would be nice if you couldn't see the rxe devices in each >> other's namespaces (Like ip link or ip addr hide other namespace's devices.) >> >> Bob > Forgot to mention. It also is definitely not good that a process in the default namespace can destroy > a rxe device in the test namespace by trying to use it. Thanks a lot. I am not sure if it is correct or not to destroy a rxe device outside this this net namespace. Because to irdma/mlx5 rdma devices, we can also destroy them with the command "modprobe -v irdma/mlx5..." outside of the net namespace. I am not sure if this is correct or not. Zhu Yanjun > > Bob
On 6/23/23 02:15, Zhu Yanjun wrote: > > 在 2023/6/22 5:27, Bob Pearson 写道: >> On 6/21/23 16:09, Bob Pearson wrote: >>> On 5/8/23 02:56, Zhu Yanjun wrote: >>>> From: Zhu Yanjun <yanjun.zhu@linux.dev> >>>> >>>> When run "ip link add" command to add a rxe rdma link in a net >>>> namespace, normally this rxe rdma link can not work in a net >>>> name space. >>>> >>>> The root cause is that a sock listening on udp port 4791 is created >>>> in init_net when the rdma_rxe module is loaded into kernel. That is, >>>> the sock listening on udp port 4791 is created in init_net. Other net >>>> namespace is difficult to use this sock. >>>> >>>> The following commits will solve this problem. >>>> >>>> In the first commit, move the creating sock listening on udp port 4791 >>>> from module_init function to rdma link creating functions. That is, >>>> after the module rdma_rxe is loaded, the sock will not be created. >>>> When run "rdma link add ..." command, the sock will be created. So >>>> when creating a rdma link in the net namespace, the sock will be >>>> created in this net namespace. >>>> >>>> In the second commit, the functions udp4_lib_lookup and udp6_lib_lookup >>>> will check the sock exists in the net namespace or not. If yes, rdma >>>> link will increase the reference count of this sock, then continue other >>>> jobs instead of creating a new sock to listen on udp port 4791. Since the >>>> network notifier is global, when the module rdma_rxe is loaded, this >>>> notifier will be registered. >>>> >>>> After the rdma link is created, the command "rdma link del" is to >>>> delete rdma link at the same time the sock is checked. If the reference >>>> count of this sock is greater than the sock reference count needed by >>>> udp tunnel, the sock reference count is decreased by one. If equal, it >>>> indicates that this rdma link is the last one. As such, the udp tunnel >>>> is shut down and the sock is closed. The above work should be >>>> implemented in linkdel function. But currently no dellink function in >>>> rxe. So the 3rd commit addes dellink function pointer. And the 4th >>>> commit implements the dellink function in rxe. >>>> >>>> To now, it is not necessary to keep a global variable to store the sock >>>> listening udp port 4791. This global variable can be replaced by the >>>> functions udp4_lib_lookup and udp6_lib_lookup totally. Because the >>>> function udp6_lib_lookup is in the fast path, a member variable l_sk6 >>>> is added to store the sock. If l_sk6 is NULL, udp6_lib_lookup is called >>>> to lookup the sock, then the sock is stored in l_sk6, in the future,it >>>> can be used directly. >>>> >>>> All the above work has been done in init_net. And it can also work in >>>> the net namespace. So the init_net is replaced by the individual net >>>> namespace. This is what the 6th commit does. Because rxe device is >>>> dependent on the net device and the sock listening on udp port 4791, >>>> every rxe device is in exclusive mode in the individual net namespace. >>>> Other rdma netns operations will be considerred in the future. >>>> >>>> In the 7th commit, the register_pernet_subsys/unregister_pernet_subsys >>>> functions are added. When a new net namespace is created, the init >>>> function will initialize the sk4 and sk6 socks. Then the 2 socks will >>>> be released when the net namespace is destroyed. The functions >>>> rxe_ns_pernet_sk4/rxe_ns_pernet_set_sk4 will get and set sk4 in the net >>>> namespace. The functions rxe_ns_pernet_sk6/rxe_ns_pernet_set_sk6 will >>>> handle sk6. Then sk4 and sk6 are used in the previous commits. >>>> >>>> As the sk4 and sk6 in pernet namespace can be accessed, it is not >>>> necessary to add a new l_sk6. As such, in the 8th commit, the l_sk6 is >>>> replaced with the sk6 in pernet namespace. >>>> >>>> Test steps: >>>> 1) Suppose that 2 NICs are in 2 different net namespaces. >>>> >>>> # ip netns exec net0 ip link >>>> 3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP >>>> link/ether 00:1e:67:a0:22:3f brd ff:ff:ff:ff:ff:ff >>>> altname enp5s0 >>>> >>>> # ip netns exec net1 ip link >>>> 4: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel >>>> link/ether f8:e4:3b:3b:e4:10 brd ff:ff:ff:ff:ff:ff >>>> >>>> 2) Add rdma link in the different net namespace >>>> net0: >>>> # ip netns exec net0 rdma link add rxe0 type rxe netdev eno2 >>>> >>>> net1: >>>> # ip netns exec net1 rdma link add rxe1 type rxe netdev eno3 >>>> >>>> 3) Run rping test. >>>> net0 >>>> # ip netns exec net0 rping -s -a 192.168.2.1 -C 1& >>>> [1] 1737 >>>> # ip netns exec net1 rping -c -a 192.168.2.1 -d -v -C 1 >>>> verbose >>>> count 1 >>>> ... >>>> ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr >>>> ... >>>> >>>> 4) Remove the rdma links from the net namespaces. >>>> net0: >>>> # ip netns exec net0 ss -lu >>>> State Recv-Q Send-Q Local Address:Port Peer Address:Port Process >>>> UNCONN 0 0 0.0.0.0:4791 0.0.0.0:* >>>> UNCONN 0 0 [::]:4791 [::]:* >>>> >>>> # ip netns exec net0 rdma link del rxe0 >>>> >>>> # ip netns exec net0 ss -lu >>>> State Recv-Q Send-Q Local Address:Port Peer Address:Port Process >>>> >>>> net1: >>>> # ip netns exec net0 ss -lu >>>> State Recv-Q Send-Q Local Address:Port Peer Address:Port Process >>>> UNCONN 0 0 0.0.0.0:4791 0.0.0.0:* >>>> UNCONN 0 0 [::]:4791 [::]:* >>>> >>>> # ip netns exec net1 rdma link del rxe1 >>>> >>>> # ip netns exec net0 ss -lu >>>> State Recv-Q Send-Q Local Address:Port Peer Address:Port Process >>>> >>>> V4->V5: Rebase the commits to V6.4-rc1 >>>> >>>> V3->V4: Rebase the commits to rdma-next; >>>> >>>> V2->V3: 1) Add "rdma link del" example in the cover letter, and use "ss -lu" to >>>> verify rdma link is removed. >>>> 2) Add register_pernet_subsys/unregister_pernet_subsys net namespace >>>> 3) Replace l_sk6 with sk6 of pernet_name_space >>>> >>>> V1->V2: Add the explicit initialization of sk6. >>>> >>>> Zhu Yanjun (8): >>>> RDMA/rxe: Creating listening sock in newlink function >>>> RDMA/rxe: Support more rdma links in init_net >>>> RDMA/nldev: Add dellink function pointer >>>> RDMA/rxe: Implement dellink in rxe >>>> RDMA/rxe: Replace global variable with sock lookup functions >>>> RDMA/rxe: add the support of net namespace >>>> RDMA/rxe: Add the support of net namespace notifier >>>> RDMA/rxe: Replace l_sk6 with sk6 in net namespace >>>> >>>> drivers/infiniband/core/nldev.c | 6 ++ >>>> drivers/infiniband/sw/rxe/Makefile | 3 +- >>>> drivers/infiniband/sw/rxe/rxe.c | 35 +++++++- >>>> drivers/infiniband/sw/rxe/rxe_net.c | 113 +++++++++++++++++------ >>>> drivers/infiniband/sw/rxe/rxe_net.h | 9 +- >>>> drivers/infiniband/sw/rxe/rxe_ns.c | 134 ++++++++++++++++++++++++++++ip netns add test >>>> drivers/infiniband/sw/rxe/rxe_ns.h | 17 ++++ >>>> include/rdma/rdma_netlink.h | 2 + >>>> 8 files changed, 279 insertions(+), 40 deletions(-) >>>> create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.c >>>> create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.hip netns add test >>>> >>> Zhu, >>> >>> I did some simple experiments on netns functionality. >>> >>> With your patch set applied and rxe0 created on enp6s0 and rxe1 created on lo in the default namespace >>> >>> # sudo ip netns add test >>> # ip netns >>> test >>> # sudo ip netns exec test ip link >>> 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000 >>> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 >>> # sudo ip netns exec test ip link set dev lo up >>> # sudo ip netns exec test ip link >>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 >>> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 >>> # sudo ip netns exec test ip addr add dev lo fe80::0200:00ff:fe00:0000/64 >>> [rxe doesn't work unless this IPV6 address is set] >>> # sudo ip netns exec test ip addr >>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 >>> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 >>> inet 127.0.0.1/8 scope host lo >>> valid_lft forever preferred_lft forever >>> inet6 fe80::200:ff:fe00:0/64 scope link >>> valid_lft forever preferred_lft forever >>> inet6 ::1/128 scope host >>> valid_lft forever preferred_lft forever >>> # sudo ip netns exec test ls /sys/class/infiniband >>> rxe0 rxe1 >>> [These show up even though the ndevs do *not* belong to the test namespace! Probably OK.] >>> # sudo ip netns exec test rdma link add rxe2 type rxe netdev lo >>> # ls /sys/class/infiniband >>> rxe0 rxe1 rxe2 >>> [The new rxe device shows up in the default namespace. At least we're consistent.] >>> # ib_send_bw -d rxe0 ... 192.168.0.27 >>> [Works. Didn't break the existing rxe devices. Expected] >>> # ib_send_bw -d rxe1 ... 127.0.0.1 >>> [Works. Expected] >>> # ib_send_bw -d rxe2 ... 127.0.0.1 >>> IB device rxe2 not found >>> Unable to find the Infiniband/RoCE device >>> [Not work. Expected.] >>> # sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1 >>> IB device rxe2 not found >>> Unable to find the Infiniband/RoCE device >>> [Also not work. Turns out rxe2 device is gone after failure. Not expected.] >>> # sudo ip netns exec test rdma link add rxe2 type rxe netdev lo >>> # ls /sys/class/infiniband >>> rxe0 rxe1 rxe2 >>> [Good. It's back] >>> # sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1 >>> [Works in test namespace! Expected.] >>> # sudo ip netns exec test ib_send_bw -d rxe1 ... 127.0.0.1 >>> [Also works. Definitely not expected.] >>> >>> My take, it sort of works. But there are some serious issues. You shouldn't be able to use the >>> rxe2 device in the default namespace. It would be nice if you couldn't see the rxe devices in each >>> other's namespaces (Like ip link or ip addr hide other namespace's devices.) >>> >>> Bob >> Forgot to mention. It also is definitely not good that a process in the default namespace can destroy >> a rxe device in the test namespace by trying to use it. > > Thanks a lot. > > I am not sure if it is correct or not to destroy a rxe device outside this this net namespace. > > Because to irdma/mlx5 rdma devices, we can also destroy them with the command "modprobe -v irdma/mlx5..." outside of the net namespace. > > I am not sure if this is correct or not. > > Zhu Yanjun > >> >> Bob I didn' intentionally destroy lo2. I just tried to access the rxe device but it failed. The rxe device was destroyed as a side effect of failing to open it. Bob
在 2023/6/23 20:59, Bob Pearson 写道: > On 6/23/23 02:15, Zhu Yanjun wrote: >> 在 2023/6/22 5:27, Bob Pearson 写道: >>> On 6/21/23 16:09, Bob Pearson wrote: >>>> On 5/8/23 02:56, Zhu Yanjun wrote: >>>>> From: Zhu Yanjun <yanjun.zhu@linux.dev> >>>>> >>>>> When run "ip link add" command to add a rxe rdma link in a net >>>>> namespace, normally this rxe rdma link can not work in a net >>>>> name space. >>>>> >>>>> The root cause is that a sock listening on udp port 4791 is created >>>>> in init_net when the rdma_rxe module is loaded into kernel. That is, >>>>> the sock listening on udp port 4791 is created in init_net. Other net >>>>> namespace is difficult to use this sock. >>>>> >>>>> The following commits will solve this problem. >>>>> >>>>> In the first commit, move the creating sock listening on udp port 4791 >>>>> from module_init function to rdma link creating functions. That is, >>>>> after the module rdma_rxe is loaded, the sock will not be created. >>>>> When run "rdma link add ..." command, the sock will be created. So >>>>> when creating a rdma link in the net namespace, the sock will be >>>>> created in this net namespace. >>>>> >>>>> In the second commit, the functions udp4_lib_lookup and udp6_lib_lookup >>>>> will check the sock exists in the net namespace or not. If yes, rdma >>>>> link will increase the reference count of this sock, then continue other >>>>> jobs instead of creating a new sock to listen on udp port 4791. Since the >>>>> network notifier is global, when the module rdma_rxe is loaded, this >>>>> notifier will be registered. >>>>> >>>>> After the rdma link is created, the command "rdma link del" is to >>>>> delete rdma link at the same time the sock is checked. If the reference >>>>> count of this sock is greater than the sock reference count needed by >>>>> udp tunnel, the sock reference count is decreased by one. If equal, it >>>>> indicates that this rdma link is the last one. As such, the udp tunnel >>>>> is shut down and the sock is closed. The above work should be >>>>> implemented in linkdel function. But currently no dellink function in >>>>> rxe. So the 3rd commit addes dellink function pointer. And the 4th >>>>> commit implements the dellink function in rxe. >>>>> >>>>> To now, it is not necessary to keep a global variable to store the sock >>>>> listening udp port 4791. This global variable can be replaced by the >>>>> functions udp4_lib_lookup and udp6_lib_lookup totally. Because the >>>>> function udp6_lib_lookup is in the fast path, a member variable l_sk6 >>>>> is added to store the sock. If l_sk6 is NULL, udp6_lib_lookup is called >>>>> to lookup the sock, then the sock is stored in l_sk6, in the future,it >>>>> can be used directly. >>>>> >>>>> All the above work has been done in init_net. And it can also work in >>>>> the net namespace. So the init_net is replaced by the individual net >>>>> namespace. This is what the 6th commit does. Because rxe device is >>>>> dependent on the net device and the sock listening on udp port 4791, >>>>> every rxe device is in exclusive mode in the individual net namespace. >>>>> Other rdma netns operations will be considerred in the future. >>>>> >>>>> In the 7th commit, the register_pernet_subsys/unregister_pernet_subsys >>>>> functions are added. When a new net namespace is created, the init >>>>> function will initialize the sk4 and sk6 socks. Then the 2 socks will >>>>> be released when the net namespace is destroyed. The functions >>>>> rxe_ns_pernet_sk4/rxe_ns_pernet_set_sk4 will get and set sk4 in the net >>>>> namespace. The functions rxe_ns_pernet_sk6/rxe_ns_pernet_set_sk6 will >>>>> handle sk6. Then sk4 and sk6 are used in the previous commits. >>>>> >>>>> As the sk4 and sk6 in pernet namespace can be accessed, it is not >>>>> necessary to add a new l_sk6. As such, in the 8th commit, the l_sk6 is >>>>> replaced with the sk6 in pernet namespace. >>>>> >>>>> Test steps: >>>>> 1) Suppose that 2 NICs are in 2 different net namespaces. >>>>> >>>>> # ip netns exec net0 ip link >>>>> 3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP >>>>> link/ether 00:1e:67:a0:22:3f brd ff:ff:ff:ff:ff:ff >>>>> altname enp5s0 >>>>> >>>>> # ip netns exec net1 ip link >>>>> 4: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel >>>>> link/ether f8:e4:3b:3b:e4:10 brd ff:ff:ff:ff:ff:ff >>>>> >>>>> 2) Add rdma link in the different net namespace >>>>> net0: >>>>> # ip netns exec net0 rdma link add rxe0 type rxe netdev eno2 >>>>> >>>>> net1: >>>>> # ip netns exec net1 rdma link add rxe1 type rxe netdev eno3 >>>>> >>>>> 3) Run rping test. >>>>> net0 >>>>> # ip netns exec net0 rping -s -a 192.168.2.1 -C 1& >>>>> [1] 1737 >>>>> # ip netns exec net1 rping -c -a 192.168.2.1 -d -v -C 1 >>>>> verbose >>>>> count 1 >>>>> ... >>>>> ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr >>>>> ... >>>>> >>>>> 4) Remove the rdma links from the net namespaces. >>>>> net0: >>>>> # ip netns exec net0 ss -lu >>>>> State Recv-Q Send-Q Local Address:Port Peer Address:Port Process >>>>> UNCONN 0 0 0.0.0.0:4791 0.0.0.0:* >>>>> UNCONN 0 0 [::]:4791 [::]:* >>>>> >>>>> # ip netns exec net0 rdma link del rxe0 >>>>> >>>>> # ip netns exec net0 ss -lu >>>>> State Recv-Q Send-Q Local Address:Port Peer Address:Port Process >>>>> >>>>> net1: >>>>> # ip netns exec net0 ss -lu >>>>> State Recv-Q Send-Q Local Address:Port Peer Address:Port Process >>>>> UNCONN 0 0 0.0.0.0:4791 0.0.0.0:* >>>>> UNCONN 0 0 [::]:4791 [::]:* >>>>> >>>>> # ip netns exec net1 rdma link del rxe1 >>>>> >>>>> # ip netns exec net0 ss -lu >>>>> State Recv-Q Send-Q Local Address:Port Peer Address:Port Process >>>>> >>>>> V4->V5: Rebase the commits to V6.4-rc1 >>>>> >>>>> V3->V4: Rebase the commits to rdma-next; >>>>> >>>>> V2->V3: 1) Add "rdma link del" example in the cover letter, and use "ss -lu" to >>>>> verify rdma link is removed. >>>>> 2) Add register_pernet_subsys/unregister_pernet_subsys net namespace >>>>> 3) Replace l_sk6 with sk6 of pernet_name_space >>>>> >>>>> V1->V2: Add the explicit initialization of sk6. >>>>> >>>>> Zhu Yanjun (8): >>>>> RDMA/rxe: Creating listening sock in newlink function >>>>> RDMA/rxe: Support more rdma links in init_net >>>>> RDMA/nldev: Add dellink function pointer >>>>> RDMA/rxe: Implement dellink in rxe >>>>> RDMA/rxe: Replace global variable with sock lookup functions >>>>> RDMA/rxe: add the support of net namespace >>>>> RDMA/rxe: Add the support of net namespace notifier >>>>> RDMA/rxe: Replace l_sk6 with sk6 in net namespace >>>>> >>>>> drivers/infiniband/core/nldev.c | 6 ++ >>>>> drivers/infiniband/sw/rxe/Makefile | 3 +- >>>>> drivers/infiniband/sw/rxe/rxe.c | 35 +++++++- >>>>> drivers/infiniband/sw/rxe/rxe_net.c | 113 +++++++++++++++++------ >>>>> drivers/infiniband/sw/rxe/rxe_net.h | 9 +- >>>>> drivers/infiniband/sw/rxe/rxe_ns.c | 134 ++++++++++++++++++++++++++++ip netns add test >>>>> drivers/infiniband/sw/rxe/rxe_ns.h | 17 ++++ >>>>> include/rdma/rdma_netlink.h | 2 + >>>>> 8 files changed, 279 insertions(+), 40 deletions(-) >>>>> create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.c >>>>> create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.hip netns add test >>>>> >>>> Zhu, >>>> >>>> I did some simple experiments on netns functionality. >>>> >>>> With your patch set applied and rxe0 created on enp6s0 and rxe1 created on lo in the default namespace >>>> >>>> # sudo ip netns add test >>>> # ip netns >>>> test >>>> # sudo ip netns exec test ip link >>>> 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000 >>>> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 >>>> # sudo ip netns exec test ip link set dev lo up >>>> # sudo ip netns exec test ip link >>>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 >>>> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 >>>> # sudo ip netns exec test ip addr add dev lo fe80::0200:00ff:fe00:0000/64 >>>> [rxe doesn't work unless this IPV6 address is set] >>>> # sudo ip netns exec test ip addr >>>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 >>>> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 >>>> inet 127.0.0.1/8 scope host lo >>>> valid_lft forever preferred_lft forever >>>> inet6 fe80::200:ff:fe00:0/64 scope link >>>> valid_lft forever preferred_lft forever >>>> inet6 ::1/128 scope host >>>> valid_lft forever preferred_lft forever >>>> # sudo ip netns exec test ls /sys/class/infiniband >>>> rxe0 rxe1 >>>> [These show up even though the ndevs do *not* belong to the test namespace! Probably OK.] >>>> # sudo ip netns exec test rdma link add rxe2 type rxe netdev lo >>>> # ls /sys/class/infiniband >>>> rxe0 rxe1 rxe2 >>>> [The new rxe device shows up in the default namespace. At least we're consistent.] >>>> # ib_send_bw -d rxe0 ... 192.168.0.27 >>>> [Works. Didn't break the existing rxe devices. Expected] >>>> # ib_send_bw -d rxe1 ... 127.0.0.1 >>>> [Works. Expected] >>>> # ib_send_bw -d rxe2 ... 127.0.0.1 >>>> IB device rxe2 not found >>>> Unable to find the Infiniband/RoCE device >>>> [Not work. Expected.] >>>> # sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1 >>>> IB device rxe2 not found >>>> Unable to find the Infiniband/RoCE device >>>> [Also not work. Turns out rxe2 device is gone after failure. Not expected.] >>>> # sudo ip netns exec test rdma link add rxe2 type rxe netdev lo >>>> # ls /sys/class/infiniband >>>> rxe0 rxe1 rxe2 >>>> [Good. It's back] >>>> # sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1 >>>> [Works in test namespace! Expected.] >>>> # sudo ip netns exec test ib_send_bw -d rxe1 ... 127.0.0.1 >>>> [Also works. Definitely not expected.] >>>> >>>> My take, it sort of works. But there are some serious issues. You shouldn't be able to use the >>>> rxe2 device in the default namespace. It would be nice if you couldn't see the rxe devices in each >>>> other's namespaces (Like ip link or ip addr hide other namespace's devices.) >>>> >>>> Bob >>> Forgot to mention. It also is definitely not good that a process in the default namespace can destroy >>> a rxe device in the test namespace by trying to use it. >> Thanks a lot. >> >> I am not sure if it is correct or not to destroy a rxe device outside this this net namespace. >> >> Because to irdma/mlx5 rdma devices, we can also destroy them with the command "modprobe -v irdma/mlx5..." outside of the net namespace. >> >> I am not sure if this is correct or not. >> >> Zhu Yanjun >> >>> Bob > I didn' intentionally destroy lo2. I just tried to access the rxe device but it failed. > The rxe device was destroyed as a side effect of failing to open it. The GID of rxe can not be generated with lo. This is a problem. Now Chuck Lever <cel@kernel.org> will fix it. Not sure if the problem that you confronted is related with this. Please use physical NIC to make tests again. Thanks a lot. Zhu Yanjun > > Bob
-----Original Message----- From: Zhu Yanjun <yanjun.zhu@linux.dev> Sent: Friday, June 23, 2023 6:51 PM To: Bob Pearson <rpearsonhpe@gmail.com>; Zhu Yanjun <yanjun.zhu@intel.com>; zyjzyj2000@gmail.com; jgg@ziepe.ca; leon@kernel.org; linux-rdma@vger.kernel.org; parav@nvidia.com; lehrer@gmail.com Subject: Re: [PATCH v6.4-rc1 v5 0/8] Fix the problem that rxe can not work in net namespace 在 2023/6/23 20:59, Bob Pearson 写道: > On 6/23/23 02:15, Zhu Yanjun wrote: >> 在 2023/6/22 5:27, Bob Pearson 写道: >>> On 6/21/23 16:09, Bob Pearson wrote: >>>> On 5/8/23 02:56, Zhu Yanjun wrote: >>>>> From: Zhu Yanjun <yanjun.zhu@linux.dev> >>>>> >>>>> When run "ip link add" command to add a rxe rdma link in a net >>>>> namespace, normally this rxe rdma link can not work in a net name >>>>> space. >>>>> >>>>> The root cause is that a sock listening on udp port 4791 is >>>>> created in init_net when the rdma_rxe module is loaded into >>>>> kernel. That is, the sock listening on udp port 4791 is created in >>>>> init_net. Other net namespace is difficult to use this sock. >>>>> >>>>> The following commits will solve this problem. >>>>> >>>>> In the first commit, move the creating sock listening on udp port >>>>> 4791 from module_init function to rdma link creating functions. >>>>> That is, after the module rdma_rxe is loaded, the sock will not be created. >>>>> When run "rdma link add ..." command, the sock will be created. So >>>>> when creating a rdma link in the net namespace, the sock will be >>>>> created in this net namespace. >>>>> >>>>> In the second commit, the functions udp4_lib_lookup and >>>>> udp6_lib_lookup will check the sock exists in the net namespace or >>>>> not. If yes, rdma link will increase the reference count of this >>>>> sock, then continue other jobs instead of creating a new sock to >>>>> listen on udp port 4791. Since the network notifier is global, >>>>> when the module rdma_rxe is loaded, this notifier will be registered. >>>>> >>>>> After the rdma link is created, the command "rdma link del" is to >>>>> delete rdma link at the same time the sock is checked. If the >>>>> reference count of this sock is greater than the sock reference >>>>> count needed by udp tunnel, the sock reference count is decreased >>>>> by one. If equal, it indicates that this rdma link is the last >>>>> one. As such, the udp tunnel is shut down and the sock is closed. >>>>> The above work should be implemented in linkdel function. But >>>>> currently no dellink function in rxe. So the 3rd commit addes >>>>> dellink function pointer. And the 4th commit implements the dellink function in rxe. >>>>> >>>>> To now, it is not necessary to keep a global variable to store the >>>>> sock listening udp port 4791. This global variable can be replaced >>>>> by the functions udp4_lib_lookup and udp6_lib_lookup totally. >>>>> Because the function udp6_lib_lookup is in the fast path, a member >>>>> variable l_sk6 is added to store the sock. If l_sk6 is NULL, >>>>> udp6_lib_lookup is called to lookup the sock, then the sock is >>>>> stored in l_sk6, in the future,it can be used directly. >>>>> >>>>> All the above work has been done in init_net. And it can also work >>>>> in the net namespace. So the init_net is replaced by the >>>>> individual net namespace. This is what the 6th commit does. >>>>> Because rxe device is dependent on the net device and the sock >>>>> listening on udp port 4791, every rxe device is in exclusive mode in the individual net namespace. >>>>> Other rdma netns operations will be considerred in the future. >>>>> >>>>> In the 7th commit, the >>>>> register_pernet_subsys/unregister_pernet_subsys >>>>> functions are added. When a new net namespace is created, the init >>>>> function will initialize the sk4 and sk6 socks. Then the 2 socks >>>>> will be released when the net namespace is destroyed. The >>>>> functions >>>>> rxe_ns_pernet_sk4/rxe_ns_pernet_set_sk4 will get and set sk4 in >>>>> the net namespace. The functions >>>>> rxe_ns_pernet_sk6/rxe_ns_pernet_set_sk6 will handle sk6. Then sk4 and sk6 are used in the previous commits. >>>>> >>>>> As the sk4 and sk6 in pernet namespace can be accessed, it is not >>>>> necessary to add a new l_sk6. As such, in the 8th commit, the >>>>> l_sk6 is replaced with the sk6 in pernet namespace. >>>>> >>>>> Test steps: >>>>> 1) Suppose that 2 NICs are in 2 different net namespaces. >>>>> >>>>> # ip netns exec net0 ip link >>>>> 3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq >>>>> state UP >>>>> link/ether 00:1e:67:a0:22:3f brd ff:ff:ff:ff:ff:ff >>>>> altname enp5s0 >>>>> >>>>> # ip netns exec net1 ip link >>>>> 4: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc >>>>> fq_codel >>>>> link/ether f8:e4:3b:3b:e4:10 brd ff:ff:ff:ff:ff:ff >>>>> >>>>> 2) Add rdma link in the different net namespace >>>>> net0: >>>>> # ip netns exec net0 rdma link add rxe0 type rxe netdev eno2 >>>>> >>>>> net1: >>>>> # ip netns exec net1 rdma link add rxe1 type rxe netdev eno3 >>>>> >>>>> 3) Run rping test. >>>>> net0 >>>>> # ip netns exec net0 rping -s -a 192.168.2.1 -C 1& >>>>> [1] 1737 >>>>> # ip netns exec net1 rping -c -a 192.168.2.1 -d -v -C 1 >>>>> verbose >>>>> count 1 >>>>> ... >>>>> ping data: rdma-ping-0: >>>>> ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr >>>>> ... >>>>> >>>>> 4) Remove the rdma links from the net namespaces. >>>>> net0: >>>>> # ip netns exec net0 ss -lu >>>>> State Recv-Q Send-Q Local Address:Port Peer >>>>> Address:Port Process >>>>> UNCONN 0 0 0.0.0.0:4791 >>>>> 0.0.0.0:* >>>>> UNCONN 0 0 [::]:4791 [::]:* >>>>> >>>>> # ip netns exec net0 rdma link del rxe0 >>>>> >>>>> # ip netns exec net0 ss -lu >>>>> State Recv-Q Send-Q Local Address:Port Peer >>>>> Address:Port Process >>>>> >>>>> net1: >>>>> # ip netns exec net0 ss -lu >>>>> State Recv-Q Send-Q Local Address:Port Peer >>>>> Address:Port Process >>>>> UNCONN 0 0 0.0.0.0:4791 >>>>> 0.0.0.0:* >>>>> UNCONN 0 0 [::]:4791 [::]:* >>>>> >>>>> # ip netns exec net1 rdma link del rxe1 >>>>> >>>>> # ip netns exec net0 ss -lu >>>>> State Recv-Q Send-Q Local Address:Port Peer >>>>> Address:Port Process >>>>> >>>>> V4->V5: Rebase the commits to V6.4-rc1 >>>>> >>>>> V3->V4: Rebase the commits to rdma-next; >>>>> >>>>> V2->V3: 1) Add "rdma link del" example in the cover letter, and >>>>> V2->use "ss -lu" to >>>>> verify rdma link is removed. >>>>> 2) Add register_pernet_subsys/unregister_pernet_subsys >>>>> net namespace >>>>> 3) Replace l_sk6 with sk6 of pernet_name_space >>>>> >>>>> V1->V2: Add the explicit initialization of sk6. >>>>> >>>>> Zhu Yanjun (8): >>>>> RDMA/rxe: Creating listening sock in newlink function >>>>> RDMA/rxe: Support more rdma links in init_net >>>>> RDMA/nldev: Add dellink function pointer >>>>> RDMA/rxe: Implement dellink in rxe >>>>> RDMA/rxe: Replace global variable with sock lookup functions >>>>> RDMA/rxe: add the support of net namespace >>>>> RDMA/rxe: Add the support of net namespace notifier >>>>> RDMA/rxe: Replace l_sk6 with sk6 in net namespace >>>>> >>>>> drivers/infiniband/core/nldev.c | 6 ++ >>>>> drivers/infiniband/sw/rxe/Makefile | 3 +- >>>>> drivers/infiniband/sw/rxe/rxe.c | 35 +++++++- >>>>> drivers/infiniband/sw/rxe/rxe_net.c | 113 >>>>> +++++++++++++++++------ >>>>> drivers/infiniband/sw/rxe/rxe_net.h | 9 +- >>>>> drivers/infiniband/sw/rxe/rxe_ns.c | 134 >>>>> ++++++++++++++++++++++++++++ip netns add test >>>>> drivers/infiniband/sw/rxe/rxe_ns.h | 17 ++++ >>>>> include/rdma/rdma_netlink.h | 2 + >>>>> 8 files changed, 279 insertions(+), 40 deletions(-) >>>>> create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.c >>>>> create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.hip netns >>>>> add test >>>>> >>>> Zhu, >>>> >>>> I did some simple experiments on netns functionality. >>>> >>>> With your patch set applied and rxe0 created on enp6s0 and rxe1 >>>> created on lo in the default namespace >>>> >>>> # sudo ip netns add test >>>> # ip netns >>>> test >>>> # sudo ip netns exec test ip link >>>> 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT >>>> group default qlen 1000 >>>> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 >>>> # sudo ip netns exec test ip link set dev lo up >>>> # sudo ip netns exec test ip link >>>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state >>>> UNKNOWN mode DEFAULT group default qlen 1000 >>>> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 >>>> # sudo ip netns exec test ip addr add dev lo >>>> fe80::0200:00ff:fe00:0000/64 >>>> [rxe doesn't work unless this IPV6 address is set] >>>> # sudo ip netns exec test ip addr >>>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state >>>> UNKNOWN group default qlen 1000 >>>> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 >>>> inet 127.0.0.1/8 scope host lo >>>> valid_lft forever preferred_lft forever >>>> inet6 fe80::200:ff:fe00:0/64 scope link >>>> valid_lft forever preferred_lft forever >>>> inet6 ::1/128 scope host >>>> valid_lft forever preferred_lft forever >>>> # sudo ip netns exec test ls /sys/class/infiniband >>>> rxe0 rxe1 >>>> [These show up even though the ndevs do *not* belong to >>>> the test namespace! Probably OK.] >>>> # sudo ip netns exec test rdma link add rxe2 type rxe netdev >>>> lo >>>> # ls /sys/class/infiniband >>>> rxe0 rxe1 rxe2 >>>> [The new rxe device shows up in the default namespace. At >>>> least we're consistent.] >>>> # ib_send_bw -d rxe0 ... 192.168.0.27 >>>> [Works. Didn't break the existing rxe devices. Expected] >>>> # ib_send_bw -d rxe1 ... 127.0.0.1 >>>> [Works. Expected] >>>> # ib_send_bw -d rxe2 ... 127.0.0.1 >>>> IB device rxe2 not found >>>> Unable to find the Infiniband/RoCE device >>>> [Not work. Expected.] >>>> # sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1 >>>> IB device rxe2 not found >>>> Unable to find the Infiniband/RoCE device >>>> [Also not work. Turns out rxe2 device is gone after >>>> failure. Not expected.] >>>> # sudo ip netns exec test rdma link add rxe2 type rxe netdev >>>> lo >>>> # ls /sys/class/infiniband >>>> rxe0 rxe1 rxe2 >>>> [Good. It's back] >>>> # sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1 >>>> [Works in test namespace! Expected.] >>>> # sudo ip netns exec test ib_send_bw -d rxe1 ... 127.0.0.1 >>>> [Also works. Definitely not expected.] >>>> >>>> My take, it sort of works. But there are some serious issues. You >>>> shouldn't be able to use the >>>> rxe2 device in the default namespace. It would be nice if you >>>> couldn't see the rxe devices in each other's namespaces (Like ip >>>> link or ip addr hide other namespace's devices.) >>>> >>>> Bob >>> Forgot to mention. It also is definitely not good that a process in >>> the default namespace can destroy a rxe device in the test namespace by trying to use it. >> Thanks a lot. >> >> I am not sure if it is correct or not to destroy a rxe device outside this this net namespace. >> >> Because to irdma/mlx5 rdma devices, we can also destroy them with the command "modprobe -v irdma/mlx5..." outside of the net namespace. >> >> I am not sure if this is correct or not. >> >> Zhu Yanjun >> >>> Bob > I didn' intentionally destroy lo2. I just tried to access the rxe device but it failed. > The rxe device was destroyed as a side effect of failing to open it. The GID of rxe can not be generated with lo. This is a problem. Now Chuck Lever <cel@kernel.org> will fix it. Not sure if the problem that you confronted is related with this. Please use physical NIC to make tests again. Thanks a lot. Zhu Yanjun > > Bob That was why I added the IPV6 address by hand. That created the gid table entry. This is also a problem for all ethernet devices for distros that mangle the MAC address when creating the IPV6 address as a security measure. These include Ubuntu which I use. So I have to always add an IPV6 address based on the MAC address for any ethernet device. Bob
> The GID of rxe can not be generated with lo. This is a problem. I agree, and would like to see a fix. It's obviously going to be a very useful use case for CI environments for upper layer storage protocols such as NFS and SMB, for instance. > Now Chuck Lever <cel@kernel.org> will fix it. My understanding is that, because RoCE allows more than one port per egress device, the mechanism for enabling rxe-on-lo is going to be different than it is for iWARP -- or it might not be possible at all. That's why my siw patches do not implement a fix for rxe. Jason needs to outline a mechanism for it so we can see what needs to be done. At which point, any interested party should be able to fix it. -- Chuck Lever
在 2023/6/25 1:03, Pearson, Robert B 写道: > -----Original Message----- > From: Zhu Yanjun <yanjun.zhu@linux.dev> > Sent: Friday, June 23, 2023 6:51 PM > To: Bob Pearson <rpearsonhpe@gmail.com>; Zhu Yanjun <yanjun.zhu@intel.com>; zyjzyj2000@gmail.com; jgg@ziepe.ca; leon@kernel.org; linux-rdma@vger.kernel.org; parav@nvidia.com; lehrer@gmail.com > Subject: Re: [PATCH v6.4-rc1 v5 0/8] Fix the problem that rxe can not work in net namespace > > > 在 2023/6/23 20:59, Bob Pearson 写道: >> On 6/23/23 02:15, Zhu Yanjun wrote: >>> 在 2023/6/22 5:27, Bob Pearson 写道: >>>> On 6/21/23 16:09, Bob Pearson wrote: >>>>> On 5/8/23 02:56, Zhu Yanjun wrote: >>>>>> From: Zhu Yanjun <yanjun.zhu@linux.dev> >>>>>> >>>>>> When run "ip link add" command to add a rxe rdma link in a net >>>>>> namespace, normally this rxe rdma link can not work in a net name >>>>>> space. >>>>>> >>>>>> The root cause is that a sock listening on udp port 4791 is >>>>>> created in init_net when the rdma_rxe module is loaded into >>>>>> kernel. That is, the sock listening on udp port 4791 is created in >>>>>> init_net. Other net namespace is difficult to use this sock. >>>>>> >>>>>> The following commits will solve this problem. >>>>>> >>>>>> In the first commit, move the creating sock listening on udp port >>>>>> 4791 from module_init function to rdma link creating functions. >>>>>> That is, after the module rdma_rxe is loaded, the sock will not be created. >>>>>> When run "rdma link add ..." command, the sock will be created. So >>>>>> when creating a rdma link in the net namespace, the sock will be >>>>>> created in this net namespace. >>>>>> >>>>>> In the second commit, the functions udp4_lib_lookup and >>>>>> udp6_lib_lookup will check the sock exists in the net namespace or >>>>>> not. If yes, rdma link will increase the reference count of this >>>>>> sock, then continue other jobs instead of creating a new sock to >>>>>> listen on udp port 4791. Since the network notifier is global, >>>>>> when the module rdma_rxe is loaded, this notifier will be registered. >>>>>> >>>>>> After the rdma link is created, the command "rdma link del" is to >>>>>> delete rdma link at the same time the sock is checked. If the >>>>>> reference count of this sock is greater than the sock reference >>>>>> count needed by udp tunnel, the sock reference count is decreased >>>>>> by one. If equal, it indicates that this rdma link is the last >>>>>> one. As such, the udp tunnel is shut down and the sock is closed. >>>>>> The above work should be implemented in linkdel function. But >>>>>> currently no dellink function in rxe. So the 3rd commit addes >>>>>> dellink function pointer. And the 4th commit implements the dellink function in rxe. >>>>>> >>>>>> To now, it is not necessary to keep a global variable to store the >>>>>> sock listening udp port 4791. This global variable can be replaced >>>>>> by the functions udp4_lib_lookup and udp6_lib_lookup totally. >>>>>> Because the function udp6_lib_lookup is in the fast path, a member >>>>>> variable l_sk6 is added to store the sock. If l_sk6 is NULL, >>>>>> udp6_lib_lookup is called to lookup the sock, then the sock is >>>>>> stored in l_sk6, in the future,it can be used directly. >>>>>> >>>>>> All the above work has been done in init_net. And it can also work >>>>>> in the net namespace. So the init_net is replaced by the >>>>>> individual net namespace. This is what the 6th commit does. >>>>>> Because rxe device is dependent on the net device and the sock >>>>>> listening on udp port 4791, every rxe device is in exclusive mode in the individual net namespace. >>>>>> Other rdma netns operations will be considerred in the future. >>>>>> >>>>>> In the 7th commit, the >>>>>> register_pernet_subsys/unregister_pernet_subsys >>>>>> functions are added. When a new net namespace is created, the init >>>>>> function will initialize the sk4 and sk6 socks. Then the 2 socks >>>>>> will be released when the net namespace is destroyed. The >>>>>> functions >>>>>> rxe_ns_pernet_sk4/rxe_ns_pernet_set_sk4 will get and set sk4 in >>>>>> the net namespace. The functions >>>>>> rxe_ns_pernet_sk6/rxe_ns_pernet_set_sk6 will handle sk6. Then sk4 and sk6 are used in the previous commits. >>>>>> >>>>>> As the sk4 and sk6 in pernet namespace can be accessed, it is not >>>>>> necessary to add a new l_sk6. As such, in the 8th commit, the >>>>>> l_sk6 is replaced with the sk6 in pernet namespace. >>>>>> >>>>>> Test steps: >>>>>> 1) Suppose that 2 NICs are in 2 different net namespaces. >>>>>> >>>>>> # ip netns exec net0 ip link >>>>>> 3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq >>>>>> state UP >>>>>> link/ether 00:1e:67:a0:22:3f brd ff:ff:ff:ff:ff:ff >>>>>> altname enp5s0 >>>>>> >>>>>> # ip netns exec net1 ip link >>>>>> 4: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc >>>>>> fq_codel >>>>>> link/ether f8:e4:3b:3b:e4:10 brd ff:ff:ff:ff:ff:ff >>>>>> >>>>>> 2) Add rdma link in the different net namespace >>>>>> net0: >>>>>> # ip netns exec net0 rdma link add rxe0 type rxe netdev eno2 >>>>>> >>>>>> net1: >>>>>> # ip netns exec net1 rdma link add rxe1 type rxe netdev eno3 >>>>>> >>>>>> 3) Run rping test. >>>>>> net0 >>>>>> # ip netns exec net0 rping -s -a 192.168.2.1 -C 1& >>>>>> [1] 1737 >>>>>> # ip netns exec net1 rping -c -a 192.168.2.1 -d -v -C 1 >>>>>> verbose >>>>>> count 1 >>>>>> ... >>>>>> ping data: rdma-ping-0: >>>>>> ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr >>>>>> ... >>>>>> >>>>>> 4) Remove the rdma links from the net namespaces. >>>>>> net0: >>>>>> # ip netns exec net0 ss -lu >>>>>> State Recv-Q Send-Q Local Address:Port Peer >>>>>> Address:Port Process >>>>>> UNCONN 0 0 0.0.0.0:4791 >>>>>> 0.0.0.0:* >>>>>> UNCONN 0 0 [::]:4791 [::]:* >>>>>> >>>>>> # ip netns exec net0 rdma link del rxe0 >>>>>> >>>>>> # ip netns exec net0 ss -lu >>>>>> State Recv-Q Send-Q Local Address:Port Peer >>>>>> Address:Port Process >>>>>> >>>>>> net1: >>>>>> # ip netns exec net0 ss -lu >>>>>> State Recv-Q Send-Q Local Address:Port Peer >>>>>> Address:Port Process >>>>>> UNCONN 0 0 0.0.0.0:4791 >>>>>> 0.0.0.0:* >>>>>> UNCONN 0 0 [::]:4791 [::]:* >>>>>> >>>>>> # ip netns exec net1 rdma link del rxe1 >>>>>> >>>>>> # ip netns exec net0 ss -lu >>>>>> State Recv-Q Send-Q Local Address:Port Peer >>>>>> Address:Port Process >>>>>> >>>>>> V4->V5: Rebase the commits to V6.4-rc1 >>>>>> >>>>>> V3->V4: Rebase the commits to rdma-next; >>>>>> >>>>>> V2->V3: 1) Add "rdma link del" example in the cover letter, and >>>>>> V2->use "ss -lu" to >>>>>> verify rdma link is removed. >>>>>> 2) Add register_pernet_subsys/unregister_pernet_subsys >>>>>> net namespace >>>>>> 3) Replace l_sk6 with sk6 of pernet_name_space >>>>>> >>>>>> V1->V2: Add the explicit initialization of sk6. >>>>>> >>>>>> Zhu Yanjun (8): >>>>>> RDMA/rxe: Creating listening sock in newlink function >>>>>> RDMA/rxe: Support more rdma links in init_net >>>>>> RDMA/nldev: Add dellink function pointer >>>>>> RDMA/rxe: Implement dellink in rxe >>>>>> RDMA/rxe: Replace global variable with sock lookup functions >>>>>> RDMA/rxe: add the support of net namespace >>>>>> RDMA/rxe: Add the support of net namespace notifier >>>>>> RDMA/rxe: Replace l_sk6 with sk6 in net namespace >>>>>> >>>>>> drivers/infiniband/core/nldev.c | 6 ++ >>>>>> drivers/infiniband/sw/rxe/Makefile | 3 +- >>>>>> drivers/infiniband/sw/rxe/rxe.c | 35 +++++++- >>>>>> drivers/infiniband/sw/rxe/rxe_net.c | 113 >>>>>> +++++++++++++++++------ >>>>>> drivers/infiniband/sw/rxe/rxe_net.h | 9 +- >>>>>> drivers/infiniband/sw/rxe/rxe_ns.c | 134 >>>>>> ++++++++++++++++++++++++++++ip netns add test >>>>>> drivers/infiniband/sw/rxe/rxe_ns.h | 17 ++++ >>>>>> include/rdma/rdma_netlink.h | 2 + >>>>>> 8 files changed, 279 insertions(+), 40 deletions(-) >>>>>> create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.c >>>>>> create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.hip netns >>>>>> add test >>>>>> >>>>> Zhu, >>>>> >>>>> I did some simple experiments on netns functionality. >>>>> >>>>> With your patch set applied and rxe0 created on enp6s0 and rxe1 >>>>> created on lo in the default namespace >>>>> >>>>> # sudo ip netns add test >>>>> # ip netns >>>>> test >>>>> # sudo ip netns exec test ip link >>>>> 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT >>>>> group default qlen 1000 >>>>> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 >>>>> # sudo ip netns exec test ip link set dev lo up >>>>> # sudo ip netns exec test ip link >>>>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state >>>>> UNKNOWN mode DEFAULT group default qlen 1000 >>>>> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 >>>>> # sudo ip netns exec test ip addr add dev lo >>>>> fe80::0200:00ff:fe00:0000/64 >>>>> [rxe doesn't work unless this IPV6 address is set] >>>>> # sudo ip netns exec test ip addr >>>>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state >>>>> UNKNOWN group default qlen 1000 >>>>> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 >>>>> inet 127.0.0.1/8 scope host lo >>>>> valid_lft forever preferred_lft forever >>>>> inet6 fe80::200:ff:fe00:0/64 scope link >>>>> valid_lft forever preferred_lft forever >>>>> inet6 ::1/128 scope host >>>>> valid_lft forever preferred_lft forever >>>>> # sudo ip netns exec test ls /sys/class/infiniband >>>>> rxe0 rxe1 >>>>> [These show up even though the ndevs do *not* belong to >>>>> the test namespace! Probably OK.] >>>>> # sudo ip netns exec test rdma link add rxe2 type rxe netdev >>>>> lo >>>>> # ls /sys/class/infiniband >>>>> rxe0 rxe1 rxe2 >>>>> [The new rxe device shows up in the default namespace. At >>>>> least we're consistent.] >>>>> # ib_send_bw -d rxe0 ... 192.168.0.27 >>>>> [Works. Didn't break the existing rxe devices. Expected] >>>>> # ib_send_bw -d rxe1 ... 127.0.0.1 >>>>> [Works. Expected] >>>>> # ib_send_bw -d rxe2 ... 127.0.0.1 >>>>> IB device rxe2 not found >>>>> Unable to find the Infiniband/RoCE device >>>>> [Not work. Expected.] >>>>> # sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1 >>>>> IB device rxe2 not found >>>>> Unable to find the Infiniband/RoCE device >>>>> [Also not work. Turns out rxe2 device is gone after >>>>> failure. Not expected.] >>>>> # sudo ip netns exec test rdma link add rxe2 type rxe netdev >>>>> lo >>>>> # ls /sys/class/infiniband >>>>> rxe0 rxe1 rxe2 >>>>> [Good. It's back] >>>>> # sudo ip netns exec test ib_send_bw -d rxe2 ... 127.0.0.1 >>>>> [Works in test namespace! Expected.] >>>>> # sudo ip netns exec test ib_send_bw -d rxe1 ... 127.0.0.1 >>>>> [Also works. Definitely not expected.] >>>>> >>>>> My take, it sort of works. But there are some serious issues. You >>>>> shouldn't be able to use the >>>>> rxe2 device in the default namespace. It would be nice if you >>>>> couldn't see the rxe devices in each other's namespaces (Like ip >>>>> link or ip addr hide other namespace's devices.) >>>>> >>>>> Bob >>>> Forgot to mention. It also is definitely not good that a process in >>>> the default namespace can destroy a rxe device in the test namespace by trying to use it. >>> Thanks a lot. >>> >>> I am not sure if it is correct or not to destroy a rxe device outside this this net namespace. >>> >>> Because to irdma/mlx5 rdma devices, we can also destroy them with the command "modprobe -v irdma/mlx5..." outside of the net namespace. >>> >>> I am not sure if this is correct or not. >>> >>> Zhu Yanjun >>> >>>> Bob >> I didn' intentionally destroy lo2. I just tried to access the rxe device but it failed. >> The rxe device was destroyed as a side effect of failing to open it. > The GID of rxe can not be generated with lo. This is a problem. Now Chuck Lever <cel@kernel.org> will fix it. > > Not sure if the problem that you confronted is related with this. Please use physical NIC to make tests again. > > Thanks a lot. > > Zhu Yanjun > >> Bob > That was why I added the IPV6 address by hand. That created the gid table entry. This is > also a problem for all ethernet devices for distros that mangle the MAC address when creating the > IPV6 address as a security measure. These include Ubuntu which I use. So I have to always add > an IPV6 address based on the MAC address for any ethernet device. Chuck Lever <cel@kernel.org> will fix this problem. Before the fix for this problem is merged, it is not good to make tests with rxe-on-lo. It is difficult for us to tell the failure from lo or net name space. So please use physical NIC instead of lo now. Zhu Yanjun > > Bob
在 2023/6/25 1:32, Chuck Lever III 写道: >> The GID of rxe can not be generated with lo. This is a problem. > I agree, and would like to see a fix. It's obviously going to be a very > useful use case for CI environments for upper layer storage protocols > such as NFS and SMB, for instance. > > >> Now Chuck Lever <cel@kernel.org> will fix it. > My understanding is that, because RoCE allows more than one port per egress > device, the mechanism for enabling rxe-on-lo is going to be different than > it is for iWARP -- or it might not be possible at all. That's why my siw > patches do not implement a fix for rxe. > > Jason needs to outline a mechanism for it so we can see what needs to be > done. At which point, any interested party should be able to fix it. Look forward to the fix. Zhu Yanjun > > > -- > Chuck Lever > >
From: Zhu Yanjun <yanjun.zhu@linux.dev> When run "ip link add" command to add a rxe rdma link in a net namespace, normally this rxe rdma link can not work in a net name space. The root cause is that a sock listening on udp port 4791 is created in init_net when the rdma_rxe module is loaded into kernel. That is, the sock listening on udp port 4791 is created in init_net. Other net namespace is difficult to use this sock. The following commits will solve this problem. In the first commit, move the creating sock listening on udp port 4791 from module_init function to rdma link creating functions. That is, after the module rdma_rxe is loaded, the sock will not be created. When run "rdma link add ..." command, the sock will be created. So when creating a rdma link in the net namespace, the sock will be created in this net namespace. In the second commit, the functions udp4_lib_lookup and udp6_lib_lookup will check the sock exists in the net namespace or not. If yes, rdma link will increase the reference count of this sock, then continue other jobs instead of creating a new sock to listen on udp port 4791. Since the network notifier is global, when the module rdma_rxe is loaded, this notifier will be registered. After the rdma link is created, the command "rdma link del" is to delete rdma link at the same time the sock is checked. If the reference count of this sock is greater than the sock reference count needed by udp tunnel, the sock reference count is decreased by one. If equal, it indicates that this rdma link is the last one. As such, the udp tunnel is shut down and the sock is closed. The above work should be implemented in linkdel function. But currently no dellink function in rxe. So the 3rd commit addes dellink function pointer. And the 4th commit implements the dellink function in rxe. To now, it is not necessary to keep a global variable to store the sock listening udp port 4791. This global variable can be replaced by the functions udp4_lib_lookup and udp6_lib_lookup totally. Because the function udp6_lib_lookup is in the fast path, a member variable l_sk6 is added to store the sock. If l_sk6 is NULL, udp6_lib_lookup is called to lookup the sock, then the sock is stored in l_sk6, in the future,it can be used directly. All the above work has been done in init_net. And it can also work in the net namespace. So the init_net is replaced by the individual net namespace. This is what the 6th commit does. Because rxe device is dependent on the net device and the sock listening on udp port 4791, every rxe device is in exclusive mode in the individual net namespace. Other rdma netns operations will be considerred in the future. In the 7th commit, the register_pernet_subsys/unregister_pernet_subsys functions are added. When a new net namespace is created, the init function will initialize the sk4 and sk6 socks. Then the 2 socks will be released when the net namespace is destroyed. The functions rxe_ns_pernet_sk4/rxe_ns_pernet_set_sk4 will get and set sk4 in the net namespace. The functions rxe_ns_pernet_sk6/rxe_ns_pernet_set_sk6 will handle sk6. Then sk4 and sk6 are used in the previous commits. As the sk4 and sk6 in pernet namespace can be accessed, it is not necessary to add a new l_sk6. As such, in the 8th commit, the l_sk6 is replaced with the sk6 in pernet namespace. Test steps: 1) Suppose that 2 NICs are in 2 different net namespaces. # ip netns exec net0 ip link 3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP link/ether 00:1e:67:a0:22:3f brd ff:ff:ff:ff:ff:ff altname enp5s0 # ip netns exec net1 ip link 4: eno3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel link/ether f8:e4:3b:3b:e4:10 brd ff:ff:ff:ff:ff:ff 2) Add rdma link in the different net namespace net0: # ip netns exec net0 rdma link add rxe0 type rxe netdev eno2 net1: # ip netns exec net1 rdma link add rxe1 type rxe netdev eno3 3) Run rping test. net0 # ip netns exec net0 rping -s -a 192.168.2.1 -C 1& [1] 1737 # ip netns exec net1 rping -c -a 192.168.2.1 -d -v -C 1 verbose count 1 ... ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr ... 4) Remove the rdma links from the net namespaces. net0: # ip netns exec net0 ss -lu State Recv-Q Send-Q Local Address:Port Peer Address:Port Process UNCONN 0 0 0.0.0.0:4791 0.0.0.0:* UNCONN 0 0 [::]:4791 [::]:* # ip netns exec net0 rdma link del rxe0 # ip netns exec net0 ss -lu State Recv-Q Send-Q Local Address:Port Peer Address:Port Process net1: # ip netns exec net0 ss -lu State Recv-Q Send-Q Local Address:Port Peer Address:Port Process UNCONN 0 0 0.0.0.0:4791 0.0.0.0:* UNCONN 0 0 [::]:4791 [::]:* # ip netns exec net1 rdma link del rxe1 # ip netns exec net0 ss -lu State Recv-Q Send-Q Local Address:Port Peer Address:Port Process V4->V5: Rebase the commits to V6.4-rc1 V3->V4: Rebase the commits to rdma-next; V2->V3: 1) Add "rdma link del" example in the cover letter, and use "ss -lu" to verify rdma link is removed. 2) Add register_pernet_subsys/unregister_pernet_subsys net namespace 3) Replace l_sk6 with sk6 of pernet_name_space V1->V2: Add the explicit initialization of sk6. Zhu Yanjun (8): RDMA/rxe: Creating listening sock in newlink function RDMA/rxe: Support more rdma links in init_net RDMA/nldev: Add dellink function pointer RDMA/rxe: Implement dellink in rxe RDMA/rxe: Replace global variable with sock lookup functions RDMA/rxe: add the support of net namespace RDMA/rxe: Add the support of net namespace notifier RDMA/rxe: Replace l_sk6 with sk6 in net namespace drivers/infiniband/core/nldev.c | 6 ++ drivers/infiniband/sw/rxe/Makefile | 3 +- drivers/infiniband/sw/rxe/rxe.c | 35 +++++++- drivers/infiniband/sw/rxe/rxe_net.c | 113 +++++++++++++++++------ drivers/infiniband/sw/rxe/rxe_net.h | 9 +- drivers/infiniband/sw/rxe/rxe_ns.c | 134 ++++++++++++++++++++++++++++ drivers/infiniband/sw/rxe/rxe_ns.h | 17 ++++ include/rdma/rdma_netlink.h | 2 + 8 files changed, 279 insertions(+), 40 deletions(-) create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.c create mode 100644 drivers/infiniband/sw/rxe/rxe_ns.h