Message ID | 20230601161407.GA9253@debian (mailing list archive) |
---|---|
State | Accepted |
Commit | 7b355b76e2b32cc516969c01984efdf49b11fc81 |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | gro: decrease size of CB | expand |
On Thu, Jun 1, 2023 at 6:14 PM Richard Gobert <richardbgobert@gmail.com> wrote: > > The GRO control block (NAPI_GRO_CB) is currently at its maximum size. > This commit reduces its size by putting two groups of fields that are > used only at different times into a union. > > Specifically, the fields frag0 and frag0_len are the fields that make up > the frag0 optimisation mechanism, which is used during the initial > parsing of the SKB. > > The fields last and age are used after the initial parsing, while the > SKB is stored in the GRO list, waiting for other packets to arrive. > > There was one location in dev_gro_receive that modified the frag0 fields > after setting last and age. I changed this accordingly without altering > the code behaviour. > > Signed-off-by: Richard Gobert <richardbgobert@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com>
On 01/06/2023 19:14, Richard Gobert wrote: > The GRO control block (NAPI_GRO_CB) is currently at its maximum size. > This commit reduces its size by putting two groups of fields that are > used only at different times into a union. > > Specifically, the fields frag0 and frag0_len are the fields that make up > the frag0 optimisation mechanism, which is used during the initial > parsing of the SKB. > > The fields last and age are used after the initial parsing, while the > SKB is stored in the GRO list, waiting for other packets to arrive. > > There was one location in dev_gro_receive that modified the frag0 fields > after setting last and age. I changed this accordingly without altering > the code behaviour. > > Signed-off-by: Richard Gobert <richardbgobert@gmail.com> Hello Richard, I believe this commit broke gro over udp tunnels. I'm running iperf tcp traffic over geneve interfaces and the bandwidth is pretty much zero. Turning off gro on the receiving side (or reverting this commit) resolves the issue.
On 6/26/23 2:55 AM, Gal Pressman wrote: > I believe this commit broke gro over udp tunnels. > I'm running iperf tcp traffic over geneve interfaces and the bandwidth > is pretty much zero. > Could you add a test script to tools/testing/selftests/net? It will help catch future regressions.
On 6/28/23 6:42 AM, Gal Pressman wrote: > On 27/06/2023 17:21, David Ahern wrote: >> On 6/26/23 2:55 AM, Gal Pressman wrote: >>> I believe this commit broke gro over udp tunnels. >>> I'm running iperf tcp traffic over geneve interfaces and the bandwidth >>> is pretty much zero. >>> >> >> Could you add a test script to tools/testing/selftests/net? It will help >> catch future regressions. >> > > I'm checking internally, someone from the team might be able to work on > this, though I'm not sure that a test that verifies bandwidth makes much > sense as a selftest. > With veth and namespaces I expect up to 25-30G performance levels, depending on the test. When something fundamental breaks like this patch a drop to < 1G would be a red flag, so there is value to the test.
> On 01/06/2023 19:14, Richard Gobert wrote: > > The GRO control block (NAPI_GRO_CB) is currently at its maximum size. > > This commit reduces its size by putting two groups of fields that are > > used only at different times into a union. > > > > Specifically, the fields frag0 and frag0_len are the fields that make up > > the frag0 optimisation mechanism, which is used during the initial > > parsing of the SKB. > > > > The fields last and age are used after the initial parsing, while the > > SKB is stored in the GRO list, waiting for other packets to arrive. > > > > There was one location in dev_gro_receive that modified the frag0 fields > > after setting last and age. I changed this accordingly without altering > > the code behaviour. > > > > Signed-off-by: Richard Gobert <richardbgobert@gmail.com> > > Hello Richard, > > I believe this commit broke gro over udp tunnels. > I'm running iperf tcp traffic over geneve interfaces and the bandwidth > is pretty much zero. > > Turning off gro on the receiving side (or reverting this commit) > resolves the issue. Sorry for the late response. I am starting to look into it right now. Can you please share more details about your setup? - I'd like to see the output of these commands: ethtool -k sysctl net - The iperf command - Your network topology
I haven't been able to reproduce it yet, I tried two different setups: - 2 VMs running locally on my PC, and a geneve interface for each. Over these geneve interfaces, I sent tcp traffic with a similar iperf command as yours. - A geneve tunnel over veth peers inside two separate namespaces as David suggested. The throughput looked fine and identical with and without my patch in both setups. Although I did validate it while working on the patch, a problem may arise from: - Packing CB members into a union, which could've led to some sort of corruption. - Calling `gro_pull_from_frag0` on the current skb before inserting it into `gro_list`. Could I ask you to run some tests: - Running the script I attached here on one machine and checking whether it reproduces the problem. - Reverting part of my commit: - Reverting the change to CB struct while keeping the changes to `gro_pull_from_frag0`. - Checking whether the regression remains. Also, could you give me some more details: - The VMs' NIC and driver. Are you using Qemu? - iperf results. - The exact kernel versions (commit hashes) you are using. - Did you run the commands (sysctl/ethtool) on the receiving VM? Here are the commands I used for the namespaces test's setup: ``` ip netns add ns1 ip link add veth0 type veth peer name veth1 ip link set veth1 netns ns1 ip a add 192.168.1.1/32 dev veth0 ip link set veth0 up ip r add 192.168.1.0/24 dev veth0 ip netns exec ns1 ip a add 192.168.1.2/32 dev veth1 ip netns exec ns1 ip link set veth1 up ip netns exec ns1 ip r add 192.168.1.0/24 dev veth1 ip link add name gnv0 type geneve id 1000 remote 192.168.1.2 ip a add 10.0.0.1/32 dev gnv0 ip link set gnv0 up ip r add 10.0.1.1/32 dev gnv0 ip netns exec ns1 ip link add name gnv0 type geneve id 1000 remote 192.168.1.1 ip netns exec ns1 ip a add 10.0.1.1/32 dev gnv0 ip netns exec ns1 ip link set gnv0 up ip netns exec ns1 ip r add 10.0.0.1/32 dev gnv0 ethtool -K veth0 generic-receive-offload off ip netns exec ns1 ethtool -K veth1 generic-receive-offload off # quick way to enable gro on veth devices ethtool -K veth0 tcp-segmentation-offload off ip netns exec ns1 ethtool -K veth1 tcp-segmentation-offload off ``` I'll continue looking into it on Monday. It would be great if someone from your team can write a test that reproduces this issue. Thanks.
On 30/06/2023 18:39, Richard Gobert wrote: > I haven't been able to reproduce it yet, I tried two different setups: > - 2 VMs running locally on my PC, and a geneve interface for each. Over > these geneve interfaces, I sent tcp traffic with a similar iperf > command as yours. > - A geneve tunnel over veth peers inside two separate namespaces as > David suggested. > > The throughput looked fine and identical with and without my patch in both > setups. > > Although I did validate it while working on the patch, a problem may arise > from: > - Packing CB members into a union, which could've led to some sort of > corruption. > - Calling `gro_pull_from_frag0` on the current skb before inserting it > into `gro_list`. > > Could I ask you to run some tests: > - Running the script I attached here on one machine and checking whether > it reproduces the problem. > - Reverting part of my commit: > - Reverting the change to CB struct while keeping the changes to > `gro_pull_from_frag0`. > - Checking whether the regression remains. > > Also, could you give me some more details: > - The VMs' NIC and driver. Are you using Qemu? > - iperf results. > - The exact kernel versions (commit hashes) you are using. > - Did you run the commands (sysctl/ethtool) on the receiving VM? > > > Here are the commands I used for the namespaces test's setup: > ``` > ip netns add ns1 > > ip link add veth0 type veth peer name veth1 > ip link set veth1 netns ns1 > > ip a add 192.168.1.1/32 dev veth0 > ip link set veth0 up > ip r add 192.168.1.0/24 dev veth0 > > ip netns exec ns1 ip a add 192.168.1.2/32 dev veth1 > ip netns exec ns1 ip link set veth1 up > ip netns exec ns1 ip r add 192.168.1.0/24 dev veth1 > > ip link add name gnv0 type geneve id 1000 remote 192.168.1.2 > ip a add 10.0.0.1/32 dev gnv0 > ip link set gnv0 up > ip r add 10.0.1.1/32 dev gnv0 > > ip netns exec ns1 ip link add name gnv0 type geneve id 1000 remote 192.168.1.1 > ip netns exec ns1 ip a add 10.0.1.1/32 dev gnv0 > ip netns exec ns1 ip link set gnv0 up > ip netns exec ns1 ip r add 10.0.0.1/32 dev gnv0 > > ethtool -K veth0 generic-receive-offload off > ip netns exec ns1 ethtool -K veth1 generic-receive-offload off > > # quick way to enable gro on veth devices > ethtool -K veth0 tcp-segmentation-offload off > ip netns exec ns1 ethtool -K veth1 tcp-segmentation-offload off > ``` > > I'll continue looking into it on Monday. It would be great if someone from > your team can write a test that reproduces this issue. > > Thanks. Hey, I don't have an answer for all of your questions yet, but it turns out I left out an important detail, the issue reproduces when outer ipv6 is used. I'm using ConnectX-6 Dx, with these scripts: Server: ip addr add 194.236.5.246/16 dev eth2 ip addr add ::12:236:5:246/96 dev eth2 ip link set dev eth2 up ip link add p1_g464 type geneve id 464 remote ::12:236:4:245 ip link set dev p1_g464 up ip addr add 196.236.5.1/16 dev p1_g464 Client: ip addr add 194.236.4.245/16 dev eth2 ip addr add ::12:236:4:245/96 dev eth2 ip link set dev eth2 up ip link add p0_g464 type geneve id 464 remote ::12:236:5:246 ip link set dev p0_g464 up ip addr add 196.236.4.2/16 dev p0_g464 Once everything is set up, iperf -s on the server and iperf -c 196.236.5.1 -i1 -t1000 On the client, should do the work. Unfortunately, I haven't been able to reproduce the same issue with veth interfaces. Reverting the napi_gro_cb part indeed resolves the issue. Thanks for taking a look!
On 02/07/2023 17:41, Gal Pressman wrote: > On 30/06/2023 18:39, Richard Gobert wrote: >> I haven't been able to reproduce it yet, I tried two different setups: >> - 2 VMs running locally on my PC, and a geneve interface for each. Over >> these geneve interfaces, I sent tcp traffic with a similar iperf >> command as yours. >> - A geneve tunnel over veth peers inside two separate namespaces as >> David suggested. >> >> The throughput looked fine and identical with and without my patch in both >> setups. >> >> Although I did validate it while working on the patch, a problem may arise >> from: >> - Packing CB members into a union, which could've led to some sort of >> corruption. >> - Calling `gro_pull_from_frag0` on the current skb before inserting it >> into `gro_list`. >> >> Could I ask you to run some tests: >> - Running the script I attached here on one machine and checking whether >> it reproduces the problem. >> - Reverting part of my commit: >> - Reverting the change to CB struct while keeping the changes to >> `gro_pull_from_frag0`. >> - Checking whether the regression remains. >> >> Also, could you give me some more details: >> - The VMs' NIC and driver. Are you using Qemu? >> - iperf results. >> - The exact kernel versions (commit hashes) you are using. >> - Did you run the commands (sysctl/ethtool) on the receiving VM? >> >> >> Here are the commands I used for the namespaces test's setup: >> ``` >> ip netns add ns1 >> >> ip link add veth0 type veth peer name veth1 >> ip link set veth1 netns ns1 >> >> ip a add 192.168.1.1/32 dev veth0 >> ip link set veth0 up >> ip r add 192.168.1.0/24 dev veth0 >> >> ip netns exec ns1 ip a add 192.168.1.2/32 dev veth1 >> ip netns exec ns1 ip link set veth1 up >> ip netns exec ns1 ip r add 192.168.1.0/24 dev veth1 >> >> ip link add name gnv0 type geneve id 1000 remote 192.168.1.2 >> ip a add 10.0.0.1/32 dev gnv0 >> ip link set gnv0 up >> ip r add 10.0.1.1/32 dev gnv0 >> >> ip netns exec ns1 ip link add name gnv0 type geneve id 1000 remote 192.168.1.1 >> ip netns exec ns1 ip a add 10.0.1.1/32 dev gnv0 >> ip netns exec ns1 ip link set gnv0 up >> ip netns exec ns1 ip r add 10.0.0.1/32 dev gnv0 >> >> ethtool -K veth0 generic-receive-offload off >> ip netns exec ns1 ethtool -K veth1 generic-receive-offload off >> >> # quick way to enable gro on veth devices >> ethtool -K veth0 tcp-segmentation-offload off >> ip netns exec ns1 ethtool -K veth1 tcp-segmentation-offload off >> ``` >> >> I'll continue looking into it on Monday. It would be great if someone from >> your team can write a test that reproduces this issue. >> >> Thanks. > > Hey, > > I don't have an answer for all of your questions yet, but it turns out I > left out an important detail, the issue reproduces when outer ipv6 is used. > > I'm using ConnectX-6 Dx, with these scripts: > > Server: > ip addr add 194.236.5.246/16 dev eth2 > ip addr add ::12:236:5:246/96 dev eth2 > ip link set dev eth2 up > > ip link add p1_g464 type geneve id 464 remote ::12:236:4:245 > ip link set dev p1_g464 up > ip addr add 196.236.5.1/16 dev p1_g464 > > Client: > ip addr add 194.236.4.245/16 dev eth2 > ip addr add ::12:236:4:245/96 dev eth2 > ip link set dev eth2 up > > ip link add p0_g464 type geneve id 464 remote ::12:236:5:246 > ip link set dev p0_g464 up > ip addr add 196.236.4.2/16 dev p0_g464 > > Once everything is set up, iperf -s on the server and > iperf -c 196.236.5.1 -i1 -t1000 > On the client, should do the work. > > Unfortunately, I haven't been able to reproduce the same issue with veth > interfaces. > > Reverting the napi_gro_cb part indeed resolves the issue. > > Thanks for taking a look! BTW, all testing is done after checking out to your commit: 7b355b76e2b3 ("gro: decrease size of CB")
Thank you for replying. I will check it out and update once there is something new.
I managed to reproduce it and found the bug that explains the problem you're experiencing. I submitted a bugfix here: https://lore.kernel.org/netdev/20230707121650.GA17677@debian/ Thanks!
On 07/07/2023 15:31, Richard Gobert wrote: > I managed to reproduce it and found the bug that explains the problem > you're experiencing. > I submitted a bugfix here: https://lore.kernel.org/netdev/20230707121650.GA17677@debian/ > Thanks! Thanks Richard! Will test it and update. BTW, did you manage to reproduce the issue with veth?
On 28/06/2023 17:19, David Ahern wrote: > On 6/28/23 6:42 AM, Gal Pressman wrote: >> On 27/06/2023 17:21, David Ahern wrote: >>> On 6/26/23 2:55 AM, Gal Pressman wrote: >>>> I believe this commit broke gro over udp tunnels. >>>> I'm running iperf tcp traffic over geneve interfaces and the bandwidth >>>> is pretty much zero. >>>> >>> >>> Could you add a test script to tools/testing/selftests/net? It will help >>> catch future regressions. >>> >> >> I'm checking internally, someone from the team might be able to work on >> this, though I'm not sure that a test that verifies bandwidth makes much >> sense as a selftest. >> > > With veth and namespaces I expect up to 25-30G performance levels, > depending on the test. When something fundamental breaks like this patch > a drop to < 1G would be a red flag, so there is value to the test. Circling back to this, I believe such test already exists: tools/testing/selftests/net/udpgro_fwd.sh And it indeed fails before Richard's fix. I guess all that's left is to actually run these tests :)?
On 8/23/23 7:43 AM, Gal Pressman wrote: >> With veth and namespaces I expect up to 25-30G performance levels, >> depending on the test. When something fundamental breaks like this patch >> a drop to < 1G would be a red flag, so there is value to the test. > Circling back to this, I believe such test already exists: > tools/testing/selftests/net/udpgro_fwd.sh > > And it indeed fails before Richard's fix. > > I guess all that's left is to actually run these tests
diff --git a/include/net/gro.h b/include/net/gro.h index a4fab706240d..7b47dd6ce94f 100644 --- a/include/net/gro.h +++ b/include/net/gro.h @@ -11,11 +11,23 @@ #include <net/udp.h> struct napi_gro_cb { - /* Virtual address of skb_shinfo(skb)->frags[0].page + offset. */ - void *frag0; + union { + struct { + /* Virtual address of skb_shinfo(skb)->frags[0].page + offset. */ + void *frag0; - /* Length of frag0. */ - unsigned int frag0_len; + /* Length of frag0. */ + unsigned int frag0_len; + }; + + struct { + /* used in skb_gro_receive() slow path */ + struct sk_buff *last; + + /* jiffies when first packet was created/queued */ + unsigned long age; + }; + }; /* This indicates where we are processing relative to skb->data. */ int data_offset; @@ -32,9 +44,6 @@ struct napi_gro_cb { /* Used in ipv6_gro_receive() and foo-over-udp */ u16 proto; - /* jiffies when first packet was created/queued */ - unsigned long age; - /* Used in napi_gro_cb::free */ #define NAPI_GRO_FREE 1 #define NAPI_GRO_FREE_STOLEN_HEAD 2 @@ -77,9 +86,6 @@ struct napi_gro_cb { /* used to support CHECKSUM_COMPLETE for tunneling protocols */ __wsum csum; - - /* used in skb_gro_receive() slow path */ - struct sk_buff *last; }; #define NAPI_GRO_CB(skb) ((struct napi_gro_cb *)(skb)->cb) diff --git a/net/core/gro.c b/net/core/gro.c index 2d84165cb4f1..a709155994ad 100644 --- a/net/core/gro.c +++ b/net/core/gro.c @@ -460,6 +460,14 @@ static void gro_pull_from_frag0(struct sk_buff *skb, int grow) } } +static void gro_try_pull_from_frag0(struct sk_buff *skb) +{ + int grow = skb_gro_offset(skb) - skb_headlen(skb); + + if (grow > 0) + gro_pull_from_frag0(skb, grow); +} + static void gro_flush_oldest(struct napi_struct *napi, struct list_head *head) { struct sk_buff *oldest; @@ -489,7 +497,6 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff struct sk_buff *pp = NULL; enum gro_result ret; int same_flow; - int grow; if (netif_elide_gro(skb->dev)) goto normal; @@ -564,17 +571,14 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff else gro_list->count++; + /* Must be called before setting NAPI_GRO_CB(skb)->{age|last} */ + gro_try_pull_from_frag0(skb); NAPI_GRO_CB(skb)->age = jiffies; NAPI_GRO_CB(skb)->last = skb; if (!skb_is_gso(skb)) skb_shinfo(skb)->gso_size = skb_gro_len(skb); list_add(&skb->list, &gro_list->list); ret = GRO_HELD; - -pull: - grow = skb_gro_offset(skb) - skb_headlen(skb); - if (grow > 0) - gro_pull_from_frag0(skb, grow); ok: if (gro_list->count) { if (!test_bit(bucket, &napi->gro_bitmask)) @@ -587,7 +591,8 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff normal: ret = GRO_NORMAL; - goto pull; + gro_try_pull_from_frag0(skb); + goto ok; } struct packet_offload *gro_find_receive_by_type(__be16 type)
The GRO control block (NAPI_GRO_CB) is currently at its maximum size. This commit reduces its size by putting two groups of fields that are used only at different times into a union. Specifically, the fields frag0 and frag0_len are the fields that make up the frag0 optimisation mechanism, which is used during the initial parsing of the SKB. The fields last and age are used after the initial parsing, while the SKB is stored in the GRO list, waiting for other packets to arrive. There was one location in dev_gro_receive that modified the frag0 fields after setting last and age. I changed this accordingly without altering the code behaviour. Signed-off-by: Richard Gobert <richardbgobert@gmail.com> --- include/net/gro.h | 26 ++++++++++++++++---------- net/core/gro.c | 19 ++++++++++++------- 2 files changed, 28 insertions(+), 17 deletions(-)