diff mbox series

[v3,1/1] gro: decrease size of CB

Message ID 20230601161407.GA9253@debian (mailing list archive)
State Accepted
Commit 7b355b76e2b32cc516969c01984efdf49b11fc81
Delegated to: Netdev Maintainers
Headers show
Series gro: decrease size of CB | expand

Checks

Context Check Description
netdev/series_format warning Target tree name not specified in the subject
netdev/tree_selection success Guessed tree name to be net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 28 this patch: 28
netdev/cc_maintainers success CCed 6 of 6 maintainers
netdev/build_clang success Errors and warnings before: 8 this patch: 8
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 28 this patch: 28
netdev/checkpatch warning WARNING: line length of 89 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Richard Gobert June 1, 2023, 4:14 p.m. UTC
The GRO control block (NAPI_GRO_CB) is currently at its maximum size.
This commit reduces its size by putting two groups of fields that are
used only at different times into a union.

Specifically, the fields frag0 and frag0_len are the fields that make up
the frag0 optimisation mechanism, which is used during the initial
parsing of the SKB.

The fields last and age are used after the initial parsing, while the
SKB is stored in the GRO list, waiting for other packets to arrive.

There was one location in dev_gro_receive that modified the frag0 fields
after setting last and age. I changed this accordingly without altering
the code behaviour.

Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
---
 include/net/gro.h | 26 ++++++++++++++++----------
 net/core/gro.c    | 19 ++++++++++++-------
 2 files changed, 28 insertions(+), 17 deletions(-)

Comments

Eric Dumazet June 6, 2023, 7:25 a.m. UTC | #1
On Thu, Jun 1, 2023 at 6:14 PM Richard Gobert <richardbgobert@gmail.com> wrote:
>
> The GRO control block (NAPI_GRO_CB) is currently at its maximum size.
> This commit reduces its size by putting two groups of fields that are
> used only at different times into a union.
>
> Specifically, the fields frag0 and frag0_len are the fields that make up
> the frag0 optimisation mechanism, which is used during the initial
> parsing of the SKB.
>
> The fields last and age are used after the initial parsing, while the
> SKB is stored in the GRO list, waiting for other packets to arrive.
>
> There was one location in dev_gro_receive that modified the frag0 fields
> after setting last and age. I changed this accordingly without altering
> the code behaviour.
>
> Signed-off-by: Richard Gobert <richardbgobert@gmail.com>

Reviewed-by: Eric Dumazet <edumazet@google.com>
Gal Pressman June 26, 2023, 8:55 a.m. UTC | #2
On 01/06/2023 19:14, Richard Gobert wrote:
> The GRO control block (NAPI_GRO_CB) is currently at its maximum size.
> This commit reduces its size by putting two groups of fields that are
> used only at different times into a union.
> 
> Specifically, the fields frag0 and frag0_len are the fields that make up
> the frag0 optimisation mechanism, which is used during the initial
> parsing of the SKB.
> 
> The fields last and age are used after the initial parsing, while the
> SKB is stored in the GRO list, waiting for other packets to arrive.
> 
> There was one location in dev_gro_receive that modified the frag0 fields
> after setting last and age. I changed this accordingly without altering
> the code behaviour.
> 
> Signed-off-by: Richard Gobert <richardbgobert@gmail.com>

Hello Richard,

I believe this commit broke gro over udp tunnels.
I'm running iperf tcp traffic over geneve interfaces and the bandwidth
is pretty much zero.

Turning off gro on the receiving side (or reverting this commit)
resolves the issue.
David Ahern June 27, 2023, 2:21 p.m. UTC | #3
On 6/26/23 2:55 AM, Gal Pressman wrote:
> I believe this commit broke gro over udp tunnels.
> I'm running iperf tcp traffic over geneve interfaces and the bandwidth
> is pretty much zero.
> 

Could you add a test script to tools/testing/selftests/net? It will help
catch future regressions.
David Ahern June 28, 2023, 2:19 p.m. UTC | #4
On 6/28/23 6:42 AM, Gal Pressman wrote:
> On 27/06/2023 17:21, David Ahern wrote:
>> On 6/26/23 2:55 AM, Gal Pressman wrote:
>>> I believe this commit broke gro over udp tunnels.
>>> I'm running iperf tcp traffic over geneve interfaces and the bandwidth
>>> is pretty much zero.
>>>
>>
>> Could you add a test script to tools/testing/selftests/net? It will help
>> catch future regressions.
>>
> 
> I'm checking internally, someone from the team might be able to work on
> this, though I'm not sure that a test that verifies bandwidth makes much
> sense as a selftest.
> 

With veth and namespaces I expect up to 25-30G performance levels,
depending on the test. When something fundamental breaks like this patch
a drop to < 1G would be a red flag, so there is value to the test.
Richard Gobert June 29, 2023, 12:36 p.m. UTC | #5
> On 01/06/2023 19:14, Richard Gobert wrote:
> > The GRO control block (NAPI_GRO_CB) is currently at its maximum size.
> > This commit reduces its size by putting two groups of fields that are
> > used only at different times into a union.
> > 
> > Specifically, the fields frag0 and frag0_len are the fields that make up
> > the frag0 optimisation mechanism, which is used during the initial
> > parsing of the SKB.
> > 
> > The fields last and age are used after the initial parsing, while the
> > SKB is stored in the GRO list, waiting for other packets to arrive.
> > 
> > There was one location in dev_gro_receive that modified the frag0 fields
> > after setting last and age. I changed this accordingly without altering
> > the code behaviour.
> > 
> > Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
> 
> Hello Richard,
> 
> I believe this commit broke gro over udp tunnels.
> I'm running iperf tcp traffic over geneve interfaces and the bandwidth
> is pretty much zero.
> 
> Turning off gro on the receiving side (or reverting this commit)
> resolves the issue.

Sorry for the late response.
I am starting to look into it right now. Can you please share more details about your setup?
- I'd like to see the output of these commands:
  ethtool -k
  sysctl net
- The iperf command
- Your network topology
Richard Gobert June 30, 2023, 3:39 p.m. UTC | #6
I haven't been able to reproduce it yet, I tried two different setups:
    - 2 VMs running locally on my PC, and a geneve interface for each. Over
      these geneve interfaces, I sent tcp traffic with a similar iperf
      command as yours.
    - A geneve tunnel over veth peers inside two separate namespaces as
      David suggested.

The throughput looked fine and identical with and without my patch in both
setups.

Although I did validate it while working on the patch, a problem may arise
from:
    - Packing CB members into a union, which could've led to some sort of
      corruption.
    - Calling `gro_pull_from_frag0` on the current skb before inserting it
      into `gro_list`.

Could I ask you to run some tests:
    - Running the script I attached here on one machine and checking whether
      it reproduces the problem. 
    - Reverting part of my commit: 
        - Reverting the change to CB struct while keeping the changes to
          `gro_pull_from_frag0`.
        - Checking whether the regression remains.

Also, could you give me some more details:
    - The VMs' NIC and driver. Are you using Qemu? 
    - iperf results.
    - The exact kernel versions (commit hashes) you are using.
    - Did you run the commands (sysctl/ethtool) on the receiving VM?


Here are the commands I used for the namespaces test's setup:
```
ip netns add ns1

ip link add veth0 type veth peer name veth1
ip link set veth1 netns ns1

ip a add 192.168.1.1/32 dev veth0
ip link set veth0 up
ip r add 192.168.1.0/24 dev veth0

ip netns exec ns1 ip a add 192.168.1.2/32 dev veth1
ip netns exec ns1 ip link set veth1 up
ip netns exec ns1 ip r add 192.168.1.0/24 dev veth1

ip link add name gnv0 type geneve id 1000 remote 192.168.1.2
ip a add 10.0.0.1/32 dev gnv0
ip link set gnv0 up
ip r add 10.0.1.1/32 dev gnv0

ip netns exec ns1 ip link add name gnv0 type geneve id 1000 remote 192.168.1.1
ip netns exec ns1 ip a add 10.0.1.1/32 dev gnv0
ip netns exec ns1 ip link set gnv0 up
ip netns exec ns1 ip r add 10.0.0.1/32 dev gnv0

ethtool -K veth0 generic-receive-offload off
ip netns exec ns1 ethtool -K veth1 generic-receive-offload off

# quick way to enable gro on veth devices
ethtool -K veth0 tcp-segmentation-offload off
ip netns exec ns1 ethtool -K veth1 tcp-segmentation-offload off
```

I'll continue looking into it on Monday. It would be great if someone from
your team can write a test that reproduces this issue.

Thanks.
Gal Pressman July 2, 2023, 2:41 p.m. UTC | #7
On 30/06/2023 18:39, Richard Gobert wrote:
> I haven't been able to reproduce it yet, I tried two different setups:
>     - 2 VMs running locally on my PC, and a geneve interface for each. Over
>       these geneve interfaces, I sent tcp traffic with a similar iperf
>       command as yours.
>     - A geneve tunnel over veth peers inside two separate namespaces as
>       David suggested.
> 
> The throughput looked fine and identical with and without my patch in both
> setups.
> 
> Although I did validate it while working on the patch, a problem may arise
> from:
>     - Packing CB members into a union, which could've led to some sort of
>       corruption.
>     - Calling `gro_pull_from_frag0` on the current skb before inserting it
>       into `gro_list`.
> 
> Could I ask you to run some tests:
>     - Running the script I attached here on one machine and checking whether
>       it reproduces the problem. 
>     - Reverting part of my commit: 
>         - Reverting the change to CB struct while keeping the changes to
>           `gro_pull_from_frag0`.
>         - Checking whether the regression remains.
> 
> Also, could you give me some more details:
>     - The VMs' NIC and driver. Are you using Qemu? 
>     - iperf results.
>     - The exact kernel versions (commit hashes) you are using.
>     - Did you run the commands (sysctl/ethtool) on the receiving VM?
> 
> 
> Here are the commands I used for the namespaces test's setup:
> ```
> ip netns add ns1
> 
> ip link add veth0 type veth peer name veth1
> ip link set veth1 netns ns1
> 
> ip a add 192.168.1.1/32 dev veth0
> ip link set veth0 up
> ip r add 192.168.1.0/24 dev veth0
> 
> ip netns exec ns1 ip a add 192.168.1.2/32 dev veth1
> ip netns exec ns1 ip link set veth1 up
> ip netns exec ns1 ip r add 192.168.1.0/24 dev veth1
> 
> ip link add name gnv0 type geneve id 1000 remote 192.168.1.2
> ip a add 10.0.0.1/32 dev gnv0
> ip link set gnv0 up
> ip r add 10.0.1.1/32 dev gnv0
> 
> ip netns exec ns1 ip link add name gnv0 type geneve id 1000 remote 192.168.1.1
> ip netns exec ns1 ip a add 10.0.1.1/32 dev gnv0
> ip netns exec ns1 ip link set gnv0 up
> ip netns exec ns1 ip r add 10.0.0.1/32 dev gnv0
> 
> ethtool -K veth0 generic-receive-offload off
> ip netns exec ns1 ethtool -K veth1 generic-receive-offload off
> 
> # quick way to enable gro on veth devices
> ethtool -K veth0 tcp-segmentation-offload off
> ip netns exec ns1 ethtool -K veth1 tcp-segmentation-offload off
> ```
> 
> I'll continue looking into it on Monday. It would be great if someone from
> your team can write a test that reproduces this issue.
> 
> Thanks.

Hey,

I don't have an answer for all of your questions yet, but it turns out I
left out an important detail, the issue reproduces when outer ipv6 is used.

I'm using ConnectX-6 Dx, with these scripts:

Server:
ip addr add 194.236.5.246/16 dev eth2
ip addr add ::12:236:5:246/96 dev eth2
ip link set dev eth2 up

ip link add p1_g464 type geneve id 464 remote ::12:236:4:245
ip link set dev p1_g464 up
ip addr add 196.236.5.1/16 dev p1_g464

Client:
ip addr add 194.236.4.245/16 dev eth2
ip addr add ::12:236:4:245/96 dev eth2
ip link set dev eth2 up

ip link add p0_g464 type geneve id 464 remote ::12:236:5:246
ip link set dev p0_g464 up
ip addr add 196.236.4.2/16 dev p0_g464

Once everything is set up, iperf -s on the server and
iperf -c 196.236.5.1 -i1 -t1000
On the client, should do the work.

Unfortunately, I haven't been able to reproduce the same issue with veth
interfaces.

Reverting the napi_gro_cb part indeed resolves the issue.

Thanks for taking a look!
Gal Pressman July 2, 2023, 2:46 p.m. UTC | #8
On 02/07/2023 17:41, Gal Pressman wrote:
> On 30/06/2023 18:39, Richard Gobert wrote:
>> I haven't been able to reproduce it yet, I tried two different setups:
>>     - 2 VMs running locally on my PC, and a geneve interface for each. Over
>>       these geneve interfaces, I sent tcp traffic with a similar iperf
>>       command as yours.
>>     - A geneve tunnel over veth peers inside two separate namespaces as
>>       David suggested.
>>
>> The throughput looked fine and identical with and without my patch in both
>> setups.
>>
>> Although I did validate it while working on the patch, a problem may arise
>> from:
>>     - Packing CB members into a union, which could've led to some sort of
>>       corruption.
>>     - Calling `gro_pull_from_frag0` on the current skb before inserting it
>>       into `gro_list`.
>>
>> Could I ask you to run some tests:
>>     - Running the script I attached here on one machine and checking whether
>>       it reproduces the problem. 
>>     - Reverting part of my commit: 
>>         - Reverting the change to CB struct while keeping the changes to
>>           `gro_pull_from_frag0`.
>>         - Checking whether the regression remains.
>>
>> Also, could you give me some more details:
>>     - The VMs' NIC and driver. Are you using Qemu? 
>>     - iperf results.
>>     - The exact kernel versions (commit hashes) you are using.
>>     - Did you run the commands (sysctl/ethtool) on the receiving VM?
>>
>>
>> Here are the commands I used for the namespaces test's setup:
>> ```
>> ip netns add ns1
>>
>> ip link add veth0 type veth peer name veth1
>> ip link set veth1 netns ns1
>>
>> ip a add 192.168.1.1/32 dev veth0
>> ip link set veth0 up
>> ip r add 192.168.1.0/24 dev veth0
>>
>> ip netns exec ns1 ip a add 192.168.1.2/32 dev veth1
>> ip netns exec ns1 ip link set veth1 up
>> ip netns exec ns1 ip r add 192.168.1.0/24 dev veth1
>>
>> ip link add name gnv0 type geneve id 1000 remote 192.168.1.2
>> ip a add 10.0.0.1/32 dev gnv0
>> ip link set gnv0 up
>> ip r add 10.0.1.1/32 dev gnv0
>>
>> ip netns exec ns1 ip link add name gnv0 type geneve id 1000 remote 192.168.1.1
>> ip netns exec ns1 ip a add 10.0.1.1/32 dev gnv0
>> ip netns exec ns1 ip link set gnv0 up
>> ip netns exec ns1 ip r add 10.0.0.1/32 dev gnv0
>>
>> ethtool -K veth0 generic-receive-offload off
>> ip netns exec ns1 ethtool -K veth1 generic-receive-offload off
>>
>> # quick way to enable gro on veth devices
>> ethtool -K veth0 tcp-segmentation-offload off
>> ip netns exec ns1 ethtool -K veth1 tcp-segmentation-offload off
>> ```
>>
>> I'll continue looking into it on Monday. It would be great if someone from
>> your team can write a test that reproduces this issue.
>>
>> Thanks.
> 
> Hey,
> 
> I don't have an answer for all of your questions yet, but it turns out I
> left out an important detail, the issue reproduces when outer ipv6 is used.
> 
> I'm using ConnectX-6 Dx, with these scripts:
> 
> Server:
> ip addr add 194.236.5.246/16 dev eth2
> ip addr add ::12:236:5:246/96 dev eth2
> ip link set dev eth2 up
> 
> ip link add p1_g464 type geneve id 464 remote ::12:236:4:245
> ip link set dev p1_g464 up
> ip addr add 196.236.5.1/16 dev p1_g464
> 
> Client:
> ip addr add 194.236.4.245/16 dev eth2
> ip addr add ::12:236:4:245/96 dev eth2
> ip link set dev eth2 up
> 
> ip link add p0_g464 type geneve id 464 remote ::12:236:5:246
> ip link set dev p0_g464 up
> ip addr add 196.236.4.2/16 dev p0_g464
> 
> Once everything is set up, iperf -s on the server and
> iperf -c 196.236.5.1 -i1 -t1000
> On the client, should do the work.
> 
> Unfortunately, I haven't been able to reproduce the same issue with veth
> interfaces.
> 
> Reverting the napi_gro_cb part indeed resolves the issue.
> 
> Thanks for taking a look!

BTW, all testing is done after checking out to your commit:
7b355b76e2b3 ("gro: decrease size of CB")
Richard Gobert July 3, 2023, 2:23 p.m. UTC | #9
Thank you for replying.
I will check it out and update once there is something new.
Richard Gobert July 7, 2023, 12:31 p.m. UTC | #10
I managed to reproduce it and found the bug that explains the problem
you're experiencing.
I submitted a bugfix here: https://lore.kernel.org/netdev/20230707121650.GA17677@debian/
Thanks!
Gal Pressman July 9, 2023, 6:55 a.m. UTC | #11
On 07/07/2023 15:31, Richard Gobert wrote:
> I managed to reproduce it and found the bug that explains the problem
> you're experiencing.
> I submitted a bugfix here: https://lore.kernel.org/netdev/20230707121650.GA17677@debian/
> Thanks!

Thanks Richard!
Will test it and update.

BTW, did you manage to reproduce the issue with veth?
Gal Pressman Aug. 23, 2023, 2:43 p.m. UTC | #12
On 28/06/2023 17:19, David Ahern wrote:
> On 6/28/23 6:42 AM, Gal Pressman wrote:
>> On 27/06/2023 17:21, David Ahern wrote:
>>> On 6/26/23 2:55 AM, Gal Pressman wrote:
>>>> I believe this commit broke gro over udp tunnels.
>>>> I'm running iperf tcp traffic over geneve interfaces and the bandwidth
>>>> is pretty much zero.
>>>>
>>>
>>> Could you add a test script to tools/testing/selftests/net? It will help
>>> catch future regressions.
>>>
>>
>> I'm checking internally, someone from the team might be able to work on
>> this, though I'm not sure that a test that verifies bandwidth makes much
>> sense as a selftest.
>>
> 
> With veth and namespaces I expect up to 25-30G performance levels,
> depending on the test. When something fundamental breaks like this patch
> a drop to < 1G would be a red flag, so there is value to the test.

Circling back to this, I believe such test already exists:
tools/testing/selftests/net/udpgro_fwd.sh

And it indeed fails before Richard's fix.

I guess all that's left is to actually run these tests :)?
David Ahern Aug. 24, 2023, 3:31 a.m. UTC | #13
On 8/23/23 7:43 AM, Gal Pressman wrote:
>> With veth and namespaces I expect up to 25-30G performance levels,
>> depending on the test. When something fundamental breaks like this patch
>> a drop to < 1G would be a red flag, so there is value to the test.
> Circling back to this, I believe such test already exists:
> tools/testing/selftests/net/udpgro_fwd.sh
> 
> And it indeed fails before Richard's fix.
> 
> I guess all that's left is to actually run these tests 
diff mbox series

Patch

diff --git a/include/net/gro.h b/include/net/gro.h
index a4fab706240d..7b47dd6ce94f 100644
--- a/include/net/gro.h
+++ b/include/net/gro.h
@@ -11,11 +11,23 @@ 
 #include <net/udp.h>
 
 struct napi_gro_cb {
-	/* Virtual address of skb_shinfo(skb)->frags[0].page + offset. */
-	void	*frag0;
+	union {
+		struct {
+			/* Virtual address of skb_shinfo(skb)->frags[0].page + offset. */
+			void	*frag0;
 
-	/* Length of frag0. */
-	unsigned int frag0_len;
+			/* Length of frag0. */
+			unsigned int frag0_len;
+		};
+
+		struct {
+			/* used in skb_gro_receive() slow path */
+			struct sk_buff *last;
+
+			/* jiffies when first packet was created/queued */
+			unsigned long age;
+		};
+	};
 
 	/* This indicates where we are processing relative to skb->data. */
 	int	data_offset;
@@ -32,9 +44,6 @@  struct napi_gro_cb {
 	/* Used in ipv6_gro_receive() and foo-over-udp */
 	u16	proto;
 
-	/* jiffies when first packet was created/queued */
-	unsigned long age;
-
 /* Used in napi_gro_cb::free */
 #define NAPI_GRO_FREE             1
 #define NAPI_GRO_FREE_STOLEN_HEAD 2
@@ -77,9 +86,6 @@  struct napi_gro_cb {
 
 	/* used to support CHECKSUM_COMPLETE for tunneling protocols */
 	__wsum	csum;
-
-	/* used in skb_gro_receive() slow path */
-	struct sk_buff *last;
 };
 
 #define NAPI_GRO_CB(skb) ((struct napi_gro_cb *)(skb)->cb)
diff --git a/net/core/gro.c b/net/core/gro.c
index 2d84165cb4f1..a709155994ad 100644
--- a/net/core/gro.c
+++ b/net/core/gro.c
@@ -460,6 +460,14 @@  static void gro_pull_from_frag0(struct sk_buff *skb, int grow)
 	}
 }
 
+static void gro_try_pull_from_frag0(struct sk_buff *skb)
+{
+	int grow = skb_gro_offset(skb) - skb_headlen(skb);
+
+	if (grow > 0)
+		gro_pull_from_frag0(skb, grow);
+}
+
 static void gro_flush_oldest(struct napi_struct *napi, struct list_head *head)
 {
 	struct sk_buff *oldest;
@@ -489,7 +497,6 @@  static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff
 	struct sk_buff *pp = NULL;
 	enum gro_result ret;
 	int same_flow;
-	int grow;
 
 	if (netif_elide_gro(skb->dev))
 		goto normal;
@@ -564,17 +571,14 @@  static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff
 	else
 		gro_list->count++;
 
+	/* Must be called before setting NAPI_GRO_CB(skb)->{age|last} */
+	gro_try_pull_from_frag0(skb);
 	NAPI_GRO_CB(skb)->age = jiffies;
 	NAPI_GRO_CB(skb)->last = skb;
 	if (!skb_is_gso(skb))
 		skb_shinfo(skb)->gso_size = skb_gro_len(skb);
 	list_add(&skb->list, &gro_list->list);
 	ret = GRO_HELD;
-
-pull:
-	grow = skb_gro_offset(skb) - skb_headlen(skb);
-	if (grow > 0)
-		gro_pull_from_frag0(skb, grow);
 ok:
 	if (gro_list->count) {
 		if (!test_bit(bucket, &napi->gro_bitmask))
@@ -587,7 +591,8 @@  static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff
 
 normal:
 	ret = GRO_NORMAL;
-	goto pull;
+	gro_try_pull_from_frag0(skb);
+	goto ok;
 }
 
 struct packet_offload *gro_find_receive_by_type(__be16 type)