diff mbox series

[net-next] net: dsa: add GRO support via gro_cells

Message ID 20200406105910.32339-1-79537434260@yandex.com (mailing list archive)
State Mainlined
Commit e131a5634830047923c694b4ce0c3b31745ff01b
Headers show
Series [net-next] net: dsa: add GRO support via gro_cells | expand

Commit Message

Alexander Lobakin April 6, 2020, 10:59 a.m. UTC
gro_cells lib is used by different encapsulating netdevices, such as
geneve, macsec, vxlan etc. to speed up decapsulated traffic processing.
CPU tag is a sort of "encapsulation", and we can use the same mechs to
greatly improve overall DSA performance.
skbs are passed to the GRO layer after removing CPU tags, so we don't
need any new packet offload types as it was firstly proposed by me in
the first GRO-over-DSA variant [1].

The size of struct gro_cells is sizeof(void *), so hot struct
dsa_slave_priv becomes only 4/8 bytes bigger, and all critical fields
remain in one 32-byte cacheline.
The other positive side effect is that drivers for network devices
that can be shipped as CPU ports of DSA-driven switches can now use
napi_gro_frags() to pass skbs to kernel. Packets built that way are
completely non-linear and are likely being dropped without GRO.

This was tested on to-be-mainlined-soon Ethernet driver that uses
napi_gro_frags(), and the overall performance was on par with the
variant from [1], sometimes even better due to minimal overhead.
net.core.gro_normal_batch tuning may help to push it to the limit
on particular setups and platforms.

[1] https://lore.kernel.org/netdev/20191230143028.27313-1-alobakin@dlink.ru/

Signed-off-by: Alexander Lobakin <79537434260@yandex.com>
---
 net/dsa/Kconfig    |  1 +
 net/dsa/dsa.c      |  2 +-
 net/dsa/dsa_priv.h |  3 +++
 net/dsa/slave.c    | 10 +++++++++-
 4 files changed, 14 insertions(+), 2 deletions(-)

Comments

Andrew Lunn April 6, 2020, 2:47 p.m. UTC | #1
On Mon, Apr 06, 2020 at 01:59:10PM +0300, Alexander Lobakin wrote:
> gro_cells lib is used by different encapsulating netdevices, such as
> geneve, macsec, vxlan etc. to speed up decapsulated traffic processing.
> CPU tag is a sort of "encapsulation", and we can use the same mechs to
> greatly improve overall DSA performance.
> skbs are passed to the GRO layer after removing CPU tags, so we don't
> need any new packet offload types as it was firstly proposed by me in
> the first GRO-over-DSA variant [1].
> 
> The size of struct gro_cells is sizeof(void *), so hot struct
> dsa_slave_priv becomes only 4/8 bytes bigger, and all critical fields
> remain in one 32-byte cacheline.
> The other positive side effect is that drivers for network devices
> that can be shipped as CPU ports of DSA-driven switches can now use
> napi_gro_frags() to pass skbs to kernel. Packets built that way are
> completely non-linear and are likely being dropped without GRO.
> 
> This was tested on to-be-mainlined-soon Ethernet driver that uses
> napi_gro_frags(), and the overall performance was on par with the
> variant from [1], sometimes even better due to minimal overhead.
> net.core.gro_normal_batch tuning may help to push it to the limit
> on particular setups and platforms.
> 
> [1] https://lore.kernel.org/netdev/20191230143028.27313-1-alobakin@dlink.ru/

Hi Alexander

net-next is closed at the moment. So you should of posted this with an
RFC prefix.

The implementation looks nice and simple. But it would be nice to have
some performance figures.

     Andrew
Alexander Lobakin April 6, 2020, 3:21 p.m. UTC | #2
06.04.2020, 17:48, "Andrew Lunn" <andrew@lunn.ch>:
> On Mon, Apr 06, 2020 at 01:59:10PM +0300, Alexander Lobakin wrote:
>>  gro_cells lib is used by different encapsulating netdevices, such as
>>  geneve, macsec, vxlan etc. to speed up decapsulated traffic processing.
>>  CPU tag is a sort of "encapsulation", and we can use the same mechs to
>>  greatly improve overall DSA performance.
>>  skbs are passed to the GRO layer after removing CPU tags, so we don't
>>  need any new packet offload types as it was firstly proposed by me in
>>  the first GRO-over-DSA variant [1].
>>
>>  The size of struct gro_cells is sizeof(void *), so hot struct
>>  dsa_slave_priv becomes only 4/8 bytes bigger, and all critical fields
>>  remain in one 32-byte cacheline.
>>  The other positive side effect is that drivers for network devices
>>  that can be shipped as CPU ports of DSA-driven switches can now use
>>  napi_gro_frags() to pass skbs to kernel. Packets built that way are
>>  completely non-linear and are likely being dropped without GRO.
>>
>>  This was tested on to-be-mainlined-soon Ethernet driver that uses
>>  napi_gro_frags(), and the overall performance was on par with the
>>  variant from [1], sometimes even better due to minimal overhead.
>>  net.core.gro_normal_batch tuning may help to push it to the limit
>>  on particular setups and platforms.
>>
>>  [1] https://lore.kernel.org/netdev/20191230143028.27313-1-alobakin@dlink.ru/
>
> Hi Alexander

Hi Andrew!

> net-next is closed at the moment. So you should of posted this with an
> RFC prefix.

I saw that it's closed, but didn't knew about "RFC" tags for that period,
sorry.

> The implementation looks nice and simple. But it would be nice to have
> some performance figures.

I'll do, sure. I think I'll collect the stats with various main receiving
functions in Ethernet driver (napi_gro_frags(), napi_gro_receive(),
netif_receive_skb(), netif_receive_skb_list()), and with and without this
patch to make them as complete as possible.

>      Andrew
Alexander Lobakin April 6, 2020, 5:34 p.m. UTC | #3
06.04.2020, 18:21, "Alexander Lobakin" <bloodyreaper@yandex.ru>:
> 06.04.2020, 17:48, "Andrew Lunn" <andrew@lunn.ch>:
>>  On Mon, Apr 06, 2020 at 01:59:10PM +0300, Alexander Lobakin wrote:
>>>   gro_cells lib is used by different encapsulating netdevices, such as
>>>   geneve, macsec, vxlan etc. to speed up decapsulated traffic processing.
>>>   CPU tag is a sort of "encapsulation", and we can use the same mechs to
>>>   greatly improve overall DSA performance.
>>>   skbs are passed to the GRO layer after removing CPU tags, so we don't
>>>   need any new packet offload types as it was firstly proposed by me in
>>>   the first GRO-over-DSA variant [1].
>>>
>>>   The size of struct gro_cells is sizeof(void *), so hot struct
>>>   dsa_slave_priv becomes only 4/8 bytes bigger, and all critical fields
>>>   remain in one 32-byte cacheline.
>>>   The other positive side effect is that drivers for network devices
>>>   that can be shipped as CPU ports of DSA-driven switches can now use
>>>   napi_gro_frags() to pass skbs to kernel. Packets built that way are
>>>   completely non-linear and are likely being dropped without GRO.
>>>
>>>   This was tested on to-be-mainlined-soon Ethernet driver that uses
>>>   napi_gro_frags(), and the overall performance was on par with the
>>>   variant from [1], sometimes even better due to minimal overhead.
>>>   net.core.gro_normal_batch tuning may help to push it to the limit
>>>   on particular setups and platforms.
>>>
>>>   [1] https://lore.kernel.org/netdev/20191230143028.27313-1-alobakin@dlink.ru/
>>
>>  Hi Alexander
>
> Hi Andrew!
>
>>  net-next is closed at the moment. So you should of posted this with an
>>  RFC prefix.
>
> I saw that it's closed, but didn't knew about "RFC" tags for that period,
> sorry.
>
>>  The implementation looks nice and simple. But it would be nice to have
>>  some performance figures.
>
> I'll do, sure. I think I'll collect the stats with various main receiving
> functions in Ethernet driver (napi_gro_frags(), napi_gro_receive(),
> netif_receive_skb(), netif_receive_skb_list()), and with and without this
> patch to make them as complete as possible.

OK, so here we go.

My device is 1.2 GHz 4-core MIPS32 R2. Ethernet controller representing
the CPU port is capable of S/G, fraglists S/G, TSO4/6 and GSO UDP L4.
Tests are performed through simple IPoE VLAN NAT forwarding setup
(port0 <-> port1.218) with iperf3 in TCP mode.
net.core.gro_normal_batch is always set to 16 as that value seems to be
the most effective for that particular hardware and drivers.

Packet counters on eth0 are the real numbers of ongoing frames. Counters
on portX are pure-software and are updated inside networking stack.

---------------------------------------------------------------------

netif_receive_skb() in Eth driver, no patch:

[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-120.01 sec  9.00 GBytes   644 Mbits/sec  413  sender
[  5]   0.00-120.00 sec  8.99 GBytes   644 Mbits/sec       receiver

eth0
RX packets:7097731 errors:0 dropped:0 overruns:0 frame:0
TX packets:7097702 errors:0 dropped:0 overruns:0 carrier:0

port0
RX packets:426050 errors:0 dropped:0 overruns:0 frame:0
TX packets:6671829 errors:0 dropped:0 overruns:0 carrier:0

port1
RX packets:6671681 errors:0 dropped:0 overruns:0 carrier:0
TX packets:425862 errors:0 dropped:0 overruns:0 carrier:0

port1.218
RX packets:6671677 errors:0 dropped:0 overruns:0 frame:0
TX packets:425851 errors:0 dropped:0 overruns:0 carrier:0

---------------------------------------------------------------------

netif_receive_skb_list() in Eth driver, no patch:

[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-120.01 sec  9.48 GBytes   679 Mbits/sec  129  sender
[  5]   0.00-120.00 sec  9.48 GBytes   679 Mbits/sec       receiver

eth0
RX packets:7448098 errors:0 dropped:0 overruns:0 frame:0
TX packets:7448073 errors:0 dropped:0 overruns:0 carrier:0

port0
RX packets:416115 errors:0 dropped:0 overruns:0 frame:0
TX packets:7032121 errors:0 dropped:0 overruns:0 carrier:0

port1
RX packets:7031983 errors:0 dropped:0 overruns:0 frame:0
TX packets:415941 errors:0 dropped:0 overruns:0 carrier:0

port1.218
RX packets:7031978 errors:0 dropped:0 overruns:0 frame:0
TX packets:415930 errors:0 dropped:0 overruns:0 carrier:0

---------------------------------------------------------------------

napi_gro_receive() in Eth driver, no patch:

[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-120.01 sec  10.0 GBytes   718 Mbits/sec  107  sender
[  5]   0.00-120.00 sec  10.0 GBytes   718 Mbits/sec       receiver

eth0
RX packets:7868281 errors:0 dropped:0 overruns:0 frame:0
TX packets:7868267 errors:0 dropped:0 overruns:0 carrier:0

port0
RX packets:429082 errors:0 dropped:0 overruns:0 frame:0
TX packets:7439343 errors:0 dropped:0 overruns:0 carrier:0

port1
RX packets:7439199 errors:0 dropped:0 overruns:0 frame:0
TX packets:428913 errors:0 dropped:0 overruns:0 carrier:0

port1.218
RX packets:7439195 errors:0 dropped:0 overruns:0 frame:0
TX packets:428902 errors:0 dropped:0 overruns:0 carrier:0

=====================================================================

netif_receive_skb() in Eth driver + patch:

[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-120.01 sec  12.2 GBytes   870 Mbits/sec  2267 sender
[  5]   0.00-120.00 sec  12.2 GBytes   870 Mbits/sec       receiver

eth0
RX packets:9474792 errors:0 dropped:0 overruns:0 frame:0
TX packets:9474777 errors:0 dropped:0 overruns:0 carrier:0

port0
RX packets:455200 errors:0 dropped:0 overruns:0 frame:0
TX packets:353288 errors:0 dropped:0 overruns:0 carrier:0

port1
RX packets:9019592 errors:0 dropped:0 overruns:0 frame:0
TX packets:455035 errors:0 dropped:0 overruns:0 carrier:0

port1.218
RX packets:353144 errors:0 dropped:0 overruns:0 frame:0
TX packets:455024 errors:0 dropped:0 overruns:0 carrier:0

---------------------------------------------------------------------

netif_receive_skb_list() in Eth driver + patch:

[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-120.01 sec  11.6 GBytes   827 Mbits/sec  2224 sender
[  5]   0.00-120.00 sec  11.5 GBytes   827 Mbits/sec       receiver

eth0
RX packets:8981651 errors:0 dropped:0 overruns:0 frame:0
TX packets:898187 errors:0 dropped:0 overruns:0 carrier:0

port0
RX packets:436159 errors:0 dropped:0 overruns:0 frame:0
TX packets:335665 errors:0 dropped:0 overruns:0 carrier:0

port1
RX packets:8545492 errors:0 dropped:0 overruns:0 frame:0
TX packets:436071 errors:0 dropped:0 overruns:0 carrier:0

port1.218
RX packets:335593 errors:0 dropped:0 overruns:0 frame:0
TX packets:436065 errors:0 dropped:0 overruns:0 carrier:0

-----------------------------------------------------------

napi_gro_receive() in Eth driver + patch:

[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-120.01 sec  11.8 GBytes   855 Mbits/sec  122  sender
[  5]   0.00-120.00 sec  11.8 GBytes   855 Mbits/sec       receiver

eth0
RX packets:9292214 errors:0 dropped:0 overruns:0 frame:0
TX packets:9292190 errors:0 dropped:0 overruns:0 carrier:0

port0
RX packets:438516 errors:0 dropped:0 overruns:0 frame:0
TX packets:347236 errors:0 dropped:0 overruns:0 carrier:0

port1
RX packets:8853698 errors:0 dropped:0 overruns:0 frame:0
TX packets:438331 errors:0 dropped:0 overruns:0 carrier:0

port1.218
RX packets:347082 errors:0 dropped:0 overruns:0 frame:0
TX packets:438320 errors:0 dropped:0 overruns:0 carrier:0

-----------------------------------------------------------

The main goal is achieved: we have about 100-200 Mbps of performance
boost while in-stack skbs are greatly reduced from ~8-9 millions to
~350000 (compare port0 TX and port1 RX without patch and with it).

The main bottleneck in gro_cells setup is that GRO layer starts to
work only after skb are being processed by DSA stack, so they are
going frame-by-frame until that moment (RX counter on port1).

If one day we change the way of handling incoming packets (not
through fake packet_type), we could avoid that by unblocking GRO
processing in between Eth driver and DSA core.
With my custom packet_offload for ETH_P_XDSA that works only for
my CPU tag format I have about ~910-920 Mbps on the same platform.
This way doesn't fit mainline code of course, so I'm working on
alternative Rx paths for DSA, e.g. through net_device::rx_handler()
etc.

Until then, gro_cells really improve things a lot while the actual
patch is tiny.

>>       Andrew
Florian Fainelli April 6, 2020, 5:57 p.m. UTC | #4
On 4/6/2020 10:34 AM, Alexander Lobakin wrote:
> 06.04.2020, 18:21, "Alexander Lobakin" <bloodyreaper@yandex.ru>:
>> 06.04.2020, 17:48, "Andrew Lunn" <andrew@lunn.ch>:
>>>  On Mon, Apr 06, 2020 at 01:59:10PM +0300, Alexander Lobakin wrote:
>>>>   gro_cells lib is used by different encapsulating netdevices, such as
>>>>   geneve, macsec, vxlan etc. to speed up decapsulated traffic processing.
>>>>   CPU tag is a sort of "encapsulation", and we can use the same mechs to
>>>>   greatly improve overall DSA performance.
>>>>   skbs are passed to the GRO layer after removing CPU tags, so we don't
>>>>   need any new packet offload types as it was firstly proposed by me in
>>>>   the first GRO-over-DSA variant [1].
>>>>
>>>>   The size of struct gro_cells is sizeof(void *), so hot struct
>>>>   dsa_slave_priv becomes only 4/8 bytes bigger, and all critical fields
>>>>   remain in one 32-byte cacheline.
>>>>   The other positive side effect is that drivers for network devices
>>>>   that can be shipped as CPU ports of DSA-driven switches can now use
>>>>   napi_gro_frags() to pass skbs to kernel. Packets built that way are
>>>>   completely non-linear and are likely being dropped without GRO.
>>>>
>>>>   This was tested on to-be-mainlined-soon Ethernet driver that uses
>>>>   napi_gro_frags(), and the overall performance was on par with the
>>>>   variant from [1], sometimes even better due to minimal overhead.
>>>>   net.core.gro_normal_batch tuning may help to push it to the limit
>>>>   on particular setups and platforms.
>>>>
>>>>   [1] https://lore.kernel.org/netdev/20191230143028.27313-1-alobakin@dlink.ru/
>>>
>>>  Hi Alexander
>>
>> Hi Andrew!
>>
>>>  net-next is closed at the moment. So you should of posted this with an
>>>  RFC prefix.
>>
>> I saw that it's closed, but didn't knew about "RFC" tags for that period,
>> sorry.
>>
>>>  The implementation looks nice and simple. But it would be nice to have
>>>  some performance figures.
>>
>> I'll do, sure. I think I'll collect the stats with various main receiving
>> functions in Ethernet driver (napi_gro_frags(), napi_gro_receive(),
>> netif_receive_skb(), netif_receive_skb_list()), and with and without this
>> patch to make them as complete as possible.
> 
> OK, so here we go.
> 
> My device is 1.2 GHz 4-core MIPS32 R2. Ethernet controller representing
> the CPU port is capable of S/G, fraglists S/G, TSO4/6 and GSO UDP L4.
> Tests are performed through simple IPoE VLAN NAT forwarding setup
> (port0 <-> port1.218) with iperf3 in TCP mode.
> net.core.gro_normal_batch is always set to 16 as that value seems to be
> the most effective for that particular hardware and drivers.
> 
> Packet counters on eth0 are the real numbers of ongoing frames. Counters
> on portX are pure-software and are updated inside networking stack.
> 
> ---------------------------------------------------------------------
> 
> netif_receive_skb() in Eth driver, no patch:
> 
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-120.01 sec  9.00 GBytes   644 Mbits/sec  413  sender
> [  5]   0.00-120.00 sec  8.99 GBytes   644 Mbits/sec       receiver
> 
> eth0
> RX packets:7097731 errors:0 dropped:0 overruns:0 frame:0
> TX packets:7097702 errors:0 dropped:0 overruns:0 carrier:0
> 
> port0
> RX packets:426050 errors:0 dropped:0 overruns:0 frame:0
> TX packets:6671829 errors:0 dropped:0 overruns:0 carrier:0
> 
> port1
> RX packets:6671681 errors:0 dropped:0 overruns:0 carrier:0
> TX packets:425862 errors:0 dropped:0 overruns:0 carrier:0
> 
> port1.218
> RX packets:6671677 errors:0 dropped:0 overruns:0 frame:0
> TX packets:425851 errors:0 dropped:0 overruns:0 carrier:0
> 
> ---------------------------------------------------------------------
> 
> netif_receive_skb_list() in Eth driver, no patch:
> 
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-120.01 sec  9.48 GBytes   679 Mbits/sec  129  sender
> [  5]   0.00-120.00 sec  9.48 GBytes   679 Mbits/sec       receiver
> 
> eth0
> RX packets:7448098 errors:0 dropped:0 overruns:0 frame:0
> TX packets:7448073 errors:0 dropped:0 overruns:0 carrier:0
> 
> port0
> RX packets:416115 errors:0 dropped:0 overruns:0 frame:0
> TX packets:7032121 errors:0 dropped:0 overruns:0 carrier:0
> 
> port1
> RX packets:7031983 errors:0 dropped:0 overruns:0 frame:0
> TX packets:415941 errors:0 dropped:0 overruns:0 carrier:0
> 
> port1.218
> RX packets:7031978 errors:0 dropped:0 overruns:0 frame:0
> TX packets:415930 errors:0 dropped:0 overruns:0 carrier:0
> 
> ---------------------------------------------------------------------
> 
> napi_gro_receive() in Eth driver, no patch:
> 
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-120.01 sec  10.0 GBytes   718 Mbits/sec  107  sender
> [  5]   0.00-120.00 sec  10.0 GBytes   718 Mbits/sec       receiver
> 
> eth0
> RX packets:7868281 errors:0 dropped:0 overruns:0 frame:0
> TX packets:7868267 errors:0 dropped:0 overruns:0 carrier:0
> 
> port0
> RX packets:429082 errors:0 dropped:0 overruns:0 frame:0
> TX packets:7439343 errors:0 dropped:0 overruns:0 carrier:0
> 
> port1
> RX packets:7439199 errors:0 dropped:0 overruns:0 frame:0
> TX packets:428913 errors:0 dropped:0 overruns:0 carrier:0
> 
> port1.218
> RX packets:7439195 errors:0 dropped:0 overruns:0 frame:0
> TX packets:428902 errors:0 dropped:0 overruns:0 carrier:0
> 
> =====================================================================
> 
> netif_receive_skb() in Eth driver + patch:
> 
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-120.01 sec  12.2 GBytes   870 Mbits/sec  2267 sender
> [  5]   0.00-120.00 sec  12.2 GBytes   870 Mbits/sec       receiver
> 
> eth0
> RX packets:9474792 errors:0 dropped:0 overruns:0 frame:0
> TX packets:9474777 errors:0 dropped:0 overruns:0 carrier:0
> 
> port0
> RX packets:455200 errors:0 dropped:0 overruns:0 frame:0
> TX packets:353288 errors:0 dropped:0 overruns:0 carrier:0
> 
> port1
> RX packets:9019592 errors:0 dropped:0 overruns:0 frame:0
> TX packets:455035 errors:0 dropped:0 overruns:0 carrier:0
> 
> port1.218
> RX packets:353144 errors:0 dropped:0 overruns:0 frame:0
> TX packets:455024 errors:0 dropped:0 overruns:0 carrier:0
> 
> ---------------------------------------------------------------------
> 
> netif_receive_skb_list() in Eth driver + patch:
> 
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-120.01 sec  11.6 GBytes   827 Mbits/sec  2224 sender
> [  5]   0.00-120.00 sec  11.5 GBytes   827 Mbits/sec       receiver
> 
> eth0
> RX packets:8981651 errors:0 dropped:0 overruns:0 frame:0
> TX packets:898187 errors:0 dropped:0 overruns:0 carrier:0
> 
> port0
> RX packets:436159 errors:0 dropped:0 overruns:0 frame:0
> TX packets:335665 errors:0 dropped:0 overruns:0 carrier:0
> 
> port1
> RX packets:8545492 errors:0 dropped:0 overruns:0 frame:0
> TX packets:436071 errors:0 dropped:0 overruns:0 carrier:0
> 
> port1.218
> RX packets:335593 errors:0 dropped:0 overruns:0 frame:0
> TX packets:436065 errors:0 dropped:0 overruns:0 carrier:0
> 
> -----------------------------------------------------------
> 
> napi_gro_receive() in Eth driver + patch:
> 
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-120.01 sec  11.8 GBytes   855 Mbits/sec  122  sender
> [  5]   0.00-120.00 sec  11.8 GBytes   855 Mbits/sec       receiver
> 
> eth0
> RX packets:9292214 errors:0 dropped:0 overruns:0 frame:0
> TX packets:9292190 errors:0 dropped:0 overruns:0 carrier:0
> 
> port0
> RX packets:438516 errors:0 dropped:0 overruns:0 frame:0
> TX packets:347236 errors:0 dropped:0 overruns:0 carrier:0
> 
> port1
> RX packets:8853698 errors:0 dropped:0 overruns:0 frame:0
> TX packets:438331 errors:0 dropped:0 overruns:0 carrier:0
> 
> port1.218
> RX packets:347082 errors:0 dropped:0 overruns:0 frame:0
> TX packets:438320 errors:0 dropped:0 overruns:0 carrier:0
> 
> -----------------------------------------------------------
> 
> The main goal is achieved: we have about 100-200 Mbps of performance
> boost while in-stack skbs are greatly reduced from ~8-9 millions to
> ~350000 (compare port0 TX and port1 RX without patch and with it).

And the number of TCP retries is also lower, which likely means that we
are making better use of the flow control built into the hardware/driver
here?

BTW do you know why you have so many retries though? It sounds like your
flow control is missing a few edge cases, or that you have an incorrect
configuration of your TX admission queue.

> 
> The main bottleneck in gro_cells setup is that GRO layer starts to
> work only after skb are being processed by DSA stack, so they are
> going frame-by-frame until that moment (RX counter on port1).
> 
> If one day we change the way of handling incoming packets (not
> through fake packet_type), we could avoid that by unblocking GRO
> processing in between Eth driver and DSA core.
> With my custom packet_offload for ETH_P_XDSA that works only for
> my CPU tag format I have about ~910-920 Mbps on the same platform.
> This way doesn't fit mainline code of course, so I'm working on
> alternative Rx paths for DSA, e.g. through net_device::rx_handler()
> etc.
> 
> Until then, gro_cells really improve things a lot while the actual
> patch is tiny.
>
Alexander Lobakin April 6, 2020, 7:11 p.m. UTC | #5
06.04.2020, 20:57, "Florian Fainelli" <f.fainelli@gmail.com>:
> On 4/6/2020 10:34 AM, Alexander Lobakin wrote:
>>  06.04.2020, 18:21, "Alexander Lobakin" <bloodyreaper@yandex.ru>:
>>>  06.04.2020, 17:48, "Andrew Lunn" <andrew@lunn.ch>:
>>>>   On Mon, Apr 06, 2020 at 01:59:10PM +0300, Alexander Lobakin wrote:
>>>>>    gro_cells lib is used by different encapsulating netdevices, such as
>>>>>    geneve, macsec, vxlan etc. to speed up decapsulated traffic processing.
>>>>>    CPU tag is a sort of "encapsulation", and we can use the same mechs to
>>>>>    greatly improve overall DSA performance.
>>>>>    skbs are passed to the GRO layer after removing CPU tags, so we don't
>>>>>    need any new packet offload types as it was firstly proposed by me in
>>>>>    the first GRO-over-DSA variant [1].
>>>>>
>>>>>    The size of struct gro_cells is sizeof(void *), so hot struct
>>>>>    dsa_slave_priv becomes only 4/8 bytes bigger, and all critical fields
>>>>>    remain in one 32-byte cacheline.
>>>>>    The other positive side effect is that drivers for network devices
>>>>>    that can be shipped as CPU ports of DSA-driven switches can now use
>>>>>    napi_gro_frags() to pass skbs to kernel. Packets built that way are
>>>>>    completely non-linear and are likely being dropped without GRO.
>>>>>
>>>>>    This was tested on to-be-mainlined-soon Ethernet driver that uses
>>>>>    napi_gro_frags(), and the overall performance was on par with the
>>>>>    variant from [1], sometimes even better due to minimal overhead.
>>>>>    net.core.gro_normal_batch tuning may help to push it to the limit
>>>>>    on particular setups and platforms.
>>>>>
>>>>>    [1] https://lore.kernel.org/netdev/20191230143028.27313-1-alobakin@dlink.ru/
>>>>
>>>>   Hi Alexander
>>>
>>>  Hi Andrew!
>>>
>>>>   net-next is closed at the moment. So you should of posted this with an
>>>>   RFC prefix.
>>>
>>>  I saw that it's closed, but didn't knew about "RFC" tags for that period,
>>>  sorry.
>>>
>>>>   The implementation looks nice and simple. But it would be nice to have
>>>>   some performance figures.
>>>
>>>  I'll do, sure. I think I'll collect the stats with various main receiving
>>>  functions in Ethernet driver (napi_gro_frags(), napi_gro_receive(),
>>>  netif_receive_skb(), netif_receive_skb_list()), and with and without this
>>>  patch to make them as complete as possible.
>>
>>  OK, so here we go.
>>
>>  My device is 1.2 GHz 4-core MIPS32 R2. Ethernet controller representing
>>  the CPU port is capable of S/G, fraglists S/G, TSO4/6 and GSO UDP L4.
>>  Tests are performed through simple IPoE VLAN NAT forwarding setup
>>  (port0 <-> port1.218) with iperf3 in TCP mode.
>>  net.core.gro_normal_batch is always set to 16 as that value seems to be
>>  the most effective for that particular hardware and drivers.
>>
>>  Packet counters on eth0 are the real numbers of ongoing frames. Counters
>>  on portX are pure-software and are updated inside networking stack.
>>
>>  ---------------------------------------------------------------------
>>
>>  netif_receive_skb() in Eth driver, no patch:
>>
>>  [ ID] Interval Transfer Bitrate Retr
>>  [ 5] 0.00-120.01 sec 9.00 GBytes 644 Mbits/sec 413 sender
>>  [ 5] 0.00-120.00 sec 8.99 GBytes 644 Mbits/sec receiver
>>
>>  eth0
>>  RX packets:7097731 errors:0 dropped:0 overruns:0 frame:0
>>  TX packets:7097702 errors:0 dropped:0 overruns:0 carrier:0
>>
>>  port0
>>  RX packets:426050 errors:0 dropped:0 overruns:0 frame:0
>>  TX packets:6671829 errors:0 dropped:0 overruns:0 carrier:0
>>
>>  port1
>>  RX packets:6671681 errors:0 dropped:0 overruns:0 carrier:0
>>  TX packets:425862 errors:0 dropped:0 overruns:0 carrier:0
>>
>>  port1.218
>>  RX packets:6671677 errors:0 dropped:0 overruns:0 frame:0
>>  TX packets:425851 errors:0 dropped:0 overruns:0 carrier:0
>>
>>  ---------------------------------------------------------------------
>>
>>  netif_receive_skb_list() in Eth driver, no patch:
>>
>>  [ ID] Interval Transfer Bitrate Retr
>>  [ 5] 0.00-120.01 sec 9.48 GBytes 679 Mbits/sec 129 sender
>>  [ 5] 0.00-120.00 sec 9.48 GBytes 679 Mbits/sec receiver
>>
>>  eth0
>>  RX packets:7448098 errors:0 dropped:0 overruns:0 frame:0
>>  TX packets:7448073 errors:0 dropped:0 overruns:0 carrier:0
>>
>>  port0
>>  RX packets:416115 errors:0 dropped:0 overruns:0 frame:0
>>  TX packets:7032121 errors:0 dropped:0 overruns:0 carrier:0
>>
>>  port1
>>  RX packets:7031983 errors:0 dropped:0 overruns:0 frame:0
>>  TX packets:415941 errors:0 dropped:0 overruns:0 carrier:0
>>
>>  port1.218
>>  RX packets:7031978 errors:0 dropped:0 overruns:0 frame:0
>>  TX packets:415930 errors:0 dropped:0 overruns:0 carrier:0
>>
>>  ---------------------------------------------------------------------
>>
>>  napi_gro_receive() in Eth driver, no patch:
>>
>>  [ ID] Interval Transfer Bitrate Retr
>>  [ 5] 0.00-120.01 sec 10.0 GBytes 718 Mbits/sec 107 sender
>>  [ 5] 0.00-120.00 sec 10.0 GBytes 718 Mbits/sec receiver
>>
>>  eth0
>>  RX packets:7868281 errors:0 dropped:0 overruns:0 frame:0
>>  TX packets:7868267 errors:0 dropped:0 overruns:0 carrier:0
>>
>>  port0
>>  RX packets:429082 errors:0 dropped:0 overruns:0 frame:0
>>  TX packets:7439343 errors:0 dropped:0 overruns:0 carrier:0
>>
>>  port1
>>  RX packets:7439199 errors:0 dropped:0 overruns:0 frame:0
>>  TX packets:428913 errors:0 dropped:0 overruns:0 carrier:0
>>
>>  port1.218
>>  RX packets:7439195 errors:0 dropped:0 overruns:0 frame:0
>>  TX packets:428902 errors:0 dropped:0 overruns:0 carrier:0
>>
>>  =====================================================================
>>
>>  netif_receive_skb() in Eth driver + patch:
>>
>>  [ ID] Interval Transfer Bitrate Retr
>>  [ 5] 0.00-120.01 sec 12.2 GBytes 870 Mbits/sec 2267 sender
>>  [ 5] 0.00-120.00 sec 12.2 GBytes 870 Mbits/sec receiver
>>
>>  eth0
>>  RX packets:9474792 errors:0 dropped:0 overruns:0 frame:0
>>  TX packets:9474777 errors:0 dropped:0 overruns:0 carrier:0
>>
>>  port0
>>  RX packets:455200 errors:0 dropped:0 overruns:0 frame:0
>>  TX packets:353288 errors:0 dropped:0 overruns:0 carrier:0
>>
>>  port1
>>  RX packets:9019592 errors:0 dropped:0 overruns:0 frame:0
>>  TX packets:455035 errors:0 dropped:0 overruns:0 carrier:0
>>
>>  port1.218
>>  RX packets:353144 errors:0 dropped:0 overruns:0 frame:0
>>  TX packets:455024 errors:0 dropped:0 overruns:0 carrier:0
>>
>>  ---------------------------------------------------------------------
>>
>>  netif_receive_skb_list() in Eth driver + patch:
>>
>>  [ ID] Interval Transfer Bitrate Retr
>>  [ 5] 0.00-120.01 sec 11.6 GBytes 827 Mbits/sec 2224 sender
>>  [ 5] 0.00-120.00 sec 11.5 GBytes 827 Mbits/sec receiver
>>
>>  eth0
>>  RX packets:8981651 errors:0 dropped:0 overruns:0 frame:0
>>  TX packets:898187 errors:0 dropped:0 overruns:0 carrier:0
>>
>>  port0
>>  RX packets:436159 errors:0 dropped:0 overruns:0 frame:0
>>  TX packets:335665 errors:0 dropped:0 overruns:0 carrier:0
>>
>>  port1
>>  RX packets:8545492 errors:0 dropped:0 overruns:0 frame:0
>>  TX packets:436071 errors:0 dropped:0 overruns:0 carrier:0
>>
>>  port1.218
>>  RX packets:335593 errors:0 dropped:0 overruns:0 frame:0
>>  TX packets:436065 errors:0 dropped:0 overruns:0 carrier:0
>>
>>  -----------------------------------------------------------
>>
>>  napi_gro_receive() in Eth driver + patch:
>>
>>  [ ID] Interval Transfer Bitrate Retr
>>  [ 5] 0.00-120.01 sec 11.8 GBytes 855 Mbits/sec 122 sender
>>  [ 5] 0.00-120.00 sec 11.8 GBytes 855 Mbits/sec receiver
>>
>>  eth0
>>  RX packets:9292214 errors:0 dropped:0 overruns:0 frame:0
>>  TX packets:9292190 errors:0 dropped:0 overruns:0 carrier:0
>>
>>  port0
>>  RX packets:438516 errors:0 dropped:0 overruns:0 frame:0
>>  TX packets:347236 errors:0 dropped:0 overruns:0 carrier:0
>>
>>  port1
>>  RX packets:8853698 errors:0 dropped:0 overruns:0 frame:0
>>  TX packets:438331 errors:0 dropped:0 overruns:0 carrier:0
>>
>>  port1.218
>>  RX packets:347082 errors:0 dropped:0 overruns:0 frame:0
>>  TX packets:438320 errors:0 dropped:0 overruns:0 carrier:0
>>
>>  -----------------------------------------------------------
>>
>>  The main goal is achieved: we have about 100-200 Mbps of performance
>>  boost while in-stack skbs are greatly reduced from ~8-9 millions to
>>  ~350000 (compare port0 TX and port1 RX without patch and with it).
>
> And the number of TCP retries is also lower, which likely means that we
> are making better use of the flow control built into the hardware/driver
> here?
>
> BTW do you know why you have so many retries though? It sounds like your
> flow control is missing a few edge cases, or that you have an incorrect
> configuration of your TX admission queue.

Well, I have the same question TBH. All these ~1.5 years that I'm
working on these switches I have pretty chaotic number of TCP
retransmissions each time I change something in the code. They are
less likely to happen when the average CPU load is lower, but ~100
is the best result I ever got.
Seems like I should stop trying to push software throughput to
the max for a while and pay more attention to this and to hardware
configuration instead and check if I miss something :) 

>>  The main bottleneck in gro_cells setup is that GRO layer starts to
>>  work only after skb are being processed by DSA stack, so they are
>>  going frame-by-frame until that moment (RX counter on port1).
>>
>>  If one day we change the way of handling incoming packets (not
>>  through fake packet_type), we could avoid that by unblocking GRO
>>  processing in between Eth driver and DSA core.
>>  With my custom packet_offload for ETH_P_XDSA that works only for
>>  my CPU tag format I have about ~910-920 Mbps on the same platform.
>>  This way doesn't fit mainline code of course, so I'm working on
>>  alternative Rx paths for DSA, e.g. through net_device::rx_handler()
>>  etc.
>>
>>  Until then, gro_cells really improve things a lot while the actual
>>  patch is tiny.
> --
> Florian
Florian Fainelli April 6, 2020, 8:16 p.m. UTC | #6
On 4/6/2020 12:11 PM, Alexander Lobakin wrote:
> 06.04.2020, 20:57, "Florian Fainelli" <f.fainelli@gmail.com>:
>> On 4/6/2020 10:34 AM, Alexander Lobakin wrote:
>>>  06.04.2020, 18:21, "Alexander Lobakin" <bloodyreaper@yandex.ru>:
>>>>  06.04.2020, 17:48, "Andrew Lunn" <andrew@lunn.ch>:
>>>>>   On Mon, Apr 06, 2020 at 01:59:10PM +0300, Alexander Lobakin wrote:
>>>>>>    gro_cells lib is used by different encapsulating netdevices, such as
>>>>>>    geneve, macsec, vxlan etc. to speed up decapsulated traffic processing.
>>>>>>    CPU tag is a sort of "encapsulation", and we can use the same mechs to
>>>>>>    greatly improve overall DSA performance.
>>>>>>    skbs are passed to the GRO layer after removing CPU tags, so we don't
>>>>>>    need any new packet offload types as it was firstly proposed by me in
>>>>>>    the first GRO-over-DSA variant [1].
>>>>>>
>>>>>>    The size of struct gro_cells is sizeof(void *), so hot struct
>>>>>>    dsa_slave_priv becomes only 4/8 bytes bigger, and all critical fields
>>>>>>    remain in one 32-byte cacheline.
>>>>>>    The other positive side effect is that drivers for network devices
>>>>>>    that can be shipped as CPU ports of DSA-driven switches can now use
>>>>>>    napi_gro_frags() to pass skbs to kernel. Packets built that way are
>>>>>>    completely non-linear and are likely being dropped without GRO.
>>>>>>
>>>>>>    This was tested on to-be-mainlined-soon Ethernet driver that uses
>>>>>>    napi_gro_frags(), and the overall performance was on par with the
>>>>>>    variant from [1], sometimes even better due to minimal overhead.
>>>>>>    net.core.gro_normal_batch tuning may help to push it to the limit
>>>>>>    on particular setups and platforms.
>>>>>>
>>>>>>    [1] https://lore.kernel.org/netdev/20191230143028.27313-1-alobakin@dlink.ru/
>>>>>
>>>>>   Hi Alexander
>>>>
>>>>  Hi Andrew!
>>>>
>>>>>   net-next is closed at the moment. So you should of posted this with an
>>>>>   RFC prefix.
>>>>
>>>>  I saw that it's closed, but didn't knew about "RFC" tags for that period,
>>>>  sorry.
>>>>
>>>>>   The implementation looks nice and simple. But it would be nice to have
>>>>>   some performance figures.
>>>>
>>>>  I'll do, sure. I think I'll collect the stats with various main receiving
>>>>  functions in Ethernet driver (napi_gro_frags(), napi_gro_receive(),
>>>>  netif_receive_skb(), netif_receive_skb_list()), and with and without this
>>>>  patch to make them as complete as possible.
>>>
>>>  OK, so here we go.
>>>
>>>  My device is 1.2 GHz 4-core MIPS32 R2. Ethernet controller representing
>>>  the CPU port is capable of S/G, fraglists S/G, TSO4/6 and GSO UDP L4.
>>>  Tests are performed through simple IPoE VLAN NAT forwarding setup
>>>  (port0 <-> port1.218) with iperf3 in TCP mode.
>>>  net.core.gro_normal_batch is always set to 16 as that value seems to be
>>>  the most effective for that particular hardware and drivers.
>>>
>>>  Packet counters on eth0 are the real numbers of ongoing frames. Counters
>>>  on portX are pure-software and are updated inside networking stack.
>>>
>>>  ---------------------------------------------------------------------
>>>
>>>  netif_receive_skb() in Eth driver, no patch:
>>>
>>>  [ ID] Interval Transfer Bitrate Retr
>>>  [ 5] 0.00-120.01 sec 9.00 GBytes 644 Mbits/sec 413 sender
>>>  [ 5] 0.00-120.00 sec 8.99 GBytes 644 Mbits/sec receiver
>>>
>>>  eth0
>>>  RX packets:7097731 errors:0 dropped:0 overruns:0 frame:0
>>>  TX packets:7097702 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>>  port0
>>>  RX packets:426050 errors:0 dropped:0 overruns:0 frame:0
>>>  TX packets:6671829 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>>  port1
>>>  RX packets:6671681 errors:0 dropped:0 overruns:0 carrier:0
>>>  TX packets:425862 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>>  port1.218
>>>  RX packets:6671677 errors:0 dropped:0 overruns:0 frame:0
>>>  TX packets:425851 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>>  ---------------------------------------------------------------------
>>>
>>>  netif_receive_skb_list() in Eth driver, no patch:
>>>
>>>  [ ID] Interval Transfer Bitrate Retr
>>>  [ 5] 0.00-120.01 sec 9.48 GBytes 679 Mbits/sec 129 sender
>>>  [ 5] 0.00-120.00 sec 9.48 GBytes 679 Mbits/sec receiver
>>>
>>>  eth0
>>>  RX packets:7448098 errors:0 dropped:0 overruns:0 frame:0
>>>  TX packets:7448073 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>>  port0
>>>  RX packets:416115 errors:0 dropped:0 overruns:0 frame:0
>>>  TX packets:7032121 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>>  port1
>>>  RX packets:7031983 errors:0 dropped:0 overruns:0 frame:0
>>>  TX packets:415941 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>>  port1.218
>>>  RX packets:7031978 errors:0 dropped:0 overruns:0 frame:0
>>>  TX packets:415930 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>>  ---------------------------------------------------------------------
>>>
>>>  napi_gro_receive() in Eth driver, no patch:
>>>
>>>  [ ID] Interval Transfer Bitrate Retr
>>>  [ 5] 0.00-120.01 sec 10.0 GBytes 718 Mbits/sec 107 sender
>>>  [ 5] 0.00-120.00 sec 10.0 GBytes 718 Mbits/sec receiver
>>>
>>>  eth0
>>>  RX packets:7868281 errors:0 dropped:0 overruns:0 frame:0
>>>  TX packets:7868267 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>>  port0
>>>  RX packets:429082 errors:0 dropped:0 overruns:0 frame:0
>>>  TX packets:7439343 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>>  port1
>>>  RX packets:7439199 errors:0 dropped:0 overruns:0 frame:0
>>>  TX packets:428913 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>>  port1.218
>>>  RX packets:7439195 errors:0 dropped:0 overruns:0 frame:0
>>>  TX packets:428902 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>>  =====================================================================
>>>
>>>  netif_receive_skb() in Eth driver + patch:
>>>
>>>  [ ID] Interval Transfer Bitrate Retr
>>>  [ 5] 0.00-120.01 sec 12.2 GBytes 870 Mbits/sec 2267 sender
>>>  [ 5] 0.00-120.00 sec 12.2 GBytes 870 Mbits/sec receiver
>>>
>>>  eth0
>>>  RX packets:9474792 errors:0 dropped:0 overruns:0 frame:0
>>>  TX packets:9474777 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>>  port0
>>>  RX packets:455200 errors:0 dropped:0 overruns:0 frame:0
>>>  TX packets:353288 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>>  port1
>>>  RX packets:9019592 errors:0 dropped:0 overruns:0 frame:0
>>>  TX packets:455035 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>>  port1.218
>>>  RX packets:353144 errors:0 dropped:0 overruns:0 frame:0
>>>  TX packets:455024 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>>  ---------------------------------------------------------------------
>>>
>>>  netif_receive_skb_list() in Eth driver + patch:
>>>
>>>  [ ID] Interval Transfer Bitrate Retr
>>>  [ 5] 0.00-120.01 sec 11.6 GBytes 827 Mbits/sec 2224 sender
>>>  [ 5] 0.00-120.00 sec 11.5 GBytes 827 Mbits/sec receiver
>>>
>>>  eth0
>>>  RX packets:8981651 errors:0 dropped:0 overruns:0 frame:0
>>>  TX packets:898187 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>>  port0
>>>  RX packets:436159 errors:0 dropped:0 overruns:0 frame:0
>>>  TX packets:335665 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>>  port1
>>>  RX packets:8545492 errors:0 dropped:0 overruns:0 frame:0
>>>  TX packets:436071 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>>  port1.218
>>>  RX packets:335593 errors:0 dropped:0 overruns:0 frame:0
>>>  TX packets:436065 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>>  -----------------------------------------------------------
>>>
>>>  napi_gro_receive() in Eth driver + patch:
>>>
>>>  [ ID] Interval Transfer Bitrate Retr
>>>  [ 5] 0.00-120.01 sec 11.8 GBytes 855 Mbits/sec 122 sender
>>>  [ 5] 0.00-120.00 sec 11.8 GBytes 855 Mbits/sec receiver
>>>
>>>  eth0
>>>  RX packets:9292214 errors:0 dropped:0 overruns:0 frame:0
>>>  TX packets:9292190 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>>  port0
>>>  RX packets:438516 errors:0 dropped:0 overruns:0 frame:0
>>>  TX packets:347236 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>>  port1
>>>  RX packets:8853698 errors:0 dropped:0 overruns:0 frame:0
>>>  TX packets:438331 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>>  port1.218
>>>  RX packets:347082 errors:0 dropped:0 overruns:0 frame:0
>>>  TX packets:438320 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>>  -----------------------------------------------------------
>>>
>>>  The main goal is achieved: we have about 100-200 Mbps of performance
>>>  boost while in-stack skbs are greatly reduced from ~8-9 millions to
>>>  ~350000 (compare port0 TX and port1 RX without patch and with it).
>>
>> And the number of TCP retries is also lower, which likely means that we
>> are making better use of the flow control built into the hardware/driver
>> here?
>>
>> BTW do you know why you have so many retries though? It sounds like your
>> flow control is missing a few edge cases, or that you have an incorrect
>> configuration of your TX admission queue.
> 
> Well, I have the same question TBH. All these ~1.5 years that I'm
> working on these switches I have pretty chaotic number of TCP
> retransmissions each time I change something in the code. They are
> less likely to happen when the average CPU load is lower, but ~100
> is the best result I ever got.
> Seems like I should stop trying to push software throughput to
> the max for a while and pay more attention to this and to hardware
> configuration instead and check if I miss something :) 

I have had to debug such a problem on some of our systems recently and
it came down to being a couple of things for those systems:

- as a receiver, we could create fast re-transmissions on the sender
side because of packet loss which was because the switch is able to push
packets faster than the DSA master being able to write them to DRAM. One
way to work around this is to clock the Ethernet MAC higher, at the cost
of power consumption.

- as a sender, we could have fast re-transmissions when we were
ourselves a "fast" CPU (1.7GHz or higher for Gigabit throughput), that
part is still being root caused, but I think it comes down to flow
control being incorrectly set-up in hardware, which means you could lose
packets between your ndo_start_xmit() and not having the software TXQ
assert XON/XOFF properly

So in both cases, packet loss is responsible for those fast
re-transmissions, but they are barely observable (case #1 was, since the
switch port counter did not match the Ethernet MAC MIB counters) since
you have a black hole effect.
Alexander Lobakin April 6, 2020, 9:24 p.m. UTC | #7
On Mon, 04/06/2020 at 13:16 -0700, Florian Fainelli wrote:
> On 4/6/2020 12:11 PM, Alexander Lobakin wrote:
> > 06.04.2020, 20:57, "Florian Fainelli" <f.fainelli@gmail.com>:
> > > On 4/6/2020 10:34 AM, Alexander Lobakin wrote:
> > > >  06.04.2020, 18:21, "Alexander Lobakin" <bloodyreaper@yandex.ru>:
> > > > >  06.04.2020, 17:48, "Andrew Lunn" <andrew@lunn.ch>:
> > > > > >   On Mon, Apr 06, 2020 at 01:59:10PM +0300, Alexander Lobakin wrote:
> > > > > > >    gro_cells lib is used by different encapsulating netdevices, such as
> > > > > > >    geneve, macsec, vxlan etc. to speed up decapsulated traffic processing.
> > > > > > >    CPU tag is a sort of "encapsulation", and we can use the same mechs to
> > > > > > >    greatly improve overall DSA performance.
> > > > > > >    skbs are passed to the GRO layer after removing CPU tags, so we don't
> > > > > > >    need any new packet offload types as it was firstly proposed by me in
> > > > > > >    the first GRO-over-DSA variant [1].
> > > > > > > 
> > > > > > >    The size of struct gro_cells is sizeof(void *), so hot struct
> > > > > > >    dsa_slave_priv becomes only 4/8 bytes bigger, and all critical fields
> > > > > > >    remain in one 32-byte cacheline.
> > > > > > >    The other positive side effect is that drivers for network devices
> > > > > > >    that can be shipped as CPU ports of DSA-driven switches can now use
> > > > > > >    napi_gro_frags() to pass skbs to kernel. Packets built that way are
> > > > > > >    completely non-linear and are likely being dropped without GRO.
> > > > > > > 
> > > > > > >    This was tested on to-be-mainlined-soon Ethernet driver that uses
> > > > > > >    napi_gro_frags(), and the overall performance was on par with the
> > > > > > >    variant from [1], sometimes even better due to minimal overhead.
> > > > > > >    net.core.gro_normal_batch tuning may help to push it to the limit
> > > > > > >    on particular setups and platforms.
> > > > > > > 
> > > > > > >    [1] https://lore.kernel.org/netdev/20191230143028.27313-1-alobakin@dlink.ru/
> > > > > > 
> > > > > >   Hi Alexander
> > > > > 
> > > > >  Hi Andrew!
> > > > > 
> > > > > >   net-next is closed at the moment. So you should of posted this with an
> > > > > >   RFC prefix.
> > > > > 
> > > > >  I saw that it's closed, but didn't knew about "RFC" tags for that period,
> > > > >  sorry.
> > > > > 
> > > > > >   The implementation looks nice and simple. But it would be nice to have
> > > > > >   some performance figures.
> > > > > 
> > > > >  I'll do, sure. I think I'll collect the stats with various main receiving
> > > > >  functions in Ethernet driver (napi_gro_frags(), napi_gro_receive(),
> > > > >  netif_receive_skb(), netif_receive_skb_list()), and with and without this
> > > > >  patch to make them as complete as possible.
> > > > 
> > > >  OK, so here we go.
> > > > 
> > > >  My device is 1.2 GHz 4-core MIPS32 R2. Ethernet controller representing
> > > >  the CPU port is capable of S/G, fraglists S/G, TSO4/6 and GSO UDP L4.
> > > >  Tests are performed through simple IPoE VLAN NAT forwarding setup
> > > >  (port0 <-> port1.218) with iperf3 in TCP mode.
> > > >  net.core.gro_normal_batch is always set to 16 as that value seems to be
> > > >  the most effective for that particular hardware and drivers.
> > > > 
> > > >  Packet counters on eth0 are the real numbers of ongoing frames. Counters
> > > >  on portX are pure-software and are updated inside networking stack.
> > > > 
> > > >  ---------------------------------------------------------------------
> > > > 
> > > >  netif_receive_skb() in Eth driver, no patch:
> > > > 
> > > >  [ ID] Interval Transfer Bitrate Retr
> > > >  [ 5] 0.00-120.01 sec 9.00 GBytes 644 Mbits/sec 413 sender
> > > >  [ 5] 0.00-120.00 sec 8.99 GBytes 644 Mbits/sec receiver
> > > > 
> > > >  eth0
> > > >  RX packets:7097731 errors:0 dropped:0 overruns:0 frame:0
> > > >  TX packets:7097702 errors:0 dropped:0 overruns:0 carrier:0
> > > > 
> > > >  port0
> > > >  RX packets:426050 errors:0 dropped:0 overruns:0 frame:0
> > > >  TX packets:6671829 errors:0 dropped:0 overruns:0 carrier:0
> > > > 
> > > >  port1
> > > >  RX packets:6671681 errors:0 dropped:0 overruns:0 carrier:0
> > > >  TX packets:425862 errors:0 dropped:0 overruns:0 carrier:0
> > > > 
> > > >  port1.218
> > > >  RX packets:6671677 errors:0 dropped:0 overruns:0 frame:0
> > > >  TX packets:425851 errors:0 dropped:0 overruns:0 carrier:0
> > > > 
> > > >  ---------------------------------------------------------------------
> > > > 
> > > >  netif_receive_skb_list() in Eth driver, no patch:
> > > > 
> > > >  [ ID] Interval Transfer Bitrate Retr
> > > >  [ 5] 0.00-120.01 sec 9.48 GBytes 679 Mbits/sec 129 sender
> > > >  [ 5] 0.00-120.00 sec 9.48 GBytes 679 Mbits/sec receiver
> > > > 
> > > >  eth0
> > > >  RX packets:7448098 errors:0 dropped:0 overruns:0 frame:0
> > > >  TX packets:7448073 errors:0 dropped:0 overruns:0 carrier:0
> > > > 
> > > >  port0
> > > >  RX packets:416115 errors:0 dropped:0 overruns:0 frame:0
> > > >  TX packets:7032121 errors:0 dropped:0 overruns:0 carrier:0
> > > > 
> > > >  port1
> > > >  RX packets:7031983 errors:0 dropped:0 overruns:0 frame:0
> > > >  TX packets:415941 errors:0 dropped:0 overruns:0 carrier:0
> > > > 
> > > >  port1.218
> > > >  RX packets:7031978 errors:0 dropped:0 overruns:0 frame:0
> > > >  TX packets:415930 errors:0 dropped:0 overruns:0 carrier:0
> > > > 
> > > >  ---------------------------------------------------------------------
> > > > 
> > > >  napi_gro_receive() in Eth driver, no patch:
> > > > 
> > > >  [ ID] Interval Transfer Bitrate Retr
> > > >  [ 5] 0.00-120.01 sec 10.0 GBytes 718 Mbits/sec 107 sender
> > > >  [ 5] 0.00-120.00 sec 10.0 GBytes 718 Mbits/sec receiver
> > > > 
> > > >  eth0
> > > >  RX packets:7868281 errors:0 dropped:0 overruns:0 frame:0
> > > >  TX packets:7868267 errors:0 dropped:0 overruns:0 carrier:0
> > > > 
> > > >  port0
> > > >  RX packets:429082 errors:0 dropped:0 overruns:0 frame:0
> > > >  TX packets:7439343 errors:0 dropped:0 overruns:0 carrier:0
> > > > 
> > > >  port1
> > > >  RX packets:7439199 errors:0 dropped:0 overruns:0 frame:0
> > > >  TX packets:428913 errors:0 dropped:0 overruns:0 carrier:0
> > > > 
> > > >  port1.218
> > > >  RX packets:7439195 errors:0 dropped:0 overruns:0 frame:0
> > > >  TX packets:428902 errors:0 dropped:0 overruns:0 carrier:0
> > > > 
> > > >  =====================================================================
> > > > 
> > > >  netif_receive_skb() in Eth driver + patch:
> > > > 
> > > >  [ ID] Interval Transfer Bitrate Retr
> > > >  [ 5] 0.00-120.01 sec 12.2 GBytes 870 Mbits/sec 2267 sender
> > > >  [ 5] 0.00-120.00 sec 12.2 GBytes 870 Mbits/sec receiver
> > > > 
> > > >  eth0
> > > >  RX packets:9474792 errors:0 dropped:0 overruns:0 frame:0
> > > >  TX packets:9474777 errors:0 dropped:0 overruns:0 carrier:0
> > > > 
> > > >  port0
> > > >  RX packets:455200 errors:0 dropped:0 overruns:0 frame:0
> > > >  TX packets:353288 errors:0 dropped:0 overruns:0 carrier:0
> > > > 
> > > >  port1
> > > >  RX packets:9019592 errors:0 dropped:0 overruns:0 frame:0
> > > >  TX packets:455035 errors:0 dropped:0 overruns:0 carrier:0
> > > > 
> > > >  port1.218
> > > >  RX packets:353144 errors:0 dropped:0 overruns:0 frame:0
> > > >  TX packets:455024 errors:0 dropped:0 overruns:0 carrier:0
> > > > 
> > > >  ---------------------------------------------------------------------
> > > > 
> > > >  netif_receive_skb_list() in Eth driver + patch:
> > > > 
> > > >  [ ID] Interval Transfer Bitrate Retr
> > > >  [ 5] 0.00-120.01 sec 11.6 GBytes 827 Mbits/sec 2224 sender
> > > >  [ 5] 0.00-120.00 sec 11.5 GBytes 827 Mbits/sec receiver
> > > > 
> > > >  eth0
> > > >  RX packets:8981651 errors:0 dropped:0 overruns:0 frame:0
> > > >  TX packets:898187 errors:0 dropped:0 overruns:0 carrier:0
> > > > 
> > > >  port0
> > > >  RX packets:436159 errors:0 dropped:0 overruns:0 frame:0
> > > >  TX packets:335665 errors:0 dropped:0 overruns:0 carrier:0
> > > > 
> > > >  port1
> > > >  RX packets:8545492 errors:0 dropped:0 overruns:0 frame:0
> > > >  TX packets:436071 errors:0 dropped:0 overruns:0 carrier:0
> > > > 
> > > >  port1.218
> > > >  RX packets:335593 errors:0 dropped:0 overruns:0 frame:0
> > > >  TX packets:436065 errors:0 dropped:0 overruns:0 carrier:0
> > > > 
> > > >  -----------------------------------------------------------
> > > > 
> > > >  napi_gro_receive() in Eth driver + patch:
> > > > 
> > > >  [ ID] Interval Transfer Bitrate Retr
> > > >  [ 5] 0.00-120.01 sec 11.8 GBytes 855 Mbits/sec 122 sender
> > > >  [ 5] 0.00-120.00 sec 11.8 GBytes 855 Mbits/sec receiver
> > > > 
> > > >  eth0
> > > >  RX packets:9292214 errors:0 dropped:0 overruns:0 frame:0
> > > >  TX packets:9292190 errors:0 dropped:0 overruns:0 carrier:0
> > > > 
> > > >  port0
> > > >  RX packets:438516 errors:0 dropped:0 overruns:0 frame:0
> > > >  TX packets:347236 errors:0 dropped:0 overruns:0 carrier:0
> > > > 
> > > >  port1
> > > >  RX packets:8853698 errors:0 dropped:0 overruns:0 frame:0
> > > >  TX packets:438331 errors:0 dropped:0 overruns:0 carrier:0
> > > > 
> > > >  port1.218
> > > >  RX packets:347082 errors:0 dropped:0 overruns:0 frame:0
> > > >  TX packets:438320 errors:0 dropped:0 overruns:0 carrier:0
> > > > 
> > > >  -----------------------------------------------------------
> > > > 
> > > >  The main goal is achieved: we have about 100-200 Mbps of performance
> > > >  boost while in-stack skbs are greatly reduced from ~8-9 millions to
> > > >  ~350000 (compare port0 TX and port1 RX without patch and with it).
> > > 
> > > And the number of TCP retries is also lower, which likely means that we
> > > are making better use of the flow control built into the hardware/driver
> > > here?
> > > 
> > > BTW do you know why you have so many retries though? It sounds like your
> > > flow control is missing a few edge cases, or that you have an incorrect
> > > configuration of your TX admission queue.
> > 
> > Well, I have the same question TBH. All these ~1.5 years that I'm
> > working on these switches I have pretty chaotic number of TCP
> > retransmissions each time I change something in the code. They are
> > less likely to happen when the average CPU load is lower, but ~100
> > is the best result I ever got.
> > Seems like I should stop trying to push software throughput to
> > the max for a while and pay more attention to this and to hardware
> > configuration instead and check if I miss something :) 
> 
> I have had to debug such a problem on some of our systems recently and
> it came down to being a couple of things for those systems:
> 
> - as a receiver, we could create fast re-transmissions on the sender
> side because of packet loss which was because the switch is able to push
> packets faster than the DSA master being able to write them to DRAM. One
> way to work around this is to clock the Ethernet MAC higher, at the cost
> of power consumption.
> 
> - as a sender, we could have fast re-transmissions when we were
> ourselves a "fast" CPU (1.7GHz or higher for Gigabit throughput), that
> part is still being root caused, but I think it comes down to flow
> control being incorrectly set-up in hardware, which means you could lose
> packets between your ndo_start_xmit() and not having the software TXQ
> assert XON/XOFF properly
> 
> So in both cases, packet loss is responsible for those fast
> re-transmissions, but they are barely observable (case #1 was, since the
> switch port counter did not match the Ethernet MAC MIB counters) since
> you have a black hole effect.

Thank you for so detailed response! I suppose there might be both of
these on my system, I'll have a look at this soon.
diff mbox series

Patch

diff --git a/net/dsa/Kconfig b/net/dsa/Kconfig
index 92663dcb3aa2..739613070d07 100644
--- a/net/dsa/Kconfig
+++ b/net/dsa/Kconfig
@@ -9,6 +9,7 @@  menuconfig NET_DSA
 	tristate "Distributed Switch Architecture"
 	depends on HAVE_NET_DSA
 	depends on BRIDGE || BRIDGE=n
+	select GRO_CELLS
 	select NET_SWITCHDEV
 	select PHYLINK
 	select NET_DEVLINK
diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c
index ee2610c4d46a..0384a911779e 100644
--- a/net/dsa/dsa.c
+++ b/net/dsa/dsa.c
@@ -234,7 +234,7 @@  static int dsa_switch_rcv(struct sk_buff *skb, struct net_device *dev,
 	if (dsa_skb_defer_rx_timestamp(p, skb))
 		return 0;
 
-	netif_receive_skb(skb);
+	gro_cells_receive(&p->gcells, skb);
 
 	return 0;
 }
diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h
index 904cc7c9b882..6d9a1ef65fa0 100644
--- a/net/dsa/dsa_priv.h
+++ b/net/dsa/dsa_priv.h
@@ -11,6 +11,7 @@ 
 #include <linux/netdevice.h>
 #include <linux/netpoll.h>
 #include <net/dsa.h>
+#include <net/gro_cells.h>
 
 enum {
 	DSA_NOTIFIER_AGEING_TIME,
@@ -77,6 +78,8 @@  struct dsa_slave_priv {
 
 	struct pcpu_sw_netstats	*stats64;
 
+	struct gro_cells	gcells;
+
 	/* DSA port data, such as switch, port index, etc. */
 	struct dsa_port		*dp;
 
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 5390ff541658..36c7491e8e5f 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -1762,6 +1762,11 @@  int dsa_slave_create(struct dsa_port *port)
 		free_netdev(slave_dev);
 		return -ENOMEM;
 	}
+
+	ret = gro_cells_init(&p->gcells, slave_dev);
+	if (ret)
+		goto out_free;
+
 	p->dp = port;
 	INIT_LIST_HEAD(&p->mall_tc_list);
 	p->xmit = cpu_dp->tag_ops->xmit;
@@ -1781,7 +1786,7 @@  int dsa_slave_create(struct dsa_port *port)
 	ret = dsa_slave_phy_setup(slave_dev);
 	if (ret) {
 		netdev_err(master, "error %d setting up slave phy\n", ret);
-		goto out_free;
+		goto out_gcells;
 	}
 
 	dsa_slave_notify(slave_dev, DSA_PORT_REGISTER);
@@ -1800,6 +1805,8 @@  int dsa_slave_create(struct dsa_port *port)
 	phylink_disconnect_phy(p->dp->pl);
 	rtnl_unlock();
 	phylink_destroy(p->dp->pl);
+out_gcells:
+	gro_cells_destroy(&p->gcells);
 out_free:
 	free_percpu(p->stats64);
 	free_netdev(slave_dev);
@@ -1820,6 +1827,7 @@  void dsa_slave_destroy(struct net_device *slave_dev)
 	dsa_slave_notify(slave_dev, DSA_PORT_UNREGISTER);
 	unregister_netdev(slave_dev);
 	phylink_destroy(dp->pl);
+	gro_cells_destroy(&p->gcells);
 	free_percpu(p->stats64);
 	free_netdev(slave_dev);
 }