mbox series

[net-next,00/10] net: support ipv4 big tcp

Message ID cover.1673666803.git.lucien.xin@gmail.com (mailing list archive)
Headers show
Series net: support ipv4 big tcp | expand

Message

Xin Long Jan. 14, 2023, 3:31 a.m. UTC
This is similar to the BIG TCP patchset added by Eric for IPv6:

  https://lwn.net/Articles/895398/

Different from IPv6, IPv4 tot_len is 16-bit long only, and IPv4 header
doesn't have exthdrs(options) for the BIG TCP packets' length. To make
it simple, as David and Paolo suggested, we set IPv4 tot_len to 0 to
indicate this might be a BIG TCP packet and use skb->len as the real
IPv4 total length.

This will work safely, as all BIG TCP packets are GSO/GRO packets and
processed on the same host as they were created; There is no padding
in GSO/GRO packets, and skb->len - network_offset is exactly the IPv4
packet total length; Also, before implementing the feature, all those
places that may get iph tot_len from BIG TCP packets are taken care
with some new APIs:

Patch 1 adds some APIs for iph tot_len setting and getting, which are
used in all these places where IPv4 BIG TCP packets may reach in Patch
2-7, and Patch 8 implements this feature and Patch 10 adds a selftest
for it. Patch 9 is a fix in netfilter length_mt6 so that the selftest
can also cover IPv6 BIG TCP.

Note that the similar change as in Patch 2-7 are also needed for IPv6
BIG TCP packets, and will be addressed in another patchset.

The similar performance test is done for IPv4 BIG TCP with 25Gbit NIC
and 1.5K MTU:

No BIG TCP:
for i in {1..10}; do netperf -t TCP_RR -H 192.168.100.1 -- -r80000,80000 -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT|tail -1; done
168          322          337          3776.49
143          236          277          4654.67
128          258          288          4772.83
171          229          278          4645.77
175          228          243          4678.93
149          239          279          4599.86
164          234          268          4606.94
155          276          289          4235.82
180          255          268          4418.95
168          241          249          4417.82

Enable BIG TCP:
ip link set dev ens1f0np0 gro_max_size 128000 gso_max_size 128000
for i in {1..10}; do netperf -t TCP_RR -H 192.168.100.1 -- -r80000,80000 -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT|tail -1; done
161          241          252          4821.73
174          205          217          5098.28
167          208          220          5001.43
164          228          249          4883.98
150          233          249          4914.90
180          233          244          4819.66
154          208          219          5004.92
157          209          247          4999.78
160          218          246          4842.31
174          206          217          5080.99

Xin Long (10):
  net: add a couple of helpers for iph tot_len
  bridge: use skb_ip_totlen in br netfilter
  openvswitch: use skb_ip_totlen in conntrack
  net: sched: use skb_ip_totlen and iph_totlen
  netfilter: use skb_ip_totlen and iph_totlen
  cipso_ipv4: use iph_set_totlen in skbuff_setattr
  ipvlan: use skb_ip_totlen in ipvlan_get_L3_hdr
  net: add support for ipv4 big tcp
  netfilter: get ipv6 pktlen properly in length_mt6
  selftests: add a selftest for big tcp

 drivers/net/ipvlan/ipvlan_core.c           |   2 +-
 include/linux/ip.h                         |  20 +++
 include/linux/ipv6.h                       |   9 ++
 include/net/netfilter/nf_tables_ipv4.h     |   4 +-
 include/net/route.h                        |   3 -
 net/bridge/br_netfilter_hooks.c            |   2 +-
 net/bridge/netfilter/nf_conntrack_bridge.c |   4 +-
 net/core/gro.c                             |   6 +-
 net/core/sock.c                            |  11 +-
 net/ipv4/af_inet.c                         |   7 +-
 net/ipv4/cipso_ipv4.c                      |   2 +-
 net/ipv4/ip_input.c                        |   2 +-
 net/ipv4/ip_output.c                       |   2 +-
 net/netfilter/ipvs/ip_vs_xmit.c            |   2 +-
 net/netfilter/nf_log_syslog.c              |   2 +-
 net/netfilter/xt_length.c                  |   5 +-
 net/openvswitch/conntrack.c                |   2 +-
 net/sched/act_ct.c                         |   2 +-
 net/sched/sch_cake.c                       |   2 +-
 tools/testing/selftests/net/Makefile       |   1 +
 tools/testing/selftests/net/big_tcp.sh     | 157 +++++++++++++++++++++
 21 files changed, 212 insertions(+), 35 deletions(-)
 create mode 100755 tools/testing/selftests/net/big_tcp.sh

Comments

David Ahern Jan. 15, 2023, 3:45 p.m. UTC | #1
On 1/13/23 8:31 PM, Xin Long wrote:
> This is similar to the BIG TCP patchset added by Eric for IPv6:
> 
>   https://lwn.net/Articles/895398/
> 
> Different from IPv6, IPv4 tot_len is 16-bit long only, and IPv4 header
> doesn't have exthdrs(options) for the BIG TCP packets' length. To make
> it simple, as David and Paolo suggested, we set IPv4 tot_len to 0 to
> indicate this might be a BIG TCP packet and use skb->len as the real
> IPv4 total length.
> 
> This will work safely, as all BIG TCP packets are GSO/GRO packets and
> processed on the same host as they were created; There is no padding
> in GSO/GRO packets, and skb->len - network_offset is exactly the IPv4
> packet total length; Also, before implementing the feature, all those
> places that may get iph tot_len from BIG TCP packets are taken care
> with some new APIs:
> 
> Patch 1 adds some APIs for iph tot_len setting and getting, which are
> used in all these places where IPv4 BIG TCP packets may reach in Patch
> 2-7, and Patch 8 implements this feature and Patch 10 adds a selftest
> for it. Patch 9 is a fix in netfilter length_mt6 so that the selftest
> can also cover IPv6 BIG TCP.
> 
> Note that the similar change as in Patch 2-7 are also needed for IPv6
> BIG TCP packets, and will be addressed in another patchset.
> 
> The similar performance test is done for IPv4 BIG TCP with 25Gbit NIC
> and 1.5K MTU:
> 
> No BIG TCP:
> for i in {1..10}; do netperf -t TCP_RR -H 192.168.100.1 -- -r80000,80000 -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT|tail -1; done
> 168          322          337          3776.49
> 143          236          277          4654.67
> 128          258          288          4772.83
> 171          229          278          4645.77
> 175          228          243          4678.93
> 149          239          279          4599.86
> 164          234          268          4606.94
> 155          276          289          4235.82
> 180          255          268          4418.95
> 168          241          249          4417.82
> 
> Enable BIG TCP:
> ip link set dev ens1f0np0 gro_max_size 128000 gso_max_size 128000
> for i in {1..10}; do netperf -t TCP_RR -H 192.168.100.1 -- -r80000,80000 -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT|tail -1; done
> 161          241          252          4821.73
> 174          205          217          5098.28
> 167          208          220          5001.43
> 164          228          249          4883.98
> 150          233          249          4914.90
> 180          233          244          4819.66
> 154          208          219          5004.92
> 157          209          247          4999.78
> 160          218          246          4842.31
> 174          206          217          5080.99
> 
> Xin Long (10):
>   net: add a couple of helpers for iph tot_len
>   bridge: use skb_ip_totlen in br netfilter
>   openvswitch: use skb_ip_totlen in conntrack
>   net: sched: use skb_ip_totlen and iph_totlen
>   netfilter: use skb_ip_totlen and iph_totlen
>   cipso_ipv4: use iph_set_totlen in skbuff_setattr
>   ipvlan: use skb_ip_totlen in ipvlan_get_L3_hdr
>   net: add support for ipv4 big tcp
>   netfilter: get ipv6 pktlen properly in length_mt6
>   selftests: add a selftest for big tcp
> 
>  drivers/net/ipvlan/ipvlan_core.c           |   2 +-
>  include/linux/ip.h                         |  20 +++
>  include/linux/ipv6.h                       |   9 ++
>  include/net/netfilter/nf_tables_ipv4.h     |   4 +-
>  include/net/route.h                        |   3 -
>  net/bridge/br_netfilter_hooks.c            |   2 +-
>  net/bridge/netfilter/nf_conntrack_bridge.c |   4 +-
>  net/core/gro.c                             |   6 +-
>  net/core/sock.c                            |  11 +-
>  net/ipv4/af_inet.c                         |   7 +-
>  net/ipv4/cipso_ipv4.c                      |   2 +-
>  net/ipv4/ip_input.c                        |   2 +-
>  net/ipv4/ip_output.c                       |   2 +-
>  net/netfilter/ipvs/ip_vs_xmit.c            |   2 +-
>  net/netfilter/nf_log_syslog.c              |   2 +-
>  net/netfilter/xt_length.c                  |   5 +-
>  net/openvswitch/conntrack.c                |   2 +-
>  net/sched/act_ct.c                         |   2 +-
>  net/sched/sch_cake.c                       |   2 +-
>  tools/testing/selftests/net/Makefile       |   1 +
>  tools/testing/selftests/net/big_tcp.sh     | 157 +++++++++++++++++++++
>  21 files changed, 212 insertions(+), 35 deletions(-)
>  create mode 100755 tools/testing/selftests/net/big_tcp.sh
> 

Thanks for working on this and writing the selftests.

A couple of years ago I was experimenting with a simpler version of this
change (only changed what I needed to run tests). tcpdump (as an example
of packet socket app) was confused about the packet length and reported
truncated packet errors. As I recall that is the only really tricky part
to getting large packets for IPv4.
Eric Dumazet Jan. 15, 2023, 4:04 p.m. UTC | #2
On Sat, Jan 14, 2023 at 4:31 AM Xin Long <lucien.xin@gmail.com> wrote:
>
> This is similar to the BIG TCP patchset added by Eric for IPv6:
>
>   https://lwn.net/Articles/895398/
>
> Different from IPv6, IPv4 tot_len is 16-bit long only, and IPv4 header
> doesn't have exthdrs(options) for the BIG TCP packets' length. To make
> it simple, as David and Paolo suggested, we set IPv4 tot_len to 0 to
> indicate this might be a BIG TCP packet and use skb->len as the real
> IPv4 total length.
>
> This will work safely, as all BIG TCP packets are GSO/GRO packets and
> processed on the same host as they were created; There is no padding
> in GSO/GRO packets, and skb->len - network_offset is exactly the IPv4
> packet total length; Also, before implementing the feature, all those
> places that may get iph tot_len from BIG TCP packets are taken care
> with some new APIs:
>
> Patch 1 adds some APIs for iph tot_len setting and getting, which are
> used in all these places where IPv4 BIG TCP packets may reach in Patch
> 2-7, and Patch 8 implements this feature and Patch 10 adds a selftest
> for it. Patch 9 is a fix in netfilter length_mt6 so that the selftest
> can also cover IPv6 BIG TCP.
>
> Note that the similar change as in Patch 2-7 are also needed for IPv6
> BIG TCP packets, and will be addressed in another patchset.
>
> The similar performance test is done for IPv4 BIG TCP with 25Gbit NIC
> and 1.5K MTU:
>

This is broken, sorry.

There are reasons BIG TCP was implemented for IPv6 only, it seems you
missed a lot of them.

Networking needs observability and diagnostic tools.

Until you come back with a proper way for tcpdump to not mess things,
there is no way I can ACK these changes.
Xin Long Jan. 15, 2023, 5:33 p.m. UTC | #3
On Sun, Jan 15, 2023 at 11:05 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Sat, Jan 14, 2023 at 4:31 AM Xin Long <lucien.xin@gmail.com> wrote:
> >
> > This is similar to the BIG TCP patchset added by Eric for IPv6:
> >
> >   https://lwn.net/Articles/895398/
> >
> > Different from IPv6, IPv4 tot_len is 16-bit long only, and IPv4 header
> > doesn't have exthdrs(options) for the BIG TCP packets' length. To make
> > it simple, as David and Paolo suggested, we set IPv4 tot_len to 0 to
> > indicate this might be a BIG TCP packet and use skb->len as the real
> > IPv4 total length.
> >
> > This will work safely, as all BIG TCP packets are GSO/GRO packets and
> > processed on the same host as they were created; There is no padding
> > in GSO/GRO packets, and skb->len - network_offset is exactly the IPv4
> > packet total length; Also, before implementing the feature, all those
> > places that may get iph tot_len from BIG TCP packets are taken care
> > with some new APIs:
> >
> > Patch 1 adds some APIs for iph tot_len setting and getting, which are
> > used in all these places where IPv4 BIG TCP packets may reach in Patch
> > 2-7, and Patch 8 implements this feature and Patch 10 adds a selftest
> > for it. Patch 9 is a fix in netfilter length_mt6 so that the selftest
> > can also cover IPv6 BIG TCP.
> >
> > Note that the similar change as in Patch 2-7 are also needed for IPv6
> > BIG TCP packets, and will be addressed in another patchset.
> >
> > The similar performance test is done for IPv4 BIG TCP with 25Gbit NIC
> > and 1.5K MTU:
> >
>
> This is broken, sorry.
>
> There are reasons BIG TCP was implemented for IPv6 only, it seems you
> missed a lot of them.
>
> Networking needs observability and diagnostic tools.
>
> Until you come back with a proper way for tcpdump to not mess things,
> there is no way I can ACK these changes.
For the installed tcpdump, it's just parsing iph->tot_len from the raw pkt,
and I'm not sure how to make it without any change in tcpdump. But,

https://github.com/the-tcpdump-group/tcpdump/commit/c8623960f050cb81c12b31107070021f27f14b18

As this is already in tcpdump, we can build tcpdump with "-DGUESS_TSO":

# ./configure CFLAGS=-DGUESS_TSO

It seems someone had met such problems even without IPv4 BIG TCP, not
sure in Linux or other OS.
Now it's time to enable this CFLAG. :-)

Thanks.