diff mbox series

[net] openvswitch: Set the skbuff pkt_type for proper pmtud support.

Message ID 20240322190603.251831-1-aconole@redhat.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net] openvswitch: Set the skbuff pkt_type for proper pmtud support. | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag present in non-next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 944 this patch: 944
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 6 of 7 maintainers
netdev/build_clang success Errors and warnings before: 955 this patch: 955
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 955 this patch: 955
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 8 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-03-23--00-00 (tests: 943)

Commit Message

Aaron Conole March 22, 2024, 7:06 p.m. UTC
Open vSwitch is originally intended to switch at layer 2, only dealing with
Ethernet frames.  With the introduction of l3 tunnels support, it crossed
into the realm of needing to care a bit about some routing details when
making forwarding decisions.  If an oversized packet would need to be
fragmented during this forwarding decision, there is a chance for pmtu
to get involved and generate a routing exception.  This is gated by the
skbuff->pkt_type field.

When a flow is already loaded into the openvswitch module this field is
set up and transitioned properly as a packet moves from one port to
another.  In the case that a packet execute is invoked after a flow is
newly installed this field is not properly initialized.  This causes the
pmtud mechanism to omit sending the required exception messages across
the tunnel boundary and a second attempt needs to be made to make sure
that the routing exception is properly setup.  To fix this, we set the
outgoing packet's pkt_type to PACKET_OUTGOING, since it can only get
to the openvswitch module via a port device or packet command.

This issue is periodically encountered in complex setups, such as large
openshift deployments, where multiple sets of tunnel traversal occurs.
A way to recreate this is with the ovn-heater project that can setup
a networking environment which mimics such large deployments.  In that
environment, without this patch, we can see:

  ./ovn_cluster.sh start
  podman exec ovn-chassis-1 ip r a 170.168.0.5/32 dev eth1 mtu 1200
  podman exec ovn-chassis-1 ip netns exec sw01p1  ip r flush cache
  podman exec ovn-chassis-1 ip netns exec sw01p1 ping 21.0.0.3 -M do -s 1300 -c2
  PING 21.0.0.3 (21.0.0.3) 1300(1328) bytes of data.
  From 21.0.0.3 icmp_seq=2 Frag needed and DF set (mtu = 1142)

  --- 21.0.0.3 ping statistics ---
  2 packets transmitted, 0 received, +1 errors, 100% packet loss, time 1017ms

Using tcpdump, we can also see the expected ICMP FRAG_NEEDED message is not
sent into the server.

With this patch, setting the pkt_type, we see the following:

  podman exec ovn-chassis-1 ip netns exec sw01p1 ping 21.0.0.3 -M do -s 1300 -c2
  PING 21.0.0.3 (21.0.0.3) 1300(1328) bytes of data.
  From 21.0.0.3 icmp_seq=1 Frag needed and DF set (mtu = 1222)
  ping: local error: message too long, mtu=1222

  --- 21.0.0.3 ping statistics ---
  2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1061ms

In this case, the first ping request receives the FRAG_NEEDED message and
a local routing exception is created.

Reported-at: https://issues.redhat.com/browse/FDP-164
Fixes: 58264848a5a7 ("openvswitch: Add vxlan tunneling support.")
Signed-off-by: Aaron Conole <aconole@redhat.com>
---
NOTE: An alternate approach would be to add a netlink attribute to preserve
      pkt_type across the kernel->user boundary, but that does require some
      userspace cooperation.

 net/openvswitch/actions.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Eelco Chaudron March 25, 2024, 8:44 a.m. UTC | #1
On 22 Mar 2024, at 20:06, Aaron Conole wrote:

> Open vSwitch is originally intended to switch at layer 2, only dealing with
> Ethernet frames.  With the introduction of l3 tunnels support, it crossed
> into the realm of needing to care a bit about some routing details when
> making forwarding decisions.  If an oversized packet would need to be
> fragmented during this forwarding decision, there is a chance for pmtu
> to get involved and generate a routing exception.  This is gated by the
> skbuff->pkt_type field.
>
> When a flow is already loaded into the openvswitch module this field is
> set up and transitioned properly as a packet moves from one port to
> another.  In the case that a packet execute is invoked after a flow is
> newly installed this field is not properly initialized.  This causes the
> pmtud mechanism to omit sending the required exception messages across
> the tunnel boundary and a second attempt needs to be made to make sure
> that the routing exception is properly setup.  To fix this, we set the
> outgoing packet's pkt_type to PACKET_OUTGOING, since it can only get
> to the openvswitch module via a port device or packet command.

Is this not a problem when the packet comes from the bridge port in the kernel?

> This issue is periodically encountered in complex setups, such as large
> openshift deployments, where multiple sets of tunnel traversal occurs.
> A way to recreate this is with the ovn-heater project that can setup
> a networking environment which mimics such large deployments.  In that
> environment, without this patch, we can see:
>
>   ./ovn_cluster.sh start
>   podman exec ovn-chassis-1 ip r a 170.168.0.5/32 dev eth1 mtu 1200
>   podman exec ovn-chassis-1 ip netns exec sw01p1  ip r flush cache
>   podman exec ovn-chassis-1 ip netns exec sw01p1 ping 21.0.0.3 -M do -s 1300 -c2
>   PING 21.0.0.3 (21.0.0.3) 1300(1328) bytes of data.
>   From 21.0.0.3 icmp_seq=2 Frag needed and DF set (mtu = 1142)
>
>   --- 21.0.0.3 ping statistics ---
>   2 packets transmitted, 0 received, +1 errors, 100% packet loss, time 1017ms
>
> Using tcpdump, we can also see the expected ICMP FRAG_NEEDED message is not
> sent into the server.
>
> With this patch, setting the pkt_type, we see the following:
>
>   podman exec ovn-chassis-1 ip netns exec sw01p1 ping 21.0.0.3 -M do -s 1300 -c2
>   PING 21.0.0.3 (21.0.0.3) 1300(1328) bytes of data.
>   From 21.0.0.3 icmp_seq=1 Frag needed and DF set (mtu = 1222)
>   ping: local error: message too long, mtu=1222
>
>   --- 21.0.0.3 ping statistics ---
>   2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1061ms
>
> In this case, the first ping request receives the FRAG_NEEDED message and
> a local routing exception is created.
>
> Reported-at: https://issues.redhat.com/browse/FDP-164
> Fixes: 58264848a5a7 ("openvswitch: Add vxlan tunneling support.")
> Signed-off-by: Aaron Conole <aconole@redhat.com>
> ---
> NOTE: An alternate approach would be to add a netlink attribute to preserve
>       pkt_type across the kernel->user boundary, but that does require some
>       userspace cooperation.

I prefer the method in this patch, as it requires no userspace change, i.e. it will work even with older versions of OVS without the need for backports.

>  net/openvswitch/actions.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
> index 6fcd7e2ca81fe..952c6292100d0 100644
> --- a/net/openvswitch/actions.c
> +++ b/net/openvswitch/actions.c
> @@ -936,6 +936,8 @@ static void do_output(struct datapath *dp, struct sk_buff *skb, int out_port,
>  				pskb_trim(skb, ovs_mac_header_len(key));
>  		}
>
> +		skb->pkt_type = PACKET_OUTGOING;
> +

Maybe add a comment based on the large explanation above?

>  		if (likely(!mru ||
>  		           (skb->len <= mru + vport->dev->hard_header_len))) {
>  			ovs_vport_send(vport, skb, ovs_key_mac_proto(key));
> -- 
> 2.41.0
Aaron Conole March 25, 2024, 12:22 p.m. UTC | #2
Eelco Chaudron <echaudro@redhat.com> writes:

> On 22 Mar 2024, at 20:06, Aaron Conole wrote:
>
>> Open vSwitch is originally intended to switch at layer 2, only dealing with
>> Ethernet frames.  With the introduction of l3 tunnels support, it crossed
>> into the realm of needing to care a bit about some routing details when
>> making forwarding decisions.  If an oversized packet would need to be
>> fragmented during this forwarding decision, there is a chance for pmtu
>> to get involved and generate a routing exception.  This is gated by the
>> skbuff->pkt_type field.
>>
>> When a flow is already loaded into the openvswitch module this field is
>> set up and transitioned properly as a packet moves from one port to
>> another.  In the case that a packet execute is invoked after a flow is
>> newly installed this field is not properly initialized.  This causes the
>> pmtud mechanism to omit sending the required exception messages across
>> the tunnel boundary and a second attempt needs to be made to make sure
>> that the routing exception is properly setup.  To fix this, we set the
>> outgoing packet's pkt_type to PACKET_OUTGOING, since it can only get
>> to the openvswitch module via a port device or packet command.
>
> Is this not a problem when the packet comes from the bridge port in the kernel?

It very well may be an issue there as well, but the recommendation is to
operate with the bridge port down as far as I know, so I don't know if
this issue has been observed happening from the bridge port.

Since I will spin a v2 with a comment, do you want me to mention
something about the bridge port?

>> This issue is periodically encountered in complex setups, such as large
>> openshift deployments, where multiple sets of tunnel traversal occurs.
>> A way to recreate this is with the ovn-heater project that can setup
>> a networking environment which mimics such large deployments.  In that
>> environment, without this patch, we can see:
>>
>>   ./ovn_cluster.sh start
>>   podman exec ovn-chassis-1 ip r a 170.168.0.5/32 dev eth1 mtu 1200
>>   podman exec ovn-chassis-1 ip netns exec sw01p1  ip r flush cache
>>   podman exec ovn-chassis-1 ip netns exec sw01p1 ping 21.0.0.3 -M do -s 1300 -c2
>>   PING 21.0.0.3 (21.0.0.3) 1300(1328) bytes of data.
>>   From 21.0.0.3 icmp_seq=2 Frag needed and DF set (mtu = 1142)
>>
>>   --- 21.0.0.3 ping statistics ---
>>   2 packets transmitted, 0 received, +1 errors, 100% packet loss, time 1017ms
>>
>> Using tcpdump, we can also see the expected ICMP FRAG_NEEDED message is not
>> sent into the server.
>>
>> With this patch, setting the pkt_type, we see the following:
>>
>>   podman exec ovn-chassis-1 ip netns exec sw01p1 ping 21.0.0.3 -M do -s 1300 -c2
>>   PING 21.0.0.3 (21.0.0.3) 1300(1328) bytes of data.
>>   From 21.0.0.3 icmp_seq=1 Frag needed and DF set (mtu = 1222)
>>   ping: local error: message too long, mtu=1222
>>
>>   --- 21.0.0.3 ping statistics ---
>>   2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1061ms
>>
>> In this case, the first ping request receives the FRAG_NEEDED message and
>> a local routing exception is created.
>>
>> Reported-at: https://issues.redhat.com/browse/FDP-164
>> Fixes: 58264848a5a7 ("openvswitch: Add vxlan tunneling support.")
>> Signed-off-by: Aaron Conole <aconole@redhat.com>
>> ---
>> NOTE: An alternate approach would be to add a netlink attribute to preserve
>>       pkt_type across the kernel->user boundary, but that does require some
>>       userspace cooperation.
>
> I prefer the method in this patch, as it requires no userspace change,
> i.e. it will work even with older versions of OVS without the need for
> backports.

Yes - that was my thinking as well.

>>  net/openvswitch/actions.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
>> index 6fcd7e2ca81fe..952c6292100d0 100644
>> --- a/net/openvswitch/actions.c
>> +++ b/net/openvswitch/actions.c
>> @@ -936,6 +936,8 @@ static void do_output(struct datapath *dp, struct sk_buff *skb, int out_port,
>>  				pskb_trim(skb, ovs_mac_header_len(key));
>>  		}
>>
>> +		skb->pkt_type = PACKET_OUTGOING;
>> +
>
> Maybe add a comment based on the large explanation above?

Okay - I can add one.

>>  		if (likely(!mru ||
>>  		           (skb->len <= mru + vport->dev->hard_header_len))) {
>>  			ovs_vport_send(vport, skb, ovs_key_mac_proto(key));
>> -- 
>> 2.41.0
Ilya Maximets March 25, 2024, 12:37 p.m. UTC | #3
On 3/25/24 13:22, Aaron Conole wrote:
> Eelco Chaudron <echaudro@redhat.com> writes:
> 
>> On 22 Mar 2024, at 20:06, Aaron Conole wrote:
>>
>>> Open vSwitch is originally intended to switch at layer 2, only dealing with
>>> Ethernet frames.  With the introduction of l3 tunnels support, it crossed
>>> into the realm of needing to care a bit about some routing details when
>>> making forwarding decisions.  If an oversized packet would need to be
>>> fragmented during this forwarding decision, there is a chance for pmtu
>>> to get involved and generate a routing exception.  This is gated by the
>>> skbuff->pkt_type field.
>>>
>>> When a flow is already loaded into the openvswitch module this field is
>>> set up and transitioned properly as a packet moves from one port to
>>> another.  In the case that a packet execute is invoked after a flow is
>>> newly installed this field is not properly initialized.  This causes the
>>> pmtud mechanism to omit sending the required exception messages across
>>> the tunnel boundary and a second attempt needs to be made to make sure
>>> that the routing exception is properly setup.  To fix this, we set the
>>> outgoing packet's pkt_type to PACKET_OUTGOING, since it can only get
>>> to the openvswitch module via a port device or packet command.
>>
>> Is this not a problem when the packet comes from the bridge port in the kernel?
> 
> It very well may be an issue there as well, but the recommendation is to
> operate with the bridge port down as far as I know, so I don't know if
> this issue has been observed happening from the bridge port.

FWIW, bridge ports are typically used as an entry point for tunneled
traffic so it can egress from a physical port attached to OVS.  It means
they are pretty much always UP in most common setups like OpenStack or
ovn-kubernetes and handle a decent amount of traffic.  They are also used
to direct some other types of traffic to the host kernel.

Unless I misunderstood which ports we're talking about here.

Best regards, Ilya Maximets.
Eelco Chaudron March 25, 2024, 12:57 p.m. UTC | #4
On 25 Mar 2024, at 13:37, Ilya Maximets wrote:

> On 3/25/24 13:22, Aaron Conole wrote:
>> Eelco Chaudron <echaudro@redhat.com> writes:
>>
>>> On 22 Mar 2024, at 20:06, Aaron Conole wrote:
>>>
>>>> Open vSwitch is originally intended to switch at layer 2, only dealing with
>>>> Ethernet frames.  With the introduction of l3 tunnels support, it crossed
>>>> into the realm of needing to care a bit about some routing details when
>>>> making forwarding decisions.  If an oversized packet would need to be
>>>> fragmented during this forwarding decision, there is a chance for pmtu
>>>> to get involved and generate a routing exception.  This is gated by the
>>>> skbuff->pkt_type field.
>>>>
>>>> When a flow is already loaded into the openvswitch module this field is
>>>> set up and transitioned properly as a packet moves from one port to
>>>> another.  In the case that a packet execute is invoked after a flow is
>>>> newly installed this field is not properly initialized.  This causes the
>>>> pmtud mechanism to omit sending the required exception messages across
>>>> the tunnel boundary and a second attempt needs to be made to make sure
>>>> that the routing exception is properly setup.  To fix this, we set the
>>>> outgoing packet's pkt_type to PACKET_OUTGOING, since it can only get
>>>> to the openvswitch module via a port device or packet command.
>>>
>>> Is this not a problem when the packet comes from the bridge port in the kernel?
>>
>> It very well may be an issue there as well, but the recommendation is to
>> operate with the bridge port down as far as I know, so I don't know if
>> this issue has been observed happening from the bridge port.
>
> FWIW, bridge ports are typically used as an entry point for tunneled
> traffic so it can egress from a physical port attached to OVS.  It means
> they are pretty much always UP in most common setups like OpenStack or
> ovn-kubernetes and handle a decent amount of traffic.  They are also used
> to direct some other types of traffic to the host kernel.

+1 here, I’m talking about the same port. I think we only advise having this down for userspace bridges, but not in the case the bridge is the tunnel endpoint.

> Unless I misunderstood which ports we're talking about here.
>
> Best regards, Ilya Maximets.
Aaron Conole March 27, 2024, 5:17 p.m. UTC | #5
Eelco Chaudron <echaudro@redhat.com> writes:

> On 25 Mar 2024, at 13:37, Ilya Maximets wrote:
>
>> On 3/25/24 13:22, Aaron Conole wrote:
>>> Eelco Chaudron <echaudro@redhat.com> writes:
>>>
>>>> On 22 Mar 2024, at 20:06, Aaron Conole wrote:
>>>>
>>>>> Open vSwitch is originally intended to switch at layer 2, only dealing with
>>>>> Ethernet frames.  With the introduction of l3 tunnels support, it crossed
>>>>> into the realm of needing to care a bit about some routing details when
>>>>> making forwarding decisions.  If an oversized packet would need to be
>>>>> fragmented during this forwarding decision, there is a chance for pmtu
>>>>> to get involved and generate a routing exception.  This is gated by the
>>>>> skbuff->pkt_type field.
>>>>>
>>>>> When a flow is already loaded into the openvswitch module this field is
>>>>> set up and transitioned properly as a packet moves from one port to
>>>>> another.  In the case that a packet execute is invoked after a flow is
>>>>> newly installed this field is not properly initialized.  This causes the
>>>>> pmtud mechanism to omit sending the required exception messages across
>>>>> the tunnel boundary and a second attempt needs to be made to make sure
>>>>> that the routing exception is properly setup.  To fix this, we set the
>>>>> outgoing packet's pkt_type to PACKET_OUTGOING, since it can only get
>>>>> to the openvswitch module via a port device or packet command.
>>>>
>>>> Is this not a problem when the packet comes from the bridge port in the kernel?
>>>
>>> It very well may be an issue there as well, but the recommendation is to
>>> operate with the bridge port down as far as I know, so I don't know if
>>> this issue has been observed happening from the bridge port.
>>
>> FWIW, bridge ports are typically used as an entry point for tunneled
>> traffic so it can egress from a physical port attached to OVS.  It means
>> they are pretty much always UP in most common setups like OpenStack or
>> ovn-kubernetes and handle a decent amount of traffic.  They are also used
>> to direct some other types of traffic to the host kernel.
>
> +1 here, I’m talking about the same port. I think we only advise
> having this down for userspace bridges, but not in the case the bridge
> is the tunnel endpoint.

Okay, I'll confirm about up/down, but it seems like it shouldn't matter
and we should be setting the outgoing type.

>> Unless I misunderstood which ports we're talking about here.
>>
>> Best regards, Ilya Maximets.
diff mbox series

Patch

diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 6fcd7e2ca81fe..952c6292100d0 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -936,6 +936,8 @@  static void do_output(struct datapath *dp, struct sk_buff *skb, int out_port,
 				pskb_trim(skb, ovs_mac_header_len(key));
 		}
 
+		skb->pkt_type = PACKET_OUTGOING;
+
 		if (likely(!mru ||
 		           (skb->len <= mru + vport->dev->hard_header_len))) {
 			ovs_vport_send(vport, skb, ovs_key_mac_proto(key));