mbox series

[net-next,0/6] net: reset MAC header consistently across L3 virtual devices

Message ID cover.1624572003.git.gnault@redhat.com (mailing list archive)
Headers show
Series net: reset MAC header consistently across L3 virtual devices | expand

Message

Guillaume Nault June 25, 2021, 1:32 p.m. UTC
Some virtual L3 devices, like vxlan-gpe and gre (in collect_md mode),
reset the MAC header pointer after they parsed the outer headers. This
accurately reflects the fact that the decapsulated packet is pure L3
packet, as that makes the MAC header 0 bytes long (the MAC and network
header pointers are equal).

However, many L3 devices only adjust the network header after
decapsulation and leave the MAC header pointer to its original value.
This can confuse other parts of the networking stack, like TC, which
then considers the outer headers as one big MAC header.

This patch series makes the following L3 tunnels behave like VXLAN-GPE:
bareudp, ipip, sit, gre, ip6gre, ip6tnl, gtp.

The case of gre is a bit special. It already resets the MAC header
pointer in collect_md mode, so only the classical mode needs to be
adjusted. However, gre also has a special case that expects the MAC
header pointer to keep pointing to the outer header even after
decapsulation. Therefore, patch 4 keeps an exception for this case.

Ideally, we'd centralise the call to skb_reset_mac_header() in
ip_tunnel_rcv(), to avoid manual calls in ipip (patch 2),
sit (patch 3) and gre (patch 4). That's unfortunately not feasible
currently, because of the gre special case discussed above that
precludes us from resetting the MAC header unconditionally.

The original motivation is to redirect bareudp packets to Ethernet
devices (as described in patch 1). The rest of this series aims at
bringing consistency across all L3 devices (apart from gre's special
case unfortunately).

Note: the gtp patch results from pure code inspection and has been
compiled tested only.


Guillaume Nault (6):
  bareudp: allow redirecting bareudp packets to eth devices
  ipip: allow redirecting ipip and mplsip packets to eth devices
  sit: allow redirecting ip6ip, ipip and mplsip packets to eth devices
  gre: let mac_header point to outer header only when necessary
  ip6_tunnel: allow redirecting ip6gre and ipxip6 packets to eth devices
  gtp: reset mac_header after decap

 drivers/net/bareudp.c | 1 +
 drivers/net/gtp.c     | 1 +
 net/ipv4/ip_gre.c     | 7 ++++++-
 net/ipv4/ipip.c       | 2 ++
 net/ipv6/ip6_tunnel.c | 1 +
 net/ipv6/sit.c        | 4 ++++
 6 files changed, 15 insertions(+), 1 deletion(-)

Comments

David Ahern June 26, 2021, 5:50 p.m. UTC | #1
On 6/25/21 7:32 AM, Guillaume Nault wrote:
> Some virtual L3 devices, like vxlan-gpe and gre (in collect_md mode),
> reset the MAC header pointer after they parsed the outer headers. This
> accurately reflects the fact that the decapsulated packet is pure L3
> packet, as that makes the MAC header 0 bytes long (the MAC and network
> header pointers are equal).
> 
> However, many L3 devices only adjust the network header after
> decapsulation and leave the MAC header pointer to its original value.
> This can confuse other parts of the networking stack, like TC, which
> then considers the outer headers as one big MAC header.
> 
> This patch series makes the following L3 tunnels behave like VXLAN-GPE:
> bareudp, ipip, sit, gre, ip6gre, ip6tnl, gtp.
> 
> The case of gre is a bit special. It already resets the MAC header
> pointer in collect_md mode, so only the classical mode needs to be
> adjusted. However, gre also has a special case that expects the MAC
> header pointer to keep pointing to the outer header even after
> decapsulation. Therefore, patch 4 keeps an exception for this case.
> 
> Ideally, we'd centralise the call to skb_reset_mac_header() in
> ip_tunnel_rcv(), to avoid manual calls in ipip (patch 2),
> sit (patch 3) and gre (patch 4). That's unfortunately not feasible
> currently, because of the gre special case discussed above that
> precludes us from resetting the MAC header unconditionally.

What about adding a flag to ip_tunnel indicating if it can be done (or
should not be done since doing it is the most common)?

> 
> The original motivation is to redirect bareudp packets to Ethernet
> devices (as described in patch 1). The rest of this series aims at
> bringing consistency across all L3 devices (apart from gre's special
> case unfortunately).

Can you add a selftests that covers the use cases you mention in the
commit logs?
Guillaume Nault June 26, 2021, 8:53 p.m. UTC | #2
On Sat, Jun 26, 2021 at 11:50:19AM -0600, David Ahern wrote:
> On 6/25/21 7:32 AM, Guillaume Nault wrote:
> > Some virtual L3 devices, like vxlan-gpe and gre (in collect_md mode),
> > reset the MAC header pointer after they parsed the outer headers. This
> > accurately reflects the fact that the decapsulated packet is pure L3
> > packet, as that makes the MAC header 0 bytes long (the MAC and network
> > header pointers are equal).
> > 
> > However, many L3 devices only adjust the network header after
> > decapsulation and leave the MAC header pointer to its original value.
> > This can confuse other parts of the networking stack, like TC, which
> > then considers the outer headers as one big MAC header.
> > 
> > This patch series makes the following L3 tunnels behave like VXLAN-GPE:
> > bareudp, ipip, sit, gre, ip6gre, ip6tnl, gtp.
> > 
> > The case of gre is a bit special. It already resets the MAC header
> > pointer in collect_md mode, so only the classical mode needs to be
> > adjusted. However, gre also has a special case that expects the MAC
> > header pointer to keep pointing to the outer header even after
> > decapsulation. Therefore, patch 4 keeps an exception for this case.
> > 
> > Ideally, we'd centralise the call to skb_reset_mac_header() in
> > ip_tunnel_rcv(), to avoid manual calls in ipip (patch 2),
> > sit (patch 3) and gre (patch 4). That's unfortunately not feasible
> > currently, because of the gre special case discussed above that
> > precludes us from resetting the MAC header unconditionally.
> 
> What about adding a flag to ip_tunnel indicating if it can be done (or
> should not be done since doing it is the most common)?

That's feasible. I didn't do it here because I wanted to keep the
patch series focused on L3 tunnels. Modifying ip_tunnel_rcv()'s
prototype would also require updating erspan_rcv(), which isn't L3
(erspan carries Ethernet frames). I was feeling such consolidation
would be best done in a follow up patch series.

I can repost if you feel strongly about it. Otherwise, I'll follow up
with the ip_tunnel_rcv() consolidation in a later patch. Just let me
know if you have any preference.

> > The original motivation is to redirect bareudp packets to Ethernet
> > devices (as described in patch 1). The rest of this series aims at
> > bringing consistency across all L3 devices (apart from gre's special
> > case unfortunately).
> 
> Can you add a selftests that covers the use cases you mention in the
> commit logs?

I'm already having a selftests for most of the tunnels (and their
different operating modes), gtp being the main exception. But it's not
yet ready for upstream, as I'm trying to move the topology in its own
.sh file, to keep the main selftests as simple as possible.
I'll post it as soon as I get it in good shape.

Thanks for the rewiew.
David Ahern June 27, 2021, 3:56 p.m. UTC | #3
On 6/26/21 2:53 PM, Guillaume Nault wrote:
> On Sat, Jun 26, 2021 at 11:50:19AM -0600, David Ahern wrote:
>> On 6/25/21 7:32 AM, Guillaume Nault wrote:
>>> Some virtual L3 devices, like vxlan-gpe and gre (in collect_md mode),
>>> reset the MAC header pointer after they parsed the outer headers. This
>>> accurately reflects the fact that the decapsulated packet is pure L3
>>> packet, as that makes the MAC header 0 bytes long (the MAC and network
>>> header pointers are equal).
>>>
>>> However, many L3 devices only adjust the network header after
>>> decapsulation and leave the MAC header pointer to its original value.
>>> This can confuse other parts of the networking stack, like TC, which
>>> then considers the outer headers as one big MAC header.
>>>
>>> This patch series makes the following L3 tunnels behave like VXLAN-GPE:
>>> bareudp, ipip, sit, gre, ip6gre, ip6tnl, gtp.
>>>
>>> The case of gre is a bit special. It already resets the MAC header
>>> pointer in collect_md mode, so only the classical mode needs to be
>>> adjusted. However, gre also has a special case that expects the MAC
>>> header pointer to keep pointing to the outer header even after
>>> decapsulation. Therefore, patch 4 keeps an exception for this case.
>>>
>>> Ideally, we'd centralise the call to skb_reset_mac_header() in
>>> ip_tunnel_rcv(), to avoid manual calls in ipip (patch 2),
>>> sit (patch 3) and gre (patch 4). That's unfortunately not feasible
>>> currently, because of the gre special case discussed above that
>>> precludes us from resetting the MAC header unconditionally.
>>
>> What about adding a flag to ip_tunnel indicating if it can be done (or
>> should not be done since doing it is the most common)?
> 
> That's feasible. I didn't do it here because I wanted to keep the
> patch series focused on L3 tunnels. Modifying ip_tunnel_rcv()'s
> prototype would also require updating erspan_rcv(), which isn't L3
> (erspan carries Ethernet frames). I was feeling such consolidation
> would be best done in a follow up patch series.

I was thinking a flag in 'struct ip_tunnel'. It's the private data for
those netdevices, so a per-instance setting. I haven't walked through
the details to know if it would work.

> 
> I can repost if you feel strongly about it. Otherwise, I'll follow up
> with the ip_tunnel_rcv() consolidation in a later patch. Just let me
> know if you have any preference.
> 
>>> The original motivation is to redirect bareudp packets to Ethernet
>>> devices (as described in patch 1). The rest of this series aims at
>>> bringing consistency across all L3 devices (apart from gre's special
>>> case unfortunately).
>>
>> Can you add a selftests that covers the use cases you mention in the
>> commit logs?
> 
> I'm already having a selftests for most of the tunnels (and their
> different operating modes), gtp being the main exception. But it's not
> yet ready for upstream, as I'm trying to move the topology in its own
> .sh file, to keep the main selftests as simple as possible.
> I'll post it as soon as I get it in good shape.
> 

That works. Thanks,
Guillaume Nault June 28, 2021, 3:04 p.m. UTC | #4
On Sun, Jun 27, 2021 at 09:56:53AM -0600, David Ahern wrote:
> On 6/26/21 2:53 PM, Guillaume Nault wrote:
> > On Sat, Jun 26, 2021 at 11:50:19AM -0600, David Ahern wrote:
> >> On 6/25/21 7:32 AM, Guillaume Nault wrote:
> >>> Some virtual L3 devices, like vxlan-gpe and gre (in collect_md mode),
> >>> reset the MAC header pointer after they parsed the outer headers. This
> >>> accurately reflects the fact that the decapsulated packet is pure L3
> >>> packet, as that makes the MAC header 0 bytes long (the MAC and network
> >>> header pointers are equal).
> >>>
> >>> However, many L3 devices only adjust the network header after
> >>> decapsulation and leave the MAC header pointer to its original value.
> >>> This can confuse other parts of the networking stack, like TC, which
> >>> then considers the outer headers as one big MAC header.
> >>>
> >>> This patch series makes the following L3 tunnels behave like VXLAN-GPE:
> >>> bareudp, ipip, sit, gre, ip6gre, ip6tnl, gtp.
> >>>
> >>> The case of gre is a bit special. It already resets the MAC header
> >>> pointer in collect_md mode, so only the classical mode needs to be
> >>> adjusted. However, gre also has a special case that expects the MAC
> >>> header pointer to keep pointing to the outer header even after
> >>> decapsulation. Therefore, patch 4 keeps an exception for this case.
> >>>
> >>> Ideally, we'd centralise the call to skb_reset_mac_header() in
> >>> ip_tunnel_rcv(), to avoid manual calls in ipip (patch 2),
> >>> sit (patch 3) and gre (patch 4). That's unfortunately not feasible
> >>> currently, because of the gre special case discussed above that
> >>> precludes us from resetting the MAC header unconditionally.
> >>
> >> What about adding a flag to ip_tunnel indicating if it can be done (or
> >> should not be done since doing it is the most common)?
> > 
> > That's feasible. I didn't do it here because I wanted to keep the
> > patch series focused on L3 tunnels. Modifying ip_tunnel_rcv()'s
> > prototype would also require updating erspan_rcv(), which isn't L3
> > (erspan carries Ethernet frames). I was feeling such consolidation
> > would be best done in a follow up patch series.
> 
> I was thinking a flag in 'struct ip_tunnel'. It's the private data for
> those netdevices, so a per-instance setting. I haven't walked through
> the details to know if it would work.

I didn't think about that. Good idea, that looks perfectly doable. But
I'd still prefer to centralise the skb_reset_mac_header() call in a
dedicated patch set. I we did it here, we'd have to not reset the mac
header by default, to guarantee that unrelated tunnels wouldn't be
affected.
However, I think that the default behaviour should be to reset the mac
header and that only special cases, like the one in ip_gre, should
explicitely turn that off. Therefore, we'd need a follow up patch
anyway, to change the way this "reset_mac" flag would be set.

IMHO, the current approach has the advantage of clearly separating the
new feature from the refactoring. But if you feel strongly about using
a flag in struct ip_tunnel, I can rework this series.
David Ahern June 28, 2021, 6:46 p.m. UTC | #5
On 6/28/21 9:04 AM, Guillaume Nault wrote:
> On Sun, Jun 27, 2021 at 09:56:53AM -0600, David Ahern wrote:
>> On 6/26/21 2:53 PM, Guillaume Nault wrote:
>>> On Sat, Jun 26, 2021 at 11:50:19AM -0600, David Ahern wrote:
>>>> On 6/25/21 7:32 AM, Guillaume Nault wrote:
>>>>> Some virtual L3 devices, like vxlan-gpe and gre (in collect_md mode),
>>>>> reset the MAC header pointer after they parsed the outer headers. This
>>>>> accurately reflects the fact that the decapsulated packet is pure L3
>>>>> packet, as that makes the MAC header 0 bytes long (the MAC and network
>>>>> header pointers are equal).
>>>>>
>>>>> However, many L3 devices only adjust the network header after
>>>>> decapsulation and leave the MAC header pointer to its original value.
>>>>> This can confuse other parts of the networking stack, like TC, which
>>>>> then considers the outer headers as one big MAC header.
>>>>>
>>>>> This patch series makes the following L3 tunnels behave like VXLAN-GPE:
>>>>> bareudp, ipip, sit, gre, ip6gre, ip6tnl, gtp.
>>>>>
>>>>> The case of gre is a bit special. It already resets the MAC header
>>>>> pointer in collect_md mode, so only the classical mode needs to be
>>>>> adjusted. However, gre also has a special case that expects the MAC
>>>>> header pointer to keep pointing to the outer header even after
>>>>> decapsulation. Therefore, patch 4 keeps an exception for this case.
>>>>>
>>>>> Ideally, we'd centralise the call to skb_reset_mac_header() in
>>>>> ip_tunnel_rcv(), to avoid manual calls in ipip (patch 2),
>>>>> sit (patch 3) and gre (patch 4). That's unfortunately not feasible
>>>>> currently, because of the gre special case discussed above that
>>>>> precludes us from resetting the MAC header unconditionally.
>>>>
>>>> What about adding a flag to ip_tunnel indicating if it can be done (or
>>>> should not be done since doing it is the most common)?
>>>
>>> That's feasible. I didn't do it here because I wanted to keep the
>>> patch series focused on L3 tunnels. Modifying ip_tunnel_rcv()'s
>>> prototype would also require updating erspan_rcv(), which isn't L3
>>> (erspan carries Ethernet frames). I was feeling such consolidation
>>> would be best done in a follow up patch series.
>>
>> I was thinking a flag in 'struct ip_tunnel'. It's the private data for
>> those netdevices, so a per-instance setting. I haven't walked through
>> the details to know if it would work.
> 
> I didn't think about that. Good idea, that looks perfectly doable. But
> I'd still prefer to centralise the skb_reset_mac_header() call in a
> dedicated patch set. I we did it here, we'd have to not reset the mac
> header by default, to guarantee that unrelated tunnels wouldn't be
> affected.
> However, I think that the default behaviour should be to reset the mac
> header and that only special cases, like the one in ip_gre, should
> explicitely turn that off. Therefore, we'd need a follow up patch
> anyway, to change the way this "reset_mac" flag would be set.
> 
> IMHO, the current approach has the advantage of clearly separating the
> new feature from the refactoring. But if you feel strongly about using
> a flag in struct ip_tunnel, I can rework this series.
> 

I am accustomed to doing refactoring first to add some new feature in
the simplest and clearest way.
patchwork-bot+netdevbpf@kernel.org June 28, 2021, 8:10 p.m. UTC | #6
Hello:

This series was applied to netdev/net-next.git (refs/heads/master):

On Fri, 25 Jun 2021 15:32:59 +0200 you wrote:
> Some virtual L3 devices, like vxlan-gpe and gre (in collect_md mode),
> reset the MAC header pointer after they parsed the outer headers. This
> accurately reflects the fact that the decapsulated packet is pure L3
> packet, as that makes the MAC header 0 bytes long (the MAC and network
> header pointers are equal).
> 
> However, many L3 devices only adjust the network header after
> decapsulation and leave the MAC header pointer to its original value.
> This can confuse other parts of the networking stack, like TC, which
> then considers the outer headers as one big MAC header.
> 
> [...]

Here is the summary with links:
  - [net-next,1/6] bareudp: allow redirecting bareudp packets to eth devices
    https://git.kernel.org/netdev/net-next/c/99c8719b7981
  - [net-next,2/6] ipip: allow redirecting ipip and mplsip packets to eth devices
    https://git.kernel.org/netdev/net-next/c/7ad136fd288c
  - [net-next,3/6] sit: allow redirecting ip6ip, ipip and mplsip packets to eth devices
    https://git.kernel.org/netdev/net-next/c/730eed2772e7
  - [net-next,4/6] gre: let mac_header point to outer header only when necessary
    https://git.kernel.org/netdev/net-next/c/aab1e898c26c
  - [net-next,5/6] ip6_tunnel: allow redirecting ip6gre and ipxip6 packets to eth devices
    https://git.kernel.org/netdev/net-next/c/da5a2e49f064
  - [net-next,6/6] gtp: reset mac_header after decap
    https://git.kernel.org/netdev/net-next/c/b2d898c8a523

You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
Guillaume Nault June 28, 2021, 8:40 p.m. UTC | #7
On Mon, Jun 28, 2021 at 12:46:24PM -0600, David Ahern wrote:
> On 6/28/21 9:04 AM, Guillaume Nault wrote:
> > On Sun, Jun 27, 2021 at 09:56:53AM -0600, David Ahern wrote:
> >> On 6/26/21 2:53 PM, Guillaume Nault wrote:
> >>> On Sat, Jun 26, 2021 at 11:50:19AM -0600, David Ahern wrote:
> >>>> On 6/25/21 7:32 AM, Guillaume Nault wrote:
> >>>>> Some virtual L3 devices, like vxlan-gpe and gre (in collect_md mode),
> >>>>> reset the MAC header pointer after they parsed the outer headers. This
> >>>>> accurately reflects the fact that the decapsulated packet is pure L3
> >>>>> packet, as that makes the MAC header 0 bytes long (the MAC and network
> >>>>> header pointers are equal).
> >>>>>
> >>>>> However, many L3 devices only adjust the network header after
> >>>>> decapsulation and leave the MAC header pointer to its original value.
> >>>>> This can confuse other parts of the networking stack, like TC, which
> >>>>> then considers the outer headers as one big MAC header.
> >>>>>
> >>>>> This patch series makes the following L3 tunnels behave like VXLAN-GPE:
> >>>>> bareudp, ipip, sit, gre, ip6gre, ip6tnl, gtp.
> >>>>>
> >>>>> The case of gre is a bit special. It already resets the MAC header
> >>>>> pointer in collect_md mode, so only the classical mode needs to be
> >>>>> adjusted. However, gre also has a special case that expects the MAC
> >>>>> header pointer to keep pointing to the outer header even after
> >>>>> decapsulation. Therefore, patch 4 keeps an exception for this case.
> >>>>>
> >>>>> Ideally, we'd centralise the call to skb_reset_mac_header() in
> >>>>> ip_tunnel_rcv(), to avoid manual calls in ipip (patch 2),
> >>>>> sit (patch 3) and gre (patch 4). That's unfortunately not feasible
> >>>>> currently, because of the gre special case discussed above that
> >>>>> precludes us from resetting the MAC header unconditionally.
> >>>>
> >>>> What about adding a flag to ip_tunnel indicating if it can be done (or
> >>>> should not be done since doing it is the most common)?
> >>>
> >>> That's feasible. I didn't do it here because I wanted to keep the
> >>> patch series focused on L3 tunnels. Modifying ip_tunnel_rcv()'s
> >>> prototype would also require updating erspan_rcv(), which isn't L3
> >>> (erspan carries Ethernet frames). I was feeling such consolidation
> >>> would be best done in a follow up patch series.
> >>
> >> I was thinking a flag in 'struct ip_tunnel'. It's the private data for
> >> those netdevices, so a per-instance setting. I haven't walked through
> >> the details to know if it would work.
> > 
> > I didn't think about that. Good idea, that looks perfectly doable. But
> > I'd still prefer to centralise the skb_reset_mac_header() call in a
> > dedicated patch set. I we did it here, we'd have to not reset the mac
> > header by default, to guarantee that unrelated tunnels wouldn't be
> > affected.
> > However, I think that the default behaviour should be to reset the mac
> > header and that only special cases, like the one in ip_gre, should
> > explicitely turn that off. Therefore, we'd need a follow up patch
> > anyway, to change the way this "reset_mac" flag would be set.
> > 
> > IMHO, the current approach has the advantage of clearly separating the
> > new feature from the refactoring. But if you feel strongly about using
> > a flag in struct ip_tunnel, I can rework this series.
> > 
> 
> I am accustomed to doing refactoring first to add some new feature in
> the simplest and clearest way.

I agree on the general statement. Anyway, I've just received the
patchwork-bot message saying the series has been applied. I'll follow
up with the struct ip_tunnel flag once net-next reopens (as I guess it's
about to close).