[v3] veth: Drop MTU check when forwarding packets

Message ID	20240808070428.13643-1-djduanjiong@gmail.com (mailing list archive)
State	Rejected
Delegated to:	Netdev Maintainers
Headers	show Received: from mail-pj1-f49.google.com (mail-pj1-f49.google.com [209.85.216.49]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 818AF7464 for <netdev@vger.kernel.org>; Thu, 8 Aug 2024 07:04:40 +0000 (UTC) From: Duan Jiong <djduanjiong@gmail.com> To: netdev@vger.kernel.org Cc: Duan Jiong <djduanjiong@gmail.com> Subject: [PATCH v3] veth: Drop MTU check when forwarding packets Date: Thu, 8 Aug 2024 15:04:28 +0800 Message-Id: <20240808070428.13643-1-djduanjiong@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[v3] veth: Drop MTU check when forwarding packets \| expand [v3] veth: Drop MTU check when forwarding packets

Context	Check	Description
netdev/series_format	warning	Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection	success	Guessed tree name to be net-next, async
netdev/ynl	success	Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 68 this patch: 68
netdev/build_tools	success	Errors and warnings before: 9 this patch: 9
netdev/cc_maintainers	fail	8 maintainers not CCed: edumazet@google.com hawk@kernel.org kuba@kernel.org daniel@iogearbox.net bpf@vger.kernel.org john.fastabend@gmail.com ast@kernel.org pabeni@redhat.com
netdev/build_clang	success	Errors and warnings before: 115 this patch: 115
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 4085 this patch: 4085
netdev/checkpatch	success	total: 0 errors, 0 warnings, 0 checks, 27 lines checked
netdev/build_clang_rust	success	No Rust files in patch. Skipping build
netdev/kdoc	success	Errors and warnings before: 101 this patch: 101
netdev/source_inline	success	Was 0 now: 0
netdev/contest	warning	net-next-2024-08-08--15-00 (tests: 705)

Duan Jiong Aug. 8, 2024, 7:04 a.m. UTC

When the mtu of the veth card is not the same at both ends, there
is no need to check the mtu when forwarding packets, and it should
be a permissible behavior to allow receiving packets with larger
mtu than your own.

Signed-off-by: Duan Jiong <djduanjiong@gmail.com>
---
 drivers/net/veth.c        | 2 +-
 include/linux/netdevice.h | 1 +
 net/core/dev.c            | 6 ++++++
 3 files changed, 8 insertions(+), 1 deletion(-)

Toke Høiland-Jørgensen Aug. 8, 2024, 9:23 a.m. UTC | #1

Duan Jiong <djduanjiong@gmail.com> writes:

> When the mtu of the veth card is not the same at both ends, there
> is no need to check the mtu when forwarding packets, and it should
> be a permissible behavior to allow receiving packets with larger
> mtu than your own.

Erm, huh? The MTU check is against the receiving interface, so AFAICT
your patch just disables MTU checking entirely for veth devices, which
seems ill-advised.

What's the issue you are trying to fix, exactly?

-Toke

Duan Jiong Aug. 8, 2024, 9:55 a.m. UTC | #2

Now there are veth device pairs, one vethA on the host side and one
vethB on the container side, when vethA sends data to vethB, at this
time if vethB is configured with a smaller MTU then the data will be
discarded, at this time there is no problem for the data to be
received in my opinion, and I would like it to be so. When vethB sends
data, the kernel stack is sharded according to the MTU of vethB.

Toke Høiland-Jørgensen <toke@kernel.org> 于2024年8月8日周四 17:23写道：
>
> Duan Jiong <djduanjiong@gmail.com> writes:
>
> > When the mtu of the veth card is not the same at both ends, there
> > is no need to check the mtu when forwarding packets, and it should
> > be a permissible behavior to allow receiving packets with larger
> > mtu than your own.
>
> Erm, huh? The MTU check is against the receiving interface, so AFAICT
> your patch just disables MTU checking entirely for veth devices, which
> seems ill-advised.
>
> What's the issue you are trying to fix, exactly?
>
> -Toke

Toke Høiland-Jørgensen Aug. 8, 2024, 10:21 a.m. UTC | #3

Duan Jiong <djduanjiong@gmail.com> writes:

> Now there are veth device pairs, one vethA on the host side and one
> vethB on the container side, when vethA sends data to vethB, at this
> time if vethB is configured with a smaller MTU then the data will be
> discarded, at this time there is no problem for the data to be
> received in my opinion, and I would like it to be so. When vethB sends
> data, the kernel stack is sharded according to the MTU of vethB.

Well, yes, if the MTU is too small, the packet will be dropped. That's
what the MTU setting is for :)

If you want to send bigger packets just, y'know, raise the MTU? What is
the reason you can't do that?

-Toke

Toke Høiland-Jørgensen Aug. 8, 2024, 12:24 p.m. UTC | #4

djduanjiong@gmail.com writes:

> This is similar to a virtual machine that receives packets larger than
> its own mtu, regardless of the mtu configured in the guest.  Or, to
> put it another way, what are the negative effects of this change?

Well, it's changing long-standing behaviour (the MTU check has been
there throughout the history of veth). Changing it will mean that
applications that set the MTU and rely on the fact that they will never
receive packets higher than the MTU, will potentially break in
interesting ways.

You still haven't answered what's keeping you from setting the MTU
correctly on the veth devices you're using?

-Toke

Willem de Bruijn Aug. 8, 2024, 7:38 p.m. UTC | #5

Toke Høiland-Jørgensen wrote:
> djduanjiong@gmail.com writes:
> 
> > This is similar to a virtual machine that receives packets larger than
> > its own mtu, regardless of the mtu configured in the guest.  Or, to
> > put it another way, what are the negative effects of this change?
> 
> Well, it's changing long-standing behaviour (the MTU check has been
> there throughout the history of veth). Changing it will mean that
> applications that set the MTU and rely on the fact that they will never
> receive packets higher than the MTU, will potentially break in
> interesting ways.

That this works is very veth specific, though?

In general this max *transfer* unit configuration makes no assurances
on the size of packets arriving. Though posted rx buffer size does,
for which veth has no equivalent.

> You still haven't answered what's keeping you from setting the MTU
> correctly on the veth devices you're using?

Agreed that it has a risk, so some justification is in order. Similar
to how commit 5f7d57280c19 (" bpf: Drop MTU check when doing TC-BPF
redirect to ingress") addressed a specific need.

Toke Høiland-Jørgensen Aug. 9, 2024, 9:47 a.m. UTC | #6

Willem de Bruijn <willemdebruijn.kernel@gmail.com> writes:

> Toke Høiland-Jørgensen wrote:
>> djduanjiong@gmail.com writes:
>> 
>> > This is similar to a virtual machine that receives packets larger than
>> > its own mtu, regardless of the mtu configured in the guest.  Or, to
>> > put it another way, what are the negative effects of this change?
>> 
>> Well, it's changing long-standing behaviour (the MTU check has been
>> there throughout the history of veth). Changing it will mean that
>> applications that set the MTU and rely on the fact that they will never
>> receive packets higher than the MTU, will potentially break in
>> interesting ways.
>
> That this works is very veth specific, though?
>
> In general this max *transfer* unit configuration makes no assurances
> on the size of packets arriving. Though posted rx buffer size does,
> for which veth has no equivalent.

Well, in practice we do use the MTU to limit the RX size as well. See
for instance the limit on MTU sizes if an XDP program without
multibuffer support is loaded. And I don't think having an asymmetric
MTU setting on a physical point-to-point Ethernet link generally works
either. So in that sense it does make sense that veth has this
limitation, given that it's basically emulating an ethernet wire.

I do see your point that a virtual device doesn't really *have* to
respect MTU, though. So if we were implementing a new driver this
argument would be a lot easier to make. In fact, AFAICT the newer netkit
driver doesn't check the MTU setting before forwarding, so there's
already some inconsistency there.

>> You still haven't answered what's keeping you from setting the MTU
>> correctly on the veth devices you're using?
>
> Agreed that it has a risk, so some justification is in order. Similar
> to how commit 5f7d57280c19 (" bpf: Drop MTU check when doing TC-BPF
> redirect to ingress") addressed a specific need.

Exactly :)

And cf the above, using netkit may be an alternative that doesn't carry
this risk (assuming that's compatible with the use case).

-Toke

Duan Jiong Aug. 12, 2024, 9:11 a.m. UTC | #7

Sorry for responding to the email so late.

>
> I do see your point that a virtual device doesn't really *have* to
> respect MTU, though. So if we were implementing a new driver this
> argument would be a lot easier to make. In fact, AFAICT the newer netkit
> driver doesn't check the MTU setting before forwarding, so there's
> already some inconsistency there.
>
> >> You still haven't answered what's keeping you from setting the MTU
> >> correctly on the veth devices you're using?
> >

vm1(mtu 1600)---ovs---ipsec vpn1(mtu 1500)---ipsec vpn2(mtu
1500)---ovs---vm2(mtu 1600)

My scenario is that two vms are communicating via ipsec vpn gateway,
the two vpn gateways
are interconnected via public network, the vpn gateway has only one
NIC, single arm mode.
vpn gateway mtu will be 1500 in general, but the packets sent by the
vm's to the vpn gateway
may be more than 1500, and at this time, if implemented according to
the existing veth
driver, the packets sent by the vm's will be discarded. If allowed to
receive large packets,
the vpn gateway can actually accept large packets then esp encapsulate
them and then fragment
so that in the end it doesn't affect the connectivity of the network.

> > Agreed that it has a risk, so some justification is in order. Similar
> > to how commit 5f7d57280c19 (" bpf: Drop MTU check when doing TC-BPF
> > redirect to ingress") addressed a specific need.
>
> Exactly :)
>
> And cf the above, using netkit may be an alternative that doesn't carry
> this risk (assuming that's compatible with the use case).
>
> -Toke


I can see how there could be a potential risk here, can we consider
adding a switchable option
to control this behavior?

Toke Høiland-Jørgensen Aug. 13, 2024, 11:40 a.m. UTC | #8

Duan Jiong <djduanjiong@gmail.com> writes:

> Sorry for responding to the email so late.
>
>>
>> I do see your point that a virtual device doesn't really *have* to
>> respect MTU, though. So if we were implementing a new driver this
>> argument would be a lot easier to make. In fact, AFAICT the newer netkit
>> driver doesn't check the MTU setting before forwarding, so there's
>> already some inconsistency there.
>>
>> >> You still haven't answered what's keeping you from setting the MTU
>> >> correctly on the veth devices you're using?
>> >
>
> vm1(mtu 1600)---ovs---ipsec vpn1(mtu 1500)---ipsec vpn2(mtu
> 1500)---ovs---vm2(mtu 1600)

Where's the veth device in this setup?

> My scenario is that two vms are communicating via ipsec vpn gateway,
> the two vpn gateways are interconnected via public network, the vpn
> gateway has only one NIC, single arm mode. vpn gateway mtu will be
> 1500 in general, but the packets sent by the vm's to the vpn gateway
> may be more than 1500, and at this time, if implemented according to
> the existing veth driver, the packets sent by the vm's will be
> discarded. If allowed to receive large packets, the vpn gateway can
> actually accept large packets then esp encapsulate them and then
> fragment so that in the end it doesn't affect the connectivity of the
> network.

I'm not sure I quite get the setup; it sounds like you want a subset of
the traffic to adhere to one MTU, and another subset to adhere to a
different MTU, on the same interface? Could you not divide the traffic
over two different interfaces (with different MTUs) instead?

>> > Agreed that it has a risk, so some justification is in order. Similar
>> > to how commit 5f7d57280c19 (" bpf: Drop MTU check when doing TC-BPF
>> > redirect to ingress") addressed a specific need.
>>
>> Exactly :)
>>
>> And cf the above, using netkit may be an alternative that doesn't carry
>> this risk (assuming that's compatible with the use case).
>>
>> -Toke
>
>
> I can see how there could be a potential risk here, can we consider
> adding a switchable option to control this behavior?

Hmm, a toggle has its own cost in terms of complexity and overhead. Plus
it's adding new UAPI. It may be that this is the least bad option in the
end, but before going that route we should be very sure that there's not
another way to solve your problem (cf the above).

This has been discussed before, BTW, most recently five-and-some
years ago:

https://patchwork.ozlabs.org/project/netdev/patch/CAMJ5cBHZ4DqjE6Md-0apA8aaLLk9Hpiypfooo7ud-p9XyFyeng@mail.gmail.com/

-Toke

Duan Jiong Aug. 14, 2024, 6:42 a.m. UTC | #9

On Tue, Aug 13, 2024 at 7:40 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>
> Duan Jiong <djduanjiong@gmail.com> writes:
>
> >> >
> >
> > vm1(mtu 1600)---ovs---ipsec vpn1(mtu 1500)---ipsec vpn2(mtu
> > 1500)---ovs---vm2(mtu 1600)
>
> Where's the veth device in this setup?
>

The veth device is used for ipsec vpn containers to connect to ovs, and
traffic before and after esp encapsulation goes to this NIC.


> > My scenario is that two vms are communicating via ipsec vpn gateway,
> > the two vpn gateways are interconnected via public network, the vpn
> > gateway has only one NIC, single arm mode. vpn gateway mtu will be
> > 1500 in general, but the packets sent by the vm's to the vpn gateway
> > may be more than 1500, and at this time, if implemented according to
> > the existing veth driver, the packets sent by the vm's will be
> > discarded. If allowed to receive large packets, the vpn gateway can
> > actually accept large packets then esp encapsulate them and then
> > fragment so that in the end it doesn't affect the connectivity of the
> > network.
>
> I'm not sure I quite get the setup; it sounds like you want a subset of
> the traffic to adhere to one MTU, and another subset to adhere to a
> different MTU, on the same interface? Could you not divide the traffic
> over two different interfaces (with different MTUs) instead?
>

This is indeed a viable option, but it's not easy to change our own
implementation right now, so we're just seeing if it's feasible to skip
the veth mtu check.


> >> > Agreed that it has a risk, so some justification is in order. Similar
> >> > to how commit 5f7d57280c19 (" bpf: Drop MTU check when doing TC-BPF
> >> > redirect to ingress") addressed a specific need.
> >>
> >> Exactly :)
> >>
> >> And cf the above, using netkit may be an alternative that doesn't carry
> >> this risk (assuming that's compatible with the use case).
> >>
> >> -Toke
> >
> >
> > I can see how there could be a potential risk here, can we consider
> > adding a switchable option to control this behavior?
>
> Hmm, a toggle has its own cost in terms of complexity and overhead. Plus
> it's adding new UAPI. It may be that this is the least bad option in the
> end, but before going that route we should be very sure that there's not
> another way to solve your problem (cf the above).
>
> This has been discussed before, BTW, most recently five-and-some
> years ago:
>
> https://patchwork.ozlabs.org/project/netdev/patch/CAMJ5cBHZ4DqjE6Md-0apA8aaLLk9Hpiypfooo7ud-p9XyFyeng@mail.gmail.com/
>
> -Toke

Toke Høiland-Jørgensen Aug. 16, 2024, 9:09 p.m. UTC | #10

Duan Jiong <djduanjiong@gmail.com> writes:

> On Tue, Aug 13, 2024 at 7:40 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>>
>> Duan Jiong <djduanjiong@gmail.com> writes:
>>
>> >> >
>> >
>> > vm1(mtu 1600)---ovs---ipsec vpn1(mtu 1500)---ipsec vpn2(mtu
>> > 1500)---ovs---vm2(mtu 1600)
>>
>> Where's the veth device in this setup?
>>
>
> The veth device is used for ipsec vpn containers to connect to ovs, and
> traffic before and after esp encapsulation goes to this NIC.
>
>
>> > My scenario is that two vms are communicating via ipsec vpn gateway,
>> > the two vpn gateways are interconnected via public network, the vpn
>> > gateway has only one NIC, single arm mode. vpn gateway mtu will be
>> > 1500 in general, but the packets sent by the vm's to the vpn gateway
>> > may be more than 1500, and at this time, if implemented according to
>> > the existing veth driver, the packets sent by the vm's will be
>> > discarded. If allowed to receive large packets, the vpn gateway can
>> > actually accept large packets then esp encapsulate them and then
>> > fragment so that in the end it doesn't affect the connectivity of the
>> > network.
>>
>> I'm not sure I quite get the setup; it sounds like you want a subset of
>> the traffic to adhere to one MTU, and another subset to adhere to a
>> different MTU, on the same interface? Could you not divide the traffic
>> over two different interfaces (with different MTUs) instead?
>>
>
> This is indeed a viable option, but it's not easy to change our own
> implementation right now, so we're just seeing if it's feasible to skip
> the veth mtu check.

Hmm, well, that basically means asking the kernel community to take on
an extra maintenance burden so that you won't have to do that yourself.
And since there's a workaround, my inclination would be to say that the
use case is not sufficiently compelling that it's worth doing so.

However, I don't feel that strongly about it, and if others think it's
worth adding a knob to disable the MTU check on forward, I won't object.

-Toke

[v3] veth: Drop MTU check when forwarding packets

Checks

Commit Message

Comments

Patch