diff mbox series

[v3] veth: Drop MTU check when forwarding packets

Message ID 20240808070428.13643-1-djduanjiong@gmail.com (mailing list archive)
State Rejected
Delegated to: Netdev Maintainers
Headers show
Series [v3] veth: Drop MTU check when forwarding packets | expand

Checks

Context Check Description
netdev/series_format warning Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection success Guessed tree name to be net-next, async
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 68 this patch: 68
netdev/build_tools success Errors and warnings before: 9 this patch: 9
netdev/cc_maintainers fail 8 maintainers not CCed: edumazet@google.com hawk@kernel.org kuba@kernel.org daniel@iogearbox.net bpf@vger.kernel.org john.fastabend@gmail.com ast@kernel.org pabeni@redhat.com
netdev/build_clang success Errors and warnings before: 115 this patch: 115
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 4085 this patch: 4085
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 27 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 101 this patch: 101
netdev/source_inline success Was 0 now: 0
netdev/contest warning net-next-2024-08-08--15-00 (tests: 705)

Commit Message

Duan Jiong Aug. 8, 2024, 7:04 a.m. UTC
When the mtu of the veth card is not the same at both ends, there
is no need to check the mtu when forwarding packets, and it should
be a permissible behavior to allow receiving packets with larger
mtu than your own.

Signed-off-by: Duan Jiong <djduanjiong@gmail.com>
---
 drivers/net/veth.c        | 2 +-
 include/linux/netdevice.h | 1 +
 net/core/dev.c            | 6 ++++++
 3 files changed, 8 insertions(+), 1 deletion(-)

Comments

Toke Høiland-Jørgensen Aug. 8, 2024, 9:23 a.m. UTC | #1
Duan Jiong <djduanjiong@gmail.com> writes:

> When the mtu of the veth card is not the same at both ends, there
> is no need to check the mtu when forwarding packets, and it should
> be a permissible behavior to allow receiving packets with larger
> mtu than your own.

Erm, huh? The MTU check is against the receiving interface, so AFAICT
your patch just disables MTU checking entirely for veth devices, which
seems ill-advised.

What's the issue you are trying to fix, exactly?

-Toke
Duan Jiong Aug. 8, 2024, 9:55 a.m. UTC | #2
Now there are veth device pairs, one vethA on the host side and one
vethB on the container side, when vethA sends data to vethB, at this
time if vethB is configured with a smaller MTU then the data will be
discarded, at this time there is no problem for the data to be
received in my opinion, and I would like it to be so. When vethB sends
data, the kernel stack is sharded according to the MTU of vethB.

Toke Høiland-Jørgensen <toke@kernel.org> 于2024年8月8日周四 17:23写道:
>
> Duan Jiong <djduanjiong@gmail.com> writes:
>
> > When the mtu of the veth card is not the same at both ends, there
> > is no need to check the mtu when forwarding packets, and it should
> > be a permissible behavior to allow receiving packets with larger
> > mtu than your own.
>
> Erm, huh? The MTU check is against the receiving interface, so AFAICT
> your patch just disables MTU checking entirely for veth devices, which
> seems ill-advised.
>
> What's the issue you are trying to fix, exactly?
>
> -Toke
Toke Høiland-Jørgensen Aug. 8, 2024, 10:21 a.m. UTC | #3
Duan Jiong <djduanjiong@gmail.com> writes:

> Now there are veth device pairs, one vethA on the host side and one
> vethB on the container side, when vethA sends data to vethB, at this
> time if vethB is configured with a smaller MTU then the data will be
> discarded, at this time there is no problem for the data to be
> received in my opinion, and I would like it to be so. When vethB sends
> data, the kernel stack is sharded according to the MTU of vethB.

Well, yes, if the MTU is too small, the packet will be dropped. That's
what the MTU setting is for :)

If you want to send bigger packets just, y'know, raise the MTU? What is
the reason you can't do that?

-Toke
Toke Høiland-Jørgensen Aug. 8, 2024, 12:24 p.m. UTC | #4
djduanjiong@gmail.com writes:

> This is similar to a virtual machine that receives packets larger than
> its own mtu, regardless of the mtu configured in the guest.  Or, to
> put it another way, what are the negative effects of this change?

Well, it's changing long-standing behaviour (the MTU check has been
there throughout the history of veth). Changing it will mean that
applications that set the MTU and rely on the fact that they will never
receive packets higher than the MTU, will potentially break in
interesting ways.

You still haven't answered what's keeping you from setting the MTU
correctly on the veth devices you're using?

-Toke
Willem de Bruijn Aug. 8, 2024, 7:38 p.m. UTC | #5
Toke Høiland-Jørgensen wrote:
> djduanjiong@gmail.com writes:
> 
> > This is similar to a virtual machine that receives packets larger than
> > its own mtu, regardless of the mtu configured in the guest.  Or, to
> > put it another way, what are the negative effects of this change?
> 
> Well, it's changing long-standing behaviour (the MTU check has been
> there throughout the history of veth). Changing it will mean that
> applications that set the MTU and rely on the fact that they will never
> receive packets higher than the MTU, will potentially break in
> interesting ways.

That this works is very veth specific, though?

In general this max *transfer* unit configuration makes no assurances
on the size of packets arriving. Though posted rx buffer size does,
for which veth has no equivalent.
 
> You still haven't answered what's keeping you from setting the MTU
> correctly on the veth devices you're using?

Agreed that it has a risk, so some justification is in order. Similar
to how commit 5f7d57280c19 (" bpf: Drop MTU check when doing TC-BPF
redirect to ingress") addressed a specific need.
Toke Høiland-Jørgensen Aug. 9, 2024, 9:47 a.m. UTC | #6
Willem de Bruijn <willemdebruijn.kernel@gmail.com> writes:

> Toke Høiland-Jørgensen wrote:
>> djduanjiong@gmail.com writes:
>> 
>> > This is similar to a virtual machine that receives packets larger than
>> > its own mtu, regardless of the mtu configured in the guest.  Or, to
>> > put it another way, what are the negative effects of this change?
>> 
>> Well, it's changing long-standing behaviour (the MTU check has been
>> there throughout the history of veth). Changing it will mean that
>> applications that set the MTU and rely on the fact that they will never
>> receive packets higher than the MTU, will potentially break in
>> interesting ways.
>
> That this works is very veth specific, though?
>
> In general this max *transfer* unit configuration makes no assurances
> on the size of packets arriving. Though posted rx buffer size does,
> for which veth has no equivalent.

Well, in practice we do use the MTU to limit the RX size as well. See
for instance the limit on MTU sizes if an XDP program without
multibuffer support is loaded. And I don't think having an asymmetric
MTU setting on a physical point-to-point Ethernet link generally works
either. So in that sense it does make sense that veth has this
limitation, given that it's basically emulating an ethernet wire.

I do see your point that a virtual device doesn't really *have* to
respect MTU, though. So if we were implementing a new driver this
argument would be a lot easier to make. In fact, AFAICT the newer netkit
driver doesn't check the MTU setting before forwarding, so there's
already some inconsistency there.

>> You still haven't answered what's keeping you from setting the MTU
>> correctly on the veth devices you're using?
>
> Agreed that it has a risk, so some justification is in order. Similar
> to how commit 5f7d57280c19 (" bpf: Drop MTU check when doing TC-BPF
> redirect to ingress") addressed a specific need.

Exactly :)

And cf the above, using netkit may be an alternative that doesn't carry
this risk (assuming that's compatible with the use case).

-Toke
Duan Jiong Aug. 12, 2024, 9:11 a.m. UTC | #7
Sorry for responding to the email so late.

>
> I do see your point that a virtual device doesn't really *have* to
> respect MTU, though. So if we were implementing a new driver this
> argument would be a lot easier to make. In fact, AFAICT the newer netkit
> driver doesn't check the MTU setting before forwarding, so there's
> already some inconsistency there.
>
> >> You still haven't answered what's keeping you from setting the MTU
> >> correctly on the veth devices you're using?
> >

vm1(mtu 1600)---ovs---ipsec vpn1(mtu 1500)---ipsec vpn2(mtu
1500)---ovs---vm2(mtu 1600)

My scenario is that two vms are communicating via ipsec vpn gateway,
the two vpn gateways
are interconnected via public network, the vpn gateway has only one
NIC, single arm mode.
vpn gateway mtu will be 1500 in general, but the packets sent by the
vm's to the vpn gateway
may be more than 1500, and at this time, if implemented according to
the existing veth
driver, the packets sent by the vm's will be discarded. If allowed to
receive large packets,
the vpn gateway can actually accept large packets then esp encapsulate
them and then fragment
so that in the end it doesn't affect the connectivity of the network.

> > Agreed that it has a risk, so some justification is in order. Similar
> > to how commit 5f7d57280c19 (" bpf: Drop MTU check when doing TC-BPF
> > redirect to ingress") addressed a specific need.
>
> Exactly :)
>
> And cf the above, using netkit may be an alternative that doesn't carry
> this risk (assuming that's compatible with the use case).
>
> -Toke


I can see how there could be a potential risk here, can we consider
adding a switchable option
to control this behavior?
Toke Høiland-Jørgensen Aug. 13, 2024, 11:40 a.m. UTC | #8
Duan Jiong <djduanjiong@gmail.com> writes:

> Sorry for responding to the email so late.
>
>>
>> I do see your point that a virtual device doesn't really *have* to
>> respect MTU, though. So if we were implementing a new driver this
>> argument would be a lot easier to make. In fact, AFAICT the newer netkit
>> driver doesn't check the MTU setting before forwarding, so there's
>> already some inconsistency there.
>>
>> >> You still haven't answered what's keeping you from setting the MTU
>> >> correctly on the veth devices you're using?
>> >
>
> vm1(mtu 1600)---ovs---ipsec vpn1(mtu 1500)---ipsec vpn2(mtu
> 1500)---ovs---vm2(mtu 1600)

Where's the veth device in this setup?

> My scenario is that two vms are communicating via ipsec vpn gateway,
> the two vpn gateways are interconnected via public network, the vpn
> gateway has only one NIC, single arm mode. vpn gateway mtu will be
> 1500 in general, but the packets sent by the vm's to the vpn gateway
> may be more than 1500, and at this time, if implemented according to
> the existing veth driver, the packets sent by the vm's will be
> discarded. If allowed to receive large packets, the vpn gateway can
> actually accept large packets then esp encapsulate them and then
> fragment so that in the end it doesn't affect the connectivity of the
> network.

I'm not sure I quite get the setup; it sounds like you want a subset of
the traffic to adhere to one MTU, and another subset to adhere to a
different MTU, on the same interface? Could you not divide the traffic
over two different interfaces (with different MTUs) instead?

>> > Agreed that it has a risk, so some justification is in order. Similar
>> > to how commit 5f7d57280c19 (" bpf: Drop MTU check when doing TC-BPF
>> > redirect to ingress") addressed a specific need.
>>
>> Exactly :)
>>
>> And cf the above, using netkit may be an alternative that doesn't carry
>> this risk (assuming that's compatible with the use case).
>>
>> -Toke
>
>
> I can see how there could be a potential risk here, can we consider
> adding a switchable option to control this behavior?

Hmm, a toggle has its own cost in terms of complexity and overhead. Plus
it's adding new UAPI. It may be that this is the least bad option in the
end, but before going that route we should be very sure that there's not
another way to solve your problem (cf the above).

This has been discussed before, BTW, most recently five-and-some
years ago:

https://patchwork.ozlabs.org/project/netdev/patch/CAMJ5cBHZ4DqjE6Md-0apA8aaLLk9Hpiypfooo7ud-p9XyFyeng@mail.gmail.com/

-Toke
Duan Jiong Aug. 14, 2024, 6:42 a.m. UTC | #9
On Tue, Aug 13, 2024 at 7:40 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>
> Duan Jiong <djduanjiong@gmail.com> writes:
>
> >> >
> >
> > vm1(mtu 1600)---ovs---ipsec vpn1(mtu 1500)---ipsec vpn2(mtu
> > 1500)---ovs---vm2(mtu 1600)
>
> Where's the veth device in this setup?
>

The veth device is used for ipsec vpn containers to connect to ovs, and
traffic before and after esp encapsulation goes to this NIC.


> > My scenario is that two vms are communicating via ipsec vpn gateway,
> > the two vpn gateways are interconnected via public network, the vpn
> > gateway has only one NIC, single arm mode. vpn gateway mtu will be
> > 1500 in general, but the packets sent by the vm's to the vpn gateway
> > may be more than 1500, and at this time, if implemented according to
> > the existing veth driver, the packets sent by the vm's will be
> > discarded. If allowed to receive large packets, the vpn gateway can
> > actually accept large packets then esp encapsulate them and then
> > fragment so that in the end it doesn't affect the connectivity of the
> > network.
>
> I'm not sure I quite get the setup; it sounds like you want a subset of
> the traffic to adhere to one MTU, and another subset to adhere to a
> different MTU, on the same interface? Could you not divide the traffic
> over two different interfaces (with different MTUs) instead?
>

This is indeed a viable option, but it's not easy to change our own
implementation right now, so we're just seeing if it's feasible to skip
the veth mtu check.


> >> > Agreed that it has a risk, so some justification is in order. Similar
> >> > to how commit 5f7d57280c19 (" bpf: Drop MTU check when doing TC-BPF
> >> > redirect to ingress") addressed a specific need.
> >>
> >> Exactly :)
> >>
> >> And cf the above, using netkit may be an alternative that doesn't carry
> >> this risk (assuming that's compatible with the use case).
> >>
> >> -Toke
> >
> >
> > I can see how there could be a potential risk here, can we consider
> > adding a switchable option to control this behavior?
>
> Hmm, a toggle has its own cost in terms of complexity and overhead. Plus
> it's adding new UAPI. It may be that this is the least bad option in the
> end, but before going that route we should be very sure that there's not
> another way to solve your problem (cf the above).
>
> This has been discussed before, BTW, most recently five-and-some
> years ago:
>
> https://patchwork.ozlabs.org/project/netdev/patch/CAMJ5cBHZ4DqjE6Md-0apA8aaLLk9Hpiypfooo7ud-p9XyFyeng@mail.gmail.com/
>
> -Toke
Toke Høiland-Jørgensen Aug. 16, 2024, 9:09 p.m. UTC | #10
Duan Jiong <djduanjiong@gmail.com> writes:

> On Tue, Aug 13, 2024 at 7:40 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>>
>> Duan Jiong <djduanjiong@gmail.com> writes:
>>
>> >> >
>> >
>> > vm1(mtu 1600)---ovs---ipsec vpn1(mtu 1500)---ipsec vpn2(mtu
>> > 1500)---ovs---vm2(mtu 1600)
>>
>> Where's the veth device in this setup?
>>
>
> The veth device is used for ipsec vpn containers to connect to ovs, and
> traffic before and after esp encapsulation goes to this NIC.
>
>
>> > My scenario is that two vms are communicating via ipsec vpn gateway,
>> > the two vpn gateways are interconnected via public network, the vpn
>> > gateway has only one NIC, single arm mode. vpn gateway mtu will be
>> > 1500 in general, but the packets sent by the vm's to the vpn gateway
>> > may be more than 1500, and at this time, if implemented according to
>> > the existing veth driver, the packets sent by the vm's will be
>> > discarded. If allowed to receive large packets, the vpn gateway can
>> > actually accept large packets then esp encapsulate them and then
>> > fragment so that in the end it doesn't affect the connectivity of the
>> > network.
>>
>> I'm not sure I quite get the setup; it sounds like you want a subset of
>> the traffic to adhere to one MTU, and another subset to adhere to a
>> different MTU, on the same interface? Could you not divide the traffic
>> over two different interfaces (with different MTUs) instead?
>>
>
> This is indeed a viable option, but it's not easy to change our own
> implementation right now, so we're just seeing if it's feasible to skip
> the veth mtu check.

Hmm, well, that basically means asking the kernel community to take on
an extra maintenance burden so that you won't have to do that yourself.
And since there's a workaround, my inclination would be to say that the
use case is not sufficiently compelling that it's worth doing so.

However, I don't feel that strongly about it, and if others think it's
worth adding a knob to disable the MTU check on forward, I won't object.

-Toke
diff mbox series

Patch

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 426e68a95067..f505fe2a55c1 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -317,7 +317,7 @@  static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
 static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb,
 			    struct veth_rq *rq, bool xdp)
 {
-	return __dev_forward_skb(dev, skb) ?: xdp ?
+	return __dev_forward_skb_nomtu(dev, skb) ?: xdp ?
 		veth_xdp_rx(rq, skb) :
 		__netif_rx(skb);
 }
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index d20c6c99eb88..8cee9b40e50e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3943,6 +3943,7 @@  int bpf_xdp_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
 u8 dev_xdp_prog_count(struct net_device *dev);
 u32 dev_xdp_prog_id(struct net_device *dev, enum bpf_xdp_mode mode);
 
+int __dev_forward_skb_nomtu(struct net_device *dev, struct sk_buff *skb);
 int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb);
 int dev_forward_skb(struct net_device *dev, struct sk_buff *skb);
 int dev_forward_skb_nomtu(struct net_device *dev, struct sk_buff *skb);
diff --git a/net/core/dev.c b/net/core/dev.c
index e1bb6d7856d9..acd740f78b1c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2190,6 +2190,12 @@  static int __dev_forward_skb2(struct net_device *dev, struct sk_buff *skb,
 	return ret;
 }
 
+int __dev_forward_skb_nomtu(struct net_device *dev, struct sk_buff *skb)
+{
+	return __dev_forward_skb2(dev, skb, false);
+}
+EXPORT_SYMBOL_GPL(__dev_forward_skb_nomtu);
+
 int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
 {
 	return __dev_forward_skb2(dev, skb, true);