Message ID | 20240808070428.13643-1-djduanjiong@gmail.com (mailing list archive) |
---|---|
State | Rejected |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [v3] veth: Drop MTU check when forwarding packets | expand |
Duan Jiong <djduanjiong@gmail.com> writes: > When the mtu of the veth card is not the same at both ends, there > is no need to check the mtu when forwarding packets, and it should > be a permissible behavior to allow receiving packets with larger > mtu than your own. Erm, huh? The MTU check is against the receiving interface, so AFAICT your patch just disables MTU checking entirely for veth devices, which seems ill-advised. What's the issue you are trying to fix, exactly? -Toke
Now there are veth device pairs, one vethA on the host side and one vethB on the container side, when vethA sends data to vethB, at this time if vethB is configured with a smaller MTU then the data will be discarded, at this time there is no problem for the data to be received in my opinion, and I would like it to be so. When vethB sends data, the kernel stack is sharded according to the MTU of vethB. Toke Høiland-Jørgensen <toke@kernel.org> 于2024年8月8日周四 17:23写道: > > Duan Jiong <djduanjiong@gmail.com> writes: > > > When the mtu of the veth card is not the same at both ends, there > > is no need to check the mtu when forwarding packets, and it should > > be a permissible behavior to allow receiving packets with larger > > mtu than your own. > > Erm, huh? The MTU check is against the receiving interface, so AFAICT > your patch just disables MTU checking entirely for veth devices, which > seems ill-advised. > > What's the issue you are trying to fix, exactly? > > -Toke
Duan Jiong <djduanjiong@gmail.com> writes: > Now there are veth device pairs, one vethA on the host side and one > vethB on the container side, when vethA sends data to vethB, at this > time if vethB is configured with a smaller MTU then the data will be > discarded, at this time there is no problem for the data to be > received in my opinion, and I would like it to be so. When vethB sends > data, the kernel stack is sharded according to the MTU of vethB. Well, yes, if the MTU is too small, the packet will be dropped. That's what the MTU setting is for :) If you want to send bigger packets just, y'know, raise the MTU? What is the reason you can't do that? -Toke
djduanjiong@gmail.com writes: > This is similar to a virtual machine that receives packets larger than > its own mtu, regardless of the mtu configured in the guest. Or, to > put it another way, what are the negative effects of this change? Well, it's changing long-standing behaviour (the MTU check has been there throughout the history of veth). Changing it will mean that applications that set the MTU and rely on the fact that they will never receive packets higher than the MTU, will potentially break in interesting ways. You still haven't answered what's keeping you from setting the MTU correctly on the veth devices you're using? -Toke
Toke Høiland-Jørgensen wrote: > djduanjiong@gmail.com writes: > > > This is similar to a virtual machine that receives packets larger than > > its own mtu, regardless of the mtu configured in the guest. Or, to > > put it another way, what are the negative effects of this change? > > Well, it's changing long-standing behaviour (the MTU check has been > there throughout the history of veth). Changing it will mean that > applications that set the MTU and rely on the fact that they will never > receive packets higher than the MTU, will potentially break in > interesting ways. That this works is very veth specific, though? In general this max *transfer* unit configuration makes no assurances on the size of packets arriving. Though posted rx buffer size does, for which veth has no equivalent. > You still haven't answered what's keeping you from setting the MTU > correctly on the veth devices you're using? Agreed that it has a risk, so some justification is in order. Similar to how commit 5f7d57280c19 (" bpf: Drop MTU check when doing TC-BPF redirect to ingress") addressed a specific need.
Willem de Bruijn <willemdebruijn.kernel@gmail.com> writes: > Toke Høiland-Jørgensen wrote: >> djduanjiong@gmail.com writes: >> >> > This is similar to a virtual machine that receives packets larger than >> > its own mtu, regardless of the mtu configured in the guest. Or, to >> > put it another way, what are the negative effects of this change? >> >> Well, it's changing long-standing behaviour (the MTU check has been >> there throughout the history of veth). Changing it will mean that >> applications that set the MTU and rely on the fact that they will never >> receive packets higher than the MTU, will potentially break in >> interesting ways. > > That this works is very veth specific, though? > > In general this max *transfer* unit configuration makes no assurances > on the size of packets arriving. Though posted rx buffer size does, > for which veth has no equivalent. Well, in practice we do use the MTU to limit the RX size as well. See for instance the limit on MTU sizes if an XDP program without multibuffer support is loaded. And I don't think having an asymmetric MTU setting on a physical point-to-point Ethernet link generally works either. So in that sense it does make sense that veth has this limitation, given that it's basically emulating an ethernet wire. I do see your point that a virtual device doesn't really *have* to respect MTU, though. So if we were implementing a new driver this argument would be a lot easier to make. In fact, AFAICT the newer netkit driver doesn't check the MTU setting before forwarding, so there's already some inconsistency there. >> You still haven't answered what's keeping you from setting the MTU >> correctly on the veth devices you're using? > > Agreed that it has a risk, so some justification is in order. Similar > to how commit 5f7d57280c19 (" bpf: Drop MTU check when doing TC-BPF > redirect to ingress") addressed a specific need. Exactly :) And cf the above, using netkit may be an alternative that doesn't carry this risk (assuming that's compatible with the use case). -Toke
Sorry for responding to the email so late. > > I do see your point that a virtual device doesn't really *have* to > respect MTU, though. So if we were implementing a new driver this > argument would be a lot easier to make. In fact, AFAICT the newer netkit > driver doesn't check the MTU setting before forwarding, so there's > already some inconsistency there. > > >> You still haven't answered what's keeping you from setting the MTU > >> correctly on the veth devices you're using? > > vm1(mtu 1600)---ovs---ipsec vpn1(mtu 1500)---ipsec vpn2(mtu 1500)---ovs---vm2(mtu 1600) My scenario is that two vms are communicating via ipsec vpn gateway, the two vpn gateways are interconnected via public network, the vpn gateway has only one NIC, single arm mode. vpn gateway mtu will be 1500 in general, but the packets sent by the vm's to the vpn gateway may be more than 1500, and at this time, if implemented according to the existing veth driver, the packets sent by the vm's will be discarded. If allowed to receive large packets, the vpn gateway can actually accept large packets then esp encapsulate them and then fragment so that in the end it doesn't affect the connectivity of the network. > > Agreed that it has a risk, so some justification is in order. Similar > > to how commit 5f7d57280c19 (" bpf: Drop MTU check when doing TC-BPF > > redirect to ingress") addressed a specific need. > > Exactly :) > > And cf the above, using netkit may be an alternative that doesn't carry > this risk (assuming that's compatible with the use case). > > -Toke I can see how there could be a potential risk here, can we consider adding a switchable option to control this behavior?
Duan Jiong <djduanjiong@gmail.com> writes: > Sorry for responding to the email so late. > >> >> I do see your point that a virtual device doesn't really *have* to >> respect MTU, though. So if we were implementing a new driver this >> argument would be a lot easier to make. In fact, AFAICT the newer netkit >> driver doesn't check the MTU setting before forwarding, so there's >> already some inconsistency there. >> >> >> You still haven't answered what's keeping you from setting the MTU >> >> correctly on the veth devices you're using? >> > > > vm1(mtu 1600)---ovs---ipsec vpn1(mtu 1500)---ipsec vpn2(mtu > 1500)---ovs---vm2(mtu 1600) Where's the veth device in this setup? > My scenario is that two vms are communicating via ipsec vpn gateway, > the two vpn gateways are interconnected via public network, the vpn > gateway has only one NIC, single arm mode. vpn gateway mtu will be > 1500 in general, but the packets sent by the vm's to the vpn gateway > may be more than 1500, and at this time, if implemented according to > the existing veth driver, the packets sent by the vm's will be > discarded. If allowed to receive large packets, the vpn gateway can > actually accept large packets then esp encapsulate them and then > fragment so that in the end it doesn't affect the connectivity of the > network. I'm not sure I quite get the setup; it sounds like you want a subset of the traffic to adhere to one MTU, and another subset to adhere to a different MTU, on the same interface? Could you not divide the traffic over two different interfaces (with different MTUs) instead? >> > Agreed that it has a risk, so some justification is in order. Similar >> > to how commit 5f7d57280c19 (" bpf: Drop MTU check when doing TC-BPF >> > redirect to ingress") addressed a specific need. >> >> Exactly :) >> >> And cf the above, using netkit may be an alternative that doesn't carry >> this risk (assuming that's compatible with the use case). >> >> -Toke > > > I can see how there could be a potential risk here, can we consider > adding a switchable option to control this behavior? Hmm, a toggle has its own cost in terms of complexity and overhead. Plus it's adding new UAPI. It may be that this is the least bad option in the end, but before going that route we should be very sure that there's not another way to solve your problem (cf the above). This has been discussed before, BTW, most recently five-and-some years ago: https://patchwork.ozlabs.org/project/netdev/patch/CAMJ5cBHZ4DqjE6Md-0apA8aaLLk9Hpiypfooo7ud-p9XyFyeng@mail.gmail.com/ -Toke
On Tue, Aug 13, 2024 at 7:40 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote: > > Duan Jiong <djduanjiong@gmail.com> writes: > > >> > > > > > vm1(mtu 1600)---ovs---ipsec vpn1(mtu 1500)---ipsec vpn2(mtu > > 1500)---ovs---vm2(mtu 1600) > > Where's the veth device in this setup? > The veth device is used for ipsec vpn containers to connect to ovs, and traffic before and after esp encapsulation goes to this NIC. > > My scenario is that two vms are communicating via ipsec vpn gateway, > > the two vpn gateways are interconnected via public network, the vpn > > gateway has only one NIC, single arm mode. vpn gateway mtu will be > > 1500 in general, but the packets sent by the vm's to the vpn gateway > > may be more than 1500, and at this time, if implemented according to > > the existing veth driver, the packets sent by the vm's will be > > discarded. If allowed to receive large packets, the vpn gateway can > > actually accept large packets then esp encapsulate them and then > > fragment so that in the end it doesn't affect the connectivity of the > > network. > > I'm not sure I quite get the setup; it sounds like you want a subset of > the traffic to adhere to one MTU, and another subset to adhere to a > different MTU, on the same interface? Could you not divide the traffic > over two different interfaces (with different MTUs) instead? > This is indeed a viable option, but it's not easy to change our own implementation right now, so we're just seeing if it's feasible to skip the veth mtu check. > >> > Agreed that it has a risk, so some justification is in order. Similar > >> > to how commit 5f7d57280c19 (" bpf: Drop MTU check when doing TC-BPF > >> > redirect to ingress") addressed a specific need. > >> > >> Exactly :) > >> > >> And cf the above, using netkit may be an alternative that doesn't carry > >> this risk (assuming that's compatible with the use case). > >> > >> -Toke > > > > > > I can see how there could be a potential risk here, can we consider > > adding a switchable option to control this behavior? > > Hmm, a toggle has its own cost in terms of complexity and overhead. Plus > it's adding new UAPI. It may be that this is the least bad option in the > end, but before going that route we should be very sure that there's not > another way to solve your problem (cf the above). > > This has been discussed before, BTW, most recently five-and-some > years ago: > > https://patchwork.ozlabs.org/project/netdev/patch/CAMJ5cBHZ4DqjE6Md-0apA8aaLLk9Hpiypfooo7ud-p9XyFyeng@mail.gmail.com/ > > -Toke
Duan Jiong <djduanjiong@gmail.com> writes: > On Tue, Aug 13, 2024 at 7:40 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote: >> >> Duan Jiong <djduanjiong@gmail.com> writes: >> >> >> > >> > >> > vm1(mtu 1600)---ovs---ipsec vpn1(mtu 1500)---ipsec vpn2(mtu >> > 1500)---ovs---vm2(mtu 1600) >> >> Where's the veth device in this setup? >> > > The veth device is used for ipsec vpn containers to connect to ovs, and > traffic before and after esp encapsulation goes to this NIC. > > >> > My scenario is that two vms are communicating via ipsec vpn gateway, >> > the two vpn gateways are interconnected via public network, the vpn >> > gateway has only one NIC, single arm mode. vpn gateway mtu will be >> > 1500 in general, but the packets sent by the vm's to the vpn gateway >> > may be more than 1500, and at this time, if implemented according to >> > the existing veth driver, the packets sent by the vm's will be >> > discarded. If allowed to receive large packets, the vpn gateway can >> > actually accept large packets then esp encapsulate them and then >> > fragment so that in the end it doesn't affect the connectivity of the >> > network. >> >> I'm not sure I quite get the setup; it sounds like you want a subset of >> the traffic to adhere to one MTU, and another subset to adhere to a >> different MTU, on the same interface? Could you not divide the traffic >> over two different interfaces (with different MTUs) instead? >> > > This is indeed a viable option, but it's not easy to change our own > implementation right now, so we're just seeing if it's feasible to skip > the veth mtu check. Hmm, well, that basically means asking the kernel community to take on an extra maintenance burden so that you won't have to do that yourself. And since there's a workaround, my inclination would be to say that the use case is not sufficiently compelling that it's worth doing so. However, I don't feel that strongly about it, and if others think it's worth adding a knob to disable the MTU check on forward, I won't object. -Toke
diff --git a/drivers/net/veth.c b/drivers/net/veth.c index 426e68a95067..f505fe2a55c1 100644 --- a/drivers/net/veth.c +++ b/drivers/net/veth.c @@ -317,7 +317,7 @@ static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb) static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb, struct veth_rq *rq, bool xdp) { - return __dev_forward_skb(dev, skb) ?: xdp ? + return __dev_forward_skb_nomtu(dev, skb) ?: xdp ? veth_xdp_rx(rq, skb) : __netif_rx(skb); } diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index d20c6c99eb88..8cee9b40e50e 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -3943,6 +3943,7 @@ int bpf_xdp_link_attach(const union bpf_attr *attr, struct bpf_prog *prog); u8 dev_xdp_prog_count(struct net_device *dev); u32 dev_xdp_prog_id(struct net_device *dev, enum bpf_xdp_mode mode); +int __dev_forward_skb_nomtu(struct net_device *dev, struct sk_buff *skb); int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb); int dev_forward_skb(struct net_device *dev, struct sk_buff *skb); int dev_forward_skb_nomtu(struct net_device *dev, struct sk_buff *skb); diff --git a/net/core/dev.c b/net/core/dev.c index e1bb6d7856d9..acd740f78b1c 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2190,6 +2190,12 @@ static int __dev_forward_skb2(struct net_device *dev, struct sk_buff *skb, return ret; } +int __dev_forward_skb_nomtu(struct net_device *dev, struct sk_buff *skb) +{ + return __dev_forward_skb2(dev, skb, false); +} +EXPORT_SYMBOL_GPL(__dev_forward_skb_nomtu); + int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb) { return __dev_forward_skb2(dev, skb, true);
When the mtu of the veth card is not the same at both ends, there is no need to check the mtu when forwarding packets, and it should be a permissible behavior to allow receiving packets with larger mtu than your own. Signed-off-by: Duan Jiong <djduanjiong@gmail.com> --- drivers/net/veth.c | 2 +- include/linux/netdevice.h | 1 + net/core/dev.c | 6 ++++++ 3 files changed, 8 insertions(+), 1 deletion(-)