Message ID | 160216609656.882446.16642490462568561112.stgit@firesoul (mailing list archive) |
---|---|
Headers | show |
Series | bpf: New approach for BPF MTU handling | expand |
On Thu, 08 Oct 2020 16:08:57 +0200 Jesper Dangaard Brouer wrote:
> V3: Drop enforcement of MTU in net-core, leave it to drivers
Sorry for being late to the discussion.
I absolutely disagree. We had cases in the past where HW would lock up
if it was sent a frame with bad geometry.
We will not be sprinkling validation checks across the drivers because
some reconfiguration path may occasionally yield a bad packet, or it's
hard to do something right with BPF.
Jakub Kicinski wrote: > On Thu, 08 Oct 2020 16:08:57 +0200 Jesper Dangaard Brouer wrote: > > V3: Drop enforcement of MTU in net-core, leave it to drivers > > Sorry for being late to the discussion. > > I absolutely disagree. We had cases in the past where HW would lock up > if it was sent a frame with bad geometry. > > We will not be sprinkling validation checks across the drivers because > some reconfiguration path may occasionally yield a bad packet, or it's > hard to do something right with BPF. This is a driver bug then. As it stands today drivers may get hit with skb with MTU greater than set MTU as best I can tell. Generally I expect drivers use MTU to configure RX buffers not sure how it is going to be used on TX side? Any examples? I just poked around through the driver source to see and seems to confirm its primarily for RX side configuration with some drivers throwing the event down to the firmware for something that I can't see in the code? I'm not suggestiong sprinkling validation checks across the drivers. I'm suggesting if the drivers hang we fix them.
On Fri, Oct 09, 2020 at 01:49:14PM -0700, John Fastabend wrote: > Jakub Kicinski wrote: > > On Thu, 08 Oct 2020 16:08:57 +0200 Jesper Dangaard Brouer wrote: > > > V3: Drop enforcement of MTU in net-core, leave it to drivers > > > > Sorry for being late to the discussion. > > > > I absolutely disagree. We had cases in the past where HW would lock up > > if it was sent a frame with bad geometry. > > > > We will not be sprinkling validation checks across the drivers because > > some reconfiguration path may occasionally yield a bad packet, or it's > > hard to do something right with BPF. > > This is a driver bug then. As it stands today drivers may get hit with > skb with MTU greater than set MTU as best I can tell. Generally I > expect drivers use MTU to configure RX buffers not sure how it is going > to be used on TX side? Any examples? I just poked around through the > driver source to see and seems to confirm its primarily for RX side > configuration with some drivers throwing the event down to the firmware > for something that I can't see in the code? > > I'm not suggestiong sprinkling validation checks across the drivers. > I'm suggesting if the drivers hang we fix them. +1 I've seen HW that hangs when certain sizes of the packet. Like < 68 byte TX where size is one specific constant. I don't think it's a job of the stack or the driver to deal with that. It's firmware/hw bug.
On Fri, Oct 9, 2020 at 2:07 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Fri, Oct 09, 2020 at 01:49:14PM -0700, John Fastabend wrote: > > Jakub Kicinski wrote: > > > On Thu, 08 Oct 2020 16:08:57 +0200 Jesper Dangaard Brouer wrote: > > > > V3: Drop enforcement of MTU in net-core, leave it to drivers > > > > > > Sorry for being late to the discussion. > > > > > > I absolutely disagree. We had cases in the past where HW would lock up > > > if it was sent a frame with bad geometry. > > > > > > We will not be sprinkling validation checks across the drivers because > > > some reconfiguration path may occasionally yield a bad packet, or it's > > > hard to do something right with BPF. > > > > This is a driver bug then. As it stands today drivers may get hit with > > skb with MTU greater than set MTU as best I can tell. Generally I > > expect drivers use MTU to configure RX buffers not sure how it is going > > to be used on TX side? Any examples? I just poked around through the > > driver source to see and seems to confirm its primarily for RX side > > configuration with some drivers throwing the event down to the firmware > > for something that I can't see in the code? > > > > I'm not suggestiong sprinkling validation checks across the drivers. > > I'm suggesting if the drivers hang we fix them. > > +1 > > I've seen HW that hangs when certain sizes of the packet. > Like < 68 byte TX where size is one specific constant. > I don't think it's a job of the stack or the driver to deal with that. > It's firmware/hw bug. +1 It's not the job of the core stack, but it *is* the job of the driver to deal with firmware/hw bugs like this. Sure fix in hw when you can (next rev), in fw if you can't (and have fw, can release it, rev it, distribute it), but ultimately that's why drivers have quirks for various revisions of hw/fw... so you very much fix bugs like this in driver if needed.
On Fri, 09 Oct 2020 13:49:14 -0700 John Fastabend wrote: > Jakub Kicinski wrote: > > On Thu, 08 Oct 2020 16:08:57 +0200 Jesper Dangaard Brouer wrote: > > > V3: Drop enforcement of MTU in net-core, leave it to drivers > > > > Sorry for being late to the discussion. > > > > I absolutely disagree. We had cases in the past where HW would lock up > > if it was sent a frame with bad geometry. > > > > We will not be sprinkling validation checks across the drivers because > > some reconfiguration path may occasionally yield a bad packet, or it's > > hard to do something right with BPF. > > This is a driver bug then. As it stands today drivers may get hit with > skb with MTU greater than set MTU as best I can tell. You're talking about taking it from "maybe this can happen, but will still be at most jumbo" to "it's going to be very easy to trigger and length may be > MAX_U16". > Generally I expect drivers use MTU to configure RX buffers not sure > how it is going to be used on TX side? Any examples? I just poked > around through the driver source to see and seems to confirm its > primarily for RX side configuration with some drivers throwing the > event down to the firmware for something that I can't see in the code? Right, but that could just be because nobody expects to get over sized frames from the stack. We actively encourage drivers to remove paranoid checks. It's really not going to be a great experience for driver authors where they need to consult a list of things they should and shouldn't check. If we want to do this, the driver interface must most definitely say MRU and not MTU. > I'm not suggestiong sprinkling validation checks across the drivers. > I'm suggesting if the drivers hang we fix them. We both know the level of testing drivers get, it's unlikely this will be validated. It's packet of death waiting to happen. And all this for what? Saving 2 cycles on a branch that will almost never be taken?
On Fri, 9 Oct 2020 16:00:10 -0700 Jakub Kicinski <kuba@kernel.org> wrote: > On Fri, 09 Oct 2020 13:49:14 -0700 John Fastabend wrote: > > Jakub Kicinski wrote: > > > On Thu, 08 Oct 2020 16:08:57 +0200 Jesper Dangaard Brouer wrote: > > > > V3: Drop enforcement of MTU in net-core, leave it to drivers > > > > > > Sorry for being late to the discussion. > > > > > > I absolutely disagree. We had cases in the past where HW would lock up > > > if it was sent a frame with bad geometry. I agree with Jakub here. I do find it risky not to do these MTU check in net-core. > > > We will not be sprinkling validation checks across the drivers because > > > some reconfiguration path may occasionally yield a bad packet, or it's > > > hard to do something right with BPF. > > > > This is a driver bug then. As it stands today drivers may get hit with > > skb with MTU greater than set MTU as best I can tell. > > You're talking about taking it from "maybe this can happen, but will > still be at most jumbo" to "it's going to be very easy to trigger and > length may be > MAX_U16". It is interesting that a misbehaving BPF program can easily trigger this. Next week, I will looking writing such a BPF-prog and then test it on the hardware I have avail in my testlab. > > Generally I expect drivers use MTU to configure RX buffers not sure > > how it is going to be used on TX side? Any examples? I just poked > > around through the driver source to see and seems to confirm its > > primarily for RX side configuration with some drivers throwing the > > event down to the firmware for something that I can't see in the code? > > Right, but that could just be because nobody expects to get over sized > frames from the stack. > > We actively encourage drivers to remove paranoid checks. It's really > not going to be a great experience for driver authors where they need > to consult a list of things they should and shouldn't check. > > If we want to do this, the driver interface must most definitely say > MRU and not MTU. What is MRU? > > I'm not suggestiong sprinkling validation checks across the drivers. > > I'm suggesting if the drivers hang we fix them. > > We both know the level of testing drivers get, it's unlikely this will > be validated. It's packet of death waiting to happen. > > And all this for what? Saving 2 cycles on a branch that will almost > never be taken? I do think it is risky not to do this simple MTU check in net-core. I also believe the overhead is very very low. Hint, I'm basically just moving the MTU check from one place to another. (And last patch in patchset is an optimization that inlines and save cycles when doing these kind of MTU checks).
On Sat, 10 Oct 2020 12:44:02 +0200 Jesper Dangaard Brouer wrote: > > > > We will not be sprinkling validation checks across the drivers because > > > > some reconfiguration path may occasionally yield a bad packet, or it's > > > > hard to do something right with BPF. > > > > > > This is a driver bug then. As it stands today drivers may get hit with > > > skb with MTU greater than set MTU as best I can tell. > > > > You're talking about taking it from "maybe this can happen, but will > > still be at most jumbo" to "it's going to be very easy to trigger and > > length may be > MAX_U16". > > It is interesting that a misbehaving BPF program can easily trigger this. > Next week, I will looking writing such a BPF-prog and then test it on > the hardware I have avail in my testlab. FWIW I took a quick swing at testing it with the HW I have and it did exactly what hardware should do. The TX unit entered an error state and then the driver detected that and reset it a few seconds later. Hardware is almost always designed to behave like that. If some NIC actually cleanly drops over sized TX frames, I'd bet it's done in FW, or some other software piece. There was also a statement earlier in the thread that we can put a large frame on the wire and "let the switch drop it". I don't believe that's possible either (as I mentioned previously BPF could generate frames above jumbo size). My phy knowledge is very rudimentary and rusty but from what I heard Ethernet PHYs have a hard design limit on the length of a frame they can put of a wire (or pull from it), because of symbol encoding, electrical charges on the wire etc. reasons. There needs to be a bunch of idle symbols every now and then. And obviously if one actually manages to get a longer frame to the PHY it will fault, see above. > > > Generally I expect drivers use MTU to configure RX buffers not sure > > > how it is going to be used on TX side? Any examples? I just poked > > > around through the driver source to see and seems to confirm its > > > primarily for RX side configuration with some drivers throwing the > > > event down to the firmware for something that I can't see in the code? > > > > Right, but that could just be because nobody expects to get over sized > > frames from the stack. > > > > We actively encourage drivers to remove paranoid checks. It's really > > not going to be a great experience for driver authors where they need > > to consult a list of things they should and shouldn't check. > > > > If we want to do this, the driver interface must most definitely say > > MRU and not MTU. > > What is MRU? Max Receive Unit, Jesse and others have been talking about how we should separate the TX config from RX config for drivers. Right now drivers configure RX filters based on the max transmission unit, which is weird, and nobody is sure whether that's actually desired. > > > I'm not suggestiong sprinkling validation checks across the drivers. > > > I'm suggesting if the drivers hang we fix them. > > > > We both know the level of testing drivers get, it's unlikely this will > > be validated. It's packet of death waiting to happen. > > > > And all this for what? Saving 2 cycles on a branch that will almost > > never be taken? > > I do think it is risky not to do this simple MTU check in net-core. I > also believe the overhead is very very low. Hint, I'm basically just > moving the MTU check from one place to another. (And last patch in > patchset is an optimization that inlines and save cycles when doing > these kind of MTU checks).
Jakub Kicinski wrote: > On Sat, 10 Oct 2020 12:44:02 +0200 Jesper Dangaard Brouer wrote: > > > > > We will not be sprinkling validation checks across the drivers because > > > > > some reconfiguration path may occasionally yield a bad packet, or it's > > > > > hard to do something right with BPF. > > > > > > > > This is a driver bug then. As it stands today drivers may get hit with > > > > skb with MTU greater than set MTU as best I can tell. > > > > > > You're talking about taking it from "maybe this can happen, but will > > > still be at most jumbo" to "it's going to be very easy to trigger and > > > length may be > MAX_U16". > > > > It is interesting that a misbehaving BPF program can easily trigger this. > > Next week, I will looking writing such a BPF-prog and then test it on > > the hardware I have avail in my testlab. > > FWIW I took a quick swing at testing it with the HW I have and it did > exactly what hardware should do. The TX unit entered an error state > and then the driver detected that and reset it a few seconds later. Ths seems like the right thing to do in my opinion. If the stack gives the NIC garbage entering error state and reset sounds expected. Thanks for actually trying it by the way. We might have come to different conclusions though from my side the conclusion is, good nothing horrible happened no MTU check needed. If the user spews garbage at the nic from the BPF program great it gets dropped and causes the driver/nic to punish you a bit by staying hung. Fix your BPF program. Now if the nic hangs and doesn't ever come back I would care. But, we have watchdog logic for this. I don't really feel like we need to guard bad BPF programs from doing dumb things, setting MTU in this case, but other things might be nested vlans that wont fly, overwriting checksums, corrupting mac headers, etc. > > Hardware is almost always designed to behave like that. If some NIC > actually cleanly drops over sized TX frames, I'd bet it's done in FW, > or some other software piece. Agree. > > There was also a statement earlier in the thread that we can put a large > frame on the wire and "let the switch drop it". I don't believe > that's possible either (as I mentioned previously BPF could generate > frames above jumbo size). My phy knowledge is very rudimentary and I think that was something I said, what I meant is if the hardware sent a jumbo frame to a switch with a 1500MRU set I would expect the receiver to drop it. On the hardware side I would guess the error is it doesn't fit in the receive buffer. I think if you sent a very large frame, something much larger than 9k (without TSO), the sender itself will hang or abort and reset just like above. From what I've seen mostly the maximum receive frame size mirrors the MTU because no one has an explicit MRU to configure. > rusty but from what I heard Ethernet PHYs have a hard design limit on > the length of a frame they can put of a wire (or pull from it), because > of symbol encoding, electrical charges on the wire etc. reasons. There > needs to be a bunch of idle symbols every now and then. And obviously > if one actually manages to get a longer frame to the PHY it will fault, > see above. Yes, I've seen this before on some hardware. > > > > > Generally I expect drivers use MTU to configure RX buffers not sure > > > > how it is going to be used on TX side? Any examples? I just poked > > > > around through the driver source to see and seems to confirm its > > > > primarily for RX side configuration with some drivers throwing the > > > > event down to the firmware for something that I can't see in the code? > > > > > > Right, but that could just be because nobody expects to get over sized > > > frames from the stack. > > > > > > We actively encourage drivers to remove paranoid checks. It's really > > > not going to be a great experience for driver authors where they need > > > to consult a list of things they should and shouldn't check. > > > > > > If we want to do this, the driver interface must most definitely say > > > MRU and not MTU. > > > > What is MRU? > > Max Receive Unit, Jesse and others have been talking about how we > should separate the TX config from RX config for drivers. Right now > drivers configure RX filters based on the max transmission unit, > which is weird, and nobody is sure whether that's actually desired. Agree. But, its a reasonable default I think. An explicit MRU would be a nice addition. > > > > > I'm not suggestiong sprinkling validation checks across the drivers. > > > > I'm suggesting if the drivers hang we fix them. > > > > > > We both know the level of testing drivers get, it's unlikely this will > > > be validated. It's packet of death waiting to happen. > > > We could write some selftests for driver writers to run? I think any selftests we could provide would be welcome. ./test_bpf_driver eth0 Test large frame Test small frame Test corrupted checksum ... > > > And all this for what? Saving 2 cycles on a branch that will almost > > > never be taken? 2 cycles here and 2 cycles there .... plus complexity to think about it. Eventually it all adds up. At the risk of entering bike shedding territory maybe. > > > > I do think it is risky not to do this simple MTU check in net-core. I > > also believe the overhead is very very low. Hint, I'm basically just > > moving the MTU check from one place to another. (And last patch in > > patchset is an optimization that inlines and save cycles when doing > > these kind of MTU checks).
On Sat, 10 Oct 2020 16:52:48 -0700 John Fastabend wrote: > Jakub Kicinski wrote: > > FWIW I took a quick swing at testing it with the HW I have and it did > > exactly what hardware should do. The TX unit entered an error state > > and then the driver detected that and reset it a few seconds later. > > Ths seems like the right thing to do in my opinion. If the > stack gives the NIC garbage entering error state and reset > sounds expected. Thanks for actually trying it by the way. > > We might have come to different conclusions though from my side > the conclusion is, good nothing horrible happened no MTU check needed. > If the user spews garbage at the nic from the BPF program great it > gets dropped and causes the driver/nic to punish you a bit by staying > hung. Fix your BPF program. Right probably difference of perspective. I understand that from Cilium's POV you can probably feel pretty confident about the BPF programs that are running. I bet Maciej is even more confident with Android! But in principle BPF was supposed to make the kernel end user programmable. We have to ensure it's safe. > > > > And all this for what? Saving 2 cycles on a branch that will almost > > > > never be taken? > > 2 cycles here and 2 cycles there .... plus complexity to think about > it. Eventually it all adds up. At the risk of entering bike shedding > territory maybe. Not sure it's a bike shedding territory but I doubt you want to be making either the complexity or the performance argument to a fellow TLS maintainer.. cough cough.. ;)
On Sat, 10 Oct 2020 09:32:12 -0700 Jakub Kicinski <kuba@kernel.org> wrote: > On Sat, 10 Oct 2020 12:44:02 +0200 Jesper Dangaard Brouer wrote: > > > > > We will not be sprinkling validation checks across the drivers because > > > > > some reconfiguration path may occasionally yield a bad packet, or it's > > > > > hard to do something right with BPF. > > > > > > > > This is a driver bug then. As it stands today drivers may get hit with > > > > skb with MTU greater than set MTU as best I can tell. > > > > > > You're talking about taking it from "maybe this can happen, but will > > > still be at most jumbo" to "it's going to be very easy to trigger and > > > length may be > MAX_U16". > > > > It is interesting that a misbehaving BPF program can easily trigger this. > > Next week, I will looking writing such a BPF-prog and then test it on > > the hardware I have avail in my testlab. I've tested sending different packet sizes that exceed the MTU on different hardware. They all silently drop the transmitted packet. mlx5 and i40e configured to (L3) MTU 1500, will lets through upto 1504, while ixgbe will drop size 1504. Packets can be observed locally with tcpdump, but the other end doesn't receive the packet. I didn't find any counters (including ethtool -S) indicating these packets were dropped at hardware/firmware level, which were a little concerning for later troubleshooting. Another observation is that size increases (with bpf_skb_adjust_room) above 4096 + e.g 128 will likely fail, even-though I have the 64K limit in this kernel. > FWIW I took a quick swing at testing it with the HW I have and it did > exactly what hardware should do. The TX unit entered an error state > and then the driver detected that and reset it a few seconds later. The drivers (i40e, mlx5, ixgbe) I tested with didn't entered an error state, when getting packets exceeding the MTU. I didn't go much above 4K, so maybe I didn't trigger those cases. > Hardware is almost always designed to behave like that. If some NIC > actually cleanly drops over sized TX frames, I'd bet it's done in FW, > or some other software piece.
On Tue, 13 Oct 2020 22:40:09 +0200 Jesper Dangaard Brouer wrote: > > FWIW I took a quick swing at testing it with the HW I have and it did > > exactly what hardware should do. The TX unit entered an error state > > and then the driver detected that and reset it a few seconds later. > > The drivers (i40e, mlx5, ixgbe) I tested with didn't entered an error > state, when getting packets exceeding the MTU. I didn't go much above > 4K, so maybe I didn't trigger those cases. You probably need to go above 16k to get out of the acceptable jumbo frame size. I tested ixgbe by converting TSO frames to large TCP frames, at low probability.
On Tue, Oct 13, 2020 at 04:07:26PM -0700, Jakub Kicinski wrote: > On Tue, 13 Oct 2020 22:40:09 +0200 Jesper Dangaard Brouer wrote: > > > FWIW I took a quick swing at testing it with the HW I have and it did > > > exactly what hardware should do. The TX unit entered an error state > > > and then the driver detected that and reset it a few seconds later. > > > > The drivers (i40e, mlx5, ixgbe) I tested with didn't entered an error > > state, when getting packets exceeding the MTU. I didn't go much above > > 4K, so maybe I didn't trigger those cases. > > You probably need to go above 16k to get out of the acceptable jumbo > frame size. I tested ixgbe by converting TSO frames to large TCP frames, > at low probability. how about we set __bpf_skb_max_len() to jumbo like 8k and be done with it. I guess some badly written driver/fw may still hang with <= 8k skb that bpf redirected from one netdev with mtu=jumbo to another netdev with mtu=1500, but then it's really a job of the driver/fw to deal with it cleanly. I think checking skb->tx_dev->mtu for every xmited packet is not great. For typical load balancer it would be good to have MRU 1500 and MTU 15xx. Especially if it's internet facing. Just to drop all known big packets in hw via MRU check. But the stack doesn't have MRU vs MTU distinction and XDP_TX doesn't adhere to MTU. xdp_data_hard_end is the limit. So xdp already allows growing the packet beyond MTU. I think upgrading artificial limit in __bpf_skb_max_len() to 8k will keep it safe enough for all practical cases and will avoid unnecessary checks and complexity in xmit path.
> how about we set __bpf_skb_max_len() to jumbo like 8k and be done with it. 8k is still far too small. A lot of places do 9K or 16K jumbo frames. You'd need at least a full 16K for it to be real jumbo compatible. That said, if we're ever willing to ignore device mtu, then I see no reason why an 8K or 16K or 32K limit is any better than 64K. (which is at least max IP packet size compatible [let's ignore ipv6 jumbograms as not realistic]) If something in the firmware/driver fails at 64K it'll probably fail at 8K as well. Since the 'bad' hardware is most likely old and only ~1500 (or 1 pagesize) capable anyway... In practice driver limitations maybe more around the number of pages or sg sections, then rather on the max packet size anyway... so failures may depend on individual skb layout... And as a reminder there are interfaces (like lo) that default to 64K mtu. (and I have veth setups with 64K mtu as well) btw. our GCE folks tell us they occasionally see (and now discard) >mtu packets from Linux VMs (using the virtio-net driver), we've not had time to debug this (the VMs in question have some pretty funky routing and for privacy reason I've not been able to get actual dumps of the problematic frames), but gut feeling is >mtu packets occasionally leak into the drivers (probably from the tcp stack). > I guess some badly written driver/fw may still hang with <= 8k skb > that bpf redirected from one netdev with mtu=jumbo to another > netdev with mtu=1500, but then it's really a job of the driver/fw > to deal with it cleanly. > > I think checking skb->tx_dev->mtu for every xmited packet is not great. > For typical load balancer it would be good to have MRU 1500 and MTU 15xx. > Especially if it's internet facing. Just to drop all known big > packets in hw via MRU check. > But the stack doesn't have MRU vs MTU distinction and XDP_TX doesn't > adhere to MTU. xdp_data_hard_end is the limit. > So xdp already allows growing the packet beyond MTU. > I think upgrading artificial limit in __bpf_skb_max_len() to 8k will > keep it safe enough for all practical cases and will avoid unnecessary > checks and complexity in xmit path.