[bpf-next,V3,0/6] bpf: New approach for BPF MTU handling

Message ID	160216609656.882446.16642490462568561112.stgit@firesoul (mailing list archive)
Headers	show Return-Path: <SRS0=0bGE=DP=vger.kernel.org=bpf-owner@kernel.org> Subject: [PATCH bpf-next V3 0/6] bpf: New approach for BPF MTU handling From: Jesper Dangaard Brouer <brouer@redhat.com> To: bpf@vger.kernel.org Cc: Jesper Dangaard Brouer <brouer@redhat.com>, netdev@vger.kernel.org, Daniel Borkmann <borkmann@iogearbox.net>, Alexei Starovoitov <alexei.starovoitov@gmail.com>, maze@google.com, lmb@cloudflare.com, shaun@tigera.io, Lorenzo Bianconi <lorenzo@kernel.org>, marek@cloudflare.com, John Fastabend <john.fastabend@gmail.com>, Jakub Kicinski <kuba@kernel.org>, eyal.birger@gmail.com Date: Thu, 08 Oct 2020 16:08:57 +0200 Message-ID: <160216609656.882446.16642490462568561112.stgit@firesoul> User-Agent: StGit/0.19 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Precedence: bulk
Series	bpf: New approach for BPF MTU handling \| expand [bpf-next,V3,0/6] bpf: New approach for BPF MTU handling [bpf-next,V3,1/6] bpf: Remove MTU check in __bpf_skb_max_len [bpf-next,V3,2/6] bpf: bpf_fib_lookup return MTU value as output when looked up [bpf-next,V3,3/6] bpf: add BPF-helper for MTU checking [bpf-next,V3,4/6] bpf: make it possible to identify BPF redirected SKBs [bpf-next,V3,5/6] bpf: drop MTU check when doing TC-BPF redirect to ingress [bpf-next,V3,6/6] net: inline and splitup is_skb_forwardable

Jesper Dangaard Brouer Oct. 8, 2020, 2:08 p.m. UTC

This patchset drops all the MTU checks in TC BPF-helpers that limits
growing the packet size. This is done because these BPF-helpers doesn't
take redirect into account, which can result in their MTU check being done
against the wrong netdev.

The new approach is to give BPF-programs knowledge about the MTU on a
netdev (via ifindex) and fib route lookup level. Meaning some BPF-helpers
are added and extended to make it possible to do MTU checks in the
BPF-code.

If BPF-prog doesn't comply with the MTU then the packet will eventually
get dropped as some other layer. In some cases the existing kernel MTU
checks will drop the packet, but there are also cases where BPF can bypass
these checks. Specifically doing TC-redirect from ingress step
(sch_handle_ingress) into egress code path (basically calling
dev_queue_xmit()). It is left up to driver code to handle these kind of
MTU violations.

One advantage of this approach is that it ingress-to-egress BPF-prog can
send information via packet data. With the MTU checks removed in the
helpers, and also not done in skb_do_redirect() call, this allows for an
ingress BPF-prog to communicate with an egress BPF-prog via packet data,
as long as egress BPF-prog remove this prior to transmitting packet.

This patchset is primarily focused on TC-BPF, but I've made sure that the
MTU BPF-helpers also works for XDP BPF-programs.

V2: Change BPF-helper API from lookup to check
V3: Drop enforcement of MTU in net-core, leave it to drivers

---

Jesper Dangaard Brouer (6):
      bpf: Remove MTU check in __bpf_skb_max_len
      bpf: bpf_fib_lookup return MTU value as output when looked up
      bpf: add BPF-helper for MTU checking
      bpf: make it possible to identify BPF redirected SKBs
      bpf: drop MTU check when doing TC-BPF redirect to ingress
      net: inline and splitup is_skb_forwardable


 include/linux/netdevice.h      |   32 +++++++-
 include/uapi/linux/bpf.h       |   74 +++++++++++++++++-
 net/core/dev.c                 |   25 +-----
 net/core/filter.c              |  166 ++++++++++++++++++++++++++++++++++++----
 net/sched/Kconfig              |    1 
 tools/include/uapi/linux/bpf.h |   74 +++++++++++++++++-
 6 files changed, 326 insertions(+), 46 deletions(-)

--
Signature

Jakub Kicinski Oct. 9, 2020, 4:33 p.m. UTC | #1

On Thu, 08 Oct 2020 16:08:57 +0200 Jesper Dangaard Brouer wrote:
> V3: Drop enforcement of MTU in net-core, leave it to drivers

Sorry for being late to the discussion.

I absolutely disagree. We had cases in the past where HW would lock up
if it was sent a frame with bad geometry.

We will not be sprinkling validation checks across the drivers because
some reconfiguration path may occasionally yield a bad packet, or it's
hard to do something right with BPF.

John Fastabend Oct. 9, 2020, 8:49 p.m. UTC | #2

Jakub Kicinski wrote:
> On Thu, 08 Oct 2020 16:08:57 +0200 Jesper Dangaard Brouer wrote:
> > V3: Drop enforcement of MTU in net-core, leave it to drivers
> 
> Sorry for being late to the discussion.
> 
> I absolutely disagree. We had cases in the past where HW would lock up
> if it was sent a frame with bad geometry.
> 
> We will not be sprinkling validation checks across the drivers because
> some reconfiguration path may occasionally yield a bad packet, or it's
> hard to do something right with BPF.

This is a driver bug then. As it stands today drivers may get hit with
skb with MTU greater than set MTU as best I can tell. Generally I
expect drivers use MTU to configure RX buffers not sure how it is going
to be used on TX side? Any examples? I just poked around through the
driver source to see and seems to confirm its primarily for RX side
configuration with some drivers throwing the event down to the firmware
for something that I can't see in the code?

I'm not suggestiong sprinkling validation checks across the drivers.
I'm suggesting if the drivers hang we fix them.

Alexei Starovoitov Oct. 9, 2020, 9:07 p.m. UTC | #3

On Fri, Oct 09, 2020 at 01:49:14PM -0700, John Fastabend wrote:
> Jakub Kicinski wrote:
> > On Thu, 08 Oct 2020 16:08:57 +0200 Jesper Dangaard Brouer wrote:
> > > V3: Drop enforcement of MTU in net-core, leave it to drivers
> > 
> > Sorry for being late to the discussion.
> > 
> > I absolutely disagree. We had cases in the past where HW would lock up
> > if it was sent a frame with bad geometry.
> > 
> > We will not be sprinkling validation checks across the drivers because
> > some reconfiguration path may occasionally yield a bad packet, or it's
> > hard to do something right with BPF.
> 
> This is a driver bug then. As it stands today drivers may get hit with
> skb with MTU greater than set MTU as best I can tell. Generally I
> expect drivers use MTU to configure RX buffers not sure how it is going
> to be used on TX side? Any examples? I just poked around through the
> driver source to see and seems to confirm its primarily for RX side
> configuration with some drivers throwing the event down to the firmware
> for something that I can't see in the code?
> 
> I'm not suggestiong sprinkling validation checks across the drivers.
> I'm suggesting if the drivers hang we fix them.

+1

I've seen HW that hangs when certain sizes of the packet.
Like < 68 byte TX where size is one specific constant.
I don't think it's a job of the stack or the driver to deal with that.
It's firmware/hw bug.

Maciej Żenczykowski Oct. 9, 2020, 9:57 p.m. UTC | #4

On Fri, Oct 9, 2020 at 2:07 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Oct 09, 2020 at 01:49:14PM -0700, John Fastabend wrote:
> > Jakub Kicinski wrote:
> > > On Thu, 08 Oct 2020 16:08:57 +0200 Jesper Dangaard Brouer wrote:
> > > > V3: Drop enforcement of MTU in net-core, leave it to drivers
> > >
> > > Sorry for being late to the discussion.
> > >
> > > I absolutely disagree. We had cases in the past where HW would lock up
> > > if it was sent a frame with bad geometry.
> > >
> > > We will not be sprinkling validation checks across the drivers because
> > > some reconfiguration path may occasionally yield a bad packet, or it's
> > > hard to do something right with BPF.
> >
> > This is a driver bug then. As it stands today drivers may get hit with
> > skb with MTU greater than set MTU as best I can tell. Generally I
> > expect drivers use MTU to configure RX buffers not sure how it is going
> > to be used on TX side? Any examples? I just poked around through the
> > driver source to see and seems to confirm its primarily for RX side
> > configuration with some drivers throwing the event down to the firmware
> > for something that I can't see in the code?
> >
> > I'm not suggestiong sprinkling validation checks across the drivers.
> > I'm suggesting if the drivers hang we fix them.
>
> +1
>
> I've seen HW that hangs when certain sizes of the packet.
> Like < 68 byte TX where size is one specific constant.
> I don't think it's a job of the stack or the driver to deal with that.
> It's firmware/hw bug.

+1
It's not the job of the core stack, but it *is* the job of the driver
to deal with firmware/hw bugs like this.
Sure fix in hw when you can (next rev), in fw if you can't (and have
fw, can release it, rev it, distribute it), but ultimately that's why
drivers have quirks for various revisions of hw/fw... so you very much
fix bugs like this in driver if needed.

Jakub Kicinski Oct. 9, 2020, 11 p.m. UTC | #5

On Fri, 09 Oct 2020 13:49:14 -0700 John Fastabend wrote:
> Jakub Kicinski wrote:
> > On Thu, 08 Oct 2020 16:08:57 +0200 Jesper Dangaard Brouer wrote:  
> > > V3: Drop enforcement of MTU in net-core, leave it to drivers  
> > 
> > Sorry for being late to the discussion.
> > 
> > I absolutely disagree. We had cases in the past where HW would lock up
> > if it was sent a frame with bad geometry.
> > 
> > We will not be sprinkling validation checks across the drivers because
> > some reconfiguration path may occasionally yield a bad packet, or it's
> > hard to do something right with BPF.  
> 
> This is a driver bug then. As it stands today drivers may get hit with
> skb with MTU greater than set MTU as best I can tell.

You're talking about taking it from "maybe this can happen, but will
still be at most jumbo" to "it's going to be very easy to trigger and
length may be > MAX_U16".

> Generally I expect drivers use MTU to configure RX buffers not sure
> how it is going to be used on TX side? Any examples? I just poked
> around through the driver source to see and seems to confirm its
> primarily for RX side configuration with some drivers throwing the
> event down to the firmware for something that I can't see in the code?

Right, but that could just be because nobody expects to get over sized
frames from the stack.

We actively encourage drivers to remove paranoid checks. It's really
not going to be a great experience for driver authors where they need
to consult a list of things they should and shouldn't check.

If we want to do this, the driver interface must most definitely say 
MRU and not MTU.

> I'm not suggestiong sprinkling validation checks across the drivers.
> I'm suggesting if the drivers hang we fix them.

We both know the level of testing drivers get, it's unlikely this will
be validated. It's packet of death waiting to happen. 

And all this for what? Saving 2 cycles on a branch that will almost
never be taken?

Jesper Dangaard Brouer Oct. 10, 2020, 10:44 a.m. UTC | #6

On Fri, 9 Oct 2020 16:00:10 -0700
Jakub Kicinski <kuba@kernel.org> wrote:

> On Fri, 09 Oct 2020 13:49:14 -0700 John Fastabend wrote:
> > Jakub Kicinski wrote:  
> > > On Thu, 08 Oct 2020 16:08:57 +0200 Jesper Dangaard Brouer wrote:    
> > > > V3: Drop enforcement of MTU in net-core, leave it to drivers    
> > > 
> > > Sorry for being late to the discussion.
> > > 
> > > I absolutely disagree. We had cases in the past where HW would lock up
> > > if it was sent a frame with bad geometry.

I agree with Jakub here.  I do find it risky not to do these MTU check
in net-core.

> > > We will not be sprinkling validation checks across the drivers because
> > > some reconfiguration path may occasionally yield a bad packet, or it's
> > > hard to do something right with BPF.    
> > 
> > This is a driver bug then. As it stands today drivers may get hit with
> > skb with MTU greater than set MTU as best I can tell.  
> 
> You're talking about taking it from "maybe this can happen, but will
> still be at most jumbo" to "it's going to be very easy to trigger and
> length may be > MAX_U16".

It is interesting that a misbehaving BPF program can easily trigger this.
Next week, I will looking writing such a BPF-prog and then test it on
the hardware I have avail in my testlab.


> > Generally I expect drivers use MTU to configure RX buffers not sure
> > how it is going to be used on TX side? Any examples? I just poked
> > around through the driver source to see and seems to confirm its
> > primarily for RX side configuration with some drivers throwing the
> > event down to the firmware for something that I can't see in the code?  
> 
> Right, but that could just be because nobody expects to get over sized
> frames from the stack.
> 
> We actively encourage drivers to remove paranoid checks. It's really
> not going to be a great experience for driver authors where they need
> to consult a list of things they should and shouldn't check.
> 
> If we want to do this, the driver interface must most definitely say 
> MRU and not MTU.

What is MRU?

 
> > I'm not suggestiong sprinkling validation checks across the drivers.
> > I'm suggesting if the drivers hang we fix them.  
> 
> We both know the level of testing drivers get, it's unlikely this will
> be validated. It's packet of death waiting to happen. 
> 
> And all this for what? Saving 2 cycles on a branch that will almost
> never be taken?

I do think it is risky not to do this simple MTU check in net-core.  I
also believe the overhead is very very low.  Hint, I'm basically just
moving the MTU check from one place to another.  (And last patch in
patchset is an optimization that inlines and save cycles when doing
these kind of MTU checks).

Jakub Kicinski Oct. 10, 2020, 4:32 p.m. UTC | #7

On Sat, 10 Oct 2020 12:44:02 +0200 Jesper Dangaard Brouer wrote:
> > > > We will not be sprinkling validation checks across the drivers because
> > > > some reconfiguration path may occasionally yield a bad packet, or it's
> > > > hard to do something right with BPF.      
> > > 
> > > This is a driver bug then. As it stands today drivers may get hit with
> > > skb with MTU greater than set MTU as best I can tell.    
> > 
> > You're talking about taking it from "maybe this can happen, but will
> > still be at most jumbo" to "it's going to be very easy to trigger and
> > length may be > MAX_U16".  
> 
> It is interesting that a misbehaving BPF program can easily trigger this.
> Next week, I will looking writing such a BPF-prog and then test it on
> the hardware I have avail in my testlab.

FWIW I took a quick swing at testing it with the HW I have and it did
exactly what hardware should do. The TX unit entered an error state 
and then the driver detected that and reset it a few seconds later.

Hardware is almost always designed to behave like that. If some NIC
actually cleanly drops over sized TX frames, I'd bet it's done in FW,
or some other software piece.

There was also a statement earlier in the thread that we can put a large
frame on the wire and "let the switch drop it". I don't believe
that's possible either (as I mentioned previously BPF could generate
frames above jumbo size). My phy knowledge is very rudimentary and
rusty but from what I heard Ethernet PHYs have a hard design limit on
the length of a frame they can put of a wire (or pull from it), because
of symbol encoding, electrical charges on the wire etc. reasons. There
needs to be a bunch of idle symbols every now and then. And obviously
if one actually manages to get a longer frame to the PHY it will fault,
see above.

> > > Generally I expect drivers use MTU to configure RX buffers not sure
> > > how it is going to be used on TX side? Any examples? I just poked
> > > around through the driver source to see and seems to confirm its
> > > primarily for RX side configuration with some drivers throwing the
> > > event down to the firmware for something that I can't see in the code?    
> > 
> > Right, but that could just be because nobody expects to get over sized
> > frames from the stack.
> > 
> > We actively encourage drivers to remove paranoid checks. It's really
> > not going to be a great experience for driver authors where they need
> > to consult a list of things they should and shouldn't check.
> > 
> > If we want to do this, the driver interface must most definitely say 
> > MRU and not MTU.  
> 
> What is MRU?

Max Receive Unit, Jesse and others have been talking about how we 
should separate the TX config from RX config for drivers. Right now
drivers configure RX filters based on the max transmission unit, 
which is weird, and nobody is sure whether that's actually desired.

> > > I'm not suggestiong sprinkling validation checks across the drivers.
> > > I'm suggesting if the drivers hang we fix them.    
> > 
> > We both know the level of testing drivers get, it's unlikely this will
> > be validated. It's packet of death waiting to happen. 
> > 
> > And all this for what? Saving 2 cycles on a branch that will almost
> > never be taken?  
> 
> I do think it is risky not to do this simple MTU check in net-core.  I
> also believe the overhead is very very low.  Hint, I'm basically just
> moving the MTU check from one place to another.  (And last patch in
> patchset is an optimization that inlines and save cycles when doing
> these kind of MTU checks).

John Fastabend Oct. 10, 2020, 11:52 p.m. UTC | #8

Jakub Kicinski wrote:
> On Sat, 10 Oct 2020 12:44:02 +0200 Jesper Dangaard Brouer wrote:
> > > > > We will not be sprinkling validation checks across the drivers because
> > > > > some reconfiguration path may occasionally yield a bad packet, or it's
> > > > > hard to do something right with BPF.      
> > > > 
> > > > This is a driver bug then. As it stands today drivers may get hit with
> > > > skb with MTU greater than set MTU as best I can tell.    
> > > 
> > > You're talking about taking it from "maybe this can happen, but will
> > > still be at most jumbo" to "it's going to be very easy to trigger and
> > > length may be > MAX_U16".  
> > 
> > It is interesting that a misbehaving BPF program can easily trigger this.
> > Next week, I will looking writing such a BPF-prog and then test it on
> > the hardware I have avail in my testlab.
> 
> FWIW I took a quick swing at testing it with the HW I have and it did
> exactly what hardware should do. The TX unit entered an error state 
> and then the driver detected that and reset it a few seconds later.

Ths seems like the right thing to do in my opinion. If the
stack gives the NIC garbage entering error state and reset
sounds expected. Thanks for actually trying it by the way.

We might have come to different conclusions though from my side
the conclusion is, good nothing horrible happened no MTU check needed.
If the user spews garbage at the nic from the BPF program great it
gets dropped and causes the driver/nic to punish you a bit by staying
hung. Fix your BPF program.

Now if the nic hangs and doesn't ever come back I would care. But,
we have watchdog logic for this.

I don't really feel like we need to guard bad BPF programs from
doing dumb things, setting MTU in this case, but other things might
be nested vlans that wont fly, overwriting checksums, corrupting
mac headers, etc.

> 
> Hardware is almost always designed to behave like that. If some NIC
> actually cleanly drops over sized TX frames, I'd bet it's done in FW,
> or some other software piece.

Agree.

> 
> There was also a statement earlier in the thread that we can put a large
> frame on the wire and "let the switch drop it". I don't believe
> that's possible either (as I mentioned previously BPF could generate
> frames above jumbo size). My phy knowledge is very rudimentary and

I think that was something I said, what I meant is if the hardware
sent a jumbo frame to a switch with a 1500MRU set I would expect
the receiver to drop it. On the hardware side I would guess the
error is it doesn't fit in the receive buffer. I think if you sent
a very large frame, something much larger than 9k (without TSO), the
sender itself will hang or abort and reset just like above.

From what I've seen mostly the maximum receive frame size mirrors
the MTU because no one has an explicit MRU to configure.

> rusty but from what I heard Ethernet PHYs have a hard design limit on
> the length of a frame they can put of a wire (or pull from it), because
> of symbol encoding, electrical charges on the wire etc. reasons. There
> needs to be a bunch of idle symbols every now and then. And obviously
> if one actually manages to get a longer frame to the PHY it will fault,
> see above.

Yes, I've seen this before on some hardware.

> 
> > > > Generally I expect drivers use MTU to configure RX buffers not sure
> > > > how it is going to be used on TX side? Any examples? I just poked
> > > > around through the driver source to see and seems to confirm its
> > > > primarily for RX side configuration with some drivers throwing the
> > > > event down to the firmware for something that I can't see in the code?    
> > > 
> > > Right, but that could just be because nobody expects to get over sized
> > > frames from the stack.
> > > 
> > > We actively encourage drivers to remove paranoid checks. It's really
> > > not going to be a great experience for driver authors where they need
> > > to consult a list of things they should and shouldn't check.
> > > 
> > > If we want to do this, the driver interface must most definitely say 
> > > MRU and not MTU.  
> > 
> > What is MRU?
> 
> Max Receive Unit, Jesse and others have been talking about how we 
> should separate the TX config from RX config for drivers. Right now
> drivers configure RX filters based on the max transmission unit, 
> which is weird, and nobody is sure whether that's actually desired.

Agree. But, its a reasonable default I think. An explicit MRU would
be a nice addition.

> 
> > > > I'm not suggestiong sprinkling validation checks across the drivers.
> > > > I'm suggesting if the drivers hang we fix them.    
> > > 
> > > We both know the level of testing drivers get, it's unlikely this will
> > > be validated. It's packet of death waiting to happen. 
> > > 

We could write some selftests for driver writers to run? I think any
selftests we could provide would be welcome.

 ./test_bpf_driver eth0
  Test large frame
  Test small frame
  Test corrupted checksum
  ...

> > > And all this for what? Saving 2 cycles on a branch that will almost
> > > never be taken?  

2 cycles here and 2 cycles there .... plus complexity to think about
it. Eventually it all adds up. At the risk of entering bike shedding
territory maybe.

> > 
> > I do think it is risky not to do this simple MTU check in net-core.  I
> > also believe the overhead is very very low.  Hint, I'm basically just
> > moving the MTU check from one place to another.  (And last patch in
> > patchset is an optimization that inlines and save cycles when doing
> > these kind of MTU checks).

Jakub Kicinski Oct. 11, 2020, 11:30 p.m. UTC | #9

On Sat, 10 Oct 2020 16:52:48 -0700 John Fastabend wrote:
> Jakub Kicinski wrote:
> > FWIW I took a quick swing at testing it with the HW I have and it did
> > exactly what hardware should do. The TX unit entered an error state 
> > and then the driver detected that and reset it a few seconds later.  
> 
> Ths seems like the right thing to do in my opinion. If the
> stack gives the NIC garbage entering error state and reset
> sounds expected. Thanks for actually trying it by the way.
> 
> We might have come to different conclusions though from my side
> the conclusion is, good nothing horrible happened no MTU check needed.
> If the user spews garbage at the nic from the BPF program great it
> gets dropped and causes the driver/nic to punish you a bit by staying
> hung. Fix your BPF program.

Right probably difference of perspective. I understand that from
Cilium's POV you can probably feel pretty confident about the BPF
programs that are running. I bet Maciej is even more confident with
Android!

But in principle BPF was supposed to make the kernel end user
programmable. We have to ensure it's safe.

> > > > And all this for what? Saving 2 cycles on a branch that will almost
> > > > never be taken?    
> 
> 2 cycles here and 2 cycles there .... plus complexity to think about
> it. Eventually it all adds up. At the risk of entering bike shedding
> territory maybe.

Not sure it's a bike shedding territory but I doubt you want to be
making either the complexity or the performance argument to a fellow 
TLS maintainer.. cough cough.. ;)

Jesper Dangaard Brouer Oct. 13, 2020, 8:40 p.m. UTC | #10

On Sat, 10 Oct 2020 09:32:12 -0700
Jakub Kicinski <kuba@kernel.org> wrote:

> On Sat, 10 Oct 2020 12:44:02 +0200 Jesper Dangaard Brouer wrote:
> > > > > We will not be sprinkling validation checks across the drivers because
> > > > > some reconfiguration path may occasionally yield a bad packet, or it's
> > > > > hard to do something right with BPF.        
> > > > 
> > > > This is a driver bug then. As it stands today drivers may get hit with
> > > > skb with MTU greater than set MTU as best I can tell.      
> > > 
> > > You're talking about taking it from "maybe this can happen, but will
> > > still be at most jumbo" to "it's going to be very easy to trigger and
> > > length may be > MAX_U16".    
> > 
> > It is interesting that a misbehaving BPF program can easily trigger this.
> > Next week, I will looking writing such a BPF-prog and then test it on
> > the hardware I have avail in my testlab.  

I've tested sending different packet sizes that exceed the MTU on
different hardware. They all silently drop the transmitted packet. mlx5
and i40e configured to (L3) MTU 1500, will lets through upto 1504, while
ixgbe will drop size 1504.

Packets can be observed locally with tcpdump, but the other end doesn't
receive the packet. I didn't find any counters (including ethtool -S)
indicating these packets were dropped at hardware/firmware level, which
were a little concerning for later troubleshooting.

Another observation is that size increases (with bpf_skb_adjust_room)
above 4096 + e.g 128 will likely fail, even-though I have the 64K limit in
this kernel.

> FWIW I took a quick swing at testing it with the HW I have and it did
> exactly what hardware should do. The TX unit entered an error state 
> and then the driver detected that and reset it a few seconds later.

The drivers (i40e, mlx5, ixgbe) I tested with didn't entered an error
state, when getting packets exceeding the MTU.  I didn't go much above
4K, so maybe I didn't trigger those cases.

> Hardware is almost always designed to behave like that. If some NIC
> actually cleanly drops over sized TX frames, I'd bet it's done in FW,
> or some other software piece.

Jakub Kicinski Oct. 13, 2020, 11:07 p.m. UTC | #11

On Tue, 13 Oct 2020 22:40:09 +0200 Jesper Dangaard Brouer wrote:
> > FWIW I took a quick swing at testing it with the HW I have and it did
> > exactly what hardware should do. The TX unit entered an error state 
> > and then the driver detected that and reset it a few seconds later.  
> 
> The drivers (i40e, mlx5, ixgbe) I tested with didn't entered an error
> state, when getting packets exceeding the MTU.  I didn't go much above
> 4K, so maybe I didn't trigger those cases.

You probably need to go above 16k to get out of the acceptable jumbo
frame size. I tested ixgbe by converting TSO frames to large TCP frames,
at low probability.

Alexei Starovoitov Oct. 13, 2020, 11:37 p.m. UTC | #12

On Tue, Oct 13, 2020 at 04:07:26PM -0700, Jakub Kicinski wrote:
> On Tue, 13 Oct 2020 22:40:09 +0200 Jesper Dangaard Brouer wrote:
> > > FWIW I took a quick swing at testing it with the HW I have and it did
> > > exactly what hardware should do. The TX unit entered an error state 
> > > and then the driver detected that and reset it a few seconds later.  
> > 
> > The drivers (i40e, mlx5, ixgbe) I tested with didn't entered an error
> > state, when getting packets exceeding the MTU.  I didn't go much above
> > 4K, so maybe I didn't trigger those cases.
> 
> You probably need to go above 16k to get out of the acceptable jumbo
> frame size. I tested ixgbe by converting TSO frames to large TCP frames,
> at low probability.

how about we set __bpf_skb_max_len() to jumbo like 8k and be done with it.

I guess some badly written driver/fw may still hang with <= 8k skb
that bpf redirected from one netdev with mtu=jumbo to another
netdev with mtu=1500, but then it's really a job of the driver/fw
to deal with it cleanly.

I think checking skb->tx_dev->mtu for every xmited packet is not great.
For typical load balancer it would be good to have MRU 1500 and MTU 15xx.
Especially if it's internet facing. Just to drop all known big
packets in hw via MRU check.
But the stack doesn't have MRU vs MTU distinction and XDP_TX doesn't
adhere to MTU. xdp_data_hard_end is the limit.
So xdp already allows growing the packet beyond MTU.
I think upgrading artificial limit in __bpf_skb_max_len() to 8k will
keep it safe enough for all practical cases and will avoid unnecessary
checks and complexity in xmit path.

Maciej Żenczykowski Oct. 13, 2020, 11:54 p.m. UTC | #13

> how about we set __bpf_skb_max_len() to jumbo like 8k and be done with it.

8k is still far too small.  A lot of places do 9K or 16K jumbo frames.
You'd need at least a full 16K for it to be real jumbo compatible.

That said, if we're ever willing to ignore device mtu, then I see no
reason why an 8K or 16K or 32K limit is any better than 64K.
(which is at least max IP packet size compatible [let's ignore ipv6
jumbograms as not realistic])

If something in the firmware/driver fails at 64K it'll probably fail
at 8K as well.
Since the 'bad' hardware is most likely old and only ~1500 (or 1
pagesize) capable anyway...

In practice driver limitations maybe more around the number of pages
or sg sections, then rather on the max packet size anyway...
so failures may depend on individual skb layout...

And as a reminder there are interfaces (like lo) that default to 64K mtu.
(and I have veth setups with 64K mtu as well)

btw. our GCE folks tell us they occasionally see (and now discard)
>mtu packets from Linux VMs (using the virtio-net driver),
we've not had time to debug this (the VMs in question have some pretty
funky routing and for privacy reason I've not been able to get actual
dumps of the problematic frames), but gut feeling is >mtu packets
occasionally leak into the drivers (probably from the tcp stack).

> I guess some badly written driver/fw may still hang with <= 8k skb
> that bpf redirected from one netdev with mtu=jumbo to another
> netdev with mtu=1500, but then it's really a job of the driver/fw
> to deal with it cleanly.
>
> I think checking skb->tx_dev->mtu for every xmited packet is not great.
> For typical load balancer it would be good to have MRU 1500 and MTU 15xx.
> Especially if it's internet facing. Just to drop all known big
> packets in hw via MRU check.
> But the stack doesn't have MRU vs MTU distinction and XDP_TX doesn't
> adhere to MTU. xdp_data_hard_end is the limit.
> So xdp already allows growing the packet beyond MTU.
> I think upgrading artificial limit in __bpf_skb_max_len() to 8k will
> keep it safe enough for all practical cases and will avoid unnecessary
> checks and complexity in xmit path.

[bpf-next,V3,0/6] bpf: New approach for BPF MTU handling

Message

Comments