diff mbox series

[net] net: gso: fix panic on frag_list with mixed head alloc types

Message ID 559cea869928e169240d74c386735f3f95beca32.1666858629.git.jbenc@redhat.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [net] net: gso: fix panic on frag_list with mixed head alloc types | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net
netdev/fixes_present success Fixes tag present in non-next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 2 this patch: 2
netdev/cc_maintainers fail 3 blamed authors not CCed: alexanderduyck@fb.com willemb@google.com davem@davemloft.net; 7 maintainers not CCed: kuba@kernel.org davem@davemloft.net bpf@vger.kernel.org alexanderduyck@fb.com edumazet@google.com willemb@google.com pabeni@redhat.com
netdev/build_clang success Errors and warnings before: 5 this patch: 5
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 2 this patch: 2
netdev/checkpatch warning WARNING: line length of 84 exceeds 80 columns WARNING: line length of 86 exceeds 80 columns WARNING: line length of 87 exceeds 80 columns WARNING: line length of 88 exceeds 80 columns WARNING: line length of 89 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Jiri Benc Oct. 27, 2022, 8:20 a.m. UTC
Since commit 3dcbdb134f32 ("net: gso: Fix skb_segment splat when
splitting gso_size mangled skb having linear-headed frag_list"), it is
allowed to change gso_size of a GRO packet. However, that commit assumes
that "checking the first list_skb member suffices; i.e if either of the
list_skb members have non head_frag head, then the first one has too".

It turns out this assumption does not hold. We've seen BUG_ON being hit
in skb_segment when skbs on the frag_list had differing head_frag. That
particular case was with vmxnet3; looking at the driver, it indeed uses
different skb allocation strategies based on the packet size. The last
packet in frag_list can thus be kmalloced if it is sufficiently small.
And there's nothing preventing drivers from mixing things even more
freely.

There are three different locations where this can be fixed:

(1) We could check head_frag in GRO and not allow GROing skbs with
    different head_frag. However, that would lead to performance
    regression (at least on vmxnet3) on normal forward paths with
    unmodified gso_size, where mixed head_frag is not a problem.

(2) Set a flag in bpf_skb_net_grow and bpf_skb_net_shrink indicating
    that NETIF_F_SG is undesirable. That would need to eat a bit in
    sk_buff. Furthermore, that flag can be unset when all skbs on the
    frag_list are page backed. To retain good performance,
    bpf_skb_net_grow/shrink would have to walk the frag_list.

(3) Walk the frag_list in skb_segment when determining whether
    NETIF_F_SG should be cleared. This of course slows things down.

This patch implements (3). To limit the performance impact in
skb_segment, the list is walked only for skbs with SKB_GSO_DODGY set
that have gso_size changed. Normal paths thus will not hit it.

Fixes: 3dcbdb134f32 ("net: gso: Fix skb_segment splat when splitting gso_size mangled skb having linear-headed frag_list")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
---
 net/core/skbuff.c | 36 +++++++++++++++++++-----------------
 1 file changed, 19 insertions(+), 17 deletions(-)

Comments

Jakub Kicinski Oct. 29, 2022, 4:41 a.m. UTC | #1
On Thu, 27 Oct 2022 10:20:56 +0200 Jiri Benc wrote:
> Since commit 3dcbdb134f32 ("net: gso: Fix skb_segment splat when
> splitting gso_size mangled skb having linear-headed frag_list"), it is
> allowed to change gso_size of a GRO packet. However, that commit assumes
> that "checking the first list_skb member suffices; i.e if either of the
> list_skb members have non head_frag head, then the first one has too".
> 
> It turns out this assumption does not hold. We've seen BUG_ON being hit
> in skb_segment when skbs on the frag_list had differing head_frag. That
> particular case was with vmxnet3; looking at the driver, it indeed uses
> different skb allocation strategies based on the packet size.

Where are you looking? I'm not seeing it TBH.

I don't think the driver is that important, tho, __napi_alloc_skb() 
will select page backing or kmalloc, all by itself.

The patch LGTM, adding more CCs in case I'm missing something.

> The last packet in frag_list can thus be kmalloced if it is
> sufficiently small. And there's nothing preventing drivers from
> mixing things even more freely.
> 
> There are three different locations where this can be fixed:
> 
> (1) We could check head_frag in GRO and not allow GROing skbs with
>     different head_frag. However, that would lead to performance
>     regression (at least on vmxnet3) on normal forward paths with
>     unmodified gso_size, where mixed head_frag is not a problem.
> 
> (2) Set a flag in bpf_skb_net_grow and bpf_skb_net_shrink indicating
>     that NETIF_F_SG is undesirable. That would need to eat a bit in
>     sk_buff. Furthermore, that flag can be unset when all skbs on the
>     frag_list are page backed. To retain good performance,
>     bpf_skb_net_grow/shrink would have to walk the frag_list.
> 
> (3) Walk the frag_list in skb_segment when determining whether
>     NETIF_F_SG should be cleared. This of course slows things down.
> 
> This patch implements (3). To limit the performance impact in
> skb_segment, the list is walked only for skbs with SKB_GSO_DODGY set
> that have gso_size changed. Normal paths thus will not hit it.
> 
> Fixes: 3dcbdb134f32 ("net: gso: Fix skb_segment splat when splitting
> gso_size mangled skb having linear-headed frag_list") Signed-off-by:
> Jiri Benc <jbenc@redhat.com>

> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 1d9719e72f9d..bbf3acff44c6 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -4134,23 +4134,25 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>  	int i = 0;
>  	int pos;
>  
> -	if (list_skb && !list_skb->head_frag && skb_headlen(list_skb) &&
> -	    (skb_shinfo(head_skb)->gso_type & SKB_GSO_DODGY)) {
> -		/* gso_size is untrusted, and we have a frag_list with a linear
> -		 * non head_frag head.
> -		 *
> -		 * (we assume checking the first list_skb member suffices;
> -		 * i.e if either of the list_skb members have non head_frag
> -		 * head, then the first one has too).
> -		 *
> -		 * If head_skb's headlen does not fit requested gso_size, it
> -		 * means that the frag_list members do NOT terminate on exact
> -		 * gso_size boundaries. Hence we cannot perform skb_frag_t page
> -		 * sharing. Therefore we must fallback to copying the frag_list
> -		 * skbs; we do so by disabling SG.
> -		 */
> -		if (mss != GSO_BY_FRAGS && mss != skb_headlen(head_skb))
> -			features &= ~NETIF_F_SG;
> +	if ((skb_shinfo(head_skb)->gso_type & SKB_GSO_DODGY) &&
> +	    mss != GSO_BY_FRAGS && mss != skb_headlen(head_skb)) {
> +		struct sk_buff *check_skb;
> +
> +		for (check_skb = list_skb; check_skb; check_skb = check_skb->next) {
> +			if (skb_headlen(check_skb) && !check_skb->head_frag) {
> +				/* gso_size is untrusted, and we have a frag_list with
> +				 * a linear non head_frag item.
> +				 *
> +				 * If head_skb's headlen does not fit requested gso_size,
> +				 * it means that the frag_list members do NOT terminate
> +				 * on exact gso_size boundaries. Hence we cannot perform
> +				 * skb_frag_t page sharing. Therefore we must fallback to
> +				 * copying the frag_list skbs; we do so by disabling SG.
> +				 */
> +				features &= ~NETIF_F_SG;
> +				break;
> +			}
> +		}
>  	}
>  
>  	__skb_push(head_skb, doffset);
Shmulik Ladkani Oct. 29, 2022, 7:41 a.m. UTC | #2
On Thu, 27 Oct 2022 10:20:56 +0200
Jiri Benc <jbenc@redhat.com> wrote:

> It turns out this assumption does not hold. We've seen BUG_ON being hit
> in skb_segment when skbs on the frag_list had differing head_frag. That
> particular case was with vmxnet3; looking at the driver, it indeed uses
> different skb allocation strategies based on the packet size. The last
> packet in frag_list can thus be kmalloced if it is sufficiently small.
> And there's nothing preventing drivers from mixing things even more
> freely.

Hi Jiri,

One of my early attempts to fix the original BUG was to also detect:

> - some frag in the frag_list has a linear part that is NOT head_frag,
>   or length not equal to the requested gso_size

See [0], see skb_is_nonlinear_equal_frags() there

(Note that your current suggestion implements the "some frag in the
 frag_list has a linear part that is NOT head_frag" condition, but not
 "length not equal to the requested gso_size")

As a response, Willem suggested:

> My suggestion only tested the first frag_skb length. If a list can be
> created where the first frag_skb is head_frag but a later one is not,
> it will fail short. I kind of doubt that.

See [1]

So we eventually concluded testing just
  !list_skb->head_frag && skb_headlen(list_skb)
and not every frag in frag_list.

Maybe Willem can elaborate on that.


[0] https://lore.kernel.org/netdev/20190903185121.56906d31@pixies/
[1] https://lore.kernel.org/netdev/CA+FuTScE=pyopY=3f5E4JGx1zyGqT+XS+8ss13UN4if4TZ2NbA@mail.gmail.com/
Willem de Bruijn Oct. 29, 2022, 2:10 p.m. UTC | #3
On Sat, Oct 29, 2022 at 3:41 AM Shmulik Ladkani
<shmulik.ladkani@gmail.com> wrote:
>
> On Thu, 27 Oct 2022 10:20:56 +0200
> Jiri Benc <jbenc@redhat.com> wrote:
>
> > It turns out this assumption does not hold. We've seen BUG_ON being hit
> > in skb_segment when skbs on the frag_list had differing head_frag. That
> > particular case was with vmxnet3; looking at the driver, it indeed uses
> > different skb allocation strategies based on the packet size. The last
> > packet in frag_list can thus be kmalloced if it is sufficiently small.
> > And there's nothing preventing drivers from mixing things even more
> > freely.
>
> Hi Jiri,
>
> One of my early attempts to fix the original BUG was to also detect:
>
> > - some frag in the frag_list has a linear part that is NOT head_frag,
> >   or length not equal to the requested gso_size
>
> See [0], see skb_is_nonlinear_equal_frags() there
>
> (Note that your current suggestion implements the "some frag in the
>  frag_list has a linear part that is NOT head_frag" condition, but not
>  "length not equal to the requested gso_size")
>
> As a response, Willem suggested:
>
> > My suggestion only tested the first frag_skb length. If a list can be
> > created where the first frag_skb is head_frag but a later one is not,
> > it will fail short. I kind of doubt that.
>
> See [1]
>
> So we eventually concluded testing just
>   !list_skb->head_frag && skb_headlen(list_skb)
> and not every frag in frag_list.
>
> Maybe Willem can elaborate on that.

Clearly I was wrong :)

If a device has different allocation strategies depending on packet
size, and GRO coalesces those using a list, then indeed this does not
have to hold. GRO requires the same packet size and thus allocation
strategy to coalesce -- except for the last segment.

I don't see any allocation in vmxnet3 that uses a head frag, though.
There is a small packet path (rxDataRingUsed), but both small and
large allocate using a kmalloc-backed skb->data as far as I can tell.

In any case, iterating over all frags is more robust. This is an edge
case, fine to incur the cost there.
Jiri Benc Oct. 31, 2022, 3:54 p.m. UTC | #4
On Fri, 28 Oct 2022 21:41:23 -0700, Jakub Kicinski wrote:
> On Thu, 27 Oct 2022 10:20:56 +0200 Jiri Benc wrote:
> > It turns out this assumption does not hold. We've seen BUG_ON being hit
> > in skb_segment when skbs on the frag_list had differing head_frag. That
> > particular case was with vmxnet3; looking at the driver, it indeed uses
> > different skb allocation strategies based on the packet size.  
> 
> Where are you looking? I'm not seeing it TBH.

Looking at the code again, I think I misread it.

> I don't think the driver is that important, tho, __napi_alloc_skb() 
> will select page backing or kmalloc, all by itself.

In this case, it was __netdev_alloc_skb. I'll fix the description.

Thanks!

 Jiri
Jiri Benc Oct. 31, 2022, 4:52 p.m. UTC | #5
On Sat, 29 Oct 2022 10:10:03 -0400, Willem de Bruijn wrote:
> If a device has different allocation strategies depending on packet
> size, and GRO coalesces those using a list, then indeed this does not
> have to hold. GRO requires the same packet size and thus allocation
> strategy to coalesce -- except for the last segment.

That's exactly what I saw: the last segment was different.

However, I don't see anything in the GRO code that enforces that. It
appears that currently, it just usually happens that way. When there's
a burst of packets for the given flow on the wire, only the last
segment is small (and thus malloced) and there's no immediate packet
following for the same flow. What would happen if (for whatever reason)
there was such packet following?

> I don't see any allocation in vmxnet3 that uses a head frag, though.
> There is a small packet path (rxDataRingUsed), but both small and
> large allocate using a kmalloc-backed skb->data as far as I can tell.

I believe the logic is that for rxDataRingUsed,
netdev_alloc_skb_ip_align is called to alloc skb to copy data into,
passing to it the actual packet length. If it's small enough,
__netdev_alloc_skb will kmalloc the data. However, for !rxDataRingUsed,
the skb for dma buffer is allocated with a larger length and
__netdev_alloc_skb will use page_frag_alloc.

I admit I've not spend that much time understanding the logic in the
driver. I was satisfied when the perceived logic matched what I saw in
the kernel memory dump. I may have easily missed something, such as
Jakub's point that it's not actually the driver deciding on the
allocation strategy but rather __netdev_alloc_skb on its own. But the
outcome still seems to be that small packets are kmalloced, while
larger packets are page backed. Am I wrong?

> In any case, iterating over all frags is more robust. This is an edge
> case, fine to incur the cost there.

Thanks! We might get a minor speedup if we check only the last segment;
but first, I'd like to be proven wrong about GRO not enforcing this.
Plus, I wonder whether the speedup would be measurable if we have to
iterate through the list to find the last segment anyway.

Unless there are objections or clarifications (and unless I'm wrong
above), I'll send a v2 with the commit message corrected and with the
same code.

Thanks to all!

 Jiri
Willem de Bruijn Oct. 31, 2022, 9:16 p.m. UTC | #6
On Mon, Oct 31, 2022 at 12:53 PM Jiri Benc <jbenc@redhat.com> wrote:
>
> On Sat, 29 Oct 2022 10:10:03 -0400, Willem de Bruijn wrote:
> > If a device has different allocation strategies depending on packet
> > size, and GRO coalesces those using a list, then indeed this does not
> > have to hold. GRO requires the same packet size and thus allocation
> > strategy to coalesce -- except for the last segment.
>
> That's exactly what I saw: the last segment was different.
>
> However, I don't see anything in the GRO code that enforces that. It
> appears that currently, it just usually happens that way. When there's
> a burst of packets for the given flow on the wire, only the last
> segment is small (and thus malloced) and there's no immediate packet
> following for the same flow. What would happen if (for whatever reason)
> there was such packet following?

This is enforced. In tcp_gro_receive:

        /* If skb is a GRO packet, make sure its gso_size matches
prior packet mss.
         * If it is a single frame, do not aggregate it if its length
         * is bigger than our mss.
         */
        if (unlikely(skb_is_gso(skb)))
                flush |= (mss != skb_shinfo(skb)->gso_size);
        else
                flush |= (len - 1) >= mss;

        [..]

        /* Force a flush if last segment is smaller than mss. */
        if (unlikely(skb_is_gso(skb)))
                flush = len != NAPI_GRO_CB(skb)->count *
skb_shinfo(skb)->gso_size;
        else
                flush = len < mss;

That branch and the comment is very new, introduced with GRO support
for HW-GRO packets. But the else clauses are not new.

>
> > I don't see any allocation in vmxnet3 that uses a head frag, though.
> > There is a small packet path (rxDataRingUsed), but both small and
> > large allocate using a kmalloc-backed skb->data as far as I can tell.
>
> I believe the logic is that for rxDataRingUsed,
> netdev_alloc_skb_ip_align is called to alloc skb to copy data into,
> passing to it the actual packet length. If it's small enough,
> __netdev_alloc_skb will kmalloc the data. However, for !rxDataRingUsed,
> the skb for dma buffer is allocated with a larger length and
> __netdev_alloc_skb will use page_frag_alloc.

That explains perfectly.

> I admit I've not spend that much time understanding the logic in the
> driver. I was satisfied when the perceived logic matched what I saw in
> the kernel memory dump. I may have easily missed something, such as
> Jakub's point that it's not actually the driver deciding on the
> allocation strategy but rather __netdev_alloc_skb on its own. But the
> outcome still seems to be that small packets are kmalloced, while
> larger packets are page backed. Am I wrong?
>
> > In any case, iterating over all frags is more robust. This is an edge
> > case, fine to incur the cost there.
>
> Thanks! We might get a minor speedup if we check only the last segment;
> but first, I'd like to be proven wrong about GRO not enforcing this.
> Plus, I wonder whether the speedup would be measurable if we have to
> iterate through the list to find the last segment anyway.
>
> Unless there are objections or clarifications (and unless I'm wrong
> above), I'll send a v2 with the commit message corrected and with the
> same code.

Sounds good to me, thanks.

>
> Thanks to all!
>
>  Jiri
>
diff mbox series

Patch

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 1d9719e72f9d..bbf3acff44c6 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4134,23 +4134,25 @@  struct sk_buff *skb_segment(struct sk_buff *head_skb,
 	int i = 0;
 	int pos;
 
-	if (list_skb && !list_skb->head_frag && skb_headlen(list_skb) &&
-	    (skb_shinfo(head_skb)->gso_type & SKB_GSO_DODGY)) {
-		/* gso_size is untrusted, and we have a frag_list with a linear
-		 * non head_frag head.
-		 *
-		 * (we assume checking the first list_skb member suffices;
-		 * i.e if either of the list_skb members have non head_frag
-		 * head, then the first one has too).
-		 *
-		 * If head_skb's headlen does not fit requested gso_size, it
-		 * means that the frag_list members do NOT terminate on exact
-		 * gso_size boundaries. Hence we cannot perform skb_frag_t page
-		 * sharing. Therefore we must fallback to copying the frag_list
-		 * skbs; we do so by disabling SG.
-		 */
-		if (mss != GSO_BY_FRAGS && mss != skb_headlen(head_skb))
-			features &= ~NETIF_F_SG;
+	if ((skb_shinfo(head_skb)->gso_type & SKB_GSO_DODGY) &&
+	    mss != GSO_BY_FRAGS && mss != skb_headlen(head_skb)) {
+		struct sk_buff *check_skb;
+
+		for (check_skb = list_skb; check_skb; check_skb = check_skb->next) {
+			if (skb_headlen(check_skb) && !check_skb->head_frag) {
+				/* gso_size is untrusted, and we have a frag_list with
+				 * a linear non head_frag item.
+				 *
+				 * If head_skb's headlen does not fit requested gso_size,
+				 * it means that the frag_list members do NOT terminate
+				 * on exact gso_size boundaries. Hence we cannot perform
+				 * skb_frag_t page sharing. Therefore we must fallback to
+				 * copying the frag_list skbs; we do so by disabling SG.
+				 */
+				features &= ~NETIF_F_SG;
+				break;
+			}
+		}
 	}
 
 	__skb_push(head_skb, doffset);