Message ID | 559cea869928e169240d74c386735f3f95beca32.1666858629.git.jbenc@redhat.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net] net: gso: fix panic on frag_list with mixed head alloc types | expand |
On Thu, 27 Oct 2022 10:20:56 +0200 Jiri Benc wrote: > Since commit 3dcbdb134f32 ("net: gso: Fix skb_segment splat when > splitting gso_size mangled skb having linear-headed frag_list"), it is > allowed to change gso_size of a GRO packet. However, that commit assumes > that "checking the first list_skb member suffices; i.e if either of the > list_skb members have non head_frag head, then the first one has too". > > It turns out this assumption does not hold. We've seen BUG_ON being hit > in skb_segment when skbs on the frag_list had differing head_frag. That > particular case was with vmxnet3; looking at the driver, it indeed uses > different skb allocation strategies based on the packet size. Where are you looking? I'm not seeing it TBH. I don't think the driver is that important, tho, __napi_alloc_skb() will select page backing or kmalloc, all by itself. The patch LGTM, adding more CCs in case I'm missing something. > The last packet in frag_list can thus be kmalloced if it is > sufficiently small. And there's nothing preventing drivers from > mixing things even more freely. > > There are three different locations where this can be fixed: > > (1) We could check head_frag in GRO and not allow GROing skbs with > different head_frag. However, that would lead to performance > regression (at least on vmxnet3) on normal forward paths with > unmodified gso_size, where mixed head_frag is not a problem. > > (2) Set a flag in bpf_skb_net_grow and bpf_skb_net_shrink indicating > that NETIF_F_SG is undesirable. That would need to eat a bit in > sk_buff. Furthermore, that flag can be unset when all skbs on the > frag_list are page backed. To retain good performance, > bpf_skb_net_grow/shrink would have to walk the frag_list. > > (3) Walk the frag_list in skb_segment when determining whether > NETIF_F_SG should be cleared. This of course slows things down. > > This patch implements (3). To limit the performance impact in > skb_segment, the list is walked only for skbs with SKB_GSO_DODGY set > that have gso_size changed. Normal paths thus will not hit it. > > Fixes: 3dcbdb134f32 ("net: gso: Fix skb_segment splat when splitting > gso_size mangled skb having linear-headed frag_list") Signed-off-by: > Jiri Benc <jbenc@redhat.com> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 1d9719e72f9d..bbf3acff44c6 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -4134,23 +4134,25 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb, > int i = 0; > int pos; > > - if (list_skb && !list_skb->head_frag && skb_headlen(list_skb) && > - (skb_shinfo(head_skb)->gso_type & SKB_GSO_DODGY)) { > - /* gso_size is untrusted, and we have a frag_list with a linear > - * non head_frag head. > - * > - * (we assume checking the first list_skb member suffices; > - * i.e if either of the list_skb members have non head_frag > - * head, then the first one has too). > - * > - * If head_skb's headlen does not fit requested gso_size, it > - * means that the frag_list members do NOT terminate on exact > - * gso_size boundaries. Hence we cannot perform skb_frag_t page > - * sharing. Therefore we must fallback to copying the frag_list > - * skbs; we do so by disabling SG. > - */ > - if (mss != GSO_BY_FRAGS && mss != skb_headlen(head_skb)) > - features &= ~NETIF_F_SG; > + if ((skb_shinfo(head_skb)->gso_type & SKB_GSO_DODGY) && > + mss != GSO_BY_FRAGS && mss != skb_headlen(head_skb)) { > + struct sk_buff *check_skb; > + > + for (check_skb = list_skb; check_skb; check_skb = check_skb->next) { > + if (skb_headlen(check_skb) && !check_skb->head_frag) { > + /* gso_size is untrusted, and we have a frag_list with > + * a linear non head_frag item. > + * > + * If head_skb's headlen does not fit requested gso_size, > + * it means that the frag_list members do NOT terminate > + * on exact gso_size boundaries. Hence we cannot perform > + * skb_frag_t page sharing. Therefore we must fallback to > + * copying the frag_list skbs; we do so by disabling SG. > + */ > + features &= ~NETIF_F_SG; > + break; > + } > + } > } > > __skb_push(head_skb, doffset);
On Thu, 27 Oct 2022 10:20:56 +0200 Jiri Benc <jbenc@redhat.com> wrote: > It turns out this assumption does not hold. We've seen BUG_ON being hit > in skb_segment when skbs on the frag_list had differing head_frag. That > particular case was with vmxnet3; looking at the driver, it indeed uses > different skb allocation strategies based on the packet size. The last > packet in frag_list can thus be kmalloced if it is sufficiently small. > And there's nothing preventing drivers from mixing things even more > freely. Hi Jiri, One of my early attempts to fix the original BUG was to also detect: > - some frag in the frag_list has a linear part that is NOT head_frag, > or length not equal to the requested gso_size See [0], see skb_is_nonlinear_equal_frags() there (Note that your current suggestion implements the "some frag in the frag_list has a linear part that is NOT head_frag" condition, but not "length not equal to the requested gso_size") As a response, Willem suggested: > My suggestion only tested the first frag_skb length. If a list can be > created where the first frag_skb is head_frag but a later one is not, > it will fail short. I kind of doubt that. See [1] So we eventually concluded testing just !list_skb->head_frag && skb_headlen(list_skb) and not every frag in frag_list. Maybe Willem can elaborate on that. [0] https://lore.kernel.org/netdev/20190903185121.56906d31@pixies/ [1] https://lore.kernel.org/netdev/CA+FuTScE=pyopY=3f5E4JGx1zyGqT+XS+8ss13UN4if4TZ2NbA@mail.gmail.com/
On Sat, Oct 29, 2022 at 3:41 AM Shmulik Ladkani <shmulik.ladkani@gmail.com> wrote: > > On Thu, 27 Oct 2022 10:20:56 +0200 > Jiri Benc <jbenc@redhat.com> wrote: > > > It turns out this assumption does not hold. We've seen BUG_ON being hit > > in skb_segment when skbs on the frag_list had differing head_frag. That > > particular case was with vmxnet3; looking at the driver, it indeed uses > > different skb allocation strategies based on the packet size. The last > > packet in frag_list can thus be kmalloced if it is sufficiently small. > > And there's nothing preventing drivers from mixing things even more > > freely. > > Hi Jiri, > > One of my early attempts to fix the original BUG was to also detect: > > > - some frag in the frag_list has a linear part that is NOT head_frag, > > or length not equal to the requested gso_size > > See [0], see skb_is_nonlinear_equal_frags() there > > (Note that your current suggestion implements the "some frag in the > frag_list has a linear part that is NOT head_frag" condition, but not > "length not equal to the requested gso_size") > > As a response, Willem suggested: > > > My suggestion only tested the first frag_skb length. If a list can be > > created where the first frag_skb is head_frag but a later one is not, > > it will fail short. I kind of doubt that. > > See [1] > > So we eventually concluded testing just > !list_skb->head_frag && skb_headlen(list_skb) > and not every frag in frag_list. > > Maybe Willem can elaborate on that. Clearly I was wrong :) If a device has different allocation strategies depending on packet size, and GRO coalesces those using a list, then indeed this does not have to hold. GRO requires the same packet size and thus allocation strategy to coalesce -- except for the last segment. I don't see any allocation in vmxnet3 that uses a head frag, though. There is a small packet path (rxDataRingUsed), but both small and large allocate using a kmalloc-backed skb->data as far as I can tell. In any case, iterating over all frags is more robust. This is an edge case, fine to incur the cost there.
On Fri, 28 Oct 2022 21:41:23 -0700, Jakub Kicinski wrote: > On Thu, 27 Oct 2022 10:20:56 +0200 Jiri Benc wrote: > > It turns out this assumption does not hold. We've seen BUG_ON being hit > > in skb_segment when skbs on the frag_list had differing head_frag. That > > particular case was with vmxnet3; looking at the driver, it indeed uses > > different skb allocation strategies based on the packet size. > > Where are you looking? I'm not seeing it TBH. Looking at the code again, I think I misread it. > I don't think the driver is that important, tho, __napi_alloc_skb() > will select page backing or kmalloc, all by itself. In this case, it was __netdev_alloc_skb. I'll fix the description. Thanks! Jiri
On Sat, 29 Oct 2022 10:10:03 -0400, Willem de Bruijn wrote: > If a device has different allocation strategies depending on packet > size, and GRO coalesces those using a list, then indeed this does not > have to hold. GRO requires the same packet size and thus allocation > strategy to coalesce -- except for the last segment. That's exactly what I saw: the last segment was different. However, I don't see anything in the GRO code that enforces that. It appears that currently, it just usually happens that way. When there's a burst of packets for the given flow on the wire, only the last segment is small (and thus malloced) and there's no immediate packet following for the same flow. What would happen if (for whatever reason) there was such packet following? > I don't see any allocation in vmxnet3 that uses a head frag, though. > There is a small packet path (rxDataRingUsed), but both small and > large allocate using a kmalloc-backed skb->data as far as I can tell. I believe the logic is that for rxDataRingUsed, netdev_alloc_skb_ip_align is called to alloc skb to copy data into, passing to it the actual packet length. If it's small enough, __netdev_alloc_skb will kmalloc the data. However, for !rxDataRingUsed, the skb for dma buffer is allocated with a larger length and __netdev_alloc_skb will use page_frag_alloc. I admit I've not spend that much time understanding the logic in the driver. I was satisfied when the perceived logic matched what I saw in the kernel memory dump. I may have easily missed something, such as Jakub's point that it's not actually the driver deciding on the allocation strategy but rather __netdev_alloc_skb on its own. But the outcome still seems to be that small packets are kmalloced, while larger packets are page backed. Am I wrong? > In any case, iterating over all frags is more robust. This is an edge > case, fine to incur the cost there. Thanks! We might get a minor speedup if we check only the last segment; but first, I'd like to be proven wrong about GRO not enforcing this. Plus, I wonder whether the speedup would be measurable if we have to iterate through the list to find the last segment anyway. Unless there are objections or clarifications (and unless I'm wrong above), I'll send a v2 with the commit message corrected and with the same code. Thanks to all! Jiri
On Mon, Oct 31, 2022 at 12:53 PM Jiri Benc <jbenc@redhat.com> wrote: > > On Sat, 29 Oct 2022 10:10:03 -0400, Willem de Bruijn wrote: > > If a device has different allocation strategies depending on packet > > size, and GRO coalesces those using a list, then indeed this does not > > have to hold. GRO requires the same packet size and thus allocation > > strategy to coalesce -- except for the last segment. > > That's exactly what I saw: the last segment was different. > > However, I don't see anything in the GRO code that enforces that. It > appears that currently, it just usually happens that way. When there's > a burst of packets for the given flow on the wire, only the last > segment is small (and thus malloced) and there's no immediate packet > following for the same flow. What would happen if (for whatever reason) > there was such packet following? This is enforced. In tcp_gro_receive: /* If skb is a GRO packet, make sure its gso_size matches prior packet mss. * If it is a single frame, do not aggregate it if its length * is bigger than our mss. */ if (unlikely(skb_is_gso(skb))) flush |= (mss != skb_shinfo(skb)->gso_size); else flush |= (len - 1) >= mss; [..] /* Force a flush if last segment is smaller than mss. */ if (unlikely(skb_is_gso(skb))) flush = len != NAPI_GRO_CB(skb)->count * skb_shinfo(skb)->gso_size; else flush = len < mss; That branch and the comment is very new, introduced with GRO support for HW-GRO packets. But the else clauses are not new. > > > I don't see any allocation in vmxnet3 that uses a head frag, though. > > There is a small packet path (rxDataRingUsed), but both small and > > large allocate using a kmalloc-backed skb->data as far as I can tell. > > I believe the logic is that for rxDataRingUsed, > netdev_alloc_skb_ip_align is called to alloc skb to copy data into, > passing to it the actual packet length. If it's small enough, > __netdev_alloc_skb will kmalloc the data. However, for !rxDataRingUsed, > the skb for dma buffer is allocated with a larger length and > __netdev_alloc_skb will use page_frag_alloc. That explains perfectly. > I admit I've not spend that much time understanding the logic in the > driver. I was satisfied when the perceived logic matched what I saw in > the kernel memory dump. I may have easily missed something, such as > Jakub's point that it's not actually the driver deciding on the > allocation strategy but rather __netdev_alloc_skb on its own. But the > outcome still seems to be that small packets are kmalloced, while > larger packets are page backed. Am I wrong? > > > In any case, iterating over all frags is more robust. This is an edge > > case, fine to incur the cost there. > > Thanks! We might get a minor speedup if we check only the last segment; > but first, I'd like to be proven wrong about GRO not enforcing this. > Plus, I wonder whether the speedup would be measurable if we have to > iterate through the list to find the last segment anyway. > > Unless there are objections or clarifications (and unless I'm wrong > above), I'll send a v2 with the commit message corrected and with the > same code. Sounds good to me, thanks. > > Thanks to all! > > Jiri >
diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 1d9719e72f9d..bbf3acff44c6 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -4134,23 +4134,25 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb, int i = 0; int pos; - if (list_skb && !list_skb->head_frag && skb_headlen(list_skb) && - (skb_shinfo(head_skb)->gso_type & SKB_GSO_DODGY)) { - /* gso_size is untrusted, and we have a frag_list with a linear - * non head_frag head. - * - * (we assume checking the first list_skb member suffices; - * i.e if either of the list_skb members have non head_frag - * head, then the first one has too). - * - * If head_skb's headlen does not fit requested gso_size, it - * means that the frag_list members do NOT terminate on exact - * gso_size boundaries. Hence we cannot perform skb_frag_t page - * sharing. Therefore we must fallback to copying the frag_list - * skbs; we do so by disabling SG. - */ - if (mss != GSO_BY_FRAGS && mss != skb_headlen(head_skb)) - features &= ~NETIF_F_SG; + if ((skb_shinfo(head_skb)->gso_type & SKB_GSO_DODGY) && + mss != GSO_BY_FRAGS && mss != skb_headlen(head_skb)) { + struct sk_buff *check_skb; + + for (check_skb = list_skb; check_skb; check_skb = check_skb->next) { + if (skb_headlen(check_skb) && !check_skb->head_frag) { + /* gso_size is untrusted, and we have a frag_list with + * a linear non head_frag item. + * + * If head_skb's headlen does not fit requested gso_size, + * it means that the frag_list members do NOT terminate + * on exact gso_size boundaries. Hence we cannot perform + * skb_frag_t page sharing. Therefore we must fallback to + * copying the frag_list skbs; we do so by disabling SG. + */ + features &= ~NETIF_F_SG; + break; + } + } } __skb_push(head_skb, doffset);
Since commit 3dcbdb134f32 ("net: gso: Fix skb_segment splat when splitting gso_size mangled skb having linear-headed frag_list"), it is allowed to change gso_size of a GRO packet. However, that commit assumes that "checking the first list_skb member suffices; i.e if either of the list_skb members have non head_frag head, then the first one has too". It turns out this assumption does not hold. We've seen BUG_ON being hit in skb_segment when skbs on the frag_list had differing head_frag. That particular case was with vmxnet3; looking at the driver, it indeed uses different skb allocation strategies based on the packet size. The last packet in frag_list can thus be kmalloced if it is sufficiently small. And there's nothing preventing drivers from mixing things even more freely. There are three different locations where this can be fixed: (1) We could check head_frag in GRO and not allow GROing skbs with different head_frag. However, that would lead to performance regression (at least on vmxnet3) on normal forward paths with unmodified gso_size, where mixed head_frag is not a problem. (2) Set a flag in bpf_skb_net_grow and bpf_skb_net_shrink indicating that NETIF_F_SG is undesirable. That would need to eat a bit in sk_buff. Furthermore, that flag can be unset when all skbs on the frag_list are page backed. To retain good performance, bpf_skb_net_grow/shrink would have to walk the frag_list. (3) Walk the frag_list in skb_segment when determining whether NETIF_F_SG should be cleared. This of course slows things down. This patch implements (3). To limit the performance impact in skb_segment, the list is walked only for skbs with SKB_GSO_DODGY set that have gso_size changed. Normal paths thus will not hit it. Fixes: 3dcbdb134f32 ("net: gso: Fix skb_segment splat when splitting gso_size mangled skb having linear-headed frag_list") Signed-off-by: Jiri Benc <jbenc@redhat.com> --- net/core/skbuff.c | 36 +++++++++++++++++++----------------- 1 file changed, 19 insertions(+), 17 deletions(-)