Message ID | 20230828233210.36532-1-mkhalfella@purestorage.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | skbuff: skb_segment, Update nfrags after calling zero copy functions | expand |
Mohamed Khalfella wrote: > We have seen kernel panic with stacktrace below. This is 5.15.123 > LTS kernel running a qemu VM with virtio network interface and > vhost=on. When enabling packet corruption, with command below, we see > the kernel panic. > > tc qdisc add dev eth2 root netem corrupt 0.7065335155074846% > > [ 193.894380] BUG: kernel NULL pointer dereference, address: 00000000000000bc > [ 193.894635] #PF: supervisor read access in kernel mode > [ 193.894828] #PF: error_code(0x0000) - not-present page > [ 193.895027] PGD 0 P4D 0 > [ 193.895140] Oops: 0000 [#1] SMP > [ 193.895273] CPU: 13 PID: 18164 Comm: vh-net-17428 Kdump: loaded Tainted: G O 5.15.123+ #26 > [ 193.895602] Hardware name: > [ 193.903919] RIP: 0010:skb_segment+0xb0e/0x12f0 > [ 193.908176] Code: 45 a8 50 e8 54 46 be ff 44 8b 5d 80 41 01 c7 48 83 c4 18 44 89 7d b4 44 3b 5d b0 0f 8c f0 00 00 00 4d 85 e4 0f 84 65 07 00 00 <45> 8b b4 24 bc 00 00 00 48 8b b5 70 ff ff ff 4d 03 b4 24 c0 00 00 > [ 193.921099] RSP: 0018:ffffc9002bf2f770 EFLAGS: 00010282 > [ 193.925552] RAX: 00000000000000ee RBX: ffff88baab5092c8 RCX: 0000000000000003 > [ 193.934308] RDX: 0000000000000000 RSI: 00000000fff7ffff RDI: 0000000000000001 > [ 193.943281] RBP: ffffc9002bf2f850 R08: 0000000000000000 R09: c0000000fff7ffff > [ 193.952658] R10: 0000000000000029 R11: ffffc9002bf2f310 R12: 0000000000000000 > [ 193.962423] R13: ffff88abc2291e00 R14: ffff88abc2291d00 R15: ffff88abc2290600 > [ 193.972593] FS: 0000000000000000(0000) GS:ffff88c07f840000(0000) knlGS:0000000000000000 > [ 193.983302] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 193.988891] CR2: 00000000000000bc CR3: 0000001dccb4b006 CR4: 00000000001726e0 > [ 193.999925] MSR 198h IA32 perf status 0x00001ba000001a00 > [ 194.005488] MSR 19Ch IA32 thermal status 0x0000000088370000 > [ 194.010983] MSR 1B1h IA32 package thermal status 0x0000000088330000 > [ 194.021892] Call Trace: > [ 194.027422] <TASK> > [ 194.032838] ? __die_body+0x1a/0x60 > [ 194.038172] ? page_fault_oops+0x12d/0x4d0 > [ 194.043395] ? skb_segment+0xb0e/0x12f0 > [ 194.048501] ? search_bpf_extables+0x59/0x60 > [ 194.053547] ? fixup_exception+0x1d/0x250 > [ 194.058537] ? exc_page_fault+0x67/0x140 > [ 194.063382] ? asm_exc_page_fault+0x1f/0x30 > [ 194.068171] ? skb_segment+0xb0e/0x12f0 > [ 194.072861] tcp_gso_segment+0x107/0x540 > [ 194.077507] ? sk_common_release+0xe0/0xe0 > [ 194.082031] inet_gso_segment+0x15c/0x3d0 > [ 194.086441] ? __skb_get_hash_symmetric+0x190/0x190 > [ 194.090783] skb_mac_gso_segment+0x9f/0x110 > [ 194.095016] __skb_gso_segment+0xc1/0x190 > [ 194.099124] ? netif_skb_features+0xb5/0x280 > [ 194.103131] netem_enqueue+0x290/0xb10 [sch_netem] > [ 194.107071] dev_qdisc_enqueue+0x16/0x70 > [ 194.110884] __dev_queue_xmit+0x63b/0xb30 > [ 194.114569] ? inet_gso_segment+0x15c/0x3d0 > [ 194.118160] ? bond_start_xmit+0x159/0x380 [bonding] > [ 194.121670] bond_start_xmit+0x159/0x380 [bonding] > [ 194.125101] ? skb_mac_gso_segment+0xa7/0x110 > [ 194.128506] dev_hard_start_xmit+0xc3/0x1e0 > [ 194.131787] __dev_queue_xmit+0x8a0/0xb30 > [ 194.134977] ? macvlan_start_xmit+0x4f/0x100 [macvlan] > [ 194.138225] macvlan_start_xmit+0x4f/0x100 [macvlan] > [ 194.141477] dev_hard_start_xmit+0xc3/0x1e0 > [ 194.144622] sch_direct_xmit+0xe3/0x280 > [ 194.147748] __dev_queue_xmit+0x54a/0xb30 > [ 194.150924] ? tap_get_user+0x2a8/0x9c0 [tap] > [ 194.154131] tap_get_user+0x2a8/0x9c0 [tap] > [ 194.157358] tap_sendmsg+0x52/0x8e0 [tap] > [ 194.160565] ? get_tx_bufs+0x42/0x1d0 [vhost_net] > [ 194.163815] ? get_tx_bufs+0x16a/0x1d0 [vhost_net] > [ 194.167049] handle_tx_zerocopy+0x14e/0x4c0 [vhost_net] > [ 194.170351] ? add_range+0x11/0x30 > [ 194.173631] handle_tx+0xcd/0xe0 [vhost_net] > [ 194.176959] vhost_worker+0x76/0xb0 [vhost] > [ 194.180299] ? vhost_flush_work+0x10/0x10 [vhost] > [ 194.183667] kthread+0x118/0x140 > [ 194.187007] ? set_kthread_struct+0x40/0x40 > [ 194.190358] ret_from_fork+0x1f/0x30 > [ 194.193670] </TASK> > > I have narrowed the issue down to these lines in skb_segment() > > 4247 if (skb_orphan_frags(frag_skb, GFP_ATOMIC) || > 4248 skb_zerocopy_clone(nskb, frag_skb, GFP_ATOMIC)) > 4249 goto err; > > It is possible for skb_orphan_frags() or skb_zerocopy_clone() to update > `nr_frags` as both functions may call skb_copy_ubufs(). If `nr_frags` > gets updated, the local copy in `nfrags` might end up stale and cause > this panic. In particular it is possible the while loop below hits > `i >= nrfrags` prematurely and tries to move to next `frag_skb` by using > `list_skb`. If `list_skb` is NULL then we hit the panic above. > > The naive way to fix this is to update `nfrags` as shown in this patch. > This way we get the correct number of fragments and while loop can > find all fragments without needing to move to next skbuff. > > I wanted to share this with the list and see what people think before > submitting a patch. Specially that I have not tested this on master > branch. > > Fixes: bf5c25d60861 ("skbuff: in skb_segment, call zerocopy functions once per nskb") > Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com> > --- > net/core/skbuff.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index a298992060e6..864cc8ad1969 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -4567,6 +4567,8 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb, > if (skb_orphan_frags(frag_skb, GFP_ATOMIC) || > skb_zerocopy_clone(nskb, frag_skb, GFP_ATOMIC)) > goto err; > + /* Update nfrags in case skb_copy_ubufs() updates nr_frags */ > + nfrags = skb_shinfo(frag_skb)->nr_frags; > > while (pos < offset + len) { > if (i >= nfrags) { > @@ -4587,6 +4589,8 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb, > skb_zerocopy_clone(nskb, frag_skb, > GFP_ATOMIC)) > goto err; > + /* Update nfrags in case skb_copy_ubufs() updates nr_frags */ > + nfrags = skb_shinfo(frag_skb)->nr_frags; > > list_skb = list_skb->next; > } In principle this makes sense to me. skb_copy_ubufs will create new frags, which do not map 1:1 on to the original frags. It originally allocated the number of order 0 pages required to store __skb_pagelen(skb). Recently it switched to using high order pages. The two calls to skb_orphan_frags and skb_zerocopy clone are called for each new segment skb. But only on skb_orphan_frags will it possibly call skb_copy_ubufs, an then always for the first call of the input skb (head_skb or a list_skb). From commit "skbuff: in skb_segment, call zerocopy functions once per nskb": "When segmenting skbs with user frags, either the user frags must be replaced with private copies and uarg released, or the uarg must have its refcount increased for each new skb. skb_orphan_frags does the first, except for cases that can handle reference counting. skb_zerocopy_clone then does the second." skb_copy_ubufs only called on the first check for an input simplifies. A possible realloc in the middle of processing a series of frags would be much harder. Small point: nfrags is not the only state that needs to be refreshed after a fags realloc, also frag. Thanks for the report. I'm traveling likely without internet until the weekend. Apologies if it takes a while for me to follow up.
On 2023-08-28 21:18:16 -0700, willemjdebruijn wrote: > Small point: nfrags is not the only state that needs to be refreshed > after a fags realloc, also frag. I am new to this code. Can you help me understand why frag needs to be updated too? My reading of this code is that frag points to frags array in shared info. As long as shared info pointer remain the same frag pointer should remain valid. Am I missing something? > > Thanks for the report. I'm traveling likely without internet until the > weekend. Apologies if it takes a while for me to follow up. No problem. Thanks for the quick response!
On Tue, Aug 29, 2023 at 8:50 AM Mohamed Khalfella <mkhalfella@purestorage.com> wrote: > > On 2023-08-28 21:18:16 -0700, willemjdebruijn wrote: > > Small point: nfrags is not the only state that needs to be refreshed > > after a fags realloc, also frag. > > I am new to this code. Can you help me understand why frag needs to be > updated too? My reading of this code is that frag points to frags array > in shared info. As long as shared info pointer remain the same frag > pointer should remain valid. > skb_copy_ubufs() could actually call skb_unclone() and thus skb->head could be re-allocated. I guess that if you run your patch (and a repro of the bug ?) with KASAN enabled kernel, you should see a possible use-after-free ? To force the skb_unclone() path, having a tcpdump catching all packets would be enough I think. > Am I missing something? > > > > Thanks for the report. I'm traveling likely without internet until the > > weekend. Apologies if it takes a while for me to follow up. > No problem. Thanks for the quick response!
On 2023-08-29 10:07:59 +0200, Eric Dumazet wrote: > On Tue, Aug 29, 2023 at 8:50 AM Mohamed Khalfella > <mkhalfella@purestorage.com> wrote: > > > > On 2023-08-28 21:18:16 -0700, willemjdebruijn wrote: > > > Small point: nfrags is not the only state that needs to be refreshed > > > after a fags realloc, also frag. > > > > I am new to this code. Can you help me understand why frag needs to be > > updated too? My reading of this code is that frag points to frags array > > in shared info. As long as shared info pointer remain the same frag > > pointer should remain valid. > > > > skb_copy_ubufs() could actually call skb_unclone() and thus skb->head > could be re-allocated. > > I guess that if you run your patch (and a repro of the bug ?) with > KASAN enabled kernel, you should see a possible use-after-free ? > > To force the skb_unclone() path, having a tcpdump catching all packets > would be enough I think. > Okay, I see it now. I have not tested this patch with tcpdump capturing packets at the same time. Also, during my testing I have not seen the value of skb->head changnig. Now you are mentioning it it, I will make sure to test with tcpdump running and see skb->head changing. Thank you for pointing that out. For frag, I guess something like frag = &skb_shinfo(list_skb)->frags[i]; should do the job. I have not tested it though. I will need to do more testing before posting updated patch.
On Tue, Aug 29, 2023 at 11:31 AM Mohamed Khalfella <mkhalfella@purestorage.com> wrote: > > On 2023-08-29 10:07:59 +0200, Eric Dumazet wrote: > > On Tue, Aug 29, 2023 at 8:50 AM Mohamed Khalfella > > <mkhalfella@purestorage.com> wrote: > > > > > > On 2023-08-28 21:18:16 -0700, willemjdebruijn wrote: > > > > Small point: nfrags is not the only state that needs to be refreshed > > > > after a fags realloc, also frag. > > > > > > I am new to this code. Can you help me understand why frag needs to be > > > updated too? My reading of this code is that frag points to frags array > > > in shared info. As long as shared info pointer remain the same frag > > > pointer should remain valid. > > > > > > > skb_copy_ubufs() could actually call skb_unclone() and thus skb->head > > could be re-allocated. > > > > I guess that if you run your patch (and a repro of the bug ?) with > > KASAN enabled kernel, you should see a possible use-after-free ? > > > > To force the skb_unclone() path, having a tcpdump catching all packets > > would be enough I think. > > > > Okay, I see it now. I have not tested this patch with tcpdump capturing > packets at the same time. Also, during my testing I have not seen the > value of skb->head changnig. Now you are mentioning it it, I will make > sure to test with tcpdump running and see skb->head changing. Thank you > for pointing that out. > > For frag, I guess something like frag = &skb_shinfo(list_skb)->frags[i]; > should do the job. I have not tested it though. I will need to do more > testing before posting updated patch. Another way to test this path for certain (without tcpdump having to race) is to add a temporary/debug patch like this one: diff --git a/net/core/skbuff.c b/net/core/skbuff.c index a298992060e6efdecb87c7ffc8290eafe330583f..20cc42be5e81cdca567515f2a886af4ada0fbe0a 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1749,7 +1749,8 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask) int i, order, psize, new_frags; u32 d_off; - if (skb_shared(skb) || skb_unclone(skb, gfp_mask)) + if (skb_shared(skb) || + pskb_expand_head(skb, 0, 0, gfp_mask)) return -EINVAL; if (!num_frags) Note that this might catch other bugs :/
On 2023-08-29 12:09:15 +0200, Eric Dumazet wrote: > Another way to test this path for certain (without tcpdump having to race) > is to add a temporary/debug patch like this one: > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index a298992060e6efdecb87c7ffc8290eafe330583f..20cc42be5e81cdca567515f2a886af4ada0fbe0a > 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -1749,7 +1749,8 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask) > int i, order, psize, new_frags; > u32 d_off; > > - if (skb_shared(skb) || skb_unclone(skb, gfp_mask)) > + if (skb_shared(skb) || > + pskb_expand_head(skb, 0, 0, gfp_mask)) > return -EINVAL; > > if (!num_frags) > > Note that this might catch other bugs :/ I was not able to make it allocate a new frags by running tcpdump while reproing the problem. However, I was able to do it with your patch.
On Wed, Aug 30, 2023 at 12:24 AM Mohamed Khalfella <mkhalfella@purestorage.com> wrote: > > On 2023-08-29 12:09:15 +0200, Eric Dumazet wrote: > > Another way to test this path for certain (without tcpdump having to race) > > is to add a temporary/debug patch like this one: > > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > > index a298992060e6efdecb87c7ffc8290eafe330583f..20cc42be5e81cdca567515f2a886af4ada0fbe0a > > 100644 > > --- a/net/core/skbuff.c > > +++ b/net/core/skbuff.c > > @@ -1749,7 +1749,8 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask) > > int i, order, psize, new_frags; > > u32 d_off; > > > > - if (skb_shared(skb) || skb_unclone(skb, gfp_mask)) > > + if (skb_shared(skb) || > > + pskb_expand_head(skb, 0, 0, gfp_mask)) > > return -EINVAL; > > > > if (!num_frags) > > > > Note that this might catch other bugs :/ > > I was not able to make it allocate a new frags by running tcpdump while > reproing the problem. However, I was able to do it with your patch. I am glad this worked, and looking forward to a v2 of your patch, thanks !
diff --git a/net/core/skbuff.c b/net/core/skbuff.c index a298992060e6..864cc8ad1969 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -4567,6 +4567,8 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb, if (skb_orphan_frags(frag_skb, GFP_ATOMIC) || skb_zerocopy_clone(nskb, frag_skb, GFP_ATOMIC)) goto err; + /* Update nfrags in case skb_copy_ubufs() updates nr_frags */ + nfrags = skb_shinfo(frag_skb)->nr_frags; while (pos < offset + len) { if (i >= nfrags) { @@ -4587,6 +4589,8 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb, skb_zerocopy_clone(nskb, frag_skb, GFP_ATOMIC)) goto err; + /* Update nfrags in case skb_copy_ubufs() updates nr_frags */ + nfrags = skb_shinfo(frag_skb)->nr_frags; list_skb = list_skb->next; }
We have seen kernel panic with stacktrace below. This is 5.15.123 LTS kernel running a qemu VM with virtio network interface and vhost=on. When enabling packet corruption, with command below, we see the kernel panic. tc qdisc add dev eth2 root netem corrupt 0.7065335155074846% [ 193.894380] BUG: kernel NULL pointer dereference, address: 00000000000000bc [ 193.894635] #PF: supervisor read access in kernel mode [ 193.894828] #PF: error_code(0x0000) - not-present page [ 193.895027] PGD 0 P4D 0 [ 193.895140] Oops: 0000 [#1] SMP [ 193.895273] CPU: 13 PID: 18164 Comm: vh-net-17428 Kdump: loaded Tainted: G O 5.15.123+ #26 [ 193.895602] Hardware name: [ 193.903919] RIP: 0010:skb_segment+0xb0e/0x12f0 [ 193.908176] Code: 45 a8 50 e8 54 46 be ff 44 8b 5d 80 41 01 c7 48 83 c4 18 44 89 7d b4 44 3b 5d b0 0f 8c f0 00 00 00 4d 85 e4 0f 84 65 07 00 00 <45> 8b b4 24 bc 00 00 00 48 8b b5 70 ff ff ff 4d 03 b4 24 c0 00 00 [ 193.921099] RSP: 0018:ffffc9002bf2f770 EFLAGS: 00010282 [ 193.925552] RAX: 00000000000000ee RBX: ffff88baab5092c8 RCX: 0000000000000003 [ 193.934308] RDX: 0000000000000000 RSI: 00000000fff7ffff RDI: 0000000000000001 [ 193.943281] RBP: ffffc9002bf2f850 R08: 0000000000000000 R09: c0000000fff7ffff [ 193.952658] R10: 0000000000000029 R11: ffffc9002bf2f310 R12: 0000000000000000 [ 193.962423] R13: ffff88abc2291e00 R14: ffff88abc2291d00 R15: ffff88abc2290600 [ 193.972593] FS: 0000000000000000(0000) GS:ffff88c07f840000(0000) knlGS:0000000000000000 [ 193.983302] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 193.988891] CR2: 00000000000000bc CR3: 0000001dccb4b006 CR4: 00000000001726e0 [ 193.999925] MSR 198h IA32 perf status 0x00001ba000001a00 [ 194.005488] MSR 19Ch IA32 thermal status 0x0000000088370000 [ 194.010983] MSR 1B1h IA32 package thermal status 0x0000000088330000 [ 194.021892] Call Trace: [ 194.027422] <TASK> [ 194.032838] ? __die_body+0x1a/0x60 [ 194.038172] ? page_fault_oops+0x12d/0x4d0 [ 194.043395] ? skb_segment+0xb0e/0x12f0 [ 194.048501] ? search_bpf_extables+0x59/0x60 [ 194.053547] ? fixup_exception+0x1d/0x250 [ 194.058537] ? exc_page_fault+0x67/0x140 [ 194.063382] ? asm_exc_page_fault+0x1f/0x30 [ 194.068171] ? skb_segment+0xb0e/0x12f0 [ 194.072861] tcp_gso_segment+0x107/0x540 [ 194.077507] ? sk_common_release+0xe0/0xe0 [ 194.082031] inet_gso_segment+0x15c/0x3d0 [ 194.086441] ? __skb_get_hash_symmetric+0x190/0x190 [ 194.090783] skb_mac_gso_segment+0x9f/0x110 [ 194.095016] __skb_gso_segment+0xc1/0x190 [ 194.099124] ? netif_skb_features+0xb5/0x280 [ 194.103131] netem_enqueue+0x290/0xb10 [sch_netem] [ 194.107071] dev_qdisc_enqueue+0x16/0x70 [ 194.110884] __dev_queue_xmit+0x63b/0xb30 [ 194.114569] ? inet_gso_segment+0x15c/0x3d0 [ 194.118160] ? bond_start_xmit+0x159/0x380 [bonding] [ 194.121670] bond_start_xmit+0x159/0x380 [bonding] [ 194.125101] ? skb_mac_gso_segment+0xa7/0x110 [ 194.128506] dev_hard_start_xmit+0xc3/0x1e0 [ 194.131787] __dev_queue_xmit+0x8a0/0xb30 [ 194.134977] ? macvlan_start_xmit+0x4f/0x100 [macvlan] [ 194.138225] macvlan_start_xmit+0x4f/0x100 [macvlan] [ 194.141477] dev_hard_start_xmit+0xc3/0x1e0 [ 194.144622] sch_direct_xmit+0xe3/0x280 [ 194.147748] __dev_queue_xmit+0x54a/0xb30 [ 194.150924] ? tap_get_user+0x2a8/0x9c0 [tap] [ 194.154131] tap_get_user+0x2a8/0x9c0 [tap] [ 194.157358] tap_sendmsg+0x52/0x8e0 [tap] [ 194.160565] ? get_tx_bufs+0x42/0x1d0 [vhost_net] [ 194.163815] ? get_tx_bufs+0x16a/0x1d0 [vhost_net] [ 194.167049] handle_tx_zerocopy+0x14e/0x4c0 [vhost_net] [ 194.170351] ? add_range+0x11/0x30 [ 194.173631] handle_tx+0xcd/0xe0 [vhost_net] [ 194.176959] vhost_worker+0x76/0xb0 [vhost] [ 194.180299] ? vhost_flush_work+0x10/0x10 [vhost] [ 194.183667] kthread+0x118/0x140 [ 194.187007] ? set_kthread_struct+0x40/0x40 [ 194.190358] ret_from_fork+0x1f/0x30 [ 194.193670] </TASK> I have narrowed the issue down to these lines in skb_segment() 4247 if (skb_orphan_frags(frag_skb, GFP_ATOMIC) || 4248 skb_zerocopy_clone(nskb, frag_skb, GFP_ATOMIC)) 4249 goto err; It is possible for skb_orphan_frags() or skb_zerocopy_clone() to update `nr_frags` as both functions may call skb_copy_ubufs(). If `nr_frags` gets updated, the local copy in `nfrags` might end up stale and cause this panic. In particular it is possible the while loop below hits `i >= nrfrags` prematurely and tries to move to next `frag_skb` by using `list_skb`. If `list_skb` is NULL then we hit the panic above. The naive way to fix this is to update `nfrags` as shown in this patch. This way we get the correct number of fragments and while loop can find all fragments without needing to move to next skbuff. I wanted to share this with the list and see what people think before submitting a patch. Specially that I have not tested this on master branch. Fixes: bf5c25d60861 ("skbuff: in skb_segment, call zerocopy functions once per nskb") Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com> --- net/core/skbuff.c | 4 ++++ 1 file changed, 4 insertions(+)