Message ID | 5a2eae5c-9ad7-aca5-4927-665db946cfb2@gmail.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | kernel panics with Big TCP and Tx ZC with hugepages | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Guessing tree name failed - patch did not apply |
David Ahern wrote: > This has been on the back burner for too long now and with v6.3 released > we should get it resolved before reports start rolling in. I am throwing > this data dump out to the mailing list hoping someone else can provide > more insights. > > Big TCP (both IPv6 and IPv4 versions are affected) can cause a variety > of panics when combined with the Tx ZC API and buffers backed by > hugepages. I have seen this with mlx5, a driver under development and > veth, so it seems to be a problem with the core stack. > > A quick reproducer: > > #!/bin/bash > # > # make sure ip is from top of tree iproute2 > > ip netns add peer > ip li add eth0 type veth peer eth1 > ip li set eth0 mtu 3400 up > ip addr add dev eth0 172.16.253.1/24 > ip addr add dev eth0 2001:db8:1::1/64 > > ip li set eth1 netns peer mtu 3400 up > ip -netns peer addr add dev eth1 172.16.253.2/24 > ip -netns peer addr add dev eth1 2001:db8:1::2/64 > > ip netns exec peer iperf3 -s -D > > ip li set dev eth0 gso_ipv4_max_size $((510*1024)) gro_ipv4_max_size > $((510*1024)) gso_max_size $((510*1024)) gro_max_size $((510*1024)) > > ip -netns peer li set dev eth1 gso_ipv4_max_size $((510*1024)) > gro_ipv4_max_size $((510*1024)) gso_max_size $((510*1024)) gro_max_size > $((510*1024)) > > sysctl -w vm.nr_hugepages=2 > > cat <<EOF > Run either: > > iperf3 -c 172.16.253.2 --zc_api > iperf3 -c 2001:db8:1::2 --zc_api > > where iperf3 is from https://github.com/dsahern/iperf mods-3.10 > EOF > > iperf3 in my tree has support for buffers using hugepages when using the > Tx ZC API (--zc_api arg above). > > I have seen various backtraces based on platform and configuration, but > skb_release_data is typically in the path. This is a common one for the > veth reproducer above (saw it with both v4 and v6): > > [ 32.167294] general protection fault, probably for non-canonical > address 0xdd8672069ea377b2: 0000 [#1] PREEMPT SMP > [ 32.167569] CPU: 5 PID: 635 Comm: iperf3 Not tainted 6.3.0+ #4 > [ 32.167742] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS > 1.13.0-1ubuntu1.1 04/01/2014 > [ 32.168039] RIP: 0010:skb_release_data+0xf4/0x180 > [ 32.168208] Code: 7e 57 48 89 d8 48 c1 e0 04 4d 8b 64 05 30 41 f6 c4 > 01 75 e1 41 80 7e 76 00 4d 89 e7 79 0c 4c 89 e7 e8 90 f > [ 32.168869] RSP: 0018:ffffc900001a4eb0 EFLAGS: 00010202 > [ 32.169025] RAX: 00000000000001c0 RBX: 000000000000001c RCX: > 0000000000000000 > [ 32.169265] RDX: 0000000000000102 RSI: 000000000000068f RDI: > 00000000ffffffff > [ 32.169475] RBP: ffffc900001a4ee0 R08: 0000000000000000 R09: > ffff88807fd77ec0 > [ 32.169708] R10: ffffea0000173430 R11: 0000000000000000 R12: > dd8672069ea377aa > [ 32.169915] R13: ffff8880069cf100 R14: ffff888011910ae0 R15: > dd8672069ea377aa > [ 32.170126] FS: 0000000001720880(0000) GS:ffff88807fd40000(0000) > knlGS:0000000000000000 > [ 32.170398] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 32.170586] CR2: 00007f0f04400000 CR3: 0000000004caa000 CR4: > 0000000000750ee0 > [ 32.170796] PKRU: 55555554 > [ 32.170888] Call Trace: > [ 32.170975] <IRQ> > [ 32.171039] skb_release_all+0x2e/0x40 > [ 32.171152] napi_consume_skb+0x62/0xf0 > [ 32.171281] net_rx_action+0xf6/0x250 > [ 32.171394] __do_softirq+0xdf/0x2c0 > [ 32.171506] do_softirq+0x81/0xa0 > [ 32.171608] </IRQ> > > > Xin came up with this patch a couple of months ago that resolves the > panic but it has a big impact on performance: > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 0fbd5c85155f..6c2c8d09fd89 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -1717,7 +1717,7 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t > gfp_mask) > { > int num_frags = skb_shinfo(skb)->nr_frags; > struct page *page, *head = NULL; > - int i, new_frags; > + int i, new_frags, pagelen; > u32 d_off; > > if (skb_shared(skb) || skb_unclone(skb, gfp_mask)) > @@ -1733,7 +1733,16 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t > gfp_mask) > return 0; > } > > - new_frags = (__skb_pagelen(skb) + PAGE_SIZE - 1) >> PAGE_SHIFT; > + pagelen = __skb_pagelen(skb); > + if (pagelen > GSO_LEGACY_MAX_SIZE) { > + /* without hugepages, skb frags can only hold 65536 data. */ This is with CONFIG_MAX_SKB_FRAGS 17 I suppose. So is the issue just that new_frags ends up indexing out of bounds in frags[MAX_SKB_FRAGS]? GSO_LEGACY_MAX_SIZE happens to match the value, but is not not the right constant, as that is a max on the packet length, regardless of whether in linear or frags. > + if (!__pskb_pull_tail(skb, pagelen - GSO_LEGACY_MAX_SIZE)) > + return -ENOMEM; > + pagelen = GSO_LEGACY_MAX_SIZE; > + num_frags = skb_shinfo(skb)->nr_frags; > + } > + new_frags = (pagelen + PAGE_SIZE - 1) >> PAGE_SHIFT; > + > for (i = 0; i < new_frags; i++) { > page = alloc_page(gfp_mask); > if (!page) { >
On 4/26/23 3:52 PM, Willem de Bruijn wrote: > David Ahern wrote: >> This has been on the back burner for too long now and with v6.3 released >> we should get it resolved before reports start rolling in. I am throwing >> this data dump out to the mailing list hoping someone else can provide >> more insights. >> >> Big TCP (both IPv6 and IPv4 versions are affected) can cause a variety >> of panics when combined with the Tx ZC API and buffers backed by >> hugepages. I have seen this with mlx5, a driver under development and >> veth, so it seems to be a problem with the core stack. >> >> A quick reproducer: >> >> #!/bin/bash >> # >> # make sure ip is from top of tree iproute2 >> >> ip netns add peer >> ip li add eth0 type veth peer eth1 >> ip li set eth0 mtu 3400 up >> ip addr add dev eth0 172.16.253.1/24 >> ip addr add dev eth0 2001:db8:1::1/64 >> >> ip li set eth1 netns peer mtu 3400 up >> ip -netns peer addr add dev eth1 172.16.253.2/24 >> ip -netns peer addr add dev eth1 2001:db8:1::2/64 >> >> ip netns exec peer iperf3 -s -D >> >> ip li set dev eth0 gso_ipv4_max_size $((510*1024)) gro_ipv4_max_size >> $((510*1024)) gso_max_size $((510*1024)) gro_max_size $((510*1024)) >> >> ip -netns peer li set dev eth1 gso_ipv4_max_size $((510*1024)) >> gro_ipv4_max_size $((510*1024)) gso_max_size $((510*1024)) gro_max_size >> $((510*1024)) >> >> sysctl -w vm.nr_hugepages=2 >> >> cat <<EOF >> Run either: >> >> iperf3 -c 172.16.253.2 --zc_api >> iperf3 -c 2001:db8:1::2 --zc_api >> >> where iperf3 is from https://github.com/dsahern/iperf mods-3.10 >> EOF >> >> iperf3 in my tree has support for buffers using hugepages when using the >> Tx ZC API (--zc_api arg above). >> >> I have seen various backtraces based on platform and configuration, but >> skb_release_data is typically in the path. This is a common one for the >> veth reproducer above (saw it with both v4 and v6): >> >> [ 32.167294] general protection fault, probably for non-canonical >> address 0xdd8672069ea377b2: 0000 [#1] PREEMPT SMP >> [ 32.167569] CPU: 5 PID: 635 Comm: iperf3 Not tainted 6.3.0+ #4 >> [ 32.167742] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS >> 1.13.0-1ubuntu1.1 04/01/2014 >> [ 32.168039] RIP: 0010:skb_release_data+0xf4/0x180 >> [ 32.168208] Code: 7e 57 48 89 d8 48 c1 e0 04 4d 8b 64 05 30 41 f6 c4 >> 01 75 e1 41 80 7e 76 00 4d 89 e7 79 0c 4c 89 e7 e8 90 f >> [ 32.168869] RSP: 0018:ffffc900001a4eb0 EFLAGS: 00010202 >> [ 32.169025] RAX: 00000000000001c0 RBX: 000000000000001c RCX: >> 0000000000000000 >> [ 32.169265] RDX: 0000000000000102 RSI: 000000000000068f RDI: >> 00000000ffffffff >> [ 32.169475] RBP: ffffc900001a4ee0 R08: 0000000000000000 R09: >> ffff88807fd77ec0 >> [ 32.169708] R10: ffffea0000173430 R11: 0000000000000000 R12: >> dd8672069ea377aa >> [ 32.169915] R13: ffff8880069cf100 R14: ffff888011910ae0 R15: >> dd8672069ea377aa >> [ 32.170126] FS: 0000000001720880(0000) GS:ffff88807fd40000(0000) >> knlGS:0000000000000000 >> [ 32.170398] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [ 32.170586] CR2: 00007f0f04400000 CR3: 0000000004caa000 CR4: >> 0000000000750ee0 >> [ 32.170796] PKRU: 55555554 >> [ 32.170888] Call Trace: >> [ 32.170975] <IRQ> >> [ 32.171039] skb_release_all+0x2e/0x40 >> [ 32.171152] napi_consume_skb+0x62/0xf0 >> [ 32.171281] net_rx_action+0xf6/0x250 >> [ 32.171394] __do_softirq+0xdf/0x2c0 >> [ 32.171506] do_softirq+0x81/0xa0 >> [ 32.171608] </IRQ> >> >> >> Xin came up with this patch a couple of months ago that resolves the >> panic but it has a big impact on performance: >> >> diff --git a/net/core/skbuff.c b/net/core/skbuff.c >> index 0fbd5c85155f..6c2c8d09fd89 100644 >> --- a/net/core/skbuff.c >> +++ b/net/core/skbuff.c >> @@ -1717,7 +1717,7 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t >> gfp_mask) >> { >> int num_frags = skb_shinfo(skb)->nr_frags; >> struct page *page, *head = NULL; >> - int i, new_frags; >> + int i, new_frags, pagelen; >> u32 d_off; >> >> if (skb_shared(skb) || skb_unclone(skb, gfp_mask)) >> @@ -1733,7 +1733,16 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t >> gfp_mask) >> return 0; >> } >> >> - new_frags = (__skb_pagelen(skb) + PAGE_SIZE - 1) >> PAGE_SHIFT; >> + pagelen = __skb_pagelen(skb); >> + if (pagelen > GSO_LEGACY_MAX_SIZE) { >> + /* without hugepages, skb frags can only hold 65536 data. */ > > This is with CONFIG_MAX_SKB_FRAGS 17 I suppose. correct, I did not enable that config so it defaults to 17. > > So is the issue just that new_frags ends up indexing out of bounds > in frags[MAX_SKB_FRAGS]? yes, I see nr_frags at 32 which is clearly way out of bounds and it seems to be the skb_copy_ubufs function causing it. > > GSO_LEGACY_MAX_SIZE happens to match the value, but is not not the > right constant, as that is a max on the packet length, regardless > of whether in linear or frags. > >> + if (!__pskb_pull_tail(skb, pagelen - GSO_LEGACY_MAX_SIZE)) >> + return -ENOMEM; >> + pagelen = GSO_LEGACY_MAX_SIZE; >> + num_frags = skb_shinfo(skb)->nr_frags; >> + } >> + new_frags = (pagelen + PAGE_SIZE - 1) >> PAGE_SHIFT; >> + >> for (i = 0; i < new_frags; i++) { >> page = alloc_page(gfp_mask); >> if (!page) { >> > >
On 4/27/23 00:24, David Ahern wrote: > On 4/26/23 3:52 PM, Willem de Bruijn wrote: >> David Ahern wrote: >>> This has been on the back burner for too long now and with v6.3 released >>> we should get it resolved before reports start rolling in. I am throwing >>> this data dump out to the mailing list hoping someone else can provide >>> more insights. >>> >>> Big TCP (both IPv6 and IPv4 versions are affected) can cause a variety >>> of panics when combined with the Tx ZC API and buffers backed by >>> hugepages. I have seen this with mlx5, a driver under development and >>> veth, so it seems to be a problem with the core stack. >>> >>> A quick reproducer: >>> >>> #!/bin/bash >>> # >>> # make sure ip is from top of tree iproute2 >>> >>> ip netns add peer >>> ip li add eth0 type veth peer eth1 >>> ip li set eth0 mtu 3400 up >>> ip addr add dev eth0 172.16.253.1/24 >>> ip addr add dev eth0 2001:db8:1::1/64 >>> >>> ip li set eth1 netns peer mtu 3400 up >>> ip -netns peer addr add dev eth1 172.16.253.2/24 >>> ip -netns peer addr add dev eth1 2001:db8:1::2/64 >>> >>> ip netns exec peer iperf3 -s -D >>> >>> ip li set dev eth0 gso_ipv4_max_size $((510*1024)) gro_ipv4_max_size >>> $((510*1024)) gso_max_size $((510*1024)) gro_max_size $((510*1024)) >>> >>> ip -netns peer li set dev eth1 gso_ipv4_max_size $((510*1024)) >>> gro_ipv4_max_size $((510*1024)) gso_max_size $((510*1024)) gro_max_size >>> $((510*1024)) >>> >>> sysctl -w vm.nr_hugepages=2 >>> >>> cat <<EOF >>> Run either: >>> >>> iperf3 -c 172.16.253.2 --zc_api >>> iperf3 -c 2001:db8:1::2 --zc_api >>> >>> where iperf3 is from https://github.com/dsahern/iperf mods-3.10 >>> EOF >>> >>> iperf3 in my tree has support for buffers using hugepages when using the >>> Tx ZC API (--zc_api arg above). >>> >>> I have seen various backtraces based on platform and configuration, but >>> skb_release_data is typically in the path. This is a common one for the >>> veth reproducer above (saw it with both v4 and v6): >>> >>> [ 32.167294] general protection fault, probably for non-canonical >>> address 0xdd8672069ea377b2: 0000 [#1] PREEMPT SMP >>> [ 32.167569] CPU: 5 PID: 635 Comm: iperf3 Not tainted 6.3.0+ #4 >>> [ 32.167742] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS >>> 1.13.0-1ubuntu1.1 04/01/2014 >>> [ 32.168039] RIP: 0010:skb_release_data+0xf4/0x180 >>> [ 32.168208] Code: 7e 57 48 89 d8 48 c1 e0 04 4d 8b 64 05 30 41 f6 c4 >>> 01 75 e1 41 80 7e 76 00 4d 89 e7 79 0c 4c 89 e7 e8 90 f >>> [ 32.168869] RSP: 0018:ffffc900001a4eb0 EFLAGS: 00010202 >>> [ 32.169025] RAX: 00000000000001c0 RBX: 000000000000001c RCX: >>> 0000000000000000 >>> [ 32.169265] RDX: 0000000000000102 RSI: 000000000000068f RDI: >>> 00000000ffffffff >>> [ 32.169475] RBP: ffffc900001a4ee0 R08: 0000000000000000 R09: >>> ffff88807fd77ec0 >>> [ 32.169708] R10: ffffea0000173430 R11: 0000000000000000 R12: >>> dd8672069ea377aa >>> [ 32.169915] R13: ffff8880069cf100 R14: ffff888011910ae0 R15: >>> dd8672069ea377aa >>> [ 32.170126] FS: 0000000001720880(0000) GS:ffff88807fd40000(0000) >>> knlGS:0000000000000000 >>> [ 32.170398] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> [ 32.170586] CR2: 00007f0f04400000 CR3: 0000000004caa000 CR4: >>> 0000000000750ee0 >>> [ 32.170796] PKRU: 55555554 >>> [ 32.170888] Call Trace: >>> [ 32.170975] <IRQ> >>> [ 32.171039] skb_release_all+0x2e/0x40 >>> [ 32.171152] napi_consume_skb+0x62/0xf0 >>> [ 32.171281] net_rx_action+0xf6/0x250 >>> [ 32.171394] __do_softirq+0xdf/0x2c0 >>> [ 32.171506] do_softirq+0x81/0xa0 >>> [ 32.171608] </IRQ> >>> >>> >>> Xin came up with this patch a couple of months ago that resolves the >>> panic but it has a big impact on performance: >>> >>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c >>> index 0fbd5c85155f..6c2c8d09fd89 100644 >>> --- a/net/core/skbuff.c >>> +++ b/net/core/skbuff.c >>> @@ -1717,7 +1717,7 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t >>> gfp_mask) >>> { >>> int num_frags = skb_shinfo(skb)->nr_frags; >>> struct page *page, *head = NULL; >>> - int i, new_frags; >>> + int i, new_frags, pagelen; >>> u32 d_off; >>> >>> if (skb_shared(skb) || skb_unclone(skb, gfp_mask)) >>> @@ -1733,7 +1733,16 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t >>> gfp_mask) >>> return 0; >>> } >>> >>> - new_frags = (__skb_pagelen(skb) + PAGE_SIZE - 1) >> PAGE_SHIFT; >>> + pagelen = __skb_pagelen(skb); >>> + if (pagelen > GSO_LEGACY_MAX_SIZE) { >>> + /* without hugepages, skb frags can only hold 65536 data. */ >> This is with CONFIG_MAX_SKB_FRAGS 17 I suppose. > correct, I did not enable that config so it defaults to 17. > >> So is the issue just that new_frags ends up indexing out of bounds >> in frags[MAX_SKB_FRAGS]? > yes, I see nr_frags at 32 which is clearly way out of bounds and it > seems to be the skb_copy_ubufs function causing it. Solution is to sense which page order is necessary in order to fit the skb length into MAX_SKB_FRAGS pages. alloc_page() -> alloc_pages(order); alloc_skb_with_frags() has a similar strategy (but different goals) I can prepare a patch. >> GSO_LEGACY_MAX_SIZE happens to match the value, but is not not the >> right constant, as that is a max on the packet length, regardless >> of whether in linear or frags. >> >>> + if (!__pskb_pull_tail(skb, pagelen - GSO_LEGACY_MAX_SIZE)) >>> + return -ENOMEM; >>> + pagelen = GSO_LEGACY_MAX_SIZE; >>> + num_frags = skb_shinfo(skb)->nr_frags; >>> + } >>> + new_frags = (pagelen + PAGE_SIZE - 1) >> PAGE_SHIFT; >>> + >>> for (i = 0; i < new_frags; i++) { >>> page = alloc_page(gfp_mask); >>> if (!page) { >>> >>
On Thu, Apr 27, 2023 at 5:32 PM Eric Dumazet <erdnetdev@gmail.com> wrote: > > > On 4/27/23 00:24, David Ahern wrote: > > On 4/26/23 3:52 PM, Willem de Bruijn wrote: > >> David Ahern wrote: > >>> This has been on the back burner for too long now and with v6.3 released > >>> we should get it resolved before reports start rolling in. I am throwing > >>> this data dump out to the mailing list hoping someone else can provide > >>> more insights. > >>> > >>> Big TCP (both IPv6 and IPv4 versions are affected) can cause a variety > >>> of panics when combined with the Tx ZC API and buffers backed by > >>> hugepages. I have seen this with mlx5, a driver under development and > >>> veth, so it seems to be a problem with the core stack. > >>> > >>> A quick reproducer: > >>> > >>> #!/bin/bash > >>> # > >>> # make sure ip is from top of tree iproute2 > >>> > >>> ip netns add peer > >>> ip li add eth0 type veth peer eth1 > >>> ip li set eth0 mtu 3400 up > >>> ip addr add dev eth0 172.16.253.1/24 > >>> ip addr add dev eth0 2001:db8:1::1/64 > >>> > >>> ip li set eth1 netns peer mtu 3400 up > >>> ip -netns peer addr add dev eth1 172.16.253.2/24 > >>> ip -netns peer addr add dev eth1 2001:db8:1::2/64 > >>> > >>> ip netns exec peer iperf3 -s -D > >>> > >>> ip li set dev eth0 gso_ipv4_max_size $((510*1024)) gro_ipv4_max_size > >>> $((510*1024)) gso_max_size $((510*1024)) gro_max_size $((510*1024)) > >>> > >>> ip -netns peer li set dev eth1 gso_ipv4_max_size $((510*1024)) > >>> gro_ipv4_max_size $((510*1024)) gso_max_size $((510*1024)) gro_max_size > >>> $((510*1024)) > >>> > >>> sysctl -w vm.nr_hugepages=2 > >>> > >>> cat <<EOF > >>> Run either: > >>> > >>> iperf3 -c 172.16.253.2 --zc_api > >>> iperf3 -c 2001:db8:1::2 --zc_api > >>> > >>> where iperf3 is from https://github.com/dsahern/iperf mods-3.10 > >>> EOF > >>> > >>> iperf3 in my tree has support for buffers using hugepages when using the > >>> Tx ZC API (--zc_api arg above). > >>> > >>> I have seen various backtraces based on platform and configuration, but > >>> skb_release_data is typically in the path. This is a common one for the > >>> veth reproducer above (saw it with both v4 and v6): > >>> > >>> [ 32.167294] general protection fault, probably for non-canonical > >>> address 0xdd8672069ea377b2: 0000 [#1] PREEMPT SMP > >>> [ 32.167569] CPU: 5 PID: 635 Comm: iperf3 Not tainted 6.3.0+ #4 > >>> [ 32.167742] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS > >>> 1.13.0-1ubuntu1.1 04/01/2014 > >>> [ 32.168039] RIP: 0010:skb_release_data+0xf4/0x180 > >>> [ 32.168208] Code: 7e 57 48 89 d8 48 c1 e0 04 4d 8b 64 05 30 41 f6 c4 > >>> 01 75 e1 41 80 7e 76 00 4d 89 e7 79 0c 4c 89 e7 e8 90 f > >>> [ 32.168869] RSP: 0018:ffffc900001a4eb0 EFLAGS: 00010202 > >>> [ 32.169025] RAX: 00000000000001c0 RBX: 000000000000001c RCX: > >>> 0000000000000000 > >>> [ 32.169265] RDX: 0000000000000102 RSI: 000000000000068f RDI: > >>> 00000000ffffffff > >>> [ 32.169475] RBP: ffffc900001a4ee0 R08: 0000000000000000 R09: > >>> ffff88807fd77ec0 > >>> [ 32.169708] R10: ffffea0000173430 R11: 0000000000000000 R12: > >>> dd8672069ea377aa > >>> [ 32.169915] R13: ffff8880069cf100 R14: ffff888011910ae0 R15: > >>> dd8672069ea377aa > >>> [ 32.170126] FS: 0000000001720880(0000) GS:ffff88807fd40000(0000) > >>> knlGS:0000000000000000 > >>> [ 32.170398] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > >>> [ 32.170586] CR2: 00007f0f04400000 CR3: 0000000004caa000 CR4: > >>> 0000000000750ee0 > >>> [ 32.170796] PKRU: 55555554 > >>> [ 32.170888] Call Trace: > >>> [ 32.170975] <IRQ> > >>> [ 32.171039] skb_release_all+0x2e/0x40 > >>> [ 32.171152] napi_consume_skb+0x62/0xf0 > >>> [ 32.171281] net_rx_action+0xf6/0x250 > >>> [ 32.171394] __do_softirq+0xdf/0x2c0 > >>> [ 32.171506] do_softirq+0x81/0xa0 > >>> [ 32.171608] </IRQ> > >>> > >>> > >>> Xin came up with this patch a couple of months ago that resolves the > >>> panic but it has a big impact on performance: > >>> > >>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c > >>> index 0fbd5c85155f..6c2c8d09fd89 100644 > >>> --- a/net/core/skbuff.c > >>> +++ b/net/core/skbuff.c > >>> @@ -1717,7 +1717,7 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t > >>> gfp_mask) > >>> { > >>> int num_frags = skb_shinfo(skb)->nr_frags; > >>> struct page *page, *head = NULL; > >>> - int i, new_frags; > >>> + int i, new_frags, pagelen; > >>> u32 d_off; > >>> > >>> if (skb_shared(skb) || skb_unclone(skb, gfp_mask)) > >>> @@ -1733,7 +1733,16 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t > >>> gfp_mask) > >>> return 0; > >>> } > >>> > >>> - new_frags = (__skb_pagelen(skb) + PAGE_SIZE - 1) >> PAGE_SHIFT; > >>> + pagelen = __skb_pagelen(skb); > >>> + if (pagelen > GSO_LEGACY_MAX_SIZE) { > >>> + /* without hugepages, skb frags can only hold 65536 data. */ > >> This is with CONFIG_MAX_SKB_FRAGS 17 I suppose. > > correct, I did not enable that config so it defaults to 17. > > > >> So is the issue just that new_frags ends up indexing out of bounds > >> in frags[MAX_SKB_FRAGS]? > > yes, I see nr_frags at 32 which is clearly way out of bounds and it > > seems to be the skb_copy_ubufs function causing it. > > > Solution is to sense which page order is necessary in order to fit the > skb length into MAX_SKB_FRAGS pages. alloc_page() -> alloc_pages(order); > alloc_skb_with_frags() has a similar strategy (but different goals) > > I can prepare a patch. > I will test the following: diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 2112146092bfe24061b24b9e4684bcc031a045b9..891ecca40b29af1e15a89745f7fc630b19ea0202 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1758,7 +1758,7 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask) { int num_frags = skb_shinfo(skb)->nr_frags; struct page *page, *head = NULL; - int i, new_frags; + int i, order, psize, new_frags; u32 d_off; if (skb_shared(skb) || skb_unclone(skb, gfp_mask)) @@ -1767,9 +1767,17 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask) if (!num_frags) goto release; - new_frags = (__skb_pagelen(skb) + PAGE_SIZE - 1) >> PAGE_SHIFT; + /* We might have to allocate high order pages, so compute what minimum + * page order is needed. + */ + order = 0; + while ((PAGE_SIZE << order) * MAX_SKB_FRAGS < __skb_pagelen(skb)) + order++; + psize = (PAGE_SIZE << order); + + new_frags = (__skb_pagelen(skb) + psize - 1) >> (PAGE_SHIFT + order); for (i = 0; i < new_frags; i++) { - page = alloc_page(gfp_mask); + page = alloc_pages(gfp_mask, order); if (!page) { while (head) { struct page *next = (struct page *)page_private(head); @@ -1796,11 +1804,11 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask) vaddr = kmap_atomic(p); while (done < p_len) { - if (d_off == PAGE_SIZE) { + if (d_off == psize) { d_off = 0; page = (struct page *)page_private(page); } - copy = min_t(u32, PAGE_SIZE - d_off, p_len - done); + copy = min_t(u32, psize - d_off, p_len - done); memcpy(page_address(page) + d_off, vaddr + p_off + done, copy); done += copy; @@ -1816,7 +1824,7 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask) /* skb frags point to kernel buffers */ for (i = 0; i < new_frags - 1; i++) { - __skb_fill_page_desc(skb, i, head, 0, PAGE_SIZE); + __skb_fill_page_desc(skb, i, head, 0, psize); head = (struct page *)page_private(head); } __skb_fill_page_desc(skb, new_frags - 1, head, 0, d_off);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 0fbd5c85155f..6c2c8d09fd89 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1717,7 +1717,7 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask) { int num_frags = skb_shinfo(skb)->nr_frags; struct page *page, *head = NULL; - int i, new_frags; + int i, new_frags, pagelen; u32 d_off;