kernel panics with Big TCP and Tx ZC with hugepages

Message ID	5a2eae5c-9ad7-aca5-4927-665db946cfb2@gmail.com (mailing list archive)
State	Changes Requested
Delegated to:	Netdev Maintainers
Headers	show Return-Path: <netdev-owner@vger.kernel.org> Message-ID: <5a2eae5c-9ad7-aca5-4927-665db946cfb2@gmail.com> Date: Wed, 26 Apr 2023 13:07:06 -0600 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.10.0 Content-Language: en-US To: Eric Dumazet <eric.dumazet@gmail.com>, Willem de Bruijn <willemdebruijn.kernel@gmail.com>, Xin Long <lucien.xin@gmail.com>, "netdev@vger.kernel.org" <netdev@vger.kernel.org>, Nikolay Aleksandrov <razor@blackwall.org> From: David Ahern <dsahern@gmail.com> Subject: kernel panics with Big TCP and Tx ZC with hugepages Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Precedence: bulk
Series	kernel panics with Big TCP and Tx ZC with hugepages \| expand kernel panics with Big TCP and Tx ZC with hugepages

Context	Check	Description
netdev/tree_selection	success	Guessing tree name failed - patch did not apply

David Ahern April 26, 2023, 7:07 p.m. UTC

This has been on the back burner for too long now and with v6.3 released
we should get it resolved before reports start rolling in. I am throwing
this data dump out to the mailing list hoping someone else can provide
more insights.

Big TCP (both IPv6 and IPv4 versions are affected) can cause a variety
of panics when combined with the Tx ZC API and buffers backed by
hugepages. I have seen this with mlx5, a driver under development and
veth, so it seems to be a problem with the core stack.

A quick reproducer:

#!/bin/bash
#
# make sure ip is from top of tree iproute2

ip netns add peer
ip li add eth0 type veth peer eth1
ip li set eth0 mtu 3400 up
ip addr add dev eth0 172.16.253.1/24
ip addr add dev eth0 2001:db8:1::1/64

ip li set eth1 netns peer mtu 3400 up
ip -netns peer addr add dev eth1 172.16.253.2/24
ip -netns peer addr add dev eth1 2001:db8:1::2/64

ip netns exec peer iperf3 -s -D

ip li set dev eth0 gso_ipv4_max_size $((510*1024)) gro_ipv4_max_size
$((510*1024)) gso_max_size $((510*1024)) gro_max_size  $((510*1024))

ip -netns peer li set dev eth1 gso_ipv4_max_size $((510*1024))
gro_ipv4_max_size  $((510*1024)) gso_max_size $((510*1024)) gro_max_size
 $((510*1024))

sysctl -w vm.nr_hugepages=2

cat <<EOF
Run either:

    iperf3 -c 172.16.253.2 --zc_api
    iperf3 -c 2001:db8:1::2 --zc_api

where iperf3 is from https://github.com/dsahern/iperf mods-3.10
EOF

iperf3 in my tree has support for buffers using hugepages when using the
Tx ZC API (--zc_api arg above).

I have seen various backtraces based on platform and configuration, but
skb_release_data is typically in the path. This is a common one for the
veth reproducer above (saw it with both v4 and v6):

[   32.167294] general protection fault, probably for non-canonical
address 0xdd8672069ea377b2: 0000 [#1] PREEMPT SMP
[   32.167569] CPU: 5 PID: 635 Comm: iperf3 Not tainted 6.3.0+ #4
[   32.167742] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
1.13.0-1ubuntu1.1 04/01/2014
[   32.168039] RIP: 0010:skb_release_data+0xf4/0x180
[   32.168208] Code: 7e 57 48 89 d8 48 c1 e0 04 4d 8b 64 05 30 41 f6 c4
01 75 e1 41 80 7e 76 00 4d 89 e7 79 0c 4c 89 e7 e8 90 f
[   32.168869] RSP: 0018:ffffc900001a4eb0 EFLAGS: 00010202
[   32.169025] RAX: 00000000000001c0 RBX: 000000000000001c RCX:
0000000000000000
[   32.169265] RDX: 0000000000000102 RSI: 000000000000068f RDI:
00000000ffffffff
[   32.169475] RBP: ffffc900001a4ee0 R08: 0000000000000000 R09:
ffff88807fd77ec0
[   32.169708] R10: ffffea0000173430 R11: 0000000000000000 R12:
dd8672069ea377aa
[   32.169915] R13: ffff8880069cf100 R14: ffff888011910ae0 R15:
dd8672069ea377aa
[   32.170126] FS:  0000000001720880(0000) GS:ffff88807fd40000(0000)
knlGS:0000000000000000
[   32.170398] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   32.170586] CR2: 00007f0f04400000 CR3: 0000000004caa000 CR4:
0000000000750ee0
[   32.170796] PKRU: 55555554
[   32.170888] Call Trace:
[   32.170975]  <IRQ>
[   32.171039]  skb_release_all+0x2e/0x40
[   32.171152]  napi_consume_skb+0x62/0xf0
[   32.171281]  net_rx_action+0xf6/0x250
[   32.171394]  __do_softirq+0xdf/0x2c0
[   32.171506]  do_softirq+0x81/0xa0
[   32.171608]  </IRQ>


Xin came up with this patch a couple of months ago that resolves the
panic but it has a big impact on performance:

        if (skb_shared(skb) || skb_unclone(skb, gfp_mask))
@@ -1733,7 +1733,16 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t
gfp_mask)
                return 0;
        }

-       new_frags = (__skb_pagelen(skb) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+       pagelen = __skb_pagelen(skb);
+       if (pagelen > GSO_LEGACY_MAX_SIZE) {
+               /* without hugepages, skb frags can only hold 65536 data. */
+               if (!__pskb_pull_tail(skb, pagelen - GSO_LEGACY_MAX_SIZE))
+                       return -ENOMEM;
+               pagelen = GSO_LEGACY_MAX_SIZE;
+               num_frags = skb_shinfo(skb)->nr_frags;
+       }
+       new_frags = (pagelen + PAGE_SIZE - 1) >> PAGE_SHIFT;
+
        for (i = 0; i < new_frags; i++) {
                page = alloc_page(gfp_mask);
                if (!page) {

Willem de Bruijn April 26, 2023, 9:52 p.m. UTC | #1

David Ahern wrote:
> This has been on the back burner for too long now and with v6.3 released
> we should get it resolved before reports start rolling in. I am throwing
> this data dump out to the mailing list hoping someone else can provide
> more insights.
> 
> Big TCP (both IPv6 and IPv4 versions are affected) can cause a variety
> of panics when combined with the Tx ZC API and buffers backed by
> hugepages. I have seen this with mlx5, a driver under development and
> veth, so it seems to be a problem with the core stack.
> 
> A quick reproducer:
> 
> #!/bin/bash
> #
> # make sure ip is from top of tree iproute2
> 
> ip netns add peer
> ip li add eth0 type veth peer eth1
> ip li set eth0 mtu 3400 up
> ip addr add dev eth0 172.16.253.1/24
> ip addr add dev eth0 2001:db8:1::1/64
> 
> ip li set eth1 netns peer mtu 3400 up
> ip -netns peer addr add dev eth1 172.16.253.2/24
> ip -netns peer addr add dev eth1 2001:db8:1::2/64
> 
> ip netns exec peer iperf3 -s -D
> 
> ip li set dev eth0 gso_ipv4_max_size $((510*1024)) gro_ipv4_max_size
> $((510*1024)) gso_max_size $((510*1024)) gro_max_size  $((510*1024))
> 
> ip -netns peer li set dev eth1 gso_ipv4_max_size $((510*1024))
> gro_ipv4_max_size  $((510*1024)) gso_max_size $((510*1024)) gro_max_size
>  $((510*1024))
> 
> sysctl -w vm.nr_hugepages=2
> 
> cat <<EOF
> Run either:
> 
>     iperf3 -c 172.16.253.2 --zc_api
>     iperf3 -c 2001:db8:1::2 --zc_api
> 
> where iperf3 is from https://github.com/dsahern/iperf mods-3.10
> EOF
> 
> iperf3 in my tree has support for buffers using hugepages when using the
> Tx ZC API (--zc_api arg above).
> 
> I have seen various backtraces based on platform and configuration, but
> skb_release_data is typically in the path. This is a common one for the
> veth reproducer above (saw it with both v4 and v6):
> 
> [   32.167294] general protection fault, probably for non-canonical
> address 0xdd8672069ea377b2: 0000 [#1] PREEMPT SMP
> [   32.167569] CPU: 5 PID: 635 Comm: iperf3 Not tainted 6.3.0+ #4
> [   32.167742] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
> 1.13.0-1ubuntu1.1 04/01/2014
> [   32.168039] RIP: 0010:skb_release_data+0xf4/0x180
> [   32.168208] Code: 7e 57 48 89 d8 48 c1 e0 04 4d 8b 64 05 30 41 f6 c4
> 01 75 e1 41 80 7e 76 00 4d 89 e7 79 0c 4c 89 e7 e8 90 f
> [   32.168869] RSP: 0018:ffffc900001a4eb0 EFLAGS: 00010202
> [   32.169025] RAX: 00000000000001c0 RBX: 000000000000001c RCX:
> 0000000000000000
> [   32.169265] RDX: 0000000000000102 RSI: 000000000000068f RDI:
> 00000000ffffffff
> [   32.169475] RBP: ffffc900001a4ee0 R08: 0000000000000000 R09:
> ffff88807fd77ec0
> [   32.169708] R10: ffffea0000173430 R11: 0000000000000000 R12:
> dd8672069ea377aa
> [   32.169915] R13: ffff8880069cf100 R14: ffff888011910ae0 R15:
> dd8672069ea377aa
> [   32.170126] FS:  0000000001720880(0000) GS:ffff88807fd40000(0000)
> knlGS:0000000000000000
> [   32.170398] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   32.170586] CR2: 00007f0f04400000 CR3: 0000000004caa000 CR4:
> 0000000000750ee0
> [   32.170796] PKRU: 55555554
> [   32.170888] Call Trace:
> [   32.170975]  <IRQ>
> [   32.171039]  skb_release_all+0x2e/0x40
> [   32.171152]  napi_consume_skb+0x62/0xf0
> [   32.171281]  net_rx_action+0xf6/0x250
> [   32.171394]  __do_softirq+0xdf/0x2c0
> [   32.171506]  do_softirq+0x81/0xa0
> [   32.171608]  </IRQ>
> 
> 
> Xin came up with this patch a couple of months ago that resolves the
> panic but it has a big impact on performance:
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 0fbd5c85155f..6c2c8d09fd89 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1717,7 +1717,7 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t
> gfp_mask)
>  {
>         int num_frags = skb_shinfo(skb)->nr_frags;
>         struct page *page, *head = NULL;
> -       int i, new_frags;
> +       int i, new_frags, pagelen;
>         u32 d_off;
> 
>         if (skb_shared(skb) || skb_unclone(skb, gfp_mask))
> @@ -1733,7 +1733,16 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t
> gfp_mask)
>                 return 0;
>         }
> 
> -       new_frags = (__skb_pagelen(skb) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> +       pagelen = __skb_pagelen(skb);
> +       if (pagelen > GSO_LEGACY_MAX_SIZE) {
> +               /* without hugepages, skb frags can only hold 65536 data. */

This is with CONFIG_MAX_SKB_FRAGS 17 I suppose.

So is the issue just that new_frags ends up indexing out of bounds
in frags[MAX_SKB_FRAGS]?

GSO_LEGACY_MAX_SIZE happens to match the value, but is not not the
right constant, as that is a max on the packet length, regardless
of whether in linear or frags.

> +               if (!__pskb_pull_tail(skb, pagelen - GSO_LEGACY_MAX_SIZE))
> +                       return -ENOMEM;
> +               pagelen = GSO_LEGACY_MAX_SIZE;
> +               num_frags = skb_shinfo(skb)->nr_frags;
> +       }
> +       new_frags = (pagelen + PAGE_SIZE - 1) >> PAGE_SHIFT;
> +
>         for (i = 0; i < new_frags; i++) {
>                 page = alloc_page(gfp_mask);
>                 if (!page) {
>

David Ahern April 26, 2023, 10:24 p.m. UTC | #2

On 4/26/23 3:52 PM, Willem de Bruijn wrote:
> David Ahern wrote:
>> This has been on the back burner for too long now and with v6.3 released
>> we should get it resolved before reports start rolling in. I am throwing
>> this data dump out to the mailing list hoping someone else can provide
>> more insights.
>>
>> Big TCP (both IPv6 and IPv4 versions are affected) can cause a variety
>> of panics when combined with the Tx ZC API and buffers backed by
>> hugepages. I have seen this with mlx5, a driver under development and
>> veth, so it seems to be a problem with the core stack.
>>
>> A quick reproducer:
>>
>> #!/bin/bash
>> #
>> # make sure ip is from top of tree iproute2
>>
>> ip netns add peer
>> ip li add eth0 type veth peer eth1
>> ip li set eth0 mtu 3400 up
>> ip addr add dev eth0 172.16.253.1/24
>> ip addr add dev eth0 2001:db8:1::1/64
>>
>> ip li set eth1 netns peer mtu 3400 up
>> ip -netns peer addr add dev eth1 172.16.253.2/24
>> ip -netns peer addr add dev eth1 2001:db8:1::2/64
>>
>> ip netns exec peer iperf3 -s -D
>>
>> ip li set dev eth0 gso_ipv4_max_size $((510*1024)) gro_ipv4_max_size
>> $((510*1024)) gso_max_size $((510*1024)) gro_max_size  $((510*1024))
>>
>> ip -netns peer li set dev eth1 gso_ipv4_max_size $((510*1024))
>> gro_ipv4_max_size  $((510*1024)) gso_max_size $((510*1024)) gro_max_size
>>  $((510*1024))
>>
>> sysctl -w vm.nr_hugepages=2
>>
>> cat <<EOF
>> Run either:
>>
>>     iperf3 -c 172.16.253.2 --zc_api
>>     iperf3 -c 2001:db8:1::2 --zc_api
>>
>> where iperf3 is from https://github.com/dsahern/iperf mods-3.10
>> EOF
>>
>> iperf3 in my tree has support for buffers using hugepages when using the
>> Tx ZC API (--zc_api arg above).
>>
>> I have seen various backtraces based on platform and configuration, but
>> skb_release_data is typically in the path. This is a common one for the
>> veth reproducer above (saw it with both v4 and v6):
>>
>> [   32.167294] general protection fault, probably for non-canonical
>> address 0xdd8672069ea377b2: 0000 [#1] PREEMPT SMP
>> [   32.167569] CPU: 5 PID: 635 Comm: iperf3 Not tainted 6.3.0+ #4
>> [   32.167742] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
>> 1.13.0-1ubuntu1.1 04/01/2014
>> [   32.168039] RIP: 0010:skb_release_data+0xf4/0x180
>> [   32.168208] Code: 7e 57 48 89 d8 48 c1 e0 04 4d 8b 64 05 30 41 f6 c4
>> 01 75 e1 41 80 7e 76 00 4d 89 e7 79 0c 4c 89 e7 e8 90 f
>> [   32.168869] RSP: 0018:ffffc900001a4eb0 EFLAGS: 00010202
>> [   32.169025] RAX: 00000000000001c0 RBX: 000000000000001c RCX:
>> 0000000000000000
>> [   32.169265] RDX: 0000000000000102 RSI: 000000000000068f RDI:
>> 00000000ffffffff
>> [   32.169475] RBP: ffffc900001a4ee0 R08: 0000000000000000 R09:
>> ffff88807fd77ec0
>> [   32.169708] R10: ffffea0000173430 R11: 0000000000000000 R12:
>> dd8672069ea377aa
>> [   32.169915] R13: ffff8880069cf100 R14: ffff888011910ae0 R15:
>> dd8672069ea377aa
>> [   32.170126] FS:  0000000001720880(0000) GS:ffff88807fd40000(0000)
>> knlGS:0000000000000000
>> [   32.170398] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [   32.170586] CR2: 00007f0f04400000 CR3: 0000000004caa000 CR4:
>> 0000000000750ee0
>> [   32.170796] PKRU: 55555554
>> [   32.170888] Call Trace:
>> [   32.170975]  <IRQ>
>> [   32.171039]  skb_release_all+0x2e/0x40
>> [   32.171152]  napi_consume_skb+0x62/0xf0
>> [   32.171281]  net_rx_action+0xf6/0x250
>> [   32.171394]  __do_softirq+0xdf/0x2c0
>> [   32.171506]  do_softirq+0x81/0xa0
>> [   32.171608]  </IRQ>
>>
>>
>> Xin came up with this patch a couple of months ago that resolves the
>> panic but it has a big impact on performance:
>>
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index 0fbd5c85155f..6c2c8d09fd89 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -1717,7 +1717,7 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t
>> gfp_mask)
>>  {
>>         int num_frags = skb_shinfo(skb)->nr_frags;
>>         struct page *page, *head = NULL;
>> -       int i, new_frags;
>> +       int i, new_frags, pagelen;
>>         u32 d_off;
>>
>>         if (skb_shared(skb) || skb_unclone(skb, gfp_mask))
>> @@ -1733,7 +1733,16 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t
>> gfp_mask)
>>                 return 0;
>>         }
>>
>> -       new_frags = (__skb_pagelen(skb) + PAGE_SIZE - 1) >> PAGE_SHIFT;
>> +       pagelen = __skb_pagelen(skb);
>> +       if (pagelen > GSO_LEGACY_MAX_SIZE) {
>> +               /* without hugepages, skb frags can only hold 65536 data. */
> 
> This is with CONFIG_MAX_SKB_FRAGS 17 I suppose.

correct, I did not enable that config so it defaults to 17.

> 
> So is the issue just that new_frags ends up indexing out of bounds
> in frags[MAX_SKB_FRAGS]?

yes, I see nr_frags at 32 which is clearly way out of bounds and it
seems to be the skb_copy_ubufs function causing it.

> 
> GSO_LEGACY_MAX_SIZE happens to match the value, but is not not the
> right constant, as that is a max on the packet length, regardless
> of whether in linear or frags.
> 
>> +               if (!__pskb_pull_tail(skb, pagelen - GSO_LEGACY_MAX_SIZE))
>> +                       return -ENOMEM;
>> +               pagelen = GSO_LEGACY_MAX_SIZE;
>> +               num_frags = skb_shinfo(skb)->nr_frags;
>> +       }
>> +       new_frags = (pagelen + PAGE_SIZE - 1) >> PAGE_SHIFT;
>> +
>>         for (i = 0; i < new_frags; i++) {
>>                 page = alloc_page(gfp_mask);
>>                 if (!page) {
>>
> 
>

ericnetdev dumazet April 27, 2023, 3:32 p.m. UTC | #3

On 4/27/23 00:24, David Ahern wrote:
> On 4/26/23 3:52 PM, Willem de Bruijn wrote:
>> David Ahern wrote:
>>> This has been on the back burner for too long now and with v6.3 released
>>> we should get it resolved before reports start rolling in. I am throwing
>>> this data dump out to the mailing list hoping someone else can provide
>>> more insights.
>>>
>>> Big TCP (both IPv6 and IPv4 versions are affected) can cause a variety
>>> of panics when combined with the Tx ZC API and buffers backed by
>>> hugepages. I have seen this with mlx5, a driver under development and
>>> veth, so it seems to be a problem with the core stack.
>>>
>>> A quick reproducer:
>>>
>>> #!/bin/bash
>>> #
>>> # make sure ip is from top of tree iproute2
>>>
>>> ip netns add peer
>>> ip li add eth0 type veth peer eth1
>>> ip li set eth0 mtu 3400 up
>>> ip addr add dev eth0 172.16.253.1/24
>>> ip addr add dev eth0 2001:db8:1::1/64
>>>
>>> ip li set eth1 netns peer mtu 3400 up
>>> ip -netns peer addr add dev eth1 172.16.253.2/24
>>> ip -netns peer addr add dev eth1 2001:db8:1::2/64
>>>
>>> ip netns exec peer iperf3 -s -D
>>>
>>> ip li set dev eth0 gso_ipv4_max_size $((510*1024)) gro_ipv4_max_size
>>> $((510*1024)) gso_max_size $((510*1024)) gro_max_size  $((510*1024))
>>>
>>> ip -netns peer li set dev eth1 gso_ipv4_max_size $((510*1024))
>>> gro_ipv4_max_size  $((510*1024)) gso_max_size $((510*1024)) gro_max_size
>>>   $((510*1024))
>>>
>>> sysctl -w vm.nr_hugepages=2
>>>
>>> cat <<EOF
>>> Run either:
>>>
>>>      iperf3 -c 172.16.253.2 --zc_api
>>>      iperf3 -c 2001:db8:1::2 --zc_api
>>>
>>> where iperf3 is from https://github.com/dsahern/iperf mods-3.10
>>> EOF
>>>
>>> iperf3 in my tree has support for buffers using hugepages when using the
>>> Tx ZC API (--zc_api arg above).
>>>
>>> I have seen various backtraces based on platform and configuration, but
>>> skb_release_data is typically in the path. This is a common one for the
>>> veth reproducer above (saw it with both v4 and v6):
>>>
>>> [   32.167294] general protection fault, probably for non-canonical
>>> address 0xdd8672069ea377b2: 0000 [#1] PREEMPT SMP
>>> [   32.167569] CPU: 5 PID: 635 Comm: iperf3 Not tainted 6.3.0+ #4
>>> [   32.167742] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
>>> 1.13.0-1ubuntu1.1 04/01/2014
>>> [   32.168039] RIP: 0010:skb_release_data+0xf4/0x180
>>> [   32.168208] Code: 7e 57 48 89 d8 48 c1 e0 04 4d 8b 64 05 30 41 f6 c4
>>> 01 75 e1 41 80 7e 76 00 4d 89 e7 79 0c 4c 89 e7 e8 90 f
>>> [   32.168869] RSP: 0018:ffffc900001a4eb0 EFLAGS: 00010202
>>> [   32.169025] RAX: 00000000000001c0 RBX: 000000000000001c RCX:
>>> 0000000000000000
>>> [   32.169265] RDX: 0000000000000102 RSI: 000000000000068f RDI:
>>> 00000000ffffffff
>>> [   32.169475] RBP: ffffc900001a4ee0 R08: 0000000000000000 R09:
>>> ffff88807fd77ec0
>>> [   32.169708] R10: ffffea0000173430 R11: 0000000000000000 R12:
>>> dd8672069ea377aa
>>> [   32.169915] R13: ffff8880069cf100 R14: ffff888011910ae0 R15:
>>> dd8672069ea377aa
>>> [   32.170126] FS:  0000000001720880(0000) GS:ffff88807fd40000(0000)
>>> knlGS:0000000000000000
>>> [   32.170398] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [   32.170586] CR2: 00007f0f04400000 CR3: 0000000004caa000 CR4:
>>> 0000000000750ee0
>>> [   32.170796] PKRU: 55555554
>>> [   32.170888] Call Trace:
>>> [   32.170975]  <IRQ>
>>> [   32.171039]  skb_release_all+0x2e/0x40
>>> [   32.171152]  napi_consume_skb+0x62/0xf0
>>> [   32.171281]  net_rx_action+0xf6/0x250
>>> [   32.171394]  __do_softirq+0xdf/0x2c0
>>> [   32.171506]  do_softirq+0x81/0xa0
>>> [   32.171608]  </IRQ>
>>>
>>>
>>> Xin came up with this patch a couple of months ago that resolves the
>>> panic but it has a big impact on performance:
>>>
>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>> index 0fbd5c85155f..6c2c8d09fd89 100644
>>> --- a/net/core/skbuff.c
>>> +++ b/net/core/skbuff.c
>>> @@ -1717,7 +1717,7 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t
>>> gfp_mask)
>>>   {
>>>          int num_frags = skb_shinfo(skb)->nr_frags;
>>>          struct page *page, *head = NULL;
>>> -       int i, new_frags;
>>> +       int i, new_frags, pagelen;
>>>          u32 d_off;
>>>
>>>          if (skb_shared(skb) || skb_unclone(skb, gfp_mask))
>>> @@ -1733,7 +1733,16 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t
>>> gfp_mask)
>>>                  return 0;
>>>          }
>>>
>>> -       new_frags = (__skb_pagelen(skb) + PAGE_SIZE - 1) >> PAGE_SHIFT;
>>> +       pagelen = __skb_pagelen(skb);
>>> +       if (pagelen > GSO_LEGACY_MAX_SIZE) {
>>> +               /* without hugepages, skb frags can only hold 65536 data. */
>> This is with CONFIG_MAX_SKB_FRAGS 17 I suppose.
> correct, I did not enable that config so it defaults to 17.
>
>> So is the issue just that new_frags ends up indexing out of bounds
>> in frags[MAX_SKB_FRAGS]?
> yes, I see nr_frags at 32 which is clearly way out of bounds and it
> seems to be the skb_copy_ubufs function causing it.


Solution is to sense which page order is necessary in order to fit the 
skb length into MAX_SKB_FRAGS pages. alloc_page() -> alloc_pages(order); 
alloc_skb_with_frags() has a similar strategy (but different goals)

I can prepare a patch.

>> GSO_LEGACY_MAX_SIZE happens to match the value, but is not not the
>> right constant, as that is a max on the packet length, regardless
>> of whether in linear or frags.
>>
>>> +               if (!__pskb_pull_tail(skb, pagelen - GSO_LEGACY_MAX_SIZE))
>>> +                       return -ENOMEM;
>>> +               pagelen = GSO_LEGACY_MAX_SIZE;
>>> +               num_frags = skb_shinfo(skb)->nr_frags;
>>> +       }
>>> +       new_frags = (pagelen + PAGE_SIZE - 1) >> PAGE_SHIFT;
>>> +
>>>          for (i = 0; i < new_frags; i++) {
>>>                  page = alloc_page(gfp_mask);
>>>                  if (!page) {
>>>
>>

Eric Dumazet April 27, 2023, 4:03 p.m. UTC | #4

On Thu, Apr 27, 2023 at 5:32 PM Eric Dumazet <erdnetdev@gmail.com> wrote:
>
>
> On 4/27/23 00:24, David Ahern wrote:
> > On 4/26/23 3:52 PM, Willem de Bruijn wrote:
> >> David Ahern wrote:
> >>> This has been on the back burner for too long now and with v6.3 released
> >>> we should get it resolved before reports start rolling in. I am throwing
> >>> this data dump out to the mailing list hoping someone else can provide
> >>> more insights.
> >>>
> >>> Big TCP (both IPv6 and IPv4 versions are affected) can cause a variety
> >>> of panics when combined with the Tx ZC API and buffers backed by
> >>> hugepages. I have seen this with mlx5, a driver under development and
> >>> veth, so it seems to be a problem with the core stack.
> >>>
> >>> A quick reproducer:
> >>>
> >>> #!/bin/bash
> >>> #
> >>> # make sure ip is from top of tree iproute2
> >>>
> >>> ip netns add peer
> >>> ip li add eth0 type veth peer eth1
> >>> ip li set eth0 mtu 3400 up
> >>> ip addr add dev eth0 172.16.253.1/24
> >>> ip addr add dev eth0 2001:db8:1::1/64
> >>>
> >>> ip li set eth1 netns peer mtu 3400 up
> >>> ip -netns peer addr add dev eth1 172.16.253.2/24
> >>> ip -netns peer addr add dev eth1 2001:db8:1::2/64
> >>>
> >>> ip netns exec peer iperf3 -s -D
> >>>
> >>> ip li set dev eth0 gso_ipv4_max_size $((510*1024)) gro_ipv4_max_size
> >>> $((510*1024)) gso_max_size $((510*1024)) gro_max_size  $((510*1024))
> >>>
> >>> ip -netns peer li set dev eth1 gso_ipv4_max_size $((510*1024))
> >>> gro_ipv4_max_size  $((510*1024)) gso_max_size $((510*1024)) gro_max_size
> >>>   $((510*1024))
> >>>
> >>> sysctl -w vm.nr_hugepages=2
> >>>
> >>> cat <<EOF
> >>> Run either:
> >>>
> >>>      iperf3 -c 172.16.253.2 --zc_api
> >>>      iperf3 -c 2001:db8:1::2 --zc_api
> >>>
> >>> where iperf3 is from https://github.com/dsahern/iperf mods-3.10
> >>> EOF
> >>>
> >>> iperf3 in my tree has support for buffers using hugepages when using the
> >>> Tx ZC API (--zc_api arg above).
> >>>
> >>> I have seen various backtraces based on platform and configuration, but
> >>> skb_release_data is typically in the path. This is a common one for the
> >>> veth reproducer above (saw it with both v4 and v6):
> >>>
> >>> [   32.167294] general protection fault, probably for non-canonical
> >>> address 0xdd8672069ea377b2: 0000 [#1] PREEMPT SMP
> >>> [   32.167569] CPU: 5 PID: 635 Comm: iperf3 Not tainted 6.3.0+ #4
> >>> [   32.167742] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
> >>> 1.13.0-1ubuntu1.1 04/01/2014
> >>> [   32.168039] RIP: 0010:skb_release_data+0xf4/0x180
> >>> [   32.168208] Code: 7e 57 48 89 d8 48 c1 e0 04 4d 8b 64 05 30 41 f6 c4
> >>> 01 75 e1 41 80 7e 76 00 4d 89 e7 79 0c 4c 89 e7 e8 90 f
> >>> [   32.168869] RSP: 0018:ffffc900001a4eb0 EFLAGS: 00010202
> >>> [   32.169025] RAX: 00000000000001c0 RBX: 000000000000001c RCX:
> >>> 0000000000000000
> >>> [   32.169265] RDX: 0000000000000102 RSI: 000000000000068f RDI:
> >>> 00000000ffffffff
> >>> [   32.169475] RBP: ffffc900001a4ee0 R08: 0000000000000000 R09:
> >>> ffff88807fd77ec0
> >>> [   32.169708] R10: ffffea0000173430 R11: 0000000000000000 R12:
> >>> dd8672069ea377aa
> >>> [   32.169915] R13: ffff8880069cf100 R14: ffff888011910ae0 R15:
> >>> dd8672069ea377aa
> >>> [   32.170126] FS:  0000000001720880(0000) GS:ffff88807fd40000(0000)
> >>> knlGS:0000000000000000
> >>> [   32.170398] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>> [   32.170586] CR2: 00007f0f04400000 CR3: 0000000004caa000 CR4:
> >>> 0000000000750ee0
> >>> [   32.170796] PKRU: 55555554
> >>> [   32.170888] Call Trace:
> >>> [   32.170975]  <IRQ>
> >>> [   32.171039]  skb_release_all+0x2e/0x40
> >>> [   32.171152]  napi_consume_skb+0x62/0xf0
> >>> [   32.171281]  net_rx_action+0xf6/0x250
> >>> [   32.171394]  __do_softirq+0xdf/0x2c0
> >>> [   32.171506]  do_softirq+0x81/0xa0
> >>> [   32.171608]  </IRQ>
> >>>
> >>>
> >>> Xin came up with this patch a couple of months ago that resolves the
> >>> panic but it has a big impact on performance:
> >>>
> >>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> >>> index 0fbd5c85155f..6c2c8d09fd89 100644
> >>> --- a/net/core/skbuff.c
> >>> +++ b/net/core/skbuff.c
> >>> @@ -1717,7 +1717,7 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t
> >>> gfp_mask)
> >>>   {
> >>>          int num_frags = skb_shinfo(skb)->nr_frags;
> >>>          struct page *page, *head = NULL;
> >>> -       int i, new_frags;
> >>> +       int i, new_frags, pagelen;
> >>>          u32 d_off;
> >>>
> >>>          if (skb_shared(skb) || skb_unclone(skb, gfp_mask))
> >>> @@ -1733,7 +1733,16 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t
> >>> gfp_mask)
> >>>                  return 0;
> >>>          }
> >>>
> >>> -       new_frags = (__skb_pagelen(skb) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> >>> +       pagelen = __skb_pagelen(skb);
> >>> +       if (pagelen > GSO_LEGACY_MAX_SIZE) {
> >>> +               /* without hugepages, skb frags can only hold 65536 data. */
> >> This is with CONFIG_MAX_SKB_FRAGS 17 I suppose.
> > correct, I did not enable that config so it defaults to 17.
> >
> >> So is the issue just that new_frags ends up indexing out of bounds
> >> in frags[MAX_SKB_FRAGS]?
> > yes, I see nr_frags at 32 which is clearly way out of bounds and it
> > seems to be the skb_copy_ubufs function causing it.
>
>
> Solution is to sense which page order is necessary in order to fit the
> skb length into MAX_SKB_FRAGS pages. alloc_page() -> alloc_pages(order);
> alloc_skb_with_frags() has a similar strategy (but different goals)
>
> I can prepare a patch.
>

I will test the following:

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 2112146092bfe24061b24b9e4684bcc031a045b9..891ecca40b29af1e15a89745f7fc630b19ea0202
100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1758,7 +1758,7 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 {
        int num_frags = skb_shinfo(skb)->nr_frags;
        struct page *page, *head = NULL;
-       int i, new_frags;
+       int i, order, psize, new_frags;
        u32 d_off;

        if (skb_shared(skb) || skb_unclone(skb, gfp_mask))
@@ -1767,9 +1767,17 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
        if (!num_frags)
                goto release;

-       new_frags = (__skb_pagelen(skb) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+       /* We might have to allocate high order pages, so compute what minimum
+        * page order is needed.
+        */
+       order = 0;
+       while ((PAGE_SIZE << order) * MAX_SKB_FRAGS < __skb_pagelen(skb))
+               order++;
+       psize = (PAGE_SIZE << order);
+
+       new_frags = (__skb_pagelen(skb) + psize - 1) >> (PAGE_SHIFT + order);
        for (i = 0; i < new_frags; i++) {
-               page = alloc_page(gfp_mask);
+               page = alloc_pages(gfp_mask, order);
                if (!page) {
                        while (head) {
                                struct page *next = (struct page
*)page_private(head);
@@ -1796,11 +1804,11 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
                        vaddr = kmap_atomic(p);

                        while (done < p_len) {
-                               if (d_off == PAGE_SIZE) {
+                               if (d_off == psize) {
                                        d_off = 0;
                                        page = (struct page
*)page_private(page);
                                }
-                               copy = min_t(u32, PAGE_SIZE - d_off,
p_len - done);
+                               copy = min_t(u32, psize - d_off, p_len - done);
                                memcpy(page_address(page) + d_off,
                                       vaddr + p_off + done, copy);
                                done += copy;
@@ -1816,7 +1824,7 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)

        /* skb frags point to kernel buffers */
        for (i = 0; i < new_frags - 1; i++) {
-               __skb_fill_page_desc(skb, i, head, 0, PAGE_SIZE);
+               __skb_fill_page_desc(skb, i, head, 0, psize);
                head = (struct page *)page_private(head);
        }
        __skb_fill_page_desc(skb, new_frags - 1, head, 0, d_off);

kernel panics with Big TCP and Tx ZC with hugepages

Checks

Commit Message

Comments

Patch