Message ID | 20240424165646.1625690-2-dtatulea@nvidia.com (mailing list archive) |
---|---|
State | RFC |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [RFC] net: Fix one page_pool page leak from skb_frag_unref | expand |
On Wed, 2024-04-24 at 19:56 +0300, Dragos Tatulea wrote: > The patch referenced in the fixes tag introduced the skb_frag_unref API. This is wrong actually. Only skb_frag_ref was introduced. Sorry for the confusion. > This API still has a dark corner: when skb->pp_recycled is false and a > fragment being referenced is backed by a page_pool page, skb_frag_unref > can leak the page out of the page_pool, like in the following example: > > BUG: Bad page state in process swapper/4 pfn:103423 > page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x103423000 pfn:0x103423 > flags: 0x200000000000000(node=0|zone=2) > page_type: 0xffffffff() > raw: 0200000000000000 dead000000000040 ffff888106f38000 0000000000000000 > raw: 0000000103423000 0000000000000041 00000000ffffffff 0000000000000000 > page dumped because: page_pool leak > Modules linked in: act_mirred act_csum act_pedit act_gact cls_flower > act_ct nf_flow_table sch_ingress xt_conntrack xt_MASQUERADE > nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat > br_netfilter overlay rpcrdma rdma_ucm ib_iser libiscsi > scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib > ib_uverbs ib_core zram zsmalloc mlx5_core fuse CPU: 4 PID: 0 Comm: > swapper/4 Not tainted 6.9.0-rc4+ #2 > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 > Call Trace: > <IRQ> > dump_stack_lvl+0x53/0x70 > bad_page+0x6f/0xf0 > free_unref_page_prepare+0x271/0x420 > free_unref_page+0x38/0x120 > ___pskb_trim+0x261/0x390 > skb_segment+0xf60/0x1040 > tcp_gso_segment+0xe8/0x4e0 > inet_gso_segment+0x155/0x3d0 > skb_mac_gso_segment+0x99/0x100 > __skb_gso_segment+0xb4/0x160 > ? netif_skb_features+0x95/0x2f0 > validate_xmit_skb+0x16c/0x330 > validate_xmit_skb_list+0x4c/0x70 > sch_direct_xmit+0x23e/0x350 > __dev_queue_xmit+0x334/0xbc0 > tcf_mirred_to_dev+0x2a5/0x3e0 [act_mirred] > tcf_mirred_act+0xd7/0x4b0 [act_mirred] > ? tcf_pedit_act+0x6f/0x540 [act_pedit] > tcf_action_exec+0x82/0x170 > fl_classify+0x1ee/0x200 [cls_flower] > ? tcf_mirred_to_dev+0x2a5/0x3e0 [act_mirred] > ? mlx5e_completion_event+0x39/0x40 [mlx5_core] > ? mlx5_eq_comp_int+0x1d7/0x1f0 [mlx5_core] > tcf_classify+0x26a/0x470 > tc_run+0xa2/0x120 > ? handle_irq_event+0x53/0x80 > ? kvm_clock_get_cycles+0x11/0x20 > __netif_receive_skb_core.constprop.0+0x932/0xee0 > __netif_receive_skb_list_core+0xfe/0x1f0 > netif_receive_skb_list_internal+0x1b5/0x2b0 > napi_gro_complete.constprop.0+0xee/0x120 > dev_gro_receive+0x3f4/0x710 > napi_gro_receive+0x7d/0x220 > mlx5e_handle_rx_cqe_mpwrq+0x10d/0x1d0 [mlx5_core] > mlx5e_poll_rx_cq+0x8b/0x6f0 [mlx5_core] > mlx5e_napi_poll+0xdc/0x6c0 [mlx5_core] > __napi_poll+0x25/0x1b0 > net_rx_action+0x2c1/0x330 > __do_softirq+0xcb/0x28c > irq_exit_rcu+0x69/0x90 > common_interrupt+0x85/0xa0 > </IRQ> > <TASK> > asm_common_interrupt+0x26/0x40 > RIP: 0010:default_idle+0x17/0x20 > Code: 00 4d 29 c8 4c 01 c7 4c 29 c2 e9 76 ff ff ff cc cc cc cc f3 0f 1e > fa 8b 05 76 3f 0a 01 85 c0 7e 07 0f 00 2d 1d 74 41 00 fb f4 <fa> c3 0f > 1f 80 00 00 00 00 f3 0f 1e fa 65 48 8b 35 04 b3 42 7e f0 > RSP: 0018:ffff88810087bed8 EFLAGS: 00000246 > RAX: 0000000000000000 RBX: ffff8881008415c0 RCX: 000000e116d359fb > RDX: 0000000000000000 RSI: ffffffff8223e1d1 RDI: 000000000003f214 > RBP: 0000000000000004 R08: 000000000003f214 R09: 000000e116d359fb > R10: 000000e116d359fb R11: 000000000005dfee R12: 0000000000000004 > R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 > default_idle_call+0x3d/0xf0 > do_idle+0x1ce/0x1e0 > cpu_startup_entry+0x29/0x30 > start_secondary+0x109/0x130 > common_startup_64+0x129/0x138 > </TASK> > > How it happens: > -> skb_segment > -> clone skb into nskb > -> call __pskb_trim(nskb) > -> call pskb_expand_head(nskb) because nskb is cloned > -> call skb_release_data(nskb) because nskb is cloned > -> nskb->pp_recycle = 0 > -> skb_unref(nskb->frag[i], nskb) > -> nskb->pp_recycle is false, frag is page_pool page > -> Fragment is released via put_page(frag page), hence leaking > from the page_pool. > > Something tells me that this is not be the only issue of this kind... > > The patch itself contains a suggested fix for this specific bug: it > forces the unref in ___pskb_trim to recycle to the page_pool. If the > page is not a page_pool page, it will be dereferenced as a regular page. > > The alternative would be to save the skb->pp_recycled flag before > pskb_expand_head and use it later during the unref. > One more interesting point is the comment in skb_release_data [1] and it's commit message [2] from Ilias. Looking at the commit message [1] https://elixir.bootlin.com/linux/v6.9-rc5/source/net/core/skbuff.c#L1137 [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2cc3aeb5ecccec0d266813172fcd82b4b5fa5803 > Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> > Co-developed-by: Jianbo Liu <jianbol@nvidia.com> > Fixes: a580ea994fd3 ("net: mirror skb frag ref/unref helpers") > Cc: Mina Almasry <almasrymina@google.com> > --- > net/core/skbuff.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 37c858dc11a6..ab75b4f876ce 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -2634,7 +2634,7 @@ int ___pskb_trim(struct sk_buff *skb, unsigned int len) > skb_shinfo(skb)->nr_frags = i; > > for (; i < nfrags; i++) > - skb_frag_unref(skb, i); > + __skb_frag_unref(&skb_shinfo(skb)->frags[i], true); > > if (skb_has_frag_list(skb)) > skb_drop_fraglist(skb);
On Wed, Apr 24, 2024 at 10:25 AM Dragos Tatulea <dtatulea@nvidia.com> wrote: > > On Wed, 2024-04-24 at 19:56 +0300, Dragos Tatulea wrote: > > The patch referenced in the fixes tag introduced the skb_frag_unref API. > This is wrong actually. Only skb_frag_ref was introduced. Sorry for the > confusion. > > > This API still has a dark corner: when skb->pp_recycled is false and a > > fragment being referenced is backed by a page_pool page, skb_frag_unref > > can leak the page out of the page_pool, like in the following example: > > > > BUG: Bad page state in process swapper/4 pfn:103423 > > page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x103423000 pfn:0x103423 > > flags: 0x200000000000000(node=0|zone=2) > > page_type: 0xffffffff() > > raw: 0200000000000000 dead000000000040 ffff888106f38000 0000000000000000 > > raw: 0000000103423000 0000000000000041 00000000ffffffff 0000000000000000 > > page dumped because: page_pool leak > > Modules linked in: act_mirred act_csum act_pedit act_gact cls_flower > > act_ct nf_flow_table sch_ingress xt_conntrack xt_MASQUERADE > > nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat > > br_netfilter overlay rpcrdma rdma_ucm ib_iser libiscsi > > scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib > > ib_uverbs ib_core zram zsmalloc mlx5_core fuse CPU: 4 PID: 0 Comm: > > swapper/4 Not tainted 6.9.0-rc4+ #2 > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 > > Call Trace: > > <IRQ> > > dump_stack_lvl+0x53/0x70 > > bad_page+0x6f/0xf0 > > free_unref_page_prepare+0x271/0x420 > > free_unref_page+0x38/0x120 > > ___pskb_trim+0x261/0x390 > > skb_segment+0xf60/0x1040 > > tcp_gso_segment+0xe8/0x4e0 > > inet_gso_segment+0x155/0x3d0 > > skb_mac_gso_segment+0x99/0x100 > > __skb_gso_segment+0xb4/0x160 > > ? netif_skb_features+0x95/0x2f0 > > validate_xmit_skb+0x16c/0x330 > > validate_xmit_skb_list+0x4c/0x70 > > sch_direct_xmit+0x23e/0x350 > > __dev_queue_xmit+0x334/0xbc0 > > tcf_mirred_to_dev+0x2a5/0x3e0 [act_mirred] > > tcf_mirred_act+0xd7/0x4b0 [act_mirred] > > ? tcf_pedit_act+0x6f/0x540 [act_pedit] > > tcf_action_exec+0x82/0x170 > > fl_classify+0x1ee/0x200 [cls_flower] > > ? tcf_mirred_to_dev+0x2a5/0x3e0 [act_mirred] > > ? mlx5e_completion_event+0x39/0x40 [mlx5_core] > > ? mlx5_eq_comp_int+0x1d7/0x1f0 [mlx5_core] > > tcf_classify+0x26a/0x470 > > tc_run+0xa2/0x120 > > ? handle_irq_event+0x53/0x80 > > ? kvm_clock_get_cycles+0x11/0x20 > > __netif_receive_skb_core.constprop.0+0x932/0xee0 > > __netif_receive_skb_list_core+0xfe/0x1f0 > > netif_receive_skb_list_internal+0x1b5/0x2b0 > > napi_gro_complete.constprop.0+0xee/0x120 > > dev_gro_receive+0x3f4/0x710 > > napi_gro_receive+0x7d/0x220 > > mlx5e_handle_rx_cqe_mpwrq+0x10d/0x1d0 [mlx5_core] > > mlx5e_poll_rx_cq+0x8b/0x6f0 [mlx5_core] > > mlx5e_napi_poll+0xdc/0x6c0 [mlx5_core] > > __napi_poll+0x25/0x1b0 > > net_rx_action+0x2c1/0x330 > > __do_softirq+0xcb/0x28c > > irq_exit_rcu+0x69/0x90 > > common_interrupt+0x85/0xa0 > > </IRQ> > > <TASK> > > asm_common_interrupt+0x26/0x40 > > RIP: 0010:default_idle+0x17/0x20 > > Code: 00 4d 29 c8 4c 01 c7 4c 29 c2 e9 76 ff ff ff cc cc cc cc f3 0f 1e > > fa 8b 05 76 3f 0a 01 85 c0 7e 07 0f 00 2d 1d 74 41 00 fb f4 <fa> c3 0f > > 1f 80 00 00 00 00 f3 0f 1e fa 65 48 8b 35 04 b3 42 7e f0 > > RSP: 0018:ffff88810087bed8 EFLAGS: 00000246 > > RAX: 0000000000000000 RBX: ffff8881008415c0 RCX: 000000e116d359fb > > RDX: 0000000000000000 RSI: ffffffff8223e1d1 RDI: 000000000003f214 > > RBP: 0000000000000004 R08: 000000000003f214 R09: 000000e116d359fb > > R10: 000000e116d359fb R11: 000000000005dfee R12: 0000000000000004 > > R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 > > default_idle_call+0x3d/0xf0 > > do_idle+0x1ce/0x1e0 > > cpu_startup_entry+0x29/0x30 > > start_secondary+0x109/0x130 > > common_startup_64+0x129/0x138 > > </TASK> > > > > How it happens: > > -> skb_segment > > -> clone skb into nskb > > -> call __pskb_trim(nskb) > > -> call pskb_expand_head(nskb) because nskb is cloned > > -> call skb_release_data(nskb) because nskb is cloned > > -> nskb->pp_recycle = 0 > > -> skb_unref(nskb->frag[i], nskb) > > -> nskb->pp_recycle is false, frag is page_pool page > > -> Fragment is released via put_page(frag page), hence leaking > > from the page_pool. > > > > Something tells me that this is not be the only issue of this kind... > > > > The patch itself contains a suggested fix for this specific bug: it > > forces the unref in ___pskb_trim to recycle to the page_pool. If the > > page is not a page_pool page, it will be dereferenced as a regular page. > > > > The alternative would be to save the skb->pp_recycled flag before > > pskb_expand_head and use it later during the unref. > > > > One more interesting point is the comment in skb_release_data [1] and it's > commit message [2] from Ilias. Looking at the commit message > Sorry for the bug. I don't think this is the right fix to be honest though. As you mention this is only 1 such instance of a general underlying issue. The underlying issue is that before the fixes commit, skb_frag_ref() grabbed a regular page ref. After the fixes commit, skb_frag_ref() grabs a page-pool ref. The unref path always dropped a regular page ref, thanks to this commit as you point out: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2cc3aeb5ecccec0d266813172fcd82b4b5fa5803 AFAICT the correct fix is to actually revert commit 2cc3aeb5eccc ("skbuff: Fix a potential race while recycling page_pool packets"). The reason is that now that skb_frag_ref() can grab page-pool refs, we don't need to make sure there is only 1 SKB that triggers the recycle path anymore. All the skb and its clones can obtain page-pool refs, and in the unref path we drop the page-pool refs. page_pool_put_page() detects correctly that the last page-pool ref is put and recycles the page only then. I'm unable to repro this issue. Do you need to do anything special? Or is 1 flow that hits skb_segment() good enough? Reverting commit 2cc3aeb5eccc ("skbuff: Fix a potential race while recycling page_pool packets") shows no regressions for me. Is it possible to try that out? If that doesn't work, I think I prefer reverting a580ea994fd3 ("net: mirror skb frag ref/unref helpers") rather than merging this fix to make sure we removed the underlying cause of the issue. > [1] https://elixir.bootlin.com/linux/v6.9-rc5/source/net/core/skbuff.c#L1137 > [2] > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2cc3aeb5ecccec0d266813172fcd82b4b5fa5803 > > > Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> > > Co-developed-by: Jianbo Liu <jianbol@nvidia.com> > > Fixes: a580ea994fd3 ("net: mirror skb frag ref/unref helpers") > > Cc: Mina Almasry <almasrymina@google.com> > > --- > > net/core/skbuff.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > > index 37c858dc11a6..ab75b4f876ce 100644 > > --- a/net/core/skbuff.c > > +++ b/net/core/skbuff.c > > @@ -2634,7 +2634,7 @@ int ___pskb_trim(struct sk_buff *skb, unsigned int len) > > skb_shinfo(skb)->nr_frags = i; > > > > for (; i < nfrags; i++) > > - skb_frag_unref(skb, i); > > + __skb_frag_unref(&skb_shinfo(skb)->frags[i], true); > > > > if (skb_has_frag_list(skb)) > > skb_drop_fraglist(skb); >
On Wed, 2024-04-24 at 15:08 -0700, Mina Almasry wrote: > On Wed, Apr 24, 2024 at 10:25 AM Dragos Tatulea <dtatulea@nvidia.com> wrote: > > > > On Wed, 2024-04-24 at 19:56 +0300, Dragos Tatulea wrote: > > > The patch referenced in the fixes tag introduced the skb_frag_unref API. > > This is wrong actually. Only skb_frag_ref was introduced. Sorry for the > > confusion. > > > > > This API still has a dark corner: when skb->pp_recycled is false and a > > > fragment being referenced is backed by a page_pool page, skb_frag_unref > > > can leak the page out of the page_pool, like in the following example: > > > > > > BUG: Bad page state in process swapper/4 pfn:103423 > > > page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x103423000 pfn:0x103423 > > > flags: 0x200000000000000(node=0|zone=2) > > > page_type: 0xffffffff() > > > raw: 0200000000000000 dead000000000040 ffff888106f38000 0000000000000000 > > > raw: 0000000103423000 0000000000000041 00000000ffffffff 0000000000000000 > > > page dumped because: page_pool leak > > > Modules linked in: act_mirred act_csum act_pedit act_gact cls_flower > > > act_ct nf_flow_table sch_ingress xt_conntrack xt_MASQUERADE > > > nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat > > > br_netfilter overlay rpcrdma rdma_ucm ib_iser libiscsi > > > scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib > > > ib_uverbs ib_core zram zsmalloc mlx5_core fuse CPU: 4 PID: 0 Comm: > > > swapper/4 Not tainted 6.9.0-rc4+ #2 > > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 > > > Call Trace: > > > <IRQ> > > > dump_stack_lvl+0x53/0x70 > > > bad_page+0x6f/0xf0 > > > free_unref_page_prepare+0x271/0x420 > > > free_unref_page+0x38/0x120 > > > ___pskb_trim+0x261/0x390 > > > skb_segment+0xf60/0x1040 > > > tcp_gso_segment+0xe8/0x4e0 > > > inet_gso_segment+0x155/0x3d0 > > > skb_mac_gso_segment+0x99/0x100 > > > __skb_gso_segment+0xb4/0x160 > > > ? netif_skb_features+0x95/0x2f0 > > > validate_xmit_skb+0x16c/0x330 > > > validate_xmit_skb_list+0x4c/0x70 > > > sch_direct_xmit+0x23e/0x350 > > > __dev_queue_xmit+0x334/0xbc0 > > > tcf_mirred_to_dev+0x2a5/0x3e0 [act_mirred] > > > tcf_mirred_act+0xd7/0x4b0 [act_mirred] > > > ? tcf_pedit_act+0x6f/0x540 [act_pedit] > > > tcf_action_exec+0x82/0x170 > > > fl_classify+0x1ee/0x200 [cls_flower] > > > ? tcf_mirred_to_dev+0x2a5/0x3e0 [act_mirred] > > > ? mlx5e_completion_event+0x39/0x40 [mlx5_core] > > > ? mlx5_eq_comp_int+0x1d7/0x1f0 [mlx5_core] > > > tcf_classify+0x26a/0x470 > > > tc_run+0xa2/0x120 > > > ? handle_irq_event+0x53/0x80 > > > ? kvm_clock_get_cycles+0x11/0x20 > > > __netif_receive_skb_core.constprop.0+0x932/0xee0 > > > __netif_receive_skb_list_core+0xfe/0x1f0 > > > netif_receive_skb_list_internal+0x1b5/0x2b0 > > > napi_gro_complete.constprop.0+0xee/0x120 > > > dev_gro_receive+0x3f4/0x710 > > > napi_gro_receive+0x7d/0x220 > > > mlx5e_handle_rx_cqe_mpwrq+0x10d/0x1d0 [mlx5_core] > > > mlx5e_poll_rx_cq+0x8b/0x6f0 [mlx5_core] > > > mlx5e_napi_poll+0xdc/0x6c0 [mlx5_core] > > > __napi_poll+0x25/0x1b0 > > > net_rx_action+0x2c1/0x330 > > > __do_softirq+0xcb/0x28c > > > irq_exit_rcu+0x69/0x90 > > > common_interrupt+0x85/0xa0 > > > </IRQ> > > > <TASK> > > > asm_common_interrupt+0x26/0x40 > > > RIP: 0010:default_idle+0x17/0x20 > > > Code: 00 4d 29 c8 4c 01 c7 4c 29 c2 e9 76 ff ff ff cc cc cc cc f3 0f 1e > > > fa 8b 05 76 3f 0a 01 85 c0 7e 07 0f 00 2d 1d 74 41 00 fb f4 <fa> c3 0f > > > 1f 80 00 00 00 00 f3 0f 1e fa 65 48 8b 35 04 b3 42 7e f0 > > > RSP: 0018:ffff88810087bed8 EFLAGS: 00000246 > > > RAX: 0000000000000000 RBX: ffff8881008415c0 RCX: 000000e116d359fb > > > RDX: 0000000000000000 RSI: ffffffff8223e1d1 RDI: 000000000003f214 > > > RBP: 0000000000000004 R08: 000000000003f214 R09: 000000e116d359fb > > > R10: 000000e116d359fb R11: 000000000005dfee R12: 0000000000000004 > > > R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 > > > default_idle_call+0x3d/0xf0 > > > do_idle+0x1ce/0x1e0 > > > cpu_startup_entry+0x29/0x30 > > > start_secondary+0x109/0x130 > > > common_startup_64+0x129/0x138 > > > </TASK> > > > > > > How it happens: > > > -> skb_segment > > > -> clone skb into nskb > > > -> call __pskb_trim(nskb) > > > -> call pskb_expand_head(nskb) because nskb is cloned > > > -> call skb_release_data(nskb) because nskb is cloned > > > -> nskb->pp_recycle = 0 > > > -> skb_unref(nskb->frag[i], nskb) > > > -> nskb->pp_recycle is false, frag is page_pool page > > > -> Fragment is released via put_page(frag page), hence leaking > > > from the page_pool. > > > > > > Something tells me that this is not be the only issue of this kind... > > > > > > The patch itself contains a suggested fix for this specific bug: it > > > forces the unref in ___pskb_trim to recycle to the page_pool. If the > > > page is not a page_pool page, it will be dereferenced as a regular page. > > > > > > The alternative would be to save the skb->pp_recycled flag before > > > pskb_expand_head and use it later during the unref. > > > > > > > One more interesting point is the comment in skb_release_data [1] and it's > > commit message [2] from Ilias. Looking at the commit message > > > > Sorry for the bug. I don't think this is the right fix to be honest > though. As you mention this is only 1 such instance of a general > underlying issue. > Right. It was a conversation starter :). > The underlying issue is that before the fixes commit, skb_frag_ref() > grabbed a regular page ref. After the fixes commit, skb_frag_ref() > grabs a page-pool ref. > It is still possible for skb_frag_ref() to grab a regular page ref for a page_pool page: for example in skb_segment() when a nskb is created (not cloned, so pp_recycle is off) and frags with page_pool pages are added to it. I was seeing this in my test as well. If the intention is to always use page_pool ref/unref for page_pool pages, then the recycle flag check doesn't belong skb_frag_ref/unref(). > The unref path always dropped a regular page > ref, thanks to this commit as you point out: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2cc3aeb5ecccec0d266813172fcd82b4b5fa5803 > > AFAICT the correct fix is to actually revert commit 2cc3aeb5eccc > ("skbuff: Fix a potential race while recycling page_pool packets"). > The reason is that now that skb_frag_ref() can grab page-pool refs, we > don't need to make sure there is only 1 SKB that triggers the recycle > path anymore. All the skb and its clones can obtain page-pool refs, > and in the unref path we drop the page-pool refs. page_pool_put_page() > detects correctly that the last page-pool ref is put and recycles the > page only then. > I don't think this is a good way forward. For example, skb->pp_recycle is used as a hint in skb_gro_receive to avoid coalescing skbs with different pp_recycle flag states. This could interfere with that. > I'm unable to repro this issue. Do you need to do anything special? Or > is 1 flow that hits skb_segment() good enough? > I don't have a very easy repro. But the test is a forwarding one. > Reverting commit 2cc3aeb5eccc ("skbuff: Fix a potential race while > recycling page_pool packets") shows no regressions for me. Is it > possible to try that out? > I can try it out, it might work. But again, I fear that simply reverting this change can have trigger other unexpected behaviours. Also, this doesn't handle the behaviour for skb_frag_ref mentioned above. > If that doesn't work, I think I prefer > reverting a580ea994fd3 ("net: mirror skb frag ref/unref helpers") > rather than merging this fix to make sure we removed the underlying > cause of the issue. This is the safest bet. So, to recap, I see 2 possibilities: 1) Revert a580ea994fd3 ("net: mirror skb frag ref/unref helpers"): safe, but it will probably have to come back in one way or another. 2) Drop the recycle checks from skb_frag_ref/unref: this enforces the rule of always referencing/dereferencing pages based on their type (page_pool or normal). Thanks, Dragos > > > [1] https://elixir.bootlin.com/linux/v6.9-rc5/source/net/core/skbuff.c#L1137 > > [2] > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2cc3aeb5ecccec0d266813172fcd82b4b5fa5803 > > > > > Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> > > > Co-developed-by: Jianbo Liu <jianbol@nvidia.com> > > > Fixes: a580ea994fd3 ("net: mirror skb frag ref/unref helpers") > > > Cc: Mina Almasry <almasrymina@google.com> > > > --- > > > net/core/skbuff.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > > > index 37c858dc11a6..ab75b4f876ce 100644 > > > --- a/net/core/skbuff.c > > > +++ b/net/core/skbuff.c > > > @@ -2634,7 +2634,7 @@ int ___pskb_trim(struct sk_buff *skb, unsigned int len) > > > skb_shinfo(skb)->nr_frags = i; > > > > > > for (; i < nfrags; i++) > > > - skb_frag_unref(skb, i); > > > + __skb_frag_unref(&skb_shinfo(skb)->frags[i], true); > > > > > > if (skb_has_frag_list(skb)) > > > skb_drop_fraglist(skb); > > > >
On Thu, Apr 25, 2024 at 1:17 AM Dragos Tatulea <dtatulea@nvidia.com> wrote: > > On Wed, 2024-04-24 at 15:08 -0700, Mina Almasry wrote: > > If that doesn't work, I think I prefer > > reverting a580ea994fd3 ("net: mirror skb frag ref/unref helpers") > > rather than merging this fix to make sure we removed the underlying > > cause of the issue. > This is the safest bet. > > So, to recap, I see 2 possibilities: > > 1) Revert a580ea994fd3 ("net: mirror skb frag ref/unref helpers"): safe, but it > will probably have to come back in one way or another. > 2) Drop the recycle checks from skb_frag_ref/unref: this enforces the rule of > always referencing/dereferencing pages based on their type (page_pool or > normal). > If this works, I would be very happy. I personally think ref/unref should be done based on the page type. For me the net stack using the regular {get|put}_page on a pp page isn't great. It requires special handling to make sure the ref + unref are in sync. Also if the last pp ref is dropped while there are pending regular refs, __page_pool_page_can_be_recycled() check will fail and the page will not be recycled. On the other hand, since 0a149ab78ee2 ("page_pool: transition to reference count management after page draining") I'm not sure there is any reason to continue to use get/put_page on pp-pages, we can use the new pp-ref instead. I don't see any regressions with this diff (needs cleanup), but your test setup seems much much better than mine (I think this is the second reffing issue you manage to repro): diff --git a/include/linux/skbuff_ref.h b/include/linux/skbuff_ref.h index 4dcdbe9fbc5f..4c72227dce1b 100644 --- a/include/linux/skbuff_ref.h +++ b/include/linux/skbuff_ref.h @@ -31,7 +31,7 @@ static inline bool napi_pp_get_page(struct page *page) static inline void skb_page_ref(struct page *page, bool recycle) { #ifdef CONFIG_PAGE_POOL - if (recycle && napi_pp_get_page(page)) + if (napi_pp_get_page(page)) return; #endif get_page(page); @@ -69,7 +69,7 @@ static inline void skb_page_unref(struct page *page, bool recycle) { #ifdef CONFIG_PAGE_POOL - if (recycle && napi_pp_put_page(page)) + if (napi_pp_put_page(page)) return; #endif put_page(page);
On Thu, 2024-04-25 at 12:20 -0700, Mina Almasry wrote: > On Thu, Apr 25, 2024 at 1:17 AM Dragos Tatulea <dtatulea@nvidia.com> wrote: > > > > On Wed, 2024-04-24 at 15:08 -0700, Mina Almasry wrote: > > > If that doesn't work, I think I prefer > > > reverting a580ea994fd3 ("net: mirror skb frag ref/unref helpers") > > > rather than merging this fix to make sure we removed the underlying > > > cause of the issue. > > This is the safest bet. > > > > So, to recap, I see 2 possibilities: > > > > 1) Revert a580ea994fd3 ("net: mirror skb frag ref/unref helpers"): safe, but it > > will probably have to come back in one way or another. > > 2) Drop the recycle checks from skb_frag_ref/unref: this enforces the rule of > > always referencing/dereferencing pages based on their type (page_pool or > > normal). > > > > If this works, I would be very happy. I personally think ref/unref > should be done based on the page type. For me the net stack using the > regular {get|put}_page on a pp page isn't great. It requires special > handling to make sure the ref + unref are in sync. Also if the last pp > ref is dropped while there are pending regular refs, > __page_pool_page_can_be_recycled() check will fail and the page will > not be recycled. > > On the other hand, since 0a149ab78ee2 ("page_pool: transition to > reference count management after page draining") I'm not sure there is > any reason to continue to use get/put_page on pp-pages, we can use the > new pp-ref instead. > > I don't see any regressions with this diff (needs cleanup), but your > test setup seems much much better than mine (I think this is the > second reffing issue you manage to repro): > > diff --git a/include/linux/skbuff_ref.h b/include/linux/skbuff_ref.h > index 4dcdbe9fbc5f..4c72227dce1b 100644 > --- a/include/linux/skbuff_ref.h > +++ b/include/linux/skbuff_ref.h > @@ -31,7 +31,7 @@ static inline bool napi_pp_get_page(struct page *page) > static inline void skb_page_ref(struct page *page, bool recycle) > { > #ifdef CONFIG_PAGE_POOL > - if (recycle && napi_pp_get_page(page)) > + if (napi_pp_get_page(page)) > return; > #endif > get_page(page); > @@ -69,7 +69,7 @@ static inline void > skb_page_unref(struct page *page, bool recycle) > { > #ifdef CONFIG_PAGE_POOL > - if (recycle && napi_pp_put_page(page)) > + if (napi_pp_put_page(page)) > return; > #endif > put_page(page); > > This is option 2. I thought this would fix everything. But I just tested and it's not the case: we are now reaching a negative pp_ref_count. So probably somewhere a regular page reference is still being taken... Thanks, Dragos
On Thu, Apr 25, 2024 at 12:48 PM Dragos Tatulea <dtatulea@nvidia.com> wrote: > > On Thu, 2024-04-25 at 12:20 -0700, Mina Almasry wrote: > > On Thu, Apr 25, 2024 at 1:17 AM Dragos Tatulea <dtatulea@nvidia.com> wrote: > > > > > > On Wed, 2024-04-24 at 15:08 -0700, Mina Almasry wrote: > > > > If that doesn't work, I think I prefer > > > > reverting a580ea994fd3 ("net: mirror skb frag ref/unref helpers") > > > > rather than merging this fix to make sure we removed the underlying > > > > cause of the issue. > > > This is the safest bet. > > > > > > So, to recap, I see 2 possibilities: > > > > > > 1) Revert a580ea994fd3 ("net: mirror skb frag ref/unref helpers"): safe, but it > > > will probably have to come back in one way or another. > > > 2) Drop the recycle checks from skb_frag_ref/unref: this enforces the rule of > > > always referencing/dereferencing pages based on their type (page_pool or > > > normal). > > > > > > > If this works, I would be very happy. I personally think ref/unref > > should be done based on the page type. For me the net stack using the > > regular {get|put}_page on a pp page isn't great. It requires special > > handling to make sure the ref + unref are in sync. Also if the last pp > > ref is dropped while there are pending regular refs, > > __page_pool_page_can_be_recycled() check will fail and the page will > > not be recycled. > > > > On the other hand, since 0a149ab78ee2 ("page_pool: transition to > > reference count management after page draining") I'm not sure there is > > any reason to continue to use get/put_page on pp-pages, we can use the > > new pp-ref instead. > > > > I don't see any regressions with this diff (needs cleanup), but your > > test setup seems much much better than mine (I think this is the > > second reffing issue you manage to repro): > > > > diff --git a/include/linux/skbuff_ref.h b/include/linux/skbuff_ref.h > > index 4dcdbe9fbc5f..4c72227dce1b 100644 > > --- a/include/linux/skbuff_ref.h > > +++ b/include/linux/skbuff_ref.h > > @@ -31,7 +31,7 @@ static inline bool napi_pp_get_page(struct page *page) > > static inline void skb_page_ref(struct page *page, bool recycle) > > { > > #ifdef CONFIG_PAGE_POOL > > - if (recycle && napi_pp_get_page(page)) > > + if (napi_pp_get_page(page)) > > return; > > #endif > > get_page(page); > > @@ -69,7 +69,7 @@ static inline void > > skb_page_unref(struct page *page, bool recycle) > > { > > #ifdef CONFIG_PAGE_POOL > > - if (recycle && napi_pp_put_page(page)) > > + if (napi_pp_put_page(page)) > > return; > > #endif > > put_page(page); > > > > > This is option 2. I thought this would fix everything. But I just tested and > it's not the case: we are now reaching a negative pp_ref_count. So probably > somewhere a regular page reference is still being taken... > I would guess the most likely root cause of this would be a call site that does get_page() instead of skb_frag_ref(), right? The other possibility would be if something like: - page is not pp_page - skb_page_ref(page) // obtains a regular reference. - page is converted to pp_page - skb_page_unref(page) // drops a pp reference. But I'm not aware of non-pp pages ever being converted to pp pages. You probably figured this out already, but if you would like to dig further instead of reverting the offending patch, this diff would probably catch the get_page() callsite, no? (on my test setup this debug code doesn't trigger). diff --git a/include/linux/mm.h b/include/linux/mm.h index 7b0ee64225de..a22a676f4b6b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1473,8 +1473,14 @@ static inline void folio_get(struct folio *folio) folio_ref_inc(folio); } +static inline bool debug_is_pp_page(struct page *page) +{ + return (page->pp_magic & ~0x3UL) == PP_SIGNATURE; +} + static inline void get_page(struct page *page) { + WARN_ON_ONCE(debug_is_pp_page(page)); folio_get(page_folio(page)); } @@ -1569,6 +1575,8 @@ static inline void put_page(struct page *page) { struct folio *folio = page_folio(page); + WARN_ON_ONCE(debug_is_pp_page(page)); + /* * For some devmap managed pages we need to catch refcount transition * from 2 to 1:
On Thu, 2024-04-25 at 13:42 -0700, Mina Almasry wrote: > On Thu, Apr 25, 2024 at 12:48 PM Dragos Tatulea <dtatulea@nvidia.com> wrote: > > > > On Thu, 2024-04-25 at 12:20 -0700, Mina Almasry wrote: > > > On Thu, Apr 25, 2024 at 1:17 AM Dragos Tatulea <dtatulea@nvidia.com> wrote: > > > > > > > > On Wed, 2024-04-24 at 15:08 -0700, Mina Almasry wrote: > > > > > If that doesn't work, I think I prefer > > > > > reverting a580ea994fd3 ("net: mirror skb frag ref/unref helpers") > > > > > rather than merging this fix to make sure we removed the underlying > > > > > cause of the issue. > > > > This is the safest bet. > > > > > > > > So, to recap, I see 2 possibilities: > > > > > > > > 1) Revert a580ea994fd3 ("net: mirror skb frag ref/unref helpers"): safe, but it > > > > will probably have to come back in one way or another. > > > > 2) Drop the recycle checks from skb_frag_ref/unref: this enforces the rule of > > > > always referencing/dereferencing pages based on their type (page_pool or > > > > normal). > > > > > > > > > > If this works, I would be very happy. I personally think ref/unref > > > should be done based on the page type. For me the net stack using the > > > regular {get|put}_page on a pp page isn't great. It requires special > > > handling to make sure the ref + unref are in sync. Also if the last pp > > > ref is dropped while there are pending regular refs, > > > __page_pool_page_can_be_recycled() check will fail and the page will > > > not be recycled. > > > > > > On the other hand, since 0a149ab78ee2 ("page_pool: transition to > > > reference count management after page draining") I'm not sure there is > > > any reason to continue to use get/put_page on pp-pages, we can use the > > > new pp-ref instead. > > > > > > I don't see any regressions with this diff (needs cleanup), but your > > > test setup seems much much better than mine (I think this is the > > > second reffing issue you manage to repro): > > > > > > diff --git a/include/linux/skbuff_ref.h b/include/linux/skbuff_ref.h > > > index 4dcdbe9fbc5f..4c72227dce1b 100644 > > > --- a/include/linux/skbuff_ref.h > > > +++ b/include/linux/skbuff_ref.h > > > @@ -31,7 +31,7 @@ static inline bool napi_pp_get_page(struct page *page) > > > static inline void skb_page_ref(struct page *page, bool recycle) > > > { > > > #ifdef CONFIG_PAGE_POOL > > > - if (recycle && napi_pp_get_page(page)) > > > + if (napi_pp_get_page(page)) > > > return; > > > #endif > > > get_page(page); > > > @@ -69,7 +69,7 @@ static inline void > > > skb_page_unref(struct page *page, bool recycle) > > > { > > > #ifdef CONFIG_PAGE_POOL > > > - if (recycle && napi_pp_put_page(page)) > > > + if (napi_pp_put_page(page)) > > > return; > > > #endif > > > put_page(page); > > > > > > > > This is option 2. I thought this would fix everything. But I just tested and > > it's not the case: we are now reaching a negative pp_ref_count. > > I was tired and botched the revert of the code in this RFC when testing. Dropping the recycle flag works as expected. Do we need an RFC v2 or is this non RFC material? Thanks, Dragos > > So probably > > somewhere a regular page reference is still being taken... > > > > I would guess the most likely root cause of this would be a call site > that does get_page() instead of skb_frag_ref(), right? > > The other possibility would be if something like: > > - page is not pp_page > - skb_page_ref(page) // obtains a regular reference. > - page is converted to pp_page > - skb_page_unref(page) // drops a pp reference. > > But I'm not aware of non-pp pages ever being converted to pp pages. > > You probably figured this out already, but if you would like to dig > further instead of reverting the offending patch, this diff would > probably catch the get_page() callsite, no? (on my test setup this > debug code doesn't trigger). > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 7b0ee64225de..a22a676f4b6b 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1473,8 +1473,14 @@ static inline void folio_get(struct folio *folio) > folio_ref_inc(folio); > } > > +static inline bool debug_is_pp_page(struct page *page) > +{ > + return (page->pp_magic & ~0x3UL) == PP_SIGNATURE; > +} > + > static inline void get_page(struct page *page) > { > + WARN_ON_ONCE(debug_is_pp_page(page)); > folio_get(page_folio(page)); > } > > @@ -1569,6 +1575,8 @@ static inline void put_page(struct page *page) > { > struct folio *folio = page_folio(page); > > + WARN_ON_ONCE(debug_is_pp_page(page)); > + > /* > * For some devmap managed pages we need to catch refcount transition > * from 2 to 1: >
On Fri, Apr 26, 2024 at 7:59 AM Dragos Tatulea <dtatulea@nvidia.com> wrote: > > On Thu, 2024-04-25 at 13:42 -0700, Mina Almasry wrote: > > On Thu, Apr 25, 2024 at 12:48 PM Dragos Tatulea <dtatulea@nvidia.com> wrote: > > > > > > On Thu, 2024-04-25 at 12:20 -0700, Mina Almasry wrote: > > > > diff --git a/include/linux/skbuff_ref.h b/include/linux/skbuff_ref.h > > > > index 4dcdbe9fbc5f..4c72227dce1b 100644 > > > > --- a/include/linux/skbuff_ref.h > > > > +++ b/include/linux/skbuff_ref.h > > > > @@ -31,7 +31,7 @@ static inline bool napi_pp_get_page(struct page *page) > > > > static inline void skb_page_ref(struct page *page, bool recycle) > > > > { > > > > #ifdef CONFIG_PAGE_POOL > > > > - if (recycle && napi_pp_get_page(page)) > > > > + if (napi_pp_get_page(page)) > > > > return; > > > > #endif > > > > get_page(page); > > > > @@ -69,7 +69,7 @@ static inline void > > > > skb_page_unref(struct page *page, bool recycle) > > > > { > > > > #ifdef CONFIG_PAGE_POOL > > > > - if (recycle && napi_pp_put_page(page)) > > > > + if (napi_pp_put_page(page)) > > > > return; > > > > #endif > > > > put_page(page); > > > > > > > > > > > This is option 2. I thought this would fix everything. But I just tested and > > > it's not the case: we are now reaching a negative pp_ref_count. > > > > I was tired and botched the revert of the code in this RFC when testing. > Dropping the recycle flag works as expected. Do we need an RFC v2 or is this non > RFC material? > > Thanks, > Dragos > IMO, it needs to be cleaned up to remove the bool recycle flag dead code, but other than that I think it's good as a non RFC. To save you some time I did the said clean up and here is a patch passing make allmodconfig build + checkpatch/kdoc checks + tests on my end: https://pastebin.com/bX5KcHTb I could submit it to the list if you prefer. FWIW, if this approach shows no regressions, I think we may be able to deprecate skb->pp_recycle flag entirely, but I can look into that as a separate change. -- Thanks, Mina
On Thu, 25 Apr 2024 08:17:28 +0000 Dragos Tatulea wrote: > > The unref path always dropped a regular page > > ref, thanks to this commit as you point out: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2cc3aeb5ecccec0d266813172fcd82b4b5fa5803 > > > > AFAICT the correct fix is to actually revert commit 2cc3aeb5eccc > > ("skbuff: Fix a potential race while recycling page_pool packets"). > > The reason is that now that skb_frag_ref() can grab page-pool refs, we > > don't need to make sure there is only 1 SKB that triggers the recycle > > path anymore. All the skb and its clones can obtain page-pool refs, > > and in the unref path we drop the page-pool refs. page_pool_put_page() > > detects correctly that the last page-pool ref is put and recycles the > > page only then. > > > I don't think this is a good way forward. For example, skb->pp_recycle is used > as a hint in skb_gro_receive to avoid coalescing skbs with different pp_recycle > flag states. This could interfere with that. That's a bit speculative, right? The simple invariant we are trying to hold is that if skb->pp_recycle && skb_frag_is_pp(skb, i) then the reference skb is holding on that frag is a pp reference, not page reference. skb_gro_receive() needs to maintain that invariant, if it doesn't we need to fix it..
On Wed, 24 Apr 2024 15:08:42 -0700 Mina Almasry wrote: > I'm unable to repro this issue. Do you need to do anything special? Or > is 1 flow that hits skb_segment() good enough? At some point we may want to start writing unit tests to make sure we don't regress the 5 corner cases we found previously.. ;) Should be easy to operate on skbs under kunit.
On Thu, 25 Apr 2024 12:20:59 -0700 Mina Almasry wrote: > - if (recycle && napi_pp_get_page(page)) > + if (napi_pp_get_page(page)) Pretty sure you can't do that. The "recycle" here is a concurrency guarantee. A guarantee someone is holding a pp ref on that page, a ref which will not go away while napi_pp_get_page() is executing.
On Fri, Apr 26, 2024 at 4:09 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Thu, 25 Apr 2024 12:20:59 -0700 Mina Almasry wrote: > > - if (recycle && napi_pp_get_page(page)) > > + if (napi_pp_get_page(page)) > > Pretty sure you can't do that. The "recycle" here is a concurrency > guarantee. A guarantee someone is holding a pp ref on that page, > a ref which will not go away while napi_pp_get_page() is executing. I don't mean to argue, but I think the get_page()/put_page() pair we do in the page ref path is susceptible to the same issue. AFAIU it's not safe to get_page() if another CPU can be dropping the last ref, get_page_unless_zero() should be used instead. Since get_page() is good in the page ref path without some guarantee, it's not obvious to me why we need this guarantee in the pp ref path, but I could be missing some subtlety. At any rate, if you prefer us going down the road of reverting commit 2cc3aeb5eccc ("skbuff: Fix a potential race while recycling page_pool packets"), I think that could also fix the issue.
On Fri, 2024-04-26 at 16:05 -0700, Jakub Kicinski wrote: > On Thu, 25 Apr 2024 08:17:28 +0000 Dragos Tatulea wrote: > > > The unref path always dropped a regular page > > > ref, thanks to this commit as you point out: > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2cc3aeb5ecccec0d266813172fcd82b4b5fa5803 > > > > > > AFAICT the correct fix is to actually revert commit 2cc3aeb5eccc > > > ("skbuff: Fix a potential race while recycling page_pool packets"). > > > The reason is that now that skb_frag_ref() can grab page-pool refs, we > > > don't need to make sure there is only 1 SKB that triggers the recycle > > > path anymore. All the skb and its clones can obtain page-pool refs, > > > and in the unref path we drop the page-pool refs. page_pool_put_page() > > > detects correctly that the last page-pool ref is put and recycles the > > > page only then. > > > > > I don't think this is a good way forward. For example, skb->pp_recycle is used > > as a hint in skb_gro_receive to avoid coalescing skbs with different pp_recycle > > flag states. This could interfere with that. > > That's a bit speculative, right? The simple invariant we are trying to > hold is that if skb->pp_recycle && skb_frag_is_pp(skb, i) then the > reference skb is holding on that frag is a pp reference, not page > reference. > Yes, it was a speculative statement. After re-reading it and the code of skb_gro_receive() it makes less sense now. Mina's suggestion to revert commit 2cc3aeb5eccc ("skbuff: Fix a potential race while recycling page_pool packets") seems less scary now. I just hope we don't bump into too many scenarios similar to the ipsec one... > skb_gro_receive() needs to maintain that invariant, if it doesn't > we need to fix it.. > Thanks, Dragos
On Fri, 26 Apr 2024 21:24:09 -0700 Mina Almasry wrote: > On Fri, Apr 26, 2024 at 4:09 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > On Thu, 25 Apr 2024 12:20:59 -0700 Mina Almasry wrote: > > > - if (recycle && napi_pp_get_page(page)) > > > + if (napi_pp_get_page(page)) > > > > Pretty sure you can't do that. The "recycle" here is a concurrency > > guarantee. A guarantee someone is holding a pp ref on that page, > > a ref which will not go away while napi_pp_get_page() is executing. > > I don't mean to argue, but I think the get_page()/put_page() pair we > do in the page ref path is susceptible to the same issue. AFAIU it's > not safe to get_page() if another CPU can be dropping the last ref, > get_page_unless_zero() should be used instead. Whoever gave us the pointer to operate on has a reference, so the page can't disappear. get_page() is safe. The problem with pp is that we don't know whether the caller has a pp ref or a page ref. IOW the pp ref may not be owned by whoever called us. I guess the situation may actually be worse and we can only pp-ref a page if both "source" and "destination" skb has pp_recycle = 1 :S > Since get_page() is good in the page ref path without some guarantee, > it's not obvious to me why we need this guarantee in the pp ref path, > but I could be missing some subtlety. At any rate, if you prefer us > going down the road of reverting commit 2cc3aeb5eccc ("skbuff: Fix a > potential race while recycling page_pool packets"), I think that could > also fix the issue.
On Mon, 2024-04-29 at 09:39 +0200, Dragos Tatulea wrote: > On Fri, 2024-04-26 at 16:05 -0700, Jakub Kicinski wrote: > > On Thu, 25 Apr 2024 08:17:28 +0000 Dragos Tatulea wrote: > > > > The unref path always dropped a regular page > > > > ref, thanks to this commit as you point out: > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2cc3aeb5ecccec0d266813172fcd82b4b5fa5803 > > > > > > > > AFAICT the correct fix is to actually revert commit 2cc3aeb5eccc > > > > ("skbuff: Fix a potential race while recycling page_pool packets"). > > > > The reason is that now that skb_frag_ref() can grab page-pool refs, we > > > > don't need to make sure there is only 1 SKB that triggers the recycle > > > > path anymore. All the skb and its clones can obtain page-pool refs, > > > > and in the unref path we drop the page-pool refs. page_pool_put_page() > > > > detects correctly that the last page-pool ref is put and recycles the > > > > page only then. > > > > > > > I don't think this is a good way forward. For example, skb->pp_recycle is used > > > as a hint in skb_gro_receive to avoid coalescing skbs with different pp_recycle > > > flag states. This could interfere with that. > > > > That's a bit speculative, right? The simple invariant we are trying to > > hold is that if skb->pp_recycle && skb_frag_is_pp(skb, i) then the > > reference skb is holding on that frag is a pp reference, not page > > reference. > > > Yes, it was a speculative statement. After re-reading it and the code of > skb_gro_receive() it makes less sense now. > > Mina's suggestion to revert commit 2cc3aeb5eccc ("skbuff: Fix a potential race > while recycling page_pool packets") seems less scary now. I just hope we don't > bump into too many scenarios similar to the ipsec one... > > > skb_gro_receive() needs to maintain that invariant, if it doesn't > > we need to fix it.. > > > Gentle ping. Not sure how to proceed with this: 1) Revert commit 2cc3aeb5eccc ("skbuff: Fix a potential race while recycling page_pool packets"). I tested this btw and it works (for this specific scenario). 2) Revert Mina's commit a580ea994fd3 ("net: mirror skb frag ref/unref helpers") for now. Thanks, Dragos
On Tue, Apr 30, 2024 at 11:20 PM Dragos Tatulea <dtatulea@nvidia.com> wrote: > > On Mon, 2024-04-29 at 09:39 +0200, Dragos Tatulea wrote: > > On Fri, 2024-04-26 at 16:05 -0700, Jakub Kicinski wrote: > > > On Thu, 25 Apr 2024 08:17:28 +0000 Dragos Tatulea wrote: > > > > > The unref path always dropped a regular page > > > > > ref, thanks to this commit as you point out: > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2cc3aeb5ecccec0d266813172fcd82b4b5fa5803 > > > > > > > > > > AFAICT the correct fix is to actually revert commit 2cc3aeb5eccc > > > > > ("skbuff: Fix a potential race while recycling page_pool packets"). > > > > > The reason is that now that skb_frag_ref() can grab page-pool refs, we > > > > > don't need to make sure there is only 1 SKB that triggers the recycle > > > > > path anymore. All the skb and its clones can obtain page-pool refs, > > > > > and in the unref path we drop the page-pool refs. page_pool_put_page() > > > > > detects correctly that the last page-pool ref is put and recycles the > > > > > page only then. > > > > > > > > > I don't think this is a good way forward. For example, skb->pp_recycle is used > > > > as a hint in skb_gro_receive to avoid coalescing skbs with different pp_recycle > > > > flag states. This could interfere with that. > > > > > > That's a bit speculative, right? The simple invariant we are trying to > > > hold is that if skb->pp_recycle && skb_frag_is_pp(skb, i) then the > > > reference skb is holding on that frag is a pp reference, not page > > > reference. > > > > > Yes, it was a speculative statement. After re-reading it and the code of > > skb_gro_receive() it makes less sense now. > > > > Mina's suggestion to revert commit 2cc3aeb5eccc ("skbuff: Fix a potential race > > while recycling page_pool packets") seems less scary now. I just hope we don't > > bump into too many scenarios similar to the ipsec one... > > > > > skb_gro_receive() needs to maintain that invariant, if it doesn't > > > we need to fix it.. > > > > > > Gentle ping. Not sure how to proceed with this: > > 1) Revert commit 2cc3aeb5eccc > ("skbuff: Fix a potential race while recycling page_pool packets"). I tested > this btw and it works (for this specific scenario). > > 2) Revert Mina's commit a580ea994fd3 ("net: mirror skb frag ref/unref helpers") > for now. > I vote for #1, and IIUC Jakub's feedback, he seems to prefer this as well. If we continue to run into edge cases after the revert of #1, I think we may want to do #2 and I can look to reland it with the kunit tests that Jakub suggested that reproduce these edge cases. I can upload #1 in the morning if there are no objections. I don't see any regressions with #1 but I was never able to repo this issue.
On Wed, 2024-05-01 at 00:48 -0700, Mina Almasry wrote: > On Tue, Apr 30, 2024 at 11:20 PM Dragos Tatulea <dtatulea@nvidia.com> wrote: > > > > On Mon, 2024-04-29 at 09:39 +0200, Dragos Tatulea wrote: > > > On Fri, 2024-04-26 at 16:05 -0700, Jakub Kicinski wrote: > > > > On Thu, 25 Apr 2024 08:17:28 +0000 Dragos Tatulea wrote: > > > > > > The unref path always dropped a regular page > > > > > > ref, thanks to this commit as you point out: > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2cc3aeb5ecccec0d266813172fcd82b4b5fa5803 > > > > > > > > > > > > AFAICT the correct fix is to actually revert commit 2cc3aeb5eccc > > > > > > ("skbuff: Fix a potential race while recycling page_pool packets"). > > > > > > The reason is that now that skb_frag_ref() can grab page-pool refs, we > > > > > > don't need to make sure there is only 1 SKB that triggers the recycle > > > > > > path anymore. All the skb and its clones can obtain page-pool refs, > > > > > > and in the unref path we drop the page-pool refs. page_pool_put_page() > > > > > > detects correctly that the last page-pool ref is put and recycles the > > > > > > page only then. > > > > > > > > > > > I don't think this is a good way forward. For example, skb->pp_recycle is used > > > > > as a hint in skb_gro_receive to avoid coalescing skbs with different pp_recycle > > > > > flag states. This could interfere with that. > > > > > > > > That's a bit speculative, right? The simple invariant we are trying to > > > > hold is that if skb->pp_recycle && skb_frag_is_pp(skb, i) then the > > > > reference skb is holding on that frag is a pp reference, not page > > > > reference. > > > > > > > Yes, it was a speculative statement. After re-reading it and the code of > > > skb_gro_receive() it makes less sense now. > > > > > > Mina's suggestion to revert commit 2cc3aeb5eccc ("skbuff: Fix a potential race > > > while recycling page_pool packets") seems less scary now. I just hope we don't > > > bump into too many scenarios similar to the ipsec one... > > > > > > > skb_gro_receive() needs to maintain that invariant, if it doesn't > > > > we need to fix it.. > > > > > > > > > Gentle ping. Not sure how to proceed with this: > > > > 1) Revert commit 2cc3aeb5eccc > > ("skbuff: Fix a potential race while recycling page_pool packets"). I tested > > this btw and it works (for this specific scenario). > > > > 2) Revert Mina's commit a580ea994fd3 ("net: mirror skb frag ref/unref helpers") > > for now. > > > > I vote for #1, and IIUC Jakub's feedback, he seems to prefer this as > well. If we continue to run into edge cases after the revert of #1, I > think we may want to do #2 and I can look to reland it with the kunit > tests that Jakub suggested that reproduce these edge cases. > > I can upload #1 in the morning if there are no objections. I don't see > any regressions with #1 but I was never able to repo this issue. > Yes please. And once you post it we can also take it internally to check all the other tests that were failing in the same place. Thanks, Dragos
On Wed, 1 May 2024 00:48:43 -0700 Mina Almasry wrote: > > 1) Revert commit 2cc3aeb5eccc > > ("skbuff: Fix a potential race while recycling page_pool packets"). I tested > > this btw and it works (for this specific scenario). > > > > 2) Revert Mina's commit a580ea994fd3 ("net: mirror skb frag ref/unref helpers") > > for now. > > I vote for #1, and IIUC Jakub's feedback, he seems to prefer this as > well. I vote #2, actually :( Or #3 make page pool ref safe to acquire concurrently, but that plus fixing all the places where we do crazy things may be tricky. Even taking the ref is not as simple as using atomic_long_inc_not_zero() sadly, partly because we try to keep the refcount at one, in an apparent attempt to avoid dirtying the cache line twice. So maybe partial revert to stop be bleeding and retry after more testing is the way to go? I had a quick look at the code and there is also a bunch of functions which "shift" frags from one skb to another, without checking whether the pp_recycle state matches.
On Wed, 1 May 2024 07:24:34 -0700 Jakub Kicinski wrote: > I vote #2, actually :( Or #3 make page pool ref safe to acquire > concurrently, but that plus fixing all the places where we do crazy > things may be tricky. > > Even taking the ref is not as simple as using atomic_long_inc_not_zero() > sadly, partly because we try to keep the refcount at one, in an apparent > attempt to avoid dirtying the cache line twice. > > So maybe partial revert to stop be bleeding and retry after more testing > is the way to go? > > I had a quick look at the code and there is also a bunch of functions > which "shift" frags from one skb to another, without checking whether > the pp_recycle state matches. BTW these two refs seem to look at the wrong skb: diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 0c8b82750000..afd3336928d0 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -2148,7 +2148,7 @@ struct sk_buff *__pskb_copy_fclone(struct sk_buff *skb, int headroom, } for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { skb_shinfo(n)->frags[i] = skb_shinfo(skb)->frags[i]; - skb_frag_ref(skb, i); + skb_frag_ref(n, i); } skb_shinfo(n)->nr_frags = i; } @@ -5934,7 +5934,7 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from, * since we set nr_frags to 0. */ for (i = 0; i < from_shinfo->nr_frags; i++) - __skb_frag_ref(&from_shinfo->frags[i], from->pp_recycle); + __skb_frag_ref(&from_shinfo->frags[i], to->pp_recycle); to->truesize += delta; to->len += len;
On Wed, May 1, 2024 at 7:24 AM Jakub Kicinski <kuba@kernel.org> wrote: > > On Wed, 1 May 2024 00:48:43 -0700 Mina Almasry wrote: > > > 1) Revert commit 2cc3aeb5eccc > > > ("skbuff: Fix a potential race while recycling page_pool packets"). I tested > > > this btw and it works (for this specific scenario). > > > > > > 2) Revert Mina's commit a580ea994fd3 ("net: mirror skb frag ref/unref helpers") > > > for now. > > > > I vote for #1, and IIUC Jakub's feedback, he seems to prefer this as > > well. > > I vote #2, actually :( Or #3 make page pool ref safe to acquire > concurrently, but that plus fixing all the places where we do crazy > things may be tricky. > > Even taking the ref is not as simple as using atomic_long_inc_not_zero() > sadly, partly because we try to keep the refcount at one, in an apparent > attempt to avoid dirtying the cache line twice. > > So maybe partial revert to stop be bleeding and retry after more testing > is the way to go? > OK, I will upload a revert sometime today. > I had a quick look at the code and there is also a bunch of functions > which "shift" frags from one skb to another, without checking whether > the pp_recycle state matches. You posted a diff, I will pick it up in a separate patch.
On Mon, Apr 29, 2024 at 8:00 AM Jakub Kicinski <kuba@kernel.org> wrote: > > On Fri, 26 Apr 2024 21:24:09 -0700 Mina Almasry wrote: > > On Fri, Apr 26, 2024 at 4:09 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > > > On Thu, 25 Apr 2024 12:20:59 -0700 Mina Almasry wrote: > > > > - if (recycle && napi_pp_get_page(page)) > > > > + if (napi_pp_get_page(page)) > > > > > > Pretty sure you can't do that. The "recycle" here is a concurrency > > > guarantee. A guarantee someone is holding a pp ref on that page, > > > a ref which will not go away while napi_pp_get_page() is executing. > > > > I don't mean to argue, but I think the get_page()/put_page() pair we > > do in the page ref path is susceptible to the same issue. AFAIU it's > > not safe to get_page() if another CPU can be dropping the last ref, > > get_page_unless_zero() should be used instead. > I uploaded a revert for review, but to reland I perhaps need to understand a bit more the concern here. AFAICT that diff you're responding to is safe and it works very well with devmem so it would be my preferred approach to reland (but there are other options if you are convinced it's bad). FWIW my thoughts: > Whoever gave us the pointer to operate on has a reference, so the page > can't disappear. get_page() is safe. Agreed. > The problem with pp is that we > don't know whether the caller has a pp ref or a page ref. IOW the pp > ref may not be owned by whoever called us. > OK, this is where I'm not sure anymore. The diff you're replying to attempts to enforce the invariant: "if anyone wants a reference on an skb_frag, skb_frag_ref will be a pp ref on pp frags (is_pp_page==true), and page refs on non-pp frags (is_pp_page==false)". Additionally the page doesn't transition from pp to non-pp and vice versa while anyone is holding a pp ref, because page_pool_set_pp_info() is called right after the page is obtained from the buddy allocator (before released from the page pool) and page_pool_clear_pp_info() is called only after all the pp refs are dropped. So: 1. We know the caller has a ref (otherwise get_page() wouldn't be safe in the non-pp case). 2. We know that the page has not transitioned from pp to non-pp or vice versa since the caller obtained the ref (from code inspection, pp info is not changed until all the refs are dropped for pp pages). 3. AFAICT, it follows that if the page is pp, then the caller has a pp ref, and if the page is non-pp, then the caller has a page ref. 4. So, if is_pp_page==true, then the caller has a pp ref, then napi_pp_get_page() should be concurrently safe. AFAICT the only way my mental model is broken is if there is code doing a raw get_page() rather than a skb_frag_ref() in core net stack. I would like to get rid of these call sites if they exist. They would not interact well with devmem I think (but could be made to work with some effort). -- Thanks, Mina
On Thu, 2 May 2024 13:08:23 -0700 Mina Almasry wrote: > OK, this is where I'm not sure anymore. The diff you're replying to > attempts to enforce the invariant: "if anyone wants a reference on an > skb_frag, skb_frag_ref will be a pp ref on pp frags > (is_pp_page==true), and page refs on non-pp frags > (is_pp_page==false)". > > Additionally the page doesn't transition from pp to non-pp and vice > versa while anyone is holding a pp ref, because > page_pool_set_pp_info() is called right after the page is obtained > from the buddy allocator (before released from the page pool) and > page_pool_clear_pp_info() is called only after all the pp refs are > dropped. > > So: > > 1. We know the caller has a ref (otherwise get_page() wouldn't be safe > in the non-pp case). > 2. We know that the page has not transitioned from pp to non-pp or > vice versa since the caller obtained the ref (from code inspection, pp How do we know that? > info is not changed until all the refs are dropped for pp pages). > 3. AFAICT, it follows that if the page is pp, then the caller has a pp > ref, and if the page is non-pp, then the caller has a page ref. If that's true we have nothing to worry about. > 4. So, if is_pp_page==true, then the caller has a pp ref, then > napi_pp_get_page() should be concurrently safe. > > AFAICT the only way my mental model is broken is if there is code > doing a raw get_page() rather than a skb_frag_ref() in core net stack. Not just. We also get pages fed from the outside, which may be PP pages. Right now it'd be okay because "from outside" pages would end up in Tx. Tx always allocates skbs with ->pp_recycle = 0, so we'll hold full refs. > I would like to get rid of these call sites if they exist. They would > not interact well with devmem I think (but could be made to work with > some effort). Maybe if we convert more code to operate on netmem_ref it will become clearer where raw pages get fed in, and therefore were we have to be very careful?
diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 37c858dc11a6..ab75b4f876ce 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -2634,7 +2634,7 @@ int ___pskb_trim(struct sk_buff *skb, unsigned int len) skb_shinfo(skb)->nr_frags = i; for (; i < nfrags; i++) - skb_frag_unref(skb, i); + __skb_frag_unref(&skb_shinfo(skb)->frags[i], true); if (skb_has_frag_list(skb)) skb_drop_fraglist(skb);
The patch referenced in the fixes tag introduced the skb_frag_unref API. This API still has a dark corner: when skb->pp_recycled is false and a fragment being referenced is backed by a page_pool page, skb_frag_unref can leak the page out of the page_pool, like in the following example: BUG: Bad page state in process swapper/4 pfn:103423 page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x103423000 pfn:0x103423 flags: 0x200000000000000(node=0|zone=2) page_type: 0xffffffff() raw: 0200000000000000 dead000000000040 ffff888106f38000 0000000000000000 raw: 0000000103423000 0000000000000041 00000000ffffffff 0000000000000000 page dumped because: page_pool leak Modules linked in: act_mirred act_csum act_pedit act_gact cls_flower act_ct nf_flow_table sch_ingress xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter overlay rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib ib_uverbs ib_core zram zsmalloc mlx5_core fuse CPU: 4 PID: 0 Comm: swapper/4 Not tainted 6.9.0-rc4+ #2 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: <IRQ> dump_stack_lvl+0x53/0x70 bad_page+0x6f/0xf0 free_unref_page_prepare+0x271/0x420 free_unref_page+0x38/0x120 ___pskb_trim+0x261/0x390 skb_segment+0xf60/0x1040 tcp_gso_segment+0xe8/0x4e0 inet_gso_segment+0x155/0x3d0 skb_mac_gso_segment+0x99/0x100 __skb_gso_segment+0xb4/0x160 ? netif_skb_features+0x95/0x2f0 validate_xmit_skb+0x16c/0x330 validate_xmit_skb_list+0x4c/0x70 sch_direct_xmit+0x23e/0x350 __dev_queue_xmit+0x334/0xbc0 tcf_mirred_to_dev+0x2a5/0x3e0 [act_mirred] tcf_mirred_act+0xd7/0x4b0 [act_mirred] ? tcf_pedit_act+0x6f/0x540 [act_pedit] tcf_action_exec+0x82/0x170 fl_classify+0x1ee/0x200 [cls_flower] ? tcf_mirred_to_dev+0x2a5/0x3e0 [act_mirred] ? mlx5e_completion_event+0x39/0x40 [mlx5_core] ? mlx5_eq_comp_int+0x1d7/0x1f0 [mlx5_core] tcf_classify+0x26a/0x470 tc_run+0xa2/0x120 ? handle_irq_event+0x53/0x80 ? kvm_clock_get_cycles+0x11/0x20 __netif_receive_skb_core.constprop.0+0x932/0xee0 __netif_receive_skb_list_core+0xfe/0x1f0 netif_receive_skb_list_internal+0x1b5/0x2b0 napi_gro_complete.constprop.0+0xee/0x120 dev_gro_receive+0x3f4/0x710 napi_gro_receive+0x7d/0x220 mlx5e_handle_rx_cqe_mpwrq+0x10d/0x1d0 [mlx5_core] mlx5e_poll_rx_cq+0x8b/0x6f0 [mlx5_core] mlx5e_napi_poll+0xdc/0x6c0 [mlx5_core] __napi_poll+0x25/0x1b0 net_rx_action+0x2c1/0x330 __do_softirq+0xcb/0x28c irq_exit_rcu+0x69/0x90 common_interrupt+0x85/0xa0 </IRQ> <TASK> asm_common_interrupt+0x26/0x40 RIP: 0010:default_idle+0x17/0x20 Code: 00 4d 29 c8 4c 01 c7 4c 29 c2 e9 76 ff ff ff cc cc cc cc f3 0f 1e fa 8b 05 76 3f 0a 01 85 c0 7e 07 0f 00 2d 1d 74 41 00 fb f4 <fa> c3 0f 1f 80 00 00 00 00 f3 0f 1e fa 65 48 8b 35 04 b3 42 7e f0 RSP: 0018:ffff88810087bed8 EFLAGS: 00000246 RAX: 0000000000000000 RBX: ffff8881008415c0 RCX: 000000e116d359fb RDX: 0000000000000000 RSI: ffffffff8223e1d1 RDI: 000000000003f214 RBP: 0000000000000004 R08: 000000000003f214 R09: 000000e116d359fb R10: 000000e116d359fb R11: 000000000005dfee R12: 0000000000000004 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 default_idle_call+0x3d/0xf0 do_idle+0x1ce/0x1e0 cpu_startup_entry+0x29/0x30 start_secondary+0x109/0x130 common_startup_64+0x129/0x138 </TASK> How it happens: -> skb_segment -> clone skb into nskb -> call __pskb_trim(nskb) -> call pskb_expand_head(nskb) because nskb is cloned -> call skb_release_data(nskb) because nskb is cloned -> nskb->pp_recycle = 0 -> skb_unref(nskb->frag[i], nskb) -> nskb->pp_recycle is false, frag is page_pool page -> Fragment is released via put_page(frag page), hence leaking from the page_pool. Something tells me that this is not be the only issue of this kind... The patch itself contains a suggested fix for this specific bug: it forces the unref in ___pskb_trim to recycle to the page_pool. If the page is not a page_pool page, it will be dereferenced as a regular page. The alternative would be to save the skb->pp_recycled flag before pskb_expand_head and use it later during the unref. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Co-developed-by: Jianbo Liu <jianbol@nvidia.com> Fixes: a580ea994fd3 ("net: mirror skb frag ref/unref helpers") Cc: Mina Almasry <almasrymina@google.com> --- net/core/skbuff.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)