diff mbox series

[RFC] net: esp: fix bad handling of pages from page_pool

Message ID 20240304094950.761233-1-dtatulea@nvidia.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [RFC] net: esp: fix bad handling of pages from page_pool | expand

Checks

Context Check Description
netdev/series_format warning Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection success Guessed tree name to be net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 940 this patch: 940
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 7 of 7 maintainers
netdev/build_clang success Errors and warnings before: 956 this patch: 956
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 956 this patch: 956
netdev/checkpatch fail ERROR: Remove Gerrit Change-Id's before submitting upstream
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Dragos Tatulea March 4, 2024, 9:48 a.m. UTC
When the skb is reorganized during esp_output (!esp->inline), the pages
coming from the original skb fragments are supposed to be released back
to the system through put_page. But if the skb fragment pages are
originating from a page_pool, calling put_page on them will trigger a
page_pool leak which will eventually result in a crash.

This leak can be easily observed when using CONFIG_DEBUG_VM and doing
ipsec + gre (non offloaded) forwarding:

  BUG: Bad page state in process ksoftirqd/16  pfn:1451b6
  page:00000000de2b8d32 refcount:0 mapcount:0 mapping:0000000000000000 index:0x1451b6000 pfn:0x1451b6
  flags: 0x200000000000000(node=0|zone=2)
  page_type: 0xffffffff()
  raw: 0200000000000000 dead000000000040 ffff88810d23c000 0000000000000000
  raw: 00000001451b6000 0000000000000001 00000000ffffffff 0000000000000000
  page dumped because: page_pool leak
  Modules linked in: ip_gre gre mlx5_ib mlx5_core xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat nf_nat xt_addrtype br_netfilter rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm ib_uverbs ib_core overlay zram zsmalloc fuse [last unloaded: mlx5_core]
  CPU: 16 PID: 96 Comm: ksoftirqd/16 Not tainted 6.8.0-rc4+ #22
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  Call Trace:
   <TASK>
   dump_stack_lvl+0x36/0x50
   bad_page+0x70/0xf0
   free_unref_page_prepare+0x27a/0x460
   free_unref_page+0x38/0x120
   esp_ssg_unref.isra.0+0x15f/0x200
   esp_output_tail+0x66d/0x780
   esp_xmit+0x2c5/0x360
   validate_xmit_xfrm+0x313/0x370
   ? validate_xmit_skb+0x1d/0x330
   validate_xmit_skb_list+0x4c/0x70
   sch_direct_xmit+0x23e/0x350
   __dev_queue_xmit+0x337/0xba0
   ? nf_hook_slow+0x3f/0xd0
   ip_finish_output2+0x25e/0x580
   iptunnel_xmit+0x19b/0x240
   ip_tunnel_xmit+0x5fb/0xb60
   ipgre_xmit+0x14d/0x280 [ip_gre]
   dev_hard_start_xmit+0xc3/0x1c0
   __dev_queue_xmit+0x208/0xba0
   ? nf_hook_slow+0x3f/0xd0
   ip_finish_output2+0x1ca/0x580
   ip_sublist_rcv_finish+0x32/0x40
   ip_sublist_rcv+0x1b2/0x1f0
   ? ip_rcv_finish_core.constprop.0+0x460/0x460
   ip_list_rcv+0x103/0x130
   __netif_receive_skb_list_core+0x181/0x1e0
   netif_receive_skb_list_internal+0x1b3/0x2c0
   napi_gro_receive+0xc8/0x200
   gro_cell_poll+0x52/0x90
   __napi_poll+0x25/0x1a0
   net_rx_action+0x28e/0x300
   __do_softirq+0xc3/0x276
   ? sort_range+0x20/0x20
   run_ksoftirqd+0x1e/0x30
   smpboot_thread_fn+0xa6/0x130
   kthread+0xcd/0x100
   ? kthread_complete_and_exit+0x20/0x20
   ret_from_fork+0x31/0x50
   ? kthread_complete_and_exit+0x20/0x20
   ret_from_fork_asm+0x11/0x20
   </TASK>

The suggested fix is to use the page_pool release API first and then fallback
to put_page.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reported-by: Anatoli N.Chechelnickiy <Anatoli.Chechelnickiy@m.interpipe.biz>
Reported-by: Ian Kumlien <ian.kumlien@gmail.com>
Change-Id: I263cf91c1d13c2736a58927e8e0fc51296759450
---
 net/ipv4/esp4.c | 11 ++++++++---
 net/ipv6/esp6.c | 11 ++++++++---
 2 files changed, 16 insertions(+), 6 deletions(-)

Comments

Jakub Kicinski March 6, 2024, 3:04 a.m. UTC | #1
On Mon, 4 Mar 2024 11:48:52 +0200 Dragos Tatulea wrote:
> When the skb is reorganized during esp_output (!esp->inline), the pages
> coming from the original skb fragments are supposed to be released back
> to the system through put_page. But if the skb fragment pages are
> originating from a page_pool, calling put_page on them will trigger a
> page_pool leak which will eventually result in a crash.

So it just does: skb_shinfo(skb)->nr_frags = 1;
and assumes that's equivalent to owning a page ref on all the frags?

Fix looks more or less good, we would need a new wrapper to avoid
build issues without PAGE_POOL, but I wonder if we wouldn't be better
off changing the other side. Instead of "cutting off" the frags -
walking them and dealing with various page types. Because Mina and co.
will step onto this landmine as well.
Dragos Tatulea March 6, 2024, 1:05 p.m. UTC | #2
On Tue, 2024-03-05 at 19:04 -0800, Jakub Kicinski wrote:
> On Mon, 4 Mar 2024 11:48:52 +0200 Dragos Tatulea wrote:
> > When the skb is reorganized during esp_output (!esp->inline), the pages
> > coming from the original skb fragments are supposed to be released back
> > to the system through put_page. But if the skb fragment pages are
> > originating from a page_pool, calling put_page on them will trigger a
> > page_pool leak which will eventually result in a crash.
> 
> So it just does: skb_shinfo(skb)->nr_frags = 1;
> and assumes that's equivalent to owning a page ref on all the frags?
> 
My understanding is different: it sets nr_frags to 1 because it's swapping out
the old page frag in fragment 0 with the new xfrag page frag and will use this
"new" skb from here. It does take a page reference for the xfrag page frag.

> Fix looks more or less good, we would need a new wrapper to avoid
> build issues without PAGE_POOL, 
> 
Ack. Which component would be best location for this wrapper: page_pool?

> but I wonder if we wouldn't be better
> off changing the other side. Instead of "cutting off" the frags -
> walking them and dealing with various page types. Because Mina and co.
> will step onto this landmine as well.
The page frags are still stored and used in the sg scatterlist. If we release
them at the moment when the skb is "cut off", the pages in the sg will be
invalid. At least that's my understanding.

Thanks,
Dragos
Jakub Kicinski March 6, 2024, 3:22 p.m. UTC | #3
On Wed, 6 Mar 2024 13:05:14 +0000 Dragos Tatulea wrote:
> On Tue, 2024-03-05 at 19:04 -0800, Jakub Kicinski wrote:
> > On Mon, 4 Mar 2024 11:48:52 +0200 Dragos Tatulea wrote:  
> > > When the skb is reorganized during esp_output (!esp->inline), the pages
> > > coming from the original skb fragments are supposed to be released back
> > > to the system through put_page. But if the skb fragment pages are
> > > originating from a page_pool, calling put_page on them will trigger a
> > > page_pool leak which will eventually result in a crash.  
> > 
> > So it just does: skb_shinfo(skb)->nr_frags = 1;
> > and assumes that's equivalent to owning a page ref on all the frags?
> >   
> My understanding is different: it sets nr_frags to 1 because it's swapping out
> the old page frag in fragment 0 with the new xfrag page frag and will use this
> "new" skb from here. It does take a page reference for the xfrag page frag.

Same understanding, I'm just bad at explaining :)

> > Fix looks more or less good, we would need a new wrapper to avoid
> > build issues without PAGE_POOL, 
> >   
> Ack. Which component would be best location for this wrapper: page_pool?

Hm, that's a judgment call.
Part of me wants to put it next to napi_frag_unref(), since we
basically need to factor out the insides of this function.
When you post the patch the page pool crowd will give us
their opinions.

> > but I wonder if we wouldn't be better
> > off changing the other side. Instead of "cutting off" the frags -
> > walking them and dealing with various page types. Because Mina and co.
> > will step onto this landmine as well.  
> The page frags are still stored and used in the sg scatterlist. If we release
> them at the moment when the skb is "cut off", the pages in the sg will be
> invalid. At least that's my understanding.

I was thinking something along the lines of:

	for each frag()
		if (is_pp_page()) {
			get_page();
			page_pool_unref_page(1);
		}

so that it's trivial to insert another check for "is this a zero-copy"
page in there, and error our. But on reflection the zero copy check may
be better placed in __skb_to_sgvec(), so ignore this. Just respin
what you got with a new helper.
Dragos Tatulea March 6, 2024, 4 p.m. UTC | #4
On Wed, 2024-03-06 at 07:22 -0800, Jakub Kicinski wrote:
> On Wed, 6 Mar 2024 13:05:14 +0000 Dragos Tatulea wrote:
> > On Tue, 2024-03-05 at 19:04 -0800, Jakub Kicinski wrote:
> > > On Mon, 4 Mar 2024 11:48:52 +0200 Dragos Tatulea wrote:  
> > > > When the skb is reorganized during esp_output (!esp->inline), the pages
> > > > coming from the original skb fragments are supposed to be released back
> > > > to the system through put_page. But if the skb fragment pages are
> > > > originating from a page_pool, calling put_page on them will trigger a
> > > > page_pool leak which will eventually result in a crash.  
> > > 
> > > So it just does: skb_shinfo(skb)->nr_frags = 1;
> > > and assumes that's equivalent to owning a page ref on all the frags?
> > >   
> > My understanding is different: it sets nr_frags to 1 because it's swapping out
> > the old page frag in fragment 0 with the new xfrag page frag and will use this
> > "new" skb from here. It does take a page reference for the xfrag page frag.
> 
> Same understanding, I'm just bad at explaining :)
> 
> > > Fix looks more or less good, we would need a new wrapper to avoid
> > > build issues without PAGE_POOL, 
> > >   
> > Ack. Which component would be best location for this wrapper: page_pool?
> 
> Hm, that's a judgment call.
> Part of me wants to put it next to napi_frag_unref(), since we
> basically need to factor out the insides of this function.
> When you post the patch the page pool crowd will give us
> their opinions.
> 
Why not have napi_pp_put_page simply return false if CONFIG_PAGE_POOL is not
set?

Regarding stable would I need to send a separate fix that does the raw pp page
check without the API?

> > > but I wonder if we wouldn't be better
> > > off changing the other side. Instead of "cutting off" the frags -
> > > walking them and dealing with various page types. Because Mina and co.
> > > will step onto this landmine as well.  
> > The page frags are still stored and used in the sg scatterlist. If we release
> > them at the moment when the skb is "cut off", the pages in the sg will be
> > invalid. At least that's my understanding.
> 
> I was thinking something along the lines of:
> 
> 	for each frag()
> 		if (is_pp_page()) {
> 			get_page();
> 			page_pool_unref_page(1);
> 		}
> 
> so that it's trivial to insert another check for "is this a zero-copy"
> page in there, and error our. But on reflection the zero copy check may
> be better placed in __skb_to_sgvec(), so ignore this. Just respin
> what you got with a new helper.
> 
Ignored. I was hoping we wouldn't go in that direction :).

Thanks,
Dragos
Jakub Kicinski March 6, 2024, 4:09 p.m. UTC | #5
On Wed, 6 Mar 2024 16:00:46 +0000 Dragos Tatulea wrote:
> > Hm, that's a judgment call.
> > Part of me wants to put it next to napi_frag_unref(), since we
> > basically need to factor out the insides of this function.
> > When you post the patch the page pool crowd will give us
> > their opinions.
>
> Why not have napi_pp_put_page simply return false if CONFIG_PAGE_POOL is not
> set?

Without LTO it may still be a function call.
Plus, subjectively, I think that it's a bit too much logic to encode in
the caller (you must also check skb->pp_recycle, AFAIU)
Maybe we should make skb_pp_recycle() take struct page and move it to
skbuff.h ? Rename it to skb_page_unref() ?

> Regarding stable would I need to send a separate fix that does the raw pp page
> check without the API?

You can put them in one patch, I reckon.
Mina Almasry March 6, 2024, 4:40 p.m. UTC | #6
On Wed, Mar 6, 2024 at 8:09 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 6 Mar 2024 16:00:46 +0000 Dragos Tatulea wrote:
> > > Hm, that's a judgment call.
> > > Part of me wants to put it next to napi_frag_unref(), since we
> > > basically need to factor out the insides of this function.
> > > When you post the patch the page pool crowd will give us
> > > their opinions.
> >
> > Why not have napi_pp_put_page simply return false if CONFIG_PAGE_POOL is not
> > set?
>
> Without LTO it may still be a function call.
> Plus, subjectively, I think that it's a bit too much logic to encode in
> the caller (you must also check skb->pp_recycle, AFAIU)
> Maybe we should make skb_pp_recycle() take struct page and move it to
> skbuff.h ? Rename it to skb_page_unref() ?
>

Does the caller need to check skb->pp_recycle? pp_recycle seems like a
redundant bit. We can tell whether the page is pp by checking
is_pp_page(page). the pages in the frag must be pp pages when
skb->pp_recycle is set and must be non pp pages when the
skb->pp_recycle is not set, so it all seems redundant to me.

My fix would be something like:

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index d577e0bee18d..cc737b7b9860 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3507,17 +3507,25 @@ int skb_cow_data_for_xdp(struct page_pool
*pool, struct sk_buff **pskb,
 bool napi_pp_put_page(struct page *page, bool napi_safe);

 static inline void
-napi_frag_unref(skb_frag_t *frag, bool recycle, bool napi_safe)
+napi_page_unref(struct page *page, bool napi_safe)
 {
-       struct page *page = skb_frag_page(frag);
-
 #ifdef CONFIG_PAGE_POOL
-       if (recycle && napi_pp_put_page(page, napi_safe))
+       if (napi_pp_put_page(page, napi_safe))
                return;
 #endif
        put_page(page);
 }

+static inline void
+napi_frag_unref(skb_frag_t *frag, bool recycle, bool napi_safe)
+{
+       struct page *page = skb_frag_page(frag);
+
+       DEBUG_NET_WARN_ON(recycle != is_pp_page(page));
+
+       napi_page_unref(page);
+}
+

And then use napi_page_unref() in the callers to handle page pool &
non-page pool gracefully without leaking page pool internals to the
callers.

> > Regarding stable would I need to send a separate fix that does the raw pp page
> > check without the API?
>
> You can put them in one patch, I reckon.



--
Thanks,
Mina
Dragos Tatulea March 6, 2024, 5:09 p.m. UTC | #7
On Wed, 2024-03-06 at 08:40 -0800, Mina Almasry wrote:
> On Wed, Mar 6, 2024 at 8:09 AM Jakub Kicinski <kuba@kernel.org> wrote:
> > 
> > On Wed, 6 Mar 2024 16:00:46 +0000 Dragos Tatulea wrote:
> > > > Hm, that's a judgment call.
> > > > Part of me wants to put it next to napi_frag_unref(), since we
> > > > basically need to factor out the insides of this function.
> > > > When you post the patch the page pool crowd will give us
> > > > their opinions.
> > > 
> > > Why not have napi_pp_put_page simply return false if CONFIG_PAGE_POOL is not
> > > set?
> > 
> > Without LTO it may still be a function call.
> > Plus, subjectively, I think that it's a bit too much logic to encode in
> > the caller (you must also check skb->pp_recycle, AFAIU)
> > Maybe we should make skb_pp_recycle() take struct page and move it to
> > skbuff.h ? Rename it to skb_page_unref() ?
> > 
> 
> Does the caller need to check skb->pp_recycle? pp_recycle seems like a
> redundant bit. We can tell whether the page is pp by checking
> is_pp_page(page). the pages in the frag must be pp pages when
> skb->pp_recycle is set and must be non pp pages when the
> skb->pp_recycle is not set, so it all seems redundant to me.
> 
AFAIU we don't have to check for pp_recycle, at least not in this specific case.

> My fix would be something like:
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index d577e0bee18d..cc737b7b9860 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -3507,17 +3507,25 @@ int skb_cow_data_for_xdp(struct page_pool
> *pool, struct sk_buff **pskb,
>  bool napi_pp_put_page(struct page *page, bool napi_safe);
> 
>  static inline void
> -napi_frag_unref(skb_frag_t *frag, bool recycle, bool napi_safe)
> +napi_page_unref(struct page *page, bool napi_safe)
>  {
> -       struct page *page = skb_frag_page(frag);
> -
>  #ifdef CONFIG_PAGE_POOL
> -       if (recycle && napi_pp_put_page(page, napi_safe))
> +       if (napi_pp_put_page(page, napi_safe))
>                 return;
>  #endif
>         put_page(page);
>  }
> 
> +static inline void
> +napi_frag_unref(skb_frag_t *frag, bool recycle, bool napi_safe)
> +{
> +       struct page *page = skb_frag_page(frag);
> +
> +       DEBUG_NET_WARN_ON(recycle != is_pp_page(page));
> +
> +       napi_page_unref(page);
> +}
> +
> 
> And then use napi_page_unref() in the callers to handle page pool &
> non-page pool gracefully without leaking page pool internals to the
> callers.
> 
We'd also need to add is_pp_page() in the header with the changes above...

On that line of thought, unless these new APIs are useful for other use-cases,
why not keep it simple:
- Move is_pp_page() to skbuff.h.
- Do a simple is_pp_page(page) ? page_pool_put_full_page(page):put_page(page) in
the caller? Checking skb->pp_recycle would not be needed. 

Thanks,
Dragos

> > > Regarding stable would I need to send a separate fix that does the raw pp page
> > > check without the API?
> > 
> > You can put them in one patch, I reckon.
>
Mina Almasry March 6, 2024, 5:27 p.m. UTC | #8
On Wed, Mar 6, 2024 at 9:10 AM Dragos Tatulea <dtatulea@nvidia.com> wrote:
>
> On Wed, 2024-03-06 at 08:40 -0800, Mina Almasry wrote:
> > On Wed, Mar 6, 2024 at 8:09 AM Jakub Kicinski <kuba@kernel.org> wrote:
> > >
> > > On Wed, 6 Mar 2024 16:00:46 +0000 Dragos Tatulea wrote:
> > > > > Hm, that's a judgment call.
> > > > > Part of me wants to put it next to napi_frag_unref(), since we
> > > > > basically need to factor out the insides of this function.
> > > > > When you post the patch the page pool crowd will give us
> > > > > their opinions.
> > > >
> > > > Why not have napi_pp_put_page simply return false if CONFIG_PAGE_POOL is not
> > > > set?
> > >
> > > Without LTO it may still be a function call.
> > > Plus, subjectively, I think that it's a bit too much logic to encode in
> > > the caller (you must also check skb->pp_recycle, AFAIU)
> > > Maybe we should make skb_pp_recycle() take struct page and move it to
> > > skbuff.h ? Rename it to skb_page_unref() ?
> > >
> >
> > Does the caller need to check skb->pp_recycle? pp_recycle seems like a
> > redundant bit. We can tell whether the page is pp by checking
> > is_pp_page(page). the pages in the frag must be pp pages when
> > skb->pp_recycle is set and must be non pp pages when the
> > skb->pp_recycle is not set, so it all seems redundant to me.
> >
> AFAIU we don't have to check for pp_recycle, at least not in this specific case.
>
> > My fix would be something like:
> >
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index d577e0bee18d..cc737b7b9860 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -3507,17 +3507,25 @@ int skb_cow_data_for_xdp(struct page_pool
> > *pool, struct sk_buff **pskb,
> >  bool napi_pp_put_page(struct page *page, bool napi_safe);
> >
> >  static inline void
> > -napi_frag_unref(skb_frag_t *frag, bool recycle, bool napi_safe)
> > +napi_page_unref(struct page *page, bool napi_safe)
> >  {
> > -       struct page *page = skb_frag_page(frag);
> > -
> >  #ifdef CONFIG_PAGE_POOL
> > -       if (recycle && napi_pp_put_page(page, napi_safe))
> > +       if (napi_pp_put_page(page, napi_safe))
> >                 return;
> >  #endif
> >         put_page(page);
> >  }
> >
> > +static inline void
> > +napi_frag_unref(skb_frag_t *frag, bool recycle, bool napi_safe)
> > +{
> > +       struct page *page = skb_frag_page(frag);
> > +
> > +       DEBUG_NET_WARN_ON(recycle != is_pp_page(page));
> > +
> > +       napi_page_unref(page);
> > +}
> > +
> >
> > And then use napi_page_unref() in the callers to handle page pool &
> > non-page pool gracefully without leaking page pool internals to the
> > callers.
> >
> We'd also need to add is_pp_page() in the header with the changes above...
>

Gah, I did not realize skbuff.h doesn't have is_pp_page(). Sorry!

> On that line of thought, unless these new APIs are useful for other use-cases,
> why not keep it simple:
> - Move is_pp_page() to skbuff.h.

I prefer this. is_pp_page() is a one-liner anyway.

> - Do a simple is_pp_page(page) ? page_pool_put_full_page(page):put_page(page) in
> the caller? Checking skb->pp_recycle would not be needed.
>

IMHO page pool internals should not leak to the callers like this.
There should be a helper that does the right thing for pp & non-pp
pages so that the caller can use it without worrying about subtle pp
concepts that they may miss at some callsites, but I'm fine either way
if there is strong disagreement from others.

> Thanks,
> Dragos
>
> > > > Regarding stable would I need to send a separate fix that does the raw pp page
> > > > check without the API?
> > >
> > > You can put them in one patch, I reckon.
> >
>
Jakub Kicinski March 6, 2024, 5:41 p.m. UTC | #9
On Wed, 6 Mar 2024 17:09:57 +0000 Dragos Tatulea wrote:
> > Does the caller need to check skb->pp_recycle? pp_recycle seems like a
> > redundant bit. We can tell whether the page is pp by checking
> > is_pp_page(page). the pages in the frag must be pp pages when
> > skb->pp_recycle is set and must be non pp pages when the
> > skb->pp_recycle is not set, so it all seems redundant to me.
> >   
> AFAIU we don't have to check for pp_recycle, at least not in this specific case.

Definitely not something we assuming in a fix that needs to go 
to stable.

So far, AFAIU, it's legal to have an skb without skb->pp_recycle
set, which holds full page refs to a PP page.
Mina Almasry March 6, 2024, 6:41 p.m. UTC | #10
On Wed, Mar 6, 2024 at 9:41 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 6 Mar 2024 17:09:57 +0000 Dragos Tatulea wrote:
> > > Does the caller need to check skb->pp_recycle? pp_recycle seems like a
> > > redundant bit. We can tell whether the page is pp by checking
> > > is_pp_page(page). the pages in the frag must be pp pages when
> > > skb->pp_recycle is set and must be non pp pages when the
> > > skb->pp_recycle is not set, so it all seems redundant to me.
> > >
> > AFAIU we don't have to check for pp_recycle, at least not in this specific case.
>
> Definitely not something we assuming in a fix that needs to go
> to stable.
>
> So far, AFAIU, it's legal to have an skb without skb->pp_recycle
> set, which holds full page refs to a PP page.

Interesting. I apologize then I did not realize this combination is
legal. But I have a follow up question: what is the ref code supposed
to do in this combination? AFAIU:

- skb->pp_recycle && is_pp_page()
ref via page_pool_ref_page()
unref via page_pool_unref_page()

- !skb->pp_recycle && !is_pp_page()
ref via get_page()
unref via put_page()

- !skb->pp_recycle && is_pp_page()
ref via ?
unref via?

Also is this combination also legal you think? and if so what to do?
- skb->pp_recycle && !is_pp_page()
ref via ?
unref via?

I'm asking because I'm starting to wonder if this patch has some issue in it:
https://patchwork.kernel.org/project/netdevbpf/patch/20231215033011.12107-4-liangchen.linux@gmail.com/

Because in this patch, if we have a !skb->pp_recycle & is_pp_page()
combination we end up doing in skb_try_coalesce():
ref via page_pool_ref_page()
unref via put_page() via eventual napi_frag_unref()

which seems like an issue, no?
Mina Almasry March 6, 2024, 6:46 p.m. UTC | #11
On Wed, Mar 6, 2024 at 10:41 AM Mina Almasry <almasrymina@google.com> wrote:
>
> On Wed, Mar 6, 2024 at 9:41 AM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Wed, 6 Mar 2024 17:09:57 +0000 Dragos Tatulea wrote:
> > > > Does the caller need to check skb->pp_recycle? pp_recycle seems like a
> > > > redundant bit. We can tell whether the page is pp by checking
> > > > is_pp_page(page). the pages in the frag must be pp pages when
> > > > skb->pp_recycle is set and must be non pp pages when the
> > > > skb->pp_recycle is not set, so it all seems redundant to me.
> > > >
> > > AFAIU we don't have to check for pp_recycle, at least not in this specific case.
> >
> > Definitely not something we assuming in a fix that needs to go
> > to stable.
> >
> > So far, AFAIU, it's legal to have an skb without skb->pp_recycle
> > set, which holds full page refs to a PP page.
>
> Interesting. I apologize then I did not realize this combination is
> legal. But I have a follow up question: what is the ref code supposed
> to do in this combination? AFAIU:
>
> - skb->pp_recycle && is_pp_page()
> ref via page_pool_ref_page()
> unref via page_pool_unref_page()
>
> - !skb->pp_recycle && !is_pp_page()
> ref via get_page()
> unref via put_page()
>
> - !skb->pp_recycle && is_pp_page()
> ref via ?
> unref via?
>
> Also is this combination also legal you think? and if so what to do?
> - skb->pp_recycle && !is_pp_page()
> ref via ?
> unref via?
>
> I'm asking because I'm starting to wonder if this patch has some issue in it:
> https://patchwork.kernel.org/project/netdevbpf/patch/20231215033011.12107-4-liangchen.linux@gmail.com/
>
> Because in this patch, if we have a !skb->pp_recycle & is_pp_page()
> combination we end up doing in skb_try_coalesce():
> ref via page_pool_ref_page()
> unref via put_page() via eventual napi_frag_unref()
>
> which seems like an issue, no?
>

Gah, nevermind, skb_pp_frag_ref() actually returns -EINVAL if
!skb->pp_recycle, and in the call site we do a skb_frag_ref() on this
error, so all in all we end up doing a get_page/put_page pair. Sorry
for the noise.

So we're supposed to:
- !skb->pp_recycle && is_pp_page()
ref via get_page
unref via put_page

Very subtle stuff (for me at least). I'll try to propose some cleanup
to make this a bit simpler using helpers that handle all these subtle
details internally so that the call sites don't have to do this
special handling.
Jakub Kicinski March 7, 2024, 2:16 a.m. UTC | #12
On Wed, 6 Mar 2024 10:46:45 -0800 Mina Almasry wrote:
> Gah, nevermind, skb_pp_frag_ref() actually returns -EINVAL if
> !skb->pp_recycle, and in the call site we do a skb_frag_ref() on this
> error, so all in all we end up doing a get_page/put_page pair. Sorry
> for the noise.
> 
> So we're supposed to:
> - !skb->pp_recycle && is_pp_page()
> ref via get_page
> unref via put_page
> 
> Very subtle stuff (for me at least). I'll try to propose some cleanup
> to make this a bit simpler using helpers that handle all these subtle
> details internally so that the call sites don't have to do this
> special handling.

Sure, although complexity is complexity, we can only do so much to hide
it.

For pp_recycle - the problem is when we added page pool pages, hardly
anything in the upper layers of the stack was made pp aware. So we can
end up with someone doing

	get_page(page);
	skb_fill_page_desc(skb, page);

on a PP page.

You probably inspected a lot of those cases for the ZC work, and they
can be fixed up to do a "pp-aware get()", but until then we need
skb->pp_recycle. pp_recycle kinda denotes whether whoever constructed
the skb was PP aware.

So, yes, definitely a good long term goal, but not in a fix :)
diff mbox series

Patch

diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index 4dd9e5040672..3e07d78c887d 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -112,9 +112,14 @@  static void esp_ssg_unref(struct xfrm_state *x, void *tmp)
 	/* Unref skb_frag_pages in the src scatterlist if necessary.
 	 * Skip the first sg which comes from skb->data.
 	 */
-	if (req->src != req->dst)
-		for (sg = sg_next(req->src); sg; sg = sg_next(sg))
-			put_page(sg_page(sg));
+	if (req->src != req->dst) {
+		for (sg = sg_next(req->src); sg; sg = sg_next(sg)) {
+			struct page *page = sg_page(sg);
+
+			if (!napi_pp_put_page(page, false))
+				put_page(page);
+		}
+	}
 }
 
 #ifdef CONFIG_INET_ESPINTCP
diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c
index 6e6efe026cdc..b73f5773139d 100644
--- a/net/ipv6/esp6.c
+++ b/net/ipv6/esp6.c
@@ -129,9 +129,14 @@  static void esp_ssg_unref(struct xfrm_state *x, void *tmp)
 	/* Unref skb_frag_pages in the src scatterlist if necessary.
 	 * Skip the first sg which comes from skb->data.
 	 */
-	if (req->src != req->dst)
-		for (sg = sg_next(req->src); sg; sg = sg_next(sg))
-			put_page(sg_page(sg));
+	if (req->src != req->dst) {
+		for (sg = sg_next(req->src); sg; sg = sg_next(sg)) {
+			struct page *page = sg_page(sg);
+
+			if (!napi_pp_put_page(page, false))
+				put_page(page);
+		}
+	}
 }
 
 #ifdef CONFIG_INET6_ESPINTCP