diff mbox series

[net-next,v7,4/4] skbuff: Optimization of SKB coalescing for page pool

Message ID 20231206105419.27952-5-liangchen.linux@gmail.com (mailing list archive)
State New
Headers show
Series skbuff: Optimize SKB coalescing for page pool | expand

Commit Message

Liang Chen Dec. 6, 2023, 10:54 a.m. UTC
In order to address the issues encountered with commit 1effe8ca4e34
("skbuff: fix coalescing for page_pool fragment recycling"), the
combination of the following condition was excluded from skb coalescing:

from->pp_recycle = 1
from->cloned = 1
to->pp_recycle = 1

However, with page pool environments, the aforementioned combination can
be quite common(ex. NetworkMananger may lead to the additional
packet_type being registered, thus the cloning). In scenarios with a
higher number of small packets, it can significantly affect the success
rate of coalescing. For example, considering packets of 256 bytes size,
our comparison of coalescing success rate is as follows:

Without page pool: 70%
With page pool: 13%

Consequently, this has an impact on performance:

Without page pool: 2.57 Gbits/sec
With page pool: 2.26 Gbits/sec

Therefore, it seems worthwhile to optimize this scenario and enable
coalescing of this particular combination. To achieve this, we need to
ensure the correct increment of the "from" SKB page's page pool
reference count (pp_ref_count).

Following this optimization, the success rate of coalescing measured in
our environment has improved as follows:

With page pool: 60%

This success rate is approaching the rate achieved without using page
pool, and the performance has also been improved:

With page pool: 2.52 Gbits/sec

Below is the performance comparison for small packets before and after
this optimization. We observe no impact to packets larger than 4K.

packet size     before      after       improved
(bytes)         (Gbits/sec) (Gbits/sec)
128             1.19        1.27        7.13%
256             2.26        2.52        11.75%
512             4.13        4.81        16.50%
1024            6.17        6.73        9.05%
2048            14.54       15.47       6.45%
4096            25.44       27.87       9.52%

Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com>
Suggested-by: Jason Wang <jasowang@redhat.com>
---
 include/net/page_pool/helpers.h |  5 ++++
 net/core/skbuff.c               | 41 +++++++++++++++++++++++----------
 2 files changed, 34 insertions(+), 12 deletions(-)

Comments

Mina Almasry Dec. 9, 2023, 2:18 a.m. UTC | #1
On Wed, Dec 6, 2023 at 2:54 AM Liang Chen <liangchen.linux@gmail.com> wrote:
>
> In order to address the issues encountered with commit 1effe8ca4e34
> ("skbuff: fix coalescing for page_pool fragment recycling"), the
> combination of the following condition was excluded from skb coalescing:
>
> from->pp_recycle = 1
> from->cloned = 1
> to->pp_recycle = 1
>
> However, with page pool environments, the aforementioned combination can
> be quite common(ex. NetworkMananger may lead to the additional
> packet_type being registered, thus the cloning). In scenarios with a
> higher number of small packets, it can significantly affect the success
> rate of coalescing. For example, considering packets of 256 bytes size,
> our comparison of coalescing success rate is as follows:
>
> Without page pool: 70%
> With page pool: 13%
>
> Consequently, this has an impact on performance:
>
> Without page pool: 2.57 Gbits/sec
> With page pool: 2.26 Gbits/sec
>
> Therefore, it seems worthwhile to optimize this scenario and enable
> coalescing of this particular combination. To achieve this, we need to
> ensure the correct increment of the "from" SKB page's page pool
> reference count (pp_ref_count).
>
> Following this optimization, the success rate of coalescing measured in
> our environment has improved as follows:
>
> With page pool: 60%
>
> This success rate is approaching the rate achieved without using page
> pool, and the performance has also been improved:
>
> With page pool: 2.52 Gbits/sec
>
> Below is the performance comparison for small packets before and after
> this optimization. We observe no impact to packets larger than 4K.
>
> packet size     before      after       improved
> (bytes)         (Gbits/sec) (Gbits/sec)
> 128             1.19        1.27        7.13%
> 256             2.26        2.52        11.75%
> 512             4.13        4.81        16.50%
> 1024            6.17        6.73        9.05%
> 2048            14.54       15.47       6.45%
> 4096            25.44       27.87       9.52%
>
> Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
> Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com>
> Suggested-by: Jason Wang <jasowang@redhat.com>
> ---
>  include/net/page_pool/helpers.h |  5 ++++
>  net/core/skbuff.c               | 41 +++++++++++++++++++++++----------
>  2 files changed, 34 insertions(+), 12 deletions(-)
>
> diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
> index 9dc8eaf8a959..268bc9d9ffd3 100644
> --- a/include/net/page_pool/helpers.h
> +++ b/include/net/page_pool/helpers.h
> @@ -278,6 +278,11 @@ static inline long page_pool_unref_page(struct page *page, long nr)
>         return ret;
>  }
>
> +static inline void page_pool_ref_page(struct page *page)
> +{
> +       atomic_long_inc(&page->pp_ref_count);
> +}
> +
>  static inline bool page_pool_is_last_ref(struct page *page)
>  {
>         /* If page_pool_unref_page() returns 0, we were the last user */
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 7e26b56cda38..3c2515a29376 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -947,6 +947,24 @@ static bool skb_pp_recycle(struct sk_buff *skb, void *data, bool napi_safe)
>         return napi_pp_put_page(virt_to_page(data), napi_safe);
>  }
>
> +/**
> + * skb_pp_frag_ref() - Increase fragment reference count of a page
> + * @page:      page of the fragment on which to increase a reference
> + *
> + * Increase fragment reference count (pp_ref_count) on a page, but if it is
> + * not a page pool page, fallback to increase a reference(_refcount) on a
> + * normal page.
> + */
> +static void skb_pp_frag_ref(struct page *page)
> +{
> +       struct page *head_page = compound_head(page);
> +
> +       if (likely(is_pp_page(head_page)))
> +               page_pool_ref_page(head_page);
> +       else
> +               page_ref_inc(head_page);
> +}
> +

I am confused by this, why add a new helper instead of modifying the
existing helper, skb_frag_ref()?

My mental model is that if the net stack wants to acquire a reference
on a frag, it calls skb_frag_ref(), and if it wants to drop a
reference on a frag, it should call skb_frag_unref(). Internally
skb_frag_ref/unref() can do all sorts of checking to decide whether to
increment page->refcount or page->pp_ref_count. I can't wrap my head
around the introduction of skb_pp_frag_ref(), but no equivalent
skb_pp_frag_unref().

But even if skb_pp_frag_unref() was added, when should the net stack
use skb_frag_ref/unref, and when should the stack use
skb_pp_ref/unref? The docs currently describe what the function does,
but when a program unfamiliar with the page pool should use it.

>  static void skb_kfree_head(void *head, unsigned int end_offset)
>  {
>         if (end_offset == SKB_SMALL_HEAD_HEADROOM)
> @@ -5769,17 +5787,12 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
>                 return false;
>
>         /* In general, avoid mixing page_pool and non-page_pool allocated
> -        * pages within the same SKB. Additionally avoid dealing with clones
> -        * with page_pool pages, in case the SKB is using page_pool fragment
> -        * references (page_pool_alloc_frag()). Since we only take full page
> -        * references for cloned SKBs at the moment that would result in
> -        * inconsistent reference counts.
> -        * In theory we could take full references if @from is cloned and
> -        * !@to->pp_recycle but its tricky (due to potential race with
> -        * the clone disappearing) and rare, so not worth dealing with.
> +        * pages within the same SKB. In theory we could take full
> +        * references if @from is cloned and !@to->pp_recycle but its
> +        * tricky (due to potential race with the clone disappearing) and
> +        * rare, so not worth dealing with.
>          */
> -       if (to->pp_recycle != from->pp_recycle ||
> -           (from->pp_recycle && skb_cloned(from)))
> +       if (to->pp_recycle != from->pp_recycle)
>                 return false;
>
>         if (len <= skb_tailroom(to)) {
> @@ -5836,8 +5849,12 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
>         /* if the skb is not cloned this does nothing
>          * since we set nr_frags to 0.
>          */
> -       for (i = 0; i < from_shinfo->nr_frags; i++)
> -               __skb_frag_ref(&from_shinfo->frags[i]);
> +       if (from->pp_recycle)
> +               for (i = 0; i < from_shinfo->nr_frags; i++)
> +                       skb_pp_frag_ref(skb_frag_page(&from_shinfo->frags[i]));
> +       else
> +               for (i = 0; i < from_shinfo->nr_frags; i++)
> +                       __skb_frag_ref(&from_shinfo->frags[i]);

You added a check here to use skb_pp_frag_ref() instead of
skb_frag_ref() here, but it's not clear to me why other callsites of
skb_frag_ref() don't need to be modified in the same way after your
patch.

After your patch:

skb_frag_ref() will always increment page->_refcount
skb_frag_unref() will either decrement page->_refcount or decrement
page->pp_ref_count (depending on the value of skb->pp_recycle).
skb_pp_frag_ref() will either increment page->_refcount or increment
page->pp_ref_count (depending on the value of is_pp_page(), not
skb->pp_recycle).
skb_pp_frag_unref() doesn't exist.

Is this not confusing? Can we streamline things:

skb_frag_ref() increments page->pp_ref_count for skb->pp_recycle,
page->_refcount otherwise.
skb_frag_unref() decrement page->pp_ref_count for skb->pp_recycle,
page->_refcount otherwise.

Or am I missing something that causes us to require this asymmetric
reference counting?

>
>         to->truesize += delta;
>         to->len += len;
> --
> 2.31.1
>
>
Liang Chen Dec. 11, 2023, 3:38 a.m. UTC | #2
On Sat, Dec 9, 2023 at 10:18 AM Mina Almasry <almasrymina@google.com> wrote:
>
> On Wed, Dec 6, 2023 at 2:54 AM Liang Chen <liangchen.linux@gmail.com> wrote:
> >
> > In order to address the issues encountered with commit 1effe8ca4e34
> > ("skbuff: fix coalescing for page_pool fragment recycling"), the
> > combination of the following condition was excluded from skb coalescing:
> >
> > from->pp_recycle = 1
> > from->cloned = 1
> > to->pp_recycle = 1
> >
> > However, with page pool environments, the aforementioned combination can
> > be quite common(ex. NetworkMananger may lead to the additional
> > packet_type being registered, thus the cloning). In scenarios with a
> > higher number of small packets, it can significantly affect the success
> > rate of coalescing. For example, considering packets of 256 bytes size,
> > our comparison of coalescing success rate is as follows:
> >
> > Without page pool: 70%
> > With page pool: 13%
> >
> > Consequently, this has an impact on performance:
> >
> > Without page pool: 2.57 Gbits/sec
> > With page pool: 2.26 Gbits/sec
> >
> > Therefore, it seems worthwhile to optimize this scenario and enable
> > coalescing of this particular combination. To achieve this, we need to
> > ensure the correct increment of the "from" SKB page's page pool
> > reference count (pp_ref_count).
> >
> > Following this optimization, the success rate of coalescing measured in
> > our environment has improved as follows:
> >
> > With page pool: 60%
> >
> > This success rate is approaching the rate achieved without using page
> > pool, and the performance has also been improved:
> >
> > With page pool: 2.52 Gbits/sec
> >
> > Below is the performance comparison for small packets before and after
> > this optimization. We observe no impact to packets larger than 4K.
> >
> > packet size     before      after       improved
> > (bytes)         (Gbits/sec) (Gbits/sec)
> > 128             1.19        1.27        7.13%
> > 256             2.26        2.52        11.75%
> > 512             4.13        4.81        16.50%
> > 1024            6.17        6.73        9.05%
> > 2048            14.54       15.47       6.45%
> > 4096            25.44       27.87       9.52%
> >
> > Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
> > Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com>
> > Suggested-by: Jason Wang <jasowang@redhat.com>
> > ---
> >  include/net/page_pool/helpers.h |  5 ++++
> >  net/core/skbuff.c               | 41 +++++++++++++++++++++++----------
> >  2 files changed, 34 insertions(+), 12 deletions(-)
> >
> > diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
> > index 9dc8eaf8a959..268bc9d9ffd3 100644
> > --- a/include/net/page_pool/helpers.h
> > +++ b/include/net/page_pool/helpers.h
> > @@ -278,6 +278,11 @@ static inline long page_pool_unref_page(struct page *page, long nr)
> >         return ret;
> >  }
> >
> > +static inline void page_pool_ref_page(struct page *page)
> > +{
> > +       atomic_long_inc(&page->pp_ref_count);
> > +}
> > +
> >  static inline bool page_pool_is_last_ref(struct page *page)
> >  {
> >         /* If page_pool_unref_page() returns 0, we were the last user */
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 7e26b56cda38..3c2515a29376 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -947,6 +947,24 @@ static bool skb_pp_recycle(struct sk_buff *skb, void *data, bool napi_safe)
> >         return napi_pp_put_page(virt_to_page(data), napi_safe);
> >  }
> >
> > +/**
> > + * skb_pp_frag_ref() - Increase fragment reference count of a page
> > + * @page:      page of the fragment on which to increase a reference
> > + *
> > + * Increase fragment reference count (pp_ref_count) on a page, but if it is
> > + * not a page pool page, fallback to increase a reference(_refcount) on a
> > + * normal page.
> > + */
> > +static void skb_pp_frag_ref(struct page *page)
> > +{
> > +       struct page *head_page = compound_head(page);
> > +
> > +       if (likely(is_pp_page(head_page)))
> > +               page_pool_ref_page(head_page);
> > +       else
> > +               page_ref_inc(head_page);
> > +}
> > +
>
> I am confused by this, why add a new helper instead of modifying the
> existing helper, skb_frag_ref()?
>
> My mental model is that if the net stack wants to acquire a reference
> on a frag, it calls skb_frag_ref(), and if it wants to drop a
> reference on a frag, it should call skb_frag_unref(). Internally
> skb_frag_ref/unref() can do all sorts of checking to decide whether to
> increment page->refcount or page->pp_ref_count. I can't wrap my head
> around the introduction of skb_pp_frag_ref(), but no equivalent
> skb_pp_frag_unref().
>
> But even if skb_pp_frag_unref() was added, when should the net stack
> use skb_frag_ref/unref, and when should the stack use
> skb_pp_ref/unref? The docs currently describe what the function does,
> but when a program unfamiliar with the page pool should use it.
>
> >  static void skb_kfree_head(void *head, unsigned int end_offset)
> >  {
> >         if (end_offset == SKB_SMALL_HEAD_HEADROOM)
> > @@ -5769,17 +5787,12 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
> >                 return false;
> >
> >         /* In general, avoid mixing page_pool and non-page_pool allocated
> > -        * pages within the same SKB. Additionally avoid dealing with clones
> > -        * with page_pool pages, in case the SKB is using page_pool fragment
> > -        * references (page_pool_alloc_frag()). Since we only take full page
> > -        * references for cloned SKBs at the moment that would result in
> > -        * inconsistent reference counts.
> > -        * In theory we could take full references if @from is cloned and
> > -        * !@to->pp_recycle but its tricky (due to potential race with
> > -        * the clone disappearing) and rare, so not worth dealing with.
> > +        * pages within the same SKB. In theory we could take full
> > +        * references if @from is cloned and !@to->pp_recycle but its
> > +        * tricky (due to potential race with the clone disappearing) and
> > +        * rare, so not worth dealing with.
> >          */
> > -       if (to->pp_recycle != from->pp_recycle ||
> > -           (from->pp_recycle && skb_cloned(from)))
> > +       if (to->pp_recycle != from->pp_recycle)
> >                 return false;
> >
> >         if (len <= skb_tailroom(to)) {
> > @@ -5836,8 +5849,12 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
> >         /* if the skb is not cloned this does nothing
> >          * since we set nr_frags to 0.
> >          */
> > -       for (i = 0; i < from_shinfo->nr_frags; i++)
> > -               __skb_frag_ref(&from_shinfo->frags[i]);
> > +       if (from->pp_recycle)
> > +               for (i = 0; i < from_shinfo->nr_frags; i++)
> > +                       skb_pp_frag_ref(skb_frag_page(&from_shinfo->frags[i]));
> > +       else
> > +               for (i = 0; i < from_shinfo->nr_frags; i++)
> > +                       __skb_frag_ref(&from_shinfo->frags[i]);
>
> You added a check here to use skb_pp_frag_ref() instead of
> skb_frag_ref() here, but it's not clear to me why other callsites of
> skb_frag_ref() don't need to be modified in the same way after your
> patch.
>
> After your patch:
>
> skb_frag_ref() will always increment page->_refcount
> skb_frag_unref() will either decrement page->_refcount or decrement
> page->pp_ref_count (depending on the value of skb->pp_recycle).
> skb_pp_frag_ref() will either increment page->_refcount or increment
> page->pp_ref_count (depending on the value of is_pp_page(), not
> skb->pp_recycle).
> skb_pp_frag_unref() doesn't exist.
>
> Is this not confusing? Can we streamline things:
>
> skb_frag_ref() increments page->pp_ref_count for skb->pp_recycle,
> page->_refcount otherwise.
> skb_frag_unref() decrement page->pp_ref_count for skb->pp_recycle,
> page->_refcount otherwise.
>
> Or am I missing something that causes us to require this asymmetric
> reference counting?
>

This idea was previously implemented, as shown here:
https://lore.kernel.org/all/20211009093724.10539-5-linyunsheng@huawei.com/.
But implementing this would result in some unnecessary overhead, since
currently, 'skb_try_coalesce' is the only place where the page pool
reference count for skb frag might be increased. I would prefer to
move the logic to '__skb_frag_ref' when such a need becomes more
common. Thanks!

> >
> >         to->truesize += delta;
> >         to->len += len;
> > --
> > 2.31.1
> >
> >
>
>
> --
> Thanks,
> Mina
Mina Almasry Dec. 11, 2023, 4:21 a.m. UTC | #3
On Sun, Dec 10, 2023 at 7:38 PM Liang Chen <liangchen.linux@gmail.com> wrote:
>
> On Sat, Dec 9, 2023 at 10:18 AM Mina Almasry <almasrymina@google.com> wrote:
> >
> > On Wed, Dec 6, 2023 at 2:54 AM Liang Chen <liangchen.linux@gmail.com> wrote:
> > >
> > > In order to address the issues encountered with commit 1effe8ca4e34
> > > ("skbuff: fix coalescing for page_pool fragment recycling"), the
> > > combination of the following condition was excluded from skb coalescing:
> > >
> > > from->pp_recycle = 1
> > > from->cloned = 1
> > > to->pp_recycle = 1
> > >
> > > However, with page pool environments, the aforementioned combination can
> > > be quite common(ex. NetworkMananger may lead to the additional
> > > packet_type being registered, thus the cloning). In scenarios with a
> > > higher number of small packets, it can significantly affect the success
> > > rate of coalescing. For example, considering packets of 256 bytes size,
> > > our comparison of coalescing success rate is as follows:
> > >
> > > Without page pool: 70%
> > > With page pool: 13%
> > >
> > > Consequently, this has an impact on performance:
> > >
> > > Without page pool: 2.57 Gbits/sec
> > > With page pool: 2.26 Gbits/sec
> > >
> > > Therefore, it seems worthwhile to optimize this scenario and enable
> > > coalescing of this particular combination. To achieve this, we need to
> > > ensure the correct increment of the "from" SKB page's page pool
> > > reference count (pp_ref_count).
> > >
> > > Following this optimization, the success rate of coalescing measured in
> > > our environment has improved as follows:
> > >
> > > With page pool: 60%
> > >
> > > This success rate is approaching the rate achieved without using page
> > > pool, and the performance has also been improved:
> > >
> > > With page pool: 2.52 Gbits/sec
> > >
> > > Below is the performance comparison for small packets before and after
> > > this optimization. We observe no impact to packets larger than 4K.
> > >
> > > packet size     before      after       improved
> > > (bytes)         (Gbits/sec) (Gbits/sec)
> > > 128             1.19        1.27        7.13%
> > > 256             2.26        2.52        11.75%
> > > 512             4.13        4.81        16.50%
> > > 1024            6.17        6.73        9.05%
> > > 2048            14.54       15.47       6.45%
> > > 4096            25.44       27.87       9.52%
> > >
> > > Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
> > > Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com>
> > > Suggested-by: Jason Wang <jasowang@redhat.com>
> > > ---
> > >  include/net/page_pool/helpers.h |  5 ++++
> > >  net/core/skbuff.c               | 41 +++++++++++++++++++++++----------
> > >  2 files changed, 34 insertions(+), 12 deletions(-)
> > >
> > > diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
> > > index 9dc8eaf8a959..268bc9d9ffd3 100644
> > > --- a/include/net/page_pool/helpers.h
> > > +++ b/include/net/page_pool/helpers.h
> > > @@ -278,6 +278,11 @@ static inline long page_pool_unref_page(struct page *page, long nr)
> > >         return ret;
> > >  }
> > >
> > > +static inline void page_pool_ref_page(struct page *page)
> > > +{
> > > +       atomic_long_inc(&page->pp_ref_count);
> > > +}
> > > +
> > >  static inline bool page_pool_is_last_ref(struct page *page)
> > >  {
> > >         /* If page_pool_unref_page() returns 0, we were the last user */
> > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > index 7e26b56cda38..3c2515a29376 100644
> > > --- a/net/core/skbuff.c
> > > +++ b/net/core/skbuff.c
> > > @@ -947,6 +947,24 @@ static bool skb_pp_recycle(struct sk_buff *skb, void *data, bool napi_safe)
> > >         return napi_pp_put_page(virt_to_page(data), napi_safe);
> > >  }
> > >
> > > +/**
> > > + * skb_pp_frag_ref() - Increase fragment reference count of a page
> > > + * @page:      page of the fragment on which to increase a reference
> > > + *
> > > + * Increase fragment reference count (pp_ref_count) on a page, but if it is
> > > + * not a page pool page, fallback to increase a reference(_refcount) on a
> > > + * normal page.
> > > + */
> > > +static void skb_pp_frag_ref(struct page *page)
> > > +{
> > > +       struct page *head_page = compound_head(page);
> > > +
> > > +       if (likely(is_pp_page(head_page)))
> > > +               page_pool_ref_page(head_page);
> > > +       else
> > > +               page_ref_inc(head_page);
> > > +}
> > > +
> >
> > I am confused by this, why add a new helper instead of modifying the
> > existing helper, skb_frag_ref()?
> >
> > My mental model is that if the net stack wants to acquire a reference
> > on a frag, it calls skb_frag_ref(), and if it wants to drop a
> > reference on a frag, it should call skb_frag_unref(). Internally
> > skb_frag_ref/unref() can do all sorts of checking to decide whether to
> > increment page->refcount or page->pp_ref_count. I can't wrap my head
> > around the introduction of skb_pp_frag_ref(), but no equivalent
> > skb_pp_frag_unref().
> >
> > But even if skb_pp_frag_unref() was added, when should the net stack
> > use skb_frag_ref/unref, and when should the stack use
> > skb_pp_ref/unref? The docs currently describe what the function does,
> > but when a program unfamiliar with the page pool should use it.
> >
> > >  static void skb_kfree_head(void *head, unsigned int end_offset)
> > >  {
> > >         if (end_offset == SKB_SMALL_HEAD_HEADROOM)
> > > @@ -5769,17 +5787,12 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
> > >                 return false;
> > >
> > >         /* In general, avoid mixing page_pool and non-page_pool allocated
> > > -        * pages within the same SKB. Additionally avoid dealing with clones
> > > -        * with page_pool pages, in case the SKB is using page_pool fragment
> > > -        * references (page_pool_alloc_frag()). Since we only take full page
> > > -        * references for cloned SKBs at the moment that would result in
> > > -        * inconsistent reference counts.
> > > -        * In theory we could take full references if @from is cloned and
> > > -        * !@to->pp_recycle but its tricky (due to potential race with
> > > -        * the clone disappearing) and rare, so not worth dealing with.
> > > +        * pages within the same SKB. In theory we could take full
> > > +        * references if @from is cloned and !@to->pp_recycle but its
> > > +        * tricky (due to potential race with the clone disappearing) and
> > > +        * rare, so not worth dealing with.
> > >          */
> > > -       if (to->pp_recycle != from->pp_recycle ||
> > > -           (from->pp_recycle && skb_cloned(from)))
> > > +       if (to->pp_recycle != from->pp_recycle)
> > >                 return false;
> > >
> > >         if (len <= skb_tailroom(to)) {
> > > @@ -5836,8 +5849,12 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
> > >         /* if the skb is not cloned this does nothing
> > >          * since we set nr_frags to 0.
> > >          */
> > > -       for (i = 0; i < from_shinfo->nr_frags; i++)
> > > -               __skb_frag_ref(&from_shinfo->frags[i]);
> > > +       if (from->pp_recycle)
> > > +               for (i = 0; i < from_shinfo->nr_frags; i++)
> > > +                       skb_pp_frag_ref(skb_frag_page(&from_shinfo->frags[i]));
> > > +       else
> > > +               for (i = 0; i < from_shinfo->nr_frags; i++)
> > > +                       __skb_frag_ref(&from_shinfo->frags[i]);
> >
> > You added a check here to use skb_pp_frag_ref() instead of
> > skb_frag_ref() here, but it's not clear to me why other callsites of
> > skb_frag_ref() don't need to be modified in the same way after your
> > patch.
> >
> > After your patch:
> >
> > skb_frag_ref() will always increment page->_refcount
> > skb_frag_unref() will either decrement page->_refcount or decrement
> > page->pp_ref_count (depending on the value of skb->pp_recycle).
> > skb_pp_frag_ref() will either increment page->_refcount or increment
> > page->pp_ref_count (depending on the value of is_pp_page(), not
> > skb->pp_recycle).
> > skb_pp_frag_unref() doesn't exist.
> >
> > Is this not confusing? Can we streamline things:
> >
> > skb_frag_ref() increments page->pp_ref_count for skb->pp_recycle,
> > page->_refcount otherwise.
> > skb_frag_unref() decrement page->pp_ref_count for skb->pp_recycle,
> > page->_refcount otherwise.
> >
> > Or am I missing something that causes us to require this asymmetric
> > reference counting?
> >
>
> This idea was previously implemented, as shown here:
> https://lore.kernel.org/all/20211009093724.10539-5-linyunsheng@huawei.com/.
> But implementing this would result in some unnecessary overhead, since
> currently, 'skb_try_coalesce' is the only place where the page pool
> reference count for skb frag might be increased. I would prefer to
> move the logic to '__skb_frag_ref' when such a need becomes more
> common. Thanks!
>

Is it possible/desirable to add a comment to skb_frag_ref() that it
should not be used with skb->pp_recycle? At least I was tripped by
this, but maybe it's considered obvious somehow.

But I feel like this maybe needs to be fixed. Why does the page_pool
need a separate page->pp_ref_count? Why not use page->_refcount like
the rest of the code? Is there a history here behind this decision
that you can point me to? It seems to me that
incrementing/decrementing page->pp_ref_count may be equivalent to
doing the same on page->_refcount.

> > >
> > >         to->truesize += delta;
> > >         to->len += len;
> > > --
> > > 2.31.1
> > >
> > >
> >
> >
> > --
> > Thanks,
> > Mina
Jakub Kicinski Dec. 11, 2023, 7:32 p.m. UTC | #4
On Sun, 10 Dec 2023 20:21:21 -0800 Mina Almasry wrote:
> Is it possible/desirable to add a comment to skb_frag_ref() that it
> should not be used with skb->pp_recycle? At least I was tripped by
> this, but maybe it's considered obvious somehow.
> 
> But I feel like this maybe needs to be fixed. Why does the page_pool
> need a separate page->pp_ref_count? Why not use page->_refcount like
> the rest of the code? Is there a history here behind this decision
> that you can point me to? It seems to me that
> incrementing/decrementing page->pp_ref_count may be equivalent to
> doing the same on page->_refcount.

Does reading the contents of the comment I proposed here:
https://lore.kernel.org/all/20231208173816.2f32ad0f@kernel.org/
elucidate it? The pp_ref_count means the holder is aware that 
they can't release the reference by calling put_page().
Because (a) we may need to clean up the pp state, unmap DMA etc.
and (b) one day it may not even be a real page (your work).

TBH I'm partial to the rename from patch 1, so I wouldn't delay this
work any more :) But you have a point that we should inspect the code
and consider making the semantics of skb_frag_ref() stronger all by
itself, without the need to add a new flavor of the helper..
Are you okay with leaving that as a follow up or do you reckon it's
easy enough we should push for it now?
Mina Almasry Dec. 11, 2023, 8:07 p.m. UTC | #5
On Mon, Dec 11, 2023 at 11:32 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Sun, 10 Dec 2023 20:21:21 -0800 Mina Almasry wrote:
> > Is it possible/desirable to add a comment to skb_frag_ref() that it
> > should not be used with skb->pp_recycle? At least I was tripped by
> > this, but maybe it's considered obvious somehow.
> >
> > But I feel like this maybe needs to be fixed. Why does the page_pool
> > need a separate page->pp_ref_count? Why not use page->_refcount like
> > the rest of the code? Is there a history here behind this decision
> > that you can point me to? It seems to me that
> > incrementing/decrementing page->pp_ref_count may be equivalent to
> > doing the same on page->_refcount.
>
> Does reading the contents of the comment I proposed here:
> https://lore.kernel.org/all/20231208173816.2f32ad0f@kernel.org/
> elucidate it? The pp_ref_count means the holder is aware that
> they can't release the reference by calling put_page().
> Because (a) we may need to clean up the pp state, unmap DMA etc.
> and (b) one day it may not even be a real page (your work).
>

Thank you, that makes sense.

> TBH I'm partial to the rename from patch 1, so I wouldn't delay this
> work any more :) But you have a point that we should inspect the code
> and consider making the semantics of skb_frag_ref() stronger all by
> itself, without the need to add a new flavor of the helper..
> Are you okay with leaving that as a follow up or do you reckon it's
> easy enough we should push for it now?

I think the rename from pp_frag_count -> pp_ref_count is a huge
improvement, and I think the fact that the netstack has a way to
obtain a reference on a pp frag is also a huge improvement. Please go
ahead mearging this if you like, I was asking questions for my own
education/follow up work to consider.
diff mbox series

Patch

diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
index 9dc8eaf8a959..268bc9d9ffd3 100644
--- a/include/net/page_pool/helpers.h
+++ b/include/net/page_pool/helpers.h
@@ -278,6 +278,11 @@  static inline long page_pool_unref_page(struct page *page, long nr)
 	return ret;
 }
 
+static inline void page_pool_ref_page(struct page *page)
+{
+	atomic_long_inc(&page->pp_ref_count);
+}
+
 static inline bool page_pool_is_last_ref(struct page *page)
 {
 	/* If page_pool_unref_page() returns 0, we were the last user */
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 7e26b56cda38..3c2515a29376 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -947,6 +947,24 @@  static bool skb_pp_recycle(struct sk_buff *skb, void *data, bool napi_safe)
 	return napi_pp_put_page(virt_to_page(data), napi_safe);
 }
 
+/**
+ * skb_pp_frag_ref() - Increase fragment reference count of a page
+ * @page:	page of the fragment on which to increase a reference
+ *
+ * Increase fragment reference count (pp_ref_count) on a page, but if it is
+ * not a page pool page, fallback to increase a reference(_refcount) on a
+ * normal page.
+ */
+static void skb_pp_frag_ref(struct page *page)
+{
+	struct page *head_page = compound_head(page);
+
+	if (likely(is_pp_page(head_page)))
+		page_pool_ref_page(head_page);
+	else
+		page_ref_inc(head_page);
+}
+
 static void skb_kfree_head(void *head, unsigned int end_offset)
 {
 	if (end_offset == SKB_SMALL_HEAD_HEADROOM)
@@ -5769,17 +5787,12 @@  bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
 		return false;
 
 	/* In general, avoid mixing page_pool and non-page_pool allocated
-	 * pages within the same SKB. Additionally avoid dealing with clones
-	 * with page_pool pages, in case the SKB is using page_pool fragment
-	 * references (page_pool_alloc_frag()). Since we only take full page
-	 * references for cloned SKBs at the moment that would result in
-	 * inconsistent reference counts.
-	 * In theory we could take full references if @from is cloned and
-	 * !@to->pp_recycle but its tricky (due to potential race with
-	 * the clone disappearing) and rare, so not worth dealing with.
+	 * pages within the same SKB. In theory we could take full
+	 * references if @from is cloned and !@to->pp_recycle but its
+	 * tricky (due to potential race with the clone disappearing) and
+	 * rare, so not worth dealing with.
 	 */
-	if (to->pp_recycle != from->pp_recycle ||
-	    (from->pp_recycle && skb_cloned(from)))
+	if (to->pp_recycle != from->pp_recycle)
 		return false;
 
 	if (len <= skb_tailroom(to)) {
@@ -5836,8 +5849,12 @@  bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
 	/* if the skb is not cloned this does nothing
 	 * since we set nr_frags to 0.
 	 */
-	for (i = 0; i < from_shinfo->nr_frags; i++)
-		__skb_frag_ref(&from_shinfo->frags[i]);
+	if (from->pp_recycle)
+		for (i = 0; i < from_shinfo->nr_frags; i++)
+			skb_pp_frag_ref(skb_frag_page(&from_shinfo->frags[i]));
+	else
+		for (i = 0; i < from_shinfo->nr_frags; i++)
+			__skb_frag_ref(&from_shinfo->frags[i]);
 
 	to->truesize += delta;
 	to->len += len;