diff mbox series

[net-next] skbuff: Optimize SKB coalescing for page pool case

Message ID 20230628121150.47778-1-liangchen.linux@gmail.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net-next] skbuff: Optimize SKB coalescing for page pool case | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 5150 this patch: 5150
netdev/cc_maintainers success CCed 7 of 7 maintainers
netdev/build_clang success Errors and warnings before: 1979 this patch: 1979
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 5379 this patch: 5379
netdev/checkpatch warning CHECK: Alignment should match open parenthesis WARNING: line length of 82 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Liang Chen June 28, 2023, 12:11 p.m. UTC
In order to address the issues encountered with commit 1effe8ca4e34
("skbuff: fix coalescing for page_pool fragment recycling"), the
combination of the following condition was excluded from skb coalescing:

from->pp_recycle = 1
from->cloned = 1
to->pp_recycle = 1

However, with page pool environments, the aforementioned combination can
be quite common. In scenarios with a higher number of small packets, it
can significantly affect the success rate of coalescing. For example,
when considering packets of 256 bytes size, our comparison of coalescing
success rate is as follows:

Without page pool: 70%
With page pool: 13%

Consequently, this has an impact on performance:

Without page pool: 2.64 Gbits/sec
With page pool: 2.41 Gbits/sec

Therefore, it seems worthwhile to optimize this scenario and enable
coalescing of this particular combination. To achieve this, we need to
ensure the correct increment of the "from" SKB page's page pool fragment
count (pp_frag_count).

Following this optimization, the success rate of coalescing measured in our
environment has improved as follows:

With page pool: 60%

This success rate is approaching the rate achieved without using page pool,
and the performance has also been improved:

With page pool: 2.61 Gbits/sec

Below is the performance comparison for small packets before and after this
optimization. We observe no impact to packets larger than 4K.

without page pool fragment(PP_FLAG_PAGE_FRAG)
packet size     before      after
(bytes)         (Gbits/sec) (Gbits/sec)
128             1.28        1.37
256             2.41        2.61
512             4.56        4.87
1024            7.69        8.21
2048            12.85       13.41

with page pool fragment(PP_FLAG_PAGE_FRAG)
packet size     before      after
(bytes)         (Gbits/sec) (Gbits/sec)
128             1.28        1.37
256             2.35        2.62
512             4.37        4.86
1024            7.62        8.41
2048            13.07       13.53

with page pool fragment(PP_FLAG_PAGE_FRAG) and high order(order = 3)
packet size     before      after
(bytes)         (Gbits/sec) (Gbits/sec)
128             1.28        1.41
256             2.41        2.74
512             4.57        5.25
1024            8.61        9.71
2048            14.81       16.78

Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
---
 include/net/page_pool.h | 21 +++++++++++++++++++++
 net/core/skbuff.c       | 11 +++++++----
 2 files changed, 28 insertions(+), 4 deletions(-)

Comments

Yunsheng Lin June 29, 2023, 6:53 a.m. UTC | #1
On 2023/6/28 20:11, Liang Chen wrote:
> In order to address the issues encountered with commit 1effe8ca4e34
> ("skbuff: fix coalescing for page_pool fragment recycling"), the
> combination of the following condition was excluded from skb coalescing:
> 
> from->pp_recycle = 1
> from->cloned = 1
> to->pp_recycle = 1
> 
> However, with page pool environments, the aforementioned combination can
> be quite common. In scenarios with a higher number of small packets, it
> can significantly affect the success rate of coalescing. For example,
> when considering packets of 256 bytes size, our comparison of coalescing
> success rate is as follows:

As skb_try_coalesce() only allow coaleascing when 'to' skb is not cloned.

Could you give more detailed about the testing when we have a non-cloned
'to' skb and a cloned 'from' skb? As both of them should be belong to the
same flow.

I had the below patchset trying to do something similar as this patch does:
https://lore.kernel.org/all/20211009093724.10539-5-linyunsheng@huawei.com/

It seems this patch is only trying to optimize a specific case for skb
coalescing, So if skb coalescing between non-cloned and cloned skb is a
common case, then it might worth optimizing.


> 
> Without page pool: 70%
> With page pool: 13%
> 

...

> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> index 126f9e294389..05e5d8ead63b 100644
> --- a/include/net/page_pool.h
> +++ b/include/net/page_pool.h
> @@ -399,4 +399,25 @@ static inline void page_pool_nid_changed(struct page_pool *pool, int new_nid)
>  		page_pool_update_nid(pool, new_nid);
>  }
>  
> +static inline bool page_pool_is_pp_page(struct page *page)
> +{
> +	return (page->pp_magic & ~0x3UL) == PP_SIGNATURE;
> +}
> +
> +static inline bool page_pool_is_pp_page_frag(struct page *page)> +{
> +	return !!(page->pp->p.flags & PP_FLAG_PAGE_FRAG);
> +}
> +
> +static inline void page_pool_page_ref(struct page *page)
> +{
> +	struct page *head_page = compound_head(page);

It seems we could avoid adding head_page here:
page = compound_head(page);

> +
> +	if (page_pool_is_pp_page(head_page) &&
> +			page_pool_is_pp_page_frag(head_page))
> +		atomic_long_inc(&head_page->pp_frag_count);
> +	else
> +		get_page(head_page);

page_ref_inc() should be enough here instead of get_page()
as compound_head() have been called.

> +}
> +
>  #endif /* _NET_PAGE_POOL_H */
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 6c5915efbc17..9806b091f0f6 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -5666,8 +5666,7 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
>  	 * !@to->pp_recycle but its tricky (due to potential race with
>  	 * the clone disappearing) and rare, so not worth dealing with.
>  	 */
> -	if (to->pp_recycle != from->pp_recycle ||
> -	    (from->pp_recycle && skb_cloned(from)))
> +	if (to->pp_recycle != from->pp_recycle)
>  		return false;
>  
>  	if (len <= skb_tailroom(to)) {
> @@ -5724,8 +5723,12 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
>  	/* if the skb is not cloned this does nothing
>  	 * since we set nr_frags to 0.
>  	 */
> -	for (i = 0; i < from_shinfo->nr_frags; i++)
> -		__skb_frag_ref(&from_shinfo->frags[i]);
> +	if (from->pp_recycle)
> +		for (i = 0; i < from_shinfo->nr_frags; i++)
> +			page_pool_page_ref(skb_frag_page(&from_shinfo->frags[i]));
> +	else
> +		for (i = 0; i < from_shinfo->nr_frags; i++)
> +			__skb_frag_ref(&from_shinfo->frags[i]);
>  
>  	to->truesize += delta;
>  	to->len += len;
>
Liang Chen June 29, 2023, 12:17 p.m. UTC | #2
On Thu, Jun 29, 2023 at 2:53 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2023/6/28 20:11, Liang Chen wrote:
> > In order to address the issues encountered with commit 1effe8ca4e34
> > ("skbuff: fix coalescing for page_pool fragment recycling"), the
> > combination of the following condition was excluded from skb coalescing:
> >
> > from->pp_recycle = 1
> > from->cloned = 1
> > to->pp_recycle = 1
> >
> > However, with page pool environments, the aforementioned combination can
> > be quite common. In scenarios with a higher number of small packets, it
> > can significantly affect the success rate of coalescing. For example,
> > when considering packets of 256 bytes size, our comparison of coalescing
> > success rate is as follows:
>
> As skb_try_coalesce() only allow coaleascing when 'to' skb is not cloned.
>
> Could you give more detailed about the testing when we have a non-cloned
> 'to' skb and a cloned 'from' skb? As both of them should be belong to the
> same flow.
>
> I had the below patchset trying to do something similar as this patch does:
> https://lore.kernel.org/all/20211009093724.10539-5-linyunsheng@huawei.com/
>
> It seems this patch is only trying to optimize a specific case for skb
> coalescing, So if skb coalescing between non-cloned and cloned skb is a
> common case, then it might worth optimizing.
>

Sure, Thanks for the information! The testing is just a common iperf
test as below.

iperf3 -c <server IP> -i 5 -f g -t 0 -l 128

We observed the frequency of each combination of the pp (page pool)
and clone condition when entering skb_try_coalesce. The results
motivated us to propose such an optimization, as we noticed that case
11 (from pp/clone=1/1 and to pp/clone = 1/0) occurs quite often.

+-------------+--------------+--------------+--------------+--------------+
|   from/to   | pp/clone=0/0 | pp/clone=0/1 | pp/clone=1/0 | pp/clone=1/1 |
+-------------+--------------+--------------+--------------+--------------+
|pp/clone=0/0 | 0            | 1            | 2            | 3            |
|pp/clone=0/1 | 4            | 5            | 6            | 7            |
|pp/clone=1/0 | 8            | 9            | 10           | 11           |
|pp/clone=1/1 | 12           | 13           | 14           | 15           |
|+------------+--------------+--------------+--------------+--------------+


packet size 128:
total : 152903
0 : 0            (0%)
1 : 0            (0%)
2 : 0            (0%)
3 : 0            (0%)
4 : 0            (0%)
5 : 0            (0%)
6 : 0            (0%)
7 : 0            (0%)
8 : 0            (0%)
9 : 0            (0%)
10 : 20681       (13%)
11 : 90136       (58%)
12 : 0           (0%)
13 : 0           (0%)
14 : 0           (0%)
15 : 42086       (27%)

Thanks,
Liang


>
> >
> > Without page pool: 70%
> > With page pool: 13%
> >
>
> ...
>
> > diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> > index 126f9e294389..05e5d8ead63b 100644
> > --- a/include/net/page_pool.h
> > +++ b/include/net/page_pool.h
> > @@ -399,4 +399,25 @@ static inline void page_pool_nid_changed(struct page_pool *pool, int new_nid)
> >               page_pool_update_nid(pool, new_nid);
> >  }
> >
> > +static inline bool page_pool_is_pp_page(struct page *page)
> > +{
> > +     return (page->pp_magic & ~0x3UL) == PP_SIGNATURE;
> > +}
> > +
> > +static inline bool page_pool_is_pp_page_frag(struct page *page)> +{
> > +     return !!(page->pp->p.flags & PP_FLAG_PAGE_FRAG);
> > +}
> > +
> > +static inline void page_pool_page_ref(struct page *page)
> > +{
> > +     struct page *head_page = compound_head(page);
>
> It seems we could avoid adding head_page here:
> page = compound_head(page);
>
> > +
> > +     if (page_pool_is_pp_page(head_page) &&
> > +                     page_pool_is_pp_page_frag(head_page))
> > +             atomic_long_inc(&head_page->pp_frag_count);
> > +     else
> > +             get_page(head_page);
>
> page_ref_inc() should be enough here instead of get_page()
> as compound_head() have been called.
>
> > +}
> > +
> >  #endif /* _NET_PAGE_POOL_H */
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 6c5915efbc17..9806b091f0f6 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -5666,8 +5666,7 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
> >        * !@to->pp_recycle but its tricky (due to potential race with
> >        * the clone disappearing) and rare, so not worth dealing with.
> >        */
> > -     if (to->pp_recycle != from->pp_recycle ||
> > -         (from->pp_recycle && skb_cloned(from)))
> > +     if (to->pp_recycle != from->pp_recycle)
> >               return false;
> >
> >       if (len <= skb_tailroom(to)) {
> > @@ -5724,8 +5723,12 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
> >       /* if the skb is not cloned this does nothing
> >        * since we set nr_frags to 0.
> >        */
> > -     for (i = 0; i < from_shinfo->nr_frags; i++)
> > -             __skb_frag_ref(&from_shinfo->frags[i]);
> > +     if (from->pp_recycle)
> > +             for (i = 0; i < from_shinfo->nr_frags; i++)
> > +                     page_pool_page_ref(skb_frag_page(&from_shinfo->frags[i]));
> > +     else
> > +             for (i = 0; i < from_shinfo->nr_frags; i++)
> > +                     __skb_frag_ref(&from_shinfo->frags[i]);
> >
> >       to->truesize += delta;
> >       to->len += len;
> >
Liang Chen June 29, 2023, 12:19 p.m. UTC | #3
On Thu, Jun 29, 2023 at 8:17 PM Liang Chen <liangchen.linux@gmail.com> wrote:
>
> On Thu, Jun 29, 2023 at 2:53 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >
> > On 2023/6/28 20:11, Liang Chen wrote:
> > > In order to address the issues encountered with commit 1effe8ca4e34
> > > ("skbuff: fix coalescing for page_pool fragment recycling"), the
> > > combination of the following condition was excluded from skb coalescing:
> > >
> > > from->pp_recycle = 1
> > > from->cloned = 1
> > > to->pp_recycle = 1
> > >
> > > However, with page pool environments, the aforementioned combination can
> > > be quite common. In scenarios with a higher number of small packets, it
> > > can significantly affect the success rate of coalescing. For example,
> > > when considering packets of 256 bytes size, our comparison of coalescing
> > > success rate is as follows:
> >
> > As skb_try_coalesce() only allow coaleascing when 'to' skb is not cloned.
> >
> > Could you give more detailed about the testing when we have a non-cloned
> > 'to' skb and a cloned 'from' skb? As both of them should be belong to the
> > same flow.
> >
> > I had the below patchset trying to do something similar as this patch does:
> > https://lore.kernel.org/all/20211009093724.10539-5-linyunsheng@huawei.com/
> >
> > It seems this patch is only trying to optimize a specific case for skb
> > coalescing, So if skb coalescing between non-cloned and cloned skb is a
> > common case, then it might worth optimizing.
> >
>
> Sure, Thanks for the information! The testing is just a common iperf
> test as below.
>
> iperf3 -c <server IP> -i 5 -f g -t 0 -l 128
>
> We observed the frequency of each combination of the pp (page pool)
> and clone condition when entering skb_try_coalesce. The results
> motivated us to propose such an optimization, as we noticed that case
> 11 (from pp/clone=1/1 and to pp/clone = 1/0) occurs quite often.
>
> +-------------+--------------+--------------+--------------+--------------+
> |   from/to   | pp/clone=0/0 | pp/clone=0/1 | pp/clone=1/0 | pp/clone=1/1 |
> +-------------+--------------+--------------+--------------+--------------+
> |pp/clone=0/0 | 0            | 1            | 2            | 3            |
> |pp/clone=0/1 | 4            | 5            | 6            | 7            |
> |pp/clone=1/0 | 8            | 9            | 10           | 11           |
> |pp/clone=1/1 | 12           | 13           | 14           | 15           |
> |+------------+--------------+--------------+--------------+--------------+
>
>
> packet size 128:
> total : 152903
> 0 : 0            (0%)
> 1 : 0            (0%)
> 2 : 0            (0%)
> 3 : 0            (0%)
> 4 : 0            (0%)
> 5 : 0            (0%)
> 6 : 0            (0%)
> 7 : 0            (0%)
> 8 : 0            (0%)
> 9 : 0            (0%)
> 10 : 20681       (13%)
> 11 : 90136       (58%)
> 12 : 0           (0%)
> 13 : 0           (0%)
> 14 : 0           (0%)
> 15 : 42086       (27%)
>
> Thanks,
> Liang
>
>
> >
> > >
> > > Without page pool: 70%
> > > With page pool: 13%
> > >
> >
> > ...
> >
> > > diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> > > index 126f9e294389..05e5d8ead63b 100644
> > > --- a/include/net/page_pool.h
> > > +++ b/include/net/page_pool.h
> > > @@ -399,4 +399,25 @@ static inline void page_pool_nid_changed(struct page_pool *pool, int new_nid)
> > >               page_pool_update_nid(pool, new_nid);
> > >  }
> > >
> > > +static inline bool page_pool_is_pp_page(struct page *page)
> > > +{
> > > +     return (page->pp_magic & ~0x3UL) == PP_SIGNATURE;
> > > +}
> > > +
> > > +static inline bool page_pool_is_pp_page_frag(struct page *page)> +{
> > > +     return !!(page->pp->p.flags & PP_FLAG_PAGE_FRAG);
> > > +}
> > > +
> > > +static inline void page_pool_page_ref(struct page *page)
> > > +{
> > > +     struct page *head_page = compound_head(page);
> >
> > It seems we could avoid adding head_page here:
> > page = compound_head(page);
> >

Sure.

> > > +
> > > +     if (page_pool_is_pp_page(head_page) &&
> > > +                     page_pool_is_pp_page_frag(head_page))
> > > +             atomic_long_inc(&head_page->pp_frag_count);
> > > +     else
> > > +             get_page(head_page);
> >
> > page_ref_inc() should be enough here instead of get_page()
> > as compound_head() have been called.
> >

Yeah, it will be changed to page_ref_inc on v2.

> > > +}
> > > +
> > >  #endif /* _NET_PAGE_POOL_H */
> > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > index 6c5915efbc17..9806b091f0f6 100644
> > > --- a/net/core/skbuff.c
> > > +++ b/net/core/skbuff.c
> > > @@ -5666,8 +5666,7 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
> > >        * !@to->pp_recycle but its tricky (due to potential race with
> > >        * the clone disappearing) and rare, so not worth dealing with.
> > >        */
> > > -     if (to->pp_recycle != from->pp_recycle ||
> > > -         (from->pp_recycle && skb_cloned(from)))
> > > +     if (to->pp_recycle != from->pp_recycle)
> > >               return false;
> > >
> > >       if (len <= skb_tailroom(to)) {
> > > @@ -5724,8 +5723,12 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
> > >       /* if the skb is not cloned this does nothing
> > >        * since we set nr_frags to 0.
> > >        */
> > > -     for (i = 0; i < from_shinfo->nr_frags; i++)
> > > -             __skb_frag_ref(&from_shinfo->frags[i]);
> > > +     if (from->pp_recycle)
> > > +             for (i = 0; i < from_shinfo->nr_frags; i++)
> > > +                     page_pool_page_ref(skb_frag_page(&from_shinfo->frags[i]));
> > > +     else
> > > +             for (i = 0; i < from_shinfo->nr_frags; i++)
> > > +                     __skb_frag_ref(&from_shinfo->frags[i]);
> > >
> > >       to->truesize += delta;
> > >       to->len += len;
> > >
Yunsheng Lin June 30, 2023, 11:52 a.m. UTC | #4
On 2023/6/29 20:19, Liang Chen wrote:
> On Thu, Jun 29, 2023 at 8:17 PM Liang Chen <liangchen.linux@gmail.com> wrote:
>>
>> On Thu, Jun 29, 2023 at 2:53 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>>
>>> On 2023/6/28 20:11, Liang Chen wrote:
>>>> In order to address the issues encountered with commit 1effe8ca4e34
>>>> ("skbuff: fix coalescing for page_pool fragment recycling"), the
>>>> combination of the following condition was excluded from skb coalescing:
>>>>
>>>> from->pp_recycle = 1
>>>> from->cloned = 1
>>>> to->pp_recycle = 1
>>>>
>>>> However, with page pool environments, the aforementioned combination can
>>>> be quite common. In scenarios with a higher number of small packets, it
>>>> can significantly affect the success rate of coalescing. For example,
>>>> when considering packets of 256 bytes size, our comparison of coalescing
>>>> success rate is as follows:
>>>
>>> As skb_try_coalesce() only allow coaleascing when 'to' skb is not cloned.
>>>
>>> Could you give more detailed about the testing when we have a non-cloned
>>> 'to' skb and a cloned 'from' skb? As both of them should be belong to the
>>> same flow.
>>>
>>> I had the below patchset trying to do something similar as this patch does:
>>> https://lore.kernel.org/all/20211009093724.10539-5-linyunsheng@huawei.com/
>>>
>>> It seems this patch is only trying to optimize a specific case for skb
>>> coalescing, So if skb coalescing between non-cloned and cloned skb is a
>>> common case, then it might worth optimizing.
>>>
>>
>> Sure, Thanks for the information! The testing is just a common iperf
>> test as below.
>>
>> iperf3 -c <server IP> -i 5 -f g -t 0 -l 128
>>
>> We observed the frequency of each combination of the pp (page pool)
>> and clone condition when entering skb_try_coalesce. The results
>> motivated us to propose such an optimization, as we noticed that case
>> 11 (from pp/clone=1/1 and to pp/clone = 1/0) occurs quite often.
>>
>> +-------------+--------------+--------------+--------------+--------------+
>> |   from/to   | pp/clone=0/0 | pp/clone=0/1 | pp/clone=1/0 | pp/clone=1/1 |
>> +-------------+--------------+--------------+--------------+--------------+
>> |pp/clone=0/0 | 0            | 1            | 2            | 3            |
>> |pp/clone=0/1 | 4            | 5            | 6            | 7            |
>> |pp/clone=1/0 | 8            | 9            | 10           | 11           |
>> |pp/clone=1/1 | 12           | 13           | 14           | 15           |
>> |+------------+--------------+--------------+--------------+--------------+


I run the iperf test, it seems there is only one skb_clone() calling for each
round, and I was using 'iperf', not 'iperf3'.
Is there any app like tcpdump running? It seems odd that the skb from the rx
need to be cloned for a common iperf test, which app or configuration is causing
the cloning?

Maybe using the ftrace to see the skb_clone() calling?
echo skb_clone > set_ftrace_filter
echo function > current_tracer
Jakub Kicinski June 30, 2023, 11:07 p.m. UTC | #5
On Wed, 28 Jun 2023 20:11:50 +0800 Liang Chen wrote:
> +static inline void page_pool_page_ref(struct page *page)
> +{
> +	struct page *head_page = compound_head(page);
> +
> +	if (page_pool_is_pp_page(head_page) &&
> +			page_pool_is_pp_page_frag(head_page))
> +		atomic_long_inc(&head_page->pp_frag_count);
> +	else
> +		get_page(head_page);

AFAICT this is not correct, you cannot take an extra *pp* reference 
on a pp page, unless it is a frag. If the @to skb is a pp skb (and it
must be, if we get here) we will have two skbs with pp_recycle = 1.
Which means if they both get freed at the same time they will both
try to return / release the page.

I haven't looked at Yunsheng's v5 yet, but that's why I was asking 
to rename the pp_frag_count to pp_ref_count. pp_frag_count is a true
refcount (rather than skb->pp_recycle acting as a poor man's single
shot refcount), so in case of frag'd pages / after Yunsheng's work
we will be able to take new refs.

Long story short please wait until Yunsheng's changes are finalized.
Liang Chen July 3, 2023, 9:09 a.m. UTC | #6
On Fri, Jun 30, 2023 at 7:52 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2023/6/29 20:19, Liang Chen wrote:
> > On Thu, Jun 29, 2023 at 8:17 PM Liang Chen <liangchen.linux@gmail.com> wrote:
> >>
> >> On Thu, Jun 29, 2023 at 2:53 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >>>
> >>> On 2023/6/28 20:11, Liang Chen wrote:
> >>>> In order to address the issues encountered with commit 1effe8ca4e34
> >>>> ("skbuff: fix coalescing for page_pool fragment recycling"), the
> >>>> combination of the following condition was excluded from skb coalescing:
> >>>>
> >>>> from->pp_recycle = 1
> >>>> from->cloned = 1
> >>>> to->pp_recycle = 1
> >>>>
> >>>> However, with page pool environments, the aforementioned combination can
> >>>> be quite common. In scenarios with a higher number of small packets, it
> >>>> can significantly affect the success rate of coalescing. For example,
> >>>> when considering packets of 256 bytes size, our comparison of coalescing
> >>>> success rate is as follows:
> >>>
> >>> As skb_try_coalesce() only allow coaleascing when 'to' skb is not cloned.
> >>>
> >>> Could you give more detailed about the testing when we have a non-cloned
> >>> 'to' skb and a cloned 'from' skb? As both of them should be belong to the
> >>> same flow.
> >>>
> >>> I had the below patchset trying to do something similar as this patch does:
> >>> https://lore.kernel.org/all/20211009093724.10539-5-linyunsheng@huawei.com/
> >>>
> >>> It seems this patch is only trying to optimize a specific case for skb
> >>> coalescing, So if skb coalescing between non-cloned and cloned skb is a
> >>> common case, then it might worth optimizing.
> >>>
> >>
> >> Sure, Thanks for the information! The testing is just a common iperf
> >> test as below.
> >>
> >> iperf3 -c <server IP> -i 5 -f g -t 0 -l 128
> >>
> >> We observed the frequency of each combination of the pp (page pool)
> >> and clone condition when entering skb_try_coalesce. The results
> >> motivated us to propose such an optimization, as we noticed that case
> >> 11 (from pp/clone=1/1 and to pp/clone = 1/0) occurs quite often.
> >>
> >> +-------------+--------------+--------------+--------------+--------------+
> >> |   from/to   | pp/clone=0/0 | pp/clone=0/1 | pp/clone=1/0 | pp/clone=1/1 |
> >> +-------------+--------------+--------------+--------------+--------------+
> >> |pp/clone=0/0 | 0            | 1            | 2            | 3            |
> >> |pp/clone=0/1 | 4            | 5            | 6            | 7            |
> >> |pp/clone=1/0 | 8            | 9            | 10           | 11           |
> >> |pp/clone=1/1 | 12           | 13           | 14           | 15           |
> >> |+------------+--------------+--------------+--------------+--------------+
>
>
> I run the iperf test, it seems there is only one skb_clone() calling for each
> round, and I was using 'iperf', not 'iperf3'.
> Is there any app like tcpdump running? It seems odd that the skb from the rx
> need to be cloned for a common iperf test, which app or configuration is causing
> the cloning?
>
> Maybe using the ftrace to see the skb_clone() calling?
> echo skb_clone > set_ftrace_filter
> echo function > current_tracer

Thanks for raising the concerns. We did some investigation into the
cause of skb cloning. The result is that in our environment (fedora 37
default network setup) NetworkMananger creates a SOCK_DGRAM socket,
which eventually leads to the additional packet_type (struct
packet_sock.prot_hook) being registered, thus the cloning. Since
__netif_receive_skb_core iterates through orig_dev->ptype_specific for
all possible skb delivery targets and increases skb->users
accordingly.

We will update the commit message to include this information to point
out that the figures are specific to our environment. But there are
many possible reasons skbs can be cloned, and improvements in this
code path can still bring benefits.

Thanks,
Liang
Liang Chen July 3, 2023, 9:12 a.m. UTC | #7
On Sat, Jul 1, 2023 at 7:07 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 28 Jun 2023 20:11:50 +0800 Liang Chen wrote:
> > +static inline void page_pool_page_ref(struct page *page)
> > +{
> > +     struct page *head_page = compound_head(page);
> > +
> > +     if (page_pool_is_pp_page(head_page) &&
> > +                     page_pool_is_pp_page_frag(head_page))
> > +             atomic_long_inc(&head_page->pp_frag_count);
> > +     else
> > +             get_page(head_page);
>
> AFAICT this is not correct, you cannot take an extra *pp* reference
> on a pp page, unless it is a frag. If the @to skb is a pp skb (and it
> must be, if we get here) we will have two skbs with pp_recycle = 1.
> Which means if they both get freed at the same time they will both
> try to return / release the page.
>
> I haven't looked at Yunsheng's v5 yet, but that's why I was asking
> to rename the pp_frag_count to pp_ref_count. pp_frag_count is a true
> refcount (rather than skb->pp_recycle acting as a poor man's single
> shot refcount), so in case of frag'd pages / after Yunsheng's work
> we will be able to take new refs.
>
> Long story short please wait until Yunsheng's changes are finalized.

Sure, thanks for the information. I will wait until Yunsheng's changes
are finalized before proposing v2.

As for the "pp" reference, it has the test
page_pool_is_pp_page_frag(head_page) there. So for a non-frag pp page,
it will be a get_page call.

Thanks,
Liang
Jakub Kicinski July 3, 2023, 6:53 p.m. UTC | #8
On Mon, 3 Jul 2023 17:12:46 +0800 Liang Chen wrote:
> As for the "pp" reference, it has the test
> page_pool_is_pp_page_frag(head_page) there. So for a non-frag pp page,
> it will be a get_page call.

You don't understand - you can't put a page from a page pool in two
skbs with pp_recycle set, unless the page is frag'ed.
Yunsheng Lin July 4, 2023, 12:50 a.m. UTC | #9
On 2023/7/4 2:53, Jakub Kicinski wrote:
> On Mon, 3 Jul 2023 17:12:46 +0800 Liang Chen wrote:
>> As for the "pp" reference, it has the test
>> page_pool_is_pp_page_frag(head_page) there. So for a non-frag pp page,
>> it will be a get_page call.
> 
> You don't understand - you can't put a page from a page pool in two
> skbs with pp_recycle set, unless the page is frag'ed.

Agreed. I think we should disallow skb coaleasing for non-frag pp page
instead of calling get_page(), as there is data race when calling
page_pool_return_skb_page() concurrently for the same non-frag pp page.

Even with my patchset, it may break the arch with
PAGE_POOL_DMA_USE_PP_FRAG_COUNT being true.

> .
>
Liang Chen July 4, 2023, 4:54 a.m. UTC | #10
On Tue, Jul 4, 2023 at 8:50 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2023/7/4 2:53, Jakub Kicinski wrote:
> > On Mon, 3 Jul 2023 17:12:46 +0800 Liang Chen wrote:
> >> As for the "pp" reference, it has the test
> >> page_pool_is_pp_page_frag(head_page) there. So for a non-frag pp page,
> >> it will be a get_page call.
> >
> > You don't understand - you can't put a page from a page pool in two
> > skbs with pp_recycle set, unless the page is frag'ed.
>
> Agreed. I think we should disallow skb coaleasing for non-frag pp page
> instead of calling get_page(), as there is data race when calling
> page_pool_return_skb_page() concurrently for the same non-frag pp page.
>
> Even with my patchset, it may break the arch with
> PAGE_POOL_DMA_USE_PP_FRAG_COUNT being true.
>

Yeah, that's my fault. I thought __page_pool_put_page handles this
elevated refcnt, but overlooked that the second skb release would
enter page_pool_return_skb_page again, not directly calling put_page.

Disallowing skb coalescing for non-frag pp pages sounds good; we will
handle this problem on v2 after Yunsheng's changes are finalized.

> > .
> >
diff mbox series

Patch

diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 126f9e294389..05e5d8ead63b 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -399,4 +399,25 @@  static inline void page_pool_nid_changed(struct page_pool *pool, int new_nid)
 		page_pool_update_nid(pool, new_nid);
 }
 
+static inline bool page_pool_is_pp_page(struct page *page)
+{
+	return (page->pp_magic & ~0x3UL) == PP_SIGNATURE;
+}
+
+static inline bool page_pool_is_pp_page_frag(struct page *page)
+{
+	return !!(page->pp->p.flags & PP_FLAG_PAGE_FRAG);
+}
+
+static inline void page_pool_page_ref(struct page *page)
+{
+	struct page *head_page = compound_head(page);
+
+	if (page_pool_is_pp_page(head_page) &&
+			page_pool_is_pp_page_frag(head_page))
+		atomic_long_inc(&head_page->pp_frag_count);
+	else
+		get_page(head_page);
+}
+
 #endif /* _NET_PAGE_POOL_H */
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6c5915efbc17..9806b091f0f6 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5666,8 +5666,7 @@  bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
 	 * !@to->pp_recycle but its tricky (due to potential race with
 	 * the clone disappearing) and rare, so not worth dealing with.
 	 */
-	if (to->pp_recycle != from->pp_recycle ||
-	    (from->pp_recycle && skb_cloned(from)))
+	if (to->pp_recycle != from->pp_recycle)
 		return false;
 
 	if (len <= skb_tailroom(to)) {
@@ -5724,8 +5723,12 @@  bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
 	/* if the skb is not cloned this does nothing
 	 * since we set nr_frags to 0.
 	 */
-	for (i = 0; i < from_shinfo->nr_frags; i++)
-		__skb_frag_ref(&from_shinfo->frags[i]);
+	if (from->pp_recycle)
+		for (i = 0; i < from_shinfo->nr_frags; i++)
+			page_pool_page_ref(skb_frag_page(&from_shinfo->frags[i]));
+	else
+		for (i = 0; i < from_shinfo->nr_frags; i++)
+			__skb_frag_ref(&from_shinfo->frags[i]);
 
 	to->truesize += delta;
 	to->len += len;