Message ID | 20230628121150.47778-1-liangchen.linux@gmail.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net-next] skbuff: Optimize SKB coalescing for page pool case | expand |
On 2023/6/28 20:11, Liang Chen wrote: > In order to address the issues encountered with commit 1effe8ca4e34 > ("skbuff: fix coalescing for page_pool fragment recycling"), the > combination of the following condition was excluded from skb coalescing: > > from->pp_recycle = 1 > from->cloned = 1 > to->pp_recycle = 1 > > However, with page pool environments, the aforementioned combination can > be quite common. In scenarios with a higher number of small packets, it > can significantly affect the success rate of coalescing. For example, > when considering packets of 256 bytes size, our comparison of coalescing > success rate is as follows: As skb_try_coalesce() only allow coaleascing when 'to' skb is not cloned. Could you give more detailed about the testing when we have a non-cloned 'to' skb and a cloned 'from' skb? As both of them should be belong to the same flow. I had the below patchset trying to do something similar as this patch does: https://lore.kernel.org/all/20211009093724.10539-5-linyunsheng@huawei.com/ It seems this patch is only trying to optimize a specific case for skb coalescing, So if skb coalescing between non-cloned and cloned skb is a common case, then it might worth optimizing. > > Without page pool: 70% > With page pool: 13% > ... > diff --git a/include/net/page_pool.h b/include/net/page_pool.h > index 126f9e294389..05e5d8ead63b 100644 > --- a/include/net/page_pool.h > +++ b/include/net/page_pool.h > @@ -399,4 +399,25 @@ static inline void page_pool_nid_changed(struct page_pool *pool, int new_nid) > page_pool_update_nid(pool, new_nid); > } > > +static inline bool page_pool_is_pp_page(struct page *page) > +{ > + return (page->pp_magic & ~0x3UL) == PP_SIGNATURE; > +} > + > +static inline bool page_pool_is_pp_page_frag(struct page *page)> +{ > + return !!(page->pp->p.flags & PP_FLAG_PAGE_FRAG); > +} > + > +static inline void page_pool_page_ref(struct page *page) > +{ > + struct page *head_page = compound_head(page); It seems we could avoid adding head_page here: page = compound_head(page); > + > + if (page_pool_is_pp_page(head_page) && > + page_pool_is_pp_page_frag(head_page)) > + atomic_long_inc(&head_page->pp_frag_count); > + else > + get_page(head_page); page_ref_inc() should be enough here instead of get_page() as compound_head() have been called. > +} > + > #endif /* _NET_PAGE_POOL_H */ > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 6c5915efbc17..9806b091f0f6 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -5666,8 +5666,7 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from, > * !@to->pp_recycle but its tricky (due to potential race with > * the clone disappearing) and rare, so not worth dealing with. > */ > - if (to->pp_recycle != from->pp_recycle || > - (from->pp_recycle && skb_cloned(from))) > + if (to->pp_recycle != from->pp_recycle) > return false; > > if (len <= skb_tailroom(to)) { > @@ -5724,8 +5723,12 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from, > /* if the skb is not cloned this does nothing > * since we set nr_frags to 0. > */ > - for (i = 0; i < from_shinfo->nr_frags; i++) > - __skb_frag_ref(&from_shinfo->frags[i]); > + if (from->pp_recycle) > + for (i = 0; i < from_shinfo->nr_frags; i++) > + page_pool_page_ref(skb_frag_page(&from_shinfo->frags[i])); > + else > + for (i = 0; i < from_shinfo->nr_frags; i++) > + __skb_frag_ref(&from_shinfo->frags[i]); > > to->truesize += delta; > to->len += len; >
On Thu, Jun 29, 2023 at 2:53 PM Yunsheng Lin <linyunsheng@huawei.com> wrote: > > On 2023/6/28 20:11, Liang Chen wrote: > > In order to address the issues encountered with commit 1effe8ca4e34 > > ("skbuff: fix coalescing for page_pool fragment recycling"), the > > combination of the following condition was excluded from skb coalescing: > > > > from->pp_recycle = 1 > > from->cloned = 1 > > to->pp_recycle = 1 > > > > However, with page pool environments, the aforementioned combination can > > be quite common. In scenarios with a higher number of small packets, it > > can significantly affect the success rate of coalescing. For example, > > when considering packets of 256 bytes size, our comparison of coalescing > > success rate is as follows: > > As skb_try_coalesce() only allow coaleascing when 'to' skb is not cloned. > > Could you give more detailed about the testing when we have a non-cloned > 'to' skb and a cloned 'from' skb? As both of them should be belong to the > same flow. > > I had the below patchset trying to do something similar as this patch does: > https://lore.kernel.org/all/20211009093724.10539-5-linyunsheng@huawei.com/ > > It seems this patch is only trying to optimize a specific case for skb > coalescing, So if skb coalescing between non-cloned and cloned skb is a > common case, then it might worth optimizing. > Sure, Thanks for the information! The testing is just a common iperf test as below. iperf3 -c <server IP> -i 5 -f g -t 0 -l 128 We observed the frequency of each combination of the pp (page pool) and clone condition when entering skb_try_coalesce. The results motivated us to propose such an optimization, as we noticed that case 11 (from pp/clone=1/1 and to pp/clone = 1/0) occurs quite often. +-------------+--------------+--------------+--------------+--------------+ | from/to | pp/clone=0/0 | pp/clone=0/1 | pp/clone=1/0 | pp/clone=1/1 | +-------------+--------------+--------------+--------------+--------------+ |pp/clone=0/0 | 0 | 1 | 2 | 3 | |pp/clone=0/1 | 4 | 5 | 6 | 7 | |pp/clone=1/0 | 8 | 9 | 10 | 11 | |pp/clone=1/1 | 12 | 13 | 14 | 15 | |+------------+--------------+--------------+--------------+--------------+ packet size 128: total : 152903 0 : 0 (0%) 1 : 0 (0%) 2 : 0 (0%) 3 : 0 (0%) 4 : 0 (0%) 5 : 0 (0%) 6 : 0 (0%) 7 : 0 (0%) 8 : 0 (0%) 9 : 0 (0%) 10 : 20681 (13%) 11 : 90136 (58%) 12 : 0 (0%) 13 : 0 (0%) 14 : 0 (0%) 15 : 42086 (27%) Thanks, Liang > > > > > Without page pool: 70% > > With page pool: 13% > > > > ... > > > diff --git a/include/net/page_pool.h b/include/net/page_pool.h > > index 126f9e294389..05e5d8ead63b 100644 > > --- a/include/net/page_pool.h > > +++ b/include/net/page_pool.h > > @@ -399,4 +399,25 @@ static inline void page_pool_nid_changed(struct page_pool *pool, int new_nid) > > page_pool_update_nid(pool, new_nid); > > } > > > > +static inline bool page_pool_is_pp_page(struct page *page) > > +{ > > + return (page->pp_magic & ~0x3UL) == PP_SIGNATURE; > > +} > > + > > +static inline bool page_pool_is_pp_page_frag(struct page *page)> +{ > > + return !!(page->pp->p.flags & PP_FLAG_PAGE_FRAG); > > +} > > + > > +static inline void page_pool_page_ref(struct page *page) > > +{ > > + struct page *head_page = compound_head(page); > > It seems we could avoid adding head_page here: > page = compound_head(page); > > > + > > + if (page_pool_is_pp_page(head_page) && > > + page_pool_is_pp_page_frag(head_page)) > > + atomic_long_inc(&head_page->pp_frag_count); > > + else > > + get_page(head_page); > > page_ref_inc() should be enough here instead of get_page() > as compound_head() have been called. > > > +} > > + > > #endif /* _NET_PAGE_POOL_H */ > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > > index 6c5915efbc17..9806b091f0f6 100644 > > --- a/net/core/skbuff.c > > +++ b/net/core/skbuff.c > > @@ -5666,8 +5666,7 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from, > > * !@to->pp_recycle but its tricky (due to potential race with > > * the clone disappearing) and rare, so not worth dealing with. > > */ > > - if (to->pp_recycle != from->pp_recycle || > > - (from->pp_recycle && skb_cloned(from))) > > + if (to->pp_recycle != from->pp_recycle) > > return false; > > > > if (len <= skb_tailroom(to)) { > > @@ -5724,8 +5723,12 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from, > > /* if the skb is not cloned this does nothing > > * since we set nr_frags to 0. > > */ > > - for (i = 0; i < from_shinfo->nr_frags; i++) > > - __skb_frag_ref(&from_shinfo->frags[i]); > > + if (from->pp_recycle) > > + for (i = 0; i < from_shinfo->nr_frags; i++) > > + page_pool_page_ref(skb_frag_page(&from_shinfo->frags[i])); > > + else > > + for (i = 0; i < from_shinfo->nr_frags; i++) > > + __skb_frag_ref(&from_shinfo->frags[i]); > > > > to->truesize += delta; > > to->len += len; > >
On Thu, Jun 29, 2023 at 8:17 PM Liang Chen <liangchen.linux@gmail.com> wrote: > > On Thu, Jun 29, 2023 at 2:53 PM Yunsheng Lin <linyunsheng@huawei.com> wrote: > > > > On 2023/6/28 20:11, Liang Chen wrote: > > > In order to address the issues encountered with commit 1effe8ca4e34 > > > ("skbuff: fix coalescing for page_pool fragment recycling"), the > > > combination of the following condition was excluded from skb coalescing: > > > > > > from->pp_recycle = 1 > > > from->cloned = 1 > > > to->pp_recycle = 1 > > > > > > However, with page pool environments, the aforementioned combination can > > > be quite common. In scenarios with a higher number of small packets, it > > > can significantly affect the success rate of coalescing. For example, > > > when considering packets of 256 bytes size, our comparison of coalescing > > > success rate is as follows: > > > > As skb_try_coalesce() only allow coaleascing when 'to' skb is not cloned. > > > > Could you give more detailed about the testing when we have a non-cloned > > 'to' skb and a cloned 'from' skb? As both of them should be belong to the > > same flow. > > > > I had the below patchset trying to do something similar as this patch does: > > https://lore.kernel.org/all/20211009093724.10539-5-linyunsheng@huawei.com/ > > > > It seems this patch is only trying to optimize a specific case for skb > > coalescing, So if skb coalescing between non-cloned and cloned skb is a > > common case, then it might worth optimizing. > > > > Sure, Thanks for the information! The testing is just a common iperf > test as below. > > iperf3 -c <server IP> -i 5 -f g -t 0 -l 128 > > We observed the frequency of each combination of the pp (page pool) > and clone condition when entering skb_try_coalesce. The results > motivated us to propose such an optimization, as we noticed that case > 11 (from pp/clone=1/1 and to pp/clone = 1/0) occurs quite often. > > +-------------+--------------+--------------+--------------+--------------+ > | from/to | pp/clone=0/0 | pp/clone=0/1 | pp/clone=1/0 | pp/clone=1/1 | > +-------------+--------------+--------------+--------------+--------------+ > |pp/clone=0/0 | 0 | 1 | 2 | 3 | > |pp/clone=0/1 | 4 | 5 | 6 | 7 | > |pp/clone=1/0 | 8 | 9 | 10 | 11 | > |pp/clone=1/1 | 12 | 13 | 14 | 15 | > |+------------+--------------+--------------+--------------+--------------+ > > > packet size 128: > total : 152903 > 0 : 0 (0%) > 1 : 0 (0%) > 2 : 0 (0%) > 3 : 0 (0%) > 4 : 0 (0%) > 5 : 0 (0%) > 6 : 0 (0%) > 7 : 0 (0%) > 8 : 0 (0%) > 9 : 0 (0%) > 10 : 20681 (13%) > 11 : 90136 (58%) > 12 : 0 (0%) > 13 : 0 (0%) > 14 : 0 (0%) > 15 : 42086 (27%) > > Thanks, > Liang > > > > > > > > > > Without page pool: 70% > > > With page pool: 13% > > > > > > > ... > > > > > diff --git a/include/net/page_pool.h b/include/net/page_pool.h > > > index 126f9e294389..05e5d8ead63b 100644 > > > --- a/include/net/page_pool.h > > > +++ b/include/net/page_pool.h > > > @@ -399,4 +399,25 @@ static inline void page_pool_nid_changed(struct page_pool *pool, int new_nid) > > > page_pool_update_nid(pool, new_nid); > > > } > > > > > > +static inline bool page_pool_is_pp_page(struct page *page) > > > +{ > > > + return (page->pp_magic & ~0x3UL) == PP_SIGNATURE; > > > +} > > > + > > > +static inline bool page_pool_is_pp_page_frag(struct page *page)> +{ > > > + return !!(page->pp->p.flags & PP_FLAG_PAGE_FRAG); > > > +} > > > + > > > +static inline void page_pool_page_ref(struct page *page) > > > +{ > > > + struct page *head_page = compound_head(page); > > > > It seems we could avoid adding head_page here: > > page = compound_head(page); > > Sure. > > > + > > > + if (page_pool_is_pp_page(head_page) && > > > + page_pool_is_pp_page_frag(head_page)) > > > + atomic_long_inc(&head_page->pp_frag_count); > > > + else > > > + get_page(head_page); > > > > page_ref_inc() should be enough here instead of get_page() > > as compound_head() have been called. > > Yeah, it will be changed to page_ref_inc on v2. > > > +} > > > + > > > #endif /* _NET_PAGE_POOL_H */ > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > > > index 6c5915efbc17..9806b091f0f6 100644 > > > --- a/net/core/skbuff.c > > > +++ b/net/core/skbuff.c > > > @@ -5666,8 +5666,7 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from, > > > * !@to->pp_recycle but its tricky (due to potential race with > > > * the clone disappearing) and rare, so not worth dealing with. > > > */ > > > - if (to->pp_recycle != from->pp_recycle || > > > - (from->pp_recycle && skb_cloned(from))) > > > + if (to->pp_recycle != from->pp_recycle) > > > return false; > > > > > > if (len <= skb_tailroom(to)) { > > > @@ -5724,8 +5723,12 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from, > > > /* if the skb is not cloned this does nothing > > > * since we set nr_frags to 0. > > > */ > > > - for (i = 0; i < from_shinfo->nr_frags; i++) > > > - __skb_frag_ref(&from_shinfo->frags[i]); > > > + if (from->pp_recycle) > > > + for (i = 0; i < from_shinfo->nr_frags; i++) > > > + page_pool_page_ref(skb_frag_page(&from_shinfo->frags[i])); > > > + else > > > + for (i = 0; i < from_shinfo->nr_frags; i++) > > > + __skb_frag_ref(&from_shinfo->frags[i]); > > > > > > to->truesize += delta; > > > to->len += len; > > >
On 2023/6/29 20:19, Liang Chen wrote: > On Thu, Jun 29, 2023 at 8:17 PM Liang Chen <liangchen.linux@gmail.com> wrote: >> >> On Thu, Jun 29, 2023 at 2:53 PM Yunsheng Lin <linyunsheng@huawei.com> wrote: >>> >>> On 2023/6/28 20:11, Liang Chen wrote: >>>> In order to address the issues encountered with commit 1effe8ca4e34 >>>> ("skbuff: fix coalescing for page_pool fragment recycling"), the >>>> combination of the following condition was excluded from skb coalescing: >>>> >>>> from->pp_recycle = 1 >>>> from->cloned = 1 >>>> to->pp_recycle = 1 >>>> >>>> However, with page pool environments, the aforementioned combination can >>>> be quite common. In scenarios with a higher number of small packets, it >>>> can significantly affect the success rate of coalescing. For example, >>>> when considering packets of 256 bytes size, our comparison of coalescing >>>> success rate is as follows: >>> >>> As skb_try_coalesce() only allow coaleascing when 'to' skb is not cloned. >>> >>> Could you give more detailed about the testing when we have a non-cloned >>> 'to' skb and a cloned 'from' skb? As both of them should be belong to the >>> same flow. >>> >>> I had the below patchset trying to do something similar as this patch does: >>> https://lore.kernel.org/all/20211009093724.10539-5-linyunsheng@huawei.com/ >>> >>> It seems this patch is only trying to optimize a specific case for skb >>> coalescing, So if skb coalescing between non-cloned and cloned skb is a >>> common case, then it might worth optimizing. >>> >> >> Sure, Thanks for the information! The testing is just a common iperf >> test as below. >> >> iperf3 -c <server IP> -i 5 -f g -t 0 -l 128 >> >> We observed the frequency of each combination of the pp (page pool) >> and clone condition when entering skb_try_coalesce. The results >> motivated us to propose such an optimization, as we noticed that case >> 11 (from pp/clone=1/1 and to pp/clone = 1/0) occurs quite often. >> >> +-------------+--------------+--------------+--------------+--------------+ >> | from/to | pp/clone=0/0 | pp/clone=0/1 | pp/clone=1/0 | pp/clone=1/1 | >> +-------------+--------------+--------------+--------------+--------------+ >> |pp/clone=0/0 | 0 | 1 | 2 | 3 | >> |pp/clone=0/1 | 4 | 5 | 6 | 7 | >> |pp/clone=1/0 | 8 | 9 | 10 | 11 | >> |pp/clone=1/1 | 12 | 13 | 14 | 15 | >> |+------------+--------------+--------------+--------------+--------------+ I run the iperf test, it seems there is only one skb_clone() calling for each round, and I was using 'iperf', not 'iperf3'. Is there any app like tcpdump running? It seems odd that the skb from the rx need to be cloned for a common iperf test, which app or configuration is causing the cloning? Maybe using the ftrace to see the skb_clone() calling? echo skb_clone > set_ftrace_filter echo function > current_tracer
On Wed, 28 Jun 2023 20:11:50 +0800 Liang Chen wrote: > +static inline void page_pool_page_ref(struct page *page) > +{ > + struct page *head_page = compound_head(page); > + > + if (page_pool_is_pp_page(head_page) && > + page_pool_is_pp_page_frag(head_page)) > + atomic_long_inc(&head_page->pp_frag_count); > + else > + get_page(head_page); AFAICT this is not correct, you cannot take an extra *pp* reference on a pp page, unless it is a frag. If the @to skb is a pp skb (and it must be, if we get here) we will have two skbs with pp_recycle = 1. Which means if they both get freed at the same time they will both try to return / release the page. I haven't looked at Yunsheng's v5 yet, but that's why I was asking to rename the pp_frag_count to pp_ref_count. pp_frag_count is a true refcount (rather than skb->pp_recycle acting as a poor man's single shot refcount), so in case of frag'd pages / after Yunsheng's work we will be able to take new refs. Long story short please wait until Yunsheng's changes are finalized.
On Fri, Jun 30, 2023 at 7:52 PM Yunsheng Lin <linyunsheng@huawei.com> wrote: > > On 2023/6/29 20:19, Liang Chen wrote: > > On Thu, Jun 29, 2023 at 8:17 PM Liang Chen <liangchen.linux@gmail.com> wrote: > >> > >> On Thu, Jun 29, 2023 at 2:53 PM Yunsheng Lin <linyunsheng@huawei.com> wrote: > >>> > >>> On 2023/6/28 20:11, Liang Chen wrote: > >>>> In order to address the issues encountered with commit 1effe8ca4e34 > >>>> ("skbuff: fix coalescing for page_pool fragment recycling"), the > >>>> combination of the following condition was excluded from skb coalescing: > >>>> > >>>> from->pp_recycle = 1 > >>>> from->cloned = 1 > >>>> to->pp_recycle = 1 > >>>> > >>>> However, with page pool environments, the aforementioned combination can > >>>> be quite common. In scenarios with a higher number of small packets, it > >>>> can significantly affect the success rate of coalescing. For example, > >>>> when considering packets of 256 bytes size, our comparison of coalescing > >>>> success rate is as follows: > >>> > >>> As skb_try_coalesce() only allow coaleascing when 'to' skb is not cloned. > >>> > >>> Could you give more detailed about the testing when we have a non-cloned > >>> 'to' skb and a cloned 'from' skb? As both of them should be belong to the > >>> same flow. > >>> > >>> I had the below patchset trying to do something similar as this patch does: > >>> https://lore.kernel.org/all/20211009093724.10539-5-linyunsheng@huawei.com/ > >>> > >>> It seems this patch is only trying to optimize a specific case for skb > >>> coalescing, So if skb coalescing between non-cloned and cloned skb is a > >>> common case, then it might worth optimizing. > >>> > >> > >> Sure, Thanks for the information! The testing is just a common iperf > >> test as below. > >> > >> iperf3 -c <server IP> -i 5 -f g -t 0 -l 128 > >> > >> We observed the frequency of each combination of the pp (page pool) > >> and clone condition when entering skb_try_coalesce. The results > >> motivated us to propose such an optimization, as we noticed that case > >> 11 (from pp/clone=1/1 and to pp/clone = 1/0) occurs quite often. > >> > >> +-------------+--------------+--------------+--------------+--------------+ > >> | from/to | pp/clone=0/0 | pp/clone=0/1 | pp/clone=1/0 | pp/clone=1/1 | > >> +-------------+--------------+--------------+--------------+--------------+ > >> |pp/clone=0/0 | 0 | 1 | 2 | 3 | > >> |pp/clone=0/1 | 4 | 5 | 6 | 7 | > >> |pp/clone=1/0 | 8 | 9 | 10 | 11 | > >> |pp/clone=1/1 | 12 | 13 | 14 | 15 | > >> |+------------+--------------+--------------+--------------+--------------+ > > > I run the iperf test, it seems there is only one skb_clone() calling for each > round, and I was using 'iperf', not 'iperf3'. > Is there any app like tcpdump running? It seems odd that the skb from the rx > need to be cloned for a common iperf test, which app or configuration is causing > the cloning? > > Maybe using the ftrace to see the skb_clone() calling? > echo skb_clone > set_ftrace_filter > echo function > current_tracer Thanks for raising the concerns. We did some investigation into the cause of skb cloning. The result is that in our environment (fedora 37 default network setup) NetworkMananger creates a SOCK_DGRAM socket, which eventually leads to the additional packet_type (struct packet_sock.prot_hook) being registered, thus the cloning. Since __netif_receive_skb_core iterates through orig_dev->ptype_specific for all possible skb delivery targets and increases skb->users accordingly. We will update the commit message to include this information to point out that the figures are specific to our environment. But there are many possible reasons skbs can be cloned, and improvements in this code path can still bring benefits. Thanks, Liang
On Sat, Jul 1, 2023 at 7:07 AM Jakub Kicinski <kuba@kernel.org> wrote: > > On Wed, 28 Jun 2023 20:11:50 +0800 Liang Chen wrote: > > +static inline void page_pool_page_ref(struct page *page) > > +{ > > + struct page *head_page = compound_head(page); > > + > > + if (page_pool_is_pp_page(head_page) && > > + page_pool_is_pp_page_frag(head_page)) > > + atomic_long_inc(&head_page->pp_frag_count); > > + else > > + get_page(head_page); > > AFAICT this is not correct, you cannot take an extra *pp* reference > on a pp page, unless it is a frag. If the @to skb is a pp skb (and it > must be, if we get here) we will have two skbs with pp_recycle = 1. > Which means if they both get freed at the same time they will both > try to return / release the page. > > I haven't looked at Yunsheng's v5 yet, but that's why I was asking > to rename the pp_frag_count to pp_ref_count. pp_frag_count is a true > refcount (rather than skb->pp_recycle acting as a poor man's single > shot refcount), so in case of frag'd pages / after Yunsheng's work > we will be able to take new refs. > > Long story short please wait until Yunsheng's changes are finalized. Sure, thanks for the information. I will wait until Yunsheng's changes are finalized before proposing v2. As for the "pp" reference, it has the test page_pool_is_pp_page_frag(head_page) there. So for a non-frag pp page, it will be a get_page call. Thanks, Liang
On Mon, 3 Jul 2023 17:12:46 +0800 Liang Chen wrote: > As for the "pp" reference, it has the test > page_pool_is_pp_page_frag(head_page) there. So for a non-frag pp page, > it will be a get_page call. You don't understand - you can't put a page from a page pool in two skbs with pp_recycle set, unless the page is frag'ed.
On 2023/7/4 2:53, Jakub Kicinski wrote: > On Mon, 3 Jul 2023 17:12:46 +0800 Liang Chen wrote: >> As for the "pp" reference, it has the test >> page_pool_is_pp_page_frag(head_page) there. So for a non-frag pp page, >> it will be a get_page call. > > You don't understand - you can't put a page from a page pool in two > skbs with pp_recycle set, unless the page is frag'ed. Agreed. I think we should disallow skb coaleasing for non-frag pp page instead of calling get_page(), as there is data race when calling page_pool_return_skb_page() concurrently for the same non-frag pp page. Even with my patchset, it may break the arch with PAGE_POOL_DMA_USE_PP_FRAG_COUNT being true. > . >
On Tue, Jul 4, 2023 at 8:50 AM Yunsheng Lin <linyunsheng@huawei.com> wrote: > > On 2023/7/4 2:53, Jakub Kicinski wrote: > > On Mon, 3 Jul 2023 17:12:46 +0800 Liang Chen wrote: > >> As for the "pp" reference, it has the test > >> page_pool_is_pp_page_frag(head_page) there. So for a non-frag pp page, > >> it will be a get_page call. > > > > You don't understand - you can't put a page from a page pool in two > > skbs with pp_recycle set, unless the page is frag'ed. > > Agreed. I think we should disallow skb coaleasing for non-frag pp page > instead of calling get_page(), as there is data race when calling > page_pool_return_skb_page() concurrently for the same non-frag pp page. > > Even with my patchset, it may break the arch with > PAGE_POOL_DMA_USE_PP_FRAG_COUNT being true. > Yeah, that's my fault. I thought __page_pool_put_page handles this elevated refcnt, but overlooked that the second skb release would enter page_pool_return_skb_page again, not directly calling put_page. Disallowing skb coalescing for non-frag pp pages sounds good; we will handle this problem on v2 after Yunsheng's changes are finalized. > > . > >
diff --git a/include/net/page_pool.h b/include/net/page_pool.h index 126f9e294389..05e5d8ead63b 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -399,4 +399,25 @@ static inline void page_pool_nid_changed(struct page_pool *pool, int new_nid) page_pool_update_nid(pool, new_nid); } +static inline bool page_pool_is_pp_page(struct page *page) +{ + return (page->pp_magic & ~0x3UL) == PP_SIGNATURE; +} + +static inline bool page_pool_is_pp_page_frag(struct page *page) +{ + return !!(page->pp->p.flags & PP_FLAG_PAGE_FRAG); +} + +static inline void page_pool_page_ref(struct page *page) +{ + struct page *head_page = compound_head(page); + + if (page_pool_is_pp_page(head_page) && + page_pool_is_pp_page_frag(head_page)) + atomic_long_inc(&head_page->pp_frag_count); + else + get_page(head_page); +} + #endif /* _NET_PAGE_POOL_H */ diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 6c5915efbc17..9806b091f0f6 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -5666,8 +5666,7 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from, * !@to->pp_recycle but its tricky (due to potential race with * the clone disappearing) and rare, so not worth dealing with. */ - if (to->pp_recycle != from->pp_recycle || - (from->pp_recycle && skb_cloned(from))) + if (to->pp_recycle != from->pp_recycle) return false; if (len <= skb_tailroom(to)) { @@ -5724,8 +5723,12 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from, /* if the skb is not cloned this does nothing * since we set nr_frags to 0. */ - for (i = 0; i < from_shinfo->nr_frags; i++) - __skb_frag_ref(&from_shinfo->frags[i]); + if (from->pp_recycle) + for (i = 0; i < from_shinfo->nr_frags; i++) + page_pool_page_ref(skb_frag_page(&from_shinfo->frags[i])); + else + for (i = 0; i < from_shinfo->nr_frags; i++) + __skb_frag_ref(&from_shinfo->frags[i]); to->truesize += delta; to->len += len;
In order to address the issues encountered with commit 1effe8ca4e34 ("skbuff: fix coalescing for page_pool fragment recycling"), the combination of the following condition was excluded from skb coalescing: from->pp_recycle = 1 from->cloned = 1 to->pp_recycle = 1 However, with page pool environments, the aforementioned combination can be quite common. In scenarios with a higher number of small packets, it can significantly affect the success rate of coalescing. For example, when considering packets of 256 bytes size, our comparison of coalescing success rate is as follows: Without page pool: 70% With page pool: 13% Consequently, this has an impact on performance: Without page pool: 2.64 Gbits/sec With page pool: 2.41 Gbits/sec Therefore, it seems worthwhile to optimize this scenario and enable coalescing of this particular combination. To achieve this, we need to ensure the correct increment of the "from" SKB page's page pool fragment count (pp_frag_count). Following this optimization, the success rate of coalescing measured in our environment has improved as follows: With page pool: 60% This success rate is approaching the rate achieved without using page pool, and the performance has also been improved: With page pool: 2.61 Gbits/sec Below is the performance comparison for small packets before and after this optimization. We observe no impact to packets larger than 4K. without page pool fragment(PP_FLAG_PAGE_FRAG) packet size before after (bytes) (Gbits/sec) (Gbits/sec) 128 1.28 1.37 256 2.41 2.61 512 4.56 4.87 1024 7.69 8.21 2048 12.85 13.41 with page pool fragment(PP_FLAG_PAGE_FRAG) packet size before after (bytes) (Gbits/sec) (Gbits/sec) 128 1.28 1.37 256 2.35 2.62 512 4.37 4.86 1024 7.62 8.41 2048 13.07 13.53 with page pool fragment(PP_FLAG_PAGE_FRAG) and high order(order = 3) packet size before after (bytes) (Gbits/sec) (Gbits/sec) 128 1.28 1.41 256 2.41 2.74 512 4.57 5.25 1024 8.61 9.71 2048 14.81 16.78 Signed-off-by: Liang Chen <liangchen.linux@gmail.com> --- include/net/page_pool.h | 21 +++++++++++++++++++++ net/core/skbuff.c | 11 +++++++---- 2 files changed, 28 insertions(+), 4 deletions(-)