Message ID | 20230714170853.866018-3-aleksander.lobakin@intel.com (mailing list archive) |
---|---|
State | RFC |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | net: page_pool: a couple of assorted optimizations | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Clearly marked for net-next, async |
netdev/apply | fail | Patch does not apply to net-next |
From: Alexander Lobakin <aleksander.lobakin@intel.com> Date: Fri, 14 Jul 2023 19:08:45 +0200 > On x86_64, frag_* fields of struct page_pool are scattered across two > cachelines despite the summary size of 24 bytes. The last field, > ::frag_users, is pushed out to the next one, sharing it with > ::alloc_stats. There are two copies now of #2 and #3 as I reordered them and forgot to remove obsolete files ._. It's an RFC anyway, so please just ignore the duplicates. Sorry. [...] Thanks, Olek
On 14/07/2023 19.08, Alexander Lobakin wrote: > On x86_64, frag_* fields of struct page_pool are scattered across two > cachelines despite the summary size of 24 bytes. The last field, > ::frag_users, is pushed out to the next one, sharing it with > ::alloc_stats. > All three fields are used in pretty much the same places. There are some > holes and cold members to move around. Move frag_* one block up, placing > them right after &page_pool_params perfectly at the beginning of CL2. > This doesn't do any meaningful to the second block, as those are some > destroy-path cold structures, and doesn't do anything to ::alloc_stats, > which still starts at 200-byte offset, 8 bytes after CL3 (still fitting > into 1 cacheline). > On my setup, this yields 1-2% of Mpps when using PP frags actively. > When it comes to 32-bit architectures with 32-byte CL: &page_pool_params > plus ::pad is 44 bytes, the block taken care of is 16 bytes within one > CL, so there should be at least no regressions from the actual change. > > Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> > --- > include/net/page_pool.h | 10 +++++----- > 1 file changed, 5 insertions(+), 5 deletions(-) > > diff --git a/include/net/page_pool.h b/include/net/page_pool.h > index 829dc1f8ba6b..212d72b5cfec 100644 > --- a/include/net/page_pool.h > +++ b/include/net/page_pool.h > @@ -130,16 +130,16 @@ static inline u64 *page_pool_ethtool_stats_get(u64 *data, void *stats) > struct page_pool { > struct page_pool_params p; > > + long frag_users; > + struct page *frag_page; > + unsigned int frag_offset; > + u32 pages_state_hold_cnt; I think this is okay, but I want to highlight that: - pages_state_hold_cnt and pages_state_release_cnt need to be kept on separate cache-lines. > + > struct delayed_work release_dw; > void (*disconnect)(void *); > unsigned long defer_start; > unsigned long defer_warn; > > - u32 pages_state_hold_cnt; > - unsigned int frag_offset; > - struct page *frag_page; > - long frag_users; > - > #ifdef CONFIG_PAGE_POOL_STATS > /* these stats are incremented while in softirq context */ > struct page_pool_alloc_stats alloc_stats;
From: Jesper Dangaard Brouer <jbrouer@redhat.com> Date: Fri, 14 Jul 2023 20:37:39 +0200 > > > On 14/07/2023 19.08, Alexander Lobakin wrote: >> On x86_64, frag_* fields of struct page_pool are scattered across two >> cachelines despite the summary size of 24 bytes. The last field, >> ::frag_users, is pushed out to the next one, sharing it with >> ::alloc_stats. >> All three fields are used in pretty much the same places. There are some >> holes and cold members to move around. Move frag_* one block up, placing >> them right after &page_pool_params perfectly at the beginning of CL2. >> This doesn't do any meaningful to the second block, as those are some >> destroy-path cold structures, and doesn't do anything to ::alloc_stats, >> which still starts at 200-byte offset, 8 bytes after CL3 (still fitting >> into 1 cacheline). >> On my setup, this yields 1-2% of Mpps when using PP frags actively. >> When it comes to 32-bit architectures with 32-byte CL: &page_pool_params >> plus ::pad is 44 bytes, the block taken care of is 16 bytes within one >> CL, so there should be at least no regressions from the actual change. >> >> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> >> --- >> include/net/page_pool.h | 10 +++++----- >> 1 file changed, 5 insertions(+), 5 deletions(-) >> >> diff --git a/include/net/page_pool.h b/include/net/page_pool.h >> index 829dc1f8ba6b..212d72b5cfec 100644 >> --- a/include/net/page_pool.h >> +++ b/include/net/page_pool.h >> @@ -130,16 +130,16 @@ static inline u64 >> *page_pool_ethtool_stats_get(u64 *data, void *stats) >> struct page_pool { >> struct page_pool_params p; >> + long frag_users; >> + struct page *frag_page; >> + unsigned int frag_offset; >> + u32 pages_state_hold_cnt; > > I think this is okay, but I want to highlight that: > - pages_state_hold_cnt and pages_state_release_cnt > need to be kept on separate cache-lines. They're pretty far away from each other. I moved hold_cnt here as well to keep it stacked with frag_offset and avoid introducing 32-bit holes. > > >> + >> struct delayed_work release_dw; >> void (*disconnect)(void *); >> unsigned long defer_start; >> unsigned long defer_warn; >> - u32 pages_state_hold_cnt; >> - unsigned int frag_offset; >> - struct page *frag_page; >> - long frag_users; >> - >> #ifdef CONFIG_PAGE_POOL_STATS >> /* these stats are incremented while in softirq context */ >> struct page_pool_alloc_stats alloc_stats; > Thanks, Olek
Apologies for the late reply, I was on vacation and start going through my email piles... On Tue, 18 Jul 2023 at 16:52, Alexander Lobakin <aleksander.lobakin@intel.com> wrote: > > From: Jesper Dangaard Brouer <jbrouer@redhat.com> > Date: Fri, 14 Jul 2023 20:37:39 +0200 > > > > > > > On 14/07/2023 19.08, Alexander Lobakin wrote: > >> On x86_64, frag_* fields of struct page_pool are scattered across two > >> cachelines despite the summary size of 24 bytes. The last field, > >> ::frag_users, is pushed out to the next one, sharing it with > >> ::alloc_stats. > >> All three fields are used in pretty much the same places. There are some > >> holes and cold members to move around. Move frag_* one block up, placing > >> them right after &page_pool_params perfectly at the beginning of CL2. > >> This doesn't do any meaningful to the second block, as those are some > >> destroy-path cold structures, and doesn't do anything to ::alloc_stats, > >> which still starts at 200-byte offset, 8 bytes after CL3 (still fitting > >> into 1 cacheline). > >> On my setup, this yields 1-2% of Mpps when using PP frags actively. > >> When it comes to 32-bit architectures with 32-byte CL: &page_pool_params > >> plus ::pad is 44 bytes, the block taken care of is 16 bytes within one > >> CL, so there should be at least no regressions from the actual change. > >> > >> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> > >> --- > >> include/net/page_pool.h | 10 +++++----- > >> 1 file changed, 5 insertions(+), 5 deletions(-) > >> > >> diff --git a/include/net/page_pool.h b/include/net/page_pool.h > >> index 829dc1f8ba6b..212d72b5cfec 100644 > >> --- a/include/net/page_pool.h > >> +++ b/include/net/page_pool.h > >> @@ -130,16 +130,16 @@ static inline u64 > >> *page_pool_ethtool_stats_get(u64 *data, void *stats) > >> struct page_pool { > >> struct page_pool_params p; > >> + long frag_users; > >> + struct page *frag_page; > >> + unsigned int frag_offset; > >> + u32 pages_state_hold_cnt; > > > > I think this is okay, but I want to highlight that: > > - pages_state_hold_cnt and pages_state_release_cnt > > need to be kept on separate cache-lines. > > They're pretty far away from each other. I moved hold_cnt here as well > to keep it stacked with frag_offset and avoid introducing 32-bit holes. This is to prevent cache line bouncing and/or false sharing right? The change seems fine to me as well but mind adding a comment about this when you resend? Thanks /Ilias > > > > > > >> + > >> struct delayed_work release_dw; > >> void (*disconnect)(void *); > >> unsigned long defer_start; > >> unsigned long defer_warn; > >> - u32 pages_state_hold_cnt; > >> - unsigned int frag_offset; > >> - struct page *frag_page; > >> - long frag_users; > >> - > >> #ifdef CONFIG_PAGE_POOL_STATS > >> /* these stats are incremented while in softirq context */ > >> struct page_pool_alloc_stats alloc_stats; > > > > Thanks, > Olek
From: Ilias Apalodimas <ilias.apalodimas@linaro.org> Date: Wed, 26 Jul 2023 11:13:26 +0300 > Apologies for the late reply, I was on vacation and start going > through my email piles... No worries. I remember having to grind through hundreds of mails after each vacation :s :D > > On Tue, 18 Jul 2023 at 16:52, Alexander Lobakin > <aleksander.lobakin@intel.com> wrote: >> >> From: Jesper Dangaard Brouer <jbrouer@redhat.com> >> Date: Fri, 14 Jul 2023 20:37:39 +0200 >> >>> >>> >>> On 14/07/2023 19.08, Alexander Lobakin wrote: >>>> On x86_64, frag_* fields of struct page_pool are scattered across two >>>> cachelines despite the summary size of 24 bytes. The last field, >>>> ::frag_users, is pushed out to the next one, sharing it with >>>> ::alloc_stats. >>>> All three fields are used in pretty much the same places. There are some >>>> holes and cold members to move around. Move frag_* one block up, placing >>>> them right after &page_pool_params perfectly at the beginning of CL2. >>>> This doesn't do any meaningful to the second block, as those are some >>>> destroy-path cold structures, and doesn't do anything to ::alloc_stats, >>>> which still starts at 200-byte offset, 8 bytes after CL3 (still fitting >>>> into 1 cacheline). >>>> On my setup, this yields 1-2% of Mpps when using PP frags actively. >>>> When it comes to 32-bit architectures with 32-byte CL: &page_pool_params >>>> plus ::pad is 44 bytes, the block taken care of is 16 bytes within one >>>> CL, so there should be at least no regressions from the actual change. >>>> >>>> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> >>>> --- >>>> include/net/page_pool.h | 10 +++++----- >>>> 1 file changed, 5 insertions(+), 5 deletions(-) >>>> >>>> diff --git a/include/net/page_pool.h b/include/net/page_pool.h >>>> index 829dc1f8ba6b..212d72b5cfec 100644 >>>> --- a/include/net/page_pool.h >>>> +++ b/include/net/page_pool.h >>>> @@ -130,16 +130,16 @@ static inline u64 >>>> *page_pool_ethtool_stats_get(u64 *data, void *stats) >>>> struct page_pool { >>>> struct page_pool_params p; >>>> + long frag_users; >>>> + struct page *frag_page; >>>> + unsigned int frag_offset; >>>> + u32 pages_state_hold_cnt; >>> >>> I think this is okay, but I want to highlight that: >>> - pages_state_hold_cnt and pages_state_release_cnt >>> need to be kept on separate cache-lines. >> >> They're pretty far away from each other. I moved hold_cnt here as well >> to keep it stacked with frag_offset and avoid introducing 32-bit holes. > > This is to prevent cache line bouncing and/or false sharing right? > The change seems fine to me as well but mind adding a comment about > this when you resend? Right. Sure, why not. > > Thanks > /Ilias >> >>> >>> >>>> + >>>> struct delayed_work release_dw; >>>> void (*disconnect)(void *); >>>> unsigned long defer_start; >>>> unsigned long defer_warn; >>>> - u32 pages_state_hold_cnt; >>>> - unsigned int frag_offset; >>>> - struct page *frag_page; >>>> - long frag_users; >>>> - >>>> #ifdef CONFIG_PAGE_POOL_STATS >>>> /* these stats are incremented while in softirq context */ >>>> struct page_pool_alloc_stats alloc_stats; >>> >> >> Thanks, >> Olek Thanks, Olek
diff --git a/include/net/page_pool.h b/include/net/page_pool.h index 829dc1f8ba6b..212d72b5cfec 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -130,16 +130,16 @@ static inline u64 *page_pool_ethtool_stats_get(u64 *data, void *stats) struct page_pool { struct page_pool_params p; + long frag_users; + struct page *frag_page; + unsigned int frag_offset; + u32 pages_state_hold_cnt; + struct delayed_work release_dw; void (*disconnect)(void *); unsigned long defer_start; unsigned long defer_warn; - u32 pages_state_hold_cnt; - unsigned int frag_offset; - struct page *frag_page; - long frag_users; - #ifdef CONFIG_PAGE_POOL_STATS /* these stats are incremented while in softirq context */ struct page_pool_alloc_stats alloc_stats;
On x86_64, frag_* fields of struct page_pool are scattered across two cachelines despite the summary size of 24 bytes. The last field, ::frag_users, is pushed out to the next one, sharing it with ::alloc_stats. All three fields are used in pretty much the same places. There are some holes and cold members to move around. Move frag_* one block up, placing them right after &page_pool_params perfectly at the beginning of CL2. This doesn't do any meaningful to the second block, as those are some destroy-path cold structures, and doesn't do anything to ::alloc_stats, which still starts at 200-byte offset, 8 bytes after CL3 (still fitting into 1 cacheline). On my setup, this yields 1-2% of Mpps when using PP frags actively. When it comes to 32-bit architectures with 32-byte CL: &page_pool_params plus ::pad is 44 bytes, the block taken care of is 16 bytes within one CL, so there should be at least no regressions from the actual change. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> --- include/net/page_pool.h | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)