diff mbox series

[net-next,06/11] net: page_pool: avoid calling no-op externals when possible

Message ID 20230516161841.37138-7-aleksander.lobakin@intel.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series net: intel: start The Great Code Dedup + Page Pool for iavf | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 5176 this patch: 5176
netdev/cc_maintainers success CCed 7 of 7 maintainers
netdev/build_clang success Errors and warnings before: 1979 this patch: 1979
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 5407 this patch: 5407
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 31 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Alexander Lobakin May 16, 2023, 4:18 p.m. UTC
Turned out page_pool_put{,_full}_page() can burn quite a bunch of cycles
even when on DMA-coherent platforms (like x86) with no active IOMMU or
swiotlb, just for the call ladder.
Indeed, it's

page_pool_put_page()
  page_pool_put_defragged_page()                  <- external
    __page_pool_put_page()
      page_pool_dma_sync_for_device()             <- non-inline
        dma_sync_single_range_for_device()
          dma_sync_single_for_device()            <- external
            dma_direct_sync_single_for_device()
              dev_is_dma_coherent()               <- exit

For the inline functions, no guarantees the compiler won't uninline them
(they're clearly not one-liners and sometimes compilers uninline even
2 + 2). The first external call is necessary, but the rest 2+ are done
for nothing each time, plus a bunch of checks here and there.
Since Page Pool mappings are long-term and for one "device + addr" pair
dma_need_sync() will always return the same value (basically, whether it
belongs to an swiotlb pool), addresses can be tested once right after
they're obtained and the result can be reused until the page is unmapped.
Define new PP flag, which will mean "do DMA syncs for device, but only
when needed" and turn it on by default when the driver asks to sync
pages. When a page is mapped, check whether it needs syncs and if so,
replace that "sync when needed" back to "always do syncs" globally for
the whole pool (better safe than sorry). As long as a pool has no pages
requiring DMA syncs, this cuts off a good piece of calls and checks.
On my x86_64, this gives from 2% to 5% performance benefit with no
negative impact for cases when IOMMU is on and the shortcut can't be
used.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 include/net/page_pool.h |  3 +++
 net/core/page_pool.c    | 10 ++++++++++
 2 files changed, 13 insertions(+)

Comments

Christoph Hellwig May 17, 2023, 8:14 a.m. UTC | #1
So while this looks ok, Eric Dumazet had looked into some optimization
for this touching the core code, and promised me to come up with an
even better version a while ago.  Eric, what's the state of your
optimizations for no-op DMA syncs?
Jakub Kicinski May 18, 2023, 4:08 a.m. UTC | #2
On Tue, 16 May 2023 18:18:36 +0200 Alexander Lobakin wrote:
> +		/* Try to avoid calling no-op syncs */
> +		pool->p.flags |= PP_FLAG_DMA_MAYBE_SYNC;
> +		pool->p.flags &= ~PP_FLAG_DMA_SYNC_DEV;
>  	}
>  
>  	if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT &&
> @@ -323,6 +327,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
>  
>  	page_pool_set_dma_addr(page, dma);
>  
> +	if ((pool->p.flags & PP_FLAG_DMA_MAYBE_SYNC) &&
> +	    dma_need_sync(pool->p.dev, dma)) {
> +		pool->p.flags |= PP_FLAG_DMA_SYNC_DEV;
> +		pool->p.flags &= ~PP_FLAG_DMA_MAYBE_SYNC;
> +	}

is it just me or does it feel cleaner to allocate a page at init,
and throw it into the cache, rather than adding a condition to a
fast(ish) path?
Yunsheng Lin May 18, 2023, 4:54 a.m. UTC | #3
On 2023/5/18 12:08, Jakub Kicinski wrote:
> On Tue, 16 May 2023 18:18:36 +0200 Alexander Lobakin wrote:
>> +		/* Try to avoid calling no-op syncs */
>> +		pool->p.flags |= PP_FLAG_DMA_MAYBE_SYNC;
>> +		pool->p.flags &= ~PP_FLAG_DMA_SYNC_DEV;
>>  	}
>>  
>>  	if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT &&
>> @@ -323,6 +327,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
>>  
>>  	page_pool_set_dma_addr(page, dma);
>>  
>> +	if ((pool->p.flags & PP_FLAG_DMA_MAYBE_SYNC) &&
>> +	    dma_need_sync(pool->p.dev, dma)) {
>> +		pool->p.flags |= PP_FLAG_DMA_SYNC_DEV;
>> +		pool->p.flags &= ~PP_FLAG_DMA_MAYBE_SYNC;
>> +	}
> 
> is it just me or does it feel cleaner to allocate a page at init,
> and throw it into the cache, rather than adding a condition to a
> fast(ish) path?

Is dma_need_sync() not reliable until a dma map is called?
Is there any reason why not just clear PP_FLAG_DMA_SYNC_DEV if
dma_need_sync() is false without introducing the PP_FLAG_DMA_MAYBE_SYNC
flag?

> 
> .
>
Alexander Lobakin May 18, 2023, 1:26 p.m. UTC | #4
From: Christoph Hellwig <hch@lst.de>
Date: Wed, 17 May 2023 10:14:58 +0200

> So while this looks ok, Eric Dumazet had looked into some optimization
> for this touching the core code, and promised me to come up with an
> even better version a while ago.  Eric, what's the state of your
> optimizations for no-op DMA syncs?
> 

Hmm, his last proposals were for avoiding indirect calls when IOMMU is
on, but in fact we don't need to synchronize stuff -- DMA IOMMU on
x86_64 also usually doesn't synchronize anything, but you don't know
that prior to doing an indirect call and dma_need_sync() won't help.
I was thinking of adding .dma_need_sync() callback to DMA ops, which
could also be called once only on page allocation, like in this patch.

Also want to hear how it goes for Eric :)

Thanks,
Olek
Alexander Lobakin May 18, 2023, 1:29 p.m. UTC | #5
From: Yunsheng Lin <linyunsheng@huawei.com>
Date: Thu, 18 May 2023 12:54:37 +0800

> On 2023/5/18 12:08, Jakub Kicinski wrote:
>> On Tue, 16 May 2023 18:18:36 +0200 Alexander Lobakin wrote:
>>> +		/* Try to avoid calling no-op syncs */
>>> +		pool->p.flags |= PP_FLAG_DMA_MAYBE_SYNC;
>>> +		pool->p.flags &= ~PP_FLAG_DMA_SYNC_DEV;
>>>  	}
>>>  
>>>  	if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT &&
>>> @@ -323,6 +327,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
>>>  
>>>  	page_pool_set_dma_addr(page, dma);
>>>  
>>> +	if ((pool->p.flags & PP_FLAG_DMA_MAYBE_SYNC) &&
>>> +	    dma_need_sync(pool->p.dev, dma)) {
>>> +		pool->p.flags |= PP_FLAG_DMA_SYNC_DEV;
>>> +		pool->p.flags &= ~PP_FLAG_DMA_MAYBE_SYNC;
>>> +	}
>>
>> is it just me or does it feel cleaner to allocate a page at init,
>> and throw it into the cache, rather than adding a condition to a
>> fast(ish) path?
> 
> Is dma_need_sync() not reliable until a dma map is called?
> Is there any reason why not just clear PP_FLAG_DMA_SYNC_DEV if
> dma_need_sync() is false without introducing the PP_FLAG_DMA_MAYBE_SYNC
> flag?

We can't just clear the flag, because some drivers don't want PP to
synchronize DMA. Without a new flag, we can't distinguish those two.
Example:

1) Driver doesn't set DMA_SYNC_DEV
2) We check for dma_need_sync() and it returns true
3) As a result, we set DMA_SYNC_DEV, although driver does that on its
   own

Thanks,
Olek
Alexander Lobakin May 18, 2023, 1:34 p.m. UTC | #6
From: Jakub Kicinski <kuba@kernel.org>
Date: Wed, 17 May 2023 21:08:04 -0700

> On Tue, 16 May 2023 18:18:36 +0200 Alexander Lobakin wrote:
>> +		/* Try to avoid calling no-op syncs */
>> +		pool->p.flags |= PP_FLAG_DMA_MAYBE_SYNC;
>> +		pool->p.flags &= ~PP_FLAG_DMA_SYNC_DEV;
>>  	}
>>  
>>  	if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT &&
>> @@ -323,6 +327,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
>>  
>>  	page_pool_set_dma_addr(page, dma);
>>  
>> +	if ((pool->p.flags & PP_FLAG_DMA_MAYBE_SYNC) &&
>> +	    dma_need_sync(pool->p.dev, dma)) {
>> +		pool->p.flags |= PP_FLAG_DMA_SYNC_DEV;
>> +		pool->p.flags &= ~PP_FLAG_DMA_MAYBE_SYNC;
>> +	}
> 
> is it just me or does it feel cleaner to allocate a page at init,
> and throw it into the cache, rather than adding a condition to a
> fast(ish) path?

When recycling is on, not that fast -- new allocations occur relatively
rarely and it's allocations anyway, one `if` doesn't change anything there.

And seems like I didn't get the sentence regarding "allocate and throw"
:s This condition just disables the shortcut if any new page suddenly
requires real DMA syncs (and if it does, then the sync a line below will
take way more time than 1 more `if`  anyway).

Thanks,
Olek
Jakub Kicinski May 18, 2023, 3:02 p.m. UTC | #7
On Thu, 18 May 2023 15:34:44 +0200 Alexander Lobakin wrote:
> And seems like I didn't get the sentence regarding "allocate and throw"
> :s This condition just disables the shortcut if any new page suddenly
> requires real DMA syncs (and if it does, then the sync a line below will
> take way more time than 1 more `if`  anyway).

Right, I misread.
diff mbox series

Patch

diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index c8ec2f34722b..8435013de06e 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -46,6 +46,9 @@ 
 					* device driver responsibility
 					*/
 #define PP_FLAG_PAGE_FRAG	BIT(2) /* for page frag feature */
+#define PP_FLAG_DMA_MAYBE_SYNC	BIT(3) /* Internal, should not be used in the
+					* drivers
+					*/
 #define PP_FLAG_ALL		(PP_FLAG_DMA_MAP |\
 				 PP_FLAG_DMA_SYNC_DEV |\
 				 PP_FLAG_PAGE_FRAG)
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index e212e9d7edcb..57f323dee6c4 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -175,6 +175,10 @@  static int page_pool_init(struct page_pool *pool,
 		/* pool->p.offset has to be set according to the address
 		 * offset used by the DMA engine to start copying rx data
 		 */
+
+		/* Try to avoid calling no-op syncs */
+		pool->p.flags |= PP_FLAG_DMA_MAYBE_SYNC;
+		pool->p.flags &= ~PP_FLAG_DMA_SYNC_DEV;
 	}
 
 	if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT &&
@@ -323,6 +327,12 @@  static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
 
 	page_pool_set_dma_addr(page, dma);
 
+	if ((pool->p.flags & PP_FLAG_DMA_MAYBE_SYNC) &&
+	    dma_need_sync(pool->p.dev, dma)) {
+		pool->p.flags |= PP_FLAG_DMA_SYNC_DEV;
+		pool->p.flags &= ~PP_FLAG_DMA_MAYBE_SYNC;
+	}
+
 	if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
 		page_pool_dma_sync_for_device(pool, page, pool->p.max_len);