diff mbox series

[net-next,v5,03/14] page_pool: avoid calling no-op externals when possible

Message ID 20231124154732.1623518-4-aleksander.lobakin@intel.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series net: intel: start The Great Code Dedup + Page Pool for iavf | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/codegen success Generated files up to date
netdev/tree_selection success Clearly marked for net-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1226 this patch: 1226
netdev/cc_maintainers success CCed 6 of 6 maintainers
netdev/build_clang success Errors and warnings before: 1143 this patch: 1143
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 1261 this patch: 1261
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 128 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Alexander Lobakin Nov. 24, 2023, 3:47 p.m. UTC
Turned out page_pool_put{,_full}_page() can burn quite a bunch of cycles
even on DMA-coherent platforms (like x86) with no active IOMMU or
swiotlb, just for the call ladder.
Indeed, it's

page_pool_put_page()
  page_pool_put_defragged_page()                  <- external
    __page_pool_put_page()
      page_pool_dma_sync_for_device()             <- non-inline
        dma_sync_single_range_for_device()
          dma_sync_single_for_device()            <- external
            dma_direct_sync_single_for_device()
              dev_is_dma_coherent()               <- exit

For the inline functions, no guarantees the compiler won't uninline them
(they're clearly not one-liners and sometimes the compilers manage to
uninline 2 + 2). The first external call is necessary, but the rest 2+
are done for nothing each time, plus a bunch of checks here and there.
Since Page Pool mappings are long-term and for one "device + addr" pair
dma_need_sync() will always return the same value (basically, whether it
belongs to an swiotlb pool), addresses can be tested once right after
they're obtained and the result can be reused until the page is
unmapped.
Do this test after a successful DMA mapping and reuse the lowest bit of
page::dma_addr to store the result, since Page Pool addresses are always
page-aligned and the lowest bits are unused.
page_pool_{g,s}et_dma_addr() will then take this bit into account and
exclude it from the actual address. Call page_pool_dma_sync_for_device()
only if the bit is set, shaving off several function calls back and
forth. If pool::dma_sync is not set, i.e. the driver didn't ask to
perform syncs, don't do this test and never touch the lowest bit.
On my x86_64, this gives from 2% to 5% performance benefit with no
negative impact for cases when IOMMU is on and the shortcut can't be
used.

Suggested-by: Yunsheng Lin <linyunsheng@huawei.com> # per-page flag
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
---
 include/net/page_pool/helpers.h | 60 +++++++++++++++++++++++++++++----
 net/core/page_pool.c            | 10 +++---
 2 files changed, 58 insertions(+), 12 deletions(-)

Comments

Yunsheng Lin Nov. 25, 2023, 1:04 p.m. UTC | #1
On 2023/11/24 23:47, Alexander Lobakin wrote:
> Turned out page_pool_put{,_full}_page() can burn quite a bunch of cycles
> even on DMA-coherent platforms (like x86) with no active IOMMU or
> swiotlb, just for the call ladder.
> Indeed, it's
> 
> page_pool_put_page()
>   page_pool_put_defragged_page()                  <- external
>     __page_pool_put_page()
>       page_pool_dma_sync_for_device()             <- non-inline
>         dma_sync_single_range_for_device()
>           dma_sync_single_for_device()            <- external
>             dma_direct_sync_single_for_device()
>               dev_is_dma_coherent()               <- exit
> 
> For the inline functions, no guarantees the compiler won't uninline them
> (they're clearly not one-liners and sometimes the compilers manage to
> uninline 2 + 2). The first external call is necessary, but the rest 2+
> are done for nothing each time, plus a bunch of checks here and there.
> Since Page Pool mappings are long-term and for one "device + addr" pair
> dma_need_sync() will always return the same value (basically, whether it
> belongs to an swiotlb pool), addresses can be tested once right after
> they're obtained and the result can be reused until the page is
> unmapped.
> Do this test after a successful DMA mapping and reuse the lowest bit of
> page::dma_addr to store the result, since Page Pool addresses are always
> page-aligned and the lowest bits are unused.
> page_pool_{g,s}et_dma_addr() will then take this bit into account and
> exclude it from the actual address. Call page_pool_dma_sync_for_device()
> only if the bit is set, shaving off several function calls back and
> forth. If pool::dma_sync is not set, i.e. the driver didn't ask to
> perform syncs, don't do this test and never touch the lowest bit.
> On my x86_64, this gives from 2% to 5% performance benefit with no
> negative impact for cases when IOMMU is on and the shortcut can't be
> used.
> 
> Suggested-by: Yunsheng Lin <linyunsheng@huawei.com> # per-page flag

Sorry for not remembering the suggestion:(

If most of cases is that some pages need syncing and some page does not
need syncing for the same device, then perhaps per-page flag is more
beneficial, if not, maybe per-device flag is enough too?

What i wonder is that is there any reason this kind trick isn't done
in the dma layer instead of each subsystem or driver using dma api
duplicating this kind of trick?

At least xsk_buff seems to using the similar trick?

You may have explained that, it has been a litte long between v4 and v5,
so I may have forgotten about that.


> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
> ---
>  include/net/page_pool/helpers.h | 60 +++++++++++++++++++++++++++++----
>  net/core/page_pool.c            | 10 +++---
>  2 files changed, 58 insertions(+), 12 deletions(-)
> 
> diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
> index 4ebd544ae977..28bec368b8e9 100644
> --- a/include/net/page_pool/helpers.h
> +++ b/include/net/page_pool/helpers.h
> @@ -52,6 +52,8 @@
>  #ifndef _NET_PAGE_POOL_HELPERS_H
>  #define _NET_PAGE_POOL_HELPERS_H
>  
> +#include <linux/dma-mapping.h>
> +
>  #include <net/page_pool/types.h>
>  
>  #ifdef CONFIG_PAGE_POOL_STATS
> @@ -354,6 +356,11 @@ static inline void page_pool_free_va(struct page_pool *pool, void *va,
>  	page_pool_put_page(pool, virt_to_head_page(va), -1, allow_direct);
>  }
>  
> +/* Occupies page::dma_addr[0:0] and indicates whether the mapping needs to be
> + * synchronized (either for device, for CPU, or both).
> + */
> +#define PAGE_POOL_DMA_ADDR_NEED_SYNC		BIT(0)

It seems we have less dma space from 32 bit arch with 64 bit dma address:(
But from my last searching in google, it seems we still have enough bit for
that:

arm: Large Physical Address Extension or LPAE, 40 bits of phys addr.
arc: Physical Address Extension or PAE, 40 bits of phys addr.
mips: eXtended Physical Addressing or PXA, 40 bits of phys addr.
powerpc: does not seems to have a name for the feature, and have 36
         bits of phys addr.
riscv: large physical address, 34 bits of phys addr.
x86: Physical Address Extension or PAE, 36 bits of phys addr.

> +
>  /**
>   * page_pool_get_dma_addr() - Retrieve the stored DMA address.
>   * @page:	page allocated from a page pool
> @@ -363,27 +370,66 @@ static inline void page_pool_free_va(struct page_pool *pool, void *va,
>   */
>  static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
>  {
> -	dma_addr_t ret = page->dma_addr;
> +	dma_addr_t ret = page->dma_addr & ~PAGE_POOL_DMA_ADDR_NEED_SYNC;
>  
>  	if (PAGE_POOL_32BIT_ARCH_WITH_64BIT_DMA)
> -		ret <<= PAGE_SHIFT;
> +		ret <<= PAGE_SHIFT - 1;
>  
>  	return ret;
>  }
>  
> -static inline bool page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
> +/**
> + * page_pool_set_dma_addr - save the obtained DMA address for @page
> + * @pool: &page_pool this page belongs to and was mapped by
> + * @page: page to save the address for
> + * @addr: DMA address to save
> + *
> + * If the pool was created with the DMA synchronization enabled, it will test
> + * whether the address needs to be synced and store the result as well to
> + * conserve CPU cycles.
> + *
> + * Return: false if the address doesn't fit into the corresponding field and
> + * thus can't be stored, true on success.
> + */
> +static inline bool page_pool_set_dma_addr(const struct page_pool *pool,
> +					  struct page *page,
> +					  dma_addr_t addr)
>  {
> +	unsigned long val = addr;
> +
> +	if (unlikely(!addr)) {
> +		page->dma_addr = 0;
> +		return true;
> +	}

The above seems unrelated change?

> +
>  	if (PAGE_POOL_32BIT_ARCH_WITH_64BIT_DMA) {
> -		page->dma_addr = addr >> PAGE_SHIFT;
> +		val = addr >> (PAGE_SHIFT - 1);
>  
>  		/* We assume page alignment to shave off bottom bits,
>  		 * if this "compression" doesn't work we need to drop.
>  		 */
> -		return addr != (dma_addr_t)page->dma_addr << PAGE_SHIFT;
> +		if (addr != (dma_addr_t)val << (PAGE_SHIFT - 1))
> +			return false;
>  	}
>  
> -	page->dma_addr = addr;
> -	return false;
> +	if (pool->dma_sync && dma_need_sync(pool->p.dev, addr))
> +		val |= PAGE_POOL_DMA_ADDR_NEED_SYNC;
> +
> +	page->dma_addr = val;
> +
> +	return true;
> +}
> +
Alexander Lobakin Nov. 27, 2023, 2:32 p.m. UTC | #2
From: Yunsheng Lin <linyunsheng@huawei.com>
Date: Sat, 25 Nov 2023 21:04:28 +0800

> On 2023/11/24 23:47, Alexander Lobakin wrote:
>> Turned out page_pool_put{,_full}_page() can burn quite a bunch of cycles
>> even on DMA-coherent platforms (like x86) with no active IOMMU or
>> swiotlb, just for the call ladder.
>> Indeed, it's
>>
>> page_pool_put_page()
>>   page_pool_put_defragged_page()                  <- external
>>     __page_pool_put_page()
>>       page_pool_dma_sync_for_device()             <- non-inline
>>         dma_sync_single_range_for_device()
>>           dma_sync_single_for_device()            <- external
>>             dma_direct_sync_single_for_device()
>>               dev_is_dma_coherent()               <- exit
>>
>> For the inline functions, no guarantees the compiler won't uninline them
>> (they're clearly not one-liners and sometimes the compilers manage to
>> uninline 2 + 2). The first external call is necessary, but the rest 2+
>> are done for nothing each time, plus a bunch of checks here and there.
>> Since Page Pool mappings are long-term and for one "device + addr" pair
>> dma_need_sync() will always return the same value (basically, whether it
>> belongs to an swiotlb pool), addresses can be tested once right after
>> they're obtained and the result can be reused until the page is
>> unmapped.
>> Do this test after a successful DMA mapping and reuse the lowest bit of
>> page::dma_addr to store the result, since Page Pool addresses are always
>> page-aligned and the lowest bits are unused.
>> page_pool_{g,s}et_dma_addr() will then take this bit into account and
>> exclude it from the actual address. Call page_pool_dma_sync_for_device()
>> only if the bit is set, shaving off several function calls back and
>> forth. If pool::dma_sync is not set, i.e. the driver didn't ask to
>> perform syncs, don't do this test and never touch the lowest bit.
>> On my x86_64, this gives from 2% to 5% performance benefit with no
>> negative impact for cases when IOMMU is on and the shortcut can't be
>> used.
>>
>> Suggested-by: Yunsheng Lin <linyunsheng@huawei.com> # per-page flag
> 
> Sorry for not remembering the suggestion:(

In the previous versions of this change I used a global flag per whole
page_pool, just like XSk does for the whole XSk buff pool, then you
proposed to use the lowest bit of ::dma_addr and store it per page, so
that it would be more granular/precise. I tested it and it doesn't
perform worse than global, but in some cases may be beneficial.

> 
> If most of cases is that some pages need syncing and some page does not
> need syncing for the same device, then perhaps per-page flag is more
> beneficial, if not, maybe per-device flag is enough too?

Per-device is not.
You never know when dma_alloc_* or dma_map_* will return an address
needing syncs. I realize that in most cases it will always be the same,
but since there is a possibility of more precise optimization, why not
use it :>

> 
> What i wonder is that is there any reason this kind trick isn't done
> in the dma layer instead of each subsystem or driver using dma api
> duplicating this kind of trick?
> 
> At least xsk_buff seems to using the similar trick?
> 
> You may have explained that, it has been a litte long between v4 and v5,
> so I may have forgotten about that.

Yeah, as I wrote previously, there's currently no easy way to combine
these two optimizations.

I can't even use this patch in the XSk layer, since it doesn't operate
with struct page, rather xsk_buff etc., which may not have a direct
pointer to struct page. And in case of DMA sync functions, there are no
pointers to xsk_buff passed there, only xsk_buff_pool and dma_addr_t,
with no direct association with a particular page.
Also note that xsk_buff_pool still needs to be converted to use dma_buf
API. This should be done soon and will involve quite big changes
regarding DMA, it makes no sense to touch anything there prior to this.

Field in the struct device may work out, but in some cases such global
gate might disable this optimization where it could be still enabled for
page_pool or xsk_buff_pool.

Chris, any thoughts on a global flag for skipping DMA syncs ladder?

> 
> 
>> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
>> ---
>>  include/net/page_pool/helpers.h | 60 +++++++++++++++++++++++++++++----
>>  net/core/page_pool.c            | 10 +++---
>>  2 files changed, 58 insertions(+), 12 deletions(-)
>>
>> diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
>> index 4ebd544ae977..28bec368b8e9 100644
>> --- a/include/net/page_pool/helpers.h
>> +++ b/include/net/page_pool/helpers.h
>> @@ -52,6 +52,8 @@
>>  #ifndef _NET_PAGE_POOL_HELPERS_H
>>  #define _NET_PAGE_POOL_HELPERS_H
>>  
>> +#include <linux/dma-mapping.h>
>> +
>>  #include <net/page_pool/types.h>
>>  
>>  #ifdef CONFIG_PAGE_POOL_STATS
>> @@ -354,6 +356,11 @@ static inline void page_pool_free_va(struct page_pool *pool, void *va,
>>  	page_pool_put_page(pool, virt_to_head_page(va), -1, allow_direct);
>>  }
>>  
>> +/* Occupies page::dma_addr[0:0] and indicates whether the mapping needs to be
>> + * synchronized (either for device, for CPU, or both).
>> + */
>> +#define PAGE_POOL_DMA_ADDR_NEED_SYNC		BIT(0)
> 
> It seems we have less dma space from 32 bit arch with 64 bit dma address:(
> But from my last searching in google, it seems we still have enough bit for
> that:
> 
> arm: Large Physical Address Extension or LPAE, 40 bits of phys addr.
> arc: Physical Address Extension or PAE, 40 bits of phys addr.
> mips: eXtended Physical Addressing or PXA, 40 bits of phys addr.
> powerpc: does not seems to have a name for the feature, and have 36
>          bits of phys addr.
> riscv: large physical address, 34 bits of phys addr.
> x86: Physical Address Extension or PAE, 36 bits of phys addr.

Yeah, 1 bit less, but from I've read, in most cases it will still work.
For 4 Kb pages, shift is 12, so that we're able to address 44 bits.
Eating one bit for DMA sync flag shrinks it to 43 -> still enough for
the cases you described above. Also note that some of these arches may
have bigger pages, where the shift is higher -> even more available bits.

> 
>> +
>>  /**
>>   * page_pool_get_dma_addr() - Retrieve the stored DMA address.
>>   * @page:	page allocated from a page pool
>> @@ -363,27 +370,66 @@ static inline void page_pool_free_va(struct page_pool *pool, void *va,
>>   */
>>  static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
>>  {
>> -	dma_addr_t ret = page->dma_addr;
>> +	dma_addr_t ret = page->dma_addr & ~PAGE_POOL_DMA_ADDR_NEED_SYNC;
>>  
>>  	if (PAGE_POOL_32BIT_ARCH_WITH_64BIT_DMA)
>> -		ret <<= PAGE_SHIFT;
>> +		ret <<= PAGE_SHIFT - 1;
>>  
>>  	return ret;
>>  }
>>  
>> -static inline bool page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
>> +/**
>> + * page_pool_set_dma_addr - save the obtained DMA address for @page
>> + * @pool: &page_pool this page belongs to and was mapped by
>> + * @page: page to save the address for
>> + * @addr: DMA address to save
>> + *
>> + * If the pool was created with the DMA synchronization enabled, it will test
>> + * whether the address needs to be synced and store the result as well to
>> + * conserve CPU cycles.
>> + *
>> + * Return: false if the address doesn't fit into the corresponding field and
>> + * thus can't be stored, true on success.
>> + */
>> +static inline bool page_pool_set_dma_addr(const struct page_pool *pool,
>> +					  struct page *page,
>> +					  dma_addr_t addr)
>>  {
>> +	unsigned long val = addr;
>> +
>> +	if (unlikely(!addr)) {
>> +		page->dma_addr = 0;
>> +		return true;
>> +	}
> 
> The above seems unrelated change?

Related. We use page_put_set_dma_addr() to clear ::dma_addr as well
(grep for it in page_pool.c). In this case, we don't want
dma_need_sync() to be called as we explicitly pass zero. This check
zeroes the field and exits as quickly as possible.
In case with the call mentioned above, zero is a compile-time constant
there, so that this little branch will be inlined with the rest dropped.

> 
>> +
>>  	if (PAGE_POOL_32BIT_ARCH_WITH_64BIT_DMA) {
>> -		page->dma_addr = addr >> PAGE_SHIFT;
>> +		val = addr >> (PAGE_SHIFT - 1);
>>  
>>  		/* We assume page alignment to shave off bottom bits,
>>  		 * if this "compression" doesn't work we need to drop.
>>  		 */
>> -		return addr != (dma_addr_t)page->dma_addr << PAGE_SHIFT;
>> +		if (addr != (dma_addr_t)val << (PAGE_SHIFT - 1))
>> +			return false;
>>  	}
>>  
>> -	page->dma_addr = addr;
>> -	return false;
>> +	if (pool->dma_sync && dma_need_sync(pool->p.dev, addr))
>> +		val |= PAGE_POOL_DMA_ADDR_NEED_SYNC;
>> +
>> +	page->dma_addr = val;
>> +
>> +	return true;
>> +}
>> +
> 

Thanks,
Olek
Jakub Kicinski Nov. 27, 2023, 6:17 p.m. UTC | #3
On Mon, 27 Nov 2023 15:32:19 +0100 Alexander Lobakin wrote:
> > Sorry for not remembering the suggestion:(  
> 
> In the previous versions of this change I used a global flag per whole
> page_pool, just like XSk does for the whole XSk buff pool, then you
> proposed to use the lowest bit of ::dma_addr and store it per page, so
> that it would be more granular/precise. I tested it and it doesn't
> perform worse than global, but in some cases may be beneficial.

FWIW I'd vote to stick to per-page pool. You seem to handle the
sizeof(dma_addr_t) > sizeof(long) case correctly but the code is
growing in complexity, providing no known/measurable benefit.
We can always do this later but for now it seems like a premature
optimization to me.
Alexander Lobakin Nov. 28, 2023, 4:50 p.m. UTC | #4
From: Jakub Kicinski <kuba@kernel.org>
Date: Mon, 27 Nov 2023 10:17:20 -0800

> On Mon, 27 Nov 2023 15:32:19 +0100 Alexander Lobakin wrote:
>>> Sorry for not remembering the suggestion:(  
>>
>> In the previous versions of this change I used a global flag per whole
>> page_pool, just like XSk does for the whole XSk buff pool, then you
>> proposed to use the lowest bit of ::dma_addr and store it per page, so
>> that it would be more granular/precise. I tested it and it doesn't
>> perform worse than global, but in some cases may be beneficial.
> 
> FWIW I'd vote to stick to per-page pool. You seem to handle the
> sizeof(dma_addr_t) > sizeof(long) case correctly but the code is
> growing in complexity, providing no known/measurable benefit.
> We can always do this later but for now it seems like a premature
> optimization to me.

Yeah, this also seems more logical and optimal to me. Will wait a bit
for a possible reply from Chris and then send the next rev.

Thanks,
Olek
Yunsheng Lin Nov. 29, 2023, 3:17 a.m. UTC | #5
On 2023/11/27 22:32, Alexander Lobakin wrote:
> 
> Chris, any thoughts on a global flag for skipping DMA syncs ladder?

It seems there was one already in the past:

https://lore.kernel.org/netdev/7c55a4d7-b4aa-25d4-1917-f6f355bd722e@arm.com/T/

> 
>>
>>

>>> +static inline bool page_pool_set_dma_addr(const struct page_pool *pool,
>>> +					  struct page *page,
>>> +					  dma_addr_t addr)
>>>  {
>>> +	unsigned long val = addr;
>>> +
>>> +	if (unlikely(!addr)) {
>>> +		page->dma_addr = 0;
>>> +		return true;
>>> +	}
>>
>> The above seems unrelated change?
> 
> Related. We use page_put_set_dma_addr() to clear ::dma_addr as well
> (grep for it in page_pool.c). In this case, we don't want
> dma_need_sync() to be called as we explicitly pass zero. This check
> zeroes the field and exits as quickly as possible.

The question seems to be about if we need to ensure the LSB of
page->dma_addr is not set when page_pool releases a page back to page
allocator?

> In case with the call mentioned above, zero is a compile-time constant
> there, so that this little branch will be inlined with the rest dropped.
>
Alexander Lobakin Nov. 29, 2023, 1:17 p.m. UTC | #6
From: Yunsheng Lin <linyunsheng@huawei.com>
Date: Wed, 29 Nov 2023 11:17:50 +0800

> On 2023/11/27 22:32, Alexander Lobakin wrote:
>>
>> Chris, any thoughts on a global flag for skipping DMA syncs ladder?
> 
> It seems there was one already in the past:
> 
> https://lore.kernel.org/netdev/7c55a4d7-b4aa-25d4-1917-f6f355bd722e@arm.com/T/

It addresses a different problem, meaningless indirect calls, while this
one addresses meaningless direct calls :>
When the above gets merged, we could combine these two into one global,
but Eric wasn't active with his patch and I remember there were some
problems, so I wouldn't count on that it will arrive soon.

> 
>>
>>>
>>>
> 
>>>> +static inline bool page_pool_set_dma_addr(const struct page_pool *pool,
>>>> +					  struct page *page,
>>>> +					  dma_addr_t addr)
>>>>  {
>>>> +	unsigned long val = addr;
>>>> +
>>>> +	if (unlikely(!addr)) {
>>>> +		page->dma_addr = 0;
>>>> +		return true;
>>>> +	}
>>>
>>> The above seems unrelated change?
>>
>> Related. We use page_put_set_dma_addr() to clear ::dma_addr as well
>> (grep for it in page_pool.c). In this case, we don't want
>> dma_need_sync() to be called as we explicitly pass zero. This check
>> zeroes the field and exits as quickly as possible.
> 
> The question seems to be about if we need to ensure the LSB of
> page->dma_addr is not set when page_pool releases a page back to page
> allocator?

But why do we need to call dma_need_sync(0) when freeing a page wasting
CPU cycles on relatively hot path?

> 
>> In case with the call mentioned above, zero is a compile-time constant
>> there, so that this little branch will be inlined with the rest dropped.

Thanks,
Olek
Yunsheng Lin Nov. 30, 2023, 8:46 a.m. UTC | #7
On 2023/11/29 21:17, Alexander Lobakin wrote:
> From: Yunsheng Lin <linyunsheng@huawei.com>
> Date: Wed, 29 Nov 2023 11:17:50 +0800
> 
>> On 2023/11/27 22:32, Alexander Lobakin wrote:
>>>
>>> Chris, any thoughts on a global flag for skipping DMA syncs ladder?
>>
>> It seems there was one already in the past:
>>
>> https://lore.kernel.org/netdev/7c55a4d7-b4aa-25d4-1917-f6f355bd722e@arm.com/T/
> 
> It addresses a different problem, meaningless indirect calls, while this
> one addresses meaningless direct calls :>
> When the above gets merged, we could combine these two into one global,
> but Eric wasn't active with his patch and I remember there were some
> problems, so I wouldn't count on that it will arrive soon.

I went through the above patch, It seems to me that there was no
fundamental problem that stopping us from implementing it in the dma
layer basing on Eric' patch if Eric is no longer interested in working
on a newer version?

It is just that if we allow every subsystem and driver using dma API
doing their own trick of skipping dma sync, then there is less willing
to implement it in the dma layer.
Alexander Lobakin Nov. 30, 2023, 11:58 a.m. UTC | #8
From: Yunsheng Lin <linyunsheng@huawei.com>
Date: Thu, 30 Nov 2023 16:46:11 +0800

> On 2023/11/29 21:17, Alexander Lobakin wrote:
>> From: Yunsheng Lin <linyunsheng@huawei.com>
>> Date: Wed, 29 Nov 2023 11:17:50 +0800
>>
>>> On 2023/11/27 22:32, Alexander Lobakin wrote:
>>>>
>>>> Chris, any thoughts on a global flag for skipping DMA syncs ladder?
>>>
>>> It seems there was one already in the past:
>>>
>>> https://lore.kernel.org/netdev/7c55a4d7-b4aa-25d4-1917-f6f355bd722e@arm.com/T/
>>
>> It addresses a different problem, meaningless indirect calls, while this
>> one addresses meaningless direct calls :>
>> When the above gets merged, we could combine these two into one global,
>> but Eric wasn't active with his patch and I remember there were some
>> problems, so I wouldn't count on that it will arrive soon.
> 
> I went through the above patch, It seems to me that there was no
> fundamental problem that stopping us from implementing it in the dma
> layer basing on Eric' patch if Eric is no longer interested in working
> on a newer version?

I'm somewhat interested in continuing working on Eric's patch, but not
now. Have some urgent projects to work on, I could take this in January
I guess...
This PP-specific shortcut was done earlier and gives good boosts. It
would be trivial to remove it together with the XSk shortcut once a
generic both indirect and direct call DMA shortcut lands.
Does this sounds good / justified enough? Or you and other
reviewers/maintainers would prefer to wait for the generic one without
taking this patch?

> 
> It is just that if we allow every subsystem and driver using dma API
> doing their own trick of skipping dma sync, then there is less willing
> to implement it in the dma layer.

Thanks,
Olek
Yunsheng Lin Nov. 30, 2023, 12:20 p.m. UTC | #9
On 2023/11/30 19:58, Alexander Lobakin wrote:
> From: Yunsheng Lin <linyunsheng@huawei.com>
> Date: Thu, 30 Nov 2023 16:46:11 +0800
> 
>> On 2023/11/29 21:17, Alexander Lobakin wrote:
>>> From: Yunsheng Lin <linyunsheng@huawei.com>
>>> Date: Wed, 29 Nov 2023 11:17:50 +0800
>>>
>>>> On 2023/11/27 22:32, Alexander Lobakin wrote:
>>>>>
>>>>> Chris, any thoughts on a global flag for skipping DMA syncs ladder?
>>>>
>>>> It seems there was one already in the past:
>>>>
>>>> https://lore.kernel.org/netdev/7c55a4d7-b4aa-25d4-1917-f6f355bd722e@arm.com/T/
>>>
>>> It addresses a different problem, meaningless indirect calls, while this
>>> one addresses meaningless direct calls :>
>>> When the above gets merged, we could combine these two into one global,
>>> but Eric wasn't active with his patch and I remember there were some
>>> problems, so I wouldn't count on that it will arrive soon.
>>
>> I went through the above patch, It seems to me that there was no
>> fundamental problem that stopping us from implementing it in the dma
>> layer basing on Eric' patch if Eric is no longer interested in working
>> on a newer version?
> 
> I'm somewhat interested in continuing working on Eric's patch, but not
> now. Have some urgent projects to work on, I could take this in January
> I guess...
> This PP-specific shortcut was done earlier and gives good boosts. It
> would be trivial to remove it together with the XSk shortcut once a
> generic both indirect and direct call DMA shortcut lands.
> Does this sounds good / justified enough? Or you and other
> reviewers/maintainers would prefer to wait for the generic one without
> taking this patch?
> 

I would prefer we could wait for the generic one as there is only about one
month between now and january:)
Alexander Lobakin Dec. 1, 2023, 2:37 p.m. UTC | #10
From: Yunsheng Lin <linyunsheng@huawei.com>
Date: Thu, 30 Nov 2023 20:20:44 +0800

> On 2023/11/30 19:58, Alexander Lobakin wrote:
>> From: Yunsheng Lin <linyunsheng@huawei.com>
>> Date: Thu, 30 Nov 2023 16:46:11 +0800
>>
>>> On 2023/11/29 21:17, Alexander Lobakin wrote:
>>>> From: Yunsheng Lin <linyunsheng@huawei.com>
>>>> Date: Wed, 29 Nov 2023 11:17:50 +0800
>>>>
>>>>> On 2023/11/27 22:32, Alexander Lobakin wrote:
>>>>>>
>>>>>> Chris, any thoughts on a global flag for skipping DMA syncs ladder?
>>>>>
>>>>> It seems there was one already in the past:
>>>>>
>>>>> https://lore.kernel.org/netdev/7c55a4d7-b4aa-25d4-1917-f6f355bd722e@arm.com/T/
>>>>
>>>> It addresses a different problem, meaningless indirect calls, while this
>>>> one addresses meaningless direct calls :>
>>>> When the above gets merged, we could combine these two into one global,
>>>> but Eric wasn't active with his patch and I remember there were some
>>>> problems, so I wouldn't count on that it will arrive soon.
>>>
>>> I went through the above patch, It seems to me that there was no
>>> fundamental problem that stopping us from implementing it in the dma
>>> layer basing on Eric' patch if Eric is no longer interested in working
>>> on a newer version?
>>
>> I'm somewhat interested in continuing working on Eric's patch, but not
>> now. Have some urgent projects to work on, I could take this in January
>> I guess...
>> This PP-specific shortcut was done earlier and gives good boosts. It
>> would be trivial to remove it together with the XSk shortcut once a
>> generic both indirect and direct call DMA shortcut lands.
>> Does this sounds good / justified enough? Or you and other
>> reviewers/maintainers would prefer to wait for the generic one without
>> taking this patch?
>>
> 
> I would prefer we could wait for the generic one as there is only about one
> month between now and january:)

Ok, let's do it this way. I'll try to revive Eric's idea soon and get it
taken if he doesn't mind :>

Thanks,
Olek
Christoph Hellwig Dec. 12, 2023, 3:25 p.m. UTC | #11
On Thu, Nov 30, 2023 at 08:20:44PM +0800, Yunsheng Lin wrote:
> I would prefer we could wait for the generic one as there is only about one
> month between now and january:)

Same here.  Please fix this properly for everyone instead of adding
a pile of hac
diff mbox series

Patch

diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
index 4ebd544ae977..28bec368b8e9 100644
--- a/include/net/page_pool/helpers.h
+++ b/include/net/page_pool/helpers.h
@@ -52,6 +52,8 @@ 
 #ifndef _NET_PAGE_POOL_HELPERS_H
 #define _NET_PAGE_POOL_HELPERS_H
 
+#include <linux/dma-mapping.h>
+
 #include <net/page_pool/types.h>
 
 #ifdef CONFIG_PAGE_POOL_STATS
@@ -354,6 +356,11 @@  static inline void page_pool_free_va(struct page_pool *pool, void *va,
 	page_pool_put_page(pool, virt_to_head_page(va), -1, allow_direct);
 }
 
+/* Occupies page::dma_addr[0:0] and indicates whether the mapping needs to be
+ * synchronized (either for device, for CPU, or both).
+ */
+#define PAGE_POOL_DMA_ADDR_NEED_SYNC		BIT(0)
+
 /**
  * page_pool_get_dma_addr() - Retrieve the stored DMA address.
  * @page:	page allocated from a page pool
@@ -363,27 +370,66 @@  static inline void page_pool_free_va(struct page_pool *pool, void *va,
  */
 static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
 {
-	dma_addr_t ret = page->dma_addr;
+	dma_addr_t ret = page->dma_addr & ~PAGE_POOL_DMA_ADDR_NEED_SYNC;
 
 	if (PAGE_POOL_32BIT_ARCH_WITH_64BIT_DMA)
-		ret <<= PAGE_SHIFT;
+		ret <<= PAGE_SHIFT - 1;
 
 	return ret;
 }
 
-static inline bool page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
+/**
+ * page_pool_set_dma_addr - save the obtained DMA address for @page
+ * @pool: &page_pool this page belongs to and was mapped by
+ * @page: page to save the address for
+ * @addr: DMA address to save
+ *
+ * If the pool was created with the DMA synchronization enabled, it will test
+ * whether the address needs to be synced and store the result as well to
+ * conserve CPU cycles.
+ *
+ * Return: false if the address doesn't fit into the corresponding field and
+ * thus can't be stored, true on success.
+ */
+static inline bool page_pool_set_dma_addr(const struct page_pool *pool,
+					  struct page *page,
+					  dma_addr_t addr)
 {
+	unsigned long val = addr;
+
+	if (unlikely(!addr)) {
+		page->dma_addr = 0;
+		return true;
+	}
+
 	if (PAGE_POOL_32BIT_ARCH_WITH_64BIT_DMA) {
-		page->dma_addr = addr >> PAGE_SHIFT;
+		val = addr >> (PAGE_SHIFT - 1);
 
 		/* We assume page alignment to shave off bottom bits,
 		 * if this "compression" doesn't work we need to drop.
 		 */
-		return addr != (dma_addr_t)page->dma_addr << PAGE_SHIFT;
+		if (addr != (dma_addr_t)val << (PAGE_SHIFT - 1))
+			return false;
 	}
 
-	page->dma_addr = addr;
-	return false;
+	if (pool->dma_sync && dma_need_sync(pool->p.dev, addr))
+		val |= PAGE_POOL_DMA_ADDR_NEED_SYNC;
+
+	page->dma_addr = val;
+
+	return true;
+}
+
+/**
+ * page_pool_dma_addr_need_sync - check whether a mapped page needs DMA syncs
+ * @page: page to check
+ *
+ * Return: the result previously saved by page_pool_set_dma_addr() to skip
+ * synchronization when not needed.
+ */
+static inline bool page_pool_dma_addr_need_sync(const struct page *page)
+{
+	return page->dma_addr & PAGE_POOL_DMA_ADDR_NEED_SYNC;
 }
 
 static inline bool page_pool_put(struct page_pool *pool)
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index a8da3ba3b1e8..09bdfdfc4788 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -366,10 +366,10 @@  static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
 	if (dma_mapping_error(pool->p.dev, dma))
 		return false;
 
-	if (page_pool_set_dma_addr(page, dma))
+	if (!page_pool_set_dma_addr(pool, page, dma))
 		goto unmap_failed;
 
-	if (pool->dma_sync)
+	if (page_pool_dma_addr_need_sync(page))
 		page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
 
 	return true;
@@ -545,7 +545,7 @@  static void page_pool_return_page(struct page_pool *pool, struct page *page)
 	dma_unmap_page_attrs(pool->p.dev, dma,
 			     PAGE_SIZE << pool->p.order, pool->p.dma_dir,
 			     DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING);
-	page_pool_set_dma_addr(page, 0);
+	page_pool_set_dma_addr(pool, page, 0);
 skip_dma_unmap:
 	page_pool_clear_pp_info(page);
 
@@ -622,7 +622,7 @@  __page_pool_put_page(struct page_pool *pool, struct page *page,
 	if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) {
 		/* Read barrier done in page_ref_count / READ_ONCE */
 
-		if (pool->dma_sync)
+		if (page_pool_dma_addr_need_sync(page))
 			page_pool_dma_sync_for_device(pool, page,
 						      dma_sync_size);
 
@@ -735,7 +735,7 @@  static struct page *page_pool_drain_frag(struct page_pool *pool,
 		return NULL;
 
 	if (page_ref_count(page) == 1 && !page_is_pfmemalloc(page)) {
-		if (pool->dma_sync)
+		if (page_pool_dma_addr_need_sync(page))
 			page_pool_dma_sync_for_device(pool, page, -1);
 
 		return page;