diff mbox series

[net-next,v2,1/3] page_pool: unify frag page and non-frag page handling

Message ID 20230529092840.40413-2-linyunsheng@huawei.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series support non-frag page for page_pool_alloc_frag() | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 5156 this patch: 5156
netdev/cc_maintainers success CCed 7 of 7 maintainers
netdev/build_clang success Errors and warnings before: 1977 this patch: 1977
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 5387 this patch: 5387
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 85 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Yunsheng Lin May 29, 2023, 9:28 a.m. UTC
Currently page_pool_dev_alloc_pages() can not be called
when PP_FLAG_PAGE_FRAG is set, because it does not use
the frag reference counting.

As we are already doing a optimization by not updating
page->pp_frag_count in page_pool_defrag_page() for the
last frag user, and non-frag page only have one user,
so we utilize that to unify frag page and non-frag page
handling, so that page_pool_dev_alloc_pages() can also
be called with PP_FLAG_PAGE_FRAG set.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
CC: Lorenzo Bianconi <lorenzo@kernel.org>
CC: Alexander Duyck <alexander.duyck@gmail.com>
---
 include/net/page_pool.h | 38 +++++++++++++++++++++++++++++++-------
 net/core/page_pool.c    |  1 +
 2 files changed, 32 insertions(+), 7 deletions(-)

Comments

Alexander Duyck May 30, 2023, 3:07 p.m. UTC | #1
On Mon, 2023-05-29 at 17:28 +0800, Yunsheng Lin wrote:
> Currently page_pool_dev_alloc_pages() can not be called
> when PP_FLAG_PAGE_FRAG is set, because it does not use
> the frag reference counting.
> 
> As we are already doing a optimization by not updating
> page->pp_frag_count in page_pool_defrag_page() for the
> last frag user, and non-frag page only have one user,
> so we utilize that to unify frag page and non-frag page
> handling, so that page_pool_dev_alloc_pages() can also
> be called with PP_FLAG_PAGE_FRAG set.
> 
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> CC: Lorenzo Bianconi <lorenzo@kernel.org>
> CC: Alexander Duyck <alexander.duyck@gmail.com>

I"m not really a huge fan of the approach. Basically it looks like you
are trying to turn every page pool page into a fragmented page. Why not
just stick to keeping the fragemented pages and have a special case
that just generates a mono-frag page for your allocator instead.

The problem is there are some architectures where we just cannot
support having pp_frag_count due to the DMA size. So it makes sense to
leave those with just basic page pool instead of trying to fake that it
is a fragmented page.

> ---
>  include/net/page_pool.h | 38 +++++++++++++++++++++++++++++++-------
>  net/core/page_pool.c    |  1 +
>  2 files changed, 32 insertions(+), 7 deletions(-)
> 
> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> index c8ec2f34722b..ea7a0c0592a5 100644
> --- a/include/net/page_pool.h
> +++ b/include/net/page_pool.h
> @@ -50,6 +50,9 @@
>  				 PP_FLAG_DMA_SYNC_DEV |\
>  				 PP_FLAG_PAGE_FRAG)
>  
> +#define PAGE_POOL_DMA_USE_PP_FRAG_COUNT \
> +		(sizeof(dma_addr_t) > sizeof(unsigned long))
> +
>  /*
>   * Fast allocation side cache array/stack
>   *
> @@ -295,13 +298,20 @@ void page_pool_put_defragged_page(struct page_pool *pool, struct page *page,
>   */
>  static inline void page_pool_fragment_page(struct page *page, long nr)
>  {
> -	atomic_long_set(&page->pp_frag_count, nr);
> +	if (!PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
> +		atomic_long_set(&page->pp_frag_count, nr);
>  }
>  
> +/* We need to reset frag_count back to 1 for the last user to allow
> + * only one user in case the page is recycled and allocated as non-frag
> + * page.
> + */
>  static inline long page_pool_defrag_page(struct page *page, long nr)
>  {
>  	long ret;
>  
> +	BUILD_BUG_ON(__builtin_constant_p(nr) && nr != 1);
> +

What is the point of this line? It doesn't make much sense to me. Are
you just trying to force an optiimization? You would be better off just
taking the BUILD_BUG_ON contents and feeding them into an if statement
below since the statement will compile out anyway.

It seems like what you would want here is:
	BUG_ON(!PAGE_POOL_DMA_USE_PP_FRAG_COUNT);

Otherwise you are potentially writing to a variable that shouldn't
exist.

>  	/* If nr == pp_frag_count then we have cleared all remaining
>  	 * references to the page. No need to actually overwrite it, instead
>  	 * we can leave this to be overwritten by the calling function.
> @@ -311,19 +321,36 @@ static inline long page_pool_defrag_page(struct page *page, long nr)
>  	 * especially when dealing with a page that may be partitioned
>  	 * into only 2 or 3 pieces.
>  	 */
> -	if (atomic_long_read(&page->pp_frag_count) == nr)
> +	if (atomic_long_read(&page->pp_frag_count) == nr) {
> +		/* As we have ensured nr is always one for constant case
> +		 * using the BUILD_BUG_ON() as above, only need to handle
> +		 * the non-constant case here for frag count draining.
> +		 */
> +		if (!__builtin_constant_p(nr))
> +			atomic_long_set(&page->pp_frag_count, 1);
> +
>  		return 0;
> +	}
>  
>  	ret = atomic_long_sub_return(nr, &page->pp_frag_count);
>  	WARN_ON(ret < 0);
> +
> +	/* Reset frag count back to 1, this should be the rare case when
> +	 * two users call page_pool_defrag_page() currently.
> +	 */
> +	if (!ret)
> +		atomic_long_set(&page->pp_frag_count, 1);
> +
>  	return ret;
>  }
>  
>  static inline bool page_pool_is_last_frag(struct page_pool *pool,
>  					  struct page *page)
>  {
> -	/* If fragments aren't enabled or count is 0 we were the last user */
> -	return !(pool->p.flags & PP_FLAG_PAGE_FRAG) ||
> +	/* When dma_addr_upper is overlapped with pp_frag_count
> +	 * or we were the last page frag user.
> +	 */
> +	return PAGE_POOL_DMA_USE_PP_FRAG_COUNT ||
>  	       (page_pool_defrag_page(page, 1) == 0);
>  }
>  
> @@ -357,9 +384,6 @@ static inline void page_pool_recycle_direct(struct page_pool *pool,
>  	page_pool_put_full_page(pool, page, true);
>  }
>  
> -#define PAGE_POOL_DMA_USE_PP_FRAG_COUNT	\
> -		(sizeof(dma_addr_t) > sizeof(unsigned long))
> -
>  static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
>  {
>  	dma_addr_t ret = page->dma_addr;
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index e212e9d7edcb..0868aa8f6323 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -334,6 +334,7 @@ static void page_pool_set_pp_info(struct page_pool *pool,
>  {
>  	page->pp = pool;
>  	page->pp_magic |= PP_SIGNATURE;
> +	page_pool_fragment_page(page, 1);
>  	if (pool->p.init_callback)
>  		pool->p.init_callback(page, pool->p.init_arg);
>  }

Again, you are adding statements here that now have to be stripped out
under specific circumstances. In my opinion it would be better to not
modify base page pool pages and instead just have your allocator
provide a 1 frag page pool page via a special case allocator rather
then messing with all the other drivers.
Yunsheng Lin May 31, 2023, 11:55 a.m. UTC | #2
On 2023/5/30 23:07, Alexander H Duyck wrote:
> On Mon, 2023-05-29 at 17:28 +0800, Yunsheng Lin wrote:
>> Currently page_pool_dev_alloc_pages() can not be called
>> when PP_FLAG_PAGE_FRAG is set, because it does not use
>> the frag reference counting.
>>
>> As we are already doing a optimization by not updating
>> page->pp_frag_count in page_pool_defrag_page() for the
>> last frag user, and non-frag page only have one user,
>> so we utilize that to unify frag page and non-frag page
>> handling, so that page_pool_dev_alloc_pages() can also
>> be called with PP_FLAG_PAGE_FRAG set.
>>
>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>> CC: Lorenzo Bianconi <lorenzo@kernel.org>
>> CC: Alexander Duyck <alexander.duyck@gmail.com>
> 
> I"m not really a huge fan of the approach. Basically it looks like you
> are trying to turn every page pool page into a fragmented page. Why not
> just stick to keeping the fragemented pages and have a special case
> that just generates a mono-frag page for your allocator instead.

Let me try to describe what does this patch try to do and how it
do that in more detailed in order to have more common understanding.

Before this patch:

As we use PP_FLAG_PAGE_FRAG to decide whether to check the frag count
in page_pool_is_last_frag() when page is returned back to page pool,
so:

1. PAGE_POOL_DMA_USE_PP_FRAG_COUNT being true case: page_pool_create()
   fails when it is called with PP_FLAG_PAGE_FRAG set, if user calls
   page_pool_alloc_frag(), then we warn about it and return NULL.

2. PAGE_POOL_DMA_USE_PP_FRAG_COUNT being false case:
   (1). page_pool_create() is called with PP_FLAG_PAGE_FRAG set:
        page_pool_alloc_pages() is not allowed to be called as the
        page->pp_frag_count is not setup correctly for the case.
   (2). page_pool_create() is called without PP_FLAG_PAGE_FRAG set:
        page_pool_alloc_frag() is not allowed to be called as the
        page->pp_frag_count is not checked in page_pool_is_last_frag().

and mlx5 using a mix of the about:
page_pool_create() is called with with PP_FLAG_PAGE_FRAG,
page_pool_dev_alloc_pages() is called to allocate a page and
page_pool_fragment_page() is called to setup the page->pp_frag_count
correctly so the page_pool_is_last_frag() can see the correct
page->pp_frag_count, mlx5 driver handling the frag count is in the
below, it is complicated and I am not sure if there are any added
benefit that can justify the complication yet:
https://www.spinics.net/lists/netdev/msg892893.html

There are usecases for veth and virtio_net to use frag support
in page pool to reduce memory usage, and it may request different
frag size depending on the head/tail room space for xdp_frame/shinfo
and mtu/packet size. When the requested frag size is large enough
that a single page can not be split into more than one frag, using
frag support only have performance penalty because of the extra frag
count handling for frag support.
So to avoid driver handling the page->pp_frag_count directly and driver
calling different page pool API according to memory size, we need to
way to unify the page_pool_is_last_frag() for frag and non-frag page.

1. https://patchwork.kernel.org/project/netdevbpf/patch/d3ae6bd3537fbce379382ac6a42f67e22f27ece2.1683896626.git.lorenzo@kernel.org/
2. https://patchwork.kernel.org/project/netdevbpf/patch/20230526054621.18371-3-liangchen.linux@gmail.com/

After this patch:
This patch ensure pp_frag_count of page from pool->alloc/pool->ring or
newly allocated from page allocator is one, which means we assume that
all pages have one frag user initially so that we can have a unified
handling for frag and non-frag page in page_pool_is_last_frag().

So the key point in this patch is about unified handling in
page_pool_is_last_frag(), which is the free/put side of page pool,
not the alloc/generate side.

Utilizing the page->pp_frag_count being one initially for every page
is the least costly way to do that as best as I can think of.
As it only add the cost of page_pool_fragment_page() for non-frag page
case as you have mentioned below, which I think it is negligible as we
are already dirtying the same cache line in page_pool_set_pp_info().
And for frag page, we avoid the reseting page->pp_frag_count to one by
utilizing the optimization of not updating page->pp_frag_count in
page_pool_defrag_page() for the last frag user.

Please let me know if the above makes sense, or if misunderstood your
concern here.

> 
> The problem is there are some architectures where we just cannot
> support having pp_frag_count due to the DMA size. So it makes sense to
> leave those with just basic page pool instead of trying to fake that it
> is a fragmented page.

It kind of depend on how you veiw it, this patch view it as only supporting
one frag when we can't support having pp_frag_count, so I would not call it
faking.

> 
>> ---
>>  include/net/page_pool.h | 38 +++++++++++++++++++++++++++++++-------
>>  net/core/page_pool.c    |  1 +
>>  2 files changed, 32 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
>> index c8ec2f34722b..ea7a0c0592a5 100644
>> --- a/include/net/page_pool.h
>> +++ b/include/net/page_pool.h
>> @@ -50,6 +50,9 @@
>>  				 PP_FLAG_DMA_SYNC_DEV |\
>>  				 PP_FLAG_PAGE_FRAG)
>>  
>> +#define PAGE_POOL_DMA_USE_PP_FRAG_COUNT \
>> +		(sizeof(dma_addr_t) > sizeof(unsigned long))
>> +
>>  /*
>>   * Fast allocation side cache array/stack
>>   *
>> @@ -295,13 +298,20 @@ void page_pool_put_defragged_page(struct page_pool *pool, struct page *page,
>>   */
>>  static inline void page_pool_fragment_page(struct page *page, long nr)
>>  {
>> -	atomic_long_set(&page->pp_frag_count, nr);
>> +	if (!PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
>> +		atomic_long_set(&page->pp_frag_count, nr);
>>  }
>>  
>> +/* We need to reset frag_count back to 1 for the last user to allow
>> + * only one user in case the page is recycled and allocated as non-frag
>> + * page.
>> + */
>>  static inline long page_pool_defrag_page(struct page *page, long nr)
>>  {
>>  	long ret;
>>  
>> +	BUILD_BUG_ON(__builtin_constant_p(nr) && nr != 1);
>> +
> 
> What is the point of this line? It doesn't make much sense to me. Are
> you just trying to force an optiimization? You would be better off just
> taking the BUILD_BUG_ON contents and feeding them into an if statement
> below since the statement will compile out anyway.

if the "if statement" you said refers to the below, then yes.

>> +		if (!__builtin_constant_p(nr))
>> +			atomic_long_set(&page->pp_frag_count, 1);

But it is a *BUILD*_BUG_ON(), isn't it compiled out anywhere we put it?

Will move it down anyway to avoid confusion.

> 
> It seems like what you would want here is:
> 	BUG_ON(!PAGE_POOL_DMA_USE_PP_FRAG_COUNT);
> 
> Otherwise you are potentially writing to a variable that shouldn't
> exist.

Not if the driver use the page_pool_alloc_frag() API instead of manipulating
the page->pp_frag_count directly using the page_pool_defrag_page() like mlx5.
The mlx5 call the page_pool_create() with with PP_FLAG_PAGE_FRAG set, and
it does not seems to have a failback for PAGE_POOL_DMA_USE_PP_FRAG_COUNT
case, and we may need to keep PP_FLAG_PAGE_FRAG for it. That's why we need
to keep the driver from implementation detail(pp_frag_count handling specifically)
of the frag support unless we have a very good reason.

> 
>>  	/* If nr == pp_frag_count then we have cleared all remaining
>>  	 * references to the page. No need to actually overwrite it, instead
>>  	 * we can leave this to be overwritten by the calling function.
>> @@ -311,19 +321,36 @@ static inline long page_pool_defrag_page(struct page *page, long nr)
>>  	 * especially when dealing with a page that may be partitioned
>>  	 * into only 2 or 3 pieces.
>>  	 */
>> -	if (atomic_long_read(&page->pp_frag_count) == nr)
>> +	if (atomic_long_read(&page->pp_frag_count) == nr) {
>> +		/* As we have ensured nr is always one for constant case
>> +		 * using the BUILD_BUG_ON() as above, only need to handle
>> +		 * the non-constant case here for frag count draining.
>> +		 */
>> +		if (!__builtin_constant_p(nr))
>> +			atomic_long_set(&page->pp_frag_count, 1);
>> +
>>  		return 0;
>> +	}
>>  
>>  	ret = atomic_long_sub_return(nr, &page->pp_frag_count);
>>  	WARN_ON(ret < 0);
>> +
>> +	/* Reset frag count back to 1, this should be the rare case when
>> +	 * two users call page_pool_defrag_page() currently.
>> +	 */
>> +	if (!ret)
>> +		atomic_long_set(&page->pp_frag_count, 1);
>> +
>>  	return ret;
>>  }
>>  
>>  static inline bool page_pool_is_last_frag(struct page_pool *pool,
>>  					  struct page *page)
>>  {
>> -	/* If fragments aren't enabled or count is 0 we were the last user */
>> -	return !(pool->p.flags & PP_FLAG_PAGE_FRAG) ||
>> +	/* When dma_addr_upper is overlapped with pp_frag_count
>> +	 * or we were the last page frag user.
>> +	 */
>> +	return PAGE_POOL_DMA_USE_PP_FRAG_COUNT ||
>>  	       (page_pool_defrag_page(page, 1) == 0);
>>  }
>>  
>> @@ -357,9 +384,6 @@ static inline void page_pool_recycle_direct(struct page_pool *pool,
>>  	page_pool_put_full_page(pool, page, true);
>>  }
>>  
>> -#define PAGE_POOL_DMA_USE_PP_FRAG_COUNT	\
>> -		(sizeof(dma_addr_t) > sizeof(unsigned long))
>> -
>>  static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
>>  {
>>  	dma_addr_t ret = page->dma_addr;
>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
>> index e212e9d7edcb..0868aa8f6323 100644
>> --- a/net/core/page_pool.c
>> +++ b/net/core/page_pool.c
>> @@ -334,6 +334,7 @@ static void page_pool_set_pp_info(struct page_pool *pool,
>>  {
>>  	page->pp = pool;
>>  	page->pp_magic |= PP_SIGNATURE;
>> +	page_pool_fragment_page(page, 1);
>>  	if (pool->p.init_callback)
>>  		pool->p.init_callback(page, pool->p.init_arg);
>>  }
> 
> Again, you are adding statements here that now have to be stripped out
> under specific circumstances. In my opinion it would be better to not
> modify base page pool pages and instead just have your allocator
> provide a 1 frag page pool page via a special case allocator rather
> then messing with all the other drivers.

As above, it is about unifying handling for frag and non-frag page in
page_pool_is_last_frag(). please let me know if there is any better way
to do it without adding statements here.

> .
>
Alexander Duyck June 2, 2023, 4:37 p.m. UTC | #3
On Wed, 2023-05-31 at 19:55 +0800, Yunsheng Lin wrote:
> On 2023/5/30 23:07, Alexander H Duyck wrote:
> > On Mon, 2023-05-29 at 17:28 +0800, Yunsheng Lin wrote:
> > > Currently page_pool_dev_alloc_pages() can not be called
> > > when PP_FLAG_PAGE_FRAG is set, because it does not use
> > > the frag reference counting.
> > > 
> > > As we are already doing a optimization by not updating
> > > page->pp_frag_count in page_pool_defrag_page() for the
> > > last frag user, and non-frag page only have one user,
> > > so we utilize that to unify frag page and non-frag page
> > > handling, so that page_pool_dev_alloc_pages() can also
> > > be called with PP_FLAG_PAGE_FRAG set.
> > > 
> > > Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> > > CC: Lorenzo Bianconi <lorenzo@kernel.org>
> > > CC: Alexander Duyck <alexander.duyck@gmail.com>
> > 
> > I"m not really a huge fan of the approach. Basically it looks like you
> > are trying to turn every page pool page into a fragmented page. Why not
> > just stick to keeping the fragemented pages and have a special case
> > that just generates a mono-frag page for your allocator instead.
> 
> Let me try to describe what does this patch try to do and how it
> do that in more detailed in order to have more common understanding.
> 
> Before this patch:
> 
> As we use PP_FLAG_PAGE_FRAG to decide whether to check the frag count
> in page_pool_is_last_frag() when page is returned back to page pool,
> so:
> 
> 1. PAGE_POOL_DMA_USE_PP_FRAG_COUNT being true case: page_pool_create()
>    fails when it is called with PP_FLAG_PAGE_FRAG set, if user calls
>    page_pool_alloc_frag(), then we warn about it and return NULL.
> 
> 2. PAGE_POOL_DMA_USE_PP_FRAG_COUNT being false case:
>    (1). page_pool_create() is called with PP_FLAG_PAGE_FRAG set:
>         page_pool_alloc_pages() is not allowed to be called as the
>         page->pp_frag_count is not setup correctly for the case.
>    (2). page_pool_create() is called without PP_FLAG_PAGE_FRAG set:
>         page_pool_alloc_frag() is not allowed to be called as the
>         page->pp_frag_count is not checked in page_pool_is_last_frag().
> 
> and mlx5 using a mix of the about:
> page_pool_create() is called with with PP_FLAG_PAGE_FRAG,
> page_pool_dev_alloc_pages() is called to allocate a page and
> page_pool_fragment_page() is called to setup the page->pp_frag_count
> correctly so the page_pool_is_last_frag() can see the correct
> page->pp_frag_count, mlx5 driver handling the frag count is in the
> below, it is complicated and I am not sure if there are any added
> benefit that can justify the complication yet:
> https://www.spinics.net/lists/netdev/msg892893.html
> 
> There are usecases for veth and virtio_net to use frag support
> in page pool to reduce memory usage, and it may request different
> frag size depending on the head/tail room space for xdp_frame/shinfo
> and mtu/packet size. When the requested frag size is large enough
> that a single page can not be split into more than one frag, using
> frag support only have performance penalty because of the extra frag
> count handling for frag support.
> So to avoid driver handling the page->pp_frag_count directly and driver
> calling different page pool API according to memory size, we need to
> way to unify the page_pool_is_last_frag() for frag and non-frag page.
> 
> 1. https://patchwork.kernel.org/project/netdevbpf/patch/d3ae6bd3537fbce379382ac6a42f67e22f27ece2.1683896626.git.lorenzo@kernel.org/
> 2. https://patchwork.kernel.org/project/netdevbpf/patch/20230526054621.18371-3-liangchen.linux@gmail.com/
> 
> After this patch:
> This patch ensure pp_frag_count of page from pool->alloc/pool->ring or
> newly allocated from page allocator is one, which means we assume that
> all pages have one frag user initially so that we can have a unified
> handling for frag and non-frag page in page_pool_is_last_frag().
> 
> So the key point in this patch is about unified handling in
> page_pool_is_last_frag(), which is the free/put side of page pool,
> not the alloc/generate side.
> 
> Utilizing the page->pp_frag_count being one initially for every page
> is the least costly way to do that as best as I can think of.
> As it only add the cost of page_pool_fragment_page() for non-frag page
> case as you have mentioned below, which I think it is negligible as we
> are already dirtying the same cache line in page_pool_set_pp_info().
> And for frag page, we avoid the reseting page->pp_frag_count to one by
> utilizing the optimization of not updating page->pp_frag_count in
> page_pool_defrag_page() for the last frag user.
> 
> Please let me know if the above makes sense, or if misunderstood your
> concern here.

So my main concern is that what this is doing is masking things so that
the veth and virtio_net drivers can essentially lie about the truesize
of the memory they are using in order to improve their performance by
misleading the socket layer about how much memory it is actually
holding onto.

We have historically had an issue with reporting the truesize of
fragments, but generally the underestimation was kept to something less
than 100% because pages were generally split by at least half. Where it
would get messy is if a misbehaviing socket held onto packets for an
exceedingly long time.

What this patch set is doing is enabling explicit lying about the
truesize, and it compounds that by allowing for mixing small
allocations w/ large ones.

> > 
> > The problem is there are some architectures where we just cannot
> > support having pp_frag_count due to the DMA size. So it makes sense to
> > leave those with just basic page pool instead of trying to fake that it
> > is a fragmented page.
> 
> It kind of depend on how you veiw it, this patch view it as only supporting
> one frag when we can't support having pp_frag_count, so I would not call it
> faking.

So the big thing that make it "faking" is the truesize underestimation
that will occur with these frames.

> 
> > 
> > > ---
> > >  include/net/page_pool.h | 38 +++++++++++++++++++++++++++++++-------
> > >  net/core/page_pool.c    |  1 +
> > >  2 files changed, 32 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> > > index c8ec2f34722b..ea7a0c0592a5 100644
> > > --- a/include/net/page_pool.h
> > > +++ b/include/net/page_pool.h
> > > @@ -50,6 +50,9 @@
> > >  				 PP_FLAG_DMA_SYNC_DEV |\
> > >  				 PP_FLAG_PAGE_FRAG)
> > >  
> > > +#define PAGE_POOL_DMA_USE_PP_FRAG_COUNT \
> > > +		(sizeof(dma_addr_t) > sizeof(unsigned long))
> > > +
> > >  /*
> > >   * Fast allocation side cache array/stack
> > >   *
> > > @@ -295,13 +298,20 @@ void page_pool_put_defragged_page(struct page_pool *pool, struct page *page,
> > >   */
> > >  static inline void page_pool_fragment_page(struct page *page, long nr)
> > >  {
> > > -	atomic_long_set(&page->pp_frag_count, nr);
> > > +	if (!PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
> > > +		atomic_long_set(&page->pp_frag_count, nr);
> > >  }
> > >  
> > > +/* We need to reset frag_count back to 1 for the last user to allow
> > > + * only one user in case the page is recycled and allocated as non-frag
> > > + * page.
> > > + */
> > >  static inline long page_pool_defrag_page(struct page *page, long nr)
> > >  {
> > >  	long ret;
> > >  
> > > +	BUILD_BUG_ON(__builtin_constant_p(nr) && nr != 1);
> > > +
> > 
> > What is the point of this line? It doesn't make much sense to me. Are
> > you just trying to force an optiimization? You would be better off just
> > taking the BUILD_BUG_ON contents and feeding them into an if statement
> > below since the statement will compile out anyway.
> 
> if the "if statement" you said refers to the below, then yes.
> 
> > > +		if (!__builtin_constant_p(nr))
> > > +			atomic_long_set(&page->pp_frag_count, 1);
> 
> But it is a *BUILD*_BUG_ON(), isn't it compiled out anywhere we put it?
> 
> Will move it down anyway to avoid confusion.

Actually now that I look at this more it is even more confusing. The
whole point of this function was that we were supposed to be getting
pp_frag_count to 0. However you are setting it to 1.

This is seriously flawed. If we are going to treat non-fragmented pages
as mono-frags then that is what we should do. We should be pulling this
acounting into all of the page pool freeing paths, not trying to force
the value up to 1 for the non-fragmented case.

> > 
> > It seems like what you would want here is:
> > 	BUG_ON(!PAGE_POOL_DMA_USE_PP_FRAG_COUNT);
> > 
> > Otherwise you are potentially writing to a variable that shouldn't
> > exist.
> 
> Not if the driver use the page_pool_alloc_frag() API instead of manipulating
> the page->pp_frag_count directly using the page_pool_defrag_page() like mlx5.
> The mlx5 call the page_pool_create() with with PP_FLAG_PAGE_FRAG set, and
> it does not seems to have a failback for PAGE_POOL_DMA_USE_PP_FRAG_COUNT
> case, and we may need to keep PP_FLAG_PAGE_FRAG for it. That's why we need
> to keep the driver from implementation detail(pp_frag_count handling specifically)
> of the frag support unless we have a very good reason.
> 

Getting the truesize is that "very good reason". The fact is the
drivers were doing this long before page pool came around. Trying to
pull that away from them is the wrong way to go in my opinion.

> > >  	/* If nr == pp_frag_count then we have cleared all remaining
> > >  	 * references to the page. No need to actually overwrite it, instead
> > >  	 * we can leave this to be overwritten by the calling function.
> > > @@ -311,19 +321,36 @@ static inline long page_pool_defrag_page(struct page *page, long nr)
> > >  	 * especially when dealing with a page that may be partitioned
> > >  	 * into only 2 or 3 pieces.
> > >  	 */
> > > -	if (atomic_long_read(&page->pp_frag_count) == nr)
> > > +	if (atomic_long_read(&page->pp_frag_count) == nr) {
> > > +		/* As we have ensured nr is always one for constant case
> > > +		 * using the BUILD_BUG_ON() as above, only need to handle
> > > +		 * the non-constant case here for frag count draining.
> > > +		 */
> > > +		if (!__builtin_constant_p(nr))
> > > +			atomic_long_set(&page->pp_frag_count, 1);
> > > +
> > >  		return 0;
> > > +	}
> > >  

The optimization here was already the comparison since we didn't have
to do anything if pp_frag_count == nr. The whole point of pp_frag_count
going to 0 is that is considered non-fragmented in that case and ready
to be freed. By resetting it to 1 you are implying that there is still
one *other* user that is holding a fragment so the page cannot be
freed.

We weren't bothering with writing the value since the page is in the
free path and this value is going to be unused until the page is
reallocated anyway.

> > >  	ret = atomic_long_sub_return(nr, &page->pp_frag_count);
> > >  	WARN_ON(ret < 0);
> > > +
> > > +	/* Reset frag count back to 1, this should be the rare case when
> > > +	 * two users call page_pool_defrag_page() currently.
> > > +	 */
> > > +	if (!ret)
> > > +		atomic_long_set(&page->pp_frag_count, 1);
> > > +
> > >  	return ret;
> > >  }
> > >  
> > >  static inline bool page_pool_is_last_frag(struct page_pool *pool,
> > >  					  struct page *page)
> > >  {
> > > -	/* If fragments aren't enabled or count is 0 we were the last user */
> > > -	return !(pool->p.flags & PP_FLAG_PAGE_FRAG) ||
> > > +	/* When dma_addr_upper is overlapped with pp_frag_count
> > > +	 * or we were the last page frag user.
> > > +	 */
> > > +	return PAGE_POOL_DMA_USE_PP_FRAG_COUNT ||
> > >  	       (page_pool_defrag_page(page, 1) == 0);
> > >  }
> > >  
> > > @@ -357,9 +384,6 @@ static inline void page_pool_recycle_direct(struct page_pool *pool,
> > >  	page_pool_put_full_page(pool, page, true);
> > >  }
> > >  
> > > -#define PAGE_POOL_DMA_USE_PP_FRAG_COUNT	\
> > > -		(sizeof(dma_addr_t) > sizeof(unsigned long))
> > > -
> > >  static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
> > >  {
> > >  	dma_addr_t ret = page->dma_addr;
> > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> > > index e212e9d7edcb..0868aa8f6323 100644
> > > --- a/net/core/page_pool.c
> > > +++ b/net/core/page_pool.c
> > > @@ -334,6 +334,7 @@ static void page_pool_set_pp_info(struct page_pool *pool,
> > >  {
> > >  	page->pp = pool;
> > >  	page->pp_magic |= PP_SIGNATURE;
> > > +	page_pool_fragment_page(page, 1);
> > >  	if (pool->p.init_callback)
> > >  		pool->p.init_callback(page, pool->p.init_arg);
> > >  }
> > 
> > Again, you are adding statements here that now have to be stripped out
> > under specific circumstances. In my opinion it would be better to not
> > modify base page pool pages and instead just have your allocator
> > provide a 1 frag page pool page via a special case allocator rather
> > then messing with all the other drivers.
> 
> As above, it is about unifying handling for frag and non-frag page in
> page_pool_is_last_frag(). please let me know if there is any better way
> to do it without adding statements here.

I get what you are trying to get at but I feel like the implementation
is going to cause more problems than it helps. The problem is it is
going to hurt base page pool performance and it just makes the
fragmented pages that much more confusing to deal with.

My advice as a first step would be to look at first solving how to
enable the PP_FLAG_PAGE_FRAG mode when you have
PAGE_POOL_DMA_USE_PP_FRAG_COUNT as true. That should be creating mono-
frags as we are calling them, and we should have a way to get the
truesize for those so we know when we are consuming significant amount
of memory.

Once that is solved then we can look at what it would take to apply
mono-frags to the standard page pool case. Ideally we would need to
find a way to do it with minimal overhead.
Yunsheng Lin June 3, 2023, 12:59 p.m. UTC | #4
On 2023/6/3 0:37, Alexander H Duyck wrote:
...

>>
>> Please let me know if the above makes sense, or if misunderstood your
>> concern here.
> 
> So my main concern is that what this is doing is masking things so that
> the veth and virtio_net drivers can essentially lie about the truesize
> of the memory they are using in order to improve their performance by
> misleading the socket layer about how much memory it is actually
> holding onto.
> 
> We have historically had an issue with reporting the truesize of
> fragments, but generally the underestimation was kept to something less
> than 100% because pages were generally split by at least half. Where it
> would get messy is if a misbehaviing socket held onto packets for an
> exceedingly long time.
> 
> What this patch set is doing is enabling explicit lying about the
> truesize, and it compounds that by allowing for mixing small
> allocations w/ large ones.
> 
>>>
>>> The problem is there are some architectures where we just cannot
>>> support having pp_frag_count due to the DMA size. So it makes sense to
>>> leave those with just basic page pool instead of trying to fake that it
>>> is a fragmented page.
>>
>> It kind of depend on how you veiw it, this patch view it as only supporting
>> one frag when we can't support having pp_frag_count, so I would not call it
>> faking.
> 
> So the big thing that make it "faking" is the truesize underestimation
> that will occur with these frames.

Let's discuss truesize issue in patch 2 instead of here.
Personally, I still believe that if the driver can compute the
truesize correctly by manipulating the page->pp_frag_count and
frag offset directly, the page pool can do that too.

> 
>>
>>>
>>>> ---

...

>>>
>>> What is the point of this line? It doesn't make much sense to me. Are
>>> you just trying to force an optiimization? You would be better off just
>>> taking the BUILD_BUG_ON contents and feeding them into an if statement
>>> below since the statement will compile out anyway.
>>
>> if the "if statement" you said refers to the below, then yes.
>>
>>>> +		if (!__builtin_constant_p(nr))
>>>> +			atomic_long_set(&page->pp_frag_count, 1);
>>
>> But it is a *BUILD*_BUG_ON(), isn't it compiled out anywhere we put it?
>>
>> Will move it down anyway to avoid confusion.
> 
> Actually now that I look at this more it is even more confusing. The
> whole point of this function was that we were supposed to be getting
> pp_frag_count to 0. However you are setting it to 1.
> 
> This is seriously flawed. If we are going to treat non-fragmented pages
> as mono-frags then that is what we should do. We should be pulling this
> acounting into all of the page pool freeing paths, not trying to force
> the value up to 1 for the non-fragmented case.

I am not sure I understand what do you mean by 'non-fragmented ',
'mono-frags', 'page pool freeing paths' and 'non-fragmented case'
here. maybe describe it more detailed with something like the
pseudocode?

> 
>>>
>>> It seems like what you would want here is:
>>> 	BUG_ON(!PAGE_POOL_DMA_USE_PP_FRAG_COUNT);
>>>
>>> Otherwise you are potentially writing to a variable that shouldn't
>>> exist.
>>
>> Not if the driver use the page_pool_alloc_frag() API instead of manipulating
>> the page->pp_frag_count directly using the page_pool_defrag_page() like mlx5.
>> The mlx5 call the page_pool_create() with with PP_FLAG_PAGE_FRAG set, and
>> it does not seems to have a failback for PAGE_POOL_DMA_USE_PP_FRAG_COUNT
>> case, and we may need to keep PP_FLAG_PAGE_FRAG for it. That's why we need
>> to keep the driver from implementation detail(pp_frag_count handling specifically)
>> of the frag support unless we have a very good reason.
>>
> 
> Getting the truesize is that "very good reason". The fact is the
> drivers were doing this long before page pool came around. Trying to
> pull that away from them is the wrong way to go in my opinion.

If the truesize is really the concern here, I think it make more
sense to enforce it in the page pool instead of each driver doing
their trick, so I also think we can do better here to handle
pp_frag_count in the page pool instead of driver handling it, so
let's continue the truesize disscussion in patch 2 to see if we
can come up with something better there.

> 
>>>>  	/* If nr == pp_frag_count then we have cleared all remaining
>>>>  	 * references to the page. No need to actually overwrite it, instead
>>>>  	 * we can leave this to be overwritten by the calling function.
>>>> @@ -311,19 +321,36 @@ static inline long page_pool_defrag_page(struct page *page, long nr)
>>>>  	 * especially when dealing with a page that may be partitioned
>>>>  	 * into only 2 or 3 pieces.
>>>>  	 */
>>>> -	if (atomic_long_read(&page->pp_frag_count) == nr)
>>>> +	if (atomic_long_read(&page->pp_frag_count) == nr) {
>>>> +		/* As we have ensured nr is always one for constant case
>>>> +		 * using the BUILD_BUG_ON() as above, only need to handle
>>>> +		 * the non-constant case here for frag count draining.
>>>> +		 */
>>>> +		if (!__builtin_constant_p(nr))
>>>> +			atomic_long_set(&page->pp_frag_count, 1);
>>>> +
>>>>  		return 0;
>>>> +	}
>>>>  
> 
> The optimization here was already the comparison since we didn't have
> to do anything if pp_frag_count == nr. The whole point of pp_frag_count
> going to 0 is that is considered non-fragmented in that case and ready
> to be freed. By resetting it to 1 you are implying that there is still
> one *other* user that is holding a fragment so the page cannot be
> freed.
> 
> We weren't bothering with writing the value since the page is in the
> free path and this value is going to be unused until the page is
> reallocated anyway.

I am not sure what you meant above.
But I will describe what is this patch trying to do again:
When PP_FLAG_PAGE_FRAG is set and that flag is per page pool, not per
page, so page_pool_alloc_pages() is not allowed to be called as the
page->pp_frag_count is not setup correctly for the case.

So in order to allow calling page_pool_alloc_pages(), as best as I
can think of, either we need a per page flag/bit to decide whether
to do something like dec_and_test for page->pp_frag_count in
page_pool_is_last_frag(), or we unify the page->pp_frag_count handling
in page_pool_is_last_frag() so that we don't need a per page flag/bit.

This patch utilizes the optimization you mentioned above to unify the
page->pp_frag_count handling.

> 
>>>>  	ret = atomic_long_sub_return(nr, &page->pp_frag_count);
>>>>  	WARN_ON(ret < 0);
>>>> +
>>>> +	/* Reset frag count back to 1, this should be the rare case when
>>>> +	 * two users call page_pool_defrag_page() currently.
>>>> +	 */
>>>> +	if (!ret)
>>>> +		atomic_long_set(&page->pp_frag_count, 1);
>>>> +
>>>>  	return ret;
>>>>  }
>>>>

...

>> As above, it is about unifying handling for frag and non-frag page in
>> page_pool_is_last_frag(). please let me know if there is any better way
>> to do it without adding statements here.
> 
> I get what you are trying to get at but I feel like the implementation
> is going to cause more problems than it helps. The problem is it is
> going to hurt base page pool performance and it just makes the
> fragmented pages that much more confusing to deal with.

For base page pool performance, as I mentioned before:
It remove PP_FLAG_PAGE_FRAG checking and only add the cost of
page_pool_fragment_page() in page_pool_set_pp_info(), which I
think it is negligible as we are already dirtying the same cache
line in page_pool_set_pp_info().

For the confusing, sometimes it is about personal taste, so I am
not going to argue with it:) But it would be good to provide a
non-confusing way to do that with minimal overhead. I feel like
you have provided it in the begin, but I am not able to understand
it yet.

> 
> My advice as a first step would be to look at first solving how to
> enable the PP_FLAG_PAGE_FRAG mode when you have
> PAGE_POOL_DMA_USE_PP_FRAG_COUNT as true. That should be creating mono-
> frags as we are calling them, and we should have a way to get the
> truesize for those so we know when we are consuming significant amount
> of memory.

Does the way to get the truesize in the below RFC make sense to you?
https://patchwork.kernel.org/project/netdevbpf/patch/20230516124801.2465-4-linyunsheng@huawei.com/

> 
> Once that is solved then we can look at what it would take to apply
> mono-frags to the standard page pool case. Ideally we would need to
> find a way to do it with minimal overhead.
> 
>
Alexander Duyck June 5, 2023, 2:58 p.m. UTC | #5
On Sat, Jun 3, 2023 at 5:59 AM Yunsheng Lin <yunshenglin0825@gmail.com> wrote:
>
> On 2023/6/3 0:37, Alexander H Duyck wrote:
> ...
>
> >>
> >> Please let me know if the above makes sense, or if misunderstood your
> >> concern here.
> >
> > So my main concern is that what this is doing is masking things so that
> > the veth and virtio_net drivers can essentially lie about the truesize
> > of the memory they are using in order to improve their performance by
> > misleading the socket layer about how much memory it is actually
> > holding onto.
> >
> > We have historically had an issue with reporting the truesize of
> > fragments, but generally the underestimation was kept to something less
> > than 100% because pages were generally split by at least half. Where it
> > would get messy is if a misbehaviing socket held onto packets for an
> > exceedingly long time.
> >
> > What this patch set is doing is enabling explicit lying about the
> > truesize, and it compounds that by allowing for mixing small
> > allocations w/ large ones.
> >
> >>>
> >>> The problem is there are some architectures where we just cannot
> >>> support having pp_frag_count due to the DMA size. So it makes sense to
> >>> leave those with just basic page pool instead of trying to fake that it
> >>> is a fragmented page.
> >>
> >> It kind of depend on how you veiw it, this patch view it as only supporting
> >> one frag when we can't support having pp_frag_count, so I would not call it
> >> faking.
> >
> > So the big thing that make it "faking" is the truesize underestimation
> > that will occur with these frames.
>
> Let's discuss truesize issue in patch 2 instead of here.
> Personally, I still believe that if the driver can compute the
> truesize correctly by manipulating the page->pp_frag_count and
> frag offset directly, the page pool can do that too.
>
> >
> >>
> >>>
> >>>> ---
>
> ...
>
> >>>
> >>> What is the point of this line? It doesn't make much sense to me. Are
> >>> you just trying to force an optiimization? You would be better off just
> >>> taking the BUILD_BUG_ON contents and feeding them into an if statement
> >>> below since the statement will compile out anyway.
> >>
> >> if the "if statement" you said refers to the below, then yes.
> >>
> >>>> +          if (!__builtin_constant_p(nr))
> >>>> +                  atomic_long_set(&page->pp_frag_count, 1);
> >>
> >> But it is a *BUILD*_BUG_ON(), isn't it compiled out anywhere we put it?
> >>
> >> Will move it down anyway to avoid confusion.
> >
> > Actually now that I look at this more it is even more confusing. The
> > whole point of this function was that we were supposed to be getting
> > pp_frag_count to 0. However you are setting it to 1.
> >
> > This is seriously flawed. If we are going to treat non-fragmented pages
> > as mono-frags then that is what we should do. We should be pulling this
> > acounting into all of the page pool freeing paths, not trying to force
> > the value up to 1 for the non-fragmented case.
>
> I am not sure I understand what do you mean by 'non-fragmented ',
> 'mono-frags', 'page pool freeing paths' and 'non-fragmented case'
> here. maybe describe it more detailed with something like the
> pseudocode?

What you are attempting to generate are "mono-frags" where a page pool
page has a frag count of 1. I refer to "non-fragmented pages" as the
legacy page pool page without pp_frags set.

The "page-pool freeing paths" are the ones outside of the fragmented
bits here. Basically __page_pool_put_page and the like. What you
should be doing is pushing the reference counting code down deeper
into the page pool logic. Currently it is more of a surface setup.

The whole point I am getting at with this is that we should see the
number of layers reduced for the fragmented pages, and by converting
the non-fragmented pages to mono-frags we should see that maintain its
current performance and total number of layers instead of having more
layers added to it.

> >
> >>>
> >>> It seems like what you would want here is:
> >>>     BUG_ON(!PAGE_POOL_DMA_USE_PP_FRAG_COUNT);
> >>>
> >>> Otherwise you are potentially writing to a variable that shouldn't
> >>> exist.
> >>
> >> Not if the driver use the page_pool_alloc_frag() API instead of manipulating
> >> the page->pp_frag_count directly using the page_pool_defrag_page() like mlx5.
> >> The mlx5 call the page_pool_create() with with PP_FLAG_PAGE_FRAG set, and
> >> it does not seems to have a failback for PAGE_POOL_DMA_USE_PP_FRAG_COUNT
> >> case, and we may need to keep PP_FLAG_PAGE_FRAG for it. That's why we need
> >> to keep the driver from implementation detail(pp_frag_count handling specifically)
> >> of the frag support unless we have a very good reason.
> >>
> >
> > Getting the truesize is that "very good reason". The fact is the
> > drivers were doing this long before page pool came around. Trying to
> > pull that away from them is the wrong way to go in my opinion.
>
> If the truesize is really the concern here, I think it make more
> sense to enforce it in the page pool instead of each driver doing
> their trick, so I also think we can do better here to handle
> pp_frag_count in the page pool instead of driver handling it, so
> let's continue the truesize disscussion in patch 2 to see if we
> can come up with something better there.

The problem is we don't free the page until the next allocation so the
truesize will be false as the remainder of the page should be added to
the truesize. The drivers tend to know what they are doing with the
page and when they are freeing it. We don't have that sort of
knowledge when we are doing the allocation.

> >
> >>>>    /* If nr == pp_frag_count then we have cleared all remaining
> >>>>     * references to the page. No need to actually overwrite it, instead
> >>>>     * we can leave this to be overwritten by the calling function.
> >>>> @@ -311,19 +321,36 @@ static inline long page_pool_defrag_page(struct page *page, long nr)
> >>>>     * especially when dealing with a page that may be partitioned
> >>>>     * into only 2 or 3 pieces.
> >>>>     */
> >>>> -  if (atomic_long_read(&page->pp_frag_count) == nr)
> >>>> +  if (atomic_long_read(&page->pp_frag_count) == nr) {
> >>>> +          /* As we have ensured nr is always one for constant case
> >>>> +           * using the BUILD_BUG_ON() as above, only need to handle
> >>>> +           * the non-constant case here for frag count draining.
> >>>> +           */
> >>>> +          if (!__builtin_constant_p(nr))
> >>>> +                  atomic_long_set(&page->pp_frag_count, 1);
> >>>> +
> >>>>            return 0;
> >>>> +  }
> >>>>
> >
> > The optimization here was already the comparison since we didn't have
> > to do anything if pp_frag_count == nr. The whole point of pp_frag_count
> > going to 0 is that is considered non-fragmented in that case and ready
> > to be freed. By resetting it to 1 you are implying that there is still
> > one *other* user that is holding a fragment so the page cannot be
> > freed.
> >
> > We weren't bothering with writing the value since the page is in the
> > free path and this value is going to be unused until the page is
> > reallocated anyway.
>
> I am not sure what you meant above.
> But I will describe what is this patch trying to do again:
> When PP_FLAG_PAGE_FRAG is set and that flag is per page pool, not per
> page, so page_pool_alloc_pages() is not allowed to be called as the
> page->pp_frag_count is not setup correctly for the case.
>
> So in order to allow calling page_pool_alloc_pages(), as best as I
> can think of, either we need a per page flag/bit to decide whether
> to do something like dec_and_test for page->pp_frag_count in
> page_pool_is_last_frag(), or we unify the page->pp_frag_count handling
> in page_pool_is_last_frag() so that we don't need a per page flag/bit.
>
> This patch utilizes the optimization you mentioned above to unify the
> page->pp_frag_count handling.

Basically what should be happening if all page-pool pages are to be
considered "fragmented" is that we should be folding this into the
freeing logic. What we now have a 2 stage setup where we are dropping
the count to 0, then rebounding it and setting it back to 1. If we are
going to have all page pool pages fragmented then the freeing path for
page pool pages should just be handling frag count directly instead of
hacking on it here and ignoring it in the actual freeing paths.

> >
> >>>>    ret = atomic_long_sub_return(nr, &page->pp_frag_count);
> >>>>    WARN_ON(ret < 0);
> >>>> +
> >>>> +  /* Reset frag count back to 1, this should be the rare case when
> >>>> +   * two users call page_pool_defrag_page() currently.
> >>>> +   */
> >>>> +  if (!ret)
> >>>> +          atomic_long_set(&page->pp_frag_count, 1);
> >>>> +
> >>>>    return ret;
> >>>>  }
> >>>>
>
> ...
>
> >> As above, it is about unifying handling for frag and non-frag page in
> >> page_pool_is_last_frag(). please let me know if there is any better way
> >> to do it without adding statements here.
> >
> > I get what you are trying to get at but I feel like the implementation
> > is going to cause more problems than it helps. The problem is it is
> > going to hurt base page pool performance and it just makes the
> > fragmented pages that much more confusing to deal with.
>
> For base page pool performance, as I mentioned before:
> It remove PP_FLAG_PAGE_FRAG checking and only add the cost of
> page_pool_fragment_page() in page_pool_set_pp_info(), which I
> think it is negligible as we are already dirtying the same cache
> line in page_pool_set_pp_info().

I have no problem with getting rid of the flag.

> For the confusing, sometimes it is about personal taste, so I am
> not going to argue with it:) But it would be good to provide a
> non-confusing way to do that with minimal overhead. I feel like
> you have provided it in the begin, but I am not able to understand
> it yet.

The problem here is that instead of treating all page pool pages as
fragmented, what the patch set has done is added a shim layer so that
you are layering fragmentation on top of page pool pages which was
already the case.

That is why I have suggested make page pool pages a "mono-frag" as
your first patch. Basically it is going to force you to have to set
the pp_frag value for these pages, and verify it is 1 when you are
freeing it.

Then you are going to have to modify the fragmented cases to make use
of lower level calls because now instead of us defragging a fragmented
page, and then freeing it the two operations essentially have to be
combined into one operation.

> >
> > My advice as a first step would be to look at first solving how to
> > enable the PP_FLAG_PAGE_FRAG mode when you have
> > PAGE_POOL_DMA_USE_PP_FRAG_COUNT as true. That should be creating mono-
> > frags as we are calling them, and we should have a way to get the
> > truesize for those so we know when we are consuming significant amount
> > of memory.
>
> Does the way to get the truesize in the below RFC make sense to you?
> https://patchwork.kernel.org/project/netdevbpf/patch/20230516124801.2465-4-linyunsheng@huawei.com/

It doesn't add any value. All you are doing is passing the "size"
value as "truesize". The whole point of the "truesize" would be to
report the actual size. So a step in that direction would be to bump
truesize to include the remainder that isn't used when you decide it
is time to allocate a new page. The problem is that it requires some
fore-knowledge of what the next requested size is going to be. That is
why it is better to just have the drivers manage this since they know
what size they typically request and when they are going to close
pages.

Like I said, if you are wanting to go down this path you are better
off starting with page pool and making all regular page pool pages
into mono-frags. Then from there we can start building things out.

With that you could then let drivers like the Mellanox one handle its
own fragmenting knowing it has to return things to a mono-frag in
order for it to be working correctly.
Yunsheng Lin June 6, 2023, 12:41 p.m. UTC | #6
On 2023/6/5 22:58, Alexander Duyck wrote:

...

>>
>> I am not sure I understand what do you mean by 'non-fragmented ',
>> 'mono-frags', 'page pool freeing paths' and 'non-fragmented case'
>> here. maybe describe it more detailed with something like the
>> pseudocode?
> 
> What you are attempting to generate are "mono-frags" where a page pool
> page has a frag count of 1. I refer to "non-fragmented pages" as the
> legacy page pool page without pp_frags set.
> 
> The "page-pool freeing paths" are the ones outside of the fragmented
> bits here. Basically __page_pool_put_page and the like. What you
> should be doing is pushing the reference counting code down deeper
> into the page pool logic. Currently it is more of a surface setup.
> 
> The whole point I am getting at with this is that we should see the
> number of layers reduced for the fragmented pages, and by converting
> the non-fragmented pages to mono-frags we should see that maintain its
> current performance and total number of layers instead of having more
> layers added to it.

Do you mean reducing the number of layers for the fragmented pages by
moving the page->pp_frag_count handling from page_pool_defrag_page()
to __page_pool_put_page() where page->_refcount is checked?

Or merge page->pp_frag_count into page->_refcount so that we don't
need page->pp_frag_count anymore?

As my understanding, when a page from page pool is passed to the stack
to be processed, the stack may hold onto that page by taking
page->_refcount too, which means page pool has no control over who will
hold onto and when that taken will be released, that is why page pool
do the "page_ref_count(page) == 1" checking in __page_pool_put_page(),
if it is not true, the page pool can't recycle the page, so pp_frag_count
and _refcount have different meaning and serve different purpose, merging
them doesn't work, and moving them to one place doesn't make much sense
too?

Or is there other obvious consideration that I missed?

>>>
>> I am not sure what you meant above.
>> But I will describe what is this patch trying to do again:
>> When PP_FLAG_PAGE_FRAG is set and that flag is per page pool, not per
>> page, so page_pool_alloc_pages() is not allowed to be called as the
>> page->pp_frag_count is not setup correctly for the case.
>>
>> So in order to allow calling page_pool_alloc_pages(), as best as I
>> can think of, either we need a per page flag/bit to decide whether
>> to do something like dec_and_test for page->pp_frag_count in
>> page_pool_is_last_frag(), or we unify the page->pp_frag_count handling
>> in page_pool_is_last_frag() so that we don't need a per page flag/bit.
>>
>> This patch utilizes the optimization you mentioned above to unify the
>> page->pp_frag_count handling.
> 
> Basically what should be happening if all page-pool pages are to be
> considered "fragmented" is that we should be folding this into the
> freeing logic. What we now have a 2 stage setup where we are dropping
> the count to 0, then rebounding it and setting it back to 1. If we are
> going to have all page pool pages fragmented then the freeing path for
> page pool pages should just be handling frag count directly instead of
> hacking on it here and ignoring it in the actual freeing paths.

Do you mean doing something like below? isn't it dirtying the cache line
of 'struct page' whenever a page is recycled, which means we may not be
able to the maintain current performance for non-fragmented or mono-frag
case?

--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -583,6 +583,10 @@ static __always_inline struct page *
 __page_pool_put_page(struct page_pool *pool, struct page *page,
                     unsigned int dma_sync_size, bool allow_direct)
 {
+
+       if (!page_pool_defrag_page(page, 1))
+               return NULL;
+
        /* This allocator is optimized for the XDP mode that uses
         * one-frame-per-page, but have fallbacks that act like the
         * regular page allocator APIs.
@@ -594,6 +598,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
         */
        if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) {
                /* Read barrier done in page_ref_count / READ_ONCE */
+               page_pool_fragment_page(page, 1);

                if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
                        page_pool_dma_sync_for_device(pool, page,



> 
>>>
>>>>>>    ret = atomic_long_sub_return(nr, &page->pp_frag_count);
>>>>>>    WARN_ON(ret < 0);
>>>>>> +
>>>>>> +  /* Reset frag count back to 1, this should be the rare case when
>>>>>> +   * two users call page_pool_defrag_page() currently.
>>>>>> +   */
>>>>>> +  if (!ret)
>>>>>> +          atomic_long_set(&page->pp_frag_count, 1);
>>>>>> +
>>>>>>    return ret;
>>>>>>  }
>>>>>>
>>
>> ...
>>
>>>> As above, it is about unifying handling for frag and non-frag page in
>>>> page_pool_is_last_frag(). please let me know if there is any better way
>>>> to do it without adding statements here.
>>>
>>> I get what you are trying to get at but I feel like the implementation
>>> is going to cause more problems than it helps. The problem is it is
>>> going to hurt base page pool performance and it just makes the
>>> fragmented pages that much more confusing to deal with.
>>
>> For base page pool performance, as I mentioned before:
>> It remove PP_FLAG_PAGE_FRAG checking and only add the cost of
>> page_pool_fragment_page() in page_pool_set_pp_info(), which I
>> think it is negligible as we are already dirtying the same cache
>> line in page_pool_set_pp_info().
> 
> I have no problem with getting rid of the flag.
> 
>> For the confusing, sometimes it is about personal taste, so I am
>> not going to argue with it:) But it would be good to provide a
>> non-confusing way to do that with minimal overhead. I feel like
>> you have provided it in the begin, but I am not able to understand
>> it yet.
> 
> The problem here is that instead of treating all page pool pages as
> fragmented, what the patch set has done is added a shim layer so that
> you are layering fragmentation on top of page pool pages which was
> already the case.
> 
> That is why I have suggested make page pool pages a "mono-frag" as
> your first patch. Basically it is going to force you to have to set
> the pp_frag value for these pages, and verify it is 1 when you are
> freeing it.

It seems it is bascially what this patch do with minimal
overhead to the previous users.

Let me try again with what this patch mainly do:

Currently when page_pool_create() is called with
PP_FLAG_PAGE_FRAG flag, page_pool_alloc_pages() is only
allowed to be called under the below constraints:
1. page_pool_fragment_page() need to be called to setup
   page->pp_frag_count immediately.
2. page_pool_defrag_page() often need to be called to
   drain the page->pp_frag_count when there is no more
   user will be holding on to that page.

Those constraints exist in order to support a page to
be splitted into multi frags.

And those constraints have some overhead because of the
cache line dirtying/bouncing and atomic update.

Those constraints are unavoidable for case when we need
a page to be splitted into more than one frag, but there
is also case that we want to avoid the above constraints
and their overhead when a page can't be splitted as it
can only hold a big frag as requested by user, depending
on different use cases:
use case 1: allocate page without page splitting.
use case 2: allocate page with page splitting.
use case 3: allocate page with or without page splitting
            depending on the frag size.

Currently page pool only provide page_pool_alloc_pages()
and page_pool_alloc_frag() API to enable the above 1 & 2
separately, so we can not use a combination of 1 & 2 to
enable 3, it is not possible yet because of the per
page_pool flag PP_FLAG_PAGE_FRAG.

So in order to allow allocating unsplitted page without
the overhead of splitted page while still allow allocating
splitted page, we need to remove the per page_pool flag
in page_pool_is_last_frag(), as best as I can think of, it
seems there are two methods as below:
1. Add per page flag/bit to indicate a page is splitted or
   not, which means we might need to update that flag/bit
   everytime the page is recycled, dirtying the cache line
   of 'struct page' for use case 1.
2. Unify the page->pp_frag_count handling for both splitted
   and unsplitted page by assuming all pages in the page
   pool is splitted into a big frag initially.

Because we want to support the above use case 3 with minimal
overhead, especially not adding any noticable overhead for
use case 1, and we are already doing an optimization by not
updating pp_frag_count in page_pool_defrag_page() for the
last frag user, this patch chooses to unify the pp_frag_count
handling to support the above use case 3.

Let me know if it is making any sense here.

> 
> Then you are going to have to modify the fragmented cases to make use
> of lower level calls because now instead of us defragging a fragmented
> page, and then freeing it the two operations essentially have to be
> combined into one operation.

Does 'defragging a fragmented page' mean doing decrementing pp_frag_count?
"freeing it" mean calling put_page()? What does 'combined' really means
here?

> 
>>>
>>> My advice as a first step would be to look at first solving how to
>>> enable the PP_FLAG_PAGE_FRAG mode when you have
>>> PAGE_POOL_DMA_USE_PP_FRAG_COUNT as true. That should be creating mono-
>>> frags as we are calling them, and we should have a way to get the
>>> truesize for those so we know when we are consuming significant amount
>>> of memory.
>>
>> Does the way to get the truesize in the below RFC make sense to you?
>> https://patchwork.kernel.org/project/netdevbpf/patch/20230516124801.2465-4-linyunsheng@huawei.com/
> 
> It doesn't add any value. All you are doing is passing the "size"
> value as "truesize". The whole point of the "truesize" would be to
> report the actual size. So a step in that direction would be to bump
> truesize to include the remainder that isn't used when you decide it
> is time to allocate a new page. The problem is that it requires some
> fore-knowledge of what the next requested size is going to be. That is
> why it is better to just have the drivers manage this since they know
> what size they typically request and when they are going to close
> pages.
> 
> Like I said, if you are wanting to go down this path you are better
> off starting with page pool and making all regular page pool pages
> into mono-frags. Then from there we can start building things out.

'mono-frag' means page with pp_frag_count being one. If yes, then I
feel like we have the same goal here, but may have different opinion
on how to implement it.

> 
> With that you could then let drivers like the Mellanox one handle its
> own fragmenting knowing it has to return things to a mono-frag in
> order for it to be working correctly.

I still really don't how it will be better for mlx5 to handle its
own fragmenting yet?

+cc Dragos & Saeed to share some info here, so that we can see
if page pool learn from it.

> 
> .
>
Alexander Duyck June 6, 2023, 3:33 p.m. UTC | #7
On Tue, Jun 6, 2023 at 5:41 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2023/6/5 22:58, Alexander Duyck wrote:
>
> ...
>
> >>
> >> I am not sure I understand what do you mean by 'non-fragmented ',
> >> 'mono-frags', 'page pool freeing paths' and 'non-fragmented case'
> >> here. maybe describe it more detailed with something like the
> >> pseudocode?
> >
> > What you are attempting to generate are "mono-frags" where a page pool
> > page has a frag count of 1. I refer to "non-fragmented pages" as the
> > legacy page pool page without pp_frags set.
> >
> > The "page-pool freeing paths" are the ones outside of the fragmented
> > bits here. Basically __page_pool_put_page and the like. What you
> > should be doing is pushing the reference counting code down deeper
> > into the page pool logic. Currently it is more of a surface setup.
> >
> > The whole point I am getting at with this is that we should see the
> > number of layers reduced for the fragmented pages, and by converting
> > the non-fragmented pages to mono-frags we should see that maintain its
> > current performance and total number of layers instead of having more
> > layers added to it.
>
> Do you mean reducing the number of layers for the fragmented pages by
> moving the page->pp_frag_count handling from page_pool_defrag_page()
> to __page_pool_put_page() where page->_refcount is checked?

I was thinking you could move pp_frag_count down into
__page_pool_put_page(). Basically it is doing the static check against
1, and if it isn't 1 we will have to subtract 1 from it.

> Or merge page->pp_frag_count into page->_refcount so that we don't
> need page->pp_frag_count anymore?
>
> As my understanding, when a page from page pool is passed to the stack
> to be processed, the stack may hold onto that page by taking
> page->_refcount too, which means page pool has no control over who will
> hold onto and when that taken will be released, that is why page pool
> do the "page_ref_count(page) == 1" checking in __page_pool_put_page(),
> if it is not true, the page pool can't recycle the page, so pp_frag_count
> and _refcount have different meaning and serve different purpose, merging
> them doesn't work, and moving them to one place doesn't make much sense
> too?
>
> Or is there other obvious consideration that I missed?

You have the right understanding. Basically we cannot recycle it until
the fragcount is one or reaches 0 after subtraction and the refcount
is 1. If we attempt to free it and fragcount hits 0 and refcount is !=
1 we have to free it.

> >>>
> >> I am not sure what you meant above.
> >> But I will describe what is this patch trying to do again:
> >> When PP_FLAG_PAGE_FRAG is set and that flag is per page pool, not per
> >> page, so page_pool_alloc_pages() is not allowed to be called as the
> >> page->pp_frag_count is not setup correctly for the case.
> >>
> >> So in order to allow calling page_pool_alloc_pages(), as best as I
> >> can think of, either we need a per page flag/bit to decide whether
> >> to do something like dec_and_test for page->pp_frag_count in
> >> page_pool_is_last_frag(), or we unify the page->pp_frag_count handling
> >> in page_pool_is_last_frag() so that we don't need a per page flag/bit.
> >>
> >> This patch utilizes the optimization you mentioned above to unify the
> >> page->pp_frag_count handling.
> >
> > Basically what should be happening if all page-pool pages are to be
> > considered "fragmented" is that we should be folding this into the
> > freeing logic. What we now have a 2 stage setup where we are dropping
> > the count to 0, then rebounding it and setting it back to 1. If we are
> > going to have all page pool pages fragmented then the freeing path for
> > page pool pages should just be handling frag count directly instead of
> > hacking on it here and ignoring it in the actual freeing paths.
>
> Do you mean doing something like below? isn't it dirtying the cache line
> of 'struct page' whenever a page is recycled, which means we may not be
> able to the maintain current performance for non-fragmented or mono-frag
> case?
>
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -583,6 +583,10 @@ static __always_inline struct page *
>  __page_pool_put_page(struct page_pool *pool, struct page *page,
>                      unsigned int dma_sync_size, bool allow_direct)
>  {
> +
> +       if (!page_pool_defrag_page(page, 1))
> +               return NULL;
> +

Yes, that is pretty much it. This would be your standard case page
pool put path. Basically it allows us to start getting rid of a bunch
of noise in the fragmented path.

>         /* This allocator is optimized for the XDP mode that uses
>          * one-frame-per-page, but have fallbacks that act like the
>          * regular page allocator APIs.
> @@ -594,6 +598,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
>          */
>         if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) {
>                 /* Read barrier done in page_ref_count / READ_ONCE */
> +               page_pool_fragment_page(page, 1);

I wouldn't bother resetting this to 1 until after you have recycled it
and pulled it back out again as an allocation. Basically when the
pages are sitting in the pool the frag_count should be 0. That way it
makes it easier to track and is similar to how the memory allocator
actually deals with the page reference count. Basically if the page is
sitting in the pool the frag_count is 0, once it comes out it should
be 1 or more indicating it is in use.

>
>                 if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
>                         page_pool_dma_sync_for_device(pool, page,
>
>
>
> >
> >>>
> >>>>>>    ret = atomic_long_sub_return(nr, &page->pp_frag_count);
> >>>>>>    WARN_ON(ret < 0);
> >>>>>> +
> >>>>>> +  /* Reset frag count back to 1, this should be the rare case when
> >>>>>> +   * two users call page_pool_defrag_page() currently.
> >>>>>> +   */
> >>>>>> +  if (!ret)
> >>>>>> +          atomic_long_set(&page->pp_frag_count, 1);
> >>>>>> +
> >>>>>>    return ret;
> >>>>>>  }
> >>>>>>
> >>
> >> ...
> >>
> >>>> As above, it is about unifying handling for frag and non-frag page in
> >>>> page_pool_is_last_frag(). please let me know if there is any better way
> >>>> to do it without adding statements here.
> >>>
> >>> I get what you are trying to get at but I feel like the implementation
> >>> is going to cause more problems than it helps. The problem is it is
> >>> going to hurt base page pool performance and it just makes the
> >>> fragmented pages that much more confusing to deal with.
> >>
> >> For base page pool performance, as I mentioned before:
> >> It remove PP_FLAG_PAGE_FRAG checking and only add the cost of
> >> page_pool_fragment_page() in page_pool_set_pp_info(), which I
> >> think it is negligible as we are already dirtying the same cache
> >> line in page_pool_set_pp_info().
> >
> > I have no problem with getting rid of the flag.
> >
> >> For the confusing, sometimes it is about personal taste, so I am
> >> not going to argue with it:) But it would be good to provide a
> >> non-confusing way to do that with minimal overhead. I feel like
> >> you have provided it in the begin, but I am not able to understand
> >> it yet.
> >
> > The problem here is that instead of treating all page pool pages as
> > fragmented, what the patch set has done is added a shim layer so that
> > you are layering fragmentation on top of page pool pages which was
> > already the case.
> >
> > That is why I have suggested make page pool pages a "mono-frag" as
> > your first patch. Basically it is going to force you to have to set
> > the pp_frag value for these pages, and verify it is 1 when you are
> > freeing it.
>
> It seems it is bascially what this patch do with minimal
> overhead to the previous users.
>
> Let me try again with what this patch mainly do:
>
> Currently when page_pool_create() is called with
> PP_FLAG_PAGE_FRAG flag, page_pool_alloc_pages() is only
> allowed to be called under the below constraints:
> 1. page_pool_fragment_page() need to be called to setup
>    page->pp_frag_count immediately.
> 2. page_pool_defrag_page() often need to be called to
>    drain the page->pp_frag_count when there is no more
>    user will be holding on to that page.

Right. Basically you will need to assign the value much larger in
page_pool_fragment_page assuming you have a frag count of 1 after
allocation.

On free you should be able to do either an atomic sub and verify
non-zero when you have to defrag at the end of fragmenting a page.

> Those constraints exist in order to support a page to
> be splitted into multi frags.

Right. However it isn't much different then how we were dealing with
the page reference count in drivers such as ixgbe. You can take a look
at ixgbe_alloc_mapped_page() for an example of that.

> And those constraints have some overhead because of the
> cache line dirtying/bouncing and atomic update.

We already have dirtied it on allocation and we were already dirtying
it on freeing as well.

> Those constraints are unavoidable for case when we need
> a page to be splitted into more than one frag, but there
> is also case that we want to avoid the above constraints
> and their overhead when a page can't be splitted as it
> can only hold a big frag as requested by user, depending
> on different use cases:
> use case 1: allocate page without page splitting.
> use case 2: allocate page with page splitting.
> use case 3: allocate page with or without page splitting
>             depending on the frag size.
>
> Currently page pool only provide page_pool_alloc_pages()
> and page_pool_alloc_frag() API to enable the above 1 & 2
> separately, so we can not use a combination of 1 & 2 to
> enable 3, it is not possible yet because of the per
> page_pool flag PP_FLAG_PAGE_FRAG.

I get that we need to get rid of the flag. The general idea here is
one step at a time. If we want to get rid of the flag then we have to
make the page pool set the frag_count in all cases where it can. In
the cases where we run into the DMA issue the functions that
frag/defrag pages have to succeed w/ only support for 1 frag. Most
likely we will need to wrap the inline helpers for that.

What it means is that the "DMA_USES" case will make the frag allocator
synonymous with the non-fragmented allocator so both will be providing
full pages.

> So in order to allow allocating unsplitted page without
> the overhead of splitted page while still allow allocating
> splitted page, we need to remove the per page_pool flag
> in page_pool_is_last_frag(), as best as I can think of, it
> seems there are two methods as below:
> 1. Add per page flag/bit to indicate a page is splitted or
>    not, which means we might need to update that flag/bit
>    everytime the page is recycled, dirtying the cache line
>    of 'struct page' for use case 1.
> 2. Unify the page->pp_frag_count handling for both splitted
>    and unsplitted page by assuming all pages in the page
>    pool is splitted into a big frag initially.

I am in support of 2. It is the simplest approach here. Basically if
we cannot support it due to the DMA variable definition then we should
have auto-magically succeeding versions of the defrag and fragment
functions that only support allowing 1 fragment per page.

> Because we want to support the above use case 3 with minimal
> overhead, especially not adding any noticable overhead for
> use case 1, and we are already doing an optimization by not
> updating pp_frag_count in page_pool_defrag_page() for the
> last frag user, this patch chooses to unify the pp_frag_count
> handling to support the above use case 3.
>
> Let me know if it is making any sense here.

Yes, that is pretty much it.

> >
> > Then you are going to have to modify the fragmented cases to make use
> > of lower level calls because now instead of us defragging a fragmented
> > page, and then freeing it the two operations essentially have to be
> > combined into one operation.
>
> Does 'defragging a fragmented page' mean doing decrementing pp_frag_count?
> "freeing it" mean calling put_page()? What does 'combined' really means
> here?

The change is that the code would do the subtraction and if it hit 0
it was freeing the page. That is the one piece that gets more
complicated because we really should be hitting 1. So we may be adding
a few more operations to that case.

> >
> >>>
> >>> My advice as a first step would be to look at first solving how to
> >>> enable the PP_FLAG_PAGE_FRAG mode when you have
> >>> PAGE_POOL_DMA_USE_PP_FRAG_COUNT as true. That should be creating mono-
> >>> frags as we are calling them, and we should have a way to get the
> >>> truesize for those so we know when we are consuming significant amount
> >>> of memory.
> >>
> >> Does the way to get the truesize in the below RFC make sense to you?
> >> https://patchwork.kernel.org/project/netdevbpf/patch/20230516124801.2465-4-linyunsheng@huawei.com/
> >
> > It doesn't add any value. All you are doing is passing the "size"
> > value as "truesize". The whole point of the "truesize" would be to
> > report the actual size. So a step in that direction would be to bump
> > truesize to include the remainder that isn't used when you decide it
> > is time to allocate a new page. The problem is that it requires some
> > fore-knowledge of what the next requested size is going to be. That is
> > why it is better to just have the drivers manage this since they know
> > what size they typically request and when they are going to close
> > pages.
> >
> > Like I said, if you are wanting to go down this path you are better
> > off starting with page pool and making all regular page pool pages
> > into mono-frags. Then from there we can start building things out.
>
> 'mono-frag' means page with pp_frag_count being one. If yes, then I
> feel like we have the same goal here, but may have different opinion
> on how to implement it.

Yeah, I think it is mostly implementation differences. I thought back
when we did this I had advocated for just frag counting all the pages
right from the start.

> >
> > With that you could then let drivers like the Mellanox one handle its
> > own fragmenting knowing it has to return things to a mono-frag in
> > order for it to be working correctly.
>
> I still really don't how it will be better for mlx5 to handle its
> own fragmenting yet?
>
> +cc Dragos & Saeed to share some info here, so that we can see
> if page pool learn from it.

It has more to do with the fact that the driver knows what it is going
to do beforehand. In many cases it can look at the page and know that
it isn't going to reuse it again so it can just report the truesize
being the length from the current pointer to the end of the page.

You can think of it as the performance advantage of a purpose built
ASIC versus a general purpose CPU. The fact is we are able to cut out
much of the unnecessary overhead if we know exactly how we are going
to use the memory in the driver versus having to guess at it in the
page pool API.
Yunsheng Lin June 7, 2023, 12:46 p.m. UTC | #8
On 2023/6/6 23:33, Alexander Duyck wrote:
>> Do you mean doing something like below? isn't it dirtying the cache line
>> of 'struct page' whenever a page is recycled, which means we may not be
>> able to the maintain current performance for non-fragmented or mono-frag
>> case?
>>
>> --- a/net/core/page_pool.c
>> +++ b/net/core/page_pool.c
>> @@ -583,6 +583,10 @@ static __always_inline struct page *
>>  __page_pool_put_page(struct page_pool *pool, struct page *page,
>>                      unsigned int dma_sync_size, bool allow_direct)
>>  {
>> +
>> +       if (!page_pool_defrag_page(page, 1))
>> +               return NULL;
>> +
> 
> Yes, that is pretty much it. This would be your standard case page
> pool put path. Basically it allows us to start getting rid of a bunch
> of noise in the fragmented path.
> 
>>         /* This allocator is optimized for the XDP mode that uses
>>          * one-frame-per-page, but have fallbacks that act like the
>>          * regular page allocator APIs.
>> @@ -594,6 +598,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
>>          */
>>         if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) {
>>                 /* Read barrier done in page_ref_count / READ_ONCE */
>> +               page_pool_fragment_page(page, 1);
> 
> I wouldn't bother resetting this to 1 until after you have recycled it
> and pulled it back out again as an allocation. Basically when the
> pages are sitting in the pool the frag_count should be 0. That way it
> makes it easier to track and is similar to how the memory allocator
> actually deals with the page reference count. Basically if the page is
> sitting in the pool the frag_count is 0, once it comes out it should
> be 1 or more indicating it is in use.

Let's be more specific about what we want to do here:

For a specific page without splitting, the journey that it will go
through is as below before this patch:
1. It is allocated from the page allocator.
2. It is initialized in page_pool_set_pp_info().
3. It is passed to driver by page pool.
4. It is passed to stack by the driver.
5. It is passed back to the page pool by the stack, depending on the
   page_ref_count() checking:
   5.1 page_ref_count() being one, the page is now owned by the page
       pool, and may be passed to the driver by going to step 3.
   5.2 page_ref_count() not being one, the page is released by page
       pool doing resoure cleaning like dma mapping and put_page().

So a page may go through step 3 ~ 5.1 many times without dirtying
the cache line of  'struct page' as my understanding.


If I follow your suggestion here, It seems for a specific page without
splitting, it may go through:
1. It is allocated from the page allocator.
2. It is initialized in page_pool_set_pp_info().
3. It's pp_frag_count is set to one.
4. It is passed to driver by page pool.
5. It is passed to stack by the driver.
6. It is passed back to the page pool by the stack, depending on the
   pp_frag_count and page_ref_count() checking:
   6.1 pp_frag_count and page_ref_count() being one, the page is now
       owned by the page pool, and may be passed to the driver by
       going to step 3.
   6.2 otherwise the page is released by page pool doing resoure
       cleaning like dma mapping and put_page().

Aren't we dirtying the cache line of  'struct page' everytime the page
is recycled? Or did I miss something obvious here?

For my implementation, for a specific page without splitting, it may
go through:
1. It is allocated from the page allocator.
2. It is initialized in page_pool_set_pp_info() and it's pp_frag_count
   set to one.
3. It is passed to driver by page pool.
4. It is passed to stack by the driver.
5. It is passed back to the page pool by the stack, depending on the
   page_ref_count() checking:
   5.1 pp_frag_count and page_ref_count() being one, the page is now
       owned by the page pool, and as the optimization by not updating
       page->pp_frag_count in page_pool_defrag_page() for the last
       frag user, it can be passed to the driver by going to step 3
       without resetting the pp_frag_count to 1, which may dirty
       the cache line of  'struct page'.
   5.2 otherwise the page is released by page pool doing resoure
       cleaning like dma mapping and put_page().

Does it make any sense, or it doesn't really matter we are dirtying
the cache line of  'struct page' whenever a page without splitted is
recycled?


>>
>> Does 'defragging a fragmented page' mean doing decrementing pp_frag_count?
>> "freeing it" mean calling put_page()? What does 'combined' really means
>> here?
> 
> The change is that the code would do the subtraction and if it hit 0
> it was freeing the page. That is the one piece that gets more
> complicated because we really should be hitting 1. So we may be adding
> a few more operations to that case.
> 
>>>

I am not sure I understand it. Does 'gets more complicated' means doing
some optimization like page_pool_defrag_page() does?
https://elixir.bootlin.com/linux/v6.4-rc5/source/include/net/page_pool.h#L314

> 
>>>
>>> With that you could then let drivers like the Mellanox one handle its
>>> own fragmenting knowing it has to return things to a mono-frag in
>>> order for it to be working correctly.
>>
>> I still really don't how it will be better for mlx5 to handle its
>> own fragmenting yet?
>>
>> +cc Dragos & Saeed to share some info here, so that we can see
>> if page pool learn from it.
> 
> It has more to do with the fact that the driver knows what it is going
> to do beforehand. In many cases it can look at the page and know that
> it isn't going to reuse it again so it can just report the truesize
> being the length from the current pointer to the end of the page.
> 
> You can think of it as the performance advantage of a purpose built
> ASIC versus a general purpose CPU. The fact is we are able to cut out
> much of the unnecessary overhead if we know exactly how we are going
> to use the memory in the driver versus having to guess at it in the
> page pool API.

In general, I would agree with that.
But for the specific case with mlx5, I am not sure about that, that's
why I am curious about what is the exact reason about it doing the
complicated frag_count handing in the driver instead of improving
the page pool to support it's usecase, if it is about the last frag
truesize problem here, we can do something like virtio_net do in the
page pool too.

> 
> .
>
Alexander Duyck June 7, 2023, 3:05 p.m. UTC | #9
On Wed, Jun 7, 2023 at 5:46 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2023/6/6 23:33, Alexander Duyck wrote:
> >> Do you mean doing something like below? isn't it dirtying the cache line
> >> of 'struct page' whenever a page is recycled, which means we may not be
> >> able to the maintain current performance for non-fragmented or mono-frag
> >> case?
> >>
> >> --- a/net/core/page_pool.c
> >> +++ b/net/core/page_pool.c
> >> @@ -583,6 +583,10 @@ static __always_inline struct page *
> >>  __page_pool_put_page(struct page_pool *pool, struct page *page,
> >>                      unsigned int dma_sync_size, bool allow_direct)
> >>  {
> >> +
> >> +       if (!page_pool_defrag_page(page, 1))
> >> +               return NULL;
> >> +
> >
> > Yes, that is pretty much it. This would be your standard case page
> > pool put path. Basically it allows us to start getting rid of a bunch
> > of noise in the fragmented path.
> >
> >>         /* This allocator is optimized for the XDP mode that uses
> >>          * one-frame-per-page, but have fallbacks that act like the
> >>          * regular page allocator APIs.
> >> @@ -594,6 +598,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
> >>          */
> >>         if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) {
> >>                 /* Read barrier done in page_ref_count / READ_ONCE */
> >> +               page_pool_fragment_page(page, 1);
> >
> > I wouldn't bother resetting this to 1 until after you have recycled it
> > and pulled it back out again as an allocation. Basically when the
> > pages are sitting in the pool the frag_count should be 0. That way it
> > makes it easier to track and is similar to how the memory allocator
> > actually deals with the page reference count. Basically if the page is
> > sitting in the pool the frag_count is 0, once it comes out it should
> > be 1 or more indicating it is in use.
>
> Let's be more specific about what we want to do here:
>
> For a specific page without splitting, the journey that it will go
> through is as below before this patch:
> 1. It is allocated from the page allocator.
> 2. It is initialized in page_pool_set_pp_info().
> 3. It is passed to driver by page pool.
> 4. It is passed to stack by the driver.
> 5. It is passed back to the page pool by the stack, depending on the
>    page_ref_count() checking:
>    5.1 page_ref_count() being one, the page is now owned by the page
>        pool, and may be passed to the driver by going to step 3.
>    5.2 page_ref_count() not being one, the page is released by page
>        pool doing resoure cleaning like dma mapping and put_page().
>
> So a page may go through step 3 ~ 5.1 many times without dirtying
> the cache line of  'struct page' as my understanding.
>
>
> If I follow your suggestion here, It seems for a specific page without
> splitting, it may go through:
> 1. It is allocated from the page allocator.
> 2. It is initialized in page_pool_set_pp_info().
> 3. It's pp_frag_count is set to one.
> 4. It is passed to driver by page pool.
> 5. It is passed to stack by the driver.
> 6. It is passed back to the page pool by the stack, depending on the
>    pp_frag_count and page_ref_count() checking:
>    6.1 pp_frag_count and page_ref_count() being one, the page is now
>        owned by the page pool, and may be passed to the driver by
>        going to step 3.
>    6.2 otherwise the page is released by page pool doing resoure
>        cleaning like dma mapping and put_page().
>
> Aren't we dirtying the cache line of  'struct page' everytime the page
> is recycled? Or did I miss something obvious here?

What you are stating makes sense. So we would probably want to keep
the pp_frag_count at 1 when it is sitting in the pool.

> For my implementation, for a specific page without splitting, it may
> go through:
> 1. It is allocated from the page allocator.
> 2. It is initialized in page_pool_set_pp_info() and it's pp_frag_count
>    set to one.
> 3. It is passed to driver by page pool.
> 4. It is passed to stack by the driver.
> 5. It is passed back to the page pool by the stack, depending on the
>    page_ref_count() checking:
>    5.1 pp_frag_count and page_ref_count() being one, the page is now
>        owned by the page pool, and as the optimization by not updating
>        page->pp_frag_count in page_pool_defrag_page() for the last
>        frag user, it can be passed to the driver by going to step 3
>        without resetting the pp_frag_count to 1, which may dirty
>        the cache line of  'struct page'.
>    5.2 otherwise the page is released by page pool doing resoure
>        cleaning like dma mapping and put_page().
>
> Does it make any sense, or it doesn't really matter we are dirtying
> the cache line of  'struct page' whenever a page without splitted is
> recycled?

No, that makes sense. No point in dirtying a cache line if we don't have to.

> >>
> >> Does 'defragging a fragmented page' mean doing decrementing pp_frag_count?
> >> "freeing it" mean calling put_page()? What does 'combined' really means
> >> here?
> >
> > The change is that the code would do the subtraction and if it hit 0
> > it was freeing the page. That is the one piece that gets more
> > complicated because we really should be hitting 1. So we may be adding
> > a few more operations to that case.
> >
> >>>
>
> I am not sure I understand it. Does 'gets more complicated' means doing
> some optimization like page_pool_defrag_page() does?
> https://elixir.bootlin.com/linux/v6.4-rc5/source/include/net/page_pool.h#L314

Our standard case is to decrement by 1. We will need to for the code
that is doing your step 5.1 to handle a case where we are removing
multiple frag references. That is what I was getting at with the "more
complicated" comment. Basically if we push it to 0 then we either have
to free or recycle the page by resetting the fragments.

> >
> >>>
> >>> With that you could then let drivers like the Mellanox one handle its
> >>> own fragmenting knowing it has to return things to a mono-frag in
> >>> order for it to be working correctly.
> >>
> >> I still really don't how it will be better for mlx5 to handle its
> >> own fragmenting yet?
> >>
> >> +cc Dragos & Saeed to share some info here, so that we can see
> >> if page pool learn from it.
> >
> > It has more to do with the fact that the driver knows what it is going
> > to do beforehand. In many cases it can look at the page and know that
> > it isn't going to reuse it again so it can just report the truesize
> > being the length from the current pointer to the end of the page.
> >
> > You can think of it as the performance advantage of a purpose built
> > ASIC versus a general purpose CPU. The fact is we are able to cut out
> > much of the unnecessary overhead if we know exactly how we are going
> > to use the memory in the driver versus having to guess at it in the
> > page pool API.
>
> In general, I would agree with that.
> But for the specific case with mlx5, I am not sure about that, that's
> why I am curious about what is the exact reason about it doing the
> complicated frag_count handing in the driver instead of improving
> the page pool to support it's usecase, if it is about the last frag
> truesize problem here, we can do something like virtio_net do in the
> page pool too.

I suspect it has to do with their hardware doing the fragmentation of
the page. As I recall some of the Mellanox parts support Rx packing so
it is likely that their hardware is packing multiple frames into a
single page and so they are marking it pre-fragmented, and then when
the hardware completes the DMA they go through and record the offsets
for the individual fragments and pass them up the stack.
diff mbox series

Patch

diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index c8ec2f34722b..ea7a0c0592a5 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -50,6 +50,9 @@ 
 				 PP_FLAG_DMA_SYNC_DEV |\
 				 PP_FLAG_PAGE_FRAG)
 
+#define PAGE_POOL_DMA_USE_PP_FRAG_COUNT \
+		(sizeof(dma_addr_t) > sizeof(unsigned long))
+
 /*
  * Fast allocation side cache array/stack
  *
@@ -295,13 +298,20 @@  void page_pool_put_defragged_page(struct page_pool *pool, struct page *page,
  */
 static inline void page_pool_fragment_page(struct page *page, long nr)
 {
-	atomic_long_set(&page->pp_frag_count, nr);
+	if (!PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
+		atomic_long_set(&page->pp_frag_count, nr);
 }
 
+/* We need to reset frag_count back to 1 for the last user to allow
+ * only one user in case the page is recycled and allocated as non-frag
+ * page.
+ */
 static inline long page_pool_defrag_page(struct page *page, long nr)
 {
 	long ret;
 
+	BUILD_BUG_ON(__builtin_constant_p(nr) && nr != 1);
+
 	/* If nr == pp_frag_count then we have cleared all remaining
 	 * references to the page. No need to actually overwrite it, instead
 	 * we can leave this to be overwritten by the calling function.
@@ -311,19 +321,36 @@  static inline long page_pool_defrag_page(struct page *page, long nr)
 	 * especially when dealing with a page that may be partitioned
 	 * into only 2 or 3 pieces.
 	 */
-	if (atomic_long_read(&page->pp_frag_count) == nr)
+	if (atomic_long_read(&page->pp_frag_count) == nr) {
+		/* As we have ensured nr is always one for constant case
+		 * using the BUILD_BUG_ON() as above, only need to handle
+		 * the non-constant case here for frag count draining.
+		 */
+		if (!__builtin_constant_p(nr))
+			atomic_long_set(&page->pp_frag_count, 1);
+
 		return 0;
+	}
 
 	ret = atomic_long_sub_return(nr, &page->pp_frag_count);
 	WARN_ON(ret < 0);
+
+	/* Reset frag count back to 1, this should be the rare case when
+	 * two users call page_pool_defrag_page() currently.
+	 */
+	if (!ret)
+		atomic_long_set(&page->pp_frag_count, 1);
+
 	return ret;
 }
 
 static inline bool page_pool_is_last_frag(struct page_pool *pool,
 					  struct page *page)
 {
-	/* If fragments aren't enabled or count is 0 we were the last user */
-	return !(pool->p.flags & PP_FLAG_PAGE_FRAG) ||
+	/* When dma_addr_upper is overlapped with pp_frag_count
+	 * or we were the last page frag user.
+	 */
+	return PAGE_POOL_DMA_USE_PP_FRAG_COUNT ||
 	       (page_pool_defrag_page(page, 1) == 0);
 }
 
@@ -357,9 +384,6 @@  static inline void page_pool_recycle_direct(struct page_pool *pool,
 	page_pool_put_full_page(pool, page, true);
 }
 
-#define PAGE_POOL_DMA_USE_PP_FRAG_COUNT	\
-		(sizeof(dma_addr_t) > sizeof(unsigned long))
-
 static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
 {
 	dma_addr_t ret = page->dma_addr;
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index e212e9d7edcb..0868aa8f6323 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -334,6 +334,7 @@  static void page_pool_set_pp_info(struct page_pool *pool,
 {
 	page->pp = pool;
 	page->pp_magic |= PP_SIGNATURE;
+	page_pool_fragment_page(page, 1);
 	if (pool->p.init_callback)
 		pool->p.init_callback(page, pool->p.init_arg);
 }