diff mbox series

[RFC,v2,06/11] page-pool: add device memory support

Message ID 20230810015751.3297321-7-almasrymina@google.com (mailing list archive)
State New, archived
Headers show
Series Device Memory TCP | expand

Commit Message

Mina Almasry Aug. 10, 2023, 1:57 a.m. UTC
Overload the LSB of struct page* to indicate that it's a page_pool_iov.

Refactor mm calls on struct page * into helpers, and add page_pool_iov
handling on those helpers. Modify callers of these mm APIs with calls to
these helpers instead.

In areas where struct page* is dereferenced, add a check for special
handling of page_pool_iov.

The memory providers producing page_pool_iov can set the LSB on the
struct page* returned to the page pool.

Note that instead of overloading the LSB of page pointers, we can
instead define a new union between struct page & struct page_pool_iov and
compact it in a new type. However, we'd need to implement the code churn
to modify the page_pool & drivers to use this new type. For this POC
that is not implemented (feedback welcome).

I have a sample implementation of adding a new page_pool_token type
in the page_pool to give a general idea here:
https://github.com/torvalds/linux/commit/3a7628700eb7fd02a117db036003bca50779608d

Full branch here:
https://github.com/torvalds/linux/compare/master...mina:linux:tcpdevmem-pp-tokens

(In the branches above, page_pool_iov is called devmem_slice).

Could also add static_branch to speed up the checks in page_pool_iov
memory providers are being used.

Signed-off-by: Mina Almasry <almasrymina@google.com>
---
 include/net/page_pool.h | 74 ++++++++++++++++++++++++++++++++++-
 net/core/page_pool.c    | 85 ++++++++++++++++++++++++++++-------------
 2 files changed, 131 insertions(+), 28 deletions(-)

--
2.41.0.640.ga95def55d0-goog

Comments

Jesper Dangaard Brouer Aug. 19, 2023, 9:51 a.m. UTC | #1
On 10/08/2023 03.57, Mina Almasry wrote:
> Overload the LSB of struct page* to indicate that it's a page_pool_iov.
> 
> Refactor mm calls on struct page * into helpers, and add page_pool_iov
> handling on those helpers. Modify callers of these mm APIs with calls to
> these helpers instead.
> 

I don't like of this approach.
This is adding code to the PP (page_pool) fast-path in multiple places.

I've not had time to run my usual benchmarks, which are here:
 
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c

But I'm sure it will affect performance.

Regardless of performance, this approach is using ptr-LSB-bits, to hide
that page-pointer are not really struct-pages, feels like force feeding
a solution just to use the page_pool APIs.


> In areas where struct page* is dereferenced, add a check for special
> handling of page_pool_iov.
> 
> The memory providers producing page_pool_iov can set the LSB on the
> struct page* returned to the page pool.
> 
> Note that instead of overloading the LSB of page pointers, we can
> instead define a new union between struct page & struct page_pool_iov and
> compact it in a new type. However, we'd need to implement the code churn
> to modify the page_pool & drivers to use this new type. For this POC
> that is not implemented (feedback welcome).
> 

I've said before, that I prefer multiplexing on page->pp_magic.
For your page_pool_iov the layout would have to match the offset of
pp_magic, to do this. (And if insisting on using PP infra the refcnt
would also need to align).

On the allocation side, all drivers already use a driver helper
page_pool_dev_alloc_pages() or we could add another (better named)
helper to multiplex between other types of allocators, e.g. a devmem
allocator.

On free/return/recycle the functions napi_pp_put_page or skb_pp_recycle
could multiplex on pp_magic and call another API.  The API could be an
extension to PP helpers, but it could also be a devmap allocator helper.

IMHO forcing/piggy-bagging everything into page_pool is not the right 
solution.  I really think netstack need to support different allocator 
types. The page pool have been leading the way, yes, but perhaps it is 
time to add an API layer that e.g. could be named netmem, that gives us 
the multiplexing between allocators.  In that process some of page_pool 
APIs would be lifted out as common blocks and others remain.

--Jesper

> I have a sample implementation of adding a new page_pool_token type
> in the page_pool to give a general idea here:
> https://github.com/torvalds/linux/commit/3a7628700eb7fd02a117db036003bca50779608d
> 
> Full branch here:
> https://github.com/torvalds/linux/compare/master...mina:linux:tcpdevmem-pp-tokens
> 
> (In the branches above, page_pool_iov is called devmem_slice).
> 
> Could also add static_branch to speed up the checks in page_pool_iov
> memory providers are being used.
> 
> Signed-off-by: Mina Almasry <almasrymina@google.com>
> ---
>   include/net/page_pool.h | 74 ++++++++++++++++++++++++++++++++++-
>   net/core/page_pool.c    | 85 ++++++++++++++++++++++++++++-------------
>   2 files changed, 131 insertions(+), 28 deletions(-)
> 
> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> index 537eb36115ed..f08ca230d68e 100644
> --- a/include/net/page_pool.h
> +++ b/include/net/page_pool.h
> @@ -282,6 +282,64 @@ static inline struct page_pool_iov *page_to_page_pool_iov(struct page *page)
>   	return NULL;
>   }
> 
> +static inline int page_pool_page_ref_count(struct page *page)
> +{
> +	if (page_is_page_pool_iov(page))
> +		return page_pool_iov_refcount(page_to_page_pool_iov(page));
> +
> +	return page_ref_count(page);
> +}
> +
> +static inline void page_pool_page_get_many(struct page *page,
> +					   unsigned int count)
> +{
> +	if (page_is_page_pool_iov(page))
> +		return page_pool_iov_get_many(page_to_page_pool_iov(page),
> +					      count);
> +
> +	return page_ref_add(page, count);
> +}
> +
> +static inline void page_pool_page_put_many(struct page *page,
> +					   unsigned int count)
> +{
> +	if (page_is_page_pool_iov(page))
> +		return page_pool_iov_put_many(page_to_page_pool_iov(page),
> +					      count);
> +
> +	if (count > 1)
> +		page_ref_sub(page, count - 1);
> +
> +	put_page(page);
> +}
> +
> +static inline bool page_pool_page_is_pfmemalloc(struct page *page)
> +{
> +	if (page_is_page_pool_iov(page))
> +		return false;
> +
> +	return page_is_pfmemalloc(page);
> +}
> +
> +static inline bool page_pool_page_is_pref_nid(struct page *page, int pref_nid)
> +{
> +	/* Assume page_pool_iov are on the preferred node without actually
> +	 * checking...
> +	 *
> +	 * This check is only used to check for recycling memory in the page
> +	 * pool's fast paths. Currently the only implementation of page_pool_iov
> +	 * is dmabuf device memory. It's a deliberate decision by the user to
> +	 * bind a certain dmabuf to a certain netdev, and the netdev rx queue
> +	 * would not be able to reallocate memory from another dmabuf that
> +	 * exists on the preferred node, so, this check doesn't make much sense
> +	 * in this case. Assume all page_pool_iovs can be recycled for now.
> +	 */
> +	if (page_is_page_pool_iov(page))
> +		return true;
> +
> +	return page_to_nid(page) == pref_nid;
> +}
> +
>   struct page_pool {
>   	struct page_pool_params p;
> 
> @@ -434,6 +492,9 @@ static inline long page_pool_defrag_page(struct page *page, long nr)
>   {
>   	long ret;
> 
> +	if (page_is_page_pool_iov(page))
> +		return -EINVAL;
> +
>   	/* If nr == pp_frag_count then we have cleared all remaining
>   	 * references to the page. No need to actually overwrite it, instead
>   	 * we can leave this to be overwritten by the calling function.
> @@ -494,7 +555,12 @@ static inline void page_pool_recycle_direct(struct page_pool *pool,
> 
>   static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
>   {
> -	dma_addr_t ret = page->dma_addr;
> +	dma_addr_t ret;
> +
> +	if (page_is_page_pool_iov(page))
> +		return page_pool_iov_dma_addr(page_to_page_pool_iov(page));
> +
> +	ret = page->dma_addr;
> 
>   	if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
>   		ret |= (dma_addr_t)page->dma_addr_upper << 16 << 16;
> @@ -504,6 +570,12 @@ static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
> 
>   static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
>   {
> +	/* page_pool_iovs are mapped and their dma-addr can't be modified. */
> +	if (page_is_page_pool_iov(page)) {
> +		DEBUG_NET_WARN_ON_ONCE(true);
> +		return;
> +	}
> +
>   	page->dma_addr = addr;
>   	if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
>   		page->dma_addr_upper = upper_32_bits(addr);
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 0a7c08d748b8..20c1f74fd844 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -318,7 +318,7 @@ static struct page *page_pool_refill_alloc_cache(struct page_pool *pool)
>   		if (unlikely(!page))
>   			break;
> 
> -		if (likely(page_to_nid(page) == pref_nid)) {
> +		if (likely(page_pool_page_is_pref_nid(page, pref_nid))) {
>   			pool->alloc.cache[pool->alloc.count++] = page;
>   		} else {
>   			/* NUMA mismatch;
> @@ -363,7 +363,15 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool,
>   					  struct page *page,
>   					  unsigned int dma_sync_size)
>   {
> -	dma_addr_t dma_addr = page_pool_get_dma_addr(page);
> +	dma_addr_t dma_addr;
> +
> +	/* page_pool_iov memory provider do not support PP_FLAG_DMA_SYNC_DEV */
> +	if (page_is_page_pool_iov(page)) {
> +		DEBUG_NET_WARN_ON_ONCE(true);
> +		return;
> +	}
> +
> +	dma_addr = page_pool_get_dma_addr(page);
> 
>   	dma_sync_size = min(dma_sync_size, pool->p.max_len);
>   	dma_sync_single_range_for_device(pool->p.dev, dma_addr,
> @@ -375,6 +383,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
>   {
>   	dma_addr_t dma;
> 
> +	if (page_is_page_pool_iov(page)) {
> +		/* page_pool_iovs are already mapped */
> +		DEBUG_NET_WARN_ON_ONCE(true);
> +		return true;
> +	}
> +
>   	/* Setup DMA mapping: use 'struct page' area for storing DMA-addr
>   	 * since dma_addr_t can be either 32 or 64 bits and does not always fit
>   	 * into page private data (i.e 32bit cpu with 64bit DMA caps)
> @@ -398,14 +412,24 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
>   static void page_pool_set_pp_info(struct page_pool *pool,
>   				  struct page *page)
>   {
> -	page->pp = pool;
> -	page->pp_magic |= PP_SIGNATURE;
> +	if (!page_is_page_pool_iov(page)) {
> +		page->pp = pool;
> +		page->pp_magic |= PP_SIGNATURE;
> +	} else {
> +		page_to_page_pool_iov(page)->pp = pool;
> +	}
> +
>   	if (pool->p.init_callback)
>   		pool->p.init_callback(page, pool->p.init_arg);
>   }
> 
>   static void page_pool_clear_pp_info(struct page *page)
>   {
> +	if (page_is_page_pool_iov(page)) {
> +		page_to_page_pool_iov(page)->pp = NULL;
> +		return;
> +	}
> +
>   	page->pp_magic = 0;
>   	page->pp = NULL;
>   }
> @@ -615,7 +639,7 @@ static bool page_pool_recycle_in_cache(struct page *page,
>   		return false;
>   	}
> 
> -	/* Caller MUST have verified/know (page_ref_count(page) == 1) */
> +	/* Caller MUST have verified/know (page_pool_page_ref_count(page) == 1) */
>   	pool->alloc.cache[pool->alloc.count++] = page;
>   	recycle_stat_inc(pool, cached);
>   	return true;
> @@ -638,9 +662,10 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
>   	 * refcnt == 1 means page_pool owns page, and can recycle it.
>   	 *
>   	 * page is NOT reusable when allocated when system is under
> -	 * some pressure. (page_is_pfmemalloc)
> +	 * some pressure. (page_pool_page_is_pfmemalloc)
>   	 */
> -	if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) {
> +	if (likely(page_pool_page_ref_count(page) == 1 &&
> +		   !page_pool_page_is_pfmemalloc(page))) {
>   		/* Read barrier done in page_ref_count / READ_ONCE */
> 
>   		if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> @@ -741,7 +766,8 @@ static struct page *page_pool_drain_frag(struct page_pool *pool,
>   	if (likely(page_pool_defrag_page(page, drain_count)))
>   		return NULL;
> 
> -	if (page_ref_count(page) == 1 && !page_is_pfmemalloc(page)) {
> +	if (page_pool_page_ref_count(page) == 1 &&
> +	    !page_pool_page_is_pfmemalloc(page)) {
>   		if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
>   			page_pool_dma_sync_for_device(pool, page, -1);
> 
> @@ -818,9 +844,9 @@ static void page_pool_empty_ring(struct page_pool *pool)
>   	/* Empty recycle ring */
>   	while ((page = ptr_ring_consume_bh(&pool->ring))) {
>   		/* Verify the refcnt invariant of cached pages */
> -		if (!(page_ref_count(page) == 1))
> +		if (!(page_pool_page_ref_count(page) == 1))
>   			pr_crit("%s() page_pool refcnt %d violation\n",
> -				__func__, page_ref_count(page));
> +				__func__, page_pool_page_ref_count(page));
> 
>   		page_pool_return_page(pool, page);
>   	}
> @@ -977,19 +1003,24 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe)
>   	struct page_pool *pp;
>   	bool allow_direct;
> 
> -	page = compound_head(page);
> +	if (!page_is_page_pool_iov(page)) {
> +		page = compound_head(page);
> 
> -	/* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
> -	 * in order to preserve any existing bits, such as bit 0 for the
> -	 * head page of compound page and bit 1 for pfmemalloc page, so
> -	 * mask those bits for freeing side when doing below checking,
> -	 * and page_is_pfmemalloc() is checked in __page_pool_put_page()
> -	 * to avoid recycling the pfmemalloc page.
> -	 */
> -	if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
> -		return false;
> +		/* page->pp_magic is OR'ed with PP_SIGNATURE after the
> +		 * allocation in order to preserve any existing bits, such as
> +		 * bit 0 for the head page of compound page and bit 1 for
> +		 * pfmemalloc page, so mask those bits for freeing side when
> +		 * doing below checking, and page_pool_page_is_pfmemalloc() is
> +		 * checked in __page_pool_put_page() to avoid recycling the
> +		 * pfmemalloc page.
> +		 */
> +		if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
> +			return false;
> 
> -	pp = page->pp;
> +		pp = page->pp;
> +	} else {
> +		pp = page_to_page_pool_iov(page)->pp;
> +	}
> 
>   	/* Allow direct recycle if we have reasons to believe that we are
>   	 * in the same context as the consumer would run, so there's
> @@ -1273,9 +1304,9 @@ static bool mp_huge_busy(struct mp_huge *hu, unsigned int idx)
> 
>   	for (j = 0; j < (1 << MP_HUGE_ORDER); j++) {
>   		page = hu->page[idx] + j;
> -		if (page_ref_count(page) != 1) {
> +		if (page_pool_page_ref_count(page) != 1) {
>   			pr_warn("Page with ref count %d at %u, %u. Can't safely destory, leaking memory!\n",
> -				page_ref_count(page), idx, j);
> +				page_pool_page_ref_count(page), idx, j);
>   			return true;
>   		}
>   	}
> @@ -1330,7 +1361,7 @@ static struct page *mp_huge_alloc_pages(struct page_pool *pool, gfp_t gfp)
>   			continue;
> 
>   		if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
> -		    page_ref_count(page) != 1) {
> +		    page_pool_page_ref_count(page) != 1) {
>   			atomic_inc(&mp_huge_ins_b);
>   			continue;
>   		}
> @@ -1458,9 +1489,9 @@ static void mp_huge_1g_destroy(struct page_pool *pool)
>   	free = true;
>   	for (i = 0; i < MP_HUGE_1G_CNT; i++) {
>   		page = hu->page + i;
> -		if (page_ref_count(page) != 1) {
> +		if (page_pool_page_ref_count(page) != 1) {
>   			pr_warn("Page with ref count %d at %u. Can't safely destory, leaking memory!\n",
> -				page_ref_count(page), i);
> +				page_pool_page_ref_count(page), i);
>   			free = false;
>   			break;
>   		}
> @@ -1489,7 +1520,7 @@ static struct page *mp_huge_1g_alloc_pages(struct page_pool *pool, gfp_t gfp)
>   		page = hu->page + page_i;
> 
>   		if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
> -		    page_ref_count(page) != 1) {
> +		    page_pool_page_ref_count(page) != 1) {
>   			atomic_inc(&mp_huge_ins_b);
>   			continue;
>   		}
> --
> 2.41.0.640.ga95def55d0-goog
>
Willem de Bruijn Aug. 19, 2023, 2:08 p.m. UTC | #2
On Sat, Aug 19, 2023 at 5:51 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
>
> On 10/08/2023 03.57, Mina Almasry wrote:
> > Overload the LSB of struct page* to indicate that it's a page_pool_iov.
> >
> > Refactor mm calls on struct page * into helpers, and add page_pool_iov
> > handling on those helpers. Modify callers of these mm APIs with calls to
> > these helpers instead.
> >
>
> I don't like of this approach.
> This is adding code to the PP (page_pool) fast-path in multiple places.
>
> I've not had time to run my usual benchmarks, which are here:
>
> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c
>
> But I'm sure it will affect performance.
>
> Regardless of performance, this approach is using ptr-LSB-bits, to hide
> that page-pointer are not really struct-pages, feels like force feeding
> a solution just to use the page_pool APIs.
>
>
> > In areas where struct page* is dereferenced, add a check for special
> > handling of page_pool_iov.
> >
> > The memory providers producing page_pool_iov can set the LSB on the
> > struct page* returned to the page pool.
> >
> > Note that instead of overloading the LSB of page pointers, we can
> > instead define a new union between struct page & struct page_pool_iov and
> > compact it in a new type. However, we'd need to implement the code churn
> > to modify the page_pool & drivers to use this new type. For this POC
> > that is not implemented (feedback welcome).
> >
>
> I've said before, that I prefer multiplexing on page->pp_magic.
> For your page_pool_iov the layout would have to match the offset of
> pp_magic, to do this. (And if insisting on using PP infra the refcnt
> would also need to align).

Perhaps I misunderstand, but this suggests continuing to using
struct page to demultiplex memory type?

I think the feedback has been strong to not multiplex yet another
memory type into that struct, that is not a real page. Which is why
we went into this direction. This latest series limits the impact largely
to networking structures and code.

One way or another, there will be a branch and multiplexing. Whether
that is in struct page, the page pool or a new netdev mem type as you
propose.

Any regression in page pool can be avoided in the common case that
does not use device mem by placing that behind a static_branch. Would
that address your performance concerns?

>
> On the allocation side, all drivers already use a driver helper
> page_pool_dev_alloc_pages() or we could add another (better named)
> helper to multiplex between other types of allocators, e.g. a devmem
> allocator.
>
> On free/return/recycle the functions napi_pp_put_page or skb_pp_recycle
> could multiplex on pp_magic and call another API.  The API could be an
> extension to PP helpers, but it could also be a devmap allocator helper.
>
> IMHO forcing/piggy-bagging everything into page_pool is not the right
> solution.  I really think netstack need to support different allocator
> types.

To me this is lifting page_pool into such a netstack alloctator pool.

Not sure adding another explicit layer of indirection would be cleaner
or faster (potentially more indirect calls).

As for the LSB trick: that avoided adding a lot of boilerplate churn
with new type and helper functions.



> The page pool have been leading the way, yes, but perhaps it is
> time to add an API layer that e.g. could be named netmem, that gives us
> the multiplexing between allocators.  In that process some of page_pool
> APIs would be lifted out as common blocks and others remain.
>
> --Jesper
>
> > I have a sample implementation of adding a new page_pool_token type
> > in the page_pool to give a general idea here:
> > https://github.com/torvalds/linux/commit/3a7628700eb7fd02a117db036003bca50779608d
> >
> > Full branch here:
> > https://github.com/torvalds/linux/compare/master...mina:linux:tcpdevmem-pp-tokens
> >
> > (In the branches above, page_pool_iov is called devmem_slice).
> >
> > Could also add static_branch to speed up the checks in page_pool_iov
> > memory providers are being used.
> >
> > Signed-off-by: Mina Almasry <almasrymina@google.com>
> > ---
> >   include/net/page_pool.h | 74 ++++++++++++++++++++++++++++++++++-
> >   net/core/page_pool.c    | 85 ++++++++++++++++++++++++++++-------------
> >   2 files changed, 131 insertions(+), 28 deletions(-)
> >
> > diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> > index 537eb36115ed..f08ca230d68e 100644
> > --- a/include/net/page_pool.h
> > +++ b/include/net/page_pool.h
> > @@ -282,6 +282,64 @@ static inline struct page_pool_iov *page_to_page_pool_iov(struct page *page)
> >       return NULL;
> >   }
> >
> > +static inline int page_pool_page_ref_count(struct page *page)
> > +{
> > +     if (page_is_page_pool_iov(page))
> > +             return page_pool_iov_refcount(page_to_page_pool_iov(page));
> > +
> > +     return page_ref_count(page);
> > +}
> > +
> > +static inline void page_pool_page_get_many(struct page *page,
> > +                                        unsigned int count)
> > +{
> > +     if (page_is_page_pool_iov(page))
> > +             return page_pool_iov_get_many(page_to_page_pool_iov(page),
> > +                                           count);
> > +
> > +     return page_ref_add(page, count);
> > +}
> > +
> > +static inline void page_pool_page_put_many(struct page *page,
> > +                                        unsigned int count)
> > +{
> > +     if (page_is_page_pool_iov(page))
> > +             return page_pool_iov_put_many(page_to_page_pool_iov(page),
> > +                                           count);
> > +
> > +     if (count > 1)
> > +             page_ref_sub(page, count - 1);
> > +
> > +     put_page(page);
> > +}
> > +
> > +static inline bool page_pool_page_is_pfmemalloc(struct page *page)
> > +{
> > +     if (page_is_page_pool_iov(page))
> > +             return false;
> > +
> > +     return page_is_pfmemalloc(page);
> > +}
> > +
> > +static inline bool page_pool_page_is_pref_nid(struct page *page, int pref_nid)
> > +{
> > +     /* Assume page_pool_iov are on the preferred node without actually
> > +      * checking...
> > +      *
> > +      * This check is only used to check for recycling memory in the page
> > +      * pool's fast paths. Currently the only implementation of page_pool_iov
> > +      * is dmabuf device memory. It's a deliberate decision by the user to
> > +      * bind a certain dmabuf to a certain netdev, and the netdev rx queue
> > +      * would not be able to reallocate memory from another dmabuf that
> > +      * exists on the preferred node, so, this check doesn't make much sense
> > +      * in this case. Assume all page_pool_iovs can be recycled for now.
> > +      */
> > +     if (page_is_page_pool_iov(page))
> > +             return true;
> > +
> > +     return page_to_nid(page) == pref_nid;
> > +}
> > +
> >   struct page_pool {
> >       struct page_pool_params p;
> >
> > @@ -434,6 +492,9 @@ static inline long page_pool_defrag_page(struct page *page, long nr)
> >   {
> >       long ret;
> >
> > +     if (page_is_page_pool_iov(page))
> > +             return -EINVAL;
> > +
> >       /* If nr == pp_frag_count then we have cleared all remaining
> >        * references to the page. No need to actually overwrite it, instead
> >        * we can leave this to be overwritten by the calling function.
> > @@ -494,7 +555,12 @@ static inline void page_pool_recycle_direct(struct page_pool *pool,
> >
> >   static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
> >   {
> > -     dma_addr_t ret = page->dma_addr;
> > +     dma_addr_t ret;
> > +
> > +     if (page_is_page_pool_iov(page))
> > +             return page_pool_iov_dma_addr(page_to_page_pool_iov(page));
> > +
> > +     ret = page->dma_addr;
> >
> >       if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
> >               ret |= (dma_addr_t)page->dma_addr_upper << 16 << 16;
> > @@ -504,6 +570,12 @@ static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
> >
> >   static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
> >   {
> > +     /* page_pool_iovs are mapped and their dma-addr can't be modified. */
> > +     if (page_is_page_pool_iov(page)) {
> > +             DEBUG_NET_WARN_ON_ONCE(true);
> > +             return;
> > +     }
> > +
> >       page->dma_addr = addr;
> >       if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
> >               page->dma_addr_upper = upper_32_bits(addr);
> > diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> > index 0a7c08d748b8..20c1f74fd844 100644
> > --- a/net/core/page_pool.c
> > +++ b/net/core/page_pool.c
> > @@ -318,7 +318,7 @@ static struct page *page_pool_refill_alloc_cache(struct page_pool *pool)
> >               if (unlikely(!page))
> >                       break;
> >
> > -             if (likely(page_to_nid(page) == pref_nid)) {
> > +             if (likely(page_pool_page_is_pref_nid(page, pref_nid))) {
> >                       pool->alloc.cache[pool->alloc.count++] = page;
> >               } else {
> >                       /* NUMA mismatch;
> > @@ -363,7 +363,15 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool,
> >                                         struct page *page,
> >                                         unsigned int dma_sync_size)
> >   {
> > -     dma_addr_t dma_addr = page_pool_get_dma_addr(page);
> > +     dma_addr_t dma_addr;
> > +
> > +     /* page_pool_iov memory provider do not support PP_FLAG_DMA_SYNC_DEV */
> > +     if (page_is_page_pool_iov(page)) {
> > +             DEBUG_NET_WARN_ON_ONCE(true);
> > +             return;
> > +     }
> > +
> > +     dma_addr = page_pool_get_dma_addr(page);
> >
> >       dma_sync_size = min(dma_sync_size, pool->p.max_len);
> >       dma_sync_single_range_for_device(pool->p.dev, dma_addr,
> > @@ -375,6 +383,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
> >   {
> >       dma_addr_t dma;
> >
> > +     if (page_is_page_pool_iov(page)) {
> > +             /* page_pool_iovs are already mapped */
> > +             DEBUG_NET_WARN_ON_ONCE(true);
> > +             return true;
> > +     }
> > +
> >       /* Setup DMA mapping: use 'struct page' area for storing DMA-addr
> >        * since dma_addr_t can be either 32 or 64 bits and does not always fit
> >        * into page private data (i.e 32bit cpu with 64bit DMA caps)
> > @@ -398,14 +412,24 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
> >   static void page_pool_set_pp_info(struct page_pool *pool,
> >                                 struct page *page)
> >   {
> > -     page->pp = pool;
> > -     page->pp_magic |= PP_SIGNATURE;
> > +     if (!page_is_page_pool_iov(page)) {
> > +             page->pp = pool;
> > +             page->pp_magic |= PP_SIGNATURE;
> > +     } else {
> > +             page_to_page_pool_iov(page)->pp = pool;
> > +     }
> > +
> >       if (pool->p.init_callback)
> >               pool->p.init_callback(page, pool->p.init_arg);
> >   }
> >
> >   static void page_pool_clear_pp_info(struct page *page)
> >   {
> > +     if (page_is_page_pool_iov(page)) {
> > +             page_to_page_pool_iov(page)->pp = NULL;
> > +             return;
> > +     }
> > +
> >       page->pp_magic = 0;
> >       page->pp = NULL;
> >   }
> > @@ -615,7 +639,7 @@ static bool page_pool_recycle_in_cache(struct page *page,
> >               return false;
> >       }
> >
> > -     /* Caller MUST have verified/know (page_ref_count(page) == 1) */
> > +     /* Caller MUST have verified/know (page_pool_page_ref_count(page) == 1) */
> >       pool->alloc.cache[pool->alloc.count++] = page;
> >       recycle_stat_inc(pool, cached);
> >       return true;
> > @@ -638,9 +662,10 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
> >        * refcnt == 1 means page_pool owns page, and can recycle it.
> >        *
> >        * page is NOT reusable when allocated when system is under
> > -      * some pressure. (page_is_pfmemalloc)
> > +      * some pressure. (page_pool_page_is_pfmemalloc)
> >        */
> > -     if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) {
> > +     if (likely(page_pool_page_ref_count(page) == 1 &&
> > +                !page_pool_page_is_pfmemalloc(page))) {
> >               /* Read barrier done in page_ref_count / READ_ONCE */
> >
> >               if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> > @@ -741,7 +766,8 @@ static struct page *page_pool_drain_frag(struct page_pool *pool,
> >       if (likely(page_pool_defrag_page(page, drain_count)))
> >               return NULL;
> >
> > -     if (page_ref_count(page) == 1 && !page_is_pfmemalloc(page)) {
> > +     if (page_pool_page_ref_count(page) == 1 &&
> > +         !page_pool_page_is_pfmemalloc(page)) {
> >               if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> >                       page_pool_dma_sync_for_device(pool, page, -1);
> >
> > @@ -818,9 +844,9 @@ static void page_pool_empty_ring(struct page_pool *pool)
> >       /* Empty recycle ring */
> >       while ((page = ptr_ring_consume_bh(&pool->ring))) {
> >               /* Verify the refcnt invariant of cached pages */
> > -             if (!(page_ref_count(page) == 1))
> > +             if (!(page_pool_page_ref_count(page) == 1))
> >                       pr_crit("%s() page_pool refcnt %d violation\n",
> > -                             __func__, page_ref_count(page));
> > +                             __func__, page_pool_page_ref_count(page));
> >
> >               page_pool_return_page(pool, page);
> >       }
> > @@ -977,19 +1003,24 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe)
> >       struct page_pool *pp;
> >       bool allow_direct;
> >
> > -     page = compound_head(page);
> > +     if (!page_is_page_pool_iov(page)) {
> > +             page = compound_head(page);
> >
> > -     /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
> > -      * in order to preserve any existing bits, such as bit 0 for the
> > -      * head page of compound page and bit 1 for pfmemalloc page, so
> > -      * mask those bits for freeing side when doing below checking,
> > -      * and page_is_pfmemalloc() is checked in __page_pool_put_page()
> > -      * to avoid recycling the pfmemalloc page.
> > -      */
> > -     if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
> > -             return false;
> > +             /* page->pp_magic is OR'ed with PP_SIGNATURE after the
> > +              * allocation in order to preserve any existing bits, such as
> > +              * bit 0 for the head page of compound page and bit 1 for
> > +              * pfmemalloc page, so mask those bits for freeing side when
> > +              * doing below checking, and page_pool_page_is_pfmemalloc() is
> > +              * checked in __page_pool_put_page() to avoid recycling the
> > +              * pfmemalloc page.
> > +              */
> > +             if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
> > +                     return false;
> >
> > -     pp = page->pp;
> > +             pp = page->pp;
> > +     } else {
> > +             pp = page_to_page_pool_iov(page)->pp;
> > +     }
> >
> >       /* Allow direct recycle if we have reasons to believe that we are
> >        * in the same context as the consumer would run, so there's
> > @@ -1273,9 +1304,9 @@ static bool mp_huge_busy(struct mp_huge *hu, unsigned int idx)
> >
> >       for (j = 0; j < (1 << MP_HUGE_ORDER); j++) {
> >               page = hu->page[idx] + j;
> > -             if (page_ref_count(page) != 1) {
> > +             if (page_pool_page_ref_count(page) != 1) {
> >                       pr_warn("Page with ref count %d at %u, %u. Can't safely destory, leaking memory!\n",
> > -                             page_ref_count(page), idx, j);
> > +                             page_pool_page_ref_count(page), idx, j);
> >                       return true;
> >               }
> >       }
> > @@ -1330,7 +1361,7 @@ static struct page *mp_huge_alloc_pages(struct page_pool *pool, gfp_t gfp)
> >                       continue;
> >
> >               if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
> > -                 page_ref_count(page) != 1) {
> > +                 page_pool_page_ref_count(page) != 1) {
> >                       atomic_inc(&mp_huge_ins_b);
> >                       continue;
> >               }
> > @@ -1458,9 +1489,9 @@ static void mp_huge_1g_destroy(struct page_pool *pool)
> >       free = true;
> >       for (i = 0; i < MP_HUGE_1G_CNT; i++) {
> >               page = hu->page + i;
> > -             if (page_ref_count(page) != 1) {
> > +             if (page_pool_page_ref_count(page) != 1) {
> >                       pr_warn("Page with ref count %d at %u. Can't safely destory, leaking memory!\n",
> > -                             page_ref_count(page), i);
> > +                             page_pool_page_ref_count(page), i);
> >                       free = false;
> >                       break;
> >               }
> > @@ -1489,7 +1520,7 @@ static struct page *mp_huge_1g_alloc_pages(struct page_pool *pool, gfp_t gfp)
> >               page = hu->page + page_i;
> >
> >               if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
> > -                 page_ref_count(page) != 1) {
> > +                 page_pool_page_ref_count(page) != 1) {
> >                       atomic_inc(&mp_huge_ins_b);
> >                       continue;
> >               }
> > --
> > 2.41.0.640.ga95def55d0-goog
> >
>
Jesper Dangaard Brouer Aug. 19, 2023, 3:22 p.m. UTC | #3
On 19/08/2023 16.08, Willem de Bruijn wrote:
> On Sat, Aug 19, 2023 at 5:51 AM Jesper Dangaard Brouer
> <jbrouer@redhat.com> wrote:
>>
>>
>>
>> On 10/08/2023 03.57, Mina Almasry wrote:
>>> Overload the LSB of struct page* to indicate that it's a page_pool_iov.
>>>
>>> Refactor mm calls on struct page * into helpers, and add page_pool_iov
>>> handling on those helpers. Modify callers of these mm APIs with calls to
>>> these helpers instead.
>>>
>>
>> I don't like of this approach.
>> This is adding code to the PP (page_pool) fast-path in multiple places.
>>
>> I've not had time to run my usual benchmarks, which are here:
>>
>> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c
>>
>> But I'm sure it will affect performance.
>>
>> Regardless of performance, this approach is using ptr-LSB-bits, to hide
>> that page-pointer are not really struct-pages, feels like force feeding
>> a solution just to use the page_pool APIs.
>>
>>
>>> In areas where struct page* is dereferenced, add a check for special
>>> handling of page_pool_iov.
>>>
>>> The memory providers producing page_pool_iov can set the LSB on the
>>> struct page* returned to the page pool.
>>>
>>> Note that instead of overloading the LSB of page pointers, we can
>>> instead define a new union between struct page & struct page_pool_iov and
>>> compact it in a new type. However, we'd need to implement the code churn
>>> to modify the page_pool & drivers to use this new type. For this POC
>>> that is not implemented (feedback welcome).
>>>
>>
>> I've said before, that I prefer multiplexing on page->pp_magic.
>> For your page_pool_iov the layout would have to match the offset of
>> pp_magic, to do this. (And if insisting on using PP infra the refcnt
>> would also need to align).
> 
> Perhaps I misunderstand, but this suggests continuing to using
> struct page to demultiplex memory type?
> 

(Perhaps we are misunderstanding each-other and my use of the words 
multiplexing and demultiplex are wrong, I'm sorry, as English isn't my 
native language.)

I do see the problem of depending on having a struct page, as the 
page_pool_iov isn't related to struct page.  Having "page" in the name 
of "page_pool_iov" is also confusing (hardest problem is CS is naming, 
as we all know).

To support more allocator types, perhaps skb->pp_recycle bit need to 
grow another bit (and be renamed skb->recycle), so we can tell allocator 
types apart, those that are page based and those whom are not.


> I think the feedback has been strong to not multiplex yet another
> memory type into that struct, that is not a real page. Which is why
> we went into this direction. This latest series limits the impact largely
> to networking structures and code.
> 

Some what related what I'm objecting to: the "page_pool_iov" is not a 
real page, but this getting recycled into something called "page_pool", 
which funny enough deals with struct-pages internally and depend on the 
struct-page-refcnt.

Given the approach changed way from using struct page, then I also don't 
see the connection with the page_pool. Sorry.

> One way or another, there will be a branch and multiplexing. Whether
> that is in struct page, the page pool or a new netdev mem type as you
> propose.
> 

I'm asking to have this branch/multiplexing done a the call sites.

(IMHO not changing the drivers is a pipe-dream.)

> Any regression in page pool can be avoided in the common case that
> does not use device mem by placing that behind a static_branch. Would
> that address your performance concerns?
> 

No. This will not help.

The problem is that every where in the page_pool code it is getting 
polluted with:

   if (page_is_page_pool_iov(page))
     call-some-iov-func-instead()

Like: the very central piece of getting the refcnt:

+static inline int page_pool_page_ref_count(struct page *page)
+{
+	if (page_is_page_pool_iov(page))
+		return page_pool_iov_refcount(page_to_page_pool_iov(page));
+
+	return page_ref_count(page);
+}


The fast-path of the PP is used for XDP_DROP scenarios, and is currently 
around 14 cycles (tsc). Thus, any extra code in this code patch will 
change the fast-path.


>>
>> On the allocation side, all drivers already use a driver helper
>> page_pool_dev_alloc_pages() or we could add another (better named)
>> helper to multiplex between other types of allocators, e.g. a devmem
>> allocator.
>>
>> On free/return/recycle the functions napi_pp_put_page or skb_pp_recycle
>> could multiplex on pp_magic and call another API.  The API could be an
>> extension to PP helpers, but it could also be a devmap allocator helper.
>>
>> IMHO forcing/piggy-bagging everything into page_pool is not the right
>> solution.  I really think netstack need to support different allocator
>> types.
> 
> To me this is lifting page_pool into such a netstack alloctator pool.
> 

This is should be renamed as it is not longer dealing with pages.

> Not sure adding another explicit layer of indirection would be cleaner
> or faster (potentially more indirect calls).
> 

It seems we are talking past each-other.  The layer of indirection I'm 
talking about is likely a simple header file (e.g. named netmem.h) that 
will get inline compiled so there is no overhead. It will be used by 
driver, such that we can avoid touching driver again when introducing 
new memory allocator types.


> As for the LSB trick: that avoided adding a lot of boilerplate churn
> with new type and helper functions.
> 

Says the lazy programmer :-P ... sorry could not resist ;-)

> 
> 
>> The page pool have been leading the way, yes, but perhaps it is
>> time to add an API layer that e.g. could be named netmem, that gives us
>> the multiplexing between allocators.  In that process some of page_pool
>> APIs would be lifted out as common blocks and others remain.
>>
>> --Jesper
>>
>>> I have a sample implementation of adding a new page_pool_token type
>>> in the page_pool to give a general idea here:
>>> https://github.com/torvalds/linux/commit/3a7628700eb7fd02a117db036003bca50779608d
>>>
>>> Full branch here:
>>> https://github.com/torvalds/linux/compare/master...mina:linux:tcpdevmem-pp-tokens
>>>
>>> (In the branches above, page_pool_iov is called devmem_slice).
>>>
>>> Could also add static_branch to speed up the checks in page_pool_iov
>>> memory providers are being used.
>>>
>>> Signed-off-by: Mina Almasry <almasrymina@google.com>
>>> ---
>>>    include/net/page_pool.h | 74 ++++++++++++++++++++++++++++++++++-
>>>    net/core/page_pool.c    | 85 ++++++++++++++++++++++++++++-------------
>>>    2 files changed, 131 insertions(+), 28 deletions(-)
>>>
>>> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
>>> index 537eb36115ed..f08ca230d68e 100644
>>> --- a/include/net/page_pool.h
>>> +++ b/include/net/page_pool.h
>>> @@ -282,6 +282,64 @@ static inline struct page_pool_iov *page_to_page_pool_iov(struct page *page)
>>>        return NULL;
>>>    }
>>>
>>> +static inline int page_pool_page_ref_count(struct page *page)
>>> +{
>>> +     if (page_is_page_pool_iov(page))
>>> +             return page_pool_iov_refcount(page_to_page_pool_iov(page));
>>> +
>>> +     return page_ref_count(page);
>>> +}
>>> +
>>> +static inline void page_pool_page_get_many(struct page *page,
>>> +                                        unsigned int count)
>>> +{
>>> +     if (page_is_page_pool_iov(page))
>>> +             return page_pool_iov_get_many(page_to_page_pool_iov(page),
>>> +                                           count);
>>> +
>>> +     return page_ref_add(page, count);
>>> +}
>>> +
>>> +static inline void page_pool_page_put_many(struct page *page,
>>> +                                        unsigned int count)
>>> +{
>>> +     if (page_is_page_pool_iov(page))
>>> +             return page_pool_iov_put_many(page_to_page_pool_iov(page),
>>> +                                           count);
>>> +
>>> +     if (count > 1)
>>> +             page_ref_sub(page, count - 1);
>>> +
>>> +     put_page(page);
>>> +}
>>> +
>>> +static inline bool page_pool_page_is_pfmemalloc(struct page *page)
>>> +{
>>> +     if (page_is_page_pool_iov(page))
>>> +             return false;
>>> +
>>> +     return page_is_pfmemalloc(page);
>>> +}
>>> +
>>> +static inline bool page_pool_page_is_pref_nid(struct page *page, int pref_nid)
>>> +{
>>> +     /* Assume page_pool_iov are on the preferred node without actually
>>> +      * checking...
>>> +      *
>>> +      * This check is only used to check for recycling memory in the page
>>> +      * pool's fast paths. Currently the only implementation of page_pool_iov
>>> +      * is dmabuf device memory. It's a deliberate decision by the user to
>>> +      * bind a certain dmabuf to a certain netdev, and the netdev rx queue
>>> +      * would not be able to reallocate memory from another dmabuf that
>>> +      * exists on the preferred node, so, this check doesn't make much sense
>>> +      * in this case. Assume all page_pool_iovs can be recycled for now.
>>> +      */
>>> +     if (page_is_page_pool_iov(page))
>>> +             return true;
>>> +
>>> +     return page_to_nid(page) == pref_nid;
>>> +}
>>> +
>>>    struct page_pool {
>>>        struct page_pool_params p;
>>>
>>> @@ -434,6 +492,9 @@ static inline long page_pool_defrag_page(struct page *page, long nr)
>>>    {
>>>        long ret;
>>>
>>> +     if (page_is_page_pool_iov(page))
>>> +             return -EINVAL;
>>> +
>>>        /* If nr == pp_frag_count then we have cleared all remaining
>>>         * references to the page. No need to actually overwrite it, instead
>>>         * we can leave this to be overwritten by the calling function.
>>> @@ -494,7 +555,12 @@ static inline void page_pool_recycle_direct(struct page_pool *pool,
>>>
>>>    static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
>>>    {
>>> -     dma_addr_t ret = page->dma_addr;
>>> +     dma_addr_t ret;
>>> +
>>> +     if (page_is_page_pool_iov(page))
>>> +             return page_pool_iov_dma_addr(page_to_page_pool_iov(page));
>>> +
>>> +     ret = page->dma_addr;
>>>
>>>        if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
>>>                ret |= (dma_addr_t)page->dma_addr_upper << 16 << 16;
>>> @@ -504,6 +570,12 @@ static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
>>>
>>>    static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
>>>    {
>>> +     /* page_pool_iovs are mapped and their dma-addr can't be modified. */
>>> +     if (page_is_page_pool_iov(page)) {
>>> +             DEBUG_NET_WARN_ON_ONCE(true);
>>> +             return;
>>> +     }
>>> +
>>>        page->dma_addr = addr;
>>>        if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
>>>                page->dma_addr_upper = upper_32_bits(addr);
>>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
>>> index 0a7c08d748b8..20c1f74fd844 100644
>>> --- a/net/core/page_pool.c
>>> +++ b/net/core/page_pool.c
>>> @@ -318,7 +318,7 @@ static struct page *page_pool_refill_alloc_cache(struct page_pool *pool)
>>>                if (unlikely(!page))
>>>                        break;
>>>
>>> -             if (likely(page_to_nid(page) == pref_nid)) {
>>> +             if (likely(page_pool_page_is_pref_nid(page, pref_nid))) {
>>>                        pool->alloc.cache[pool->alloc.count++] = page;
>>>                } else {
>>>                        /* NUMA mismatch;
>>> @@ -363,7 +363,15 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool,
>>>                                          struct page *page,
>>>                                          unsigned int dma_sync_size)
>>>    {
>>> -     dma_addr_t dma_addr = page_pool_get_dma_addr(page);
>>> +     dma_addr_t dma_addr;
>>> +
>>> +     /* page_pool_iov memory provider do not support PP_FLAG_DMA_SYNC_DEV */
>>> +     if (page_is_page_pool_iov(page)) {
>>> +             DEBUG_NET_WARN_ON_ONCE(true);
>>> +             return;
>>> +     }
>>> +
>>> +     dma_addr = page_pool_get_dma_addr(page);
>>>
>>>        dma_sync_size = min(dma_sync_size, pool->p.max_len);
>>>        dma_sync_single_range_for_device(pool->p.dev, dma_addr,
>>> @@ -375,6 +383,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
>>>    {
>>>        dma_addr_t dma;
>>>
>>> +     if (page_is_page_pool_iov(page)) {
>>> +             /* page_pool_iovs are already mapped */
>>> +             DEBUG_NET_WARN_ON_ONCE(true);
>>> +             return true;
>>> +     }
>>> +
>>>        /* Setup DMA mapping: use 'struct page' area for storing DMA-addr
>>>         * since dma_addr_t can be either 32 or 64 bits and does not always fit
>>>         * into page private data (i.e 32bit cpu with 64bit DMA caps)
>>> @@ -398,14 +412,24 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
>>>    static void page_pool_set_pp_info(struct page_pool *pool,
>>>                                  struct page *page)
>>>    {
>>> -     page->pp = pool;
>>> -     page->pp_magic |= PP_SIGNATURE;
>>> +     if (!page_is_page_pool_iov(page)) {
>>> +             page->pp = pool;
>>> +             page->pp_magic |= PP_SIGNATURE;
>>> +     } else {
>>> +             page_to_page_pool_iov(page)->pp = pool;
>>> +     }
>>> +
>>>        if (pool->p.init_callback)
>>>                pool->p.init_callback(page, pool->p.init_arg);
>>>    }
>>>
>>>    static void page_pool_clear_pp_info(struct page *page)
>>>    {
>>> +     if (page_is_page_pool_iov(page)) {
>>> +             page_to_page_pool_iov(page)->pp = NULL;
>>> +             return;
>>> +     }
>>> +
>>>        page->pp_magic = 0;
>>>        page->pp = NULL;
>>>    }
>>> @@ -615,7 +639,7 @@ static bool page_pool_recycle_in_cache(struct page *page,
>>>                return false;
>>>        }
>>>
>>> -     /* Caller MUST have verified/know (page_ref_count(page) == 1) */
>>> +     /* Caller MUST have verified/know (page_pool_page_ref_count(page) == 1) */
>>>        pool->alloc.cache[pool->alloc.count++] = page;
>>>        recycle_stat_inc(pool, cached);
>>>        return true;
>>> @@ -638,9 +662,10 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
>>>         * refcnt == 1 means page_pool owns page, and can recycle it.
>>>         *
>>>         * page is NOT reusable when allocated when system is under
>>> -      * some pressure. (page_is_pfmemalloc)
>>> +      * some pressure. (page_pool_page_is_pfmemalloc)
>>>         */
>>> -     if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) {
>>> +     if (likely(page_pool_page_ref_count(page) == 1 &&
>>> +                !page_pool_page_is_pfmemalloc(page))) {
>>>                /* Read barrier done in page_ref_count / READ_ONCE */
>>>
>>>                if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
>>> @@ -741,7 +766,8 @@ static struct page *page_pool_drain_frag(struct page_pool *pool,
>>>        if (likely(page_pool_defrag_page(page, drain_count)))
>>>                return NULL;
>>>
>>> -     if (page_ref_count(page) == 1 && !page_is_pfmemalloc(page)) {
>>> +     if (page_pool_page_ref_count(page) == 1 &&
>>> +         !page_pool_page_is_pfmemalloc(page)) {
>>>                if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
>>>                        page_pool_dma_sync_for_device(pool, page, -1);
>>>
>>> @@ -818,9 +844,9 @@ static void page_pool_empty_ring(struct page_pool *pool)
>>>        /* Empty recycle ring */
>>>        while ((page = ptr_ring_consume_bh(&pool->ring))) {
>>>                /* Verify the refcnt invariant of cached pages */
>>> -             if (!(page_ref_count(page) == 1))
>>> +             if (!(page_pool_page_ref_count(page) == 1))
>>>                        pr_crit("%s() page_pool refcnt %d violation\n",
>>> -                             __func__, page_ref_count(page));
>>> +                             __func__, page_pool_page_ref_count(page));
>>>
>>>                page_pool_return_page(pool, page);
>>>        }
>>> @@ -977,19 +1003,24 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe)
>>>        struct page_pool *pp;
>>>        bool allow_direct;
>>>
>>> -     page = compound_head(page);
>>> +     if (!page_is_page_pool_iov(page)) {
>>> +             page = compound_head(page);
>>>
>>> -     /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
>>> -      * in order to preserve any existing bits, such as bit 0 for the
>>> -      * head page of compound page and bit 1 for pfmemalloc page, so
>>> -      * mask those bits for freeing side when doing below checking,
>>> -      * and page_is_pfmemalloc() is checked in __page_pool_put_page()
>>> -      * to avoid recycling the pfmemalloc page.
>>> -      */
>>> -     if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
>>> -             return false;
>>> +             /* page->pp_magic is OR'ed with PP_SIGNATURE after the
>>> +              * allocation in order to preserve any existing bits, such as
>>> +              * bit 0 for the head page of compound page and bit 1 for
>>> +              * pfmemalloc page, so mask those bits for freeing side when
>>> +              * doing below checking, and page_pool_page_is_pfmemalloc() is
>>> +              * checked in __page_pool_put_page() to avoid recycling the
>>> +              * pfmemalloc page.
>>> +              */
>>> +             if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
>>> +                     return false;
>>>
>>> -     pp = page->pp;
>>> +             pp = page->pp;
>>> +     } else {
>>> +             pp = page_to_page_pool_iov(page)->pp;
>>> +     }
>>>
>>>        /* Allow direct recycle if we have reasons to believe that we are
>>>         * in the same context as the consumer would run, so there's
>>> @@ -1273,9 +1304,9 @@ static bool mp_huge_busy(struct mp_huge *hu, unsigned int idx)
>>>
>>>        for (j = 0; j < (1 << MP_HUGE_ORDER); j++) {
>>>                page = hu->page[idx] + j;
>>> -             if (page_ref_count(page) != 1) {
>>> +             if (page_pool_page_ref_count(page) != 1) {
>>>                        pr_warn("Page with ref count %d at %u, %u. Can't safely destory, leaking memory!\n",
>>> -                             page_ref_count(page), idx, j);
>>> +                             page_pool_page_ref_count(page), idx, j);
>>>                        return true;
>>>                }
>>>        }
>>> @@ -1330,7 +1361,7 @@ static struct page *mp_huge_alloc_pages(struct page_pool *pool, gfp_t gfp)
>>>                        continue;
>>>
>>>                if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
>>> -                 page_ref_count(page) != 1) {
>>> +                 page_pool_page_ref_count(page) != 1) {
>>>                        atomic_inc(&mp_huge_ins_b);
>>>                        continue;
>>>                }
>>> @@ -1458,9 +1489,9 @@ static void mp_huge_1g_destroy(struct page_pool *pool)
>>>        free = true;
>>>        for (i = 0; i < MP_HUGE_1G_CNT; i++) {
>>>                page = hu->page + i;
>>> -             if (page_ref_count(page) != 1) {
>>> +             if (page_pool_page_ref_count(page) != 1) {
>>>                        pr_warn("Page with ref count %d at %u. Can't safely destory, leaking memory!\n",
>>> -                             page_ref_count(page), i);
>>> +                             page_pool_page_ref_count(page), i);
>>>                        free = false;
>>>                        break;
>>>                }
>>> @@ -1489,7 +1520,7 @@ static struct page *mp_huge_1g_alloc_pages(struct page_pool *pool, gfp_t gfp)
>>>                page = hu->page + page_i;
>>>
>>>                if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
>>> -                 page_ref_count(page) != 1) {
>>> +                 page_pool_page_ref_count(page) != 1) {
>>>                        atomic_inc(&mp_huge_ins_b);
>>>                        continue;
>>>                }
>>> --
>>> 2.41.0.640.ga95def55d0-goog
>>>
>>
>
David Ahern Aug. 19, 2023, 3:49 p.m. UTC | #4
On 8/19/23 9:22 AM, Jesper Dangaard Brouer wrote:
> 
> I do see the problem of depending on having a struct page, as the
> page_pool_iov isn't related to struct page.  Having "page" in the name
> of "page_pool_iov" is also confusing (hardest problem is CS is naming,
> as we all know).
> 
> To support more allocator types, perhaps skb->pp_recycle bit need to
> grow another bit (and be renamed skb->recycle), so we can tell allocator
> types apart, those that are page based and those whom are not.
> 
> 
>> I think the feedback has been strong to not multiplex yet another
>> memory type into that struct, that is not a real page. Which is why
>> we went into this direction. This latest series limits the impact largely
>> to networking structures and code.
>>
> 
> Some what related what I'm objecting to: the "page_pool_iov" is not a
> real page, but this getting recycled into something called "page_pool",
> which funny enough deals with struct-pages internally and depend on the
> struct-page-refcnt.
> 
> Given the approach changed way from using struct page, then I also don't
> see the connection with the page_pool. Sorry.

I do not care for the page_pool_iov name either; I presumed it was least
change to prove an idea and the name and details would evolve.

How about something like buffer_pool or netdev_buf_pool that can operate
with either pages, dma addresses, or something else in the future?

> 
>> As for the LSB trick: that avoided adding a lot of boilerplate churn
>> with new type and helper functions.
>>
> 
> Says the lazy programmer :-P ... sorry could not resist ;-)

Use of the LSB (or bits depending on alignment expectations) is a common
trick and already done in quite a few places in the networking stack.
This trick is essential to any realistic change here to incorporate gpu
memory; way too much code will have unnecessary churn without it.

I do prefer my earlier suggestion though where the skb_frag_t has a
union of relevant types though. Instead of `struct page *page` it could
be `void *addr` with the helpers indicating page, iov, or other.
Willem de Bruijn Aug. 19, 2023, 4:11 p.m. UTC | #5
> > Any regression in page pool can be avoided in the common case that
> > does not use device mem by placing that behind a static_branch. Would
> > that address your performance concerns?
> >
>
> No. This will not help.
>
> The problem is that every where in the page_pool code it is getting
> polluted with:
>
>    if (page_is_page_pool_iov(page))
>      call-some-iov-func-instead()
>
> Like: the very central piece of getting the refcnt:
>
> +static inline int page_pool_page_ref_count(struct page *page)
> +{
> +       if (page_is_page_pool_iov(page))
> +               return page_pool_iov_refcount(page_to_page_pool_iov(page));
> +
> +       return page_ref_count(page);
> +}
>
>
> The fast-path of the PP is used for XDP_DROP scenarios, and is currently
> around 14 cycles (tsc). Thus, any extra code in this code patch will
> change the fast-path.

With static_branch disabled, it would only insert a NOP?
Willem de Bruijn Aug. 19, 2023, 4:12 p.m. UTC | #6
On Sat, Aug 19, 2023 at 11:49 AM David Ahern <dsahern@kernel.org> wrote:
>
> On 8/19/23 9:22 AM, Jesper Dangaard Brouer wrote:
> >
> > I do see the problem of depending on having a struct page, as the
> > page_pool_iov isn't related to struct page.  Having "page" in the name
> > of "page_pool_iov" is also confusing (hardest problem is CS is naming,
> > as we all know).
> >
> > To support more allocator types, perhaps skb->pp_recycle bit need to
> > grow another bit (and be renamed skb->recycle), so we can tell allocator
> > types apart, those that are page based and those whom are not.
> >
> >
> >> I think the feedback has been strong to not multiplex yet another
> >> memory type into that struct, that is not a real page. Which is why
> >> we went into this direction. This latest series limits the impact largely
> >> to networking structures and code.
> >>
> >
> > Some what related what I'm objecting to: the "page_pool_iov" is not a
> > real page, but this getting recycled into something called "page_pool",
> > which funny enough deals with struct-pages internally and depend on the
> > struct-page-refcnt.
> >
> > Given the approach changed way from using struct page, then I also don't
> > see the connection with the page_pool. Sorry.
>
> I do not care for the page_pool_iov name either; I presumed it was least
> change to prove an idea and the name and details would evolve.
>
> How about something like buffer_pool or netdev_buf_pool that can operate
> with either pages, dma addresses, or something else in the future?

Sounds good. I suggested this name, but I see how using page in the
name is not very clear.

> >
> >> As for the LSB trick: that avoided adding a lot of boilerplate churn
> >> with new type and helper functions.
> >>
> >
> > Says the lazy programmer :-P ... sorry could not resist ;-)

:-) For the record, there is a prior version that added a separate type.

I did not like the churn it brought and asked for this.

>
> Use of the LSB (or bits depending on alignment expectations) is a common
> trick and already done in quite a few places in the networking stack.
> This trick is essential to any realistic change here to incorporate gpu
> memory; way too much code will have unnecessary churn without it.
>
> I do prefer my earlier suggestion though where the skb_frag_t has a
> union of relevant types though. Instead of `struct page *page` it could
> be `void *addr` with the helpers indicating page, iov, or other.

Okay. I think that is how we did it previously.
Mina Almasry Aug. 19, 2023, 8:24 p.m. UTC | #7
On Sat, Aug 19, 2023 at 8:22 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
>
> On 19/08/2023 16.08, Willem de Bruijn wrote:
> > On Sat, Aug 19, 2023 at 5:51 AM Jesper Dangaard Brouer
> > <jbrouer@redhat.com> wrote:
> >>
> >>
> >>
> >> On 10/08/2023 03.57, Mina Almasry wrote:
> >>> Overload the LSB of struct page* to indicate that it's a page_pool_iov.
> >>>
> >>> Refactor mm calls on struct page * into helpers, and add page_pool_iov
> >>> handling on those helpers. Modify callers of these mm APIs with calls to
> >>> these helpers instead.
> >>>
> >>
> >> I don't like of this approach.
> >> This is adding code to the PP (page_pool) fast-path in multiple places.
> >>
> >> I've not had time to run my usual benchmarks, which are here:
> >>
> >> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c
> >>

Thank you for linking that, I'll try to run these against the next submission.

> >> But I'm sure it will affect performance.
> >>
> >> Regardless of performance, this approach is using ptr-LSB-bits, to hide
> >> that page-pointer are not really struct-pages, feels like force feeding
> >> a solution just to use the page_pool APIs.
> >>
> >>
> >>> In areas where struct page* is dereferenced, add a check for special
> >>> handling of page_pool_iov.
> >>>
> >>> The memory providers producing page_pool_iov can set the LSB on the
> >>> struct page* returned to the page pool.
> >>>
> >>> Note that instead of overloading the LSB of page pointers, we can
> >>> instead define a new union between struct page & struct page_pool_iov and
> >>> compact it in a new type. However, we'd need to implement the code churn
> >>> to modify the page_pool & drivers to use this new type. For this POC
> >>> that is not implemented (feedback welcome).
> >>>
> >>
> >> I've said before, that I prefer multiplexing on page->pp_magic.
> >> For your page_pool_iov the layout would have to match the offset of
> >> pp_magic, to do this. (And if insisting on using PP infra the refcnt
> >> would also need to align).
> >
> > Perhaps I misunderstand, but this suggests continuing to using
> > struct page to demultiplex memory type?
> >
>
> (Perhaps we are misunderstanding each-other and my use of the words
> multiplexing and demultiplex are wrong, I'm sorry, as English isn't my
> native language.)
>
> I do see the problem of depending on having a struct page, as the
> page_pool_iov isn't related to struct page.  Having "page" in the name
> of "page_pool_iov" is also confusing (hardest problem is CS is naming,
> as we all know).
>
> To support more allocator types, perhaps skb->pp_recycle bit need to
> grow another bit (and be renamed skb->recycle), so we can tell allocator
> types apart, those that are page based and those whom are not.
>
>
> > I think the feedback has been strong to not multiplex yet another
> > memory type into that struct, that is not a real page. Which is why
> > we went into this direction. This latest series limits the impact largely
> > to networking structures and code.
> >
>
> Some what related what I'm objecting to: the "page_pool_iov" is not a
> real page, but this getting recycled into something called "page_pool",
> which funny enough deals with struct-pages internally and depend on the
> struct-page-refcnt.
>
> Given the approach changed way from using struct page, then I also don't
> see the connection with the page_pool. Sorry.
>
> > One way or another, there will be a branch and multiplexing. Whether
> > that is in struct page, the page pool or a new netdev mem type as you
> > propose.
> >
>
> I'm asking to have this branch/multiplexing done a the call sites.
>
> (IMHO not changing the drivers is a pipe-dream.)
>

I think I understand what Jesper is saying. I think Jesper wants the
page_pool to remain unchanged, and another layer on top of it to do
the multiplexing, i.e.:

driver ---> new_layer (does multiplexing) ---> page_pool -------> mm layer
                                \------------------------------>
devmem_pool --> dma-buf layer

But, I think, Jakub wants the page_pool to be the front end, and for
the multiplexing to happen in the page pool (I think, Jakub did not
write this in an email, but this is how I interpret his comments from
[1], and his memory provider RFC). So I think Jakub wants:

driver --> page_pool ---> memory_provider (does multiplexing) --->
basic_provider -------> mm layer

\----------------------------------------> devmem_provider --> dma-buf
layer

That is the approach in this RFC.

I think proper naming that makes sense can be figured out, and is not
a huge issue. I think in both cases we can minimize the changes to the
drivers, maybe. In the first approach the driver will need to use the
APIs created by the new layer. In the second approach, the driver
continues to use page_pool APIs.

I think we need to converge on a path between the 2 approaches (or
maybe 3rd approach to do). For me the pros/cons of each approach
(please add):

multiplexing in new_layer:
- Pro: maybe better for performance? Not sure if static_branch can
achieve the same perf. I can verify with Jesper's perf tests.
- Pro: doesn't add complexity in the page_pool (but adds complexity in
terms of adding new pools like devmem_pool)
- Con: the devmem_pool & page_pool will end up being duplicated code,
I suspect, because they largely do similar things (both need to
recycle memory for example).

multiplexing via memory_provider:
- Pro: no code duplication.
- Pro: less changes to the drivers, I think. The drivers can continue
to use the page_pool API, no need to introduce calls to 'new_layer'.
- Con: adds complexity to the page_pool (needs to handle devmem).
- Con: probably careful handling via static_branch needs to be done to
achieve performance.

[1] https://lore.kernel.org/netdev/20230619110705.106ec599@kernel.org/

> > Any regression in page pool can be avoided in the common case that
> > does not use device mem by placing that behind a static_branch. Would
> > that address your performance concerns?
> >
>
> No. This will not help.
>
> The problem is that every where in the page_pool code it is getting
> polluted with:
>
>    if (page_is_page_pool_iov(page))
>      call-some-iov-func-instead()
>
> Like: the very central piece of getting the refcnt:
>
> +static inline int page_pool_page_ref_count(struct page *page)
> +{
> +       if (page_is_page_pool_iov(page))
> +               return page_pool_iov_refcount(page_to_page_pool_iov(page));
> +
> +       return page_ref_count(page);
> +}
>
>
> The fast-path of the PP is used for XDP_DROP scenarios, and is currently
> around 14 cycles (tsc). Thus, any extra code in this code patch will
> change the fast-path.
>
>
> >>
> >> On the allocation side, all drivers already use a driver helper
> >> page_pool_dev_alloc_pages() or we could add another (better named)
> >> helper to multiplex between other types of allocators, e.g. a devmem
> >> allocator.
> >>
> >> On free/return/recycle the functions napi_pp_put_page or skb_pp_recycle
> >> could multiplex on pp_magic and call another API.  The API could be an
> >> extension to PP helpers, but it could also be a devmap allocator helper.
> >>
> >> IMHO forcing/piggy-bagging everything into page_pool is not the right
> >> solution.  I really think netstack need to support different allocator
> >> types.
> >
> > To me this is lifting page_pool into such a netstack alloctator pool.
> >
>
> This is should be renamed as it is not longer dealing with pages.
>
> > Not sure adding another explicit layer of indirection would be cleaner
> > or faster (potentially more indirect calls).
> >
>
> It seems we are talking past each-other.  The layer of indirection I'm
> talking about is likely a simple header file (e.g. named netmem.h) that
> will get inline compiled so there is no overhead. It will be used by
> driver, such that we can avoid touching driver again when introducing
> new memory allocator types.
>
>
> > As for the LSB trick: that avoided adding a lot of boilerplate churn
> > with new type and helper functions.
> >
>
> Says the lazy programmer :-P ... sorry could not resist ;-)
>
> >
> >
> >> The page pool have been leading the way, yes, but perhaps it is
> >> time to add an API layer that e.g. could be named netmem, that gives us
> >> the multiplexing between allocators.  In that process some of page_pool
> >> APIs would be lifted out as common blocks and others remain.
> >>
> >> --Jesper
> >>
> >>> I have a sample implementation of adding a new page_pool_token type
> >>> in the page_pool to give a general idea here:
> >>> https://github.com/torvalds/linux/commit/3a7628700eb7fd02a117db036003bca50779608d
> >>>
> >>> Full branch here:
> >>> https://github.com/torvalds/linux/compare/master...mina:linux:tcpdevmem-pp-tokens
> >>>
> >>> (In the branches above, page_pool_iov is called devmem_slice).
> >>>
> >>> Could also add static_branch to speed up the checks in page_pool_iov
> >>> memory providers are being used.
> >>>
> >>> Signed-off-by: Mina Almasry <almasrymina@google.com>
> >>> ---
> >>>    include/net/page_pool.h | 74 ++++++++++++++++++++++++++++++++++-
> >>>    net/core/page_pool.c    | 85 ++++++++++++++++++++++++++++-------------
> >>>    2 files changed, 131 insertions(+), 28 deletions(-)
> >>>
> >>> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> >>> index 537eb36115ed..f08ca230d68e 100644
> >>> --- a/include/net/page_pool.h
> >>> +++ b/include/net/page_pool.h
> >>> @@ -282,6 +282,64 @@ static inline struct page_pool_iov *page_to_page_pool_iov(struct page *page)
> >>>        return NULL;
> >>>    }
> >>>
> >>> +static inline int page_pool_page_ref_count(struct page *page)
> >>> +{
> >>> +     if (page_is_page_pool_iov(page))
> >>> +             return page_pool_iov_refcount(page_to_page_pool_iov(page));
> >>> +
> >>> +     return page_ref_count(page);
> >>> +}
> >>> +
> >>> +static inline void page_pool_page_get_many(struct page *page,
> >>> +                                        unsigned int count)
> >>> +{
> >>> +     if (page_is_page_pool_iov(page))
> >>> +             return page_pool_iov_get_many(page_to_page_pool_iov(page),
> >>> +                                           count);
> >>> +
> >>> +     return page_ref_add(page, count);
> >>> +}
> >>> +
> >>> +static inline void page_pool_page_put_many(struct page *page,
> >>> +                                        unsigned int count)
> >>> +{
> >>> +     if (page_is_page_pool_iov(page))
> >>> +             return page_pool_iov_put_many(page_to_page_pool_iov(page),
> >>> +                                           count);
> >>> +
> >>> +     if (count > 1)
> >>> +             page_ref_sub(page, count - 1);
> >>> +
> >>> +     put_page(page);
> >>> +}
> >>> +
> >>> +static inline bool page_pool_page_is_pfmemalloc(struct page *page)
> >>> +{
> >>> +     if (page_is_page_pool_iov(page))
> >>> +             return false;
> >>> +
> >>> +     return page_is_pfmemalloc(page);
> >>> +}
> >>> +
> >>> +static inline bool page_pool_page_is_pref_nid(struct page *page, int pref_nid)
> >>> +{
> >>> +     /* Assume page_pool_iov are on the preferred node without actually
> >>> +      * checking...
> >>> +      *
> >>> +      * This check is only used to check for recycling memory in the page
> >>> +      * pool's fast paths. Currently the only implementation of page_pool_iov
> >>> +      * is dmabuf device memory. It's a deliberate decision by the user to
> >>> +      * bind a certain dmabuf to a certain netdev, and the netdev rx queue
> >>> +      * would not be able to reallocate memory from another dmabuf that
> >>> +      * exists on the preferred node, so, this check doesn't make much sense
> >>> +      * in this case. Assume all page_pool_iovs can be recycled for now.
> >>> +      */
> >>> +     if (page_is_page_pool_iov(page))
> >>> +             return true;
> >>> +
> >>> +     return page_to_nid(page) == pref_nid;
> >>> +}
> >>> +
> >>>    struct page_pool {
> >>>        struct page_pool_params p;
> >>>
> >>> @@ -434,6 +492,9 @@ static inline long page_pool_defrag_page(struct page *page, long nr)
> >>>    {
> >>>        long ret;
> >>>
> >>> +     if (page_is_page_pool_iov(page))
> >>> +             return -EINVAL;
> >>> +
> >>>        /* If nr == pp_frag_count then we have cleared all remaining
> >>>         * references to the page. No need to actually overwrite it, instead
> >>>         * we can leave this to be overwritten by the calling function.
> >>> @@ -494,7 +555,12 @@ static inline void page_pool_recycle_direct(struct page_pool *pool,
> >>>
> >>>    static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
> >>>    {
> >>> -     dma_addr_t ret = page->dma_addr;
> >>> +     dma_addr_t ret;
> >>> +
> >>> +     if (page_is_page_pool_iov(page))
> >>> +             return page_pool_iov_dma_addr(page_to_page_pool_iov(page));
> >>> +
> >>> +     ret = page->dma_addr;
> >>>
> >>>        if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
> >>>                ret |= (dma_addr_t)page->dma_addr_upper << 16 << 16;
> >>> @@ -504,6 +570,12 @@ static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
> >>>
> >>>    static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
> >>>    {
> >>> +     /* page_pool_iovs are mapped and their dma-addr can't be modified. */
> >>> +     if (page_is_page_pool_iov(page)) {
> >>> +             DEBUG_NET_WARN_ON_ONCE(true);
> >>> +             return;
> >>> +     }
> >>> +
> >>>        page->dma_addr = addr;
> >>>        if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
> >>>                page->dma_addr_upper = upper_32_bits(addr);
> >>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> >>> index 0a7c08d748b8..20c1f74fd844 100644
> >>> --- a/net/core/page_pool.c
> >>> +++ b/net/core/page_pool.c
> >>> @@ -318,7 +318,7 @@ static struct page *page_pool_refill_alloc_cache(struct page_pool *pool)
> >>>                if (unlikely(!page))
> >>>                        break;
> >>>
> >>> -             if (likely(page_to_nid(page) == pref_nid)) {
> >>> +             if (likely(page_pool_page_is_pref_nid(page, pref_nid))) {
> >>>                        pool->alloc.cache[pool->alloc.count++] = page;
> >>>                } else {
> >>>                        /* NUMA mismatch;
> >>> @@ -363,7 +363,15 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool,
> >>>                                          struct page *page,
> >>>                                          unsigned int dma_sync_size)
> >>>    {
> >>> -     dma_addr_t dma_addr = page_pool_get_dma_addr(page);
> >>> +     dma_addr_t dma_addr;
> >>> +
> >>> +     /* page_pool_iov memory provider do not support PP_FLAG_DMA_SYNC_DEV */
> >>> +     if (page_is_page_pool_iov(page)) {
> >>> +             DEBUG_NET_WARN_ON_ONCE(true);
> >>> +             return;
> >>> +     }
> >>> +
> >>> +     dma_addr = page_pool_get_dma_addr(page);
> >>>
> >>>        dma_sync_size = min(dma_sync_size, pool->p.max_len);
> >>>        dma_sync_single_range_for_device(pool->p.dev, dma_addr,
> >>> @@ -375,6 +383,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
> >>>    {
> >>>        dma_addr_t dma;
> >>>
> >>> +     if (page_is_page_pool_iov(page)) {
> >>> +             /* page_pool_iovs are already mapped */
> >>> +             DEBUG_NET_WARN_ON_ONCE(true);
> >>> +             return true;
> >>> +     }
> >>> +
> >>>        /* Setup DMA mapping: use 'struct page' area for storing DMA-addr
> >>>         * since dma_addr_t can be either 32 or 64 bits and does not always fit
> >>>         * into page private data (i.e 32bit cpu with 64bit DMA caps)
> >>> @@ -398,14 +412,24 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
> >>>    static void page_pool_set_pp_info(struct page_pool *pool,
> >>>                                  struct page *page)
> >>>    {
> >>> -     page->pp = pool;
> >>> -     page->pp_magic |= PP_SIGNATURE;
> >>> +     if (!page_is_page_pool_iov(page)) {
> >>> +             page->pp = pool;
> >>> +             page->pp_magic |= PP_SIGNATURE;
> >>> +     } else {
> >>> +             page_to_page_pool_iov(page)->pp = pool;
> >>> +     }
> >>> +
> >>>        if (pool->p.init_callback)
> >>>                pool->p.init_callback(page, pool->p.init_arg);
> >>>    }
> >>>
> >>>    static void page_pool_clear_pp_info(struct page *page)
> >>>    {
> >>> +     if (page_is_page_pool_iov(page)) {
> >>> +             page_to_page_pool_iov(page)->pp = NULL;
> >>> +             return;
> >>> +     }
> >>> +
> >>>        page->pp_magic = 0;
> >>>        page->pp = NULL;
> >>>    }
> >>> @@ -615,7 +639,7 @@ static bool page_pool_recycle_in_cache(struct page *page,
> >>>                return false;
> >>>        }
> >>>
> >>> -     /* Caller MUST have verified/know (page_ref_count(page) == 1) */
> >>> +     /* Caller MUST have verified/know (page_pool_page_ref_count(page) == 1) */
> >>>        pool->alloc.cache[pool->alloc.count++] = page;
> >>>        recycle_stat_inc(pool, cached);
> >>>        return true;
> >>> @@ -638,9 +662,10 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
> >>>         * refcnt == 1 means page_pool owns page, and can recycle it.
> >>>         *
> >>>         * page is NOT reusable when allocated when system is under
> >>> -      * some pressure. (page_is_pfmemalloc)
> >>> +      * some pressure. (page_pool_page_is_pfmemalloc)
> >>>         */
> >>> -     if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) {
> >>> +     if (likely(page_pool_page_ref_count(page) == 1 &&
> >>> +                !page_pool_page_is_pfmemalloc(page))) {
> >>>                /* Read barrier done in page_ref_count / READ_ONCE */
> >>>
> >>>                if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> >>> @@ -741,7 +766,8 @@ static struct page *page_pool_drain_frag(struct page_pool *pool,
> >>>        if (likely(page_pool_defrag_page(page, drain_count)))
> >>>                return NULL;
> >>>
> >>> -     if (page_ref_count(page) == 1 && !page_is_pfmemalloc(page)) {
> >>> +     if (page_pool_page_ref_count(page) == 1 &&
> >>> +         !page_pool_page_is_pfmemalloc(page)) {
> >>>                if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> >>>                        page_pool_dma_sync_for_device(pool, page, -1);
> >>>
> >>> @@ -818,9 +844,9 @@ static void page_pool_empty_ring(struct page_pool *pool)
> >>>        /* Empty recycle ring */
> >>>        while ((page = ptr_ring_consume_bh(&pool->ring))) {
> >>>                /* Verify the refcnt invariant of cached pages */
> >>> -             if (!(page_ref_count(page) == 1))
> >>> +             if (!(page_pool_page_ref_count(page) == 1))
> >>>                        pr_crit("%s() page_pool refcnt %d violation\n",
> >>> -                             __func__, page_ref_count(page));
> >>> +                             __func__, page_pool_page_ref_count(page));
> >>>
> >>>                page_pool_return_page(pool, page);
> >>>        }
> >>> @@ -977,19 +1003,24 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe)
> >>>        struct page_pool *pp;
> >>>        bool allow_direct;
> >>>
> >>> -     page = compound_head(page);
> >>> +     if (!page_is_page_pool_iov(page)) {
> >>> +             page = compound_head(page);
> >>>
> >>> -     /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
> >>> -      * in order to preserve any existing bits, such as bit 0 for the
> >>> -      * head page of compound page and bit 1 for pfmemalloc page, so
> >>> -      * mask those bits for freeing side when doing below checking,
> >>> -      * and page_is_pfmemalloc() is checked in __page_pool_put_page()
> >>> -      * to avoid recycling the pfmemalloc page.
> >>> -      */
> >>> -     if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
> >>> -             return false;
> >>> +             /* page->pp_magic is OR'ed with PP_SIGNATURE after the
> >>> +              * allocation in order to preserve any existing bits, such as
> >>> +              * bit 0 for the head page of compound page and bit 1 for
> >>> +              * pfmemalloc page, so mask those bits for freeing side when
> >>> +              * doing below checking, and page_pool_page_is_pfmemalloc() is
> >>> +              * checked in __page_pool_put_page() to avoid recycling the
> >>> +              * pfmemalloc page.
> >>> +              */
> >>> +             if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
> >>> +                     return false;
> >>>
> >>> -     pp = page->pp;
> >>> +             pp = page->pp;
> >>> +     } else {
> >>> +             pp = page_to_page_pool_iov(page)->pp;
> >>> +     }
> >>>
> >>>        /* Allow direct recycle if we have reasons to believe that we are
> >>>         * in the same context as the consumer would run, so there's
> >>> @@ -1273,9 +1304,9 @@ static bool mp_huge_busy(struct mp_huge *hu, unsigned int idx)
> >>>
> >>>        for (j = 0; j < (1 << MP_HUGE_ORDER); j++) {
> >>>                page = hu->page[idx] + j;
> >>> -             if (page_ref_count(page) != 1) {
> >>> +             if (page_pool_page_ref_count(page) != 1) {
> >>>                        pr_warn("Page with ref count %d at %u, %u. Can't safely destory, leaking memory!\n",
> >>> -                             page_ref_count(page), idx, j);
> >>> +                             page_pool_page_ref_count(page), idx, j);
> >>>                        return true;
> >>>                }
> >>>        }
> >>> @@ -1330,7 +1361,7 @@ static struct page *mp_huge_alloc_pages(struct page_pool *pool, gfp_t gfp)
> >>>                        continue;
> >>>
> >>>                if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
> >>> -                 page_ref_count(page) != 1) {
> >>> +                 page_pool_page_ref_count(page) != 1) {
> >>>                        atomic_inc(&mp_huge_ins_b);
> >>>                        continue;
> >>>                }
> >>> @@ -1458,9 +1489,9 @@ static void mp_huge_1g_destroy(struct page_pool *pool)
> >>>        free = true;
> >>>        for (i = 0; i < MP_HUGE_1G_CNT; i++) {
> >>>                page = hu->page + i;
> >>> -             if (page_ref_count(page) != 1) {
> >>> +             if (page_pool_page_ref_count(page) != 1) {
> >>>                        pr_warn("Page with ref count %d at %u. Can't safely destory, leaking memory!\n",
> >>> -                             page_ref_count(page), i);
> >>> +                             page_pool_page_ref_count(page), i);
> >>>                        free = false;
> >>>                        break;
> >>>                }
> >>> @@ -1489,7 +1520,7 @@ static struct page *mp_huge_1g_alloc_pages(struct page_pool *pool, gfp_t gfp)
> >>>                page = hu->page + page_i;
> >>>
> >>>                if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
> >>> -                 page_ref_count(page) != 1) {
> >>> +                 page_pool_page_ref_count(page) != 1) {
> >>>                        atomic_inc(&mp_huge_ins_b);
> >>>                        continue;
> >>>                }
> >>> --
> >>> 2.41.0.640.ga95def55d0-goog
> >>>
> >>
> >
>


--
Thanks,
Mina
Mina Almasry Aug. 19, 2023, 8:27 p.m. UTC | #8
On Sat, Aug 19, 2023 at 1:24 PM Mina Almasry <almasrymina@google.com> wrote:
>
> On Sat, Aug 19, 2023 at 8:22 AM Jesper Dangaard Brouer
> <jbrouer@redhat.com> wrote:
> >
> >
> >
> > On 19/08/2023 16.08, Willem de Bruijn wrote:
> > > On Sat, Aug 19, 2023 at 5:51 AM Jesper Dangaard Brouer
> > > <jbrouer@redhat.com> wrote:
> > >>
> > >>
> > >>
> > >> On 10/08/2023 03.57, Mina Almasry wrote:
> > >>> Overload the LSB of struct page* to indicate that it's a page_pool_iov.
> > >>>
> > >>> Refactor mm calls on struct page * into helpers, and add page_pool_iov
> > >>> handling on those helpers. Modify callers of these mm APIs with calls to
> > >>> these helpers instead.
> > >>>
> > >>
> > >> I don't like of this approach.
> > >> This is adding code to the PP (page_pool) fast-path in multiple places.
> > >>
> > >> I've not had time to run my usual benchmarks, which are here:
> > >>
> > >> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c
> > >>
>
> Thank you for linking that, I'll try to run these against the next submission.
>
> > >> But I'm sure it will affect performance.
> > >>
> > >> Regardless of performance, this approach is using ptr-LSB-bits, to hide
> > >> that page-pointer are not really struct-pages, feels like force feeding
> > >> a solution just to use the page_pool APIs.
> > >>
> > >>
> > >>> In areas where struct page* is dereferenced, add a check for special
> > >>> handling of page_pool_iov.
> > >>>
> > >>> The memory providers producing page_pool_iov can set the LSB on the
> > >>> struct page* returned to the page pool.
> > >>>
> > >>> Note that instead of overloading the LSB of page pointers, we can
> > >>> instead define a new union between struct page & struct page_pool_iov and
> > >>> compact it in a new type. However, we'd need to implement the code churn
> > >>> to modify the page_pool & drivers to use this new type. For this POC
> > >>> that is not implemented (feedback welcome).
> > >>>
> > >>
> > >> I've said before, that I prefer multiplexing on page->pp_magic.
> > >> For your page_pool_iov the layout would have to match the offset of
> > >> pp_magic, to do this. (And if insisting on using PP infra the refcnt
> > >> would also need to align).
> > >
> > > Perhaps I misunderstand, but this suggests continuing to using
> > > struct page to demultiplex memory type?
> > >
> >
> > (Perhaps we are misunderstanding each-other and my use of the words
> > multiplexing and demultiplex are wrong, I'm sorry, as English isn't my
> > native language.)
> >
> > I do see the problem of depending on having a struct page, as the
> > page_pool_iov isn't related to struct page.  Having "page" in the name
> > of "page_pool_iov" is also confusing (hardest problem is CS is naming,
> > as we all know).
> >
> > To support more allocator types, perhaps skb->pp_recycle bit need to
> > grow another bit (and be renamed skb->recycle), so we can tell allocator
> > types apart, those that are page based and those whom are not.
> >
> >
> > > I think the feedback has been strong to not multiplex yet another
> > > memory type into that struct, that is not a real page. Which is why
> > > we went into this direction. This latest series limits the impact largely
> > > to networking structures and code.
> > >
> >
> > Some what related what I'm objecting to: the "page_pool_iov" is not a
> > real page, but this getting recycled into something called "page_pool",
> > which funny enough deals with struct-pages internally and depend on the
> > struct-page-refcnt.
> >
> > Given the approach changed way from using struct page, then I also don't
> > see the connection with the page_pool. Sorry.
> >
> > > One way or another, there will be a branch and multiplexing. Whether
> > > that is in struct page, the page pool or a new netdev mem type as you
> > > propose.
> > >
> >
> > I'm asking to have this branch/multiplexing done a the call sites.
> >
> > (IMHO not changing the drivers is a pipe-dream.)
> >
>
> I think I understand what Jesper is saying. I think Jesper wants the
> page_pool to remain unchanged, and another layer on top of it to do
> the multiplexing, i.e.:
>
> driver ---> new_layer (does multiplexing) ---> page_pool -------> mm layer
>                                 \------------------------------>
> devmem_pool --> dma-buf layer
>

Gmail mangled this :/ Let me try again. This should read:

driver -> new_layer -> page_pool ------> mm layer
                      \--------->devmem_pool -> dma-buf

> But, I think, Jakub wants the page_pool to be the front end, and for
> the multiplexing to happen in the page pool (I think, Jakub did not
> write this in an email, but this is how I interpret his comments from
> [1], and his memory provider RFC). So I think Jakub wants:
>
> driver --> page_pool ---> memory_provider (does multiplexing) --->
> basic_provider -------> mm layer
>
> \----------------------------------------> devmem_provider --> dma-buf
> layer
>

This should read:
driver -> pp -> memory provider -> basic provider -> mm
                                   \--------------> devmem provider -> dma-buf

Sorry for the spam!

> That is the approach in this RFC.
>
> I think proper naming that makes sense can be figured out, and is not
> a huge issue. I think in both cases we can minimize the changes to the
> drivers, maybe. In the first approach the driver will need to use the
> APIs created by the new layer. In the second approach, the driver
> continues to use page_pool APIs.
>
> I think we need to converge on a path between the 2 approaches (or
> maybe 3rd approach to do). For me the pros/cons of each approach
> (please add):
>
> multiplexing in new_layer:
> - Pro: maybe better for performance? Not sure if static_branch can
> achieve the same perf. I can verify with Jesper's perf tests.
> - Pro: doesn't add complexity in the page_pool (but adds complexity in
> terms of adding new pools like devmem_pool)
> - Con: the devmem_pool & page_pool will end up being duplicated code,
> I suspect, because they largely do similar things (both need to
> recycle memory for example).
>
> multiplexing via memory_provider:
> - Pro: no code duplication.
> - Pro: less changes to the drivers, I think. The drivers can continue
> to use the page_pool API, no need to introduce calls to 'new_layer'.
> - Con: adds complexity to the page_pool (needs to handle devmem).
> - Con: probably careful handling via static_branch needs to be done to
> achieve performance.
>
> [1] https://lore.kernel.org/netdev/20230619110705.106ec599@kernel.org/
>
> > > Any regression in page pool can be avoided in the common case that
> > > does not use device mem by placing that behind a static_branch. Would
> > > that address your performance concerns?
> > >
> >
> > No. This will not help.
> >
> > The problem is that every where in the page_pool code it is getting
> > polluted with:
> >
> >    if (page_is_page_pool_iov(page))
> >      call-some-iov-func-instead()
> >
> > Like: the very central piece of getting the refcnt:
> >
> > +static inline int page_pool_page_ref_count(struct page *page)
> > +{
> > +       if (page_is_page_pool_iov(page))
> > +               return page_pool_iov_refcount(page_to_page_pool_iov(page));
> > +
> > +       return page_ref_count(page);
> > +}
> >
> >
> > The fast-path of the PP is used for XDP_DROP scenarios, and is currently
> > around 14 cycles (tsc). Thus, any extra code in this code patch will
> > change the fast-path.
> >
> >
> > >>
> > >> On the allocation side, all drivers already use a driver helper
> > >> page_pool_dev_alloc_pages() or we could add another (better named)
> > >> helper to multiplex between other types of allocators, e.g. a devmem
> > >> allocator.
> > >>
> > >> On free/return/recycle the functions napi_pp_put_page or skb_pp_recycle
> > >> could multiplex on pp_magic and call another API.  The API could be an
> > >> extension to PP helpers, but it could also be a devmap allocator helper.
> > >>
> > >> IMHO forcing/piggy-bagging everything into page_pool is not the right
> > >> solution.  I really think netstack need to support different allocator
> > >> types.
> > >
> > > To me this is lifting page_pool into such a netstack alloctator pool.
> > >
> >
> > This is should be renamed as it is not longer dealing with pages.
> >
> > > Not sure adding another explicit layer of indirection would be cleaner
> > > or faster (potentially more indirect calls).
> > >
> >
> > It seems we are talking past each-other.  The layer of indirection I'm
> > talking about is likely a simple header file (e.g. named netmem.h) that
> > will get inline compiled so there is no overhead. It will be used by
> > driver, such that we can avoid touching driver again when introducing
> > new memory allocator types.
> >
> >
> > > As for the LSB trick: that avoided adding a lot of boilerplate churn
> > > with new type and helper functions.
> > >
> >
> > Says the lazy programmer :-P ... sorry could not resist ;-)
> >
> > >
> > >
> > >> The page pool have been leading the way, yes, but perhaps it is
> > >> time to add an API layer that e.g. could be named netmem, that gives us
> > >> the multiplexing between allocators.  In that process some of page_pool
> > >> APIs would be lifted out as common blocks and others remain.
> > >>
> > >> --Jesper
> > >>
> > >>> I have a sample implementation of adding a new page_pool_token type
> > >>> in the page_pool to give a general idea here:
> > >>> https://github.com/torvalds/linux/commit/3a7628700eb7fd02a117db036003bca50779608d
> > >>>
> > >>> Full branch here:
> > >>> https://github.com/torvalds/linux/compare/master...mina:linux:tcpdevmem-pp-tokens
> > >>>
> > >>> (In the branches above, page_pool_iov is called devmem_slice).
> > >>>
> > >>> Could also add static_branch to speed up the checks in page_pool_iov
> > >>> memory providers are being used.
> > >>>
> > >>> Signed-off-by: Mina Almasry <almasrymina@google.com>
> > >>> ---
> > >>>    include/net/page_pool.h | 74 ++++++++++++++++++++++++++++++++++-
> > >>>    net/core/page_pool.c    | 85 ++++++++++++++++++++++++++++-------------
> > >>>    2 files changed, 131 insertions(+), 28 deletions(-)
> > >>>
> > >>> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> > >>> index 537eb36115ed..f08ca230d68e 100644
> > >>> --- a/include/net/page_pool.h
> > >>> +++ b/include/net/page_pool.h
> > >>> @@ -282,6 +282,64 @@ static inline struct page_pool_iov *page_to_page_pool_iov(struct page *page)
> > >>>        return NULL;
> > >>>    }
> > >>>
> > >>> +static inline int page_pool_page_ref_count(struct page *page)
> > >>> +{
> > >>> +     if (page_is_page_pool_iov(page))
> > >>> +             return page_pool_iov_refcount(page_to_page_pool_iov(page));
> > >>> +
> > >>> +     return page_ref_count(page);
> > >>> +}
> > >>> +
> > >>> +static inline void page_pool_page_get_many(struct page *page,
> > >>> +                                        unsigned int count)
> > >>> +{
> > >>> +     if (page_is_page_pool_iov(page))
> > >>> +             return page_pool_iov_get_many(page_to_page_pool_iov(page),
> > >>> +                                           count);
> > >>> +
> > >>> +     return page_ref_add(page, count);
> > >>> +}
> > >>> +
> > >>> +static inline void page_pool_page_put_many(struct page *page,
> > >>> +                                        unsigned int count)
> > >>> +{
> > >>> +     if (page_is_page_pool_iov(page))
> > >>> +             return page_pool_iov_put_many(page_to_page_pool_iov(page),
> > >>> +                                           count);
> > >>> +
> > >>> +     if (count > 1)
> > >>> +             page_ref_sub(page, count - 1);
> > >>> +
> > >>> +     put_page(page);
> > >>> +}
> > >>> +
> > >>> +static inline bool page_pool_page_is_pfmemalloc(struct page *page)
> > >>> +{
> > >>> +     if (page_is_page_pool_iov(page))
> > >>> +             return false;
> > >>> +
> > >>> +     return page_is_pfmemalloc(page);
> > >>> +}
> > >>> +
> > >>> +static inline bool page_pool_page_is_pref_nid(struct page *page, int pref_nid)
> > >>> +{
> > >>> +     /* Assume page_pool_iov are on the preferred node without actually
> > >>> +      * checking...
> > >>> +      *
> > >>> +      * This check is only used to check for recycling memory in the page
> > >>> +      * pool's fast paths. Currently the only implementation of page_pool_iov
> > >>> +      * is dmabuf device memory. It's a deliberate decision by the user to
> > >>> +      * bind a certain dmabuf to a certain netdev, and the netdev rx queue
> > >>> +      * would not be able to reallocate memory from another dmabuf that
> > >>> +      * exists on the preferred node, so, this check doesn't make much sense
> > >>> +      * in this case. Assume all page_pool_iovs can be recycled for now.
> > >>> +      */
> > >>> +     if (page_is_page_pool_iov(page))
> > >>> +             return true;
> > >>> +
> > >>> +     return page_to_nid(page) == pref_nid;
> > >>> +}
> > >>> +
> > >>>    struct page_pool {
> > >>>        struct page_pool_params p;
> > >>>
> > >>> @@ -434,6 +492,9 @@ static inline long page_pool_defrag_page(struct page *page, long nr)
> > >>>    {
> > >>>        long ret;
> > >>>
> > >>> +     if (page_is_page_pool_iov(page))
> > >>> +             return -EINVAL;
> > >>> +
> > >>>        /* If nr == pp_frag_count then we have cleared all remaining
> > >>>         * references to the page. No need to actually overwrite it, instead
> > >>>         * we can leave this to be overwritten by the calling function.
> > >>> @@ -494,7 +555,12 @@ static inline void page_pool_recycle_direct(struct page_pool *pool,
> > >>>
> > >>>    static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
> > >>>    {
> > >>> -     dma_addr_t ret = page->dma_addr;
> > >>> +     dma_addr_t ret;
> > >>> +
> > >>> +     if (page_is_page_pool_iov(page))
> > >>> +             return page_pool_iov_dma_addr(page_to_page_pool_iov(page));
> > >>> +
> > >>> +     ret = page->dma_addr;
> > >>>
> > >>>        if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
> > >>>                ret |= (dma_addr_t)page->dma_addr_upper << 16 << 16;
> > >>> @@ -504,6 +570,12 @@ static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
> > >>>
> > >>>    static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
> > >>>    {
> > >>> +     /* page_pool_iovs are mapped and their dma-addr can't be modified. */
> > >>> +     if (page_is_page_pool_iov(page)) {
> > >>> +             DEBUG_NET_WARN_ON_ONCE(true);
> > >>> +             return;
> > >>> +     }
> > >>> +
> > >>>        page->dma_addr = addr;
> > >>>        if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
> > >>>                page->dma_addr_upper = upper_32_bits(addr);
> > >>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> > >>> index 0a7c08d748b8..20c1f74fd844 100644
> > >>> --- a/net/core/page_pool.c
> > >>> +++ b/net/core/page_pool.c
> > >>> @@ -318,7 +318,7 @@ static struct page *page_pool_refill_alloc_cache(struct page_pool *pool)
> > >>>                if (unlikely(!page))
> > >>>                        break;
> > >>>
> > >>> -             if (likely(page_to_nid(page) == pref_nid)) {
> > >>> +             if (likely(page_pool_page_is_pref_nid(page, pref_nid))) {
> > >>>                        pool->alloc.cache[pool->alloc.count++] = page;
> > >>>                } else {
> > >>>                        /* NUMA mismatch;
> > >>> @@ -363,7 +363,15 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool,
> > >>>                                          struct page *page,
> > >>>                                          unsigned int dma_sync_size)
> > >>>    {
> > >>> -     dma_addr_t dma_addr = page_pool_get_dma_addr(page);
> > >>> +     dma_addr_t dma_addr;
> > >>> +
> > >>> +     /* page_pool_iov memory provider do not support PP_FLAG_DMA_SYNC_DEV */
> > >>> +     if (page_is_page_pool_iov(page)) {
> > >>> +             DEBUG_NET_WARN_ON_ONCE(true);
> > >>> +             return;
> > >>> +     }
> > >>> +
> > >>> +     dma_addr = page_pool_get_dma_addr(page);
> > >>>
> > >>>        dma_sync_size = min(dma_sync_size, pool->p.max_len);
> > >>>        dma_sync_single_range_for_device(pool->p.dev, dma_addr,
> > >>> @@ -375,6 +383,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
> > >>>    {
> > >>>        dma_addr_t dma;
> > >>>
> > >>> +     if (page_is_page_pool_iov(page)) {
> > >>> +             /* page_pool_iovs are already mapped */
> > >>> +             DEBUG_NET_WARN_ON_ONCE(true);
> > >>> +             return true;
> > >>> +     }
> > >>> +
> > >>>        /* Setup DMA mapping: use 'struct page' area for storing DMA-addr
> > >>>         * since dma_addr_t can be either 32 or 64 bits and does not always fit
> > >>>         * into page private data (i.e 32bit cpu with 64bit DMA caps)
> > >>> @@ -398,14 +412,24 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
> > >>>    static void page_pool_set_pp_info(struct page_pool *pool,
> > >>>                                  struct page *page)
> > >>>    {
> > >>> -     page->pp = pool;
> > >>> -     page->pp_magic |= PP_SIGNATURE;
> > >>> +     if (!page_is_page_pool_iov(page)) {
> > >>> +             page->pp = pool;
> > >>> +             page->pp_magic |= PP_SIGNATURE;
> > >>> +     } else {
> > >>> +             page_to_page_pool_iov(page)->pp = pool;
> > >>> +     }
> > >>> +
> > >>>        if (pool->p.init_callback)
> > >>>                pool->p.init_callback(page, pool->p.init_arg);
> > >>>    }
> > >>>
> > >>>    static void page_pool_clear_pp_info(struct page *page)
> > >>>    {
> > >>> +     if (page_is_page_pool_iov(page)) {
> > >>> +             page_to_page_pool_iov(page)->pp = NULL;
> > >>> +             return;
> > >>> +     }
> > >>> +
> > >>>        page->pp_magic = 0;
> > >>>        page->pp = NULL;
> > >>>    }
> > >>> @@ -615,7 +639,7 @@ static bool page_pool_recycle_in_cache(struct page *page,
> > >>>                return false;
> > >>>        }
> > >>>
> > >>> -     /* Caller MUST have verified/know (page_ref_count(page) == 1) */
> > >>> +     /* Caller MUST have verified/know (page_pool_page_ref_count(page) == 1) */
> > >>>        pool->alloc.cache[pool->alloc.count++] = page;
> > >>>        recycle_stat_inc(pool, cached);
> > >>>        return true;
> > >>> @@ -638,9 +662,10 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
> > >>>         * refcnt == 1 means page_pool owns page, and can recycle it.
> > >>>         *
> > >>>         * page is NOT reusable when allocated when system is under
> > >>> -      * some pressure. (page_is_pfmemalloc)
> > >>> +      * some pressure. (page_pool_page_is_pfmemalloc)
> > >>>         */
> > >>> -     if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) {
> > >>> +     if (likely(page_pool_page_ref_count(page) == 1 &&
> > >>> +                !page_pool_page_is_pfmemalloc(page))) {
> > >>>                /* Read barrier done in page_ref_count / READ_ONCE */
> > >>>
> > >>>                if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> > >>> @@ -741,7 +766,8 @@ static struct page *page_pool_drain_frag(struct page_pool *pool,
> > >>>        if (likely(page_pool_defrag_page(page, drain_count)))
> > >>>                return NULL;
> > >>>
> > >>> -     if (page_ref_count(page) == 1 && !page_is_pfmemalloc(page)) {
> > >>> +     if (page_pool_page_ref_count(page) == 1 &&
> > >>> +         !page_pool_page_is_pfmemalloc(page)) {
> > >>>                if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> > >>>                        page_pool_dma_sync_for_device(pool, page, -1);
> > >>>
> > >>> @@ -818,9 +844,9 @@ static void page_pool_empty_ring(struct page_pool *pool)
> > >>>        /* Empty recycle ring */
> > >>>        while ((page = ptr_ring_consume_bh(&pool->ring))) {
> > >>>                /* Verify the refcnt invariant of cached pages */
> > >>> -             if (!(page_ref_count(page) == 1))
> > >>> +             if (!(page_pool_page_ref_count(page) == 1))
> > >>>                        pr_crit("%s() page_pool refcnt %d violation\n",
> > >>> -                             __func__, page_ref_count(page));
> > >>> +                             __func__, page_pool_page_ref_count(page));
> > >>>
> > >>>                page_pool_return_page(pool, page);
> > >>>        }
> > >>> @@ -977,19 +1003,24 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe)
> > >>>        struct page_pool *pp;
> > >>>        bool allow_direct;
> > >>>
> > >>> -     page = compound_head(page);
> > >>> +     if (!page_is_page_pool_iov(page)) {
> > >>> +             page = compound_head(page);
> > >>>
> > >>> -     /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
> > >>> -      * in order to preserve any existing bits, such as bit 0 for the
> > >>> -      * head page of compound page and bit 1 for pfmemalloc page, so
> > >>> -      * mask those bits for freeing side when doing below checking,
> > >>> -      * and page_is_pfmemalloc() is checked in __page_pool_put_page()
> > >>> -      * to avoid recycling the pfmemalloc page.
> > >>> -      */
> > >>> -     if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
> > >>> -             return false;
> > >>> +             /* page->pp_magic is OR'ed with PP_SIGNATURE after the
> > >>> +              * allocation in order to preserve any existing bits, such as
> > >>> +              * bit 0 for the head page of compound page and bit 1 for
> > >>> +              * pfmemalloc page, so mask those bits for freeing side when
> > >>> +              * doing below checking, and page_pool_page_is_pfmemalloc() is
> > >>> +              * checked in __page_pool_put_page() to avoid recycling the
> > >>> +              * pfmemalloc page.
> > >>> +              */
> > >>> +             if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
> > >>> +                     return false;
> > >>>
> > >>> -     pp = page->pp;
> > >>> +             pp = page->pp;
> > >>> +     } else {
> > >>> +             pp = page_to_page_pool_iov(page)->pp;
> > >>> +     }
> > >>>
> > >>>        /* Allow direct recycle if we have reasons to believe that we are
> > >>>         * in the same context as the consumer would run, so there's
> > >>> @@ -1273,9 +1304,9 @@ static bool mp_huge_busy(struct mp_huge *hu, unsigned int idx)
> > >>>
> > >>>        for (j = 0; j < (1 << MP_HUGE_ORDER); j++) {
> > >>>                page = hu->page[idx] + j;
> > >>> -             if (page_ref_count(page) != 1) {
> > >>> +             if (page_pool_page_ref_count(page) != 1) {
> > >>>                        pr_warn("Page with ref count %d at %u, %u. Can't safely destory, leaking memory!\n",
> > >>> -                             page_ref_count(page), idx, j);
> > >>> +                             page_pool_page_ref_count(page), idx, j);
> > >>>                        return true;
> > >>>                }
> > >>>        }
> > >>> @@ -1330,7 +1361,7 @@ static struct page *mp_huge_alloc_pages(struct page_pool *pool, gfp_t gfp)
> > >>>                        continue;
> > >>>
> > >>>                if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
> > >>> -                 page_ref_count(page) != 1) {
> > >>> +                 page_pool_page_ref_count(page) != 1) {
> > >>>                        atomic_inc(&mp_huge_ins_b);
> > >>>                        continue;
> > >>>                }
> > >>> @@ -1458,9 +1489,9 @@ static void mp_huge_1g_destroy(struct page_pool *pool)
> > >>>        free = true;
> > >>>        for (i = 0; i < MP_HUGE_1G_CNT; i++) {
> > >>>                page = hu->page + i;
> > >>> -             if (page_ref_count(page) != 1) {
> > >>> +             if (page_pool_page_ref_count(page) != 1) {
> > >>>                        pr_warn("Page with ref count %d at %u. Can't safely destory, leaking memory!\n",
> > >>> -                             page_ref_count(page), i);
> > >>> +                             page_pool_page_ref_count(page), i);
> > >>>                        free = false;
> > >>>                        break;
> > >>>                }
> > >>> @@ -1489,7 +1520,7 @@ static struct page *mp_huge_1g_alloc_pages(struct page_pool *pool, gfp_t gfp)
> > >>>                page = hu->page + page_i;
> > >>>
> > >>>                if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
> > >>> -                 page_ref_count(page) != 1) {
> > >>> +                 page_pool_page_ref_count(page) != 1) {
> > >>>                        atomic_inc(&mp_huge_ins_b);
> > >>>                        continue;
> > >>>                }
> > >>> --
> > >>> 2.41.0.640.ga95def55d0-goog
> > >>>
> > >>
> > >
> >
>
>
> --
> Thanks,
> Mina
Jakub Kicinski Aug. 21, 2023, 9:31 p.m. UTC | #9
On Sat, 19 Aug 2023 12:12:16 -0400 Willem de Bruijn wrote:
> :-) For the record, there is a prior version that added a separate type.
> 
> I did not like the churn it brought and asked for this.

It does end up looking cleaner that I personally expected, FWIW.

> > Use of the LSB (or bits depending on alignment expectations) is a common
> > trick and already done in quite a few places in the networking stack.
> > This trick is essential to any realistic change here to incorporate gpu
> > memory; way too much code will have unnecessary churn without it.

We'll end up needing the LSB trick either way, right? The only question
is whether the "if" is part of page pool or the caller of page pool.

Having seen zctap I'm afraid if we push this out of pp every provider
will end up re-implementing page pool's recycling/caching functionality
:(

Maybe we need to "fork" the API? The device memory "ifs" are only needed
for data pages. Which means that we can retain a faster, "if-less" API
for headers and XDP. Or is that too much duplication?
Willem de Bruijn Aug. 22, 2023, 12:58 a.m. UTC | #10
On Mon, Aug 21, 2023 at 5:31 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Sat, 19 Aug 2023 12:12:16 -0400 Willem de Bruijn wrote:
> > :-) For the record, there is a prior version that added a separate type.
> >
> > I did not like the churn it brought and asked for this.
>
> It does end up looking cleaner that I personally expected, FWIW.
>
> > > Use of the LSB (or bits depending on alignment expectations) is a common
> > > trick and already done in quite a few places in the networking stack.
> > > This trick is essential to any realistic change here to incorporate gpu
> > > memory; way too much code will have unnecessary churn without it.
>
> We'll end up needing the LSB trick either way, right? The only question
> is whether the "if" is part of page pool or the caller of page pool.

Indeed. Adding layering does not remove this.

> Having seen zctap I'm afraid if we push this out of pp every provider
> will end up re-implementing page pool's recycling/caching functionality
> :(
>
> Maybe we need to "fork" the API? The device memory "ifs" are only needed
> for data pages. Which means that we can retain a faster, "if-less" API
> for headers and XDP. Or is that too much duplication?

I don't think that would be faster. Just a different structuring of
the code. We still need to take one of multiple paths for, say, page
allocation (page_pool_alloc_pages).

If having a struct page_pool and struct mem_pool, there would still be
a type-based branch. But now either in every caller (yech), or in some
thin shim layer. Why not just have it behind the existing API? That is
what your memory provider does. The only difference now is that one of
the providers really does not deal with pages.

I think this multiplexing does not have to introduce performance
regressions with well placed static_branch. Benchmarks and
side-by-side comparison of assembly will have to verify that.

Indirect function calls need to be avoided, of course, in favor of a
type based switch. And unless the non_default_providers static branch
is enabled, the default path is taken unconditionally.

If it no longer is a pure page pool, maybe it can be renamed. I
personally don't care.
Mina Almasry Aug. 22, 2023, 6:05 a.m. UTC | #11
On Sat, Aug 19, 2023 at 2:51 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
>
> On 10/08/2023 03.57, Mina Almasry wrote:
> > Overload the LSB of struct page* to indicate that it's a page_pool_iov.
> >
> > Refactor mm calls on struct page * into helpers, and add page_pool_iov
> > handling on those helpers. Modify callers of these mm APIs with calls to
> > these helpers instead.
> >
>
> I don't like of this approach.
> This is adding code to the PP (page_pool) fast-path in multiple places.
>
> I've not had time to run my usual benchmarks, which are here:
>
> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c
>

I ported over this benchmark to my tree and ran it, my results:

net-next @ b44693495af8
https://pastebin.com/raw/JuU7UQXe

+ Jakub's memory-provider APIs:
https://pastebin.com/raw/StMBhetn

+ devmem TCP changes:
https://pastebin.com/raw/mY1L6U4r

+ intentional regression just to make sure the benchmark is working:
https://pastebin.com/raw/wqWhcJdG

I don't seem to be able to detect a regression with this series as-is,
but I'm not that familiar with the test and may be doing something
wrong or misinterpreting the results. Does this look ok to you?

> But I'm sure it will affect performance.
>
> Regardless of performance, this approach is using ptr-LSB-bits, to hide
> that page-pointer are not really struct-pages, feels like force feeding
> a solution just to use the page_pool APIs.
>
>
> > In areas where struct page* is dereferenced, add a check for special
> > handling of page_pool_iov.
> >
> > The memory providers producing page_pool_iov can set the LSB on the
> > struct page* returned to the page pool.
> >
> > Note that instead of overloading the LSB of page pointers, we can
> > instead define a new union between struct page & struct page_pool_iov and
> > compact it in a new type. However, we'd need to implement the code churn
> > to modify the page_pool & drivers to use this new type. For this POC
> > that is not implemented (feedback welcome).
> >
>
> I've said before, that I prefer multiplexing on page->pp_magic.
> For your page_pool_iov the layout would have to match the offset of
> pp_magic, to do this. (And if insisting on using PP infra the refcnt
> would also need to align).
>
> On the allocation side, all drivers already use a driver helper
> page_pool_dev_alloc_pages() or we could add another (better named)
> helper to multiplex between other types of allocators, e.g. a devmem
> allocator.
>
> On free/return/recycle the functions napi_pp_put_page or skb_pp_recycle
> could multiplex on pp_magic and call another API.  The API could be an
> extension to PP helpers, but it could also be a devmap allocator helper.
>
> IMHO forcing/piggy-bagging everything into page_pool is not the right
> solution.  I really think netstack need to support different allocator
> types. The page pool have been leading the way, yes, but perhaps it is
> time to add an API layer that e.g. could be named netmem, that gives us
> the multiplexing between allocators.  In that process some of page_pool
> APIs would be lifted out as common blocks and others remain.
>
> --Jesper
>
> > I have a sample implementation of adding a new page_pool_token type
> > in the page_pool to give a general idea here:
> > https://github.com/torvalds/linux/commit/3a7628700eb7fd02a117db036003bca50779608d
> >
> > Full branch here:
> > https://github.com/torvalds/linux/compare/master...mina:linux:tcpdevmem-pp-tokens
> >
> > (In the branches above, page_pool_iov is called devmem_slice).
> >
> > Could also add static_branch to speed up the checks in page_pool_iov
> > memory providers are being used.
> >
> > Signed-off-by: Mina Almasry <almasrymina@google.com>
> > ---
> >   include/net/page_pool.h | 74 ++++++++++++++++++++++++++++++++++-
> >   net/core/page_pool.c    | 85 ++++++++++++++++++++++++++++-------------
> >   2 files changed, 131 insertions(+), 28 deletions(-)
> >
> > diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> > index 537eb36115ed..f08ca230d68e 100644
> > --- a/include/net/page_pool.h
> > +++ b/include/net/page_pool.h
> > @@ -282,6 +282,64 @@ static inline struct page_pool_iov *page_to_page_pool_iov(struct page *page)
> >       return NULL;
> >   }
> >
> > +static inline int page_pool_page_ref_count(struct page *page)
> > +{
> > +     if (page_is_page_pool_iov(page))
> > +             return page_pool_iov_refcount(page_to_page_pool_iov(page));
> > +
> > +     return page_ref_count(page);
> > +}
> > +
> > +static inline void page_pool_page_get_many(struct page *page,
> > +                                        unsigned int count)
> > +{
> > +     if (page_is_page_pool_iov(page))
> > +             return page_pool_iov_get_many(page_to_page_pool_iov(page),
> > +                                           count);
> > +
> > +     return page_ref_add(page, count);
> > +}
> > +
> > +static inline void page_pool_page_put_many(struct page *page,
> > +                                        unsigned int count)
> > +{
> > +     if (page_is_page_pool_iov(page))
> > +             return page_pool_iov_put_many(page_to_page_pool_iov(page),
> > +                                           count);
> > +
> > +     if (count > 1)
> > +             page_ref_sub(page, count - 1);
> > +
> > +     put_page(page);
> > +}
> > +
> > +static inline bool page_pool_page_is_pfmemalloc(struct page *page)
> > +{
> > +     if (page_is_page_pool_iov(page))
> > +             return false;
> > +
> > +     return page_is_pfmemalloc(page);
> > +}
> > +
> > +static inline bool page_pool_page_is_pref_nid(struct page *page, int pref_nid)
> > +{
> > +     /* Assume page_pool_iov are on the preferred node without actually
> > +      * checking...
> > +      *
> > +      * This check is only used to check for recycling memory in the page
> > +      * pool's fast paths. Currently the only implementation of page_pool_iov
> > +      * is dmabuf device memory. It's a deliberate decision by the user to
> > +      * bind a certain dmabuf to a certain netdev, and the netdev rx queue
> > +      * would not be able to reallocate memory from another dmabuf that
> > +      * exists on the preferred node, so, this check doesn't make much sense
> > +      * in this case. Assume all page_pool_iovs can be recycled for now.
> > +      */
> > +     if (page_is_page_pool_iov(page))
> > +             return true;
> > +
> > +     return page_to_nid(page) == pref_nid;
> > +}
> > +
> >   struct page_pool {
> >       struct page_pool_params p;
> >
> > @@ -434,6 +492,9 @@ static inline long page_pool_defrag_page(struct page *page, long nr)
> >   {
> >       long ret;
> >
> > +     if (page_is_page_pool_iov(page))
> > +             return -EINVAL;
> > +
> >       /* If nr == pp_frag_count then we have cleared all remaining
> >        * references to the page. No need to actually overwrite it, instead
> >        * we can leave this to be overwritten by the calling function.
> > @@ -494,7 +555,12 @@ static inline void page_pool_recycle_direct(struct page_pool *pool,
> >
> >   static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
> >   {
> > -     dma_addr_t ret = page->dma_addr;
> > +     dma_addr_t ret;
> > +
> > +     if (page_is_page_pool_iov(page))
> > +             return page_pool_iov_dma_addr(page_to_page_pool_iov(page));
> > +
> > +     ret = page->dma_addr;
> >
> >       if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
> >               ret |= (dma_addr_t)page->dma_addr_upper << 16 << 16;
> > @@ -504,6 +570,12 @@ static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
> >
> >   static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
> >   {
> > +     /* page_pool_iovs are mapped and their dma-addr can't be modified. */
> > +     if (page_is_page_pool_iov(page)) {
> > +             DEBUG_NET_WARN_ON_ONCE(true);
> > +             return;
> > +     }
> > +
> >       page->dma_addr = addr;
> >       if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
> >               page->dma_addr_upper = upper_32_bits(addr);
> > diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> > index 0a7c08d748b8..20c1f74fd844 100644
> > --- a/net/core/page_pool.c
> > +++ b/net/core/page_pool.c
> > @@ -318,7 +318,7 @@ static struct page *page_pool_refill_alloc_cache(struct page_pool *pool)
> >               if (unlikely(!page))
> >                       break;
> >
> > -             if (likely(page_to_nid(page) == pref_nid)) {
> > +             if (likely(page_pool_page_is_pref_nid(page, pref_nid))) {
> >                       pool->alloc.cache[pool->alloc.count++] = page;
> >               } else {
> >                       /* NUMA mismatch;
> > @@ -363,7 +363,15 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool,
> >                                         struct page *page,
> >                                         unsigned int dma_sync_size)
> >   {
> > -     dma_addr_t dma_addr = page_pool_get_dma_addr(page);
> > +     dma_addr_t dma_addr;
> > +
> > +     /* page_pool_iov memory provider do not support PP_FLAG_DMA_SYNC_DEV */
> > +     if (page_is_page_pool_iov(page)) {
> > +             DEBUG_NET_WARN_ON_ONCE(true);
> > +             return;
> > +     }
> > +
> > +     dma_addr = page_pool_get_dma_addr(page);
> >
> >       dma_sync_size = min(dma_sync_size, pool->p.max_len);
> >       dma_sync_single_range_for_device(pool->p.dev, dma_addr,
> > @@ -375,6 +383,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
> >   {
> >       dma_addr_t dma;
> >
> > +     if (page_is_page_pool_iov(page)) {
> > +             /* page_pool_iovs are already mapped */
> > +             DEBUG_NET_WARN_ON_ONCE(true);
> > +             return true;
> > +     }
> > +
> >       /* Setup DMA mapping: use 'struct page' area for storing DMA-addr
> >        * since dma_addr_t can be either 32 or 64 bits and does not always fit
> >        * into page private data (i.e 32bit cpu with 64bit DMA caps)
> > @@ -398,14 +412,24 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
> >   static void page_pool_set_pp_info(struct page_pool *pool,
> >                                 struct page *page)
> >   {
> > -     page->pp = pool;
> > -     page->pp_magic |= PP_SIGNATURE;
> > +     if (!page_is_page_pool_iov(page)) {
> > +             page->pp = pool;
> > +             page->pp_magic |= PP_SIGNATURE;
> > +     } else {
> > +             page_to_page_pool_iov(page)->pp = pool;
> > +     }
> > +
> >       if (pool->p.init_callback)
> >               pool->p.init_callback(page, pool->p.init_arg);
> >   }
> >
> >   static void page_pool_clear_pp_info(struct page *page)
> >   {
> > +     if (page_is_page_pool_iov(page)) {
> > +             page_to_page_pool_iov(page)->pp = NULL;
> > +             return;
> > +     }
> > +
> >       page->pp_magic = 0;
> >       page->pp = NULL;
> >   }
> > @@ -615,7 +639,7 @@ static bool page_pool_recycle_in_cache(struct page *page,
> >               return false;
> >       }
> >
> > -     /* Caller MUST have verified/know (page_ref_count(page) == 1) */
> > +     /* Caller MUST have verified/know (page_pool_page_ref_count(page) == 1) */
> >       pool->alloc.cache[pool->alloc.count++] = page;
> >       recycle_stat_inc(pool, cached);
> >       return true;
> > @@ -638,9 +662,10 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
> >        * refcnt == 1 means page_pool owns page, and can recycle it.
> >        *
> >        * page is NOT reusable when allocated when system is under
> > -      * some pressure. (page_is_pfmemalloc)
> > +      * some pressure. (page_pool_page_is_pfmemalloc)
> >        */
> > -     if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) {
> > +     if (likely(page_pool_page_ref_count(page) == 1 &&
> > +                !page_pool_page_is_pfmemalloc(page))) {
> >               /* Read barrier done in page_ref_count / READ_ONCE */
> >
> >               if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> > @@ -741,7 +766,8 @@ static struct page *page_pool_drain_frag(struct page_pool *pool,
> >       if (likely(page_pool_defrag_page(page, drain_count)))
> >               return NULL;
> >
> > -     if (page_ref_count(page) == 1 && !page_is_pfmemalloc(page)) {
> > +     if (page_pool_page_ref_count(page) == 1 &&
> > +         !page_pool_page_is_pfmemalloc(page)) {
> >               if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> >                       page_pool_dma_sync_for_device(pool, page, -1);
> >
> > @@ -818,9 +844,9 @@ static void page_pool_empty_ring(struct page_pool *pool)
> >       /* Empty recycle ring */
> >       while ((page = ptr_ring_consume_bh(&pool->ring))) {
> >               /* Verify the refcnt invariant of cached pages */
> > -             if (!(page_ref_count(page) == 1))
> > +             if (!(page_pool_page_ref_count(page) == 1))
> >                       pr_crit("%s() page_pool refcnt %d violation\n",
> > -                             __func__, page_ref_count(page));
> > +                             __func__, page_pool_page_ref_count(page));
> >
> >               page_pool_return_page(pool, page);
> >       }
> > @@ -977,19 +1003,24 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe)
> >       struct page_pool *pp;
> >       bool allow_direct;
> >
> > -     page = compound_head(page);
> > +     if (!page_is_page_pool_iov(page)) {
> > +             page = compound_head(page);
> >
> > -     /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
> > -      * in order to preserve any existing bits, such as bit 0 for the
> > -      * head page of compound page and bit 1 for pfmemalloc page, so
> > -      * mask those bits for freeing side when doing below checking,
> > -      * and page_is_pfmemalloc() is checked in __page_pool_put_page()
> > -      * to avoid recycling the pfmemalloc page.
> > -      */
> > -     if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
> > -             return false;
> > +             /* page->pp_magic is OR'ed with PP_SIGNATURE after the
> > +              * allocation in order to preserve any existing bits, such as
> > +              * bit 0 for the head page of compound page and bit 1 for
> > +              * pfmemalloc page, so mask those bits for freeing side when
> > +              * doing below checking, and page_pool_page_is_pfmemalloc() is
> > +              * checked in __page_pool_put_page() to avoid recycling the
> > +              * pfmemalloc page.
> > +              */
> > +             if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
> > +                     return false;
> >
> > -     pp = page->pp;
> > +             pp = page->pp;
> > +     } else {
> > +             pp = page_to_page_pool_iov(page)->pp;
> > +     }
> >
> >       /* Allow direct recycle if we have reasons to believe that we are
> >        * in the same context as the consumer would run, so there's
> > @@ -1273,9 +1304,9 @@ static bool mp_huge_busy(struct mp_huge *hu, unsigned int idx)
> >
> >       for (j = 0; j < (1 << MP_HUGE_ORDER); j++) {
> >               page = hu->page[idx] + j;
> > -             if (page_ref_count(page) != 1) {
> > +             if (page_pool_page_ref_count(page) != 1) {
> >                       pr_warn("Page with ref count %d at %u, %u. Can't safely destory, leaking memory!\n",
> > -                             page_ref_count(page), idx, j);
> > +                             page_pool_page_ref_count(page), idx, j);
> >                       return true;
> >               }
> >       }
> > @@ -1330,7 +1361,7 @@ static struct page *mp_huge_alloc_pages(struct page_pool *pool, gfp_t gfp)
> >                       continue;
> >
> >               if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
> > -                 page_ref_count(page) != 1) {
> > +                 page_pool_page_ref_count(page) != 1) {
> >                       atomic_inc(&mp_huge_ins_b);
> >                       continue;
> >               }
> > @@ -1458,9 +1489,9 @@ static void mp_huge_1g_destroy(struct page_pool *pool)
> >       free = true;
> >       for (i = 0; i < MP_HUGE_1G_CNT; i++) {
> >               page = hu->page + i;
> > -             if (page_ref_count(page) != 1) {
> > +             if (page_pool_page_ref_count(page) != 1) {
> >                       pr_warn("Page with ref count %d at %u. Can't safely destory, leaking memory!\n",
> > -                             page_ref_count(page), i);
> > +                             page_pool_page_ref_count(page), i);
> >                       free = false;
> >                       break;
> >               }
> > @@ -1489,7 +1520,7 @@ static struct page *mp_huge_1g_alloc_pages(struct page_pool *pool, gfp_t gfp)
> >               page = hu->page + page_i;
> >
> >               if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
> > -                 page_ref_count(page) != 1) {
> > +                 page_pool_page_ref_count(page) != 1) {
> >                       atomic_inc(&mp_huge_ins_b);
> >                       continue;
> >               }
> > --
> > 2.41.0.640.ga95def55d0-goog
> >
>


--
Thanks,
Mina
Jesper Dangaard Brouer Aug. 22, 2023, 12:24 p.m. UTC | #12
On 22/08/2023 08.05, Mina Almasry wrote:
> On Sat, Aug 19, 2023 at 2:51 AM Jesper Dangaard Brouer
> <jbrouer@redhat.com> wrote:
>>
>> On 10/08/2023 03.57, Mina Almasry wrote:
>>> Overload the LSB of struct page* to indicate that it's a page_pool_iov.
>>>
>>> Refactor mm calls on struct page * into helpers, and add page_pool_iov
>>> handling on those helpers. Modify callers of these mm APIs with calls to
>>> these helpers instead.
>>>
>>
>> I don't like of this approach.
>> This is adding code to the PP (page_pool) fast-path in multiple places.
>>
>> I've not had time to run my usual benchmarks, which are here:
>>
>> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c
>>
> 
> I ported over this benchmark to my tree and ran it, my results:
> 

What CPU is this and GHz?  (I guess 2.6 GHz based on results).

(It looks like this CPU is more efficient, instructions per cycles, than 
my E5-1650 v4 @ 3.60GHz).

> net-next @ b44693495af8
> https://pastebin.com/raw/JuU7UQXe
> 
> + Jakub's memory-provider APIs:
> https://pastebin.com/raw/StMBhetn
> 
> + devmem TCP changes:
> https://pastebin.com/raw/mY1L6U4r
> 

Only a single cycle slowdown for "page_pool01_fast_path".
 From 10 cycles to 11 cycles.

> + intentional regression just to make sure the benchmark is working:
> https://pastebin.com/raw/wqWhcJdG
> 
> I don't seem to be able to detect a regression with this series as-is,
> but I'm not that familiar with the test and may be doing something
> wrong or misinterpreting the results. Does this look ok to you?
> 

The performance results are better than I expected.  The small
regression from 10 cycles to 11 cycles is actually 10%, but I expect
with some likely/unlikely instrumentation we can "likely" remove this again.

So, this change actually looks acceptable from a performance PoV.
I still think this page_pool_iov is very invasive to page_pool, but
maybe it is better to hide this "uglyness" inside page_pool.

The test primarily tests fast-path, and you also add "if" statements to
all the DMA operations, which is not part of this benchmark.  Perhaps we 
can add unlikely statements, or inspect (objdump) the ASM to check code 
priorities the original page based "provider".

>> But I'm sure it will affect performance.
>>

Guess, I was wrong ;-)

--Jesper


>> Regardless of performance, this approach is using ptr-LSB-bits, to hide
>> that page-pointer are not really struct-pages, feels like force feeding
>> a solution just to use the page_pool APIs.
>>
>>
>>> In areas where struct page* is dereferenced, add a check for special
>>> handling of page_pool_iov.
>>>
>>> The memory providers producing page_pool_iov can set the LSB on the
>>> struct page* returned to the page pool.
>>>
>>> Note that instead of overloading the LSB of page pointers, we can
>>> instead define a new union between struct page & struct page_pool_iov and
>>> compact it in a new type. However, we'd need to implement the code churn
>>> to modify the page_pool & drivers to use this new type. For this POC
>>> that is not implemented (feedback welcome).
>>>
>>
>> I've said before, that I prefer multiplexing on page->pp_magic.
>> For your page_pool_iov the layout would have to match the offset of
>> pp_magic, to do this. (And if insisting on using PP infra the refcnt
>> would also need to align).
>>
>> On the allocation side, all drivers already use a driver helper
>> page_pool_dev_alloc_pages() or we could add another (better named)
>> helper to multiplex between other types of allocators, e.g. a devmem
>> allocator.
>>
>> On free/return/recycle the functions napi_pp_put_page or skb_pp_recycle
>> could multiplex on pp_magic and call another API.  The API could be an
>> extension to PP helpers, but it could also be a devmap allocator helper.
>>
>> IMHO forcing/piggy-bagging everything into page_pool is not the right
>> solution.  I really think netstack need to support different allocator
>> types. The page pool have been leading the way, yes, but perhaps it is
>> time to add an API layer that e.g. could be named netmem, that gives us
>> the multiplexing between allocators.  In that process some of page_pool
>> APIs would be lifted out as common blocks and others remain.
>>
>> --Jesper
>>
>>> I have a sample implementation of adding a new page_pool_token type
>>> in the page_pool to give a general idea here:
>>> https://github.com/torvalds/linux/commit/3a7628700eb7fd02a117db036003bca50779608d
>>>
>>> Full branch here:
>>> https://github.com/torvalds/linux/compare/master...mina:linux:tcpdevmem-pp-tokens
>>>
>>> (In the branches above, page_pool_iov is called devmem_slice).
>>>
>>> Could also add static_branch to speed up the checks in page_pool_iov
>>> memory providers are being used.
>>>
>>> Signed-off-by: Mina Almasry <almasrymina@google.com>
>>> ---
>>>    include/net/page_pool.h | 74 ++++++++++++++++++++++++++++++++++-
>>>    net/core/page_pool.c    | 85 ++++++++++++++++++++++++++++-------------
>>>    2 files changed, 131 insertions(+), 28 deletions(-)
>>>
>>> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
>>> index 537eb36115ed..f08ca230d68e 100644
>>> --- a/include/net/page_pool.h
>>> +++ b/include/net/page_pool.h
>>> @@ -282,6 +282,64 @@ static inline struct page_pool_iov *page_to_page_pool_iov(struct page *page)
>>>        return NULL;
>>>    }
>>>
>>> +static inline int page_pool_page_ref_count(struct page *page)
>>> +{
>>> +     if (page_is_page_pool_iov(page))
>>> +             return page_pool_iov_refcount(page_to_page_pool_iov(page));
>>> +
>>> +     return page_ref_count(page);
>>> +}
>>> +
>>> +static inline void page_pool_page_get_many(struct page *page,
>>> +                                        unsigned int count)
>>> +{
>>> +     if (page_is_page_pool_iov(page))
>>> +             return page_pool_iov_get_many(page_to_page_pool_iov(page),
>>> +                                           count);
>>> +
>>> +     return page_ref_add(page, count);
>>> +}
>>> +
>>> +static inline void page_pool_page_put_many(struct page *page,
>>> +                                        unsigned int count)
>>> +{
>>> +     if (page_is_page_pool_iov(page))
>>> +             return page_pool_iov_put_many(page_to_page_pool_iov(page),
>>> +                                           count);
>>> +
>>> +     if (count > 1)
>>> +             page_ref_sub(page, count - 1);
>>> +
>>> +     put_page(page);
>>> +}
>>> +
>>> +static inline bool page_pool_page_is_pfmemalloc(struct page *page)
>>> +{
>>> +     if (page_is_page_pool_iov(page))
>>> +             return false;
>>> +
>>> +     return page_is_pfmemalloc(page);
>>> +}
>>> +
>>> +static inline bool page_pool_page_is_pref_nid(struct page *page, int pref_nid)
>>> +{
>>> +     /* Assume page_pool_iov are on the preferred node without actually
>>> +      * checking...
>>> +      *
>>> +      * This check is only used to check for recycling memory in the page
>>> +      * pool's fast paths. Currently the only implementation of page_pool_iov
>>> +      * is dmabuf device memory. It's a deliberate decision by the user to
>>> +      * bind a certain dmabuf to a certain netdev, and the netdev rx queue
>>> +      * would not be able to reallocate memory from another dmabuf that
>>> +      * exists on the preferred node, so, this check doesn't make much sense
>>> +      * in this case. Assume all page_pool_iovs can be recycled for now.
>>> +      */
>>> +     if (page_is_page_pool_iov(page))
>>> +             return true;
>>> +
>>> +     return page_to_nid(page) == pref_nid;
>>> +}
>>> +
>>>    struct page_pool {
>>>        struct page_pool_params p;
>>>
>>> @@ -434,6 +492,9 @@ static inline long page_pool_defrag_page(struct page *page, long nr)
>>>    {
>>>        long ret;
>>>
>>> +     if (page_is_page_pool_iov(page))
>>> +             return -EINVAL;
>>> +
>>>        /* If nr == pp_frag_count then we have cleared all remaining
>>>         * references to the page. No need to actually overwrite it, instead
>>>         * we can leave this to be overwritten by the calling function.
>>> @@ -494,7 +555,12 @@ static inline void page_pool_recycle_direct(struct page_pool *pool,
>>>
>>>    static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
>>>    {
>>> -     dma_addr_t ret = page->dma_addr;
>>> +     dma_addr_t ret;
>>> +
>>> +     if (page_is_page_pool_iov(page))
>>> +             return page_pool_iov_dma_addr(page_to_page_pool_iov(page));
>>> +
>>> +     ret = page->dma_addr;
>>>
>>>        if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
>>>                ret |= (dma_addr_t)page->dma_addr_upper << 16 << 16;
>>> @@ -504,6 +570,12 @@ static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
>>>
>>>    static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
>>>    {
>>> +     /* page_pool_iovs are mapped and their dma-addr can't be modified. */
>>> +     if (page_is_page_pool_iov(page)) {
>>> +             DEBUG_NET_WARN_ON_ONCE(true);
>>> +             return;
>>> +     }
>>> +
>>>        page->dma_addr = addr;
>>>        if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
>>>                page->dma_addr_upper = upper_32_bits(addr);
>>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
>>> index 0a7c08d748b8..20c1f74fd844 100644
>>> --- a/net/core/page_pool.c
>>> +++ b/net/core/page_pool.c
>>> @@ -318,7 +318,7 @@ static struct page *page_pool_refill_alloc_cache(struct page_pool *pool)
>>>                if (unlikely(!page))
>>>                        break;
>>>
>>> -             if (likely(page_to_nid(page) == pref_nid)) {
>>> +             if (likely(page_pool_page_is_pref_nid(page, pref_nid))) {
>>>                        pool->alloc.cache[pool->alloc.count++] = page;
>>>                } else {
>>>                        /* NUMA mismatch;
>>> @@ -363,7 +363,15 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool,
>>>                                          struct page *page,
>>>                                          unsigned int dma_sync_size)
>>>    {
>>> -     dma_addr_t dma_addr = page_pool_get_dma_addr(page);
>>> +     dma_addr_t dma_addr;
>>> +
>>> +     /* page_pool_iov memory provider do not support PP_FLAG_DMA_SYNC_DEV */
>>> +     if (page_is_page_pool_iov(page)) {
>>> +             DEBUG_NET_WARN_ON_ONCE(true);
>>> +             return;
>>> +     }
>>> +
>>> +     dma_addr = page_pool_get_dma_addr(page);
>>>
>>>        dma_sync_size = min(dma_sync_size, pool->p.max_len);
>>>        dma_sync_single_range_for_device(pool->p.dev, dma_addr,
>>> @@ -375,6 +383,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
>>>    {
>>>        dma_addr_t dma;
>>>
>>> +     if (page_is_page_pool_iov(page)) {
>>> +             /* page_pool_iovs are already mapped */
>>> +             DEBUG_NET_WARN_ON_ONCE(true);
>>> +             return true;
>>> +     }
>>> +
>>>        /* Setup DMA mapping: use 'struct page' area for storing DMA-addr
>>>         * since dma_addr_t can be either 32 or 64 bits and does not always fit
>>>         * into page private data (i.e 32bit cpu with 64bit DMA caps)
>>> @@ -398,14 +412,24 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
>>>    static void page_pool_set_pp_info(struct page_pool *pool,
>>>                                  struct page *page)
>>>    {
>>> -     page->pp = pool;
>>> -     page->pp_magic |= PP_SIGNATURE;
>>> +     if (!page_is_page_pool_iov(page)) {
>>> +             page->pp = pool;
>>> +             page->pp_magic |= PP_SIGNATURE;
>>> +     } else {
>>> +             page_to_page_pool_iov(page)->pp = pool;
>>> +     }
>>> +
>>>        if (pool->p.init_callback)
>>>                pool->p.init_callback(page, pool->p.init_arg);
>>>    }
>>>
>>>    static void page_pool_clear_pp_info(struct page *page)
>>>    {
>>> +     if (page_is_page_pool_iov(page)) {
>>> +             page_to_page_pool_iov(page)->pp = NULL;
>>> +             return;
>>> +     }
>>> +
>>>        page->pp_magic = 0;
>>>        page->pp = NULL;
>>>    }
>>> @@ -615,7 +639,7 @@ static bool page_pool_recycle_in_cache(struct page *page,
>>>                return false;
>>>        }
>>>
>>> -     /* Caller MUST have verified/know (page_ref_count(page) == 1) */
>>> +     /* Caller MUST have verified/know (page_pool_page_ref_count(page) == 1) */
>>>        pool->alloc.cache[pool->alloc.count++] = page;
>>>        recycle_stat_inc(pool, cached);
>>>        return true;
>>> @@ -638,9 +662,10 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
>>>         * refcnt == 1 means page_pool owns page, and can recycle it.
>>>         *
>>>         * page is NOT reusable when allocated when system is under
>>> -      * some pressure. (page_is_pfmemalloc)
>>> +      * some pressure. (page_pool_page_is_pfmemalloc)
>>>         */
>>> -     if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) {
>>> +     if (likely(page_pool_page_ref_count(page) == 1 &&
>>> +                !page_pool_page_is_pfmemalloc(page))) {
>>>                /* Read barrier done in page_ref_count / READ_ONCE */
>>>
>>>                if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
>>> @@ -741,7 +766,8 @@ static struct page *page_pool_drain_frag(struct page_pool *pool,
>>>        if (likely(page_pool_defrag_page(page, drain_count)))
>>>                return NULL;
>>>
>>> -     if (page_ref_count(page) == 1 && !page_is_pfmemalloc(page)) {
>>> +     if (page_pool_page_ref_count(page) == 1 &&
>>> +         !page_pool_page_is_pfmemalloc(page)) {
>>>                if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
>>>                        page_pool_dma_sync_for_device(pool, page, -1);
>>>
>>> @@ -818,9 +844,9 @@ static void page_pool_empty_ring(struct page_pool *pool)
>>>        /* Empty recycle ring */
>>>        while ((page = ptr_ring_consume_bh(&pool->ring))) {
>>>                /* Verify the refcnt invariant of cached pages */
>>> -             if (!(page_ref_count(page) == 1))
>>> +             if (!(page_pool_page_ref_count(page) == 1))
>>>                        pr_crit("%s() page_pool refcnt %d violation\n",
>>> -                             __func__, page_ref_count(page));
>>> +                             __func__, page_pool_page_ref_count(page));
>>>
>>>                page_pool_return_page(pool, page);
>>>        }
>>> @@ -977,19 +1003,24 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe)
>>>        struct page_pool *pp;
>>>        bool allow_direct;
>>>
>>> -     page = compound_head(page);
>>> +     if (!page_is_page_pool_iov(page)) {
>>> +             page = compound_head(page);
>>>
>>> -     /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
>>> -      * in order to preserve any existing bits, such as bit 0 for the
>>> -      * head page of compound page and bit 1 for pfmemalloc page, so
>>> -      * mask those bits for freeing side when doing below checking,
>>> -      * and page_is_pfmemalloc() is checked in __page_pool_put_page()
>>> -      * to avoid recycling the pfmemalloc page.
>>> -      */
>>> -     if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
>>> -             return false;
>>> +             /* page->pp_magic is OR'ed with PP_SIGNATURE after the
>>> +              * allocation in order to preserve any existing bits, such as
>>> +              * bit 0 for the head page of compound page and bit 1 for
>>> +              * pfmemalloc page, so mask those bits for freeing side when
>>> +              * doing below checking, and page_pool_page_is_pfmemalloc() is
>>> +              * checked in __page_pool_put_page() to avoid recycling the
>>> +              * pfmemalloc page.
>>> +              */
>>> +             if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
>>> +                     return false;
>>>
>>> -     pp = page->pp;
>>> +             pp = page->pp;
>>> +     } else {
>>> +             pp = page_to_page_pool_iov(page)->pp;
>>> +     }
>>>
>>>        /* Allow direct recycle if we have reasons to believe that we are
>>>         * in the same context as the consumer would run, so there's
>>> @@ -1273,9 +1304,9 @@ static bool mp_huge_busy(struct mp_huge *hu, unsigned int idx)
>>>
>>>        for (j = 0; j < (1 << MP_HUGE_ORDER); j++) {
>>>                page = hu->page[idx] + j;
>>> -             if (page_ref_count(page) != 1) {
>>> +             if (page_pool_page_ref_count(page) != 1) {
>>>                        pr_warn("Page with ref count %d at %u, %u. Can't safely destory, leaking memory!\n",
>>> -                             page_ref_count(page), idx, j);
>>> +                             page_pool_page_ref_count(page), idx, j);
>>>                        return true;
>>>                }
>>>        }
>>> @@ -1330,7 +1361,7 @@ static struct page *mp_huge_alloc_pages(struct page_pool *pool, gfp_t gfp)
>>>                        continue;
>>>
>>>                if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
>>> -                 page_ref_count(page) != 1) {
>>> +                 page_pool_page_ref_count(page) != 1) {
>>>                        atomic_inc(&mp_huge_ins_b);
>>>                        continue;
>>>                }
>>> @@ -1458,9 +1489,9 @@ static void mp_huge_1g_destroy(struct page_pool *pool)
>>>        free = true;
>>>        for (i = 0; i < MP_HUGE_1G_CNT; i++) {
>>>                page = hu->page + i;
>>> -             if (page_ref_count(page) != 1) {
>>> +             if (page_pool_page_ref_count(page) != 1) {
>>>                        pr_warn("Page with ref count %d at %u. Can't safely destory, leaking memory!\n",
>>> -                             page_ref_count(page), i);
>>> +                             page_pool_page_ref_count(page), i);
>>>                        free = false;
>>>                        break;
>>>                }
>>> @@ -1489,7 +1520,7 @@ static struct page *mp_huge_1g_alloc_pages(struct page_pool *pool, gfp_t gfp)
>>>                page = hu->page + page_i;
>>>
>>>                if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
>>> -                 page_ref_count(page) != 1) {
>>> +                 page_pool_page_ref_count(page) != 1) {
>>>                        atomic_inc(&mp_huge_ins_b);
>>>                        continue;
>>>                }
>>> --
>>> 2.41.0.640.ga95def55d0-goog
>>>
>>
> 
> 
> --
> Thanks,
> Mina
>
Mina Almasry Aug. 22, 2023, 11:33 p.m. UTC | #13
On Tue, Aug 22, 2023 at 5:24 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
>
> On 22/08/2023 08.05, Mina Almasry wrote:
> > On Sat, Aug 19, 2023 at 2:51 AM Jesper Dangaard Brouer
> > <jbrouer@redhat.com> wrote:
> >>
> >> On 10/08/2023 03.57, Mina Almasry wrote:
> >>> Overload the LSB of struct page* to indicate that it's a page_pool_iov.
> >>>
> >>> Refactor mm calls on struct page * into helpers, and add page_pool_iov
> >>> handling on those helpers. Modify callers of these mm APIs with calls to
> >>> these helpers instead.
> >>>
> >>
> >> I don't like of this approach.
> >> This is adding code to the PP (page_pool) fast-path in multiple places.
> >>
> >> I've not had time to run my usual benchmarks, which are here:
> >>
> >> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c
> >>
> >
> > I ported over this benchmark to my tree and ran it, my results:
> >
>
> What CPU is this and GHz?  (I guess 2.6 GHz based on results).
>
> (It looks like this CPU is more efficient, instructions per cycles, than
> my E5-1650 v4 @ 3.60GHz).
>

cat /proc/cpuinfo
...
vendor_id       : GenuineIntel
cpu family      : 6
model           : 143
model name      : Intel(R) Xeon(R) Platinum 8481C CPU @ 2.70GHz
stepping        : 8
microcode       : 0xffffffff
cpu MHz         : 2699.998
```

This is a vCPU on the Google Cloud A3 VMs.

> > net-next @ b44693495af8
> > https://pastebin.com/raw/JuU7UQXe
> >
> > + Jakub's memory-provider APIs:
> > https://pastebin.com/raw/StMBhetn
> >
> > + devmem TCP changes:
> > https://pastebin.com/raw/mY1L6U4r
> >
>
> Only a single cycle slowdown for "page_pool01_fast_path".
>  From 10 cycles to 11 cycles.
>
> > + intentional regression just to make sure the benchmark is working:
> > https://pastebin.com/raw/wqWhcJdG
> >
> > I don't seem to be able to detect a regression with this series as-is,
> > but I'm not that familiar with the test and may be doing something
> > wrong or misinterpreting the results. Does this look ok to you?
> >
>
> The performance results are better than I expected.  The small
> regression from 10 cycles to 11 cycles is actually 10%, but I expect
> with some likely/unlikely instrumentation we can "likely" remove this again.
>

So the patch is already optimized carefully (I hope) to put all the
devmem processing in the default unlikely path. Willem showed me that:

if (page_pool_iov())
   return handle_page_pool_iov();

return handle_page();

The handle_page() will be 'likely' by default, which removes the need
for explicit likely/unlikely. I'm not sure we can get better perf with
explicit likely/unlikey, but I can try.

> So, this change actually looks acceptable from a performance PoV.
> I still think this page_pool_iov is very invasive to page_pool, but
> maybe it is better to hide this "uglyness" inside page_pool.
>
> The test primarily tests fast-path, and you also add "if" statements to
> all the DMA operations, which is not part of this benchmark.  Perhaps we
> can add unlikely statements, or inspect (objdump) the ASM to check code
> priorities the original page based "provider".
>
> >> But I'm sure it will affect performance.
> >>
>
> Guess, I was wrong ;-)
>
> --Jesper
>
>
> >> Regardless of performance, this approach is using ptr-LSB-bits, to hide
> >> that page-pointer are not really struct-pages, feels like force feeding
> >> a solution just to use the page_pool APIs.
> >>
> >>
> >>> In areas where struct page* is dereferenced, add a check for special
> >>> handling of page_pool_iov.
> >>>
> >>> The memory providers producing page_pool_iov can set the LSB on the
> >>> struct page* returned to the page pool.
> >>>
> >>> Note that instead of overloading the LSB of page pointers, we can
> >>> instead define a new union between struct page & struct page_pool_iov and
> >>> compact it in a new type. However, we'd need to implement the code churn
> >>> to modify the page_pool & drivers to use this new type. For this POC
> >>> that is not implemented (feedback welcome).
> >>>
> >>
> >> I've said before, that I prefer multiplexing on page->pp_magic.
> >> For your page_pool_iov the layout would have to match the offset of
> >> pp_magic, to do this. (And if insisting on using PP infra the refcnt
> >> would also need to align).
> >>
> >> On the allocation side, all drivers already use a driver helper
> >> page_pool_dev_alloc_pages() or we could add another (better named)
> >> helper to multiplex between other types of allocators, e.g. a devmem
> >> allocator.
> >>
> >> On free/return/recycle the functions napi_pp_put_page or skb_pp_recycle
> >> could multiplex on pp_magic and call another API.  The API could be an
> >> extension to PP helpers, but it could also be a devmap allocator helper.
> >>
> >> IMHO forcing/piggy-bagging everything into page_pool is not the right
> >> solution.  I really think netstack need to support different allocator
> >> types. The page pool have been leading the way, yes, but perhaps it is
> >> time to add an API layer that e.g. could be named netmem, that gives us
> >> the multiplexing between allocators.  In that process some of page_pool
> >> APIs would be lifted out as common blocks and others remain.
> >>
> >> --Jesper
> >>
> >>> I have a sample implementation of adding a new page_pool_token type
> >>> in the page_pool to give a general idea here:
> >>> https://github.com/torvalds/linux/commit/3a7628700eb7fd02a117db036003bca50779608d
> >>>
> >>> Full branch here:
> >>> https://github.com/torvalds/linux/compare/master...mina:linux:tcpdevmem-pp-tokens
> >>>
> >>> (In the branches above, page_pool_iov is called devmem_slice).
> >>>
> >>> Could also add static_branch to speed up the checks in page_pool_iov
> >>> memory providers are being used.
> >>>
> >>> Signed-off-by: Mina Almasry <almasrymina@google.com>
> >>> ---
> >>>    include/net/page_pool.h | 74 ++++++++++++++++++++++++++++++++++-
> >>>    net/core/page_pool.c    | 85 ++++++++++++++++++++++++++++-------------
> >>>    2 files changed, 131 insertions(+), 28 deletions(-)
> >>>
> >>> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> >>> index 537eb36115ed..f08ca230d68e 100644
> >>> --- a/include/net/page_pool.h
> >>> +++ b/include/net/page_pool.h
> >>> @@ -282,6 +282,64 @@ static inline struct page_pool_iov *page_to_page_pool_iov(struct page *page)
> >>>        return NULL;
> >>>    }
> >>>
> >>> +static inline int page_pool_page_ref_count(struct page *page)
> >>> +{
> >>> +     if (page_is_page_pool_iov(page))
> >>> +             return page_pool_iov_refcount(page_to_page_pool_iov(page));
> >>> +
> >>> +     return page_ref_count(page);
> >>> +}
> >>> +
> >>> +static inline void page_pool_page_get_many(struct page *page,
> >>> +                                        unsigned int count)
> >>> +{
> >>> +     if (page_is_page_pool_iov(page))
> >>> +             return page_pool_iov_get_many(page_to_page_pool_iov(page),
> >>> +                                           count);
> >>> +
> >>> +     return page_ref_add(page, count);
> >>> +}
> >>> +
> >>> +static inline void page_pool_page_put_many(struct page *page,
> >>> +                                        unsigned int count)
> >>> +{
> >>> +     if (page_is_page_pool_iov(page))
> >>> +             return page_pool_iov_put_many(page_to_page_pool_iov(page),
> >>> +                                           count);
> >>> +
> >>> +     if (count > 1)
> >>> +             page_ref_sub(page, count - 1);
> >>> +
> >>> +     put_page(page);
> >>> +}
> >>> +
> >>> +static inline bool page_pool_page_is_pfmemalloc(struct page *page)
> >>> +{
> >>> +     if (page_is_page_pool_iov(page))
> >>> +             return false;
> >>> +
> >>> +     return page_is_pfmemalloc(page);
> >>> +}
> >>> +
> >>> +static inline bool page_pool_page_is_pref_nid(struct page *page, int pref_nid)
> >>> +{
> >>> +     /* Assume page_pool_iov are on the preferred node without actually
> >>> +      * checking...
> >>> +      *
> >>> +      * This check is only used to check for recycling memory in the page
> >>> +      * pool's fast paths. Currently the only implementation of page_pool_iov
> >>> +      * is dmabuf device memory. It's a deliberate decision by the user to
> >>> +      * bind a certain dmabuf to a certain netdev, and the netdev rx queue
> >>> +      * would not be able to reallocate memory from another dmabuf that
> >>> +      * exists on the preferred node, so, this check doesn't make much sense
> >>> +      * in this case. Assume all page_pool_iovs can be recycled for now.
> >>> +      */
> >>> +     if (page_is_page_pool_iov(page))
> >>> +             return true;
> >>> +
> >>> +     return page_to_nid(page) == pref_nid;
> >>> +}
> >>> +
> >>>    struct page_pool {
> >>>        struct page_pool_params p;
> >>>
> >>> @@ -434,6 +492,9 @@ static inline long page_pool_defrag_page(struct page *page, long nr)
> >>>    {
> >>>        long ret;
> >>>
> >>> +     if (page_is_page_pool_iov(page))
> >>> +             return -EINVAL;
> >>> +
> >>>        /* If nr == pp_frag_count then we have cleared all remaining
> >>>         * references to the page. No need to actually overwrite it, instead
> >>>         * we can leave this to be overwritten by the calling function.
> >>> @@ -494,7 +555,12 @@ static inline void page_pool_recycle_direct(struct page_pool *pool,
> >>>
> >>>    static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
> >>>    {
> >>> -     dma_addr_t ret = page->dma_addr;
> >>> +     dma_addr_t ret;
> >>> +
> >>> +     if (page_is_page_pool_iov(page))
> >>> +             return page_pool_iov_dma_addr(page_to_page_pool_iov(page));
> >>> +
> >>> +     ret = page->dma_addr;
> >>>
> >>>        if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
> >>>                ret |= (dma_addr_t)page->dma_addr_upper << 16 << 16;
> >>> @@ -504,6 +570,12 @@ static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
> >>>
> >>>    static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
> >>>    {
> >>> +     /* page_pool_iovs are mapped and their dma-addr can't be modified. */
> >>> +     if (page_is_page_pool_iov(page)) {
> >>> +             DEBUG_NET_WARN_ON_ONCE(true);
> >>> +             return;
> >>> +     }
> >>> +
> >>>        page->dma_addr = addr;
> >>>        if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
> >>>                page->dma_addr_upper = upper_32_bits(addr);
> >>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> >>> index 0a7c08d748b8..20c1f74fd844 100644
> >>> --- a/net/core/page_pool.c
> >>> +++ b/net/core/page_pool.c
> >>> @@ -318,7 +318,7 @@ static struct page *page_pool_refill_alloc_cache(struct page_pool *pool)
> >>>                if (unlikely(!page))
> >>>                        break;
> >>>
> >>> -             if (likely(page_to_nid(page) == pref_nid)) {
> >>> +             if (likely(page_pool_page_is_pref_nid(page, pref_nid))) {
> >>>                        pool->alloc.cache[pool->alloc.count++] = page;
> >>>                } else {
> >>>                        /* NUMA mismatch;
> >>> @@ -363,7 +363,15 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool,
> >>>                                          struct page *page,
> >>>                                          unsigned int dma_sync_size)
> >>>    {
> >>> -     dma_addr_t dma_addr = page_pool_get_dma_addr(page);
> >>> +     dma_addr_t dma_addr;
> >>> +
> >>> +     /* page_pool_iov memory provider do not support PP_FLAG_DMA_SYNC_DEV */
> >>> +     if (page_is_page_pool_iov(page)) {
> >>> +             DEBUG_NET_WARN_ON_ONCE(true);
> >>> +             return;
> >>> +     }
> >>> +
> >>> +     dma_addr = page_pool_get_dma_addr(page);
> >>>
> >>>        dma_sync_size = min(dma_sync_size, pool->p.max_len);
> >>>        dma_sync_single_range_for_device(pool->p.dev, dma_addr,
> >>> @@ -375,6 +383,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
> >>>    {
> >>>        dma_addr_t dma;
> >>>
> >>> +     if (page_is_page_pool_iov(page)) {
> >>> +             /* page_pool_iovs are already mapped */
> >>> +             DEBUG_NET_WARN_ON_ONCE(true);
> >>> +             return true;
> >>> +     }
> >>> +
> >>>        /* Setup DMA mapping: use 'struct page' area for storing DMA-addr
> >>>         * since dma_addr_t can be either 32 or 64 bits and does not always fit
> >>>         * into page private data (i.e 32bit cpu with 64bit DMA caps)
> >>> @@ -398,14 +412,24 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
> >>>    static void page_pool_set_pp_info(struct page_pool *pool,
> >>>                                  struct page *page)
> >>>    {
> >>> -     page->pp = pool;
> >>> -     page->pp_magic |= PP_SIGNATURE;
> >>> +     if (!page_is_page_pool_iov(page)) {
> >>> +             page->pp = pool;
> >>> +             page->pp_magic |= PP_SIGNATURE;
> >>> +     } else {
> >>> +             page_to_page_pool_iov(page)->pp = pool;
> >>> +     }
> >>> +
> >>>        if (pool->p.init_callback)
> >>>                pool->p.init_callback(page, pool->p.init_arg);
> >>>    }
> >>>
> >>>    static void page_pool_clear_pp_info(struct page *page)
> >>>    {
> >>> +     if (page_is_page_pool_iov(page)) {
> >>> +             page_to_page_pool_iov(page)->pp = NULL;
> >>> +             return;
> >>> +     }
> >>> +
> >>>        page->pp_magic = 0;
> >>>        page->pp = NULL;
> >>>    }
> >>> @@ -615,7 +639,7 @@ static bool page_pool_recycle_in_cache(struct page *page,
> >>>                return false;
> >>>        }
> >>>
> >>> -     /* Caller MUST have verified/know (page_ref_count(page) == 1) */
> >>> +     /* Caller MUST have verified/know (page_pool_page_ref_count(page) == 1) */
> >>>        pool->alloc.cache[pool->alloc.count++] = page;
> >>>        recycle_stat_inc(pool, cached);
> >>>        return true;
> >>> @@ -638,9 +662,10 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
> >>>         * refcnt == 1 means page_pool owns page, and can recycle it.
> >>>         *
> >>>         * page is NOT reusable when allocated when system is under
> >>> -      * some pressure. (page_is_pfmemalloc)
> >>> +      * some pressure. (page_pool_page_is_pfmemalloc)
> >>>         */
> >>> -     if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) {
> >>> +     if (likely(page_pool_page_ref_count(page) == 1 &&
> >>> +                !page_pool_page_is_pfmemalloc(page))) {
> >>>                /* Read barrier done in page_ref_count / READ_ONCE */
> >>>
> >>>                if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> >>> @@ -741,7 +766,8 @@ static struct page *page_pool_drain_frag(struct page_pool *pool,
> >>>        if (likely(page_pool_defrag_page(page, drain_count)))
> >>>                return NULL;
> >>>
> >>> -     if (page_ref_count(page) == 1 && !page_is_pfmemalloc(page)) {
> >>> +     if (page_pool_page_ref_count(page) == 1 &&
> >>> +         !page_pool_page_is_pfmemalloc(page)) {
> >>>                if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> >>>                        page_pool_dma_sync_for_device(pool, page, -1);
> >>>
> >>> @@ -818,9 +844,9 @@ static void page_pool_empty_ring(struct page_pool *pool)
> >>>        /* Empty recycle ring */
> >>>        while ((page = ptr_ring_consume_bh(&pool->ring))) {
> >>>                /* Verify the refcnt invariant of cached pages */
> >>> -             if (!(page_ref_count(page) == 1))
> >>> +             if (!(page_pool_page_ref_count(page) == 1))
> >>>                        pr_crit("%s() page_pool refcnt %d violation\n",
> >>> -                             __func__, page_ref_count(page));
> >>> +                             __func__, page_pool_page_ref_count(page));
> >>>
> >>>                page_pool_return_page(pool, page);
> >>>        }
> >>> @@ -977,19 +1003,24 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe)
> >>>        struct page_pool *pp;
> >>>        bool allow_direct;
> >>>
> >>> -     page = compound_head(page);
> >>> +     if (!page_is_page_pool_iov(page)) {
> >>> +             page = compound_head(page);
> >>>
> >>> -     /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
> >>> -      * in order to preserve any existing bits, such as bit 0 for the
> >>> -      * head page of compound page and bit 1 for pfmemalloc page, so
> >>> -      * mask those bits for freeing side when doing below checking,
> >>> -      * and page_is_pfmemalloc() is checked in __page_pool_put_page()
> >>> -      * to avoid recycling the pfmemalloc page.
> >>> -      */
> >>> -     if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
> >>> -             return false;
> >>> +             /* page->pp_magic is OR'ed with PP_SIGNATURE after the
> >>> +              * allocation in order to preserve any existing bits, such as
> >>> +              * bit 0 for the head page of compound page and bit 1 for
> >>> +              * pfmemalloc page, so mask those bits for freeing side when
> >>> +              * doing below checking, and page_pool_page_is_pfmemalloc() is
> >>> +              * checked in __page_pool_put_page() to avoid recycling the
> >>> +              * pfmemalloc page.
> >>> +              */
> >>> +             if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
> >>> +                     return false;
> >>>
> >>> -     pp = page->pp;
> >>> +             pp = page->pp;
> >>> +     } else {
> >>> +             pp = page_to_page_pool_iov(page)->pp;
> >>> +     }
> >>>
> >>>        /* Allow direct recycle if we have reasons to believe that we are
> >>>         * in the same context as the consumer would run, so there's
> >>> @@ -1273,9 +1304,9 @@ static bool mp_huge_busy(struct mp_huge *hu, unsigned int idx)
> >>>
> >>>        for (j = 0; j < (1 << MP_HUGE_ORDER); j++) {
> >>>                page = hu->page[idx] + j;
> >>> -             if (page_ref_count(page) != 1) {
> >>> +             if (page_pool_page_ref_count(page) != 1) {
> >>>                        pr_warn("Page with ref count %d at %u, %u. Can't safely destory, leaking memory!\n",
> >>> -                             page_ref_count(page), idx, j);
> >>> +                             page_pool_page_ref_count(page), idx, j);
> >>>                        return true;
> >>>                }
> >>>        }
> >>> @@ -1330,7 +1361,7 @@ static struct page *mp_huge_alloc_pages(struct page_pool *pool, gfp_t gfp)
> >>>                        continue;
> >>>
> >>>                if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
> >>> -                 page_ref_count(page) != 1) {
> >>> +                 page_pool_page_ref_count(page) != 1) {
> >>>                        atomic_inc(&mp_huge_ins_b);
> >>>                        continue;
> >>>                }
> >>> @@ -1458,9 +1489,9 @@ static void mp_huge_1g_destroy(struct page_pool *pool)
> >>>        free = true;
> >>>        for (i = 0; i < MP_HUGE_1G_CNT; i++) {
> >>>                page = hu->page + i;
> >>> -             if (page_ref_count(page) != 1) {
> >>> +             if (page_pool_page_ref_count(page) != 1) {
> >>>                        pr_warn("Page with ref count %d at %u. Can't safely destory, leaking memory!\n",
> >>> -                             page_ref_count(page), i);
> >>> +                             page_pool_page_ref_count(page), i);
> >>>                        free = false;
> >>>                        break;
> >>>                }
> >>> @@ -1489,7 +1520,7 @@ static struct page *mp_huge_1g_alloc_pages(struct page_pool *pool, gfp_t gfp)
> >>>                page = hu->page + page_i;
> >>>
> >>>                if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
> >>> -                 page_ref_count(page) != 1) {
> >>> +                 page_pool_page_ref_count(page) != 1) {
> >>>                        atomic_inc(&mp_huge_ins_b);
> >>>                        continue;
> >>>                }
> >>> --
> >>> 2.41.0.640.ga95def55d0-goog
> >>>
> >>
> >
> >
> > --
> > Thanks,
> > Mina
> >
>


--
Thanks,
Mina
David Wei Sept. 8, 2023, 2:32 a.m. UTC | #14
On 19/08/2023 13:24, Mina Almasry wrote:
> On Sat, Aug 19, 2023 at 8:22 AM Jesper Dangaard Brouer
> <jbrouer@redhat.com> wrote:
>>
>>
>>
>> On 19/08/2023 16.08, Willem de Bruijn wrote:
>>> On Sat, Aug 19, 2023 at 5:51 AM Jesper Dangaard Brouer
>>> <jbrouer@redhat.com> wrote:
>>>>
>>>>
>>>>
>>>> On 10/08/2023 03.57, Mina Almasry wrote:
>>>>> Overload the LSB of struct page* to indicate that it's a page_pool_iov.
>>>>>
>>>>> Refactor mm calls on struct page * into helpers, and add page_pool_iov
>>>>> handling on those helpers. Modify callers of these mm APIs with calls to
>>>>> these helpers instead.
>>>>>
>>>>
>>>> I don't like of this approach.
>>>> This is adding code to the PP (page_pool) fast-path in multiple places.
>>>>
>>>> I've not had time to run my usual benchmarks, which are here:
>>>>
>>>> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c
>>>>
> 
> Thank you for linking that, I'll try to run these against the next submission.
> 
>>>> But I'm sure it will affect performance.
>>>>
>>>> Regardless of performance, this approach is using ptr-LSB-bits, to hide
>>>> that page-pointer are not really struct-pages, feels like force feeding
>>>> a solution just to use the page_pool APIs.
>>>>
>>>>
>>>>> In areas where struct page* is dereferenced, add a check for special
>>>>> handling of page_pool_iov.
>>>>>
>>>>> The memory providers producing page_pool_iov can set the LSB on the
>>>>> struct page* returned to the page pool.
>>>>>
>>>>> Note that instead of overloading the LSB of page pointers, we can
>>>>> instead define a new union between struct page & struct page_pool_iov and
>>>>> compact it in a new type. However, we'd need to implement the code churn
>>>>> to modify the page_pool & drivers to use this new type. For this POC
>>>>> that is not implemented (feedback welcome).
>>>>>
>>>>
>>>> I've said before, that I prefer multiplexing on page->pp_magic.
>>>> For your page_pool_iov the layout would have to match the offset of
>>>> pp_magic, to do this. (And if insisting on using PP infra the refcnt
>>>> would also need to align).
>>>
>>> Perhaps I misunderstand, but this suggests continuing to using
>>> struct page to demultiplex memory type?
>>>
>>
>> (Perhaps we are misunderstanding each-other and my use of the words
>> multiplexing and demultiplex are wrong, I'm sorry, as English isn't my
>> native language.)
>>
>> I do see the problem of depending on having a struct page, as the
>> page_pool_iov isn't related to struct page.  Having "page" in the name
>> of "page_pool_iov" is also confusing (hardest problem is CS is naming,
>> as we all know).
>>
>> To support more allocator types, perhaps skb->pp_recycle bit need to
>> grow another bit (and be renamed skb->recycle), so we can tell allocator
>> types apart, those that are page based and those whom are not.
>>
>>
>>> I think the feedback has been strong to not multiplex yet another
>>> memory type into that struct, that is not a real page. Which is why
>>> we went into this direction. This latest series limits the impact largely
>>> to networking structures and code.
>>>
>>
>> Some what related what I'm objecting to: the "page_pool_iov" is not a
>> real page, but this getting recycled into something called "page_pool",
>> which funny enough deals with struct-pages internally and depend on the
>> struct-page-refcnt.
>>
>> Given the approach changed way from using struct page, then I also don't
>> see the connection with the page_pool. Sorry.
>>
>>> One way or another, there will be a branch and multiplexing. Whether
>>> that is in struct page, the page pool or a new netdev mem type as you
>>> propose.
>>>
>>
>> I'm asking to have this branch/multiplexing done a the call sites.
>>
>> (IMHO not changing the drivers is a pipe-dream.)
>>
> 
> I think I understand what Jesper is saying. I think Jesper wants the
> page_pool to remain unchanged, and another layer on top of it to do
> the multiplexing, i.e.:
> 
> driver ---> new_layer (does multiplexing) ---> page_pool -------> mm layer
>                                 \------------------------------>
> devmem_pool --> dma-buf layer
> 
> But, I think, Jakub wants the page_pool to be the front end, and for
> the multiplexing to happen in the page pool (I think, Jakub did not
> write this in an email, but this is how I interpret his comments from
> [1], and his memory provider RFC). So I think Jakub wants:
> 
> driver --> page_pool ---> memory_provider (does multiplexing) --->
> basic_provider -------> mm layer
> 
> \----------------------------------------> devmem_provider --> dma-buf
> layer
> 
> That is the approach in this RFC.
> 
> I think proper naming that makes sense can be figured out, and is not
> a huge issue. I think in both cases we can minimize the changes to the
> drivers, maybe. In the first approach the driver will need to use the
> APIs created by the new layer. In the second approach, the driver
> continues to use page_pool APIs.
> 
> I think we need to converge on a path between the 2 approaches (or
> maybe 3rd approach to do). For me the pros/cons of each approach
> (please add):
> 
> multiplexing in new_layer:
> - Pro: maybe better for performance? Not sure if static_branch can
> achieve the same perf. I can verify with Jesper's perf tests.
> - Pro: doesn't add complexity in the page_pool (but adds complexity in
> terms of adding new pools like devmem_pool)
> - Con: the devmem_pool & page_pool will end up being duplicated code,
> I suspect, because they largely do similar things (both need to
> recycle memory for example).

Hi all, for our ZC RX into userspace host memory implementation, we opted for
the 1st approach [1] that was initially suggested by Kuba. The multiplexing was
added into a thin shim layer that we named "data_pool", but is functionally
identical to devmem_pool.

Seeing Mina's work in this patchset, I would also prefer the 2nd approach i.e.
multiplexing via memory_provider within a page_pool. In addition to the
pros/cons Mina mentioned, in our case the pages are actually pages so it makes
sense to me to put it inside the page_pool.

[1]: https://lore.kernel.org/io-uring/20230826011954.1801099-11-dw@davidwei.uk/

> 
> multiplexing via memory_provider:
> - Pro: no code duplication.
> - Pro: less changes to the drivers, I think. The drivers can continue
> to use the page_pool API, no need to introduce calls to 'new_layer'.
> - Con: adds complexity to the page_pool (needs to handle devmem).
> - Con: probably careful handling via static_branch needs to be done to
> achieve performance.
> 
> [1] https://lore.kernel.org/netdev/20230619110705.106ec599@kernel.org/
> 
>>> Any regression in page pool can be avoided in the common case that
>>> does not use device mem by placing that behind a static_branch. Would
>>> that address your performance concerns?
>>>
>>
>> No. This will not help.
>>
>> The problem is that every where in the page_pool code it is getting
>> polluted with:
>>
>>    if (page_is_page_pool_iov(page))
>>      call-some-iov-func-instead()
>>
>> Like: the very central piece of getting the refcnt:
>>
>> +static inline int page_pool_page_ref_count(struct page *page)
>> +{
>> +       if (page_is_page_pool_iov(page))
>> +               return page_pool_iov_refcount(page_to_page_pool_iov(page));
>> +
>> +       return page_ref_count(page);
>> +}
>>
>>
>> The fast-path of the PP is used for XDP_DROP scenarios, and is currently
>> around 14 cycles (tsc). Thus, any extra code in this code patch will
>> change the fast-path.
>>
>>
>>>>
>>>> On the allocation side, all drivers already use a driver helper
>>>> page_pool_dev_alloc_pages() or we could add another (better named)
>>>> helper to multiplex between other types of allocators, e.g. a devmem
>>>> allocator.
>>>>
>>>> On free/return/recycle the functions napi_pp_put_page or skb_pp_recycle
>>>> could multiplex on pp_magic and call another API.  The API could be an
>>>> extension to PP helpers, but it could also be a devmap allocator helper.
>>>>
>>>> IMHO forcing/piggy-bagging everything into page_pool is not the right
>>>> solution.  I really think netstack need to support different allocator
>>>> types.
>>>
>>> To me this is lifting page_pool into such a netstack alloctator pool.
>>>
>>
>> This is should be renamed as it is not longer dealing with pages.
>>
>>> Not sure adding another explicit layer of indirection would be cleaner
>>> or faster (potentially more indirect calls).
>>>
>>
>> It seems we are talking past each-other.  The layer of indirection I'm
>> talking about is likely a simple header file (e.g. named netmem.h) that
>> will get inline compiled so there is no overhead. It will be used by
>> driver, such that we can avoid touching driver again when introducing
>> new memory allocator types.
>>
>>
>>> As for the LSB trick: that avoided adding a lot of boilerplate churn
>>> with new type and helper functions.
>>>
>>
>> Says the lazy programmer :-P ... sorry could not resist ;-)
>>
>>>
>>>
>>>> The page pool have been leading the way, yes, but perhaps it is
>>>> time to add an API layer that e.g. could be named netmem, that gives us
>>>> the multiplexing between allocators.  In that process some of page_pool
>>>> APIs would be lifted out as common blocks and others remain.
>>>>
>>>> --Jesper
>>>>
>>>>> I have a sample implementation of adding a new page_pool_token type
>>>>> in the page_pool to give a general idea here:
>>>>> https://github.com/torvalds/linux/commit/3a7628700eb7fd02a117db036003bca50779608d
>>>>>
>>>>> Full branch here:
>>>>> https://github.com/torvalds/linux/compare/master...mina:linux:tcpdevmem-pp-tokens
>>>>>
>>>>> (In the branches above, page_pool_iov is called devmem_slice).
>>>>>
>>>>> Could also add static_branch to speed up the checks in page_pool_iov
>>>>> memory providers are being used.
>>>>>
>>>>> Signed-off-by: Mina Almasry <almasrymina@google.com>
>>>>> ---
>>>>>    include/net/page_pool.h | 74 ++++++++++++++++++++++++++++++++++-
>>>>>    net/core/page_pool.c    | 85 ++++++++++++++++++++++++++++-------------
>>>>>    2 files changed, 131 insertions(+), 28 deletions(-)
>>>>>
>>>>> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
>>>>> index 537eb36115ed..f08ca230d68e 100644
>>>>> --- a/include/net/page_pool.h
>>>>> +++ b/include/net/page_pool.h
>>>>> @@ -282,6 +282,64 @@ static inline struct page_pool_iov *page_to_page_pool_iov(struct page *page)
>>>>>        return NULL;
>>>>>    }
>>>>>
>>>>> +static inline int page_pool_page_ref_count(struct page *page)
>>>>> +{
>>>>> +     if (page_is_page_pool_iov(page))
>>>>> +             return page_pool_iov_refcount(page_to_page_pool_iov(page));
>>>>> +
>>>>> +     return page_ref_count(page);
>>>>> +}
>>>>> +
>>>>> +static inline void page_pool_page_get_many(struct page *page,
>>>>> +                                        unsigned int count)
>>>>> +{
>>>>> +     if (page_is_page_pool_iov(page))
>>>>> +             return page_pool_iov_get_many(page_to_page_pool_iov(page),
>>>>> +                                           count);
>>>>> +
>>>>> +     return page_ref_add(page, count);
>>>>> +}
>>>>> +
>>>>> +static inline void page_pool_page_put_many(struct page *page,
>>>>> +                                        unsigned int count)
>>>>> +{
>>>>> +     if (page_is_page_pool_iov(page))
>>>>> +             return page_pool_iov_put_many(page_to_page_pool_iov(page),
>>>>> +                                           count);
>>>>> +
>>>>> +     if (count > 1)
>>>>> +             page_ref_sub(page, count - 1);
>>>>> +
>>>>> +     put_page(page);
>>>>> +}
>>>>> +
>>>>> +static inline bool page_pool_page_is_pfmemalloc(struct page *page)
>>>>> +{
>>>>> +     if (page_is_page_pool_iov(page))
>>>>> +             return false;
>>>>> +
>>>>> +     return page_is_pfmemalloc(page);
>>>>> +}
>>>>> +
>>>>> +static inline bool page_pool_page_is_pref_nid(struct page *page, int pref_nid)
>>>>> +{
>>>>> +     /* Assume page_pool_iov are on the preferred node without actually
>>>>> +      * checking...
>>>>> +      *
>>>>> +      * This check is only used to check for recycling memory in the page
>>>>> +      * pool's fast paths. Currently the only implementation of page_pool_iov
>>>>> +      * is dmabuf device memory. It's a deliberate decision by the user to
>>>>> +      * bind a certain dmabuf to a certain netdev, and the netdev rx queue
>>>>> +      * would not be able to reallocate memory from another dmabuf that
>>>>> +      * exists on the preferred node, so, this check doesn't make much sense
>>>>> +      * in this case. Assume all page_pool_iovs can be recycled for now.
>>>>> +      */
>>>>> +     if (page_is_page_pool_iov(page))
>>>>> +             return true;
>>>>> +
>>>>> +     return page_to_nid(page) == pref_nid;
>>>>> +}
>>>>> +
>>>>>    struct page_pool {
>>>>>        struct page_pool_params p;
>>>>>
>>>>> @@ -434,6 +492,9 @@ static inline long page_pool_defrag_page(struct page *page, long nr)
>>>>>    {
>>>>>        long ret;
>>>>>
>>>>> +     if (page_is_page_pool_iov(page))
>>>>> +             return -EINVAL;
>>>>> +
>>>>>        /* If nr == pp_frag_count then we have cleared all remaining
>>>>>         * references to the page. No need to actually overwrite it, instead
>>>>>         * we can leave this to be overwritten by the calling function.
>>>>> @@ -494,7 +555,12 @@ static inline void page_pool_recycle_direct(struct page_pool *pool,
>>>>>
>>>>>    static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
>>>>>    {
>>>>> -     dma_addr_t ret = page->dma_addr;
>>>>> +     dma_addr_t ret;
>>>>> +
>>>>> +     if (page_is_page_pool_iov(page))
>>>>> +             return page_pool_iov_dma_addr(page_to_page_pool_iov(page));
>>>>> +
>>>>> +     ret = page->dma_addr;
>>>>>
>>>>>        if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
>>>>>                ret |= (dma_addr_t)page->dma_addr_upper << 16 << 16;
>>>>> @@ -504,6 +570,12 @@ static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
>>>>>
>>>>>    static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
>>>>>    {
>>>>> +     /* page_pool_iovs are mapped and their dma-addr can't be modified. */
>>>>> +     if (page_is_page_pool_iov(page)) {
>>>>> +             DEBUG_NET_WARN_ON_ONCE(true);
>>>>> +             return;
>>>>> +     }
>>>>> +
>>>>>        page->dma_addr = addr;
>>>>>        if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
>>>>>                page->dma_addr_upper = upper_32_bits(addr);
>>>>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
>>>>> index 0a7c08d748b8..20c1f74fd844 100644
>>>>> --- a/net/core/page_pool.c
>>>>> +++ b/net/core/page_pool.c
>>>>> @@ -318,7 +318,7 @@ static struct page *page_pool_refill_alloc_cache(struct page_pool *pool)
>>>>>                if (unlikely(!page))
>>>>>                        break;
>>>>>
>>>>> -             if (likely(page_to_nid(page) == pref_nid)) {
>>>>> +             if (likely(page_pool_page_is_pref_nid(page, pref_nid))) {
>>>>>                        pool->alloc.cache[pool->alloc.count++] = page;
>>>>>                } else {
>>>>>                        /* NUMA mismatch;
>>>>> @@ -363,7 +363,15 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool,
>>>>>                                          struct page *page,
>>>>>                                          unsigned int dma_sync_size)
>>>>>    {
>>>>> -     dma_addr_t dma_addr = page_pool_get_dma_addr(page);
>>>>> +     dma_addr_t dma_addr;
>>>>> +
>>>>> +     /* page_pool_iov memory provider do not support PP_FLAG_DMA_SYNC_DEV */
>>>>> +     if (page_is_page_pool_iov(page)) {
>>>>> +             DEBUG_NET_WARN_ON_ONCE(true);
>>>>> +             return;
>>>>> +     }
>>>>> +
>>>>> +     dma_addr = page_pool_get_dma_addr(page);
>>>>>
>>>>>        dma_sync_size = min(dma_sync_size, pool->p.max_len);
>>>>>        dma_sync_single_range_for_device(pool->p.dev, dma_addr,
>>>>> @@ -375,6 +383,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
>>>>>    {
>>>>>        dma_addr_t dma;
>>>>>
>>>>> +     if (page_is_page_pool_iov(page)) {
>>>>> +             /* page_pool_iovs are already mapped */
>>>>> +             DEBUG_NET_WARN_ON_ONCE(true);
>>>>> +             return true;
>>>>> +     }
>>>>> +
>>>>>        /* Setup DMA mapping: use 'struct page' area for storing DMA-addr
>>>>>         * since dma_addr_t can be either 32 or 64 bits and does not always fit
>>>>>         * into page private data (i.e 32bit cpu with 64bit DMA caps)
>>>>> @@ -398,14 +412,24 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
>>>>>    static void page_pool_set_pp_info(struct page_pool *pool,
>>>>>                                  struct page *page)
>>>>>    {
>>>>> -     page->pp = pool;
>>>>> -     page->pp_magic |= PP_SIGNATURE;
>>>>> +     if (!page_is_page_pool_iov(page)) {
>>>>> +             page->pp = pool;
>>>>> +             page->pp_magic |= PP_SIGNATURE;
>>>>> +     } else {
>>>>> +             page_to_page_pool_iov(page)->pp = pool;
>>>>> +     }
>>>>> +
>>>>>        if (pool->p.init_callback)
>>>>>                pool->p.init_callback(page, pool->p.init_arg);
>>>>>    }
>>>>>
>>>>>    static void page_pool_clear_pp_info(struct page *page)
>>>>>    {
>>>>> +     if (page_is_page_pool_iov(page)) {
>>>>> +             page_to_page_pool_iov(page)->pp = NULL;
>>>>> +             return;
>>>>> +     }
>>>>> +
>>>>>        page->pp_magic = 0;
>>>>>        page->pp = NULL;
>>>>>    }
>>>>> @@ -615,7 +639,7 @@ static bool page_pool_recycle_in_cache(struct page *page,
>>>>>                return false;
>>>>>        }
>>>>>
>>>>> -     /* Caller MUST have verified/know (page_ref_count(page) == 1) */
>>>>> +     /* Caller MUST have verified/know (page_pool_page_ref_count(page) == 1) */
>>>>>        pool->alloc.cache[pool->alloc.count++] = page;
>>>>>        recycle_stat_inc(pool, cached);
>>>>>        return true;
>>>>> @@ -638,9 +662,10 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
>>>>>         * refcnt == 1 means page_pool owns page, and can recycle it.
>>>>>         *
>>>>>         * page is NOT reusable when allocated when system is under
>>>>> -      * some pressure. (page_is_pfmemalloc)
>>>>> +      * some pressure. (page_pool_page_is_pfmemalloc)
>>>>>         */
>>>>> -     if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) {
>>>>> +     if (likely(page_pool_page_ref_count(page) == 1 &&
>>>>> +                !page_pool_page_is_pfmemalloc(page))) {
>>>>>                /* Read barrier done in page_ref_count / READ_ONCE */
>>>>>
>>>>>                if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
>>>>> @@ -741,7 +766,8 @@ static struct page *page_pool_drain_frag(struct page_pool *pool,
>>>>>        if (likely(page_pool_defrag_page(page, drain_count)))
>>>>>                return NULL;
>>>>>
>>>>> -     if (page_ref_count(page) == 1 && !page_is_pfmemalloc(page)) {
>>>>> +     if (page_pool_page_ref_count(page) == 1 &&
>>>>> +         !page_pool_page_is_pfmemalloc(page)) {
>>>>>                if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
>>>>>                        page_pool_dma_sync_for_device(pool, page, -1);
>>>>>
>>>>> @@ -818,9 +844,9 @@ static void page_pool_empty_ring(struct page_pool *pool)
>>>>>        /* Empty recycle ring */
>>>>>        while ((page = ptr_ring_consume_bh(&pool->ring))) {
>>>>>                /* Verify the refcnt invariant of cached pages */
>>>>> -             if (!(page_ref_count(page) == 1))
>>>>> +             if (!(page_pool_page_ref_count(page) == 1))
>>>>>                        pr_crit("%s() page_pool refcnt %d violation\n",
>>>>> -                             __func__, page_ref_count(page));
>>>>> +                             __func__, page_pool_page_ref_count(page));
>>>>>
>>>>>                page_pool_return_page(pool, page);
>>>>>        }
>>>>> @@ -977,19 +1003,24 @@ bool page_pool_return_skb_page(struct page *page, bool napi_safe)
>>>>>        struct page_pool *pp;
>>>>>        bool allow_direct;
>>>>>
>>>>> -     page = compound_head(page);
>>>>> +     if (!page_is_page_pool_iov(page)) {
>>>>> +             page = compound_head(page);
>>>>>
>>>>> -     /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
>>>>> -      * in order to preserve any existing bits, such as bit 0 for the
>>>>> -      * head page of compound page and bit 1 for pfmemalloc page, so
>>>>> -      * mask those bits for freeing side when doing below checking,
>>>>> -      * and page_is_pfmemalloc() is checked in __page_pool_put_page()
>>>>> -      * to avoid recycling the pfmemalloc page.
>>>>> -      */
>>>>> -     if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
>>>>> -             return false;
>>>>> +             /* page->pp_magic is OR'ed with PP_SIGNATURE after the
>>>>> +              * allocation in order to preserve any existing bits, such as
>>>>> +              * bit 0 for the head page of compound page and bit 1 for
>>>>> +              * pfmemalloc page, so mask those bits for freeing side when
>>>>> +              * doing below checking, and page_pool_page_is_pfmemalloc() is
>>>>> +              * checked in __page_pool_put_page() to avoid recycling the
>>>>> +              * pfmemalloc page.
>>>>> +              */
>>>>> +             if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
>>>>> +                     return false;
>>>>>
>>>>> -     pp = page->pp;
>>>>> +             pp = page->pp;
>>>>> +     } else {
>>>>> +             pp = page_to_page_pool_iov(page)->pp;
>>>>> +     }
>>>>>
>>>>>        /* Allow direct recycle if we have reasons to believe that we are
>>>>>         * in the same context as the consumer would run, so there's
>>>>> @@ -1273,9 +1304,9 @@ static bool mp_huge_busy(struct mp_huge *hu, unsigned int idx)
>>>>>
>>>>>        for (j = 0; j < (1 << MP_HUGE_ORDER); j++) {
>>>>>                page = hu->page[idx] + j;
>>>>> -             if (page_ref_count(page) != 1) {
>>>>> +             if (page_pool_page_ref_count(page) != 1) {
>>>>>                        pr_warn("Page with ref count %d at %u, %u. Can't safely destory, leaking memory!\n",
>>>>> -                             page_ref_count(page), idx, j);
>>>>> +                             page_pool_page_ref_count(page), idx, j);
>>>>>                        return true;
>>>>>                }
>>>>>        }
>>>>> @@ -1330,7 +1361,7 @@ static struct page *mp_huge_alloc_pages(struct page_pool *pool, gfp_t gfp)
>>>>>                        continue;
>>>>>
>>>>>                if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
>>>>> -                 page_ref_count(page) != 1) {
>>>>> +                 page_pool_page_ref_count(page) != 1) {
>>>>>                        atomic_inc(&mp_huge_ins_b);
>>>>>                        continue;
>>>>>                }
>>>>> @@ -1458,9 +1489,9 @@ static void mp_huge_1g_destroy(struct page_pool *pool)
>>>>>        free = true;
>>>>>        for (i = 0; i < MP_HUGE_1G_CNT; i++) {
>>>>>                page = hu->page + i;
>>>>> -             if (page_ref_count(page) != 1) {
>>>>> +             if (page_pool_page_ref_count(page) != 1) {
>>>>>                        pr_warn("Page with ref count %d at %u. Can't safely destory, leaking memory!\n",
>>>>> -                             page_ref_count(page), i);
>>>>> +                             page_pool_page_ref_count(page), i);
>>>>>                        free = false;
>>>>>                        break;
>>>>>                }
>>>>> @@ -1489,7 +1520,7 @@ static struct page *mp_huge_1g_alloc_pages(struct page_pool *pool, gfp_t gfp)
>>>>>                page = hu->page + page_i;
>>>>>
>>>>>                if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
>>>>> -                 page_ref_count(page) != 1) {
>>>>> +                 page_pool_page_ref_count(page) != 1) {
>>>>>                        atomic_inc(&mp_huge_ins_b);
>>>>>                        continue;
>>>>>                }
>>>>> --
>>>>> 2.41.0.640.ga95def55d0-goog
>>>>>
>>>>
>>>
>>
> 
> 
> --
> Thanks,
> Mina
diff mbox series

Patch

diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 537eb36115ed..f08ca230d68e 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -282,6 +282,64 @@  static inline struct page_pool_iov *page_to_page_pool_iov(struct page *page)
 	return NULL;
 }

+static inline int page_pool_page_ref_count(struct page *page)
+{
+	if (page_is_page_pool_iov(page))
+		return page_pool_iov_refcount(page_to_page_pool_iov(page));
+
+	return page_ref_count(page);
+}
+
+static inline void page_pool_page_get_many(struct page *page,
+					   unsigned int count)
+{
+	if (page_is_page_pool_iov(page))
+		return page_pool_iov_get_many(page_to_page_pool_iov(page),
+					      count);
+
+	return page_ref_add(page, count);
+}
+
+static inline void page_pool_page_put_many(struct page *page,
+					   unsigned int count)
+{
+	if (page_is_page_pool_iov(page))
+		return page_pool_iov_put_many(page_to_page_pool_iov(page),
+					      count);
+
+	if (count > 1)
+		page_ref_sub(page, count - 1);
+
+	put_page(page);
+}
+
+static inline bool page_pool_page_is_pfmemalloc(struct page *page)
+{
+	if (page_is_page_pool_iov(page))
+		return false;
+
+	return page_is_pfmemalloc(page);
+}
+
+static inline bool page_pool_page_is_pref_nid(struct page *page, int pref_nid)
+{
+	/* Assume page_pool_iov are on the preferred node without actually
+	 * checking...
+	 *
+	 * This check is only used to check for recycling memory in the page
+	 * pool's fast paths. Currently the only implementation of page_pool_iov
+	 * is dmabuf device memory. It's a deliberate decision by the user to
+	 * bind a certain dmabuf to a certain netdev, and the netdev rx queue
+	 * would not be able to reallocate memory from another dmabuf that
+	 * exists on the preferred node, so, this check doesn't make much sense
+	 * in this case. Assume all page_pool_iovs can be recycled for now.
+	 */
+	if (page_is_page_pool_iov(page))
+		return true;
+
+	return page_to_nid(page) == pref_nid;
+}
+
 struct page_pool {
 	struct page_pool_params p;

@@ -434,6 +492,9 @@  static inline long page_pool_defrag_page(struct page *page, long nr)
 {
 	long ret;

+	if (page_is_page_pool_iov(page))
+		return -EINVAL;
+
 	/* If nr == pp_frag_count then we have cleared all remaining
 	 * references to the page. No need to actually overwrite it, instead
 	 * we can leave this to be overwritten by the calling function.
@@ -494,7 +555,12 @@  static inline void page_pool_recycle_direct(struct page_pool *pool,

 static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
 {
-	dma_addr_t ret = page->dma_addr;
+	dma_addr_t ret;
+
+	if (page_is_page_pool_iov(page))
+		return page_pool_iov_dma_addr(page_to_page_pool_iov(page));
+
+	ret = page->dma_addr;

 	if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
 		ret |= (dma_addr_t)page->dma_addr_upper << 16 << 16;
@@ -504,6 +570,12 @@  static inline dma_addr_t page_pool_get_dma_addr(struct page *page)

 static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
 {
+	/* page_pool_iovs are mapped and their dma-addr can't be modified. */
+	if (page_is_page_pool_iov(page)) {
+		DEBUG_NET_WARN_ON_ONCE(true);
+		return;
+	}
+
 	page->dma_addr = addr;
 	if (PAGE_POOL_DMA_USE_PP_FRAG_COUNT)
 		page->dma_addr_upper = upper_32_bits(addr);
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 0a7c08d748b8..20c1f74fd844 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -318,7 +318,7 @@  static struct page *page_pool_refill_alloc_cache(struct page_pool *pool)
 		if (unlikely(!page))
 			break;

-		if (likely(page_to_nid(page) == pref_nid)) {
+		if (likely(page_pool_page_is_pref_nid(page, pref_nid))) {
 			pool->alloc.cache[pool->alloc.count++] = page;
 		} else {
 			/* NUMA mismatch;
@@ -363,7 +363,15 @@  static void page_pool_dma_sync_for_device(struct page_pool *pool,
 					  struct page *page,
 					  unsigned int dma_sync_size)
 {
-	dma_addr_t dma_addr = page_pool_get_dma_addr(page);
+	dma_addr_t dma_addr;
+
+	/* page_pool_iov memory provider do not support PP_FLAG_DMA_SYNC_DEV */
+	if (page_is_page_pool_iov(page)) {
+		DEBUG_NET_WARN_ON_ONCE(true);
+		return;
+	}
+
+	dma_addr = page_pool_get_dma_addr(page);

 	dma_sync_size = min(dma_sync_size, pool->p.max_len);
 	dma_sync_single_range_for_device(pool->p.dev, dma_addr,
@@ -375,6 +383,12 @@  static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
 {
 	dma_addr_t dma;

+	if (page_is_page_pool_iov(page)) {
+		/* page_pool_iovs are already mapped */
+		DEBUG_NET_WARN_ON_ONCE(true);
+		return true;
+	}
+
 	/* Setup DMA mapping: use 'struct page' area for storing DMA-addr
 	 * since dma_addr_t can be either 32 or 64 bits and does not always fit
 	 * into page private data (i.e 32bit cpu with 64bit DMA caps)
@@ -398,14 +412,24 @@  static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
 static void page_pool_set_pp_info(struct page_pool *pool,
 				  struct page *page)
 {
-	page->pp = pool;
-	page->pp_magic |= PP_SIGNATURE;
+	if (!page_is_page_pool_iov(page)) {
+		page->pp = pool;
+		page->pp_magic |= PP_SIGNATURE;
+	} else {
+		page_to_page_pool_iov(page)->pp = pool;
+	}
+
 	if (pool->p.init_callback)
 		pool->p.init_callback(page, pool->p.init_arg);
 }

 static void page_pool_clear_pp_info(struct page *page)
 {
+	if (page_is_page_pool_iov(page)) {
+		page_to_page_pool_iov(page)->pp = NULL;
+		return;
+	}
+
 	page->pp_magic = 0;
 	page->pp = NULL;
 }
@@ -615,7 +639,7 @@  static bool page_pool_recycle_in_cache(struct page *page,
 		return false;
 	}

-	/* Caller MUST have verified/know (page_ref_count(page) == 1) */
+	/* Caller MUST have verified/know (page_pool_page_ref_count(page) == 1) */
 	pool->alloc.cache[pool->alloc.count++] = page;
 	recycle_stat_inc(pool, cached);
 	return true;
@@ -638,9 +662,10 @@  __page_pool_put_page(struct page_pool *pool, struct page *page,
 	 * refcnt == 1 means page_pool owns page, and can recycle it.
 	 *
 	 * page is NOT reusable when allocated when system is under
-	 * some pressure. (page_is_pfmemalloc)
+	 * some pressure. (page_pool_page_is_pfmemalloc)
 	 */
-	if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) {
+	if (likely(page_pool_page_ref_count(page) == 1 &&
+		   !page_pool_page_is_pfmemalloc(page))) {
 		/* Read barrier done in page_ref_count / READ_ONCE */

 		if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
@@ -741,7 +766,8 @@  static struct page *page_pool_drain_frag(struct page_pool *pool,
 	if (likely(page_pool_defrag_page(page, drain_count)))
 		return NULL;

-	if (page_ref_count(page) == 1 && !page_is_pfmemalloc(page)) {
+	if (page_pool_page_ref_count(page) == 1 &&
+	    !page_pool_page_is_pfmemalloc(page)) {
 		if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
 			page_pool_dma_sync_for_device(pool, page, -1);

@@ -818,9 +844,9 @@  static void page_pool_empty_ring(struct page_pool *pool)
 	/* Empty recycle ring */
 	while ((page = ptr_ring_consume_bh(&pool->ring))) {
 		/* Verify the refcnt invariant of cached pages */
-		if (!(page_ref_count(page) == 1))
+		if (!(page_pool_page_ref_count(page) == 1))
 			pr_crit("%s() page_pool refcnt %d violation\n",
-				__func__, page_ref_count(page));
+				__func__, page_pool_page_ref_count(page));

 		page_pool_return_page(pool, page);
 	}
@@ -977,19 +1003,24 @@  bool page_pool_return_skb_page(struct page *page, bool napi_safe)
 	struct page_pool *pp;
 	bool allow_direct;

-	page = compound_head(page);
+	if (!page_is_page_pool_iov(page)) {
+		page = compound_head(page);

-	/* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
-	 * in order to preserve any existing bits, such as bit 0 for the
-	 * head page of compound page and bit 1 for pfmemalloc page, so
-	 * mask those bits for freeing side when doing below checking,
-	 * and page_is_pfmemalloc() is checked in __page_pool_put_page()
-	 * to avoid recycling the pfmemalloc page.
-	 */
-	if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
-		return false;
+		/* page->pp_magic is OR'ed with PP_SIGNATURE after the
+		 * allocation in order to preserve any existing bits, such as
+		 * bit 0 for the head page of compound page and bit 1 for
+		 * pfmemalloc page, so mask those bits for freeing side when
+		 * doing below checking, and page_pool_page_is_pfmemalloc() is
+		 * checked in __page_pool_put_page() to avoid recycling the
+		 * pfmemalloc page.
+		 */
+		if (unlikely((page->pp_magic & ~0x3UL) != PP_SIGNATURE))
+			return false;

-	pp = page->pp;
+		pp = page->pp;
+	} else {
+		pp = page_to_page_pool_iov(page)->pp;
+	}

 	/* Allow direct recycle if we have reasons to believe that we are
 	 * in the same context as the consumer would run, so there's
@@ -1273,9 +1304,9 @@  static bool mp_huge_busy(struct mp_huge *hu, unsigned int idx)

 	for (j = 0; j < (1 << MP_HUGE_ORDER); j++) {
 		page = hu->page[idx] + j;
-		if (page_ref_count(page) != 1) {
+		if (page_pool_page_ref_count(page) != 1) {
 			pr_warn("Page with ref count %d at %u, %u. Can't safely destory, leaking memory!\n",
-				page_ref_count(page), idx, j);
+				page_pool_page_ref_count(page), idx, j);
 			return true;
 		}
 	}
@@ -1330,7 +1361,7 @@  static struct page *mp_huge_alloc_pages(struct page_pool *pool, gfp_t gfp)
 			continue;

 		if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
-		    page_ref_count(page) != 1) {
+		    page_pool_page_ref_count(page) != 1) {
 			atomic_inc(&mp_huge_ins_b);
 			continue;
 		}
@@ -1458,9 +1489,9 @@  static void mp_huge_1g_destroy(struct page_pool *pool)
 	free = true;
 	for (i = 0; i < MP_HUGE_1G_CNT; i++) {
 		page = hu->page + i;
-		if (page_ref_count(page) != 1) {
+		if (page_pool_page_ref_count(page) != 1) {
 			pr_warn("Page with ref count %d at %u. Can't safely destory, leaking memory!\n",
-				page_ref_count(page), i);
+				page_pool_page_ref_count(page), i);
 			free = false;
 			break;
 		}
@@ -1489,7 +1520,7 @@  static struct page *mp_huge_1g_alloc_pages(struct page_pool *pool, gfp_t gfp)
 		page = hu->page + page_i;

 		if ((page->pp_magic & ~0x3UL) == PP_SIGNATURE ||
-		    page_ref_count(page) != 1) {
+		    page_pool_page_ref_count(page) != 1) {
 			atomic_inc(&mp_huge_ins_b);
 			continue;
 		}