diff mbox series

[RFC,v3,07/12] page-pool: device memory support

Message ID 20231106024413.2801438-8-almasrymina@google.com (mailing list archive)
State New, archived
Headers show
Series Device Memory TCP | expand

Commit Message

Mina Almasry Nov. 6, 2023, 2:44 a.m. UTC
Overload the LSB of struct page* to indicate that it's a page_pool_iov.

Refactor mm calls on struct page* into helpers, and add page_pool_iov
handling on those helpers. Modify callers of these mm APIs with calls to
these helpers instead.

In areas where struct page* is dereferenced, add a check for special
handling of page_pool_iov.

Signed-off-by: Mina Almasry <almasrymina@google.com>

---
 include/net/page_pool/helpers.h | 74 ++++++++++++++++++++++++++++++++-
 net/core/page_pool.c            | 63 ++++++++++++++++++++--------
 2 files changed, 118 insertions(+), 19 deletions(-)

Comments

Yunsheng Lin Nov. 7, 2023, 8 a.m. UTC | #1
On 2023/11/6 10:44, Mina Almasry wrote:
> Overload the LSB of struct page* to indicate that it's a page_pool_iov.
> 
> Refactor mm calls on struct page* into helpers, and add page_pool_iov
> handling on those helpers. Modify callers of these mm APIs with calls to
> these helpers instead.
> 
> In areas where struct page* is dereferenced, add a check for special
> handling of page_pool_iov.
> 
> Signed-off-by: Mina Almasry <almasrymina@google.com>
> 
> ---
>  include/net/page_pool/helpers.h | 74 ++++++++++++++++++++++++++++++++-
>  net/core/page_pool.c            | 63 ++++++++++++++++++++--------
>  2 files changed, 118 insertions(+), 19 deletions(-)
> 
> diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
> index b93243c2a640..08f1a2cc70d2 100644
> --- a/include/net/page_pool/helpers.h
> +++ b/include/net/page_pool/helpers.h
> @@ -151,6 +151,64 @@ static inline struct page_pool_iov *page_to_page_pool_iov(struct page *page)
>  	return NULL;
>  }
>  
> +static inline int page_pool_page_ref_count(struct page *page)
> +{
> +	if (page_is_page_pool_iov(page))
> +		return page_pool_iov_refcount(page_to_page_pool_iov(page));

We have added a lot of 'if' for the devmem case, it would be better to
make it more generic so that we can have more unified metadata handling
for normal page and devmem. If we add another memory type here, do we
need another 'if' here?
That is part of the reason I suggested using a more unified metadata for
all the types of memory chunks used by page_pool.
Mina Almasry Nov. 7, 2023, 9:56 p.m. UTC | #2
On Tue, Nov 7, 2023 at 12:00 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2023/11/6 10:44, Mina Almasry wrote:
> > Overload the LSB of struct page* to indicate that it's a page_pool_iov.
> >
> > Refactor mm calls on struct page* into helpers, and add page_pool_iov
> > handling on those helpers. Modify callers of these mm APIs with calls to
> > these helpers instead.
> >
> > In areas where struct page* is dereferenced, add a check for special
> > handling of page_pool_iov.
> >
> > Signed-off-by: Mina Almasry <almasrymina@google.com>
> >
> > ---
> >  include/net/page_pool/helpers.h | 74 ++++++++++++++++++++++++++++++++-
> >  net/core/page_pool.c            | 63 ++++++++++++++++++++--------
> >  2 files changed, 118 insertions(+), 19 deletions(-)
> >
> > diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
> > index b93243c2a640..08f1a2cc70d2 100644
> > --- a/include/net/page_pool/helpers.h
> > +++ b/include/net/page_pool/helpers.h
> > @@ -151,6 +151,64 @@ static inline struct page_pool_iov *page_to_page_pool_iov(struct page *page)
> >       return NULL;
> >  }
> >
> > +static inline int page_pool_page_ref_count(struct page *page)
> > +{
> > +     if (page_is_page_pool_iov(page))
> > +             return page_pool_iov_refcount(page_to_page_pool_iov(page));
>
> We have added a lot of 'if' for the devmem case, it would be better to
> make it more generic so that we can have more unified metadata handling
> for normal page and devmem. If we add another memory type here, do we
> need another 'if' here?

Maybe, not sure. I'm guessing new memory types will either be pages or
iovs, so maybe no new if statements needed.

> That is part of the reason I suggested using a more unified metadata for
> all the types of memory chunks used by page_pool.

I think your suggestion was to use struct pages for devmem. That was
thoroughly considered and intensely argued about in the initial
conversations regarding devmem and the initial RFC, and from the
conclusions there it's extremely clear to me that devmem struct pages
are categorically a no-go.

--
Thanks,
Mina
Yunsheng Lin Nov. 8, 2023, 10:56 a.m. UTC | #3
On 2023/11/8 5:56, Mina Almasry wrote:
> On Tue, Nov 7, 2023 at 12:00 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>
>> On 2023/11/6 10:44, Mina Almasry wrote:
>>> Overload the LSB of struct page* to indicate that it's a page_pool_iov.
>>>
>>> Refactor mm calls on struct page* into helpers, and add page_pool_iov
>>> handling on those helpers. Modify callers of these mm APIs with calls to
>>> these helpers instead.
>>>
>>> In areas where struct page* is dereferenced, add a check for special
>>> handling of page_pool_iov.
>>>
>>> Signed-off-by: Mina Almasry <almasrymina@google.com>
>>>
>>> ---
>>>  include/net/page_pool/helpers.h | 74 ++++++++++++++++++++++++++++++++-
>>>  net/core/page_pool.c            | 63 ++++++++++++++++++++--------
>>>  2 files changed, 118 insertions(+), 19 deletions(-)
>>>
>>> diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
>>> index b93243c2a640..08f1a2cc70d2 100644
>>> --- a/include/net/page_pool/helpers.h
>>> +++ b/include/net/page_pool/helpers.h
>>> @@ -151,6 +151,64 @@ static inline struct page_pool_iov *page_to_page_pool_iov(struct page *page)
>>>       return NULL;
>>>  }
>>>
>>> +static inline int page_pool_page_ref_count(struct page *page)
>>> +{
>>> +     if (page_is_page_pool_iov(page))
>>> +             return page_pool_iov_refcount(page_to_page_pool_iov(page));
>>
>> We have added a lot of 'if' for the devmem case, it would be better to
>> make it more generic so that we can have more unified metadata handling
>> for normal page and devmem. If we add another memory type here, do we
>> need another 'if' here?
> 
> Maybe, not sure. I'm guessing new memory types will either be pages or
> iovs, so maybe no new if statements needed.
> 
>> That is part of the reason I suggested using a more unified metadata for
>> all the types of memory chunks used by page_pool.
> 
> I think your suggestion was to use struct pages for devmem. That was
> thoroughly considered and intensely argued about in the initial
> conversations regarding devmem and the initial RFC, and from the
> conclusions there it's extremely clear to me that devmem struct pages
> are categorically a no-go.

Not exactly, I was wondering if adding a more abstract structure specificly
for page pool makes any sense, and each mem type can add its own specific
fields, net stack only see and handle the common fields so that it does not
care about specific mem type, and each provider only see the and handle the
specific fields belonging to it most of the time.

Ideally something like beleow:

struct netmem {
	/* common fields */
	refcount_t refcount;
	struct page_pool *pp;
	......

	union {
		struct devmem{
			struct dmabuf_genpool_chunk_owner *owner;
		};

		struct other_mem{
			...
			...
		};
	};
};

But untill we completely decouple the 'struct page' from the net stack,
the above seems undoable in the near term.
But we might be able to do something as folio is doing now, mm subsystem
is still seeing 'struct folio/page', but other subsystem like slab is using
'struct slab', and there is still some common fields shared between
'struct folio' and 'struct slab'.

As the netmem patchset, is devmem able to reuse the below 'struct netmem'
and rename it to 'struct page_pool_iov'? So that 'struct page' for normal
memory and 'struct page_pool_iov' for devmem share the common fields used
by page pool and net stack? And we might be able to reuse the 'flags',
'_pp_mapping_pad' and '_mapcount' for specific mem provider, which is enough
for the devmem only requiring a single pointer to point to it's
owner?

https://lkml.kernel.org/netdev/20230105214631.3939268-2-willy@infradead.org/

+/**
+ * struct netmem - A memory allocation from a &struct page_pool.
+ * @flags: The same as the page flags.  Do not use directly.
+ * @pp_magic: Magic value to avoid recycling non page_pool allocated pages.
+ * @pp: The page pool this netmem was allocated from.
+ * @dma_addr: Call netmem_get_dma_addr() to read this value.
+ * @dma_addr_upper: Might need to be 64-bit on 32-bit architectures.
+ * @pp_frag_count: For frag page support, not supported in 32-bit
+ *   architectures with 64-bit DMA.
+ * @_mapcount: Do not access this member directly.
+ * @_refcount: Do not access this member directly.  Read it using
+ *   netmem_ref_count() and manipulate it with netmem_get() and netmem_put().
+ *
+ * This struct overlays struct page for now.  Do not modify without a
+ * good understanding of the issues.
+ */
+struct netmem {
+	unsigned long flags;
+	unsigned long pp_magic;
+	struct page_pool *pp;
+	/* private: no need to document this padding */
+	unsigned long _pp_mapping_pad;	/* aliases with folio->mapping */
+	/* public: */
+	unsigned long dma_addr;
+	union {
+		unsigned long dma_addr_upper;
+		atomic_long_t pp_frag_count;
+	};
+	atomic_t _mapcount;
+	atomic_t _refcount;
+};

If we do that, it seems we might be able to allow net stack and page pool to see
the metadata for devmem chunk as 'struct page', and may be able to aovid most of
the 'if' checking in net stack and page pool?

> 
> --
> Thanks,
> Mina
> 
> .
>
Mina Almasry Nov. 9, 2023, 3:20 a.m. UTC | #4
On Wed, Nov 8, 2023 at 2:56 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2023/11/8 5:56, Mina Almasry wrote:
> > On Tue, Nov 7, 2023 at 12:00 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >>
> >> On 2023/11/6 10:44, Mina Almasry wrote:
> >>> Overload the LSB of struct page* to indicate that it's a page_pool_iov.
> >>>
> >>> Refactor mm calls on struct page* into helpers, and add page_pool_iov
> >>> handling on those helpers. Modify callers of these mm APIs with calls to
> >>> these helpers instead.
> >>>
> >>> In areas where struct page* is dereferenced, add a check for special
> >>> handling of page_pool_iov.
> >>>
> >>> Signed-off-by: Mina Almasry <almasrymina@google.com>
> >>>
> >>> ---
> >>>  include/net/page_pool/helpers.h | 74 ++++++++++++++++++++++++++++++++-
> >>>  net/core/page_pool.c            | 63 ++++++++++++++++++++--------
> >>>  2 files changed, 118 insertions(+), 19 deletions(-)
> >>>
> >>> diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
> >>> index b93243c2a640..08f1a2cc70d2 100644
> >>> --- a/include/net/page_pool/helpers.h
> >>> +++ b/include/net/page_pool/helpers.h
> >>> @@ -151,6 +151,64 @@ static inline struct page_pool_iov *page_to_page_pool_iov(struct page *page)
> >>>       return NULL;
> >>>  }
> >>>
> >>> +static inline int page_pool_page_ref_count(struct page *page)
> >>> +{
> >>> +     if (page_is_page_pool_iov(page))
> >>> +             return page_pool_iov_refcount(page_to_page_pool_iov(page));
> >>
> >> We have added a lot of 'if' for the devmem case, it would be better to
> >> make it more generic so that we can have more unified metadata handling
> >> for normal page and devmem. If we add another memory type here, do we
> >> need another 'if' here?
> >
> > Maybe, not sure. I'm guessing new memory types will either be pages or
> > iovs, so maybe no new if statements needed.
> >
> >> That is part of the reason I suggested using a more unified metadata for
> >> all the types of memory chunks used by page_pool.
> >
> > I think your suggestion was to use struct pages for devmem. That was
> > thoroughly considered and intensely argued about in the initial
> > conversations regarding devmem and the initial RFC, and from the
> > conclusions there it's extremely clear to me that devmem struct pages
> > are categorically a no-go.
>
> Not exactly, I was wondering if adding a more abstract structure specificly
> for page pool makes any sense, and each mem type can add its own specific
> fields, net stack only see and handle the common fields so that it does not
> care about specific mem type, and each provider only see the and handle the
> specific fields belonging to it most of the time.
>
> Ideally something like beleow:
>
> struct netmem {
>         /* common fields */
>         refcount_t refcount;
>         struct page_pool *pp;
>         ......
>
>         union {
>                 struct devmem{
>                         struct dmabuf_genpool_chunk_owner *owner;
>                 };
>
>                 struct other_mem{
>                         ...
>                         ...
>                 };
>         };
> };
>
> But untill we completely decouple the 'struct page' from the net stack,
> the above seems undoable in the near term.

Agreed everything above is undoable.

> But we might be able to do something as folio is doing now, mm subsystem
> is still seeing 'struct folio/page', but other subsystem like slab is using
> 'struct slab', and there is still some common fields shared between
> 'struct folio' and 'struct slab'.
>

In my eyes this is almost exactly what I suggested in RFC v1 and got
immediately nacked with no room to negotiate. What we did for v1 is to
allocate struct pages for dma-buf to make dma-bufs look like struct
page to mm subsystem. Almost exactly what you're describing above.
It's a no-go. I don't think renaming struct page to netmem is going to
move the needle (it also re-introduces code-churn). What I feel like I
learnt is that dma-bufs are not struct pages and can't be made to look
like one, I think.

> As the netmem patchset, is devmem able to reuse the below 'struct netmem'
> and rename it to 'struct page_pool_iov'?

I don't think so. For the reasons above, but also practically it
immediately falls apart. Consider this field in netmem:

+ * @flags: The same as the page flags.  Do not use directly.

dma-buf don't have or support page-flags, and making dma-buf looks
like they support page flags or any page-like features (other than
dma_addr) seems extremely unacceptable to mm folks.

> So that 'struct page' for normal
> memory and 'struct page_pool_iov' for devmem share the common fields used
> by page pool and net stack?

Are you suggesting that we'd cast a netmem* to a page* and call core
mm APIs on it? It's basically what was happening with RFC v1, where
things that are not struct pages were made to look like struct pages.

Also, there isn't much upside for what you're suggesting, I think. For
example I can align the refcount variable in struct page_pool_iov with
the refcount in struct page so that this works:

put_page((struct page*)ppiov);

but it's a disaster. Because put_page() will call __put_page() if the
page is freed, and __put_page() will try to return the page to the
buddy allocator!

>  And we might be able to reuse the 'flags',
> '_pp_mapping_pad' and '_mapcount' for specific mem provider, which is enough
> for the devmem only requiring a single pointer to point to it's
> owner?
>

All the above seems quite similar to RFC v1 again, using netmem
instead of struct page. In RFC v1 we re-used zone_device_data() for
the dma-buf owner equivalent.

> https://lkml.kernel.org/netdev/20230105214631.3939268-2-willy@infradead.org/
>
> +/**
> + * struct netmem - A memory allocation from a &struct page_pool.
> + * @flags: The same as the page flags.  Do not use directly.
> + * @pp_magic: Magic value to avoid recycling non page_pool allocated pages.
> + * @pp: The page pool this netmem was allocated from.
> + * @dma_addr: Call netmem_get_dma_addr() to read this value.
> + * @dma_addr_upper: Might need to be 64-bit on 32-bit architectures.
> + * @pp_frag_count: For frag page support, not supported in 32-bit
> + *   architectures with 64-bit DMA.
> + * @_mapcount: Do not access this member directly.
> + * @_refcount: Do not access this member directly.  Read it using
> + *   netmem_ref_count() and manipulate it with netmem_get() and netmem_put().
> + *
> + * This struct overlays struct page for now.  Do not modify without a
> + * good understanding of the issues.
> + */
> +struct netmem {
> +       unsigned long flags;
> +       unsigned long pp_magic;
> +       struct page_pool *pp;
> +       /* private: no need to document this padding */
> +       unsigned long _pp_mapping_pad;  /* aliases with folio->mapping */
> +       /* public: */
> +       unsigned long dma_addr;
> +       union {
> +               unsigned long dma_addr_upper;
> +               atomic_long_t pp_frag_count;
> +       };
> +       atomic_t _mapcount;
> +       atomic_t _refcount;
> +};
>
> If we do that, it seems we might be able to allow net stack and page pool to see
> the metadata for devmem chunk as 'struct page', and may be able to aovid most of
> the 'if' checking in net stack and page pool?
>
> >
> > --
> > Thanks,
> > Mina
> >
> > .
> >
Paolo Abeni Nov. 9, 2023, 9:01 a.m. UTC | #5
On Sun, 2023-11-05 at 18:44 -0800, Mina Almasry wrote:
> Overload the LSB of struct page* to indicate that it's a page_pool_iov.
> 
> Refactor mm calls on struct page* into helpers, and add page_pool_iov
> handling on those helpers. Modify callers of these mm APIs with calls to
> these helpers instead.
> 
> In areas where struct page* is dereferenced, add a check for special
> handling of page_pool_iov.
> 
> Signed-off-by: Mina Almasry <almasrymina@google.com>
> 
> ---
>  include/net/page_pool/helpers.h | 74 ++++++++++++++++++++++++++++++++-
>  net/core/page_pool.c            | 63 ++++++++++++++++++++--------
>  2 files changed, 118 insertions(+), 19 deletions(-)
> 
> diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
> index b93243c2a640..08f1a2cc70d2 100644
> --- a/include/net/page_pool/helpers.h
> +++ b/include/net/page_pool/helpers.h
> @@ -151,6 +151,64 @@ static inline struct page_pool_iov *page_to_page_pool_iov(struct page *page)
>  	return NULL;
>  }
>  
> +static inline int page_pool_page_ref_count(struct page *page)
> +{
> +	if (page_is_page_pool_iov(page))
> +		return page_pool_iov_refcount(page_to_page_pool_iov(page));
> +
> +	return page_ref_count(page);
> +}
> +
> +static inline void page_pool_page_get_many(struct page *page,
> +					   unsigned int count)
> +{
> +	if (page_is_page_pool_iov(page))
> +		return page_pool_iov_get_many(page_to_page_pool_iov(page),
> +					      count);
> +
> +	return page_ref_add(page, count);
> +}
> +
> +static inline void page_pool_page_put_many(struct page *page,
> +					   unsigned int count)
> +{
> +	if (page_is_page_pool_iov(page))
> +		return page_pool_iov_put_many(page_to_page_pool_iov(page),
> +					      count);
> +
> +	if (count > 1)
> +		page_ref_sub(page, count - 1);
> +
> +	put_page(page);
> +}
> +
> +static inline bool page_pool_page_is_pfmemalloc(struct page *page)
> +{
> +	if (page_is_page_pool_iov(page))
> +		return false;
> +
> +	return page_is_pfmemalloc(page);
> +}
> +
> +static inline bool page_pool_page_is_pref_nid(struct page *page, int pref_nid)
> +{
> +	/* Assume page_pool_iov are on the preferred node without actually
> +	 * checking...
> +	 *
> +	 * This check is only used to check for recycling memory in the page
> +	 * pool's fast paths. Currently the only implementation of page_pool_iov
> +	 * is dmabuf device memory. It's a deliberate decision by the user to
> +	 * bind a certain dmabuf to a certain netdev, and the netdev rx queue
> +	 * would not be able to reallocate memory from another dmabuf that
> +	 * exists on the preferred node, so, this check doesn't make much sense
> +	 * in this case. Assume all page_pool_iovs can be recycled for now.
> +	 */
> +	if (page_is_page_pool_iov(page))
> +		return true;
> +
> +	return page_to_nid(page) == pref_nid;
> +}
> +
>  /**
>   * page_pool_dev_alloc_pages() - allocate a page.
>   * @pool:	pool from which to allocate
> @@ -301,6 +359,9 @@ static inline long page_pool_defrag_page(struct page *page, long nr)
>  {
>  	long ret;
>  
> +	if (page_is_page_pool_iov(page))
> +		return -EINVAL;
> +
>  	/* If nr == pp_frag_count then we have cleared all remaining
>  	 * references to the page:
>  	 * 1. 'n == 1': no need to actually overwrite it.
> @@ -431,7 +492,12 @@ static inline void page_pool_free_va(struct page_pool *pool, void *va,
>   */
>  static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
>  {
> -	dma_addr_t ret = page->dma_addr;
> +	dma_addr_t ret;
> +
> +	if (page_is_page_pool_iov(page))
> +		return page_pool_iov_dma_addr(page_to_page_pool_iov(page));

Should the above conditional be guarded by the page_pool_mem_providers
static key? this looks like fast-path. Same question for the refcount
helper above.

Minor nit: possibly cache 'page_is_page_pool_iov(page)' to make the
code more readable.

> +
> +	ret = page->dma_addr;
>  
>  	if (PAGE_POOL_32BIT_ARCH_WITH_64BIT_DMA)
>  		ret <<= PAGE_SHIFT;
> @@ -441,6 +507,12 @@ static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
>  
>  static inline bool page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
>  {
> +	/* page_pool_iovs are mapped and their dma-addr can't be modified. */
> +	if (page_is_page_pool_iov(page)) {
> +		DEBUG_NET_WARN_ON_ONCE(true);
> +		return false;
> +	}

Quickly skimming over the page_pool_code it looks like
page_pool_set_dma_addr() usage is guarded by the PP_FLAG_DMA_MAP page
pool flag. Could the device mem provider enforce such flag being
cleared on the page pool?

> +
>  	if (PAGE_POOL_32BIT_ARCH_WITH_64BIT_DMA) {
>  		page->dma_addr = addr >> PAGE_SHIFT;
>  
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 138ddea0b28f..d211996d423b 100644
> --- a/net/core/page_pool.cnn
> +++ b/net/core/page_pool.c
> @@ -317,7 +317,7 @@ static struct page *page_pool_refill_alloc_cache(struct page_pool *pool)
>  		if (unlikely(!page))
>  			break;
>  
> -		if (likely(page_to_nid(page) == pref_nid)) {
> +		if (likely(page_pool_page_is_pref_nid(page, pref_nid))) {
>  			pool->alloc.cache[pool->alloc.count++] = page;
>  		} else {
>  			/* NUMA mismatch;
> @@ -362,7 +362,15 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool,
>  					  struct page *page,
>  					  unsigned int dma_sync_size)
>  {
> -	dma_addr_t dma_addr = page_pool_get_dma_addr(page);
> +	dma_addr_t dma_addr;
> +
> +	/* page_pool_iov memory provider do not support PP_FLAG_DMA_SYNC_DEV */
> +	if (page_is_page_pool_iov(page)) {
> +		DEBUG_NET_WARN_ON_ONCE(true);
> +		return;
> +	}

Similar to the above point, mutatis mutandis.

> +
> +	dma_addr = page_pool_get_dma_addr(page);
>  
>  	dma_sync_size = min(dma_sync_size, pool->p.max_len);
>  	dma_sync_single_range_for_device(pool->p.dev, dma_addr,
> @@ -374,6 +382,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
>  {
>  	dma_addr_t dma;
>  
> +	if (page_is_page_pool_iov(page)) {
> +		/* page_pool_iovs are already mapped */
> +		DEBUG_NET_WARN_ON_ONCE(true);
> +		return true;
> +	}

Ditto.

Cheers,

Paolo
Yunsheng Lin Nov. 9, 2023, 9:30 a.m. UTC | #6
On 2023/11/9 11:20, Mina Almasry wrote:
> On Wed, Nov 8, 2023 at 2:56 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:

> 
> Agreed everything above is undoable.
> 
>> But we might be able to do something as folio is doing now, mm subsystem
>> is still seeing 'struct folio/page', but other subsystem like slab is using
>> 'struct slab', and there is still some common fields shared between
>> 'struct folio' and 'struct slab'.
>>
> 
> In my eyes this is almost exactly what I suggested in RFC v1 and got
> immediately nacked with no room to negotiate. What we did for v1 is to
> allocate struct pages for dma-buf to make dma-bufs look like struct
> page to mm subsystem. Almost exactly what you're describing above.

Maybe the above is where we have disagreement:
Do we still need make dma-bufs look like struct page to mm subsystem?
IMHO, the answer is no. We might only need to make dma-bufs look like
struct page to net stack and page pool subsystem. I think that is already
what this pacthset is trying to do, what I am suggesting is just make
it more like 'struct page' to net stack and page pool subsystem, in order
to try to avoid most of the 'if' checking in net stack and page pool
subsystem.

> It's a no-go. I don't think renaming struct page to netmem is going to
> move the needle (it also re-introduces code-churn). What I feel like I
> learnt is that dma-bufs are not struct pages and can't be made to look
> like one, I think.
> 
>> As the netmem patchset, is devmem able to reuse the below 'struct netmem'
>> and rename it to 'struct page_pool_iov'?
> 
> I don't think so. For the reasons above, but also practically it
> immediately falls apart. Consider this field in netmem:
> 
> + * @flags: The same as the page flags.  Do not use directly.
> 
> dma-buf don't have or support page-flags, and making dma-buf looks
> like they support page flags or any page-like features (other than
> dma_addr) seems extremely unacceptable to mm folks.

As far as I tell, as we limit the devmem usage in netstack, the below
is the related mm function call for 'struct page' for devmem:
page_ref_*(): page->_refcount does not need changing
page_is_pfmemalloc(): which is corresponding to page->pp_magic, and
                      devmem provider can set/unset it in it's 'alloc_pages'
                      ops.
page_to_nid(): we may need to handle it differently somewhat like this
               patch does as page_to_nid() may has different implementation
               based on different configuration.
page_pool_iov_put_many(): as mentioned in other thread, if net stack is not
                          calling page_pool_page_put_many() directly, we
                          can reuse napi_pp_put_page() for devmem too, and
                          handle the special case for devmem in 'release_page'
                          ops.

> 
>> So that 'struct page' for normal
>> memory and 'struct page_pool_iov' for devmem share the common fields used
>> by page pool and net stack?
> 
> Are you suggesting that we'd cast a netmem* to a page* and call core
> mm APIs on it? It's basically what was happening with RFC v1, where
> things that are not struct pages were made to look like struct pages.
> 
> Also, there isn't much upside for what you're suggesting, I think. For
> example I can align the refcount variable in struct page_pool_iov with
> the refcount in struct page so that this works:
> 
> put_page((struct page*)ppiov);
> 
> but it's a disaster. Because put_page() will call __put_page() if the
> page is freed, and __put_page() will try to return the page to the
> buddy allocator!

As what I suggested above, Can we handle this in devmem provider's
'release_page' ops instead of calling put_page() directly as for devmem.

> 
>>  And we might be able to reuse the 'flags',
>> '_pp_mapping_pad' and '_mapcount' for specific mem provider, which is enough
>> for the devmem only requiring a single pointer to point to it's
>> owner?
>>
> 
> All the above seems quite similar to RFC v1 again, using netmem
> instead of struct page. In RFC v1 we re-used zone_device_data() for
> the dma-buf owner equivalent.

As we have added a few checkings to limit 'struct page' for devmem to
be only used in net stack, we can decouple 'struct page' for devmem
from mm subsystem, zone_device_data() is not really needed, right?

If we can decouple 'struct page' for normal memory from mm subsystem
through the folio work in the future, then we may define a more abstract
structure for page pool and net stack instead of reusing 'struct page'
from mm.

>
Mina Almasry Nov. 9, 2023, 12:20 p.m. UTC | #7
On Thu, Nov 9, 2023 at 1:30 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2023/11/9 11:20, Mina Almasry wrote:
> > On Wed, Nov 8, 2023 at 2:56 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> >
> > Agreed everything above is undoable.
> >
> >> But we might be able to do something as folio is doing now, mm subsystem
> >> is still seeing 'struct folio/page', but other subsystem like slab is using
> >> 'struct slab', and there is still some common fields shared between
> >> 'struct folio' and 'struct slab'.
> >>
> >
> > In my eyes this is almost exactly what I suggested in RFC v1 and got
> > immediately nacked with no room to negotiate. What we did for v1 is to
> > allocate struct pages for dma-buf to make dma-bufs look like struct
> > page to mm subsystem. Almost exactly what you're describing above.
>
> Maybe the above is where we have disagreement:
> Do we still need make dma-bufs look like struct page to mm subsystem?
> IMHO, the answer is no. We might only need to make dma-bufs look like
> struct page to net stack and page pool subsystem. I think that is already
> what this pacthset is trying to do, what I am suggesting is just make
> it more like 'struct page' to net stack and page pool subsystem, in order
> to try to avoid most of the 'if' checking in net stack and page pool
> subsystem.
>

First, most of the checking in the net stack is
skb_frag_not_readable(). dma-buf are fundamentally not kmap()able and
not readable. So we can't remove those, no matter what we do I think.
Can we agree on that? If so, lets discuss removing most of the ifs in
the page pool, only.

> > It's a no-go. I don't think renaming struct page to netmem is going to
> > move the needle (it also re-introduces code-churn). What I feel like I
> > learnt is that dma-bufs are not struct pages and can't be made to look
> > like one, I think.
> >
> >> As the netmem patchset, is devmem able to reuse the below 'struct netmem'
> >> and rename it to 'struct page_pool_iov'?
> >
> > I don't think so. For the reasons above, but also practically it
> > immediately falls apart. Consider this field in netmem:
> >
> > + * @flags: The same as the page flags.  Do not use directly.
> >
> > dma-buf don't have or support page-flags, and making dma-buf looks
> > like they support page flags or any page-like features (other than
> > dma_addr) seems extremely unacceptable to mm folks.
>
> As far as I tell, as we limit the devmem usage in netstack, the below
> is the related mm function call for 'struct page' for devmem:
> page_ref_*(): page->_refcount does not need changing

Sorry, I don't understand. Are you suggesting we call page_ref_add() &
page_ref_sub() on page_pool_iov? That is basically making
page_pool_iov look like struct page to the mm stack, since page_ref_*
are mm calls, which you say above we don't need to do. We will still
need to special case this, no?

> page_is_pfmemalloc(): which is corresponding to page->pp_magic, and
>                       devmem provider can set/unset it in it's 'alloc_pages'
>                       ops.

page_is_pfmemalloc() has nothing to do with page->pp_magic. It checks
page->lru.next to figure out if this is a pfmemalloc. page_pool_iov
has no page->lru.next. Still need to special case this?

> page_to_nid(): we may need to handle it differently somewhat like this
>                patch does as page_to_nid() may has different implementation
>                based on different configuration.

So you're saying we need to handle page_to_nid() differently for
devmem? So we're not going to be able to avoid the if statement.

> page_pool_iov_put_many(): as mentioned in other thread, if net stack is not
>                           calling page_pool_page_put_many() directly, we
>                           can reuse napi_pp_put_page() for devmem too, and
>                           handle the special case for devmem in 'release_page'
>                           ops.
>

page_pool_iov_put_many()/page_pool_iov_get_many() are called to do
refcounting before the page is released back to the provider. I'm not
seeing how we can handle the special case inside of 'release_page' -
that's too late, as far as I can tell.

The only way to remove the if statements in the page pool is to
implement what you said was not feasible in an earlier email. We would
define this struct:

struct netmem {
        /* common fields */
        refcount_t refcount;
        bool is_pfmemalloc;
        int nid;
        ......
        union {
                struct devmem{
                        struct dmabuf_genpool_chunk_owner *owner;
                };

                struct page * page;
        };
};

Then, we would require all memory providers to allocate struct netmem
for the memory and set the common fields, including ones that have
struct pages. For devmem, netmem->page will be NULL, because netmem
has no page.

If we do that, the page pool can ignore whether the underlying memory
is page or devmem, because it can use the common fields, example:

/* page_ref_count replacement */
netmem_ref_count(struct netmem* netmem) {
    return netmem->refcount;
}

/* page_ref_add replacement */
netmem_ref_add(struct netmem* netmem) {
   atomic_inc(netmem->refcount);
}

/* page_to_nid replacement */
netmem_nid(struct netmem* netmem) {
    return netmem->nid;
}

/* page_is_pfmemalloc() replacement */
netmem_is_pfmemalloc(struct netmem* netmem) {
    return netmem->is_pfmemalloc;
}

/* page_ref_sub replacement */
netmem_ref_sub(struct netmem* netmem) {
    atomic_sub(netmet->refcount);
    if (netmem->refcount == 0) {
                  /* release page to the memory provider.
                   * struct page memory provider will do put_page(),
                   * devmem will do something else */
           }
     }
}


I think this MAY BE technically feasible, but I'm not sure it's better:

1. It is a huge refactor to the page pool, lots of code churn. While
the page pool currently uses page*, it needs to be completely
refactored to use netmem*.
2. It causes extra memory usage. struct netmem needs to be allocated
for every struct page.
3. It has minimal perf upside. The page_is_page_pool_iov() checks
currently have minimal perf impact, and I demonstrated that to Jesper
in RFC v2.
4. It also may not be technically feasible. I'm not sure how netmem
interacts with skb_frag_t. I guess we replace struct page* bv_page
with struct netmem* bv_page, and add changes there.
5. Drivers need to be refactored to use netmem* instead of page*,
unless we cast netmem* to page* before returning to the driver.

Possibly other downsides, these are what I could immediately think of.

If I'm still misunderstanding your suggestion, it may be time to send
me a concrete code snippet of what you have in mind. I'm a bit
confused at the moment because the only avenue I see to remove the if
statements in the page pool is to define the struct that we agreed is
not feasible in earlier emails.

> >
> >> So that 'struct page' for normal
> >> memory and 'struct page_pool_iov' for devmem share the common fields used
> >> by page pool and net stack?
> >
> > Are you suggesting that we'd cast a netmem* to a page* and call core
> > mm APIs on it? It's basically what was happening with RFC v1, where
> > things that are not struct pages were made to look like struct pages.
> >
> > Also, there isn't much upside for what you're suggesting, I think. For
> > example I can align the refcount variable in struct page_pool_iov with
> > the refcount in struct page so that this works:
> >
> > put_page((struct page*)ppiov);
> >
> > but it's a disaster. Because put_page() will call __put_page() if the
> > page is freed, and __put_page() will try to return the page to the
> > buddy allocator!
>
> As what I suggested above, Can we handle this in devmem provider's
> 'release_page' ops instead of calling put_page() directly as for devmem.
>
> >
> >>  And we might be able to reuse the 'flags',
> >> '_pp_mapping_pad' and '_mapcount' for specific mem provider, which is enough
> >> for the devmem only requiring a single pointer to point to it's
> >> owner?
> >>
> >
> > All the above seems quite similar to RFC v1 again, using netmem
> > instead of struct page. In RFC v1 we re-used zone_device_data() for
> > the dma-buf owner equivalent.
>
> As we have added a few checkings to limit 'struct page' for devmem to
> be only used in net stack, we can decouple 'struct page' for devmem
> from mm subsystem, zone_device_data() is not really needed, right?
>
> If we can decouple 'struct page' for normal memory from mm subsystem
> through the folio work in the future, then we may define a more abstract
> structure for page pool and net stack instead of reusing 'struct page'
> from mm.
>
> >



--
Thanks,
Mina
Yunsheng Lin Nov. 9, 2023, 1:23 p.m. UTC | #8
On 2023/11/9 20:20, Mina Almasry wrote:
> On Thu, Nov 9, 2023 at 1:30 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>
>> On 2023/11/9 11:20, Mina Almasry wrote:
>>> On Wed, Nov 8, 2023 at 2:56 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>
>>>
>>> Agreed everything above is undoable.
>>>
>>>> But we might be able to do something as folio is doing now, mm subsystem
>>>> is still seeing 'struct folio/page', but other subsystem like slab is using
>>>> 'struct slab', and there is still some common fields shared between
>>>> 'struct folio' and 'struct slab'.
>>>>
>>>
>>> In my eyes this is almost exactly what I suggested in RFC v1 and got
>>> immediately nacked with no room to negotiate. What we did for v1 is to
>>> allocate struct pages for dma-buf to make dma-bufs look like struct
>>> page to mm subsystem. Almost exactly what you're describing above.
>>
>> Maybe the above is where we have disagreement:
>> Do we still need make dma-bufs look like struct page to mm subsystem?
>> IMHO, the answer is no. We might only need to make dma-bufs look like
>> struct page to net stack and page pool subsystem. I think that is already
>> what this pacthset is trying to do, what I am suggesting is just make
>> it more like 'struct page' to net stack and page pool subsystem, in order
>> to try to avoid most of the 'if' checking in net stack and page pool
>> subsystem.
>>
> 
> First, most of the checking in the net stack is
> skb_frag_not_readable(). dma-buf are fundamentally not kmap()able and
> not readable. So we can't remove those, no matter what we do I think.
> Can we agree on that? If so, lets discuss removing most of the ifs in
> the page pool, only.

Agreed on the 'not kmap()able and not readable' checking part.

> 
>>> It's a no-go. I don't think renaming struct page to netmem is going to
>>> move the needle (it also re-introduces code-churn). What I feel like I
>>> learnt is that dma-bufs are not struct pages and can't be made to look
>>> like one, I think.
>>>
>>>> As the netmem patchset, is devmem able to reuse the below 'struct netmem'
>>>> and rename it to 'struct page_pool_iov'?
>>>
>>> I don't think so. For the reasons above, but also practically it
>>> immediately falls apart. Consider this field in netmem:
>>>
>>> + * @flags: The same as the page flags.  Do not use directly.
>>>
>>> dma-buf don't have or support page-flags, and making dma-buf looks
>>> like they support page flags or any page-like features (other than
>>> dma_addr) seems extremely unacceptable to mm folks.
>>
>> As far as I tell, as we limit the devmem usage in netstack, the below
>> is the related mm function call for 'struct page' for devmem:
>> page_ref_*(): page->_refcount does not need changing
> 
> Sorry, I don't understand. Are you suggesting we call page_ref_add() &
> page_ref_sub() on page_pool_iov? That is basically making
> page_pool_iov look like struct page to the mm stack, since page_ref_*
> are mm calls, which you say above we don't need to do. We will still
> need to special case this, no?

As we are reusing 'struct page' for devmem, page->_refcount for
devmem and page->_refcount for normal memory should be the same, right?
We may need to ensure 'struct page' for devmem to always look like a head
page for compound page or base page for net stack, as we use get_page()
in __skb_frag_ref().

We can choose to not call page_ref_sub() for page from devmem, we can
call napi_pp_put_page(), and we may be able to special handle the page
from devmem in devmem provider's 'release_page' ops in napi_pp_put_page().

> 
>> page_is_pfmemalloc(): which is corresponding to page->pp_magic, and
>>                       devmem provider can set/unset it in it's 'alloc_pages'
>>                       ops.
> 
> page_is_pfmemalloc() has nothing to do with page->pp_magic. It checks
> page->lru.next to figure out if this is a pfmemalloc. page_pool_iov
> has no page->lru.next. Still need to special case this?

See the comment in napi_pp_put_page():

	/* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
	 * in order to preserve any existing bits, such as bit 0 for the
	 * head page of compound page and bit 1 for pfmemalloc page, so
	 * mask those bits for freeing side when doing below checking,
	 * and page_is_pfmemalloc() is checked in __page_pool_put_page()
	 * to avoid recycling the pfmemalloc page.
	 */

There is some union in struct page, page->lru.next and page->pp_magic is
actually pointing to the same thing as my understanding.


> 
>> page_to_nid(): we may need to handle it differently somewhat like this
>>                patch does as page_to_nid() may has different implementation
>>                based on different configuration.
> 
> So you're saying we need to handle page_to_nid() differently for
> devmem? So we're not going to be able to avoid the if statement.

Yes, it seems to be the only place that might need special handling I
see so far.

> 
>> page_pool_iov_put_many(): as mentioned in other thread, if net stack is not
>>                           calling page_pool_page_put_many() directly, we
>>                           can reuse napi_pp_put_page() for devmem too, and
>>                           handle the special case for devmem in 'release_page'
>>                           ops.
>>
> 
> page_pool_iov_put_many()/page_pool_iov_get_many() are called to do

Can we remove the page_pool_iov_put_many()/page_pool_iov_get_many()
calling?

> refcounting before the page is released back to the provider. I'm not
> seeing how we can handle the special case inside of 'release_page' -
> that's too late, as far as I can tell.

And handle the special case in page_pool_return_page() to mainly
replace put_page() with 'release_page' for devmem page?
https://elixir.free-electrons.com/linux/v6.6-rc1/source/net/core/page_pool.c#L537

> 
> The only way to remove the if statements in the page pool is to
> implement what you said was not feasible in an earlier email. We would
> define this struct:
> 
> struct netmem {
>         /* common fields */
>         refcount_t refcount;
>         bool is_pfmemalloc;
>         int nid;
>         ......
>         union {
>                 struct devmem{
>                         struct dmabuf_genpool_chunk_owner *owner;
>                 };
> 
>                 struct page * page;
>         };
> };
> 
> Then, we would require all memory providers to allocate struct netmem
> for the memory and set the common fields, including ones that have
> struct pages. For devmem, netmem->page will be NULL, because netmem
> has no page.

That is not what I have in mind.

> 
> If we do that, the page pool can ignore whether the underlying memory
> is page or devmem, because it can use the common fields, example:
> 
> /* page_ref_count replacement */
> netmem_ref_count(struct netmem* netmem) {
>     return netmem->refcount;
> }
> 
> /* page_ref_add replacement */
> netmem_ref_add(struct netmem* netmem) {
>    atomic_inc(netmem->refcount);
> }
> 
> /* page_to_nid replacement */
> netmem_nid(struct netmem* netmem) {
>     return netmem->nid;
> }
> 
> /* page_is_pfmemalloc() replacement */
> netmem_is_pfmemalloc(struct netmem* netmem) {
>     return netmem->is_pfmemalloc;
> }
> 
> /* page_ref_sub replacement */
> netmem_ref_sub(struct netmem* netmem) {
>     atomic_sub(netmet->refcount);
>     if (netmem->refcount == 0) {
>                   /* release page to the memory provider.
>                    * struct page memory provider will do put_page(),
>                    * devmem will do something else */
>            }
>      }
> }
> 
> 
> I think this MAY BE technically feasible, but I'm not sure it's better:
> 
> 1. It is a huge refactor to the page pool, lots of code churn. While
> the page pool currently uses page*, it needs to be completely
> refactored to use netmem*.
> 2. It causes extra memory usage. struct netmem needs to be allocated
> for every struct page.
> 3. It has minimal perf upside. The page_is_page_pool_iov() checks
> currently have minimal perf impact, and I demonstrated that to Jesper
> in RFC v2.
> 4. It also may not be technically feasible. I'm not sure how netmem
> interacts with skb_frag_t. I guess we replace struct page* bv_page
> with struct netmem* bv_page, and add changes there.
> 5. Drivers need to be refactored to use netmem* instead of page*,
> unless we cast netmem* to page* before returning to the driver.
> 
> Possibly other downsides, these are what I could immediately think of.
> 
> If I'm still misunderstanding your suggestion, it may be time to send
> me a concrete code snippet of what you have in mind. I'm a bit
> confused at the moment because the only avenue I see to remove the if
> statements in the page pool is to define the struct that we agreed is
> not feasible in earlier emails.
> 

I might be able to do it at the weekend if it is still not making any
sense to you.

> 
> --
> Thanks,
> Mina
> 
> .
>
Christian König Nov. 9, 2023, 2:23 p.m. UTC | #9
Am 09.11.23 um 04:20 schrieb Mina Almasry:
> [SNIP]
>> But we might be able to do something as folio is doing now, mm subsystem
>> is still seeing 'struct folio/page', but other subsystem like slab is using
>> 'struct slab', and there is still some common fields shared between
>> 'struct folio' and 'struct slab'.
>>
> In my eyes this is almost exactly what I suggested in RFC v1 and got
> immediately nacked with no room to negotiate. What we did for v1 is to
> allocate struct pages for dma-buf to make dma-bufs look like struct
> page to mm subsystem. Almost exactly what you're describing above.
> It's a no-go. I don't think renaming struct page to netmem is going to
> move the needle (it also re-introduces code-churn). What I feel like I
> learnt is that dma-bufs are not struct pages and can't be made to look
> like one, I think.

Yeah, that pretty much hits the nail on the head. What was not mentioned 
yet and you could potentially try to do is to go the other way around.

In other words instead of importing a DMA-buf file descriptor into the 
page-pool, you take some netdev page-pool pages and export them as 
DMA-buf handle.

All you then need to do is to implement the necessary DMA-buf interface, 
e.g. wait for DMA-fences before accessing stuff, when you have an async 
operation install a DMA-fence so that other can wait for your operation 
to finish etc.. etc..

This still doesn't gives you peer 2 peer over for example the PCIe bus, 
but it would give you zero copy in the sense that a GPU or other 
acceleration device directly writes stuff into memory a network device 
can work with.

We already have some similar functionality for at least Intel and AMD hw 
where an userptr mechanism is used to make malloced memory (backed by 
normal struct pages) available to the GPU. The problem with this 
approach is that most GPUs currently can't do good recoverable page 
faults which in turn makes the whole MMU notifier based approach just 
horrible inefficient.

Just giving a few more pointers you might want to look into...

Cheers,
Christian.
diff mbox series

Patch

diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
index b93243c2a640..08f1a2cc70d2 100644
--- a/include/net/page_pool/helpers.h
+++ b/include/net/page_pool/helpers.h
@@ -151,6 +151,64 @@  static inline struct page_pool_iov *page_to_page_pool_iov(struct page *page)
 	return NULL;
 }
 
+static inline int page_pool_page_ref_count(struct page *page)
+{
+	if (page_is_page_pool_iov(page))
+		return page_pool_iov_refcount(page_to_page_pool_iov(page));
+
+	return page_ref_count(page);
+}
+
+static inline void page_pool_page_get_many(struct page *page,
+					   unsigned int count)
+{
+	if (page_is_page_pool_iov(page))
+		return page_pool_iov_get_many(page_to_page_pool_iov(page),
+					      count);
+
+	return page_ref_add(page, count);
+}
+
+static inline void page_pool_page_put_many(struct page *page,
+					   unsigned int count)
+{
+	if (page_is_page_pool_iov(page))
+		return page_pool_iov_put_many(page_to_page_pool_iov(page),
+					      count);
+
+	if (count > 1)
+		page_ref_sub(page, count - 1);
+
+	put_page(page);
+}
+
+static inline bool page_pool_page_is_pfmemalloc(struct page *page)
+{
+	if (page_is_page_pool_iov(page))
+		return false;
+
+	return page_is_pfmemalloc(page);
+}
+
+static inline bool page_pool_page_is_pref_nid(struct page *page, int pref_nid)
+{
+	/* Assume page_pool_iov are on the preferred node without actually
+	 * checking...
+	 *
+	 * This check is only used to check for recycling memory in the page
+	 * pool's fast paths. Currently the only implementation of page_pool_iov
+	 * is dmabuf device memory. It's a deliberate decision by the user to
+	 * bind a certain dmabuf to a certain netdev, and the netdev rx queue
+	 * would not be able to reallocate memory from another dmabuf that
+	 * exists on the preferred node, so, this check doesn't make much sense
+	 * in this case. Assume all page_pool_iovs can be recycled for now.
+	 */
+	if (page_is_page_pool_iov(page))
+		return true;
+
+	return page_to_nid(page) == pref_nid;
+}
+
 /**
  * page_pool_dev_alloc_pages() - allocate a page.
  * @pool:	pool from which to allocate
@@ -301,6 +359,9 @@  static inline long page_pool_defrag_page(struct page *page, long nr)
 {
 	long ret;
 
+	if (page_is_page_pool_iov(page))
+		return -EINVAL;
+
 	/* If nr == pp_frag_count then we have cleared all remaining
 	 * references to the page:
 	 * 1. 'n == 1': no need to actually overwrite it.
@@ -431,7 +492,12 @@  static inline void page_pool_free_va(struct page_pool *pool, void *va,
  */
 static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
 {
-	dma_addr_t ret = page->dma_addr;
+	dma_addr_t ret;
+
+	if (page_is_page_pool_iov(page))
+		return page_pool_iov_dma_addr(page_to_page_pool_iov(page));
+
+	ret = page->dma_addr;
 
 	if (PAGE_POOL_32BIT_ARCH_WITH_64BIT_DMA)
 		ret <<= PAGE_SHIFT;
@@ -441,6 +507,12 @@  static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
 
 static inline bool page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
 {
+	/* page_pool_iovs are mapped and their dma-addr can't be modified. */
+	if (page_is_page_pool_iov(page)) {
+		DEBUG_NET_WARN_ON_ONCE(true);
+		return false;
+	}
+
 	if (PAGE_POOL_32BIT_ARCH_WITH_64BIT_DMA) {
 		page->dma_addr = addr >> PAGE_SHIFT;
 
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 138ddea0b28f..d211996d423b 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -317,7 +317,7 @@  static struct page *page_pool_refill_alloc_cache(struct page_pool *pool)
 		if (unlikely(!page))
 			break;
 
-		if (likely(page_to_nid(page) == pref_nid)) {
+		if (likely(page_pool_page_is_pref_nid(page, pref_nid))) {
 			pool->alloc.cache[pool->alloc.count++] = page;
 		} else {
 			/* NUMA mismatch;
@@ -362,7 +362,15 @@  static void page_pool_dma_sync_for_device(struct page_pool *pool,
 					  struct page *page,
 					  unsigned int dma_sync_size)
 {
-	dma_addr_t dma_addr = page_pool_get_dma_addr(page);
+	dma_addr_t dma_addr;
+
+	/* page_pool_iov memory provider do not support PP_FLAG_DMA_SYNC_DEV */
+	if (page_is_page_pool_iov(page)) {
+		DEBUG_NET_WARN_ON_ONCE(true);
+		return;
+	}
+
+	dma_addr = page_pool_get_dma_addr(page);
 
 	dma_sync_size = min(dma_sync_size, pool->p.max_len);
 	dma_sync_single_range_for_device(pool->p.dev, dma_addr,
@@ -374,6 +382,12 @@  static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
 {
 	dma_addr_t dma;
 
+	if (page_is_page_pool_iov(page)) {
+		/* page_pool_iovs are already mapped */
+		DEBUG_NET_WARN_ON_ONCE(true);
+		return true;
+	}
+
 	/* Setup DMA mapping: use 'struct page' area for storing DMA-addr
 	 * since dma_addr_t can be either 32 or 64 bits and does not always fit
 	 * into page private data (i.e 32bit cpu with 64bit DMA caps)
@@ -405,22 +419,33 @@  static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
 static void page_pool_set_pp_info(struct page_pool *pool,
 				  struct page *page)
 {
-	page->pp = pool;
-	page->pp_magic |= PP_SIGNATURE;
-
-	/* Ensuring all pages have been split into one fragment initially:
-	 * page_pool_set_pp_info() is only called once for every page when it
-	 * is allocated from the page allocator and page_pool_fragment_page()
-	 * is dirtying the same cache line as the page->pp_magic above, so
-	 * the overhead is negligible.
-	 */
-	page_pool_fragment_page(page, 1);
+	if (!page_is_page_pool_iov(page)) {
+		page->pp = pool;
+		page->pp_magic |= PP_SIGNATURE;
+
+		/* Ensuring all pages have been split into one fragment
+		 * initially:
+		 * page_pool_set_pp_info() is only called once for every page
+		 * when it is allocated from the page allocator and
+		 * page_pool_fragment_page() is dirtying the same cache line as
+		 * the page->pp_magic above, so * the overhead is negligible.
+		 */
+		page_pool_fragment_page(page, 1);
+	} else {
+		page_to_page_pool_iov(page)->pp = pool;
+	}
+
 	if (pool->p.init_callback)
 		pool->p.init_callback(page, pool->p.init_arg);
 }
 
 static void page_pool_clear_pp_info(struct page *page)
 {
+	if (page_is_page_pool_iov(page)) {
+		page_to_page_pool_iov(page)->pp = NULL;
+		return;
+	}
+
 	page->pp_magic = 0;
 	page->pp = NULL;
 }
@@ -630,7 +655,7 @@  static bool page_pool_recycle_in_cache(struct page *page,
 		return false;
 	}
 
-	/* Caller MUST have verified/know (page_ref_count(page) == 1) */
+	/* Caller MUST have verified/know (page_pool_page_ref_count(page) == 1) */
 	pool->alloc.cache[pool->alloc.count++] = page;
 	recycle_stat_inc(pool, cached);
 	return true;
@@ -655,9 +680,10 @@  __page_pool_put_page(struct page_pool *pool, struct page *page,
 	 * refcnt == 1 means page_pool owns page, and can recycle it.
 	 *
 	 * page is NOT reusable when allocated when system is under
-	 * some pressure. (page_is_pfmemalloc)
+	 * some pressure. (page_pool_page_is_pfmemalloc)
 	 */
-	if (likely(page_ref_count(page) == 1 && !page_is_pfmemalloc(page))) {
+	if (likely(page_pool_page_ref_count(page) == 1 &&
+		   !page_pool_page_is_pfmemalloc(page))) {
 		/* Read barrier done in page_ref_count / READ_ONCE */
 
 		if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
@@ -772,7 +798,8 @@  static struct page *page_pool_drain_frag(struct page_pool *pool,
 	if (likely(page_pool_defrag_page(page, drain_count)))
 		return NULL;
 
-	if (page_ref_count(page) == 1 && !page_is_pfmemalloc(page)) {
+	if (page_pool_page_ref_count(page) == 1 &&
+	    !page_pool_page_is_pfmemalloc(page)) {
 		if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
 			page_pool_dma_sync_for_device(pool, page, -1);
 
@@ -848,9 +875,9 @@  static void page_pool_empty_ring(struct page_pool *pool)
 	/* Empty recycle ring */
 	while ((page = ptr_ring_consume_bh(&pool->ring))) {
 		/* Verify the refcnt invariant of cached pages */
-		if (!(page_ref_count(page) == 1))
+		if (!(page_pool_page_ref_count(page) == 1))
 			pr_crit("%s() page_pool refcnt %d violation\n",
-				__func__, page_ref_count(page));
+				__func__, page_pool_page_ref_count(page));
 
 		page_pool_return_page(pool, page);
 	}