diff mbox series

[RFC,net-next,v6,02/15] net: page_pool: create hooks for custom page providers

Message ID 20240305020153.2787423-3-almasrymina@google.com (mailing list archive)
State New
Headers show
Series Device Memory TCP | expand

Commit Message

Mina Almasry March 5, 2024, 2:01 a.m. UTC
From: Jakub Kicinski <kuba@kernel.org>

The page providers which try to reuse the same pages will
need to hold onto the ref, even if page gets released from
the pool - as in releasing the page from the pp just transfers
the "ownership" reference from pp to the provider, and provider
will wait for other references to be gone before feeding this
page back into the pool.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Mina Almasry <almasrymina@google.com>

---

This is implemented by Jakub in his RFC:
https://lore.kernel.org/netdev/f8270765-a27b-6ccf-33ea-cda097168d79@redhat.com/T/

I take no credit for the idea or implementation; I only added minor
edits to make this workable with device memory TCP, and removed some
hacky test code. This is a critical dependency of device memory TCP
and thus I'm pulling it into this series to make it revewable and
mergeable.

RFC v3 -> v1
- Removed unusued mem_provider. (Yunsheng).
- Replaced memory_provider & mp_priv with netdev_rx_queue (Jakub).

---
 include/net/page_pool/types.h | 12 ++++++++++
 net/core/page_pool.c          | 43 +++++++++++++++++++++++++++++++----
 2 files changed, 50 insertions(+), 5 deletions(-)

Comments

David Wei March 5, 2024, 9:54 p.m. UTC | #1
On 2024-03-04 18:01, Mina Almasry wrote:
> +struct memory_provider_ops {
> +	int (*init)(struct page_pool *pool);
> +	void (*destroy)(struct page_pool *pool);
> +	struct page *(*alloc_pages)(struct page_pool *pool, gfp_t gfp);
> +	bool (*release_page)(struct page_pool *pool, struct page *page);

For ZC Rx we added a scrub() function to memory_provider_ops that is
called from page_pool_scrub(). Does TCP devmem not custom behaviour
waiting for all netmem_refs to return before destroying the page pool?
What happens if e.g. application crashes?
Mina Almasry March 5, 2024, 10:36 p.m. UTC | #2
On Tue, Mar 5, 2024 at 1:55 PM David Wei <dw@davidwei.uk> wrote:
>
> On 2024-03-04 18:01, Mina Almasry wrote:
> > +struct memory_provider_ops {
> > +     int (*init)(struct page_pool *pool);
> > +     void (*destroy)(struct page_pool *pool);
> > +     struct page *(*alloc_pages)(struct page_pool *pool, gfp_t gfp);
> > +     bool (*release_page)(struct page_pool *pool, struct page *page);
>
> For ZC Rx we added a scrub() function to memory_provider_ops that is
> called from page_pool_scrub(). Does TCP devmem not custom behaviour
> waiting for all netmem_refs to return before destroying the page pool?
> What happens if e.g. application crashes?

(sorry for the long reply, but he refcounting is pretty complicated to
explain and I feel like we need to agree on how things currently work)

Yeah, the addition of the page_pool_scrub() function is a bit of a
head scratcher for me. Here is how the (complicated) refcounting works
for devmem TCP (assuming the driver is not doing its own recycling
logic which complicates things further):

1. When a netmem_ref is allocated by the page_pool (from dmabuf or
page), the netmem_get_pp_ref_count_ref()==1 and belongs to the page
pool as long as the netmem is waiting in the pool for driver
allocation.

2. When a netmem is allocated by the driver, no refcounting is
changed, but the ownership of the netmem_get_pp_ref_count_ref() is
implicitly transferred from the page pool to the driver. i.e. the ref
now belongs to the driver until an skb is formed.

3. When the driver forms an skb using skb_rx_add_frag_netmem(), no
refcounting is changed, but the ownership of the
netmem_get_pp_ref_count_ref() is transferred from the driver to the
TCP stack.

4. When the TCP stack hands the skb to the application, the TCP stack
obtains an additional refcount, so netmem_get_pp_ref_count_ref()==2,
and frees the skb using skb_frag_unref(), which drops the
netmem_get_pp_ref_count_ref()==1.

5. When the user is done with the skb, the user calls the
DEVMEM_DONTNEED setsockopt which calls napi_pp_put_netmem() which
recycles the netmem back to the page pool. This doesn't modify any
refcounting, but the refcount ownership transfers from the userspace
back to the page pool, and we're back at step 1.

So all in all netmem can belong either to (a) the page pool, or (b)
the driver, or (c) the TCP stack, or (d) the application depending on
where exactly it is in the RX path.

When an application running devmem TCP crashes, the netmem that belong
to the page pool or driver are not touched, because the page pool is
not tied to the application in our case really. However, the TCP stack
notices the devmem socket of the application close, and when it does,
the TCP stack will:

1. Free all the skbs in the sockets receive queue. This is not custom
behavior for devmem TCP, it's just standard for TCP to free all skbs
waiting to be received by the application.
2. The TCP stack will free references that belong to the application.
Since the application crashed, it will not call the DEVMEM_DONTNEED
setsockopt, so we need to free those on behalf of the application.
This is done in this diff:

@@ -2498,6 +2498,15 @@ static void tcp_md5sig_info_free_rcu(struct
rcu_head *head)
 void tcp_v4_destroy_sock(struct sock *sk)
 {
  struct tcp_sock *tp = tcp_sk(sk);
+ __maybe_unused unsigned long index;
+ __maybe_unused void *netmem;
+
+#ifdef CONFIG_PAGE_POOL
+ xa_for_each(&sk->sk_user_frags, index, netmem)
+ WARN_ON_ONCE(!napi_pp_put_page((__force netmem_ref)netmem, false));
+#endif
+
+ xa_destroy(&sk->sk_user_frags);

  trace_tcp_destroy_sock(sk);

To be honest, I think it makes sense for the TCP stack to be
responsible for putting the references that belong to it and the
application. To me, it does not make much sense for the page pool to
be responsible for putting the reference that belongs to the TCP stack
or driver via a page_pool_scrub() function, as those references do not
belong to the page pool really. I'm not sure why there is a diff
between our use cases here because I'm not an io_uring expert. Why do
you need to scrub all the references on page pool destruction? Don't
these belong to non-page pool components like io_uring stack or TCP
stack ol otherwise?
Pavel Begunkov March 6, 2024, 2:29 p.m. UTC | #3
On 3/5/24 22:36, Mina Almasry wrote:
> On Tue, Mar 5, 2024 at 1:55 PM David Wei <dw@davidwei.uk> wrote:
>>
>> On 2024-03-04 18:01, Mina Almasry wrote:
>>> +struct memory_provider_ops {
>>> +     int (*init)(struct page_pool *pool);
>>> +     void (*destroy)(struct page_pool *pool);
>>> +     struct page *(*alloc_pages)(struct page_pool *pool, gfp_t gfp);
>>> +     bool (*release_page)(struct page_pool *pool, struct page *page);
>>
>> For ZC Rx we added a scrub() function to memory_provider_ops that is
>> called from page_pool_scrub(). Does TCP devmem not custom behaviour
>> waiting for all netmem_refs to return before destroying the page pool?
>> What happens if e.g. application crashes?
> 
> (sorry for the long reply, but he refcounting is pretty complicated to
> explain and I feel like we need to agree on how things currently work)
> 
> Yeah, the addition of the page_pool_scrub() function is a bit of a
> head scratcher for me. Here is how the (complicated) refcounting works
> for devmem TCP (assuming the driver is not doing its own recycling
> logic which complicates things further):
> 
> 1. When a netmem_ref is allocated by the page_pool (from dmabuf or
> page), the netmem_get_pp_ref_count_ref()==1 and belongs to the page
> pool as long as the netmem is waiting in the pool for driver
> allocation.
> 
> 2. When a netmem is allocated by the driver, no refcounting is
> changed, but the ownership of the netmem_get_pp_ref_count_ref() is
> implicitly transferred from the page pool to the driver. i.e. the ref
> now belongs to the driver until an skb is formed.
> 
> 3. When the driver forms an skb using skb_rx_add_frag_netmem(), no
> refcounting is changed, but the ownership of the
> netmem_get_pp_ref_count_ref() is transferred from the driver to the
> TCP stack.
> 
> 4. When the TCP stack hands the skb to the application, the TCP stack
> obtains an additional refcount, so netmem_get_pp_ref_count_ref()==2,
> and frees the skb using skb_frag_unref(), which drops the
> netmem_get_pp_ref_count_ref()==1.
> 
> 5. When the user is done with the skb, the user calls the
> DEVMEM_DONTNEED setsockopt which calls napi_pp_put_netmem() which
> recycles the netmem back to the page pool. This doesn't modify any
> refcounting, but the refcount ownership transfers from the userspace
> back to the page pool, and we're back at step 1.
> 
> So all in all netmem can belong either to (a) the page pool, or (b)
> the driver, or (c) the TCP stack, or (d) the application depending on
> where exactly it is in the RX path.
> 
> When an application running devmem TCP crashes, the netmem that belong
> to the page pool or driver are not touched, because the page pool is
> not tied to the application in our case really. However, the TCP stack
> notices the devmem socket of the application close, and when it does,
> the TCP stack will:
> 
> 1. Free all the skbs in the sockets receive queue. This is not custom
> behavior for devmem TCP, it's just standard for TCP to free all skbs
> waiting to be received by the application.
> 2. The TCP stack will free references that belong to the application.
> Since the application crashed, it will not call the DEVMEM_DONTNEED
> setsockopt, so we need to free those on behalf of the application.
> This is done in this diff:
> 
> @@ -2498,6 +2498,15 @@ static void tcp_md5sig_info_free_rcu(struct
> rcu_head *head)
>   void tcp_v4_destroy_sock(struct sock *sk)
>   {
>    struct tcp_sock *tp = tcp_sk(sk);
> + __maybe_unused unsigned long index;
> + __maybe_unused void *netmem;
> +
> +#ifdef CONFIG_PAGE_POOL
> + xa_for_each(&sk->sk_user_frags, index, netmem)
> + WARN_ON_ONCE(!napi_pp_put_page((__force netmem_ref)netmem, false));
> +#endif
> +
> + xa_destroy(&sk->sk_user_frags);
> 
>    trace_tcp_destroy_sock(sk);
> 
> To be honest, I think it makes sense for the TCP stack to be
> responsible for putting the references that belong to it and the
> application. To me, it does not make much sense for the page pool to
> be responsible for putting the reference that belongs to the TCP stack
> or driver via a page_pool_scrub() function, as those references do not
> belong to the page pool really. I'm not sure why there is a diff
> between our use cases here because I'm not an io_uring expert. Why do
> you need to scrub all the references on page pool destruction? Don't
> these belong to non-page pool components like io_uring stack or TCP
> stack ol otherwise?

That one is about cleaning buffers that are in b/w 4 and 5, i.e.
owned by the user, which devmem does at sock destruction. io_uring
could get by without scrub, dropping user refs while unregistering
ifq, but then it'd need to wait for all requests to finish so there
is no step 4 in the meantime. Might change, can be useful, but it
was much easier to hook into the pp release loop.

Another concern is who and when can reset ifq / kill pp outside
of io_uring/devmem. I assume it can happen on a whim, which is
hard to handle gracefully.
Mina Almasry March 6, 2024, 5:04 p.m. UTC | #4
On Wed, Mar 6, 2024 at 6:30 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 3/5/24 22:36, Mina Almasry wrote:
> > On Tue, Mar 5, 2024 at 1:55 PM David Wei <dw@davidwei.uk> wrote:
> >>
> >> On 2024-03-04 18:01, Mina Almasry wrote:
> >>> +struct memory_provider_ops {
> >>> +     int (*init)(struct page_pool *pool);
> >>> +     void (*destroy)(struct page_pool *pool);
> >>> +     struct page *(*alloc_pages)(struct page_pool *pool, gfp_t gfp);
> >>> +     bool (*release_page)(struct page_pool *pool, struct page *page);
> >>
> >> For ZC Rx we added a scrub() function to memory_provider_ops that is
> >> called from page_pool_scrub(). Does TCP devmem not custom behaviour
> >> waiting for all netmem_refs to return before destroying the page pool?
> >> What happens if e.g. application crashes?
> >
> > (sorry for the long reply, but he refcounting is pretty complicated to
> > explain and I feel like we need to agree on how things currently work)
> >
> > Yeah, the addition of the page_pool_scrub() function is a bit of a
> > head scratcher for me. Here is how the (complicated) refcounting works
> > for devmem TCP (assuming the driver is not doing its own recycling
> > logic which complicates things further):
> >
> > 1. When a netmem_ref is allocated by the page_pool (from dmabuf or
> > page), the netmem_get_pp_ref_count_ref()==1 and belongs to the page
> > pool as long as the netmem is waiting in the pool for driver
> > allocation.
> >
> > 2. When a netmem is allocated by the driver, no refcounting is
> > changed, but the ownership of the netmem_get_pp_ref_count_ref() is
> > implicitly transferred from the page pool to the driver. i.e. the ref
> > now belongs to the driver until an skb is formed.
> >
> > 3. When the driver forms an skb using skb_rx_add_frag_netmem(), no
> > refcounting is changed, but the ownership of the
> > netmem_get_pp_ref_count_ref() is transferred from the driver to the
> > TCP stack.
> >
> > 4. When the TCP stack hands the skb to the application, the TCP stack
> > obtains an additional refcount, so netmem_get_pp_ref_count_ref()==2,
> > and frees the skb using skb_frag_unref(), which drops the
> > netmem_get_pp_ref_count_ref()==1.
> >
> > 5. When the user is done with the skb, the user calls the
> > DEVMEM_DONTNEED setsockopt which calls napi_pp_put_netmem() which
> > recycles the netmem back to the page pool. This doesn't modify any
> > refcounting, but the refcount ownership transfers from the userspace
> > back to the page pool, and we're back at step 1.
> >
> > So all in all netmem can belong either to (a) the page pool, or (b)
> > the driver, or (c) the TCP stack, or (d) the application depending on
> > where exactly it is in the RX path.
> >
> > When an application running devmem TCP crashes, the netmem that belong
> > to the page pool or driver are not touched, because the page pool is
> > not tied to the application in our case really. However, the TCP stack
> > notices the devmem socket of the application close, and when it does,
> > the TCP stack will:
> >
> > 1. Free all the skbs in the sockets receive queue. This is not custom
> > behavior for devmem TCP, it's just standard for TCP to free all skbs
> > waiting to be received by the application.
> > 2. The TCP stack will free references that belong to the application.
> > Since the application crashed, it will not call the DEVMEM_DONTNEED
> > setsockopt, so we need to free those on behalf of the application.
> > This is done in this diff:
> >
> > @@ -2498,6 +2498,15 @@ static void tcp_md5sig_info_free_rcu(struct
> > rcu_head *head)
> >   void tcp_v4_destroy_sock(struct sock *sk)
> >   {
> >    struct tcp_sock *tp = tcp_sk(sk);
> > + __maybe_unused unsigned long index;
> > + __maybe_unused void *netmem;
> > +
> > +#ifdef CONFIG_PAGE_POOL
> > + xa_for_each(&sk->sk_user_frags, index, netmem)
> > + WARN_ON_ONCE(!napi_pp_put_page((__force netmem_ref)netmem, false));
> > +#endif
> > +
> > + xa_destroy(&sk->sk_user_frags);
> >
> >    trace_tcp_destroy_sock(sk);
> >
> > To be honest, I think it makes sense for the TCP stack to be
> > responsible for putting the references that belong to it and the
> > application. To me, it does not make much sense for the page pool to
> > be responsible for putting the reference that belongs to the TCP stack
> > or driver via a page_pool_scrub() function, as those references do not
> > belong to the page pool really. I'm not sure why there is a diff
> > between our use cases here because I'm not an io_uring expert. Why do
> > you need to scrub all the references on page pool destruction? Don't
> > these belong to non-page pool components like io_uring stack or TCP
> > stack ol otherwise?
>
> That one is about cleaning buffers that are in b/w 4 and 5, i.e.
> owned by the user, which devmem does at sock destruction. io_uring
> could get by without scrub, dropping user refs while unregistering
> ifq, but then it'd need to wait for all requests to finish so there
> is no step 4 in the meantime. Might change, can be useful, but it
> was much easier to hook into the pp release loop.
>
> Another concern is who and when can reset ifq / kill pp outside
> of io_uring/devmem. I assume it can happen on a whim, which is
> hard to handle gracefully.
>

If this is about dropping application refs in step 4 & step 5, then
from devmem TCP perspective it must be done on socket close & skb
freeing AFAIU, and not delayed until page_pool destruction. Think
about a stupid or malicious user that does something like:

1. Set up dmabuf binding using netlink api.
2. While (100000):
3.   create devmem TCP socket.
4.   receive some devmem data on TCP socket.
5.   close TCP socket without calling DEVMEM_DONTNEED.
6. clean up dmabuf binding using netlink api.

In this case, we need to drop the references in step 5 when the socket
is destroyed, so the memory is freed to the page pool and available
for the next socket in step 3. We cannot delay the freeing until step
6 when the rx queue is recreated and the page pool is destroyed,
otherwise the net_iovs would leak in the loop and eventually the NIC
would fail to find available memory. The same bug would be
reproducible with io_uring unless you're creating a new page pool for
each new io_uring socket equivalent.

But even outside of this, I think it's a bit semantically off to ask
the page_pool to drop references that belong to the application IMO,
because those references are not the page_pool's.

> --
> Pavel Begunkov
Pavel Begunkov March 6, 2024, 7:12 p.m. UTC | #5
On 3/6/24 17:04, Mina Almasry wrote:
> On Wed, Mar 6, 2024 at 6:30 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>
>> On 3/5/24 22:36, Mina Almasry wrote:
>>> On Tue, Mar 5, 2024 at 1:55 PM David Wei <dw@davidwei.uk> wrote:
>>>>
>>>> On 2024-03-04 18:01, Mina Almasry wrote:
>>>>> +struct memory_provider_ops {
>>>>> +     int (*init)(struct page_pool *pool);
>>>>> +     void (*destroy)(struct page_pool *pool);
>>>>> +     struct page *(*alloc_pages)(struct page_pool *pool, gfp_t gfp);
>>>>> +     bool (*release_page)(struct page_pool *pool, struct page *page);
>>>>
>>>> For ZC Rx we added a scrub() function to memory_provider_ops that is
>>>> called from page_pool_scrub(). Does TCP devmem not custom behaviour
>>>> waiting for all netmem_refs to return before destroying the page pool?
>>>> What happens if e.g. application crashes?
>>>
>>> (sorry for the long reply, but he refcounting is pretty complicated to
>>> explain and I feel like we need to agree on how things currently work)
>>>
>>> Yeah, the addition of the page_pool_scrub() function is a bit of a
>>> head scratcher for me. Here is how the (complicated) refcounting works
>>> for devmem TCP (assuming the driver is not doing its own recycling
>>> logic which complicates things further):
>>>
>>> 1. When a netmem_ref is allocated by the page_pool (from dmabuf or
>>> page), the netmem_get_pp_ref_count_ref()==1 and belongs to the page
>>> pool as long as the netmem is waiting in the pool for driver
>>> allocation.
>>>
>>> 2. When a netmem is allocated by the driver, no refcounting is
>>> changed, but the ownership of the netmem_get_pp_ref_count_ref() is
>>> implicitly transferred from the page pool to the driver. i.e. the ref
>>> now belongs to the driver until an skb is formed.
>>>
>>> 3. When the driver forms an skb using skb_rx_add_frag_netmem(), no
>>> refcounting is changed, but the ownership of the
>>> netmem_get_pp_ref_count_ref() is transferred from the driver to the
>>> TCP stack.
>>>
>>> 4. When the TCP stack hands the skb to the application, the TCP stack
>>> obtains an additional refcount, so netmem_get_pp_ref_count_ref()==2,
>>> and frees the skb using skb_frag_unref(), which drops the
>>> netmem_get_pp_ref_count_ref()==1.
>>>
>>> 5. When the user is done with the skb, the user calls the
>>> DEVMEM_DONTNEED setsockopt which calls napi_pp_put_netmem() which
>>> recycles the netmem back to the page pool. This doesn't modify any
>>> refcounting, but the refcount ownership transfers from the userspace
>>> back to the page pool, and we're back at step 1.
>>>
>>> So all in all netmem can belong either to (a) the page pool, or (b)
>>> the driver, or (c) the TCP stack, or (d) the application depending on
>>> where exactly it is in the RX path.
>>>
>>> When an application running devmem TCP crashes, the netmem that belong
>>> to the page pool or driver are not touched, because the page pool is
>>> not tied to the application in our case really. However, the TCP stack
>>> notices the devmem socket of the application close, and when it does,
>>> the TCP stack will:
>>>
>>> 1. Free all the skbs in the sockets receive queue. This is not custom
>>> behavior for devmem TCP, it's just standard for TCP to free all skbs
>>> waiting to be received by the application.
>>> 2. The TCP stack will free references that belong to the application.
>>> Since the application crashed, it will not call the DEVMEM_DONTNEED
>>> setsockopt, so we need to free those on behalf of the application.
>>> This is done in this diff:
>>>
>>> @@ -2498,6 +2498,15 @@ static void tcp_md5sig_info_free_rcu(struct
>>> rcu_head *head)
>>>    void tcp_v4_destroy_sock(struct sock *sk)
>>>    {
>>>     struct tcp_sock *tp = tcp_sk(sk);
>>> + __maybe_unused unsigned long index;
>>> + __maybe_unused void *netmem;
>>> +
>>> +#ifdef CONFIG_PAGE_POOL
>>> + xa_for_each(&sk->sk_user_frags, index, netmem)
>>> + WARN_ON_ONCE(!napi_pp_put_page((__force netmem_ref)netmem, false));
>>> +#endif
>>> +
>>> + xa_destroy(&sk->sk_user_frags);
>>>
>>>     trace_tcp_destroy_sock(sk);
>>>
>>> To be honest, I think it makes sense for the TCP stack to be
>>> responsible for putting the references that belong to it and the
>>> application. To me, it does not make much sense for the page pool to
>>> be responsible for putting the reference that belongs to the TCP stack
>>> or driver via a page_pool_scrub() function, as those references do not
>>> belong to the page pool really. I'm not sure why there is a diff
>>> between our use cases here because I'm not an io_uring expert. Why do
>>> you need to scrub all the references on page pool destruction? Don't
>>> these belong to non-page pool components like io_uring stack or TCP
>>> stack ol otherwise?
>>
>> That one is about cleaning buffers that are in b/w 4 and 5, i.e.
>> owned by the user, which devmem does at sock destruction. io_uring
>> could get by without scrub, dropping user refs while unregistering
>> ifq, but then it'd need to wait for all requests to finish so there
>> is no step 4 in the meantime. Might change, can be useful, but it
>> was much easier to hook into the pp release loop.
>>
>> Another concern is who and when can reset ifq / kill pp outside
>> of io_uring/devmem. I assume it can happen on a whim, which is
>> hard to handle gracefully.
>>
> 
> If this is about dropping application refs in step 4 & step 5, then
> from devmem TCP perspective it must be done on socket close & skb
> freeing AFAIU, and not delayed until page_pool destruction. 

Right, something in the kernel should take care of it. You temporarily
attach it to the socket, which is fine. And you could've also stored
it in the netlink socket or some other object. In case of zcrx io_uring
impl, it's bound to io_uring, io_uring is responsible for cleaning them
up. And we do it before __page_pool_destroy, otherwise there would be
a ref dependency.

A side note, attaching to netlink or some other global object sounds
conceptually better, as once you return a buffer to the user, the
socket should not have any further business with the buffer. FWIW,
that better resembles io_uring approach. For example allows to:

recv(sock);
close(sock);
process_rx_buffers();

or to return (i.e. DEVMEM_DONTNEED) buffers from different sockets
in one call. However, I don't think it's important for devmem and
perhaps more implementation dictated.

> Think
> about a stupid or malicious user that does something like:
> 
> 1. Set up dmabuf binding using netlink api.
> 2. While (100000):
> 3.   create devmem TCP socket.
> 4.   receive some devmem data on TCP socket.
> 5.   close TCP socket without calling DEVMEM_DONTNEED.
> 6. clean up dmabuf binding using netlink api.
> 
> In this case, we need to drop the references in step 5 when the socket
> is destroyed, so the memory is freed to the page pool and available
> for the next socket in step 3. We cannot delay the freeing until step
> 6 when the rx queue is recreated and the page pool is destroyed,
> otherwise the net_iovs would leak in the loop and eventually the NIC
> would fail to find available memory. The same bug would be

By "would leak" you probably mean until step 6, right? There are
always many ways to shoot yourself in the leg. Even if you clean
up in 5, the user can just leak the socket and get the same result
with pp starvation. I see it not as a requirement but rather a
uapi choice, that's assuming netlink would be cleaned as a normal
socket when the task exits.

> reproducible with io_uring unless you're creating a new page pool for
> each new io_uring socket equivalent.

Surely we don't, but it's still the user's responsibility to
return buffers back. And in case of io_uring buffers returned
to the user are not attached to a socket, so even the
scope / lifetime is a bit different.

> But even outside of this, I think it's a bit semantically off to ask
> the page_pool to drop references that belong to the application IMO,
> because those references are not the page_pool's.

Completely agree with you, which is why it was in a callback,
totally controlled by io_uring.
Mina Almasry March 6, 2024, 9:59 p.m. UTC | #6
On Wed, Mar 6, 2024 at 11:14 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 3/6/24 17:04, Mina Almasry wrote:
> > On Wed, Mar 6, 2024 at 6:30 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
> >>
> >> On 3/5/24 22:36, Mina Almasry wrote:
> >>> On Tue, Mar 5, 2024 at 1:55 PM David Wei <dw@davidwei.uk> wrote:
> >>>>
> >>>> On 2024-03-04 18:01, Mina Almasry wrote:
> >>>>> +struct memory_provider_ops {
> >>>>> +     int (*init)(struct page_pool *pool);
> >>>>> +     void (*destroy)(struct page_pool *pool);
> >>>>> +     struct page *(*alloc_pages)(struct page_pool *pool, gfp_t gfp);
> >>>>> +     bool (*release_page)(struct page_pool *pool, struct page *page);
> >>>>
> >>>> For ZC Rx we added a scrub() function to memory_provider_ops that is
> >>>> called from page_pool_scrub(). Does TCP devmem not custom behaviour
> >>>> waiting for all netmem_refs to return before destroying the page pool?
> >>>> What happens if e.g. application crashes?
> >>>
> >>> (sorry for the long reply, but he refcounting is pretty complicated to
> >>> explain and I feel like we need to agree on how things currently work)
> >>>
> >>> Yeah, the addition of the page_pool_scrub() function is a bit of a
> >>> head scratcher for me. Here is how the (complicated) refcounting works
> >>> for devmem TCP (assuming the driver is not doing its own recycling
> >>> logic which complicates things further):
> >>>
> >>> 1. When a netmem_ref is allocated by the page_pool (from dmabuf or
> >>> page), the netmem_get_pp_ref_count_ref()==1 and belongs to the page
> >>> pool as long as the netmem is waiting in the pool for driver
> >>> allocation.
> >>>
> >>> 2. When a netmem is allocated by the driver, no refcounting is
> >>> changed, but the ownership of the netmem_get_pp_ref_count_ref() is
> >>> implicitly transferred from the page pool to the driver. i.e. the ref
> >>> now belongs to the driver until an skb is formed.
> >>>
> >>> 3. When the driver forms an skb using skb_rx_add_frag_netmem(), no
> >>> refcounting is changed, but the ownership of the
> >>> netmem_get_pp_ref_count_ref() is transferred from the driver to the
> >>> TCP stack.
> >>>
> >>> 4. When the TCP stack hands the skb to the application, the TCP stack
> >>> obtains an additional refcount, so netmem_get_pp_ref_count_ref()==2,
> >>> and frees the skb using skb_frag_unref(), which drops the
> >>> netmem_get_pp_ref_count_ref()==1.
> >>>
> >>> 5. When the user is done with the skb, the user calls the
> >>> DEVMEM_DONTNEED setsockopt which calls napi_pp_put_netmem() which
> >>> recycles the netmem back to the page pool. This doesn't modify any
> >>> refcounting, but the refcount ownership transfers from the userspace
> >>> back to the page pool, and we're back at step 1.
> >>>
> >>> So all in all netmem can belong either to (a) the page pool, or (b)
> >>> the driver, or (c) the TCP stack, or (d) the application depending on
> >>> where exactly it is in the RX path.
> >>>
> >>> When an application running devmem TCP crashes, the netmem that belong
> >>> to the page pool or driver are not touched, because the page pool is
> >>> not tied to the application in our case really. However, the TCP stack
> >>> notices the devmem socket of the application close, and when it does,
> >>> the TCP stack will:
> >>>
> >>> 1. Free all the skbs in the sockets receive queue. This is not custom
> >>> behavior for devmem TCP, it's just standard for TCP to free all skbs
> >>> waiting to be received by the application.
> >>> 2. The TCP stack will free references that belong to the application.
> >>> Since the application crashed, it will not call the DEVMEM_DONTNEED
> >>> setsockopt, so we need to free those on behalf of the application.
> >>> This is done in this diff:
> >>>
> >>> @@ -2498,6 +2498,15 @@ static void tcp_md5sig_info_free_rcu(struct
> >>> rcu_head *head)
> >>>    void tcp_v4_destroy_sock(struct sock *sk)
> >>>    {
> >>>     struct tcp_sock *tp = tcp_sk(sk);
> >>> + __maybe_unused unsigned long index;
> >>> + __maybe_unused void *netmem;
> >>> +
> >>> +#ifdef CONFIG_PAGE_POOL
> >>> + xa_for_each(&sk->sk_user_frags, index, netmem)
> >>> + WARN_ON_ONCE(!napi_pp_put_page((__force netmem_ref)netmem, false));
> >>> +#endif
> >>> +
> >>> + xa_destroy(&sk->sk_user_frags);
> >>>
> >>>     trace_tcp_destroy_sock(sk);
> >>>
> >>> To be honest, I think it makes sense for the TCP stack to be
> >>> responsible for putting the references that belong to it and the
> >>> application. To me, it does not make much sense for the page pool to
> >>> be responsible for putting the reference that belongs to the TCP stack
> >>> or driver via a page_pool_scrub() function, as those references do not
> >>> belong to the page pool really. I'm not sure why there is a diff
> >>> between our use cases here because I'm not an io_uring expert. Why do
> >>> you need to scrub all the references on page pool destruction? Don't
> >>> these belong to non-page pool components like io_uring stack or TCP
> >>> stack ol otherwise?
> >>
> >> That one is about cleaning buffers that are in b/w 4 and 5, i.e.
> >> owned by the user, which devmem does at sock destruction. io_uring
> >> could get by without scrub, dropping user refs while unregistering
> >> ifq, but then it'd need to wait for all requests to finish so there
> >> is no step 4 in the meantime. Might change, can be useful, but it
> >> was much easier to hook into the pp release loop.
> >>
> >> Another concern is who and when can reset ifq / kill pp outside
> >> of io_uring/devmem. I assume it can happen on a whim, which is
> >> hard to handle gracefully.
> >>
> >
> > If this is about dropping application refs in step 4 & step 5, then
> > from devmem TCP perspective it must be done on socket close & skb
> > freeing AFAIU, and not delayed until page_pool destruction.
>
> Right, something in the kernel should take care of it. You temporarily
> attach it to the socket, which is fine. And you could've also stored
> it in the netlink socket or some other object. In case of zcrx io_uring
> impl, it's bound to io_uring, io_uring is responsible for cleaning them
> up. And we do it before __page_pool_destroy, otherwise there would be
> a ref dependency.
>

AFAIU today the page_pool_release() waits until there are no more
pages in flight before calling __page_pool_destroy(), and in existing
use cases it's common for the page_pool to stay alive after
page_pool_destroy() is called until all the skbs waiting in the
receive queue to be recvmsg()'d are received and the page_pool is
freed. I just didn't modify that behavior.

> A side note, attaching to netlink or some other global object sounds
> conceptually better, as once you return a buffer to the user, the
> socket should not have any further business with the buffer. FWIW,
> that better resembles io_uring approach. For example allows to:
>
> recv(sock);
> close(sock);
> process_rx_buffers();
>
> or to return (i.e. DEVMEM_DONTNEED) buffers from different sockets
> in one call. However, I don't think it's important for devmem and
> perhaps more implementation dictated.
>

Hmm I think this may be a sockets vs io_uring difference? For sockets
there is no way to recvmsg() new buffers after the sock close and
there is no way to release buffers to the kernel via the setsockopt()
after the sock close.

But I don't think we need to align on everything here, right? If
page_pool_scrub makes more sense in your use case because the lifetime
of the io_uring buffers is different I don't see any issue with you
extending the ops with page_pool_scrub(), and not define it for the
dmabuf_devmem provider which doesn't need a scrub, right?

> > Think
> > about a stupid or malicious user that does something like:
> >
> > 1. Set up dmabuf binding using netlink api.
> > 2. While (100000):
> > 3.   create devmem TCP socket.
> > 4.   receive some devmem data on TCP socket.
> > 5.   close TCP socket without calling DEVMEM_DONTNEED.
> > 6. clean up dmabuf binding using netlink api.
> >
> > In this case, we need to drop the references in step 5 when the socket
> > is destroyed, so the memory is freed to the page pool and available
> > for the next socket in step 3. We cannot delay the freeing until step
> > 6 when the rx queue is recreated and the page pool is destroyed,
> > otherwise the net_iovs would leak in the loop and eventually the NIC
> > would fail to find available memory. The same bug would be
>
> By "would leak" you probably mean until step 6, right? There are

Yes, sorry I wasn't clear!

> always many ways to shoot yourself in the leg. Even if you clean
> up in 5, the user can just leak the socket and get the same result
> with pp starvation. I see it not as a requirement but rather a
> uapi choice, that's assuming netlink would be cleaned as a normal
> socket when the task exits.
>

Yes, thanks for pointing out. The above was a pathological example
meant to describe the point, but I think this generates a realistic
edge case I may run into production. I don't know if you care about
the specifics, but FWIW we split our userspace into an orchestrator
that allocates dma-bufs and binds them via netlink and the ML
application that creates tcp connections. We do this because then the
orchestrator needs CAP_NET_ADMIN for netlink but the ML applications
do not. If we delay dropping references until page_pool_destroy then
we delay dropping references until the orchestrator exists, i.e. we
risk one ML application crashing, leaving references unfreed, and the
next application (that reuses the buffers) seeing a smaller address
space because the previous application did not get to release them
before crash and so on.

Now of course it's possible to work around this by making sure we
don't reuse bound buffers (when they should be reusable for the same
user), but in general I think in the socket use case it's a bit
unnatural IMO for one socket to leave state behind like this and this
would be a subtlety that the userspace needs to take care of, but like
you said, maybe a uapi or buffer lifetime choice.

> > reproducible with io_uring unless you're creating a new page pool for
> > each new io_uring socket equivalent.
>
> Surely we don't, but it's still the user's responsibility to
> return buffers back. And in case of io_uring buffers returned
> to the user are not attached to a socket, so even the
> scope / lifetime is a bit different.
>

Yes, sorry, without understanding the specifics it seems your lifetime
management is different. IMO it's not an issue if we diverge in this
aspect.

> > But even outside of this, I think it's a bit semantically off to ask
> > the page_pool to drop references that belong to the application IMO,
> > because those references are not the page_pool's.
>
> Completely agree with you, which is why it was in a callback,
> totally controlled by io_uring.
>
> --
> Pavel Begunkov
Pavel Begunkov March 7, 2024, 2:25 p.m. UTC | #7
On 3/6/24 21:59, Mina Almasry wrote:
> On Wed, Mar 6, 2024 at 11:14 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>> On 3/6/24 17:04, Mina Almasry wrote:
>>> On Wed, Mar 6, 2024 at 6:30 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>>> On 3/5/24 22:36, Mina Almasry wrote:
...
>>>>> To be honest, I think it makes sense for the TCP stack to be
>>>>> responsible for putting the references that belong to it and the
>>>>> application. To me, it does not make much sense for the page pool to
>>>>> be responsible for putting the reference that belongs to the TCP stack
>>>>> or driver via a page_pool_scrub() function, as those references do not
>>>>> belong to the page pool really. I'm not sure why there is a diff
>>>>> between our use cases here because I'm not an io_uring expert. Why do
>>>>> you need to scrub all the references on page pool destruction? Don't
>>>>> these belong to non-page pool components like io_uring stack or TCP
>>>>> stack ol otherwise?
>>>>
>>>> That one is about cleaning buffers that are in b/w 4 and 5, i.e.
>>>> owned by the user, which devmem does at sock destruction. io_uring
>>>> could get by without scrub, dropping user refs while unregistering
>>>> ifq, but then it'd need to wait for all requests to finish so there
>>>> is no step 4 in the meantime. Might change, can be useful, but it
>>>> was much easier to hook into the pp release loop.
>>>>
>>>> Another concern is who and when can reset ifq / kill pp outside
>>>> of io_uring/devmem. I assume it can happen on a whim, which is
>>>> hard to handle gracefully.
>>>>
>>>
>>> If this is about dropping application refs in step 4 & step 5, then
>>> from devmem TCP perspective it must be done on socket close & skb
>>> freeing AFAIU, and not delayed until page_pool destruction.
>>
>> Right, something in the kernel should take care of it. You temporarily
>> attach it to the socket, which is fine. And you could've also stored
>> it in the netlink socket or some other object. In case of zcrx io_uring
>> impl, it's bound to io_uring, io_uring is responsible for cleaning them
>> up. And we do it before __page_pool_destroy, otherwise there would be
>> a ref dependency.
>>
> 
> AFAIU today the page_pool_release() waits until there are no more
> pages in flight before calling __page_pool_destroy(), and in existing
> use cases it's common for the page_pool to stay alive after
> page_pool_destroy() is called until all the skbs waiting in the
> receive queue to be recvmsg()'d are received and the page_pool is
> freed. I just didn't modify that behavior.
> 
>> A side note, attaching to netlink or some other global object sounds
>> conceptually better, as once you return a buffer to the user, the
>> socket should not have any further business with the buffer. FWIW,
>> that better resembles io_uring approach. For example allows to:
>>
>> recv(sock);
>> close(sock);
>> process_rx_buffers();
>>
>> or to return (i.e. DEVMEM_DONTNEED) buffers from different sockets
>> in one call. However, I don't think it's important for devmem and
>> perhaps more implementation dictated.
>>
> 
> Hmm I think this may be a sockets vs io_uring difference? For sockets
> there is no way to recvmsg() new buffers after the sock close and

That is true for io_uring as well, but io_uring can't wait for socket
tear down as it holds socket refs

> there is no way to release buffers to the kernel via the setsockopt()
> after the sock close.

If, for example, you store the xarray with userspace buffers in a
netlink socket and implement DONT_NEED setsockopt against it, you
would be able to return bufs after the TCP socket is closed. FWIW,
I'm not saying you should or even want to have it this way.

> But I don't think we need to align on everything here, right? If

Absolutely, mentioned just to entertain with a design consideration

> page_pool_scrub makes more sense in your use case because the lifetime
> of the io_uring buffers is different I don't see any issue with you
> extending the ops with page_pool_scrub(), and not define it for the
> dmabuf_devmem provider which doesn't need a scrub, right?

Yes, and it's a slow path, but I'll look at removing it anyway in
later rfcs.

>>> Think
>>> about a stupid or malicious user that does something like:
>>>
>>> 1. Set up dmabuf binding using netlink api.
>>> 2. While (100000):
>>> 3.   create devmem TCP socket.
>>> 4.   receive some devmem data on TCP socket.
>>> 5.   close TCP socket without calling DEVMEM_DONTNEED.
>>> 6. clean up dmabuf binding using netlink api.
>>>
>>> In this case, we need to drop the references in step 5 when the socket
>>> is destroyed, so the memory is freed to the page pool and available
>>> for the next socket in step 3. We cannot delay the freeing until step
>>> 6 when the rx queue is recreated and the page pool is destroyed,
>>> otherwise the net_iovs would leak in the loop and eventually the NIC
>>> would fail to find available memory. The same bug would be
>>
>> By "would leak" you probably mean until step 6, right? There are
> 
> Yes, sorry I wasn't clear!
> 
>> always many ways to shoot yourself in the leg. Even if you clean
>> up in 5, the user can just leak the socket and get the same result
>> with pp starvation. I see it not as a requirement but rather a
>> uapi choice, that's assuming netlink would be cleaned as a normal
>> socket when the task exits.
>>
> 
> Yes, thanks for pointing out. The above was a pathological example
> meant to describe the point, but I think this generates a realistic
> edge case I may run into production. I don't know if you care about
> the specifics, but FWIW we split our userspace into an orchestrator
> that allocates dma-bufs and binds them via netlink and the ML
> application that creates tcp connections. We do this because then the
> orchestrator needs CAP_NET_ADMIN for netlink but the ML applications
> do not. If we delay dropping references until page_pool_destroy then
> we delay dropping references until the orchestrator exists, i.e. we
> risk one ML application crashing, leaving references unfreed, and the
> next application (that reuses the buffers) seeing a smaller address
> space because the previous application did not get to release them
> before crash and so on.

Makes sense

> Now of course it's possible to work around this by making sure we
> don't reuse bound buffers (when they should be reusable for the same
> user), but in general I think in the socket use case it's a bit
> unnatural IMO for one socket to leave state behind like this and this
> would be a subtlety that the userspace needs to take care of, but like
> you said, maybe a uapi or buffer lifetime choice.
> 
>>> reproducible with io_uring unless you're creating a new page pool for
>>> each new io_uring socket equivalent.
>>
>> Surely we don't, but it's still the user's responsibility to
>> return buffers back. And in case of io_uring buffers returned
>> to the user are not attached to a socket, so even the
>> scope / lifetime is a bit different.
>>
> 
> Yes, sorry, without understanding the specifics it seems your lifetime
> management is different. IMO it's not an issue if we diverge in this
> aspect.

In terms of devmem that would be if you attach userspace buffers
to the netlink socket instead of a TCP socket as mentioned above,
not that much different
David Wei March 8, 2024, 4:57 a.m. UTC | #8
On 2024-03-04 18:01, Mina Almasry wrote:
> From: Jakub Kicinski <kuba@kernel.org>
> 
> The page providers which try to reuse the same pages will
> need to hold onto the ref, even if page gets released from
> the pool - as in releasing the page from the pp just transfers
> the "ownership" reference from pp to the provider, and provider
> will wait for other references to be gone before feeding this
> page back into the pool.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Mina Almasry <almasrymina@google.com>
> 
> ---
> 
> This is implemented by Jakub in his RFC:
> https://lore.kernel.org/netdev/f8270765-a27b-6ccf-33ea-cda097168d79@redhat.com/T/
> 
> I take no credit for the idea or implementation; I only added minor
> edits to make this workable with device memory TCP, and removed some
> hacky test code. This is a critical dependency of device memory TCP
> and thus I'm pulling it into this series to make it revewable and
> mergeable.
> 
> RFC v3 -> v1
> - Removed unusued mem_provider. (Yunsheng).
> - Replaced memory_provider & mp_priv with netdev_rx_queue (Jakub).
> 
> ---
>  include/net/page_pool/types.h | 12 ++++++++++
>  net/core/page_pool.c          | 43 +++++++++++++++++++++++++++++++----
>  2 files changed, 50 insertions(+), 5 deletions(-)
> 
> diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
> index 5e43a08d3231..ffe5f31fb0da 100644
> --- a/include/net/page_pool/types.h
> +++ b/include/net/page_pool/types.h
> @@ -52,6 +52,7 @@ struct pp_alloc_cache {
>   * @dev:	device, for DMA pre-mapping purposes
>   * @netdev:	netdev this pool will serve (leave as NULL if none or multiple)
>   * @napi:	NAPI which is the sole consumer of pages, otherwise NULL
> + * @queue:	struct netdev_rx_queue this page_pool is being created for.
>   * @dma_dir:	DMA mapping direction
>   * @max_len:	max DMA sync memory size for PP_FLAG_DMA_SYNC_DEV
>   * @offset:	DMA sync address offset for PP_FLAG_DMA_SYNC_DEV
> @@ -64,6 +65,7 @@ struct page_pool_params {
>  		int		nid;
>  		struct device	*dev;
>  		struct napi_struct *napi;
> +		struct netdev_rx_queue *queue;
>  		enum dma_data_direction dma_dir;
>  		unsigned int	max_len;
>  		unsigned int	offset;
> @@ -126,6 +128,13 @@ struct page_pool_stats {
>  };
>  #endif
>  
> +struct memory_provider_ops {
> +	int (*init)(struct page_pool *pool);
> +	void (*destroy)(struct page_pool *pool);
> +	struct page *(*alloc_pages)(struct page_pool *pool, gfp_t gfp);
> +	bool (*release_page)(struct page_pool *pool, struct page *page);
> +};

Separate question as I try to adapt bnxt to this and your queue
configuration API.

How does GVE handle the need to allocate kernel pages for headers and
dmabuf for payloads?

Reading the code, struct gve_rx_ring is the main per-ring object with a
page pool. gve_queue_page_lists are filled with page pool netmem
allocations from the page pool in gve_alloc_queue_page_list(). Are these
strictly used for payloads only?

I found a struct gve_header_buf in both gve_rx_ring and struct
gve_per_rx_queue_mem_dpo. This is allocated in gve_rx_queue_mem_alloc()
using dma_alloc_coherent(). Is this where GVE stores headers?

IOW, GVE only uses page pool to allocate memory for QPLs, and QPLs are
used by the device for split payloads. Is my understanding correct?

> +
>  struct page_pool {
>  	struct page_pool_params_fast p;
>  
> @@ -176,6 +185,9 @@ struct page_pool {
>  	 */
>  	struct ptr_ring ring;
>  
> +	void *mp_priv;
> +	const struct memory_provider_ops *mp_ops;
> +
>  #ifdef CONFIG_PAGE_POOL_STATS
>  	/* recycle stats are per-cpu to avoid locking */
>  	struct page_pool_recycle_stats __percpu *recycle_stats;
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index d706fe5548df..8776fcad064a 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -25,6 +25,8 @@
>  
>  #include "page_pool_priv.h"
>  
> +static DEFINE_STATIC_KEY_FALSE(page_pool_mem_providers);
> +
>  #define DEFER_TIME (msecs_to_jiffies(1000))
>  #define DEFER_WARN_INTERVAL (60 * HZ)
>  
> @@ -177,6 +179,7 @@ static int page_pool_init(struct page_pool *pool,
>  			  int cpuid)
>  {
>  	unsigned int ring_qsize = 1024; /* Default */
> +	int err;
>  
>  	memcpy(&pool->p, &params->fast, sizeof(pool->p));
>  	memcpy(&pool->slow, &params->slow, sizeof(pool->slow));
> @@ -248,10 +251,25 @@ static int page_pool_init(struct page_pool *pool,
>  	/* Driver calling page_pool_create() also call page_pool_destroy() */
>  	refcount_set(&pool->user_cnt, 1);
>  
> +	if (pool->mp_ops) {
> +		err = pool->mp_ops->init(pool);
> +		if (err) {
> +			pr_warn("%s() mem-provider init failed %d\n",
> +				__func__, err);
> +			goto free_ptr_ring;
> +		}
> +
> +		static_branch_inc(&page_pool_mem_providers);
> +	}
> +
>  	if (pool->p.flags & PP_FLAG_DMA_MAP)
>  		get_device(pool->p.dev);
>  
>  	return 0;
> +
> +free_ptr_ring:
> +	ptr_ring_cleanup(&pool->ring, NULL);
> +	return err;
>  }
>  
>  static void page_pool_uninit(struct page_pool *pool)
> @@ -546,7 +564,10 @@ struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp)
>  		return page;
>  
>  	/* Slow-path: cache empty, do real allocation */
> -	page = __page_pool_alloc_pages_slow(pool, gfp);
> +	if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops)
> +		page = pool->mp_ops->alloc_pages(pool, gfp);
> +	else
> +		page = __page_pool_alloc_pages_slow(pool, gfp);
>  	return page;
>  }
>  EXPORT_SYMBOL(page_pool_alloc_pages);
> @@ -603,10 +624,13 @@ void __page_pool_release_page_dma(struct page_pool *pool, struct page *page)
>  void page_pool_return_page(struct page_pool *pool, struct page *page)
>  {
>  	int count;
> +	bool put;
>  
> -	__page_pool_release_page_dma(pool, page);
> -
> -	page_pool_clear_pp_info(page);
> +	put = true;
> +	if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops)
> +		put = pool->mp_ops->release_page(pool, page);
> +	else
> +		__page_pool_release_page_dma(pool, page);
>  
>  	/* This may be the last page returned, releasing the pool, so
>  	 * it is not safe to reference pool afterwards.
> @@ -614,7 +638,10 @@ void page_pool_return_page(struct page_pool *pool, struct page *page)
>  	count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt);
>  	trace_page_pool_state_release(pool, page, count);
>  
> -	put_page(page);
> +	if (put) {
> +		page_pool_clear_pp_info(page);
> +		put_page(page);
> +	}
>  	/* An optimization would be to call __free_pages(page, pool->p.order)
>  	 * knowing page is not part of page-cache (thus avoiding a
>  	 * __page_cache_release() call).
> @@ -884,6 +911,12 @@ static void __page_pool_destroy(struct page_pool *pool)
>  
>  	page_pool_unlist(pool);
>  	page_pool_uninit(pool);
> +
> +	if (pool->mp_ops) {
> +		pool->mp_ops->destroy(pool);
> +		static_branch_dec(&page_pool_mem_providers);
> +	}
> +
>  	kfree(pool);
>  }
>
Mina Almasry March 8, 2024, 7:53 p.m. UTC | #9
On Thu, Mar 7, 2024 at 8:57 PM David Wei <dw@davidwei.uk> wrote:
>
> On 2024-03-04 18:01, Mina Almasry wrote:
> > From: Jakub Kicinski <kuba@kernel.org>
> >
> > The page providers which try to reuse the same pages will
> > need to hold onto the ref, even if page gets released from
> > the pool - as in releasing the page from the pp just transfers
> > the "ownership" reference from pp to the provider, and provider
> > will wait for other references to be gone before feeding this
> > page back into the pool.
> >
> > Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> > Signed-off-by: Mina Almasry <almasrymina@google.com>
> >
> > ---
> >
> > This is implemented by Jakub in his RFC:
> > https://lore.kernel.org/netdev/f8270765-a27b-6ccf-33ea-cda097168d79@redhat.com/T/
> >
> > I take no credit for the idea or implementation; I only added minor
> > edits to make this workable with device memory TCP, and removed some
> > hacky test code. This is a critical dependency of device memory TCP
> > and thus I'm pulling it into this series to make it revewable and
> > mergeable.
> >
> > RFC v3 -> v1
> > - Removed unusued mem_provider. (Yunsheng).
> > - Replaced memory_provider & mp_priv with netdev_rx_queue (Jakub).
> >
> > ---
> >  include/net/page_pool/types.h | 12 ++++++++++
> >  net/core/page_pool.c          | 43 +++++++++++++++++++++++++++++++----
> >  2 files changed, 50 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
> > index 5e43a08d3231..ffe5f31fb0da 100644
> > --- a/include/net/page_pool/types.h
> > +++ b/include/net/page_pool/types.h
> > @@ -52,6 +52,7 @@ struct pp_alloc_cache {
> >   * @dev:     device, for DMA pre-mapping purposes
> >   * @netdev:  netdev this pool will serve (leave as NULL if none or multiple)
> >   * @napi:    NAPI which is the sole consumer of pages, otherwise NULL
> > + * @queue:   struct netdev_rx_queue this page_pool is being created for.
> >   * @dma_dir: DMA mapping direction
> >   * @max_len: max DMA sync memory size for PP_FLAG_DMA_SYNC_DEV
> >   * @offset:  DMA sync address offset for PP_FLAG_DMA_SYNC_DEV
> > @@ -64,6 +65,7 @@ struct page_pool_params {
> >               int             nid;
> >               struct device   *dev;
> >               struct napi_struct *napi;
> > +             struct netdev_rx_queue *queue;
> >               enum dma_data_direction dma_dir;
> >               unsigned int    max_len;
> >               unsigned int    offset;
> > @@ -126,6 +128,13 @@ struct page_pool_stats {
> >  };
> >  #endif
> >
> > +struct memory_provider_ops {
> > +     int (*init)(struct page_pool *pool);
> > +     void (*destroy)(struct page_pool *pool);
> > +     struct page *(*alloc_pages)(struct page_pool *pool, gfp_t gfp);
> > +     bool (*release_page)(struct page_pool *pool, struct page *page);
> > +};
>
> Separate question as I try to adapt bnxt to this and your queue
> configuration API.
>
> How does GVE handle the need to allocate kernel pages for headers and
> dmabuf for payloads?
>
> Reading the code, struct gve_rx_ring is the main per-ring object with a
> page pool. gve_queue_page_lists are filled with page pool netmem
> allocations from the page pool in gve_alloc_queue_page_list(). Are these
> strictly used for payloads only?
>

You're almost correct. We actually don't use the gve queue page lists
for devmem TCP, that's an unrelated GVE feature/code path for low
memory VMs. The code in effect is the !qpl code. In that code, for
incoming RX packets we allocate a new or recycled netmem from the page
pool in gve_alloc_page_dqo(). These buffers are used for payload only
in the case where header split is enabled. In the case header split is
disabled, these buffers are used for the entire incoming packet.

> I found a struct gve_header_buf in both gve_rx_ring and struct
> gve_per_rx_queue_mem_dpo. This is allocated in gve_rx_queue_mem_alloc()
> using dma_alloc_coherent(). Is this where GVE stores headers?
>

Yes, this is where GVE stores headers.

> IOW, GVE only uses page pool to allocate memory for QPLs, and QPLs are
> used by the device for split payloads. Is my understanding correct?
>
Christoph Hellwig March 18, 2024, 2:02 a.m. UTC | #10
On Mon, Mar 04, 2024 at 06:01:37PM -0800, Mina Almasry wrote:
> From: Jakub Kicinski <kuba@kernel.org>
> 
> The page providers which try to reuse the same pages will
> need to hold onto the ref, even if page gets released from
> the pool - as in releasing the page from the pp just transfers
> the "ownership" reference from pp to the provider, and provider
> will wait for other references to be gone before feeding this
> page back into the pool.

The word hook always rings a giant warning bell for me, and looking into
this series I am concerned indeed.

The only provider provided here is the dma-buf one, and that basically
is the only sensible one for the documented design.  So instead of
adding hooks that random proprietary crap can hook into, why not hard
code the dma buf provide and just use a flag?  That'll also avoid
expensive indirect calls.
David Wei March 18, 2024, 2:49 a.m. UTC | #11
On 2024-03-17 19:02, Christoph Hellwig wrote:
> On Mon, Mar 04, 2024 at 06:01:37PM -0800, Mina Almasry wrote:
>> From: Jakub Kicinski <kuba@kernel.org>
>>
>> The page providers which try to reuse the same pages will
>> need to hold onto the ref, even if page gets released from
>> the pool - as in releasing the page from the pp just transfers
>> the "ownership" reference from pp to the provider, and provider
>> will wait for other references to be gone before feeding this
>> page back into the pool.
> 
> The word hook always rings a giant warning bell for me, and looking into
> this series I am concerned indeed.
> 
> The only provider provided here is the dma-buf one, and that basically
> is the only sensible one for the documented design.  So instead of
> adding hooks that random proprietary crap can hook into, why not hard
> code the dma buf provide and just use a flag?  That'll also avoid
> expensive indirect calls.

I'm working on a similar proposal for zero copy Rx but to host memory
and depend on this memory provider API.

Jakub also designed this API for hugepages too IIRC. Basically there's
going to be at least three fancy ways of providing pages (one of which
isn't actually pages, hence the merged netmem_t series) to drivers.

>
Christoph Hellwig March 18, 2024, 11:22 p.m. UTC | #12
On Sun, Mar 17, 2024 at 07:49:43PM -0700, David Wei wrote:
> I'm working on a similar proposal for zero copy Rx but to host memory
> and depend on this memory provider API.

How do you need a different provider for that vs just udmabuf?

> Jakub also designed this API for hugepages too IIRC. Basically there's
> going to be at least three fancy ways of providing pages (one of which
> isn't actually pages, hence the merged netmem_t series) to drivers.

How do hugepages different from a normal page allocation?  They should
just a different ordered passed to the page allocator.
Mina Almasry March 22, 2024, 5:40 p.m. UTC | #13
Hi Christoph,

Sorry for the late reply, I've been out for a few days.

On Mon, Mar 18, 2024 at 4:22 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Sun, Mar 17, 2024 at 07:49:43PM -0700, David Wei wrote:
> > I'm working on a similar proposal for zero copy Rx but to host memory
> > and depend on this memory provider API.
>
> How do you need a different provider for that vs just udmabuf?
>

This was discussed on the io_uring ZC RFC in one of the earliest RFCs.
Here is a link to Pavel's response:

https://patchwork.kernel.org/project/netdevbpf/patch/20231106024413.2801438-6-almasrymina@google.com/#25589471

The UAPI of wrapping io_uring memory into a udmabuf just to use it
with devmem TCP only for the user to have to unwrap it is undesirable
to him.

> > Jakub also designed this API for hugepages too IIRC. Basically there's
> > going to be at least three fancy ways of providing pages (one of which
> > isn't actually pages, hence the merged netmem_t series) to drivers.
>
> How do hugepages different from a normal page allocation?  They should
> just a different ordered passed to the page allocator.
>

Yes, that's more-or-less what's what the hugepage memory provider
Jakub proposed does. The memory provider would allocate a hugepage and
hold a reference to it. Then when the page_pool needs a page, it would
allocate a PAGE_SIZE page from said hugepage region and provide it to
the page_pool, and the pool back to the driver. This allows the
hugepages to work without the page_pool and driver to be hugepage
aware and to insert huge page specific processing in it.

Other designs for this hugepage use case are possible, I'm just
describing Jakub's idea for it as a potential use-case for these
hooks. For example technically the page_pool at the moment does
support non-0 order allocations, but most drivers simply set the order
to 0 and use the page pool only for PAGE_SIZE allocations. An
alternative design could be to use this support in the page pool, but
that requires every driver to adopt this rather than a core networking
change that can apply transparently (to a large extent) to all
page_pool drivers.
Mina Almasry March 22, 2024, 5:54 p.m. UTC | #14
On Sun, Mar 17, 2024 at 7:03 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Mon, Mar 04, 2024 at 06:01:37PM -0800, Mina Almasry wrote:
> > From: Jakub Kicinski <kuba@kernel.org>
> >
> > The page providers which try to reuse the same pages will
> > need to hold onto the ref, even if page gets released from
> > the pool - as in releasing the page from the pp just transfers
> > the "ownership" reference from pp to the provider, and provider
> > will wait for other references to be gone before feeding this
> > page back into the pool.
>
> The word hook always rings a giant warning bell for me, and looking into
> this series I am concerned indeed.
>
> The only provider provided here is the dma-buf one, and that basically
> is the only sensible one for the documented design.

Sorry I don't mean to argue but as David mentioned, there are some
plans in the works and ones not in the works to extend this to other
memory types. David mentioned io_uring & Jakub's huge page use cases
which may want to re-use this design. I have an additional one in
mind, which is extending devmem TCP for storage devices. Currently
storage devices do not support dmabuf and my understanding is that
it's very hard to do so, and NVMe uses pci_p2pdma instead. I wonder if
it's possible to extend devmem TCP in the future to support pci_p2pdma
to support nvme devices in the future.

Additionally I've been thinking about a use case of limiting the
amount of memory the net stack can use. Currently the page pool is
free to allocate as much memory as it wants from the buddy allocator.
This may be undesirable in very low memory setups such as overcommited
VMs. We can imagine a memory provider that allows allocation only if
the page_pool is below a certain limit. We can also imagine a memory
provider that preallocates memory and only uses that pinned pool. None
of these are in the works at the moment, but are examples of how this
can be (reasonably?) extended.

>  So instead of
> adding hooks that random proprietary crap can hook into,

To be completely honest I'm unsure how to design hooks for proprietary
code to hook into. I think that would be done on the basis of
EXPORTED_SYMBOL? We do not export these hooks, nor plan to at the
moment.

> why not hard
> code the dma buf provide and just use a flag?  That'll also avoid
> expensive indirect calls.
>

Thankfully the indirect calls do not seem to be an issue. We've been
able to hit 95% line rate with devmem TCP and I think the remaining 5%
are a bottleneck unrelated to the indirect calls. Page_pool benchmarks
show a very minor degradation in the fast path, so small it may be
just noise in the measurement (may!):

https://lore.kernel.org/netdev/20240305020153.2787423-1-almasrymina@google.com/T/#m1c308df9665724879947a345c4b1ec3b51ff6856

This is because the code path that does indirect allocations is the
slow path. The page_pool recycles netmem aggressively.
Jakub Kicinski March 22, 2024, 11:19 p.m. UTC | #15
On Fri, 22 Mar 2024 10:40:26 -0700 Mina Almasry wrote:
> Other designs for this hugepage use case are possible, I'm just
> describing Jakub's idea for it as a potential use-case for these
> hooks. 

I made it ops because I had 4 different implementations with different
recycling algorithms. I think it's a fairly reasonable piece of code.
Christoph Hellwig March 24, 2024, 11:35 p.m. UTC | #16
On Fri, Mar 22, 2024 at 10:40:26AM -0700, Mina Almasry wrote:
> Hi Christoph,
> 
> Sorry for the late reply, I've been out for a few days.
> 
> On Mon, Mar 18, 2024 at 4:22 PM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Sun, Mar 17, 2024 at 07:49:43PM -0700, David Wei wrote:
> > > I'm working on a similar proposal for zero copy Rx but to host memory
> > > and depend on this memory provider API.
> >
> > How do you need a different provider for that vs just udmabuf?
> >
> 
> This was discussed on the io_uring ZC RFC in one of the earliest RFCs.
> Here is a link to Pavel's response:
> 
> https://patchwork.kernel.org/project/netdevbpf/patch/20231106024413.2801438-6-almasrymina@google.com/#25589471

Undesirable is not a good argument.  We need one proper API that
different subsystems share for this use case (this is the same Feedback
I gave Keith for the similar block proposal btw, not picking on the net
folks here).

If dmabuf/udmabuf doesn't work for that we need to enhance or replace
it, but not come up with little subsystem specific side channels.
Christoph Hellwig March 24, 2024, 11:35 p.m. UTC | #17
On Fri, Mar 22, 2024 at 04:19:44PM -0700, Jakub Kicinski wrote:
> On Fri, 22 Mar 2024 10:40:26 -0700 Mina Almasry wrote:
> > Other designs for this hugepage use case are possible, I'm just
> > describing Jakub's idea for it as a potential use-case for these
> > hooks. 
> 
> I made it ops because I had 4 different implementations with different
> recycling algorithms. I think it's a fairly reasonable piece of code.

Assuming we need 4 different implementation.  And I strongly question
that.
Christoph Hellwig March 24, 2024, 11:37 p.m. UTC | #18
On Fri, Mar 22, 2024 at 10:54:54AM -0700, Mina Almasry wrote:
> Sorry I don't mean to argue but as David mentioned, there are some
> plans in the works and ones not in the works to extend this to other
> memory types. David mentioned io_uring & Jakub's huge page use cases
> which may want to re-use this design. I have an additional one in
> mind, which is extending devmem TCP for storage devices. Currently
> storage devices do not support dmabuf and my understanding is that
> it's very hard to do so, and NVMe uses pci_p2pdma instead. I wonder if
> it's possible to extend devmem TCP in the future to support pci_p2pdma
> to support nvme devices in the future.

The block layer needs to suppotr dmabuf for this kind of I/O.
Any special netdev to block side channel will be NAKed before you can
even send it out.
Mina Almasry March 26, 2024, 8:19 p.m. UTC | #19
On Sun, Mar 24, 2024 at 4:37 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Fri, Mar 22, 2024 at 10:54:54AM -0700, Mina Almasry wrote:
> > Sorry I don't mean to argue but as David mentioned, there are some
> > plans in the works and ones not in the works to extend this to other
> > memory types. David mentioned io_uring & Jakub's huge page use cases
> > which may want to re-use this design. I have an additional one in
> > mind, which is extending devmem TCP for storage devices. Currently
> > storage devices do not support dmabuf and my understanding is that
> > it's very hard to do so, and NVMe uses pci_p2pdma instead. I wonder if
> > it's possible to extend devmem TCP in the future to support pci_p2pdma
> > to support nvme devices in the future.
>
> The block layer needs to suppotr dmabuf for this kind of I/O.
> Any special netdev to block side channel will be NAKed before you can
> even send it out.

Thanks, a few questions if you have time to help me understand the
potential of extending this to storage devices.

Are you envisioning that dmabuf support would be added to the block
layer (which I understand is part of the VFS and not driver specific),
or as part of the specific storage driver (like nvme for example)? If
we can add dmabuf support to the block layer itself that sounds
awesome. We may then be able to do devmem TCP on all/most storage
devices without having to modify each individual driver.

In your estimation, is adding dmabuf support to the block layer
something technically feasible & acceptable upstream? I notice you
suggested it so I'm guessing yes to both, but I thought I'd confirm.

Worthy of note this is all pertaining to potential follow up use
cases, nothing in this particular proposal is trying to do any of this
yet.
Christoph Hellwig March 28, 2024, 7:31 a.m. UTC | #20
On Tue, Mar 26, 2024 at 01:19:20PM -0700, Mina Almasry wrote:
> 
> Are you envisioning that dmabuf support would be added to the block
> layer

Yes.

> (which I understand is part of the VFS and not driver specific),

The block layer isn't really the VFS, it's just another core stack
like the network stack.

> or as part of the specific storage driver (like nvme for example)? If
> we can add dmabuf support to the block layer itself that sounds
> awesome. We may then be able to do devmem TCP on all/most storage
> devices without having to modify each individual driver.

I suspect we'll still need to touch the drivers to understand it,
but hopefully all the main infrastructure can live in the block layer.

> In your estimation, is adding dmabuf support to the block layer
> something technically feasible & acceptable upstream? I notice you
> suggested it so I'm guessing yes to both, but I thought I'd confirm.

I think so, and I know there has been quite some interest to at least
pre-register userspace memory so that the iommu overhead can be
pre-loaded.  It also is a much better interface for Peer to Peer
transfers than what we currently have.
Mina Almasry April 1, 2024, 7:22 p.m. UTC | #21
On Thu, Mar 28, 2024 at 12:31 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Tue, Mar 26, 2024 at 01:19:20PM -0700, Mina Almasry wrote:
> >
> > Are you envisioning that dmabuf support would be added to the block
> > layer
>
> Yes.
>
> > (which I understand is part of the VFS and not driver specific),
>
> The block layer isn't really the VFS, it's just another core stack
> like the network stack.
>
> > or as part of the specific storage driver (like nvme for example)? If
> > we can add dmabuf support to the block layer itself that sounds
> > awesome. We may then be able to do devmem TCP on all/most storage
> > devices without having to modify each individual driver.
>
> I suspect we'll still need to touch the drivers to understand it,
> but hopefully all the main infrastructure can live in the block layer.
>
> > In your estimation, is adding dmabuf support to the block layer
> > something technically feasible & acceptable upstream? I notice you
> > suggested it so I'm guessing yes to both, but I thought I'd confirm.
>
> I think so, and I know there has been quite some interest to at least
> pre-register userspace memory so that the iommu overhead can be
> pre-loaded.  It also is a much better interface for Peer to Peer
> transfers than what we currently have.
>

I think this is positively thrilling news for me. I was worried that
adding devmemTCP support to storage devices would involve using a
non-dmabuf standard of buffer sharing like pci_p2pdma_
(drivers/pci/p2pdma.c) and that would require messy changes to
pci_p2pdma_ that would get nacked. Also it would require adding
pci_p2pdma_ support to devmem TCP, which is a can of worms. If adding
dma-buf support to storage devices is feasible and desirable, that's a
much better approach IMO. (a) it will maybe work with devmem TCP
without any changes needed on the netdev side of things and (b)
dma-buf support may be generically useful and a good contribution even
outside of devmem TCP.

I don't have a concrete user for devmem TCP for storage devices but
the use case is very similar to GPU and I imagine the benefits in perf
can be significant in some setups.

Christoph, if you have any hints or rough specific design in mind for
how dma-buf support can be added to the block layer, please do let us
know and we'll follow your hints to investigate. But I don't want to
use up too much of your time. Marc and I can definitely read enough
code to figure out how to do it ourselves :-)

Marc, please review and consider this thread and work, this could be a
good project for you and I. I imagine the work would be:

1. Investigate how to add dma-buf support to the block layer (maybe
write a prototype code, and maybe even test it with devmem TCP).
2. Share a code or no-code proposal with netdev/fs/block layer mailing
list and try to work through concerns/nacks.
3. Finally share RFC through merging etc.

--
Thanks,
Mina
Cong Wang April 8, 2024, 3:34 p.m. UTC | #22
On Mon, Apr 01, 2024 at 12:22:24PM -0700, Mina Almasry wrote:
> On Thu, Mar 28, 2024 at 12:31 AM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Tue, Mar 26, 2024 at 01:19:20PM -0700, Mina Almasry wrote:
> > >
> > > Are you envisioning that dmabuf support would be added to the block
> > > layer
> >
> > Yes.
> >
> > > (which I understand is part of the VFS and not driver specific),
> >
> > The block layer isn't really the VFS, it's just another core stack
> > like the network stack.
> >
> > > or as part of the specific storage driver (like nvme for example)? If
> > > we can add dmabuf support to the block layer itself that sounds
> > > awesome. We may then be able to do devmem TCP on all/most storage
> > > devices without having to modify each individual driver.
> >
> > I suspect we'll still need to touch the drivers to understand it,
> > but hopefully all the main infrastructure can live in the block layer.
> >
> > > In your estimation, is adding dmabuf support to the block layer
> > > something technically feasible & acceptable upstream? I notice you
> > > suggested it so I'm guessing yes to both, but I thought I'd confirm.
> >
> > I think so, and I know there has been quite some interest to at least
> > pre-register userspace memory so that the iommu overhead can be
> > pre-loaded.  It also is a much better interface for Peer to Peer
> > transfers than what we currently have.
> >

Thanks for copying me on this. This sounds really great. 

Also P2PDMA requires PCI root complex to support this kind of direct transfer,
and IIUC dmabuf does not have such hardware dependency.

> 
> I think this is positively thrilling news for me. I was worried that
> adding devmemTCP support to storage devices would involve using a
> non-dmabuf standard of buffer sharing like pci_p2pdma_
> (drivers/pci/p2pdma.c) and that would require messy changes to
> pci_p2pdma_ that would get nacked. Also it would require adding
> pci_p2pdma_ support to devmem TCP, which is a can of worms. If adding
> dma-buf support to storage devices is feasible and desirable, that's a
> much better approach IMO. (a) it will maybe work with devmem TCP
> without any changes needed on the netdev side of things and (b)
> dma-buf support may be generically useful and a good contribution even
> outside of devmem TCP.

I think the major difference is its interface, which exposes an mmap memory
region instead of fd: https://lwn.net/Articles/906092/.

> 
> I don't have a concrete user for devmem TCP for storage devices but
> the use case is very similar to GPU and I imagine the benefits in perf
> can be significant in some setups.

We have storage use cases at ByteDance, we use NVME SSD to cache videos
transferred through network, so moving data directly from SSD to NIC
would help a lot.

Thanks!
diff mbox series

Patch

diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index 5e43a08d3231..ffe5f31fb0da 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -52,6 +52,7 @@  struct pp_alloc_cache {
  * @dev:	device, for DMA pre-mapping purposes
  * @netdev:	netdev this pool will serve (leave as NULL if none or multiple)
  * @napi:	NAPI which is the sole consumer of pages, otherwise NULL
+ * @queue:	struct netdev_rx_queue this page_pool is being created for.
  * @dma_dir:	DMA mapping direction
  * @max_len:	max DMA sync memory size for PP_FLAG_DMA_SYNC_DEV
  * @offset:	DMA sync address offset for PP_FLAG_DMA_SYNC_DEV
@@ -64,6 +65,7 @@  struct page_pool_params {
 		int		nid;
 		struct device	*dev;
 		struct napi_struct *napi;
+		struct netdev_rx_queue *queue;
 		enum dma_data_direction dma_dir;
 		unsigned int	max_len;
 		unsigned int	offset;
@@ -126,6 +128,13 @@  struct page_pool_stats {
 };
 #endif
 
+struct memory_provider_ops {
+	int (*init)(struct page_pool *pool);
+	void (*destroy)(struct page_pool *pool);
+	struct page *(*alloc_pages)(struct page_pool *pool, gfp_t gfp);
+	bool (*release_page)(struct page_pool *pool, struct page *page);
+};
+
 struct page_pool {
 	struct page_pool_params_fast p;
 
@@ -176,6 +185,9 @@  struct page_pool {
 	 */
 	struct ptr_ring ring;
 
+	void *mp_priv;
+	const struct memory_provider_ops *mp_ops;
+
 #ifdef CONFIG_PAGE_POOL_STATS
 	/* recycle stats are per-cpu to avoid locking */
 	struct page_pool_recycle_stats __percpu *recycle_stats;
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index d706fe5548df..8776fcad064a 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -25,6 +25,8 @@ 
 
 #include "page_pool_priv.h"
 
+static DEFINE_STATIC_KEY_FALSE(page_pool_mem_providers);
+
 #define DEFER_TIME (msecs_to_jiffies(1000))
 #define DEFER_WARN_INTERVAL (60 * HZ)
 
@@ -177,6 +179,7 @@  static int page_pool_init(struct page_pool *pool,
 			  int cpuid)
 {
 	unsigned int ring_qsize = 1024; /* Default */
+	int err;
 
 	memcpy(&pool->p, &params->fast, sizeof(pool->p));
 	memcpy(&pool->slow, &params->slow, sizeof(pool->slow));
@@ -248,10 +251,25 @@  static int page_pool_init(struct page_pool *pool,
 	/* Driver calling page_pool_create() also call page_pool_destroy() */
 	refcount_set(&pool->user_cnt, 1);
 
+	if (pool->mp_ops) {
+		err = pool->mp_ops->init(pool);
+		if (err) {
+			pr_warn("%s() mem-provider init failed %d\n",
+				__func__, err);
+			goto free_ptr_ring;
+		}
+
+		static_branch_inc(&page_pool_mem_providers);
+	}
+
 	if (pool->p.flags & PP_FLAG_DMA_MAP)
 		get_device(pool->p.dev);
 
 	return 0;
+
+free_ptr_ring:
+	ptr_ring_cleanup(&pool->ring, NULL);
+	return err;
 }
 
 static void page_pool_uninit(struct page_pool *pool)
@@ -546,7 +564,10 @@  struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp)
 		return page;
 
 	/* Slow-path: cache empty, do real allocation */
-	page = __page_pool_alloc_pages_slow(pool, gfp);
+	if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops)
+		page = pool->mp_ops->alloc_pages(pool, gfp);
+	else
+		page = __page_pool_alloc_pages_slow(pool, gfp);
 	return page;
 }
 EXPORT_SYMBOL(page_pool_alloc_pages);
@@ -603,10 +624,13 @@  void __page_pool_release_page_dma(struct page_pool *pool, struct page *page)
 void page_pool_return_page(struct page_pool *pool, struct page *page)
 {
 	int count;
+	bool put;
 
-	__page_pool_release_page_dma(pool, page);
-
-	page_pool_clear_pp_info(page);
+	put = true;
+	if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops)
+		put = pool->mp_ops->release_page(pool, page);
+	else
+		__page_pool_release_page_dma(pool, page);
 
 	/* This may be the last page returned, releasing the pool, so
 	 * it is not safe to reference pool afterwards.
@@ -614,7 +638,10 @@  void page_pool_return_page(struct page_pool *pool, struct page *page)
 	count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt);
 	trace_page_pool_state_release(pool, page, count);
 
-	put_page(page);
+	if (put) {
+		page_pool_clear_pp_info(page);
+		put_page(page);
+	}
 	/* An optimization would be to call __free_pages(page, pool->p.order)
 	 * knowing page is not part of page-cache (thus avoiding a
 	 * __page_cache_release() call).
@@ -884,6 +911,12 @@  static void __page_pool_destroy(struct page_pool *pool)
 
 	page_pool_unlist(pool);
 	page_pool_uninit(pool);
+
+	if (pool->mp_ops) {
+		pool->mp_ops->destroy(pool);
+		static_branch_dec(&page_pool_mem_providers);
+	}
+
 	kfree(pool);
 }