diff mbox series

[net-next,v19,06/13] memory-provider: dmabuf devmem memory provider

Message ID 20240813211317.3381180-7-almasrymina@google.com (mailing list archive)
State Handled Elsewhere
Headers show
Series [net-next,v19,01/13] netdev: add netdev_rx_queue_restart() | expand

Commit Message

Mina Almasry Aug. 13, 2024, 9:13 p.m. UTC
Implement a memory provider that allocates dmabuf devmem in the form of
net_iov.

The provider receives a reference to the struct netdev_dmabuf_binding
via the pool->mp_priv pointer. The driver needs to set this pointer for
the provider in the net_iov.

The provider obtains a reference on the netdev_dmabuf_binding which
guarantees the binding and the underlying mapping remains alive until
the provider is destroyed.

Usage of PP_FLAG_DMA_MAP is required for this memory provide such that
the page_pool can provide the driver with the dma-addrs of the devmem.

Support for PP_FLAG_DMA_SYNC_DEV is omitted for simplicity & p.order !=
0.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>

---

v19:
- Add PP_FLAG_ALLOW_UNREADABLE_NETMEM flag. It serves 2 purposes, (a)
  it guards drivers that don't support unreadable netmem (net_iov
  backed) from accidentally getting exposed to it, and (b) drivers that
  wish to create header pools can unset it for that pool to force
  readable netmem.
- Add page_pool_check_memory_provider, which verifies that the driver
  has created a page_pool with the expected configuration. This is used
  to report to the user if the mp configuration succeeded, and also
  verify that the driver is doing the right thing.
- Don't reset niov->dma_addr on allocation/free.

v17:
- Use ASSERT_RTNL (Jakub)

v16:
- Add DEBUG_NET_WARN_ON_ONCE(!rtnl_is_locked()), to catch cases if
  page_pool_init without rtnl_locking when the queue is provided. In
  this case, the queue configuration may be changed while we're initing
  the page_pool, which could be a race.

v13:
- Return on warning (Pavel).
- Fixed pool->recycle_stats not being freed on error (Pavel).
- Applied reviewed-by from Pavel.

v11:
- Rebase to not use the ops. (Christoph)

v8:
- Use skb_frag_size instead of frag->bv_len to fix patch-by-patch build
  error

v6:
- refactor new memory provider functions into net/core/devmem.c (Pavel)

v2:
- Disable devmem for p.order != 0

v1:
- static_branch check in page_is_page_pool_iov() (Willem & Paolo).
- PP_DEVMEM -> PP_IOV (David).
- Require PP_FLAG_DMA_MAP (Jakub).

---
 include/net/mp_dmabuf_devmem.h |  44 ++++++++++++++
 include/net/page_pool/types.h  |  16 ++++-
 net/core/devmem.c              |  70 ++++++++++++++++++++++
 net/core/page_pool.c           | 103 ++++++++++++++++++++++++---------
 net/core/page_pool_priv.h      |   6 ++
 net/core/page_pool_user.c      |  26 +++++++++
 6 files changed, 237 insertions(+), 28 deletions(-)
 create mode 100644 include/net/mp_dmabuf_devmem.h

Comments

Pavel Begunkov Aug. 14, 2024, 2:12 p.m. UTC | #1
On 8/13/24 22:13, Mina Almasry wrote:
> Implement a memory provider that allocates dmabuf devmem in the form of
> net_iov.
> 
> The provider receives a reference to the struct netdev_dmabuf_binding
> via the pool->mp_priv pointer. The driver needs to set this pointer for
> the provider in the net_iov.
> 
> The provider obtains a reference on the netdev_dmabuf_binding which
> guarantees the binding and the underlying mapping remains alive until
> the provider is destroyed.
> 
> Usage of PP_FLAG_DMA_MAP is required for this memory provide such that
> the page_pool can provide the driver with the dma-addrs of the devmem.
> 
> Support for PP_FLAG_DMA_SYNC_DEV is omitted for simplicity & p.order !=
> 0.
> 
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com>
> Signed-off-by: Mina Almasry <almasrymina@google.com>
> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
> 
> ---
> 
> v19:
> - Add PP_FLAG_ALLOW_UNREADABLE_NETMEM flag. It serves 2 purposes, (a)
>    it guards drivers that don't support unreadable netmem (net_iov
>    backed) from accidentally getting exposed to it, and (b) drivers that
>    wish to create header pools can unset it for that pool to force
>    readable netmem.
> - Add page_pool_check_memory_provider, which verifies that the driver
>    has created a page_pool with the expected configuration. This is used
>    to report to the user if the mp configuration succeeded, and also
>    verify that the driver is doing the right thing.
> - Don't reset niov->dma_addr on allocation/free.
> 
> v17:
> - Use ASSERT_RTNL (Jakub)
> 
> v16:
> - Add DEBUG_NET_WARN_ON_ONCE(!rtnl_is_locked()), to catch cases if
>    page_pool_init without rtnl_locking when the queue is provided. In
>    this case, the queue configuration may be changed while we're initing
>    the page_pool, which could be a race.
> 
> v13:
> - Return on warning (Pavel).
> - Fixed pool->recycle_stats not being freed on error (Pavel).
> - Applied reviewed-by from Pavel.
> 
> v11:
> - Rebase to not use the ops. (Christoph)
> 
> v8:
> - Use skb_frag_size instead of frag->bv_len to fix patch-by-patch build
>    error
> 
> v6:
> - refactor new memory provider functions into net/core/devmem.c (Pavel)
> 
> v2:
> - Disable devmem for p.order != 0
> 
> v1:
> - static_branch check in page_is_page_pool_iov() (Willem & Paolo).
> - PP_DEVMEM -> PP_IOV (David).
> - Require PP_FLAG_DMA_MAP (Jakub).
> 
...
> diff --git a/net/core/devmem.c b/net/core/devmem.c
> index 301f4250ca82..2f2a7f4dee4c 100644
> --- a/net/core/devmem.c
> +++ b/net/core/devmem.c
> @@ -17,6 +17,7 @@
>   #include <linux/genalloc.h>
>   #include <linux/dma-buf.h>
>   #include <net/devmem.h>
> +#include <net/mp_dmabuf_devmem.h>
>   #include <net/netdev_queues.h>
>   
>   #include "page_pool_priv.h"
> @@ -153,6 +154,10 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
>   	if (err)
>   		goto err_xa_erase;
>   
> +	err = page_pool_check_memory_provider(dev, rxq, binding);

Frankly, I pretty much don't like it.

1. We do it after reconfiguring the queue just to fail and reconfigure
it again.

2. It should be a part of the common path like netdev_rx_queue_restart(),
not specific to devmem TCP.

These two can be fixed by moving the check into
netdev_rx_queue_restart() just after ->ndo_queue_mem_alloc, assuming
that the callback where we init page pools.

3. That implicit check gives me bad feeling, instead of just getting
direct feedback from the driver, either it's a flag or an error
returned, we have to try to figure what exactly the driver did, with
a high chance this inference will fail us at some point.

And page_pool_check_memory_provider() is not that straightforward,
it doesn't walk through pools of a queue. Not looking too deep,
but it seems like the nested loop can be moved out with the same
effect, so it first looks for a pool in the device and the follows
with the bound_rxqs. And seems the bound_rxqs check would always turn
true, you set the binding into the map in
net_devmem_bind_dmabuf_to_queue() before the restart and it'll be there
after restart for page_pool_check_memory_provider(). Maybe I missed
something, but it's not super clear.

4. And the last thing Jakub mentioned is that we need to be prepared
to expose a flag to the userspace for whether a queue supports
netiov. Not really doable in a sane manner with such implicit
post configuration checks.

And that brings us back to the first approach I mentioned, where
we have a flag in the queue structure, drivers set it, and
netdev_rx_queue_restart() checks it before any callback. That's
where the thread with Jakub stopped, and it reads like at least
he's not against the idea.


> +	if (err)
> +		goto err_xa_erase;
> +
>   	return 0;
>   
>   err_xa_erase:
> @@ -305,4 +310,69 @@ void dev_dmabuf_uninstall(struct net_device *dev)
>   				xa_erase(&binding->bound_rxqs, xa_idx);
>   	}
>   }
> +
...
> diff --git a/net/core/page_pool_user.c b/net/core/page_pool_user.c
> index 3a3277ba167b..cbc54ee4f670 100644
> --- a/net/core/page_pool_user.c
> +++ b/net/core/page_pool_user.c
> @@ -344,6 +344,32 @@ void page_pool_unlist(struct page_pool *pool)
>   	mutex_unlock(&page_pools_lock);
>   }
>   
> +int page_pool_check_memory_provider(struct net_device *dev,
> +				    struct netdev_rx_queue *rxq,
> +				    struct net_devmem_dmabuf_binding *binding)
> +{
> +	struct netdev_rx_queue *binding_rxq;
> +	struct page_pool *pool;
> +	struct hlist_node *n;
> +	unsigned long xa_idx;
> +
> +	mutex_lock(&page_pools_lock);
> +	hlist_for_each_entry_safe(pool, n, &dev->page_pools, user.list) {
> +		if (pool->mp_priv != binding)
> +			continue;
> +
> +		xa_for_each(&binding->bound_rxqs, xa_idx, binding_rxq) {
> +			if (rxq != binding_rxq)
> +				continue;
> +
> +			mutex_unlock(&page_pools_lock);
> +			return 0;
> +		}
> +	}
> +	mutex_unlock(&page_pools_lock);
> +	return -ENODATA;
> +}
> +
>   static void page_pool_unreg_netdev_wipe(struct net_device *netdev)
>   {
>   	struct page_pool *pool;
Mina Almasry Aug. 14, 2024, 2:55 p.m. UTC | #2
On Wed, Aug 14, 2024 at 10:11 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
...
> > diff --git a/net/core/devmem.c b/net/core/devmem.c
> > index 301f4250ca82..2f2a7f4dee4c 100644
> > --- a/net/core/devmem.c
> > +++ b/net/core/devmem.c
> > @@ -17,6 +17,7 @@
> >   #include <linux/genalloc.h>
> >   #include <linux/dma-buf.h>
> >   #include <net/devmem.h>
> > +#include <net/mp_dmabuf_devmem.h>
> >   #include <net/netdev_queues.h>
> >
> >   #include "page_pool_priv.h"
> > @@ -153,6 +154,10 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
> >       if (err)
> >               goto err_xa_erase;
> >
> > +     err = page_pool_check_memory_provider(dev, rxq, binding);
>
> Frankly, I pretty much don't like it.
>
> 1. We do it after reconfiguring the queue just to fail and reconfigure
> it again.
>

I don't see an issue with that? Or is it just me?

> 2. It should be a part of the common path like netdev_rx_queue_restart(),
> not specific to devmem TCP.
>
> These two can be fixed by moving the check into
> netdev_rx_queue_restart() just after ->ndo_queue_mem_alloc, assuming
> that the callback where we init page pools.
>

The only reason is that the page_pool_check_memory_provider() needs to
know the memory provider to check for. Separating them keep
netdev_rx_queue_restart() usable for other future use cases that don't
expect a memory provider to be bound, but you are correct in that this
can be easily resolved by passing the binding to
netdev_rx_queue_restart() and doing the
page_pool_check_memory_providers() check inside of that function.

> 3. That implicit check gives me bad feeling, instead of just getting
> direct feedback from the driver, either it's a flag or an error
> returned, we have to try to figure what exactly the driver did, with
> a high chance this inference will fail us at some point.
>

This is where I get a bit confused. Jakub did mention that it is
desirable for core to verify that the driver did the right thing,
instead of trusting that a driver did the right thing without
verifying. Relying on a flag from the driver opens the door for the
driver to say "I support this" but actually not create the mp
page_pool. In my mind the explicit check is superior to getting
feedback from the driver.

Additionally this approach lets us detect support in core using 10
lines of code or so, rather than ask every driver that wants to
support mp to add boilerplate code to declare support (and run into
subtle bugs when this boilerplate is missing). There are minor pros
and cons to each approach; I don't see a showstopping reason to go
with one over the other.

> And page_pool_check_memory_provider() is not that straightforward,
> it doesn't walk through pools of a queue.

Right, we don't save the pp of a queue, only a netdev. The outer loop
checks all the pps of the netdev to find one with the correct binding,
and the inner loop checks that this binding is attached to the correct
queue.

> Not looking too deep,
> but it seems like the nested loop can be moved out with the same
> effect, so it first looks for a pool in the device and the follows
> with the bound_rxqs. And seems the bound_rxqs check would always turn
> true, you set the binding into the map in
> net_devmem_bind_dmabuf_to_queue() before the restart and it'll be there
> after restart for page_pool_check_memory_provider(). Maybe I missed
> something, but it's not super clear.
>
> 4. And the last thing Jakub mentioned is that we need to be prepared
> to expose a flag to the userspace for whether a queue supports
> netiov. Not really doable in a sane manner with such implicit
> post configuration checks.
>

I don't see a very strong reason to expose the flag to the userspace
now. userspace can try to bind dmabuf and get an EOPNOTSUPP if the
operation is not supported, right? In the future if passing the flag
to userspace becomes needed for some usecase, we do need feedback from
the driver, and it would be trivial to add similarly to what you
suggested.

> And that brings us back to the first approach I mentioned, where
> we have a flag in the queue structure, drivers set it, and
> netdev_rx_queue_restart() checks it before any callback. That's
> where the thread with Jakub stopped, and it reads like at least
> he's not against the idea.

Hmm, the netdev_rx_queue array is created in core, not by the driver,
does the driver set this flag during initialization? We could run into
subtle bugs with races if a code path checks for support after core
has allocated the netdev_rx_queue array but before the driver has had
a chance to declare support, right? Maybe a minor issue. Instead we
could add an ndo to the queue API that lets the driver tell us that it
could support binding on a given rx queue, and check that in
net_devmem_bind_dmabuf_to_queue() right before we do the bind?

But this is only if declaring support to userspace becomes needed for
some use case. At the moment I'm under the impression that verifying
in core that the driver did the right thing is preferred, and I'd like
to minimize the boilerplate the driver needs to implement if possible.

Additionally this series is big and blocks multiple interesting follow
up work; maybe going forward with an approach that works - and can
easily be iterated on later if we run into issues - could be wise. I
do not see an issue with adding a driver signal in the future (if
needed) and deprecating the core check (if needed), right?

--
Thanks,
Mina
Pavel Begunkov Aug. 14, 2024, 4:32 p.m. UTC | #3
On 8/14/24 15:55, Mina Almasry wrote:
> On Wed, Aug 14, 2024 at 10:11 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
> ...
>>> diff --git a/net/core/devmem.c b/net/core/devmem.c
>>> index 301f4250ca82..2f2a7f4dee4c 100644
>>> --- a/net/core/devmem.c
>>> +++ b/net/core/devmem.c
>>> @@ -17,6 +17,7 @@
>>>    #include <linux/genalloc.h>
>>>    #include <linux/dma-buf.h>
>>>    #include <net/devmem.h>
>>> +#include <net/mp_dmabuf_devmem.h>
>>>    #include <net/netdev_queues.h>
>>>
>>>    #include "page_pool_priv.h"
>>> @@ -153,6 +154,10 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
>>>        if (err)
>>>                goto err_xa_erase;
>>>
>>> +     err = page_pool_check_memory_provider(dev, rxq, binding);
>>
>> Frankly, I pretty much don't like it.
>>
>> 1. We do it after reconfiguring the queue just to fail and reconfigure
>> it again.
>>
> 
> I don't see an issue with that? Or is it just me?

Not a bug, just over excessive harassing of the interface,
which can be easily be avoided.


>> 2. It should be a part of the common path like netdev_rx_queue_restart(),
>> not specific to devmem TCP.
>>
>> These two can be fixed by moving the check into
>> netdev_rx_queue_restart() just after ->ndo_queue_mem_alloc, assuming
>> that the callback where we init page pools.
>>
> 
> The only reason is that the page_pool_check_memory_provider() needs to
> know the memory provider to check for. Separating them keep
> netdev_rx_queue_restart() usable for other future use cases that don't
> expect a memory provider to be bound, but you are correct in that this
> can be easily resolved by passing the binding to
> netdev_rx_queue_restart() and doing the
> page_pool_check_memory_providers() check inside of that function.

It's already passed inside the queue.

netdev_rx_queue_restart() {
	if (rxq->mp_params && !rxq->netiov_supported)
		fail;
}

>> 3. That implicit check gives me bad feeling, instead of just getting
>> direct feedback from the driver, either it's a flag or an error
>> returned, we have to try to figure what exactly the driver did, with
>> a high chance this inference will fail us at some point.
>>
> 
> This is where I get a bit confused. Jakub did mention that it is
> desirable for core to verify that the driver did the right thing,
> instead of trusting that a driver did the right thing without
> verifying. Relying on a flag from the driver opens the door for the
> driver to say "I support this" but actually not create the mp
> page_pool. In my mind the explicit check is superior to getting
> feedback from the driver.

You can apply the same argument to anything, but not like
after each for example ->ndo_start_xmit we dig into the
interface's pending queue to make sure it was actually queued.

And even if you check that there is a page pool, the driver
can just create an empty pool that it'll never use. There
are always ways to make it wrong.

Yes, there is a difference, and I'm not against it as a
WARN_ON_ONCE after failing it in a more explicit way.

Jakub might have a different opinion on how it should look
like, and we can clarify on that, but I do believe it's a
confusing interface that can be easily made better.

> Additionally this approach lets us detect support in core using 10
> lines of code or so, rather than ask every driver that wants to
> support mp to add boilerplate code to declare support (and run into
> subtle bugs when this boilerplate is missing). There are minor pros

Right, 10 lines of some odd code, which not even clear off the
bat why it's there if we get an error code from the restart /
callbacks, vs one line of "boilerplate" per driver a la

rxq->netiov_supported = true;

that can be added while implementing the queue api. If it's
missing the user gets back not a subtle error.


> and cons to each approach; I don't see a showstopping reason to go
> with one over the other.
> 
>> And page_pool_check_memory_provider() is not that straightforward,
>> it doesn't walk through pools of a queue.
> 
> Right, we don't save the pp of a queue, only a netdev. The outer loop
> checks all the pps of the netdev to find one with the correct binding,
> and the inner loop checks that this binding is attached to the correct
> queue.

That's the thing, I doubt about the second part.

net_devmem_bind_dmabuf_to_queue() {
	err = xa_alloc(&binding->bound_rxqs, &xa_idx, rxq);
	if (err)
		return err;

	netdev_rx_queue_restart();

	// page_pool_check_memory_provider
	...
	xa_for_each(&binding->bound_rxqs, xa_idx, binding_rxq) {
		if (rxq == binding_rxq)
			return success;
}

Can't b4 the patches for some reason, but that's the highlight
from the patchset, correct me if I'm wrong. That xa_for_each
check is always true because you put the queue in there right
before it, and I don't that anyone could've erased it.

The problem here is that it seems the ->bound_rxqs state doesn't
depend on what page pools were actually created and with what mp.

So if you try to add a binding to 2 queues of the same interface,
the first succeeds and the second silently fails, then the
following net_devmem_bind_dmabuf_to_queue() for the second queue
will say that everything is fine, because there is a pool for
the first queue with the binding and the queue check is just
true.

>> Not looking too deep,
>> but it seems like the nested loop can be moved out with the same
>> effect, so it first looks for a pool in the device and the follows
>> with the bound_rxqs. And seems the bound_rxqs check would always turn
>> true, you set the binding into the map in
>> net_devmem_bind_dmabuf_to_queue() before the restart and it'll be there
>> after restart for page_pool_check_memory_provider(). Maybe I missed
>> something, but it's not super clear.
>>
>> 4. And the last thing Jakub mentioned is that we need to be prepared
>> to expose a flag to the userspace for whether a queue supports
>> netiov. Not really doable in a sane manner with such implicit
>> post configuration checks.
>>
> 
> I don't see a very strong reason to expose the flag to the userspace

I'll leave that for Jakub to answer

> now. userspace can try to bind dmabuf and get an EOPNOTSUPP if the
> operation is not supported, right? In the future if passing the flag
> to userspace becomes needed for some usecase, we do need feedback from
> the driver, and it would be trivial to add similarly to what you
> suggested.

Doable, since it wouldn't change the user api, but that means
refactoring millions drivers (ok, ok, 4-5) instead of preparing
for it from the beginning.

>> And that brings us back to the first approach I mentioned, where
>> we have a flag in the queue structure, drivers set it, and
>> netdev_rx_queue_restart() checks it before any callback. That's
>> where the thread with Jakub stopped, and it reads like at least
>> he's not against the idea.
> 
> Hmm, the netdev_rx_queue array is created in core, not by the driver,
> does the driver set this flag during initialization? We could run into
> subtle bugs with races if a code path checks for support after core
> has allocated the netdev_rx_queue array but before the driver has had
> a chance to declare support, right? Maybe a minor issue. Instead we

Which is fine, it'd just fail, how are we going to attach to a
queue that hasn't been initialised by the driver yet. Regardless,
I doubt we expose the interface before the driver has a chance
to init it, but I'd need to look it up (or defer to Jakub) to
say for sure.

> could add an ndo to the queue API that lets the driver tell us that it
> could support binding on a given rx queue, and check that in
> net_devmem_bind_dmabuf_to_queue() right before we do the bind?
> 
> But this is only if declaring support to userspace becomes needed for
> some use case. At the moment I'm under the impression that verifying
> in core that the driver did the right thing is preferred, and I'd like
> to minimize the boilerplate the driver needs to implement if possible.
> 
> Additionally this series is big and blocks multiple interesting follow
> up work; maybe going forward with an approach that works - and can
> easily be iterated on later if we run into issues - could be wise. I
> do not see an issue with adding a driver signal in the future (if
> needed) and deprecating the core check (if needed), right?
Jakub Kicinski Aug. 16, 2024, 12:55 a.m. UTC | #4
On Tue, 13 Aug 2024 21:13:08 +0000 Mina Almasry wrote:
> +EXPORT_SYMBOL(page_pool_mem_providers);

not sure if this export is needed, but it doesn't appear to be needed
by this patch?
Jakub Kicinski Aug. 16, 2024, 1:22 a.m. UTC | #5
On Wed, 14 Aug 2024 17:32:53 +0100 Pavel Begunkov wrote:
> > This is where I get a bit confused. Jakub did mention that it is
> > desirable for core to verify that the driver did the right thing,
> > instead of trusting that a driver did the right thing without
> > verifying. Relying on a flag from the driver opens the door for the
> > driver to say "I support this" but actually not create the mp
> > page_pool. In my mind the explicit check is superior to getting
> > feedback from the driver.  
> 
> You can apply the same argument to anything, but not like
> after each for example ->ndo_start_xmit we dig into the
> interface's pending queue to make sure it was actually queued.
> 
> And even if you check that there is a page pool, the driver
> can just create an empty pool that it'll never use. There
> are always ways to make it wrong.
> 
> Yes, there is a difference, and I'm not against it as a
> WARN_ON_ONCE after failing it in a more explicit way.
> 
> Jakub might have a different opinion on how it should look
> like, and we can clarify on that, but I do believe it's a
> confusing interface that can be easily made better.

My queue API RFC patches had configuration arguments, not sure if this
is the right version but you'll get the idea:
https://github.com/kuba-moo/linux/blob/qcfg/include/net/netdev_cfg.h#L43-L50
This way we can _tell_ the driver what the config should be. That part
got lost somewhere along the way, because perhaps in its embryonic form
it doesn't make sense.

We can bring it back, add HDS with threshold of 0, to it, and a bit for
non-readable memory. On top of that "capability bits" in struct
netdev_queue_mgmt_ops to mark that the driver pays attention to particular
fields of the config.

Not sure if it should block the series, but that'd be the way I'd do it
(for now?)

I'd keep the current check with a WARN_ON_ONCE(), tho.
Given the absence of tests driver developers can use.
Especially those who _aren't_ supporting the feature.

> > and cons to each approach; I don't see a showstopping reason to go
> > with one over the other.
> >   
> >> And page_pool_check_memory_provider() is not that straightforward,
> >> it doesn't walk through pools of a queue.  
> > 
> > Right, we don't save the pp of a queue, only a netdev. The outer loop
> > checks all the pps of the netdev to find one with the correct binding,
> > and the inner loop checks that this binding is attached to the correct
> > queue.  
> 
> That's the thing, I doubt about the second part.
> 
> net_devmem_bind_dmabuf_to_queue() {
> 	err = xa_alloc(&binding->bound_rxqs, &xa_idx, rxq);
> 	if (err)
> 		return err;
> 
> 	netdev_rx_queue_restart();
> 
> 	// page_pool_check_memory_provider
> 	...
> 	xa_for_each(&binding->bound_rxqs, xa_idx, binding_rxq) {
> 		if (rxq == binding_rxq)
> 			return success;
> }
> 
> Can't b4 the patches for some reason, but that's the highlight
> from the patchset, correct me if I'm wrong. That xa_for_each
> check is always true because you put the queue in there right
> before it, and I don't that anyone could've erased it.
> 
> The problem here is that it seems the ->bound_rxqs state doesn't
> depend on what page pools were actually created and with what mp.

FWIW I don't understand the point of walking the xa either.
Just check the queue number of the pp you found matches,
page pool params are saved in the page pool. No?
Mina Almasry Aug. 16, 2024, 12:20 p.m. UTC | #6
On Thu, Aug 15, 2024 at 9:22 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 14 Aug 2024 17:32:53 +0100 Pavel Begunkov wrote:
> > > This is where I get a bit confused. Jakub did mention that it is
> > > desirable for core to verify that the driver did the right thing,
> > > instead of trusting that a driver did the right thing without
> > > verifying. Relying on a flag from the driver opens the door for the
> > > driver to say "I support this" but actually not create the mp
> > > page_pool. In my mind the explicit check is superior to getting
> > > feedback from the driver.
> >
> > You can apply the same argument to anything, but not like
> > after each for example ->ndo_start_xmit we dig into the
> > interface's pending queue to make sure it was actually queued.
> >
> > And even if you check that there is a page pool, the driver
> > can just create an empty pool that it'll never use. There
> > are always ways to make it wrong.
> >
> > Yes, there is a difference, and I'm not against it as a
> > WARN_ON_ONCE after failing it in a more explicit way.
> >
> > Jakub might have a different opinion on how it should look
> > like, and we can clarify on that, but I do believe it's a
> > confusing interface that can be easily made better.
>
> My queue API RFC patches had configuration arguments, not sure if this
> is the right version but you'll get the idea:
> https://github.com/kuba-moo/linux/blob/qcfg/include/net/netdev_cfg.h#L43-L50
> This way we can _tell_ the driver what the config should be. That part
> got lost somewhere along the way, because perhaps in its embryonic form
> it doesn't make sense.
>
> We can bring it back, add HDS with threshold of 0, to it, and a bit for
> non-readable memory. On top of that "capability bits" in struct
> netdev_queue_mgmt_ops to mark that the driver pays attention to particular
> fields of the config.
>
> Not sure if it should block the series, but that'd be the way I'd do it
> (for now?)
>

I'm not sure I want to go into a rabbit hole of adding configuration
via the queue API, blocking this series . We had discussed this months
back and figured that it's a significant undertaking on its own. I'm
not sure GVE has HDS threshold capability for example, and I'm also
not sure how to coexist header split negotiability via the queue API
when an ethtool API exists alongside it. I think this is worthy of
separating in its own follow up series.

For now detecting that the driver was able to create the page_pool
with the correct memory provider in core should be sufficient. Also
asking the driver to set a
netdev_rx_queue->unreadable_netmem_supported flag should also be
sufficient. I've implemented both locally and they work well.

> I'd keep the current check with a WARN_ON_ONCE(), tho.
> Given the absence of tests driver developers can use.
> Especially those who _aren't_ supporting the feature.
>

Yes what I have locally is the driver setting
netdev_rx_queue->unreadable_netmem_supported when header split is
turned on, and additionally a WARN_ON_ONCE around the check in core. I
was about to send that when I read your email. I'm hoping we don't
have to go through the scope creep of adding configuration via the
queue API, which I think is a very significant undertaking.

> > > and cons to each approach; I don't see a showstopping reason to go
> > > with one over the other.
> > >
> > >> And page_pool_check_memory_provider() is not that straightforward,
> > >> it doesn't walk through pools of a queue.
> > >
> > > Right, we don't save the pp of a queue, only a netdev. The outer loop
> > > checks all the pps of the netdev to find one with the correct binding,
> > > and the inner loop checks that this binding is attached to the correct
> > > queue.
> >
> > That's the thing, I doubt about the second part.
> >
> > net_devmem_bind_dmabuf_to_queue() {
> >       err = xa_alloc(&binding->bound_rxqs, &xa_idx, rxq);
> >       if (err)
> >               return err;
> >
> >       netdev_rx_queue_restart();
> >
> >       // page_pool_check_memory_provider
> >       ...
> >       xa_for_each(&binding->bound_rxqs, xa_idx, binding_rxq) {
> >               if (rxq == binding_rxq)
> >                       return success;
> > }
> >
> > Can't b4 the patches for some reason, but that's the highlight
> > from the patchset, correct me if I'm wrong. That xa_for_each
> > check is always true because you put the queue in there right
> > before it, and I don't that anyone could've erased it.
> >
> > The problem here is that it seems the ->bound_rxqs state doesn't
> > depend on what page pools were actually created and with what mp.
>
> FWIW I don't understand the point of walking the xa either.
> Just check the queue number of the pp you found matches,
> page pool params are saved in the page pool. No?
>

Yes, I changed this check to check pool->p.queue, and it works fine.
Jakub Kicinski Aug. 16, 2024, 3:35 p.m. UTC | #7
On Fri, 16 Aug 2024 08:20:44 -0400 Mina Almasry wrote:
> > I'd keep the current check with a WARN_ON_ONCE(), tho.
> > Given the absence of tests driver developers can use.
> > Especially those who _aren't_ supporting the feature.
> 
> Yes what I have locally is the driver setting
> netdev_rx_queue->unreadable_netmem_supported when header split is
> turned on, and additionally a WARN_ON_ONCE around the check in core. I
> was about to send that when I read your email. I'm hoping we don't
> have to go through the scope creep of adding configuration via the
> queue API, which I think is a very significant undertaking.

I don't like adding more and more transient stuff to netdev_rx_queue.
It's one thing if we create a temporary solution in the core, which
we can easily redo later. It's another altogether when we expect drivers
to keep some bit up to date across all the reconfiguration paths they
have. Just to then got an replace that with another API.

If the post-check works let's go with that for now.
diff mbox series

Patch

diff --git a/include/net/mp_dmabuf_devmem.h b/include/net/mp_dmabuf_devmem.h
new file mode 100644
index 000000000000..300a2356eed0
--- /dev/null
+++ b/include/net/mp_dmabuf_devmem.h
@@ -0,0 +1,44 @@ 
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Dmabuf device memory provider.
+ *
+ * Authors:	Mina Almasry <almasrymina@google.com>
+ *
+ */
+#ifndef _NET_MP_DMABUF_DEVMEM_H
+#define _NET_MP_DMABUF_DEVMEM_H
+
+#include <net/netmem.h>
+
+#if defined(CONFIG_DMA_SHARED_BUFFER) && defined(CONFIG_GENERIC_ALLOCATOR)
+int mp_dmabuf_devmem_init(struct page_pool *pool);
+
+netmem_ref mp_dmabuf_devmem_alloc_netmems(struct page_pool *pool, gfp_t gfp);
+
+void mp_dmabuf_devmem_destroy(struct page_pool *pool);
+
+bool mp_dmabuf_devmem_release_page(struct page_pool *pool, netmem_ref netmem);
+#else
+static inline int mp_dmabuf_devmem_init(struct page_pool *pool)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline netmem_ref mp_dmabuf_devmem_alloc_netmems(struct page_pool *pool,
+							gfp_t gfp)
+{
+	return 0;
+}
+
+static inline void mp_dmabuf_devmem_destroy(struct page_pool *pool)
+{
+}
+
+static inline bool mp_dmabuf_devmem_release_page(struct page_pool *pool,
+						 netmem_ref netmem)
+{
+	return false;
+}
+#endif
+
+#endif /* _NET_MP_DMABUF_DEVMEM_H */
diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index 4afd6dd56351..e221521cd7f3 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -20,8 +20,17 @@ 
 					* device driver responsibility
 					*/
 #define PP_FLAG_SYSTEM_POOL	BIT(2) /* Global system page_pool */
+#define PP_FLAG_ALLOW_UNREADABLE_NETMEM	BIT(3) /* Allow unreadable (net_iov
+						* backed) netmem in this
+						* page_pool. Drivers setting
+						* this must be able to support
+						* unreadable netmem, where
+						* netmem_address() would return
+						* NULL. This flag should not be
+						* set for header page_pools.
+						*/
 #define PP_FLAG_ALL		(PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV | \
-				 PP_FLAG_SYSTEM_POOL)
+				 PP_FLAG_SYSTEM_POOL | PP_FLAG_ALLOW_UNREADABLE_NETMEM)
 
 /*
  * Fast allocation side cache array/stack
@@ -52,12 +61,14 @@  struct pp_alloc_cache {
  * @nid:	NUMA node id to allocate from pages from
  * @dev:	device, for DMA pre-mapping purposes
  * @napi:	NAPI which is the sole consumer of pages, otherwise NULL
+ * @queue:	struct netdev_rx_queue this page_pool is being created for.
  * @dma_dir:	DMA mapping direction
  * @max_len:	max DMA sync memory size for PP_FLAG_DMA_SYNC_DEV
  * @offset:	DMA sync address offset for PP_FLAG_DMA_SYNC_DEV
  * @slow:	params with slowpath access only (initialization and Netlink)
  * @netdev:	netdev this pool will serve (leave as NULL if none or multiple)
- * @flags:	PP_FLAG_DMA_MAP, PP_FLAG_DMA_SYNC_DEV, PP_FLAG_SYSTEM_POOL
+ * @flags:	PP_FLAG_DMA_MAP, PP_FLAG_DMA_SYNC_DEV, PP_FLAG_SYSTEM_POOL,
+ *		PP_FLAG_ALLOW_UNREADABLE_NETMEM.
  */
 struct page_pool_params {
 	struct_group_tagged(page_pool_params_fast, fast,
@@ -66,6 +77,7 @@  struct page_pool_params {
 		int		nid;
 		struct device	*dev;
 		struct napi_struct *napi;
+		struct netdev_rx_queue *queue;
 		enum dma_data_direction dma_dir;
 		unsigned int	max_len;
 		unsigned int	offset;
diff --git a/net/core/devmem.c b/net/core/devmem.c
index 301f4250ca82..2f2a7f4dee4c 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -17,6 +17,7 @@ 
 #include <linux/genalloc.h>
 #include <linux/dma-buf.h>
 #include <net/devmem.h>
+#include <net/mp_dmabuf_devmem.h>
 #include <net/netdev_queues.h>
 
 #include "page_pool_priv.h"
@@ -153,6 +154,10 @@  int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
 	if (err)
 		goto err_xa_erase;
 
+	err = page_pool_check_memory_provider(dev, rxq, binding);
+	if (err)
+		goto err_xa_erase;
+
 	return 0;
 
 err_xa_erase:
@@ -305,4 +310,69 @@  void dev_dmabuf_uninstall(struct net_device *dev)
 				xa_erase(&binding->bound_rxqs, xa_idx);
 	}
 }
+
+/*** "Dmabuf devmem memory provider" ***/
+
+int mp_dmabuf_devmem_init(struct page_pool *pool)
+{
+	struct net_devmem_dmabuf_binding *binding = pool->mp_priv;
+
+	if (!binding)
+		return -EINVAL;
+
+	if (!pool->dma_map)
+		return -EOPNOTSUPP;
+
+	if (pool->dma_sync)
+		return -EOPNOTSUPP;
+
+	if (pool->p.order != 0)
+		return -E2BIG;
+
+	net_devmem_dmabuf_binding_get(binding);
+	return 0;
+}
+
+netmem_ref mp_dmabuf_devmem_alloc_netmems(struct page_pool *pool, gfp_t gfp)
+{
+	struct net_devmem_dmabuf_binding *binding = pool->mp_priv;
+	netmem_ref netmem;
+	struct net_iov *niov;
+
+	niov = net_devmem_alloc_dmabuf(binding);
+	if (!niov)
+		return 0;
+
+	netmem = net_iov_to_netmem(niov);
+
+	page_pool_set_pp_info(pool, netmem);
+
+	pool->pages_state_hold_cnt++;
+	trace_page_pool_state_hold(pool, netmem, pool->pages_state_hold_cnt);
+	return netmem;
+}
+
+void mp_dmabuf_devmem_destroy(struct page_pool *pool)
+{
+	struct net_devmem_dmabuf_binding *binding = pool->mp_priv;
+
+	net_devmem_dmabuf_binding_put(binding);
+}
+
+bool mp_dmabuf_devmem_release_page(struct page_pool *pool, netmem_ref netmem)
+{
+	if (WARN_ON_ONCE(!netmem_is_net_iov(netmem)))
+		return false;
+
+	if (WARN_ON_ONCE(atomic_long_read(netmem_get_pp_ref_count_ref(netmem)) !=
+		     1))
+		return false;
+
+	page_pool_clear_pp_info(netmem);
+
+	net_devmem_free_dmabuf(netmem_to_net_iov(netmem));
+
+	/* We don't want the page pool put_page()ing our net_iovs. */
+	return false;
+}
 #endif
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 13277f05aebd..25be3327561b 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -13,6 +13,7 @@ 
 
 #include <net/page_pool/helpers.h>
 #include <net/xdp.h>
+#include <net/netdev_rx_queue.h>
 
 #include <linux/dma-direction.h>
 #include <linux/dma-mapping.h>
@@ -21,6 +22,9 @@ 
 #include <linux/poison.h>
 #include <linux/ethtool.h>
 #include <linux/netdevice.h>
+#include <linux/genalloc.h>
+#include <net/devmem.h>
+#include <net/mp_dmabuf_devmem.h>
 
 #include <trace/events/page_pool.h>
 
@@ -28,6 +32,7 @@ 
 #include "netmem_priv.h"
 
 DEFINE_STATIC_KEY_FALSE(page_pool_mem_providers);
+EXPORT_SYMBOL(page_pool_mem_providers);
 
 #define DEFER_TIME (msecs_to_jiffies(1000))
 #define DEFER_WARN_INTERVAL (60 * HZ)
@@ -190,6 +195,7 @@  static int page_pool_init(struct page_pool *pool,
 			  int cpuid)
 {
 	unsigned int ring_qsize = 1024; /* Default */
+	int err;
 
 	page_pool_struct_check();
 
@@ -271,7 +277,36 @@  static int page_pool_init(struct page_pool *pool,
 	if (pool->dma_map)
 		get_device(pool->p.dev);
 
+	if (pool->p.queue &&
+	    pool->slow.flags & PP_FLAG_ALLOW_UNREADABLE_NETMEM) {
+		/* We rely on rtnl_lock()ing to make sure netdev_rx_queue
+		 * configuration doesn't change while we're initializing the
+		 * page_pool.
+		 */
+		ASSERT_RTNL();
+		pool->mp_priv = pool->p.queue->mp_params.mp_priv;
+	}
+
+	if (pool->mp_priv) {
+		err = mp_dmabuf_devmem_init(pool);
+		if (err) {
+			pr_warn("%s() mem-provider init failed %d\n", __func__,
+				err);
+			goto free_ptr_ring;
+		}
+
+		static_branch_inc(&page_pool_mem_providers);
+	}
+
 	return 0;
+
+free_ptr_ring:
+	ptr_ring_cleanup(&pool->ring, NULL);
+#ifdef CONFIG_PAGE_POOL_STATS
+	if (!pool->system)
+		free_percpu(pool->recycle_stats);
+#endif
+	return err;
 }
 
 static void page_pool_uninit(struct page_pool *pool)
@@ -455,28 +490,6 @@  static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
 	return false;
 }
 
-static void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem)
-{
-	netmem_set_pp(netmem, pool);
-	netmem_or_pp_magic(netmem, PP_SIGNATURE);
-
-	/* Ensuring all pages have been split into one fragment initially:
-	 * page_pool_set_pp_info() is only called once for every page when it
-	 * is allocated from the page allocator and page_pool_fragment_page()
-	 * is dirtying the same cache line as the page->pp_magic above, so
-	 * the overhead is negligible.
-	 */
-	page_pool_fragment_netmem(netmem, 1);
-	if (pool->has_init_callback)
-		pool->slow.init_callback(netmem, pool->slow.init_arg);
-}
-
-static void page_pool_clear_pp_info(netmem_ref netmem)
-{
-	netmem_clear_pp_magic(netmem);
-	netmem_set_pp(netmem, NULL);
-}
-
 static struct page *__page_pool_alloc_page_order(struct page_pool *pool,
 						 gfp_t gfp)
 {
@@ -572,7 +585,10 @@  netmem_ref page_pool_alloc_netmem(struct page_pool *pool, gfp_t gfp)
 		return netmem;
 
 	/* Slow-path: cache empty, do real allocation */
-	netmem = __page_pool_alloc_pages_slow(pool, gfp);
+	if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_priv)
+		netmem = mp_dmabuf_devmem_alloc_netmems(pool, gfp);
+	else
+		netmem = __page_pool_alloc_pages_slow(pool, gfp);
 	return netmem;
 }
 EXPORT_SYMBOL(page_pool_alloc_netmem);
@@ -608,6 +624,28 @@  s32 page_pool_inflight(const struct page_pool *pool, bool strict)
 	return inflight;
 }
 
+void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem)
+{
+	netmem_set_pp(netmem, pool);
+	netmem_or_pp_magic(netmem, PP_SIGNATURE);
+
+	/* Ensuring all pages have been split into one fragment initially:
+	 * page_pool_set_pp_info() is only called once for every page when it
+	 * is allocated from the page allocator and page_pool_fragment_page()
+	 * is dirtying the same cache line as the page->pp_magic above, so
+	 * the overhead is negligible.
+	 */
+	page_pool_fragment_netmem(netmem, 1);
+	if (pool->has_init_callback)
+		pool->slow.init_callback(netmem, pool->slow.init_arg);
+}
+
+void page_pool_clear_pp_info(netmem_ref netmem)
+{
+	netmem_clear_pp_magic(netmem);
+	netmem_set_pp(netmem, NULL);
+}
+
 static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
 							 netmem_ref netmem)
 {
@@ -636,8 +674,13 @@  static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
 void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
 {
 	int count;
+	bool put;
 
-	__page_pool_release_page_dma(pool, netmem);
+	put = true;
+	if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_priv)
+		put = mp_dmabuf_devmem_release_page(pool, netmem);
+	else
+		__page_pool_release_page_dma(pool, netmem);
 
 	/* This may be the last page returned, releasing the pool, so
 	 * it is not safe to reference pool afterwards.
@@ -645,8 +688,10 @@  void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
 	count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt);
 	trace_page_pool_state_release(pool, netmem, count);
 
-	page_pool_clear_pp_info(netmem);
-	put_page(netmem_to_page(netmem));
+	if (put) {
+		page_pool_clear_pp_info(netmem);
+		put_page(netmem_to_page(netmem));
+	}
 	/* An optimization would be to call __free_pages(page, pool->p.order)
 	 * knowing page is not part of page-cache (thus avoiding a
 	 * __page_cache_release() call).
@@ -965,6 +1010,12 @@  static void __page_pool_destroy(struct page_pool *pool)
 
 	page_pool_unlist(pool);
 	page_pool_uninit(pool);
+
+	if (pool->mp_priv) {
+		mp_dmabuf_devmem_destroy(pool);
+		static_branch_dec(&page_pool_mem_providers);
+	}
+
 	kfree(pool);
 }
 
diff --git a/net/core/page_pool_priv.h b/net/core/page_pool_priv.h
index 581501b5cd8c..fcd9c1a227bc 100644
--- a/net/core/page_pool_priv.h
+++ b/net/core/page_pool_priv.h
@@ -33,4 +33,10 @@  static inline bool page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
 	return page_pool_set_dma_addr_netmem(page_to_netmem(page), addr);
 }
 
+void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem);
+void page_pool_clear_pp_info(netmem_ref netmem);
+int page_pool_check_memory_provider(struct net_device *dev,
+				    struct netdev_rx_queue *rxq,
+				    struct net_devmem_dmabuf_binding *binding);
+
 #endif
diff --git a/net/core/page_pool_user.c b/net/core/page_pool_user.c
index 3a3277ba167b..cbc54ee4f670 100644
--- a/net/core/page_pool_user.c
+++ b/net/core/page_pool_user.c
@@ -344,6 +344,32 @@  void page_pool_unlist(struct page_pool *pool)
 	mutex_unlock(&page_pools_lock);
 }
 
+int page_pool_check_memory_provider(struct net_device *dev,
+				    struct netdev_rx_queue *rxq,
+				    struct net_devmem_dmabuf_binding *binding)
+{
+	struct netdev_rx_queue *binding_rxq;
+	struct page_pool *pool;
+	struct hlist_node *n;
+	unsigned long xa_idx;
+
+	mutex_lock(&page_pools_lock);
+	hlist_for_each_entry_safe(pool, n, &dev->page_pools, user.list) {
+		if (pool->mp_priv != binding)
+			continue;
+
+		xa_for_each(&binding->bound_rxqs, xa_idx, binding_rxq) {
+			if (rxq != binding_rxq)
+				continue;
+
+			mutex_unlock(&page_pools_lock);
+			return 0;
+		}
+	}
+	mutex_unlock(&page_pools_lock);
+	return -ENODATA;
+}
+
 static void page_pool_unreg_netdev_wipe(struct net_device *netdev)
 {
 	struct page_pool *pool;