diff mbox series

[RFC,net-next,v8,02/14] net: page_pool: create hooks for custom page providers

Message ID 20240403002053.2376017-3-almasrymina@google.com (mailing list archive)
State Superseded
Headers show
Series Device Memory TCP | expand

Commit Message

Mina Almasry April 3, 2024, 12:20 a.m. UTC
From: Jakub Kicinski <kuba@kernel.org>

The page providers which try to reuse the same pages will
need to hold onto the ref, even if page gets released from
the pool - as in releasing the page from the pp just transfers
the "ownership" reference from pp to the provider, and provider
will wait for other references to be gone before feeding this
page back into the pool.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Mina Almasry <almasrymina@google.com>

---

This is implemented by Jakub in his RFC:
https://lore.kernel.org/netdev/f8270765-a27b-6ccf-33ea-cda097168d79@redhat.com/T/

I take no credit for the idea or implementation; I only added minor
edits to make this workable with device memory TCP, and removed some
hacky test code. This is a critical dependency of device memory TCP
and thus I'm pulling it into this series to make it revewable and
mergeable.

RFC v3 -> v1
- Removed unusued mem_provider. (Yunsheng).
- Replaced memory_provider & mp_priv with netdev_rx_queue (Jakub).

---
 include/net/page_pool/types.h | 12 ++++++++++
 net/core/page_pool.c          | 43 +++++++++++++++++++++++++++++++----
 2 files changed, 50 insertions(+), 5 deletions(-)

Comments

Christoph Hellwig May 1, 2024, 7:54 a.m. UTC | #1
Still NAK to creating aⅺbitrary hooks here.  This should be a page or
dmabuf pool and not an indirect call abstraction allowing random
crap to hook into it.
Mina Almasry May 3, 2024, 8:10 p.m. UTC | #2
Sorry for the late reply.

On Wed, May 1, 2024 at 12:55 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> Still NAK to creating aⅺbitrary hooks here.

Is the concern still that folks may be able to hook proprietary stuff
into this like you mentioned before[1]?

I don't see how that can be done as currently written. The page_pool
grabs the memory_provider_ops from the netdev_rx_queue struct managed
by core net stack and not really overridable by external modules. When
the netdev creates the page_pool, it gets the core-managed
netdev_rx_queue via something like __netif_get_rx_queue() and passes
that to page_pool_create().

We could make the memory_provider_ops even more opaque by only
allowing the device to only pass in the netdev + queue num to the
page_pool_create, and have the page_pool_create query the
netdev_rx_queue struct, to make sure we're getting the one managed by
core.

Long story short is that as currently written I think it's pretty much
impossible for someone to plug in a proprietary out-of-tree memory
provider using these hooks, and if desired I can change the code
slightly to make it even more difficult (but maybe that's pointless, I
don't think it's possible even in the current iteration). The only way
to get a memory_provider_ops in is to seek to merge it as part of the
kernel with community approval. Is there something I'm missing here?

> This should be a page or
> dmabuf pool and not an indirect call abstraction allowing random
> crap to hook into it.
>

What is the suggested fix here? I do something like:

cp net/core/page_pool.c net/core/dmabuf_pool.c

and then modify it such that the net stack maintains 2 page_pools?
There are a lot of cons to that:

1. Code duplication/maintenance (page_pool.c + dmabuf_pool.c will look
very similar).

2. The hooks enable more use cases than dmabuf_pool + standard pages.
In addition to those, I'm thinking of (but not working on):
a. Limited memory pools. I.e. a page_pool limited to a certain amount
of memory (for overcommited VMs).
b. dmabuf pools with GPU virtual addresses. Currently we seek to
support dmabuf memory where the virtual address is an offset into the
dmabuf for CPU access. For GPU memory accessible to the GPU we need
dmabuf memory where the virtual address is the GPU virtual address.

3. Support for multiple page_pools is actually more proprietary
friendly IMO. Currently the page_pool is internal to core. If we start
adding additional pools we need to have some uniform behavior between
all the pools so core can operate on memory that originated from any
one of them. In that case it becomes actually easier for someone to
develop an out of tree pool and use it from their out-of-tree driver
and as long as their out of tree page_pool behaves similarly enough to
the decided uniform behavior, it may be able to fool core into
thinking it's an in-tree pool...

[1] https://lore.kernel.org/linux-kernel/ZfegzB341oNc_Ocz@infradead.org/


--
Thanks,
Mina
Christoph Hellwig May 6, 2024, 12:04 p.m. UTC | #3
On Fri, May 03, 2024 at 01:10:44PM -0700, Mina Almasry wrote:
> Is the concern still that folks may be able to hook proprietary stuff
> into this like you mentioned before[1]?

That is on concern.  The other is that people will do stupid stuff
even in tree if you give them enough rope, and they should not have
that rope when the only sensible options are page/folio based kernel
memory (incuding large/huge folios) and dmabuf.

> cp net/core/page_pool.c net/core/dmabuf_pool.c
> 
> and then modify it such that the net stack maintains 2 page_pools?
> There are a lot of cons to that:

No.  Just have branches for page based vs dmabuf in a few places.
Pavel Begunkov May 7, 2024, 4:05 p.m. UTC | #4
On 5/6/24 13:04, Christoph Hellwig wrote:
> On Fri, May 03, 2024 at 01:10:44PM -0700, Mina Almasry wrote:
>> Is the concern still that folks may be able to hook proprietary stuff
>> into this like you mentioned before[1]?
> 
> That is on concern.  The other is that people will do stupid stuff

We're not now banning all virtual function tables because there is
a chance someone might probably conceivably do perhaps something
proprietary, aren't we? Let's just patch up all ways they might
use it if there is any left.

> even in tree if you give them enough rope, and they should not have
> that rope when the only sensible options are page/folio based kernel
> memory (incuding large/huge folios) and dmabuf.

I believe there is at least one deep confusion here, considering you
previously mentioned Keith's pre-mapping patches. The "hooks" are not
that about in what format you pass memory, it's arguably the least
interesting part for page pool, more or less it'd circulate whatever
is given. It's more of how to have a better control over buffer lifetime
and implement a buffer pool passing data to users and empty buffers
back.
It's a different from storage where a user passes a buffer before it
initiates a read, data is sent by the other socket end. TCP devmem
uses dmabuf, io_uring keeps pages (of normal user memory).


>> cp net/core/page_pool.c net/core/dmabuf_pool.c
>>
>> and then modify it such that the net stack maintains 2 page_pools?
>> There are a lot of cons to that:
> 
> No.  Just have branches for page based vs dmabuf in a few places.
>
Jason Gunthorpe May 7, 2024, 4:18 p.m. UTC | #5
On Tue, May 07, 2024 at 05:05:12PM +0100, Pavel Begunkov wrote:
> > even in tree if you give them enough rope, and they should not have
> > that rope when the only sensible options are page/folio based kernel
> > memory (incuding large/huge folios) and dmabuf.
> 
> I believe there is at least one deep confusion here, considering you
> previously mentioned Keith's pre-mapping patches. The "hooks" are not
> that about in what format you pass memory, it's arguably the least
> interesting part for page pool, more or less it'd circulate whatever
> is given. It's more of how to have a better control over buffer lifetime
> and implement a buffer pool passing data to users and empty buffers
> back.

Isn't that more or less exactly what dmabuf is? Why do you need
another almost dma-buf thing for another project?

Jason
Christoph Hellwig May 7, 2024, 4:23 p.m. UTC | #6
On Tue, May 07, 2024 at 01:18:57PM -0300, Jason Gunthorpe wrote:
> On Tue, May 07, 2024 at 05:05:12PM +0100, Pavel Begunkov wrote:
> > > even in tree if you give them enough rope, and they should not have
> > > that rope when the only sensible options are page/folio based kernel
> > > memory (incuding large/huge folios) and dmabuf.
> > 
> > I believe there is at least one deep confusion here, considering you
> > previously mentioned Keith's pre-mapping patches. The "hooks" are not
> > that about in what format you pass memory, it's arguably the least
> > interesting part for page pool, more or less it'd circulate whatever
> > is given. It's more of how to have a better control over buffer lifetime
> > and implement a buffer pool passing data to users and empty buffers
> > back.
> 
> Isn't that more or less exactly what dmabuf is? Why do you need
> another almost dma-buf thing for another project?

That's the exact point I've been making since the last round of
the series.  We don't need to reinvent dmabuf poorly in every
subsystem, but instead fix the odd parts in it and make it suitable
for everyone.
Mina Almasry May 7, 2024, 4:42 p.m. UTC | #7
On Tue, May 7, 2024 at 9:24 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Tue, May 07, 2024 at 01:18:57PM -0300, Jason Gunthorpe wrote:
> > On Tue, May 07, 2024 at 05:05:12PM +0100, Pavel Begunkov wrote:
> > > > even in tree if you give them enough rope, and they should not have
> > > > that rope when the only sensible options are page/folio based kernel
> > > > memory (incuding large/huge folios) and dmabuf.
> > >
> > > I believe there is at least one deep confusion here, considering you
> > > previously mentioned Keith's pre-mapping patches. The "hooks" are not
> > > that about in what format you pass memory, it's arguably the least
> > > interesting part for page pool, more or less it'd circulate whatever
> > > is given. It's more of how to have a better control over buffer lifetime
> > > and implement a buffer pool passing data to users and empty buffers
> > > back.
> >
> > Isn't that more or less exactly what dmabuf is? Why do you need
> > another almost dma-buf thing for another project?
>
> That's the exact point I've been making since the last round of
> the series.  We don't need to reinvent dmabuf poorly in every
> subsystem, but instead fix the odd parts in it and make it suitable
> for everyone.
>


FWIW the change Christoph is requesting is straight forward from my
POV and doesn't really hurt the devmem use case. I'd basically remove
the ops and add an if statement in the slow path where the ops are
being used to alloc/free from dmabuf instead of alloc_pages().
Something like (very rough, doesn't compile):

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 92be1aaf18ccc..2cc986455bce6 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -557,8 +557,8 @@ netmem_ref page_pool_alloc_netmem(struct page_pool
*pool, gfp_t gfp)
                return netmem;

        /* Slow-path: cache empty, do real allocation */
-       if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops)
-               netmem = pool->mp_ops->alloc_pages(pool, gfp);
+       if (page_pool_is_dmabuf(pool))
+               netmem = mp_dmabuf_devmem_alloc_pages():
        else
                netmem = __page_pool_alloc_pages_slow(pool, gfp);
        return netmem;


The folks that will be negatively impacted by this are
Jakub/Pavel/David. I think all were planning to extend the hooks for
io_uring or other memory types.

Pavel/David, AFAICT you have these options here (but maybe you can
think of more):

1. Align with devmem TCP to use udmabuf for your io_uring memory. I
think in the past you said it's a uapi you don't link but in the face
of this pushback you may want to reconsider.

2. Follow the example of devmem TCP and add another if statement to
alloc from io_uring, so something like:

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 92be1aaf18ccc..3545bb82c7d05 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -557,8 +557,10 @@ netmem_ref page_pool_alloc_netmem(struct
page_pool *pool, gfp_t gfp)
                return netmem;

        /* Slow-path: cache empty, do real allocation */
-       if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops)
-               netmem = pool->mp_ops->alloc_pages(pool, gfp);
+       if (page_pool_is_dmabuf(pool))
+               netmem = mp_dmabuf_devmem_alloc_pages():
+       else if (page_pool_is_io_uring(pool))
+               netmem = mp_io_uring_alloc_pages():
        else
                netmem = __page_pool_alloc_pages_slow(pool, gfp);
        return netmem;

Note that Christoph/Jason may not like you adding non-dmabuf io_uring
backing memory in the first place, so there may be pushback against
this approach.

3. Pushback on the nack on this thread. It seems you're already
discussing this. I'll see what happens.

To be honest the GVE queue-API has just been merged I think, so I'm
now unblocked on sending non-RFCs of this work and I'm hoping to send
the next version soon. I may apply these changes on the next version
for more discussion or leave as is and carry the nack until the
conversation converges.
Jason Gunthorpe May 7, 2024, 4:48 p.m. UTC | #8
On Tue, May 07, 2024 at 09:42:05AM -0700, Mina Almasry wrote:

> 1. Align with devmem TCP to use udmabuf for your io_uring memory. I
> think in the past you said it's a uapi you don't link but in the face
> of this pushback you may want to reconsider.

dmabuf does not force a uapi, you can acquire your pages however you
want and wrap them up in a dmabuf. No uapi at all.

The point is that dmabuf already provides ops that do basically what
is needed here. We don't need ops calling ops just because dmabuf's
ops are not understsood or not perfect. Fixup dmabuf.

If io_uring wants to take its existing memory pre-registration it can
wrap that in a dmbauf, and somehow pass it to the netstack. Userspace
doesn't need to know a dmabuf is being used in the background.

Jason
Pavel Begunkov May 7, 2024, 4:55 p.m. UTC | #9
On 5/7/24 17:23, Christoph Hellwig wrote:
> On Tue, May 07, 2024 at 01:18:57PM -0300, Jason Gunthorpe wrote:
>> On Tue, May 07, 2024 at 05:05:12PM +0100, Pavel Begunkov wrote:
>>>> even in tree if you give them enough rope, and they should not have
>>>> that rope when the only sensible options are page/folio based kernel
>>>> memory (incuding large/huge folios) and dmabuf.
>>>
>>> I believe there is at least one deep confusion here, considering you
>>> previously mentioned Keith's pre-mapping patches. The "hooks" are not
>>> that about in what format you pass memory, it's arguably the least
>>> interesting part for page pool, more or less it'd circulate whatever
>>> is given. It's more of how to have a better control over buffer lifetime
>>> and implement a buffer pool passing data to users and empty buffers
>>> back.
>>
>> Isn't that more or less exactly what dmabuf is? Why do you need
>> another almost dma-buf thing for another project?
> 
> That's the exact point I've been making since the last round of
> the series.  We don't need to reinvent dmabuf poorly in every
> subsystem, but instead fix the odd parts in it and make it suitable
> for everyone.

Someone would need to elaborate how dma-buf is like that addition
to page pool infra. The granularity here is usually 4K and less
(hw dictated), what user receives cannot be guaranteed to be
contiguous in memory. Having thousands of dma-buf instances is
not an option, so a completion would need to include a range
where data sits. Then who controls lifetime of buffers? If it's
dma-buf, then at least it needs to track what sub-buffers are
handed to user and what are currently in the kernel. How it would be
accounted? ioctl_return_subrange(dmabuf, [range]), sounds like
a bad idea for performance. To cover user memory it'd also need
to be read from userspace, ioctl here wouldn't be an option, but
let's say it's somehow done in the kernel.

That's not all the list, but in short, even though I haven't been
following dma-buf developments too closely, I have hard time seeing
how it can be a replacement here.
Mina Almasry May 7, 2024, 5:15 p.m. UTC | #10
On Tue, May 7, 2024 at 9:55 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 5/7/24 17:23, Christoph Hellwig wrote:
> > On Tue, May 07, 2024 at 01:18:57PM -0300, Jason Gunthorpe wrote:
> >> On Tue, May 07, 2024 at 05:05:12PM +0100, Pavel Begunkov wrote:
> >>>> even in tree if you give them enough rope, and they should not have
> >>>> that rope when the only sensible options are page/folio based kernel
> >>>> memory (incuding large/huge folios) and dmabuf.
> >>>
> >>> I believe there is at least one deep confusion here, considering you
> >>> previously mentioned Keith's pre-mapping patches. The "hooks" are not
> >>> that about in what format you pass memory, it's arguably the least
> >>> interesting part for page pool, more or less it'd circulate whatever
> >>> is given. It's more of how to have a better control over buffer lifetime
> >>> and implement a buffer pool passing data to users and empty buffers
> >>> back.
> >>
> >> Isn't that more or less exactly what dmabuf is? Why do you need
> >> another almost dma-buf thing for another project?
> >
> > That's the exact point I've been making since the last round of
> > the series.  We don't need to reinvent dmabuf poorly in every
> > subsystem, but instead fix the odd parts in it and make it suitable
> > for everyone.
>
> Someone would need to elaborate how dma-buf is like that addition
> to page pool infra.

I think I understand what Jason is requesting here, and I'll take a
shot at elaborating. AFAICT what he's saying is technically feasible
and addresses the nack while giving you the uapi you want. It just
requires a bit (a lot?) of work on your end unfortunately.

CONFIG_UDMABUF takes in a memfd, converts it to a dmabuf, and returns
it to userspace. See udmabuf_create().

I think what Jason is saying here, is that you can write similar code
to udmabuf_creat() that takes in a io_uring memory region, and
converts it to a dmabuf inside the kernel.

I haven't looked at your series yet too closely (sorry!), but I assume
you currently have a netlink API that binds an io_uring memory region
to the NIC rx-queue page_pool, right? That netlink API would need to
be changed to:

1. Take in the io_uring memory.
2. Convert it to a dmabuf like udmabuf_create() does.
3. Bind the resulting dmabuf to the rx-queue page_pool.

There would be more changes needed vis-a-vis the clean up path and
lifetime management, but I think this is the general idea.

This would give you the uapi you want, while the page_pool never seen
non-dmabuf memory (addresses the nack, I think).
Pavel Begunkov May 7, 2024, 5:17 p.m. UTC | #11
On 5/7/24 17:42, Mina Almasry wrote:
> On Tue, May 7, 2024 at 9:24 AM Christoph Hellwig <hch@infradead.org> wrote:
>>
>> On Tue, May 07, 2024 at 01:18:57PM -0300, Jason Gunthorpe wrote:
>>> On Tue, May 07, 2024 at 05:05:12PM +0100, Pavel Begunkov wrote:
>>>>> even in tree if you give them enough rope, and they should not have
>>>>> that rope when the only sensible options are page/folio based kernel
>>>>> memory (incuding large/huge folios) and dmabuf.
>>>>
>>>> I believe there is at least one deep confusion here, considering you
>>>> previously mentioned Keith's pre-mapping patches. The "hooks" are not
>>>> that about in what format you pass memory, it's arguably the least
>>>> interesting part for page pool, more or less it'd circulate whatever
>>>> is given. It's more of how to have a better control over buffer lifetime
>>>> and implement a buffer pool passing data to users and empty buffers
>>>> back.
>>>
>>> Isn't that more or less exactly what dmabuf is? Why do you need
>>> another almost dma-buf thing for another project?
>>
>> That's the exact point I've been making since the last round of
>> the series.  We don't need to reinvent dmabuf poorly in every
>> subsystem, but instead fix the odd parts in it and make it suitable
>> for everyone.
>>
> 
> 
> FWIW the change Christoph is requesting is straight forward from my
> POV and doesn't really hurt the devmem use case. I'd basically remove
> the ops and add an if statement in the slow path where the ops are
> being used to alloc/free from dmabuf instead of alloc_pages().
> Something like (very rough, doesn't compile):
> 
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 92be1aaf18ccc..2cc986455bce6 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -557,8 +557,8 @@ netmem_ref page_pool_alloc_netmem(struct page_pool
> *pool, gfp_t gfp)
>                  return netmem;
> 
>          /* Slow-path: cache empty, do real allocation */
> -       if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops)
> -               netmem = pool->mp_ops->alloc_pages(pool, gfp);
> +       if (page_pool_is_dmabuf(pool))
> +               netmem = mp_dmabuf_devmem_alloc_pages():
>          else
>                  netmem = __page_pool_alloc_pages_slow(pool, gfp);
>          return netmem;
> 
> 
> The folks that will be negatively impacted by this are
> Jakub/Pavel/David. I think all were planning to extend the hooks for
> io_uring or other memory types.
> 
> Pavel/David, AFAICT you have these options here (but maybe you can
> think of more):
> 
> 1. Align with devmem TCP to use udmabuf for your io_uring memory. I
> think in the past you said it's a uapi you don't link but in the face
> of this pushback you may want to reconsider.

If the argument would be that we have to switch to a less efficient
and less consistent api for io_uring (fast path handling used buffers
back to kernel) just because it has to has dmabuf and without direct
relation to dmabuf, then no, it's not the way anything can be sanely
developed.

> 2. Follow the example of devmem TCP and add another if statement to
> alloc from io_uring, so something like:
> 
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 92be1aaf18ccc..3545bb82c7d05 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -557,8 +557,10 @@ netmem_ref page_pool_alloc_netmem(struct
> page_pool *pool, gfp_t gfp)
>                  return netmem;
> 
>          /* Slow-path: cache empty, do real allocation */
> -       if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops)
> -               netmem = pool->mp_ops->alloc_pages(pool, gfp);
> +       if (page_pool_is_dmabuf(pool))
> +               netmem = mp_dmabuf_devmem_alloc_pages():
> +       else if (page_pool_is_io_uring(pool))
> +               netmem = mp_io_uring_alloc_pages():
>          else
>                  netmem = __page_pool_alloc_pages_slow(pool, gfp);

I don't see why we'd do that instead instead of having a
well made function table, which is equivalent.

>          return netmem;
> 
> Note that Christoph/Jason may not like you adding non-dmabuf io_uring
> backing memory in the first place, so there may be pushback against
> this approach.

Christoph mentioned pages, we're using pages, I don't think it's
too fancy. I don't believe that's it, which would be equivalent to
"let's remove user pointers from the kernel and mandate passing
dmabuf only".


> 3. Pushback on the nack on this thread. It seems you're already
> discussing this. I'll see what happens.
> 
> To be honest the GVE queue-API has just been merged I think, so I'm
> now unblocked on sending non-RFCs of this work and I'm hoping to send
> the next version soon. I may apply these changes on the next version
> for more discussion or leave as is and carry the nack until the
> conversation converges.
>
Daniel Vetter May 7, 2024, 5:19 p.m. UTC | #12
On Tue, May 07, 2024 at 01:48:38PM -0300, Jason Gunthorpe wrote:
> On Tue, May 07, 2024 at 09:42:05AM -0700, Mina Almasry wrote:
> 
> > 1. Align with devmem TCP to use udmabuf for your io_uring memory. I
> > think in the past you said it's a uapi you don't link but in the face
> > of this pushback you may want to reconsider.
> 
> dmabuf does not force a uapi, you can acquire your pages however you
> want and wrap them up in a dmabuf. No uapi at all.
> 
> The point is that dmabuf already provides ops that do basically what
> is needed here. We don't need ops calling ops just because dmabuf's
> ops are not understsood or not perfect. Fixup dmabuf.
> 
> If io_uring wants to take its existing memory pre-registration it can
> wrap that in a dmbauf, and somehow pass it to the netstack. Userspace
> doesn't need to know a dmabuf is being used in the background.

So roughly the current dma-buf design considerations for the users of the
dma-api interfaces:

- It's a memory buffer of fixed length.

- Either that memory is permanently nailed into place with dma_buf_pin
  (and if we add more users than just drm display then we should probably
  figure out the mlock account question for these). For locking hierarchy
  dma_buf_pin uses dma_resv_lock which nests within mmap_sem/vma locks but
  outside of any reclaim/alloc contexts.

- Or the memory is more dynamic, in which case case you need to be able to
  dma_resv_lock when you need the memory and make a promise (as a
  dma_fence) that you'll release the memory within finite time and without
  any further allocations once you've unlocked the dma_buf (because
  dma_fence is in GFP_NORECLAIM). That promise can be waiting for memory
  access to finish, but it can also be a pte invalidate+tlb flush, or some
  kind of preemption, or whatever your hw can do really.

  Also, if you do this dynamic model and need to atomically reserve more
  than one dma_buf, you get to do the wait/wound mutex dance, but that's
  really just a bunch of funny looking error handling code and not really
  impacting the overall design or locking hierarchy.

Everything else we can adjust, but I think the above three are not really
changeable or dma-buf becomes unuseable for gpu drivers.

Note that exporters of dma-buf can pretty much do whatever they feel like,
including rejecting all the generic interfaces/ops, because we also use
dma-buf as userspace handles for some really special memory.
-Sima
Pavel Begunkov May 7, 2024, 5:25 p.m. UTC | #13
On 5/7/24 17:48, Jason Gunthorpe wrote:
> On Tue, May 07, 2024 at 09:42:05AM -0700, Mina Almasry wrote:
> 
>> 1. Align with devmem TCP to use udmabuf for your io_uring memory. I
>> think in the past you said it's a uapi you don't link but in the face
>> of this pushback you may want to reconsider.
> 
> dmabuf does not force a uapi, you can acquire your pages however you
> want and wrap them up in a dmabuf. No uapi at all.
> 
> The point is that dmabuf already provides ops that do basically what
> is needed here. We don't need ops calling ops just because dmabuf's
> ops are not understsood or not perfect. Fixup dmabuf.

Those ops, for example, are used to efficiently return used buffers
back to the kernel, which is uapi, I don't see how dmabuf can be
fixed up to cover it.

> If io_uring wants to take its existing memory pre-registration it can
> wrap that in a dmbauf, and somehow pass it to the netstack. Userspace
> doesn't need to know a dmabuf is being used in the background.

io_uring's pre-registered memory is just pages, but even that is
going to be replaced with just a normal user buffer pointer.
Regardless, io_uring can wrap pages into a dmabuf, but it's not
a direct replacement for the ops, it'd mandate uapi change in a not
desirable way.
Pavel Begunkov May 7, 2024, 5:34 p.m. UTC | #14
On 5/7/24 18:15, Mina Almasry wrote:
> On Tue, May 7, 2024 at 9:55 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>
>> On 5/7/24 17:23, Christoph Hellwig wrote:
>>> On Tue, May 07, 2024 at 01:18:57PM -0300, Jason Gunthorpe wrote:
>>>> On Tue, May 07, 2024 at 05:05:12PM +0100, Pavel Begunkov wrote:
>>>>>> even in tree if you give them enough rope, and they should not have
>>>>>> that rope when the only sensible options are page/folio based kernel
>>>>>> memory (incuding large/huge folios) and dmabuf.
>>>>>
>>>>> I believe there is at least one deep confusion here, considering you
>>>>> previously mentioned Keith's pre-mapping patches. The "hooks" are not
>>>>> that about in what format you pass memory, it's arguably the least
>>>>> interesting part for page pool, more or less it'd circulate whatever
>>>>> is given. It's more of how to have a better control over buffer lifetime
>>>>> and implement a buffer pool passing data to users and empty buffers
>>>>> back.
>>>>
>>>> Isn't that more or less exactly what dmabuf is? Why do you need
>>>> another almost dma-buf thing for another project?
>>>
>>> That's the exact point I've been making since the last round of
>>> the series.  We don't need to reinvent dmabuf poorly in every
>>> subsystem, but instead fix the odd parts in it and make it suitable
>>> for everyone.
>>
>> Someone would need to elaborate how dma-buf is like that addition
>> to page pool infra.
> 
> I think I understand what Jason is requesting here, and I'll take a
> shot at elaborating. AFAICT what he's saying is technically feasible
> and addresses the nack while giving you the uapi you want. It just
> requires a bit (a lot?) of work on your end unfortunately.
> 
> CONFIG_UDMABUF takes in a memfd, converts it to a dmabuf, and returns
> it to userspace. See udmabuf_create().
> 
> I think what Jason is saying here, is that you can write similar code
> to udmabuf_creat() that takes in a io_uring memory region, and
> converts it to a dmabuf inside the kernel.
> 
> I haven't looked at your series yet too closely (sorry!), but I assume
> you currently have a netlink API that binds an io_uring memory region
> to the NIC rx-queue page_pool, right? That netlink API would need to
> be changed to:

No, it's different, I'll skip details, but the main problem is
that those callbacks are used to implement the user api returning
buffers via a ring, where the callback grabs them (in napi context)
and feeds into page pool. That replaces SO_DEVMEM_DONTNEED and the
need for ioctl/setsockopt.

> 1. Take in the io_uring memory.
> 2. Convert it to a dmabuf like udmabuf_create() does.
> 3. Bind the resulting dmabuf to the rx-queue page_pool.
> 
> There would be more changes needed vis-a-vis the clean up path and
> lifetime management, but I think this is the general idea.
> 
> This would give you the uapi you want, while the page_pool never seen
> non-dmabuf memory (addresses the nack, I think).
Jason Gunthorpe May 7, 2024, 5:56 p.m. UTC | #15
On Tue, May 07, 2024 at 06:25:52PM +0100, Pavel Begunkov wrote:
> On 5/7/24 17:48, Jason Gunthorpe wrote:
> > On Tue, May 07, 2024 at 09:42:05AM -0700, Mina Almasry wrote:
> > 
> > > 1. Align with devmem TCP to use udmabuf for your io_uring memory. I
> > > think in the past you said it's a uapi you don't link but in the face
> > > of this pushback you may want to reconsider.
> > 
> > dmabuf does not force a uapi, you can acquire your pages however you
> > want and wrap them up in a dmabuf. No uapi at all.
> > 
> > The point is that dmabuf already provides ops that do basically what
> > is needed here. We don't need ops calling ops just because dmabuf's
> > ops are not understsood or not perfect. Fixup dmabuf.
> 
> Those ops, for example, are used to efficiently return used buffers
> back to the kernel, which is uapi, I don't see how dmabuf can be
> fixed up to cover it.

Sure, but that doesn't mean you can't use dma buf for the other parts
of the flow. The per-page lifetime is a different topic than the
refcounting and access of the entire bulk of memory.

Jason
Pavel Begunkov May 7, 2024, 7:35 p.m. UTC | #16
On 5/7/24 18:56, Jason Gunthorpe wrote:
> On Tue, May 07, 2024 at 06:25:52PM +0100, Pavel Begunkov wrote:
>> On 5/7/24 17:48, Jason Gunthorpe wrote:
>>> On Tue, May 07, 2024 at 09:42:05AM -0700, Mina Almasry wrote:
>>>
>>>> 1. Align with devmem TCP to use udmabuf for your io_uring memory. I
>>>> think in the past you said it's a uapi you don't link but in the face
>>>> of this pushback you may want to reconsider.
>>>
>>> dmabuf does not force a uapi, you can acquire your pages however you
>>> want and wrap them up in a dmabuf. No uapi at all.
>>>
>>> The point is that dmabuf already provides ops that do basically what
>>> is needed here. We don't need ops calling ops just because dmabuf's
>>> ops are not understsood or not perfect. Fixup dmabuf.
>>
>> Those ops, for example, are used to efficiently return used buffers
>> back to the kernel, which is uapi, I don't see how dmabuf can be
>> fixed up to cover it.
> 
> Sure, but that doesn't mean you can't use dma buf for the other parts
> of the flow. The per-page lifetime is a different topic than the
> refcounting and access of the entire bulk of memory.

Ok, so if we're leaving uapi (and ops) and keep per page/sub-buffer as
is, the rest is resolving uptr -> pages, and passing it to page pool in
a convenient to page pool format (net_iov). I don't see how dmabuf would
help here. Adding dmabuf in the middle (internally wrapping pages) would
add more setup code with the same final result, that is a format that
page pool can work with. And for io_uring it's normal user memory. We'll
have to use dmabuf when we'd want to extend to peer-to-peer and all that
fun, but that's a small fraction of it, and we'll hopefully reuse some
setup helpers from devmem tcp.
Jason Gunthorpe May 7, 2024, 11:32 p.m. UTC | #17
On Tue, May 07, 2024 at 08:35:37PM +0100, Pavel Begunkov wrote:
> On 5/7/24 18:56, Jason Gunthorpe wrote:
> > On Tue, May 07, 2024 at 06:25:52PM +0100, Pavel Begunkov wrote:
> > > On 5/7/24 17:48, Jason Gunthorpe wrote:
> > > > On Tue, May 07, 2024 at 09:42:05AM -0700, Mina Almasry wrote:
> > > > 
> > > > > 1. Align with devmem TCP to use udmabuf for your io_uring memory. I
> > > > > think in the past you said it's a uapi you don't link but in the face
> > > > > of this pushback you may want to reconsider.
> > > > 
> > > > dmabuf does not force a uapi, you can acquire your pages however you
> > > > want and wrap them up in a dmabuf. No uapi at all.
> > > > 
> > > > The point is that dmabuf already provides ops that do basically what
> > > > is needed here. We don't need ops calling ops just because dmabuf's
> > > > ops are not understsood or not perfect. Fixup dmabuf.
> > > 
> > > Those ops, for example, are used to efficiently return used buffers
> > > back to the kernel, which is uapi, I don't see how dmabuf can be
> > > fixed up to cover it.
> > 
> > Sure, but that doesn't mean you can't use dma buf for the other parts
> > of the flow. The per-page lifetime is a different topic than the
> > refcounting and access of the entire bulk of memory.
> 
> Ok, so if we're leaving uapi (and ops) and keep per page/sub-buffer as
> is, the rest is resolving uptr -> pages, and passing it to page pool in
> a convenient to page pool format (net_iov).

I'm not going to pretend to know about page pool details, but dmabuf
is the way to get the bulk of pages into a pool within the net stack's
allocator and keep that bulk properly refcounted while.

An object like dmabuf is needed for the general case because there are
not going to be per-page references or otherwise available.

What you seem to want is to alter how the actual allocation flow works
from that bulk of memory and delay the free. It seems like a different
topic to me, and honestly hacking into the allocator free function
seems a bit weird..

Jason
Daniel Vetter May 8, 2024, 7:16 a.m. UTC | #18
On Tue, May 07, 2024 at 08:32:47PM -0300, Jason Gunthorpe wrote:
> On Tue, May 07, 2024 at 08:35:37PM +0100, Pavel Begunkov wrote:
> > On 5/7/24 18:56, Jason Gunthorpe wrote:
> > > On Tue, May 07, 2024 at 06:25:52PM +0100, Pavel Begunkov wrote:
> > > > On 5/7/24 17:48, Jason Gunthorpe wrote:
> > > > > On Tue, May 07, 2024 at 09:42:05AM -0700, Mina Almasry wrote:
> > > > > 
> > > > > > 1. Align with devmem TCP to use udmabuf for your io_uring memory. I
> > > > > > think in the past you said it's a uapi you don't link but in the face
> > > > > > of this pushback you may want to reconsider.
> > > > > 
> > > > > dmabuf does not force a uapi, you can acquire your pages however you
> > > > > want and wrap them up in a dmabuf. No uapi at all.
> > > > > 
> > > > > The point is that dmabuf already provides ops that do basically what
> > > > > is needed here. We don't need ops calling ops just because dmabuf's
> > > > > ops are not understsood or not perfect. Fixup dmabuf.
> > > > 
> > > > Those ops, for example, are used to efficiently return used buffers
> > > > back to the kernel, which is uapi, I don't see how dmabuf can be
> > > > fixed up to cover it.
> > > 
> > > Sure, but that doesn't mean you can't use dma buf for the other parts
> > > of the flow. The per-page lifetime is a different topic than the
> > > refcounting and access of the entire bulk of memory.
> > 
> > Ok, so if we're leaving uapi (and ops) and keep per page/sub-buffer as
> > is, the rest is resolving uptr -> pages, and passing it to page pool in
> > a convenient to page pool format (net_iov).
> 
> I'm not going to pretend to know about page pool details, but dmabuf
> is the way to get the bulk of pages into a pool within the net stack's
> allocator and keep that bulk properly refcounted while.
> 
> An object like dmabuf is needed for the general case because there are
> not going to be per-page references or otherwise available.
> 
> What you seem to want is to alter how the actual allocation flow works
> from that bulk of memory and delay the free. It seems like a different
> topic to me, and honestly hacking into the allocator free function
> seems a bit weird..

Also I don't see how it's an argument against dma-buf as the interface for
all these, because e.g. ttm internally does have a page pool because
depending upon allocator, that's indeed beneficial. Other drm drivers have
more buffer-based concepts for opportunistically memory around, usually
by marking buffers that are just kept as cache as purgeable (which is a
concept that goes all the way to opengl/vulkan).

But these are all internals of the dma-buf exporter, the dma-buf api users
don't ever need to care.
-Sima
Pavel Begunkov May 8, 2024, 11:30 a.m. UTC | #19
On 5/8/24 00:32, Jason Gunthorpe wrote:
> On Tue, May 07, 2024 at 08:35:37PM +0100, Pavel Begunkov wrote:
>> On 5/7/24 18:56, Jason Gunthorpe wrote:
>>> On Tue, May 07, 2024 at 06:25:52PM +0100, Pavel Begunkov wrote:
>>>> On 5/7/24 17:48, Jason Gunthorpe wrote:
>>>>> On Tue, May 07, 2024 at 09:42:05AM -0700, Mina Almasry wrote:
>>>>>
>>>>>> 1. Align with devmem TCP to use udmabuf for your io_uring memory. I
>>>>>> think in the past you said it's a uapi you don't link but in the face
>>>>>> of this pushback you may want to reconsider.
>>>>>
>>>>> dmabuf does not force a uapi, you can acquire your pages however you
>>>>> want and wrap them up in a dmabuf. No uapi at all.
>>>>>
>>>>> The point is that dmabuf already provides ops that do basically what
>>>>> is needed here. We don't need ops calling ops just because dmabuf's
>>>>> ops are not understsood or not perfect. Fixup dmabuf.
>>>>
>>>> Those ops, for example, are used to efficiently return used buffers
>>>> back to the kernel, which is uapi, I don't see how dmabuf can be
>>>> fixed up to cover it.
>>>
>>> Sure, but that doesn't mean you can't use dma buf for the other parts
>>> of the flow. The per-page lifetime is a different topic than the
>>> refcounting and access of the entire bulk of memory.
>>
>> Ok, so if we're leaving uapi (and ops) and keep per page/sub-buffer as
>> is, the rest is resolving uptr -> pages, and passing it to page pool in
>> a convenient to page pool format (net_iov).
> 
> I'm not going to pretend to know about page pool details, but dmabuf
> is the way to get the bulk of pages into a pool within the net stack's
> allocator and keep that bulk properly refcounted while.> 
> An object like dmabuf is needed for the general case because there are
> not going to be per-page references or otherwise available.

They are already pinned, memory is owned by the provider, io_uring
in this case, and it should not be freed circumventing io_uring,
and at this stage calling release_pages() is not such a hassle,
especially comparing to introducing an additional object.

My question is how having an intermediary dmabuf benefits the net
stack or io_uring ? For now IMO it doesn't solve anything but adds
extra complexity. Adding dmabuf for the sake of adding dmabuf is
not a great choice.

> What you seem to want is to alter how the actual allocation flow works
> from that bulk of memory and delay the free. It seems like a different
For people who jumped here without looking what this patchset is
about, that's the entire point of the io_uring zero copy approach
as well as this set. Instead of using kernel private pages that you
have no other option but to copy/mmap (and then free), it hands
buffers to the user while using memory accessible/visible in some
way by the user.

That "delay free" is taking a reference while user is reading data
(slightly different for devmem tcp). And note, it's not a page/dmabuf
reference, kernel can forcibly take it back and release pages.

> topic to me, and honestly hacking into the allocator free function
> seems a bit weird..

Do you also think that DMA_BUF_IOCTL_SYNC is a weird hack, because
it "delays free" by pinning the dmabuf object and letting the user
read memory instead of copying it? I can find many examples
Pavel Begunkov May 8, 2024, 11:35 a.m. UTC | #20
On 5/8/24 08:16, Daniel Vetter wrote:
> On Tue, May 07, 2024 at 08:32:47PM -0300, Jason Gunthorpe wrote:
>> On Tue, May 07, 2024 at 08:35:37PM +0100, Pavel Begunkov wrote:
>>> On 5/7/24 18:56, Jason Gunthorpe wrote:
>>>> On Tue, May 07, 2024 at 06:25:52PM +0100, Pavel Begunkov wrote:
>>>>> On 5/7/24 17:48, Jason Gunthorpe wrote:
>>>>>> On Tue, May 07, 2024 at 09:42:05AM -0700, Mina Almasry wrote:
>>>>>>
>>>>>>> 1. Align with devmem TCP to use udmabuf for your io_uring memory. I
>>>>>>> think in the past you said it's a uapi you don't link but in the face
>>>>>>> of this pushback you may want to reconsider.
>>>>>>
>>>>>> dmabuf does not force a uapi, you can acquire your pages however you
>>>>>> want and wrap them up in a dmabuf. No uapi at all.
>>>>>>
>>>>>> The point is that dmabuf already provides ops that do basically what
>>>>>> is needed here. We don't need ops calling ops just because dmabuf's
>>>>>> ops are not understsood or not perfect. Fixup dmabuf.
>>>>>
>>>>> Those ops, for example, are used to efficiently return used buffers
>>>>> back to the kernel, which is uapi, I don't see how dmabuf can be
>>>>> fixed up to cover it.
>>>>
>>>> Sure, but that doesn't mean you can't use dma buf for the other parts
>>>> of the flow. The per-page lifetime is a different topic than the
>>>> refcounting and access of the entire bulk of memory.
>>>
>>> Ok, so if we're leaving uapi (and ops) and keep per page/sub-buffer as
>>> is, the rest is resolving uptr -> pages, and passing it to page pool in
>>> a convenient to page pool format (net_iov).
>>
>> I'm not going to pretend to know about page pool details, but dmabuf
>> is the way to get the bulk of pages into a pool within the net stack's
>> allocator and keep that bulk properly refcounted while.
>>
>> An object like dmabuf is needed for the general case because there are
>> not going to be per-page references or otherwise available.
>>
>> What you seem to want is to alter how the actual allocation flow works
>> from that bulk of memory and delay the free. It seems like a different
>> topic to me, and honestly hacking into the allocator free function
>> seems a bit weird..
> 
> Also I don't see how it's an argument against dma-buf as the interface for

It's not, neither I said it is, but it is an argument against removing
the network's page pool ops.

> all these, because e.g. ttm internally does have a page pool because
> depending upon allocator, that's indeed beneficial. Other drm drivers have
> more buffer-based concepts for opportunistically memory around, usually
> by marking buffers that are just kept as cache as purgeable (which is a
> concept that goes all the way to opengl/vulkan).

Because in this case it solves nothing and helps with nothing, quite
the opposite. Just as well we can ask why NVMe doesn't wrap user pages
into a dmabuf while doing IO.

> But these are all internals of the dma-buf exporter, the dma-buf api users
> don't ever need to care.
> -Sima
Jason Gunthorpe May 8, 2024, 2:25 p.m. UTC | #21
On Wed, May 08, 2024 at 12:30:07PM +0100, Pavel Begunkov wrote:

> > I'm not going to pretend to know about page pool details, but dmabuf
> > is the way to get the bulk of pages into a pool within the net stack's
> > allocator and keep that bulk properly refcounted while.> An object like
> > dmabuf is needed for the general case because there are
> > not going to be per-page references or otherwise available.
> 
> They are already pinned, memory is owned by the provider, io_uring
> in this case, and it should not be freed circumventing io_uring,
> and at this stage calling release_pages() is not such a hassle,
> especially comparing to introducing an additional object.

Something needs to co-ordinate when the net stack's allocator is done
with the bulk of pages and when io_uring and do the final
put_user_page() to free it. DMABUF is not an unreasonable choice for
this.

> > topic to me, and honestly hacking into the allocator free function
> > seems a bit weird..
> 
> Do you also think that DMA_BUF_IOCTL_SYNC is a weird hack, because
> it "delays free" by pinning the dmabuf object and letting the user
> read memory instead of copying it? I can find many examples

It seems to me the flow you want is for the driver to allocate a page,
put it on a rx ring, process it through the netstack, and deliver it
to io_uring. io_uring would then sit on the allocation until userspace
it done and return it back to the netstack allocator.

Hooking the free of the netstack allocator and then defering it seems
like a weird and indirect way to get there. Why can't io_uring just be
the entity that does the final free and not mess with the logic
allocator?

Jason
Daniel Vetter May 8, 2024, 3:34 p.m. UTC | #22
On Wed, May 08, 2024 at 12:35:52PM +0100, Pavel Begunkov wrote:
> On 5/8/24 08:16, Daniel Vetter wrote:
> > On Tue, May 07, 2024 at 08:32:47PM -0300, Jason Gunthorpe wrote:
> > > On Tue, May 07, 2024 at 08:35:37PM +0100, Pavel Begunkov wrote:
> > > > On 5/7/24 18:56, Jason Gunthorpe wrote:
> > > > > On Tue, May 07, 2024 at 06:25:52PM +0100, Pavel Begunkov wrote:
> > > > > > On 5/7/24 17:48, Jason Gunthorpe wrote:
> > > > > > > On Tue, May 07, 2024 at 09:42:05AM -0700, Mina Almasry wrote:
> > > > > > > 
> > > > > > > > 1. Align with devmem TCP to use udmabuf for your io_uring memory. I
> > > > > > > > think in the past you said it's a uapi you don't link but in the face
> > > > > > > > of this pushback you may want to reconsider.
> > > > > > > 
> > > > > > > dmabuf does not force a uapi, you can acquire your pages however you
> > > > > > > want and wrap them up in a dmabuf. No uapi at all.
> > > > > > > 
> > > > > > > The point is that dmabuf already provides ops that do basically what
> > > > > > > is needed here. We don't need ops calling ops just because dmabuf's
> > > > > > > ops are not understsood or not perfect. Fixup dmabuf.
> > > > > > 
> > > > > > Those ops, for example, are used to efficiently return used buffers
> > > > > > back to the kernel, which is uapi, I don't see how dmabuf can be
> > > > > > fixed up to cover it.
> > > > > 
> > > > > Sure, but that doesn't mean you can't use dma buf for the other parts
> > > > > of the flow. The per-page lifetime is a different topic than the
> > > > > refcounting and access of the entire bulk of memory.
> > > > 
> > > > Ok, so if we're leaving uapi (and ops) and keep per page/sub-buffer as
> > > > is, the rest is resolving uptr -> pages, and passing it to page pool in
> > > > a convenient to page pool format (net_iov).
> > > 
> > > I'm not going to pretend to know about page pool details, but dmabuf
> > > is the way to get the bulk of pages into a pool within the net stack's
> > > allocator and keep that bulk properly refcounted while.
> > > 
> > > An object like dmabuf is needed for the general case because there are
> > > not going to be per-page references or otherwise available.
> > > 
> > > What you seem to want is to alter how the actual allocation flow works
> > > from that bulk of memory and delay the free. It seems like a different
> > > topic to me, and honestly hacking into the allocator free function
> > > seems a bit weird..
> > 
> > Also I don't see how it's an argument against dma-buf as the interface for
> 
> It's not, neither I said it is, but it is an argument against removing
> the network's page pool ops.
> 
> > all these, because e.g. ttm internally does have a page pool because
> > depending upon allocator, that's indeed beneficial. Other drm drivers have
> > more buffer-based concepts for opportunistically memory around, usually
> > by marking buffers that are just kept as cache as purgeable (which is a
> > concept that goes all the way to opengl/vulkan).
> 
> Because in this case it solves nothing and helps with nothing, quite
> the opposite. Just as well we can ask why NVMe doesn't wrap user pages
> into a dmabuf while doing IO.

Because the rules around memory reclaim, gfp nesting and guaranteed
forward progress don't match up for block i/o. I looked quite a bit into
gluing direct i/o into dma-buf because there's vulkan extensions for that,
and it's an absolute mess.
-Sima

> 
> > But these are all internals of the dma-buf exporter, the dma-buf api users
> > don't ever need to care.
> > -Sima
> 
> -- 
> Pavel Begunkov
Pavel Begunkov May 8, 2024, 3:44 p.m. UTC | #23
On 5/8/24 15:25, Jason Gunthorpe wrote:
> On Wed, May 08, 2024 at 12:30:07PM +0100, Pavel Begunkov wrote:
> 
>>> I'm not going to pretend to know about page pool details, but dmabuf
>>> is the way to get the bulk of pages into a pool within the net stack's
>>> allocator and keep that bulk properly refcounted while.> An object like
>>> dmabuf is needed for the general case because there are
>>> not going to be per-page references or otherwise available.
>>
>> They are already pinned, memory is owned by the provider, io_uring
>> in this case, and it should not be freed circumventing io_uring,
>> and at this stage calling release_pages() is not such a hassle,
>> especially comparing to introducing an additional object.
> 
> Something needs to co-ordinate when the net stack's allocator is done
> with the bulk of pages and when io_uring and do the final
> put_user_page() to free it. DMABUF is not an unreasonable choice for
> this.

When a page pool dies notifies io_uring via the ->destroy callback.
Vise versa, when io_uring wants to terminate zerocopy, it releases
the interface queue, which kills the page pool, ending in
->destroy again.
  
>>> topic to me, and honestly hacking into the allocator free function
>>> seems a bit weird..
>>
>> Do you also think that DMA_BUF_IOCTL_SYNC is a weird hack, because
>> it "delays free" by pinning the dmabuf object and letting the user
>> read memory instead of copying it? I can find many examples
> 
> It seems to me the flow you want is for the driver to allocate a page,
> put it on a rx ring, process it through the netstack, and deliver it
> to io_uring. io_uring would then sit on the allocation until userspace
> it done and return it back to the netstack allocator.

That's right, with a note that "driver allocating a page" is
not a alloc_page() but grabbing a user page/frag  the user
registered beforehand.

> Hooking the free of the netstack allocator and then defering it seems

FWIW, it's not about page pool's ->release_page, it's a slow
path and in an ideal world wouldn't be called outside of tear
down.

> like a weird and indirect way to get there. Why can't io_uring just be
> the entity that does the final free and not mess with the logic
> allocator?

Then the user has to do a syscall (e.g. via io_uring) to return pages,
and there we'd need to care how to put the pages efficiently, i.e.
hitting the page pool's fast path, e.g. by hoping napi is scheduled and
scheduled for the CPU we're running on, or maybe transferring the pages
to the right CPU first.

Compare it with userspace putting pages into a ring, and the allocator
taking from there when needed without any extra synchronisation and
hassle just because it's a sole consumer.
Christoph Hellwig May 8, 2024, 3:51 p.m. UTC | #24
On Wed, May 08, 2024 at 12:35:52PM +0100, Pavel Begunkov wrote:
> > all these, because e.g. ttm internally does have a page pool because
> > depending upon allocator, that's indeed beneficial. Other drm drivers have
> > more buffer-based concepts for opportunistically memory around, usually
> > by marking buffers that are just kept as cache as purgeable (which is a
> > concept that goes all the way to opengl/vulkan).
> 
> Because in this case it solves nothing and helps with nothing, quite
> the opposite. Just as well we can ask why NVMe doesn't wrap user pages
> into a dmabuf while doing IO.

You seem to confused totally unrelated things.

For short-term pins, that is pin_user_pages without FOLL_LONGTERM there
would never be any point in using a dmabuf as the pin is transient.  For
long-term pin dmabufs in the block layer absolutely make sense, and I
wish io_uring would have just implemented them from the start instead of
the current fixed buffers that are not quite as useful by not
pre-mapping DMA and not supporting P2P.
Jason Gunthorpe May 8, 2024, 3:58 p.m. UTC | #25
On Wed, May 08, 2024 at 04:44:32PM +0100, Pavel Begunkov wrote:

> > like a weird and indirect way to get there. Why can't io_uring just be
> > the entity that does the final free and not mess with the logic
> > allocator?
> 
> Then the user has to do a syscall (e.g. via io_uring) to return pages,
> and there we'd need to care how to put the pages efficiently, i.e.
> hitting the page pool's fast path, e.g. by hoping napi is scheduled and
> scheduled for the CPU we're running on, or maybe transferring the pages
> to the right CPU first.
> 
> Compare it with userspace putting pages into a ring, and the allocator
> taking from there when needed without any extra synchronisation and
> hassle just because it's a sole consumer.

Wow, that sounds a bit terrifying for security, but I guess I can see
your point.

You are replacing the whole allocator logic if you are effectively
putting the free list in userspace memory.

Jason
Pavel Begunkov May 8, 2024, 4:13 p.m. UTC | #26
On 5/8/24 16:58, Jason Gunthorpe wrote:
> On Wed, May 08, 2024 at 04:44:32PM +0100, Pavel Begunkov wrote:
> 
>>> like a weird and indirect way to get there. Why can't io_uring just be
>>> the entity that does the final free and not mess with the logic
>>> allocator?
>>
>> Then the user has to do a syscall (e.g. via io_uring) to return pages,
>> and there we'd need to care how to put the pages efficiently, i.e.
>> hitting the page pool's fast path, e.g. by hoping napi is scheduled and
>> scheduled for the CPU we're running on, or maybe transferring the pages
>> to the right CPU first.
>>
>> Compare it with userspace putting pages into a ring, and the allocator
>> taking from there when needed without any extra synchronisation and
>> hassle just because it's a sole consumer.
> 
> Wow, that sounds a bit terrifying for security, but I guess I can see
> your point.

Mind elaborating about security? "No synchronisation" is for grabbing
from the ring, it's napi exclusive, but it does refcounting to make sure
there are no previous net users left and the userspace doesn't try
anything funny like returning a page twice. And it's not even a page
but rather a separately refcounted buffer represented by an offset
from the userspace POV. It doesn't even have to be page sized, hw
benefits from smaller chunks.

> You are replacing the whole allocator logic if you are effectively
> putting the free list in userspace memory.
> 
> Jason
Pavel Begunkov May 8, 2024, 5:02 p.m. UTC | #27
On 5/8/24 16:51, Christoph Hellwig wrote:
> On Wed, May 08, 2024 at 12:35:52PM +0100, Pavel Begunkov wrote:
>>> all these, because e.g. ttm internally does have a page pool because
>>> depending upon allocator, that's indeed beneficial. Other drm drivers have
>>> more buffer-based concepts for opportunistically memory around, usually
>>> by marking buffers that are just kept as cache as purgeable (which is a
>>> concept that goes all the way to opengl/vulkan).
>>
>> Because in this case it solves nothing and helps with nothing, quite
>> the opposite. Just as well we can ask why NVMe doesn't wrap user pages
>> into a dmabuf while doing IO.
> 
> You seem to confused totally unrelated things.
>
> For short-term pins, that is pin_user_pages without FOLL_LONGTERM there
> would never be any point in using a dmabuf as the pin is transient.  For
> long-term pin dmabufs in the block layer absolutely make sense, and I

Well, the example fell flat, but you don't use dmabuf when there are
no upsides from using it. For instance, when you already have pinned
pages, you're going to use pages, and there are no other refcounting
concerns. Unless there is an advantage of dmabufs over FOLL_LONGTERM
that I don't know about when used with normal user pages.

> wish io_uring would have just implemented them from the start instead of
> the current fixed buffers that are not quite as useful by not
> pre-mapping DMA and not supporting P2P.

fdget(dmabuf) would be horrible, I assume that's not the suggestion.
But then it's really about kernel internals, and theoretically can
be patched to wrap user pages into a dmabuf and pass it in. The tricky
part is how that "pass it in" should looks like. Keith tried to cover
both pre-mapping and p2p by registering dmabuf and passing the mapped
addresses in an iter IIRC.

Anyway, this discussion should better move from to block/fs lists,
if there is any interest.
Christoph Hellwig May 9, 2024, 4:49 a.m. UTC | #28
On Wed, May 08, 2024 at 06:02:14PM +0100, Pavel Begunkov wrote:
> Well, the example fell flat, but you don't use dmabuf when there are
> no upsides from using it. For instance, when you already have pinned
> pages, you're going to use pages, and there are no other refcounting
> concerns.

Sure.

> Unless there is an advantage of dmabufs over FOLL_LONGTERM
> that I don't know about when used with normal user pages.

The advantages of using a dma-buf over FOLL_LONGTERM are:

 a) you pre-dma map, which is a significant performance advantage for
    IOMMU-based setups
 b) you support any dma-buf exported and not just user memory.  This
    is primarily important for PCIe P2P, but there might be other
    useful exporters as well

> > wish io_uring would have just implemented them from the start instead of
> > the current fixed buffers that are not quite as useful by not
> > pre-mapping DMA and not supporting P2P.
> 
> fdget(dmabuf) would be horrible, I assume that's not the suggestion.

I'm not even sure what you mean with that.
diff mbox series

Patch

diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index 5e43a08d3231..ffe5f31fb0da 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -52,6 +52,7 @@  struct pp_alloc_cache {
  * @dev:	device, for DMA pre-mapping purposes
  * @netdev:	netdev this pool will serve (leave as NULL if none or multiple)
  * @napi:	NAPI which is the sole consumer of pages, otherwise NULL
+ * @queue:	struct netdev_rx_queue this page_pool is being created for.
  * @dma_dir:	DMA mapping direction
  * @max_len:	max DMA sync memory size for PP_FLAG_DMA_SYNC_DEV
  * @offset:	DMA sync address offset for PP_FLAG_DMA_SYNC_DEV
@@ -64,6 +65,7 @@  struct page_pool_params {
 		int		nid;
 		struct device	*dev;
 		struct napi_struct *napi;
+		struct netdev_rx_queue *queue;
 		enum dma_data_direction dma_dir;
 		unsigned int	max_len;
 		unsigned int	offset;
@@ -126,6 +128,13 @@  struct page_pool_stats {
 };
 #endif
 
+struct memory_provider_ops {
+	int (*init)(struct page_pool *pool);
+	void (*destroy)(struct page_pool *pool);
+	struct page *(*alloc_pages)(struct page_pool *pool, gfp_t gfp);
+	bool (*release_page)(struct page_pool *pool, struct page *page);
+};
+
 struct page_pool {
 	struct page_pool_params_fast p;
 
@@ -176,6 +185,9 @@  struct page_pool {
 	 */
 	struct ptr_ring ring;
 
+	void *mp_priv;
+	const struct memory_provider_ops *mp_ops;
+
 #ifdef CONFIG_PAGE_POOL_STATS
 	/* recycle stats are per-cpu to avoid locking */
 	struct page_pool_recycle_stats __percpu *recycle_stats;
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index dd364d738c00..795b7ff1c01f 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -25,6 +25,8 @@ 
 
 #include "page_pool_priv.h"
 
+static DEFINE_STATIC_KEY_FALSE(page_pool_mem_providers);
+
 #define DEFER_TIME (msecs_to_jiffies(1000))
 #define DEFER_WARN_INTERVAL (60 * HZ)
 
@@ -177,6 +179,7 @@  static int page_pool_init(struct page_pool *pool,
 			  int cpuid)
 {
 	unsigned int ring_qsize = 1024; /* Default */
+	int err;
 
 	memcpy(&pool->p, &params->fast, sizeof(pool->p));
 	memcpy(&pool->slow, &params->slow, sizeof(pool->slow));
@@ -248,10 +251,25 @@  static int page_pool_init(struct page_pool *pool,
 	/* Driver calling page_pool_create() also call page_pool_destroy() */
 	refcount_set(&pool->user_cnt, 1);
 
+	if (pool->mp_ops) {
+		err = pool->mp_ops->init(pool);
+		if (err) {
+			pr_warn("%s() mem-provider init failed %d\n", __func__,
+				err);
+			goto free_ptr_ring;
+		}
+
+		static_branch_inc(&page_pool_mem_providers);
+	}
+
 	if (pool->p.flags & PP_FLAG_DMA_MAP)
 		get_device(pool->p.dev);
 
 	return 0;
+
+free_ptr_ring:
+	ptr_ring_cleanup(&pool->ring, NULL);
+	return err;
 }
 
 static void page_pool_uninit(struct page_pool *pool)
@@ -546,7 +564,10 @@  struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp)
 		return page;
 
 	/* Slow-path: cache empty, do real allocation */
-	page = __page_pool_alloc_pages_slow(pool, gfp);
+	if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops)
+		page = pool->mp_ops->alloc_pages(pool, gfp);
+	else
+		page = __page_pool_alloc_pages_slow(pool, gfp);
 	return page;
 }
 EXPORT_SYMBOL(page_pool_alloc_pages);
@@ -603,10 +624,13 @@  void __page_pool_release_page_dma(struct page_pool *pool, struct page *page)
 void page_pool_return_page(struct page_pool *pool, struct page *page)
 {
 	int count;
+	bool put;
 
-	__page_pool_release_page_dma(pool, page);
-
-	page_pool_clear_pp_info(page);
+	put = true;
+	if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops)
+		put = pool->mp_ops->release_page(pool, page);
+	else
+		__page_pool_release_page_dma(pool, page);
 
 	/* This may be the last page returned, releasing the pool, so
 	 * it is not safe to reference pool afterwards.
@@ -614,7 +638,10 @@  void page_pool_return_page(struct page_pool *pool, struct page *page)
 	count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt);
 	trace_page_pool_state_release(pool, page, count);
 
-	put_page(page);
+	if (put) {
+		page_pool_clear_pp_info(page);
+		put_page(page);
+	}
 	/* An optimization would be to call __free_pages(page, pool->p.order)
 	 * knowing page is not part of page-cache (thus avoiding a
 	 * __page_cache_release() call).
@@ -889,6 +916,12 @@  static void __page_pool_destroy(struct page_pool *pool)
 
 	page_pool_unlist(pool);
 	page_pool_uninit(pool);
+
+	if (pool->mp_ops) {
+		pool->mp_ops->destroy(pool);
+		static_branch_dec(&page_pool_mem_providers);
+	}
+
 	kfree(pool);
 }