mbox series

[v2,0/2] iommu: Allow passing custom allocators to pgtable drivers

Message ID 20231110094352.565347-1-boris.brezillon@collabora.com (mailing list archive)
Headers show
Series iommu: Allow passing custom allocators to pgtable drivers | expand

Message

Boris Brezillon Nov. 10, 2023, 9:43 a.m. UTC
Hello,

This patchset is an attempt at making page table allocation
customizable. This is useful to some GPU drivers for various reasons:

- speed-up upcoming page table allocations by managing a pool of free
  pages
- batch page table allocation instead of allocating one page at a time
- pre-reserve pages for page tables needed for map/unmap operations and
  return the unused page tables to some pool

The first and last reasons are particularly important for GPU drivers
wanting to implement asynchronous VM_BIND. Asynchronous VM_BIND requires
that any page table needed for a map/unmap operation to succeed be
allocated at VM_BIND job creation time. At the time of the job creation,
we don't know what the VM will look like when we get to execute the
map/unmap, and can't guess how many page tables we will need. Because
of that, we have to over-provision page tables for the worst case
scenario (page table tree is empty), which means we will allocate/free
a lot. Having pool a pool of free pages is crucial if we want to
speed-up VM_BIND requests.

There might also be other good reasons to want custom allocators, like
fine-grained memory accounting and resource limiting.

A real example of how such custom allocators can be used is available
here[1]. v2 of the Panthor driver is approaching submission, and I
figured I'd try to upstream the dependencies separately, which is
why I submit this series now, even though the user of this new API
will come afterwards. If you'd prefer to have those patches submitted
along with the Panthor driver, let me know.

Regards,

Boris

[1]https://gitlab.freedesktop.org/panfrost/linux/-/blob/panthor/drivers/gpu/drm/panthor/panthor_mmu.c#L441

Boris Brezillon (2):
  iommu: Allow passing custom allocators to pgtable drivers
  iommu: Extend LPAE page table format to support custom allocators

 drivers/iommu/io-pgtable-arm.c | 55 ++++++++++++++++++++++++----------
 drivers/iommu/io-pgtable.c     | 23 ++++++++++++++
 include/linux/io-pgtable.h     | 31 +++++++++++++++++++
 3 files changed, 93 insertions(+), 16 deletions(-)

Comments

Gaurav Kohli Nov. 10, 2023, 10:47 a.m. UTC | #1
On 11/10/2023 3:13 PM, Boris Brezillon wrote:
> Hello,
> 
> This patchset is an attempt at making page table allocation
> customizable. This is useful to some GPU drivers for various reasons:
> 
> - speed-up upcoming page table allocations by managing a pool of free
>    pages
> - batch page table allocation instead of allocating one page at a time
> - pre-reserve pages for page tables needed for map/unmap operations and
>    return the unused page tables to some pool
> 
> The first and last reasons are particularly important for GPU drivers
> wanting to implement asynchronous VM_BIND. Asynchronous VM_BIND requires
> that any page table needed for a map/unmap operation to succeed be
> allocated at VM_BIND job creation time. At the time of the job creation,
> we don't know what the VM will look like when we get to execute the
> map/unmap, and can't guess how many page tables we will need. Because
> of that, we have to over-provision page tables for the worst case
> scenario (page table tree is empty), which means we will allocate/free
> a lot. Having pool a pool of free pages is crucial if we want to
> speed-up VM_BIND requests.
> 
> There might also be other good reasons to want custom allocators, like
> fine-grained memory accounting and resource limiting.
> 
> A real example of how such custom allocators can be used is available
> here[1]. v2 of the Panthor driver is approaching submission, and I
> figured I'd try to upstream the dependencies separately, which is
> why I submit this series now, even though the user of this new API
> will come afterwards. If you'd prefer to have those patches submitted
> along with the Panthor driver, let me know.
> 
> Regards,
> 
> Boris
> 
> [1]https://gitlab.freedesktop.org/panfrost/linux/-/blob/panthor/drivers/gpu/drm/panthor/panthor_mmu.c#L441
> 
> Boris Brezillon (2):
>    iommu: Allow passing custom allocators to pgtable drivers
>    iommu: Extend LPAE page table format to support custom allocators
> 
>   drivers/iommu/io-pgtable-arm.c | 55 ++++++++++++++++++++++++----------
>   drivers/iommu/io-pgtable.c     | 23 ++++++++++++++
>   include/linux/io-pgtable.h     | 31 +++++++++++++++++++
>   3 files changed, 93 insertions(+), 16 deletions(-)
> 

Tested patches and reviewed also, both looks good , please feel free to 
add in case of merging:

Reviewed-by: Gaurav Kohli <quic_gkohli@quicinc.com>
Tested-by: Gaurav Kohli <quic_gkohli@quicinc.com>
Jason Gunthorpe Nov. 10, 2023, 3:14 p.m. UTC | #2
On Fri, Nov 10, 2023 at 10:43:50AM +0100, Boris Brezillon wrote:
> Hello,
> 
> This patchset is an attempt at making page table allocation
> customizable. This is useful to some GPU drivers for various reasons:
> 
> - speed-up upcoming page table allocations by managing a pool of free
>   pages
> - batch page table allocation instead of allocating one page at a time
> - pre-reserve pages for page tables needed for map/unmap operations and
>   return the unused page tables to some pool

Why would these topics be unique to GPU drivers as a user?

Shouldn't improving the allocator in the io page table be done
generically?

> A real example of how such custom allocators can be used is available
> here[1]. v2 of the Panthor driver is approaching submission, and I
> figured I'd try to upstream the dependencies separately, which is
> why I submit this series now, even though the user of this new API
> will come afterwards. If you'd prefer to have those patches submitted
> along with the Panthor driver, let me know.

Patches like this should come with users, but I think you should
refocus this effort to improving the io pagetable itself not allowing
a GPU driver to replace its insides. That seems like a cop-out.

Jason
Boris Brezillon Nov. 10, 2023, 3:48 p.m. UTC | #3
Hi Jason,

On Fri, 10 Nov 2023 11:14:28 -0400
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Fri, Nov 10, 2023 at 10:43:50AM +0100, Boris Brezillon wrote:
> > Hello,
> > 
> > This patchset is an attempt at making page table allocation
> > customizable. This is useful to some GPU drivers for various reasons:
> > 
> > - speed-up upcoming page table allocations by managing a pool of free
> >   pages
> > - batch page table allocation instead of allocating one page at a time
> > - pre-reserve pages for page tables needed for map/unmap operations and
> >   return the unused page tables to some pool  
> 
> Why would these topics be unique to GPU drivers as a user?

They are not, it's just that the primary use case for this extension
was a GPU driver, and that other GPU driver maintainers have expressed
interest in this solution too. Gaurav has a different use case that has
nothing to do with caching MMU page table allocations.

> 
> Shouldn't improving the allocator in the io page table be done
> generically?

While most of it could be made generic, the pre-reservation is a bit
special for VM_BIND: we need to pre-reserve page tables without knowing
the state of the page table tree (over-reservation), because page table
updates are executed asynchronously (the state of the VM when we
prepare the request might differ from its state when we execute it). We
also need to make sure no other pre-reservation requests steal pages
from the pool of pages we reserved for requests that were not executed
yet.

I'm not saying this is impossible to implement, but it sounds too
specific for a generic io-pgtable cache.

> 
> > A real example of how such custom allocators can be used is available
> > here[1]. v2 of the Panthor driver is approaching submission, and I

That's actually v3, not v2.

> > figured I'd try to upstream the dependencies separately, which is
> > why I submit this series now, even though the user of this new API
> > will come afterwards. If you'd prefer to have those patches submitted
> > along with the Panthor driver, let me know.  
> 
> Patches like this should come with users,

Fine, I can submit those two patches as part of the panthor patch
series if you prefer.

> but I think you should
> refocus this effort to improving the io pagetable itself not allowing
> a GPU driver to replace its insides.

These 2 patches don't come out of the blue. This has been discussed
with Robin who suggested to go for this simple solution where
allocations are deferred to the io-pgtable user. Turns out Gaurav
came up with a different use case where these custom page table
allocators would be useful.

Regards,

Boris
Jason Gunthorpe Nov. 10, 2023, 4:12 p.m. UTC | #4
On Fri, Nov 10, 2023 at 04:48:09PM +0100, Boris Brezillon wrote:

> > Shouldn't improving the allocator in the io page table be done
> > generically?
> 
> While most of it could be made generic, the pre-reservation is a bit
> special for VM_BIND: we need to pre-reserve page tables without knowing
> the state of the page table tree (over-reservation), because page table
> updates are executed asynchronously (the state of the VM when we
> prepare the request might differ from its state when we execute it). We
> also need to make sure no other pre-reservation requests steal pages
> from the pool of pages we reserved for requests that were not executed
> yet.
> 
> I'm not saying this is impossible to implement, but it sounds too
> specific for a generic io-pgtable cache.

It is quite easy, and indeed much better to do it internally.

struct page allocations like the io page table uses get a few pointers
of data to be used by the caller in the struct page *.

You can put a refcounter in that data per-page to count how many
callers have reserved the page. Add a new "allocate VA" API to
allocate and install page table levels that cover a VA range in the
radix tree and increment all the refcounts on all the impacted struct
pages.

Now you can be guarenteed that future map in that VA range will be
fully non-allocating, and future unmap will be fully non-freeing.

Some "unallocate VA" will decrement the refcounts and free the page
table levels within that VA range.

Precompute the number of required pages at the start of allocate and
you can trivally do batch allocations. Ditto for unallocate, it can
trivially do batch freeing.

Way better and more generically useful than allocator ops!

I'd be interested in something like this for iommufd too, we greatly
suffer from poor iommu driver performace during map, and in general we
lack a robust way to actually fully unmap all page table levels.

A new domain API to prepare all the ioptes more efficiently would be a
great general improvement!

Jason
Boris Brezillon Nov. 10, 2023, 7:16 p.m. UTC | #5
On Fri, 10 Nov 2023 12:12:29 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Fri, Nov 10, 2023 at 04:48:09PM +0100, Boris Brezillon wrote:
> 
> > > Shouldn't improving the allocator in the io page table be done
> > > generically?  
> > 
> > While most of it could be made generic, the pre-reservation is a bit
> > special for VM_BIND: we need to pre-reserve page tables without knowing
> > the state of the page table tree (over-reservation), because page table
> > updates are executed asynchronously (the state of the VM when we
> > prepare the request might differ from its state when we execute it). We
> > also need to make sure no other pre-reservation requests steal pages
> > from the pool of pages we reserved for requests that were not executed
> > yet.
> > 
> > I'm not saying this is impossible to implement, but it sounds too
> > specific for a generic io-pgtable cache.  
> 
> It is quite easy, and indeed much better to do it internally.
> 
> struct page allocations like the io page table uses get a few pointers
> of data to be used by the caller in the struct page *.

Ah, right. I didn't even consider that given how volatile page fields
are (not even sure which ones we're allowed to used for private data
tbh).

> You can put a refcounter in that data per-page to count how many
> callers have reserved the page. Add a new "allocate VA" API to
> allocate and install page table levels that cover a VA range in the
> radix tree and increment all the refcounts on all the impacted struct
> pages.

I like the general idea, but it starts to get tricky when:

1. you have a page table format supporting a dynamic number of levels.
For instance, on ARM MMUs, you can get rid of the last level if you
have portions of your buffer that are physically contiguous and aligned
on the upper PTE granularity (and the VA is aligned too, of course).
I'm assuming we want to optimize mem consumption by merging physically
contiguous regions in that case. If we accept to keep a static
granularity, there should be no issue.

and

2. your future MMU requests are unordered. That's the case for
VM_BIND, if you have multiple async queues, or if you want to fast
track synchronous requests.

In that case, I guess we can keep the leaf page tables around until all
pending requests have been executed, and get rid of them if we have no
remaining users at the end.

> 
> Now you can be guarenteed that future map in that VA range will be
> fully non-allocating, and future unmap will be fully non-freeing.

You mean fully non-freeing if there are no known remaining users to
come, right?

> 
> Some "unallocate VA" will decrement the refcounts and free the page
> table levels within that VA range.
> 
> Precompute the number of required pages at the start of allocate and
> you can trivally do batch allocations. Ditto for unallocate, it can
> trivially do batch freeing.
> 
> Way better and more generically useful than allocator ops!
> 
> I'd be interested in something like this for iommufd too, we greatly
> suffer from poor iommu driver performace during map, and in general we
> lack a robust way to actually fully unmap all page table levels.

Yeah, that might become a problem for us too (being able to tear down
all unused levels when you only unmap a portion of it, and the rest was
already empty). That, and also the ability to atomically update a
portion of the tree (even if I already have a workaround in mind for
that case).

> 
> A new domain API to prepare all the ioptes more efficiently would be a
> great general improvement!

If there are incentives to get this caching mechanism up and running,
I'm happy to help in any way you think would be useful, but I'd really
like to have a temporary solution until we have this solution ready.
Given custom allocators seem to be useful for other use cases, I'm
tempted to get it merged, and I'll happily port panthor to the new
caching system when it's ready.

Regards,

Boris
Jason Gunthorpe Nov. 10, 2023, 7:42 p.m. UTC | #6
On Fri, Nov 10, 2023 at 08:16:52PM +0100, Boris Brezillon wrote:
> On Fri, 10 Nov 2023 12:12:29 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Fri, Nov 10, 2023 at 04:48:09PM +0100, Boris Brezillon wrote:
> > 
> > > > Shouldn't improving the allocator in the io page table be done
> > > > generically?  
> > > 
> > > While most of it could be made generic, the pre-reservation is a bit
> > > special for VM_BIND: we need to pre-reserve page tables without knowing
> > > the state of the page table tree (over-reservation), because page table
> > > updates are executed asynchronously (the state of the VM when we
> > > prepare the request might differ from its state when we execute it). We
> > > also need to make sure no other pre-reservation requests steal pages
> > > from the pool of pages we reserved for requests that were not executed
> > > yet.
> > > 
> > > I'm not saying this is impossible to implement, but it sounds too
> > > specific for a generic io-pgtable cache.  
> > 
> > It is quite easy, and indeed much better to do it internally.
> > 
> > struct page allocations like the io page table uses get a few pointers
> > of data to be used by the caller in the struct page *.
> 
> Ah, right. I didn't even consider that given how volatile page fields
> are (not even sure which ones we're allowed to used for private data
> tbh).

It is much more orderly now, eg look at the slab and net folio
conversions

> > You can put a refcounter in that data per-page to count how many
> > callers have reserved the page. Add a new "allocate VA" API to
> > allocate and install page table levels that cover a VA range in the
> > radix tree and increment all the refcounts on all the impacted struct
> > pages.
> 
> I like the general idea, but it starts to get tricky when:
> 
> 1. you have a page table format supporting a dynamic number of levels.
> For instance, on ARM MMUs, you can get rid of the last level if you
> have portions of your buffer that are physically contiguous and aligned
> on the upper PTE granularity (and the VA is aligned too, of course).
> I'm assuming we want to optimize mem consumption by merging physically
> contiguous regions in that case. If we accept to keep a static
> granularity, there should be no issue.

If the last level(s) get chopped you'd have to stick the pages into a
linked list instead of freeing it, yes.

> 2. your future MMU requests are unordered. That's the case for
> VM_BIND, if you have multiple async queues, or if you want to fast
> track synchronous requests.

Don't really understand this?
 
> In that case, I guess we can keep the leaf page tables around until all
> pending requests have been executed, and get rid of them if we have no
> remaining users at the end.

I assumed you preallocated a IOVA window at some point and then the
BIND is just changing the mapping. The IOVA allocation would pin down
all the radix tree memory so that that any map in the preallocated
IOVA range cannot fail.

> > Now you can be guarenteed that future map in that VA range will be
> > fully non-allocating, and future unmap will be fully non-freeing.
> 
> You mean fully non-freeing if there are no known remaining users to
> come, right?

unmap of allocated IOVA would be non-freeing. Free would happen on
allocate

> > A new domain API to prepare all the ioptes more efficiently would be a
> > great general improvement!
> 
> If there are incentives to get this caching mechanism up and running,
> I'm happy to help in any way you think would be useful, but I'd really
> like to have a temporary solution until we have this solution ready.
> Given custom allocators seem to be useful for other use cases, I'm
> tempted to get it merged, and I'll happily port panthor to the new
> caching system when it's ready.

My experience with GPU land is these hacky temporary things become
permanent and then a total pain for everyone else :( By the time
someone comes to fix it you will be gone and nobody will be willing to
help do changes to the GPU driver.

Jason
Boris Brezillon Nov. 13, 2023, 9:11 a.m. UTC | #7
On Fri, 10 Nov 2023 15:42:15 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> > > You can put a refcounter in that data per-page to count how many
> > > callers have reserved the page. Add a new "allocate VA" API to
> > > allocate and install page table levels that cover a VA range in the
> > > radix tree and increment all the refcounts on all the impacted struct
> > > pages.  
> > 
> > I like the general idea, but it starts to get tricky when:
> > 
> > 1. you have a page table format supporting a dynamic number of levels.
> > For instance, on ARM MMUs, you can get rid of the last level if you
> > have portions of your buffer that are physically contiguous and aligned
> > on the upper PTE granularity (and the VA is aligned too, of course).
> > I'm assuming we want to optimize mem consumption by merging physically
> > contiguous regions in that case. If we accept to keep a static
> > granularity, there should be no issue.  
> 
> If the last level(s) get chopped you'd have to stick the pages into a
> linked list instead of freeing it, yes.
> 
> > 2. your future MMU requests are unordered. That's the case for
> > VM_BIND, if you have multiple async queues, or if you want to fast
> > track synchronous requests.  
> 
> Don't really understand this?

I'm just saying that you don't know when BIND requests will be
executed, and combined with #1, it makes it more complicated to guess
what should stay around/be released.

>  
> > In that case, I guess we can keep the leaf page tables around until all
> > pending requests have been executed, and get rid of them if we have no
> > remaining users at the end.  
> 
> I assumed you preallocated a IOVA window at some point and then the
> BIND is just changing the mapping.

There's no concept of pre-allocated IOVA range in panthor, at least not
yet. We just map/unmap on demand without trying to attach these
operations to an higher level entity that have an IOVA range attached to
it (I assume this would be the VkImage/VkBuffer for Vulkan).

> The IOVA allocation would pin down
> all the radix tree memory so that that any map in the preallocated
> IOVA range cannot fail.

Question is, when would you do the IOVA range allocation? So far, I was
assuming that every BIND request was a combination of:

1/ Pre-allocate enough resources for this specific map/unmap to always
succeed

2/ Execute this BIND operation when time comes

IIUC, you're suggesting doing things differently:

1/ Reserve/pre-allocate the IOVA range for your higher-level
entity/object (through an explicit ioctl, I guess)

2/ BIND requests just map/unmap stuff in this pre-allocated/reserved
IOVA range. All page tables have been allocated during #1, so there's
no allocation happening here.

3/ When your higher level object is destroyed, release the IOVA range,
which, as a result, unmaps everything in that range, and frees up the
IOMMU page tables (and any other resources attached to this IOVA range).

> 
> > > Now you can be guarenteed that future map in that VA range will be
> > > fully non-allocating, and future unmap will be fully non-freeing.  
> > 
> > You mean fully non-freeing if there are no known remaining users to
> > come, right?  
> 
> unmap of allocated IOVA would be non-freeing. Free would happen on
> allocate

Does that mean resources stay around until someone else tries to
allocate an IOVA range overlapping this previously existing IOVA
range? With your IOVA range solution, I'd expect resources to be
released when the IOVA range is released/destroyed.

> 
> > > A new domain API to prepare all the ioptes more efficiently would be a
> > > great general improvement!  
> > 
> > If there are incentives to get this caching mechanism up and running,
> > I'm happy to help in any way you think would be useful, but I'd really
> > like to have a temporary solution until we have this solution ready.
> > Given custom allocators seem to be useful for other use cases, I'm
> > tempted to get it merged, and I'll happily port panthor to the new
> > caching system when it's ready.  
> 
> My experience with GPU land is these hacky temporary things become
> permanent and then a total pain for everyone else :( By the time
> someone comes to fix it you will be gone and nobody will be willing to
> help do changes to the GPU driver.

Hm, that doesn't match my recent experience with DRM drivers, where
internal DRM APIs get changed pretty regularly, and reviewed by DRM
driver maintainers in a timely manner...

Anyway, given you already thought it through, can I ask you to provide
a preliminary implementation for this IOVA range mechanism so I can
play with it and adjust panthor accordingly. And if you don't have the
time, can you at least give me extra details about the implementation
you had in mind, so I don't have to guess and come back with something
that's not matching what you had in mind.

Regards,

Boris
Jason Gunthorpe Nov. 14, 2023, 4:27 p.m. UTC | #8
On Mon, Nov 13, 2023 at 10:11:03AM +0100, Boris Brezillon wrote:

> > The IOVA allocation would pin down
> > all the radix tree memory so that that any map in the preallocated
> > IOVA range cannot fail.
> 
> Question is, when would you do the IOVA range allocation? So far, I was
> assuming that every BIND request was a combination of:
> 
> 1/ Pre-allocate enough resources for this specific map/unmap to always
> succeed
> 
> 2/ Execute this BIND operation when time comes
> 
> IIUC, you're suggesting doing things differently:
> 
> 1/ Reserve/pre-allocate the IOVA range for your higher-level
> entity/object (through an explicit ioctl, I guess)
> 
> 2/ BIND requests just map/unmap stuff in this pre-allocated/reserved
> IOVA range. All page tables have been allocated during #1, so there's
> no allocation happening here.
> 
> 3/ When your higher level object is destroyed, release the IOVA range,
> which, as a result, unmaps everything in that range, and frees up the
> IOMMU page tables (and any other resources attached to this IOVA range).

I don't really know anything about vulkan so I can't really comment to
well, but it seems to me what you outline makes sense, but also you
could make #1 allocate the IOVA as part of the preallocation??
 
> > > > Now you can be guarenteed that future map in that VA range will be
> > > > fully non-allocating, and future unmap will be fully non-freeing.  
> > > 
> > > You mean fully non-freeing if there are no known remaining users to
> > > come, right?  
> > 
> > unmap of allocated IOVA would be non-freeing. Free would happen on
> > allocate
> 
> Does that mean resources stay around until someone else tries to
> allocate an IOVA range overlapping this previously existing IOVA
> range? With your IOVA range solution, I'd expect resources to be
> released when the IOVA range is released/destroyed.

Sorry, I mistyped deallocate. Yes when the iova is deallocated then
the pinned down radix leaves could become freed. It would be a logical
time to do the freeing.

> > My experience with GPU land is these hacky temporary things become
> > permanent and then a total pain for everyone else :( By the time
> > someone comes to fix it you will be gone and nobody will be willing to
> > help do changes to the GPU driver.
> 
> Hm, that doesn't match my recent experience with DRM drivers, where
> internal DRM APIs get changed pretty regularly, and reviewed by DRM
> driver maintainers in a timely manner...

If the DRM maintainers push it then it happens :) Ask Robin about
his iommu_present() removal

> Anyway, given you already thought it through, can I ask you to provide
> a preliminary implementation for this IOVA range mechanism so I can
> play with it and adjust panthor accordingly. And if you don't have the
> time, can you at least give me extra details about the implementation
> you had in mind, so I don't have to guess and come back with something
> that's not matching what you had in mind.

Oh, I don't know if I can manage patches in any reasonable time frame,
though I think it is pretty straightforward really:

 - Patch to introduce some 'struct iopte_page' (see struct slab)
 - Adjust io pagetable implementations to consume it
 - Do RCU freeing of iopte_page
 - Add a reserve/unreserve io page table ops
 - Implement reserve/unresereve in arm by manipulating a new refcount
   in iopte_page. Rely on RCU to protect the derefs
 - Modify iommufd to call reserve/unreserve around areas attachment
   to have an intree user.

Some of this is a bit interesting, like reserving probably will
ideally want to invoke the batch allocator for efficiency which means
computing the number of radix levels required to fully populate the
current empty level - that should be general code somehow

Jason
Jason Gunthorpe Nov. 20, 2023, 2:04 p.m. UTC | #9
On Tue, Nov 14, 2023 at 12:27:48PM -0400, Jason Gunthorpe wrote:

> > Anyway, given you already thought it through, can I ask you to provide
> > a preliminary implementation for this IOVA range mechanism so I can
> > play with it and adjust panthor accordingly. And if you don't have the
> > time, can you at least give me extra details about the implementation
> > you had in mind, so I don't have to guess and come back with something
> > that's not matching what you had in mind.
> 
> Oh, I don't know if I can manage patches in any reasonable time frame,
> though I think it is pretty straightforward really:
> 
>  - Patch to introduce some 'struct iopte_page' (see struct slab)
>  - Adjust io pagetable implementations to consume it
>  - Do RCU freeing of iopte_page
>  - Add a reserve/unreserve io page table ops
>  - Implement reserve/unresereve in arm by manipulating a new refcount
>    in iopte_page. Rely on RCU to protect the derefs
>  - Modify iommufd to call reserve/unreserve around areas attachment
>    to have an intree user.
> 
> Some of this is a bit interesting, like reserving probably will
> ideally want to invoke the batch allocator for efficiency which means
> computing the number of radix levels required to fully populate the
> current empty level - that should be general code somehow

At LPC there was quite a lot if interest in improving the io page
table stuff to work better. Based on that I'm even more against adding
an external allocator at this time. :\

I may try to write a sketch of something but I'm not sure when..

Jason
Boris Brezillon Nov. 20, 2023, 2:38 p.m. UTC | #10
Hi Jason,

On Mon, 20 Nov 2023 10:04:25 -0400
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Tue, Nov 14, 2023 at 12:27:48PM -0400, Jason Gunthorpe wrote:
> 
> > > Anyway, given you already thought it through, can I ask you to provide
> > > a preliminary implementation for this IOVA range mechanism so I can
> > > play with it and adjust panthor accordingly. And if you don't have the
> > > time, can you at least give me extra details about the implementation
> > > you had in mind, so I don't have to guess and come back with something
> > > that's not matching what you had in mind.  
> > 
> > Oh, I don't know if I can manage patches in any reasonable time frame,
> > though I think it is pretty straightforward really:
> > 
> >  - Patch to introduce some 'struct iopte_page' (see struct slab)
> >  - Adjust io pagetable implementations to consume it
> >  - Do RCU freeing of iopte_page
> >  - Add a reserve/unreserve io page table ops
> >  - Implement reserve/unresereve in arm by manipulating a new refcount
> >    in iopte_page. Rely on RCU to protect the derefs
> >  - Modify iommufd to call reserve/unreserve around areas attachment
> >    to have an intree user.
> > 
> > Some of this is a bit interesting, like reserving probably will
> > ideally want to invoke the batch allocator for efficiency which means
> > computing the number of radix levels required to fully populate the
> > current empty level - that should be general code somehow  
> 
> At LPC there was quite a lot if interest in improving the io page
> table stuff to work better. Based on that I'm even more against adding
> an external allocator at this time. :\

I'm sure this has been discussed with the IOMMU maintainers, but I'd
really like to hear it from them if you don't mind. Especially since,
as I mentioned, the custom allocator idea comes from Robin, not me...

> 
> I may try to write a sketch of something but I'm not sure when..

Yeah, that's the main problem here. Making panthor (the GPU driver for
newer mali GPUs) depend on some new stuff with no clear estimation of
when this work will actually be conducted is not something I want to
do. We've already had to wait a few months for some DRM deps to land
(and those were actively being worked on), and I was hoping the fact
this custom allocator idea originated from Robin (who's maintaining the
subsystem with others), plus the fact there was no real push back on
v1, was a sign this was good to go. If the final call is to reject the
custom allocator approach, and there's no updates on your generic
solution in the next couple weeks, I'll probably copy the pgtable code
in panthor so I can implement my own stuff, and later switch back to
the generic io-pgtable implementation when things have settled down.

Nothing against you or your solution (I actually commit to move to the
new approach when it's ready), but we have different priorities, and I
can't really delay the driver merging by several months unless there's
a very good reason to do it.

Best Regards,

Boris
Jason Gunthorpe Nov. 20, 2023, 2:46 p.m. UTC | #11
On Mon, Nov 20, 2023 at 03:38:38PM +0100, Boris Brezillon wrote:

> > At LPC there was quite a lot if interest in improving the io page
> > table stuff to work better. Based on that I'm even more against adding
> > an external allocator at this time. :\
> 
> I'm sure this has been discussed with the IOMMU maintainers, but I'd
> really like to hear it from them if you don't mind. Especially since,
> as I mentioned, the custom allocator idea comes from Robin, not me...

It is a community effort - this is my opinion. We now understand this
io page table area needs a lot of work to meet all the new needs. I'm
against hacking it up like this and making that work harder.

> v1, was a sign this was good to go. If the final call is to reject the
> custom allocator approach, and there's no updates on your generic
> solution in the next couple weeks, I'll probably copy the pgtable code
> in panthor so I can implement my own stuff, and later switch back to
> the generic io-pgtable implementation when things have settled down.

That's horrible, DRM should refuse that too, yet I expect you will
probably end up merging like that.

I would not expect a resolution in a few weeks.

> Nothing against you or your solution (I actually commit to move to the
> new approach when it's ready), but we have different priorities, and I
> can't really delay the driver merging by several months unless there's
> a very good reason to do it.

Welcome to Linux, nothing runs fast :(

Jason
Boris Brezillon Nov. 20, 2023, 3:14 p.m. UTC | #12
On Mon, 20 Nov 2023 10:46:04 -0400
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Mon, Nov 20, 2023 at 03:38:38PM +0100, Boris Brezillon wrote:
> 
> > > At LPC there was quite a lot if interest in improving the io page
> > > table stuff to work better. Based on that I'm even more against adding
> > > an external allocator at this time. :\  
> > 
> > I'm sure this has been discussed with the IOMMU maintainers, but I'd
> > really like to hear it from them if you don't mind. Especially since,
> > as I mentioned, the custom allocator idea comes from Robin, not me...  
> 
> It is a community effort - this is my opinion. We now understand this
> io page table area needs a lot of work to meet all the new needs. I'm
> against hacking it up like this and making that work harder.

Consider it a hack if you like, but the thing is pretty self-contained,
and I doubt it will get to a point where things become unmaintainable
before you get the generic caching system working. Not to mention
Gaurav's use case, which can't really be solved with your caching
solution.

> 
> > v1, was a sign this was good to go. If the final call is to reject the
> > custom allocator approach, and there's no updates on your generic
> > solution in the next couple weeks, I'll probably copy the pgtable code
> > in panthor so I can implement my own stuff, and later switch back to
> > the generic io-pgtable implementation when things have settled down.  
> 
> That's horrible, DRM should refuse that too,

Most GPU drivers have their own MMU/pgtable logic, partly because they
are the only users, and also because most of them delegate the page
table updates to the GPU (page tables leave in VRAM). I've actually
been recommended to go for this approach when I submitted the first
version of the driver, but after further discussions with Robin and
Rob we agreed this was probably not a good idea (because the code
exists already, and having a second implementation is never a good
thing, and also because what's needed for discrete GPUs with VRAM
doesn't necessarily apply to GPUs that share the memory with the rest
of the system).

> yet I expect you will
> probably end up merging like that.
> 
> I would not expect a resolution in a few weeks.

OOC, what's the time frame?

> 
> > Nothing against you or your solution (I actually commit to move to the
> > new approach when it's ready), but we have different priorities, and I
> > can't really delay the driver merging by several months unless there's
> > a very good reason to do it.  
> 
> Welcome to Linux, nothing runs fast :(

I've maintained a subsystem for a few years, so I know what it means
to contribute upstream and ask or being asked to extend the core to do
something the right way. I'm generally patient and try to follow
maintainers' recommendations, but there's some point where waiting is
no more an option, and you have to be more pragmatic, because there's
always a better way of doing things, and perfection is often the enemy
of good...
Jason Gunthorpe Nov. 20, 2023, 3:45 p.m. UTC | #13
On Mon, Nov 20, 2023 at 04:14:18PM +0100, Boris Brezillon wrote:
> On Mon, 20 Nov 2023 10:46:04 -0400
> Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> > On Mon, Nov 20, 2023 at 03:38:38PM +0100, Boris Brezillon wrote:
> > 
> > > > At LPC there was quite a lot if interest in improving the io page
> > > > table stuff to work better. Based on that I'm even more against adding
> > > > an external allocator at this time. :\  
> > > 
> > > I'm sure this has been discussed with the IOMMU maintainers, but I'd
> > > really like to hear it from them if you don't mind. Especially since,
> > > as I mentioned, the custom allocator idea comes from Robin, not me...  
> > 
> > It is a community effort - this is my opinion. We now understand this
> > io page table area needs a lot of work to meet all the new needs. I'm
> > against hacking it up like this and making that work harder.
> 
> Consider it a hack if you like, but the thing is pretty self-contained,
> and I doubt it will get to a point where things become unmaintainable
> before you get the generic caching system working. Not to mention
> Gaurav's use case, which can't really be solved with your caching
> solution.

Gaurav doesn't need a custom allocator, he needs a way to manage
sharing page table memory with the hypervisor. A different set of ops
that don't replace the entire allocator would be fine for them.

The issue with taking away the alloc_page is that it complicates the
ability for the common code to use the struct page fields and more. We
already know we are going to need those to do some of the things that
are being asked for, eg RCU freeing and refcounting.

> > I would not expect a resolution in a few weeks.
> 
> OOC, what's the time frame?

The larger need has been discussed is considerable work, I would be
surprised if something could be merged in less than 6 months,
especially with the upcoming holiday season

> I've maintained a subsystem for a few years, so I know what it means
> to contribute upstream and ask or being asked to extend the core to do
> something the right way. I'm generally patient and try to follow
> maintainers' recommendations, but there's some point where waiting is
> no more an option, and you have to be more pragmatic, because there's
> always a better way of doing things, and perfection is often the enemy
> of good...

Here I am not talking about perfection, I am concerned about messing
up work that I now see as important for other iommu drivers related
things by making a hack for a DRM driver (which as I've said has,
unravelling these hacks has been traumatic for many of us
historically)

So while I would not like you copying the code to DRM, but I think it
would be less troublesome to everyone in the longer term.

Jason
Robin Murphy Nov. 22, 2023, 5:23 p.m. UTC | #14
On 2023-11-20 3:45 pm, Jason Gunthorpe wrote:
> On Mon, Nov 20, 2023 at 04:14:18PM +0100, Boris Brezillon wrote:
>> On Mon, 20 Nov 2023 10:46:04 -0400
>> Jason Gunthorpe <jgg@ziepe.ca> wrote:
>>
>>> On Mon, Nov 20, 2023 at 03:38:38PM +0100, Boris Brezillon wrote:
>>>
>>>>> At LPC there was quite a lot if interest in improving the io page
>>>>> table stuff to work better. Based on that I'm even more against adding
>>>>> an external allocator at this time. :\
>>>>
>>>> I'm sure this has been discussed with the IOMMU maintainers, but I'd
>>>> really like to hear it from them if you don't mind. Especially since,
>>>> as I mentioned, the custom allocator idea comes from Robin, not me...
>>>
>>> It is a community effort - this is my opinion. We now understand this
>>> io page table area needs a lot of work to meet all the new needs. I'm
>>> against hacking it up like this and making that work harder.
>>
>> Consider it a hack if you like, but the thing is pretty self-contained,
>> and I doubt it will get to a point where things become unmaintainable
>> before you get the generic caching system working. Not to mention
>> Gaurav's use case, which can't really be solved with your caching
>> solution.
> 
> Gaurav doesn't need a custom allocator, he needs a way to manage
> sharing page table memory with the hypervisor. A different set of ops
> that don't replace the entire allocator would be fine for them.
> 
> The issue with taking away the alloc_page is that it complicates the
> ability for the common code to use the struct page fields and more. We
> already know we are going to need those to do some of the things that
> are being asked for, eg RCU freeing and refcounting.

Oh come on, for all the things you claim are simple, you don't think 
it's trivial to make the fancy stuff conditional? (Hint: we already have 
control flow all over which is conditional on the cfg and/or calling 
context, and that future stuff may well end up wanting to be conditional 
in its own right anyway since not all io-pgtable-arm users will need or 
want it).

Having already 90% prototyped deferred freeing (oh, to be able to get 
back to fun things like that...), I see no significant impediment to 
fitting that in around this. Certainly no worse than fitting it around 
the !coherent_walk DMA API bits - in fact if anything, punting all that 
lot into custom allocator hooks starts to look like an attractive idea 
once we have the means (it really wants to be using dma_alloc_pages() 
these days).

>>> I would not expect a resolution in a few weeks.
>>
>> OOC, what's the time frame?
> 
> The larger need has been discussed is considerable work, I would be
> surprised if something could be merged in less than 6 months,
> especially with the upcoming holiday season
> 
>> I've maintained a subsystem for a few years, so I know what it means
>> to contribute upstream and ask or being asked to extend the core to do
>> something the right way. I'm generally patient and try to follow
>> maintainers' recommendations, but there's some point where waiting is
>> no more an option, and you have to be more pragmatic, because there's
>> always a better way of doing things, and perfection is often the enemy
>> of good...
> 
> Here I am not talking about perfection, I am concerned about messing
> up work that I now see as important for other iommu drivers related
> things by making a hack for a DRM driver (which as I've said has,
> unravelling these hacks has been traumatic for many of us
> historically)

In your own words, this is a community effort. You do not own this code, 
and you do not get to prevent the community from solving multiple 
problems they have now, in a way which is reasonable now, just because 
*you* don't like it, and *you* believe it might possibly make it harder 
to do something at some indeterminate point in future. Your objection is 
noted, but as above, I for one do not see a convincing basis for it anyway.

You know what *the community* sees as important? Not having to maintain 
multiple copies of near-identical pagetable code, because maintenance is 
an established and very real concern. This is why we have the io-pgtable 
library in the first place, and why we adapted io-pgtable-arm to support 
panfrost when that came along. If supporting panfrost and panthor as 
io-pgtable users really does ever become a practical problem at some 
point in the future, we will work to address that problem at that point. 
In the meantime, we can get on with the common goal of making Linux do 
the things people want it to do, in the best way that the maintainers 
are happy to maintain. This is not "a hack for a DRM driver", it's a 
simple and convenient generalisation of part of io-pgtable, in keeping 
with the design of io-pgtable (user-provided callbacks are nothing new), 
and straightforwardly serving at least two completely different 
use-cases already.

Thanks,
Robin.
Jason Gunthorpe Nov. 22, 2023, 5:50 p.m. UTC | #15
On Wed, Nov 22, 2023 at 05:23:34PM +0000, Robin Murphy wrote:
> > Gaurav doesn't need a custom allocator, he needs a way to manage
> > sharing page table memory with the hypervisor. A different set of ops
> > that don't replace the entire allocator would be fine for them.
> > 
> > The issue with taking away the alloc_page is that it complicates the
> > ability for the common code to use the struct page fields and more. We
> > already know we are going to need those to do some of the things that
> > are being asked for, eg RCU freeing and refcounting.
> 
> Oh come on, for all the things you claim are simple, you don't think it's
> trivial to make the fancy stuff conditional? (Hint: we already have control
> flow all over which is conditional on the cfg and/or calling context, and
> that future stuff may well end up wanting to be conditional in its own right
> anyway since not all io-pgtable-arm users will need or want it).

I really don't want to make it conditional! I'd want to have one
common set of radix algorithms shared by the enterprise IOMMUs, so we
can actually improve those drivers and get all the special features
people want without writing everything 5 times.

> > Here I am not talking about perfection, I am concerned about messing
> > up work that I now see as important for other iommu drivers related
> > things by making a hack for a DRM driver (which as I've said has,
> > unravelling these hacks has been traumatic for many of us
> > historically)
> 
> In your own words, this is a community effort. You do not own this code, and
> you do not get to prevent the community from solving multiple problems they
> have now, in a way which is reasonable now, just because *you* don't like
> it, and *you* believe it might possibly make it harder to do something at
> some indeterminate point in future. Your objection is noted, but as above, I
> for one do not see a convincing basis for it anyway.

I've said already it is my opinion, if you want to bring another one
then you can particpate too. Thank you for noting my objection.

> You know what *the community* sees as important? Not having to maintain
> multiple copies of near-identical pagetable code, because maintenance is an
> established and very real concern. 

What I heard at LPC is we badly need to stop maintaining independent
radix algorithms across at least the enterprise iommu drivers. This
was clearly an articulated need for many different people *from the
community*.

> This is why we have the io-pgtable
> library in the first place, and why we adapted io-pgtable-arm to support
> panfrost when that came along. If supporting panfrost and panthor as
> io-pgtable users really does ever become a practical problem at some point
> in the future, we will work to address that problem at that point.

You mean Joao or myself will get stuck with it. Not "we". :(

> meantime, we can get on with the common goal of making Linux do the things
> people want it to do, in the best way that the maintainers are happy to
> maintain. This is not "a hack for a DRM driver", it's a simple and
> convenient generalisation of part of io-pgtable, in keeping with the design
> of io-pgtable (user-provided callbacks are nothing new), and
> straightforwardly serving at least two completely different use-cases
> already.

I've already explained to Boris, and he agreed, that it is the *wrong*
generalization because it is implementing a generally useful algorithm
in the DRM driver and keeping it out of the core code.

If we deliberately do the wrong thing then, yes, it is a hack.

Jason
Boris Brezillon Nov. 23, 2023, 8:51 a.m. UTC | #16
Hi Jason,

On Wed, 22 Nov 2023 13:50:55 -0400
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Wed, Nov 22, 2023 at 05:23:34PM +0000, Robin Murphy wrote:
> > > Gaurav doesn't need a custom allocator, he needs a way to manage
> > > sharing page table memory with the hypervisor. A different set of ops
> > > that don't replace the entire allocator would be fine for them.
> > > 
> > > The issue with taking away the alloc_page is that it complicates the
> > > ability for the common code to use the struct page fields and more. We
> > > already know we are going to need those to do some of the things that
> > > are being asked for, eg RCU freeing and refcounting.  
> > 
> > Oh come on, for all the things you claim are simple, you don't think it's
> > trivial to make the fancy stuff conditional? (Hint: we already have control
> > flow all over which is conditional on the cfg and/or calling context, and
> > that future stuff may well end up wanting to be conditional in its own right
> > anyway since not all io-pgtable-arm users will need or want it).  
> 
> I really don't want to make it conditional! I'd want to have one
> common set of radix algorithms shared by the enterprise IOMMUs, so we
> can actually improve those drivers and get all the special features
> people want without writing everything 5 times.

Just because it's optional doesn't mean you can't have a common
implementation shared across drivers. Feels like you're stuck on an
all-or-nothing mindset instead of trying to progressively push stuff to
more drivers, until you end up with everyone using your caching system,
at which point you can decide to make it part of the mandatory/core
features.

> 
> > > Here I am not talking about perfection, I am concerned about messing
> > > up work that I now see as important for other iommu drivers related
> > > things by making a hack for a DRM driver (which as I've said has,
> > > unravelling these hacks has been traumatic for many of us
> > > historically)  
> > 
> > In your own words, this is a community effort. You do not own this code, and
> > you do not get to prevent the community from solving multiple problems they
> > have now, in a way which is reasonable now, just because *you* don't like
> > it, and *you* believe it might possibly make it harder to do something at
> > some indeterminate point in future. Your objection is noted, but as above, I
> > for one do not see a convincing basis for it anyway.  
> 
> I've said already it is my opinion, if you want to bring another one
> then you can particpate too. Thank you for noting my objection.
> 
> > You know what *the community* sees as important? Not having to maintain
> > multiple copies of near-identical pagetable code, because maintenance is an
> > established and very real concern.   
> 
> What I heard at LPC is we badly need to stop maintaining independent
> radix algorithms across at least the enterprise iommu drivers. This
> was clearly an articulated need for many different people *from the
> community*.

Can these people please raise their voice here? Or maybe LPC's IOMMU
discussions were summarized somewhere in an email sent to the IOMMU ML,
where we could discuss that, instead of this thread.

> 
> > This is why we have the io-pgtable
> > library in the first place, and why we adapted io-pgtable-arm to support
> > panfrost when that came along. If supporting panfrost and panthor as
> > io-pgtable users really does ever become a practical problem at some point
> > in the future, we will work to address that problem at that point.  
> 
> You mean Joao or myself will get stuck with it. Not "we". :(

Oh come on, I'm tired of hearing that argument. I proposed my help, and
am still happy to help in any way I can, including reviewing DRM
changes, or helping patching DRM drivers, if that's the main blocker.
If there's a set of specific DRM drivers you had problem with, please
name them, so we can discuss it with the rest of the DRM community in
order to try and find a solution.

> 
> > meantime, we can get on with the common goal of making Linux do the things
> > people want it to do, in the best way that the maintainers are happy to
> > maintain. This is not "a hack for a DRM driver", it's a simple and
> > convenient generalisation of part of io-pgtable, in keeping with the design
> > of io-pgtable (user-provided callbacks are nothing new), and
> > straightforwardly serving at least two completely different use-cases
> > already.  
> 
> I've already explained to Boris, and he agreed, that it is the *wrong*
> generalization because it is implementing a generally useful algorithm
> in the DRM driver and keeping it out of the core code.

Sorry, but I never said that. I said your approach was interesting (and
I keep thinking it is), tried to see how it'd fit in the panthor
picture, and said I wouldn't mind porting panthor to it when it's
ready. I still strongly disagree on the form though. Blocking
driver-merging/framework-evolutions for months just because you decided
how you want it done, and that nothing else can happen in the meantime
because it might get in the way, sounds like a single man decision, not
something widely agreed across the iommu community.

Beside, I think your objection that custom allocators are in the way of
this caching generalization is a bit of an over statement, especially if
custom allocators provide the same guarantees you get from existing
allocations (page-backed, no fields used in the page, etc). Worst case
scenario, you disable the generic cache for all users that pass a custom
allocator, and add a pr_warn() (or WARN_ON() if you want to be pushy)
to make sure driver owners take that into consideration quickly.

Last but not least, IIUC your approach implies explicit VA range
allocation/deallocation to work (through new API calls), which means
you'll need to patch all io-pgtable users get acks from their
maintainers. That includes IOMMU drivers living in drivers/iommu, which,
if I read behind the lines, should be easy peasy, but also external
io-pgtable users like DRM drivers, which, according to you, are
unresponsive and reluctant to change how they use the APIs. So, if I
follow your way of thinking, you'll just be stuck in the same place,
waiting for DRM drivers to accept the transversal changes you intend to
push.

> 
> If we deliberately do the wrong thing then, yes, it is a hack.

We don't deliberately do the 'wrong' thing. We do something that's
working but you consider to be wrong/inappropriate, and we commit to
revisit our implementation when the new/fancy thing is out. I don't see
how more constructive I can be here.

What I strongly oppose to though, is having someone say 'I have the
perfect solution for you', and then tell me, 'but you have to wait
another six months, maybe more, to see what it looks like'. I already
see your counter argument coming, that I can pick it up, and try to
implement it myself. Problem is, in this sort of situation, the guy
proposing his help (me) rarely has the same vision of what the
implementation should look like, and ends up spending a considerable
amount of time working on something that's, at best, barely matching
the original idea.

Regards,

Boris
Jason Gunthorpe Nov. 23, 2023, 1:48 p.m. UTC | #17
On Thu, Nov 23, 2023 at 09:51:41AM +0100, Boris Brezillon wrote:
> Beside, I think your objection that custom allocators are in the way of
> this caching generalization is a bit of an over statement, especially if
> custom allocators provide the same guarantees you get from existing
> allocations (page-backed, no fields used in the page, etc). Worst case
> scenario, you disable the generic cache for all users that pass a custom
> allocator, and add a pr_warn() (or WARN_ON() if you want to be pushy)
> to make sure driver owners take that into consideration quickly.

I want to have common algorithms to manage the radix system, like mm
does, and then have some fairly hairy stuff done in the new common
code to address all the needs we now understand people have. This is
not just "caching".

One of the important things on this list is RCU freeing of the table
memory.

RCU freeing requires precious space in the struct page.

External ops mean we must have the io page table instance available
inside the RCU callback so it knows what op to call to free.

This means the struct page needs to store both a rcu_head and another
pointer.

Currently the similar page table algorithms in the mm side consume all
the free struct page space. I don't expect the future io page table
version of the same stuff to be much different.

So it is not just that the allocator side can't use the struct page
memory. It is that an external allocator inherently demands *more*
space in the struct page, space we may not have to give.

Worse, the entire idea of RCU free seems to be incompatible with what
you are trying to achieve since pages could be unavailable for use
while they are waiting in RCU.

I'm looking at this and thinking we loose the option to do RCU in the
generic code to accommodate external ops. "make RCU optional" is
basically saying to duplicate alot of stuff because RCU becomes pretty
fundamental to features and algorithm design.

Do you understand my alarm to this direction better now?  I'm sorry if
I haven't explained this clearly enough. (I'm coming at this from the
MM perspective where the sorts of algorithms and problems here are
already well known)

> unresponsive and reluctant to change how they use the APIs. So, if I
> follow your way of thinking, you'll just be stuck in the same place,
> waiting for DRM drivers to accept the transversal changes you intend to
> push.

I wasn't imagining a forced API change. I think the current API can
work fine along with an optional pre-allocation scheme.

> What I strongly oppose to though, is having someone say 'I have the
> perfect solution for you', and then tell me, 'but you have to wait
> another six months, maybe more, to see what it looks like'. 

I never said you should be blocked. I agreed with duplicating if
necessary to be unblocked.

I also outlined and encouraged you to solve the general problem, but I
fully understand you don't want to take it on. That's fine too.

You are both mis-characterizing my position as trying to block the DRM
driver :(

I'm trying to accomodate the work I see we need to do soon on the
enterprise iommu side that touches this same code.

Look, if we go down this road then we may end up duplicating the page
table code so the iommu drivers can use the improved versions until
the DRM driver can be changed. Maybe doing duplication later is better
than sooner.

Jason
Robin Murphy Nov. 23, 2023, 4:49 p.m. UTC | #18
On 23/11/2023 1:48 pm, Jason Gunthorpe wrote:
> On Thu, Nov 23, 2023 at 09:51:41AM +0100, Boris Brezillon wrote:
>> Beside, I think your objection that custom allocators are in the way of
>> this caching generalization is a bit of an over statement, especially if
>> custom allocators provide the same guarantees you get from existing
>> allocations (page-backed, no fields used in the page, etc). Worst case
>> scenario, you disable the generic cache for all users that pass a custom
>> allocator, and add a pr_warn() (or WARN_ON() if you want to be pushy)
>> to make sure driver owners take that into consideration quickly.
> 
> I want to have common algorithms to manage the radix system, like mm
> does, and then have some fairly hairy stuff done in the new common
> code to address all the needs we now understand people have. This is
> not just "caching".
> 
> One of the important things on this list is RCU freeing of the table
> memory.
> 
> RCU freeing requires precious space in the struct page.
> 
> External ops mean we must have the io page table instance available
> inside the RCU callback so it knows what op to call to free.
> 
> This means the struct page needs to store both a rcu_head and another
> pointer.
> 
> Currently the similar page table algorithms in the mm side consume all
> the free struct page space. I don't expect the future io page table
> version of the same stuff to be much different.
> 
> So it is not just that the allocator side can't use the struct page
> memory. It is that an external allocator inherently demands *more*
> space in the struct page, space we may not have to give.
> 
> Worse, the entire idea of RCU free seems to be incompatible with what
> you are trying to achieve since pages could be unavailable for use
> while they are waiting in RCU.
> 
> I'm looking at this and thinking we loose the option to do RCU in the
> generic code to accommodate external ops. "make RCU optional" is
> basically saying to duplicate alot of stuff because RCU becomes pretty
> fundamental to features and algorithm design.
> 
> Do you understand my alarm to this direction better now?  I'm sorry if
> I haven't explained this clearly enough. (I'm coming at this from the
> MM perspective where the sorts of algorithms and problems here are
> already well known)
> 
>> unresponsive and reluctant to change how they use the APIs. So, if I
>> follow your way of thinking, you'll just be stuck in the same place,
>> waiting for DRM drivers to accept the transversal changes you intend to
>> push.
> 
> I wasn't imagining a forced API change. I think the current API can
> work fine along with an optional pre-allocation scheme.
> 
>> What I strongly oppose to though, is having someone say 'I have the
>> perfect solution for you', and then tell me, 'but you have to wait
>> another six months, maybe more, to see what it looks like'.
> 
> I never said you should be blocked. I agreed with duplicating if
> necessary to be unblocked.
> 
> I also outlined and encouraged you to solve the general problem, but I
> fully understand you don't want to take it on. That's fine too.
> 
> You are both mis-characterizing my position as trying to block the DRM
> driver :(
> 
> I'm trying to accomodate the work I see we need to do soon on the
> enterprise iommu side that touches this same code.
> 
> Look, if we go down this road then we may end up duplicating the page
> table code so the iommu drivers can use the improved versions until
> the DRM driver can be changed. Maybe doing duplication later is better
> than sooner.

Oh, so essentially your plan is to replace io-pgtable-arm with something 
else entirely. That's fair enough; happy to review that when I see it, 
but as I pointed out, realistically this change hardly makes that any 
worse than what's *already* present in io-pgtable-arm that you're going 
to have to deal with anyway.

I would instinctively assume that an "enterprise" optimised version 
would want to be a clean-slate fork itself since I can't imagine it 
wants to be burdened with having to handle non-coherent pagetables, 
Armv7 LPAE format, or any of the quirks for embedded and mobile IOMMUs. 
I have no issue with moving SMMUv3 to a shiny new pagetable 
implementation in future, but that is quite clearly not at all 
incompatible with leaving the existing implementation in place for all 
the *other* decidedly non-enterprise io-pgtable-arm users, both IOMMU 
and GPU, not to mention GPU-via-IOMMU in the case of Adreno.

Thanks,
Robin.
Jason Gunthorpe Nov. 23, 2023, 4:59 p.m. UTC | #19
On Thu, Nov 23, 2023 at 04:49:21PM +0000, Robin Murphy wrote:

> Oh, so essentially your plan is to replace io-pgtable-arm with something
> else entirely. That's fair enough; happy to review that when I see it, but
> as I pointed out, realistically this change hardly makes that any worse than
> what's *already* present in io-pgtable-arm that you're going to have to deal
> with anyway.
> 
> I would instinctively assume that an "enterprise" optimised version would
> want to be a clean-slate fork itself since I can't imagine it wants to be
> burdened with having to handle non-coherent pagetables, Armv7 LPAE format,
> or any of the quirks for embedded and mobile IOMMUs. I have no issue with
> moving SMMUv3 to a shiny new pagetable implementation in future, but that is
> quite clearly not at all incompatible with leaving the existing
> implementation in place for all the *other* decidedly non-enterprise
> io-pgtable-arm users, both IOMMU and GPU, not to mention GPU-via-IOMMU in
> the case of Adreno.

Oh, okay! Great, then we are all in agreement!

I was not thinking about a fork, per-say, but yes I can see how that
is would be a fine approach to take for everyone concerned.

Thanks,
Jason