mbox series

[RFC,RESEND,00/16] Split IOMMU DMA mapping operation to two steps

Message ID cover.1709635535.git.leon@kernel.org (mailing list archive)
Headers show
Series Split IOMMU DMA mapping operation to two steps | expand

Message

Leon Romanovsky March 5, 2024, 11:18 a.m. UTC
This is complimentary part to the proposed LSF/MM topic.
https://lore.kernel.org/linux-rdma/22df55f8-cf64-4aa8-8c0b-b556c867b926@linux.dev/T/#m85672c860539fdbbc8fe0f5ccabdc05b40269057

This is posted as RFC to get a feedback on proposed split, but RDMA, VFIO and
DMA patches are ready for review and inclusion, the NVMe patches are still in
progress as they require agreement on API first.

Thanks

-------------------------------------------------------------------------------
The DMA mapping operation performs two steps at one same time: allocates
IOVA space and actually maps DMA pages to that space. This one shot
operation works perfectly for non-complex scenarios, where callers use
that DMA API in control path when they setup hardware.

However in more complex scenarios, when DMA mapping is needed in data
path and especially when some sort of specific datatype is involved,
such one shot approach has its drawbacks.

That approach pushes developers to introduce new DMA APIs for specific
datatype. For example existing scatter-gather mapping functions, or
latest Chuck's RFC series to add biovec related DMA mapping [1] and
probably struct folio will need it too.

These advanced DMA mapping APIs are needed to calculate IOVA size to
allocate it as one chunk and some sort of offset calculations to know
which part of IOVA to map.

Instead of teaching DMA to know these specific datatypes, let's separate
existing DMA mapping routine to two steps and give an option to advanced
callers (subsystems) perform all calculations internally in advance and
map pages later when it is needed.

In this series, three users are converted and each of such conversion
presents different positive gain:
1. RDMA simplifies and speeds up its pagefault handling for
   on-demand-paging (ODP) mode.
2. VFIO PCI live migration code saves huge chunk of memory.
3. NVMe PCI avoids intermediate SG table manipulation and operates
   directly on BIOs.

Thanks

[1] https://lore.kernel.org/all/169772852492.5232.17148564580779995849.stgit@klimt.1015granger.net

Chaitanya Kulkarni (2):
  block: add dma_link_range() based API
  nvme-pci: use blk_rq_dma_map() for NVMe SGL

Leon Romanovsky (14):
  mm/hmm: let users to tag specific PFNs
  dma-mapping: provide an interface to allocate IOVA
  dma-mapping: provide callbacks to link/unlink pages to specific IOVA
  iommu/dma: Provide an interface to allow preallocate IOVA
  iommu/dma: Prepare map/unmap page functions to receive IOVA
  iommu/dma: Implement link/unlink page callbacks
  RDMA/umem: Preallocate and cache IOVA for UMEM ODP
  RDMA/umem: Store ODP access mask information in PFN
  RDMA/core: Separate DMA mapping to caching IOVA and page linkage
  RDMA/umem: Prevent UMEM ODP creation with SWIOTLB
  vfio/mlx5: Explicitly use number of pages instead of allocated length
  vfio/mlx5: Rewrite create mkey flow to allow better code reuse
  vfio/mlx5: Explicitly store page list
  vfio/mlx5: Convert vfio to use DMA link API

 Documentation/core-api/dma-attributes.rst |   7 +
 block/blk-merge.c                         | 156 ++++++++++++++
 drivers/infiniband/core/umem_odp.c        | 219 +++++++------------
 drivers/infiniband/hw/mlx5/mlx5_ib.h      |   1 +
 drivers/infiniband/hw/mlx5/odp.c          |  59 +++--
 drivers/iommu/dma-iommu.c                 | 129 ++++++++---
 drivers/nvme/host/pci.c                   | 220 +++++--------------
 drivers/vfio/pci/mlx5/cmd.c               | 252 ++++++++++++----------
 drivers/vfio/pci/mlx5/cmd.h               |  22 +-
 drivers/vfio/pci/mlx5/main.c              | 136 +++++-------
 include/linux/blk-mq.h                    |   9 +
 include/linux/dma-map-ops.h               |  13 ++
 include/linux/dma-mapping.h               |  39 ++++
 include/linux/hmm.h                       |   3 +
 include/rdma/ib_umem_odp.h                |  22 +-
 include/rdma/ib_verbs.h                   |  54 +++++
 kernel/dma/debug.h                        |   2 +
 kernel/dma/direct.h                       |   7 +-
 kernel/dma/mapping.c                      |  91 ++++++++
 mm/hmm.c                                  |  34 +--
 20 files changed, 870 insertions(+), 605 deletions(-)

Comments

Robin Murphy March 5, 2024, 12:05 p.m. UTC | #1
On 2024-03-05 11:18 am, Leon Romanovsky wrote:
> This is complimentary part to the proposed LSF/MM topic.
> https://lore.kernel.org/linux-rdma/22df55f8-cf64-4aa8-8c0b-b556c867b926@linux.dev/T/#m85672c860539fdbbc8fe0f5ccabdc05b40269057
> 
> This is posted as RFC to get a feedback on proposed split, but RDMA, VFIO and
> DMA patches are ready for review and inclusion, the NVMe patches are still in
> progress as they require agreement on API first.
> 
> Thanks
> 
> -------------------------------------------------------------------------------
> The DMA mapping operation performs two steps at one same time: allocates
> IOVA space and actually maps DMA pages to that space. This one shot
> operation works perfectly for non-complex scenarios, where callers use
> that DMA API in control path when they setup hardware.
> 
> However in more complex scenarios, when DMA mapping is needed in data
> path and especially when some sort of specific datatype is involved,
> such one shot approach has its drawbacks.
> 
> That approach pushes developers to introduce new DMA APIs for specific
> datatype. For example existing scatter-gather mapping functions, or
> latest Chuck's RFC series to add biovec related DMA mapping [1] and
> probably struct folio will need it too.
> 
> These advanced DMA mapping APIs are needed to calculate IOVA size to
> allocate it as one chunk and some sort of offset calculations to know
> which part of IOVA to map.

I don't follow this part at all - at *some* point, something must know a 
range of memory addresses involved in a DMA transfer, so that's where it 
should map that range for DMA. Even in a badly-designed system where the 
point it's most practical to make the mapping is further out and only 
knows that DMA will touch some subset of a buffer, but doesn't know 
exactly what subset yet, you'd usually just map the whole buffer. I 
don't see why the DMA API would ever need to know about anything other 
than pages/PFNs and dma_addr_ts (yes, it does also accept them being 
wrapped together in scatterlists; yes, scatterlists are awful and it 
would be nice to replace them with a better general DMA descriptor; that 
is a whole other subject of its own).

> Instead of teaching DMA to know these specific datatypes, let's separate
> existing DMA mapping routine to two steps and give an option to advanced
> callers (subsystems) perform all calculations internally in advance and
> map pages later when it is needed.

 From a brief look, this is clearly an awkward reinvention of the IOMMU 
API. If IOMMU-aware drivers/subsystems want to explicitly manage IOMMU 
address spaces then they can and should use the IOMMU API. Perhaps 
there's room for some quality-of-life additions to the IOMMU API to help 
with common usage patterns, but the generic DMA mapping API is 
absolutely not the place for it.

Thanks,
Robin.

> In this series, three users are converted and each of such conversion
> presents different positive gain:
> 1. RDMA simplifies and speeds up its pagefault handling for
>     on-demand-paging (ODP) mode.
> 2. VFIO PCI live migration code saves huge chunk of memory.
> 3. NVMe PCI avoids intermediate SG table manipulation and operates
>     directly on BIOs.
> 
> Thanks
> 
> [1] https://lore.kernel.org/all/169772852492.5232.17148564580779995849.stgit@klimt.1015granger.net
> 
> Chaitanya Kulkarni (2):
>    block: add dma_link_range() based API
>    nvme-pci: use blk_rq_dma_map() for NVMe SGL
> 
> Leon Romanovsky (14):
>    mm/hmm: let users to tag specific PFNs
>    dma-mapping: provide an interface to allocate IOVA
>    dma-mapping: provide callbacks to link/unlink pages to specific IOVA
>    iommu/dma: Provide an interface to allow preallocate IOVA
>    iommu/dma: Prepare map/unmap page functions to receive IOVA
>    iommu/dma: Implement link/unlink page callbacks
>    RDMA/umem: Preallocate and cache IOVA for UMEM ODP
>    RDMA/umem: Store ODP access mask information in PFN
>    RDMA/core: Separate DMA mapping to caching IOVA and page linkage
>    RDMA/umem: Prevent UMEM ODP creation with SWIOTLB
>    vfio/mlx5: Explicitly use number of pages instead of allocated length
>    vfio/mlx5: Rewrite create mkey flow to allow better code reuse
>    vfio/mlx5: Explicitly store page list
>    vfio/mlx5: Convert vfio to use DMA link API
> 
>   Documentation/core-api/dma-attributes.rst |   7 +
>   block/blk-merge.c                         | 156 ++++++++++++++
>   drivers/infiniband/core/umem_odp.c        | 219 +++++++------------
>   drivers/infiniband/hw/mlx5/mlx5_ib.h      |   1 +
>   drivers/infiniband/hw/mlx5/odp.c          |  59 +++--
>   drivers/iommu/dma-iommu.c                 | 129 ++++++++---
>   drivers/nvme/host/pci.c                   | 220 +++++--------------
>   drivers/vfio/pci/mlx5/cmd.c               | 252 ++++++++++++----------
>   drivers/vfio/pci/mlx5/cmd.h               |  22 +-
>   drivers/vfio/pci/mlx5/main.c              | 136 +++++-------
>   include/linux/blk-mq.h                    |   9 +
>   include/linux/dma-map-ops.h               |  13 ++
>   include/linux/dma-mapping.h               |  39 ++++
>   include/linux/hmm.h                       |   3 +
>   include/rdma/ib_umem_odp.h                |  22 +-
>   include/rdma/ib_verbs.h                   |  54 +++++
>   kernel/dma/debug.h                        |   2 +
>   kernel/dma/direct.h                       |   7 +-
>   kernel/dma/mapping.c                      |  91 ++++++++
>   mm/hmm.c                                  |  34 +--
>   20 files changed, 870 insertions(+), 605 deletions(-)
>
Leon Romanovsky March 5, 2024, 12:29 p.m. UTC | #2
On Tue, Mar 05, 2024 at 12:05:23PM +0000, Robin Murphy wrote:
> On 2024-03-05 11:18 am, Leon Romanovsky wrote:
> > This is complimentary part to the proposed LSF/MM topic.
> > https://lore.kernel.org/linux-rdma/22df55f8-cf64-4aa8-8c0b-b556c867b926@linux.dev/T/#m85672c860539fdbbc8fe0f5ccabdc05b40269057
> > 
> > This is posted as RFC to get a feedback on proposed split, but RDMA, VFIO and
> > DMA patches are ready for review and inclusion, the NVMe patches are still in
> > progress as they require agreement on API first.
> > 
> > Thanks
> > 
> > -------------------------------------------------------------------------------
> > The DMA mapping operation performs two steps at one same time: allocates
> > IOVA space and actually maps DMA pages to that space. This one shot
> > operation works perfectly for non-complex scenarios, where callers use
> > that DMA API in control path when they setup hardware.
> > 
> > However in more complex scenarios, when DMA mapping is needed in data
> > path and especially when some sort of specific datatype is involved,
> > such one shot approach has its drawbacks.
> > 
> > That approach pushes developers to introduce new DMA APIs for specific
> > datatype. For example existing scatter-gather mapping functions, or
> > latest Chuck's RFC series to add biovec related DMA mapping [1] and
> > probably struct folio will need it too.
> > 
> > These advanced DMA mapping APIs are needed to calculate IOVA size to
> > allocate it as one chunk and some sort of offset calculations to know
> > which part of IOVA to map.
> 
> I don't follow this part at all - at *some* point, something must know a
> range of memory addresses involved in a DMA transfer, so that's where it
> should map that range for DMA. 

In all presented cases in this series, the overall DMA size is known in
advance. In RDMA case, it is known when user registers the memory, in
VFIO, when live migration is happening and in NVMe, when BIO is created.

So once we allocated IOVA, we will need to link ranges, which si the
same as map but without IOVA allocation.

> Even in a badly-designed system where the
> point it's most practical to make the mapping is further out and only knows
> that DMA will touch some subset of a buffer, but doesn't know exactly what
> subset yet, you'd usually just map the whole buffer. I don't see why the DMA
> API would ever need to know about anything other than pages/PFNs and
> dma_addr_ts (yes, it does also accept them being wrapped together in
> scatterlists; yes, scatterlists are awful and it would be nice to replace
> them with a better general DMA descriptor; that is a whole other subject of
> its own).

This is exactly what was done here, we got rid of scatterlists.

> 
> > Instead of teaching DMA to know these specific datatypes, let's separate
> > existing DMA mapping routine to two steps and give an option to advanced
> > callers (subsystems) perform all calculations internally in advance and
> > map pages later when it is needed.
> 
> From a brief look, this is clearly an awkward reinvention of the IOMMU API.
> If IOMMU-aware drivers/subsystems want to explicitly manage IOMMU address
> spaces then they can and should use the IOMMU API. Perhaps there's room for
> some quality-of-life additions to the IOMMU API to help with common usage
> patterns, but the generic DMA mapping API is absolutely not the place for
> it.

DMA mapping gives nice abstraction from IOMMU, and allows us to have
same flow for IOMMU and non-IOMMU flows without duplicating code, while
you suggest to teach almost every part in the kernel to know about IOMMU.

In this series, we changed RDMA, VFIO and NVMe, and in all cases we
removed more code than added. From what I saw, VDPA and virito-blk will
benefit from proposed API too.

Even in this RFC, where Chaitanya did partial job and didn't convert
whole driver, the gain is pretty obvious:
https://lore.kernel.org/linux-rdma/016fc02cbfa9be3c156a6f74df38def1e09c08f1.1709635535.git.leon@kernel.org/T/#u

 drivers/nvme/host/pci.c | 220 ++++++++++++++++++++++++++++++++++++++++++++++----------------------------------------------------------------------------------------------------------------------------------------------------------------
 1 file changed, 49 insertions(+), 171 deletions(-)


Thanks

> 
> Thanks,
> Robin.
> 
> > In this series, three users are converted and each of such conversion
> > presents different positive gain:
> > 1. RDMA simplifies and speeds up its pagefault handling for
> >     on-demand-paging (ODP) mode.
> > 2. VFIO PCI live migration code saves huge chunk of memory.
> > 3. NVMe PCI avoids intermediate SG table manipulation and operates
> >     directly on BIOs.
> > 
> > Thanks
> > 
> > [1] https://lore.kernel.org/all/169772852492.5232.17148564580779995849.stgit@klimt.1015granger.net
> > 
> > Chaitanya Kulkarni (2):
> >    block: add dma_link_range() based API
> >    nvme-pci: use blk_rq_dma_map() for NVMe SGL
> > 
> > Leon Romanovsky (14):
> >    mm/hmm: let users to tag specific PFNs
> >    dma-mapping: provide an interface to allocate IOVA
> >    dma-mapping: provide callbacks to link/unlink pages to specific IOVA
> >    iommu/dma: Provide an interface to allow preallocate IOVA
> >    iommu/dma: Prepare map/unmap page functions to receive IOVA
> >    iommu/dma: Implement link/unlink page callbacks
> >    RDMA/umem: Preallocate and cache IOVA for UMEM ODP
> >    RDMA/umem: Store ODP access mask information in PFN
> >    RDMA/core: Separate DMA mapping to caching IOVA and page linkage
> >    RDMA/umem: Prevent UMEM ODP creation with SWIOTLB
> >    vfio/mlx5: Explicitly use number of pages instead of allocated length
> >    vfio/mlx5: Rewrite create mkey flow to allow better code reuse
> >    vfio/mlx5: Explicitly store page list
> >    vfio/mlx5: Convert vfio to use DMA link API
> > 
> >   Documentation/core-api/dma-attributes.rst |   7 +
> >   block/blk-merge.c                         | 156 ++++++++++++++
> >   drivers/infiniband/core/umem_odp.c        | 219 +++++++------------
> >   drivers/infiniband/hw/mlx5/mlx5_ib.h      |   1 +
> >   drivers/infiniband/hw/mlx5/odp.c          |  59 +++--
> >   drivers/iommu/dma-iommu.c                 | 129 ++++++++---
> >   drivers/nvme/host/pci.c                   | 220 +++++--------------
> >   drivers/vfio/pci/mlx5/cmd.c               | 252 ++++++++++++----------
> >   drivers/vfio/pci/mlx5/cmd.h               |  22 +-
> >   drivers/vfio/pci/mlx5/main.c              | 136 +++++-------
> >   include/linux/blk-mq.h                    |   9 +
> >   include/linux/dma-map-ops.h               |  13 ++
> >   include/linux/dma-mapping.h               |  39 ++++
> >   include/linux/hmm.h                       |   3 +
> >   include/rdma/ib_umem_odp.h                |  22 +-
> >   include/rdma/ib_verbs.h                   |  54 +++++
> >   kernel/dma/debug.h                        |   2 +
> >   kernel/dma/direct.h                       |   7 +-
> >   kernel/dma/mapping.c                      |  91 ++++++++
> >   mm/hmm.c                                  |  34 +--
> >   20 files changed, 870 insertions(+), 605 deletions(-)
> >
Christoph Hellwig March 6, 2024, 2:44 p.m. UTC | #3
On Tue, Mar 05, 2024 at 02:29:35PM +0200, Leon Romanovsky wrote:
> > > These advanced DMA mapping APIs are needed to calculate IOVA size to
> > > allocate it as one chunk and some sort of offset calculations to know
> > > which part of IOVA to map.
> > 
> > I don't follow this part at all - at *some* point, something must know a
> > range of memory addresses involved in a DMA transfer, so that's where it
> > should map that range for DMA. 
> 
> In all presented cases in this series, the overall DMA size is known in
> advance. In RDMA case, it is known when user registers the memory, in
> VFIO, when live migration is happening and in NVMe, when BIO is created.
> 
> So once we allocated IOVA, we will need to link ranges, which si the
> same as map but without IOVA allocation.

But above you say:

"These advanced DMA mapping APIs are needed to calculate IOVA size to
allocate it as one chunk and some sort of offset calculations to know
which part of IOVA to map."

this suggests you need helpers to calculate the len and offset.  I
can't see where that would ever make sense.  The total transfer
size should just be passed in by the callers and be known, and
there should be no offset.

> > > Instead of teaching DMA to know these specific datatypes, let's separate
> > > existing DMA mapping routine to two steps and give an option to advanced
> > > callers (subsystems) perform all calculations internally in advance and
> > > map pages later when it is needed.
> > 
> > From a brief look, this is clearly an awkward reinvention of the IOMMU API.
> > If IOMMU-aware drivers/subsystems want to explicitly manage IOMMU address
> > spaces then they can and should use the IOMMU API. Perhaps there's room for
> > some quality-of-life additions to the IOMMU API to help with common usage
> > patterns, but the generic DMA mapping API is absolutely not the place for
> > it.
> 
> DMA mapping gives nice abstraction from IOMMU, and allows us to have
> same flow for IOMMU and non-IOMMU flows without duplicating code, while
> you suggest to teach almost every part in the kernel to know about IOMMU.

Except that the flows are fundamentally different for the "can coalesce"
vs "can't coalesce" case.  In the former we have one dma_addr_t range,
and in the latter as many as there are input vectors (this is ignoring
the weird iommu merging case where we we coalesce some but not all
segments, but I'd rather not have that in a new API).

So if we want to efficiently be able to handle these cases we need
two APIs in the driver and a good framework to switch between them.
Robins makes a point here that the iommu API handles the can coalesce
case and he has a point as that's exactly how the IOMMU API works.
I'd still prefer to wrap it with dma callers to handle things like
swiotlb and maybe Xen grant tables and to avoid the type confusion
between dma_addr_t and then untyped iova in the iommu layer, but
having this layer or not is probably worth a discussion.

> 
> In this series, we changed RDMA, VFIO and NVMe, and in all cases we
> removed more code than added. From what I saw, VDPA and virito-blk will
> benefit from proposed API too.
> 
> Even in this RFC, where Chaitanya did partial job and didn't convert
> whole driver, the gain is pretty obvious:
> https://lore.kernel.org/linux-rdma/016fc02cbfa9be3c156a6f74df38def1e09c08f1.1709635535.git.leon@kernel.org/T/#u
> 

I have no idea how that nvme patch is even supposed to work.  It removes
the PRP path in nvme-pci, which not only is the most common I/O path
but actually required for the admin queue as NVMe doesn't support
SGLs for the admin queue.
Jason Gunthorpe March 6, 2024, 3:43 p.m. UTC | #4
On Wed, Mar 06, 2024 at 03:44:16PM +0100, Christoph Hellwig wrote:

> Except that the flows are fundamentally different for the "can coalesce"
> vs "can't coalesce" case.  In the former we have one dma_addr_t range,
> and in the latter as many as there are input vectors (this is ignoring
> the weird iommu merging case where we we coalesce some but not all
> segments, but I'd rather not have that in a new API).

I don't think they are so fundamentally different, at least in our
past conversations I never came out with the idea we should burden the
driver with two different flows based on what kind of alignment the
transfer happens to have.

Certainly if we split the API to focus one API on doing only
page-aligned transfers the aligned part does become a little.

At least the RDMA drivers could productively use just a page aligned
interface. But I didn't think this would make BIO users happy so never
even thought about it..

> The total transfer size should just be passed in by the callers and
> be known, and there should be no offset.

The API needs the caller to figure out the total number of IOVA pages
it needs, rounding up the CPU ranges to full aligned pages. That
becomes the IOVA allocation.

offset is something that arises to support non-aligned transfers.

> So if we want to efficiently be able to handle these cases we need
> two APIs in the driver and a good framework to switch between them.

But, what does the non-page-aligned version look like? Doesn't it
still look basically like this?

And what is the actual difference if the input is aligned? The caller
can assume it doesn't need to provide a per-range dma_addr_t during
unmap.

It still can't assume the HW programming will be linear due to the P2P
!ACS support.

And it still has to call an API per-cpu range to actually program the
IOMMU.

So are they really so different to want different APIs? That strikes
me as a big driver cost.

> I'd still prefer to wrap it with dma callers to handle things like
> swiotlb and maybe Xen grant tables and to avoid the type confusion
> between dma_addr_t and then untyped iova in the iommu layer, but
> having this layer or not is probably worth a discussion.

I'm surprised by the idea of random drivers reaching past dma-iommu.c
and into the iommu layer to setup DMA directly on the DMA API's
iommu_domain?? That seems like completely giving up on the DMA API
abstraction to me. :(

IMHO, it needs to be wrapped, the wrapper needs to do all the special
P2P stuff, at a minimum. The wrapper should multiplex to all the
non-iommu cases for the driver too.

We still need to achieve some kind of abstraction here that doesn't
bruden every driver with different code paths for each DMA back end!
Don't we??

Jason
Christoph Hellwig March 6, 2024, 4:20 p.m. UTC | #5
On Wed, Mar 06, 2024 at 11:43:28AM -0400, Jason Gunthorpe wrote:
> I don't think they are so fundamentally different, at least in our
> past conversations I never came out with the idea we should burden the
> driver with two different flows based on what kind of alignment the
> transfer happens to have.

Then we talked past each other..

> At least the RDMA drivers could productively use just a page aligned
> interface. But I didn't think this would make BIO users happy so never
> even thought about it..

page aligned is generally the right thing for the block layer.  NVMe
for example already requires that anyway due to PRPs.

> > The total transfer size should just be passed in by the callers and
> > be known, and there should be no offset.
> 
> The API needs the caller to figure out the total number of IOVA pages
> it needs, rounding up the CPU ranges to full aligned pages. That
> becomes the IOVA allocation.

Yes, it's a basic align up to the granularity asuming we don't bother
with non-aligned transfers.

> 
> > So if we want to efficiently be able to handle these cases we need
> > two APIs in the driver and a good framework to switch between them.
> 
> But, what does the non-page-aligned version look like? Doesn't it
> still look basically like this?

I'd just rather have the non-aligned case for those who really need
it be the loop over map single region that is needed for the direct
mapping anyway.

> 
> And what is the actual difference if the input is aligned? The caller
> can assume it doesn't need to provide a per-range dma_addr_t during
> unmap.

A per-range dma_addr_t doesn't really make sense for the aligned and
coalesced case.

> It still can't assume the HW programming will be linear due to the P2P
> !ACS support.
> 
> And it still has to call an API per-cpu range to actually program the
> IOMMU.
> 
> So are they really so different to want different APIs? That strikes
> me as a big driver cost.

To not have to store a dma_address range per CPU range that doesn't
actually get used at all.
Jason Gunthorpe March 6, 2024, 5:44 p.m. UTC | #6
On Wed, Mar 06, 2024 at 05:20:22PM +0100, Christoph Hellwig wrote:
> On Wed, Mar 06, 2024 at 11:43:28AM -0400, Jason Gunthorpe wrote:
> > I don't think they are so fundamentally different, at least in our
> > past conversations I never came out with the idea we should burden the
> > driver with two different flows based on what kind of alignment the
> > transfer happens to have.
> 
> Then we talked past each other..

Well, we never talked to such detail

> > > So if we want to efficiently be able to handle these cases we need
> > > two APIs in the driver and a good framework to switch between them.
> > 
> > But, what does the non-page-aligned version look like? Doesn't it
> > still look basically like this?
> 
> I'd just rather have the non-aligned case for those who really need
> it be the loop over map single region that is needed for the direct
> mapping anyway.

There is a list of interesting cases this has to cover:

 1. Direct map. No dma_addr_t at unmap, multiple HW SGLs
 2. IOMMU aligned map, no P2P. Only IOVA range at unmap, single HW SGLs
 3. IOMMU aligned map, P2P. Only IOVA range at unmap, multiple HW SGLs
 4. swiotlb single range. Only IOVA range at unmap, single HW SGL
 5. swiotlb multi-range. All dma_addr_t's at unmap, multiple HW SGLs.
 6. Unaligned IOMMU. Only IOVA range at unmap, multiple HW SGLs

I think we agree that 1 and 2 should be optimized highly as they are
the common case. That mainly means no dma_addr_t storage in either

5 is the slowest and has the most overhead.

4 is basically the same as 2 from the driver's viewpoint

3 is quite similar to 1, but it has the IOVA range at unmap.

6 doesn't have to be optimal, from the driver perspective it can be
like 5

That is three basic driver flows 1/3, 2/4 and 5/6

So are you thinking something more like a driver flow of:

  .. extent IO and get # aligned pages and know if there is P2P ..
  dma_init_io(state, num_pages, p2p_flag)
  if (dma_io_single_range(state)) {
       // #2, #4
       for each io()
	    dma_link_aligned_pages(state, io range)
       hw_sgl = (state->iova, state->len)
  } else {
       // #1, #3, #5, #6
       hw_sgls = alloc_hw_sgls(num_ios)
       if (dma_io_needs_dma_addr_unmap(state))
	   dma_addr_storage = alloc_num_ios(); // #5 only
       for each io()
	    hw_sgl[i] = dma_map_single(state, io range)
	    if (dma_addr_storage)
	       dma_addr_storage[i] = hw_sgl[i]; // #5 only
  }

?

This is not quite what you said, we split the driver flow based on
needing 1 HW SGL vs need many HW SGL.

> > So are they really so different to want different APIs? That strikes
> > me as a big driver cost.
> 
> To not have to store a dma_address range per CPU range that doesn't
> actually get used at all.

Right, that is a nice optimization we should reach for.

Jason
Christoph Hellwig March 6, 2024, 10:14 p.m. UTC | #7
On Wed, Mar 06, 2024 at 01:44:56PM -0400, Jason Gunthorpe wrote:
> There is a list of interesting cases this has to cover:
> 
>  1. Direct map. No dma_addr_t at unmap, multiple HW SGLs
>  2. IOMMU aligned map, no P2P. Only IOVA range at unmap, single HW SGLs
>  3. IOMMU aligned map, P2P. Only IOVA range at unmap, multiple HW SGLs
>  4. swiotlb single range. Only IOVA range at unmap, single HW SGL
>  5. swiotlb multi-range. All dma_addr_t's at unmap, multiple HW SGLs.
>  6. Unaligned IOMMU. Only IOVA range at unmap, multiple HW SGLs
> 
> I think we agree that 1 and 2 should be optimized highly as they are
> the common case. That mainly means no dma_addr_t storage in either

I don't think you can do without dma_addr_t storage.  In most cases
your can just store the dma_addr_t in the LE/BE encoded hardware
SGL, so no extra storage should be needed though.

> 3 is quite similar to 1, but it has the IOVA range at unmap.

Can you explain what P2P case you mean?  The switch one with the
bus address is indeed basically the same, just with potentioally a
different offset, while the through host bridge case is the same
as a normal iommu map.

> 
> 4 is basically the same as 2 from the driver's viewpoint

I'd actually treat it the same as one.

> 5 is the slowest and has the most overhead.

and 5 could be broken into multiple 4s at least for now.  Or do you
have a different dfinition of range here?

> So are you thinking something more like a driver flow of:
> 
>   .. extent IO and get # aligned pages and know if there is P2P ..
>   dma_init_io(state, num_pages, p2p_flag)
>   if (dma_io_single_range(state)) {
>        // #2, #4
>        for each io()
> 	    dma_link_aligned_pages(state, io range)
>        hw_sgl = (state->iova, state->len)
>   } else {

I think what you have a dma_io_single_range should become before
the dma_init_io.  If we know we can't coalesce it really just is a
dma_map_{single,page,bvec} loop, no need for any extra state.

And we're back to roughly the proposal I sent out years ago.

> This is not quite what you said, we split the driver flow based on
> needing 1 HW SGL vs need many HW SGL.

That's at least what I intended to say, and I'm a little curious as what
it came across.
Jason Gunthorpe March 7, 2024, midnight UTC | #8
On Wed, Mar 06, 2024 at 11:14:00PM +0100, Christoph Hellwig wrote:
> On Wed, Mar 06, 2024 at 01:44:56PM -0400, Jason Gunthorpe wrote:
> > There is a list of interesting cases this has to cover:
> > 
> >  1. Direct map. No dma_addr_t at unmap, multiple HW SGLs
> >  2. IOMMU aligned map, no P2P. Only IOVA range at unmap, single HW SGLs
> >  3. IOMMU aligned map, P2P. Only IOVA range at unmap, multiple HW SGLs
> >  4. swiotlb single range. Only IOVA range at unmap, single HW SGL
> >  5. swiotlb multi-range. All dma_addr_t's at unmap, multiple HW SGLs.
> >  6. Unaligned IOMMU. Only IOVA range at unmap, multiple HW SGLs
> > 
> > I think we agree that 1 and 2 should be optimized highly as they are
> > the common case. That mainly means no dma_addr_t storage in either
> 
> I don't think you can do without dma_addr_t storage.  In most cases
> your can just store the dma_addr_t in the LE/BE encoded hardware
> SGL, so no extra storage should be needed though.

RDMA (and often DRM too) generally doesn't work like that, the driver
copies the page table into the device and then the only reason to have
a dma_addr_t storage is to pass that to the dma unmap API. Optionally
eliminating long term dma_addr_t storage would be a worthwhile memory
savings for large long lived user space memory registrations.

> > 3 is quite similar to 1, but it has the IOVA range at unmap.
> 
> Can you explain what P2P case you mean?  The switch one with the
> bus address is indeed basically the same, just with potentioally a
> different offset, while the through host bridge case is the same
> as a normal iommu map.

Yes, the bus address case. The IOMMU is turned on, ACS on a local
switch is off.

All pages go through the IOMMU in the normal way except P2P pages
between devices on the same switch. (ie the dma_addr_t is CPU physical
of the P2P plus an offset). RDMA must support a mixture of IOVA and
P2P addresses in the same IO operation.

I suppose it would make more sense to say it is similar to 6.

> > 5 is the slowest and has the most overhead.
> 
> and 5 could be broken into multiple 4s at least for now.  Or do you
> have a different dfinition of range here?

I wrote the list as from a single IO operation perspective, so all but
5 need to store a single IOVA range that could be stored in some
simple non-dynamic memory along with whatever HW SGLs/etc are needed.

The point of 5 being different is because the driver has to provide a
dynamically sized list of dma_addr_t's as storage until unmap. 5 is
the only case that requires that full list.

So yes, 5 could be broken up into multiple IOs, but then the
specialness of 5 is the driver must keep track of multiple IOs..

> > So are you thinking something more like a driver flow of:
> > 
> >   .. extent IO and get # aligned pages and know if there is P2P ..
> >   dma_init_io(state, num_pages, p2p_flag)
> >   if (dma_io_single_range(state)) {
> >        // #2, #4
> >        for each io()
> > 	    dma_link_aligned_pages(state, io range)
> >        hw_sgl = (state->iova, state->len)
> >   } else {
> 
> I think what you have a dma_io_single_range should become before
> the dma_init_io.  If we know we can't coalesce it really just is a
> dma_map_{single,page,bvec} loop, no need for any extra state.

I imagine dma_io_single_range() to just check a flag in state.

I still want to call dma_init_io() for the non-coalescing cases
because all the flows, regardless of composition, should be about as
fast as dma_map_sg is today.

That means we need to always pre-allocate the IOVA in any case where
the IOMMU might be active - even on a non-coalescing flow.

IOW, dma_init_io() always pre-allocates IOVA if the iommu is going to
be used and we can't just call today's dma_map_page() in a loop on the
non-coalescing side and pay the overhead of Nx IOVA allocations.

In large part this is for RDMA, were a single P2P page in a large
multi-gigabyte user memory registration shouldn't drastically harm the
registration performance by falling down to doing dma_map_page, and an
IOVA allocation, on a 4k page by page basis.

The other thing that got hand waved here is how does dma_init_io()
know which of the 6 states we are looking at? I imagine we probably
want to do something like:

   struct dma_io_summarize summary = {};
   for each io()
        dma_io_summarize_range(&summary, io range)
   dma_init_io(dev, &state, &summary);
   if (state->single_range) {
   } else {
   }
   dma_io_done_mapping(&state); <-- flush IOTLB once

At least this way the DMA API still has some decent opportunity for
abstraction and future growth using state to pass bits of information
between the API family.

There is some swiotlb complexity that needs something like this, a
system with iommu can still fail to coalesce if the pages are
encrypted and the device doesn't support DMA from encrypted pages. We
need to check for P2P pages, encrypted memory pages, and who knows
what else.

> And we're back to roughly the proposal I sent out years ago.

Well, all of this is roughly your original proposal, just with
different optimization choices and some enhancement to also cover
hmm_range_fault() users.

Enhancing the single sgl case is not a big change, I think. It does
seem simplifying for the driver to not have to coalesce SGLs to detect
the single-SGL fast-path.

> > This is not quite what you said, we split the driver flow based on
> > needing 1 HW SGL vs need many HW SGL.
> 
> That's at least what I intended to say, and I'm a little curious as what
> it came across.

Ok, I was reading the discussion more about as alignment than single
HW SGL, I think you ment alignment as implying coalescing behavior
implying single HW SGL..

Jason
Zhu Yanjun March 7, 2024, 6:01 a.m. UTC | #9
在 2024/3/5 12:18, Leon Romanovsky 写道:
> This is complimentary part to the proposed LSF/MM topic.
> https://lore.kernel.org/linux-rdma/22df55f8-cf64-4aa8-8c0b-b556c867b926@linux.dev/T/#m85672c860539fdbbc8fe0f5ccabdc05b40269057

I am interested in this topic. Hope I can join the meeting to discuss 
this topic.

Zhu Yanjun

> 
> This is posted as RFC to get a feedback on proposed split, but RDMA, VFIO and
> DMA patches are ready for review and inclusion, the NVMe patches are still in
> progress as they require agreement on API first.
> 
> Thanks
> 
> -------------------------------------------------------------------------------
> The DMA mapping operation performs two steps at one same time: allocates
> IOVA space and actually maps DMA pages to that space. This one shot
> operation works perfectly for non-complex scenarios, where callers use
> that DMA API in control path when they setup hardware.
> 
> However in more complex scenarios, when DMA mapping is needed in data
> path and especially when some sort of specific datatype is involved,
> such one shot approach has its drawbacks.
> 
> That approach pushes developers to introduce new DMA APIs for specific
> datatype. For example existing scatter-gather mapping functions, or
> latest Chuck's RFC series to add biovec related DMA mapping [1] and
> probably struct folio will need it too.
> 
> These advanced DMA mapping APIs are needed to calculate IOVA size to
> allocate it as one chunk and some sort of offset calculations to know
> which part of IOVA to map.
> 
> Instead of teaching DMA to know these specific datatypes, let's separate
> existing DMA mapping routine to two steps and give an option to advanced
> callers (subsystems) perform all calculations internally in advance and
> map pages later when it is needed.
> 
> In this series, three users are converted and each of such conversion
> presents different positive gain:
> 1. RDMA simplifies and speeds up its pagefault handling for
>     on-demand-paging (ODP) mode.
> 2. VFIO PCI live migration code saves huge chunk of memory.
> 3. NVMe PCI avoids intermediate SG table manipulation and operates
>     directly on BIOs.
> 
> Thanks
> 
> [1] https://lore.kernel.org/all/169772852492.5232.17148564580779995849.stgit@klimt.1015granger.net
> 
> Chaitanya Kulkarni (2):
>    block: add dma_link_range() based API
>    nvme-pci: use blk_rq_dma_map() for NVMe SGL
> 
> Leon Romanovsky (14):
>    mm/hmm: let users to tag specific PFNs
>    dma-mapping: provide an interface to allocate IOVA
>    dma-mapping: provide callbacks to link/unlink pages to specific IOVA
>    iommu/dma: Provide an interface to allow preallocate IOVA
>    iommu/dma: Prepare map/unmap page functions to receive IOVA
>    iommu/dma: Implement link/unlink page callbacks
>    RDMA/umem: Preallocate and cache IOVA for UMEM ODP
>    RDMA/umem: Store ODP access mask information in PFN
>    RDMA/core: Separate DMA mapping to caching IOVA and page linkage
>    RDMA/umem: Prevent UMEM ODP creation with SWIOTLB
>    vfio/mlx5: Explicitly use number of pages instead of allocated length
>    vfio/mlx5: Rewrite create mkey flow to allow better code reuse
>    vfio/mlx5: Explicitly store page list
>    vfio/mlx5: Convert vfio to use DMA link API
> 
>   Documentation/core-api/dma-attributes.rst |   7 +
>   block/blk-merge.c                         | 156 ++++++++++++++
>   drivers/infiniband/core/umem_odp.c        | 219 +++++++------------
>   drivers/infiniband/hw/mlx5/mlx5_ib.h      |   1 +
>   drivers/infiniband/hw/mlx5/odp.c          |  59 +++--
>   drivers/iommu/dma-iommu.c                 | 129 ++++++++---
>   drivers/nvme/host/pci.c                   | 220 +++++--------------
>   drivers/vfio/pci/mlx5/cmd.c               | 252 ++++++++++++----------
>   drivers/vfio/pci/mlx5/cmd.h               |  22 +-
>   drivers/vfio/pci/mlx5/main.c              | 136 +++++-------
>   include/linux/blk-mq.h                    |   9 +
>   include/linux/dma-map-ops.h               |  13 ++
>   include/linux/dma-mapping.h               |  39 ++++
>   include/linux/hmm.h                       |   3 +
>   include/rdma/ib_umem_odp.h                |  22 +-
>   include/rdma/ib_verbs.h                   |  54 +++++
>   kernel/dma/debug.h                        |   2 +
>   kernel/dma/direct.h                       |   7 +-
>   kernel/dma/mapping.c                      |  91 ++++++++
>   mm/hmm.c                                  |  34 +--
>   20 files changed, 870 insertions(+), 605 deletions(-)
>
Christoph Hellwig March 7, 2024, 3:05 p.m. UTC | #10
On Wed, Mar 06, 2024 at 08:00:36PM -0400, Jason Gunthorpe wrote:
> > 
> > I don't think you can do without dma_addr_t storage.  In most cases
> > your can just store the dma_addr_t in the LE/BE encoded hardware
> > SGL, so no extra storage should be needed though.
> 
> RDMA (and often DRM too) generally doesn't work like that, the driver
> copies the page table into the device and then the only reason to have
> a dma_addr_t storage is to pass that to the dma unmap API. Optionally
> eliminating long term dma_addr_t storage would be a worthwhile memory
> savings for large long lived user space memory registrations.

It's just kinda hard to do.  For aligned IOMMU mapping you'd only
have one dma_addr_t mappings (or maybe a few if P2P regions are
involved), so this probably doesn't matter.  For direct mappings
you'd have a few, but maybe the better answer is to use THP
more aggressively and reduce the number of segments.

> I wrote the list as from a single IO operation perspective, so all but
> 5 need to store a single IOVA range that could be stored in some
> simple non-dynamic memory along with whatever HW SGLs/etc are needed.
> 
> The point of 5 being different is because the driver has to provide a
> dynamically sized list of dma_addr_t's as storage until unmap. 5 is
> the only case that requires that full list.

No, all cases need to store one or more ranges.

> > > So are you thinking something more like a driver flow of:
> > > 
> > >   .. extent IO and get # aligned pages and know if there is P2P ..
> > >   dma_init_io(state, num_pages, p2p_flag)
> > >   if (dma_io_single_range(state)) {
> > >        // #2, #4
> > >        for each io()
> > > 	    dma_link_aligned_pages(state, io range)
> > >        hw_sgl = (state->iova, state->len)
> > >   } else {
> > 
> > I think what you have a dma_io_single_range should become before
> > the dma_init_io.  If we know we can't coalesce it really just is a
> > dma_map_{single,page,bvec} loop, no need for any extra state.
> 
> I imagine dma_io_single_range() to just check a flag in state.
> 
> I still want to call dma_init_io() for the non-coalescing cases
> because all the flows, regardless of composition, should be about as
> fast as dma_map_sg is today.

If all flows includes multiple non-coalesced regions that just makes
things very complicated, and that's exactly what I'd want to avoid.

> That means we need to always pre-allocate the IOVA in any case where
> the IOMMU might be active - even on a non-coalescing flow.
> 
> IOW, dma_init_io() always pre-allocates IOVA if the iommu is going to
> be used and we can't just call today's dma_map_page() in a loop on the
> non-coalescing side and pay the overhead of Nx IOVA allocations.
> 
> In large part this is for RDMA, were a single P2P page in a large
> multi-gigabyte user memory registration shouldn't drastically harm the
> registration performance by falling down to doing dma_map_page, and an
> IOVA allocation, on a 4k page by page basis.

But that P2P page needs to be handled very differently, as with it
we can't actually use a single iova range.  So I'm not sure how that
is even supposed to work.  If you have

 +-------+-----+-------+
 | local | P2P | local |
 +-------+-----+-------+

you need at least 3 hw SGL entries, as the IOVA won't be contigous.

> The other thing that got hand waved here is how does dma_init_io()
> know which of the 6 states we are looking at? I imagine we probably
> want to do something like:
> 
>    struct dma_io_summarize summary = {};
>    for each io()
>         dma_io_summarize_range(&summary, io range)
>    dma_init_io(dev, &state, &summary);
>    if (state->single_range) {
>    } else {
>    }
>    dma_io_done_mapping(&state); <-- flush IOTLB once

That's why I really just want 2 cases.  If the caller guarantees the
range is coalescable and there is an IOMMU use the iommu-API like
API, else just iter over map_single/page.

> Enhancing the single sgl case is not a big change, I think. It does
> seem simplifying for the driver to not have to coalesce SGLs to detect
> the single-SGL fast-path.
> 
> > > This is not quite what you said, we split the driver flow based on
> > > needing 1 HW SGL vs need many HW SGL.
> > 
> > That's at least what I intended to say, and I'm a little curious as what
> > it came across.
> 
> Ok, I was reading the discussion more about as alignment than single
> HW SGL, I think you ment alignment as implying coalescing behavior
> implying single HW SGL..

Yes.
Jason Gunthorpe March 7, 2024, 9:01 p.m. UTC | #11
On Thu, Mar 07, 2024 at 04:05:05PM +0100, Christoph Hellwig wrote:
> On Wed, Mar 06, 2024 at 08:00:36PM -0400, Jason Gunthorpe wrote:
> > > 
> > > I don't think you can do without dma_addr_t storage.  In most cases
> > > your can just store the dma_addr_t in the LE/BE encoded hardware
> > > SGL, so no extra storage should be needed though.
> > 
> > RDMA (and often DRM too) generally doesn't work like that, the driver
> > copies the page table into the device and then the only reason to have
> > a dma_addr_t storage is to pass that to the dma unmap API. Optionally
> > eliminating long term dma_addr_t storage would be a worthwhile memory
> > savings for large long lived user space memory registrations.
> 
> It's just kinda hard to do.  For aligned IOMMU mapping you'd only
> have one dma_addr_t mappings (or maybe a few if P2P regions are
> involved), so this probably doesn't matter.  For direct mappings
> you'd have a few, but maybe the better answer is to use THP
> more aggressively and reduce the number of segments.

Right, those things have all been done. 100GB of huge pages is still
using a fair amount of memory for storing dma_addr_t's.

It is hard to do perfectly, but I think it is not so bad if we focus
on the direct only case and simple systems that can exclude swiotlb
early on.

> > > > So are you thinking something more like a driver flow of:
> > > > 
> > > >   .. extent IO and get # aligned pages and know if there is P2P ..
> > > >   dma_init_io(state, num_pages, p2p_flag)
> > > >   if (dma_io_single_range(state)) {
> > > >        // #2, #4
> > > >        for each io()
> > > > 	    dma_link_aligned_pages(state, io range)
> > > >        hw_sgl = (state->iova, state->len)
> > > >   } else {
> > > 
> > > I think what you have a dma_io_single_range should become before
> > > the dma_init_io.  If we know we can't coalesce it really just is a
> > > dma_map_{single,page,bvec} loop, no need for any extra state.
> > 
> > I imagine dma_io_single_range() to just check a flag in state.
> > 
> > I still want to call dma_init_io() for the non-coalescing cases
> > because all the flows, regardless of composition, should be about as
> > fast as dma_map_sg is today.
> 
> If all flows includes multiple non-coalesced regions that just makes
> things very complicated, and that's exactly what I'd want to avoid.

I don't see how to avoid it unless we say RDMA shouldn't use this API,
which is kind of the whole point from my perspective..

I want an API that can handle all the same complexity as dma_map_sg()
without forcing the use of scatterlist. Instead "bring your own
datastructure". This is the essence of what we discussed.

An API that is inferior to dma_map_sg() is really problematic to use
with RDMA.

> > That means we need to always pre-allocate the IOVA in any case where
> > the IOMMU might be active - even on a non-coalescing flow.
> > 
> > IOW, dma_init_io() always pre-allocates IOVA if the iommu is going to
> > be used and we can't just call today's dma_map_page() in a loop on the
> > non-coalescing side and pay the overhead of Nx IOVA allocations.
> > 
> > In large part this is for RDMA, were a single P2P page in a large
> > multi-gigabyte user memory registration shouldn't drastically harm the
> > registration performance by falling down to doing dma_map_page, and an
> > IOVA allocation, on a 4k page by page basis.
> 
> But that P2P page needs to be handled very differently, as with it
> we can't actually use a single iova range.  So I'm not sure how that
> is even supposed to work.  If you have
> 
>  +-------+-----+-------+
>  | local | P2P | local |
>  +-------+-----+-------+
> 
> you need at least 3 hw SGL entries, as the IOVA won't be contigous.

Sure, 3 SGL entries is fine, that isn't what I'm pointing at

I'm saying that today if you give such a scatterlist to dma_map_sg()
it scans it and computes the IOVA space need, allocates one IOVA
space, then subdivides that single space up into the 3 HW SGLs you
show.

If you don't preserve that then we are calling, 4k at a time, a
dma_map_page() which is not anywhere close to the same outcome as what
dma_map_sg did. I may not get contiguous IOVA, I may not get 3 SGLs,
and we call into the IOVA allocator a huge number of times.

It needs to work following the same basic structure of dma_map_sg,
unfolding that logic into helpers so that the driver can provide
the data structure:

 - Scan the io ranges and figure out how much IOVA needed
   (dma_io_summarize_range)
 - Allocate the IOVA (dma_init_io)
 - Scan the io ranges again generate the final HW SGL
   (dma_io_link_page)
 - Finish the iommu batch (dma_io_done_mapping)

And you can make that pattern work for all the other cases too.

So I don't see this as particularly worse, calling some other API
instead of dma_map_page is not really a complexity on the
driver. Calling dma_init_io every time is also not a complexity. The
DMA API side is a bit more, but not substantively different logic from
what dma_map_sg already does.

Otherwise what is the alternative? How do I keep these complex things
working in RDMA and remove scatterlist?

> > The other thing that got hand waved here is how does dma_init_io()
> > know which of the 6 states we are looking at? I imagine we probably
> > want to do something like:
> > 
> >    struct dma_io_summarize summary = {};
> >    for each io()
> >         dma_io_summarize_range(&summary, io range)
> >    dma_init_io(dev, &state, &summary);
> >    if (state->single_range) {
> >    } else {
> >    }
> >    dma_io_done_mapping(&state); <-- flush IOTLB once
> 
> That's why I really just want 2 cases.  If the caller guarantees the
> range is coalescable and there is an IOMMU use the iommu-API like
> API, else just iter over map_single/page.

But how does the caller even know if it is coalescable? Other than the
trivial case of a single CPU range, that is a complicated detail based
on what pages are inside the range combined with the capability of the
device doing DMA. I don't see a simple way for the caller to figure
this out. You need to sweep every page and collect some information on
it. The above is to abstract that detail.

It was simpler before the confidential compute stuff :(

Jason
Christoph Hellwig March 8, 2024, 4:49 p.m. UTC | #12
On Thu, Mar 07, 2024 at 05:01:16PM -0400, Jason Gunthorpe wrote:
> > 
> > It's just kinda hard to do.  For aligned IOMMU mapping you'd only
> > have one dma_addr_t mappings (or maybe a few if P2P regions are
> > involved), so this probably doesn't matter.  For direct mappings
> > you'd have a few, but maybe the better answer is to use THP
> > more aggressively and reduce the number of segments.
> 
> Right, those things have all been done. 100GB of huge pages is still
> using a fair amount of memory for storing dma_addr_t's.
> 
> It is hard to do perfectly, but I think it is not so bad if we focus
> on the direct only case and simple systems that can exclude swiotlb
> early on.

Even with direct mappings only we still need to take care of
cache synchronization.

> > If all flows includes multiple non-coalesced regions that just makes
> > things very complicated, and that's exactly what I'd want to avoid.
> 
> I don't see how to avoid it unless we say RDMA shouldn't use this API,
> which is kind of the whole point from my perspective..

The DMA API callers really need to know what is P2P or not for
various reasons.  And they should generally have that information
available, either from pin_user_pages that needs to special case
it or from the in-kernel I/O submitter that build it from P2P and
normal memory.

> Sure, 3 SGL entries is fine, that isn't what I'm pointing at
> 
> I'm saying that today if you give such a scatterlist to dma_map_sg()
> it scans it and computes the IOVA space need, allocates one IOVA
> space, then subdivides that single space up into the 3 HW SGLs you
> show.
> 
> If you don't preserve that then we are calling, 4k at a time, a
> dma_map_page() which is not anywhere close to the same outcome as what
> dma_map_sg did. I may not get contiguous IOVA, I may not get 3 SGLs,
> and we call into the IOVA allocator a huge number of times.

Again, your callers must know what is a P2P region and what is not.
I don't think it is a hard burdern to do mappings at that granularity,
and we can encapsulate this in nice helpes for say the block layer
and pin_user_pages callers to start.

> 
> It needs to work following the same basic structure of dma_map_sg,
> unfolding that logic into helpers so that the driver can provide
> the data structure:
> 
>  - Scan the io ranges and figure out how much IOVA needed
>    (dma_io_summarize_range)

That is in general a function of the upper layer and not the DMA code.

>  - Allocate the IOVA (dma_init_io)

And this step is only needed for the iommu case.

> > That's why I really just want 2 cases.  If the caller guarantees the
> > range is coalescable and there is an IOMMU use the iommu-API like
> > API, else just iter over map_single/page.
> 
> But how does the caller even know if it is coalescable? Other than the
> trivial case of a single CPU range, that is a complicated detail based
> on what pages are inside the range combined with the capability of the
> device doing DMA. I don't see a simple way for the caller to figure
> this out. You need to sweep every page and collect some information on
> it. The above is to abstract that detail.

dma_get_merge_boundary already provides this information in terms
of the device capabilities.  And given that the callers knows what
is P2P and what is not we have all the information that is needed.
Jason Gunthorpe March 8, 2024, 8:23 p.m. UTC | #13
On Fri, Mar 08, 2024 at 05:49:20PM +0100, Christoph Hellwig wrote:
> On Thu, Mar 07, 2024 at 05:01:16PM -0400, Jason Gunthorpe wrote:
> > > 
> > > It's just kinda hard to do.  For aligned IOMMU mapping you'd only
> > > have one dma_addr_t mappings (or maybe a few if P2P regions are
> > > involved), so this probably doesn't matter.  For direct mappings
> > > you'd have a few, but maybe the better answer is to use THP
> > > more aggressively and reduce the number of segments.
> > 
> > Right, those things have all been done. 100GB of huge pages is still
> > using a fair amount of memory for storing dma_addr_t's.
> > 
> > It is hard to do perfectly, but I think it is not so bad if we focus
> > on the direct only case and simple systems that can exclude swiotlb
> > early on.
> 
> Even with direct mappings only we still need to take care of
> cache synchronization.

Yes, we still have to unmap, but the unmap for cache synchronization
doesn't need the dma_addr_t to flush the CPU cache.

> > > If all flows includes multiple non-coalesced regions that just makes
> > > things very complicated, and that's exactly what I'd want to avoid.
> > 
> > I don't see how to avoid it unless we say RDMA shouldn't use this API,
> > which is kind of the whole point from my perspective..
> 
> The DMA API callers really need to know what is P2P or not for
> various reasons.  And they should generally have that information
> available, either from pin_user_pages that needs to special case
> it or from the in-kernel I/O submitter that build it from P2P and
> normal memory.

I think that is a BIO thing. RDMA just calls with FOLL_PCI_P2PDMA and
shoves the resulting page list into in a scattertable. It never checks
if any returned page is P2P - it has no reason to care. dma_map_sg()
does all the work.

That is the kind of abstraction I am coming to this problem with.

You are looking at BIO where you already needed to split things up for
other reasons, but I think that is a uniquely block thing that will
not be shared in other subsystems.

> > If you don't preserve that then we are calling, 4k at a time, a
> > dma_map_page() which is not anywhere close to the same outcome as what
> > dma_map_sg did. I may not get contiguous IOVA, I may not get 3 SGLs,
> > and we call into the IOVA allocator a huge number of times.
> 
> Again, your callers must know what is a P2P region and what is not.

I don't see this at all. We don't do this today in RDMA. There is no
"P2P region".

> > > That's why I really just want 2 cases.  If the caller guarantees the
> > > range is coalescable and there is an IOMMU use the iommu-API like
> > > API, else just iter over map_single/page.
> > 
> > But how does the caller even know if it is coalescable? Other than the
> > trivial case of a single CPU range, that is a complicated detail based
> > on what pages are inside the range combined with the capability of the
> > device doing DMA. I don't see a simple way for the caller to figure
> > this out. You need to sweep every page and collect some information on
> > it. The above is to abstract that detail.
> 
> dma_get_merge_boundary already provides this information in terms
> of the device capabilities.  And given that the callers knows what
> is P2P and what is not we have all the information that is needed.

Encrypted memory too.

RDMA also doesn't call dma_get_merge_boundary(). It doesn't keep track
of P2P regions. It doesn't break out encrypted memory. It has no
purpose to do any of those things.

You fundamentally cannot subdivide a memory registration.

So we could artificially introduce the concept of limited coalescing
into RDMA, dmabuf and others just to drive this new API - but really
that feels much much worse than just making the DMA API still able to
do IOMMU coalescing in more cases.

Even if we did that, it will still be less efficient than today where
we just call dma_map_sg() on the jumble of pages.

Jason
Christoph Hellwig March 9, 2024, 4:14 p.m. UTC | #14
On Fri, Mar 08, 2024 at 04:23:42PM -0400, Jason Gunthorpe wrote:
> > The DMA API callers really need to know what is P2P or not for
> > various reasons.  And they should generally have that information
> > available, either from pin_user_pages that needs to special case
> > it or from the in-kernel I/O submitter that build it from P2P and
> > normal memory.
> 
> I think that is a BIO thing. RDMA just calls with FOLL_PCI_P2PDMA and
> shoves the resulting page list into in a scattertable. It never checks
> if any returned page is P2P - it has no reason to care. dma_map_sg()
> does all the work.

Right now it does, but that's not really a good interface.  If we have
a pin_user_pages variant that only pins until the next relevant P2P
boundary and tells you about we can significantly simplify the overall
interface.
Leon Romanovsky March 10, 2024, 9:35 a.m. UTC | #15
On Sat, Mar 09, 2024 at 05:14:18PM +0100, Christoph Hellwig wrote:
> On Fri, Mar 08, 2024 at 04:23:42PM -0400, Jason Gunthorpe wrote:
> > > The DMA API callers really need to know what is P2P or not for
> > > various reasons.  And they should generally have that information
> > > available, either from pin_user_pages that needs to special case
> > > it or from the in-kernel I/O submitter that build it from P2P and
> > > normal memory.
> > 
> > I think that is a BIO thing. RDMA just calls with FOLL_PCI_P2PDMA and
> > shoves the resulting page list into in a scattertable. It never checks
> > if any returned page is P2P - it has no reason to care. dma_map_sg()
> > does all the work.
> 
> Right now it does, but that's not really a good interface.  If we have
> a pin_user_pages variant that only pins until the next relevant P2P
> boundary and tells you about we can significantly simplify the overall
> interface.

And you will need to have a way to instruct that pin_user_pages() variant
to continue anyway, because you asked for FOLL_PCI_P2PDMA. Without that
force, you will have !FOLL_PCI_P2PDMA behaviour.

When you say "simplify the overall interface", which interface do you mean?

Thanks
Christoph Hellwig March 12, 2024, 9:28 p.m. UTC | #16
On Sun, Mar 10, 2024 at 11:35:13AM +0200, Leon Romanovsky wrote:
> And you will need to have a way to instruct that pin_user_pages() variant
> to continue anyway, because you asked for FOLL_PCI_P2PDMA. Without that
> force, you will have !FOLL_PCI_P2PDMA behaviour.

I don't understand what you mean.

> When you say "simplify the overall interface", which interface do you mean?

Primarily the dma mapping interface.  Secondarily also everything around
it.
Leon Romanovsky March 13, 2024, 7:46 a.m. UTC | #17
On Tue, Mar 12, 2024 at 10:28:44PM +0100, Christoph Hellwig wrote:
> On Sun, Mar 10, 2024 at 11:35:13AM +0200, Leon Romanovsky wrote:
> > And you will need to have a way to instruct that pin_user_pages() variant
> > to continue anyway, because you asked for FOLL_PCI_P2PDMA. Without that
> > force, you will have !FOLL_PCI_P2PDMA behaviour.
> 
> I don't understand what you mean.

Jason talked about the need to call to pin_user_pages(..., gup_flags | FOLL_PCI_P2PDMA, ...),
but in your proposal this call won't be possible anymore.

> 
> > When you say "simplify the overall interface", which interface do you mean?
> 
> Primarily the dma mapping interface.  Secondarily also everything around
> it.

OK, thanks
Christoph Hellwig March 13, 2024, 9:44 p.m. UTC | #18
On Wed, Mar 13, 2024 at 09:46:36AM +0200, Leon Romanovsky wrote:
> On Tue, Mar 12, 2024 at 10:28:44PM +0100, Christoph Hellwig wrote:
> > On Sun, Mar 10, 2024 at 11:35:13AM +0200, Leon Romanovsky wrote:
> > > And you will need to have a way to instruct that pin_user_pages() variant
> > > to continue anyway, because you asked for FOLL_PCI_P2PDMA. Without that
> > > force, you will have !FOLL_PCI_P2PDMA behaviour.
> > 
> > I don't understand what you mean.
> 
> Jason talked about the need to call to pin_user_pages(..., gup_flags | FOLL_PCI_P2PDMA, ...),
> but in your proposal this call won't be possible anymore.

Why?
Jason Gunthorpe March 19, 2024, 3:36 p.m. UTC | #19
On Sat, Mar 09, 2024 at 05:14:18PM +0100, Christoph Hellwig wrote:
> On Fri, Mar 08, 2024 at 04:23:42PM -0400, Jason Gunthorpe wrote:
> > > The DMA API callers really need to know what is P2P or not for
> > > various reasons.  And they should generally have that information
> > > available, either from pin_user_pages that needs to special case
> > > it or from the in-kernel I/O submitter that build it from P2P and
> > > normal memory.
> > 
> > I think that is a BIO thing. RDMA just calls with FOLL_PCI_P2PDMA and
> > shoves the resulting page list into in a scattertable. It never checks
> > if any returned page is P2P - it has no reason to care. dma_map_sg()
> > does all the work.
> 
> Right now it does, but that's not really a good interface.  If we have
> a pin_user_pages variant that only pins until the next relevant P2P
> boundary and tells you about we can significantly simplify the overall
> interface.

Sorry for the delay, I was away..

I kind of understand your thinking on the DMA side, but I don't see
how this is good for users of the API beyond BIO.

How will this make RDMA better? We have one MR, the MR has pages, the
HW doesn't care about the SW distinction of p2p, swiotlb, direct,
encrypted, iommu, etc. It needs to create one HW page list for
whatever user VA range was given.

Or worse, whatever thing is inside a DMABUF from a DRM
driver. DMABUF's can have a (dynamic!) mixture of P2P and regular
AFAIK based on the GPU's migration behavior.

Or triple worse, ODP can dynamically change on a page by page basis
the type depending on what hmm_range_fault() returns.

So I take it as a requirement that RDMA MUST make single MR's out of a
hodgepodge of page types. RDMA MRs cannot be split. Multiple MR's are
not a functional replacement for a single MR.

Go back to the start of what are we trying to do here:
 1) Make a DMA API that can support hmm_range_fault() users in a
    sensible and performant way
 2) Make a DMA API that can support RDMA MR's backed by DMABUF's, and
    user VA's without restriction
 3) Allow to remove scatterlist from BIO paths
 4) Provide a DMABUF API that is not scatterlist that can feed into
    the new DMA API - again supporting DMABUF's hodgepodge of types.

I'd like to do all of these things. I know 3 is your highest priority,
but it is my lowest :)

So, if the new API can only do uniformity I immediately loose #1 -
hmm_range_fault() can't guarentee anything, so it looses the IOVA
optimization that Leon's patches illustrate.

For uniformity #2 probably needs multiple DMA API "transactions". This
is doable, but it is less performant than one "transaction".

#3 is perfectly happy because BIO already creates uniformity

#4 is like #2, there is not guarenteed uniformity inside DMABUF so
every DMABUF importer needs to take some complexity to deal with
it. There are many DMABUF importers so this feels like a poor API
abstraction if we force everyone there to take on complexity.

So I'm just not seeing why this would be better. I think Leon's series
shows the cost of non-uniformity support is actually pretty
small. Still, we could do better, if the caller can optionally
indicate it knows it has uniformity then that can be optimized futher.

I'd like to find something that works well for all of the above, and I
think abstracting non-uniformity at the API level is important for the
above reasons.

Can we tweak what Leon has done to keep the hmm_range_fault support
and non-uniformity for RDMA but add a uniformity optimized flow for
BIO?

Jason
Leon Romanovsky March 20, 2024, 8:55 a.m. UTC | #20
On Tue, Mar 19, 2024 at 12:36:20PM -0300, Jason Gunthorpe wrote:
> On Sat, Mar 09, 2024 at 05:14:18PM +0100, Christoph Hellwig wrote:
> > On Fri, Mar 08, 2024 at 04:23:42PM -0400, Jason Gunthorpe wrote:
> > > > The DMA API callers really need to know what is P2P or not for
> > > > various reasons.  And they should generally have that information
> > > > available, either from pin_user_pages that needs to special case
> > > > it or from the in-kernel I/O submitter that build it from P2P and
> > > > normal memory.
> > > 
> > > I think that is a BIO thing. RDMA just calls with FOLL_PCI_P2PDMA and
> > > shoves the resulting page list into in a scattertable. It never checks
> > > if any returned page is P2P - it has no reason to care. dma_map_sg()
> > > does all the work.
> > 
> > Right now it does, but that's not really a good interface.  If we have
> > a pin_user_pages variant that only pins until the next relevant P2P
> > boundary and tells you about we can significantly simplify the overall
> > interface.
> 
> Sorry for the delay, I was away..

<...>

> Can we tweak what Leon has done to keep the hmm_range_fault support
> and non-uniformity for RDMA but add a uniformity optimized flow for
> BIO?

Something like this will do the trick.

From 45e739e7073fb04bc168624f77320130bb3f9267 Mon Sep 17 00:00:00 2001
Message-ID: <45e739e7073fb04bc168624f77320130bb3f9267.1710924764.git.leonro@nvidia.com>
From: Leon Romanovsky <leonro@nvidia.com>
Date: Mon, 18 Mar 2024 11:16:41 +0200
Subject: [PATCH] mm/gup: add strict interface to pin user pages according to
 FOLL flag

All pin_user_pages*() and get_user_pages*() callbacks allocate user
pages by partially taking into account their p2p vs. non-p2p properties.

In case, user sets FOLL_PCI_P2PDMA flag, the allocated pages will include
both p2p and "regular" pages, while if FOLL_PCI_P2PDMA flag is not provided,
only regular pages are returned.

In order to make sure that with FOLL_PCI_P2PDMA flag, only p2p pages are
returned, let's introduce new internal FOLL_STRICT flag and provide special
pin_user_pages_fast_strict() API call.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 include/linux/mm.h |  3 +++
 mm/gup.c           | 36 +++++++++++++++++++++++++++++++++++-
 mm/internal.h      |  4 +++-
 3 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f5a97dec5169..910b65dde24a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2491,6 +2491,9 @@ int pin_user_pages_fast(unsigned long start, int nr_pages,
 			unsigned int gup_flags, struct page **pages);
 void folio_add_pin(struct folio *folio);
 
+int pin_user_pages_fast_strict(unsigned long start, int nr_pages,
+			       unsigned int gup_flags, struct page **pages);
+
 int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc);
 int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
 			struct task_struct *task, bool bypass_rlim);
diff --git a/mm/gup.c b/mm/gup.c
index df83182ec72d..11b5c626a4ab 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -133,6 +133,10 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
 	if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)))
 		return NULL;
 
+	if (flags & FOLL_STRICT)
+		if (flags & FOLL_PCI_P2PDMA && !is_pci_p2pdma_page(page))
+			return NULL;
+
 	if (flags & FOLL_GET)
 		return try_get_folio(page, refs);
 
@@ -232,6 +236,10 @@ int __must_check try_grab_page(struct page *page, unsigned int flags)
 	if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)))
 		return -EREMOTEIO;
 
+	if (flags & FOLL_STRICT)
+		if (flags & FOLL_PCI_P2PDMA && !is_pci_p2pdma_page(page))
+			return -EREMOTEIO;
+
 	if (flags & FOLL_GET)
 		folio_ref_inc(folio);
 	else if (flags & FOLL_PIN) {
@@ -2243,6 +2251,8 @@ static bool is_valid_gup_args(struct page **pages, int *locked,
 	 * - FOLL_TOUCH/FOLL_PIN/FOLL_TRIED/FOLL_FAST_ONLY are internal only
 	 * - FOLL_REMOTE is internal only and used on follow_page()
 	 * - FOLL_UNLOCKABLE is internal only and used if locked is !NULL
+	 * - FOLL_STRICT is internal only and used to distinguish between p2p
+	 *   and "regular" pages.
 	 */
 	if (WARN_ON_ONCE(gup_flags & INTERNAL_GUP_FLAGS))
 		return false;
@@ -3187,7 +3197,8 @@ static int internal_get_user_pages_fast(unsigned long start,
 	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
 				       FOLL_FORCE | FOLL_PIN | FOLL_GET |
 				       FOLL_FAST_ONLY | FOLL_NOFAULT |
-				       FOLL_PCI_P2PDMA | FOLL_HONOR_NUMA_FAULT)))
+				       FOLL_PCI_P2PDMA | FOLL_HONOR_NUMA_FAULT |
+				       FOLL_STRICT)))
 		return -EINVAL;
 
 	if (gup_flags & FOLL_PIN)
@@ -3322,6 +3333,29 @@ int pin_user_pages_fast(unsigned long start, int nr_pages,
 }
 EXPORT_SYMBOL_GPL(pin_user_pages_fast);
 
+/**
+ * pin_user_pages_fast_strict() - this is pin_user_pages_fast() variant, which
+ * makes sure that only pages with same properties are pinned.
+ *
+ * @start:      starting user address
+ * @nr_pages:   number of pages from start to pin
+ * @gup_flags:  flags modifying pin behaviour
+ * @pages:      array that receives pointers to the pages pinned.
+ *              Should be at least nr_pages long.
+ *
+ * Nearly the same as pin_user_pages_fastt(), except that FOLL_STRICT is set.
+ *
+ * FOLL_STRICT means that the pages are allocated with specific FOLL_* properties.
+ */
+int pin_user_pages_fast_strict(unsigned long start, int nr_pages,
+			       unsigned int gup_flags, struct page **pages)
+{
+	if (!is_valid_gup_args(pages, NULL, &gup_flags, FOLL_PIN | FOLL_STRICT))
+		return -EINVAL;
+	return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
+}
+EXPORT_SYMBOL_GPL(pin_user_pages_fast_strict);
+
 /**
  * pin_user_pages_remote() - pin pages of a remote process
  *
diff --git a/mm/internal.h b/mm/internal.h
index f309a010d50f..7578837a0444 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1031,10 +1031,12 @@ enum {
 	FOLL_FAST_ONLY = 1 << 20,
 	/* allow unlocking the mmap lock */
 	FOLL_UNLOCKABLE = 1 << 21,
+	/* don't mix pages with different properties, e.g. p2p with "regular" ones */
+	FOLL_STRICT = 1 << 22,
 };
 
 #define INTERNAL_GUP_FLAGS (FOLL_TOUCH | FOLL_TRIED | FOLL_REMOTE | FOLL_PIN | \
-			    FOLL_FAST_ONLY | FOLL_UNLOCKABLE)
+			    FOLL_FAST_ONLY | FOLL_UNLOCKABLE | FOLL_STRICT)
 
 /*
  * Indicates for which pages that are write-protected in the page table,
Christoph Hellwig March 21, 2024, 10:39 p.m. UTC | #21
On Tue, Mar 19, 2024 at 12:36:20PM -0300, Jason Gunthorpe wrote:
> I kind of understand your thinking on the DMA side, but I don't see
> how this is good for users of the API beyond BIO.
> 
> How will this make RDMA better? We have one MR, the MR has pages, the
> HW doesn't care about the SW distinction of p2p, swiotlb, direct,
> encrypted, iommu, etc. It needs to create one HW page list for
> whatever user VA range was given.

Well, the hardware (as in the PCIe card) never cares.  But the setup
path for the IOMMU does, and something in the OS needs to know about
it.  So unless we want to stash away a 'is this P2P' flag in every
page / SG entry / bvec, or a do a lookup to find that out for each
of them we need to manage chunks at these boundaries.  And that's
what I'm proposing.

> Or worse, whatever thing is inside a DMABUF from a DRM
> driver. DMABUF's can have a (dynamic!) mixture of P2P and regular
> AFAIK based on the GPU's migration behavior.

And that's fine.  We just need to track it efficiently.

> 
> Or triple worse, ODP can dynamically change on a page by page basis
> the type depending on what hmm_range_fault() returns.

Same.  If this changes all the time you need to track it.  And we
should find a way to shared the code if we have multiple users for it.

But most DMA API consumers will never see P2P, and when they see it
it will be static.  So don't build the DMA API to automically do
the (not exactly super cheap) checks and add complexity for it.

> So I take it as a requirement that RDMA MUST make single MR's out of a
> hodgepodge of page types. RDMA MRs cannot be split. Multiple MR's are
> not a functional replacement for a single MR.

But MRs consolidate multiple dma addresses anyway.

> Go back to the start of what are we trying to do here:
>  1) Make a DMA API that can support hmm_range_fault() users in a
>     sensible and performant way
>  2) Make a DMA API that can support RDMA MR's backed by DMABUF's, and
>     user VA's without restriction
>  3) Allow to remove scatterlist from BIO paths
>  4) Provide a DMABUF API that is not scatterlist that can feed into
>     the new DMA API - again supporting DMABUF's hodgepodge of types.
> 
> I'd like to do all of these things. I know 3 is your highest priority,
> but it is my lowest :)

Well, 3 an 4.  And 3 is not just limited to bio, but all the other
pointless scatterlist uses.
Christoph Hellwig March 21, 2024, 10:40 p.m. UTC | #22
On Wed, Mar 20, 2024 at 10:55:36AM +0200, Leon Romanovsky wrote:
> Something like this will do the trick.

As far as I can tell it totally misses the point.  Which is not to never
return non-P2P if the flag is set, but to return either all P2P or non-P2
P and not create a boundary in the single call.
Leon Romanovsky March 22, 2024, 5:46 p.m. UTC | #23
On Thu, Mar 21, 2024 at 11:40:13PM +0100, Christoph Hellwig wrote:
> On Wed, Mar 20, 2024 at 10:55:36AM +0200, Leon Romanovsky wrote:
> > Something like this will do the trick.
> 
> As far as I can tell it totally misses the point.  Which is not to never
> return non-P2P if the flag is set, but to return either all P2P or non-P2
> P and not create a boundary in the single call.

You are treating FOLL_PCI_P2PDMA as a hint, but in iov_iter_extract_user_pages()
you set it only for p2p queues. I was under impression that you want only p2p pages
in these queues.

Anyway, I can prepare other patch that will return or p2p or non-p2p pages in one shot.

Thanks
Jason Gunthorpe March 22, 2024, 6:43 p.m. UTC | #24
On Thu, Mar 21, 2024 at 11:39:10PM +0100, Christoph Hellwig wrote:
> On Tue, Mar 19, 2024 at 12:36:20PM -0300, Jason Gunthorpe wrote:
> > I kind of understand your thinking on the DMA side, but I don't see
> > how this is good for users of the API beyond BIO.
> > 
> > How will this make RDMA better? We have one MR, the MR has pages, the
> > HW doesn't care about the SW distinction of p2p, swiotlb, direct,
> > encrypted, iommu, etc. It needs to create one HW page list for
> > whatever user VA range was given.
> 
> Well, the hardware (as in the PCIe card) never cares.  But the setup
> path for the IOMMU does, and something in the OS needs to know about
> it.  So unless we want to stash away a 'is this P2P' flag in every
> page / SG entry / bvec, or a do a lookup to find that out for each
> of them we need to manage chunks at these boundaries.  And that's
> what I'm proposing.

Okay, if we look at the struct-page-less world (which we want for
DMABUF) then we need to keep track for sure. What I had drafted was to
keep track in the new "per-SG entry" because that seemed easiest to
migrate existing code into.

Though the datastructure could also be written to be a list of uniform
memory types and then a list of SG entries. (more like how bio is
organized)

No idea right now which is better, and I'm happy make it go either
way.

But Leon's series is not quite getting to this, it it still struct
page based and struct page itself has all the metadata - though as you
say it is a bit expensive to access.

> > Or worse, whatever thing is inside a DMABUF from a DRM
> > driver. DMABUF's can have a (dynamic!) mixture of P2P and regular
> > AFAIK based on the GPU's migration behavior.
> 
> And that's fine.  We just need to track it efficiently.

Right, DMABUF/etc will return a something that has a list of physical
addresses and some meta-data to indicate the "p2p memory provider" for
the P2P part.

Perhaps it could be as simple as 1 bit in the physical address/length
and a global "P2P memory provider" pointer for the entire DMA
BUF. Unclear to me right now, but sure.

> > Or triple worse, ODP can dynamically change on a page by page basis
> > the type depending on what hmm_range_fault() returns.
> 
> Same.  If this changes all the time you need to track it.  And we
> should find a way to shared the code if we have multiple users for it.

ODP (for at least the forseeable furture) is simpler because it is
always struct page based so we don't need more metadata if we pay the
cost to reach into the struct page. I suspect that is the right trade
off for hmm_range_fault users.

> But most DMA API consumers will never see P2P, and when they see it
> it will be static.  So don't build the DMA API to automically do
> the (not exactly super cheap) checks and add complexity for it.

Okay, I think I get what you'd like to see.

If we are going to make caller provided uniformity a requirement, lets
imagine a formal memory type idea to help keep this a little
abstracted?

 DMA_MEMORY_TYPE_NORMAL
 DMA_MEMORY_TYPE_P2P_NOT_ACS
 DMA_MEMORY_TYPE_ENCRYPTED
 DMA_MEMORY_TYPE_BOUNCE_BUFFER  // ??

Then maybe the driver flow looks like:

	if (transaction.memory_type == DMA_MEMORY_TYPE_NORMAL && dma_api_has_iommu(dev)) {
		struct dma_api_iommu_state state;

		dma_api_iommu_start(&state, transaction.num_pages);
		for_each_range(transaction, range)
			dma_api_iommu_map_range(&state, range.start_page, range.length);
		num_hwsgls = 1;
		hwsgl.addr = state.iova;
		hwsgl.length = transaction.length
		dma_api_iommu_batch_done(&state);
	} else if (transaction.memory_type == DMA_MEMORY_TYPE_P2P_NOT_ACS) {
		num_hwsgls = transcation.num_sgls;
		for_each_range(transaction, range) {
			hwsgl[i].addr = dma_api_p2p_not_acs_map(range.start_physical, range.length, p2p_memory_provider);
			hwsgl[i].len = range.size;
		}
	} else {
		/* Must be DMA_MEMORY_TYPE_NORMAL, DMA_MEMORY_TYPE_ENCRYPTED, DMA_MEMORY_TYPE_BOUNCE_BUFFER? */
		num_hwsgls = transcation.num_sgls;
		for_each_range(transaction, range) {
			hwsgl[i].addr = dma_api_map_cpu_page(range.start_page, range.length);
			hwsgl[i].len = range.size;
		}
	}

And the hmm_range_fault case is sort of like:

		struct dma_api_iommu_state state;
		dma_api_iommu_start(&state, mr.num_pages);

		[..]
		hmm_range_fault(...)
		if (present)
			dma_link_page(&state, faulting_address_offset, page);
		else
			dma_unlink_page(&state, faulting_address_offset, page);

Is this looking closer?

> > So I take it as a requirement that RDMA MUST make single MR's out of a
> > hodgepodge of page types. RDMA MRs cannot be split. Multiple MR's are
> > not a functional replacement for a single MR.
> 
> But MRs consolidate multiple dma addresses anyway.

I'm not sure I understand this?
 
> > Go back to the start of what are we trying to do here:
> >  1) Make a DMA API that can support hmm_range_fault() users in a
> >     sensible and performant way
> >  2) Make a DMA API that can support RDMA MR's backed by DMABUF's, and
> >     user VA's without restriction
> >  3) Allow to remove scatterlist from BIO paths
> >  4) Provide a DMABUF API that is not scatterlist that can feed into
> >     the new DMA API - again supporting DMABUF's hodgepodge of types.
> > 
> > I'd like to do all of these things. I know 3 is your highest priority,
> > but it is my lowest :)
> 
> Well, 3 an 4.  And 3 is not just limited to bio, but all the other
> pointless scatterlist uses.

Well, I didn't write a '5) remove all the other pointless scatterlist
case' :)

Anyhow, I think we all agree on the high level objective, we just need
to get to an API that fuses all of these goals together.

To go back to my main thesis - I would like a high performance low
level DMA API that is capable enough that it could implement
scatterlist dma_map_sg() and thus also implement any future
scatterlist_v2, bio, hmm_range_fault or any other thing we come up
with on top of it. This is broadly what I thought we agreed to at LSF
last year.

Jason
Christoph Hellwig March 24, 2024, 11:16 p.m. UTC | #25
On Fri, Mar 22, 2024 at 07:46:17PM +0200, Leon Romanovsky wrote:
> > As far as I can tell it totally misses the point.  Which is not to never
> > return non-P2P if the flag is set, but to return either all P2P or non-P2
> > P and not create a boundary in the single call.
> 
> You are treating FOLL_PCI_P2PDMA as a hint, but in iov_iter_extract_user_pages()
> you set it only for p2p queues. I was under impression that you want only p2p pages
> in these queues.

FOLL_PCI_P2PDMA is an indicator that the caller can cope with P2P
pages.  Most callers simply can't.
Christoph Hellwig March 24, 2024, 11:22 p.m. UTC | #26
On Fri, Mar 22, 2024 at 03:43:30PM -0300, Jason Gunthorpe wrote:
> If we are going to make caller provided uniformity a requirement, lets
> imagine a formal memory type idea to help keep this a little
> abstracted?
> 
>  DMA_MEMORY_TYPE_NORMAL
>  DMA_MEMORY_TYPE_P2P_NOT_ACS
>  DMA_MEMORY_TYPE_ENCRYPTED
>  DMA_MEMORY_TYPE_BOUNCE_BUFFER  // ??
> 
> Then maybe the driver flow looks like:
> 
> 	if (transaction.memory_type == DMA_MEMORY_TYPE_NORMAL && dma_api_has_iommu(dev)) {

Add a nice helper to make this somewhat readable, but yes.

> 	} else if (transaction.memory_type == DMA_MEMORY_TYPE_P2P_NOT_ACS) {
> 		num_hwsgls = transcation.num_sgls;
> 		for_each_range(transaction, range) {
> 			hwsgl[i].addr = dma_api_p2p_not_acs_map(range.start_physical, range.length, p2p_memory_provider);
> 			hwsgl[i].len = range.size;
> 		}
> 	} else {
> 		/* Must be DMA_MEMORY_TYPE_NORMAL, DMA_MEMORY_TYPE_ENCRYPTED, DMA_MEMORY_TYPE_BOUNCE_BUFFER? */
> 		num_hwsgls = transcation.num_sgls;
> 		for_each_range(transaction, range) {
> 			hwsgl[i].addr = dma_api_map_cpu_page(range.start_page, range.length);
> 			hwsgl[i].len = range.size;
> 		}
>

And these two are really the same except that we call a different map
helper underneath.  So I think as far as the driver is concerned
they should be the same, the DMA API just needs to key off the
memory tap.

> And the hmm_range_fault case is sort of like:
> 
> 		struct dma_api_iommu_state state;
> 		dma_api_iommu_start(&state, mr.num_pages);
> 
> 		[..]
> 		hmm_range_fault(...)
> 		if (present)
> 			dma_link_page(&state, faulting_address_offset, page);
> 		else
> 			dma_unlink_page(&state, faulting_address_offset, page);
> 
> Is this looking closer?

Yes.

> > > So I take it as a requirement that RDMA MUST make single MR's out of a
> > > hodgepodge of page types. RDMA MRs cannot be split. Multiple MR's are
> > > not a functional replacement for a single MR.
> > 
> > But MRs consolidate multiple dma addresses anyway.
> 
> I'm not sure I understand this?

The RDMA MRs take a a list of PFNish address, (or SGLs with the
enhanced MRs from Mellanox) and give you back a single rkey/lkey.

> To go back to my main thesis - I would like a high performance low
> level DMA API that is capable enough that it could implement
> scatterlist dma_map_sg() and thus also implement any future
> scatterlist_v2, bio, hmm_range_fault or any other thing we come up
> with on top of it. This is broadly what I thought we agreed to at LSF
> last year.

I think the biggest underlying problem of the scatterlist based
DMA implementation for IOMMUs is that it's trying to handle to much,
that is magic coalescing even if the segments boundaries don't align
with the IOMMU page size.  If we can get rid of that misfeature I
think we'd greatly simply the API and implementation.
Jason Gunthorpe March 27, 2024, 5:14 p.m. UTC | #27
On Mon, Mar 25, 2024 at 12:22:15AM +0100, Christoph Hellwig wrote:
> On Fri, Mar 22, 2024 at 03:43:30PM -0300, Jason Gunthorpe wrote:
> > If we are going to make caller provided uniformity a requirement, lets
> > imagine a formal memory type idea to help keep this a little
> > abstracted?
> > 
> >  DMA_MEMORY_TYPE_NORMAL
> >  DMA_MEMORY_TYPE_P2P_NOT_ACS
> >  DMA_MEMORY_TYPE_ENCRYPTED
> >  DMA_MEMORY_TYPE_BOUNCE_BUFFER  // ??
> > 
> > Then maybe the driver flow looks like:
> > 
> > 	if (transaction.memory_type == DMA_MEMORY_TYPE_NORMAL && dma_api_has_iommu(dev)) {
> 
> Add a nice helper to make this somewhat readable, but yes.
> 
> > 	} else if (transaction.memory_type == DMA_MEMORY_TYPE_P2P_NOT_ACS) {
> > 		num_hwsgls = transcation.num_sgls;
> > 		for_each_range(transaction, range) {
> > 			hwsgl[i].addr = dma_api_p2p_not_acs_map(range.start_physical, range.length, p2p_memory_provider);
> > 			hwsgl[i].len = range.size;
> > 		}
> > 	} else {
> > 		/* Must be DMA_MEMORY_TYPE_NORMAL, DMA_MEMORY_TYPE_ENCRYPTED, DMA_MEMORY_TYPE_BOUNCE_BUFFER? */
> > 		num_hwsgls = transcation.num_sgls;
> > 		for_each_range(transaction, range) {
> > 			hwsgl[i].addr = dma_api_map_cpu_page(range.start_page, range.length);
> > 			hwsgl[i].len = range.size;
> > 		}
> >
> 
> And these two are really the same except that we call a different map
> helper underneath.  So I think as far as the driver is concerned
> they should be the same, the DMA API just needs to key off the
> memory tap.

Yeah.. If the caller is going to have compute the memory type of the
range then lets pass it to the helper

dma_api_map_memory_type(transaction.memory_type, range.start_page, range.length);

Then we can just hide all the differences under the API without doing
duplicated work.

Function names need some work ...

> > > > So I take it as a requirement that RDMA MUST make single MR's out of a
> > > > hodgepodge of page types. RDMA MRs cannot be split. Multiple MR's are
> > > > not a functional replacement for a single MR.
> > > 
> > > But MRs consolidate multiple dma addresses anyway.
> > 
> > I'm not sure I understand this?
> 
> The RDMA MRs take a a list of PFNish address, (or SGLs with the
> enhanced MRs from Mellanox) and give you back a single rkey/lkey.

Yes, that is the desire.
 
> > To go back to my main thesis - I would like a high performance low
> > level DMA API that is capable enough that it could implement
> > scatterlist dma_map_sg() and thus also implement any future
> > scatterlist_v2, bio, hmm_range_fault or any other thing we come up
> > with on top of it. This is broadly what I thought we agreed to at LSF
> > last year.
> 
> I think the biggest underlying problem of the scatterlist based
> DMA implementation for IOMMUs is that it's trying to handle to much,
> that is magic coalescing even if the segments boundaries don't align
> with the IOMMU page size.  If we can get rid of that misfeature I
> think we'd greatly simply the API and implementation.

Yeah, that stuff is not easy at all and takes extra computation to
figure out. I always assumed it was there for block...

Leon & Chaitanya will make a RFC v2 along these lines, lets see how it
goes.

Thanks,
Jason
Zhu Yanjun April 9, 2024, 8:39 p.m. UTC | #28
在 2024/3/7 7:01, Zhu Yanjun 写道:
> 在 2024/3/5 12:18, Leon Romanovsky 写道:
>> This is complimentary part to the proposed LSF/MM topic.
>> https://lore.kernel.org/linux-rdma/22df55f8-cf64-4aa8-8c0b-b556c867b926@linux.dev/T/#m85672c860539fdbbc8fe0f5ccabdc05b40269057
> 
> I am interested in this topic. Hope I can join the meeting to discuss 
> this topic.
> 

With the same idea, in the IDPF driver, the function dma_alloc_coherent 
which is called in the IDPF driver can be devided into the following 2 
functions:

iommu_dma_alloc_pages

and

iommu_dma_map_page

So the function iommu_dma_alloc_pages allocates pages, 
iommu_dma_map_page makes mapping between pages and IOVA.

Now the above idea is implemented in the NIC driver. Currently it can 
work well.

Next the above idea will be implemented in the block device. Hope this 
can increase the performance of the block device.

Best Regards,
Zhu Yanjun

> Zhu Yanjun
> 
>>
>> This is posted as RFC to get a feedback on proposed split, but RDMA, 
>> VFIO and
>> DMA patches are ready for review and inclusion, the NVMe patches are 
>> still in
>> progress as they require agreement on API first.
>>
>> Thanks
>>
>> -------------------------------------------------------------------------------
>> The DMA mapping operation performs two steps at one same time: allocates
>> IOVA space and actually maps DMA pages to that space. This one shot
>> operation works perfectly for non-complex scenarios, where callers use
>> that DMA API in control path when they setup hardware.
>>
>> However in more complex scenarios, when DMA mapping is needed in data
>> path and especially when some sort of specific datatype is involved,
>> such one shot approach has its drawbacks.
>>
>> That approach pushes developers to introduce new DMA APIs for specific
>> datatype. For example existing scatter-gather mapping functions, or
>> latest Chuck's RFC series to add biovec related DMA mapping [1] and
>> probably struct folio will need it too.
>>
>> These advanced DMA mapping APIs are needed to calculate IOVA size to
>> allocate it as one chunk and some sort of offset calculations to know
>> which part of IOVA to map.
>>
>> Instead of teaching DMA to know these specific datatypes, let's separate
>> existing DMA mapping routine to two steps and give an option to advanced
>> callers (subsystems) perform all calculations internally in advance and
>> map pages later when it is needed.
>>
>> In this series, three users are converted and each of such conversion
>> presents different positive gain:
>> 1. RDMA simplifies and speeds up its pagefault handling for
>>     on-demand-paging (ODP) mode.
>> 2. VFIO PCI live migration code saves huge chunk of memory.
>> 3. NVMe PCI avoids intermediate SG table manipulation and operates
>>     directly on BIOs.
>>
>> Thanks
>>
>> [1] 
>> https://lore.kernel.org/all/169772852492.5232.17148564580779995849.stgit@klimt.1015granger.net
>>
>> Chaitanya Kulkarni (2):
>>    block: add dma_link_range() based API
>>    nvme-pci: use blk_rq_dma_map() for NVMe SGL
>>
>> Leon Romanovsky (14):
>>    mm/hmm: let users to tag specific PFNs
>>    dma-mapping: provide an interface to allocate IOVA
>>    dma-mapping: provide callbacks to link/unlink pages to specific IOVA
>>    iommu/dma: Provide an interface to allow preallocate IOVA
>>    iommu/dma: Prepare map/unmap page functions to receive IOVA
>>    iommu/dma: Implement link/unlink page callbacks
>>    RDMA/umem: Preallocate and cache IOVA for UMEM ODP
>>    RDMA/umem: Store ODP access mask information in PFN
>>    RDMA/core: Separate DMA mapping to caching IOVA and page linkage
>>    RDMA/umem: Prevent UMEM ODP creation with SWIOTLB
>>    vfio/mlx5: Explicitly use number of pages instead of allocated length
>>    vfio/mlx5: Rewrite create mkey flow to allow better code reuse
>>    vfio/mlx5: Explicitly store page list
>>    vfio/mlx5: Convert vfio to use DMA link API
>>
>>   Documentation/core-api/dma-attributes.rst |   7 +
>>   block/blk-merge.c                         | 156 ++++++++++++++
>>   drivers/infiniband/core/umem_odp.c        | 219 +++++++------------
>>   drivers/infiniband/hw/mlx5/mlx5_ib.h      |   1 +
>>   drivers/infiniband/hw/mlx5/odp.c          |  59 +++--
>>   drivers/iommu/dma-iommu.c                 | 129 ++++++++---
>>   drivers/nvme/host/pci.c                   | 220 +++++--------------
>>   drivers/vfio/pci/mlx5/cmd.c               | 252 ++++++++++++----------
>>   drivers/vfio/pci/mlx5/cmd.h               |  22 +-
>>   drivers/vfio/pci/mlx5/main.c              | 136 +++++-------
>>   include/linux/blk-mq.h                    |   9 +
>>   include/linux/dma-map-ops.h               |  13 ++
>>   include/linux/dma-mapping.h               |  39 ++++
>>   include/linux/hmm.h                       |   3 +
>>   include/rdma/ib_umem_odp.h                |  22 +-
>>   include/rdma/ib_verbs.h                   |  54 +++++
>>   kernel/dma/debug.h                        |   2 +
>>   kernel/dma/direct.h                       |   7 +-
>>   kernel/dma/mapping.c                      |  91 ++++++++
>>   mm/hmm.c                                  |  34 +--
>>   20 files changed, 870 insertions(+), 605 deletions(-)
>>
>
Zeng, Oak May 2, 2024, 11:32 p.m. UTC | #29
Hi Leon, Jason

> -----Original Message-----
> From: Leon Romanovsky <leon@kernel.org>
> Sent: Tuesday, March 5, 2024 6:19 AM
> To: Christoph Hellwig <hch@lst.de>; Robin Murphy
> <robin.murphy@arm.com>; Marek Szyprowski
> <m.szyprowski@samsung.com>; Joerg Roedel <joro@8bytes.org>; Will
> Deacon <will@kernel.org>; Jason Gunthorpe <jgg@ziepe.ca>; Chaitanya
> Kulkarni <chaitanyak@nvidia.com>
> Cc: Jonathan Corbet <corbet@lwn.net>; Jens Axboe <axboe@kernel.dk>;
> Keith Busch <kbusch@kernel.org>; Sagi Grimberg <sagi@grimberg.me>;
> Yishai Hadas <yishaih@nvidia.com>; Shameer Kolothum
> <shameerali.kolothum.thodi@huawei.com>; Kevin Tian
> <kevin.tian@intel.com>; Alex Williamson <alex.williamson@redhat.com>;
> Jérôme Glisse <jglisse@redhat.com>; Andrew Morton <akpm@linux-
> foundation.org>; linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org;
> linux-block@vger.kernel.org; linux-rdma@vger.kernel.org;
> iommu@lists.linux.dev; linux-nvme@lists.infradead.org;
> kvm@vger.kernel.org; linux-mm@kvack.org; Bart Van Assche
> <bvanassche@acm.org>; Damien Le Moal
> <damien.lemoal@opensource.wdc.com>; Amir Goldstein
> <amir73il@gmail.com>; josef@toxicpanda.com; Martin K. Petersen
> <martin.petersen@oracle.com>; daniel@iogearbox.net; Dan Williams
> <dan.j.williams@intel.com>; jack@suse.com; Leon Romanovsky
> <leonro@nvidia.com>; Zhu Yanjun <zyjzyj2000@gmail.com>
> Subject: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two
> steps
> 
> This is complimentary part to the proposed LSF/MM topic.
> https://lore.kernel.org/linux-rdma/22df55f8-cf64-4aa8-8c0b-
> b556c867b926@linux.dev/T/#m85672c860539fdbbc8fe0f5ccabdc05b40269057
> 
> This is posted as RFC to get a feedback on proposed split, but RDMA, VFIO
> and
> DMA patches are ready for review and inclusion, the NVMe patches are still
> in
> progress as they require agreement on API first.
> 
> Thanks
> 
> -------------------------------------------------------------------------------
> The DMA mapping operation performs two steps at one same time: allocates
> IOVA space and actually maps DMA pages to that space. This one shot
> operation works perfectly for non-complex scenarios, where callers use
> that DMA API in control path when they setup hardware.
> 
> However in more complex scenarios, when DMA mapping is needed in data
> path and especially when some sort of specific datatype is involved,
> such one shot approach has its drawbacks.
> 
> That approach pushes developers to introduce new DMA APIs for specific
> datatype. For example existing scatter-gather mapping functions, or
> latest Chuck's RFC series to add biovec related DMA mapping [1] and
> probably struct folio will need it too.
> 
> These advanced DMA mapping APIs are needed to calculate IOVA size to
> allocate it as one chunk and some sort of offset calculations to know
> which part of IOVA to map.
> 
> Instead of teaching DMA to know these specific datatypes, let's separate
> existing DMA mapping routine to two steps and give an option to advanced
> callers (subsystems) perform all calculations internally in advance and
> map pages later when it is needed.

I looked into how this scheme can be applied to DRM subsystem and GPU drivers. 

I figured RDMA can apply this scheme because RDMA can calculate the iova size. Per my limited knowledge of rdma, user can register a memory region (the reg_user_mr vfunc) and memory region's sized is used to pre-allocate iova space. And in the RDMA use case, it seems the user registered region can be very big, e.g., 512MiB or even GiB

In GPU driver, we have a few use cases where we need dma-mapping. Just name two:

1) userptr: it is user malloc'ed/mmap'ed memory and registers to gpu (in Intel's driver it is through a vm_bind api, similar to mmap). A userptr can be of any random size, depending on user malloc size. Today we use dma-map-sg for this use case. The down side of our approach is, during userptr invalidation, even if user only munmap partially of an userptr, we invalidate the whole userptr from gpu page table, because there is no way for us to partially dma-unmap the whole sg list. I think we can try your new API in this case. The main benefit of the new approach is the partial munmap case.

We will have to pre-allocate iova for each userptr, and we have many userptrs of random size... So we might be not as efficient as RDMA case where I assume user register a few big memory regions.  

2) system allocator: it is malloc'ed/mmap'ed memory be used for GPU program directly, without any other extra driver API call. We call this use case system allocator.

For system allocator, driver have no knowledge of which virtual address range is valid in advance. So when GPU access a malloc'ed/mmap'ed address, we have a page fault. We then look up a CPU vma which contains the fault address. I guess we can use the CPU vma size to allocate the iova space of the same size?

But there will be a true difficulty to apply your scheme to this use case. It is related to the STICKY flag. As I understand it, the sticky flag is designed for driver to mark "this page/pfn has been populated, no need to re-populate again", roughly...Unlike userptr and RDMA use cases where the backing store of a buffer is always in system memory, in the system allocator use case, the backing store can be changing b/t system memory and GPU's device private memory. Even worse, we have to assume the data migration b/t system and GPU is dynamic. When data is migrated to GPU, we don't need dma-map. And when migration happens to a pfn with STICKY flag, we still need to repopulate this pfn. So you can see, it is not easy to apply this scheme to this use case. At least I can't see an obvious way.


Oak


> 
> In this series, three users are converted and each of such conversion
> presents different positive gain:
> 1. RDMA simplifies and speeds up its pagefault handling for
>    on-demand-paging (ODP) mode.
> 2. VFIO PCI live migration code saves huge chunk of memory.
> 3. NVMe PCI avoids intermediate SG table manipulation and operates
>    directly on BIOs.
> 
> Thanks
> 
> [1]
> https://lore.kernel.org/all/169772852492.5232.17148564580779995849.stgit@
> klimt.1015granger.net
> 
> Chaitanya Kulkarni (2):
>   block: add dma_link_range() based API
>   nvme-pci: use blk_rq_dma_map() for NVMe SGL
> 
> Leon Romanovsky (14):
>   mm/hmm: let users to tag specific PFNs
>   dma-mapping: provide an interface to allocate IOVA
>   dma-mapping: provide callbacks to link/unlink pages to specific IOVA
>   iommu/dma: Provide an interface to allow preallocate IOVA
>   iommu/dma: Prepare map/unmap page functions to receive IOVA
>   iommu/dma: Implement link/unlink page callbacks
>   RDMA/umem: Preallocate and cache IOVA for UMEM ODP
>   RDMA/umem: Store ODP access mask information in PFN
>   RDMA/core: Separate DMA mapping to caching IOVA and page linkage
>   RDMA/umem: Prevent UMEM ODP creation with SWIOTLB
>   vfio/mlx5: Explicitly use number of pages instead of allocated length
>   vfio/mlx5: Rewrite create mkey flow to allow better code reuse
>   vfio/mlx5: Explicitly store page list
>   vfio/mlx5: Convert vfio to use DMA link API
> 
>  Documentation/core-api/dma-attributes.rst |   7 +
>  block/blk-merge.c                         | 156 ++++++++++++++
>  drivers/infiniband/core/umem_odp.c        | 219 +++++++------------
>  drivers/infiniband/hw/mlx5/mlx5_ib.h      |   1 +
>  drivers/infiniband/hw/mlx5/odp.c          |  59 +++--
>  drivers/iommu/dma-iommu.c                 | 129 ++++++++---
>  drivers/nvme/host/pci.c                   | 220 +++++--------------
>  drivers/vfio/pci/mlx5/cmd.c               | 252 ++++++++++++----------
>  drivers/vfio/pci/mlx5/cmd.h               |  22 +-
>  drivers/vfio/pci/mlx5/main.c              | 136 +++++-------
>  include/linux/blk-mq.h                    |   9 +
>  include/linux/dma-map-ops.h               |  13 ++
>  include/linux/dma-mapping.h               |  39 ++++
>  include/linux/hmm.h                       |   3 +
>  include/rdma/ib_umem_odp.h                |  22 +-
>  include/rdma/ib_verbs.h                   |  54 +++++
>  kernel/dma/debug.h                        |   2 +
>  kernel/dma/direct.h                       |   7 +-
>  kernel/dma/mapping.c                      |  91 ++++++++
>  mm/hmm.c                                  |  34 +--
>  20 files changed, 870 insertions(+), 605 deletions(-)
> 
> --
> 2.44.0
Zhu Yanjun May 3, 2024, 11:57 a.m. UTC | #30
On 03.05.24 01:32, Zeng, Oak wrote:
> Hi Leon, Jason
>
>> -----Original Message-----
>> From: Leon Romanovsky <leon@kernel.org>
>> Sent: Tuesday, March 5, 2024 6:19 AM
>> To: Christoph Hellwig <hch@lst.de>; Robin Murphy
>> <robin.murphy@arm.com>; Marek Szyprowski
>> <m.szyprowski@samsung.com>; Joerg Roedel <joro@8bytes.org>; Will
>> Deacon <will@kernel.org>; Jason Gunthorpe <jgg@ziepe.ca>; Chaitanya
>> Kulkarni <chaitanyak@nvidia.com>
>> Cc: Jonathan Corbet <corbet@lwn.net>; Jens Axboe <axboe@kernel.dk>;
>> Keith Busch <kbusch@kernel.org>; Sagi Grimberg <sagi@grimberg.me>;
>> Yishai Hadas <yishaih@nvidia.com>; Shameer Kolothum
>> <shameerali.kolothum.thodi@huawei.com>; Kevin Tian
>> <kevin.tian@intel.com>; Alex Williamson <alex.williamson@redhat.com>;
>> Jérôme Glisse <jglisse@redhat.com>; Andrew Morton <akpm@linux-
>> foundation.org>; linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org;
>> linux-block@vger.kernel.org; linux-rdma@vger.kernel.org;
>> iommu@lists.linux.dev; linux-nvme@lists.infradead.org;
>> kvm@vger.kernel.org; linux-mm@kvack.org; Bart Van Assche
>> <bvanassche@acm.org>; Damien Le Moal
>> <damien.lemoal@opensource.wdc.com>; Amir Goldstein
>> <amir73il@gmail.com>; josef@toxicpanda.com; Martin K. Petersen
>> <martin.petersen@oracle.com>; daniel@iogearbox.net; Dan Williams
>> <dan.j.williams@intel.com>; jack@suse.com; Leon Romanovsky
>> <leonro@nvidia.com>; Zhu Yanjun <zyjzyj2000@gmail.com>
>> Subject: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two
>> steps
>>
>> This is complimentary part to the proposed LSF/MM topic.
>> https://lore.kernel.org/linux-rdma/22df55f8-cf64-4aa8-8c0b-
>> b556c867b926@linux.dev/T/#m85672c860539fdbbc8fe0f5ccabdc05b40269057
>>
>> This is posted as RFC to get a feedback on proposed split, but RDMA, VFIO
>> and
>> DMA patches are ready for review and inclusion, the NVMe patches are still
>> in
>> progress as they require agreement on API first.
>>
>> Thanks
>>
>> -------------------------------------------------------------------------------
>> The DMA mapping operation performs two steps at one same time: allocates
>> IOVA space and actually maps DMA pages to that space. This one shot
>> operation works perfectly for non-complex scenarios, where callers use
>> that DMA API in control path when they setup hardware.
>>
>> However in more complex scenarios, when DMA mapping is needed in data
>> path and especially when some sort of specific datatype is involved,
>> such one shot approach has its drawbacks.
>>
>> That approach pushes developers to introduce new DMA APIs for specific
>> datatype. For example existing scatter-gather mapping functions, or
>> latest Chuck's RFC series to add biovec related DMA mapping [1] and
>> probably struct folio will need it too.
>>
>> These advanced DMA mapping APIs are needed to calculate IOVA size to
>> allocate it as one chunk and some sort of offset calculations to know
>> which part of IOVA to map.
>>
>> Instead of teaching DMA to know these specific datatypes, let's separate
>> existing DMA mapping routine to two steps and give an option to advanced
>> callers (subsystems) perform all calculations internally in advance and
>> map pages later when it is needed.
> I looked into how this scheme can be applied to DRM subsystem and GPU drivers.
>
> I figured RDMA can apply this scheme because RDMA can calculate the iova size. Per my limited knowledge of rdma, user can register a memory region (the reg_user_mr vfunc) and memory region's sized is used to pre-allocate iova space. And in the RDMA use case, it seems the user registered region can be very big, e.g., 512MiB or even GiB
>
> In GPU driver, we have a few use cases where we need dma-mapping. Just name two:
>
> 1) userptr: it is user malloc'ed/mmap'ed memory and registers to gpu (in Intel's driver it is through a vm_bind api, similar to mmap). A userptr can be of any random size, depending on user malloc size. Today we use dma-map-sg for this use case. The down side of our approach is, during userptr invalidation, even if user only munmap partially of an userptr, we invalidate the whole userptr from gpu page table, because there is no way for us to partially dma-unmap the whole sg list. I think we can try your new API in this case. The main benefit of the new approach is the partial munmap case.
>
> We will have to pre-allocate iova for each userptr, and we have many userptrs of random size... So we might be not as efficient as RDMA case where I assume user register a few big memory regions.
>
> 2) system allocator: it is malloc'ed/mmap'ed memory be used for GPU program directly, without any other extra driver API call. We call this use case system allocator.
>
> For system allocator, driver have no knowledge of which virtual address range is valid in advance. So when GPU access a malloc'ed/mmap'ed address, we have a page fault. We then look up a CPU vma which contains the fault address. I guess we can use the CPU vma size to allocate the iova space of the same size?
>
> But there will be a true difficulty to apply your scheme to this use case. It is related to the STICKY flag. As I understand it, the sticky flag is designed for driver to mark "this page/pfn has been populated, no need to re-populate again", roughly...Unlike userptr and RDMA use cases where the backing store of a buffer is always in system memory, in the system allocator use case, the backing store can be changing b/t system memory and GPU's device private memory. Even worse, we have to assume the data migration b/t system and GPU is dynamic. When data is migrated to GPU, we don't need dma-map. And when migration happens to a pfn with STICKY flag, we still need to repopulate this pfn. So you can see, it is not easy to apply this scheme to this use case. At least I can't see an obvious way.

Not sure if GPU peer to peer dma mapping GPU memory for use can use this 
scheme or not. If I remember it correctly, Intel Gaudi GPU supports peer 
2 peer dma mapping in GPU Direct RDMA. Not sure if this scheme can be 
applied in that place or not.

Just my 2 cent suggestions.

Zhu Yanjun

>
>
> Oak
>
>
>> In this series, three users are converted and each of such conversion
>> presents different positive gain:
>> 1. RDMA simplifies and speeds up its pagefault handling for
>>     on-demand-paging (ODP) mode.
>> 2. VFIO PCI live migration code saves huge chunk of memory.
>> 3. NVMe PCI avoids intermediate SG table manipulation and operates
>>     directly on BIOs.
>>
>> Thanks
>>
>> [1]
>> https://lore.kernel.org/all/169772852492.5232.17148564580779995849.stgit@
>> klimt.1015granger.net
>>
>> Chaitanya Kulkarni (2):
>>    block: add dma_link_range() based API
>>    nvme-pci: use blk_rq_dma_map() for NVMe SGL
>>
>> Leon Romanovsky (14):
>>    mm/hmm: let users to tag specific PFNs
>>    dma-mapping: provide an interface to allocate IOVA
>>    dma-mapping: provide callbacks to link/unlink pages to specific IOVA
>>    iommu/dma: Provide an interface to allow preallocate IOVA
>>    iommu/dma: Prepare map/unmap page functions to receive IOVA
>>    iommu/dma: Implement link/unlink page callbacks
>>    RDMA/umem: Preallocate and cache IOVA for UMEM ODP
>>    RDMA/umem: Store ODP access mask information in PFN
>>    RDMA/core: Separate DMA mapping to caching IOVA and page linkage
>>    RDMA/umem: Prevent UMEM ODP creation with SWIOTLB
>>    vfio/mlx5: Explicitly use number of pages instead of allocated length
>>    vfio/mlx5: Rewrite create mkey flow to allow better code reuse
>>    vfio/mlx5: Explicitly store page list
>>    vfio/mlx5: Convert vfio to use DMA link API
>>
>>   Documentation/core-api/dma-attributes.rst |   7 +
>>   block/blk-merge.c                         | 156 ++++++++++++++
>>   drivers/infiniband/core/umem_odp.c        | 219 +++++++------------
>>   drivers/infiniband/hw/mlx5/mlx5_ib.h      |   1 +
>>   drivers/infiniband/hw/mlx5/odp.c          |  59 +++--
>>   drivers/iommu/dma-iommu.c                 | 129 ++++++++---
>>   drivers/nvme/host/pci.c                   | 220 +++++--------------
>>   drivers/vfio/pci/mlx5/cmd.c               | 252 ++++++++++++----------
>>   drivers/vfio/pci/mlx5/cmd.h               |  22 +-
>>   drivers/vfio/pci/mlx5/main.c              | 136 +++++-------
>>   include/linux/blk-mq.h                    |   9 +
>>   include/linux/dma-map-ops.h               |  13 ++
>>   include/linux/dma-mapping.h               |  39 ++++
>>   include/linux/hmm.h                       |   3 +
>>   include/rdma/ib_umem_odp.h                |  22 +-
>>   include/rdma/ib_verbs.h                   |  54 +++++
>>   kernel/dma/debug.h                        |   2 +
>>   kernel/dma/direct.h                       |   7 +-
>>   kernel/dma/mapping.c                      |  91 ++++++++
>>   mm/hmm.c                                  |  34 +--
>>   20 files changed, 870 insertions(+), 605 deletions(-)
>>
>> --
>> 2.44.0
Jason Gunthorpe May 3, 2024, 4:42 p.m. UTC | #31
On Thu, May 02, 2024 at 11:32:55PM +0000, Zeng, Oak wrote:

> > Instead of teaching DMA to know these specific datatypes, let's separate
> > existing DMA mapping routine to two steps and give an option to advanced
> > callers (subsystems) perform all calculations internally in advance and
> > map pages later when it is needed.
> 
> I looked into how this scheme can be applied to DRM subsystem and GPU drivers. 
> 
> I figured RDMA can apply this scheme because RDMA can calculate the
> iova size. Per my limited knowledge of rdma, user can register a
> memory region (the reg_user_mr vfunc) and memory region's sized is
> used to pre-allocate iova space. And in the RDMA use case, it seems
> the user registered region can be very big, e.g., 512MiB or even GiB

In RDMA the iova would be linked to the SVA granual we discussed
previously.

> In GPU driver, we have a few use cases where we need dma-mapping. Just name two:
> 
> 1) userptr: it is user malloc'ed/mmap'ed memory and registers to gpu
> (in Intel's driver it is through a vm_bind api, similar to mmap). A
> userptr can be of any random size, depending on user malloc
> size. Today we use dma-map-sg for this use case. The down side of
> our approach is, during userptr invalidation, even if user only
> munmap partially of an userptr, we invalidate the whole userptr from
> gpu page table, because there is no way for us to partially
> dma-unmap the whole sg list. I think we can try your new API in this
> case. The main benefit of the new approach is the partial munmap
> case.

Yes, this is one of the main things it will improve.
 
> We will have to pre-allocate iova for each userptr, and we have many
> userptrs of random size... So we might be not as efficient as RDMA
> case where I assume user register a few big memory regions.

You are already doing this. dma_map_sg() does exactly the same IOVA
allocation under the covers.
 
> 2) system allocator: it is malloc'ed/mmap'ed memory be used for GPU
> program directly, without any other extra driver API call. We call
> this use case system allocator.
 
> For system allocator, driver have no knowledge of which virtual
> address range is valid in advance. So when GPU access a
> malloc'ed/mmap'ed address, we have a page fault. We then look up a
> CPU vma which contains the fault address. I guess we can use the CPU
> vma size to allocate the iova space of the same size?

No. You'd follow what we discussed in the other thread.

If you do a full SVA then you'd split your MM space into granuals and
when a fault hits a granual you'd allocate the IOVA for the whole
granual. RDMA ODP is using a 512M granual currently.

If you are doing sub ranges then you'd probably allocate the IOVA for
the well defined sub range (assuming the typical use case isn't huge)

> But there will be a true difficulty to apply your scheme to this use
> case. It is related to the STICKY flag. As I understand it, the
> sticky flag is designed for driver to mark "this page/pfn has been
> populated, no need to re-populate again", roughly...Unlike userptr
> and RDMA use cases where the backing store of a buffer is always in
> system memory, in the system allocator use case, the backing store
> can be changing b/t system memory and GPU's device private
> memory. Even worse, we have to assume the data migration b/t system
> and GPU is dynamic. When data is migrated to GPU, we don't need
> dma-map. And when migration happens to a pfn with STICKY flag, we
> still need to repopulate this pfn. So you can see, it is not easy to
> apply this scheme to this use case. At least I can't see an obvious
> way.

You are already doing this today, you are keeping the sg list around
until you unmap it.

Instead of keeping the sg list you'd keep a much smaller datastructure
per-granual. The sticky bit is simply a convient way for ODP to manage
the smaller data structure, you don't have to use it.

But you do need to keep track of what pages in the granual have been
DMA mapped - sg list was doing this before. This could be a simple
bitmap array matching the granual size.

Looking (far) forward we may be able to have a "replace" API that
allows installing a new page unconditionally regardless of what is
already there.

Jason
Zeng, Oak May 3, 2024, 8:59 p.m. UTC | #32
> -----Original Message-----
> From: Jason Gunthorpe <jgg@ziepe.ca>
> Sent: Friday, May 3, 2024 12:43 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: leon@kernel.org; Christoph Hellwig <hch@lst.de>; Robin Murphy
> <robin.murphy@arm.com>; Marek Szyprowski
> <m.szyprowski@samsung.com>; Joerg Roedel <joro@8bytes.org>; Will
> Deacon <will@kernel.org>; Chaitanya Kulkarni <chaitanyak@nvidia.com>;
> Brost, Matthew <matthew.brost@intel.com>; Hellstrom, Thomas
> <thomas.hellstrom@intel.com>; Jonathan Corbet <corbet@lwn.net>; Jens
> Axboe <axboe@kernel.dk>; Keith Busch <kbusch@kernel.org>; Sagi
> Grimberg <sagi@grimberg.me>; Yishai Hadas <yishaih@nvidia.com>;
> Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>; Tian, Kevin
> <kevin.tian@intel.com>; Alex Williamson <alex.williamson@redhat.com>;
> Jérôme Glisse <jglisse@redhat.com>; Andrew Morton <akpm@linux-
> foundation.org>; linux-doc@vger.kernel.org; linux-kernel@vger.kernel.org;
> linux-block@vger.kernel.org; linux-rdma@vger.kernel.org;
> iommu@lists.linux.dev; linux-nvme@lists.infradead.org;
> kvm@vger.kernel.org; linux-mm@kvack.org; Bart Van Assche
> <bvanassche@acm.org>; Damien Le Moal
> <damien.lemoal@opensource.wdc.com>; Amir Goldstein
> <amir73il@gmail.com>; josef@toxicpanda.com; Martin K. Petersen
> <martin.petersen@oracle.com>; daniel@iogearbox.net; Williams, Dan J
> <dan.j.williams@intel.com>; jack@suse.com; Leon Romanovsky
> <leonro@nvidia.com>; Zhu Yanjun <zyjzyj2000@gmail.com>
> Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to
> two steps
> 
> On Thu, May 02, 2024 at 11:32:55PM +0000, Zeng, Oak wrote:
> 
> > > Instead of teaching DMA to know these specific datatypes, let's separate
> > > existing DMA mapping routine to two steps and give an option to
> advanced
> > > callers (subsystems) perform all calculations internally in advance and
> > > map pages later when it is needed.
> >
> > I looked into how this scheme can be applied to DRM subsystem and GPU
> drivers.
> >
> > I figured RDMA can apply this scheme because RDMA can calculate the
> > iova size. Per my limited knowledge of rdma, user can register a
> > memory region (the reg_user_mr vfunc) and memory region's sized is
> > used to pre-allocate iova space. And in the RDMA use case, it seems
> > the user registered region can be very big, e.g., 512MiB or even GiB
> 
> In RDMA the iova would be linked to the SVA granual we discussed
> previously.

I need to learn more of this scheme. 

Let's say 512MiB granual... In a 57-bit virtual address machine, the use space can address space can be up to 56 bit (e.g.,  half-half split b/t kernel and user)

So you would end up with  134,217, 728 sub-regions (2 to the power of 27), which is huge...

Is that RDMA use a much smaller virtual address space?

With 512MiB granual, do you fault-in or map 512MiB virtual address range to RDMA page table? E.g., when page fault happens at address A, do you fault-in the whole 512MiB region to RDMA page table? How do you make sure all addresses in this 512MiB region are valid virtual address?  



> 
> > In GPU driver, we have a few use cases where we need dma-mapping. Just
> name two:
> >
> > 1) userptr: it is user malloc'ed/mmap'ed memory and registers to gpu
> > (in Intel's driver it is through a vm_bind api, similar to mmap). A
> > userptr can be of any random size, depending on user malloc
> > size. Today we use dma-map-sg for this use case. The down side of
> > our approach is, during userptr invalidation, even if user only
> > munmap partially of an userptr, we invalidate the whole userptr from
> > gpu page table, because there is no way for us to partially
> > dma-unmap the whole sg list. I think we can try your new API in this
> > case. The main benefit of the new approach is the partial munmap
> > case.
> 
> Yes, this is one of the main things it will improve.
> 
> > We will have to pre-allocate iova for each userptr, and we have many
> > userptrs of random size... So we might be not as efficient as RDMA
> > case where I assume user register a few big memory regions.
> 
> You are already doing this. dma_map_sg() does exactly the same IOVA
> allocation under the covers.

Sure. Then we can replace our _sg with your new DMA Api once it is merged. We will gain a benefit with a little more codes

> 
> > 2) system allocator: it is malloc'ed/mmap'ed memory be used for GPU
> > program directly, without any other extra driver API call. We call
> > this use case system allocator.
> 
> > For system allocator, driver have no knowledge of which virtual
> > address range is valid in advance. So when GPU access a
> > malloc'ed/mmap'ed address, we have a page fault. We then look up a
> > CPU vma which contains the fault address. I guess we can use the CPU
> > vma size to allocate the iova space of the same size?
> 
> No. You'd follow what we discussed in the other thread.
> 
> If you do a full SVA then you'd split your MM space into granuals and
> when a fault hits a granual you'd allocate the IOVA for the whole
> granual. RDMA ODP is using a 512M granual currently.

Per system allocator requirement, we have to do full SVA (which means ANY valid CPU virtual address is a valid GPU virtual address). 

Per my above calculation, with 512M granual, we will end up a huge number of sub-regions....

> 
> If you are doing sub ranges then you'd probably allocate the IOVA for
> the well defined sub range (assuming the typical use case isn't huge)

Can you explain what is sub ranges? Is that device only mirror partially of the CPU virtual address space?

How do we decide which part to mirror?


> 
> > But there will be a true difficulty to apply your scheme to this use
> > case. It is related to the STICKY flag. As I understand it, the
> > sticky flag is designed for driver to mark "this page/pfn has been
> > populated, no need to re-populate again", roughly...Unlike userptr
> > and RDMA use cases where the backing store of a buffer is always in
> > system memory, in the system allocator use case, the backing store
> > can be changing b/t system memory and GPU's device private
> > memory. Even worse, we have to assume the data migration b/t system
> > and GPU is dynamic. When data is migrated to GPU, we don't need
> > dma-map. And when migration happens to a pfn with STICKY flag, we
> > still need to repopulate this pfn. So you can see, it is not easy to
> > apply this scheme to this use case. At least I can't see an obvious
> > way.
> 
> You are already doing this today, you are keeping the sg list around
> until you unmap it.
> 
> Instead of keeping the sg list you'd keep a much smaller datastructure
> per-granual. The sticky bit is simply a convient way for ODP to manage
> the smaller data structure, you don't have to use it.
> 
> But you do need to keep track of what pages in the granual have been
> DMA mapped - sg list was doing this before. This could be a simple
> bitmap array matching the granual size.

Make sense. We can try once your API is ready. 

I still don't figure out the granular scheme. Please help with above questions.

Thanks,
Oak


> 
> Looking (far) forward we may be able to have a "replace" API that
> allows installing a new page unconditionally regardless of what is
> already there.
> 
> Jason