mbox series

[RFC,v1,00/18] Provide a new two step DMA API mapping API

Message ID cover.1719909395.git.leon@kernel.org (mailing list archive)
Headers show
Series Provide a new two step DMA API mapping API | expand

Message

Leon Romanovsky July 2, 2024, 9:09 a.m. UTC
Changelog:
v1:
 * Rewrote cover letter
 * Changed to API as proposed
   https://lore.kernel.org/linux-rdma/20240322184330.GL66976@ziepe.ca/
 * Removed IB DMA wrappers and use DMA API directly
v0: https://lore.kernel.org/all/cover.1709635535.git.leon@kernel.org
-------------------------------------------------------------------------

Currently the only efficient way to map a complex memory description through
the DMA API is by using the scatterlist APIs. The SG APIs are unique in that
they efficiently combine the two fundamental operations of sizing and allocating
a large IOVA window from the IOMMU and processing all the per-address
swiotlb/flushing/p2p/map details.

This uniqueness has been a long standing pain point as the scatterlist API
is mandatory, but expensive to use. It prevents any kind of optimization or
feature improvement (such as avoiding struct page for P2P) due to the impossibility
of improving the scatterlist.

Several approaches have been explored to expand the DMA API with additional
scatterlist-like structures (BIO[1], rlist[2]), instead split up the DMA API
to allow callers to bring their own data structure.

The API is split up into parts:
 - dma_alloc_iova() / dma_free_iova()
   To do any pre-allocation required. This is done based on the caller
   supplying some details about how much IOMMU address space it would need
   in worst case.
 - dma_link_range() / dma_unlink_range()
   Perform the actual mapping into the pre-allocated IOVA. This is very
   similar to dma_map_page().

A driver will extent its mapping size using its own data structure, such as
BIO, to request the required IOVA. Then it will iterate directly over it's
data structure to DMA map each range. The result can then be stored directly
into the HW specific DMA list. No intermediate scatterlist is required.

In this series, examples of three users are converted to the new API to show
the benefits. Each user has a unique flow:
 1. RDMA ODP is an example of "SVA mirroring" using HMM that needs to
    dynamically map/unmap large numbers of single pages. This becomes
    significantly faster in the IOMMU case as the map/unmap is now just
    a page table walk, the IOVA allocation is pre-computed once. Significant
    amounts of memory are saved as there is no longer a need to store the
    dma_addr_t of each page.
 2. VFIO PCI live migration code is building a very large "page list"
    for the device. Instead of allocating a scatter list entry per allocated
    page it can just allocate an array of 'struct page *', saving a large
    amount of memory.
 3. NVMe PCI demonstrates how a BIO can be converted to a HW scatter
    list without having to allocate then populate an intermediate SG table.

This step is first along a path to provide alternatives to scatterlist and
solve some of the abuses and design mistakes, for instance in DMABUF's P2P
support.

The ODP and VFIO versions are complete and fully tested, they can be the users
of the new API to merge it. The NVMe requires more work.

[1] https://lore.kernel.org/all/169772852492.5232.17148564580779995849.stgit@klimt.1015granger.net/
[2] https://lore.kernel.org/all/ZD2lMvprVxu23BXZ@ziepe.ca/

Chaitanya Kulkarni (2):
  block: export helper to get segment max size
  nvme-pci: use new dma API

Leon Romanovsky (16):
  dma-mapping: query DMA memory type
  dma-mapping: provide an interface to allocate IOVA
  dma-mapping: check if IOVA can be used
  dma-mapping: implement link range API
  mm/hmm: let users to tag specific PFN with DMA mapped bit
  dma-mapping: provide callbacks to link/unlink HMM PFNs to specific
    IOVA
  iommu/dma: Provide an interface to allow preallocate IOVA
  iommu/dma: Implement link/unlink ranges callbacks
  RDMA/umem: Preallocate and cache IOVA for UMEM ODP
  RDMA/umem: Store ODP access mask information in PFN
  RDMA/core: Separate DMA mapping to caching IOVA and page linkage
  RDMA/umem: Prevent UMEM ODP creation with SWIOTLB
  vfio/mlx5: Explicitly use number of pages instead of allocated length
  vfio/mlx5: Rewrite create mkey flow to allow better code reuse
  vfio/mlx5: Explicitly store page list
  vfio/mlx5: Convert vfio to use DMA link API

 block/blk-merge.c                    |   3 +-
 drivers/infiniband/core/umem_odp.c   | 218 ++++++-----------
 drivers/infiniband/hw/mlx5/mlx5_ib.h |   1 +
 drivers/infiniband/hw/mlx5/odp.c     |  44 ++--
 drivers/iommu/dma-iommu.c            | 142 +++++++++--
 drivers/nvme/host/pci.c              | 283 ++++++++++++++++------
 drivers/pci/p2pdma.c                 |   4 +-
 drivers/vfio/pci/mlx5/cmd.c          | 344 +++++++++++++++------------
 drivers/vfio/pci/mlx5/cmd.h          |  25 +-
 drivers/vfio/pci/mlx5/main.c         |  86 +++----
 include/linux/blk-mq.h               |   3 +
 include/linux/dma-map-ops.h          |  30 +++
 include/linux/dma-mapping.h          |  87 +++++++
 include/linux/hmm.h                  |   4 +
 include/rdma/ib_umem_odp.h           |  23 +-
 kernel/dma/mapping.c                 | 290 ++++++++++++++++++++++
 mm/hmm.c                             |  34 ++-
 17 files changed, 1122 insertions(+), 499 deletions(-)

Comments

Christoph Hellwig July 3, 2024, 5:42 a.m. UTC | #1
I just tried to boot this on my usual qemu test setup with emulated
nvme devices, and it dead-loops with messages like this fairly late
in the boot cycle:

[   43.826627] iommu: unaligned: iova 0xfff7e000 pa 0x000000010be33650 size 0x1000 min_pagesz 0x1000
[   43.826982] dma_mapping_error -12

passing intel_iommu=off instead of intel_iommu=on (expectedly) makes
it go away.
Zhu Yanjun July 3, 2024, 10:42 a.m. UTC | #2
在 2024/7/3 13:42, Christoph Hellwig 写道:
> I just tried to boot this on my usual qemu test setup with emulated
> nvme devices, and it dead-loops with messages like this fairly late
> in the boot cycle:
> 
> [   43.826627] iommu: unaligned: iova 0xfff7e000 pa 0x000000010be33650 size 0x1000 min_pagesz 0x1000
> [   43.826982] dma_mapping_error -12
> 
> passing intel_iommu=off instead of intel_iommu=on (expectedly) makes

I also confronted this problem.IMO, if intel_iommu=off, the driver of 
drivers/iommu can not be used.

If I remember correctly, some guys in the first version can fix this 
problem. I will check the mails.

To me, I just revert this commit because I do not use this commit about 
nvme.

Zhu Yanjun

> it go away.
>
Leon Romanovsky July 3, 2024, 10:52 a.m. UTC | #3
On Wed, Jul 03, 2024 at 07:42:38AM +0200, Christoph Hellwig wrote:
> I just tried to boot this on my usual qemu test setup with emulated
> nvme devices, and it dead-loops with messages like this fairly late
> in the boot cycle:
> 
> [   43.826627] iommu: unaligned: iova 0xfff7e000 pa 0x000000010be33650 size 0x1000 min_pagesz 0x1000
> [   43.826982] dma_mapping_error -12
> 
> passing intel_iommu=off instead of intel_iommu=on (expectedly) makes
> it go away.

Can you please share your kernel command line and qemu?
On my and Chaitanya setups it works fine.

Thanks
Christoph Hellwig July 3, 2024, 2:35 p.m. UTC | #4
On Wed, Jul 03, 2024 at 01:52:53PM +0300, Leon Romanovsky wrote:
> On Wed, Jul 03, 2024 at 07:42:38AM +0200, Christoph Hellwig wrote:
> > I just tried to boot this on my usual qemu test setup with emulated
> > nvme devices, and it dead-loops with messages like this fairly late
> > in the boot cycle:
> > 
> > [   43.826627] iommu: unaligned: iova 0xfff7e000 pa 0x000000010be33650 size 0x1000 min_pagesz 0x1000
> > [   43.826982] dma_mapping_error -12
> > 
> > passing intel_iommu=off instead of intel_iommu=on (expectedly) makes
> > it go away.
> 
> Can you please share your kernel command line and qemu?
> On my and Chaitanya setups it works fine.

qemu-system-x86_64 \
        -nographic \
	-enable-kvm \
	-m 6g \
	-smp 4 \
	-cpu host \
	-M q35,kernel-irqchip=split \
	-kernel arch/x86/boot/bzImage \
	-append "root=/dev/vda console=ttyS0,115200n8 intel_iommu=on" \
        -device intel-iommu,intremap=on \
	-device ioh3420,multifunction=on,bus=pcie.0,id=port9-0,addr=9.0,chassis=0 \	
        -blockdev driver=file,cache.direct=on,node-name=root,filename=/home/hch/images/bookworm.img \
	-blockdev driver=host_device,cache.direct=on,node-name=test,filename=/dev/nvme0n1p4 \
	-device virtio-blk,drive=root \
	-device nvme,drive=test,serial=1234
Leon Romanovsky July 3, 2024, 3:51 p.m. UTC | #5
On Wed, Jul 03, 2024 at 04:35:30PM +0200, Christoph Hellwig wrote:
> On Wed, Jul 03, 2024 at 01:52:53PM +0300, Leon Romanovsky wrote:
> > On Wed, Jul 03, 2024 at 07:42:38AM +0200, Christoph Hellwig wrote:
> > > I just tried to boot this on my usual qemu test setup with emulated
> > > nvme devices, and it dead-loops with messages like this fairly late
> > > in the boot cycle:
> > > 
> > > [   43.826627] iommu: unaligned: iova 0xfff7e000 pa 0x000000010be33650 size 0x1000 min_pagesz 0x1000
> > > [   43.826982] dma_mapping_error -12
> > > 
> > > passing intel_iommu=off instead of intel_iommu=on (expectedly) makes
> > > it go away.
> > 
> > Can you please share your kernel command line and qemu?
> > On my and Chaitanya setups it works fine.
> 
> qemu-system-x86_64 \
>         -nographic \
> 	-enable-kvm \
> 	-m 6g \
> 	-smp 4 \
> 	-cpu host \
> 	-M q35,kernel-irqchip=split \
> 	-kernel arch/x86/boot/bzImage \
> 	-append "root=/dev/vda console=ttyS0,115200n8 intel_iommu=on" \
>         -device intel-iommu,intremap=on \
> 	-device ioh3420,multifunction=on,bus=pcie.0,id=port9-0,addr=9.0,chassis=0 \	
>         -blockdev driver=file,cache.direct=on,node-name=root,filename=/home/hch/images/bookworm.img \
> 	-blockdev driver=host_device,cache.direct=on,node-name=test,filename=/dev/nvme0n1p4 \
> 	-device virtio-blk,drive=root \
> 	-device nvme,drive=test,serial=1234

Thanks, Chaitanya will take a look.

If we put aside this issue, do you think that the proposed API is the right one?

BTW, I have more fancy command line, it is probably the root cause of working/not-working:
/opt/simx/bin/qemu-system-x86_64
        -append root=/dev/root rw ignore_loglevel rootfstype=9p
        rootflags="cache=loose,trans=virtio" earlyprintk=serial,ttyS0,115200
                console=hvc0 noibrs noibpb nopti nospectre_v2 nospectre_v1
                l1tf=off nospec_store_bypass_disable no_stf_barrier mds=off
                mitigations=off panic_on_warn=1
                intel_iommu=on iommu=nopt iommu.forcedac=true
                vfio_iommu_type1.allow_unsafe_interrupts=1
                systemd.hostname=mtl-leonro-l-vm
        -chardev stdio,id=stdio,mux=on,signal=off   
        -cpu host                                  
        -device virtio-rng-pci                    
        -device virtio-balloon-pci               
        -device isa-serial,chardev=stdio        
        -device virtio-serial-pci              
        -device virtconsole,chardev=stdio     
        -device virtio-9p-pci,fsdev=host_fs,mount_tag=/dev/root  
        -device virtio-9p-pci,fsdev=host_bind_fs0,mount_tag=bind0
        -device virtio-9p-pci,fsdev=host_bind_fs1,mount_tag=bind1
        -device virtio-9p-pci,fsdev=host_bind_fs2,mount_tag=bind2
        -device intel-iommu,intremap=on 
        -device connectx7              
        -device nvme,drive=drv0,serial=foo 
        -drive file=/home/leonro/.cache/mellanox/mkt/nvme-1g.raw,if=none,id=drv0,format=raw 
        -enable-kvm                                                                        
        -fsdev local,id=host_bind_fs1,security_model=none,path=/logs          
        -fsdev local,id=host_fs,security_model=none,path=/mnt/self           
        -fsdev local,id=host_bind_fs0,security_model=none,path=/plugins     
        -fsdev local,id=host_bind_fs2,security_model=none,path=/home/leonro
        -fw_cfg etc/sercon-port,string=2  
        -kernel /home/leonro/src/kernel/arch/x86/boot/bzImage 
        -m 5G -machine q35,kernel-irqchip=split              
        -mon chardev=stdio                                  
        -net nic,model=virtio,macaddr=52:54:01:d8:e5:f9    
        -net user,hostfwd=tcp:127.0.0.1:46645-:22  
        -no-reboot -nodefaults -nographic -smp 16 -vga none 

> 
>
Christoph Hellwig July 4, 2024, 7:48 a.m. UTC | #6
On Wed, Jul 03, 2024 at 06:51:14PM +0300, Leon Romanovsky wrote:
> If we put aside this issue, do you think that the proposed API is the right one?

I haven't look at it in detail yet, but from a quick look there is a
few things to note:


1) The amount of code needed in nvme worries me a bit.  Now NVMe a messy
driver due to the stupid PRPs vs just using SGLs, but needing a fair
amount of extra boilerplate code in drivers is a bit of a warning sign.
I plan to look into this to see if I can help on improving it, but for
that I need a working version first.


2) The amount of seemingly unrelated global headers pulled into other
global headers.  Some of this might just be sloppiness, e.g. I can't
see why dma-mapping.h would actually need iommu.h to start with,
but pci.h in dma-map-ops.h is a no-go.

3) which brings me to real layering violations.  dev_is_untrusted and
dev_use_swiotlb are DMA API internals, no way I'd ever want to expose
them. dma-map-ops.h is a semi-internal header only for implementations
of the dma ops (as very clearly documented at the top of that file),
it must not be included by drivers.  Same for swiotlb.h.

Not quite as concerning, but doing an indirect call for each map
through dma_map_ops in addition to the iommu ops is not every efficient.
We've through for a while to allow direct calls to dma-iommu similar
how we do direct calls to dma-direct from the core mapping.c code.
This might be a good time to do that as a prep step for this work.
Leon Romanovsky July 4, 2024, 1:18 p.m. UTC | #7
On Thu, Jul 04, 2024 at 09:48:56AM +0200, Christoph Hellwig wrote:
> On Wed, Jul 03, 2024 at 06:51:14PM +0300, Leon Romanovsky wrote:
> > If we put aside this issue, do you think that the proposed API is the right one?
> 
> I haven't look at it in detail yet, but from a quick look there is a
> few things to note:
> 
> 
> 1) The amount of code needed in nvme worries me a bit.  Now NVMe a messy
> driver due to the stupid PRPs vs just using SGLs, but needing a fair
> amount of extra boilerplate code in drivers is a bit of a warning sign.
> I plan to look into this to see if I can help on improving it, but for
> that I need a working version first.

Chaitanya is working on this and I'll join him to help on next Sunday,
after I'll return to the office from my sick leave/

> 
> 
> 2) The amount of seemingly unrelated global headers pulled into other
> global headers.  Some of this might just be sloppiness, e.g. I can't
> see why dma-mapping.h would actually need iommu.h to start with,
> but pci.h in dma-map-ops.h is a no-go.

pci.h was pulled because I needed to call to pci_p2pdma_map_type()
in dma_can_use_iova().

> 
> 3) which brings me to real layering violations.  dev_is_untrusted and
> dev_use_swiotlb are DMA API internals, no way I'd ever want to expose
> them. dma-map-ops.h is a semi-internal header only for implementations
> of the dma ops (as very clearly documented at the top of that file),
> it must not be included by drivers.  Same for swiotlb.h.

These item shouldn't worry you and will be changed in the final version.
They are outcome of patch "RDMA/umem: Prevent UMEM ODP creation with SWIOTLB".
https://lore.kernel.org/all/d18c454636bf3cfdba9b66b7cc794d713eadc4a5.1719909395.git.leon@kernel.org/

All HMM users need such "prevention" so it will be moved to a common place.

> 
> Not quite as concerning, but doing an indirect call for each map
> through dma_map_ops in addition to the iommu ops is not every efficient.
> We've through for a while to allow direct calls to dma-iommu similar
> how we do direct calls to dma-direct from the core mapping.c code.
> This might be a good time to do that as a prep step for this work.

Sure, no problem, will start in parallel to work on this.

>
Christoph Hellwig July 5, 2024, 6 a.m. UTC | #8
On Thu, Jul 04, 2024 at 04:18:39PM +0300, Leon Romanovsky wrote:
> > 2) The amount of seemingly unrelated global headers pulled into other
> > global headers.  Some of this might just be sloppiness, e.g. I can't
> > see why dma-mapping.h would actually need iommu.h to start with,
> > but pci.h in dma-map-ops.h is a no-go.
> 
> pci.h was pulled because I needed to call to pci_p2pdma_map_type()
> in dma_can_use_iova().

No, that's not the reason.  The reason is actually that whole
dev_use_swiotlb mess which shouldn't exist in this form.
Christoph Hellwig July 5, 2024, 6:39 a.m. UTC | #9
Review from the NVMe driver consumer perspective.  I think if all these
were implement we'd probably end up with less code than before the
conversion.

The split between dma_iova_attrs, dma_memory_type and dma_iova_state is
odd.  I would have expected them to just be just a single object.  While
talking about this I think the domain field in dma_iova_state should
probably be a private pointer instead of being tied to the iommu.

Also do we need the attrs member in the iova_attrs structure?  The
"attrs" really are flags passed to the mapping routines that are
per-operation and not persistent, so I'd expect them to be passed
per-call and not stored in a structure.

I'd also expect that the use_iova field to be in the mapping state
and not separately provided by the driver.

For nvme specific data structures I would have expected a dma_add/
len pair in struct iod_dma_map, maybe even using a common type.

Also the data structure split seems odd - I'd expect the actual
mapping state and a small number (at least one) dma_addr/len pair
to be inside the nvme_iod structure, and then only do the dynamic
allocation if we need more of them because there are more segments
and we are not using the iommu.

If we had a common data structure for the dma_addr/len pairs
dma_unlink_range could just take care of the unmap for the non-iommu
case as well, which would be neat.  I'd also expect that
dma_free_iova would be covered by it.

I would have expected dma_link_range to return the dma_addr_t instead
of poking into the iova structure in the callers.

In __nvme_rq_dma_map the <= PAGE_SIZE case is pointless.  In the
existing code the reason for it is to avoid allocating and mapping the
sg_table, but that code is still left before we even get to this code.

My suggestion above to only allocate the dma_addr/len pairs when there
is more than 1 or a few of it would allow to trivially implement that
suggestion using the normal API without having to keep that special
case and the dma_len parameter around.

If this addes a version of dma_map_page_atttrs that directly took
the physical address as a prep patch the callers would not have to
bother with page pointer manipulations and just work on physical
addresses for both the iommu and no-iommu cases.  It would also help
a little bit with the eventualy switch to store the physical address
instead of page+offset in the bio_vec.  Talking about that, I've
been wanting to add a bvec_phys helper for to convert the
page_phys(bv.bv_page) + bv.bv_offset calculations.  This is becoming
more urgent with more callers needing to that, I'll try to get it out
to Jens ASAP so that it can make the 6.11 merge window.

Can we make dma_start_range / dma_end_range simple no-ops for the
non-iommu code to avoid boilerplate code in the callers to avoid
boilerplate code in the callers to deal with the two cases?
Chaitanya Kulkarni July 5, 2024, 10:53 p.m. UTC | #10
On 7/3/24 07:35, Christoph Hellwig wrote:
> On Wed, Jul 03, 2024 at 01:52:53PM +0300, Leon Romanovsky wrote:
>> On Wed, Jul 03, 2024 at 07:42:38AM +0200, Christoph Hellwig wrote:
>>> I just tried to boot this on my usual qemu test setup with emulated
>>> nvme devices, and it dead-loops with messages like this fairly late
>>> in the boot cycle:
>>>
>>> [   43.826627] iommu: unaligned: iova 0xfff7e000 pa 0x000000010be33650 size 0x1000 min_pagesz 0x1000
>>> [   43.826982] dma_mapping_error -12
>>>
>>> passing intel_iommu=off instead of intel_iommu=on (expectedly) makes
>>> it go away.
>> Can you please share your kernel command line and qemu?
>> On my and Chaitanya setups it works fine.
> qemu-system-x86_64 \
>          -nographic \
> 	-enable-kvm \
> 	-m 6g \
> 	-smp 4 \
> 	-cpu host \
> 	-M q35,kernel-irqchip=split \
> 	-kernel arch/x86/boot/bzImage \
> 	-append "root=/dev/vda console=ttyS0,115200n8 intel_iommu=on" \
>          -device intel-iommu,intremap=on \
> 	-device ioh3420,multifunction=on,bus=pcie.0,id=port9-0,addr=9.0,chassis=0 \	
>          -blockdev driver=file,cache.direct=on,node-name=root,filename=/home/hch/images/bookworm.img \
> 	-blockdev driver=host_device,cache.direct=on,node-name=test,filename=/dev/nvme0n1p4 \
> 	-device virtio-blk,drive=root \
> 	-device nvme,drive=test,serial=1234
>

I tried to reproduce this issue somehow it is not reproducible.

I'll try again on Leon's setup on my Saturday night, to fix that
case.

-ck
Christoph Hellwig July 6, 2024, 6:26 a.m. UTC | #11
On Fri, Jul 05, 2024 at 10:53:06PM +0000, Chaitanya Kulkarni wrote:
> I tried to reproduce this issue somehow it is not reproducible.
> 
> I'll try again on Leon's setup on my Saturday night, to fix that
> case.

It is passthrough I/O from userspace.  The address is not page aligned
as seen in the printk.  Forcing bounce buffering of all passthrough
I/O makes it go away.

The problem is the first mapped segment does not have to be aligned
and we're missing the code to places it at the aligned offset into
the IOVA space.
Leon Romanovsky July 7, 2024, 9:16 a.m. UTC | #12
On Sat, Jul 06, 2024 at 08:26:04AM +0200, Christoph Hellwig wrote:
> On Fri, Jul 05, 2024 at 10:53:06PM +0000, Chaitanya Kulkarni wrote:
> > I tried to reproduce this issue somehow it is not reproducible.
> > 
> > I'll try again on Leon's setup on my Saturday night, to fix that
> > case.
> 
> It is passthrough I/O from userspace.  The address is not page aligned
> as seen in the printk.  Forcing bounce buffering of all passthrough
> I/O makes it go away.
> 
> The problem is the first mapped segment does not have to be aligned
> and we're missing the code to places it at the aligned offset into
> the IOVA space.

Thanks for the explanation.
Leon Romanovsky July 7, 2024, 9:45 a.m. UTC | #13
On Fri, Jul 05, 2024 at 08:39:10AM +0200, Christoph Hellwig wrote:
> Review from the NVMe driver consumer perspective.  I think if all these
> were implement we'd probably end up with less code than before the
> conversion.

Thanks for the review, I will try to address all the comments in the next version.

> 
> The split between dma_iova_attrs, dma_memory_type and dma_iova_state is
> odd.  I would have expected them to just be just a single object.  While
> talking about this I think the domain field in dma_iova_state should
> probably be a private pointer instead of being tied to the iommu.
> 
> Also do we need the attrs member in the iova_attrs structure?  The
> "attrs" really are flags passed to the mapping routines that are
> per-operation and not persistent, so I'd expect them to be passed
> per-call and not stored in a structure.

It is left-over from my not-send version where I added new attribute
to indicate that dma_alloc_iova() can't support SWIOTLB to avoid
dev_use_swiotlb() mess. I will remove it.

> 
> I'd also expect that the use_iova field to be in the mapping state
> and not separately provided by the driver.
> 
> For nvme specific data structures I would have expected a dma_add/
> len pair in struct iod_dma_map, maybe even using a common type.
> 
> Also the data structure split seems odd - I'd expect the actual
> mapping state and a small number (at least one) dma_addr/len pair
> to be inside the nvme_iod structure, and then only do the dynamic
> allocation if we need more of them because there are more segments
> and we are not using the iommu.
> 
> If we had a common data structure for the dma_addr/len pairs
> dma_unlink_range could just take care of the unmap for the non-iommu
> case as well, which would be neat.  I'd also expect that
> dma_free_iova would be covered by it.

Internally Jason asked for the same thing, but I didn't want to produce
asymmetric API where drivers have a call to dma_alloc_iova() but don't
have a call to dma_free_iova(). However, now, it is 2 versus 1, so I will
change it.

> 
> I would have expected dma_link_range to return the dma_addr_t instead
> of poking into the iova structure in the callers.
> 
> In __nvme_rq_dma_map the <= PAGE_SIZE case is pointless.  In the
> existing code the reason for it is to avoid allocating and mapping the
> sg_table, but that code is still left before we even get to this code.
> 
> My suggestion above to only allocate the dma_addr/len pairs when there
> is more than 1 or a few of it would allow to trivially implement that
> suggestion using the normal API without having to keep that special
> case and the dma_len parameter around.
> 
> If this addes a version of dma_map_page_atttrs that directly took
> the physical address as a prep patch the callers would not have to
> bother with page pointer manipulations and just work on physical
> addresses for both the iommu and no-iommu cases.  It would also help
> a little bit with the eventualy switch to store the physical address
> instead of page+offset in the bio_vec.  Talking about that, I've
> been wanting to add a bvec_phys helper for to convert the
> page_phys(bv.bv_page) + bv.bv_offset calculations.  This is becoming
> more urgent with more callers needing to that, I'll try to get it out
> to Jens ASAP so that it can make the 6.11 merge window.
> 
> Can we make dma_start_range / dma_end_range simple no-ops for the
> non-iommu code to avoid boilerplate code in the callers to avoid
> boilerplate code in the callers to deal with the two cases?

Yes, sure.

Thanks
Leon Romanovsky July 7, 2024, 12:45 p.m. UTC | #14
On Fri, Jul 05, 2024 at 10:53:06PM +0000, Chaitanya Kulkarni wrote:
> On 7/3/24 07:35, Christoph Hellwig wrote:
> > On Wed, Jul 03, 2024 at 01:52:53PM +0300, Leon Romanovsky wrote:
> >> On Wed, Jul 03, 2024 at 07:42:38AM +0200, Christoph Hellwig wrote:
> >>> I just tried to boot this on my usual qemu test setup with emulated
> >>> nvme devices, and it dead-loops with messages like this fairly late
> >>> in the boot cycle:
> >>>
> >>> [   43.826627] iommu: unaligned: iova 0xfff7e000 pa 0x000000010be33650 size 0x1000 min_pagesz 0x1000
> >>> [   43.826982] dma_mapping_error -12
> >>>
> >>> passing intel_iommu=off instead of intel_iommu=on (expectedly) makes
> >>> it go away.
> >> Can you please share your kernel command line and qemu?
> >> On my and Chaitanya setups it works fine.
> > qemu-system-x86_64 \
> >          -nographic \
> > 	-enable-kvm \
> > 	-m 6g \
> > 	-smp 4 \
> > 	-cpu host \
> > 	-M q35,kernel-irqchip=split \
> > 	-kernel arch/x86/boot/bzImage \
> > 	-append "root=/dev/vda console=ttyS0,115200n8 intel_iommu=on" \
> >          -device intel-iommu,intremap=on \
> > 	-device ioh3420,multifunction=on,bus=pcie.0,id=port9-0,addr=9.0,chassis=0 \	
> >          -blockdev driver=file,cache.direct=on,node-name=root,filename=/home/hch/images/bookworm.img \
> > 	-blockdev driver=host_device,cache.direct=on,node-name=test,filename=/dev/nvme0n1p4 \
> > 	-device virtio-blk,drive=root \
> > 	-device nvme,drive=test,serial=1234
> >
> 
> I tried to reproduce this issue somehow it is not reproducible.
> 
> I'll try again on Leon's setup on my Saturday night, to fix that
> case.

Chaitanya,

I added "mem_align=120" line to fio configuration file and the issue reproduced.

Thanks

> 
> -ck
> 
> 
>
Jason Gunthorpe July 8, 2024, 4:52 p.m. UTC | #15
On Thu, Jul 04, 2024 at 09:48:56AM +0200, Christoph Hellwig wrote:

> 1) The amount of code needed in nvme worries me a bit.  Now NVMe a messy
> driver due to the stupid PRPs vs just using SGLs, but needing a fair
> amount of extra boilerplate code in drivers is a bit of a warning sign.
> I plan to look into this to see if I can help on improving it, but for
> that I need a working version first.

It would be nice to have less.  So much now depends on the caller to
provide both the input and output data structure.

Ideally we'd have some template code that consolidates these loops to
common code with driver provided hooks - there are a few ways to get
that efficiently in C.

I think it will be clearer when we get to RDMA and there we have the
same SGL/PRP kind of split up and we can see what is sharable.

> Not quite as concerning, but doing an indirect call for each map
> through dma_map_ops in addition to the iommu ops is not every
> efficient.

Yeah, there is no reason to support anything other than dma-iommu.c
for the iommu path, so the dma_map_op indirection for this could just
be removed.

I'm also cooking something that should let us build a way to iommu map
a bio_vec very efficiently, which should transform this into a single
indirect call into the iommu driver per bio_vec, and a single radix
walk/etc.

> We've through for a while to allow direct calls to dma-iommu similar
> how we do direct calls to dma-direct from the core mapping.c code.
> This might be a good time to do that as a prep step for this work.

I think there is room to benchmark and further improve these
paths. Even the fast direct map path is not compiling down to a single
load/store instruction per bio_vec entry as would be ideal.

Jason
Jason Gunthorpe July 8, 2024, 11:57 p.m. UTC | #16
On Fri, Jul 05, 2024 at 08:39:10AM +0200, Christoph Hellwig wrote:
> Review from the NVMe driver consumer perspective.  I think if all these
> were implement we'd probably end up with less code than before the
> conversion.

One of the topics that came up with at LSF is what to do with the
dma_memory_type data?

The concept here was that the new DMA API would work on same-type
memory only, meaning the caller would have to split up by type.

I understand the block stack already does this using P2P and !P2P, but
that isn't quite enough here as we want to split principally based on
IOMMU or !IOMMU.

Today that boils down to splitting based on a few things, like
grouping encrypted memory, and grouping by P2P provider.

So the idea was the "struct dma_memory_type" would encapsulate
whatever grouping was needed and the block layer would drive its
splitting on this, same structs can be merged.

I didn't notice this got included in this RFC..

The trivial option is to store the dma_memory_type per-BIO and drive
the de-duplication like that. The other could be to derive it from
first entry in the bio_vect every time a new page is added.

This same-type design is one of the key elements of this API
arrangement - for RDMA I expect we will need a data structure storing
a list of types with a list of pages, and we will DMA map type by
type.

Jason
Christoph Hellwig July 9, 2024, 6:17 a.m. UTC | #17
On Mon, Jul 08, 2024 at 01:52:38PM -0300, Jason Gunthorpe wrote:
> Ideally we'd have some template code that consolidates these loops to
> common code with driver provided hooks - there are a few ways to get
> that efficiently in C.
> 
> I think it will be clearer when we get to RDMA and there we have the
> same SGL/PRP kind of split up and we can see what is sharable.

I really would not want to build common code for PRPs - this is a concept
very specific to RDMA and NVMe.  OTOH more common code SGLs would be
nice.  If you look at e.g. SCSI drivers most of them have a simpe loop of
mapping the SG table and then copying the fields into the hardware SGL.
This would be a very common case for a helper.

That whole thing of course opens the question if we want a pure
in-memory version of the dma_addr_t/len tuple.  IMHO that is the best
way to migrate and allows to share code easily.  We can look into ways
to avoiding that more for drivers that care, but most drivers are
probably best serve with it to keep the code simple and make the
conversion easier.

> I'm also cooking something that should let us build a way to iommu map
> a bio_vec very efficiently, which should transform this into a single
> indirect call into the iommu driver per bio_vec, and a single radix
> walk/etc.

I assume you mean array of bio_vecs here.  That would indeed nice.
We'd still potentially need a few calls for block drivers as
requests can have multiple bios and thus bio_vec arrays, but it would
still be a nice reduction of calls.
Christoph Hellwig July 9, 2024, 6:20 a.m. UTC | #18
On Mon, Jul 08, 2024 at 08:57:21PM -0300, Jason Gunthorpe wrote:
> I understand the block stack already does this using P2P and !P2P, but
> that isn't quite enough here as we want to split principally based on
> IOMMU or !IOMMU.

Except for the powerpc bypass IOMMU or not is a global decision,
and the bypass is per I/O.  So I'm not sure what else you want there?
Jason Gunthorpe July 9, 2024, 6:53 p.m. UTC | #19
On Tue, Jul 09, 2024 at 08:17:21AM +0200, Christoph Hellwig wrote:
> On Mon, Jul 08, 2024 at 01:52:38PM -0300, Jason Gunthorpe wrote:
> > Ideally we'd have some template code that consolidates these loops to
> > common code with driver provided hooks - there are a few ways to get
> > that efficiently in C.
> > 
> > I think it will be clearer when we get to RDMA and there we have the
> > same SGL/PRP kind of split up and we can see what is sharable.
> 
> I really would not want to build common code for PRPs - this is a concept
> very specific to RDMA and NVMe.  

I think DRM has it too. If you are populating a GPU page table then it
is basically a convoluted PRP. Probably requires different splitting
logic than what RDMA does, but I've never looked.

> OTOH more common code SGLs would be nice.  If you look at e.g. SCSI
> drivers most of them have a simpe loop of mapping the SG table and
> then copying the fields into the hardware SGL.  This would be a very
> common case for a helper.

Yes, I belive this is very common.

> That whole thing of course opens the question if we want a pure
> in-memory version of the dma_addr_t/len tuple.  IMHO that is the best
> way to migrate and allows to share code easily.  We can look into ways
> to avoiding that more for drivers that care, but most drivers are
> probably best serve with it to keep the code simple and make the
> conversion easier.

My feeling has been that this RFC is the low level interface and we
can bring our own data structure on top.

It would probably make sense to build a scatterlist v2 on top of this
that has an in-memory dma_addr_t/len list close to today. Yes it costs
a memory allocation, or a larger initial allocation, but many places
may not really care. Block drivers have always allocated a SGL, for
instance.

Then the verbosity of this API is less important as we may only use it
in a few places.

My main take away was that we should make the dma_ops interface
simpler and more general so we can have this choice instead of welding
a single datastructure through everything.

> > I'm also cooking something that should let us build a way to iommu map
> > a bio_vec very efficiently, which should transform this into a single
> > indirect call into the iommu driver per bio_vec, and a single radix
> > walk/etc.
>
> I assume you mean array of bio_vecs here.  That would indeed nice.
> We'd still potentially need a few calls for block drivers as
> requests can have multiple bios and thus bio_vec arrays, but it would
> still be a nice reduction of calls.

Yes. iommufd has performance needs here, not sure what it will turn
into but bio_vec[] direct to optimized radix manipuilation is
something I'd be keen to see.

Jason
Jason Gunthorpe July 9, 2024, 7:03 p.m. UTC | #20
On Tue, Jul 09, 2024 at 08:20:15AM +0200, Christoph Hellwig wrote:
> On Mon, Jul 08, 2024 at 08:57:21PM -0300, Jason Gunthorpe wrote:
> > I understand the block stack already does this using P2P and !P2P, but
> > that isn't quite enough here as we want to split principally based on
> > IOMMU or !IOMMU.
> 
> Except for the powerpc bypass IOMMU or not is a global decision,
> and the bypass is per I/O.  So I'm not sure what else you want there?

For P2P we know if the DMA will go through the IOMMU or not based on
the PCIe fabric path between the initiator (the one doing the DMA) and
the target (the one providing the MMIO memory).

Depending on PCIe topology and ACS flags this path may use the IOMMU
or may skip the IOMMU.

To put it in code, the 'enum pci_p2pdma_map_type' can only be
determined once we know the initator and target struct device.

PCI_P2PDMA_MAP_BUS_ADDR means we don't use the iommu.

PCI_P2PDMA_MAP_THRU_HOST_BRIDGE means we do.

With this API it is important that a single request always has the
same PCI_P2PDMA_MAP_* outcome, and the simplest way to do that is to
split requests if the MMIO memory changes target struct devices.

Jason
Christoph Hellwig July 10, 2024, 6:22 a.m. UTC | #21
On Tue, Jul 09, 2024 at 04:03:20PM -0300, Jason Gunthorpe wrote:
> > Except for the powerpc bypass IOMMU or not is a global decision,
> > and the bypass is per I/O.  So I'm not sure what else you want there?
> 
> For P2P we know if the DMA will go through the IOMMU or not based on
> the PCIe fabric path between the initiator (the one doing the DMA) and
> the target (the one providing the MMIO memory).

Oh, yes.  So effectively you are asking if we can arbitrarily mix
P2P sources in a single map request.  I think the only sane answer
from the iommu/dma subsystem perspective is: hell no.

But that means the upper layer need to split at such a boundary.
E.g. get_user_pages needs to look at this and stop at the boundary,
leaving the rest to the next call.

For the block layer just having one kind per BIO is fine right now,
although I could see use cases where people would want to combine
them.  We can probably defer that until it is needed, though.
Christoph Hellwig July 10, 2024, 6:27 a.m. UTC | #22
On Tue, Jul 09, 2024 at 03:53:15PM -0300, Jason Gunthorpe wrote:
> > That whole thing of course opens the question if we want a pure
> > in-memory version of the dma_addr_t/len tuple.  IMHO that is the best
> > way to migrate and allows to share code easily.  We can look into ways
> > to avoiding that more for drivers that care, but most drivers are
> > probably best serve with it to keep the code simple and make the
> > conversion easier.
> 
> My feeling has been that this RFC is the low level interface and we
> can bring our own data structure on top.
>
> It would probably make sense to build a scatterlist v2 on top of this
> that has an in-memory dma_addr_t/len list close to today

Yes, the usage of the dma_vec would be in a higher layer.  But I'd
really like to see it from the beginning.

> . Yes it costs
> a memory allocation, or a larger initial allocation, but many places
> may not really care. Block drivers have always allocated a SGL, for
> instance.

Except for those optimizing for snall transfer of a single segment
(like nvme). 

> My main take away was that we should make the dma_ops interface
> simpler and more general so we can have this choice instead of welding
> a single datastructure through everything.

Yes, I don't think the dma_vec should be the low-level interface.
I think a low-level interface based on physical address is the right
one.  I'll see what I can do to move the single segment map interface
to be physical address based instead of page based so that we can
unify them.
Jason Gunthorpe July 11, 2024, 11:21 p.m. UTC | #23
On Wed, Jul 10, 2024 at 08:27:04AM +0200, Christoph Hellwig wrote:
> On Tue, Jul 09, 2024 at 03:53:15PM -0300, Jason Gunthorpe wrote:
> > > That whole thing of course opens the question if we want a pure
> > > in-memory version of the dma_addr_t/len tuple.  IMHO that is the best
> > > way to migrate and allows to share code easily.  We can look into ways
> > > to avoiding that more for drivers that care, but most drivers are
> > > probably best serve with it to keep the code simple and make the
> > > conversion easier.
> > 
> > My feeling has been that this RFC is the low level interface and we
> > can bring our own data structure on top.
> >
> > It would probably make sense to build a scatterlist v2 on top of this
> > that has an in-memory dma_addr_t/len list close to today
> 
> Yes, the usage of the dma_vec would be in a higher layer.  But I'd
> really like to see it from the beginning.

Well, lets start with agreeing on this layer's API and be confident it
can succeed.

Then I'd say to look at RDMA as it is a logical place to build such a
data structure and we can build something that at least does what RDMA
needs.

I need something anyhow to plumb through to DMABUF and over to iommufd
and VFIO, can't skip out on it :)

> Yes, I don't think the dma_vec should be the low-level interface.
> I think a low-level interface based on physical address is the right
> one.  I'll see what I can do to move the single segment map interface
> to be physical address based instead of page based so that we can
> unify them.

Yeah, I've been talking to Matthew explaining that starting at the DMA
API makes the most sense and lets remove mandatory struct page
entanglements there. Then we can start to examine other layers. Having
a consistent option in the DMA API to be physically based with a
memory type fits with that plan.

Jason
Jason Gunthorpe July 11, 2024, 11:29 p.m. UTC | #24
On Wed, Jul 10, 2024 at 08:22:12AM +0200, Christoph Hellwig wrote:
> On Tue, Jul 09, 2024 at 04:03:20PM -0300, Jason Gunthorpe wrote:
> > > Except for the powerpc bypass IOMMU or not is a global decision,
> > > and the bypass is per I/O.  So I'm not sure what else you want there?
> > 
> > For P2P we know if the DMA will go through the IOMMU or not based on
> > the PCIe fabric path between the initiator (the one doing the DMA) and
> > the target (the one providing the MMIO memory).
> 
> Oh, yes.  So effectively you are asking if we can arbitrarily mix
> P2P sources in a single map request.  I think the only sane answer
> from the iommu/dma subsystem perspective is: hell no.

Well, today we can mix them and the dma_map_sg will sort it out. With
this new API we can't anymore.

So this little detail needs to be taken care of somehow as well, and I
didn't see it in this RFC.

> For the block layer just having one kind per BIO is fine right now,
> although I could see use cases where people would want to combine
> them.  We can probably defer that until it is needed, though.

Do you have an application in mind that would want multi-kind per BIO?

Jason
Christoph Hellwig July 12, 2024, 4:54 a.m. UTC | #25
On Thu, Jul 11, 2024 at 08:29:17PM -0300, Jason Gunthorpe wrote:
> So this little detail needs to be taken care of somehow as well, and I
> didn't see it in this RFC.

Yes.  Same about generally not mixing P2P and non-P2P.

> 
> > For the block layer just having one kind per BIO is fine right now,
> > although I could see use cases where people would want to combine
> > them.  We can probably defer that until it is needed, though.
> 
> Do you have an application in mind that would want multi-kind per BIO?

If you are doing say garbage collection in a file system, and do write
that sources data from multiple devices, where some sit directly on the
root port and others behind a switch.

This is all purely hypothetical, and I'm happy to just check for it
and reject it for it now.
Jason Gunthorpe July 12, 2024, 12:42 p.m. UTC | #26
On Fri, Jul 12, 2024 at 06:54:22AM +0200, Christoph Hellwig wrote:

> This is all purely hypothetical, and I'm happy to just check for it
> and reject it for it now.

I do know a patch set is cooking to allow mixing ZONE_DEVICE P2P and
anon memory in the same VMA ala HMM with transparent migration of
ZONE_DEVICE to anon.

In this situation userspace will be generating IO with no idea about
any P2P/!P2P boundaries.

Jason
Christoph Hellwig July 13, 2024, 5:24 a.m. UTC | #27
On Fri, Jul 12, 2024 at 09:42:37AM -0300, Jason Gunthorpe wrote:
> On Fri, Jul 12, 2024 at 06:54:22AM +0200, Christoph Hellwig wrote:
> 
> > This is all purely hypothetical, and I'm happy to just check for it
> > and reject it for it now.
> 
> I do know a patch set is cooking to allow mixing ZONE_DEVICE P2P and
> anon memory in the same VMA ala HMM with transparent migration of
> ZONE_DEVICE to anon.
> 
> In this situation userspace will be generating IO with no idea about
> any P2P/!P2P boundaries.

Yes.  And as said from the beginning of these discussion I think the
right way is to change the gup code so that for a single call to
get/pin_user_pages it always returns either only P2P pages or non-P2P
pages only, with the FOLL_PCI_P2PDMA just allowing P2P pages at all.
Similarly they could easily just return one kind of P2P pages per
call only.