[RFC,v1,00/18] Provide a new two step DMA API mapping API

Message ID	cover.1719909395.git.leon@kernel.org (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Leon Romanovsky <leon@kernel.org> To: Jens Axboe <axboe@kernel.dk>, Jason Gunthorpe <jgg@ziepe.ca>, Robin Murphy <robin.murphy@arm.com>, Joerg Roedel <joro@8bytes.org>, Will Deacon <will@kernel.org>, Keith Busch <kbusch@kernel.org>, Christoph Hellwig <hch@lst.de>, "Zeng, Oak" <oak.zeng@intel.com>, Chaitanya Kulkarni <kch@nvidia.com> Cc: Sagi Grimberg <sagi@grimberg.me>, Bjorn Helgaas <bhelgaas@google.com>, Logan Gunthorpe <logang@deltatee.com>, Yishai Hadas <yishaih@nvidia.com>, Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>, Kevin Tian <kevin.tian@intel.com>, Alex Williamson <alex.williamson@redhat.com>, Marek Szyprowski <m.szyprowski@samsung.com>, =?utf-8?b?SsOpcsO0bWUgR2xpc3Nl?= <jglisse@redhat.com>, Andrew Morton <akpm@linux-foundation.org>, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org, iommu@lists.linux.dev, linux-nvme@lists.infradead.org, linux-pci@vger.kernel.org, kvm@vger.kernel.org, linux-mm@kvack.org Subject: [RFC PATCH v1 00/18] Provide a new two step DMA API mapping API Date: Tue, 2 Jul 2024 12:09:30 +0300 Message-ID: <cover.1719909395.git.leon@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Provide a new two step DMA API mapping API \| expand [RFC,v1,00/18] Provide a new two step DMA API mapping API [RFC,v1,01/18] dma-mapping: query DMA memory type [RFC,v1,02/18] dma-mapping: provide an interface to allocate IOVA [RFC,v1,03/18] dma-mapping: check if IOVA can be used [RFC,v1,04/18] dma-mapping: implement link range API [RFC,v1,05/18] mm/hmm: let users to tag specific PFN with DMA mapped bit [RFC,v1,06/18] dma-mapping: provide callbacks to link/unlink HMM PFNs to specific IOVA [RFC,v1,07/18] iommu/dma: Provide an interface to allow preallocate IOVA [RFC,v1,08/18] iommu/dma: Implement link/unlink ranges callbacks [RFC,v1,09/18] RDMA/umem: Preallocate and cache IOVA for UMEM ODP [RFC,v1,10/18] RDMA/umem: Store ODP access mask information in PFN [RFC,v1,11/18] RDMA/core: Separate DMA mapping to caching IOVA and page linkage [RFC,v1,12/18] RDMA/umem: Prevent UMEM ODP creation with SWIOTLB [RFC,v1,13/18] vfio/mlx5: Explicitly use number of pages instead of allocated length [RFC,v1,14/18] vfio/mlx5: Rewrite create mkey flow to allow better code reuse [RFC,v1,15/18] vfio/mlx5: Explicitly store page list [RFC,v1,16/18] vfio/mlx5: Convert vfio to use DMA link API [RFC,v1,17/18] block: export helper to get segment max size [RFC,v1,18/18] nvme-pci: use new dma API

Leon Romanovsky July 2, 2024, 9:09 a.m. UTC

Changelog:
v1:
* Rewrote cover letter
* Changed to API as proposed
https://lore.kernel.org/linux-rdma/20240322184330.GL66976@ziepe.ca/
* Removed IB DMA wrappers and use DMA API directly
v0: https://lore.kernel.org/all/cover.1709635535.git.leon@kernel.org
-------------------------------------------------------------------------

Currently the only efficient way to map a complex memory description through
the DMA API is by using the scatterlist APIs. The SG APIs are unique in that
they efficiently combine the two fundamental operations of sizing and allocating
a large IOVA window from the IOMMU and processing all the per-address
swiotlb/flushing/p2p/map details.

This uniqueness has been a long standing pain point as the scatterlist API
is mandatory, but expensive to use. It prevents any kind of optimization or
feature improvement (such as avoiding struct page for P2P) due to the impossibility
of improving the scatterlist.

Several approaches have been explored to expand the DMA API with additional
scatterlist-like structures (BIO[1], rlist[2]), instead split up the DMA API
to allow callers to bring their own data structure.

The API is split up into parts:
- dma_alloc_iova() / dma_free_iova()
To do any pre-allocation required. This is done based on the caller
supplying some details about how much IOMMU address space it would need
in worst case.
- dma_link_range() / dma_unlink_range()
Perform the actual mapping into the pre-allocated IOVA. This is very
similar to dma_map_page().

A driver will extent its mapping size using its own data structure, such as
BIO, to request the required IOVA. Then it will iterate directly over it's
data structure to DMA map each range. The result can then be stored directly
into the HW specific DMA list. No intermediate scatterlist is required.

In this series, examples of three users are converted to the new API to show
the benefits. Each user has a unique flow:
1. RDMA ODP is an example of "SVA mirroring" using HMM that needs to
dynamically map/unmap large numbers of single pages. This becomes
significantly faster in the IOMMU case as the map/unmap is now just
a page table walk, the IOVA allocation is pre-computed once. Significant
amounts of memory are saved as there is no longer a need to store the
dma_addr_t of each page.
2. VFIO PCI live migration code is building a very large "page list"
for the device. Instead of allocating a scatter list entry per allocated
page it can just allocate an array of 'struct page *', saving a large
amount of memory.
3. NVMe PCI demonstrates how a BIO can be converted to a HW scatter
list without having to allocate then populate an intermediate SG table.

This step is first along a path to provide alternatives to scatterlist and
solve some of the abuses and design mistakes, for instance in DMABUF's P2P
support.

The ODP and VFIO versions are complete and fully tested, they can be the users
of the new API to merge it. The NVMe requires more work.

[1] https://lore.kernel.org/all/169772852492.5232.17148564580779995849.stgit@klimt.1015granger.net/
[2] https://lore.kernel.org/all/ZD2lMvprVxu23BXZ@ziepe.ca/

Chaitanya Kulkarni (2):
block: export helper to get segment max size
nvme-pci: use new dma API

Leon Romanovsky (16):
dma-mapping: query DMA memory type
dma-mapping: provide an interface to allocate IOVA
dma-mapping: check if IOVA can be used
dma-mapping: implement link range API
mm/hmm: let users to tag specific PFN with DMA mapped bit
dma-mapping: provide callbacks to link/unlink HMM PFNs to specific
IOVA
iommu/dma: Provide an interface to allow preallocate IOVA
iommu/dma: Implement link/unlink ranges callbacks
RDMA/umem: Preallocate and cache IOVA for UMEM ODP
RDMA/umem: Store ODP access mask information in PFN
RDMA/core: Separate DMA mapping to caching IOVA and page linkage
RDMA/umem: Prevent UMEM ODP creation with SWIOTLB
vfio/mlx5: Explicitly use number of pages instead of allocated length
vfio/mlx5: Rewrite create mkey flow to allow better code reuse
vfio/mlx5: Explicitly store page list
vfio/mlx5: Convert vfio to use DMA link API

Christoph Hellwig July 3, 2024, 5:42 a.m. UTC | #1

I just tried to boot this on my usual qemu test setup with emulated
nvme devices, and it dead-loops with messages like this fairly late
in the boot cycle:

[   43.826627] iommu: unaligned: iova 0xfff7e000 pa 0x000000010be33650 size 0x1000 min_pagesz 0x1000
[   43.826982] dma_mapping_error -12

passing intel_iommu=off instead of intel_iommu=on (expectedly) makes
it go away.

Zhu Yanjun July 3, 2024, 10:42 a.m. UTC | #2

在 2024/7/3 13:42, Christoph Hellwig 写道:
> I just tried to boot this on my usual qemu test setup with emulated
> nvme devices, and it dead-loops with messages like this fairly late
> in the boot cycle:
> 
> [   43.826627] iommu: unaligned: iova 0xfff7e000 pa 0x000000010be33650 size 0x1000 min_pagesz 0x1000
> [   43.826982] dma_mapping_error -12
> 
> passing intel_iommu=off instead of intel_iommu=on (expectedly) makes

I also confronted this problem.IMO, if intel_iommu=off, the driver of 
drivers/iommu can not be used.

If I remember correctly, some guys in the first version can fix this 
problem. I will check the mails.

To me, I just revert this commit because I do not use this commit about 
nvme.

Zhu Yanjun

> it go away.
>

Leon Romanovsky July 3, 2024, 10:52 a.m. UTC | #3

On Wed, Jul 03, 2024 at 07:42:38AM +0200, Christoph Hellwig wrote:
> I just tried to boot this on my usual qemu test setup with emulated
> nvme devices, and it dead-loops with messages like this fairly late
> in the boot cycle:
> 
> [   43.826627] iommu: unaligned: iova 0xfff7e000 pa 0x000000010be33650 size 0x1000 min_pagesz 0x1000
> [   43.826982] dma_mapping_error -12
> 
> passing intel_iommu=off instead of intel_iommu=on (expectedly) makes
> it go away.

Can you please share your kernel command line and qemu?
On my and Chaitanya setups it works fine.

Thanks

Christoph Hellwig July 3, 2024, 2:35 p.m. UTC | #4

On Wed, Jul 03, 2024 at 01:52:53PM +0300, Leon Romanovsky wrote:
> On Wed, Jul 03, 2024 at 07:42:38AM +0200, Christoph Hellwig wrote:
> > I just tried to boot this on my usual qemu test setup with emulated
> > nvme devices, and it dead-loops with messages like this fairly late
> > in the boot cycle:
> > 
> > [   43.826627] iommu: unaligned: iova 0xfff7e000 pa 0x000000010be33650 size 0x1000 min_pagesz 0x1000
> > [   43.826982] dma_mapping_error -12
> > 
> > passing intel_iommu=off instead of intel_iommu=on (expectedly) makes
> > it go away.
> 
> Can you please share your kernel command line and qemu?
> On my and Chaitanya setups it works fine.

qemu-system-x86_64 \
        -nographic \
	-enable-kvm \
	-m 6g \
	-smp 4 \
	-cpu host \
	-M q35,kernel-irqchip=split \
	-kernel arch/x86/boot/bzImage \
	-append "root=/dev/vda console=ttyS0,115200n8 intel_iommu=on" \
        -device intel-iommu,intremap=on \
	-device ioh3420,multifunction=on,bus=pcie.0,id=port9-0,addr=9.0,chassis=0 \	
        -blockdev driver=file,cache.direct=on,node-name=root,filename=/home/hch/images/bookworm.img \
	-blockdev driver=host_device,cache.direct=on,node-name=test,filename=/dev/nvme0n1p4 \
	-device virtio-blk,drive=root \
	-device nvme,drive=test,serial=1234

Leon Romanovsky July 3, 2024, 3:51 p.m. UTC | #5

On Wed, Jul 03, 2024 at 04:35:30PM +0200, Christoph Hellwig wrote:
> On Wed, Jul 03, 2024 at 01:52:53PM +0300, Leon Romanovsky wrote:
> > On Wed, Jul 03, 2024 at 07:42:38AM +0200, Christoph Hellwig wrote:
> > > I just tried to boot this on my usual qemu test setup with emulated
> > > nvme devices, and it dead-loops with messages like this fairly late
> > > in the boot cycle:
> > > 
> > > [   43.826627] iommu: unaligned: iova 0xfff7e000 pa 0x000000010be33650 size 0x1000 min_pagesz 0x1000
> > > [   43.826982] dma_mapping_error -12
> > > 
> > > passing intel_iommu=off instead of intel_iommu=on (expectedly) makes
> > > it go away.
> > 
> > Can you please share your kernel command line and qemu?
> > On my and Chaitanya setups it works fine.
> 
> qemu-system-x86_64 \
>         -nographic \
> 	-enable-kvm \
> 	-m 6g \
> 	-smp 4 \
> 	-cpu host \
> 	-M q35,kernel-irqchip=split \
> 	-kernel arch/x86/boot/bzImage \
> 	-append "root=/dev/vda console=ttyS0,115200n8 intel_iommu=on" \
>         -device intel-iommu,intremap=on \
> 	-device ioh3420,multifunction=on,bus=pcie.0,id=port9-0,addr=9.0,chassis=0 \	
>         -blockdev driver=file,cache.direct=on,node-name=root,filename=/home/hch/images/bookworm.img \
> 	-blockdev driver=host_device,cache.direct=on,node-name=test,filename=/dev/nvme0n1p4 \
> 	-device virtio-blk,drive=root \
> 	-device nvme,drive=test,serial=1234

Thanks, Chaitanya will take a look.

If we put aside this issue, do you think that the proposed API is the right one?

BTW, I have more fancy command line, it is probably the root cause of working/not-working:
/opt/simx/bin/qemu-system-x86_64
        -append root=/dev/root rw ignore_loglevel rootfstype=9p
        rootflags="cache=loose,trans=virtio" earlyprintk=serial,ttyS0,115200
                console=hvc0 noibrs noibpb nopti nospectre_v2 nospectre_v1
                l1tf=off nospec_store_bypass_disable no_stf_barrier mds=off
                mitigations=off panic_on_warn=1
                intel_iommu=on iommu=nopt iommu.forcedac=true
                vfio_iommu_type1.allow_unsafe_interrupts=1
                systemd.hostname=mtl-leonro-l-vm
        -chardev stdio,id=stdio,mux=on,signal=off   
        -cpu host                                  
        -device virtio-rng-pci                    
        -device virtio-balloon-pci               
        -device isa-serial,chardev=stdio        
        -device virtio-serial-pci              
        -device virtconsole,chardev=stdio     
        -device virtio-9p-pci,fsdev=host_fs,mount_tag=/dev/root  
        -device virtio-9p-pci,fsdev=host_bind_fs0,mount_tag=bind0
        -device virtio-9p-pci,fsdev=host_bind_fs1,mount_tag=bind1
        -device virtio-9p-pci,fsdev=host_bind_fs2,mount_tag=bind2
        -device intel-iommu,intremap=on 
        -device connectx7              
        -device nvme,drive=drv0,serial=foo 
        -drive file=/home/leonro/.cache/mellanox/mkt/nvme-1g.raw,if=none,id=drv0,format=raw 
        -enable-kvm                                                                        
        -fsdev local,id=host_bind_fs1,security_model=none,path=/logs          
        -fsdev local,id=host_fs,security_model=none,path=/mnt/self           
        -fsdev local,id=host_bind_fs0,security_model=none,path=/plugins     
        -fsdev local,id=host_bind_fs2,security_model=none,path=/home/leonro
        -fw_cfg etc/sercon-port,string=2  
        -kernel /home/leonro/src/kernel/arch/x86/boot/bzImage 
        -m 5G -machine q35,kernel-irqchip=split              
        -mon chardev=stdio                                  
        -net nic,model=virtio,macaddr=52:54:01:d8:e5:f9    
        -net user,hostfwd=tcp:127.0.0.1:46645-:22  
        -no-reboot -nodefaults -nographic -smp 16 -vga none 

> 
>

Christoph Hellwig July 4, 2024, 7:48 a.m. UTC | #6

On Wed, Jul 03, 2024 at 06:51:14PM +0300, Leon Romanovsky wrote:
> If we put aside this issue, do you think that the proposed API is the right one?

I haven't look at it in detail yet, but from a quick look there is a
few things to note:


1) The amount of code needed in nvme worries me a bit.  Now NVMe a messy
driver due to the stupid PRPs vs just using SGLs, but needing a fair
amount of extra boilerplate code in drivers is a bit of a warning sign.
I plan to look into this to see if I can help on improving it, but for
that I need a working version first.


2) The amount of seemingly unrelated global headers pulled into other
global headers.  Some of this might just be sloppiness, e.g. I can't
see why dma-mapping.h would actually need iommu.h to start with,
but pci.h in dma-map-ops.h is a no-go.

3) which brings me to real layering violations.  dev_is_untrusted and
dev_use_swiotlb are DMA API internals, no way I'd ever want to expose
them. dma-map-ops.h is a semi-internal header only for implementations
of the dma ops (as very clearly documented at the top of that file),
it must not be included by drivers.  Same for swiotlb.h.

Not quite as concerning, but doing an indirect call for each map
through dma_map_ops in addition to the iommu ops is not every efficient.
We've through for a while to allow direct calls to dma-iommu similar
how we do direct calls to dma-direct from the core mapping.c code.
This might be a good time to do that as a prep step for this work.

Leon Romanovsky July 4, 2024, 1:18 p.m. UTC | #7

On Thu, Jul 04, 2024 at 09:48:56AM +0200, Christoph Hellwig wrote:
> On Wed, Jul 03, 2024 at 06:51:14PM +0300, Leon Romanovsky wrote:
> > If we put aside this issue, do you think that the proposed API is the right one?
> 
> I haven't look at it in detail yet, but from a quick look there is a
> few things to note:
> 
> 
> 1) The amount of code needed in nvme worries me a bit.  Now NVMe a messy
> driver due to the stupid PRPs vs just using SGLs, but needing a fair
> amount of extra boilerplate code in drivers is a bit of a warning sign.
> I plan to look into this to see if I can help on improving it, but for
> that I need a working version first.

Chaitanya is working on this and I'll join him to help on next Sunday,
after I'll return to the office from my sick leave/

> 
> 
> 2) The amount of seemingly unrelated global headers pulled into other
> global headers.  Some of this might just be sloppiness, e.g. I can't
> see why dma-mapping.h would actually need iommu.h to start with,
> but pci.h in dma-map-ops.h is a no-go.

pci.h was pulled because I needed to call to pci_p2pdma_map_type()
in dma_can_use_iova().

> 
> 3) which brings me to real layering violations.  dev_is_untrusted and
> dev_use_swiotlb are DMA API internals, no way I'd ever want to expose
> them. dma-map-ops.h is a semi-internal header only for implementations
> of the dma ops (as very clearly documented at the top of that file),
> it must not be included by drivers.  Same for swiotlb.h.

These item shouldn't worry you and will be changed in the final version.
They are outcome of patch "RDMA/umem: Prevent UMEM ODP creation with SWIOTLB".
https://lore.kernel.org/all/d18c454636bf3cfdba9b66b7cc794d713eadc4a5.1719909395.git.leon@kernel.org/

All HMM users need such "prevention" so it will be moved to a common place.

> 
> Not quite as concerning, but doing an indirect call for each map
> through dma_map_ops in addition to the iommu ops is not every efficient.
> We've through for a while to allow direct calls to dma-iommu similar
> how we do direct calls to dma-direct from the core mapping.c code.
> This might be a good time to do that as a prep step for this work.

Sure, no problem, will start in parallel to work on this.

>

Christoph Hellwig July 5, 2024, 6 a.m. UTC | #8

On Thu, Jul 04, 2024 at 04:18:39PM +0300, Leon Romanovsky wrote:
> > 2) The amount of seemingly unrelated global headers pulled into other
> > global headers.  Some of this might just be sloppiness, e.g. I can't
> > see why dma-mapping.h would actually need iommu.h to start with,
> > but pci.h in dma-map-ops.h is a no-go.
> 
> pci.h was pulled because I needed to call to pci_p2pdma_map_type()
> in dma_can_use_iova().

No, that's not the reason.  The reason is actually that whole
dev_use_swiotlb mess which shouldn't exist in this form.

Christoph Hellwig July 5, 2024, 6:39 a.m. UTC | #9

Review from the NVMe driver consumer perspective.  I think if all these
were implement we'd probably end up with less code than before the
conversion.

The split between dma_iova_attrs, dma_memory_type and dma_iova_state is
odd.  I would have expected them to just be just a single object.  While
talking about this I think the domain field in dma_iova_state should
probably be a private pointer instead of being tied to the iommu.

Also do we need the attrs member in the iova_attrs structure?  The
"attrs" really are flags passed to the mapping routines that are
per-operation and not persistent, so I'd expect them to be passed
per-call and not stored in a structure.

I'd also expect that the use_iova field to be in the mapping state
and not separately provided by the driver.

For nvme specific data structures I would have expected a dma_add/
len pair in struct iod_dma_map, maybe even using a common type.

Also the data structure split seems odd - I'd expect the actual
mapping state and a small number (at least one) dma_addr/len pair
to be inside the nvme_iod structure, and then only do the dynamic
allocation if we need more of them because there are more segments
and we are not using the iommu.

If we had a common data structure for the dma_addr/len pairs
dma_unlink_range could just take care of the unmap for the non-iommu
case as well, which would be neat.  I'd also expect that
dma_free_iova would be covered by it.

I would have expected dma_link_range to return the dma_addr_t instead
of poking into the iova structure in the callers.

In __nvme_rq_dma_map the <= PAGE_SIZE case is pointless.  In the
existing code the reason for it is to avoid allocating and mapping the
sg_table, but that code is still left before we even get to this code.

My suggestion above to only allocate the dma_addr/len pairs when there
is more than 1 or a few of it would allow to trivially implement that
suggestion using the normal API without having to keep that special
case and the dma_len parameter around.

If this addes a version of dma_map_page_atttrs that directly took
the physical address as a prep patch the callers would not have to
bother with page pointer manipulations and just work on physical
addresses for both the iommu and no-iommu cases.  It would also help
a little bit with the eventualy switch to store the physical address
instead of page+offset in the bio_vec.  Talking about that, I've
been wanting to add a bvec_phys helper for to convert the
page_phys(bv.bv_page) + bv.bv_offset calculations.  This is becoming
more urgent with more callers needing to that, I'll try to get it out
to Jens ASAP so that it can make the 6.11 merge window.

Can we make dma_start_range / dma_end_range simple no-ops for the
non-iommu code to avoid boilerplate code in the callers to avoid
boilerplate code in the callers to deal with the two cases?

Chaitanya Kulkarni July 5, 2024, 10:53 p.m. UTC | #10

On 7/3/24 07:35, Christoph Hellwig wrote:
> On Wed, Jul 03, 2024 at 01:52:53PM +0300, Leon Romanovsky wrote:
>> On Wed, Jul 03, 2024 at 07:42:38AM +0200, Christoph Hellwig wrote:
>>> I just tried to boot this on my usual qemu test setup with emulated
>>> nvme devices, and it dead-loops with messages like this fairly late
>>> in the boot cycle:
>>>
>>> [   43.826627] iommu: unaligned: iova 0xfff7e000 pa 0x000000010be33650 size 0x1000 min_pagesz 0x1000
>>> [   43.826982] dma_mapping_error -12
>>>
>>> passing intel_iommu=off instead of intel_iommu=on (expectedly) makes
>>> it go away.
>> Can you please share your kernel command line and qemu?
>> On my and Chaitanya setups it works fine.
> qemu-system-x86_64 \
>          -nographic \
> 	-enable-kvm \
> 	-m 6g \
> 	-smp 4 \
> 	-cpu host \
> 	-M q35,kernel-irqchip=split \
> 	-kernel arch/x86/boot/bzImage \
> 	-append "root=/dev/vda console=ttyS0,115200n8 intel_iommu=on" \
>          -device intel-iommu,intremap=on \
> 	-device ioh3420,multifunction=on,bus=pcie.0,id=port9-0,addr=9.0,chassis=0 \	
>          -blockdev driver=file,cache.direct=on,node-name=root,filename=/home/hch/images/bookworm.img \
> 	-blockdev driver=host_device,cache.direct=on,node-name=test,filename=/dev/nvme0n1p4 \
> 	-device virtio-blk,drive=root \
> 	-device nvme,drive=test,serial=1234
>

I tried to reproduce this issue somehow it is not reproducible.

I'll try again on Leon's setup on my Saturday night, to fix that
case.

-ck

Christoph Hellwig July 6, 2024, 6:26 a.m. UTC | #11

On Fri, Jul 05, 2024 at 10:53:06PM +0000, Chaitanya Kulkarni wrote:
> I tried to reproduce this issue somehow it is not reproducible.
> 
> I'll try again on Leon's setup on my Saturday night, to fix that
> case.

It is passthrough I/O from userspace.  The address is not page aligned
as seen in the printk.  Forcing bounce buffering of all passthrough
I/O makes it go away.

The problem is the first mapped segment does not have to be aligned
and we're missing the code to places it at the aligned offset into
the IOVA space.

[RFC,v1,00/18] Provide a new two step DMA API mapping API

Message

Comments