Message ID | cover.1719909395.git.leon@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | Provide a new two step DMA API mapping API | expand |
I just tried to boot this on my usual qemu test setup with emulated nvme devices, and it dead-loops with messages like this fairly late in the boot cycle: [ 43.826627] iommu: unaligned: iova 0xfff7e000 pa 0x000000010be33650 size 0x1000 min_pagesz 0x1000 [ 43.826982] dma_mapping_error -12 passing intel_iommu=off instead of intel_iommu=on (expectedly) makes it go away.
在 2024/7/3 13:42, Christoph Hellwig 写道: > I just tried to boot this on my usual qemu test setup with emulated > nvme devices, and it dead-loops with messages like this fairly late > in the boot cycle: > > [ 43.826627] iommu: unaligned: iova 0xfff7e000 pa 0x000000010be33650 size 0x1000 min_pagesz 0x1000 > [ 43.826982] dma_mapping_error -12 > > passing intel_iommu=off instead of intel_iommu=on (expectedly) makes I also confronted this problem.IMO, if intel_iommu=off, the driver of drivers/iommu can not be used. If I remember correctly, some guys in the first version can fix this problem. I will check the mails. To me, I just revert this commit because I do not use this commit about nvme. Zhu Yanjun > it go away. >
On Wed, Jul 03, 2024 at 07:42:38AM +0200, Christoph Hellwig wrote: > I just tried to boot this on my usual qemu test setup with emulated > nvme devices, and it dead-loops with messages like this fairly late > in the boot cycle: > > [ 43.826627] iommu: unaligned: iova 0xfff7e000 pa 0x000000010be33650 size 0x1000 min_pagesz 0x1000 > [ 43.826982] dma_mapping_error -12 > > passing intel_iommu=off instead of intel_iommu=on (expectedly) makes > it go away. Can you please share your kernel command line and qemu? On my and Chaitanya setups it works fine. Thanks
On Wed, Jul 03, 2024 at 01:52:53PM +0300, Leon Romanovsky wrote: > On Wed, Jul 03, 2024 at 07:42:38AM +0200, Christoph Hellwig wrote: > > I just tried to boot this on my usual qemu test setup with emulated > > nvme devices, and it dead-loops with messages like this fairly late > > in the boot cycle: > > > > [ 43.826627] iommu: unaligned: iova 0xfff7e000 pa 0x000000010be33650 size 0x1000 min_pagesz 0x1000 > > [ 43.826982] dma_mapping_error -12 > > > > passing intel_iommu=off instead of intel_iommu=on (expectedly) makes > > it go away. > > Can you please share your kernel command line and qemu? > On my and Chaitanya setups it works fine. qemu-system-x86_64 \ -nographic \ -enable-kvm \ -m 6g \ -smp 4 \ -cpu host \ -M q35,kernel-irqchip=split \ -kernel arch/x86/boot/bzImage \ -append "root=/dev/vda console=ttyS0,115200n8 intel_iommu=on" \ -device intel-iommu,intremap=on \ -device ioh3420,multifunction=on,bus=pcie.0,id=port9-0,addr=9.0,chassis=0 \ -blockdev driver=file,cache.direct=on,node-name=root,filename=/home/hch/images/bookworm.img \ -blockdev driver=host_device,cache.direct=on,node-name=test,filename=/dev/nvme0n1p4 \ -device virtio-blk,drive=root \ -device nvme,drive=test,serial=1234
On Wed, Jul 03, 2024 at 04:35:30PM +0200, Christoph Hellwig wrote: > On Wed, Jul 03, 2024 at 01:52:53PM +0300, Leon Romanovsky wrote: > > On Wed, Jul 03, 2024 at 07:42:38AM +0200, Christoph Hellwig wrote: > > > I just tried to boot this on my usual qemu test setup with emulated > > > nvme devices, and it dead-loops with messages like this fairly late > > > in the boot cycle: > > > > > > [ 43.826627] iommu: unaligned: iova 0xfff7e000 pa 0x000000010be33650 size 0x1000 min_pagesz 0x1000 > > > [ 43.826982] dma_mapping_error -12 > > > > > > passing intel_iommu=off instead of intel_iommu=on (expectedly) makes > > > it go away. > > > > Can you please share your kernel command line and qemu? > > On my and Chaitanya setups it works fine. > > qemu-system-x86_64 \ > -nographic \ > -enable-kvm \ > -m 6g \ > -smp 4 \ > -cpu host \ > -M q35,kernel-irqchip=split \ > -kernel arch/x86/boot/bzImage \ > -append "root=/dev/vda console=ttyS0,115200n8 intel_iommu=on" \ > -device intel-iommu,intremap=on \ > -device ioh3420,multifunction=on,bus=pcie.0,id=port9-0,addr=9.0,chassis=0 \ > -blockdev driver=file,cache.direct=on,node-name=root,filename=/home/hch/images/bookworm.img \ > -blockdev driver=host_device,cache.direct=on,node-name=test,filename=/dev/nvme0n1p4 \ > -device virtio-blk,drive=root \ > -device nvme,drive=test,serial=1234 Thanks, Chaitanya will take a look. If we put aside this issue, do you think that the proposed API is the right one? BTW, I have more fancy command line, it is probably the root cause of working/not-working: /opt/simx/bin/qemu-system-x86_64 -append root=/dev/root rw ignore_loglevel rootfstype=9p rootflags="cache=loose,trans=virtio" earlyprintk=serial,ttyS0,115200 console=hvc0 noibrs noibpb nopti nospectre_v2 nospectre_v1 l1tf=off nospec_store_bypass_disable no_stf_barrier mds=off mitigations=off panic_on_warn=1 intel_iommu=on iommu=nopt iommu.forcedac=true vfio_iommu_type1.allow_unsafe_interrupts=1 systemd.hostname=mtl-leonro-l-vm -chardev stdio,id=stdio,mux=on,signal=off -cpu host -device virtio-rng-pci -device virtio-balloon-pci -device isa-serial,chardev=stdio -device virtio-serial-pci -device virtconsole,chardev=stdio -device virtio-9p-pci,fsdev=host_fs,mount_tag=/dev/root -device virtio-9p-pci,fsdev=host_bind_fs0,mount_tag=bind0 -device virtio-9p-pci,fsdev=host_bind_fs1,mount_tag=bind1 -device virtio-9p-pci,fsdev=host_bind_fs2,mount_tag=bind2 -device intel-iommu,intremap=on -device connectx7 -device nvme,drive=drv0,serial=foo -drive file=/home/leonro/.cache/mellanox/mkt/nvme-1g.raw,if=none,id=drv0,format=raw -enable-kvm -fsdev local,id=host_bind_fs1,security_model=none,path=/logs -fsdev local,id=host_fs,security_model=none,path=/mnt/self -fsdev local,id=host_bind_fs0,security_model=none,path=/plugins -fsdev local,id=host_bind_fs2,security_model=none,path=/home/leonro -fw_cfg etc/sercon-port,string=2 -kernel /home/leonro/src/kernel/arch/x86/boot/bzImage -m 5G -machine q35,kernel-irqchip=split -mon chardev=stdio -net nic,model=virtio,macaddr=52:54:01:d8:e5:f9 -net user,hostfwd=tcp:127.0.0.1:46645-:22 -no-reboot -nodefaults -nographic -smp 16 -vga none > >
On Wed, Jul 03, 2024 at 06:51:14PM +0300, Leon Romanovsky wrote:
> If we put aside this issue, do you think that the proposed API is the right one?
I haven't look at it in detail yet, but from a quick look there is a
few things to note:
1) The amount of code needed in nvme worries me a bit. Now NVMe a messy
driver due to the stupid PRPs vs just using SGLs, but needing a fair
amount of extra boilerplate code in drivers is a bit of a warning sign.
I plan to look into this to see if I can help on improving it, but for
that I need a working version first.
2) The amount of seemingly unrelated global headers pulled into other
global headers. Some of this might just be sloppiness, e.g. I can't
see why dma-mapping.h would actually need iommu.h to start with,
but pci.h in dma-map-ops.h is a no-go.
3) which brings me to real layering violations. dev_is_untrusted and
dev_use_swiotlb are DMA API internals, no way I'd ever want to expose
them. dma-map-ops.h is a semi-internal header only for implementations
of the dma ops (as very clearly documented at the top of that file),
it must not be included by drivers. Same for swiotlb.h.
Not quite as concerning, but doing an indirect call for each map
through dma_map_ops in addition to the iommu ops is not every efficient.
We've through for a while to allow direct calls to dma-iommu similar
how we do direct calls to dma-direct from the core mapping.c code.
This might be a good time to do that as a prep step for this work.
On Thu, Jul 04, 2024 at 09:48:56AM +0200, Christoph Hellwig wrote: > On Wed, Jul 03, 2024 at 06:51:14PM +0300, Leon Romanovsky wrote: > > If we put aside this issue, do you think that the proposed API is the right one? > > I haven't look at it in detail yet, but from a quick look there is a > few things to note: > > > 1) The amount of code needed in nvme worries me a bit. Now NVMe a messy > driver due to the stupid PRPs vs just using SGLs, but needing a fair > amount of extra boilerplate code in drivers is a bit of a warning sign. > I plan to look into this to see if I can help on improving it, but for > that I need a working version first. Chaitanya is working on this and I'll join him to help on next Sunday, after I'll return to the office from my sick leave/ > > > 2) The amount of seemingly unrelated global headers pulled into other > global headers. Some of this might just be sloppiness, e.g. I can't > see why dma-mapping.h would actually need iommu.h to start with, > but pci.h in dma-map-ops.h is a no-go. pci.h was pulled because I needed to call to pci_p2pdma_map_type() in dma_can_use_iova(). > > 3) which brings me to real layering violations. dev_is_untrusted and > dev_use_swiotlb are DMA API internals, no way I'd ever want to expose > them. dma-map-ops.h is a semi-internal header only for implementations > of the dma ops (as very clearly documented at the top of that file), > it must not be included by drivers. Same for swiotlb.h. These item shouldn't worry you and will be changed in the final version. They are outcome of patch "RDMA/umem: Prevent UMEM ODP creation with SWIOTLB". https://lore.kernel.org/all/d18c454636bf3cfdba9b66b7cc794d713eadc4a5.1719909395.git.leon@kernel.org/ All HMM users need such "prevention" so it will be moved to a common place. > > Not quite as concerning, but doing an indirect call for each map > through dma_map_ops in addition to the iommu ops is not every efficient. > We've through for a while to allow direct calls to dma-iommu similar > how we do direct calls to dma-direct from the core mapping.c code. > This might be a good time to do that as a prep step for this work. Sure, no problem, will start in parallel to work on this. >
On Thu, Jul 04, 2024 at 04:18:39PM +0300, Leon Romanovsky wrote: > > 2) The amount of seemingly unrelated global headers pulled into other > > global headers. Some of this might just be sloppiness, e.g. I can't > > see why dma-mapping.h would actually need iommu.h to start with, > > but pci.h in dma-map-ops.h is a no-go. > > pci.h was pulled because I needed to call to pci_p2pdma_map_type() > in dma_can_use_iova(). No, that's not the reason. The reason is actually that whole dev_use_swiotlb mess which shouldn't exist in this form.
Review from the NVMe driver consumer perspective. I think if all these were implement we'd probably end up with less code than before the conversion. The split between dma_iova_attrs, dma_memory_type and dma_iova_state is odd. I would have expected them to just be just a single object. While talking about this I think the domain field in dma_iova_state should probably be a private pointer instead of being tied to the iommu. Also do we need the attrs member in the iova_attrs structure? The "attrs" really are flags passed to the mapping routines that are per-operation and not persistent, so I'd expect them to be passed per-call and not stored in a structure. I'd also expect that the use_iova field to be in the mapping state and not separately provided by the driver. For nvme specific data structures I would have expected a dma_add/ len pair in struct iod_dma_map, maybe even using a common type. Also the data structure split seems odd - I'd expect the actual mapping state and a small number (at least one) dma_addr/len pair to be inside the nvme_iod structure, and then only do the dynamic allocation if we need more of them because there are more segments and we are not using the iommu. If we had a common data structure for the dma_addr/len pairs dma_unlink_range could just take care of the unmap for the non-iommu case as well, which would be neat. I'd also expect that dma_free_iova would be covered by it. I would have expected dma_link_range to return the dma_addr_t instead of poking into the iova structure in the callers. In __nvme_rq_dma_map the <= PAGE_SIZE case is pointless. In the existing code the reason for it is to avoid allocating and mapping the sg_table, but that code is still left before we even get to this code. My suggestion above to only allocate the dma_addr/len pairs when there is more than 1 or a few of it would allow to trivially implement that suggestion using the normal API without having to keep that special case and the dma_len parameter around. If this addes a version of dma_map_page_atttrs that directly took the physical address as a prep patch the callers would not have to bother with page pointer manipulations and just work on physical addresses for both the iommu and no-iommu cases. It would also help a little bit with the eventualy switch to store the physical address instead of page+offset in the bio_vec. Talking about that, I've been wanting to add a bvec_phys helper for to convert the page_phys(bv.bv_page) + bv.bv_offset calculations. This is becoming more urgent with more callers needing to that, I'll try to get it out to Jens ASAP so that it can make the 6.11 merge window. Can we make dma_start_range / dma_end_range simple no-ops for the non-iommu code to avoid boilerplate code in the callers to avoid boilerplate code in the callers to deal with the two cases?
On 7/3/24 07:35, Christoph Hellwig wrote: > On Wed, Jul 03, 2024 at 01:52:53PM +0300, Leon Romanovsky wrote: >> On Wed, Jul 03, 2024 at 07:42:38AM +0200, Christoph Hellwig wrote: >>> I just tried to boot this on my usual qemu test setup with emulated >>> nvme devices, and it dead-loops with messages like this fairly late >>> in the boot cycle: >>> >>> [ 43.826627] iommu: unaligned: iova 0xfff7e000 pa 0x000000010be33650 size 0x1000 min_pagesz 0x1000 >>> [ 43.826982] dma_mapping_error -12 >>> >>> passing intel_iommu=off instead of intel_iommu=on (expectedly) makes >>> it go away. >> Can you please share your kernel command line and qemu? >> On my and Chaitanya setups it works fine. > qemu-system-x86_64 \ > -nographic \ > -enable-kvm \ > -m 6g \ > -smp 4 \ > -cpu host \ > -M q35,kernel-irqchip=split \ > -kernel arch/x86/boot/bzImage \ > -append "root=/dev/vda console=ttyS0,115200n8 intel_iommu=on" \ > -device intel-iommu,intremap=on \ > -device ioh3420,multifunction=on,bus=pcie.0,id=port9-0,addr=9.0,chassis=0 \ > -blockdev driver=file,cache.direct=on,node-name=root,filename=/home/hch/images/bookworm.img \ > -blockdev driver=host_device,cache.direct=on,node-name=test,filename=/dev/nvme0n1p4 \ > -device virtio-blk,drive=root \ > -device nvme,drive=test,serial=1234 > I tried to reproduce this issue somehow it is not reproducible. I'll try again on Leon's setup on my Saturday night, to fix that case. -ck
On Fri, Jul 05, 2024 at 10:53:06PM +0000, Chaitanya Kulkarni wrote: > I tried to reproduce this issue somehow it is not reproducible. > > I'll try again on Leon's setup on my Saturday night, to fix that > case. It is passthrough I/O from userspace. The address is not page aligned as seen in the printk. Forcing bounce buffering of all passthrough I/O makes it go away. The problem is the first mapped segment does not have to be aligned and we're missing the code to places it at the aligned offset into the IOVA space.
On Sat, Jul 06, 2024 at 08:26:04AM +0200, Christoph Hellwig wrote: > On Fri, Jul 05, 2024 at 10:53:06PM +0000, Chaitanya Kulkarni wrote: > > I tried to reproduce this issue somehow it is not reproducible. > > > > I'll try again on Leon's setup on my Saturday night, to fix that > > case. > > It is passthrough I/O from userspace. The address is not page aligned > as seen in the printk. Forcing bounce buffering of all passthrough > I/O makes it go away. > > The problem is the first mapped segment does not have to be aligned > and we're missing the code to places it at the aligned offset into > the IOVA space. Thanks for the explanation.
On Fri, Jul 05, 2024 at 08:39:10AM +0200, Christoph Hellwig wrote: > Review from the NVMe driver consumer perspective. I think if all these > were implement we'd probably end up with less code than before the > conversion. Thanks for the review, I will try to address all the comments in the next version. > > The split between dma_iova_attrs, dma_memory_type and dma_iova_state is > odd. I would have expected them to just be just a single object. While > talking about this I think the domain field in dma_iova_state should > probably be a private pointer instead of being tied to the iommu. > > Also do we need the attrs member in the iova_attrs structure? The > "attrs" really are flags passed to the mapping routines that are > per-operation and not persistent, so I'd expect them to be passed > per-call and not stored in a structure. It is left-over from my not-send version where I added new attribute to indicate that dma_alloc_iova() can't support SWIOTLB to avoid dev_use_swiotlb() mess. I will remove it. > > I'd also expect that the use_iova field to be in the mapping state > and not separately provided by the driver. > > For nvme specific data structures I would have expected a dma_add/ > len pair in struct iod_dma_map, maybe even using a common type. > > Also the data structure split seems odd - I'd expect the actual > mapping state and a small number (at least one) dma_addr/len pair > to be inside the nvme_iod structure, and then only do the dynamic > allocation if we need more of them because there are more segments > and we are not using the iommu. > > If we had a common data structure for the dma_addr/len pairs > dma_unlink_range could just take care of the unmap for the non-iommu > case as well, which would be neat. I'd also expect that > dma_free_iova would be covered by it. Internally Jason asked for the same thing, but I didn't want to produce asymmetric API where drivers have a call to dma_alloc_iova() but don't have a call to dma_free_iova(). However, now, it is 2 versus 1, so I will change it. > > I would have expected dma_link_range to return the dma_addr_t instead > of poking into the iova structure in the callers. > > In __nvme_rq_dma_map the <= PAGE_SIZE case is pointless. In the > existing code the reason for it is to avoid allocating and mapping the > sg_table, but that code is still left before we even get to this code. > > My suggestion above to only allocate the dma_addr/len pairs when there > is more than 1 or a few of it would allow to trivially implement that > suggestion using the normal API without having to keep that special > case and the dma_len parameter around. > > If this addes a version of dma_map_page_atttrs that directly took > the physical address as a prep patch the callers would not have to > bother with page pointer manipulations and just work on physical > addresses for both the iommu and no-iommu cases. It would also help > a little bit with the eventualy switch to store the physical address > instead of page+offset in the bio_vec. Talking about that, I've > been wanting to add a bvec_phys helper for to convert the > page_phys(bv.bv_page) + bv.bv_offset calculations. This is becoming > more urgent with more callers needing to that, I'll try to get it out > to Jens ASAP so that it can make the 6.11 merge window. > > Can we make dma_start_range / dma_end_range simple no-ops for the > non-iommu code to avoid boilerplate code in the callers to avoid > boilerplate code in the callers to deal with the two cases? Yes, sure. Thanks
On Fri, Jul 05, 2024 at 10:53:06PM +0000, Chaitanya Kulkarni wrote: > On 7/3/24 07:35, Christoph Hellwig wrote: > > On Wed, Jul 03, 2024 at 01:52:53PM +0300, Leon Romanovsky wrote: > >> On Wed, Jul 03, 2024 at 07:42:38AM +0200, Christoph Hellwig wrote: > >>> I just tried to boot this on my usual qemu test setup with emulated > >>> nvme devices, and it dead-loops with messages like this fairly late > >>> in the boot cycle: > >>> > >>> [ 43.826627] iommu: unaligned: iova 0xfff7e000 pa 0x000000010be33650 size 0x1000 min_pagesz 0x1000 > >>> [ 43.826982] dma_mapping_error -12 > >>> > >>> passing intel_iommu=off instead of intel_iommu=on (expectedly) makes > >>> it go away. > >> Can you please share your kernel command line and qemu? > >> On my and Chaitanya setups it works fine. > > qemu-system-x86_64 \ > > -nographic \ > > -enable-kvm \ > > -m 6g \ > > -smp 4 \ > > -cpu host \ > > -M q35,kernel-irqchip=split \ > > -kernel arch/x86/boot/bzImage \ > > -append "root=/dev/vda console=ttyS0,115200n8 intel_iommu=on" \ > > -device intel-iommu,intremap=on \ > > -device ioh3420,multifunction=on,bus=pcie.0,id=port9-0,addr=9.0,chassis=0 \ > > -blockdev driver=file,cache.direct=on,node-name=root,filename=/home/hch/images/bookworm.img \ > > -blockdev driver=host_device,cache.direct=on,node-name=test,filename=/dev/nvme0n1p4 \ > > -device virtio-blk,drive=root \ > > -device nvme,drive=test,serial=1234 > > > > I tried to reproduce this issue somehow it is not reproducible. > > I'll try again on Leon's setup on my Saturday night, to fix that > case. Chaitanya, I added "mem_align=120" line to fio configuration file and the issue reproduced. Thanks > > -ck > > >