Message ID | cover.1719909395.git.leon@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | Provide a new two step DMA API mapping API | expand |
I just tried to boot this on my usual qemu test setup with emulated nvme devices, and it dead-loops with messages like this fairly late in the boot cycle: [ 43.826627] iommu: unaligned: iova 0xfff7e000 pa 0x000000010be33650 size 0x1000 min_pagesz 0x1000 [ 43.826982] dma_mapping_error -12 passing intel_iommu=off instead of intel_iommu=on (expectedly) makes it go away.
在 2024/7/3 13:42, Christoph Hellwig 写道: > I just tried to boot this on my usual qemu test setup with emulated > nvme devices, and it dead-loops with messages like this fairly late > in the boot cycle: > > [ 43.826627] iommu: unaligned: iova 0xfff7e000 pa 0x000000010be33650 size 0x1000 min_pagesz 0x1000 > [ 43.826982] dma_mapping_error -12 > > passing intel_iommu=off instead of intel_iommu=on (expectedly) makes I also confronted this problem.IMO, if intel_iommu=off, the driver of drivers/iommu can not be used. If I remember correctly, some guys in the first version can fix this problem. I will check the mails. To me, I just revert this commit because I do not use this commit about nvme. Zhu Yanjun > it go away. >
On Wed, Jul 03, 2024 at 07:42:38AM +0200, Christoph Hellwig wrote: > I just tried to boot this on my usual qemu test setup with emulated > nvme devices, and it dead-loops with messages like this fairly late > in the boot cycle: > > [ 43.826627] iommu: unaligned: iova 0xfff7e000 pa 0x000000010be33650 size 0x1000 min_pagesz 0x1000 > [ 43.826982] dma_mapping_error -12 > > passing intel_iommu=off instead of intel_iommu=on (expectedly) makes > it go away. Can you please share your kernel command line and qemu? On my and Chaitanya setups it works fine. Thanks
On Wed, Jul 03, 2024 at 01:52:53PM +0300, Leon Romanovsky wrote: > On Wed, Jul 03, 2024 at 07:42:38AM +0200, Christoph Hellwig wrote: > > I just tried to boot this on my usual qemu test setup with emulated > > nvme devices, and it dead-loops with messages like this fairly late > > in the boot cycle: > > > > [ 43.826627] iommu: unaligned: iova 0xfff7e000 pa 0x000000010be33650 size 0x1000 min_pagesz 0x1000 > > [ 43.826982] dma_mapping_error -12 > > > > passing intel_iommu=off instead of intel_iommu=on (expectedly) makes > > it go away. > > Can you please share your kernel command line and qemu? > On my and Chaitanya setups it works fine. qemu-system-x86_64 \ -nographic \ -enable-kvm \ -m 6g \ -smp 4 \ -cpu host \ -M q35,kernel-irqchip=split \ -kernel arch/x86/boot/bzImage \ -append "root=/dev/vda console=ttyS0,115200n8 intel_iommu=on" \ -device intel-iommu,intremap=on \ -device ioh3420,multifunction=on,bus=pcie.0,id=port9-0,addr=9.0,chassis=0 \ -blockdev driver=file,cache.direct=on,node-name=root,filename=/home/hch/images/bookworm.img \ -blockdev driver=host_device,cache.direct=on,node-name=test,filename=/dev/nvme0n1p4 \ -device virtio-blk,drive=root \ -device nvme,drive=test,serial=1234
On Wed, Jul 03, 2024 at 04:35:30PM +0200, Christoph Hellwig wrote: > On Wed, Jul 03, 2024 at 01:52:53PM +0300, Leon Romanovsky wrote: > > On Wed, Jul 03, 2024 at 07:42:38AM +0200, Christoph Hellwig wrote: > > > I just tried to boot this on my usual qemu test setup with emulated > > > nvme devices, and it dead-loops with messages like this fairly late > > > in the boot cycle: > > > > > > [ 43.826627] iommu: unaligned: iova 0xfff7e000 pa 0x000000010be33650 size 0x1000 min_pagesz 0x1000 > > > [ 43.826982] dma_mapping_error -12 > > > > > > passing intel_iommu=off instead of intel_iommu=on (expectedly) makes > > > it go away. > > > > Can you please share your kernel command line and qemu? > > On my and Chaitanya setups it works fine. > > qemu-system-x86_64 \ > -nographic \ > -enable-kvm \ > -m 6g \ > -smp 4 \ > -cpu host \ > -M q35,kernel-irqchip=split \ > -kernel arch/x86/boot/bzImage \ > -append "root=/dev/vda console=ttyS0,115200n8 intel_iommu=on" \ > -device intel-iommu,intremap=on \ > -device ioh3420,multifunction=on,bus=pcie.0,id=port9-0,addr=9.0,chassis=0 \ > -blockdev driver=file,cache.direct=on,node-name=root,filename=/home/hch/images/bookworm.img \ > -blockdev driver=host_device,cache.direct=on,node-name=test,filename=/dev/nvme0n1p4 \ > -device virtio-blk,drive=root \ > -device nvme,drive=test,serial=1234 Thanks, Chaitanya will take a look. If we put aside this issue, do you think that the proposed API is the right one? BTW, I have more fancy command line, it is probably the root cause of working/not-working: /opt/simx/bin/qemu-system-x86_64 -append root=/dev/root rw ignore_loglevel rootfstype=9p rootflags="cache=loose,trans=virtio" earlyprintk=serial,ttyS0,115200 console=hvc0 noibrs noibpb nopti nospectre_v2 nospectre_v1 l1tf=off nospec_store_bypass_disable no_stf_barrier mds=off mitigations=off panic_on_warn=1 intel_iommu=on iommu=nopt iommu.forcedac=true vfio_iommu_type1.allow_unsafe_interrupts=1 systemd.hostname=mtl-leonro-l-vm -chardev stdio,id=stdio,mux=on,signal=off -cpu host -device virtio-rng-pci -device virtio-balloon-pci -device isa-serial,chardev=stdio -device virtio-serial-pci -device virtconsole,chardev=stdio -device virtio-9p-pci,fsdev=host_fs,mount_tag=/dev/root -device virtio-9p-pci,fsdev=host_bind_fs0,mount_tag=bind0 -device virtio-9p-pci,fsdev=host_bind_fs1,mount_tag=bind1 -device virtio-9p-pci,fsdev=host_bind_fs2,mount_tag=bind2 -device intel-iommu,intremap=on -device connectx7 -device nvme,drive=drv0,serial=foo -drive file=/home/leonro/.cache/mellanox/mkt/nvme-1g.raw,if=none,id=drv0,format=raw -enable-kvm -fsdev local,id=host_bind_fs1,security_model=none,path=/logs -fsdev local,id=host_fs,security_model=none,path=/mnt/self -fsdev local,id=host_bind_fs0,security_model=none,path=/plugins -fsdev local,id=host_bind_fs2,security_model=none,path=/home/leonro -fw_cfg etc/sercon-port,string=2 -kernel /home/leonro/src/kernel/arch/x86/boot/bzImage -m 5G -machine q35,kernel-irqchip=split -mon chardev=stdio -net nic,model=virtio,macaddr=52:54:01:d8:e5:f9 -net user,hostfwd=tcp:127.0.0.1:46645-:22 -no-reboot -nodefaults -nographic -smp 16 -vga none > >
On Wed, Jul 03, 2024 at 06:51:14PM +0300, Leon Romanovsky wrote:
> If we put aside this issue, do you think that the proposed API is the right one?
I haven't look at it in detail yet, but from a quick look there is a
few things to note:
1) The amount of code needed in nvme worries me a bit. Now NVMe a messy
driver due to the stupid PRPs vs just using SGLs, but needing a fair
amount of extra boilerplate code in drivers is a bit of a warning sign.
I plan to look into this to see if I can help on improving it, but for
that I need a working version first.
2) The amount of seemingly unrelated global headers pulled into other
global headers. Some of this might just be sloppiness, e.g. I can't
see why dma-mapping.h would actually need iommu.h to start with,
but pci.h in dma-map-ops.h is a no-go.
3) which brings me to real layering violations. dev_is_untrusted and
dev_use_swiotlb are DMA API internals, no way I'd ever want to expose
them. dma-map-ops.h is a semi-internal header only for implementations
of the dma ops (as very clearly documented at the top of that file),
it must not be included by drivers. Same for swiotlb.h.
Not quite as concerning, but doing an indirect call for each map
through dma_map_ops in addition to the iommu ops is not every efficient.
We've through for a while to allow direct calls to dma-iommu similar
how we do direct calls to dma-direct from the core mapping.c code.
This might be a good time to do that as a prep step for this work.
On Thu, Jul 04, 2024 at 09:48:56AM +0200, Christoph Hellwig wrote: > On Wed, Jul 03, 2024 at 06:51:14PM +0300, Leon Romanovsky wrote: > > If we put aside this issue, do you think that the proposed API is the right one? > > I haven't look at it in detail yet, but from a quick look there is a > few things to note: > > > 1) The amount of code needed in nvme worries me a bit. Now NVMe a messy > driver due to the stupid PRPs vs just using SGLs, but needing a fair > amount of extra boilerplate code in drivers is a bit of a warning sign. > I plan to look into this to see if I can help on improving it, but for > that I need a working version first. Chaitanya is working on this and I'll join him to help on next Sunday, after I'll return to the office from my sick leave/ > > > 2) The amount of seemingly unrelated global headers pulled into other > global headers. Some of this might just be sloppiness, e.g. I can't > see why dma-mapping.h would actually need iommu.h to start with, > but pci.h in dma-map-ops.h is a no-go. pci.h was pulled because I needed to call to pci_p2pdma_map_type() in dma_can_use_iova(). > > 3) which brings me to real layering violations. dev_is_untrusted and > dev_use_swiotlb are DMA API internals, no way I'd ever want to expose > them. dma-map-ops.h is a semi-internal header only for implementations > of the dma ops (as very clearly documented at the top of that file), > it must not be included by drivers. Same for swiotlb.h. These item shouldn't worry you and will be changed in the final version. They are outcome of patch "RDMA/umem: Prevent UMEM ODP creation with SWIOTLB". https://lore.kernel.org/all/d18c454636bf3cfdba9b66b7cc794d713eadc4a5.1719909395.git.leon@kernel.org/ All HMM users need such "prevention" so it will be moved to a common place. > > Not quite as concerning, but doing an indirect call for each map > through dma_map_ops in addition to the iommu ops is not every efficient. > We've through for a while to allow direct calls to dma-iommu similar > how we do direct calls to dma-direct from the core mapping.c code. > This might be a good time to do that as a prep step for this work. Sure, no problem, will start in parallel to work on this. >
On Thu, Jul 04, 2024 at 04:18:39PM +0300, Leon Romanovsky wrote: > > 2) The amount of seemingly unrelated global headers pulled into other > > global headers. Some of this might just be sloppiness, e.g. I can't > > see why dma-mapping.h would actually need iommu.h to start with, > > but pci.h in dma-map-ops.h is a no-go. > > pci.h was pulled because I needed to call to pci_p2pdma_map_type() > in dma_can_use_iova(). No, that's not the reason. The reason is actually that whole dev_use_swiotlb mess which shouldn't exist in this form.
Review from the NVMe driver consumer perspective. I think if all these were implement we'd probably end up with less code than before the conversion. The split between dma_iova_attrs, dma_memory_type and dma_iova_state is odd. I would have expected them to just be just a single object. While talking about this I think the domain field in dma_iova_state should probably be a private pointer instead of being tied to the iommu. Also do we need the attrs member in the iova_attrs structure? The "attrs" really are flags passed to the mapping routines that are per-operation and not persistent, so I'd expect them to be passed per-call and not stored in a structure. I'd also expect that the use_iova field to be in the mapping state and not separately provided by the driver. For nvme specific data structures I would have expected a dma_add/ len pair in struct iod_dma_map, maybe even using a common type. Also the data structure split seems odd - I'd expect the actual mapping state and a small number (at least one) dma_addr/len pair to be inside the nvme_iod structure, and then only do the dynamic allocation if we need more of them because there are more segments and we are not using the iommu. If we had a common data structure for the dma_addr/len pairs dma_unlink_range could just take care of the unmap for the non-iommu case as well, which would be neat. I'd also expect that dma_free_iova would be covered by it. I would have expected dma_link_range to return the dma_addr_t instead of poking into the iova structure in the callers. In __nvme_rq_dma_map the <= PAGE_SIZE case is pointless. In the existing code the reason for it is to avoid allocating and mapping the sg_table, but that code is still left before we even get to this code. My suggestion above to only allocate the dma_addr/len pairs when there is more than 1 or a few of it would allow to trivially implement that suggestion using the normal API without having to keep that special case and the dma_len parameter around. If this addes a version of dma_map_page_atttrs that directly took the physical address as a prep patch the callers would not have to bother with page pointer manipulations and just work on physical addresses for both the iommu and no-iommu cases. It would also help a little bit with the eventualy switch to store the physical address instead of page+offset in the bio_vec. Talking about that, I've been wanting to add a bvec_phys helper for to convert the page_phys(bv.bv_page) + bv.bv_offset calculations. This is becoming more urgent with more callers needing to that, I'll try to get it out to Jens ASAP so that it can make the 6.11 merge window. Can we make dma_start_range / dma_end_range simple no-ops for the non-iommu code to avoid boilerplate code in the callers to avoid boilerplate code in the callers to deal with the two cases?
On 7/3/24 07:35, Christoph Hellwig wrote: > On Wed, Jul 03, 2024 at 01:52:53PM +0300, Leon Romanovsky wrote: >> On Wed, Jul 03, 2024 at 07:42:38AM +0200, Christoph Hellwig wrote: >>> I just tried to boot this on my usual qemu test setup with emulated >>> nvme devices, and it dead-loops with messages like this fairly late >>> in the boot cycle: >>> >>> [ 43.826627] iommu: unaligned: iova 0xfff7e000 pa 0x000000010be33650 size 0x1000 min_pagesz 0x1000 >>> [ 43.826982] dma_mapping_error -12 >>> >>> passing intel_iommu=off instead of intel_iommu=on (expectedly) makes >>> it go away. >> Can you please share your kernel command line and qemu? >> On my and Chaitanya setups it works fine. > qemu-system-x86_64 \ > -nographic \ > -enable-kvm \ > -m 6g \ > -smp 4 \ > -cpu host \ > -M q35,kernel-irqchip=split \ > -kernel arch/x86/boot/bzImage \ > -append "root=/dev/vda console=ttyS0,115200n8 intel_iommu=on" \ > -device intel-iommu,intremap=on \ > -device ioh3420,multifunction=on,bus=pcie.0,id=port9-0,addr=9.0,chassis=0 \ > -blockdev driver=file,cache.direct=on,node-name=root,filename=/home/hch/images/bookworm.img \ > -blockdev driver=host_device,cache.direct=on,node-name=test,filename=/dev/nvme0n1p4 \ > -device virtio-blk,drive=root \ > -device nvme,drive=test,serial=1234 > I tried to reproduce this issue somehow it is not reproducible. I'll try again on Leon's setup on my Saturday night, to fix that case. -ck
On Fri, Jul 05, 2024 at 10:53:06PM +0000, Chaitanya Kulkarni wrote: > I tried to reproduce this issue somehow it is not reproducible. > > I'll try again on Leon's setup on my Saturday night, to fix that > case. It is passthrough I/O from userspace. The address is not page aligned as seen in the printk. Forcing bounce buffering of all passthrough I/O makes it go away. The problem is the first mapped segment does not have to be aligned and we're missing the code to places it at the aligned offset into the IOVA space.
On Sat, Jul 06, 2024 at 08:26:04AM +0200, Christoph Hellwig wrote: > On Fri, Jul 05, 2024 at 10:53:06PM +0000, Chaitanya Kulkarni wrote: > > I tried to reproduce this issue somehow it is not reproducible. > > > > I'll try again on Leon's setup on my Saturday night, to fix that > > case. > > It is passthrough I/O from userspace. The address is not page aligned > as seen in the printk. Forcing bounce buffering of all passthrough > I/O makes it go away. > > The problem is the first mapped segment does not have to be aligned > and we're missing the code to places it at the aligned offset into > the IOVA space. Thanks for the explanation.
On Fri, Jul 05, 2024 at 08:39:10AM +0200, Christoph Hellwig wrote: > Review from the NVMe driver consumer perspective. I think if all these > were implement we'd probably end up with less code than before the > conversion. Thanks for the review, I will try to address all the comments in the next version. > > The split between dma_iova_attrs, dma_memory_type and dma_iova_state is > odd. I would have expected them to just be just a single object. While > talking about this I think the domain field in dma_iova_state should > probably be a private pointer instead of being tied to the iommu. > > Also do we need the attrs member in the iova_attrs structure? The > "attrs" really are flags passed to the mapping routines that are > per-operation and not persistent, so I'd expect them to be passed > per-call and not stored in a structure. It is left-over from my not-send version where I added new attribute to indicate that dma_alloc_iova() can't support SWIOTLB to avoid dev_use_swiotlb() mess. I will remove it. > > I'd also expect that the use_iova field to be in the mapping state > and not separately provided by the driver. > > For nvme specific data structures I would have expected a dma_add/ > len pair in struct iod_dma_map, maybe even using a common type. > > Also the data structure split seems odd - I'd expect the actual > mapping state and a small number (at least one) dma_addr/len pair > to be inside the nvme_iod structure, and then only do the dynamic > allocation if we need more of them because there are more segments > and we are not using the iommu. > > If we had a common data structure for the dma_addr/len pairs > dma_unlink_range could just take care of the unmap for the non-iommu > case as well, which would be neat. I'd also expect that > dma_free_iova would be covered by it. Internally Jason asked for the same thing, but I didn't want to produce asymmetric API where drivers have a call to dma_alloc_iova() but don't have a call to dma_free_iova(). However, now, it is 2 versus 1, so I will change it. > > I would have expected dma_link_range to return the dma_addr_t instead > of poking into the iova structure in the callers. > > In __nvme_rq_dma_map the <= PAGE_SIZE case is pointless. In the > existing code the reason for it is to avoid allocating and mapping the > sg_table, but that code is still left before we even get to this code. > > My suggestion above to only allocate the dma_addr/len pairs when there > is more than 1 or a few of it would allow to trivially implement that > suggestion using the normal API without having to keep that special > case and the dma_len parameter around. > > If this addes a version of dma_map_page_atttrs that directly took > the physical address as a prep patch the callers would not have to > bother with page pointer manipulations and just work on physical > addresses for both the iommu and no-iommu cases. It would also help > a little bit with the eventualy switch to store the physical address > instead of page+offset in the bio_vec. Talking about that, I've > been wanting to add a bvec_phys helper for to convert the > page_phys(bv.bv_page) + bv.bv_offset calculations. This is becoming > more urgent with more callers needing to that, I'll try to get it out > to Jens ASAP so that it can make the 6.11 merge window. > > Can we make dma_start_range / dma_end_range simple no-ops for the > non-iommu code to avoid boilerplate code in the callers to avoid > boilerplate code in the callers to deal with the two cases? Yes, sure. Thanks
On Fri, Jul 05, 2024 at 10:53:06PM +0000, Chaitanya Kulkarni wrote: > On 7/3/24 07:35, Christoph Hellwig wrote: > > On Wed, Jul 03, 2024 at 01:52:53PM +0300, Leon Romanovsky wrote: > >> On Wed, Jul 03, 2024 at 07:42:38AM +0200, Christoph Hellwig wrote: > >>> I just tried to boot this on my usual qemu test setup with emulated > >>> nvme devices, and it dead-loops with messages like this fairly late > >>> in the boot cycle: > >>> > >>> [ 43.826627] iommu: unaligned: iova 0xfff7e000 pa 0x000000010be33650 size 0x1000 min_pagesz 0x1000 > >>> [ 43.826982] dma_mapping_error -12 > >>> > >>> passing intel_iommu=off instead of intel_iommu=on (expectedly) makes > >>> it go away. > >> Can you please share your kernel command line and qemu? > >> On my and Chaitanya setups it works fine. > > qemu-system-x86_64 \ > > -nographic \ > > -enable-kvm \ > > -m 6g \ > > -smp 4 \ > > -cpu host \ > > -M q35,kernel-irqchip=split \ > > -kernel arch/x86/boot/bzImage \ > > -append "root=/dev/vda console=ttyS0,115200n8 intel_iommu=on" \ > > -device intel-iommu,intremap=on \ > > -device ioh3420,multifunction=on,bus=pcie.0,id=port9-0,addr=9.0,chassis=0 \ > > -blockdev driver=file,cache.direct=on,node-name=root,filename=/home/hch/images/bookworm.img \ > > -blockdev driver=host_device,cache.direct=on,node-name=test,filename=/dev/nvme0n1p4 \ > > -device virtio-blk,drive=root \ > > -device nvme,drive=test,serial=1234 > > > > I tried to reproduce this issue somehow it is not reproducible. > > I'll try again on Leon's setup on my Saturday night, to fix that > case. Chaitanya, I added "mem_align=120" line to fio configuration file and the issue reproduced. Thanks > > -ck > > >
On Thu, Jul 04, 2024 at 09:48:56AM +0200, Christoph Hellwig wrote: > 1) The amount of code needed in nvme worries me a bit. Now NVMe a messy > driver due to the stupid PRPs vs just using SGLs, but needing a fair > amount of extra boilerplate code in drivers is a bit of a warning sign. > I plan to look into this to see if I can help on improving it, but for > that I need a working version first. It would be nice to have less. So much now depends on the caller to provide both the input and output data structure. Ideally we'd have some template code that consolidates these loops to common code with driver provided hooks - there are a few ways to get that efficiently in C. I think it will be clearer when we get to RDMA and there we have the same SGL/PRP kind of split up and we can see what is sharable. > Not quite as concerning, but doing an indirect call for each map > through dma_map_ops in addition to the iommu ops is not every > efficient. Yeah, there is no reason to support anything other than dma-iommu.c for the iommu path, so the dma_map_op indirection for this could just be removed. I'm also cooking something that should let us build a way to iommu map a bio_vec very efficiently, which should transform this into a single indirect call into the iommu driver per bio_vec, and a single radix walk/etc. > We've through for a while to allow direct calls to dma-iommu similar > how we do direct calls to dma-direct from the core mapping.c code. > This might be a good time to do that as a prep step for this work. I think there is room to benchmark and further improve these paths. Even the fast direct map path is not compiling down to a single load/store instruction per bio_vec entry as would be ideal. Jason
On Fri, Jul 05, 2024 at 08:39:10AM +0200, Christoph Hellwig wrote: > Review from the NVMe driver consumer perspective. I think if all these > were implement we'd probably end up with less code than before the > conversion. One of the topics that came up with at LSF is what to do with the dma_memory_type data? The concept here was that the new DMA API would work on same-type memory only, meaning the caller would have to split up by type. I understand the block stack already does this using P2P and !P2P, but that isn't quite enough here as we want to split principally based on IOMMU or !IOMMU. Today that boils down to splitting based on a few things, like grouping encrypted memory, and grouping by P2P provider. So the idea was the "struct dma_memory_type" would encapsulate whatever grouping was needed and the block layer would drive its splitting on this, same structs can be merged. I didn't notice this got included in this RFC.. The trivial option is to store the dma_memory_type per-BIO and drive the de-duplication like that. The other could be to derive it from first entry in the bio_vect every time a new page is added. This same-type design is one of the key elements of this API arrangement - for RDMA I expect we will need a data structure storing a list of types with a list of pages, and we will DMA map type by type. Jason
On Mon, Jul 08, 2024 at 01:52:38PM -0300, Jason Gunthorpe wrote: > Ideally we'd have some template code that consolidates these loops to > common code with driver provided hooks - there are a few ways to get > that efficiently in C. > > I think it will be clearer when we get to RDMA and there we have the > same SGL/PRP kind of split up and we can see what is sharable. I really would not want to build common code for PRPs - this is a concept very specific to RDMA and NVMe. OTOH more common code SGLs would be nice. If you look at e.g. SCSI drivers most of them have a simpe loop of mapping the SG table and then copying the fields into the hardware SGL. This would be a very common case for a helper. That whole thing of course opens the question if we want a pure in-memory version of the dma_addr_t/len tuple. IMHO that is the best way to migrate and allows to share code easily. We can look into ways to avoiding that more for drivers that care, but most drivers are probably best serve with it to keep the code simple and make the conversion easier. > I'm also cooking something that should let us build a way to iommu map > a bio_vec very efficiently, which should transform this into a single > indirect call into the iommu driver per bio_vec, and a single radix > walk/etc. I assume you mean array of bio_vecs here. That would indeed nice. We'd still potentially need a few calls for block drivers as requests can have multiple bios and thus bio_vec arrays, but it would still be a nice reduction of calls.
On Mon, Jul 08, 2024 at 08:57:21PM -0300, Jason Gunthorpe wrote: > I understand the block stack already does this using P2P and !P2P, but > that isn't quite enough here as we want to split principally based on > IOMMU or !IOMMU. Except for the powerpc bypass IOMMU or not is a global decision, and the bypass is per I/O. So I'm not sure what else you want there?
On Tue, Jul 09, 2024 at 08:17:21AM +0200, Christoph Hellwig wrote: > On Mon, Jul 08, 2024 at 01:52:38PM -0300, Jason Gunthorpe wrote: > > Ideally we'd have some template code that consolidates these loops to > > common code with driver provided hooks - there are a few ways to get > > that efficiently in C. > > > > I think it will be clearer when we get to RDMA and there we have the > > same SGL/PRP kind of split up and we can see what is sharable. > > I really would not want to build common code for PRPs - this is a concept > very specific to RDMA and NVMe. I think DRM has it too. If you are populating a GPU page table then it is basically a convoluted PRP. Probably requires different splitting logic than what RDMA does, but I've never looked. > OTOH more common code SGLs would be nice. If you look at e.g. SCSI > drivers most of them have a simpe loop of mapping the SG table and > then copying the fields into the hardware SGL. This would be a very > common case for a helper. Yes, I belive this is very common. > That whole thing of course opens the question if we want a pure > in-memory version of the dma_addr_t/len tuple. IMHO that is the best > way to migrate and allows to share code easily. We can look into ways > to avoiding that more for drivers that care, but most drivers are > probably best serve with it to keep the code simple and make the > conversion easier. My feeling has been that this RFC is the low level interface and we can bring our own data structure on top. It would probably make sense to build a scatterlist v2 on top of this that has an in-memory dma_addr_t/len list close to today. Yes it costs a memory allocation, or a larger initial allocation, but many places may not really care. Block drivers have always allocated a SGL, for instance. Then the verbosity of this API is less important as we may only use it in a few places. My main take away was that we should make the dma_ops interface simpler and more general so we can have this choice instead of welding a single datastructure through everything. > > I'm also cooking something that should let us build a way to iommu map > > a bio_vec very efficiently, which should transform this into a single > > indirect call into the iommu driver per bio_vec, and a single radix > > walk/etc. > > I assume you mean array of bio_vecs here. That would indeed nice. > We'd still potentially need a few calls for block drivers as > requests can have multiple bios and thus bio_vec arrays, but it would > still be a nice reduction of calls. Yes. iommufd has performance needs here, not sure what it will turn into but bio_vec[] direct to optimized radix manipuilation is something I'd be keen to see. Jason
On Tue, Jul 09, 2024 at 08:20:15AM +0200, Christoph Hellwig wrote: > On Mon, Jul 08, 2024 at 08:57:21PM -0300, Jason Gunthorpe wrote: > > I understand the block stack already does this using P2P and !P2P, but > > that isn't quite enough here as we want to split principally based on > > IOMMU or !IOMMU. > > Except for the powerpc bypass IOMMU or not is a global decision, > and the bypass is per I/O. So I'm not sure what else you want there? For P2P we know if the DMA will go through the IOMMU or not based on the PCIe fabric path between the initiator (the one doing the DMA) and the target (the one providing the MMIO memory). Depending on PCIe topology and ACS flags this path may use the IOMMU or may skip the IOMMU. To put it in code, the 'enum pci_p2pdma_map_type' can only be determined once we know the initator and target struct device. PCI_P2PDMA_MAP_BUS_ADDR means we don't use the iommu. PCI_P2PDMA_MAP_THRU_HOST_BRIDGE means we do. With this API it is important that a single request always has the same PCI_P2PDMA_MAP_* outcome, and the simplest way to do that is to split requests if the MMIO memory changes target struct devices. Jason
On Tue, Jul 09, 2024 at 04:03:20PM -0300, Jason Gunthorpe wrote: > > Except for the powerpc bypass IOMMU or not is a global decision, > > and the bypass is per I/O. So I'm not sure what else you want there? > > For P2P we know if the DMA will go through the IOMMU or not based on > the PCIe fabric path between the initiator (the one doing the DMA) and > the target (the one providing the MMIO memory). Oh, yes. So effectively you are asking if we can arbitrarily mix P2P sources in a single map request. I think the only sane answer from the iommu/dma subsystem perspective is: hell no. But that means the upper layer need to split at such a boundary. E.g. get_user_pages needs to look at this and stop at the boundary, leaving the rest to the next call. For the block layer just having one kind per BIO is fine right now, although I could see use cases where people would want to combine them. We can probably defer that until it is needed, though.
On Tue, Jul 09, 2024 at 03:53:15PM -0300, Jason Gunthorpe wrote: > > That whole thing of course opens the question if we want a pure > > in-memory version of the dma_addr_t/len tuple. IMHO that is the best > > way to migrate and allows to share code easily. We can look into ways > > to avoiding that more for drivers that care, but most drivers are > > probably best serve with it to keep the code simple and make the > > conversion easier. > > My feeling has been that this RFC is the low level interface and we > can bring our own data structure on top. > > It would probably make sense to build a scatterlist v2 on top of this > that has an in-memory dma_addr_t/len list close to today Yes, the usage of the dma_vec would be in a higher layer. But I'd really like to see it from the beginning. > . Yes it costs > a memory allocation, or a larger initial allocation, but many places > may not really care. Block drivers have always allocated a SGL, for > instance. Except for those optimizing for snall transfer of a single segment (like nvme). > My main take away was that we should make the dma_ops interface > simpler and more general so we can have this choice instead of welding > a single datastructure through everything. Yes, I don't think the dma_vec should be the low-level interface. I think a low-level interface based on physical address is the right one. I'll see what I can do to move the single segment map interface to be physical address based instead of page based so that we can unify them.
On Wed, Jul 10, 2024 at 08:27:04AM +0200, Christoph Hellwig wrote: > On Tue, Jul 09, 2024 at 03:53:15PM -0300, Jason Gunthorpe wrote: > > > That whole thing of course opens the question if we want a pure > > > in-memory version of the dma_addr_t/len tuple. IMHO that is the best > > > way to migrate and allows to share code easily. We can look into ways > > > to avoiding that more for drivers that care, but most drivers are > > > probably best serve with it to keep the code simple and make the > > > conversion easier. > > > > My feeling has been that this RFC is the low level interface and we > > can bring our own data structure on top. > > > > It would probably make sense to build a scatterlist v2 on top of this > > that has an in-memory dma_addr_t/len list close to today > > Yes, the usage of the dma_vec would be in a higher layer. But I'd > really like to see it from the beginning. Well, lets start with agreeing on this layer's API and be confident it can succeed. Then I'd say to look at RDMA as it is a logical place to build such a data structure and we can build something that at least does what RDMA needs. I need something anyhow to plumb through to DMABUF and over to iommufd and VFIO, can't skip out on it :) > Yes, I don't think the dma_vec should be the low-level interface. > I think a low-level interface based on physical address is the right > one. I'll see what I can do to move the single segment map interface > to be physical address based instead of page based so that we can > unify them. Yeah, I've been talking to Matthew explaining that starting at the DMA API makes the most sense and lets remove mandatory struct page entanglements there. Then we can start to examine other layers. Having a consistent option in the DMA API to be physically based with a memory type fits with that plan. Jason
On Wed, Jul 10, 2024 at 08:22:12AM +0200, Christoph Hellwig wrote: > On Tue, Jul 09, 2024 at 04:03:20PM -0300, Jason Gunthorpe wrote: > > > Except for the powerpc bypass IOMMU or not is a global decision, > > > and the bypass is per I/O. So I'm not sure what else you want there? > > > > For P2P we know if the DMA will go through the IOMMU or not based on > > the PCIe fabric path between the initiator (the one doing the DMA) and > > the target (the one providing the MMIO memory). > > Oh, yes. So effectively you are asking if we can arbitrarily mix > P2P sources in a single map request. I think the only sane answer > from the iommu/dma subsystem perspective is: hell no. Well, today we can mix them and the dma_map_sg will sort it out. With this new API we can't anymore. So this little detail needs to be taken care of somehow as well, and I didn't see it in this RFC. > For the block layer just having one kind per BIO is fine right now, > although I could see use cases where people would want to combine > them. We can probably defer that until it is needed, though. Do you have an application in mind that would want multi-kind per BIO? Jason
On Thu, Jul 11, 2024 at 08:29:17PM -0300, Jason Gunthorpe wrote: > So this little detail needs to be taken care of somehow as well, and I > didn't see it in this RFC. Yes. Same about generally not mixing P2P and non-P2P. > > > For the block layer just having one kind per BIO is fine right now, > > although I could see use cases where people would want to combine > > them. We can probably defer that until it is needed, though. > > Do you have an application in mind that would want multi-kind per BIO? If you are doing say garbage collection in a file system, and do write that sources data from multiple devices, where some sit directly on the root port and others behind a switch. This is all purely hypothetical, and I'm happy to just check for it and reject it for it now.
On Fri, Jul 12, 2024 at 06:54:22AM +0200, Christoph Hellwig wrote: > This is all purely hypothetical, and I'm happy to just check for it > and reject it for it now. I do know a patch set is cooking to allow mixing ZONE_DEVICE P2P and anon memory in the same VMA ala HMM with transparent migration of ZONE_DEVICE to anon. In this situation userspace will be generating IO with no idea about any P2P/!P2P boundaries. Jason
On Fri, Jul 12, 2024 at 09:42:37AM -0300, Jason Gunthorpe wrote: > On Fri, Jul 12, 2024 at 06:54:22AM +0200, Christoph Hellwig wrote: > > > This is all purely hypothetical, and I'm happy to just check for it > > and reject it for it now. > > I do know a patch set is cooking to allow mixing ZONE_DEVICE P2P and > anon memory in the same VMA ala HMM with transparent migration of > ZONE_DEVICE to anon. > > In this situation userspace will be generating IO with no idea about > any P2P/!P2P boundaries. Yes. And as said from the beginning of these discussion I think the right way is to change the gup code so that for a single call to get/pin_user_pages it always returns either only P2P pages or non-P2P pages only, with the FOLL_PCI_P2PDMA just allowing P2P pages at all. Similarly they could easily just return one kind of P2P pages per call only.