Message ID | 20241209192927.107503-2-mjrosato@linux.ibm.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | s390x/pci: relax I/O address translation requirement | expand |
On 09.12.24 20:29, Matthew Rosato wrote: > When receiving a guest mpcifc(4) or mpcifc(6) instruction without the T > bit set, treat this as a request to perform direct mapping instead of > address translation. In order to facilitiate this, pin the entirety of > guest memory into the host iommu. > > Subsequent guest DMA operations are all expected to be of the format > guest_phys+sdma, allowing them to be used as lookup into the host > iommu table. > > Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com> > --- > hw/s390x/s390-pci-bus.c | 23 ++++++++++++++++++ > hw/s390x/s390-pci-inst.c | 42 +++++++++++++++++++++++++++++++-- > include/hw/s390x/s390-pci-bus.h | 2 ++ > 3 files changed, 65 insertions(+), 2 deletions(-) > > diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c > index 40b2567aa7..8d4224e032 100644 > --- a/hw/s390x/s390-pci-bus.c > +++ b/hw/s390x/s390-pci-bus.c > @@ -18,6 +18,7 @@ > #include "hw/s390x/s390-pci-inst.h" > #include "hw/s390x/s390-pci-kvm.h" > #include "hw/s390x/s390-pci-vfio.h" > +#include "hw/boards.h" > #include "hw/pci/pci_bus.h" > #include "hw/qdev-properties.h" > #include "hw/pci/pci_bridge.h" > @@ -720,6 +721,27 @@ void s390_pci_iommu_enable(S390PCIIOMMU *iommu) > TYPE_S390_IOMMU_MEMORY_REGION, OBJECT(&iommu->mr), > name, iommu->pal + 1); > iommu->enabled = true; > + iommu->direct_map = false; > + memory_region_add_subregion(&iommu->mr, 0, MEMORY_REGION(&iommu->iommu_mr)); > + g_free(name); > +} > + > +void s390_pci_iommu_dm_enable(S390PCIIOMMU *iommu) > +{ > + MachineState *ms = MACHINE(qdev_get_machine()); > + > + /* > + * For direct-mapping we must map the entire guest address space. Because > + * the mappings are contiguous we are not restricted to individual 4K > + * mappings via vfio, so let's not worry about the DMA limit when > + * calculating the range. > + */ > + char *name = g_strdup_printf("iommu-s390-%04x", iommu->pbdev->uid); > + memory_region_init_iommu(&iommu->iommu_mr, sizeof(iommu->iommu_mr), > + TYPE_S390_IOMMU_MEMORY_REGION, OBJECT(&iommu->mr), > + name, iommu->pba + ms->ram_size); > + iommu->enabled = true; > + iommu->direct_map = true; > memory_region_add_subregion(&iommu->mr, 0, MEMORY_REGION(&iommu->iommu_mr)); > g_free(name); > } > @@ -727,6 +749,7 @@ void s390_pci_iommu_enable(S390PCIIOMMU *iommu) > void s390_pci_iommu_disable(S390PCIIOMMU *iommu) > { > iommu->enabled = false; > + iommu->direct_map = false; > g_hash_table_remove_all(iommu->iotlb); > memory_region_del_subregion(&iommu->mr, MEMORY_REGION(&iommu->iommu_mr)); > object_unparent(OBJECT(&iommu->iommu_mr)); > diff --git a/hw/s390x/s390-pci-inst.c b/hw/s390x/s390-pci-inst.c > index 41655082da..f4d8fe8fe8 100644 > --- a/hw/s390x/s390-pci-inst.c > +++ b/hw/s390x/s390-pci-inst.c > @@ -16,6 +16,7 @@ > #include "exec/memory.h" > #include "qemu/error-report.h" > #include "sysemu/hw_accel.h" > +#include "hw/boards.h" > #include "hw/pci/pci_device.h" > #include "hw/s390x/s390-pci-inst.h" > #include "hw/s390x/s390-pci-bus.h" > @@ -990,6 +991,33 @@ int pci_dereg_irqs(S390PCIBusDevice *pbdev) > return 0; > } > > +static void s390_pci_setup_stage2_map(S390PCIIOMMU *iommu) > +{ > + MachineState *ms = MACHINE(qdev_get_machine()); > + uint64_t remain = ms->ram_size, start = iommu->pba, mask, size, curr = 0; > + uint64_t end = start + remain - 1; > + IOMMUTLBEvent event = { > + .type = IOMMU_NOTIFIER_MAP, > + .entry = { > + .target_as = &address_space_memory, > + .translated_addr = 0, > + .perm = IOMMU_RW, > + }, > + }; > + > + while (remain >= TARGET_PAGE_SIZE) { > + mask = dma_aligned_pow2_mask(start, end, 64); > + size = mask + 1; > + event.entry.iova = start; > + event.entry.addr_mask = mask; > + event.entry.translated_addr = curr; > + memory_region_notify_iommu(&iommu->iommu_mr, 0, event); > + start += size; > + curr += size; > + remain -= size; > + } > +} Hi, Trying to wrap my head around that ... you mention that "pin the entirety of guest memory". Do you mean that we will actually end up longterm pinning all guest RAM in the kernel, similar to what vfio ends up doing? In that case, it would be incompatible with virtio-balloon (and without modifications with upcoming virtio-mem). Is there already a mechanism in place to handle that -- a call to ram_block_discard_disable() -- or even a way to support coordinated discarding of RAM (e.g., virtio-mem + vfio)?
On 12/9/24 4:01 PM, David Hildenbrand wrote: > On 09.12.24 20:29, Matthew Rosato wrote: > > Hi, > > Trying to wrap my head around that ... you mention that "pin the entirety of guest memory". > > Do you mean that we will actually end up longterm pinning all guest RAM in the kernel, similar to what vfio ends up doing? Yes. Actually, the usecase here is specifically PCI passthrough via vfio-pci on s390. Unlike other platforms, the default s390 approach only pins on-demand and doesn't longterm pin all guest RAM, which is nice from a memory footprint perspective but pays a price via all those guest-2 RPCIT instructions. The goal here is now provide the optional alternative to longterm pin like other platforms. > > In that case, it would be incompatible with virtio-balloon (and without modifications with upcoming virtio-mem). Is there already a mechanism in place to handle that -- a call to ram_block_discard_disable() -- or even a way to support coordinated discarding of RAM (e.g., virtio-mem + vfio)? Good point, should be calling add ram_block_discard_disable(true) when set register + a corresponding (false) during deregister... Will add for v2. As for supporting coordinated discard, I was hoping to subsequently look at virtio-mem for this. Thanks, Matt
On 09.12.24 22:45, Matthew Rosato wrote: > On 12/9/24 4:01 PM, David Hildenbrand wrote: >> On 09.12.24 20:29, Matthew Rosato wrote: >> >> Hi, >> >> Trying to wrap my head around that ... you mention that "pin the entirety of guest memory". >> >> Do you mean that we will actually end up longterm pinning all guest RAM in the kernel, similar to what vfio ends up doing? > > Yes. Actually, the usecase here is specifically PCI passthrough via vfio-pci on s390. Unlike other platforms, the default s390 approach only pins on-demand and doesn't longterm pin all guest RAM, which is nice from a memory footprint perspective but pays a price via all those guest-2 RPCIT instructions. The goal here is now provide the optional alternative to longterm pin like other platforms. Okay, thanks for confirming. One more question: who will trigger this longterm-pinning? Is it vfio? (the code flow from your code to the pinning code would be nice) > >> >> In that case, it would be incompatible with virtio-balloon (and without modifications with upcoming virtio-mem). Is there already a mechanism in place to handle that -- a call to ram_block_discard_disable() -- or even a way to support coordinated discarding of RAM (e.g., virtio-mem + vfio)? > > Good point, should be calling add ram_block_discard_disable(true) when set register + a corresponding (false) during deregister... Will add for v2. > > As for supporting coordinated discard, I was hoping to subsequently look at virtio-mem for this. As long as discarding is blocked for now, we're good. To support it, the RAMDiscardManager would have to be wired up, similar to vfio. I think the current way of handling it via + IOMMUTLBEvent event = { + .type = IOMMU_NOTIFIER_MAP, + .entry = { + .target_as = &address_space_memory, + .translated_addr = 0, + .perm = IOMMU_RW, + }, + }; Is probably not ideal: it cannot cope with memory holes (which virtio-mem would create). Likely, you'd instead want an address space notifier, and really only map the memory region sections you get notified about. There, you can test for RAMDiscardManager and handle it like vfio does.
On 12/9/24 5:09 PM, David Hildenbrand wrote: > On 09.12.24 22:45, Matthew Rosato wrote: >> On 12/9/24 4:01 PM, David Hildenbrand wrote: >>> On 09.12.24 20:29, Matthew Rosato wrote: >>> >>> Hi, >>> >>> Trying to wrap my head around that ... you mention that "pin the entirety of guest memory". >>> >>> Do you mean that we will actually end up longterm pinning all guest RAM in the kernel, similar to what vfio ends up doing? >> >> Yes. Actually, the usecase here is specifically PCI passthrough via vfio-pci on s390. Unlike other platforms, the default s390 approach only pins on-demand and doesn't longterm pin all guest RAM, which is nice from a memory footprint perspective but pays a price via all those guest-2 RPCIT instructions. The goal here is now provide the optional alternative to longterm pin like other platforms. > > Okay, thanks for confirming. One more question: who will trigger this longterm-pinning? Is it vfio? > > (the code flow from your code to the pinning code would be nice) > Yes, the vfio IOMMU code triggers it. My s390_pci_setup_stage2_map added by this patch calls memory_region_notify_iommu in a loop such that we trigger iommu notifications to map iova X+pba -> GPA X for all GPA, where pba = a "base address" offset that has to be applied when mapping on s390. The notifications are sent in the largest batches possible to minimize vfio ioctls / use the least number of vfio dma mappings. The iommu notifications get picked up in vfio_iommu_map_notify where we will follow the container DMA path down to vfio_legacy_dma_map; then ultimately vfio is handling the pinning via the VFIO_IOMMU_MAP_DMA ioctl. >> >>> >>> In that case, it would be incompatible with virtio-balloon (and without modifications with upcoming virtio-mem). Is there already a mechanism in place to handle that -- a call to ram_block_discard_disable() -- or even a way to support coordinated discarding of RAM (e.g., virtio-mem + vfio)? >> >> Good point, should be calling add ram_block_discard_disable(true) when set register + a corresponding (false) during deregister... Will add for v2. >> >> As for supporting coordinated discard, I was hoping to subsequently look at virtio-mem for this. > > As long as discarding is blocked for now, we're good. To support it, the RAMDiscardManager would have to be wired up, similar to vfio. > > I think the current way of handling it via > vf > + IOMMUTLBEvent event = { > + .type = IOMMU_NOTIFIER_MAP, > + .entry = { > + .target_as = &address_space_memory, > + .translated_addr = 0, > + .perm = IOMMU_RW, > + }, > + }; > > > Is probably not ideal: it cannot cope with memory holes (which virtio-mem would create). > > Likely, you'd instead want an address space notifier, and really only map the memory region sections you get notified about. > > There, you can test for RAMDiscardManager and handle it like vfio does. > I'll start looking into this; for the moment I'll plan on blocking discarding in this series with a follow-on to then enable virtio-mem, but if I get something working sooner I'll add it to this series. Either way I'll put you on CC.
On 10.12.24 00:22, Matthew Rosato wrote: > On 12/9/24 5:09 PM, David Hildenbrand wrote: >> On 09.12.24 22:45, Matthew Rosato wrote: >>> On 12/9/24 4:01 PM, David Hildenbrand wrote: >>>> On 09.12.24 20:29, Matthew Rosato wrote: >>>> >>>> Hi, >>>> >>>> Trying to wrap my head around that ... you mention that "pin the entirety of guest memory". >>>> >>>> Do you mean that we will actually end up longterm pinning all guest RAM in the kernel, similar to what vfio ends up doing? >>> >>> Yes. Actually, the usecase here is specifically PCI passthrough via vfio-pci on s390. Unlike other platforms, the default s390 approach only pins on-demand and doesn't longterm pin all guest RAM, which is nice from a memory footprint perspective but pays a price via all those guest-2 RPCIT instructions. The goal here is now provide the optional alternative to longterm pin like other platforms. >> >> Okay, thanks for confirming. One more question: who will trigger this longterm-pinning? Is it vfio? >> >> (the code flow from your code to the pinning code would be nice) >> Thanks for the details. > > Yes, the vfio IOMMU code triggers it. My s390_pci_setup_stage2_map added by this patch calls memory_region_notify_iommu in a loop such that we trigger iommu notifications to map iova X+pba -> GPA X for all GPA, where pba = a "base address" offset that has to be applied when mapping on s390. Ah, now I see that you use "ms->ram_size", so indeed, this will map initial RAM only. The more-future-proof approach would indeed be to register a memory listener on &address_space_memory, to then map/unmap whenever notified about map/unmap. See "struct vfio_memory_listener" with its region_add/region_del callbacks that do that, and also implement the RAMDiscardManager plumbing. And now I wonder if there would be a way to just reuse "struct vfio_memory_listener" here some way? Or at least reuse the map/unmap functions, because ... The notifications are sent in the largest batches possible to minimize vfio ioctls / use the least number of vfio dma mappings. > > The iommu notifications get picked up in vfio_iommu_map_notify where we will follow the container DMA path down to vfio_legacy_dma_map; then ultimately vfio is handling the pinning via the VFIO_IOMMU_MAP_DMA ioctl. ... vfio_listener_region_add() will also just call vfio_container_dma_map(). Maybe there is a reason s390x needs to handle this using memory_region_notify_iommu() callbacks instead of doing it similar to "struct vfio_memory_listener" when registered on &address_space_memory without a viommu. > >>> >>>> >>>> In that case, it would be incompatible with virtio-balloon (and without modifications with upcoming virtio-mem). Is there already a mechanism in place to handle that -- a call to ram_block_discard_disable() -- or even a way to support coordinated discarding of RAM (e.g., virtio-mem + vfio)? >>> >>> Good point, should be calling add ram_block_discard_disable(true) when set register + a corresponding (false) during deregister... Will add for v2. >>> >>> As for supporting coordinated discard, I was hoping to subsequently look at virtio-mem for this. >> >> As long as discarding is blocked for now, we're good. To support it, the RAMDiscardManager would have to be wired up, similar to vfio. >> >> I think the current way of handling it via >> vf >> + IOMMUTLBEvent event = { >> + .type = IOMMU_NOTIFIER_MAP, >> + .entry = { >> + .target_as = &address_space_memory, >> + .translated_addr = 0, >> + .perm = IOMMU_RW, >> + }, >> + }; >> >> >> Is probably not ideal: it cannot cope with memory holes (which virtio-mem would create). >> >> Likely, you'd instead want an address space notifier, and really only map the memory region sections you get notified about. >> >> There, you can test for RAMDiscardManager and handle it like vfio does. >> > > I'll start looking into this; for the moment I'll plan on blocking discarding in this series with a follow-on to then enable virtio-mem, but if I get something working sooner I'll add it to this series. Either way I'll put you on CC. Thanks, exploring whether we can reuse the vfio bits to map/unmap might be very valuable and simplify things.
On 09/12/2024 20.29, Matthew Rosato wrote: > When receiving a guest mpcifc(4) or mpcifc(6) instruction without the T > bit set, treat this as a request to perform direct mapping instead of > address translation. In order to facilitiate this, pin the entirety of s/facilitiate/facilitate/ > guest memory into the host iommu. ... > +void s390_pci_iommu_dm_enable(S390PCIIOMMU *iommu) > +{ > + MachineState *ms = MACHINE(qdev_get_machine()); > + > + /* > + * For direct-mapping we must map the entire guest address space. Because > + * the mappings are contiguous we are not restricted to individual 4K > + * mappings via vfio, so let's not worry about the DMA limit when > + * calculating the range. > + */ > + char *name = g_strdup_printf("iommu-s390-%04x", iommu->pbdev->uid); FWIW, you could use g_autofree to get rid of the g_free() at the end of the function. > + memory_region_init_iommu(&iommu->iommu_mr, sizeof(iommu->iommu_mr), > + TYPE_S390_IOMMU_MEMORY_REGION, OBJECT(&iommu->mr), > + name, iommu->pba + ms->ram_size); > + iommu->enabled = true; > + iommu->direct_map = true; > memory_region_add_subregion(&iommu->mr, 0, MEMORY_REGION(&iommu->iommu_mr)); > g_free(name); > } Thomas
On 12/9/24 20:29, Matthew Rosato wrote: > When receiving a guest mpcifc(4) or mpcifc(6) instruction without the T > bit set, treat this as a request to perform direct mapping instead of > address translation. In order to facilitiate this, pin the entirety of > guest memory into the host iommu. > > Subsequent guest DMA operations are all expected to be of the format > guest_phys+sdma, allowing them to be used as lookup into the host > iommu table. > > Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com> > --- > hw/s390x/s390-pci-bus.c | 23 ++++++++++++++++++ > hw/s390x/s390-pci-inst.c | 42 +++++++++++++++++++++++++++++++-- > include/hw/s390x/s390-pci-bus.h | 2 ++ > 3 files changed, 65 insertions(+), 2 deletions(-) > > diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c > index 40b2567aa7..8d4224e032 100644 > --- a/hw/s390x/s390-pci-bus.c > +++ b/hw/s390x/s390-pci-bus.c > @@ -18,6 +18,7 @@ > #include "hw/s390x/s390-pci-inst.h" > #include "hw/s390x/s390-pci-kvm.h" > #include "hw/s390x/s390-pci-vfio.h" > +#include "hw/boards.h" > #include "hw/pci/pci_bus.h" > #include "hw/qdev-properties.h" > #include "hw/pci/pci_bridge.h" > @@ -720,6 +721,27 @@ void s390_pci_iommu_enable(S390PCIIOMMU *iommu) > TYPE_S390_IOMMU_MEMORY_REGION, OBJECT(&iommu->mr), > name, iommu->pal + 1); > iommu->enabled = true; > + iommu->direct_map = false; > + memory_region_add_subregion(&iommu->mr, 0, MEMORY_REGION(&iommu->iommu_mr)); > + g_free(name); > +} > + > +void s390_pci_iommu_dm_enable(S390PCIIOMMU *iommu) This is duplicating s390_pci_iommu_enable(). May be we could add a new argument to s390_pci_iommu_enable() instead ? > +{ > + MachineState *ms = MACHINE(qdev_get_machine()); > + > + /* > + * For direct-mapping we must map the entire guest address space. Because > + * the mappings are contiguous we are not restricted to individual 4K > + * mappings via vfio, so let's not worry about the DMA limit when> + * calculating the range. > + */ > + char *name = g_strdup_printf("iommu-s390-%04x", iommu->pbdev->uid); > + memory_region_init_iommu(&iommu->iommu_mr, sizeof(iommu->iommu_mr), > + TYPE_S390_IOMMU_MEMORY_REGION, OBJECT(&iommu->mr), > + name, iommu->pba + ms->ram_size); > + iommu->enabled = true; > + iommu->direct_map = true; > memory_region_add_subregion(&iommu->mr, 0, MEMORY_REGION(&iommu->iommu_mr)); > g_free(name); > } > @@ -727,6 +749,7 @@ void s390_pci_iommu_enable(S390PCIIOMMU *iommu) > void s390_pci_iommu_disable(S390PCIIOMMU *iommu) > { > iommu->enabled = false; > + iommu->direct_map = false; > g_hash_table_remove_all(iommu->iotlb); > memory_region_del_subregion(&iommu->mr, MEMORY_REGION(&iommu->iommu_mr)); > object_unparent(OBJECT(&iommu->iommu_mr)); > diff --git a/hw/s390x/s390-pci-inst.c b/hw/s390x/s390-pci-inst.c > index 41655082da..f4d8fe8fe8 100644 > --- a/hw/s390x/s390-pci-inst.c > +++ b/hw/s390x/s390-pci-inst.c > @@ -16,6 +16,7 @@ > #include "exec/memory.h" > #include "qemu/error-report.h" > #include "sysemu/hw_accel.h" > +#include "hw/boards.h" > #include "hw/pci/pci_device.h" > #include "hw/s390x/s390-pci-inst.h" > #include "hw/s390x/s390-pci-bus.h" > @@ -990,6 +991,33 @@ int pci_dereg_irqs(S390PCIBusDevice *pbdev) > return 0; > } > > +static void s390_pci_setup_stage2_map(S390PCIIOMMU *iommu) This is very much like s390_pci_batch_unmap(). Could we introduce a common helper ? > +{ > + MachineState *ms = MACHINE(qdev_get_machine()); > + uint64_t remain = ms->ram_size, start = iommu->pba, mask, size, curr = 0; > + uint64_t end = start + remain - 1; > + IOMMUTLBEvent event = { > + .type = IOMMU_NOTIFIER_MAP, > + .entry = { > + .target_as = &address_space_memory, > + .translated_addr = 0, > + .perm = IOMMU_RW, > + }, > + }; > + > + while (remain >= TARGET_PAGE_SIZE) { > + mask = dma_aligned_pow2_mask(start, end, 64); > + size = mask + 1; > + event.entry.iova = start; > + event.entry.addr_mask = mask; > + event.entry.translated_addr = curr; > + memory_region_notify_iommu(&iommu->iommu_mr, 0, event); > + start += size; > + curr += size; > + remain -= size; > + } > +} > + > static int reg_ioat(CPUS390XState *env, S390PCIBusDevice *pbdev, ZpciFib fib, > uintptr_t ra) > { > @@ -1008,7 +1036,7 @@ static int reg_ioat(CPUS390XState *env, S390PCIBusDevice *pbdev, ZpciFib fib, > } > > /* currently we only support designation type 1 with translation */ > - if (!(dt == ZPCI_IOTA_RTTO && t)) { > + if (t && !(dt == ZPCI_IOTA_RTTO)) { Is this change part of the patchset ? It looks like some other issue. I might be wrong. > error_report("unsupported ioat dt %d t %d", dt, t); > s390_program_interrupt(env, PGM_OPERAND, ra); > return -EINVAL; > @@ -1018,13 +1046,23 @@ static int reg_ioat(CPUS390XState *env, S390PCIBusDevice *pbdev, ZpciFib fib, > iommu->pal = pal; > iommu->g_iota = g_iota; > > - s390_pci_iommu_enable(iommu); > + if (t) { > + s390_pci_iommu_enable(iommu); > + } else { > + s390_pci_iommu_dm_enable(iommu); > + /* If not translating, map everything now */ > + s390_pci_setup_stage2_map(iommu); > + } I don't understand how we can enter "map everything" case. Could you explain a bit more the scenario ? Thanks, C. > return 0; > } > > void pci_dereg_ioat(S390PCIIOMMU *iommu) > { > + MachineState *ms = MACHINE(qdev_get_machine()); > + if (iommu->direct_map) { > + s390_pci_batch_unmap(iommu, iommu->pba, ms->ram_size); > + } > s390_pci_iommu_disable(iommu); > iommu->pba = 0; > iommu->pal = 0; > diff --git a/include/hw/s390x/s390-pci-bus.h b/include/hw/s390x/s390-pci-bus.h > index 2c43ea123f..2aa55e3fd0 100644 > --- a/include/hw/s390x/s390-pci-bus.h > +++ b/include/hw/s390x/s390-pci-bus.h > @@ -278,6 +278,7 @@ struct S390PCIIOMMU { > MemoryRegion mr; > IOMMUMemoryRegion iommu_mr; > bool enabled; > + bool direct_map; > uint64_t g_iota; > uint64_t pba; > uint64_t pal; > @@ -389,6 +390,7 @@ int pci_chsc_sei_nt2_have_event(void); > void s390_pci_sclp_configure(SCCB *sccb); > void s390_pci_sclp_deconfigure(SCCB *sccb); > void s390_pci_iommu_enable(S390PCIIOMMU *iommu); > +void s390_pci_iommu_dm_enable(S390PCIIOMMU *iommu); > void s390_pci_iommu_disable(S390PCIIOMMU *iommu); > void s390_pci_generate_error_event(uint16_t pec, uint32_t fh, uint32_t fid, > uint64_t faddr, uint32_t e);
On 12/11/24 9:40 AM, Cédric Le Goater wrote: > On 12/9/24 20:29, Matthew Rosato wrote: >> static int reg_ioat(CPUS390XState *env, S390PCIBusDevice *pbdev, ZpciFib fib, >> uintptr_t ra) >> { >> @@ -1008,7 +1036,7 @@ static int reg_ioat(CPUS390XState *env, S390PCIBusDevice *pbdev, ZpciFib fib, >> } >> /* currently we only support designation type 1 with translation */ >> - if (!(dt == ZPCI_IOTA_RTTO && t)) { >> + if (t && !(dt == ZPCI_IOTA_RTTO)) { > > Is this change part of the patchset ? It looks like some other issue. > I might be wrong. > Definitely part of this patch. Before we only allow a single type of translation request from the guest. So, guest needed to be requesting translation (t) and it had to be of a certain type (dt==ZPCI_IOTA_RTTO) else we fail. Now we are allowing either that same single translation type (so t && dt==ZPCI_IOTA_RTTO) OR no translation (!t). So it becomes valid to have (!t) but still invalid to have (t) with any other dt value besides (dt==ZPCI_IOTA_RTTO), hence the new check. >> error_report("unsupported ioat dt %d t %d", dt, t); >> s390_program_interrupt(env, PGM_OPERAND, ra); >> return -EINVAL; >> @@ -1018,13 +1046,23 @@ static int reg_ioat(CPUS390XState *env, S390PCIBusDevice *pbdev, ZpciFib fib, >> iommu->pal = pal; >> iommu->g_iota = g_iota; >> - s390_pci_iommu_enable(iommu); >> + if (t) { >> + s390_pci_iommu_enable(iommu); >> + } else { >> + s390_pci_iommu_dm_enable(iommu); >> + /* If not translating, map everything now */ >> + s390_pci_setup_stage2_map(iommu); >> + } > > > I don't understand how we can enter "map everything" case. > Could you explain a bit more the scenario ? > Sure, this actually relates directly to the check I discussed up above... Before, we would only ever accept a guest MPCIFC instruction of type ZPCI_MOD_FC_REG_IOAT that specified "designation type 1 with translation". All else would be rejected. Now, we would accept either the above OR a guest MPCIFC instruction that specifies "no translation" - this is the case that gets us into the "map everything" scenario. Patch 2 adds a CLP indication that we advertise to the guest that tells them whether or not the device group in question supports the "map everything case", so it's safe/allowable to issue a MPCIFC instruction that specifies "no translation" to a device in that group. The referenced kernel series in the cover letter implements the linux guest behavior necessary to exploit this "no translation" case via the optional kernel parameter iommu.passthrough=1. Hope that helps, Matt
On 12/10/24 3:58 AM, David Hildenbrand wrote: > Maybe there is a reason s390x needs to handle this using memory_region_notify_iommu() callbacks instead of doing it similar to "struct vfio_memory_listener" when registered on &address_space_memory without a viommu. > Hi David, I think I sorted a way to handle this such that, when direct mapping, we use a memory region alias instead so that vfio can ultimately handle all of the pinning/unpinning in the non-iommu path of vfio_listener_region_add/del, just like it does for other platforms. But for s390 the alias is needed to provide the SDMA offset so as to ensure that e.g. GPA X maps to iova SDMA+X. Looks to be working nicely so far with my rework of the associated kernel series -- Going to send as part of v2, would appreciate it if you'd give that a look. Thanks, Matt
diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c index 40b2567aa7..8d4224e032 100644 --- a/hw/s390x/s390-pci-bus.c +++ b/hw/s390x/s390-pci-bus.c @@ -18,6 +18,7 @@ #include "hw/s390x/s390-pci-inst.h" #include "hw/s390x/s390-pci-kvm.h" #include "hw/s390x/s390-pci-vfio.h" +#include "hw/boards.h" #include "hw/pci/pci_bus.h" #include "hw/qdev-properties.h" #include "hw/pci/pci_bridge.h" @@ -720,6 +721,27 @@ void s390_pci_iommu_enable(S390PCIIOMMU *iommu) TYPE_S390_IOMMU_MEMORY_REGION, OBJECT(&iommu->mr), name, iommu->pal + 1); iommu->enabled = true; + iommu->direct_map = false; + memory_region_add_subregion(&iommu->mr, 0, MEMORY_REGION(&iommu->iommu_mr)); + g_free(name); +} + +void s390_pci_iommu_dm_enable(S390PCIIOMMU *iommu) +{ + MachineState *ms = MACHINE(qdev_get_machine()); + + /* + * For direct-mapping we must map the entire guest address space. Because + * the mappings are contiguous we are not restricted to individual 4K + * mappings via vfio, so let's not worry about the DMA limit when + * calculating the range. + */ + char *name = g_strdup_printf("iommu-s390-%04x", iommu->pbdev->uid); + memory_region_init_iommu(&iommu->iommu_mr, sizeof(iommu->iommu_mr), + TYPE_S390_IOMMU_MEMORY_REGION, OBJECT(&iommu->mr), + name, iommu->pba + ms->ram_size); + iommu->enabled = true; + iommu->direct_map = true; memory_region_add_subregion(&iommu->mr, 0, MEMORY_REGION(&iommu->iommu_mr)); g_free(name); } @@ -727,6 +749,7 @@ void s390_pci_iommu_enable(S390PCIIOMMU *iommu) void s390_pci_iommu_disable(S390PCIIOMMU *iommu) { iommu->enabled = false; + iommu->direct_map = false; g_hash_table_remove_all(iommu->iotlb); memory_region_del_subregion(&iommu->mr, MEMORY_REGION(&iommu->iommu_mr)); object_unparent(OBJECT(&iommu->iommu_mr)); diff --git a/hw/s390x/s390-pci-inst.c b/hw/s390x/s390-pci-inst.c index 41655082da..f4d8fe8fe8 100644 --- a/hw/s390x/s390-pci-inst.c +++ b/hw/s390x/s390-pci-inst.c @@ -16,6 +16,7 @@ #include "exec/memory.h" #include "qemu/error-report.h" #include "sysemu/hw_accel.h" +#include "hw/boards.h" #include "hw/pci/pci_device.h" #include "hw/s390x/s390-pci-inst.h" #include "hw/s390x/s390-pci-bus.h" @@ -990,6 +991,33 @@ int pci_dereg_irqs(S390PCIBusDevice *pbdev) return 0; } +static void s390_pci_setup_stage2_map(S390PCIIOMMU *iommu) +{ + MachineState *ms = MACHINE(qdev_get_machine()); + uint64_t remain = ms->ram_size, start = iommu->pba, mask, size, curr = 0; + uint64_t end = start + remain - 1; + IOMMUTLBEvent event = { + .type = IOMMU_NOTIFIER_MAP, + .entry = { + .target_as = &address_space_memory, + .translated_addr = 0, + .perm = IOMMU_RW, + }, + }; + + while (remain >= TARGET_PAGE_SIZE) { + mask = dma_aligned_pow2_mask(start, end, 64); + size = mask + 1; + event.entry.iova = start; + event.entry.addr_mask = mask; + event.entry.translated_addr = curr; + memory_region_notify_iommu(&iommu->iommu_mr, 0, event); + start += size; + curr += size; + remain -= size; + } +} + static int reg_ioat(CPUS390XState *env, S390PCIBusDevice *pbdev, ZpciFib fib, uintptr_t ra) { @@ -1008,7 +1036,7 @@ static int reg_ioat(CPUS390XState *env, S390PCIBusDevice *pbdev, ZpciFib fib, } /* currently we only support designation type 1 with translation */ - if (!(dt == ZPCI_IOTA_RTTO && t)) { + if (t && !(dt == ZPCI_IOTA_RTTO)) { error_report("unsupported ioat dt %d t %d", dt, t); s390_program_interrupt(env, PGM_OPERAND, ra); return -EINVAL; @@ -1018,13 +1046,23 @@ static int reg_ioat(CPUS390XState *env, S390PCIBusDevice *pbdev, ZpciFib fib, iommu->pal = pal; iommu->g_iota = g_iota; - s390_pci_iommu_enable(iommu); + if (t) { + s390_pci_iommu_enable(iommu); + } else { + s390_pci_iommu_dm_enable(iommu); + /* If not translating, map everything now */ + s390_pci_setup_stage2_map(iommu); + } return 0; } void pci_dereg_ioat(S390PCIIOMMU *iommu) { + MachineState *ms = MACHINE(qdev_get_machine()); + if (iommu->direct_map) { + s390_pci_batch_unmap(iommu, iommu->pba, ms->ram_size); + } s390_pci_iommu_disable(iommu); iommu->pba = 0; iommu->pal = 0; diff --git a/include/hw/s390x/s390-pci-bus.h b/include/hw/s390x/s390-pci-bus.h index 2c43ea123f..2aa55e3fd0 100644 --- a/include/hw/s390x/s390-pci-bus.h +++ b/include/hw/s390x/s390-pci-bus.h @@ -278,6 +278,7 @@ struct S390PCIIOMMU { MemoryRegion mr; IOMMUMemoryRegion iommu_mr; bool enabled; + bool direct_map; uint64_t g_iota; uint64_t pba; uint64_t pal; @@ -389,6 +390,7 @@ int pci_chsc_sei_nt2_have_event(void); void s390_pci_sclp_configure(SCCB *sccb); void s390_pci_sclp_deconfigure(SCCB *sccb); void s390_pci_iommu_enable(S390PCIIOMMU *iommu); +void s390_pci_iommu_dm_enable(S390PCIIOMMU *iommu); void s390_pci_iommu_disable(S390PCIIOMMU *iommu); void s390_pci_generate_error_event(uint16_t pec, uint32_t fh, uint32_t fid, uint64_t faddr, uint32_t e);
When receiving a guest mpcifc(4) or mpcifc(6) instruction without the T bit set, treat this as a request to perform direct mapping instead of address translation. In order to facilitiate this, pin the entirety of guest memory into the host iommu. Subsequent guest DMA operations are all expected to be of the format guest_phys+sdma, allowing them to be used as lookup into the host iommu table. Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com> --- hw/s390x/s390-pci-bus.c | 23 ++++++++++++++++++ hw/s390x/s390-pci-inst.c | 42 +++++++++++++++++++++++++++++++-- include/hw/s390x/s390-pci-bus.h | 2 ++ 3 files changed, 65 insertions(+), 2 deletions(-)