Message ID | 20240117091559.144730-5-peterx@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | intel_iommu: Reset vIOMMU after all the rest of devices | expand |
Hi Peter, On 1/17/24 10:15, peterx@redhat.com wrote: > From: Peter Xu <peterx@redhat.com> > > We got report from Yanghang Liu on an unexpected host DMA error when system > resets with VFIO attached to vIOMMU in the VM context. Alex Williamson > quickly spot that there can be ordering issues on resets. A further test > verified that the issue is indeed caused by such wrong ordering. nit: not sure the commit msg should contain people info (cover letter does it already + credits below) > > vIOMMU is a fundamentally infrustructural device, its reset is currently infrastructural > problematic because no ordering is guaranteed against other PCI devices > which may DMA through the vIOMMU device. > > The reset order is tricky, not only because it's current representation as s/it's/its > a normal "-device" (so it kind of follow the qdev tree depth-first reset, > but at a wrong place in the qtree; ideally it should be the parent > somewhere for all pci buses, or just part of pci host bridge), but also > because customized device reset hooks registered over the system reset > framework, so that the ordering of the vIOMMU reset is not guaranteed. > > For example, VFIO can register its reset hook with vfio_reset_handler() if > some device does not support FLR. That will not so far follow the > depth-first travelsal reset mechanism provided by QEMU reset framework. traversal > > To remedy both of the issues with limited code changes, leverage the newly > introduced reset stage framework to reset vIOMMUs at the last stage of the > rest devices. More information can be found in the comments in the patch, > which I decided to persist even with the code to make the problem even > clearer (with potential TODOs for the future, if possible). > > Buglink: https://issues.redhat.com/browse/RHEL-7188 > Analyzed-by: Alex Williamson <alex.williamson@redhat.com> > Reported-by: YangHang Liu <yanghliu@redhat.com> > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > hw/i386/intel_iommu.c | 54 +++++++++++++++++++++++++++++++++++++++++-- > 1 file changed, 52 insertions(+), 2 deletions(-) > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c > index 8b467cbbd2..5a8fbcad7a 100644 > --- a/hw/i386/intel_iommu.c > +++ b/hw/i386/intel_iommu.c > @@ -35,6 +35,7 @@ > #include "sysemu/kvm.h" > #include "sysemu/dma.h" > #include "sysemu/sysemu.h" > +#include "sysemu/reset.h" > #include "hw/i386/apic_internal.h" > #include "kvm/kvm_i386.h" > #include "migration/vmstate.h" > @@ -4086,7 +4087,7 @@ static void vtd_init(IntelIOMMUState *s) > /* Should not reset address_spaces when reset because devices will still use > * the address space they got at first (won't ask the bus again). > */ > -static void vtd_reset(DeviceState *dev) > +static void vtd_reset(void *dev) > { > IntelIOMMUState *s = INTEL_IOMMU_DEVICE(dev); > > @@ -4242,7 +4243,6 @@ static void vtd_class_init(ObjectClass *klass, void *data) > DeviceClass *dc = DEVICE_CLASS(klass); > X86IOMMUClass *x86_class = X86_IOMMU_DEVICE_CLASS(klass); > > - dc->reset = vtd_reset; > dc->vmsd = &vtd_vmstate; > device_class_set_props(dc, vtd_properties); > dc->hotpluggable = false; > @@ -4254,10 +4254,60 @@ static void vtd_class_init(ObjectClass *klass, void *data) > dc->desc = "Intel IOMMU (VT-d) DMA Remapping device"; > } > > +static void vtd_instance_init(Object *obj) > +{ > + IntelIOMMUState *s = INTEL_IOMMU_DEVICE(obj); > + > + /* > + * vIOMMU reset may require proper ordering with other devices. There > + * are two complexities so that normal DeviceState.reset() may not > + * work properly for vIOMMUs: > + * > + * (1) Device depth-first reset hierachy doesn't yet work for vIOMMUs > + * (reference: resettable_cold_reset_fn()) > + * > + * Currently, vIOMMU devices are created as normal '-device' > + * cmdlines. It means in many ways it has the same attributes with s/with/as > + * most of the rest devices, even if the rest devices should s/rest/other > + * logically be under control of the vIOMMU unit. > + * > + * One side effect of it is vIOMMU devices will be currently put > + * randomly under qdev tree hierarchy, which is the source of > + * device reset ordering in current QEMU (depth-first traversal). > + * It means vIOMMU now can be reset before some devices. For fully > + * emulated devices that's not a problem, because the traversal > + * holds BQL for the whole process. However it is a problem if DMA > + * can happen without BQL, like VFIO, vDPA or remote device process. > + * > + * TODO: one ideal solution can be that we make vIOMMU the parent > + * of the whole pci host bridge. Hence vIOMMU can be reset after > + * all the devices are reset and quiesced. in hw/pci/pci.c there is also the info iommu_bus? When resetting a pci device know whether it is attached to a viommu. I don't know if we could plug the reset priority mechanism at this level. > + * > + * (2) Some devices register its own reset functions > + * > + * Even if above issue solved, if devices register its own reset s/its/their > + * functions for some reason via QEMU reset hooks, vIOMMU can still > + * be reset before the device. One example is vfio_reset_handler() > + * where FLR is not supported on the device. > + * > + * TODO: merge relevant reset functions into the device tree reset > + * framework. > + * > + * Neither of the above TODO may be trivial. To make it work for now, > + * leverage reset stages and reset vIOMMU always at latter stage of the > + * default. It means it needs to be reset after at least: > + * > + * - resettable_cold_reset_fn(): machine qdev tree reset > + * - vfio_reset_handler(): VFIO reset for !FLR > + */ > + qemu_register_reset_one(vtd_reset, s, false, 1); introducing enum values may be better ( just as we have for migration prio) > +} > + > static const TypeInfo vtd_info = { > .name = TYPE_INTEL_IOMMU_DEVICE, > .parent = TYPE_X86_IOMMU_DEVICE, > .instance_size = sizeof(IntelIOMMUState), > + .instance_init = vtd_instance_init, > .class_init = vtd_class_init, > }; > Thanks Eric
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c index 8b467cbbd2..5a8fbcad7a 100644 --- a/hw/i386/intel_iommu.c +++ b/hw/i386/intel_iommu.c @@ -35,6 +35,7 @@ #include "sysemu/kvm.h" #include "sysemu/dma.h" #include "sysemu/sysemu.h" +#include "sysemu/reset.h" #include "hw/i386/apic_internal.h" #include "kvm/kvm_i386.h" #include "migration/vmstate.h" @@ -4086,7 +4087,7 @@ static void vtd_init(IntelIOMMUState *s) /* Should not reset address_spaces when reset because devices will still use * the address space they got at first (won't ask the bus again). */ -static void vtd_reset(DeviceState *dev) +static void vtd_reset(void *dev) { IntelIOMMUState *s = INTEL_IOMMU_DEVICE(dev); @@ -4242,7 +4243,6 @@ static void vtd_class_init(ObjectClass *klass, void *data) DeviceClass *dc = DEVICE_CLASS(klass); X86IOMMUClass *x86_class = X86_IOMMU_DEVICE_CLASS(klass); - dc->reset = vtd_reset; dc->vmsd = &vtd_vmstate; device_class_set_props(dc, vtd_properties); dc->hotpluggable = false; @@ -4254,10 +4254,60 @@ static void vtd_class_init(ObjectClass *klass, void *data) dc->desc = "Intel IOMMU (VT-d) DMA Remapping device"; } +static void vtd_instance_init(Object *obj) +{ + IntelIOMMUState *s = INTEL_IOMMU_DEVICE(obj); + + /* + * vIOMMU reset may require proper ordering with other devices. There + * are two complexities so that normal DeviceState.reset() may not + * work properly for vIOMMUs: + * + * (1) Device depth-first reset hierachy doesn't yet work for vIOMMUs + * (reference: resettable_cold_reset_fn()) + * + * Currently, vIOMMU devices are created as normal '-device' + * cmdlines. It means in many ways it has the same attributes with + * most of the rest devices, even if the rest devices should + * logically be under control of the vIOMMU unit. + * + * One side effect of it is vIOMMU devices will be currently put + * randomly under qdev tree hierarchy, which is the source of + * device reset ordering in current QEMU (depth-first traversal). + * It means vIOMMU now can be reset before some devices. For fully + * emulated devices that's not a problem, because the traversal + * holds BQL for the whole process. However it is a problem if DMA + * can happen without BQL, like VFIO, vDPA or remote device process. + * + * TODO: one ideal solution can be that we make vIOMMU the parent + * of the whole pci host bridge. Hence vIOMMU can be reset after + * all the devices are reset and quiesced. + * + * (2) Some devices register its own reset functions + * + * Even if above issue solved, if devices register its own reset + * functions for some reason via QEMU reset hooks, vIOMMU can still + * be reset before the device. One example is vfio_reset_handler() + * where FLR is not supported on the device. + * + * TODO: merge relevant reset functions into the device tree reset + * framework. + * + * Neither of the above TODO may be trivial. To make it work for now, + * leverage reset stages and reset vIOMMU always at latter stage of the + * default. It means it needs to be reset after at least: + * + * - resettable_cold_reset_fn(): machine qdev tree reset + * - vfio_reset_handler(): VFIO reset for !FLR + */ + qemu_register_reset_one(vtd_reset, s, false, 1); +} + static const TypeInfo vtd_info = { .name = TYPE_INTEL_IOMMU_DEVICE, .parent = TYPE_X86_IOMMU_DEVICE, .instance_size = sizeof(IntelIOMMUState), + .instance_init = vtd_instance_init, .class_init = vtd_class_init, };