Message ID | 1482158486-18597-1-git-send-email-peterx@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, 19 Dec 2016 22:41:26 +0800 Peter Xu <peterx@redhat.com> wrote: > This is preparation work to finally enabled dynamic switching ON/OFF for > VT-d protection. The old VT-d codes is using static IOMMU region, and > that won't satisfy vfio-pci device listeners. > > Let me explain. > > vfio-pci devices depend on the memory region listener and IOMMU replay > mechanism to make sure the device mapping is coherent with the guest > even if there are domain switches. And there are two kinds of domain > switches: > > (1) switch from domain A -> B > (2) switch from domain A -> no domain (e.g., turn DMAR off) > > Case (1) is handled by the context entry invalidation handling by the > VT-d replay logic. What the replay function should do here is to replay > the existing page mappings in domain B. > > However for case (2), we don't want to replay any domain mappings - we > just need the default GPA->HPA mappings (the address_space_memory > mapping). And this patch helps on case (2) to build up the mapping > automatically by leveraging the vfio-pci memory listeners. > > Another important thing that this patch does is to seperate > IR (Interrupt Remapping) from DMAR (DMA Remapping). IR region should not > depend on the DMAR region (like before this patch). It should be a > standalone region, and it should be able to be activated without > DMAR (which is a common behavior of Linux kernel - by default it enables > IR while disabled DMAR). This seems like an improvement, but I will note that there are existing locked memory accounting issues inherent with VT-d and vfio. With VT-d, each device has a unique AddressSpace. This requires that each is managed via a separate vfio container. Each container is accounted for separately for locked pages. libvirt currently only knows that if any vfio devices are attached that the locked memory limit for the process needs to be set sufficient for the VM memory. When VT-d is involved, we either need to figure out how to associate otherwise independent vfio containers to share locked page accounting or teach libvirt that the locked memory requirement needs to be multiplied by the number of attached vfio devices. The latter seems far less complicated but reduces the containment of QEMU a bit since the process has the ability to lock potentially many multiples of the VM address size. Thanks, Alex > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > hw/i386/intel_iommu.c | 75 ++++++++++++++++++++++++++++++++++++++++--- > hw/i386/trace-events | 3 ++ > include/hw/i386/intel_iommu.h | 2 ++ > 3 files changed, 76 insertions(+), 4 deletions(-) > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c > index 5f3e351..75a3f4e 100644 > --- a/hw/i386/intel_iommu.c > +++ b/hw/i386/intel_iommu.c > @@ -1179,9 +1179,42 @@ static void vtd_handle_gcmd_sirtp(IntelIOMMUState *s) > vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_IRTPS); > } > > +static void vtd_switch_address_space(IntelIOMMUState *s, bool enabled) > +{ > + GHashTableIter iter; > + VTDBus *vtd_bus; > + VTDAddressSpace *as; > + int i; > + > + g_hash_table_iter_init(&iter, s->vtd_as_by_busptr); > + while (g_hash_table_iter_next (&iter, NULL, (void**)&vtd_bus)) { > + for (i = 0; i < X86_IOMMU_PCI_DEVFN_MAX; i++) { > + as = vtd_bus->dev_as[i]; > + if (as == NULL) { > + continue; > + } > + trace_vtd_switch_address_space(pci_bus_num(vtd_bus->bus), > + VTD_PCI_SLOT(i), VTD_PCI_FUNC(i), > + enabled); > + if (enabled) { > + memory_region_add_subregion_overlap(&as->root, 0, > + &as->iommu, 2); > + } else { > + memory_region_del_subregion(&as->root, &as->iommu); > + } > + } > + } > +} > + > /* Handle Translation Enable/Disable */ > static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en) > { > + bool old = s->dmar_enabled; > + > + if (old == en) { > + return; > + } > + > VTD_DPRINTF(CSR, "Translation Enable %s", (en ? "on" : "off")); > > if (en) { > @@ -1196,6 +1229,8 @@ static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en) > /* Ok - report back to driver */ > vtd_set_clear_mask_long(s, DMAR_GSTS_REG, VTD_GSTS_TES, 0); > } > + > + vtd_switch_address_space(s, en); > } > > /* Handle Interrupt Remap Enable/Disable */ > @@ -2343,15 +2378,47 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn) > vtd_dev_as->devfn = (uint8_t)devfn; > vtd_dev_as->iommu_state = s; > vtd_dev_as->context_cache_entry.context_cache_gen = 0; > + > + /* > + * When DMAR is disabled, memory region relationships looks > + * like: > + * > + * 0000000000000000-ffffffffffffffff (prio 0, RW): vtd_root > + * 0000000000000000-ffffffffffffffff (prio 1, RW): vtd_sys_alias > + * 00000000fee00000-00000000feefffff (prio 64, RW): intel_iommu_ir > + * > + * When DMAR is disabled, it becomes: > + * > + * 0000000000000000-ffffffffffffffff (prio 0, RW): vtd_root > + * 0000000000000000-ffffffffffffffff (prio 2, RW): intel_iommu > + * 0000000000000000-ffffffffffffffff (prio 1, RW): vtd_sys_alias > + * 00000000fee00000-00000000feefffff (prio 64, RW): intel_iommu_ir > + * > + * The intel_iommu region is dynamically added/removed. > + */ > memory_region_init_iommu(&vtd_dev_as->iommu, OBJECT(s), > &s->iommu_ops, "intel_iommu", UINT64_MAX); > + memory_region_init_alias(&vtd_dev_as->sys_alias, OBJECT(s), > + "vtd_sys_alias", get_system_memory(), > + 0, memory_region_size(get_system_memory())); > memory_region_init_io(&vtd_dev_as->iommu_ir, OBJECT(s), > &vtd_mem_ir_ops, s, "intel_iommu_ir", > VTD_INTERRUPT_ADDR_SIZE); > - memory_region_add_subregion(&vtd_dev_as->iommu, VTD_INTERRUPT_ADDR_FIRST, > - &vtd_dev_as->iommu_ir); > - address_space_init(&vtd_dev_as->as, > - &vtd_dev_as->iommu, "intel_iommu"); > + memory_region_init(&vtd_dev_as->root, OBJECT(s), > + "vtd_root", UINT64_MAX); > + memory_region_add_subregion_overlap(&vtd_dev_as->root, > + VTD_INTERRUPT_ADDR_FIRST, > + &vtd_dev_as->iommu_ir, 64); > + address_space_init(&vtd_dev_as->as, &vtd_dev_as->root, name); > + memory_region_add_subregion_overlap(&vtd_dev_as->root, 0, > + &vtd_dev_as->sys_alias, 1); > + if (s->dmar_enabled) { > + memory_region_add_subregion_overlap(&vtd_dev_as->root, 0, > + &vtd_dev_as->iommu, 2); > + } > + trace_vtd_switch_address_space(pci_bus_num(vtd_bus->bus), > + VTD_PCI_SLOT(devfn), VTD_PCI_FUNC(devfn), > + s->dmar_enabled); > } > return vtd_dev_as; > } > diff --git a/hw/i386/trace-events b/hw/i386/trace-events > index d2b4973..aee93bb 100644 > --- a/hw/i386/trace-events > +++ b/hw/i386/trace-events > @@ -10,6 +10,9 @@ xen_pv_mmio_write(uint64_t addr) "WARNING: write to Xen PV Device MMIO space (ad > # hw/i386/x86-iommu.c > x86_iommu_iec_notify(bool global, uint32_t index, uint32_t mask) "Notify IEC invalidation: global=%d index=%" PRIu32 " mask=%" PRIu32 > > +# hw/i386/intel_iommu.c > +vtd_switch_address_space(uint8_t bus, uint8_t slot, uint8_t fn, bool on) "Device %02x:%02x.%x switching address space (iommu enabled=%d)" > + > # hw/i386/amd_iommu.c > amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" + offset 0x%"PRIx32 > amdvi_cache_update(uint16_t domid, uint8_t bus, uint8_t slot, uint8_t func, uint64_t gpa, uint64_t txaddr) " update iotlb domid 0x%"PRIx16" devid: %02x:%02x.%x gpa 0x%"PRIx64" hpa 0x%"PRIx64 > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h > index 405c9d1..85c1b9b 100644 > --- a/include/hw/i386/intel_iommu.h > +++ b/include/hw/i386/intel_iommu.h > @@ -83,6 +83,8 @@ struct VTDAddressSpace { > uint8_t devfn; > AddressSpace as; > MemoryRegion iommu; > + MemoryRegion root; > + MemoryRegion sys_alias; > MemoryRegion iommu_ir; /* Interrupt region: 0xfeeXXXXX */ > IntelIOMMUState *iommu_state; > VTDContextCacheEntry context_cache_entry;
On Mon, Dec 19, 2016 at 10:41:26PM +0800, Peter Xu wrote: > This is preparation work to finally enabled dynamic switching ON/OFF for > VT-d protection. The old VT-d codes is using static IOMMU region, and > that won't satisfy vfio-pci device listeners. > > Let me explain. > > vfio-pci devices depend on the memory region listener and IOMMU replay > mechanism to make sure the device mapping is coherent with the guest > even if there are domain switches. And there are two kinds of domain > switches: > > (1) switch from domain A -> B > (2) switch from domain A -> no domain (e.g., turn DMAR off) > > Case (1) is handled by the context entry invalidation handling by the > VT-d replay logic. What the replay function should do here is to replay > the existing page mappings in domain B. > > However for case (2), we don't want to replay any domain mappings - we > just need the default GPA->HPA mappings (the address_space_memory > mapping). And this patch helps on case (2) to build up the mapping > automatically by leveraging the vfio-pci memory listeners. > > Another important thing that this patch does is to seperate > IR (Interrupt Remapping) from DMAR (DMA Remapping). IR region should not > depend on the DMAR region (like before this patch). It should be a > standalone region, and it should be able to be activated without > DMAR (which is a common behavior of Linux kernel - by default it enables > IR while disabled DMAR). > > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > hw/i386/intel_iommu.c | 75 ++++++++++++++++++++++++++++++++++++++++--- > hw/i386/trace-events | 3 ++ > include/hw/i386/intel_iommu.h | 2 ++ > 3 files changed, 76 insertions(+), 4 deletions(-) > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c > index 5f3e351..75a3f4e 100644 > --- a/hw/i386/intel_iommu.c > +++ b/hw/i386/intel_iommu.c > @@ -1179,9 +1179,42 @@ static void vtd_handle_gcmd_sirtp(IntelIOMMUState *s) > vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_IRTPS); > } > > +static void vtd_switch_address_space(IntelIOMMUState *s, bool enabled) > +{ > + GHashTableIter iter; > + VTDBus *vtd_bus; > + VTDAddressSpace *as; > + int i; > + > + g_hash_table_iter_init(&iter, s->vtd_as_by_busptr); > + while (g_hash_table_iter_next (&iter, NULL, (void**)&vtd_bus)) { > + for (i = 0; i < X86_IOMMU_PCI_DEVFN_MAX; i++) { > + as = vtd_bus->dev_as[i]; > + if (as == NULL) { > + continue; > + } > + trace_vtd_switch_address_space(pci_bus_num(vtd_bus->bus), > + VTD_PCI_SLOT(i), VTD_PCI_FUNC(i), > + enabled); > + if (enabled) { > + memory_region_add_subregion_overlap(&as->root, 0, > + &as->iommu, 2); > + } else { > + memory_region_del_subregion(&as->root, &as->iommu); Why not use memory_region_set_enabled() rather than actually adding/deleting the subregion? > + } > + } > + } > +} > + > /* Handle Translation Enable/Disable */ > static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en) > { > + bool old = s->dmar_enabled; > + > + if (old == en) { > + return; > + } > + > VTD_DPRINTF(CSR, "Translation Enable %s", (en ? "on" : "off")); > > if (en) { > @@ -1196,6 +1229,8 @@ static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en) > /* Ok - report back to driver */ > vtd_set_clear_mask_long(s, DMAR_GSTS_REG, VTD_GSTS_TES, 0); > } > + > + vtd_switch_address_space(s, en); > } > > /* Handle Interrupt Remap Enable/Disable */ > @@ -2343,15 +2378,47 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn) > vtd_dev_as->devfn = (uint8_t)devfn; > vtd_dev_as->iommu_state = s; > vtd_dev_as->context_cache_entry.context_cache_gen = 0; > + > + /* > + * When DMAR is disabled, memory region relationships looks > + * like: > + * > + * 0000000000000000-ffffffffffffffff (prio 0, RW): vtd_root > + * 0000000000000000-ffffffffffffffff (prio 1, RW): vtd_sys_alias > + * 00000000fee00000-00000000feefffff (prio 64, RW): intel_iommu_ir > + * > + * When DMAR is disabled, it becomes: > + * > + * 0000000000000000-ffffffffffffffff (prio 0, RW): vtd_root > + * 0000000000000000-ffffffffffffffff (prio 2, RW): intel_iommu > + * 0000000000000000-ffffffffffffffff (prio 1, RW): vtd_sys_alias > + * 00000000fee00000-00000000feefffff (prio 64, RW): intel_iommu_ir > + * > + * The intel_iommu region is dynamically added/removed. > + */ > memory_region_init_iommu(&vtd_dev_as->iommu, OBJECT(s), > &s->iommu_ops, "intel_iommu", UINT64_MAX); I'm almost certain UINT64_MAX is wrong here. For one thing it would collide with PCI BARs. For another, I can't imagine that the IOMMU page tables can really span an entire 2^64 space. > + memory_region_init_alias(&vtd_dev_as->sys_alias, OBJECT(s), > + "vtd_sys_alias", get_system_memory(), > + 0, memory_region_size(get_system_memory())); I strongly suspect using memory_region_size(get_system_memory()) is also incorrect here. System memory has size UINT64_MAX, but I'll bet you you can't actually access all of that via PCI space (again, it would collide with actual PCI BARs). I also suspect you can't reach CPU MMIO regions via the PCI DMA space. So, I think you should find out what this limit actually is and restrict the alias to that window. > memory_region_init_io(&vtd_dev_as->iommu_ir, OBJECT(s), > &vtd_mem_ir_ops, s, "intel_iommu_ir", > VTD_INTERRUPT_ADDR_SIZE); > - memory_region_add_subregion(&vtd_dev_as->iommu, VTD_INTERRUPT_ADDR_FIRST, > - &vtd_dev_as->iommu_ir); > - address_space_init(&vtd_dev_as->as, > - &vtd_dev_as->iommu, "intel_iommu"); > + memory_region_init(&vtd_dev_as->root, OBJECT(s), > + "vtd_root", UINT64_MAX); > + memory_region_add_subregion_overlap(&vtd_dev_as->root, > + VTD_INTERRUPT_ADDR_FIRST, > + &vtd_dev_as->iommu_ir, 64); > + address_space_init(&vtd_dev_as->as, &vtd_dev_as->root, name); > + memory_region_add_subregion_overlap(&vtd_dev_as->root, 0, > + &vtd_dev_as->sys_alias, 1); > + if (s->dmar_enabled) { > + memory_region_add_subregion_overlap(&vtd_dev_as->root, 0, > + &vtd_dev_as->iommu, 2); > + } Hmm. You have the IOMMU translated region overlaying the direct-mapped alias. You enable and disable the IOMMU subregion, but you always leave the direct mapping enabled. You might get away with this because the size of the IOMMU region is UINT64_MAX, which will overlay everything - but as above, I think that's wrong. If that's changed then guest devices may be able to access portions of the raw address space outside the IOMMU mapped region, which could break the guest's expectations of device isolation. I think it would be much safer to disable the system memory alias when the IOMMU is enabled. > + trace_vtd_switch_address_space(pci_bus_num(vtd_bus->bus), > + VTD_PCI_SLOT(devfn), VTD_PCI_FUNC(devfn), > + s->dmar_enabled); > } > return vtd_dev_as; > } > diff --git a/hw/i386/trace-events b/hw/i386/trace-events > index d2b4973..aee93bb 100644 > --- a/hw/i386/trace-events > +++ b/hw/i386/trace-events > @@ -10,6 +10,9 @@ xen_pv_mmio_write(uint64_t addr) "WARNING: write to Xen PV Device MMIO space (ad > # hw/i386/x86-iommu.c > x86_iommu_iec_notify(bool global, uint32_t index, uint32_t mask) "Notify IEC invalidation: global=%d index=%" PRIu32 " mask=%" PRIu32 > > +# hw/i386/intel_iommu.c > +vtd_switch_address_space(uint8_t bus, uint8_t slot, uint8_t fn, bool on) "Device %02x:%02x.%x switching address space (iommu enabled=%d)" > + > # hw/i386/amd_iommu.c > amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" + offset 0x%"PRIx32 > amdvi_cache_update(uint16_t domid, uint8_t bus, uint8_t slot, uint8_t func, uint64_t gpa, uint64_t txaddr) " update iotlb domid 0x%"PRIx16" devid: %02x:%02x.%x gpa 0x%"PRIx64" hpa 0x%"PRIx64 > diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h > index 405c9d1..85c1b9b 100644 > --- a/include/hw/i386/intel_iommu.h > +++ b/include/hw/i386/intel_iommu.h > @@ -83,6 +83,8 @@ struct VTDAddressSpace { > uint8_t devfn; > AddressSpace as; > MemoryRegion iommu; > + MemoryRegion root; > + MemoryRegion sys_alias; > MemoryRegion iommu_ir; /* Interrupt region: 0xfeeXXXXX */ > IntelIOMMUState *iommu_state; > VTDContextCacheEntry context_cache_entry;
On Mon, Dec 19, 2016 at 09:56:50AM -0700, Alex Williamson wrote: > On Mon, 19 Dec 2016 22:41:26 +0800 > Peter Xu <peterx@redhat.com> wrote: > > > This is preparation work to finally enabled dynamic switching ON/OFF for > > VT-d protection. The old VT-d codes is using static IOMMU region, and > > that won't satisfy vfio-pci device listeners. > > > > Let me explain. > > > > vfio-pci devices depend on the memory region listener and IOMMU replay > > mechanism to make sure the device mapping is coherent with the guest > > even if there are domain switches. And there are two kinds of domain > > switches: > > > > (1) switch from domain A -> B > > (2) switch from domain A -> no domain (e.g., turn DMAR off) > > > > Case (1) is handled by the context entry invalidation handling by the > > VT-d replay logic. What the replay function should do here is to replay > > the existing page mappings in domain B. > > > > However for case (2), we don't want to replay any domain mappings - we > > just need the default GPA->HPA mappings (the address_space_memory > > mapping). And this patch helps on case (2) to build up the mapping > > automatically by leveraging the vfio-pci memory listeners. > > > > Another important thing that this patch does is to seperate > > IR (Interrupt Remapping) from DMAR (DMA Remapping). IR region should not > > depend on the DMAR region (like before this patch). It should be a > > standalone region, and it should be able to be activated without > > DMAR (which is a common behavior of Linux kernel - by default it enables > > IR while disabled DMAR). > > > This seems like an improvement, but I will note that there are existing > locked memory accounting issues inherent with VT-d and vfio. With > VT-d, each device has a unique AddressSpace. This requires that each > is managed via a separate vfio container. Each container is accounted > for separately for locked pages. libvirt currently only knows that if > any vfio devices are attached that the locked memory limit for the > process needs to be set sufficient for the VM memory. When VT-d is > involved, we either need to figure out how to associate otherwise > independent vfio containers to share locked page accounting or teach > libvirt that the locked memory requirement needs to be multiplied by > the number of attached vfio devices. The latter seems far less > complicated but reduces the containment of QEMU a bit since the > process has the ability to lock potentially many multiples of the VM > address size. Thanks, Yes, this patch just tried to move VT-d forward a bit, rather than do it once and for all. I think we can do better than this in the future, for example, one address space per guest IOMMU domain (as you have mentioned before). However I suppose that will need more work (which I still can't estimate on the amount of work). So I am considering to enable the device assignments functionally first, then we can further improve based on a workable version. Same thoughts apply to the IOMMU replay RFC series. Regarding to the locked memory accounting issue: do we have existing way to do the accounting? If so, would you (or anyone) please elaborate a bit? If not, is that an ongoing/planned work? Thanks, -- peterx
On Tue, Dec 20, 2016 at 10:30:12AM +1100, David Gibson wrote: [...] > > +static void vtd_switch_address_space(IntelIOMMUState *s, bool enabled) > > +{ > > + GHashTableIter iter; > > + VTDBus *vtd_bus; > > + VTDAddressSpace *as; > > + int i; > > + > > + g_hash_table_iter_init(&iter, s->vtd_as_by_busptr); > > + while (g_hash_table_iter_next (&iter, NULL, (void**)&vtd_bus)) { > > + for (i = 0; i < X86_IOMMU_PCI_DEVFN_MAX; i++) { > > + as = vtd_bus->dev_as[i]; > > + if (as == NULL) { > > + continue; > > + } > > + trace_vtd_switch_address_space(pci_bus_num(vtd_bus->bus), > > + VTD_PCI_SLOT(i), VTD_PCI_FUNC(i), > > + enabled); > > + if (enabled) { > > + memory_region_add_subregion_overlap(&as->root, 0, > > + &as->iommu, 2); > > + } else { > > + memory_region_del_subregion(&as->root, &as->iommu); > > Why not use memory_region_set_enabled() rather than actually > adding/deleting the subregion? Good idea, thanks. :-) [...] > > @@ -2343,15 +2378,47 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn) > > vtd_dev_as->devfn = (uint8_t)devfn; > > vtd_dev_as->iommu_state = s; > > vtd_dev_as->context_cache_entry.context_cache_gen = 0; > > + > > + /* > > + * When DMAR is disabled, memory region relationships looks > > + * like: > > + * > > + * 0000000000000000-ffffffffffffffff (prio 0, RW): vtd_root > > + * 0000000000000000-ffffffffffffffff (prio 1, RW): vtd_sys_alias > > + * 00000000fee00000-00000000feefffff (prio 64, RW): intel_iommu_ir > > + * > > + * When DMAR is disabled, it becomes: > > + * > > + * 0000000000000000-ffffffffffffffff (prio 0, RW): vtd_root > > + * 0000000000000000-ffffffffffffffff (prio 2, RW): intel_iommu > > + * 0000000000000000-ffffffffffffffff (prio 1, RW): vtd_sys_alias > > + * 00000000fee00000-00000000feefffff (prio 64, RW): intel_iommu_ir > > + * > > + * The intel_iommu region is dynamically added/removed. > > + */ > > memory_region_init_iommu(&vtd_dev_as->iommu, OBJECT(s), > > &s->iommu_ops, "intel_iommu", UINT64_MAX); > > I'm almost certain UINT64_MAX is wrong here. For one thing it would > collide with PCI BARs. For another, I can't imagine that the IOMMU > page tables can really span an entire 2^64 space. Could you explain why here device address space has things to do with PCI BARs? I thought BARs are for CPU address space only (so that CPU can access PCI registers via MMIO manner), am I wrong? I think we should have a big enough IOMMU region size here. If device writes to invalid addresses, IMHO we should trap it and report to guest. If we have a smaller size than UINT64_MAX, how we can trap this behavior and report for the whole address space (it should cover [0, 2^64-1])? > > > + memory_region_init_alias(&vtd_dev_as->sys_alias, OBJECT(s), > > + "vtd_sys_alias", get_system_memory(), > > + 0, memory_region_size(get_system_memory())); > > I strongly suspect using memory_region_size(get_system_memory()) is > also incorrect here. System memory has size UINT64_MAX, but I'll bet > you you can't actually access all of that via PCI space (again, it > would collide with actual PCI BARs). I also suspect you can't reach > CPU MMIO regions via the PCI DMA space. Hmm, sounds correct. However if so we will have the same problem if without IOMMU? See pci_device_iommu_address_space() - address_space_memory will be the default if we have no IOMMU protection, and that will cover e.g. CPU MMIO regions as well. > > So, I think you should find out what this limit actually is and > restrict the alias to that window. /me needs some more reading to figure this out. Still not quite familiar with the whole VM memory regions. Hints are welcomed... > > > memory_region_init_io(&vtd_dev_as->iommu_ir, OBJECT(s), > > &vtd_mem_ir_ops, s, "intel_iommu_ir", > > VTD_INTERRUPT_ADDR_SIZE); > > - memory_region_add_subregion(&vtd_dev_as->iommu, VTD_INTERRUPT_ADDR_FIRST, > > - &vtd_dev_as->iommu_ir); > > - address_space_init(&vtd_dev_as->as, > > - &vtd_dev_as->iommu, "intel_iommu"); > > + memory_region_init(&vtd_dev_as->root, OBJECT(s), > > + "vtd_root", UINT64_MAX); > > + memory_region_add_subregion_overlap(&vtd_dev_as->root, > > + VTD_INTERRUPT_ADDR_FIRST, > > + &vtd_dev_as->iommu_ir, 64); > > + address_space_init(&vtd_dev_as->as, &vtd_dev_as->root, name); > > + memory_region_add_subregion_overlap(&vtd_dev_as->root, 0, > > + &vtd_dev_as->sys_alias, 1); > > + if (s->dmar_enabled) { > > + memory_region_add_subregion_overlap(&vtd_dev_as->root, 0, > > + &vtd_dev_as->iommu, 2); > > + } > > Hmm. You have the IOMMU translated region overlaying the > direct-mapped alias. You enable and disable the IOMMU subregion, but > you always leave the direct mapping enabled. You might get away with > this because the size of the IOMMU region is UINT64_MAX, which will > overlay everything - but as above, I think that's wrong. If that's > changed then guest devices may be able to access portions of the raw > address space outside the IOMMU mapped region, which could break the > guest's expectations of device isolation. > > I think it would be much safer to disable the system memory alias when > the IOMMU is enabled. Reasonable. Will adopt. Thanks! -- peterx
On Tue, 20 Dec 2016 11:44:41 +0800 Peter Xu <peterx@redhat.com> wrote: > On Mon, Dec 19, 2016 at 09:56:50AM -0700, Alex Williamson wrote: > > On Mon, 19 Dec 2016 22:41:26 +0800 > > Peter Xu <peterx@redhat.com> wrote: > > > > > This is preparation work to finally enabled dynamic switching ON/OFF for > > > VT-d protection. The old VT-d codes is using static IOMMU region, and > > > that won't satisfy vfio-pci device listeners. > > > > > > Let me explain. > > > > > > vfio-pci devices depend on the memory region listener and IOMMU replay > > > mechanism to make sure the device mapping is coherent with the guest > > > even if there are domain switches. And there are two kinds of domain > > > switches: > > > > > > (1) switch from domain A -> B > > > (2) switch from domain A -> no domain (e.g., turn DMAR off) > > > > > > Case (1) is handled by the context entry invalidation handling by the > > > VT-d replay logic. What the replay function should do here is to replay > > > the existing page mappings in domain B. > > > > > > However for case (2), we don't want to replay any domain mappings - we > > > just need the default GPA->HPA mappings (the address_space_memory > > > mapping). And this patch helps on case (2) to build up the mapping > > > automatically by leveraging the vfio-pci memory listeners. > > > > > > Another important thing that this patch does is to seperate > > > IR (Interrupt Remapping) from DMAR (DMA Remapping). IR region should not > > > depend on the DMAR region (like before this patch). It should be a > > > standalone region, and it should be able to be activated without > > > DMAR (which is a common behavior of Linux kernel - by default it enables > > > IR while disabled DMAR). > > > > > > This seems like an improvement, but I will note that there are existing > > locked memory accounting issues inherent with VT-d and vfio. With > > VT-d, each device has a unique AddressSpace. This requires that each > > is managed via a separate vfio container. Each container is accounted > > for separately for locked pages. libvirt currently only knows that if > > any vfio devices are attached that the locked memory limit for the > > process needs to be set sufficient for the VM memory. When VT-d is > > involved, we either need to figure out how to associate otherwise > > independent vfio containers to share locked page accounting or teach > > libvirt that the locked memory requirement needs to be multiplied by > > the number of attached vfio devices. The latter seems far less > > complicated but reduces the containment of QEMU a bit since the > > process has the ability to lock potentially many multiples of the VM > > address size. Thanks, > > Yes, this patch just tried to move VT-d forward a bit, rather than do > it once and for all. I think we can do better than this in the future, > for example, one address space per guest IOMMU domain (as you have > mentioned before). However I suppose that will need more work (which I > still can't estimate on the amount of work). So I am considering to > enable the device assignments functionally first, then we can further > improve based on a workable version. Same thoughts apply to the IOMMU > replay RFC series. I'm not arguing against it, I'm just trying to set expectations for where this gets us. An AddressSpace per guest iommu domain seems like the right model for QEMU, but it has some fundamental issues with vfio. We currently tie a QEMU AddressSpace to a vfio container, which represents the host IOMMU context. The AddressSpace of a device is currently assumed to be fixed in QEMU, guest IOMMU domains clearly are not. vfio only let's us have access to a device while it's protected within a container. Therefore in order to move a device to a different AddressSpace based on the guest domain configuration, we'd need to tear down the vfio configuration, including releasing the device. > Regarding to the locked memory accounting issue: do we have existing > way to do the accounting? If so, would you (or anyone) please > elaborate a bit? If not, is that an ongoing/planned work? As I describe above, there's a vfio container per AddressSpace, each container is an IOMMU domain in the host. In the guest, an IOMMU domain can include multiple AddressSpaces, one for each context entry that's part of the domain. When the guest programs a translation for an IOMMU domain, that maps a guest IOVA to a guest physical address, for each AddressSpace. Each AddressSpace is backed by a vfio container, which needs to pin the pages of that translation in order to get a host physical address, which then gets programmed into the host IOMMU domain with the guest-IOVA and host physical address. The pinning process is where page accounting is done. It's done per vfio context. The worst case scenario for accounting is thus when VT-d is present but disabled (or in passthrough mode) as each AddressSpace duplicates address_space_memory and every page of guest memory is pinned and accounted for each vfio container. That's the existing way we do accounting. There is no current development that I'm aware of to change this. As above, the simplest stop-gap solution is that libvirt would need to be aware when VT-d is present for a VM and use a different algorithm to set QEMU locked memory limit, but it's not without its downsides. Alternatively, a new IOMMU model would need to be developed for vfio. The type1 model was only ever intended to be used for relatively static user mappings and I expect it to have horrendous performance when backing a dynamic guest IOMMU domain. Really the only guest IOMMU usage model that makes any sort of sense with type1 is to run the guest with passthrough (iommu=pt) and only pull devices out of passthrough for relatively static mapping cases within the guest userspace (nested assigned devices or dpdk). If the expectation is that we just need this one little bit more code to make vfio usable in the guest, that may be true, but it really is just barely usable. It's not going to be fast for any sort of dynamic mapping and it's going to have accounting issues that are not compatible with how libvirt sets locked memory limits for QEMU as soon as you go beyond a single device. Thanks, Alex
On Mon, Dec 19, 2016 at 09:52:52PM -0700, Alex Williamson wrote: [...] > > Yes, this patch just tried to move VT-d forward a bit, rather than do > > it once and for all. I think we can do better than this in the future, > > for example, one address space per guest IOMMU domain (as you have > > mentioned before). However I suppose that will need more work (which I > > still can't estimate on the amount of work). So I am considering to > > enable the device assignments functionally first, then we can further > > improve based on a workable version. Same thoughts apply to the IOMMU > > replay RFC series. > > I'm not arguing against it, I'm just trying to set expectations for > where this gets us. An AddressSpace per guest iommu domain seems like > the right model for QEMU, but it has some fundamental issues with > vfio. We currently tie a QEMU AddressSpace to a vfio container, which > represents the host IOMMU context. The AddressSpace of a device is > currently assumed to be fixed in QEMU, guest IOMMU domains clearly > are not. vfio only let's us have access to a device while it's > protected within a container. Therefore in order to move a device to a > different AddressSpace based on the guest domain configuration, we'd > need to tear down the vfio configuration, including releasing the > device. I assume this is VT-d specific issue, right? Looks like ppc is using a totally differnet way to manage the mapping, and devices can share the same address space. > > > Regarding to the locked memory accounting issue: do we have existing > > way to do the accounting? If so, would you (or anyone) please > > elaborate a bit? If not, is that an ongoing/planned work? > > As I describe above, there's a vfio container per AddressSpace, each > container is an IOMMU domain in the host. In the guest, an IOMMU > domain can include multiple AddressSpaces, one for each context entry > that's part of the domain. When the guest programs a translation for > an IOMMU domain, that maps a guest IOVA to a guest physical address, > for each AddressSpace. Each AddressSpace is backed by a vfio > container, which needs to pin the pages of that translation in order to > get a host physical address, which then gets programmed into the host > IOMMU domain with the guest-IOVA and host physical address. The > pinning process is where page accounting is done. It's done per vfio > context. The worst case scenario for accounting is thus when VT-d is > present but disabled (or in passthrough mode) as each AddressSpace > duplicates address_space_memory and every page of guest memory is > pinned and accounted for each vfio container. IIUC this accounting issue will solve itself if we can solve the previous issue. While we don't have it now, so ... > > That's the existing way we do accounting. There is no current > development that I'm aware of to change this. As above, the simplest > stop-gap solution is that libvirt would need to be aware when VT-d is > present for a VM and use a different algorithm to set QEMU locked > memory limit, but it's not without its downsides. ... here I think it's sensible to consider a specific algorithm for vt-d use case. I am just curious about how should we define this algorithm. First of all, when the devices are not sharing domain (or say, one guest iommu domain per assigned device), everything should be fine. No special algorithm needed. IMHO the problem will happen only if there are assigned devices that share a same address space (either system, or specific iommu domain). In that case, the accounted value (or say, current->mm->locked_vm iiuc) will be bigger than the real locked memory size. However, I think the problem is whether devices will be put into same address space depends on guest behavior - the guest can either use iommu=pt, or manually putting devices into the same guest iommu region to achieve that. But from hypervisor POV, how should we estimate this? Can we really? > Alternatively, a new > IOMMU model would need to be developed for vfio. The type1 model was > only ever intended to be used for relatively static user mappings and I > expect it to have horrendous performance when backing a dynamic guest > IOMMU domain. Really the only guest IOMMU usage model that makes any > sort of sense with type1 is to run the guest with passthrough (iommu=pt) > and only pull devices out of passthrough for relatively static mapping > cases within the guest userspace (nested assigned devices or dpdk). If > the expectation is that we just need this one little bit more code to > make vfio usable in the guest, that may be true, but it really is just > barely usable. It's not going to be fast for any sort of dynamic > mapping and it's going to have accounting issues that are not > compatible with how libvirt sets locked memory limits for QEMU as soon > as you go beyond a single device. Thanks, I can totally understand that the performance will suck if dynamic mapping is used. AFAIU this work will only be used with static dma mapping like running DPDK in guest (besides other trivial goals, like, development purpose). Regarding to "the other" iommu model you mentioned besides type1, is there any existing discussions out there? Any further learning material/links would be greatly welcomed. Thanks! -- peterx
Hi, Your series seems to have some coding style problems. See output below for more information: Subject: [Qemu-devel] [PATCH] intel_iommu: allow dynamic switch of IOMMU region Message-id: 1482158486-18597-1-git-send-email-peterx@redhat.com Type: series === TEST SCRIPT BEGIN === #!/bin/bash BASE=base n=1 total=$(git log --oneline $BASE.. | wc -l) failed=0 # Useful git options git config --local diff.renamelimit 0 git config --local diff.renames True commits="$(git log --format=%H --reverse $BASE..)" for c in $commits; do echo "Checking PATCH $n/$total: $(git log -n 1 --format=%s $c)..." if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; then failed=1 echo fi n=$((n+1)) done exit $failed === TEST SCRIPT END === Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384 Switched to a new branch 'test' ae738a3 intel_iommu: allow dynamic switch of IOMMU region === OUTPUT BEGIN === Checking PATCH 1/1: intel_iommu: allow dynamic switch of IOMMU region... ERROR: "(foo**)" should be "(foo **)" #55: FILE: hw/i386/intel_iommu.c:1190: + while (g_hash_table_iter_next (&iter, NULL, (void**)&vtd_bus)) { ERROR: space prohibited between function name and open parenthesis '(' #55: FILE: hw/i386/intel_iommu.c:1190: + while (g_hash_table_iter_next (&iter, NULL, (void**)&vtd_bus)) { total: 2 errors, 0 warnings, 118 lines checked Your patch has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. === OUTPUT END === Test command exited with code: 1 --- Email generated automatically by Patchew [http://patchew.org/]. Please send your feedback to patchew-devel@freelists.org
Hi, Your series failed automatic build test. Please find the testing commands and their output below. If you have docker installed, you can probably reproduce it locally. Type: series Message-id: 1482158486-18597-1-git-send-email-peterx@redhat.com Subject: [Qemu-devel] [PATCH] intel_iommu: allow dynamic switch of IOMMU region === TEST SCRIPT BEGIN === #!/bin/bash set -e git submodule update --init dtc # Let docker tests dump environment info export SHOW_ENV=1 export J=16 make docker-test-quick@centos6 make docker-test-mingw@fedora make docker-test-build@min-glib === TEST SCRIPT END === Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384 Switched to a new branch 'test' ae738a3 intel_iommu: allow dynamic switch of IOMMU region === OUTPUT BEGIN === Submodule 'dtc' (git://git.qemu-project.org/dtc.git) registered for path 'dtc' Cloning into 'dtc'... Submodule path 'dtc': checked out '65cc4d2748a2c2e6f27f1cf39e07a5dbabd80ebf' BUILD centos6 make[1]: Entering directory `/var/tmp/patchew-tester-tmp-0ljnwqpe/src' ARCHIVE qemu.tgz ARCHIVE dtc.tgz COPY RUNNER RUN test-quick in qemu:centos6 Packages installed: SDL-devel-1.2.14-7.el6_7.1.x86_64 ccache-3.1.6-2.el6.x86_64 epel-release-6-8.noarch gcc-4.4.7-17.el6.x86_64 git-1.7.1-4.el6_7.1.x86_64 glib2-devel-2.28.8-5.el6.x86_64 libfdt-devel-1.4.0-1.el6.x86_64 make-3.81-23.el6.x86_64 package g++ is not installed pixman-devel-0.32.8-1.el6.x86_64 tar-1.23-15.el6_8.x86_64 zlib-devel-1.2.3-29.el6.x86_64 Environment variables: PACKAGES=libfdt-devel ccache tar git make gcc g++ zlib-devel glib2-devel SDL-devel pixman-devel epel-release HOSTNAME=191db77c270c TERM=xterm MAKEFLAGS= -j16 HISTSIZE=1000 J=16 USER=root CCACHE_DIR=/var/tmp/ccache EXTRA_CONFIGURE_OPTS= V= SHOW_ENV=1 MAIL=/var/spool/mail/root PATH=/usr/lib/ccache:/usr/lib64/ccache:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin PWD=/ LANG=en_US.UTF-8 TARGET_LIST= HISTCONTROL=ignoredups SHLVL=1 HOME=/root TEST_DIR=/tmp/qemu-test LOGNAME=root LESSOPEN=||/usr/bin/lesspipe.sh %s FEATURES= dtc DEBUG= G_BROKEN_FILENAMES=1 CCACHE_HASHDIR= _=/usr/bin/env Configure options: --enable-werror --target-list=x86_64-softmmu,aarch64-softmmu --prefix=/var/tmp/qemu-build/install No C++ compiler available; disabling C++ specific optional code Install prefix /var/tmp/qemu-build/install BIOS directory /var/tmp/qemu-build/install/share/qemu binary directory /var/tmp/qemu-build/install/bin library directory /var/tmp/qemu-build/install/lib module directory /var/tmp/qemu-build/install/lib/qemu libexec directory /var/tmp/qemu-build/install/libexec include directory /var/tmp/qemu-build/install/include config directory /var/tmp/qemu-build/install/etc local state directory /var/tmp/qemu-build/install/var Manual directory /var/tmp/qemu-build/install/share/man ELF interp prefix /usr/gnemul/qemu-%M Source path /tmp/qemu-test/src C compiler cc Host C compiler cc C++ compiler Objective-C compiler cc ARFLAGS rv CFLAGS -O2 -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -g QEMU_CFLAGS -I/usr/include/pixman-1 -pthread -I/usr/include/glib-2.0 -I/usr/lib64/glib-2.0/include -fPIE -DPIE -m64 -mcx16 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -Wstrict-prototypes -Wredundant-decls -Wall -Wundef -Wwrite-strings -Wmissing-prototypes -fno-strict-aliasing -fno-common -fwrapv -Wendif-labels -Wmissing-include-dirs -Wempty-body -Wnested-externs -Wformat-security -Wformat-y2k -Winit-self -Wignored-qualifiers -Wold-style-declaration -Wold-style-definition -Wtype-limits -fstack-protector-all LDFLAGS -Wl,--warn-common -Wl,-z,relro -Wl,-z,now -pie -m64 -g make make install install python python -B smbd /usr/sbin/smbd module support no host CPU x86_64 host big endian no target list x86_64-softmmu aarch64-softmmu tcg debug enabled no gprof enabled no sparse enabled no strip binaries yes profiler no static build no pixman system SDL support yes (1.2.14) GTK support no GTK GL support no VTE support no TLS priority NORMAL GNUTLS support no GNUTLS rnd no libgcrypt no libgcrypt kdf no nettle no nettle kdf no libtasn1 no curses support no virgl support no curl support no mingw32 support no Audio drivers oss Block whitelist (rw) Block whitelist (ro) VirtFS support no VNC support yes VNC SASL support no VNC JPEG support no VNC PNG support no xen support no brlapi support no bluez support no Documentation no PIE yes vde support no netmap support no Linux AIO support no ATTR/XATTR support yes Install blobs yes KVM support yes COLO support yes RDMA support no TCG interpreter no fdt support yes preadv support yes fdatasync yes madvise yes posix_madvise yes libcap-ng support no vhost-net support yes vhost-scsi support yes vhost-vsock support yes Trace backends log spice support no rbd support no xfsctl support no smartcard support no libusb no usb net redir no OpenGL support no OpenGL dmabufs no libiscsi support no libnfs support no build guest agent yes QGA VSS support no QGA w32 disk info no QGA MSI support no seccomp support no coroutine backend ucontext coroutine pool yes debug stack usage no GlusterFS support no Archipelago support no gcov gcov gcov enabled no TPM support yes libssh2 support no TPM passthrough yes QOM debugging yes lzo support no snappy support no bzip2 support no NUMA host support no tcmalloc support no jemalloc support no avx2 optimization no replication support yes GEN x86_64-softmmu/config-devices.mak.tmp GEN aarch64-softmmu/config-devices.mak.tmp GEN config-host.h GEN qemu-options.def GEN qmp-commands.h GEN qapi-types.h GEN qapi-visit.h GEN qapi-event.h GEN qmp-introspect.h GEN x86_64-softmmu/config-devices.mak GEN aarch64-softmmu/config-devices.mak GEN module_block.h GEN tests/test-qapi-types.h GEN tests/test-qapi-visit.h GEN tests/test-qmp-commands.h GEN tests/test-qapi-event.h GEN tests/test-qmp-introspect.h GEN config-all-devices.mak GEN trace/generated-tracers.h GEN trace/generated-tcg-tracers.h GEN trace/generated-helpers-wrappers.h GEN trace/generated-helpers.h CC tests/qemu-iotests/socket_scm_helper.o GEN qga/qapi-generated/qga-qmp-commands.h GEN qga/qapi-generated/qga-qapi-types.h GEN qga/qapi-generated/qga-qapi-visit.h GEN qga/qapi-generated/qga-qapi-types.c GEN qga/qapi-generated/qga-qapi-visit.c GEN qga/qapi-generated/qga-qmp-marshal.c GEN qmp-introspect.c GEN qapi-types.c GEN qapi-visit.c GEN qapi-event.c CC qapi/qapi-visit-core.o CC qapi/qapi-dealloc-visitor.o CC qapi/qobject-input-visitor.o CC qapi/qobject-output-visitor.o CC qapi/qmp-registry.o CC qapi/qmp-dispatch.o CC qapi/string-input-visitor.o CC qapi/string-output-visitor.o CC qapi/opts-visitor.o CC qapi/qapi-clone-visitor.o CC qapi/qmp-event.o CC qapi/qapi-util.o CC qobject/qnull.o CC qobject/qint.o CC qobject/qstring.o CC qobject/qdict.o CC qobject/qlist.o CC qobject/qfloat.o CC qobject/qbool.o CC qobject/qjson.o CC qobject/qobject.o CC qobject/json-lexer.o CC qobject/json-streamer.o CC qobject/json-parser.o GEN trace/generated-tracers.c CC trace/control.o CC trace/qmp.o CC util/osdep.o CC util/cutils.o CC util/unicode.o CC util/qemu-timer-common.o CC util/bufferiszero.o CC util/compatfd.o CC util/event_notifier-posix.o CC util/mmap-alloc.o CC util/oslib-posix.o CC util/qemu-openpty.o CC util/qemu-thread-posix.o CC util/memfd.o CC util/envlist.o CC util/path.o CC util/module.o CC util/bitmap.o CC util/hbitmap.o CC util/bitops.o CC util/fifo8.o CC util/acl.o CC util/error.o CC util/qemu-error.o CC util/id.o CC util/iov.o CC util/qemu-config.o CC util/qemu-sockets.o CC util/uri.o CC util/notify.o CC util/qemu-option.o CC util/qemu-progress.o CC util/hexdump.o CC util/crc32c.o CC util/uuid.o CC util/throttle.o CC util/getauxval.o CC util/readline.o CC util/rcu.o CC util/qemu-coroutine.o CC util/qemu-coroutine-lock.o CC util/qemu-coroutine-io.o CC util/qemu-coroutine-sleep.o CC util/coroutine-ucontext.o CC util/buffer.o CC util/timed-average.o CC util/base64.o CC util/log.o CC util/qdist.o CC util/qht.o CC util/range.o CC crypto/pbkdf-stub.o CC stubs/arch-query-cpu-def.o CC stubs/arch-query-cpu-model-expansion.o CC stubs/arch-query-cpu-model-comparison.o CC stubs/arch-query-cpu-model-baseline.o CC stubs/bdrv-next-monitor-owned.o CC stubs/blk-commit-all.o CC stubs/blockdev-close-all-bdrv-states.o CC stubs/clock-warp.o CC stubs/cpu-get-clock.o CC stubs/cpu-get-icount.o CC stubs/dump.o CC stubs/error-printf.o CC stubs/fdset-add-fd.o CC stubs/fdset-find-fd.o CC stubs/fdset-get-fd.o CC stubs/fdset-remove-fd.o CC stubs/gdbstub.o CC stubs/get-fd.o CC stubs/get-next-serial.o CC stubs/get-vm-name.o CC stubs/iothread.o CC stubs/iothread-lock.o CC stubs/is-daemonized.o CC stubs/machine-init-done.o CC stubs/migr-blocker.o CC stubs/mon-is-qmp.o CC stubs/monitor-init.o CC stubs/notify-event.o CC stubs/qtest.o CC stubs/replay.o CC stubs/set-fd-handler.o CC stubs/replay-user.o CC stubs/reset.o CC stubs/runstate-check.o CC stubs/sysbus.o CC stubs/slirp.o CC stubs/trace-control.o CC stubs/uuid.o CC stubs/vm-stop.o CC stubs/vmstate.o CC stubs/cpus.o CC stubs/qmp_pc_dimm_device_list.o CC stubs/kvm.o CC stubs/target-get-monitor-def.o CC stubs/target-monitor-defs.o CC stubs/vhost.o CC stubs/iohandler.o CC stubs/smbios_type_38.o CC stubs/ipmi.o CC stubs/pc_madt_cpu_entry.o CC stubs/migration-colo.o CC contrib/ivshmem-client/ivshmem-client.o CC contrib/ivshmem-client/main.o CC contrib/ivshmem-server/ivshmem-server.o CC contrib/ivshmem-server/main.o CC qemu-nbd.o CC async.o CC thread-pool.o CC block.o CC blockjob.o CC iohandler.o CC qemu-timer.o CC main-loop.o CC aio-posix.o CC qemu-io-cmds.o CC replication.o CC block/raw_bsd.o CC block/qcow.o CC block/vdi.o CC block/vmdk.o CC block/cloop.o CC block/bochs.o CC block/vvfat.o CC block/vpc.o CC block/dmg.o CC block/qcow2.o CC block/qcow2-refcount.o CC block/qcow2-cluster.o CC block/qcow2-snapshot.o CC block/qcow2-cache.o CC block/qed.o CC block/qed-l2-cache.o CC block/qed-gencb.o CC block/qed-table.o CC block/qed-cluster.o CC block/qed-check.o CC block/vhdx.o CC block/vhdx-endian.o CC block/vhdx-log.o CC block/quorum.o CC block/parallels.o CC block/blkdebug.o CC block/blkverify.o CC block/blkreplay.o CC block/block-backend.o CC block/snapshot.o CC block/qapi.o CC block/raw-posix.o CC block/null.o CC block/mirror.o CC block/commit.o CC block/io.o CC block/throttle-groups.o CC block/nbd.o CC block/nbd-client.o CC block/sheepdog.o CC block/accounting.o CC block/dirty-bitmap.o CC block/write-threshold.o CC block/backup.o CC block/replication.o CC block/crypto.o CC nbd/server.o CC nbd/client.o CC nbd/common.o CC crypto/hash.o CC crypto/init.o CC crypto/hash-glib.o CC crypto/aes.o CC crypto/desrfb.o CC crypto/cipher.o CC crypto/tlscreds.o CC crypto/tlscredsanon.o CC crypto/tlscredsx509.o CC crypto/tlssession.o CC crypto/secret.o CC crypto/random-platform.o CC crypto/pbkdf.o CC crypto/ivgen.o CC crypto/ivgen-essiv.o CC crypto/ivgen-plain.o CC crypto/ivgen-plain64.o CC crypto/afsplit.o CC crypto/xts.o CC crypto/block.o CC crypto/block-qcow.o CC io/channel.o CC crypto/block-luks.o CC io/channel-buffer.o CC io/channel-command.o CC io/channel-file.o CC io/channel-socket.o CC io/channel-tls.o CC io/channel-watch.o CC io/channel-websock.o CC io/channel-util.o CC io/task.o CC qom/object.o CC qom/container.o CC qom/qom-qobject.o CC qom/object_interfaces.o GEN qemu-img-cmds.h CC qemu-io.o CC qemu-bridge-helper.o CC blockdev.o CC blockdev-nbd.o CC iothread.o CC qdev-monitor.o CC device-hotplug.o CC os-posix.o CC qemu-char.o CC page_cache.o CC accel.o CC bt-host.o CC bt-vhci.o CC dma-helpers.o CC vl.o CC tpm.o CC device_tree.o GEN qmp-marshal.c CC qmp.o CC hmp.o CC cpus-common.o CC audio/audio.o CC audio/noaudio.o CC audio/wavaudio.o CC audio/mixeng.o CC audio/sdlaudio.o CC audio/ossaudio.o CC audio/wavcapture.o CC backends/rng.o CC backends/rng-egd.o CC backends/rng-random.o CC backends/msmouse.o CC backends/testdev.o CC backends/tpm.o CC backends/hostmem.o CC backends/hostmem-ram.o CC backends/hostmem-file.o CC backends/cryptodev.o CC backends/cryptodev-builtin.o CC block/stream.o CC disas/arm.o CC disas/i386.o CC fsdev/qemu-fsdev-dummy.o CC fsdev/qemu-fsdev-opts.o CC hw/acpi/core.o CC hw/acpi/piix4.o CC hw/acpi/pcihp.o CC hw/acpi/ich9.o CC hw/acpi/tco.o CC hw/acpi/cpu_hotplug.o CC hw/acpi/memory_hotplug.o CC hw/acpi/memory_hotplug_acpi_table.o CC hw/acpi/cpu.o CC hw/acpi/nvdimm.o CC hw/acpi/acpi_interface.o CC hw/acpi/bios-linker-loader.o CC hw/acpi/aml-build.o CC hw/acpi/ipmi.o CC hw/audio/sb16.o CC hw/audio/es1370.o CC hw/audio/ac97.o CC hw/audio/fmopl.o CC hw/audio/adlib.o CC hw/audio/gus.o CC hw/audio/gusemu_hal.o CC hw/audio/gusemu_mixer.o CC hw/audio/cs4231a.o CC hw/audio/intel-hda.o CC hw/audio/hda-codec.o CC hw/audio/pcspk.o CC hw/audio/wm8750.o CC hw/audio/pl041.o CC hw/audio/lm4549.o CC hw/audio/marvell_88w8618.o CC hw/block/block.o CC hw/block/cdrom.o CC hw/block/hd-geometry.o CC hw/block/fdc.o CC hw/block/m25p80.o CC hw/block/nand.o CC hw/block/pflash_cfi01.o CC hw/block/pflash_cfi02.o CC hw/block/ecc.o CC hw/block/nvme.o CC hw/block/onenand.o CC hw/bt/core.o CC hw/bt/l2cap.o CC hw/bt/sdp.o CC hw/bt/hci.o CC hw/bt/hid.o CC hw/bt/hci-csr.o CC hw/char/ipoctal232.o CC hw/char/parallel.o CC hw/char/pl011.o CC hw/char/serial.o CC hw/char/serial-isa.o CC hw/char/serial-pci.o CC hw/char/virtio-console.o CC hw/char/cadence_uart.o CC hw/char/debugcon.o CC hw/char/imx_serial.o CC hw/core/qdev-properties.o CC hw/core/qdev.o CC hw/core/bus.o CC hw/core/fw-path-provider.o CC hw/core/irq.o CC hw/core/hotplug.o CC hw/core/ptimer.o CC hw/core/sysbus.o CC hw/core/machine.o CC hw/core/null-machine.o CC hw/core/loader.o CC hw/core/qdev-properties-system.o CC hw/core/register.o CC hw/core/or-irq.o CC hw/core/platform-bus.o CC hw/display/ads7846.o CC hw/display/cirrus_vga.o CC hw/display/pl110.o CC hw/display/ssd0303.o CC hw/display/ssd0323.o CC hw/display/vga-pci.o CC hw/display/vga-isa.o CC hw/display/vmware_vga.o CC hw/display/blizzard.o CC hw/display/exynos4210_fimd.o CC hw/display/framebuffer.o CC hw/display/tc6393xb.o CC hw/dma/pl080.o CC hw/dma/pl330.o CC hw/dma/i8257.o CC hw/dma/xlnx-zynq-devcfg.o CC hw/gpio/max7310.o CC hw/gpio/pl061.o CC hw/gpio/zaurus.o CC hw/gpio/gpio_key.o CC hw/i2c/core.o CC hw/i2c/smbus.o CC hw/i2c/smbus_eeprom.o CC hw/i2c/i2c-ddc.o CC hw/i2c/versatile_i2c.o CC hw/i2c/smbus_ich9.o CC hw/i2c/pm_smbus.o CC hw/i2c/bitbang_i2c.o CC hw/i2c/exynos4210_i2c.o CC hw/i2c/imx_i2c.o CC hw/i2c/aspeed_i2c.o CC hw/ide/core.o CC hw/ide/atapi.o CC hw/ide/qdev.o CC hw/ide/pci.o CC hw/ide/isa.o CC hw/ide/piix.o CC hw/ide/microdrive.o CC hw/ide/ahci.o CC hw/ide/ich.o CC hw/input/hid.o CC hw/input/lm832x.o CC hw/input/pckbd.o CC hw/input/pl050.o CC hw/input/ps2.o CC hw/input/stellaris_input.o CC hw/input/tsc2005.o CC hw/input/vmmouse.o CC hw/input/virtio-input.o CC hw/input/virtio-input-hid.o CC hw/input/virtio-input-host.o CC hw/intc/i8259_common.o CC hw/intc/i8259.o CC hw/intc/pl190.o CC hw/intc/imx_avic.o CC hw/intc/realview_gic.o CC hw/intc/ioapic_common.o CC hw/intc/arm_gic_common.o CC hw/intc/arm_gic.o CC hw/intc/arm_gicv2m.o CC hw/intc/arm_gicv3_common.o CC hw/intc/arm_gicv3.o CC hw/intc/arm_gicv3_dist.o CC hw/intc/arm_gicv3_redist.o CC hw/intc/arm_gicv3_its_common.o CC hw/intc/intc.o CC hw/ipack/ipack.o CC hw/ipack/tpci200.o CC hw/ipmi/ipmi.o CC hw/ipmi/ipmi_bmc_sim.o CC hw/ipmi/ipmi_bmc_extern.o CC hw/ipmi/isa_ipmi_kcs.o CC hw/isa/isa-bus.o CC hw/ipmi/isa_ipmi_bt.o CC hw/isa/apm.o CC hw/mem/pc-dimm.o CC hw/mem/nvdimm.o CC hw/misc/applesmc.o CC hw/misc/max111x.o CC hw/misc/tmp105.o CC hw/misc/debugexit.o CC hw/misc/sga.o CC hw/misc/pc-testdev.o CC hw/misc/pci-testdev.o CC hw/misc/arm_l2x0.o CC hw/misc/arm_integrator_debug.o CC hw/misc/arm11scu.o CC hw/misc/a9scu.o CC hw/net/ne2000.o CC hw/net/eepro100.o CC hw/net/pcnet-pci.o CC hw/net/pcnet.o CC hw/net/e1000.o CC hw/net/e1000x_common.o CC hw/net/net_tx_pkt.o CC hw/net/net_rx_pkt.o CC hw/net/e1000e.o CC hw/net/e1000e_core.o CC hw/net/rtl8139.o CC hw/net/vmxnet3.o CC hw/net/smc91c111.o CC hw/net/lan9118.o CC hw/net/ne2000-isa.o CC hw/net/xgmac.o CC hw/net/allwinner_emac.o CC hw/net/imx_fec.o CC hw/net/cadence_gem.o CC hw/net/stellaris_enet.o CC hw/net/rocker/rocker.o CC hw/net/rocker/rocker_fp.o CC hw/net/rocker/rocker_desc.o CC hw/net/rocker/rocker_world.o CC hw/net/rocker/rocker_of_dpa.o CC hw/nvram/eeprom93xx.o CC hw/nvram/fw_cfg.o CC hw/nvram/chrp_nvram.o CC hw/pci-bridge/pci_bridge_dev.o CC hw/pci-bridge/pci_expander_bridge.o CC hw/pci-bridge/xio3130_upstream.o CC hw/pci-bridge/xio3130_downstream.o CC hw/pci-bridge/ioh3420.o CC hw/pci-bridge/i82801b11.o CC hw/pci-host/pam.o CC hw/pci-host/versatile.o CC hw/pci-host/piix.o CC hw/pci-host/q35.o CC hw/pci-host/gpex.o CC hw/pci/pci.o CC hw/pci/pci_bridge.o CC hw/pci/msix.o CC hw/pci/msi.o CC hw/pci/shpc.o CC hw/pci/slotid_cap.o CC hw/pci/pci_host.o CC hw/pci/pcie_host.o CC hw/pci/pcie.o CC hw/pci/pcie_aer.o CC hw/pci/pcie_port.o CC hw/pci/pci-stub.o CC hw/pcmcia/pcmcia.o CC hw/scsi/scsi-disk.o CC hw/scsi/scsi-generic.o CC hw/scsi/lsi53c895a.o CC hw/scsi/scsi-bus.o CC hw/scsi/mptsas.o CC hw/scsi/mptconfig.o CC hw/scsi/mptendian.o CC hw/scsi/megasas.o CC hw/scsi/vmw_pvscsi.o CC hw/scsi/esp.o CC hw/scsi/esp-pci.o CC hw/sd/pl181.o CC hw/sd/ssi-sd.o CC hw/sd/sd.o CC hw/sd/core.o CC hw/sd/sdhci.o CC hw/smbios/smbios.o CC hw/smbios/smbios_type_38.o CC hw/ssi/pl022.o CC hw/ssi/ssi.o CC hw/ssi/xilinx_spips.o CC hw/ssi/aspeed_smc.o CC hw/ssi/stm32f2xx_spi.o CC hw/timer/arm_mptimer.o CC hw/timer/arm_timer.o CC hw/timer/a9gtimer.o CC hw/timer/cadence_ttc.o CC hw/timer/ds1338.o CC hw/timer/hpet.o CC hw/timer/i8254_common.o CC hw/timer/i8254.o CC hw/timer/pl031.o CC hw/timer/twl92230.o CC hw/timer/imx_epit.o CC hw/timer/imx_gpt.o CC hw/timer/stm32f2xx_timer.o CC hw/timer/aspeed_timer.o CC hw/tpm/tpm_tis.o CC hw/tpm/tpm_passthrough.o CC hw/tpm/tpm_util.o CC hw/usb/core.o CC hw/usb/combined-packet.o CC hw/usb/bus.o /tmp/qemu-test/src/hw/nvram/fw_cfg.c: In function ‘fw_cfg_dma_transfer’: /tmp/qemu-test/src/hw/nvram/fw_cfg.c:329: warning: ‘read’ may be used uninitialized in this function CC hw/usb/libhw.o CC hw/usb/desc.o CC hw/usb/desc-msos.o CC hw/usb/hcd-uhci.o CC hw/usb/hcd-ohci.o CC hw/usb/hcd-ehci.o CC hw/usb/hcd-ehci-pci.o CC hw/usb/hcd-ehci-sysbus.o CC hw/usb/hcd-xhci.o CC hw/usb/hcd-musb.o CC hw/usb/dev-hub.o CC hw/usb/dev-hid.o CC hw/usb/dev-wacom.o CC hw/usb/dev-storage.o CC hw/usb/dev-uas.o CC hw/usb/dev-audio.o CC hw/usb/dev-serial.o CC hw/usb/dev-network.o CC hw/usb/dev-bluetooth.o CC hw/usb/dev-smartcard-reader.o CC hw/usb/dev-mtp.o CC hw/usb/host-stub.o CC hw/virtio/virtio-rng.o CC hw/virtio/virtio-pci.o CC hw/virtio/virtio-bus.o CC hw/virtio/virtio-mmio.o CC hw/watchdog/watchdog.o CC hw/watchdog/wdt_i6300esb.o CC hw/watchdog/wdt_ib700.o CC migration/migration.o CC migration/socket.o CC migration/fd.o CC migration/exec.o CC migration/tls.o CC migration/colo-comm.o CC migration/colo.o CC migration/colo-failover.o CC migration/vmstate.o CC migration/qemu-file.o CC migration/qemu-file-channel.o CC migration/xbzrle.o CC migration/postcopy-ram.o CC migration/qjson.o CC migration/block.o CC net/net.o CC net/queue.o CC net/checksum.o CC net/util.o CC net/socket.o CC net/hub.o CC net/dump.o CC net/eth.o CC net/l2tpv3.o CC net/vhost-user.o CC net/tap-linux.o CC net/tap.o CC net/slirp.o CC net/filter.o CC net/filter-buffer.o CC net/filter-mirror.o CC net/colo-compare.o CC net/colo.o CC net/filter-rewriter.o CC replay/replay.o CC qom/cpu.o CC replay/replay-internal.o CC replay/replay-events.o /tmp/qemu-test/src/replay/replay-internal.c: In function ‘replay_put_array’: /tmp/qemu-test/src/replay/replay-internal.c:65: warning: ignoring return value of ‘fwrite’, declared with attribute warn_unused_result CC replay/replay-time.o CC replay/replay-input.o CC replay/replay-char.o CC replay/replay-snapshot.o CC slirp/cksum.o CC slirp/if.o CC slirp/ip_icmp.o CC slirp/ip6_icmp.o CC slirp/ip6_input.o CC slirp/ip6_output.o CC slirp/ip_input.o CC slirp/ip_output.o CC slirp/dnssearch.o CC slirp/dhcpv6.o CC slirp/slirp.o CC slirp/mbuf.o CC slirp/misc.o CC slirp/sbuf.o CC slirp/socket.o CC slirp/tcp_input.o CC slirp/tcp_subr.o CC slirp/tcp_output.o CC slirp/tcp_timer.o CC slirp/udp.o /tmp/qemu-test/src/slirp/tcp_input.c: In function ‘tcp_input’: /tmp/qemu-test/src/slirp/tcp_input.c:219: warning: ‘save_ip.ip_p’ may be used uninitialized in this function /tmp/qemu-test/src/slirp/tcp_input.c:219: warning: ‘save_ip.ip_len’ may be used uninitialized in this function /tmp/qemu-test/src/slirp/tcp_input.c:219: warning: ‘save_ip.ip_tos’ may be used uninitialized in this function /tmp/qemu-test/src/slirp/tcp_input.c:219: warning: ‘save_ip.ip_id’ may be used uninitialized in this function /tmp/qemu-test/src/slirp/tcp_input.c:219: warning: ‘save_ip.ip_off’ may be used uninitialized in this function /tmp/qemu-test/src/slirp/tcp_input.c:219: warning: ‘save_ip.ip_ttl’ may be used uninitialized in this function /tmp/qemu-test/src/slirp/tcp_input.c:219: warning: ‘save_ip.ip_sum’ may be used uninitialized in this function /tmp/qemu-test/src/slirp/tcp_input.c:219: warning: ‘save_ip.ip_src.s_addr’ may be used uninitialized in this function /tmp/qemu-test/src/slirp/tcp_input.c:219: warning: ‘save_ip.ip_dst.s_addr’ may be used uninitialized in this function /tmp/qemu-test/src/slirp/tcp_input.c:220: warning: ‘save_ip6.ip_nh’ may be used uninitialized in this function CC slirp/udp6.o CC slirp/bootp.o CC slirp/tftp.o CC slirp/arp_table.o CC slirp/ndp_table.o CC ui/keymaps.o CC ui/console.o CC ui/cursor.o CC ui/qemu-pixman.o CC ui/input.o CC ui/input-keymap.o CC ui/input-legacy.o CC ui/input-linux.o CC ui/sdl.o CC ui/sdl_zoom.o CC ui/x_keymap.o CC ui/vnc.o CC ui/vnc-enc-zlib.o CC ui/vnc-enc-hextile.o CC ui/vnc-enc-tight.o CC ui/vnc-palette.o CC ui/vnc-enc-zrle.o CC ui/vnc-auth-vencrypt.o CC ui/vnc-ws.o CC ui/vnc-jobs.o LINK tests/qemu-iotests/socket_scm_helper CC qga/commands.o CC qga/guest-agent-command-state.o CC qga/main.o CC qga/commands-posix.o CC qga/channel-posix.o CC qga/qapi-generated/qga-qapi-types.o CC qga/qapi-generated/qga-qapi-visit.o CC qga/qapi-generated/qga-qmp-marshal.o CC qmp-introspect.o CC qapi-types.o CC qapi-visit.o CC qapi-event.o AS optionrom/multiboot.o AS optionrom/linuxboot.o CC optionrom/linuxboot_dma.o AS optionrom/kvmvapic.o cc: unrecognized option '-no-integrated-as' cc: unrecognized option '-no-integrated-as' BUILD optionrom/multiboot.img AR libqemustub.a CC qemu-img.o BUILD optionrom/linuxboot.img CC qmp-marshal.o BUILD optionrom/linuxboot_dma.img CC trace/generated-tracers.o BUILD optionrom/multiboot.raw BUILD optionrom/linuxboot_dma.raw BUILD optionrom/linuxboot.raw BUILD optionrom/kvmvapic.img SIGN optionrom/multiboot.bin SIGN optionrom/linuxboot.bin SIGN optionrom/linuxboot_dma.bin BUILD optionrom/kvmvapic.raw SIGN optionrom/kvmvapic.bin AR libqemuutil.a LINK qemu-ga LINK ivshmem-client LINK ivshmem-server LINK qemu-nbd LINK qemu-img LINK qemu-io LINK qemu-bridge-helper GEN x86_64-softmmu/hmp-commands.h GEN x86_64-softmmu/hmp-commands-info.h GEN x86_64-softmmu/config-target.h GEN aarch64-softmmu/config-target.h GEN aarch64-softmmu/hmp-commands.h GEN aarch64-softmmu/hmp-commands-info.h CC x86_64-softmmu/cpu-exec.o CC x86_64-softmmu/exec.o CC x86_64-softmmu/translate-all.o CC x86_64-softmmu/translate-common.o CC x86_64-softmmu/cpu-exec-common.o CC x86_64-softmmu/tcg/tcg.o CC x86_64-softmmu/tcg/tcg-op.o CC x86_64-softmmu/tcg/optimize.o CC x86_64-softmmu/tcg/tcg-common.o CC x86_64-softmmu/fpu/softfloat.o CC x86_64-softmmu/disas.o CC x86_64-softmmu/tcg-runtime.o CC x86_64-softmmu/arch_init.o CC x86_64-softmmu/cpus.o CC x86_64-softmmu/monitor.o CC x86_64-softmmu/gdbstub.o CC x86_64-softmmu/balloon.o CC x86_64-softmmu/ioport.o CC x86_64-softmmu/numa.o CC x86_64-softmmu/qtest.o CC x86_64-softmmu/bootdevice.o CC aarch64-softmmu/translate-all.o CC aarch64-softmmu/exec.o CC x86_64-softmmu/kvm-all.o CC x86_64-softmmu/memory.o CC x86_64-softmmu/cputlb.o CC aarch64-softmmu/cpu-exec.o CC aarch64-softmmu/translate-common.o CC x86_64-softmmu/memory_mapping.o CC x86_64-softmmu/dump.o CC x86_64-softmmu/migration/ram.o CC aarch64-softmmu/cpu-exec-common.o CC aarch64-softmmu/tcg/tcg.o CC aarch64-softmmu/tcg/tcg-op.o CC aarch64-softmmu/tcg/optimize.o CC x86_64-softmmu/migration/savevm.o CC aarch64-softmmu/tcg/tcg-common.o CC aarch64-softmmu/fpu/softfloat.o CC x86_64-softmmu/xen-common-stub.o CC aarch64-softmmu/disas.o CC aarch64-softmmu/tcg-runtime.o CC x86_64-softmmu/xen-hvm-stub.o GEN aarch64-softmmu/gdbstub-xml.c CC aarch64-softmmu/kvm-stub.o CC x86_64-softmmu/hw/block/virtio-blk.o CC aarch64-softmmu/arch_init.o CC aarch64-softmmu/cpus.o CC aarch64-softmmu/monitor.o CC aarch64-softmmu/gdbstub.o CC aarch64-softmmu/balloon.o CC aarch64-softmmu/ioport.o CC x86_64-softmmu/hw/block/dataplane/virtio-blk.o CC aarch64-softmmu/numa.o CC aarch64-softmmu/qtest.o CC aarch64-softmmu/bootdevice.o CC aarch64-softmmu/memory.o CC aarch64-softmmu/cputlb.o CC x86_64-softmmu/hw/char/virtio-serial-bus.o CC x86_64-softmmu/hw/core/nmi.o CC x86_64-softmmu/hw/core/generic-loader.o CC x86_64-softmmu/hw/cpu/core.o CC x86_64-softmmu/hw/display/vga.o CC aarch64-softmmu/memory_mapping.o CC aarch64-softmmu/dump.o CC x86_64-softmmu/hw/display/virtio-gpu.o CC aarch64-softmmu/migration/ram.o CC aarch64-softmmu/migration/savevm.o CC aarch64-softmmu/xen-common-stub.o CC x86_64-softmmu/hw/display/virtio-gpu-3d.o CC x86_64-softmmu/hw/display/virtio-gpu-pci.o CC aarch64-softmmu/xen-hvm-stub.o CC x86_64-softmmu/hw/display/virtio-vga.o CC aarch64-softmmu/hw/adc/stm32f2xx_adc.o CC aarch64-softmmu/hw/block/virtio-blk.o CC aarch64-softmmu/hw/block/dataplane/virtio-blk.o CC x86_64-softmmu/hw/intc/apic.o CC aarch64-softmmu/hw/char/exynos4210_uart.o CC aarch64-softmmu/hw/char/omap_uart.o CC aarch64-softmmu/hw/char/digic-uart.o CC aarch64-softmmu/hw/char/stm32f2xx_usart.o CC aarch64-softmmu/hw/char/bcm2835_aux.o CC aarch64-softmmu/hw/char/virtio-serial-bus.o CC x86_64-softmmu/hw/intc/apic_common.o CC x86_64-softmmu/hw/intc/ioapic.o CC x86_64-softmmu/hw/isa/lpc_ich9.o CC aarch64-softmmu/hw/core/nmi.o CC aarch64-softmmu/hw/core/generic-loader.o CC aarch64-softmmu/hw/cpu/arm11mpcore.o CC aarch64-softmmu/hw/cpu/realview_mpcore.o CC x86_64-softmmu/hw/misc/vmport.o CC aarch64-softmmu/hw/cpu/a9mpcore.o CC x86_64-softmmu/hw/misc/ivshmem.o CC aarch64-softmmu/hw/cpu/a15mpcore.o CC x86_64-softmmu/hw/misc/pvpanic.o CC x86_64-softmmu/hw/misc/edu.o CC x86_64-softmmu/hw/misc/hyperv_testdev.o CC aarch64-softmmu/hw/cpu/core.o CC x86_64-softmmu/hw/net/virtio-net.o CC x86_64-softmmu/hw/net/vhost_net.o CC aarch64-softmmu/hw/display/omap_dss.o CC x86_64-softmmu/hw/scsi/virtio-scsi.o CC x86_64-softmmu/hw/scsi/virtio-scsi-dataplane.o CC x86_64-softmmu/hw/scsi/vhost-scsi.o CC x86_64-softmmu/hw/timer/mc146818rtc.o CC aarch64-softmmu/hw/display/omap_lcdc.o CC x86_64-softmmu/hw/vfio/common.o CC x86_64-softmmu/hw/vfio/pci.o CC x86_64-softmmu/hw/vfio/pci-quirks.o CC aarch64-softmmu/hw/display/pxa2xx_lcd.o CC x86_64-softmmu/hw/vfio/platform.o CC aarch64-softmmu/hw/display/bcm2835_fb.o CC aarch64-softmmu/hw/display/vga.o CC aarch64-softmmu/hw/display/virtio-gpu.o CC aarch64-softmmu/hw/display/virtio-gpu-3d.o CC aarch64-softmmu/hw/display/virtio-gpu-pci.o CC aarch64-softmmu/hw/display/dpcd.o CC aarch64-softmmu/hw/display/xlnx_dp.o CC aarch64-softmmu/hw/dma/xlnx_dpdma.o CC aarch64-softmmu/hw/dma/omap_dma.o CC aarch64-softmmu/hw/dma/soc_dma.o CC aarch64-softmmu/hw/dma/pxa2xx_dma.o CC aarch64-softmmu/hw/dma/bcm2835_dma.o CC x86_64-softmmu/hw/vfio/calxeda-xgmac.o CC x86_64-softmmu/hw/vfio/amd-xgbe.o CC aarch64-softmmu/hw/gpio/omap_gpio.o CC aarch64-softmmu/hw/gpio/imx_gpio.o CC aarch64-softmmu/hw/i2c/omap_i2c.o CC aarch64-softmmu/hw/input/pxa2xx_keypad.o CC x86_64-softmmu/hw/vfio/spapr.o CC x86_64-softmmu/hw/virtio/virtio.o CC x86_64-softmmu/hw/virtio/virtio-balloon.o CC x86_64-softmmu/hw/virtio/vhost.o CC aarch64-softmmu/hw/input/tsc210x.o CC x86_64-softmmu/hw/virtio/vhost-backend.o CC aarch64-softmmu/hw/intc/armv7m_nvic.o CC x86_64-softmmu/hw/virtio/vhost-user.o CC aarch64-softmmu/hw/intc/exynos4210_gic.o CC aarch64-softmmu/hw/intc/exynos4210_combiner.o CC x86_64-softmmu/hw/virtio/vhost-vsock.o CC aarch64-softmmu/hw/intc/omap_intc.o CC x86_64-softmmu/hw/virtio/virtio-crypto.o CC aarch64-softmmu/hw/intc/bcm2835_ic.o CC aarch64-softmmu/hw/intc/bcm2836_control.o CC x86_64-softmmu/hw/virtio/virtio-crypto-pci.o CC aarch64-softmmu/hw/intc/allwinner-a10-pic.o CC x86_64-softmmu/hw/i386/multiboot.o CC x86_64-softmmu/hw/i386/pc_piix.o CC x86_64-softmmu/hw/i386/pc.o CC aarch64-softmmu/hw/intc/aspeed_vic.o CC x86_64-softmmu/hw/i386/pc_q35.o CC x86_64-softmmu/hw/i386/pc_sysfw.o /tmp/qemu-test/src/hw/i386/pc_piix.c: In function ‘igd_passthrough_isa_bridge_create’: /tmp/qemu-test/src/hw/i386/pc_piix.c:1046: warning: ‘pch_rev_id’ may be used uninitialized in this function CC aarch64-softmmu/hw/intc/arm_gicv3_cpuif.o CC aarch64-softmmu/hw/misc/ivshmem.o CC aarch64-softmmu/hw/misc/arm_sysctl.o CC x86_64-softmmu/hw/i386/x86-iommu.o CC aarch64-softmmu/hw/misc/cbus.o CC x86_64-softmmu/hw/i386/intel_iommu.o CC x86_64-softmmu/hw/i386/amd_iommu.o CC aarch64-softmmu/hw/misc/exynos4210_pmu.o CC x86_64-softmmu/hw/i386/kvmvapic.o CC aarch64-softmmu/hw/misc/imx_ccm.o CC aarch64-softmmu/hw/misc/imx31_ccm.o CC aarch64-softmmu/hw/misc/imx25_ccm.o CC x86_64-softmmu/hw/i386/acpi-build.o CC x86_64-softmmu/hw/i386/pci-assign-load-rom.o CC x86_64-softmmu/hw/i386/kvm/clock.o CC aarch64-softmmu/hw/misc/imx6_ccm.o CC x86_64-softmmu/hw/i386/kvm/apic.o CC aarch64-softmmu/hw/misc/imx6_src.o CC aarch64-softmmu/hw/misc/mst_fpga.o CC x86_64-softmmu/hw/i386/kvm/i8259.o CC aarch64-softmmu/hw/misc/omap_gpmc.o CC aarch64-softmmu/hw/misc/omap_clk.o CC x86_64-softmmu/hw/i386/kvm/ioapic.o CC x86_64-softmmu/hw/i386/kvm/i8254.o CC aarch64-softmmu/hw/misc/omap_l4.o CC aarch64-softmmu/hw/misc/omap_sdrc.o CC aarch64-softmmu/hw/misc/omap_tap.o CC x86_64-softmmu/hw/i386/kvm/pci-assign.o CC x86_64-softmmu/target-i386/translate.o CC aarch64-softmmu/hw/misc/bcm2835_mbox.o CC x86_64-softmmu/target-i386/helper.o CC x86_64-softmmu/target-i386/cpu.o CC aarch64-softmmu/hw/misc/bcm2835_property.o CC x86_64-softmmu/target-i386/bpt_helper.o CC aarch64-softmmu/hw/misc/zynq_slcr.o CC aarch64-softmmu/hw/misc/zynq-xadc.o CC x86_64-softmmu/target-i386/excp_helper.o CC x86_64-softmmu/target-i386/fpu_helper.o CC x86_64-softmmu/target-i386/cc_helper.o CC x86_64-softmmu/target-i386/int_helper.o CC x86_64-softmmu/target-i386/svm_helper.o CC x86_64-softmmu/target-i386/smm_helper.o CC aarch64-softmmu/hw/misc/stm32f2xx_syscfg.o CC x86_64-softmmu/target-i386/misc_helper.o CC x86_64-softmmu/target-i386/mem_helper.o CC aarch64-softmmu/hw/misc/edu.o CC x86_64-softmmu/target-i386/seg_helper.o CC x86_64-softmmu/target-i386/mpx_helper.o CC aarch64-softmmu/hw/misc/auxbus.o CC x86_64-softmmu/target-i386/gdbstub.o CC aarch64-softmmu/hw/misc/aspeed_scu.o CC x86_64-softmmu/target-i386/machine.o /tmp/qemu-test/src/hw/i386/intel_iommu.c: In function ‘vtd_switch_address_space’: /tmp/qemu-test/src/hw/i386/intel_iommu.c:1196: warning: implicit declaration of function ‘trace_vtd_switch_address_space’ /tmp/qemu-test/src/hw/i386/intel_iommu.c:1196: warning: nested extern declaration of ‘trace_vtd_switch_address_space’ /tmp/qemu-test/src/hw/i386/intel_iommu.c: In function ‘vtd_find_add_as’: /tmp/qemu-test/src/hw/i386/intel_iommu.c:2412: error: ‘name’ undeclared (first use in this function) /tmp/qemu-test/src/hw/i386/intel_iommu.c:2412: error: (Each undeclared identifier is reported only once /tmp/qemu-test/src/hw/i386/intel_iommu.c:2412: error: for each function it appears in.) make[1]: *** [hw/i386/intel_iommu.o] Error 1 make[1]: *** Waiting for unfinished jobs.... CC x86_64-softmmu/target-i386/arch_memory_mapping.o CC aarch64-softmmu/hw/misc/aspeed_sdmc.o CC aarch64-softmmu/hw/net/virtio-net.o CC aarch64-softmmu/hw/net/vhost_net.o CC aarch64-softmmu/hw/pcmcia/pxa2xx.o CC aarch64-softmmu/hw/scsi/virtio-scsi.o CC aarch64-softmmu/hw/scsi/virtio-scsi-dataplane.o CC aarch64-softmmu/hw/sd/omap_mmc.o CC aarch64-softmmu/hw/scsi/vhost-scsi.o CC aarch64-softmmu/hw/sd/pxa2xx_mmci.o CC aarch64-softmmu/hw/ssi/omap_spi.o CC aarch64-softmmu/hw/ssi/imx_spi.o CC aarch64-softmmu/hw/timer/exynos4210_mct.o CC aarch64-softmmu/hw/timer/exynos4210_pwm.o CC aarch64-softmmu/hw/timer/exynos4210_rtc.o CC aarch64-softmmu/hw/timer/omap_gptimer.o CC aarch64-softmmu/hw/timer/omap_synctimer.o CC aarch64-softmmu/hw/timer/pxa2xx_timer.o CC aarch64-softmmu/hw/timer/digic-timer.o CC aarch64-softmmu/hw/timer/allwinner-a10-pit.o CC aarch64-softmmu/hw/usb/tusb6010.o CC aarch64-softmmu/hw/vfio/common.o CC aarch64-softmmu/hw/vfio/pci.o CC aarch64-softmmu/hw/vfio/pci-quirks.o CC aarch64-softmmu/hw/vfio/platform.o CC aarch64-softmmu/hw/vfio/calxeda-xgmac.o CC aarch64-softmmu/hw/vfio/amd-xgbe.o CC aarch64-softmmu/hw/vfio/spapr.o CC aarch64-softmmu/hw/virtio/virtio.o CC aarch64-softmmu/hw/virtio/virtio-balloon.o CC aarch64-softmmu/hw/virtio/vhost.o CC aarch64-softmmu/hw/virtio/vhost-backend.o CC aarch64-softmmu/hw/virtio/vhost-user.o CC aarch64-softmmu/hw/virtio/vhost-vsock.o CC aarch64-softmmu/hw/virtio/virtio-crypto.o CC aarch64-softmmu/hw/virtio/virtio-crypto-pci.o CC aarch64-softmmu/hw/arm/boot.o CC aarch64-softmmu/hw/arm/collie.o CC aarch64-softmmu/hw/arm/exynos4_boards.o CC aarch64-softmmu/hw/arm/gumstix.o CC aarch64-softmmu/hw/arm/highbank.o CC aarch64-softmmu/hw/arm/digic_boards.o CC aarch64-softmmu/hw/arm/integratorcp.o CC aarch64-softmmu/hw/arm/mainstone.o CC aarch64-softmmu/hw/arm/musicpal.o CC aarch64-softmmu/hw/arm/nseries.o CC aarch64-softmmu/hw/arm/omap_sx1.o CC aarch64-softmmu/hw/arm/palm.o CC aarch64-softmmu/hw/arm/realview.o CC aarch64-softmmu/hw/arm/spitz.o CC aarch64-softmmu/hw/arm/stellaris.o CC aarch64-softmmu/hw/arm/tosa.o CC aarch64-softmmu/hw/arm/versatilepb.o CC aarch64-softmmu/hw/arm/vexpress.o CC aarch64-softmmu/hw/arm/virt.o /tmp/qemu-test/src/hw/i386/acpi-build.c: In function ‘build_append_pci_bus_devices’: /tmp/qemu-test/src/hw/i386/acpi-build.c:501: warning: ‘notify_method’ may be used uninitialized in this function CC aarch64-softmmu/hw/arm/xilinx_zynq.o CC aarch64-softmmu/hw/arm/z2.o CC aarch64-softmmu/hw/arm/virt-acpi-build.o CC aarch64-softmmu/hw/arm/netduino2.o CC aarch64-softmmu/hw/arm/sysbus-fdt.o CC aarch64-softmmu/hw/arm/exynos4210.o CC aarch64-softmmu/hw/arm/armv7m.o CC aarch64-softmmu/hw/arm/pxa2xx.o CC aarch64-softmmu/hw/arm/pxa2xx_gpio.o CC aarch64-softmmu/hw/arm/pxa2xx_pic.o CC aarch64-softmmu/hw/arm/digic.o CC aarch64-softmmu/hw/arm/omap1.o CC aarch64-softmmu/hw/arm/omap2.o CC aarch64-softmmu/hw/arm/strongarm.o CC aarch64-softmmu/hw/arm/allwinner-a10.o CC aarch64-softmmu/hw/arm/cubieboard.o CC aarch64-softmmu/hw/arm/bcm2835_peripherals.o CC aarch64-softmmu/hw/arm/bcm2836.o CC aarch64-softmmu/hw/arm/raspi.o CC aarch64-softmmu/hw/arm/stm32f205_soc.o CC aarch64-softmmu/hw/arm/xlnx-zynqmp.o CC aarch64-softmmu/hw/arm/xlnx-ep108.o CC aarch64-softmmu/hw/arm/fsl-imx25.o CC aarch64-softmmu/hw/arm/imx25_pdk.o CC aarch64-softmmu/hw/arm/fsl-imx31.o CC aarch64-softmmu/hw/arm/kzm.o CC aarch64-softmmu/hw/arm/fsl-imx6.o CC aarch64-softmmu/hw/arm/sabrelite.o CC aarch64-softmmu/hw/arm/aspeed_soc.o CC aarch64-softmmu/hw/arm/aspeed.o CC aarch64-softmmu/target-arm/arm-semi.o CC aarch64-softmmu/target-arm/machine.o CC aarch64-softmmu/target-arm/psci.o CC aarch64-softmmu/target-arm/arch_dump.o CC aarch64-softmmu/target-arm/monitor.o CC aarch64-softmmu/target-arm/kvm-stub.o CC aarch64-softmmu/target-arm/translate.o CC aarch64-softmmu/target-arm/helper.o CC aarch64-softmmu/target-arm/op_helper.o CC aarch64-softmmu/target-arm/cpu.o CC aarch64-softmmu/target-arm/neon_helper.o CC aarch64-softmmu/target-arm/iwmmxt_helper.o CC aarch64-softmmu/target-arm/gdbstub.o CC aarch64-softmmu/target-arm/cpu64.o CC aarch64-softmmu/target-arm/translate-a64.o CC aarch64-softmmu/target-arm/helper-a64.o CC aarch64-softmmu/target-arm/gdbstub64.o CC aarch64-softmmu/target-arm/crypto_helper.o CC aarch64-softmmu/target-arm/arm-powerctl.o GEN trace/generated-helpers.c CC aarch64-softmmu/trace/control-target.o CC aarch64-softmmu/gdbstub-xml.o CC aarch64-softmmu/trace/generated-helpers.o make: *** [subdir-x86_64-softmmu] Error 2 make: *** Waiting for unfinished jobs.... /tmp/qemu-test/src/target-arm/translate-a64.c: In function ‘handle_shri_with_rndacc’: /tmp/qemu-test/src/target-arm/translate-a64.c:6391: warning: ‘tcg_src_hi’ may be used uninitialized in this function /tmp/qemu-test/src/target-arm/translate-a64.c: In function ‘disas_simd_scalar_two_reg_misc’: /tmp/qemu-test/src/target-arm/translate-a64.c:8118: warning: ‘rmode’ may be used uninitialized in this function LINK aarch64-softmmu/qemu-system-aarch64 make[1]: *** [docker-run] Error 2 make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-0ljnwqpe/src' make: *** [docker-run-test-quick@centos6] Error 2 === OUTPUT END === Test command exited with code: 2 --- Email generated automatically by Patchew [http://patchew.org/]. Please send your feedback to patchew-devel@freelists.org
On Tue, 20 Dec 2016 14:38:01 +0800 Peter Xu <peterx@redhat.com> wrote: > On Mon, Dec 19, 2016 at 09:52:52PM -0700, Alex Williamson wrote: > > [...] > > > > Yes, this patch just tried to move VT-d forward a bit, rather than do > > > it once and for all. I think we can do better than this in the future, > > > for example, one address space per guest IOMMU domain (as you have > > > mentioned before). However I suppose that will need more work (which I > > > still can't estimate on the amount of work). So I am considering to > > > enable the device assignments functionally first, then we can further > > > improve based on a workable version. Same thoughts apply to the IOMMU > > > replay RFC series. > > > > I'm not arguing against it, I'm just trying to set expectations for > > where this gets us. An AddressSpace per guest iommu domain seems like > > the right model for QEMU, but it has some fundamental issues with > > vfio. We currently tie a QEMU AddressSpace to a vfio container, which > > represents the host IOMMU context. The AddressSpace of a device is > > currently assumed to be fixed in QEMU, guest IOMMU domains clearly > > are not. vfio only let's us have access to a device while it's > > protected within a container. Therefore in order to move a device to a > > different AddressSpace based on the guest domain configuration, we'd > > need to tear down the vfio configuration, including releasing the > > device. > > I assume this is VT-d specific issue, right? Looks like ppc is using a > totally differnet way to manage the mapping, and devices can share the > same address space. It's only VT-d specific in that VT-d is the only vIOMMU we have for x86. ppc has a much different host IOMMU architcture and their VM architecture requires an IOMMU. The ppc model has a notion of preregistration to help with this, among other things. > > > Regarding to the locked memory accounting issue: do we have existing > > > way to do the accounting? If so, would you (or anyone) please > > > elaborate a bit? If not, is that an ongoing/planned work? > > > > As I describe above, there's a vfio container per AddressSpace, each > > container is an IOMMU domain in the host. In the guest, an IOMMU > > domain can include multiple AddressSpaces, one for each context entry > > that's part of the domain. When the guest programs a translation for > > an IOMMU domain, that maps a guest IOVA to a guest physical address, > > for each AddressSpace. Each AddressSpace is backed by a vfio > > container, which needs to pin the pages of that translation in order to > > get a host physical address, which then gets programmed into the host > > IOMMU domain with the guest-IOVA and host physical address. The > > pinning process is where page accounting is done. It's done per vfio > > context. The worst case scenario for accounting is thus when VT-d is > > present but disabled (or in passthrough mode) as each AddressSpace > > duplicates address_space_memory and every page of guest memory is > > pinned and accounted for each vfio container. > > IIUC this accounting issue will solve itself if we can solve the > previous issue. While we don't have it now, so ... Not sure what "previous issue" is referring to here. > > That's the existing way we do accounting. There is no current > > development that I'm aware of to change this. As above, the simplest > > stop-gap solution is that libvirt would need to be aware when VT-d is > > present for a VM and use a different algorithm to set QEMU locked > > memory limit, but it's not without its downsides. > > ... here I think it's sensible to consider a specific algorithm for > vt-d use case. I am just curious about how should we define this > algorithm. > > First of all, when the devices are not sharing domain (or say, one > guest iommu domain per assigned device), everything should be fine. No, each domain could map the entire guest address space. If we're talking about a domain per device for use with the Linux DMA API, then it's unlikely that the sum of mapped pages across all the domains will exceed the current libvirt set locked memory limit. However, that's exactly the configuration where we expect to have abysmal performance. As soon as we recommend the guest boot with iommu=pt, then each container will be mapping and pinning the entire VM address space. > No > special algorithm needed. IMHO the problem will happen only if there > are assigned devices that share a same address space (either system, > or specific iommu domain). In that case, the accounted value (or say, > current->mm->locked_vm iiuc) will be bigger than the real locked > memory size. > > However, I think the problem is whether devices will be put into same > address space depends on guest behavior - the guest can either use > iommu=pt, or manually putting devices into the same guest iommu region > to achieve that. But from hypervisor POV, how should we estimate this? > Can we really? The simple answer is that each device needs to be able to map the entire VM address space and therefore when a VM is configured with VT-d, libvirt needs to multiply the current locked memory settings for assigned devices by the number of devices (groups actually) assigned. There are (at least) two problems with this though. The first is that we expect QEMU to use this increased locked memory limits for duplicate accounting of the same pages, but an exploited user process could take advantage of it and cause problems. Not optimal. The second problem relates to the usage of the IOVA address space and the assumption that a given container will map no more than the VM address space. When no vIOMMU is exposed to the VM, QEMU manages the container IOVA space and we know that QEMU is only mapping VM RAM and therefore mappings are bound by the size of the VM. With a vIOMMU, the guest is in control of the IOVA space and can map up to the limits of the vIOMMU. The guest can map a single 4KB page to every IOVA up to that limit and we'll account that page each time. So even valid (though perhaps not useful) cases within the guest can hit that locking limit. This suggests that we not only need a vfio IOMMU model that tracks pfns per domain to avoid duplicate accounting, but we need some way to share that tracking between domains. Then we can go back to allowing a locked memory limit up to the VM RAM size as the correct and complete solution (plus some sort of shadow page table based mapping for any hope of bearable performance for dynamic usage). > > Alternatively, a new > > IOMMU model would need to be developed for vfio. The type1 model was > > only ever intended to be used for relatively static user mappings and I > > expect it to have horrendous performance when backing a dynamic guest > > IOMMU domain. Really the only guest IOMMU usage model that makes any > > sort of sense with type1 is to run the guest with passthrough (iommu=pt) > > and only pull devices out of passthrough for relatively static mapping > > cases within the guest userspace (nested assigned devices or dpdk). If > > the expectation is that we just need this one little bit more code to > > make vfio usable in the guest, that may be true, but it really is just > > barely usable. It's not going to be fast for any sort of dynamic > > mapping and it's going to have accounting issues that are not > > compatible with how libvirt sets locked memory limits for QEMU as soon > > as you go beyond a single device. Thanks, > > I can totally understand that the performance will suck if dynamic > mapping is used. AFAIU this work will only be used with static dma > mapping like running DPDK in guest (besides other trivial goals, like, > development purpose). We can't control how a feature is used, which is why I'm trying to make sure this doesn't come as a surprise to anyone. > Regarding to "the other" iommu model you mentioned besides type1, is > there any existing discussions out there? Any further learning > material/links would be greatly welcomed. Nope. You and Aviv are the only ones doing work that suggests a need for a new IOMMU model. Thanks, Alex
On Tue, Dec 20, 2016 at 12:16:50PM +0800, Peter Xu wrote: > On Tue, Dec 20, 2016 at 10:30:12AM +1100, David Gibson wrote: > > [...] > > > > +static void vtd_switch_address_space(IntelIOMMUState *s, bool enabled) > > > +{ > > > + GHashTableIter iter; > > > + VTDBus *vtd_bus; > > > + VTDAddressSpace *as; > > > + int i; > > > + > > > + g_hash_table_iter_init(&iter, s->vtd_as_by_busptr); > > > + while (g_hash_table_iter_next (&iter, NULL, (void**)&vtd_bus)) { > > > + for (i = 0; i < X86_IOMMU_PCI_DEVFN_MAX; i++) { > > > + as = vtd_bus->dev_as[i]; > > > + if (as == NULL) { > > > + continue; > > > + } > > > + trace_vtd_switch_address_space(pci_bus_num(vtd_bus->bus), > > > + VTD_PCI_SLOT(i), VTD_PCI_FUNC(i), > > > + enabled); > > > + if (enabled) { > > > + memory_region_add_subregion_overlap(&as->root, 0, > > > + &as->iommu, 2); > > > + } else { > > > + memory_region_del_subregion(&as->root, &as->iommu); > > > > Why not use memory_region_set_enabled() rather than actually > > adding/deleting the subregion? > > Good idea, thanks. :-) > > [...] > > > > @@ -2343,15 +2378,47 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn) > > > vtd_dev_as->devfn = (uint8_t)devfn; > > > vtd_dev_as->iommu_state = s; > > > vtd_dev_as->context_cache_entry.context_cache_gen = 0; > > > + > > > + /* > > > + * When DMAR is disabled, memory region relationships looks > > > + * like: > > > + * > > > + * 0000000000000000-ffffffffffffffff (prio 0, RW): vtd_root > > > + * 0000000000000000-ffffffffffffffff (prio 1, RW): vtd_sys_alias > > > + * 00000000fee00000-00000000feefffff (prio 64, RW): intel_iommu_ir > > > + * > > > + * When DMAR is disabled, it becomes: > > > + * > > > + * 0000000000000000-ffffffffffffffff (prio 0, RW): vtd_root > > > + * 0000000000000000-ffffffffffffffff (prio 2, RW): intel_iommu > > > + * 0000000000000000-ffffffffffffffff (prio 1, RW): vtd_sys_alias > > > + * 00000000fee00000-00000000feefffff (prio 64, RW): intel_iommu_ir > > > + * > > > + * The intel_iommu region is dynamically added/removed. > > > + */ > > > memory_region_init_iommu(&vtd_dev_as->iommu, OBJECT(s), > > > &s->iommu_ops, "intel_iommu", UINT64_MAX); > > > > I'm almost certain UINT64_MAX is wrong here. For one thing it would > > collide with PCI BARs. For another, I can't imagine that the IOMMU > > page tables can really span an entire 2^64 space. > > Could you explain why here device address space has things to do with > PCI BARs? I thought BARs are for CPU address space only (so that CPU > can access PCI registers via MMIO manner), am I wrong? In short, yes. So, first think about vanilla PCI - most things are PCI-E these days, but the PCI addressing model which was designed for the old hardware is still mostly the same. With plain PCI, you have a physical bus over which address and data cycles pass. Those cycles don't distinguish between transfers from host to device or device to host. Each address cycle just gives which address space: configuration, IO or memory, and an address. Devices respond to addresses within their BARs, typically such cycles will come from the host, but they don't have to - a device is able to send cycles to another device (peer to peer DMA). Meanwhile the host bridge will respond to addresses within certain DMA windows, propagating those access onwards to system memory. How many DMA windows there are, their size, location and whether they're mapped directly or via an IOMMU depends on the model of host bridge. On x86, traditionally, PCI addresses 0..<somewhere> were simply mapped directly to memory addresses 0..<somewhere>, identity mapping RAM into PCI space. BARs would be assigned above <somewhere>, so they don't collide. I suspect old enough machines will have <somewhere> == 2G, leaving 2G..4G for the BARs of 32-bit devices. More modern x86 bridges must have provisions for accessing memory above 4G, but I'm not entirely certain how that works. PAPR traditionally also had a DMA window from 0..2G, however instead of being direct mapped to RAM, it is always translated via an IOMMU. More modern PAPR systems have that window by default, but allow the OS to remove it and configure up to 2 DMA windows of variable length and page size. Various other platforms have various other DMA window arrangements. With PCI-E, of course, upstream and downstream cycles are distinct, and peer to peer DMA isn't usually possible (unless a switch is configured specially to allow it by forwarding cycles from one downstream port to another). But the address mndel remains logically the same: there is just one PCI memory space and both device BARs and host DMA windows live within it. Firmware and/or the OS need to know the details of the platform's host bridge, and configure both the BARs and the DMA windows so that they don't collide. > I think we should have a big enough IOMMU region size here. If device > writes to invalid addresses, IMHO we should trap it and report to > guest. If we have a smaller size than UINT64_MAX, how we can trap this > behavior and report for the whole address space (it should cover [0, > 2^64-1])? That's not how the IOMMU works. How it traps is dependent on the specific IOMMU model, but generally they'll only even look at cycles which lie within the IOMMU's DMA window. On x86 I'm pretty sure that window will be large, but it won't be 2^64. It's also likely to have a gap between 2..4GiB to allow room for the BARs of 32-bit devices. > > > > > + memory_region_init_alias(&vtd_dev_as->sys_alias, OBJECT(s), > > > + "vtd_sys_alias", get_system_memory(), > > > + 0, memory_region_size(get_system_memory())); > > > > I strongly suspect using memory_region_size(get_system_memory()) is > > also incorrect here. System memory has size UINT64_MAX, but I'll bet > > you you can't actually access all of that via PCI space (again, it > > would collide with actual PCI BARs). I also suspect you can't reach > > CPU MMIO regions via the PCI DMA space. > > Hmm, sounds correct. > > However if so we will have the same problem if without IOMMU? See > pci_device_iommu_address_space() - address_space_memory will be the > default if we have no IOMMU protection, and that will cover e.g. CPU > MMIO regions as well. True. That default is basically assuming that both the host bridge's DMA windows, and its outbound IO and memory windows are identity mapped between the system bus and the PCI address space. I suspect that's rarely 100% true, but it's close enough to work on a fair few platforms. But since you're building a more accurate model of the x86 host bridge's behaviour here, you might as well try to get it as accurate as possible. > > So, I think you should find out what this limit actually is and > > restrict the alias to that window. > > /me needs some more reading to figure this out. Still not quite > familiar with the whole VM memory regions. Hints are welcomed... It's not really a question of VM memory regions. Rather it's about how the host bridge translates between cpu side and PCI side addresses. > > > memory_region_init_io(&vtd_dev_as->iommu_ir, OBJECT(s), > > > &vtd_mem_ir_ops, s, "intel_iommu_ir", > > > VTD_INTERRUPT_ADDR_SIZE); > > > - memory_region_add_subregion(&vtd_dev_as->iommu, VTD_INTERRUPT_ADDR_FIRST, > > > - &vtd_dev_as->iommu_ir); > > > - address_space_init(&vtd_dev_as->as, > > > - &vtd_dev_as->iommu, "intel_iommu"); > > > + memory_region_init(&vtd_dev_as->root, OBJECT(s), > > > + "vtd_root", UINT64_MAX); > > > + memory_region_add_subregion_overlap(&vtd_dev_as->root, > > > + VTD_INTERRUPT_ADDR_FIRST, > > > + &vtd_dev_as->iommu_ir, 64); > > > + address_space_init(&vtd_dev_as->as, &vtd_dev_as->root, name); > > > + memory_region_add_subregion_overlap(&vtd_dev_as->root, 0, > > > + &vtd_dev_as->sys_alias, 1); > > > + if (s->dmar_enabled) { > > > + memory_region_add_subregion_overlap(&vtd_dev_as->root, 0, > > > + &vtd_dev_as->iommu, 2); > > > + } > > > > Hmm. You have the IOMMU translated region overlaying the > > direct-mapped alias. You enable and disable the IOMMU subregion, but > > you always leave the direct mapping enabled. You might get away with > > this because the size of the IOMMU region is UINT64_MAX, which will > > overlay everything - but as above, I think that's wrong. If that's > > changed then guest devices may be able to access portions of the raw > > address space outside the IOMMU mapped region, which could break the > > guest's expectations of device isolation. > > > > I think it would be much safer to disable the system memory alias when > > the IOMMU is enabled. > > Reasonable. Will adopt. > > Thanks! > > -- peterx >
On Tue, Dec 20, 2016 at 05:04:33PM -0700, Alex Williamson wrote: > On Tue, 20 Dec 2016 14:38:01 +0800 > Peter Xu <peterx@redhat.com> wrote: > > > On Mon, Dec 19, 2016 at 09:52:52PM -0700, Alex Williamson wrote: > > > > [...] > > > > > > Yes, this patch just tried to move VT-d forward a bit, rather than do > > > > it once and for all. I think we can do better than this in the future, > > > > for example, one address space per guest IOMMU domain (as you have > > > > mentioned before). However I suppose that will need more work (which I > > > > still can't estimate on the amount of work). So I am considering to > > > > enable the device assignments functionally first, then we can further > > > > improve based on a workable version. Same thoughts apply to the IOMMU > > > > replay RFC series. > > > > > > I'm not arguing against it, I'm just trying to set expectations for > > > where this gets us. An AddressSpace per guest iommu domain seems like > > > the right model for QEMU, but it has some fundamental issues with > > > vfio. We currently tie a QEMU AddressSpace to a vfio container, which > > > represents the host IOMMU context. The AddressSpace of a device is > > > currently assumed to be fixed in QEMU, guest IOMMU domains clearly > > > are not. vfio only let's us have access to a device while it's > > > protected within a container. Therefore in order to move a device to a > > > different AddressSpace based on the guest domain configuration, we'd > > > need to tear down the vfio configuration, including releasing the > > > device. > > > > I assume this is VT-d specific issue, right? Looks like ppc is using a > > totally differnet way to manage the mapping, and devices can share the > > same address space. > > It's only VT-d specific in that VT-d is the only vIOMMU we have for > x86. ppc has a much different host IOMMU architcture and their VM > architecture requires an IOMMU. The ppc model has a notion of > preregistration to help with this, among other things. > > > > > Regarding to the locked memory accounting issue: do we have existing > > > > way to do the accounting? If so, would you (or anyone) please > > > > elaborate a bit? If not, is that an ongoing/planned work? > > > > > > As I describe above, there's a vfio container per AddressSpace, each > > > container is an IOMMU domain in the host. In the guest, an IOMMU > > > domain can include multiple AddressSpaces, one for each context entry > > > that's part of the domain. When the guest programs a translation for > > > an IOMMU domain, that maps a guest IOVA to a guest physical address, > > > for each AddressSpace. Each AddressSpace is backed by a vfio > > > container, which needs to pin the pages of that translation in order to > > > get a host physical address, which then gets programmed into the host > > > IOMMU domain with the guest-IOVA and host physical address. The > > > pinning process is where page accounting is done. It's done per vfio > > > context. The worst case scenario for accounting is thus when VT-d is > > > present but disabled (or in passthrough mode) as each AddressSpace > > > duplicates address_space_memory and every page of guest memory is > > > pinned and accounted for each vfio container. > > > > IIUC this accounting issue will solve itself if we can solve the > > previous issue. While we don't have it now, so ... > > Not sure what "previous issue" is referring to here. Here I meant if we can let devices share the same VFIOAddressSpace in the future for VT-d emulation (just like ppc), then we won't need to worry about the duplicated accounting issue again. In that case, when N devices are put into the same guest iommu domain, it'll share a single VFIOAddressSpace in QEMU, and the mapping will be counted only once. But I think this does not solve the problem you mentioned below - yes looks like guest user space driver can map the whole 39/48 bits address space. That's something I failed to realize before... > > > > That's the existing way we do accounting. There is no current > > > development that I'm aware of to change this. As above, the simplest > > > stop-gap solution is that libvirt would need to be aware when VT-d is > > > present for a VM and use a different algorithm to set QEMU locked > > > memory limit, but it's not without its downsides. > > > > ... here I think it's sensible to consider a specific algorithm for > > vt-d use case. I am just curious about how should we define this > > algorithm. > > > > First of all, when the devices are not sharing domain (or say, one > > guest iommu domain per assigned device), everything should be fine. > > No, each domain could map the entire guest address space. If we're > talking about a domain per device for use with the Linux DMA API, then > it's unlikely that the sum of mapped pages across all the domains will > exceed the current libvirt set locked memory limit. However, that's > exactly the configuration where we expect to have abysmal performance. > As soon as we recommend the guest boot with iommu=pt, then each > container will be mapping and pinning the entire VM address space. > > > No > > special algorithm needed. IMHO the problem will happen only if there > > are assigned devices that share a same address space (either system, > > or specific iommu domain). In that case, the accounted value (or say, > > current->mm->locked_vm iiuc) will be bigger than the real locked > > memory size. > > > > However, I think the problem is whether devices will be put into same > > address space depends on guest behavior - the guest can either use > > iommu=pt, or manually putting devices into the same guest iommu region > > to achieve that. But from hypervisor POV, how should we estimate this? > > Can we really? > > The simple answer is that each device needs to be able to map the > entire VM address space and therefore when a VM is configured with > VT-d, libvirt needs to multiply the current locked memory settings for > assigned devices by the number of devices (groups actually) assigned. > There are (at least) two problems with this though. The first is that > we expect QEMU to use this increased locked memory limits for duplicate > accounting of the same pages, but an exploited user process could take > advantage of it and cause problems. Not optimal. The second problem > relates to the usage of the IOVA address space and the assumption that > a given container will map no more than the VM address space. When no > vIOMMU is exposed to the VM, QEMU manages the container IOVA space and > we know that QEMU is only mapping VM RAM and therefore mappings are > bound by the size of the VM. With a vIOMMU, the guest is in control of > the IOVA space and can map up to the limits of the vIOMMU. The guest > can map a single 4KB page to every IOVA up to that limit and we'll > account that page each time. So even valid (though perhaps not useful) > cases within the guest can hit that locking limit. So what will happen if a VM uses over the locked memory limit? Are we only using cgroup so that the DMA_MAP ioctl() will just fail? Or there is anything more to do with libvirt when VM exceeds this limitation? Another question to vfio-pci module: I see that vfio_listener_region_add() will crash the VM if it fails to do dma map and we'll get this: hw_error("vfio: DMA mapping failed, unable to continue"); However in vfio_iommu_map_notify(), we don't have such a hard requirement - when dma map fails, we just error_report() without quit the VM. Could I ask why we are having different behavior here? Any special concerns? IMHO if the guest program abuse the vIOMMU mapping usage and reaches the locked memory limit (I think for now N*VM_RAM_SIZE is fairly big enough) that libvirt assigned to this VM, we should just crash the VM to make sure it won't affect others (I assume using over N*VM_RAM_SIZE for mapping is a hard clue that guest is doing something dangerous and illegal). [...] > > I can totally understand that the performance will suck if dynamic > > mapping is used. AFAIU this work will only be used with static dma > > mapping like running DPDK in guest (besides other trivial goals, like, > > development purpose). > > We can't control how a feature is used, which is why I'm trying to make > sure this doesn't come as a surprise to anyone. Yes. Agree that we'd better try our best to let people know this before they start to use it. Thanks, -- peterx
On Mon, Dec 19, 2016 at 09:52:52PM -0700, Alex Williamson wrote: > On Tue, 20 Dec 2016 11:44:41 +0800 > Peter Xu <peterx@redhat.com> wrote: > > > On Mon, Dec 19, 2016 at 09:56:50AM -0700, Alex Williamson wrote: > > > On Mon, 19 Dec 2016 22:41:26 +0800 > > > Peter Xu <peterx@redhat.com> wrote: > > > > > > > This is preparation work to finally enabled dynamic switching ON/OFF for > > > > VT-d protection. The old VT-d codes is using static IOMMU region, and > > > > that won't satisfy vfio-pci device listeners. > > > > > > > > Let me explain. > > > > > > > > vfio-pci devices depend on the memory region listener and IOMMU replay > > > > mechanism to make sure the device mapping is coherent with the guest > > > > even if there are domain switches. And there are two kinds of domain > > > > switches: > > > > > > > > (1) switch from domain A -> B > > > > (2) switch from domain A -> no domain (e.g., turn DMAR off) > > > > > > > > Case (1) is handled by the context entry invalidation handling by the > > > > VT-d replay logic. What the replay function should do here is to replay > > > > the existing page mappings in domain B. > > > > > > > > However for case (2), we don't want to replay any domain mappings - we > > > > just need the default GPA->HPA mappings (the address_space_memory > > > > mapping). And this patch helps on case (2) to build up the mapping > > > > automatically by leveraging the vfio-pci memory listeners. > > > > > > > > Another important thing that this patch does is to seperate > > > > IR (Interrupt Remapping) from DMAR (DMA Remapping). IR region should not > > > > depend on the DMAR region (like before this patch). It should be a > > > > standalone region, and it should be able to be activated without > > > > DMAR (which is a common behavior of Linux kernel - by default it enables > > > > IR while disabled DMAR). > > > > > > > > > This seems like an improvement, but I will note that there are existing > > > locked memory accounting issues inherent with VT-d and vfio. With > > > VT-d, each device has a unique AddressSpace. This requires that each > > > is managed via a separate vfio container. Each container is accounted > > > for separately for locked pages. libvirt currently only knows that if > > > any vfio devices are attached that the locked memory limit for the > > > process needs to be set sufficient for the VM memory. When VT-d is > > > involved, we either need to figure out how to associate otherwise > > > independent vfio containers to share locked page accounting or teach > > > libvirt that the locked memory requirement needs to be multiplied by > > > the number of attached vfio devices. The latter seems far less > > > complicated but reduces the containment of QEMU a bit since the > > > process has the ability to lock potentially many multiples of the VM > > > address size. Thanks, > > > > Yes, this patch just tried to move VT-d forward a bit, rather than do > > it once and for all. I think we can do better than this in the future, > > for example, one address space per guest IOMMU domain (as you have > > mentioned before). However I suppose that will need more work (which I > > still can't estimate on the amount of work). So I am considering to > > enable the device assignments functionally first, then we can further > > improve based on a workable version. Same thoughts apply to the IOMMU > > replay RFC series. > > I'm not arguing against it, I'm just trying to set expectations for > where this gets us. An AddressSpace per guest iommu domain seems like > the right model for QEMU, but it has some fundamental issues with > vfio. We currently tie a QEMU AddressSpace to a vfio container, which > represents the host IOMMU context. The AddressSpace of a device is > currently assumed to be fixed in QEMU, Actually, I think we can work around this: you could set up a separate AddressSpace for each device which consists of nothing but a big alias into an AddressSpace associated with the current IOMMU domain. As the device is moved between domains you remove/replace the alias region - or even replace it with an alias direct into system memory when the IOMMU is disabled. > est IOMMU domains clearly > are not. vfio only let's us have access to a device while it's > protected within a container. Therefore in order to move a device to a > different AddressSpace based on the guest domain configuration, we'd > need to tear down the vfio configuration, including releasing the > device. > > > Regarding to the locked memory accounting issue: do we have existing > > way to do the accounting? If so, would you (or anyone) please > > elaborate a bit? If not, is that an ongoing/planned work? > > As I describe above, there's a vfio container per AddressSpace, each > container is an IOMMU domain in the host. In the guest, an IOMMU > domain can include multiple AddressSpaces, one for each context entry > that's part of the domain. When the guest programs a translation for > an IOMMU domain, that maps a guest IOVA to a guest physical address, > for each AddressSpace. Each AddressSpace is backed by a vfio > container, which needs to pin the pages of that translation in order to > get a host physical address, which then gets programmed into the host > IOMMU domain with the guest-IOVA and host physical address. The > pinning process is where page accounting is done. Ah.. and I take it the accounting isn't smart enough to tell that the same page is already pinned elsewhere. I guess that would take rather a lot of extra bookkeeping. > It's done per vfio > context. The worst case scenario for accounting is thus when VT-d is > present but disabled (or in passthrough mode) as each AddressSpace > duplicates address_space_memory and every page of guest memory is > pinned and accounted for each vfio container. Hmm. I imagine you'll need a copy of the current translation tables for a guest domain regardless of VFIO involvement. So, when a domain is unused - i.e. has no devices in it, won't the container have all the groups detached and so give up all the memory. Obviously when a device is assigned to the domain you'll need to replay the current mappings into VFIO. > That's the existing way we do accounting. There is no current > development that I'm aware of to change this. As above, the simplest > stop-gap solution is that libvirt would need to be aware when VT-d is > present for a VM and use a different algorithm to set QEMU locked > memory limit, but it's not without its downsides. Alternatively, a new > IOMMU model would need to be developed for vfio. The type1 model was > only ever intended to be used for relatively static user mappings and I > expect it to have horrendous performance when backing a dynamic guest > IOMMU domain. Really the only guest IOMMU usage model that makes any > sort of sense with type1 is to run the guest with passthrough (iommu=pt) > and only pull devices out of passthrough for relatively static mapping > cases within the guest userspace (nested assigned devices or dpdk). If > the expectation is that we just need this one little bit more code to > make vfio usable in the guest, that may be true, but it really is just > barely usable. It's not going to be fast for any sort of dynamic > mapping and it's going to have accounting issues that are not > compatible with how libvirt sets locked memory limits for QEMU as soon > as you go beyond a single device. Thanks, Maybe we should revisit the idea of a "type2" IOMMU which could handle both guest VT-d and guest PAPR TCEs. I'm not excessively fond of the pre-registration model that PAPR uses at the moment, but it might be the best available way to deal with the accounting issue.
On Tue, Dec 20, 2016 at 03:02:54PM -0800, no-reply@patchew.org wrote: [...] > === OUTPUT BEGIN === > Checking PATCH 1/1: intel_iommu: allow dynamic switch of IOMMU region... > ERROR: "(foo**)" should be "(foo **)" > #55: FILE: hw/i386/intel_iommu.c:1190: > + while (g_hash_table_iter_next (&iter, NULL, (void**)&vtd_bus)) { > > ERROR: space prohibited between function name and open parenthesis '(' > #55: FILE: hw/i386/intel_iommu.c:1190: > + while (g_hash_table_iter_next (&iter, NULL, (void**)&vtd_bus)) { > > total: 2 errors, 0 warnings, 118 lines checked Hmm, let me fix. -- peterx
On Tue, Dec 20, 2016 at 03:57:59PM -0800, no-reply@patchew.org wrote: [...] > /tmp/qemu-test/src/hw/i386/intel_iommu.c: In function ‘vtd_switch_address_space’: > /tmp/qemu-test/src/hw/i386/intel_iommu.c:1196: warning: implicit declaration of function ‘trace_vtd_switch_address_space’ > /tmp/qemu-test/src/hw/i386/intel_iommu.c:1196: warning: nested extern declaration of ‘trace_vtd_switch_address_space’ > /tmp/qemu-test/src/hw/i386/intel_iommu.c: In function ‘vtd_find_add_as’: > /tmp/qemu-test/src/hw/i386/intel_iommu.c:2412: error: ‘name’ undeclared (first use in this function) > /tmp/qemu-test/src/hw/i386/intel_iommu.c:2412: error: (Each undeclared identifier is reported only once > /tmp/qemu-test/src/hw/i386/intel_iommu.c:2412: error: for each function it appears in.) I believe I have got something missing for the rebase on the trace-event file... Will repost a cleaner v2 with existing comments. -- peterx
On Tue, Dec 20, 2016 at 05:04:33PM -0700, Alex Williamson wrote: > On Tue, 20 Dec 2016 14:38:01 +0800 > Peter Xu <peterx@redhat.com> wrote: > > > On Mon, Dec 19, 2016 at 09:52:52PM -0700, Alex Williamson wrote: > > > > [...] > > > > > > Yes, this patch just tried to move VT-d forward a bit, rather than do > > > > it once and for all. I think we can do better than this in the future, > > > > for example, one address space per guest IOMMU domain (as you have > > > > mentioned before). However I suppose that will need more work (which I > > > > still can't estimate on the amount of work). So I am considering to > > > > enable the device assignments functionally first, then we can further > > > > improve based on a workable version. Same thoughts apply to the IOMMU > > > > replay RFC series. > > > > > > I'm not arguing against it, I'm just trying to set expectations for > > > where this gets us. An AddressSpace per guest iommu domain seems like > > > the right model for QEMU, but it has some fundamental issues with > > > vfio. We currently tie a QEMU AddressSpace to a vfio container, which > > > represents the host IOMMU context. The AddressSpace of a device is > > > currently assumed to be fixed in QEMU, guest IOMMU domains clearly > > > are not. vfio only let's us have access to a device while it's > > > protected within a container. Therefore in order to move a device to a > > > different AddressSpace based on the guest domain configuration, we'd > > > need to tear down the vfio configuration, including releasing the > > > device. > > > > I assume this is VT-d specific issue, right? Looks like ppc is using a > > totally differnet way to manage the mapping, and devices can share the > > same address space. > > It's only VT-d specific in that VT-d is the only vIOMMU we have for > x86. ppc has a much different host IOMMU architcture and their VM > architecture requires an IOMMU. The ppc model has a notion of > preregistration to help with this, among other things. Preregistration addresses two different problems: 1) Preregistered pages are accounted as locked only once per host mm, even if registered against multiple containers, so it avoids the problem of double counting locked memory. 2) By moving the accounting to preregistration, it removes one barrier to having decent performance for dynamic mappings. (e.g. for PAPR accounting at map time made it basically impossible to write a fast KVM acceleration for this). > > > > Regarding to the locked memory accounting issue: do we have existing > > > > way to do the accounting? If so, would you (or anyone) please > > > > elaborate a bit? If not, is that an ongoing/planned work? > > > > > > As I describe above, there's a vfio container per AddressSpace, each > > > container is an IOMMU domain in the host. In the guest, an IOMMU > > > domain can include multiple AddressSpaces, one for each context entry > > > that's part of the domain. When the guest programs a translation for > > > an IOMMU domain, that maps a guest IOVA to a guest physical address, > > > for each AddressSpace. Each AddressSpace is backed by a vfio > > > container, which needs to pin the pages of that translation in order to > > > get a host physical address, which then gets programmed into the host > > > IOMMU domain with the guest-IOVA and host physical address. The > > > pinning process is where page accounting is done. It's done per vfio > > > context. The worst case scenario for accounting is thus when VT-d is > > > present but disabled (or in passthrough mode) as each AddressSpace > > > duplicates address_space_memory and every page of guest memory is > > > pinned and accounted for each vfio container. > > > > IIUC this accounting issue will solve itself if we can solve the > > previous issue. While we don't have it now, so ... > > Not sure what "previous issue" is referring to here. > > > > That's the existing way we do accounting. There is no current > > > development that I'm aware of to change this. As above, the simplest > > > stop-gap solution is that libvirt would need to be aware when VT-d is > > > present for a VM and use a different algorithm to set QEMU locked > > > memory limit, but it's not without its downsides. > > > > ... here I think it's sensible to consider a specific algorithm for > > vt-d use case. I am just curious about how should we define this > > algorithm. > > > > First of all, when the devices are not sharing domain (or say, one > > guest iommu domain per assigned device), everything should be fine. > > No, each domain could map the entire guest address space. If we're > talking about a domain per device for use with the Linux DMA API, then > it's unlikely that the sum of mapped pages across all the domains will > exceed the current libvirt set locked memory limit. However, that's > exactly the configuration where we expect to have abysmal performance. > As soon as we recommend the guest boot with iommu=pt, then each > container will be mapping and pinning the entire VM address space. > > > No > > special algorithm needed. IMHO the problem will happen only if there > > are assigned devices that share a same address space (either system, > > or specific iommu domain). In that case, the accounted value (or say, > > current->mm->locked_vm iiuc) will be bigger than the real locked > > memory size. > > > > However, I think the problem is whether devices will be put into same > > address space depends on guest behavior - the guest can either use > > iommu=pt, or manually putting devices into the same guest iommu region > > to achieve that. But from hypervisor POV, how should we estimate this? > > Can we really? > > The simple answer is that each device needs to be able to map the > entire VM address space and therefore when a VM is configured with > VT-d, libvirt needs to multiply the current locked memory settings for > assigned devices by the number of devices (groups actually) assigned. > There are (at least) two problems with this though. The first is that > we expect QEMU to use this increased locked memory limits for duplicate > accounting of the same pages, but an exploited user process could take > advantage of it and cause problems. Not optimal. The second problem > relates to the usage of the IOVA address space and the assumption that > a given container will map no more than the VM address space. When no > vIOMMU is exposed to the VM, QEMU manages the container IOVA space and > we know that QEMU is only mapping VM RAM and therefore mappings are > bound by the size of the VM. With a vIOMMU, the guest is in control of > the IOVA space and can map up to the limits of the vIOMMU. The guest > can map a single 4KB page to every IOVA up to that limit and we'll > account that page each time. So even valid (though perhaps not useful) > cases within the guest can hit that locking limit. > > This suggests that we not only need a vfio IOMMU model that tracks pfns > per domain to avoid duplicate accounting, but we need some way to share > that tracking between domains. Sounds pretty much identical to PAPR's preregistration. > Then we can go back to allowing a > locked memory limit up to the VM RAM size as the correct and complete > solution (plus some sort of shadow page table based mapping for any > hope of bearable performance for dynamic usage). > > > > Alternatively, a new > > > IOMMU model would need to be developed for vfio. The type1 model was > > > only ever intended to be used for relatively static user mappings and I > > > expect it to have horrendous performance when backing a dynamic guest > > > IOMMU domain. Really the only guest IOMMU usage model that makes any > > > sort of sense with type1 is to run the guest with passthrough (iommu=pt) > > > and only pull devices out of passthrough for relatively static mapping > > > cases within the guest userspace (nested assigned devices or dpdk). If > > > the expectation is that we just need this one little bit more code to > > > make vfio usable in the guest, that may be true, but it really is just > > > barely usable. It's not going to be fast for any sort of dynamic > > > mapping and it's going to have accounting issues that are not > > > compatible with how libvirt sets locked memory limits for QEMU as soon > > > as you go beyond a single device. Thanks, > > > > I can totally understand that the performance will suck if dynamic > > mapping is used. AFAIU this work will only be used with static dma > > mapping like running DPDK in guest (besides other trivial goals, like, > > development purpose). > > We can't control how a feature is used, which is why I'm trying to make > sure this doesn't come as a surprise to anyone. > > > Regarding to "the other" iommu model you mentioned besides type1, is > > there any existing discussions out there? Any further learning > > material/links would be greatly welcomed. > > Nope. You and Aviv are the only ones doing work that suggests a need > for a new IOMMU model. Thanks, I've also thought about it idly in the past. Currently we have some gratuitous differences in the spapr IOMMU mode. We do need some things different than Type1, but less than we thought when the interfaces were designed. It would be nice to converge these together. Preregistration, plus some more powerful introspection about what IOVA windows the IOMMU model can support should get us most of the way to being able to support both PAPR and VT-d backends with the same interface.
On Wed, Dec 21, 2016 at 01:53:37PM +1100, David Gibson wrote: [...] > > Could you explain why here device address space has things to do with > > PCI BARs? I thought BARs are for CPU address space only (so that CPU > > can access PCI registers via MMIO manner), am I wrong? > > In short, yes. So, first think about vanilla PCI - most things are > PCI-E these days, but the PCI addressing model which was designed for > the old hardware is still mostly the same. > > With plain PCI, you have a physical bus over which address and data > cycles pass. Those cycles don't distinguish between transfers from > host to device or device to host. Each address cycle just gives which > address space: configuration, IO or memory, and an address. > > Devices respond to addresses within their BARs, typically such cycles > will come from the host, but they don't have to - a device is able to > send cycles to another device (peer to peer DMA). Meanwhile the host > bridge will respond to addresses within certain DMA windows, > propagating those access onwards to system memory. How many DMA > windows there are, their size, location and whether they're mapped > directly or via an IOMMU depends on the model of host bridge. > > On x86, traditionally, PCI addresses 0..<somewhere> were simply mapped > directly to memory addresses 0..<somewhere>, identity mapping RAM into > PCI space. BARs would be assigned above <somewhere>, so they don't > collide. I suspect old enough machines will have <somewhere> == 2G, > leaving 2G..4G for the BARs of 32-bit devices. More modern x86 > bridges must have provisions for accessing memory above 4G, but I'm > not entirely certain how that works. > > PAPR traditionally also had a DMA window from 0..2G, however instead > of being direct mapped to RAM, it is always translated via an IOMMU. > More modern PAPR systems have that window by default, but allow the > OS to remove it and configure up to 2 DMA windows of variable length > and page size. Various other platforms have various other DMA window > arrangements. > > With PCI-E, of course, upstream and downstream cycles are distinct, > and peer to peer DMA isn't usually possible (unless a switch is > configured specially to allow it by forwarding cycles from one > downstream port to another). But the address mndel remains logically > the same: there is just one PCI memory space and both device BARs and > host DMA windows live within it. Firmware and/or the OS need to know > the details of the platform's host bridge, and configure both the BARs > and the DMA windows so that they don't collide. Thanks for the thorough explanation. :) So we should mask out all the MMIO regions (including BAR address ranges) for PCI device address space, right? Since they should not be able to access such addresses, but system ram? > > > I think we should have a big enough IOMMU region size here. If device > > writes to invalid addresses, IMHO we should trap it and report to > > guest. If we have a smaller size than UINT64_MAX, how we can trap this > > behavior and report for the whole address space (it should cover [0, > > 2^64-1])? > > That's not how the IOMMU works. How it traps is dependent on the > specific IOMMU model, but generally they'll only even look at cycles > which lie within the IOMMU's DMA window. On x86 I'm pretty sure that > window will be large, but it won't be 2^64. It's also likely to have > a gap between 2..4GiB to allow room for the BARs of 32-bit devices. But for x86 IOMMU region, I don't know anything like "DMA window" - device has its own context entry, which will point to a whole page table. In that sense I think at least all addresses from (0, 2^39-1) should be legal addresses? And that range should depend on how many address space bits the specific Intel IOMMU support, currently the emulated VT-d one supports 39 bits. An example would be: one with VT-d should be able to map the address 3G (0xc0000000, here it is an IOVA address) to any physical address he/she wants, as long as he/she setup the page table correctly. Hope I didn't miss anything important.. > > > > > > > > + memory_region_init_alias(&vtd_dev_as->sys_alias, OBJECT(s), > > > > + "vtd_sys_alias", get_system_memory(), > > > > + 0, memory_region_size(get_system_memory())); > > > > > > I strongly suspect using memory_region_size(get_system_memory()) is > > > also incorrect here. System memory has size UINT64_MAX, but I'll bet > > > you you can't actually access all of that via PCI space (again, it > > > would collide with actual PCI BARs). I also suspect you can't reach > > > CPU MMIO regions via the PCI DMA space. > > > > Hmm, sounds correct. > > > > However if so we will have the same problem if without IOMMU? See > > pci_device_iommu_address_space() - address_space_memory will be the > > default if we have no IOMMU protection, and that will cover e.g. CPU > > MMIO regions as well. > > True. That default is basically assuming that both the host bridge's > DMA windows, and its outbound IO and memory windows are identity > mapped between the system bus and the PCI address space. I suspect > that's rarely 100% true, but it's close enough to work on a fair few > platforms. > > But since you're building a more accurate model of the x86 host > bridge's behaviour here, you might as well try to get it as accurate > as possible. Yes, but even if we can fix this problem, it should be for the no-iommu case as well? If so, I think it might be more suitable for another standalone patch. Anyway, I noted this down. Thanks, -- peterx
On Wed, Dec 21, 2016 at 06:05:49PM +0800, Peter Xu wrote: > On Wed, Dec 21, 2016 at 01:53:37PM +1100, David Gibson wrote: > > [...] > > > > Could you explain why here device address space has things to do with > > > PCI BARs? I thought BARs are for CPU address space only (so that CPU > > > can access PCI registers via MMIO manner), am I wrong? > > > > In short, yes. So, first think about vanilla PCI - most things are > > PCI-E these days, but the PCI addressing model which was designed for > > the old hardware is still mostly the same. > > > > With plain PCI, you have a physical bus over which address and data > > cycles pass. Those cycles don't distinguish between transfers from > > host to device or device to host. Each address cycle just gives which > > address space: configuration, IO or memory, and an address. > > > > Devices respond to addresses within their BARs, typically such cycles > > will come from the host, but they don't have to - a device is able to > > send cycles to another device (peer to peer DMA). Meanwhile the host > > bridge will respond to addresses within certain DMA windows, > > propagating those access onwards to system memory. How many DMA > > windows there are, their size, location and whether they're mapped > > directly or via an IOMMU depends on the model of host bridge. > > > > On x86, traditionally, PCI addresses 0..<somewhere> were simply mapped > > directly to memory addresses 0..<somewhere>, identity mapping RAM into > > PCI space. BARs would be assigned above <somewhere>, so they don't > > collide. I suspect old enough machines will have <somewhere> == 2G, > > leaving 2G..4G for the BARs of 32-bit devices. More modern x86 > > bridges must have provisions for accessing memory above 4G, but I'm > > not entirely certain how that works. > > > > PAPR traditionally also had a DMA window from 0..2G, however instead > > of being direct mapped to RAM, it is always translated via an IOMMU. > > More modern PAPR systems have that window by default, but allow the > > OS to remove it and configure up to 2 DMA windows of variable length > > and page size. Various other platforms have various other DMA window > > arrangements. > > > > With PCI-E, of course, upstream and downstream cycles are distinct, > > and peer to peer DMA isn't usually possible (unless a switch is > > configured specially to allow it by forwarding cycles from one > > downstream port to another). But the address mndel remains logically > > the same: there is just one PCI memory space and both device BARs and > > host DMA windows live within it. Firmware and/or the OS need to know > > the details of the platform's host bridge, and configure both the BARs > > and the DMA windows so that they don't collide. > > Thanks for the thorough explanation. :) > > So we should mask out all the MMIO regions (including BAR address > ranges) for PCI device address space, right? Uhh.. I think "masking out" is treating the problem backwards. I think you should allow only a window covering RAM, not take everything then try to remove MMIO. > Since they should not be > able to access such addresses, but system ram? What is "they"? > > > I think we should have a big enough IOMMU region size here. If device > > > writes to invalid addresses, IMHO we should trap it and report to > > > guest. If we have a smaller size than UINT64_MAX, how we can trap this > > > behavior and report for the whole address space (it should cover [0, > > > 2^64-1])? > > > > That's not how the IOMMU works. How it traps is dependent on the > > specific IOMMU model, but generally they'll only even look at cycles > > which lie within the IOMMU's DMA window. On x86 I'm pretty sure that > > window will be large, but it won't be 2^64. It's also likely to have > > a gap between 2..4GiB to allow room for the BARs of 32-bit devices. > > But for x86 IOMMU region, I don't know anything like "DMA window" - > device has its own context entry, which will point to a whole page > table. In that sense I think at least all addresses from (0, 2^39-1) > should be legal addresses? And that range should depend on how many > address space bits the specific Intel IOMMU support, currently the > emulated VT-d one supports 39 bits. Well, sounds like the DMA window is 0..2^39-1 then. If there's a maximum number of bits in the specification - even if those aren't implemented on any current model - that would also be a reasonable choice. Again, I strongly suspect there's a gap in the range 2..4GiB, to allow for 32-bit BARs. Maybe not the whole range, but some chunk of it. I believe that's what's called the "io hole" on x86. This more or less has to be there. If all of 0..4GiB was mapped to RAM, and you had both a DMA capable device and a 32-bit device hanging off a PCI-E to PCI bridge, the 32-bit device's BARs could pick up cycles from the DMA device that were intended to go to RAM. > An example would be: one with VT-d should be able to map the address > 3G (0xc0000000, here it is an IOVA address) to any physical address > he/she wants, as long as he/she setup the page table correctly. > > Hope I didn't miss anything important.. > > > > > > > > > > > > + memory_region_init_alias(&vtd_dev_as->sys_alias, OBJECT(s), > > > > > + "vtd_sys_alias", get_system_memory(), > > > > > + 0, memory_region_size(get_system_memory())); > > > > > > > > I strongly suspect using memory_region_size(get_system_memory()) is > > > > also incorrect here. System memory has size UINT64_MAX, but I'll bet > > > > you you can't actually access all of that via PCI space (again, it > > > > would collide with actual PCI BARs). I also suspect you can't reach > > > > CPU MMIO regions via the PCI DMA space. > > > > > > Hmm, sounds correct. > > > > > > However if so we will have the same problem if without IOMMU? See > > > pci_device_iommu_address_space() - address_space_memory will be the > > > default if we have no IOMMU protection, and that will cover e.g. CPU > > > MMIO regions as well. > > > > True. That default is basically assuming that both the host bridge's > > DMA windows, and its outbound IO and memory windows are identity > > mapped between the system bus and the PCI address space. I suspect > > that's rarely 100% true, but it's close enough to work on a fair few > > platforms. > > > > But since you're building a more accurate model of the x86 host > > bridge's behaviour here, you might as well try to get it as accurate > > as possible. > > Yes, but even if we can fix this problem, it should be for the > no-iommu case as well? If so, I think it might be more suitable for > another standalone patch. Hm, perhaps.
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c index 5f3e351..75a3f4e 100644 --- a/hw/i386/intel_iommu.c +++ b/hw/i386/intel_iommu.c @@ -1179,9 +1179,42 @@ static void vtd_handle_gcmd_sirtp(IntelIOMMUState *s) vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_IRTPS); } +static void vtd_switch_address_space(IntelIOMMUState *s, bool enabled) +{ + GHashTableIter iter; + VTDBus *vtd_bus; + VTDAddressSpace *as; + int i; + + g_hash_table_iter_init(&iter, s->vtd_as_by_busptr); + while (g_hash_table_iter_next (&iter, NULL, (void**)&vtd_bus)) { + for (i = 0; i < X86_IOMMU_PCI_DEVFN_MAX; i++) { + as = vtd_bus->dev_as[i]; + if (as == NULL) { + continue; + } + trace_vtd_switch_address_space(pci_bus_num(vtd_bus->bus), + VTD_PCI_SLOT(i), VTD_PCI_FUNC(i), + enabled); + if (enabled) { + memory_region_add_subregion_overlap(&as->root, 0, + &as->iommu, 2); + } else { + memory_region_del_subregion(&as->root, &as->iommu); + } + } + } +} + /* Handle Translation Enable/Disable */ static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en) { + bool old = s->dmar_enabled; + + if (old == en) { + return; + } + VTD_DPRINTF(CSR, "Translation Enable %s", (en ? "on" : "off")); if (en) { @@ -1196,6 +1229,8 @@ static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en) /* Ok - report back to driver */ vtd_set_clear_mask_long(s, DMAR_GSTS_REG, VTD_GSTS_TES, 0); } + + vtd_switch_address_space(s, en); } /* Handle Interrupt Remap Enable/Disable */ @@ -2343,15 +2378,47 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn) vtd_dev_as->devfn = (uint8_t)devfn; vtd_dev_as->iommu_state = s; vtd_dev_as->context_cache_entry.context_cache_gen = 0; + + /* + * When DMAR is disabled, memory region relationships looks + * like: + * + * 0000000000000000-ffffffffffffffff (prio 0, RW): vtd_root + * 0000000000000000-ffffffffffffffff (prio 1, RW): vtd_sys_alias + * 00000000fee00000-00000000feefffff (prio 64, RW): intel_iommu_ir + * + * When DMAR is disabled, it becomes: + * + * 0000000000000000-ffffffffffffffff (prio 0, RW): vtd_root + * 0000000000000000-ffffffffffffffff (prio 2, RW): intel_iommu + * 0000000000000000-ffffffffffffffff (prio 1, RW): vtd_sys_alias + * 00000000fee00000-00000000feefffff (prio 64, RW): intel_iommu_ir + * + * The intel_iommu region is dynamically added/removed. + */ memory_region_init_iommu(&vtd_dev_as->iommu, OBJECT(s), &s->iommu_ops, "intel_iommu", UINT64_MAX); + memory_region_init_alias(&vtd_dev_as->sys_alias, OBJECT(s), + "vtd_sys_alias", get_system_memory(), + 0, memory_region_size(get_system_memory())); memory_region_init_io(&vtd_dev_as->iommu_ir, OBJECT(s), &vtd_mem_ir_ops, s, "intel_iommu_ir", VTD_INTERRUPT_ADDR_SIZE); - memory_region_add_subregion(&vtd_dev_as->iommu, VTD_INTERRUPT_ADDR_FIRST, - &vtd_dev_as->iommu_ir); - address_space_init(&vtd_dev_as->as, - &vtd_dev_as->iommu, "intel_iommu"); + memory_region_init(&vtd_dev_as->root, OBJECT(s), + "vtd_root", UINT64_MAX); + memory_region_add_subregion_overlap(&vtd_dev_as->root, + VTD_INTERRUPT_ADDR_FIRST, + &vtd_dev_as->iommu_ir, 64); + address_space_init(&vtd_dev_as->as, &vtd_dev_as->root, name); + memory_region_add_subregion_overlap(&vtd_dev_as->root, 0, + &vtd_dev_as->sys_alias, 1); + if (s->dmar_enabled) { + memory_region_add_subregion_overlap(&vtd_dev_as->root, 0, + &vtd_dev_as->iommu, 2); + } + trace_vtd_switch_address_space(pci_bus_num(vtd_bus->bus), + VTD_PCI_SLOT(devfn), VTD_PCI_FUNC(devfn), + s->dmar_enabled); } return vtd_dev_as; } diff --git a/hw/i386/trace-events b/hw/i386/trace-events index d2b4973..aee93bb 100644 --- a/hw/i386/trace-events +++ b/hw/i386/trace-events @@ -10,6 +10,9 @@ xen_pv_mmio_write(uint64_t addr) "WARNING: write to Xen PV Device MMIO space (ad # hw/i386/x86-iommu.c x86_iommu_iec_notify(bool global, uint32_t index, uint32_t mask) "Notify IEC invalidation: global=%d index=%" PRIu32 " mask=%" PRIu32 +# hw/i386/intel_iommu.c +vtd_switch_address_space(uint8_t bus, uint8_t slot, uint8_t fn, bool on) "Device %02x:%02x.%x switching address space (iommu enabled=%d)" + # hw/i386/amd_iommu.c amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" + offset 0x%"PRIx32 amdvi_cache_update(uint16_t domid, uint8_t bus, uint8_t slot, uint8_t func, uint64_t gpa, uint64_t txaddr) " update iotlb domid 0x%"PRIx16" devid: %02x:%02x.%x gpa 0x%"PRIx64" hpa 0x%"PRIx64 diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h index 405c9d1..85c1b9b 100644 --- a/include/hw/i386/intel_iommu.h +++ b/include/hw/i386/intel_iommu.h @@ -83,6 +83,8 @@ struct VTDAddressSpace { uint8_t devfn; AddressSpace as; MemoryRegion iommu; + MemoryRegion root; + MemoryRegion sys_alias; MemoryRegion iommu_ir; /* Interrupt region: 0xfeeXXXXX */ IntelIOMMUState *iommu_state; VTDContextCacheEntry context_cache_entry;
This is preparation work to finally enabled dynamic switching ON/OFF for VT-d protection. The old VT-d codes is using static IOMMU region, and that won't satisfy vfio-pci device listeners. Let me explain. vfio-pci devices depend on the memory region listener and IOMMU replay mechanism to make sure the device mapping is coherent with the guest even if there are domain switches. And there are two kinds of domain switches: (1) switch from domain A -> B (2) switch from domain A -> no domain (e.g., turn DMAR off) Case (1) is handled by the context entry invalidation handling by the VT-d replay logic. What the replay function should do here is to replay the existing page mappings in domain B. However for case (2), we don't want to replay any domain mappings - we just need the default GPA->HPA mappings (the address_space_memory mapping). And this patch helps on case (2) to build up the mapping automatically by leveraging the vfio-pci memory listeners. Another important thing that this patch does is to seperate IR (Interrupt Remapping) from DMAR (DMA Remapping). IR region should not depend on the DMAR region (like before this patch). It should be a standalone region, and it should be able to be activated without DMAR (which is a common behavior of Linux kernel - by default it enables IR while disabled DMAR). Signed-off-by: Peter Xu <peterx@redhat.com> --- hw/i386/intel_iommu.c | 75 ++++++++++++++++++++++++++++++++++++++++--- hw/i386/trace-events | 3 ++ include/hw/i386/intel_iommu.h | 2 ++ 3 files changed, 76 insertions(+), 4 deletions(-)