Message ID | 20201016112933.14856-2-FelixCui-oc@zhaoxin.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Skip flatview_simplify() for cpu vendor zhaoxin | expand |
On 16/10/20 13:29, FelixCuioc wrote: > The issue here is that an assinged EHCI device accesses > an adjacent mapping between the delete and add phases > of the VFIO MemoryListener. > We want to skip flatview_simplify() is to prevent EHCI > device IOVA mappings from being unmapped. Hi, there is indeed a bug, but I have already explained last month (https://mail.gnu.org/archive/html/qemu-devel/2020-09/msg01279.html) that this patch is conceptually wrong: 1) you're adding host_get_vendor conditioned on compiling the x86 emulator, so you are breaking compilation on non-x86 machines. 2) you're adding a check for the host, but the bug applies to all hosts. If there is a bug on x86 hardware emulation, it should be fixed even when emulating x86 from ARM. It should also apply to all CPU vendors. Alex, the issue here is that the delete+add passes are racing against an assigned device's DMA. For KVM we were thinking of changing the whole memory map with a single ioctl, but that's much easier because KVM builds its page tables lazily. It would be possible for the IOMMU too but it would require a relatively complicated comparison of the old and new memory maps in the kernel. Paolo
hi paolo, >2) you're adding a check for the host, but the bug applies to all hosts. >If there is a bug on x86 hardware emulation, it should be fixed even >when emulating x86 from ARM. It should also apply to all CPU vendors. What is the progress of handling this bug ? If the processing is more complicated, can we temporarily remove flatview_simplify()? hi Alex, >the issue here is that the delete+add passes are racing against an >assigned device's DMA Please help comment how to solve this problem. Best regards Felixcui-oc
On Fri, 16 Oct 2020 13:42:29 +0200 Paolo Bonzini <pbonzini@redhat.com> wrote: > On 16/10/20 13:29, FelixCuioc wrote: > > The issue here is that an assinged EHCI device accesses > > an adjacent mapping between the delete and add phases > > of the VFIO MemoryListener. > > We want to skip flatview_simplify() is to prevent EHCI > > device IOVA mappings from being unmapped. > > Hi, > > there is indeed a bug, but I have already explained last month > (https://mail.gnu.org/archive/html/qemu-devel/2020-09/msg01279.html) > that this patch is conceptually wrong: > > 1) you're adding host_get_vendor conditioned on compiling the x86 > emulator, so you are breaking compilation on non-x86 machines. > > 2) you're adding a check for the host, but the bug applies to all hosts. > If there is a bug on x86 hardware emulation, it should be fixed even > when emulating x86 from ARM. It should also apply to all CPU vendors. > > Alex, the issue here is that the delete+add passes are racing against an > assigned device's DMA. For KVM we were thinking of changing the whole > memory map with a single ioctl, but that's much easier because KVM > builds its page tables lazily. It would be possible for the IOMMU too > but it would require a relatively complicated comparison of the old and > new memory maps in the kernel. We can only build IOMMU page tables lazily if we get faults, which we generally don't. We also cannot atomically update IOMMU page tables relative to a device, so "housekeeping" updates of mappings to (I assume) consolidate KVM memory slots doesn't work so well when the device is still running. Stopping the device via something like the bus-master enable bit also sounds like a whole set of problems itself. I assume these simplified mappings also reduce our resolution for later unmaps, which isn't necessarily a win for an assigned device either if it exposes the race again each boot. Maybe the question is why we don't see these errors more regularly, is there something unique about the memory layout of this platform versus others that causes larger memory regions to be coalesced together only to be later unmapped and provide more exposure to this issue? Thanks, Alex
On 19/10/20 21:02, Alex Williamson wrote: >> For KVM we were thinking of changing the whole >> memory map with a single ioctl, but that's much easier because KVM >> builds its page tables lazily. It would be possible for the IOMMU too >> but it would require a relatively complicated comparison of the old and >> new memory maps in the kernel. > > We can only build IOMMU page tables lazily if we get faults, which we > generally don't. We also cannot atomically update IOMMU page tables > relative to a device, Yeah, I didn't mean building IOMMU page tables lazily, rather replacing the whole VFIO memory map with a single ioctl. I don't think that requires atomic updates of the IOMMU page table root, but it would require atomic updates of IOMMU page table entires; VFIO would compare the old and new memory map and modify the page tables when it sees a difference. Is that possible? Paolo > so "housekeeping" updates of mappings to (I > assume) consolidate KVM memory slots doesn't work so well when the > device is still running. Stopping the device via something like the > bus-master enable bit also sounds like a whole set of problems itself. > I assume these simplified mappings also reduce our resolution for later > unmaps, which isn't necessarily a win for an assigned device either if > it exposes the race again each boot. > > Maybe the question is why we don't see these errors more regularly, is > there something unique about the memory layout of this platform versus > others that causes larger memory regions to be coalesced together only > to be later unmapped and provide more exposure to this issue? Thanks,
On Tue, 20 Oct 2020 11:24:58 +0200 Paolo Bonzini <pbonzini@redhat.com> wrote: > On 19/10/20 21:02, Alex Williamson wrote: > >> For KVM we were thinking of changing the whole > >> memory map with a single ioctl, but that's much easier because KVM > >> builds its page tables lazily. It would be possible for the IOMMU too > >> but it would require a relatively complicated comparison of the old and > >> new memory maps in the kernel. > > > > We can only build IOMMU page tables lazily if we get faults, which we > > generally don't. We also cannot atomically update IOMMU page tables > > relative to a device, > > Yeah, I didn't mean building IOMMU page tables lazily, rather replacing > the whole VFIO memory map with a single ioctl. > > I don't think that requires atomic updates of the IOMMU page table root, > but it would require atomic updates of IOMMU page table entires; VFIO > would compare the old and new memory map and modify the page tables when > it sees a difference. Is that possible? Theoretically possible, probably. I imagine any IOMMU worth it's silicon should be able to update ptes atomically, but there's probably a substantial re-engineering of the internal IOMMU API and external vfio IOMMU uAPI to get there. For example, I don't expect we have support at the IOMMU driver level to reshuffle ptes when an IOMMU super page is broken. Instead, for an unmap operation, the IOMMU API allows the driver to return a larger number of unmapped pages than requested. I'd be nervous about an agreed baseline for modifications to pages covered by an IOMMU super page as well. The vfio IOMMU type1 uAPI is largely a reflection of this internal API with further restrictions for tracking and accounting of user mappings. Therefore we don't allow mappings that modify or overlap existing mappings nor do we allow an unmap which bisects any existing mapping. To support a memory map approach (which implicitly negates those sorts of rules) we'd need to know if the IOMMU driver itself can atomically handle arbitrary maps and unmaps, performing any necessary super page fix-ups atomically. The internal mechanics of the vfio IOMMU would need to change quite a bit too for tracking and pinning, I suspect. Do we necessarily need a memory map ioctl for this or could it be the QEMU code that compares the old and new maps to trigger map and unmap ioctls? For example (aiui) our race is that if we have contiguous memory regions A and B and flatview_simplify() tries to expand A and delete B we'll see a series of listener notifications deleting A and B and adding A'. But the vfio QEMU code could parse the memory map to determine that old A + B is functionally equivalent to A' and do nothing. Do you foresee any breakdowns for such an approach? Hotplug concerns me in that a new device only has the current simplified flatview, ex. we only know A' rather than A + B, so we can't get back to A + !B like a device with more history could. Thanks, Alex
On 21/10/20 00:44, Alex Williamson wrote: > Do we necessarily need a memory map ioctl for this or could it be the > QEMU code that compares the old and new maps to trigger map and unmap > ioctls? For example (aiui) our race is that if we have contiguous > memory regions A and B and flatview_simplify() tries to expand A and > delete B we'll see a series of listener notifications deleting A and B > and adding A'. But the vfio QEMU code could parse the memory map to > determine that old A + B is functionally equivalent to A' and do > nothing. I think the issue is a bit different, and in fact there are two sides of the same issue. Say you have A (large) and it is replaced by A' (smaller) + B, then: * the first part of A disappears for a moment before A' appears. This is something that QEMU can work around, by not doing anything * the second part of A disappears for a moment before B appears. This is the root API issue and not something that QEMU can work around; and in fact it is not even fixed by removing flatview_simplify. Felix, did you identify the regions whose simplification causes the bug? Is this RAM (for example due to the PAM registers) or something else? Paolo
hi Paolo, >Felix, did you identify the regions whose simplification causes the bug? >Is this RAM (for example due to the PAM registers) or something else? yes, this bug is caused by write PAM register.The actual situation is that the properties of some ranges are changed from RW to readonly.This situation will cause the old ranges to be unmapped. Best regards Felixcui-oc
On Wed, 21 Oct 2020 09:37:53 +0200 Paolo Bonzini <pbonzini@redhat.com> wrote: > On 21/10/20 00:44, Alex Williamson wrote: > > Do we necessarily need a memory map ioctl for this or could it be the > > QEMU code that compares the old and new maps to trigger map and unmap > > ioctls? For example (aiui) our race is that if we have contiguous > > memory regions A and B and flatview_simplify() tries to expand A and > > delete B we'll see a series of listener notifications deleting A and B > > and adding A'. But the vfio QEMU code could parse the memory map to > > determine that old A + B is functionally equivalent to A' and do > > nothing. > > I think the issue is a bit different, and in fact there are two sides of > the same issue. Say you have A (large) and it is replaced by A' > (smaller) + B, then: > > * the first part of A disappears for a moment before A' appears. This > is something that QEMU can work around, by not doing anything > > * the second part of A disappears for a moment before B appears. This > is the root API issue and not something that QEMU can work around; and > in fact it is not even fixed by removing flatview_simplify. Right, our current uAPI does not support a mechanism to atomically change a mapping, but likewise we're probably not going to have devices performing DMA to regions that are being remapped. We know that removing flatview_simplify() resolves this issue and FelixCui's update suggests we do have a case where the permission changes of an adjacent range is triggering a range consolidation, which we see as the range being removed and added as something else, larger or smaller. I can understand the general benefit of flatview_simplify(), but I wonder if the best short term solution is to skip operating on the x86 PAM range, which I understand to be a small number of memory chunks below 1MB. I might also wonder why the EHCI controller on this platform is choosing that range for DMA. Thanks, Alex
On 21/10/20 20:49, Alex Williamson wrote: > I can understand the general benefit of flatview_simplify(), but I > wonder if the best short term solution is to skip operating on the x86 > PAM range, which I understand to be a small number of memory chunks > below 1MB. I'd rather remove flatview_simplify altogether, it probably triggers relatively rarely. Possibly do not let it operate on RAM/ROM regions, only on I/O regions. > I might also wonder why the EHCI controller on this > platform is choosing that range for DMA. I assume it's the BIOS's driver and it's choosing a range in low memory, but still I'm not sure why its DMA is racing against the PAM update (which is done very early). Felix, do you know the answer? Paolo
hi , >I assume it's the BIOS's driver and it's choosing a range in low memory, >but still I'm not sure why its DMA is racing against the PAM update >(which is done very early). Felix, do you know the answer? This bug is triggered by make_bios_readonly() in seabios. Make_bios_readonly() will write pam register. And ehci_setup() is executed before make_bios_readonly(). After initializing EHCI in seabios, it will continuously send dma cycles. So when write pam register, this bug appeared. In addition, before write pam registers, flatview_simplify() has merged a very large range.For example, this large range is 0xc0000-0xbfffffff. So even if EHCI is configured to not allocate buffers in low memory, this bug will still occur.Thanks. Best regards Felixcui-oc
On 22/10/20 05:02, FelixCui-oc wrote: > In addition, before write pam registers, flatview_simplify() has merged > a very large range.For example, > > this large range is 0xc0000-0xbfffffff. So even if EHCI is configured to > not allocate buffers in low memory, > > this bug will still occur.Thanks. So removing flatview_simplify() works because the higher area (0x10000 and above) remains the same. I guess the simplest thing to do is to apply flatview_simplify() only to I/O areas, though we can also consider removing it completely. I'm not sure in which case it would provide a noticeable improvement. Paolo
hi paolo, >So removing flatview_simplify() works because the higher area (0x10000 >and above) remains the same. I guess the simplest thing to do is to >apply flatview_simplify() only to I/O areas, though we can also consider >removing it completely. I'm not sure in which case it would provide a >noticeable improvement. I agree with you very much. Both of these cases can solve this bug. Thanks. Best regards Felixcui-oc
hi paolo, >So removing flatview_simplify() works because the higher area (0x10000 >and above) remains the same. I guess the simplest thing to do is to >apply flatview_simplify() only to I/O areas, though we can also consider >removing it completely. I'm not sure in which case it would provide a >noticeable improvement. Which one do you think is better? I prefer to remove flatview_simplify() directly. Please help give some suggestions , and then i will resubmit the patch. Best regards FelixCui-oc
diff --git a/softmmu/memory.c b/softmmu/memory.c index 403ff3abc9..a998018d87 100644 --- a/softmmu/memory.c +++ b/softmmu/memory.c @@ -700,6 +700,22 @@ static MemoryRegion *memory_region_get_flatview_root(MemoryRegion *mr) return NULL; } +static bool skip_simplify(void) +{ +#ifdef I386_CPU_H + char vendor[CPUID_VENDOR_SZ + 1] = { 0 }; + host_get_vendor(vendor); + if (!strncmp(vendor, CPUID_VENDOR_VIA, strlen(CPUID_VENDOR_VIA)) + || !strncmp(vendor, CPUID_VENDOR_ZHAOXIN, + strlen(CPUID_VENDOR_ZHAOXIN))) { + return true; + } + return false; +#else + return false; +#endif +} + /* Render a memory topology into a list of disjoint absolute ranges. */ static FlatView *generate_memory_topology(MemoryRegion *mr) { @@ -713,7 +729,9 @@ static FlatView *generate_memory_topology(MemoryRegion *mr) addrrange_make(int128_zero(), int128_2_64()), false, false); } - flatview_simplify(view); + if (!skip_simplify()) { + flatview_simplify(view); + } view->dispatch = address_space_dispatch_new(view); for (i = 0; i < view->nr; i++) { diff --git a/target/i386/cpu.c b/target/i386/cpu.c index 5d713c8528..6bda35a03e 100644 --- a/target/i386/cpu.c +++ b/target/i386/cpu.c @@ -1559,6 +1559,13 @@ void host_cpuid(uint32_t function, uint32_t count, if (edx) *edx = vec[3]; } +void host_get_vendor(char *vendor) +{ + uint32_t eax, ebx, ecx, edx; + + host_cpuid(0x0, 0, &eax, &ebx, &ecx, &edx); + x86_cpu_vendor_words2str(vendor, ebx, edx, ecx); +} void host_vendor_fms(char *vendor, int *family, int *model, int *stepping) { diff --git a/target/i386/cpu.h b/target/i386/cpu.h index 51c1d5f60a..e8e26e6a53 100644 --- a/target/i386/cpu.h +++ b/target/i386/cpu.h @@ -832,6 +832,8 @@ typedef uint64_t FeatureWordArray[FEATURE_WORDS]; #define CPUID_VENDOR_VIA "CentaurHauls" +#define CPUID_VENDOR_ZHAOXIN " Shanghai " + #define CPUID_VENDOR_HYGON "HygonGenuine" #define IS_INTEL_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_INTEL_1 && \ @@ -1918,6 +1920,7 @@ void cpu_clear_apic_feature(CPUX86State *env); void host_cpuid(uint32_t function, uint32_t count, uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t *edx); void host_vendor_fms(char *vendor, int *family, int *model, int *stepping); +void host_get_vendor(char *vendor); /* helper.c */ bool x86_cpu_tlb_fill(CPUState *cs, vaddr address, int size,
The issue here is that an assinged EHCI device accesses an adjacent mapping between the delete and add phases of the VFIO MemoryListener. We want to skip flatview_simplify() is to prevent EHCI device IOVA mappings from being unmapped. Signed-off-by: FelixCuioc <FelixCui-oc@zhaoxin.com> --- softmmu/memory.c | 20 +++++++++++++++++++- target/i386/cpu.c | 7 +++++++ target/i386/cpu.h | 3 +++ 3 files changed, 29 insertions(+), 1 deletion(-)