Message ID | 5d6e8360-7ba8-487e-0ac8-7f517e560520@suse.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | IOMMU: superpage support when not sharing pagetables | expand |
On 24/08/2021 15:21, Jan Beulich wrote: > While already the case for PVH, there's no reason to treat PV > differently here, though of course the addresses get taken from another > source in this case. Except that, to match CPU side mappings, by default > we permit r/o ones. This then also means we now deal consistently with > IO-APICs whose MMIO is or is not covered by E820 reserved regions. > > Signed-off-by: Jan Beulich <jbeulich@suse.com> Why do we give PV dom0 a mapping of the IO-APIC? Having thought about it, it cannot possibly be usable. IO-APICs use a indirect window, and Xen doesn't perform any write-emulation (that I can see), so the guest can't even read the data register and work out which register it represents. It also can't do an atomic 64bit read across the index and data registers, as that is explicitly undefined behaviour in the IO-APIC spec. On the other hand, we do have PHYSDEVOP_apic_{read,write} which, despite the name, is for the IO-APIC not the LAPIC, and I bet these hypercalls where introduced upon discovering that a read-only mapping of the IO-APIC it totally useless. I think we can safely not expose the IO-APICs to PV dom0 at all, which simplifies the memory handling further. ~Andrew
On 26.08.2021 13:57, Andrew Cooper wrote: > On 24/08/2021 15:21, Jan Beulich wrote: >> While already the case for PVH, there's no reason to treat PV >> differently here, though of course the addresses get taken from another >> source in this case. Except that, to match CPU side mappings, by default >> we permit r/o ones. This then also means we now deal consistently with >> IO-APICs whose MMIO is or is not covered by E820 reserved regions. >> >> Signed-off-by: Jan Beulich <jbeulich@suse.com> > > Why do we give PV dom0 a mapping of the IO-APIC? Having thought about > it, it cannot possibly be usable. > > IO-APICs use a indirect window, and Xen doesn't perform any > write-emulation (that I can see), so the guest can't even read the data > register and work out which register it represents. It also can't do an > atomic 64bit read across the index and data registers, as that is > explicitly undefined behaviour in the IO-APIC spec. > > On the other hand, we do have PHYSDEVOP_apic_{read,write} which, despite > the name, is for the IO-APIC not the LAPIC, and I bet these hypercalls > where introduced upon discovering that a read-only mapping of the > IO-APIC it totally useless. > > I think we can safely not expose the IO-APICs to PV dom0 at all, which > simplifies the memory handling further. The reason we do expose it r/o is that some ACPI implementations access (read and write) some RTEs from AML. If we don't allow Dom0 to map the IO-APIC, Dom0 will typically fail to boot on such systems. What we have right now seems to be good enough for those systems, no matter that the data they get to read makes little sense. If we learned of systems where this isn't sufficient, we'd have to implement more complete read emulation (i.e. latching writes to the window register while still discarding writes to the data register). If we produced a not-present PTE instead of a r/o one for such mapping requests, I'm afraid we'd actually further complicate memory handling, because we'd then need to consider for emulation also n/p #PF, not just writes to r/o mappings. This said - I'd also be fine with consistently not mapping the IO-APICs in the IOMMU page tables. It was you to request CPU and IOMMU mappings to match. What I want to do away with is the present mixture, derived from the E820 type covering the IO-APIC space. Jan
On 26/08/2021 13:55, Jan Beulich wrote: > On 26.08.2021 13:57, Andrew Cooper wrote: >> On 24/08/2021 15:21, Jan Beulich wrote: >>> While already the case for PVH, there's no reason to treat PV >>> differently here, though of course the addresses get taken from another >>> source in this case. Except that, to match CPU side mappings, by default >>> we permit r/o ones. This then also means we now deal consistently with >>> IO-APICs whose MMIO is or is not covered by E820 reserved regions. >>> >>> Signed-off-by: Jan Beulich <jbeulich@suse.com> >> Why do we give PV dom0 a mapping of the IO-APIC? Having thought about >> it, it cannot possibly be usable. >> >> IO-APICs use a indirect window, and Xen doesn't perform any >> write-emulation (that I can see), so the guest can't even read the data >> register and work out which register it represents. It also can't do an >> atomic 64bit read across the index and data registers, as that is >> explicitly undefined behaviour in the IO-APIC spec. >> >> On the other hand, we do have PHYSDEVOP_apic_{read,write} which, despite >> the name, is for the IO-APIC not the LAPIC, and I bet these hypercalls >> where introduced upon discovering that a read-only mapping of the >> IO-APIC it totally useless. >> >> I think we can safely not expose the IO-APICs to PV dom0 at all, which >> simplifies the memory handling further. > The reason we do expose it r/o is that some ACPI implementations access > (read and write) some RTEs from AML. If we don't allow Dom0 to map the > IO-APIC, Dom0 will typically fail to boot on such systems. I think more details are needed. If AML is reading the RTEs, it's is also writing to the index register. Irrespective of Xen, doing this behind the back of the OS cannot work safely, because at a minimum the ACPI interpreter would need to take the ioapic lock, and I see no evidence of workarounds like this in Linux. In Xen, we appear to swallow writes to mmio_ro ranges which is rude, and causes the associated reads to read garbage. This is Xen-induced memory corruption inside dom0. I don't think any of this is legitimate behaviour. ACPI needs disabling on such systems, or interlocking suitably, and its not just IO-APICs which are problematic - any windowed I/O (e.g. RTC) as well as any other device with read side effects. I think we should seriously consider not mapping the IO-APIC at all. That said, I would be surprised if Linux is cleanly avoiding the IO-APIC, so a slightly less bad alternative is to redirect to an "MMIO" frame of ~0's which we still write-discard on top of. That at least makes the Xen-induced memory corruption behave deterministically. ~Andrew
On 07.09.2021 19:13, Andrew Cooper wrote: > On 26/08/2021 13:55, Jan Beulich wrote: >> On 26.08.2021 13:57, Andrew Cooper wrote: >>> On 24/08/2021 15:21, Jan Beulich wrote: >>>> While already the case for PVH, there's no reason to treat PV >>>> differently here, though of course the addresses get taken from another >>>> source in this case. Except that, to match CPU side mappings, by default >>>> we permit r/o ones. This then also means we now deal consistently with >>>> IO-APICs whose MMIO is or is not covered by E820 reserved regions. >>>> >>>> Signed-off-by: Jan Beulich <jbeulich@suse.com> >>> Why do we give PV dom0 a mapping of the IO-APIC? Having thought about >>> it, it cannot possibly be usable. >>> >>> IO-APICs use a indirect window, and Xen doesn't perform any >>> write-emulation (that I can see), so the guest can't even read the data >>> register and work out which register it represents. It also can't do an >>> atomic 64bit read across the index and data registers, as that is >>> explicitly undefined behaviour in the IO-APIC spec. >>> >>> On the other hand, we do have PHYSDEVOP_apic_{read,write} which, despite >>> the name, is for the IO-APIC not the LAPIC, and I bet these hypercalls >>> where introduced upon discovering that a read-only mapping of the >>> IO-APIC it totally useless. >>> >>> I think we can safely not expose the IO-APICs to PV dom0 at all, which >>> simplifies the memory handling further. >> The reason we do expose it r/o is that some ACPI implementations access >> (read and write) some RTEs from AML. If we don't allow Dom0 to map the >> IO-APIC, Dom0 will typically fail to boot on such systems. > > I think more details are needed. How do you expect to collect the necessary info without having an affected system to test? I see close to zero chance to locate the old reports (and possible discussion) via web search. > If AML is reading the RTEs, it's is also writing to the index register. Quite likely, yes. Albeit this being broken to a fair degree in the first place, ... > Irrespective of Xen, doing this behind the back of the OS cannot work > safely, because at a minimum the ACPI interpreter would need to take the > ioapic lock, and I see no evidence of workarounds like this in Linux. ... as you indicate you think (as much as I do), leaves room for the actual accesses to also be flawed (and hence meaningless in the first place). I do recall looking at the disassembled AML back at the time, but I do not recall any details for sure. What I seem to vaguely recall is that their whole purpose was to set the mask bit in an RTE (I think to work around the dual routing issue, and I assume in turn to work around missing workarounds in certain OSes). For that the current approach as well as the alternative one you suggest below would be equally "good enough". > In Xen, we appear to swallow writes to mmio_ro ranges which is rude, and > causes the associated reads to read garbage. This is Xen-induced memory > corruption inside dom0. > > > I don't think any of this is legitimate behaviour. ACPI needs disabling > on such systems, or interlocking suitably, and its not just IO-APICs > which are problematic - any windowed I/O (e.g. RTC) as well as any other > device with read side effects. I don't think disabling ACPI on such systems would be a viable option. Things tend to not work very well that way ... Plus iirc these issues weren't exclusively on some random no-name systems, but in at least one of the cases on ones of a pretty large vendor. > I think we should seriously consider not mapping the IO-APIC at all. We can easily do so on the IOMMU side, if you agree to have CPU and IOMMU mappings diverge for this case. Since the behavior is PV- specific anyway, there are also no concerns as to differing behavior with vs without shared page tables (on PVH). > That said, I would be surprised if Linux is cleanly avoiding the > IO-APIC, so a slightly less bad alternative is to redirect to an "MMIO" > frame of ~0's which we still write-discard on top of. > > That at least makes the Xen-induced memory corruption behave > deterministically. It would work more deterministically, yes, but without such a system available to test we wouldn't know whether that approach would actually make things work (sufficiently). Whereas for the current approach we do know (from the testing done back at the time). Jan
--- a/xen/drivers/passthrough/x86/iommu.c +++ b/xen/drivers/passthrough/x86/iommu.c @@ -159,12 +159,12 @@ void arch_iommu_domain_destroy(struct do page_list_empty(&dom_iommu(d)->arch.pgtables.list)); } -static bool __hwdom_init hwdom_iommu_map(const struct domain *d, - unsigned long pfn, - unsigned long max_pfn) +static unsigned int __hwdom_init hwdom_iommu_map(const struct domain *d, + unsigned long pfn, + unsigned long max_pfn) { mfn_t mfn = _mfn(pfn); - unsigned int i, type; + unsigned int i, type, perms = IOMMUF_readable | IOMMUF_writable; /* * Set up 1:1 mapping for dom0. Default to include only conventional RAM @@ -173,44 +173,60 @@ static bool __hwdom_init hwdom_iommu_map * that fall in unusable ranges for PV Dom0. */ if ( (pfn > max_pfn && !mfn_valid(mfn)) || xen_in_range(pfn) ) - return false; + return 0; switch ( type = page_get_ram_type(mfn) ) { case RAM_TYPE_UNUSABLE: - return false; + return 0; case RAM_TYPE_CONVENTIONAL: if ( iommu_hwdom_strict ) - return false; + return 0; break; default: if ( type & RAM_TYPE_RESERVED ) { if ( !iommu_hwdom_inclusive && !iommu_hwdom_reserved ) - return false; + perms = 0; } - else if ( is_hvm_domain(d) || !iommu_hwdom_inclusive || pfn > max_pfn ) - return false; + else if ( is_hvm_domain(d) ) + return 0; + else if ( !iommu_hwdom_inclusive || pfn > max_pfn ) + perms = 0; } /* Check that it doesn't overlap with the Interrupt Address Range. */ if ( pfn >= 0xfee00 && pfn <= 0xfeeff ) - return false; + return 0; /* ... or the IO-APIC */ - for ( i = 0; has_vioapic(d) && i < d->arch.hvm.nr_vioapics; i++ ) - if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) ) - return false; + if ( has_vioapic(d) ) + { + for ( i = 0; i < d->arch.hvm.nr_vioapics; i++ ) + if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) ) + return 0; + } + else if ( is_pv_domain(d) ) + { + /* + * Be consistent with CPU mappings: Dom0 is permitted to establish r/o + * ones there, so it should also have such established for IOMMUs. + */ + for ( i = 0; i < nr_ioapics; i++ ) + if ( pfn == PFN_DOWN(mp_ioapics[i].mpc_apicaddr) ) + return rangeset_contains_singleton(mmio_ro_ranges, pfn) + ? IOMMUF_readable : 0; + } /* * ... or the PCIe MCFG regions. * TODO: runtime added MMCFG regions are not checked to make sure they * don't overlap with already mapped regions, thus preventing trapping. */ if ( has_vpci(d) && vpci_is_mmcfg_address(d, pfn_to_paddr(pfn)) ) - return false; + return 0; - return true; + return perms; } void __hwdom_init arch_iommu_hwdom_init(struct domain *d) @@ -246,15 +262,19 @@ void __hwdom_init arch_iommu_hwdom_init( for ( i = 0; i < top; i++ ) { unsigned long pfn = pdx_to_pfn(i); + unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn); int rc; - if ( !hwdom_iommu_map(d, pfn, max_pfn) ) + if ( !perms ) rc = 0; else if ( paging_mode_translate(d) ) - rc = set_identity_p2m_entry(d, pfn, p2m_access_rw, 0); + rc = set_identity_p2m_entry(d, pfn, + perms & IOMMUF_writable ? p2m_access_rw + : p2m_access_r, + 0); else rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K, - IOMMUF_readable | IOMMUF_writable, &flush_flags); + perms, &flush_flags); if ( rc ) printk(XENLOG_WARNING "%pd: identity %smapping of %lx failed: %d\n",
While already the case for PVH, there's no reason to treat PV differently here, though of course the addresses get taken from another source in this case. Except that, to match CPU side mappings, by default we permit r/o ones. This then also means we now deal consistently with IO-APICs whose MMIO is or is not covered by E820 reserved regions. Signed-off-by: Jan Beulich <jbeulich@suse.com> --- [integrated] v1: Integrate into series. [standalone] v2: Keep IOMMU mappings in sync with CPU ones.