diff mbox series

[07/17] IOMMU/x86: restrict IO-APIC mappings for PV Dom0

Message ID 5d6e8360-7ba8-487e-0ac8-7f517e560520@suse.com (mailing list archive)
State Superseded
Headers show
Series IOMMU: superpage support when not sharing pagetables | expand

Commit Message

Jan Beulich Aug. 24, 2021, 2:21 p.m. UTC
While already the case for PVH, there's no reason to treat PV
differently here, though of course the addresses get taken from another
source in this case. Except that, to match CPU side mappings, by default
we permit r/o ones. This then also means we now deal consistently with
IO-APICs whose MMIO is or is not covered by E820 reserved regions.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
[integrated] v1: Integrate into series.
[standalone] v2: Keep IOMMU mappings in sync with CPU ones.

Comments

Andrew Cooper Aug. 26, 2021, 11:57 a.m. UTC | #1
On 24/08/2021 15:21, Jan Beulich wrote:
> While already the case for PVH, there's no reason to treat PV
> differently here, though of course the addresses get taken from another
> source in this case. Except that, to match CPU side mappings, by default
> we permit r/o ones. This then also means we now deal consistently with
> IO-APICs whose MMIO is or is not covered by E820 reserved regions.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Why do we give PV dom0 a mapping of the IO-APIC?  Having thought about
it, it cannot possibly be usable.

IO-APICs use a indirect window, and Xen doesn't perform any
write-emulation (that I can see), so the guest can't even read the data
register and work out which register it represents.  It also can't do an
atomic 64bit read across the index and data registers, as that is
explicitly undefined behaviour in the IO-APIC spec.

On the other hand, we do have PHYSDEVOP_apic_{read,write} which, despite
the name, is for the IO-APIC not the LAPIC, and I bet these hypercalls
where introduced upon discovering that a read-only mapping of the
IO-APIC it totally useless.

I think we can safely not expose the IO-APICs to PV dom0 at all, which
simplifies the memory handling further.

~Andrew
Jan Beulich Aug. 26, 2021, 12:55 p.m. UTC | #2
On 26.08.2021 13:57, Andrew Cooper wrote:
> On 24/08/2021 15:21, Jan Beulich wrote:
>> While already the case for PVH, there's no reason to treat PV
>> differently here, though of course the addresses get taken from another
>> source in this case. Except that, to match CPU side mappings, by default
>> we permit r/o ones. This then also means we now deal consistently with
>> IO-APICs whose MMIO is or is not covered by E820 reserved regions.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Why do we give PV dom0 a mapping of the IO-APIC?  Having thought about
> it, it cannot possibly be usable.
> 
> IO-APICs use a indirect window, and Xen doesn't perform any
> write-emulation (that I can see), so the guest can't even read the data
> register and work out which register it represents.  It also can't do an
> atomic 64bit read across the index and data registers, as that is
> explicitly undefined behaviour in the IO-APIC spec.
> 
> On the other hand, we do have PHYSDEVOP_apic_{read,write} which, despite
> the name, is for the IO-APIC not the LAPIC, and I bet these hypercalls
> where introduced upon discovering that a read-only mapping of the
> IO-APIC it totally useless.
> 
> I think we can safely not expose the IO-APICs to PV dom0 at all, which
> simplifies the memory handling further.

The reason we do expose it r/o is that some ACPI implementations access
(read and write) some RTEs from AML. If we don't allow Dom0 to map the
IO-APIC, Dom0 will typically fail to boot on such systems. What we have
right now seems to be good enough for those systems, no matter that the
data they get to read makes little sense. If we learned of systems
where this isn't sufficient, we'd have to implement more complete read
emulation (i.e. latching writes to the window register while still
discarding writes to the data register).

If we produced a not-present PTE instead of a r/o one for such mapping
requests, I'm afraid we'd actually further complicate memory handling,
because we'd then need to consider for emulation also n/p #PF, not just
writes to r/o mappings.

This said - I'd also be fine with consistently not mapping the IO-APICs
in the IOMMU page tables. It was you to request CPU and IOMMU mappings
to match. What I want to do away with is the present mixture, derived
from the E820 type covering the IO-APIC space.

Jan
Andrew Cooper Sept. 7, 2021, 5:13 p.m. UTC | #3
On 26/08/2021 13:55, Jan Beulich wrote:
> On 26.08.2021 13:57, Andrew Cooper wrote:
>> On 24/08/2021 15:21, Jan Beulich wrote:
>>> While already the case for PVH, there's no reason to treat PV
>>> differently here, though of course the addresses get taken from another
>>> source in this case. Except that, to match CPU side mappings, by default
>>> we permit r/o ones. This then also means we now deal consistently with
>>> IO-APICs whose MMIO is or is not covered by E820 reserved regions.
>>>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> Why do we give PV dom0 a mapping of the IO-APIC?  Having thought about
>> it, it cannot possibly be usable.
>>
>> IO-APICs use a indirect window, and Xen doesn't perform any
>> write-emulation (that I can see), so the guest can't even read the data
>> register and work out which register it represents.  It also can't do an
>> atomic 64bit read across the index and data registers, as that is
>> explicitly undefined behaviour in the IO-APIC spec.
>>
>> On the other hand, we do have PHYSDEVOP_apic_{read,write} which, despite
>> the name, is for the IO-APIC not the LAPIC, and I bet these hypercalls
>> where introduced upon discovering that a read-only mapping of the
>> IO-APIC it totally useless.
>>
>> I think we can safely not expose the IO-APICs to PV dom0 at all, which
>> simplifies the memory handling further.
> The reason we do expose it r/o is that some ACPI implementations access
> (read and write) some RTEs from AML. If we don't allow Dom0 to map the
> IO-APIC, Dom0 will typically fail to boot on such systems.

I think more details are needed.

If AML is reading the RTEs, it's is also writing to the index register.

Irrespective of Xen, doing this behind the back of the OS cannot work
safely, because at a minimum the ACPI interpreter would need to take the
ioapic lock, and I see no evidence of workarounds like this in Linux.


In Xen, we appear to swallow writes to mmio_ro ranges which is rude, and
causes the associated reads to read garbage.  This is Xen-induced memory
corruption inside dom0.


I don't think any of this is legitimate behaviour.  ACPI needs disabling
on such systems, or interlocking suitably, and its not just IO-APICs
which are problematic - any windowed I/O (e.g. RTC) as well as any other
device with read side effects.

I think we should seriously consider not mapping the IO-APIC at all. 
That said, I would be surprised if Linux is cleanly avoiding the
IO-APIC, so a slightly less bad alternative is to redirect to an "MMIO"
frame of ~0's which we still write-discard on top of.

That at least makes the Xen-induced memory corruption behave
deterministically.

~Andrew
Jan Beulich Sept. 8, 2021, 9:44 a.m. UTC | #4
On 07.09.2021 19:13, Andrew Cooper wrote:
> On 26/08/2021 13:55, Jan Beulich wrote:
>> On 26.08.2021 13:57, Andrew Cooper wrote:
>>> On 24/08/2021 15:21, Jan Beulich wrote:
>>>> While already the case for PVH, there's no reason to treat PV
>>>> differently here, though of course the addresses get taken from another
>>>> source in this case. Except that, to match CPU side mappings, by default
>>>> we permit r/o ones. This then also means we now deal consistently with
>>>> IO-APICs whose MMIO is or is not covered by E820 reserved regions.
>>>>
>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>> Why do we give PV dom0 a mapping of the IO-APIC?  Having thought about
>>> it, it cannot possibly be usable.
>>>
>>> IO-APICs use a indirect window, and Xen doesn't perform any
>>> write-emulation (that I can see), so the guest can't even read the data
>>> register and work out which register it represents.  It also can't do an
>>> atomic 64bit read across the index and data registers, as that is
>>> explicitly undefined behaviour in the IO-APIC spec.
>>>
>>> On the other hand, we do have PHYSDEVOP_apic_{read,write} which, despite
>>> the name, is for the IO-APIC not the LAPIC, and I bet these hypercalls
>>> where introduced upon discovering that a read-only mapping of the
>>> IO-APIC it totally useless.
>>>
>>> I think we can safely not expose the IO-APICs to PV dom0 at all, which
>>> simplifies the memory handling further.
>> The reason we do expose it r/o is that some ACPI implementations access
>> (read and write) some RTEs from AML. If we don't allow Dom0 to map the
>> IO-APIC, Dom0 will typically fail to boot on such systems.
> 
> I think more details are needed.

How do you expect to collect the necessary info without having an affected
system to test? I see close to zero chance to locate the old reports (and
possible discussion) via web search.

> If AML is reading the RTEs, it's is also writing to the index register.

Quite likely, yes. Albeit this being broken to a fair degree in the
first place, ...

> Irrespective of Xen, doing this behind the back of the OS cannot work
> safely, because at a minimum the ACPI interpreter would need to take the
> ioapic lock, and I see no evidence of workarounds like this in Linux.

... as you indicate you think (as much as I do), leaves room for the
actual accesses to also be flawed (and hence meaningless in the first
place). I do recall looking at the disassembled AML back at the time,
but I do not recall any details for sure. What I seem to vaguely recall
is that their whole purpose was to set the mask bit in an RTE (I think
to work around the dual routing issue, and I assume in turn to work
around missing workarounds in certain OSes). For that the current
approach as well as the alternative one you suggest below would be
equally "good enough".

> In Xen, we appear to swallow writes to mmio_ro ranges which is rude, and
> causes the associated reads to read garbage.  This is Xen-induced memory
> corruption inside dom0.
> 
> 
> I don't think any of this is legitimate behaviour.  ACPI needs disabling
> on such systems, or interlocking suitably, and its not just IO-APICs
> which are problematic - any windowed I/O (e.g. RTC) as well as any other
> device with read side effects.

I don't think disabling ACPI on such systems would be a viable option.
Things tend to not work very well that way ... Plus iirc these issues
weren't exclusively on some random no-name systems, but in at least one
of the cases on ones of a pretty large vendor.

> I think we should seriously consider not mapping the IO-APIC at all. 

We can easily do so on the IOMMU side, if you agree to have CPU and
IOMMU mappings diverge for this case. Since the behavior is PV-
specific anyway, there are also no concerns as to differing behavior
with vs without shared page tables (on PVH).

> That said, I would be surprised if Linux is cleanly avoiding the
> IO-APIC, so a slightly less bad alternative is to redirect to an "MMIO"
> frame of ~0's which we still write-discard on top of.
> 
> That at least makes the Xen-induced memory corruption behave
> deterministically.

It would work more deterministically, yes, but without such a system
available to test we wouldn't know whether that approach would actually
make things work (sufficiently). Whereas for the current approach we do
know (from the testing done back at the time).

Jan
diff mbox series

Patch

--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -159,12 +159,12 @@  void arch_iommu_domain_destroy(struct do
            page_list_empty(&dom_iommu(d)->arch.pgtables.list));
 }
 
-static bool __hwdom_init hwdom_iommu_map(const struct domain *d,
-                                         unsigned long pfn,
-                                         unsigned long max_pfn)
+static unsigned int __hwdom_init hwdom_iommu_map(const struct domain *d,
+                                                 unsigned long pfn,
+                                                 unsigned long max_pfn)
 {
     mfn_t mfn = _mfn(pfn);
-    unsigned int i, type;
+    unsigned int i, type, perms = IOMMUF_readable | IOMMUF_writable;
 
     /*
      * Set up 1:1 mapping for dom0. Default to include only conventional RAM
@@ -173,44 +173,60 @@  static bool __hwdom_init hwdom_iommu_map
      * that fall in unusable ranges for PV Dom0.
      */
     if ( (pfn > max_pfn && !mfn_valid(mfn)) || xen_in_range(pfn) )
-        return false;
+        return 0;
 
     switch ( type = page_get_ram_type(mfn) )
     {
     case RAM_TYPE_UNUSABLE:
-        return false;
+        return 0;
 
     case RAM_TYPE_CONVENTIONAL:
         if ( iommu_hwdom_strict )
-            return false;
+            return 0;
         break;
 
     default:
         if ( type & RAM_TYPE_RESERVED )
         {
             if ( !iommu_hwdom_inclusive && !iommu_hwdom_reserved )
-                return false;
+                perms = 0;
         }
-        else if ( is_hvm_domain(d) || !iommu_hwdom_inclusive || pfn > max_pfn )
-            return false;
+        else if ( is_hvm_domain(d) )
+            return 0;
+        else if ( !iommu_hwdom_inclusive || pfn > max_pfn )
+            perms = 0;
     }
 
     /* Check that it doesn't overlap with the Interrupt Address Range. */
     if ( pfn >= 0xfee00 && pfn <= 0xfeeff )
-        return false;
+        return 0;
     /* ... or the IO-APIC */
-    for ( i = 0; has_vioapic(d) && i < d->arch.hvm.nr_vioapics; i++ )
-        if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
-            return false;
+    if ( has_vioapic(d) )
+    {
+        for ( i = 0; i < d->arch.hvm.nr_vioapics; i++ )
+            if ( pfn == PFN_DOWN(domain_vioapic(d, i)->base_address) )
+                return 0;
+    }
+    else if ( is_pv_domain(d) )
+    {
+        /*
+         * Be consistent with CPU mappings: Dom0 is permitted to establish r/o
+         * ones there, so it should also have such established for IOMMUs.
+         */
+        for ( i = 0; i < nr_ioapics; i++ )
+            if ( pfn == PFN_DOWN(mp_ioapics[i].mpc_apicaddr) )
+                return rangeset_contains_singleton(mmio_ro_ranges, pfn)
+                       ? IOMMUF_readable : 0;
+    }
     /*
      * ... or the PCIe MCFG regions.
      * TODO: runtime added MMCFG regions are not checked to make sure they
      * don't overlap with already mapped regions, thus preventing trapping.
      */
     if ( has_vpci(d) && vpci_is_mmcfg_address(d, pfn_to_paddr(pfn)) )
-        return false;
+        return 0;
 
-    return true;
+    return perms;
 }
 
 void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
@@ -246,15 +262,19 @@  void __hwdom_init arch_iommu_hwdom_init(
     for ( i = 0; i < top; i++ )
     {
         unsigned long pfn = pdx_to_pfn(i);
+        unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
         int rc;
 
-        if ( !hwdom_iommu_map(d, pfn, max_pfn) )
+        if ( !perms )
             rc = 0;
         else if ( paging_mode_translate(d) )
-            rc = set_identity_p2m_entry(d, pfn, p2m_access_rw, 0);
+            rc = set_identity_p2m_entry(d, pfn,
+                                        perms & IOMMUF_writable ? p2m_access_rw
+                                                                : p2m_access_r,
+                                        0);
         else
             rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
-                           IOMMUF_readable | IOMMUF_writable, &flush_flags);
+                           perms, &flush_flags);
 
         if ( rc )
             printk(XENLOG_WARNING "%pd: identity %smapping of %lx failed: %d\n",