diff mbox series

[for-4.20?] x86/dom0: be less restrictive with the Interrupt Address Range

Message ID 20250212153800.5159-1-roger.pau@citrix.com (mailing list archive)
State Superseded
Headers show
Series [for-4.20?] x86/dom0: be less restrictive with the Interrupt Address Range | expand

Commit Message

Roger Pau Monne Feb. 12, 2025, 3:38 p.m. UTC
Xen currently prevents dom0 from creating CPU or IOMMU page-table mappings
into the interrupt address range [0xfee00000, 0xfeefffff].  This range has
two different purposes.  For accesses from the CPU is contains the default
position of local APIC page at 0xfee00000.  For accesses from devices
it's the MSI address range, so the address field in the MSI entries
(usually) point to an address on that range to trigger an interrupt.

There are reports of Lenovo Thinkpad devices placing what seems to be the
UCSI shared mailbox at address 0xfeec2000 in the interrupt address range.
Attempting to use that device with a Linux PV dom0 leads to an error when
Linux kernel maps 0xfeec2000:

RIP: e030:xen_mc_flush+0x1e8/0x2b0
 xen_leave_lazy_mmu+0x15/0x60
 vmap_range_noflush+0x408/0x6f0
 __ioremap_caller+0x20d/0x350
 acpi_os_map_iomem+0x1a3/0x1c0
 acpi_ex_system_memory_space_handler+0x229/0x3f0
 acpi_ev_address_space_dispatch+0x17e/0x4c0
 acpi_ex_access_region+0x28a/0x510
 acpi_ex_field_datum_io+0x95/0x5c0
 acpi_ex_extract_from_field+0x36b/0x4e0
 acpi_ex_read_data_from_field+0xcb/0x430
 acpi_ex_resolve_node_to_value+0x2e0/0x530
 acpi_ex_resolve_to_value+0x1e7/0x550
 acpi_ds_evaluate_name_path+0x107/0x170
 acpi_ds_exec_end_op+0x392/0x860
 acpi_ps_parse_loop+0x268/0xa30
 acpi_ps_parse_aml+0x221/0x5e0
 acpi_ps_execute_method+0x171/0x3e0
 acpi_ns_evaluate+0x174/0x5d0
 acpi_evaluate_object+0x167/0x440
 acpi_evaluate_dsm+0xb6/0x130
 ucsi_acpi_dsm+0x53/0x80
 ucsi_acpi_read+0x2e/0x60
 ucsi_register+0x24/0xa0
 ucsi_acpi_probe+0x162/0x1e3
 platform_probe+0x48/0x90
 really_probe+0xde/0x340
 __driver_probe_device+0x78/0x110
 driver_probe_device+0x1f/0x90
 __driver_attach+0xd2/0x1c0
 bus_for_each_dev+0x77/0xc0
 bus_add_driver+0x112/0x1f0
 driver_register+0x72/0xd0
 do_one_initcall+0x48/0x300
 do_init_module+0x60/0x220
 __do_sys_init_module+0x17f/0x1b0
 do_syscall_64+0x82/0x170

Remove the restrictions to create mappings the interrupt address range for
dom0.  Note that the restriction to map the local APIC page is enforced
separately, and that continues to be present.

For PVH dom0 it's important that the restriction is removed from
arch_iommu_hwdom_init(), as the logic in that function creates mappings in
both the CPU and the IOMMU page tables for reserved regions in the memory
map.  The expectation is that the page at 0xfeec2000 will be added to the
host memory map using the EfiACPIMemoryNVS type, so arch_iommu_hwdom_init()
will create a mapping for it.

Note that even if the interrupt address range entries are populated in the
IOMMU page-tables no device access will reach those pages.  Device accesses
to the Interrupt Address Range will always be converted into Interrupt
Messages and are not subject to DMA remapping.

There's also the following restriction noted in Intel VT-d:

> Software must not program paging-structure entries to remap any address to
> the interrupt address range. Untranslated requests and translation requests
> that result in an address in the interrupt range will be blocked with
> condition code LGN.4 or SGN.8. Translated requests with an address in the
> interrupt address range are treated as Unsupported Request (UR).

However this restriction doesn't apply to the identity mappings possibly
created for dom0, since the interrupt address range is never subject to DMA
remapping.

Reported-by: Jürgen Groß <jgross@suse.com>
Link: https://lore.kernel.org/xen-devel/baade0a7-e204-4743-bda1-282df74e5f89@suse.com/
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/dom0_build.c           | 4 ----
 xen/drivers/passthrough/x86/iommu.c | 5 -----
 2 files changed, 9 deletions(-)

Comments

Roger Pau Monne Feb. 13, 2025, 8:46 a.m. UTC | #1
On Wed, Feb 12, 2025 at 04:38:00PM +0100, Roger Pau Monne wrote:
> Xen currently prevents dom0 from creating CPU or IOMMU page-table mappings
> into the interrupt address range [0xfee00000, 0xfeefffff].  This range has
> two different purposes.  For accesses from the CPU is contains the default
> position of local APIC page at 0xfee00000.  For accesses from devices
> it's the MSI address range, so the address field in the MSI entries
> (usually) point to an address on that range to trigger an interrupt.
> 
> There are reports of Lenovo Thinkpad devices placing what seems to be the
> UCSI shared mailbox at address 0xfeec2000 in the interrupt address range.
> Attempting to use that device with a Linux PV dom0 leads to an error when
> Linux kernel maps 0xfeec2000:
> 
> RIP: e030:xen_mc_flush+0x1e8/0x2b0
>  xen_leave_lazy_mmu+0x15/0x60
>  vmap_range_noflush+0x408/0x6f0
>  __ioremap_caller+0x20d/0x350
>  acpi_os_map_iomem+0x1a3/0x1c0
>  acpi_ex_system_memory_space_handler+0x229/0x3f0
>  acpi_ev_address_space_dispatch+0x17e/0x4c0
>  acpi_ex_access_region+0x28a/0x510
>  acpi_ex_field_datum_io+0x95/0x5c0
>  acpi_ex_extract_from_field+0x36b/0x4e0
>  acpi_ex_read_data_from_field+0xcb/0x430
>  acpi_ex_resolve_node_to_value+0x2e0/0x530
>  acpi_ex_resolve_to_value+0x1e7/0x550
>  acpi_ds_evaluate_name_path+0x107/0x170
>  acpi_ds_exec_end_op+0x392/0x860
>  acpi_ps_parse_loop+0x268/0xa30
>  acpi_ps_parse_aml+0x221/0x5e0
>  acpi_ps_execute_method+0x171/0x3e0
>  acpi_ns_evaluate+0x174/0x5d0
>  acpi_evaluate_object+0x167/0x440
>  acpi_evaluate_dsm+0xb6/0x130
>  ucsi_acpi_dsm+0x53/0x80
>  ucsi_acpi_read+0x2e/0x60
>  ucsi_register+0x24/0xa0
>  ucsi_acpi_probe+0x162/0x1e3
>  platform_probe+0x48/0x90
>  really_probe+0xde/0x340
>  __driver_probe_device+0x78/0x110
>  driver_probe_device+0x1f/0x90
>  __driver_attach+0xd2/0x1c0
>  bus_for_each_dev+0x77/0xc0
>  bus_add_driver+0x112/0x1f0
>  driver_register+0x72/0xd0
>  do_one_initcall+0x48/0x300
>  do_init_module+0x60/0x220
>  __do_sys_init_module+0x17f/0x1b0
>  do_syscall_64+0x82/0x170
> 
> Remove the restrictions to create mappings the interrupt address range for
> dom0.  Note that the restriction to map the local APIC page is enforced
> separately, and that continues to be present.
> 
> For PVH dom0 it's important that the restriction is removed from
> arch_iommu_hwdom_init(), as the logic in that function creates mappings in
> both the CPU and the IOMMU page tables for reserved regions in the memory
> map.  The expectation is that the page at 0xfeec2000 will be added to the
> host memory map using the EfiACPIMemoryNVS type, so arch_iommu_hwdom_init()
> will create a mapping for it.
> 
> Note that even if the interrupt address range entries are populated in the
> IOMMU page-tables no device access will reach those pages.  Device accesses
> to the Interrupt Address Range will always be converted into Interrupt
> Messages and are not subject to DMA remapping.
> 
> There's also the following restriction noted in Intel VT-d:
> 
> > Software must not program paging-structure entries to remap any address to
> > the interrupt address range. Untranslated requests and translation requests
> > that result in an address in the interrupt range will be blocked with
> > condition code LGN.4 or SGN.8. Translated requests with an address in the
> > interrupt address range are treated as Unsupported Request (UR).
> 
> However this restriction doesn't apply to the identity mappings possibly
> created for dom0, since the interrupt address range is never subject to DMA
> remapping.
> 
> Reported-by: Jürgen Groß <jgross@suse.com>
> Link: https://lore.kernel.org/xen-devel/baade0a7-e204-4743-bda1-282df74e5f89@suse.com/
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
>  xen/arch/x86/dom0_build.c           | 4 ----
>  xen/drivers/passthrough/x86/iommu.c | 5 -----
>  2 files changed, 9 deletions(-)
> 
> diff --git a/xen/arch/x86/dom0_build.c b/xen/arch/x86/dom0_build.c
> index e8f5bf5447bc..d1b4ef83b2d0 100644
> --- a/xen/arch/x86/dom0_build.c
> +++ b/xen/arch/x86/dom0_build.c
> @@ -555,10 +555,6 @@ int __init dom0_setup_permissions(struct domain *d)
>          if ( !rangeset_contains_singleton(mmio_ro_ranges, mfn) )
>              rc |= iomem_deny_access(d, mfn, mfn);
>      }
> -    /* MSI range. */
> -    rc |= iomem_deny_access(d, paddr_to_pfn(MSI_ADDR_BASE_LO),
> -                            paddr_to_pfn(MSI_ADDR_BASE_LO +
> -                                         MSI_ADDR_DEST_ID_MASK));
>      /* HyperTransport range. */
>      if ( boot_cpu_data.x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON) )
>      {
> diff --git a/xen/drivers/passthrough/x86/iommu.c b/xen/drivers/passthrough/x86/iommu.c
> index 8b1e0596b84a..ec17701c90dd 100644
> --- a/xen/drivers/passthrough/x86/iommu.c
> +++ b/xen/drivers/passthrough/x86/iommu.c
> @@ -475,11 +475,6 @@ void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
>      if ( rc )
>          panic("IOMMU failed to remove Xen ranges: %d\n", rc);
>  
> -    /* Remove any overlap with the Interrupt Address Range. */
> -    rc = rangeset_remove_range(map, 0xfee00, 0xfeeff);
> -    if ( rc )
> -        panic("IOMMU failed to remove Interrupt Address Range: %d\n", rc);
> -

This last chunk is not correct, if the interrupt address range is not
removed, the local APIC page needs to be filtered out, so this should
instead be:

diff --git a/xen/drivers/passthrough/x86/iommu.c b/xen/drivers/passthrough/x86/iommu.c
index 8b1e0596b84a..c53626dfc69d 100644
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -475,10 +475,11 @@ void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
     if ( rc )
         panic("IOMMU failed to remove Xen ranges: %d\n", rc);

-    /* Remove any overlap with the Interrupt Address Range. */
-    rc = rangeset_remove_range(map, 0xfee00, 0xfeeff);
+    /* Remove any overlap with the local APIC page. */
+    rc = rangeset_remove_range(map, paddr_to_pfn(mp_lapic_addr),
+                               paddr_to_pfn(mp_lapic_addr));
     if ( rc )
-        panic("IOMMU failed to remove Interrupt Address Range: %d\n", rc);
+        panic("IOMMU failed to remove local APIC page: %d\n", rc);

     /* If emulating IO-APIC(s) make sure the base address is unmapped. */
     if ( has_vioapic(d) )
Jan Beulich Feb. 13, 2025, 9:06 a.m. UTC | #2
On 12.02.2025 16:38, Roger Pau Monne wrote:
> --- a/xen/arch/x86/dom0_build.c
> +++ b/xen/arch/x86/dom0_build.c
> @@ -555,10 +555,6 @@ int __init dom0_setup_permissions(struct domain *d)
>          if ( !rangeset_contains_singleton(mmio_ro_ranges, mfn) )
>              rc |= iomem_deny_access(d, mfn, mfn);
>      }
> -    /* MSI range. */
> -    rc |= iomem_deny_access(d, paddr_to_pfn(MSI_ADDR_BASE_LO),
> -                            paddr_to_pfn(MSI_ADDR_BASE_LO +
> -                                         MSI_ADDR_DEST_ID_MASK));

For this one, yes, I can see the LAPIC counterpart a few lines up from here
(as the description says). In arch_iommu_hwdom_init(), however, I can't.
Where's that?

> --- a/xen/drivers/passthrough/x86/iommu.c
> +++ b/xen/drivers/passthrough/x86/iommu.c
> @@ -475,11 +475,6 @@ void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
>      if ( rc )
>          panic("IOMMU failed to remove Xen ranges: %d\n", rc);
>  
> -    /* Remove any overlap with the Interrupt Address Range. */
> -    rc = rangeset_remove_range(map, 0xfee00, 0xfeeff);
> -    if ( rc )
> -        panic("IOMMU failed to remove Interrupt Address Range: %d\n", rc);

Besides being puzzled by the use of literal numbers here, why was this
necessary in the first place? Or in other words, why do we not respect the
domain's ->iomem_caps here (irrespective of iommu_hwdom_{inclusive,reserved}),
thus making sure we don't allow access to anything dom0_setup_permissions()
denies? There is iomem_access_permitted() checking in identity_map() for PV,
but no equivalent for PVH that I could spot. If that was checked somewhere, my
question on the earlier hunk would then also be addressed, of course.

Further, with the expectation for the UCSI region to eventually be marked
ACPI_NVS, how's that going to help here? The loop over the E820 map a few
lines up from here doesn't make any mappings for E820_{ACPI,NVS}. (later) Oh,
pvh_setup_acpi() does map them, and it running after iommu_hwdom_init() the
mappings should be made in both page tables (if not shared). Which gets me to
a tangential question though: Am I failing to spot where we avoid, for the
shared page tables case, doing all the work arch_iommu_hwdom_init() does?

Jan
Roger Pau Monne Feb. 13, 2025, 5:07 p.m. UTC | #3
On Thu, Feb 13, 2025 at 10:06:26AM +0100, Jan Beulich wrote:
> On 12.02.2025 16:38, Roger Pau Monne wrote:
> > --- a/xen/arch/x86/dom0_build.c
> > +++ b/xen/arch/x86/dom0_build.c
> > @@ -555,10 +555,6 @@ int __init dom0_setup_permissions(struct domain *d)
> >          if ( !rangeset_contains_singleton(mmio_ro_ranges, mfn) )
> >              rc |= iomem_deny_access(d, mfn, mfn);
> >      }
> > -    /* MSI range. */
> > -    rc |= iomem_deny_access(d, paddr_to_pfn(MSI_ADDR_BASE_LO),
> > -                            paddr_to_pfn(MSI_ADDR_BASE_LO +
> > -                                         MSI_ADDR_DEST_ID_MASK));
> 
> For this one, yes, I can see the LAPIC counterpart a few lines up from here
> (as the description says). In arch_iommu_hwdom_init(), however, I can't.
> Where's that?

We crossed emails, as a bit before this reply from yours I sent:

https://lore.kernel.org/xen-devel/Z62xS26FBClpsol9@macbook.local/

> > --- a/xen/drivers/passthrough/x86/iommu.c
> > +++ b/xen/drivers/passthrough/x86/iommu.c
> > @@ -475,11 +475,6 @@ void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
> >      if ( rc )
> >          panic("IOMMU failed to remove Xen ranges: %d\n", rc);
> >  
> > -    /* Remove any overlap with the Interrupt Address Range. */
> > -    rc = rangeset_remove_range(map, 0xfee00, 0xfeeff);
> > -    if ( rc )
> > -        panic("IOMMU failed to remove Interrupt Address Range: %d\n", rc);
> 
> Besides being puzzled by the use of literal numbers here, why was this
> necessary in the first place? Or in other words, why do we not respect the
> domain's ->iomem_caps here (irrespective of iommu_hwdom_{inclusive,reserved}),
> thus making sure we don't allow access to anything dom0_setup_permissions()
> denies? There is iomem_access_permitted() checking in identity_map() for PV,
> but no equivalent for PVH that I could spot. If that was checked somewhere, my
> question on the earlier hunk would then also be addressed, of course.

Indeed, I wondered the same when adjusting this code, I think I might
go ahead and add a pre-patch that switches the code in
arch_iommu_hwdom_init() to use ->iomem_caps and remove all the
open-coded filtering if feasible.

> Further, with the expectation for the UCSI region to eventually be marked
> ACPI_NVS, how's that going to help here? The loop over the E820 map a few
> lines up from here doesn't make any mappings for E820_{ACPI,NVS}. (later) Oh,
> pvh_setup_acpi() does map them, and it running after iommu_hwdom_init() the
> mappings should be made in both page tables (if not shared).

That code is not very well laid out, we should do the mappings in a
single place preferably.

> Which gets me to
> a tangential question though: Am I failing to spot where we avoid, for the
> shared page tables case, doing all the work arch_iommu_hwdom_init() does?

Even in the shared page-table case Xen needs to perform the work done
by arch_iommu_hwdom_init(), as it maps reserved (E820_RESERVED)
regions into dom0 p2m.  PVH dom0 mandates strict mode, so
arch_iommu_hwdom_init() just maps E820_RESERVED for PVH.

Thanks, Roger.
Jan Beulich Feb. 14, 2025, 1:01 p.m. UTC | #4
On 12.02.2025 16:38, Roger Pau Monne wrote:
> There's also the following restriction noted in Intel VT-d:
> 
>> Software must not program paging-structure entries to remap any address to
>> the interrupt address range. Untranslated requests and translation requests
>> that result in an address in the interrupt range will be blocked with
>> condition code LGN.4 or SGN.8. Translated requests with an address in the
>> interrupt address range are treated as Unsupported Request (UR).
> 
> However this restriction doesn't apply to the identity mappings possibly
> created for dom0, since the interrupt address range is never subject to DMA
> remapping.

Coming back to this also with your on-demand-p2m-populating patch in mind:
I'm having some trouble following your comment on this quotation. The doc
text is quite clear that page table entries must not point at the interrupt
address range. They don't make an exception for identity mappings. And we
don't know how the IOMMUs internally work, e.g. what sanity checks they do
(and what failure thereof would result in).

Instead I'm now wondering whether we don't need to
- prevent ept_set_entry() from propagating to the IOMMU mappings targeting
  the interrupt range,
- demand non-shared page tables if any such mapping is to be established
  in the CPU page tables.

Then again I won't assert that my interpretation of that quoted text makes
sense at all.

Jan
Roger Pau Monne Feb. 14, 2025, 1:57 p.m. UTC | #5
On Fri, Feb 14, 2025 at 02:01:09PM +0100, Jan Beulich wrote:
> On 12.02.2025 16:38, Roger Pau Monne wrote:
> > There's also the following restriction noted in Intel VT-d:
> > 
> >> Software must not program paging-structure entries to remap any address to
> >> the interrupt address range. Untranslated requests and translation requests
> >> that result in an address in the interrupt range will be blocked with
> >> condition code LGN.4 or SGN.8. Translated requests with an address in the
> >> interrupt address range are treated as Unsupported Request (UR).
> > 
> > However this restriction doesn't apply to the identity mappings possibly
> > created for dom0, since the interrupt address range is never subject to DMA
> > remapping.
> 
> Coming back to this also with your on-demand-p2m-populating patch in mind:
> I'm having some trouble following your comment on this quotation. The doc
> text is quite clear that page table entries must not point at the interrupt
> address range. They don't make an exception for identity mappings. And we
> don't know how the IOMMUs internally work, e.g. what sanity checks they do
> (and what failure thereof would result in).

My understanding is that address translation will never happen for the
interrupt address range, so whatever gets mapped on that range will
never be translated by the IOMMU.  Hence for the specific case here,
there will never be untranslated request that result in an address in
the interrupt address range, because translation is not done for input
addresses in the interrupt address range.

Sorry, hope the above makes sense, I'm having a hard time trying to
write down what I understand happens when the IOMMU handles accesses
to the interrupt address range.

Maybe a diagram would be easier.  This is my understanding of how
translation works in the IOMMU:

 input address -> translation -> output address

However input addresses that are in the interrupt address range are
not subject to translation, and hence there's no output address that
can be subject to the quoted VT-d paragraph.

> Instead I'm now wondering whether we don't need to
> - prevent ept_set_entry() from propagating to the IOMMU mappings targeting
>   the interrupt range,
> - demand non-shared page tables if any such mapping is to be established
>   in the CPU page tables.
> 
> Then again I won't assert that my interpretation of that quoted text makes
> sense at all.

See above, *I think* the quoted paragraph only applies to output
addresses, and in the case of mappings created on the interrupt
address range there's simply no output address.

Thanks, Roger.
Jan Beulich Feb. 17, 2025, 8:49 a.m. UTC | #6
On 14.02.2025 14:57, Roger Pau Monné wrote:
> On Fri, Feb 14, 2025 at 02:01:09PM +0100, Jan Beulich wrote:
>> On 12.02.2025 16:38, Roger Pau Monne wrote:
>>> There's also the following restriction noted in Intel VT-d:
>>>
>>>> Software must not program paging-structure entries to remap any address to
>>>> the interrupt address range. Untranslated requests and translation requests
>>>> that result in an address in the interrupt range will be blocked with
>>>> condition code LGN.4 or SGN.8. Translated requests with an address in the
>>>> interrupt address range are treated as Unsupported Request (UR).
>>>
>>> However this restriction doesn't apply to the identity mappings possibly
>>> created for dom0, since the interrupt address range is never subject to DMA
>>> remapping.
>>
>> Coming back to this also with your on-demand-p2m-populating patch in mind:
>> I'm having some trouble following your comment on this quotation. The doc
>> text is quite clear that page table entries must not point at the interrupt
>> address range. They don't make an exception for identity mappings. And we
>> don't know how the IOMMUs internally work, e.g. what sanity checks they do
>> (and what failure thereof would result in).
> 
> My understanding is that address translation will never happen for the
> interrupt address range, so whatever gets mapped on that range will
> never be translated by the IOMMU.  Hence for the specific case here,
> there will never be untranslated request that result in an address in
> the interrupt address range, because translation is not done for input
> addresses in the interrupt address range.
> 
> Sorry, hope the above makes sense, I'm having a hard time trying to
> write down what I understand happens when the IOMMU handles accesses
> to the interrupt address range.
> 
> Maybe a diagram would be easier.  This is my understanding of how
> translation works in the IOMMU:
> 
>  input address -> translation -> output address
> 
> However input addresses that are in the interrupt address range are
> not subject to translation, and hence there's no output address that
> can be subject to the quoted VT-d paragraph.

I agree this is a possible (and plausible) interpretation of that text.
I'm merely unconvinced it's the only possible one.

Jan

>> Instead I'm now wondering whether we don't need to
>> - prevent ept_set_entry() from propagating to the IOMMU mappings targeting
>>   the interrupt range,
>> - demand non-shared page tables if any such mapping is to be established
>>   in the CPU page tables.
>>
>> Then again I won't assert that my interpretation of that quoted text makes
>> sense at all.
> 
> See above, *I think* the quoted paragraph only applies to output
> addresses, and in the case of mappings created on the interrupt
> address range there's simply no output address.
> 
> Thanks, Roger.
Oleksii Kurochko Feb. 17, 2025, 8:52 a.m. UTC | #7
On 2/12/25 4:38 PM, Roger Pau Monne wrote:
> Xen currently prevents dom0 from creating CPU or IOMMU page-table mappings
> into the interrupt address range [0xfee00000, 0xfeefffff].  This range has
> two different purposes.  For accesses from the CPU is contains the default
> position of local APIC page at 0xfee00000.  For accesses from devices
> it's the MSI address range, so the address field in the MSI entries
> (usually) point to an address on that range to trigger an interrupt.
>
> There are reports of Lenovo Thinkpad devices placing what seems to be the
> UCSI shared mailbox at address 0xfeec2000 in the interrupt address range.
> Attempting to use that device with a Linux PV dom0 leads to an error when
> Linux kernel maps 0xfeec2000:
>
> RIP: e030:xen_mc_flush+0x1e8/0x2b0
>   xen_leave_lazy_mmu+0x15/0x60
>   vmap_range_noflush+0x408/0x6f0
>   __ioremap_caller+0x20d/0x350
>   acpi_os_map_iomem+0x1a3/0x1c0
>   acpi_ex_system_memory_space_handler+0x229/0x3f0
>   acpi_ev_address_space_dispatch+0x17e/0x4c0
>   acpi_ex_access_region+0x28a/0x510
>   acpi_ex_field_datum_io+0x95/0x5c0
>   acpi_ex_extract_from_field+0x36b/0x4e0
>   acpi_ex_read_data_from_field+0xcb/0x430
>   acpi_ex_resolve_node_to_value+0x2e0/0x530
>   acpi_ex_resolve_to_value+0x1e7/0x550
>   acpi_ds_evaluate_name_path+0x107/0x170
>   acpi_ds_exec_end_op+0x392/0x860
>   acpi_ps_parse_loop+0x268/0xa30
>   acpi_ps_parse_aml+0x221/0x5e0
>   acpi_ps_execute_method+0x171/0x3e0
>   acpi_ns_evaluate+0x174/0x5d0
>   acpi_evaluate_object+0x167/0x440
>   acpi_evaluate_dsm+0xb6/0x130
>   ucsi_acpi_dsm+0x53/0x80
>   ucsi_acpi_read+0x2e/0x60
>   ucsi_register+0x24/0xa0
>   ucsi_acpi_probe+0x162/0x1e3
>   platform_probe+0x48/0x90
>   really_probe+0xde/0x340
>   __driver_probe_device+0x78/0x110
>   driver_probe_device+0x1f/0x90
>   __driver_attach+0xd2/0x1c0
>   bus_for_each_dev+0x77/0xc0
>   bus_add_driver+0x112/0x1f0
>   driver_register+0x72/0xd0
>   do_one_initcall+0x48/0x300
>   do_init_module+0x60/0x220
>   __do_sys_init_module+0x17f/0x1b0
>   do_syscall_64+0x82/0x170
>
> Remove the restrictions to create mappings the interrupt address range for
> dom0.  Note that the restriction to map the local APIC page is enforced
> separately, and that continues to be present.
>
> For PVH dom0 it's important that the restriction is removed from
> arch_iommu_hwdom_init(), as the logic in that function creates mappings in
> both the CPU and the IOMMU page tables for reserved regions in the memory
> map.  The expectation is that the page at 0xfeec2000 will be added to the
> host memory map using the EfiACPIMemoryNVS type, so arch_iommu_hwdom_init()
> will create a mapping for it.
>
> Note that even if the interrupt address range entries are populated in the
> IOMMU page-tables no device access will reach those pages.  Device accesses
> to the Interrupt Address Range will always be converted into Interrupt
> Messages and are not subject to DMA remapping.
>
> There's also the following restriction noted in Intel VT-d:
>
>> Software must not program paging-structure entries to remap any address to
>> the interrupt address range. Untranslated requests and translation requests
>> that result in an address in the interrupt range will be blocked with
>> condition code LGN.4 or SGN.8. Translated requests with an address in the
>> interrupt address range are treated as Unsupported Request (UR).
> However this restriction doesn't apply to the identity mappings possibly
> created for dom0, since the interrupt address range is never subject to DMA
> remapping.
>
> Reported-by: Jürgen Groß<jgross@suse.com>
> Link:https://lore.kernel.org/xen-devel/baade0a7-e204-4743-bda1-282df74e5f89@suse.com/
> Signed-off-by: Roger Pau Monné<roger.pau@citrix.com>

Considering that the patch hasn't received the required Acks, I prefer to include it in 4.21
since we are too close to the release date and backport it if necessary.

Any objections to including it in next release?

~ Oleksii

> ---
>   xen/arch/x86/dom0_build.c           | 4 ----
>   xen/drivers/passthrough/x86/iommu.c | 5 -----
>   2 files changed, 9 deletions(-)
>
> diff --git a/xen/arch/x86/dom0_build.c b/xen/arch/x86/dom0_build.c
> index e8f5bf5447bc..d1b4ef83b2d0 100644
> --- a/xen/arch/x86/dom0_build.c
> +++ b/xen/arch/x86/dom0_build.c
> @@ -555,10 +555,6 @@ int __init dom0_setup_permissions(struct domain *d)
>           if ( !rangeset_contains_singleton(mmio_ro_ranges, mfn) )
>               rc |= iomem_deny_access(d, mfn, mfn);
>       }
> -    /* MSI range. */
> -    rc |= iomem_deny_access(d, paddr_to_pfn(MSI_ADDR_BASE_LO),
> -                            paddr_to_pfn(MSI_ADDR_BASE_LO +
> -                                         MSI_ADDR_DEST_ID_MASK));
>       /* HyperTransport range. */
>       if ( boot_cpu_data.x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON) )
>       {
> diff --git a/xen/drivers/passthrough/x86/iommu.c b/xen/drivers/passthrough/x86/iommu.c
> index 8b1e0596b84a..ec17701c90dd 100644
> --- a/xen/drivers/passthrough/x86/iommu.c
> +++ b/xen/drivers/passthrough/x86/iommu.c
> @@ -475,11 +475,6 @@ void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
>       if ( rc )
>           panic("IOMMU failed to remove Xen ranges: %d\n", rc);
>   
> -    /* Remove any overlap with the Interrupt Address Range. */
> -    rc = rangeset_remove_range(map, 0xfee00, 0xfeeff);
> -    if ( rc )
> -        panic("IOMMU failed to remove Interrupt Address Range: %d\n", rc);
> -
>       /* If emulating IO-APIC(s) make sure the base address is unmapped. */
>       if ( has_vioapic(d) )
>       {
Roger Pau Monne Feb. 17, 2025, 9:47 a.m. UTC | #8
On Mon, Feb 17, 2025 at 09:49:18AM +0100, Jan Beulich wrote:
> On 14.02.2025 14:57, Roger Pau Monné wrote:
> > On Fri, Feb 14, 2025 at 02:01:09PM +0100, Jan Beulich wrote:
> >> On 12.02.2025 16:38, Roger Pau Monne wrote:
> >>> There's also the following restriction noted in Intel VT-d:
> >>>
> >>>> Software must not program paging-structure entries to remap any address to
> >>>> the interrupt address range. Untranslated requests and translation requests
> >>>> that result in an address in the interrupt range will be blocked with
> >>>> condition code LGN.4 or SGN.8. Translated requests with an address in the
> >>>> interrupt address range are treated as Unsupported Request (UR).
> >>>
> >>> However this restriction doesn't apply to the identity mappings possibly
> >>> created for dom0, since the interrupt address range is never subject to DMA
> >>> remapping.
> >>
> >> Coming back to this also with your on-demand-p2m-populating patch in mind:
> >> I'm having some trouble following your comment on this quotation. The doc
> >> text is quite clear that page table entries must not point at the interrupt
> >> address range. They don't make an exception for identity mappings. And we
> >> don't know how the IOMMUs internally work, e.g. what sanity checks they do
> >> (and what failure thereof would result in).
> > 
> > My understanding is that address translation will never happen for the
> > interrupt address range, so whatever gets mapped on that range will
> > never be translated by the IOMMU.  Hence for the specific case here,
> > there will never be untranslated request that result in an address in
> > the interrupt address range, because translation is not done for input
> > addresses in the interrupt address range.
> > 
> > Sorry, hope the above makes sense, I'm having a hard time trying to
> > write down what I understand happens when the IOMMU handles accesses
> > to the interrupt address range.
> > 
> > Maybe a diagram would be easier.  This is my understanding of how
> > translation works in the IOMMU:
> > 
> >  input address -> translation -> output address
> > 
> > However input addresses that are in the interrupt address range are
> > not subject to translation, and hence there's no output address that
> > can be subject to the quoted VT-d paragraph.
> 
> I agree this is a possible (and plausible) interpretation of that text.
> I'm merely unconvinced it's the only possible one.

The AMD-Vi specification mentions the following regarding the
interrupt address range:

> 2.1.4.2 Interrupt Address Range
>
> Accesses to the interrupt address range (Table 3) are defined to go
> through the interrupt remapping portion of the IOMMU and not through
> address translation processing. Therefore, when a transaction is being
> processed as an interrupt remapping operation, the transaction
> attribute of pretranslated or untranslated is ignored.
>
> Software Note: The IOMMU should not be configured such that an address
> translation results in a special address such as the interrupt address
> range.

Which I interpret in the same way as VT-d: input addresses that belong
to the interrupt address range are not subject to translation.  Output
addresses that belong to the interrupt address range are not allowed,
otherwise the IOMMU will raise a fault.

I've added the following chunk to Xen:

diff --git a/xen/drivers/passthrough/x86/iommu.c b/xen/drivers/passthrough/x86/iommu.c
index 8b1e0596b84a..20aa46e305a3 100644
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -480,6 +480,9 @@ void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
     if ( rc )
         panic("IOMMU failed to remove Interrupt Address Range: %d\n", rc);
 
+    rc = rangeset_add_range(map, 0xfee01, 0xfeeff);
+    BUG_ON(rc);
+
     /* If emulating IO-APIC(s) make sure the base address is unmapped. */
     if ( has_vioapic(d) )
     {

And run a basic test on each server microarchitecture (AMD Naples to
Genoa, Intel Haswell to Emerald Rapids) available on XenRT, everything
seems to be OK, no IOMMU faults, but still running.

Would you agree to allowing mappings to the interrupt address range if
the above test raises no issues?  I know it's not every possible piece
of hardware out there, but it's quite good coverage.

IMO Xen should allow the creation of mappings on the interrupt address
range, otherwise I don't see how we can deal with Thinkpad issue.  And
this issue we known about, but sadly we have no visibility of what
firmware might put in that range.

Thanks, Roger.
Jan Beulich Feb. 17, 2025, 9:54 a.m. UTC | #9
On 17.02.2025 10:47, Roger Pau Monné wrote:
> On Mon, Feb 17, 2025 at 09:49:18AM +0100, Jan Beulich wrote:
>> On 14.02.2025 14:57, Roger Pau Monné wrote:
>>> On Fri, Feb 14, 2025 at 02:01:09PM +0100, Jan Beulich wrote:
>>>> On 12.02.2025 16:38, Roger Pau Monne wrote:
>>>>> There's also the following restriction noted in Intel VT-d:
>>>>>
>>>>>> Software must not program paging-structure entries to remap any address to
>>>>>> the interrupt address range. Untranslated requests and translation requests
>>>>>> that result in an address in the interrupt range will be blocked with
>>>>>> condition code LGN.4 or SGN.8. Translated requests with an address in the
>>>>>> interrupt address range are treated as Unsupported Request (UR).
>>>>>
>>>>> However this restriction doesn't apply to the identity mappings possibly
>>>>> created for dom0, since the interrupt address range is never subject to DMA
>>>>> remapping.
>>>>
>>>> Coming back to this also with your on-demand-p2m-populating patch in mind:
>>>> I'm having some trouble following your comment on this quotation. The doc
>>>> text is quite clear that page table entries must not point at the interrupt
>>>> address range. They don't make an exception for identity mappings. And we
>>>> don't know how the IOMMUs internally work, e.g. what sanity checks they do
>>>> (and what failure thereof would result in).
>>>
>>> My understanding is that address translation will never happen for the
>>> interrupt address range, so whatever gets mapped on that range will
>>> never be translated by the IOMMU.  Hence for the specific case here,
>>> there will never be untranslated request that result in an address in
>>> the interrupt address range, because translation is not done for input
>>> addresses in the interrupt address range.
>>>
>>> Sorry, hope the above makes sense, I'm having a hard time trying to
>>> write down what I understand happens when the IOMMU handles accesses
>>> to the interrupt address range.
>>>
>>> Maybe a diagram would be easier.  This is my understanding of how
>>> translation works in the IOMMU:
>>>
>>>  input address -> translation -> output address
>>>
>>> However input addresses that are in the interrupt address range are
>>> not subject to translation, and hence there's no output address that
>>> can be subject to the quoted VT-d paragraph.
>>
>> I agree this is a possible (and plausible) interpretation of that text.
>> I'm merely unconvinced it's the only possible one.
> 
> The AMD-Vi specification mentions the following regarding the
> interrupt address range:
> 
>> 2.1.4.2 Interrupt Address Range
>>
>> Accesses to the interrupt address range (Table 3) are defined to go
>> through the interrupt remapping portion of the IOMMU and not through
>> address translation processing. Therefore, when a transaction is being
>> processed as an interrupt remapping operation, the transaction
>> attribute of pretranslated or untranslated is ignored.
>>
>> Software Note: The IOMMU should not be configured such that an address
>> translation results in a special address such as the interrupt address
>> range.
> 
> Which I interpret in the same way as VT-d: input addresses that belong
> to the interrupt address range are not subject to translation.  Output
> addresses that belong to the interrupt address range are not allowed,
> otherwise the IOMMU will raise a fault.
> 
> I've added the following chunk to Xen:
> 
> diff --git a/xen/drivers/passthrough/x86/iommu.c b/xen/drivers/passthrough/x86/iommu.c
> index 8b1e0596b84a..20aa46e305a3 100644
> --- a/xen/drivers/passthrough/x86/iommu.c
> +++ b/xen/drivers/passthrough/x86/iommu.c
> @@ -480,6 +480,9 @@ void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
>      if ( rc )
>          panic("IOMMU failed to remove Interrupt Address Range: %d\n", rc);
>  
> +    rc = rangeset_add_range(map, 0xfee01, 0xfeeff);
> +    BUG_ON(rc);
> +
>      /* If emulating IO-APIC(s) make sure the base address is unmapped. */
>      if ( has_vioapic(d) )
>      {
> 
> And run a basic test on each server microarchitecture (AMD Naples to
> Genoa, Intel Haswell to Emerald Rapids) available on XenRT, everything
> seems to be OK, no IOMMU faults, but still running.
> 
> Would you agree to allowing mappings to the interrupt address range if
> the above test raises no issues?  I know it's not every possible piece
> of hardware out there, but it's quite good coverage.

Yeah, I think that ought to be good enough.

Jan
diff mbox series

Patch

diff --git a/xen/arch/x86/dom0_build.c b/xen/arch/x86/dom0_build.c
index e8f5bf5447bc..d1b4ef83b2d0 100644
--- a/xen/arch/x86/dom0_build.c
+++ b/xen/arch/x86/dom0_build.c
@@ -555,10 +555,6 @@  int __init dom0_setup_permissions(struct domain *d)
         if ( !rangeset_contains_singleton(mmio_ro_ranges, mfn) )
             rc |= iomem_deny_access(d, mfn, mfn);
     }
-    /* MSI range. */
-    rc |= iomem_deny_access(d, paddr_to_pfn(MSI_ADDR_BASE_LO),
-                            paddr_to_pfn(MSI_ADDR_BASE_LO +
-                                         MSI_ADDR_DEST_ID_MASK));
     /* HyperTransport range. */
     if ( boot_cpu_data.x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON) )
     {
diff --git a/xen/drivers/passthrough/x86/iommu.c b/xen/drivers/passthrough/x86/iommu.c
index 8b1e0596b84a..ec17701c90dd 100644
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -475,11 +475,6 @@  void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
     if ( rc )
         panic("IOMMU failed to remove Xen ranges: %d\n", rc);
 
-    /* Remove any overlap with the Interrupt Address Range. */
-    rc = rangeset_remove_range(map, 0xfee00, 0xfeeff);
-    if ( rc )
-        panic("IOMMU failed to remove Interrupt Address Range: %d\n", rc);
-
     /* If emulating IO-APIC(s) make sure the base address is unmapped. */
     if ( has_vioapic(d) )
     {