diff mbox series

[v6,3/7] x86/hvm: Allow access to registers on the same page as MSI-X table

Message ID a9b04e2224e97a27a127a003e8ccf5edfd4922c7.1714154036.git-series.marmarek@invisiblethingslab.com (mailing list archive)
State Superseded
Headers show
Series MSI-X support with qemu in stubdomain, and other related changes | expand

Commit Message

Marek Marczykowski-Górecki April 26, 2024, 5:54 p.m. UTC
Some devices (notably Intel Wifi 6 AX210 card) keep auxiliary registers
on the same page as MSI-X table. Device model (especially one in
stubdomain) cannot really handle those, as direct writes to that page is
refused (page is on the mmio_ro_ranges list). Instead, extend
msixtbl_mmio_ops to handle such accesses too.

Doing this, requires correlating read/write location with guest
of MSI-X table address. Since QEMU doesn't map MSI-X table to the guest,
it requires msixtbl_entry->gtable, which is HVM-only. Similar feature
for PV would need to be done separately.

This will be also used to read Pending Bit Array, if it lives on the same
page, making QEMU not needing /dev/mem access at all (especially helpful
with lockdown enabled in dom0). If PBA lives on another page, QEMU will
map it to the guest directly.
If PBA lives on the same page, discard writes and log a message.
Technically, writes outside of PBA could be allowed, but at this moment
the precise location of PBA isn't saved, and also no known device abuses
the spec in this way (at least yet).

To access those registers, msixtbl_mmio_ops need the relevant page
mapped. MSI handling already has infrastructure for that, using fixmap,
so try to map first/last page of the MSI-X table (if necessary) and save
their fixmap indexes. Note that msix_get_fixmap() does reference
counting and reuses existing mapping, so just call it directly, even if
the page was mapped before. Also, it uses a specific range of fixmap
indexes which doesn't include 0, so use 0 as default ("not mapped")
value - which simplifies code a bit.

GCC 12.2.1 gets confused about 'desc' variable:

    arch/x86/hvm/vmsi.c: In function ‘msixtbl_range’:
    arch/x86/hvm/vmsi.c:553:8: error: ‘desc’ may be used uninitialized [-Werror=maybe-uninitialized]
      553 |     if ( desc )
          |        ^
    arch/x86/hvm/vmsi.c:537:28: note: ‘desc’ was declared here
      537 |     const struct msi_desc *desc;
          |                            ^~~~

It's conditional initialization is actually correct (in the case where
it isn't initialized, function returns early), but to avoid
build failure initialize it explicitly to NULL anyway.

Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
---
Changes in v6:
- use MSIX_CHECK_WARN macro
- extend assert on fixmap_idx
- add break in default label, after ASSERT_UNREACHABLE(), and move
  setting default there
- style fixes
Changes in v5:
- style fixes
- include GCC version in the commit message
- warn only once (per domain, per device) about failed adjacent access
Changes in v4:
- drop same_page parameter of msixtbl_find_entry(), distinguish two
  cases in relevant callers
- rename adj_access_table_idx to adj_access_idx
- code style fixes
- drop alignment check in adjacent_{read,write}() - all callers already
  have it earlier
- delay mapping first/last MSI-X pages until preparing device for a
  passthrough
v3:
 - merge handling into msixtbl_mmio_ops
 - extend commit message
v2:
 - adjust commit message
 - pass struct domain to msixtbl_page_handler_get_hwaddr()
 - reduce local variables used only once
 - log a warning if write is forbidden if MSI-X and PBA lives on the same
   page
 - do not passthrough unaligned accesses
 - handle accesses both before and after MSI-X table
---
 xen/arch/x86/hvm/vmsi.c        | 200 ++++++++++++++++++++++++++++++++--
 xen/arch/x86/include/asm/msi.h |   5 +-
 xen/arch/x86/msi.c             |  41 +++++++-
 3 files changed, 236 insertions(+), 10 deletions(-)

Comments

Jan Beulich April 30, 2024, 3:03 p.m. UTC | #1
On 26.04.2024 19:54, Marek Marczykowski-Górecki wrote:
> Some devices (notably Intel Wifi 6 AX210 card) keep auxiliary registers
> on the same page as MSI-X table. Device model (especially one in
> stubdomain) cannot really handle those, as direct writes to that page is
> refused (page is on the mmio_ro_ranges list). Instead, extend
> msixtbl_mmio_ops to handle such accesses too.
> 
> Doing this, requires correlating read/write location with guest
> of MSI-X table address. Since QEMU doesn't map MSI-X table to the guest,
> it requires msixtbl_entry->gtable, which is HVM-only. Similar feature
> for PV would need to be done separately.
> 
> This will be also used to read Pending Bit Array, if it lives on the same
> page, making QEMU not needing /dev/mem access at all (especially helpful
> with lockdown enabled in dom0). If PBA lives on another page, QEMU will
> map it to the guest directly.
> If PBA lives on the same page, discard writes and log a message.
> Technically, writes outside of PBA could be allowed, but at this moment
> the precise location of PBA isn't saved, and also no known device abuses
> the spec in this way (at least yet).
> 
> To access those registers, msixtbl_mmio_ops need the relevant page
> mapped. MSI handling already has infrastructure for that, using fixmap,
> so try to map first/last page of the MSI-X table (if necessary) and save
> their fixmap indexes. Note that msix_get_fixmap() does reference
> counting and reuses existing mapping, so just call it directly, even if
> the page was mapped before. Also, it uses a specific range of fixmap
> indexes which doesn't include 0, so use 0 as default ("not mapped")
> value - which simplifies code a bit.
> 
> GCC 12.2.1 gets confused about 'desc' variable:
> 
>     arch/x86/hvm/vmsi.c: In function ‘msixtbl_range’:
>     arch/x86/hvm/vmsi.c:553:8: error: ‘desc’ may be used uninitialized [-Werror=maybe-uninitialized]
>       553 |     if ( desc )
>           |        ^
>     arch/x86/hvm/vmsi.c:537:28: note: ‘desc’ was declared here
>       537 |     const struct msi_desc *desc;
>           |                            ^~~~
> 
> It's conditional initialization is actually correct (in the case where
> it isn't initialized, function returns early), but to avoid
> build failure initialize it explicitly to NULL anyway.
> 
> Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>

_Without_ the usually implied ack (as indicated before) and with two
small tweaks (which can likely be taken care of while committing):
Reviewed-by: Jan Beulich <jbeulich@suse.com>

> +static int adjacent_read(
> +    unsigned int fixmap_idx,
> +    paddr_t address, unsigned int len, uint64_t *pval)
> +{
> +    const void __iomem *hwaddr;
> +
> +    ASSERT(fixmap_idx <= FIX_MSIX_IO_RESERV_END);
> +
> +    hwaddr = fix_to_virt(fixmap_idx) + PAGE_OFFSET(address);
> +
> +    switch ( len )
> +    {
> +    case 1:
> +        *pval = readb(hwaddr);
> +        break;
> +
> +    case 2:
> +        *pval = readw(hwaddr);
> +        break;
> +
> +    case 4:
> +        *pval = readl(hwaddr);
> +        break;
> +
> +    case 8:
> +        *pval = readq(hwaddr);
> +        break;
> +
> +    default:
> +        ASSERT_UNREACHABLE();
> +        *pval = ~0UL;

Nit: Better ~0ULL here (short of there being UINT64_C()).

> +        break;
> +    }
> +
> +    return X86EMUL_OKAY;
> +}
> +
> +static int adjacent_write(
> +    unsigned int fixmap_idx,
> +    paddr_t address, unsigned int len, uint64_t val)
> +{
> +    void __iomem *hwaddr;
> +
> +    if ( fixmap_idx == ADJACENT_DISCARD_WRITE )
> +        return X86EMUL_OKAY;
> +
> +    ASSERT(fixmap_idx <= FIX_MSIX_IO_RESERV_END);
> +
> +    hwaddr = fix_to_virt(fixmap_idx) + PAGE_OFFSET(address);
> +
> +    switch ( len )
> +    {
> +    case 1:
> +        writeb(val, hwaddr);
> +        break;
> +
> +    case 2:
> +        writew(val, hwaddr);
> +        break;
> +
> +    case 4:
> +        writel(val, hwaddr);
> +        break;
> +
> +    case 8:
> +        writeq(val, hwaddr);
> +        break;
> +
> +    default:
> +        ASSERT_UNREACHABLE();
> +    }

There's still a "break;" missing here.

Jan
Roger Pau Monné May 3, 2024, 8:33 a.m. UTC | #2
On Fri, Apr 26, 2024 at 07:54:00PM +0200, Marek Marczykowski-Górecki wrote:
> Some devices (notably Intel Wifi 6 AX210 card) keep auxiliary registers
> on the same page as MSI-X table. Device model (especially one in
> stubdomain) cannot really handle those, as direct writes to that page is
> refused (page is on the mmio_ro_ranges list). Instead, extend
> msixtbl_mmio_ops to handle such accesses too.
> 
> Doing this, requires correlating read/write location with guest
> of MSI-X table address. Since QEMU doesn't map MSI-X table to the guest,
  ^ extra 'of'?
> it requires msixtbl_entry->gtable, which is HVM-only. Similar feature
> for PV would need to be done separately.
> 
> This will be also used to read Pending Bit Array, if it lives on the same
> page, making QEMU not needing /dev/mem access at all (especially helpful
> with lockdown enabled in dom0). If PBA lives on another page, QEMU will
> map it to the guest directly.
> If PBA lives on the same page, discard writes and log a message.
> Technically, writes outside of PBA could be allowed, but at this moment
> the precise location of PBA isn't saved, and also no known device abuses
> the spec in this way (at least yet).
> 
> To access those registers, msixtbl_mmio_ops need the relevant page
> mapped. MSI handling already has infrastructure for that, using fixmap,
> so try to map first/last page of the MSI-X table (if necessary) and save
> their fixmap indexes. Note that msix_get_fixmap() does reference
> counting and reuses existing mapping, so just call it directly, even if
> the page was mapped before. Also, it uses a specific range of fixmap
> indexes which doesn't include 0, so use 0 as default ("not mapped")
> value - which simplifies code a bit.
> 
> GCC 12.2.1 gets confused about 'desc' variable:
> 
>     arch/x86/hvm/vmsi.c: In function ‘msixtbl_range’:
>     arch/x86/hvm/vmsi.c:553:8: error: ‘desc’ may be used uninitialized [-Werror=maybe-uninitialized]
>       553 |     if ( desc )
>           |        ^
>     arch/x86/hvm/vmsi.c:537:28: note: ‘desc’ was declared here
>       537 |     const struct msi_desc *desc;
>           |                            ^~~~
> 
> It's conditional initialization is actually correct (in the case where
> it isn't initialized, function returns early), but to avoid
> build failure initialize it explicitly to NULL anyway.
> 
> Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> ---
> Changes in v6:
> - use MSIX_CHECK_WARN macro
> - extend assert on fixmap_idx
> - add break in default label, after ASSERT_UNREACHABLE(), and move
>   setting default there
> - style fixes
> Changes in v5:
> - style fixes
> - include GCC version in the commit message
> - warn only once (per domain, per device) about failed adjacent access
> Changes in v4:
> - drop same_page parameter of msixtbl_find_entry(), distinguish two
>   cases in relevant callers
> - rename adj_access_table_idx to adj_access_idx
> - code style fixes
> - drop alignment check in adjacent_{read,write}() - all callers already
>   have it earlier
> - delay mapping first/last MSI-X pages until preparing device for a
>   passthrough
> v3:
>  - merge handling into msixtbl_mmio_ops
>  - extend commit message
> v2:
>  - adjust commit message
>  - pass struct domain to msixtbl_page_handler_get_hwaddr()
>  - reduce local variables used only once
>  - log a warning if write is forbidden if MSI-X and PBA lives on the same
>    page
>  - do not passthrough unaligned accesses
>  - handle accesses both before and after MSI-X table
> ---
>  xen/arch/x86/hvm/vmsi.c        | 200 ++++++++++++++++++++++++++++++++--
>  xen/arch/x86/include/asm/msi.h |   5 +-
>  xen/arch/x86/msi.c             |  41 +++++++-
>  3 files changed, 236 insertions(+), 10 deletions(-)
> 
> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
> index 999917983789..230e3a5dee3f 100644
> --- a/xen/arch/x86/hvm/vmsi.c
> +++ b/xen/arch/x86/hvm/vmsi.c
> @@ -180,6 +180,10 @@ static bool msixtbl_initialised(const struct domain *d)
>      return d->arch.hvm.msixtbl_list.next;
>  }
>  
> +/*
> + * Lookup an msixtbl_entry on the same page as given addr. It's up to the
> + * caller to check if address is strictly part of the table - if relevant.
> + */
>  static struct msixtbl_entry *msixtbl_find_entry(
>      struct vcpu *v, unsigned long addr)
>  {
> @@ -187,8 +191,8 @@ static struct msixtbl_entry *msixtbl_find_entry(
>      struct domain *d = v->domain;
>  
>      list_for_each_entry( entry, &d->arch.hvm.msixtbl_list, list )
> -        if ( addr >= entry->gtable &&
> -             addr < entry->gtable + entry->table_len )
> +        if ( PFN_DOWN(addr) >= PFN_DOWN(entry->gtable) &&
> +             PFN_DOWN(addr) <= PFN_DOWN(entry->gtable + entry->table_len - 1) )
>              return entry;
>  
>      return NULL;
> @@ -213,6 +217,138 @@ static struct msi_desc *msixtbl_addr_to_desc(
>      return NULL;
>  }
>  
> +/*
> + * Returns:
> + *  - ADJACENT_DONT_HANDLE if no handling should be done
> + *  - ADJACENT_DISCARD_WRITE if write should be discarded
> + *  - a fixmap idx to use for handling
> + */
> +#define ADJACENT_DONT_HANDLE UINT_MAX
> +#define ADJACENT_DISCARD_WRITE (UINT_MAX - 1)

I think this could be simpler, there's no need to signal with so fine
grained detail about the action to be performed.

Any adjacent access to the MSI-X table should be handled by the logic
you are adding, so anything that falls in those ranges should
terminate here.

adjacent_handle() should IMO just return whether the access is
replayed against the hardware, or if it's just dropped.

> +static unsigned int adjacent_handle(
> +    const struct msixtbl_entry *entry, unsigned long addr, bool write)
> +{
> +    unsigned int adj_type;
> +    struct arch_msix *msix;
> +
> +    if ( !entry || !entry->pdev )
> +        return ADJACENT_DONT_HANDLE;
> +
> +    if ( PFN_DOWN(addr) == PFN_DOWN(entry->gtable) && addr < entry->gtable )
> +        adj_type = ADJ_IDX_FIRST;
> +    else if ( PFN_DOWN(addr) == PFN_DOWN(entry->gtable + entry->table_len - 1) &&
> +              addr >= entry->gtable + entry->table_len )
> +        adj_type = ADJ_IDX_LAST;
> +    else
> +        return ADJACENT_DONT_HANDLE;
> +
> +    msix = entry->pdev->msix;
> +    ASSERT(msix);
> +
> +    if ( !msix->adj_access_idx[adj_type] )
> +    {
> +        if ( MSIX_CHECK_WARN(msix, entry->pdev->domain->domain_id,
> +                             adjacent_not_initialized) )
> +            gprintk(XENLOG_WARNING,
> +                    "Page for adjacent(%d) MSI-X table access not initialized for %pp (addr %#lx, gtable %#lx\n",
> +                    adj_type, &entry->pdev->sbdf, addr, entry->gtable);
> +        return ADJACENT_DONT_HANDLE;
> +    }
> +
> +    /* If PBA lives on the same page too, discard writes. */
> +    if ( write &&
> +         ((adj_type == ADJ_IDX_LAST &&
> +           msix->table.last == msix->pba.first) ||
> +          (adj_type == ADJ_IDX_FIRST &&
> +           msix->table.first == msix->pba.last)) )
> +    {
> +        if ( MSIX_CHECK_WARN(msix, entry->pdev->domain->domain_id,
> +                             adjacent_pba) )
> +            gprintk(XENLOG_WARNING,
> +                    "MSI-X table and PBA of %pp live on the same page, "
> +                    "writing to other registers there is not implemented\n",
> +                    &entry->pdev->sbdf);
> +        return ADJACENT_DISCARD_WRITE;
> +    }
> +
> +    return msix->adj_access_idx[adj_type];
> +}
> +
> +static int adjacent_read(
> +    unsigned int fixmap_idx,
> +    paddr_t address, unsigned int len, uint64_t *pval)
> +{
> +    const void __iomem *hwaddr;
> +
> +    ASSERT(fixmap_idx <= FIX_MSIX_IO_RESERV_END);
> +
> +    hwaddr = fix_to_virt(fixmap_idx) + PAGE_OFFSET(address);

IMO adjacent_handle() should be called here (and in adjacent_write()),
and adjacent_{read,write}() called unconditionally from
msixtbl_{read,write}() when an access that doesn't fall in the MSI-X
table is handled.  See comment below in msixtbl_read().

> +
> +    switch ( len )
> +    {
> +    case 1:
> +        *pval = readb(hwaddr);
> +        break;
> +
> +    case 2:
> +        *pval = readw(hwaddr);
> +        break;
> +
> +    case 4:
> +        *pval = readl(hwaddr);
> +        break;
> +
> +    case 8:
> +        *pval = readq(hwaddr);
> +        break;
> +
> +    default:
> +        ASSERT_UNREACHABLE();
> +        *pval = ~0UL;
> +        break;
> +    }
> +
> +    return X86EMUL_OKAY;
> +}
> +
> +static int adjacent_write(
> +    unsigned int fixmap_idx,
> +    paddr_t address, unsigned int len, uint64_t val)
> +{
> +    void __iomem *hwaddr;
> +
> +    if ( fixmap_idx == ADJACENT_DISCARD_WRITE )
> +        return X86EMUL_OKAY;
> +
> +    ASSERT(fixmap_idx <= FIX_MSIX_IO_RESERV_END);

Since you check the idx is sane, shouldn't you also assert idx >=
FIX_MSIX_IO_RESERV_BASE?

> +
> +    hwaddr = fix_to_virt(fixmap_idx) + PAGE_OFFSET(address);
> +
> +    switch ( len )
> +    {
> +    case 1:
> +        writeb(val, hwaddr);
> +        break;
> +
> +    case 2:
> +        writew(val, hwaddr);
> +        break;
> +
> +    case 4:
> +        writel(val, hwaddr);
> +        break;
> +
> +    case 8:
> +        writeq(val, hwaddr);
> +        break;
> +
> +    default:
> +        ASSERT_UNREACHABLE();
> +    }
> +
> +    return X86EMUL_OKAY;
> +}
> +
>  static int cf_check msixtbl_read(
>      const struct hvm_io_handler *handler, uint64_t address, uint32_t len,
>      uint64_t *pval)
> @@ -220,9 +356,10 @@ static int cf_check msixtbl_read(
>      unsigned long offset;
>      struct msixtbl_entry *entry;
>      unsigned int nr_entry, index;
> +    unsigned int adjacent_fixmap;
>      int r = X86EMUL_UNHANDLEABLE;
>  
> -    if ( (len != 4 && len != 8) || (address & (len - 1)) )
> +    if ( !IS_ALIGNED(address, len) )
>          return r;
>  
>      rcu_read_lock(&msixtbl_rcu_lock);
> @@ -230,6 +367,21 @@ static int cf_check msixtbl_read(
>      entry = msixtbl_find_entry(current, address);
>      if ( !entry )
>          goto out;
> +
> +    adjacent_fixmap = adjacent_handle(entry, address, false);

This seems overly complicated, but is possible I'm missing some logic.

IMO it would seem way less convoluted to simply do:

entry = msixtbl_find_entry(current, address);
if ( !entry )
    goto out;

if ( address < entry->gtable ||
     address >= entry->gtable + entry->table_len )
{
    adjacent_read(...);
    goto out;
}

And put all the logic in adjacent_{read,write}() directly rather than
having both adjacent_{read,write}() plus adjacent_handle() calls here?

If the access doesn't fall between the boundaries of the MSI-X table
it's either going to be a handled adjacent access, or it's going to be
discarded.

> +    if ( adjacent_fixmap != ADJACENT_DONT_HANDLE )
> +    {
> +        r = adjacent_read(adjacent_fixmap, address, len, pval);
> +        goto out;
> +    }
> +
> +    if ( address < entry->gtable ||
> +         address >= entry->gtable + entry->table_len )
> +        goto out;
> +
> +    if ( len != 4 && len != 8 )
> +        goto out;
> +
>      offset = address & (PCI_MSIX_ENTRY_SIZE - 1);
>  
>      if ( offset != PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET )
> @@ -282,6 +434,7 @@ static int msixtbl_write(struct vcpu *v, unsigned long address,
>      int r = X86EMUL_UNHANDLEABLE;
>      unsigned long flags;
>      struct irq_desc *desc;
> +    unsigned int adjacent_fixmap;
>  
>      if ( !IS_ALIGNED(address, len) )
>          return X86EMUL_OKAY;
> @@ -291,6 +444,21 @@ static int msixtbl_write(struct vcpu *v, unsigned long address,
>      entry = msixtbl_find_entry(v, address);
>      if ( !entry )
>          goto out;
> +
> +    adjacent_fixmap = adjacent_handle(entry, address, true);
> +    if ( adjacent_fixmap != ADJACENT_DONT_HANDLE )
> +    {
> +        r = adjacent_write(adjacent_fixmap, address, len, val);
> +        goto out;
> +    }
> +
> +    if ( address < entry->gtable ||
> +         address >= entry->gtable + entry->table_len )
> +        goto out;
> +
> +    if ( len != 4 && len != 8 )
> +        goto out;
> +
>      nr_entry = array_index_nospec(((address - entry->gtable) /
>                                     PCI_MSIX_ENTRY_SIZE),
>                                    MAX_MSIX_TABLE_ENTRIES);
> @@ -356,8 +524,8 @@ static int cf_check _msixtbl_write(
>      const struct hvm_io_handler *handler, uint64_t address, uint32_t len,
>      uint64_t val)
>  {
> -    /* Ignore invalid length or unaligned writes. */
> -    if ( (len != 4 && len != 8) || !IS_ALIGNED(address, len) )
> +    /* Ignore unaligned writes. */
> +    if ( !IS_ALIGNED(address, len) )
>          return X86EMUL_OKAY;
>  
>      /*
> @@ -374,14 +542,22 @@ static bool cf_check msixtbl_range(
>  {
>      struct vcpu *curr = current;
>      unsigned long addr = r->addr;
> -    const struct msi_desc *desc;
> +    const struct msixtbl_entry *entry;
> +    const struct msi_desc *desc = NULL;
> +    unsigned int adjacent_fixmap;
>  
>      ASSERT(r->type == IOREQ_TYPE_COPY);
>  
>      rcu_read_lock(&msixtbl_rcu_lock);
> -    desc = msixtbl_addr_to_desc(msixtbl_find_entry(curr, addr), addr);
> +    entry = msixtbl_find_entry(curr, addr);
> +    adjacent_fixmap = adjacent_handle(entry, addr, false);
> +    if ( adjacent_fixmap == ADJACENT_DONT_HANDLE )
> +        desc = msixtbl_addr_to_desc(entry, addr);

I'm not sure you need adjacent_handle() here, I would think that any
address adjacent to the MSI-X table Xen would want to handle
unconditionally, and hence msixtbl_range() should return true:

entry = msixtbl_find_entry(curr, addr);
if ( !entry )
    return 0;

if ( addr < entry->gtable || addr >= entry->gtable + entry->table_len )
    /* Possibly handle adjacent access. */
    return 1;

desc = msixtbl_find_entry(curr, addr);
...

(Missing the _unlock calls in the chunk above)

There's no other processing that can happen for an adjacent access
unless it's are handled here.  Otherwise such accesses will be
discarded anyway?  Hence better short-circuit the MMIO handler search
earlier.

>      rcu_read_unlock(&msixtbl_rcu_lock);
>  
> +    if ( adjacent_fixmap != ADJACENT_DONT_HANDLE )
> +        return 1;
> +
>      if ( desc )
>          return 1;
>  
> @@ -622,12 +798,16 @@ void msix_write_completion(struct vcpu *v)
>           v->arch.hvm.hvm_io.msix_snoop_gpa )
>      {
>          unsigned int token = hvmemul_cache_disable(v);
> -        const struct msi_desc *desc;
> +        const struct msi_desc *desc = NULL;
> +        const struct msixtbl_entry *entry;
>          uint32_t data;
>  
>          rcu_read_lock(&msixtbl_rcu_lock);
> -        desc = msixtbl_addr_to_desc(msixtbl_find_entry(v, snoop_addr),
> -                                    snoop_addr);
> +        entry = msixtbl_find_entry(v, snoop_addr);
> +        if ( entry &&
> +             snoop_addr >= entry->gtable &&
> +             snoop_addr < entry->gtable + entry->table_len )
> +            desc = msixtbl_addr_to_desc(entry, snoop_addr);

This would be easier if you added the MSI-X table boundary checks in
msixtbl_addr_to_desc() itself, rather than open-coding here.  That way
you don't need to modify msix_write_completion() at all.  It also
makes msixtbl_addr_to_desc() more robust in case it gets called with
bogus addresses.

>          rcu_read_unlock(&msixtbl_rcu_lock);
>  
>          if ( desc &&
> diff --git a/xen/arch/x86/include/asm/msi.h b/xen/arch/x86/include/asm/msi.h
> index bcfdfd35345d..923b730d48b8 100644
> --- a/xen/arch/x86/include/asm/msi.h
> +++ b/xen/arch/x86/include/asm/msi.h
> @@ -224,6 +224,9 @@ struct arch_msix {
>      } table, pba;
>      int table_refcnt[MAX_MSIX_TABLE_PAGES];
>      int table_idx[MAX_MSIX_TABLE_PAGES];
> +#define ADJ_IDX_FIRST 0
> +#define ADJ_IDX_LAST  1
> +    unsigned int adj_access_idx[2];
>      spinlock_t table_lock;
>      bool host_maskall, guest_maskall;
>      domid_t warned_domid;
> @@ -231,6 +234,8 @@ struct arch_msix {
>          uint8_t all;
>          struct {
>              bool maskall                   : 1;
> +            bool adjacent_not_initialized  : 1;
> +            bool adjacent_pba              : 1;
>          };
>      } warned_kind;
>  };
> diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
> index 42c793426da3..c77b81896269 100644
> --- a/xen/arch/x86/msi.c
> +++ b/xen/arch/x86/msi.c
> @@ -913,6 +913,36 @@ static int msix_capability_init(struct pci_dev *dev,
>          list_add_tail(&entry->list, &dev->msi_list);
>          *desc = entry;
>      }
> +    else
> +    {
> +        /*
> +         * If the MSI-X table doesn't start at the page boundary, map the first page for
> +         * passthrough accesses.
> +         */
> +        if ( PAGE_OFFSET(table_paddr) )
> +        {
> +            int idx = msix_get_fixmap(msix, table_paddr, table_paddr);
> +
> +            if ( idx > 0 )
> +                msix->adj_access_idx[ADJ_IDX_FIRST] = idx;
> +            else
> +                gprintk(XENLOG_ERR, "Failed to map first MSI-X table page: %d\n", idx);
> +        }
> +        /*
> +         * If the MSI-X table doesn't end on the page boundary, map the last page
> +         * for passthrough accesses.
> +         */
> +        if ( PAGE_OFFSET(table_paddr + msix->nr_entries * PCI_MSIX_ENTRY_SIZE) )

This check is correct ....

> +        {
> +            uint64_t entry_paddr = table_paddr + msix->nr_entries * PCI_MSIX_ENTRY_SIZE;

... however for correctness you want to use msix->nr_entries - 1 here,
otherwise you are requesting an address that's past the last MSI-X
table entry, and hence msix_get_fixmap() could refuse to provide an
idx if it ever does boundary checking.

Thanks, Roger.
Marek Marczykowski-Górecki May 4, 2024, 12:48 a.m. UTC | #3
On Fri, May 03, 2024 at 10:33:38AM +0200, Roger Pau Monné wrote:
> On Fri, Apr 26, 2024 at 07:54:00PM +0200, Marek Marczykowski-Górecki wrote:
> > Some devices (notably Intel Wifi 6 AX210 card) keep auxiliary registers
> > on the same page as MSI-X table. Device model (especially one in
> > stubdomain) cannot really handle those, as direct writes to that page is
> > refused (page is on the mmio_ro_ranges list). Instead, extend
> > msixtbl_mmio_ops to handle such accesses too.
> > 
> > Doing this, requires correlating read/write location with guest
> > of MSI-X table address. Since QEMU doesn't map MSI-X table to the guest,
>   ^ extra 'of'?
> > it requires msixtbl_entry->gtable, which is HVM-only. Similar feature
> > for PV would need to be done separately.
> > 
> > This will be also used to read Pending Bit Array, if it lives on the same
> > page, making QEMU not needing /dev/mem access at all (especially helpful
> > with lockdown enabled in dom0). If PBA lives on another page, QEMU will
> > map it to the guest directly.
> > If PBA lives on the same page, discard writes and log a message.
> > Technically, writes outside of PBA could be allowed, but at this moment
> > the precise location of PBA isn't saved, and also no known device abuses
> > the spec in this way (at least yet).
> > 
> > To access those registers, msixtbl_mmio_ops need the relevant page
> > mapped. MSI handling already has infrastructure for that, using fixmap,
> > so try to map first/last page of the MSI-X table (if necessary) and save
> > their fixmap indexes. Note that msix_get_fixmap() does reference
> > counting and reuses existing mapping, so just call it directly, even if
> > the page was mapped before. Also, it uses a specific range of fixmap
> > indexes which doesn't include 0, so use 0 as default ("not mapped")
> > value - which simplifies code a bit.
> > 
> > GCC 12.2.1 gets confused about 'desc' variable:
> > 
> >     arch/x86/hvm/vmsi.c: In function ‘msixtbl_range’:
> >     arch/x86/hvm/vmsi.c:553:8: error: ‘desc’ may be used uninitialized [-Werror=maybe-uninitialized]
> >       553 |     if ( desc )
> >           |        ^
> >     arch/x86/hvm/vmsi.c:537:28: note: ‘desc’ was declared here
> >       537 |     const struct msi_desc *desc;
> >           |                            ^~~~
> > 
> > It's conditional initialization is actually correct (in the case where
> > it isn't initialized, function returns early), but to avoid
> > build failure initialize it explicitly to NULL anyway.
> > 
> > Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> > ---
> > Changes in v6:
> > - use MSIX_CHECK_WARN macro
> > - extend assert on fixmap_idx
> > - add break in default label, after ASSERT_UNREACHABLE(), and move
> >   setting default there
> > - style fixes
> > Changes in v5:
> > - style fixes
> > - include GCC version in the commit message
> > - warn only once (per domain, per device) about failed adjacent access
> > Changes in v4:
> > - drop same_page parameter of msixtbl_find_entry(), distinguish two
> >   cases in relevant callers
> > - rename adj_access_table_idx to adj_access_idx
> > - code style fixes
> > - drop alignment check in adjacent_{read,write}() - all callers already
> >   have it earlier
> > - delay mapping first/last MSI-X pages until preparing device for a
> >   passthrough
> > v3:
> >  - merge handling into msixtbl_mmio_ops
> >  - extend commit message
> > v2:
> >  - adjust commit message
> >  - pass struct domain to msixtbl_page_handler_get_hwaddr()
> >  - reduce local variables used only once
> >  - log a warning if write is forbidden if MSI-X and PBA lives on the same
> >    page
> >  - do not passthrough unaligned accesses
> >  - handle accesses both before and after MSI-X table
> > ---
> >  xen/arch/x86/hvm/vmsi.c        | 200 ++++++++++++++++++++++++++++++++--
> >  xen/arch/x86/include/asm/msi.h |   5 +-
> >  xen/arch/x86/msi.c             |  41 +++++++-
> >  3 files changed, 236 insertions(+), 10 deletions(-)
> > 
> > diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
> > index 999917983789..230e3a5dee3f 100644
> > --- a/xen/arch/x86/hvm/vmsi.c
> > +++ b/xen/arch/x86/hvm/vmsi.c
> > @@ -180,6 +180,10 @@ static bool msixtbl_initialised(const struct domain *d)
> >      return d->arch.hvm.msixtbl_list.next;
> >  }
> >  
> > +/*
> > + * Lookup an msixtbl_entry on the same page as given addr. It's up to the
> > + * caller to check if address is strictly part of the table - if relevant.
> > + */
> >  static struct msixtbl_entry *msixtbl_find_entry(
> >      struct vcpu *v, unsigned long addr)
> >  {
> > @@ -187,8 +191,8 @@ static struct msixtbl_entry *msixtbl_find_entry(
> >      struct domain *d = v->domain;
> >  
> >      list_for_each_entry( entry, &d->arch.hvm.msixtbl_list, list )
> > -        if ( addr >= entry->gtable &&
> > -             addr < entry->gtable + entry->table_len )
> > +        if ( PFN_DOWN(addr) >= PFN_DOWN(entry->gtable) &&
> > +             PFN_DOWN(addr) <= PFN_DOWN(entry->gtable + entry->table_len - 1) )
> >              return entry;
> >  
> >      return NULL;
> > @@ -213,6 +217,138 @@ static struct msi_desc *msixtbl_addr_to_desc(
> >      return NULL;
> >  }
> >  
> > +/*
> > + * Returns:
> > + *  - ADJACENT_DONT_HANDLE if no handling should be done
> > + *  - ADJACENT_DISCARD_WRITE if write should be discarded
> > + *  - a fixmap idx to use for handling
> > + */
> > +#define ADJACENT_DONT_HANDLE UINT_MAX
> > +#define ADJACENT_DISCARD_WRITE (UINT_MAX - 1)
> 
> I think this could be simpler, there's no need to signal with so fine
> grained detail about the action to be performed.
> 
> Any adjacent access to the MSI-X table should be handled by the logic
> you are adding, so anything that falls in those ranges should
> terminate here.
> 
> adjacent_handle() should IMO just return whether the access is
> replayed against the hardware, or if it's just dropped.

The distinction here is to return X86EMUL_OKAY in case of adjacent write
that is ignored because PBA is somewhere near, but X86EMUL_UNHANDLABLE
for other/error cases (like fixmap indices not initialized).
But maybe this distinction doesn't make sense and X86EMUL_UNHANDLABLE is
okay in either case? 

> > +static unsigned int adjacent_handle(
> > +    const struct msixtbl_entry *entry, unsigned long addr, bool write)
> > +{
> > +    unsigned int adj_type;
> > +    struct arch_msix *msix;
> > +
> > +    if ( !entry || !entry->pdev )
> > +        return ADJACENT_DONT_HANDLE;
> > +
> > +    if ( PFN_DOWN(addr) == PFN_DOWN(entry->gtable) && addr < entry->gtable )
> > +        adj_type = ADJ_IDX_FIRST;
> > +    else if ( PFN_DOWN(addr) == PFN_DOWN(entry->gtable + entry->table_len - 1) &&
> > +              addr >= entry->gtable + entry->table_len )
> > +        adj_type = ADJ_IDX_LAST;
> > +    else
> > +        return ADJACENT_DONT_HANDLE;
> > +
> > +    msix = entry->pdev->msix;
> > +    ASSERT(msix);
> > +
> > +    if ( !msix->adj_access_idx[adj_type] )
> > +    {
> > +        if ( MSIX_CHECK_WARN(msix, entry->pdev->domain->domain_id,
> > +                             adjacent_not_initialized) )
> > +            gprintk(XENLOG_WARNING,
> > +                    "Page for adjacent(%d) MSI-X table access not initialized for %pp (addr %#lx, gtable %#lx\n",
> > +                    adj_type, &entry->pdev->sbdf, addr, entry->gtable);
> > +        return ADJACENT_DONT_HANDLE;
> > +    }
> > +
> > +    /* If PBA lives on the same page too, discard writes. */
> > +    if ( write &&
> > +         ((adj_type == ADJ_IDX_LAST &&
> > +           msix->table.last == msix->pba.first) ||
> > +          (adj_type == ADJ_IDX_FIRST &&
> > +           msix->table.first == msix->pba.last)) )
> > +    {
> > +        if ( MSIX_CHECK_WARN(msix, entry->pdev->domain->domain_id,
> > +                             adjacent_pba) )
> > +            gprintk(XENLOG_WARNING,
> > +                    "MSI-X table and PBA of %pp live on the same page, "
> > +                    "writing to other registers there is not implemented\n",
> > +                    &entry->pdev->sbdf);
> > +        return ADJACENT_DISCARD_WRITE;
> > +    }
> > +
> > +    return msix->adj_access_idx[adj_type];
> > +}
> > +
> > +static int adjacent_read(
> > +    unsigned int fixmap_idx,
> > +    paddr_t address, unsigned int len, uint64_t *pval)
> > +{
> > +    const void __iomem *hwaddr;
> > +
> > +    ASSERT(fixmap_idx <= FIX_MSIX_IO_RESERV_END);
> > +
> > +    hwaddr = fix_to_virt(fixmap_idx) + PAGE_OFFSET(address);
> 
> IMO adjacent_handle() should be called here (and in adjacent_write()),
> and adjacent_{read,write}() called unconditionally from
> msixtbl_{read,write}() when an access that doesn't fall in the MSI-X
> table is handled.  See comment below in msixtbl_read().

Makes sense.

> > +
> > +    switch ( len )
> > +    {
> > +    case 1:
> > +        *pval = readb(hwaddr);
> > +        break;
> > +
> > +    case 2:
> > +        *pval = readw(hwaddr);
> > +        break;
> > +
> > +    case 4:
> > +        *pval = readl(hwaddr);
> > +        break;
> > +
> > +    case 8:
> > +        *pval = readq(hwaddr);
> > +        break;
> > +
> > +    default:
> > +        ASSERT_UNREACHABLE();
> > +        *pval = ~0UL;
> > +        break;
> > +    }
> > +
> > +    return X86EMUL_OKAY;
> > +}
> > +
> > +static int adjacent_write(
> > +    unsigned int fixmap_idx,
> > +    paddr_t address, unsigned int len, uint64_t val)
> > +{
> > +    void __iomem *hwaddr;
> > +
> > +    if ( fixmap_idx == ADJACENT_DISCARD_WRITE )
> > +        return X86EMUL_OKAY;
> > +
> > +    ASSERT(fixmap_idx <= FIX_MSIX_IO_RESERV_END);
> 
> Since you check the idx is sane, shouldn't you also assert idx >=
> FIX_MSIX_IO_RESERV_BASE?

If moving adjacent_handle() here, I'd simply drop this assert.

> > +
> > +    hwaddr = fix_to_virt(fixmap_idx) + PAGE_OFFSET(address);
> > +
> > +    switch ( len )
> > +    {
> > +    case 1:
> > +        writeb(val, hwaddr);
> > +        break;
> > +
> > +    case 2:
> > +        writew(val, hwaddr);
> > +        break;
> > +
> > +    case 4:
> > +        writel(val, hwaddr);
> > +        break;
> > +
> > +    case 8:
> > +        writeq(val, hwaddr);
> > +        break;
> > +
> > +    default:
> > +        ASSERT_UNREACHABLE();
> > +    }
> > +
> > +    return X86EMUL_OKAY;
> > +}
> > +
> >  static int cf_check msixtbl_read(
> >      const struct hvm_io_handler *handler, uint64_t address, uint32_t len,
> >      uint64_t *pval)
> > @@ -220,9 +356,10 @@ static int cf_check msixtbl_read(
> >      unsigned long offset;
> >      struct msixtbl_entry *entry;
> >      unsigned int nr_entry, index;
> > +    unsigned int adjacent_fixmap;
> >      int r = X86EMUL_UNHANDLEABLE;
> >  
> > -    if ( (len != 4 && len != 8) || (address & (len - 1)) )
> > +    if ( !IS_ALIGNED(address, len) )
> >          return r;
> >  
> >      rcu_read_lock(&msixtbl_rcu_lock);
> > @@ -230,6 +367,21 @@ static int cf_check msixtbl_read(
> >      entry = msixtbl_find_entry(current, address);
> >      if ( !entry )
> >          goto out;
> > +
> > +    adjacent_fixmap = adjacent_handle(entry, address, false);
> 
> This seems overly complicated, but is possible I'm missing some logic.
> 
> IMO it would seem way less convoluted to simply do:
> 
> entry = msixtbl_find_entry(current, address);
> if ( !entry )
>     goto out;
> 
> if ( address < entry->gtable ||
>      address >= entry->gtable + entry->table_len )
> {
>     adjacent_read(...);
>     goto out;
> }
> 
> And put all the logic in adjacent_{read,write}() directly rather than
> having both adjacent_{read,write}() plus adjacent_handle() calls here?
> 
> If the access doesn't fall between the boundaries of the MSI-X table
> it's either going to be a handled adjacent access, or it's going to be
> discarded.

Discarded - should it return X86EMUL_OKAY in that case? Currently it
returns X86EMUL_UNHANDLABLE in case adjacent access isn't handled (for
any reason) either.

> > +    if ( adjacent_fixmap != ADJACENT_DONT_HANDLE )
> > +    {
> > +        r = adjacent_read(adjacent_fixmap, address, len, pval);
> > +        goto out;
> > +    }
> > +
> > +    if ( address < entry->gtable ||
> > +         address >= entry->gtable + entry->table_len )
> > +        goto out;
> > +
> > +    if ( len != 4 && len != 8 )
> > +        goto out;
> > +
> >      offset = address & (PCI_MSIX_ENTRY_SIZE - 1);
> >  
> >      if ( offset != PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET )
> > @@ -282,6 +434,7 @@ static int msixtbl_write(struct vcpu *v, unsigned long address,
> >      int r = X86EMUL_UNHANDLEABLE;
> >      unsigned long flags;
> >      struct irq_desc *desc;
> > +    unsigned int adjacent_fixmap;
> >  
> >      if ( !IS_ALIGNED(address, len) )
> >          return X86EMUL_OKAY;
> > @@ -291,6 +444,21 @@ static int msixtbl_write(struct vcpu *v, unsigned long address,
> >      entry = msixtbl_find_entry(v, address);
> >      if ( !entry )
> >          goto out;
> > +
> > +    adjacent_fixmap = adjacent_handle(entry, address, true);
> > +    if ( adjacent_fixmap != ADJACENT_DONT_HANDLE )
> > +    {
> > +        r = adjacent_write(adjacent_fixmap, address, len, val);
> > +        goto out;
> > +    }
> > +
> > +    if ( address < entry->gtable ||
> > +         address >= entry->gtable + entry->table_len )
> > +        goto out;
> > +
> > +    if ( len != 4 && len != 8 )
> > +        goto out;
> > +
> >      nr_entry = array_index_nospec(((address - entry->gtable) /
> >                                     PCI_MSIX_ENTRY_SIZE),
> >                                    MAX_MSIX_TABLE_ENTRIES);
> > @@ -356,8 +524,8 @@ static int cf_check _msixtbl_write(
> >      const struct hvm_io_handler *handler, uint64_t address, uint32_t len,
> >      uint64_t val)
> >  {
> > -    /* Ignore invalid length or unaligned writes. */
> > -    if ( (len != 4 && len != 8) || !IS_ALIGNED(address, len) )
> > +    /* Ignore unaligned writes. */
> > +    if ( !IS_ALIGNED(address, len) )
> >          return X86EMUL_OKAY;
> >  
> >      /*
> > @@ -374,14 +542,22 @@ static bool cf_check msixtbl_range(
> >  {
> >      struct vcpu *curr = current;
> >      unsigned long addr = r->addr;
> > -    const struct msi_desc *desc;
> > +    const struct msixtbl_entry *entry;
> > +    const struct msi_desc *desc = NULL;
> > +    unsigned int adjacent_fixmap;
> >  
> >      ASSERT(r->type == IOREQ_TYPE_COPY);
> >  
> >      rcu_read_lock(&msixtbl_rcu_lock);
> > -    desc = msixtbl_addr_to_desc(msixtbl_find_entry(curr, addr), addr);
> > +    entry = msixtbl_find_entry(curr, addr);
> > +    adjacent_fixmap = adjacent_handle(entry, addr, false);
> > +    if ( adjacent_fixmap == ADJACENT_DONT_HANDLE )
> > +        desc = msixtbl_addr_to_desc(entry, addr);
> 
> I'm not sure you need adjacent_handle() here, I would think that any
> address adjacent to the MSI-X table Xen would want to handle
> unconditionally, and hence msixtbl_range() should return true:

I put it here to not duplicate a set of checks for adjacent access. It
isn't only about the range, but also few other case (like if fixmap
indices are set).

> entry = msixtbl_find_entry(curr, addr);
> if ( !entry )
>     return 0;
> 
> if ( addr < entry->gtable || addr >= entry->gtable + entry->table_len )
>     /* Possibly handle adjacent access. */
>     return 1;
> 
> desc = msixtbl_find_entry(curr, addr);
> ...
> 
> (Missing the _unlock calls in the chunk above)
> 
> There's no other processing that can happen for an adjacent access
> unless it's are handled here.  Otherwise such accesses will be
> discarded anyway?  Hence better short-circuit the MMIO handler search
> earlier.

Can't there be some ioreq server that could theoretically handle them?

> >      rcu_read_unlock(&msixtbl_rcu_lock);
> >  
> > +    if ( adjacent_fixmap != ADJACENT_DONT_HANDLE )
> > +        return 1;
> > +
> >      if ( desc )
> >          return 1;
> >  
> > @@ -622,12 +798,16 @@ void msix_write_completion(struct vcpu *v)
> >           v->arch.hvm.hvm_io.msix_snoop_gpa )
> >      {
> >          unsigned int token = hvmemul_cache_disable(v);
> > -        const struct msi_desc *desc;
> > +        const struct msi_desc *desc = NULL;
> > +        const struct msixtbl_entry *entry;
> >          uint32_t data;
> >  
> >          rcu_read_lock(&msixtbl_rcu_lock);
> > -        desc = msixtbl_addr_to_desc(msixtbl_find_entry(v, snoop_addr),
> > -                                    snoop_addr);
> > +        entry = msixtbl_find_entry(v, snoop_addr);
> > +        if ( entry &&
> > +             snoop_addr >= entry->gtable &&
> > +             snoop_addr < entry->gtable + entry->table_len )
> > +            desc = msixtbl_addr_to_desc(entry, snoop_addr);
> 
> This would be easier if you added the MSI-X table boundary checks in
> msixtbl_addr_to_desc() itself, rather than open-coding here.  That way
> you don't need to modify msix_write_completion() at all.  It also
> makes msixtbl_addr_to_desc() more robust in case it gets called with
> bogus addresses.

Makes sense I think.

> >          rcu_read_unlock(&msixtbl_rcu_lock);
> >  
> >          if ( desc &&
> > diff --git a/xen/arch/x86/include/asm/msi.h b/xen/arch/x86/include/asm/msi.h
> > index bcfdfd35345d..923b730d48b8 100644
> > --- a/xen/arch/x86/include/asm/msi.h
> > +++ b/xen/arch/x86/include/asm/msi.h
> > @@ -224,6 +224,9 @@ struct arch_msix {
> >      } table, pba;
> >      int table_refcnt[MAX_MSIX_TABLE_PAGES];
> >      int table_idx[MAX_MSIX_TABLE_PAGES];
> > +#define ADJ_IDX_FIRST 0
> > +#define ADJ_IDX_LAST  1
> > +    unsigned int adj_access_idx[2];
> >      spinlock_t table_lock;
> >      bool host_maskall, guest_maskall;
> >      domid_t warned_domid;
> > @@ -231,6 +234,8 @@ struct arch_msix {
> >          uint8_t all;
> >          struct {
> >              bool maskall                   : 1;
> > +            bool adjacent_not_initialized  : 1;
> > +            bool adjacent_pba              : 1;
> >          };
> >      } warned_kind;
> >  };
> > diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
> > index 42c793426da3..c77b81896269 100644
> > --- a/xen/arch/x86/msi.c
> > +++ b/xen/arch/x86/msi.c
> > @@ -913,6 +913,36 @@ static int msix_capability_init(struct pci_dev *dev,
> >          list_add_tail(&entry->list, &dev->msi_list);
> >          *desc = entry;
> >      }
> > +    else
> > +    {
> > +        /*
> > +         * If the MSI-X table doesn't start at the page boundary, map the first page for
> > +         * passthrough accesses.
> > +         */
> > +        if ( PAGE_OFFSET(table_paddr) )
> > +        {
> > +            int idx = msix_get_fixmap(msix, table_paddr, table_paddr);
> > +
> > +            if ( idx > 0 )
> > +                msix->adj_access_idx[ADJ_IDX_FIRST] = idx;
> > +            else
> > +                gprintk(XENLOG_ERR, "Failed to map first MSI-X table page: %d\n", idx);
> > +        }
> > +        /*
> > +         * If the MSI-X table doesn't end on the page boundary, map the last page
> > +         * for passthrough accesses.
> > +         */
> > +        if ( PAGE_OFFSET(table_paddr + msix->nr_entries * PCI_MSIX_ENTRY_SIZE) )
> 
> This check is correct ....
> 
> > +        {
> > +            uint64_t entry_paddr = table_paddr + msix->nr_entries * PCI_MSIX_ENTRY_SIZE;
> 
> ... however for correctness you want to use msix->nr_entries - 1 here,
> otherwise you are requesting an address that's past the last MSI-X
> table entry, and hence msix_get_fixmap() could refuse to provide an
> idx if it ever does boundary checking.

Ok.
Roger Pau Monné May 6, 2024, 7:38 a.m. UTC | #4
On Sat, May 04, 2024 at 02:48:12AM +0200, Marek Marczykowski-Górecki wrote:
> On Fri, May 03, 2024 at 10:33:38AM +0200, Roger Pau Monné wrote:
> > On Fri, Apr 26, 2024 at 07:54:00PM +0200, Marek Marczykowski-Górecki wrote:
> > > Some devices (notably Intel Wifi 6 AX210 card) keep auxiliary registers
> > > on the same page as MSI-X table. Device model (especially one in
> > > stubdomain) cannot really handle those, as direct writes to that page is
> > > refused (page is on the mmio_ro_ranges list). Instead, extend
> > > msixtbl_mmio_ops to handle such accesses too.
> > > 
> > > Doing this, requires correlating read/write location with guest
> > > of MSI-X table address. Since QEMU doesn't map MSI-X table to the guest,
> >   ^ extra 'of'?
> > > it requires msixtbl_entry->gtable, which is HVM-only. Similar feature
> > > for PV would need to be done separately.
> > > 
> > > This will be also used to read Pending Bit Array, if it lives on the same
> > > page, making QEMU not needing /dev/mem access at all (especially helpful
> > > with lockdown enabled in dom0). If PBA lives on another page, QEMU will
> > > map it to the guest directly.
> > > If PBA lives on the same page, discard writes and log a message.
> > > Technically, writes outside of PBA could be allowed, but at this moment
> > > the precise location of PBA isn't saved, and also no known device abuses
> > > the spec in this way (at least yet).
> > > 
> > > To access those registers, msixtbl_mmio_ops need the relevant page
> > > mapped. MSI handling already has infrastructure for that, using fixmap,
> > > so try to map first/last page of the MSI-X table (if necessary) and save
> > > their fixmap indexes. Note that msix_get_fixmap() does reference
> > > counting and reuses existing mapping, so just call it directly, even if
> > > the page was mapped before. Also, it uses a specific range of fixmap
> > > indexes which doesn't include 0, so use 0 as default ("not mapped")
> > > value - which simplifies code a bit.
> > > 
> > > GCC 12.2.1 gets confused about 'desc' variable:
> > > 
> > >     arch/x86/hvm/vmsi.c: In function ‘msixtbl_range’:
> > >     arch/x86/hvm/vmsi.c:553:8: error: ‘desc’ may be used uninitialized [-Werror=maybe-uninitialized]
> > >       553 |     if ( desc )
> > >           |        ^
> > >     arch/x86/hvm/vmsi.c:537:28: note: ‘desc’ was declared here
> > >       537 |     const struct msi_desc *desc;
> > >           |                            ^~~~
> > > 
> > > It's conditional initialization is actually correct (in the case where
> > > it isn't initialized, function returns early), but to avoid
> > > build failure initialize it explicitly to NULL anyway.
> > > 
> > > Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> > > ---
> > > Changes in v6:
> > > - use MSIX_CHECK_WARN macro
> > > - extend assert on fixmap_idx
> > > - add break in default label, after ASSERT_UNREACHABLE(), and move
> > >   setting default there
> > > - style fixes
> > > Changes in v5:
> > > - style fixes
> > > - include GCC version in the commit message
> > > - warn only once (per domain, per device) about failed adjacent access
> > > Changes in v4:
> > > - drop same_page parameter of msixtbl_find_entry(), distinguish two
> > >   cases in relevant callers
> > > - rename adj_access_table_idx to adj_access_idx
> > > - code style fixes
> > > - drop alignment check in adjacent_{read,write}() - all callers already
> > >   have it earlier
> > > - delay mapping first/last MSI-X pages until preparing device for a
> > >   passthrough
> > > v3:
> > >  - merge handling into msixtbl_mmio_ops
> > >  - extend commit message
> > > v2:
> > >  - adjust commit message
> > >  - pass struct domain to msixtbl_page_handler_get_hwaddr()
> > >  - reduce local variables used only once
> > >  - log a warning if write is forbidden if MSI-X and PBA lives on the same
> > >    page
> > >  - do not passthrough unaligned accesses
> > >  - handle accesses both before and after MSI-X table
> > > ---
> > >  xen/arch/x86/hvm/vmsi.c        | 200 ++++++++++++++++++++++++++++++++--
> > >  xen/arch/x86/include/asm/msi.h |   5 +-
> > >  xen/arch/x86/msi.c             |  41 +++++++-
> > >  3 files changed, 236 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
> > > index 999917983789..230e3a5dee3f 100644
> > > --- a/xen/arch/x86/hvm/vmsi.c
> > > +++ b/xen/arch/x86/hvm/vmsi.c
> > > @@ -180,6 +180,10 @@ static bool msixtbl_initialised(const struct domain *d)
> > >      return d->arch.hvm.msixtbl_list.next;
> > >  }
> > >  
> > > +/*
> > > + * Lookup an msixtbl_entry on the same page as given addr. It's up to the
> > > + * caller to check if address is strictly part of the table - if relevant.
> > > + */
> > >  static struct msixtbl_entry *msixtbl_find_entry(
> > >      struct vcpu *v, unsigned long addr)
> > >  {
> > > @@ -187,8 +191,8 @@ static struct msixtbl_entry *msixtbl_find_entry(
> > >      struct domain *d = v->domain;
> > >  
> > >      list_for_each_entry( entry, &d->arch.hvm.msixtbl_list, list )
> > > -        if ( addr >= entry->gtable &&
> > > -             addr < entry->gtable + entry->table_len )
> > > +        if ( PFN_DOWN(addr) >= PFN_DOWN(entry->gtable) &&
> > > +             PFN_DOWN(addr) <= PFN_DOWN(entry->gtable + entry->table_len - 1) )
> > >              return entry;
> > >  
> > >      return NULL;
> > > @@ -213,6 +217,138 @@ static struct msi_desc *msixtbl_addr_to_desc(
> > >      return NULL;
> > >  }
> > >  
> > > +/*
> > > + * Returns:
> > > + *  - ADJACENT_DONT_HANDLE if no handling should be done
> > > + *  - ADJACENT_DISCARD_WRITE if write should be discarded
> > > + *  - a fixmap idx to use for handling
> > > + */
> > > +#define ADJACENT_DONT_HANDLE UINT_MAX
> > > +#define ADJACENT_DISCARD_WRITE (UINT_MAX - 1)
> > 
> > I think this could be simpler, there's no need to signal with so fine
> > grained detail about the action to be performed.
> > 
> > Any adjacent access to the MSI-X table should be handled by the logic
> > you are adding, so anything that falls in those ranges should
> > terminate here.
> > 
> > adjacent_handle() should IMO just return whether the access is
> > replayed against the hardware, or if it's just dropped.
> 
> The distinction here is to return X86EMUL_OKAY in case of adjacent write
> that is ignored because PBA is somewhere near, but X86EMUL_UNHANDLABLE
> for other/error cases (like fixmap indices not initialized).
> But maybe this distinction doesn't make sense and X86EMUL_UNHANDLABLE is
> okay in either case? 

That's my suggestion, yes.  I don't think it's expected for ioreqs to
handle those adjacent accesses, the more with the limitation that some
of them might not even have access to such region in the first place.

> > > @@ -220,9 +356,10 @@ static int cf_check msixtbl_read(
> > >      unsigned long offset;
> > >      struct msixtbl_entry *entry;
> > >      unsigned int nr_entry, index;
> > > +    unsigned int adjacent_fixmap;
> > >      int r = X86EMUL_UNHANDLEABLE;
> > >  
> > > -    if ( (len != 4 && len != 8) || (address & (len - 1)) )
> > > +    if ( !IS_ALIGNED(address, len) )
> > >          return r;
> > >  
> > >      rcu_read_lock(&msixtbl_rcu_lock);
> > > @@ -230,6 +367,21 @@ static int cf_check msixtbl_read(
> > >      entry = msixtbl_find_entry(current, address);
> > >      if ( !entry )
> > >          goto out;
> > > +
> > > +    adjacent_fixmap = adjacent_handle(entry, address, false);
> > 
> > This seems overly complicated, but is possible I'm missing some logic.
> > 
> > IMO it would seem way less convoluted to simply do:
> > 
> > entry = msixtbl_find_entry(current, address);
> > if ( !entry )
> >     goto out;
> > 
> > if ( address < entry->gtable ||
> >      address >= entry->gtable + entry->table_len )
> > {
> >     adjacent_read(...);
> >     goto out;
> > }
> > 
> > And put all the logic in adjacent_{read,write}() directly rather than
> > having both adjacent_{read,write}() plus adjacent_handle() calls here?
> > 
> > If the access doesn't fall between the boundaries of the MSI-X table
> > it's either going to be a handled adjacent access, or it's going to be
> > discarded.
> 
> Discarded - should it return X86EMUL_OKAY in that case? Currently it
> returns X86EMUL_UNHANDLABLE in case adjacent access isn't handled (for
> any reason) either.

Yes, I think we want to terminate all adjacent accesses here - there's
so far no need to forward them to any other handler, and that would
simplify the logic.  I don't see an use case for handling those
elsewhere, but if that ever arises we can always add the support them.
There's no need to cater for a non-existent use-case that just makes
the code more complicated.

> > > +    if ( adjacent_fixmap != ADJACENT_DONT_HANDLE )
> > > +    {
> > > +        r = adjacent_read(adjacent_fixmap, address, len, pval);
> > > +        goto out;
> > > +    }
> > > +
> > > +    if ( address < entry->gtable ||
> > > +         address >= entry->gtable + entry->table_len )
> > > +        goto out;
> > > +
> > > +    if ( len != 4 && len != 8 )
> > > +        goto out;
> > > +
> > >      offset = address & (PCI_MSIX_ENTRY_SIZE - 1);
> > >  
> > >      if ( offset != PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET )
> > > @@ -282,6 +434,7 @@ static int msixtbl_write(struct vcpu *v, unsigned long address,
> > >      int r = X86EMUL_UNHANDLEABLE;
> > >      unsigned long flags;
> > >      struct irq_desc *desc;
> > > +    unsigned int adjacent_fixmap;
> > >  
> > >      if ( !IS_ALIGNED(address, len) )
> > >          return X86EMUL_OKAY;
> > > @@ -291,6 +444,21 @@ static int msixtbl_write(struct vcpu *v, unsigned long address,
> > >      entry = msixtbl_find_entry(v, address);
> > >      if ( !entry )
> > >          goto out;
> > > +
> > > +    adjacent_fixmap = adjacent_handle(entry, address, true);
> > > +    if ( adjacent_fixmap != ADJACENT_DONT_HANDLE )
> > > +    {
> > > +        r = adjacent_write(adjacent_fixmap, address, len, val);
> > > +        goto out;
> > > +    }
> > > +
> > > +    if ( address < entry->gtable ||
> > > +         address >= entry->gtable + entry->table_len )
> > > +        goto out;
> > > +
> > > +    if ( len != 4 && len != 8 )
> > > +        goto out;
> > > +
> > >      nr_entry = array_index_nospec(((address - entry->gtable) /
> > >                                     PCI_MSIX_ENTRY_SIZE),
> > >                                    MAX_MSIX_TABLE_ENTRIES);
> > > @@ -356,8 +524,8 @@ static int cf_check _msixtbl_write(
> > >      const struct hvm_io_handler *handler, uint64_t address, uint32_t len,
> > >      uint64_t val)
> > >  {
> > > -    /* Ignore invalid length or unaligned writes. */
> > > -    if ( (len != 4 && len != 8) || !IS_ALIGNED(address, len) )
> > > +    /* Ignore unaligned writes. */
> > > +    if ( !IS_ALIGNED(address, len) )
> > >          return X86EMUL_OKAY;
> > >  
> > >      /*
> > > @@ -374,14 +542,22 @@ static bool cf_check msixtbl_range(
> > >  {
> > >      struct vcpu *curr = current;
> > >      unsigned long addr = r->addr;
> > > -    const struct msi_desc *desc;
> > > +    const struct msixtbl_entry *entry;
> > > +    const struct msi_desc *desc = NULL;
> > > +    unsigned int adjacent_fixmap;
> > >  
> > >      ASSERT(r->type == IOREQ_TYPE_COPY);
> > >  
> > >      rcu_read_lock(&msixtbl_rcu_lock);
> > > -    desc = msixtbl_addr_to_desc(msixtbl_find_entry(curr, addr), addr);
> > > +    entry = msixtbl_find_entry(curr, addr);
> > > +    adjacent_fixmap = adjacent_handle(entry, addr, false);
> > > +    if ( adjacent_fixmap == ADJACENT_DONT_HANDLE )
> > > +        desc = msixtbl_addr_to_desc(entry, addr);
> > 
> > I'm not sure you need adjacent_handle() here, I would think that any
> > address adjacent to the MSI-X table Xen would want to handle
> > unconditionally, and hence msixtbl_range() should return true:
> 
> I put it here to not duplicate a set of checks for adjacent access. It
> isn't only about the range, but also few other case (like if fixmap
> indices are set).

Right, but as discussed above, we likely want to terminate those
accesses here anyway, since there's no other handler that cares about
MSI-X table adjacent regions.

> > entry = msixtbl_find_entry(curr, addr);
> > if ( !entry )
> >     return 0;
> > 
> > if ( addr < entry->gtable || addr >= entry->gtable + entry->table_len )
> >     /* Possibly handle adjacent access. */
> >     return 1;
> > 
> > desc = msixtbl_find_entry(curr, addr);
> > ...
> > 
> > (Missing the _unlock calls in the chunk above)
> > 
> > There's no other processing that can happen for an adjacent access
> > unless it's are handled here.  Otherwise such accesses will be
> > discarded anyway?  Hence better short-circuit the MMIO handler search
> > earlier.
> 
> Can't there be some ioreq server that could theoretically handle them?

I don't think any of the current ioreq implementations care about
those ATM, and with your work to make QEMU not depend on /dev/mem
it's even more unlikely that we might gain such in the future.

As said above, I wouldn't mind allowing such forwarding if it made the
code easier, but seeing how it makes the logic more complicated I
think it's best to terminate all MSI-X table adjacent accesses here.

Thanks, Roger.
diff mbox series

Patch

diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index 999917983789..230e3a5dee3f 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -180,6 +180,10 @@  static bool msixtbl_initialised(const struct domain *d)
     return d->arch.hvm.msixtbl_list.next;
 }
 
+/*
+ * Lookup an msixtbl_entry on the same page as given addr. It's up to the
+ * caller to check if address is strictly part of the table - if relevant.
+ */
 static struct msixtbl_entry *msixtbl_find_entry(
     struct vcpu *v, unsigned long addr)
 {
@@ -187,8 +191,8 @@  static struct msixtbl_entry *msixtbl_find_entry(
     struct domain *d = v->domain;
 
     list_for_each_entry( entry, &d->arch.hvm.msixtbl_list, list )
-        if ( addr >= entry->gtable &&
-             addr < entry->gtable + entry->table_len )
+        if ( PFN_DOWN(addr) >= PFN_DOWN(entry->gtable) &&
+             PFN_DOWN(addr) <= PFN_DOWN(entry->gtable + entry->table_len - 1) )
             return entry;
 
     return NULL;
@@ -213,6 +217,138 @@  static struct msi_desc *msixtbl_addr_to_desc(
     return NULL;
 }
 
+/*
+ * Returns:
+ *  - ADJACENT_DONT_HANDLE if no handling should be done
+ *  - ADJACENT_DISCARD_WRITE if write should be discarded
+ *  - a fixmap idx to use for handling
+ */
+#define ADJACENT_DONT_HANDLE UINT_MAX
+#define ADJACENT_DISCARD_WRITE (UINT_MAX - 1)
+static unsigned int adjacent_handle(
+    const struct msixtbl_entry *entry, unsigned long addr, bool write)
+{
+    unsigned int adj_type;
+    struct arch_msix *msix;
+
+    if ( !entry || !entry->pdev )
+        return ADJACENT_DONT_HANDLE;
+
+    if ( PFN_DOWN(addr) == PFN_DOWN(entry->gtable) && addr < entry->gtable )
+        adj_type = ADJ_IDX_FIRST;
+    else if ( PFN_DOWN(addr) == PFN_DOWN(entry->gtable + entry->table_len - 1) &&
+              addr >= entry->gtable + entry->table_len )
+        adj_type = ADJ_IDX_LAST;
+    else
+        return ADJACENT_DONT_HANDLE;
+
+    msix = entry->pdev->msix;
+    ASSERT(msix);
+
+    if ( !msix->adj_access_idx[adj_type] )
+    {
+        if ( MSIX_CHECK_WARN(msix, entry->pdev->domain->domain_id,
+                             adjacent_not_initialized) )
+            gprintk(XENLOG_WARNING,
+                    "Page for adjacent(%d) MSI-X table access not initialized for %pp (addr %#lx, gtable %#lx\n",
+                    adj_type, &entry->pdev->sbdf, addr, entry->gtable);
+        return ADJACENT_DONT_HANDLE;
+    }
+
+    /* If PBA lives on the same page too, discard writes. */
+    if ( write &&
+         ((adj_type == ADJ_IDX_LAST &&
+           msix->table.last == msix->pba.first) ||
+          (adj_type == ADJ_IDX_FIRST &&
+           msix->table.first == msix->pba.last)) )
+    {
+        if ( MSIX_CHECK_WARN(msix, entry->pdev->domain->domain_id,
+                             adjacent_pba) )
+            gprintk(XENLOG_WARNING,
+                    "MSI-X table and PBA of %pp live on the same page, "
+                    "writing to other registers there is not implemented\n",
+                    &entry->pdev->sbdf);
+        return ADJACENT_DISCARD_WRITE;
+    }
+
+    return msix->adj_access_idx[adj_type];
+}
+
+static int adjacent_read(
+    unsigned int fixmap_idx,
+    paddr_t address, unsigned int len, uint64_t *pval)
+{
+    const void __iomem *hwaddr;
+
+    ASSERT(fixmap_idx <= FIX_MSIX_IO_RESERV_END);
+
+    hwaddr = fix_to_virt(fixmap_idx) + PAGE_OFFSET(address);
+
+    switch ( len )
+    {
+    case 1:
+        *pval = readb(hwaddr);
+        break;
+
+    case 2:
+        *pval = readw(hwaddr);
+        break;
+
+    case 4:
+        *pval = readl(hwaddr);
+        break;
+
+    case 8:
+        *pval = readq(hwaddr);
+        break;
+
+    default:
+        ASSERT_UNREACHABLE();
+        *pval = ~0UL;
+        break;
+    }
+
+    return X86EMUL_OKAY;
+}
+
+static int adjacent_write(
+    unsigned int fixmap_idx,
+    paddr_t address, unsigned int len, uint64_t val)
+{
+    void __iomem *hwaddr;
+
+    if ( fixmap_idx == ADJACENT_DISCARD_WRITE )
+        return X86EMUL_OKAY;
+
+    ASSERT(fixmap_idx <= FIX_MSIX_IO_RESERV_END);
+
+    hwaddr = fix_to_virt(fixmap_idx) + PAGE_OFFSET(address);
+
+    switch ( len )
+    {
+    case 1:
+        writeb(val, hwaddr);
+        break;
+
+    case 2:
+        writew(val, hwaddr);
+        break;
+
+    case 4:
+        writel(val, hwaddr);
+        break;
+
+    case 8:
+        writeq(val, hwaddr);
+        break;
+
+    default:
+        ASSERT_UNREACHABLE();
+    }
+
+    return X86EMUL_OKAY;
+}
+
 static int cf_check msixtbl_read(
     const struct hvm_io_handler *handler, uint64_t address, uint32_t len,
     uint64_t *pval)
@@ -220,9 +356,10 @@  static int cf_check msixtbl_read(
     unsigned long offset;
     struct msixtbl_entry *entry;
     unsigned int nr_entry, index;
+    unsigned int adjacent_fixmap;
     int r = X86EMUL_UNHANDLEABLE;
 
-    if ( (len != 4 && len != 8) || (address & (len - 1)) )
+    if ( !IS_ALIGNED(address, len) )
         return r;
 
     rcu_read_lock(&msixtbl_rcu_lock);
@@ -230,6 +367,21 @@  static int cf_check msixtbl_read(
     entry = msixtbl_find_entry(current, address);
     if ( !entry )
         goto out;
+
+    adjacent_fixmap = adjacent_handle(entry, address, false);
+    if ( adjacent_fixmap != ADJACENT_DONT_HANDLE )
+    {
+        r = adjacent_read(adjacent_fixmap, address, len, pval);
+        goto out;
+    }
+
+    if ( address < entry->gtable ||
+         address >= entry->gtable + entry->table_len )
+        goto out;
+
+    if ( len != 4 && len != 8 )
+        goto out;
+
     offset = address & (PCI_MSIX_ENTRY_SIZE - 1);
 
     if ( offset != PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET )
@@ -282,6 +434,7 @@  static int msixtbl_write(struct vcpu *v, unsigned long address,
     int r = X86EMUL_UNHANDLEABLE;
     unsigned long flags;
     struct irq_desc *desc;
+    unsigned int adjacent_fixmap;
 
     if ( !IS_ALIGNED(address, len) )
         return X86EMUL_OKAY;
@@ -291,6 +444,21 @@  static int msixtbl_write(struct vcpu *v, unsigned long address,
     entry = msixtbl_find_entry(v, address);
     if ( !entry )
         goto out;
+
+    adjacent_fixmap = adjacent_handle(entry, address, true);
+    if ( adjacent_fixmap != ADJACENT_DONT_HANDLE )
+    {
+        r = adjacent_write(adjacent_fixmap, address, len, val);
+        goto out;
+    }
+
+    if ( address < entry->gtable ||
+         address >= entry->gtable + entry->table_len )
+        goto out;
+
+    if ( len != 4 && len != 8 )
+        goto out;
+
     nr_entry = array_index_nospec(((address - entry->gtable) /
                                    PCI_MSIX_ENTRY_SIZE),
                                   MAX_MSIX_TABLE_ENTRIES);
@@ -356,8 +524,8 @@  static int cf_check _msixtbl_write(
     const struct hvm_io_handler *handler, uint64_t address, uint32_t len,
     uint64_t val)
 {
-    /* Ignore invalid length or unaligned writes. */
-    if ( (len != 4 && len != 8) || !IS_ALIGNED(address, len) )
+    /* Ignore unaligned writes. */
+    if ( !IS_ALIGNED(address, len) )
         return X86EMUL_OKAY;
 
     /*
@@ -374,14 +542,22 @@  static bool cf_check msixtbl_range(
 {
     struct vcpu *curr = current;
     unsigned long addr = r->addr;
-    const struct msi_desc *desc;
+    const struct msixtbl_entry *entry;
+    const struct msi_desc *desc = NULL;
+    unsigned int adjacent_fixmap;
 
     ASSERT(r->type == IOREQ_TYPE_COPY);
 
     rcu_read_lock(&msixtbl_rcu_lock);
-    desc = msixtbl_addr_to_desc(msixtbl_find_entry(curr, addr), addr);
+    entry = msixtbl_find_entry(curr, addr);
+    adjacent_fixmap = adjacent_handle(entry, addr, false);
+    if ( adjacent_fixmap == ADJACENT_DONT_HANDLE )
+        desc = msixtbl_addr_to_desc(entry, addr);
     rcu_read_unlock(&msixtbl_rcu_lock);
 
+    if ( adjacent_fixmap != ADJACENT_DONT_HANDLE )
+        return 1;
+
     if ( desc )
         return 1;
 
@@ -622,12 +798,16 @@  void msix_write_completion(struct vcpu *v)
          v->arch.hvm.hvm_io.msix_snoop_gpa )
     {
         unsigned int token = hvmemul_cache_disable(v);
-        const struct msi_desc *desc;
+        const struct msi_desc *desc = NULL;
+        const struct msixtbl_entry *entry;
         uint32_t data;
 
         rcu_read_lock(&msixtbl_rcu_lock);
-        desc = msixtbl_addr_to_desc(msixtbl_find_entry(v, snoop_addr),
-                                    snoop_addr);
+        entry = msixtbl_find_entry(v, snoop_addr);
+        if ( entry &&
+             snoop_addr >= entry->gtable &&
+             snoop_addr < entry->gtable + entry->table_len )
+            desc = msixtbl_addr_to_desc(entry, snoop_addr);
         rcu_read_unlock(&msixtbl_rcu_lock);
 
         if ( desc &&
diff --git a/xen/arch/x86/include/asm/msi.h b/xen/arch/x86/include/asm/msi.h
index bcfdfd35345d..923b730d48b8 100644
--- a/xen/arch/x86/include/asm/msi.h
+++ b/xen/arch/x86/include/asm/msi.h
@@ -224,6 +224,9 @@  struct arch_msix {
     } table, pba;
     int table_refcnt[MAX_MSIX_TABLE_PAGES];
     int table_idx[MAX_MSIX_TABLE_PAGES];
+#define ADJ_IDX_FIRST 0
+#define ADJ_IDX_LAST  1
+    unsigned int adj_access_idx[2];
     spinlock_t table_lock;
     bool host_maskall, guest_maskall;
     domid_t warned_domid;
@@ -231,6 +234,8 @@  struct arch_msix {
         uint8_t all;
         struct {
             bool maskall                   : 1;
+            bool adjacent_not_initialized  : 1;
+            bool adjacent_pba              : 1;
         };
     } warned_kind;
 };
diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index 42c793426da3..c77b81896269 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -913,6 +913,36 @@  static int msix_capability_init(struct pci_dev *dev,
         list_add_tail(&entry->list, &dev->msi_list);
         *desc = entry;
     }
+    else
+    {
+        /*
+         * If the MSI-X table doesn't start at the page boundary, map the first page for
+         * passthrough accesses.
+         */
+        if ( PAGE_OFFSET(table_paddr) )
+        {
+            int idx = msix_get_fixmap(msix, table_paddr, table_paddr);
+
+            if ( idx > 0 )
+                msix->adj_access_idx[ADJ_IDX_FIRST] = idx;
+            else
+                gprintk(XENLOG_ERR, "Failed to map first MSI-X table page: %d\n", idx);
+        }
+        /*
+         * If the MSI-X table doesn't end on the page boundary, map the last page
+         * for passthrough accesses.
+         */
+        if ( PAGE_OFFSET(table_paddr + msix->nr_entries * PCI_MSIX_ENTRY_SIZE) )
+        {
+            uint64_t entry_paddr = table_paddr + msix->nr_entries * PCI_MSIX_ENTRY_SIZE;
+            int idx = msix_get_fixmap(msix, table_paddr, entry_paddr);
+
+            if ( idx > 0 )
+                msix->adj_access_idx[ADJ_IDX_LAST] = idx;
+            else
+                gprintk(XENLOG_ERR, "Failed to map last MSI-X table page: %d\n", idx);
+        }
+    }
 
     if ( !msix->used_entries )
     {
@@ -1079,6 +1109,17 @@  static void _pci_cleanup_msix(struct arch_msix *msix)
         msix->table.first = 0;
         msix->table.last = 0;
 
+        if ( msix->adj_access_idx[ADJ_IDX_FIRST] )
+        {
+            msix_put_fixmap(msix, msix->adj_access_idx[ADJ_IDX_FIRST]);
+            msix->adj_access_idx[ADJ_IDX_FIRST] = 0;
+        }
+        if ( msix->adj_access_idx[ADJ_IDX_LAST] )
+        {
+            msix_put_fixmap(msix, msix->adj_access_idx[ADJ_IDX_LAST]);
+            msix->adj_access_idx[ADJ_IDX_LAST] = 0;
+        }
+
         if ( rangeset_remove_range(mmio_ro_ranges, msix->pba.first,
                                    msix->pba.last) )
             WARN();