diff mbox series

[RFC] xen/amd-iommu: Add interrupt remapping quirk for ath11k

Message ID 20250226211125.43625-1-jason.andryuk@amd.com (mailing list archive)
State New
Headers show
Series [RFC] xen/amd-iommu: Add interrupt remapping quirk for ath11k | expand

Commit Message

Jason Andryuk Feb. 26, 2025, 9:11 p.m. UTC
Sometimes we have to quirk the PCI IRTE to use a non-zero remap_index
corresponding to the guest's view of the MSI data register.  The MSI
data guest vector equals interrupt remapping table index.

The ath11k wifi device does unusual things with MSIs.  The driver lets
Linux program the MSI capability.  Linux internally caches the MSI data
it thinks it programmed.  It sets its affinity to CPU0.  The ath11k
driver then reads the MSI address from the PCI configuration space.  The
MSI address and cached data are then passed to other components on the
same card to generate MSI interrupts.

With Xen, vPCI and QEMU PCI passthrough have a guest idea of the MSI
address and data.  But Xen programs the actual hardware with its own
address and data.  With per-device IRT, xen uses index 0.  When the
ath11k driver passes the guest address and data to the hardware, it
generates faults when there is no IRTE for the guest data (~0x25).

To work around this, we can, for per-device IRTs, program the hardware
to use the guest data & associated IRTE.  The address doesn't matter
since the IRTE handles that, and the Xen address & vector can be used as
expected.

For vPCI, the guest MSI data is available at the time of initial MSI
setup, but that is not the case for HVM.  With HVM, the initial MSI
setup is done when PHYSDEVOP_map_pirq is called, but the guest vector is
only available later when XEN_DOMCTL_bind_pt_irq is called.  In that
case, we need to tear down and create a new IRTE.  This later location
can also handle vPCI.

Add pirq_guest_bind_gvec to plumb down the gvec without modifying all
call sites.  Use msi_desc->gvec to pass through the desired value.

Only tested with AMD-Vi.  Requires per-device IRT.  With AMD-Vi, the
number of MSIs is passed in, but a minimum of a page is allocated for
the table.  The vector is 8 bits giving indices 0-255.  Even with 128bit
IRTEs, 16 bytes, 1 page 4096 / 16 = 256 entries, so we don't have to
worry about overflow.  N MSIs can only have the last one at 255, so the
guest can't expect to have N vectors starting above 255 - N.

Signed-off-by: Xenia Ragiadakou <xenia.ragiadakou@amd.com>
Signed-off-by: Jason Andryuk <jason.andryuk@amd.com>
---
Is something like this feasible for inclusion upstream?  I'm asking
before I look into what it would take to support Intel.

e.g. Replace amd_iommu_perdev_intremap with something generic.

The ath11k device supports and tries to enable 32 MSIs.  Linux in PVH
dom0 and HVM domU fails enabling 32 and falls back to just 1, so that is
all that has been tested.

Using msi_desc->gvec should be okay since with posted interrupts - the
gvec is expected to match.

hvm_pi_update_irte() changes the IRTE but not the MSI data in the PCI
capability, so that isn't suitable by itself.
---
 xen/arch/x86/include/asm/msi.h           |  3 ++-
 xen/arch/x86/irq.c                       | 13 +++++++++++-
 xen/arch/x86/msi.c                       |  1 +
 xen/drivers/passthrough/amd/iommu_intr.c | 25 ++++++++++++++++++++++++
 xen/drivers/passthrough/pci.c            | 24 +++++++++++++++++++++++
 xen/drivers/passthrough/x86/hvm.c        |  3 ++-
 xen/include/xen/irq.h                    |  2 ++
 xen/include/xen/pci.h                    |  2 ++
 8 files changed, 70 insertions(+), 3 deletions(-)

Comments

Jan Beulich Feb. 27, 2025, 8:54 a.m. UTC | #1
On 26.02.2025 22:11, Jason Andryuk wrote:
> Sometimes we have to quirk the PCI IRTE to use a non-zero remap_index
> corresponding to the guest's view of the MSI data register.  The MSI
> data guest vector equals interrupt remapping table index.
> 
> The ath11k wifi device does unusual things with MSIs.  The driver lets
> Linux program the MSI capability.  Linux internally caches the MSI data
> it thinks it programmed.  It sets its affinity to CPU0.  The ath11k
> driver then reads the MSI address from the PCI configuration space.  The
> MSI address and cached data are then passed to other components on the
> same card to generate MSI interrupts.

I'm curious whether it's known how e.g. KVM deals with this.

> With Xen, vPCI and QEMU PCI passthrough have a guest idea of the MSI
> address and data.  But Xen programs the actual hardware with its own
> address and data.  With per-device IRT, xen uses index 0.  When the
> ath11k driver passes the guest address and data to the hardware, it
> generates faults when there is no IRTE for the guest data (~0x25).
> 
> To work around this, we can, for per-device IRTs, program the hardware
> to use the guest data & associated IRTE.  The address doesn't matter
> since the IRTE handles that, and the Xen address & vector can be used as
> expected.
> 
> For vPCI, the guest MSI data is available at the time of initial MSI
> setup, but that is not the case for HVM.  With HVM, the initial MSI
> setup is done when PHYSDEVOP_map_pirq is called, but the guest vector is
> only available later when XEN_DOMCTL_bind_pt_irq is called.  In that
> case, we need to tear down and create a new IRTE.  This later location
> can also handle vPCI.
> 
> Add pirq_guest_bind_gvec to plumb down the gvec without modifying all
> call sites.  Use msi_desc->gvec to pass through the desired value.
> 
> Only tested with AMD-Vi.  Requires per-device IRT.  With AMD-Vi, the
> number of MSIs is passed in, but a minimum of a page is allocated for
> the table.  The vector is 8 bits giving indices 0-255.  Even with 128bit
> IRTEs, 16 bytes, 1 page 4096 / 16 = 256 entries, so we don't have to
> worry about overflow.  N MSIs can only have the last one at 255, so the
> guest can't expect to have N vectors starting above 255 - N.
> 
> Signed-off-by: Xenia Ragiadakou <xenia.ragiadakou@amd.com>
> Signed-off-by: Jason Andryuk <jason.andryuk@amd.com>

Just to clarify: Who's the original patch author? The common expectation
is that the first S-o-b: matches From:.

> ---
> Is something like this feasible for inclusion upstream?  I'm asking
> before I look into what it would take to support Intel.

Well, I wouldn't outright say "no". It needs to be pretty clear that this
doesn't put at risk the "normal" cases. Which is going to be somewhat
difficult considering how convoluted the involved code (sadly) is. See
also the commentary related remark at the very bottom.

> e.g. Replace amd_iommu_perdev_intremap with something generic.
> 
> The ath11k device supports and tries to enable 32 MSIs.  Linux in PVH
> dom0 and HVM domU fails enabling 32 and falls back to just 1, so that is
> all that has been tested.
> 
> Using msi_desc->gvec should be okay since with posted interrupts - the
> gvec is expected to match.
> 
> hvm_pi_update_irte() changes the IRTE but not the MSI data in the PCI
> capability, so that isn't suitable by itself.

These last two paragraphs look to again be related to the VT-d aspect.
Yet there's the middle one which apparently doesn't, hence I'm uncertain
I read all of this as it's intended.

> --- a/xen/drivers/passthrough/amd/iommu_intr.c
> +++ b/xen/drivers/passthrough/amd/iommu_intr.c
> @@ -543,6 +543,31 @@ int cf_check amd_iommu_msi_msg_update_ire(
>      if ( !msg )
>          return 0;
>  
> +    if ( pdev->gvec_as_irte_idx && amd_iommu_perdev_intremap )
> +    {
> +        int new_remap_index = 0;
> +        if ( msi_desc->gvec )
> +        {
> +            printk("%pp: gvec remap_index %#x -> %#x\n", &pdev->sbdf,
> +                   msi_desc->remap_index, msi_desc->gvec);
> +            new_remap_index = msi_desc->gvec;
> +        }
> +
> +        if ( new_remap_index && new_remap_index != msi_desc->remap_index &&
> +             msi_desc->remap_index != -1 )
> +        {
> +            /* Clear any existing entries */
> +            update_intremap_entry_from_msi_msg(iommu, bdf, nr,
> +                                               &msi_desc->remap_index,
> +                                               NULL, NULL);
> +
> +            for ( i = 0; i < nr; ++i )
> +                msi_desc[i].remap_index = -1;
> +
> +            msi_desc->remap_index = new_remap_index;

You zap nr entries, and then set only 1? Doesn't the zapping loop need to
instead be a setting one? Perhaps with a check up front that the last value
used will still fit in 8 bits? Or else make applying the quirk conditional
upon nr == 1?

> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -306,6 +306,17 @@ static void apply_quirks(struct pci_dev *pdev)
>          { PCI_VENDOR_ID_INTEL, 0x6fa0 },
>          { PCI_VENDOR_ID_INTEL, 0x6fc0 },
>      };
> +    static const struct {
> +        uint16_t vendor, device;
> +    } hide_irt[] = {
> +#define PCI_VENDOR_ID_QCOM		0x17cb

At least this wants to move into xen/pci_ids.h.

> +#define QCA6390_DEVICE_ID		0x1101
> +#define QCN9074_DEVICE_ID		0x1104
> +#define WCN6855_DEVICE_ID		0x1103
> +        { PCI_VENDOR_ID_QCOM, QCA6390_DEVICE_ID },
> +        { PCI_VENDOR_ID_QCOM, QCN9074_DEVICE_ID },
> +        { PCI_VENDOR_ID_QCOM, WCN6855_DEVICE_ID },
> +    };

May I ask what's the source of information on which specific devices are
affected by this anomalous behavior? Just the Linux driver?

I'm also uncertain #define-s are very useful in such a case. Raw hex numbers
in the table with a comment indicating the device name ought to be as fine.

> --- a/xen/include/xen/pci.h
> +++ b/xen/include/xen/pci.h
> @@ -127,6 +127,8 @@ struct pci_dev {
>      /* Device with errata, ignore the BARs. */
>      bool ignore_bars;
>  
> +    bool gvec_as_irte_idx;
> +
>      /* Device misbehaving, prevent assigning it to guests. */
>      bool broken;
>  

Overall more commentary would be needed throughout the patch. This field is
just one example where some minimal explanation is missing.

Jan
Roger Pau Monné Feb. 27, 2025, 10:23 a.m. UTC | #2
On Wed, Feb 26, 2025 at 04:11:25PM -0500, Jason Andryuk wrote:
> Sometimes we have to quirk the PCI IRTE to use a non-zero remap_index
> corresponding to the guest's view of the MSI data register.  The MSI
> data guest vector equals interrupt remapping table index.

I think you need some introduction before making this statement about
remapping indexes and IRTEs.

> The ath11k wifi device does unusual things with MSIs.  The driver lets
> Linux program the MSI capability.  Linux internally caches the MSI data
> it thinks it programmed.  It sets its affinity to CPU0.  The ath11k
> driver then reads the MSI address from the PCI configuration space.  The
> MSI address and cached data are then passed to other components on the
> same card to generate MSI interrupts.
> 
> With Xen, vPCI and QEMU PCI passthrough have a guest idea of the MSI
> address and data.  But Xen programs the actual hardware with its own
> address and data.  With per-device IRT, xen uses index 0.

By "Xen uses index 0" I think you mean that when using per-device
interrupt remapping table indexes start at 0 for every device, instead
of all devices sharing the same index address space.

> When the
> ath11k driver passes the guest address and data to the hardware, it
> generates faults when there is no IRTE for the guest data (~0x25).

What does ~0x25 mean in this context?

> To work around this, we can, for per-device IRTs, program the hardware
> to use the guest data & associated IRTE.  The address doesn't matter
> since the IRTE handles that, and the Xen address & vector can be used as
> expected.

All this work on AMD because when interrupt remapping is enabled all
MSIs are handled by the remapping table, while on Intel there's still
a bit in the MSI address field to signal whether the MSI is using a
remapping entry, or is using the "compatibility" format (iow: no
remapping).

> 
> For vPCI, the guest MSI data is available at the time of initial MSI
> setup, but that is not the case for HVM.  With HVM, the initial MSI
> setup is done when PHYSDEVOP_map_pirq is called, but the guest vector is
> only available later when XEN_DOMCTL_bind_pt_irq is called.  In that
> case, we need to tear down and create a new IRTE.  This later location
> can also handle vPCI.
> 
> Add pirq_guest_bind_gvec to plumb down the gvec without modifying all
> call sites.  Use msi_desc->gvec to pass through the desired value.

So basically the solution is to use the guest selected MSI vector as
the interrupt remapping table index, as then the guest can use the MSI
data and address fields without requiring Xen translation.

What about the guest using the same vector across multiple vCPUs?  So
MSI entries having the same vector field, but different target
destination CPUs?  That won't work correctly as all those MSIs will
attempt to use the same IRTE?

Note that when interrupt remapping support was introduced on AMD-Vi it
was indeed the vector that was used as index into the interrupt
remapping table, this was changed in:

2ca9fbd739b8 AMD IOMMU: allocate IRTE entries instead of using a static mapping

> Only tested with AMD-Vi.  Requires per-device IRT.  With AMD-Vi, the
> number of MSIs is passed in, but a minimum of a page is allocated for
> the table.  The vector is 8 bits giving indices 0-255.  Even with 128bit
> IRTEs, 16 bytes, 1 page 4096 / 16 = 256 entries, so we don't have to
> worry about overflow.  N MSIs can only have the last one at 255, so the
> guest can't expect to have N vectors starting above 255 - N.

While this seems like a possible quirk for AMD, what about Intel?

And what about PV?  I think PV mostly works because the migration of
interrupts across CPUs doesn't cause the IRT index to change, but we
should somehow add checks to this regard if this is now a requirement
for such kind of quirky devices.

> 
> Signed-off-by: Xenia Ragiadakou <xenia.ragiadakou@amd.com>
> Signed-off-by: Jason Andryuk <jason.andryuk@amd.com>
> ---
> Is something like this feasible for inclusion upstream?  I'm asking
> before I look into what it would take to support Intel.

Intel would be more complicated, as there the usage of the IRT is
explicitly signaled in the MSI address field.  Otherwise it's
considered a "compatibility" (iow: non-translated) MSI.

> e.g. Replace amd_iommu_perdev_intremap with something generic.
> 
> The ath11k device supports and tries to enable 32 MSIs.  Linux in PVH
> dom0 and HVM domU fails enabling 32 and falls back to just 1, so that is
> all that has been tested.

DYK why it fails to enable 32?

> Using msi_desc->gvec should be okay since with posted interrupts - the
> gvec is expected to match.
> 
> hvm_pi_update_irte() changes the IRTE but not the MSI data in the PCI
> capability, so that isn't suitable by itself.
> ---
>  xen/arch/x86/include/asm/msi.h           |  3 ++-
>  xen/arch/x86/irq.c                       | 13 +++++++++++-
>  xen/arch/x86/msi.c                       |  1 +
>  xen/drivers/passthrough/amd/iommu_intr.c | 25 ++++++++++++++++++++++++
>  xen/drivers/passthrough/pci.c            | 24 +++++++++++++++++++++++
>  xen/drivers/passthrough/x86/hvm.c        |  3 ++-
>  xen/include/xen/irq.h                    |  2 ++
>  xen/include/xen/pci.h                    |  2 ++
>  8 files changed, 70 insertions(+), 3 deletions(-)
> 
> diff --git a/xen/arch/x86/include/asm/msi.h b/xen/arch/x86/include/asm/msi.h
> index 378b85ee94..ea1004af14 100644
> --- a/xen/arch/x86/include/asm/msi.h
> +++ b/xen/arch/x86/include/asm/msi.h
> @@ -107,7 +107,8 @@ struct msi_desc {
>      } msi_attrib;
>  
>      bool irte_initialized;
> -    uint8_t gvec;            /* guest vector. valid when pi_desc isn't NULL */
> +    uint8_t gvec;            /* guest vector. valid when pi_desc isn't NULL or
> +                                when pci_dev gvec_as_irte_idx is true */

Missing capital 'V' after full stop.

Nit: multi-line comments should be:

/*
 * guest vector. Valid when pi_desc isn't NULL or
 * when pci_dev gvec_as_irte_idx is true
 */

I would probably move the whole comment ahead of the field
declaration:

    /*
     * Guest vector.  Valid when pi_desc isn't NULL or when pci_dev
     * gvec_as_irte_idx is true.
     */
    uint8_t gvec;

>      const struct pi_desc *pi_desc; /* pointer to posted descriptor */
>  
>      struct list_head list;
> diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
> index ff3ac832f4..3fc73feaea 100644
> --- a/xen/arch/x86/irq.c
> +++ b/xen/arch/x86/irq.c
> @@ -1600,7 +1600,8 @@ int pirq_shared(struct domain *d, int pirq)
>      return shared;
>  }
>  
> -int pirq_guest_bind(struct vcpu *v, struct pirq *pirq, int will_share)
> +int pirq_guest_bind_gvec(struct vcpu *v, struct pirq *pirq, int will_share,

I think you could take the opportunity to convert will_share to a
boolean.

> +                         uint8_t gvec)
>  {
>      struct irq_desc         *desc;
>      irq_guest_action_t *action, *newaction = NULL;
> @@ -1674,7 +1675,12 @@ int pirq_guest_bind(struct vcpu *v, struct pirq *pirq, int will_share)
>                                            &cpu_online_map) )
>                  affinity = desc->affinity;
>              if ( affinity )
> +            {
> +                if ( gvec && desc->msi_desc )
> +                    desc->msi_desc->gvec = gvec;

Hm, this feels a bit out of place.  Shouldn't the field better be set
by pt_irq_create_bind() when irq_type == PT_IRQ_TYPE_MSI and the
quirk is enabled for the device?

> +
>                  desc->handler->set_affinity(desc, affinity);
> +            }
>          }
>  
>          desc->status &= ~IRQ_DISABLED;
> @@ -1730,6 +1736,11 @@ int pirq_guest_bind(struct vcpu *v, struct pirq *pirq, int will_share)
>      return rc;
>  }
>  
> +int pirq_guest_bind(struct vcpu *v, struct pirq *pirq, int will_share)
> +{
> +    return pirq_guest_bind_gvec(v, pirq, will_share, 0);
> +}

Could this be a static inline in some header?

> +
>  static irq_guest_action_t *__pirq_guest_unbind(
>      struct domain *d, struct pirq *pirq, struct irq_desc *desc)
>  {
> diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
> index bf5b71822e..cef2987038 100644
> --- a/xen/arch/x86/msi.c
> +++ b/xen/arch/x86/msi.c
> @@ -487,6 +487,7 @@ static struct msi_desc *alloc_msi_entry(unsigned int nr)
>          entry[nr].remap_index = -1;
>          entry[nr].pi_desc = NULL;
>          entry[nr].irte_initialized = false;
> +        entry[nr].gvec = 0;

We should rather use xzalloc_array() instead of xmalloc_array() here,
as that would avoid all this manual setting to NULL, 0 and false.

It would be good to do this as a pre-patch, so that you can avoid the
change here.

>      }
>  
>      return entry;
> diff --git a/xen/drivers/passthrough/amd/iommu_intr.c b/xen/drivers/passthrough/amd/iommu_intr.c
> index c0273059cb..2e228d2c21 100644
> --- a/xen/drivers/passthrough/amd/iommu_intr.c
> +++ b/xen/drivers/passthrough/amd/iommu_intr.c
> @@ -543,6 +543,31 @@ int cf_check amd_iommu_msi_msg_update_ire(
>      if ( !msg )
>          return 0;
>  
> +    if ( pdev->gvec_as_irte_idx && amd_iommu_perdev_intremap )
> +    {
> +        int new_remap_index = 0;

Newline.  You could make this unsigned also by the looks of it?

> +        if ( msi_desc->gvec )
> +        {
> +            printk("%pp: gvec remap_index %#x -> %#x\n", &pdev->sbdf,
> +                   msi_desc->remap_index, msi_desc->gvec);

gprintk(XENLOG_DEBUG, ...

> +            new_remap_index = msi_desc->gvec;
> +        }
> +
> +        if ( new_remap_index && new_remap_index != msi_desc->remap_index &&
> +             msi_desc->remap_index != -1 )
> +        {
> +            /* Clear any existing entries */
> +            update_intremap_entry_from_msi_msg(iommu, bdf, nr,
> +                                               &msi_desc->remap_index,
> +                                               NULL, NULL);

Why do you need to clear any entries?  This will cause a window where
MSI entries targeting this IRTEs to generate faults because the
entries are not setup.

Just re-use them, update_intremap_entry_from_msi_msg() will update the
IRTE atomically so that there's no window where the entries would be
invalid, and thus to faults will be generated.

> +
> +            for ( i = 0; i < nr; ++i )
> +                msi_desc[i].remap_index = -1;
> +
> +            msi_desc->remap_index = new_remap_index;
> +        }
> +    }
> +
>      rc = update_intremap_entry_from_msi_msg(iommu, bdf, nr,
>                                              &msi_desc->remap_index,
>                                              msg, &data);

To be on the safe side, I would add a check here that ensures that
update_intremap_entry_from_msi_msg() doesn't change the IRT index
(unless it's -1).

> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index e1a09344df..7031aedb94 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -306,6 +306,17 @@ static void apply_quirks(struct pci_dev *pdev)
>          { PCI_VENDOR_ID_INTEL, 0x6fa0 },
>          { PCI_VENDOR_ID_INTEL, 0x6fc0 },
>      };
> +    static const struct {
> +        uint16_t vendor, device;
> +    } hide_irt[] = {

Nit: hide_irt is not very descriptive, I would rather use
force_gvec_as_irti or something similar.

> +#define PCI_VENDOR_ID_QCOM		0x17cb
> +#define QCA6390_DEVICE_ID		0x1101
> +#define QCN9074_DEVICE_ID		0x1104
> +#define WCN6855_DEVICE_ID		0x1103

There are some hard tabs in the defines above which should instead be
spaces.

> +        { PCI_VENDOR_ID_QCOM, QCA6390_DEVICE_ID },
> +        { PCI_VENDOR_ID_QCOM, QCN9074_DEVICE_ID },
> +        { PCI_VENDOR_ID_QCOM, WCN6855_DEVICE_ID },
> +    };
>      unsigned int i;
>  
>      for ( i = 0; i < ARRAY_SIZE(ignore_bars); i++)
> @@ -316,6 +327,19 @@ static void apply_quirks(struct pci_dev *pdev)
>               * from trying to size the BARs or add handlers to trap accesses.
>               */
>              pdev->ignore_bars = true;
> +
> +    for ( i = 0; i < ARRAY_SIZE(hide_irt); i++)
                                                 ^ missing space.
> +    {
> +        if ( vendor == hide_irt[i].vendor &&
> +             device == hide_irt[i].device )
> +        {
> +            pdev->gvec_as_irte_idx = true;
> +            printk("%pp %04x:%04x quirk gvec as intr remap index\n",
> +                   &pdev->sbdf, hide_irt[i].vendor, hide_irt[i].device);
> +            if ( !amd_iommu_perdev_intremap )
> +                printk("gvec quirk requires per-device intr remap!\n");

I think pdev->gvec_as_irte_idx should not be set if there's no perdev
IRT support.  You should also limit the quirk to AMD-Vi systems, note
that amd_iommu_perdev_intremap is defined as:

bool __ro_after_init amd_iommu_perdev_intremap = true;

And hence would unconditionally be true on Intel systems.

> +        }
> +    }
>  }
>  
>  static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn)
> diff --git a/xen/drivers/passthrough/x86/hvm.c b/xen/drivers/passthrough/x86/hvm.c
> index f5faff7a49..5d17f93b06 100644
> --- a/xen/drivers/passthrough/x86/hvm.c
> +++ b/xen/drivers/passthrough/x86/hvm.c
> @@ -307,7 +307,8 @@ int pt_irq_create_bind(
>               */
>              pirq_dpci->dom = d;
>              /* bind after hvm_irq_dpci is setup to avoid race with irq handler*/
> -            rc = pirq_guest_bind(d->vcpu[0], info, 0);
> +            rc = pirq_guest_bind_gvec(d->vcpu[0], info, 0,
> +                                      pirq_dpci->gmsi.gvec);
>              if ( rc == 0 && pt_irq_bind->u.msi.gtable )
>              {
>                  rc = msixtbl_pt_register(d, info, pt_irq_bind->u.msi.gtable);
> diff --git a/xen/include/xen/irq.h b/xen/include/xen/irq.h
> index 95034c0d6b..96109d6ebe 100644
> --- a/xen/include/xen/irq.h
> +++ b/xen/include/xen/irq.h
> @@ -192,6 +192,8 @@ extern void pirq_guest_eoi(struct pirq *pirq);
>  extern void desc_guest_eoi(struct irq_desc *desc, struct pirq *pirq);
>  extern int pirq_guest_unmask(struct domain *d);
>  extern int pirq_guest_bind(struct vcpu *v, struct pirq *pirq, int will_share);
> +extern int pirq_guest_bind_gvec(struct vcpu *v, struct pirq *pirq,
> +                                int will_share, uint8_t gvec);

Hm, it seems like a layering violation to put a x86 specific function
in a common header.

Did you consider hiding the need to use the guest vector as the IRT
index in struct arch_pirq?

>  extern void pirq_guest_unbind(struct domain *d, struct pirq *pirq);
>  extern void pirq_set_affinity(struct domain *d, int pirq,
>                                const cpumask_t *mask);
> diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
> index 4f12bcf089..14afd78f75 100644
> --- a/xen/include/xen/pci.h
> +++ b/xen/include/xen/pci.h
> @@ -127,6 +127,8 @@ struct pci_dev {
>      /* Device with errata, ignore the BARs. */
>      bool ignore_bars;
>  
> +    bool gvec_as_irte_idx;

A small comment might be helpful here:

/* Quirk: force the use of the MSI vector as the IRT index. */

Overall I'm a little at unease for allowing domains to control the IRT
index address space.  I haven't looked closely enough to see if a
guest could cause some kind of clashes or the triggering of internal
Xen state checks by for example forcing multiple MSI entries to use
the same vector.

Thanks, Roger.
Andrew Cooper Feb. 27, 2025, 11:14 a.m. UTC | #3
On 26/02/2025 9:11 pm, Jason Andryuk wrote:
> @@ -316,6 +327,19 @@ static void apply_quirks(struct pci_dev *pdev)
>               * from trying to size the BARs or add handlers to trap accesses.
>               */
>              pdev->ignore_bars = true;
> +
> +    for ( i = 0; i < ARRAY_SIZE(hide_irt); i++)
> +    {
> +        if ( vendor == hide_irt[i].vendor &&
> +             device == hide_irt[i].device )
> +        {
> +            pdev->gvec_as_irte_idx = true;
> +            printk("%pp %04x:%04x quirk gvec as intr remap index\n",
> +                   &pdev->sbdf, hide_irt[i].vendor, hide_irt[i].device);
> +            if ( !amd_iommu_perdev_intremap )
> +                printk("gvec quirk requires per-device intr remap!\n");

(Covering only what others haven't.)

amd_iommu_perdev_intremap is the subject of XSA-36.

Sadly it still exists, and only because Xen does not have a viable
IOMMU-groups model, so can only fix amd_sp5100_erratum28() by turning
disabling the XSA-36 fix and switching back into one fully-shared
interrupt remapping table.

This is of course horrible.  The proper fix is to put the IDE and SATA
controller into the same IOMMU group (force them to be handled as a
unit) at which point *they* can share a intremap table while the rest of
the system uses unique ones.   (Also, disabling the IOMMUs globally
because perdev remapping was disabled and sata combined mode is active,
is a truly unacceptable thing to do.)

Unfortunately, the Intel problems are relevant here.

amd_iommu_perdev_intremap exists because it was trying to copy how Intel
works.  Intel IOMMUs have a single interrupt remapping table shared by
all devices behind it.  Then Intel realised this was a giant security
vulnerability, and introduced the concept of Source ID verification, to
fix a problem which only exists because the remapping table was shared
to begin with.

On AMD, because we have per domain tables, we allocate in order simply
because it's easy.  And yes, we can allocate out-of-order to match other
setups.

But on Intel, you can't.  The table, and therefore the indices in it,
are shared.

In principle, if you only have a single ath11k device behind the IOMMU,
you could shuffle around existing entries to make holes where you want
them, but this can't be done atomically and you risk interfering with an
active device.

You might get somewhere with allocating top-down in the table by default
and leaving the bottom for magic like this?  But then you've still got
to fix the remappable-form problem that Roger pointed out.

~Andrew
Jason Andryuk Feb. 27, 2025, 4:49 p.m. UTC | #4
On 2025-02-27 03:54, Jan Beulich wrote:
> On 26.02.2025 22:11, Jason Andryuk wrote:
>> Sometimes we have to quirk the PCI IRTE to use a non-zero remap_index
>> corresponding to the guest's view of the MSI data register.  The MSI
>> data guest vector equals interrupt remapping table index.
>>
>> The ath11k wifi device does unusual things with MSIs.  The driver lets
>> Linux program the MSI capability.  Linux internally caches the MSI data
>> it thinks it programmed.  It sets its affinity to CPU0.  The ath11k
>> driver then reads the MSI address from the PCI configuration space.  The
>> MSI address and cached data are then passed to other components on the
>> same card to generate MSI interrupts.
> 
> I'm curious whether it's known how e.g. KVM deals with this.

There were some vfio patches that did not get merged, FWICT.  A Linux 
patch added a quirk to allow the guest to read the hardware MSI values. 
QEMU intercepted access to a memory region of a BAR and swapped guest 
MSI values for hardware MSI values.

https://lore.kernel.org/ath11k/20240812170014.1583783-1-alex.williamson@redhat.com/

I tried something similar, but abandoned it.  The ath11k driver uses 
Linux's cached value of the guest MSI data and passes that to the 
device.   It doesn't re-read the hardware value out of the configuration 
space.  That made me think using the guest data as an index would be a 
better workaround.


>> Signed-off-by: Xenia Ragiadakou <xenia.ragiadakou@amd.com>
>> Signed-off-by: Jason Andryuk <jason.andryuk@amd.com>
> 
> Just to clarify: Who's the original patch author? The common expectation
> is that the first S-o-b: matches From:.

I took Xenia's changes to xen/drivers/passthrough/pci.c and 
xen/include/xen/pci.h from an earlier patch and re-used them.  I wrote 
the rest, so I put myself in the Form: line.

>> ---
>> Is something like this feasible for inclusion upstream?  I'm asking
>> before I look into what it would take to support Intel.
> 
> Well, I wouldn't outright say "no". It needs to be pretty clear that this
> doesn't put at risk the "normal" cases. Which is going to be somewhat
> difficult considering how convoluted the involved code (sadly) is. See
> also the commentary related remark at the very bottom.

Ok

>> e.g. Replace amd_iommu_perdev_intremap with something generic.
>>
>> The ath11k device supports and tries to enable 32 MSIs.  Linux in PVH
>> dom0 and HVM domU fails enabling 32 and falls back to just 1, so that is
>> all that has been tested.
>>
>> Using msi_desc->gvec should be okay since with posted interrupts - the
>> gvec is expected to match.
>>
>> hvm_pi_update_irte() changes the IRTE but not the MSI data in the PCI
>> capability, so that isn't suitable by itself.
> 
> These last two paragraphs look to again be related to the VT-d aspect.
> Yet there's the middle one which apparently doesn't, hence I'm uncertain
> I read all of this as it's intended.

Sorry, I was just putting down thoughts.  Yes, the last two were 
thinking about VT-d integration.

In terms of the number of MSI, I wanted to highlight that I only tested 
with 1 MSI since I always worry about code I haven't tested.

>> --- a/xen/drivers/passthrough/amd/iommu_intr.c
>> +++ b/xen/drivers/passthrough/amd/iommu_intr.c
>> @@ -543,6 +543,31 @@ int cf_check amd_iommu_msi_msg_update_ire(
>>       if ( !msg )
>>           return 0;
>>   
>> +    if ( pdev->gvec_as_irte_idx && amd_iommu_perdev_intremap )
>> +    {
>> +        int new_remap_index = 0;
>> +        if ( msi_desc->gvec )
>> +        {
>> +            printk("%pp: gvec remap_index %#x -> %#x\n", &pdev->sbdf,
>> +                   msi_desc->remap_index, msi_desc->gvec);
>> +            new_remap_index = msi_desc->gvec;
>> +        }
>> +
>> +        if ( new_remap_index && new_remap_index != msi_desc->remap_index &&
>> +             msi_desc->remap_index != -1 )
>> +        {
>> +            /* Clear any existing entries */
>> +            update_intremap_entry_from_msi_msg(iommu, bdf, nr,
>> +                                               &msi_desc->remap_index,
>> +                                               NULL, NULL);
>> +
>> +            for ( i = 0; i < nr; ++i )
>> +                msi_desc[i].remap_index = -1;
>> +
>> +            msi_desc->remap_index = new_remap_index;
> 
> You zap nr entries, and then set only 1? Doesn't the zapping loop need to
> instead be a setting one? Perhaps with a check up front that the last value
> used will still fit in 8 bits? Or else make applying the quirk conditional
> upon nr == 1?

The code below here sets all `nr` entries on success:

     rc = update_intremap_entry_from_msi_msg(iommu, bdf, nr,
                                             &msi_desc->remap_index,
                                             msg, &data);
     if ( !rc )
     {
         for ( i = 1; i < nr; ++i )
             msi_desc[i].remap_index = msi_desc->remap_index + i;
         msg->data = data;
     }

     return rc;

The single passed in &msi_desc->remap_index is used as the start value 
(when < INTREMAP_MAX_ENTRIES) or is assigned a value.  Checking 
remap_index + nr fits is a good idea.

Maybe all the remap_index settting should be moved into 
update_intremap_entry_from_msi_msg()?

>> --- a/xen/drivers/passthrough/pci.c
>> +++ b/xen/drivers/passthrough/pci.c
>> @@ -306,6 +306,17 @@ static void apply_quirks(struct pci_dev *pdev)

>> +#define QCA6390_DEVICE_ID		0x1101
>> +#define QCN9074_DEVICE_ID		0x1104
>> +#define WCN6855_DEVICE_ID		0x1103
>> +        { PCI_VENDOR_ID_QCOM, QCA6390_DEVICE_ID },
>> +        { PCI_VENDOR_ID_QCOM, QCN9074_DEVICE_ID },
>> +        { PCI_VENDOR_ID_QCOM, WCN6855_DEVICE_ID },
>> +    };
> 
> May I ask what's the source of information on which specific devices are
> affected by this anomalous behavior? Just the Linux driver?

These are just taken from the Linux driver.  Tested with WCN6855 0x1103.

> I'm also uncertain #define-s are very useful in such a case. Raw hex numbers
> in the table with a comment indicating the device name ought to be as fine.

Ok.

>> --- a/xen/include/xen/pci.h
>> +++ b/xen/include/xen/pci.h
>> @@ -127,6 +127,8 @@ struct pci_dev {
>>       /* Device with errata, ignore the BARs. */
>>       bool ignore_bars;
>>   
>> +    bool gvec_as_irte_idx;
>> +
>>       /* Device misbehaving, prevent assigning it to guests. */
>>       bool broken;
>>   
> 
> Overall more commentary would be needed throughout the patch. This field is
> just one example where some minimal explanation is missing.

Ok.

Thanks for taking a look.

Regards,
Jason
Jason Andryuk Feb. 27, 2025, 6:28 p.m. UTC | #5
On 2025-02-27 05:23, Roger Pau Monné wrote:
> On Wed, Feb 26, 2025 at 04:11:25PM -0500, Jason Andryuk wrote:
>> Sometimes we have to quirk the PCI IRTE to use a non-zero remap_index
>> corresponding to the guest's view of the MSI data register.  The MSI
>> data guest vector equals interrupt remapping table index.
> 
> I think you need some introduction before making this statement about
> remapping indexes and IRTEs.

I can drop or move later.

>> The ath11k wifi device does unusual things with MSIs.  The driver lets
>> Linux program the MSI capability.  Linux internally caches the MSI data
>> it thinks it programmed.  It sets its affinity to CPU0.  The ath11k
>> driver then reads the MSI address from the PCI configuration space.  The
>> MSI address and cached data are then passed to other components on the
>> same card to generate MSI interrupts.
>>
>> With Xen, vPCI and QEMU PCI passthrough have a guest idea of the MSI
>> address and data.  But Xen programs the actual hardware with its own
>> address and data.  With per-device IRT, xen uses index 0.
> 
> By "Xen uses index 0" I think you mean that when using per-device
> interrupt remapping table indexes start at 0 for every device, instead
> of all devices sharing the same index address space.

Yes.

>> When the
>> ath11k driver passes the guest address and data to the hardware, it
>> generates faults when there is no IRTE for the guest data (~0x25).
> 
> What does ~0x25 mean in this context?

It was supposed to be an example of the observed MSI data in the range 
0x25-0x28.  Maybe I should just state non-zero.

>> To work around this, we can, for per-device IRTs, program the hardware
>> to use the guest data & associated IRTE.  The address doesn't matter
>> since the IRTE handles that, and the Xen address & vector can be used as
>> expected.
> 
> All this work on AMD because when interrupt remapping is enabled all
> MSIs are handled by the remapping table, while on Intel there's still
> a bit in the MSI address field to signal whether the MSI is using a
> remapping entry, or is using the "compatibility" format (iow: no
> remapping).

So, on Intel, if the guest hands the device the MSI address, it can 
decided to bypass remapping?

Thanks for providing insight into the Intel inner workings.  That's why 
I am asking.

>>
>> For vPCI, the guest MSI data is available at the time of initial MSI
>> setup, but that is not the case for HVM.  With HVM, the initial MSI
>> setup is done when PHYSDEVOP_map_pirq is called, but the guest vector is
>> only available later when XEN_DOMCTL_bind_pt_irq is called.  In that
>> case, we need to tear down and create a new IRTE.  This later location
>> can also handle vPCI.
>>
>> Add pirq_guest_bind_gvec to plumb down the gvec without modifying all
>> call sites.  Use msi_desc->gvec to pass through the desired value.
> 
> So basically the solution is to use the guest selected MSI vector as
> the interrupt remapping table index, as then the guest can use the MSI
> data and address fields without requiring Xen translation.
> 
> What about the guest using the same vector across multiple vCPUs?  So
> MSI entries having the same vector field, but different target
> destination CPUs?  That won't work correctly as all those MSIs will
> attempt to use the same IRTE?
> 
> Note that when interrupt remapping support was introduced on AMD-Vi it
> was indeed the vector that was used as index into the interrupt
> remapping table, this was changed in:
> 
> 2ca9fbd739b8 AMD IOMMU: allocate IRTE entries instead of using a static mapping
> 
>> Only tested with AMD-Vi.  Requires per-device IRT.  With AMD-Vi, the
>> number of MSIs is passed in, but a minimum of a page is allocated for
>> the table.  The vector is 8 bits giving indices 0-255.  Even with 128bit
>> IRTEs, 16 bytes, 1 page 4096 / 16 = 256 entries, so we don't have to
>> worry about overflow.  N MSIs can only have the last one at 255, so the
>> guest can't expect to have N vectors starting above 255 - N.
> 
> While this seems like a possible quirk for AMD, what about Intel?
> 
> And what about PV?  I think PV mostly works because the migration of
> interrupts across CPUs doesn't cause the IRT index to change, but we
> should somehow add checks to this regard if this is now a requirement
> for such kind of quirky devices.

I didn't try, but PV dom0 worked with the device with multiple MSI.

>>
>> Signed-off-by: Xenia Ragiadakou <xenia.ragiadakou@amd.com>
>> Signed-off-by: Jason Andryuk <jason.andryuk@amd.com>
>> ---
>> Is something like this feasible for inclusion upstream?  I'm asking
>> before I look into what it would take to support Intel.
> 
> Intel would be more complicated, as there the usage of the IRT is
> explicitly signaled in the MSI address field.  Otherwise it's
> considered a "compatibility" (iow: non-translated) MSI.

Hmmm, ok.

>> e.g. Replace amd_iommu_perdev_intremap with something generic.
>>
>> The ath11k device supports and tries to enable 32 MSIs.  Linux in PVH
>> dom0 and HVM domU fails enabling 32 and falls back to just 1, so that is
>> all that has been tested.
> 
> DYK why it fails to enable 32?

Not exactly - someone else had the card.  msi_capability_init() failed. 
If it ends up in arch_setup_msi_irqs(), only 1 MSI is supported.  But 
precisely where the mutiple nvecs was denied was not tracked down.

>> Using msi_desc->gvec should be okay since with posted interrupts - the
>> gvec is expected to match.
>>
>> hvm_pi_update_irte() changes the IRTE but not the MSI data in the PCI
>> capability, so that isn't suitable by itself.
>> ---
>>   xen/arch/x86/include/asm/msi.h           |  3 ++-
>>   xen/arch/x86/irq.c                       | 13 +++++++++++-
>>   xen/arch/x86/msi.c                       |  1 +
>>   xen/drivers/passthrough/amd/iommu_intr.c | 25 ++++++++++++++++++++++++
>>   xen/drivers/passthrough/pci.c            | 24 +++++++++++++++++++++++
>>   xen/drivers/passthrough/x86/hvm.c        |  3 ++-
>>   xen/include/xen/irq.h                    |  2 ++
>>   xen/include/xen/pci.h                    |  2 ++
>>   8 files changed, 70 insertions(+), 3 deletions(-)
>>
>> diff --git a/xen/arch/x86/include/asm/msi.h b/xen/arch/x86/include/asm/msi.h
>> index 378b85ee94..ea1004af14 100644
>> --- a/xen/arch/x86/include/asm/msi.h
>> +++ b/xen/arch/x86/include/asm/msi.h
>> @@ -107,7 +107,8 @@ struct msi_desc {
>>       } msi_attrib;
>>   
>>       bool irte_initialized;
>> -    uint8_t gvec;            /* guest vector. valid when pi_desc isn't NULL */
>> +    uint8_t gvec;            /* guest vector. valid when pi_desc isn't NULL or
>> +                                when pci_dev gvec_as_irte_idx is true */
> 
> Missing capital 'V' after full stop.
> 
> Nit: multi-line comments should be:
> 
> /*
>   * guest vector. Valid when pi_desc isn't NULL or
>   * when pci_dev gvec_as_irte_idx is true
>   */
> 
> I would probably move the whole comment ahead of the field
> declaration:
> 
>      /*
>       * Guest vector.  Valid when pi_desc isn't NULL or when pci_dev
>       * gvec_as_irte_idx is true.
>       */
>      uint8_t gvec;

Sounds good.

>>       const struct pi_desc *pi_desc; /* pointer to posted descriptor */
>>   
>>       struct list_head list;
>> diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
>> index ff3ac832f4..3fc73feaea 100644
>> --- a/xen/arch/x86/irq.c
>> +++ b/xen/arch/x86/irq.c
>> @@ -1600,7 +1600,8 @@ int pirq_shared(struct domain *d, int pirq)
>>       return shared;
>>   }
>>   
>> -int pirq_guest_bind(struct vcpu *v, struct pirq *pirq, int will_share)
>> +int pirq_guest_bind_gvec(struct vcpu *v, struct pirq *pirq, int will_share,
> 
> I think you could take the opportunity to convert will_share to a
> boolean.

Ok.

>> +                         uint8_t gvec)
>>   {
>>       struct irq_desc         *desc;
>>       irq_guest_action_t *action, *newaction = NULL;
>> @@ -1674,7 +1675,12 @@ int pirq_guest_bind(struct vcpu *v, struct pirq *pirq, int will_share)
>>                                             &cpu_online_map) )
>>                   affinity = desc->affinity;
>>               if ( affinity )
>> +            {
>> +                if ( gvec && desc->msi_desc )
>> +                    desc->msi_desc->gvec = gvec;
> 
> Hm, this feels a bit out of place.  Shouldn't the field better be set
> by pt_irq_create_bind() when irq_type == PT_IRQ_TYPE_MSI and the
> quirk is enabled for the device?

I can look again, but I put it here for simplicity. 
pt_irq_create_bind() has the gvec, but not the irq_desc.  Passing gvec 
into pirq_guest_bind() was the easiest way to get the gvec into the 
msi_desc.

The gvec is in pirq_dpci, so maybe that can just be looked up lower down 
closer to programming the hardware.

>> +
>>                   desc->handler->set_affinity(desc, affinity);
>> +            }
>>           }
>>   
>>           desc->status &= ~IRQ_DISABLED;
>> @@ -1730,6 +1736,11 @@ int pirq_guest_bind(struct vcpu *v, struct pirq *pirq, int will_share)
>>       return rc;
>>   }
>>   
>> +int pirq_guest_bind(struct vcpu *v, struct pirq *pirq, int will_share)
>> +{
>> +    return pirq_guest_bind_gvec(v, pirq, will_share, 0);
>> +}
> 
> Could this be a static inline in some header?

Sure.

>> +
>>   static irq_guest_action_t *__pirq_guest_unbind(
>>       struct domain *d, struct pirq *pirq, struct irq_desc *desc)
>>   {
>> diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
>> index bf5b71822e..cef2987038 100644
>> --- a/xen/arch/x86/msi.c
>> +++ b/xen/arch/x86/msi.c
>> @@ -487,6 +487,7 @@ static struct msi_desc *alloc_msi_entry(unsigned int nr)
>>           entry[nr].remap_index = -1;
>>           entry[nr].pi_desc = NULL;
>>           entry[nr].irte_initialized = false;
>> +        entry[nr].gvec = 0;
> 
> We should rather use xzalloc_array() instead of xmalloc_array() here,
> as that would avoid all this manual setting to NULL, 0 and false.
> 
> It would be good to do this as a pre-patch, so that you can avoid the
> change here.

Sounds good.

>>       }
>>   
>>       return entry;
>> diff --git a/xen/drivers/passthrough/amd/iommu_intr.c b/xen/drivers/passthrough/amd/iommu_intr.c
>> index c0273059cb..2e228d2c21 100644
>> --- a/xen/drivers/passthrough/amd/iommu_intr.c
>> +++ b/xen/drivers/passthrough/amd/iommu_intr.c
>> @@ -543,6 +543,31 @@ int cf_check amd_iommu_msi_msg_update_ire(
>>       if ( !msg )
>>           return 0;
>>   
>> +    if ( pdev->gvec_as_irte_idx && amd_iommu_perdev_intremap )
>> +    {
>> +        int new_remap_index = 0;
> 
> Newline.  You could make this unsigned also by the looks of it?
> 
>> +        if ( msi_desc->gvec )
>> +        {
>> +            printk("%pp: gvec remap_index %#x -> %#x\n", &pdev->sbdf,
>> +                   msi_desc->remap_index, msi_desc->gvec);
> 
> gprintk(XENLOG_DEBUG, ...

>> +            new_remap_index = msi_desc->gvec;
>> +        }
>> +
>> +        if ( new_remap_index && new_remap_index != msi_desc->remap_index &&
>> +             msi_desc->remap_index != -1 )
>> +        {
>> +            /* Clear any existing entries */
>> +            update_intremap_entry_from_msi_msg(iommu, bdf, nr,
>> +                                               &msi_desc->remap_index,
>> +                                               NULL, NULL);
> 
> Why do you need to clear any entries?  This will cause a window where
> MSI entries targeting this IRTEs to generate faults because the
> entries are not setup.
> 
> Just re-use them, update_intremap_entry_from_msi_msg() will update the
> IRTE atomically so that there's no window where the entries would be
> invalid, and thus to faults will be generated.

I see your point about the window.  I was trying to keep it clean as 
different indices get populated.  Initially, IRT[0..n-1] is populated. 
Later, when the gvec is available, we want IRT[gvec..gvec+n-1] 
populated.  I guess the new gvec ones could be added, and then 
0...gvec-1 removed.  Or don't bother?

I considered leaving IRTE[0] and adding IRTE[gvec].  I think that could 
work, but would be more hacky.

I was trying to keep the irte accounting bitmap correct, but it doesn't 
really matter for per-device IRT.

>> +
>> +            for ( i = 0; i < nr; ++i )
>> +                msi_desc[i].remap_index = -1;
>> +
>> +            msi_desc->remap_index = new_remap_index;
>> +        }
>> +    }
>> +
>>       rc = update_intremap_entry_from_msi_msg(iommu, bdf, nr,
>>                                               &msi_desc->remap_index,
>>                                               msg, &data);
> 
> To be on the safe side, I would add a check here that ensures that
> update_intremap_entry_from_msi_msg() doesn't change the IRT index
> (unless it's -1).

Ok

> 
>> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
>> index e1a09344df..7031aedb94 100644
>> --- a/xen/drivers/passthrough/pci.c
>> +++ b/xen/drivers/passthrough/pci.c
>> @@ -306,6 +306,17 @@ static void apply_quirks(struct pci_dev *pdev)
>>           { PCI_VENDOR_ID_INTEL, 0x6fa0 },
>>           { PCI_VENDOR_ID_INTEL, 0x6fc0 },
>>       };
>> +    static const struct {
>> +        uint16_t vendor, device;
>> +    } hide_irt[] = {
> 
> Nit: hide_irt is not very descriptive, I would rather use
> force_gvec_as_irti or something similar.

Ok.

>> +#define PCI_VENDOR_ID_QCOM		0x17cb
>> +#define QCA6390_DEVICE_ID		0x1101
>> +#define QCN9074_DEVICE_ID		0x1104
>> +#define WCN6855_DEVICE_ID		0x1103
> 
> There are some hard tabs in the defines above which should instead be
> spaces.

Ok.  Will probably go away with Jan's suggestion to remove the defines.

>> +        { PCI_VENDOR_ID_QCOM, QCA6390_DEVICE_ID },
>> +        { PCI_VENDOR_ID_QCOM, QCN9074_DEVICE_ID },
>> +        { PCI_VENDOR_ID_QCOM, WCN6855_DEVICE_ID },
>> +    };
>>       unsigned int i;
>>   
>>       for ( i = 0; i < ARRAY_SIZE(ignore_bars); i++)
>> @@ -316,6 +327,19 @@ static void apply_quirks(struct pci_dev *pdev)
>>                * from trying to size the BARs or add handlers to trap accesses.
>>                */
>>               pdev->ignore_bars = true;
>> +
>> +    for ( i = 0; i < ARRAY_SIZE(hide_irt); i++)
>                                                   ^ missing space.

Yes, thanks.

>> +    {
>> +        if ( vendor == hide_irt[i].vendor &&
>> +             device == hide_irt[i].device )
>> +        {
>> +            pdev->gvec_as_irte_idx = true;
>> +            printk("%pp %04x:%04x quirk gvec as intr remap index\n",
>> +                   &pdev->sbdf, hide_irt[i].vendor, hide_irt[i].device);
>> +            if ( !amd_iommu_perdev_intremap )
>> +                printk("gvec quirk requires per-device intr remap!\n");
> 
> I think pdev->gvec_as_irte_idx should not be set if there's no perdev
> IRT support.  You should also limit the quirk to AMD-Vi systems, note
> that amd_iommu_perdev_intremap is defined as:
> 
> bool __ro_after_init amd_iommu_perdev_intremap = true;
> 
> And hence would unconditionally be true on Intel systems.

Thanks.  I didn't immediately see a way to check which iommu 
implementation was in use.

>> +        }
>> +    }
>>   }
>>   
>>   static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn)

>> diff --git a/xen/include/xen/irq.h b/xen/include/xen/irq.h
>> index 95034c0d6b..96109d6ebe 100644
>> --- a/xen/include/xen/irq.h
>> +++ b/xen/include/xen/irq.h
>> @@ -192,6 +192,8 @@ extern void pirq_guest_eoi(struct pirq *pirq);
>>   extern void desc_guest_eoi(struct irq_desc *desc, struct pirq *pirq);
>>   extern int pirq_guest_unmask(struct domain *d);
>>   extern int pirq_guest_bind(struct vcpu *v, struct pirq *pirq, int will_share);
>> +extern int pirq_guest_bind_gvec(struct vcpu *v, struct pirq *pirq,
>> +                                int will_share, uint8_t gvec);
> 
> Hm, it seems like a layering violation to put a x86 specific function
> in a common header.

Oh, yes, this could be internal to x86.

> Did you consider hiding the need to use the guest vector as the IRT
> index in struct arch_pirq?

With sufficient pointer following, the gvec can probably be found. 
Passing gvec to pirq_guest_bind_gvec() was just the easiest way to 
bridge the gap.

>>   extern void pirq_guest_unbind(struct domain *d, struct pirq *pirq);
>>   extern void pirq_set_affinity(struct domain *d, int pirq,
>>                                 const cpumask_t *mask);
>> diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
>> index 4f12bcf089..14afd78f75 100644
>> --- a/xen/include/xen/pci.h
>> +++ b/xen/include/xen/pci.h
>> @@ -127,6 +127,8 @@ struct pci_dev {
>>       /* Device with errata, ignore the BARs. */
>>       bool ignore_bars;
>>   
>> +    bool gvec_as_irte_idx;
> 
> A small comment might be helpful here:
> 
> /* Quirk: force the use of the MSI vector as the IRT index. */

Sounds good.

> Overall I'm a little at unease for allowing domains to control the IRT
> index address space.  I haven't looked closely enough to see if a
> guest could cause some kind of clashes or the triggering of internal
> Xen state checks by for example forcing multiple MSI entries to use
> the same vector.

I was thinking that with per-device intremap, and the fact that it is 
only a single MSI capability for the device, the change is fairly 
contained.  It's just changing the indices.  Xen is still controlling 
the contents of the IRTEs, so that seems okay to me.

Thanks for taking a look.

Regards,
Jason
Andrew Cooper Feb. 27, 2025, 7 p.m. UTC | #6
On 27/02/2025 6:28 pm, Jason Andryuk wrote:
> On 2025-02-27 05:23, Roger Pau Monné wrote:
>>> To work around this, we can, for per-device IRTs, program the hardware
>>> to use the guest data & associated IRTE.  The address doesn't matter
>>> since the IRTE handles that, and the Xen address & vector can be
>>> used as
>>> expected.
>>
>> All this work on AMD because when interrupt remapping is enabled all
>> MSIs are handled by the remapping table, while on Intel there's still
>> a bit in the MSI address field to signal whether the MSI is using a
>> remapping entry, or is using the "compatibility" format (iow: no
>> remapping).
>
> So, on Intel, if the guest hands the device the MSI address, it can
> decided to bypass remapping?
>
> Thanks for providing insight into the Intel inner workings.  That's
> why I am asking.

Yes.  In the IOMMU you can choose between blocking or permitting
compatibility-form interrupts, but you can't cause them to become remapped.

~Andrew
Roger Pau Monné Feb. 28, 2025, 9:36 a.m. UTC | #7
On Thu, Feb 27, 2025 at 01:28:11PM -0500, Jason Andryuk wrote:
> On 2025-02-27 05:23, Roger Pau Monné wrote:
> > On Wed, Feb 26, 2025 at 04:11:25PM -0500, Jason Andryuk wrote:
> > > When the
> > > ath11k driver passes the guest address and data to the hardware, it
> > > generates faults when there is no IRTE for the guest data (~0x25).
> > 
> > What does ~0x25 mean in this context?
> 
> It was supposed to be an example of the observed MSI data in the range
> 0x25-0x28.  Maybe I should just state non-zero.

I don't think the data range matters much, I would just drop it.

> > > To work around this, we can, for per-device IRTs, program the hardware
> > > to use the guest data & associated IRTE.  The address doesn't matter
> > > since the IRTE handles that, and the Xen address & vector can be used as
> > > expected.
> > 
> > All this work on AMD because when interrupt remapping is enabled all
> > MSIs are handled by the remapping table, while on Intel there's still
> > a bit in the MSI address field to signal whether the MSI is using a
> > remapping entry, or is using the "compatibility" format (iow: no
> > remapping).
> 
> So, on Intel, if the guest hands the device the MSI address, it can decided
> to bypass remapping?
> 
> Thanks for providing insight into the Intel inner workings.  That's why I am
> asking.

Yes, sorry, I'm afraid I don't have any good solution for Intel, at
least not anything similar to what you propose to do on AMD-Vi.  I
guess we could take a partial solution for AMD-Vi only, but it's
sub-optimal from Xen perspective to have a piece of hardware working
fine on AMD and not on Intel.

> > > 
> > > For vPCI, the guest MSI data is available at the time of initial MSI
> > > setup, but that is not the case for HVM.  With HVM, the initial MSI
> > > setup is done when PHYSDEVOP_map_pirq is called, but the guest vector is
> > > only available later when XEN_DOMCTL_bind_pt_irq is called.  In that
> > > case, we need to tear down and create a new IRTE.  This later location
> > > can also handle vPCI.
> > > 
> > > Add pirq_guest_bind_gvec to plumb down the gvec without modifying all
> > > call sites.  Use msi_desc->gvec to pass through the desired value.
> > 
> > So basically the solution is to use the guest selected MSI vector as
> > the interrupt remapping table index, as then the guest can use the MSI
> > data and address fields without requiring Xen translation.
> > 
> > What about the guest using the same vector across multiple vCPUs?  So
> > MSI entries having the same vector field, but different target
> > destination CPUs?  That won't work correctly as all those MSIs will
> > attempt to use the same IRTE?

I think you will also need to add some extra checks to ensure that
when this quirk is active the guest will always set APIC ID 0 as the
interrupt destination for all MSI entries for the affected device, so
that there cannot be vector overlap between CPUs.  Otherwise the quirk
won't work as expected.

> > Note that when interrupt remapping support was introduced on AMD-Vi it
> > was indeed the vector that was used as index into the interrupt
> > remapping table, this was changed in:
> > 
> > 2ca9fbd739b8 AMD IOMMU: allocate IRTE entries instead of using a static mapping
> > 
> > > Only tested with AMD-Vi.  Requires per-device IRT.  With AMD-Vi, the
> > > number of MSIs is passed in, but a minimum of a page is allocated for
> > > the table.  The vector is 8 bits giving indices 0-255.  Even with 128bit
> > > IRTEs, 16 bytes, 1 page 4096 / 16 = 256 entries, so we don't have to
> > > worry about overflow.  N MSIs can only have the last one at 255, so the
> > > guest can't expect to have N vectors starting above 255 - N.
> > 
> > While this seems like a possible quirk for AMD, what about Intel?
> > 
> > And what about PV?  I think PV mostly works because the migration of
> > interrupts across CPUs doesn't cause the IRT index to change, but we
> > should somehow add checks to this regard if this is now a requirement
> > for such kind of quirky devices.
> 
> I didn't try, but PV dom0 worked with the device with multiple MSI.

Oh, so there's something about HVM/PVH that makes multiple MSI not
work.  I think we should figure out what it is before accepting a
solution.

> > > e.g. Replace amd_iommu_perdev_intremap with something generic.
> > > 
> > > The ath11k device supports and tries to enable 32 MSIs.  Linux in PVH
> > > dom0 and HVM domU fails enabling 32 and falls back to just 1, so that is
> > > all that has been tested.
> > 
> > DYK why it fails to enable 32?
> 
> Not exactly - someone else had the card.  msi_capability_init() failed. If
> it ends up in arch_setup_msi_irqs(), only 1 MSI is supported.  But precisely
> where the mutiple nvecs was denied was not tracked down.

Does it also fail on native?  I'm mostly asking because it would be
good to get to the bottom of this, so that we don't come up with a
partial solution that will break if multi-msi is used later in Linux.

> > > +                         uint8_t gvec)
> > >   {
> > >       struct irq_desc         *desc;
> > >       irq_guest_action_t *action, *newaction = NULL;
> > > @@ -1674,7 +1675,12 @@ int pirq_guest_bind(struct vcpu *v, struct pirq *pirq, int will_share)
> > >                                             &cpu_online_map) )
> > >                   affinity = desc->affinity;
> > >               if ( affinity )
> > > +            {
> > > +                if ( gvec && desc->msi_desc )
> > > +                    desc->msi_desc->gvec = gvec;
> > 
> > Hm, this feels a bit out of place.  Shouldn't the field better be set
> > by pt_irq_create_bind() when irq_type == PT_IRQ_TYPE_MSI and the
> > quirk is enabled for the device?
> 
> I can look again, but I put it here for simplicity. pt_irq_create_bind() has
> the gvec, but not the irq_desc.  Passing gvec into pirq_guest_bind() was the
> easiest way to get the gvec into the msi_desc.
> 
> The gvec is in pirq_dpci, so maybe that can just be looked up lower down
> closer to programming the hardware.

TBH it's not a blocker, but I thought it would be more in-place to
deal with all MSI related stuff in pt_irq_create_bind(), so that you
could also set the filed there.

> > > +            new_remap_index = msi_desc->gvec;
> > > +        }
> > > +
> > > +        if ( new_remap_index && new_remap_index != msi_desc->remap_index &&
> > > +             msi_desc->remap_index != -1 )
> > > +        {
> > > +            /* Clear any existing entries */
> > > +            update_intremap_entry_from_msi_msg(iommu, bdf, nr,
> > > +                                               &msi_desc->remap_index,
> > > +                                               NULL, NULL);
> > 
> > Why do you need to clear any entries?  This will cause a window where
> > MSI entries targeting this IRTEs to generate faults because the
> > entries are not setup.
> > 
> > Just re-use them, update_intremap_entry_from_msi_msg() will update the
> > IRTE atomically so that there's no window where the entries would be
> > invalid, and thus to faults will be generated.
> 
> I see your point about the window.  I was trying to keep it clean as
> different indices get populated.  Initially, IRT[0..n-1] is populated.

Hm, I see.  For this specific use-case you are changing the IRT index
when the guest updates the MSI vector.  Tearing down of the old
entries would better be done _after_ the MSI entry has been updated,
so that at all times the pointed IRTE is valid.

> Later, when the gvec is available, we want IRT[gvec..gvec+n-1] populated.  I
> guess the new gvec ones could be added, and then 0...gvec-1 removed.  Or
> don't bother?

Indeed, that would be a better approach, as then the IRTE would always
be valid.

In fact you could possibly leave the old IRTE entries as-is, they
would be unhooked from any MSI entry, and if re-used they would be
setup correctly.  For this specific quirk where vector == IRT index
there's never the need to search for a free IRTE, as the guest set
vector will dictate which IRTE to use.

I guess it would be nice to attempt to keep the inuse IRTE bitmap in
sync if possible.

> I considered leaving IRTE[0] and adding IRTE[gvec].  I think that could
> work, but would be more hacky.
> 
> I was trying to keep the irte accounting bitmap correct, but it doesn't
> really matter for per-device IRT.

Yes, that's my thinking too.  If you can move the call to teardown the
old IRTE after the new one has been setup and the MSI entry has been
updated that would be the best approach IMO.

> > > diff --git a/xen/include/xen/irq.h b/xen/include/xen/irq.h
> > > index 95034c0d6b..96109d6ebe 100644
> > > --- a/xen/include/xen/irq.h
> > > +++ b/xen/include/xen/irq.h
> > > @@ -192,6 +192,8 @@ extern void pirq_guest_eoi(struct pirq *pirq);
> > >   extern void desc_guest_eoi(struct irq_desc *desc, struct pirq *pirq);
> > >   extern int pirq_guest_unmask(struct domain *d);
> > >   extern int pirq_guest_bind(struct vcpu *v, struct pirq *pirq, int will_share);
> > > +extern int pirq_guest_bind_gvec(struct vcpu *v, struct pirq *pirq,
> > > +                                int will_share, uint8_t gvec);
> > 
> > Hm, it seems like a layering violation to put a x86 specific function
> > in a common header.
> 
> Oh, yes, this could be internal to x86.
> 
> > Did you consider hiding the need to use the guest vector as the IRT
> > index in struct arch_pirq?
> 
> With sufficient pointer following, the gvec can probably be found. Passing
> gvec to pirq_guest_bind_gvec() was just the easiest way to bridge the gap.

No strong opinion, just wondering whether it was considered and if it
could be easier to implement.

> > >   extern void pirq_guest_unbind(struct domain *d, struct pirq *pirq);
> > >   extern void pirq_set_affinity(struct domain *d, int pirq,
> > >                                 const cpumask_t *mask);
> > > diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
> > > index 4f12bcf089..14afd78f75 100644
> > > --- a/xen/include/xen/pci.h
> > > +++ b/xen/include/xen/pci.h
> > > @@ -127,6 +127,8 @@ struct pci_dev {
> > >       /* Device with errata, ignore the BARs. */
> > >       bool ignore_bars;
> > > +    bool gvec_as_irte_idx;
> > 
> > A small comment might be helpful here:
> > 
> > /* Quirk: force the use of the MSI vector as the IRT index. */
> 
> Sounds good.
> 
> > Overall I'm a little at unease for allowing domains to control the IRT
> > index address space.  I haven't looked closely enough to see if a
> > guest could cause some kind of clashes or the triggering of internal
> > Xen state checks by for example forcing multiple MSI entries to use
> > the same vector.
> 
> I was thinking that with per-device intremap, and the fact that it is only a
> single MSI capability for the device, the change is fairly contained.  It's
> just changing the indices.  Xen is still controlling the contents of the
> IRTEs, so that seems okay to me.

Indeed.  I cannot find any obvious issue.

Thanks, Roger.
Jason Andryuk Feb. 28, 2025, 8:25 p.m. UTC | #8
On 2025-02-28 04:36, Roger Pau Monné wrote:
> On Thu, Feb 27, 2025 at 01:28:11PM -0500, Jason Andryuk wrote:
>> On 2025-02-27 05:23, Roger Pau Monné wrote:
>>> On Wed, Feb 26, 2025 at 04:11:25PM -0500, Jason Andryuk wrote:

>>>> To work around this, we can, for per-device IRTs, program the hardware
>>>> to use the guest data & associated IRTE.  The address doesn't matter
>>>> since the IRTE handles that, and the Xen address & vector can be used as
>>>> expected.
>>>
>>> All this work on AMD because when interrupt remapping is enabled all
>>> MSIs are handled by the remapping table, while on Intel there's still
>>> a bit in the MSI address field to signal whether the MSI is using a
>>> remapping entry, or is using the "compatibility" format (iow: no
>>> remapping).
>>
>> So, on Intel, if the guest hands the device the MSI address, it can decided
>> to bypass remapping?
>>
>> Thanks for providing insight into the Intel inner workings.  That's why I am
>> asking.
> 
> Yes, sorry, I'm afraid I don't have any good solution for Intel, at
> least not anything similar to what you propose to do on AMD-Vi.  I
> guess we could take a partial solution for AMD-Vi only, but it's
> sub-optimal from Xen perspective to have a piece of hardware working
> fine on AMD and not on Intel.

I only need AMD to work ;)

But yeah, I thought I should make an effort to get both working.

>>>>
>>>> For vPCI, the guest MSI data is available at the time of initial MSI
>>>> setup, but that is not the case for HVM.  With HVM, the initial MSI
>>>> setup is done when PHYSDEVOP_map_pirq is called, but the guest vector is
>>>> only available later when XEN_DOMCTL_bind_pt_irq is called.  In that
>>>> case, we need to tear down and create a new IRTE.  This later location
>>>> can also handle vPCI.
>>>>
>>>> Add pirq_guest_bind_gvec to plumb down the gvec without modifying all
>>>> call sites.  Use msi_desc->gvec to pass through the desired value.
>>>
>>> So basically the solution is to use the guest selected MSI vector as
>>> the interrupt remapping table index, as then the guest can use the MSI
>>> data and address fields without requiring Xen translation.
>>>
>>> What about the guest using the same vector across multiple vCPUs?  So
>>> MSI entries having the same vector field, but different target
>>> destination CPUs?  That won't work correctly as all those MSIs will
>>> attempt to use the same IRTE?
> 
> I think you will also need to add some extra checks to ensure that
> when this quirk is active the guest will always set APIC ID 0 as the
> interrupt destination for all MSI entries for the affected device, so
> that there cannot be vector overlap between CPUs.  Otherwise the quirk
> won't work as expected.

Ok.

>>>> e.g. Replace amd_iommu_perdev_intremap with something generic.
>>>>
>>>> The ath11k device supports and tries to enable 32 MSIs.  Linux in PVH
>>>> dom0 and HVM domU fails enabling 32 and falls back to just 1, so that is
>>>> all that has been tested.
>>>
>>> DYK why it fails to enable 32?
>>
>> Not exactly - someone else had the card.  msi_capability_init() failed. If
>> it ends up in arch_setup_msi_irqs(), only 1 MSI is supported.  But precisely
>> where the mutiple nvecs was denied was not tracked down.
> 
> Does it also fail on native?  I'm mostly asking because it would be
> good to get to the bottom of this, so that we don't come up with a
> partial solution that will break if multi-msi is used later in Linux.

My understanding is native and PV dom0 work with 32, and it's Linux 
deciding not to use multiple MSI.

It might be this:
static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
{
         int irq, pirq;
         struct msi_desc *msidesc;
         struct msi_msg msg;

         if (type == PCI_CAP_ID_MSI && nvec > 1)
                 return 1;

I'll have to look into this more.


>>>> +            new_remap_index = msi_desc->gvec;
>>>> +        }
>>>> +
>>>> +        if ( new_remap_index && new_remap_index != msi_desc->remap_index &&
>>>> +             msi_desc->remap_index != -1 )
>>>> +        {
>>>> +            /* Clear any existing entries */
>>>> +            update_intremap_entry_from_msi_msg(iommu, bdf, nr,
>>>> +                                               &msi_desc->remap_index,
>>>> +                                               NULL, NULL);
>>>
>>> Why do you need to clear any entries?  This will cause a window where
>>> MSI entries targeting this IRTEs to generate faults because the
>>> entries are not setup.
>>>
>>> Just re-use them, update_intremap_entry_from_msi_msg() will update the
>>> IRTE atomically so that there's no window where the entries would be
>>> invalid, and thus to faults will be generated.
>>
>> I see your point about the window.  I was trying to keep it clean as
>> different indices get populated.  Initially, IRT[0..n-1] is populated.
> 
> Hm, I see.  For this specific use-case you are changing the IRT index
> when the guest updates the MSI vector.  Tearing down of the old
> entries would better be done _after_ the MSI entry has been updated,
> so that at all times the pointed IRTE is valid.
> 
>> Later, when the gvec is available, we want IRT[gvec..gvec+n-1] populated.  I
>> guess the new gvec ones could be added, and then 0...gvec-1 removed.  Or
>> don't bother?
> 
> Indeed, that would be a better approach, as then the IRTE would always
> be valid.
> 
> In fact you could possibly leave the old IRTE entries as-is, they
> would be unhooked from any MSI entry, and if re-used they would be
> setup correctly.  For this specific quirk where vector == IRT index
> there's never the need to search for a free IRTE, as the guest set
> vector will dictate which IRTE to use.
> 
> I guess it would be nice to attempt to keep the inuse IRTE bitmap in
> sync if possible.
> 
>> I considered leaving IRTE[0] and adding IRTE[gvec].  I think that could
>> work, but would be more hacky.
>>
>> I was trying to keep the irte accounting bitmap correct, but it doesn't
>> really matter for per-device IRT.
> 
> Yes, that's my thinking too.  If you can move the call to teardown the
> old IRTE after the new one has been setup and the MSI entry has been
> updated that would be the best approach IMO.

Ok.

Thanks,
Jason
Roger Pau Monné March 4, 2025, 10:23 a.m. UTC | #9
On Fri, Feb 28, 2025 at 03:25:52PM -0500, Jason Andryuk wrote:
> On 2025-02-28 04:36, Roger Pau Monné wrote:
> > On Thu, Feb 27, 2025 at 01:28:11PM -0500, Jason Andryuk wrote:
> > > On 2025-02-27 05:23, Roger Pau Monné wrote:
> > > > On Wed, Feb 26, 2025 at 04:11:25PM -0500, Jason Andryuk wrote:
> > > > > To work around this, we can, for per-device IRTs, program the hardware
> > > > > to use the guest data & associated IRTE.  The address doesn't matter
> > > > > since the IRTE handles that, and the Xen address & vector can be used as
> > > > > expected.
> > > > 
> > > > All this work on AMD because when interrupt remapping is enabled all
> > > > MSIs are handled by the remapping table, while on Intel there's still
> > > > a bit in the MSI address field to signal whether the MSI is using a
> > > > remapping entry, or is using the "compatibility" format (iow: no
> > > > remapping).
> > > 
> > > So, on Intel, if the guest hands the device the MSI address, it can decided
> > > to bypass remapping?
> > > 
> > > Thanks for providing insight into the Intel inner workings.  That's why I am
> > > asking.
> > 
> > Yes, sorry, I'm afraid I don't have any good solution for Intel, at
> > least not anything similar to what you propose to do on AMD-Vi.  I
> > guess we could take a partial solution for AMD-Vi only, but it's
> > sub-optimal from Xen perspective to have a piece of hardware working
> > fine on AMD and not on Intel.
> 
> I only need AMD to work ;)
> 
> But yeah, I thought I should make an effort to get both working.

Kind of tangential to this approach.  Do you know which register(s)
are used to store the non-architectural MSI address and data fields?

I'm wondering if it simply would be easier to introduce a quirk for
this device in vPCI (and possibly QEMU?) that intercepts writes to the
out of band MSI registers.  That should work for both Intel and AMD,
but would have the side effect that Xen would need to intercept
accesses to at least a full page, and possibly forward accesses to
adjacent registers.

> > > > > e.g. Replace amd_iommu_perdev_intremap with something generic.
> > > > > 
> > > > > The ath11k device supports and tries to enable 32 MSIs.  Linux in PVH
> > > > > dom0 and HVM domU fails enabling 32 and falls back to just 1, so that is
> > > > > all that has been tested.
> > > > 
> > > > DYK why it fails to enable 32?
> > > 
> > > Not exactly - someone else had the card.  msi_capability_init() failed. If
> > > it ends up in arch_setup_msi_irqs(), only 1 MSI is supported.  But precisely
> > > where the mutiple nvecs was denied was not tracked down.
> > 
> > Does it also fail on native?  I'm mostly asking because it would be
> > good to get to the bottom of this, so that we don't come up with a
> > partial solution that will break if multi-msi is used later in Linux.
> 
> My understanding is native and PV dom0 work with 32, and it's Linux deciding
> not to use multiple MSI.
> 
> It might be this:
> static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
> {
>         int irq, pirq;
>         struct msi_desc *msidesc;
>         struct msi_msg msg;
> 
>         if (type == PCI_CAP_ID_MSI && nvec > 1)
>                 return 1;
> 
> I'll have to look into this more.

That shouldn't apply to PVH because it never exposes
XENFEAT_hvm_pirqs, and I would expect xen_hvm_setup_msi_irqs() to not
get used (otherwise we have a bug somewhere).

Thanks, Roger.
Jason Andryuk March 4, 2025, 3:15 p.m. UTC | #10
On 2025-03-04 05:23, Roger Pau Monné wrote:
> On Fri, Feb 28, 2025 at 03:25:52PM -0500, Jason Andryuk wrote:
>> On 2025-02-28 04:36, Roger Pau Monné wrote:
>>> On Thu, Feb 27, 2025 at 01:28:11PM -0500, Jason Andryuk wrote:
>>>> On 2025-02-27 05:23, Roger Pau Monné wrote:
>>>>> On Wed, Feb 26, 2025 at 04:11:25PM -0500, Jason Andryuk wrote:
>>>>>> To work around this, we can, for per-device IRTs, program the hardware
>>>>>> to use the guest data & associated IRTE.  The address doesn't matter
>>>>>> since the IRTE handles that, and the Xen address & vector can be used as
>>>>>> expected.
>>>>>
>>>>> All this work on AMD because when interrupt remapping is enabled all
>>>>> MSIs are handled by the remapping table, while on Intel there's still
>>>>> a bit in the MSI address field to signal whether the MSI is using a
>>>>> remapping entry, or is using the "compatibility" format (iow: no
>>>>> remapping).
>>>>
>>>> So, on Intel, if the guest hands the device the MSI address, it can decided
>>>> to bypass remapping?
>>>>
>>>> Thanks for providing insight into the Intel inner workings.  That's why I am
>>>> asking.
>>>
>>> Yes, sorry, I'm afraid I don't have any good solution for Intel, at
>>> least not anything similar to what you propose to do on AMD-Vi.  I
>>> guess we could take a partial solution for AMD-Vi only, but it's
>>> sub-optimal from Xen perspective to have a piece of hardware working
>>> fine on AMD and not on Intel.
>>
>> I only need AMD to work ;)
>>
>> But yeah, I thought I should make an effort to get both working.
> 
> Kind of tangential to this approach.  Do you know which register(s)
> are used to store the non-architectural MSI address and data fields?
> 
> I'm wondering if it simply would be easier to introduce a quirk for
> this device in vPCI (and possibly QEMU?) that intercepts writes to the
> out of band MSI registers.  That should work for both Intel and AMD,
> but would have the side effect that Xen would need to intercept
> accesses to at least a full page, and possibly forward accesses to
> adjacent registers.

 From the QEMU part of the vfio hack:
* We therefore come up with a really crude quirk that looks for values
* written to the ATH11K_PCI_WINDOW (defined in Linux driver as starting
* at 0x80000 with an 18-bit mask, ie. 256k) that match the guest MSI
* address.  When found we replace the data with the host physical
* address and set a cookie to match the MSI data write, again replacing
* with the host value and clearing the cookie.

https://lore.kernel.org/ath11k/20240812170045.1584000-1-alex.williamson@redhat.com/

This is inside BAR0, AIUI.  I'm guessing, but I think the driver puts 
them into a command ring, so it's not a fixed location.  The large area, 
and since we won't normally intercept BAR access, made me not want to 
pursue this.

>>>>>> e.g. Replace amd_iommu_perdev_intremap with something generic.
>>>>>>
>>>>>> The ath11k device supports and tries to enable 32 MSIs.  Linux in PVH
>>>>>> dom0 and HVM domU fails enabling 32 and falls back to just 1, so that is
>>>>>> all that has been tested.
>>>>>
>>>>> DYK why it fails to enable 32?
>>>>
>>>> Not exactly - someone else had the card.  msi_capability_init() failed. If
>>>> it ends up in arch_setup_msi_irqs(), only 1 MSI is supported.  But precisely
>>>> where the mutiple nvecs was denied was not tracked down.
>>>
>>> Does it also fail on native?  I'm mostly asking because it would be
>>> good to get to the bottom of this, so that we don't come up with a
>>> partial solution that will break if multi-msi is used later in Linux.
>>
>> My understanding is native and PV dom0 work with 32, and it's Linux deciding
>> not to use multiple MSI.
>>
>> It might be this:
>> static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
>> {
>>          int irq, pirq;
>>          struct msi_desc *msidesc;
>>          struct msi_msg msg;
>>
>>          if (type == PCI_CAP_ID_MSI && nvec > 1)
>>                  return 1;
>>
>> I'll have to look into this more.
> 
> That shouldn't apply to PVH because it never exposes
> XENFEAT_hvm_pirqs, and I would expect xen_hvm_setup_msi_irqs() to not
> get used (otherwise we have a bug somewhere).

Okay.  Yeah, this doesn't seem to get called.  I asked internally, and 
no one tracked down precisely why multi-msi is denied.  I still need to 
get around to that.

Thanks,
Jason
Roger Pau Monné March 4, 2025, 4:41 p.m. UTC | #11
On Tue, Mar 04, 2025 at 10:15:21AM -0500, Jason Andryuk wrote:
> On 2025-03-04 05:23, Roger Pau Monné wrote:
> > On Fri, Feb 28, 2025 at 03:25:52PM -0500, Jason Andryuk wrote:
> > > On 2025-02-28 04:36, Roger Pau Monné wrote:
> > > > On Thu, Feb 27, 2025 at 01:28:11PM -0500, Jason Andryuk wrote:
> > > > > On 2025-02-27 05:23, Roger Pau Monné wrote:
> > > > > > On Wed, Feb 26, 2025 at 04:11:25PM -0500, Jason Andryuk wrote:
> > > > > > > To work around this, we can, for per-device IRTs, program the hardware
> > > > > > > to use the guest data & associated IRTE.  The address doesn't matter
> > > > > > > since the IRTE handles that, and the Xen address & vector can be used as
> > > > > > > expected.
> > > > > > 
> > > > > > All this work on AMD because when interrupt remapping is enabled all
> > > > > > MSIs are handled by the remapping table, while on Intel there's still
> > > > > > a bit in the MSI address field to signal whether the MSI is using a
> > > > > > remapping entry, or is using the "compatibility" format (iow: no
> > > > > > remapping).
> > > > > 
> > > > > So, on Intel, if the guest hands the device the MSI address, it can decided
> > > > > to bypass remapping?
> > > > > 
> > > > > Thanks for providing insight into the Intel inner workings.  That's why I am
> > > > > asking.
> > > > 
> > > > Yes, sorry, I'm afraid I don't have any good solution for Intel, at
> > > > least not anything similar to what you propose to do on AMD-Vi.  I
> > > > guess we could take a partial solution for AMD-Vi only, but it's
> > > > sub-optimal from Xen perspective to have a piece of hardware working
> > > > fine on AMD and not on Intel.
> > > 
> > > I only need AMD to work ;)
> > > 
> > > But yeah, I thought I should make an effort to get both working.
> > 
> > Kind of tangential to this approach.  Do you know which register(s)
> > are used to store the non-architectural MSI address and data fields?
> > 
> > I'm wondering if it simply would be easier to introduce a quirk for
> > this device in vPCI (and possibly QEMU?) that intercepts writes to the
> > out of band MSI registers.  That should work for both Intel and AMD,
> > but would have the side effect that Xen would need to intercept
> > accesses to at least a full page, and possibly forward accesses to
> > adjacent registers.
> 
> From the QEMU part of the vfio hack:
> * We therefore come up with a really crude quirk that looks for values
> * written to the ATH11K_PCI_WINDOW (defined in Linux driver as starting
> * at 0x80000 with an 18-bit mask, ie. 256k) that match the guest MSI
> * address.  When found we replace the data with the host physical
> * address and set a cookie to match the MSI data write, again replacing
> * with the host value and clearing the cookie.
> 
> https://lore.kernel.org/ath11k/20240812170045.1584000-1-alex.williamson@redhat.com/
> 
> This is inside BAR0, AIUI.  I'm guessing, but I think the driver puts them
> into a command ring, so it's not a fixed location.  The large area, and
> since we won't normally intercept BAR access, made me not want to pursue
> this.

Oh, I see, it's not a fixed register, but something like a command
queue.  Great, that makes it way worse to deal with.  It would also
imply that Xen would need to possibly map the whole 256k ring in order
to forward requests to the hardware.

I would like for a solution that covers both Intel and AMD, but I
don't have the card myself to test or develop any of those solutions,
so I think having a solution that works on AMD is likely better than
having no solution at all.  Whoever wants to use the card on Intel
will have to come up with a solution there.

Thanks, Roger.
Jan Beulich March 5, 2025, 8:09 a.m. UTC | #12
On 27.02.2025 17:49, Jason Andryuk wrote:
> On 2025-02-27 03:54, Jan Beulich wrote:
>> On 26.02.2025 22:11, Jason Andryuk wrote:
>>> Signed-off-by: Xenia Ragiadakou <xenia.ragiadakou@amd.com>
>>> Signed-off-by: Jason Andryuk <jason.andryuk@amd.com>
>>
>> Just to clarify: Who's the original patch author? The common expectation
>> is that the first S-o-b: matches From:.
> 
> I took Xenia's changes to xen/drivers/passthrough/pci.c and 
> xen/include/xen/pci.h from an earlier patch and re-used them.  I wrote 
> the rest, so I put myself in the Form: line.

Unusual arrangements of tags typically call for some clarification in ...

>>> ---

... the post-commit-message area. In the case here the question arises
whether a different tag (Co-Developed-by:?) might not be better.

>>> --- a/xen/drivers/passthrough/amd/iommu_intr.c
>>> +++ b/xen/drivers/passthrough/amd/iommu_intr.c
>>> @@ -543,6 +543,31 @@ int cf_check amd_iommu_msi_msg_update_ire(
>>>       if ( !msg )
>>>           return 0;
>>>   
>>> +    if ( pdev->gvec_as_irte_idx && amd_iommu_perdev_intremap )
>>> +    {
>>> +        int new_remap_index = 0;
>>> +        if ( msi_desc->gvec )
>>> +        {
>>> +            printk("%pp: gvec remap_index %#x -> %#x\n", &pdev->sbdf,
>>> +                   msi_desc->remap_index, msi_desc->gvec);
>>> +            new_remap_index = msi_desc->gvec;
>>> +        }
>>> +
>>> +        if ( new_remap_index && new_remap_index != msi_desc->remap_index &&
>>> +             msi_desc->remap_index != -1 )
>>> +        {
>>> +            /* Clear any existing entries */
>>> +            update_intremap_entry_from_msi_msg(iommu, bdf, nr,
>>> +                                               &msi_desc->remap_index,
>>> +                                               NULL, NULL);
>>> +
>>> +            for ( i = 0; i < nr; ++i )
>>> +                msi_desc[i].remap_index = -1;
>>> +
>>> +            msi_desc->remap_index = new_remap_index;
>>
>> You zap nr entries, and then set only 1? Doesn't the zapping loop need to
>> instead be a setting one? Perhaps with a check up front that the last value
>> used will still fit in 8 bits? Or else make applying the quirk conditional
>> upon nr == 1?
> 
> The code below here sets all `nr` entries on success:
> 
>      rc = update_intremap_entry_from_msi_msg(iommu, bdf, nr,
>                                              &msi_desc->remap_index,
>                                              msg, &data);
>      if ( !rc )
>      {
>          for ( i = 1; i < nr; ++i )
>              msi_desc[i].remap_index = msi_desc->remap_index + i;
>          msg->data = data;
>      }
> 
>      return rc;

Ah, yes, I see now how this matches other behavior in the function.

> Maybe all the remap_index settting should be moved into 
> update_intremap_entry_from_msi_msg()?

That would require passing in msi_desc (or making assumptions on the
passed in "int *remap_index"), neither of which looks very attractive
to me.

Jan
Jan Beulich March 5, 2025, 8:41 a.m. UTC | #13
On 27.02.2025 19:28, Jason Andryuk wrote:
> On 2025-02-27 05:23, Roger Pau Monné wrote:
>> On Wed, Feb 26, 2025 at 04:11:25PM -0500, Jason Andryuk wrote:
>> All this work on AMD because when interrupt remapping is enabled all
>> MSIs are handled by the remapping table, while on Intel there's still
>> a bit in the MSI address field to signal whether the MSI is using a
>> remapping entry, or is using the "compatibility" format (iow: no
>> remapping).
> 
> So, on Intel, if the guest hands the device the MSI address, it can 
> decided to bypass remapping?

While the answer is "yes" here, the result - aiui - would be an insecure
configuration. So you can only do this for fully trusted domains.

Jan
Jason Andryuk March 13, 2025, 3:30 p.m. UTC | #14
On 2025-02-27 05:23, Roger Pau Monné wrote:
> On Wed, Feb 26, 2025 at 04:11:25PM -0500, Jason Andryuk wrote:
>>
>> The ath11k device supports and tries to enable 32 MSIs.  Linux in PVH
>> dom0 and HVM domU fails enabling 32 and falls back to just 1, so that is
>> all that has been tested.
> 
> DYK why it fails to enable 32?

In Linux msi_capability_init()

         /* Reject multi-MSI early on irq domain enabled architectures */
         if (nvec > 1 && !pci_msi_domain_supports(dev, 
MSI_FLAG_MULTI_PCI_MSI, ALLOW_LEGACY))
                 return 1;

MSI_FLAG_MULTI_PCI_MSI is only set for AMD and Intel interrupt 
remapping, and Xen PVH and HVM don't have either of those.  They are 
using "VECTOR", so this check fails.

Regards,
Jason
Roger Pau Monné March 13, 2025, 3:47 p.m. UTC | #15
On Thu, Mar 13, 2025 at 11:30:28AM -0400, Jason Andryuk wrote:
> On 2025-02-27 05:23, Roger Pau Monné wrote:
> > On Wed, Feb 26, 2025 at 04:11:25PM -0500, Jason Andryuk wrote:
> > > 
> > > The ath11k device supports and tries to enable 32 MSIs.  Linux in PVH
> > > dom0 and HVM domU fails enabling 32 and falls back to just 1, so that is
> > > all that has been tested.
> > 
> > DYK why it fails to enable 32?
> 
> In Linux msi_capability_init()
> 
>         /* Reject multi-MSI early on irq domain enabled architectures */
>         if (nvec > 1 && !pci_msi_domain_supports(dev,
> MSI_FLAG_MULTI_PCI_MSI, ALLOW_LEGACY))
>                 return 1;
> 
> MSI_FLAG_MULTI_PCI_MSI is only set for AMD and Intel interrupt remapping,
> and Xen PVH and HVM don't have either of those.  They are using "VECTOR", so
> this check fails.

Oh, interesting.  So classic PV MSI domain supports
MSI_FLAG_MULTI_PCI_MSI, even when no IOMMU is exposed there either.

Thanks, so it's nothing specific to Xen, just how Linux works.

Roger.
Andrew Cooper March 13, 2025, 4:02 p.m. UTC | #16
On 13/03/2025 3:47 pm, Roger Pau Monné wrote:
> On Thu, Mar 13, 2025 at 11:30:28AM -0400, Jason Andryuk wrote:
>> On 2025-02-27 05:23, Roger Pau Monné wrote:
>>> On Wed, Feb 26, 2025 at 04:11:25PM -0500, Jason Andryuk wrote:
>>>> The ath11k device supports and tries to enable 32 MSIs.  Linux in PVH
>>>> dom0 and HVM domU fails enabling 32 and falls back to just 1, so that is
>>>> all that has been tested.
>>> DYK why it fails to enable 32?
>> In Linux msi_capability_init()
>>
>>         /* Reject multi-MSI early on irq domain enabled architectures */
>>         if (nvec > 1 && !pci_msi_domain_supports(dev,
>> MSI_FLAG_MULTI_PCI_MSI, ALLOW_LEGACY))
>>                 return 1;
>>
>> MSI_FLAG_MULTI_PCI_MSI is only set for AMD and Intel interrupt remapping,
>> and Xen PVH and HVM don't have either of those.  They are using "VECTOR", so
>> this check fails.
> Oh, interesting.  So classic PV MSI domain supports
> MSI_FLAG_MULTI_PCI_MSI, even when no IOMMU is exposed there either.
>
> Thanks, so it's nothing specific to Xen, just how Linux works.

This is something which TGLX and I have discussed in the past.  It is a
mistake for any x86 system to do MSI multi-message without an IOMMU.

MSI multi-message gets you a power-of-2, aligned, block of vectors, up
to a maximum of 32, which must always target the same CPU.

The LAPIC prioritisation is on groups of 16, aligned, vectors.

If MSI has 16 or fewer vectors, then any interrupt causes all others to
be blocked owing to LAPIC behaviour.

With 32 vectors, you can get two vectors (one from the first 16, one
from the second 16) where the higher vector can interrupt the lower
one.  And you pay 32 vectors for this.

With the IOMMU, every message gets a controllable CPU and controllable
priority, because they come from the IRTE, not the device.

Removing Multi-MSI support makes vector allocation much easier because
you you never need to allocate/move blocks.

~Andrew
Jan Beulich March 13, 2025, 4:07 p.m. UTC | #17
On 13.03.2025 17:02, Andrew Cooper wrote:
> On 13/03/2025 3:47 pm, Roger Pau Monné wrote:
>> On Thu, Mar 13, 2025 at 11:30:28AM -0400, Jason Andryuk wrote:
>>> On 2025-02-27 05:23, Roger Pau Monné wrote:
>>>> On Wed, Feb 26, 2025 at 04:11:25PM -0500, Jason Andryuk wrote:
>>>>> The ath11k device supports and tries to enable 32 MSIs.  Linux in PVH
>>>>> dom0 and HVM domU fails enabling 32 and falls back to just 1, so that is
>>>>> all that has been tested.
>>>> DYK why it fails to enable 32?
>>> In Linux msi_capability_init()
>>>
>>>         /* Reject multi-MSI early on irq domain enabled architectures */
>>>         if (nvec > 1 && !pci_msi_domain_supports(dev,
>>> MSI_FLAG_MULTI_PCI_MSI, ALLOW_LEGACY))
>>>                 return 1;
>>>
>>> MSI_FLAG_MULTI_PCI_MSI is only set for AMD and Intel interrupt remapping,
>>> and Xen PVH and HVM don't have either of those.  They are using "VECTOR", so
>>> this check fails.
>> Oh, interesting.  So classic PV MSI domain supports
>> MSI_FLAG_MULTI_PCI_MSI, even when no IOMMU is exposed there either.
>>
>> Thanks, so it's nothing specific to Xen, just how Linux works.
> 
> This is something which TGLX and I have discussed in the past.  It is a
> mistake for any x86 system to do MSI multi-message without an IOMMU.

Well, with PVH there always will be an IOMMU, just that Linux can't see
it. Even with PV it should be the hypervisor to determine whether multi-
message MSI is possible. Hence how the classic (non-pvops) kernel had
worked in this regard.

Jan

> MSI multi-message gets you a power-of-2, aligned, block of vectors, up
> to a maximum of 32, which must always target the same CPU.
> 
> The LAPIC prioritisation is on groups of 16, aligned, vectors.
> 
> If MSI has 16 or fewer vectors, then any interrupt causes all others to
> be blocked owing to LAPIC behaviour.
> 
> With 32 vectors, you can get two vectors (one from the first 16, one
> from the second 16) where the higher vector can interrupt the lower
> one.  And you pay 32 vectors for this.
> 
> With the IOMMU, every message gets a controllable CPU and controllable
> priority, because they come from the IRTE, not the device.
> 
> Removing Multi-MSI support makes vector allocation much easier because
> you you never need to allocate/move blocks.
> 
> ~Andrew
Andrew Cooper March 13, 2025, 4:24 p.m. UTC | #18
On 13/03/2025 4:07 pm, Jan Beulich wrote:
> On 13.03.2025 17:02, Andrew Cooper wrote:
>> On 13/03/2025 3:47 pm, Roger Pau Monné wrote:
>>> On Thu, Mar 13, 2025 at 11:30:28AM -0400, Jason Andryuk wrote:
>>>> On 2025-02-27 05:23, Roger Pau Monné wrote:
>>>>> On Wed, Feb 26, 2025 at 04:11:25PM -0500, Jason Andryuk wrote:
>>>>>> The ath11k device supports and tries to enable 32 MSIs.  Linux in PVH
>>>>>> dom0 and HVM domU fails enabling 32 and falls back to just 1, so that is
>>>>>> all that has been tested.
>>>>> DYK why it fails to enable 32?
>>>> In Linux msi_capability_init()
>>>>
>>>>         /* Reject multi-MSI early on irq domain enabled architectures */
>>>>         if (nvec > 1 && !pci_msi_domain_supports(dev,
>>>> MSI_FLAG_MULTI_PCI_MSI, ALLOW_LEGACY))
>>>>                 return 1;
>>>>
>>>> MSI_FLAG_MULTI_PCI_MSI is only set for AMD and Intel interrupt remapping,
>>>> and Xen PVH and HVM don't have either of those.  They are using "VECTOR", so
>>>> this check fails.
>>> Oh, interesting.  So classic PV MSI domain supports
>>> MSI_FLAG_MULTI_PCI_MSI, even when no IOMMU is exposed there either.
>>>
>>> Thanks, so it's nothing specific to Xen, just how Linux works.
>> This is something which TGLX and I have discussed in the past.  It is a
>> mistake for any x86 system to do MSI multi-message without an IOMMU.
> Well, with PVH there always will be an IOMMU, just that Linux can't see
> it. Even with PV it should be the hypervisor to determine whether multi-
> message MSI is possible. Hence how the classic (non-pvops) kernel had
> worked in this regard.

Xen should hide (and instruct Qemu to hide) multi-message on non-IOMMU
hardware.  The result of "supporting" them on non-IOMMU hardware is
worse than making the driver run in single MSI mode.

While in theory Xen could expose "I've got an IOMMU so you can do multi
message" to guests, this isn't trivial for the guest to cope with.  Even
if the guest knows multi-message is safe, it still cant express this to
Xen via a multi-message shaped interface.

With Teddy's PV-IOMMU, this problem ought to go away because now the
guest can see the IOMMU.

~Andrew
Jason Andryuk March 13, 2025, 6:03 p.m. UTC | #19
On 2025-03-13 12:24, Andrew Cooper wrote:
> On 13/03/2025 4:07 pm, Jan Beulich wrote:
>> On 13.03.2025 17:02, Andrew Cooper wrote:
>>> On 13/03/2025 3:47 pm, Roger Pau Monné wrote:
>>>> On Thu, Mar 13, 2025 at 11:30:28AM -0400, Jason Andryuk wrote:
>>>>> On 2025-02-27 05:23, Roger Pau Monné wrote:
>>>>>> On Wed, Feb 26, 2025 at 04:11:25PM -0500, Jason Andryuk wrote:
>>>>>>> The ath11k device supports and tries to enable 32 MSIs.  Linux in PVH
>>>>>>> dom0 and HVM domU fails enabling 32 and falls back to just 1, so that is
>>>>>>> all that has been tested.
>>>>>> DYK why it fails to enable 32?
>>>>> In Linux msi_capability_init()
>>>>>
>>>>>          /* Reject multi-MSI early on irq domain enabled architectures */
>>>>>          if (nvec > 1 && !pci_msi_domain_supports(dev,
>>>>> MSI_FLAG_MULTI_PCI_MSI, ALLOW_LEGACY))
>>>>>                  return 1;
>>>>>
>>>>> MSI_FLAG_MULTI_PCI_MSI is only set for AMD and Intel interrupt remapping,
>>>>> and Xen PVH and HVM don't have either of those.  They are using "VECTOR", so
>>>>> this check fails.
>>>> Oh, interesting.  So classic PV MSI domain supports
>>>> MSI_FLAG_MULTI_PCI_MSI, even when no IOMMU is exposed there either.

I was told PV dom0 used 32 MSIs, but I don' readily see 
MSI_FLAG_MULTI_PCI_MSI set anywhere.  I guess it gets to 
xen_initdom_setup_msi_irqs() which supports multiple nvecs.

>>>>
>>>> Thanks, so it's nothing specific to Xen, just how Linux works.
>>> This is something which TGLX and I have discussed in the past.  It is a
>>> mistake for any x86 system to do MSI multi-message without an IOMMU.
>> Well, with PVH there always will be an IOMMU, just that Linux can't see
>> it. Even with PV it should be the hypervisor to determine whether multi-
>> message MSI is possible. Hence how the classic (non-pvops) kernel had
>> worked in this regard.
> 
> Xen should hide (and instruct Qemu to hide) multi-message on non-IOMMU
> hardware.  The result of "supporting" them on non-IOMMU hardware is
> worse than making the driver run in single MSI mode.

FWIW, QEMU MSI support hardcodes 1 MSI.

Regards,
Jason
diff mbox series

Patch

diff --git a/xen/arch/x86/include/asm/msi.h b/xen/arch/x86/include/asm/msi.h
index 378b85ee94..ea1004af14 100644
--- a/xen/arch/x86/include/asm/msi.h
+++ b/xen/arch/x86/include/asm/msi.h
@@ -107,7 +107,8 @@  struct msi_desc {
     } msi_attrib;
 
     bool irte_initialized;
-    uint8_t gvec;            /* guest vector. valid when pi_desc isn't NULL */
+    uint8_t gvec;            /* guest vector. valid when pi_desc isn't NULL or
+                                when pci_dev gvec_as_irte_idx is true */
     const struct pi_desc *pi_desc; /* pointer to posted descriptor */
 
     struct list_head list;
diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index ff3ac832f4..3fc73feaea 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -1600,7 +1600,8 @@  int pirq_shared(struct domain *d, int pirq)
     return shared;
 }
 
-int pirq_guest_bind(struct vcpu *v, struct pirq *pirq, int will_share)
+int pirq_guest_bind_gvec(struct vcpu *v, struct pirq *pirq, int will_share,
+                         uint8_t gvec)
 {
     struct irq_desc         *desc;
     irq_guest_action_t *action, *newaction = NULL;
@@ -1674,7 +1675,12 @@  int pirq_guest_bind(struct vcpu *v, struct pirq *pirq, int will_share)
                                           &cpu_online_map) )
                 affinity = desc->affinity;
             if ( affinity )
+            {
+                if ( gvec && desc->msi_desc )
+                    desc->msi_desc->gvec = gvec;
+
                 desc->handler->set_affinity(desc, affinity);
+            }
         }
 
         desc->status &= ~IRQ_DISABLED;
@@ -1730,6 +1736,11 @@  int pirq_guest_bind(struct vcpu *v, struct pirq *pirq, int will_share)
     return rc;
 }
 
+int pirq_guest_bind(struct vcpu *v, struct pirq *pirq, int will_share)
+{
+    return pirq_guest_bind_gvec(v, pirq, will_share, 0);
+}
+
 static irq_guest_action_t *__pirq_guest_unbind(
     struct domain *d, struct pirq *pirq, struct irq_desc *desc)
 {
diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index bf5b71822e..cef2987038 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -487,6 +487,7 @@  static struct msi_desc *alloc_msi_entry(unsigned int nr)
         entry[nr].remap_index = -1;
         entry[nr].pi_desc = NULL;
         entry[nr].irte_initialized = false;
+        entry[nr].gvec = 0;
     }
 
     return entry;
diff --git a/xen/drivers/passthrough/amd/iommu_intr.c b/xen/drivers/passthrough/amd/iommu_intr.c
index c0273059cb..2e228d2c21 100644
--- a/xen/drivers/passthrough/amd/iommu_intr.c
+++ b/xen/drivers/passthrough/amd/iommu_intr.c
@@ -543,6 +543,31 @@  int cf_check amd_iommu_msi_msg_update_ire(
     if ( !msg )
         return 0;
 
+    if ( pdev->gvec_as_irte_idx && amd_iommu_perdev_intremap )
+    {
+        int new_remap_index = 0;
+        if ( msi_desc->gvec )
+        {
+            printk("%pp: gvec remap_index %#x -> %#x\n", &pdev->sbdf,
+                   msi_desc->remap_index, msi_desc->gvec);
+            new_remap_index = msi_desc->gvec;
+        }
+
+        if ( new_remap_index && new_remap_index != msi_desc->remap_index &&
+             msi_desc->remap_index != -1 )
+        {
+            /* Clear any existing entries */
+            update_intremap_entry_from_msi_msg(iommu, bdf, nr,
+                                               &msi_desc->remap_index,
+                                               NULL, NULL);
+
+            for ( i = 0; i < nr; ++i )
+                msi_desc[i].remap_index = -1;
+
+            msi_desc->remap_index = new_remap_index;
+        }
+    }
+
     rc = update_intremap_entry_from_msi_msg(iommu, bdf, nr,
                                             &msi_desc->remap_index,
                                             msg, &data);
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index e1a09344df..7031aedb94 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -306,6 +306,17 @@  static void apply_quirks(struct pci_dev *pdev)
         { PCI_VENDOR_ID_INTEL, 0x6fa0 },
         { PCI_VENDOR_ID_INTEL, 0x6fc0 },
     };
+    static const struct {
+        uint16_t vendor, device;
+    } hide_irt[] = {
+#define PCI_VENDOR_ID_QCOM		0x17cb
+#define QCA6390_DEVICE_ID		0x1101
+#define QCN9074_DEVICE_ID		0x1104
+#define WCN6855_DEVICE_ID		0x1103
+        { PCI_VENDOR_ID_QCOM, QCA6390_DEVICE_ID },
+        { PCI_VENDOR_ID_QCOM, QCN9074_DEVICE_ID },
+        { PCI_VENDOR_ID_QCOM, WCN6855_DEVICE_ID },
+    };
     unsigned int i;
 
     for ( i = 0; i < ARRAY_SIZE(ignore_bars); i++)
@@ -316,6 +327,19 @@  static void apply_quirks(struct pci_dev *pdev)
              * from trying to size the BARs or add handlers to trap accesses.
              */
             pdev->ignore_bars = true;
+
+    for ( i = 0; i < ARRAY_SIZE(hide_irt); i++)
+    {
+        if ( vendor == hide_irt[i].vendor &&
+             device == hide_irt[i].device )
+        {
+            pdev->gvec_as_irte_idx = true;
+            printk("%pp %04x:%04x quirk gvec as intr remap index\n",
+                   &pdev->sbdf, hide_irt[i].vendor, hide_irt[i].device);
+            if ( !amd_iommu_perdev_intremap )
+                printk("gvec quirk requires per-device intr remap!\n");
+        }
+    }
 }
 
 static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 bus, u8 devfn)
diff --git a/xen/drivers/passthrough/x86/hvm.c b/xen/drivers/passthrough/x86/hvm.c
index f5faff7a49..5d17f93b06 100644
--- a/xen/drivers/passthrough/x86/hvm.c
+++ b/xen/drivers/passthrough/x86/hvm.c
@@ -307,7 +307,8 @@  int pt_irq_create_bind(
              */
             pirq_dpci->dom = d;
             /* bind after hvm_irq_dpci is setup to avoid race with irq handler*/
-            rc = pirq_guest_bind(d->vcpu[0], info, 0);
+            rc = pirq_guest_bind_gvec(d->vcpu[0], info, 0,
+                                      pirq_dpci->gmsi.gvec);
             if ( rc == 0 && pt_irq_bind->u.msi.gtable )
             {
                 rc = msixtbl_pt_register(d, info, pt_irq_bind->u.msi.gtable);
diff --git a/xen/include/xen/irq.h b/xen/include/xen/irq.h
index 95034c0d6b..96109d6ebe 100644
--- a/xen/include/xen/irq.h
+++ b/xen/include/xen/irq.h
@@ -192,6 +192,8 @@  extern void pirq_guest_eoi(struct pirq *pirq);
 extern void desc_guest_eoi(struct irq_desc *desc, struct pirq *pirq);
 extern int pirq_guest_unmask(struct domain *d);
 extern int pirq_guest_bind(struct vcpu *v, struct pirq *pirq, int will_share);
+extern int pirq_guest_bind_gvec(struct vcpu *v, struct pirq *pirq,
+                                int will_share, uint8_t gvec);
 extern void pirq_guest_unbind(struct domain *d, struct pirq *pirq);
 extern void pirq_set_affinity(struct domain *d, int pirq,
                               const cpumask_t *mask);
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 4f12bcf089..14afd78f75 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -127,6 +127,8 @@  struct pci_dev {
     /* Device with errata, ignore the BARs. */
     bool ignore_bars;
 
+    bool gvec_as_irte_idx;
+
     /* Device misbehaving, prevent assigning it to guests. */
     bool broken;