diff mbox series

[5/5] x86/kvm: Add KVM_FEATURE_MSI_EXT_DEST_ID

Message ID 20201007122046.1113577-5-dwmw2@infradead.org (mailing list archive)
State New, archived
Headers show
Series Fix x2apic enablement and allow up to 32768 CPUs without IR where supported | expand

Commit Message

David Woodhouse Oct. 7, 2020, 12:20 p.m. UTC
From: David Woodhouse <dwmw@amazon.co.uk>

This allows the host to indicate that IOAPIC and MSI emulation supports
15-bit destination IDs, allowing up to 32768 CPUs without interrupt
remapping.

cf. https://patchwork.kernel.org/patch/11816693/ for qemu

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
---
 Documentation/virt/kvm/cpuid.rst     | 4 ++++
 arch/x86/include/uapi/asm/kvm_para.h | 1 +
 arch/x86/kernel/kvm.c                | 6 ++++++
 3 files changed, 11 insertions(+)

Comments

Thomas Gleixner Oct. 8, 2020, 12:05 p.m. UTC | #1
On Wed, Oct 07 2020 at 13:20, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
>
> This allows the host to indicate that IOAPIC and MSI emulation supports
> 15-bit destination IDs, allowing up to 32768 CPUs without interrupt
> remapping.
>
> cf. https://patchwork.kernel.org/patch/11816693/ for qemu
>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> Acked-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  Documentation/virt/kvm/cpuid.rst     | 4 ++++
>  arch/x86/include/uapi/asm/kvm_para.h | 1 +
>  arch/x86/kernel/kvm.c                | 6 ++++++
>  3 files changed, 11 insertions(+)
>
> diff --git a/Documentation/virt/kvm/cpuid.rst b/Documentation/virt/kvm/cpuid.rst
> index a7dff9186bed..1726b5925d2b 100644
> --- a/Documentation/virt/kvm/cpuid.rst
> +++ b/Documentation/virt/kvm/cpuid.rst
> @@ -92,6 +92,10 @@ KVM_FEATURE_ASYNC_PF_INT          14          guest checks this feature bit
>                                                async pf acknowledgment msr
>                                                0x4b564d07.
>  
> +KVM_FEATURE_MSI_EXT_DEST_ID       15          guest checks this feature bit
> +                                              before using extended destination
> +                                              ID bits in MSI address
> bits 11-5.

Why MSI_EXT_DEST_ID? It's enabling that for MSI and IO/APIC. The
underlying mechanism might be the same, but APIC_EXT_DEST_ID is more
general and then you might also make the explanation of that bit match
the changelog.

Thanks,

        tglx
David Woodhouse Oct. 8, 2020, 12:55 p.m. UTC | #2
On Thu, 2020-10-08 at 14:05 +0200, Thomas Gleixner wrote:
> Why MSI_EXT_DEST_ID? It's enabling that for MSI and IO/APIC. The
> underlying mechanism might be the same, but APIC_EXT_DEST_ID is more
> general and then you might also make the explanation of that bit
> match the changelog.

It's enabling it for *everything* that generates MSI cycles — which
includes IOAPIC, HPET, and MSI-capable PCI devices. Hell, and anything
else which feels like generating a physical address cycle to 0xFEExxxxx
addresses.

Again, the IOAPIC is just a device for turning pin-based interrupts
into MSIs. Bits 19-4 of the address it writes to are taken directly
from bits 63-48 of the IOAPIC RTE. There's complexity elsewhere but for
*those* bits, It just uses the bits it's given, just like a PCI device
or an HPET would. 

When I implemented this in qemu I didn't even *touch* the IOAPIC
support; it doesn't affect the IOAPIC at all, just like it doesn't
affect the HPET or any of the MSI-capable PCI devices that qemu
emulates. They just put the bits on the bus that they're told to, when
they want to generate an interrupt.

This feature is an MSI feature.

Not an HPET feature.

Not a PCI feature.

Not an IOAPIC feature.

The fact that a *Linux* guest has special-case knowledge in its IOAPIC
driver that duplicates what the MSI message composition does, is not a
good justification for naming the feature bit bizarrely.

In fact I'm really tempted to make Linux's io_apic.c just use
irq_chip_compose_msi_msg() and swizzle the bits out of the message
identically for IR and non-IR alike (modulo the pin hack), and delete
the IR_IO_APIC_route_entry struct completely. 

That also completely removes the ext_dest_id trick from visibility in
io_apic.c. And might avoid further confusion.
David Woodhouse Oct. 8, 2020, 4:08 p.m. UTC | #3
On Thu, 2020-10-08 at 13:55 +0100, David Woodhouse wrote:
> 
> In fact I'm really tempted to make Linux's io_apic.c just use
> irq_chip_compose_msi_msg() and swizzle the bits out of the message
> identically for IR and non-IR alike (modulo the pin hack), and delete
> the IR_IO_APIC_route_entry struct completely. 
> 
> That also completely removes the ext_dest_id trick from visibility in
> io_apic.c. And might avoid further confusion.

That looks a bit like this, FWIW. Turns out it doesn't *entirely*
remove the ext_dest_id trick from io_apic.c because it does still need
to put together an entry manually for the ExtINT hackery and for
restoring boot mode. But it does remove a lot of incestuous ioapic
hackery from IOMMU drivers, and deletes more code than it adds.

It has the additional effect of enabling the WARN_ON for unreachable
APIC IDs in __irq_msi_compose_msg(), for IOAPIC interrupts too.

(We'd want the x86_vector_domain to actually have an MSI compose
function in the !CONFIG_PCI_MSI case if we did this, of course.)

From 2fbc79588d4677ee1cc9df661162fcf1a7da57f0 Mon Sep 17 00:00:00 2001
From: David Woodhouse <dwmw@amazon.co.uk>
Date: Thu, 8 Oct 2020 15:44:42 +0100
Subject: [PATCH 6/5] x86/ioapic: Generate RTE directly from parent irqchip's MSI
 message

The IOAPIC generates an MSI cycle with address/data bits taken from its
Redirection Table Entry in some combination which used to make sense,
but now is just a bunch of bits which get passed through in some
seemingly arbitrary order.

Instead of making IRQ remapping drivers directly frob the IOAPIC RTE,
let them just do their job and generate an MSI message. The bit
swizzling to turn that MSI message into the IOAPIC's RTE is the same in
all cases, since it's a function of the IOAPIC hardware. The IRQ
remappers have no real need to get involved with that.

The only slight caveat is that the IOAPIC is interpreting some of
those fields too, and it does want the 'vector' field to be unique
to make EOI work. The AMD IOMMU happens to put its IRTE index in the
bits that the IOAPIC thinks are the vector field, and accommodates
this requirement by reserving the first 32 indices for the IOAPIC.
The Intel IOMMU doesn't actually use the bits that the IOAPIC thinks
are the vector field, so it fills in the 'pin' value there instead.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/include/asm/hw_irq.h       | 11 +++---
 arch/x86/include/asm/msidef.h       |  2 ++
 arch/x86/kernel/apic/io_apic.c      | 55 ++++++++++++++++++-----------
 drivers/iommu/amd/iommu.c           | 14 --------
 drivers/iommu/hyperv-iommu.c        | 31 ----------------
 drivers/iommu/intel/irq_remapping.c | 19 +++-------
 6 files changed, 46 insertions(+), 86 deletions(-)

diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index a4aeeaace040..aabd8f1b6bb0 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -45,12 +45,11 @@ enum irq_alloc_type {
 };
 
 struct ioapic_alloc_info {
-	int				pin;
-	int				node;
-	u32				trigger : 1;
-	u32				polarity : 1;
-	u32				valid : 1;
-	struct IO_APIC_route_entry	*entry;
+	int		pin;
+	int		node;
+	u32		trigger : 1;
+	u32		polarity : 1;
+	u32		valid : 1;
 };
 
 struct uv_alloc_info {
diff --git a/arch/x86/include/asm/msidef.h b/arch/x86/include/asm/msidef.h
index ee2f8ccc32d0..37c3d2d492c9 100644
--- a/arch/x86/include/asm/msidef.h
+++ b/arch/x86/include/asm/msidef.h
@@ -18,6 +18,7 @@
 #define MSI_DATA_DELIVERY_MODE_SHIFT	8
 #define  MSI_DATA_DELIVERY_FIXED	(0 << MSI_DATA_DELIVERY_MODE_SHIFT)
 #define  MSI_DATA_DELIVERY_LOWPRI	(1 << MSI_DATA_DELIVERY_MODE_SHIFT)
+#define  MSI_DATA_DELIVERY_MODE_MASK	(3 << MSI_DATA_DELIVERY_MODE_SHIFT)
 
 #define MSI_DATA_LEVEL_SHIFT		14
 #define	 MSI_DATA_LEVEL_DEASSERT	(0 << MSI_DATA_LEVEL_SHIFT)
@@ -37,6 +38,7 @@
 #define MSI_ADDR_DEST_MODE_SHIFT	2
 #define  MSI_ADDR_DEST_MODE_PHYSICAL	(0 << MSI_ADDR_DEST_MODE_SHIFT)
 #define	 MSI_ADDR_DEST_MODE_LOGICAL	(1 << MSI_ADDR_DEST_MODE_SHIFT)
+#define  MSI_ADDR_DEST_MODE_MASK	(1 << MSI_DATA_DELIVERY_MODE_SHIFT)
 
 #define MSI_ADDR_REDIRECTION_SHIFT	3
 #define  MSI_ADDR_REDIRECTION_CPU	(0 << MSI_ADDR_REDIRECTION_SHIFT)
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 54f6a029b1d1..ca2da19d5c55 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -48,6 +48,7 @@
 #include <linux/jiffies.h>	/* time_after() */
 #include <linux/slab.h>
 #include <linux/memblock.h>
+#include <linux/msi.h>
 
 #include <asm/irqdomain.h>
 #include <asm/io.h>
@@ -63,6 +64,7 @@
 #include <asm/setup.h>
 #include <asm/irq_remapping.h>
 #include <asm/hw_irq.h>
+#include <asm/msidef.h>
 
 #include <asm/apic.h>
 
@@ -1851,22 +1853,36 @@ static void ioapic_ir_ack_level(struct irq_data *irq_data)
 	eoi_ioapic_pin(data->entry.vector, data);
 }
 
+static void mp_swizzle_msi_dest_bits(struct irq_data *irq_data, void *_entry)
+{
+	struct msi_msg msg;
+	u32 *entry = _entry;
+
+	irq_chip_compose_msi_msg(irq_data, &msg);
+
+	/*
+	 * They're in a bit of a random order for historical reasons, but
+	 * the IO/APIC is just a device for turning interrupt lines into
+	 * MSIs, and various bits of the MSI addr/data are just swizzled
+	 * into/from the bits of Redirection Table Entry.
+	 */
+	entry[0] &= 0xfffff000;
+	entry[0] |= (msg.data & (MSI_DATA_DELIVERY_MODE_MASK |
+				 MSI_DATA_VECTOR_MASK));
+	entry[0] |= (msg.address_lo & MSI_ADDR_DEST_MODE_MASK) << 9;
+
+	entry[1] &= 0xffff;
+	entry[1] |= (msg.address_lo & MSI_ADDR_DEST_ID_MASK) << 12;
+}
+
+
 static void ioapic_configure_entry(struct irq_data *irqd)
 {
 	struct mp_chip_data *mpd = irqd->chip_data;
-	struct irq_cfg *cfg = irqd_cfg(irqd);
 	struct irq_pin_list *entry;
 
-	/*
-	 * Only update when the parent is the vector domain, don't touch it
-	 * if the parent is the remapping domain. Check the installed
-	 * ioapic chip to verify that.
-	 */
-	if (irqd->chip == &ioapic_chip) {
-		mpd->entry.dest = cfg->dest_apicid & 0xff;
-		mpd->entry.virt_ext_dest = cfg->dest_apicid >> 8;
-		mpd->entry.vector = cfg->vector;
-	}
+	mp_swizzle_msi_dest_bits(irqd, &mpd->entry);
+
 	for_each_irq_pin(entry, mpd->irq_2_pin)
 		__ioapic_write_entry(entry->apic, entry->pin, mpd->entry);
 }
@@ -2949,15 +2965,14 @@ static void mp_irqdomain_get_attr(u32 gsi, struct mp_chip_data *data,
 	}
 }
 
-static void mp_setup_entry(struct irq_cfg *cfg, struct mp_chip_data *data,
-			   struct IO_APIC_route_entry *entry)
+static void mp_setup_entry(struct irq_data *irq_data, struct mp_chip_data *data)
 {
+	struct IO_APIC_route_entry *entry = &data->entry;
+
 	memset(entry, 0, sizeof(*entry));
-	entry->delivery_mode = apic->irq_delivery_mode;
-	entry->dest_mode     = apic->irq_dest_mode;
-	entry->dest	     = cfg->dest_apicid & 0xff;
-	entry->virt_ext_dest = cfg->dest_apicid >> 8;
-	entry->vector	     = cfg->vector;
+
+	mp_swizzle_msi_dest_bits(irq_data, entry);
+
 	entry->trigger	     = data->trigger;
 	entry->polarity	     = data->polarity;
 	/*
@@ -2995,7 +3010,6 @@ int mp_irqdomain_alloc(struct irq_domain *domain, unsigned int virq,
 	if (!data)
 		return -ENOMEM;
 
-	info->ioapic.entry = &data->entry;
 	ret = irq_domain_alloc_irqs_parent(domain, virq, nr_irqs, info);
 	if (ret < 0) {
 		kfree(data);
@@ -3013,8 +3027,7 @@ int mp_irqdomain_alloc(struct irq_domain *domain, unsigned int virq,
 	add_pin_to_irq_node(data, ioapic_alloc_attr_node(info), ioapic, pin);
 
 	local_irq_save(flags);
-	if (info->ioapic.entry)
-		mp_setup_entry(cfg, data, info->ioapic.entry);
+	mp_setup_entry(irq_data, data);
 	mp_register_handler(virq, data->trigger);
 	if (virq < nr_legacy_irqs())
 		legacy_pic->mask(virq);
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index ef64e01f66d7..13d0a8f42d56 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3597,7 +3597,6 @@ static void irq_remapping_prepare_irte(struct amd_ir_data *data,
 {
 	struct irq_2_irte *irte_info = &data->irq_2_irte;
 	struct msi_msg *msg = &data->msi_entry;
-	struct IO_APIC_route_entry *entry;
 	struct amd_iommu *iommu = amd_iommu_rlookup_table[devid];
 
 	if (!iommu)
@@ -3611,19 +3610,6 @@ static void irq_remapping_prepare_irte(struct amd_ir_data *data,
 
 	switch (info->type) {
 	case X86_IRQ_ALLOC_TYPE_IOAPIC:
-		/* Setup IOAPIC entry */
-		entry = info->ioapic.entry;
-		info->ioapic.entry = NULL;
-		memset(entry, 0, sizeof(*entry));
-		entry->vector        = index;
-		entry->mask          = 0;
-		entry->trigger       = info->ioapic.trigger;
-		entry->polarity      = info->ioapic.polarity;
-		/* Mask level triggered irqs. */
-		if (info->ioapic.trigger)
-			entry->mask = 1;
-		break;
-
 	case X86_IRQ_ALLOC_TYPE_HPET:
 	case X86_IRQ_ALLOC_TYPE_PCI_MSI:
 	case X86_IRQ_ALLOC_TYPE_PCI_MSIX:
diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c
index e09e2d734c57..37dd485a5640 100644
--- a/drivers/iommu/hyperv-iommu.c
+++ b/drivers/iommu/hyperv-iommu.c
@@ -40,7 +40,6 @@ static int hyperv_ir_set_affinity(struct irq_data *data,
 {
 	struct irq_data *parent = data->parent_data;
 	struct irq_cfg *cfg = irqd_cfg(data);
-	struct IO_APIC_route_entry *entry;
 	int ret;
 
 	/* Return error If new irq affinity is out of ioapic_max_cpumask. */
@@ -51,9 +50,6 @@ static int hyperv_ir_set_affinity(struct irq_data *data,
 	if (ret < 0 || ret == IRQ_SET_MASK_OK_DONE)
 		return ret;
 
-	entry = data->chip_data;
-	entry->dest = cfg->dest_apicid;
-	entry->vector = cfg->vector;
 	send_cleanup_vector(cfg);
 
 	return 0;
@@ -89,20 +85,6 @@ static int hyperv_irq_remapping_alloc(struct irq_domain *domain,
 
 	irq_data->chip = &hyperv_ir_chip;
 
-	/*
-	 * If there is interrupt remapping function of IOMMU, setting irq
-	 * affinity only needs to change IRTE of IOMMU. But Hyper-V doesn't
-	 * support interrupt remapping function, setting irq affinity of IO-APIC
-	 * interrupts still needs to change IO-APIC registers. But ioapic_
-	 * configure_entry() will ignore value of cfg->vector and cfg->
-	 * dest_apicid when IO-APIC's parent irq domain is not the vector
-	 * domain.(See ioapic_configure_entry()) In order to setting vector
-	 * and dest_apicid to IO-APIC register, IO-APIC entry pointer is saved
-	 * in the chip_data and hyperv_irq_remapping_activate()/hyperv_ir_set_
-	 * affinity() set vector and dest_apicid directly into IO-APIC entry.
-	 */
-	irq_data->chip_data = info->ioapic.entry;
-
 	/*
 	 * Hypver-V IO APIC irq affinity should be in the scope of
 	 * ioapic_max_cpumask because no irq remapping support.
@@ -119,22 +101,9 @@ static void hyperv_irq_remapping_free(struct irq_domain *domain,
 	irq_domain_free_irqs_common(domain, virq, nr_irqs);
 }
 
-static int hyperv_irq_remapping_activate(struct irq_domain *domain,
-			  struct irq_data *irq_data, bool reserve)
-{
-	struct irq_cfg *cfg = irqd_cfg(irq_data);
-	struct IO_APIC_route_entry *entry = irq_data->chip_data;
-
-	entry->dest = cfg->dest_apicid;
-	entry->vector = cfg->vector;
-
-	return 0;
-}
-
 static const struct irq_domain_ops hyperv_ir_domain_ops = {
 	.alloc = hyperv_irq_remapping_alloc,
 	.free = hyperv_irq_remapping_free,
-	.activate = hyperv_irq_remapping_activate,
 };
 
 static int __init hyperv_prepare_irq_remapping(void)
diff --git a/drivers/iommu/intel/irq_remapping.c b/drivers/iommu/intel/irq_remapping.c
index 0cfce1d3b7bb..511dfb4884bc 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -1265,7 +1265,6 @@ static void intel_irq_remapping_prepare_irte(struct intel_ir_data *data,
 					     struct irq_alloc_info *info,
 					     int index, int sub_handle)
 {
-	struct IR_IO_APIC_route_entry *entry;
 	struct irte *irte = &data->irte_entry;
 	struct msi_msg *msg = &data->msi_entry;
 
@@ -1281,23 +1280,15 @@ static void intel_irq_remapping_prepare_irte(struct intel_ir_data *data,
 			irte->avail, irte->vector, irte->dest_id,
 			irte->sid, irte->sq, irte->svt);
 
-		entry = (struct IR_IO_APIC_route_entry *)info->ioapic.entry;
-		info->ioapic.entry = NULL;
-		memset(entry, 0, sizeof(*entry));
-		entry->index2	= (index >> 15) & 0x1;
-		entry->zero	= 0;
-		entry->format	= 1;
-		entry->index	= (index & 0x7fff);
 		/*
 		 * IO-APIC RTE will be configured with virtual vector.
 		 * irq handler will do the explicit EOI to the io-apic.
 		 */
-		entry->vector	= info->ioapic.pin;
-		entry->mask	= 0;			/* enable IRQ */
-		entry->trigger	= info->ioapic.trigger;
-		entry->polarity	= info->ioapic.polarity;
-		if (info->ioapic.trigger)
-			entry->mask = 1; /* Mask level triggered irqs. */
+		msg->data = info->ioapic.pin;
+		msg->address_hi = MSI_ADDR_BASE_HI;
+		msg->address_lo = MSI_ADDR_BASE_LO | MSI_ADDR_IR_EXT_INT |
+				  MSI_ADDR_IR_INDEX1(index) |
+				  MSI_ADDR_IR_INDEX2(index);
 		break;
 
 	case X86_IRQ_ALLOC_TYPE_HPET:
Thomas Gleixner Oct. 8, 2020, 9:14 p.m. UTC | #4
On Thu, Oct 08 2020 at 17:08, David Woodhouse wrote:
> On Thu, 2020-10-08 at 13:55 +0100, David Woodhouse wrote:
>
> (We'd want the x86_vector_domain to actually have an MSI compose
> function in the !CONFIG_PCI_MSI case if we did this, of course.)

The compose function and the vector domain wrapper can simply move to vector.c

> From 2fbc79588d4677ee1cc9df661162fcf1a7da57f0 Mon Sep 17 00:00:00 2001
> From: David Woodhouse <dwmw@amazon.co.uk>
> Date: Thu, 8 Oct 2020 15:44:42 +0100
> Subject: [PATCH 6/5] x86/ioapic: Generate RTE directly from parent irqchip's MSI
>  message
>
> The IOAPIC generates an MSI cycle with address/data bits taken from its
> Redirection Table Entry in some combination which used to make sense,
> but now is just a bunch of bits which get passed through in some
> seemingly arbitrary order.
>
> Instead of making IRQ remapping drivers directly frob the IOAPIC RTE,
> let them just do their job and generate an MSI message. The bit
> swizzling to turn that MSI message into the IOAPIC's RTE is the same in
> all cases, since it's a function of the IOAPIC hardware. The IRQ
> remappers have no real need to get involved with that.
>
> The only slight caveat is that the IOAPIC is interpreting some of
> those fields too, and it does want the 'vector' field to be unique
> to make EOI work. The AMD IOMMU happens to put its IRTE index in the
> bits that the IOAPIC thinks are the vector field, and accommodates
> this requirement by reserving the first 32 indices for the IOAPIC.
> The Intel IOMMU doesn't actually use the bits that the IOAPIC thinks
> are the vector field, so it fills in the 'pin' value there instead.
>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>  arch/x86/include/asm/hw_irq.h       | 11 +++---
>  arch/x86/include/asm/msidef.h       |  2 ++
>  arch/x86/kernel/apic/io_apic.c      | 55 ++++++++++++++++++-----------
>  drivers/iommu/amd/iommu.c           | 14 --------
>  drivers/iommu/hyperv-iommu.c        | 31 ----------------
>  drivers/iommu/intel/irq_remapping.c | 19 +++-------
>  6 files changed, 46 insertions(+), 86 deletions(-)

Nice :)

> +static void mp_swizzle_msi_dest_bits(struct irq_data *irq_data, void *_entry)
> +{
> +	struct msi_msg msg;
> +	u32 *entry = _entry;
> +
> +	irq_chip_compose_msi_msg(irq_data, &msg);

Duh, for some stupid reason it never occured to me to do it that
way.

Historically the MSI compose function was part of the MSI PCI chip and I
just changed that recently when I reworked the code to make IMS support
possible.

Historical blinders are pretty sticky :(

> +	/*
> +	 * They're in a bit of a random order for historical reasons, but
> +	 * the IO/APIC is just a device for turning interrupt lines into
> +	 * MSIs, and various bits of the MSI addr/data are just swizzled
> +	 * into/from the bits of Redirection Table Entry.
> +	 */
> +	entry[0] &= 0xfffff000;
> +	entry[0] |= (msg.data & (MSI_DATA_DELIVERY_MODE_MASK |
> +				 MSI_DATA_VECTOR_MASK));
> +	entry[0] |= (msg.address_lo & MSI_ADDR_DEST_MODE_MASK) << 9;
> +
> +	entry[1] &= 0xffff;
> +	entry[1] |= (msg.address_lo & MSI_ADDR_DEST_ID_MASK) << 12;
> +}

....

>  	switch (info->type) {
>  	case X86_IRQ_ALLOC_TYPE_IOAPIC:
> -		/* Setup IOAPIC entry */
> -		entry = info->ioapic.entry;
> -		info->ioapic.entry = NULL;
> -		memset(entry, 0, sizeof(*entry));
> -		entry->vector        = index;
> -		entry->mask          = 0;
> -		entry->trigger       = info->ioapic.trigger;
> -		entry->polarity      = info->ioapic.polarity;
> -		/* Mask level triggered irqs. */
> -		if (info->ioapic.trigger)
> -			entry->mask = 1;
> -		break;
> -

Thanks for cleaning this up!

       tglx
David Woodhouse Oct. 8, 2020, 9:39 p.m. UTC | #5
On Thu, 2020-10-08 at 23:14 +0200, Thomas Gleixner wrote:
> On Thu, Oct 08 2020 at 17:08, David Woodhouse wrote:
> > On Thu, 2020-10-08 at 13:55 +0100, David Woodhouse wrote:
> > 
> > (We'd want the x86_vector_domain to actually have an MSI compose
> > function in the !CONFIG_PCI_MSI case if we did this, of course.)
> 
> The compose function and the vector domain wrapper can simply move to
> vector.c

I ended up putting __irq_msi_compose_msg() into apic.c and that way I
can make virt_ext_dest_id static in that file.

And then I can move all the HPET-MSI support into hpet.c too.

https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/ext_dest_id
Thomas Gleixner Oct. 8, 2020, 11:27 p.m. UTC | #6
On Thu, Oct 08 2020 at 22:39, David Woodhouse wrote:
> On Thu, 2020-10-08 at 23:14 +0200, Thomas Gleixner wrote:
>> > 
>> > (We'd want the x86_vector_domain to actually have an MSI compose
>> > function in the !CONFIG_PCI_MSI case if we did this, of course.)
>> 
>> The compose function and the vector domain wrapper can simply move to
>> vector.c
>
> I ended up putting __irq_msi_compose_msg() into apic.c and that way I
> can make virt_ext_dest_id static in that file.
>
> And then I can move all the HPET-MSI support into hpet.c too.

Works for me.

> https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/ext_dest_id

For the next submission, can you please

 - pick up the -ENODEV changes for HPET/IOAPIC which I posted earlier

 - shuffle all that compose/IOAPIC cleanup around

before adding that extended dest id stuff.

Thanks,

        tglx
David Woodhouse Oct. 9, 2020, 6:07 a.m. UTC | #7
On 9 October 2020 00:27:06 BST, Thomas Gleixner <tglx@linutronix.de> wrote:
>On Thu, Oct 08 2020 at 22:39, David Woodhouse wrote:
>> On Thu, 2020-10-08 at 23:14 +0200, Thomas Gleixner wrote:
>>> > 
>>> > (We'd want the x86_vector_domain to actually have an MSI compose
>>> > function in the !CONFIG_PCI_MSI case if we did this, of course.)
>>> 
>>> The compose function and the vector domain wrapper can simply move
>to
>>> vector.c
>>
>> I ended up putting __irq_msi_compose_msg() into apic.c and that way I
>> can make virt_ext_dest_id static in that file.
>>
>> And then I can move all the HPET-MSI support into hpet.c too.
>
>Works for me.
>
>>
>https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/ext_dest_id
>
>For the next submission, can you please
>
> - pick up the -ENODEV changes for HPET/IOAPIC which I posted earlier

Ack.

> - shuffle all that compose/IOAPIC cleanup around

I'd prefer the MSI swizzling bit to stay last in the series. I want to stare hard at the hyperv-iommu part a bit more, and ideally even have it tested. I'd prefer the real functionality that I care about, not to depend on that cleanup.

If it actually let me remove all mention of ext_dest_id from the IOAPIC code and use *only* MSI swizzling, I'd be keener to reorder. But as noted, there are a couple of manual RTE constructions in there still anyway.

I can move __irq_compose_msi_msg() earlier in the series though, and then virt_ext_dest_id can be static in apic.c from its inception.

OK?
David Woodhouse Oct. 10, 2020, 10:06 a.m. UTC | #8
On Fri, 2020-10-09 at 01:27 +0200, Thomas Gleixner wrote:
> On Thu, Oct 08 2020 at 22:39, David Woodhouse wrote:
> > On Thu, 2020-10-08 at 23:14 +0200, Thomas Gleixner wrote:
> > > > 
> > > > (We'd want the x86_vector_domain to actually have an MSI compose
> > > > function in the !CONFIG_PCI_MSI case if we did this, of course.)
> > > 
> > > The compose function and the vector domain wrapper can simply move to
> > > vector.c
> > 
> > I ended up putting __irq_msi_compose_msg() into apic.c and that way I
> > can make virt_ext_dest_id static in that file.
> > 
> > And then I can move all the HPET-MSI support into hpet.c too.
> 
> Works for me.
> 
> > https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/ext_dest_id
> 
> For the next submission, can you please
> 
>  - pick up the -ENODEV changes for HPET/IOAPIC which I posted earlier

I think the world will be a nicer place if HPET and IOAPIC have their
own struct device and their drivers can just use dev_get_msi_domain().

The IRQ remapping drivers already plug into the device-add notifier and
can fill in the appropriate MSI domain just like they do¹ for PCI and
ACPI devices.

Using platform_add_bundle() for HPET looks trivial enough; I'll have a
play with that and then do IOAPIC too if/when the initialisation order
and hotplug handling all works out OK to install the correct
msi_domain.
Thomas Gleixner Oct. 10, 2020, 11:44 a.m. UTC | #9
On Sat, Oct 10 2020 at 11:06, David Woodhouse wrote:
> On Fri, 2020-10-09 at 01:27 +0200, Thomas Gleixner wrote:
>> On Thu, Oct 08 2020 at 22:39, David Woodhouse wrote:
>> For the next submission, can you please
>> 
>>  - pick up the -ENODEV changes for HPET/IOAPIC which I posted earlier
>
> I think the world will be a nicer place if HPET and IOAPIC have their
> own struct device and their drivers can just use dev_get_msi_domain().
>
> The IRQ remapping drivers already plug into the device-add notifier and
> can fill in the appropriate MSI domain just like they do¹ for PCI and
> ACPI devices.
>
> Using platform_add_bundle() for HPET looks trivial enough; I'll have a
> play with that and then do IOAPIC too if/when the initialisation order
> and hotplug handling all works out OK to install the correct
> msi_domain.

Yes, I was wondering about that when I made PCI at least use that
mechanism, but had not had time to actually look at it.

Thanks,

        tglx
David Woodhouse Oct. 10, 2020, 11:58 a.m. UTC | #10
On 10 October 2020 12:44:10 BST, Thomas Gleixner <tglx@linutronix.de> wrote:
>On Sat, Oct 10 2020 at 11:06, David Woodhouse wrote:
>> On Fri, 2020-10-09 at 01:27 +0200, Thomas Gleixner wrote:
>>> On Thu, Oct 08 2020 at 22:39, David Woodhouse wrote:
>>> For the next submission, can you please
>>> 
>>>  - pick up the -ENODEV changes for HPET/IOAPIC which I posted
>earlier
>>
>> I think the world will be a nicer place if HPET and IOAPIC have their
>> own struct device and their drivers can just use
>dev_get_msi_domain().
>>
>> The IRQ remapping drivers already plug into the device-add notifier
>and
>> can fill in the appropriate MSI domain just like they do¹ for PCI and
>> ACPI devices.
>>
>> Using platform_add_bundle() for HPET looks trivial enough; I'll have
>a
>> play with that and then do IOAPIC too if/when the initialisation
>order
>> and hotplug handling all works out OK to install the correct
>> msi_domain.
>
>Yes, I was wondering about that when I made PCI at least use that
>mechanism, but had not had time to actually look at it.

Yeah. There's some muttering to be done for HPET about whether it's *its* MSI domain or whether it's the parent domain. But I'll have a play. I think we'll be able to drop the whole irq_remapping_get_irq_domain() thing.

Either way, it's a separate cleanup and the 15-bit APIC ID series I posted yesterday should be fine as it is.
Thomas Gleixner Oct. 11, 2020, 5:12 p.m. UTC | #11
On Sat, Oct 10 2020 at 12:58, David Woodhouse wrote:
> On 10 October 2020 12:44:10 BST, Thomas Gleixner <tglx@linutronix.de> wrote:
>>On Sat, Oct 10 2020 at 11:06, David Woodhouse wrote:

>>> The IRQ remapping drivers already plug into the device-add notifier
>>> and can fill in the appropriate MSI domain just like they do¹ for
>>> PCI and ACPI devices.
>>> Using platform_add_bundle() for HPET looks trivial enough; I'll have
>>> a play with that and then do IOAPIC too if/when the initialisation
>>> order and hotplug handling all works out OK to install the correct
>>> msi_domain.
>>
>> Yes, I was wondering about that when I made PCI at least use that
>> mechanism, but had not had time to actually look at it.
>
> Yeah. There's some muttering to be done for HPET about whether it's
> *its* MSI domain or whether it's the parent domain. But I'll have a
> play. I think we'll be able to drop the whole
> irq_remapping_get_irq_domain() thing.

That would be really nice.

> Either way, it's a separate cleanup and the 15-bit APIC ID series I
> posted yesterday should be fine as it is.

I go over it in the next days once more and stick it into my devel tree
until rc1. Need to get some conflicts sorted with that Device MSI stuff.

Thanks,

        tglx
David Woodhouse Oct. 11, 2020, 9:15 p.m. UTC | #12
On 11 October 2020 18:12:08 BST, Thomas Gleixner <tglx@linutronix.de> wrote:
>On Sat, Oct 10 2020 at 12:58, David Woodhouse wrote:
>> On 10 October 2020 12:44:10 BST, Thomas Gleixner <tglx@linutronix.de>
>wrote:
>>>On Sat, Oct 10 2020 at 11:06, David Woodhouse wrote:
>
>>>> The IRQ remapping drivers already plug into the device-add notifier
>>>> and can fill in the appropriate MSI domain just like they do¹ for
>>>> PCI and ACPI devices.
>>>> Using platform_add_bundle() for HPET looks trivial enough; I'll
>have
>>>> a play with that and then do IOAPIC too if/when the initialisation
>>>> order and hotplug handling all works out OK to install the correct
>>>> msi_domain.
>>>
>>> Yes, I was wondering about that when I made PCI at least use that
>>> mechanism, but had not had time to actually look at it.
>>
>> Yeah. There's some muttering to be done for HPET about whether it's
>> *its* MSI domain or whether it's the parent domain. But I'll have a
>> play. I think we'll be able to drop the whole
>> irq_remapping_get_irq_domain() thing.
>
>That would be really nice.

I can make it work for HPET if I fix up the point at which the IRQ remapping code registers a notifier on the platform bus. (At IRQ remap setup time is too early; when it registers the PCI bus notifier is too late.)

IOAPIC is harder though as the platform bus doesn't even exist that early. Maybe an early platform bus is possible but it would have to turn out particularly simple to do, or I'd need to find another use case too, to justify it. Will continue to play....

>> Either way, it's a separate cleanup and the 15-bit APIC ID series I
>> posted yesterday should be fine as it is.
>
>I go over it in the next days once more and stick it into my devel tree
>until rc1. Need to get some conflicts sorted with that Device MSI
>stuff.

While playing with HPET I noticed I need s/CONFIG_PCI_MSI/CONFIG_IRQ_GENERIC_MSI/ where the variables are declared at the top of msi.c to match the change I made later on. Can post v3 of the series or you can silently fix it up as you go; please advise.
Thomas Gleixner Oct. 12, 2020, 9:33 a.m. UTC | #13
On Sun, Oct 11 2020 at 22:15, David Woodhouse wrote:
> On 11 October 2020 18:12:08 BST, Thomas Gleixner <tglx@linutronix.de> wrote:
>> On Sat, Oct 10 2020 at 12:58, David Woodhouse wrote:
>>> On 10 October 2020 12:44:10 BST, Thomas Gleixner <tglx@linutronix.de>
>> wrote:
>>> Yeah. There's some muttering to be done for HPET about whether it's
>>> *its* MSI domain or whether it's the parent domain. But I'll have a
>>> play. I think we'll be able to drop the whole
>>> irq_remapping_get_irq_domain() thing.
>>
>> That would be really nice.
>
> I can make it work for HPET if I fix up the point at which the IRQ
> remapping code registers a notifier on the platform bus. (At IRQ remap
> setup time is too early; when it registers the PCI bus notifier is too
> late.)
>
> IOAPIC is harder though as the platform bus doesn't even exist that
> early. Maybe an early platform bus is possible but it would have to
> turn out particularly simple to do, or I'd need to find another use
> case too, to justify it. Will continue to play....

You might want to look into using irq_find_matching_fwspec() instead for
both HPET and IOAPIC. That needs a select() callback implemented in the
remapping domains.

>> I go over it in the next days once more and stick it into my devel tree
>> until rc1. Need to get some conflicts sorted with that Device MSI
>> stuff.
>
> While playing with HPET I noticed I need
> s/CONFIG_PCI_MSI/CONFIG_IRQ_GENERIC_MSI/ where the variables are
> declared at the top of msi.c to match the change I made later on. Can
> post v3 of the series or you can silently fix it up as you go; please
> advise.

I think I might be able to handle that on my own :)

Thanks,

        tglx
David Woodhouse Oct. 12, 2020, 4:06 p.m. UTC | #14
On Mon, 2020-10-12 at 11:33 +0200, Thomas Gleixner wrote:
> On Sun, Oct 11 2020 at 22:15, David Woodhouse wrote:
> > On 11 October 2020 18:12:08 BST, Thomas Gleixner <tglx@linutronix.de> wrote:
> > > On Sat, Oct 10 2020 at 12:58, David Woodhouse wrote:
> > > > On 10 October 2020 12:44:10 BST, Thomas Gleixner <tglx@linutronix.de>
> > > 
> > > wrote:
> > > > Yeah. There's some muttering to be done for HPET about whether it's
> > > > *its* MSI domain or whether it's the parent domain. But I'll have a
> > > > play. I think we'll be able to drop the whole
> > > > irq_remapping_get_irq_domain() thing.
> > > 
> > > That would be really nice.
> > 
> > I can make it work for HPET if I fix up the point at which the IRQ
> > remapping code registers a notifier on the platform bus. (At IRQ remap
> > setup time is too early; when it registers the PCI bus notifier is too
> > late.)
> > 
> > IOAPIC is harder though as the platform bus doesn't even exist that
> > early. Maybe an early platform bus is possible but it would have to
> > turn out particularly simple to do, or I'd need to find another use
> > case too, to justify it. Will continue to play....
> 
> You might want to look into using irq_find_matching_fwspec() instead for
> both HPET and IOAPIC. That needs a select() callback implemented in the
> remapping domains.

That works. Pushed (along with that trivial HPET MSI fixup) to 
https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/ext_dest_id

Briefly tested with I/OAPIC with and without Intel remapping in qemu,
not tested for HPET because I haven't worked out how to get qemu to do
HPET-MSI yet.

David Woodhouse (8):
      genirq/irqdomain: Implement get_name() method on irqchip fwnodes
      x86/apic: Add select() method on vector irqdomain
      iommu/amd: Implement select() method on remapping irqdomain
      iommu/vt-d: Implement select() method on remapping irqdomain
      iommu/hyper-v: Implement select() method on remapping irqdomain
      x86/hpet: Use irq_find_matching_fwspec() to find remapping irqdomain
      x86/ioapic: Use irq_find_matching_fwspec() to find remapping irqdomain
      x86: Kill all traces of irq_remapping_get_irq_domain()

 arch/x86/include/asm/hw_irq.h        |  2 --
 arch/x86/include/asm/irq_remapping.h |  9 ------
 arch/x86/include/asm/irqdomain.h     |  3 ++
 arch/x86/kernel/apic/io_apic.c       | 21 ++++++--------
 arch/x86/kernel/apic/vector.c        | 52 +++++++++++++++++++++++++++++++++++
 arch/x86/kernel/hpet.c               | 23 +++++++++-------
 drivers/iommu/amd/iommu.c            | 53 +++++++++++++-----------------------
 drivers/iommu/hyperv-iommu.c         | 18 ++++++------
 drivers/iommu/intel/irq_remapping.c  | 30 +++++++++-----------
 drivers/iommu/irq_remapping.c        | 14 ----------
 drivers/iommu/irq_remapping.h        |  3 --
 kernel/irq/irqdomain.c               | 11 +++++++-
 12 files changed, 128 insertions(+), 111 deletions(-)

diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index aabd8f1b6bb0..aef795f17478 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -40,8 +40,6 @@ enum irq_alloc_type {
 	X86_IRQ_ALLOC_TYPE_PCI_MSIX,
 	X86_IRQ_ALLOC_TYPE_DMAR,
 	X86_IRQ_ALLOC_TYPE_UV,
-	X86_IRQ_ALLOC_TYPE_IOAPIC_GET_PARENT,
-	X86_IRQ_ALLOC_TYPE_HPET_GET_PARENT,
 };
 
 struct ioapic_alloc_info {
diff --git a/arch/x86/include/asm/irq_remapping.h b/arch/x86/include/asm/irq_remapping.h
index af4a151d70b3..7cc49432187f 100644
--- a/arch/x86/include/asm/irq_remapping.h
+++ b/arch/x86/include/asm/irq_remapping.h
@@ -44,9 +44,6 @@ extern int irq_remapping_reenable(int);
 extern int irq_remap_enable_fault_handling(void);
 extern void panic_if_irq_remap(const char *msg);
 
-extern struct irq_domain *
-irq_remapping_get_irq_domain(struct irq_alloc_info *info);
-
 /* Create PCI MSI/MSIx irqdomain, use @parent as the parent irqdomain. */
 extern struct irq_domain *
 arch_create_remap_msi_irq_domain(struct irq_domain *par, const char *n, int id);
@@ -71,11 +68,5 @@ static inline void panic_if_irq_remap(const char *msg)
 {
 }
 
-static inline struct irq_domain *
-irq_remapping_get_irq_domain(struct irq_alloc_info *info)
-{
-	return NULL;
-}
-
 #endif /* CONFIG_IRQ_REMAP */
 #endif /* __X86_IRQ_REMAPPING_H */
diff --git a/arch/x86/include/asm/irqdomain.h b/arch/x86/include/asm/irqdomain.h
index cd684d45cb5f..2fc85f523ace 100644
--- a/arch/x86/include/asm/irqdomain.h
+++ b/arch/x86/include/asm/irqdomain.h
@@ -12,6 +12,9 @@ enum {
 	X86_IRQ_ALLOC_LEGACY				= 0x2,
 };
 
+int x86_fwspec_is_ioapic(struct irq_fwspec *fwspec);
+int x86_fwspec_is_hpet(struct irq_fwspec *fwspec);
+
 extern struct irq_domain *x86_vector_domain;
 
 extern void init_irq_alloc_info(struct irq_alloc_info *info,
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index ca2da19d5c55..f6a5b9f887aa 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -2305,36 +2305,33 @@ static inline void __init check_timer(void)
 
 static int mp_irqdomain_create(int ioapic)
 {
-	struct irq_alloc_info info;
 	struct irq_domain *parent;
 	int hwirqs = mp_ioapic_pin_count(ioapic);
 	struct ioapic *ip = &ioapics[ioapic];
 	struct ioapic_domain_cfg *cfg = &ip->irqdomain_cfg;
 	struct mp_ioapic_gsi *gsi_cfg = mp_ioapic_gsi_routing(ioapic);
 	struct fwnode_handle *fn;
-	char *name = "IO-APIC";
+	struct irq_fwspec fwspec;
 
 	if (cfg->type == IOAPIC_DOMAIN_INVALID)
 		return 0;
 
-	init_irq_alloc_info(&info, NULL);
-	info.type = X86_IRQ_ALLOC_TYPE_IOAPIC_GET_PARENT;
-	info.devid = mpc_ioapic_id(ioapic);
-	parent = irq_remapping_get_irq_domain(&info);
-	if (!parent)
-		parent = x86_vector_domain;
-	else
-		name = "IO-APIC-IR";
-
 	/* Handle device tree enumerated APICs proper */
 	if (cfg->dev) {
 		fn = of_node_to_fwnode(cfg->dev);
 	} else {
-		fn = irq_domain_alloc_named_id_fwnode(name, ioapic);
+		fn = irq_domain_alloc_named_id_fwnode("IO-APIC", ioapic);
 		if (!fn)
 			return -ENOMEM;
 	}
 
+	fwspec.fwnode = fn;
+	fwspec.param_count = 1;
+	fwspec.param[0] = ioapic;
+	parent = irq_find_matching_fwspec(&fwspec, DOMAIN_BUS_ANY);
+	if (!parent)
+		return -ENODEV;
+
 	ip->irqdomain = irq_domain_create_linear(fn, hwirqs, cfg->ops,
 						 (void *)(long)ioapic);
 
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index bb2e2a2488a5..3f0485c12b13 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -636,7 +636,59 @@ static void x86_vector_debug_show(struct seq_file *m, struct irq_domain *d,
 }
 #endif
 
+
+int x86_fwspec_is_ioapic(struct irq_fwspec *fwspec)
+{
+	if (fwspec->param_count != 1)
+		return 0;
+
+	if (is_fwnode_irqchip(fwspec->fwnode)){
+		const char *fwname = fwnode_get_name(fwspec->fwnode);
+
+		if (!strncmp(fwname, "IO-APIC-", 8) &&
+		    simple_strtol(fwname+8, NULL, 10) == fwspec->param[0])
+			return 1;
+#ifdef CONFIG_OF
+	} else if (to_of_node(fwspec->fwnode) &&
+		   of_device_is_compatible(to_of_node(fwspec->fwnode),
+					   "intel,ce4100-ioapic")) {
+		return 1;
+#endif
+	}
+	return 0;
+}
+
+int x86_fwspec_is_hpet(struct irq_fwspec *fwspec)
+{
+	if (fwspec->param_count != 1)
+		return 0;
+
+	if (is_fwnode_irqchip(fwspec->fwnode)){
+		const char *fwname = fwnode_get_name(fwspec->fwnode);
+
+		if (!strncmp(fwname, "HPET-MSI-", 9) &&
+		    simple_strtol(fwname+9, NULL, 10) == fwspec->param[0])
+			return 1;
+	}
+	return 0;
+}
+
+static int x86_vector_select(struct irq_domain *d, struct irq_fwspec *fwspec,
+			     enum irq_domain_bus_token bus_token)
+{
+	/*
+	 * HPET and I/OAPIC cannot be parented in the vector domain
+	 * if IRQ remapping is enabled. APIC IDs above 15 bits are
+	 * only permitted if IRQ remapping is enabled, so check that.
+	 */
+	if (apic->apic_id_valid(32768))
+		return 0;
+
+	return x86_fwspec_is_ioapic(fwspec) || x86_fwspec_is_hpet(fwspec);
+}
+
 static const struct irq_domain_ops x86_vector_domain_ops = {
+	.select		= x86_vector_select,
 	.alloc		= x86_vector_alloc_irqs,
 	.free		= x86_vector_free_irqs,
 	.activate	= x86_vector_activate,
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 3b8b12769f3b..fb7736ca7b5b 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -543,8 +543,8 @@ static struct irq_domain *hpet_create_irq_domain(int hpet_id)
 {
 	struct msi_domain_info *domain_info;
 	struct irq_domain *parent, *d;
-	struct irq_alloc_info info;
 	struct fwnode_handle *fn;
+	struct irq_fwspec fwspec;
 
 	if (x86_vector_domain == NULL)
 		return NULL;
@@ -556,15 +556,6 @@ static struct irq_domain *hpet_create_irq_domain(int hpet_id)
 	*domain_info = hpet_msi_domain_info;
 	domain_info->data = (void *)(long)hpet_id;
 
-	init_irq_alloc_info(&info, NULL);
-	info.type = X86_IRQ_ALLOC_TYPE_HPET_GET_PARENT;
-	info.devid = hpet_id;
-	parent = irq_remapping_get_irq_domain(&info);
-	if (parent == NULL)
-		parent = x86_vector_domain;
-	else
-		hpet_msi_controller.name = "IR-HPET-MSI";
-
 	fn = irq_domain_alloc_named_id_fwnode(hpet_msi_controller.name,
 					      hpet_id);
 	if (!fn) {
@@ -572,6 +563,18 @@ static struct irq_domain *hpet_create_irq_domain(int hpet_id)
 		return NULL;
 	}
 
+	fwspec.fwnode = fn;
+	fwspec.param_count = 1;
+	fwspec.param[0] = hpet_id;
+	parent = irq_find_matching_fwspec(&fwspec, DOMAIN_BUS_ANY);
+	if (!parent) {
+		irq_domain_free_fwnode(fn);
+		kfree(domain_info);
+		return NULL;
+	}
+	if (parent != x86_vector_domain)
+		hpet_msi_controller.name = "IR-HPET-MSI";
+
 	d = msi_create_irq_domain(fn, domain_info, parent);
 	if (!d) {
 		irq_domain_free_fwnode(fn);
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 13d0a8f42d56..16adbf9d8fbb 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3536,10 +3536,8 @@ static int get_devid(struct irq_alloc_info *info)
 {
 	switch (info->type) {
 	case X86_IRQ_ALLOC_TYPE_IOAPIC:
-	case X86_IRQ_ALLOC_TYPE_IOAPIC_GET_PARENT:
 		return get_ioapic_devid(info->devid);
 	case X86_IRQ_ALLOC_TYPE_HPET:
-	case X86_IRQ_ALLOC_TYPE_HPET_GET_PARENT:
 		return get_hpet_devid(info->devid);
 	case X86_IRQ_ALLOC_TYPE_PCI_MSI:
 	case X86_IRQ_ALLOC_TYPE_PCI_MSIX:
@@ -3550,46 +3548,32 @@ static int get_devid(struct irq_alloc_info *info)
 	}
 }
 
-static struct irq_domain *get_irq_domain_for_devid(struct irq_alloc_info *info,
-						   int devid)
-{
-	struct amd_iommu *iommu = amd_iommu_rlookup_table[devid];
-
-	if (!iommu)
-		return NULL;
-
-	switch (info->type) {
-	case X86_IRQ_ALLOC_TYPE_IOAPIC_GET_PARENT:
-	case X86_IRQ_ALLOC_TYPE_HPET_GET_PARENT:
-		return iommu->ir_domain;
-	default:
-		WARN_ON_ONCE(1);
-		return NULL;
-	}
-}
-
-static struct irq_domain *get_irq_domain(struct irq_alloc_info *info)
-{
-	int devid;
-
-	if (!info)
-		return NULL;
-
-	devid = get_devid(info);
-	if (devid < 0)
-		return NULL;
-	return get_irq_domain_for_devid(info, devid);
-}
-
 struct irq_remap_ops amd_iommu_irq_ops = {
 	.prepare		= amd_iommu_prepare,
 	.enable			= amd_iommu_enable,
 	.disable		= amd_iommu_disable,
 	.reenable		= amd_iommu_reenable,
 	.enable_faulting	= amd_iommu_enable_faulting,
-	.get_irq_domain		= get_irq_domain,
 };
 
+static int irq_remapping_select(struct irq_domain *d, struct irq_fwspec *fwspec,
+				enum irq_domain_bus_token bus_token)
+{
+	struct amd_iommu *iommu;
+	int devid = -1;
+
+	if (x86_fwspec_is_ioapic(fwspec))
+		devid = get_ioapic_devid(fwspec->param[0]);
+	else if (x86_fwspec_is_ioapic(fwspec))
+		devid = get_hpet_devid(fwspec->param[0]);
+
+	if (devid < 0)
+		return 0;
+
+	iommu = amd_iommu_rlookup_table[devid];
+	return iommu && iommu->ir_domain == d;
+}
+
 static void irq_remapping_prepare_irte(struct amd_ir_data *data,
 				       struct irq_cfg *irq_cfg,
 				       struct irq_alloc_info *info,
@@ -3813,6 +3797,7 @@ static void irq_remapping_deactivate(struct irq_domain *domain,
 }
 
 static const struct irq_domain_ops amd_ir_domain_ops = {
+	.select = irq_remapping_select,
 	.alloc = irq_remapping_alloc,
 	.free = irq_remapping_free,
 	.activate = irq_remapping_activate,
diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c
index 37dd485a5640..e7ed2bb83358 100644
--- a/drivers/iommu/hyperv-iommu.c
+++ b/drivers/iommu/hyperv-iommu.c
@@ -61,6 +61,14 @@ static struct irq_chip hyperv_ir_chip = {
 	.irq_set_affinity	= hyperv_ir_set_affinity,
 };
 
+static int hyperv_irq_remapping_select(struct irq_domain *d,
+				       struct irq_fwspec *fwspec,
+				       enum irq_domain_bus_token bus_token)
+{
+	/* Claim only the first (and only) I/OAPIC */
+	return x86_fwspec_is_ioapic(fwspec) && fwspec->param[0] == 0;
+}
+
 static int hyperv_irq_remapping_alloc(struct irq_domain *domain,
 				     unsigned int virq, unsigned int nr_irqs,
 				     void *arg)
@@ -102,6 +110,7 @@ static void hyperv_irq_remapping_free(struct irq_domain *domain,
 }
 
 static const struct irq_domain_ops hyperv_ir_domain_ops = {
+	.select = hyperv_irq_remapping_select,
 	.alloc = hyperv_irq_remapping_alloc,
 	.free = hyperv_irq_remapping_free,
 };
@@ -151,18 +160,9 @@ static int __init hyperv_enable_irq_remapping(void)
 	return IRQ_REMAP_X2APIC_MODE;
 }
 
-static struct irq_domain *hyperv_get_irq_domain(struct irq_alloc_info *info)
-{
-	if (info->type == X86_IRQ_ALLOC_TYPE_IOAPIC_GET_PARENT)
-		return ioapic_ir_domain;
-	else
-		return NULL;
-}
-
 struct irq_remap_ops hyperv_irq_remap_ops = {
 	.prepare		= hyperv_prepare_irq_remapping,
 	.enable			= hyperv_enable_irq_remapping,
-	.get_irq_domain		= hyperv_get_irq_domain,
 };
 
 #endif
diff --git a/drivers/iommu/intel/irq_remapping.c b/drivers/iommu/intel/irq_remapping.c
index 511dfb4884bc..ccf61cd18f69 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -1128,29 +1128,12 @@ static void prepare_irte(struct irte *irte, int vector, unsigned int dest)
 	irte->redir_hint = 1;
 }
 
-static struct irq_domain *intel_get_irq_domain(struct irq_alloc_info *info)
-{
-	if (!info)
-		return NULL;
-
-	switch (info->type) {
-	case X86_IRQ_ALLOC_TYPE_IOAPIC_GET_PARENT:
-		return map_ioapic_to_ir(info->devid);
-	case X86_IRQ_ALLOC_TYPE_HPET_GET_PARENT:
-		return map_hpet_to_ir(info->devid);
-	default:
-		WARN_ON_ONCE(1);
-		return NULL;
-	}
-}
-
 struct irq_remap_ops intel_irq_remap_ops = {
 	.prepare		= intel_prepare_irq_remapping,
 	.enable			= intel_enable_irq_remapping,
 	.disable		= disable_irq_remapping,
 	.reenable		= reenable_irq_remapping,
 	.enable_faulting	= enable_drhd_fault_handling,
-	.get_irq_domain		= intel_get_irq_domain,
 };
 
 static void intel_ir_reconfigure_irte(struct irq_data *irqd, bool force)
@@ -1435,7 +1418,20 @@ static void intel_irq_remapping_deactivate(struct irq_domain *domain,
 	modify_irte(&data->irq_2_iommu, &entry);
 }
 
+static int intel_irq_remapping_select(struct irq_domain *d,
+				      struct irq_fwspec *fwspec,
+				      enum irq_domain_bus_token bus_token)
+{
+	if (x86_fwspec_is_ioapic(fwspec))
+		return d == map_ioapic_to_ir(fwspec->param[0]);
+	else if (x86_fwspec_is_hpet(fwspec))
+		return d == map_hpet_to_ir(fwspec->param[0]);
+
+	return 0;
+}
+
 static const struct irq_domain_ops intel_ir_domain_ops = {
+	.select = intel_irq_remapping_select,
 	.alloc = intel_irq_remapping_alloc,
 	.free = intel_irq_remapping_free,
 	.activate = intel_irq_remapping_activate,
diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
index 2d84b1ed205e..83314b9d8f38 100644
--- a/drivers/iommu/irq_remapping.c
+++ b/drivers/iommu/irq_remapping.c
@@ -158,17 +158,3 @@ void panic_if_irq_remap(const char *msg)
 	if (irq_remapping_enabled)
 		panic(msg);
 }
-
-/**
- * irq_remapping_get_irq_domain - Get the irqdomain serving the request @info
- * @info: interrupt allocation information, used to identify the IOMMU device
- *
- * Returns pointer to IRQ domain, or NULL on failure.
- */
-struct irq_domain *irq_remapping_get_irq_domain(struct irq_alloc_info *info)
-{
-	if (!remap_ops || !remap_ops->get_irq_domain)
-		return NULL;
-
-	return remap_ops->get_irq_domain(info);
-}
diff --git a/drivers/iommu/irq_remapping.h b/drivers/iommu/irq_remapping.h
index 1661b3d75920..8c89cb947cdb 100644
--- a/drivers/iommu/irq_remapping.h
+++ b/drivers/iommu/irq_remapping.h
@@ -42,9 +42,6 @@ struct irq_remap_ops {
 
 	/* Enable fault handling */
 	int  (*enable_faulting)(void);
-
-	/* Get the irqdomain associated to IOMMU device */
-	struct irq_domain *(*get_irq_domain)(struct irq_alloc_info *);
 };
 
 extern struct irq_remap_ops intel_irq_remap_ops;
diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index 76cd7ebd1178..6440f97c412e 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -42,7 +42,16 @@ static inline void debugfs_add_domain_dir(struct irq_domain *d) { }
 static inline void debugfs_remove_domain_dir(struct irq_domain *d) { }
 #endif
 
-const struct fwnode_operations irqchip_fwnode_ops;
+static const char *irqchip_fwnode_get_name(const struct fwnode_handle *fwnode)
+{
+	struct irqchip_fwid *fwid = container_of(fwnode, struct irqchip_fwid, fwnode);
+
+	return fwid->name;
+}
+
+const struct fwnode_operations irqchip_fwnode_ops = {
+	.get_name = irqchip_fwnode_get_name,
+};
 EXPORT_SYMBOL_GPL(irqchip_fwnode_ops);
 
 /**
Thomas Gleixner Oct. 12, 2020, 6:38 p.m. UTC | #15
On Mon, Oct 12 2020 at 17:06, David Woodhouse wrote:
> On Mon, 2020-10-12 at 11:33 +0200, Thomas Gleixner wrote:
>> You might want to look into using irq_find_matching_fwspec() instead for
>> both HPET and IOAPIC. That needs a select() callback implemented in the
>> remapping domains.
>
> That works.

:)

Nasty, but way better than what we have now. Now I start to wonder
whether we can get rid of a few other things as well especially the
non-standard alloc_info thingy.

Thanks,

        tglx
David Woodhouse Oct. 12, 2020, 8:20 p.m. UTC | #16
On Mon, 2020-10-12 at 20:38 +0200, Thomas Gleixner wrote:
> On Mon, Oct 12 2020 at 17:06, David Woodhouse wrote:
> > On Mon, 2020-10-12 at 11:33 +0200, Thomas Gleixner wrote:
> > > You might want to look into using irq_find_matching_fwspec()
> > > instead for
> > > both HPET and IOAPIC. That needs a select() callback implemented
> > > in the
> > > remapping domains.
> > 
> > That works.
> 
> :)
> 
> Nasty, but way better than what we have now. 

Want me to send that out in email or is the git tree enough for now?

I've cleaned it up a little and fixed a bug in the I/OAPIC error path.

Still not entirely convinced about the apic->apic_id_valid(32768) thing
but it should work well enough, and doesn't require exporting any extra
state from apic.c that way.
Thomas Gleixner Oct. 12, 2020, 10:13 p.m. UTC | #17
On Mon, Oct 12 2020 at 21:20, David Woodhouse wrote:
> On Mon, 2020-10-12 at 20:38 +0200, Thomas Gleixner wrote:
>> Nasty, but way better than what we have now. 
>
> Want me to send that out in email or is the git tree enough for now?
>
> I've cleaned it up a little and fixed a bug in the I/OAPIC error path.

Mail would be nice once you are confident with the pile.

> Still not entirely convinced about the apic->apic_id_valid(32768) thing
> but it should work well enough, and doesn't require exporting any extra
> state from apic.c that way.

Yeah, that part is odd.

I really dislike the way how irq_find_matching_fwspec() works. The 'rc'
value is actually boolean despite being type 'int' and if 'rc' is not 0
then it returns the domain even if 'rc' is an error code.

But that does not allow to return error codes from a domain match() /
select() callback which is what we really want to express that there is
something fishy.

Something like the below perhaps? Needs more thought obviously.

Marc, any opinion ?

Thanks,

        tglx

8<-------------

arch/x86/kernel/apic/vector.c |   18 ++++++++++++++++++
 include/linux/irqdomain.h     |    1 +
 kernel/irq/irqdomain.c        |    2 +-
 3 files changed, 20 insertions(+), 1 deletion(-)

--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -593,6 +593,24 @@ static int x86_vector_alloc_irqs(struct
 	return err;
 }
 
+struct irq_domain *x86_select_parent_domain(struct irq_fwspec *fwspec)
+{
+	struct irq_domain *dom;
+
+	/*
+	 * If Interrupt Remapping is enabled then the match function
+	 * returns either the remapping domain or an error code if the
+	 * device is not registered with the remapping unit.
+	 *
+	 * If remapping is not enabled, then the function returns NULL.
+	 */
+	dom = irq_find_matching_fwspec(fwspec, DOMAIN_BUS_IR);
+	if (dom)
+		return IS_ERR(dom) ? NULL : dom;
+
+	return x86_vector_domain;
+}
+
 #ifdef CONFIG_GENERIC_IRQ_DEBUGFS
 static void x86_vector_debug_show(struct seq_file *m, struct irq_domain *d,
 				  struct irq_data *irqd, int ind)
--- a/include/linux/irqdomain.h
+++ b/include/linux/irqdomain.h
@@ -85,6 +85,7 @@ enum irq_domain_bus_token {
 	DOMAIN_BUS_TI_SCI_INTA_MSI,
 	DOMAIN_BUS_WAKEUP,
 	DOMAIN_BUS_VMD_MSI,
+	DOMAIN_BUS_IR,
 };
 
 /**
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -395,7 +395,7 @@ struct irq_domain *irq_find_matching_fws
 			       (h->bus_token == bus_token)));
 
 		if (rc) {
-			found = h;
+			found = rc < 0 ? ERR_PTR(rc) : h;
 			break;
 		}
 	}
David Woodhouse Oct. 13, 2020, 7:52 a.m. UTC | #18
On Tue, 2020-10-13 at 00:13 +0200, Thomas Gleixner wrote:
> On Mon, Oct 12 2020 at 21:20, David Woodhouse wrote:
> > On Mon, 2020-10-12 at 20:38 +0200, Thomas Gleixner wrote:
> > > Nasty, but way better than what we have now. 
> > 
> > Want me to send that out in email or is the git tree enough for now?
> > 
> > I've cleaned it up a little and fixed a bug in the I/OAPIC error path.
> 
> Mail would be nice once you are confident with the pile.

After the next cup of coffee. Will send it as an incremental series on
top of my previous ext_dest_id series.

> > Still not entirely convinced about the apic->apic_id_valid(32768) thing
> > but it should work well enough, and doesn't require exporting any extra
> > state from apic.c that way.
> 
> Yeah, that part is odd.
> 
> I really dislike the way how irq_find_matching_fwspec() works. The 'rc'
> value is actually boolean despite being type 'int' and if 'rc' is not 0
> then it returns the domain even if 'rc' is an error code.
> 
> But that does not allow to return error codes from a domain match() /
> select() callback which is what we really want to express that there is
> something fishy.

I don't know that we need to return an explicit error. I made
x86_vector_select() refuse to match if IRQ remapping is enabled, which
means that the irq_find_matching_fwspec() call will fail by returning
NULL in precisely the cases it should. (Which should never happen; qv).

That failure means HPET will refuse to set up MSI, or I/OAPIC will
refuse to initialise (later causing a BUG and a failure to boot, which
is probably the correct thing to do).

It's all fine.

+       dom = irq_find_matching_fwspec(fwspec, DOMAIN_BUS_IR);
+       if (dom)
+               return IS_ERR(dom) ? NULL : dom;
+
+       return x86_vector_domain;
+}

Ick. There's no need for that.

Eliminating that awful "if not found then slip the x86_vector_domain in
as a special case" was the whole *point* of using
irq_find_matching_fwspec() in the first place.
Thomas Gleixner Oct. 13, 2020, 9:28 a.m. UTC | #19
On Tue, Oct 13 2020 at 08:52, David Woodhouse wrote:
> On Tue, 2020-10-13 at 00:13 +0200, Thomas Gleixner wrote:
> +       dom = irq_find_matching_fwspec(fwspec, DOMAIN_BUS_IR);
> +       if (dom)
> +               return IS_ERR(dom) ? NULL : dom;
> +
> +       return x86_vector_domain;
> +}
>
> Ick. There's no need for that.
>
> Eliminating that awful "if not found then slip the x86_vector_domain in
> as a special case" was the whole *point* of using
> irq_find_matching_fwspec() in the first place.

The point was to get rid of irq_remapping_get_irq_domain().

And TBH,

        if (apicid_valid(32768))

is just another way to slip the vector domain in. It's just differently
awful.

Having an explicit answer from the search for IR:

    - Here is the domain
    - Your device is not registered properly
    - IR not enabled or not supported

is way more obvious than the above disguised is_remapping_enabled()
check.

Thanks,

        tglx
David Woodhouse Oct. 13, 2020, 10:15 a.m. UTC | #20
On Tue, 2020-10-13 at 11:28 +0200, Thomas Gleixner wrote:
> On Tue, Oct 13 2020 at 08:52, David Woodhouse wrote:
> > On Tue, 2020-10-13 at 00:13 +0200, Thomas Gleixner wrote:
> > +       dom = irq_find_matching_fwspec(fwspec, DOMAIN_BUS_IR);
> > +       if (dom)
> > +               return IS_ERR(dom) ? NULL : dom;
> > +
> > +       return x86_vector_domain;
> > +}
> > 
> > Ick. There's no need for that.
> > 
> > Eliminating that awful "if not found then slip the x86_vector_domain in
> > as a special case" was the whole *point* of using
> > irq_find_matching_fwspec() in the first place.
> 
> The point was to get rid of irq_remapping_get_irq_domain().

My reason for doing it was to get rid of irq_remapping_get_irq_domain()
*because* I hated the special-casing and magical slipping in of
x86_vector_domain when it returned NULL.

> And TBH,
> 
>         if (apicid_valid(32768))
> 
> is just another way to slip the vector domain in. It's just differently
> awful.

For me, that's very much not just "slipping the vector domain in".
That's the vector domain returning true in its *own* ->select()
function, in the circumstances where it wants to be used.

The key difference is that nobody needs an external magic pointer to
the x86_vector_domain. In a true irqdomain hierarchy system, shouldn't
we be trying to eliminate *all* those magic pointers to specific
domains, if we can?

And sure, the apicid_valid(32768) as a proxy for irq_remapping_enabled
is a bit of an ugly trick. I suppose we can explicitly expose
irq_remapping_enabled from drivers/iommu if we wanted to.

> Having an explicit answer from the search for IR:
> 
>     - Here is the domain
>     - Your device is not registered properly
>     - IR not enabled or not supported
> 
> is way more obvious than the above disguised is_remapping_enabled()
> check.

I just don't even like thinking of it as a 'search for IR'.

HPET shouldn't be caring about IR any more than PCI devices do. It just
wants its parent irqdomain, that's all.

For I/OAPIC there's the slight complexity that it does actually ack
level-triggered interrupts differently when it's behind IR. But I don't
think we need a whole separate irq_chip for that; surely it could be
handled internally in ioapic_ack_level() ? 

Either way, even with that slight hack it's nicer to think of
mp_irqdomain_create() just wanting to find its parent domain, without
any special knowledge of IR and falling back to x86_parent_domain. The
hack for IR level-ack is then self-contained.
Thomas Gleixner Oct. 13, 2020, 10:46 a.m. UTC | #21
On Tue, Oct 13 2020 at 11:28, Thomas Gleixner wrote:
> On Tue, Oct 13 2020 at 08:52, David Woodhouse wrote:
>> On Tue, 2020-10-13 at 00:13 +0200, Thomas Gleixner wrote:
>> +       dom = irq_find_matching_fwspec(fwspec, DOMAIN_BUS_IR);
>> +       if (dom)
>> +               return IS_ERR(dom) ? NULL : dom;
>> +
>> +       return x86_vector_domain;
>> +}
>>
>> Ick. There's no need for that.
>>
>> Eliminating that awful "if not found then slip the x86_vector_domain in
>> as a special case" was the whole *point* of using
>> irq_find_matching_fwspec() in the first place.
>
> The point was to get rid of irq_remapping_get_irq_domain().
>
> And TBH,
>
>         if (apicid_valid(32768))
>
> is just another way to slip the vector domain in. It's just differently
> awful.
>
> Having an explicit answer from the search for IR:
>
>     - Here is the domain
>     - Your device is not registered properly
>     - IR not enabled or not supported
>
> is way more obvious than the above disguised is_remapping_enabled()
> check.

And after becoming more awake, that wont work anyway because there is
more than one IR domain, so there is no way to return an error "You
forgot to register" obviously.

But the APIC id (32768) valid check is also broken because IR can be
enabled even without X2APIC.

Thanks,

        tglx
David Woodhouse Oct. 13, 2020, 10:53 a.m. UTC | #22
On Tue, 2020-10-13 at 12:46 +0200, Thomas Gleixner wrote:
> And after becoming more awake, that wont work anyway because there is
> more than one IR domain, so there is no way to return an error "You
> forgot to register" obviously.
> 
> But the APIC id (32768) valid check is also broken because IR can be
> enabled even without X2APIC.

Nope, it's perfectly OK to allow HPET and I/OAPIC to be parented in the
x86_vector_domain in that case, regardless of IR.

The *actual* criterion for x86_vector_select() returning zero to say 
"don't use me" is "could there be CPUs in the system which can't be
reached through x86_vector_msi_compose_msg()?". It's not really about
IR at all.

The apic_id_valid(32768) check is checking precisely the right thing.
David Woodhouse Oct. 13, 2020, 11:51 a.m. UTC | #23
On Tue, 2020-10-13 at 11:53 +0100, David Woodhouse wrote:
> On Tue, 2020-10-13 at 12:46 +0200, Thomas Gleixner wrote:
> > And after becoming more awake, that wont work anyway because there is
> > more than one IR domain, so there is no way to return an error "You
> > forgot to register" obviously.
> > 
> > But the APIC id (32768) valid check is also broken because IR can be
> > enabled even without X2APIC.
> 
> Nope, it's perfectly OK to allow HPET and I/OAPIC to be parented in the
> x86_vector_domain in that case, regardless of IR.
> 
> The *actual* criterion for x86_vector_select() returning zero to say 
> "don't use me" is "could there be CPUs in the system which can't be
> reached through x86_vector_msi_compose_msg()?". It's not really about
> IR at all.
> 
> The apic_id_valid(32768) check is checking precisely the right thing.

With that realisation, I've fixed the comment in my ext_dest_id branch
to remove all mention of IRQ remapping. It now looks like this:

static int x86_vector_select(struct irq_domain *d, struct irq_fwspec *fwspec,
			     enum irq_domain_bus_token bus_token)
{
	/*
	 * HPET and I/OAPIC drivers use irq_find_matching_irqdomain()
	 * to find their parent irqdomain. For x86_vector_domain to be
	 * suitable, all CPUs in the system must be reachable by its
	 * x86_vector_msi_compose_msg() function. Which is only true
	 * in !x2apic mode, or in x2apic physical mode if APIC IDs were
	 * restricted to 8 or 15 bits at boot time. In those cases,
	 * 1<<15 will *not* be a valid APIC ID.
	 */
	if (apic->apic_id_valid(1<<15))
		return 0;

	return x86_fwspec_is_ioapic(fwspec) || x86_fwspec_is_hpet(fwspec);
}

That makes it clearer that this isn't just some incestuous interaction
with IRQ remapping — that APIC ID limit really is the basis on which
this irqdomain, all by itself, makes the decision about whether it's
capable of being the parent irqdomain to the requesting device.

It just so happens that *if* you allow higher APIC IDs in the system
and *if* no other irqdomains exist which take ownership of a given HPET
or I/OAPIC, then that child device will fail to find a suitable parent
irqdomain and will fail to initialise. And that's just how we want it.

Sure, we have that special case in x2apic startup where we ensure that
if we don't have IR then we limit the APIC IDs. But that's there, and
*this* code is perfectly self-contained and isolated.
Thomas Gleixner Oct. 13, 2020, 12:40 p.m. UTC | #24
On Tue, Oct 13 2020 at 12:51, David Woodhouse wrote:
> With that realisation, I've fixed the comment in my ext_dest_id branch
> to remove all mention of IRQ remapping. It now looks like this:
>
> static int x86_vector_select(struct irq_domain *d, struct irq_fwspec *fwspec,
> 			     enum irq_domain_bus_token bus_token)
> {
> 	/*
> 	 * HPET and I/OAPIC drivers use irq_find_matching_irqdomain()
> 	 * to find their parent irqdomain. For x86_vector_domain to be
> 	 * suitable, all CPUs in the system must be reachable by its
> 	 * x86_vector_msi_compose_msg() function. Which is only true
> 	 * in !x2apic mode, or in x2apic physical mode if APIC IDs were
> 	 * restricted to 8 or 15 bits at boot time. In those cases,
> 	 * 1<<15 will *not* be a valid APIC ID.
> 	 */
> 	if (apic->apic_id_valid(1<<15))
> 		return 0;
>
> 	return x86_fwspec_is_ioapic(fwspec) || x86_fwspec_is_hpet(fwspec);
> }
>
> That makes it clearer that this isn't just some incestuous interaction
> with IRQ remapping — that APIC ID limit really is the basis on which
> this irqdomain, all by itself, makes the decision about whether it's
> capable of being the parent irqdomain to the requesting device.

Yes, that makes sense now.

Thanks,

        tglx
diff mbox series

Patch

diff --git a/Documentation/virt/kvm/cpuid.rst b/Documentation/virt/kvm/cpuid.rst
index a7dff9186bed..1726b5925d2b 100644
--- a/Documentation/virt/kvm/cpuid.rst
+++ b/Documentation/virt/kvm/cpuid.rst
@@ -92,6 +92,10 @@  KVM_FEATURE_ASYNC_PF_INT          14          guest checks this feature bit
                                               async pf acknowledgment msr
                                               0x4b564d07.
 
+KVM_FEATURE_MSI_EXT_DEST_ID       15          guest checks this feature bit
+                                              before using extended destination
+                                              ID bits in MSI address bits 11-5.
+
 KVM_FEATURE_CLOCSOURCE_STABLE_BIT 24          host will warn if no guest-side
                                               per-cpu warps are expeced in
                                               kvmclock
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 812e9b4c1114..950afebfba88 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -32,6 +32,7 @@ 
 #define KVM_FEATURE_POLL_CONTROL	12
 #define KVM_FEATURE_PV_SCHED_YIELD	13
 #define KVM_FEATURE_ASYNC_PF_INT	14
+#define KVM_FEATURE_MSI_EXT_DEST_ID	15
 
 #define KVM_HINTS_REALTIME      0
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 1b51b727b140..4986b4399aef 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -743,12 +743,18 @@  static void __init kvm_init_platform(void)
 	x86_platform.apic_post_init = kvm_apic_init;
 }
 
+static bool __init kvm_msi_ext_dest_id(void)
+{
+	return kvm_para_has_feature(KVM_FEATURE_MSI_EXT_DEST_ID);
+}
+
 const __initconst struct hypervisor_x86 x86_hyper_kvm = {
 	.name			= "KVM",
 	.detect			= kvm_detect,
 	.type			= X86_HYPER_KVM,
 	.init.guest_late_init	= kvm_guest_init,
 	.init.x2apic_available	= kvm_para_available,
+	.init.msi_ext_dest_id	= kvm_msi_ext_dest_id,
 	.init.init_platform	= kvm_init_platform,
 };