diff mbox series

[RFC,1/1] platform-msi: Add platform check for subdevice irq domain

Message ID 20201210004624.345282-1-baolu.lu@linux.intel.com (mailing list archive)
State Not Applicable
Headers show
Series [RFC,1/1] platform-msi: Add platform check for subdevice irq domain | expand

Commit Message

Baolu Lu Dec. 10, 2020, 12:46 a.m. UTC
The pci_subdevice_msi_create_irq_domain() should fail if the underlying
platform is not able to support IMS (Interrupt Message Storage). Otherwise,
the isolation of interrupt is not guaranteed.

For x86, IMS is only supported on bare metal for now. We could enable it
in the virtualization environments in the future if interrupt HYPERCALL
domain is supported or the hardware has the capability of interrupt
isolation for subdevices.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/linux-pci/87pn4nk7nn.fsf@nanos.tec.linutronix.de/
Link: https://lore.kernel.org/linux-pci/877dqrnzr3.fsf@nanos.tec.linutronix.de/
Link: https://lore.kernel.org/linux-pci/877dqqmc2h.fsf@nanos.tec.linutronix.de/
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 arch/x86/pci/common.c       | 43 +++++++++++++++++++++++++++++++++++++
 drivers/base/platform-msi.c |  8 +++++++
 include/linux/msi.h         |  1 +
 3 files changed, 52 insertions(+)

Note:
Learnt from the discussions in this thread:

https://lore.kernel.org/linux-pci/160408357912.912050.17005584526266191420.stgit@djiang5-desk3.ch.intel.com/

The device IMS (Interrupt Message Storage) should not be enabled in any
virtualization environments unless there is a HYPERCALL domain which
makes the changes in the message store managed by the hypervisor.

As the initial step, we allow the IMS to be enabled only if we are
running on the bare metal. It's easy to enable IMS in the virtualization
environments if above preconditions are met in the future.

We ever thought about moving probably_on_bare_metal() to a more generic
file so that it could be well maintained and used. But we need some
suggestions about where to put it. Your comments are very appreciated.

This patch is only for comments purpose. Please don't merge it. We will
include it in the Intel IMS implementation later once we reach a
consensus.

Best regards,
baolu

Comments

David Woodhouse Dec. 10, 2020, 8:22 a.m. UTC | #1
On Thu, 2020-12-10 at 08:46 +0800, Lu Baolu wrote:
> +/*
> + * We want to figure out which context we are running in. But the hardware
> + * does not introduce a reliable way (instruction, CPUID leaf, MSR, whatever)
> + * which can be manipulated by the VMM to let the OS figure out where it runs.
> + * So we go with the below probably_on_bare_metal() function as a replacement
> + * for definitely_on_bare_metal() to go forward only for the very simple reason
> + * that this is the only option we have.
> + */
> +static const char * const possible_vmm_vendor_name[] = {
> +       "QEMU", "Bochs", "KVM", "Xen", "VMware", "VMW", "VMware Inc.",
> +       "innotek GmbH", "Oracle Corporation", "Parallels", "BHYVE",
> +       "Microsoft Corporation"
> +};

People do use SeaBIOS ("Bochs") on bare metal.

You'll also see "Amazon EC2" on virt instances as well as bare metal
instances. Although in that case I believe the virt instances do have
the 'virtual machine' flag set in bit 4 of the BIOS Characteristics
Extension Byte 2, and the bare metal obviously don't.
Bjorn Helgaas Dec. 10, 2020, 6:57 p.m. UTC | #2
On Thu, Dec 10, 2020 at 08:46:24AM +0800, Lu Baolu wrote:
> The pci_subdevice_msi_create_irq_domain() should fail if the underlying
> platform is not able to support IMS (Interrupt Message Storage). Otherwise,
> the isolation of interrupt is not guaranteed.
> 
> For x86, IMS is only supported on bare metal for now. We could enable it
> in the virtualization environments in the future if interrupt HYPERCALL
> domain is supported or the hardware has the capability of interrupt
> isolation for subdevices.

> + * We want to figure out which context we are running in. But the hardware
> + * does not introduce a reliable way (instruction, CPUID leaf, MSR, whatever)
> + * which can be manipulated by the VMM to let the OS figure out where it runs.
> + * So we go with the below probably_on_bare_metal() function as a replacement
> + * for definitely_on_bare_metal() to go forward only for the very simple reason
> + * that this is the only option we have.
> + */
> +static const char * const possible_vmm_vendor_name[] = {
> +	"QEMU", "Bochs", "KVM", "Xen", "VMware", "VMW", "VMware Inc.",
> +	"innotek GmbH", "Oracle Corporation", "Parallels", "BHYVE",
> +	"Microsoft Corporation"
> +};
> +
> +static bool probably_on_bare_metal(void)

What is the point of a function called probably_on_bare_metal()?
*Probably*?  The caller can't really do anything with the fact that
we're not 100% sure this gives the correct answer.  Just call it
"on_bare_metal()" or something and accept the fact that it might be
wrong sometimes.

This patch goes with IMS support, which somebody else is handling, so
I assume you don't need anything from the PCI side.
David Woodhouse Dec. 10, 2020, 7:15 p.m. UTC | #3
On Thu, 2020-12-10 at 12:57 -0600, Bjorn Helgaas wrote:
> What is the point of a function called probably_on_bare_metal()?
> *Probably*?  The caller can't really do anything with the fact that
> we're not 100% sure this gives the correct answer.  Just call it
> "on_bare_metal()" or something and accept the fact that it might be
> wrong sometimes.

Acknowledging that it's a heuristic is OK, but we should certainly be
clear about whether it has false positives, false negatives, or both.
Baolu Lu Dec. 16, 2020, 1:42 a.m. UTC | #4
Hi David,

On 12/10/20 4:22 PM, David Woodhouse wrote:
> On Thu, 2020-12-10 at 08:46 +0800, Lu Baolu wrote:
>> +/*
>> + * We want to figure out which context we are running in. But the hardware
>> + * does not introduce a reliable way (instruction, CPUID leaf, MSR, whatever)
>> + * which can be manipulated by the VMM to let the OS figure out where it runs.
>> + * So we go with the below probably_on_bare_metal() function as a replacement
>> + * for definitely_on_bare_metal() to go forward only for the very simple reason
>> + * that this is the only option we have.
>> + */
>> +static const char * const possible_vmm_vendor_name[] = {
>> +       "QEMU", "Bochs", "KVM", "Xen", "VMware", "VMW", "VMware Inc.",
>> +       "innotek GmbH", "Oracle Corporation", "Parallels", "BHYVE",
>> +       "Microsoft Corporation"
>> +};
> 
> People do use SeaBIOS ("Bochs") on bare metal.

Is there any unique way to distinguish between running on bare metal and
VM?

> 
> You'll also see "Amazon EC2" on virt instances as well as bare metal
> instances. Although in that case I believe the virt instances do have
> the 'virtual machine' flag set in bit 4 of the BIOS Characteristics
> Extension Byte 2, and the bare metal obviously don't.
> 

So for Amazon EC2 case, we can use this byte to distinguish. Can you
please point me to the references of this Extension Byte (reference
code/spec or anything else) ?

Best regards,
baolu
Baolu Lu Dec. 16, 2020, 1:48 a.m. UTC | #5
Hi Bjorn,

On 12/11/20 2:57 AM, Bjorn Helgaas wrote:
> On Thu, Dec 10, 2020 at 08:46:24AM +0800, Lu Baolu wrote:
>> The pci_subdevice_msi_create_irq_domain() should fail if the underlying
>> platform is not able to support IMS (Interrupt Message Storage). Otherwise,
>> the isolation of interrupt is not guaranteed.
>>
>> For x86, IMS is only supported on bare metal for now. We could enable it
>> in the virtualization environments in the future if interrupt HYPERCALL
>> domain is supported or the hardware has the capability of interrupt
>> isolation for subdevices.
> 
>> + * We want to figure out which context we are running in. But the hardware
>> + * does not introduce a reliable way (instruction, CPUID leaf, MSR, whatever)
>> + * which can be manipulated by the VMM to let the OS figure out where it runs.
>> + * So we go with the below probably_on_bare_metal() function as a replacement
>> + * for definitely_on_bare_metal() to go forward only for the very simple reason
>> + * that this is the only option we have.
>> + */
>> +static const char * const possible_vmm_vendor_name[] = {
>> +	"QEMU", "Bochs", "KVM", "Xen", "VMware", "VMW", "VMware Inc.",
>> +	"innotek GmbH", "Oracle Corporation", "Parallels", "BHYVE",
>> +	"Microsoft Corporation"
>> +};
>> +
>> +static bool probably_on_bare_metal(void)
> 
> What is the point of a function called probably_on_bare_metal()?
> *Probably*?  The caller can't really do anything with the fact that
> we're not 100% sure this gives the correct answer.  Just call it
> "on_bare_metal()" or something and accept the fact that it might be
> wrong sometimes.

Agreed. we can use on_bare_metal() and add comments and kernel messages
to let users and developers know that we're not 100% sure. People should
help to make it more accurate by reporting exceptions.

> 
> This patch goes with IMS support, which somebody else is handling, so
> I assume you don't need anything from the PCI side.

Yes. This is a followup of previous discussion.

Best regards,
baolu
Tian, Kevin Jan. 7, 2021, 7:01 a.m. UTC | #6
> From: David Woodhouse <dwmw2@infradead.org>
> Sent: Thursday, December 10, 2020 4:23 PM
> 
> On Thu, 2020-12-10 at 08:46 +0800, Lu Baolu wrote:
> > +/*
> > + * We want to figure out which context we are running in. But the
> hardware
> > + * does not introduce a reliable way (instruction, CPUID leaf, MSR,
> whatever)
> > + * which can be manipulated by the VMM to let the OS figure out where it
> runs.
> > + * So we go with the below probably_on_bare_metal() function as a
> replacement
> > + * for definitely_on_bare_metal() to go forward only for the very simple
> reason
> > + * that this is the only option we have.
> > + */
> > +static const char * const possible_vmm_vendor_name[] = {
> > +       "QEMU", "Bochs", "KVM", "Xen", "VMware", "VMW", "VMware Inc.",
> > +       "innotek GmbH", "Oracle Corporation", "Parallels", "BHYVE",
> > +       "Microsoft Corporation"
> > +};
> 
> People do use SeaBIOS ("Bochs") on bare metal.
> 
> You'll also see "Amazon EC2" on virt instances as well as bare metal
> instances. Although in that case I believe the virt instances do have
> the 'virtual machine' flag set in bit 4 of the BIOS Characteristics
> Extension Byte 2, and the bare metal obviously don't.
> 

Are those virtual instances having CPUID hypervisor bit set? If yes,
they can be differentiated from bare metal instances w/o checking
the vendor list.

btw do you know whether this 'virtual machine' flag is widely used
in virtualization environments? If yes, we probably should add check
on this flag even before checking DMI_SYS_VENDOR. It sounds more
general...

Thanks
Kevin
diff mbox series

Patch

diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
index 3507f456fcd0..2fbebb406cac 100644
--- a/arch/x86/pci/common.c
+++ b/arch/x86/pci/common.c
@@ -724,3 +724,46 @@  struct pci_dev *pci_real_dma_dev(struct pci_dev *dev)
 	return dev;
 }
 #endif
+
+/*
+ * We want to figure out which context we are running in. But the hardware
+ * does not introduce a reliable way (instruction, CPUID leaf, MSR, whatever)
+ * which can be manipulated by the VMM to let the OS figure out where it runs.
+ * So we go with the below probably_on_bare_metal() function as a replacement
+ * for definitely_on_bare_metal() to go forward only for the very simple reason
+ * that this is the only option we have.
+ */
+static const char * const possible_vmm_vendor_name[] = {
+	"QEMU", "Bochs", "KVM", "Xen", "VMware", "VMW", "VMware Inc.",
+	"innotek GmbH", "Oracle Corporation", "Parallels", "BHYVE",
+	"Microsoft Corporation"
+};
+
+static bool probably_on_bare_metal(void)
+{
+	int i;
+
+	if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
+		return false;
+
+	for (i = 0; i < ARRAY_SIZE(possible_vmm_vendor_name); i++)
+		if (dmi_match(DMI_SYS_VENDOR, possible_vmm_vendor_name[i]))
+			return false;
+
+	pr_info("System running on bare metal, report to bugzilla.kernel.org if not the case.");
+
+	return true;
+}
+
+bool arch_support_pci_device_ims(struct pci_dev *pdev)
+{
+	/*
+	 * When we are running in a VMM context, the device IMS could only be
+	 * enabled when the underlying hardware supports interrupt isolation
+	 * of the subdevice, or any mechanism (trap, hypercall) is added so
+	 * that changes in the interrupt message store could be managed by the
+	 * VMM. For now, we only support the device IMS when we are running on
+	 * the bare metal.
+	 */
+	return probably_on_bare_metal();
+}
diff --git a/drivers/base/platform-msi.c b/drivers/base/platform-msi.c
index 8432a1bf4e28..88e5fe4dae67 100644
--- a/drivers/base/platform-msi.c
+++ b/drivers/base/platform-msi.c
@@ -512,6 +512,11 @@  struct irq_domain *device_msi_create_irq_domain(struct fwnode_handle *fn,
 #ifdef CONFIG_PCI
 #include <linux/pci.h>
 
+bool __weak arch_support_pci_device_ims(struct pci_dev *pdev)
+{
+	return false;
+}
+
 /**
  * pci_subdevice_msi_create_irq_domain - Create an irq domain for subdevices
  * @pdev:	Pointer to PCI device for which the subdevice domain is created
@@ -523,6 +528,9 @@  struct irq_domain *pci_subdevice_msi_create_irq_domain(struct pci_dev *pdev,
 	struct irq_domain *domain, *pdev_msi;
 	struct fwnode_handle *fn;
 
+	if (!arch_support_pci_device_ims(pdev))
+		return NULL;
+
 	/*
 	 * Retrieve the MSI domain of the underlying PCI device's MSI
 	 * domain. The PCI device domain's parent domain is also the parent
diff --git a/include/linux/msi.h b/include/linux/msi.h
index f319d7c6a4ef..6fda81c4b859 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -478,6 +478,7 @@  struct irq_domain *device_msi_create_irq_domain(struct fwnode_handle *fn,
 						struct irq_domain *parent);
 
 # ifdef CONFIG_PCI
+bool arch_support_pci_device_ims(struct pci_dev *pdev);
 struct irq_domain *pci_subdevice_msi_create_irq_domain(struct pci_dev *pdev,
 						       struct msi_domain_info *info);
 # endif