diff mbox series

[v2] x86/PCI: Prefer MMIO over PIO on VMware hypervisor

Message ID 1662448117-10807-1-git-send-email-akaher@vmware.com (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show
Series [v2] x86/PCI: Prefer MMIO over PIO on VMware hypervisor | expand

Commit Message

Ajay Kaher Sept. 6, 2022, 7:08 a.m. UTC
During boot-time there are many PCI config reads, these could be performed
either using Port IO instructions (PIO) or memory mapped I/O (MMIO).

PIO are less efficient than MMIO, they require twice as many PCI accesses
and PIO instructions are serializing. As a result, MMIO should be preferred
when possible over PIO.

Virtual Machine test result using VMware hypervisor
1 hundred thousand reads using raw_pci_read() took:
PIO: 12.809 seconds
MMIO: 8.517 seconds (~33.5% faster then PIO)

Currently, when these reads are performed by a virtual machine, they all
cause a VM-exit, and therefore each one of them induces a considerable
overhead.

This overhead can be further improved, by mapping MMIO region of virtual
machine to memory area that holds the values that the “emulated hardware”
is supposed to return. The memory region is mapped as "read-only” in the
NPT/EPT, so reads from these regions would be treated as regular memory
reads. Writes would still be trapped and emulated by the hypervisor.

Virtual Machine test result with above changes in VMware hypervisor
1 hundred thousand read using raw_pci_read() took:
PIO: 12.809 seconds
MMIO: 0.010 seconds

This helps to reduce virtual machine PCI scan and initialization time by
~65%. In our case it reduced to ~18 mSec from ~55 mSec.

MMIO is also faster than PIO on bare-metal systems, but due to some bugs
with legacy hardware and the smaller gains on bare-metal, it seems prudent
not to change bare-metal behavior.

Signed-off-by: Ajay Kaher <akaher@vmware.com>
---
v1 -> v2:
Limit changes to apply only to VMs [Matthew W.]
---
 arch/x86/pci/common.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

Comments

Vitaly Kuznetsov Sept. 7, 2022, 3:20 p.m. UTC | #1
Ajay Kaher <akaher@vmware.com> writes:

> During boot-time there are many PCI config reads, these could be performed
> either using Port IO instructions (PIO) or memory mapped I/O (MMIO).
>
> PIO are less efficient than MMIO, they require twice as many PCI accesses
> and PIO instructions are serializing. As a result, MMIO should be preferred
> when possible over PIO.
>
> Virtual Machine test result using VMware hypervisor
> 1 hundred thousand reads using raw_pci_read() took:
> PIO: 12.809 seconds
> MMIO: 8.517 seconds (~33.5% faster then PIO)
>
> Currently, when these reads are performed by a virtual machine, they all
> cause a VM-exit, and therefore each one of them induces a considerable
> overhead.
>
> This overhead can be further improved, by mapping MMIO region of virtual
> machine to memory area that holds the values that the “emulated hardware”
> is supposed to return. The memory region is mapped as "read-only” in the
> NPT/EPT, so reads from these regions would be treated as regular memory
> reads. Writes would still be trapped and emulated by the hypervisor.
>
> Virtual Machine test result with above changes in VMware hypervisor
> 1 hundred thousand read using raw_pci_read() took:
> PIO: 12.809 seconds
> MMIO: 0.010 seconds
>
> This helps to reduce virtual machine PCI scan and initialization time by
> ~65%. In our case it reduced to ~18 mSec from ~55 mSec.
>
> MMIO is also faster than PIO on bare-metal systems, but due to some bugs
> with legacy hardware and the smaller gains on bare-metal, it seems prudent
> not to change bare-metal behavior.

Out of curiosity, are we sure MMIO *always* works for other hypervisors
besides Vmware? Various Hyper-V version can probably be tested (were
they?) but with KVM it's much harder as PCI is emulated in VMM and
there's certainly more than 1 in existence...

>
> Signed-off-by: Ajay Kaher <akaher@vmware.com>
> ---
> v1 -> v2:
> Limit changes to apply only to VMs [Matthew W.]
> ---
>  arch/x86/pci/common.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 45 insertions(+)
>
> diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
> index ddb7986..1e5a8f7 100644
> --- a/arch/x86/pci/common.c
> +++ b/arch/x86/pci/common.c
> @@ -20,6 +20,7 @@
>  #include <asm/pci_x86.h>
>  #include <asm/setup.h>
>  #include <asm/irqdomain.h>
> +#include <asm/hypervisor.h>
>  
>  unsigned int pci_probe = PCI_PROBE_BIOS | PCI_PROBE_CONF1 | PCI_PROBE_CONF2 |
>  				PCI_PROBE_MMCONF;
> @@ -57,14 +58,58 @@ int raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn,
>  	return -EINVAL;
>  }
>  
> +#ifdef CONFIG_HYPERVISOR_GUEST
> +static int vm_raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
> +						int reg, int len, u32 *val)
> +{
> +	if (raw_pci_ext_ops)
> +		return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
> +	if (domain == 0 && reg < 256 && raw_pci_ops)
> +		return raw_pci_ops->read(domain, bus, devfn, reg, len, val);
> +	return -EINVAL;
> +}
> +
> +static int vm_raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn,
> +						int reg, int len, u32 val)
> +{
> +	if (raw_pci_ext_ops)
> +		return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val);
> +	if (domain == 0 && reg < 256 && raw_pci_ops)
> +		return raw_pci_ops->write(domain, bus, devfn, reg, len, val);
> +	return -EINVAL;
> +}

These look exactly like raw_pci_read()/raw_pci_write() but with inverted
priority. We could've added a parameter but to be more flexible, I'd
suggest we add a 'priority' field to 'struct pci_raw_ops' and make
raw_pci_read()/raw_pci_write() check it before deciding what to use
first. To be on the safe side, you can leave raw_pci_ops's priority
higher than raw_pci_ext_ops's by default and only tweak it in
arch/x86/kernel/cpu/vmware.c 

> +#endif /* CONFIG_HYPERVISOR_GUEST */
> +
>  static int pci_read(struct pci_bus *bus, unsigned int devfn, int where, int size, u32 *value)
>  {
> +#ifdef CONFIG_HYPERVISOR_GUEST
> +	/*
> +	 * MMIO is faster than PIO, but due to some bugs with legacy
> +	 * hardware, it seems prudent to prefer MMIO for VMs and PIO
> +	 * for bare-metal.
> +	 */
> +	if (!hypervisor_is_type(X86_HYPER_NATIVE))
> +		return vm_raw_pci_read(pci_domain_nr(bus), bus->number,
> +					 devfn, where, size, value);
> +#endif /* CONFIG_HYPERVISOR_GUEST */
> +
>  	return raw_pci_read(pci_domain_nr(bus), bus->number,
>  				 devfn, where, size, value);
>  }
>  
>  static int pci_write(struct pci_bus *bus, unsigned int devfn, int where, int size, u32 value)
>  {
> +#ifdef CONFIG_HYPERVISOR_GUEST
> +	/*
> +	 * MMIO is faster than PIO, but due to some bugs with legacy
> +	 * hardware, it seems prudent to prefer MMIO for VMs and PIO
> +	 * for bare-metal.
> +	 */
> +	if (!hypervisor_is_type(X86_HYPER_NATIVE))
> +		return vm_raw_pci_write(pci_domain_nr(bus), bus->number,
> +					  devfn, where, size, value);
> +#endif /* CONFIG_HYPERVISOR_GUEST */
> +
>  	return raw_pci_write(pci_domain_nr(bus), bus->number,
>  				  devfn, where, size, value);
>  }
Wei Liu Sept. 12, 2022, 3:17 p.m. UTC | #2
On Tue, Sep 06, 2022 at 12:38:37PM +0530, Ajay Kaher wrote:
> During boot-time there are many PCI config reads, these could be performed
> either using Port IO instructions (PIO) or memory mapped I/O (MMIO).
> 
> PIO are less efficient than MMIO, they require twice as many PCI accesses
> and PIO instructions are serializing. As a result, MMIO should be preferred
> when possible over PIO.
> 
> Virtual Machine test result using VMware hypervisor
> 1 hundred thousand reads using raw_pci_read() took:
> PIO: 12.809 seconds
> MMIO: 8.517 seconds (~33.5% faster then PIO)
> 
> Currently, when these reads are performed by a virtual machine, they all
> cause a VM-exit, and therefore each one of them induces a considerable
> overhead.
> 
> This overhead can be further improved, by mapping MMIO region of virtual
> machine to memory area that holds the values that the “emulated hardware”
> is supposed to return. The memory region is mapped as "read-only” in the
> NPT/EPT, so reads from these regions would be treated as regular memory
> reads. Writes would still be trapped and emulated by the hypervisor.
> 
> Virtual Machine test result with above changes in VMware hypervisor
> 1 hundred thousand read using raw_pci_read() took:
> PIO: 12.809 seconds
> MMIO: 0.010 seconds
> 
> This helps to reduce virtual machine PCI scan and initialization time by
> ~65%. In our case it reduced to ~18 mSec from ~55 mSec.
> 
> MMIO is also faster than PIO on bare-metal systems, but due to some bugs
> with legacy hardware and the smaller gains on bare-metal, it seems prudent
> not to change bare-metal behavior.
> 
> Signed-off-by: Ajay Kaher <akaher@vmware.com>

The subject line should be fixed -- you're changing the behaviour for
all hypervisors, not just VMWare. I almost skipped this because of the
subject line.

Thanks,
Wei.
diff mbox series

Patch

diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
index ddb7986..1e5a8f7 100644
--- a/arch/x86/pci/common.c
+++ b/arch/x86/pci/common.c
@@ -20,6 +20,7 @@ 
 #include <asm/pci_x86.h>
 #include <asm/setup.h>
 #include <asm/irqdomain.h>
+#include <asm/hypervisor.h>
 
 unsigned int pci_probe = PCI_PROBE_BIOS | PCI_PROBE_CONF1 | PCI_PROBE_CONF2 |
 				PCI_PROBE_MMCONF;
@@ -57,14 +58,58 @@  int raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn,
 	return -EINVAL;
 }
 
+#ifdef CONFIG_HYPERVISOR_GUEST
+static int vm_raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn,
+						int reg, int len, u32 *val)
+{
+	if (raw_pci_ext_ops)
+		return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
+	if (domain == 0 && reg < 256 && raw_pci_ops)
+		return raw_pci_ops->read(domain, bus, devfn, reg, len, val);
+	return -EINVAL;
+}
+
+static int vm_raw_pci_write(unsigned int domain, unsigned int bus, unsigned int devfn,
+						int reg, int len, u32 val)
+{
+	if (raw_pci_ext_ops)
+		return raw_pci_ext_ops->write(domain, bus, devfn, reg, len, val);
+	if (domain == 0 && reg < 256 && raw_pci_ops)
+		return raw_pci_ops->write(domain, bus, devfn, reg, len, val);
+	return -EINVAL;
+}
+#endif /* CONFIG_HYPERVISOR_GUEST */
+
 static int pci_read(struct pci_bus *bus, unsigned int devfn, int where, int size, u32 *value)
 {
+#ifdef CONFIG_HYPERVISOR_GUEST
+	/*
+	 * MMIO is faster than PIO, but due to some bugs with legacy
+	 * hardware, it seems prudent to prefer MMIO for VMs and PIO
+	 * for bare-metal.
+	 */
+	if (!hypervisor_is_type(X86_HYPER_NATIVE))
+		return vm_raw_pci_read(pci_domain_nr(bus), bus->number,
+					 devfn, where, size, value);
+#endif /* CONFIG_HYPERVISOR_GUEST */
+
 	return raw_pci_read(pci_domain_nr(bus), bus->number,
 				 devfn, where, size, value);
 }
 
 static int pci_write(struct pci_bus *bus, unsigned int devfn, int where, int size, u32 value)
 {
+#ifdef CONFIG_HYPERVISOR_GUEST
+	/*
+	 * MMIO is faster than PIO, but due to some bugs with legacy
+	 * hardware, it seems prudent to prefer MMIO for VMs and PIO
+	 * for bare-metal.
+	 */
+	if (!hypervisor_is_type(X86_HYPER_NATIVE))
+		return vm_raw_pci_write(pci_domain_nr(bus), bus->number,
+					  devfn, where, size, value);
+#endif /* CONFIG_HYPERVISOR_GUEST */
+
 	return raw_pci_write(pci_domain_nr(bus), bus->number,
 				  devfn, where, size, value);
 }