diff mbox series

PCI: hv: Do not set PCI_COMMAND_MEMORY to reduce VM boot time

Message ID 20220419220007.26550-1-decui@microsoft.com (mailing list archive)
State Superseded
Delegated to: Lorenzo Pieralisi
Headers show
Series PCI: hv: Do not set PCI_COMMAND_MEMORY to reduce VM boot time | expand

Commit Message

Dexuan Cui April 19, 2022, 10 p.m. UTC
A VM on Azure can have 14 GPUs, and each GPU may have a huge MMIO BAR,
e.g. 128 GB. Currently the boot time of such a VM can be 4+ minutes, and
most of the time is used by the host to unmap/map the vBAR from/to pBAR
when the VM clears and sets the PCI_COMMAND_MEMORY bit: each unmap/map
operation for a 128GB BAR needs about 1.8 seconds, and the pci-hyperv
driver and the Linux PCI subsystem flip the PCI_COMMAND_MEMORY bit
eight times (see pci_setup_device() -> pci_read_bases() and
pci_std_update_resource()), increasing the boot time by 1.8 * 8 = 14.4
seconds per GPU, i.e. 14.4 * 14 = 201.6 seconds in total.

Fix the slowness by not turning on the PCI_COMMAND_MEMORY in pci-hyperv.c,
so the bit stays in the off state before the PCI device driver calls
pci_enable_device(): when the bit is off, pci_read_bases() and
pci_std_update_resource() don't cause Hyper-V to unmap/map the vBARs.
With this change, the boot time of such a VM is reduced by
1.8 * (8-1) * 14 = 176.4 seconds.

Tested-by: Boqun Feng (Microsoft) <boqun.feng@gmail.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Cc: Jake Oshins <jakeo@microsoft.com>
---
 drivers/pci/controller/pci-hyperv.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

Comments

Michael Kelley (LINUX) April 20, 2022, 2:47 p.m. UTC | #1
From: Dexuan Cui <decui@microsoft.com> Sent: Tuesday, April 19, 2022 3:00 PM
> 
> A VM on Azure can have 14 GPUs, and each GPU may have a huge MMIO BAR,
> e.g. 128 GB. Currently the boot time of such a VM can be 4+ minutes, and
> most of the time is used by the host to unmap/map the vBAR from/to pBAR
> when the VM clears and sets the PCI_COMMAND_MEMORY bit: each unmap/map
> operation for a 128GB BAR needs about 1.8 seconds, and the pci-hyperv
> driver and the Linux PCI subsystem flip the PCI_COMMAND_MEMORY bit
> eight times (see pci_setup_device() -> pci_read_bases() and
> pci_std_update_resource()), increasing the boot time by 1.8 * 8 = 14.4
> seconds per GPU, i.e. 14.4 * 14 = 201.6 seconds in total.
> 
> Fix the slowness by not turning on the PCI_COMMAND_MEMORY in pci-hyperv.c,
> so the bit stays in the off state before the PCI device driver calls
> pci_enable_device(): when the bit is off, pci_read_bases() and
> pci_std_update_resource() don't cause Hyper-V to unmap/map the vBARs.
> With this change, the boot time of such a VM is reduced by
> 1.8 * (8-1) * 14 = 176.4 seconds.
> 
> Tested-by: Boqun Feng (Microsoft) <boqun.feng@gmail.com>
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> Cc: Jake Oshins <jakeo@microsoft.com>
> ---
>  drivers/pci/controller/pci-hyperv.c | 17 +++++++++++------
>  1 file changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index d270a204324e..f9fbbd8d94db 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -2082,12 +2082,17 @@ static void prepopulate_bars(struct hv_pcibus_device
> *hbus)
>  				}
>  			}
>  			if (high_size <= 1 && low_size <= 1) {
> -				/* Set the memory enable bit. */
> -				_hv_pcifront_read_config(hpdev, PCI_COMMAND, 2,
> -							 &command);
> -				command |= PCI_COMMAND_MEMORY;
> -				_hv_pcifront_write_config(hpdev, PCI_COMMAND, 2,
> -							  command);
> +				/*
> +				 * No need to set the PCI_COMMAND_MEMORY bit as
> +				 * the core PCI driver doesn't require the bit
> +				 * to be pre-set. Actually here we intentionally
> +				 * keep the bit off so that the PCI BAR probing
> +				 * in the core PCI driver doesn't cause Hyper-V
> +				 * to unnecessarily unmap/map the virtual BARs
> +				 * from/to the physical BARs multiple times.
> +				 * This reduces the VM boot time significantly
> +				 * if the BAR sizes are huge.
> +				 */
>  				break;
>  			}
>  		}
> --
> 2.17.1

Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Lorenzo Pieralisi April 29, 2022, 10:15 a.m. UTC | #2
On Tue, Apr 19, 2022 at 03:00:07PM -0700, Dexuan Cui wrote:
> A VM on Azure can have 14 GPUs, and each GPU may have a huge MMIO BAR,
> e.g. 128 GB. Currently the boot time of such a VM can be 4+ minutes, and
> most of the time is used by the host to unmap/map the vBAR from/to pBAR
> when the VM clears and sets the PCI_COMMAND_MEMORY bit: each unmap/map
> operation for a 128GB BAR needs about 1.8 seconds, and the pci-hyperv
> driver and the Linux PCI subsystem flip the PCI_COMMAND_MEMORY bit
> eight times (see pci_setup_device() -> pci_read_bases() and
> pci_std_update_resource()), increasing the boot time by 1.8 * 8 = 14.4
> seconds per GPU, i.e. 14.4 * 14 = 201.6 seconds in total.
> 
> Fix the slowness by not turning on the PCI_COMMAND_MEMORY in pci-hyperv.c,
> so the bit stays in the off state before the PCI device driver calls
> pci_enable_device(): when the bit is off, pci_read_bases() and
> pci_std_update_resource() don't cause Hyper-V to unmap/map the vBARs.
> With this change, the boot time of such a VM is reduced by
> 1.8 * (8-1) * 14 = 176.4 seconds.

I believe you need to clarify this commit message. It took me a while
to understand what you are really doing.

What this patch is doing is bootstrapping PCI devices with command
memory clear because there is no need to have it set (?) in the first
place and because, if it is set, the PCI core layer needs to toggle it
on and off in order to eg size BAR regions, which causes the slow down
you are reporting.

I assume, given the above, that there is strictly no need to have
devices with command memory set at kernel startup handover and if
there was it would not be set in the PCI Hyper-V host controller
driver (because that's what you are _removing_).

I think this should not be merged as a fix and I'd be careful
about possible regressions before sending it to stable kernels,
if you wish to do so.

It is fine by me to go via the Hyper-V tree even though I don't
see why that's better, unless you want to send it as a fix and
I think you should not.

You can add my tag but the commit log should be rewritten and
you should add a Link: to the discussion thread.

Acked-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>

> Tested-by: Boqun Feng (Microsoft) <boqun.feng@gmail.com>
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> Cc: Jake Oshins <jakeo@microsoft.com>
> ---
>  drivers/pci/controller/pci-hyperv.c | 17 +++++++++++------
>  1 file changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index d270a204324e..f9fbbd8d94db 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -2082,12 +2082,17 @@ static void prepopulate_bars(struct hv_pcibus_device *hbus)
>  				}
>  			}
>  			if (high_size <= 1 && low_size <= 1) {
> -				/* Set the memory enable bit. */
> -				_hv_pcifront_read_config(hpdev, PCI_COMMAND, 2,
> -							 &command);
> -				command |= PCI_COMMAND_MEMORY;
> -				_hv_pcifront_write_config(hpdev, PCI_COMMAND, 2,
> -							  command);
> +				/*
> +				 * No need to set the PCI_COMMAND_MEMORY bit as
> +				 * the core PCI driver doesn't require the bit
> +				 * to be pre-set. Actually here we intentionally
> +				 * keep the bit off so that the PCI BAR probing
> +				 * in the core PCI driver doesn't cause Hyper-V
> +				 * to unnecessarily unmap/map the virtual BARs
> +				 * from/to the physical BARs multiple times.
> +				 * This reduces the VM boot time significantly
> +				 * if the BAR sizes are huge.
> +				 */
>  				break;
>  			}
>  		}
> -- 
> 2.17.1
>
Dexuan Cui April 29, 2022, 3:47 p.m. UTC | #3
> From: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
> Sent: Friday, April 29, 2022 3:15 AM
> To: Dexuan Cui <decui@microsoft.com>
> Cc: wei.liu@kernel.org; KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
> <haiyangz@microsoft.com>; Stephen Hemminger <sthemmin@microsoft.com>;
> bhelgaas@google.com; linux-hyperv@vger.kernel.org;
> linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org; Michael Kelley (LINUX)
> <mikelley@microsoft.com>; robh@kernel.org; kw@linux.com; Jake Oshins
> <jakeo@microsoft.com>
> Subject: Re: [PATCH] PCI: hv: Do not set PCI_COMMAND_MEMORY to reduce
> VM boot time
> 
> On Tue, Apr 19, 2022 at 03:00:07PM -0700, Dexuan Cui wrote:
> > A VM on Azure can have 14 GPUs, and each GPU may have a huge MMIO
> BAR,
> > e.g. 128 GB. Currently the boot time of such a VM can be 4+ minutes, and
> > most of the time is used by the host to unmap/map the vBAR from/to pBAR
> > when the VM clears and sets the PCI_COMMAND_MEMORY bit: each
> unmap/map
> > operation for a 128GB BAR needs about 1.8 seconds, and the pci-hyperv
> > driver and the Linux PCI subsystem flip the PCI_COMMAND_MEMORY bit
> > eight times (see pci_setup_device() -> pci_read_bases() and
> > pci_std_update_resource()), increasing the boot time by 1.8 * 8 = 14.4
> > seconds per GPU, i.e. 14.4 * 14 = 201.6 seconds in total.
> >
> > Fix the slowness by not turning on the PCI_COMMAND_MEMORY in
> pci-hyperv.c,
> > so the bit stays in the off state before the PCI device driver calls
> > pci_enable_device(): when the bit is off, pci_read_bases() and
> > pci_std_update_resource() don't cause Hyper-V to unmap/map the vBARs.
> > With this change, the boot time of such a VM is reduced by
> > 1.8 * (8-1) * 14 = 176.4 seconds.
> 
> I believe you need to clarify this commit message. It took me a while
> to understand what you are really doing.
> 
> What this patch is doing is bootstrapping PCI devices with command
> memory clear because there is no need to have it set (?) in the first
> place and because, if it is set, the PCI core layer needs to toggle it
> on and off in order to eg size BAR regions, which causes the slow down
> you are reporting.
> 
> I assume, given the above, that there is strictly no need to have
> devices with command memory set at kernel startup handover and if
> there was it would not be set in the PCI Hyper-V host controller
> driver (because that's what you are _removing_).
> 
> I think this should not be merged as a fix and I'd be careful
> about possible regressions before sending it to stable kernels,
> if you wish to do so.
> 
> It is fine by me to go via the Hyper-V tree even though I don't
> see why that's better, unless you want to send it as a fix and
> I think you should not.
> 
> You can add my tag but the commit log should be rewritten and
> you should add a Link: to the discussion thread.
> 
> Acked-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>

Thanks, Lorenzo! I'll post v2 with the commit message revised.
It's ok to me to have this patch go through the hyperv-next branch
rather than hyperv-fixes.
diff mbox series

Patch

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index d270a204324e..f9fbbd8d94db 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -2082,12 +2082,17 @@  static void prepopulate_bars(struct hv_pcibus_device *hbus)
 				}
 			}
 			if (high_size <= 1 && low_size <= 1) {
-				/* Set the memory enable bit. */
-				_hv_pcifront_read_config(hpdev, PCI_COMMAND, 2,
-							 &command);
-				command |= PCI_COMMAND_MEMORY;
-				_hv_pcifront_write_config(hpdev, PCI_COMMAND, 2,
-							  command);
+				/*
+				 * No need to set the PCI_COMMAND_MEMORY bit as
+				 * the core PCI driver doesn't require the bit
+				 * to be pre-set. Actually here we intentionally
+				 * keep the bit off so that the PCI BAR probing
+				 * in the core PCI driver doesn't cause Hyper-V
+				 * to unnecessarily unmap/map the virtual BARs
+				 * from/to the physical BARs multiple times.
+				 * This reduces the VM boot time significantly
+				 * if the BAR sizes are huge.
+				 */
 				break;
 			}
 		}