diff mbox series

PCI: Batch BAR sizing operations

Message ID 20250111210652.402845-1-alex.williamson@redhat.com (mailing list archive)
State Superseded
Commit 43a30551d392ef7442a2574019db1f4ba74a9e5a
Delegated to: Bjorn Helgaas
Headers show
Series PCI: Batch BAR sizing operations | expand

Commit Message

Alex Williamson Jan. 11, 2025, 9:06 p.m. UTC
Toggling memory enable is free on bare metal, but potentially expensive
in virtualized environments as the device MMIO spaces are added and
removed from the VM address space, including DMA mapping of those spaces
through the IOMMU where peer-to-peer is supported.  Currently memory
decode is disabled around sizing each individual BAR, even for SR-IOV
BARs while VF Enable is cleared.

This can be better optimized for virtual environments by sizing a set
of BARs at once, stashing the resulting mask into an array, while only
toggling memory enable once.  This also naturally improves the SR-IOV
path as the caller becomes responsible for any necessary decode disables
while sizing BARs, therefore SR-IOV BARs are sized relying only on the
VF Enable rather than toggling the PF memory enable in the command
register.

Reported-by: Mitchell Augustin <mitchell.augustin@canonical.com>
Link: https://lore.kernel.org/all/CAHTA-uYp07FgM6T1OZQKqAdSA5JrZo0ReNEyZgQZub4mDRrV5w@mail.gmail.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

This is an alternative to the patch proposed by Mitchell[1] and more
in line with my suggestion in the original report thread[2].  It makes
more sense to me to stash all the BAR sizing values into an array on
the front end of parsing them into resources than it does to pass
multiple arrays on the backend for status printing purposes.  We can
discuss in either of these patches which is the better approach or
if someone has a better yet alternative.

I don't have quite the config Mitchell has for testing, but this
should make effectively the same improvement and does show a
significant improvement in guest boot time even with a single 24GB
GPU attached.  There are of course further improvements to investigate
in the VMM, but disabling memory decode per BAR is a good start to
making Linux be a friendlier guest.  Further testing appreciate.
Thanks,

Alex

[1]https://lore.kernel.org/all/20241218224258.2225210-1-mitchell.augustin@canonical.com/
[2]https://lore.kernel.org/all/20241203150620.15431c5c.alex.williamson@redhat.com/

 drivers/pci/iov.c   |   8 ++-
 drivers/pci/pci.h   |   4 +-
 drivers/pci/probe.c | 132 +++++++++++++++++++++++++++++---------------
 3 files changed, 97 insertions(+), 47 deletions(-)

Comments

Mitchell Augustin Jan. 13, 2025, 5:46 p.m. UTC | #1
Thanks Alex.

Your approach is a bit cleaner than mine, and I verified that it
results in the same performance boost as my patch on my setup without
any noticeable regressions, so from my POV, either patch would be a
good solution.

Reviewed-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>
Tested-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>


On Sat, Jan 11, 2025 at 3:07 PM Alex Williamson
<alex.williamson@redhat.com> wrote:
>
> Toggling memory enable is free on bare metal, but potentially expensive
> in virtualized environments as the device MMIO spaces are added and
> removed from the VM address space, including DMA mapping of those spaces
> through the IOMMU where peer-to-peer is supported.  Currently memory
> decode is disabled around sizing each individual BAR, even for SR-IOV
> BARs while VF Enable is cleared.
>
> This can be better optimized for virtual environments by sizing a set
> of BARs at once, stashing the resulting mask into an array, while only
> toggling memory enable once.  This also naturally improves the SR-IOV
> path as the caller becomes responsible for any necessary decode disables
> while sizing BARs, therefore SR-IOV BARs are sized relying only on the
> VF Enable rather than toggling the PF memory enable in the command
> register.
>
> Reported-by: Mitchell Augustin <mitchell.augustin@canonical.com>
> Link: https://lore.kernel.org/all/CAHTA-uYp07FgM6T1OZQKqAdSA5JrZo0ReNEyZgQZub4mDRrV5w@mail.gmail.com
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>
> This is an alternative to the patch proposed by Mitchell[1] and more
> in line with my suggestion in the original report thread[2].  It makes
> more sense to me to stash all the BAR sizing values into an array on
> the front end of parsing them into resources than it does to pass
> multiple arrays on the backend for status printing purposes.  We can
> discuss in either of these patches which is the better approach or
> if someone has a better yet alternative.
>
> I don't have quite the config Mitchell has for testing, but this
> should make effectively the same improvement and does show a
> significant improvement in guest boot time even with a single 24GB
> GPU attached.  There are of course further improvements to investigate
> in the VMM, but disabling memory decode per BAR is a good start to
> making Linux be a friendlier guest.  Further testing appreciate.
> Thanks,
>
> Alex
>
> [1]https://lore.kernel.org/all/20241218224258.2225210-1-mitchell.augustin@canonical.com/
> [2]https://lore.kernel.org/all/20241203150620.15431c5c.alex.williamson@redhat.com/
>
>  drivers/pci/iov.c   |   8 ++-
>  drivers/pci/pci.h   |   4 +-
>  drivers/pci/probe.c | 132 +++++++++++++++++++++++++++++---------------
>  3 files changed, 97 insertions(+), 47 deletions(-)
>
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index 4be402fe9ab9..9e4770cdd4d5 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -747,6 +747,7 @@ static int sriov_init(struct pci_dev *dev, int pos)
>         struct resource *res;
>         const char *res_name;
>         struct pci_dev *pdev;
> +       u32 sriovbars[PCI_SRIOV_NUM_BARS];
>
>         pci_read_config_word(dev, pos + PCI_SRIOV_CTRL, &ctrl);
>         if (ctrl & PCI_SRIOV_CTRL_VFE) {
> @@ -783,6 +784,10 @@ static int sriov_init(struct pci_dev *dev, int pos)
>         if (!iov)
>                 return -ENOMEM;
>
> +       /* Sizing SR-IOV BARs with VF Enable cleared - no decode */
> +       __pci_size_stdbars(dev, PCI_SRIOV_NUM_BARS,
> +                          pos + PCI_SRIOV_BAR, sriovbars);
> +
>         nres = 0;
>         for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>                 res = &dev->resource[i + PCI_IOV_RESOURCES];
> @@ -796,7 +801,8 @@ static int sriov_init(struct pci_dev *dev, int pos)
>                         bar64 = (res->flags & IORESOURCE_MEM_64) ? 1 : 0;
>                 else
>                         bar64 = __pci_read_base(dev, pci_bar_unknown, res,
> -                                               pos + PCI_SRIOV_BAR + i * 4);
> +                                               pos + PCI_SRIOV_BAR + i * 4,
> +                                               &sriovbars[i]);
>                 if (!res->flags)
>                         continue;
>                 if (resource_size(res) & (PAGE_SIZE - 1)) {
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 2e40fc63ba31..6f27651c2a70 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -315,8 +315,10 @@ bool pci_bus_generic_read_dev_vendor_id(struct pci_bus *bus, int devfn, u32 *pl,
>  int pci_idt_bus_quirk(struct pci_bus *bus, int devfn, u32 *pl, int rrs_timeout);
>
>  int pci_setup_device(struct pci_dev *dev);
> +void __pci_size_stdbars(struct pci_dev *dev, int count,
> +                       unsigned int pos, u32 *sizes);
>  int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
> -                   struct resource *res, unsigned int reg);
> +                   struct resource *res, unsigned int reg, u32 *sizes);
>  void pci_configure_ari(struct pci_dev *dev);
>  void __pci_bus_size_bridges(struct pci_bus *bus,
>                         struct list_head *realloc_head);
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 2e81ab0f5a25..5ca96280d698 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -164,50 +164,75 @@ static inline unsigned long decode_bar(struct pci_dev *dev, u32 bar)
>
>  #define PCI_COMMAND_DECODE_ENABLE      (PCI_COMMAND_MEMORY | PCI_COMMAND_IO)
>
> +/**
> + * __pci_size_bars - Read the raw BAR mask for a range of PCI BARs
> + * @dev: the PCI device
> + * @count: number of BARs to size
> + * @pos: starting config space position
> + * @sizes: array to store mask values
> + * @rom: indicate whether to use ROM mask, which avoids enabling ROM BARs
> + *
> + * Provided @sizes array must be sufficiently sized to store results for
> + * @count u32 BARs.  Caller is responsible for disabling decode to specified
> + * BAR range around calling this function.  This function is intended to avoid
> + * disabling decode around sizing each BAR individually, which can result in
> + * non-trivial overhead in virtualized environments with very large PCI BARs.
> + */
> +static void __pci_size_bars(struct pci_dev *dev, int count,
> +                           unsigned int pos, u32 *sizes, bool rom)
> +{
> +       u32 orig, mask = rom ? PCI_ROM_ADDRESS_MASK : ~0;
> +       int i;
> +
> +       for (i = 0; i < count; i++, pos += 4, sizes++) {
> +               pci_read_config_dword(dev, pos, &orig);
> +               pci_write_config_dword(dev, pos, mask);
> +               pci_read_config_dword(dev, pos, sizes);
> +               pci_write_config_dword(dev, pos, orig);
> +
> +               /*
> +                * All bits set in size means the device isn't working properly.
> +                * If the BAR isn't implemented, all bits must be 0.  If it's a
> +                * memory BAR or a ROM, bit 0 must be clear; if it's an io BAR,
> +                * bit 1 must be clear.
> +                */
> +               if (PCI_POSSIBLE_ERROR(*sizes))
> +                       *sizes = 0;
> +       }
> +}
> +
> +void __pci_size_stdbars(struct pci_dev *dev, int count,
> +                       unsigned int pos, u32 *sizes)
> +{
> +       __pci_size_bars(dev, count, pos, sizes, false);
> +}
> +
> +static void __pci_size_rom(struct pci_dev *dev, unsigned int pos, u32 *sizes)
> +{
> +       __pci_size_bars(dev, 1, pos, sizes, true);
> +}
> +
>  /**
>   * __pci_read_base - Read a PCI BAR
>   * @dev: the PCI device
>   * @type: type of the BAR
>   * @res: resource buffer to be filled in
>   * @pos: BAR position in the config space
> + * @sizes: array of one or more pre-read BAR masks
>   *
>   * Returns 1 if the BAR is 64-bit, or 0 if 32-bit.
>   */
>  int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
> -                   struct resource *res, unsigned int pos)
> +                   struct resource *res, unsigned int pos, u32 *sizes)
>  {
> -       u32 l = 0, sz = 0, mask;
> +       u32 l = 0;
>         u64 l64, sz64, mask64;
> -       u16 orig_cmd;
>         struct pci_bus_region region, inverted_region;
>         const char *res_name = pci_resource_name(dev, res - dev->resource);
>
> -       mask = type ? PCI_ROM_ADDRESS_MASK : ~0;
> -
> -       /* No printks while decoding is disabled! */
> -       if (!dev->mmio_always_on) {
> -               pci_read_config_word(dev, PCI_COMMAND, &orig_cmd);
> -               if (orig_cmd & PCI_COMMAND_DECODE_ENABLE) {
> -                       pci_write_config_word(dev, PCI_COMMAND,
> -                               orig_cmd & ~PCI_COMMAND_DECODE_ENABLE);
> -               }
> -       }
> -
>         res->name = pci_name(dev);
>
>         pci_read_config_dword(dev, pos, &l);
> -       pci_write_config_dword(dev, pos, l | mask);
> -       pci_read_config_dword(dev, pos, &sz);
> -       pci_write_config_dword(dev, pos, l);
> -
> -       /*
> -        * All bits set in sz means the device isn't working properly.
> -        * If the BAR isn't implemented, all bits must be 0.  If it's a
> -        * memory BAR or a ROM, bit 0 must be clear; if it's an io BAR, bit
> -        * 1 must be clear.
> -        */
> -       if (PCI_POSSIBLE_ERROR(sz))
> -               sz = 0;
>
>         /*
>          * I don't know how l can have all bits set.  Copied from old code.
> @@ -221,35 +246,30 @@ int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
>                 res->flags |= IORESOURCE_SIZEALIGN;
>                 if (res->flags & IORESOURCE_IO) {
>                         l64 = l & PCI_BASE_ADDRESS_IO_MASK;
> -                       sz64 = sz & PCI_BASE_ADDRESS_IO_MASK;
> +                       sz64 = *sizes & PCI_BASE_ADDRESS_IO_MASK;
>                         mask64 = PCI_BASE_ADDRESS_IO_MASK & (u32)IO_SPACE_LIMIT;
>                 } else {
>                         l64 = l & PCI_BASE_ADDRESS_MEM_MASK;
> -                       sz64 = sz & PCI_BASE_ADDRESS_MEM_MASK;
> +                       sz64 = *sizes & PCI_BASE_ADDRESS_MEM_MASK;
>                         mask64 = (u32)PCI_BASE_ADDRESS_MEM_MASK;
> +
> +                       if (res->flags & IORESOURCE_MEM_64) {
> +                               pci_read_config_dword(dev, pos + 4, &l);
> +                               sizes++;
> +
> +                               l64 |= ((u64)l << 32);
> +                               sz64 |= ((u64)*sizes << 32);
> +                               mask64 |= ((u64)~0 << 32);
> +                       }
>                 }
>         } else {
>                 if (l & PCI_ROM_ADDRESS_ENABLE)
>                         res->flags |= IORESOURCE_ROM_ENABLE;
>                 l64 = l & PCI_ROM_ADDRESS_MASK;
> -               sz64 = sz & PCI_ROM_ADDRESS_MASK;
> +               sz64 = *sizes & PCI_ROM_ADDRESS_MASK;
>                 mask64 = PCI_ROM_ADDRESS_MASK;
>         }
>
> -       if (res->flags & IORESOURCE_MEM_64) {
> -               pci_read_config_dword(dev, pos + 4, &l);
> -               pci_write_config_dword(dev, pos + 4, ~0);
> -               pci_read_config_dword(dev, pos + 4, &sz);
> -               pci_write_config_dword(dev, pos + 4, l);
> -
> -               l64 |= ((u64)l << 32);
> -               sz64 |= ((u64)sz << 32);
> -               mask64 |= ((u64)~0 << 32);
> -       }
> -
> -       if (!dev->mmio_always_on && (orig_cmd & PCI_COMMAND_DECODE_ENABLE))
> -               pci_write_config_word(dev, PCI_COMMAND, orig_cmd);
> -
>         if (!sz64)
>                 goto fail;
>
> @@ -320,7 +340,11 @@ int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
>
>  static void pci_read_bases(struct pci_dev *dev, unsigned int howmany, int rom)
>  {
> +       u32 rombar, stdbars[PCI_STD_NUM_BARS];
>         unsigned int pos, reg;
> +       u16 orig_cmd;
> +
> +       BUILD_BUG_ON(howmany > PCI_STD_NUM_BARS);
>
>         if (dev->non_compliant_bars)
>                 return;
> @@ -329,10 +353,28 @@ static void pci_read_bases(struct pci_dev *dev, unsigned int howmany, int rom)
>         if (dev->is_virtfn)
>                 return;
>
> +       /* No printks while decoding is disabled! */
> +       if (!dev->mmio_always_on) {
> +               pci_read_config_word(dev, PCI_COMMAND, &orig_cmd);
> +               if (orig_cmd & PCI_COMMAND_DECODE_ENABLE) {
> +                       pci_write_config_word(dev, PCI_COMMAND,
> +                               orig_cmd & ~PCI_COMMAND_DECODE_ENABLE);
> +               }
> +       }
> +
> +       __pci_size_stdbars(dev, howmany, PCI_BASE_ADDRESS_0, stdbars);
> +       if (rom)
> +               __pci_size_rom(dev, rom, &rombar);
> +
> +       if (!dev->mmio_always_on &&
> +           (orig_cmd & PCI_COMMAND_DECODE_ENABLE))
> +               pci_write_config_word(dev, PCI_COMMAND, orig_cmd);
> +
>         for (pos = 0; pos < howmany; pos++) {
>                 struct resource *res = &dev->resource[pos];
>                 reg = PCI_BASE_ADDRESS_0 + (pos << 2);
> -               pos += __pci_read_base(dev, pci_bar_unknown, res, reg);
> +               pos += __pci_read_base(dev, pci_bar_unknown,
> +                                      res, reg, &stdbars[pos]);
>         }
>
>         if (rom) {
> @@ -340,7 +382,7 @@ static void pci_read_bases(struct pci_dev *dev, unsigned int howmany, int rom)
>                 dev->rom_base_reg = rom;
>                 res->flags = IORESOURCE_MEM | IORESOURCE_PREFETCH |
>                                 IORESOURCE_READONLY | IORESOURCE_SIZEALIGN;
> -               __pci_read_base(dev, pci_bar_mem32, res, rom);
> +               __pci_read_base(dev, pci_bar_mem32, res, rom, &rombar);
>         }
>  }
>
> --
> 2.47.1
>


--
Mitchell Augustin
Software Engineer - Ubuntu Partner Engineering
Bjorn Helgaas Jan. 14, 2025, 9:12 p.m. UTC | #2
On Sat, Jan 11, 2025 at 02:06:51PM -0700, Alex Williamson wrote:
> Toggling memory enable is free on bare metal, but potentially expensive
> in virtualized environments as the device MMIO spaces are added and
> removed from the VM address space, including DMA mapping of those spaces
> through the IOMMU where peer-to-peer is supported.  Currently memory
> decode is disabled around sizing each individual BAR, even for SR-IOV
> BARs while VF Enable is cleared.
> 
> This can be better optimized for virtual environments by sizing a set
> of BARs at once, stashing the resulting mask into an array, while only
> toggling memory enable once.  This also naturally improves the SR-IOV
> path as the caller becomes responsible for any necessary decode disables
> while sizing BARs, therefore SR-IOV BARs are sized relying only on the
> VF Enable rather than toggling the PF memory enable in the command
> register.
> 
> Reported-by: Mitchell Augustin <mitchell.augustin@canonical.com>
> Link: https://lore.kernel.org/all/CAHTA-uYp07FgM6T1OZQKqAdSA5JrZo0ReNEyZgQZub4mDRrV5w@mail.gmail.com
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

Applied to pci/enumeration for v6.14, thanks.

> ---
> 
> This is an alternative to the patch proposed by Mitchell[1] and more
> in line with my suggestion in the original report thread[2].  It makes
> more sense to me to stash all the BAR sizing values into an array on
> the front end of parsing them into resources than it does to pass
> multiple arrays on the backend for status printing purposes.  We can
> discuss in either of these patches which is the better approach or
> if someone has a better yet alternative.
> 
> I don't have quite the config Mitchell has for testing, but this
> should make effectively the same improvement and does show a
> significant improvement in guest boot time even with a single 24GB
> GPU attached.  There are of course further improvements to investigate
> in the VMM, but disabling memory decode per BAR is a good start to
> making Linux be a friendlier guest.  Further testing appreciate.
> Thanks,
> 
> Alex
> 
> [1]https://lore.kernel.org/all/20241218224258.2225210-1-mitchell.augustin@canonical.com/
> [2]https://lore.kernel.org/all/20241203150620.15431c5c.alex.williamson@redhat.com/
> 
>  drivers/pci/iov.c   |   8 ++-
>  drivers/pci/pci.h   |   4 +-
>  drivers/pci/probe.c | 132 +++++++++++++++++++++++++++++---------------
>  3 files changed, 97 insertions(+), 47 deletions(-)
> 
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index 4be402fe9ab9..9e4770cdd4d5 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -747,6 +747,7 @@ static int sriov_init(struct pci_dev *dev, int pos)
>  	struct resource *res;
>  	const char *res_name;
>  	struct pci_dev *pdev;
> +	u32 sriovbars[PCI_SRIOV_NUM_BARS];
>  
>  	pci_read_config_word(dev, pos + PCI_SRIOV_CTRL, &ctrl);
>  	if (ctrl & PCI_SRIOV_CTRL_VFE) {
> @@ -783,6 +784,10 @@ static int sriov_init(struct pci_dev *dev, int pos)
>  	if (!iov)
>  		return -ENOMEM;
>  
> +	/* Sizing SR-IOV BARs with VF Enable cleared - no decode */
> +	__pci_size_stdbars(dev, PCI_SRIOV_NUM_BARS,
> +			   pos + PCI_SRIOV_BAR, sriovbars);
> +
>  	nres = 0;
>  	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>  		res = &dev->resource[i + PCI_IOV_RESOURCES];
> @@ -796,7 +801,8 @@ static int sriov_init(struct pci_dev *dev, int pos)
>  			bar64 = (res->flags & IORESOURCE_MEM_64) ? 1 : 0;
>  		else
>  			bar64 = __pci_read_base(dev, pci_bar_unknown, res,
> -						pos + PCI_SRIOV_BAR + i * 4);
> +						pos + PCI_SRIOV_BAR + i * 4,
> +						&sriovbars[i]);
>  		if (!res->flags)
>  			continue;
>  		if (resource_size(res) & (PAGE_SIZE - 1)) {
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 2e40fc63ba31..6f27651c2a70 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -315,8 +315,10 @@ bool pci_bus_generic_read_dev_vendor_id(struct pci_bus *bus, int devfn, u32 *pl,
>  int pci_idt_bus_quirk(struct pci_bus *bus, int devfn, u32 *pl, int rrs_timeout);
>  
>  int pci_setup_device(struct pci_dev *dev);
> +void __pci_size_stdbars(struct pci_dev *dev, int count,
> +			unsigned int pos, u32 *sizes);
>  int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
> -		    struct resource *res, unsigned int reg);
> +		    struct resource *res, unsigned int reg, u32 *sizes);
>  void pci_configure_ari(struct pci_dev *dev);
>  void __pci_bus_size_bridges(struct pci_bus *bus,
>  			struct list_head *realloc_head);
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 2e81ab0f5a25..5ca96280d698 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -164,50 +164,75 @@ static inline unsigned long decode_bar(struct pci_dev *dev, u32 bar)
>  
>  #define PCI_COMMAND_DECODE_ENABLE	(PCI_COMMAND_MEMORY | PCI_COMMAND_IO)
>  
> +/**
> + * __pci_size_bars - Read the raw BAR mask for a range of PCI BARs
> + * @dev: the PCI device
> + * @count: number of BARs to size
> + * @pos: starting config space position
> + * @sizes: array to store mask values
> + * @rom: indicate whether to use ROM mask, which avoids enabling ROM BARs
> + *
> + * Provided @sizes array must be sufficiently sized to store results for
> + * @count u32 BARs.  Caller is responsible for disabling decode to specified
> + * BAR range around calling this function.  This function is intended to avoid
> + * disabling decode around sizing each BAR individually, which can result in
> + * non-trivial overhead in virtualized environments with very large PCI BARs.
> + */
> +static void __pci_size_bars(struct pci_dev *dev, int count,
> +			    unsigned int pos, u32 *sizes, bool rom)
> +{
> +	u32 orig, mask = rom ? PCI_ROM_ADDRESS_MASK : ~0;
> +	int i;
> +
> +	for (i = 0; i < count; i++, pos += 4, sizes++) {
> +		pci_read_config_dword(dev, pos, &orig);
> +		pci_write_config_dword(dev, pos, mask);
> +		pci_read_config_dword(dev, pos, sizes);
> +		pci_write_config_dword(dev, pos, orig);
> +
> +		/*
> +		 * All bits set in size means the device isn't working properly.
> +		 * If the BAR isn't implemented, all bits must be 0.  If it's a
> +		 * memory BAR or a ROM, bit 0 must be clear; if it's an io BAR,
> +		 * bit 1 must be clear.
> +		 */
> +		if (PCI_POSSIBLE_ERROR(*sizes))
> +			*sizes = 0;
> +	}
> +}
> +
> +void __pci_size_stdbars(struct pci_dev *dev, int count,
> +			unsigned int pos, u32 *sizes)
> +{
> +	__pci_size_bars(dev, count, pos, sizes, false);
> +}
> +
> +static void __pci_size_rom(struct pci_dev *dev, unsigned int pos, u32 *sizes)
> +{
> +	__pci_size_bars(dev, 1, pos, sizes, true);
> +}
> +
>  /**
>   * __pci_read_base - Read a PCI BAR
>   * @dev: the PCI device
>   * @type: type of the BAR
>   * @res: resource buffer to be filled in
>   * @pos: BAR position in the config space
> + * @sizes: array of one or more pre-read BAR masks
>   *
>   * Returns 1 if the BAR is 64-bit, or 0 if 32-bit.
>   */
>  int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
> -		    struct resource *res, unsigned int pos)
> +		    struct resource *res, unsigned int pos, u32 *sizes)
>  {
> -	u32 l = 0, sz = 0, mask;
> +	u32 l = 0;
>  	u64 l64, sz64, mask64;
> -	u16 orig_cmd;
>  	struct pci_bus_region region, inverted_region;
>  	const char *res_name = pci_resource_name(dev, res - dev->resource);
>  
> -	mask = type ? PCI_ROM_ADDRESS_MASK : ~0;
> -
> -	/* No printks while decoding is disabled! */
> -	if (!dev->mmio_always_on) {
> -		pci_read_config_word(dev, PCI_COMMAND, &orig_cmd);
> -		if (orig_cmd & PCI_COMMAND_DECODE_ENABLE) {
> -			pci_write_config_word(dev, PCI_COMMAND,
> -				orig_cmd & ~PCI_COMMAND_DECODE_ENABLE);
> -		}
> -	}
> -
>  	res->name = pci_name(dev);
>  
>  	pci_read_config_dword(dev, pos, &l);
> -	pci_write_config_dword(dev, pos, l | mask);
> -	pci_read_config_dword(dev, pos, &sz);
> -	pci_write_config_dword(dev, pos, l);
> -
> -	/*
> -	 * All bits set in sz means the device isn't working properly.
> -	 * If the BAR isn't implemented, all bits must be 0.  If it's a
> -	 * memory BAR or a ROM, bit 0 must be clear; if it's an io BAR, bit
> -	 * 1 must be clear.
> -	 */
> -	if (PCI_POSSIBLE_ERROR(sz))
> -		sz = 0;
>  
>  	/*
>  	 * I don't know how l can have all bits set.  Copied from old code.
> @@ -221,35 +246,30 @@ int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
>  		res->flags |= IORESOURCE_SIZEALIGN;
>  		if (res->flags & IORESOURCE_IO) {
>  			l64 = l & PCI_BASE_ADDRESS_IO_MASK;
> -			sz64 = sz & PCI_BASE_ADDRESS_IO_MASK;
> +			sz64 = *sizes & PCI_BASE_ADDRESS_IO_MASK;
>  			mask64 = PCI_BASE_ADDRESS_IO_MASK & (u32)IO_SPACE_LIMIT;
>  		} else {
>  			l64 = l & PCI_BASE_ADDRESS_MEM_MASK;
> -			sz64 = sz & PCI_BASE_ADDRESS_MEM_MASK;
> +			sz64 = *sizes & PCI_BASE_ADDRESS_MEM_MASK;
>  			mask64 = (u32)PCI_BASE_ADDRESS_MEM_MASK;
> +
> +			if (res->flags & IORESOURCE_MEM_64) {
> +				pci_read_config_dword(dev, pos + 4, &l);
> +				sizes++;
> +
> +				l64 |= ((u64)l << 32);
> +				sz64 |= ((u64)*sizes << 32);
> +				mask64 |= ((u64)~0 << 32);
> +			}
>  		}
>  	} else {
>  		if (l & PCI_ROM_ADDRESS_ENABLE)
>  			res->flags |= IORESOURCE_ROM_ENABLE;
>  		l64 = l & PCI_ROM_ADDRESS_MASK;
> -		sz64 = sz & PCI_ROM_ADDRESS_MASK;
> +		sz64 = *sizes & PCI_ROM_ADDRESS_MASK;
>  		mask64 = PCI_ROM_ADDRESS_MASK;
>  	}
>  
> -	if (res->flags & IORESOURCE_MEM_64) {
> -		pci_read_config_dword(dev, pos + 4, &l);
> -		pci_write_config_dword(dev, pos + 4, ~0);
> -		pci_read_config_dword(dev, pos + 4, &sz);
> -		pci_write_config_dword(dev, pos + 4, l);
> -
> -		l64 |= ((u64)l << 32);
> -		sz64 |= ((u64)sz << 32);
> -		mask64 |= ((u64)~0 << 32);
> -	}
> -
> -	if (!dev->mmio_always_on && (orig_cmd & PCI_COMMAND_DECODE_ENABLE))
> -		pci_write_config_word(dev, PCI_COMMAND, orig_cmd);
> -
>  	if (!sz64)
>  		goto fail;
>  
> @@ -320,7 +340,11 @@ int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
>  
>  static void pci_read_bases(struct pci_dev *dev, unsigned int howmany, int rom)
>  {
> +	u32 rombar, stdbars[PCI_STD_NUM_BARS];
>  	unsigned int pos, reg;
> +	u16 orig_cmd;
> +
> +	BUILD_BUG_ON(howmany > PCI_STD_NUM_BARS);
>  
>  	if (dev->non_compliant_bars)
>  		return;
> @@ -329,10 +353,28 @@ static void pci_read_bases(struct pci_dev *dev, unsigned int howmany, int rom)
>  	if (dev->is_virtfn)
>  		return;
>  
> +	/* No printks while decoding is disabled! */
> +	if (!dev->mmio_always_on) {
> +		pci_read_config_word(dev, PCI_COMMAND, &orig_cmd);
> +		if (orig_cmd & PCI_COMMAND_DECODE_ENABLE) {
> +			pci_write_config_word(dev, PCI_COMMAND,
> +				orig_cmd & ~PCI_COMMAND_DECODE_ENABLE);
> +		}
> +	}
> +
> +	__pci_size_stdbars(dev, howmany, PCI_BASE_ADDRESS_0, stdbars);
> +	if (rom)
> +		__pci_size_rom(dev, rom, &rombar);
> +
> +	if (!dev->mmio_always_on &&
> +	    (orig_cmd & PCI_COMMAND_DECODE_ENABLE))
> +		pci_write_config_word(dev, PCI_COMMAND, orig_cmd);
> +
>  	for (pos = 0; pos < howmany; pos++) {
>  		struct resource *res = &dev->resource[pos];
>  		reg = PCI_BASE_ADDRESS_0 + (pos << 2);
> -		pos += __pci_read_base(dev, pci_bar_unknown, res, reg);
> +		pos += __pci_read_base(dev, pci_bar_unknown,
> +				       res, reg, &stdbars[pos]);
>  	}
>  
>  	if (rom) {
> @@ -340,7 +382,7 @@ static void pci_read_bases(struct pci_dev *dev, unsigned int howmany, int rom)
>  		dev->rom_base_reg = rom;
>  		res->flags = IORESOURCE_MEM | IORESOURCE_PREFETCH |
>  				IORESOURCE_READONLY | IORESOURCE_SIZEALIGN;
> -		__pci_read_base(dev, pci_bar_mem32, res, rom);
> +		__pci_read_base(dev, pci_bar_mem32, res, rom, &rombar);
>  	}
>  }
>  
> -- 
> 2.47.1
>
Ilpo Järvinen Jan. 20, 2025, 4:10 p.m. UTC | #3
On Sat, 11 Jan 2025, Alex Williamson wrote:

> Toggling memory enable is free on bare metal, but potentially expensive
> in virtualized environments as the device MMIO spaces are added and
> removed from the VM address space, including DMA mapping of those spaces
> through the IOMMU where peer-to-peer is supported.  Currently memory
> decode is disabled around sizing each individual BAR, even for SR-IOV
> BARs while VF Enable is cleared.
> 
> This can be better optimized for virtual environments by sizing a set
> of BARs at once, stashing the resulting mask into an array, while only
> toggling memory enable once.  This also naturally improves the SR-IOV
> path as the caller becomes responsible for any necessary decode disables
> while sizing BARs, therefore SR-IOV BARs are sized relying only on the
> VF Enable rather than toggling the PF memory enable in the command
> register.
> 
> Reported-by: Mitchell Augustin <mitchell.augustin@canonical.com>
> Link: https://lore.kernel.org/all/CAHTA-uYp07FgM6T1OZQKqAdSA5JrZo0ReNEyZgQZub4mDRrV5w@mail.gmail.com
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
> 
> This is an alternative to the patch proposed by Mitchell[1] and more
> in line with my suggestion in the original report thread[2].  It makes
> more sense to me to stash all the BAR sizing values into an array on
> the front end of parsing them into resources than it does to pass
> multiple arrays on the backend for status printing purposes.  We can
> discuss in either of these patches which is the better approach or
> if someone has a better yet alternative.
> 
> I don't have quite the config Mitchell has for testing, but this
> should make effectively the same improvement and does show a
> significant improvement in guest boot time even with a single 24GB
> GPU attached.  There are of course further improvements to investigate
> in the VMM, but disabling memory decode per BAR is a good start to
> making Linux be a friendlier guest.  Further testing appreciate.
> Thanks,
> 
> Alex
> 
> [1]https://lore.kernel.org/all/20241218224258.2225210-1-mitchell.augustin@canonical.com/
> [2]https://lore.kernel.org/all/20241203150620.15431c5c.alex.williamson@redhat.com/
> 
>  drivers/pci/iov.c   |   8 ++-
>  drivers/pci/pci.h   |   4 +-
>  drivers/pci/probe.c | 132 +++++++++++++++++++++++++++++---------------
>  3 files changed, 97 insertions(+), 47 deletions(-)
> 
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index 4be402fe9ab9..9e4770cdd4d5 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -747,6 +747,7 @@ static int sriov_init(struct pci_dev *dev, int pos)
>  	struct resource *res;
>  	const char *res_name;
>  	struct pci_dev *pdev;
> +	u32 sriovbars[PCI_SRIOV_NUM_BARS];
>  
>  	pci_read_config_word(dev, pos + PCI_SRIOV_CTRL, &ctrl);
>  	if (ctrl & PCI_SRIOV_CTRL_VFE) {
> @@ -783,6 +784,10 @@ static int sriov_init(struct pci_dev *dev, int pos)
>  	if (!iov)
>  		return -ENOMEM;
>  
> +	/* Sizing SR-IOV BARs with VF Enable cleared - no decode */
> +	__pci_size_stdbars(dev, PCI_SRIOV_NUM_BARS,
> +			   pos + PCI_SRIOV_BAR, sriovbars);
> +
>  	nres = 0;
>  	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>  		res = &dev->resource[i + PCI_IOV_RESOURCES];
> @@ -796,7 +801,8 @@ static int sriov_init(struct pci_dev *dev, int pos)
>  			bar64 = (res->flags & IORESOURCE_MEM_64) ? 1 : 0;
>  		else
>  			bar64 = __pci_read_base(dev, pci_bar_unknown, res,
> -						pos + PCI_SRIOV_BAR + i * 4);
> +						pos + PCI_SRIOV_BAR + i * 4,
> +						&sriovbars[i]);
>  		if (!res->flags)
>  			continue;
>  		if (resource_size(res) & (PAGE_SIZE - 1)) {
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 2e40fc63ba31..6f27651c2a70 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -315,8 +315,10 @@ bool pci_bus_generic_read_dev_vendor_id(struct pci_bus *bus, int devfn, u32 *pl,
>  int pci_idt_bus_quirk(struct pci_bus *bus, int devfn, u32 *pl, int rrs_timeout);
>  
>  int pci_setup_device(struct pci_dev *dev);
> +void __pci_size_stdbars(struct pci_dev *dev, int count,
> +			unsigned int pos, u32 *sizes);
>  int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
> -		    struct resource *res, unsigned int reg);
> +		    struct resource *res, unsigned int reg, u32 *sizes);
>  void pci_configure_ari(struct pci_dev *dev);
>  void __pci_bus_size_bridges(struct pci_bus *bus,
>  			struct list_head *realloc_head);
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 2e81ab0f5a25..5ca96280d698 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -164,50 +164,75 @@ static inline unsigned long decode_bar(struct pci_dev *dev, u32 bar)
>  
>  #define PCI_COMMAND_DECODE_ENABLE	(PCI_COMMAND_MEMORY | PCI_COMMAND_IO)
>  
> +/**
> + * __pci_size_bars - Read the raw BAR mask for a range of PCI BARs
> + * @dev: the PCI device
> + * @count: number of BARs to size
> + * @pos: starting config space position
> + * @sizes: array to store mask values
> + * @rom: indicate whether to use ROM mask, which avoids enabling ROM BARs
> + *
> + * Provided @sizes array must be sufficiently sized to store results for
> + * @count u32 BARs.  Caller is responsible for disabling decode to specified
> + * BAR range around calling this function.  This function is intended to avoid
> + * disabling decode around sizing each BAR individually, which can result in
> + * non-trivial overhead in virtualized environments with very large PCI BARs.
> + */
> +static void __pci_size_bars(struct pci_dev *dev, int count,
> +			    unsigned int pos, u32 *sizes, bool rom)
> +{
> +	u32 orig, mask = rom ? PCI_ROM_ADDRESS_MASK : ~0;
> +	int i;
> +
> +	for (i = 0; i < count; i++, pos += 4, sizes++) {
> +		pci_read_config_dword(dev, pos, &orig);
> +		pci_write_config_dword(dev, pos, mask);
> +		pci_read_config_dword(dev, pos, sizes);
> +		pci_write_config_dword(dev, pos, orig);
> +
> +		/*
> +		 * All bits set in size means the device isn't working properly.
> +		 * If the BAR isn't implemented, all bits must be 0.  If it's a
> +		 * memory BAR or a ROM, bit 0 must be clear; if it's an io BAR,
> +		 * bit 1 must be clear.
> +		 */
> +		if (PCI_POSSIBLE_ERROR(*sizes))
> +			*sizes = 0;
> +	}
> +}

I'm trying to understand how 64-bit BARs are supposed to work here. The 
*sizes is being filled inside this function for both lower and upper 
u32, right? So why can PCI_POSSIBLE_ERROR() be used for the upper part? 
Can't the upper u32 of a 64-bit BAR have all bits as 1?

> +void __pci_size_stdbars(struct pci_dev *dev, int count,
> +			unsigned int pos, u32 *sizes)
> +{
> +	__pci_size_bars(dev, count, pos, sizes, false);
> +}
> +
> +static void __pci_size_rom(struct pci_dev *dev, unsigned int pos, u32 *sizes)
> +{
> +	__pci_size_bars(dev, 1, pos, sizes, true);
> +}
> +
>  /**
>   * __pci_read_base - Read a PCI BAR
>   * @dev: the PCI device
>   * @type: type of the BAR
>   * @res: resource buffer to be filled in
>   * @pos: BAR position in the config space
> + * @sizes: array of one or more pre-read BAR masks
>   *
>   * Returns 1 if the BAR is 64-bit, or 0 if 32-bit.
>   */
>  int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
> -		    struct resource *res, unsigned int pos)
> +		    struct resource *res, unsigned int pos, u32 *sizes)
>  {
> -	u32 l = 0, sz = 0, mask;
> +	u32 l = 0;
>  	u64 l64, sz64, mask64;
> -	u16 orig_cmd;
>  	struct pci_bus_region region, inverted_region;
>  	const char *res_name = pci_resource_name(dev, res - dev->resource);
>  
> -	mask = type ? PCI_ROM_ADDRESS_MASK : ~0;
> -
> -	/* No printks while decoding is disabled! */
> -	if (!dev->mmio_always_on) {
> -		pci_read_config_word(dev, PCI_COMMAND, &orig_cmd);
> -		if (orig_cmd & PCI_COMMAND_DECODE_ENABLE) {
> -			pci_write_config_word(dev, PCI_COMMAND,
> -				orig_cmd & ~PCI_COMMAND_DECODE_ENABLE);
> -		}
> -	}
> -
>  	res->name = pci_name(dev);
>  
>  	pci_read_config_dword(dev, pos, &l);
> -	pci_write_config_dword(dev, pos, l | mask);
> -	pci_read_config_dword(dev, pos, &sz);
> -	pci_write_config_dword(dev, pos, l);
> -
> -	/*
> -	 * All bits set in sz means the device isn't working properly.
> -	 * If the BAR isn't implemented, all bits must be 0.  If it's a
> -	 * memory BAR or a ROM, bit 0 must be clear; if it's an io BAR, bit
> -	 * 1 must be clear.
> -	 */
> -	if (PCI_POSSIBLE_ERROR(sz))
> -		sz = 0;
>  
>  	/*
>  	 * I don't know how l can have all bits set.  Copied from old code.
> @@ -221,35 +246,30 @@ int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
>  		res->flags |= IORESOURCE_SIZEALIGN;
>  		if (res->flags & IORESOURCE_IO) {
>  			l64 = l & PCI_BASE_ADDRESS_IO_MASK;
> -			sz64 = sz & PCI_BASE_ADDRESS_IO_MASK;
> +			sz64 = *sizes & PCI_BASE_ADDRESS_IO_MASK;
>  			mask64 = PCI_BASE_ADDRESS_IO_MASK & (u32)IO_SPACE_LIMIT;
>  		} else {
>  			l64 = l & PCI_BASE_ADDRESS_MEM_MASK;
> -			sz64 = sz & PCI_BASE_ADDRESS_MEM_MASK;
> +			sz64 = *sizes & PCI_BASE_ADDRESS_MEM_MASK;
>  			mask64 = (u32)PCI_BASE_ADDRESS_MEM_MASK;
> +
> +			if (res->flags & IORESOURCE_MEM_64) {
> +				pci_read_config_dword(dev, pos + 4, &l);
> +				sizes++;
> +
> +				l64 |= ((u64)l << 32);
> +				sz64 |= ((u64)*sizes << 32);
> +				mask64 |= ((u64)~0 << 32);
> +			}
>  		}
>  	} else {
>  		if (l & PCI_ROM_ADDRESS_ENABLE)
>  			res->flags |= IORESOURCE_ROM_ENABLE;
>  		l64 = l & PCI_ROM_ADDRESS_MASK;
> -		sz64 = sz & PCI_ROM_ADDRESS_MASK;
> +		sz64 = *sizes & PCI_ROM_ADDRESS_MASK;
>  		mask64 = PCI_ROM_ADDRESS_MASK;
>  	}
>  
> -	if (res->flags & IORESOURCE_MEM_64) {
> -		pci_read_config_dword(dev, pos + 4, &l);
> -		pci_write_config_dword(dev, pos + 4, ~0);
> -		pci_read_config_dword(dev, pos + 4, &sz);
> -		pci_write_config_dword(dev, pos + 4, l);
> -
> -		l64 |= ((u64)l << 32);
> -		sz64 |= ((u64)sz << 32);
> -		mask64 |= ((u64)~0 << 32);
> -	}

No PCI_POSSIBLE_ERROR() check here for the upper u32 in the previous 
code.
Alex Williamson Jan. 20, 2025, 5:10 p.m. UTC | #4
On Mon, 20 Jan 2025 18:10:14 +0200 (EET)
Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> wrote:

> On Sat, 11 Jan 2025, Alex Williamson wrote:
> 
> > Toggling memory enable is free on bare metal, but potentially expensive
> > in virtualized environments as the device MMIO spaces are added and
> > removed from the VM address space, including DMA mapping of those spaces
> > through the IOMMU where peer-to-peer is supported.  Currently memory
> > decode is disabled around sizing each individual BAR, even for SR-IOV
> > BARs while VF Enable is cleared.
> > 
> > This can be better optimized for virtual environments by sizing a set
> > of BARs at once, stashing the resulting mask into an array, while only
> > toggling memory enable once.  This also naturally improves the SR-IOV
> > path as the caller becomes responsible for any necessary decode disables
> > while sizing BARs, therefore SR-IOV BARs are sized relying only on the
> > VF Enable rather than toggling the PF memory enable in the command
> > register.
> > 
> > Reported-by: Mitchell Augustin <mitchell.augustin@canonical.com>
> > Link: https://lore.kernel.org/all/CAHTA-uYp07FgM6T1OZQKqAdSA5JrZo0ReNEyZgQZub4mDRrV5w@mail.gmail.com
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> > 
> > This is an alternative to the patch proposed by Mitchell[1] and more
> > in line with my suggestion in the original report thread[2].  It makes
> > more sense to me to stash all the BAR sizing values into an array on
> > the front end of parsing them into resources than it does to pass
> > multiple arrays on the backend for status printing purposes.  We can
> > discuss in either of these patches which is the better approach or
> > if someone has a better yet alternative.
> > 
> > I don't have quite the config Mitchell has for testing, but this
> > should make effectively the same improvement and does show a
> > significant improvement in guest boot time even with a single 24GB
> > GPU attached.  There are of course further improvements to investigate
> > in the VMM, but disabling memory decode per BAR is a good start to
> > making Linux be a friendlier guest.  Further testing appreciate.
> > Thanks,
> > 
> > Alex
> > 
> > [1]https://lore.kernel.org/all/20241218224258.2225210-1-mitchell.augustin@canonical.com/
> > [2]https://lore.kernel.org/all/20241203150620.15431c5c.alex.williamson@redhat.com/
> > 
> >  drivers/pci/iov.c   |   8 ++-
> >  drivers/pci/pci.h   |   4 +-
> >  drivers/pci/probe.c | 132 +++++++++++++++++++++++++++++---------------
> >  3 files changed, 97 insertions(+), 47 deletions(-)
> > 
> > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> > index 4be402fe9ab9..9e4770cdd4d5 100644
> > --- a/drivers/pci/iov.c
> > +++ b/drivers/pci/iov.c
> > @@ -747,6 +747,7 @@ static int sriov_init(struct pci_dev *dev, int pos)
> >  	struct resource *res;
> >  	const char *res_name;
> >  	struct pci_dev *pdev;
> > +	u32 sriovbars[PCI_SRIOV_NUM_BARS];
> >  
> >  	pci_read_config_word(dev, pos + PCI_SRIOV_CTRL, &ctrl);
> >  	if (ctrl & PCI_SRIOV_CTRL_VFE) {
> > @@ -783,6 +784,10 @@ static int sriov_init(struct pci_dev *dev, int pos)
> >  	if (!iov)
> >  		return -ENOMEM;
> >  
> > +	/* Sizing SR-IOV BARs with VF Enable cleared - no decode */
> > +	__pci_size_stdbars(dev, PCI_SRIOV_NUM_BARS,
> > +			   pos + PCI_SRIOV_BAR, sriovbars);
> > +
> >  	nres = 0;
> >  	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
> >  		res = &dev->resource[i + PCI_IOV_RESOURCES];
> > @@ -796,7 +801,8 @@ static int sriov_init(struct pci_dev *dev, int pos)
> >  			bar64 = (res->flags & IORESOURCE_MEM_64) ? 1 : 0;
> >  		else
> >  			bar64 = __pci_read_base(dev, pci_bar_unknown, res,
> > -						pos + PCI_SRIOV_BAR + i * 4);
> > +						pos + PCI_SRIOV_BAR + i * 4,
> > +						&sriovbars[i]);
> >  		if (!res->flags)
> >  			continue;
> >  		if (resource_size(res) & (PAGE_SIZE - 1)) {
> > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> > index 2e40fc63ba31..6f27651c2a70 100644
> > --- a/drivers/pci/pci.h
> > +++ b/drivers/pci/pci.h
> > @@ -315,8 +315,10 @@ bool pci_bus_generic_read_dev_vendor_id(struct pci_bus *bus, int devfn, u32 *pl,
> >  int pci_idt_bus_quirk(struct pci_bus *bus, int devfn, u32 *pl, int rrs_timeout);
> >  
> >  int pci_setup_device(struct pci_dev *dev);
> > +void __pci_size_stdbars(struct pci_dev *dev, int count,
> > +			unsigned int pos, u32 *sizes);
> >  int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
> > -		    struct resource *res, unsigned int reg);
> > +		    struct resource *res, unsigned int reg, u32 *sizes);
> >  void pci_configure_ari(struct pci_dev *dev);
> >  void __pci_bus_size_bridges(struct pci_bus *bus,
> >  			struct list_head *realloc_head);
> > diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> > index 2e81ab0f5a25..5ca96280d698 100644
> > --- a/drivers/pci/probe.c
> > +++ b/drivers/pci/probe.c
> > @@ -164,50 +164,75 @@ static inline unsigned long decode_bar(struct pci_dev *dev, u32 bar)
> >  
> >  #define PCI_COMMAND_DECODE_ENABLE	(PCI_COMMAND_MEMORY | PCI_COMMAND_IO)
> >  
> > +/**
> > + * __pci_size_bars - Read the raw BAR mask for a range of PCI BARs
> > + * @dev: the PCI device
> > + * @count: number of BARs to size
> > + * @pos: starting config space position
> > + * @sizes: array to store mask values
> > + * @rom: indicate whether to use ROM mask, which avoids enabling ROM BARs
> > + *
> > + * Provided @sizes array must be sufficiently sized to store results for
> > + * @count u32 BARs.  Caller is responsible for disabling decode to specified
> > + * BAR range around calling this function.  This function is intended to avoid
> > + * disabling decode around sizing each BAR individually, which can result in
> > + * non-trivial overhead in virtualized environments with very large PCI BARs.
> > + */
> > +static void __pci_size_bars(struct pci_dev *dev, int count,
> > +			    unsigned int pos, u32 *sizes, bool rom)
> > +{
> > +	u32 orig, mask = rom ? PCI_ROM_ADDRESS_MASK : ~0;
> > +	int i;
> > +
> > +	for (i = 0; i < count; i++, pos += 4, sizes++) {
> > +		pci_read_config_dword(dev, pos, &orig);
> > +		pci_write_config_dword(dev, pos, mask);
> > +		pci_read_config_dword(dev, pos, sizes);
> > +		pci_write_config_dword(dev, pos, orig);
> > +
> > +		/*
> > +		 * All bits set in size means the device isn't working properly.
> > +		 * If the BAR isn't implemented, all bits must be 0.  If it's a
> > +		 * memory BAR or a ROM, bit 0 must be clear; if it's an io BAR,
> > +		 * bit 1 must be clear.
> > +		 */
> > +		if (PCI_POSSIBLE_ERROR(*sizes))
> > +			*sizes = 0;
> > +	}
> > +}  
> 
> I'm trying to understand how 64-bit BARs are supposed to work here. The 
> *sizes is being filled inside this function for both lower and upper 
> u32, right? So why can PCI_POSSIBLE_ERROR() be used for the upper part? 
> Can't the upper u32 of a 64-bit BAR have all bits as 1?

Thanks for catching this, I think this check just needs to be moved
back to its original location below so that we only do this on the
lower u32.  I'll send a v2 with that change.  Thanks!

Alex

> > +void __pci_size_stdbars(struct pci_dev *dev, int count,
> > +			unsigned int pos, u32 *sizes)
> > +{
> > +	__pci_size_bars(dev, count, pos, sizes, false);
> > +}
> > +
> > +static void __pci_size_rom(struct pci_dev *dev, unsigned int pos, u32 *sizes)
> > +{
> > +	__pci_size_bars(dev, 1, pos, sizes, true);
> > +}
> > +
> >  /**
> >   * __pci_read_base - Read a PCI BAR
> >   * @dev: the PCI device
> >   * @type: type of the BAR
> >   * @res: resource buffer to be filled in
> >   * @pos: BAR position in the config space
> > + * @sizes: array of one or more pre-read BAR masks
> >   *
> >   * Returns 1 if the BAR is 64-bit, or 0 if 32-bit.
> >   */
> >  int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
> > -		    struct resource *res, unsigned int pos)
> > +		    struct resource *res, unsigned int pos, u32 *sizes)
> >  {
> > -	u32 l = 0, sz = 0, mask;
> > +	u32 l = 0;
> >  	u64 l64, sz64, mask64;
> > -	u16 orig_cmd;
> >  	struct pci_bus_region region, inverted_region;
> >  	const char *res_name = pci_resource_name(dev, res - dev->resource);
> >  
> > -	mask = type ? PCI_ROM_ADDRESS_MASK : ~0;
> > -
> > -	/* No printks while decoding is disabled! */
> > -	if (!dev->mmio_always_on) {
> > -		pci_read_config_word(dev, PCI_COMMAND, &orig_cmd);
> > -		if (orig_cmd & PCI_COMMAND_DECODE_ENABLE) {
> > -			pci_write_config_word(dev, PCI_COMMAND,
> > -				orig_cmd & ~PCI_COMMAND_DECODE_ENABLE);
> > -		}
> > -	}
> > -
> >  	res->name = pci_name(dev);
> >  
> >  	pci_read_config_dword(dev, pos, &l);
> > -	pci_write_config_dword(dev, pos, l | mask);
> > -	pci_read_config_dword(dev, pos, &sz);
> > -	pci_write_config_dword(dev, pos, l);
> > -
> > -	/*
> > -	 * All bits set in sz means the device isn't working properly.
> > -	 * If the BAR isn't implemented, all bits must be 0.  If it's a
> > -	 * memory BAR or a ROM, bit 0 must be clear; if it's an io BAR, bit
> > -	 * 1 must be clear.
> > -	 */
> > -	if (PCI_POSSIBLE_ERROR(sz))
> > -		sz = 0;
> >  
> >  	/*
> >  	 * I don't know how l can have all bits set.  Copied from old code.
> > @@ -221,35 +246,30 @@ int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
> >  		res->flags |= IORESOURCE_SIZEALIGN;
> >  		if (res->flags & IORESOURCE_IO) {
> >  			l64 = l & PCI_BASE_ADDRESS_IO_MASK;
> > -			sz64 = sz & PCI_BASE_ADDRESS_IO_MASK;
> > +			sz64 = *sizes & PCI_BASE_ADDRESS_IO_MASK;
> >  			mask64 = PCI_BASE_ADDRESS_IO_MASK & (u32)IO_SPACE_LIMIT;
> >  		} else {
> >  			l64 = l & PCI_BASE_ADDRESS_MEM_MASK;
> > -			sz64 = sz & PCI_BASE_ADDRESS_MEM_MASK;
> > +			sz64 = *sizes & PCI_BASE_ADDRESS_MEM_MASK;
> >  			mask64 = (u32)PCI_BASE_ADDRESS_MEM_MASK;
> > +
> > +			if (res->flags & IORESOURCE_MEM_64) {
> > +				pci_read_config_dword(dev, pos + 4, &l);
> > +				sizes++;
> > +
> > +				l64 |= ((u64)l << 32);
> > +				sz64 |= ((u64)*sizes << 32);
> > +				mask64 |= ((u64)~0 << 32);
> > +			}
> >  		}
> >  	} else {
> >  		if (l & PCI_ROM_ADDRESS_ENABLE)
> >  			res->flags |= IORESOURCE_ROM_ENABLE;
> >  		l64 = l & PCI_ROM_ADDRESS_MASK;
> > -		sz64 = sz & PCI_ROM_ADDRESS_MASK;
> > +		sz64 = *sizes & PCI_ROM_ADDRESS_MASK;
> >  		mask64 = PCI_ROM_ADDRESS_MASK;
> >  	}
> >  
> > -	if (res->flags & IORESOURCE_MEM_64) {
> > -		pci_read_config_dword(dev, pos + 4, &l);
> > -		pci_write_config_dword(dev, pos + 4, ~0);
> > -		pci_read_config_dword(dev, pos + 4, &sz);
> > -		pci_write_config_dword(dev, pos + 4, l);
> > -
> > -		l64 |= ((u64)l << 32);
> > -		sz64 |= ((u64)sz << 32);
> > -		mask64 |= ((u64)~0 << 32);
> > -	}  
> 
> No PCI_POSSIBLE_ERROR() check here for the upper u32 in the previous 
> code.
>
diff mbox series

Patch

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 4be402fe9ab9..9e4770cdd4d5 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -747,6 +747,7 @@  static int sriov_init(struct pci_dev *dev, int pos)
 	struct resource *res;
 	const char *res_name;
 	struct pci_dev *pdev;
+	u32 sriovbars[PCI_SRIOV_NUM_BARS];
 
 	pci_read_config_word(dev, pos + PCI_SRIOV_CTRL, &ctrl);
 	if (ctrl & PCI_SRIOV_CTRL_VFE) {
@@ -783,6 +784,10 @@  static int sriov_init(struct pci_dev *dev, int pos)
 	if (!iov)
 		return -ENOMEM;
 
+	/* Sizing SR-IOV BARs with VF Enable cleared - no decode */
+	__pci_size_stdbars(dev, PCI_SRIOV_NUM_BARS,
+			   pos + PCI_SRIOV_BAR, sriovbars);
+
 	nres = 0;
 	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
 		res = &dev->resource[i + PCI_IOV_RESOURCES];
@@ -796,7 +801,8 @@  static int sriov_init(struct pci_dev *dev, int pos)
 			bar64 = (res->flags & IORESOURCE_MEM_64) ? 1 : 0;
 		else
 			bar64 = __pci_read_base(dev, pci_bar_unknown, res,
-						pos + PCI_SRIOV_BAR + i * 4);
+						pos + PCI_SRIOV_BAR + i * 4,
+						&sriovbars[i]);
 		if (!res->flags)
 			continue;
 		if (resource_size(res) & (PAGE_SIZE - 1)) {
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 2e40fc63ba31..6f27651c2a70 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -315,8 +315,10 @@  bool pci_bus_generic_read_dev_vendor_id(struct pci_bus *bus, int devfn, u32 *pl,
 int pci_idt_bus_quirk(struct pci_bus *bus, int devfn, u32 *pl, int rrs_timeout);
 
 int pci_setup_device(struct pci_dev *dev);
+void __pci_size_stdbars(struct pci_dev *dev, int count,
+			unsigned int pos, u32 *sizes);
 int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
-		    struct resource *res, unsigned int reg);
+		    struct resource *res, unsigned int reg, u32 *sizes);
 void pci_configure_ari(struct pci_dev *dev);
 void __pci_bus_size_bridges(struct pci_bus *bus,
 			struct list_head *realloc_head);
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 2e81ab0f5a25..5ca96280d698 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -164,50 +164,75 @@  static inline unsigned long decode_bar(struct pci_dev *dev, u32 bar)
 
 #define PCI_COMMAND_DECODE_ENABLE	(PCI_COMMAND_MEMORY | PCI_COMMAND_IO)
 
+/**
+ * __pci_size_bars - Read the raw BAR mask for a range of PCI BARs
+ * @dev: the PCI device
+ * @count: number of BARs to size
+ * @pos: starting config space position
+ * @sizes: array to store mask values
+ * @rom: indicate whether to use ROM mask, which avoids enabling ROM BARs
+ *
+ * Provided @sizes array must be sufficiently sized to store results for
+ * @count u32 BARs.  Caller is responsible for disabling decode to specified
+ * BAR range around calling this function.  This function is intended to avoid
+ * disabling decode around sizing each BAR individually, which can result in
+ * non-trivial overhead in virtualized environments with very large PCI BARs.
+ */
+static void __pci_size_bars(struct pci_dev *dev, int count,
+			    unsigned int pos, u32 *sizes, bool rom)
+{
+	u32 orig, mask = rom ? PCI_ROM_ADDRESS_MASK : ~0;
+	int i;
+
+	for (i = 0; i < count; i++, pos += 4, sizes++) {
+		pci_read_config_dword(dev, pos, &orig);
+		pci_write_config_dword(dev, pos, mask);
+		pci_read_config_dword(dev, pos, sizes);
+		pci_write_config_dword(dev, pos, orig);
+
+		/*
+		 * All bits set in size means the device isn't working properly.
+		 * If the BAR isn't implemented, all bits must be 0.  If it's a
+		 * memory BAR or a ROM, bit 0 must be clear; if it's an io BAR,
+		 * bit 1 must be clear.
+		 */
+		if (PCI_POSSIBLE_ERROR(*sizes))
+			*sizes = 0;
+	}
+}
+
+void __pci_size_stdbars(struct pci_dev *dev, int count,
+			unsigned int pos, u32 *sizes)
+{
+	__pci_size_bars(dev, count, pos, sizes, false);
+}
+
+static void __pci_size_rom(struct pci_dev *dev, unsigned int pos, u32 *sizes)
+{
+	__pci_size_bars(dev, 1, pos, sizes, true);
+}
+
 /**
  * __pci_read_base - Read a PCI BAR
  * @dev: the PCI device
  * @type: type of the BAR
  * @res: resource buffer to be filled in
  * @pos: BAR position in the config space
+ * @sizes: array of one or more pre-read BAR masks
  *
  * Returns 1 if the BAR is 64-bit, or 0 if 32-bit.
  */
 int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
-		    struct resource *res, unsigned int pos)
+		    struct resource *res, unsigned int pos, u32 *sizes)
 {
-	u32 l = 0, sz = 0, mask;
+	u32 l = 0;
 	u64 l64, sz64, mask64;
-	u16 orig_cmd;
 	struct pci_bus_region region, inverted_region;
 	const char *res_name = pci_resource_name(dev, res - dev->resource);
 
-	mask = type ? PCI_ROM_ADDRESS_MASK : ~0;
-
-	/* No printks while decoding is disabled! */
-	if (!dev->mmio_always_on) {
-		pci_read_config_word(dev, PCI_COMMAND, &orig_cmd);
-		if (orig_cmd & PCI_COMMAND_DECODE_ENABLE) {
-			pci_write_config_word(dev, PCI_COMMAND,
-				orig_cmd & ~PCI_COMMAND_DECODE_ENABLE);
-		}
-	}
-
 	res->name = pci_name(dev);
 
 	pci_read_config_dword(dev, pos, &l);
-	pci_write_config_dword(dev, pos, l | mask);
-	pci_read_config_dword(dev, pos, &sz);
-	pci_write_config_dword(dev, pos, l);
-
-	/*
-	 * All bits set in sz means the device isn't working properly.
-	 * If the BAR isn't implemented, all bits must be 0.  If it's a
-	 * memory BAR or a ROM, bit 0 must be clear; if it's an io BAR, bit
-	 * 1 must be clear.
-	 */
-	if (PCI_POSSIBLE_ERROR(sz))
-		sz = 0;
 
 	/*
 	 * I don't know how l can have all bits set.  Copied from old code.
@@ -221,35 +246,30 @@  int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
 		res->flags |= IORESOURCE_SIZEALIGN;
 		if (res->flags & IORESOURCE_IO) {
 			l64 = l & PCI_BASE_ADDRESS_IO_MASK;
-			sz64 = sz & PCI_BASE_ADDRESS_IO_MASK;
+			sz64 = *sizes & PCI_BASE_ADDRESS_IO_MASK;
 			mask64 = PCI_BASE_ADDRESS_IO_MASK & (u32)IO_SPACE_LIMIT;
 		} else {
 			l64 = l & PCI_BASE_ADDRESS_MEM_MASK;
-			sz64 = sz & PCI_BASE_ADDRESS_MEM_MASK;
+			sz64 = *sizes & PCI_BASE_ADDRESS_MEM_MASK;
 			mask64 = (u32)PCI_BASE_ADDRESS_MEM_MASK;
+
+			if (res->flags & IORESOURCE_MEM_64) {
+				pci_read_config_dword(dev, pos + 4, &l);
+				sizes++;
+
+				l64 |= ((u64)l << 32);
+				sz64 |= ((u64)*sizes << 32);
+				mask64 |= ((u64)~0 << 32);
+			}
 		}
 	} else {
 		if (l & PCI_ROM_ADDRESS_ENABLE)
 			res->flags |= IORESOURCE_ROM_ENABLE;
 		l64 = l & PCI_ROM_ADDRESS_MASK;
-		sz64 = sz & PCI_ROM_ADDRESS_MASK;
+		sz64 = *sizes & PCI_ROM_ADDRESS_MASK;
 		mask64 = PCI_ROM_ADDRESS_MASK;
 	}
 
-	if (res->flags & IORESOURCE_MEM_64) {
-		pci_read_config_dword(dev, pos + 4, &l);
-		pci_write_config_dword(dev, pos + 4, ~0);
-		pci_read_config_dword(dev, pos + 4, &sz);
-		pci_write_config_dword(dev, pos + 4, l);
-
-		l64 |= ((u64)l << 32);
-		sz64 |= ((u64)sz << 32);
-		mask64 |= ((u64)~0 << 32);
-	}
-
-	if (!dev->mmio_always_on && (orig_cmd & PCI_COMMAND_DECODE_ENABLE))
-		pci_write_config_word(dev, PCI_COMMAND, orig_cmd);
-
 	if (!sz64)
 		goto fail;
 
@@ -320,7 +340,11 @@  int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
 
 static void pci_read_bases(struct pci_dev *dev, unsigned int howmany, int rom)
 {
+	u32 rombar, stdbars[PCI_STD_NUM_BARS];
 	unsigned int pos, reg;
+	u16 orig_cmd;
+
+	BUILD_BUG_ON(howmany > PCI_STD_NUM_BARS);
 
 	if (dev->non_compliant_bars)
 		return;
@@ -329,10 +353,28 @@  static void pci_read_bases(struct pci_dev *dev, unsigned int howmany, int rom)
 	if (dev->is_virtfn)
 		return;
 
+	/* No printks while decoding is disabled! */
+	if (!dev->mmio_always_on) {
+		pci_read_config_word(dev, PCI_COMMAND, &orig_cmd);
+		if (orig_cmd & PCI_COMMAND_DECODE_ENABLE) {
+			pci_write_config_word(dev, PCI_COMMAND,
+				orig_cmd & ~PCI_COMMAND_DECODE_ENABLE);
+		}
+	}
+
+	__pci_size_stdbars(dev, howmany, PCI_BASE_ADDRESS_0, stdbars);
+	if (rom)
+		__pci_size_rom(dev, rom, &rombar);
+
+	if (!dev->mmio_always_on &&
+	    (orig_cmd & PCI_COMMAND_DECODE_ENABLE))
+		pci_write_config_word(dev, PCI_COMMAND, orig_cmd);
+
 	for (pos = 0; pos < howmany; pos++) {
 		struct resource *res = &dev->resource[pos];
 		reg = PCI_BASE_ADDRESS_0 + (pos << 2);
-		pos += __pci_read_base(dev, pci_bar_unknown, res, reg);
+		pos += __pci_read_base(dev, pci_bar_unknown,
+				       res, reg, &stdbars[pos]);
 	}
 
 	if (rom) {
@@ -340,7 +382,7 @@  static void pci_read_bases(struct pci_dev *dev, unsigned int howmany, int rom)
 		dev->rom_base_reg = rom;
 		res->flags = IORESOURCE_MEM | IORESOURCE_PREFETCH |
 				IORESOURCE_READONLY | IORESOURCE_SIZEALIGN;
-		__pci_read_base(dev, pci_bar_mem32, res, rom);
+		__pci_read_base(dev, pci_bar_mem32, res, rom, &rombar);
 	}
 }