diff mbox

[V9,03/18] PCI: Add weak pcibios_iov_resource_size() interface

Message ID 1414942894-17034-4-git-send-email-weiyang@linux.vnet.ibm.com (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show

Commit Message

Wei Yang Nov. 2, 2014, 3:41 p.m. UTC
When retrieving VF IOV BAR in virtfn_add(), it will divide the total PF's IOV
BAR size with the totalVF number. This is true for most cases, while may not
be correct on some specific platform.

For example on PowerNV platform, in order to fix PF's IOV BAR into a hardware
alignment, the PF's IOV BAR size would be expended. This means the original
method couldn't work.

This patch introduces a weak pcibios_iov_resource_size() interface, which
gives platform a chance to implement specific method to calculate the VF BAR
resource size.

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
---
 drivers/pci/iov.c   |   27 +++++++++++++++++++++++++--
 include/linux/pci.h |    5 +++++
 2 files changed, 30 insertions(+), 2 deletions(-)

Comments

Bjorn Helgaas Nov. 19, 2014, 1:12 a.m. UTC | #1
[+cc Don, Myron]

Hi Wei,

On Sun, Nov 02, 2014 at 11:41:19PM +0800, Wei Yang wrote:
> When retrieving VF IOV BAR in virtfn_add(), it will divide the total PF's IOV
> BAR size with the totalVF number. This is true for most cases, while may not
> be correct on some specific platform.
> 
> For example on PowerNV platform, in order to fix PF's IOV BAR into a hardware
> alignment, the PF's IOV BAR size would be expended. This means the original
> method couldn't work.
> 
> This patch introduces a weak pcibios_iov_resource_size() interface, which
> gives platform a chance to implement specific method to calculate the VF BAR
> resource size.
> 
> Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
> ---
>  drivers/pci/iov.c   |   27 +++++++++++++++++++++++++--
>  include/linux/pci.h |    5 +++++
>  2 files changed, 30 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index 4d1685d..6866830 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -61,6 +61,30 @@ static void virtfn_remove_bus(struct pci_bus *physbus, struct pci_bus *virtbus)
>  		pci_remove_bus(virtbus);
>  }
>  
> +resource_size_t __weak pcibios_iov_resource_size(struct pci_dev *dev, int resno)
> +{
> +	return 0;
> +}
> +
> +resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno)
> +{
> +	resource_size_t size;
> +	struct pci_sriov *iov;
> +
> +	if (!dev->is_physfn)
> +		return 0;
> +
> +	size = pcibios_iov_resource_size(dev, resno);
> +	if (size != 0)
> +		return size;
> +
> +	iov = dev->sriov;
> +	size = resource_size(dev->resource + resno);
> +	do_div(size, iov->total_VFs);
> +
> +	return size;
> +}
> +
>  static int virtfn_add(struct pci_dev *dev, int id, int reset)
>  {
>  	int i;
> @@ -96,8 +120,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset)
>  			continue;
>  		virtfn->resource[i].name = pci_name(virtfn);
>  		virtfn->resource[i].flags = res->flags;
> -		size = resource_size(res);
> -		do_div(size, iov->total_VFs);
> +		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
>  		virtfn->resource[i].start = res->start + size * id;

Can you help me understand this?

We have previously called sriov_init() on the PF.  There, we sized the VF
BARs, which are in the PF's SR-IOV Capability (SR-IOV spec sec 3.3.14).
The size we discover is the amount of space required by a single VF, so
sriov_init() adjusts PF->resource[PCI_IOV_RESOURCES + i] by multiplying
that size by PCI_SRIOV_TOTAL_VF, so this PF resource is now big enough to 
hold the VF BAR[i] areas for all the possible VFs.

Now we're in virtfn_add(), setting up a new VF.  The usual BARs at config
space 0x10, etc., are read-only zeroes for a VF, so we don't size them the
usual way.  Instead, we carve out a slice of the
PF->resource[PCI_IOV_RESOURCES + i] area.

I thought the starting address of the VF BARn memory aperture was
prescribed by the spec in sec 2.1.1.1 and shown in figure 2-1:

  BARx VFy starting address = VF BARx + (y - 1) * (VF BARx aperture size)

That's basically what the existing code does.  We don't save the VF BARx
aperture size, so we recompute it by undoing the multiplication we did in
sriov_init().

But you're computing the starting address using a different VF BARx
aperture size.  How does that work?  I assumed this calculation was built
into the PCI device, but obviously I'm missing something.

To make it concrete, here's a made-up example:

  PF SR-IOV Capability
    TotalVFs = 4
    NumVFs = 4
    System Page Size = 4KB
    VF BAR0 = [mem 0x00000000-0x00000fff] (4KB at address 0)

  PF  pci_dev->resource[7] = [mem 0x00000000-0x00003fff] (16KB)
  VF1 pci_dev->resource[0] = [mem 0x00000000-0x00000fff]
  VF2 pci_dev->resource[0] = [mem 0x00001000-0x00001fff]
  VF3 pci_dev->resource[0] = [mem 0x00002000-0x00002fff]
  VF4 pci_dev->resource[0] = [mem 0x00003000-0x00003fff]

You're changing this so we might use 16KB as the VF BAR0 aperture size
instead of 4KB, which would result in the following:

  VF1 pci_dev->resource[0] = [mem 0x00000000-0x00003fff]
  VF2 pci_dev->resource[0] = [mem 0x00004000-0x00007fff]
  VF3 pci_dev->resource[0] = [mem 0x00008000-0x0000bfff]
  VF4 pci_dev->resource[0] = [mem 0x0000c000-0x0000ffff]

But you didn't change sriov_init(), so the PF resource[7] is only 16KB, not
the 64KB the VFs need.  And I assume the VF address decoder is wired to
claim the 4KB regions at 0x0000, 0x1000, 0x2000, 0x3000, not the ones at
0x0000, 0x4000, 0x8000, 0xc000.

Bjorn

>  		virtfn->resource[i].end = virtfn->resource[i].start + size - 1;
>  		rc = request_resource(res, &virtfn->resource[i]);
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index bbf8058..2f5b454 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1162,6 +1162,8 @@ resource_size_t pcibios_window_alignment(struct pci_bus *bus,
>  resource_size_t pcibios_iov_resource_alignment(struct pci_dev *dev,
>  						 int resno,
>  						 resource_size_t align);
> +resource_size_t pcibios_iov_resource_size(struct pci_dev *dev,
> +		                            int resno);
>  
>  #define PCI_VGA_STATE_CHANGE_BRIDGE (1 << 0)
>  #define PCI_VGA_STATE_CHANGE_DECODES (1 << 1)
> @@ -1666,6 +1668,7 @@ int pci_num_vf(struct pci_dev *dev);
>  int pci_vfs_assigned(struct pci_dev *dev);
>  int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs);
>  int pci_sriov_get_totalvfs(struct pci_dev *dev);
> +resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno);
>  #else
>  static inline int pci_iov_virtfn_bus(struct pci_dev *dev, int id)
>  {
> @@ -1685,6 +1688,8 @@ static inline int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs)
>  { return 0; }
>  static inline int pci_sriov_get_totalvfs(struct pci_dev *dev)
>  { return 0; }
> +static inline resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno)
> +{ return 0; }
>  #endif
>  
>  #if defined(CONFIG_HOTPLUG_PCI) || defined(CONFIG_HOTPLUG_PCI_MODULE)
> -- 
> 1.7.9.5
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt Nov. 19, 2014, 2:15 a.m. UTC | #2
On Tue, 2014-11-18 at 18:12 -0700, Bjorn Helgaas wrote:
> 
> Can you help me understand this?
> 
> We have previously called sriov_init() on the PF.  There, we sized the VF
> BARs, which are in the PF's SR-IOV Capability (SR-IOV spec sec 3.3.14).
> The size we discover is the amount of space required by a single VF, so
> sriov_init() adjusts PF->resource[PCI_IOV_RESOURCES + i] by multiplying
> that size by PCI_SRIOV_TOTAL_VF, so this PF resource is now big enough to 
> hold the VF BAR[i] areas for all the possible VFs.

So I'll let Richard (Wei) answer on the details but I'll just chime in
about the "big picture". This isn't about changing the spacing between VFs
which is handled by the system page size.

This is about the way we create MMIO windows from the CPU to the VF BARs.

Basically, we have a (limited) set of 64-bit windows we can create that
are divided in equal sized segments (256 of them), each segment assigned
in HW to one of our Partitionable Endpoints (aka domain).

So even if we only ever create 16 VFs for a device, we need to use an
entire of these windows, which will use 256*VF_size and thus allocate
that much space. Also the window has to be naturally aligned.

We can then assign the VF BAR to a spot inside that window that corresponds
to the range of PEs that we have assigned to that device (which typically
isn't going to be the beginning of the window).

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wei Yang Nov. 19, 2014, 3:21 a.m. UTC | #3
On Wed, Nov 19, 2014 at 01:15:32PM +1100, Benjamin Herrenschmidt wrote:
>On Tue, 2014-11-18 at 18:12 -0700, Bjorn Helgaas wrote:
>> 
>> Can you help me understand this?
>> 
>> We have previously called sriov_init() on the PF.  There, we sized the VF
>> BARs, which are in the PF's SR-IOV Capability (SR-IOV spec sec 3.3.14).
>> The size we discover is the amount of space required by a single VF, so
>> sriov_init() adjusts PF->resource[PCI_IOV_RESOURCES + i] by multiplying
>> that size by PCI_SRIOV_TOTAL_VF, so this PF resource is now big enough to 
>> hold the VF BAR[i] areas for all the possible VFs.
>
>So I'll let Richard (Wei) answer on the details but I'll just chime in
>about the "big picture". This isn't about changing the spacing between VFs
>which is handled by the system page size.
>
>This is about the way we create MMIO windows from the CPU to the VF BARs.
>
>Basically, we have a (limited) set of 64-bit windows we can create that
>are divided in equal sized segments (256 of them), each segment assigned
>in HW to one of our Partitionable Endpoints (aka domain).
>
>So even if we only ever create 16 VFs for a device, we need to use an
>entire of these windows, which will use 256*VF_size and thus allocate
>that much space. Also the window has to be naturally aligned.
>
>We can then assign the VF BAR to a spot inside that window that corresponds
>to the range of PEs that we have assigned to that device (which typically
>isn't going to be the beginning of the window).
>

Bjorn & Ben,

Let me try to explain it. Thanks for Ben's explanation, it would be helpful. We
are not trying to change the space between VFs.

As mentioned by Ben, we use some HW to map the MMIO space to PE. But the HW
must map 256 segments with the same size. This will lead a situation like
this.

   +------+------+        +------+------+------+------+
   |VF#0  |VF#1  |   ...  |      |VF#N-1|PF#A  |PF#B  |
   +------+------+        +------+------+------+------+

Suppose N = 254 and the HW map these 256 segments to their corresponding PE#.
Then it introduces one problem, the PF#A and PF#B have been already assigned
to some PE#. We can't map one MMIO range to two different PE#.

What we have done is to "Expand the IOV BAR" to fit the whole HW 256 segments.
By doing so, the MMIO range will look like this.

   +------+------+        +------+------+------+------+------+------+
   |VF#0  |VF#1  |   ...  |      |VF#N-1|blank |blank |PF#A  |PF#B  |
   +------+------+        +------+------+------+------+------+------+

We do some tricky to "Expand" the IOV BAR, so that make sure there would not
be some overlap between VF's PE and PF's PE.

Then this will leads to the IOV BAR size change from:
   
   IOV BAR size = (VF BAR aperture size) * VF_number

 to:
   
   IOV BAR size = (VF BAR aperture size) * 256

This is the reason we need a platform dependent method to get the VF BAR size.
Otherwise the VF BAR size would be not correct.

Now let's take a look at your example again.

  PF SR-IOV Capability
    TotalVFs = 4
    NumVFs = 4
    System Page Size = 4KB
    VF BAR0 = [mem 0x00000000-0x00000fff] (4KB at address 0)

  PF  pci_dev->resource[7] = [mem 0x00000000-0x00003fff] (16KB)
  VF1 pci_dev->resource[0] = [mem 0x00000000-0x00000fff]
  VF2 pci_dev->resource[0] = [mem 0x00001000-0x00001fff]
  VF3 pci_dev->resource[0] = [mem 0x00002000-0x00002fff]
  VF4 pci_dev->resource[0] = [mem 0x00003000-0x00003fff]

The difference after our expanding is the IOV BAR size is 256*4KB instead of
16KB. So it will look like this:

  PF  pci_dev->resource[7] = [mem 0x00000000-0x000fffff] (1024KB)
  VF1 pci_dev->resource[0] = [mem 0x00000000-0x00000fff]
  VF2 pci_dev->resource[0] = [mem 0x00001000-0x00001fff]
  VF3 pci_dev->resource[0] = [mem 0x00002000-0x00002fff]
  VF4 pci_dev->resource[0] = [mem 0x00003000-0x00003fff]
  ...
  and 252 4KB space leave not used.

So the start address and the size of VF will not change, but the PF's IOV BAR
will be expanded.

>Cheers,
>Ben.
>
Bjorn Helgaas Nov. 19, 2014, 4:26 a.m. UTC | #4
On Wed, Nov 19, 2014 at 11:21:00AM +0800, Wei Yang wrote:
> On Wed, Nov 19, 2014 at 01:15:32PM +1100, Benjamin Herrenschmidt wrote:
> >On Tue, 2014-11-18 at 18:12 -0700, Bjorn Helgaas wrote:
> >> 
> >> Can you help me understand this?
> >> 
> >> We have previously called sriov_init() on the PF.  There, we sized the VF
> >> BARs, which are in the PF's SR-IOV Capability (SR-IOV spec sec 3.3.14).
> >> The size we discover is the amount of space required by a single VF, so
> >> sriov_init() adjusts PF->resource[PCI_IOV_RESOURCES + i] by multiplying
> >> that size by PCI_SRIOV_TOTAL_VF, so this PF resource is now big enough to 
> >> hold the VF BAR[i] areas for all the possible VFs.
> >
> >So I'll let Richard (Wei) answer on the details but I'll just chime in
> >about the "big picture". This isn't about changing the spacing between VFs
> >which is handled by the system page size.
> >
> >This is about the way we create MMIO windows from the CPU to the VF BARs.
> >
> >Basically, we have a (limited) set of 64-bit windows we can create that
> >are divided in equal sized segments (256 of them), each segment assigned
> >in HW to one of our Partitionable Endpoints (aka domain).
> >
> >So even if we only ever create 16 VFs for a device, we need to use an
> >entire of these windows, which will use 256*VF_size and thus allocate
> >that much space. Also the window has to be naturally aligned.
> >
> >We can then assign the VF BAR to a spot inside that window that corresponds
> >to the range of PEs that we have assigned to that device (which typically
> >isn't going to be the beginning of the window).
> >
> 
> Bjorn & Ben,
> 
> Let me try to explain it. Thanks for Ben's explanation, it would be helpful. We
> are not trying to change the space between VFs.
> 
> As mentioned by Ben, we use some HW to map the MMIO space to PE. 

We need some documentation with pictures about what a PE is.  I did find
this:

https://events.linuxfoundation.org/images/stories/slides/lfcs2013_yang.pdf

which looks like a good start, although there's not quite enough text for
me to understand, and it doesn't have much about MMIO space.

> But the HW
> must map 256 segments with the same size. This will lead a situation like
> this.
> 
>    +------+------+        +------+------+------+------+
>    |VF#0  |VF#1  |   ...  |      |VF#N-1|PF#A  |PF#B  |
>    +------+------+        +------+------+------+------+
> 
> Suppose N = 254 and the HW map these 256 segments to their corresponding PE#.

I guess these 256 segments are regions of CPU physical address space, and
they are being mapped to bus address space?  Is there some relationship
between a PE and part of the bus address space?

> Then it introduces one problem, the PF#A and PF#B have been already assigned
> to some PE#. We can't map one MMIO range to two different PE#.
> 
> What we have done is to "Expand the IOV BAR" to fit the whole HW 256 segments.
> By doing so, the MMIO range will look like this.
> 
>    +------+------+        +------+------+------+------+------+------+
>    |VF#0  |VF#1  |   ...  |      |VF#N-1|blank |blank |PF#A  |PF#B  |
>    +------+------+        +------+------+------+------+------+------+
> 
> We do some tricky to "Expand" the IOV BAR, so that make sure there would not
> be some overlap between VF's PE and PF's PE.

The language here is tricky.  You're not actually *expanding* the IOV BAR.
The IOV BAR is a hardware thing and its size is determined by normal BAR
sizing and the number of VFs.  What you're doing is reserving additional
space for that BAR, and the additional space will be unused.  That's all
fine; we just need a way to describe it accurately.

> Then this will leads to the IOV BAR size change from:
>    
>    IOV BAR size = (VF BAR aperture size) * VF_number
> 
>  to:
>    
>    IOV BAR size = (VF BAR aperture size) * 256
> 
> This is the reason we need a platform dependent method to get the VF BAR size.
> Otherwise the VF BAR size would be not correct.
> 
> Now let's take a look at your example again.
> 
>   PF SR-IOV Capability
>     TotalVFs = 4
>     NumVFs = 4
>     System Page Size = 4KB
>     VF BAR0 = [mem 0x00000000-0x00000fff] (4KB at address 0)
> 
>   PF  pci_dev->resource[7] = [mem 0x00000000-0x00003fff] (16KB)
>   VF1 pci_dev->resource[0] = [mem 0x00000000-0x00000fff]
>   VF2 pci_dev->resource[0] = [mem 0x00001000-0x00001fff]
>   VF3 pci_dev->resource[0] = [mem 0x00002000-0x00002fff]
>   VF4 pci_dev->resource[0] = [mem 0x00003000-0x00003fff]
> 
> The difference after our expanding is the IOV BAR size is 256*4KB instead of
> 16KB. So it will look like this:
> 
>   PF  pci_dev->resource[7] = [mem 0x00000000-0x000fffff] (1024KB)

Is the idea that you want this resource to be big enough to cover all 256
segments?  I think I'm OK with increasing the size of the PF resources to
prevent overlap.  That part shouldn't be too ugly.

>   VF1 pci_dev->resource[0] = [mem 0x00000000-0x00000fff]
>   VF2 pci_dev->resource[0] = [mem 0x00001000-0x00001fff]
>   VF3 pci_dev->resource[0] = [mem 0x00002000-0x00002fff]
>   VF4 pci_dev->resource[0] = [mem 0x00003000-0x00003fff]
>   ...
>   and 252 4KB space leave not used.
> 
> So the start address and the size of VF will not change, but the PF's IOV BAR
> will be expanded.

I'm really dubious about this change to use pci_iov_resource_size().  I
think you might be doing that because if you increase the PF resource size,
dividing that increased size by total_VFs will give you garbage.  E.g., in
the example above, you would compute "size = 1024KB / 4", which would make
the VF BARs appear to be 256KB instead of 4KB as they should be.

I think it would be better to solve that problem by decoupling the PF
resource size and the VF BAR size.  For example, we could keep track of the
VF BAR size explicitly in struct pci_sriov, instead of computing it from
the PF resource size and total_VFs.  This would keep the VF BAR size
completely platform-independent.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wei Yang Nov. 19, 2014, 9:27 a.m. UTC | #5
On Tue, Nov 18, 2014 at 09:26:01PM -0700, Bjorn Helgaas wrote:
>On Wed, Nov 19, 2014 at 11:21:00AM +0800, Wei Yang wrote:
>> On Wed, Nov 19, 2014 at 01:15:32PM +1100, Benjamin Herrenschmidt wrote:
>> >On Tue, 2014-11-18 at 18:12 -0700, Bjorn Helgaas wrote:
>> >> 
>> >> Can you help me understand this?
>> >> 
>> >> We have previously called sriov_init() on the PF.  There, we sized the VF
>> >> BARs, which are in the PF's SR-IOV Capability (SR-IOV spec sec 3.3.14).
>> >> The size we discover is the amount of space required by a single VF, so
>> >> sriov_init() adjusts PF->resource[PCI_IOV_RESOURCES + i] by multiplying
>> >> that size by PCI_SRIOV_TOTAL_VF, so this PF resource is now big enough to 
>> >> hold the VF BAR[i] areas for all the possible VFs.
>> >
>> >So I'll let Richard (Wei) answer on the details but I'll just chime in
>> >about the "big picture". This isn't about changing the spacing between VFs
>> >which is handled by the system page size.
>> >
>> >This is about the way we create MMIO windows from the CPU to the VF BARs.
>> >
>> >Basically, we have a (limited) set of 64-bit windows we can create that
>> >are divided in equal sized segments (256 of them), each segment assigned
>> >in HW to one of our Partitionable Endpoints (aka domain).
>> >
>> >So even if we only ever create 16 VFs for a device, we need to use an
>> >entire of these windows, which will use 256*VF_size and thus allocate
>> >that much space. Also the window has to be naturally aligned.
>> >
>> >We can then assign the VF BAR to a spot inside that window that corresponds
>> >to the range of PEs that we have assigned to that device (which typically
>> >isn't going to be the beginning of the window).
>> >
>> 
>> Bjorn & Ben,
>> 
>> Let me try to explain it. Thanks for Ben's explanation, it would be helpful. We
>> are not trying to change the space between VFs.
>> 
>> As mentioned by Ben, we use some HW to map the MMIO space to PE. 
>
>We need some documentation with pictures about what a PE is.  I did find
>this:
>
>https://events.linuxfoundation.org/images/stories/slides/lfcs2013_yang.pdf
>
>which looks like a good start, although there's not quite enough text for
>me to understand, and it doesn't have much about MMIO space.

Yes, this slide is used 2 years ago and for P7 platform.

Current solution is for P8, which we choose some different mechanism.
Especially the MMIO manipulation used in current implementation is not used in
that moment.

>> But the HW
>> must map 256 segments with the same size. This will lead a situation like
>> this.
>> 
>>    +------+------+        +------+------+------+------+
>>    |VF#0  |VF#1  |   ...  |      |VF#N-1|PF#A  |PF#B  |
>>    +------+------+        +------+------+------+------+
>> 
>> Suppose N = 254 and the HW map these 256 segments to their corresponding PE#.
>
>I guess these 256 segments are regions of CPU physical address space, and
>they are being mapped to bus address space?  Is there some relationship
>between a PE and part of the bus address space?
>

PE is an entity for EEH, which may include a whole bus or one pci device.

When some device got some error, we need to identify which PE it belongs to.
So we have some HW to map between PE# and MMIO/DMA/MSI address.

The HW mentioned in previous letter is the one to map MMIO address to a PE#.
While this HW must map a range with 256 equal segments. And yes, this is
mapped to bus address space.

>> Then it introduces one problem, the PF#A and PF#B have been already assigned
>> to some PE#. We can't map one MMIO range to two different PE#.
>> 
>> What we have done is to "Expand the IOV BAR" to fit the whole HW 256 segments.
>> By doing so, the MMIO range will look like this.
>> 
>>    +------+------+        +------+------+------+------+------+------+
>>    |VF#0  |VF#1  |   ...  |      |VF#N-1|blank |blank |PF#A  |PF#B  |
>>    +------+------+        +------+------+------+------+------+------+
>> 
>> We do some tricky to "Expand" the IOV BAR, so that make sure there would not
>> be some overlap between VF's PE and PF's PE.
>
>The language here is tricky.  You're not actually *expanding* the IOV BAR.
>The IOV BAR is a hardware thing and its size is determined by normal BAR
>sizing and the number of VFs.  What you're doing is reserving additional
>space for that BAR, and the additional space will be unused.  That's all
>fine; we just need a way to describe it accurately.
>

Yes, you are right. My word is not exact.

What I am doing is to reserve more space for IOV BAR. I will make the log more
precise in next version.

>> Then this will leads to the IOV BAR size change from:
>>    
>>    IOV BAR size = (VF BAR aperture size) * VF_number
>> 
>>  to:
>>    
>>    IOV BAR size = (VF BAR aperture size) * 256
>> 
>> This is the reason we need a platform dependent method to get the VF BAR size.
>> Otherwise the VF BAR size would be not correct.
>> 
>> Now let's take a look at your example again.
>> 
>>   PF SR-IOV Capability
>>     TotalVFs = 4
>>     NumVFs = 4
>>     System Page Size = 4KB
>>     VF BAR0 = [mem 0x00000000-0x00000fff] (4KB at address 0)
>> 
>>   PF  pci_dev->resource[7] = [mem 0x00000000-0x00003fff] (16KB)
>>   VF1 pci_dev->resource[0] = [mem 0x00000000-0x00000fff]
>>   VF2 pci_dev->resource[0] = [mem 0x00001000-0x00001fff]
>>   VF3 pci_dev->resource[0] = [mem 0x00002000-0x00002fff]
>>   VF4 pci_dev->resource[0] = [mem 0x00003000-0x00003fff]
>> 
>> The difference after our expanding is the IOV BAR size is 256*4KB instead of
>> 16KB. So it will look like this:
>> 
>>   PF  pci_dev->resource[7] = [mem 0x00000000-0x000fffff] (1024KB)
>
>Is the idea that you want this resource to be big enough to cover all 256
>segments?  I think I'm OK with increasing the size of the PF resources to
>prevent overlap.  That part shouldn't be too ugly.
>

Yes, big enough to cover all 256 segments.

Sorry for making it ugly :-(

>>   VF1 pci_dev->resource[0] = [mem 0x00000000-0x00000fff]
>>   VF2 pci_dev->resource[0] = [mem 0x00001000-0x00001fff]
>>   VF3 pci_dev->resource[0] = [mem 0x00002000-0x00002fff]
>>   VF4 pci_dev->resource[0] = [mem 0x00003000-0x00003fff]
>>   ...
>>   and 252 4KB space leave not used.
>> 
>> So the start address and the size of VF will not change, but the PF's IOV BAR
>> will be expanded.
>
>I'm really dubious about this change to use pci_iov_resource_size().  I
>think you might be doing that because if you increase the PF resource size,
>dividing that increased size by total_VFs will give you garbage.  E.g., in
>the example above, you would compute "size = 1024KB / 4", which would make
>the VF BARs appear to be 256KB instead of 4KB as they should be.
>

Yes, your understanding is correct.

>I think it would be better to solve that problem by decoupling the PF
>resource size and the VF BAR size.  For example, we could keep track of the
>VF BAR size explicitly in struct pci_sriov, instead of computing it from
>the PF resource size and total_VFs.  This would keep the VF BAR size
>completely platform-independent.
>

Hmm... this is another solution.

If you prefer this one, I will make a change accordingly.

Thanks for your comments :-)

>Bjorn
Bjorn Helgaas Nov. 19, 2014, 5:23 p.m. UTC | #6
On Wed, Nov 19, 2014 at 05:27:40PM +0800, Wei Yang wrote:
> On Tue, Nov 18, 2014 at 09:26:01PM -0700, Bjorn Helgaas wrote:
> >On Wed, Nov 19, 2014 at 11:21:00AM +0800, Wei Yang wrote:
> >> On Wed, Nov 19, 2014 at 01:15:32PM +1100, Benjamin Herrenschmidt wrote:
> >> >On Tue, 2014-11-18 at 18:12 -0700, Bjorn Helgaas wrote:

> >> But the HW
> >> must map 256 segments with the same size. This will lead a situation like
> >> this.
> >> 
> >>    +------+------+        +------+------+------+------+
> >>    |VF#0  |VF#1  |   ...  |      |VF#N-1|PF#A  |PF#B  |
> >>    +------+------+        +------+------+------+------+
> >> 
> >> Suppose N = 254 and the HW map these 256 segments to their corresponding PE#.
> >
> >I guess these 256 segments are regions of CPU physical address space, and
> >they are being mapped to bus address space?  Is there some relationship
> >between a PE and part of the bus address space?
> >
> 
> PE is an entity for EEH, which may include a whole bus or one pci device.

Yes, I've read that many times.  What's missing is the connection between a
PE and the things in the PCI specs (buses, devices, functions, MMIO address
space, DMA, MSI, etc.)  Presumably the PE structure imposes constraints on
how the core uses the standard PCI elements, but we don't really have a
clear description of those constraints yet.

> When some device got some error, we need to identify which PE it belongs to.
> So we have some HW to map between PE# and MMIO/DMA/MSI address.
> 
> The HW mentioned in previous letter is the one to map MMIO address to a PE#.
> While this HW must map a range with 256 equal segments. And yes, this is
> mapped to bus address space.
> ...

> >> The difference after our expanding is the IOV BAR size is 256*4KB instead of
> >> 16KB. So it will look like this:
> >> 
> >>   PF  pci_dev->resource[7] = [mem 0x00000000-0x000fffff] (1024KB)
> >
> >Is the idea that you want this resource to be big enough to cover all 256
> >segments?  I think I'm OK with increasing the size of the PF resources to
> >prevent overlap.  That part shouldn't be too ugly.
> >
> 
> Yes, big enough to cover all 256 segments.
> 
> Sorry for making it ugly :-(

I didn't mean that what you did was ugly.  I meant that increasing the size
of the PF resource can be done cleanly.

By the way, when you do this, it would be nice if the dmesg showed the
standard PF IOV BAR sizing, and then a separate line showing the resource
expansion to deal with the PE constraints.  I don't think even the standard
output is very clear -- I think we currently get something like this:

  pci 0000:00:00.0 reg 0x174: [mem 0x00000000-0x00000fff]

But that is only the size of a single VF BAR aperture.  Then sriov_init()
multiplies that by the number of possible VFs, but I don't think we print
the overall size of that PF resource.  I think we should, because it's
misleading to print only the smaller piece.  Maybe something like this:

  pci 0000:00:00.0 VF BAR0: [mem 0x00000000-0x00003fff] (for 4 VFs)

And then you could do something like:

  pci 0000:00:00.0 VF BAR0: [mem 0x00000000-0x000fffff] (expanded for PE alignment)

> >>   VF1 pci_dev->resource[0] = [mem 0x00000000-0x00000fff]
> >>   VF2 pci_dev->resource[0] = [mem 0x00001000-0x00001fff]
> >>   VF3 pci_dev->resource[0] = [mem 0x00002000-0x00002fff]
> >>   VF4 pci_dev->resource[0] = [mem 0x00003000-0x00003fff]
> >>   ...
> >>   and 252 4KB space leave not used.
> >> 
> >> So the start address and the size of VF will not change, but the PF's IOV BAR
> >> will be expanded.
> >
> >I'm really dubious about this change to use pci_iov_resource_size().  I
> >think you might be doing that because if you increase the PF resource size,
> >dividing that increased size by total_VFs will give you garbage.  E.g., in
> >the example above, you would compute "size = 1024KB / 4", which would make
> >the VF BARs appear to be 256KB instead of 4KB as they should be.
> 
> Yes, your understanding is correct.
> 
> >I think it would be better to solve that problem by decoupling the PF
> >resource size and the VF BAR size.  For example, we could keep track of the
> >VF BAR size explicitly in struct pci_sriov, instead of computing it from
> >the PF resource size and total_VFs.  This would keep the VF BAR size
> >completely platform-independent.
> 
> Hmm... this is another solution.
> 
> If you prefer this one, I will make a change accordingly.

Yes, I definitely prefer to track the VF BAR size explicitly.  I think that
will make the code much clearer.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt Nov. 19, 2014, 8:51 p.m. UTC | #7
On Wed, 2014-11-19 at 10:23 -0700, Bjorn Helgaas wrote:
> 
> Yes, I've read that many times.  What's missing is the connection between a
> PE and the things in the PCI specs (buses, devices, functions, MMIO address
> space, DMA, MSI, etc.)  Presumably the PE structure imposes constraints on
> how the core uses the standard PCI elements, but we don't really have a
> clear description of those constraints yet.

Right, a "PE" is a HW concept in fact in our bridges, that essentially is
a shared isolation state between DMA, MMIO, MSIs, PCIe error messages,...
for a given "domain" or set of PCI functions.

The techniques of how the HW resources are mapped to PE and associated
constraints are slightly different from one generation of our chips to
the next. In general, P7 follows an architecture known as "IODA" and P8
"IODA2". I'm trying to get that spec made available via OpenPower but
that hasn't happened yet.

In this case we mostly care about IODA2 (P8), so I'll give a quick
description here. Wei, feel free to copy/paste that into a bit of doco
to throw into Documentation/powerpc/ along with your next spin of the patch.

The concept of "PE" is a way to group the various resources associated
with a device or a set of device to provide isolation between partitions
(ie. filtering of DMA, MSIs etc...) and to provide a mechanism to freeze
a device that is causing errors in order to limit the possibility of
propagation of bad data.

There is thus, in HW, a table of "PE" states that contains a pair of
"frozen" state bits (one for MMIO and one for DMA, they get set together
but can be cleared independently) for each PE.

When a PE is frozen, all stores in any direction are dropped and all loads
return all 1's value. MSIs are also blocked. There's a bit more state that
captures things like the details of the error that caused the freeze etc...
but that's not critical.

The interesting part is how the various type of PCIe transactions (MMIO,
DMA,...) are matched to their corresponding PEs.

I will provide a rought description of what we have on P8 (IODA2). Keep
in mind that this is all per PHB (host bridge). Each PHB is a completely
separate HW entity which replicates the entire logic, so has its own set
of PEs etc...

First, P8 has 256 PEs per PHB.

 * Inbound

For DMA, MSIs and inbound PCIe error messages, we have a table (in memory but
accessed in HW by the chip) that provides a direct correspondence between
a PCIe RID (bus/dev/fn) with a PE number. We call this the RTT.

 - For DMA we then provide an entire address space for each PE that can contains
two "windows", depending on the value of PCI bit 59. Each window can then be
configured to be remapped via a "TCE table" (iommu translation table), which has
various configurable characteristics which we can describe another day.

 - For MSIs, we have two windows in the address space (one at the top of the 32-bit
space and one much higher) which, via a combination of the address and MSI value,
will result in one of the 2048 interrupts per bridge being triggered. There's
a PE value in the interrupt controller descriptor table as well which is compared
with the PE obtained from the RTT to "authorize" the device to emit that specific
interrupt.

 - Error messages just use the RTT.

 * Outbound. That's where the tricky part is.

The PHB basically has a concept of "windows" from the CPU address space to the
PCI address space. There is one M32 window and 16 M64 windows. They have different
characteristics. First what they have in common: they are configured to forward a
configurable portion of the CPU address space to the PCIe bus and must be naturally
aligned power of two in size. The rest is different:

  - The M32 window:

    * It is limited to 4G in size

    * It drops the top bits of the address (above the size) and replaces them with
a configurable value. This is typically used to generate 32-bit PCIe accesses. We
configure that window at boot from FW and don't touch it from Linux, it's usually
set to forward a 2G portion of address space from the CPU to PCIe
0x8000_0000..0xffff_ffff. (Note: The top 64K are actually reserved for MSIs but
this is not a problem at this point, we just need to ensure Linux doesn't assign
anything there, the M32 logic ignores that however and will forward in that space
if we try).

    * It is divided into 256 segments of equal size. A table in the chip provides
for each of these 256 segments a PE#. That allows to essentially assign portions
of the MMIO space to PEs on a segment granularity. For a 2G window, this is 8M.

Now, this is the "main" window we use in Linux today (excluding SR-IOV). We
basically use the trick of forcing the bridge MMIO windows onto a segment
alignment/granularity so that the space behind a bridge can be assigned to a PE.

Ideally we would like to be able to have individual functions in PE's but that
would mean using a completely different address allocation scheme where individual
function BARs can be "grouped" to fit in one or more segments....

 - The M64 windows.

   * Their smallest size is 1M

   * They do not translate addresses (the address on PCIe is the same as the
address on the PowerBus. There is a way to also set the top 14 bits which are
not conveyed by PowerBus but we don't use this).

   * They can be configured to be segmented or not. When segmented, they have
256 segments, however they are not remapped. The segment number *is* the PE
number. When no segmented, the PE number can be specified for the entire
window.

   * They support overlaps in which case there is a well defined ordering of
matching (I don't remember off hand which of the lower or higher numbered
window takes priority but basically it's well defined).

We have code (fairly new compared to the M32 stuff) that exploits that for
large BARs in 64-bit space:

We create a single big M64 that covers the entire region of address space that
has been assigned by FW for the PHB (about 64G, ignore the space for the M32,
it comes out of a different "reserve"). We configure that window as segmented.

Then we do the same thing as with M32, using the bridge aligment trick, to
match to those giant segments.

Since we cannot remap, we have two additional constraints:

  - We do the PE# allocation *after* the 64-bit space has been assigned since
the segments used will derive directly the PE#, we then "update" the M32 PE#
for the devices that use both 32-bit and 64-bit spaces or assign the remaining
PE# to 32-bit only devices.

  - We cannot "group" segments in HW so if a device ends up using more than
one segment, we end up with more than one PE#. There is a HW mechanism to
make the freeze state cascade to "companion" PEs but that only work for PCIe
error messages (typically used so that if you freeze a switch, it freezes all
its children). So we do it in SW. We lose a bit of effectiveness of EEH in that
case, but that's the best we found. So when any of the PEs freezes, we freeze
the other ones for that "domain". We thus introduce the concept of "master PE"
which is the one used for DMA, MSIs etc... and "secondary PEs" that are used
for the remaining M64 segments.

We would like to investigate using additional M64's in "single PE" mode to
overlay over specific BARs to work around some of that, for example for devices
with very large BARs (some GPUs), it would make sense, but we haven't done it
yet.

Finally, the plan to use M64 for SR-IOV, which we describe a bit already, consists
of using those M64's. So for a given IOV BAR, we need to effectively reserve the
entire 256 segments (256 * IOV BAR size) and then "position" the BAR to start at
the beginning of a free range of segments/PEs inside that M64.

The goal is of course to be able to give a separate PE for each VF...

I hope that helps clarifying things a bit ...

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wei Yang Nov. 20, 2014, 5:39 a.m. UTC | #8
On Wed, Nov 19, 2014 at 10:23:50AM -0700, Bjorn Helgaas wrote:
>On Wed, Nov 19, 2014 at 05:27:40PM +0800, Wei Yang wrote:
>> On Tue, Nov 18, 2014 at 09:26:01PM -0700, Bjorn Helgaas wrote:
>> >On Wed, Nov 19, 2014 at 11:21:00AM +0800, Wei Yang wrote:
>> >> On Wed, Nov 19, 2014 at 01:15:32PM +1100, Benjamin Herrenschmidt wrote:
>> >> >On Tue, 2014-11-18 at 18:12 -0700, Bjorn Helgaas wrote:
>
>> >> But the HW
>> >> must map 256 segments with the same size. This will lead a situation like
>> >> this.
>> >> 
>> >>    +------+------+        +------+------+------+------+
>> >>    |VF#0  |VF#1  |   ...  |      |VF#N-1|PF#A  |PF#B  |
>> >>    +------+------+        +------+------+------+------+
>> >> 
>> >> Suppose N = 254 and the HW map these 256 segments to their corresponding PE#.
>> >
>> >I guess these 256 segments are regions of CPU physical address space, and
>> >they are being mapped to bus address space?  Is there some relationship
>> >between a PE and part of the bus address space?
>> >
>> 
>> PE is an entity for EEH, which may include a whole bus or one pci device.
>
>Yes, I've read that many times.  What's missing is the connection between a
>PE and the things in the PCI specs (buses, devices, functions, MMIO address
>space, DMA, MSI, etc.)  Presumably the PE structure imposes constraints on
>how the core uses the standard PCI elements, but we don't really have a
>clear description of those constraints yet.
>
>> When some device got some error, we need to identify which PE it belongs to.
>> So we have some HW to map between PE# and MMIO/DMA/MSI address.
>> 
>> The HW mentioned in previous letter is the one to map MMIO address to a PE#.
>> While this HW must map a range with 256 equal segments. And yes, this is
>> mapped to bus address space.
>> ...
>
>> >> The difference after our expanding is the IOV BAR size is 256*4KB instead of
>> >> 16KB. So it will look like this:
>> >> 
>> >>   PF  pci_dev->resource[7] = [mem 0x00000000-0x000fffff] (1024KB)
>> >
>> >Is the idea that you want this resource to be big enough to cover all 256
>> >segments?  I think I'm OK with increasing the size of the PF resources to
>> >prevent overlap.  That part shouldn't be too ugly.
>> >
>> 
>> Yes, big enough to cover all 256 segments.
>> 
>> Sorry for making it ugly :-(
>
>I didn't mean that what you did was ugly.  I meant that increasing the size
>of the PF resource can be done cleanly.
>
>By the way, when you do this, it would be nice if the dmesg showed the
>standard PF IOV BAR sizing, and then a separate line showing the resource
>expansion to deal with the PE constraints.  I don't think even the standard
>output is very clear -- I think we currently get something like this:
>
>  pci 0000:00:00.0 reg 0x174: [mem 0x00000000-0x00000fff]
>
>But that is only the size of a single VF BAR aperture.  Then sriov_init()
>multiplies that by the number of possible VFs, but I don't think we print
>the overall size of that PF resource.  I think we should, because it's
>misleading to print only the smaller piece.  Maybe something like this:
>
>  pci 0000:00:00.0 VF BAR0: [mem 0x00000000-0x00003fff] (for 4 VFs)
>
>And then you could do something like:
>
>  pci 0000:00:00.0 VF BAR0: [mem 0x00000000-0x000fffff] (expanded for PE alignment)
>

Got it, will add message to reflect it.

>> >>   VF1 pci_dev->resource[0] = [mem 0x00000000-0x00000fff]
>> >>   VF2 pci_dev->resource[0] = [mem 0x00001000-0x00001fff]
>> >>   VF3 pci_dev->resource[0] = [mem 0x00002000-0x00002fff]
>> >>   VF4 pci_dev->resource[0] = [mem 0x00003000-0x00003fff]
>> >>   ...
>> >>   and 252 4KB space leave not used.
>> >> 
>> >> So the start address and the size of VF will not change, but the PF's IOV BAR
>> >> will be expanded.
>> >
>> >I'm really dubious about this change to use pci_iov_resource_size().  I
>> >think you might be doing that because if you increase the PF resource size,
>> >dividing that increased size by total_VFs will give you garbage.  E.g., in
>> >the example above, you would compute "size = 1024KB / 4", which would make
>> >the VF BARs appear to be 256KB instead of 4KB as they should be.
>> 
>> Yes, your understanding is correct.
>> 
>> >I think it would be better to solve that problem by decoupling the PF
>> >resource size and the VF BAR size.  For example, we could keep track of the
>> >VF BAR size explicitly in struct pci_sriov, instead of computing it from
>> >the PF resource size and total_VFs.  This would keep the VF BAR size
>> >completely platform-independent.
>> 
>> Hmm... this is another solution.
>> 
>> If you prefer this one, I will make a change accordingly.
>
>Yes, I definitely prefer to track the VF BAR size explicitly.  I think that
>will make the code much clearer.

Got it.

>
>Bjorn
Wei Yang Nov. 20, 2014, 5:40 a.m. UTC | #9
On Thu, Nov 20, 2014 at 07:51:40AM +1100, Benjamin Herrenschmidt wrote:
>On Wed, 2014-11-19 at 10:23 -0700, Bjorn Helgaas wrote:
>> 
>> Yes, I've read that many times.  What's missing is the connection between a
>> PE and the things in the PCI specs (buses, devices, functions, MMIO address
>> space, DMA, MSI, etc.)  Presumably the PE structure imposes constraints on
>> how the core uses the standard PCI elements, but we don't really have a
>> clear description of those constraints yet.
>
>Right, a "PE" is a HW concept in fact in our bridges, that essentially is
>a shared isolation state between DMA, MMIO, MSIs, PCIe error messages,...
>for a given "domain" or set of PCI functions.
>
>The techniques of how the HW resources are mapped to PE and associated
>constraints are slightly different from one generation of our chips to
>the next. In general, P7 follows an architecture known as "IODA" and P8
>"IODA2". I'm trying to get that spec made available via OpenPower but
>that hasn't happened yet.
>
>In this case we mostly care about IODA2 (P8), so I'll give a quick
>description here. Wei, feel free to copy/paste that into a bit of doco
>to throw into Documentation/powerpc/ along with your next spin of the patch.
>

Got it.

I will add more description in powerpc directory.
diff mbox

Patch

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 4d1685d..6866830 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -61,6 +61,30 @@  static void virtfn_remove_bus(struct pci_bus *physbus, struct pci_bus *virtbus)
 		pci_remove_bus(virtbus);
 }
 
+resource_size_t __weak pcibios_iov_resource_size(struct pci_dev *dev, int resno)
+{
+	return 0;
+}
+
+resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno)
+{
+	resource_size_t size;
+	struct pci_sriov *iov;
+
+	if (!dev->is_physfn)
+		return 0;
+
+	size = pcibios_iov_resource_size(dev, resno);
+	if (size != 0)
+		return size;
+
+	iov = dev->sriov;
+	size = resource_size(dev->resource + resno);
+	do_div(size, iov->total_VFs);
+
+	return size;
+}
+
 static int virtfn_add(struct pci_dev *dev, int id, int reset)
 {
 	int i;
@@ -96,8 +120,7 @@  static int virtfn_add(struct pci_dev *dev, int id, int reset)
 			continue;
 		virtfn->resource[i].name = pci_name(virtfn);
 		virtfn->resource[i].flags = res->flags;
-		size = resource_size(res);
-		do_div(size, iov->total_VFs);
+		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
 		virtfn->resource[i].start = res->start + size * id;
 		virtfn->resource[i].end = virtfn->resource[i].start + size - 1;
 		rc = request_resource(res, &virtfn->resource[i]);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index bbf8058..2f5b454 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1162,6 +1162,8 @@  resource_size_t pcibios_window_alignment(struct pci_bus *bus,
 resource_size_t pcibios_iov_resource_alignment(struct pci_dev *dev,
 						 int resno,
 						 resource_size_t align);
+resource_size_t pcibios_iov_resource_size(struct pci_dev *dev,
+		                            int resno);
 
 #define PCI_VGA_STATE_CHANGE_BRIDGE (1 << 0)
 #define PCI_VGA_STATE_CHANGE_DECODES (1 << 1)
@@ -1666,6 +1668,7 @@  int pci_num_vf(struct pci_dev *dev);
 int pci_vfs_assigned(struct pci_dev *dev);
 int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs);
 int pci_sriov_get_totalvfs(struct pci_dev *dev);
+resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno);
 #else
 static inline int pci_iov_virtfn_bus(struct pci_dev *dev, int id)
 {
@@ -1685,6 +1688,8 @@  static inline int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs)
 { return 0; }
 static inline int pci_sriov_get_totalvfs(struct pci_dev *dev)
 { return 0; }
+static inline resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno)
+{ return 0; }
 #endif
 
 #if defined(CONFIG_HOTPLUG_PCI) || defined(CONFIG_HOTPLUG_PCI_MODULE)