Message ID | 20150713184001.19985.64867.stgit@mdrustad-wks.jf.intel.com (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
On Mon, 2015-07-13 at 11:40 -0700, Mark D Rustad wrote: > From: Mark Rustad <mark.d.rustad@intel.com> > > Add a dev_flags bit, PCI_DEV_FLAGS_VPD_REF_F0, to access VPD through > function 0 to provide VPD access on other functions. This is for > hardware devices that provide copies of the same VPD capability > registers in multiple functions. Because the kernel expects that > each function has its own registers, both the locking and the state > tracking are affected by VPD accesses to different functions. > > On such devices for example, if a VPD write is performed on function > 0, *any* later attempt to read VPD from any other function of that > device will hang. This has to do with how the kernel tracks the > expected value of the F bit per function. > > Concurrent accesses to different functions of the same device can > not only hang but also corrupt both read and write VPD data. > > When hangs occur, typically the error message: > > vpd r/w failed. This is likely a firmware bug on this device. > > will be seen. > > Never set this bit on function 0 or there will be an infinite recursion. > > Signed-off-by: Mark Rustad <mark.d.rustad@intel.com> > --- > Changes in V2: > - Corrected spelling in log message > - Added checks to see that the referenced function 0 is reasonable > Changes in V3: > - Don't leak a device reference > - Check that function 0 has VPD > - Make a helper for the function 0 checks > - Do multifunction check in the quirk > Changes in V4: > - Provide a much more detailed explanation in the commit log > --- > drivers/pci/access.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++++- > include/linux/pci.h | 2 ++ > 2 files changed, 62 insertions(+), 1 deletion(-) > > diff --git a/drivers/pci/access.c b/drivers/pci/access.c > index d9b64a175990..b965c12168b7 100644 > --- a/drivers/pci/access.c > +++ b/drivers/pci/access.c > @@ -439,6 +439,56 @@ static const struct pci_vpd_ops pci_vpd_pci22_ops = { > .release = pci_vpd_pci22_release, > }; > > +static ssize_t pci_vpd_f0_read(struct pci_dev *dev, loff_t pos, size_t count, > + void *arg) > +{ > + struct pci_dev *tdev = pci_get_slot(dev->bus, PCI_SLOT(dev->devfn)); > + ssize_t ret; > + > + if (!tdev) > + return -ENODEV; > + > + ret = pci_read_vpd(tdev, pos, count, arg); > + pci_dev_put(tdev); > + return ret; > +} > + > +static ssize_t pci_vpd_f0_write(struct pci_dev *dev, loff_t pos, size_t count, > + const void *arg) > +{ > + struct pci_dev *tdev = pci_get_slot(dev->bus, PCI_SLOT(dev->devfn)); > + ssize_t ret; > + > + if (!tdev) > + return -ENODEV; > + > + ret = pci_write_vpd(tdev, pos, count, arg); > + pci_dev_put(tdev); > + return ret; > +} > + > +static const struct pci_vpd_ops pci_vpd_f0_ops = { > + .read = pci_vpd_f0_read, > + .write = pci_vpd_f0_write, > + .release = pci_vpd_pci22_release, > +}; > + > +static int pci_vpd_f0_dev_check(struct pci_dev *dev) > +{ > + struct pci_dev *tdev = pci_get_slot(dev->bus, PCI_SLOT(dev->devfn)); > + int ret = 0; > + > + if (!tdev) > + return -ENODEV; > + if (!tdev->vpd || !tdev->multifunction || > + dev->class != tdev->class || dev->vendor != tdev->vendor || > + dev->device != tdev->device) > + ret = -ENODEV; > + > + pci_dev_put(tdev); > + return ret; > +} > + > int pci_vpd_pci22_init(struct pci_dev *dev) > { > struct pci_vpd_pci22 *vpd; > @@ -447,12 +497,21 @@ int pci_vpd_pci22_init(struct pci_dev *dev) > cap = pci_find_capability(dev, PCI_CAP_ID_VPD); > if (!cap) > return -ENODEV; > + if (dev->dev_flags & PCI_DEV_FLAGS_VPD_REF_F0) { > + int ret = pci_vpd_f0_dev_check(dev); > + > + if (ret) > + return ret; > + } In addition to the (PCI_SLOT() != devfn) issue, I'm concerned about topologies like we see on Skylake. IIRC, the integrated NIC appears at something like 00:1f.6. I don't know if that specific NIC has VPD, nor am I sure it really matter because another example or some future version might. So we'll set the PCI_DEV_FLAGS_VPD_REF_F0 because we do so for all (PCI_FUNC() != 0) Intel NICs, we'll call pci_vpd_f0_dev_check(), which will error because function 0 has a different class code and device ID, so we return error and if VPD exists on the device, it's now inaccessible. I thought there was talk about whitelisting anything on the root bus to avoid strange root complex integrated devices (and perhaps avoid the general case for assigned devices within a VM), but I don't see anything like that here. Perhaps instead of failing and hiding VPD we should fail, clear the flag, and allow normal access. Thanks, Alex > vpd = kzalloc(sizeof(*vpd), GFP_ATOMIC); > if (!vpd) > return -ENOMEM; > > vpd->base.len = PCI_VPD_PCI22_SIZE; > - vpd->base.ops = &pci_vpd_pci22_ops; > + if (dev->dev_flags & PCI_DEV_FLAGS_VPD_REF_F0) > + vpd->base.ops = &pci_vpd_f0_ops; > + else > + vpd->base.ops = &pci_vpd_pci22_ops; > mutex_init(&vpd->lock); > vpd->cap = cap; > vpd->busy = false; > diff --git a/include/linux/pci.h b/include/linux/pci.h > index 8a0321a8fb59..8edb125db13a 100644 > --- a/include/linux/pci.h > +++ b/include/linux/pci.h > @@ -180,6 +180,8 @@ enum pci_dev_flags { > PCI_DEV_FLAGS_NO_BUS_RESET = (__force pci_dev_flags_t) (1 << 6), > /* Do not use PM reset even if device advertises NoSoftRst- */ > PCI_DEV_FLAGS_NO_PM_RESET = (__force pci_dev_flags_t) (1 << 7), > + /* Get VPD from function 0 VPD */ > + PCI_DEV_FLAGS_VPD_REF_F0 = (__force pci_dev_flags_t) (1 << 8), > }; > > enum pci_irq_reroute_variant { > > -- > To unsubscribe from this list: send the line "unsubscribe linux-pci" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> On Sep 15, 2015, at 11:19 AM, Alex Williamson <alex.williamson@redhat.com> wrote: > > In addition to the (PCI_SLOT() != devfn) issue, I'm concerned about > topologies like we see on Skylake. IIRC, the integrated NIC appears at > something like 00:1f.6. I don't know if that specific NIC has VPD, nor > am I sure it really matter because another example or some future > version might. So we'll set the PCI_DEV_FLAGS_VPD_REF_F0 because we do > so for all (PCI_FUNC() != 0) Intel NICs, we'll call > pci_vpd_f0_dev_check(), which will error because function 0 has a > different class code and device ID, so we return error and if VPD exists > on the device, it's now inaccessible. Yes, that is exactly what would happen. > I thought there was talk about whitelisting anything on the root bus to > avoid strange root complex integrated devices (and perhaps avoid the > general case for assigned devices within a VM), but I don't see anything > like that here. I hadn't heard that talk, but I'm not on the PCI list and I guess I wasn't copied. > Perhaps instead of failing and hiding VPD we should fail, clear the > flag, and allow normal access. Thanks, Because the purpose of VPD is to hold information about the device, I would suggest that VPD should never be provided for an embedded network device, but rather for the device as a whole. So while there may well be VPD for an SOC, that VPD should not be associated with one of its embedded devices, but rather something more appropriate for the device as a whole. And attaching VPD to a whole bunch of internal devices would just be madness. So I understand the concern, but I don't think that it should really happen in real systems. I did think about this case when I was working on the patches. A networking device should really only have VPD when it is its own physical device, such as a NIC. -- Mark Rustad, Networking Division, Intel Corporation
On Tue, 2015-09-15 at 18:39 +0000, Rustad, Mark D wrote: > > On Sep 15, 2015, at 11:19 AM, Alex Williamson <alex.williamson@redhat.com> wrote: > > > > In addition to the (PCI_SLOT() != devfn) issue, I'm concerned about > > topologies like we see on Skylake. IIRC, the integrated NIC appears at > > something like 00:1f.6. I don't know if that specific NIC has VPD, nor > > am I sure it really matter because another example or some future > > version might. So we'll set the PCI_DEV_FLAGS_VPD_REF_F0 because we do > > so for all (PCI_FUNC() != 0) Intel NICs, we'll call > > pci_vpd_f0_dev_check(), which will error because function 0 has a > > different class code and device ID, so we return error and if VPD exists > > on the device, it's now inaccessible. > > Yes, that is exactly what would happen. > > > I thought there was talk about whitelisting anything on the root bus to > > avoid strange root complex integrated devices (and perhaps avoid the > > general case for assigned devices within a VM), but I don't see anything > > like that here. > > I hadn't heard that talk, but I'm not on the PCI list and I guess I > wasn't copied. > > > Perhaps instead of failing and hiding VPD we should fail, clear the > > flag, and allow normal access. Thanks, > > Because the purpose of VPD is to hold information about the device, I > would suggest that VPD should never be provided for an embedded > network device, but rather for the device as a whole. So while there > may well be VPD for an SOC, that VPD should not be associated with one > of its embedded devices, but rather something more appropriate for the > device as a whole. And attaching VPD to a whole bunch of internal > devices would just be madness. FRU-type information is only one of the use cases of VPD, the spec also defines (PCI rev 3.0, 6.4): ... a mechanism for storing information such as performance and failure data on the device being monitored. That information could very much be function specific. > So I understand the concern, but I don't think that it should really > happen in real systems. I did think about this case when I was working > on the patches. A networking device should really only have VPD when > it is its own physical device, such as a NIC. > When I was looking at whether we should provide VPD access of an assigned device at all, I ran across this interesting statement in the PCI spec (rev 3.0, I.3.1.1): CP Extended Capability This field allows a new capability to be identified in the VPD area. Since dynamic control/status cannot be placed in VPD, the data for this field identifies where, in the device’s memory or I/O address space, the control/status registers for the capability can be found. Location of the control/status registers is identified by providing the index (a value between 0 and 5) of the Base Address register that defines the address range that contains the registers, and the offset within that Base Address register range where the control/status registers reside. The data area for this field is four bytes long. The first byte contains the ID of the extended capability. The second byte contains the index (zero based) of the Base Address register used. The next two bytes contain the offset (in little-endian order) within that address range where the control/status registers defined for that capability reside. Again, this sounds like function specific data, and both here and above, blocking access to VPD could affect the functionality of drivers. It may be the case that Intel would find this use to be madness, but there's no PCI spec requirement that separate functions are in any way similar and we're looking at an interface that may be used by non-Intel devices as well. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> On Sep 15, 2015, at 12:04 PM, Alex Williamson <alex.williamson@redhat.com> wrote: > > > FRU-type information is only one of the use cases of VPD, the spec also > defines (PCI rev 3.0, 6.4): > > ... a mechanism for storing information such as performance and > failure data on the device being monitored. > > That information could very much be function specific. It is open to interpretation. I guess still I see it as the physical device as a whole. > When I was looking at whether we should provide VPD access of an > assigned device at all, I ran across this interesting statement in the > PCI spec (rev 3.0, I.3.1.1): > > CP Extended Capability > > This field allows a new capability to be identified in the VPD > area. Since dynamic control/status cannot be placed in VPD, the > data for this field identifies where, in the device’s memory or > I/O address space, the control/status registers for the > capability can be found. Location of the control/status > registers is identified by providing the index (a value between > 0 and 5) of the Base Address register that defines the address > range that contains the registers, and the offset within that > Base Address register range where the control/status registers > reside. The data area for this field is four bytes long. The > first byte contains the ID of the extended capability. The > second byte contains the index (zero based) of the Base Address > register used. The next two bytes contain the offset (in > little-endian order) within that address range where the > control/status registers defined for that capability reside. > > Again, this sounds like function specific data, and both here and above, > blocking access to VPD could affect the functionality of drivers. It > may be the case that Intel would find this use to be madness, but > there's no PCI spec requirement that separate functions are in any way > similar and we're looking at an interface that may be used by non-Intel > devices as well. Thanks, It isn't an interface as such, it is a quirk to address some widespread design problems with multi function devices with VPD. And you are right that functions can be different. In fact this quirk is needed only because now they often (usually in fact) are not different! I do hope to see some non-Intel devices use the quirk, because I'm pretty sure there are other devices that have the same issue. I realize that I covered a pretty wide swath by making the quirk apply to all Intel Ethernet devices, but that still seems correct. The Skylake is not an issue because it does not have VPD so the pci_find_capability will fail before any handling of the quirk is possible. The code that applies the quirk could check specific devices, but it would make the code a lot bigger, and I see this kind of code as dead weight for so many systems that I tried to make it as small as possible. Since all Intel Ethernet seems to be correct now and as far as I can see into the future, that is what I did. Going back to something you mentioned before, I think you are right that the failure case for the pci_vpd_f0_dev_check could be made to simply clear the quirk and continue, since pci_vpd_f0_dev_check really should not fail in cases where the quirk is applicable. That does seem like a reasonable change to me the more I think about it. I think a whitelist would be unnecessary dead weight. -- Mark Rustad, Networking Division, Intel Corporation
On Tue, 2015-09-15 at 20:47 +0000, Rustad, Mark D wrote: > > On Sep 15, 2015, at 12:04 PM, Alex Williamson <alex.williamson@redhat.com> wrote: > > > > > > FRU-type information is only one of the use cases of VPD, the spec also > > defines (PCI rev 3.0, 6.4): > > > > ... a mechanism for storing information such as performance and > > failure data on the device being monitored. > > > > That information could very much be function specific. > > It is open to interpretation. I guess still I see it as the physical device as a whole. > > > When I was looking at whether we should provide VPD access of an > > assigned device at all, I ran across this interesting statement in the > > PCI spec (rev 3.0, I.3.1.1): > > > > CP Extended Capability > > > > This field allows a new capability to be identified in the VPD > > area. Since dynamic control/status cannot be placed in VPD, the > > data for this field identifies where, in the device’s memory or > > I/O address space, the control/status registers for the > > capability can be found. Location of the control/status > > registers is identified by providing the index (a value between > > 0 and 5) of the Base Address register that defines the address > > range that contains the registers, and the offset within that > > Base Address register range where the control/status registers > > reside. The data area for this field is four bytes long. The > > first byte contains the ID of the extended capability. The > > second byte contains the index (zero based) of the Base Address > > register used. The next two bytes contain the offset (in > > little-endian order) within that address range where the > > control/status registers defined for that capability reside. > > > > Again, this sounds like function specific data, and both here and above, > > blocking access to VPD could affect the functionality of drivers. It > > may be the case that Intel would find this use to be madness, but > > there's no PCI spec requirement that separate functions are in any way > > similar and we're looking at an interface that may be used by non-Intel > > devices as well. Thanks, > > It isn't an interface as such, it is a quirk to address some > widespread design problems with multi function devices with VPD. And > you are right that functions can be different. In fact this quirk is > needed only because now they often (usually in fact) are not > different! I do hope to see some non-Intel devices use the quirk, > because I'm pretty sure there are other devices that have the same > issue. > > I realize that I covered a pretty wide swath by making the quirk apply > to all Intel Ethernet devices, but that still seems correct. The > Skylake is not an issue because it does not have VPD so the > pci_find_capability will fail before any handling of the quirk is > possible. The code that applies the quirk could check specific > devices, but it would make the code a lot bigger, and I see this kind > of code as dead weight for so many systems that I tried to make it as > small as possible. Since all Intel Ethernet seems to be correct now > and as far as I can see into the future, that is what I did. > > Going back to something you mentioned before, I think you are right > that the failure case for the pci_vpd_f0_dev_check could be made to > simply clear the quirk and continue, since pci_vpd_f0_dev_check really > should not fail in cases where the quirk is applicable. That does seem > like a reasonable change to me the more I think about it. > > I think a whitelist would be unnecessary dead weight. Yep, a whitelist is probably not the way to go. AFAICT, you're looking for plugin-cards where all the functions meet the criteria of having the same class, vendor, and device ID. If we don't meet that criteria, then it's not a device we're expecting and we should leave it alone. Also, rather than clearing the flag, can we move the tests done by pci_vpd_f0_dev_check() into the quirk setup function? It seems like function 0 should be sufficiently configured by the time we're probing non-zero functions that we can be more selective in setting the flag rather than unsetting it later. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> On Sep 15, 2015, at 2:17 PM, Alex Williamson <alex.williamson@redhat.com> wrote: > > Also, rather than clearing the flag, can we move the tests done by > pci_vpd_f0_dev_check() into the > quirk setup function? It seems like function 0 should be sufficiently > configured by the time we're probing non-zero functions that we can be > more selective in setting the flag rather than unsetting it later. I guess I was being very conservative in not assuming anything about the state of other devices at that point. Things seem to be increasingly parallel all the time and I am not deeply involved in the evolution of the PCI subsystem. If you want to make that assumption, I would suggest that pci_vpd_f0_dev_check remain a separate function called by the quirk setup so that it can be used by other quirk setup functions as well. -- Mark Rustad, Networking Division, Intel Corporation
diff --git a/drivers/pci/access.c b/drivers/pci/access.c index d9b64a175990..b965c12168b7 100644 --- a/drivers/pci/access.c +++ b/drivers/pci/access.c @@ -439,6 +439,56 @@ static const struct pci_vpd_ops pci_vpd_pci22_ops = { .release = pci_vpd_pci22_release, }; +static ssize_t pci_vpd_f0_read(struct pci_dev *dev, loff_t pos, size_t count, + void *arg) +{ + struct pci_dev *tdev = pci_get_slot(dev->bus, PCI_SLOT(dev->devfn)); + ssize_t ret; + + if (!tdev) + return -ENODEV; + + ret = pci_read_vpd(tdev, pos, count, arg); + pci_dev_put(tdev); + return ret; +} + +static ssize_t pci_vpd_f0_write(struct pci_dev *dev, loff_t pos, size_t count, + const void *arg) +{ + struct pci_dev *tdev = pci_get_slot(dev->bus, PCI_SLOT(dev->devfn)); + ssize_t ret; + + if (!tdev) + return -ENODEV; + + ret = pci_write_vpd(tdev, pos, count, arg); + pci_dev_put(tdev); + return ret; +} + +static const struct pci_vpd_ops pci_vpd_f0_ops = { + .read = pci_vpd_f0_read, + .write = pci_vpd_f0_write, + .release = pci_vpd_pci22_release, +}; + +static int pci_vpd_f0_dev_check(struct pci_dev *dev) +{ + struct pci_dev *tdev = pci_get_slot(dev->bus, PCI_SLOT(dev->devfn)); + int ret = 0; + + if (!tdev) + return -ENODEV; + if (!tdev->vpd || !tdev->multifunction || + dev->class != tdev->class || dev->vendor != tdev->vendor || + dev->device != tdev->device) + ret = -ENODEV; + + pci_dev_put(tdev); + return ret; +} + int pci_vpd_pci22_init(struct pci_dev *dev) { struct pci_vpd_pci22 *vpd; @@ -447,12 +497,21 @@ int pci_vpd_pci22_init(struct pci_dev *dev) cap = pci_find_capability(dev, PCI_CAP_ID_VPD); if (!cap) return -ENODEV; + if (dev->dev_flags & PCI_DEV_FLAGS_VPD_REF_F0) { + int ret = pci_vpd_f0_dev_check(dev); + + if (ret) + return ret; + } vpd = kzalloc(sizeof(*vpd), GFP_ATOMIC); if (!vpd) return -ENOMEM; vpd->base.len = PCI_VPD_PCI22_SIZE; - vpd->base.ops = &pci_vpd_pci22_ops; + if (dev->dev_flags & PCI_DEV_FLAGS_VPD_REF_F0) + vpd->base.ops = &pci_vpd_f0_ops; + else + vpd->base.ops = &pci_vpd_pci22_ops; mutex_init(&vpd->lock); vpd->cap = cap; vpd->busy = false; diff --git a/include/linux/pci.h b/include/linux/pci.h index 8a0321a8fb59..8edb125db13a 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -180,6 +180,8 @@ enum pci_dev_flags { PCI_DEV_FLAGS_NO_BUS_RESET = (__force pci_dev_flags_t) (1 << 6), /* Do not use PM reset even if device advertises NoSoftRst- */ PCI_DEV_FLAGS_NO_PM_RESET = (__force pci_dev_flags_t) (1 << 7), + /* Get VPD from function 0 VPD */ + PCI_DEV_FLAGS_VPD_REF_F0 = (__force pci_dev_flags_t) (1 << 8), }; enum pci_irq_reroute_variant {