Message ID | 20180723200339.23943-1-mr.nuke.me@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
Series | [v5] PCI: Check for PCIe downtraining conditions | expand |
On Mon, 23 Jul 2018 15:03:38 -0500, Alexandru Gagniuc wrote: > PCIe downtraining happens when both the device and PCIe port are > capable of a larger bus width or higher speed than negotiated. > Downtraining might be indicative of other problems in the system, and > identifying this from userspace is neither intuitive, nor > straightforward. > > The easiest way to detect this is with pcie_print_link_status(), > since the bottleneck is usually the link that is downtrained. It's not > a perfect solution, but it works extremely well in most cases. > > Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com> > --- > > For the sake of review, I've created a __pcie_print_link_status() which > takes a 'verbose' argument. If we agree want to go this route, and update > the users of pcie_print_link_status(), I can split this up in two patches. > I prefer just printing this information in the core functions, and letting > drivers not have to worry about this. Though there seems to be strong for > not going that route, so here it goes: FWIW the networking drivers print PCIe BW because sometimes the network bandwidth is simply over-provisioned on multi port cards, e.g. 80Gbps card on a x8 link. Sorry to bike shed, but currently the networking cards print the info during probe. Would it make sense to move your message closer to probe time? Rather than when device is added. If driver structure is available, we could also consider adding a boolean to struct pci_driver to indicate if driver wants the verbose message? This way we avoid duplicated prints. I have no objection to current patch, it LGTM. Just a thought. > drivers/pci/pci.c | 22 ++++++++++++++++++---- > drivers/pci/probe.c | 21 +++++++++++++++++++++ > include/linux/pci.h | 1 + > 3 files changed, 40 insertions(+), 4 deletions(-) > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index 316496e99da9..414ad7b3abdb 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -5302,14 +5302,15 @@ u32 pcie_bandwidth_capable(struct pci_dev *dev, enum pci_bus_speed *speed, > } > > /** > - * pcie_print_link_status - Report the PCI device's link speed and width > + * __pcie_print_link_status - Report the PCI device's link speed and width > * @dev: PCI device to query > + * @verbose: Be verbose -- print info even when enough bandwidth is available. > * > * Report the available bandwidth at the device. If this is less than the > * device is capable of, report the device's maximum possible bandwidth and > * the upstream link that limits its performance to less than that. > */ > -void pcie_print_link_status(struct pci_dev *dev) > +void __pcie_print_link_status(struct pci_dev *dev, bool verbose) > { > enum pcie_link_width width, width_cap; > enum pci_bus_speed speed, speed_cap; > @@ -5319,11 +5320,11 @@ void pcie_print_link_status(struct pci_dev *dev) > bw_cap = pcie_bandwidth_capable(dev, &speed_cap, &width_cap); > bw_avail = pcie_bandwidth_available(dev, &limiting_dev, &speed, &width); > > - if (bw_avail >= bw_cap) > + if (bw_avail >= bw_cap && verbose) > pci_info(dev, "%u.%03u Gb/s available PCIe bandwidth (%s x%d link)\n", > bw_cap / 1000, bw_cap % 1000, > PCIE_SPEED2STR(speed_cap), width_cap); > - else > + else if (bw_avail < bw_cap) > pci_info(dev, "%u.%03u Gb/s available PCIe bandwidth, limited by %s x%d link at %s (capable of %u.%03u Gb/s with %s x%d link)\n", > bw_avail / 1000, bw_avail % 1000, > PCIE_SPEED2STR(speed), width, > @@ -5331,6 +5332,19 @@ void pcie_print_link_status(struct pci_dev *dev) > bw_cap / 1000, bw_cap % 1000, > PCIE_SPEED2STR(speed_cap), width_cap); > } > + > +/** > + * pcie_print_link_status - Report the PCI device's link speed and width > + * @dev: PCI device to query > + * > + * Report the available bandwidth at the device. If this is less than the > + * device is capable of, report the device's maximum possible bandwidth and > + * the upstream link that limits its performance to less than that. > + */ > +void pcie_print_link_status(struct pci_dev *dev) > +{ > + __pcie_print_link_status(dev, true); > +} > EXPORT_SYMBOL(pcie_print_link_status); > > /** > diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c > index ac876e32de4b..1f7336377c3b 100644 > --- a/drivers/pci/probe.c > +++ b/drivers/pci/probe.c > @@ -2205,6 +2205,24 @@ static struct pci_dev *pci_scan_device(struct pci_bus *bus, int devfn) > return dev; > } > > +static void pcie_check_upstream_link(struct pci_dev *dev) > +{ > + if (!pci_is_pcie(dev)) > + return; > + > + /* Look from the device up to avoid downstream ports with no devices. */ > + if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ENDPOINT) && > + (pci_pcie_type(dev) != PCI_EXP_TYPE_LEG_END) && > + (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM)) > + return; > + > + /* Multi-function PCIe share the same link/status. */ > + if (PCI_FUNC(dev->devfn) != 0 || dev->is_virtfn) > + return; > + > + __pcie_print_link_status(dev, false); > +} > + > static void pci_init_capabilities(struct pci_dev *dev) > { > /* Enhanced Allocation */ > @@ -2240,6 +2258,9 @@ static void pci_init_capabilities(struct pci_dev *dev) > /* Advanced Error Reporting */ > pci_aer_init(dev); > > + /* Check link and detect downtrain errors */ > + pcie_check_upstream_link(dev); > + > if (pci_probe_reset_function(dev) == 0) > dev->reset_fn = 1; > } > diff --git a/include/linux/pci.h b/include/linux/pci.h > index abd5d5e17aee..15bfab8f7a1b 100644 > --- a/include/linux/pci.h > +++ b/include/linux/pci.h > @@ -1088,6 +1088,7 @@ int pcie_set_mps(struct pci_dev *dev, int mps); > u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev **limiting_dev, > enum pci_bus_speed *speed, > enum pcie_link_width *width); > +void __pcie_print_link_status(struct pci_dev *dev, bool verbose); > void pcie_print_link_status(struct pci_dev *dev); > int pcie_flr(struct pci_dev *dev); > int __pci_reset_function_locked(struct pci_dev *dev);
On 7/24/2018 12:01 AM, Jakub Kicinski wrote: > On Mon, 23 Jul 2018 15:03:38 -0500, Alexandru Gagniuc wrote: >> PCIe downtraining happens when both the device and PCIe port are >> capable of a larger bus width or higher speed than negotiated. >> Downtraining might be indicative of other problems in the system, and >> identifying this from userspace is neither intuitive, nor >> straightforward. >> >> The easiest way to detect this is with pcie_print_link_status(), >> since the bottleneck is usually the link that is downtrained. It's not >> a perfect solution, but it works extremely well in most cases. >> >> Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com> >> --- >> >> For the sake of review, I've created a __pcie_print_link_status() which >> takes a 'verbose' argument. If we agree want to go this route, and update >> the users of pcie_print_link_status(), I can split this up in two patches. >> I prefer just printing this information in the core functions, and letting >> drivers not have to worry about this. Though there seems to be strong for >> not going that route, so here it goes: > > FWIW the networking drivers print PCIe BW because sometimes the network > bandwidth is simply over-provisioned on multi port cards, e.g. 80Gbps > card on a x8 link. > > Sorry to bike shed, but currently the networking cards print the info > during probe. Would it make sense to move your message closer to probe > time? Rather than when device is added. If driver structure is > available, we could also consider adding a boolean to struct pci_driver > to indicate if driver wants the verbose message? This way we avoid > duplicated prints. > > I have no objection to current patch, it LGTM. Just a thought. I don't see the reason for having two functions. What's the problem with adding the verbose argument to the original function? > >> drivers/pci/pci.c | 22 ++++++++++++++++++---- >> drivers/pci/probe.c | 21 +++++++++++++++++++++ >> include/linux/pci.h | 1 + >> 3 files changed, 40 insertions(+), 4 deletions(-) >> >> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c >> index 316496e99da9..414ad7b3abdb 100644 >> --- a/drivers/pci/pci.c >> +++ b/drivers/pci/pci.c >> @@ -5302,14 +5302,15 @@ u32 pcie_bandwidth_capable(struct pci_dev *dev, enum pci_bus_speed *speed, >> } >> >> /** >> - * pcie_print_link_status - Report the PCI device's link speed and width >> + * __pcie_print_link_status - Report the PCI device's link speed and width >> * @dev: PCI device to query >> + * @verbose: Be verbose -- print info even when enough bandwidth is available. >> * >> * Report the available bandwidth at the device. If this is less than the >> * device is capable of, report the device's maximum possible bandwidth and >> * the upstream link that limits its performance to less than that. >> */ >> -void pcie_print_link_status(struct pci_dev *dev) >> +void __pcie_print_link_status(struct pci_dev *dev, bool verbose) >> { >> enum pcie_link_width width, width_cap; >> enum pci_bus_speed speed, speed_cap; >> @@ -5319,11 +5320,11 @@ void pcie_print_link_status(struct pci_dev *dev) >> bw_cap = pcie_bandwidth_capable(dev, &speed_cap, &width_cap); >> bw_avail = pcie_bandwidth_available(dev, &limiting_dev, &speed, &width); >> >> - if (bw_avail >= bw_cap) >> + if (bw_avail >= bw_cap && verbose) >> pci_info(dev, "%u.%03u Gb/s available PCIe bandwidth (%s x%d link)\n", >> bw_cap / 1000, bw_cap % 1000, >> PCIE_SPEED2STR(speed_cap), width_cap); >> - else >> + else if (bw_avail < bw_cap) >> pci_info(dev, "%u.%03u Gb/s available PCIe bandwidth, limited by %s x%d link at %s (capable of %u.%03u Gb/s with %s x%d link)\n", >> bw_avail / 1000, bw_avail % 1000, >> PCIE_SPEED2STR(speed), width, >> @@ -5331,6 +5332,19 @@ void pcie_print_link_status(struct pci_dev *dev) >> bw_cap / 1000, bw_cap % 1000, >> PCIE_SPEED2STR(speed_cap), width_cap); >> } >> + >> +/** >> + * pcie_print_link_status - Report the PCI device's link speed and width >> + * @dev: PCI device to query >> + * >> + * Report the available bandwidth at the device. If this is less than the >> + * device is capable of, report the device's maximum possible bandwidth and >> + * the upstream link that limits its performance to less than that. >> + */ >> +void pcie_print_link_status(struct pci_dev *dev) >> +{ >> + __pcie_print_link_status(dev, true); >> +} >> EXPORT_SYMBOL(pcie_print_link_status); >> >> /** >> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c >> index ac876e32de4b..1f7336377c3b 100644 >> --- a/drivers/pci/probe.c >> +++ b/drivers/pci/probe.c >> @@ -2205,6 +2205,24 @@ static struct pci_dev *pci_scan_device(struct pci_bus *bus, int devfn) >> return dev; >> } >> >> +static void pcie_check_upstream_link(struct pci_dev *dev) >> +{ >> + if (!pci_is_pcie(dev)) >> + return; >> + >> + /* Look from the device up to avoid downstream ports with no devices. */ >> + if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ENDPOINT) && >> + (pci_pcie_type(dev) != PCI_EXP_TYPE_LEG_END) && >> + (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM)) >> + return; >> + >> + /* Multi-function PCIe share the same link/status. */ >> + if (PCI_FUNC(dev->devfn) != 0 || dev->is_virtfn) >> + return; >> + >> + __pcie_print_link_status(dev, false); >> +} >> + >> static void pci_init_capabilities(struct pci_dev *dev) >> { >> /* Enhanced Allocation */ >> @@ -2240,6 +2258,9 @@ static void pci_init_capabilities(struct pci_dev *dev) >> /* Advanced Error Reporting */ >> pci_aer_init(dev); >> >> + /* Check link and detect downtrain errors */ >> + pcie_check_upstream_link(dev); This is called for every PCIe device right? Won't there be a duplicated print in case a device loads with lower PCIe bandwidth than needed? >> + >> if (pci_probe_reset_function(dev) == 0) >> dev->reset_fn = 1; >> } >> diff --git a/include/linux/pci.h b/include/linux/pci.h >> index abd5d5e17aee..15bfab8f7a1b 100644 >> --- a/include/linux/pci.h >> +++ b/include/linux/pci.h >> @@ -1088,6 +1088,7 @@ int pcie_set_mps(struct pci_dev *dev, int mps); >> u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev **limiting_dev, >> enum pci_bus_speed *speed, >> enum pcie_link_width *width); >> +void __pcie_print_link_status(struct pci_dev *dev, bool verbose); >> void pcie_print_link_status(struct pci_dev *dev); >> int pcie_flr(struct pci_dev *dev); >> int __pci_reset_function_locked(struct pci_dev *dev); >
On Tue, 24 Jul 2018 00:52:22 +0300, Tal Gilboa wrote: > On 7/24/2018 12:01 AM, Jakub Kicinski wrote: > > On Mon, 23 Jul 2018 15:03:38 -0500, Alexandru Gagniuc wrote: > >> PCIe downtraining happens when both the device and PCIe port are > >> capable of a larger bus width or higher speed than negotiated. > >> Downtraining might be indicative of other problems in the system, and > >> identifying this from userspace is neither intuitive, nor > >> straightforward. > >> > >> The easiest way to detect this is with pcie_print_link_status(), > >> since the bottleneck is usually the link that is downtrained. It's not > >> a perfect solution, but it works extremely well in most cases. > >> > >> Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com> > >> --- > >> > >> For the sake of review, I've created a __pcie_print_link_status() which > >> takes a 'verbose' argument. If we agree want to go this route, and update > >> the users of pcie_print_link_status(), I can split this up in two patches. > >> I prefer just printing this information in the core functions, and letting > >> drivers not have to worry about this. Though there seems to be strong for > >> not going that route, so here it goes: > > > > FWIW the networking drivers print PCIe BW because sometimes the network > > bandwidth is simply over-provisioned on multi port cards, e.g. 80Gbps > > card on a x8 link. > > > > Sorry to bike shed, but currently the networking cards print the info > > during probe. Would it make sense to move your message closer to probe > > time? Rather than when device is added. If driver structure is > > available, we could also consider adding a boolean to struct pci_driver > > to indicate if driver wants the verbose message? This way we avoid > > duplicated prints. > > > > I have no objection to current patch, it LGTM. Just a thought. > > I don't see the reason for having two functions. What's the problem with > adding the verbose argument to the original function? IMHO it's reasonable to keep the default parameter to what 90% of users want by a means on a wrapper. The non-verbose output is provided by the core already for all devices. What do you think of my proposal above Tal? That would make the extra wrapper unnecessary since the verbose parameter would be part of the driver structure, and it would avoid the duplicated output.
On 07/23/2018 05:14 PM, Jakub Kicinski wrote: > On Tue, 24 Jul 2018 00:52:22 +0300, Tal Gilboa wrote: >> On 7/24/2018 12:01 AM, Jakub Kicinski wrote: >>> On Mon, 23 Jul 2018 15:03:38 -0500, Alexandru Gagniuc wrote: >>>> PCIe downtraining happens when both the device and PCIe port are >>>> capable of a larger bus width or higher speed than negotiated. >>>> Downtraining might be indicative of other problems in the system, and >>>> identifying this from userspace is neither intuitive, nor >>>> straightforward. >>>> >>>> The easiest way to detect this is with pcie_print_link_status(), >>>> since the bottleneck is usually the link that is downtrained. It's not >>>> a perfect solution, but it works extremely well in most cases. >>>> >>>> Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com> >>>> --- >>>> >>>> For the sake of review, I've created a __pcie_print_link_status() which >>>> takes a 'verbose' argument. If we agree want to go this route, and update >>>> the users of pcie_print_link_status(), I can split this up in two patches. >>>> I prefer just printing this information in the core functions, and letting >>>> drivers not have to worry about this. Though there seems to be strong for >>>> not going that route, so here it goes: >>> >>> FWIW the networking drivers print PCIe BW because sometimes the network >>> bandwidth is simply over-provisioned on multi port cards, e.g. 80Gbps >>> card on a x8 link. >>> >>> Sorry to bike shed, but currently the networking cards print the info >>> during probe. Would it make sense to move your message closer to probe >>> time? Rather than when device is added. If driver structure is >>> available, we could also consider adding a boolean to struct pci_driver >>> to indicate if driver wants the verbose message? This way we avoid >>> duplicated prints. >>> >>> I have no objection to current patch, it LGTM. Just a thought. >> >> I don't see the reason for having two functions. What's the problem with >> adding the verbose argument to the original function? > > IMHO it's reasonable to keep the default parameter to what 90% of users > want by a means on a wrapper. The non-verbose output is provided by > the core already for all devices. > > What do you think of my proposal above Tal? That would make the extra > wrapper unnecessary since the verbose parameter would be part of the > driver structure, and it would avoid the duplicated output. I see how it might make sense to add another member to the driver struct, but is it worth the extra learning curve? It seems to be something with the potential to confuse new driver developers, and having a very marginal benefit. Although, if that's what people want... Alex
On 7/24/2018 2:59 AM, Alex G. wrote: > > > On 07/23/2018 05:14 PM, Jakub Kicinski wrote: >> On Tue, 24 Jul 2018 00:52:22 +0300, Tal Gilboa wrote: >>> On 7/24/2018 12:01 AM, Jakub Kicinski wrote: >>>> On Mon, 23 Jul 2018 15:03:38 -0500, Alexandru Gagniuc wrote: >>>>> PCIe downtraining happens when both the device and PCIe port are >>>>> capable of a larger bus width or higher speed than negotiated. >>>>> Downtraining might be indicative of other problems in the system, and >>>>> identifying this from userspace is neither intuitive, nor >>>>> straightforward. >>>>> >>>>> The easiest way to detect this is with pcie_print_link_status(), >>>>> since the bottleneck is usually the link that is downtrained. It's not >>>>> a perfect solution, but it works extremely well in most cases. >>>>> >>>>> Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com> >>>>> --- >>>>> >>>>> For the sake of review, I've created a __pcie_print_link_status() >>>>> which >>>>> takes a 'verbose' argument. If we agree want to go this route, and >>>>> update >>>>> the users of pcie_print_link_status(), I can split this up in two >>>>> patches. >>>>> I prefer just printing this information in the core functions, and >>>>> letting >>>>> drivers not have to worry about this. Though there seems to be >>>>> strong for >>>>> not going that route, so here it goes: >>>> >>>> FWIW the networking drivers print PCIe BW because sometimes the network >>>> bandwidth is simply over-provisioned on multi port cards, e.g. 80Gbps >>>> card on a x8 link. >>>> >>>> Sorry to bike shed, but currently the networking cards print the info >>>> during probe. Would it make sense to move your message closer to probe >>>> time? Rather than when device is added. If driver structure is >>>> available, we could also consider adding a boolean to struct pci_driver >>>> to indicate if driver wants the verbose message? This way we avoid >>>> duplicated prints. >>>> >>>> I have no objection to current patch, it LGTM. Just a thought. >>> >>> I don't see the reason for having two functions. What's the problem with >>> adding the verbose argument to the original function? >> >> IMHO it's reasonable to keep the default parameter to what 90% of users >> want by a means on a wrapper. The non-verbose output is provided by >> the core already for all devices. >> >> What do you think of my proposal above Tal? That would make the extra >> wrapper unnecessary since the verbose parameter would be part of the >> driver structure, and it would avoid the duplicated output. > > I see how it might make sense to add another member to the driver > struct, but is it worth the extra learning curve? It seems to be > something with the potential to confuse new driver developers, and > having a very marginal benefit. > Although, if that's what people want... I prefer the wrapper function. Looking at struct pci_driver it would seem strange for it to hold a field for controlling verbosity (IMO). This is a very (very) specific field in a very general struct. > > Alex
On 07/24/2018 08:40 AM, Tal Gilboa wrote: > On 7/24/2018 2:59 AM, Alex G. wrote: >> >> >> On 07/23/2018 05:14 PM, Jakub Kicinski wrote: >>> On Tue, 24 Jul 2018 00:52:22 +0300, Tal Gilboa wrote: >>>> On 7/24/2018 12:01 AM, Jakub Kicinski wrote: >>>>> On Mon, 23 Jul 2018 15:03:38 -0500, Alexandru Gagniuc wrote: >>>>>> PCIe downtraining happens when both the device and PCIe port are >>>>>> capable of a larger bus width or higher speed than negotiated. >>>>>> Downtraining might be indicative of other problems in the system, and >>>>>> identifying this from userspace is neither intuitive, nor >>>>>> straightforward. >>>>>> >>>>>> The easiest way to detect this is with pcie_print_link_status(), >>>>>> since the bottleneck is usually the link that is downtrained. It's not >>>>>> a perfect solution, but it works extremely well in most cases. >>>>>> >>>>>> Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com> >>>>>> --- >>>>>> >>>>>> For the sake of review, I've created a __pcie_print_link_status() >>>>>> which >>>>>> takes a 'verbose' argument. If we agree want to go this route, and >>>>>> update >>>>>> the users of pcie_print_link_status(), I can split this up in two >>>>>> patches. >>>>>> I prefer just printing this information in the core functions, and >>>>>> letting >>>>>> drivers not have to worry about this. Though there seems to be >>>>>> strong for >>>>>> not going that route, so here it goes: >>>>> >>>>> FWIW the networking drivers print PCIe BW because sometimes the network >>>>> bandwidth is simply over-provisioned on multi port cards, e.g. 80Gbps >>>>> card on a x8 link. >>>>> >>>>> Sorry to bike shed, but currently the networking cards print the info >>>>> during probe. Would it make sense to move your message closer to probe >>>>> time? Rather than when device is added. If driver structure is >>>>> available, we could also consider adding a boolean to struct pci_driver >>>>> to indicate if driver wants the verbose message? This way we avoid >>>>> duplicated prints. >>>>> >>>>> I have no objection to current patch, it LGTM. Just a thought. >>>> >>>> I don't see the reason for having two functions. What's the problem with >>>> adding the verbose argument to the original function? >>> >>> IMHO it's reasonable to keep the default parameter to what 90% of users >>> want by a means on a wrapper. The non-verbose output is provided by >>> the core already for all devices. >>> >>> What do you think of my proposal above Tal? That would make the extra >>> wrapper unnecessary since the verbose parameter would be part of the >>> driver structure, and it would avoid the duplicated output. >> >> I see how it might make sense to add another member to the driver >> struct, but is it worth the extra learning curve? It seems to be >> something with the potential to confuse new driver developers, and >> having a very marginal benefit. >> Although, if that's what people want... > > I prefer the wrapper function. Looking at struct pci_driver it would > seem strange for it to hold a field for controlling verbosity (IMO). > This is a very (very) specific field in a very general struct. If people are okay with the wrapper, then I'm not going to update the patch. Alex
On 7/24/2018 12:52 AM, Tal Gilboa wrote: > On 7/24/2018 12:01 AM, Jakub Kicinski wrote: >> On Mon, 23 Jul 2018 15:03:38 -0500, Alexandru Gagniuc wrote: >>> PCIe downtraining happens when both the device and PCIe port are >>> capable of a larger bus width or higher speed than negotiated. >>> Downtraining might be indicative of other problems in the system, and >>> identifying this from userspace is neither intuitive, nor >>> straightforward. >>> >>> The easiest way to detect this is with pcie_print_link_status(), >>> since the bottleneck is usually the link that is downtrained. It's not >>> a perfect solution, but it works extremely well in most cases. >>> >>> Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com> >>> --- >>> >>> For the sake of review, I've created a __pcie_print_link_status() which >>> takes a 'verbose' argument. If we agree want to go this route, and >>> update >>> the users of pcie_print_link_status(), I can split this up in two >>> patches. >>> I prefer just printing this information in the core functions, and >>> letting >>> drivers not have to worry about this. Though there seems to be strong >>> for >>> not going that route, so here it goes: >> >> FWIW the networking drivers print PCIe BW because sometimes the network >> bandwidth is simply over-provisioned on multi port cards, e.g. 80Gbps >> card on a x8 link. >> >> Sorry to bike shed, but currently the networking cards print the info >> during probe. Would it make sense to move your message closer to probe >> time? Rather than when device is added. If driver structure is >> available, we could also consider adding a boolean to struct pci_driver >> to indicate if driver wants the verbose message? This way we avoid >> duplicated prints. >> >> I have no objection to current patch, it LGTM. Just a thought. > > I don't see the reason for having two functions. What's the problem with > adding the verbose argument to the original function? > >> >>> drivers/pci/pci.c | 22 ++++++++++++++++++---- >>> drivers/pci/probe.c | 21 +++++++++++++++++++++ >>> include/linux/pci.h | 1 + >>> 3 files changed, 40 insertions(+), 4 deletions(-) >> >>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c >>> index 316496e99da9..414ad7b3abdb 100644 >>> --- a/drivers/pci/pci.c >>> +++ b/drivers/pci/pci.c >>> @@ -5302,14 +5302,15 @@ u32 pcie_bandwidth_capable(struct pci_dev >>> *dev, enum pci_bus_speed *speed, >>> } >>> /** >>> - * pcie_print_link_status - Report the PCI device's link speed and >>> width >>> + * __pcie_print_link_status - Report the PCI device's link speed and >>> width >>> * @dev: PCI device to query >>> + * @verbose: Be verbose -- print info even when enough bandwidth is >>> available. >>> * >>> * Report the available bandwidth at the device. If this is less >>> than the >>> * device is capable of, report the device's maximum possible >>> bandwidth and >>> * the upstream link that limits its performance to less than that. >>> */ >>> -void pcie_print_link_status(struct pci_dev *dev) >>> +void __pcie_print_link_status(struct pci_dev *dev, bool verbose) >>> { >>> enum pcie_link_width width, width_cap; >>> enum pci_bus_speed speed, speed_cap; >>> @@ -5319,11 +5320,11 @@ void pcie_print_link_status(struct pci_dev *dev) >>> bw_cap = pcie_bandwidth_capable(dev, &speed_cap, &width_cap); >>> bw_avail = pcie_bandwidth_available(dev, &limiting_dev, &speed, >>> &width); >>> - if (bw_avail >= bw_cap) >>> + if (bw_avail >= bw_cap && verbose) >>> pci_info(dev, "%u.%03u Gb/s available PCIe bandwidth (%s >>> x%d link)\n", >>> bw_cap / 1000, bw_cap % 1000, >>> PCIE_SPEED2STR(speed_cap), width_cap); >>> - else >>> + else if (bw_avail < bw_cap) >>> pci_info(dev, "%u.%03u Gb/s available PCIe bandwidth, >>> limited by %s x%d link at %s (capable of %u.%03u Gb/s with %s x%d >>> link)\n", >>> bw_avail / 1000, bw_avail % 1000, >>> PCIE_SPEED2STR(speed), width, >>> @@ -5331,6 +5332,19 @@ void pcie_print_link_status(struct pci_dev *dev) >>> bw_cap / 1000, bw_cap % 1000, >>> PCIE_SPEED2STR(speed_cap), width_cap); >>> } >>> + >>> +/** >>> + * pcie_print_link_status - Report the PCI device's link speed and >>> width >>> + * @dev: PCI device to query >>> + * >>> + * Report the available bandwidth at the device. If this is less >>> than the >>> + * device is capable of, report the device's maximum possible >>> bandwidth and >>> + * the upstream link that limits its performance to less than that. >>> + */ >>> +void pcie_print_link_status(struct pci_dev *dev) >>> +{ >>> + __pcie_print_link_status(dev, true); >>> +} >>> EXPORT_SYMBOL(pcie_print_link_status); >>> /** >>> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c >>> index ac876e32de4b..1f7336377c3b 100644 >>> --- a/drivers/pci/probe.c >>> +++ b/drivers/pci/probe.c >>> @@ -2205,6 +2205,24 @@ static struct pci_dev *pci_scan_device(struct >>> pci_bus *bus, int devfn) >>> return dev; >>> } >>> +static void pcie_check_upstream_link(struct pci_dev *dev) >>> +{ >>> + if (!pci_is_pcie(dev)) >>> + return; >>> + >>> + /* Look from the device up to avoid downstream ports with no >>> devices. */ >>> + if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ENDPOINT) && >>> + (pci_pcie_type(dev) != PCI_EXP_TYPE_LEG_END) && >>> + (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM)) >>> + return; >>> + >>> + /* Multi-function PCIe share the same link/status. */ >>> + if (PCI_FUNC(dev->devfn) != 0 || dev->is_virtfn) >>> + return; >>> + >>> + __pcie_print_link_status(dev, false); >>> +} >>> + >>> static void pci_init_capabilities(struct pci_dev *dev) >>> { >>> /* Enhanced Allocation */ >>> @@ -2240,6 +2258,9 @@ static void pci_init_capabilities(struct >>> pci_dev *dev) >>> /* Advanced Error Reporting */ >>> pci_aer_init(dev); >>> + /* Check link and detect downtrain errors */ >>> + pcie_check_upstream_link(dev); > > This is called for every PCIe device right? Won't there be a duplicated > print in case a device loads with lower PCIe bandwidth than needed? Alex, can you comment on this please? > >>> + >>> if (pci_probe_reset_function(dev) == 0) >>> dev->reset_fn = 1; >>> } >>> diff --git a/include/linux/pci.h b/include/linux/pci.h >>> index abd5d5e17aee..15bfab8f7a1b 100644 >>> --- a/include/linux/pci.h >>> +++ b/include/linux/pci.h >>> @@ -1088,6 +1088,7 @@ int pcie_set_mps(struct pci_dev *dev, int mps); >>> u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev >>> **limiting_dev, >>> enum pci_bus_speed *speed, >>> enum pcie_link_width *width); >>> +void __pcie_print_link_status(struct pci_dev *dev, bool verbose); >>> void pcie_print_link_status(struct pci_dev *dev); >>> int pcie_flr(struct pci_dev *dev); >>> int __pci_reset_function_locked(struct pci_dev *dev); >>
On 07/31/2018 01:40 AM, Tal Gilboa wrote: [snip] >>>> @@ -2240,6 +2258,9 @@ static void pci_init_capabilities(struct >>>> pci_dev *dev) >>>> /* Advanced Error Reporting */ >>>> pci_aer_init(dev); >>>> + /* Check link and detect downtrain errors */ >>>> + pcie_check_upstream_link(dev); >> >> This is called for every PCIe device right? Won't there be a >> duplicated print in case a device loads with lower PCIe bandwidth than >> needed? > > Alex, can you comment on this please? Of course I can. There's one print at probe() time, which happens if bandwidth < max. I would think that's fine. There is a way to duplicate it, and that is if the driver also calls print_link_status(). A few driver maintainers who call it have indicated they'd be fine with removing it from the driver, and leaving it in the core PCI. Should the device come back at low speed, go away, then come back at full speed we'd still only see one print from the first probe. Again, if driver doesn't also call the function. There's the corner case where the device come up at < max, goes away, then comes back faster, but still < max. There will be two prints, but they won't be true duplicates -- each one will indicate different BW. Alex >>>> + >>>> if (pci_probe_reset_function(dev) == 0) >>>> dev->reset_fn = 1; >>>> } >>>> diff --git a/include/linux/pci.h b/include/linux/pci.h >>>> index abd5d5e17aee..15bfab8f7a1b 100644 >>>> --- a/include/linux/pci.h >>>> +++ b/include/linux/pci.h >>>> @@ -1088,6 +1088,7 @@ int pcie_set_mps(struct pci_dev *dev, int mps); >>>> u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev >>>> **limiting_dev, >>>> enum pci_bus_speed *speed, >>>> enum pcie_link_width *width); >>>> +void __pcie_print_link_status(struct pci_dev *dev, bool verbose); >>>> void pcie_print_link_status(struct pci_dev *dev); >>>> int pcie_flr(struct pci_dev *dev); >>>> int __pci_reset_function_locked(struct pci_dev *dev); >>>
On 7/31/2018 6:10 PM, Alex G. wrote: > On 07/31/2018 01:40 AM, Tal Gilboa wrote: > [snip] >>>>> @@ -2240,6 +2258,9 @@ static void pci_init_capabilities(struct >>>>> pci_dev *dev) >>>>> /* Advanced Error Reporting */ >>>>> pci_aer_init(dev); >>>>> + /* Check link and detect downtrain errors */ >>>>> + pcie_check_upstream_link(dev); >>> >>> This is called for every PCIe device right? Won't there be a >>> duplicated print in case a device loads with lower PCIe bandwidth >>> than needed? >> >> Alex, can you comment on this please? > > Of course I can. > > There's one print at probe() time, which happens if bandwidth < max. I > would think that's fine. There is a way to duplicate it, and that is if > the driver also calls print_link_status(). A few driver maintainers who > call it have indicated they'd be fine with removing it from the driver, > and leaving it in the core PCI. We would be fine with that as well. Please include the removal in your patches. > > Should the device come back at low speed, go away, then come back at > full speed we'd still only see one print from the first probe. Again, if > driver doesn't also call the function. > There's the corner case where the device come up at < max, goes away, > then comes back faster, but still < max. There will be two prints, but > they won't be true duplicates -- each one will indicate different BW. This is fine. > > Alex > >>>>> + >>>>> if (pci_probe_reset_function(dev) == 0) >>>>> dev->reset_fn = 1; >>>>> } >>>>> diff --git a/include/linux/pci.h b/include/linux/pci.h >>>>> index abd5d5e17aee..15bfab8f7a1b 100644 >>>>> --- a/include/linux/pci.h >>>>> +++ b/include/linux/pci.h >>>>> @@ -1088,6 +1088,7 @@ int pcie_set_mps(struct pci_dev *dev, int mps); >>>>> u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev >>>>> **limiting_dev, >>>>> enum pci_bus_speed *speed, >>>>> enum pcie_link_width *width); >>>>> +void __pcie_print_link_status(struct pci_dev *dev, bool verbose); >>>>> void pcie_print_link_status(struct pci_dev *dev); >>>>> int pcie_flr(struct pci_dev *dev); >>>>> int __pci_reset_function_locked(struct pci_dev *dev); >>>>
On 08/05/2018 02:06 AM, Tal Gilboa wrote: > On 7/31/2018 6:10 PM, Alex G. wrote: >> On 07/31/2018 01:40 AM, Tal Gilboa wrote: >> [snip] >>>>>> @@ -2240,6 +2258,9 @@ static void pci_init_capabilities(struct >>>>>> pci_dev *dev) >>>>>> /* Advanced Error Reporting */ >>>>>> pci_aer_init(dev); >>>>>> + /* Check link and detect downtrain errors */ >>>>>> + pcie_check_upstream_link(dev); >>>> >>>> This is called for every PCIe device right? Won't there be a >>>> duplicated print in case a device loads with lower PCIe bandwidth >>>> than needed? >>> >>> Alex, can you comment on this please? >> >> Of course I can. >> >> There's one print at probe() time, which happens if bandwidth < max. I >> would think that's fine. There is a way to duplicate it, and that is if >> the driver also calls print_link_status(). A few driver maintainers who >> call it have indicated they'd be fine with removing it from the driver, >> and leaving it in the core PCI. > > We would be fine with that as well. Please include the removal in your > patches. What's the proper procedure? Do I wait for confirmation from Bjorn before knocking on maintainer's doors, or do I William Wallace into their trees and demand they merge the removal (pending Bjorn's approval on the other side) ? Alex >> >> Should the device come back at low speed, go away, then come back at >> full speed we'd still only see one print from the first probe. Again, if >> driver doesn't also call the function. >> There's the corner case where the device come up at < max, goes away, >> then comes back faster, but still < max. There will be two prints, but >> they won't be true duplicates -- each one will indicate different BW. > > This is fine. > >> >> Alex >> >>>>>> + >>>>>> if (pci_probe_reset_function(dev) == 0) >>>>>> dev->reset_fn = 1; >>>>>> } >>>>>> diff --git a/include/linux/pci.h b/include/linux/pci.h >>>>>> index abd5d5e17aee..15bfab8f7a1b 100644 >>>>>> --- a/include/linux/pci.h >>>>>> +++ b/include/linux/pci.h >>>>>> @@ -1088,6 +1088,7 @@ int pcie_set_mps(struct pci_dev *dev, int mps); >>>>>> u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev >>>>>> **limiting_dev, >>>>>> enum pci_bus_speed *speed, >>>>>> enum pcie_link_width *width); >>>>>> +void __pcie_print_link_status(struct pci_dev *dev, bool verbose); >>>>>> void pcie_print_link_status(struct pci_dev *dev); >>>>>> int pcie_flr(struct pci_dev *dev); >>>>>> int __pci_reset_function_locked(struct pci_dev *dev); >>>>> >
On Mon, Aug 6, 2018 at 1:39 PM <Alex_Gagniuc@dellteam.com> wrote: > > On 08/05/2018 02:06 AM, Tal Gilboa wrote: > > On 7/31/2018 6:10 PM, Alex G. wrote: > >> On 07/31/2018 01:40 AM, Tal Gilboa wrote: > >> [snip] > >>>>>> @@ -2240,6 +2258,9 @@ static void pci_init_capabilities(struct > >>>>>> pci_dev *dev) > >>>>>> /* Advanced Error Reporting */ > >>>>>> pci_aer_init(dev); > >>>>>> + /* Check link and detect downtrain errors */ > >>>>>> + pcie_check_upstream_link(dev); > >>>> > >>>> This is called for every PCIe device right? Won't there be a > >>>> duplicated print in case a device loads with lower PCIe bandwidth > >>>> than needed? > >>> > >>> Alex, can you comment on this please? > >> > >> Of course I can. > >> > >> There's one print at probe() time, which happens if bandwidth < max. I > >> would think that's fine. There is a way to duplicate it, and that is if > >> the driver also calls print_link_status(). A few driver maintainers who > >> call it have indicated they'd be fine with removing it from the driver, > >> and leaving it in the core PCI. > > > > We would be fine with that as well. Please include the removal in your > > patches. > > What's the proper procedure? Do I wait for confirmation from Bjorn > before knocking on maintainer's doors, or do I William Wallace into > their trees and demand they merge the removal (pending Bjorn's approval > on the other side) ? Post a v4 series that does the PCI core stuff as well as removing the driver code. Bjorn
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 316496e99da9..414ad7b3abdb 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -5302,14 +5302,15 @@ u32 pcie_bandwidth_capable(struct pci_dev *dev, enum pci_bus_speed *speed, } /** - * pcie_print_link_status - Report the PCI device's link speed and width + * __pcie_print_link_status - Report the PCI device's link speed and width * @dev: PCI device to query + * @verbose: Be verbose -- print info even when enough bandwidth is available. * * Report the available bandwidth at the device. If this is less than the * device is capable of, report the device's maximum possible bandwidth and * the upstream link that limits its performance to less than that. */ -void pcie_print_link_status(struct pci_dev *dev) +void __pcie_print_link_status(struct pci_dev *dev, bool verbose) { enum pcie_link_width width, width_cap; enum pci_bus_speed speed, speed_cap; @@ -5319,11 +5320,11 @@ void pcie_print_link_status(struct pci_dev *dev) bw_cap = pcie_bandwidth_capable(dev, &speed_cap, &width_cap); bw_avail = pcie_bandwidth_available(dev, &limiting_dev, &speed, &width); - if (bw_avail >= bw_cap) + if (bw_avail >= bw_cap && verbose) pci_info(dev, "%u.%03u Gb/s available PCIe bandwidth (%s x%d link)\n", bw_cap / 1000, bw_cap % 1000, PCIE_SPEED2STR(speed_cap), width_cap); - else + else if (bw_avail < bw_cap) pci_info(dev, "%u.%03u Gb/s available PCIe bandwidth, limited by %s x%d link at %s (capable of %u.%03u Gb/s with %s x%d link)\n", bw_avail / 1000, bw_avail % 1000, PCIE_SPEED2STR(speed), width, @@ -5331,6 +5332,19 @@ void pcie_print_link_status(struct pci_dev *dev) bw_cap / 1000, bw_cap % 1000, PCIE_SPEED2STR(speed_cap), width_cap); } + +/** + * pcie_print_link_status - Report the PCI device's link speed and width + * @dev: PCI device to query + * + * Report the available bandwidth at the device. If this is less than the + * device is capable of, report the device's maximum possible bandwidth and + * the upstream link that limits its performance to less than that. + */ +void pcie_print_link_status(struct pci_dev *dev) +{ + __pcie_print_link_status(dev, true); +} EXPORT_SYMBOL(pcie_print_link_status); /** diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index ac876e32de4b..1f7336377c3b 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -2205,6 +2205,24 @@ static struct pci_dev *pci_scan_device(struct pci_bus *bus, int devfn) return dev; } +static void pcie_check_upstream_link(struct pci_dev *dev) +{ + if (!pci_is_pcie(dev)) + return; + + /* Look from the device up to avoid downstream ports with no devices. */ + if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ENDPOINT) && + (pci_pcie_type(dev) != PCI_EXP_TYPE_LEG_END) && + (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM)) + return; + + /* Multi-function PCIe share the same link/status. */ + if (PCI_FUNC(dev->devfn) != 0 || dev->is_virtfn) + return; + + __pcie_print_link_status(dev, false); +} + static void pci_init_capabilities(struct pci_dev *dev) { /* Enhanced Allocation */ @@ -2240,6 +2258,9 @@ static void pci_init_capabilities(struct pci_dev *dev) /* Advanced Error Reporting */ pci_aer_init(dev); + /* Check link and detect downtrain errors */ + pcie_check_upstream_link(dev); + if (pci_probe_reset_function(dev) == 0) dev->reset_fn = 1; } diff --git a/include/linux/pci.h b/include/linux/pci.h index abd5d5e17aee..15bfab8f7a1b 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -1088,6 +1088,7 @@ int pcie_set_mps(struct pci_dev *dev, int mps); u32 pcie_bandwidth_available(struct pci_dev *dev, struct pci_dev **limiting_dev, enum pci_bus_speed *speed, enum pcie_link_width *width); +void __pcie_print_link_status(struct pci_dev *dev, bool verbose); void pcie_print_link_status(struct pci_dev *dev); int pcie_flr(struct pci_dev *dev); int __pci_reset_function_locked(struct pci_dev *dev);
PCIe downtraining happens when both the device and PCIe port are capable of a larger bus width or higher speed than negotiated. Downtraining might be indicative of other problems in the system, and identifying this from userspace is neither intuitive, nor straightforward. The easiest way to detect this is with pcie_print_link_status(), since the bottleneck is usually the link that is downtrained. It's not a perfect solution, but it works extremely well in most cases. Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com> --- For the sake of review, I've created a __pcie_print_link_status() which takes a 'verbose' argument. If we agree want to go this route, and update the users of pcie_print_link_status(), I can split this up in two patches. I prefer just printing this information in the core functions, and letting drivers not have to worry about this. Though there seems to be strong for not going that route, so here it goes: Changes since v4: - Use 'verbose' argumnet to print bandwidth under normal conditions - Without verbose, only downtraining conditions are reported Changes since v3: - Remove extra newline and parentheses. Changes since v2: - Check dev->is_virtfn flag Changes since v1: - Use pcie_print_link_status() instead of reimplementing logic drivers/pci/pci.c | 22 ++++++++++++++++++---- drivers/pci/probe.c | 21 +++++++++++++++++++++ include/linux/pci.h | 1 + 3 files changed, 40 insertions(+), 4 deletions(-)