diff mbox

pciehp is broken from 4.10-rc1

Message ID 20170204185607.GA29957@wunner.de (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show

Commit Message

Lukas Wunner Feb. 4, 2017, 6:56 p.m. UTC
On Sat, Feb 04, 2017 at 09:12:54AM +0100, Lukas Wunner wrote:
> On Fri, Feb 03, 2017 at 11:00:19PM -0800, Yinghai Lu wrote:
> > we have extra Link Up event queued, while pm_runtime_get_sync/pm_runtime_put ?
> >   [  143.445483] pcieport 0000:60:03.2: PME# enabled
> >   [  143.992915] pciehp 0000:60:03.2:pcie004: Slot(8): Link Up
> 
> I notice that with 68db9bc81436 applied, PME is repeatedly enabled and
> disabled on the port, presumably whenever it switches from D3 to D0
> and vice-versa.
> 
> Perhaps this port sends an interrupt while PME is enabled and the slot
> is actually occupied, despite it having been disabled via sysfs.

Section 6.7.3.4 of the PCIe Base spec seems to support the theory above,
so here's a tentative patch.

Thanks,

Lukas

-- >8 --
Subject: [PATCH] PCI: pciehp: Don't enable PME on runtime suspend

Since commit 68db9bc81436 ("PCI: pciehp: Add runtime PM support for PCIe
hotplug ports") we runtime suspend a hotplug port to D3hot when all its
children are runtime suspended or none are present.

When runtime suspending the port the PCI core automatically enables PME:
    pci_pm_runtime_suspend()
        pci_finish_runtime_suspend()
            __pci_enable_wake()

According to the PCI Express Base Specification, section 6.7.3.4:
   "Note that PME and Hot-Plug Event interrupts (when both are
    implemented) always share the same MSI or MSI-X vector [...]
    If wake generation is required by the associated form factor
    specification, a hot-plug capable Downstream Port must support
    generation of a wakeup event (using the PME mechanism) on hotplug
    events that occur when the system is in a sleep state or the Port
    is in device state D1, D2, or D3Hot."

Thus, if the port is runtime suspended even though it is still occupied,
it may immediately be woken by a PME interrupt.  One scenario where this
happens is if all children of the hotplug port have runtime suspended.
Another scenario is power control via sysfs:  If a user manually turns
the hotplug port off (e.g. to safely remove the card), PME will signal
an interrupt for the still-occupied slot, which is interpreted by pciehp
as re-insertion of a card.  As a result, power control via sysfs is no
longer possible.  This was observed and reported by Yinghai Lu.

PME is in fact unnecessary on hotplug ports:  Hotplug can be signaled
even in D3hot, and commit 68db9bc81436 ensures that all parents of the
hotplug port are kept awake so that interrupts can be delivered.
PME would allow us to runtime suspend the parent ports as well, but we
do not make use of it because we cannot be sure if PME actually works.
Thunderbolt controllers for instance advertise PME capability, but at
least on Macs the PME pin is not connected.

Since we do not rely on PME for hotplug ports, we may as well not enable
it, thereby avoiding its negative side effects.  However the present
commit deliberately only avoids enabling PME on runtime suspend, the
ability to enable it for system sleep is retained.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=193951
Fixes: 68db9bc81436 ("PCI: pciehp: Add runtime PM support for PCIe
    hotplug ports")
Reported-by: Yinghai Lu <yinghai@kernel.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
---
 drivers/pci/pci.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Yinghai Lu Feb. 4, 2017, 9:44 p.m. UTC | #1
On Sat, Feb 4, 2017 at 10:56 AM, Lukas Wunner <lukas@wunner.de> wrote:
> On Sat, Feb 04, 2017 at 09:12:54AM +0100, Lukas Wunner wrote:
>
> Section 6.7.3.4 of the PCIe Base spec seems to support the theory above,
> so here's a tentative patch.
>
>
> -- >8 --
> Subject: [PATCH] PCI: pciehp: Don't enable PME on runtime suspend

it works:
sca05-0a81e0bc:~ # echo 0 > /sys/bus/pci/slots/8/power
[  176.983229] pci_hotplug: power_write_file: power = 0
[  176.988945] pciehp 0000:60:03.2:pcie004: pciehp_get_power_status:
SLOTCTRL a8 value read 11f1
[  176.998482] pciehp 0000:60:03.2:pcie004: pciehp_unconfigure_device:
domain:bus:dev = 0000:65:00
[  177.008215] mlx4_core 0000:65:00.0: PME# disabled
[  178.817209] iommu: Removing device 0000:65:00.0 from group 172
[  178.829869] pcieport 0000:60:03.2:   root_bridge ACPI_HANDLE
ffff9059bb00ce88 : pci0000:60
[  178.839114] pcieport 0000:60:03.2:  pciehp is native
[  178.844717] pci 0000:65:00.0: freeing pci_dev info
[  178.854842] pciehp 0000:60:03.2:pcie004: Timeout on hotplug command
0x11f1 (issued 97032 msec ago)
[  178.865199] pciehp 0000:60:03.2:pcie004: pending interrupts 0x0010
from Slot Status
[  178.873890] pciehp 0000:60:03.2:pcie004: pciehp_power_off_slot:
SLOTCTRL a8 write cmd 400
[  178.937499] pciehp 0000:60:03.2:pcie004: pending interrupts 0x0100
from Slot Status
[  178.946052] pciehp 0000:60:03.2:pcie004: Slot(8): Link Down
[  178.952422] pciehp 0000:60:03.2:pcie004: Slot(8): Link Down event
ignored; already powering off
[  179.894895] pciehp 0000:60:03.2:pcie004: pciehp_green_led_off:
SLOTCTRL a8 write cmd 300
[  179.903941] pciehp 0000:60:03.2:pcie004: pending interrupts 0x0010
from Slot Status

sca05-0a81e0bc:~ #
sca05-0a81e0bc:~ #
sca05-0a81e0bc:~ # echo 1 > /sys/bus/pci/slots/8/power
[  222.136244] pci_hotplug: power_write_file: power = 1
[  222.141818] pciehp 0000:60:03.2:pcie004: pciehp_get_power_status:
SLOTCTRL a8 value read 17f1
[  222.151710] pciehp 0000:60:03.2:pcie004: pending interrupts 0x0010
from Slot Status
[  222.160303] pciehp 0000:60:03.2:pcie004: pciehp_power_on_slot:
SLOTCTRL a8 write cmd 0
[  222.169156] pciehp 0000:60:03.2:pcie004: pciehp_green_led_blink:
SLOTCTRL a8 write cmd 200
[  222.178395] pciehp 0000:60:03.2:pcie004: pending interrupts 0x0010
from Slot Status
[  222.204325] pciehp 0000:60:03.2:pcie004: pending interrupts 0x0010
from Slot Status
[  222.784102] pciehp 0000:60:03.2:pcie004: pciehp_check_link_active:
lnk_status = f083
[  222.792750] pciehp 0000:60:03.2:pcie004: pending interrupts 0x0100
from Slot Status
[  222.801297] pciehp 0000:60:03.2:pcie004: Slot(8): Link Up
[  222.807358] pciehp 0000:60:03.2:pcie004: Slot(8): Link Up event
ignored; already powering on
[  222.807409] pciehp 0000:60:03.2:pcie004: pciehp_check_link_active:
lnk_status = f083
[  222.968062] pciehp 0000:60:03.2:pcie004: pciehp_check_link_status:
lnk_status = f083
[  222.976922] pci 0000:65:00.0: [15b3:1003] type 00 class 0x0c0600
[  222.983864] pci 0000:65:00.0: reg 0x10: [mem 0x00000000-0x000fffff 64bit]
[  222.991598] pci 0000:65:00.0: reg 0x18: [mem 0x00000000-0x07ffffff
64bit pref]
[  222.999979] pci_bus 0000:65:   bridge ACPI_HANDLE ffff9019bb0d0438
: 0000:60:03.2
[  223.010044] pci 0000:65:00.0: reg 0x134: [mem 0x00000000-0x07ffffff
64bit pref]
[  223.018214] pci 0000:65:00.0: VF(n) BAR2 space: [mem
0x00000000-0x1ffffffff 64bit pref] (contains BAR2 for 64 VFs)
[  223.033412] pci 0000:65:00.0: on_all_pcie_path: 1
[  223.040666] iommu: Adding device 0000:65:00.0 to group 172
[  223.047042] pci 0000:65:00.0: BAR 2: assigned [mem
0x387bf8000000-0x387bffffffff 64bit pref]
[  223.056589] pci 0000:65:00.0: BAR 9: assigned [mem
0x3879f8000000-0x387bf7ffffff 64bit pref]
[  223.066068] pci 0000:65:00.0: BAR 0: [mem size 0x00100000 64bit] + pref
[  223.073464] pci 0000:65:00.0: BAR 0: assigned [mem
0x3879f7f00000-0x3879f7ffffff 64bit]
[  223.082527] pcieport 0000:60:03.2: PCI bridge to [bus 65-67]
[  223.088864] pcieport 0000:60:03.2:   bridge window [io  0xa000-0xafff]
[  223.096161] pcieport 0000:60:03.2:   bridge window [mem
0xc0000000-0xc3ffffff]
[  223.104229] pcieport 0000:60:03.2:   bridge window [mem
0x387800000000-0x387bffffffff 64bit pref]
[  223.114152] pcieport 0000:60:03.2: Max Payload Size set to  256/
256 (was  256), Max Read Rq  128
[  223.124152] pci 0000:65:00.0: Max Payload Size set to  256/ 256
(was  128), Max Read Rq  512
[  223.133597] pci 0000:65:00.0: calling
mellanox_check_broken_intx_masking+0x0/0x130
[  223.142055] calling  mellanox_check_broken_intx_masking+0x0/0x130 @
16969 for 0000:65:00.0
[  223.151291] pci fixup mellanox_check_broken_intx_masking+0x0/0x130
returned after 0 usecs for 0000:65:00.0
[  223.162157] pcieport 0000:60:03.2:   root_bridge ACPI_HANDLE
ffff9059bb00ce88 : pci0000:60
[  223.171395] pcieport 0000:60:03.2:  pciehp is native
[  223.177183] mlx4_core: Initializing 0000:65:00.0
[  223.182374] mlx4_core 0000:65:00.0: enabling device (0140 -> 0142)
...
[  231.714671] pciehp 0000:60:03.2:pcie004: pciehp_green_led_on:
SLOTCTRL a8 write cmd 100
[  232.728377] pciehp 0000:60:03.2:pcie004: Timeout on hotplug command
0x11f1 (issued 1016 msec ago)
[  232.738298] pciehp 0000:60:03.2:pcie004:
pciehp_set_attention_status: SLOTCTRL a8 write cmd c0
Lukas Wunner Feb. 4, 2017, 11:34 p.m. UTC | #2
On Sat, Feb 04, 2017 at 01:44:34PM -0800, Yinghai Lu wrote:
> On Sat, Feb 4, 2017 at 10:56 AM, Lukas Wunner <lukas@wunner.de> wrote:
> > On Sat, Feb 04, 2017 at 09:12:54AM +0100, Lukas Wunner wrote:
> > Section 6.7.3.4 of the PCIe Base spec seems to support the theory above,
> > so here's a tentative patch.
> >
> >
> > -- >8 --
> > Subject: [PATCH] PCI: pciehp: Don't enable PME on runtime suspend
> 
> it works:

Thanks a lot for the report and for testing the patch!


@Rafael:
Bjorn usually wants you to ack PM-related patches.  Looking at the
patch again I'm now thinking it might be cleaner to move the check
for is_hotplug_bridge into pci_dev_run_wake().  Thoughts?

Thanks,

Lukas
Yinghai Lu Feb. 5, 2017, 4:22 a.m. UTC | #3
On Sat, Feb 4, 2017 at 3:34 PM, Lukas Wunner <lukas@wunner.de> wrote:
> On Sat, Feb 04, 2017 at 01:44:34PM -0800, Yinghai Lu wrote:
>> On Sat, Feb 4, 2017 at 10:56 AM, Lukas Wunner <lukas@wunner.de> wrote:
>> > On Sat, Feb 04, 2017 at 09:12:54AM +0100, Lukas Wunner wrote:
>> > Section 6.7.3.4 of the PCIe Base spec seems to support the theory above,
>> > so here's a tentative patch.
>> >
>> >
>> > -- >8 --
>> > Subject: [PATCH] PCI: pciehp: Don't enable PME on runtime suspend
>>
>> it works:
>
> Thanks a lot for the report and for testing the patch!

Wait, Commit 68db9bc still has problem with another server (skylake
based), and this patch does not help.


sca05-0a81fd8d:~ # echo 0 > /sys/bus/pci/slots/11/power
[  362.721197] pci_hotplug: power_write_file: power = 0
[  362.726887] pciehp 0000:b3:00.0:pcie004: pciehp_get_power_status:
SLOTCTRL a8 value read 11f1
[  362.736431] pciehp 0000:b3:00.0:pcie004: pciehp_unconfigure_device:
domain:bus:dev = 0000:b4:00
[  362.746160] mlx4_core 0000:b4:00.0: PME# disabled
[  364.494033] pcieport 0000:b3:00.0:   root_bridge ACPI_HANDLE
ffff9e56b8811550 : pci0000:b3
[  364.503274] pcieport 0000:b3:00.0:  pciehp is native
[  364.508863] pci 0000:b4:00.0: freeing pci_dev info
[  364.514718] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010
from Slot Status
[  364.523443] pciehp 0000:b3:00.0:pcie004: pciehp_power_off_slot:
SLOTCTRL a8 write cmd 400
[  364.587047] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0108
from Slot Status
[  364.595592] pciehp 0000:b3:00.0:pcie004: Slot(11): Link Down
[  364.602325] pciehp 0000:b3:00.0:pcie004: Slot(11): Link Down event
ignored; already powering off
[  365.568415] pciehp 0000:b3:00.0:pcie004: pciehp_green_led_off:
SLOTCTRL a8 write cmd 300
[  365.569338] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010
from Slot Status

sca05-0a81fd8d:~ # echo 1 > /sys/bus/pci/slots/11/power
[  375.376609] pci_hotplug: power_write_file: power = 1
[  375.382175] pciehp 0000:b3:00.0:pcie004: pciehp_get_power_status:
SLOTCTRL a8 value read 17f1
[  375.392695] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010
from Slot Status
[  375.401370] pciehp 0000:b3:00.0:pcie004: pciehp_power_on_slot:
SLOTCTRL a8 write cmd 0
[  375.410231] pciehp 0000:b3:00.0:pcie004: pciehp_green_led_blink:
SLOTCTRL a8 write cmd 200
[  375.411071] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010
from Slot Status
[  375.445222] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010
from Slot Status
[  377.444400] pciehp 0000:b3:00.0:pcie004: Data Link Layer Link
Active not set in 1000 msec
[  378.960364] pci 0000:b4:00.0 id reading try 50 times with interval
20 ms to get ffffffff
[  378.969406] pciehp 0000:b3:00.0:pcie004: pciehp_check_link_status:
lnk_status = 5001
[  378.978059] pciehp 0000:b3:00.0:pcie004: link training error: status 0x5001
[  378.985834] pciehp 0000:b3:00.0:pcie004: Failed to check link status
[  378.987185] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010
from Slot Status
[  378.987253] pciehp 0000:b3:00.0:pcie004: pciehp_power_off_slot:
SLOTCTRL a8 write cmd 400
[  380.000409] pciehp 0000:b3:00.0:pcie004: pciehp_green_led_off:
SLOTCTRL a8 write cmd 300
[  380.000674] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010
from Slot Status
[  380.018020] pciehp 0000:b3:00.0:pcie004:
pciehp_set_attention_status: SLOTCTRL a8 write cmd 40
[  380.019053] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010
from Slot Status
-bash: echo: write error: Operation not permitted

revert commit 68db9bc, also make it working again.


Thanks


Yinghai
Yinghai Lu Feb. 5, 2017, 5:20 a.m. UTC | #4
On Sat, Feb 4, 2017 at 8:22 PM, Yinghai Lu <yinghai@kernel.org> wrote:
> On Sat, Feb 4, 2017 at 3:34 PM, Lukas Wunner <lukas@wunner.de> wrote:
>> On Sat, Feb 04, 2017 at 01:44:34PM -0800, Yinghai Lu wrote:
>>> On Sat, Feb 4, 2017 at 10:56 AM, Lukas Wunner <lukas@wunner.de> wrote:
>>> > On Sat, Feb 04, 2017 at 09:12:54AM +0100, Lukas Wunner wrote:
>>> > Section 6.7.3.4 of the PCIe Base spec seems to support the theory above,
>>> > so here's a tentative patch.
>>> >
>>> >
>>> > -- >8 --
>>> > Subject: [PATCH] PCI: pciehp: Don't enable PME on runtime suspend
>>>
>>> it works:
>>
>> Thanks a lot for the report and for testing the patch!
>
> Wait, Commit 68db9bc still has problem with another server (skylake
> based), and this patch does not help.
>
>
> sca05-0a81fd8d:~ # echo 0 > /sys/bus/pci/slots/11/power
> [  362.721197] pci_hotplug: power_write_file: power = 0
> [  362.726887] pciehp 0000:b3:00.0:pcie004: pciehp_get_power_status:
> SLOTCTRL a8 value read 11f1
> [  362.736431] pciehp 0000:b3:00.0:pcie004: pciehp_unconfigure_device:
> domain:bus:dev = 0000:b4:00
> [  362.746160] mlx4_core 0000:b4:00.0: PME# disabled
> [  364.494033] pcieport 0000:b3:00.0:   root_bridge ACPI_HANDLE
> ffff9e56b8811550 : pci0000:b3
> [  364.503274] pcieport 0000:b3:00.0:  pciehp is native
> [  364.508863] pci 0000:b4:00.0: freeing pci_dev info
> [  364.514718] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010
> from Slot Status
> [  364.523443] pciehp 0000:b3:00.0:pcie004: pciehp_power_off_slot:
> SLOTCTRL a8 write cmd 400
> [  364.587047] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0108
> from Slot Status
> [  364.595592] pciehp 0000:b3:00.0:pcie004: Slot(11): Link Down
> [  364.602325] pciehp 0000:b3:00.0:pcie004: Slot(11): Link Down event
> ignored; already powering off
> [  365.568415] pciehp 0000:b3:00.0:pcie004: pciehp_green_led_off:
> SLOTCTRL a8 write cmd 300
> [  365.569338] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010
> from Slot Status
>
> sca05-0a81fd8d:~ # echo 1 > /sys/bus/pci/slots/11/power
> [  375.376609] pci_hotplug: power_write_file: power = 1
> [  375.382175] pciehp 0000:b3:00.0:pcie004: pciehp_get_power_status:
> SLOTCTRL a8 value read 17f1
> [  375.392695] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010
> from Slot Status
> [  375.401370] pciehp 0000:b3:00.0:pcie004: pciehp_power_on_slot:
> SLOTCTRL a8 write cmd 0
> [  375.410231] pciehp 0000:b3:00.0:pcie004: pciehp_green_led_blink:
> SLOTCTRL a8 write cmd 200
> [  375.411071] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010
> from Slot Status
> [  375.445222] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010
> from Slot Status
> [  377.444400] pciehp 0000:b3:00.0:pcie004: Data Link Layer Link
> Active not set in 1000 msec
> [  378.960364] pci 0000:b4:00.0 id reading try 50 times with interval
> 20 ms to get ffffffff
> [  378.969406] pciehp 0000:b3:00.0:pcie004: pciehp_check_link_status:
> lnk_status = 5001
> [  378.978059] pciehp 0000:b3:00.0:pcie004: link training error: status 0x5001
> [  378.985834] pciehp 0000:b3:00.0:pcie004: Failed to check link status
> [  378.987185] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010
> from Slot Status
> [  378.987253] pciehp 0000:b3:00.0:pcie004: pciehp_power_off_slot:
> SLOTCTRL a8 write cmd 400
> [  380.000409] pciehp 0000:b3:00.0:pcie004: pciehp_green_led_off:
> SLOTCTRL a8 write cmd 300
> [  380.000674] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010
> from Slot Status
> [  380.018020] pciehp 0000:b3:00.0:pcie004:
> pciehp_set_attention_status: SLOTCTRL a8 write cmd 40
> [  380.019053] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010
> from Slot Status
> -bash: echo: write error: Operation not permitted
>
> revert commit 68db9bc, also make it working again.

output after reverting 68db9bc

sca05-0a81fd8d:~ # echo 0 > /sys/bus/pci/slots/11/power
[  359.966115] pci_hotplug: power_write_file: power = 0
[  359.971759] pciehp 0000:b3:00.0:pcie004: pciehp_get_power_status:
SLOTCTRL a8 value read 11f1
[  359.981284] pciehp 0000:b3:00.0:pcie004: pciehp_unconfigure_device:
domain:bus:dev = 0000:b4:00
[  359.991017] mlx4_core 0000:b4:00.0: PME# disabled
[  361.579571] pci 0000:b4:00.0: freeing pci_dev info
[  361.585390] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010
from Slot Status
[  361.594116] pciehp 0000:b3:00.0:pcie004: pciehp_power_off_slot:
SLOTCTRL a8 write cmd 400
[  361.657705] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0108
from Slot Status
[  361.666268] pciehp 0000:b3:00.0:pcie004: Slot(11): Link Down
[  361.673076] pciehp 0000:b3:00.0:pcie004: Slot(11): Link Down event
ignored; already powering off
[  362.621894] pciehp 0000:b3:00.0:pcie004: pciehp_green_led_off:
SLOTCTRL a8 write cmd 300
[  362.622499] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010
from Slot Status
sca05-0a81fd8d:~ #
sca05-0a81fd8d:~ # echo 1 > /sys/bus/pci/slots/11/power
[  368.797970] pci_hotplug: power_write_file: power = 1
[  368.803544] pciehp 0000:b3:00.0:pcie004: pciehp_get_power_status:
SLOTCTRL a8 value read 17f1
[  368.813743] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010
from Slot Status
[  368.822410] pciehp 0000:b3:00.0:pcie004: pciehp_power_on_slot:
SLOTCTRL a8 write cmd 0
[  368.831280] pciehp 0000:b3:00.0:pcie004: pciehp_green_led_blink:
SLOTCTRL a8 write cmd 200
[  368.832115] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010
from Slot Status
[  369.455188] pciehp 0000:b3:00.0:pcie004: pciehp_check_link_active:
lnk_status = f083
[  369.463844] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0108
from Slot Status
[  369.465786] pciehp 0000:b3:00.0:pcie004: pciehp_check_link_active:
lnk_status = f083
[  369.481042] pciehp 0000:b3:00.0:pcie004: Slot(11): Link Up
[  369.487219] pciehp 0000:b3:00.0:pcie004: Slot(11): Link Up event
ignored; already powering on
[  369.573787] pciehp 0000:b3:00.0:pcie004: pciehp_check_link_status:
lnk_status = f083
[  369.582664] pci 0000:b4:00.0: [15b3:1003] type 00 class 0x0c0600
[  369.589626] pci 0000:b4:00.0: reg 0x10: [mem 0x00000000-0x000fffff 64bit]
[  369.597359] pci 0000:b4:00.0: reg 0x18: [mem 0x00000000-0x07ffffff
64bit pref]
[  369.605749] pci_bus 0000:b4:   bridge ACPI_HANDLE ffff9c2fb8817780
: 0000:b3:00.0
[  369.615407] pci 0000:b4:00.0: reg 0x134: [mem 0x00000000-0x07ffffff
64bit pref]
[  369.623571] pci 0000:b4:00.0: VF(n) BAR2 space: [mem
0x00000000-0x1ffffffff 64bit pref] (contains BAR2 for 64 VFs)
[  369.638820] pci 0000:b4:00.0: on_all_pcie_path: 1
[  369.644445] pci 0000:b4:00.0: BAR 2: assigned [mem
0x396ff8000000-0x396fffffffff 64bit pref]
[  369.654012] pci 0000:b4:00.0: BAR 9: assigned [mem
0x396df8000000-0x396ff7ffffff 64bit pref]
[  369.663489] pci 0000:b4:00.0: BAR 0: [mem size 0x00100000 64bit] + pref
[  369.670879] pci 0000:b4:00.0: BAR 0: assigned [mem
0xddf00000-0xddffffff 64bit]
[  369.679171] pcieport 0000:b3:00.0: PCI bridge to [bus b4-b7]
[  369.685495] pcieport 0000:b3:00.0:   bridge window [io  0xf000-0xffff]
[  369.692791] pcieport 0000:b3:00.0:   bridge window [mem
0xdd000000-0xddffffff]
[  369.700857] pcieport 0000:b3:00.0:   bridge window [mem
0x396df8000000-0x396fffffffff 64bit pref]
[  369.710778] pcieport 0000:b3:00.0: Max Payload Size set to  256/
256 (was  256), Max Read Rq  128
[  369.720776] pci 0000:b4:00.0: Max Payload Size set to  256/ 256
(was  128), Max Read Rq  512
[  369.730231] pci 0000:b4:00.0: calling
mellanox_check_broken_intx_masking+0x0/0x130
[  369.738691] calling  mellanox_check_broken_intx_masking+0x0/0x130 @
40613 for 0000:b4:00.0
[  369.747913] pci fixup mellanox_check_broken_intx_masking+0x0/0x130
returned after 0 usecs for 0000:b4:00.0
[  369.759192] mlx4_core: Initializing 0000:b4:00.0
[  369.764398] mlx4_core 0000:b4:00.0: enabling device (0000 -> 0002)
[  369.771854]   alloc irq_desc for 71 on node 5
[  369.776904] IOAPIC[31]: Set IRTE entry (P:1 FPD:0 Dst_Mode:1
Redir_hint:1 Trig_Mode:0 Dlvry_Mode:1 Avail:0 Vector:D7 Dest:00143FFF
SID:B32C SQ:0 SVT:1)
[  369.792059] IOAPIC[24]: Set routing entry (31-0 -> 0xd7 -> IRQ 71
Mode:1 Active:1 Dest:1327103)
...

[  377.032574] pciehp 0000:b3:00.0:pcie004: pciehp_green_led_on:
SLOTCTRL a8 write cmd 100
[  377.032802] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010
from Slot Status
[  377.050076] pciehp 0000:b3:00.0:pcie004:
pciehp_set_attention_status: SLOTCTRL a8 write cmd c0
[  377.050328] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010
from Slot Status
Lukas Wunner Feb. 5, 2017, 7:34 a.m. UTC | #5
On Sat, Feb 04, 2017 at 08:22:59PM -0800, Yinghai Lu wrote:
> On Sat, Feb 4, 2017 at 3:34 PM, Lukas Wunner <lukas@wunner.de> wrote:
> > On Sat, Feb 04, 2017 at 01:44:34PM -0800, Yinghai Lu wrote:
> >> On Sat, Feb 4, 2017 at 10:56 AM, Lukas Wunner <lukas@wunner.de> wrote:
> >> > On Sat, Feb 04, 2017 at 09:12:54AM +0100, Lukas Wunner wrote:
> >> > Section 6.7.3.4 of the PCIe Base spec seems to support the theory above,
> >> > so here's a tentative patch.
> >> >
> >> > -- >8 --
> >> > Subject: [PATCH] PCI: pciehp: Don't enable PME on runtime suspend
> >>
> >> it works:
> >
> > Thanks a lot for the report and for testing the patch!
> 
> Wait, Commit 68db9bc still has problem with another server (skylake
> based), and this patch does not help.
[...]
> sca05-0a81fd8d:~ # echo 1 > /sys/bus/pci/slots/11/power
> [  375.376609] pci_hotplug: power_write_file: power = 1
> [  375.382175] pciehp 0000:b3:00.0:pcie004: pciehp_get_power_status: SLOTCTRL a8 value read 17f1
> [  375.392695] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010 from Slot Status
> [  375.401370] pciehp 0000:b3:00.0:pcie004: pciehp_power_on_slot: SLOTCTRL a8 write cmd 0
> [  375.410231] pciehp 0000:b3:00.0:pcie004: pciehp_green_led_blink: SLOTCTRL a8 write cmd 200
> [  375.411071] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010 from Slot Status
> [  375.445222] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010 from Slot Status
> [  377.444400] pciehp 0000:b3:00.0:pcie004: Data Link Layer Link Active not set in 1000 msec
> [  378.960364] pci 0000:b4:00.0 id reading try 50 times with interval 20 ms to get ffffffff
> [  378.969406] pciehp 0000:b3:00.0:pcie004: pciehp_check_link_status: lnk_status = 5001
> [  378.978059] pciehp 0000:b3:00.0:pcie004: link training error: status 0x5001
> [  378.985834] pciehp 0000:b3:00.0:pcie004: Failed to check link status
> [  378.987185] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010 from Slot Status
> [  378.987253] pciehp 0000:b3:00.0:pcie004: pciehp_power_off_slot: SLOTCTRL a8 write cmd 400
> [  380.000409] pciehp 0000:b3:00.0:pcie004: pciehp_green_led_off: SLOTCTRL a8 write cmd 300
> [  380.000674] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010 from Slot Status
> [  380.018020] pciehp 0000:b3:00.0:pcie004: pciehp_set_attention_status: SLOTCTRL a8 write cmd 40
> [  380.019053] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010 from Slot Status

So on this Skylake machine link training fails after resuming from D3hot
to D0.

One thing that's a bit fishy is that normally the Link Disable bit is
cleared when powering on the slot.  This results in a debug message
in dmesg containg the string "lnk_ctrl = ", and that line is missing
from the output you've pasted above, suggesting that the machine is
not running a stock v4.10 kernel after all but something else.  Could
you check why this message is not printed?  Could you check with lspci
if the Link Disable bit is set before you invoke "echo 1"?

This is the call stack:
pciehp_sysfs_enable_slot()
  pciehp_enable_slot()
    board_added()
      pciehp_power_on_slot()
        pciehp_link_enable()
          __pciehp_link_set()

Another theory is that the link is generally unreliable on this machine
since the Link Bandwidth Management Status bit is set in the Link Status
Register ("lnk_status = 5001"), which according to the spec means:

"Hardware has changed Link speed or width to attempt to correct unreliable
Link operation, either through an LTSSM timeout or a higher level process.
This bit must be set if the Physical Layer reports a speed or width change
was initiated by the Downstream component that was not indicated as an
autonomous change."

In this case it would be good to know which hardware exactly we're dealing
with so that we might quirk it to not runtime suspend the port.  To that
end, could you attach a full dmesg log to the bugzilla entry I've created?
https://bugzilla.kernel.org/show_bug.cgi?id=193951

@Mika, Rafael: Are you aware of Skylake machines with unreliable link
training, or perhaps errata of Skylake chips related to link training
on hotplug ports?

Thanks,

Lukas
Mika Westerberg Feb. 6, 2017, 10:37 a.m. UTC | #6
On Sun, Feb 05, 2017 at 08:34:54AM +0100, Lukas Wunner wrote:
> > sca05-0a81fd8d:~ # echo 1 > /sys/bus/pci/slots/11/power
> > [  375.376609] pci_hotplug: power_write_file: power = 1
> > [  375.382175] pciehp 0000:b3:00.0:pcie004: pciehp_get_power_status: SLOTCTRL a8 value read 17f1
> > [  375.392695] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010 from Slot Status
> > [  375.401370] pciehp 0000:b3:00.0:pcie004: pciehp_power_on_slot: SLOTCTRL a8 write cmd 0
> > [  375.410231] pciehp 0000:b3:00.0:pcie004: pciehp_green_led_blink: SLOTCTRL a8 write cmd 200
> > [  375.411071] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010 from Slot Status
> > [  375.445222] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010 from Slot Status
> > [  377.444400] pciehp 0000:b3:00.0:pcie004: Data Link Layer Link Active not set in 1000 msec
> > [  378.960364] pci 0000:b4:00.0 id reading try 50 times with interval 20 ms to get ffffffff
> > [  378.969406] pciehp 0000:b3:00.0:pcie004: pciehp_check_link_status: lnk_status = 5001
> > [  378.978059] pciehp 0000:b3:00.0:pcie004: link training error: status 0x5001
> > [  378.985834] pciehp 0000:b3:00.0:pcie004: Failed to check link status
> > [  378.987185] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010 from Slot Status
> > [  378.987253] pciehp 0000:b3:00.0:pcie004: pciehp_power_off_slot: SLOTCTRL a8 write cmd 400
> > [  380.000409] pciehp 0000:b3:00.0:pcie004: pciehp_green_led_off: SLOTCTRL a8 write cmd 300
> > [  380.000674] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010 from Slot Status
> > [  380.018020] pciehp 0000:b3:00.0:pcie004: pciehp_set_attention_status: SLOTCTRL a8 write cmd 40
> > [  380.019053] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010 from Slot Status

It would be good to see the output when 68db9bc is reverted. Yinghai,
can you attach that to the bugzilla but as well?

> So on this Skylake machine link training fails after resuming from D3hot
> to D0.
> 
> One thing that's a bit fishy is that normally the Link Disable bit is
> cleared when powering on the slot.  This results in a debug message
> in dmesg containg the string "lnk_ctrl = ", and that line is missing
> from the output you've pasted above, suggesting that the machine is
> not running a stock v4.10 kernel after all but something else.  Could
> you check why this message is not printed?  Could you check with lspci
> if the Link Disable bit is set before you invoke "echo 1"?
> 
> This is the call stack:
> pciehp_sysfs_enable_slot()
>   pciehp_enable_slot()
>     board_added()
>       pciehp_power_on_slot()
>         pciehp_link_enable()
>           __pciehp_link_set()
> 
> Another theory is that the link is generally unreliable on this machine
> since the Link Bandwidth Management Status bit is set in the Link Status
> Register ("lnk_status = 5001"), which according to the spec means:
> 
> "Hardware has changed Link speed or width to attempt to correct unreliable
> Link operation, either through an LTSSM timeout or a higher level process.
> This bit must be set if the Physical Layer reports a speed or width change
> was initiated by the Downstream component that was not indicated as an
> autonomous change."
> 
> In this case it would be good to know which hardware exactly we're dealing
> with so that we might quirk it to not runtime suspend the port.  To that
> end, could you attach a full dmesg log to the bugzilla entry I've created?
> https://bugzilla.kernel.org/show_bug.cgi?id=193951
> 
> @Mika, Rafael: Are you aware of Skylake machines with unreliable link
> training, or perhaps errata of Skylake chips related to link training
> on hotplug ports?

According to the 100-series (the chipset used with Skylake) errata
below, I don't see any mentions related to PCIe link training issues.

http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/100-series-chipset-spec-update.pdf
Rafael J. Wysocki Feb. 6, 2017, 11:45 a.m. UTC | #7
On Sunday, February 05, 2017 12:34:43 AM Lukas Wunner wrote:
> On Sat, Feb 04, 2017 at 01:44:34PM -0800, Yinghai Lu wrote:
> > On Sat, Feb 4, 2017 at 10:56 AM, Lukas Wunner <lukas@wunner.de> wrote:
> > > On Sat, Feb 04, 2017 at 09:12:54AM +0100, Lukas Wunner wrote:
> > > Section 6.7.3.4 of the PCIe Base spec seems to support the theory above,
> > > so here's a tentative patch.
> > >
> > >
> > > -- >8 --
> > > Subject: [PATCH] PCI: pciehp: Don't enable PME on runtime suspend
> > 
> > it works:
> 
> Thanks a lot for the report and for testing the patch!
> 
> 
> @Rafael:
> Bjorn usually wants you to ack PM-related patches.  Looking at the
> patch again I'm now thinking it might be cleaner to move the check
> for is_hotplug_bridge into pci_dev_run_wake().  Thoughts?

Yes, pci_dev_run_wake() would be a better place.

Thanks,
Rafael
Rafael J. Wysocki Feb. 6, 2017, 11:49 a.m. UTC | #8
On Monday, February 06, 2017 12:37:06 PM Mika Westerberg wrote:
> On Sun, Feb 05, 2017 at 08:34:54AM +0100, Lukas Wunner wrote:
> > > sca05-0a81fd8d:~ # echo 1 > /sys/bus/pci/slots/11/power
> > > [  375.376609] pci_hotplug: power_write_file: power = 1
> > > [  375.382175] pciehp 0000:b3:00.0:pcie004: pciehp_get_power_status: SLOTCTRL a8 value read 17f1
> > > [  375.392695] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010 from Slot Status
> > > [  375.401370] pciehp 0000:b3:00.0:pcie004: pciehp_power_on_slot: SLOTCTRL a8 write cmd 0
> > > [  375.410231] pciehp 0000:b3:00.0:pcie004: pciehp_green_led_blink: SLOTCTRL a8 write cmd 200
> > > [  375.411071] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010 from Slot Status
> > > [  375.445222] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010 from Slot Status
> > > [  377.444400] pciehp 0000:b3:00.0:pcie004: Data Link Layer Link Active not set in 1000 msec
> > > [  378.960364] pci 0000:b4:00.0 id reading try 50 times with interval 20 ms to get ffffffff
> > > [  378.969406] pciehp 0000:b3:00.0:pcie004: pciehp_check_link_status: lnk_status = 5001
> > > [  378.978059] pciehp 0000:b3:00.0:pcie004: link training error: status 0x5001
> > > [  378.985834] pciehp 0000:b3:00.0:pcie004: Failed to check link status
> > > [  378.987185] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010 from Slot Status
> > > [  378.987253] pciehp 0000:b3:00.0:pcie004: pciehp_power_off_slot: SLOTCTRL a8 write cmd 400
> > > [  380.000409] pciehp 0000:b3:00.0:pcie004: pciehp_green_led_off: SLOTCTRL a8 write cmd 300
> > > [  380.000674] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010 from Slot Status
> > > [  380.018020] pciehp 0000:b3:00.0:pcie004: pciehp_set_attention_status: SLOTCTRL a8 write cmd 40
> > > [  380.019053] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010 from Slot Status
> 
> It would be good to see the output when 68db9bc is reverted. Yinghai,
> can you attach that to the bugzilla but as well?
> 
> > So on this Skylake machine link training fails after resuming from D3hot
> > to D0.
> > 
> > One thing that's a bit fishy is that normally the Link Disable bit is
> > cleared when powering on the slot.  This results in a debug message
> > in dmesg containg the string "lnk_ctrl = ", and that line is missing
> > from the output you've pasted above, suggesting that the machine is
> > not running a stock v4.10 kernel after all but something else.  Could
> > you check why this message is not printed?  Could you check with lspci
> > if the Link Disable bit is set before you invoke "echo 1"?
> > 
> > This is the call stack:
> > pciehp_sysfs_enable_slot()
> >   pciehp_enable_slot()
> >     board_added()
> >       pciehp_power_on_slot()
> >         pciehp_link_enable()
> >           __pciehp_link_set()
> > 
> > Another theory is that the link is generally unreliable on this machine
> > since the Link Bandwidth Management Status bit is set in the Link Status
> > Register ("lnk_status = 5001"), which according to the spec means:
> > 
> > "Hardware has changed Link speed or width to attempt to correct unreliable
> > Link operation, either through an LTSSM timeout or a higher level process.
> > This bit must be set if the Physical Layer reports a speed or width change
> > was initiated by the Downstream component that was not indicated as an
> > autonomous change."
> > 
> > In this case it would be good to know which hardware exactly we're dealing
> > with so that we might quirk it to not runtime suspend the port.  To that
> > end, could you attach a full dmesg log to the bugzilla entry I've created?
> > https://bugzilla.kernel.org/show_bug.cgi?id=193951
> > 
> > @Mika, Rafael: Are you aware of Skylake machines with unreliable link
> > training, or perhaps errata of Skylake chips related to link training
> > on hotplug ports?
> 
> According to the 100-series (the chipset used with Skylake) errata
> below, I don't see any mentions related to PCIe link training issues.
> 
> http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/100-series-chipset-spec-update.pdf

Still, it does look like errata to me.

At least I don't see what can be done on the software side to avoid this from
happening except for leaving the port(s) in question in D0.

Thanks,
Rafael
Bjorn Helgaas Feb. 6, 2017, 6:10 p.m. UTC | #9
On Sun, Feb 05, 2017 at 08:34:54AM +0100, Lukas Wunner wrote:
> On Sat, Feb 04, 2017 at 08:22:59PM -0800, Yinghai Lu wrote:
> > On Sat, Feb 4, 2017 at 3:34 PM, Lukas Wunner <lukas@wunner.de> wrote:
> > > On Sat, Feb 04, 2017 at 01:44:34PM -0800, Yinghai Lu wrote:
> > >> On Sat, Feb 4, 2017 at 10:56 AM, Lukas Wunner <lukas@wunner.de> wrote:
> > >> > On Sat, Feb 04, 2017 at 09:12:54AM +0100, Lukas Wunner wrote:
> > >> > Section 6.7.3.4 of the PCIe Base spec seems to support the theory above,
> > >> > so here's a tentative patch.
> > >> >
> > >> > -- >8 --
> > >> > Subject: [PATCH] PCI: pciehp: Don't enable PME on runtime suspend
> > >>
> > >> it works:
> > >
> > > Thanks a lot for the report and for testing the patch!
> > 
> > Wait, Commit 68db9bc still has problem with another server (skylake
> > based), and this patch does not help.
> [...]
> > sca05-0a81fd8d:~ # echo 1 > /sys/bus/pci/slots/11/power
> > [  375.376609] pci_hotplug: power_write_file: power = 1
> > [  375.382175] pciehp 0000:b3:00.0:pcie004: pciehp_get_power_status: SLOTCTRL a8 value read 17f1
> > [  375.392695] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010 from Slot Status
> > [  375.401370] pciehp 0000:b3:00.0:pcie004: pciehp_power_on_slot: SLOTCTRL a8 write cmd 0
> > [  375.410231] pciehp 0000:b3:00.0:pcie004: pciehp_green_led_blink: SLOTCTRL a8 write cmd 200
> > [  375.411071] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010 from Slot Status
> > [  375.445222] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010 from Slot Status
> > [  377.444400] pciehp 0000:b3:00.0:pcie004: Data Link Layer Link Active not set in 1000 msec
> > [  378.960364] pci 0000:b4:00.0 id reading try 50 times with interval 20 ms to get ffffffff
> > [  378.969406] pciehp 0000:b3:00.0:pcie004: pciehp_check_link_status: lnk_status = 5001
> > [  378.978059] pciehp 0000:b3:00.0:pcie004: link training error: status 0x5001
> > [  378.985834] pciehp 0000:b3:00.0:pcie004: Failed to check link status
> > [  378.987185] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010 from Slot Status
> > [  378.987253] pciehp 0000:b3:00.0:pcie004: pciehp_power_off_slot: SLOTCTRL a8 write cmd 400
> > [  380.000409] pciehp 0000:b3:00.0:pcie004: pciehp_green_led_off: SLOTCTRL a8 write cmd 300
> > [  380.000674] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010 from Slot Status
> > [  380.018020] pciehp 0000:b3:00.0:pcie004: pciehp_set_attention_status: SLOTCTRL a8 write cmd 40
> > [  380.019053] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010 from Slot Status
> 
> So on this Skylake machine link training fails after resuming from D3hot
> to D0.
> 
> One thing that's a bit fishy is that normally the Link Disable bit is
> cleared when powering on the slot.  This results in a debug message
> in dmesg containg the string "lnk_ctrl = ", and that line is missing
> from the output you've pasted above, suggesting that the machine is
> not running a stock v4.10 kernel after all but something else.  Could
> you check why this message is not printed?  Could you check with lspci
> if the Link Disable bit is set before you invoke "echo 1"?
> 
> This is the call stack:
> pciehp_sysfs_enable_slot()
>   pciehp_enable_slot()
>     board_added()
>       pciehp_power_on_slot()
>         pciehp_link_enable()
>           __pciehp_link_set()
> 
> Another theory is that the link is generally unreliable on this machine
> since the Link Bandwidth Management Status bit is set in the Link Status
> Register ("lnk_status = 5001"), which according to the spec means:
> 
> "Hardware has changed Link speed or width to attempt to correct unreliable
> Link operation, either through an LTSSM timeout or a higher level process.
> This bit must be set if the Physical Layer reports a speed or width change
> was initiated by the Downstream component that was not indicated as an
> autonomous change."
> 
> In this case it would be good to know which hardware exactly we're dealing
> with so that we might quirk it to not runtime suspend the port.  To that
> end, could you attach a full dmesg log to the bugzilla entry I've created?
> https://bugzilla.kernel.org/show_bug.cgi?id=193951
> 
> @Mika, Rafael: Are you aware of Skylake machines with unreliable link
> training, or perhaps errata of Skylake chips related to link training
> on hotplug ports?

I think we're prematurely leaping to the conclusion that this is a
hardware erratum.  I don't have nearly the confidence that pciehp is
handling this correctly that you seem to have.

If this is a hardware erratum, we should be able to turn off
CONFIG_HOTPLUG_PCI_PCIE and drive through this scenario manually with
setpci.  That sequence would be immensely helpful to any hardware
engineers who want to investigate this.

I'm hesitant to add a quirk until we have a better understanding of
what's going on.  Yinghai tripped over this one broken case, but I
don't have any reason to believe that's the only one.

Bjorn
Lukas Wunner Feb. 6, 2017, 9:35 p.m. UTC | #10
On Mon, Feb 06, 2017 at 12:37:06PM +0200, Mika Westerberg wrote:
> On Sun, Feb 05, 2017 at 08:34:54AM +0100, Lukas Wunner wrote:
> > @Mika, Rafael: Are you aware of Skylake machines with unreliable link
> > training, or perhaps errata of Skylake chips related to link training
> > on hotplug ports?
> 
> According to the 100-series (the chipset used with Skylake) errata
> below, I don't see any mentions related to PCIe link training issues.
> 
> http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/100-series-chipset-spec-update.pdf

Yinghai Lu responded off-list that the hardware in question is an
unreleased / secret Intel product, so this particular issue cannot
be expected to be documented publicly at this point.

Of course this raises the question whether issues with unreleased
products can at all be considered valid regressions, given that the
final product may not regress.  It seems like a novelty to me that
patches would get reverted for something like this, but we'll see.

Lukas
Lukas Wunner Feb. 7, 2017, 6:08 a.m. UTC | #11
[cc += linux-pci]

On Mon, Feb 06, 2017 at 07:51:08PM -0800, Yinghai Lu wrote:
> On Mon, Feb 6, 2017 at 12:42 PM, Lukas Wunner <lukas@wunner.de> wrote:
> > On Sun, Feb 05, 2017 at 08:34:54AM +0100, Lukas Wunner wrote:
> >> On Sat, Feb 04, 2017 at 08:22:59PM -0800, Yinghai Lu wrote:
> >> > Wait, Commit 68db9bc still has problem with another server (skylake
> >> > based), and this patch does not help.
> >> [...]
> >> > sca05-0a81fd8d:~ # echo 1 > /sys/bus/pci/slots/11/power
> >> > [  375.376609] pci_hotplug: power_write_file: power = 1
> >> > [  375.382175] pciehp 0000:b3:00.0:pcie004: pciehp_get_power_status: SLOTCTRL a8 value read 17f1
> >> > [  375.392695] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010 from Slot Status
> >> > [  375.401370] pciehp 0000:b3:00.0:pcie004: pciehp_power_on_slot: SLOTCTRL a8 write cmd 0
> >> > [  375.410231] pciehp 0000:b3:00.0:pcie004: pciehp_green_led_blink: SLOTCTRL a8 write cmd 200
> >> > [  375.411071] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010 from Slot Status
> >> > [  375.445222] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010 from Slot Status
> >> > [  377.444400] pciehp 0000:b3:00.0:pcie004: Data Link Layer Link Active not set in 1000 msec
> >> > [  378.960364] pci 0000:b4:00.0 id reading try 50 times with interval 20 ms to get ffffffff
> >> > [  378.969406] pciehp 0000:b3:00.0:pcie004: pciehp_check_link_status: lnk_status = 5001
> >> > [  378.978059] pciehp 0000:b3:00.0:pcie004: link training error: status 0x5001
> >> > [  378.985834] pciehp 0000:b3:00.0:pcie004: Failed to check link status
> >> > [  378.987185] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010 from Slot Status
> >> > [  378.987253] pciehp 0000:b3:00.0:pcie004: pciehp_power_off_slot: SLOTCTRL a8 write cmd 400
> >> > [  380.000409] pciehp 0000:b3:00.0:pcie004: pciehp_green_led_off: SLOTCTRL a8 write cmd 300
> >> > [  380.000674] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010 from Slot Status
> >> > [  380.018020] pciehp 0000:b3:00.0:pcie004: pciehp_set_attention_status: SLOTCTRL a8 write cmd 40
> >> > [  380.019053] pciehp 0000:b3:00.0:pcie004: pending interrupts 0x0010 from Slot Status
> >>
> >> So on this Skylake machine link training fails after resuming from D3hot
> >> to D0.
> >>
> >> One thing that's a bit fishy is that normally the Link Disable bit is
> >> cleared when powering on the slot.  This results in a debug message
> >> in dmesg containg the string "lnk_ctrl = ", and that line is missing
> >> from the output you've pasted above, suggesting that the machine is
> >> not running a stock v4.10 kernel after all but something else.  Could
> >> you check why this message is not printed?  Could you check with lspci
> >> if the Link Disable bit is set before you invoke "echo 1"?
> >
> > Could you answer the questions above please?
> 
> link is always enabled, except some BIOS disable the link if the the
> card is not present.
> 
> there is one time we disable link when power off the slot., and that
> line get removed
> to support link event.
> 
> now that enable link is once on BIOS have that disabled during booting.
> 
> also i had one local patch:
> during enabling, the function will check if the old value is the same
> as new value.
> if it the same it will not write again.
> 
> void pcie_link_disable_set(struct pci_dev *dev, int bit)
> {
>         u16 lnk_ctrl, old_lnk_ctrl;
> 
>         if (!pci_is_pcie(dev))
>                 return;
> 
>         pcie_capability_read_word(dev, PCI_EXP_LNKCTL, &lnk_ctrl);
>         old_lnk_ctrl = lnk_ctrl;
> 
>         if (!bit)
>                 lnk_ctrl &= ~PCI_EXP_LNKCTL_LD;
>         else
>                 lnk_ctrl |= PCI_EXP_LNKCTL_LD;
> 
>         if (old_lnk_ctrl == lnk_ctrl)
>                 return;
> 
>         pcie_capability_write_word(dev, PCI_EXP_LNKCTL, lnk_ctrl);
> 
>         dev_printk(KERN_DEBUG, &dev->dev, "%s: lnk_ctrl = %x\n", __func__,
>                          lnk_ctrl);
> }
> EXPORT_SYMBOL(pcie_link_disable_set);

So you're not even running a stock v4.10 kernel?  The function you've
quoted above is not part of the stock kernel.  Where are you calling
this from?

Please retry with a stock v4.10-rc7 kernel and report back if the issue
persists.

Sorry to be blunt, but I think it's unreasonable and unfair to report
an issue with link training and alleging that it's caused by my patch,
and afterwards coming out of the closet that you're using custom patches
that modify link training, the very part that is failing.

Lukas
Yinghai Lu Feb. 7, 2017, 6:08 p.m. UTC | #12
On Mon, Feb 6, 2017 at 10:08 PM, Lukas Wunner <lukas@wunner.de> wrote:
> Please retry with a stock v4.10-rc7 kernel and report back if the issue
> persists.

sca05-0a81fd7f:~ # echo 0 > /sys/bus/pci/slots/7/power
[  454.072658] pciehp 0000:73:00.0:pcie004: pciehp_get_power_status:
SLOTCTRL a8 value read 11f1
[  454.082194] pciehp 0000:73:00.0:pcie004: pciehp_unconfigure_device:
domain:bus:dev = 0000:74:00
[  454.091931] mlx4_core 0000:74:00.0: PME# disabled
[  455.418084] pciehp 0000:73:00.0:pcie004: pending interrupts 0x0010
from Slot Status
[  455.426802] pciehp 0000:73:00.0:pcie004: pciehp_power_off_slot:
SLOTCTRL a8 write cmd 400
[  455.490381] pciehp 0000:73:00.0:pcie004: pending interrupts 0x0108
from Slot Status
[  455.498938] pciehp 0000:73:00.0:pcie004: Slot(7): Link Down
[  455.505426] pciehp 0000:73:00.0:pcie004: Slot(7): Link Down event
ignored; already powering off
[  456.440169] pciehp 0000:73:00.0:pcie004: pciehp_green_led_off:
SLOTCTRL a8 write cmd 300
[  456.441181] pciehp 0000:73:00.0:pcie004: pending interrupts 0x0010
from Slot Status
sca05-0a81fd7f:~ # echo 1 > /sys/bus/pci/slots/7/power
[  470.356464] pciehp 0000:73:00.0:pcie004: pciehp_get_power_status:
SLOTCTRL a8 value read 17f1
[  470.366662] pciehp 0000:73:00.0:pcie004: pending interrupts 0x0010
from Slot Status
[  470.375339] pciehp 0000:73:00.0:pcie004: pciehp_power_on_slot:
SLOTCTRL a8 write cmd 0
[  470.384199] pciehp 0000:73:00.0:pcie004: __pciehp_link_set: lnk_ctrl = 40
[  470.391789] pciehp 0000:73:00.0:pcie004: pciehp_green_led_blink:
SLOTCTRL a8 write cmd 200
[  470.391966] pciehp 0000:73:00.0:pcie004: pending interrupts 0x0010
from Slot Status
[  470.428791] pciehp 0000:73:00.0:pcie004: pending interrupts 0x0010
from Slot Status
[  472.428147] pciehp 0000:73:00.0:pcie004: Data Link Layer Link
Active not set in 1000 msec
[  473.944158] pci 0000:74:00.0 id reading try 50 times with interval
20 ms to get ffffffff
[  473.953209] pciehp 0000:73:00.0:pcie004: pciehp_check_link_status:
lnk_status = 5001
[  473.961861] pciehp 0000:73:00.0:pcie004: link training error: status 0x5001
[  473.969642] pciehp 0000:73:00.0:pcie004: Failed to check link status
[  473.970721] pciehp 0000:73:00.0:pcie004: pending interrupts 0x0010
from Slot Status
[  473.970818] pciehp 0000:73:00.0:pcie004: pciehp_power_off_slot:
SLOTCTRL a8 write cmd 400
[  475.000272] pciehp 0000:73:00.0:pcie004: pciehp_green_led_off:
SLOTCTRL a8 write cmd 300
[  475.000880] pciehp 0000:73:00.0:pcie004: pending interrupts 0x0010
from Slot Status
[  475.017879] pciehp 0000:73:00.0:pcie004:
pciehp_set_attention_status: SLOTCTRL a8 write cmd 40
[  475.018386] pciehp 0000:73:00.0:pcie004: pending interrupts 0x0010
from Slot Status
-bash: echo: write error: Operation not permitted
Erik Veijola Feb. 8, 2017, 1 p.m. UTC | #13
On Mon, Feb 06, 2017 at 10:35:28PM +0100, Lukas Wunner wrote:
> On Mon, Feb 06, 2017 at 12:37:06PM +0200, Mika Westerberg wrote:
> > On Sun, Feb 05, 2017 at 08:34:54AM +0100, Lukas Wunner wrote:
> > > @Mika, Rafael: Are you aware of Skylake machines with unreliable link
> > > training, or perhaps errata of Skylake chips related to link training
> > > on hotplug ports?
> > 
> > According to the 100-series (the chipset used with Skylake) errata
> > below, I don't see any mentions related to PCIe link training issues.
> > 
> > http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/100-series-chipset-spec-update.pdf
> 
> Yinghai Lu responded off-list that the hardware in question is an
> unreleased / secret Intel product, so this particular issue cannot
> be expected to be documented publicly at this point.
> 
> Of course this raises the question whether issues with unreleased
> products can at all be considered valid regressions, given that the
> final product may not regress.  It seems like a novelty to me that
> patches would get reverted for something like this, but we'll see.
> 
> Lukas
> 

Yinghai, we may have a similar system in our lab so we could test this
also. What is your setup for doing the test?

-Erik
Bjorn Helgaas Feb. 8, 2017, 5:25 p.m. UTC | #14
On Mon, Feb 06, 2017 at 10:35:28PM +0100, Lukas Wunner wrote:
> On Mon, Feb 06, 2017 at 12:37:06PM +0200, Mika Westerberg wrote:
> > On Sun, Feb 05, 2017 at 08:34:54AM +0100, Lukas Wunner wrote:
> > > @Mika, Rafael: Are you aware of Skylake machines with unreliable link
> > > training, or perhaps errata of Skylake chips related to link training
> > > on hotplug ports?
> > 
> > According to the 100-series (the chipset used with Skylake) errata
> > below, I don't see any mentions related to PCIe link training issues.
> > 
> > http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/100-series-chipset-spec-update.pdf
> 
> Yinghai Lu responded off-list that the hardware in question is an
> unreleased / secret Intel product, so this particular issue cannot
> be expected to be documented publicly at this point.
> 
> Of course this raises the question whether issues with unreleased
> products can at all be considered valid regressions, given that the
> final product may not regress.  It seems like a novelty to me that
> patches would get reverted for something like this, but we'll see.

I assume the hardware will eventually be released, and I assume the
hardware will not be changed because of this issue.  I would like to
avoid the situation of having v4.9 but not v4.10 work on this
hardware.
diff mbox

Patch

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index a881c0d..daf3c02 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2072,7 +2072,13 @@  int pci_finish_runtime_suspend(struct pci_dev *dev)
 
 	dev->runtime_d3cold = target_state == PCI_D3cold;
 
-	__pci_enable_wake(dev, target_state, true, pci_dev_run_wake(dev));
+	/*
+	 * Enable PME if dev can generate wakeup events at runtime, but not on
+	 * hotplug ports since PME signals unwanted interrupts if the slot is
+	 * occupied (PCIe Base Specification, section 6.7.3.4).
+	 */
+	__pci_enable_wake(dev, target_state, true, pci_dev_run_wake(dev) &&
+						   !dev->is_hotplug_bridge);
 
 	error = pci_set_power_state(dev, target_state);