diff mbox series

[v2,1/2] PCI: Introduce pcie_wait_for_link_delay()

Message ID 20191004123947.11087-2-mika.westerberg@linux.intel.com (mailing list archive)
State Superseded, archived
Headers show
Series PCI: Add missing link delays | expand

Commit Message

Mika Westerberg Oct. 4, 2019, 12:39 p.m. UTC
This is otherwise similar to pcie_wait_for_link() but allows passing
custom activation delay in milliseconds.

Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
---
 drivers/pci/pci.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

Comments

Marc MERLIN Aug. 8, 2020, 8:22 p.m. UTC | #1
On Fri, Oct 04, 2019 at 03:39:46PM +0300, Mika Westerberg wrote:
> This is otherwise similar to pcie_wait_for_link() but allows passing
> custom activation delay in milliseconds.
> 
> Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
> ---
>  drivers/pci/pci.c | 21 ++++++++++++++++++---
>  1 file changed, 18 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index e7982af9a5d8..bfd92e018925 100644

Hi Mika,

So, I have a thinkpad P73 with thunderbolt, and while I don't boot
often, my last boots have been unreliable at best (was only able to boot
5.7 once, and 5.8 did not succeed either).

5.6 was working for a while, but couldn't boot it either this morning,
so I had to go back to 5.5. This does not mean 5.5 does not have the
problem, just that it booted this morning, while 5.6 didn't when I
tried.
Once the kernel is booted, the problem does not seem to occur much, or
at all.

Basically, I'm getting the same thing than this person with a P53 (which
is a mostly identical lenovo thinkpad, to mine)
kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
https://bbs.archlinux.org/viewtopic.php?id=250658

The kernel boots eventually, but it takes minutes, and everything is so
super slow, that I just can't reasonably use the machine.

This shows similar issues with 5.3, 5.4.
https://forum.proxmox.com/threads/pme-spurious-native-interrupt-kernel-meldungen.62850/

Another report here with 5.6:
https://bugzilla.redhat.com/show_bug.cgi?id=1831899

My current kernel is running your patch above, and I haven't done a lot
of research yet to confirm whether going back to a kernel before it was
merged, fixes the problem. Unfortunately the problem is not consistent,
so it makes things harder to test/debug, especially on my main laptop
that I do all my work on :)

I noticed this older patch of yours:
http://patchwork.ozlabs.org/project/linux-pci/patch/0113014581dbe2d1f938813f1783905bd81b79db.1560079442.git.lukas@wunner.de/
This patch is not in my kernel, is it worth adding?

Can I get you more info to help debug this?

If that helps:
sauron:/usr/src/linux-5.7.11-amd64-preempt-sysrq-20190816/drivers/pci# lspci
00:00.0 Host bridge: Intel Corporation Device 3e20 (rev 0d)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 0d)
00:02.0 VGA compatible controller: Intel Corporation UHD Graphics 630 (Mobile) (rev 02)
00:04.0 Signal processing controller: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem (rev 0d)
00:08.0 System peripheral: Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th/8th Gen Core Processor Gaussian Mixture Model
00:12.0 Signal processing controller: Intel Corporation Cannon Lake PCH Thermal Controller (rev 10)
00:14.0 USB controller: Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller (rev 10)
00:14.2 RAM memory: Intel Corporation Cannon Lake PCH Shared SRAM (rev 10)
00:15.0 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH Serial IO I2C Controller #0 (rev 10)
00:15.1 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH Serial IO I2C Controller #1 (rev 10)
00:16.0 Communication controller: Intel Corporation Cannon Lake PCH HECI Controller (rev 10)
00:17.0 SATA controller: Intel Corporation Cannon Lake Mobile PCH SATA AHCI Controller (rev 10)
00:1b.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #17 (rev f0)
00:1c.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #1 (rev f0)
00:1c.5 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #6 (rev f0)
00:1c.7 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #8 (rev f0)
00:1e.0 Communication controller: Intel Corporation Cannon Lake PCH Serial IO UART Host Controller (rev 10)
00:1f.0 ISA bridge: Intel Corporation Cannon Lake LPC Controller (rev 10)
00:1f.3 Audio device: Intel Corporation Cannon Lake PCH cAVS (rev 10)
00:1f.4 SMBus: Intel Corporation Cannon Lake PCH SMBus Controller (rev 10)
00:1f.5 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH SPI Controller (rev 10)
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (7) I219-LM (rev 10)
01:00.0 VGA compatible controller: NVIDIA Corporation TU104GLM [Quadro RTX 4000 Mobile / Max-Q] (rev a1)
01:00.1 Audio device: NVIDIA Corporation TU104 HD Audio Controller (rev a1)
01:00.2 USB controller: NVIDIA Corporation TU104 USB 3.1 Host Controller (rev a1)
01:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU104 USB Type-C UCSI Controller (rev a1)
02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
04:00.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
05:00.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
05:01.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
05:02.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
05:04.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
06:00.0 System peripheral: Intel Corporation JHL7540 Thunderbolt 3 NHI [Titan Ridge 4C 2018] (rev 06)
2c:00.0 USB controller: Intel Corporation JHL7540 Thunderbolt 3 USB Controller [Titan Ridge 4C 2018] (rev 06)
52:00.0 Network controller: Intel Corporation Wi-Fi 6 AX200 (rev 1a)
54:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader (rev 01)


sauron:/usr/src/linux-5.7.11-amd64-preempt-sysrq-20190816/drivers/pci# lsusb -t
/:  Bus 06.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/2p, 10000M
/:  Bus 05.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/2p, 480M
/:  Bus 04.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/4p, 10000M
/:  Bus 03.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/2p, 480M
/:  Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/10p, 10000M
/:  Bus 01.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/16p, 480M
    |__ Port 2: Dev 2, If 0, Class=Human Interface Device, Driver=usbhid, 12M
    |__ Port 8: Dev 3, If 3, Class=Video, Driver=uvcvideo, 480M
    |__ Port 8: Dev 3, If 1, Class=Video, Driver=uvcvideo, 480M
    |__ Port 8: Dev 3, If 2, Class=Video, Driver=uvcvideo, 480M
    |__ Port 8: Dev 3, If 0, Class=Video, Driver=uvcvideo, 480M
    |__ Port 9: Dev 4, If 0, Class=Vendor Specific Class, Driver=, 12M
    |__ Port 14: Dev 6, If 0, Class=Wireless, Driver=btusb, 12M
    |__ Port 14: Dev 6, If 1, Class=Wireless, Driver=btusb, 12M

Thanks,
Marc
Marc MERLIN Aug. 8, 2020, 8:23 p.m. UTC | #2
I forgot to add that my mostly hanging boots look like this:
https://photos.app.goo.gl/HJvTraYYZbiNTNE39

Marc
Marc MERLIN Aug. 9, 2020, 4:31 p.m. UTC | #3
On Sat, Aug 08, 2020 at 01:22:02PM -0700, Marc MERLIN wrote:
> Basically, I'm getting the same thing than this person with a P53 (which
> is a mostly identical lenovo thinkpad, to mine)
> kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
> kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
> kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
> kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
> kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
> https://bbs.archlinux.org/viewtopic.php?id=250658
 
I had to reboot today and tried my 5.7.11 kernel 6 times.
It never booted and each time got stuck on 
pcieport 0000:00:01.0: PME: Spurious native interrupt!

This is the nvidia driver and claimed by nouveau (I don't use nvidia graphics
but I'm forced to use nouveau to turn the nvidia chip down so that it
doesn't drain my batteries).

I ended up being able to boot the 7th time after removing the yubikey in my USB-C 
port, which is also thunderbolt.
PME messages shown below. Let me know if you'd like further data.

Thanks,
Marc

[    4.142484] acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug PME PCIeCapability LTR DPC]
[    4.151715] pci 0000:00:01.0: PME# supported from D0 D3hot D3cold
[    4.151727] pci 0000:00:01.0: PME# disabled
[    4.165979] pci 0000:00:14.0: PME# supported from D3hot D3cold
[    4.166000] pci 0000:00:14.0: PME# disabled
[    4.177746] pci 0000:00:16.0: PME# supported from D3hot
[    4.177767] pci 0000:00:16.0: PME# disabled
[    4.180850] pci 0000:00:17.0: PME# supported from D3hot
[    4.180862] pci 0000:00:17.0: PME# disabled
[    4.183830] pci 0000:00:1b.0: PME# supported from D0 D3hot D3cold
[    4.183847] pci 0000:00:1b.0: PME# disabled
[    4.189643] pci 0000:00:1c.0: PME# supported from D0 D3hot D3cold
[    4.189660] pci 0000:00:1c.0: PME# disabled
[    4.193085] pci 0000:00:1c.5: PME# supported from D0 D3hot D3cold
[    4.193101] pci 0000:00:1c.5: PME# disabled
[    4.196462] pci 0000:00:1c.7: PME# supported from D0 D3hot D3cold
[    4.196478] pci 0000:00:1c.7: PME# disabled
[    4.206057] pci 0000:00:1f.3: PME# supported from D3hot D3cold
[    4.206079] pci 0000:00:1f.3: PME# disabled
[    4.214993] pci 0000:00:1f.6: PME# supported from D0 D3hot D3cold
[    4.215015] pci 0000:00:1f.6: PME# disabled
[    4.217978] pci 0000:01:00.0: PME# supported from D0 D3hot
[    4.217991] pci 0000:01:00.0: PME# disabled
[    4.219129] pci 0000:01:00.2: PME# supported from D0 D3hot
[    4.219142] pci 0000:01:00.2: PME# disabled
[    4.219578] pci 0000:01:00.3: PME# supported from D0 D3hot
[    4.219591] pci 0000:01:00.3: PME# disabled
[    4.221398] pci 0000:04:00.0: PME# supported from D0 D1 D2 D3hot D3cold
[    4.221433] pci 0000:04:00.0: PME# disabled
[    4.222282] pci 0000:05:00.0: PME# supported from D0 D1 D2 D3hot D3cold
[    4.222297] pci 0000:05:00.0: PME# disabled
[    4.222792] pci 0000:05:01.0: PME# supported from D0 D1 D2 D3hot D3cold
[    4.222806] pci 0000:05:01.0: PME# disabled
[    4.223289] pci 0000:05:02.0: PME# supported from D0 D1 D2 D3hot D3cold
[    4.223304] pci 0000:05:02.0: PME# disabled
[    4.223839] pci 0000:05:04.0: PME# supported from D0 D1 D2 D3hot D3cold
[    4.223854] pci 0000:05:04.0: PME# disabled
[    4.224645] pci 0000:06:00.0: PME# supported from D0 D1 D2 D3hot D3cold
[    4.224661] pci 0000:06:00.0: PME# disabled
[    4.225644] pci 0000:2c:00.0: PME# supported from D0 D1 D2 D3hot D3cold
[    4.225661] pci 0000:2c:00.0: PME# disabled
[    4.227557] pci 0000:52:00.0: PME# supported from D0 D3hot D3cold
[    4.227708] pci 0000:52:00.0: PME# disabled
[    4.229139] pci 0000:54:00.0: PME# supported from D1 D2 D3hot D3cold
[    4.229155] pci 0000:54:00.0: PME# disabled
[    7.238126] pcieport 0000:00:01.0: PME: Signaling with IRQ 122
[    7.239208] pcieport 0000:00:1b.0: PME: Signaling with IRQ 123
[    7.239861] pcieport 0000:00:1c.0: PME: Signaling with IRQ 124
[    7.241522] pcieport 0000:00:1c.5: PME: Signaling with IRQ 125
[    7.242499] pcieport 0000:00:1c.7: PME: Signaling with IRQ 126
[    7.401422] pcieport 0000:05:01.0: PME# enabled
[    7.401868] pcieport 0000:05:04.0: PME# enabled
[    8.985668] xhci_hcd 0000:01:00.2: PME# enabled
[    8.988738] xhci_hcd 0000:2c:00.0: PME# enabled
[    9.008649] pcieport 0000:05:02.0: PME# enabled
[   12.378450] nvidia-gpu 0000:01:00.3: PME# enabled
[   25.610848] thunderbolt 0000:06:00.0: PME# enabled
[   25.628766] pcieport 0000:05:00.0: PME# enabled
[   25.648762] pcieport 0000:04:00.0: PME# enabled
[   25.668889] pcieport 0000:00:1c.0: PME# enabled
[  179.608847] nvidia-gpu 0000:01:00.3: PME# disabled
[  179.608873] pcieport 0000:00:01.0: PME: Spurious native interrupt!
[  183.359454] nvidia-gpu 0000:01:00.3: PME# enabled
[  183.396832] nvidia-gpu 0000:01:00.3: PME# disabled
[  183.396859] pcieport 0000:00:01.0: PME: Spurious native interrupt!
[  187.147398] nvidia-gpu 0000:01:00.3: PME# enabled
[  187.184826] nvidia-gpu 0000:01:00.3: PME# disabled
[  187.184852] pcieport 0000:00:01.0: PME: Spurious native interrupt!
[  190.932329] nvidia-gpu 0000:01:00.3: PME# enabled
[  190.972359] nvidia-gpu 0000:01:00.3: PME# disabled
[  190.972366] pcieport 0000:00:01.0: PME: Spurious native interrupt!
[  192.351330] snd_hda_intel 0000:00:1f.3: PME# enabled
[  192.468365] snd_hda_intel 0000:00:1f.3: PME# disabled
[  192.736342] xhci_hcd 0000:01:00.2: PME# disabled
[  194.296431] pcieport 0000:00:1c.0: PME# disabled
[  194.432427] pcieport 0000:04:00.0: PME# disabled
[  194.432591] pcieport 0000:05:01.0: PME# disabled
[  194.432771] pcieport 0000:05:00.0: PME# disabled
[  194.432943] pcieport 0000:05:02.0: PME# disabled
[  194.433102] pcieport 0000:05:04.0: PME# disabled
[  194.556299] pcieport 0000:05:01.0: PME# enabled
[  194.556417] pcieport 0000:05:04.0: PME# enabled
[  195.560440] thunderbolt 0000:06:00.0: PME# disabled
Marc MERLIN Sept. 6, 2020, 6:18 p.m. UTC | #4
Ok, I have an update to this problem. I added the nouveau list because
I can't quite tell if the issue is:
- the PCIe changes that went in 5.6 I think (or 5.5?), referenced below

- a new issue with thunderbold on thinkpad P73, that seems to be
  triggered if I have a USB-C yubikey in the port. With 5.7, my issues
  went away if I removed the USB key during boot, showing an interaction
  between nouveau and thunderbolt

- changes in the nouveau driver. Mika told me the PCIe regression
  "pcieport 0000:00:01.0: PME: Spurious native interrupt!" is supposed
  to be fixed in 5.8, but I still get a 4mn hang or so during boot and
  with 5.8, removing the USB key, didn't help make the boot faster

I don't otherwise use the nvidia chip I so wish I didn't have, I only
use intel graphics on that laptop, but I must apparently use the nouveau
driver to manage the nouveau chip so that it's turned off and not
burning 60W doing nothing.

lspci is in the quoted message below, I won't copy it here again, but
here's the nvidia bit:
01:00.0 VGA compatible controller: NVIDIA Corporation TU104GLM [Quadro RTX 4000 Mobile / Max-Q] (rev a1)
01:00.1 Audio device: NVIDIA Corporation TU104 HD Audio Controller (rev a1)
01:00.2 USB controller: NVIDIA Corporation TU104 USB 3.1 Host Controller (rev a1)
01:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU104 USB Type-C UCSI Controller (rev a1)

Here are 5 boots, 4 on 5.8.5:

dmesg.1_hang_but_no_warning.txt https://pastebin.com/Y5NaH08n
Boot hung for quite a while, but no clear output

dmesg.2_pme_spurious.txt https://pastebin.com/dX19aCpj
[    8.185808] nvidia-gpu 0000:01:00.3: runtime IRQ mapping not provided by arch
[    8.185989] nvidia-gpu 0000:01:00.3: enabling device (0000 -> 0002)
[    8.188986] nvidia-gpu 0000:01:00.3: enabling bus mastering
[   11.936507] nvidia-gpu 0000:01:00.3: PME# enabled
[   11.975985] nvidia-gpu 0000:01:00.3: PME# disabled
[   11.976011] pcieport 0000:00:01.0: PME: Spurious native interrupt!

dmesg.3_usb_key_yanked.txt https://pastebin.com/m7QLnCZt
I yanked the USB key during boot, that seemed to help unlock things with
5.7, but did not with 5.8. It's hung on a loop of:
[   11.262854] nvidia-gpu 0000:01:00.3: saving config space at offset 0x0 (reading 0x1ad910de)
[   11.262863] nvidia-gpu 0000:01:00.3: saving config space at offset 0x4 (reading 0x100406)
[   11.262869] nvidia-gpu 0000:01:00.3: saving config space at offset 0x8 (reading 0xc8000a1)
[   11.262874] nvidia-gpu 0000:01:00.3: saving config space at offset 0xc (reading 0x800000)
[   11.262880] nvidia-gpu 0000:01:00.3: saving config space at offset 0x10 (reading 0xce054000)
[   11.262885] nvidia-gpu 0000:01:00.3: saving config space at offset 0x14 (reading 0x0)
[   11.262890] nvidia-gpu 0000:01:00.3: saving config space at offset 0x18 (reading 0x0)
[   11.262895] nvidia-gpu 0000:01:00.3: saving config space at offset 0x1c (reading 0x0)
[   11.262900] nvidia-gpu 0000:01:00.3: saving config space at offset 0x20 (reading 0x0)
[   11.262906] nvidia-gpu 0000:01:00.3: saving config space at offset 0x24 (reading 0x0)
[   11.262911] nvidia-gpu 0000:01:00.3: saving config space at offset 0x28 (reading 0x0)
[   11.262916] nvidia-gpu 0000:01:00.3: saving config space at offset 0x2c (reading 0x229b17aa)
[   11.262921] nvidia-gpu 0000:01:00.3: saving config space at offset 0x30 (reading 0x0)
[   11.262926] nvidia-gpu 0000:01:00.3: saving config space at offset 0x34 (reading 0x68)
[   11.262931] nvidia-gpu 0000:01:00.3: saving config space at offset 0x38 (reading 0x0)
[   11.262937] nvidia-gpu 0000:01:00.3: saving config space at offset 0x3c (reading 0x4ff)
[   11.262985] nvidia-gpu 0000:01:00.3: PME# enabled
[   11.303060] nvidia-gpu 0000:01:00.3: PME# disabled

dmesg.4_5.5_boot_fine.txt https://pastebin.com/WXgQTUYP
reference boot with 4.5, it works fine, no issues

dmesg.5_no_key_still_hang.txt https://pastebin.com/kcT8Ras0
unfortunately, booting without the USB-C key in thunderbolt, did not
allow this boot to be faster, it looks different though:
[    6.723454] pcieport 0000:00:01.0: runtime IRQ mapping not provided by arch
[    6.723598] pcieport 0000:00:01.0: PME: Signaling with IRQ 122
[    6.724011] pcieport 0000:00:01.0: saving config space at offset 0x0 (reading 0x19018086)
[    6.724016] pcieport 0000:00:01.0: saving config space at offset 0x4 (reading 0x100407)
[    6.724021] pcieport 0000:00:01.0: saving config space at offset 0x8 (reading 0x604000d)
[    6.724025] pcieport 0000:00:01.0: saving config space at offset 0xc (reading 0x810000)
[    6.724029] pcieport 0000:00:01.0: saving config space at offset 0x10 (reading 0x0)
[    6.724033] pcieport 0000:00:01.0: saving config space at offset 0x14 (reading 0x0)
[    6.724037] pcieport 0000:00:01.0: saving config space at offset 0x18 (reading 0x10100)
[    6.724041] pcieport 0000:00:01.0: saving config space at offset 0x1c (reading 0x20002020)
[    6.724046] pcieport 0000:00:01.0: saving config space at offset 0x20 (reading 0xce00cd00)
[    6.724050] pcieport 0000:00:01.0: saving config space at offset 0x24 (reading 0xb1f1a001)
[    6.724054] pcieport 0000:00:01.0: saving config space at offset 0x28 (reading 0x0)
[    6.724058] pcieport 0000:00:01.0: saving config space at offset 0x2c (reading 0x0)
[    6.724062] pcieport 0000:00:01.0: saving config space at offset 0x30 (reading 0x0)
[    6.724066] pcieport 0000:00:01.0: saving config space at offset 0x34 (reading 0x88)
[    6.724070] pcieport 0000:00:01.0: saving config space at offset 0x38 (reading 0x0)
[    6.724074] pcieport 0000:00:01.0: saving config space at offset 0x3c (reading 0x201ff)
[    6.724129] pcieport 0000:00:1b.0: runtime IRQ mapping not provided by arch
[    6.724650] pcieport 0000:00:1b.0: PME: Signaling with IRQ 123
[    6.725021] pcieport 0000:00:1b.0: saving config space at offset 0x0 (reading 0xa3408086)
[    6.725026] pcieport 0000:00:1b.0: saving config space at offset 0x4 (reading 0x100407)
[    6.725031] pcieport 0000:00:1b.0: saving config space at offset 0x8 (reading 0x60400f0)
[    6.725035] pcieport 0000:00:1b.0: saving config space at offset 0xc (reading 0x810000)
[    6.725040] pcieport 0000:00:1b.0: saving config space at offset 0x10 (reading 0x0)
[    6.725044] pcieport 0000:00:1b.0: saving config space at offset 0x14 (reading 0x0)
[    6.725049] pcieport 0000:00:1b.0: saving config space at offset 0x18 (reading 0x20200)
[    6.725053] pcieport 0000:00:1b.0: saving config space at offset 0x1c (reading 0x200000f0)
[    6.725058] pcieport 0000:00:1b.0: saving config space at offset 0x20 (reading 0xce30ce30)
[    6.725062] pcieport 0000:00:1b.0: saving config space at offset 0x24 (reading 0x1fff1)
[    6.725067] pcieport 0000:00:1b.0: saving config space at offset 0x28 (reading 0x0)
[    6.725071] pcieport 0000:00:1b.0: saving config space at offset 0x2c (reading 0x0)
[    6.725075] pcieport 0000:00:1b.0: saving config space at offset 0x30 (reading 0x0)
[    6.725080] pcieport 0000:00:1b.0: saving config space at offset 0x34 (reading 0x40)
[    6.725084] pcieport 0000:00:1b.0: saving config space at offset 0x38 (reading 0x0)
[    6.725089] pcieport 0000:00:1b.0: saving config space at offset 0x3c (reading 0x201ff)
[    6.725154] pcieport 0000:00:1c.0: runtime IRQ mapping not provided by arch
[    6.725284] pcieport 0000:00:1c.0: PME: Signaling with IRQ 124
[    6.725580] pcieport 0000:00:1c.0: pciehp: Slot #0 AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug+ Surprise+ Interlock- NoCompl+ IbPresDis- LLActRep+
[    6.726086] pci_bus 0000:04: dev 00, created physical slot 0

Any idea what's going on?

Thanks,
Marc

On Sat, Aug 08, 2020 at 01:22:02PM -0700, Marc MERLIN wrote:
> On Fri, Oct 04, 2019 at 03:39:46PM +0300, Mika Westerberg wrote:
> > This is otherwise similar to pcie_wait_for_link() but allows passing
> > custom activation delay in milliseconds.
> > 
> > Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
> > ---
> >  drivers/pci/pci.c | 21 ++++++++++++++++++---
> >  1 file changed, 18 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > index e7982af9a5d8..bfd92e018925 100644
> 
> Hi Mika,
> 
> So, I have a thinkpad P73 with thunderbolt, and while I don't boot
> often, my last boots have been unreliable at best (was only able to boot
> 5.7 once, and 5.8 did not succeed either).
> 
> 5.6 was working for a while, but couldn't boot it either this morning,
> so I had to go back to 5.5. This does not mean 5.5 does not have the
> problem, just that it booted this morning, while 5.6 didn't when I
> tried.
> Once the kernel is booted, the problem does not seem to occur much, or
> at all.
> 
> Basically, I'm getting the same thing than this person with a P53 (which
> is a mostly identical lenovo thinkpad, to mine)
> kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
> kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
> kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
> kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
> kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
> https://bbs.archlinux.org/viewtopic.php?id=250658
> 
> The kernel boots eventually, but it takes minutes, and everything is so
> super slow, that I just can't reasonably use the machine.
> 
> This shows similar issues with 5.3, 5.4.
> https://forum.proxmox.com/threads/pme-spurious-native-interrupt-kernel-meldungen.62850/
> 
> Another report here with 5.6:
> https://bugzilla.redhat.com/show_bug.cgi?id=1831899
> 
> My current kernel is running your patch above, and I haven't done a lot
> of research yet to confirm whether going back to a kernel before it was
> merged, fixes the problem. Unfortunately the problem is not consistent,
> so it makes things harder to test/debug, especially on my main laptop
> that I do all my work on :)
> 
> I noticed this older patch of yours:
> http://patchwork.ozlabs.org/project/linux-pci/patch/0113014581dbe2d1f938813f1783905bd81b79db.1560079442.git.lukas@wunner.de/
> This patch is not in my kernel, is it worth adding?
> 
> Can I get you more info to help debug this?
> 
> If that helps:
> sauron:/usr/src/linux-5.7.11-amd64-preempt-sysrq-20190816/drivers/pci# lspci
> 00:00.0 Host bridge: Intel Corporation Device 3e20 (rev 0d)
> 00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 0d)
> 00:02.0 VGA compatible controller: Intel Corporation UHD Graphics 630 (Mobile) (rev 02)
> 00:04.0 Signal processing controller: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem (rev 0d)
> 00:08.0 System peripheral: Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th/8th Gen Core Processor Gaussian Mixture Model
> 00:12.0 Signal processing controller: Intel Corporation Cannon Lake PCH Thermal Controller (rev 10)
> 00:14.0 USB controller: Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller (rev 10)
> 00:14.2 RAM memory: Intel Corporation Cannon Lake PCH Shared SRAM (rev 10)
> 00:15.0 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH Serial IO I2C Controller #0 (rev 10)
> 00:15.1 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH Serial IO I2C Controller #1 (rev 10)
> 00:16.0 Communication controller: Intel Corporation Cannon Lake PCH HECI Controller (rev 10)
> 00:17.0 SATA controller: Intel Corporation Cannon Lake Mobile PCH SATA AHCI Controller (rev 10)
> 00:1b.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #17 (rev f0)
> 00:1c.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #1 (rev f0)
> 00:1c.5 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #6 (rev f0)
> 00:1c.7 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #8 (rev f0)
> 00:1e.0 Communication controller: Intel Corporation Cannon Lake PCH Serial IO UART Host Controller (rev 10)
> 00:1f.0 ISA bridge: Intel Corporation Cannon Lake LPC Controller (rev 10)
> 00:1f.3 Audio device: Intel Corporation Cannon Lake PCH cAVS (rev 10)
> 00:1f.4 SMBus: Intel Corporation Cannon Lake PCH SMBus Controller (rev 10)
> 00:1f.5 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH SPI Controller (rev 10)
> 00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (7) I219-LM (rev 10)
> 01:00.0 VGA compatible controller: NVIDIA Corporation TU104GLM [Quadro RTX 4000 Mobile / Max-Q] (rev a1)
> 01:00.1 Audio device: NVIDIA Corporation TU104 HD Audio Controller (rev a1)
> 01:00.2 USB controller: NVIDIA Corporation TU104 USB 3.1 Host Controller (rev a1)
> 01:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU104 USB Type-C UCSI Controller (rev a1)
> 02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
> 04:00.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
> 05:00.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
> 05:01.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
> 05:02.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
> 05:04.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
> 06:00.0 System peripheral: Intel Corporation JHL7540 Thunderbolt 3 NHI [Titan Ridge 4C 2018] (rev 06)
> 2c:00.0 USB controller: Intel Corporation JHL7540 Thunderbolt 3 USB Controller [Titan Ridge 4C 2018] (rev 06)
> 52:00.0 Network controller: Intel Corporation Wi-Fi 6 AX200 (rev 1a)
> 54:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader (rev 01)
> 
> 
> sauron:/usr/src/linux-5.7.11-amd64-preempt-sysrq-20190816/drivers/pci# lsusb -t
> /:  Bus 06.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/2p, 10000M
> /:  Bus 05.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/2p, 480M
> /:  Bus 04.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/4p, 10000M
> /:  Bus 03.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/2p, 480M
> /:  Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/10p, 10000M
> /:  Bus 01.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/16p, 480M
>     |__ Port 2: Dev 2, If 0, Class=Human Interface Device, Driver=usbhid, 12M
>     |__ Port 8: Dev 3, If 3, Class=Video, Driver=uvcvideo, 480M
>     |__ Port 8: Dev 3, If 1, Class=Video, Driver=uvcvideo, 480M
>     |__ Port 8: Dev 3, If 2, Class=Video, Driver=uvcvideo, 480M
>     |__ Port 8: Dev 3, If 0, Class=Video, Driver=uvcvideo, 480M
>     |__ Port 9: Dev 4, If 0, Class=Vendor Specific Class, Driver=, 12M
>     |__ Port 14: Dev 6, If 0, Class=Wireless, Driver=btusb, 12M
>     |__ Port 14: Dev 6, If 1, Class=Wireless, Driver=btusb, 12M
> 
> Thanks,
> Marc
> -- 
> "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
>  
> Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08
Matthias Andree Sept. 6, 2020, 6:26 p.m. UTC | #5
Please everyone stop Cc:ing me on this discussion, I have no interest
and nothing to contribute here.

I have set an invalid Reply-To: just in case...
Karol Herbst Sept. 7, 2020, 7:14 p.m. UTC | #6
On Sun, Sep 6, 2020 at 8:52 PM Marc MERLIN <marc_nouveau@merlins.org> wrote:
>
> Ok, I have an update to this problem. I added the nouveau list because
> I can't quite tell if the issue is:
> - the PCIe changes that went in 5.6 I think (or 5.5?), referenced below
>
> - a new issue with thunderbold on thinkpad P73, that seems to be
>   triggered if I have a USB-C yubikey in the port. With 5.7, my issues
>   went away if I removed the USB key during boot, showing an interaction
>   between nouveau and thunderbolt
>
> - changes in the nouveau driver. Mika told me the PCIe regression
>   "pcieport 0000:00:01.0: PME: Spurious native interrupt!" is supposed
>   to be fixed in 5.8, but I still get a 4mn hang or so during boot and
>   with 5.8, removing the USB key, didn't help make the boot faster
>

that's the root port the GPU is attached to, no? I saw that message on
the Thinkpad P1G2 when runtime resuming the Nvidia GPU, but it does
seem to come from the root port.

> I don't otherwise use the nvidia chip I so wish I didn't have, I only
> use intel graphics on that laptop, but I must apparently use the nouveau
> driver to manage the nouveau chip so that it's turned off and not
> burning 60W doing nothing.
>

Well, you'd also need it when attaching external displays.

> lspci is in the quoted message below, I won't copy it here again, but
> here's the nvidia bit:
> 01:00.0 VGA compatible controller: NVIDIA Corporation TU104GLM [Quadro RTX 4000 Mobile / Max-Q] (rev a1)
> 01:00.1 Audio device: NVIDIA Corporation TU104 HD Audio Controller (rev a1)
> 01:00.2 USB controller: NVIDIA Corporation TU104 USB 3.1 Host Controller (rev a1)
> 01:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU104 USB Type-C UCSI Controller (rev a1)
>
> Here are 5 boots, 4 on 5.8.5:
>
> dmesg.1_hang_but_no_warning.txt https://pastebin.com/Y5NaH08n
> Boot hung for quite a while, but no clear output
>
> dmesg.2_pme_spurious.txt https://pastebin.com/dX19aCpj
> [    8.185808] nvidia-gpu 0000:01:00.3: runtime IRQ mapping not provided by arch
> [    8.185989] nvidia-gpu 0000:01:00.3: enabling device (0000 -> 0002)
> [    8.188986] nvidia-gpu 0000:01:00.3: enabling bus mastering
> [   11.936507] nvidia-gpu 0000:01:00.3: PME# enabled
> [   11.975985] nvidia-gpu 0000:01:00.3: PME# disabled
> [   11.976011] pcieport 0000:00:01.0: PME: Spurious native interrupt!
>
> dmesg.3_usb_key_yanked.txt https://pastebin.com/m7QLnCZt
> I yanked the USB key during boot, that seemed to help unlock things with
> 5.7, but did not with 5.8. It's hung on a loop of:
> [   11.262854] nvidia-gpu 0000:01:00.3: saving config space at offset 0x0 (reading 0x1ad910de)
> [   11.262863] nvidia-gpu 0000:01:00.3: saving config space at offset 0x4 (reading 0x100406)
> [   11.262869] nvidia-gpu 0000:01:00.3: saving config space at offset 0x8 (reading 0xc8000a1)
> [   11.262874] nvidia-gpu 0000:01:00.3: saving config space at offset 0xc (reading 0x800000)
> [   11.262880] nvidia-gpu 0000:01:00.3: saving config space at offset 0x10 (reading 0xce054000)
> [   11.262885] nvidia-gpu 0000:01:00.3: saving config space at offset 0x14 (reading 0x0)
> [   11.262890] nvidia-gpu 0000:01:00.3: saving config space at offset 0x18 (reading 0x0)
> [   11.262895] nvidia-gpu 0000:01:00.3: saving config space at offset 0x1c (reading 0x0)
> [   11.262900] nvidia-gpu 0000:01:00.3: saving config space at offset 0x20 (reading 0x0)
> [   11.262906] nvidia-gpu 0000:01:00.3: saving config space at offset 0x24 (reading 0x0)
> [   11.262911] nvidia-gpu 0000:01:00.3: saving config space at offset 0x28 (reading 0x0)
> [   11.262916] nvidia-gpu 0000:01:00.3: saving config space at offset 0x2c (reading 0x229b17aa)
> [   11.262921] nvidia-gpu 0000:01:00.3: saving config space at offset 0x30 (reading 0x0)
> [   11.262926] nvidia-gpu 0000:01:00.3: saving config space at offset 0x34 (reading 0x68)
> [   11.262931] nvidia-gpu 0000:01:00.3: saving config space at offset 0x38 (reading 0x0)
> [   11.262937] nvidia-gpu 0000:01:00.3: saving config space at offset 0x3c (reading 0x4ff)
> [   11.262985] nvidia-gpu 0000:01:00.3: PME# enabled
> [   11.303060] nvidia-gpu 0000:01:00.3: PME# disabled
>

mhh, interesting. I heard some random comments that the Nvidia
USB-C/UCSI driver is a bit broken and can cause various issues. Mind
blacklisting i2c-nvidia-gpu and typec_nvidia (and verify they don't
get loaded) and see if that helps?

> dmesg.4_5.5_boot_fine.txt https://pastebin.com/WXgQTUYP
> reference boot with 4.5, it works fine, no issues
>
> dmesg.5_no_key_still_hang.txt https://pastebin.com/kcT8Ras0
> unfortunately, booting without the USB-C key in thunderbolt, did not
> allow this boot to be faster, it looks different though:
> [    6.723454] pcieport 0000:00:01.0: runtime IRQ mapping not provided by arch
> [    6.723598] pcieport 0000:00:01.0: PME: Signaling with IRQ 122
> [    6.724011] pcieport 0000:00:01.0: saving config space at offset 0x0 (reading 0x19018086)
> [    6.724016] pcieport 0000:00:01.0: saving config space at offset 0x4 (reading 0x100407)
> [    6.724021] pcieport 0000:00:01.0: saving config space at offset 0x8 (reading 0x604000d)
> [    6.724025] pcieport 0000:00:01.0: saving config space at offset 0xc (reading 0x810000)
> [    6.724029] pcieport 0000:00:01.0: saving config space at offset 0x10 (reading 0x0)
> [    6.724033] pcieport 0000:00:01.0: saving config space at offset 0x14 (reading 0x0)
> [    6.724037] pcieport 0000:00:01.0: saving config space at offset 0x18 (reading 0x10100)
> [    6.724041] pcieport 0000:00:01.0: saving config space at offset 0x1c (reading 0x20002020)
> [    6.724046] pcieport 0000:00:01.0: saving config space at offset 0x20 (reading 0xce00cd00)
> [    6.724050] pcieport 0000:00:01.0: saving config space at offset 0x24 (reading 0xb1f1a001)
> [    6.724054] pcieport 0000:00:01.0: saving config space at offset 0x28 (reading 0x0)
> [    6.724058] pcieport 0000:00:01.0: saving config space at offset 0x2c (reading 0x0)
> [    6.724062] pcieport 0000:00:01.0: saving config space at offset 0x30 (reading 0x0)
> [    6.724066] pcieport 0000:00:01.0: saving config space at offset 0x34 (reading 0x88)
> [    6.724070] pcieport 0000:00:01.0: saving config space at offset 0x38 (reading 0x0)
> [    6.724074] pcieport 0000:00:01.0: saving config space at offset 0x3c (reading 0x201ff)
> [    6.724129] pcieport 0000:00:1b.0: runtime IRQ mapping not provided by arch
> [    6.724650] pcieport 0000:00:1b.0: PME: Signaling with IRQ 123
> [    6.725021] pcieport 0000:00:1b.0: saving config space at offset 0x0 (reading 0xa3408086)
> [    6.725026] pcieport 0000:00:1b.0: saving config space at offset 0x4 (reading 0x100407)
> [    6.725031] pcieport 0000:00:1b.0: saving config space at offset 0x8 (reading 0x60400f0)
> [    6.725035] pcieport 0000:00:1b.0: saving config space at offset 0xc (reading 0x810000)
> [    6.725040] pcieport 0000:00:1b.0: saving config space at offset 0x10 (reading 0x0)
> [    6.725044] pcieport 0000:00:1b.0: saving config space at offset 0x14 (reading 0x0)
> [    6.725049] pcieport 0000:00:1b.0: saving config space at offset 0x18 (reading 0x20200)
> [    6.725053] pcieport 0000:00:1b.0: saving config space at offset 0x1c (reading 0x200000f0)
> [    6.725058] pcieport 0000:00:1b.0: saving config space at offset 0x20 (reading 0xce30ce30)
> [    6.725062] pcieport 0000:00:1b.0: saving config space at offset 0x24 (reading 0x1fff1)
> [    6.725067] pcieport 0000:00:1b.0: saving config space at offset 0x28 (reading 0x0)
> [    6.725071] pcieport 0000:00:1b.0: saving config space at offset 0x2c (reading 0x0)
> [    6.725075] pcieport 0000:00:1b.0: saving config space at offset 0x30 (reading 0x0)
> [    6.725080] pcieport 0000:00:1b.0: saving config space at offset 0x34 (reading 0x40)
> [    6.725084] pcieport 0000:00:1b.0: saving config space at offset 0x38 (reading 0x0)
> [    6.725089] pcieport 0000:00:1b.0: saving config space at offset 0x3c (reading 0x201ff)
> [    6.725154] pcieport 0000:00:1c.0: runtime IRQ mapping not provided by arch
> [    6.725284] pcieport 0000:00:1c.0: PME: Signaling with IRQ 124
> [    6.725580] pcieport 0000:00:1c.0: pciehp: Slot #0 AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug+ Surprise+ Interlock- NoCompl+ IbPresDis- LLActRep+
> [    6.726086] pci_bus 0000:04: dev 00, created physical slot 0
>
> Any idea what's going on?
>
> Thanks,
> Marc
>
> On Sat, Aug 08, 2020 at 01:22:02PM -0700, Marc MERLIN wrote:
> > On Fri, Oct 04, 2019 at 03:39:46PM +0300, Mika Westerberg wrote:
> > > This is otherwise similar to pcie_wait_for_link() but allows passing
> > > custom activation delay in milliseconds.
> > >
> > > Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
> > > ---
> > >  drivers/pci/pci.c | 21 ++++++++++++++++++---
> > >  1 file changed, 18 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > > index e7982af9a5d8..bfd92e018925 100644
> >
> > Hi Mika,
> >
> > So, I have a thinkpad P73 with thunderbolt, and while I don't boot
> > often, my last boots have been unreliable at best (was only able to boot
> > 5.7 once, and 5.8 did not succeed either).
> >
> > 5.6 was working for a while, but couldn't boot it either this morning,
> > so I had to go back to 5.5. This does not mean 5.5 does not have the
> > problem, just that it booted this morning, while 5.6 didn't when I
> > tried.
> > Once the kernel is booted, the problem does not seem to occur much, or
> > at all.
> >
> > Basically, I'm getting the same thing than this person with a P53 (which
> > is a mostly identical lenovo thinkpad, to mine)
> > kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
> > kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
> > kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
> > kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
> > kernel: pcieport 0000:00:01.0: PME: Spurious native interrupt!
> > https://bbs.archlinux.org/viewtopic.php?id=250658
> >
> > The kernel boots eventually, but it takes minutes, and everything is so
> > super slow, that I just can't reasonably use the machine.
> >
> > This shows similar issues with 5.3, 5.4.
> > https://forum.proxmox.com/threads/pme-spurious-native-interrupt-kernel-meldungen.62850/
> >
> > Another report here with 5.6:
> > https://bugzilla.redhat.com/show_bug.cgi?id=1831899
> >
> > My current kernel is running your patch above, and I haven't done a lot
> > of research yet to confirm whether going back to a kernel before it was
> > merged, fixes the problem. Unfortunately the problem is not consistent,
> > so it makes things harder to test/debug, especially on my main laptop
> > that I do all my work on :)
> >
> > I noticed this older patch of yours:
> > http://patchwork.ozlabs.org/project/linux-pci/patch/0113014581dbe2d1f938813f1783905bd81b79db.1560079442.git.lukas@wunner.de/
> > This patch is not in my kernel, is it worth adding?
> >
> > Can I get you more info to help debug this?
> >
> > If that helps:
> > sauron:/usr/src/linux-5.7.11-amd64-preempt-sysrq-20190816/drivers/pci# lspci
> > 00:00.0 Host bridge: Intel Corporation Device 3e20 (rev 0d)
> > 00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 0d)
> > 00:02.0 VGA compatible controller: Intel Corporation UHD Graphics 630 (Mobile) (rev 02)
> > 00:04.0 Signal processing controller: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem (rev 0d)
> > 00:08.0 System peripheral: Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th/8th Gen Core Processor Gaussian Mixture Model
> > 00:12.0 Signal processing controller: Intel Corporation Cannon Lake PCH Thermal Controller (rev 10)
> > 00:14.0 USB controller: Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller (rev 10)
> > 00:14.2 RAM memory: Intel Corporation Cannon Lake PCH Shared SRAM (rev 10)
> > 00:15.0 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH Serial IO I2C Controller #0 (rev 10)
> > 00:15.1 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH Serial IO I2C Controller #1 (rev 10)
> > 00:16.0 Communication controller: Intel Corporation Cannon Lake PCH HECI Controller (rev 10)
> > 00:17.0 SATA controller: Intel Corporation Cannon Lake Mobile PCH SATA AHCI Controller (rev 10)
> > 00:1b.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #17 (rev f0)
> > 00:1c.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #1 (rev f0)
> > 00:1c.5 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #6 (rev f0)
> > 00:1c.7 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #8 (rev f0)
> > 00:1e.0 Communication controller: Intel Corporation Cannon Lake PCH Serial IO UART Host Controller (rev 10)
> > 00:1f.0 ISA bridge: Intel Corporation Cannon Lake LPC Controller (rev 10)
> > 00:1f.3 Audio device: Intel Corporation Cannon Lake PCH cAVS (rev 10)
> > 00:1f.4 SMBus: Intel Corporation Cannon Lake PCH SMBus Controller (rev 10)
> > 00:1f.5 Serial bus controller [0c80]: Intel Corporation Cannon Lake PCH SPI Controller (rev 10)
> > 00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (7) I219-LM (rev 10)
> > 01:00.0 VGA compatible controller: NVIDIA Corporation TU104GLM [Quadro RTX 4000 Mobile / Max-Q] (rev a1)
> > 01:00.1 Audio device: NVIDIA Corporation TU104 HD Audio Controller (rev a1)
> > 01:00.2 USB controller: NVIDIA Corporation TU104 USB 3.1 Host Controller (rev a1)
> > 01:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU104 USB Type-C UCSI Controller (rev a1)
> > 02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
> > 04:00.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
> > 05:00.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
> > 05:01.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
> > 05:02.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
> > 05:04.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06)
> > 06:00.0 System peripheral: Intel Corporation JHL7540 Thunderbolt 3 NHI [Titan Ridge 4C 2018] (rev 06)
> > 2c:00.0 USB controller: Intel Corporation JHL7540 Thunderbolt 3 USB Controller [Titan Ridge 4C 2018] (rev 06)
> > 52:00.0 Network controller: Intel Corporation Wi-Fi 6 AX200 (rev 1a)
> > 54:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader (rev 01)
> >
> >
> > sauron:/usr/src/linux-5.7.11-amd64-preempt-sysrq-20190816/drivers/pci# lsusb -t
> > /:  Bus 06.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/2p, 10000M
> > /:  Bus 05.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/2p, 480M
> > /:  Bus 04.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/4p, 10000M
> > /:  Bus 03.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/2p, 480M
> > /:  Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/10p, 10000M
> > /:  Bus 01.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/16p, 480M
> >     |__ Port 2: Dev 2, If 0, Class=Human Interface Device, Driver=usbhid, 12M
> >     |__ Port 8: Dev 3, If 3, Class=Video, Driver=uvcvideo, 480M
> >     |__ Port 8: Dev 3, If 1, Class=Video, Driver=uvcvideo, 480M
> >     |__ Port 8: Dev 3, If 2, Class=Video, Driver=uvcvideo, 480M
> >     |__ Port 8: Dev 3, If 0, Class=Video, Driver=uvcvideo, 480M
> >     |__ Port 9: Dev 4, If 0, Class=Vendor Specific Class, Driver=, 12M
> >     |__ Port 14: Dev 6, If 0, Class=Wireless, Driver=btusb, 12M
> >     |__ Port 14: Dev 6, If 1, Class=Wireless, Driver=btusb, 12M
> >
> > Thanks,
> > Marc
> > --
> > "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
> >
> > Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08
>
> --
> "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
>
> Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08
> _______________________________________________
> Nouveau mailing list
> Nouveau@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/nouveau
>
Marc MERLIN Sept. 7, 2020, 8:58 p.m. UTC | #7
On Mon, Sep 07, 2020 at 09:14:03PM +0200, Karol Herbst wrote:
> > - changes in the nouveau driver. Mika told me the PCIe regression
> >   "pcieport 0000:00:01.0: PME: Spurious native interrupt!" is supposed
> >   to be fixed in 5.8, but I still get a 4mn hang or so during boot and
> >   with 5.8, removing the USB key, didn't help make the boot faster
> 
> that's the root port the GPU is attached to, no? I saw that message on
> the Thinkpad P1G2 when runtime resuming the Nvidia GPU, but it does
> seem to come from the root port.

Hi Karol, thanks for your answer.
 
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 0d)
01:00.0 VGA compatible controller: NVIDIA Corporation TU104GLM [Quadro RTX 4000 Mobile / Max-Q] (rev a1)

> Well, you'd also need it when attaching external displays.
 
Indeed. I just don't need that on this laptop, but familiar with the not
so seemless procedure to turn on both GPUs, and mirror the intel one into
the nvidia one for external output. 

> > [   11.262985] nvidia-gpu 0000:01:00.3: PME# enabled
> > [   11.303060] nvidia-gpu 0000:01:00.3: PME# disabled
> 
> mhh, interesting. I heard some random comments that the Nvidia
> USB-C/UCSI driver is a bit broken and can cause various issues. Mind
> blacklisting i2c-nvidia-gpu and typec_nvidia (and verify they don't
> get loaded) and see if that helps?

Right, this one:
01:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU104 USB Type-C UCSI Controller (rev a1)
Sure, I'll blacklist it. Ok, just did that, removed from initrd,
rebooted, and it was no better.

From initrd (before root gets mounted), I have this:
nouveau              1961984  0
mxm_wmi                16384  1 nouveau
hwmon                  32768  1 nouveau
ttm                   102400  1 nouveau
wmi                    32768  2 nouveau,mxm_wmi

I still got a 2mn hang. and a nouveau probe error
[  189.124530] nouveau: probe of 0000:01:00.0 failed with error -12


Here's what it looks like:
[    9.693230] hid: raw HID events driver (C) Jiri Kosina
[    9.694988] usbcore: registered new interface driver usbhid
[    9.694989] usbhid: USB HID core driver
[    9.696700] hid-generic 0003:1050:0200.0001: hiddev0,hidraw0: USB HID v1.00 Device [Yubico Yubico Gnubby (gnubby1)] on usb-0000:00:14.0-2/input0
[    9.784456] Console: switching to colour frame buffer device 240x67
[    9.816297] i915 0000:00:02.0: fb0: i915drmfb frame buffer device
[   25.087400] thunderbolt 0000:06:00.0: saving config space at offset 0x0 (reading 0x15eb8086)
[   25.087414] thunderbolt 0000:06:00.0: saving config space at offset 0x4 (reading 0x100406)
[   25.087419] thunderbolt 0000:06:00.0: saving config space at offset 0x8 (reading 0x8800006)
[   25.087424] thunderbolt 0000:06:00.0: saving config space at offset 0xc (reading 0x20)
[   25.087430] thunderbolt 0000:06:00.0: saving config space at offset 0x10 (reading 0xcc100000)
[   25.087435] thunderbolt 0000:06:00.0: saving config space at offset 0x14 (reading 0xcc140000)
[   25.087440] thunderbolt 0000:06:00.0: saving config space at offset 0x18 (reading 0x0)
[   25.087445] thunderbolt 0000:06:00.0: saving config space at offset 0x1c (reading 0x0)
[   25.087450] thunderbolt 0000:06:00.0: saving config space at offset 0x20 (reading 0x0)
[   25.087455] thunderbolt 0000:06:00.0: saving config space at offset 0x24 (reading 0x0)
[   25.087460] thunderbolt 0000:06:00.0: saving config space at offset 0x28 (reading 0x0)
[   25.087466] thunderbolt 0000:06:00.0: saving config space at offset 0x2c (reading 0x229b17aa)
[   25.087471] thunderbolt 0000:06:00.0: saving config space at offset 0x30 (reading 0x0)
[   25.087476] thunderbolt 0000:06:00.0: saving config space at offset 0x34 (reading 0x80)
[   25.087481] thunderbolt 0000:06:00.0: saving config space at offset 0x38 (reading 0x0)
[   25.087486] thunderbolt 0000:06:00.0: saving config space at offset 0x3c (reading 0x1ff)
[   25.087571] thunderbolt 0000:06:00.0: PME# enabled
[   25.105353] pcieport 0000:05:00.0: saving config space at offset 0x0 (reading 0x15ea8086)
[   25.105364] pcieport 0000:05:00.0: saving config space at offset 0x4 (reading 0x100407)
[   25.105370] pcieport 0000:05:00.0: saving config space at offset 0x8 (reading 0x6040006)
[   25.105375] pcieport 0000:05:00.0: saving config space at offset 0xc (reading 0x10020)
[   25.105380] pcieport 0000:05:00.0: saving config space at offset 0x10 (reading 0x0)
[   25.105384] pcieport 0000:05:00.0: saving config space at offset 0x14 (reading 0x0)
[   25.105389] pcieport 0000:05:00.0: saving config space at offset 0x18 (reading 0x60605)
[   25.105394] pcieport 0000:05:00.0: saving config space at offset 0x1c (reading 0x1f1)
[   25.105399] pcieport 0000:05:00.0: saving config space at offset 0x20 (reading 0xcc10cc10)
[   25.105404] pcieport 0000:05:00.0: saving config space at offset 0x24 (reading 0x1fff1)
[   25.105409] pcieport 0000:05:00.0: saving config space at offset 0x28 (reading 0x0)
[   25.105413] pcieport 0000:05:00.0: saving config space at offset 0x2c (reading 0x0)
[   25.105418] pcieport 0000:05:00.0: saving config space at offset 0x30 (reading 0x0)
[   25.105423] pcieport 0000:05:00.0: saving config space at offset 0x34 (reading 0x80)
[   25.105428] pcieport 0000:05:00.0: saving config space at offset 0x38 (reading 0x0)
[   25.105432] pcieport 0000:05:00.0: saving config space at offset 0x3c (reading 0x201ff)
[   25.105517] pcieport 0000:05:00.0: PME# enabled
[   25.125367] pcieport 0000:04:00.0: saving config space at offset 0x0 (reading 0x15ea8086)
[   25.125378] pcieport 0000:04:00.0: saving config space at offset 0x4 (reading 0x100007)
[   25.125383] pcieport 0000:04:00.0: saving config space at offset 0x8 (reading 0x6040006)
[   25.125388] pcieport 0000:04:00.0: saving config space at offset 0xc (reading 0x10020)
[   25.125393] pcieport 0000:04:00.0: saving config space at offset 0x10 (reading 0x0)
[   25.125398] pcieport 0000:04:00.0: saving config space at offset 0x14 (reading 0x0)
[   25.125403] pcieport 0000:04:00.0: saving config space at offset 0x18 (reading 0x510504)
[   25.125407] pcieport 0000:04:00.0: saving config space at offset 0x1c (reading 0x5141)
[   25.125412] pcieport 0000:04:00.0: saving config space at offset 0x20 (reading 0xcc10b400)
[   25.125417] pcieport 0000:04:00.0: saving config space at offset 0x24 (reading 0x3ff10001)
[   25.125422] pcieport 0000:04:00.0: saving config space at offset 0x28 (reading 0x60)
[   25.125427] pcieport 0000:04:00.0: saving config space at offset 0x2c (reading 0x60)
[   25.125431] pcieport 0000:04:00.0: saving config space at offset 0x30 (reading 0x0)
[   25.125436] pcieport 0000:04:00.0: saving config space at offset 0x34 (reading 0x80)
[   25.125441] pcieport 0000:04:00.0: saving config space at offset 0x38 (reading 0x0)
[   25.125446] pcieport 0000:04:00.0: saving config space at offset 0x3c (reading 0x201ff)
[   25.125528] pcieport 0000:04:00.0: PME# enabled
[   25.145423] pcieport 0000:00:1c.0: saving config space at offset 0x0 (reading 0xa3388086)
[   25.145437] pcieport 0000:00:1c.0: saving config space at offset 0x4 (reading 0x100407)
[   25.145445] pcieport 0000:00:1c.0: saving config space at offset 0x8 (reading 0x60400f0)
[   25.145453] pcieport 0000:00:1c.0: saving config space at offset 0xc (reading 0x810000)
[   25.145460] pcieport 0000:00:1c.0: saving config space at offset 0x10 (reading 0x0)
[   25.145464] pcieport 0000:00:1c.0: saving config space at offset 0x14 (reading 0x0)
[   25.145469] pcieport 0000:00:1c.0: saving config space at offset 0x18 (reading 0x510400)
[   25.145476] pcieport 0000:00:1c.0: saving config space at offset 0x1c (reading 0x20006040)
[   25.145484] pcieport 0000:00:1c.0: saving config space at offset 0x20 (reading 0xcc10b400)
[   25.145488] pcieport 0000:00:1c.0: saving config space at offset 0x24 (reading 0x3ff10001)
[   25.145493] pcieport 0000:00:1c.0: saving config space at offset 0x28 (reading 0x60)
[   25.145497] pcieport 0000:00:1c.0: saving config space at offset 0x2c (reading 0x60)
[   25.145502] pcieport 0000:00:1c.0: saving config space at offset 0x30 (reading 0x0)
[   25.145506] pcieport 0000:00:1c.0: saving config space at offset 0x34 (reading 0x40)
[   25.145510] pcieport 0000:00:1c.0: saving config space at offset 0x38 (reading 0x0)
[   25.145515] pcieport 0000:00:1c.0: saving config space at offset 0x3c (reading 0x201ff)
[   25.145604] pcieport 0000:00:1c.0: PME# enabled
[   26.265697] pcieport 0000:00:1c.0: power state changed by ACPI to D3cold
[   45.468365] random: crng init done
[  105.032727] usb 1-2: USB disconnect, device number 2  <= I removed a usb key, didn't help
[  128.495144] async_tx: api initialized (async)
[  128.514820] device-mapper: uevent: version 1.0.3
[  128.518186] device-mapper: ioctl: 4.42.0-ioctl (2020-02-27) initialised: dm-devel@redhat.com
[  144.869445] e1000e 0000:00:1f.6 eth0: NIC Link is Down
[  172.851384] BTRFS: device label btrfs_pool4 devid 1 transid 78270 /dev/sdb4 scanned by btrfs (1293)
[  172.851648] BTRFS: device label btrfs_pool3 devid 1 transid 27410 /dev/sda5 scanned by btrfs (1293)
[  172.852030] BTRFS: device fsid de9694f8-9c0d-4e9d-bd12-57adc4381cd7 devid 1 transid 41 /dev/sda3 scanned by btrfs (1293)
[  172.852224] BTRFS: device fsid 23e1398d-e462-41aa-b85e-f574906ddc03 devid 1 transid 585 /dev/nvme0n1p4 scanned by btrfs (1293)
[  189.124291] nouveau 0000:01:00.0: disp ctor failed, -12
[  189.124530] nouveau: probe of 0000:01:00.0 failed with error -12

The next boot looks similar:
[   25.161759] pcieport 0000:00:1c.0: PME# enabled
[   26.297810] pcieport 0000:00:1c.0: power state changed by ACPI to D3cold
[  128.427270] async_tx: api initialized (async)
[  128.446525] device-mapper: uevent: version 1.0.3
[  128.446691] device-mapper: ioctl: 4.42.0-ioctl (2020-02-27) initialised: dm-devel@redhat.com
[  128.458120] random: cryptsetup: uninitialized urandom read (4 bytes read)
[  138.507373] random: cryptsetup: uninitialized urandom read (4 bytes read)
[  144.793573] e1000e 0000:00:1f.6 eth0: NIC Link is Down
[  159.627780] random: crng init done
[  171.814064] process '/usr/bin/fstype' started with executable stack
[  181.949989] BTRFS: device label btrfs_boot devid 1 transid 394687 /dev/mapper/cryptroot scanned by btrfs (1063)
[  181.953437] BTRFS: device label btrfs_pool4 devid 1 transid 78270 /dev/sdb4 scanned by btrfs (1063)
[  181.956989] BTRFS: device label btrfs_pool3 devid 1 transid 27410 /dev/sda5 scanned by btrfs (1063)
[  181.960473] BTRFS: device fsid de9694f8-9c0d-4e9d-bd12-57adc4381cd7 devid 1 transid 41 /dev/sda3 scanned by btrfs (1063)
[  181.964097] BTRFS: device fsid 23e1398d-e462-41aa-b85e-f574906ddc03 devid 1 transid 585 /dev/nvme0n1p4 scanned by btrfs (1063)
[  188.733645] nouveau 0000:01:00.0: disp ctor failed, -12
[  188.740653] nouveau: probe of 0000:01:00.0 failed with error -12
[  188.901070] PM: Image not found (code -22)

Does that help?

Thanks,
Marc
Karol Herbst Sept. 7, 2020, 11:51 p.m. UTC | #8
On Mon, Sep 7, 2020 at 10:58 PM Marc MERLIN <marc_nouveau@merlins.org> wrote:
>
> On Mon, Sep 07, 2020 at 09:14:03PM +0200, Karol Herbst wrote:
> > > - changes in the nouveau driver. Mika told me the PCIe regression
> > >   "pcieport 0000:00:01.0: PME: Spurious native interrupt!" is supposed
> > >   to be fixed in 5.8, but I still get a 4mn hang or so during boot and
> > >   with 5.8, removing the USB key, didn't help make the boot faster
> >
> > that's the root port the GPU is attached to, no? I saw that message on
> > the Thinkpad P1G2 when runtime resuming the Nvidia GPU, but it does
> > seem to come from the root port.
>
> Hi Karol, thanks for your answer.
>
> 00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 0d)
> 01:00.0 VGA compatible controller: NVIDIA Corporation TU104GLM [Quadro RTX 4000 Mobile / Max-Q] (rev a1)
>
> > Well, you'd also need it when attaching external displays.
>
> Indeed. I just don't need that on this laptop, but familiar with the not
> so seemless procedure to turn on both GPUs, and mirror the intel one into
> the nvidia one for external output.
>
> > > [   11.262985] nvidia-gpu 0000:01:00.3: PME# enabled
> > > [   11.303060] nvidia-gpu 0000:01:00.3: PME# disabled
> >
> > mhh, interesting. I heard some random comments that the Nvidia
> > USB-C/UCSI driver is a bit broken and can cause various issues. Mind
> > blacklisting i2c-nvidia-gpu and typec_nvidia (and verify they don't
> > get loaded) and see if that helps?
>
> Right, this one:
> 01:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU104 USB Type-C UCSI Controller (rev a1)
> Sure, I'll blacklist it. Ok, just did that, removed from initrd,
> rebooted, and it was no better.
>
> From initrd (before root gets mounted), I have this:
> nouveau              1961984  0
> mxm_wmi                16384  1 nouveau
> hwmon                  32768  1 nouveau
> ttm                   102400  1 nouveau
> wmi                    32768  2 nouveau,mxm_wmi
>
> I still got a 2mn hang. and a nouveau probe error
> [  189.124530] nouveau: probe of 0000:01:00.0 failed with error -12
>
>
> Here's what it looks like:
> [    9.693230] hid: raw HID events driver (C) Jiri Kosina
> [    9.694988] usbcore: registered new interface driver usbhid
> [    9.694989] usbhid: USB HID core driver
> [    9.696700] hid-generic 0003:1050:0200.0001: hiddev0,hidraw0: USB HID v1.00 Device [Yubico Yubico Gnubby (gnubby1)] on usb-0000:00:14.0-2/input0
> [    9.784456] Console: switching to colour frame buffer device 240x67
> [    9.816297] i915 0000:00:02.0: fb0: i915drmfb frame buffer device
> [   25.087400] thunderbolt 0000:06:00.0: saving config space at offset 0x0 (reading 0x15eb8086)
> [   25.087414] thunderbolt 0000:06:00.0: saving config space at offset 0x4 (reading 0x100406)
> [   25.087419] thunderbolt 0000:06:00.0: saving config space at offset 0x8 (reading 0x8800006)
> [   25.087424] thunderbolt 0000:06:00.0: saving config space at offset 0xc (reading 0x20)
> [   25.087430] thunderbolt 0000:06:00.0: saving config space at offset 0x10 (reading 0xcc100000)
> [   25.087435] thunderbolt 0000:06:00.0: saving config space at offset 0x14 (reading 0xcc140000)
> [   25.087440] thunderbolt 0000:06:00.0: saving config space at offset 0x18 (reading 0x0)
> [   25.087445] thunderbolt 0000:06:00.0: saving config space at offset 0x1c (reading 0x0)
> [   25.087450] thunderbolt 0000:06:00.0: saving config space at offset 0x20 (reading 0x0)
> [   25.087455] thunderbolt 0000:06:00.0: saving config space at offset 0x24 (reading 0x0)
> [   25.087460] thunderbolt 0000:06:00.0: saving config space at offset 0x28 (reading 0x0)
> [   25.087466] thunderbolt 0000:06:00.0: saving config space at offset 0x2c (reading 0x229b17aa)
> [   25.087471] thunderbolt 0000:06:00.0: saving config space at offset 0x30 (reading 0x0)
> [   25.087476] thunderbolt 0000:06:00.0: saving config space at offset 0x34 (reading 0x80)
> [   25.087481] thunderbolt 0000:06:00.0: saving config space at offset 0x38 (reading 0x0)
> [   25.087486] thunderbolt 0000:06:00.0: saving config space at offset 0x3c (reading 0x1ff)
> [   25.087571] thunderbolt 0000:06:00.0: PME# enabled
> [   25.105353] pcieport 0000:05:00.0: saving config space at offset 0x0 (reading 0x15ea8086)
> [   25.105364] pcieport 0000:05:00.0: saving config space at offset 0x4 (reading 0x100407)
> [   25.105370] pcieport 0000:05:00.0: saving config space at offset 0x8 (reading 0x6040006)
> [   25.105375] pcieport 0000:05:00.0: saving config space at offset 0xc (reading 0x10020)
> [   25.105380] pcieport 0000:05:00.0: saving config space at offset 0x10 (reading 0x0)
> [   25.105384] pcieport 0000:05:00.0: saving config space at offset 0x14 (reading 0x0)
> [   25.105389] pcieport 0000:05:00.0: saving config space at offset 0x18 (reading 0x60605)
> [   25.105394] pcieport 0000:05:00.0: saving config space at offset 0x1c (reading 0x1f1)
> [   25.105399] pcieport 0000:05:00.0: saving config space at offset 0x20 (reading 0xcc10cc10)
> [   25.105404] pcieport 0000:05:00.0: saving config space at offset 0x24 (reading 0x1fff1)
> [   25.105409] pcieport 0000:05:00.0: saving config space at offset 0x28 (reading 0x0)
> [   25.105413] pcieport 0000:05:00.0: saving config space at offset 0x2c (reading 0x0)
> [   25.105418] pcieport 0000:05:00.0: saving config space at offset 0x30 (reading 0x0)
> [   25.105423] pcieport 0000:05:00.0: saving config space at offset 0x34 (reading 0x80)
> [   25.105428] pcieport 0000:05:00.0: saving config space at offset 0x38 (reading 0x0)
> [   25.105432] pcieport 0000:05:00.0: saving config space at offset 0x3c (reading 0x201ff)
> [   25.105517] pcieport 0000:05:00.0: PME# enabled
> [   25.125367] pcieport 0000:04:00.0: saving config space at offset 0x0 (reading 0x15ea8086)
> [   25.125378] pcieport 0000:04:00.0: saving config space at offset 0x4 (reading 0x100007)
> [   25.125383] pcieport 0000:04:00.0: saving config space at offset 0x8 (reading 0x6040006)
> [   25.125388] pcieport 0000:04:00.0: saving config space at offset 0xc (reading 0x10020)
> [   25.125393] pcieport 0000:04:00.0: saving config space at offset 0x10 (reading 0x0)
> [   25.125398] pcieport 0000:04:00.0: saving config space at offset 0x14 (reading 0x0)
> [   25.125403] pcieport 0000:04:00.0: saving config space at offset 0x18 (reading 0x510504)
> [   25.125407] pcieport 0000:04:00.0: saving config space at offset 0x1c (reading 0x5141)
> [   25.125412] pcieport 0000:04:00.0: saving config space at offset 0x20 (reading 0xcc10b400)
> [   25.125417] pcieport 0000:04:00.0: saving config space at offset 0x24 (reading 0x3ff10001)
> [   25.125422] pcieport 0000:04:00.0: saving config space at offset 0x28 (reading 0x60)
> [   25.125427] pcieport 0000:04:00.0: saving config space at offset 0x2c (reading 0x60)
> [   25.125431] pcieport 0000:04:00.0: saving config space at offset 0x30 (reading 0x0)
> [   25.125436] pcieport 0000:04:00.0: saving config space at offset 0x34 (reading 0x80)
> [   25.125441] pcieport 0000:04:00.0: saving config space at offset 0x38 (reading 0x0)
> [   25.125446] pcieport 0000:04:00.0: saving config space at offset 0x3c (reading 0x201ff)
> [   25.125528] pcieport 0000:04:00.0: PME# enabled
> [   25.145423] pcieport 0000:00:1c.0: saving config space at offset 0x0 (reading 0xa3388086)
> [   25.145437] pcieport 0000:00:1c.0: saving config space at offset 0x4 (reading 0x100407)
> [   25.145445] pcieport 0000:00:1c.0: saving config space at offset 0x8 (reading 0x60400f0)
> [   25.145453] pcieport 0000:00:1c.0: saving config space at offset 0xc (reading 0x810000)
> [   25.145460] pcieport 0000:00:1c.0: saving config space at offset 0x10 (reading 0x0)
> [   25.145464] pcieport 0000:00:1c.0: saving config space at offset 0x14 (reading 0x0)
> [   25.145469] pcieport 0000:00:1c.0: saving config space at offset 0x18 (reading 0x510400)
> [   25.145476] pcieport 0000:00:1c.0: saving config space at offset 0x1c (reading 0x20006040)
> [   25.145484] pcieport 0000:00:1c.0: saving config space at offset 0x20 (reading 0xcc10b400)
> [   25.145488] pcieport 0000:00:1c.0: saving config space at offset 0x24 (reading 0x3ff10001)
> [   25.145493] pcieport 0000:00:1c.0: saving config space at offset 0x28 (reading 0x60)
> [   25.145497] pcieport 0000:00:1c.0: saving config space at offset 0x2c (reading 0x60)
> [   25.145502] pcieport 0000:00:1c.0: saving config space at offset 0x30 (reading 0x0)
> [   25.145506] pcieport 0000:00:1c.0: saving config space at offset 0x34 (reading 0x40)
> [   25.145510] pcieport 0000:00:1c.0: saving config space at offset 0x38 (reading 0x0)
> [   25.145515] pcieport 0000:00:1c.0: saving config space at offset 0x3c (reading 0x201ff)
> [   25.145604] pcieport 0000:00:1c.0: PME# enabled
> [   26.265697] pcieport 0000:00:1c.0: power state changed by ACPI to D3cold
> [   45.468365] random: crng init done
> [  105.032727] usb 1-2: USB disconnect, device number 2  <= I removed a usb key, didn't help
> [  128.495144] async_tx: api initialized (async)
> [  128.514820] device-mapper: uevent: version 1.0.3
> [  128.518186] device-mapper: ioctl: 4.42.0-ioctl (2020-02-27) initialised: dm-devel@redhat.com
> [  144.869445] e1000e 0000:00:1f.6 eth0: NIC Link is Down
> [  172.851384] BTRFS: device label btrfs_pool4 devid 1 transid 78270 /dev/sdb4 scanned by btrfs (1293)
> [  172.851648] BTRFS: device label btrfs_pool3 devid 1 transid 27410 /dev/sda5 scanned by btrfs (1293)
> [  172.852030] BTRFS: device fsid de9694f8-9c0d-4e9d-bd12-57adc4381cd7 devid 1 transid 41 /dev/sda3 scanned by btrfs (1293)
> [  172.852224] BTRFS: device fsid 23e1398d-e462-41aa-b85e-f574906ddc03 devid 1 transid 585 /dev/nvme0n1p4 scanned by btrfs (1293)
> [  189.124291] nouveau 0000:01:00.0: disp ctor failed, -12
> [  189.124530] nouveau: probe of 0000:01:00.0 failed with error -12
>
> The next boot looks similar:
> [   25.161759] pcieport 0000:00:1c.0: PME# enabled
> [   26.297810] pcieport 0000:00:1c.0: power state changed by ACPI to D3cold
> [  128.427270] async_tx: api initialized (async)
> [  128.446525] device-mapper: uevent: version 1.0.3
> [  128.446691] device-mapper: ioctl: 4.42.0-ioctl (2020-02-27) initialised: dm-devel@redhat.com
> [  128.458120] random: cryptsetup: uninitialized urandom read (4 bytes read)
> [  138.507373] random: cryptsetup: uninitialized urandom read (4 bytes read)
> [  144.793573] e1000e 0000:00:1f.6 eth0: NIC Link is Down
> [  159.627780] random: crng init done
> [  171.814064] process '/usr/bin/fstype' started with executable stack
> [  181.949989] BTRFS: device label btrfs_boot devid 1 transid 394687 /dev/mapper/cryptroot scanned by btrfs (1063)
> [  181.953437] BTRFS: device label btrfs_pool4 devid 1 transid 78270 /dev/sdb4 scanned by btrfs (1063)
> [  181.956989] BTRFS: device label btrfs_pool3 devid 1 transid 27410 /dev/sda5 scanned by btrfs (1063)
> [  181.960473] BTRFS: device fsid de9694f8-9c0d-4e9d-bd12-57adc4381cd7 devid 1 transid 41 /dev/sda3 scanned by btrfs (1063)
> [  181.964097] BTRFS: device fsid 23e1398d-e462-41aa-b85e-f574906ddc03 devid 1 transid 585 /dev/nvme0n1p4 scanned by btrfs (1063)
> [  188.733645] nouveau 0000:01:00.0: disp ctor failed, -12
> [  188.740653] nouveau: probe of 0000:01:00.0 failed with error -12

oh, I somehow missed that "disp ctor failed" message. I think that
might explain why things are a bit hanging. From the top of my head I
am not sure if that's something known or something new. But just in
case I CCed Lyude and Ben. And I think booting with
nouveau.debug=disp=trace could already show something relevant.

> [  188.901070] PM: Image not found (code -22)
>
> Does that help?
>
> Thanks,
> Marc
> --
> "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
>
> Home page: http://marc.merlins.org/                       | PGP 7F55D5F27AAF9D08
>
Marc MERLIN Sept. 8, 2020, 12:29 a.m. UTC | #9
On Tue, Sep 08, 2020 at 01:51:19AM +0200, Karol Herbst wrote:
> oh, I somehow missed that "disp ctor failed" message. I think that
> might explain why things are a bit hanging. From the top of my head I
> am not sure if that's something known or something new. But just in
> case I CCed Lyude and Ben. And I think booting with
> nouveau.debug=disp=trace could already show something relevant.

Thanks.
I've added that to my boot for next time I reboot.

I'm moving some folks to Bcc now, and let's remove the lists other than
nouveau on followups (lkml and pci). I'm just putting a warning here
so that it shows up in other list archives and anyone finding this
later knows that they should look in the nouveau archives for further
updates/resolution.

Thanks,
Marc
Ilia Mirkin Dec. 27, 2020, 6:28 p.m. UTC | #10
On Sun, Dec 27, 2020 at 12:03 PM Marc MERLIN <marc_nouveau@merlins.org> wrote:
>
> This started with 5.5 and hasn't gotten better since then, despite some reports
> I tried to send.
>
> As per my previous message:
> I have a Thinkpad P70 with hybrid graphics.
> 01:00.0 VGA compatible controller: NVIDIA Corporation GM107GLM [Quadro M600M] (rev a2)
> that one works fine, I can use i915 for the main screen, and nouveau to
> display on the external ports (external ports are only wired to nvidia
> chip, so it's impossible to use them without turning the nvidia chip
> on).
>
> I now got a newer P73 also with the same hybrid graphics (setup as such
> in the bios). It runs fine with i915, and I don't need to use external
> display with nouveau for now (it almost works, but I only see the mouse
> cursor on the external screen, no window or anything else can get
> displayed, very weird).
> 01:00.0 VGA compatible controller: NVIDIA Corporation TU104GLM [Quadro RTX 4000 Mobile / Max-Q] (rev a1)

Display offload usually requires acceleration -- the copies are done
using the DMA engine. Please make sure that you have firmware
available (and a new enough mesa). The errors suggest that you don't
have firmware available at the time that nouveau loads. Depending on
your setup, that might mean the firmware has to be built into the
kernel, or available in initramfs. (Or just regular filesystem if you
don't use a complicated boot sequence. But many people go with distro
defaults, which do have this complexity.)

>
>
> after boot, when it gets the right trigger (not sure which ones), it
> loops on this evern 2 seconds, mostly forever.

The gpu suspends with runtime pm. And then gets woken up for some
reason (could be something quite silly, like lspci, or could be
something explicitly checking connectors, etc). Repeat.

Cheers,

  -ilia
Marc MERLIN Dec. 29, 2020, 3:51 p.m. UTC | #11
On Sat, Dec 26, 2020 at 03:12:09AM -0800, Ilia Mirkin wrote:
> > after boot, when it gets the right trigger (not sure which ones), it
> > loops on this evern 2 seconds, mostly forever.
> 
> The gpu suspends with runtime pm. And then gets woken up for some
> reason (could be something quite silly, like lspci, or could be
> something explicitly checking connectors, etc). Repeat.

Ah, fair point.  Could it be powertop even?
How would I go towards tracing that?
Sounds like this would be a problem with all chips if userspace is able
to wake them up every second or two with a probe. Now I wonder what
broken userspace I have that could be doing this.
 
> Display offload usually requires acceleration -- the copies are done
> using the DMA engine. Please make sure that you have firmware
> available (and a new enough mesa). The errors suggest that you don't
> have firmware available at the time that nouveau loads. Depending on
> your setup, that might mean the firmware has to be built into the
> kernel, or available in initramfs. (Or just regular filesystem if you
> don't use a complicated boot sequence. But many people go with distro
> defaults, which do have this complexity.)

Hi Ilia, thanks for your answer.

Do you think that could be a reason why the boot would hang for 2 full minutes at every
boot ever since I upgraded to 5.5?

Also, without wanting to sound like a full newbie, where is that
firmware you're talking about? In my kernel source?

Here's what I do have:
sauron:/usr/local/bin# dpkggrep nouveau
libdrm-nouveau2:amd64				install
xserver-xorg-video-nouveau			install

no nouveau-firmware package in debian:
sauron:/usr/local/bin# apt-cache search nouveau
bumblebee - NVIDIA Optimus support for Linux
libdrm-nouveau2 - Userspace interface to nouveau-specific kernel DRM services -- runtime
xfonts-jmk - Jim Knoble's character-cell fonts for X
xserver-xorg-video-nouveau - X.Org X server -- Nouveau display driver

No firmware file on my disk:
sauron:/usr/local/bin# find /lib/modules/5.9.11-amd64-preempt-sysrq-20190817/ /lib/firmware/ |grep nouveau
/lib/modules/5.9.11-amd64-preempt-sysrq-20190817/kernel/drivers/gpu/drm/nouveau
/lib/modules/5.9.11-amd64-preempt-sysrq-20190817/kernel/drivers/gpu/drm/nouveau/nouveau.ko
sauron:/usr/local/bin# 

The kernel module is in my initrd:
sauron:/usr/local/bin# dd if=/boot/initrd.img-5.9.11-amd64-preempt-sysrq-20190817 bs=2966528  skip=1 | gunzip | cpio -tdv | grep nouveau
drwxr-xr-x   1 root     root            0 Nov 30 15:40 usr/lib/modules/5.9.11-amd64-preempt-sysrq-20190817/kernel/drivers/gpu/drm/nouveau
-rw-r--r--   1 root     root      3691385 Nov 30 15:35 usr/lib/modules/5.9.11-amd64-preempt-sysrq-20190817/kernel/drivers/gpu/drm/nouveau/nouveau.ko
17+1 records in
17+1 records out
52566778 bytes (53 MB, 50 MiB) copied, 1.69708 s, 31.0 MB/s

What am I supposed to do/check next?

Note that ultimately I only need nouveau not to hang my boot 2mn and do
PM so that the nvidia chip goes to sleep since I don't use it.

Thanks,
Marc
Ilia Mirkin Dec. 29, 2020, 4:33 p.m. UTC | #12
On Tue, Dec 29, 2020 at 10:52 AM Marc MERLIN <marc_nouveau@merlins.org> wrote:
>
> On Sat, Dec 26, 2020 at 03:12:09AM -0800, Ilia Mirkin wrote:
> > > after boot, when it gets the right trigger (not sure which ones), it
> > > loops on this evern 2 seconds, mostly forever.
> >
> > The gpu suspends with runtime pm. And then gets woken up for some
> > reason (could be something quite silly, like lspci, or could be
> > something explicitly checking connectors, etc). Repeat.
>
> Ah, fair point.  Could it be powertop even?
> How would I go towards tracing that?
> Sounds like this would be a problem with all chips if userspace is able
> to wake them up every second or two with a probe. Now I wonder what
> broken userspace I have that could be doing this.

Well, it's a theory. Some userspace helpfully prevents the GPU from
suspending entirely, unfortunately I don't remember its name though by
messing with the attached audio device. It's very common and meant to
help... oh well.

>
> > Display offload usually requires acceleration -- the copies are done
> > using the DMA engine. Please make sure that you have firmware
> > available (and a new enough mesa). The errors suggest that you don't
> > have firmware available at the time that nouveau loads. Depending on
> > your setup, that might mean the firmware has to be built into the
> > kernel, or available in initramfs. (Or just regular filesystem if you
> > don't use a complicated boot sequence. But many people go with distro
> > defaults, which do have this complexity.)
>
> Hi Ilia, thanks for your answer.
>
> Do you think that could be a reason why the boot would hang for 2 full minutes at every
> boot ever since I upgraded to 5.5?

I'd have to check, but I'm guessing TU104 acceleration became a thing
in 5.5. I would also not be very surprised if the code didn't handle
failure extremely gracefully - there definitely have been problems
with that in the past.

>
> Also, without wanting to sound like a full newbie, where is that
> firmware you're talking about? In my kernel source?
>
> Here's what I do have:
> sauron:/usr/local/bin# dpkggrep nouveau
> libdrm-nouveau2:amd64                           install
> xserver-xorg-video-nouveau                      install
>
> no nouveau-firmware package in debian:
> sauron:/usr/local/bin# apt-cache search nouveau
> bumblebee - NVIDIA Optimus support for Linux
> libdrm-nouveau2 - Userspace interface to nouveau-specific kernel DRM services -- runtime
> xfonts-jmk - Jim Knoble's character-cell fonts for X
> xserver-xorg-video-nouveau - X.Org X server -- Nouveau display driver
>
> No firmware file on my disk:
> sauron:/usr/local/bin# find /lib/modules/5.9.11-amd64-preempt-sysrq-20190817/ /lib/firmware/ |grep nouveau
> /lib/modules/5.9.11-amd64-preempt-sysrq-20190817/kernel/drivers/gpu/drm/nouveau
> /lib/modules/5.9.11-amd64-preempt-sysrq-20190817/kernel/drivers/gpu/drm/nouveau/nouveau.ko
> sauron:/usr/local/bin#
>
> The kernel module is in my initrd:
> sauron:/usr/local/bin# dd if=/boot/initrd.img-5.9.11-amd64-preempt-sysrq-20190817 bs=2966528  skip=1 | gunzip | cpio -tdv | grep nouveau
> drwxr-xr-x   1 root     root            0 Nov 30 15:40 usr/lib/modules/5.9.11-amd64-preempt-sysrq-20190817/kernel/drivers/gpu/drm/nouveau
> -rw-r--r--   1 root     root      3691385 Nov 30 15:35 usr/lib/modules/5.9.11-amd64-preempt-sysrq-20190817/kernel/drivers/gpu/drm/nouveau/nouveau.ko
> 17+1 records in
> 17+1 records out
> 52566778 bytes (53 MB, 50 MiB) copied, 1.69708 s, 31.0 MB/s

I think that gets you out of "full newbie" land...

>
> What am I supposed to do/check next?
>
> Note that ultimately I only need nouveau not to hang my boot 2mn and do
> PM so that the nvidia chip goes to sleep since I don't use it.

I'm not extremely familiar with debian packaging, but the firmware is
provided by NVIDIA and shipped as part of linux-firmware:

https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/nvidia

This needs to be available at /lib/firmware/nvidia when nouveau loads.
Based on your email above, it's most likely that it would load from
the initrd - so make sure it's in there.

Of course now that I read your email a bit more carefully, it seems
your issue is with the "saving config space" messages. I'm not sure
I've seen those before. Perhaps you have some sort of debug enabled.
I'd find where in the kernel they are being produced, and what the
conditions for it are. But the failure to load firmware isn't great --
not 100% sure if it impacts runpm or not.

I just double-checked, TU10x accel came in via
afa3b96b058d87c2c44d1c83dadb2ba6998d03ce, which was first in v5.6.
Initial TU10x support came in v5.0. So that doesn't line up with your
timeline.

Anyways, I'd definitely sort the firmware situation out, but it may
not be the cause of your problem.

Cheers,

  -ilia
diff mbox series

Patch

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index e7982af9a5d8..bfd92e018925 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4607,14 +4607,17 @@  static int pci_pm_reset(struct pci_dev *dev, int probe)
 
 	return pci_dev_wait(dev, "PM D3->D0", PCIE_RESET_READY_POLL_MS);
 }
+
 /**
- * pcie_wait_for_link - Wait until link is active or inactive
+ * pcie_wait_for_link_delay - Wait until link is active or inactive
  * @pdev: Bridge device
  * @active: waiting for active or inactive?
+ * @delay: Delay to wait after link has become active (in ms)
  *
  * Use this to wait till link becomes active or inactive.
  */
-bool pcie_wait_for_link(struct pci_dev *pdev, bool active)
+static bool pcie_wait_for_link_delay(struct pci_dev *pdev, bool active,
+				     int delay)
 {
 	int timeout = 1000;
 	bool ret;
@@ -4651,13 +4654,25 @@  bool pcie_wait_for_link(struct pci_dev *pdev, bool active)
 		timeout -= 10;
 	}
 	if (active && ret)
-		msleep(100);
+		msleep(delay);
 	else if (ret != active)
 		pci_info(pdev, "Data Link Layer Link Active not %s in 1000 msec\n",
 			active ? "set" : "cleared");
 	return ret == active;
 }
 
+/**
+ * pcie_wait_for_link - Wait until link is active or inactive
+ * @pdev: Bridge device
+ * @active: waiting for active or inactive?
+ *
+ * Use this to wait till link becomes active or inactive.
+ */
+bool pcie_wait_for_link(struct pci_dev *pdev, bool active)
+{
+	return pcie_wait_for_link_delay(pdev, active, 100);
+}
+
 void pci_reset_secondary_bus(struct pci_dev *dev)
 {
 	u16 ctrl;