mbox series

[PATCHv2,0/5] aer handling fixups

Message ID 20210104230300.1277180-1-kbusch@kernel.org (mailing list archive)
Headers show
Series aer handling fixups | expand

Message

Keith Busch Jan. 4, 2021, 11:02 p.m. UTC
Changes from v1:

  Added received Acks

  Split the kernel print identifying the port type being reset.

  Added a patch for the portdrv to ensure the slot_reset happens without
  relying on a downstream device driver..

Keith Busch (5):
  PCI/ERR: Clear status of the reporting device
  PCI/AER: Actually get the root port
  PCI/ERR: Retain status from error notification
  PCI/AER: Specify the type of port that was reset
  PCI/portdrv: Report reset for frozen channel

 drivers/pci/pcie/aer.c         |  5 +++--
 drivers/pci/pcie/err.c         | 16 +++++++---------
 drivers/pci/pcie/portdrv_pci.c |  3 ++-
 3 files changed, 12 insertions(+), 12 deletions(-)

Comments

Hinko Kocevar Jan. 5, 2021, 2:21 p.m. UTC | #1
On 1/5/21 12:02 AM, Keith Busch wrote:
> Changes from v1:
> 
>    Added received Acks
> 
>    Split the kernel print identifying the port type being reset.
> 
>    Added a patch for the portdrv to ensure the slot_reset happens without
>    relying on a downstream device driver..
> 
> Keith Busch (5):
>    PCI/ERR: Clear status of the reporting device
>    PCI/AER: Actually get the root port
>    PCI/ERR: Retain status from error notification
>    PCI/AER: Specify the type of port that was reset
>    PCI/portdrv: Report reset for frozen channel
> 
>   drivers/pci/pcie/aer.c         |  5 +++--
>   drivers/pci/pcie/err.c         | 16 +++++++---------
>   drivers/pci/pcie/portdrv_pci.c |  3 ++-
>   3 files changed, 12 insertions(+), 12 deletions(-)
> 

Dear Keith,

I've tested this v2 series patches on the same setup I have tested the 
v1 series patches a bit earlier (the same linux-pci GIT tree was used to 
apply these patches to).

I'm seeing these messages in the dmesg when after an error is injected 
and the recovery process starts:

[  397.434293] watchdog: BUG: soft lockup - CPU#2 stuck for 23s! 
[irq/122-aerdrv:128]
[  397.441872] Modules linked in: xdma(O) aer_inject xt_CHECKSUM 
iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack 
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun 
ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter rfkill 
sunrpc dm_mirror dm_region_hash dm_log dm_mod x86_pkg_temp_thermal 
intel_powerclamp coretemp snd_hda_codec_hdmi kvm_intel kvm snd_hda_intel 
snd_intel_dspcfg irqbypass i2c_designware_platform iTCO_wdt 
snd_hda_codec i2c_designware_core crct10dif_pclmul crc32_pclmul 
ghash_clmulni_intel iTCO_vendor_support snd_hda_core snd_hwdep mei_wdt 
intel_wmi_thunderbolt snd_seq snd_seq_device aesni_intel crypto_simd 
snd_pcm cryptd glue_helper rapl i2c_i801 snd_timer snd intel_cstate 
soundcore i2c_smbus pcspkr intel_lpss_pci intel_lpss mei_me idma64 
virt_dma sg joydev input_leds mei mfd_core intel_pch_thermal plx_dma wmi 
pinctrl_sunrisepoint intel_pmc_core acpi_pad ie31200_edac ip_tables xfs 
libcrc32c sd_mod t10_pi i915 cec rc_core
[  397.441925]  drm_kms_helper igb ptp ahci crc32c_intel syscopyarea 
serio_raw sysfillrect pps_core libahci sysimgblt dca fb_sys_fops 
i2c_algo_bit libata drm video
[  397.542794] CPU: 2 PID: 128 Comm: irq/122-aerdrv Tainted: G S 
  O      5.10.0-1.keithv2.el7.x86_64 #1
[  397.552530] Hardware name: AMI AM G6x/msd/AM G6x/msd, BIOS 4.08.01 
02/19/2019
[  397.559663] RIP: 0010:pci_bus_write_config_dword+0x0/0x30
[  397.565069] Code: c2 01 75 1f 55 48 8b 87 c0 00 00 00 44 0f b7 c1 b9 
02 00 00 00 48 8b 40 20 48 89 e5 e8 a9 de 70 00 5d c3 b8 87 00 00 00 c3 
90 <0f> 1f 44 00 00 f6 c2 03 75 1e 55 48 8b 87 c0 00 00 00 41 89 c8 b9
[  397.583819] RSP: 0018:ffffc90000303b88 EFLAGS: 00000297
[  397.589044] RAX: 0000000000000000 RBX: ffff8881011671cc RCX: 
00000000071f1f1f
[  397.596182] RDX: 00000000000001a4 RSI: 0000000000000008 RDI: 
ffff8881013d9000
[  397.603319] RBP: ffffc90000303b90 R08: 0000000000000004 R09: 
00000000071f1f00
[  397.610454] R10: 0000000000000000 R11: ffffc90000303950 R12: 
ffff888101400000
[  397.617588] R13: 0000000000000000 R14: ffff8881011671dc R15: 
00000000fee98fdc
[  397.624722] FS:  0000000000000000(0000) GS:ffff88845dc80000(0000) 
knlGS:0000000000000000
[  397.632814] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  397.638565] CR2: 00007fe94d450000 CR3: 000000000240a001 CR4: 
00000000003706e0
[  397.645702] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[  397.652842] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
0000000000000400
[  397.659978] Call Trace:
[  397.662433]  ? pci_write_config_dword+0x22/0x30
[  397.666974]  pci_vc_save_restore_dwords+0x59/0x70
[  397.671682]  pci_vc_do_save_buffer+0x3da/0x5b0
[  397.676131]  ? pci_read_config_dword+0x23/0x40
[  397.680583]  pci_restore_vc_state+0x4d/0x70
[  397.684776]  pci_restore_state.part.44+0x177/0x420
[  397.689573]  pci_restore_state+0x18/0x20
[  397.693507]  pcie_portdrv_slot_reset+0x12/0x30
[  397.697957]  report_slot_reset+0x46/0x80
[  397.701889]  ? merge_result.part.4+0x40/0x40
[  397.706163]  pci_walk_bus+0x48/0x90
[  397.709660]  pci_walk_bridge+0x1d/0x30
[  397.713415]  pcie_do_recovery+0x1bd/0x270
[  397.717434]  ? aer_dev_correctable_show+0xd0/0xd0
[  397.722147]  ? aer_dev_correctable_show+0xd0/0xd0
[  397.726854]  aer_process_err_devices+0xd0/0xe9
[  397.731308]  aer_isr.cold.25+0x54/0xa3
[  397.735067]  ? irq_finalize_oneshot.part.48+0xf0/0xf0
[  397.740122]  irq_thread_fn+0x24/0x60
[  397.743713]  irq_thread+0xea/0x170
[  397.747121]  ? irq_forced_thread_fn+0x80/0x80
[  397.751481]  ? irq_thread_check_affinity+0xf0/0xf0
[  397.756280]  kthread+0x126/0x140
[  397.759521]  ? kthread_park+0x90/0x90
[  397.763191]  ret_from_fork+0x22/0x30
[  425.434159] watchdog: BUG: soft lockup - CPU#2 stuck for 23s! 
[irq/122-aerdrv:128]
[  425.441727] Modules linked in: xdma(O) aer_inject xt_CHECKSUM 
iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack 
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun 
ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter rfkill 
sunrpc dm_mirror dm_region_hash dm_log dm_mod x86_pkg_temp_thermal 
intel_powerclamp coretemp snd_hda_codec_hdmi kvm_intel kvm snd_hda_intel 
snd_intel_dspcfg irqbypass i2c_designware_platform iTCO_wdt 
snd_hda_codec i2c_designware_core crct10dif_pclmul crc32_pclmul 
ghash_clmulni_intel iTCO_vendor_support snd_hda_core snd_hwdep mei_wdt 
intel_wmi_thunderbolt snd_seq snd_seq_device aesni_intel crypto_simd 
snd_pcm cryptd glue_helper rapl i2c_i801 snd_timer snd intel_cstate 
soundcore i2c_smbus pcspkr intel_lpss_pci intel_lpss mei_me idma64 
virt_dma sg joydev input_leds mei mfd_core intel_pch_thermal plx_dma wmi 
pinctrl_sunrisepoint intel_pmc_core acpi_pad ie31200_edac ip_tables xfs 
libcrc32c sd_mod t10_pi i915 cec rc_core
[  425.441781]  drm_kms_helper igb ptp ahci crc32c_intel syscopyarea 
serio_raw sysfillrect pps_core libahci sysimgblt dca fb_sys_fops 
i2c_algo_bit libata drm video
[  425.542649] CPU: 2 PID: 128 Comm: irq/122-aerdrv Tainted: G S 
  O L    5.10.0-1.keithv2.el7.x86_64 #1
[  425.552383] Hardware name: AMI AM G6x/msd/AM G6x/msd, BIOS 4.08.01 
02/19/2019
[  425.559517] RIP: 0010:clocksource_watchdog+0xb7/0x2b0
[  425.564581] Code: 44 00 00 48 89 df 48 8b 03 e8 05 ab ad 00 48 8b 3d 
e6 cc a6 01 49 89 c5 48 8b 07 e8 f3 aa ad 00 49 89 c7 fb 66 0f 1f 44 00 
00 <48> 8b 43 50 a8 10 0f 84 2b 01 00 00 8b 15 77 cc a6 01 85 d2 0f 85
[  425.583332] RSP: 0018:ffffc90000158e98 EFLAGS: 00000206
[  425.588563] RAX: 000000005b0c2b18 RBX: ffffffff82435c60 RCX: 
0000000000000046
[  425.595702] RDX: 5b0c2b1800000000 RSI: 403a197300000000 RDI: 
0000000000000046
[  425.602839] RBP: ffffc90000158ed8 R08: 0000000000000000 R09: 
000000000000001b
[  425.609978] R10: 0000000000000001 R11: ffff888102db4000 R12: 
0000000000000000
[  425.617120] R13: 000001343b891adb R14: 0000000000000000 R15: 
000000005b0c2b18
[  425.624261] FS:  0000000000000000(0000) GS:ffff88845dc80000(0000) 
knlGS:0000000000000000
[  425.632353] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  425.638100] CR2: 00007fe94d450000 CR3: 000000000240a001 CR4: 
00000000003706e0
[  425.645232] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[  425.652365] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
0000000000000400
[  425.659501] Call Trace:
[  425.661957]  <IRQ>
[  425.663981]  ? find_next_bit+0x14/0x20
[  425.667738]  ? __clocksource_unstable+0x70/0x70
[  425.672273]  call_timer_fn+0x2e/0x100
[  425.675945]  run_timer_softirq+0x1d4/0x400
[  425.680054]  ? execlists_submission_tasklet+0x93/0x180 [i915]
[  425.685856]  ? tasklet_action_common.isra.23+0x69/0x100
[  425.691085]  __do_softirq+0xdb/0x29e
[  425.694669]  asm_call_irq_on_stack+0x12/0x20
[  425.698950]  </IRQ>
[  425.701064]  do_softirq_own_stack+0x3d/0x50
[  425.705262]  irq_exit_rcu+0xd2/0xe0
[  425.708762]  common_interrupt+0x81/0x150
[  425.712701]  ? irq_exit_rcu+0x45/0xe0
[  425.716372]  asm_common_interrupt+0x1e/0x40
[  425.720563] RIP: 0010:pci_mmcfg_read+0x11/0xe0
[  425.725023] Code: 89 e0 5b 41 5c 41 5d 5d c3 45 31 e4 eb e2 cc cc cc 
cc cc cc cc cc cc cc cc 0f 1f 44 00 00 55 81 fa ff 00 00 00 48 89 e5 41 
57 <41> 56 41 89 d6 0f 97 c2 81 f9 ff 0f 00 00 0f 9f c0 41 55 08 c2 41
[  425.743776] RSP: 0018:ffffc90000303b20 EFLAGS: 00000297
[  425.749006] RAX: ffffffff81980df0 RBX: ffffc90000303b96 RCX: 
0000000000000162
[  425.756141] RDX: 0000000000000008 RSI: 0000000000000002 RDI: 
0000000000000000
[  425.763275] RBP: ffffc90000303b28 R08: 0000000000000002 R09: 
ffffc90000303b5c
[  425.770411] R10: 0000000000000008 R11: ffffc90000303950 R12: 
0000000000000001
[  425.777542] R13: 0000000000000162 R14: ffff888101400000 R15: 
0000000000000064
[  425.784678]  ? pcibios_retrieve_fw_addr+0x90/0x90
[  425.789386]  raw_pci_read+0x35/0x40
[  425.792885]  pci_read+0x2c/0x30
[  425.796036]  pci_bus_read_config_word+0x4a/0x70
[  425.800573]  pci_read_config_word+0x23/0x40
[  425.804760]  pci_wait_for_pending+0x49/0xb0
[  425.808946]  ? raw_pci_write+0x36/0x40
[  425.812700]  pci_vc_do_save_buffer+0x453/0x5b0
[  425.817152]  ? pci_read_config_dword+0x23/0x40
[  425.821605]  pci_restore_vc_state+0x4d/0x70
[  425.825799]  pci_restore_state.part.44+0x177/0x420
[  425.830601]  pci_restore_state+0x18/0x20
[  425.834532]  pcie_portdrv_slot_reset+0x12/0x30
[  425.838991]  report_slot_reset+0x46/0x80
[  425.842919]  ? merge_result.part.4+0x40/0x40
[  425.847197]  pci_walk_bus+0x48/0x90
[  425.850696]  pci_walk_bridge+0x1d/0x30
[  425.854455]  pcie_do_recovery+0x1bd/0x270
[  425.858472]  ? aer_dev_correctable_show+0xd0/0xd0
[  425.863181]  ? aer_dev_correctable_show+0xd0/0xd0
[  425.867889]  aer_process_err_devices+0xd0/0xe9
[  425.872334]  aer_isr.cold.25+0x54/0xa3
[  425.876090]  ? irq_finalize_oneshot.part.48+0xf0/0xf0
[  425.881144]  irq_thread_fn+0x24/0x60
[  425.884722]  irq_thread+0xea/0x170
[  425.888132]  ? irq_forced_thread_fn+0x80/0x80
[  425.892498]  ? irq_thread_check_affinity+0xf0/0xf0
[  425.897296]  kthread+0x126/0x140
[  425.900532]  ? kthread_park+0x90/0x90
[  425.904204]  ret_from_fork+0x22/0x30
[  430.737050] rcu: INFO: rcu_sched self-detected stall on CPU
[  430.742631] rcu: 	2-....: (10176 ticks this GP) 
idle=01a/1/0x4000000000000000 softirq=8276/8276 fqs=13750
[  430.752283] 	(t=60000 jiffies g=12873 q=14518)
[  430.756732] NMI backtrace for cpu 2
[  430.760223] CPU: 2 PID: 128 Comm: irq/122-aerdrv Tainted: G S 
  O L    5.10.0-1.keithv2.el7.x86_64 #1
[  430.769954] Hardware name: AMI AM G6x/msd/AM G6x/msd, BIOS 4.08.01 
02/19/2019
[  430.777089] Call Trace:
[  430.779542]  <IRQ>
[  430.781562]  dump_stack+0x74/0x92
[  430.784896]  nmi_cpu_backtrace.cold.8+0x32/0x68
[  430.789435]  ? lapic_can_unplug_cpu+0x80/0x80
[  430.793798]  nmi_trigger_cpumask_backtrace+0xe6/0xf0
[  430.798770]  arch_trigger_cpumask_backtrace+0x19/0x20
[  430.803830]  rcu_dump_cpu_stacks+0xb6/0xe6
[  430.807937]  rcu_sched_clock_irq.cold.92+0x124/0x3a9
[  430.812908]  ? account_system_index_time+0x94/0xa0
[  430.817708]  ? tick_sched_do_timer+0x60/0x60
[  430.821982]  update_process_times+0x93/0xd0
[  430.826173]  tick_sched_handle.isra.23+0x25/0x60
[  430.830798]  tick_sched_timer+0x6b/0x80
[  430.834638]  __hrtimer_run_queues+0x105/0x240
[  430.839002]  hrtimer_interrupt+0x101/0x220
[  430.843112]  __sysvec_apic_timer_interrupt+0x62/0xe0
[  430.848089]  sysvec_apic_timer_interrupt+0x35/0x90
[  430.852885]  asm_sysvec_apic_timer_interrupt+0x12/0x20
[  430.858026] RIP: 0010:clocksource_watchdog+0xb7/0x2b0
[  430.863086] Code: 44 00 00 48 89 df 48 8b 03 e8 05 ab ad 00 48 8b 3d 
e6 cc a6 01 49 89 c5 48 8b 07 e8 f3 aa ad 00 49 89 c7 fb 66 0f 1f 44 00 
00 <48> 8b 43 50 a8 10 0f 84 2b 01 00 00 8b 15 77 cc a6 01 85 d2 0f 85
[  430.881838] RSP: 0018:ffffc90000158e98 EFLAGS: 00000206
[  430.887077] RAX: 000000005b0c2b18 RBX: ffffffff82435c60 RCX: 
0000000000000046
[  430.894216] RDX: 5b0c2b1800000000 RSI: 403a197300000000 RDI: 
0000000000000046
[  430.901355] RBP: ffffc90000158ed8 R08: 0000000000000000 R09: 
000000000000001b
[  430.908491] R10: 0000000000000001 R11: ffff888102db4000 R12: 
0000000000000000
[  430.915631] R13: 000001343b891adb R14: 0000000000000000 R15: 
000000005b0c2b18
[  430.922770]  ? clocksource_watchdog+0xad/0x2b0
[  430.927221]  ? find_next_bit+0x14/0x20
[  430.930984]  ? __clocksource_unstable+0x70/0x70
[  430.935526]  call_timer_fn+0x2e/0x100
[  430.939198]  run_timer_softirq+0x1d4/0x400
[  430.943305]  ? execlists_submission_tasklet+0x93/0x180 [i915]
[  430.949104]  ? tasklet_action_common.isra.23+0x69/0x100
[  430.954337]  __do_softirq+0xdb/0x29e
[  430.957921]  asm_call_irq_on_stack+0x12/0x20
[  430.962196]  </IRQ>
[  430.964307]  do_softirq_own_stack+0x3d/0x50
[  430.968500]  irq_exit_rcu+0xd2/0xe0
[  430.971998]  common_interrupt+0x81/0x150
[  430.975936]  ? irq_exit_rcu+0x45/0xe0
[  430.979606]  asm_common_interrupt+0x1e/0x40
[  430.983795] RIP: 0010:pci_mmcfg_read+0x11/0xe0
[  430.988257] Code: 89 e0 5b 41 5c 41 5d 5d c3 45 31 e4 eb e2 cc cc cc 
cc cc cc cc cc cc cc cc 0f 1f 44 00 00 55 81 fa ff 00 00 00 48 89 e5 41 
57 <41> 56 41 89 d6 0f 97 c2 81 f9 ff 0f 00 00 0f 9f c0 41 55 08 c2 41
[  431.007005] RSP: 0018:ffffc90000303b20 EFLAGS: 00000297
[  431.012232] RAX: ffffffff81980df0 RBX: ffffc90000303b96 RCX: 
0000000000000162
[  431.019366] RDX: 0000000000000008 RSI: 0000000000000002 RDI: 
0000000000000000
[  431.026499] RBP: ffffc90000303b28 R08: 0000000000000002 R09: 
ffffc90000303b5c
[  431.033630] R10: 0000000000000008 R11: ffffc90000303950 R12: 
0000000000000001
[  431.040764] R13: 0000000000000162 R14: ffff888101400000 R15: 
0000000000000064
[  431.047897]  ? pcibios_retrieve_fw_addr+0x90/0x90
[  431.052612]  raw_pci_read+0x35/0x40
[  431.056111]  pci_read+0x2c/0x30
[  431.059261]  pci_bus_read_config_word+0x4a/0x70
[  431.063800]  pci_read_config_word+0x23/0x40
[  431.067991]  pci_wait_for_pending+0x49/0xb0
[  431.072180]  ? raw_pci_write+0x36/0x40
[  431.075939]  pci_vc_do_save_buffer+0x453/0x5b0
[  431.080386]  ? pci_read_config_dword+0x23/0x40
[  431.084832]  pci_restore_vc_state+0x4d/0x70
[  431.089020]  pci_restore_state.part.44+0x177/0x420
[  431.093821]  pci_restore_state+0x18/0x20
[  431.097751]  pcie_portdrv_slot_reset+0x12/0x30
[  431.102205]  report_slot_reset+0x46/0x80
[  431.106138]  ? merge_result.part.4+0x40/0x40
[  431.110413]  pci_walk_bus+0x48/0x90
[  431.113912]  pci_walk_bridge+0x1d/0x30
[  431.117668]  pcie_do_recovery+0x1bd/0x270
[  431.121685]  ? aer_dev_correctable_show+0xd0/0xd0
[  431.126404]  ? aer_dev_correctable_show+0xd0/0xd0
[  431.131118]  aer_process_err_devices+0xd0/0xe9
[  431.135568]  aer_isr.cold.25+0x54/0xa3
[  431.139326]  ? irq_finalize_oneshot.part.48+0xf0/0xf0
[  431.144388]  irq_thread_fn+0x24/0x60
[  431.147972]  irq_thread+0xea/0x170
[  431.151380]  ? irq_forced_thread_fn+0x80/0x80
[  431.155748]  ? irq_thread_check_affinity+0xf0/0xf0
[  431.160548]  kthread+0x126/0x140
[  431.163788]  ? kthread_park+0x90/0x90
[  431.167460]  ret_from_fork+0x22/0x30
[  444.669140] pcieport 0000:02:01.0: restoring config space at offset 
0x3c (was 0x100, writing 0x12010a)
[  462.143968] pcieport 0000:02:01.0: restoring config space at offset 
0x2c (was 0x0, writing 0x0)
[  462.152864] pcieport 0000:02:01.0: restoring config space at offset 
0x28 (was 0x0, writing 0x0)
[  462.161752] pcieport 0000:02:01.0: restoring config space at offset 
0x24 (was 0x1fff1, writing 0x90e19001)


This was not observed with the v1 series of patches.

 From what I can tell, this v2 series introduced a change in the 5/5 
patch, that the v1 did not.

BTW, just to be safe and aligned with you, which kernel sources should I 
be applying this v2 series patches to?

Let me know if you need more input.


Thanks,
Hinko
Hinko Kocevar Jan. 5, 2021, 3:06 p.m. UTC | #2
On 1/5/21 3:21 PM, Hinko Kocevar wrote:
> 
> 
> On 1/5/21 12:02 AM, Keith Busch wrote:
>> Changes from v1:
>>
>>    Added received Acks
>>
>>    Split the kernel print identifying the port type being reset.
>>
>>    Added a patch for the portdrv to ensure the slot_reset happens without
>>    relying on a downstream device driver..
>>
>> Keith Busch (5):
>>    PCI/ERR: Clear status of the reporting device
>>    PCI/AER: Actually get the root port
>>    PCI/ERR: Retain status from error notification
>>    PCI/AER: Specify the type of port that was reset
>>    PCI/portdrv: Report reset for frozen channel

I removed the patch 5/5 from this patch series, and after testing again, 
it makes my setup recover from the injected error; same as observed with 
v1 series.

Hope this helps.
//hinko

>>
>>   drivers/pci/pcie/aer.c         |  5 +++--
>>   drivers/pci/pcie/err.c         | 16 +++++++---------
>>   drivers/pci/pcie/portdrv_pci.c |  3 ++-
>>   3 files changed, 12 insertions(+), 12 deletions(-)
>>
> 
> Dear Keith,
> 
> I've tested this v2 series patches on the same setup I have tested the 
> v1 series patches a bit earlier (the same linux-pci GIT tree was used to 
> apply these patches to).
> 
> I'm seeing these messages in the dmesg when after an error is injected 
> and the recovery process starts:
> 
> [  397.434293] watchdog: BUG: soft lockup - CPU#2 stuck for 23s! 
> [irq/122-aerdrv:128]
> [  397.441872] Modules linked in: xdma(O) aer_inject xt_CHECKSUM 
> iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack 
> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun 
> ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter rfkill 
> sunrpc dm_mirror dm_region_hash dm_log dm_mod x86_pkg_temp_thermal 
> intel_powerclamp coretemp snd_hda_codec_hdmi kvm_intel kvm snd_hda_intel 
> snd_intel_dspcfg irqbypass i2c_designware_platform iTCO_wdt 
> snd_hda_codec i2c_designware_core crct10dif_pclmul crc32_pclmul 
> ghash_clmulni_intel iTCO_vendor_support snd_hda_core snd_hwdep mei_wdt 
> intel_wmi_thunderbolt snd_seq snd_seq_device aesni_intel crypto_simd 
> snd_pcm cryptd glue_helper rapl i2c_i801 snd_timer snd intel_cstate 
> soundcore i2c_smbus pcspkr intel_lpss_pci intel_lpss mei_me idma64 
> virt_dma sg joydev input_leds mei mfd_core intel_pch_thermal plx_dma wmi 
> pinctrl_sunrisepoint intel_pmc_core acpi_pad ie31200_edac ip_tables xfs 
> libcrc32c sd_mod t10_pi i915 cec rc_core
> [  397.441925]  drm_kms_helper igb ptp ahci crc32c_intel syscopyarea 
> serio_raw sysfillrect pps_core libahci sysimgblt dca fb_sys_fops 
> i2c_algo_bit libata drm video
> [  397.542794] CPU: 2 PID: 128 Comm: irq/122-aerdrv Tainted: G S  O      
> 5.10.0-1.keithv2.el7.x86_64 #1
> [  397.552530] Hardware name: AMI AM G6x/msd/AM G6x/msd, BIOS 4.08.01 
> 02/19/2019
> [  397.559663] RIP: 0010:pci_bus_write_config_dword+0x0/0x30
> [  397.565069] Code: c2 01 75 1f 55 48 8b 87 c0 00 00 00 44 0f b7 c1 b9 
> 02 00 00 00 48 8b 40 20 48 89 e5 e8 a9 de 70 00 5d c3 b8 87 00 00 00 c3 
> 90 <0f> 1f 44 00 00 f6 c2 03 75 1e 55 48 8b 87 c0 00 00 00 41 89 c8 b9
> [  397.583819] RSP: 0018:ffffc90000303b88 EFLAGS: 00000297
> [  397.589044] RAX: 0000000000000000 RBX: ffff8881011671cc RCX: 
> 00000000071f1f1f
> [  397.596182] RDX: 00000000000001a4 RSI: 0000000000000008 RDI: 
> ffff8881013d9000
> [  397.603319] RBP: ffffc90000303b90 R08: 0000000000000004 R09: 
> 00000000071f1f00
> [  397.610454] R10: 0000000000000000 R11: ffffc90000303950 R12: 
> ffff888101400000
> [  397.617588] R13: 0000000000000000 R14: ffff8881011671dc R15: 
> 00000000fee98fdc
> [  397.624722] FS:  0000000000000000(0000) GS:ffff88845dc80000(0000) 
> knlGS:0000000000000000
> [  397.632814] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  397.638565] CR2: 00007fe94d450000 CR3: 000000000240a001 CR4: 
> 00000000003706e0
> [  397.645702] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
> 0000000000000000
> [  397.652842] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
> 0000000000000400
> [  397.659978] Call Trace:
> [  397.662433]  ? pci_write_config_dword+0x22/0x30
> [  397.666974]  pci_vc_save_restore_dwords+0x59/0x70
> [  397.671682]  pci_vc_do_save_buffer+0x3da/0x5b0
> [  397.676131]  ? pci_read_config_dword+0x23/0x40
> [  397.680583]  pci_restore_vc_state+0x4d/0x70
> [  397.684776]  pci_restore_state.part.44+0x177/0x420
> [  397.689573]  pci_restore_state+0x18/0x20
> [  397.693507]  pcie_portdrv_slot_reset+0x12/0x30
> [  397.697957]  report_slot_reset+0x46/0x80
> [  397.701889]  ? merge_result.part.4+0x40/0x40
> [  397.706163]  pci_walk_bus+0x48/0x90
> [  397.709660]  pci_walk_bridge+0x1d/0x30
> [  397.713415]  pcie_do_recovery+0x1bd/0x270
> [  397.717434]  ? aer_dev_correctable_show+0xd0/0xd0
> [  397.722147]  ? aer_dev_correctable_show+0xd0/0xd0
> [  397.726854]  aer_process_err_devices+0xd0/0xe9
> [  397.731308]  aer_isr.cold.25+0x54/0xa3
> [  397.735067]  ? irq_finalize_oneshot.part.48+0xf0/0xf0
> [  397.740122]  irq_thread_fn+0x24/0x60
> [  397.743713]  irq_thread+0xea/0x170
> [  397.747121]  ? irq_forced_thread_fn+0x80/0x80
> [  397.751481]  ? irq_thread_check_affinity+0xf0/0xf0
> [  397.756280]  kthread+0x126/0x140
> [  397.759521]  ? kthread_park+0x90/0x90
> [  397.763191]  ret_from_fork+0x22/0x30
> [  425.434159] watchdog: BUG: soft lockup - CPU#2 stuck for 23s! 
> [irq/122-aerdrv:128]
> [  425.441727] Modules linked in: xdma(O) aer_inject xt_CHECKSUM 
> iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack 
> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun 
> ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter rfkill 
> sunrpc dm_mirror dm_region_hash dm_log dm_mod x86_pkg_temp_thermal 
> intel_powerclamp coretemp snd_hda_codec_hdmi kvm_intel kvm snd_hda_intel 
> snd_intel_dspcfg irqbypass i2c_designware_platform iTCO_wdt 
> snd_hda_codec i2c_designware_core crct10dif_pclmul crc32_pclmul 
> ghash_clmulni_intel iTCO_vendor_support snd_hda_core snd_hwdep mei_wdt 
> intel_wmi_thunderbolt snd_seq snd_seq_device aesni_intel crypto_simd 
> snd_pcm cryptd glue_helper rapl i2c_i801 snd_timer snd intel_cstate 
> soundcore i2c_smbus pcspkr intel_lpss_pci intel_lpss mei_me idma64 
> virt_dma sg joydev input_leds mei mfd_core intel_pch_thermal plx_dma wmi 
> pinctrl_sunrisepoint intel_pmc_core acpi_pad ie31200_edac ip_tables xfs 
> libcrc32c sd_mod t10_pi i915 cec rc_core
> [  425.441781]  drm_kms_helper igb ptp ahci crc32c_intel syscopyarea 
> serio_raw sysfillrect pps_core libahci sysimgblt dca fb_sys_fops 
> i2c_algo_bit libata drm video
> [  425.542649] CPU: 2 PID: 128 Comm: irq/122-aerdrv Tainted: G S  O L    
> 5.10.0-1.keithv2.el7.x86_64 #1
> [  425.552383] Hardware name: AMI AM G6x/msd/AM G6x/msd, BIOS 4.08.01 
> 02/19/2019
> [  425.559517] RIP: 0010:clocksource_watchdog+0xb7/0x2b0
> [  425.564581] Code: 44 00 00 48 89 df 48 8b 03 e8 05 ab ad 00 48 8b 3d 
> e6 cc a6 01 49 89 c5 48 8b 07 e8 f3 aa ad 00 49 89 c7 fb 66 0f 1f 44 00 
> 00 <48> 8b 43 50 a8 10 0f 84 2b 01 00 00 8b 15 77 cc a6 01 85 d2 0f 85
> [  425.583332] RSP: 0018:ffffc90000158e98 EFLAGS: 00000206
> [  425.588563] RAX: 000000005b0c2b18 RBX: ffffffff82435c60 RCX: 
> 0000000000000046
> [  425.595702] RDX: 5b0c2b1800000000 RSI: 403a197300000000 RDI: 
> 0000000000000046
> [  425.602839] RBP: ffffc90000158ed8 R08: 0000000000000000 R09: 
> 000000000000001b
> [  425.609978] R10: 0000000000000001 R11: ffff888102db4000 R12: 
> 0000000000000000
> [  425.617120] R13: 000001343b891adb R14: 0000000000000000 R15: 
> 000000005b0c2b18
> [  425.624261] FS:  0000000000000000(0000) GS:ffff88845dc80000(0000) 
> knlGS:0000000000000000
> [  425.632353] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  425.638100] CR2: 00007fe94d450000 CR3: 000000000240a001 CR4: 
> 00000000003706e0
> [  425.645232] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
> 0000000000000000
> [  425.652365] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
> 0000000000000400
> [  425.659501] Call Trace:
> [  425.661957]  <IRQ>
> [  425.663981]  ? find_next_bit+0x14/0x20
> [  425.667738]  ? __clocksource_unstable+0x70/0x70
> [  425.672273]  call_timer_fn+0x2e/0x100
> [  425.675945]  run_timer_softirq+0x1d4/0x400
> [  425.680054]  ? execlists_submission_tasklet+0x93/0x180 [i915]
> [  425.685856]  ? tasklet_action_common.isra.23+0x69/0x100
> [  425.691085]  __do_softirq+0xdb/0x29e
> [  425.694669]  asm_call_irq_on_stack+0x12/0x20
> [  425.698950]  </IRQ>
> [  425.701064]  do_softirq_own_stack+0x3d/0x50
> [  425.705262]  irq_exit_rcu+0xd2/0xe0
> [  425.708762]  common_interrupt+0x81/0x150
> [  425.712701]  ? irq_exit_rcu+0x45/0xe0
> [  425.716372]  asm_common_interrupt+0x1e/0x40
> [  425.720563] RIP: 0010:pci_mmcfg_read+0x11/0xe0
> [  425.725023] Code: 89 e0 5b 41 5c 41 5d 5d c3 45 31 e4 eb e2 cc cc cc 
> cc cc cc cc cc cc cc cc 0f 1f 44 00 00 55 81 fa ff 00 00 00 48 89 e5 41 
> 57 <41> 56 41 89 d6 0f 97 c2 81 f9 ff 0f 00 00 0f 9f c0 41 55 08 c2 41
> [  425.743776] RSP: 0018:ffffc90000303b20 EFLAGS: 00000297
> [  425.749006] RAX: ffffffff81980df0 RBX: ffffc90000303b96 RCX: 
> 0000000000000162
> [  425.756141] RDX: 0000000000000008 RSI: 0000000000000002 RDI: 
> 0000000000000000
> [  425.763275] RBP: ffffc90000303b28 R08: 0000000000000002 R09: 
> ffffc90000303b5c
> [  425.770411] R10: 0000000000000008 R11: ffffc90000303950 R12: 
> 0000000000000001
> [  425.777542] R13: 0000000000000162 R14: ffff888101400000 R15: 
> 0000000000000064
> [  425.784678]  ? pcibios_retrieve_fw_addr+0x90/0x90
> [  425.789386]  raw_pci_read+0x35/0x40
> [  425.792885]  pci_read+0x2c/0x30
> [  425.796036]  pci_bus_read_config_word+0x4a/0x70
> [  425.800573]  pci_read_config_word+0x23/0x40
> [  425.804760]  pci_wait_for_pending+0x49/0xb0
> [  425.808946]  ? raw_pci_write+0x36/0x40
> [  425.812700]  pci_vc_do_save_buffer+0x453/0x5b0
> [  425.817152]  ? pci_read_config_dword+0x23/0x40
> [  425.821605]  pci_restore_vc_state+0x4d/0x70
> [  425.825799]  pci_restore_state.part.44+0x177/0x420
> [  425.830601]  pci_restore_state+0x18/0x20
> [  425.834532]  pcie_portdrv_slot_reset+0x12/0x30
> [  425.838991]  report_slot_reset+0x46/0x80
> [  425.842919]  ? merge_result.part.4+0x40/0x40
> [  425.847197]  pci_walk_bus+0x48/0x90
> [  425.850696]  pci_walk_bridge+0x1d/0x30
> [  425.854455]  pcie_do_recovery+0x1bd/0x270
> [  425.858472]  ? aer_dev_correctable_show+0xd0/0xd0
> [  425.863181]  ? aer_dev_correctable_show+0xd0/0xd0
> [  425.867889]  aer_process_err_devices+0xd0/0xe9
> [  425.872334]  aer_isr.cold.25+0x54/0xa3
> [  425.876090]  ? irq_finalize_oneshot.part.48+0xf0/0xf0
> [  425.881144]  irq_thread_fn+0x24/0x60
> [  425.884722]  irq_thread+0xea/0x170
> [  425.888132]  ? irq_forced_thread_fn+0x80/0x80
> [  425.892498]  ? irq_thread_check_affinity+0xf0/0xf0
> [  425.897296]  kthread+0x126/0x140
> [  425.900532]  ? kthread_park+0x90/0x90
> [  425.904204]  ret_from_fork+0x22/0x30
> [  430.737050] rcu: INFO: rcu_sched self-detected stall on CPU
> [  430.742631] rcu:     2-....: (10176 ticks this GP) 
> idle=01a/1/0x4000000000000000 softirq=8276/8276 fqs=13750
> [  430.752283]     (t=60000 jiffies g=12873 q=14518)
> [  430.756732] NMI backtrace for cpu 2
> [  430.760223] CPU: 2 PID: 128 Comm: irq/122-aerdrv Tainted: G S  O L    
> 5.10.0-1.keithv2.el7.x86_64 #1
> [  430.769954] Hardware name: AMI AM G6x/msd/AM G6x/msd, BIOS 4.08.01 
> 02/19/2019
> [  430.777089] Call Trace:
> [  430.779542]  <IRQ>
> [  430.781562]  dump_stack+0x74/0x92
> [  430.784896]  nmi_cpu_backtrace.cold.8+0x32/0x68
> [  430.789435]  ? lapic_can_unplug_cpu+0x80/0x80
> [  430.793798]  nmi_trigger_cpumask_backtrace+0xe6/0xf0
> [  430.798770]  arch_trigger_cpumask_backtrace+0x19/0x20
> [  430.803830]  rcu_dump_cpu_stacks+0xb6/0xe6
> [  430.807937]  rcu_sched_clock_irq.cold.92+0x124/0x3a9
> [  430.812908]  ? account_system_index_time+0x94/0xa0
> [  430.817708]  ? tick_sched_do_timer+0x60/0x60
> [  430.821982]  update_process_times+0x93/0xd0
> [  430.826173]  tick_sched_handle.isra.23+0x25/0x60
> [  430.830798]  tick_sched_timer+0x6b/0x80
> [  430.834638]  __hrtimer_run_queues+0x105/0x240
> [  430.839002]  hrtimer_interrupt+0x101/0x220
> [  430.843112]  __sysvec_apic_timer_interrupt+0x62/0xe0
> [  430.848089]  sysvec_apic_timer_interrupt+0x35/0x90
> [  430.852885]  asm_sysvec_apic_timer_interrupt+0x12/0x20
> [  430.858026] RIP: 0010:clocksource_watchdog+0xb7/0x2b0
> [  430.863086] Code: 44 00 00 48 89 df 48 8b 03 e8 05 ab ad 00 48 8b 3d 
> e6 cc a6 01 49 89 c5 48 8b 07 e8 f3 aa ad 00 49 89 c7 fb 66 0f 1f 44 00 
> 00 <48> 8b 43 50 a8 10 0f 84 2b 01 00 00 8b 15 77 cc a6 01 85 d2 0f 85
> [  430.881838] RSP: 0018:ffffc90000158e98 EFLAGS: 00000206
> [  430.887077] RAX: 000000005b0c2b18 RBX: ffffffff82435c60 RCX: 
> 0000000000000046
> [  430.894216] RDX: 5b0c2b1800000000 RSI: 403a197300000000 RDI: 
> 0000000000000046
> [  430.901355] RBP: ffffc90000158ed8 R08: 0000000000000000 R09: 
> 000000000000001b
> [  430.908491] R10: 0000000000000001 R11: ffff888102db4000 R12: 
> 0000000000000000
> [  430.915631] R13: 000001343b891adb R14: 0000000000000000 R15: 
> 000000005b0c2b18
> [  430.922770]  ? clocksource_watchdog+0xad/0x2b0
> [  430.927221]  ? find_next_bit+0x14/0x20
> [  430.930984]  ? __clocksource_unstable+0x70/0x70
> [  430.935526]  call_timer_fn+0x2e/0x100
> [  430.939198]  run_timer_softirq+0x1d4/0x400
> [  430.943305]  ? execlists_submission_tasklet+0x93/0x180 [i915]
> [  430.949104]  ? tasklet_action_common.isra.23+0x69/0x100
> [  430.954337]  __do_softirq+0xdb/0x29e
> [  430.957921]  asm_call_irq_on_stack+0x12/0x20
> [  430.962196]  </IRQ>
> [  430.964307]  do_softirq_own_stack+0x3d/0x50
> [  430.968500]  irq_exit_rcu+0xd2/0xe0
> [  430.971998]  common_interrupt+0x81/0x150
> [  430.975936]  ? irq_exit_rcu+0x45/0xe0
> [  430.979606]  asm_common_interrupt+0x1e/0x40
> [  430.983795] RIP: 0010:pci_mmcfg_read+0x11/0xe0
> [  430.988257] Code: 89 e0 5b 41 5c 41 5d 5d c3 45 31 e4 eb e2 cc cc cc 
> cc cc cc cc cc cc cc cc 0f 1f 44 00 00 55 81 fa ff 00 00 00 48 89 e5 41 
> 57 <41> 56 41 89 d6 0f 97 c2 81 f9 ff 0f 00 00 0f 9f c0 41 55 08 c2 41
> [  431.007005] RSP: 0018:ffffc90000303b20 EFLAGS: 00000297
> [  431.012232] RAX: ffffffff81980df0 RBX: ffffc90000303b96 RCX: 
> 0000000000000162
> [  431.019366] RDX: 0000000000000008 RSI: 0000000000000002 RDI: 
> 0000000000000000
> [  431.026499] RBP: ffffc90000303b28 R08: 0000000000000002 R09: 
> ffffc90000303b5c
> [  431.033630] R10: 0000000000000008 R11: ffffc90000303950 R12: 
> 0000000000000001
> [  431.040764] R13: 0000000000000162 R14: ffff888101400000 R15: 
> 0000000000000064
> [  431.047897]  ? pcibios_retrieve_fw_addr+0x90/0x90
> [  431.052612]  raw_pci_read+0x35/0x40
> [  431.056111]  pci_read+0x2c/0x30
> [  431.059261]  pci_bus_read_config_word+0x4a/0x70
> [  431.063800]  pci_read_config_word+0x23/0x40
> [  431.067991]  pci_wait_for_pending+0x49/0xb0
> [  431.072180]  ? raw_pci_write+0x36/0x40
> [  431.075939]  pci_vc_do_save_buffer+0x453/0x5b0
> [  431.080386]  ? pci_read_config_dword+0x23/0x40
> [  431.084832]  pci_restore_vc_state+0x4d/0x70
> [  431.089020]  pci_restore_state.part.44+0x177/0x420
> [  431.093821]  pci_restore_state+0x18/0x20
> [  431.097751]  pcie_portdrv_slot_reset+0x12/0x30
> [  431.102205]  report_slot_reset+0x46/0x80
> [  431.106138]  ? merge_result.part.4+0x40/0x40
> [  431.110413]  pci_walk_bus+0x48/0x90
> [  431.113912]  pci_walk_bridge+0x1d/0x30
> [  431.117668]  pcie_do_recovery+0x1bd/0x270
> [  431.121685]  ? aer_dev_correctable_show+0xd0/0xd0
> [  431.126404]  ? aer_dev_correctable_show+0xd0/0xd0
> [  431.131118]  aer_process_err_devices+0xd0/0xe9
> [  431.135568]  aer_isr.cold.25+0x54/0xa3
> [  431.139326]  ? irq_finalize_oneshot.part.48+0xf0/0xf0
> [  431.144388]  irq_thread_fn+0x24/0x60
> [  431.147972]  irq_thread+0xea/0x170
> [  431.151380]  ? irq_forced_thread_fn+0x80/0x80
> [  431.155748]  ? irq_thread_check_affinity+0xf0/0xf0
> [  431.160548]  kthread+0x126/0x140
> [  431.163788]  ? kthread_park+0x90/0x90
> [  431.167460]  ret_from_fork+0x22/0x30
> [  444.669140] pcieport 0000:02:01.0: restoring config space at offset 
> 0x3c (was 0x100, writing 0x12010a)
> [  462.143968] pcieport 0000:02:01.0: restoring config space at offset 
> 0x2c (was 0x0, writing 0x0)
> [  462.152864] pcieport 0000:02:01.0: restoring config space at offset 
> 0x28 (was 0x0, writing 0x0)
> [  462.161752] pcieport 0000:02:01.0: restoring config space at offset 
> 0x24 (was 0x1fff1, writing 0x90e19001)
> 
> 
> This was not observed with the v1 series of patches.
> 
>  From what I can tell, this v2 series introduced a change in the 5/5 
> patch, that the v1 did not.
> 
> BTW, just to be safe and aligned with you, which kernel sources should I 
> be applying this v2 series patches to?
> 
> Let me know if you need more input.
> 
> 
> Thanks,
> Hinko
Keith Busch Jan. 5, 2021, 6:33 p.m. UTC | #3
On Tue, Jan 05, 2021 at 04:06:53PM +0100, Hinko Kocevar wrote:
> On 1/5/21 3:21 PM, Hinko Kocevar wrote:
> > On 1/5/21 12:02 AM, Keith Busch wrote:
> > > Changes from v1:
> > > 
> > >    Added received Acks
> > > 
> > >    Split the kernel print identifying the port type being reset.
> > > 
> > >    Added a patch for the portdrv to ensure the slot_reset happens without
> > >    relying on a downstream device driver..
> > > 
> > > Keith Busch (5):
> > >    PCI/ERR: Clear status of the reporting device
> > >    PCI/AER: Actually get the root port
> > >    PCI/ERR: Retain status from error notification
> > >    PCI/AER: Specify the type of port that was reset
> > >    PCI/portdrv: Report reset for frozen channel
> 
> I removed the patch 5/5 from this patch series, and after testing again, it
> makes my setup recover from the injected error; same as observed with v1
> series.

Thanks for the notice. Unfortunately that seems even more confusing to
me right now. That patch shouldn't do anything to the devices or the
driver's state; it just ensures a recovery path that was supposed to
happen anyway. The stack trace says restoring the config space completed
partially before getting stuck at the virtual channel capability, at
which point it appears to be in an infinite loop. I'll try to look into
it. The emulated devices I test with don't have the VC cap but I might
have real devices that do.
Kelley, Sean V Jan. 5, 2021, 11:07 p.m. UTC | #4
> On Jan 5, 2021, at 10:33 AM, Keith Busch <kbusch@kernel.org> wrote:
> 
> On Tue, Jan 05, 2021 at 04:06:53PM +0100, Hinko Kocevar wrote:
>> On 1/5/21 3:21 PM, Hinko Kocevar wrote:
>>> On 1/5/21 12:02 AM, Keith Busch wrote:
>>>> Changes from v1:
>>>> 
>>>>    Added received Acks
>>>> 
>>>>    Split the kernel print identifying the port type being reset.
>>>> 
>>>>    Added a patch for the portdrv to ensure the slot_reset happens without
>>>>    relying on a downstream device driver..
>>>> 
>>>> Keith Busch (5):
>>>>    PCI/ERR: Clear status of the reporting device
>>>>    PCI/AER: Actually get the root port
>>>>    PCI/ERR: Retain status from error notification
>>>>    PCI/AER: Specify the type of port that was reset
>>>>    PCI/portdrv: Report reset for frozen channel
>> 
>> I removed the patch 5/5 from this patch series, and after testing again, it
>> makes my setup recover from the injected error; same as observed with v1
>> series.
> 
> Thanks for the notice. Unfortunately that seems even more confusing to
> me right now. That patch shouldn't do anything to the devices or the
> driver's state; it just ensures a recovery path that was supposed to
> happen anyway. The stack trace says restoring the config space completed
> partially before getting stuck at the virtual channel capability, at
> which point it appears to be in an infinite loop. I'll try to look into
> it. The emulated devices I test with don't have the VC cap but I might
> have real devices that do.

I’m not seeing the error either with V2 when testing with are-inject using RCECs and an associated RCiEP.

Sean
Keith Busch Jan. 7, 2021, 9:42 p.m. UTC | #5
On Tue, Jan 05, 2021 at 11:07:23PM +0000, Kelley, Sean V wrote:
> > On Jan 5, 2021, at 10:33 AM, Keith Busch <kbusch@kernel.org> wrote:
> > On Tue, Jan 05, 2021 at 04:06:53PM +0100, Hinko Kocevar wrote:
> >> On 1/5/21 3:21 PM, Hinko Kocevar wrote:
> >>> On 1/5/21 12:02 AM, Keith Busch wrote:
> >>>> Changes from v1:
> >>>> 
> >>>>    Added received Acks
> >>>> 
> >>>>    Split the kernel print identifying the port type being reset.
> >>>> 
> >>>>    Added a patch for the portdrv to ensure the slot_reset happens without
> >>>>    relying on a downstream device driver..
> >>>> 
> >>>> Keith Busch (5):
> >>>>    PCI/ERR: Clear status of the reporting device
> >>>>    PCI/AER: Actually get the root port
> >>>>    PCI/ERR: Retain status from error notification
> >>>>    PCI/AER: Specify the type of port that was reset
> >>>>    PCI/portdrv: Report reset for frozen channel
> >> 
> >> I removed the patch 5/5 from this patch series, and after testing again, it
> >> makes my setup recover from the injected error; same as observed with v1
> >> series.
> > 
> > Thanks for the notice. Unfortunately that seems even more confusing to
> > me right now. That patch shouldn't do anything to the devices or the
> > driver's state; it just ensures a recovery path that was supposed to
> > happen anyway. The stack trace says restoring the config space completed
> > partially before getting stuck at the virtual channel capability, at
> > which point it appears to be in an infinite loop. I'll try to look into
> > it. The emulated devices I test with don't have the VC cap but I might
> > have real devices that do.
> 
> I’m not seeing the error either with V2 when testing with are-inject using RCECs and an associated RCiEP.

Thank you, yes, I'm also not seeing a problem either on my end. The
sighting is still concerning though, so I'll keep looking. I may have to
request Hinko to try a debug patch to help narrow down where things have
gone wrong if that's okay.
Hinko Kocevar Jan. 8, 2021, 9:38 a.m. UTC | #6
On 1/7/21 10:42 PM, Keith Busch wrote:
> On Tue, Jan 05, 2021 at 11:07:23PM +0000, Kelley, Sean V wrote:
>>> On Jan 5, 2021, at 10:33 AM, Keith Busch <kbusch@kernel.org> wrote:
>>> On Tue, Jan 05, 2021 at 04:06:53PM +0100, Hinko Kocevar wrote:
>>>> On 1/5/21 3:21 PM, Hinko Kocevar wrote:
>>>>> On 1/5/21 12:02 AM, Keith Busch wrote:
>>>>>> Changes from v1:
>>>>>>
>>>>>>     Added received Acks
>>>>>>
>>>>>>     Split the kernel print identifying the port type being reset.
>>>>>>
>>>>>>     Added a patch for the portdrv to ensure the slot_reset happens without
>>>>>>     relying on a downstream device driver..
>>>>>>
>>>>>> Keith Busch (5):
>>>>>>     PCI/ERR: Clear status of the reporting device
>>>>>>     PCI/AER: Actually get the root port
>>>>>>     PCI/ERR: Retain status from error notification
>>>>>>     PCI/AER: Specify the type of port that was reset
>>>>>>     PCI/portdrv: Report reset for frozen channel
>>>>
>>>> I removed the patch 5/5 from this patch series, and after testing again, it
>>>> makes my setup recover from the injected error; same as observed with v1
>>>> series.
>>>
>>> Thanks for the notice. Unfortunately that seems even more confusing to
>>> me right now. That patch shouldn't do anything to the devices or the
>>> driver's state; it just ensures a recovery path that was supposed to
>>> happen anyway. The stack trace says restoring the config space completed
>>> partially before getting stuck at the virtual channel capability, at
>>> which point it appears to be in an infinite loop. I'll try to look into
>>> it. The emulated devices I test with don't have the VC cap but I might
>>> have real devices that do.
>>
>> I’m not seeing the error either with V2 when testing with are-inject using RCECs and an associated RCiEP.
> 
> Thank you, yes, I'm also not seeing a problem either on my end. The
> sighting is still concerning though, so I'll keep looking. I may have to
> request Hinko to try a debug patch to help narrow down where things have
> gone wrong if that's okay.
> 

Sure. I'm willing to help out and debug this on my side as well. Let me 
know what you need me to do!
Hinko Kocevar Jan. 11, 2021, 1:39 p.m. UTC | #7
On 1/8/21 10:38 AM, Hinko Kocevar wrote:
> 
> 
> On 1/7/21 10:42 PM, Keith Busch wrote:
>> On Tue, Jan 05, 2021 at 11:07:23PM +0000, Kelley, Sean V wrote:
>>>> On Jan 5, 2021, at 10:33 AM, Keith Busch <kbusch@kernel.org> wrote:
>>>> On Tue, Jan 05, 2021 at 04:06:53PM +0100, Hinko Kocevar wrote:
>>>>> On 1/5/21 3:21 PM, Hinko Kocevar wrote:
>>>>>> On 1/5/21 12:02 AM, Keith Busch wrote:
>>>>>>> Changes from v1:
>>>>>>>
>>>>>>>     Added received Acks
>>>>>>>
>>>>>>>     Split the kernel print identifying the port type being reset.
>>>>>>>
>>>>>>>     Added a patch for the portdrv to ensure the slot_reset 
>>>>>>> happens without
>>>>>>>     relying on a downstream device driver..
>>>>>>>
>>>>>>> Keith Busch (5):
>>>>>>>     PCI/ERR: Clear status of the reporting device
>>>>>>>     PCI/AER: Actually get the root port
>>>>>>>     PCI/ERR: Retain status from error notification
>>>>>>>     PCI/AER: Specify the type of port that was reset
>>>>>>>     PCI/portdrv: Report reset for frozen channel
>>>>>
>>>>> I removed the patch 5/5 from this patch series, and after testing 
>>>>> again, it
>>>>> makes my setup recover from the injected error; same as observed 
>>>>> with v1
>>>>> series.
>>>>
>>>> Thanks for the notice. Unfortunately that seems even more confusing to
>>>> me right now. That patch shouldn't do anything to the devices or the
>>>> driver's state; it just ensures a recovery path that was supposed to
>>>> happen anyway. The stack trace says restoring the config space 
>>>> completed
>>>> partially before getting stuck at the virtual channel capability, at
>>>> which point it appears to be in an infinite loop. I'll try to look into
>>>> it. The emulated devices I test with don't have the VC cap but I might
>>>> have real devices that do.
>>>
>>> I’m not seeing the error either with V2 when testing with are-inject 
>>> using RCECs and an associated RCiEP.
>>
>> Thank you, yes, I'm also not seeing a problem either on my end. The
>> sighting is still concerning though, so I'll keep looking. I may have to
>> request Hinko to try a debug patch to help narrow down where things have
>> gone wrong if that's okay.
>>
> 
> Sure. I'm willing to help out and debug this on my side as well. Let me 
> know what you need me to do!

Testing this patch a bit more (without the 5/5) resulted in the same CPU 
lockup:

watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [irq/122-aerdrv:128]

as I initially reported with the 5/5 of this patch included.

It seems more infrequent, though. For example, after reboot this is not 
observed and the recovery process is successful, whereas when 5/5 is 
also used every recovery resulted in CPU lockup.
Keith Busch Jan. 11, 2021, 4:37 p.m. UTC | #8
On Mon, Jan 11, 2021 at 02:39:20PM +0100, Hinko Kocevar wrote:
> Testing this patch a bit more (without the 5/5) resulted in the same CPU
> lockup:
> 
> watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [irq/122-aerdrv:128]
> 
> as I initially reported with the 5/5 of this patch included.
> 
> It seems more infrequent, though. For example, after reboot this is not
> observed and the recovery process is successful, whereas when 5/5 is also
> used every recovery resulted in CPU lockup.

I am assuming this soft lockup is still when restoring the downstream
port's virtual channel capability. Your initial sighting indicates that
it doesn't appear to be a deadlock, but the stack trace never existed
pci_restore_vc_state() either. I did not find any obvious issues here
just from code inspection, so if you could try applying the following
patch and send the kernel messages output, that would help.

---
diff --git a/drivers/pci/vc.c b/drivers/pci/vc.c
index 5fc59ac31145..4834af7eb582 100644
--- a/drivers/pci/vc.c
+++ b/drivers/pci/vc.c
@@ -28,6 +28,7 @@ static void pci_vc_save_restore_dwords(struct pci_dev *dev, int pos,
 {
 	int i;
 
+	pci_warn(dev, "%s: pos:%d dwords:%d\n", __func__, pos, dwords);
 	for (i = 0; i < dwords; i++, buf++) {
 		if (save)
 			pci_read_config_dword(dev, pos + (i * 4), buf);
@@ -110,6 +111,8 @@ static void pci_vc_enable(struct pci_dev *dev, int pos, int res)
 	if (!pci_is_pcie(dev) || !pcie_downstream_port(dev))
 		return;
 
+	pci_warn(dev, "%s: pos:%d res:%d\n", __func__, pos, res);
+
 	ctrl_pos = pos + PCI_VC_RES_CTRL + (res * PCI_CAP_VC_PER_VC_SIZEOF);
 	status_pos = pos + PCI_VC_RES_STATUS + (res * PCI_CAP_VC_PER_VC_SIZEOF);
 
@@ -165,6 +168,8 @@ static void pci_vc_enable(struct pci_dev *dev, int pos, int res)
 	if (link && !pci_wait_for_pending(link, status_pos2,
 					  PCI_VC_RES_STATUS_NEGO))
 		pci_err(link, "VC%d negotiation stuck pending\n", id);
+
+	pci_warn(dev, "%s: pos:%d res:%d return\n", __func__, pos, res);
 }
 
 /**
@@ -190,6 +195,7 @@ static int pci_vc_do_save_buffer(struct pci_dev *dev, int pos,
 	int i, len = 0;
 	u8 *buf = save_state ? (u8 *)save_state->cap.data : NULL;
 
+	pci_warn(dev, "%s: buf:%d pos:%d\n", __func__, buf != NULL, pos);
 	/* Sanity check buffer size for save/restore */
 	if (buf && save_state->cap.size !=
 	    pci_vc_do_save_buffer(dev, pos, NULL, save)) {
@@ -278,6 +284,8 @@ static int pci_vc_do_save_buffer(struct pci_dev *dev, int pos,
 		pci_read_config_dword(dev, pos + PCI_VC_RES_CAP +
 				      (i * PCI_CAP_VC_PER_VC_SIZEOF), &cap);
 		parb_offset = ((cap & PCI_VC_RES_CAP_ARB_OFF) >> 24) * 16;
+		pci_warn(dev, "%s: i:%d evcc:%d parb_offset:%d\n", __func__, i,
+			 evcc, parb_offset);
 		if (parb_offset) {
 			int size, parb_phases = 0;
 
@@ -332,6 +340,7 @@ static int pci_vc_do_save_buffer(struct pci_dev *dev, int pos,
 		len += 4;
 	}
 
+	pci_warn(dev, "%s: len:%d\n", __func__, len);
 	return buf ? 0 : len;
 }
 
@@ -399,6 +408,7 @@ void pci_restore_vc_state(struct pci_dev *dev)
 		if (!save_state || !pos)
 			continue;
 
+		pci_warn(dev, "%s: i:%d pos:%d\n", __func__, i, pos);
 		pci_vc_do_save_buffer(dev, pos, save_state, false);
 	}
 }
--
Hinko Kocevar Jan. 11, 2021, 8:02 p.m. UTC | #9
On 1/11/21 5:37 PM, Keith Busch wrote:
> On Mon, Jan 11, 2021 at 02:39:20PM +0100, Hinko Kocevar wrote:
>> Testing this patch a bit more (without the 5/5) resulted in the same CPU
>> lockup:
>>
>> watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [irq/122-aerdrv:128]
>>
>> as I initially reported with the 5/5 of this patch included.
>>
>> It seems more infrequent, though. For example, after reboot this is not
>> observed and the recovery process is successful, whereas when 5/5 is also
>> used every recovery resulted in CPU lockup.
> 
> I am assuming this soft lockup is still when restoring the downstream
> port's virtual channel capability. Your initial sighting indicates that
> it doesn't appear to be a deadlock, but the stack trace never existed
> pci_restore_vc_state() either. I did not find any obvious issues here
> just from code inspection, so if you could try applying the following
> patch and send the kernel messages output, that would help.
> 
> ---
> diff --git a/drivers/pci/vc.c b/drivers/pci/vc.c
> index 5fc59ac31145..4834af7eb582 100644
> --- a/drivers/pci/vc.c
> +++ b/drivers/pci/vc.c
> @@ -28,6 +28,7 @@ static void pci_vc_save_restore_dwords(struct pci_dev *dev, int pos,
>   {
>   	int i;
>   
> +	pci_warn(dev, "%s: pos:%d dwords:%d\n", __func__, pos, dwords);
>   	for (i = 0; i < dwords; i++, buf++) {
>   		if (save)
>   			pci_read_config_dword(dev, pos + (i * 4), buf);
> @@ -110,6 +111,8 @@ static void pci_vc_enable(struct pci_dev *dev, int pos, int res)
>   	if (!pci_is_pcie(dev) || !pcie_downstream_port(dev))
>   		return;
>   
> +	pci_warn(dev, "%s: pos:%d res:%d\n", __func__, pos, res);
> +
>   	ctrl_pos = pos + PCI_VC_RES_CTRL + (res * PCI_CAP_VC_PER_VC_SIZEOF);
>   	status_pos = pos + PCI_VC_RES_STATUS + (res * PCI_CAP_VC_PER_VC_SIZEOF);
>   
> @@ -165,6 +168,8 @@ static void pci_vc_enable(struct pci_dev *dev, int pos, int res)
>   	if (link && !pci_wait_for_pending(link, status_pos2,
>   					  PCI_VC_RES_STATUS_NEGO))
>   		pci_err(link, "VC%d negotiation stuck pending\n", id);
> +
> +	pci_warn(dev, "%s: pos:%d res:%d return\n", __func__, pos, res);
>   }
>   
>   /**
> @@ -190,6 +195,7 @@ static int pci_vc_do_save_buffer(struct pci_dev *dev, int pos,
>   	int i, len = 0;
>   	u8 *buf = save_state ? (u8 *)save_state->cap.data : NULL;
>   
> +	pci_warn(dev, "%s: buf:%d pos:%d\n", __func__, buf != NULL, pos);
>   	/* Sanity check buffer size for save/restore */
>   	if (buf && save_state->cap.size !=
>   	    pci_vc_do_save_buffer(dev, pos, NULL, save)) {
> @@ -278,6 +284,8 @@ static int pci_vc_do_save_buffer(struct pci_dev *dev, int pos,
>   		pci_read_config_dword(dev, pos + PCI_VC_RES_CAP +
>   				      (i * PCI_CAP_VC_PER_VC_SIZEOF), &cap);
>   		parb_offset = ((cap & PCI_VC_RES_CAP_ARB_OFF) >> 24) * 16;
> +		pci_warn(dev, "%s: i:%d evcc:%d parb_offset:%d\n", __func__, i,
> +			 evcc, parb_offset);
>   		if (parb_offset) {
>   			int size, parb_phases = 0;
>   
> @@ -332,6 +340,7 @@ static int pci_vc_do_save_buffer(struct pci_dev *dev, int pos,
>   		len += 4;
>   	}
>   
> +	pci_warn(dev, "%s: len:%d\n", __func__, len);
>   	return buf ? 0 : len;
>   }
>   
> @@ -399,6 +408,7 @@ void pci_restore_vc_state(struct pci_dev *dev)
>   		if (!save_state || !pos)
>   			continue;
>   
> +		pci_warn(dev, "%s: i:%d pos:%d\n", __func__, i, pos);
>   		pci_vc_do_save_buffer(dev, pos, save_state, false);
>   	}
>   }
> --
> 

Attached are the messages.

Thanks!
...
[   64.649399] xdma: loading out-of-tree module taints kernel.
[   64.656255] xdma:xdma_mod_init: Xilinx XDMA Reference Driver xdma v2020.1.8
[   64.663324] xdma:xdma_mod_init: desc_blen_max: 0xfffffff/268435455, timeout: h2c 10 c2h 10 sec.
[   64.672189] xdma:xdma_device_open: xdma device 0000:08:00.0, 0x00000000591acbe0.
[   64.680346] xdma:map_single_bar: BAR0 at 0xde000000 mapped at 0x000000000fdb12a5, length=16777216(/16777216)
[   64.690290] xdma:map_single_bar: BAR1 at 0xdf000000 mapped at 0x000000005fe214c9, length=65536(/65536)
[   64.699702] xdma:map_bars: config bar 1, pos 1.
[   64.704339] xdma:identify_bars: 2 BARs: config 1, user 0, bypass -1.
[   64.711245] xdma:pci_keep_intx_enabled: 0000:08:00.0: clear INTX_DISABLE, 0x407 -> 0x7.
[   64.719369] xdma:irq_msix_channel_setup: engine 0-H2C0-MM, irq#169.
[   64.725751] xdma:irq_msix_channel_setup: engine 0-C2H0-MM, irq#170.
[   64.732134] xdma:irq_msix_user_setup: 0-USR-0, IRQ#171 with 0x0000000016487e14
[   64.739479] xdma:irq_msix_user_setup: 0-USR-1, IRQ#172 with 0x0000000017b744b7
[   64.746813] xdma:irq_msix_user_setup: 0-USR-2, IRQ#173 with 0x000000007491af44
[   64.754149] xdma:irq_msix_user_setup: 0-USR-3, IRQ#174 with 0x000000009e92f5ff
[   64.761484] xdma:irq_msix_user_setup: 0-USR-4, IRQ#175 with 0x0000000056e09b58
[   64.768819] xdma:irq_msix_user_setup: 0-USR-5, IRQ#176 with 0x00000000af6d01f5
[   64.776159] xdma:irq_msix_user_setup: 0-USR-6, IRQ#177 with 0x000000002ef15b69
[   64.783493] xdma:irq_msix_user_setup: 0-USR-7, IRQ#178 with 0x0000000099fd5ab6
[   64.790832] xdma:irq_msix_user_setup: 0-USR-8, IRQ#179 with 0x0000000014654474
[   64.798167] xdma:irq_msix_user_setup: 0-USR-9, IRQ#180 with 0x00000000963ca534
[   64.805504] xdma:irq_msix_user_setup: 0-USR-10, IRQ#181 with 0x0000000086853299
[   64.812928] xdma:irq_msix_user_setup: 0-USR-11, IRQ#182 with 0x00000000d0d53381
[   64.820354] xdma:irq_msix_user_setup: 0-USR-12, IRQ#183 with 0x000000008ce02f44
[   64.827776] xdma:irq_msix_user_setup: 0-USR-13, IRQ#184 with 0x00000000bc4b16b8
[   64.835200] xdma:irq_msix_user_setup: 0-USR-14, IRQ#185 with 0x0000000088677e88
[   64.842623] xdma:irq_msix_user_setup: 0-USR-15, IRQ#186 with 0x000000000d9849f6
[   64.850044] xdma:probe_one: 0000:08:00.0 xdma0, pdev 0x00000000591acbe0, xdev 0x00000000c77a5002, 0x00000000947398f0, usr 16, ch 1,1.
[   64.864139] xdma:cdev_xvc_init: xcdev 0x000000004eda4704, bar 0, offset 0x40000.
[   68.523443] pcieport 0000:00:01.1: power state changed by ACPI to D0
[   68.541024] pcieport 0000:00:01.1: pci_restore_vc_state: i:1 pos:256
[   68.547489] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:1 pos:256
[   68.554208] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:0 pos:256
[   68.560929] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   68.568603] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   68.574632] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   68.582305] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   68.588338] pcieport 0000:00:01.1: restoring config space at offset 0x2c (was 0x0, writing 0x0)
[   68.597139] pcieport 0000:00:01.1: restoring config space at offset 0x28 (was 0x0, writing 0x0)
[   68.605946] pcieport 0000:00:01.1: restoring config space at offset 0x24 (was 0x1fff1, writing 0x1fff1)
[   68.615568] pcieport 0000:00:01.1: PME# disabled
[   68.739042] pcieport 0000:00:01.1: saving config space at offset 0x0 (reading 0x19058086)
[   68.747225] pcieport 0000:00:01.1: saving config space at offset 0x4 (reading 0x100407)
[   68.755222] pcieport 0000:00:01.1: saving config space at offset 0x8 (reading 0x6040005)
[   68.763310] pcieport 0000:00:01.1: saving config space at offset 0xc (reading 0x810010)
[   68.771310] pcieport 0000:00:01.1: saving config space at offset 0x10 (reading 0x0)
[   68.778974] pcieport 0000:00:01.1: saving config space at offset 0x14 (reading 0x0)
[   68.786628] pcieport 0000:00:01.1: saving config space at offset 0x18 (reading 0xe0e00)
[   68.794629] pcieport 0000:00:01.1: saving config space at offset 0x1c (reading 0xf0)
[   68.802379] pcieport 0000:00:01.1: saving config space at offset 0x20 (reading 0xfff0)
[   68.810298] pcieport 0000:00:01.1: saving config space at offset 0x24 (reading 0x1fff1)
[   68.818315] pcieport 0000:00:01.1: saving config space at offset 0x28 (reading 0x0)
[   68.825980] pcieport 0000:00:01.1: saving config space at offset 0x2c (reading 0x0)
[   68.833651] pcieport 0000:00:01.1: saving config space at offset 0x30 (reading 0x0)
[   68.841312] pcieport 0000:00:01.1: saving config space at offset 0x34 (reading 0x88)
[   68.849061] pcieport 0000:00:01.1: saving config space at offset 0x38 (reading 0x0)
[   68.856719] pcieport 0000:00:01.1: saving config space at offset 0x3c (reading 0x12010b)
[   68.864817] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:1 pos:256
[   68.871433] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:0 pos:256
[   68.878049] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   68.885620] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   68.891555] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   68.899128] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   68.905178] pcieport 0000:00:01.1: PME# enabled
[   68.921051] pcieport 0000:00:01.1: power state changed by ACPI to D3cold
[   68.927905] pcieport 0000:00:01.1: power state changed by ACPI to D0
[   68.946078] pcieport 0000:00:01.1: pci_restore_vc_state: i:1 pos:256
[   68.952545] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:1 pos:256
[   68.959265] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:0 pos:256
[   68.965984] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   68.973658] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   68.979682] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   68.987353] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   68.993381] pcieport 0000:00:01.1: restoring config space at offset 0x2c (was 0x0, writing 0x0)
[   69.002178] pcieport 0000:00:01.1: restoring config space at offset 0x28 (was 0x0, writing 0x0)
[   69.010980] pcieport 0000:00:01.1: restoring config space at offset 0x24 (was 0x1fff1, writing 0x1fff1)
[   69.020624] pcieport 0000:00:01.1: PME# disabled
[   69.084588] pcieport 0000:02:02.0: PME# disabled
[   69.101016] pcieport 0000:02:02.0: pci_restore_vc_state: i:1 pos:328
[   69.107376] pcieport 0000:02:02.0: pci_vc_do_save_buffer: buf:1 pos:328
[   69.113998] pcieport 0000:02:02.0: pci_vc_do_save_buffer: buf:0 pos:328
[   69.120646] pcieport 0000:02:02.0: pci_vc_do_save_buffer: i:0 evcc:1 parb_offset:0
[   69.128225] pcieport 0000:02:02.0: pci_vc_do_save_buffer: i:1 evcc:1 parb_offset:0
[   69.135795] pcieport 0000:02:02.0: pci_vc_do_save_buffer: len:12
[   69.141818] pcieport 0000:02:02.0: pci_vc_do_save_buffer: i:0 evcc:1 parb_offset:0
[   69.149407] pcieport 0000:02:02.0: pci_vc_do_save_buffer: i:1 evcc:1 parb_offset:0
[   69.149972] pcieport 0000:00:01.1: saving config space at offset 0x0 (reading 0x19058086)
[   69.156981] pcieport 0000:02:02.0: pci_vc_do_save_buffer: len:12
[   69.157058] pcieport 0000:02:02.0: restoring config space at offset 0x2c (was 0x0, writing 0x0)
[   69.157063] pcieport 0000:02:02.0: restoring config space at offset 0x28 (was 0x0, writing 0x0)
[   69.157067] pcieport 0000:02:02.0: restoring config space at offset 0x24 (was 0x1fff1, writing 0x1fff1)
[   69.157261] pcieport 0000:02:02.0: PME# disabled
[   69.165517] pcieport 0000:00:01.1: saving config space at offset 0x4 (reading 0x100407)
[   69.171639] pci_bus 0000:0a: busn_res: [bus 0a] is released
[   69.180336] pcieport 0000:00:01.1: saving config space at offset 0x8 (reading 0x6040005)
[   69.188880] pcieport 0000:02:08.0: PME# disabled
[   69.203089] pcieport 0000:02:08.0: pci_restore_vc_state: i:1 pos:328
[   69.203091] pcieport 0000:02:08.0: pci_vc_do_save_buffer: buf:1 pos:328
[   69.203093] pcieport 0000:02:08.0: pci_vc_do_save_buffer: buf:0 pos:328
[   69.203097] pcieport 0000:02:08.0: pci_vc_do_save_buffer: i:0 evcc:1 parb_offset:48
[   69.203100] pcieport 0000:02:08.0: pci_vc_do_save_buffer: i:1 evcc:1 parb_offset:0
[   69.203101] pcieport 0000:02:08.0: pci_vc_do_save_buffer: len:76
[   69.203108] pcieport 0000:02:08.0: pci_vc_do_save_buffer: i:0 evcc:1 parb_offset:48
[   69.211361] pcieport 0000:00:01.1: saving config space at offset 0xc (reading 0x810010)
[   69.216693] pcieport 0000:02:08.0: pci_vc_save_restore_dwords: pos:376 dwords:16
[   69.216758] pcieport 0000:02:08.0: pci_vc_do_save_buffer: i:1 evcc:1 parb_offset:0
[   69.216763] pcieport 0000:02:08.0: pci_vc_do_save_buffer: len:76
[   69.216804] pcieport 0000:02:08.0: restoring config space at offset 0x2c (was 0x0, writing 0x0)
[   69.225139] pcieport 0000:00:01.1: saving config space at offset 0x10 (reading 0x0)
[   69.229513] pcieport 0000:02:08.0: restoring config space at offset 0x28 (was 0x0, writing 0x0)
[   69.229520] pcieport 0000:02:08.0: restoring config space at offset 0x24 (was 0x1fff1, writing 0x1fff1)
[   69.229614] pcieport 0000:02:08.0: PME# disabled
[   69.229829] pci_bus 0000:0b: busn_res: [bus 0b] is released
[   69.236231] pcieport 0000:00:01.1: saving config space at offset 0x14 (reading 0x0)
[   69.242730] pcieport 0000:02:09.0: PME# disabled
[   69.249579] pcieport 0000:00:01.1: saving config space at offset 0x18 (reading 0xe0e00)
[   69.257048] pci_bus 0000:0c: dev 00, dec refcount to 0
[   69.257060] pci_bus 0000:0c: dev 00, released physical slot 9
[   69.257223] pci_bus 0000:0c: busn_res: [bus 0c] is released
[   69.264928] pcieport 0000:00:01.1: saving config space at offset 0x1c (reading 0xf0)
[   69.270998] pcieport 0000:02:0a.0: PME# disabled
[   69.278705] pcieport 0000:00:01.1: saving config space at offset 0x20 (reading 0xfff0)
[   69.302025] pcieport 0000:02:0a.0: pci_restore_vc_state: i:1 pos:328
[   69.307986] pcieport 0000:00:01.1: saving config space at offset 0x24 (reading 0x1fff1)
[   69.316417] pcieport 0000:02:0a.0: pci_vc_do_save_buffer: buf:1 pos:328
[   69.316419] pcieport 0000:02:0a.0: pci_vc_do_save_buffer: buf:0 pos:328
[   69.316428] pcieport 0000:02:0a.0: pci_vc_do_save_buffer: i:0 evcc:1 parb_offset:0
[   69.316432] pcieport 0000:02:0a.0: pci_vc_do_save_buffer: i:1 evcc:1 parb_offset:0
[   69.316433] pcieport 0000:02:0a.0: pci_vc_do_save_buffer: len:12
[   69.316446] pcieport 0000:02:0a.0: pci_vc_do_save_buffer: i:0 evcc:1 parb_offset:0
[   69.316453] pcieport 0000:02:0a.0: pci_vc_do_save_buffer: i:1 evcc:1 parb_offset:0
[   69.316460] pcieport 0000:02:0a.0: pci_vc_do_save_buffer: len:12
[   69.316514] pcieport 0000:02:0a.0: restoring config space at offset 0x2c (was 0x0, writing 0x0)
[   69.316521] pcieport 0000:02:0a.0: restoring config space at offset 0x28 (was 0x0, writing 0x0)
[   69.316525] pcieport 0000:02:0a.0: restoring config space at offset 0x24 (was 0x1fff1, writing 0x1fff1)
[   69.316660] pcieport 0000:02:0a.0: PME# disabled
[   69.324439] pcieport 0000:00:01.1: saving config space at offset 0x28 (reading 0x0)
[   69.333201] pci_bus 0000:0d: busn_res: [bus 0d] is released
[   69.342639] pcieport 0000:00:01.1: saving config space at offset 0x2c (reading 0x0)
[   69.347130] plx_dma 0000:01:00.1: PME# disabled
[   69.352951] pcieport 0000:00:01.1: saving config space at offset 0x30 (reading 0x0)
[   69.546955] pcieport 0000:00:01.1: saving config space at offset 0x34 (reading 0x88)
[   69.554803] pcieport 0000:00:01.1: saving config space at offset 0x38 (reading 0x0)
[   69.562563] pcieport 0000:00:01.1: saving config space at offset 0x3c (reading 0x12010b)
[   69.570767] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:1 pos:256
[   69.577487] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:0 pos:256
[   69.584208] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   69.591884] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   69.597910] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   69.605583] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   69.611806] pcieport 0000:00:01.1: PME# enabled
[   69.623179] plx_dma 0000:01:00.2: PME# disabled
[   69.628424] pcieport 0000:00:01.1: power state changed by ACPI to D3cold
[   69.640379] plx_dma 0000:01:00.3: PME# disabled
[   69.656336] plx_dma 0000:01:00.4: PME# disabled
[   72.886830] pcieport 0000:00:01.1: power state changed by ACPI to D0
[   72.904973] pcieport 0000:00:01.1: pci_restore_vc_state: i:1 pos:256
[   72.911333] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:1 pos:256
[   72.917963] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:0 pos:256
[   72.924583] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   72.932152] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   72.938089] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   72.945665] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   72.951596] pcieport 0000:00:01.1: restoring config space at offset 0x2c (was 0x0, writing 0x0)
[   72.960292] pcieport 0000:00:01.1: restoring config space at offset 0x28 (was 0x0, writing 0x0)
[   72.968989] pcieport 0000:00:01.1: restoring config space at offset 0x24 (was 0x1fff1, writing 0x1fff1)
[   72.978623] pcieport 0000:00:01.1: PME# disabled
[   73.099568] pcieport 0000:00:01.1: saving config space at offset 0x0 (reading 0x19058086)
[   73.107774] pcieport 0000:00:01.1: saving config space at offset 0x4 (reading 0x100407)
[   73.115786] pcieport 0000:00:01.1: saving config space at offset 0x8 (reading 0x6040005)
[   73.123878] pcieport 0000:00:01.1: saving config space at offset 0xc (reading 0x810010)
[   73.131892] pcieport 0000:00:01.1: saving config space at offset 0x10 (reading 0x0)
[   73.139559] pcieport 0000:00:01.1: saving config space at offset 0x14 (reading 0x0)
[   73.147222] pcieport 0000:00:01.1: saving config space at offset 0x18 (reading 0xe0e00)
[   73.155220] pcieport 0000:00:01.1: saving config space at offset 0x1c (reading 0xf0)
[   73.162968] pcieport 0000:00:01.1: saving config space at offset 0x20 (reading 0xfff0)
[   73.170892] pcieport 0000:00:01.1: saving config space at offset 0x24 (reading 0x1fff1)
[   73.178913] pcieport 0000:00:01.1: saving config space at offset 0x28 (reading 0x0)
[   73.186579] pcieport 0000:00:01.1: saving config space at offset 0x2c (reading 0x0)
[   73.194234] pcieport 0000:00:01.1: saving config space at offset 0x30 (reading 0x0)
[   73.201890] pcieport 0000:00:01.1: saving config space at offset 0x34 (reading 0x88)
[   73.209637] pcieport 0000:00:01.1: saving config space at offset 0x38 (reading 0x0)
[   73.217296] pcieport 0000:00:01.1: saving config space at offset 0x3c (reading 0x12010b)
[   73.225411] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:1 pos:256
[   73.232025] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:0 pos:256
[   73.238644] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   73.246217] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   73.252153] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   73.259736] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   73.265707] pcieport 0000:00:01.1: PME# enabled
[   73.282253] pcieport 0000:00:01.1: power state changed by ACPI to D3cold
[   73.289128] pcieport 0000:00:01.1: power state changed by ACPI to D0
[   73.306977] pcieport 0000:00:01.1: pci_restore_vc_state: i:1 pos:256
[   73.313335] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:1 pos:256
[   73.319962] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:0 pos:256
[   73.326583] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   73.334159] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   73.340092] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   73.347671] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   73.353617] pcieport 0000:00:01.1: restoring config space at offset 0x2c (was 0x0, writing 0x0)
[   73.362320] pcieport 0000:00:01.1: restoring config space at offset 0x28 (was 0x0, writing 0x0)
[   73.371018] pcieport 0000:00:01.1: restoring config space at offset 0x24 (was 0x1fff1, writing 0x1fff1)
[   73.380445] pcieport 0000:00:01.1: PME# disabled
[   73.430387] pcieport 0000:00:01.0: aer_inject: Injecting errors 00000000/00004000 into device 0000:00:01.0
[   73.440086] pcieport 0000:00:01.0: AER: Uncorrected (Fatal) error received: 0000:00:01.0
[   73.448186] pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
[   73.458969] pcieport 0000:00:01.0:   device [8086:1901] error status/mask=00004000/00000000
[   73.467328] pcieport 0000:00:01.0:    [14] CmpltTO
[   73.473434] xdma_error_detected: >>>
[   73.477012] xdma:xdma_error_detected: dev 0x00000000591acbe0,0x00000000c77a5002, frozen state error, reset controller
[   73.487626] xdma:xdma_device_offline: pdev 0x00000000591acbe0, xdev 0x00000000947398f0.
[   73.495738] xdma:xdma_device_offline: xdev 0x00000000947398f0, done.
[   73.509969] pcieport 0000:00:01.1: saving config space at offset 0x0 (reading 0x19058086)
[   73.518157] pcieport 0000:00:01.1: saving config space at offset 0x4 (reading 0x100407)
[   73.526165] pcieport 0000:00:01.1: saving config space at offset 0x8 (reading 0x6040005)
[   73.534267] pcieport 0000:00:01.1: saving config space at offset 0xc (reading 0x810010)
[   73.542269] pcieport 0000:00:01.1: saving config space at offset 0x10 (reading 0x0)
[   73.549931] pcieport 0000:00:01.1: saving config space at offset 0x14 (reading 0x0)
[   73.557591] pcieport 0000:00:01.1: saving config space at offset 0x18 (reading 0xe0e00)
[   73.565605] pcieport 0000:00:01.1: saving config space at offset 0x1c (reading 0xf0)
[   73.573359] pcieport 0000:00:01.1: saving config space at offset 0x20 (reading 0xfff0)
[   73.581278] pcieport 0000:00:01.1: saving config space at offset 0x24 (reading 0x1fff1)
[   73.589286] pcieport 0000:00:01.1: saving config space at offset 0x28 (reading 0x0)
[   73.596945] pcieport 0000:00:01.1: saving config space at offset 0x2c (reading 0x0)
[   73.604609] pcieport 0000:00:01.1: saving config space at offset 0x30 (reading 0x0)
[   73.612272] pcieport 0000:00:01.1: saving config space at offset 0x34 (reading 0x88)
[   73.620024] pcieport 0000:00:01.1: saving config space at offset 0x38 (reading 0x0)
[   73.627691] pcieport 0000:00:01.1: saving config space at offset 0x3c (reading 0x12010b)
[   73.635804] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:1 pos:256
[   73.642419] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:0 pos:256
[   73.649037] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   73.656602] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   73.662525] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   73.670102] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   73.676118] pcieport 0000:00:01.1: PME# enabled
[   73.692066] pcieport 0000:00:01.1: power state changed by ACPI to D3cold
[   74.509998] pcieport 0000:00:01.0: AER: Root Port link has been reset (0)
[   74.516820] pcieport 0000:01:00.0: pci_restore_vc_state: i:1 pos:328
[   74.523169] pcieport 0000:01:00.0: pci_vc_do_save_buffer: buf:1 pos:328
[   74.529783] pcieport 0000:01:00.0: pci_vc_do_save_buffer: buf:0 pos:328
[   74.536405] pcieport 0000:01:00.0: pci_vc_do_save_buffer: i:0 evcc:1 parb_offset:48
[   74.544059] pcieport 0000:01:00.0: pci_vc_do_save_buffer: i:1 evcc:1 parb_offset:0
[   74.551630] pcieport 0000:01:00.0: pci_vc_do_save_buffer: len:76
[   74.557663] pcieport 0000:01:00.0: pci_vc_do_save_buffer: i:0 evcc:1 parb_offset:48
[   74.565327] pcieport 0000:01:00.0: pci_vc_save_restore_dwords: pos:376 dwords:16
[   74.572741] pcieport 0000:01:00.0: pci_vc_do_save_buffer: i:1 evcc:1 parb_offset:0
[   74.580318] pcieport 0000:01:00.0: pci_vc_do_save_buffer: len:76
[   74.586354] pcieport 0000:01:00.0: restoring config space at offset 0x3c (was 0x100, writing 0x12010b)
[   74.595676] pcieport 0000:01:00.0: restoring config space at offset 0x2c (was 0x0, writing 0x0)
[   74.604377] pcieport 0000:01:00.0: restoring config space at offset 0x28 (was 0x0, writing 0x0)
[   74.613076] pcieport 0000:01:00.0: restoring config space at offset 0x24 (was 0x1fff1, writing 0x91119001)
[   74.622730] pcieport 0000:01:00.0: restoring config space at offset 0x20 (was 0xfff0, writing 0xdf10de00)
[   74.632300] pcieport 0000:01:00.0: restoring config space at offset 0x1c (was 0x1f1, writing 0x8121)
[   74.641427] pcieport 0000:01:00.0: restoring config space at offset 0x18 (was 0x0, writing 0xd0201)
[   74.650479] pcieport 0000:01:00.0: restoring config space at offset 0x10 (was 0x0, writing 0xdf200000)
[   74.659787] pcieport 0000:01:00.0: restoring config space at offset 0xc (was 0x810000, writing 0x810010)
[   74.669275] pcieport 0000:01:00.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100407)
[   74.678805] pcieport 0000:01:00.0: saving config space at offset 0x0 (reading 0x872510b5)
[   74.686984] pcieport 0000:01:00.0: saving config space at offset 0x4 (reading 0x100407)
[   74.694996] pcieport 0000:01:00.0: saving config space at offset 0x8 (reading 0x60400ca)
[   74.703094] pcieport 0000:01:00.0: saving config space at offset 0xc (reading 0x810010)
[   74.711114] pcieport 0000:01:00.0: saving config space at offset 0x10 (reading 0xdf200000)
[   74.719381] pcieport 0000:01:00.0: saving config space at offset 0x14 (reading 0x0)
[   74.727031] pcieport 0000:01:00.0: saving config space at offset 0x18 (reading 0xd0201)
[   74.735027] pcieport 0000:01:00.0: saving config space at offset 0x1c (reading 0x8121)
[   74.742939] pcieport 0000:01:00.0: saving config space at offset 0x20 (reading 0xdf10de00)
[   74.751201] pcieport 0000:01:00.0: saving config space at offset 0x24 (reading 0x91119001)
[   74.759472] pcieport 0000:01:00.0: saving config space at offset 0x28 (reading 0x0)
[   74.767134] pcieport 0000:01:00.0: saving config space at offset 0x2c (reading 0x0)
[   74.774788] pcieport 0000:01:00.0: saving config space at offset 0x30 (reading 0x0)
[   74.782441] pcieport 0000:01:00.0: saving config space at offset 0x34 (reading 0x40)
[   74.790188] pcieport 0000:01:00.0: saving config space at offset 0x38 (reading 0x0)
[   74.797844] pcieport 0000:01:00.0: saving config space at offset 0x3c (reading 0x12010b)
[   74.806027] pcieport 0000:01:00.0: pci_vc_do_save_buffer: buf:1 pos:328
[   74.812641] pcieport 0000:01:00.0: pci_vc_do_save_buffer: buf:0 pos:328
[   74.819255] pcieport 0000:01:00.0: pci_vc_do_save_buffer: i:0 evcc:1 parb_offset:48
[   74.826911] pcieport 0000:01:00.0: pci_vc_do_save_buffer: i:1 evcc:1 parb_offset:0
[   74.834483] pcieport 0000:01:00.0: pci_vc_do_save_buffer: len:76
[   74.840505] pcieport 0000:01:00.0: pci_vc_do_save_buffer: i:0 evcc:1 parb_offset:48
[   74.848160] pcieport 0000:01:00.0: pci_vc_save_restore_dwords: pos:376 dwords:16
[   74.855580] pcieport 0000:01:00.0: pci_vc_do_save_buffer: i:1 evcc:1 parb_offset:0
[   74.863150] pcieport 0000:01:00.0: pci_vc_do_save_buffer: len:76
[   76.390998] sched: RT throttling activated
[   78.616988] pcieport 0000:00:01.1: power state changed by ACPI to D0
[   78.636106] pcieport 0000:00:01.1: pci_restore_vc_state: i:1 pos:256
[   78.642483] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:1 pos:256
[   78.649139] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:0 pos:256
[   78.655790] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   78.663376] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   78.669325] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   78.676919] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   78.682969] pcieport 0000:00:01.1: restoring config space at offset 0x2c (was 0x0, writing 0x0)
[   78.691694] pcieport 0000:00:01.1: restoring config space at offset 0x28 (was 0x0, writing 0x0)
[   78.700420] pcieport 0000:00:01.1: restoring config space at offset 0x24 (was 0x1fff1, writing 0x1fff1)
[   78.710200] pcieport 0000:00:01.1: PME# disabled
[   78.838975] pcieport 0000:00:01.1: saving config space at offset 0x0 (reading 0x19058086)
[   78.847182] pcieport 0000:00:01.1: saving config space at offset 0x4 (reading 0x100407)
[   78.855215] pcieport 0000:00:01.1: saving config space at offset 0x8 (reading 0x6040005)
[   78.863331] pcieport 0000:00:01.1: saving config space at offset 0xc (reading 0x810010)
[   78.871363] pcieport 0000:00:01.1: saving config space at offset 0x10 (reading 0x0)
[   78.879043] pcieport 0000:00:01.1: saving config space at offset 0x14 (reading 0x0)
[   78.886728] pcieport 0000:00:01.1: saving config space at offset 0x18 (reading 0xe0e00)
[   78.894762] pcieport 0000:00:01.1: saving config space at offset 0x1c (reading 0xf0)
[   78.902529] pcieport 0000:00:01.1: saving config space at offset 0x20 (reading 0xfff0)
[   78.910469] pcieport 0000:00:01.1: saving config space at offset 0x24 (reading 0x1fff1)
[   78.918498] pcieport 0000:00:01.1: saving config space at offset 0x28 (reading 0x0)
[   78.926176] pcieport 0000:00:01.1: saving config space at offset 0x2c (reading 0x0)
[   78.933857] pcieport 0000:00:01.1: saving config space at offset 0x30 (reading 0x0)
[   78.941537] pcieport 0000:00:01.1: saving config space at offset 0x34 (reading 0x88)
[   78.949299] pcieport 0000:00:01.1: saving config space at offset 0x38 (reading 0x0)
[   78.956985] pcieport 0000:00:01.1: saving config space at offset 0x3c (reading 0x12010b)
[   78.965256] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:1 pos:256
[   78.971890] pcieport 0000:00:01.1: pci_vc_do_save_buffer: buf:0 pos:256
[   78.978537] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   78.986126] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   78.992093] pcieport 0000:00:01.1: pci_vc_do_save_buffer: i:0 evcc:0 parb_offset:0
[   78.999683] pcieport 0000:00:01.1: pci_vc_do_save_buffer: len:8
[   79.005664] pcieport 0000:00:01.1: PME# enabled
[   79.023036] pcieport 0000:00:01.1: power state changed by ACPI to D3cold


[  101.457576] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [irq/122-aerdrv:129]
[  101.465153] Modules linked in: xdma(O) aer_inject xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun ebtable_filter ebtables ip6tai
[  101.465206]  drm_kms_helper syscopyarea sysfillrect igb sysimgblt ptp fb_sys_fops libahci crc32c_intel pps_core serio_raw dca i2c_algo_bit libata drm video
[  101.566072] CPU: 2 PID: 129 Comm: irq/122-aerdrv Tainted: G S         O      5.10.0-1.keithv3.el7.x86_64 #1
[  101.575814] Hardware name: AMI AM G6x/msd/AM G6x/msd, BIOS 4.08.01 02/19/2019
[  101.582950] RIP: 0010:pci_mmcfg_read+0xb7/0xe0
[  101.587403] Code: 63 c5 48 01 d8 66 8b 00 0f b7 c0 41 89 04 24 eb dc 49 63 c5 48 01 d8 8a 00 0f b6 c0 41 89 04 24 eb cb 49 63 c5 48 01 d8 8b 00 <41> 89 04 24 eb bd e8 be f4 78 ff 5b 41 c7 04 24 ff ff ff ff b8 ea
[  101.606158] RSP: 0018:ffffc90000303b68 EFLAGS: 00000286
[  101.611391] RAX: 00000000b701000d RBX: ffffc90010208000 RCX: 0000000000000f24
[  101.618531] RDX: 00000000000000ff RSI: 0000000000000002 RDI: 0000000000000000
[  101.625673] RBP: ffffc90000303b90 R08: 0000000000000004 R09: ffffc90000303bc4
[  101.632811] R10: 0000000000000008 R11: ffffc900003038e0 R12: ffffc90000303bc4
[  101.639944] R13: 0000000000000f24 R14: 0000000000008000 R15: 0000000000000004
[  101.647082] FS:  0000000000000000(0000) GS:ffff88845dc80000(0000) knlGS:0000000000000000
[  101.655174] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  101.660920] CR2: 00007f9369637b80 CR3: 000000000240a005 CR4: 00000000003706e0
[  101.668057] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  101.675190] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  101.682323] Call Trace:
[  101.684780]  raw_pci_read+0x35/0x40
[  101.688277]  pci_read+0x2c/0x30
[  101.691428]  pci_bus_read_config_dword+0x4a/0x70
[  101.696051]  pci_read_config_dword+0x23/0x40
[  101.700331]  pci_find_next_ext_capability.part.24+0x6b/0xe0
[  101.705912]  pci_find_ext_capability.part.25+0x12/0x20
[  101.711058]  pci_find_ext_capability+0x1a/0x20
[  101.715509]  pci_restore_vc_state+0x32/0xb0
[  101.719695]  ? pcie_capability_write_word+0x32/0x90
[  101.724579]  pci_restore_state.part.44+0x177/0x420
[  101.729377]  pci_restore_state+0x18/0x20
[  101.733310]  pcie_portdrv_slot_reset+0x12/0x30
[  101.737765]  report_slot_reset+0x46/0x80
[  101.741695]  ? merge_result.part.4+0x40/0x40
[  101.745971]  pci_walk_bus+0x48/0x90
[  101.749466]  pci_walk_bridge+0x1d/0x30
[  101.753223]  pcie_do_recovery+0x1bd/0x270
[  101.757241]  ? aer_dev_correctable_show+0xd0/0xd0
[  101.761948]  ? aer_dev_correctable_show+0xd0/0xd0
[  101.766657]  aer_process_err_devices+0xd0/0xe9
[  101.771104]  aer_isr.cold.25+0x54/0xa3
[  101.774862]  ? irq_finalize_oneshot.part.48+0xf0/0xf0
[  101.779924]  irq_thread_fn+0x24/0x60
[  101.783516]  irq_thread+0xea/0x170
[  101.786926]  ? irq_forced_thread_fn+0x80/0x80
[  101.791292]  ? irq_thread_check_affinity+0xf0/0xf0
[  101.796088]  kthread+0x126/0x140
[  101.799330]  ? kthread_park+0x90/0x90
[  101.803004]  ret_from_fork+0x22/0x30
[  111.539063] pcieport 0000:02:01.0: pci_restore_vc_state: i:1 pos:328
[  111.545503] pcieport 0000:02:01.0: pci_vc_do_save_buffer: buf:1 pos:328
[  129.266656] pcieport 0000:02:01.0: pci_vc_do_save_buffer: buf:0 pos:328
[  134.720858] pcieport 0000:02:01.0: pci_vc_do_save_buffer: i:0 evcc:1 parb_offset:48
[  134.728712] pcieport 0000:02:01.0: pci_vc_do_save_buffer: i:1 evcc:1 parb_offset:0
[  136.397775] rcu: INFO: rcu_sched self-detected stall on CPU
[  136.410927] rcu:     2-....: (1 GPs behind) idle=7b2/1/0x4000000000000000 softirq=5916/5917 fqs=14413
[  136.419972]  (t=60000 jiffies g=12137 q=11828)
[  136.424422] NMI backtrace for cpu 2
[  136.427915] CPU: 2 PID: 129 Comm: irq/122-aerdrv Tainted: G S         O L    5.10.0-1.keithv3.el7.x86_64 #1
[  136.437651] Hardware name: AMI AM G6x/msd/AM G6x/msd, BIOS 4.08.01 02/19/2019
[  136.444785] Call Trace:
[  136.447237]  <IRQ>
[  136.449263]  dump_stack+0x74/0x92
[  136.452595]  nmi_cpu_backtrace.cold.8+0x32/0x68
[  136.457134]  ? lapic_can_unplug_cpu+0x80/0x80
[  136.461494]  nmi_trigger_cpumask_backtrace+0xe6/0xf0
[  136.466467]  arch_trigger_cpumask_backtrace+0x19/0x20
[  136.471526]  rcu_dump_cpu_stacks+0xb6/0xe6
[  136.475635]  rcu_sched_clock_irq.cold.92+0x124/0x3a9
[  136.480610]  ? account_system_index_time+0x94/0xa0
[  136.485417]  ? tick_sched_do_timer+0x60/0x60
[  136.489696]  update_process_times+0x93/0xd0
[  136.493892]  tick_sched_handle.isra.23+0x25/0x60
[  136.498517]  tick_sched_timer+0x6b/0x80
[  136.502360]  __hrtimer_run_queues+0x105/0x240
[  136.506726]  hrtimer_interrupt+0x101/0x220
[  136.510836]  __sysvec_apic_timer_interrupt+0x62/0xe0
[  136.515815]  asm_call_irq_on_stack+0x12/0x20
[  136.520092]  </IRQ>
[  136.522202]  sysvec_apic_timer_interrupt+0x7d/0x90
[  136.527003]  asm_sysvec_apic_timer_interrupt+0x12/0x20
[  136.532152] RIP: 0010:vprintk_emit+0x10b/0x250
[  136.536601] Code: 1f 44 00 00 e8 e6 0b 00 00 8b 7d d4 48 8b 55 c8 89 de 4d 89 f8 4c 89 f1 e8 62 fa ff ff 89 c3 e8 eb 0b 00 00 48 8b 7d c0 57 9d <0f> 1f 44 00 00 45 84 e4 74 1f 80 3d 95 0a 6f 01 00 74 05 e8 3d d6
[  136.555356] RSP: 0018:ffffc90000303968 EFLAGS: 00000246
[  136.560583] RAX: 0000000000000046 RBX: 0000000000000046 RCX: 0000000000001d91
[  136.567718] RDX: ffffffff825d9628 RSI: 80000000ffffc9db RDI: 0000000000000246
[  136.574852] RBP: ffffc900003039b0 R08: ffffffff8246a940 R09: 0000000000000000
[  136.581992] R10: ffffffff82b21518 R11: ffffffff82b21516 R12: 0000000000000000
[  136.589126] R13: ffffffff82176221 R14: ffffffff8215574f R15: ffffc90000303a60
[  136.596267]  dev_vprintk_emit+0x13b/0x15f
[  136.600286]  dev_printk_emit+0x4e/0x65
[  136.604048]  ? pci_mmcfg_read+0xb7/0xe0
[  136.607890]  __dev_printk+0x58/0x6b
[  136.611383]  _dev_warn+0x6c/0x83
[  136.614622]  ? pci_bus_read_config_dword+0x4a/0x70
[  136.619418]  pci_vc_do_save_buffer+0x251/0x5ab
[  136.623869]  pci_vc_do_save_buffer+0x87/0x5ab
[  136.628238]  ? _dev_warn+0x6c/0x83
[  136.631642]  pci_restore_vc_state+0x7e/0xb0
[  136.635839]  pci_restore_state.part.44+0x177/0x420
[  136.640636]  pci_restore_state+0x18/0x20
[  136.644569]  pcie_portdrv_slot_reset+0x12/0x30
[  136.649017]  report_slot_reset+0x46/0x80
[  136.652945]  ? merge_result.part.4+0x40/0x40
[  136.657221]  pci_walk_bus+0x48/0x90
[  136.660715]  pci_walk_bridge+0x1d/0x30
[  136.664471]  pcie_do_recovery+0x1bd/0x270
[  136.668483]  ? aer_dev_correctable_show+0xd0/0xd0
[  136.673192]  ? aer_dev_correctable_show+0xd0/0xd0
[  136.677901]  aer_process_err_devices+0xd0/0xe9
[  136.682351]  aer_isr.cold.25+0x54/0xa3
[  136.686108]  ? irq_finalize_oneshot.part.48+0xf0/0xf0
[  136.691167]  irq_thread_fn+0x24/0x60
[  136.694754]  irq_thread+0xea/0x170
[  136.698169]  ? irq_forced_thread_fn+0x80/0x80
[  136.702533]  ? irq_thread_check_affinity+0xf0/0xf0
[  136.707330]  kthread+0x126/0x140
[  136.710572]  ? kthread_park+0x90/0x90
[  136.714247]  ret_from_fork+0x22/0x30
[  136.717927] pcieport 0000:02:01.0: pci_vc_do_save_buffer: len:76
[  161.459414] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [irq/122-aerdrv:129]
[  161.459417] Modules linked in: xdma(O) aer_inject xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun ebtable_filter ebtables ip6tai
[  161.459460]  drm_kms_helper syscopyarea sysfillrect igb sysimgblt ptp fb_sys_fops libahci crc32c_intel pps_core serio_raw dca i2c_algo_bit libata drm video
[  161.459468] CPU: 2 PID: 129 Comm: irq/122-aerdrv Tainted: G S         O L    5.10.0-1.keithv3.el7.x86_64 #1
[  161.459470] Hardware name: AMI AM G6x/msd/AM G6x/msd, BIOS 4.08.01 02/19/2019
[  161.459471] RIP: 0010:console_unlock+0x374/0x510
[  161.459475] Code: ff e8 10 f9 ff ff 85 c0 0f 85 4f ff ff ff e8 c3 f8 ff ff 85 c0 0f 85 18 fd ff ff e9 3d ff ff ff e8 71 2b 00 00 4c 89 ff 57 9d <0f> 1f 44 00 00 8b 85 44 ff ff ff 85 c0 0f 84 3a fd ff ff e8 74 03
[  161.459476] RSP: 0018:ffffc90000303898 EFLAGS: 00000247
[  161.459478] RAX: 0000000000000000 RBX: ffffffff82af03ac RCX: ffffffff815dba40
[  161.459479] RDX: 0000000000000000 RSI: ffffffff82af2420 RDI: 0000000000000247
[  161.459480] RBP: ffffc90000303958 R08: 0000019b07f91685 R09: 0000000000000002
[  161.459481] R10: 32303a3030303020 R11: 74726f7065696370 R12: 0000000000000000
[  161.459482] R13: 0000000000000000 R14: ffffffff827ecfa0 R15: 0000000000000247
[  161.459483] FS:  0000000000000000(0000) GS:ffff88845dc80000(0000) knlGS:0000000000000000
[  161.459485] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  161.459486] CR2: 00007f9369637b80 CR3: 000000000240a005 CR4: 00000000003706e0
[  161.459487] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  161.459488] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  161.459489] Call Trace:
[  161.459491]  vprintk_emit+0x14b/0x250
[  161.459493]  dev_vprintk_emit+0x13b/0x15f
[  161.459497]  dev_printk_emit+0x4e/0x65
[  161.459499]  ? pci_mmcfg_read+0xb7/0xe0
[  161.459503]  __dev_printk+0x58/0x6b
[  161.459505]  _dev_warn+0x6c/0x83
[  161.459508]  ? pci_bus_read_config_dword+0x4a/0x70
[  161.459510]  pci_vc_do_save_buffer+0x576/0x5ab
...
Keith Busch Jan. 11, 2021, 10:09 p.m. UTC | #10
On Mon, Jan 11, 2021 at 09:02:39PM +0100, Hinko Kocevar wrote:
> Attached are the messages.
> 
> Thanks!

Thank you. It kind of looks like the frequent ext capabilty lookup might
just be really slow for some reason. Could you try the following patch
and let me know if this improves the CPU lockup?

---
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 5c59365092fa..8a61a9365c28 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -106,6 +106,7 @@ void pci_pm_init(struct pci_dev *dev);
 void pci_ea_init(struct pci_dev *dev);
 void pci_msi_init(struct pci_dev *dev);
 void pci_msix_init(struct pci_dev *dev);
+void pci_vc_init(struct pci_dev *dev);
 void pci_allocate_cap_save_buffers(struct pci_dev *dev);
 void pci_free_cap_save_buffers(struct pci_dev *dev);
 bool pci_bridge_d3_possible(struct pci_dev *dev);
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 953f15abc850..56992a42bac6 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2401,6 +2401,7 @@ static void pci_init_capabilities(struct pci_dev *dev)
 	pci_aer_init(dev);		/* Advanced Error Reporting */
 	pci_dpc_init(dev);		/* Downstream Port Containment */
 	pci_rcec_init(dev);		/* Root Complex Event Collector */
+	pci_vc_init(dev);		/* Virtual Channel */
 
 	pcie_report_downtraining(dev);
 
diff --git a/drivers/pci/vc.c b/drivers/pci/vc.c
index 5fc59ac31145..76ac118b9b5d 100644
--- a/drivers/pci/vc.c
+++ b/drivers/pci/vc.c
@@ -357,7 +357,7 @@ int pci_save_vc_state(struct pci_dev *dev)
 		int pos, ret;
 		struct pci_cap_saved_state *save_state;
 
-		pos = pci_find_ext_capability(dev, vc_caps[i].id);
+		pos = dev->vc_caps[i];
 		if (!pos)
 			continue;
 
@@ -394,9 +394,12 @@ void pci_restore_vc_state(struct pci_dev *dev)
 		int pos;
 		struct pci_cap_saved_state *save_state;
 
-		pos = pci_find_ext_capability(dev, vc_caps[i].id);
+		pos = dev->vc_caps[i];
+		if (!pos)
+			continue;
+
 		save_state = pci_find_saved_ext_cap(dev, vc_caps[i].id);
-		if (!save_state || !pos)
+		if (!save_state)
 			continue;
 
 		pci_vc_do_save_buffer(dev, pos, save_state, false);
@@ -415,7 +418,7 @@ void pci_allocate_vc_save_buffers(struct pci_dev *dev)
 	int i;
 
 	for (i = 0; i < ARRAY_SIZE(vc_caps); i++) {
-		int len, pos = pci_find_ext_capability(dev, vc_caps[i].id);
+		int len, pos = dev->vc_caps[i];
 
 		if (!pos)
 			continue;
@@ -426,3 +429,11 @@ void pci_allocate_vc_save_buffers(struct pci_dev *dev)
 				vc_caps[i].name);
 	}
 }
+
+void pci_vc_init(struct pci_dev *dev)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(vc_caps); i++)
+		dev->vc_caps[i] = pci_find_ext_capability(dev, vc_caps[i].id);
+}
diff --git a/include/linux/pci.h b/include/linux/pci.h
index b32126d26997..7a3aa7e4d6f8 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -501,6 +501,7 @@ struct pci_dev {
 	struct pci_p2pdma *p2pdma;
 #endif
 	u16		acs_cap;	/* ACS Capability offset */
+	u16		vc_caps[3];	/* Virtual Channel capability offsets */
 	phys_addr_t	rom;		/* Physical address if not from BAR */
 	size_t		romlen;		/* Length if not from BAR */
 	char		*driver_override; /* Driver name to force a match */
--
Keith Busch Jan. 12, 2021, 7:27 p.m. UTC | #11
On Tue, Jan 12, 2021 at 06:20:55PM +0100, Hinko Kocevar wrote:
> On 1/11/21 11:09 PM, Keith Busch wrote:
>
> Here is the log after applying the patch.
> 
> What sticks out are the numerous "VC buffer not found in pci_save_vc_state"
> messages, AFAICT from vc.c pci_save_vc_state(), which I have not spotted
> before:
> 
> [dev@bd-cpu18 ~]$ dmesg | grep vc_
> [  336.960749] pcieport 0000:00:01.1: VC buffer not found in pci_save_vc_state
> [  338.125683] pcieport 0000:01:00.0: VC buffer not found in pci_save_vc_state
> [  338.342504] pcieport 0000:02:01.0: VC buffer not found in pci_save_vc_state
> [  338.569035] pcieport 0000:03:00.0: VC buffer not found in pci_save_vc_state
> [  338.775696] pcieport 0000:04:01.0: VC buffer not found in pci_save_vc_state
> [  338.982599] pcieport 0000:04:03.0: VC buffer not found in pci_save_vc_state
> [  339.189608] pcieport 0000:04:08.0: VC buffer not found in pci_save_vc_state
> [  339.406232] pcieport 0000:04:0a.0: VC buffer not found in pci_save_vc_state
> [  339.986434] pcieport 0000:04:12.0: VC buffer not found in pci_save_vc_state

Ah, that's happening because I added the cap caching after the cap
buffer allocation. The patch below on top of the previous should fix
those warnings.
 
> I do not see the lockup anymore, and the recovery seems to have successfully
> been performed.

Okay, that kind of indicates the frequent capability lookups are taking
a while. We cache other capability offsets for similar reasons in the
past, but I don't recall them ever taking so long that it triggers the
CPU lockup watchdog.

---
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 56992a42bac6..a12efa87c7e0 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2385,6 +2385,7 @@ static void pci_init_capabilities(struct pci_dev *dev)
 	pci_ea_init(dev);		/* Enhanced Allocation */
 	pci_msi_init(dev);		/* Disable MSI */
 	pci_msix_init(dev);		/* Disable MSI-X */
+	pci_vc_init(dev);		/* Virtual Channel */
 
 	/* Buffers for saving PCIe and PCI-X capabilities */
 	pci_allocate_cap_save_buffers(dev);
@@ -2401,7 +2402,6 @@ static void pci_init_capabilities(struct pci_dev *dev)
 	pci_aer_init(dev);		/* Advanced Error Reporting */
 	pci_dpc_init(dev);		/* Downstream Port Containment */
 	pci_rcec_init(dev);		/* Root Complex Event Collector */
-	pci_vc_init(dev);		/* Virtual Channel */
 
 	pcie_report_downtraining(dev);
 
--
Hinko Kocevar Jan. 12, 2021, 10:19 p.m. UTC | #12
On 1/12/21 8:27 PM, Keith Busch wrote:
> On Tue, Jan 12, 2021 at 06:20:55PM +0100, Hinko Kocevar wrote:
>> On 1/11/21 11:09 PM, Keith Busch wrote:
>>
>> Here is the log after applying the patch.
>>
>> What sticks out are the numerous "VC buffer not found in pci_save_vc_state"
>> messages, AFAICT from vc.c pci_save_vc_state(), which I have not spotted
>> before:
>>
>> [dev@bd-cpu18 ~]$ dmesg | grep vc_
>> [  336.960749] pcieport 0000:00:01.1: VC buffer not found in pci_save_vc_state
>> [  338.125683] pcieport 0000:01:00.0: VC buffer not found in pci_save_vc_state
>> [  338.342504] pcieport 0000:02:01.0: VC buffer not found in pci_save_vc_state
>> [  338.569035] pcieport 0000:03:00.0: VC buffer not found in pci_save_vc_state
>> [  338.775696] pcieport 0000:04:01.0: VC buffer not found in pci_save_vc_state
>> [  338.982599] pcieport 0000:04:03.0: VC buffer not found in pci_save_vc_state
>> [  339.189608] pcieport 0000:04:08.0: VC buffer not found in pci_save_vc_state
>> [  339.406232] pcieport 0000:04:0a.0: VC buffer not found in pci_save_vc_state
>> [  339.986434] pcieport 0000:04:12.0: VC buffer not found in pci_save_vc_state
> 
> Ah, that's happening because I added the cap caching after the cap
> buffer allocation. The patch below on top of the previous should fix
> those warnings.

Right. With this little change applied, the warnings are gone now, and I 
can see the VC related debug messages you introduced with the first patch.

>   
>> I do not see the lockup anymore, and the recovery seems to have successfully
>> been performed.
> 
> Okay, that kind of indicates the frequent capability lookups are taking
> a while. We cache other capability offsets for similar reasons in the
> past, but I don't recall them ever taking so long that it triggers the
> CPU lockup watchdog.

Now that the VC restore is being invoked again, I see the lockups are back.

> 
> ---
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 56992a42bac6..a12efa87c7e0 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -2385,6 +2385,7 @@ static void pci_init_capabilities(struct pci_dev *dev)
>   	pci_ea_init(dev);		/* Enhanced Allocation */
>   	pci_msi_init(dev);		/* Disable MSI */
>   	pci_msix_init(dev);		/* Disable MSI-X */
> +	pci_vc_init(dev);		/* Virtual Channel */
>   
>   	/* Buffers for saving PCIe and PCI-X capabilities */
>   	pci_allocate_cap_save_buffers(dev);
> @@ -2401,7 +2402,6 @@ static void pci_init_capabilities(struct pci_dev *dev)
>   	pci_aer_init(dev);		/* Advanced Error Reporting */
>   	pci_dpc_init(dev);		/* Downstream Port Containment */
>   	pci_rcec_init(dev);		/* Root Complex Event Collector */
> -	pci_vc_init(dev);		/* Virtual Channel */
>   
>   	pcie_report_downtraining(dev);
>   
> --
> 

I feel inclined to provide a little bit more info about the system I'm 
running this on as it is not a regular PC/server/laptop. It is a modular 
micro TCA system with a single CPU and MCH. MCH and CPU are separate 
cards, as are the other processing cards (AMCs) that link up to CPU 
through the MCH PEX8748 switch. I can power each card individually, or 
perform complete system power cycle. The normal power up sequence is: 
MCH, AMCs, CPU. The CPU is powered 30 sec after all other cards so that 
their PCIe links are up and ready for Linux.

All buses below CPU side 02:01.0 are on MCH PEX8748 switch:

[dev@bd-cpu18 ~]$ sudo /usr/local/bin/pcicrawler -t
00:01.0 root_port, "J6B2", slot 1, device present, speed 8GT/s, width x8
  ├─01:00.0 upstream_port, PLX Technology, Inc. (10b5), device 8725
  │  ├─02:01.0 downstream_port, slot 1, device present, power: Off, 
speed 8GT/s, width x4
  │  │  └─03:00.0 upstream_port, PLX Technology, Inc. (10b5) PEX 8748 
48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (8748)
  │  │     ├─04:01.0 downstream_port, slot 4, power: Off
  │  │     ├─04:03.0 downstream_port, slot 3, power: Off
  │  │     ├─04:08.0 downstream_port, slot 5, power: Off
  │  │     ├─04:0a.0 downstream_port, slot 6, device present, power: 
Off, speed 8GT/s, width x4
  │  │     │  └─08:00.0 endpoint, Xilinx Corporation (10ee), device 8034
  │  │     └─04:12.0 downstream_port, slot 1, power: Off
  │  ├─02:02.0 downstream_port, slot 2
  │  ├─02:08.0 downstream_port, slot 8
  │  ├─02:09.0 downstream_port, slot 9, power: Off
  │  └─02:0a.0 downstream_port, slot 10
  ├─01:00.1 endpoint, PLX Technology, Inc. (10b5), device 87d0
  ├─01:00.2 endpoint, PLX Technology, Inc. (10b5), device 87d0
  ├─01:00.3 endpoint, PLX Technology, Inc. (10b5), device 87d0
  └─01:00.4 endpoint, PLX Technology, Inc. (10b5), device 87d0


The lockups most frequently appear after the cold boot of the system. If 
I restart the CPU card only, and leave the MCH (where the PEX8748 switch 
resides) powered, the lockups do *not* happen. I'm injecting the same 
error into the root port and the system card 
configuration/location/count is always the same.

Nevertheless, in rare occasions while booting the same kernel image 
after complete system power cycle, no lockup is observed.

So far I observed that the lockups seem to always happen when recovery 
is dealing with the 02:01.0 device/bus.

If the system recovers from a first injected error, I can repeat the 
injection and the system recovers always. If the first recovery fails I 
have to either reboot the CPU or power cycle the complete system.

To me it looks like this behavior is somehow related to the system/setup 
I have, and for some reason is triggered by VC restoration (VC is not is 
use by my system at all, AFAIK).

Are you able to tell which part of the code the CPU is actually spinning 
in when the lockup is detected? I added many printk()s in the 
pci_restore_vc_state(), in the AER IRQ handler, and around to see 
something being continuously printed, but nothing appeared..

I'll continue testing your patches and try to come up with more info.

Thank you for your time!
Keith Busch Jan. 12, 2021, 11:17 p.m. UTC | #13
On Tue, Jan 12, 2021 at 11:19:37PM +0100, Hinko Kocevar wrote:
> I feel inclined to provide a little bit more info about the system I'm
> running this on as it is not a regular PC/server/laptop. It is a modular
> micro TCA system with a single CPU and MCH. MCH and CPU are separate cards,
> as are the other processing cards (AMCs) that link up to CPU through the MCH
> PEX8748 switch. I can power each card individually, or perform complete
> system power cycle. The normal power up sequence is: MCH, AMCs, CPU. The CPU
> is powered 30 sec after all other cards so that their PCIe links are up and
> ready for Linux.
> 
> All buses below CPU side 02:01.0 are on MCH PEX8748 switch:
> 
> [dev@bd-cpu18 ~]$ sudo /usr/local/bin/pcicrawler -t
> 00:01.0 root_port, "J6B2", slot 1, device present, speed 8GT/s, width x8
>  ├─01:00.0 upstream_port, PLX Technology, Inc. (10b5), device 8725
>  │  ├─02:01.0 downstream_port, slot 1, device present, power: Off, speed 8GT/s, width x4
>  │  │  └─03:00.0 upstream_port, PLX Technology, Inc. (10b5) PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (8748)
>  │  │     ├─04:01.0 downstream_port, slot 4, power: Off
>  │  │     ├─04:03.0 downstream_port, slot 3, power: Off
>  │  │     ├─04:08.0 downstream_port, slot 5, power: Off
>  │  │     ├─04:0a.0 downstream_port, slot 6, device present, power: Off, speed 8GT/s, width x4
>  │  │     │  └─08:00.0 endpoint, Xilinx Corporation (10ee), device 8034
>  │  │     └─04:12.0 downstream_port, slot 1, power: Off
>  │  ├─02:02.0 downstream_port, slot 2
>  │  ├─02:08.0 downstream_port, slot 8
>  │  ├─02:09.0 downstream_port, slot 9, power: Off
>  │  └─02:0a.0 downstream_port, slot 10
>  ├─01:00.1 endpoint, PLX Technology, Inc. (10b5), device 87d0
>  ├─01:00.2 endpoint, PLX Technology, Inc. (10b5), device 87d0
>  ├─01:00.3 endpoint, PLX Technology, Inc. (10b5), device 87d0
>  └─01:00.4 endpoint, PLX Technology, Inc. (10b5), device 87d0
> 
> 
> The lockups most frequently appear after the cold boot of the system. If I
> restart the CPU card only, and leave the MCH (where the PEX8748 switch
> resides) powered, the lockups do *not* happen. I'm injecting the same error
> into the root port and the system card configuration/location/count is
> always the same.
> 
> Nevertheless, in rare occasions while booting the same kernel image after
> complete system power cycle, no lockup is observed.
> 
> So far I observed that the lockups seem to always happen when recovery is
> dealing with the 02:01.0 device/bus.
> 
> If the system recovers from a first injected error, I can repeat the
> injection and the system recovers always. If the first recovery fails I have
> to either reboot the CPU or power cycle the complete system.
> 
> To me it looks like this behavior is somehow related to the system/setup I
> have, and for some reason is triggered by VC restoration (VC is not is use
> by my system at all, AFAIK).
 
> Are you able to tell which part of the code the CPU is actually spinning in
> when the lockup is detected? I added many printk()s in the
> pci_restore_vc_state(), in the AER IRQ handler, and around to see something
> being continuously printed, but nothing appeared..

It sounds like your setup is having difficulting completing config
cycles timely after a secondary bus reset. I don't see right now how
anything I've provided in this series is causing that.

All the stack traces you've provided so far are all within virtual
channel restoration. Subsequent stack traces are never the same though,
so it does not appear to be permanently stuck; it's just incredibly
slow. This particular capability restoration happens to require more
config cycles than most other capabilities, so I'm guessing it happens
to show up in your observation because of that rather than anything
specific about VC.

The long delays seem like a CTO should have kicked in, but maybe your
hardware isn't doing it right. Your lspci says Completion Timeout
configuration is not supported, so it should default to 50msec maximum,
but since it's taking long enough to trigger a stuck CPU watchdog, and
you appear to be getting valid data back, it doesn't look like CTO is
happening.
Hinko Kocevar Jan. 18, 2021, 8 a.m. UTC | #14
On 1/13/21 12:17 AM, Keith Busch wrote:
> On Tue, Jan 12, 2021 at 11:19:37PM +0100, Hinko Kocevar wrote:
>> I feel inclined to provide a little bit more info about the system I'm
>> running this on as it is not a regular PC/server/laptop. It is a modular
>> micro TCA system with a single CPU and MCH. MCH and CPU are separate cards,
>> as are the other processing cards (AMCs) that link up to CPU through the MCH
>> PEX8748 switch. I can power each card individually, or perform complete
>> system power cycle. The normal power up sequence is: MCH, AMCs, CPU. The CPU
>> is powered 30 sec after all other cards so that their PCIe links are up and
>> ready for Linux.
>>
>> All buses below CPU side 02:01.0 are on MCH PEX8748 switch:
>>
>> [dev@bd-cpu18 ~]$ sudo /usr/local/bin/pcicrawler -t
>> 00:01.0 root_port, "J6B2", slot 1, device present, speed 8GT/s, width x8
>>   ├─01:00.0 upstream_port, PLX Technology, Inc. (10b5), device 8725
>>   │  ├─02:01.0 downstream_port, slot 1, device present, power: Off, speed 8GT/s, width x4
>>   │  │  └─03:00.0 upstream_port, PLX Technology, Inc. (10b5) PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (8748)
>>   │  │     ├─04:01.0 downstream_port, slot 4, power: Off
>>   │  │     ├─04:03.0 downstream_port, slot 3, power: Off
>>   │  │     ├─04:08.0 downstream_port, slot 5, power: Off
>>   │  │     ├─04:0a.0 downstream_port, slot 6, device present, power: Off, speed 8GT/s, width x4
>>   │  │     │  └─08:00.0 endpoint, Xilinx Corporation (10ee), device 8034
>>   │  │     └─04:12.0 downstream_port, slot 1, power: Off
>>   │  ├─02:02.0 downstream_port, slot 2
>>   │  ├─02:08.0 downstream_port, slot 8
>>   │  ├─02:09.0 downstream_port, slot 9, power: Off
>>   │  └─02:0a.0 downstream_port, slot 10
>>   ├─01:00.1 endpoint, PLX Technology, Inc. (10b5), device 87d0
>>   ├─01:00.2 endpoint, PLX Technology, Inc. (10b5), device 87d0
>>   ├─01:00.3 endpoint, PLX Technology, Inc. (10b5), device 87d0
>>   └─01:00.4 endpoint, PLX Technology, Inc. (10b5), device 87d0
>>
>>
>> The lockups most frequently appear after the cold boot of the system. If I
>> restart the CPU card only, and leave the MCH (where the PEX8748 switch
>> resides) powered, the lockups do *not* happen. I'm injecting the same error
>> into the root port and the system card configuration/location/count is
>> always the same.
>>
>> Nevertheless, in rare occasions while booting the same kernel image after
>> complete system power cycle, no lockup is observed.
>>
>> So far I observed that the lockups seem to always happen when recovery is
>> dealing with the 02:01.0 device/bus.
>>
>> If the system recovers from a first injected error, I can repeat the
>> injection and the system recovers always. If the first recovery fails I have
>> to either reboot the CPU or power cycle the complete system.
>>
>> To me it looks like this behavior is somehow related to the system/setup I
>> have, and for some reason is triggered by VC restoration (VC is not is use
>> by my system at all, AFAIK).
>   
>> Are you able to tell which part of the code the CPU is actually spinning in
>> when the lockup is detected? I added many printk()s in the
>> pci_restore_vc_state(), in the AER IRQ handler, and around to see something
>> being continuously printed, but nothing appeared..
> 
> It sounds like your setup is having difficulting completing config
> cycles timely after a secondary bus reset. I don't see right now how
> anything I've provided in this series is causing that.
> 
> All the stack traces you've provided so far are all within virtual
> channel restoration. Subsequent stack traces are never the same though,
> so it does not appear to be permanently stuck; it's just incredibly
> slow. This particular capability restoration happens to require more
> config cycles than most other capabilities, so I'm guessing it happens
> to show up in your observation because of that rather than anything
> specific about VC.
> 
> The long delays seem like a CTO should have kicked in, but maybe your
> hardware isn't doing it right. Your lspci says Completion Timeout
> configuration is not supported, so it should default to 50msec maximum,
> but since it's taking long enough to trigger a stuck CPU watchdog, and
> you appear to be getting valid data back, it doesn't look like CTO is
> happening.
> 

Some very good new (and some little less good).

The good news is that after recompiling the latest PCI GIT kernel 
source, with heavily stripped down config (without any extras and 
modules), I do *not* see the lockups any more. Up to now I've been using 
CentOS7 bloated stock config from stock kernel to compile git sources.

The not so good news is that I'm not sure what exactly solved the issue 
for me; it seems to be config related. I've also been touching some 
other system settings, outside the Linux. Otherwise my HW configuration 
has not changed. I performed the same test steps that previously 
resulted in lockups and none were seen so far. I'm looking into what 
exactly might have been causing the issues.

With that being said, I believe that this patch series is working for my 
system and the issues reported were related to my system mis-configuration.


Thank you again!
Keith Busch Jan. 19, 2021, 6:28 p.m. UTC | #15
On Mon, Jan 18, 2021 at 09:00:34AM +0100, Hinko Kocevar wrote:
> With that being said, I believe that this patch series is working for my
> system and the issues reported were related to my system mis-configuration.

Great, thank you for confirming! I couldn't think of any reason this
series would introduce the behavior you were describing.

Bjorn, do you have any additional concerns?
Keith Busch Feb. 3, 2021, 12:03 a.m. UTC | #16
Hi Bjorn,

Any further concern? I beleive Hinko idenfitied his observations as
being solely due to his platform rather than this implementation, which
should fix some outstanding uncorrectable error handling. Please let me
know if there's anything additional you'd like to see with this series.

Thanks,
Keith
Hinko Kocevar Feb. 4, 2021, 8:35 a.m. UTC | #17
Hi,

On 2/3/21 1:03 AM, Keith Busch wrote:
> Hi Bjorn,
> 
> Any further concern? I beleive Hinko idenfitied his observations as
> being solely due to his platform rather than this implementation, which

True.

Thanks,
//hinko

> should fix some outstanding uncorrectable error handling. Please let me
> know if there's anything additional you'd like to see with this series.
> 
> Thanks,
> Keith
>
Hedi Berriche Feb. 8, 2021, 12:55 p.m. UTC | #18
Acked-by: Hedi Berriche <hedi.berriche@hpe.com>
Tested-by: Hedi Berriche <hedi.berriche@hpe.com>
Bjorn Helgaas Feb. 9, 2021, 11:06 p.m. UTC | #19
[+cc Hinko]

On Mon, Jan 04, 2021 at 03:02:55PM -0800, Keith Busch wrote:
> Changes from v1:
> 
>   Added received Acks
> 
>   Split the kernel print identifying the port type being reset.
> 
>   Added a patch for the portdrv to ensure the slot_reset happens without
>   relying on a downstream device driver..
> 
> Keith Busch (5):
>   PCI/ERR: Clear status of the reporting device
>   PCI/AER: Actually get the root port
>   PCI/ERR: Retain status from error notification
>   PCI/AER: Specify the type of port that was reset
>   PCI/portdrv: Report reset for frozen channel
> 
>  drivers/pci/pcie/aer.c         |  5 +++--
>  drivers/pci/pcie/err.c         | 16 +++++++---------
>  drivers/pci/pcie/portdrv_pci.c |  3 ++-
>  3 files changed, 12 insertions(+), 12 deletions(-)

I applied these to pci/error for v5.12, thanks!

I *am* a little concerned about the issues Hinko saw because it
doesn't look we found a root cause.  I didn't spend any time looking
into it, but even if it only shows up on his specific platform or with
some weird config combination, it's a problem.  But I guess we'll see
if anybody else trips over it.

Bjorn
Keith Busch Feb. 10, 2021, 4:05 a.m. UTC | #20
On Tue, Feb 09, 2021 at 05:06:14PM -0600, Bjorn Helgaas wrote:
> [+cc Hinko]
> 
> On Mon, Jan 04, 2021 at 03:02:55PM -0800, Keith Busch wrote:
> > Changes from v1:
> > 
> >   Added received Acks
> > 
> >   Split the kernel print identifying the port type being reset.
> > 
> >   Added a patch for the portdrv to ensure the slot_reset happens without
> >   relying on a downstream device driver..
> > 
> > Keith Busch (5):
> >   PCI/ERR: Clear status of the reporting device
> >   PCI/AER: Actually get the root port
> >   PCI/ERR: Retain status from error notification
> >   PCI/AER: Specify the type of port that was reset
> >   PCI/portdrv: Report reset for frozen channel
> > 
> >  drivers/pci/pcie/aer.c         |  5 +++--
> >  drivers/pci/pcie/err.c         | 16 +++++++---------
> >  drivers/pci/pcie/portdrv_pci.c |  3 ++-
> >  3 files changed, 12 insertions(+), 12 deletions(-)
> 
> I applied these to pci/error for v5.12, thanks!

Thanks!
 
> I *am* a little concerned about the issues Hinko saw because it
> doesn't look we found a root cause.  I didn't spend any time looking
> into it, but even if it only shows up on his specific platform or with
> some weird config combination, it's a problem.  But I guess we'll see
> if anybody else trips over it.

Yes, I'm also closely monitoring for AER issues. I think Hinko's
observation was seen without this series, and was just initially noticed
with it. I'm reasonably confident this is a safe improvement, but I want
to see this work well with everyone's hardware, too.
Yicong Yang Feb. 10, 2021, 9:36 a.m. UTC | #21
Hi,

I've tested the serial on Kunpeng 920 arm64 server and it works well.

Thanks.

On 2021/1/5 7:02, Keith Busch wrote:
> Changes from v1:
> 
>   Added received Acks
> 
>   Split the kernel print identifying the port type being reset.
> 
>   Added a patch for the portdrv to ensure the slot_reset happens without
>   relying on a downstream device driver..
> 
> Keith Busch (5):
>   PCI/ERR: Clear status of the reporting device
>   PCI/AER: Actually get the root port
>   PCI/ERR: Retain status from error notification
>   PCI/AER: Specify the type of port that was reset
>   PCI/portdrv: Report reset for frozen channel
> 
>  drivers/pci/pcie/aer.c         |  5 +++--
>  drivers/pci/pcie/err.c         | 16 +++++++---------
>  drivers/pci/pcie/portdrv_pci.c |  3 ++-
>  3 files changed, 12 insertions(+), 12 deletions(-)
>
Bjorn Helgaas Feb. 10, 2021, 9:38 p.m. UTC | #22
On Wed, Feb 10, 2021 at 01:05:04PM +0900, Keith Busch wrote:
> On Tue, Feb 09, 2021 at 05:06:14PM -0600, Bjorn Helgaas wrote:
> > On Mon, Jan 04, 2021 at 03:02:55PM -0800, Keith Busch wrote:
> > > Changes from v1:
> > > 
> > >   Added received Acks
> > > 
> > >   Split the kernel print identifying the port type being reset.
> > > 
> > >   Added a patch for the portdrv to ensure the slot_reset happens without
> > >   relying on a downstream device driver..
> > > 
> > > Keith Busch (5):
> > >   PCI/ERR: Clear status of the reporting device
> > >   PCI/AER: Actually get the root port
> > >   PCI/ERR: Retain status from error notification
> > >   PCI/AER: Specify the type of port that was reset
> > >   PCI/portdrv: Report reset for frozen channel
> > > 
> > >  drivers/pci/pcie/aer.c         |  5 +++--
> > >  drivers/pci/pcie/err.c         | 16 +++++++---------
> > >  drivers/pci/pcie/portdrv_pci.c |  3 ++-
> > >  3 files changed, 12 insertions(+), 12 deletions(-)
> > 
> > I applied these to pci/error for v5.12, thanks!
> 
> Thanks!
>  
> > I *am* a little concerned about the issues Hinko saw because it
> > doesn't look we found a root cause.  I didn't spend any time looking
> > into it, but even if it only shows up on his specific platform or with
> > some weird config combination, it's a problem.  But I guess we'll see
> > if anybody else trips over it.
> 
> Yes, I'm also closely monitoring for AER issues. I think Hinko's
> observation was seen without this series, and was just initially noticed
> with it. I'm reasonably confident this is a safe improvement, but I want
> to see this work well with everyone's hardware, too.

Oh, good.  I missed the fact that it was seen even *without* this
series.