PCI: cadence: Fixed cdns_pcie_host_link_setup return value.

Message ID	20241219081452.32035-1-18255117159@163.com (mailing list archive)
State	Not Applicable
Delegated to:	Krzysztof Wilczyński
Headers	show Received: from m16.mail.163.com (m16.mail.163.com [117.135.210.2]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 808BC217672; Thu, 19 Dec 2024 08:15:27 +0000 (UTC) From: Hans Zhang <18255117159@163.com> To: lpieralisi@kernel.org Cc: kw@linux.com, manivannan.sadhasivam@linaro.org, robh@kernel.org, bhelgaas@google.com, s-vadapalli@ti.com, thomas.richard@bootlin.com, linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, rockswang7@gmail.com, Hans Zhang <18255117159@163.com> Subject: [PATCH] PCI: cadence: Fixed cdns_pcie_host_link_setup return value. Date: Thu, 19 Dec 2024 03:14:52 -0500 Message-Id: <20241219081452.32035-1-18255117159@163.com> Precedence: bulk
Series	PCI: cadence: Fixed cdns_pcie_host_link_setup return value. \| expand PCI: cadence: Fixed cdns_pcie_host_link_setup return value.

Hans Zhang Dec. 19, 2024, 8:14 a.m. UTC

If the PCIe link never came up, the enumeration process
should not be run.

Signed-off-by: Hans Zhang <18255117159@163.com>
---
 drivers/pci/controller/cadence/pcie-cadence-host.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Siddharth Vadapalli Dec. 19, 2024, 8:33 a.m. UTC | #1

On Thu, Dec 19, 2024 at 03:14:52AM -0500, Hans Zhang wrote:
> If the PCIe link never came up, the enumeration process
> should not be run.

The link could come up at a later point in time. Please refer to the
implementation of:
dw_pcie_host_init() in drivers/pci/controller/dwc/pcie-designware-host.c
wherein we have the following:
	/* Ignore errors, the link may come up later */
	dw_pcie_wait_for_link(pci);

It seems to me that the logic behind ignoring the absence of the link
within cdns_pcie_host_link_setup() instead of erroring out, is similar to
that of dw_pcie_wait_for_link().

Regards,
Siddharth.

Hans Zhang Dec. 19, 2024, 8:49 a.m. UTC | #2

On 12/19/24 03:33, Siddharth Vadapalli wrote:
> On Thu, Dec 19, 2024 at 03:14:52AM -0500, Hans Zhang wrote:
>> If the PCIe link never came up, the enumeration process
>> should not be run.
> The link could come up at a later point in time. Please refer to the
> implementation of:
> dw_pcie_host_init() in drivers/pci/controller/dwc/pcie-designware-host.c
> wherein we have the following:
> 	/* Ignore errors, the link may come up later */
> 	dw_pcie_wait_for_link(pci);
>
> It seems to me that the logic behind ignoring the absence of the link
> within cdns_pcie_host_link_setup() instead of erroring out, is similar to
> that of dw_pcie_wait_for_link().
>
> Regards,
> Siddharth.
>
>
> If a PCIe port is not connected to a device. The PCIe link does not
> go up. The current code returns success whether the device is connected
> or not. Cadence IP's ECAM requires an LTSSM at L0 to access the RC's
> config space registers. Otherwise the enumeration process will hang.
>
> Regards,
> Hans

Siddharth Vadapalli Dec. 19, 2024, 8:59 a.m. UTC | #3

On Thu, Dec 19, 2024 at 03:49:33AM -0500, Hans Zhang wrote:
> 
> On 12/19/24 03:33, Siddharth Vadapalli wrote:
> > On Thu, Dec 19, 2024 at 03:14:52AM -0500, Hans Zhang wrote:
> > > If the PCIe link never came up, the enumeration process
> > > should not be run.
> > The link could come up at a later point in time. Please refer to the
> > implementation of:
> > dw_pcie_host_init() in drivers/pci/controller/dwc/pcie-designware-host.c
> > wherein we have the following:
> > 	/* Ignore errors, the link may come up later */
> > 	dw_pcie_wait_for_link(pci);
> > 
> > It seems to me that the logic behind ignoring the absence of the link
> > within cdns_pcie_host_link_setup() instead of erroring out, is similar to
> > that of dw_pcie_wait_for_link().
> > 
> > Regards,
> > Siddharth.
> > 
> > 
> > If a PCIe port is not connected to a device. The PCIe link does not
> > go up. The current code returns success whether the device is connected
> > or not. Cadence IP's ECAM requires an LTSSM at L0 to access the RC's
> > config space registers. Otherwise the enumeration process will hang.

The ">" symbols seem to be manually added in your reply and are also
incorrect. If you have added them manually, please don't add them at the
start of the sentences corresponding to your reply.

The issue you are facing seems to be specific to the Cadence IP or the way
in which the IP has been integrated into the device that you are using.
On TI SoCs which have the Cadence PCIe Controller, absence of PCIe devices
doesn't result in a hang. Enumeration should proceed irrespective of the
presence of PCIe devices and should indicate their absence when they aren't
connected.

While I am not denying the issue being seen, the fix should probably be
done elsewhere.

Regards,
Siddharth.

Manivannan Sadhasivam Dec. 19, 2024, 9:49 a.m. UTC | #4

On Thu, Dec 19, 2024 at 04:38:11AM -0500, Hans Zhang wrote:
> 
> On 12/19/24 03:59, Siddharth Vadapalli wrote:
> > On Thu, Dec 19, 2024 at 03:49:33AM -0500, Hans Zhang wrote:
> > > On 12/19/24 03:33, Siddharth Vadapalli wrote:
> > > > On Thu, Dec 19, 2024 at 03:14:52AM -0500, Hans Zhang wrote:
> > > > > If the PCIe link never came up, the enumeration process
> > > > > should not be run.
> > > > The link could come up at a later point in time. Please refer to the
> > > > implementation of:
> > > > dw_pcie_host_init() in drivers/pci/controller/dwc/pcie-designware-host.c
> > > > wherein we have the following:
> > > > 	/* Ignore errors, the link may come up later */
> > > > 	dw_pcie_wait_for_link(pci);
> > > > 
> > > > It seems to me that the logic behind ignoring the absence of the link
> > > > within cdns_pcie_host_link_setup() instead of erroring out, is similar to
> > > > that of dw_pcie_wait_for_link().
> > > > 
> > > > Regards,
> > > > Siddharth.
> > > > 
> > > > 
> > > > If a PCIe port is not connected to a device. The PCIe link does not
> > > > go up. The current code returns success whether the device is connected
> > > > or not. Cadence IP's ECAM requires an LTSSM at L0 to access the RC's
> > > > config space registers. Otherwise the enumeration process will hang.
> > The ">" symbols seem to be manually added in your reply and are also
> > incorrect. If you have added them manually, please don't add them at the
> > start of the sentences corresponding to your reply.
> > 
> > The issue you are facing seems to be specific to the Cadence IP or the way
> > in which the IP has been integrated into the device that you are using.
> > On TI SoCs which have the Cadence PCIe Controller, absence of PCIe devices
> > doesn't result in a hang. Enumeration should proceed irrespective of the
> > presence of PCIe devices and should indicate their absence when they aren't
> > connected.
> > 
> > While I am not denying the issue being seen, the fix should probably be
> > done elsewhere.
> > 
> > Regards,
> > Siddharth.
> We are the SOC design company and we have confirmed with the designer and
> Cadence. For the Cadence's IP we are using, ECAM must be L0 at LTSSM to be
> used. Cadence will fixed next RTL version.
> 

I don't understand what you mean by 'ECAM must be L0 at LTSSM'. If you do not
connect the device, LTSSM would still be in 'detect' state until the device is
connected. Is that different on your SoC?

> If the cdns_pcie_host_link_setup return value is not modified. The following
> is the
> log of the enumeration process without connected devices. There will be hang
> for
> more than 300 seconds. So I don't think it makes sense to run the
> enumeration
> process without connecting devices. And it will affect the boot time.
> 

We don't know your driver, so cannot comment on the issue without understanding
the problem, sorry.

- Mani

> [ 2.681770] xxx pcie: xxx_pcie_probe starting!
> [ 2.689537] xxx pcie: host bridge /soc@0/pcie@xxx ranges:
> [ 2.698601] xxx pcie: IO 0x0060100000..0x00601fffff -> 0x0060100000
> [ 2.708625] xxx pcie: MEM 0x0060200000..0x007fffffff -> 0x0060200000
> [ 2.718649] xxx pcie: MEM 0x1800000000..0x1bffffffff -> 0x1800000000
> [ 2.744441] xxx pcie: ioremap rcsu, paddr:[mem 0x0a000000-0x0a00ffff],
> vaddr:ffff800089390000
> [ 2.756230] xxx pcie: ioremap msg, paddr:[mem 0x60000000-0x600fffff],
> vaddr:ffff800089800000
> [ 2.769692] xxx pcie: ECAM at [mem 0x2c000000-0x2fffffff] for [bus c0-ff]
> [ 2.780139] xxx.pcie_phy: pcie_phy_common_init end
> [ 2.788900] xxx pcie: waiting PHY is ready! retries = 2
> [ 3.905292] xxx pcie: Link fail, retries 10 times
> [ 3.915054] xxx pcie: ret=-110, rc->quirk_retrain_flag = 0
> [ 3.923848] xxx pcie: ret=-110, rc->quirk_retrain_flag = 0
> [ 3.932669] xxx pcie: PCI host bridge to bus 0000:c0
> [ 3.940847] pci_bus 0000:c0: root bus resource [bus c0-ff]
> [ 3.948322] pci_bus 0000:c0: root bus resource [io 0x0000-0xfffff] (bus
> address [0x60100000-0x601fffff])
> [ 3.959922] pci_bus 0000:c0: root bus resource [mem 0x60200000-0x7fffffff]
> [ 3.968799] pci_bus 0000:c0: root bus resource [mem
> 0x1800000000-0x1bffffffff pref]
> [ 339.667761] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> [ 339.677449] rcu: 5-...0: (20 ticks this GP) idle=4d94/1/0x4000000000000000
> softirq=20/20 fqs=2623
> [ 339.688184] (detected by 2, t=5253 jiffies, g=-1119, q=2 ncpus=12)
> [ 339.696193] Sending NMI from CPU 2 to CPUs 5:
> [ 349.703670] rcu: rcu_preempt kthread timer wakeup didn't happen for 2509
> jiffies! g-1119 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
> [ 349.718710] rcu: Possible timer handling issue on cpu=2 timer-softirq=1208
> [ 349.727418] rcu: rcu_preempt kthread starved for 2515 jiffies! g-1119 f0x0
> RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=2
> [ 349.739642] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM
> is now expected behavior.
> [ 349.750546] rcu: RCU grace-period kthread stack dump:
> [ 349.757319] task:rcu_preempt state:I stack:0 pid:14 ppid:2
> flags:0x00000008
> [ 349.767439] Call trace:
> [ 349.771575] __switch_to+0xdc/0x150
> [ 349.776777] __schedule+0x2dc/0x7d0
> [ 349.781972] schedule+0x5c/0x100
> [ 349.786903] schedule_timeout+0x8c/0x100
> [ 349.792538] rcu_gp_fqs_loop+0x140/0x420
> [ 349.798176] rcu_gp_kthread+0x134/0x164
> [ 349.803725] kthread+0x108/0x10c
> [ 349.808657] ret_from_fork+0x10/0x20
> [ 349.813942] rcu: Stack dump where RCU GP kthread last ran:
> [ 349.821156] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G S xxx-build-generic
> #8
> [ 349.831887] Hardware name: xxx Reference Board, BIOS xxx
> [ 349.843583] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS
> BTYPE=--)
> [ 349.852294] pc : arch_cpu_idle+0x18/0x2c
> [ 349.857928] lr : arch_cpu_idle+0x14/0x2c
> 
> Regards Hans
>

Hans Zhang Dec. 19, 2024, 10:04 a.m. UTC | #5

On 12/19/24 03:59, Siddharth Vadapalli wrote:
> On Thu, Dec 19, 2024 at 03:49:33AM -0500, Hans Zhang wrote:
>>
>> On 12/19/24 03:33, Siddharth Vadapalli wrote:
>>> On Thu, Dec 19, 2024 at 03:14:52AM -0500, Hans Zhang wrote:
>>>> If the PCIe link never came up, the enumeration process
>>>> should not be run.
>>> The link could come up at a later point in time. Please refer to the
>>> implementation of:
>>> dw_pcie_host_init() in drivers/pci/controller/dwc/pcie-designware-host.c
>>> wherein we have the following:
>>> 	/* Ignore errors, the link may come up later */
>>> 	dw_pcie_wait_for_link(pci);
>>>
>>> It seems to me that the logic behind ignoring the absence of the link
>>> within cdns_pcie_host_link_setup() instead of erroring out, is similar to
>>> that of dw_pcie_wait_for_link().
>>>
>>> Regards,
>>> Siddharth.
>>>
>>>
>>> If a PCIe port is not connected to a device. The PCIe link does not
>>> go up. The current code returns success whether the device is connected
>>> or not. Cadence IP's ECAM requires an LTSSM at L0 to access the RC's
>>> config space registers. Otherwise the enumeration process will hang.
> 
> The ">" symbols seem to be manually added in your reply and are also
> incorrect. If you have added them manually, please don't add them at the
> start of the sentences corresponding to your reply.
> 
> The issue you are facing seems to be specific to the Cadence IP or the way
> in which the IP has been integrated into the device that you are using.
> On TI SoCs which have the Cadence PCIe Controller, absence of PCIe devices
> doesn't result in a hang. Enumeration should proceed irrespective of the
> presence of PCIe devices and should indicate their absence when they aren't
> connected.
> 
> While I am not denying the issue being seen, the fix should probably be
> done elsewhere.
> 
> Regards,
> Siddharth.


We are the SOC design company and we have confirmed with the designer 
and Cadence. For the Cadence's IP we are using, ECAM must be L0 at LTSSM 
to be used. Cadence will fixed next RTL version.

If the cdns_pcie_host_link_setup return value is not modified. The 
following is the log of the enumeration process without connected 
devices. There will be hang for more than 300 seconds. So I don't think 
it makes sense to run the enumeration process without connecting 
devices. And it will affect the boot time.

[ 2.681770] xxx pcie: xxx_pcie_probe starting!
[ 2.689537] xxx pcie: host bridge /soc@0/pcie@xxx ranges:
[ 2.698601] xxx pcie: IO 0x0060100000..0x00601fffff -> 0x0060100000
[ 2.708625] xxx pcie: MEM 0x0060200000..0x007fffffff -> 0x0060200000
[ 2.718649] xxx pcie: MEM 0x1800000000..0x1bffffffff -> 0x1800000000
[ 2.744441] xxx pcie: ioremap rcsu, paddr:[mem 0x0a000000-0x0a00ffff], 
vaddr:ffff800089390000
[ 2.756230] xxx pcie: ioremap msg, paddr:[mem 0x60000000-0x600fffff], 
vaddr:ffff800089800000
[ 2.769692] xxx pcie: ECAM at [mem 0x2c000000-0x2fffffff] for [bus c0-ff]
[ 2.780139] xxx.pcie_phy: pcie_phy_common_init end
[ 2.788900] xxx pcie: waiting PHY is ready! retries = 2
[ 3.905292] xxx pcie: Link fail, retries 10 times
[ 3.915054] xxx pcie: ret=-110, rc->quirk_retrain_flag = 0
[ 3.923848] xxx pcie: ret=-110, rc->quirk_retrain_flag = 0
[ 3.932669] xxx pcie: PCI host bridge to bus 0000:c0
[ 3.940847] pci_bus 0000:c0: root bus resource [bus c0-ff]
[ 3.948322] pci_bus 0000:c0: root bus resource [io 0x0000-0xfffff] (bus 
address [0x60100000-0x601fffff])
[ 3.959922] pci_bus 0000:c0: root bus resource [mem 0x60200000-0x7fffffff]
[ 3.968799] pci_bus 0000:c0: root bus resource [mem 
0x1800000000-0x1bffffffff pref]
[ 339.667761] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 339.677449] rcu: 5-...0: (20 ticks this GP) 
idle=4d94/1/0x4000000000000000 softirq=20/20 fqs=2623
[ 339.688184] (detected by 2, t=5253 jiffies, g=-1119, q=2 ncpus=12)
[ 339.696193] Sending NMI from CPU 2 to CPUs 5:
[ 349.703670] rcu: rcu_preempt kthread timer wakeup didn't happen for 
2509 jiffies! g-1119 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[ 349.718710] rcu: Possible timer handling issue on cpu=2 timer-softirq=1208
[ 349.727418] rcu: rcu_preempt kthread starved for 2515 jiffies! g-1119 
f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=2
[ 349.739642] rcu: Unless rcu_preempt kthread gets sufficient CPU time, 
OOM is now expected behavior.
[ 349.750546] rcu: RCU grace-period kthread stack dump:
[ 349.757319] task:rcu_preempt state:I stack:0 pid:14 ppid:2 
flags:0x00000008
[ 349.767439] Call trace:
[ 349.771575] __switch_to+0xdc/0x150
[ 349.776777] __schedule+0x2dc/0x7d0
[ 349.781972] schedule+0x5c/0x100
[ 349.786903] schedule_timeout+0x8c/0x100
[ 349.792538] rcu_gp_fqs_loop+0x140/0x420
[ 349.798176] rcu_gp_kthread+0x134/0x164
[ 349.803725] kthread+0x108/0x10c
[ 349.808657] ret_from_fork+0x10/0x20
[ 349.813942] rcu: Stack dump where RCU GP kthread last ran:
[ 349.821156] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G S 
xxx-build-generic #8
[ 349.831887] Hardware name: xxx Reference Board, BIOS xxx
[ 349.843583] pstate: 60400009 (nZC v daif +PAN -UAO -TCO -DIT -SSBS BTYPE)
[ 349.852294] pc : arch_cpu_idle+0x18/0x2c
[ 349.857928] lr : arch_cpu_idle+0x14/0x2c

Regards
Hans

Hans Zhang Dec. 19, 2024, 10:29 a.m. UTC | #6

On 12/19/24 04:49, Manivannan Sadhasivam wrote:
> On Thu, Dec 19, 2024 at 04:38:11AM -0500, Hans Zhang wrote:
>>
>> On 12/19/24 03:59, Siddharth Vadapalli wrote:
>>> On Thu, Dec 19, 2024 at 03:49:33AM -0500, Hans Zhang wrote:
>>>> On 12/19/24 03:33, Siddharth Vadapalli wrote:
>>>>> On Thu, Dec 19, 2024 at 03:14:52AM -0500, Hans Zhang wrote:
>>>>>> If the PCIe link never came up, the enumeration process
>>>>>> should not be run.
>>>>> The link could come up at a later point in time. Please refer to the
>>>>> implementation of:
>>>>> dw_pcie_host_init() in drivers/pci/controller/dwc/pcie-designware-host.c
>>>>> wherein we have the following:
>>>>> 	/* Ignore errors, the link may come up later */
>>>>> 	dw_pcie_wait_for_link(pci);
>>>>>
>>>>> It seems to me that the logic behind ignoring the absence of the link
>>>>> within cdns_pcie_host_link_setup() instead of erroring out, is similar to
>>>>> that of dw_pcie_wait_for_link().
>>>>>
>>>>> Regards,
>>>>> Siddharth.
>>>>>
>>>>>
>>>>> If a PCIe port is not connected to a device. The PCIe link does not
>>>>> go up. The current code returns success whether the device is connected
>>>>> or not. Cadence IP's ECAM requires an LTSSM at L0 to access the RC's
>>>>> config space registers. Otherwise the enumeration process will hang.
>>> The ">" symbols seem to be manually added in your reply and are also
>>> incorrect. If you have added them manually, please don't add them at the
>>> start of the sentences corresponding to your reply.
>>>
>>> The issue you are facing seems to be specific to the Cadence IP or the way
>>> in which the IP has been integrated into the device that you are using.
>>> On TI SoCs which have the Cadence PCIe Controller, absence of PCIe devices
>>> doesn't result in a hang. Enumeration should proceed irrespective of the
>>> presence of PCIe devices and should indicate their absence when they aren't
>>> connected.
>>>
>>> While I am not denying the issue being seen, the fix should probably be
>>> done elsewhere.
>>>
>>> Regards,
>>> Siddharth.
>> We are the SOC design company and we have confirmed with the designer and
>> Cadence. For the Cadence's IP we are using, ECAM must be L0 at LTSSM to be
>> used. Cadence will fixed next RTL version.
>>
> 
> I don't understand what you mean by 'ECAM must be L0 at LTSSM'. If you do not
> connect the device, LTSSM would still be in 'detect' state until the device is
> connected. Is that different on your SoC?
> 
>> If the cdns_pcie_host_link_setup return value is not modified. The following
>> is the
>> log of the enumeration process without connected devices. There will be hang
>> for
>> more than 300 seconds. So I don't think it makes sense to run the
>> enumeration
>> process without connecting devices. And it will affect the boot time.
>>
> 
> We don't know your driver, so cannot comment on the issue without understanding
> the problem, sorry.
> 
> - Mani
> 
>> [ 2.681770] xxx pcie: xxx_pcie_probe starting!
>> [ 2.689537] xxx pcie: host bridge /soc@0/pcie@xxx ranges:
>> [ 2.698601] xxx pcie: IO 0x0060100000..0x00601fffff -> 0x0060100000
>> [ 2.708625] xxx pcie: MEM 0x0060200000..0x007fffffff -> 0x0060200000
>> [ 2.718649] xxx pcie: MEM 0x1800000000..0x1bffffffff -> 0x1800000000
>> [ 2.744441] xxx pcie: ioremap rcsu, paddr:[mem 0x0a000000-0x0a00ffff],
>> vaddr:ffff800089390000
>> [ 2.756230] xxx pcie: ioremap msg, paddr:[mem 0x60000000-0x600fffff],
>> vaddr:ffff800089800000
>> [ 2.769692] xxx pcie: ECAM at [mem 0x2c000000-0x2fffffff] for [bus c0-ff]
>> [ 2.780139] xxx.pcie_phy: pcie_phy_common_init end
>> [ 2.788900] xxx pcie: waiting PHY is ready! retries = 2
>> [ 3.905292] xxx pcie: Link fail, retries 10 times
>> [ 3.915054] xxx pcie: ret=-110, rc->quirk_retrain_flag = 0
>> [ 3.923848] xxx pcie: ret=-110, rc->quirk_retrain_flag = 0
>> [ 3.932669] xxx pcie: PCI host bridge to bus 0000:c0
>> [ 3.940847] pci_bus 0000:c0: root bus resource [bus c0-ff]
>> [ 3.948322] pci_bus 0000:c0: root bus resource [io 0x0000-0xfffff] (bus
>> address [0x60100000-0x601fffff])
>> [ 3.959922] pci_bus 0000:c0: root bus resource [mem 0x60200000-0x7fffffff]
>> [ 3.968799] pci_bus 0000:c0: root bus resource [mem
>> 0x1800000000-0x1bffffffff pref]
>> [ 339.667761] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
>> [ 339.677449] rcu: 5-...0: (20 ticks this GP) idle=4d94/1/0x4000000000000000
>> softirq=20/20 fqs=2623
>> [ 339.688184] (detected by 2, t=5253 jiffies, g=-1119, q=2 ncpus=12)
>> [ 339.696193] Sending NMI from CPU 2 to CPUs 5:
>> [ 349.703670] rcu: rcu_preempt kthread timer wakeup didn't happen for 2509
>> jiffies! g-1119 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
>> [ 349.718710] rcu: Possible timer handling issue on cpu=2 timer-softirq=1208
>> [ 349.727418] rcu: rcu_preempt kthread starved for 2515 jiffies! g-1119 f0x0
>> RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=2
>> [ 349.739642] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM
>> is now expected behavior.
>> [ 349.750546] rcu: RCU grace-period kthread stack dump:
>> [ 349.757319] task:rcu_preempt state:I stack:0 pid:14 ppid:2
>> flags:0x00000008
>> [ 349.767439] Call trace:
>> [ 349.771575] __switch_to+0xdc/0x150
>> [ 349.776777] __schedule+0x2dc/0x7d0
>> [ 349.781972] schedule+0x5c/0x100
>> [ 349.786903] schedule_timeout+0x8c/0x100
>> [ 349.792538] rcu_gp_fqs_loop+0x140/0x420
>> [ 349.798176] rcu_gp_kthread+0x134/0x164
>> [ 349.803725] kthread+0x108/0x10c
>> [ 349.808657] ret_from_fork+0x10/0x20
>> [ 349.813942] rcu: Stack dump where RCU GP kthread last ran:
>> [ 349.821156] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G S xxx-build-generic
>> #8
>> [ 349.831887] Hardware name: xxx Reference Board, BIOS xxx
>> [ 349.843583] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS
>> BTYPE=--)
>> [ 349.852294] pc : arch_cpu_idle+0x18/0x2c
>> [ 349.857928] lr : arch_cpu_idle+0x14/0x2c
>>
>> Regards Hans
>>
> 

I am very sorry that the previous email said that I included HTML 
format, so I resend it twice.


 > I don't understand what you mean by 'ECAM must be L0 at LTSSM'. If 
you do not
 > connect the device, LTSSM would still be in 'detect' state until the 
device is
 > connected. Is that different on your SoC?

If a PCIe port is not connected to a device. Then run pci_host_probe and 
perform the enumeration process. During the enumeration process, VID and 
PID are read. If the LTSSM is not in L0, the CPU send AXI transmission 
will not be sent, that is, the AXI slave will hang. This is the problem 
with the Cadence IP we are using.

Regards
Hans

Manivannan Sadhasivam Dec. 19, 2024, 11:20 a.m. UTC | #7

On Thu, Dec 19, 2024 at 05:29:01AM -0500, Hans Zhang wrote:
> 
> 
> On 12/19/24 04:49, Manivannan Sadhasivam wrote:
> > On Thu, Dec 19, 2024 at 04:38:11AM -0500, Hans Zhang wrote:
> > > 
> > > On 12/19/24 03:59, Siddharth Vadapalli wrote:
> > > > On Thu, Dec 19, 2024 at 03:49:33AM -0500, Hans Zhang wrote:
> > > > > On 12/19/24 03:33, Siddharth Vadapalli wrote:
> > > > > > On Thu, Dec 19, 2024 at 03:14:52AM -0500, Hans Zhang wrote:
> > > > > > > If the PCIe link never came up, the enumeration process
> > > > > > > should not be run.
> > > > > > The link could come up at a later point in time. Please refer to the
> > > > > > implementation of:
> > > > > > dw_pcie_host_init() in drivers/pci/controller/dwc/pcie-designware-host.c
> > > > > > wherein we have the following:
> > > > > > 	/* Ignore errors, the link may come up later */
> > > > > > 	dw_pcie_wait_for_link(pci);
> > > > > > 
> > > > > > It seems to me that the logic behind ignoring the absence of the link
> > > > > > within cdns_pcie_host_link_setup() instead of erroring out, is similar to
> > > > > > that of dw_pcie_wait_for_link().
> > > > > > 
> > > > > > Regards,
> > > > > > Siddharth.
> > > > > > 
> > > > > > 
> > > > > > If a PCIe port is not connected to a device. The PCIe link does not
> > > > > > go up. The current code returns success whether the device is connected
> > > > > > or not. Cadence IP's ECAM requires an LTSSM at L0 to access the RC's
> > > > > > config space registers. Otherwise the enumeration process will hang.
> > > > The ">" symbols seem to be manually added in your reply and are also
> > > > incorrect. If you have added them manually, please don't add them at the
> > > > start of the sentences corresponding to your reply.
> > > > 
> > > > The issue you are facing seems to be specific to the Cadence IP or the way
> > > > in which the IP has been integrated into the device that you are using.
> > > > On TI SoCs which have the Cadence PCIe Controller, absence of PCIe devices
> > > > doesn't result in a hang. Enumeration should proceed irrespective of the
> > > > presence of PCIe devices and should indicate their absence when they aren't
> > > > connected.
> > > > 
> > > > While I am not denying the issue being seen, the fix should probably be
> > > > done elsewhere.
> > > > 
> > > > Regards,
> > > > Siddharth.
> > > We are the SOC design company and we have confirmed with the designer and
> > > Cadence. For the Cadence's IP we are using, ECAM must be L0 at LTSSM to be
> > > used. Cadence will fixed next RTL version.
> > > 
> > 
> > I don't understand what you mean by 'ECAM must be L0 at LTSSM'. If you do not
> > connect the device, LTSSM would still be in 'detect' state until the device is
> > connected. Is that different on your SoC?
> > 
> > > If the cdns_pcie_host_link_setup return value is not modified. The following
> > > is the
> > > log of the enumeration process without connected devices. There will be hang
> > > for
> > > more than 300 seconds. So I don't think it makes sense to run the
> > > enumeration
> > > process without connecting devices. And it will affect the boot time.
> > > 
> > 
> > We don't know your driver, so cannot comment on the issue without understanding
> > the problem, sorry.
> > 
> > - Mani
> > 
> > > [ 2.681770] xxx pcie: xxx_pcie_probe starting!
> > > [ 2.689537] xxx pcie: host bridge /soc@0/pcie@xxx ranges:
> > > [ 2.698601] xxx pcie: IO 0x0060100000..0x00601fffff -> 0x0060100000
> > > [ 2.708625] xxx pcie: MEM 0x0060200000..0x007fffffff -> 0x0060200000
> > > [ 2.718649] xxx pcie: MEM 0x1800000000..0x1bffffffff -> 0x1800000000
> > > [ 2.744441] xxx pcie: ioremap rcsu, paddr:[mem 0x0a000000-0x0a00ffff],
> > > vaddr:ffff800089390000
> > > [ 2.756230] xxx pcie: ioremap msg, paddr:[mem 0x60000000-0x600fffff],
> > > vaddr:ffff800089800000
> > > [ 2.769692] xxx pcie: ECAM at [mem 0x2c000000-0x2fffffff] for [bus c0-ff]
> > > [ 2.780139] xxx.pcie_phy: pcie_phy_common_init end
> > > [ 2.788900] xxx pcie: waiting PHY is ready! retries = 2
> > > [ 3.905292] xxx pcie: Link fail, retries 10 times
> > > [ 3.915054] xxx pcie: ret=-110, rc->quirk_retrain_flag = 0
> > > [ 3.923848] xxx pcie: ret=-110, rc->quirk_retrain_flag = 0
> > > [ 3.932669] xxx pcie: PCI host bridge to bus 0000:c0
> > > [ 3.940847] pci_bus 0000:c0: root bus resource [bus c0-ff]
> > > [ 3.948322] pci_bus 0000:c0: root bus resource [io 0x0000-0xfffff] (bus
> > > address [0x60100000-0x601fffff])
> > > [ 3.959922] pci_bus 0000:c0: root bus resource [mem 0x60200000-0x7fffffff]
> > > [ 3.968799] pci_bus 0000:c0: root bus resource [mem
> > > 0x1800000000-0x1bffffffff pref]
> > > [ 339.667761] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > [ 339.677449] rcu: 5-...0: (20 ticks this GP) idle=4d94/1/0x4000000000000000
> > > softirq=20/20 fqs=2623
> > > [ 339.688184] (detected by 2, t=5253 jiffies, g=-1119, q=2 ncpus=12)
> > > [ 339.696193] Sending NMI from CPU 2 to CPUs 5:
> > > [ 349.703670] rcu: rcu_preempt kthread timer wakeup didn't happen for 2509
> > > jiffies! g-1119 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
> > > [ 349.718710] rcu: Possible timer handling issue on cpu=2 timer-softirq=1208
> > > [ 349.727418] rcu: rcu_preempt kthread starved for 2515 jiffies! g-1119 f0x0
> > > RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=2
> > > [ 349.739642] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM
> > > is now expected behavior.
> > > [ 349.750546] rcu: RCU grace-period kthread stack dump:
> > > [ 349.757319] task:rcu_preempt state:I stack:0 pid:14 ppid:2
> > > flags:0x00000008
> > > [ 349.767439] Call trace:
> > > [ 349.771575] __switch_to+0xdc/0x150
> > > [ 349.776777] __schedule+0x2dc/0x7d0
> > > [ 349.781972] schedule+0x5c/0x100
> > > [ 349.786903] schedule_timeout+0x8c/0x100
> > > [ 349.792538] rcu_gp_fqs_loop+0x140/0x420
> > > [ 349.798176] rcu_gp_kthread+0x134/0x164
> > > [ 349.803725] kthread+0x108/0x10c
> > > [ 349.808657] ret_from_fork+0x10/0x20
> > > [ 349.813942] rcu: Stack dump where RCU GP kthread last ran:
> > > [ 349.821156] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G S xxx-build-generic
> > > #8
> > > [ 349.831887] Hardware name: xxx Reference Board, BIOS xxx
> > > [ 349.843583] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS
> > > BTYPE=--)
> > > [ 349.852294] pc : arch_cpu_idle+0x18/0x2c
> > > [ 349.857928] lr : arch_cpu_idle+0x14/0x2c
> > > 
> > > Regards Hans
> > > 
> > 
> 
> I am very sorry that the previous email said that I included HTML format, so
> I resend it twice.
> 
> 
> > I don't understand what you mean by 'ECAM must be L0 at LTSSM'. If you do
> not
> > connect the device, LTSSM would still be in 'detect' state until the
> device is
> > connected. Is that different on your SoC?
> 
> If a PCIe port is not connected to a device. Then run pci_host_probe and
> perform the enumeration process. During the enumeration process, VID and PID
> are read. If the LTSSM is not in L0, the CPU send AXI transmission will not
> be sent, that is, the AXI slave will hang. This is the problem with the
> Cadence IP we are using.
> 

This sounds similar to the issues we have seen with other IP implementations:

15b23906347c ("PCI: dwc: Add link up check in dw_child_pcie_ops.map_bus()")
9e9ec8d8692a ("PCI: keystone: Add link up check to ks_pcie_other_map_bus()")

If the config space access happens for devices that do not exist on the bus,
then SError gets triggered and it causes the system hang.

In that case, you need to skip the enumeration in your own
'struct pci_ops::map_bus' callback. Even though it is not the best solution, we
have to live with it.

- Mani

Hans Zhang Dec. 19, 2024, 11:46 a.m. UTC | #8

On 12/19/24 06:20, Manivannan Sadhasivam wrote:
> On Thu, Dec 19, 2024 at 05:29:01AM -0500, Hans Zhang wrote:
>>
>>
>> On 12/19/24 04:49, Manivannan Sadhasivam wrote:
>>> On Thu, Dec 19, 2024 at 04:38:11AM -0500, Hans Zhang wrote:
>>>>
>>>> On 12/19/24 03:59, Siddharth Vadapalli wrote:
>>>>> On Thu, Dec 19, 2024 at 03:49:33AM -0500, Hans Zhang wrote:
>>>>>> On 12/19/24 03:33, Siddharth Vadapalli wrote:
>>>>>>> On Thu, Dec 19, 2024 at 03:14:52AM -0500, Hans Zhang wrote:
>>>>>>>> If the PCIe link never came up, the enumeration process
>>>>>>>> should not be run.
>>>>>>> The link could come up at a later point in time. Please refer to the
>>>>>>> implementation of:
>>>>>>> dw_pcie_host_init() in drivers/pci/controller/dwc/pcie-designware-host.c
>>>>>>> wherein we have the following:
>>>>>>> 	/* Ignore errors, the link may come up later */
>>>>>>> 	dw_pcie_wait_for_link(pci);
>>>>>>>
>>>>>>> It seems to me that the logic behind ignoring the absence of the link
>>>>>>> within cdns_pcie_host_link_setup() instead of erroring out, is similar to
>>>>>>> that of dw_pcie_wait_for_link().
>>>>>>>
>>>>>>> Regards,
>>>>>>> Siddharth.
>>>>>>>
>>>>>>>
>>>>>>> If a PCIe port is not connected to a device. The PCIe link does not
>>>>>>> go up. The current code returns success whether the device is connected
>>>>>>> or not. Cadence IP's ECAM requires an LTSSM at L0 to access the RC's
>>>>>>> config space registers. Otherwise the enumeration process will hang.
>>>>> The ">" symbols seem to be manually added in your reply and are also
>>>>> incorrect. If you have added them manually, please don't add them at the
>>>>> start of the sentences corresponding to your reply.
>>>>>
>>>>> The issue you are facing seems to be specific to the Cadence IP or the way
>>>>> in which the IP has been integrated into the device that you are using.
>>>>> On TI SoCs which have the Cadence PCIe Controller, absence of PCIe devices
>>>>> doesn't result in a hang. Enumeration should proceed irrespective of the
>>>>> presence of PCIe devices and should indicate their absence when they aren't
>>>>> connected.
>>>>>
>>>>> While I am not denying the issue being seen, the fix should probably be
>>>>> done elsewhere.
>>>>>
>>>>> Regards,
>>>>> Siddharth.
>>>> We are the SOC design company and we have confirmed with the designer and
>>>> Cadence. For the Cadence's IP we are using, ECAM must be L0 at LTSSM to be
>>>> used. Cadence will fixed next RTL version.
>>>>
>>>
>>> I don't understand what you mean by 'ECAM must be L0 at LTSSM'. If you do not
>>> connect the device, LTSSM would still be in 'detect' state until the device is
>>> connected. Is that different on your SoC?
>>>
>>>> If the cdns_pcie_host_link_setup return value is not modified. The following
>>>> is the
>>>> log of the enumeration process without connected devices. There will be hang
>>>> for
>>>> more than 300 seconds. So I don't think it makes sense to run the
>>>> enumeration
>>>> process without connecting devices. And it will affect the boot time.
>>>>
>>>
>>> We don't know your driver, so cannot comment on the issue without understanding
>>> the problem, sorry.
>>>
>>> - Mani
>>>
>>>> [ 2.681770] xxx pcie: xxx_pcie_probe starting!
>>>> [ 2.689537] xxx pcie: host bridge /soc@0/pcie@xxx ranges:
>>>> [ 2.698601] xxx pcie: IO 0x0060100000..0x00601fffff -> 0x0060100000
>>>> [ 2.708625] xxx pcie: MEM 0x0060200000..0x007fffffff -> 0x0060200000
>>>> [ 2.718649] xxx pcie: MEM 0x1800000000..0x1bffffffff -> 0x1800000000
>>>> [ 2.744441] xxx pcie: ioremap rcsu, paddr:[mem 0x0a000000-0x0a00ffff],
>>>> vaddr:ffff800089390000
>>>> [ 2.756230] xxx pcie: ioremap msg, paddr:[mem 0x60000000-0x600fffff],
>>>> vaddr:ffff800089800000
>>>> [ 2.769692] xxx pcie: ECAM at [mem 0x2c000000-0x2fffffff] for [bus c0-ff]
>>>> [ 2.780139] xxx.pcie_phy: pcie_phy_common_init end
>>>> [ 2.788900] xxx pcie: waiting PHY is ready! retries = 2
>>>> [ 3.905292] xxx pcie: Link fail, retries 10 times
>>>> [ 3.915054] xxx pcie: ret=-110, rc->quirk_retrain_flag = 0
>>>> [ 3.923848] xxx pcie: ret=-110, rc->quirk_retrain_flag = 0
>>>> [ 3.932669] xxx pcie: PCI host bridge to bus 0000:c0
>>>> [ 3.940847] pci_bus 0000:c0: root bus resource [bus c0-ff]
>>>> [ 3.948322] pci_bus 0000:c0: root bus resource [io 0x0000-0xfffff] (bus
>>>> address [0x60100000-0x601fffff])
>>>> [ 3.959922] pci_bus 0000:c0: root bus resource [mem 0x60200000-0x7fffffff]
>>>> [ 3.968799] pci_bus 0000:c0: root bus resource [mem
>>>> 0x1800000000-0x1bffffffff pref]
>>>> [ 339.667761] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
>>>> [ 339.677449] rcu: 5-...0: (20 ticks this GP) idle=4d94/1/0x4000000000000000
>>>> softirq=20/20 fqs=2623
>>>> [ 339.688184] (detected by 2, t=5253 jiffies, g=-1119, q=2 ncpus=12)
>>>> [ 339.696193] Sending NMI from CPU 2 to CPUs 5:
>>>> [ 349.703670] rcu: rcu_preempt kthread timer wakeup didn't happen for 2509
>>>> jiffies! g-1119 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
>>>> [ 349.718710] rcu: Possible timer handling issue on cpu=2 timer-softirq=1208
>>>> [ 349.727418] rcu: rcu_preempt kthread starved for 2515 jiffies! g-1119 f0x0
>>>> RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=2
>>>> [ 349.739642] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM
>>>> is now expected behavior.
>>>> [ 349.750546] rcu: RCU grace-period kthread stack dump:
>>>> [ 349.757319] task:rcu_preempt state:I stack:0 pid:14 ppid:2
>>>> flags:0x00000008
>>>> [ 349.767439] Call trace:
>>>> [ 349.771575] __switch_to+0xdc/0x150
>>>> [ 349.776777] __schedule+0x2dc/0x7d0
>>>> [ 349.781972] schedule+0x5c/0x100
>>>> [ 349.786903] schedule_timeout+0x8c/0x100
>>>> [ 349.792538] rcu_gp_fqs_loop+0x140/0x420
>>>> [ 349.798176] rcu_gp_kthread+0x134/0x164
>>>> [ 349.803725] kthread+0x108/0x10c
>>>> [ 349.808657] ret_from_fork+0x10/0x20
>>>> [ 349.813942] rcu: Stack dump where RCU GP kthread last ran:
>>>> [ 349.821156] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G S xxx-build-generic
>>>> #8
>>>> [ 349.831887] Hardware name: xxx Reference Board, BIOS xxx
>>>> [ 349.843583] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS
>>>> BTYPE=--)
>>>> [ 349.852294] pc : arch_cpu_idle+0x18/0x2c
>>>> [ 349.857928] lr : arch_cpu_idle+0x14/0x2c
>>>>
>>>> Regards Hans
>>>>
>>>
>>
>> I am very sorry that the previous email said that I included HTML format, so
>> I resend it twice.
>>
>>
>>> I don't understand what you mean by 'ECAM must be L0 at LTSSM'. If you do
>> not
>>> connect the device, LTSSM would still be in 'detect' state until the
>> device is
>>> connected. Is that different on your SoC?
>>
>> If a PCIe port is not connected to a device. Then run pci_host_probe and
>> perform the enumeration process. During the enumeration process, VID and PID
>> are read. If the LTSSM is not in L0, the CPU send AXI transmission will not
>> be sent, that is, the AXI slave will hang. This is the problem with the
>> Cadence IP we are using.
>>
> 
> This sounds similar to the issues we have seen with other IP implementations:
> 
> 15b23906347c ("PCI: dwc: Add link up check in dw_child_pcie_ops.map_bus()")
> 9e9ec8d8692a ("PCI: keystone: Add link up check to ks_pcie_other_map_bus()")
> 
> If the config space access happens for devices that do not exist on the bus,
> then SError gets triggered and it causes the system hang.
> 
> In that case, you need to skip the enumeration in your own
> 'struct pci_ops::map_bus' callback. Even though it is not the best solution, we
> have to live with it.
> 
> - Mani
> 

 > In that case, you need to skip the enumeration in your own
 > 'struct pci_ops::map_bus' callback. Even though it is not the best 
solution, we
 > have to live with it.

I know how pcie-designware-host.c works, but accessing each config space 
register requires checking if it is a link up, which seems inefficient.
We have 5 PCIe controllers, and if a few of them are not connected to 
the device. And it will affect the boot time.

Regards
Hans

Manivannan Sadhasivam Dec. 19, 2024, 1:35 p.m. UTC | #9

On Thu, Dec 19, 2024 at 06:46:28AM -0500, Hans Zhang wrote:
> 
> 
> On 12/19/24 06:20, Manivannan Sadhasivam wrote:
> > On Thu, Dec 19, 2024 at 05:29:01AM -0500, Hans Zhang wrote:
> > > 
> > > 
> > > On 12/19/24 04:49, Manivannan Sadhasivam wrote:
> > > > On Thu, Dec 19, 2024 at 04:38:11AM -0500, Hans Zhang wrote:
> > > > > 
> > > > > On 12/19/24 03:59, Siddharth Vadapalli wrote:
> > > > > > On Thu, Dec 19, 2024 at 03:49:33AM -0500, Hans Zhang wrote:
> > > > > > > On 12/19/24 03:33, Siddharth Vadapalli wrote:
> > > > > > > > On Thu, Dec 19, 2024 at 03:14:52AM -0500, Hans Zhang wrote:
> > > > > > > > > If the PCIe link never came up, the enumeration process
> > > > > > > > > should not be run.
> > > > > > > > The link could come up at a later point in time. Please refer to the
> > > > > > > > implementation of:
> > > > > > > > dw_pcie_host_init() in drivers/pci/controller/dwc/pcie-designware-host.c
> > > > > > > > wherein we have the following:
> > > > > > > > 	/* Ignore errors, the link may come up later */
> > > > > > > > 	dw_pcie_wait_for_link(pci);
> > > > > > > > 
> > > > > > > > It seems to me that the logic behind ignoring the absence of the link
> > > > > > > > within cdns_pcie_host_link_setup() instead of erroring out, is similar to
> > > > > > > > that of dw_pcie_wait_for_link().
> > > > > > > > 
> > > > > > > > Regards,
> > > > > > > > Siddharth.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > If a PCIe port is not connected to a device. The PCIe link does not
> > > > > > > > go up. The current code returns success whether the device is connected
> > > > > > > > or not. Cadence IP's ECAM requires an LTSSM at L0 to access the RC's
> > > > > > > > config space registers. Otherwise the enumeration process will hang.
> > > > > > The ">" symbols seem to be manually added in your reply and are also
> > > > > > incorrect. If you have added them manually, please don't add them at the
> > > > > > start of the sentences corresponding to your reply.
> > > > > > 
> > > > > > The issue you are facing seems to be specific to the Cadence IP or the way
> > > > > > in which the IP has been integrated into the device that you are using.
> > > > > > On TI SoCs which have the Cadence PCIe Controller, absence of PCIe devices
> > > > > > doesn't result in a hang. Enumeration should proceed irrespective of the
> > > > > > presence of PCIe devices and should indicate their absence when they aren't
> > > > > > connected.
> > > > > > 
> > > > > > While I am not denying the issue being seen, the fix should probably be
> > > > > > done elsewhere.
> > > > > > 
> > > > > > Regards,
> > > > > > Siddharth.
> > > > > We are the SOC design company and we have confirmed with the designer and
> > > > > Cadence. For the Cadence's IP we are using, ECAM must be L0 at LTSSM to be
> > > > > used. Cadence will fixed next RTL version.
> > > > > 
> > > > 
> > > > I don't understand what you mean by 'ECAM must be L0 at LTSSM'. If you do not
> > > > connect the device, LTSSM would still be in 'detect' state until the device is
> > > > connected. Is that different on your SoC?
> > > > 
> > > > > If the cdns_pcie_host_link_setup return value is not modified. The following
> > > > > is the
> > > > > log of the enumeration process without connected devices. There will be hang
> > > > > for
> > > > > more than 300 seconds. So I don't think it makes sense to run the
> > > > > enumeration
> > > > > process without connecting devices. And it will affect the boot time.
> > > > > 
> > > > 
> > > > We don't know your driver, so cannot comment on the issue without understanding
> > > > the problem, sorry.
> > > > 
> > > > - Mani
> > > > 
> > > > > [ 2.681770] xxx pcie: xxx_pcie_probe starting!
> > > > > [ 2.689537] xxx pcie: host bridge /soc@0/pcie@xxx ranges:
> > > > > [ 2.698601] xxx pcie: IO 0x0060100000..0x00601fffff -> 0x0060100000
> > > > > [ 2.708625] xxx pcie: MEM 0x0060200000..0x007fffffff -> 0x0060200000
> > > > > [ 2.718649] xxx pcie: MEM 0x1800000000..0x1bffffffff -> 0x1800000000
> > > > > [ 2.744441] xxx pcie: ioremap rcsu, paddr:[mem 0x0a000000-0x0a00ffff],
> > > > > vaddr:ffff800089390000
> > > > > [ 2.756230] xxx pcie: ioremap msg, paddr:[mem 0x60000000-0x600fffff],
> > > > > vaddr:ffff800089800000
> > > > > [ 2.769692] xxx pcie: ECAM at [mem 0x2c000000-0x2fffffff] for [bus c0-ff]
> > > > > [ 2.780139] xxx.pcie_phy: pcie_phy_common_init end
> > > > > [ 2.788900] xxx pcie: waiting PHY is ready! retries = 2
> > > > > [ 3.905292] xxx pcie: Link fail, retries 10 times
> > > > > [ 3.915054] xxx pcie: ret=-110, rc->quirk_retrain_flag = 0
> > > > > [ 3.923848] xxx pcie: ret=-110, rc->quirk_retrain_flag = 0
> > > > > [ 3.932669] xxx pcie: PCI host bridge to bus 0000:c0
> > > > > [ 3.940847] pci_bus 0000:c0: root bus resource [bus c0-ff]
> > > > > [ 3.948322] pci_bus 0000:c0: root bus resource [io 0x0000-0xfffff] (bus
> > > > > address [0x60100000-0x601fffff])
> > > > > [ 3.959922] pci_bus 0000:c0: root bus resource [mem 0x60200000-0x7fffffff]
> > > > > [ 3.968799] pci_bus 0000:c0: root bus resource [mem
> > > > > 0x1800000000-0x1bffffffff pref]
> > > > > [ 339.667761] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > > > [ 339.677449] rcu: 5-...0: (20 ticks this GP) idle=4d94/1/0x4000000000000000
> > > > > softirq=20/20 fqs=2623
> > > > > [ 339.688184] (detected by 2, t=5253 jiffies, g=-1119, q=2 ncpus=12)
> > > > > [ 339.696193] Sending NMI from CPU 2 to CPUs 5:
> > > > > [ 349.703670] rcu: rcu_preempt kthread timer wakeup didn't happen for 2509
> > > > > jiffies! g-1119 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
> > > > > [ 349.718710] rcu: Possible timer handling issue on cpu=2 timer-softirq=1208
> > > > > [ 349.727418] rcu: rcu_preempt kthread starved for 2515 jiffies! g-1119 f0x0
> > > > > RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=2
> > > > > [ 349.739642] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM
> > > > > is now expected behavior.
> > > > > [ 349.750546] rcu: RCU grace-period kthread stack dump:
> > > > > [ 349.757319] task:rcu_preempt state:I stack:0 pid:14 ppid:2
> > > > > flags:0x00000008
> > > > > [ 349.767439] Call trace:
> > > > > [ 349.771575] __switch_to+0xdc/0x150
> > > > > [ 349.776777] __schedule+0x2dc/0x7d0
> > > > > [ 349.781972] schedule+0x5c/0x100
> > > > > [ 349.786903] schedule_timeout+0x8c/0x100
> > > > > [ 349.792538] rcu_gp_fqs_loop+0x140/0x420
> > > > > [ 349.798176] rcu_gp_kthread+0x134/0x164
> > > > > [ 349.803725] kthread+0x108/0x10c
> > > > > [ 349.808657] ret_from_fork+0x10/0x20
> > > > > [ 349.813942] rcu: Stack dump where RCU GP kthread last ran:
> > > > > [ 349.821156] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G S xxx-build-generic
> > > > > #8
> > > > > [ 349.831887] Hardware name: xxx Reference Board, BIOS xxx
> > > > > [ 349.843583] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS
> > > > > BTYPE=--)
> > > > > [ 349.852294] pc : arch_cpu_idle+0x18/0x2c
> > > > > [ 349.857928] lr : arch_cpu_idle+0x14/0x2c
> > > > > 
> > > > > Regards Hans
> > > > > 
> > > > 
> > > 
> > > I am very sorry that the previous email said that I included HTML format, so
> > > I resend it twice.
> > > 
> > > 
> > > > I don't understand what you mean by 'ECAM must be L0 at LTSSM'. If you do
> > > not
> > > > connect the device, LTSSM would still be in 'detect' state until the
> > > device is
> > > > connected. Is that different on your SoC?
> > > 
> > > If a PCIe port is not connected to a device. Then run pci_host_probe and
> > > perform the enumeration process. During the enumeration process, VID and PID
> > > are read. If the LTSSM is not in L0, the CPU send AXI transmission will not
> > > be sent, that is, the AXI slave will hang. This is the problem with the
> > > Cadence IP we are using.
> > > 
> > 
> > This sounds similar to the issues we have seen with other IP implementations:
> > 
> > 15b23906347c ("PCI: dwc: Add link up check in dw_child_pcie_ops.map_bus()")
> > 9e9ec8d8692a ("PCI: keystone: Add link up check to ks_pcie_other_map_bus()")
> > 
> > If the config space access happens for devices that do not exist on the bus,
> > then SError gets triggered and it causes the system hang.
> > 
> > In that case, you need to skip the enumeration in your own
> > 'struct pci_ops::map_bus' callback. Even though it is not the best solution, we
> > have to live with it.
> > 
> > - Mani
> > 
> 
> > In that case, you need to skip the enumeration in your own
> > 'struct pci_ops::map_bus' callback. Even though it is not the best
> solution, we
> > have to live with it.
> 
> I know how pcie-designware-host.c works, but accessing each config space
> register requires checking if it is a link up, which seems inefficient.

Yes.

> We have 5 PCIe controllers, and if a few of them are not connected to the
> device. And it will affect the boot time.
> 

Why are you enabling all controllers? Can't you just enable the ones you know
the endpoints are going to be connected? I'm just trying to see if we can avoid
having a quirk.

If you do not know, then you need to introduce a quirk for your platform.
But that requires your controller driver to be upstreamed. We cannot provide
hooks for downstream drivers in upstream.

- Mani

Hans Zhang Dec. 20, 2024, 7:27 a.m. UTC | #10

On 12/19/24 08:35, Manivannan Sadhasivam wrote:

>> We have 5 PCIe controllers, and if a few of them are not connected to the
>> device. And it will affect the boot time.
>>
> 
> Why are you enabling all controllers? Can't you just enable the ones you know
> the endpoints are going to be connected? I'm just trying to see if we can avoid
> having a quirk.

Our SOC has a PC product situation, and there may be PCIe slots on the 
PCB, but the device may not be plugged in. So we need to enable all ports.

> If you do not know, then you need to introduce a quirk for your platform.
> But that requires your controller driver to be upstreamed. We cannot provide
> hooks for downstream drivers in upstream.

Our controller driver currently has no plans for upstream and needs to 
wait for notification from the boss.

Regards
Hans

Manivannan Sadhasivam Dec. 20, 2024, 12:32 p.m. UTC | #11

On Fri, Dec 20, 2024 at 03:27:22PM +0800, Hans Zhang wrote:
> 
> 
> On 12/19/24 08:35, Manivannan Sadhasivam wrote:
> 
> > > We have 5 PCIe controllers, and if a few of them are not connected to the
> > > device. And it will affect the boot time.
> > > 
> > 
> > Why are you enabling all controllers? Can't you just enable the ones you know
> > the endpoints are going to be connected? I'm just trying to see if we can avoid
> > having a quirk.
> 
> Our SOC has a PC product situation, and there may be PCIe slots on the PCB,
> but the device may not be plugged in. So we need to enable all ports.
> 

Since you are trying to fail probe for the unused slots, your hardware is not
supporting hotplug as well I hope.

> > If you do not know, then you need to introduce a quirk for your platform.
> > But that requires your controller driver to be upstreamed. We cannot provide
> > hooks for downstream drivers in upstream.
> 
> Our controller driver currently has no plans for upstream and needs to wait
> for notification from the boss.
> 

Then the quirk patch has to wait until your driver is submitted upstream.

- Mani

Hans Zhang Dec. 21, 2024, 2:47 p.m. UTC | #12

On 2024/12/20 20:32, Manivannan Sadhasivam wrote:

> 
> Then the quirk patch has to wait until your driver is submitted upstream.
> 

Thank you Mani.

Regards
Hans

PCI: cadence: Fixed cdns_pcie_host_link_setup return value.

Commit Message

Comments

Patch