diff mbox series

[v2] PCI: qcom-ep: Enable controller resources like PHY only after refclk is available

Message ID 20240828140108.5562-1-manivannan.sadhasivam@linaro.org (mailing list archive)
State Superseded
Headers show
Series [v2] PCI: qcom-ep: Enable controller resources like PHY only after refclk is available | expand

Commit Message

Manivannan Sadhasivam Aug. 28, 2024, 2:01 p.m. UTC
qcom_pcie_enable_resources() is called by qcom_pcie_ep_probe() and it
enables the controller resources like clocks, regulator, PHY. On one of the
new unreleased Qcom SoC, PHY enablement depends on the active refclk. And
on all of the supported Qcom endpoint SoCs, refclk comes from the host
(RC). So calling qcom_pcie_enable_resources() without refclk causes the
whole SoC crash on the new SoC.

qcom_pcie_enable_resources() is already called by
qcom_pcie_perst_deassert() when PERST# is deasserted, and refclk is
available at that time.

Hence, remove the unnecessary call to qcom_pcie_enable_resources() from
qcom_pcie_ep_probe() to prevent the crash.

Fixes: 869bc5253406 ("PCI: dwc: ep: Fix DBI access failure for drivers requiring refclk from host")
Tested-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
---

Changes in v2:

- Changed the patch description to mention the crash clearly as suggested by
  Bjorn
- Added the Fixes tag

Bjorn: This is not going to be a 6.11 material as it is not fixing a regression
introduced in 6.11 (offending commit got merged in 6.10). So please merge this
patch for 6.12.

 drivers/pci/controller/dwc/pcie-qcom-ep.c | 14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

Comments

Bjorn Helgaas Aug. 28, 2024, 8:59 p.m. UTC | #1
On Wed, Aug 28, 2024 at 07:31:08PM +0530, Manivannan Sadhasivam wrote:
> qcom_pcie_enable_resources() is called by qcom_pcie_ep_probe() and it
> enables the controller resources like clocks, regulator, PHY. On one of the
> new unreleased Qcom SoC, PHY enablement depends on the active refclk. And
> on all of the supported Qcom endpoint SoCs, refclk comes from the host
> (RC). So calling qcom_pcie_enable_resources() without refclk causes the
> whole SoC crash on the new SoC.
> 
> qcom_pcie_enable_resources() is already called by
> qcom_pcie_perst_deassert() when PERST# is deasserted, and refclk is
> available at that time.
> 
> Hence, remove the unnecessary call to qcom_pcie_enable_resources() from
> qcom_pcie_ep_probe() to prevent the crash.
> 
> Fixes: 869bc5253406 ("PCI: dwc: ep: Fix DBI access failure for drivers requiring refclk from host")
> Tested-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
> Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
> ---
> 
> Changes in v2:
> 
> - Changed the patch description to mention the crash clearly as suggested by
>   Bjorn

Clearly mentioning the crash as rationale for the change is *part* of
what I was looking for.

The rest, just as important, is information about what sort of crash
this is, because I hope and suspect the crash is recoverable, and we
*should* recover from it because PERST# may occur at arbitrary times,
so trying to avoid it is never going to be reliable.

Bjorn
Manivannan Sadhasivam Aug. 29, 2024, 5:37 a.m. UTC | #2
On Wed, Aug 28, 2024 at 03:59:45PM -0500, Bjorn Helgaas wrote:
> On Wed, Aug 28, 2024 at 07:31:08PM +0530, Manivannan Sadhasivam wrote:
> > qcom_pcie_enable_resources() is called by qcom_pcie_ep_probe() and it
> > enables the controller resources like clocks, regulator, PHY. On one of the
> > new unreleased Qcom SoC, PHY enablement depends on the active refclk. And
> > on all of the supported Qcom endpoint SoCs, refclk comes from the host
> > (RC). So calling qcom_pcie_enable_resources() without refclk causes the
> > whole SoC crash on the new SoC.
> > 
> > qcom_pcie_enable_resources() is already called by
> > qcom_pcie_perst_deassert() when PERST# is deasserted, and refclk is
> > available at that time.
> > 
> > Hence, remove the unnecessary call to qcom_pcie_enable_resources() from
> > qcom_pcie_ep_probe() to prevent the crash.
> > 
> > Fixes: 869bc5253406 ("PCI: dwc: ep: Fix DBI access failure for drivers requiring refclk from host")
> > Tested-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
> > Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
> > ---
> > 
> > Changes in v2:
> > 
> > - Changed the patch description to mention the crash clearly as suggested by
> >   Bjorn
> 
> Clearly mentioning the crash as rationale for the change is *part* of
> what I was looking for.
> 
> The rest, just as important, is information about what sort of crash
> this is, because I hope and suspect the crash is recoverable, and we
> *should* recover from it because PERST# may occur at arbitrary times,
> so trying to avoid it is never going to be reliable.
> 

I did mention 'whole SoC crash' which typically means unrecoverable state as
the SoC would crash (not just the driver). On Qcom SoCs, this will also lead the
SoC to boot into EDL (Emergency Download) mode so that the users can collect
dumps on the crash.

As I mentioned in earlier thread, I don't know how to avoid this crash entirely
(host asserting PERST# at random times) and still depend on refclk from host.
The best possible thing we can do is, at the time of PERST# assert, we can
notify the EPF driver to cancel all the work and not touch any registers that
require active refclk which is what the driver currently does.

And I'm also working on SRIS support which will allow the endpoint to generate
its own refclk and planning to make that mode as the default working mode.
Still the users could opt for non-SRIS mode (current mode of requiring refclk
from host) through DT.

- Mani
Bjorn Helgaas Aug. 29, 2024, 12:38 p.m. UTC | #3
On Thu, Aug 29, 2024 at 11:07:20AM +0530, Manivannan Sadhasivam wrote:
> On Wed, Aug 28, 2024 at 03:59:45PM -0500, Bjorn Helgaas wrote:
> > On Wed, Aug 28, 2024 at 07:31:08PM +0530, Manivannan Sadhasivam wrote:
> > > qcom_pcie_enable_resources() is called by qcom_pcie_ep_probe() and it
> > > enables the controller resources like clocks, regulator, PHY. On one of the
> > > new unreleased Qcom SoC, PHY enablement depends on the active refclk. And
> > > on all of the supported Qcom endpoint SoCs, refclk comes from the host
> > > (RC). So calling qcom_pcie_enable_resources() without refclk causes the
> > > whole SoC crash on the new SoC.
> > > 
> > > qcom_pcie_enable_resources() is already called by
> > > qcom_pcie_perst_deassert() when PERST# is deasserted, and refclk is
> > > available at that time.
> > > 
> > > Hence, remove the unnecessary call to qcom_pcie_enable_resources() from
> > > qcom_pcie_ep_probe() to prevent the crash.
> > > 
> > > Fixes: 869bc5253406 ("PCI: dwc: ep: Fix DBI access failure for drivers requiring refclk from host")
> > > Tested-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
> > > Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
> > > ---
> > > 
> > > Changes in v2:
> > > 
> > > - Changed the patch description to mention the crash clearly as suggested by
> > >   Bjorn
> > 
> > Clearly mentioning the crash as rationale for the change is *part* of
> > what I was looking for.
> > 
> > The rest, just as important, is information about what sort of crash
> > this is, because I hope and suspect the crash is recoverable, and we
> > *should* recover from it because PERST# may occur at arbitrary times,
> > so trying to avoid it is never going to be reliable.
> 
> I did mention 'whole SoC crash' which typically means unrecoverable
> state as the SoC would crash (not just the driver). On Qcom SoCs,
> this will also lead the SoC to boot into EDL (Emergency Download)
> mode so that the users can collect dumps on the crash.

IIUC we're talking about an access to a PHY register, and the access
requires Refclk from the host.  I assume the SoC accesses the register
by doing an MMIO load.  If nothing responds, I assume the SoC would
take a machine check or similar because there's no data to complete
the load instruction.  So I assume again that the Linux on the SoC
doesn't know how to recover from such a machine check?  If that's the
scenario, is the machine check unrecoverable in principle, or is it
potentially recoverable but nobody has done the work to do it?  My
guess would be the latter, because the former would mean that it's
impossible to build a robust endpoint around this SoC.  But obviously
this is all complete speculation on my part.

Bjorn
Manivannan Sadhasivam Aug. 29, 2024, 4:44 p.m. UTC | #4
On Thu, Aug 29, 2024 at 07:38:08AM -0500, Bjorn Helgaas wrote:
> On Thu, Aug 29, 2024 at 11:07:20AM +0530, Manivannan Sadhasivam wrote:
> > On Wed, Aug 28, 2024 at 03:59:45PM -0500, Bjorn Helgaas wrote:
> > > On Wed, Aug 28, 2024 at 07:31:08PM +0530, Manivannan Sadhasivam wrote:
> > > > qcom_pcie_enable_resources() is called by qcom_pcie_ep_probe() and it
> > > > enables the controller resources like clocks, regulator, PHY. On one of the
> > > > new unreleased Qcom SoC, PHY enablement depends on the active refclk. And
> > > > on all of the supported Qcom endpoint SoCs, refclk comes from the host
> > > > (RC). So calling qcom_pcie_enable_resources() without refclk causes the
> > > > whole SoC crash on the new SoC.
> > > > 
> > > > qcom_pcie_enable_resources() is already called by
> > > > qcom_pcie_perst_deassert() when PERST# is deasserted, and refclk is
> > > > available at that time.
> > > > 
> > > > Hence, remove the unnecessary call to qcom_pcie_enable_resources() from
> > > > qcom_pcie_ep_probe() to prevent the crash.
> > > > 
> > > > Fixes: 869bc5253406 ("PCI: dwc: ep: Fix DBI access failure for drivers requiring refclk from host")
> > > > Tested-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
> > > > Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
> > > > ---
> > > > 
> > > > Changes in v2:
> > > > 
> > > > - Changed the patch description to mention the crash clearly as suggested by
> > > >   Bjorn
> > > 
> > > Clearly mentioning the crash as rationale for the change is *part* of
> > > what I was looking for.
> > > 
> > > The rest, just as important, is information about what sort of crash
> > > this is, because I hope and suspect the crash is recoverable, and we
> > > *should* recover from it because PERST# may occur at arbitrary times,
> > > so trying to avoid it is never going to be reliable.
> > 
> > I did mention 'whole SoC crash' which typically means unrecoverable
> > state as the SoC would crash (not just the driver). On Qcom SoCs,
> > this will also lead the SoC to boot into EDL (Emergency Download)
> > mode so that the users can collect dumps on the crash.
> 
> IIUC we're talking about an access to a PHY register, and the access
> requires Refclk from the host.  I assume the SoC accesses the register
> by doing an MMIO load.  If nothing responds, I assume the SoC would
> take a machine check or similar because there's no data to complete
> the load instruction.  So I assume again that the Linux on the SoC
> doesn't know how to recover from such a machine check?  If that's the
> scenario, is the machine check unrecoverable in principle, or is it
> potentially recoverable but nobody has done the work to do it?  My
> guess would be the latter, because the former would mean that it's
> impossible to build a robust endpoint around this SoC.  But obviously
> this is all complete speculation on my part.
> 

Atleast on Qcom SoCs, doing a MMIO read without enabling the resources would
result in a NoC (Network On Chip) error, which then end up as an exception to
the Trustzone and Trustzone will finally convert it to a SoC crash so that the
users could take a crash dump and do the analysis on why the crash has happened.

I know that it may sound strange to developers coming from x86 world :)

But this NoC error is something NVidia has also reported before, so I wouldn't
assume that this is a Qcom specific issue but rather for SoCs depending on
refclk from host.

For building a robust endpoint, SoCs should generate refclk by themselves.

- Mani
Bjorn Helgaas Aug. 29, 2024, 5:06 p.m. UTC | #5
On Thu, Aug 29, 2024 at 10:14:55PM +0530, Manivannan Sadhasivam wrote:
> On Thu, Aug 29, 2024 at 07:38:08AM -0500, Bjorn Helgaas wrote:
> > On Thu, Aug 29, 2024 at 11:07:20AM +0530, Manivannan Sadhasivam wrote:
> > > On Wed, Aug 28, 2024 at 03:59:45PM -0500, Bjorn Helgaas wrote:
> > > > On Wed, Aug 28, 2024 at 07:31:08PM +0530, Manivannan Sadhasivam wrote:
> > > > > qcom_pcie_enable_resources() is called by qcom_pcie_ep_probe() and it
> > > > > enables the controller resources like clocks, regulator, PHY. On one of the
> > > > > new unreleased Qcom SoC, PHY enablement depends on the active refclk. And
> > > > > on all of the supported Qcom endpoint SoCs, refclk comes from the host
> > > > > (RC). So calling qcom_pcie_enable_resources() without refclk causes the
> > > > > whole SoC crash on the new SoC.
> > > > > 
> > > > > qcom_pcie_enable_resources() is already called by
> > > > > qcom_pcie_perst_deassert() when PERST# is deasserted, and refclk is
> > > > > available at that time.
> > > > > 
> > > > > Hence, remove the unnecessary call to qcom_pcie_enable_resources() from
> > > > > qcom_pcie_ep_probe() to prevent the crash.
> > > > > 
> > > > > Fixes: 869bc5253406 ("PCI: dwc: ep: Fix DBI access failure for drivers requiring refclk from host")
> > > > > Tested-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
> > > > > Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
> > > > > ---
> > > > > 
> > > > > Changes in v2:
> > > > > 
> > > > > - Changed the patch description to mention the crash clearly as suggested by
> > > > >   Bjorn
> > > > 
> > > > Clearly mentioning the crash as rationale for the change is *part* of
> > > > what I was looking for.
> > > > 
> > > > The rest, just as important, is information about what sort of crash
> > > > this is, because I hope and suspect the crash is recoverable, and we
> > > > *should* recover from it because PERST# may occur at arbitrary times,
> > > > so trying to avoid it is never going to be reliable.
> > > 
> > > I did mention 'whole SoC crash' which typically means unrecoverable
> > > state as the SoC would crash (not just the driver). On Qcom SoCs,
> > > this will also lead the SoC to boot into EDL (Emergency Download)
> > > mode so that the users can collect dumps on the crash.
> > 
> > IIUC we're talking about an access to a PHY register, and the access
> > requires Refclk from the host.  I assume the SoC accesses the register
> > by doing an MMIO load.  If nothing responds, I assume the SoC would
> > take a machine check or similar because there's no data to complete
> > the load instruction.  So I assume again that the Linux on the SoC
> > doesn't know how to recover from such a machine check?  If that's the
> > scenario, is the machine check unrecoverable in principle, or is it
> > potentially recoverable but nobody has done the work to do it?  My
> > guess would be the latter, because the former would mean that it's
> > impossible to build a robust endpoint around this SoC.  But obviously
> > this is all complete speculation on my part.
> 
> Atleast on Qcom SoCs, doing a MMIO read without enabling the
> resources would result in a NoC (Network On Chip) error, which then
> end up as an exception to the Trustzone and Trustzone will finally
> convert it to a SoC crash so that the users could take a crash dump
> and do the analysis on why the crash has happened.
> 
> I know that it may sound strange to developers coming from x86 world
> :)

It's only strange if the system design forces a crash for events that
happen in normal operation.  Sounds like part of the problem here is
the non-SRIS mode that depends on Refclk from the host.  That and the
fact that operating in non-SRIS mode has an unavoidable race where
PERST# from the host at the wrong time can crash the endpoint.

I think users of non-SRIS mode need to be aware of this issue, and
this patch to narrow the race window, but not close it completely, is
one good place to mention it.

> But this NoC error is something NVidia has also reported before, so
> I wouldn't assume that this is a Qcom specific issue but rather for
> SoCs depending on refclk from host.

Are there other drivers that need a similar band-aid?

> For building a robust endpoint, SoCs should generate refclk by
> themselves.
> 
> - Mani
> 
> -- மணிவண்ணன் சதாசிவம்
Manivannan Sadhasivam Aug. 30, 2024, 8:12 a.m. UTC | #6
On Thu, Aug 29, 2024 at 12:06:24PM -0500, Bjorn Helgaas wrote:
> On Thu, Aug 29, 2024 at 10:14:55PM +0530, Manivannan Sadhasivam wrote:
> > On Thu, Aug 29, 2024 at 07:38:08AM -0500, Bjorn Helgaas wrote:
> > > On Thu, Aug 29, 2024 at 11:07:20AM +0530, Manivannan Sadhasivam wrote:
> > > > On Wed, Aug 28, 2024 at 03:59:45PM -0500, Bjorn Helgaas wrote:
> > > > > On Wed, Aug 28, 2024 at 07:31:08PM +0530, Manivannan Sadhasivam wrote:
> > > > > > qcom_pcie_enable_resources() is called by qcom_pcie_ep_probe() and it
> > > > > > enables the controller resources like clocks, regulator, PHY. On one of the
> > > > > > new unreleased Qcom SoC, PHY enablement depends on the active refclk. And
> > > > > > on all of the supported Qcom endpoint SoCs, refclk comes from the host
> > > > > > (RC). So calling qcom_pcie_enable_resources() without refclk causes the
> > > > > > whole SoC crash on the new SoC.
> > > > > > 
> > > > > > qcom_pcie_enable_resources() is already called by
> > > > > > qcom_pcie_perst_deassert() when PERST# is deasserted, and refclk is
> > > > > > available at that time.
> > > > > > 
> > > > > > Hence, remove the unnecessary call to qcom_pcie_enable_resources() from
> > > > > > qcom_pcie_ep_probe() to prevent the crash.
> > > > > > 
> > > > > > Fixes: 869bc5253406 ("PCI: dwc: ep: Fix DBI access failure for drivers requiring refclk from host")
> > > > > > Tested-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
> > > > > > Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
> > > > > > ---
> > > > > > 
> > > > > > Changes in v2:
> > > > > > 
> > > > > > - Changed the patch description to mention the crash clearly as suggested by
> > > > > >   Bjorn
> > > > > 
> > > > > Clearly mentioning the crash as rationale for the change is *part* of
> > > > > what I was looking for.
> > > > > 
> > > > > The rest, just as important, is information about what sort of crash
> > > > > this is, because I hope and suspect the crash is recoverable, and we
> > > > > *should* recover from it because PERST# may occur at arbitrary times,
> > > > > so trying to avoid it is never going to be reliable.
> > > > 
> > > > I did mention 'whole SoC crash' which typically means unrecoverable
> > > > state as the SoC would crash (not just the driver). On Qcom SoCs,
> > > > this will also lead the SoC to boot into EDL (Emergency Download)
> > > > mode so that the users can collect dumps on the crash.
> > > 
> > > IIUC we're talking about an access to a PHY register, and the access
> > > requires Refclk from the host.  I assume the SoC accesses the register
> > > by doing an MMIO load.  If nothing responds, I assume the SoC would
> > > take a machine check or similar because there's no data to complete
> > > the load instruction.  So I assume again that the Linux on the SoC
> > > doesn't know how to recover from such a machine check?  If that's the
> > > scenario, is the machine check unrecoverable in principle, or is it
> > > potentially recoverable but nobody has done the work to do it?  My
> > > guess would be the latter, because the former would mean that it's
> > > impossible to build a robust endpoint around this SoC.  But obviously
> > > this is all complete speculation on my part.
> > 
> > Atleast on Qcom SoCs, doing a MMIO read without enabling the
> > resources would result in a NoC (Network On Chip) error, which then
> > end up as an exception to the Trustzone and Trustzone will finally
> > convert it to a SoC crash so that the users could take a crash dump
> > and do the analysis on why the crash has happened.
> > 
> > I know that it may sound strange to developers coming from x86 world
> > :)
> 
> It's only strange if the system design forces a crash for events that
> happen in normal operation.  Sounds like part of the problem here is
> the non-SRIS mode that depends on Refclk from the host.  That and the
> fact that operating in non-SRIS mode has an unavoidable race where
> PERST# from the host at the wrong time can crash the endpoint.
> 

Precisely.

> I think users of non-SRIS mode need to be aware of this issue, and
> this patch to narrow the race window, but not close it completely, is
> one good place to mention it.
> 

Okay. I'll mention it in patch description.

> > But this NoC error is something NVidia has also reported before, so
> > I wouldn't assume that this is a Qcom specific issue but rather for
> > SoCs depending on refclk from host.
> 
> Are there other drivers that need a similar band-aid?
> 

AFAIK, only tegra194 and qcom-pcie-ep drivers require refclk from host. Rest of
the endpoint drivers seem to have independent clock.

- Mani
diff mbox series

Patch

diff --git a/drivers/pci/controller/dwc/pcie-qcom-ep.c b/drivers/pci/controller/dwc/pcie-qcom-ep.c
index 236229f66c80..2319ff2ae9f6 100644
--- a/drivers/pci/controller/dwc/pcie-qcom-ep.c
+++ b/drivers/pci/controller/dwc/pcie-qcom-ep.c
@@ -846,21 +846,15 @@  static int qcom_pcie_ep_probe(struct platform_device *pdev)
 	if (ret)
 		return ret;
 
-	ret = qcom_pcie_enable_resources(pcie_ep);
-	if (ret) {
-		dev_err(dev, "Failed to enable resources: %d\n", ret);
-		return ret;
-	}
-
 	ret = dw_pcie_ep_init(&pcie_ep->pci.ep);
 	if (ret) {
 		dev_err(dev, "Failed to initialize endpoint: %d\n", ret);
-		goto err_disable_resources;
+		return ret;
 	}
 
 	ret = qcom_pcie_ep_enable_irq_resources(pdev, pcie_ep);
 	if (ret)
-		goto err_disable_resources;
+		goto err_ep_deinit;
 
 	name = devm_kasprintf(dev, GFP_KERNEL, "%pOFP", dev->of_node);
 	if (!name) {
@@ -877,8 +871,8 @@  static int qcom_pcie_ep_probe(struct platform_device *pdev)
 	disable_irq(pcie_ep->global_irq);
 	disable_irq(pcie_ep->perst_irq);
 
-err_disable_resources:
-	qcom_pcie_disable_resources(pcie_ep);
+err_ep_deinit:
+	dw_pcie_ep_deinit(&pcie_ep->pci.ep);
 
 	return ret;
 }