diff mbox series

[RFC,HACK] PCI: dwc: layerscape: Hack around enumeration problems with Honeycomb LX2K

Message ID 20201211121507.28166-1-daniel.thompson@linaro.org (mailing list archive)
State New, archived
Headers show
Series [RFC,HACK] PCI: dwc: layerscape: Hack around enumeration problems with Honeycomb LX2K | expand

Commit Message

Daniel Thompson Dec. 11, 2020, 12:15 p.m. UTC
I have been chasing down a problem enumerating an NVMe drive on a
Honeycomb LX2K (NXP LX2160A). Specifically the drive can only enumerate
successfully if the we are emitting lots of console messages via a UART.
If the system is booted with `quiet` set then enumeration fails.

I guessed this would be due to the timing impact of printk-to-UART and
tried to find out where a delay could be added to provoke a successful
enumeration.

This patch contains the results. The delay length (1ms) was selected by
binary searching downwards until the delay was not effective for these
devices (Honeycomb LX2K and a Western Digital WD Blue SN550).

I have also included the workaround twice (conditionally compiled). The
first change is the *latest* possible code path that we can deploy a
delay whilst the second is the earliest place I could find.

The summary is that the critical window were we are currently relying on
a call to the console UART code can "mend" the driver runs from calling
dw_pcie_setup_rc() in host init to just before we read the state in the
link up callback.

Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
---

Notes:
    This patch is RFC (and HACK) because I don't have much clue *why* this
    patch works... merely that this is the smallest possible change I need
    to replicate the current accidental printk() workaround.  Perhaps one
    could argue that RFC here stands for request-for-clue.  All my
    observations and changes here are empirical and I don't know how best to
    turn them into something that is not a hack!
    
    BTW I noticed many other pcie-designware drivers take advantage
    of a function called dw_pcie_wait_for_link() in their init paths...
    but my naive attempts to add it to the layerscape driver results
    in non-booting systems so I haven't embarrassed myself by including
    that in the patch!

 drivers/pci/controller/dwc/pci-layerscape.c | 35 +++++++++++++++++++++
 1 file changed, 35 insertions(+)


base-commit: 0477e92881850d44910a7e94fc2c46f96faa131f
--
2.29.2

Comments

Rob Herring (Arm) Dec. 11, 2020, 2:37 p.m. UTC | #1
On Fri, Dec 11, 2020 at 6:15 AM Daniel Thompson
<daniel.thompson@linaro.org> wrote:
>
> I have been chasing down a problem enumerating an NVMe drive on a
> Honeycomb LX2K (NXP LX2160A). Specifically the drive can only enumerate
> successfully if the we are emitting lots of console messages via a UART.
> If the system is booted with `quiet` set then enumeration fails.

We really need to get away from work-arounds for device X on host Y. I
suspect they are not limited to that combination only...

How exactly does it fail to enumerate? There's a (racy) linkup check
on config accesses, is it reporting link down and skipping config
accesses?

> I guessed this would be due to the timing impact of printk-to-UART and
> tried to find out where a delay could be added to provoke a successful
> enumeration.
>
> This patch contains the results. The delay length (1ms) was selected by
> binary searching downwards until the delay was not effective for these
> devices (Honeycomb LX2K and a Western Digital WD Blue SN550).
>
> I have also included the workaround twice (conditionally compiled). The
> first change is the *latest* possible code path that we can deploy a
> delay whilst the second is the earliest place I could find.
>
> The summary is that the critical window were we are currently relying on
> a call to the console UART code can "mend" the driver runs from calling
> dw_pcie_setup_rc() in host init to just before we read the state in the
> link up callback.
>
> Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
> ---
>
> Notes:
>     This patch is RFC (and HACK) because I don't have much clue *why* this
>     patch works... merely that this is the smallest possible change I need
>     to replicate the current accidental printk() workaround.  Perhaps one
>     could argue that RFC here stands for request-for-clue.  All my
>     observations and changes here are empirical and I don't know how best to
>     turn them into something that is not a hack!
>
>     BTW I noticed many other pcie-designware drivers take advantage
>     of a function called dw_pcie_wait_for_link() in their init paths...
>     but my naive attempts to add it to the layerscape driver results
>     in non-booting systems so I haven't embarrassed myself by including
>     that in the patch!

You need to look at what's pending for v5.11, because I reworked this
to be more unified. The ordering of init is also possibly changed. The
sequence is now like this:

        dw_pcie_setup_rc(pp);
        dw_pcie_msi_init(pp);

        if (!dw_pcie_link_up(pci) && pci->ops->start_link) {
                ret = pci->ops->start_link(pci);
                if (ret)
                        goto err_free_msi;
        }

        /* Ignore errors, the link may come up later */
        dw_pcie_wait_for_link(pci);


>  drivers/pci/controller/dwc/pci-layerscape.c | 35 +++++++++++++++++++++
>  1 file changed, 35 insertions(+)
>
> diff --git a/drivers/pci/controller/dwc/pci-layerscape.c b/drivers/pci/controller/dwc/pci-layerscape.c
> index f24f79a70d9a..c354904b90ef 100644
> --- a/drivers/pci/controller/dwc/pci-layerscape.c
> +++ b/drivers/pci/controller/dwc/pci-layerscape.c
> @@ -22,6 +22,8 @@
>
>  #include "pcie-designware.h"
>
> +#define WORKAROUND_LATEST_POSSIBLE
> +
>  /* PEX1/2 Misc Ports Status Register */
>  #define SCFG_PEXMSCPORTSR(pex_idx)     (0x94 + (pex_idx) * 4)
>  #define LTSSM_STATE_SHIFT      20
> @@ -113,10 +115,31 @@ static int ls_pcie_link_up(struct dw_pcie *pci)
>         struct ls_pcie *pcie = to_ls_pcie(pci);
>         u32 state;
>
> +       /*
> +        * Strictly speaking *this* (before the ioread32) is the latest
> +        * point a simple delay can be effective. If we move the delay
> +        * after the ioread32 then the NVMe does not enumerate.
> +        *
> +        * However this function appears to be frequently called so an
> +        * unconditional delay here causes noticeable delay at boot
> +        * time. Hence we implement the workaround by retrying the read
> +        * after a short delay if we think we might need to return false.
> +        */
> +
>         state = (ioread32(pcie->lut + pcie->drvdata->lut_dbg) >>
>                  pcie->drvdata->ltssm_shift) &
>                  LTSSM_STATE_MASK;
>
> +#ifdef WORKAROUND_LATEST_POSSIBLE
> +       if (state < LTSSM_PCIE_L0) {
> +               /* see comment above */
> +               mdelay(1);
> +               state = (ioread32(pcie->lut + pcie->drvdata->lut_dbg) >>
> +                        pcie->drvdata->ltssm_shift) &
> +                        LTSSM_STATE_MASK;

Side note, I really wonder about the variety of ways in DWC drivers we
have to check link state. The default is a debug register... Are the
standard PCIe registers for this really broken in these cases?

> +       }
> +#endif
> +
>         if (state < LTSSM_PCIE_L0)
>                 return 0;
>
> @@ -152,6 +175,18 @@ static int ls_pcie_host_init(struct pcie_port *pp)
>
>         dw_pcie_setup_rc(pp);

Might be related to the PORT_LOGIC_SPEED_CHANGE we do in here.

> +#ifdef WORKAROUND_EARLIEST_POSSIBLE
> +       /*
> +        * This is the earliest point the delay is effective.
> +        * If we move it before dw_pcie_setup_rc() then the
> +        * NVMe does not enumerate.
> +        *
> +        * 500us is too short to reliably work around the issue
> +        * hence adopting 1000us here.
> +        */
> +       mdelay(1);
> +#endif
> +
>         return 0;
>  }
>
>
> base-commit: 0477e92881850d44910a7e94fc2c46f96faa131f
> --
> 2.29.2
>
Daniel Thompson Dec. 11, 2020, 5:05 p.m. UTC | #2
On Fri, Dec 11, 2020 at 08:37:40AM -0600, Rob Herring wrote:
> On Fri, Dec 11, 2020 at 6:15 AM Daniel Thompson
> <daniel.thompson@linaro.org> wrote:
> >
> > I have been chasing down a problem enumerating an NVMe drive on a
> > Honeycomb LX2K (NXP LX2160A). Specifically the drive can only enumerate
> > successfully if the we are emitting lots of console messages via a UART.
> > If the system is booted with `quiet` set then enumeration fails.
> 
> We really need to get away from work-arounds for device X on host Y. I
> suspect they are not limited to that combination only...

No objection on that. This patch was essentially sharing the result of
an investigation where I got stuck at the "now fix it properly" stage!


> How exactly does it fail to enumerate? There's a (racy) linkup check
> on config accesses, is it reporting link down and skipping config
> accesses?

In dmesg terms it looked like this:

--- nvme_borked_gpu_working.linted.dmesg	2020-11-18 15:28:35.544118213 +0000
+++ nvme_working_gpu_working.linted.dmesg	2020-11-18 15:28:56.180076946 +0000
@@ -241,11 +241,19 @@
 pci 0000:00:00.0: reg 0x38: [mem 0x9048000000-0x9048ffffff pref]
 pci 0000:00:00.0: supports D1 D2
 pci 0000:00:00.0: PME# supported from D0 D1 D2 D3hot
+pci 0000:01:00.0: [15b7:5009] type 00 class 0x010802
+pci 0000:01:00.0: reg 0x10: [mem 0x9049000000-0x9049003fff 64bit]
+pci 0000:01:00.0: reg 0x20: [mem 0x9049004000-0x90490040ff 64bit]
 pci 0000:00:00.0: BAR 1: assigned [mem 0x9040000000-0x9043ffffff]
 pci 0000:00:00.0: BAR 0: assigned [mem 0x9044000000-0x9044ffffff]
 pci 0000:00:00.0: BAR 6: assigned [mem 0x9045000000-0x9045ffffff pref]
+pci 0000:00:00.0: BAR 14: assigned [mem 0x9046000000-0x90460fffff]
+pci 0000:01:00.0: BAR 0: assigned [mem 0x9046000000-0x9046003fff 64bit]
+pci 0000:01:00.0: BAR 4: assigned [mem 0x9046004000-0x90460040ff 64bit]
 pci 0000:00:00.0: PCI bridge to [bus 01-ff]
+pci 0000:00:00.0:   bridge window [mem 0x9046000000-0x90460fffff]
 pci 0000:00:00.0: Max Payload Size set to  256/ 256 (was  128), Max Read Rq  256
+pci 0000:01:00.0: Max Payload Size set to  256/ 512 (was  128), Max Read Rq  256
 layerscape-pcie 3800000.pcie: host bridge /soc/pcie@3800000 ranges:
 layerscape-pcie 3800000.pcie:      MEM 0xa040000000..0xa07fffffff -> 0x0040000000
 layerscape-pcie 3800000.pcie: PCI host bridge to bus 0001:00

... and be aware that the last three lines here are another PCIe
controller coming up just fine and it is about to detect the GPU
(which like the NVMe is also gen3) just fine whether or not we
add a short delay.


> > I guessed this would be due to the timing impact of printk-to-UART and
> > tried to find out where a delay could be added to provoke a successful
> > enumeration.
> >
> > This patch contains the results. The delay length (1ms) was selected by
> > binary searching downwards until the delay was not effective for these
> > devices (Honeycomb LX2K and a Western Digital WD Blue SN550).
> >
> > I have also included the workaround twice (conditionally compiled). The
> > first change is the *latest* possible code path that we can deploy a
> > delay whilst the second is the earliest place I could find.
> >
> > The summary is that the critical window were we are currently relying on
> > a call to the console UART code can "mend" the driver runs from calling
> > dw_pcie_setup_rc() in host init to just before we read the state in the
> > link up callback.
> >
> > Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
> > ---
> >
> > Notes:
> >     This patch is RFC (and HACK) because I don't have much clue *why* this
> >     patch works... merely that this is the smallest possible change I need
> >     to replicate the current accidental printk() workaround.  Perhaps one
> >     could argue that RFC here stands for request-for-clue.  All my
> >     observations and changes here are empirical and I don't know how best to
> >     turn them into something that is not a hack!
> >
> >     BTW I noticed many other pcie-designware drivers take advantage
> >     of a function called dw_pcie_wait_for_link() in their init paths...
> >     but my naive attempts to add it to the layerscape driver results
> >     in non-booting systems so I haven't embarrassed myself by including
> >     that in the patch!
> 
> You need to look at what's pending for v5.11, because I reworked this
> to be more unified. The ordering of init is also possibly changed. The
> sequence is now like this:
> 
>         dw_pcie_setup_rc(pp);
>         dw_pcie_msi_init(pp);
> 
>         if (!dw_pcie_link_up(pci) && pci->ops->start_link) {
>                 ret = pci->ops->start_link(pci);
>                 if (ret)
>                         goto err_free_msi;
>         }
> 
>         /* Ignore errors, the link may come up later */
>         dw_pcie_wait_for_link(pci);

Thanks. That looks likely to fix it since IIUC dw_pcie_wait_for_link()
will end up waiting somewhat like the double check I added to
ls_pcie_link_up().

I'll take a look at let you know.


Daniel.
Daniel Thompson Dec. 14, 2020, 10:43 a.m. UTC | #3
On Fri, Dec 11, 2020 at 05:05:58PM +0000, Daniel Thompson wrote:
> On Fri, Dec 11, 2020 at 08:37:40AM -0600, Rob Herring wrote:
> > On Fri, Dec 11, 2020 at 6:15 AM Daniel Thompson
> > >     BTW I noticed many other pcie-designware drivers take advantage
> > >     of a function called dw_pcie_wait_for_link() in their init paths...
> > >     but my naive attempts to add it to the layerscape driver results
> > >     in non-booting systems so I haven't embarrassed myself by including
> > >     that in the patch!
> > 
> > You need to look at what's pending for v5.11, because I reworked this
> > to be more unified. The ordering of init is also possibly changed. The
> > sequence is now like this:
> > 
> >         dw_pcie_setup_rc(pp);
> >         dw_pcie_msi_init(pp);
> > 
> >         if (!dw_pcie_link_up(pci) && pci->ops->start_link) {
> >                 ret = pci->ops->start_link(pci);
> >                 if (ret)
> >                         goto err_free_msi;
> >         }
> > 
> >         /* Ignore errors, the link may come up later */
> >         dw_pcie_wait_for_link(pci);
> 
> Thanks. That looks likely to fix it since IIUC dw_pcie_wait_for_link()
> will end up waiting somewhat like the double check I added to
> ls_pcie_link_up().
> 
> I'll take a look at let you know.

Yes. These changes have fixed the enumeration problems for me.

I tested pci/next and I cherry picked your patch series onto v5.10 and
both are working well.

Given this fixes a bug for me, do you think there is any scope for me
to whittle down your series into patches for the stable kernels or am
I likely to find too many extra bits being pulled in?


Daniel.
Rob Herring (Arm) Dec. 14, 2020, 2:57 p.m. UTC | #4
On Mon, Dec 14, 2020 at 4:43 AM Daniel Thompson
<daniel.thompson@linaro.org> wrote:
>
> On Fri, Dec 11, 2020 at 05:05:58PM +0000, Daniel Thompson wrote:
> > On Fri, Dec 11, 2020 at 08:37:40AM -0600, Rob Herring wrote:
> > > On Fri, Dec 11, 2020 at 6:15 AM Daniel Thompson
> > > >     BTW I noticed many other pcie-designware drivers take advantage
> > > >     of a function called dw_pcie_wait_for_link() in their init paths...
> > > >     but my naive attempts to add it to the layerscape driver results
> > > >     in non-booting systems so I haven't embarrassed myself by including
> > > >     that in the patch!
> > >
> > > You need to look at what's pending for v5.11, because I reworked this
> > > to be more unified. The ordering of init is also possibly changed. The
> > > sequence is now like this:
> > >
> > >         dw_pcie_setup_rc(pp);
> > >         dw_pcie_msi_init(pp);
> > >
> > >         if (!dw_pcie_link_up(pci) && pci->ops->start_link) {
> > >                 ret = pci->ops->start_link(pci);
> > >                 if (ret)
> > >                         goto err_free_msi;
> > >         }
> > >
> > >         /* Ignore errors, the link may come up later */
> > >         dw_pcie_wait_for_link(pci);
> >
> > Thanks. That looks likely to fix it since IIUC dw_pcie_wait_for_link()
> > will end up waiting somewhat like the double check I added to
> > ls_pcie_link_up().
> >
> > I'll take a look at let you know.
>
> Yes. These changes have fixed the enumeration problems for me.
>
> I tested pci/next and I cherry picked your patch series onto v5.10 and
> both are working well.
>
> Given this fixes a bug for me, do you think there is any scope for me
> to whittle down your series into patches for the stable kernels or am
> I likely to find too many extra bits being pulled in?

I think I'd just go the adding a delay route. It's a fairly big series
and depends on my other clean-up done in 5.10. And there's at least
some possibility it regresses some platform given the limited testing
linux-next gets.

Rob
diff mbox series

Patch

diff --git a/drivers/pci/controller/dwc/pci-layerscape.c b/drivers/pci/controller/dwc/pci-layerscape.c
index f24f79a70d9a..c354904b90ef 100644
--- a/drivers/pci/controller/dwc/pci-layerscape.c
+++ b/drivers/pci/controller/dwc/pci-layerscape.c
@@ -22,6 +22,8 @@ 

 #include "pcie-designware.h"

+#define WORKAROUND_LATEST_POSSIBLE
+
 /* PEX1/2 Misc Ports Status Register */
 #define SCFG_PEXMSCPORTSR(pex_idx)	(0x94 + (pex_idx) * 4)
 #define LTSSM_STATE_SHIFT	20
@@ -113,10 +115,31 @@  static int ls_pcie_link_up(struct dw_pcie *pci)
 	struct ls_pcie *pcie = to_ls_pcie(pci);
 	u32 state;

+	/*
+	 * Strictly speaking *this* (before the ioread32) is the latest
+	 * point a simple delay can be effective. If we move the delay
+	 * after the ioread32 then the NVMe does not enumerate.
+	 *
+	 * However this function appears to be frequently called so an
+	 * unconditional delay here causes noticeable delay at boot
+	 * time. Hence we implement the workaround by retrying the read
+	 * after a short delay if we think we might need to return false.
+	 */
+
 	state = (ioread32(pcie->lut + pcie->drvdata->lut_dbg) >>
 		 pcie->drvdata->ltssm_shift) &
 		 LTSSM_STATE_MASK;

+#ifdef WORKAROUND_LATEST_POSSIBLE
+	if (state < LTSSM_PCIE_L0) {
+		/* see comment above */
+		mdelay(1);
+		state = (ioread32(pcie->lut + pcie->drvdata->lut_dbg) >>
+			 pcie->drvdata->ltssm_shift) &
+			 LTSSM_STATE_MASK;
+	}
+#endif
+
 	if (state < LTSSM_PCIE_L0)
 		return 0;

@@ -152,6 +175,18 @@  static int ls_pcie_host_init(struct pcie_port *pp)

 	dw_pcie_setup_rc(pp);

+#ifdef WORKAROUND_EARLIEST_POSSIBLE
+	/*
+	 * This is the earliest point the delay is effective.
+	 * If we move it before dw_pcie_setup_rc() then the
+	 * NVMe does not enumerate.
+	 *
+	 * 500us is too short to reliably work around the issue
+	 * hence adopting 1000us here.
+	 */
+	mdelay(1);
+#endif
+
 	return 0;
 }