Message ID | 20240823154023.360234-3-superm1@kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Verify devices transition from D3cold to D0 | expand |
[+cc Duc, Alex] On Fri, Aug 23, 2024 at 10:40:20AM -0500, Mario Limonciello wrote: > If a dock is plugged in at the same time as autosuspend delay then this > can cause malfunctions in the USB4 stack. This happens because the > device is still in D3cold at the time that the PCI core handed > control back to the USB4 stack. I assume the USB device in question is in the dock that was hot-added? This patch suggests that pci_dev_wait() has waited for a read of PCI_COMMAND to respond with something other than ~0, but the device is still in D3cold. I suppose we got to pci_dev_wait() via pci_pm_bridge_power_up_actions() calling pci_bridge_wait_for_secondary_bus(), since I wouldn't expect a reset in the hot-add case. > A device that has gone through a reset may return a value in PCI_COMMAND > but that doesn't mean it's finished transitioning to D0. For evices that > support power management explicitly check PCI_PM_CTRL on everything but > system resume to ensure the transition happened. s/evices/devices/ > Devices that don't support power management and system resume will > continue to use PCI_COMMAND. Is there a bugzilla or other report with more details that we can include here? > Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> > --- > v4->v5: > * Fix misleading indentation > * Amend commit message > --- > drivers/pci/pci.c | 28 ++++++++++++++++++++-------- > 1 file changed, 20 insertions(+), 8 deletions(-) > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index 1e219057a5069..f032a4aaec268 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -1309,21 +1309,33 @@ static int pci_dev_wait(struct pci_dev *dev, enum pci_reset_type reset_type, int > * the read (except when CRS SV is enabled and the read was for the > * Vendor ID; in that case it synthesizes 0x0001 data). > * > - * Wait for the device to return a non-CRS completion. Read the > - * Command register instead of Vendor ID so we don't have to > - * contend with the CRS SV value. > + * Wait for the device to return a non-CRS completion. On devices > + * that support PM control and on waits that aren't part of system > + * resume read the PM control register to ensure the device has > + * transitioned to D0. On devices that don't support PM control, > + * or during system resume read the command register to instead of > + * Vendor ID so we don't have to contend with the CRS SV value. > */ > for (;;) { > - u32 id; > - > if (pci_dev_is_disconnected(dev)) { > pci_dbg(dev, "disconnected; not waiting\n"); > return -ENOTTY; > } > > - pci_read_config_dword(dev, PCI_COMMAND, &id); > - if (!PCI_POSSIBLE_ERROR(id)) > - break; > + if (dev->pm_cap && reset_type != PCI_DEV_WAIT_RESUME) { > + u16 pmcsr; > + > + pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr); > + if (!PCI_POSSIBLE_ERROR(pmcsr) && > + (pmcsr & PCI_PM_CTRL_STATE_MASK) == PCI_D0) > + break; > + } else { > + u32 id; > + > + pci_read_config_dword(dev, PCI_COMMAND, &id); > + if (!PCI_POSSIBLE_ERROR(id)) > + break; > + } What is the rationale behind using PCI_PM_CTRL in some cases and PCI_COMMAND in others? Is there some spec language we can cite for this? IIUC, pci_dev_wait() waits for a device to be ready after a reset (FLR, D3hot->D0 transition for devices where No_Soft_Reset is clear, DPC) and after power-up from D3cold (pci_pm_bridge_power_up_actions()). I think device readiness in all of these cases is covered by PCIe r6.0, sec 6.6.1. If the Root Port above the device supports Configuration RRS Software Visibility, I think we probably should use that here instead of either PCI_COMMAND or PCI_PM_CTRL. SR-IOV VFs don't implement Vendor ID, which might complicate this a little. But it niggles in my mind that there may be some other problem beyond that. Maybe Alex remembers. Anyway, if we meet the requirements of sec 6.6.1, the device should be ready to respond to config requests, and I assume that also means the device is in D0. > if (delay > timeout) { > pci_warn(dev, "not ready %dms after %s; giving up\n", > -- > 2.43.0 >
On 8/23/2024 14:54, Bjorn Helgaas wrote: > [+cc Duc, Alex] > > On Fri, Aug 23, 2024 at 10:40:20AM -0500, Mario Limonciello wrote: >> If a dock is plugged in at the same time as autosuspend delay then this >> can cause malfunctions in the USB4 stack. This happens because the >> device is still in D3cold at the time that the PCI core handed >> control back to the USB4 stack. > > I assume the USB device in question is in the dock that was hot-added? No; it's actually the USB4 router that is malfunctioning. The CM (thunderbolt.ko) thinks the router is in D0 already, but when it attempts to do a register read it gets back all F's and it trusts that. > This patch suggests that pci_dev_wait() has waited for a read of > PCI_COMMAND to respond with something other than ~0, but the device is > still in D3cold. I suppose we got to pci_dev_wait() via > pci_pm_bridge_power_up_actions() calling > pci_bridge_wait_for_secondary_bus(), since I wouldn't expect a reset > in the hot-add case. As I said it's the router (part of the SoC). The device never disappears. It's the action of plugging in/out the the dock that causes it to change power states. We didn't try it, but I wouldn't be surprised if it could be reproduced with a script that turned on/off runtime PM on very tight timing around the autosuspend delay. > >> A device that has gone through a reset may return a value in PCI_COMMAND >> but that doesn't mean it's finished transitioning to D0. For evices that >> support power management explicitly check PCI_PM_CTRL on everything but >> system resume to ensure the transition happened. > > s/evices/devices/ Thanks. > >> Devices that don't support power management and system resume will >> continue to use PCI_COMMAND. > > Is there a bugzilla or other report with more details that we can > include here? No, unfortunately in this case it was only reported internally at AMD. Gary who is on CC brought it to me though, and if you think there are some other specific details needed but are missing we can see what else can be added to the commit message. > >> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> >> --- >> v4->v5: >> * Fix misleading indentation >> * Amend commit message >> --- >> drivers/pci/pci.c | 28 ++++++++++++++++++++-------- >> 1 file changed, 20 insertions(+), 8 deletions(-) >> >> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c >> index 1e219057a5069..f032a4aaec268 100644 >> --- a/drivers/pci/pci.c >> +++ b/drivers/pci/pci.c >> @@ -1309,21 +1309,33 @@ static int pci_dev_wait(struct pci_dev *dev, enum pci_reset_type reset_type, int >> * the read (except when CRS SV is enabled and the read was for the >> * Vendor ID; in that case it synthesizes 0x0001 data). >> * >> - * Wait for the device to return a non-CRS completion. Read the >> - * Command register instead of Vendor ID so we don't have to >> - * contend with the CRS SV value. >> + * Wait for the device to return a non-CRS completion. On devices >> + * that support PM control and on waits that aren't part of system >> + * resume read the PM control register to ensure the device has >> + * transitioned to D0. On devices that don't support PM control, >> + * or during system resume read the command register to instead of >> + * Vendor ID so we don't have to contend with the CRS SV value. >> */ >> for (;;) { >> - u32 id; >> - >> if (pci_dev_is_disconnected(dev)) { >> pci_dbg(dev, "disconnected; not waiting\n"); >> return -ENOTTY; >> } >> >> - pci_read_config_dword(dev, PCI_COMMAND, &id); >> - if (!PCI_POSSIBLE_ERROR(id)) >> - break; >> + if (dev->pm_cap && reset_type != PCI_DEV_WAIT_RESUME) { >> + u16 pmcsr; >> + >> + pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr); >> + if (!PCI_POSSIBLE_ERROR(pmcsr) && >> + (pmcsr & PCI_PM_CTRL_STATE_MASK) == PCI_D0) >> + break; >> + } else { >> + u32 id; >> + >> + pci_read_config_dword(dev, PCI_COMMAND, &id); >> + if (!PCI_POSSIBLE_ERROR(id)) >> + break; >> + } > > What is the rationale behind using PCI_PM_CTRL in some cases and > PCI_COMMAND in others? We saw a deadlock during resume from suspend when PCI_PM_CTRL was used for all cases that supported dev->pm_cap. > Is there some spec language we can cite for > this? Perhaps it being a "cold reset" during resume? > > IIUC, pci_dev_wait() waits for a device to be ready after a reset > (FLR, D3hot->D0 transition for devices where No_Soft_Reset is clear, > DPC) and after power-up from D3cold (pci_pm_bridge_power_up_actions()). > I think device readiness in all of these cases is covered by PCIe > r6.0, sec 6.6.1. Would it be helpful to you to get a dump_stack() call trace to pci_power_up() the specific call flow that needed this fix? Gary is able to to reproduce this at will, I think he should be able to gather that using an unpatched kernel to help this conversation. > > If the Root Port above the device supports Configuration RRS Software > Visibility, I think we probably should use that here instead of either > PCI_COMMAND or PCI_PM_CTRL. I did check and in this case the root port the USB4 routers are connected to support this. How do you think this should be done instead? > SR-IOV VFs don't implement Vendor ID, > which might complicate this a little. But it niggles in my mind that > there may be some other problem beyond that. Maybe Alex remembers. > > Anyway, if we meet the requirements of sec 6.6.1, the device should be > ready to respond to config requests, and I assume that also means > the device is in D0. > Within that section there is a quote to point out: " To allow components to perform internal initialization, system software must wait a specified minimum period following exit from a Conventional Reset of one or more devices before it is permitted to issue Configuration Requests to those devices " In pci_power_up() I don't really see any hardcoded delays specifically for this case of leaving D3cold. The PCI PM spec specifies that it will take "Full context restore or boot latency". I don't think it's reasonable to have NO delay.
On 8/26/2024 14:16, Mario Limonciello wrote: > On 8/23/2024 14:54, Bjorn Helgaas wrote: >> [+cc Duc, Alex] >> >> On Fri, Aug 23, 2024 at 10:40:20AM -0500, Mario Limonciello wrote: >>> If a dock is plugged in at the same time as autosuspend delay then this >>> can cause malfunctions in the USB4 stack. This happens because the >>> device is still in D3cold at the time that the PCI core handed >>> control back to the USB4 stack. >> >> I assume the USB device in question is in the dock that was hot-added? > > No; it's actually the USB4 router that is malfunctioning. The CM > (thunderbolt.ko) thinks the router is in D0 already, but when it > attempts to do a register read it gets back all F's and it trusts that. > >> This patch suggests that pci_dev_wait() has waited for a read of >> PCI_COMMAND to respond with something other than ~0, but the device is >> still in D3cold. I suppose we got to pci_dev_wait() via >> pci_pm_bridge_power_up_actions() calling >> pci_bridge_wait_for_secondary_bus(), since I wouldn't expect a reset >> in the hot-add case. > > As I said it's the router (part of the SoC). The device never > disappears. It's the action of plugging in/out the the dock that causes > it to change power states. > > We didn't try it, but I wouldn't be surprised if it could be reproduced > with a script that turned on/off runtime PM on very tight timing around > the autosuspend delay. > >> >>> A device that has gone through a reset may return a value in PCI_COMMAND >>> but that doesn't mean it's finished transitioning to D0. For evices >>> that >>> support power management explicitly check PCI_PM_CTRL on everything but >>> system resume to ensure the transition happened. >> >> s/evices/devices/ > > Thanks. > >> >>> Devices that don't support power management and system resume will >>> continue to use PCI_COMMAND. >> >> Is there a bugzilla or other report with more details that we can >> include here? > > No, unfortunately in this case it was only reported internally at AMD. > > Gary who is on CC brought it to me though, and if you think there are > some other specific details needed but are missing we can see what else > can be added to the commit message. > >> >>> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> >>> --- >>> v4->v5: >>> * Fix misleading indentation >>> * Amend commit message >>> --- >>> drivers/pci/pci.c | 28 ++++++++++++++++++++-------- >>> 1 file changed, 20 insertions(+), 8 deletions(-) >>> >>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c >>> index 1e219057a5069..f032a4aaec268 100644 >>> --- a/drivers/pci/pci.c >>> +++ b/drivers/pci/pci.c >>> @@ -1309,21 +1309,33 @@ static int pci_dev_wait(struct pci_dev *dev, >>> enum pci_reset_type reset_type, int >>> * the read (except when CRS SV is enabled and the read was for >>> the >>> * Vendor ID; in that case it synthesizes 0x0001 data). >>> * >>> - * Wait for the device to return a non-CRS completion. Read the >>> - * Command register instead of Vendor ID so we don't have to >>> - * contend with the CRS SV value. >>> + * Wait for the device to return a non-CRS completion. On devices >>> + * that support PM control and on waits that aren't part of system >>> + * resume read the PM control register to ensure the device has >>> + * transitioned to D0. On devices that don't support PM control, >>> + * or during system resume read the command register to instead of >>> + * Vendor ID so we don't have to contend with the CRS SV value. >>> */ >>> for (;;) { >>> - u32 id; >>> - >>> if (pci_dev_is_disconnected(dev)) { >>> pci_dbg(dev, "disconnected; not waiting\n"); >>> return -ENOTTY; >>> } >>> - pci_read_config_dword(dev, PCI_COMMAND, &id); >>> - if (!PCI_POSSIBLE_ERROR(id)) >>> - break; >>> + if (dev->pm_cap && reset_type != PCI_DEV_WAIT_RESUME) { >>> + u16 pmcsr; >>> + >>> + pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, >>> &pmcsr); >>> + if (!PCI_POSSIBLE_ERROR(pmcsr) && >>> + (pmcsr & PCI_PM_CTRL_STATE_MASK) == PCI_D0) >>> + break; >>> + } else { >>> + u32 id; >>> + >>> + pci_read_config_dword(dev, PCI_COMMAND, &id); >>> + if (!PCI_POSSIBLE_ERROR(id)) >>> + break; >>> + } >> >> What is the rationale behind using PCI_PM_CTRL in some cases and >> PCI_COMMAND in others? > > We saw a deadlock during resume from suspend when PCI_PM_CTRL was used > for all cases that supported dev->pm_cap. > >> Is there some spec language we can cite for >> this? > > Perhaps it being a "cold reset" during resume? > >> >> IIUC, pci_dev_wait() waits for a device to be ready after a reset >> (FLR, D3hot->D0 transition for devices where No_Soft_Reset is clear, >> DPC) and after power-up from D3cold (pci_pm_bridge_power_up_actions()). >> I think device readiness in all of these cases is covered by PCIe >> r6.0, sec 6.6.1. > > Would it be helpful to you to get a dump_stack() call trace to > pci_power_up() the specific call flow that needed this fix? > > Gary is able to to reproduce this at will, I think he should be able to > gather that using an unpatched kernel to help this conversation. Here is the kernel trace with a dump_stack() and a printk of current inserted into pci_power_up() right before the failure occurs. https://gist.github.com/superm1/cb407766ab15f42f12a6fe9d1196f6fc Also the failure is visible right after. > >> >> If the Root Port above the device supports Configuration RRS Software >> Visibility, I think we probably should use that here instead of either >> PCI_COMMAND or PCI_PM_CTRL. > > I did check and in this case the root port the USB4 routers are > connected to support this. > > How do you think this should be done instead? > >> SR-IOV VFs don't implement Vendor ID, >> which might complicate this a little. But it niggles in my mind that >> there may be some other problem beyond that. Maybe Alex remembers. > > >> >> Anyway, if we meet the requirements of sec 6.6.1, the device should be >> ready to respond to config requests, and I assume that also means >> the device is in D0. >> > > Within that section there is a quote to point out: > > " > To allow components to perform internal initialization, system software > must wait a specified minimum period > following exit from a Conventional Reset of one or more devices before > it is permitted to issue Configuration > Requests to those devices > " > > In pci_power_up() I don't really see any hardcoded delays specifically > for this case of leaving D3cold. The PCI PM spec specifies that it will > take "Full context restore or boot latency". I don't think it's > reasonable to have NO delay.
[+cc Rafael for ACPI power transition from D3cold to D0 delay] On Mon, Aug 26, 2024 at 02:16:34PM -0500, Mario Limonciello wrote: > On 8/23/2024 14:54, Bjorn Helgaas wrote: > > On Fri, Aug 23, 2024 at 10:40:20AM -0500, Mario Limonciello wrote: > > > If a dock is plugged in at the same time as autosuspend delay > > > then this can cause malfunctions in the USB4 stack. This happens > > > because the device is still in D3cold at the time that the PCI > > > core handed control back to the USB4 stack. > ... > > > A device that has gone through a reset may return a value in > > > PCI_COMMAND but that doesn't mean it's finished transitioning to > > > D0. For devices that support power management explicitly check > > > PCI_PM_CTRL on everything but system resume to ensure the > > > transition happened. > ... > > > Devices that don't support power management and system resume > > > will continue to use PCI_COMMAND. > > > --- a/drivers/pci/pci.c > > > +++ b/drivers/pci/pci.c > > > @@ -1309,21 +1309,33 @@ static int pci_dev_wait(struct pci_dev *dev, enum pci_reset_type reset_type, int > > > * the read (except when CRS SV is enabled and the read was for the > > > * Vendor ID; in that case it synthesizes 0x0001 data). > > > * > > > - * Wait for the device to return a non-CRS completion. Read the > > > - * Command register instead of Vendor ID so we don't have to > > > - * contend with the CRS SV value. > > > + * Wait for the device to return a non-CRS completion. On devices > > > + * that support PM control and on waits that aren't part of system > > > + * resume read the PM control register to ensure the device has > > > + * transitioned to D0. On devices that don't support PM control, > > > + * or during system resume read the command register to instead of > > > + * Vendor ID so we don't have to contend with the CRS SV value. > > > */ > > > for (;;) { > > > - u32 id; > > > - > > > if (pci_dev_is_disconnected(dev)) { > > > pci_dbg(dev, "disconnected; not waiting\n"); > > > return -ENOTTY; > > > } > > > - pci_read_config_dword(dev, PCI_COMMAND, &id); > > > - if (!PCI_POSSIBLE_ERROR(id)) > > > - break; > > > + if (dev->pm_cap && reset_type != PCI_DEV_WAIT_RESUME) { > > > + u16 pmcsr; > > > + > > > + pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr); > > > + if (!PCI_POSSIBLE_ERROR(pmcsr) && > > > + (pmcsr & PCI_PM_CTRL_STATE_MASK) == PCI_D0) > > > + break; > > > + } else { > > > + u32 id; > > > + > > > + pci_read_config_dword(dev, PCI_COMMAND, &id); > > > + if (!PCI_POSSIBLE_ERROR(id)) > > > + break; > > > + } > > > > What is the rationale behind using PCI_PM_CTRL in some cases and > > PCI_COMMAND in others? > > We saw a deadlock during resume from suspend when PCI_PM_CTRL was used for > all cases that supported dev->pm_cap. > > > Is there some spec language we can cite for > > this? > > Perhaps it being a "cold reset" during resume? > > > IIUC, pci_dev_wait() waits for a device to be ready after a reset > > (FLR, D3hot->D0 transition for devices where No_Soft_Reset is clear, > > DPC) and after power-up from D3cold (pci_pm_bridge_power_up_actions()). > > I think device readiness in all of these cases is covered by PCIe > > r6.0, sec 6.6.1. > ... > > If the Root Port above the device supports Configuration RRS > > Software Visibility, I think we probably should use that here > > instead of either PCI_COMMAND or PCI_PM_CTRL. > > I did check and in this case the root port the USB4 routers are > connected to support this. > > How do you think this should be done instead? > > > SR-IOV VFs don't implement Vendor ID, which might complicate this > > a little. But it niggles in my mind that there may be some other > > problem beyond that. Maybe Alex remembers. > > > > Anyway, if we meet the requirements of sec 6.6.1, the device > > should be ready to respond to config requests, and I assume that > > also means the device is in D0. > > Within that section there is a quote to point out: > > " > To allow components to perform internal initialization, system > software must wait a specified minimum period following exit from a > Conventional Reset of one or more devices before it is permitted to > issue Configuration Requests to those devices > " > > In pci_power_up() I don't really see any hardcoded delays > specifically for this case of leaving D3cold. The PCI PM spec > specifies that it will take "Full context restore or boot latency". > I don't think it's reasonable to have NO delay. I agree, leaving D3cold means "applying power to the device", and PCIe r6.0, sec 6.6.1, says that's a Fundamental Reset: A Fundamental Reset must occur following the application of power to the component. This is called a Cold Reset. So we need a delay similar to what we do in pci_bridge_wait_for_secondary_bus(). I don't know whether that is supposed to happen somewhere on the ACPI side or in the PCI side, but my inclination would be the PCI side because the delay isn't really platform-specific, it's specified by the PCI specs, and the OS needs to manage the RRS Software Visibility retries and features like Readiness Notifications. pci_set_low_power_state() contains corresponding delays for putting devices in D1, D2, and D3hot. Maybe the D3cold -> D0 delays should be in platform_pci_set_power_state()? I think pci_power_up() is the only caller that sets the state to D0, and it assumes the device is Configuration Ready when platform_pci_set_power_state() returns: pci_power_up platform_pci_set_power_state pci_read_config_word(PCI_PM_CTRL) So I think we need basically the same delays and pci_dev_wait() stuff in this path. Bjorn
On Fri, Aug 23, 2024 at 10:40:20AM -0500, Mario Limonciello wrote: > From: Mario Limonciello <mario.limonciello@amd.com> > > If a dock is plugged in at the same time as autosuspend delay then > this can cause malfunctions in the USB4 stack. This happens because > the device is still in D3cold at the time that the PCI core handed > control back to the USB4 stack. > > A device that has gone through a reset may return a value in > PCI_COMMAND but that doesn't mean it's finished transitioning to D0. > For devices that support power management explicitly check > PCI_PM_CTRL on everything but system resume to ensure the transition > happened. Still trying to understand what's going on here. I posted a change to pci_dev_wait() to read Vendor ID, look for Config RRS status, and wait for a successful completion (when RRS Software Visibility is enabled) [1]. You tested that and found that it didn't help with *this* issue [2]. I assume you tested something like v6.11-rc plus the patches from [1], i.e., without the PCI_PM_CTRL changes in this series. 1) Initially the device is in D0 2) We put it in D3cold (using some ACPI method) because the autosuspend delay expired (?) 3) Plugging in the dock wakes up the device, so we power up the device (via pci_power_up(), which again uses some ACPI method), and it should transition to D0uninitialized 4) The USB4 stack sees the device but thinks it's in D3cold (?) If your testing only included [1], but did not include the pci_power_up() change from patch 3/5 "Verify functions currently in D3cold have entered D0", I don't think we would call pci_dev_wait(), so I wouldn't expect [1] to make any difference. If you *did* include both [1] and patch 3/5, the implication would be that pci_dev_wait() successfully read the Vendor ID, meaning the device is not in D3cold when pci_power_up() returns. Can you clarify what you see and possibly expand/correct my timeline above? [1] https://lore.kernel.org/linux-pci/20240827234848.4429-1-helgaas@kernel.org/ [2] https://lore.kernel.org/linux-pci/30d9589a-8050-421b-a9a5-ad3422feadad@amd.com/ > Devices that don't support power management and system resume will > continue to use PCI_COMMAND. > > Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> > --- > v4->v5: > * Fix misleading indentation > * Amend commit message > --- > drivers/pci/pci.c | 28 ++++++++++++++++++++-------- > 1 file changed, 20 insertions(+), 8 deletions(-) > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index 1e219057a5069..f032a4aaec268 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -1309,21 +1309,33 @@ static int pci_dev_wait(struct pci_dev *dev, enum pci_reset_type reset_type, int > * the read (except when CRS SV is enabled and the read was for the > * Vendor ID; in that case it synthesizes 0x0001 data). > * > - * Wait for the device to return a non-CRS completion. Read the > - * Command register instead of Vendor ID so we don't have to > - * contend with the CRS SV value. > + * Wait for the device to return a non-CRS completion. On devices > + * that support PM control and on waits that aren't part of system > + * resume read the PM control register to ensure the device has > + * transitioned to D0. On devices that don't support PM control, > + * or during system resume read the command register to instead of > + * Vendor ID so we don't have to contend with the CRS SV value. > */ > for (;;) { > - u32 id; > - > if (pci_dev_is_disconnected(dev)) { > pci_dbg(dev, "disconnected; not waiting\n"); > return -ENOTTY; > } > > - pci_read_config_dword(dev, PCI_COMMAND, &id); > - if (!PCI_POSSIBLE_ERROR(id)) > - break; > + if (dev->pm_cap && reset_type != PCI_DEV_WAIT_RESUME) { > + u16 pmcsr; > + > + pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr); > + if (!PCI_POSSIBLE_ERROR(pmcsr) && > + (pmcsr & PCI_PM_CTRL_STATE_MASK) == PCI_D0) > + break; > + } else { > + u32 id; > + > + pci_read_config_dword(dev, PCI_COMMAND, &id); > + if (!PCI_POSSIBLE_ERROR(id)) > + break; > + } > > if (delay > timeout) { > pci_warn(dev, "not ready %dms after %s; giving up\n", > -- > 2.43.0 >
On 8/29/2024 19:01, Bjorn Helgaas wrote: > On Fri, Aug 23, 2024 at 10:40:20AM -0500, Mario Limonciello wrote: >> From: Mario Limonciello <mario.limonciello@amd.com> >> >> If a dock is plugged in at the same time as autosuspend delay then >> this can cause malfunctions in the USB4 stack. This happens because >> the device is still in D3cold at the time that the PCI core handed >> control back to the USB4 stack. >> >> A device that has gone through a reset may return a value in >> PCI_COMMAND but that doesn't mean it's finished transitioning to D0. >> For devices that support power management explicitly check >> PCI_PM_CTRL on everything but system resume to ensure the transition >> happened. > > Still trying to understand what's going on here. > > I posted a change to pci_dev_wait() to read Vendor ID, look for Config > RRS status, and wait for a successful completion (when RRS Software > Visibility is enabled) [1]. > > You tested that and found that it didn't help with *this* issue [2]. > I assume you tested something like v6.11-rc plus the patches from [1], > i.e., without the PCI_PM_CTRL changes in this series. > > 1) Initially the device is in D0 > > 2) We put it in D3cold (using some ACPI method) because the > autosuspend delay expired (?) > > 3) Plugging in the dock wakes up the device, so we power up the > device (via pci_power_up(), which again uses some ACPI method), and > it should transition to D0uninitialized > > 4) The USB4 stack sees the device but thinks it's in D3cold (?) > > If your testing only included [1], but did not include the > pci_power_up() change from patch 3/5 "Verify functions currently in > D3cold have entered D0", I don't think we would call pci_dev_wait(), > so I wouldn't expect [1] to make any difference. > > If you *did* include both [1] and patch 3/5, the implication would be > that pci_dev_wait() successfully read the Vendor ID, meaning the > device is not in D3cold when pci_power_up() returns. The testing here was from the LTS 6.6.y kernel with both [1] and patch 3/5 from this series. I can get testing from 6.11-rc6 with a combination of patches if you would like. > > Can you clarify what you see and possibly expand/correct my timeline > above? The timeline you shared is nearly correct. The USB4 stack *thinks* the device is in D0 because of the return of pci_power_up(). As by polling PCI_PM_CTRL we can see it's still in D3, so it hasn't made it to D0uninitialized yet. I guess I reading between the lines you have an assumption that you can't read the vendor ID from D3; which doesn't appear to be the case from our testing. > > [1] https://lore.kernel.org/linux-pci/20240827234848.4429-1-helgaas@kernel.org/ > [2] https://lore.kernel.org/linux-pci/30d9589a-8050-421b-a9a5-ad3422feadad@amd.com/ > >> Devices that don't support power management and system resume will >> continue to use PCI_COMMAND. >> >> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> >> --- >> v4->v5: >> * Fix misleading indentation >> * Amend commit message >> --- >> drivers/pci/pci.c | 28 ++++++++++++++++++++-------- >> 1 file changed, 20 insertions(+), 8 deletions(-) >> >> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c >> index 1e219057a5069..f032a4aaec268 100644 >> --- a/drivers/pci/pci.c >> +++ b/drivers/pci/pci.c >> @@ -1309,21 +1309,33 @@ static int pci_dev_wait(struct pci_dev *dev, enum pci_reset_type reset_type, int >> * the read (except when CRS SV is enabled and the read was for the >> * Vendor ID; in that case it synthesizes 0x0001 data). >> * >> - * Wait for the device to return a non-CRS completion. Read the >> - * Command register instead of Vendor ID so we don't have to >> - * contend with the CRS SV value. >> + * Wait for the device to return a non-CRS completion. On devices >> + * that support PM control and on waits that aren't part of system >> + * resume read the PM control register to ensure the device has >> + * transitioned to D0. On devices that don't support PM control, >> + * or during system resume read the command register to instead of >> + * Vendor ID so we don't have to contend with the CRS SV value. >> */ >> for (;;) { >> - u32 id; >> - >> if (pci_dev_is_disconnected(dev)) { >> pci_dbg(dev, "disconnected; not waiting\n"); >> return -ENOTTY; >> } >> >> - pci_read_config_dword(dev, PCI_COMMAND, &id); >> - if (!PCI_POSSIBLE_ERROR(id)) >> - break; >> + if (dev->pm_cap && reset_type != PCI_DEV_WAIT_RESUME) { >> + u16 pmcsr; >> + >> + pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr); >> + if (!PCI_POSSIBLE_ERROR(pmcsr) && >> + (pmcsr & PCI_PM_CTRL_STATE_MASK) == PCI_D0) >> + break; >> + } else { >> + u32 id; >> + >> + pci_read_config_dword(dev, PCI_COMMAND, &id); >> + if (!PCI_POSSIBLE_ERROR(id)) >> + break; >> + } >> >> if (delay > timeout) { >> pci_warn(dev, "not ready %dms after %s; giving up\n", >> -- >> 2.43.0 >>
On Tue, Sep 03, 2024 at 11:29:23AM -0500, Mario Limonciello wrote: > On 8/29/2024 19:01, Bjorn Helgaas wrote: > > On Fri, Aug 23, 2024 at 10:40:20AM -0500, Mario Limonciello wrote: > > > From: Mario Limonciello <mario.limonciello@amd.com> > > > > > > If a dock is plugged in at the same time as autosuspend delay then > > > this can cause malfunctions in the USB4 stack. This happens because > > > the device is still in D3cold at the time that the PCI core handed > > > control back to the USB4 stack. > > > > > > A device that has gone through a reset may return a value in > > > PCI_COMMAND but that doesn't mean it's finished transitioning to D0. > > > For devices that support power management explicitly check > > > PCI_PM_CTRL on everything but system resume to ensure the transition > > > happened. > > > > Still trying to understand what's going on here. > > > > I posted a change to pci_dev_wait() to read Vendor ID, look for Config > > RRS status, and wait for a successful completion (when RRS Software > > Visibility is enabled) [1]. > > > > You tested that and found that it didn't help with *this* issue [2]. > > I assume you tested something like v6.11-rc plus the patches from [1], > > i.e., without the PCI_PM_CTRL changes in this series. > > > > 1) Initially the device is in D0 > > > > 2) We put it in D3cold (using some ACPI method) because the > > autosuspend delay expired (?) > > > > 3) Plugging in the dock wakes up the device, so we power up the > > device (via pci_power_up(), which again uses some ACPI method), and > > it should transition to D0uninitialized > > > > 4) The USB4 stack sees the device but thinks it's in D3cold (?) > ... > > If you *did* include both [1] and patch 3/5, the implication would be > > that pci_dev_wait() successfully read the Vendor ID, meaning the > > device is not in D3cold when pci_power_up() returns. > > The testing here was from the LTS 6.6.y kernel with both [1] and > patch 3/5 from this series. > > > Can you clarify what you see and possibly expand/correct my > > timeline above? > > The timeline you shared is nearly correct. The USB4 stack *thinks* > the device is in D0 because of the return of pci_power_up(). > > As by polling PCI_PM_CTRL we can see it's still in D3, so it hasn't > made it to D0uninitialized yet. > > I guess I reading between the lines you have an assumption that you > can't read the vendor ID from D3; which doesn't appear to be the > case from our testing. A Vendor ID read of a device in D3hot should definitely work. Obviously if the device were in D3cold, we'd get no response at all, so the requester should log a UR error and fabricate ~0 data. But if the device starts out in D3cold and we power it up, it should not go through D3hot. The only legal transition from D3cold is to D0uninitialized (PCIe r6.0, sec 5.8). OK, so with [1] and patch 3/5: 1) Initially the device is in D0 2) We put it in D3cold (using some ACPI method) because the autosuspend delay expired (?) 3) Plugging in the dock wakes up the device, so we power up the device (via pci_power_up(), which again uses some ACPI method), and it should transition to D0uninitialized 4) With patch 3/5, pci_power_up() calls pci_dev_wait() because dev->current_state == PCI_D3cold 5) I *assume* RRS SV is enabled (lspci -vv of Root Port would confirm this; maybe we should add a pci_dbg message about which register we're polling). If so, patch [1] means we should poll Vendor ID until successful completion. 6) pci_dbg log should confirm the device is ready with a "ready %dms after D3cold->D0" message, which would mean we got a successful completion when reading Vendor ID 7) For debugging purposes, it would be interesting to read and log the PCI_PM_CTRL value here. Per sec 2.3.1, the device is not allowed to return RRS at this point since we already got a successful completion. 8) The USB4 stack sees the device and assumes it is in D0, but it seems to still be in D3cold. What is this based on? Is there a config read that returns ~0 data when it shouldn't? > > [1] https://lore.kernel.org/linux-pci/20240827234848.4429-1-helgaas@kernel.org/ > > [2] https://lore.kernel.org/linux-pci/30d9589a-8050-421b-a9a5-ad3422feadad@amd.com/ > > > > > Devices that don't support power management and system resume will > > > continue to use PCI_COMMAND. > > > > > > Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> > > > --- > > > v4->v5: > > > * Fix misleading indentation > > > * Amend commit message > > > --- > > > drivers/pci/pci.c | 28 ++++++++++++++++++++-------- > > > 1 file changed, 20 insertions(+), 8 deletions(-) > > > > > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > > > index 1e219057a5069..f032a4aaec268 100644 > > > --- a/drivers/pci/pci.c > > > +++ b/drivers/pci/pci.c > > > @@ -1309,21 +1309,33 @@ static int pci_dev_wait(struct pci_dev *dev, enum pci_reset_type reset_type, int > > > * the read (except when CRS SV is enabled and the read was for the > > > * Vendor ID; in that case it synthesizes 0x0001 data). > > > * > > > - * Wait for the device to return a non-CRS completion. Read the > > > - * Command register instead of Vendor ID so we don't have to > > > - * contend with the CRS SV value. > > > + * Wait for the device to return a non-CRS completion. On devices > > > + * that support PM control and on waits that aren't part of system > > > + * resume read the PM control register to ensure the device has > > > + * transitioned to D0. On devices that don't support PM control, > > > + * or during system resume read the command register to instead of > > > + * Vendor ID so we don't have to contend with the CRS SV value. > > > */ > > > for (;;) { > > > - u32 id; > > > - > > > if (pci_dev_is_disconnected(dev)) { > > > pci_dbg(dev, "disconnected; not waiting\n"); > > > return -ENOTTY; > > > } > > > - pci_read_config_dword(dev, PCI_COMMAND, &id); > > > - if (!PCI_POSSIBLE_ERROR(id)) > > > - break; > > > + if (dev->pm_cap && reset_type != PCI_DEV_WAIT_RESUME) { > > > + u16 pmcsr; > > > + > > > + pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr); > > > + if (!PCI_POSSIBLE_ERROR(pmcsr) && > > > + (pmcsr & PCI_PM_CTRL_STATE_MASK) == PCI_D0) > > > + break; > > > + } else { > > > + u32 id; > > > + > > > + pci_read_config_dword(dev, PCI_COMMAND, &id); > > > + if (!PCI_POSSIBLE_ERROR(id)) > > > + break; > > > + } > > > if (delay > timeout) { > > > pci_warn(dev, "not ready %dms after %s; giving up\n", > > > -- > > > 2.43.0 > > > >
On 9/3/2024 12:11, Bjorn Helgaas wrote: <snip> >> >> I guess I reading between the lines you have an assumption that you >> can't read the vendor ID from D3; which doesn't appear to be the >> case from our testing. > > A Vendor ID read of a device in D3hot should definitely work. > Obviously if the device were in D3cold, we'd get no response at all, > so the requester should log a UR error and fabricate ~0 data. > > But if the device starts out in D3cold and we power it up, it should > not go through D3hot. The only legal transition from D3cold is to > D0uninitialized (PCIe r6.0, sec 5.8). Right. The issue is it didn't finish getting into D3 at the time that we attempted to go to D0 though. So all this extra time is basically waiting for the D0->D3 transition to finish followed by D3->D0uninitialized. The best description I could offer is to call it an "aborted" D3. > > OK, so with [1] and patch 3/5: > > 1) Initially the device is in D0 > > 2) We put it in D3cold (using some ACPI method) because the > autosuspend delay expired (?) > > 3) Plugging in the dock wakes up the device, so we power up the > device (via pci_power_up(), which again uses some ACPI method), and > it should transition to D0uninitialized > > 4) With patch 3/5, pci_power_up() calls pci_dev_wait() because > dev->current_state == PCI_D3cold > > 5) I *assume* RRS SV is enabled (lspci -vv of Root Port would > confirm this; maybe we should add a pci_dbg message about which > register we're polling). If so, patch [1] means we should poll > Vendor ID until successful completion. > Yup. RootCap: CRSVisible+ RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible+ > 6) pci_dbg log should confirm the device is ready with a "ready %dms > after D3cold->D0" message, which would mean we got a successful > completion when reading Vendor ID > > 7) For debugging purposes, it would be interesting to read and log > the PCI_PM_CTRL value here. Per sec 2.3.1, the device is not > allowed to return RRS at this point since we already got a > successful completion. > OK let me get a debug log with [1], 3/5, 6.11-rc6 and a message added about this value to share back. > 8) The USB4 stack sees the device and assumes it is in D0, but it > seems to still be in D3cold. What is this based on? Is there a > config read that returns ~0 data when it shouldn't? > Yes there is. From earlier in the thread I have a [log] I shared. The message emitted is from ring_interrupt_active(): "thunderbolt 0000:e5:00.5: interrupt for TX ring 0 is already enabled" [log] https://gist.github.com/superm1/cb407766ab15f42f12a6fe9d1196f6fc
On Tue, Sep 03, 2024 at 12:31:00PM -0500, Mario Limonciello wrote: > On 9/3/2024 12:11, Bjorn Helgaas wrote: > ... > > 8) The USB4 stack sees the device and assumes it is in D0, but it > > seems to still be in D3cold. What is this based on? Is there a > > config read that returns ~0 data when it shouldn't? > > Yes there is. From earlier in the thread I have a [log] I shared. > > The message emitted is from ring_interrupt_active(): > > "thunderbolt 0000:e5:00.5: interrupt for TX ring 0 is already enabled" Right, that's in the cover letter, but I can't tell from this what the ioread32(ring->nhi->iobase + reg) returned. It looks like this is an MMIO read of BAR 0, not a config read. > [log] https://gist.github.com/superm1/cb407766ab15f42f12a6fe9d1196f6fc
On 9/3/2024 13:25, Bjorn Helgaas wrote: > On Tue, Sep 03, 2024 at 12:31:00PM -0500, Mario Limonciello wrote: >> On 9/3/2024 12:11, Bjorn Helgaas wrote: >> ... > >>> 8) The USB4 stack sees the device and assumes it is in D0, but it >>> seems to still be in D3cold. What is this based on? Is there a >>> config read that returns ~0 data when it shouldn't? >> >> Yes there is. From earlier in the thread I have a [log] I shared. >> >> The message emitted is from ring_interrupt_active(): >> >> "thunderbolt 0000:e5:00.5: interrupt for TX ring 0 is already enabled" > > Right, that's in the cover letter, but I can't tell from this what the > ioread32(ring->nhi->iobase + reg) returned. It looks like this is an > MMIO read of BAR 0, not a config read. > Yeah. I suppose another way to approach this problem is to make something else in the call chain poll PCI_PM_CTRL. Polling at the start of nhi_runtime_resume() should also work. For the "normal" scenario it would just be a single read to PCI_PM_CTRL. Mika, thoughts?
On Tue, Sep 03, 2024 at 01:32:30PM -0500, Mario Limonciello wrote: > On 9/3/2024 13:25, Bjorn Helgaas wrote: > > On Tue, Sep 03, 2024 at 12:31:00PM -0500, Mario Limonciello wrote: > > > On 9/3/2024 12:11, Bjorn Helgaas wrote: > > > ... > > > > > > 8) The USB4 stack sees the device and assumes it is in D0, but it > > > > seems to still be in D3cold. What is this based on? Is there a > > > > config read that returns ~0 data when it shouldn't? > > > > > > Yes there is. From earlier in the thread I have a [log] I shared. > > > > > > The message emitted is from ring_interrupt_active(): > > > > > > "thunderbolt 0000:e5:00.5: interrupt for TX ring 0 is already enabled" > > > > Right, that's in the cover letter, but I can't tell from this what the > > ioread32(ring->nhi->iobase + reg) returned. It looks like this is an > > MMIO read of BAR 0, not a config read. > > Yeah. I suppose another way to approach this problem is to make something > else in the call chain poll PCI_PM_CTRL. > > Polling at the start of nhi_runtime_resume() should also work. For the > "normal" scenario it would just be a single read to PCI_PM_CTRL. It doesn't sound like this is anything specific to Thunderbolt, so there should be a generic solution.
Hi, On Tue, Sep 03, 2024 at 01:32:30PM -0500, Mario Limonciello wrote: > On 9/3/2024 13:25, Bjorn Helgaas wrote: > > On Tue, Sep 03, 2024 at 12:31:00PM -0500, Mario Limonciello wrote: > > > On 9/3/2024 12:11, Bjorn Helgaas wrote: > > > ... > > > > > > 8) The USB4 stack sees the device and assumes it is in D0, but it > > > > seems to still be in D3cold. What is this based on? Is there a > > > > config read that returns ~0 data when it shouldn't? > > > > > > Yes there is. From earlier in the thread I have a [log] I shared. > > > > > > The message emitted is from ring_interrupt_active(): > > > > > > "thunderbolt 0000:e5:00.5: interrupt for TX ring 0 is already enabled" > > > > Right, that's in the cover letter, but I can't tell from this what the > > ioread32(ring->nhi->iobase + reg) returned. It looks like this is an > > MMIO read of BAR 0, not a config read. > > > > Yeah. I suppose another way to approach this problem is to make something > else in the call chain poll PCI_PM_CTRL. > > Polling at the start of nhi_runtime_resume() should also work. For the > "normal" scenario it would just be a single read to PCI_PM_CTRL. > > Mika, thoughts? I'm starting to wonder if we are looking at the correct place ;-) This reminds me that our PCIe SV people recently reported a couple of Linux related issues which they recommended to fix, and these are on my list but I'll share them because maybe they are related? First problem, and actually a PCI spec violation, is that Linux does not clear Bus Master, MMIO and IO space enables when it programs the device to D3 on runtime suspend path. It does so on system sleep path though. Something like below (untested) should do that: diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c index f412ef73a6e4..79a566376301 100644 --- a/drivers/pci/pci-driver.c +++ b/drivers/pci/pci-driver.c @@ -1332,6 +1332,7 @@ static int pci_pm_runtime_suspend(struct device *dev) if (!pci_dev->state_saved) { pci_save_state(pci_dev); + pci_pm_default_suspend(pci_dev); pci_finish_runtime_suspend(pci_dev); } The second thing is that Thunderbolt driver, for historical reasons, leaves the MSI enabled when entering D3. This too might be related. I think we can unconditionally disable it so below hack should do that (untested as well). I wonder if you could try if any of these or both can help here? Both of these issues can result unwanted events during D3 entry as far as I understand. diff --git a/drivers/thunderbolt/ctl.c b/drivers/thunderbolt/ctl.c index dc1f456736dc..73b815fbbceb 100644 --- a/drivers/thunderbolt/ctl.c +++ b/drivers/thunderbolt/ctl.c @@ -659,12 +659,11 @@ struct tb_ctl *tb_ctl_alloc(struct tb_nhi *nhi, int index, int timeout_msec, if (!ctl->frame_pool) goto err; - ctl->tx = tb_ring_alloc_tx(nhi, 0, 10, RING_FLAG_NO_SUSPEND); + ctl->tx = tb_ring_alloc_tx(nhi, 0, 10, 0); if (!ctl->tx) goto err; - ctl->rx = tb_ring_alloc_rx(nhi, 0, 10, RING_FLAG_NO_SUSPEND, 0, 0xffff, - 0xffff, NULL, NULL); + ctl->rx = tb_ring_alloc_rx(nhi, 0, 10, 0, 0, 0xffff, 0xffff, NULL, NULL); if (!ctl->rx) goto err;
On 9/4/2024 07:05, Mika Westerberg wrote: > Hi, > > On Tue, Sep 03, 2024 at 01:32:30PM -0500, Mario Limonciello wrote: >> On 9/3/2024 13:25, Bjorn Helgaas wrote: >>> On Tue, Sep 03, 2024 at 12:31:00PM -0500, Mario Limonciello wrote: >>>> On 9/3/2024 12:11, Bjorn Helgaas wrote: >>>> ... >>> >>>>> 8) The USB4 stack sees the device and assumes it is in D0, but it >>>>> seems to still be in D3cold. What is this based on? Is there a >>>>> config read that returns ~0 data when it shouldn't? >>>> >>>> Yes there is. From earlier in the thread I have a [log] I shared. >>>> >>>> The message emitted is from ring_interrupt_active(): >>>> >>>> "thunderbolt 0000:e5:00.5: interrupt for TX ring 0 is already enabled" >>> >>> Right, that's in the cover letter, but I can't tell from this what the >>> ioread32(ring->nhi->iobase + reg) returned. It looks like this is an >>> MMIO read of BAR 0, not a config read. >>> >> >> Yeah. I suppose another way to approach this problem is to make something >> else in the call chain poll PCI_PM_CTRL. >> >> Polling at the start of nhi_runtime_resume() should also work. For the >> "normal" scenario it would just be a single read to PCI_PM_CTRL. >> >> Mika, thoughts? We did this experiment to throw code to poll PCI_PM_CTRL at the start of nhi_runtime_resume() but this also fails. From that I would hypothesize the device transitioned to D0uninitialized sometime in the middle of pci_pm_runtime_resume() before the call to pm->runtime_resume(dev); > > I'm starting to wonder if we are looking at the correct place ;-) This > reminds me that our PCIe SV people recently reported a couple of Linux > related issues which they recommended to fix, and these are on my list > but I'll share them because maybe they are related? Thanks for sharing those. We had a try with them but sorry to say no improvements to the issue at hand. > > First problem, and actually a PCI spec violation, is that Linux does not > clear Bus Master, MMIO and IO space enables when it programs the device > to D3 on runtime suspend path. It does so on system sleep path though. > Something like below (untested) should do that: > > diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c > index f412ef73a6e4..79a566376301 100644 > --- a/drivers/pci/pci-driver.c > +++ b/drivers/pci/pci-driver.c > @@ -1332,6 +1332,7 @@ static int pci_pm_runtime_suspend(struct device *dev) > > if (!pci_dev->state_saved) { > pci_save_state(pci_dev); > + pci_pm_default_suspend(pci_dev); > pci_finish_runtime_suspend(pci_dev); > } > > > The second thing is that Thunderbolt driver, for historical reasons, > leaves the MSI enabled when entering D3. This too might be related. I > think we can unconditionally disable it so below hack should do that > (untested as well). I wonder if you could try if any of these or both > can help here? Both of these issues can result unwanted events during D3 > entry as far as I understand. > > diff --git a/drivers/thunderbolt/ctl.c b/drivers/thunderbolt/ctl.c > index dc1f456736dc..73b815fbbceb 100644 > --- a/drivers/thunderbolt/ctl.c > +++ b/drivers/thunderbolt/ctl.c > @@ -659,12 +659,11 @@ struct tb_ctl *tb_ctl_alloc(struct tb_nhi *nhi, int index, int timeout_msec, > if (!ctl->frame_pool) > goto err; > > - ctl->tx = tb_ring_alloc_tx(nhi, 0, 10, RING_FLAG_NO_SUSPEND); > + ctl->tx = tb_ring_alloc_tx(nhi, 0, 10, 0); > if (!ctl->tx) > goto err; > > - ctl->rx = tb_ring_alloc_rx(nhi, 0, 10, RING_FLAG_NO_SUSPEND, 0, 0xffff, > - 0xffff, NULL, NULL); > + ctl->rx = tb_ring_alloc_rx(nhi, 0, 10, 0, 0, 0xffff, 0xffff, NULL, NULL); > if (!ctl->rx) > goto err; >
Hi, On Wed, Sep 04, 2024 at 10:24:26AM -0500, Mario Limonciello wrote: > On 9/4/2024 07:05, Mika Westerberg wrote: > > Hi, > > > > On Tue, Sep 03, 2024 at 01:32:30PM -0500, Mario Limonciello wrote: > > > On 9/3/2024 13:25, Bjorn Helgaas wrote: > > > > On Tue, Sep 03, 2024 at 12:31:00PM -0500, Mario Limonciello wrote: > > > > > On 9/3/2024 12:11, Bjorn Helgaas wrote: > > > > > ... > > > > > > > > > > 8) The USB4 stack sees the device and assumes it is in D0, but it > > > > > > seems to still be in D3cold. What is this based on? Is there a > > > > > > config read that returns ~0 data when it shouldn't? > > > > > > > > > > Yes there is. From earlier in the thread I have a [log] I shared. > > > > > > > > > > The message emitted is from ring_interrupt_active(): > > > > > > > > > > "thunderbolt 0000:e5:00.5: interrupt for TX ring 0 is already enabled" > > > > > > > > Right, that's in the cover letter, but I can't tell from this what the > > > > ioread32(ring->nhi->iobase + reg) returned. It looks like this is an > > > > MMIO read of BAR 0, not a config read. > > > > > > > > > > Yeah. I suppose another way to approach this problem is to make something > > > else in the call chain poll PCI_PM_CTRL. > > > > > > Polling at the start of nhi_runtime_resume() should also work. For the > > > "normal" scenario it would just be a single read to PCI_PM_CTRL. > > > > > > Mika, thoughts? > > We did this experiment to throw code to poll PCI_PM_CTRL at the start of > nhi_runtime_resume() but this also fails. From that I would hypothesize the > device transitioned to D0uninitialized sometime in the middle of > pci_pm_runtime_resume() before the call to pm->runtime_resume(dev); > > > > > I'm starting to wonder if we are looking at the correct place ;-) This > > reminds me that our PCIe SV people recently reported a couple of Linux > > related issues which they recommended to fix, and these are on my list > > but I'll share them because maybe they are related? > > Thanks for sharing those. We had a try with them but sorry to say no > improvements to the issue at hand. Okay, thanks for checking. Few additional side paths here, though. This is supposed to work so that once the host router sleep bit is set the driver is supposed to allow the domain to enter sleep (e.g it should not be waken up before it is fully transitioned). That's what we do: 1. All tunneled PCIe Root/Downstream ports are in D3. 2. All tunneled USB 3.x ports are in U3. 3. No DisplayPort is tunneled. 4. Thunderbolt driver enables wakes. 5. Thunderbolt driver writes sleep ready bit of the host router. 6. Thunderbolt driver runtime suspend is complete. 7. ACPI method is called (_PS3 or _PR3.OFF) that will trigger the "Sleep Event". If between 5 and 7 there is device connected, it should not "abort" the sequence. Unfortunately this is not explict in the USB4 spec but the connection manager guide has similar note. Even if the connect happens there the "Sleep Event" should happen but after that it can trigger normal wakeup which will then bring everything back. Would it be possible to enable tracing around these steps so that we could see if there is hotplug notification somewhere there that is not expected? Here are instructions how to get pretty accurate trace: https://github.com/intel/tbtools?tab=readme-ov-file#tracing Please also take full dmesg. It is entirely possible that this has nothing to do with the issue but I think it is worth checking. The second thing we could try is to check the wake status bits after this has happened, like: # tbdump -r 0 -a <ADAPTER> -vv -N 1 PORT_CS_18 (where <ADAPTER> is the lane 0 adapter of the USB4 port the device was connected). The third thing to try is to comment out TB_WAKE_ON_CONNECT in tb_switch_suspend(). This should result no wake even if the device is connected. This tells us that it is really the connect on USB4 port that triggered the wake. These could (also) explain why the host router appears to be in D3 even if it should be in D0 already.
On 9/5/2024 04:33, Mika Westerberg wrote: > Hi, > > On Wed, Sep 04, 2024 at 10:24:26AM -0500, Mario Limonciello wrote: >> On 9/4/2024 07:05, Mika Westerberg wrote: >>> Hi, >>> >>> On Tue, Sep 03, 2024 at 01:32:30PM -0500, Mario Limonciello wrote: >>>> On 9/3/2024 13:25, Bjorn Helgaas wrote: >>>>> On Tue, Sep 03, 2024 at 12:31:00PM -0500, Mario Limonciello wrote: >>>>>> On 9/3/2024 12:11, Bjorn Helgaas wrote: >>>>>> ... >>>>> >>>>>>> 8) The USB4 stack sees the device and assumes it is in D0, but it >>>>>>> seems to still be in D3cold. What is this based on? Is there a >>>>>>> config read that returns ~0 data when it shouldn't? >>>>>> >>>>>> Yes there is. From earlier in the thread I have a [log] I shared. >>>>>> >>>>>> The message emitted is from ring_interrupt_active(): >>>>>> >>>>>> "thunderbolt 0000:e5:00.5: interrupt for TX ring 0 is already enabled" >>>>> >>>>> Right, that's in the cover letter, but I can't tell from this what the >>>>> ioread32(ring->nhi->iobase + reg) returned. It looks like this is an >>>>> MMIO read of BAR 0, not a config read. >>>>> >>>> >>>> Yeah. I suppose another way to approach this problem is to make something >>>> else in the call chain poll PCI_PM_CTRL. >>>> >>>> Polling at the start of nhi_runtime_resume() should also work. For the >>>> "normal" scenario it would just be a single read to PCI_PM_CTRL. >>>> >>>> Mika, thoughts? >> >> We did this experiment to throw code to poll PCI_PM_CTRL at the start of >> nhi_runtime_resume() but this also fails. From that I would hypothesize the >> device transitioned to D0uninitialized sometime in the middle of >> pci_pm_runtime_resume() before the call to pm->runtime_resume(dev); >> >>> >>> I'm starting to wonder if we are looking at the correct place ;-) This >>> reminds me that our PCIe SV people recently reported a couple of Linux >>> related issues which they recommended to fix, and these are on my list >>> but I'll share them because maybe they are related? >> >> Thanks for sharing those. We had a try with them but sorry to say no >> improvements to the issue at hand. > > Okay, thanks for checking. > > Few additional side paths here, though. This is supposed to work so that > once the host router sleep bit is set the driver is supposed to allow > the domain to enter sleep (e.g it should not be waken up before it is > fully transitioned). That's what we do: > > 1. All tunneled PCIe Root/Downstream ports are in D3. > 2. All tunneled USB 3.x ports are in U3. > 3. No DisplayPort is tunneled. > 4. Thunderbolt driver enables wakes. > 5. Thunderbolt driver writes sleep ready bit of the host router. > 6. Thunderbolt driver runtime suspend is complete. > 7. ACPI method is called (_PS3 or _PR3.OFF) that will trigger the "Sleep > Event". > > If between 5 and 7 there is device connected, it should not "abort" the > sequence. Unfortunately this is not explict in the USB4 spec but the > connection manager guide has similar note. Even if the connect happens > there the "Sleep Event" should happen but after that it can trigger > normal wakeup which will then bring everything back. > > Would it be possible to enable tracing around these steps so that we > could see if there is hotplug notification somewhere there that is not > expected? Here are instructions how to get pretty accurate trace: > > https://github.com/intel/tbtools?tab=readme-ov-file#tracing > > Please also take full dmesg. Sure, here is the dmesg with tracing enabled: https://gist.github.com/superm1/5186e0023c8a5d2ecd75c50fd2168308 > > It is entirely possible that this has nothing to do with the issue but I > think it is worth checking. > > The second thing we could try is to check the wake status bits after > this has happened, like: > > # tbdump -r 0 -a <ADAPTER> -vv -N 1 PORT_CS_18 > > (where <ADAPTER> is the lane 0 adapter of the USB4 port the device was > connected). > Unfortunately the adapter is in such a bad state at this time that tbdump doesn't work. > The third thing to try is to comment out TB_WAKE_ON_CONNECT in > tb_switch_suspend(). This should result no wake even if the device is > connected. This tells us that it is really the connect on USB4 port that > triggered the wake. Yup that's correct; there is no action on the hotplug with this change. > > These could (also) explain why the host router appears to be in D3 even > if it should be in D0 already.
Hi, On Mon, Sep 09, 2024 at 03:40:54PM -0500, Mario Limonciello wrote: > > Few additional side paths here, though. This is supposed to work so that > > once the host router sleep bit is set the driver is supposed to allow > > the domain to enter sleep (e.g it should not be waken up before it is > > fully transitioned). That's what we do: > > > > 1. All tunneled PCIe Root/Downstream ports are in D3. > > 2. All tunneled USB 3.x ports are in U3. > > 3. No DisplayPort is tunneled. > > 4. Thunderbolt driver enables wakes. > > 5. Thunderbolt driver writes sleep ready bit of the host router. > > 6. Thunderbolt driver runtime suspend is complete. > > 7. ACPI method is called (_PS3 or _PR3.OFF) that will trigger the "Sleep > > Event". > > > > If between 5 and 7 there is device connected, it should not "abort" the > > sequence. Unfortunately this is not explict in the USB4 spec but the > > connection manager guide has similar note. Even if the connect happens > > there the "Sleep Event" should happen but after that it can trigger > > normal wakeup which will then bring everything back. > > > > Would it be possible to enable tracing around these steps so that we > > could see if there is hotplug notification somewhere there that is not > > expected? Here are instructions how to get pretty accurate trace: > > > > https://github.com/intel/tbtools?tab=readme-ov-file#tracing > > > > Please also take full dmesg. > > Sure, here is the dmesg with tracing enabled: > > https://gist.github.com/superm1/5186e0023c8a5d2ecd75c50fd2168308 Thanks! I meant also enable control channel tracing as explained in the above linked page so we could maybe see the hotplug packet coming from the lane adapter too. Anyhow, [ 560.789681] thunderbolt 0-2: device disconnected [ 560.789771] thunderbolt 0000:e5:00.5: bandwidth consumption changed, re-calculating estimated bandwidth [ 560.789775] thunderbolt 0000:e5:00.5: bandwidth re-calculation done [ 560.789778] thunderbolt 0000:e5:00.5: looking for DP IN <-> DP OUT pairs: [ 560.789800] thunderbolt 0000:e5:00.5: 0:6: DP IN available [ 560.789803] thunderbolt 0000:e5:00.5: 0:6: no suitable DP OUT adapter available, not tunneling [ 560.790484] thunderbolt 0000:e5:00.5: 0:7: DP IN available [ 560.790487] thunderbolt 0000:e5:00.5: 0:7: no suitable DP OUT adapter available, not tunneling ... [ 578.677640] thunderbolt 0000:e5:00.5: nhi_runtime_suspend() - enter, pdev->current_state = 0 [ 578.677648] thunderbolt 0000:e5:00.5: 0: suspending switch [ 578.677653] thunderbolt 0000:e5:00.5: 0: enabling wakeup: 0x3f [ 578.678248] thunderbolt 0000:e5:00.5: stopping RX ring 0 [ 578.678256] thunderbolt 0000:e5:00.5: disabling interrupt at register 0x38200 bit 3 (0x9 -> 0x1) [ 578.678272] thunderbolt 0000:e5:00.5: stopping TX ring 0 [ 578.678277] thunderbolt 0000:e5:00.5: disabling interrupt at register 0x38200 bit 0 (0x1 -> 0x0) [ 578.678287] thunderbolt 0000:e5:00.5: control channel stopped ... pci_pm_runtime_resume() pci_pm_default_resume_early(pci_dev); pci_pm_power_up_and_verify_state(pci_dev); pci_power_up(pci_dev); platform_pci_set_power_state(dev, PCI_D0); // here we should see acpi_pci_set_power_state() printing "xxx: power // state changed by ACPI to D0" but we don't. // // also pdev->current_state is set to PCI_D0 pci_update_current_state(pci_dev, PCI_D0); // this one reads it back, it should be 0.. pm->runtime_resume(dev); nhi_runtime_resume(dev) [ 578.773503] thunderbolt 0000:e5:00.5: nhi_runtime_resume() - enter, pdev->current_state = 3 // .. but it is not. It seems to be powered off, D3cold. [ 578.773516] thunderbolt 0000:e5:00.5: control channel starting... [ 578.773518] thunderbolt 0000:e5:00.5: starting TX ring 0 [ 578.773524] thunderbolt 0000:e5:00.5: enabling interrupt at register 0x38200 bit 0 (0xffffffff -> 0xffffffff) // As here too. It would be interesting to see what ACPI does here, I mean enabling dynamic debugging of acpi_device_set_power() and friends. One suspect is that when the runtime suspends happens, the power resource or _PS3() gets called to put the device into D3cold and it has some sort of Sleep() inside. This allows the OS to relese the CPU to do other things too according to ACPI spec (and Linux does this) so now when you plug in the device the GPE triggers wake and Linux starts processing that one. We see the above runtime resume but then the Sleep() in the _PS3() completes and cuts the power from the NHI while we are in the middle of resuming it already (or something along those lines). We saw similar "context switch" happening when PCIe L2 failed, BIOS implemented the timeout using Sleep() that allowed Linux to process other things which resulted unexpected wake (not the same as here though). So one thing to check too is how the ACPI methods get called and especially if they somehow end up messing with each other. > > It is entirely possible that this has nothing to do with the issue but I > > think it is worth checking. > > > > The second thing we could try is to check the wake status bits after > > this has happened, like: > > > > # tbdump -r 0 -a <ADAPTER> -vv -N 1 PORT_CS_18 > > > > (where <ADAPTER> is the lane 0 adapter of the USB4 port the device was > > connected). > > > > Unfortunately the adapter is in such a bad state at this time that tbdump > doesn't work. Oh, indeed it is. > > The third thing to try is to comment out TB_WAKE_ON_CONNECT in > > tb_switch_suspend(). This should result no wake even if the device is > > connected. This tells us that it is really the connect on USB4 port that > > triggered the wake. > > Yup that's correct; there is no action on the hotplug with this change. Okay thanks for checking.
On 9/10/2024 04:13, Mika Westerberg wrote: > Hi, > > On Mon, Sep 09, 2024 at 03:40:54PM -0500, Mario Limonciello wrote: >>> Few additional side paths here, though. This is supposed to work so that >>> once the host router sleep bit is set the driver is supposed to allow >>> the domain to enter sleep (e.g it should not be waken up before it is >>> fully transitioned). That's what we do: >>> >>> 1. All tunneled PCIe Root/Downstream ports are in D3. >>> 2. All tunneled USB 3.x ports are in U3. >>> 3. No DisplayPort is tunneled. >>> 4. Thunderbolt driver enables wakes. >>> 5. Thunderbolt driver writes sleep ready bit of the host router. >>> 6. Thunderbolt driver runtime suspend is complete. >>> 7. ACPI method is called (_PS3 or _PR3.OFF) that will trigger the "Sleep >>> Event". >>> >>> If between 5 and 7 there is device connected, it should not "abort" the >>> sequence. Unfortunately this is not explict in the USB4 spec but the >>> connection manager guide has similar note. Even if the connect happens >>> there the "Sleep Event" should happen but after that it can trigger >>> normal wakeup which will then bring everything back. >>> >>> Would it be possible to enable tracing around these steps so that we >>> could see if there is hotplug notification somewhere there that is not >>> expected? Here are instructions how to get pretty accurate trace: >>> >>> https://github.com/intel/tbtools?tab=readme-ov-file#tracing >>> >>> Please also take full dmesg. >> >> Sure, here is the dmesg with tracing enabled: >> >> https://gist.github.com/superm1/5186e0023c8a5d2ecd75c50fd2168308 > > Thanks! I meant also enable control channel tracing as explained in the > above linked page so we could maybe see the hotplug packet coming from > the lane adapter too. Anyhow, > > [ 560.789681] thunderbolt 0-2: device disconnected > [ 560.789771] thunderbolt 0000:e5:00.5: bandwidth consumption changed, re-calculating estimated bandwidth > [ 560.789775] thunderbolt 0000:e5:00.5: bandwidth re-calculation done > [ 560.789778] thunderbolt 0000:e5:00.5: looking for DP IN <-> DP OUT pairs: > [ 560.789800] thunderbolt 0000:e5:00.5: 0:6: DP IN available > [ 560.789803] thunderbolt 0000:e5:00.5: 0:6: no suitable DP OUT adapter available, not tunneling > [ 560.790484] thunderbolt 0000:e5:00.5: 0:7: DP IN available > [ 560.790487] thunderbolt 0000:e5:00.5: 0:7: no suitable DP OUT adapter available, not tunneling > ... > [ 578.677640] thunderbolt 0000:e5:00.5: nhi_runtime_suspend() - enter, pdev->current_state = 0 > [ 578.677648] thunderbolt 0000:e5:00.5: 0: suspending switch > [ 578.677653] thunderbolt 0000:e5:00.5: 0: enabling wakeup: 0x3f > [ 578.678248] thunderbolt 0000:e5:00.5: stopping RX ring 0 > [ 578.678256] thunderbolt 0000:e5:00.5: disabling interrupt at register 0x38200 bit 3 (0x9 -> 0x1) > [ 578.678272] thunderbolt 0000:e5:00.5: stopping TX ring 0 > [ 578.678277] thunderbolt 0000:e5:00.5: disabling interrupt at register 0x38200 bit 0 (0x1 -> 0x0) > [ 578.678287] thunderbolt 0000:e5:00.5: control channel stopped > ... > pci_pm_runtime_resume() > pci_pm_default_resume_early(pci_dev); > pci_pm_power_up_and_verify_state(pci_dev); > pci_power_up(pci_dev); > platform_pci_set_power_state(dev, PCI_D0); > // here we should see acpi_pci_set_power_state() printing "xxx: power > // state changed by ACPI to D0" but we don't. > // > // also pdev->current_state is set to PCI_D0 > pci_update_current_state(pci_dev, PCI_D0); > // this one reads it back, it should be 0.. > pm->runtime_resume(dev); > nhi_runtime_resume(dev) > [ 578.773503] thunderbolt 0000:e5:00.5: nhi_runtime_resume() - enter, pdev->current_state = 3 > > // .. but it is not. It seems to be powered off, D3cold. > > [ 578.773516] thunderbolt 0000:e5:00.5: control channel starting... > [ 578.773518] thunderbolt 0000:e5:00.5: starting TX ring 0 > [ 578.773524] thunderbolt 0000:e5:00.5: enabling interrupt at register 0x38200 bit 0 (0xffffffff -> 0xffffffff) > > // As here too. > > It would be interesting to see what ACPI does here, I mean enabling > dynamic debugging of acpi_device_set_power() and friends. Here is a log with dyndbg turned up. https://gist.github.com/superm1/2f7ca6123431365d11e9bc3cc9329c14 You can see firmware node \_SB_.PCI0.BUSC.NHI0 is physical device 0000:e5:00.5. > One suspect is > that when the runtime suspends happens, the power resource or _PS3() > gets called to put the device into D3cold and it has some sort of > Sleep() inside. This allows the OS to relese the CPU to do other things > too according to ACPI spec (and Linux does this) so now when you plug in > the device the GPE triggers wake and Linux starts processing that one. > We see the above runtime resume but then the Sleep() in the _PS3() > completes and cuts the power from the NHI while we are in the middle of > resuming it already (or something along those lines). > > We saw similar "context switch" happening when PCIe L2 failed, BIOS > implemented the timeout using Sleep() that allowed Linux to process > other things which resulted unexpected wake (not the same as here > though). > > So one thing to check too is how the ACPI methods get called and > especially if they somehow end up messing with each other. At least to me it looks pretty straightforward that each D0 sequence is going through and nothing looked out of order: acpi_pci_set_power_state() ->acpi_device_set_power() ->->acpi_dev_pm_explicit_set() There is a Sleep() call in M232() which is a SMI used in both _PS0 and _PS3, but as far as I can tell it's not actually called as the case always has zero as an argument. Here's some snippets of ASL from the NHI0 device to see. https://gist.github.com/superm1/32ef1f822a6220a41b19574024717f79 I feel your theory is right about it being an aborted D3cold request, but I don't believe it's from concurrency issue of _PS3. I feel there isn't any guard rails either in Linux or the AML for ensuring that the transition actually finished.
Hi, On Thu, Sep 12, 2024 at 11:12:11PM -0500, Mario Limonciello wrote: > On 9/10/2024 04:13, Mika Westerberg wrote: > > Hi, > > > > On Mon, Sep 09, 2024 at 03:40:54PM -0500, Mario Limonciello wrote: > > > > Few additional side paths here, though. This is supposed to work so that > > > > once the host router sleep bit is set the driver is supposed to allow > > > > the domain to enter sleep (e.g it should not be waken up before it is > > > > fully transitioned). That's what we do: > > > > > > > > 1. All tunneled PCIe Root/Downstream ports are in D3. > > > > 2. All tunneled USB 3.x ports are in U3. > > > > 3. No DisplayPort is tunneled. > > > > 4. Thunderbolt driver enables wakes. > > > > 5. Thunderbolt driver writes sleep ready bit of the host router. > > > > 6. Thunderbolt driver runtime suspend is complete. > > > > 7. ACPI method is called (_PS3 or _PR3.OFF) that will trigger the "Sleep > > > > Event". > > > > > > > > If between 5 and 7 there is device connected, it should not "abort" the > > > > sequence. Unfortunately this is not explict in the USB4 spec but the > > > > connection manager guide has similar note. Even if the connect happens > > > > there the "Sleep Event" should happen but after that it can trigger > > > > normal wakeup which will then bring everything back. > > > > > > > > Would it be possible to enable tracing around these steps so that we > > > > could see if there is hotplug notification somewhere there that is not > > > > expected? Here are instructions how to get pretty accurate trace: > > > > > > > > https://github.com/intel/tbtools?tab=readme-ov-file#tracing > > > > > > > > Please also take full dmesg. > > > > > > Sure, here is the dmesg with tracing enabled: > > > > > > https://gist.github.com/superm1/5186e0023c8a5d2ecd75c50fd2168308 > > > > Thanks! I meant also enable control channel tracing as explained in the > > above linked page so we could maybe see the hotplug packet coming from > > the lane adapter too. Anyhow, > > > > [ 560.789681] thunderbolt 0-2: device disconnected > > [ 560.789771] thunderbolt 0000:e5:00.5: bandwidth consumption changed, re-calculating estimated bandwidth > > [ 560.789775] thunderbolt 0000:e5:00.5: bandwidth re-calculation done > > [ 560.789778] thunderbolt 0000:e5:00.5: looking for DP IN <-> DP OUT pairs: > > [ 560.789800] thunderbolt 0000:e5:00.5: 0:6: DP IN available > > [ 560.789803] thunderbolt 0000:e5:00.5: 0:6: no suitable DP OUT adapter available, not tunneling > > [ 560.790484] thunderbolt 0000:e5:00.5: 0:7: DP IN available > > [ 560.790487] thunderbolt 0000:e5:00.5: 0:7: no suitable DP OUT adapter available, not tunneling > > ... > > [ 578.677640] thunderbolt 0000:e5:00.5: nhi_runtime_suspend() - enter, pdev->current_state = 0 > > [ 578.677648] thunderbolt 0000:e5:00.5: 0: suspending switch > > [ 578.677653] thunderbolt 0000:e5:00.5: 0: enabling wakeup: 0x3f > > [ 578.678248] thunderbolt 0000:e5:00.5: stopping RX ring 0 > > [ 578.678256] thunderbolt 0000:e5:00.5: disabling interrupt at register 0x38200 bit 3 (0x9 -> 0x1) > > [ 578.678272] thunderbolt 0000:e5:00.5: stopping TX ring 0 > > [ 578.678277] thunderbolt 0000:e5:00.5: disabling interrupt at register 0x38200 bit 0 (0x1 -> 0x0) > > [ 578.678287] thunderbolt 0000:e5:00.5: control channel stopped > > ... > > pci_pm_runtime_resume() > > pci_pm_default_resume_early(pci_dev); > > pci_pm_power_up_and_verify_state(pci_dev); > > pci_power_up(pci_dev); > > platform_pci_set_power_state(dev, PCI_D0); > > // here we should see acpi_pci_set_power_state() printing "xxx: power > > // state changed by ACPI to D0" but we don't. > > // > > // also pdev->current_state is set to PCI_D0 > > pci_update_current_state(pci_dev, PCI_D0); > > // this one reads it back, it should be 0.. > > pm->runtime_resume(dev); > > nhi_runtime_resume(dev) > > [ 578.773503] thunderbolt 0000:e5:00.5: nhi_runtime_resume() - enter, pdev->current_state = 3 > > > > // .. but it is not. It seems to be powered off, D3cold. > > > > [ 578.773516] thunderbolt 0000:e5:00.5: control channel starting... > > [ 578.773518] thunderbolt 0000:e5:00.5: starting TX ring 0 > > [ 578.773524] thunderbolt 0000:e5:00.5: enabling interrupt at register 0x38200 bit 0 (0xffffffff -> 0xffffffff) > > > > // As here too. > > > > It would be interesting to see what ACPI does here, I mean enabling > > dynamic debugging of acpi_device_set_power() and friends. > > Here is a log with dyndbg turned up. > https://gist.github.com/superm1/2f7ca6123431365d11e9bc3cc9329c14 > > You can see firmware node \_SB_.PCI0.BUSC.NHI0 is physical device > 0000:e5:00.5. Thanks! > > One suspect is > > that when the runtime suspends happens, the power resource or _PS3() > > gets called to put the device into D3cold and it has some sort of > > Sleep() inside. This allows the OS to relese the CPU to do other things > > too according to ACPI spec (and Linux does this) so now when you plug in > > the device the GPE triggers wake and Linux starts processing that one. > > We see the above runtime resume but then the Sleep() in the _PS3() > > completes and cuts the power from the NHI while we are in the middle of > > resuming it already (or something along those lines). > > > > We saw similar "context switch" happening when PCIe L2 failed, BIOS > > implemented the timeout using Sleep() that allowed Linux to process > > other things which resulted unexpected wake (not the same as here > > though). > > > > So one thing to check too is how the ACPI methods get called and > > especially if they somehow end up messing with each other. > > At least to me it looks pretty straightforward that each D0 sequence is > going through and nothing looked out of order: > > acpi_pci_set_power_state() > ->acpi_device_set_power() > ->->acpi_dev_pm_explicit_set() > > > There is a Sleep() call in M232() which is a SMI used in both _PS0 and _PS3, > but as far as I can tell it's not actually called as the case always has > zero as an argument. > > Here's some snippets of ASL from the NHI0 device to see. > https://gist.github.com/superm1/32ef1f822a6220a41b19574024717f79 > > I feel your theory is right about it being an aborted D3cold request, but I > don't believe it's from concurrency issue of _PS3. I feel there isn't any > guard rails either in Linux or the AML for ensuring that > the transition actually finished. Yeah, I agree now. It does not look like the methods are messing each other here. We don't see the GPE handler being run but I don't think it matters here. For some reason the device just is not yet ready when it is supposed to be in D0. Sorry for wasting your time with these suspects.
Hi again, On Fri, Sep 13, 2024 at 07:58:07AM +0300, Mika Westerberg wrote: > Yeah, I agree now. It does not look like the methods are messing each > other here. We don't see the GPE handler being run but I don't think it > matters here. For some reason the device just is not yet ready when it > is supposed to be in D0. > > Sorry for wasting your time with these suspects. One more suggestion though ;-) I realized that my hack patch to disable I/O and MMIO did not actually do that because it looks like we don't clear those bits ever. I wonder if you could still check if the below changes anything? At least it should now "disable" the device to follow the spec. diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c index f412ef73a6e4..79a566376301 100644 --- a/drivers/pci/pci-driver.c +++ b/drivers/pci/pci-driver.c @@ -1332,6 +1332,7 @@ static int pci_pm_runtime_suspend(struct device *dev) if (!pci_dev->state_saved) { pci_save_state(pci_dev); + pci_pm_default_suspend(pci_dev); pci_finish_runtime_suspend(pci_dev); } diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index ffaaca0978cb..91f4e7a03c94 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -2218,6 +2218,13 @@ static void do_pci_disable_device(struct pci_dev *dev) pci_command &= ~PCI_COMMAND_MASTER; pci_write_config_word(dev, PCI_COMMAND, pci_command); } + /* + * PCI PM 1.2 sec 8.2.2 says that when a function is put into D3 + * the OS needs to disable I/O and MMIO space in addition to bus + * mastering so do that here. + */ + pci_command &= ~(PCI_COMMAND_IO | PCI_COMMAND_MEMORY); + pci_write_config_word(dev, PCI_COMMAND, pci_command); pcibios_disable_device(dev); }
On 9/13/2024 02:23, Mika Westerberg wrote: > Hi again, > > On Fri, Sep 13, 2024 at 07:58:07AM +0300, Mika Westerberg wrote: >> Yeah, I agree now. It does not look like the methods are messing each >> other here. We don't see the GPE handler being run but I don't think it >> matters here. For some reason the device just is not yet ready when it >> is supposed to be in D0. >> >> Sorry for wasting your time with these suspects. Don't worry about it. It validates that we're coming to the right conclusions about where the problem is. > > One more suggestion though ;-) I realized that my hack patch to disable > I/O and MMIO did not actually do that because it looks like we don't > clear those bits ever. I wonder if you could still check if the below > changes anything? At least it should now "disable" the device to follow > the spec. This actually causes the system to fail to boot. I guess some deadlock from other callers to pci_disable_device(). We also double checked putting the PCI_COMMAND writes just into the runtime suspend call backs instead (to narrow down if that is part of the issue here). Putting it there fixed the boot hang, but no change to the actual issue behavior. Bjorn, Considering this finding, are you more amenable to the polling approach? > > diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c > index f412ef73a6e4..79a566376301 100644 > --- a/drivers/pci/pci-driver.c > +++ b/drivers/pci/pci-driver.c > @@ -1332,6 +1332,7 @@ static int pci_pm_runtime_suspend(struct device *dev) > > if (!pci_dev->state_saved) { > pci_save_state(pci_dev); > + pci_pm_default_suspend(pci_dev); > pci_finish_runtime_suspend(pci_dev); > } > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index ffaaca0978cb..91f4e7a03c94 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -2218,6 +2218,13 @@ static void do_pci_disable_device(struct pci_dev *dev) > pci_command &= ~PCI_COMMAND_MASTER; > pci_write_config_word(dev, PCI_COMMAND, pci_command); > } > + /* > + * PCI PM 1.2 sec 8.2.2 says that when a function is put into D3 > + * the OS needs to disable I/O and MMIO space in addition to bus > + * mastering so do that here. > + */ > + pci_command &= ~(PCI_COMMAND_IO | PCI_COMMAND_MEMORY); > + pci_write_config_word(dev, PCI_COMMAND, pci_command); > > pcibios_disable_device(dev); > }
Hi, On Fri, Sep 13, 2024 at 03:56:46PM -0500, Mario Limonciello wrote: > > One more suggestion though ;-) I realized that my hack patch to disable > > I/O and MMIO did not actually do that because it looks like we don't > > clear those bits ever. I wonder if you could still check if the below > > changes anything? At least it should now "disable" the device to follow > > the spec. > > This actually causes the system to fail to boot. I guess some deadlock from > other callers to pci_disable_device(). > > We also double checked putting the PCI_COMMAND writes just into the runtime > suspend call backs instead (to narrow down if that is part of the issue > here). Putting it there fixed the boot hang, but no change to the actual > issue behavior. Okay, thanks again for checking!
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 1e219057a5069..f032a4aaec268 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -1309,21 +1309,33 @@ static int pci_dev_wait(struct pci_dev *dev, enum pci_reset_type reset_type, int * the read (except when CRS SV is enabled and the read was for the * Vendor ID; in that case it synthesizes 0x0001 data). * - * Wait for the device to return a non-CRS completion. Read the - * Command register instead of Vendor ID so we don't have to - * contend with the CRS SV value. + * Wait for the device to return a non-CRS completion. On devices + * that support PM control and on waits that aren't part of system + * resume read the PM control register to ensure the device has + * transitioned to D0. On devices that don't support PM control, + * or during system resume read the command register to instead of + * Vendor ID so we don't have to contend with the CRS SV value. */ for (;;) { - u32 id; - if (pci_dev_is_disconnected(dev)) { pci_dbg(dev, "disconnected; not waiting\n"); return -ENOTTY; } - pci_read_config_dword(dev, PCI_COMMAND, &id); - if (!PCI_POSSIBLE_ERROR(id)) - break; + if (dev->pm_cap && reset_type != PCI_DEV_WAIT_RESUME) { + u16 pmcsr; + + pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr); + if (!PCI_POSSIBLE_ERROR(pmcsr) && + (pmcsr & PCI_PM_CTRL_STATE_MASK) == PCI_D0) + break; + } else { + u32 id; + + pci_read_config_dword(dev, PCI_COMMAND, &id); + if (!PCI_POSSIBLE_ERROR(id)) + break; + } if (delay > timeout) { pci_warn(dev, "not ready %dms after %s; giving up\n",