diff mbox series

PCI: PM: Quirk bridge D3 on Elo i2

Message ID 11980172.O9o76ZdvQC@kreacher (mailing list archive)
State Accepted
Commit 92597f97a40bf661bebceb92e26ff87c76d562d4
Headers show
Series PCI: PM: Quirk bridge D3 on Elo i2 | expand

Commit Message

Rafael J. Wysocki March 31, 2022, 5:38 p.m. UTC
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

If one of the PCIe root ports on Elo i2 is put into D3cold and then
back into D0, the downstream device becomes permanently inaccessible,
so add a bridge D3 DMI quirk for that system.

This was exposed by commit 14858dcc3b35 ("PCI: Use
pci_update_current_state() in pci_enable_device_flags()"), but before
that commit the root port in question had never been put into D3cold
for real due to a mismatch between its power state retrieved from the
PCI_PM_CTRL register (which was accessible even though the platform
firmware indicated that the port was in D3cold) and the state of an
ACPI power resource involved in its power management.

BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=215715
Reported-by: Stefan Gottwald <gottwald@igel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/pci/pci.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

Comments

Bjorn Helgaas March 31, 2022, 9:57 p.m. UTC | #1
Hi Rafael,

On Thu, Mar 31, 2022 at 07:38:51PM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> If one of the PCIe root ports on Elo i2 is put into D3cold and then
> back into D0, the downstream device becomes permanently inaccessible,
> so add a bridge D3 DMI quirk for that system.
> 
> This was exposed by commit 14858dcc3b35 ("PCI: Use
> pci_update_current_state() in pci_enable_device_flags()"), but before
> that commit the root port in question had never been put into D3cold
> for real due to a mismatch between its power state retrieved from the
> PCI_PM_CTRL register (which was accessible even though the platform
> firmware indicated that the port was in D3cold) and the state of an
> ACPI power resource involved in its power management.

In the bug report you suspect a firmware issue.  Any idea what that
might be?  It looks like a Gemini Lake Root Port, so I wouldn't think
it would be a hardware issue.

Weird how things come in clumps.  Was just looking at Mario's patch,
which also has to do with bridges and D3.

Do we need a Fixes line?  E.g.,

  Fixes: 14858dcc3b35 ("PCI: Use pci_update_current_state() in pci_enable_device_flags()")

> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=215715
> Reported-by: Stefan Gottwald <gottwald@igel.com>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  drivers/pci/pci.c |   10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> Index: linux-pm/drivers/pci/pci.c
> ===================================================================
> --- linux-pm.orig/drivers/pci/pci.c
> +++ linux-pm/drivers/pci/pci.c
> @@ -2920,6 +2920,16 @@ static const struct dmi_system_id bridge
>  			DMI_MATCH(DMI_BOARD_VENDOR, "Gigabyte Technology Co., Ltd."),
>  			DMI_MATCH(DMI_BOARD_NAME, "X299 DESIGNARE EX-CF"),
>  		},
> +		/*
> +		 * Downstream device is not accessible after putting a root port
> +		 * into D3cold and back into D0 on Elo i2.
> +		 */
> +		.ident = "Elo i2",
> +		.matches = {
> +			DMI_MATCH(DMI_SYS_VENDOR, "Elo Touch Solutions"),
> +			DMI_MATCH(DMI_PRODUCT_NAME, "Elo i2"),
> +			DMI_MATCH(DMI_PRODUCT_VERSION, "RevB"),
> +		},

Is this bridge_d3_blacklist[] similar to the PCI_DEV_FLAGS_NO_D3 bit?
Could they be folded together?  We have a lot of bits that seem
similar but maybe not exactly the same (dev->bridge_d3,
dev->no_d3cold, dev->d3cold_allowed, dev->runtime_d3cold,
PCI_DEV_FLAGS_NO_D3, pci_bridge_d3_force, etc.)  Ugh.

bridge_d3_blacklist[] itself was added by 85b0cae89d52 ("PCI:
Blacklist power management of Gigabyte X299 DESIGNARE EX PCIe ports"),
which honestly looks kind of random, i.e., it doesn't seem to be
working around a hardware or even a firmware defect.

Apparently the X299 issue is that 00:1c.4 is connected to a
Thunderbolt controller, and the BIOS keeps the Thunderbolt controller
powered off unless something is attached to it?  At least, 00:1c.4
leads to bus 05, and in the dmesg log attached to [1] shows no devices
on bus 05.

It also says the platform doesn't support PCIe native hotplug, which
matches what Mika said about it using ACPI hotplug.  If a system is
using ACPI hotplug, it seems like maybe *that* should prevent us from
putting things in D3cold?  How can we know whether ACPI hotplug
depends on a certain power state?

Bjorn

[1] https://bugzilla.kernel.org/show_bug.cgi?id=202031

>  	},
>  #endif
>  	{ }
> 
> 
>
Rafael J. Wysocki April 1, 2022, 11:34 a.m. UTC | #2
On Thu, Mar 31, 2022 at 11:57 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> Hi Rafael,
>
> On Thu, Mar 31, 2022 at 07:38:51PM +0200, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > If one of the PCIe root ports on Elo i2 is put into D3cold and then
> > back into D0, the downstream device becomes permanently inaccessible,
> > so add a bridge D3 DMI quirk for that system.
> >
> > This was exposed by commit 14858dcc3b35 ("PCI: Use
> > pci_update_current_state() in pci_enable_device_flags()"), but before
> > that commit the root port in question had never been put into D3cold
> > for real due to a mismatch between its power state retrieved from the
> > PCI_PM_CTRL register (which was accessible even though the platform
> > firmware indicated that the port was in D3cold) and the state of an
> > ACPI power resource involved in its power management.
>
> In the bug report you suspect a firmware issue.  Any idea what that
> might be?  It looks like a Gemini Lake Root Port, so I wouldn't think
> it would be a hardware issue.

The _ON method of the ACPI power resource associated with the root
port doesn't work correctly.

> Weird how things come in clumps.  Was just looking at Mario's patch,
> which also has to do with bridges and D3.
>
> Do we need a Fixes line?  E.g.,
>
>   Fixes: 14858dcc3b35 ("PCI: Use pci_update_current_state() in pci_enable_device_flags()")

Strictly speaking, it is not a fix for the above commit.

It is a workaround for a firmware issue uncovered by it which wasn't
visible, because power management was not used correctly on the
affected system because of another firmware problem addressed by
14858dcc3b35.  It wouldn't have worked anyway had it been attempted
AFAICS.

I was thinking about CCing this change to -stable instead.

> > BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=215715
> > Reported-by: Stefan Gottwald <gottwald@igel.com>
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> >  drivers/pci/pci.c |   10 ++++++++++
> >  1 file changed, 10 insertions(+)
> >
> > Index: linux-pm/drivers/pci/pci.c
> > ===================================================================
> > --- linux-pm.orig/drivers/pci/pci.c
> > +++ linux-pm/drivers/pci/pci.c
> > @@ -2920,6 +2920,16 @@ static const struct dmi_system_id bridge
> >                       DMI_MATCH(DMI_BOARD_VENDOR, "Gigabyte Technology Co., Ltd."),
> >                       DMI_MATCH(DMI_BOARD_NAME, "X299 DESIGNARE EX-CF"),
> >               },
> > +             /*
> > +              * Downstream device is not accessible after putting a root port
> > +              * into D3cold and back into D0 on Elo i2.
> > +              */
> > +             .ident = "Elo i2",
> > +             .matches = {
> > +                     DMI_MATCH(DMI_SYS_VENDOR, "Elo Touch Solutions"),
> > +                     DMI_MATCH(DMI_PRODUCT_NAME, "Elo i2"),
> > +                     DMI_MATCH(DMI_PRODUCT_VERSION, "RevB"),
> > +             },
>
> Is this bridge_d3_blacklist[] similar to the PCI_DEV_FLAGS_NO_D3 bit?

Not really.  The former applies to the entire platform and not to an
individual device.

> Could they be folded together?  We have a lot of bits that seem
> similar but maybe not exactly the same (dev->bridge_d3,
> dev->no_d3cold, dev->d3cold_allowed, dev->runtime_d3cold,
> PCI_DEV_FLAGS_NO_D3, pci_bridge_d3_force, etc.)  Ugh.

Yes, I agree that this needs to be cleaned up.

> bridge_d3_blacklist[] itself was added by 85b0cae89d52 ("PCI:
> Blacklist power management of Gigabyte X299 DESIGNARE EX PCIe ports"),
> which honestly looks kind of random, i.e., it doesn't seem to be
> working around a hardware or even a firmware defect.
>
> Apparently the X299 issue is that 00:1c.4 is connected to a
> Thunderbolt controller, and the BIOS keeps the Thunderbolt controller
> powered off unless something is attached to it?  At least, 00:1c.4
> leads to bus 05, and in the dmesg log attached to [1] shows no devices
> on bus 05.
>
> It also says the platform doesn't support PCIe native hotplug, which
> matches what Mika said about it using ACPI hotplug.  If a system is
> using ACPI hotplug, it seems like maybe *that* should prevent us from
> putting things in D3cold?  How can we know whether ACPI hotplug
> depends on a certain power state?

We have this check in pci_bridge_d3_possible():

if (bridge->is_hotplug_bridge && !pciehp_is_native(bridge))
            return false;

but this only applies to the case when the particular bridge itself is
a hotplug one using ACPI hotplug.

If ACPI hotplug is used, it generally is unsafe to put PCIe ports into
D3cold, because in that case it is unclear what the platform
firmware's assumptions regarding control of the config space are.

However, I'm not sure how this is related to the patch at hand.
Rafael J. Wysocki April 4, 2022, 2:46 p.m. UTC | #3
On Fri, Apr 1, 2022 at 1:34 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Thu, Mar 31, 2022 at 11:57 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> >
> > Hi Rafael,
> >
> > On Thu, Mar 31, 2022 at 07:38:51PM +0200, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > >
> > > If one of the PCIe root ports on Elo i2 is put into D3cold and then
> > > back into D0, the downstream device becomes permanently inaccessible,
> > > so add a bridge D3 DMI quirk for that system.
> > >
> > > This was exposed by commit 14858dcc3b35 ("PCI: Use
> > > pci_update_current_state() in pci_enable_device_flags()"), but before
> > > that commit the root port in question had never been put into D3cold
> > > for real due to a mismatch between its power state retrieved from the
> > > PCI_PM_CTRL register (which was accessible even though the platform
> > > firmware indicated that the port was in D3cold) and the state of an
> > > ACPI power resource involved in its power management.
> >
> > In the bug report you suspect a firmware issue.  Any idea what that
> > might be?  It looks like a Gemini Lake Root Port, so I wouldn't think
> > it would be a hardware issue.
>
> The _ON method of the ACPI power resource associated with the root
> port doesn't work correctly.
>
> > Weird how things come in clumps.  Was just looking at Mario's patch,
> > which also has to do with bridges and D3.
> >
> > Do we need a Fixes line?  E.g.,
> >
> >   Fixes: 14858dcc3b35 ("PCI: Use pci_update_current_state() in pci_enable_device_flags()")
>
> Strictly speaking, it is not a fix for the above commit.
>
> It is a workaround for a firmware issue uncovered by it which wasn't
> visible, because power management was not used correctly on the
> affected system because of another firmware problem addressed by
> 14858dcc3b35.  It wouldn't have worked anyway had it been attempted
> AFAICS.
>
> I was thinking about CCing this change to -stable instead.
>
> > > BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=215715
> > > Reported-by: Stefan Gottwald <gottwald@igel.com>
> > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > ---
> > >  drivers/pci/pci.c |   10 ++++++++++
> > >  1 file changed, 10 insertions(+)
> > >
> > > Index: linux-pm/drivers/pci/pci.c
> > > ===================================================================
> > > --- linux-pm.orig/drivers/pci/pci.c
> > > +++ linux-pm/drivers/pci/pci.c
> > > @@ -2920,6 +2920,16 @@ static const struct dmi_system_id bridge
> > >                       DMI_MATCH(DMI_BOARD_VENDOR, "Gigabyte Technology Co., Ltd."),
> > >                       DMI_MATCH(DMI_BOARD_NAME, "X299 DESIGNARE EX-CF"),
> > >               },
> > > +             /*
> > > +              * Downstream device is not accessible after putting a root port
> > > +              * into D3cold and back into D0 on Elo i2.
> > > +              */
> > > +             .ident = "Elo i2",
> > > +             .matches = {
> > > +                     DMI_MATCH(DMI_SYS_VENDOR, "Elo Touch Solutions"),
> > > +                     DMI_MATCH(DMI_PRODUCT_NAME, "Elo i2"),
> > > +                     DMI_MATCH(DMI_PRODUCT_VERSION, "RevB"),
> > > +             },
> >
> > Is this bridge_d3_blacklist[] similar to the PCI_DEV_FLAGS_NO_D3 bit?
>
> Not really.  The former applies to the entire platform and not to an
> individual device.
>
> > Could they be folded together?  We have a lot of bits that seem
> > similar but maybe not exactly the same (dev->bridge_d3,
> > dev->no_d3cold, dev->d3cold_allowed, dev->runtime_d3cold,
> > PCI_DEV_FLAGS_NO_D3, pci_bridge_d3_force, etc.)  Ugh.
>
> Yes, I agree that this needs to be cleaned up.
>
> > bridge_d3_blacklist[] itself was added by 85b0cae89d52 ("PCI:
> > Blacklist power management of Gigabyte X299 DESIGNARE EX PCIe ports"),
> > which honestly looks kind of random, i.e., it doesn't seem to be
> > working around a hardware or even a firmware defect.
> >
> > Apparently the X299 issue is that 00:1c.4 is connected to a
> > Thunderbolt controller, and the BIOS keeps the Thunderbolt controller
> > powered off unless something is attached to it?  At least, 00:1c.4
> > leads to bus 05, and in the dmesg log attached to [1] shows no devices
> > on bus 05.
> >
> > It also says the platform doesn't support PCIe native hotplug, which
> > matches what Mika said about it using ACPI hotplug.  If a system is
> > using ACPI hotplug, it seems like maybe *that* should prevent us from
> > putting things in D3cold?  How can we know whether ACPI hotplug
> > depends on a certain power state?
>
> We have this check in pci_bridge_d3_possible():
>
> if (bridge->is_hotplug_bridge && !pciehp_is_native(bridge))
>             return false;
>
> but this only applies to the case when the particular bridge itself is
> a hotplug one using ACPI hotplug.
>
> If ACPI hotplug is used, it generally is unsafe to put PCIe ports into
> D3cold, because in that case it is unclear what the platform
> firmware's assumptions regarding control of the config space are.
>
> However, I'm not sure how this is related to the patch at hand.

So I'm not sure how you want to proceed here.

The platform is quirky, so the quirk for it will need to be added this
way or another.  The $subject patch adds it using the existing
mechanism, which is the least intrusive way.

You seem to be thinking that the existing mechanism may not be
adequate, but I'm not sure for what reason and anyway I think that it
can be adjusted after adding the quirk.

Please let me know what you think.
Bjorn Helgaas April 8, 2022, 7:53 p.m. UTC | #4
On Mon, Apr 04, 2022 at 04:46:14PM +0200, Rafael J. Wysocki wrote:
> On Fri, Apr 1, 2022 at 1:34 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > On Thu, Mar 31, 2022 at 11:57 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Thu, Mar 31, 2022 at 07:38:51PM +0200, Rafael J. Wysocki wrote:
> > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > >
> > > > If one of the PCIe root ports on Elo i2 is put into D3cold and then
> > > > back into D0, the downstream device becomes permanently inaccessible,
> > > > so add a bridge D3 DMI quirk for that system.
> > > >
> > > > This was exposed by commit 14858dcc3b35 ("PCI: Use
> > > > pci_update_current_state() in pci_enable_device_flags()"), but before
> > > > that commit the root port in question had never been put into D3cold
> > > > for real due to a mismatch between its power state retrieved from the
> > > > PCI_PM_CTRL register (which was accessible even though the platform
> > > > firmware indicated that the port was in D3cold) and the state of an
> > > > ACPI power resource involved in its power management.
> > >
> > > In the bug report you suspect a firmware issue.  Any idea what that
> > > might be?  It looks like a Gemini Lake Root Port, so I wouldn't think
> > > it would be a hardware issue.
> >
> > The _ON method of the ACPI power resource associated with the root
> > port doesn't work correctly.
> >
> > > Weird how things come in clumps.  Was just looking at Mario's patch,
> > > which also has to do with bridges and D3.
> > >
> > > Do we need a Fixes line?  E.g.,
> > >
> > >   Fixes: 14858dcc3b35 ("PCI: Use pci_update_current_state() in pci_enable_device_flags()")
> >
> > Strictly speaking, it is not a fix for the above commit.
> >
> > It is a workaround for a firmware issue uncovered by it which wasn't
> > visible, because power management was not used correctly on the
> > affected system because of another firmware problem addressed by
> > 14858dcc3b35.  It wouldn't have worked anyway had it been attempted
> > AFAICS.
> >
> > I was thinking about CCing this change to -stable instead.

Makes sense, thanks.

> > > > BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=215715
> > > > Reported-by: Stefan Gottwald <gottwald@igel.com>
> > > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > ---
> > > >  drivers/pci/pci.c |   10 ++++++++++
> > > >  1 file changed, 10 insertions(+)
> > > >
> > > > Index: linux-pm/drivers/pci/pci.c
> > > > ===================================================================
> > > > --- linux-pm.orig/drivers/pci/pci.c
> > > > +++ linux-pm/drivers/pci/pci.c
> > > > @@ -2920,6 +2920,16 @@ static const struct dmi_system_id bridge
> > > >                       DMI_MATCH(DMI_BOARD_VENDOR, "Gigabyte Technology Co., Ltd."),
> > > >                       DMI_MATCH(DMI_BOARD_NAME, "X299 DESIGNARE EX-CF"),
> > > >               },
> > > > +             /*
> > > > +              * Downstream device is not accessible after putting a root port
> > > > +              * into D3cold and back into D0 on Elo i2.
> > > > +              */
> > > > +             .ident = "Elo i2",
> > > > +             .matches = {
> > > > +                     DMI_MATCH(DMI_SYS_VENDOR, "Elo Touch Solutions"),
> > > > +                     DMI_MATCH(DMI_PRODUCT_NAME, "Elo i2"),
> > > > +                     DMI_MATCH(DMI_PRODUCT_VERSION, "RevB"),
> > > > +             },
> > >
> > > Is this bridge_d3_blacklist[] similar to the PCI_DEV_FLAGS_NO_D3 bit?
> >
> > Not really.  The former applies to the entire platform and not to an
> > individual device.
> >
> > > Could they be folded together?  We have a lot of bits that seem
> > > similar but maybe not exactly the same (dev->bridge_d3,
> > > dev->no_d3cold, dev->d3cold_allowed, dev->runtime_d3cold,
> > > PCI_DEV_FLAGS_NO_D3, pci_bridge_d3_force, etc.)  Ugh.
> >
> > Yes, I agree that this needs to be cleaned up.
> >
> > > bridge_d3_blacklist[] itself was added by 85b0cae89d52 ("PCI:
> > > Blacklist power management of Gigabyte X299 DESIGNARE EX PCIe ports"),
> > > which honestly looks kind of random, i.e., it doesn't seem to be
> > > working around a hardware or even a firmware defect.
> > >
> > > Apparently the X299 issue is that 00:1c.4 is connected to a
> > > Thunderbolt controller, and the BIOS keeps the Thunderbolt controller
> > > powered off unless something is attached to it?  At least, 00:1c.4
> > > leads to bus 05, and in the dmesg log attached to [1] shows no devices
> > > on bus 05.
> > >
> > > It also says the platform doesn't support PCIe native hotplug, which
> > > matches what Mika said about it using ACPI hotplug.  If a system is
> > > using ACPI hotplug, it seems like maybe *that* should prevent us from
> > > putting things in D3cold?  How can we know whether ACPI hotplug
> > > depends on a certain power state?
> >
> > We have this check in pci_bridge_d3_possible():
> >
> > if (bridge->is_hotplug_bridge && !pciehp_is_native(bridge))
> >             return false;
> >
> > but this only applies to the case when the particular bridge itself is
> > a hotplug one using ACPI hotplug.
> >
> > If ACPI hotplug is used, it generally is unsafe to put PCIe ports into
> > D3cold, because in that case it is unclear what the platform
> > firmware's assumptions regarding control of the config space are.
> >
> > However, I'm not sure how this is related to the patch at hand.
> 
> So I'm not sure how you want to proceed here.
> 
> The platform is quirky, so the quirk for it will need to be added this
> way or another.  The $subject patch adds it using the existing
> mechanism, which is the least intrusive way.
> 
> You seem to be thinking that the existing mechanism may not be
> adequate, but I'm not sure for what reason and anyway I think that it
> can be adjusted after adding the quirk.
> 
> Please let me know what you think.

I don't understand all that's going on here, but I applied it to
pci/pm for v5.19, thanks!

Bjorn
Rafael J. Wysocki April 9, 2022, 1:35 p.m. UTC | #5
On Fri, Apr 8, 2022 at 9:53 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Mon, Apr 04, 2022 at 04:46:14PM +0200, Rafael J. Wysocki wrote:
> > On Fri, Apr 1, 2022 at 1:34 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > On Thu, Mar 31, 2022 at 11:57 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > On Thu, Mar 31, 2022 at 07:38:51PM +0200, Rafael J. Wysocki wrote:
> > > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > >
> > > > > If one of the PCIe root ports on Elo i2 is put into D3cold and then
> > > > > back into D0, the downstream device becomes permanently inaccessible,
> > > > > so add a bridge D3 DMI quirk for that system.
> > > > >
> > > > > This was exposed by commit 14858dcc3b35 ("PCI: Use
> > > > > pci_update_current_state() in pci_enable_device_flags()"), but before
> > > > > that commit the root port in question had never been put into D3cold
> > > > > for real due to a mismatch between its power state retrieved from the
> > > > > PCI_PM_CTRL register (which was accessible even though the platform
> > > > > firmware indicated that the port was in D3cold) and the state of an
> > > > > ACPI power resource involved in its power management.
> > > >
> > > > In the bug report you suspect a firmware issue.  Any idea what that
> > > > might be?  It looks like a Gemini Lake Root Port, so I wouldn't think
> > > > it would be a hardware issue.
> > >
> > > The _ON method of the ACPI power resource associated with the root
> > > port doesn't work correctly.
> > >
> > > > Weird how things come in clumps.  Was just looking at Mario's patch,
> > > > which also has to do with bridges and D3.
> > > >
> > > > Do we need a Fixes line?  E.g.,
> > > >
> > > >   Fixes: 14858dcc3b35 ("PCI: Use pci_update_current_state() in pci_enable_device_flags()")
> > >
> > > Strictly speaking, it is not a fix for the above commit.
> > >
> > > It is a workaround for a firmware issue uncovered by it which wasn't
> > > visible, because power management was not used correctly on the
> > > affected system because of another firmware problem addressed by
> > > 14858dcc3b35.  It wouldn't have worked anyway had it been attempted
> > > AFAICS.
> > >
> > > I was thinking about CCing this change to -stable instead.
>
> Makes sense, thanks.
>
> > > > > BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=215715
> > > > > Reported-by: Stefan Gottwald <gottwald@igel.com>
> > > > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > > ---
> > > > >  drivers/pci/pci.c |   10 ++++++++++
> > > > >  1 file changed, 10 insertions(+)
> > > > >
> > > > > Index: linux-pm/drivers/pci/pci.c
> > > > > ===================================================================
> > > > > --- linux-pm.orig/drivers/pci/pci.c
> > > > > +++ linux-pm/drivers/pci/pci.c
> > > > > @@ -2920,6 +2920,16 @@ static const struct dmi_system_id bridge
> > > > >                       DMI_MATCH(DMI_BOARD_VENDOR, "Gigabyte Technology Co., Ltd."),
> > > > >                       DMI_MATCH(DMI_BOARD_NAME, "X299 DESIGNARE EX-CF"),
> > > > >               },
> > > > > +             /*
> > > > > +              * Downstream device is not accessible after putting a root port
> > > > > +              * into D3cold and back into D0 on Elo i2.
> > > > > +              */
> > > > > +             .ident = "Elo i2",
> > > > > +             .matches = {
> > > > > +                     DMI_MATCH(DMI_SYS_VENDOR, "Elo Touch Solutions"),
> > > > > +                     DMI_MATCH(DMI_PRODUCT_NAME, "Elo i2"),
> > > > > +                     DMI_MATCH(DMI_PRODUCT_VERSION, "RevB"),
> > > > > +             },
> > > >
> > > > Is this bridge_d3_blacklist[] similar to the PCI_DEV_FLAGS_NO_D3 bit?
> > >
> > > Not really.  The former applies to the entire platform and not to an
> > > individual device.
> > >
> > > > Could they be folded together?  We have a lot of bits that seem
> > > > similar but maybe not exactly the same (dev->bridge_d3,
> > > > dev->no_d3cold, dev->d3cold_allowed, dev->runtime_d3cold,
> > > > PCI_DEV_FLAGS_NO_D3, pci_bridge_d3_force, etc.)  Ugh.
> > >
> > > Yes, I agree that this needs to be cleaned up.
> > >
> > > > bridge_d3_blacklist[] itself was added by 85b0cae89d52 ("PCI:
> > > > Blacklist power management of Gigabyte X299 DESIGNARE EX PCIe ports"),
> > > > which honestly looks kind of random, i.e., it doesn't seem to be
> > > > working around a hardware or even a firmware defect.
> > > >
> > > > Apparently the X299 issue is that 00:1c.4 is connected to a
> > > > Thunderbolt controller, and the BIOS keeps the Thunderbolt controller
> > > > powered off unless something is attached to it?  At least, 00:1c.4
> > > > leads to bus 05, and in the dmesg log attached to [1] shows no devices
> > > > on bus 05.
> > > >
> > > > It also says the platform doesn't support PCIe native hotplug, which
> > > > matches what Mika said about it using ACPI hotplug.  If a system is
> > > > using ACPI hotplug, it seems like maybe *that* should prevent us from
> > > > putting things in D3cold?  How can we know whether ACPI hotplug
> > > > depends on a certain power state?
> > >
> > > We have this check in pci_bridge_d3_possible():
> > >
> > > if (bridge->is_hotplug_bridge && !pciehp_is_native(bridge))
> > >             return false;
> > >
> > > but this only applies to the case when the particular bridge itself is
> > > a hotplug one using ACPI hotplug.
> > >
> > > If ACPI hotplug is used, it generally is unsafe to put PCIe ports into
> > > D3cold, because in that case it is unclear what the platform
> > > firmware's assumptions regarding control of the config space are.
> > >
> > > However, I'm not sure how this is related to the patch at hand.
> >
> > So I'm not sure how you want to proceed here.
> >
> > The platform is quirky, so the quirk for it will need to be added this
> > way or another.  The $subject patch adds it using the existing
> > mechanism, which is the least intrusive way.
> >
> > You seem to be thinking that the existing mechanism may not be
> > adequate, but I'm not sure for what reason and anyway I think that it
> > can be adjusted after adding the quirk.
> >
> > Please let me know what you think.
>
> I don't understand all that's going on here, but I applied it to
> pci/pm for v5.19, thanks!

Thank you!

I've started to work on cleaning up the D3cold-related code.
Linux regression tracking (Thorsten Leemhuis) April 10, 2022, 9:16 a.m. UTC | #6
On 09.04.22 15:35, Rafael J. Wysocki wrote:
> On Fri, Apr 8, 2022 at 9:53 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>>
>> On Mon, Apr 04, 2022 at 04:46:14PM +0200, Rafael J. Wysocki wrote:
>>> On Fri, Apr 1, 2022 at 1:34 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>>>> On Thu, Mar 31, 2022 at 11:57 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>>>>> On Thu, Mar 31, 2022 at 07:38:51PM +0200, Rafael J. Wysocki wrote:
>>>>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>>>>
>>>>>> If one of the PCIe root ports on Elo i2 is put into D3cold and then
>>>>>> back into D0, the downstream device becomes permanently inaccessible,
>>>>>> so add a bridge D3 DMI quirk for that system.
>>>>>>
>>>>>> This was exposed by commit 14858dcc3b35 ("PCI: Use
>>>>>> pci_update_current_state() in pci_enable_device_flags()"), but before
>>>>>> that commit the root port in question had never been put into D3cold
>>>>>> for real due to a mismatch between its power state retrieved from the
>>>>>> PCI_PM_CTRL register (which was accessible even though the platform
>>>>>> firmware indicated that the port was in D3cold) and the state of an
>>>>>> ACPI power resource involved in its power management.
>>>>>
>>>>> In the bug report you suspect a firmware issue.  Any idea what that
>>>>> might be?  It looks like a Gemini Lake Root Port, so I wouldn't think
>>>>> it would be a hardware issue.
>>>>
>>>> The _ON method of the ACPI power resource associated with the root
>>>> port doesn't work correctly.
>>>>
>>>>> Weird how things come in clumps.  Was just looking at Mario's patch,
>>>>> which also has to do with bridges and D3.
>>>>>
>>>>> Do we need a Fixes line?  E.g.,
>>>>>
>>>>>   Fixes: 14858dcc3b35 ("PCI: Use pci_update_current_state() in pci_enable_device_flags()")
>>>>
>>>> Strictly speaking, it is not a fix for the above commit.
>>>>
>>>> It is a workaround for a firmware issue uncovered by it which wasn't
>>>> visible, because power management was not used correctly on the
>>>> affected system because of another firmware problem addressed by
>>>> 14858dcc3b35.  It wouldn't have worked anyway had it been attempted
>>>> AFAICS.
>>>>
>>>> I was thinking about CCing this change to -stable instead.
>>
>> Makes sense, thanks.
>>
>>>>>> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=215715
>>>>>> Reported-by: Stefan Gottwald <gottwald@igel.com>
>>>>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>>>> ---
>>>>>>  drivers/pci/pci.c |   10 ++++++++++
>>>>>>  1 file changed, 10 insertions(+)
>>>>>>
>>>>>> Index: linux-pm/drivers/pci/pci.c
>>>>>> ===================================================================
>>>>>> --- linux-pm.orig/drivers/pci/pci.c
>>>>>> +++ linux-pm/drivers/pci/pci.c
>>>>>> @@ -2920,6 +2920,16 @@ static const struct dmi_system_id bridge
>>>>>>                       DMI_MATCH(DMI_BOARD_VENDOR, "Gigabyte Technology Co., Ltd."),
>>>>>>                       DMI_MATCH(DMI_BOARD_NAME, "X299 DESIGNARE EX-CF"),
>>>>>>               },
>>>>>> +             /*
>>>>>> +              * Downstream device is not accessible after putting a root port
>>>>>> +              * into D3cold and back into D0 on Elo i2.
>>>>>> +              */
>>>>>> +             .ident = "Elo i2",
>>>>>> +             .matches = {
>>>>>> +                     DMI_MATCH(DMI_SYS_VENDOR, "Elo Touch Solutions"),
>>>>>> +                     DMI_MATCH(DMI_PRODUCT_NAME, "Elo i2"),
>>>>>> +                     DMI_MATCH(DMI_PRODUCT_VERSION, "RevB"),
>>>>>> +             },
>>>>>
>>>>> Is this bridge_d3_blacklist[] similar to the PCI_DEV_FLAGS_NO_D3 bit?
>>>>
>>>> Not really.  The former applies to the entire platform and not to an
>>>> individual device.
>>>>
>>>>> Could they be folded together?  We have a lot of bits that seem
>>>>> similar but maybe not exactly the same (dev->bridge_d3,
>>>>> dev->no_d3cold, dev->d3cold_allowed, dev->runtime_d3cold,
>>>>> PCI_DEV_FLAGS_NO_D3, pci_bridge_d3_force, etc.)  Ugh.
>>>>
>>>> Yes, I agree that this needs to be cleaned up.
>>>>
>>>>> bridge_d3_blacklist[] itself was added by 85b0cae89d52 ("PCI:
>>>>> Blacklist power management of Gigabyte X299 DESIGNARE EX PCIe ports"),
>>>>> which honestly looks kind of random, i.e., it doesn't seem to be
>>>>> working around a hardware or even a firmware defect.
>>>>>
>>>>> Apparently the X299 issue is that 00:1c.4 is connected to a
>>>>> Thunderbolt controller, and the BIOS keeps the Thunderbolt controller
>>>>> powered off unless something is attached to it?  At least, 00:1c.4
>>>>> leads to bus 05, and in the dmesg log attached to [1] shows no devices
>>>>> on bus 05.
>>>>>
>>>>> It also says the platform doesn't support PCIe native hotplug, which
>>>>> matches what Mika said about it using ACPI hotplug.  If a system is
>>>>> using ACPI hotplug, it seems like maybe *that* should prevent us from
>>>>> putting things in D3cold?  How can we know whether ACPI hotplug
>>>>> depends on a certain power state?
>>>>
>>>> We have this check in pci_bridge_d3_possible():
>>>>
>>>> if (bridge->is_hotplug_bridge && !pciehp_is_native(bridge))
>>>>             return false;
>>>>
>>>> but this only applies to the case when the particular bridge itself is
>>>> a hotplug one using ACPI hotplug.
>>>>
>>>> If ACPI hotplug is used, it generally is unsafe to put PCIe ports into
>>>> D3cold, because in that case it is unclear what the platform
>>>> firmware's assumptions regarding control of the config space are.
>>>>
>>>> However, I'm not sure how this is related to the patch at hand.
>>>
>>> So I'm not sure how you want to proceed here.
>>>
>>> The platform is quirky, so the quirk for it will need to be added this
>>> way or another.  The $subject patch adds it using the existing
>>> mechanism, which is the least intrusive way.
>>>
>>> You seem to be thinking that the existing mechanism may not be
>>> adequate, but I'm not sure for what reason and anyway I think that it
>>> can be adjusted after adding the quirk.
>>>
>>> Please let me know what you think.
>>
>> I don't understand all that's going on here, but I applied it to
>> pci/pm for v5.19, thanks!
> Thank you!

Sorry, but this made me wonder: why v5.19? It's a regression exposed in
v5.15, so it afaics would be good to get this in this cycle -- and also
backported to v5.15.y, but it seem a tag to take care of that is
missing. :-/

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I'm getting a lot of
reports on my table. I can only look briefly into most of them and lack
knowledge about most of the areas they concern. I thus unfortunately
will sometimes get things wrong or miss something important. I hope
that's not the case here; if you think it is, don't hesitate to tell me
in a public reply, it's in everyone's interest to set the public record
straight.
Rafael J. Wysocki April 11, 2022, 11:35 a.m. UTC | #7
On Sun, Apr 10, 2022 at 11:16 AM Thorsten Leemhuis
<regressions@leemhuis.info> wrote:
>
> On 09.04.22 15:35, Rafael J. Wysocki wrote:
> > On Fri, Apr 8, 2022 at 9:53 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> >>
> >> On Mon, Apr 04, 2022 at 04:46:14PM +0200, Rafael J. Wysocki wrote:
> >>> On Fri, Apr 1, 2022 at 1:34 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> >>>> On Thu, Mar 31, 2022 at 11:57 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> >>>>> On Thu, Mar 31, 2022 at 07:38:51PM +0200, Rafael J. Wysocki wrote:
> >>>>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>>>>>
> >>>>>> If one of the PCIe root ports on Elo i2 is put into D3cold and then
> >>>>>> back into D0, the downstream device becomes permanently inaccessible,
> >>>>>> so add a bridge D3 DMI quirk for that system.
> >>>>>>
> >>>>>> This was exposed by commit 14858dcc3b35 ("PCI: Use
> >>>>>> pci_update_current_state() in pci_enable_device_flags()"), but before
> >>>>>> that commit the root port in question had never been put into D3cold
> >>>>>> for real due to a mismatch between its power state retrieved from the
> >>>>>> PCI_PM_CTRL register (which was accessible even though the platform
> >>>>>> firmware indicated that the port was in D3cold) and the state of an
> >>>>>> ACPI power resource involved in its power management.
> >>>>>
> >>>>> In the bug report you suspect a firmware issue.  Any idea what that
> >>>>> might be?  It looks like a Gemini Lake Root Port, so I wouldn't think
> >>>>> it would be a hardware issue.
> >>>>
> >>>> The _ON method of the ACPI power resource associated with the root
> >>>> port doesn't work correctly.
> >>>>
> >>>>> Weird how things come in clumps.  Was just looking at Mario's patch,
> >>>>> which also has to do with bridges and D3.
> >>>>>
> >>>>> Do we need a Fixes line?  E.g.,
> >>>>>
> >>>>>   Fixes: 14858dcc3b35 ("PCI: Use pci_update_current_state() in pci_enable_device_flags()")
> >>>>
> >>>> Strictly speaking, it is not a fix for the above commit.
> >>>>
> >>>> It is a workaround for a firmware issue uncovered by it which wasn't
> >>>> visible, because power management was not used correctly on the
> >>>> affected system because of another firmware problem addressed by
> >>>> 14858dcc3b35.  It wouldn't have worked anyway had it been attempted
> >>>> AFAICS.
> >>>>
> >>>> I was thinking about CCing this change to -stable instead.
> >>
> >> Makes sense, thanks.
> >>
> >>>>>> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=215715
> >>>>>> Reported-by: Stefan Gottwald <gottwald@igel.com>
> >>>>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>>>>> ---
> >>>>>>  drivers/pci/pci.c |   10 ++++++++++
> >>>>>>  1 file changed, 10 insertions(+)
> >>>>>>
> >>>>>> Index: linux-pm/drivers/pci/pci.c
> >>>>>> ===================================================================
> >>>>>> --- linux-pm.orig/drivers/pci/pci.c
> >>>>>> +++ linux-pm/drivers/pci/pci.c
> >>>>>> @@ -2920,6 +2920,16 @@ static const struct dmi_system_id bridge
> >>>>>>                       DMI_MATCH(DMI_BOARD_VENDOR, "Gigabyte Technology Co., Ltd."),
> >>>>>>                       DMI_MATCH(DMI_BOARD_NAME, "X299 DESIGNARE EX-CF"),
> >>>>>>               },
> >>>>>> +             /*
> >>>>>> +              * Downstream device is not accessible after putting a root port
> >>>>>> +              * into D3cold and back into D0 on Elo i2.
> >>>>>> +              */
> >>>>>> +             .ident = "Elo i2",
> >>>>>> +             .matches = {
> >>>>>> +                     DMI_MATCH(DMI_SYS_VENDOR, "Elo Touch Solutions"),
> >>>>>> +                     DMI_MATCH(DMI_PRODUCT_NAME, "Elo i2"),
> >>>>>> +                     DMI_MATCH(DMI_PRODUCT_VERSION, "RevB"),
> >>>>>> +             },
> >>>>>
> >>>>> Is this bridge_d3_blacklist[] similar to the PCI_DEV_FLAGS_NO_D3 bit?
> >>>>
> >>>> Not really.  The former applies to the entire platform and not to an
> >>>> individual device.
> >>>>
> >>>>> Could they be folded together?  We have a lot of bits that seem
> >>>>> similar but maybe not exactly the same (dev->bridge_d3,
> >>>>> dev->no_d3cold, dev->d3cold_allowed, dev->runtime_d3cold,
> >>>>> PCI_DEV_FLAGS_NO_D3, pci_bridge_d3_force, etc.)  Ugh.
> >>>>
> >>>> Yes, I agree that this needs to be cleaned up.
> >>>>
> >>>>> bridge_d3_blacklist[] itself was added by 85b0cae89d52 ("PCI:
> >>>>> Blacklist power management of Gigabyte X299 DESIGNARE EX PCIe ports"),
> >>>>> which honestly looks kind of random, i.e., it doesn't seem to be
> >>>>> working around a hardware or even a firmware defect.
> >>>>>
> >>>>> Apparently the X299 issue is that 00:1c.4 is connected to a
> >>>>> Thunderbolt controller, and the BIOS keeps the Thunderbolt controller
> >>>>> powered off unless something is attached to it?  At least, 00:1c.4
> >>>>> leads to bus 05, and in the dmesg log attached to [1] shows no devices
> >>>>> on bus 05.
> >>>>>
> >>>>> It also says the platform doesn't support PCIe native hotplug, which
> >>>>> matches what Mika said about it using ACPI hotplug.  If a system is
> >>>>> using ACPI hotplug, it seems like maybe *that* should prevent us from
> >>>>> putting things in D3cold?  How can we know whether ACPI hotplug
> >>>>> depends on a certain power state?
> >>>>
> >>>> We have this check in pci_bridge_d3_possible():
> >>>>
> >>>> if (bridge->is_hotplug_bridge && !pciehp_is_native(bridge))
> >>>>             return false;
> >>>>
> >>>> but this only applies to the case when the particular bridge itself is
> >>>> a hotplug one using ACPI hotplug.
> >>>>
> >>>> If ACPI hotplug is used, it generally is unsafe to put PCIe ports into
> >>>> D3cold, because in that case it is unclear what the platform
> >>>> firmware's assumptions regarding control of the config space are.
> >>>>
> >>>> However, I'm not sure how this is related to the patch at hand.
> >>>
> >>> So I'm not sure how you want to proceed here.
> >>>
> >>> The platform is quirky, so the quirk for it will need to be added this
> >>> way or another.  The $subject patch adds it using the existing
> >>> mechanism, which is the least intrusive way.
> >>>
> >>> You seem to be thinking that the existing mechanism may not be
> >>> adequate, but I'm not sure for what reason and anyway I think that it
> >>> can be adjusted after adding the quirk.
> >>>
> >>> Please let me know what you think.
> >>
> >> I don't understand all that's going on here, but I applied it to
> >> pci/pm for v5.19, thanks!
> > Thank you!
>
> Sorry, but this made me wonder: why v5.19? It's a regression exposed in
> v5.15, so it afaics would be good to get this in this cycle -- and also
> backported to v5.15.y, but it seem a tag to take care of that is
> missing. :-/

Well, the patch is out there for everyone needing it.  The question is
how urgent it is to get it into the mainline and -stable, which boils
down to the question how many systems out there can be affected by it.
Since it is a firmware defect exposed, hopefully not too many.
Linux regression tracking (Thorsten Leemhuis) April 11, 2022, 12:10 p.m. UTC | #8
On 11.04.22 13:35, Rafael J. Wysocki wrote:
> On Sun, Apr 10, 2022 at 11:16 AM Thorsten Leemhuis
> <regressions@leemhuis.info> wrote:
>>
>> On 09.04.22 15:35, Rafael J. Wysocki wrote:
>>> On Fri, Apr 8, 2022 at 9:53 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>>>>
>>>> On Mon, Apr 04, 2022 at 04:46:14PM +0200, Rafael J. Wysocki wrote:
>>>>> On Fri, Apr 1, 2022 at 1:34 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>>>>>> On Thu, Mar 31, 2022 at 11:57 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>>>>>>> On Thu, Mar 31, 2022 at 07:38:51PM +0200, Rafael J. Wysocki wrote:
>>>>>>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>>>>>>
>>>>>>>> If one of the PCIe root ports on Elo i2 is put into D3cold and then
>>>>>>>> back into D0, the downstream device becomes permanently inaccessible,
>>>>>>>> so add a bridge D3 DMI quirk for that system.
>>>>>>>>
>>>>>>>> This was exposed by commit 14858dcc3b35 ("PCI: Use
>>>>>>>> pci_update_current_state() in pci_enable_device_flags()"), but before
>>>>>>>> that commit the root port in question had never been put into D3cold
>>>>>>>> for real due to a mismatch between its power state retrieved from the
>>>>>>>> PCI_PM_CTRL register (which was accessible even though the platform
>>>>>>>> firmware indicated that the port was in D3cold) and the state of an
>>>>>>>> ACPI power resource involved in its power management.
>>>>>>>
>>>>>>> In the bug report you suspect a firmware issue.  Any idea what that
>>>>>>> might be?  It looks like a Gemini Lake Root Port, so I wouldn't think
>>>>>>> it would be a hardware issue.
>>>>>>
>>>>>> The _ON method of the ACPI power resource associated with the root
>>>>>> port doesn't work correctly.
>>>>>>
>>>>>>> Weird how things come in clumps.  Was just looking at Mario's patch,
>>>>>>> which also has to do with bridges and D3.
>>>>>>>
>>>>>>> Do we need a Fixes line?  E.g.,
>>>>>>>
>>>>>>>   Fixes: 14858dcc3b35 ("PCI: Use pci_update_current_state() in pci_enable_device_flags()")
>>>>>>
>>>>>> Strictly speaking, it is not a fix for the above commit.
>>>>>>
>>>>>> It is a workaround for a firmware issue uncovered by it which wasn't
>>>>>> visible, because power management was not used correctly on the
>>>>>> affected system because of another firmware problem addressed by
>>>>>> 14858dcc3b35.  It wouldn't have worked anyway had it been attempted
>>>>>> AFAICS.
>>>>>>
>>>>>> I was thinking about CCing this change to -stable instead.
>>>>
>>>> Makes sense, thanks.
>>>>
>>>>>>>> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=215715
>>>>>>>> Reported-by: Stefan Gottwald <gottwald@igel.com>
>>>>>>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>>>>>> ---
>>>>>>>>  drivers/pci/pci.c |   10 ++++++++++
>>>>>>>>  1 file changed, 10 insertions(+)
>>>>>>>>
>>>>>>>> Index: linux-pm/drivers/pci/pci.c
>>>>>>>> ===================================================================
>>>>>>>> --- linux-pm.orig/drivers/pci/pci.c
>>>>>>>> +++ linux-pm/drivers/pci/pci.c
>>>>>>>> @@ -2920,6 +2920,16 @@ static const struct dmi_system_id bridge
>>>>>>>>                       DMI_MATCH(DMI_BOARD_VENDOR, "Gigabyte Technology Co., Ltd."),
>>>>>>>>                       DMI_MATCH(DMI_BOARD_NAME, "X299 DESIGNARE EX-CF"),
>>>>>>>>               },
>>>>>>>> +             /*
>>>>>>>> +              * Downstream device is not accessible after putting a root port
>>>>>>>> +              * into D3cold and back into D0 on Elo i2.
>>>>>>>> +              */
>>>>>>>> +             .ident = "Elo i2",
>>>>>>>> +             .matches = {
>>>>>>>> +                     DMI_MATCH(DMI_SYS_VENDOR, "Elo Touch Solutions"),
>>>>>>>> +                     DMI_MATCH(DMI_PRODUCT_NAME, "Elo i2"),
>>>>>>>> +                     DMI_MATCH(DMI_PRODUCT_VERSION, "RevB"),
>>>>>>>> +             },
>>>>>>>
>>>>>>> Is this bridge_d3_blacklist[] similar to the PCI_DEV_FLAGS_NO_D3 bit?
>>>>>>
>>>>>> Not really.  The former applies to the entire platform and not to an
>>>>>> individual device.
>>>>>>
>>>>>>> Could they be folded together?  We have a lot of bits that seem
>>>>>>> similar but maybe not exactly the same (dev->bridge_d3,
>>>>>>> dev->no_d3cold, dev->d3cold_allowed, dev->runtime_d3cold,
>>>>>>> PCI_DEV_FLAGS_NO_D3, pci_bridge_d3_force, etc.)  Ugh.
>>>>>>
>>>>>> Yes, I agree that this needs to be cleaned up.
>>>>>>
>>>>>>> bridge_d3_blacklist[] itself was added by 85b0cae89d52 ("PCI:
>>>>>>> Blacklist power management of Gigabyte X299 DESIGNARE EX PCIe ports"),
>>>>>>> which honestly looks kind of random, i.e., it doesn't seem to be
>>>>>>> working around a hardware or even a firmware defect.
>>>>>>>
>>>>>>> Apparently the X299 issue is that 00:1c.4 is connected to a
>>>>>>> Thunderbolt controller, and the BIOS keeps the Thunderbolt controller
>>>>>>> powered off unless something is attached to it?  At least, 00:1c.4
>>>>>>> leads to bus 05, and in the dmesg log attached to [1] shows no devices
>>>>>>> on bus 05.
>>>>>>>
>>>>>>> It also says the platform doesn't support PCIe native hotplug, which
>>>>>>> matches what Mika said about it using ACPI hotplug.  If a system is
>>>>>>> using ACPI hotplug, it seems like maybe *that* should prevent us from
>>>>>>> putting things in D3cold?  How can we know whether ACPI hotplug
>>>>>>> depends on a certain power state?
>>>>>>
>>>>>> We have this check in pci_bridge_d3_possible():
>>>>>>
>>>>>> if (bridge->is_hotplug_bridge && !pciehp_is_native(bridge))
>>>>>>             return false;
>>>>>>
>>>>>> but this only applies to the case when the particular bridge itself is
>>>>>> a hotplug one using ACPI hotplug.
>>>>>>
>>>>>> If ACPI hotplug is used, it generally is unsafe to put PCIe ports into
>>>>>> D3cold, because in that case it is unclear what the platform
>>>>>> firmware's assumptions regarding control of the config space are.
>>>>>>
>>>>>> However, I'm not sure how this is related to the patch at hand.
>>>>>
>>>>> So I'm not sure how you want to proceed here.
>>>>>
>>>>> The platform is quirky, so the quirk for it will need to be added this
>>>>> way or another.  The $subject patch adds it using the existing
>>>>> mechanism, which is the least intrusive way.
>>>>>
>>>>> You seem to be thinking that the existing mechanism may not be
>>>>> adequate, but I'm not sure for what reason and anyway I think that it
>>>>> can be adjusted after adding the quirk.
>>>>>
>>>>> Please let me know what you think.
>>>>
>>>> I don't understand all that's going on here, but I applied it to
>>>> pci/pm for v5.19, thanks!
>>> Thank you!
>>
>> Sorry, but this made me wonder: why v5.19? It's a regression exposed in
>> v5.15, so it afaics would be good to get this in this cycle -- and also
>> backported to v5.15.y, but it seem a tag to take care of that is
>> missing. :-/
> 
> Well, the patch is out there for everyone needing it.

IOW: only those that are able to debug the issue, find the patch, and
capable & willing to patch & compile a kernel.

>  The question is
> how urgent it is to get it into the mainline and -stable, which boils
> down to the question how many systems out there can be affected by it.

If it was a risky patch I'd agree, but for such a simple quirk? What's
the benefit of waiting? Sure, every change bears risks, but waiting can
makes life harder for people.

Thing is: I noticed a lot of maintainer wait way to long with applying
regression fixes (even for regressions that affect multiple users),
which contributes to the huge pile of changes that go into early stable
kernel (like 5.17.2 recently with 1000+ changes). That's why the new
document on handling regressions (disclaimer: written by yours truly)
has a section encouraging maintainer to merge things more quickly:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/handling-regressions.rst#n131

> Since it is a firmware defect exposed, hopefully not too many.
I guess so, but it's always hard to tell.

Ciao, Thorsten
Bjorn Helgaas April 11, 2022, 4:22 p.m. UTC | #9
On Mon, Apr 11, 2022 at 02:10:30PM +0200, Thorsten Leemhuis wrote:
> 
> 
> On 11.04.22 13:35, Rafael J. Wysocki wrote:
> > On Sun, Apr 10, 2022 at 11:16 AM Thorsten Leemhuis
> > <regressions@leemhuis.info> wrote:
> >>
> >> On 09.04.22 15:35, Rafael J. Wysocki wrote:
> >>> On Fri, Apr 8, 2022 at 9:53 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> >>>>
> >>>> On Mon, Apr 04, 2022 at 04:46:14PM +0200, Rafael J. Wysocki wrote:
> >>>>> On Fri, Apr 1, 2022 at 1:34 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> >>>>>> On Thu, Mar 31, 2022 at 11:57 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> >>>>>>> On Thu, Mar 31, 2022 at 07:38:51PM +0200, Rafael J. Wysocki wrote:
> >>>>>>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>>>>>>>
> >>>>>>>> If one of the PCIe root ports on Elo i2 is put into D3cold and then
> >>>>>>>> back into D0, the downstream device becomes permanently inaccessible,
> >>>>>>>> so add a bridge D3 DMI quirk for that system.
> >>>>>>>>
> >>>>>>>> This was exposed by commit 14858dcc3b35 ("PCI: Use
> >>>>>>>> pci_update_current_state() in pci_enable_device_flags()"), but before
> >>>>>>>> that commit the root port in question had never been put into D3cold
> >>>>>>>> for real due to a mismatch between its power state retrieved from the
> >>>>>>>> PCI_PM_CTRL register (which was accessible even though the platform
> >>>>>>>> firmware indicated that the port was in D3cold) and the state of an
> >>>>>>>> ACPI power resource involved in its power management.
> >>>>>>>> ...

> >>>> I don't understand all that's going on here, but I applied it to
> >>>> pci/pm for v5.19, thanks!
> >>> Thank you!
> >>
> >> Sorry, but this made me wonder: why v5.19? It's a regression exposed in
> >> v5.15, so it afaics would be good to get this in this cycle -- and also
> >> backported to v5.15.y, but it seem a tag to take care of that is
> >> missing. :-/

I moved it to for-linus for v5.18 and added a stable tag for v5.15+.
Bjorn Helgaas May 26, 2022, 10:12 p.m. UTC | #10
On Thu, Mar 31, 2022 at 07:38:51PM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> If one of the PCIe root ports on Elo i2 is put into D3cold and then
> back into D0, the downstream device becomes permanently inaccessible,
> so add a bridge D3 DMI quirk for that system.
> 
> This was exposed by commit 14858dcc3b35 ("PCI: Use
> pci_update_current_state() in pci_enable_device_flags()"), but before
> that commit the root port in question had never been put into D3cold
> for real due to a mismatch between its power state retrieved from the
> PCI_PM_CTRL register (which was accessible even though the platform
> firmware indicated that the port was in D3cold) and the state of an
> ACPI power resource involved in its power management.
> 
> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=215715
> Reported-by: Stefan Gottwald <gottwald@igel.com>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  drivers/pci/pci.c |   10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> Index: linux-pm/drivers/pci/pci.c
> ===================================================================
> --- linux-pm.orig/drivers/pci/pci.c
> +++ linux-pm/drivers/pci/pci.c
> @@ -2920,6 +2920,16 @@ static const struct dmi_system_id bridge
>  			DMI_MATCH(DMI_BOARD_VENDOR, "Gigabyte Technology Co., Ltd."),
>  			DMI_MATCH(DMI_BOARD_NAME, "X299 DESIGNARE EX-CF"),
>  		},
> +		/*
> +		 * Downstream device is not accessible after putting a root port
> +		 * into D3cold and back into D0 on Elo i2.
> +		 */
> +		.ident = "Elo i2",
> +		.matches = {
> +			DMI_MATCH(DMI_SYS_VENDOR, "Elo Touch Solutions"),
> +			DMI_MATCH(DMI_PRODUCT_NAME, "Elo i2"),
> +			DMI_MATCH(DMI_PRODUCT_VERSION, "RevB"),
> +		},
>  	},

This has already made it to Linus' and some stable trees, but I think
we need the following touchup.  I plan to send it right after my v5.19
pull request.

commit a99f6bb133df ("PCI/PM: Fix bridge_d3_blacklist[] Elo i2 overwrite of Gigabyte X299")
Author: Bjorn Helgaas <bhelgaas@google.com>
Date:   Thu May 26 16:52:23 2022 -0500

    PCI/PM: Fix bridge_d3_blacklist[] Elo i2 overwrite of Gigabyte X299
    
    92597f97a40b ("PCI/PM: Avoid putting Elo i2 PCIe Ports in D3cold") omitted
    braces around the new Elo i2 entry, so it overwrote the existing Gigabyte
    X299 entry.
    
    Found by:
    
      $ make W=1 drivers/pci/pci.o
        CC      drivers/pci/pci.o
      drivers/pci/pci.c:2974:12: error: initialized field overwritten [-Werror=override-init]
       2974 |   .ident = "Elo i2",
            |            ^~~~~~~~
    
    Fixes: 92597f97a40b ("PCI/PM: Avoid putting Elo i2 PCIe Ports in D3cold")
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
    Cc: stable@vger.kernel.org  # v5.15+

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index d25122fbe98a..5b400a742621 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2920,6 +2920,8 @@ static const struct dmi_system_id bridge_d3_blacklist[] = {
 			DMI_MATCH(DMI_BOARD_VENDOR, "Gigabyte Technology Co., Ltd."),
 			DMI_MATCH(DMI_BOARD_NAME, "X299 DESIGNARE EX-CF"),
 		},
+	},
+	{
 		/*
 		 * Downstream device is not accessible after putting a root port
 		 * into D3cold and back into D0 on Elo i2.
Rafael J. Wysocki May 27, 2022, 6:55 p.m. UTC | #11
On Fri, May 27, 2022 at 12:13 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Thu, Mar 31, 2022 at 07:38:51PM +0200, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > If one of the PCIe root ports on Elo i2 is put into D3cold and then
> > back into D0, the downstream device becomes permanently inaccessible,
> > so add a bridge D3 DMI quirk for that system.
> >
> > This was exposed by commit 14858dcc3b35 ("PCI: Use
> > pci_update_current_state() in pci_enable_device_flags()"), but before
> > that commit the root port in question had never been put into D3cold
> > for real due to a mismatch between its power state retrieved from the
> > PCI_PM_CTRL register (which was accessible even though the platform
> > firmware indicated that the port was in D3cold) and the state of an
> > ACPI power resource involved in its power management.
> >
> > BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=215715
> > Reported-by: Stefan Gottwald <gottwald@igel.com>
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> >  drivers/pci/pci.c |   10 ++++++++++
> >  1 file changed, 10 insertions(+)
> >
> > Index: linux-pm/drivers/pci/pci.c
> > ===================================================================
> > --- linux-pm.orig/drivers/pci/pci.c
> > +++ linux-pm/drivers/pci/pci.c
> > @@ -2920,6 +2920,16 @@ static const struct dmi_system_id bridge
> >                       DMI_MATCH(DMI_BOARD_VENDOR, "Gigabyte Technology Co., Ltd."),
> >                       DMI_MATCH(DMI_BOARD_NAME, "X299 DESIGNARE EX-CF"),
> >               },
> > +             /*
> > +              * Downstream device is not accessible after putting a root port
> > +              * into D3cold and back into D0 on Elo i2.
> > +              */
> > +             .ident = "Elo i2",
> > +             .matches = {
> > +                     DMI_MATCH(DMI_SYS_VENDOR, "Elo Touch Solutions"),
> > +                     DMI_MATCH(DMI_PRODUCT_NAME, "Elo i2"),
> > +                     DMI_MATCH(DMI_PRODUCT_VERSION, "RevB"),
> > +             },
> >       },
>
> This has already made it to Linus' and some stable trees, but I think
> we need the following touchup.  I plan to send it right after my v5.19
> pull request.

Ouch, sorry.

> commit a99f6bb133df ("PCI/PM: Fix bridge_d3_blacklist[] Elo i2 overwrite of Gigabyte X299")
> Author: Bjorn Helgaas <bhelgaas@google.com>
> Date:   Thu May 26 16:52:23 2022 -0500
>
>     PCI/PM: Fix bridge_d3_blacklist[] Elo i2 overwrite of Gigabyte X299
>
>     92597f97a40b ("PCI/PM: Avoid putting Elo i2 PCIe Ports in D3cold") omitted
>     braces around the new Elo i2 entry, so it overwrote the existing Gigabyte
>     X299 entry.
>
>     Found by:
>
>       $ make W=1 drivers/pci/pci.o
>         CC      drivers/pci/pci.o
>       drivers/pci/pci.c:2974:12: error: initialized field overwritten [-Werror=override-init]
>        2974 |   .ident = "Elo i2",
>             |            ^~~~~~~~
>
>     Fixes: 92597f97a40b ("PCI/PM: Avoid putting Elo i2 PCIe Ports in D3cold")
>     Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>     Cc: stable@vger.kernel.org  # v5.15+
>
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index d25122fbe98a..5b400a742621 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -2920,6 +2920,8 @@ static const struct dmi_system_id bridge_d3_blacklist[] = {
>                         DMI_MATCH(DMI_BOARD_VENDOR, "Gigabyte Technology Co., Ltd."),
>                         DMI_MATCH(DMI_BOARD_NAME, "X299 DESIGNARE EX-CF"),
>                 },
> +       },
> +       {
>                 /*
>                  * Downstream device is not accessible after putting a root port
>                  * into D3cold and back into D0 on Elo i2.
diff mbox series

Patch

Index: linux-pm/drivers/pci/pci.c
===================================================================
--- linux-pm.orig/drivers/pci/pci.c
+++ linux-pm/drivers/pci/pci.c
@@ -2920,6 +2920,16 @@  static const struct dmi_system_id bridge
 			DMI_MATCH(DMI_BOARD_VENDOR, "Gigabyte Technology Co., Ltd."),
 			DMI_MATCH(DMI_BOARD_NAME, "X299 DESIGNARE EX-CF"),
 		},
+		/*
+		 * Downstream device is not accessible after putting a root port
+		 * into D3cold and back into D0 on Elo i2.
+		 */
+		.ident = "Elo i2",
+		.matches = {
+			DMI_MATCH(DMI_SYS_VENDOR, "Elo Touch Solutions"),
+			DMI_MATCH(DMI_PRODUCT_NAME, "Elo i2"),
+			DMI_MATCH(DMI_PRODUCT_VERSION, "RevB"),
+		},
 	},
 #endif
 	{ }