diff mbox series

[RFC] pci: prevent putting pcie devices into lower device states on certain intel bridges

Message ID 20190927144421.22608-1-kherbst@redhat.com (mailing list archive)
State New, archived
Headers show
Series [RFC] pci: prevent putting pcie devices into lower device states on certain intel bridges | expand

Commit Message

Karol Herbst Sept. 27, 2019, 2:44 p.m. UTC
Fixes runpm breakage mainly on Nvidia GPUs as they are not able to resume.

Works perfectly with this workaround applied.

RFC comment:
We are quite sure that there is a higher amount of bridges affected by this,
but I was only testing it on my own machine for now.

I've stresstested runpm by doing 5000 runpm cycles with that patch applied
and never saw it fail.

I mainly wanted to get a discussion going on if that's a feasable workaround
indeed or if we need something better.

I am also sure, that the nouveau driver itself isn't at fault as I am able
to reproduce the same issue by poking into some PCI registers on the PCIe
bridge to put the GPU into D3cold as it's done in ACPI code.

I've written a little python script to reproduce this issue without the need
of loading nouveau:
https://raw.githubusercontent.com/karolherbst/pci-stub-runpm/master/nv_runpm_bug_test.py

Signed-off-by: Karol Herbst <kherbst@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: linux-pci@vger.kernel.org
Cc: dri-devel@lists.freedesktop.org
Cc: nouveau@lists.freedesktop.org
---
 drivers/pci/pci.c | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

Comments

Bjorn Helgaas Sept. 27, 2019, 9:42 p.m. UTC | #1
[+cc Rafael, Mika, linux-pm]

On Fri, Sep 27, 2019 at 04:44:21PM +0200, Karol Herbst wrote:
> Fixes runpm breakage mainly on Nvidia GPUs as they are not able to resume.

I don't know what runpm is.  Some userspace utility?  Module
parameter?

> Works perfectly with this workaround applied.
> 
> RFC comment:
> We are quite sure that there is a higher amount of bridges affected by this,
> but I was only testing it on my own machine for now.
> 
> I've stresstested runpm by doing 5000 runpm cycles with that patch applied
> and never saw it fail.
> 
> I mainly wanted to get a discussion going on if that's a feasable workaround
> indeed or if we need something better.
> 
> I am also sure, that the nouveau driver itself isn't at fault as I am able
> to reproduce the same issue by poking into some PCI registers on the PCIe
> bridge to put the GPU into D3cold as it's done in ACPI code.
> 
> I've written a little python script to reproduce this issue without the need
> of loading nouveau:
> https://raw.githubusercontent.com/karolherbst/pci-stub-runpm/master/nv_runpm_bug_test.py

Nice script, thanks for sharing it :)  I could learn a lot of useful
python by studying it.

> Signed-off-by: Karol Herbst <kherbst@redhat.com>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: linux-pci@vger.kernel.org
> Cc: dri-devel@lists.freedesktop.org
> Cc: nouveau@lists.freedesktop.org
> ---
>  drivers/pci/pci.c | 39 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 39 insertions(+)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 088fcdc8d2b4..9dbd29ced1ac 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -799,6 +799,42 @@ static inline bool platform_pci_bridge_d3(struct pci_dev *dev)
>  	return pci_platform_pm ? pci_platform_pm->bridge_d3(dev) : false;
>  }
>  
> +/*
> + * some intel bridges cause serious issues with runpm if the client device
> + * is put into D1/D2/D3hot before putting the client into D3cold via
> + * platform means (generally ACPI).

You mention Nvidia GPUs above, but I guess the same issue may affect
other devices?  I would really like to chase this down to a more
specific issue, e.g., a hardware defect with erratum, an ACPI defect,
or a Linux defect.  Without the specifics, this is just a band-aid.

I don't see any relevant requirements in the _OFF description, but I
don't know much about ACPI power control.

Your script allows several scenarios; I *guess* the one that causes
the problem is:

  - write 3 (D3hot) to GPU PowerState (PCIE_PM_REG == 0x64, I assume
    PM Capability Control Register)
  - write 3 (D3hot) to bridge PowerState (0x84, I assume PM Capability
    Control Register)
  - run _OFF on the power resource for the bridge

From your script I assume you do:

  - run _ON on the power resource for the bridge
  - write 0 (D0) to the bridge PowerState

You do *not* write the GPU PowerState (which we can't do if the GPU is
in D3cold).  Is there some assumption that it comes out of D3cold via
some other mechanism, e.g., is the _ON supposed to wake up the GPU?

What exactly is the serious issue?  I guess it's that the rescan
doesn't detect the GPU, which means it's not responding to config
accesses?  Is there any timing component here, e.g., maybe we're
missing some delay like the ones Mika is adding to the reset paths?

> + *
> + * skipping this makes runpm work perfectly fine on such devices.
> + *
> + * As far as we know only skylake and kaby lake SoCs are affected.
> + */
> +static unsigned short intel_broken_d3_bridges[] = {
> +	/* kbl */
> +	0x1901,
> +};
> +
> +static inline bool intel_broken_pci_pm(struct pci_bus *bus)
> +{
> +	struct pci_dev *bridge;
> +	int i;
> +
> +	if (!bus || !bus->self)
> +		return false;
> +
> +	bridge = bus->self;
> +	if (bridge->vendor != PCI_VENDOR_ID_INTEL)
> +		return false;
> +
> +	for (i = 0; i < ARRAY_SIZE(intel_broken_d3_bridges); i++) {
> +		if (bridge->device == intel_broken_d3_bridges[i]) {
> +			pci_err(bridge, "found broken intel bridge\n");

If this ends up being a hardware defect, we should use a quirk to set
a bit in the pci_dev once, as we do for broken_intx_masking and
similar bits.

> +			return true;
> +		}
> +	}
> +
> +	return false;
> +}
> +
>  /**
>   * pci_raw_set_power_state - Use PCI PM registers to set the power state of
>   *			     given PCI device
> @@ -827,6 +863,9 @@ static int pci_raw_set_power_state(struct pci_dev *dev, pci_power_t state)
>  	if (state < PCI_D0 || state > PCI_D3hot)
>  		return -EINVAL;
>  
> +	if (state != PCI_D0 && intel_broken_pci_pm(dev->bus))
> +		return 0;
> +
>  	/*
>  	 * Validate current state:
>  	 * Can enter D0 from any state, but if we can only go deeper
> -- 
> 2.21.0
>
Karol Herbst Sept. 27, 2019, 9:53 p.m. UTC | #2
On Fri, Sep 27, 2019 at 11:42 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> [+cc Rafael, Mika, linux-pm]
>
> On Fri, Sep 27, 2019 at 04:44:21PM +0200, Karol Herbst wrote:
> > Fixes runpm breakage mainly on Nvidia GPUs as they are not able to resume.
>
> I don't know what runpm is.  Some userspace utility?  Module
> parameter?
>

runpm aka runtime powermanagement aka runtime_resume and runtime_suspend

> > Works perfectly with this workaround applied.
> >
> > RFC comment:
> > We are quite sure that there is a higher amount of bridges affected by this,
> > but I was only testing it on my own machine for now.
> >
> > I've stresstested runpm by doing 5000 runpm cycles with that patch applied
> > and never saw it fail.
> >
> > I mainly wanted to get a discussion going on if that's a feasable workaround
> > indeed or if we need something better.
> >
> > I am also sure, that the nouveau driver itself isn't at fault as I am able
> > to reproduce the same issue by poking into some PCI registers on the PCIe
> > bridge to put the GPU into D3cold as it's done in ACPI code.
> >
> > I've written a little python script to reproduce this issue without the need
> > of loading nouveau:
> > https://raw.githubusercontent.com/karolherbst/pci-stub-runpm/master/nv_runpm_bug_test.py
>
> Nice script, thanks for sharing it :)  I could learn a lot of useful
> python by studying it.
>
> > Signed-off-by: Karol Herbst <kherbst@redhat.com>
> > Cc: Bjorn Helgaas <bhelgaas@google.com>
> > Cc: Lyude Paul <lyude@redhat.com>
> > Cc: linux-pci@vger.kernel.org
> > Cc: dri-devel@lists.freedesktop.org
> > Cc: nouveau@lists.freedesktop.org
> > ---
> >  drivers/pci/pci.c | 39 +++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 39 insertions(+)
> >
> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > index 088fcdc8d2b4..9dbd29ced1ac 100644
> > --- a/drivers/pci/pci.c
> > +++ b/drivers/pci/pci.c
> > @@ -799,6 +799,42 @@ static inline bool platform_pci_bridge_d3(struct pci_dev *dev)
> >       return pci_platform_pm ? pci_platform_pm->bridge_d3(dev) : false;
> >  }
> >
> > +/*
> > + * some intel bridges cause serious issues with runpm if the client device
> > + * is put into D1/D2/D3hot before putting the client into D3cold via
> > + * platform means (generally ACPI).
>
> You mention Nvidia GPUs above, but I guess the same issue may affect
> other devices?  I would really like to chase this down to a more
> specific issue, e.g., a hardware defect with erratum, an ACPI defect,
> or a Linux defect.  Without the specifics, this is just a band-aid.
>

yep.. but we were trying to talk to Nvidia and Intel about it and had
no luck getting anything out of them so far. We gave up on Nvidia,
Intel is still pending.

> I don't see any relevant requirements in the _OFF description, but I
> don't know much about ACPI power control.
>
> Your script allows several scenarios; I *guess* the one that causes
> the problem is:
>
>   - write 3 (D3hot) to GPU PowerState (PCIE_PM_REG == 0x64, I assume
>     PM Capability Control Register)

correct

>   - write 3 (D3hot) to bridge PowerState (0x84, I assume PM Capability
>     Control Register)

correct, but this seems to be fine and doesn't fix the issue if that
part is skipped

>   - run _OFF on the power resource for the bridge
>
> From your script I assume you do:
>
>   - run _ON on the power resource for the bridge
>   - write 0 (D0) to the bridge PowerState
>
> You do *not* write the GPU PowerState (which we can't do if the GPU is
> in D3cold).  Is there some assumption that it comes out of D3cold via
> some other mechanism, e.g., is the _ON supposed to wake up the GPU?

if the "link" is powered up again (via 0x248, 0xbc or 0xb0 on the GPU
bridge or the ACPI _ON method) the GPU comes up in the D0 state and is
fully operational, well besides the little issue we've got with the D3
write. It can also be worked around by putting the PCIe link into 5.0
and 8.0 mode, but that's not reliable enough as it fails it around 10%
of all tries.

>
> What exactly is the serious issue?  I guess it's that the rescan
> doesn't detect the GPU, which means it's not responding to config
> accesses?  Is there any timing component here, e.g., maybe we're
> missing some delay like the ones Mika is adding to the reset paths?
>

When I was checking up on some of the PCI registers of the bridge
controller, the slot detection told me that there is no device
recognized anymore. I don't know which register it was anymore, though
I guess one could read it up in the SoC spec document by Intel.

My guess is, that the bridge controller fails to detect the GPU being
here or actively threw it of the bus or something. But a normal system
suspend/resume cycle brings the GPU back online (doing a rescan via
sysfs gets the device detected again)

> > + *
> > + * skipping this makes runpm work perfectly fine on such devices.
> > + *
> > + * As far as we know only skylake and kaby lake SoCs are affected.
> > + */
> > +static unsigned short intel_broken_d3_bridges[] = {
> > +     /* kbl */
> > +     0x1901,
> > +};
> > +
> > +static inline bool intel_broken_pci_pm(struct pci_bus *bus)
> > +{
> > +     struct pci_dev *bridge;
> > +     int i;
> > +
> > +     if (!bus || !bus->self)
> > +             return false;
> > +
> > +     bridge = bus->self;
> > +     if (bridge->vendor != PCI_VENDOR_ID_INTEL)
> > +             return false;
> > +
> > +     for (i = 0; i < ARRAY_SIZE(intel_broken_d3_bridges); i++) {
> > +             if (bridge->device == intel_broken_d3_bridges[i]) {
> > +                     pci_err(bridge, "found broken intel bridge\n");
>
> If this ends up being a hardware defect, we should use a quirk to set
> a bit in the pci_dev once, as we do for broken_intx_masking and
> similar bits.

okay, if you think this is the preferred way then I can change the
patch accordingly.

>
> > +                     return true;
> > +             }
> > +     }
> > +
> > +     return false;
> > +}
> > +
> >  /**
> >   * pci_raw_set_power_state - Use PCI PM registers to set the power state of
> >   *                        given PCI device
> > @@ -827,6 +863,9 @@ static int pci_raw_set_power_state(struct pci_dev *dev, pci_power_t state)
> >       if (state < PCI_D0 || state > PCI_D3hot)
> >               return -EINVAL;
> >
> > +     if (state != PCI_D0 && intel_broken_pci_pm(dev->bus))
> > +             return 0;
> > +
> >       /*
> >        * Validate current state:
> >        * Can enter D0 from any state, but if we can only go deeper
> > --
> > 2.21.0
> >
Mika Westerberg Sept. 30, 2019, 8:05 a.m. UTC | #3
Hi Karol,

On Fri, Sep 27, 2019 at 11:53:48PM +0200, Karol Herbst wrote:
> > What exactly is the serious issue?  I guess it's that the rescan
> > doesn't detect the GPU, which means it's not responding to config
> > accesses?  Is there any timing component here, e.g., maybe we're
> > missing some delay like the ones Mika is adding to the reset paths?
> 
> When I was checking up on some of the PCI registers of the bridge
> controller, the slot detection told me that there is no device
> recognized anymore. I don't know which register it was anymore, though
> I guess one could read it up in the SoC spec document by Intel.
> 
> My guess is, that the bridge controller fails to detect the GPU being
> here or actively threw it of the bus or something. But a normal system
> suspend/resume cycle brings the GPU back online (doing a rescan via
> sysfs gets the device detected again)

Can you elaborate a bit what kind of scenario the issue happens (e.g
steps how it reproduces)? It was not 100% clear from the changelog. Also
what the result when the failure happens?

I see there is a script that does something but unfortunately I'm not
fluent in Python so can't extract the steps how the issue can be
reproduced ;-)

One thing that I'm working on is that Linux PCI subsystem misses certain
delays that are needed after D3cold -> D0 transition, otherwise the
device and/or link may not be ready before we access it. What you are
experiencing sounds similar. I wonder if you could try the following
patch and see if it makes any difference?

https://patchwork.kernel.org/patch/11106611/
Karol Herbst Sept. 30, 2019, 9:15 a.m. UTC | #4
On Mon, Sep 30, 2019 at 10:05 AM Mika Westerberg
<mika.westerberg@linux.intel.com> wrote:
>
> Hi Karol,
>
> On Fri, Sep 27, 2019 at 11:53:48PM +0200, Karol Herbst wrote:
> > > What exactly is the serious issue?  I guess it's that the rescan
> > > doesn't detect the GPU, which means it's not responding to config
> > > accesses?  Is there any timing component here, e.g., maybe we're
> > > missing some delay like the ones Mika is adding to the reset paths?
> >
> > When I was checking up on some of the PCI registers of the bridge
> > controller, the slot detection told me that there is no device
> > recognized anymore. I don't know which register it was anymore, though
> > I guess one could read it up in the SoC spec document by Intel.
> >
> > My guess is, that the bridge controller fails to detect the GPU being
> > here or actively threw it of the bus or something. But a normal system
> > suspend/resume cycle brings the GPU back online (doing a rescan via
> > sysfs gets the device detected again)
>
> Can you elaborate a bit what kind of scenario the issue happens (e.g
> steps how it reproduces)? It was not 100% clear from the changelog. Also
> what the result when the failure happens?
>

yeah, I already have an updated patch in the works which also does the
rework Bjorn suggested. Had no time yet to test if I didn't mess it
up.

I am also thinking of adding a kernel parameter to enable this
workaround on demand, but not quite sure on that one yet.

> I see there is a script that does something but unfortunately I'm not
> fluent in Python so can't extract the steps how the issue can be
> reproduced ;-)
>
> One thing that I'm working on is that Linux PCI subsystem misses certain
> delays that are needed after D3cold -> D0 transition, otherwise the
> device and/or link may not be ready before we access it. What you are
> experiencing sounds similar. I wonder if you could try the following
> patch and see if it makes any difference?
>
> https://patchwork.kernel.org/patch/11106611/

I think I already tried this path. The problem isn't that the device
isn't accessible too late, but that it seems that the device
completely falls off the bus. But I can retest again just to be sure.
Mika Westerberg Sept. 30, 2019, 9:29 a.m. UTC | #5
On Mon, Sep 30, 2019 at 11:15:48AM +0200, Karol Herbst wrote:
> On Mon, Sep 30, 2019 at 10:05 AM Mika Westerberg
> <mika.westerberg@linux.intel.com> wrote:
> >
> > Hi Karol,
> >
> > On Fri, Sep 27, 2019 at 11:53:48PM +0200, Karol Herbst wrote:
> > > > What exactly is the serious issue?  I guess it's that the rescan
> > > > doesn't detect the GPU, which means it's not responding to config
> > > > accesses?  Is there any timing component here, e.g., maybe we're
> > > > missing some delay like the ones Mika is adding to the reset paths?
> > >
> > > When I was checking up on some of the PCI registers of the bridge
> > > controller, the slot detection told me that there is no device
> > > recognized anymore. I don't know which register it was anymore, though
> > > I guess one could read it up in the SoC spec document by Intel.
> > >
> > > My guess is, that the bridge controller fails to detect the GPU being
> > > here or actively threw it of the bus or something. But a normal system
> > > suspend/resume cycle brings the GPU back online (doing a rescan via
> > > sysfs gets the device detected again)
> >
> > Can you elaborate a bit what kind of scenario the issue happens (e.g
> > steps how it reproduces)? It was not 100% clear from the changelog. Also
> > what the result when the failure happens?
> >
> 
> yeah, I already have an updated patch in the works which also does the
> rework Bjorn suggested. Had no time yet to test if I didn't mess it
> up.
> 
> I am also thinking of adding a kernel parameter to enable this
> workaround on demand, but not quite sure on that one yet.

Right, I think it would be good to figure out the root cause before
adding any workarounds ;-) It might very well be that we are just
missing something the PCIe spec requires but not implemented in Linux.

> > I see there is a script that does something but unfortunately I'm not
> > fluent in Python so can't extract the steps how the issue can be
> > reproduced ;-)
> >
> > One thing that I'm working on is that Linux PCI subsystem misses certain
> > delays that are needed after D3cold -> D0 transition, otherwise the
> > device and/or link may not be ready before we access it. What you are
> > experiencing sounds similar. I wonder if you could try the following
> > patch and see if it makes any difference?
> >
> > https://patchwork.kernel.org/patch/11106611/
> 
> I think I already tried this path. The problem isn't that the device
> isn't accessible too late, but that it seems that the device
> completely falls off the bus. But I can retest again just to be sure.

Yes, please try it and share full dmesg if/when the failure still happens.
Karol Herbst Sept. 30, 2019, 4:05 p.m. UTC | #6
still happens with your patch applied. The machine simply gets shut down.

dmesg can be found here:
https://gist.githubusercontent.com/karolherbst/40eb091c7b7b33ef993525de660f1a3b/raw/2380e31f566e93e5ba7c87ef545420965d4c492c/gistfile1.txt

If there are no other things to try out, I will post the updated patch shortly.

On Mon, Sep 30, 2019 at 11:29 AM Mika Westerberg
<mika.westerberg@linux.intel.com> wrote:
>
> On Mon, Sep 30, 2019 at 11:15:48AM +0200, Karol Herbst wrote:
> > On Mon, Sep 30, 2019 at 10:05 AM Mika Westerberg
> > <mika.westerberg@linux.intel.com> wrote:
> > >
> > > Hi Karol,
> > >
> > > On Fri, Sep 27, 2019 at 11:53:48PM +0200, Karol Herbst wrote:
> > > > > What exactly is the serious issue?  I guess it's that the rescan
> > > > > doesn't detect the GPU, which means it's not responding to config
> > > > > accesses?  Is there any timing component here, e.g., maybe we're
> > > > > missing some delay like the ones Mika is adding to the reset paths?
> > > >
> > > > When I was checking up on some of the PCI registers of the bridge
> > > > controller, the slot detection told me that there is no device
> > > > recognized anymore. I don't know which register it was anymore, though
> > > > I guess one could read it up in the SoC spec document by Intel.
> > > >
> > > > My guess is, that the bridge controller fails to detect the GPU being
> > > > here or actively threw it of the bus or something. But a normal system
> > > > suspend/resume cycle brings the GPU back online (doing a rescan via
> > > > sysfs gets the device detected again)
> > >
> > > Can you elaborate a bit what kind of scenario the issue happens (e.g
> > > steps how it reproduces)? It was not 100% clear from the changelog. Also
> > > what the result when the failure happens?
> > >
> >
> > yeah, I already have an updated patch in the works which also does the
> > rework Bjorn suggested. Had no time yet to test if I didn't mess it
> > up.
> >
> > I am also thinking of adding a kernel parameter to enable this
> > workaround on demand, but not quite sure on that one yet.
>
> Right, I think it would be good to figure out the root cause before
> adding any workarounds ;-) It might very well be that we are just
> missing something the PCIe spec requires but not implemented in Linux.
>
> > > I see there is a script that does something but unfortunately I'm not
> > > fluent in Python so can't extract the steps how the issue can be
> > > reproduced ;-)
> > >
> > > One thing that I'm working on is that Linux PCI subsystem misses certain
> > > delays that are needed after D3cold -> D0 transition, otherwise the
> > > device and/or link may not be ready before we access it. What you are
> > > experiencing sounds similar. I wonder if you could try the following
> > > patch and see if it makes any difference?
> > >
> > > https://patchwork.kernel.org/patch/11106611/
> >
> > I think I already tried this path. The problem isn't that the device
> > isn't accessible too late, but that it seems that the device
> > completely falls off the bus. But I can retest again just to be sure.
>
> Yes, please try it and share full dmesg if/when the failure still happens.
Mika Westerberg Sept. 30, 2019, 4:30 p.m. UTC | #7
On Mon, Sep 30, 2019 at 06:05:14PM +0200, Karol Herbst wrote:
> still happens with your patch applied. The machine simply gets shut down.
> 
> dmesg can be found here:
> https://gist.githubusercontent.com/karolherbst/40eb091c7b7b33ef993525de660f1a3b/raw/2380e31f566e93e5ba7c87ef545420965d4c492c/gistfile1.txt

Looking your dmesg:

Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: DCB version 4.1
Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: MM: using COPY for buffer copies
Sep 30 17:24:27 kernel: [drm] Initialized nouveau 1.3.1 20120801 for 0000:01:00.0 on minor 1

I would assume it runtime suspends here. Then it wakes up because of PCI
access from userspace:

Sep 30 17:24:42 kernel: pci_raw_set_power_state: 56 callbacks suppressed
 
and for some reason it does not get resumed properly. There are also few
warnings from ACPI that might be relevant:

Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59)
Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59)

This seems to be Dell XPS 9560 which I think has been around some time
already so I wonder why we only see issues now. Has it ever worked for
you or maybe there is a regression that causes it to happen now?
Karol Herbst Sept. 30, 2019, 4:36 p.m. UTC | #8
On Mon, Sep 30, 2019 at 6:30 PM Mika Westerberg
<mika.westerberg@linux.intel.com> wrote:
>
> On Mon, Sep 30, 2019 at 06:05:14PM +0200, Karol Herbst wrote:
> > still happens with your patch applied. The machine simply gets shut down.
> >
> > dmesg can be found here:
> > https://gist.githubusercontent.com/karolherbst/40eb091c7b7b33ef993525de660f1a3b/raw/2380e31f566e93e5ba7c87ef545420965d4c492c/gistfile1.txt
>
> Looking your dmesg:
>
> Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: DCB version 4.1
> Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: MM: using COPY for buffer copies
> Sep 30 17:24:27 kernel: [drm] Initialized nouveau 1.3.1 20120801 for 0000:01:00.0 on minor 1
>
> I would assume it runtime suspends here. Then it wakes up because of PCI
> access from userspace:
>
> Sep 30 17:24:42 kernel: pci_raw_set_power_state: 56 callbacks suppressed
>
> and for some reason it does not get resumed properly. There are also few
> warnings from ACPI that might be relevant:
>
> Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59)
> Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59)
>

afaik this is the case for essentially every laptop out there.

> This seems to be Dell XPS 9560 which I think has been around some time
> already so I wonder why we only see issues now. Has it ever worked for
> you or maybe there is a regression that causes it to happen now?

oh, it's broken since forever, we just tried to get more information
from Nvidia if they know what this is all about, but we got nothing
useful.

We were also hoping to find a reliable fix or workaround we could have
inside nouveau to fix that as I think nouveau is the only driver
actually hit by this issue, but nothing turned out to be reliable
enough.
Mika Westerberg Oct. 1, 2019, 8:46 a.m. UTC | #9
On Mon, Sep 30, 2019 at 06:36:12PM +0200, Karol Herbst wrote:
> On Mon, Sep 30, 2019 at 6:30 PM Mika Westerberg
> <mika.westerberg@linux.intel.com> wrote:
> >
> > On Mon, Sep 30, 2019 at 06:05:14PM +0200, Karol Herbst wrote:
> > > still happens with your patch applied. The machine simply gets shut down.
> > >
> > > dmesg can be found here:
> > > https://gist.githubusercontent.com/karolherbst/40eb091c7b7b33ef993525de660f1a3b/raw/2380e31f566e93e5ba7c87ef545420965d4c492c/gistfile1.txt
> >
> > Looking your dmesg:
> >
> > Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: DCB version 4.1
> > Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: MM: using COPY for buffer copies
> > Sep 30 17:24:27 kernel: [drm] Initialized nouveau 1.3.1 20120801 for 0000:01:00.0 on minor 1
> >
> > I would assume it runtime suspends here. Then it wakes up because of PCI
> > access from userspace:
> >
> > Sep 30 17:24:42 kernel: pci_raw_set_power_state: 56 callbacks suppressed
> >
> > and for some reason it does not get resumed properly. There are also few
> > warnings from ACPI that might be relevant:
> >
> > Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59)
> > Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59)
> >
> 
> afaik this is the case for essentially every laptop out there.

OK, so they are harmless?

> > This seems to be Dell XPS 9560 which I think has been around some time
> > already so I wonder why we only see issues now. Has it ever worked for
> > you or maybe there is a regression that causes it to happen now?
> 
> oh, it's broken since forever, we just tried to get more information
> from Nvidia if they know what this is all about, but we got nothing
> useful.
> 
> We were also hoping to find a reliable fix or workaround we could have
> inside nouveau to fix that as I think nouveau is the only driver
> actually hit by this issue, but nothing turned out to be reliable
> enough.

Can't you just block runtime PM from the nouveau driver until this is
understood better? That can be done by calling pm_runtime_forbid() (or
not calling pm_runtime_allow() in the driver). Or in case of PCI driver
you just don't decrease the reference count when probe() ends.

I think that would be much better than blocking any devices behind
Kabylake PCIe root ports from entering D3 (I don't really think the
problem is in the root ports itself but there is something we are
missing when the NVIDIA GPU is put into D3cold or back from there).
Karol Herbst Oct. 1, 2019, 8:56 a.m. UTC | #10
On Tue, Oct 1, 2019 at 10:47 AM Mika Westerberg
<mika.westerberg@linux.intel.com> wrote:
>
> On Mon, Sep 30, 2019 at 06:36:12PM +0200, Karol Herbst wrote:
> > On Mon, Sep 30, 2019 at 6:30 PM Mika Westerberg
> > <mika.westerberg@linux.intel.com> wrote:
> > >
> > > On Mon, Sep 30, 2019 at 06:05:14PM +0200, Karol Herbst wrote:
> > > > still happens with your patch applied. The machine simply gets shut down.
> > > >
> > > > dmesg can be found here:
> > > > https://gist.githubusercontent.com/karolherbst/40eb091c7b7b33ef993525de660f1a3b/raw/2380e31f566e93e5ba7c87ef545420965d4c492c/gistfile1.txt
> > >
> > > Looking your dmesg:
> > >
> > > Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: DCB version 4.1
> > > Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: MM: using COPY for buffer copies
> > > Sep 30 17:24:27 kernel: [drm] Initialized nouveau 1.3.1 20120801 for 0000:01:00.0 on minor 1
> > >
> > > I would assume it runtime suspends here. Then it wakes up because of PCI
> > > access from userspace:
> > >
> > > Sep 30 17:24:42 kernel: pci_raw_set_power_state: 56 callbacks suppressed
> > >
> > > and for some reason it does not get resumed properly. There are also few
> > > warnings from ACPI that might be relevant:
> > >
> > > Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59)
> > > Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59)
> > >
> >
> > afaik this is the case for essentially every laptop out there.
>
> OK, so they are harmless?
>

yes

> > > This seems to be Dell XPS 9560 which I think has been around some time
> > > already so I wonder why we only see issues now. Has it ever worked for
> > > you or maybe there is a regression that causes it to happen now?
> >
> > oh, it's broken since forever, we just tried to get more information
> > from Nvidia if they know what this is all about, but we got nothing
> > useful.
> >
> > We were also hoping to find a reliable fix or workaround we could have
> > inside nouveau to fix that as I think nouveau is the only driver
> > actually hit by this issue, but nothing turned out to be reliable
> > enough.
>
> Can't you just block runtime PM from the nouveau driver until this is
> understood better? That can be done by calling pm_runtime_forbid() (or
> not calling pm_runtime_allow() in the driver). Or in case of PCI driver
> you just don't decrease the reference count when probe() ends.
>

the thing is, it does work for a lot of laptops. We could only observe
this on kaby lake and skylake ones. Even on Cannon Lakes it seems to
work just fine.

> I think that would be much better than blocking any devices behind
> Kabylake PCIe root ports from entering D3 (I don't really think the
> problem is in the root ports itself but there is something we are
> missing when the NVIDIA GPU is put into D3cold or back from there).

I highly doubt there is anything wrong with the GPU alone as we have
too many indications which tell us otherwise.

Anyway, at this point I don't know where to look further for what's
actually wrong. And apparently it works on Windows, but I don't know
why and I have no idea what Windows does on such systems to make it
work reliably.
Mika Westerberg Oct. 1, 2019, 9:11 a.m. UTC | #11
On Tue, Oct 01, 2019 at 10:56:39AM +0200, Karol Herbst wrote:
> On Tue, Oct 1, 2019 at 10:47 AM Mika Westerberg
> <mika.westerberg@linux.intel.com> wrote:
> >
> > On Mon, Sep 30, 2019 at 06:36:12PM +0200, Karol Herbst wrote:
> > > On Mon, Sep 30, 2019 at 6:30 PM Mika Westerberg
> > > <mika.westerberg@linux.intel.com> wrote:
> > > >
> > > > On Mon, Sep 30, 2019 at 06:05:14PM +0200, Karol Herbst wrote:
> > > > > still happens with your patch applied. The machine simply gets shut down.
> > > > >
> > > > > dmesg can be found here:
> > > > > https://gist.githubusercontent.com/karolherbst/40eb091c7b7b33ef993525de660f1a3b/raw/2380e31f566e93e5ba7c87ef545420965d4c492c/gistfile1.txt
> > > >
> > > > Looking your dmesg:
> > > >
> > > > Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: DCB version 4.1
> > > > Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: MM: using COPY for buffer copies
> > > > Sep 30 17:24:27 kernel: [drm] Initialized nouveau 1.3.1 20120801 for 0000:01:00.0 on minor 1
> > > >
> > > > I would assume it runtime suspends here. Then it wakes up because of PCI
> > > > access from userspace:
> > > >
> > > > Sep 30 17:24:42 kernel: pci_raw_set_power_state: 56 callbacks suppressed
> > > >
> > > > and for some reason it does not get resumed properly. There are also few
> > > > warnings from ACPI that might be relevant:
> > > >
> > > > Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59)
> > > > Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59)
> > > >
> > >
> > > afaik this is the case for essentially every laptop out there.
> >
> > OK, so they are harmless?
> >
> 
> yes
> 
> > > > This seems to be Dell XPS 9560 which I think has been around some time
> > > > already so I wonder why we only see issues now. Has it ever worked for
> > > > you or maybe there is a regression that causes it to happen now?
> > >
> > > oh, it's broken since forever, we just tried to get more information
> > > from Nvidia if they know what this is all about, but we got nothing
> > > useful.
> > >
> > > We were also hoping to find a reliable fix or workaround we could have
> > > inside nouveau to fix that as I think nouveau is the only driver
> > > actually hit by this issue, but nothing turned out to be reliable
> > > enough.
> >
> > Can't you just block runtime PM from the nouveau driver until this is
> > understood better? That can be done by calling pm_runtime_forbid() (or
> > not calling pm_runtime_allow() in the driver). Or in case of PCI driver
> > you just don't decrease the reference count when probe() ends.
> >
> 
> the thing is, it does work for a lot of laptops. We could only observe
> this on kaby lake and skylake ones. Even on Cannon Lakes it seems to
> work just fine.

Can't you then limit it to those?

I've experienced that Kabylake root ports can enter and exit in D3cold
just fine because we do that for Thunderbolt for example. But that
always requires help from ACPI. If the system is using non-standard ACPI
methods for example that may require some tricks in the driver side.

> > I think that would be much better than blocking any devices behind
> > Kabylake PCIe root ports from entering D3 (I don't really think the
> > problem is in the root ports itself but there is something we are
> > missing when the NVIDIA GPU is put into D3cold or back from there).
> 
> I highly doubt there is anything wrong with the GPU alone as we have
> too many indications which tell us otherwise.
> 
> Anyway, at this point I don't know where to look further for what's
> actually wrong. And apparently it works on Windows, but I don't know
> why and I have no idea what Windows does on such systems to make it
> work reliably.

By works you mean that Windows is able to put it into D3cold and back?
If that's the case it may be that there is some ACPI magic that the
Windows driver does and we of course are missing in Linux.
Karol Herbst Oct. 1, 2019, 10 a.m. UTC | #12
On Tue, Oct 1, 2019 at 11:11 AM Mika Westerberg
<mika.westerberg@linux.intel.com> wrote:
>
> On Tue, Oct 01, 2019 at 10:56:39AM +0200, Karol Herbst wrote:
> > On Tue, Oct 1, 2019 at 10:47 AM Mika Westerberg
> > <mika.westerberg@linux.intel.com> wrote:
> > >
> > > On Mon, Sep 30, 2019 at 06:36:12PM +0200, Karol Herbst wrote:
> > > > On Mon, Sep 30, 2019 at 6:30 PM Mika Westerberg
> > > > <mika.westerberg@linux.intel.com> wrote:
> > > > >
> > > > > On Mon, Sep 30, 2019 at 06:05:14PM +0200, Karol Herbst wrote:
> > > > > > still happens with your patch applied. The machine simply gets shut down.
> > > > > >
> > > > > > dmesg can be found here:
> > > > > > https://gist.githubusercontent.com/karolherbst/40eb091c7b7b33ef993525de660f1a3b/raw/2380e31f566e93e5ba7c87ef545420965d4c492c/gistfile1.txt
> > > > >
> > > > > Looking your dmesg:
> > > > >
> > > > > Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: DCB version 4.1
> > > > > Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: MM: using COPY for buffer copies
> > > > > Sep 30 17:24:27 kernel: [drm] Initialized nouveau 1.3.1 20120801 for 0000:01:00.0 on minor 1
> > > > >
> > > > > I would assume it runtime suspends here. Then it wakes up because of PCI
> > > > > access from userspace:
> > > > >
> > > > > Sep 30 17:24:42 kernel: pci_raw_set_power_state: 56 callbacks suppressed
> > > > >
> > > > > and for some reason it does not get resumed properly. There are also few
> > > > > warnings from ACPI that might be relevant:
> > > > >
> > > > > Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59)
> > > > > Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59)
> > > > >
> > > >
> > > > afaik this is the case for essentially every laptop out there.
> > >
> > > OK, so they are harmless?
> > >
> >
> > yes
> >
> > > > > This seems to be Dell XPS 9560 which I think has been around some time
> > > > > already so I wonder why we only see issues now. Has it ever worked for
> > > > > you or maybe there is a regression that causes it to happen now?
> > > >
> > > > oh, it's broken since forever, we just tried to get more information
> > > > from Nvidia if they know what this is all about, but we got nothing
> > > > useful.
> > > >
> > > > We were also hoping to find a reliable fix or workaround we could have
> > > > inside nouveau to fix that as I think nouveau is the only driver
> > > > actually hit by this issue, but nothing turned out to be reliable
> > > > enough.
> > >
> > > Can't you just block runtime PM from the nouveau driver until this is
> > > understood better? That can be done by calling pm_runtime_forbid() (or
> > > not calling pm_runtime_allow() in the driver). Or in case of PCI driver
> > > you just don't decrease the reference count when probe() ends.
> > >
> >
> > the thing is, it does work for a lot of laptops. We could only observe
> > this on kaby lake and skylake ones. Even on Cannon Lakes it seems to
> > work just fine.
>
> Can't you then limit it to those?
>
> I've experienced that Kabylake root ports can enter and exit in D3cold
> just fine because we do that for Thunderbolt for example. But that
> always requires help from ACPI. If the system is using non-standard ACPI
> methods for example that may require some tricks in the driver side.
>

yeah.. I am not quite sure what's actually the root cause. I was also
trying to use the same PCI registers ACPI is using to trigger this
issue on a normal desktop, no luck. Using the same registers does
trigger the issue (hence the script).

The script is essentially just doing what ACPI does, just skipping a lot.

> > > I think that would be much better than blocking any devices behind
> > > Kabylake PCIe root ports from entering D3 (I don't really think the
> > > problem is in the root ports itself but there is something we are
> > > missing when the NVIDIA GPU is put into D3cold or back from there).
> >
> > I highly doubt there is anything wrong with the GPU alone as we have
> > too many indications which tell us otherwise.
> >
> > Anyway, at this point I don't know where to look further for what's
> > actually wrong. And apparently it works on Windows, but I don't know
> > why and I have no idea what Windows does on such systems to make it
> > work reliably.
>
> By works you mean that Windows is able to put it into D3cold and back?
> If that's the case it may be that there is some ACPI magic that the
> Windows driver does and we of course are missing in Linux.

Afaik that's the case. We were talking with Nvidia about it, but they
are not aware of any issues generally. (on Windows, nor the hardware).
No idea if we can trust their statements though.

But yeah, it might be that on Windows they still do _DSM calls or
something... but until today, Nvidia didn't provide any documentation
to us for that.
Bjorn Helgaas Oct. 1, 2019, 1:27 p.m. UTC | #13
On Mon, Sep 30, 2019 at 06:36:12PM +0200, Karol Herbst wrote:
> On Mon, Sep 30, 2019 at 6:30 PM Mika Westerberg
> <mika.westerberg@linux.intel.com> wrote:
> >
> > On Mon, Sep 30, 2019 at 06:05:14PM +0200, Karol Herbst wrote:
> > > still happens with your patch applied. The machine simply gets shut down.
> > >
> > > dmesg can be found here:
> > > https://gist.githubusercontent.com/karolherbst/40eb091c7b7b33ef993525de660f1a3b/raw/2380e31f566e93e5ba7c87ef545420965d4c492c/gistfile1.txt
> >
> > Looking your dmesg:
> >
> > Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: DCB version 4.1
> > Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: MM: using COPY for buffer copies
> > Sep 30 17:24:27 kernel: [drm] Initialized nouveau 1.3.1 20120801 for 0000:01:00.0 on minor 1
> >
> > I would assume it runtime suspends here. Then it wakes up because of PCI
> > access from userspace:
> >
> > Sep 30 17:24:42 kernel: pci_raw_set_power_state: 56 callbacks suppressed
> >
> > and for some reason it does not get resumed properly. There are also few
> > warnings from ACPI that might be relevant:
> >
> > Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59)
> > Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59)
> 
> afaik this is the case for essentially every laptop out there.

I think we should look into this a little bit.
acpi_ns_check_argument_types() checks the argument type and prints
this message, but AFAICT it doesn't actually fix anything or prevent
execution of the method, so I have no idea what happens when we
actually execute the _DSM.

If we execute this _DSM as part of power management, and the _DSM
doesn't work right, it would be no surprise that we have problems.

Maybe we could learn something by turning on ACPI_DB_PARSE output (see
Documentation/firmware-guide/acpi/debug.rst).

You must have an acpidump already from all your investigation.  Can
you put it somewhere, e.g., bugzilla.kernel.org, and include a URL?
Karol Herbst Oct. 1, 2019, 4:21 p.m. UTC | #14
On Tue, Oct 1, 2019 at 3:27 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Mon, Sep 30, 2019 at 06:36:12PM +0200, Karol Herbst wrote:
> > On Mon, Sep 30, 2019 at 6:30 PM Mika Westerberg
> > <mika.westerberg@linux.intel.com> wrote:
> > >
> > > On Mon, Sep 30, 2019 at 06:05:14PM +0200, Karol Herbst wrote:
> > > > still happens with your patch applied. The machine simply gets shut down.
> > > >
> > > > dmesg can be found here:
> > > > https://gist.githubusercontent.com/karolherbst/40eb091c7b7b33ef993525de660f1a3b/raw/2380e31f566e93e5ba7c87ef545420965d4c492c/gistfile1.txt
> > >
> > > Looking your dmesg:
> > >
> > > Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: DCB version 4.1
> > > Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: MM: using COPY for buffer copies
> > > Sep 30 17:24:27 kernel: [drm] Initialized nouveau 1.3.1 20120801 for 0000:01:00.0 on minor 1
> > >
> > > I would assume it runtime suspends here. Then it wakes up because of PCI
> > > access from userspace:
> > >
> > > Sep 30 17:24:42 kernel: pci_raw_set_power_state: 56 callbacks suppressed
> > >
> > > and for some reason it does not get resumed properly. There are also few
> > > warnings from ACPI that might be relevant:
> > >
> > > Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59)
> > > Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59)
> >
> > afaik this is the case for essentially every laptop out there.
>
> I think we should look into this a little bit.
> acpi_ns_check_argument_types() checks the argument type and prints
> this message, but AFAICT it doesn't actually fix anything or prevent
> execution of the method, so I have no idea what happens when we
> actually execute the _DSM.
>

I can assure you that this warning happens on every single laptop out
there with dual Nvidia graphics and it's essentially just a firmware
bug. And it never caused any issues on any of the older laptops (or
newest one for that matter).

> If we execute this _DSM as part of power management, and the _DSM
> doesn't work right, it would be no surprise that we have problems.
>
> Maybe we could learn something by turning on ACPI_DB_PARSE output (see
> Documentation/firmware-guide/acpi/debug.rst).
>
> You must have an acpidump already from all your investigation.  Can
> you put it somewhere, e.g., bugzilla.kernel.org, and include a URL?

Will do so later, right now I am traveling to XDC and will have more
time for that then.
Bjorn Helgaas Oct. 1, 2019, 7:34 p.m. UTC | #15
On Tue, Oct 01, 2019 at 06:21:28PM +0200, Karol Herbst wrote:
> On Tue, Oct 1, 2019 at 3:27 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Mon, Sep 30, 2019 at 06:36:12PM +0200, Karol Herbst wrote:
> > > On Mon, Sep 30, 2019 at 6:30 PM Mika Westerberg
> > > <mika.westerberg@linux.intel.com> wrote:
> > > >
> > > > On Mon, Sep 30, 2019 at 06:05:14PM +0200, Karol Herbst wrote:
> > > > > still happens with your patch applied. The machine simply gets shut down.
> > > > >
> > > > > dmesg can be found here:
> > > > > https://gist.githubusercontent.com/karolherbst/40eb091c7b7b33ef993525de660f1a3b/raw/2380e31f566e93e5ba7c87ef545420965d4c492c/gistfile1.txt
> > > >
> > > > Looking your dmesg:
> > > >
> > > > Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: DCB version 4.1
> > > > Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: MM: using COPY for buffer copies
> > > > Sep 30 17:24:27 kernel: [drm] Initialized nouveau 1.3.1 20120801 for 0000:01:00.0 on minor 1
> > > >
> > > > I would assume it runtime suspends here. Then it wakes up because of PCI
> > > > access from userspace:
> > > >
> > > > Sep 30 17:24:42 kernel: pci_raw_set_power_state: 56 callbacks suppressed
> > > >
> > > > and for some reason it does not get resumed properly. There are also few
> > > > warnings from ACPI that might be relevant:
> > > >
> > > > Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59)
> > > > Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59)
> > >
> > > afaik this is the case for essentially every laptop out there.
> >
> > I think we should look into this a little bit.
> > acpi_ns_check_argument_types() checks the argument type and prints
> > this message, but AFAICT it doesn't actually fix anything or prevent
> > execution of the method, so I have no idea what happens when we
> > actually execute the _DSM.
> 
> I can assure you that this warning happens on every single laptop out
> there with dual Nvidia graphics and it's essentially just a firmware
> bug. And it never caused any issues on any of the older laptops (or
> newest one for that matter).

Rafael, do you know anything about this?  If ACPI has some sort of
workaround so it can execute the method correctly anyway, maybe we
should remove or reword the warning?

Or if this does prevent execution of the method, maybe we need to add
a workaround since the problem is so prevalent in the field?

> > If we execute this _DSM as part of power management, and the _DSM
> > doesn't work right, it would be no surprise that we have problems.
> >
> > Maybe we could learn something by turning on ACPI_DB_PARSE output (see
> > Documentation/firmware-guide/acpi/debug.rst).
> >
> > You must have an acpidump already from all your investigation.  Can
> > you put it somewhere, e.g., bugzilla.kernel.org, and include a URL?
> 
> Will do so later, right now I am traveling to XDC and will have more
> time for that then.
Rafael J. Wysocki Oct. 2, 2019, 7:51 a.m. UTC | #16
On Tue, Oct 1, 2019 at 9:34 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Tue, Oct 01, 2019 at 06:21:28PM +0200, Karol Herbst wrote:
> > On Tue, Oct 1, 2019 at 3:27 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Mon, Sep 30, 2019 at 06:36:12PM +0200, Karol Herbst wrote:
> > > > On Mon, Sep 30, 2019 at 6:30 PM Mika Westerberg
> > > > <mika.westerberg@linux.intel.com> wrote:
> > > > >
> > > > > On Mon, Sep 30, 2019 at 06:05:14PM +0200, Karol Herbst wrote:
> > > > > > still happens with your patch applied. The machine simply gets shut down.
> > > > > >
> > > > > > dmesg can be found here:
> > > > > > https://gist.githubusercontent.com/karolherbst/40eb091c7b7b33ef993525de660f1a3b/raw/2380e31f566e93e5ba7c87ef545420965d4c492c/gistfile1.txt
> > > > >
> > > > > Looking your dmesg:
> > > > >
> > > > > Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: DCB version 4.1
> > > > > Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: MM: using COPY for buffer copies
> > > > > Sep 30 17:24:27 kernel: [drm] Initialized nouveau 1.3.1 20120801 for 0000:01:00.0 on minor 1
> > > > >
> > > > > I would assume it runtime suspends here. Then it wakes up because of PCI
> > > > > access from userspace:
> > > > >
> > > > > Sep 30 17:24:42 kernel: pci_raw_set_power_state: 56 callbacks suppressed
> > > > >
> > > > > and for some reason it does not get resumed properly. There are also few
> > > > > warnings from ACPI that might be relevant:
> > > > >
> > > > > Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59)
> > > > > Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59)
> > > >
> > > > afaik this is the case for essentially every laptop out there.
> > >
> > > I think we should look into this a little bit.
> > > acpi_ns_check_argument_types() checks the argument type and prints
> > > this message, but AFAICT it doesn't actually fix anything or prevent
> > > execution of the method, so I have no idea what happens when we
> > > actually execute the _DSM.
> >
> > I can assure you that this warning happens on every single laptop out
> > there with dual Nvidia graphics and it's essentially just a firmware
> > bug. And it never caused any issues on any of the older laptops (or
> > newest one for that matter).
>
> Rafael, do you know anything about this?

IIRC ACPICA will simply run the method with the assumption that the
AML in there will deal with the arguments properly anyway.

> If ACPI has some sort of workaround so it can execute the method correctly
> anyway, maybe we should remove or reword the warning?

I can agree that printing these warnings on a user system by default
is not very useful, at least as long as no visible functional issues
are present, but if there are such issues, it is good to know that
something fishy is going on.  For instance, while the method may
execute successfully, the result of that may not be as expected.

So maybe they should be debug level or similar.

> Or if this does prevent execution of the method, maybe we need to add
> a workaround since the problem is so prevalent in the field?

As par the above, no workarounds should be needed, but I'll let Bob
and Erik (CCed now) confirm or deny this.

A side note: please CC all discussions regarding general ACPI issues
to linux-acpi, so they can get the attention of all of the right
people (who may not subscribe linux-pci, for example).
Rafael J. Wysocki Oct. 3, 2019, 9:47 a.m. UTC | #17
On Tuesday, October 1, 2019 12:00:50 PM CEST Karol Herbst wrote:
> On Tue, Oct 1, 2019 at 11:11 AM Mika Westerberg
> <mika.westerberg@linux.intel.com> wrote:
> >
> > On Tue, Oct 01, 2019 at 10:56:39AM +0200, Karol Herbst wrote:
> > > On Tue, Oct 1, 2019 at 10:47 AM Mika Westerberg
> > > <mika.westerberg@linux.intel.com> wrote:
> > > >
> > > > On Mon, Sep 30, 2019 at 06:36:12PM +0200, Karol Herbst wrote:
> > > > > On Mon, Sep 30, 2019 at 6:30 PM Mika Westerberg
> > > > > <mika.westerberg@linux.intel.com> wrote:
> > > > > >
> > > > > > On Mon, Sep 30, 2019 at 06:05:14PM +0200, Karol Herbst wrote:
> > > > > > > still happens with your patch applied. The machine simply gets shut down.
> > > > > > >
> > > > > > > dmesg can be found here:
> > > > > > > https://gist.githubusercontent.com/karolherbst/40eb091c7b7b33ef993525de660f1a3b/raw/2380e31f566e93e5ba7c87ef545420965d4c492c/gistfile1.txt
> > > > > >
> > > > > > Looking your dmesg:
> > > > > >
> > > > > > Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: DCB version 4.1
> > > > > > Sep 30 17:24:27 kernel: nouveau 0000:01:00.0: DRM: MM: using COPY for buffer copies
> > > > > > Sep 30 17:24:27 kernel: [drm] Initialized nouveau 1.3.1 20120801 for 0000:01:00.0 on minor 1
> > > > > >
> > > > > > I would assume it runtime suspends here. Then it wakes up because of PCI
> > > > > > access from userspace:
> > > > > >
> > > > > > Sep 30 17:24:42 kernel: pci_raw_set_power_state: 56 callbacks suppressed
> > > > > >
> > > > > > and for some reason it does not get resumed properly. There are also few
> > > > > > warnings from ACPI that might be relevant:
> > > > > >
> > > > > > Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59)
> > > > > > Sep 30 17:24:27 kernel: ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190509/nsarguments-59)
> > > > > >
> > > > >
> > > > > afaik this is the case for essentially every laptop out there.
> > > >
> > > > OK, so they are harmless?
> > > >
> > >
> > > yes
> > >
> > > > > > This seems to be Dell XPS 9560 which I think has been around some time
> > > > > > already so I wonder why we only see issues now. Has it ever worked for
> > > > > > you or maybe there is a regression that causes it to happen now?
> > > > >
> > > > > oh, it's broken since forever, we just tried to get more information
> > > > > from Nvidia if they know what this is all about, but we got nothing
> > > > > useful.
> > > > >
> > > > > We were also hoping to find a reliable fix or workaround we could have
> > > > > inside nouveau to fix that as I think nouveau is the only driver
> > > > > actually hit by this issue, but nothing turned out to be reliable
> > > > > enough.
> > > >
> > > > Can't you just block runtime PM from the nouveau driver until this is
> > > > understood better? That can be done by calling pm_runtime_forbid() (or
> > > > not calling pm_runtime_allow() in the driver). Or in case of PCI driver
> > > > you just don't decrease the reference count when probe() ends.
> > > >
> > >
> > > the thing is, it does work for a lot of laptops. We could only observe
> > > this on kaby lake and skylake ones. Even on Cannon Lakes it seems to
> > > work just fine.
> >
> > Can't you then limit it to those?
> >
> > I've experienced that Kabylake root ports can enter and exit in D3cold
> > just fine because we do that for Thunderbolt for example. But that
> > always requires help from ACPI. If the system is using non-standard ACPI
> > methods for example that may require some tricks in the driver side.
> >
> 
> yeah.. I am not quite sure what's actually the root cause. I was also
> trying to use the same PCI registers ACPI is using to trigger this
> issue on a normal desktop, no luck. Using the same registers does
> trigger the issue (hence the script).
> 
> The script is essentially just doing what ACPI does, just skipping a lot.
> 
> > > > I think that would be much better than blocking any devices behind
> > > > Kabylake PCIe root ports from entering D3 (I don't really think the
> > > > problem is in the root ports itself but there is something we are
> > > > missing when the NVIDIA GPU is put into D3cold or back from there).
> > >
> > > I highly doubt there is anything wrong with the GPU alone as we have
> > > too many indications which tell us otherwise.
> > >
> > > Anyway, at this point I don't know where to look further for what's
> > > actually wrong. And apparently it works on Windows, but I don't know
> > > why and I have no idea what Windows does on such systems to make it
> > > work reliably.
> >
> > By works you mean that Windows is able to put it into D3cold and back?
> > If that's the case it may be that there is some ACPI magic that the
> > Windows driver does and we of course are missing in Linux.
> 
> Afaik that's the case. We were talking with Nvidia about it, but they
> are not aware of any issues generally. (on Windows, nor the hardware).
> No idea if we can trust their statements though.
> 
> But yeah, it might be that on Windows they still do _DSM calls or
> something... but until today, Nvidia didn't provide any documentation
> to us for that.

So IMO in that case the right approach is to quirk the combinations of
GPU/root complex that are known problematic.

Quirking the root complex alone is likely to affect working configurations
which generally should be avoided.
diff mbox series

Patch

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 088fcdc8d2b4..9dbd29ced1ac 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -799,6 +799,42 @@  static inline bool platform_pci_bridge_d3(struct pci_dev *dev)
 	return pci_platform_pm ? pci_platform_pm->bridge_d3(dev) : false;
 }
 
+/*
+ * some intel bridges cause serious issues with runpm if the client device
+ * is put into D1/D2/D3hot before putting the client into D3cold via
+ * platform means (generally ACPI).
+ *
+ * skipping this makes runpm work perfectly fine on such devices.
+ *
+ * As far as we know only skylake and kaby lake SoCs are affected.
+ */
+static unsigned short intel_broken_d3_bridges[] = {
+	/* kbl */
+	0x1901,
+};
+
+static inline bool intel_broken_pci_pm(struct pci_bus *bus)
+{
+	struct pci_dev *bridge;
+	int i;
+
+	if (!bus || !bus->self)
+		return false;
+
+	bridge = bus->self;
+	if (bridge->vendor != PCI_VENDOR_ID_INTEL)
+		return false;
+
+	for (i = 0; i < ARRAY_SIZE(intel_broken_d3_bridges); i++) {
+		if (bridge->device == intel_broken_d3_bridges[i]) {
+			pci_err(bridge, "found broken intel bridge\n");
+			return true;
+		}
+	}
+
+	return false;
+}
+
 /**
  * pci_raw_set_power_state - Use PCI PM registers to set the power state of
  *			     given PCI device
@@ -827,6 +863,9 @@  static int pci_raw_set_power_state(struct pci_dev *dev, pci_power_t state)
 	if (state < PCI_D0 || state > PCI_D3hot)
 		return -EINVAL;
 
+	if (state != PCI_D0 && intel_broken_pci_pm(dev->bus))
+		return 0;
+
 	/*
 	 * Validate current state:
 	 * Can enter D0 from any state, but if we can only go deeper