diff mbox

ACPI: don't show an error when we're not in charge of PCIe hotplug.

Message ID 1466028904-28328-1-git-send-email-pjones@redhat.com (mailing list archive)
State Rejected, archived
Headers show

Commit Message

Peter Jones June 15, 2016, 10:15 p.m. UTC
Right now when booting, on many laptops the firmware manages the PCIe
bus.  As a result, when we call the _OSC ACPI method, it returns an
error code.  Unfortunately the errors are not very articulate.  As a
result, we show:

ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-fe])
acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
\_SB_.PCI0 (33DB4D5B-1FF7-401C-9657-7441C03DD766): _OSC invalid UUID
_OSC request data: 1 1f 0
acpi PNP0A08:00: _OSC failed (AE_ERROR); disabling ASPM

But we did get the capabilities mask back; the firmware is merely
managing this itself.  So we really should not be showing the user a
message that looks like the firmware is broken, since it is working just
fine.

This patch supresses the error message when we're calling _OSC with the
PCIe host bridge UUID, and replaces it with a relatively innocuous
looking message that you can find if you're looking for it.
---
 drivers/acpi/bus.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

Comments

Rafael J. Wysocki June 16, 2016, 12:12 a.m. UTC | #1
On Thu, Jun 16, 2016 at 12:15 AM, Peter Jones <pjones@redhat.com> wrote:
> Right now when booting, on many laptops the firmware manages the PCIe
> bus.  As a result, when we call the _OSC ACPI method, it returns an
> error code.  Unfortunately the errors are not very articulate.

What exactly do you mean here?

>  As a result, we show:
>
> ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-fe])
> acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
> \_SB_.PCI0 (33DB4D5B-1FF7-401C-9657-7441C03DD766): _OSC invalid UUID
> _OSC request data: 1 1f 0

So _OSC told us that the UUID was invalid, didn't it?

According to the spec:

"Bit [2] – Unrecognized UUID. This bit is set to indicate that the
platform firmware does not
recognize the UUID passed in via Arg0. Capabilities bits are preserved."

> acpi PNP0A08:00: _OSC failed (AE_ERROR); disabling ASPM
>
> But we did get the capabilities mask back;

Right.  As per the spec quoted above.

> the firmware is merely managing this itself.

How can we know that?

> So we really should not be showing the user a
> message that looks like the firmware is broken, since it is working just
> fine.

Or is it?

> This patch supresses the error message when we're calling _OSC with the
> PCIe host bridge UUID, and replaces it with a relatively innocuous
> looking message that you can find if you're looking for it.

So what happens is that the firmware manages the PCIe stuff itself,
but instead of telling that to us upfront by clearing the mask, it
simply says that the UUID is unknown, because ASL writers can't be
bothered with handling that case correctly, right?

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki June 16, 2016, 2:56 p.m. UTC | #2
On Thu, Jun 16, 2016 at 2:12 AM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Thu, Jun 16, 2016 at 12:15 AM, Peter Jones <pjones@redhat.com> wrote:
>> Right now when booting, on many laptops the firmware manages the PCIe
>> bus.  As a result, when we call the _OSC ACPI method, it returns an
>> error code.  Unfortunately the errors are not very articulate.
>
> What exactly do you mean here?
>
>>  As a result, we show:
>>
>> ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-fe])
>> acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
>> \_SB_.PCI0 (33DB4D5B-1FF7-401C-9657-7441C03DD766): _OSC invalid UUID
>> _OSC request data: 1 1f 0
>
> So _OSC told us that the UUID was invalid, didn't it?

BTW, the above messages are KERN_DEBUG, so at least in theory they
shouldn't be visible in production runs.

Maybe the bug to fix is that they show up when they aren't supposed to?
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Jones June 21, 2016, 3:18 p.m. UTC | #3
(Sorry for the slow response - it's deadline time over here.)

On Thu, Jun 16, 2016 at 04:56:57PM +0200, Rafael J. Wysocki wrote:
> On Thu, Jun 16, 2016 at 2:12 AM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > On Thu, Jun 16, 2016 at 12:15 AM, Peter Jones <pjones@redhat.com> wrote:
> >> Right now when booting, on many laptops the firmware manages the PCIe
> >> bus.  As a result, when we call the _OSC ACPI method, it returns an
> >> error code.  Unfortunately the errors are not very articulate.
> >
> > What exactly do you mean here?
> >
> >>  As a result, we show:
> >>
> >> ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-fe])
> >> acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
> >> \_SB_.PCI0 (33DB4D5B-1FF7-401C-9657-7441C03DD766): _OSC invalid UUID
> >> _OSC request data: 1 1f 0
> >
> > So _OSC told us that the UUID was invalid, didn't it?
> 
> BTW, the above messages are KERN_DEBUG, so at least in theory they
> shouldn't be visible in production runs.
> 
> Maybe the bug to fix is that they show up when they aren't supposed to?

No - the workflow that I am really trying to remedy is this:

 1) S3 resume sometimes isn't working on some laptop you've got.
 2) start looking at debug messages
 3) this shows an error, so it looks like it's probably the problem
 4) go fishing for red herring
 5) if you happen to know who maintains the DSDT for the platform in
    question, eventually work out that this is working as intended and
    the bug is someplace else.
5b) if you don't know that person, eventually work out that it /might/
     be someplace else...

So the idea was to make it look more like an indication of status, and
less like an error that's causing unrelated problems.

When I talked to Mario at Dell (Cc'd), it wasn't clear to us that
there's a way to distinguish the between the UUID being
invalid/malformed, being merely unsupported, or being supported in some
configurations but not the current one.  In this particular DSDT, the
machine doesn't support the OS controlling any of this if USB-C /
thunderbolt are enabled.  The DSDT is clearly written with the belief
that you have to completely disable the handling for that UUID in this
case, and googling for this looks like it's not the only one written
with that belief.

Reading the spec (v6.1, sections 6.2.11.3 and 6.2.11.4), it seems
plausible that you can express this instead by handling the UUID but
choosing each individual query/status bit in the way that accomplishes
the OS doing nothing with the response.  So it may well be that that's
just more code that vendors have thought wasn't necessary (or wasn't
correct for some reason.)

Mario, want to jump in on your thinking here?
Mario Limonciello June 21, 2016, 6:01 p.m. UTC | #4
> -----Original Message-----

> From: Peter Jones [mailto:pjones@redhat.com]

> Sent: Tuesday, June 21, 2016 10:19 AM

> To: Rafael J. Wysocki <rafael@kernel.org>

> Cc: ACPI Devel Maling List <linux-acpi@vger.kernel.org>; Limonciello, Mario

> <Mario_Limonciello@Dell.com>; Linux Kernel Mailing List <linux-

> kernel@vger.kernel.org>; Len Brown <lenb@kernel.org>; Rafael J . Wysocki

> <rjw@rjwysocki.net>; Andy Lutomirski <luto@amacapital.net>

> Subject: Re: [PATCH] ACPI: don't show an error when we're not in charge of

> PCIe hotplug.

> 

> (Sorry for the slow response - it's deadline time over here.)

> 

> On Thu, Jun 16, 2016 at 04:56:57PM +0200, Rafael J. Wysocki wrote:

> > On Thu, Jun 16, 2016 at 2:12 AM, Rafael J. Wysocki <rafael@kernel.org>

> wrote:

> > > On Thu, Jun 16, 2016 at 12:15 AM, Peter Jones <pjones@redhat.com>

> wrote:

> > >> Right now when booting, on many laptops the firmware manages the

> PCIe

> > >> bus.  As a result, when we call the _OSC ACPI method, it returns an

> > >> error code.  Unfortunately the errors are not very articulate.

> > >

> > > What exactly do you mean here?

> > >

> > >>  As a result, we show:

> > >>

> > >> ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-fe])

> > >> acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM

> Segments MSI]

> > >> \_SB_.PCI0 (33DB4D5B-1FF7-401C-9657-7441C03DD766): _OSC invalid

> UUID

> > >> _OSC request data: 1 1f 0

> > >

> > > So _OSC told us that the UUID was invalid, didn't it?

> >

> > BTW, the above messages are KERN_DEBUG, so at least in theory they

> > shouldn't be visible in production runs.

> >

> > Maybe the bug to fix is that they show up when they aren't supposed to?

> 

> No - the workflow that I am really trying to remedy is this:

> 

>  1) S3 resume sometimes isn't working on some laptop you've got.

>  2) start looking at debug messages

>  3) this shows an error, so it looks like it's probably the problem

>  4) go fishing for red herring

>  5) if you happen to know who maintains the DSDT for the platform in

>     question, eventually work out that this is working as intended and

>     the bug is someplace else.

> 5b) if you don't know that person, eventually work out that it /might/

>      be someplace else...

> 

> So the idea was to make it look more like an indication of status, and

> less like an error that's causing unrelated problems.

> 

> When I talked to Mario at Dell (Cc'd), it wasn't clear to us that

> there's a way to distinguish the between the UUID being

> invalid/malformed, being merely unsupported, or being supported in some

> configurations but not the current one.  In this particular DSDT, the

> machine doesn't support the OS controlling any of this if USB-C /

> thunderbolt are enabled.  The DSDT is clearly written with the belief

> that you have to completely disable the handling for that UUID in this

> case, and googling for this looks like it's not the only one written

> with that belief.

> 

> Reading the spec (v6.1, sections 6.2.11.3 and 6.2.11.4), it seems

> plausible that you can express this instead by handling the UUID but

> choosing each individual query/status bit in the way that accomplishes

> the OS doing nothing with the response.  So it may well be that that's

> just more code that vendors have thought wasn't necessary (or wasn't

> correct for some reason.)

> 

> Mario, want to jump in on your thinking here?

> 

> --

>   Peter


After talking to the team, I was told this particular implementation to not let
OS take control when acting on that specific UUID based upon a variable 
(NEXP in this case) came from Intel RC code.  

That's probably why this is all across a lot of platforms, including non-Dell.

At least in the context of the laptop Peter noticed this on (Dell XPS 13 9350)
NEXP is set in GNVS based upon Thunderbolt capability.

As for why they return unrecognized UUID instead of just masking all the
capabilities bits?  It's the same net functional result.  If the vendor provided
RC code doesn't caused WCHK problems or functional problems it's hard to 
make a case for why it needs to be changed by the OEM.

I think that Peter's patch is appropriate to message this is specifically
what's going on.
Andy Lutomirski June 21, 2016, 6:07 p.m. UTC | #5
On Tue, Jun 21, 2016 at 11:01 AM,  <Mario_Limonciello@dell.com> wrote:
>> -----Original Message-----
>> From: Peter Jones [mailto:pjones@redhat.com]
>> Sent: Tuesday, June 21, 2016 10:19 AM
>> To: Rafael J. Wysocki <rafael@kernel.org>
>> Cc: ACPI Devel Maling List <linux-acpi@vger.kernel.org>; Limonciello, Mario
>> <Mario_Limonciello@Dell.com>; Linux Kernel Mailing List <linux-
>> kernel@vger.kernel.org>; Len Brown <lenb@kernel.org>; Rafael J . Wysocki
>> <rjw@rjwysocki.net>; Andy Lutomirski <luto@amacapital.net>
>> Subject: Re: [PATCH] ACPI: don't show an error when we're not in charge of
>> PCIe hotplug.
>>
>> (Sorry for the slow response - it's deadline time over here.)
>>
>> On Thu, Jun 16, 2016 at 04:56:57PM +0200, Rafael J. Wysocki wrote:
>> > On Thu, Jun 16, 2016 at 2:12 AM, Rafael J. Wysocki <rafael@kernel.org>
>> wrote:
>> > > On Thu, Jun 16, 2016 at 12:15 AM, Peter Jones <pjones@redhat.com>
>> wrote:
>> > >> Right now when booting, on many laptops the firmware manages the
>> PCIe
>> > >> bus.  As a result, when we call the _OSC ACPI method, it returns an
>> > >> error code.  Unfortunately the errors are not very articulate.
>> > >
>> > > What exactly do you mean here?
>> > >
>> > >>  As a result, we show:
>> > >>
>> > >> ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-fe])
>> > >> acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM
>> Segments MSI]
>> > >> \_SB_.PCI0 (33DB4D5B-1FF7-401C-9657-7441C03DD766): _OSC invalid
>> UUID
>> > >> _OSC request data: 1 1f 0
>> > >
>> > > So _OSC told us that the UUID was invalid, didn't it?
>> >
>> > BTW, the above messages are KERN_DEBUG, so at least in theory they
>> > shouldn't be visible in production runs.
>> >
>> > Maybe the bug to fix is that they show up when they aren't supposed to?
>>
>> No - the workflow that I am really trying to remedy is this:
>>
>>  1) S3 resume sometimes isn't working on some laptop you've got.
>>  2) start looking at debug messages
>>  3) this shows an error, so it looks like it's probably the problem
>>  4) go fishing for red herring
>>  5) if you happen to know who maintains the DSDT for the platform in
>>     question, eventually work out that this is working as intended and
>>     the bug is someplace else.
>> 5b) if you don't know that person, eventually work out that it /might/
>>      be someplace else...
>>
>> So the idea was to make it look more like an indication of status, and
>> less like an error that's causing unrelated problems.
>>
>> When I talked to Mario at Dell (Cc'd), it wasn't clear to us that
>> there's a way to distinguish the between the UUID being
>> invalid/malformed, being merely unsupported, or being supported in some
>> configurations but not the current one.  In this particular DSDT, the
>> machine doesn't support the OS controlling any of this if USB-C /
>> thunderbolt are enabled.  The DSDT is clearly written with the belief
>> that you have to completely disable the handling for that UUID in this
>> case, and googling for this looks like it's not the only one written
>> with that belief.
>>
>> Reading the spec (v6.1, sections 6.2.11.3 and 6.2.11.4), it seems
>> plausible that you can express this instead by handling the UUID but
>> choosing each individual query/status bit in the way that accomplishes
>> the OS doing nothing with the response.  So it may well be that that's
>> just more code that vendors have thought wasn't necessary (or wasn't
>> correct for some reason.)
>>
>> Mario, want to jump in on your thinking here?
>>
>> --
>>   Peter
>
> After talking to the team, I was told this particular implementation to not let
> OS take control when acting on that specific UUID based upon a variable
> (NEXP in this case) came from Intel RC code.
>
> That's probably why this is all across a lot of platforms, including non-Dell.
>
> At least in the context of the laptop Peter noticed this on (Dell XPS 13 9350)
> NEXP is set in GNVS based upon Thunderbolt capability.
>
> As for why they return unrecognized UUID instead of just masking all the
> capabilities bits?  It's the same net functional result.  If the vendor provided
> RC code doesn't caused WCHK problems or functional problems it's hard to
> make a case for why it needs to be changed by the OEM.
>
> I think that Peter's patch is appropriate to message this is specifically
> what's going on.
>

In general, it seems that Windows is considerably less inclined to
throw out errors when the DSDT contains spec violations.  It would be
great if Microsoft were to tighten up their requirements :)

I wonder if Intel could be persuaded to add an ACPI mechanism to grant
PCIe native control per device or per bus instead of more-or-less
globally as it is now.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki June 21, 2016, 10:51 p.m. UTC | #6
On Tue, Jun 21, 2016 at 8:01 PM,  <Mario_Limonciello@dell.com> wrote:
>> -----Original Message-----
>> From: Peter Jones [mailto:pjones@redhat.com]
>> Sent: Tuesday, June 21, 2016 10:19 AM
>> To: Rafael J. Wysocki <rafael@kernel.org>
>> Cc: ACPI Devel Maling List <linux-acpi@vger.kernel.org>; Limonciello, Mario
>> <Mario_Limonciello@Dell.com>; Linux Kernel Mailing List <linux-
>> kernel@vger.kernel.org>; Len Brown <lenb@kernel.org>; Rafael J . Wysocki
>> <rjw@rjwysocki.net>; Andy Lutomirski <luto@amacapital.net>
>> Subject: Re: [PATCH] ACPI: don't show an error when we're not in charge of
>> PCIe hotplug.
>>
>> (Sorry for the slow response - it's deadline time over here.)
>>
>> On Thu, Jun 16, 2016 at 04:56:57PM +0200, Rafael J. Wysocki wrote:
>> > On Thu, Jun 16, 2016 at 2:12 AM, Rafael J. Wysocki <rafael@kernel.org>
>> wrote:
>> > > On Thu, Jun 16, 2016 at 12:15 AM, Peter Jones <pjones@redhat.com>
>> wrote:
>> > >> Right now when booting, on many laptops the firmware manages the
>> PCIe
>> > >> bus.  As a result, when we call the _OSC ACPI method, it returns an
>> > >> error code.  Unfortunately the errors are not very articulate.
>> > >
>> > > What exactly do you mean here?
>> > >
>> > >>  As a result, we show:
>> > >>
>> > >> ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-fe])
>> > >> acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM
>> Segments MSI]
>> > >> \_SB_.PCI0 (33DB4D5B-1FF7-401C-9657-7441C03DD766): _OSC invalid
>> UUID
>> > >> _OSC request data: 1 1f 0
>> > >
>> > > So _OSC told us that the UUID was invalid, didn't it?
>> >
>> > BTW, the above messages are KERN_DEBUG, so at least in theory they
>> > shouldn't be visible in production runs.
>> >
>> > Maybe the bug to fix is that they show up when they aren't supposed to?
>>
>> No - the workflow that I am really trying to remedy is this:
>>
>>  1) S3 resume sometimes isn't working on some laptop you've got.
>>  2) start looking at debug messages
>>  3) this shows an error, so it looks like it's probably the problem
>>  4) go fishing for red herring
>>  5) if you happen to know who maintains the DSDT for the platform in
>>     question, eventually work out that this is working as intended and
>>     the bug is someplace else.
>> 5b) if you don't know that person, eventually work out that it /might/
>>      be someplace else...
>>
>> So the idea was to make it look more like an indication of status, and
>> less like an error that's causing unrelated problems.
>>
>> When I talked to Mario at Dell (Cc'd), it wasn't clear to us that
>> there's a way to distinguish the between the UUID being
>> invalid/malformed, being merely unsupported, or being supported in some
>> configurations but not the current one.  In this particular DSDT, the
>> machine doesn't support the OS controlling any of this if USB-C /
>> thunderbolt are enabled.  The DSDT is clearly written with the belief
>> that you have to completely disable the handling for that UUID in this
>> case, and googling for this looks like it's not the only one written
>> with that belief.
>>
>> Reading the spec (v6.1, sections 6.2.11.3 and 6.2.11.4), it seems
>> plausible that you can express this instead by handling the UUID but
>> choosing each individual query/status bit in the way that accomplishes
>> the OS doing nothing with the response.  So it may well be that that's
>> just more code that vendors have thought wasn't necessary (or wasn't
>> correct for some reason.)
>>
>> Mario, want to jump in on your thinking here?
>>
>> --
>>   Peter
>
> After talking to the team, I was told this particular implementation to not let
> OS take control when acting on that specific UUID based upon a variable
> (NEXP in this case) came from Intel RC code.
>
> That's probably why this is all across a lot of platforms, including non-Dell.
>
> At least in the context of the laptop Peter noticed this on (Dell XPS 13 9350)
> NEXP is set in GNVS based upon Thunderbolt capability.
>
> As for why they return unrecognized UUID instead of just masking all the
> capabilities bits?  It's the same net functional result.  If the vendor provided
> RC code doesn't caused WCHK problems or functional problems it's hard to
> make a case for why it needs to be changed by the OEM.
>
> I think that Peter's patch is appropriate to message this is specifically
> what's going on.

No, it may hide real (ie. non-intentional) bugs in _OSC, so it is not
appropriate.

Debug-level messages really should not hurt anyone (and should never
show up in production anyway).

We can slightly tone down the "_OSC failed (%s); disabling ASPM\n"
message in negotiate_os_control() in drivers/acpi/pci_root.c if you
think it's too strong and that's it.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki June 21, 2016, 10:54 p.m. UTC | #7
On Tue, Jun 21, 2016 at 8:07 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, Jun 21, 2016 at 11:01 AM,  <Mario_Limonciello@dell.com> wrote:
>>> -----Original Message-----
>>> From: Peter Jones [mailto:pjones@redhat.com]
>>> Sent: Tuesday, June 21, 2016 10:19 AM
>>> To: Rafael J. Wysocki <rafael@kernel.org>
>>> Cc: ACPI Devel Maling List <linux-acpi@vger.kernel.org>; Limonciello, Mario
>>> <Mario_Limonciello@Dell.com>; Linux Kernel Mailing List <linux-
>>> kernel@vger.kernel.org>; Len Brown <lenb@kernel.org>; Rafael J . Wysocki
>>> <rjw@rjwysocki.net>; Andy Lutomirski <luto@amacapital.net>
>>> Subject: Re: [PATCH] ACPI: don't show an error when we're not in charge of
>>> PCIe hotplug.
>>>
>>> (Sorry for the slow response - it's deadline time over here.)
>>>
>>> On Thu, Jun 16, 2016 at 04:56:57PM +0200, Rafael J. Wysocki wrote:
>>> > On Thu, Jun 16, 2016 at 2:12 AM, Rafael J. Wysocki <rafael@kernel.org>
>>> wrote:
>>> > > On Thu, Jun 16, 2016 at 12:15 AM, Peter Jones <pjones@redhat.com>
>>> wrote:
>>> > >> Right now when booting, on many laptops the firmware manages the
>>> PCIe
>>> > >> bus.  As a result, when we call the _OSC ACPI method, it returns an
>>> > >> error code.  Unfortunately the errors are not very articulate.
>>> > >
>>> > > What exactly do you mean here?
>>> > >
>>> > >>  As a result, we show:
>>> > >>
>>> > >> ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-fe])
>>> > >> acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM
>>> Segments MSI]
>>> > >> \_SB_.PCI0 (33DB4D5B-1FF7-401C-9657-7441C03DD766): _OSC invalid
>>> UUID
>>> > >> _OSC request data: 1 1f 0
>>> > >
>>> > > So _OSC told us that the UUID was invalid, didn't it?
>>> >
>>> > BTW, the above messages are KERN_DEBUG, so at least in theory they
>>> > shouldn't be visible in production runs.
>>> >
>>> > Maybe the bug to fix is that they show up when they aren't supposed to?
>>>
>>> No - the workflow that I am really trying to remedy is this:
>>>
>>>  1) S3 resume sometimes isn't working on some laptop you've got.
>>>  2) start looking at debug messages
>>>  3) this shows an error, so it looks like it's probably the problem
>>>  4) go fishing for red herring
>>>  5) if you happen to know who maintains the DSDT for the platform in
>>>     question, eventually work out that this is working as intended and
>>>     the bug is someplace else.
>>> 5b) if you don't know that person, eventually work out that it /might/
>>>      be someplace else...
>>>
>>> So the idea was to make it look more like an indication of status, and
>>> less like an error that's causing unrelated problems.
>>>
>>> When I talked to Mario at Dell (Cc'd), it wasn't clear to us that
>>> there's a way to distinguish the between the UUID being
>>> invalid/malformed, being merely unsupported, or being supported in some
>>> configurations but not the current one.  In this particular DSDT, the
>>> machine doesn't support the OS controlling any of this if USB-C /
>>> thunderbolt are enabled.  The DSDT is clearly written with the belief
>>> that you have to completely disable the handling for that UUID in this
>>> case, and googling for this looks like it's not the only one written
>>> with that belief.
>>>
>>> Reading the spec (v6.1, sections 6.2.11.3 and 6.2.11.4), it seems
>>> plausible that you can express this instead by handling the UUID but
>>> choosing each individual query/status bit in the way that accomplishes
>>> the OS doing nothing with the response.  So it may well be that that's
>>> just more code that vendors have thought wasn't necessary (or wasn't
>>> correct for some reason.)
>>>
>>> Mario, want to jump in on your thinking here?
>>>
>>> --
>>>   Peter
>>
>> After talking to the team, I was told this particular implementation to not let
>> OS take control when acting on that specific UUID based upon a variable
>> (NEXP in this case) came from Intel RC code.
>>
>> That's probably why this is all across a lot of platforms, including non-Dell.
>>
>> At least in the context of the laptop Peter noticed this on (Dell XPS 13 9350)
>> NEXP is set in GNVS based upon Thunderbolt capability.
>>
>> As for why they return unrecognized UUID instead of just masking all the
>> capabilities bits?  It's the same net functional result.  If the vendor provided
>> RC code doesn't caused WCHK problems or functional problems it's hard to
>> make a case for why it needs to be changed by the OEM.
>>
>> I think that Peter's patch is appropriate to message this is specifically
>> what's going on.
>>
>
> In general, it seems that Windows is considerably less inclined to
> throw out errors when the DSDT contains spec violations.  It would be
> great if Microsoft were to tighten up their requirements :)

Agreed (but I speak for myself only as usual).

> I wonder if Intel could be persuaded to add an ACPI mechanism to grant
> PCIe native control per device or per bus instead of more-or-less
> globally as it is now.

I guess at least in some cases it just has to be done that way.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mario Limonciello June 22, 2016, 7:43 p.m. UTC | #8
> -----Original Message-----

> From: rjwysocki@gmail.com [mailto:rjwysocki@gmail.com] On Behalf Of

> Rafael J. Wysocki

> Sent: Tuesday, June 21, 2016 5:51 PM

> To: Limonciello, Mario <Mario_Limonciello@Dell.com>

> Cc: Peter Jones <pjones@redhat.com>; Rafael J. Wysocki

> <rafael@kernel.org>; ACPI Devel Maling List <linux-acpi@vger.kernel.org>;

> Linux Kernel Mailing List <linux-kernel@vger.kernel.org>; Len Brown

> <lenb@kernel.org>; Rafael J. Wysocki <rjw@rjwysocki.net>; Andy Lutomirski

> <luto@amacapital.net>

> Subject: Re: [PATCH] ACPI: don't show an error when we're not in charge of

> PCIe hotplug.

> 

> On Tue, Jun 21, 2016 at 8:01 PM,  <Mario_Limonciello@dell.com> wrote:

> >> -----Original Message-----

> >> From: Peter Jones [mailto:pjones@redhat.com]

> >> Sent: Tuesday, June 21, 2016 10:19 AM

> >> To: Rafael J. Wysocki <rafael@kernel.org>

> >> Cc: ACPI Devel Maling List <linux-acpi@vger.kernel.org>; Limonciello,

> Mario

> >> <Mario_Limonciello@Dell.com>; Linux Kernel Mailing List <linux-

> >> kernel@vger.kernel.org>; Len Brown <lenb@kernel.org>; Rafael J .

> Wysocki

> >> <rjw@rjwysocki.net>; Andy Lutomirski <luto@amacapital.net>

> >> Subject: Re: [PATCH] ACPI: don't show an error when we're not in charge

> of

> >> PCIe hotplug.

> >>

> >> (Sorry for the slow response - it's deadline time over here.)

> >>

> >> On Thu, Jun 16, 2016 at 04:56:57PM +0200, Rafael J. Wysocki wrote:

> >> > On Thu, Jun 16, 2016 at 2:12 AM, Rafael J. Wysocki <rafael@kernel.org>

> >> wrote:

> >> > > On Thu, Jun 16, 2016 at 12:15 AM, Peter Jones <pjones@redhat.com>

> >> wrote:

> >> > >> Right now when booting, on many laptops the firmware manages

> the

> >> PCIe

> >> > >> bus.  As a result, when we call the _OSC ACPI method, it returns an

> >> > >> error code.  Unfortunately the errors are not very articulate.

> >> > >

> >> > > What exactly do you mean here?

> >> > >

> >> > >>  As a result, we show:

> >> > >>

> >> > >> ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-fe])

> >> > >> acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM

> ClockPM

> >> Segments MSI]

> >> > >> \_SB_.PCI0 (33DB4D5B-1FF7-401C-9657-7441C03DD766): _OSC invalid

> >> UUID

> >> > >> _OSC request data: 1 1f 0

> >> > >

> >> > > So _OSC told us that the UUID was invalid, didn't it?

> >> >

> >> > BTW, the above messages are KERN_DEBUG, so at least in theory they

> >> > shouldn't be visible in production runs.

> >> >

> >> > Maybe the bug to fix is that they show up when they aren't supposed

> to?

> >>

> >> No - the workflow that I am really trying to remedy is this:

> >>

> >>  1) S3 resume sometimes isn't working on some laptop you've got.

> >>  2) start looking at debug messages

> >>  3) this shows an error, so it looks like it's probably the problem

> >>  4) go fishing for red herring

> >>  5) if you happen to know who maintains the DSDT for the platform in

> >>     question, eventually work out that this is working as intended and

> >>     the bug is someplace else.

> >> 5b) if you don't know that person, eventually work out that it /might/

> >>      be someplace else...

> >>

> >> So the idea was to make it look more like an indication of status, and

> >> less like an error that's causing unrelated problems.

> >>

> >> When I talked to Mario at Dell (Cc'd), it wasn't clear to us that

> >> there's a way to distinguish the between the UUID being

> >> invalid/malformed, being merely unsupported, or being supported in

> some

> >> configurations but not the current one.  In this particular DSDT, the

> >> machine doesn't support the OS controlling any of this if USB-C /

> >> thunderbolt are enabled.  The DSDT is clearly written with the belief

> >> that you have to completely disable the handling for that UUID in this

> >> case, and googling for this looks like it's not the only one written

> >> with that belief.

> >>

> >> Reading the spec (v6.1, sections 6.2.11.3 and 6.2.11.4), it seems

> >> plausible that you can express this instead by handling the UUID but

> >> choosing each individual query/status bit in the way that accomplishes

> >> the OS doing nothing with the response.  So it may well be that that's

> >> just more code that vendors have thought wasn't necessary (or wasn't

> >> correct for some reason.)

> >>

> >> Mario, want to jump in on your thinking here?

> >>

> >> --

> >>   Peter

> >

> > After talking to the team, I was told this particular implementation to not

> let

> > OS take control when acting on that specific UUID based upon a variable

> > (NEXP in this case) came from Intel RC code.

> >

> > That's probably why this is all across a lot of platforms, including non-Dell.

> >

> > At least in the context of the laptop Peter noticed this on (Dell XPS 13 9350)

> > NEXP is set in GNVS based upon Thunderbolt capability.

> >

> > As for why they return unrecognized UUID instead of just masking all the

> > capabilities bits?  It's the same net functional result.  If the vendor provided

> > RC code doesn't caused WCHK problems or functional problems it's hard to

> > make a case for why it needs to be changed by the OEM.

> >

> > I think that Peter's patch is appropriate to message this is specifically

> > what's going on.

> 

> No, it may hide real (ie. non-intentional) bugs in _OSC, so it is not

> appropriate.

> 

> Debug-level messages really should not hurt anyone (and should never

> show up in production anyway).

> 

> We can slightly tone down the "_OSC failed (%s); disabling ASPM\n"

> message in negotiate_os_control() in drivers/acpi/pci_root.c if you

> think it's too strong and that's it.

> 

> Thanks,

> Rafael


I think changing that would help communicate what's going on here and at
least let the user know the result will be that the firmware is still controlling
ASPM due to the _OSC failure.

Something else that I think Andy recommended a while back was at that
time try to evaluate NEXP and display its value and an associated message
in debug logs when _OSC fails.  Would you be amenable to a change like that?
Andy Lutomirski June 22, 2016, 8:53 p.m. UTC | #9
On Wed, Jun 22, 2016 at 12:43 PM,  <Mario_Limonciello@dell.com> wrote:
>> -----Original Message-----
>> From: rjwysocki@gmail.com [mailto:rjwysocki@gmail.com] On Behalf Of
>> Rafael J. Wysocki
>> Sent: Tuesday, June 21, 2016 5:51 PM
>> To: Limonciello, Mario <Mario_Limonciello@Dell.com>
>> Cc: Peter Jones <pjones@redhat.com>; Rafael J. Wysocki
>> <rafael@kernel.org>; ACPI Devel Maling List <linux-acpi@vger.kernel.org>;
>> Linux Kernel Mailing List <linux-kernel@vger.kernel.org>; Len Brown
>> <lenb@kernel.org>; Rafael J. Wysocki <rjw@rjwysocki.net>; Andy Lutomirski
>> <luto@amacapital.net>
>> Subject: Re: [PATCH] ACPI: don't show an error when we're not in charge of
>> PCIe hotplug.
>>
>> On Tue, Jun 21, 2016 at 8:01 PM,  <Mario_Limonciello@dell.com> wrote:
>> >> -----Original Message-----
>> >> From: Peter Jones [mailto:pjones@redhat.com]
>> >> Sent: Tuesday, June 21, 2016 10:19 AM
>> >> To: Rafael J. Wysocki <rafael@kernel.org>
>> >> Cc: ACPI Devel Maling List <linux-acpi@vger.kernel.org>; Limonciello,
>> Mario
>> >> <Mario_Limonciello@Dell.com>; Linux Kernel Mailing List <linux-
>> >> kernel@vger.kernel.org>; Len Brown <lenb@kernel.org>; Rafael J .
>> Wysocki
>> >> <rjw@rjwysocki.net>; Andy Lutomirski <luto@amacapital.net>
>> >> Subject: Re: [PATCH] ACPI: don't show an error when we're not in charge
>> of
>> >> PCIe hotplug.
>> >>
>> >> (Sorry for the slow response - it's deadline time over here.)
>> >>
>> >> On Thu, Jun 16, 2016 at 04:56:57PM +0200, Rafael J. Wysocki wrote:
>> >> > On Thu, Jun 16, 2016 at 2:12 AM, Rafael J. Wysocki <rafael@kernel.org>
>> >> wrote:
>> >> > > On Thu, Jun 16, 2016 at 12:15 AM, Peter Jones <pjones@redhat.com>
>> >> wrote:
>> >> > >> Right now when booting, on many laptops the firmware manages
>> the
>> >> PCIe
>> >> > >> bus.  As a result, when we call the _OSC ACPI method, it returns an
>> >> > >> error code.  Unfortunately the errors are not very articulate.
>> >> > >
>> >> > > What exactly do you mean here?
>> >> > >
>> >> > >>  As a result, we show:
>> >> > >>
>> >> > >> ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-fe])
>> >> > >> acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM
>> ClockPM
>> >> Segments MSI]
>> >> > >> \_SB_.PCI0 (33DB4D5B-1FF7-401C-9657-7441C03DD766): _OSC invalid
>> >> UUID
>> >> > >> _OSC request data: 1 1f 0
>> >> > >
>> >> > > So _OSC told us that the UUID was invalid, didn't it?
>> >> >
>> >> > BTW, the above messages are KERN_DEBUG, so at least in theory they
>> >> > shouldn't be visible in production runs.
>> >> >
>> >> > Maybe the bug to fix is that they show up when they aren't supposed
>> to?
>> >>
>> >> No - the workflow that I am really trying to remedy is this:
>> >>
>> >>  1) S3 resume sometimes isn't working on some laptop you've got.
>> >>  2) start looking at debug messages
>> >>  3) this shows an error, so it looks like it's probably the problem
>> >>  4) go fishing for red herring
>> >>  5) if you happen to know who maintains the DSDT for the platform in
>> >>     question, eventually work out that this is working as intended and
>> >>     the bug is someplace else.
>> >> 5b) if you don't know that person, eventually work out that it /might/
>> >>      be someplace else...
>> >>
>> >> So the idea was to make it look more like an indication of status, and
>> >> less like an error that's causing unrelated problems.
>> >>
>> >> When I talked to Mario at Dell (Cc'd), it wasn't clear to us that
>> >> there's a way to distinguish the between the UUID being
>> >> invalid/malformed, being merely unsupported, or being supported in
>> some
>> >> configurations but not the current one.  In this particular DSDT, the
>> >> machine doesn't support the OS controlling any of this if USB-C /
>> >> thunderbolt are enabled.  The DSDT is clearly written with the belief
>> >> that you have to completely disable the handling for that UUID in this
>> >> case, and googling for this looks like it's not the only one written
>> >> with that belief.
>> >>
>> >> Reading the spec (v6.1, sections 6.2.11.3 and 6.2.11.4), it seems
>> >> plausible that you can express this instead by handling the UUID but
>> >> choosing each individual query/status bit in the way that accomplishes
>> >> the OS doing nothing with the response.  So it may well be that that's
>> >> just more code that vendors have thought wasn't necessary (or wasn't
>> >> correct for some reason.)
>> >>
>> >> Mario, want to jump in on your thinking here?
>> >>
>> >> --
>> >>   Peter
>> >
>> > After talking to the team, I was told this particular implementation to not
>> let
>> > OS take control when acting on that specific UUID based upon a variable
>> > (NEXP in this case) came from Intel RC code.
>> >
>> > That's probably why this is all across a lot of platforms, including non-Dell.
>> >
>> > At least in the context of the laptop Peter noticed this on (Dell XPS 13 9350)
>> > NEXP is set in GNVS based upon Thunderbolt capability.
>> >
>> > As for why they return unrecognized UUID instead of just masking all the
>> > capabilities bits?  It's the same net functional result.  If the vendor provided
>> > RC code doesn't caused WCHK problems or functional problems it's hard to
>> > make a case for why it needs to be changed by the OEM.
>> >
>> > I think that Peter's patch is appropriate to message this is specifically
>> > what's going on.
>>
>> No, it may hide real (ie. non-intentional) bugs in _OSC, so it is not
>> appropriate.
>>
>> Debug-level messages really should not hurt anyone (and should never
>> show up in production anyway).
>>
>> We can slightly tone down the "_OSC failed (%s); disabling ASPM\n"
>> message in negotiate_os_control() in drivers/acpi/pci_root.c if you
>> think it's too strong and that's it.
>>
>> Thanks,
>> Rafael
>
> I think changing that would help communicate what's going on here and at
> least let the user know the result will be that the firmware is still controlling
> ASPM due to the _OSC failure.
>
> Something else that I think Andy recommended a while back was at that
> time try to evaluate NEXP and display its value and an associated message
> in debug logs when _OSC fails.  Would you be amenable to a change like that?

That seems dangerous if NEXP is anything other than a SystemMemory
variable.  I don't know if there's a clean way to check that before
evaluating it.  (i.e. we don't want to hit some other thing called
NEXP that has side effects.)
Rafael J. Wysocki June 22, 2016, 10:47 p.m. UTC | #10
On Wed, Jun 22, 2016 at 10:53 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Wed, Jun 22, 2016 at 12:43 PM,  <Mario_Limonciello@dell.com> wrote:
>>> -----Original Message-----
>>> From: rjwysocki@gmail.com [mailto:rjwysocki@gmail.com] On Behalf Of

[cut]

>> I think changing that would help communicate what's going on here and at
>> least let the user know the result will be that the firmware is still controlling
>> ASPM due to the _OSC failure.

You seem to be assuming that all systems returning "unsupported UUID"
from the PCI host bridge _OSC will always fall into the same category,
but what if they don't?  What if at least some of them are really
broken?

>> Something else that I think Andy recommended a while back was at that
>> time try to evaluate NEXP and display its value and an associated message
>> in debug logs when _OSC fails.  Would you be amenable to a change like that?
>
> That seems dangerous if NEXP is anything other than a SystemMemory
> variable.  I don't know if there's a clean way to check that before
> evaluating it.  (i.e. we don't want to hit some other thing called
> NEXP that has side effects.)

Well, that's generic code and NEXP is not generic really, so agreed.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mario Limonciello June 23, 2016, 3:38 p.m. UTC | #11
> -----Original Message-----

> From: rjwysocki@gmail.com [mailto:rjwysocki@gmail.com] On Behalf Of

> Rafael J. Wysocki

> Sent: Wednesday, June 22, 2016 5:48 PM

> To: Andy Lutomirski <luto@amacapital.net>; Limonciello, Mario

> <Mario_Limonciello@Dell.com>

> Cc: Rafael Wysocki <rafael@kernel.org>; Peter Jones <pjones@redhat.com>;

> Linux ACPI <linux-acpi@vger.kernel.org>; linux-kernel@vger.kernel.org; Len

> Brown <lenb@kernel.org>; Rafael J. Wysocki <rjw@rjwysocki.net>

> Subject: Re: [PATCH] ACPI: don't show an error when we're not in charge of

> PCIe hotplug.

> 

> On Wed, Jun 22, 2016 at 10:53 PM, Andy Lutomirski <luto@amacapital.net>

> wrote:

> > On Wed, Jun 22, 2016 at 12:43 PM,  <Mario_Limonciello@dell.com> wrote:

> >>> -----Original Message-----

> >>> From: rjwysocki@gmail.com [mailto:rjwysocki@gmail.com] On Behalf Of

> 

> [cut]

> 

> >> I think changing that would help communicate what's going on here and at

> >> least let the user know the result will be that the firmware is still

> controlling

> >> ASPM due to the _OSC failure.

> 

> You seem to be assuming that all systems returning "unsupported UUID"

> from the PCI host bridge _OSC will always fall into the same category,

> but what if they don't?  What if at least some of them are really

> broken?

> 


Even if they are broken, the net result will be the firmware is in control
of ASPM, won't it?  At least outside of the group on this mailing list that
might not be very apparent from the current set of errors output into
logs.

> >> Something else that I think Andy recommended a while back was at that

> >> time try to evaluate NEXP and display its value and an associated message

> >> in debug logs when _OSC fails.  Would you be amenable to a change like

> that?

> >

> > That seems dangerous if NEXP is anything other than a SystemMemory

> > variable.  I don't know if there's a clean way to check that before

> > evaluating it.  (i.e. we don't want to hit some other thing called

> > NEXP that has side effects.)

> 

> Well, that's generic code and NEXP is not generic really, so agreed.


OK.
diff mbox

Patch

diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c
index 262ca31..be8a91b 100644
--- a/drivers/acpi/bus.c
+++ b/drivers/acpi/bus.c
@@ -215,6 +215,8 @@  acpi_status acpi_str_to_uuid(char *str, u8 *uuid)
 }
 EXPORT_SYMBOL_GPL(acpi_str_to_uuid);
 
+static u8 pci_osc_uuid_str[] = "33DB4D5B-1FF7-401C-9657-7441C03DD766";
+
 acpi_status acpi_run_osc(acpi_handle handle, struct acpi_osc_context *context)
 {
 	acpi_status status;
@@ -267,9 +269,13 @@  acpi_status acpi_run_osc(acpi_handle handle, struct acpi_osc_context *context)
 		if (errors & OSC_REQUEST_ERROR)
 			acpi_print_osc_error(handle, context,
 				"_OSC request failed");
-		if (errors & OSC_INVALID_UUID_ERROR)
-			acpi_print_osc_error(handle, context,
-				"_OSC invalid UUID");
+		if (errors & OSC_INVALID_UUID_ERROR) {
+			if (!strcasecmp(context->uuid_str, pci_osc_uuid_str))
+				pr_info("PCI PME managed by ACPI.\n");
+			else
+				acpi_print_osc_error(handle, context,
+						     "_OSC invalid UUID");
+		}
 		if (errors & OSC_INVALID_REVISION_ERROR)
 			acpi_print_osc_error(handle, context,
 				"_OSC invalid revision");