diff mbox series

[RESEND,1/1] PCI: Add ATS-disable quirk for AMD Radeon R7 GPUs

Message ID 20190408103725.30426-2-nickel@altlinux.org (mailing list archive)
State Accepted, archived
Commit d28ca864c493637f3c957f4ed9348a94fca6de60
Headers show
Series PCI: Add ATS-disable quirk for AMD Radeon R7 GPUs | expand

Commit Message

Nikolai Kostrigin April 8, 2019, 10:37 a.m. UTC
ATS is broken on this hardware (at least for Stoney Ridge
based laptop) and causes IOMMU stalls and
system failure. Disable ATS on these devices to make them
usable again with IOMMU enabled
Thanks to Joerg Roedel <jroedel@suse.de> for help.

https://bugzilla.kernel.org/show_bug.cgi?id=194521

Signed-off-by: Nikolai Kostrigin <nickel@altlinux.org>
---
 drivers/pci/quirks.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Bjorn Helgaas April 9, 2019, 9:59 p.m. UTC | #1
[+cc Alex]

This claims to be a resend, but I don't see a previous posting.

There *was* discussion when the quirk was added two years ago for a
different device.  As part of that, Alex thought only that device
would be affected and ATS was validated on other GPUs:

  https://lore.kernel.org/lkml/BN6PR12MB165278346BE8A76B1E4412AFF7EA0@BN6PR12MB1652.namprd12.prod.outlook.com/

On Mon, Apr 08, 2019 at 01:37:25PM +0300, Nikolai Kostrigin wrote:
> ATS is broken on this hardware (at least for Stoney Ridge
> based laptop) and causes IOMMU stalls and
> system failure. Disable ATS on these devices to make them
> usable again with IOMMU enabled
> Thanks to Joerg Roedel <jroedel@suse.de> for help.
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=194521
> 
> Signed-off-by: Nikolai Kostrigin <nickel@altlinux.org>

Joerg, I'm happy to merge this if you would review or ack it.  I don't
know enough to conclude that this is the root cause.  It'd be nice to
have an actual AMD erratum.  Maybe it would even have a list of
affected devices so we could get them all at once so people wouldn't
have to trip over them one by one.

> ---
>  drivers/pci/quirks.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 4700d24e5d55..abb2532e16bf 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -4876,6 +4876,7 @@ static void quirk_no_ats(struct pci_dev *pdev)
>  
>  /* AMD Stoney platform GPU */
>  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x98e4, quirk_no_ats);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x6900, quirk_no_ats);
>  #endif /* CONFIG_PCI_ATS */
>  
>  /* Freescale PCIe doesn't support MSI in RC mode */
> -- 
> 2.21.0
>
Nikolai Kostrigin April 10, 2019, 8:03 a.m. UTC | #2
[+cc] linux-kernel@, linux-pci@

10.04.2019 10:26, Nikolai Kostrigin пишет:
> Hello!
>
> 10.04.2019 00:59, Bjorn Helgaas пишет:
>> [+cc Alex]
>>
>> This claims to be a resend, but I don't see a previous posting.
> For some reason, unknown to me, my previous letter didn't appear
> neither in linux-kernel@, nor in linux-pci@ mailing lists.
I'm sorry. Now it's obvious I didn't manage to turn HTML formatting off
before.
>
>> There *was* discussion when the quirk was added two years ago for a
>> different device.  As part of that, Alex thought only that device
>> would be affected and ATS was validated on other GPUs:
>>
>>   https://lore.kernel.org/lkml/BN6PR12MB165278346BE8A76B1E4412AFF7EA0@BN6PR12MB1652.namprd12.prod.outlook.com/
> I'm aware of that and it was Joerg who pointed me to the above thread
> when I asked him for help.
> In my case both devices mentioned in quirks are present in a laptop.
> Unfortunately I have no hardware with standalone AMD R7 GPU
> (0102:6900) to check whether unpatched kernel would fail
> the same way with both ATS and IOMMU enabled.
>
>> On Mon, Apr 08, 2019 at 01:37:25PM +0300, Nikolai Kostrigin wrote:
>>> ATS is broken on this hardware (at least for Stoney Ridge
>>> based laptop) and causes IOMMU stalls and
>>> system failure. Disable ATS on these devices to make them
>>> usable again with IOMMU enabled
>>> Thanks to Joerg Roedel <jroedel@suse.de> for help.
>>>
>>> https://bugzilla.kernel.org/show_bug.cgi?id=194521
>>>
>>> Signed-off-by: Nikolai Kostrigin <nickel@altlinux.org>
>> Joerg, I'm happy to merge this if you would review or ack it.  I don't
>> know enough to conclude that this is the root cause  It'd be nice to
>> have an actual AMD erratum.  Maybe it would even have a list of
>> affected devices so we could get them all at once so people wouldn't
>> have to trip over them one by one.
>>
>>> ---
>>>  drivers/pci/quirks.c | 1 +
>>>  1 file changed, 1 insertion(+)
>>>
>>> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
>>> index 4700d24e5d55..abb2532e16bf 100644
>>> --- a/drivers/pci/quirks.c
>>> +++ b/drivers/pci/quirks.c
>>> @@ -4876,6 +4876,7 @@ static void quirk_no_ats(struct pci_dev *pdev)
>>>  
>>>  /* AMD Stoney platform GPU */
>>>  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x98e4, quirk_no_ats);
>>> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x6900, quirk_no_ats);
>>>  #endif /* CONFIG_PCI_ATS */
>>>  
>>>  /* Freescale PCIe doesn't support MSI in RC mode */
>>> -- 
>>> 2.21.0
>>>
>
> -- 
> Best regards,
> Nikolai Kostrigin
Alex Deucher April 10, 2019, 2:46 p.m. UTC | #3
> -----Original Message-----
> From: Bjorn Helgaas <helgaas@kernel.org>
> Sent: Tuesday, April 9, 2019 5:59 PM
> To: Nikolai Kostrigin <nickel@altlinux.org>
> Cc: linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org;
> jroedel@suse.de; Deucher, Alexander <Alexander.Deucher@amd.com>
> Subject: Re: [PATCH RESEND 1/1] PCI: Add ATS-disable quirk for AMD Radeon
> R7 GPUs
> 
> [+cc Alex]
> 
> This claims to be a resend, but I don't see a previous posting.
> 
> There *was* discussion when the quirk was added two years ago for a
> different device.  As part of that, Alex thought only that device would be
> affected and ATS was validated on other GPUs:
> 
> 
> https://lore.kernel.org/lkml/BN6PR12MB165278346BE8A76B1E4412AFF7EA0
> @BN6PR12MB1652.namprd12.prod.outlook.com/
> 
> On Mon, Apr 08, 2019 at 01:37:25PM +0300, Nikolai Kostrigin wrote:
> > ATS is broken on this hardware (at least for Stoney Ridge based
> > laptop) and causes IOMMU stalls and system failure. Disable ATS on
> > these devices to make them usable again with IOMMU enabled Thanks to
> > Joerg Roedel <jroedel@suse.de> for help.
> >
> > https://bugzilla.kernel.org/show_bug.cgi?id=194521
> >

+ a few AMD people

Seeing this bug makes it more clear.  I don't think this is a problem with the GPU.  I think it's a problem with either the sbios or iommu.  I think the original quirk added for stoney (0x98e4) is probably wrong as well.  I suspect we need a quirk for a particular laptop or sbios versions.  We validated ATS extensively with Carrizo based systems (the system in the bug report above is Carrizo based) since it is the basis of our ROCm support on APUs.  We have also been involved in tons of Linux OEM preloads with both Carrizo and Stoney based APUs in combination with TOPAZ dGPUs (0x6900) and haven't seen this issue in those programs.  We also have TOPAZ dGPUs used in OEM programs with Intel chipsets and haven't seen the issue.  I suspect since windows does not use the IOMMU by default, the sbios settings may not be well validated on certain windows only skus.  I'd rather make these DMI matches or something like that for the platform or at the very least match the SSIDs as well.

Alex

> > Signed-off-by: Nikolai Kostrigin <nickel@altlinux.org>
> 
> Joerg, I'm happy to merge this if you would review or ack it.  I don't know
> enough to conclude that this is the root cause.  It'd be nice to have an actual
> AMD erratum.  Maybe it would even have a list of affected devices so we
> could get them all at once so people wouldn't have to trip over them one by
> one.
> 
> > ---
> >  drivers/pci/quirks.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index
> > 4700d24e5d55..abb2532e16bf 100644
> > --- a/drivers/pci/quirks.c
> > +++ b/drivers/pci/quirks.c
> > @@ -4876,6 +4876,7 @@ static void quirk_no_ats(struct pci_dev *pdev)
> >
> >  /* AMD Stoney platform GPU */
> >  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x98e4, quirk_no_ats);
> > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x6900,
> quirk_no_ats);
> >  #endif /* CONFIG_PCI_ATS */
> >
> >  /* Freescale PCIe doesn't support MSI in RC mode */
> > --
> > 2.21.0
> >
Alex Deucher April 10, 2019, 3:59 p.m. UTC | #4
> -----Original Message-----
> From: Deucher, Alexander
> Sent: Wednesday, April 10, 2019 10:47 AM
> To: Bjorn Helgaas <helgaas@kernel.org>; Nikolai Kostrigin
> <nickel@altlinux.org>; Suthikulpanit, Suravee
> (Suravee.Suthikulpanit@amd.com) <Suravee.Suthikulpanit@amd.com>;
> Lendacky, Thomas <Thomas.Lendacky@amd.com>; Kuehling, Felix
> (Felix.Kuehling@amd.com) <Felix.Kuehling@amd.com>; Koenig, Christian
> (Christian.Koenig@amd.com) <Christian.Koenig@amd.com>
> Cc: linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org;
> jroedel@suse.de
> Subject: RE: [PATCH RESEND 1/1] PCI: Add ATS-disable quirk for AMD Radeon
> R7 GPUs
> 
> > -----Original Message-----
> > From: Bjorn Helgaas <helgaas@kernel.org>
> > Sent: Tuesday, April 9, 2019 5:59 PM
> > To: Nikolai Kostrigin <nickel@altlinux.org>
> > Cc: linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org;
> > jroedel@suse.de; Deucher, Alexander <Alexander.Deucher@amd.com>
> > Subject: Re: [PATCH RESEND 1/1] PCI: Add ATS-disable quirk for AMD
> > Radeon
> > R7 GPUs
> >
> > [+cc Alex]
> >
> > This claims to be a resend, but I don't see a previous posting.
> >
> > There *was* discussion when the quirk was added two years ago for a
> > different device.  As part of that, Alex thought only that device
> > would be affected and ATS was validated on other GPUs:
> >
> >
> >
> https://lore.kernel.org/lkml/BN6PR12MB165278346BE8A76B1E4412AFF7EA0
> > @BN6PR12MB1652.namprd12.prod.outlook.com/
> >
> > On Mon, Apr 08, 2019 at 01:37:25PM +0300, Nikolai Kostrigin wrote:
> > > ATS is broken on this hardware (at least for Stoney Ridge based
> > > laptop) and causes IOMMU stalls and system failure. Disable ATS on
> > > these devices to make them usable again with IOMMU enabled Thanks
> to
> > > Joerg Roedel <jroedel@suse.de> for help.
> > >
> > > https://bugzilla.kernel.org/show_bug.cgi?id=194521
> > >
> 
> + a few AMD people
> 
> Seeing this bug makes it more clear.  I don't think this is a problem with the
> GPU.  I think it's a problem with either the sbios or iommu.  I think the original
> quirk added for stoney (0x98e4) is probably wrong as well.  I suspect we
> need a quirk for a particular laptop or sbios versions.  We validated ATS
> extensively with Carrizo based systems (the system in the bug report above
> is Carrizo based) since it is the basis of our ROCm support on APUs.  We have
> also been involved in tons of Linux OEM preloads with both Carrizo and
> Stoney based APUs in combination with TOPAZ dGPUs (0x6900) and haven't
> seen this issue in those programs.  We also have TOPAZ dGPUs used in OEM
> programs with Intel chipsets and haven't seen the issue.  I suspect since
> windows does not use the IOMMU by default, the sbios settings may not be
> well validated on certain windows only skus.  I'd rather make these DMI
> matches or something like that for the platform or at the very least match
> the SSIDs as well.

Reading through these bugs again it seems to be an issue with Stoney APUs, not the dGPU specifically.  I think it would be better to disable ATS in general if a stoney based platform was detected rather than adding ATS quirks for devices then someone may put in a Stoney based platform.  It also seems to be related to runtime pm on the dGPU.  Disabling runtime pm also seem to fix the issue.  On these systems runtime pm for the dGPU is controlled via ACPI (either ATPX or _PR3 depending on the platform).  Maybe something doesn't get restored properly on runtime resume which cases the ATS issues?

Alex

> 
> Alex
> 
> > > Signed-off-by: Nikolai Kostrigin <nickel@altlinux.org>
> >
> > Joerg, I'm happy to merge this if you would review or ack it.  I don't
> > know enough to conclude that this is the root cause.  It'd be nice to
> > have an actual AMD erratum.  Maybe it would even have a list of
> > affected devices so we could get them all at once so people wouldn't
> > have to trip over them one by one.
> >
> > > ---
> > >  drivers/pci/quirks.c | 1 +
> > >  1 file changed, 1 insertion(+)
> > >
> > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index
> > > 4700d24e5d55..abb2532e16bf 100644
> > > --- a/drivers/pci/quirks.c
> > > +++ b/drivers/pci/quirks.c
> > > @@ -4876,6 +4876,7 @@ static void quirk_no_ats(struct pci_dev *pdev)
> > >
> > >  /* AMD Stoney platform GPU */
> > >  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x98e4,
> quirk_no_ats);
> > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x6900,
> > quirk_no_ats);
> > >  #endif /* CONFIG_PCI_ATS */
> > >
> > >  /* Freescale PCIe doesn't support MSI in RC mode */
> > > --
> > > 2.21.0
> > >
Bjorn Helgaas April 10, 2019, 9:26 p.m. UTC | #5
On Wed, Apr 10, 2019 at 11:03:20AM +0300, Nikolai Kostrigin wrote:
> 10.04.2019 10:26, Nikolai Kostrigin пишет:
> > Hello!
> >
> > 10.04.2019 00:59, Bjorn Helgaas пишет:
> >> [+cc Alex]
> >>
> >> This claims to be a resend, but I don't see a previous posting.
> > For some reason, unknown to me, my previous letter didn't appear
> > neither in linux-kernel@, nor in linux-pci@ mailing lists.
> I'm sorry. Now it's obvious I didn't manage to turn HTML formatting off
> before.

No worries, happens to me all the time since gmail is so unfriendly about
text-only email!

Bjorn
Joerg Roedel April 11, 2019, 12:36 p.m. UTC | #6
On Wed, Apr 10, 2019 at 03:59:57PM +0000, Deucher, Alexander wrote:
> > + a few AMD people
> > 
> > Seeing this bug makes it more clear.  I don't think this is a problem with the
> > GPU.  I think it's a problem with either the sbios or iommu.  I think the original
> > quirk added for stoney (0x98e4) is probably wrong as well.  I suspect we
> > need a quirk for a particular laptop or sbios versions.  We validated ATS
> > extensively with Carrizo based systems (the system in the bug report above
> > is Carrizo based) since it is the basis of our ROCm support on APUs.  We have
> > also been involved in tons of Linux OEM preloads with both Carrizo and
> > Stoney based APUs in combination with TOPAZ dGPUs (0x6900) and haven't
> > seen this issue in those programs.  We also have TOPAZ dGPUs used in OEM
> > programs with Intel chipsets and haven't seen the issue.  I suspect since
> > windows does not use the IOMMU by default, the sbios settings may not be
> > well validated on certain windows only skus.  I'd rather make these DMI
> > matches or something like that for the platform or at the very least match
> > the SSIDs as well.
> 
> Reading through these bugs again it seems to be an issue with Stoney
> APUs, not the dGPU specifically.  I think it would be better to
> disable ATS in general if a stoney based platform was detected rather
> than adding ATS quirks for devices then someone may put in a Stoney
> based platform.  It also seems to be related to runtime pm on the
> dGPU.  Disabling runtime pm also seem to fix the issue.  On these
> systems runtime pm for the dGPU is controlled via ACPI (either ATPX or
> _PR3 depending on the platform).  Maybe something doesn't get restored
> properly on runtime resume which cases the ATS issues?

This seems all pretty much possible, but we lack the ability to debug
this further on our side. So until we have a real root-cause with a more
specific quirk that only targets systems with a broken sbios or
whatever, we need to catch-all approach.

We can remove these quirks again when AMD sends more specific quirks
upstream.


Regards,

	Joerg
Joerg Roedel April 11, 2019, 12:38 p.m. UTC | #7
On Mon, Apr 08, 2019 at 01:37:25PM +0300, Nikolai Kostrigin wrote:
> ATS is broken on this hardware (at least for Stoney Ridge
> based laptop) and causes IOMMU stalls and
> system failure. Disable ATS on these devices to make them
> usable again with IOMMU enabled
> Thanks to Joerg Roedel <jroedel@suse.de> for help.
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=194521
> 
> Signed-off-by: Nikolai Kostrigin <nickel@altlinux.org>

Acked-by: Joerg Roedel <jroedel@suse.de>

I helped Nikolai with tracking this issue down to ATS handling and
suggested adding the quirk.
Bjorn Helgaas April 11, 2019, 1:59 p.m. UTC | #8
[+cc Alex, Suravee, Thomas, Felix, Christian]

On Mon, Apr 08, 2019 at 01:37:25PM +0300, Nikolai Kostrigin wrote:
> ATS is broken on this hardware (at least for Stoney Ridge
> based laptop) and causes IOMMU stalls and
> system failure. Disable ATS on these devices to make them
> usable again with IOMMU enabled
> Thanks to Joerg Roedel <jroedel@suse.de> for help.
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=194521
> 
> Signed-off-by: Nikolai Kostrigin <nickel@altlinux.org>

I applied this to pci/virtualization for v5.2 with Joerg's ack, a stable
tag, and a note about Alex's suspicion that this may be system-specific and
may also affect other devices.

> ---
>  drivers/pci/quirks.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 4700d24e5d55..abb2532e16bf 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -4876,6 +4876,7 @@ static void quirk_no_ats(struct pci_dev *pdev)
>  
>  /* AMD Stoney platform GPU */
>  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x98e4, quirk_no_ats);
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x6900, quirk_no_ats);
>  #endif /* CONFIG_PCI_ATS */
>  
>  /* Freescale PCIe doesn't support MSI in RC mode */
> -- 
> 2.21.0
>
diff mbox series

Patch

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 4700d24e5d55..abb2532e16bf 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -4876,6 +4876,7 @@  static void quirk_no_ats(struct pci_dev *pdev)
 
 /* AMD Stoney platform GPU */
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x98e4, quirk_no_ats);
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x6900, quirk_no_ats);
 #endif /* CONFIG_PCI_ATS */
 
 /* Freescale PCIe doesn't support MSI in RC mode */