Message ID | 20190408103725.30426-2-nickel@altlinux.org (mailing list archive) |
---|---|
State | Accepted, archived |
Commit | d28ca864c493637f3c957f4ed9348a94fca6de60 |
Headers | show |
Series | PCI: Add ATS-disable quirk for AMD Radeon R7 GPUs | expand |
[+cc Alex] This claims to be a resend, but I don't see a previous posting. There *was* discussion when the quirk was added two years ago for a different device. As part of that, Alex thought only that device would be affected and ATS was validated on other GPUs: https://lore.kernel.org/lkml/BN6PR12MB165278346BE8A76B1E4412AFF7EA0@BN6PR12MB1652.namprd12.prod.outlook.com/ On Mon, Apr 08, 2019 at 01:37:25PM +0300, Nikolai Kostrigin wrote: > ATS is broken on this hardware (at least for Stoney Ridge > based laptop) and causes IOMMU stalls and > system failure. Disable ATS on these devices to make them > usable again with IOMMU enabled > Thanks to Joerg Roedel <jroedel@suse.de> for help. > > https://bugzilla.kernel.org/show_bug.cgi?id=194521 > > Signed-off-by: Nikolai Kostrigin <nickel@altlinux.org> Joerg, I'm happy to merge this if you would review or ack it. I don't know enough to conclude that this is the root cause. It'd be nice to have an actual AMD erratum. Maybe it would even have a list of affected devices so we could get them all at once so people wouldn't have to trip over them one by one. > --- > drivers/pci/quirks.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > index 4700d24e5d55..abb2532e16bf 100644 > --- a/drivers/pci/quirks.c > +++ b/drivers/pci/quirks.c > @@ -4876,6 +4876,7 @@ static void quirk_no_ats(struct pci_dev *pdev) > > /* AMD Stoney platform GPU */ > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x98e4, quirk_no_ats); > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x6900, quirk_no_ats); > #endif /* CONFIG_PCI_ATS */ > > /* Freescale PCIe doesn't support MSI in RC mode */ > -- > 2.21.0 >
[+cc] linux-kernel@, linux-pci@ 10.04.2019 10:26, Nikolai Kostrigin пишет: > Hello! > > 10.04.2019 00:59, Bjorn Helgaas пишет: >> [+cc Alex] >> >> This claims to be a resend, but I don't see a previous posting. > For some reason, unknown to me, my previous letter didn't appear > neither in linux-kernel@, nor in linux-pci@ mailing lists. I'm sorry. Now it's obvious I didn't manage to turn HTML formatting off before. > >> There *was* discussion when the quirk was added two years ago for a >> different device. As part of that, Alex thought only that device >> would be affected and ATS was validated on other GPUs: >> >> https://lore.kernel.org/lkml/BN6PR12MB165278346BE8A76B1E4412AFF7EA0@BN6PR12MB1652.namprd12.prod.outlook.com/ > I'm aware of that and it was Joerg who pointed me to the above thread > when I asked him for help. > In my case both devices mentioned in quirks are present in a laptop. > Unfortunately I have no hardware with standalone AMD R7 GPU > (0102:6900) to check whether unpatched kernel would fail > the same way with both ATS and IOMMU enabled. > >> On Mon, Apr 08, 2019 at 01:37:25PM +0300, Nikolai Kostrigin wrote: >>> ATS is broken on this hardware (at least for Stoney Ridge >>> based laptop) and causes IOMMU stalls and >>> system failure. Disable ATS on these devices to make them >>> usable again with IOMMU enabled >>> Thanks to Joerg Roedel <jroedel@suse.de> for help. >>> >>> https://bugzilla.kernel.org/show_bug.cgi?id=194521 >>> >>> Signed-off-by: Nikolai Kostrigin <nickel@altlinux.org> >> Joerg, I'm happy to merge this if you would review or ack it. I don't >> know enough to conclude that this is the root cause It'd be nice to >> have an actual AMD erratum. Maybe it would even have a list of >> affected devices so we could get them all at once so people wouldn't >> have to trip over them one by one. >> >>> --- >>> drivers/pci/quirks.c | 1 + >>> 1 file changed, 1 insertion(+) >>> >>> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c >>> index 4700d24e5d55..abb2532e16bf 100644 >>> --- a/drivers/pci/quirks.c >>> +++ b/drivers/pci/quirks.c >>> @@ -4876,6 +4876,7 @@ static void quirk_no_ats(struct pci_dev *pdev) >>> >>> /* AMD Stoney platform GPU */ >>> DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x98e4, quirk_no_ats); >>> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x6900, quirk_no_ats); >>> #endif /* CONFIG_PCI_ATS */ >>> >>> /* Freescale PCIe doesn't support MSI in RC mode */ >>> -- >>> 2.21.0 >>> > > -- > Best regards, > Nikolai Kostrigin
> -----Original Message----- > From: Bjorn Helgaas <helgaas@kernel.org> > Sent: Tuesday, April 9, 2019 5:59 PM > To: Nikolai Kostrigin <nickel@altlinux.org> > Cc: linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org; > jroedel@suse.de; Deucher, Alexander <Alexander.Deucher@amd.com> > Subject: Re: [PATCH RESEND 1/1] PCI: Add ATS-disable quirk for AMD Radeon > R7 GPUs > > [+cc Alex] > > This claims to be a resend, but I don't see a previous posting. > > There *was* discussion when the quirk was added two years ago for a > different device. As part of that, Alex thought only that device would be > affected and ATS was validated on other GPUs: > > > https://lore.kernel.org/lkml/BN6PR12MB165278346BE8A76B1E4412AFF7EA0 > @BN6PR12MB1652.namprd12.prod.outlook.com/ > > On Mon, Apr 08, 2019 at 01:37:25PM +0300, Nikolai Kostrigin wrote: > > ATS is broken on this hardware (at least for Stoney Ridge based > > laptop) and causes IOMMU stalls and system failure. Disable ATS on > > these devices to make them usable again with IOMMU enabled Thanks to > > Joerg Roedel <jroedel@suse.de> for help. > > > > https://bugzilla.kernel.org/show_bug.cgi?id=194521 > > + a few AMD people Seeing this bug makes it more clear. I don't think this is a problem with the GPU. I think it's a problem with either the sbios or iommu. I think the original quirk added for stoney (0x98e4) is probably wrong as well. I suspect we need a quirk for a particular laptop or sbios versions. We validated ATS extensively with Carrizo based systems (the system in the bug report above is Carrizo based) since it is the basis of our ROCm support on APUs. We have also been involved in tons of Linux OEM preloads with both Carrizo and Stoney based APUs in combination with TOPAZ dGPUs (0x6900) and haven't seen this issue in those programs. We also have TOPAZ dGPUs used in OEM programs with Intel chipsets and haven't seen the issue. I suspect since windows does not use the IOMMU by default, the sbios settings may not be well validated on certain windows only skus. I'd rather make these DMI matches or something like that for the platform or at the very least match the SSIDs as well. Alex > > Signed-off-by: Nikolai Kostrigin <nickel@altlinux.org> > > Joerg, I'm happy to merge this if you would review or ack it. I don't know > enough to conclude that this is the root cause. It'd be nice to have an actual > AMD erratum. Maybe it would even have a list of affected devices so we > could get them all at once so people wouldn't have to trip over them one by > one. > > > --- > > drivers/pci/quirks.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index > > 4700d24e5d55..abb2532e16bf 100644 > > --- a/drivers/pci/quirks.c > > +++ b/drivers/pci/quirks.c > > @@ -4876,6 +4876,7 @@ static void quirk_no_ats(struct pci_dev *pdev) > > > > /* AMD Stoney platform GPU */ > > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x98e4, quirk_no_ats); > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x6900, > quirk_no_ats); > > #endif /* CONFIG_PCI_ATS */ > > > > /* Freescale PCIe doesn't support MSI in RC mode */ > > -- > > 2.21.0 > >
> -----Original Message----- > From: Deucher, Alexander > Sent: Wednesday, April 10, 2019 10:47 AM > To: Bjorn Helgaas <helgaas@kernel.org>; Nikolai Kostrigin > <nickel@altlinux.org>; Suthikulpanit, Suravee > (Suravee.Suthikulpanit@amd.com) <Suravee.Suthikulpanit@amd.com>; > Lendacky, Thomas <Thomas.Lendacky@amd.com>; Kuehling, Felix > (Felix.Kuehling@amd.com) <Felix.Kuehling@amd.com>; Koenig, Christian > (Christian.Koenig@amd.com) <Christian.Koenig@amd.com> > Cc: linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org; > jroedel@suse.de > Subject: RE: [PATCH RESEND 1/1] PCI: Add ATS-disable quirk for AMD Radeon > R7 GPUs > > > -----Original Message----- > > From: Bjorn Helgaas <helgaas@kernel.org> > > Sent: Tuesday, April 9, 2019 5:59 PM > > To: Nikolai Kostrigin <nickel@altlinux.org> > > Cc: linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org; > > jroedel@suse.de; Deucher, Alexander <Alexander.Deucher@amd.com> > > Subject: Re: [PATCH RESEND 1/1] PCI: Add ATS-disable quirk for AMD > > Radeon > > R7 GPUs > > > > [+cc Alex] > > > > This claims to be a resend, but I don't see a previous posting. > > > > There *was* discussion when the quirk was added two years ago for a > > different device. As part of that, Alex thought only that device > > would be affected and ATS was validated on other GPUs: > > > > > > > https://lore.kernel.org/lkml/BN6PR12MB165278346BE8A76B1E4412AFF7EA0 > > @BN6PR12MB1652.namprd12.prod.outlook.com/ > > > > On Mon, Apr 08, 2019 at 01:37:25PM +0300, Nikolai Kostrigin wrote: > > > ATS is broken on this hardware (at least for Stoney Ridge based > > > laptop) and causes IOMMU stalls and system failure. Disable ATS on > > > these devices to make them usable again with IOMMU enabled Thanks > to > > > Joerg Roedel <jroedel@suse.de> for help. > > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=194521 > > > > > + a few AMD people > > Seeing this bug makes it more clear. I don't think this is a problem with the > GPU. I think it's a problem with either the sbios or iommu. I think the original > quirk added for stoney (0x98e4) is probably wrong as well. I suspect we > need a quirk for a particular laptop or sbios versions. We validated ATS > extensively with Carrizo based systems (the system in the bug report above > is Carrizo based) since it is the basis of our ROCm support on APUs. We have > also been involved in tons of Linux OEM preloads with both Carrizo and > Stoney based APUs in combination with TOPAZ dGPUs (0x6900) and haven't > seen this issue in those programs. We also have TOPAZ dGPUs used in OEM > programs with Intel chipsets and haven't seen the issue. I suspect since > windows does not use the IOMMU by default, the sbios settings may not be > well validated on certain windows only skus. I'd rather make these DMI > matches or something like that for the platform or at the very least match > the SSIDs as well. Reading through these bugs again it seems to be an issue with Stoney APUs, not the dGPU specifically. I think it would be better to disable ATS in general if a stoney based platform was detected rather than adding ATS quirks for devices then someone may put in a Stoney based platform. It also seems to be related to runtime pm on the dGPU. Disabling runtime pm also seem to fix the issue. On these systems runtime pm for the dGPU is controlled via ACPI (either ATPX or _PR3 depending on the platform). Maybe something doesn't get restored properly on runtime resume which cases the ATS issues? Alex > > Alex > > > > Signed-off-by: Nikolai Kostrigin <nickel@altlinux.org> > > > > Joerg, I'm happy to merge this if you would review or ack it. I don't > > know enough to conclude that this is the root cause. It'd be nice to > > have an actual AMD erratum. Maybe it would even have a list of > > affected devices so we could get them all at once so people wouldn't > > have to trip over them one by one. > > > > > --- > > > drivers/pci/quirks.c | 1 + > > > 1 file changed, 1 insertion(+) > > > > > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index > > > 4700d24e5d55..abb2532e16bf 100644 > > > --- a/drivers/pci/quirks.c > > > +++ b/drivers/pci/quirks.c > > > @@ -4876,6 +4876,7 @@ static void quirk_no_ats(struct pci_dev *pdev) > > > > > > /* AMD Stoney platform GPU */ > > > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x98e4, > quirk_no_ats); > > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x6900, > > quirk_no_ats); > > > #endif /* CONFIG_PCI_ATS */ > > > > > > /* Freescale PCIe doesn't support MSI in RC mode */ > > > -- > > > 2.21.0 > > >
On Wed, Apr 10, 2019 at 11:03:20AM +0300, Nikolai Kostrigin wrote: > 10.04.2019 10:26, Nikolai Kostrigin пишет: > > Hello! > > > > 10.04.2019 00:59, Bjorn Helgaas пишет: > >> [+cc Alex] > >> > >> This claims to be a resend, but I don't see a previous posting. > > For some reason, unknown to me, my previous letter didn't appear > > neither in linux-kernel@, nor in linux-pci@ mailing lists. > I'm sorry. Now it's obvious I didn't manage to turn HTML formatting off > before. No worries, happens to me all the time since gmail is so unfriendly about text-only email! Bjorn
On Wed, Apr 10, 2019 at 03:59:57PM +0000, Deucher, Alexander wrote: > > + a few AMD people > > > > Seeing this bug makes it more clear. I don't think this is a problem with the > > GPU. I think it's a problem with either the sbios or iommu. I think the original > > quirk added for stoney (0x98e4) is probably wrong as well. I suspect we > > need a quirk for a particular laptop or sbios versions. We validated ATS > > extensively with Carrizo based systems (the system in the bug report above > > is Carrizo based) since it is the basis of our ROCm support on APUs. We have > > also been involved in tons of Linux OEM preloads with both Carrizo and > > Stoney based APUs in combination with TOPAZ dGPUs (0x6900) and haven't > > seen this issue in those programs. We also have TOPAZ dGPUs used in OEM > > programs with Intel chipsets and haven't seen the issue. I suspect since > > windows does not use the IOMMU by default, the sbios settings may not be > > well validated on certain windows only skus. I'd rather make these DMI > > matches or something like that for the platform or at the very least match > > the SSIDs as well. > > Reading through these bugs again it seems to be an issue with Stoney > APUs, not the dGPU specifically. I think it would be better to > disable ATS in general if a stoney based platform was detected rather > than adding ATS quirks for devices then someone may put in a Stoney > based platform. It also seems to be related to runtime pm on the > dGPU. Disabling runtime pm also seem to fix the issue. On these > systems runtime pm for the dGPU is controlled via ACPI (either ATPX or > _PR3 depending on the platform). Maybe something doesn't get restored > properly on runtime resume which cases the ATS issues? This seems all pretty much possible, but we lack the ability to debug this further on our side. So until we have a real root-cause with a more specific quirk that only targets systems with a broken sbios or whatever, we need to catch-all approach. We can remove these quirks again when AMD sends more specific quirks upstream. Regards, Joerg
On Mon, Apr 08, 2019 at 01:37:25PM +0300, Nikolai Kostrigin wrote: > ATS is broken on this hardware (at least for Stoney Ridge > based laptop) and causes IOMMU stalls and > system failure. Disable ATS on these devices to make them > usable again with IOMMU enabled > Thanks to Joerg Roedel <jroedel@suse.de> for help. > > https://bugzilla.kernel.org/show_bug.cgi?id=194521 > > Signed-off-by: Nikolai Kostrigin <nickel@altlinux.org> Acked-by: Joerg Roedel <jroedel@suse.de> I helped Nikolai with tracking this issue down to ATS handling and suggested adding the quirk.
[+cc Alex, Suravee, Thomas, Felix, Christian] On Mon, Apr 08, 2019 at 01:37:25PM +0300, Nikolai Kostrigin wrote: > ATS is broken on this hardware (at least for Stoney Ridge > based laptop) and causes IOMMU stalls and > system failure. Disable ATS on these devices to make them > usable again with IOMMU enabled > Thanks to Joerg Roedel <jroedel@suse.de> for help. > > https://bugzilla.kernel.org/show_bug.cgi?id=194521 > > Signed-off-by: Nikolai Kostrigin <nickel@altlinux.org> I applied this to pci/virtualization for v5.2 with Joerg's ack, a stable tag, and a note about Alex's suspicion that this may be system-specific and may also affect other devices. > --- > drivers/pci/quirks.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > index 4700d24e5d55..abb2532e16bf 100644 > --- a/drivers/pci/quirks.c > +++ b/drivers/pci/quirks.c > @@ -4876,6 +4876,7 @@ static void quirk_no_ats(struct pci_dev *pdev) > > /* AMD Stoney platform GPU */ > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x98e4, quirk_no_ats); > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x6900, quirk_no_ats); > #endif /* CONFIG_PCI_ATS */ > > /* Freescale PCIe doesn't support MSI in RC mode */ > -- > 2.21.0 >
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index 4700d24e5d55..abb2532e16bf 100644 --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -4876,6 +4876,7 @@ static void quirk_no_ats(struct pci_dev *pdev) /* AMD Stoney platform GPU */ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x98e4, quirk_no_ats); +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x6900, quirk_no_ats); #endif /* CONFIG_PCI_ATS */ /* Freescale PCIe doesn't support MSI in RC mode */
ATS is broken on this hardware (at least for Stoney Ridge based laptop) and causes IOMMU stalls and system failure. Disable ATS on these devices to make them usable again with IOMMU enabled Thanks to Joerg Roedel <jroedel@suse.de> for help. https://bugzilla.kernel.org/show_bug.cgi?id=194521 Signed-off-by: Nikolai Kostrigin <nickel@altlinux.org> --- drivers/pci/quirks.c | 1 + 1 file changed, 1 insertion(+)