Message ID | 1491575538-22694-1-git-send-email-joro@8bytes.org (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
> -----Original Message----- > From: Joerg Roedel [mailto:joro@8bytes.org] > Sent: Friday, April 07, 2017 10:32 AM > To: Bjorn Helgaas > Cc: linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org; Daniel Drake; > Deucher, Alexander; Samuel Sieb; David Woodhouse; Joerg Roedel > Subject: [PATCH v2] PCI: Add ATS-disable quirk for AMD Stoney GPUs > > From: Joerg Roedel <jroedel@suse.de> > > ATS is broken on this hardware and causes IOMMU stalls and > system failure. Disable ATS on these devices to make them > usable again with IOMMU enabled. > > Note that the commit in the Fixes-tag is not buggy, it > just uncovers the problem in the hardware by increasing > the ATS-flush rate. > > Fixes: b1516a14657a ('iommu/amd: Implement flush queue') > Signed-off-by: Joerg Roedel <jroedel@suse.de> Acked-by: Alex Deucher <alexander.deucher@amd.com> > --- > drivers/pci/quirks.c | 19 +++++++++++++++++++ > 1 file changed, 19 insertions(+) > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > index 6736836..7cbe316 100644 > --- a/drivers/pci/quirks.c > +++ b/drivers/pci/quirks.c > @@ -4634,3 +4634,22 @@ static void quirk_no_aersid(struct pci_dev *pdev) > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2031, > quirk_no_aersid); > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2032, > quirk_no_aersid); > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2033, > quirk_no_aersid); > + > +#ifdef CONFIG_PCI_ATS > +/* > + * Some devices have a broken ATS implementation causing IOMMU stalls. > + * Don't use ATS for those devices. > + */ > +static void quirk_disable_ats(struct pci_dev *pdev) > +{ > + /* > + * Set pdev->ats_cap = 0 to make pci_enable_ats() bail out > + * early. > + */ > + dev_info(&pdev->dev, "QUIRK: Disabling ATS"); > + pdev->ats_cap = 0; > +} > + > +/* AMD Stoney platform GPU */ > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x98e4, > quirk_disable_ats); > +#endif /* CONFIG_PCI_ATS */ > -- > 1.9.1
On Fri, Apr 07, 2017 at 04:32:18PM +0200, Joerg Roedel wrote: > From: Joerg Roedel <jroedel@suse.de> > > ATS is broken on this hardware and causes IOMMU stalls and > system failure. Disable ATS on these devices to make them > usable again with IOMMU enabled. AMD Stoney Ridge is an x86 CPU + GPU combo and this quirk pertains to the GPU, right? In that case the quirk should go to arch/x86. Paul Menzel (+cc) has just complained on linux-pci@ that final fixups are taking half a second, and I think that could be reduced if more efforts were spent to move arch-specific quirks out of the catch-all in drivers/pci/quirks.c. Thanks, Lukas > > Note that the commit in the Fixes-tag is not buggy, it > just uncovers the problem in the hardware by increasing > the ATS-flush rate. > > Fixes: b1516a14657a ('iommu/amd: Implement flush queue') > Signed-off-by: Joerg Roedel <jroedel@suse.de> > --- > drivers/pci/quirks.c | 19 +++++++++++++++++++ > 1 file changed, 19 insertions(+) > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > index 6736836..7cbe316 100644 > --- a/drivers/pci/quirks.c > +++ b/drivers/pci/quirks.c > @@ -4634,3 +4634,22 @@ static void quirk_no_aersid(struct pci_dev *pdev) > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2031, quirk_no_aersid); > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2032, quirk_no_aersid); > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2033, quirk_no_aersid); > + > +#ifdef CONFIG_PCI_ATS > +/* > + * Some devices have a broken ATS implementation causing IOMMU stalls. > + * Don't use ATS for those devices. > + */ > +static void quirk_disable_ats(struct pci_dev *pdev) > +{ > + /* > + * Set pdev->ats_cap = 0 to make pci_enable_ats() bail out > + * early. > + */ > + dev_info(&pdev->dev, "QUIRK: Disabling ATS"); > + pdev->ats_cap = 0; > +} > + > +/* AMD Stoney platform GPU */ > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x98e4, quirk_disable_ats); > +#endif /* CONFIG_PCI_ATS */ > -- > 1.9.1 >
On Sat, Apr 08, 2017 at 09:41:07AM +0200, Lukas Wunner wrote: > On Fri, Apr 07, 2017 at 04:32:18PM +0200, Joerg Roedel wrote: > > From: Joerg Roedel <jroedel@suse.de> > > > > ATS is broken on this hardware and causes IOMMU stalls and > > system failure. Disable ATS on these devices to make them > > usable again with IOMMU enabled. > > AMD Stoney Ridge is an x86 CPU + GPU combo and this quirk pertains > to the GPU, right? > > In that case the quirk should go to arch/x86. Paul Menzel (+cc) > has just complained on linux-pci@ that final fixups are taking half > a second, and I think that could be reduced if more efforts were > spent to move arch-specific quirks out of the catch-all in > drivers/pci/quirks.c. The affected hardware here might be x86-only, but ATS is not. If a broken ATS-capable plug-in card appears, we need this in generic code anyway. Also has anyone profiled why the fixups take so long (and on what hardware)? Maybe the fixup-device matching can be improved instead of cluttering arch-code with pci-fixups. Joerg
On Fri, 2017-04-07 at 16:46 +0000, Deucher, Alexander wrote: > > > > -----Original Message----- > > From: Joerg Roedel [mailto:joro@8bytes.org] > > Sent: Friday, April 07, 2017 10:32 AM > > To: Bjorn Helgaas > > Cc: linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org; Daniel Drake; > > Deucher, Alexander; Samuel Sieb; David Woodhouse; Joerg Roedel > > Subject: [PATCH v2] PCI: Add ATS-disable quirk for AMD Stoney GPUs > > > > From: Joerg Roedel <jroedel@suse.de> > > > > ATS is broken on this hardware and causes IOMMU stalls and > > system failure. Disable ATS on these devices to make them > > usable again with IOMMU enabled. > > > > Note that the commit in the Fixes-tag is not buggy, it > > just uncovers the problem in the hardware by increasing > > the ATS-flush rate. > > > > Fixes: b1516a14657a ('iommu/amd: Implement flush queue') > > Signed-off-by: Joerg Roedel <jroedel@suse.de> > Acked-by: Alex Deucher <alexander.deucher@amd.com> Alex, are you able to confirm that it is *only* the device with PCI ID 0x98e4 which has this problem, or (more likely) come up with an exhaustive list? Thanks. We'll want the same blacklist in Xen too, won't we? > > > > --- > > drivers/pci/quirks.c | 19 +++++++++++++++++++ > > 1 file changed, 19 insertions(+) > > > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > > index 6736836..7cbe316 100644 > > --- a/drivers/pci/quirks.c > > +++ b/drivers/pci/quirks.c > > @@ -4634,3 +4634,22 @@ static void quirk_no_aersid(struct pci_dev *pdev) > > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2031, > > quirk_no_aersid); > > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2032, > > quirk_no_aersid); > > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2033, > > quirk_no_aersid); > > + > > +#ifdef CONFIG_PCI_ATS > > +/* > > + * Some devices have a broken ATS implementation causing IOMMU stalls. > > + * Don't use ATS for those devices. > > + */ > > +static void quirk_disable_ats(struct pci_dev *pdev) > > +{ > > + /* > > + * Set pdev->ats_cap = 0 to make pci_enable_ats() bail out > > + * early. > > + */ > > + dev_info(&pdev->dev, "QUIRK: Disabling ATS"); > > + pdev->ats_cap = 0; > > +} > > + > > +/* AMD Stoney platform GPU */ > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x98e4, quirk_disable_ats); > > +#endif /* CONFIG_PCI_ATS */ > > -- > > 1.9.1
> -----Original Message----- > From: David Woodhouse [mailto:dwmw2@infradead.org] > Sent: Thursday, May 04, 2017 6:22 AM > To: Deucher, Alexander; 'Joerg Roedel'; Bjorn Helgaas > Cc: linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org; Daniel Drake; > Samuel Sieb; Joerg Roedel > Subject: Re: [PATCH v2] PCI: Add ATS-disable quirk for AMD Stoney GPUs > > On Fri, 2017-04-07 at 16:46 +0000, Deucher, Alexander wrote: > > > > > > -----Original Message----- > > > From: Joerg Roedel [mailto:joro@8bytes.org] > > > Sent: Friday, April 07, 2017 10:32 AM > > > To: Bjorn Helgaas > > > Cc: linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org; Daniel > Drake; > > > Deucher, Alexander; Samuel Sieb; David Woodhouse; Joerg Roedel > > > Subject: [PATCH v2] PCI: Add ATS-disable quirk for AMD Stoney GPUs > > > > > > From: Joerg Roedel <jroedel@suse.de> > > > > > > ATS is broken on this hardware and causes IOMMU stalls and > > > system failure. Disable ATS on these devices to make them > > > usable again with IOMMU enabled. > > > > > > Note that the commit in the Fixes-tag is not buggy, it > > > just uncovers the problem in the hardware by increasing > > > the ATS-flush rate. > > > > > > Fixes: b1516a14657a ('iommu/amd: Implement flush queue') > > > Signed-off-by: Joerg Roedel <jroedel@suse.de> > > Acked-by: Alex Deucher <alexander.deucher@amd.com> > > Alex, are you able to confirm that it is *only* the device with PCI ID > 0x98e4 which has this problem, or (more likely) come up with an > exhaustive list? Thanks. It's just Stoney, that is the only ID. ATS is validated on the other GPU parts. Alex > > We'll want the same blacklist in Xen too, won't we? > > > > > > > --- > > > drivers/pci/quirks.c | 19 +++++++++++++++++++ > > > 1 file changed, 19 insertions(+) > > > > > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > > > index 6736836..7cbe316 100644 > > > --- a/drivers/pci/quirks.c > > > +++ b/drivers/pci/quirks.c > > > @@ -4634,3 +4634,22 @@ static void quirk_no_aersid(struct pci_dev > *pdev) > > > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2031, > > > quirk_no_aersid); > > > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2032, > > > quirk_no_aersid); > > > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2033, > > > quirk_no_aersid); > > > + > > > +#ifdef CONFIG_PCI_ATS > > > +/* > > > + * Some devices have a broken ATS implementation causing IOMMU > stalls. > > > + * Don't use ATS for those devices. > > > + */ > > > +static void quirk_disable_ats(struct pci_dev *pdev) > > > +{ > > > + /* > > > + * Set pdev->ats_cap = 0 to make pci_enable_ats() bail out > > > + * early. > > > + */ > > > + dev_info(&pdev->dev, "QUIRK: Disabling ATS"); > > > + pdev->ats_cap = 0; > > > +} > > > + > > > +/* AMD Stoney platform GPU */ > > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x98e4, > quirk_disable_ats); > > > +#endif /* CONFIG_PCI_ATS */ > > > -- > > > 1.9.1
> -----Original Message----- > From: David Woodhouse [mailto:dwmw2@infradead.org] > Sent: Thursday, May 04, 2017 6:22 AM > To: Deucher, Alexander; 'Joerg Roedel'; Bjorn Helgaas > Cc: linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org; Daniel Drake; > Samuel Sieb; Joerg Roedel > Subject: Re: [PATCH v2] PCI: Add ATS-disable quirk for AMD Stoney GPUs > > On Fri, 2017-04-07 at 16:46 +0000, Deucher, Alexander wrote: > > > > > > -----Original Message----- > > > From: Joerg Roedel [mailto:joro@8bytes.org] > > > Sent: Friday, April 07, 2017 10:32 AM > > > To: Bjorn Helgaas > > > Cc: linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org; Daniel > Drake; > > > Deucher, Alexander; Samuel Sieb; David Woodhouse; Joerg Roedel > > > Subject: [PATCH v2] PCI: Add ATS-disable quirk for AMD Stoney GPUs > > > > > > From: Joerg Roedel <jroedel@suse.de> > > > > > > ATS is broken on this hardware and causes IOMMU stalls and > > > system failure. Disable ATS on these devices to make them > > > usable again with IOMMU enabled. > > > > > > Note that the commit in the Fixes-tag is not buggy, it > > > just uncovers the problem in the hardware by increasing > > > the ATS-flush rate. > > > > > > Fixes: b1516a14657a ('iommu/amd: Implement flush queue') > > > Signed-off-by: Joerg Roedel <jroedel@suse.de> > > Acked-by: Alex Deucher <alexander.deucher@amd.com> > > Alex, are you able to confirm that it is *only* the device with PCI ID > 0x98e4 which has this problem, or (more likely) come up with an > exhaustive list? Thanks. > > We'll want the same blacklist in Xen too, won't we? I finally got an answer from the hw team and we validated ATS on stoney as well so in theory this patch shouldn’t actually be needed. I think we may actually be papering over some other issue. The following patch seems to also fix this issue (and other issues): https://www.spinics.net/lists/stable/msg172631.html Alex > > > > > > > --- > > > drivers/pci/quirks.c | 19 +++++++++++++++++++ > > > 1 file changed, 19 insertions(+) > > > > > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > > > index 6736836..7cbe316 100644 > > > --- a/drivers/pci/quirks.c > > > +++ b/drivers/pci/quirks.c > > > @@ -4634,3 +4634,22 @@ static void quirk_no_aersid(struct pci_dev > *pdev) > > > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2031, > > > quirk_no_aersid); > > > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2032, > > > quirk_no_aersid); > > > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2033, > > > quirk_no_aersid); > > > + > > > +#ifdef CONFIG_PCI_ATS > > > +/* > > > + * Some devices have a broken ATS implementation causing IOMMU > stalls. > > > + * Don't use ATS for those devices. > > > + */ > > > +static void quirk_disable_ats(struct pci_dev *pdev) > > > +{ > > > + /* > > > + * Set pdev->ats_cap = 0 to make pci_enable_ats() bail out > > > + * early. > > > + */ > > > + dev_info(&pdev->dev, "QUIRK: Disabling ATS"); > > > + pdev->ats_cap = 0; > > > +} > > > + > > > +/* AMD Stoney platform GPU */ > > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x98e4, > quirk_disable_ats); > > > +#endif /* CONFIG_PCI_ATS */ > > > -- > > > 1.9.1
Hi Alexander, On Tue, May 23, 2017 at 07:54:12PM +0000, Deucher, Alexander wrote: > I finally got an answer from the hw team and we validated ATS on > stoney as well so in theory this patch shouldn’t actually be needed. > I think we may actually be papering over some other issue. The > following patch seems to also fix this issue (and other issues): > https://www.spinics.net/lists/stable/msg172631.html Yeah, but it still looks to me like that the hardware got into some weird state with the storm of ATS invalidations sent to it. The Completion-Wait loop timeouts seen in the original bug report indicate that the IOMMU is waiting for a response that never comes. And this is probably the ATS flush completion response from the GPU, as disabling ATS on the GPU makes the issue disappear. Regards, Joerg
On Wed, 2017-05-24 at 10:44 +0200, Joerg Roedel wrote: > Hi Alexander, > > On Tue, May 23, 2017 at 07:54:12PM +0000, Deucher, Alexander wrote: > > > > I finally got an answer from the hw team and we validated ATS on > > stoney as well so in theory this patch shouldn’t actually be needed. > > I think we may actually be papering over some other issue. The > > following patch seems to also fix this issue (and other issues): > > https://www.spinics.net/lists/stable/msg172631.html > > Yeah, but it still looks to me like that the hardware got into some > weird state with the storm of ATS invalidations sent to it. > > The Completion-Wait loop timeouts seen in the original bug report > indicate that the IOMMU is waiting for a response that never comes. And > this is probably the ATS flush completion response from the GPU, as > disabling ATS on the GPU makes the issue disappear. The above patch doesn't actually fix any spec violation which could the GPU an *excuse* to crash and stop responding to invalidations, does it? It just seems to reduce the invalidation load a little, and thus paper over the problem that the card tends to crash under load. Absent a more coherent explanation, it still seems like the correct answer is to blacklist these devices for ATS because they're broken.
> -----Original Message----- > From: Joerg Roedel [mailto:jroedel@suse.de] > Sent: Wednesday, May 24, 2017 4:45 AM > To: Deucher, Alexander > Cc: 'David Woodhouse'; 'Joerg Roedel'; Bjorn Helgaas; linux- > pci@vger.kernel.org; linux-kernel@vger.kernel.org; Daniel Drake; Samuel > Sieb > Subject: Re: [PATCH v2] PCI: Add ATS-disable quirk for AMD Stoney GPUs > > Hi Alexander, > > On Tue, May 23, 2017 at 07:54:12PM +0000, Deucher, Alexander wrote: > > I finally got an answer from the hw team and we validated ATS on > > stoney as well so in theory this patch shouldn’t actually be needed. > > I think we may actually be papering over some other issue. The > > following patch seems to also fix this issue (and other issues): > > https://www.spinics.net/lists/stable/msg172631.html > > Yeah, but it still looks to me like that the hardware got into some > weird state with the storm of ATS invalidations sent to it. > > The Completion-Wait loop timeouts seen in the original bug report > indicate that the IOMMU is waiting for a response that never comes. And > this is probably the ATS flush completion response from the GPU, as > disabling ATS on the GPU makes the issue disappear. Yeah, it's weird. My ack on the patch still stands. Just adding some additional data. Alex
On 05/24/2017 05:56 AM, Deucher, Alexander wrote: >> -----Original Message----- >> From: Joerg Roedel [mailto:jroedel@suse.de] >> Sent: Wednesday, May 24, 2017 4:45 AM >> To: Deucher, Alexander >> Cc: 'David Woodhouse'; 'Joerg Roedel'; Bjorn Helgaas; linux- >> pci@vger.kernel.org; linux-kernel@vger.kernel.org; Daniel Drake; Samuel >> Sieb >> Subject: Re: [PATCH v2] PCI: Add ATS-disable quirk for AMD Stoney GPUs >> >> Hi Alexander, >> >> On Tue, May 23, 2017 at 07:54:12PM +0000, Deucher, Alexander wrote: >>> I finally got an answer from the hw team and we validated ATS on >>> stoney as well so in theory this patch shouldn’t actually be needed. >>> I think we may actually be papering over some other issue. The >>> following patch seems to also fix this issue (and other issues): >>> https://www.spinics.net/lists/stable/msg172631.html >> >> Yeah, but it still looks to me like that the hardware got into some >> weird state with the storm of ATS invalidations sent to it. >> >> The Completion-Wait loop timeouts seen in the original bug report >> indicate that the IOMMU is waiting for a response that never comes. And >> this is probably the ATS flush completion response from the GPU, as >> disabling ATS on the GPU makes the issue disappear. > > Yeah, it's weird. My ack on the patch still stands. Just adding some additional data. > I just tested this patch without the previous ATS disabling patch (I verified that ATS was enabled). Doing a stress-test kernel build while running a 3D graphical application caused no disk corruption. That was running for several hours. If it's not the solution, it sure hides the problem really well.
> -----Original Message----- > From: Joerg Roedel [mailto:jroedel@suse.de] > Sent: Wednesday, May 24, 2017 4:45 AM > To: Deucher, Alexander > Cc: 'David Woodhouse'; 'Joerg Roedel'; Bjorn Helgaas; linux- > pci@vger.kernel.org; linux-kernel@vger.kernel.org; Daniel Drake; Samuel > Sieb > Subject: Re: [PATCH v2] PCI: Add ATS-disable quirk for AMD Stoney GPUs > > Hi Alexander, > > On Tue, May 23, 2017 at 07:54:12PM +0000, Deucher, Alexander wrote: > > I finally got an answer from the hw team and we validated ATS on > > stoney as well so in theory this patch shouldn’t actually be needed. > > I think we may actually be papering over some other issue. The > > following patch seems to also fix this issue (and other issues): > > https://www.spinics.net/lists/stable/msg172631.html > > Yeah, but it still looks to me like that the hardware got into some > weird state with the storm of ATS invalidations sent to it. > > The Completion-Wait loop timeouts seen in the original bug report > indicate that the IOMMU is waiting for a response that never comes. And > this is probably the ATS flush completion response from the GPU, as > disabling ATS on the GPU makes the issue disappear. FWIW, the GPU driver does not actually use ATS at the moment so I don't think we should see any ATS transactions. Alex
On Fri, 2017-05-26 at 11:57 +0000, Deucher, Alexander wrote: > > FWIW, the GPU driver does not actually use ATS at the moment so I > don't think we should see any ATS transactions. That's a confusing sentence. The "GPU driver", if you mean software running in the OS, wouldn't be expected to have anything to do with ATS. ATS is something that the CPU itself (or its DMA engine) would do. Instead of just performing a DMA transaction to a given bus address, and letting the IOMMU do the translation, the hardware might choose to first perform an IOTLB lookup, and then later do the actual DMA transaction to the pre-translated, raw physical address. Which kind of makes a mockery of any kind of protection the IOMMU is supposed to give you, but does shave a cycle or two of latency off the DMA when it finally happens, since the translation can be done in advance.
> -----Original Message----- > From: David Woodhouse [mailto:dwmw2@infradead.org] > Sent: Friday, May 26, 2017 8:55 AM > To: Deucher, Alexander; 'Joerg Roedel' > Cc: 'Joerg Roedel'; Bjorn Helgaas; linux-pci@vger.kernel.org; linux- > kernel@vger.kernel.org; Daniel Drake; Samuel Sieb > Subject: Re: [PATCH v2] PCI: Add ATS-disable quirk for AMD Stoney GPUs > > On Fri, 2017-05-26 at 11:57 +0000, Deucher, Alexander wrote: > > > > FWIW, the GPU driver does not actually use ATS at the moment so I > > don't think we should see any ATS transactions. > > That's a confusing sentence. The "GPU driver", if you mean software > running in the OS, wouldn't be expected to have anything to do with > ATS. > > ATS is something that the CPU itself (or its DMA engine) would do. > Instead of just performing a DMA transaction to a given bus address, > and letting the IOMMU do the translation, the hardware might choose to > first perform an IOTLB lookup, and then later do the actual DMA > transaction to the pre-translated, raw physical address. Which kind of > makes a mockery of any kind of protection the IOMMU is supposed to give > you, but does shave a cycle or two of latency off the DMA when it > finally happens, since the translation can be done in advance. + John, Suravee Full disclosure, I'm not by any means an expert with ATS. I guess I'm thinking of PRI support rather than ATS per se. On the GPU side the GPU's memory controller has multiple paths to system memory, the non-ATS/PRI path and the ATS/PRI path. The GPU has its own integrated MMU to virtualize the GPU's internal address space per GPU client. The non-ATS/PRI path uses the GPU's MMU and is just "regular" dma to addresses potentially translated by the IOMMU just like any other device that may not have ATS support. The system memory has to be resident because if the GPU faults, it can't retry the transaction. For the ATS/PRI path, the GPU's MMU is bypassed and PASIDs need to be setup on the IOMMU for each client, but once done, transactions that use that interface support retries on GPU page faults (after the OS had paged the memory in and the IOMMU tables been updated) and other features. I think only the ATS/PRI case uses the ATC on the end point. John, Suravee, correct me if I'm wrong. Alex
Hi Bjorn, On Fri, Apr 07, 2017 at 04:32:18PM +0200, Joerg Roedel wrote: > From: Joerg Roedel <jroedel@suse.de> > > ATS is broken on this hardware and causes IOMMU stalls and > system failure. Disable ATS on these devices to make them > usable again with IOMMU enabled. > > Note that the commit in the Fixes-tag is not buggy, it > just uncovers the problem in the hardware by increasing > the ATS-flush rate. > > Fixes: b1516a14657a ('iommu/amd: Implement flush queue') > Signed-off-by: Joerg Roedel <jroedel@suse.de> > --- > drivers/pci/quirks.c | 19 +++++++++++++++++++ > 1 file changed, 19 insertions(+) Any more objections on this patch? Please let me know if you want to have something changed. Regards, Joerg
On 06/15/2017 07:04 AM, Joerg Roedel wrote: > Hi Bjorn, > > On Fri, Apr 07, 2017 at 04:32:18PM +0200, Joerg Roedel wrote: >> From: Joerg Roedel <jroedel@suse.de> >> >> ATS is broken on this hardware and causes IOMMU stalls and >> system failure. Disable ATS on these devices to make them >> usable again with IOMMU enabled. >> >> Note that the commit in the Fixes-tag is not buggy, it >> just uncovers the problem in the hardware by increasing >> the ATS-flush rate. >> >> Fixes: b1516a14657a ('iommu/amd: Implement flush queue') >> Signed-off-by: Joerg Roedel <jroedel@suse.de> >> --- >> drivers/pci/quirks.c | 19 +++++++++++++++++++ >> 1 file changed, 19 insertions(+) > > Any more objections on this patch? Please let me know if you want to > have something changed. The other patch seems to fix this issue without disabling ATS. Isn't that better?
On Thu, Apr 20, 2017 at 02:11:42PM +0200, Joerg Roedel wrote: > On Sat, Apr 08, 2017 at 09:41:07AM +0200, Lukas Wunner wrote: > > On Fri, Apr 07, 2017 at 04:32:18PM +0200, Joerg Roedel wrote: > > > From: Joerg Roedel <jroedel@suse.de> > > > > > > ATS is broken on this hardware and causes IOMMU stalls and > > > system failure. Disable ATS on these devices to make them > > > usable again with IOMMU enabled. > > > > AMD Stoney Ridge is an x86 CPU + GPU combo and this quirk pertains > > to the GPU, right? > > > > In that case the quirk should go to arch/x86. Paul Menzel (+cc) > > has just complained on linux-pci@ that final fixups are taking half > > a second, and I think that could be reduced if more efforts were > > spent to move arch-specific quirks out of the catch-all in > > drivers/pci/quirks.c. > > The affected hardware here might be x86-only, but ATS is not. If a > broken ATS-capable plug-in card appears, we need this in generic code > anyway. It could go in either arch/x86/pci/fixup.c or drivers/pci/quirks.c. It's not clear to me exactly what the hardware defect is or where it is. If it's in the CPU or in a GPU that can only be found on x86, I think arch/x86/pci/fixup.c is the appropriate place. If it's in a GPU that could be found on other arches, drivers/pci/quirks.c would be the appropriate place. I don't personally think the possibility of a plugin card with broken ATS is a real reason to put this quirk in drivers/pci/quirks.c. It's a trivial patch and easy to copy or move later if we need to. Bjorn
> -----Original Message----- > From: Samuel Sieb [mailto:samuel@sieb.net] > Sent: Thursday, June 15, 2017 1:02 PM > To: Joerg Roedel; Bjorn Helgaas > Cc: linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org; Daniel Drake; > Deucher, Alexander; David Woodhouse > Subject: Re: [PATCH v2] PCI: Add ATS-disable quirk for AMD Stoney GPUs > > On 06/15/2017 07:04 AM, Joerg Roedel wrote: > > Hi Bjorn, > > > > On Fri, Apr 07, 2017 at 04:32:18PM +0200, Joerg Roedel wrote: > >> From: Joerg Roedel <jroedel@suse.de> > >> > >> ATS is broken on this hardware and causes IOMMU stalls and > >> system failure. Disable ATS on these devices to make them > >> usable again with IOMMU enabled. > >> > >> Note that the commit in the Fixes-tag is not buggy, it > >> just uncovers the problem in the hardware by increasing > >> the ATS-flush rate. > >> > >> Fixes: b1516a14657a ('iommu/amd: Implement flush queue') > >> Signed-off-by: Joerg Roedel <jroedel@suse.de> > >> --- > >> drivers/pci/quirks.c | 19 +++++++++++++++++++ > >> 1 file changed, 19 insertions(+) > > > > Any more objections on this patch? Please let me know if you want to > > have something changed. > > The other patch seems to fix this issue without disabling ATS. Isn't > that better? I talked to our validation team and ATS was validated on Stoney, so this patch is just working around something else. The other patch fixes it and is a valid optimization (it should be applied eventually), but apparently the current behavior is allowed even if it's now optimal. I'm not really an ATS expert. Alex
On Thu, Jun 15, 2017 at 04:04:21PM +0200, Joerg Roedel wrote: > Hi Bjorn, > > On Fri, Apr 07, 2017 at 04:32:18PM +0200, Joerg Roedel wrote: > > From: Joerg Roedel <jroedel@suse.de> > > > > ATS is broken on this hardware and causes IOMMU stalls and > > system failure. Disable ATS on these devices to make them > > usable again with IOMMU enabled. > > > > Note that the commit in the Fixes-tag is not buggy, it > > just uncovers the problem in the hardware by increasing > > the ATS-flush rate. > > > > Fixes: b1516a14657a ('iommu/amd: Implement flush queue') > > Signed-off-by: Joerg Roedel <jroedel@suse.de> > > --- > > drivers/pci/quirks.c | 19 +++++++++++++++++++ > > 1 file changed, 19 insertions(+) > > Any more objections on this patch? Please let me know if you want to > have something changed. It was marked "superseded" in patchwork and thus off my radar. I don't remember if I did that or why. I changed it back to "New" so I won't forget about it. You mention (May 24) the original bug report. Can you include the URL for that? I admit I just don't have warm fuzzies that the problem is well understood. Bjorn
Hi Bjorn, On Thu, Jun 15, 2017 at 02:15:45PM -0500, Bjorn Helgaas wrote: > It was marked "superseded" in patchwork and thus off my radar. I > don't remember if I did that or why. I changed it back to "New" so I > won't forget about it. Great! > You mention (May 24) the original bug report. Can you include the URL > for that? I think there were multiple reports, here is one I could still find: https://lists.linuxfoundation.org/pipermail/iommu/2017-March/020836.html > I admit I just don't have warm fuzzies that the problem is well > understood. The current understanding (without my ability to debug the hardware involved) is that the GPU in the Stoney systems gets into a weird state when ATS invalidations are sent too fast and stops responding to the iommu. The iommu then can't complete the invalidation commands and the driver throws completion-wait loop timeout messages out. Joerg
On Fri, Jun 16, 2017 at 06:29:23PM +0200, Joerg Roedel wrote: > Hi Bjorn, > > On Thu, Jun 15, 2017 at 02:15:45PM -0500, Bjorn Helgaas wrote: > > It was marked "superseded" in patchwork and thus off my radar. I > > don't remember if I did that or why. I changed it back to "New" so I > > won't forget about it. > > Great! > > > You mention (May 24) the original bug report. Can you include the URL > > for that? > > I think there were multiple reports, here is one I could still find: > > https://lists.linuxfoundation.org/pipermail/iommu/2017-March/020836.html > > > I admit I just don't have warm fuzzies that the problem is well > > understood. > > The current understanding (without my ability to debug the hardware > involved) is that the GPU in the Stoney systems gets into a weird state > when ATS invalidations are sent too fast and stops responding to the > iommu. > > The iommu then can't complete the invalidation commands and the driver > throws completion-wait loop timeout messages out. I'm still confused. Per Samuel (6dd9dbac-9b65-bc7c-bb08-413a05d09fc8@sieb.net): Samuel> The other patch seems to fix this issue without disabling ATS. Samuel> Isn't that better? and Alex (BN6PR12MB1652DF4130FC792B71DD9974F7C00@BN6PR12MB1652.namprd12.prod.outlook.com): Alex> I talked to our validation team and ATS was validated on Stoney, Alex> so this patch is just working around something else. The other Alex> patch fixes it and is a valid optimization ... I'm confused about what this "other patch" is and whether we want that one, this one, or both. Bjorn
Hi Bjorn, On Mon, Jul 10, 2017 at 11:53:58AM -0500, Bjorn Helgaas wrote: > I'm still confused. Per Samuel > (6dd9dbac-9b65-bc7c-bb08-413a05d09fc8@sieb.net): > > Samuel> The other patch seems to fix this issue without disabling ATS. > Samuel> Isn't that better? > > and Alex > (BN6PR12MB1652DF4130FC792B71DD9974F7C00@BN6PR12MB1652.namprd12.prod.outlook.com): > > Alex> I talked to our validation team and ATS was validated on Stoney, > Alex> so this patch is just working around something else. The other > Alex> patch fixes it and is a valid optimization ... > > I'm confused about what this "other patch" is and whether we want that > one, this one, or both. The other patches floating around lowered the ATS flush-rate from the AMD IOMMU driver, which makes the issue disappear as well. But the issue only disappeared, it is not solved and could probably still be reproduced with a GPU usage pattern that increases the ATS flush-rate. So blacklisting the device for ATS is still the safest thing we could do here. Regards, Joerg
> -----Original Message----- > From: Joerg Roedel [mailto:jroedel@suse.de] > Sent: Tuesday, July 11, 2017 7:50 AM > To: Bjorn Helgaas > Cc: Bjorn Helgaas; linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org; > Daniel Drake; Deucher, Alexander; Samuel Sieb; David Woodhouse > Subject: Re: [PATCH v2] PCI: Add ATS-disable quirk for AMD Stoney GPUs > > Hi Bjorn, > > On Mon, Jul 10, 2017 at 11:53:58AM -0500, Bjorn Helgaas wrote: > > I'm still confused. Per Samuel > > (6dd9dbac-9b65-bc7c-bb08-413a05d09fc8@sieb.net): > > > > Samuel> The other patch seems to fix this issue without disabling ATS. > > Samuel> Isn't that better? > > > > and Alex > > > (BN6PR12MB1652DF4130FC792B71DD9974F7C00@BN6PR12MB1652.namprd1 > 2.prod.outlook.com): > > > > Alex> I talked to our validation team and ATS was validated on Stoney, > > Alex> so this patch is just working around something else. The other > > Alex> patch fixes it and is a valid optimization ... > > > > I'm confused about what this "other patch" is and whether we want that > > one, this one, or both. Here's the other patch: https://lists.freedesktop.org/archives/amd-gfx/2017-May/009421.html > > The other patches floating around lowered the ATS flush-rate from the > AMD IOMMU driver, which makes the issue disappear as well. But the issue > only disappeared, it is not solved and could probably still be > reproduced with a GPU usage pattern that increases the ATS flush-rate. > > So blacklisting the device for ATS is still the safest thing we could do > here. I don't have any objection per se, but I'd hate to add a quirk to disable it only to remove it again in the future if we needed ATS related functionality later. We are in the process of upstreaming KFD support for Carrizo (which is a bigger version of Stoney) and that utilizes ATS related functionality to provide GPU access to pageable memory. There are no immediate requirements for Stoney, but that may change. Alex
On Fri, Apr 07, 2017 at 04:32:18PM +0200, Joerg Roedel wrote: > From: Joerg Roedel <jroedel@suse.de> > > ATS is broken on this hardware and causes IOMMU stalls and > system failure. Disable ATS on these devices to make them > usable again with IOMMU enabled. > > Note that the commit in the Fixes-tag is not buggy, it > just uncovers the problem in the hardware by increasing > the ATS-flush rate. > > Fixes: b1516a14657a ('iommu/amd: Implement flush queue') > Signed-off-by: Joerg Roedel <jroedel@suse.de> Applied with Alex's ack to pci/virtualization for v4.14, thanks! Alex, you seemed a little ambivalent later. If you want to rescind your ack, let me know and I'll remove it. > --- > drivers/pci/quirks.c | 19 +++++++++++++++++++ > 1 file changed, 19 insertions(+) > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > index 6736836..7cbe316 100644 > --- a/drivers/pci/quirks.c > +++ b/drivers/pci/quirks.c > @@ -4634,3 +4634,22 @@ static void quirk_no_aersid(struct pci_dev *pdev) > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2031, quirk_no_aersid); > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2032, quirk_no_aersid); > DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2033, quirk_no_aersid); > + > +#ifdef CONFIG_PCI_ATS > +/* > + * Some devices have a broken ATS implementation causing IOMMU stalls. > + * Don't use ATS for those devices. > + */ > +static void quirk_disable_ats(struct pci_dev *pdev) > +{ > + /* > + * Set pdev->ats_cap = 0 to make pci_enable_ats() bail out > + * early. > + */ > + dev_info(&pdev->dev, "QUIRK: Disabling ATS"); > + pdev->ats_cap = 0; > +} > + > +/* AMD Stoney platform GPU */ > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x98e4, quirk_disable_ats); > +#endif /* CONFIG_PCI_ATS */ > -- > 1.9.1 >
On 07/12/2017 07:56 PM, Bjorn Helgaas wrote: > On Fri, Apr 07, 2017 at 04:32:18PM +0200, Joerg Roedel wrote: >> From: Joerg Roedel <jroedel@suse.de> >> >> ATS is broken on this hardware and causes IOMMU stalls and >> system failure. Disable ATS on these devices to make them >> usable again with IOMMU enabled. >> >> Note that the commit in the Fixes-tag is not buggy, it >> just uncovers the problem in the hardware by increasing >> the ATS-flush rate. >> >> Fixes: b1516a14657a ('iommu/amd: Implement flush queue') >> Signed-off-by: Joerg Roedel <jroedel@suse.de> > > Applied with Alex's ack to pci/virtualization for v4.14, thanks! Is there any chance of getting this into an earlier kernel? This is a pretty devastating bug for users! I'm currently providing patched kernels, but if they run upgrades and get a new kernel without noticing it, any filesystem they access will get completely mangled.
On Tue, Aug 29, 2017 at 01:02:41PM -0700, Samuel Sieb wrote: > On 07/12/2017 07:56 PM, Bjorn Helgaas wrote: > >On Fri, Apr 07, 2017 at 04:32:18PM +0200, Joerg Roedel wrote: > >>From: Joerg Roedel <jroedel@suse.de> > >> > >>ATS is broken on this hardware and causes IOMMU stalls and > >>system failure. Disable ATS on these devices to make them > >>usable again with IOMMU enabled. > >> > >>Note that the commit in the Fixes-tag is not buggy, it > >>just uncovers the problem in the hardware by increasing > >>the ATS-flush rate. > >> > >>Fixes: b1516a14657a ('iommu/amd: Implement flush queue') > >>Signed-off-by: Joerg Roedel <jroedel@suse.de> > > > >Applied with Alex's ack to pci/virtualization for v4.14, thanks! > > Is there any chance of getting this into an earlier kernel? This is > a pretty devastating bug for users! I'm currently providing patched > kernels, but if they run upgrades and get a new kernel without > noticing it, any filesystem they access will get completely mangled. I assume you're looking to get this into stable kernels or distro update kernels. I don't personally deal with either of those, but for stable kernels, see https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/stable-kernel-rules.rst For distro update kernels, you'd have to talk to the distro folks, and I don't have contacts for those. Bjorn
On Tue, Aug 29, 2017 at 01:02:41PM -0700, Samuel Sieb wrote: > On 07/12/2017 07:56 PM, Bjorn Helgaas wrote: > >Applied with Alex's ack to pci/virtualization for v4.14, thanks! > > Is there any chance of getting this into an earlier kernel? This is a > pretty devastating bug for users! I'm currently providing patched kernels, > but if they run upgrades and get a new kernel without noticing it, any > filesystem they access will get completely mangled. The patch is queued for v4.14, so it will probably be picked up by distributions when v4.14-rc1 is released. At least this will be the case for the affected (Open)SUSE distributions. Joerg
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index 6736836..7cbe316 100644 --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -4634,3 +4634,22 @@ static void quirk_no_aersid(struct pci_dev *pdev) DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2031, quirk_no_aersid); DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2032, quirk_no_aersid); DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2033, quirk_no_aersid); + +#ifdef CONFIG_PCI_ATS +/* + * Some devices have a broken ATS implementation causing IOMMU stalls. + * Don't use ATS for those devices. + */ +static void quirk_disable_ats(struct pci_dev *pdev) +{ + /* + * Set pdev->ats_cap = 0 to make pci_enable_ats() bail out + * early. + */ + dev_info(&pdev->dev, "QUIRK: Disabling ATS"); + pdev->ats_cap = 0; +} + +/* AMD Stoney platform GPU */ +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x98e4, quirk_disable_ats); +#endif /* CONFIG_PCI_ATS */