[v2] PCI: Add ATS-disable quirk for AMD Stoney GPUs
diff mbox

Message ID 1491575538-22694-1-git-send-email-joro@8bytes.org
State New, archived
Headers show

Commit Message

Joerg Roedel April 7, 2017, 2:32 p.m. UTC
From: Joerg Roedel <jroedel@suse.de>

ATS is broken on this hardware and causes IOMMU stalls and
system failure. Disable ATS on these devices to make them
usable again with IOMMU enabled.

Note that the commit in the Fixes-tag is not buggy, it
just uncovers the problem in the hardware by increasing
the ATS-flush rate.

Fixes: b1516a14657a ('iommu/amd: Implement flush queue')
Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 drivers/pci/quirks.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

Comments

Deucher, Alexander April 7, 2017, 4:46 p.m. UTC | #1
> -----Original Message-----
> From: Joerg Roedel [mailto:joro@8bytes.org]
> Sent: Friday, April 07, 2017 10:32 AM
> To: Bjorn Helgaas
> Cc: linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org; Daniel Drake;
> Deucher, Alexander; Samuel Sieb; David Woodhouse; Joerg Roedel
> Subject: [PATCH v2] PCI: Add ATS-disable quirk for AMD Stoney GPUs
> 
> From: Joerg Roedel <jroedel@suse.de>
> 
> ATS is broken on this hardware and causes IOMMU stalls and
> system failure. Disable ATS on these devices to make them
> usable again with IOMMU enabled.
> 
> Note that the commit in the Fixes-tag is not buggy, it
> just uncovers the problem in the hardware by increasing
> the ATS-flush rate.
> 
> Fixes: b1516a14657a ('iommu/amd: Implement flush queue')
> Signed-off-by: Joerg Roedel <jroedel@suse.de>

Acked-by: Alex Deucher <alexander.deucher@amd.com>

> ---
>  drivers/pci/quirks.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 6736836..7cbe316 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -4634,3 +4634,22 @@ static void quirk_no_aersid(struct pci_dev *pdev)
>  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2031,
> quirk_no_aersid);
>  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2032,
> quirk_no_aersid);
>  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2033,
> quirk_no_aersid);
> +
> +#ifdef CONFIG_PCI_ATS
> +/*
> + * Some devices have a broken ATS implementation causing IOMMU stalls.
> + * Don't use ATS for those devices.
> + */
> +static void quirk_disable_ats(struct pci_dev *pdev)
> +{
> +	/*
> +	 * Set pdev->ats_cap = 0 to make pci_enable_ats() bail out
> +	 * early.
> +	 */
> +	dev_info(&pdev->dev, "QUIRK: Disabling ATS");
> +	pdev->ats_cap = 0;
> +}
> +
> +/* AMD Stoney platform GPU */
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x98e4,
> quirk_disable_ats);
> +#endif /* CONFIG_PCI_ATS */
> --
> 1.9.1
Lukas Wunner April 8, 2017, 7:41 a.m. UTC | #2
On Fri, Apr 07, 2017 at 04:32:18PM +0200, Joerg Roedel wrote:
> From: Joerg Roedel <jroedel@suse.de>
> 
> ATS is broken on this hardware and causes IOMMU stalls and
> system failure. Disable ATS on these devices to make them
> usable again with IOMMU enabled.

AMD Stoney Ridge is an x86 CPU + GPU combo and this quirk pertains
to the GPU, right?

In that case the quirk should go to arch/x86.  Paul Menzel (+cc)
has just complained on linux-pci@ that final fixups are taking half
a second, and I think that could be reduced if more efforts were
spent to move arch-specific quirks out of the catch-all in
drivers/pci/quirks.c.

Thanks,

Lukas

> 
> Note that the commit in the Fixes-tag is not buggy, it
> just uncovers the problem in the hardware by increasing
> the ATS-flush rate.
> 
> Fixes: b1516a14657a ('iommu/amd: Implement flush queue')
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  drivers/pci/quirks.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 6736836..7cbe316 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -4634,3 +4634,22 @@ static void quirk_no_aersid(struct pci_dev *pdev)
>  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2031, quirk_no_aersid);
>  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2032, quirk_no_aersid);
>  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2033, quirk_no_aersid);
> +
> +#ifdef CONFIG_PCI_ATS
> +/*
> + * Some devices have a broken ATS implementation causing IOMMU stalls.
> + * Don't use ATS for those devices.
> + */
> +static void quirk_disable_ats(struct pci_dev *pdev)
> +{
> +	/*
> +	 * Set pdev->ats_cap = 0 to make pci_enable_ats() bail out
> +	 * early.
> +	 */
> +	dev_info(&pdev->dev, "QUIRK: Disabling ATS");
> +	pdev->ats_cap = 0;
> +}
> +
> +/* AMD Stoney platform GPU */
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x98e4, quirk_disable_ats);
> +#endif /* CONFIG_PCI_ATS */
> -- 
> 1.9.1
>
Joerg Roedel April 20, 2017, 12:11 p.m. UTC | #3
On Sat, Apr 08, 2017 at 09:41:07AM +0200, Lukas Wunner wrote:
> On Fri, Apr 07, 2017 at 04:32:18PM +0200, Joerg Roedel wrote:
> > From: Joerg Roedel <jroedel@suse.de>
> > 
> > ATS is broken on this hardware and causes IOMMU stalls and
> > system failure. Disable ATS on these devices to make them
> > usable again with IOMMU enabled.
> 
> AMD Stoney Ridge is an x86 CPU + GPU combo and this quirk pertains
> to the GPU, right?
> 
> In that case the quirk should go to arch/x86.  Paul Menzel (+cc)
> has just complained on linux-pci@ that final fixups are taking half
> a second, and I think that could be reduced if more efforts were
> spent to move arch-specific quirks out of the catch-all in
> drivers/pci/quirks.c.

The affected hardware here might be x86-only, but ATS is not. If a
broken ATS-capable plug-in card appears, we need this in generic code
anyway.

Also has anyone profiled why the fixups take so long (and on what
hardware)? Maybe the fixup-device matching can be improved instead of
cluttering arch-code with pci-fixups.


	Joerg
David Woodhouse May 4, 2017, 10:21 a.m. UTC | #4
On Fri, 2017-04-07 at 16:46 +0000, Deucher, Alexander wrote:
> > 
> > -----Original Message-----
> > From: Joerg Roedel [mailto:joro@8bytes.org]
> > Sent: Friday, April 07, 2017 10:32 AM
> > To: Bjorn Helgaas
> > Cc: linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org; Daniel Drake;
> > Deucher, Alexander; Samuel Sieb; David Woodhouse; Joerg Roedel
> > Subject: [PATCH v2] PCI: Add ATS-disable quirk for AMD Stoney GPUs
> > 
> > From: Joerg Roedel <jroedel@suse.de>
> > 
> > ATS is broken on this hardware and causes IOMMU stalls and
> > system failure. Disable ATS on these devices to make them
> > usable again with IOMMU enabled.
> > 
> > Note that the commit in the Fixes-tag is not buggy, it
> > just uncovers the problem in the hardware by increasing
> > the ATS-flush rate.
> > 
> > Fixes: b1516a14657a ('iommu/amd: Implement flush queue')
> > Signed-off-by: Joerg Roedel <jroedel@suse.de>
> Acked-by: Alex Deucher <alexander.deucher@amd.com>

Alex, are you able to confirm that it is *only* the device with PCI ID
0x98e4 which has this problem, or (more likely) come up with an
exhaustive list? Thanks.

We'll want the same blacklist in Xen too, won't we?

> > 
> > ---
> >  drivers/pci/quirks.c | 19 +++++++++++++++++++
> >  1 file changed, 19 insertions(+)
> > 
> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > index 6736836..7cbe316 100644
> > --- a/drivers/pci/quirks.c
> > +++ b/drivers/pci/quirks.c
> > @@ -4634,3 +4634,22 @@ static void quirk_no_aersid(struct pci_dev *pdev)
> >  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2031,
> > quirk_no_aersid);
> >  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2032,
> > quirk_no_aersid);
> >  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2033,
> > quirk_no_aersid);
> > +
> > +#ifdef CONFIG_PCI_ATS
> > +/*
> > + * Some devices have a broken ATS implementation causing IOMMU stalls.
> > + * Don't use ATS for those devices.
> > + */
> > +static void quirk_disable_ats(struct pci_dev *pdev)
> > +{
> > +	/*
> > +	 * Set pdev->ats_cap = 0 to make pci_enable_ats() bail out
> > +	 * early.
> > +	 */
> > +	dev_info(&pdev->dev, "QUIRK: Disabling ATS");
> > +	pdev->ats_cap = 0;
> > +}
> > +
> > +/* AMD Stoney platform GPU */
> > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x98e4, quirk_disable_ats);
> > +#endif /* CONFIG_PCI_ATS */
> > --
> > 1.9.1
Deucher, Alexander May 4, 2017, 2:41 p.m. UTC | #5
> -----Original Message-----

> From: David Woodhouse [mailto:dwmw2@infradead.org]

> Sent: Thursday, May 04, 2017 6:22 AM

> To: Deucher, Alexander; 'Joerg Roedel'; Bjorn Helgaas

> Cc: linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org; Daniel Drake;

> Samuel Sieb; Joerg Roedel

> Subject: Re: [PATCH v2] PCI: Add ATS-disable quirk for AMD Stoney GPUs

> 

> On Fri, 2017-04-07 at 16:46 +0000, Deucher, Alexander wrote:

> > >

> > > -----Original Message-----

> > > From: Joerg Roedel [mailto:joro@8bytes.org]

> > > Sent: Friday, April 07, 2017 10:32 AM

> > > To: Bjorn Helgaas

> > > Cc: linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org; Daniel

> Drake;

> > > Deucher, Alexander; Samuel Sieb; David Woodhouse; Joerg Roedel

> > > Subject: [PATCH v2] PCI: Add ATS-disable quirk for AMD Stoney GPUs

> > >

> > > From: Joerg Roedel <jroedel@suse.de>

> > >

> > > ATS is broken on this hardware and causes IOMMU stalls and

> > > system failure. Disable ATS on these devices to make them

> > > usable again with IOMMU enabled.

> > >

> > > Note that the commit in the Fixes-tag is not buggy, it

> > > just uncovers the problem in the hardware by increasing

> > > the ATS-flush rate.

> > >

> > > Fixes: b1516a14657a ('iommu/amd: Implement flush queue')

> > > Signed-off-by: Joerg Roedel <jroedel@suse.de>

> > Acked-by: Alex Deucher <alexander.deucher@amd.com>

> 

> Alex, are you able to confirm that it is *only* the device with PCI ID

> 0x98e4 which has this problem, or (more likely) come up with an

> exhaustive list? Thanks.


It's just Stoney, that is the only ID.  ATS is validated on the other GPU parts.

Alex

> 

> We'll want the same blacklist in Xen too, won't we?

> 

> > >

> > > ---

> > >  drivers/pci/quirks.c | 19 +++++++++++++++++++

> > >  1 file changed, 19 insertions(+)

> > >

> > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c

> > > index 6736836..7cbe316 100644

> > > --- a/drivers/pci/quirks.c

> > > +++ b/drivers/pci/quirks.c

> > > @@ -4634,3 +4634,22 @@ static void quirk_no_aersid(struct pci_dev

> *pdev)

> > >  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2031,

> > > quirk_no_aersid);

> > >  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2032,

> > > quirk_no_aersid);

> > >  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2033,

> > > quirk_no_aersid);

> > > +

> > > +#ifdef CONFIG_PCI_ATS

> > > +/*

> > > + * Some devices have a broken ATS implementation causing IOMMU

> stalls.

> > > + * Don't use ATS for those devices.

> > > + */

> > > +static void quirk_disable_ats(struct pci_dev *pdev)

> > > +{

> > > +	/*

> > > +	 * Set pdev->ats_cap = 0 to make pci_enable_ats() bail out

> > > +	 * early.

> > > +	 */

> > > +	dev_info(&pdev->dev, "QUIRK: Disabling ATS");

> > > +	pdev->ats_cap = 0;

> > > +}

> > > +

> > > +/* AMD Stoney platform GPU */

> > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x98e4,

> quirk_disable_ats);

> > > +#endif /* CONFIG_PCI_ATS */

> > > --

> > > 1.9.1
Deucher, Alexander May 23, 2017, 7:54 p.m. UTC | #6
> -----Original Message-----

> From: David Woodhouse [mailto:dwmw2@infradead.org]

> Sent: Thursday, May 04, 2017 6:22 AM

> To: Deucher, Alexander; 'Joerg Roedel'; Bjorn Helgaas

> Cc: linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org; Daniel Drake;

> Samuel Sieb; Joerg Roedel

> Subject: Re: [PATCH v2] PCI: Add ATS-disable quirk for AMD Stoney GPUs

> 

> On Fri, 2017-04-07 at 16:46 +0000, Deucher, Alexander wrote:

> > >

> > > -----Original Message-----

> > > From: Joerg Roedel [mailto:joro@8bytes.org]

> > > Sent: Friday, April 07, 2017 10:32 AM

> > > To: Bjorn Helgaas

> > > Cc: linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org; Daniel

> Drake;

> > > Deucher, Alexander; Samuel Sieb; David Woodhouse; Joerg Roedel

> > > Subject: [PATCH v2] PCI: Add ATS-disable quirk for AMD Stoney GPUs

> > >

> > > From: Joerg Roedel <jroedel@suse.de>

> > >

> > > ATS is broken on this hardware and causes IOMMU stalls and

> > > system failure. Disable ATS on these devices to make them

> > > usable again with IOMMU enabled.

> > >

> > > Note that the commit in the Fixes-tag is not buggy, it

> > > just uncovers the problem in the hardware by increasing

> > > the ATS-flush rate.

> > >

> > > Fixes: b1516a14657a ('iommu/amd: Implement flush queue')

> > > Signed-off-by: Joerg Roedel <jroedel@suse.de>

> > Acked-by: Alex Deucher <alexander.deucher@amd.com>

> 

> Alex, are you able to confirm that it is *only* the device with PCI ID

> 0x98e4 which has this problem, or (more likely) come up with an

> exhaustive list? Thanks.

> 

> We'll want the same blacklist in Xen too, won't we?


I finally got an answer from the hw team and we validated ATS on stoney as well so in theory this patch shouldn’t actually be needed.  I think we may actually be papering over some other issue.  The following patch seems to also fix this issue (and other issues):
https://www.spinics.net/lists/stable/msg172631.html

Alex

> 

> > >

> > > ---

> > >  drivers/pci/quirks.c | 19 +++++++++++++++++++

> > >  1 file changed, 19 insertions(+)

> > >

> > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c

> > > index 6736836..7cbe316 100644

> > > --- a/drivers/pci/quirks.c

> > > +++ b/drivers/pci/quirks.c

> > > @@ -4634,3 +4634,22 @@ static void quirk_no_aersid(struct pci_dev

> *pdev)

> > >  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2031,

> > > quirk_no_aersid);

> > >  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2032,

> > > quirk_no_aersid);

> > >  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2033,

> > > quirk_no_aersid);

> > > +

> > > +#ifdef CONFIG_PCI_ATS

> > > +/*

> > > + * Some devices have a broken ATS implementation causing IOMMU

> stalls.

> > > + * Don't use ATS for those devices.

> > > + */

> > > +static void quirk_disable_ats(struct pci_dev *pdev)

> > > +{

> > > +	/*

> > > +	 * Set pdev->ats_cap = 0 to make pci_enable_ats() bail out

> > > +	 * early.

> > > +	 */

> > > +	dev_info(&pdev->dev, "QUIRK: Disabling ATS");

> > > +	pdev->ats_cap = 0;

> > > +}

> > > +

> > > +/* AMD Stoney platform GPU */

> > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x98e4,

> quirk_disable_ats);

> > > +#endif /* CONFIG_PCI_ATS */

> > > --

> > > 1.9.1
Joerg Roedel May 24, 2017, 8:44 a.m. UTC | #7
Hi Alexander,

On Tue, May 23, 2017 at 07:54:12PM +0000, Deucher, Alexander wrote:
> I finally got an answer from the hw team and we validated ATS on
> stoney as well so in theory this patch shouldn’t actually be needed.
> I think we may actually be papering over some other issue.  The
> following patch seems to also fix this issue (and other issues):
> https://www.spinics.net/lists/stable/msg172631.html

Yeah, but it still looks to me like that the hardware got into some
weird state with the storm of ATS invalidations sent to it.

The Completion-Wait loop timeouts seen in the original bug report
indicate that the IOMMU is waiting for a response that never comes. And
this is probably the ATS flush completion response from the GPU, as
disabling ATS on the GPU makes the issue disappear.

Regards,

	Joerg
David Woodhouse May 24, 2017, 10:38 a.m. UTC | #8
On Wed, 2017-05-24 at 10:44 +0200, Joerg Roedel wrote:
> Hi Alexander,
> 
> On Tue, May 23, 2017 at 07:54:12PM +0000, Deucher, Alexander wrote:
> > 
> > I finally got an answer from the hw team and we validated ATS on
> > stoney as well so in theory this patch shouldn’t actually be needed.
> > I think we may actually be papering over some other issue.  The
> > following patch seems to also fix this issue (and other issues):
> > https://www.spinics.net/lists/stable/msg172631.html
>
> Yeah, but it still looks to me like that the hardware got into some
> weird state with the storm of ATS invalidations sent to it.
> 
> The Completion-Wait loop timeouts seen in the original bug report
> indicate that the IOMMU is waiting for a response that never comes. And
> this is probably the ATS flush completion response from the GPU, as
> disabling ATS on the GPU makes the issue disappear.

The above patch doesn't actually fix any spec violation which could the
GPU an *excuse* to crash and stop responding to invalidations, does it?
It just seems to reduce the invalidation load a little, and thus paper
over the problem that the card tends to crash under load. Absent a more
coherent explanation, it still seems like the correct answer is to
blacklist these devices for ATS because they're broken.
Deucher, Alexander May 24, 2017, 12:56 p.m. UTC | #9
> -----Original Message-----

> From: Joerg Roedel [mailto:jroedel@suse.de]

> Sent: Wednesday, May 24, 2017 4:45 AM

> To: Deucher, Alexander

> Cc: 'David Woodhouse'; 'Joerg Roedel'; Bjorn Helgaas; linux-

> pci@vger.kernel.org; linux-kernel@vger.kernel.org; Daniel Drake; Samuel

> Sieb

> Subject: Re: [PATCH v2] PCI: Add ATS-disable quirk for AMD Stoney GPUs

> 

> Hi Alexander,

> 

> On Tue, May 23, 2017 at 07:54:12PM +0000, Deucher, Alexander wrote:

> > I finally got an answer from the hw team and we validated ATS on

> > stoney as well so in theory this patch shouldn’t actually be needed.

> > I think we may actually be papering over some other issue.  The

> > following patch seems to also fix this issue (and other issues):

> > https://www.spinics.net/lists/stable/msg172631.html

> 

> Yeah, but it still looks to me like that the hardware got into some

> weird state with the storm of ATS invalidations sent to it.

> 

> The Completion-Wait loop timeouts seen in the original bug report

> indicate that the IOMMU is waiting for a response that never comes. And

> this is probably the ATS flush completion response from the GPU, as

> disabling ATS on the GPU makes the issue disappear.


Yeah, it's weird.  My ack on the patch still stands.  Just adding some additional data.

Alex
Samuel Sieb May 26, 2017, 6:48 a.m. UTC | #10
On 05/24/2017 05:56 AM, Deucher, Alexander wrote:
>> -----Original Message-----
>> From: Joerg Roedel [mailto:jroedel@suse.de]
>> Sent: Wednesday, May 24, 2017 4:45 AM
>> To: Deucher, Alexander
>> Cc: 'David Woodhouse'; 'Joerg Roedel'; Bjorn Helgaas; linux-
>> pci@vger.kernel.org; linux-kernel@vger.kernel.org; Daniel Drake; Samuel
>> Sieb
>> Subject: Re: [PATCH v2] PCI: Add ATS-disable quirk for AMD Stoney GPUs
>>
>> Hi Alexander,
>>
>> On Tue, May 23, 2017 at 07:54:12PM +0000, Deucher, Alexander wrote:
>>> I finally got an answer from the hw team and we validated ATS on
>>> stoney as well so in theory this patch shouldn’t actually be needed.
>>> I think we may actually be papering over some other issue.  The
>>> following patch seems to also fix this issue (and other issues):
>>> https://www.spinics.net/lists/stable/msg172631.html
>>
>> Yeah, but it still looks to me like that the hardware got into some
>> weird state with the storm of ATS invalidations sent to it.
>>
>> The Completion-Wait loop timeouts seen in the original bug report
>> indicate that the IOMMU is waiting for a response that never comes. And
>> this is probably the ATS flush completion response from the GPU, as
>> disabling ATS on the GPU makes the issue disappear.
> 
> Yeah, it's weird.  My ack on the patch still stands.  Just adding some additional data.
> 

I just tested this patch without the previous ATS disabling patch (I 
verified that ATS was enabled).  Doing a stress-test kernel build while 
running a 3D graphical application caused no disk corruption.  That was 
running for several hours.  If it's not the solution, it sure hides the 
problem really well.
Deucher, Alexander May 26, 2017, 11:57 a.m. UTC | #11
> -----Original Message-----

> From: Joerg Roedel [mailto:jroedel@suse.de]

> Sent: Wednesday, May 24, 2017 4:45 AM

> To: Deucher, Alexander

> Cc: 'David Woodhouse'; 'Joerg Roedel'; Bjorn Helgaas; linux-

> pci@vger.kernel.org; linux-kernel@vger.kernel.org; Daniel Drake; Samuel

> Sieb

> Subject: Re: [PATCH v2] PCI: Add ATS-disable quirk for AMD Stoney GPUs

> 

> Hi Alexander,

> 

> On Tue, May 23, 2017 at 07:54:12PM +0000, Deucher, Alexander wrote:

> > I finally got an answer from the hw team and we validated ATS on

> > stoney as well so in theory this patch shouldn’t actually be needed.

> > I think we may actually be papering over some other issue.  The

> > following patch seems to also fix this issue (and other issues):

> > https://www.spinics.net/lists/stable/msg172631.html

> 

> Yeah, but it still looks to me like that the hardware got into some

> weird state with the storm of ATS invalidations sent to it.

> 

> The Completion-Wait loop timeouts seen in the original bug report

> indicate that the IOMMU is waiting for a response that never comes. And

> this is probably the ATS flush completion response from the GPU, as

> disabling ATS on the GPU makes the issue disappear.


FWIW, the GPU driver does not actually use ATS at the moment so I don't think we should see any ATS transactions.

Alex
David Woodhouse May 26, 2017, 12:54 p.m. UTC | #12
On Fri, 2017-05-26 at 11:57 +0000, Deucher, Alexander wrote:
> 
> FWIW, the GPU driver does not actually use ATS at the moment so I
> don't think we should see any ATS transactions.

That's a confusing sentence. The "GPU driver", if you mean software
running in the OS, wouldn't be expected to have anything to do with
ATS.

ATS is something that the CPU itself (or its DMA engine) would do.
Instead of just performing a DMA transaction to a given bus address,
and letting the IOMMU do the translation, the hardware might choose to
first perform an IOTLB lookup, and then later do the actual DMA
transaction to the pre-translated, raw physical address. Which kind of
makes a mockery of any kind of protection the IOMMU is supposed to give
you, but does shave a cycle or two of latency off the DMA when it
finally happens, since the translation can be done in advance.
Deucher, Alexander May 26, 2017, 3:59 p.m. UTC | #13
> -----Original Message-----

> From: David Woodhouse [mailto:dwmw2@infradead.org]

> Sent: Friday, May 26, 2017 8:55 AM

> To: Deucher, Alexander; 'Joerg Roedel'

> Cc: 'Joerg Roedel'; Bjorn Helgaas; linux-pci@vger.kernel.org; linux-

> kernel@vger.kernel.org; Daniel Drake; Samuel Sieb

> Subject: Re: [PATCH v2] PCI: Add ATS-disable quirk for AMD Stoney GPUs

> 

> On Fri, 2017-05-26 at 11:57 +0000, Deucher, Alexander wrote:

> >

> > FWIW, the GPU driver does not actually use ATS at the moment so I

> > don't think we should see any ATS transactions.

> 

> That's a confusing sentence. The "GPU driver", if you mean software

> running in the OS, wouldn't be expected to have anything to do with

> ATS.

> 

> ATS is something that the CPU itself (or its DMA engine) would do.

> Instead of just performing a DMA transaction to a given bus address,

> and letting the IOMMU do the translation, the hardware might choose to

> first perform an IOTLB lookup, and then later do the actual DMA

> transaction to the pre-translated, raw physical address. Which kind of

> makes a mockery of any kind of protection the IOMMU is supposed to give

> you, but does shave a cycle or two of latency off the DMA when it

> finally happens, since the translation can be done in advance.


+ John, Suravee

Full disclosure, I'm not by any means an expert with ATS.  I guess I'm thinking of PRI support rather than ATS per se.  On the GPU side the GPU's memory controller has multiple paths to system memory, the non-ATS/PRI path and the ATS/PRI path.  The GPU has its own integrated MMU to virtualize the GPU's internal address space per GPU client.  The non-ATS/PRI path uses the GPU's MMU and is just "regular" dma to addresses potentially translated by the IOMMU just like any other device that may not have ATS support.  The system memory has to be resident because if the GPU faults, it can't retry the transaction.  For the ATS/PRI path, the GPU's MMU is bypassed and PASIDs need to be setup on the IOMMU for each client, but once done, transactions that use that interface support retries on GPU page faults (after the OS had paged the memory in and the IOMMU tables been updated) and other features.  I think only the ATS/PRI case uses the ATC on the end point.  John, Suravee, correct me if I'm wrong.

Alex
Joerg Roedel June 15, 2017, 2:04 p.m. UTC | #14
Hi Bjorn,

On Fri, Apr 07, 2017 at 04:32:18PM +0200, Joerg Roedel wrote:
> From: Joerg Roedel <jroedel@suse.de>
> 
> ATS is broken on this hardware and causes IOMMU stalls and
> system failure. Disable ATS on these devices to make them
> usable again with IOMMU enabled.
> 
> Note that the commit in the Fixes-tag is not buggy, it
> just uncovers the problem in the hardware by increasing
> the ATS-flush rate.
> 
> Fixes: b1516a14657a ('iommu/amd: Implement flush queue')
> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> ---
>  drivers/pci/quirks.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)

Any more objections on this patch? Please let me know if you want to
have something changed.


Regards,

	Joerg
Samuel Sieb June 15, 2017, 5:01 p.m. UTC | #15
On 06/15/2017 07:04 AM, Joerg Roedel wrote:
> Hi Bjorn,
> 
> On Fri, Apr 07, 2017 at 04:32:18PM +0200, Joerg Roedel wrote:
>> From: Joerg Roedel <jroedel@suse.de>
>>
>> ATS is broken on this hardware and causes IOMMU stalls and
>> system failure. Disable ATS on these devices to make them
>> usable again with IOMMU enabled.
>>
>> Note that the commit in the Fixes-tag is not buggy, it
>> just uncovers the problem in the hardware by increasing
>> the ATS-flush rate.
>>
>> Fixes: b1516a14657a ('iommu/amd: Implement flush queue')
>> Signed-off-by: Joerg Roedel <jroedel@suse.de>
>> ---
>>   drivers/pci/quirks.c | 19 +++++++++++++++++++
>>   1 file changed, 19 insertions(+)
> 
> Any more objections on this patch? Please let me know if you want to
> have something changed.

The other patch seems to fix this issue without disabling ATS.  Isn't 
that better?
Bjorn Helgaas June 15, 2017, 5:12 p.m. UTC | #16
On Thu, Apr 20, 2017 at 02:11:42PM +0200, Joerg Roedel wrote:
> On Sat, Apr 08, 2017 at 09:41:07AM +0200, Lukas Wunner wrote:
> > On Fri, Apr 07, 2017 at 04:32:18PM +0200, Joerg Roedel wrote:
> > > From: Joerg Roedel <jroedel@suse.de>
> > > 
> > > ATS is broken on this hardware and causes IOMMU stalls and
> > > system failure. Disable ATS on these devices to make them
> > > usable again with IOMMU enabled.
> > 
> > AMD Stoney Ridge is an x86 CPU + GPU combo and this quirk pertains
> > to the GPU, right?
> > 
> > In that case the quirk should go to arch/x86.  Paul Menzel (+cc)
> > has just complained on linux-pci@ that final fixups are taking half
> > a second, and I think that could be reduced if more efforts were
> > spent to move arch-specific quirks out of the catch-all in
> > drivers/pci/quirks.c.
> 
> The affected hardware here might be x86-only, but ATS is not. If a
> broken ATS-capable plug-in card appears, we need this in generic code
> anyway.

It could go in either arch/x86/pci/fixup.c or drivers/pci/quirks.c.

It's not clear to me exactly what the hardware defect is or where it
is.  If it's in the CPU or in a GPU that can only be found on x86, I
think arch/x86/pci/fixup.c is the appropriate place.

If it's in a GPU that could be found on other arches,
drivers/pci/quirks.c would be the appropriate place.

I don't personally think the possibility of a plugin card with broken
ATS is a real reason to put this quirk in drivers/pci/quirks.c.  It's
a trivial patch and easy to copy or move later if we need to.

Bjorn
Deucher, Alexander June 15, 2017, 6:13 p.m. UTC | #17
> -----Original Message-----

> From: Samuel Sieb [mailto:samuel@sieb.net]

> Sent: Thursday, June 15, 2017 1:02 PM

> To: Joerg Roedel; Bjorn Helgaas

> Cc: linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org; Daniel Drake;

> Deucher, Alexander; David Woodhouse

> Subject: Re: [PATCH v2] PCI: Add ATS-disable quirk for AMD Stoney GPUs

> 

> On 06/15/2017 07:04 AM, Joerg Roedel wrote:

> > Hi Bjorn,

> >

> > On Fri, Apr 07, 2017 at 04:32:18PM +0200, Joerg Roedel wrote:

> >> From: Joerg Roedel <jroedel@suse.de>

> >>

> >> ATS is broken on this hardware and causes IOMMU stalls and

> >> system failure. Disable ATS on these devices to make them

> >> usable again with IOMMU enabled.

> >>

> >> Note that the commit in the Fixes-tag is not buggy, it

> >> just uncovers the problem in the hardware by increasing

> >> the ATS-flush rate.

> >>

> >> Fixes: b1516a14657a ('iommu/amd: Implement flush queue')

> >> Signed-off-by: Joerg Roedel <jroedel@suse.de>

> >> ---

> >>   drivers/pci/quirks.c | 19 +++++++++++++++++++

> >>   1 file changed, 19 insertions(+)

> >

> > Any more objections on this patch? Please let me know if you want to

> > have something changed.

> 

> The other patch seems to fix this issue without disabling ATS.  Isn't

> that better?


I talked to our validation team and ATS was validated on Stoney, so this patch is just working around something else.  The other patch fixes it and is a valid optimization (it should be applied eventually), but apparently the current behavior is allowed even if it's now optimal.  I'm not really an ATS expert.

Alex
Bjorn Helgaas June 15, 2017, 7:15 p.m. UTC | #18
On Thu, Jun 15, 2017 at 04:04:21PM +0200, Joerg Roedel wrote:
> Hi Bjorn,
> 
> On Fri, Apr 07, 2017 at 04:32:18PM +0200, Joerg Roedel wrote:
> > From: Joerg Roedel <jroedel@suse.de>
> > 
> > ATS is broken on this hardware and causes IOMMU stalls and
> > system failure. Disable ATS on these devices to make them
> > usable again with IOMMU enabled.
> > 
> > Note that the commit in the Fixes-tag is not buggy, it
> > just uncovers the problem in the hardware by increasing
> > the ATS-flush rate.
> > 
> > Fixes: b1516a14657a ('iommu/amd: Implement flush queue')
> > Signed-off-by: Joerg Roedel <jroedel@suse.de>
> > ---
> >  drivers/pci/quirks.c | 19 +++++++++++++++++++
> >  1 file changed, 19 insertions(+)
> 
> Any more objections on this patch? Please let me know if you want to
> have something changed.

It was marked "superseded" in patchwork and thus off my radar.  I
don't remember if I did that or why.  I changed it back to "New" so I
won't forget about it.

You mention (May 24) the original bug report.  Can you include the URL
for that?

I admit I just don't have warm fuzzies that the problem is well
understood.

Bjorn
Joerg Roedel June 16, 2017, 4:29 p.m. UTC | #19
Hi Bjorn,

On Thu, Jun 15, 2017 at 02:15:45PM -0500, Bjorn Helgaas wrote:
> It was marked "superseded" in patchwork and thus off my radar.  I
> don't remember if I did that or why.  I changed it back to "New" so I
> won't forget about it.

Great!

> You mention (May 24) the original bug report.  Can you include the URL
> for that?

I think there were multiple reports, here is one I could still find:

	https://lists.linuxfoundation.org/pipermail/iommu/2017-March/020836.html

> I admit I just don't have warm fuzzies that the problem is well
> understood.

The current understanding (without my ability to debug the hardware
involved) is that the GPU in the Stoney systems gets into a weird state
when ATS invalidations are sent too fast and stops responding to the
iommu.

The iommu then can't complete the invalidation commands and the driver
throws completion-wait loop timeout messages out.



	Joerg
Bjorn Helgaas July 10, 2017, 4:53 p.m. UTC | #20
On Fri, Jun 16, 2017 at 06:29:23PM +0200, Joerg Roedel wrote:
> Hi Bjorn,
> 
> On Thu, Jun 15, 2017 at 02:15:45PM -0500, Bjorn Helgaas wrote:
> > It was marked "superseded" in patchwork and thus off my radar.  I
> > don't remember if I did that or why.  I changed it back to "New" so I
> > won't forget about it.
> 
> Great!
> 
> > You mention (May 24) the original bug report.  Can you include the URL
> > for that?
> 
> I think there were multiple reports, here is one I could still find:
> 
> 	https://lists.linuxfoundation.org/pipermail/iommu/2017-March/020836.html
> 
> > I admit I just don't have warm fuzzies that the problem is well
> > understood.
> 
> The current understanding (without my ability to debug the hardware
> involved) is that the GPU in the Stoney systems gets into a weird state
> when ATS invalidations are sent too fast and stops responding to the
> iommu.
> 
> The iommu then can't complete the invalidation commands and the driver
> throws completion-wait loop timeout messages out.

I'm still confused.  Per Samuel
(6dd9dbac-9b65-bc7c-bb08-413a05d09fc8@sieb.net):

Samuel> The other patch seems to fix this issue without disabling ATS.
Samuel> Isn't that better?

and Alex
(BN6PR12MB1652DF4130FC792B71DD9974F7C00@BN6PR12MB1652.namprd12.prod.outlook.com):

Alex> I talked to our validation team and ATS was validated on Stoney,
Alex> so this patch is just working around something else.  The other
Alex> patch fixes it and is a valid optimization ...

I'm confused about what this "other patch" is and whether we want that
one, this one, or both.

Bjorn
Joerg Roedel July 11, 2017, 11:49 a.m. UTC | #21
Hi Bjorn,

On Mon, Jul 10, 2017 at 11:53:58AM -0500, Bjorn Helgaas wrote:
> I'm still confused.  Per Samuel
> (6dd9dbac-9b65-bc7c-bb08-413a05d09fc8@sieb.net):
> 
> Samuel> The other patch seems to fix this issue without disabling ATS.
> Samuel> Isn't that better?
> 
> and Alex
> (BN6PR12MB1652DF4130FC792B71DD9974F7C00@BN6PR12MB1652.namprd12.prod.outlook.com):
> 
> Alex> I talked to our validation team and ATS was validated on Stoney,
> Alex> so this patch is just working around something else.  The other
> Alex> patch fixes it and is a valid optimization ...
> 
> I'm confused about what this "other patch" is and whether we want that
> one, this one, or both.

The other patches floating around lowered the ATS flush-rate from the
AMD IOMMU driver, which makes the issue disappear as well. But the issue
only disappeared, it is not solved and could probably still be
reproduced with a GPU usage pattern that increases the ATS flush-rate.

So blacklisting the device for ATS is still the safest thing we could do
here.


Regards,

	Joerg
Deucher, Alexander July 11, 2017, 7:08 p.m. UTC | #22
> -----Original Message-----
> From: Joerg Roedel [mailto:jroedel@suse.de]
> Sent: Tuesday, July 11, 2017 7:50 AM
> To: Bjorn Helgaas
> Cc: Bjorn Helgaas; linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org;
> Daniel Drake; Deucher, Alexander; Samuel Sieb; David Woodhouse
> Subject: Re: [PATCH v2] PCI: Add ATS-disable quirk for AMD Stoney GPUs
> 
> Hi Bjorn,
> 
> On Mon, Jul 10, 2017 at 11:53:58AM -0500, Bjorn Helgaas wrote:
> > I'm still confused.  Per Samuel
> > (6dd9dbac-9b65-bc7c-bb08-413a05d09fc8@sieb.net):
> >
> > Samuel> The other patch seems to fix this issue without disabling ATS.
> > Samuel> Isn't that better?
> >
> > and Alex
> >
> (BN6PR12MB1652DF4130FC792B71DD9974F7C00@BN6PR12MB1652.namprd1
> 2.prod.outlook.com):
> >
> > Alex> I talked to our validation team and ATS was validated on Stoney,
> > Alex> so this patch is just working around something else.  The other
> > Alex> patch fixes it and is a valid optimization ...
> >
> > I'm confused about what this "other patch" is and whether we want that
> > one, this one, or both.

Here's the other patch:
https://lists.freedesktop.org/archives/amd-gfx/2017-May/009421.html

> 
> The other patches floating around lowered the ATS flush-rate from the
> AMD IOMMU driver, which makes the issue disappear as well. But the issue
> only disappeared, it is not solved and could probably still be
> reproduced with a GPU usage pattern that increases the ATS flush-rate.
> 
> So blacklisting the device for ATS is still the safest thing we could do
> here.

I don't have any objection per se, but I'd hate to add a quirk to disable it only to remove it again in the future if we needed ATS related functionality later.  We are in the process of upstreaming KFD support for Carrizo (which is a bigger version of Stoney) and that utilizes ATS related functionality to provide GPU access to pageable memory.  There are no immediate requirements for Stoney, but that may change.

Alex
Bjorn Helgaas July 13, 2017, 2:56 a.m. UTC | #23
On Fri, Apr 07, 2017 at 04:32:18PM +0200, Joerg Roedel wrote:
> From: Joerg Roedel <jroedel@suse.de>
> 
> ATS is broken on this hardware and causes IOMMU stalls and
> system failure. Disable ATS on these devices to make them
> usable again with IOMMU enabled.
> 
> Note that the commit in the Fixes-tag is not buggy, it
> just uncovers the problem in the hardware by increasing
> the ATS-flush rate.
> 
> Fixes: b1516a14657a ('iommu/amd: Implement flush queue')
> Signed-off-by: Joerg Roedel <jroedel@suse.de>

Applied with Alex's ack to pci/virtualization for v4.14, thanks!

Alex, you seemed a little ambivalent later.  If you want to rescind
your ack, let me know and I'll remove it.

> ---
>  drivers/pci/quirks.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 6736836..7cbe316 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -4634,3 +4634,22 @@ static void quirk_no_aersid(struct pci_dev *pdev)
>  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2031, quirk_no_aersid);
>  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2032, quirk_no_aersid);
>  DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2033, quirk_no_aersid);
> +
> +#ifdef CONFIG_PCI_ATS
> +/*
> + * Some devices have a broken ATS implementation causing IOMMU stalls.
> + * Don't use ATS for those devices.
> + */
> +static void quirk_disable_ats(struct pci_dev *pdev)
> +{
> +	/*
> +	 * Set pdev->ats_cap = 0 to make pci_enable_ats() bail out
> +	 * early.
> +	 */
> +	dev_info(&pdev->dev, "QUIRK: Disabling ATS");
> +	pdev->ats_cap = 0;
> +}
> +
> +/* AMD Stoney platform GPU */
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x98e4, quirk_disable_ats);
> +#endif /* CONFIG_PCI_ATS */
> -- 
> 1.9.1
>
Samuel Sieb Aug. 29, 2017, 8:02 p.m. UTC | #24
On 07/12/2017 07:56 PM, Bjorn Helgaas wrote:
> On Fri, Apr 07, 2017 at 04:32:18PM +0200, Joerg Roedel wrote:
>> From: Joerg Roedel <jroedel@suse.de>
>>
>> ATS is broken on this hardware and causes IOMMU stalls and
>> system failure. Disable ATS on these devices to make them
>> usable again with IOMMU enabled.
>>
>> Note that the commit in the Fixes-tag is not buggy, it
>> just uncovers the problem in the hardware by increasing
>> the ATS-flush rate.
>>
>> Fixes: b1516a14657a ('iommu/amd: Implement flush queue')
>> Signed-off-by: Joerg Roedel <jroedel@suse.de>
> 
> Applied with Alex's ack to pci/virtualization for v4.14, thanks!

Is there any chance of getting this into an earlier kernel?  This is a 
pretty devastating bug for users!  I'm currently providing patched 
kernels, but if they run upgrades and get a new kernel without noticing 
it, any filesystem they access will get completely mangled.
Bjorn Helgaas Aug. 29, 2017, 8:49 p.m. UTC | #25
On Tue, Aug 29, 2017 at 01:02:41PM -0700, Samuel Sieb wrote:
> On 07/12/2017 07:56 PM, Bjorn Helgaas wrote:
> >On Fri, Apr 07, 2017 at 04:32:18PM +0200, Joerg Roedel wrote:
> >>From: Joerg Roedel <jroedel@suse.de>
> >>
> >>ATS is broken on this hardware and causes IOMMU stalls and
> >>system failure. Disable ATS on these devices to make them
> >>usable again with IOMMU enabled.
> >>
> >>Note that the commit in the Fixes-tag is not buggy, it
> >>just uncovers the problem in the hardware by increasing
> >>the ATS-flush rate.
> >>
> >>Fixes: b1516a14657a ('iommu/amd: Implement flush queue')
> >>Signed-off-by: Joerg Roedel <jroedel@suse.de>
> >
> >Applied with Alex's ack to pci/virtualization for v4.14, thanks!
> 
> Is there any chance of getting this into an earlier kernel?  This is
> a pretty devastating bug for users!  I'm currently providing patched
> kernels, but if they run upgrades and get a new kernel without
> noticing it, any filesystem they access will get completely mangled.

I assume you're looking to get this into stable kernels or distro
update kernels.  I don't personally deal with either of those, but for
stable kernels, see
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/stable-kernel-rules.rst

For distro update kernels, you'd have to talk to the distro folks, and
I don't have contacts for those.

Bjorn
Joerg Roedel Aug. 30, 2017, 11:44 a.m. UTC | #26
On Tue, Aug 29, 2017 at 01:02:41PM -0700, Samuel Sieb wrote:
> On 07/12/2017 07:56 PM, Bjorn Helgaas wrote:
> >Applied with Alex's ack to pci/virtualization for v4.14, thanks!
> 
> Is there any chance of getting this into an earlier kernel?  This is a
> pretty devastating bug for users!  I'm currently providing patched kernels,
> but if they run upgrades and get a new kernel without noticing it, any
> filesystem they access will get completely mangled.

The patch is queued for v4.14, so it will probably be picked up by
distributions when v4.14-rc1 is released. At least this will be the case
for the affected (Open)SUSE distributions.


	Joerg

Patch
diff mbox

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 6736836..7cbe316 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -4634,3 +4634,22 @@  static void quirk_no_aersid(struct pci_dev *pdev)
 DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2031, quirk_no_aersid);
 DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2032, quirk_no_aersid);
 DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x2033, quirk_no_aersid);
+
+#ifdef CONFIG_PCI_ATS
+/*
+ * Some devices have a broken ATS implementation causing IOMMU stalls.
+ * Don't use ATS for those devices.
+ */
+static void quirk_disable_ats(struct pci_dev *pdev)
+{
+	/*
+	 * Set pdev->ats_cap = 0 to make pci_enable_ats() bail out
+	 * early.
+	 */
+	dev_info(&pdev->dev, "QUIRK: Disabling ATS");
+	pdev->ats_cap = 0;
+}
+
+/* AMD Stoney platform GPU */
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x98e4, quirk_disable_ats);
+#endif /* CONFIG_PCI_ATS */