diff mbox series

iommu/intel: quirk to disable DMAR for QM57 igfx

Message ID 20190118121705.a4usvhnskyblooja@dcvr (mailing list archive)
State New, archived
Headers show
Series iommu/intel: quirk to disable DMAR for QM57 igfx | expand

Commit Message

Eric Wong Jan. 18, 2019, 12:17 p.m. UTC
Joonas Lahtinen <joonas.lahtinen@linux.intel.com> wrote:
> Quoting Eric Wong (2019-01-04 03:06:26)
> > Yeah, so the Debian bpo 4.17(.17) kernel did not set
> > CONFIG_INTEL_IOMMU_DEFAULT_ON, so I didn't encounter problems.
> > My self-built kernels all set CONFIG_INTEL_IOMMU_DEFAULT_ON.
> 
> So it's the case that IOMMU never worked on your machine.
> 
> My recommendation would be to simply use intel_iommu=igfx_off if you
> need IOMMU.
> 
> Old hardware is known to have issues with IOMMU, and retroactively
> enabling IOMMU on those machines just brings them up :/

How about we use a quirk in case distros make IOMMU the default
one day?

--------8<--------
Subject: [PATCH] iommu/intel: quirk to disable DMAR for QM57 igfx

Like the GM45, it seems the integrated graphics on QM57 seems
broken and hanging graphics with "intel_iommu=on".  So allow
future users to unconditionally enable DMAR support and not have
to remember or know to specify "intel_iommu=igfx_off"

cf. https://lore.kernel.org/lkml/20181227114948.ev4b3jte3ubsc5us@dcvr/
cf. https://lore.kernel.org/lkml/154659116310.4596.13613897418163029789@jlahtine-desk.ger.corp.intel.com/

Signed-off-by: Eric Wong <e@80x24.org>
---
 drivers/iommu/intel-iommu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Joerg Roedel Jan. 22, 2019, 10:39 a.m. UTC | #1
On Fri, Jan 18, 2019 at 12:17:05PM +0000, Eric Wong wrote:
> @@ -5411,6 +5411,7 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2e20, quirk_iommu_g4x_gfx);
>  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2e30, quirk_iommu_g4x_gfx);
>  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2e40, quirk_iommu_g4x_gfx);
>  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2e90, quirk_iommu_g4x_gfx);
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x0044, quirk_iommu_g4x_gfx);
>  
>  static void quirk_iommu_rwbf(struct pci_dev *dev)
>  {
> @@ -5457,7 +5458,6 @@ static void quirk_calpella_no_shadow_gtt(struct pci_dev *dev)
>         }
>  }
>  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x0040, quirk_calpella_no_shadow_gtt);
> -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x0044, quirk_calpella_no_shadow_gtt);
>  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x0062, quirk_calpella_no_shadow_gtt);
>  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x006a, quirk_calpella_no_shadow_gtt);

This seems to make sense to me. Joonas, any comments or objections?

Regards,

	Joerg
Daniel Vetter Jan. 22, 2019, 10:46 a.m. UTC | #2
On Tue, Jan 22, 2019 at 11:39 AM Joerg Roedel <joro@8bytes.org> wrote:
>
> On Fri, Jan 18, 2019 at 12:17:05PM +0000, Eric Wong wrote:
> > @@ -5411,6 +5411,7 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2e20, quirk_iommu_g4x_gfx);
> >  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2e30, quirk_iommu_g4x_gfx);
> >  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2e40, quirk_iommu_g4x_gfx);
> >  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2e90, quirk_iommu_g4x_gfx);
> > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x0044, quirk_iommu_g4x_gfx);
> >
> >  static void quirk_iommu_rwbf(struct pci_dev *dev)
> >  {
> > @@ -5457,7 +5458,6 @@ static void quirk_calpella_no_shadow_gtt(struct pci_dev *dev)
> >         }
> >  }
> >  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x0040, quirk_calpella_no_shadow_gtt);
> > -DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x0044, quirk_calpella_no_shadow_gtt);
> >  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x0062, quirk_calpella_no_shadow_gtt);
> >  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x006a, quirk_calpella_no_shadow_gtt);
>
> This seems to make sense to me. Joonas, any comments or objections?

This is ironlake, which has a huge iommu hack in the gpu driver to
work around hard hangs, which:
- causes massive stalls and kills performance
- isn't well tested (it's the only one that needs this), so tends to break

So if we do this then imo we should:
- probably nuke that w/a too (check for needs_idle_maps and all the
related stuff in i915_gem_gtt.c)
- roll it out for all affected chips (i.e. need to include 0x0040).

Note that the string of platforms which have various issues with iommu
and igfx is very long, thus far we only disabled it where there's no
workaround to stop it from hanging the box, but otherwise left it
enabled. So if we make a policy change to also disable it anywhere
where it doesn't work well (instead of not at all), there's a pile
more platforms to switch.
-Daniel
Joerg Roedel Jan. 22, 2019, 11:01 a.m. UTC | #3
Hi Daniel,

On Tue, Jan 22, 2019 at 11:46:39AM +0100, Daniel Vetter wrote:
> Note that the string of platforms which have various issues with iommu
> and igfx is very long, thus far we only disabled it where there's no
> workaround to stop it from hanging the box, but otherwise left it
> enabled. So if we make a policy change to also disable it anywhere
> where it doesn't work well (instead of not at all), there's a pile
> more platforms to switch.

I think its best to just disable iommu for the igfx devices on these
platforms. Can you pick up Eric's patch and extend it with the list of
affected platforms?

Thanks,

	Joerg
Joonas Lahtinen Jan. 22, 2019, 2:48 p.m. UTC | #4
Quoting Joerg Roedel (2019-01-22 13:01:09)
> Hi Daniel,
> 
> On Tue, Jan 22, 2019 at 11:46:39AM +0100, Daniel Vetter wrote:
> > Note that the string of platforms which have various issues with iommu
> > and igfx is very long, thus far we only disabled it where there's no
> > workaround to stop it from hanging the box, but otherwise left it
> > enabled. So if we make a policy change to also disable it anywhere
> > where it doesn't work well (instead of not at all), there's a pile
> > more platforms to switch.
> 
> I think its best to just disable iommu for the igfx devices on these
> platforms. Can you pick up Eric's patch and extend it with the list of
> affected platforms?

We've been discussing this again more actively since a few months ago,
and the discussion is still ongoing internally.

According to our IOMMU folks there exists some desire to be able to assign
the iGFX device aka have intel_iommu=on instead of intel_iommu=igfx_off
due to how the devices might be grouped in IOMMU groups. Even when you
would not be using the iGFX device.

So for some uses, the fact that the device (group) is assignable seems
to be more important than the iGFX device to be working. I'm afraid
that retroactively disabling the assignment for such an old platform
might break those usage scenarios. By my quick reading of the code,
there's no way for user to turn the iGFX DMAR on once the quirk
disables it.

I guess one solution would be to default to intel_iommu=igfx_off for
platforms that are older than certain threshold. But still allow
user to enable. But that then requires duplicating the PCI ID database
into iommu code.

I don't really have winning moves to present, but I'm open to hearing
how we can avoid more damage than starting to default to intel_iommu=on
did already.

Regards, Joonas

> 
> Thanks,
> 
>         Joerg
Joerg Roedel Jan. 22, 2019, 4:51 p.m. UTC | #5
On Tue, Jan 22, 2019 at 04:48:26PM +0200, Joonas Lahtinen wrote:
> According to our IOMMU folks there exists some desire to be able to assign
> the iGFX device aka have intel_iommu=on instead of intel_iommu=igfx_off
> due to how the devices might be grouped in IOMMU groups. Even when you
> would not be using the iGFX device.

You can force the igfx device into a SI domain, or does that also
trigger the iommu issues on the chipset?

In any case, if iommu=on breaks these systems I want to make them work
again with opt-out, even at the cost of breaking assignability.

Regards,

	Joerg
Joonas Lahtinen Jan. 23, 2019, 3:02 p.m. UTC | #6
Quoting Joerg Roedel (2019-01-22 18:51:35)
> On Tue, Jan 22, 2019 at 04:48:26PM +0200, Joonas Lahtinen wrote:
> > According to our IOMMU folks there exists some desire to be able to assign
> > the iGFX device aka have intel_iommu=on instead of intel_iommu=igfx_off
> > due to how the devices might be grouped in IOMMU groups. Even when you
> > would not be using the iGFX device.
> 
> You can force the igfx device into a SI domain, or does that also
> trigger the iommu issues on the chipset?

To be honest, we've had a mixture different issues on different SKUs
that have not been hit in the past when intel_iommu was just disabled by
default.

I know that in one group of the problems, the issue has been debugged
into the GPU having its own set of virtualization mapping translation
hardware with caching and it fails to track changes to the mapping. So
if a identity mapping was established and never changed, I'd assume that
to fix at least that class of problems.

Would just passing intel_iommu=on already cause a non-identity mapping to
possibly be used for the integrated GPU? If it did, then it would
explain quite few of the issues.

We have many reports where just having intel_iommu=on (and using the
system normally, without any virtualization stuff going on) will cause
unexplained GPU hangs. For those users, simply switching to
intel_iommu=igfx_off solves the problems, and the debug often ends
there.

Regards, Joonas

> In any case, if iommu=on breaks these systems I want to make them work
> again with opt-out, even at the cost of breaking assignability.
> 
> Regards,
> 
>         Joerg
Joerg Roedel Jan. 23, 2019, 4:49 p.m. UTC | #7
On Wed, Jan 23, 2019 at 05:02:38PM +0200, Joonas Lahtinen wrote:
> We have many reports where just having intel_iommu=on (and using the
> system normally, without any virtualization stuff going on) will cause
> unexplained GPU hangs. For those users, simply switching to
> intel_iommu=igfx_off solves the problems, and the debug often ends
> there.

If you can reproduce problems on your side, then you can try to enable
CONFIG_INTEL_IOMMU_BROKEN_GFX_WA to force the GFX devices into the
identity mapping. We can also add a boot-parameter and workarounds if it
turns out that this is sufficient to make the GFX devices work with
IOMMU enabled.


Regards,

	Joerg
diff mbox series

Patch

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 048b5ab36a02..dc2507a01580 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -5399,7 +5399,7 @@  const struct iommu_ops intel_iommu_ops = {
 
 static void quirk_iommu_g4x_gfx(struct pci_dev *dev)
 {
-	/* G4x/GM45 integrated gfx dmar support is totally busted. */
+	/* G4x/GM45/QM57 integrated gfx dmar support is totally busted. */
 	pr_info("Disabling IOMMU for graphics on this chipset\n");
 	dmar_map_gfx = 0;
 }
@@ -5411,6 +5411,7 @@  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2e20, quirk_iommu_g4x_gfx);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2e30, quirk_iommu_g4x_gfx);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2e40, quirk_iommu_g4x_gfx);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2e90, quirk_iommu_g4x_gfx);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x0044, quirk_iommu_g4x_gfx);
 
 static void quirk_iommu_rwbf(struct pci_dev *dev)
 {
@@ -5457,7 +5458,6 @@  static void quirk_calpella_no_shadow_gtt(struct pci_dev *dev)
        }
 }
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x0040, quirk_calpella_no_shadow_gtt);
-DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x0044, quirk_calpella_no_shadow_gtt);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x0062, quirk_calpella_no_shadow_gtt);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x006a, quirk_calpella_no_shadow_gtt);