Message ID | 20180511135602.13071-1-chris@chris-wilson.co.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
My understanding of the virtual memory addressing from the GPU is limited... But how can the GPU poke at the kernel's allocated data? I thought we mapped into the GPU's address space only what is allocated through gem. - Lionel On 11/05/18 14:56, Chris Wilson wrote: > We observe that the OA architecture is clobbering random memory. Disable > it until this can be resolved. > > References: https://bugs.freedesktop.org/show_bug.cgi?id=106379 > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Lionel Landwerlin <lionel.g.landwerlin@intel.com> > Cc: Matthew Auld <matthew.auld@intel.com> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> > Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> > Cc: Jani Nikula <jani.nikula@intel.com> > Cc: stable@vger.kernel.org > --- > drivers/gpu/drm/i915/i915_perf.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c > index 019bd2d073ad..20187f3bf350 100644 > --- a/drivers/gpu/drm/i915/i915_perf.c > +++ b/drivers/gpu/drm/i915/i915_perf.c > @@ -3425,7 +3425,7 @@ static struct ctl_table dev_root[] = { > */ > void i915_perf_init(struct drm_i915_private *dev_priv) > { > - if (IS_HASWELL(dev_priv)) { > + if (IS_HASWELL(dev_priv) && 0) { > dev_priv->perf.oa.ops.is_valid_b_counter_reg = > gen7_is_valid_b_counter_addr; > dev_priv->perf.oa.ops.is_valid_mux_reg =
Quoting Lionel Landwerlin (2018-05-11 15:14:13) > My understanding of the virtual memory addressing from the GPU is limited... > But how can the GPU poke at the kernel's allocated data? > I thought we mapped into the GPU's address space only what is allocated > through gem. Correct. The HW should only be accessing the pages through the GTT and the GTT should only contain known pages (or a pointer to the scratch page). There is maybe a hole where we are freeing the memory before the HW has finished using it (still writing through stale TLB and whatnot even though the system has reallocated the pages), but other than that quite, quite scary. Hence this awooga. -Chris
On 11/05/18 15:18, Chris Wilson wrote: > Quoting Lionel Landwerlin (2018-05-11 15:14:13) >> My understanding of the virtual memory addressing from the GPU is limited... >> But how can the GPU poke at the kernel's allocated data? >> I thought we mapped into the GPU's address space only what is allocated >> through gem. > Correct. The HW should only be accessing the pages through the GTT and > the GTT should only contain known pages (or a pointer to the scratch > page). There is maybe a hole where we are freeing the memory before > the HW has finished using it (still writing through stale TLB and > whatnot even though the system has reallocated the pages), but other > than that quite, quite scary. Hence this awooga. > -Chris > Oh right... So this patch is a backup if you previous one won't fix the issue we see on CI? - Lionel
Quoting Lionel Landwerlin (2018-05-11 15:28:24) > On 11/05/18 15:18, Chris Wilson wrote: > > Quoting Lionel Landwerlin (2018-05-11 15:14:13) > >> My understanding of the virtual memory addressing from the GPU is limited... > >> But how can the GPU poke at the kernel's allocated data? > >> I thought we mapped into the GPU's address space only what is allocated > >> through gem. > > Correct. The HW should only be accessing the pages through the GTT and > > the GTT should only contain known pages (or a pointer to the scratch > > page). There is maybe a hole where we are freeing the memory before > > the HW has finished using it (still writing through stale TLB and > > whatnot even though the system has reallocated the pages), but other > > than that quite, quite scary. Hence this awooga. > > > Oh right... > > So this patch is a backup if you previous one won't fix the issue we see > on CI? Yes. Try to fix, it we can't, we disable until we can. It may also be purely coincidental that we've seen this bug a few times around the same test only on this machine... ;) -Chris
On 11/05/18 15:34, Chris Wilson wrote: > Quoting Lionel Landwerlin (2018-05-11 15:28:24) >> On 11/05/18 15:18, Chris Wilson wrote: >>> Quoting Lionel Landwerlin (2018-05-11 15:14:13) >>>> My understanding of the virtual memory addressing from the GPU is limited... >>>> But how can the GPU poke at the kernel's allocated data? >>>> I thought we mapped into the GPU's address space only what is allocated >>>> through gem. >>> Correct. The HW should only be accessing the pages through the GTT and >>> the GTT should only contain known pages (or a pointer to the scratch >>> page). There is maybe a hole where we are freeing the memory before >>> the HW has finished using it (still writing through stale TLB and >>> whatnot even though the system has reallocated the pages), but other >>> than that quite, quite scary. Hence this awooga. >>> >> Oh right... >> >> So this patch is a backup if you previous one won't fix the issue we see >> on CI? > Yes. Try to fix, it we can't, we disable until we can. It may also be > purely coincidental that we've seen this bug a few times around the same > test only on this machine... ;) > -Chris > Trying on a laptop.
On 11/05/18 15:18, Chris Wilson wrote: > Quoting Lionel Landwerlin (2018-05-11 15:14:13) >> My understanding of the virtual memory addressing from the GPU is limited... >> But how can the GPU poke at the kernel's allocated data? >> I thought we mapped into the GPU's address space only what is allocated >> through gem. > Correct. The HW should only be accessing the pages through the GTT and > the GTT should only contain known pages (or a pointer to the scratch > page). There is maybe a hole where we are freeing the memory before > the HW has finished using it (still writing through stale TLB and > whatnot even though the system has reallocated the pages), but other > than that quite, quite scary. Hence this awooga. > -Chris > I managed to reproduce a kasan backtrace on the same test. So it's not just the CI machine. But I can't even startup a gdm on that machine with drm-tip. So maybe there is some much more broken... i915/perf unpins the object correctly before freeing (at which point it could be reused). Should we ensure i915_vma_destroy() i915/perf maybe? It almost seems like this is an issue that could arise in other part of the driver too. - Lionel
Quoting Lionel Landwerlin (2018-05-11 16:43:02) > On 11/05/18 15:18, Chris Wilson wrote: > > Quoting Lionel Landwerlin (2018-05-11 15:14:13) > >> My understanding of the virtual memory addressing from the GPU is limited... > >> But how can the GPU poke at the kernel's allocated data? > >> I thought we mapped into the GPU's address space only what is allocated > >> through gem. > > Correct. The HW should only be accessing the pages through the GTT and > > the GTT should only contain known pages (or a pointer to the scratch > > page). There is maybe a hole where we are freeing the memory before > > the HW has finished using it (still writing through stale TLB and > > whatnot even though the system has reallocated the pages), but other > > than that quite, quite scary. Hence this awooga. > > -Chris > > > > I managed to reproduce a kasan backtrace on the same test. > So it's not just the CI machine. > > But I can't even startup a gdm on that machine with drm-tip. So maybe > there is some much more broken... Don't leave us in suspense... > i915/perf unpins the object correctly before freeing (at which point it > could be reused). Sure, but does perf know that the OA unit has stopped writing at that point... That's not so clear (from my pov). > Should we ensure i915_vma_destroy() i915/perf maybe? > > It almost seems like this is an issue that could arise in other part of > the driver too. The problem of the HW continuing to access the pages after unbinding is inherent to the system (and what actually happens if we change PTE in flight is usually undefined), hence the great care we go to track HW activity and try not to release pages while it is still using them. -Chris
On 11/05/18 16:51, Chris Wilson wrote: > Quoting Lionel Landwerlin (2018-05-11 16:43:02) >> On 11/05/18 15:18, Chris Wilson wrote: >>> Quoting Lionel Landwerlin (2018-05-11 15:14:13) >>>> My understanding of the virtual memory addressing from the GPU is limited... >>>> But how can the GPU poke at the kernel's allocated data? >>>> I thought we mapped into the GPU's address space only what is allocated >>>> through gem. >>> Correct. The HW should only be accessing the pages through the GTT and >>> the GTT should only contain known pages (or a pointer to the scratch >>> page). There is maybe a hole where we are freeing the memory before >>> the HW has finished using it (still writing through stale TLB and >>> whatnot even though the system has reallocated the pages), but other >>> than that quite, quite scary. Hence this awooga. >>> -Chris >>> >> I managed to reproduce a kasan backtrace on the same test. >> So it's not just the CI machine. >> >> But I can't even startup a gdm on that machine with drm-tip. So maybe >> there is some much more broken... > Don't leave us in suspense... Your first patch (check that OA is actually disabled) seems to get rid of the issue on my machine. Thanks a lot a for finding that! Trying to find when HSW when wrong now. Same kernel works just fine on my SKL. > >> i915/perf unpins the object correctly before freeing (at which point it >> could be reused). > Sure, but does perf know that the OA unit has stopped writing at that > point... That's not so clear (from my pov). Clearly it wasn't :( > >> Should we ensure i915_vma_destroy() i915/perf maybe? >> >> It almost seems like this is an issue that could arise in other part of >> the driver too. > The problem of the HW continuing to access the pages after unbinding is > inherent to the system (and what actually happens if we change PTE in > flight is usually undefined), hence the great care we go to track HW > activity and try not to release pages while it is still using them. > -Chris >
Quoting Lionel Landwerlin (2018-05-11 16:58:27) > On 11/05/18 16:51, Chris Wilson wrote: > > Quoting Lionel Landwerlin (2018-05-11 16:43:02) > >> On 11/05/18 15:18, Chris Wilson wrote: > >>> Quoting Lionel Landwerlin (2018-05-11 15:14:13) > >>>> My understanding of the virtual memory addressing from the GPU is limited... > >>>> But how can the GPU poke at the kernel's allocated data? > >>>> I thought we mapped into the GPU's address space only what is allocated > >>>> through gem. > >>> Correct. The HW should only be accessing the pages through the GTT and > >>> the GTT should only contain known pages (or a pointer to the scratch > >>> page). There is maybe a hole where we are freeing the memory before > >>> the HW has finished using it (still writing through stale TLB and > >>> whatnot even though the system has reallocated the pages), but other > >>> than that quite, quite scary. Hence this awooga. > >>> -Chris > >>> > >> I managed to reproduce a kasan backtrace on the same test. > >> So it's not just the CI machine. > >> > >> But I can't even startup a gdm on that machine with drm-tip. So maybe > >> there is some much more broken... > > Don't leave us in suspense... > > Your first patch (check that OA is actually disabled) seems to get rid > of the issue on my machine. > Thanks a lot a for finding that! Care to add a t-b and we'll close the bug? -Chris
On 11/05/18 16:51, Chris Wilson wrote: >> But I can't even startup a gdm on that machine with drm-tip. So maybe >> there is some much more broken... > Don't leave us in suspense... > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=890614 Not our bug :)
Quoting Lionel Landwerlin (2018-05-11 18:41:28) > On 11/05/18 16:51, Chris Wilson wrote: > > But I can't even startup a gdm on that machine with drm-tip. So maybe > there is some much more broken... > > Don't leave us in suspense... > > > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=890614 > > > Not our bug :) You would not believe how relieved I am that someone else has bugs in their code. -Chris
Quoting Lionel Landwerlin (2018-05-11 16:43:02) > On 11/05/18 15:18, Chris Wilson wrote: > > Quoting Lionel Landwerlin (2018-05-11 15:14:13) > >> My understanding of the virtual memory addressing from the GPU is limited... > >> But how can the GPU poke at the kernel's allocated data? > >> I thought we mapped into the GPU's address space only what is allocated > >> through gem. > > Correct. The HW should only be accessing the pages through the GTT and > > the GTT should only contain known pages (or a pointer to the scratch > > page). There is maybe a hole where we are freeing the memory before > > the HW has finished using it (still writing through stale TLB and > > whatnot even though the system has reallocated the pages), but other > > than that quite, quite scary. Hence this awooga. > > > > I managed to reproduce a kasan backtrace on the same test. > So it's not just the CI machine. For the record, CI also seems much happier with the wait for OACONTROL before unpinning, so no need to panic! -Chris
diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c index 019bd2d073ad..20187f3bf350 100644 --- a/drivers/gpu/drm/i915/i915_perf.c +++ b/drivers/gpu/drm/i915/i915_perf.c @@ -3425,7 +3425,7 @@ static struct ctl_table dev_root[] = { */ void i915_perf_init(struct drm_i915_private *dev_priv) { - if (IS_HASWELL(dev_priv)) { + if (IS_HASWELL(dev_priv) && 0) { dev_priv->perf.oa.ops.is_valid_b_counter_reg = gen7_is_valid_b_counter_addr; dev_priv->perf.oa.ops.is_valid_mux_reg =
We observe that the OA architecture is clobbering random memory. Disable it until this can be resolved. References: https://bugs.freedesktop.org/show_bug.cgi?id=106379 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Jani Nikula <jani.nikula@intel.com> Cc: stable@vger.kernel.org --- drivers/gpu/drm/i915/i915_perf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)