Message ID | 1432314314-23530-16-git-send-email-mika.kuoppala@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 22/05/2015 18:05, Mika Kuoppala wrote: > During review of dynamic page tables series, I was able > to hit a lite restore bug with execlists. I assume that > due to incorrect pd, the batch run out of legit address space > and into the scratch page area. The ACTHD was increasing > due to scratch being all zeroes (MI_NOOPs). And as gen8 > address space is quite large, the hangcheck happily waited > for a long long time, keeping the process effectively stuck. > > According to Chris Wilson any modern gpu will grind to halt > if it encounters commands of all ones. This seemed to do the > trick and hang was declared promptly when the gpu wandered into > the scratch land. > > v2: Use 0xffff00ff pattern (Chris) Just for my own benefit: 1. Is there any particular reason for this pattern rather than 0xffffffff? 2. Someone please correct me if I'm wrong here but at least based on my own experiences with gen9 submitting batch buffers filled with bad instructions (0xffffffff) to the GPU does not hang it. I'm guessing that is because there's allegedly a hardware security parser that MI_NOOPs out invalid instructions during execution. If that's the case here then I guess we might have to come up with something else for gen9+ if we want to induce engine hangs once the execution reaches the scratch page? On the other hand, on gen9+ page faulting is supposedly not broken anymore so maybe we don't need the scratch page to begin with there so maybe it's all moot at that point? Again, if I'm making no sense here feel free to set things straight, I'm very curious about how all of this is supposed to work. Thanks, Tomas > > Cc: Chris Wilson <chris@chris-wilson.co.uk> > Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> > --- > drivers/gpu/drm/i915/i915_gem_gtt.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c b/drivers/gpu/drm/i915/i915_gem_gtt.c > index 43fa543..a2a0c88 100644 > --- a/drivers/gpu/drm/i915/i915_gem_gtt.c > +++ b/drivers/gpu/drm/i915/i915_gem_gtt.c > @@ -2168,6 +2168,8 @@ void i915_global_gtt_cleanup(struct drm_device *dev) > vm->cleanup(vm); > } > > +#define SCRATCH_PAGE_MAGIC 0xffff00ffffff00ffULL > + > static int alloc_scratch_page(struct i915_address_space *vm) > { > struct i915_page_scratch *sp; > @@ -2185,6 +2187,7 @@ static int alloc_scratch_page(struct i915_address_space *vm) > return ret; > } > > + fill_px(vm->dev, sp, SCRATCH_PAGE_MAGIC); > set_pages_uc(px_page(sp), 1); > > vm->scratch_page = sp; >
On Wed, May 27, 2015 at 07:12:02PM +0100, Tomas Elf wrote: > On 22/05/2015 18:05, Mika Kuoppala wrote: > >During review of dynamic page tables series, I was able > >to hit a lite restore bug with execlists. I assume that > >due to incorrect pd, the batch run out of legit address space > >and into the scratch page area. The ACTHD was increasing > >due to scratch being all zeroes (MI_NOOPs). And as gen8 > >address space is quite large, the hangcheck happily waited > >for a long long time, keeping the process effectively stuck. > > > >According to Chris Wilson any modern gpu will grind to halt > >if it encounters commands of all ones. This seemed to do the > >trick and hang was declared promptly when the gpu wandered into > >the scratch land. > > > >v2: Use 0xffff00ff pattern (Chris) > > Just for my own benefit: > > 1. Is there any particular reason for this pattern rather than 0xffffffff? It is more obvious when userspace reads from the page and copies it into its own data structures or surfaces. See below, if this does impact userspace we should probably revert this patch anyway. > 2. Someone please correct me if I'm wrong here but at least based on > my own experiences with gen9 submitting batch buffers filled with > bad instructions (0xffffffff) to the GPU does not hang it. I'm > guessing that is because there's allegedly a hardware security > parser that MI_NOOPs out invalid instructions during execution. If > that's the case here then I guess we might have to come up with > something else for gen9+ if we want to induce engine hangs once the > execution reaches the scratch page? It's not a problem, there will be a GPU hang eventually (in theory at least). Mika is just trying to shortcircuit that by causing an immediate hang. > On the other hand, on gen9+ page faulting is supposedly not broken > anymore so maybe we don't need the scratch page to begin with there > so maybe it's all moot at that point? Again, if I'm making no sense > here feel free to set things straight, I'm very curious about how > all of this is supposed to work. Generating a pagefault for invalid access is an ABI change and requires opt-in (we have discussed context flags in the past). The most obvious example is the CS prefetch, which we have to prevent generating faults by providing guard pages (on older chipsets at least). But as we have been historically lax on allowing userspace to access invalid pages, we have to assume that userspace has been taking advantage of that. -Chris
On 5/22/2015 6:05 PM, Mika Kuoppala wrote: > During review of dynamic page tables series, I was able > to hit a lite restore bug with execlists. I assume that > due to incorrect pd, the batch run out of legit address space > and into the scratch page area. The ACTHD was increasing > due to scratch being all zeroes (MI_NOOPs). And as gen8 > address space is quite large, the hangcheck happily waited > for a long long time, keeping the process effectively stuck. FYI, it is probably safe to assume that the only thing updated in a lite-restore is the ring tail. I didn't realize that until recently. This issue is more frequent in GuC submission mode, and we are currently evaluation 2 alternatives to trigger the re-reading of the page tables. > > According to Chris Wilson any modern gpu will grind to halt > if it encounters commands of all ones. This seemed to do the > trick and hang was declared promptly when the gpu wandered into > the scratch land. > > v2: Use 0xffff00ff pattern (Chris) > > Cc: Chris Wilson <chris@chris-wilson.co.uk> > Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> > --- > drivers/gpu/drm/i915/i915_gem_gtt.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c b/drivers/gpu/drm/i915/i915_gem_gtt.c > index 43fa543..a2a0c88 100644 > --- a/drivers/gpu/drm/i915/i915_gem_gtt.c > +++ b/drivers/gpu/drm/i915/i915_gem_gtt.c > @@ -2168,6 +2168,8 @@ void i915_global_gtt_cleanup(struct drm_device *dev) > vm->cleanup(vm); > } > > +#define SCRATCH_PAGE_MAGIC 0xffff00ffffff00ffULL > + > static int alloc_scratch_page(struct i915_address_space *vm) > { > struct i915_page_scratch *sp; > @@ -2185,6 +2187,7 @@ static int alloc_scratch_page(struct i915_address_space *vm) > return ret; > } > > + fill_px(vm->dev, sp, SCRATCH_PAGE_MAGIC); > set_pages_uc(px_page(sp), 1); > > vm->scratch_page = sp; > -- > 1.9.1 > > _______________________________________________ > Intel-gfx mailing list > Intel-gfx@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/intel-gfx >
On 01/06/2015 16:53, Chris Wilson wrote: > On Wed, May 27, 2015 at 07:12:02PM +0100, Tomas Elf wrote: >> On 22/05/2015 18:05, Mika Kuoppala wrote: >>> During review of dynamic page tables series, I was able >>> to hit a lite restore bug with execlists. I assume that >>> due to incorrect pd, the batch run out of legit address space >>> and into the scratch page area. The ACTHD was increasing >>> due to scratch being all zeroes (MI_NOOPs). And as gen8 >>> address space is quite large, the hangcheck happily waited >>> for a long long time, keeping the process effectively stuck. >>> >>> According to Chris Wilson any modern gpu will grind to halt >>> if it encounters commands of all ones. This seemed to do the >>> trick and hang was declared promptly when the gpu wandered into >>> the scratch land. >>> >>> v2: Use 0xffff00ff pattern (Chris) >> >> Just for my own benefit: >> >> 1. Is there any particular reason for this pattern rather than 0xffffffff? > > It is more obvious when userspace reads from the page and copies it into > its own data structures or surfaces. See below, if this does impact > userspace we should probably revert this patch anyway. > >> 2. Someone please correct me if I'm wrong here but at least based on >> my own experiences with gen9 submitting batch buffers filled with >> bad instructions (0xffffffff) to the GPU does not hang it. I'm >> guessing that is because there's allegedly a hardware security >> parser that MI_NOOPs out invalid instructions during execution. If >> that's the case here then I guess we might have to come up with >> something else for gen9+ if we want to induce engine hangs once the >> execution reaches the scratch page? > > It's not a problem, there will be a GPU hang eventually (in theory at > least). Mika is just trying to shortcircuit that by causing an immediate > hang. Interesting! Why do you think the execution would hang eventually? Simply because it would end up at an invalid opcode somewhere in memory at some point, which would trig a hang? If we're talking gen9 then I'm assuming that the security parser would MI_NOOP out all invalid opcodes before they get a chance to hang the GPU (that would certainly be the case with 0xffffffff and 0xffff00ff, based on my own experiments). Or are we assuming that at some point the execution would end up at a valid opcode that would still put the GPU in an arbitrary execution state that would take a practically infinite amount of time to complete? Such as executing a semaphore instruction that no thread would ever think of signalling? Thanks, Tomas > >> On the other hand, on gen9+ page faulting is supposedly not broken >> anymore so maybe we don't need the scratch page to begin with there >> so maybe it's all moot at that point? Again, if I'm making no sense >> here feel free to set things straight, I'm very curious about how >> all of this is supposed to work. > > Generating a pagefault for invalid access is an ABI change and requires > opt-in (we have discussed context flags in the past). The most obvious > example is the CS prefetch, which we have to prevent generating faults by > providing guard pages (on older chipsets at least). But as we have been > historically lax on allowing userspace to access invalid pages, we have > to assume that userspace has been taking advantage of that. > -Chris >
On Thu, Jun 04, 2015 at 12:08:17PM +0100, Tomas Elf wrote: > On 01/06/2015 16:53, Chris Wilson wrote: > >On Wed, May 27, 2015 at 07:12:02PM +0100, Tomas Elf wrote: > >>On 22/05/2015 18:05, Mika Kuoppala wrote: > >>>During review of dynamic page tables series, I was able > >>>to hit a lite restore bug with execlists. I assume that > >>>due to incorrect pd, the batch run out of legit address space > >>>and into the scratch page area. The ACTHD was increasing > >>>due to scratch being all zeroes (MI_NOOPs). And as gen8 > >>>address space is quite large, the hangcheck happily waited > >>>for a long long time, keeping the process effectively stuck. > >>> > >>>According to Chris Wilson any modern gpu will grind to halt > >>>if it encounters commands of all ones. This seemed to do the > >>>trick and hang was declared promptly when the gpu wandered into > >>>the scratch land. > >>> > >>>v2: Use 0xffff00ff pattern (Chris) > >> > >>Just for my own benefit: > >> > >>1. Is there any particular reason for this pattern rather than 0xffffffff? > > > >It is more obvious when userspace reads from the page and copies it into > >its own data structures or surfaces. See below, if this does impact > >userspace we should probably revert this patch anyway. > > > >>2. Someone please correct me if I'm wrong here but at least based on > >>my own experiences with gen9 submitting batch buffers filled with > >>bad instructions (0xffffffff) to the GPU does not hang it. I'm > >>guessing that is because there's allegedly a hardware security > >>parser that MI_NOOPs out invalid instructions during execution. If > >>that's the case here then I guess we might have to come up with > >>something else for gen9+ if we want to induce engine hangs once the > >>execution reaches the scratch page? > > > >It's not a problem, there will be a GPU hang eventually (in theory at > >least). Mika is just trying to shortcircuit that by causing an immediate > >hang. > > Interesting! Why do you think the execution would hang eventually? > Simply because it would end up at an invalid opcode somewhere in > memory at some point, which would trig a hang? If we're talking gen9 > then I'm assuming that the security parser would MI_NOOP out all > invalid opcodes before they get a chance to hang the GPU (that would > certainly be the case with 0xffffffff and 0xffff00ff, based on my > own experiments). Or are we assuming that at some point the > execution would end up at a valid opcode that would still put the > GPU in an arbitrary execution state that would take a practically > infinite amount of time to complete? Such as executing a semaphore > instruction that no thread would ever think of signalling? Because even if it has to execute the full 48bit address space, we will eventually realise that no progress is being made (due to looping) and declare a hung GPU. A real problem is when we wrap and find a valid batch and so end up advancing even though the GPU's view of memory is incoherent. -Chris
Tomas Elf <tomas.elf@intel.com> writes: > On 22/05/2015 18:05, Mika Kuoppala wrote: >> During review of dynamic page tables series, I was able >> to hit a lite restore bug with execlists. I assume that >> due to incorrect pd, the batch run out of legit address space >> and into the scratch page area. The ACTHD was increasing >> due to scratch being all zeroes (MI_NOOPs). And as gen8 >> address space is quite large, the hangcheck happily waited >> for a long long time, keeping the process effectively stuck. >> >> According to Chris Wilson any modern gpu will grind to halt >> if it encounters commands of all ones. This seemed to do the >> trick and hang was declared promptly when the gpu wandered into >> the scratch land. >> >> v2: Use 0xffff00ff pattern (Chris) > > Just for my own benefit: > > 1. Is there any particular reason for this pattern rather than 0xffffffff? > > 2. Someone please correct me if I'm wrong here but at least based on my > own experiences with gen9 submitting batch buffers filled with bad > instructions (0xffffffff) to the GPU does not hang it. I'm guessing that > is because there's allegedly a hardware security parser that MI_NOOPs > out invalid instructions during execution. If that's the case here then > I guess we might have to come up with something else for gen9+ if we > want to induce engine hangs once the execution reaches the scratch page? > If that is the case with gen9, then we need more ducttape. Like that we always increase busyness in hangcheck (a little) to finally declare a hang even tho no loops are detected. But with this and gen < 9, the execution grinds to a halt and I get hang in a 5 second window. -Mika > On the other hand, on gen9+ page faulting is supposedly not broken > anymore so maybe we don't need the scratch page to begin with there so > maybe it's all moot at that point? Again, if I'm making no sense here > feel free to set things straight, I'm very curious about how all of this > is supposed to work. > > Thanks, > Tomas > >> >> Cc: Chris Wilson <chris@chris-wilson.co.uk> >> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> >> --- >> drivers/gpu/drm/i915/i915_gem_gtt.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c b/drivers/gpu/drm/i915/i915_gem_gtt.c >> index 43fa543..a2a0c88 100644 >> --- a/drivers/gpu/drm/i915/i915_gem_gtt.c >> +++ b/drivers/gpu/drm/i915/i915_gem_gtt.c >> @@ -2168,6 +2168,8 @@ void i915_global_gtt_cleanup(struct drm_device *dev) >> vm->cleanup(vm); >> } >> >> +#define SCRATCH_PAGE_MAGIC 0xffff00ffffff00ffULL >> + >> static int alloc_scratch_page(struct i915_address_space *vm) >> { >> struct i915_page_scratch *sp; >> @@ -2185,6 +2187,7 @@ static int alloc_scratch_page(struct i915_address_space *vm) >> return ret; >> } >> >> + fill_px(vm->dev, sp, SCRATCH_PAGE_MAGIC); >> set_pages_uc(px_page(sp), 1); >> >> vm->scratch_page = sp; >>
diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c b/drivers/gpu/drm/i915/i915_gem_gtt.c index 43fa543..a2a0c88 100644 --- a/drivers/gpu/drm/i915/i915_gem_gtt.c +++ b/drivers/gpu/drm/i915/i915_gem_gtt.c @@ -2168,6 +2168,8 @@ void i915_global_gtt_cleanup(struct drm_device *dev) vm->cleanup(vm); } +#define SCRATCH_PAGE_MAGIC 0xffff00ffffff00ffULL + static int alloc_scratch_page(struct i915_address_space *vm) { struct i915_page_scratch *sp; @@ -2185,6 +2187,7 @@ static int alloc_scratch_page(struct i915_address_space *vm) return ret; } + fill_px(vm->dev, sp, SCRATCH_PAGE_MAGIC); set_pages_uc(px_page(sp), 1); vm->scratch_page = sp;
During review of dynamic page tables series, I was able to hit a lite restore bug with execlists. I assume that due to incorrect pd, the batch run out of legit address space and into the scratch page area. The ACTHD was increasing due to scratch being all zeroes (MI_NOOPs). And as gen8 address space is quite large, the hangcheck happily waited for a long long time, keeping the process effectively stuck. According to Chris Wilson any modern gpu will grind to halt if it encounters commands of all ones. This seemed to do the trick and hang was declared promptly when the gpu wandered into the scratch land. v2: Use 0xffff00ff pattern (Chris) Cc: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> --- drivers/gpu/drm/i915/i915_gem_gtt.c | 3 +++ 1 file changed, 3 insertions(+)