Message ID | 20170913140117.11072-1-mika.kuoppala@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Quoting Mika Kuoppala (2017-09-13 15:01:17) > Evidence indicates that even if the hardware happily > tells us to proceed with reset, it really isn't ready. > Resetting a freely running batchbuffer after we have > got ack for readiness, still can cause a system hang. Hmm, so we see it on early gen and late gen. I suggest we do it universally (except gen2 which is lacking the mechanism). It's unlikely that the requirement disappeared just for a couple of gen, more likely that we simply haven't triggered the pathological behaviour. Other than, Acked-by: Chris Wilson <chris@chris-wilson.co.uk> for the find. -Chris
On Wed, Sep 13, 2017 at 03:08:06PM +0100, Chris Wilson wrote: > Quoting Mika Kuoppala (2017-09-13 15:01:17) > > Evidence indicates that even if the hardware happily > > tells us to proceed with reset, it really isn't ready. > > Resetting a freely running batchbuffer after we have > > got ack for readiness, still can cause a system hang. > > Hmm, so we see it on early gen and late gen. I suggest we do it > universally (except gen2 which is lacking the mechanism). It's unlikely > that the requirement disappeared just for a couple of gen, more likely > that we simply haven't triggered the pathological behaviour. Could just try setting ring enable=false on gen2 maybe? But we don't have GPU reset for gen2 anyway so I guess it doesn't matter.
Chris Wilson <chris@chris-wilson.co.uk> writes: > Quoting Mika Kuoppala (2017-09-13 15:01:17) >> Evidence indicates that even if the hardware happily >> tells us to proceed with reset, it really isn't ready. >> Resetting a freely running batchbuffer after we have >> got ack for readiness, still can cause a system hang. > > Hmm, so we see it on early gen and late gen. I suggest we do it > universally (except gen2 which is lacking the mechanism). It's unlikely > that the requirement disappeared just for a couple of gen, more likely > that we simply haven't triggered the pathological behaviour. > Agreed that we should do a blanket approach. I was in a hurry to post a proposed fix as I heard the prime_* are not yet blacklisted on shards. So lets hope this helps. > Other than, > Acked-by: Chris Wilson <chris@chris-wilson.co.uk> > for the find. Ta. -Mika
Quoting Patchwork (2017-09-14 01:07:40) > == Series Details == > > Series: drm/i915: Stop ring before doing readiness check > URL : https://patchwork.freedesktop.org/series/30298/ > State : failure > > == Summary == > > Test kms_cursor_legacy: > Subgroup cursorA-vs-flipA-atomic-transitions: > pass -> FAIL (shard-hsw) > Test drv_missed_irq: > pass -> FAIL (shard-hsw) > Test kms_setmode: > Subgroup basic: > pass -> FAIL (shard-hsw) fdo#99912 > > fdo#99912 https://bugs.freedesktop.org/show_bug.cgi?id=99912 > > shard-hsw total:2313 pass:1242 dwarn:0 dfail:0 fail:16 skip:1055 time:9618s > > == Logs == > > For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_5685/shards.html Of course, it decided not to run the prime_busy hang tests!!! -Chris
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c index 1b38eb94d461..f9ef1931516c 100644 --- a/drivers/gpu/drm/i915/intel_uncore.c +++ b/drivers/gpu/drm/i915/intel_uncore.c @@ -1361,33 +1361,38 @@ int i915_reg_read_ioctl(struct drm_device *dev, return ret; } +static void gen3_stop_ring(struct intel_engine_cs *engine) +{ + struct drm_i915_private *dev_priv = engine->i915; + const u32 base = engine->mmio_base; + const i915_reg_t mode = RING_MI_MODE(base); + + I915_WRITE_FW(mode, _MASKED_BIT_ENABLE(STOP_RING)); + if (intel_wait_for_register_fw(dev_priv, + mode, + MODE_IDLE, + MODE_IDLE, + 500)) + DRM_DEBUG_DRIVER("%s: timed out on STOP_RING\n", + engine->name); + + I915_WRITE_FW(RING_CTL(base), 0); + I915_WRITE_FW(RING_HEAD(base), 0); + I915_WRITE_FW(RING_TAIL(base), 0); + + /* Check acts as a post */ + if (I915_READ_FW(RING_HEAD(base)) != 0) + DRM_DEBUG_DRIVER("%s: ring head not parked\n", + engine->name); +} + static void gen3_stop_rings(struct drm_i915_private *dev_priv) { struct intel_engine_cs *engine; enum intel_engine_id id; - for_each_engine(engine, dev_priv, id) { - const u32 base = engine->mmio_base; - const i915_reg_t mode = RING_MI_MODE(base); - - I915_WRITE_FW(mode, _MASKED_BIT_ENABLE(STOP_RING)); - if (intel_wait_for_register_fw(dev_priv, - mode, - MODE_IDLE, - MODE_IDLE, - 500)) - DRM_DEBUG_DRIVER("%s: timed out on STOP_RING\n", - engine->name); - - I915_WRITE_FW(RING_CTL(base), 0); - I915_WRITE_FW(RING_HEAD(base), 0); - I915_WRITE_FW(RING_TAIL(base), 0); - - /* Check acts as a post */ - if (I915_READ_FW(RING_HEAD(base)) != 0) - DRM_DEBUG_DRIVER("%s: ring head not parked\n", - engine->name); - } + for_each_engine(engine, dev_priv, id) + gen3_stop_ring(engine); } static bool i915_reset_complete(struct pci_dev *pdev) @@ -1668,6 +1673,11 @@ static int gen8_reset_engine_start(struct intel_engine_cs *engine) struct drm_i915_private *dev_priv = engine->i915; int ret; + /* If the bb is still running at this stage, forcing a + * reset risks a system hang. + */ + gen3_stop_ring(engine); + I915_WRITE_FW(RING_RESET_CTL(engine->mmio_base), _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
Evidence indicates that even if the hardware happily tells us to proceed with reset, it really isn't ready. Resetting a freely running batchbuffer after we have got ack for readiness, still can cause a system hang. Attempt to stop ring before proceeding for ready check and reset to avoid losing the machine. Testcase: igt/prime_busy/hang-* # kbl Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> --- drivers/gpu/drm/i915/intel_uncore.c | 54 ++++++++++++++++++++++--------------- 1 file changed, 32 insertions(+), 22 deletions(-)