diff mbox

[2/2] drm/i915: Use rcu instead of stop_machine in set_wedged

Message ID 20171009164401.16035-2-daniel.vetter@ffwll.ch (mailing list archive)
State New, archived
Headers show

Commit Message

Daniel Vetter Oct. 9, 2017, 4:44 p.m. UTC
stop_machine is not really a locking primitive we should use, except
when the hw folks tell us the hw is broken and that's the only way to
work around it.

This patch tries to address the locking abuse of stop_machine() from

commit 20e4933c478a1ca694b38fa4ac44d99e659941f5
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Nov 22 14:41:21 2016 +0000

    drm/i915: Stop the machine as we install the wedged submit_request handler

Chris said parts of the reasons for going with stop_machine() was that
it's no overhead for the fast-path. But these callbacks use irqsave
spinlocks and do a bunch of MMIO, and rcu_read_lock is _real_ fast.

To stay as close as possible to the stop_machine semantics we first
update all the submit function pointers to the nop handler, then call
synchronize_rcu() to make sure no new requests can be submitted. This
should give us exactly the huge barrier we want.

I pondered whether we should annotate engine->submit_request as __rcu
and use rcu_assign_pointer and rcu_dereference on it. But the reason
behind those is to make sure the compiler/cpu barriers are there for
when you have an actual data structure you point at, to make sure all
the writes are seen correctly on the read side. But we just have a
function pointer, and .text isn't changed, so no need for these
barriers and hence no need for annotations.

Unfortunately there's a complication with the call to
intel_engine_init_global_seqno:

- Without stop_machine we must hold the corresponding spinlock.

- Without stop_machine we must ensure that all requests are marked as
  having failed with dma_fence_set_error() before we call it. That
  means we need to split the nop request submission into two phases,
  both synchronized with rcu:

  1. Only stop submitting the requests to hw and mark them as failed.

  2. After all pending requests in the scheduler/ring are suitably
  marked up as failed and we can force complete them all, also force
  complete by calling intel_engine_init_global_seqno().

This should fix the followwing lockdep splat:

Comments

Chris Wilson Oct. 10, 2017, 9:21 a.m. UTC | #1
Quoting Daniel Vetter (2017-10-09 17:44:01)
> stop_machine is not really a locking primitive we should use, except
> when the hw folks tell us the hw is broken and that's the only way to
> work around it.
> 
> This patch tries to address the locking abuse of stop_machine() from
> 
> commit 20e4933c478a1ca694b38fa4ac44d99e659941f5
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Tue Nov 22 14:41:21 2016 +0000
> 
>     drm/i915: Stop the machine as we install the wedged submit_request handler
> 
> Chris said parts of the reasons for going with stop_machine() was that
> it's no overhead for the fast-path. But these callbacks use irqsave
> spinlocks and do a bunch of MMIO, and rcu_read_lock is _real_ fast.
> 
> To stay as close as possible to the stop_machine semantics we first
> update all the submit function pointers to the nop handler, then call
> synchronize_rcu() to make sure no new requests can be submitted. This
> should give us exactly the huge barrier we want.
> 
> I pondered whether we should annotate engine->submit_request as __rcu
> and use rcu_assign_pointer and rcu_dereference on it. But the reason
> behind those is to make sure the compiler/cpu barriers are there for
> when you have an actual data structure you point at, to make sure all
> the writes are seen correctly on the read side. But we just have a
> function pointer, and .text isn't changed, so no need for these
> barriers and hence no need for annotations.
> 
> Unfortunately there's a complication with the call to
> intel_engine_init_global_seqno:

This is still broken in the same way as nop_submit_request may execute
while you sleep, breaking cancel_requests.
-Chris
Mika Kuoppala Oct. 10, 2017, 9:50 a.m. UTC | #2
Daniel Vetter <daniel.vetter@ffwll.ch> writes:

> stop_machine is not really a locking primitive we should use, except
> when the hw folks tell us the hw is broken and that's the only way to
> work around it.
>
> This patch tries to address the locking abuse of stop_machine() from
>
> commit 20e4933c478a1ca694b38fa4ac44d99e659941f5
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Tue Nov 22 14:41:21 2016 +0000
>
>     drm/i915: Stop the machine as we install the wedged submit_request handler
>
> Chris said parts of the reasons for going with stop_machine() was that
> it's no overhead for the fast-path. But these callbacks use irqsave
> spinlocks and do a bunch of MMIO, and rcu_read_lock is _real_ fast.
>
> To stay as close as possible to the stop_machine semantics we first
> update all the submit function pointers to the nop handler, then call
> synchronize_rcu() to make sure no new requests can be submitted. This
> should give us exactly the huge barrier we want.
>
> I pondered whether we should annotate engine->submit_request as __rcu
> and use rcu_assign_pointer and rcu_dereference on it. But the reason
> behind those is to make sure the compiler/cpu barriers are there for
> when you have an actual data structure you point at, to make sure all
> the writes are seen correctly on the read side. But we just have a
> function pointer, and .text isn't changed, so no need for these
> barriers and hence no need for annotations.
>
> Unfortunately there's a complication with the call to
> intel_engine_init_global_seqno:
>
> - Without stop_machine we must hold the corresponding spinlock.
>
> - Without stop_machine we must ensure that all requests are marked as
>   having failed with dma_fence_set_error() before we call it. That
>   means we need to split the nop request submission into two phases,
>   both synchronized with rcu:
>
>   1. Only stop submitting the requests to hw and mark them as failed.
>
>   2. After all pending requests in the scheduler/ring are suitably
>   marked up as failed and we can force complete them all, also force
>   complete by calling intel_engine_init_global_seqno().
>
> This should fix the followwing lockdep splat:
>
> ======================================================
> WARNING: possible circular locking dependency detected
> 4.14.0-rc3-CI-CI_DRM_3179+ #1 Tainted: G     U
> ------------------------------------------------------
> kworker/3:4/562 is trying to acquire lock:
>  (cpu_hotplug_lock.rw_sem){++++}, at: [<ffffffff8113d4bc>] stop_machine+0x1c/0x40
>
> but task is already holding lock:
>  (&dev->struct_mutex){+.+.}, at: [<ffffffffa0136588>] i915_reset_device+0x1e8/0x260 [i915]
>
> which lock already depends on the new lock.
>
> the existing dependency chain (in reverse order) is:
>
> -> #6 (&dev->struct_mutex){+.+.}:
>        __lock_acquire+0x1420/0x15e0
>        lock_acquire+0xb0/0x200
>        __mutex_lock+0x86/0x9b0
>        mutex_lock_interruptible_nested+0x1b/0x20
>        i915_mutex_lock_interruptible+0x51/0x130 [i915]
>        i915_gem_fault+0x209/0x650 [i915]
>        __do_fault+0x1e/0x80
>        __handle_mm_fault+0xa08/0xed0
>        handle_mm_fault+0x156/0x300
>        __do_page_fault+0x2c5/0x570
>        do_page_fault+0x28/0x250
>        page_fault+0x22/0x30
>
> -> #5 (&mm->mmap_sem){++++}:
>        __lock_acquire+0x1420/0x15e0
>        lock_acquire+0xb0/0x200
>        __might_fault+0x68/0x90
>        _copy_to_user+0x23/0x70
>        filldir+0xa5/0x120
>        dcache_readdir+0xf9/0x170
>        iterate_dir+0x69/0x1a0
>        SyS_getdents+0xa5/0x140
>        entry_SYSCALL_64_fastpath+0x1c/0xb1
>
> -> #4 (&sb->s_type->i_mutex_key#5){++++}:
>        down_write+0x3b/0x70
>        handle_create+0xcb/0x1e0
>        devtmpfsd+0x139/0x180
>        kthread+0x152/0x190
>        ret_from_fork+0x27/0x40
>
> -> #3 ((complete)&req.done){+.+.}:
>        __lock_acquire+0x1420/0x15e0
>        lock_acquire+0xb0/0x200
>        wait_for_common+0x58/0x210
>        wait_for_completion+0x1d/0x20
>        devtmpfs_create_node+0x13d/0x160
>        device_add+0x5eb/0x620
>        device_create_groups_vargs+0xe0/0xf0
>        device_create+0x3a/0x40
>        msr_device_create+0x2b/0x40
>        cpuhp_invoke_callback+0xc9/0xbf0
>        cpuhp_thread_fun+0x17b/0x240
>        smpboot_thread_fn+0x18a/0x280
>        kthread+0x152/0x190
>        ret_from_fork+0x27/0x40
>
> -> #2 (cpuhp_state-up){+.+.}:
>        __lock_acquire+0x1420/0x15e0
>        lock_acquire+0xb0/0x200
>        cpuhp_issue_call+0x133/0x1c0
>        __cpuhp_setup_state_cpuslocked+0x139/0x2a0
>        __cpuhp_setup_state+0x46/0x60
>        page_writeback_init+0x43/0x67
>        pagecache_init+0x3d/0x42
>        start_kernel+0x3a8/0x3fc
>        x86_64_start_reservations+0x2a/0x2c
>        x86_64_start_kernel+0x6d/0x70
>        verify_cpu+0x0/0xfb
>
> -> #1 (cpuhp_state_mutex){+.+.}:
>        __lock_acquire+0x1420/0x15e0
>        lock_acquire+0xb0/0x200
>        __mutex_lock+0x86/0x9b0
>        mutex_lock_nested+0x1b/0x20
>        __cpuhp_setup_state_cpuslocked+0x53/0x2a0
>        __cpuhp_setup_state+0x46/0x60
>        page_alloc_init+0x28/0x30
>        start_kernel+0x145/0x3fc
>        x86_64_start_reservations+0x2a/0x2c
>        x86_64_start_kernel+0x6d/0x70
>        verify_cpu+0x0/0xfb
>
> -> #0 (cpu_hotplug_lock.rw_sem){++++}:
>        check_prev_add+0x430/0x840
>        __lock_acquire+0x1420/0x15e0
>        lock_acquire+0xb0/0x200
>        cpus_read_lock+0x3d/0xb0
>        stop_machine+0x1c/0x40
>        i915_gem_set_wedged+0x1a/0x20 [i915]
>        i915_reset+0xb9/0x230 [i915]
>        i915_reset_device+0x1f6/0x260 [i915]
>        i915_handle_error+0x2d8/0x430 [i915]
>        hangcheck_declare_hang+0xd3/0xf0 [i915]
>        i915_hangcheck_elapsed+0x262/0x2d0 [i915]
>        process_one_work+0x233/0x660
>        worker_thread+0x4e/0x3b0
>        kthread+0x152/0x190
>        ret_from_fork+0x27/0x40
>
> other info that might help us debug this:
>
> Chain exists of:
>   cpu_hotplug_lock.rw_sem --> &mm->mmap_sem --> &dev->struct_mutex
>
>  Possible unsafe locking scenario:
>
>        CPU0                    CPU1
>        ----                    ----
>   lock(&dev->struct_mutex);
>                                lock(&mm->mmap_sem);
>                                lock(&dev->struct_mutex);
>   lock(cpu_hotplug_lock.rw_sem);
>
>  *** DEADLOCK ***
>
> 3 locks held by kworker/3:4/562:
>  #0:  ("events_long"){+.+.}, at: [<ffffffff8109c64a>] process_one_work+0x1aa/0x660
>  #1:  ((&(&i915->gpu_error.hangcheck_work)->work)){+.+.}, at: [<ffffffff8109c64a>] process_one_work+0x1aa/0x660
>  #2:  (&dev->struct_mutex){+.+.}, at: [<ffffffffa0136588>] i915_reset_device+0x1e8/0x260 [i915]
>
> stack backtrace:
> CPU: 3 PID: 562 Comm: kworker/3:4 Tainted: G     U          4.14.0-rc3-CI-CI_DRM_3179+ #1
> Hardware name:                  /NUC7i5BNB, BIOS BNKBL357.86A.0048.2017.0704.1415 07/04/2017
> Workqueue: events_long i915_hangcheck_elapsed [i915]
> Call Trace:
>  dump_stack+0x68/0x9f
>  print_circular_bug+0x235/0x3c0
>  ? lockdep_init_map_crosslock+0x20/0x20
>  check_prev_add+0x430/0x840
>  ? irq_work_queue+0x86/0xe0
>  ? wake_up_klogd+0x53/0x70
>  __lock_acquire+0x1420/0x15e0
>  ? __lock_acquire+0x1420/0x15e0
>  ? lockdep_init_map_crosslock+0x20/0x20
>  lock_acquire+0xb0/0x200
>  ? stop_machine+0x1c/0x40
>  ? i915_gem_object_truncate+0x50/0x50 [i915]
>  cpus_read_lock+0x3d/0xb0
>  ? stop_machine+0x1c/0x40
>  stop_machine+0x1c/0x40
>  i915_gem_set_wedged+0x1a/0x20 [i915]
>  i915_reset+0xb9/0x230 [i915]
>  i915_reset_device+0x1f6/0x260 [i915]
>  ? gen8_gt_irq_ack+0x170/0x170 [i915]
>  ? work_on_cpu_safe+0x60/0x60
>  i915_handle_error+0x2d8/0x430 [i915]
>  ? vsnprintf+0xd1/0x4b0
>  ? scnprintf+0x3a/0x70
>  hangcheck_declare_hang+0xd3/0xf0 [i915]
>  ? intel_runtime_pm_put+0x56/0xa0 [i915]
>  i915_hangcheck_elapsed+0x262/0x2d0 [i915]
>  process_one_work+0x233/0x660
>  worker_thread+0x4e/0x3b0
>  kthread+0x152/0x190
>  ? process_one_work+0x660/0x660
>  ? kthread_create_on_node+0x40/0x40
>  ret_from_fork+0x27/0x40
> Setting dangerous option reset - tainting kernel
> i915 0000:00:02.0: Resetting chip after gpu hang
> Setting dangerous option reset - tainting kernel
> i915 0000:00:02.0: Resetting chip after gpu hang
>
> v2: Have 1 global synchronize_rcu() barrier across all engines, and
> improve commit message.
>
> v3: We need to protect the seqno update with the timeline spinlock (in
> set_wedged) to avoid racing with other updates of the seqno, like we
> already do in nop_submit_request (Chris).
>
> v4: Use two-phase sequence to plug the race Chris spotted where we can
> complete requests before they're marked up with -EIO.
>
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=102886
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103096
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Marta Lofstedt <marta.lofstedt@intel.com>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_gem.c                   | 76 +++++++++++++++--------
>  drivers/gpu/drm/i915/i915_gem_request.c           |  2 +
>  drivers/gpu/drm/i915/selftests/i915_gem_request.c |  2 +
>  3 files changed, 53 insertions(+), 27 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 82a10036fb38..8d7d8d1f78db 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -3057,49 +3057,71 @@ static void nop_submit_request(struct drm_i915_gem_request *request)
>  
>  	spin_lock_irqsave(&request->engine->timeline->lock, flags);
>  	__i915_gem_request_submit(request);
> -	intel_engine_init_global_seqno(request->engine, request->global_seqno);
>  	spin_unlock_irqrestore(&request->engine->timeline->lock, flags);
>  }
>  
> -static void engine_set_wedged(struct intel_engine_cs *engine)
> +static void nop_complete_submit_request(struct drm_i915_gem_request *request)
>  {
> -	/* We need to be sure that no thread is running the old callback as
> -	 * we install the nop handler (otherwise we would submit a request
> -	 * to hardware that will never complete). In order to prevent this
> -	 * race, we wait until the machine is idle before making the swap
> -	 * (using stop_machine()).
> -	 */
> -	engine->submit_request = nop_submit_request;
> +	unsigned long flags;
>  
> -	/* Mark all executing requests as skipped */
> -	engine->cancel_requests(engine);
> +	GEM_BUG_ON(!i915_terminally_wedged(&request->i915->gpu_error));
> +	dma_fence_set_error(&request->fence, -EIO);
>  
> -	/* Mark all pending requests as complete so that any concurrent
> -	 * (lockless) lookup doesn't try and wait upon the request as we
> -	 * reset it.
> -	 */
> -	intel_engine_init_global_seqno(engine,
> -				       intel_engine_last_submit(engine));
> +	spin_lock_irqsave(&request->engine->timeline->lock, flags);
> +	__i915_gem_request_submit(request);
> +	intel_engine_init_global_seqno(request->engine, request->global_seqno);
> +	spin_unlock_irqrestore(&request->engine->timeline->lock, flags);
>  }
>  
> -static int __i915_gem_set_wedged_BKL(void *data)
> +void i915_gem_set_wedged(struct drm_i915_private *i915)
>  {
> -	struct drm_i915_private *i915 = data;
>  	struct intel_engine_cs *engine;
>  	enum intel_engine_id id;
>  
> +	/* First, stop submission to hw, but do not yet complete requests by
> +	 * rolling the global seqno forward (since this would complete requests
> +	 * for which we haven't set the fence error to EIO yet).
> +	 */
>  	for_each_engine(engine, i915, id)
> -		engine_set_wedged(engine);
> +		engine->submit_request = nop_submit_request;
>  
> -	set_bit(I915_WEDGED, &i915->gpu_error.flags);
> -	wake_up_all(&i915->gpu_error.reset_queue);
> +	/* Make sure no one is running the old callback before we proceed with
> +	 * cancelling requests and resetting the completion tracking. Otherwise
> +	 * we might submit a request to the hardware which never completes.
> +	 */
> +	synchronize_rcu();
>  
> -	return 0;
> -}
> +	for_each_engine(engine, i915, id) {
> +		/* Mark all executing requests as skipped */
> +		engine->cancel_requests(engine);
>  
> -void i915_gem_set_wedged(struct drm_i915_private *dev_priv)
> -{
> -	stop_machine(__i915_gem_set_wedged_BKL, dev_priv, NULL);
> +		/* Only once we've force-cancelled all in-flight requests can we
> +		 * start to complete all requests.
> +		 */
> +		engine->submit_request = nop_complete_submit_request;
> +	}
> +
> +	/* Make sure no request can slip through without getting completed by
> +	 * either this call here to intel_engine_init_global_seqno, or the one
> +	 * in nop_complete_submit_request.
> +	 */
> +	synchronize_rcu();
> +
> +	for_each_engine(engine, i915, id) {
> +		unsigned long flags;
> +
> +		/* Mark all pending requests as complete so that any concurrent
> +		 * (lockless) lookup doesn't try and wait upon the request as we
> +		 * reset it.
> +		 */
> +		spin_lock_irqsave(&engine->timeline->lock, flags);
> +		intel_engine_init_global_seqno(engine,
> +					       intel_engine_last_submit(engine));
> +		spin_unlock_irqrestore(&engine->timeline->lock, flags);
> +	}
> +
> +	set_bit(I915_WEDGED, &i915->gpu_error.flags);

Can we this set_bit above the loop, right after synchronize_rcu()?
Thinking is that we stop accepting requests we can't complete
the earliest.

-Mika

> +	wake_up_all(&i915->gpu_error.reset_queue);
>  }
>  
>  bool i915_gem_unset_wedged(struct drm_i915_private *i915)
> diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
> index b100b38f1dd2..ef78a85cb845 100644
> --- a/drivers/gpu/drm/i915/i915_gem_request.c
> +++ b/drivers/gpu/drm/i915/i915_gem_request.c
> @@ -556,7 +556,9 @@ submit_notify(struct i915_sw_fence *fence, enum i915_sw_fence_notify state)
>  	switch (state) {
>  	case FENCE_COMPLETE:
>  		trace_i915_gem_request_submit(request);
> +		rcu_read_lock();
>  		request->engine->submit_request(request);
> +		rcu_read_unlock();
>  		break;
>  
>  	case FENCE_FREE:
> diff --git a/drivers/gpu/drm/i915/selftests/i915_gem_request.c b/drivers/gpu/drm/i915/selftests/i915_gem_request.c
> index 78b9f811707f..a999161e8db1 100644
> --- a/drivers/gpu/drm/i915/selftests/i915_gem_request.c
> +++ b/drivers/gpu/drm/i915/selftests/i915_gem_request.c
> @@ -215,7 +215,9 @@ static int igt_request_rewind(void *arg)
>  	}
>  	i915_gem_request_get(vip);
>  	i915_add_request(vip);
> +	rcu_read_lock();
>  	request->engine->submit_request(request);
> +	rcu_read_unlock();
>  
>  	mutex_unlock(&i915->drm.struct_mutex);
>  
> -- 
> 2.14.1
Daniel Vetter Oct. 10, 2017, 12:30 p.m. UTC | #3
On Tue, Oct 10, 2017 at 10:21:45AM +0100, Chris Wilson wrote:
> Quoting Daniel Vetter (2017-10-09 17:44:01)
> > stop_machine is not really a locking primitive we should use, except
> > when the hw folks tell us the hw is broken and that's the only way to
> > work around it.
> > 
> > This patch tries to address the locking abuse of stop_machine() from
> > 
> > commit 20e4933c478a1ca694b38fa4ac44d99e659941f5
> > Author: Chris Wilson <chris@chris-wilson.co.uk>
> > Date:   Tue Nov 22 14:41:21 2016 +0000
> > 
> >     drm/i915: Stop the machine as we install the wedged submit_request handler
> > 
> > Chris said parts of the reasons for going with stop_machine() was that
> > it's no overhead for the fast-path. But these callbacks use irqsave
> > spinlocks and do a bunch of MMIO, and rcu_read_lock is _real_ fast.
> > 
> > To stay as close as possible to the stop_machine semantics we first
> > update all the submit function pointers to the nop handler, then call
> > synchronize_rcu() to make sure no new requests can be submitted. This
> > should give us exactly the huge barrier we want.
> > 
> > I pondered whether we should annotate engine->submit_request as __rcu
> > and use rcu_assign_pointer and rcu_dereference on it. But the reason
> > behind those is to make sure the compiler/cpu barriers are there for
> > when you have an actual data structure you point at, to make sure all
> > the writes are seen correctly on the read side. But we just have a
> > function pointer, and .text isn't changed, so no need for these
> > barriers and hence no need for annotations.
> > 
> > Unfortunately there's a complication with the call to
> > intel_engine_init_global_seqno:
> 
> This is still broken in the same way as nop_submit_request may execute
> while you sleep, breaking cancel_requests.

I guess then I didn't understand which race you mean, since I think the
one I've found should be fixed now. Can you pls explain in more detail
what exactly is racing against what else?

From what I can tell legacy cancel_request is entirely under the spinlock,
so hard to race, same for lrc. And with the global seqno update untangled,
they shouldn't complete too early anymore, which I thought was the race
you pointed out on the previous thread. I did reply to that to check my
understanding, but didn't get a reply.

Thanks, Daniel
Daniel Vetter Oct. 10, 2017, 12:37 p.m. UTC | #4
On Tue, Oct 10, 2017 at 12:50:45PM +0300, Mika Kuoppala wrote:
> Daniel Vetter <daniel.vetter@ffwll.ch> writes:
> 
> > stop_machine is not really a locking primitive we should use, except
> > when the hw folks tell us the hw is broken and that's the only way to
> > work around it.
> >
> > This patch tries to address the locking abuse of stop_machine() from
> >
> > commit 20e4933c478a1ca694b38fa4ac44d99e659941f5
> > Author: Chris Wilson <chris@chris-wilson.co.uk>
> > Date:   Tue Nov 22 14:41:21 2016 +0000
> >
> >     drm/i915: Stop the machine as we install the wedged submit_request handler
> >
> > Chris said parts of the reasons for going with stop_machine() was that
> > it's no overhead for the fast-path. But these callbacks use irqsave
> > spinlocks and do a bunch of MMIO, and rcu_read_lock is _real_ fast.
> >
> > To stay as close as possible to the stop_machine semantics we first
> > update all the submit function pointers to the nop handler, then call
> > synchronize_rcu() to make sure no new requests can be submitted. This
> > should give us exactly the huge barrier we want.
> >
> > I pondered whether we should annotate engine->submit_request as __rcu
> > and use rcu_assign_pointer and rcu_dereference on it. But the reason
> > behind those is to make sure the compiler/cpu barriers are there for
> > when you have an actual data structure you point at, to make sure all
> > the writes are seen correctly on the read side. But we just have a
> > function pointer, and .text isn't changed, so no need for these
> > barriers and hence no need for annotations.
> >
> > Unfortunately there's a complication with the call to
> > intel_engine_init_global_seqno:
> >
> > - Without stop_machine we must hold the corresponding spinlock.
> >
> > - Without stop_machine we must ensure that all requests are marked as
> >   having failed with dma_fence_set_error() before we call it. That
> >   means we need to split the nop request submission into two phases,
> >   both synchronized with rcu:
> >
> >   1. Only stop submitting the requests to hw and mark them as failed.
> >
> >   2. After all pending requests in the scheduler/ring are suitably
> >   marked up as failed and we can force complete them all, also force
> >   complete by calling intel_engine_init_global_seqno().
> >
> > This should fix the followwing lockdep splat:
> >
> > ======================================================
> > WARNING: possible circular locking dependency detected
> > 4.14.0-rc3-CI-CI_DRM_3179+ #1 Tainted: G     U
> > ------------------------------------------------------
> > kworker/3:4/562 is trying to acquire lock:
> >  (cpu_hotplug_lock.rw_sem){++++}, at: [<ffffffff8113d4bc>] stop_machine+0x1c/0x40
> >
> > but task is already holding lock:
> >  (&dev->struct_mutex){+.+.}, at: [<ffffffffa0136588>] i915_reset_device+0x1e8/0x260 [i915]
> >
> > which lock already depends on the new lock.
> >
> > the existing dependency chain (in reverse order) is:
> >
> > -> #6 (&dev->struct_mutex){+.+.}:
> >        __lock_acquire+0x1420/0x15e0
> >        lock_acquire+0xb0/0x200
> >        __mutex_lock+0x86/0x9b0
> >        mutex_lock_interruptible_nested+0x1b/0x20
> >        i915_mutex_lock_interruptible+0x51/0x130 [i915]
> >        i915_gem_fault+0x209/0x650 [i915]
> >        __do_fault+0x1e/0x80
> >        __handle_mm_fault+0xa08/0xed0
> >        handle_mm_fault+0x156/0x300
> >        __do_page_fault+0x2c5/0x570
> >        do_page_fault+0x28/0x250
> >        page_fault+0x22/0x30
> >
> > -> #5 (&mm->mmap_sem){++++}:
> >        __lock_acquire+0x1420/0x15e0
> >        lock_acquire+0xb0/0x200
> >        __might_fault+0x68/0x90
> >        _copy_to_user+0x23/0x70
> >        filldir+0xa5/0x120
> >        dcache_readdir+0xf9/0x170
> >        iterate_dir+0x69/0x1a0
> >        SyS_getdents+0xa5/0x140
> >        entry_SYSCALL_64_fastpath+0x1c/0xb1
> >
> > -> #4 (&sb->s_type->i_mutex_key#5){++++}:
> >        down_write+0x3b/0x70
> >        handle_create+0xcb/0x1e0
> >        devtmpfsd+0x139/0x180
> >        kthread+0x152/0x190
> >        ret_from_fork+0x27/0x40
> >
> > -> #3 ((complete)&req.done){+.+.}:
> >        __lock_acquire+0x1420/0x15e0
> >        lock_acquire+0xb0/0x200
> >        wait_for_common+0x58/0x210
> >        wait_for_completion+0x1d/0x20
> >        devtmpfs_create_node+0x13d/0x160
> >        device_add+0x5eb/0x620
> >        device_create_groups_vargs+0xe0/0xf0
> >        device_create+0x3a/0x40
> >        msr_device_create+0x2b/0x40
> >        cpuhp_invoke_callback+0xc9/0xbf0
> >        cpuhp_thread_fun+0x17b/0x240
> >        smpboot_thread_fn+0x18a/0x280
> >        kthread+0x152/0x190
> >        ret_from_fork+0x27/0x40
> >
> > -> #2 (cpuhp_state-up){+.+.}:
> >        __lock_acquire+0x1420/0x15e0
> >        lock_acquire+0xb0/0x200
> >        cpuhp_issue_call+0x133/0x1c0
> >        __cpuhp_setup_state_cpuslocked+0x139/0x2a0
> >        __cpuhp_setup_state+0x46/0x60
> >        page_writeback_init+0x43/0x67
> >        pagecache_init+0x3d/0x42
> >        start_kernel+0x3a8/0x3fc
> >        x86_64_start_reservations+0x2a/0x2c
> >        x86_64_start_kernel+0x6d/0x70
> >        verify_cpu+0x0/0xfb
> >
> > -> #1 (cpuhp_state_mutex){+.+.}:
> >        __lock_acquire+0x1420/0x15e0
> >        lock_acquire+0xb0/0x200
> >        __mutex_lock+0x86/0x9b0
> >        mutex_lock_nested+0x1b/0x20
> >        __cpuhp_setup_state_cpuslocked+0x53/0x2a0
> >        __cpuhp_setup_state+0x46/0x60
> >        page_alloc_init+0x28/0x30
> >        start_kernel+0x145/0x3fc
> >        x86_64_start_reservations+0x2a/0x2c
> >        x86_64_start_kernel+0x6d/0x70
> >        verify_cpu+0x0/0xfb
> >
> > -> #0 (cpu_hotplug_lock.rw_sem){++++}:
> >        check_prev_add+0x430/0x840
> >        __lock_acquire+0x1420/0x15e0
> >        lock_acquire+0xb0/0x200
> >        cpus_read_lock+0x3d/0xb0
> >        stop_machine+0x1c/0x40
> >        i915_gem_set_wedged+0x1a/0x20 [i915]
> >        i915_reset+0xb9/0x230 [i915]
> >        i915_reset_device+0x1f6/0x260 [i915]
> >        i915_handle_error+0x2d8/0x430 [i915]
> >        hangcheck_declare_hang+0xd3/0xf0 [i915]
> >        i915_hangcheck_elapsed+0x262/0x2d0 [i915]
> >        process_one_work+0x233/0x660
> >        worker_thread+0x4e/0x3b0
> >        kthread+0x152/0x190
> >        ret_from_fork+0x27/0x40
> >
> > other info that might help us debug this:
> >
> > Chain exists of:
> >   cpu_hotplug_lock.rw_sem --> &mm->mmap_sem --> &dev->struct_mutex
> >
> >  Possible unsafe locking scenario:
> >
> >        CPU0                    CPU1
> >        ----                    ----
> >   lock(&dev->struct_mutex);
> >                                lock(&mm->mmap_sem);
> >                                lock(&dev->struct_mutex);
> >   lock(cpu_hotplug_lock.rw_sem);
> >
> >  *** DEADLOCK ***
> >
> > 3 locks held by kworker/3:4/562:
> >  #0:  ("events_long"){+.+.}, at: [<ffffffff8109c64a>] process_one_work+0x1aa/0x660
> >  #1:  ((&(&i915->gpu_error.hangcheck_work)->work)){+.+.}, at: [<ffffffff8109c64a>] process_one_work+0x1aa/0x660
> >  #2:  (&dev->struct_mutex){+.+.}, at: [<ffffffffa0136588>] i915_reset_device+0x1e8/0x260 [i915]
> >
> > stack backtrace:
> > CPU: 3 PID: 562 Comm: kworker/3:4 Tainted: G     U          4.14.0-rc3-CI-CI_DRM_3179+ #1
> > Hardware name:                  /NUC7i5BNB, BIOS BNKBL357.86A.0048.2017.0704.1415 07/04/2017
> > Workqueue: events_long i915_hangcheck_elapsed [i915]
> > Call Trace:
> >  dump_stack+0x68/0x9f
> >  print_circular_bug+0x235/0x3c0
> >  ? lockdep_init_map_crosslock+0x20/0x20
> >  check_prev_add+0x430/0x840
> >  ? irq_work_queue+0x86/0xe0
> >  ? wake_up_klogd+0x53/0x70
> >  __lock_acquire+0x1420/0x15e0
> >  ? __lock_acquire+0x1420/0x15e0
> >  ? lockdep_init_map_crosslock+0x20/0x20
> >  lock_acquire+0xb0/0x200
> >  ? stop_machine+0x1c/0x40
> >  ? i915_gem_object_truncate+0x50/0x50 [i915]
> >  cpus_read_lock+0x3d/0xb0
> >  ? stop_machine+0x1c/0x40
> >  stop_machine+0x1c/0x40
> >  i915_gem_set_wedged+0x1a/0x20 [i915]
> >  i915_reset+0xb9/0x230 [i915]
> >  i915_reset_device+0x1f6/0x260 [i915]
> >  ? gen8_gt_irq_ack+0x170/0x170 [i915]
> >  ? work_on_cpu_safe+0x60/0x60
> >  i915_handle_error+0x2d8/0x430 [i915]
> >  ? vsnprintf+0xd1/0x4b0
> >  ? scnprintf+0x3a/0x70
> >  hangcheck_declare_hang+0xd3/0xf0 [i915]
> >  ? intel_runtime_pm_put+0x56/0xa0 [i915]
> >  i915_hangcheck_elapsed+0x262/0x2d0 [i915]
> >  process_one_work+0x233/0x660
> >  worker_thread+0x4e/0x3b0
> >  kthread+0x152/0x190
> >  ? process_one_work+0x660/0x660
> >  ? kthread_create_on_node+0x40/0x40
> >  ret_from_fork+0x27/0x40
> > Setting dangerous option reset - tainting kernel
> > i915 0000:00:02.0: Resetting chip after gpu hang
> > Setting dangerous option reset - tainting kernel
> > i915 0000:00:02.0: Resetting chip after gpu hang
> >
> > v2: Have 1 global synchronize_rcu() barrier across all engines, and
> > improve commit message.
> >
> > v3: We need to protect the seqno update with the timeline spinlock (in
> > set_wedged) to avoid racing with other updates of the seqno, like we
> > already do in nop_submit_request (Chris).
> >
> > v4: Use two-phase sequence to plug the race Chris spotted where we can
> > complete requests before they're marked up with -EIO.
> >
> > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=102886
> > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103096
> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Marta Lofstedt <marta.lofstedt@intel.com>
> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > ---
> >  drivers/gpu/drm/i915/i915_gem.c                   | 76 +++++++++++++++--------
> >  drivers/gpu/drm/i915/i915_gem_request.c           |  2 +
> >  drivers/gpu/drm/i915/selftests/i915_gem_request.c |  2 +
> >  3 files changed, 53 insertions(+), 27 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> > index 82a10036fb38..8d7d8d1f78db 100644
> > --- a/drivers/gpu/drm/i915/i915_gem.c
> > +++ b/drivers/gpu/drm/i915/i915_gem.c
> > @@ -3057,49 +3057,71 @@ static void nop_submit_request(struct drm_i915_gem_request *request)
> >  
> >  	spin_lock_irqsave(&request->engine->timeline->lock, flags);
> >  	__i915_gem_request_submit(request);
> > -	intel_engine_init_global_seqno(request->engine, request->global_seqno);
> >  	spin_unlock_irqrestore(&request->engine->timeline->lock, flags);
> >  }
> >  
> > -static void engine_set_wedged(struct intel_engine_cs *engine)
> > +static void nop_complete_submit_request(struct drm_i915_gem_request *request)
> >  {
> > -	/* We need to be sure that no thread is running the old callback as
> > -	 * we install the nop handler (otherwise we would submit a request
> > -	 * to hardware that will never complete). In order to prevent this
> > -	 * race, we wait until the machine is idle before making the swap
> > -	 * (using stop_machine()).
> > -	 */
> > -	engine->submit_request = nop_submit_request;
> > +	unsigned long flags;
> >  
> > -	/* Mark all executing requests as skipped */
> > -	engine->cancel_requests(engine);
> > +	GEM_BUG_ON(!i915_terminally_wedged(&request->i915->gpu_error));
> > +	dma_fence_set_error(&request->fence, -EIO);
> >  
> > -	/* Mark all pending requests as complete so that any concurrent
> > -	 * (lockless) lookup doesn't try and wait upon the request as we
> > -	 * reset it.
> > -	 */
> > -	intel_engine_init_global_seqno(engine,
> > -				       intel_engine_last_submit(engine));
> > +	spin_lock_irqsave(&request->engine->timeline->lock, flags);
> > +	__i915_gem_request_submit(request);
> > +	intel_engine_init_global_seqno(request->engine, request->global_seqno);
> > +	spin_unlock_irqrestore(&request->engine->timeline->lock, flags);
> >  }
> >  
> > -static int __i915_gem_set_wedged_BKL(void *data)
> > +void i915_gem_set_wedged(struct drm_i915_private *i915)
> >  {
> > -	struct drm_i915_private *i915 = data;
> >  	struct intel_engine_cs *engine;
> >  	enum intel_engine_id id;
> >  
> > +	/* First, stop submission to hw, but do not yet complete requests by
> > +	 * rolling the global seqno forward (since this would complete requests
> > +	 * for which we haven't set the fence error to EIO yet).
> > +	 */
> >  	for_each_engine(engine, i915, id)
> > -		engine_set_wedged(engine);
> > +		engine->submit_request = nop_submit_request;
> >  
> > -	set_bit(I915_WEDGED, &i915->gpu_error.flags);
> > -	wake_up_all(&i915->gpu_error.reset_queue);
> > +	/* Make sure no one is running the old callback before we proceed with
> > +	 * cancelling requests and resetting the completion tracking. Otherwise
> > +	 * we might submit a request to the hardware which never completes.
> > +	 */
> > +	synchronize_rcu();
> >  
> > -	return 0;
> > -}
> > +	for_each_engine(engine, i915, id) {
> > +		/* Mark all executing requests as skipped */
> > +		engine->cancel_requests(engine);
> >  
> > -void i915_gem_set_wedged(struct drm_i915_private *dev_priv)
> > -{
> > -	stop_machine(__i915_gem_set_wedged_BKL, dev_priv, NULL);
> > +		/* Only once we've force-cancelled all in-flight requests can we
> > +		 * start to complete all requests.
> > +		 */
> > +		engine->submit_request = nop_complete_submit_request;
> > +	}
> > +
> > +	/* Make sure no request can slip through without getting completed by
> > +	 * either this call here to intel_engine_init_global_seqno, or the one
> > +	 * in nop_complete_submit_request.
> > +	 */
> > +	synchronize_rcu();
> > +
> > +	for_each_engine(engine, i915, id) {
> > +		unsigned long flags;
> > +
> > +		/* Mark all pending requests as complete so that any concurrent
> > +		 * (lockless) lookup doesn't try and wait upon the request as we
> > +		 * reset it.
> > +		 */
> > +		spin_lock_irqsave(&engine->timeline->lock, flags);
> > +		intel_engine_init_global_seqno(engine,
> > +					       intel_engine_last_submit(engine));
> > +		spin_unlock_irqrestore(&engine->timeline->lock, flags);
> > +	}
> > +
> > +	set_bit(I915_WEDGED, &i915->gpu_error.flags);
> 
> Can we this set_bit above the loop, right after synchronize_rcu()?
> Thinking is that we stop accepting requests we can't complete
> the earliest.

tbh I have no idea what this thing does there anymore, it looks like this
will terminally wedge the gpu and refuse all subsequent requests (see
i915_gem_request_alloc), except that i915_gem_set_wedged seems to be the
new way to recover a fully hung gpu (and then later on requests are ok
again).

In short, I have no idea what this does nor where it should be put. Can
you pls explain more?

Thanks, Daniel

> 
> -Mika
> 
> > +	wake_up_all(&i915->gpu_error.reset_queue);
> >  }
> >  
> >  bool i915_gem_unset_wedged(struct drm_i915_private *i915)
> > diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
> > index b100b38f1dd2..ef78a85cb845 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_request.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_request.c
> > @@ -556,7 +556,9 @@ submit_notify(struct i915_sw_fence *fence, enum i915_sw_fence_notify state)
> >  	switch (state) {
> >  	case FENCE_COMPLETE:
> >  		trace_i915_gem_request_submit(request);
> > +		rcu_read_lock();
> >  		request->engine->submit_request(request);
> > +		rcu_read_unlock();
> >  		break;
> >  
> >  	case FENCE_FREE:
> > diff --git a/drivers/gpu/drm/i915/selftests/i915_gem_request.c b/drivers/gpu/drm/i915/selftests/i915_gem_request.c
> > index 78b9f811707f..a999161e8db1 100644
> > --- a/drivers/gpu/drm/i915/selftests/i915_gem_request.c
> > +++ b/drivers/gpu/drm/i915/selftests/i915_gem_request.c
> > @@ -215,7 +215,9 @@ static int igt_request_rewind(void *arg)
> >  	}
> >  	i915_gem_request_get(vip);
> >  	i915_add_request(vip);
> > +	rcu_read_lock();
> >  	request->engine->submit_request(request);
> > +	rcu_read_unlock();
> >  
> >  	mutex_unlock(&i915->drm.struct_mutex);
> >  
> > -- 
> > 2.14.1
Chris Wilson Oct. 10, 2017, 1:29 p.m. UTC | #5
Quoting Chris Wilson (2017-10-10 10:21:45)
> Quoting Daniel Vetter (2017-10-09 17:44:01)
> > stop_machine is not really a locking primitive we should use, except
> > when the hw folks tell us the hw is broken and that's the only way to
> > work around it.
> > 
> > This patch tries to address the locking abuse of stop_machine() from
> > 
> > commit 20e4933c478a1ca694b38fa4ac44d99e659941f5
> > Author: Chris Wilson <chris@chris-wilson.co.uk>
> > Date:   Tue Nov 22 14:41:21 2016 +0000
> > 
> >     drm/i915: Stop the machine as we install the wedged submit_request handler
> > 
> > Chris said parts of the reasons for going with stop_machine() was that
> > it's no overhead for the fast-path. But these callbacks use irqsave
> > spinlocks and do a bunch of MMIO, and rcu_read_lock is _real_ fast.
> > 
> > To stay as close as possible to the stop_machine semantics we first
> > update all the submit function pointers to the nop handler, then call
> > synchronize_rcu() to make sure no new requests can be submitted. This
> > should give us exactly the huge barrier we want.
> > 
> > I pondered whether we should annotate engine->submit_request as __rcu
> > and use rcu_assign_pointer and rcu_dereference on it. But the reason
> > behind those is to make sure the compiler/cpu barriers are there for
> > when you have an actual data structure you point at, to make sure all
> > the writes are seen correctly on the read side. But we just have a
> > function pointer, and .text isn't changed, so no need for these
> > barriers and hence no need for annotations.
> > 
> > Unfortunately there's a complication with the call to
> > intel_engine_init_global_seqno:
> 
> This is still broken in the same way as nop_submit_request may execute
> while you sleep, breaking cancel_requests.

My apologies, I misread the diff and didn't catch the removal of
init_seqno. From that pov, the execlists should be intact and I can't
see a way for a lack of -EIO being reported.

(That we have them is a topic for another day, as set-wedged is not
meant to be regarded as a normal operation, but user data loss.)

The residual panic I have is that we had to throw set-wedged vs unwedge
ordering out of the window; it used to be struct_mutex. Making
set-wedged prolonged only makes the window wider, it didn't create that
window, and at the moment it's carefully ordered inside the
i915_handle_error fallback. (We blissfully ignore debugfs.)

Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
-Chris
Chris Wilson Oct. 10, 2017, 1:39 p.m. UTC | #6
Style nits:

Quoting Daniel Vetter (2017-10-09 17:44:01)
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 82a10036fb38..8d7d8d1f78db 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -3057,49 +3057,71 @@ static void nop_submit_request(struct drm_i915_gem_request *request)
>  
>         spin_lock_irqsave(&request->engine->timeline->lock, flags);
>         __i915_gem_request_submit(request);
> -       intel_engine_init_global_seqno(request->engine, request->global_seqno);
>         spin_unlock_irqrestore(&request->engine->timeline->lock, flags);

This reduced to i915_gem_request_submit().

Hah, you can just engine->submit_request = i915_gem_request_submit, and
keep the nop_submit_request for the second phase! That may make the
diff neater?

> -static int __i915_gem_set_wedged_BKL(void *data)
> +void i915_gem_set_wedged(struct drm_i915_private *i915)
>  {
> -       struct drm_i915_private *i915 = data;
>         struct intel_engine_cs *engine;
>         enum intel_engine_id id;
>  
> +       /* First, stop submission to hw, but do not yet complete requests by
> +        * rolling the global seqno forward (since this would complete requests
> +        * for which we haven't set the fence error to EIO yet).
> +        */

I'm doing a quiet fixup of all my comments to follow
/*
 * The core convention, which you normally use anyway.
 */

>         for_each_engine(engine, i915, id)
> -               engine_set_wedged(engine);
> +               engine->submit_request = nop_submit_request;
>  
> -       set_bit(I915_WEDGED, &i915->gpu_error.flags);
> -       wake_up_all(&i915->gpu_error.reset_queue);
> +       /* Make sure no one is running the old callback before we proceed with
> +        * cancelling requests and resetting the completion tracking. Otherwise
> +        * we might submit a request to the hardware which never completes.
> +        */
> +       synchronize_rcu();
>  
> -       return 0;
> -}
> +       for_each_engine(engine, i915, id) {
> +               /* Mark all executing requests as skipped */
> +               engine->cancel_requests(engine);
>  
> -void i915_gem_set_wedged(struct drm_i915_private *dev_priv)
> -{
> -       stop_machine(__i915_gem_set_wedged_BKL, dev_priv, NULL);
> +               /* Only once we've force-cancelled all in-flight requests can we
> +                * start to complete all requests.
> +                */
> +               engine->submit_request = nop_complete_submit_request;
> +       }
> +
> +       /* Make sure no request can slip through without getting completed by
> +        * either this call here to intel_engine_init_global_seqno, or the one
> +        * in nop_complete_submit_request.
> +        */
> +       synchronize_rcu();
> +
> +       for_each_engine(engine, i915, id) {
> +               unsigned long flags;
> +
> +               /* Mark all pending requests as complete so that any concurrent
> +                * (lockless) lookup doesn't try and wait upon the request as we
> +                * reset it.
> +                */
> +               spin_lock_irqsave(&engine->timeline->lock, flags);
> +               intel_engine_init_global_seqno(engine,
> +                                              intel_engine_last_submit(engine));
> +               spin_unlock_irqrestore(&engine->timeline->lock, flags);
> +       }
> +
> +       set_bit(I915_WEDGED, &i915->gpu_error.flags);
> +       wake_up_all(&i915->gpu_error.reset_queue);
>  }
>  
>  bool i915_gem_unset_wedged(struct drm_i915_private *i915)
> diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
> index b100b38f1dd2..ef78a85cb845 100644
> --- a/drivers/gpu/drm/i915/i915_gem_request.c
> +++ b/drivers/gpu/drm/i915/i915_gem_request.c
> @@ -556,7 +556,9 @@ submit_notify(struct i915_sw_fence *fence, enum i915_sw_fence_notify state)
>         switch (state) {
>         case FENCE_COMPLETE:
>                 trace_i915_gem_request_submit(request);
> +               rcu_read_lock();

This trick needs a comment.

/*
 * We need to serialize use of the submit_request() callback with its
 * hotplugging performed during an emergency i915_gem_set_wedged().
 * We use the RCU mechanism to mark the critical section in order to
 * force i915_gem_set_wedged() to wait until the submit_request() is
 * completed before proceeding.
 */
-Chris
Daniel Vetter Oct. 10, 2017, 2:09 p.m. UTC | #7
On Tue, Oct 10, 2017 at 3:39 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> Style nits:
>
> Quoting Daniel Vetter (2017-10-09 17:44:01)
>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>> index 82a10036fb38..8d7d8d1f78db 100644
>> --- a/drivers/gpu/drm/i915/i915_gem.c
>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>> @@ -3057,49 +3057,71 @@ static void nop_submit_request(struct drm_i915_gem_request *request)
>>
>>         spin_lock_irqsave(&request->engine->timeline->lock, flags);
>>         __i915_gem_request_submit(request);
>> -       intel_engine_init_global_seqno(request->engine, request->global_seqno);
>>         spin_unlock_irqrestore(&request->engine->timeline->lock, flags);
>
> This reduced to i915_gem_request_submit().
>
> Hah, you can just engine->submit_request = i915_gem_request_submit, and
> keep the nop_submit_request for the second phase! That may make the
> diff neater?

We need to call dma_fence_set_error still, and we need to start doing
that before we start calling ->cancel_request, or we might miss some
requests.

I'll do all the other suggestions and resubmit.
-Daniel

>> -static int __i915_gem_set_wedged_BKL(void *data)
>> +void i915_gem_set_wedged(struct drm_i915_private *i915)
>>  {
>> -       struct drm_i915_private *i915 = data;
>>         struct intel_engine_cs *engine;
>>         enum intel_engine_id id;
>>
>> +       /* First, stop submission to hw, but do not yet complete requests by
>> +        * rolling the global seqno forward (since this would complete requests
>> +        * for which we haven't set the fence error to EIO yet).
>> +        */
>
> I'm doing a quiet fixup of all my comments to follow
> /*
>  * The core convention, which you normally use anyway.
>  */
>
>>         for_each_engine(engine, i915, id)
>> -               engine_set_wedged(engine);
>> +               engine->submit_request = nop_submit_request;
>>
>> -       set_bit(I915_WEDGED, &i915->gpu_error.flags);
>> -       wake_up_all(&i915->gpu_error.reset_queue);
>> +       /* Make sure no one is running the old callback before we proceed with
>> +        * cancelling requests and resetting the completion tracking. Otherwise
>> +        * we might submit a request to the hardware which never completes.
>> +        */
>> +       synchronize_rcu();
>>
>> -       return 0;
>> -}
>> +       for_each_engine(engine, i915, id) {
>> +               /* Mark all executing requests as skipped */
>> +               engine->cancel_requests(engine);
>>
>> -void i915_gem_set_wedged(struct drm_i915_private *dev_priv)
>> -{
>> -       stop_machine(__i915_gem_set_wedged_BKL, dev_priv, NULL);
>> +               /* Only once we've force-cancelled all in-flight requests can we
>> +                * start to complete all requests.
>> +                */
>> +               engine->submit_request = nop_complete_submit_request;
>> +       }
>> +
>> +       /* Make sure no request can slip through without getting completed by
>> +        * either this call here to intel_engine_init_global_seqno, or the one
>> +        * in nop_complete_submit_request.
>> +        */
>> +       synchronize_rcu();
>> +
>> +       for_each_engine(engine, i915, id) {
>> +               unsigned long flags;
>> +
>> +               /* Mark all pending requests as complete so that any concurrent
>> +                * (lockless) lookup doesn't try and wait upon the request as we
>> +                * reset it.
>> +                */
>> +               spin_lock_irqsave(&engine->timeline->lock, flags);
>> +               intel_engine_init_global_seqno(engine,
>> +                                              intel_engine_last_submit(engine));
>> +               spin_unlock_irqrestore(&engine->timeline->lock, flags);
>> +       }
>> +
>> +       set_bit(I915_WEDGED, &i915->gpu_error.flags);
>> +       wake_up_all(&i915->gpu_error.reset_queue);
>>  }
>>
>>  bool i915_gem_unset_wedged(struct drm_i915_private *i915)
>> diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
>> index b100b38f1dd2..ef78a85cb845 100644
>> --- a/drivers/gpu/drm/i915/i915_gem_request.c
>> +++ b/drivers/gpu/drm/i915/i915_gem_request.c
>> @@ -556,7 +556,9 @@ submit_notify(struct i915_sw_fence *fence, enum i915_sw_fence_notify state)
>>         switch (state) {
>>         case FENCE_COMPLETE:
>>                 trace_i915_gem_request_submit(request);
>> +               rcu_read_lock();
>
> This trick needs a comment.
>
> /*
>  * We need to serialize use of the submit_request() callback with its
>  * hotplugging performed during an emergency i915_gem_set_wedged().
>  * We use the RCU mechanism to mark the critical section in order to
>  * force i915_gem_set_wedged() to wait until the submit_request() is
>  * completed before proceeding.
>  */
> -Chris
diff mbox

Patch

======================================================
WARNING: possible circular locking dependency detected
4.14.0-rc3-CI-CI_DRM_3179+ #1 Tainted: G     U
------------------------------------------------------
kworker/3:4/562 is trying to acquire lock:
 (cpu_hotplug_lock.rw_sem){++++}, at: [<ffffffff8113d4bc>] stop_machine+0x1c/0x40

but task is already holding lock:
 (&dev->struct_mutex){+.+.}, at: [<ffffffffa0136588>] i915_reset_device+0x1e8/0x260 [i915]

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #6 (&dev->struct_mutex){+.+.}:
       __lock_acquire+0x1420/0x15e0
       lock_acquire+0xb0/0x200
       __mutex_lock+0x86/0x9b0
       mutex_lock_interruptible_nested+0x1b/0x20
       i915_mutex_lock_interruptible+0x51/0x130 [i915]
       i915_gem_fault+0x209/0x650 [i915]
       __do_fault+0x1e/0x80
       __handle_mm_fault+0xa08/0xed0
       handle_mm_fault+0x156/0x300
       __do_page_fault+0x2c5/0x570
       do_page_fault+0x28/0x250
       page_fault+0x22/0x30

-> #5 (&mm->mmap_sem){++++}:
       __lock_acquire+0x1420/0x15e0
       lock_acquire+0xb0/0x200
       __might_fault+0x68/0x90
       _copy_to_user+0x23/0x70
       filldir+0xa5/0x120
       dcache_readdir+0xf9/0x170
       iterate_dir+0x69/0x1a0
       SyS_getdents+0xa5/0x140
       entry_SYSCALL_64_fastpath+0x1c/0xb1

-> #4 (&sb->s_type->i_mutex_key#5){++++}:
       down_write+0x3b/0x70
       handle_create+0xcb/0x1e0
       devtmpfsd+0x139/0x180
       kthread+0x152/0x190
       ret_from_fork+0x27/0x40

-> #3 ((complete)&req.done){+.+.}:
       __lock_acquire+0x1420/0x15e0
       lock_acquire+0xb0/0x200
       wait_for_common+0x58/0x210
       wait_for_completion+0x1d/0x20
       devtmpfs_create_node+0x13d/0x160
       device_add+0x5eb/0x620
       device_create_groups_vargs+0xe0/0xf0
       device_create+0x3a/0x40
       msr_device_create+0x2b/0x40
       cpuhp_invoke_callback+0xc9/0xbf0
       cpuhp_thread_fun+0x17b/0x240
       smpboot_thread_fn+0x18a/0x280
       kthread+0x152/0x190
       ret_from_fork+0x27/0x40

-> #2 (cpuhp_state-up){+.+.}:
       __lock_acquire+0x1420/0x15e0
       lock_acquire+0xb0/0x200
       cpuhp_issue_call+0x133/0x1c0
       __cpuhp_setup_state_cpuslocked+0x139/0x2a0
       __cpuhp_setup_state+0x46/0x60
       page_writeback_init+0x43/0x67
       pagecache_init+0x3d/0x42
       start_kernel+0x3a8/0x3fc
       x86_64_start_reservations+0x2a/0x2c
       x86_64_start_kernel+0x6d/0x70
       verify_cpu+0x0/0xfb

-> #1 (cpuhp_state_mutex){+.+.}:
       __lock_acquire+0x1420/0x15e0
       lock_acquire+0xb0/0x200
       __mutex_lock+0x86/0x9b0
       mutex_lock_nested+0x1b/0x20
       __cpuhp_setup_state_cpuslocked+0x53/0x2a0
       __cpuhp_setup_state+0x46/0x60
       page_alloc_init+0x28/0x30
       start_kernel+0x145/0x3fc
       x86_64_start_reservations+0x2a/0x2c
       x86_64_start_kernel+0x6d/0x70
       verify_cpu+0x0/0xfb

-> #0 (cpu_hotplug_lock.rw_sem){++++}:
       check_prev_add+0x430/0x840
       __lock_acquire+0x1420/0x15e0
       lock_acquire+0xb0/0x200
       cpus_read_lock+0x3d/0xb0
       stop_machine+0x1c/0x40
       i915_gem_set_wedged+0x1a/0x20 [i915]
       i915_reset+0xb9/0x230 [i915]
       i915_reset_device+0x1f6/0x260 [i915]
       i915_handle_error+0x2d8/0x430 [i915]
       hangcheck_declare_hang+0xd3/0xf0 [i915]
       i915_hangcheck_elapsed+0x262/0x2d0 [i915]
       process_one_work+0x233/0x660
       worker_thread+0x4e/0x3b0
       kthread+0x152/0x190
       ret_from_fork+0x27/0x40

other info that might help us debug this:

Chain exists of:
  cpu_hotplug_lock.rw_sem --> &mm->mmap_sem --> &dev->struct_mutex

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&dev->struct_mutex);
                               lock(&mm->mmap_sem);
                               lock(&dev->struct_mutex);
  lock(cpu_hotplug_lock.rw_sem);

 *** DEADLOCK ***

3 locks held by kworker/3:4/562:
 #0:  ("events_long"){+.+.}, at: [<ffffffff8109c64a>] process_one_work+0x1aa/0x660
 #1:  ((&(&i915->gpu_error.hangcheck_work)->work)){+.+.}, at: [<ffffffff8109c64a>] process_one_work+0x1aa/0x660
 #2:  (&dev->struct_mutex){+.+.}, at: [<ffffffffa0136588>] i915_reset_device+0x1e8/0x260 [i915]

stack backtrace:
CPU: 3 PID: 562 Comm: kworker/3:4 Tainted: G     U          4.14.0-rc3-CI-CI_DRM_3179+ #1
Hardware name:                  /NUC7i5BNB, BIOS BNKBL357.86A.0048.2017.0704.1415 07/04/2017
Workqueue: events_long i915_hangcheck_elapsed [i915]
Call Trace:
 dump_stack+0x68/0x9f
 print_circular_bug+0x235/0x3c0
 ? lockdep_init_map_crosslock+0x20/0x20
 check_prev_add+0x430/0x840
 ? irq_work_queue+0x86/0xe0
 ? wake_up_klogd+0x53/0x70
 __lock_acquire+0x1420/0x15e0
 ? __lock_acquire+0x1420/0x15e0
 ? lockdep_init_map_crosslock+0x20/0x20
 lock_acquire+0xb0/0x200
 ? stop_machine+0x1c/0x40
 ? i915_gem_object_truncate+0x50/0x50 [i915]
 cpus_read_lock+0x3d/0xb0
 ? stop_machine+0x1c/0x40
 stop_machine+0x1c/0x40
 i915_gem_set_wedged+0x1a/0x20 [i915]
 i915_reset+0xb9/0x230 [i915]
 i915_reset_device+0x1f6/0x260 [i915]
 ? gen8_gt_irq_ack+0x170/0x170 [i915]
 ? work_on_cpu_safe+0x60/0x60
 i915_handle_error+0x2d8/0x430 [i915]
 ? vsnprintf+0xd1/0x4b0
 ? scnprintf+0x3a/0x70
 hangcheck_declare_hang+0xd3/0xf0 [i915]
 ? intel_runtime_pm_put+0x56/0xa0 [i915]
 i915_hangcheck_elapsed+0x262/0x2d0 [i915]
 process_one_work+0x233/0x660
 worker_thread+0x4e/0x3b0
 kthread+0x152/0x190
 ? process_one_work+0x660/0x660
 ? kthread_create_on_node+0x40/0x40
 ret_from_fork+0x27/0x40
Setting dangerous option reset - tainting kernel
i915 0000:00:02.0: Resetting chip after gpu hang
Setting dangerous option reset - tainting kernel
i915 0000:00:02.0: Resetting chip after gpu hang

v2: Have 1 global synchronize_rcu() barrier across all engines, and
improve commit message.

v3: We need to protect the seqno update with the timeline spinlock (in
set_wedged) to avoid racing with other updates of the seqno, like we
already do in nop_submit_request (Chris).

v4: Use two-phase sequence to plug the race Chris spotted where we can
complete requests before they're marked up with -EIO.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=102886
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103096
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Marta Lofstedt <marta.lofstedt@intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c                   | 76 +++++++++++++++--------
 drivers/gpu/drm/i915/i915_gem_request.c           |  2 +
 drivers/gpu/drm/i915/selftests/i915_gem_request.c |  2 +
 3 files changed, 53 insertions(+), 27 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 82a10036fb38..8d7d8d1f78db 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -3057,49 +3057,71 @@  static void nop_submit_request(struct drm_i915_gem_request *request)
 
 	spin_lock_irqsave(&request->engine->timeline->lock, flags);
 	__i915_gem_request_submit(request);
-	intel_engine_init_global_seqno(request->engine, request->global_seqno);
 	spin_unlock_irqrestore(&request->engine->timeline->lock, flags);
 }
 
-static void engine_set_wedged(struct intel_engine_cs *engine)
+static void nop_complete_submit_request(struct drm_i915_gem_request *request)
 {
-	/* We need to be sure that no thread is running the old callback as
-	 * we install the nop handler (otherwise we would submit a request
-	 * to hardware that will never complete). In order to prevent this
-	 * race, we wait until the machine is idle before making the swap
-	 * (using stop_machine()).
-	 */
-	engine->submit_request = nop_submit_request;
+	unsigned long flags;
 
-	/* Mark all executing requests as skipped */
-	engine->cancel_requests(engine);
+	GEM_BUG_ON(!i915_terminally_wedged(&request->i915->gpu_error));
+	dma_fence_set_error(&request->fence, -EIO);
 
-	/* Mark all pending requests as complete so that any concurrent
-	 * (lockless) lookup doesn't try and wait upon the request as we
-	 * reset it.
-	 */
-	intel_engine_init_global_seqno(engine,
-				       intel_engine_last_submit(engine));
+	spin_lock_irqsave(&request->engine->timeline->lock, flags);
+	__i915_gem_request_submit(request);
+	intel_engine_init_global_seqno(request->engine, request->global_seqno);
+	spin_unlock_irqrestore(&request->engine->timeline->lock, flags);
 }
 
-static int __i915_gem_set_wedged_BKL(void *data)
+void i915_gem_set_wedged(struct drm_i915_private *i915)
 {
-	struct drm_i915_private *i915 = data;
 	struct intel_engine_cs *engine;
 	enum intel_engine_id id;
 
+	/* First, stop submission to hw, but do not yet complete requests by
+	 * rolling the global seqno forward (since this would complete requests
+	 * for which we haven't set the fence error to EIO yet).
+	 */
 	for_each_engine(engine, i915, id)
-		engine_set_wedged(engine);
+		engine->submit_request = nop_submit_request;
 
-	set_bit(I915_WEDGED, &i915->gpu_error.flags);
-	wake_up_all(&i915->gpu_error.reset_queue);
+	/* Make sure no one is running the old callback before we proceed with
+	 * cancelling requests and resetting the completion tracking. Otherwise
+	 * we might submit a request to the hardware which never completes.
+	 */
+	synchronize_rcu();
 
-	return 0;
-}
+	for_each_engine(engine, i915, id) {
+		/* Mark all executing requests as skipped */
+		engine->cancel_requests(engine);
 
-void i915_gem_set_wedged(struct drm_i915_private *dev_priv)
-{
-	stop_machine(__i915_gem_set_wedged_BKL, dev_priv, NULL);
+		/* Only once we've force-cancelled all in-flight requests can we
+		 * start to complete all requests.
+		 */
+		engine->submit_request = nop_complete_submit_request;
+	}
+
+	/* Make sure no request can slip through without getting completed by
+	 * either this call here to intel_engine_init_global_seqno, or the one
+	 * in nop_complete_submit_request.
+	 */
+	synchronize_rcu();
+
+	for_each_engine(engine, i915, id) {
+		unsigned long flags;
+
+		/* Mark all pending requests as complete so that any concurrent
+		 * (lockless) lookup doesn't try and wait upon the request as we
+		 * reset it.
+		 */
+		spin_lock_irqsave(&engine->timeline->lock, flags);
+		intel_engine_init_global_seqno(engine,
+					       intel_engine_last_submit(engine));
+		spin_unlock_irqrestore(&engine->timeline->lock, flags);
+	}
+
+	set_bit(I915_WEDGED, &i915->gpu_error.flags);
+	wake_up_all(&i915->gpu_error.reset_queue);
 }
 
 bool i915_gem_unset_wedged(struct drm_i915_private *i915)
diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
index b100b38f1dd2..ef78a85cb845 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.c
+++ b/drivers/gpu/drm/i915/i915_gem_request.c
@@ -556,7 +556,9 @@  submit_notify(struct i915_sw_fence *fence, enum i915_sw_fence_notify state)
 	switch (state) {
 	case FENCE_COMPLETE:
 		trace_i915_gem_request_submit(request);
+		rcu_read_lock();
 		request->engine->submit_request(request);
+		rcu_read_unlock();
 		break;
 
 	case FENCE_FREE:
diff --git a/drivers/gpu/drm/i915/selftests/i915_gem_request.c b/drivers/gpu/drm/i915/selftests/i915_gem_request.c
index 78b9f811707f..a999161e8db1 100644
--- a/drivers/gpu/drm/i915/selftests/i915_gem_request.c
+++ b/drivers/gpu/drm/i915/selftests/i915_gem_request.c
@@ -215,7 +215,9 @@  static int igt_request_rewind(void *arg)
 	}
 	i915_gem_request_get(vip);
 	i915_add_request(vip);
+	rcu_read_lock();
 	request->engine->submit_request(request);
+	rcu_read_unlock();
 
 	mutex_unlock(&i915->drm.struct_mutex);