From patchwork Tue Sep 14 05:09:53 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Matthew Brost X-Patchwork-Id: 12491713 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9AF02C433F5 for ; Tue, 14 Sep 2021 05:15:11 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 5E73660F70 for ; Tue, 14 Sep 2021 05:15:11 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 5E73660F70 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id C64106E3F7; Tue, 14 Sep 2021 05:15:02 +0000 (UTC) Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by gabe.freedesktop.org (Postfix) with ESMTPS id E794A6E3F0; Tue, 14 Sep 2021 05:15:00 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6200,9189,10106"; a="208976887" X-IronPort-AV: E=Sophos;i="5.85,291,1624345200"; d="scan'208";a="208976887" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Sep 2021 22:14:59 -0700 X-IronPort-AV: E=Sophos;i="5.85,291,1624345200"; d="scan'208";a="697404480" Received: from jons-linux-dev-box.fm.intel.com ([10.1.27.20]) by fmsmga006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Sep 2021 22:14:59 -0700 From: Matthew Brost To: , Cc: , Date: Mon, 13 Sep 2021 22:09:53 -0700 Message-Id: <20210914050956.30685-2-matthew.brost@intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: <20210914050956.30685-1-matthew.brost@intel.com> References: <20210914050956.30685-1-matthew.brost@intel.com> MIME-Version: 1.0 Subject: [Intel-gfx] [PATCH 1/4] drm/i915/guc: Move guc_ids under submission_state sub-structure X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" Move guc_ids under submission_state sub-structure as a future patch will use contexts_lock (global GuC submission lock) to protect more data. Introducing the sub-structure makes ownership of the locking / fields clear. v2: (Docs) - Fix Docs warning by adding comment to submission_state structure Signed-off-by: Matthew Brost --- drivers/gpu/drm/i915/gt/intel_context_types.h | 9 ++-- drivers/gpu/drm/i915/gt/uc/intel_guc.h | 26 +++++++---- .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 44 ++++++++++--------- 3 files changed, 45 insertions(+), 34 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h index 930569a1a01f..af43b3c83339 100644 --- a/drivers/gpu/drm/i915/gt/intel_context_types.h +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h @@ -189,18 +189,19 @@ struct intel_context { struct { /** * @id: handle which is used to uniquely identify this context - * with the GuC, protected by guc->contexts_lock + * with the GuC, protected by guc->submission_state.lock */ u16 id; /** * @ref: the number of references to the guc_id, when * transitioning in and out of zero protected by - * guc->contexts_lock + * guc->submission_state.lock */ atomic_t ref; /** - * @link: in guc->guc_id_list when the guc_id has no refs but is - * still valid, protected by guc->contexts_lock + * @link: in guc->submission_state.guc_id_list when the guc_id + * has no refs but is still valid, protected by + * guc->submission_state.lock */ struct list_head link; } guc_id; diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h index 5dd174babf7a..0e28f490c12d 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h @@ -71,16 +71,24 @@ struct intel_guc { } interrupts; /** - * @contexts_lock: protects guc_ids, guc_id_list, ce->guc_id.id, and - * ce->guc_id.ref when transitioning in and out of zero + * @submission_state: all state related to GuC submission that needs to + * be protected by a common lock */ - spinlock_t contexts_lock; - /** @guc_ids: used to allocate unique ce->guc_id.id values */ - struct ida guc_ids; - /** - * @guc_id_list: list of intel_context with valid guc_ids but no refs - */ - struct list_head guc_id_list; + struct { + /** + * @lock: protects everything in submission_state and ce->guc_id + */ + spinlock_t lock; + /** + * @guc_ids: used to allocate new guc_ids + */ + struct ida guc_ids; + /** + * @guc_id_list: list of intel_context with valid guc_ids but no + * refs + */ + struct list_head guc_id_list; + } submission_state; /** * @submission_supported: tracks whether we support GuC submission on diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c index c7a41802b448..678da915eb9d 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c @@ -89,7 +89,7 @@ * sched_engine can be submitting at a time. Currently only one sched_engine is * used for all of GuC submission but that could change in the future. * - * guc->contexts_lock + * guc->submission_state.lock * Protects guc_id allocation for the given GuC, i.e. only one context can be * doing guc_id allocation operations at a time for each GuC in the system. * @@ -103,7 +103,7 @@ * * Lock ordering rules: * sched_engine->lock -> ce->guc_state.lock - * guc->contexts_lock -> ce->guc_state.lock + * guc->submission_state.lock -> ce->guc_state.lock * * Reset races: * When a full GT reset is triggered it is assumed that some G2H responses to @@ -1148,9 +1148,9 @@ int intel_guc_submission_init(struct intel_guc *guc) xa_init_flags(&guc->context_lookup, XA_FLAGS_LOCK_IRQ); - spin_lock_init(&guc->contexts_lock); - INIT_LIST_HEAD(&guc->guc_id_list); - ida_init(&guc->guc_ids); + spin_lock_init(&guc->submission_state.lock); + INIT_LIST_HEAD(&guc->submission_state.guc_id_list); + ida_init(&guc->submission_state.guc_ids); return 0; } @@ -1215,7 +1215,7 @@ static void guc_submit_request(struct i915_request *rq) static int new_guc_id(struct intel_guc *guc) { - return ida_simple_get(&guc->guc_ids, 0, + return ida_simple_get(&guc->submission_state.guc_ids, 0, GUC_MAX_LRC_DESCRIPTORS, GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN); } @@ -1223,7 +1223,8 @@ static int new_guc_id(struct intel_guc *guc) static void __release_guc_id(struct intel_guc *guc, struct intel_context *ce) { if (!context_guc_id_invalid(ce)) { - ida_simple_remove(&guc->guc_ids, ce->guc_id.id); + ida_simple_remove(&guc->submission_state.guc_ids, + ce->guc_id.id); reset_lrc_desc(guc, ce->guc_id.id); set_context_guc_id_invalid(ce); } @@ -1235,9 +1236,9 @@ static void release_guc_id(struct intel_guc *guc, struct intel_context *ce) { unsigned long flags; - spin_lock_irqsave(&guc->contexts_lock, flags); + spin_lock_irqsave(&guc->submission_state.lock, flags); __release_guc_id(guc, ce); - spin_unlock_irqrestore(&guc->contexts_lock, flags); + spin_unlock_irqrestore(&guc->submission_state.lock, flags); } static int steal_guc_id(struct intel_guc *guc) @@ -1245,10 +1246,10 @@ static int steal_guc_id(struct intel_guc *guc) struct intel_context *ce; int guc_id; - lockdep_assert_held(&guc->contexts_lock); + lockdep_assert_held(&guc->submission_state.lock); - if (!list_empty(&guc->guc_id_list)) { - ce = list_first_entry(&guc->guc_id_list, + if (!list_empty(&guc->submission_state.guc_id_list)) { + ce = list_first_entry(&guc->submission_state.guc_id_list, struct intel_context, guc_id.link); @@ -1273,7 +1274,7 @@ static int assign_guc_id(struct intel_guc *guc, u16 *out) { int ret; - lockdep_assert_held(&guc->contexts_lock); + lockdep_assert_held(&guc->submission_state.lock); ret = new_guc_id(guc); if (unlikely(ret < 0)) { @@ -1295,7 +1296,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce) GEM_BUG_ON(atomic_read(&ce->guc_id.ref)); try_again: - spin_lock_irqsave(&guc->contexts_lock, flags); + spin_lock_irqsave(&guc->submission_state.lock, flags); might_lock(&ce->guc_state.lock); @@ -1310,7 +1311,7 @@ static int pin_guc_id(struct intel_guc *guc, struct intel_context *ce) atomic_inc(&ce->guc_id.ref); out_unlock: - spin_unlock_irqrestore(&guc->contexts_lock, flags); + spin_unlock_irqrestore(&guc->submission_state.lock, flags); /* * -EAGAIN indicates no guc_id are available, let's retire any @@ -1346,11 +1347,12 @@ static void unpin_guc_id(struct intel_guc *guc, struct intel_context *ce) if (unlikely(context_guc_id_invalid(ce))) return; - spin_lock_irqsave(&guc->contexts_lock, flags); + spin_lock_irqsave(&guc->submission_state.lock, flags); if (!context_guc_id_invalid(ce) && list_empty(&ce->guc_id.link) && !atomic_read(&ce->guc_id.ref)) - list_add_tail(&ce->guc_id.link, &guc->guc_id_list); - spin_unlock_irqrestore(&guc->contexts_lock, flags); + list_add_tail(&ce->guc_id.link, + &guc->submission_state.guc_id_list); + spin_unlock_irqrestore(&guc->submission_state.lock, flags); } static int __guc_action_register_context(struct intel_guc *guc, @@ -1921,16 +1923,16 @@ static void guc_context_destroy(struct kref *kref) * returns indicating this context has been deregistered the guc_id is * returned to the pool of available guc_id. */ - spin_lock_irqsave(&guc->contexts_lock, flags); + spin_lock_irqsave(&guc->submission_state.lock, flags); if (context_guc_id_invalid(ce)) { - spin_unlock_irqrestore(&guc->contexts_lock, flags); + spin_unlock_irqrestore(&guc->submission_state.lock, flags); __guc_context_destroy(ce); return; } if (!list_empty(&ce->guc_id.link)) list_del_init(&ce->guc_id.link); - spin_unlock_irqrestore(&guc->contexts_lock, flags); + spin_unlock_irqrestore(&guc->submission_state.lock, flags); /* Seal race with Reset */ spin_lock_irqsave(&ce->guc_state.lock, flags); From patchwork Tue Sep 14 05:09:54 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Matthew Brost X-Patchwork-Id: 12491711 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6F211C433EF for ; Tue, 14 Sep 2021 05:15:05 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 3BADF60F70 for ; Tue, 14 Sep 2021 05:15:05 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 3BADF60F70 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 7374D6E3F0; Tue, 14 Sep 2021 05:15:02 +0000 (UTC) Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by gabe.freedesktop.org (Postfix) with ESMTPS id 2CE336E3F2; Tue, 14 Sep 2021 05:15:01 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6200,9189,10106"; a="208976888" X-IronPort-AV: E=Sophos;i="5.85,291,1624345200"; d="scan'208";a="208976888" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Sep 2021 22:14:59 -0700 X-IronPort-AV: E=Sophos;i="5.85,291,1624345200"; d="scan'208";a="697404483" Received: from jons-linux-dev-box.fm.intel.com ([10.1.27.20]) by fmsmga006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Sep 2021 22:14:59 -0700 From: Matthew Brost To: , Cc: , Date: Mon, 13 Sep 2021 22:09:54 -0700 Message-Id: <20210914050956.30685-3-matthew.brost@intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: <20210914050956.30685-1-matthew.brost@intel.com> References: <20210914050956.30685-1-matthew.brost@intel.com> MIME-Version: 1.0 Subject: [Intel-gfx] [PATCH 2/4] drm/i915/guc: Do error capture asynchronously X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" An error capture allocates memory, memory allocations depend on resets, and resets need to flush the G2H handlers to seal several races. If the error capture is done from the G2H handler this creates a circular dependency. To work around this, do a error capture in a work queue asynchronously from the G2H handler. This should be fine as (eventually) all register state is put into a buffer by the GuC so it is safe to restart the context before the error capture is complete. Example of lockdep splat below: [ 154.625989] ====================================================== [ 154.632195] WARNING: possible circular locking dependency detected [ 154.638393] 5.14.0-rc5-guc+ #50 Tainted: G U [ 154.643991] ------------------------------------------------------ [ 154.650196] i915_selftest/1673 is trying to acquire lock: [ 154.655621] ffff8881079cb918 ((work_completion)(&ct->requests.worker)){+.+.}-{0:0}, at: __flush_work+0x350/0x4d0 [ 154.665826] but task is already holding lock: [ 154.671682] ffff8881079cbfb8 (>->reset.mutex){+.+.}-{3:3}, at: intel_gt_reset+0xf0/0x300 [i915] [ 154.680659] which lock already depends on the new lock. [ 154.688857] the existing dependency chain (in reverse order) is: [ 154.696365] -> #2 (>->reset.mutex){+.+.}-{3:3}: [ 154.702571] lock_acquire+0xd2/0x300 [ 154.706695] i915_gem_shrinker_taints_mutex+0x2d/0x50 [i915] [ 154.712959] intel_gt_init_reset+0x61/0x80 [i915] [ 154.718258] intel_gt_init_early+0xe6/0x120 [i915] [ 154.723648] i915_driver_probe+0x592/0xdc0 [i915] [ 154.728942] i915_pci_probe+0x43/0x1c0 [i915] [ 154.733891] pci_device_probe+0x9b/0x110 [ 154.738362] really_probe+0x1a6/0x3a0 [ 154.742568] __driver_probe_device+0xf9/0x170 [ 154.747468] driver_probe_device+0x19/0x90 [ 154.752114] __driver_attach+0x99/0x170 [ 154.756492] bus_for_each_dev+0x73/0xc0 [ 154.760870] bus_add_driver+0x14b/0x1f0 [ 154.765248] driver_register+0x67/0xb0 [ 154.769542] i915_init+0x18/0x8c [i915] [ 154.773964] do_one_initcall+0x53/0x2e0 [ 154.778343] do_init_module+0x56/0x210 [ 154.782639] load_module+0x25fc/0x29f0 [ 154.786934] __do_sys_finit_module+0xae/0x110 [ 154.791835] do_syscall_64+0x38/0xc0 [ 154.795958] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 154.801558] -> #1 (fs_reclaim){+.+.}-{0:0}: [ 154.807241] lock_acquire+0xd2/0x300 [ 154.811361] fs_reclaim_acquire+0x9e/0xd0 [ 154.815914] kmem_cache_alloc_trace+0x30/0x790 [ 154.820899] i915_gpu_coredump_alloc+0x53/0x1a0 [i915] [ 154.826649] i915_gpu_coredump+0x39/0x560 [i915] [ 154.831866] i915_capture_error_state+0xa/0x70 [i915] [ 154.837513] intel_guc_context_reset_process_msg+0x174/0x1f0 [i915] [ 154.844383] ct_incoming_request_worker_func+0x130/0x1b0 [i915] [ 154.850898] process_one_work+0x264/0x590 [ 154.855451] worker_thread+0x4b/0x3a0 [ 154.859655] kthread+0x147/0x170 [ 154.863428] ret_from_fork+0x1f/0x30 [ 154.867548] -> #0 ((work_completion)(&ct->requests.worker)){+.+.}-{0:0}: [ 154.875747] check_prev_add+0x90/0xc30 [ 154.880042] __lock_acquire+0x1643/0x2110 [ 154.884595] lock_acquire+0xd2/0x300 [ 154.888715] __flush_work+0x373/0x4d0 [ 154.892920] intel_guc_submission_reset_prepare+0xf3/0x340 [i915] [ 154.899606] intel_uc_reset_prepare+0x40/0x50 [i915] [ 154.905166] reset_prepare+0x55/0x60 [i915] [ 154.909946] intel_gt_reset+0x11c/0x300 [i915] [ 154.914984] do_device_reset+0x13/0x20 [i915] [ 154.919936] check_whitelist_across_reset+0x166/0x250 [i915] [ 154.926212] live_reset_whitelist.cold+0x6a/0x7a [i915] [ 154.932037] __i915_subtests.cold+0x20/0x74 [i915] [ 154.937428] __run_selftests.cold+0x96/0xee [i915] [ 154.942816] i915_live_selftests+0x2c/0x60 [i915] [ 154.948125] i915_pci_probe+0x93/0x1c0 [i915] [ 154.953076] pci_device_probe+0x9b/0x110 [ 154.957545] really_probe+0x1a6/0x3a0 [ 154.961749] __driver_probe_device+0xf9/0x170 [ 154.966653] driver_probe_device+0x19/0x90 [ 154.971290] __driver_attach+0x99/0x170 [ 154.975671] bus_for_each_dev+0x73/0xc0 [ 154.980053] bus_add_driver+0x14b/0x1f0 [ 154.984431] driver_register+0x67/0xb0 [ 154.988725] i915_init+0x18/0x8c [i915] [ 154.993149] do_one_initcall+0x53/0x2e0 [ 154.997527] do_init_module+0x56/0x210 [ 155.001822] load_module+0x25fc/0x29f0 [ 155.006118] __do_sys_finit_module+0xae/0x110 [ 155.011019] do_syscall_64+0x38/0xc0 [ 155.015139] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 155.020729] other info that might help us debug this: [ 155.028752] Chain exists of: (work_completion)(&ct->requests.worker) --> fs_reclaim --> >->reset.mutex [ 155.041294] Possible unsafe locking scenario: [ 155.047240] CPU0 CPU1 [ 155.051791] ---- ---- [ 155.056344] lock(>->reset.mutex); [ 155.060026] lock(fs_reclaim); [ 155.065706] lock(>->reset.mutex); [ 155.071912] lock((work_completion)(&ct->requests.worker)); [ 155.077595] *** DEADLOCK *** Signed-off-by: Matthew Brost --- drivers/gpu/drm/i915/gt/intel_context.c | 2 + drivers/gpu/drm/i915/gt/intel_context_types.h | 7 +++ drivers/gpu/drm/i915/gt/uc/intel_guc.h | 10 ++++ .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 47 +++++++++++++++++-- 4 files changed, 62 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c index ff637147b1a9..72ad60e9d908 100644 --- a/drivers/gpu/drm/i915/gt/intel_context.c +++ b/drivers/gpu/drm/i915/gt/intel_context.c @@ -399,6 +399,8 @@ intel_context_init(struct intel_context *ce, struct intel_engine_cs *engine) ce->guc_id.id = GUC_INVALID_LRC_ID; INIT_LIST_HEAD(&ce->guc_id.link); + INIT_LIST_HEAD(&ce->guc_capture_link); + /* * Initialize fence to be complete as this is expected to be complete * unless there is a pending schedule disable outstanding. diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h b/drivers/gpu/drm/i915/gt/intel_context_types.h index af43b3c83339..925a06162e8e 100644 --- a/drivers/gpu/drm/i915/gt/intel_context_types.h +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h @@ -206,6 +206,13 @@ struct intel_context { struct list_head link; } guc_id; + /** + * @guc_capture_link: in guc->submission_state.capture_list when an + * error capture is pending on this context, protected by + * guc->submission_state.lock + */ + struct list_head guc_capture_link; + #ifdef CONFIG_DRM_I915_SELFTEST /** * @drop_schedule_enable: Force drop of schedule enable G2H for selftest diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.h b/drivers/gpu/drm/i915/gt/uc/intel_guc.h index 0e28f490c12d..87ee792eafd4 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.h +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.h @@ -88,6 +88,16 @@ struct intel_guc { * refs */ struct list_head guc_id_list; + /** + * @capture_list: list of intel_context that need to capture + * error state + */ + struct list_head capture_list; + /** + * @capture_worker: worker to do error capture when the GuC + * signals a context has been reset + */ + struct work_struct capture_worker; } submission_state; /** diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c index 678da915eb9d..ba6838a35a69 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c @@ -91,7 +91,8 @@ * * guc->submission_state.lock * Protects guc_id allocation for the given GuC, i.e. only one context can be - * doing guc_id allocation operations at a time for each GuC in the system. + * doing guc_id allocation operations at a time for each GuC in the system. Also + * protects everything else under the guc->submission_state sub-structure. * * ce->guc_state.lock * Protects everything under ce->guc_state. Ensures that a context is in the @@ -1126,6 +1127,8 @@ void intel_guc_submission_reset_finish(struct intel_guc *guc) intel_gt_unpark_heartbeats(guc_to_gt(guc)); } +static void capture_worker_func(struct work_struct *w); + /* * Set up the memory resources to be shared with the GuC (via the GGTT) * at firmware loading time. @@ -1151,6 +1154,8 @@ int intel_guc_submission_init(struct intel_guc *guc) spin_lock_init(&guc->submission_state.lock); INIT_LIST_HEAD(&guc->submission_state.guc_id_list); ida_init(&guc->submission_state.guc_ids); + INIT_LIST_HEAD(&guc->submission_state.capture_list); + INIT_WORK(&guc->submission_state.capture_worker, capture_worker_func); return 0; } @@ -2879,17 +2884,51 @@ int intel_guc_sched_done_process_msg(struct intel_guc *guc, return 0; } -static void capture_error_state(struct intel_guc *guc, - struct intel_context *ce) +static void capture_worker_func(struct work_struct *w) { + struct intel_guc *guc = container_of(w, struct intel_guc, + submission_state.capture_worker); struct intel_gt *gt = guc_to_gt(guc); struct drm_i915_private *i915 = gt->i915; + struct intel_context *ce = + list_first_entry(&guc->submission_state.capture_list, + struct intel_context, guc_capture_link); struct intel_engine_cs *engine = __context_to_physical_engine(ce); + unsigned long flags; intel_wakeref_t wakeref; + spin_lock_irqsave(&guc->submission_state.lock, flags); + list_del_init(&ce->guc_capture_link); + spin_unlock_irqrestore(&guc->submission_state.lock, flags); + intel_engine_set_hung_context(engine, ce); with_intel_runtime_pm(&i915->runtime_pm, wakeref) - i915_capture_error_state(gt, engine->mask); + i915_capture_error_state(gt, ce->engine->mask); +} + +static void capture_error_state(struct intel_guc *guc, + struct intel_context *ce) +{ + struct intel_gt *gt = guc_to_gt(guc); + struct drm_i915_private *i915 = gt->i915; + struct intel_engine_cs *engine = __context_to_physical_engine(ce); + unsigned long flags; + + spin_lock_irqsave(&guc->submission_state.lock, flags); + list_add_tail(&guc->submission_state.capture_list, + &ce->guc_capture_link); + spin_unlock_irqrestore(&guc->submission_state.lock, flags); + + /* + * We do the error capture async as an error capture can allocate + * memory, the reset path must flush the G2H handler in order to seal + * several races, and the memory allocations depend on the reset path + * (circular dependecy if error capture done here in the G2H handler). + * Doing the error capture async should be ok, as (eventually) all + * register state is captured in buffer by the GuC (i.e. it ok to + * restart the context before the error capture is complete). + */ + queue_work(system_unbound_wq, &guc->submission_state.capture_worker); atomic_inc(&i915->gpu_error.reset_engine_count[engine->uabi_class]); } From patchwork Tue Sep 14 05:09:55 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Matthew Brost X-Patchwork-Id: 12491719 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 68D55C433F5 for ; Tue, 14 Sep 2021 05:15:28 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 3840260F21 for ; Tue, 14 Sep 2021 05:15:28 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 3840260F21 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id EA4C96E40C; Tue, 14 Sep 2021 05:15:10 +0000 (UTC) Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by gabe.freedesktop.org (Postfix) with ESMTPS id 500916E3F0; Tue, 14 Sep 2021 05:15:01 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6200,9189,10106"; a="208976889" X-IronPort-AV: E=Sophos;i="5.85,291,1624345200"; d="scan'208";a="208976889" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Sep 2021 22:15:00 -0700 X-IronPort-AV: E=Sophos;i="5.85,291,1624345200"; d="scan'208";a="697404485" Received: from jons-linux-dev-box.fm.intel.com ([10.1.27.20]) by fmsmga006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Sep 2021 22:14:59 -0700 From: Matthew Brost To: , Cc: , Date: Mon, 13 Sep 2021 22:09:55 -0700 Message-Id: <20210914050956.30685-4-matthew.brost@intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: <20210914050956.30685-1-matthew.brost@intel.com> References: <20210914050956.30685-1-matthew.brost@intel.com> MIME-Version: 1.0 Subject: [Intel-gfx] [PATCH 3/4] drm/i915/guc: Flush G2H work queue during reset X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" It isn't safe to scrub for missing G2H or continue with the reset until all G2H processing is complete. Flush the G2H work queue during reset to ensure it is done running. No need to call the IRQ handler directly either as the scrubbing code can deal with any missing G2H. Signed-off-by: Matthew Brost Reviewed-by: Daniele Ceraolo Spurio --- .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 18 +----------------- 1 file changed, 1 insertion(+), 17 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c index ba6838a35a69..1986a57b52cc 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c @@ -800,8 +800,6 @@ static void guc_flush_submissions(struct intel_guc *guc) void intel_guc_submission_reset_prepare(struct intel_guc *guc) { - int i; - if (unlikely(!guc_submission_initialized(guc))) { /* Reset called during driver load? GuC not yet initialised! */ return; @@ -816,21 +814,7 @@ void intel_guc_submission_reset_prepare(struct intel_guc *guc) spin_unlock_irq(&guc_to_gt(guc)->irq_lock); guc_flush_submissions(guc); - - /* - * Handle any outstanding G2Hs before reset. Call IRQ handler directly - * each pass as interrupt have been disabled. We always scrub for - * outstanding G2H as it is possible for outstanding_submission_g2h to - * be incremented after the context state update. - */ - for (i = 0; i < 4 && atomic_read(&guc->outstanding_submission_g2h); ++i) { - intel_guc_to_host_event_handler(guc); -#define wait_for_reset(guc, wait_var) \ - intel_guc_wait_for_pending_msg(guc, wait_var, false, (HZ / 20)) - do { - wait_for_reset(guc, &guc->outstanding_submission_g2h); - } while (!list_empty(&guc->ct.requests.incoming)); - } + flush_work(&guc->ct.requests.worker); scrub_guc_desc_for_outstanding_g2h(guc); } From patchwork Tue Sep 14 05:09:56 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Matthew Brost X-Patchwork-Id: 12491715 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C5738C433EF for ; Tue, 14 Sep 2021 05:15:13 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 92F0060F70 for ; Tue, 14 Sep 2021 05:15:13 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 92F0060F70 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 529CA6E402; Tue, 14 Sep 2021 05:15:03 +0000 (UTC) Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by gabe.freedesktop.org (Postfix) with ESMTPS id 624276E3F7; Tue, 14 Sep 2021 05:15:01 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6200,9189,10106"; a="208976891" X-IronPort-AV: E=Sophos;i="5.85,291,1624345200"; d="scan'208";a="208976891" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Sep 2021 22:15:00 -0700 X-IronPort-AV: E=Sophos;i="5.85,291,1624345200"; d="scan'208";a="697404488" Received: from jons-linux-dev-box.fm.intel.com ([10.1.27.20]) by fmsmga006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Sep 2021 22:14:59 -0700 From: Matthew Brost To: , Cc: , Date: Mon, 13 Sep 2021 22:09:56 -0700 Message-Id: <20210914050956.30685-5-matthew.brost@intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: <20210914050956.30685-1-matthew.brost@intel.com> References: <20210914050956.30685-1-matthew.brost@intel.com> MIME-Version: 1.0 Subject: [Intel-gfx] [PATCH 4/4] drm/i915/guc: Refcount context during error capture X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" From: John Harrison When i915 receives a context reset notification from GuC, it triggers an error capture before resetting any outstanding requsts of that context. Unfortunately, the error capture is not a time bound operation. In certain situations it can take a long time, particularly when multiple large LMEM buffers must be read back and eoncoded. If this delay is longer than other timeouts (heartbeat, test recovery, etc.) then a full GT reset can be triggered in the middle. That can result in the context being reset by GuC actually being destroyed before the error capture completes and the GuC submission code resumes. Thus, the GuC side can start dereferencing stale pointers and Bad Things ensue. So add a refcount get of the context during the entire reset operation. That way, the context can't be destroyed part way through no matter what other resets or user interactions occur. v2: (Matthew Brost) - Update patch to work with async error capture Signed-off-by: John Harrison Signed-off-by: Matthew Brost --- .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 24 +++++++++++++++++-- 1 file changed, 22 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c index 1986a57b52cc..02917fc4d4a8 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c @@ -2888,6 +2888,8 @@ static void capture_worker_func(struct work_struct *w) intel_engine_set_hung_context(engine, ce); with_intel_runtime_pm(&i915->runtime_pm, wakeref) i915_capture_error_state(gt, ce->engine->mask); + + intel_context_put(ce); } static void capture_error_state(struct intel_guc *guc, @@ -2924,7 +2926,7 @@ static void guc_context_replay(struct intel_context *ce) tasklet_hi_schedule(&sched_engine->tasklet); } -static void guc_handle_context_reset(struct intel_guc *guc, +static bool guc_handle_context_reset(struct intel_guc *guc, struct intel_context *ce) { trace_intel_context_reset(ce); @@ -2937,7 +2939,11 @@ static void guc_handle_context_reset(struct intel_guc *guc, !context_blocked(ce))) { capture_error_state(guc, ce); guc_context_replay(ce); + + return false; } + + return true; } int intel_guc_context_reset_process_msg(struct intel_guc *guc, @@ -2945,6 +2951,7 @@ int intel_guc_context_reset_process_msg(struct intel_guc *guc, { struct intel_context *ce; int desc_idx; + unsigned long flags; if (unlikely(len != 1)) { drm_err(&guc_to_gt(guc)->i915->drm, "Invalid length %u", len); @@ -2952,11 +2959,24 @@ int intel_guc_context_reset_process_msg(struct intel_guc *guc, } desc_idx = msg[0]; + + /* + * The context lookup uses the xarray but lookups only require an RCU lock + * not the full spinlock. So take the lock explicitly and keep it until the + * context has been reference count locked to ensure it can't be destroyed + * asynchronously until the reset is done. + */ + xa_lock_irqsave(&guc->context_lookup, flags); ce = g2h_context_lookup(guc, desc_idx); + if (ce) + intel_context_get(ce); + xa_unlock_irqrestore(&guc->context_lookup, flags); + if (unlikely(!ce)) return -EPROTO; - guc_handle_context_reset(guc, ce); + if (guc_handle_context_reset(guc, ce)) + intel_context_put(ce); return 0; }