[v2,05/21] drm/i915/gt: Skip TLB invalidations once wedged

Message ID	f20bd21c94610dae59824b8040e5a9400de6f963.1657800199.git.mchehab@kernel.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> From: Mauro Carvalho Chehab <mchehab@kernel.org> To: Date: Thu, 14 Jul 2022 13:06:10 +0100 Message-Id: <f20bd21c94610dae59824b8040e5a9400de6f963.1657800199.git.mchehab@kernel.org> In-Reply-To: <cover.1657800199.git.mchehab@kernel.org> References: <cover.1657800199.git.mchehab@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Subject: [Intel-gfx] [PATCH v2 05/21] drm/i915/gt: Skip TLB invalidations once wedged Precedence: list Cc: =?utf-8?q?Thomas_Hellstr=C3=B6m?= <thomas.hellstrom@linux.intel.com>, David Airlie <airlied@linux.ie>, dri-devel@lists.freedesktop.org, Lucas De Marchi <lucas.demarchi@intel.com>, linux-kernel@vger.kernel.org, Chris Wilson <chris.p.wilson@intel.com>, Rodrigo Vivi <rodrigo.vivi@intel.com>, Dave Airlie <airlied@redhat.com>, stable@vger.kernel.org, Mauro Carvalho Chehab <mchehab@kernel.org>, intel-gfx@lists.freedesktop.org Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>
Series	Fix performance regressions with TLB and add GuC support \| expand [v2,00/21] Fix performance regressions with TLB and add GuC support [v2,01/21] drm/i915/gt: Ignore TLB invalidations on idle engines [v2,02/21] drm/i915/gt: document with_intel_gt_pm_if_awake() [v2,03/21] drm/i915/gt: Invalidate TLB of the OA unit at TLB invalidations [v2,04/21] drm/i915/gt: Only invalidate TLBs exposed to user manipulation [v2,05/21] drm/i915/gt: Skip TLB invalidations once wedged [v2,06/21] drm/i915/gt: Batch TLB invalidations [v2,07/21] drm/i915/gt: describe the new tlb parameter at i915_vma_resource [v2,08/21] drm/i915/gt: Move TLB invalidation to its own file [v2,09/21] drm/i915/guc: Define CTB based TLB invalidation routines [v2,10/21] drm/i915/guc: use kernel-doc for enum intel_guc_tlb_inval_mode [v2,11/21] drm/i915/guc: document the TLB invalidation struct members [v2,12/21] drm/i915/guc: Introduce TLB_INVALIDATION_ALL action [v2,13/21] drm/i915: Invalidate the TLBs on each GT [v2,14/21] drm/i915: document tlb field at struct drm_i915_gem_object [v2,15/21] drm/i915: Add platform macro for selective tlb flush [v2,16/21] drm/i915: Define GuC Based TLB invalidation routines [v2,17/21] drm/i915: Add generic interface for tlb invalidation for XeHP [v2,18/21] drm/i915: Use selective tlb invalidations where supported [v2,19/21] drm/i915/gt: document TLB cache invalidation functions [v2,20/21] drm/i915/guc: describe enum intel_guc_tlb_invalidation_type [v2,21/21] drm/i915/guc: document TLB cache invalidation functions

Message ID

f20bd21c94610dae59824b8040e5a9400de6f963.1657800199.git.mchehab@kernel.org (mailing list archive)

State

New, archived

Headers

From: Mauro Carvalho Chehab <mchehab@kernel.org>
To: 
Date: Thu, 14 Jul 2022 13:06:10 +0100
Message-Id: 
 <f20bd21c94610dae59824b8040e5a9400de6f963.1657800199.git.mchehab@kernel.org>
In-Reply-To: <cover.1657800199.git.mchehab@kernel.org>
References: <cover.1657800199.git.mchehab@kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Subject: [Intel-gfx] [PATCH v2 05/21] drm/i915/gt: Skip TLB invalidations
 once wedged
Precedence: list
Cc: =?utf-8?q?Thomas_Hellstr=C3=B6m?= <thomas.hellstrom@linux.intel.com>,
 David Airlie <airlied@linux.ie>, dri-devel@lists.freedesktop.org,
 Lucas De Marchi <lucas.demarchi@intel.com>, linux-kernel@vger.kernel.org,
 Chris Wilson <chris.p.wilson@intel.com>,
 Rodrigo Vivi <rodrigo.vivi@intel.com>, Dave Airlie <airlied@redhat.com>,
 stable@vger.kernel.org, Mauro Carvalho Chehab <mchehab@kernel.org>,
 intel-gfx@lists.freedesktop.org
Errors-To: intel-gfx-bounces@lists.freedesktop.org
Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

Series

Fix performance regressions with TLB and add GuC support | expand

Commit Message

Mauro Carvalho Chehab July 14, 2022, 12:06 p.m. UTC

From: Chris Wilson <chris.p.wilson@intel.com>

Skip all further TLB invalidations once the device is wedged and
had been reset, as, on such cases, it can no longer process instructions
on the GPU and the user no longer has access to the TLB's in each engine.

That helps to reduce the performance regression introduced by TLB
invalidate logic.

Cc: stable@vger.kernel.org
Fixes: 7938d61591d3 ("drm/i915: Flush TLBs before releasing backing store")
Signed-off-by: Chris Wilson <chris.p.wilson@intel.com>
Cc: Fei Yang <fei.yang@intel.com>
Cc: Andi Shyti <andi.shyti@linux.intel.com>
Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
---

To avoid mailbombing on a large number of people, only mailing lists were C/C on the cover.
See [PATCH v2 00/21] at: https://lore.kernel.org/all/cover.1657800199.git.mchehab@kernel.org/

 drivers/gpu/drm/i915/gt/intel_gt.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Tvrtko Ursulin July 18, 2022, 1:45 p.m. UTC | #1

On 14/07/2022 13:06, Mauro Carvalho Chehab wrote:
> From: Chris Wilson <chris.p.wilson@intel.com>
> 
> Skip all further TLB invalidations once the device is wedged and
> had been reset, as, on such cases, it can no longer process instructions
> on the GPU and the user no longer has access to the TLB's in each engine.
> 
> That helps to reduce the performance regression introduced by TLB
> invalidate logic.
> 
> Cc: stable@vger.kernel.org
> Fixes: 7938d61591d3 ("drm/i915: Flush TLBs before releasing backing store")

Is the claim of a performance regression this solved based on a wedged 
GPU which does not work any more to the extend where mmio tlb 
invalidation requests keep timing out? If so please clarify in the 
commit text and then it looks good to me. Even if it is IMO a very 
borderline situation to declare something a fix.

Regards,

Tvrtko

> Signed-off-by: Chris Wilson <chris.p.wilson@intel.com>
> Cc: Fei Yang <fei.yang@intel.com>
> Cc: Andi Shyti <andi.shyti@linux.intel.com>
> Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
> ---
> 
> To avoid mailbombing on a large number of people, only mailing lists were C/C on the cover.
> See [PATCH v2 00/21] at: https://lore.kernel.org/all/cover.1657800199.git.mchehab@kernel.org/
> 
>   drivers/gpu/drm/i915/gt/intel_gt.c | 3 +++
>   1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c
> index 1d84418e8676..5c55a90672f4 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt.c
> +++ b/drivers/gpu/drm/i915/gt/intel_gt.c
> @@ -934,6 +934,9 @@ void intel_gt_invalidate_tlbs(struct intel_gt *gt)
>   	if (I915_SELFTEST_ONLY(gt->awake == -ENODEV))
>   		return;
>   
> +	if (intel_gt_is_wedged(gt))
> +		return;
> +
>   	if (GRAPHICS_VER(i915) == 12) {
>   		regs = gen12_regs;
>   		num = ARRAY_SIZE(gen12_regs);

Mauro Carvalho Chehab July 18, 2022, 4:06 p.m. UTC | #2

On Mon, 18 Jul 2022 14:45:22 +0100
Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> wrote:

> On 14/07/2022 13:06, Mauro Carvalho Chehab wrote:
> > From: Chris Wilson <chris.p.wilson@intel.com>
> > 
> > Skip all further TLB invalidations once the device is wedged and
> > had been reset, as, on such cases, it can no longer process instructions
> > on the GPU and the user no longer has access to the TLB's in each engine.
> > 
> > That helps to reduce the performance regression introduced by TLB
> > invalidate logic.
> > 
> > Cc: stable@vger.kernel.org
> > Fixes: 7938d61591d3 ("drm/i915: Flush TLBs before releasing backing store")  
> 
> Is the claim of a performance regression this solved based on a wedged 
> GPU which does not work any more to the extend where mmio tlb 
> invalidation requests keep timing out? If so please clarify in the 
> commit text and then it looks good to me. Even if it is IMO a very 
> borderline situation to declare something a fix.

Indeed this helps on a borderline situation: if GT is wedged, TLB 
invalidation will timeout, so it makes sense to keep the patch with a
comment like:

    drm/i915/gt: Skip TLB invalidations once wedged
    
    Skip all further TLB invalidations once the device is wedged and
    had been reset, as, on such cases, it can no longer process instructions
    on the GPU and the user no longer has access to the TLB's in each engine.
    
    So, an attempt to do a TLB cache invalidation will produce a timeout.
    
    That helps to reduce the performance regression introduced by TLB
    invalidate logic.

Regards,
Mauro

Tvrtko Ursulin July 19, 2022, 7:19 a.m. UTC | #3

On 18/07/2022 17:06, Mauro Carvalho Chehab wrote:
> On Mon, 18 Jul 2022 14:45:22 +0100
> Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> wrote:
> 
>> On 14/07/2022 13:06, Mauro Carvalho Chehab wrote:
>>> From: Chris Wilson <chris.p.wilson@intel.com>
>>>
>>> Skip all further TLB invalidations once the device is wedged and
>>> had been reset, as, on such cases, it can no longer process instructions
>>> on the GPU and the user no longer has access to the TLB's in each engine.
>>>
>>> That helps to reduce the performance regression introduced by TLB
>>> invalidate logic.
>>>
>>> Cc: stable@vger.kernel.org
>>> Fixes: 7938d61591d3 ("drm/i915: Flush TLBs before releasing backing store")
>>
>> Is the claim of a performance regression this solved based on a wedged
>> GPU which does not work any more to the extend where mmio tlb
>> invalidation requests keep timing out? If so please clarify in the
>> commit text and then it looks good to me. Even if it is IMO a very
>> borderline situation to declare something a fix.
> 
> Indeed this helps on a borderline situation: if GT is wedged, TLB
> invalidation will timeout, so it makes sense to keep the patch with a
> comment like:
> 
>      drm/i915/gt: Skip TLB invalidations once wedged
>      
>      Skip all further TLB invalidations once the device is wedged and
>      had been reset, as, on such cases, it can no longer process instructions
>      on the GPU and the user no longer has access to the TLB's in each engine.
>      
>      So, an attempt to do a TLB cache invalidation will produce a timeout.
>      
>      That helps to reduce the performance regression introduced by TLB
>      invalidate logic.

Yeah that is better but whether bothering stable with it is the 
question. Wedged GPU means constant endless -EIO to userspace so very 
hard to imagine that after a TLB invalidation timeout or two there would 
be further ones. But okay, it's tiny so fine I guess.

Regards,

Tvrtko

Andi Shyti July 22, 2022, noon UTC | #4

Hi Mauro,

On Thu, Jul 14, 2022 at 01:06:10PM +0100, Mauro Carvalho Chehab wrote:
> From: Chris Wilson <chris.p.wilson@intel.com>
> 
> Skip all further TLB invalidations once the device is wedged and
> had been reset, as, on such cases, it can no longer process instructions
> on the GPU and the user no longer has access to the TLB's in each engine.
> 
> That helps to reduce the performance regression introduced by TLB
> invalidate logic.
> 
> Cc: stable@vger.kernel.org
> Fixes: 7938d61591d3 ("drm/i915: Flush TLBs before releasing backing store")
> Signed-off-by: Chris Wilson <chris.p.wilson@intel.com>
> Cc: Fei Yang <fei.yang@intel.com>
> Cc: Andi Shyti <andi.shyti@linux.intel.com>
> Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>

I haven't read any concern from Tvrtko here, in any case:

Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>

thanks,
Andi

diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c
index 1d84418e8676..5c55a90672f4 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt.c
@@ -934,6 +934,9 @@  void intel_gt_invalidate_tlbs(struct intel_gt *gt)
 	if (I915_SELFTEST_ONLY(gt->awake == -ENODEV))
 		return;
 
+	if (intel_gt_is_wedged(gt))
+		return;
+
 	if (GRAPHICS_VER(i915) == 12) {
 		regs = gen12_regs;
 		num = ARRAY_SIZE(gen12_regs);

[v2,05/21] drm/i915/gt: Skip TLB invalidations once wedged

Commit Message

Comments

Patch