iommu/arm-smmu-qcom: Only enable stall on smmu-v2

Message ID	20241216171017.4881-1-robdclark@gmail.com (mailing list archive)
State	New
Headers	show Received: from mail-pl1-f171.google.com (mail-pl1-f171.google.com [209.85.214.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 177DA1C54A6; Mon, 16 Dec 2024 17:10:22 +0000 (UTC) From: Rob Clark <robdclark@gmail.com> To: iommu@lists.linux.dev Cc: linux-arm-msm@vger.kernel.org, freedreno@lists.freedesktop.org, Akhil P Oommen <quic_akhilpo@quicinc.com>, Rob Clark <robdclark@chromium.org>, Rob Clark <robdclark@gmail.com>, Will Deacon <will@kernel.org>, Robin Murphy <robin.murphy@arm.com>, Joerg Roedel <joro@8bytes.org>, linux-arm-kernel@lists.infradead.org (moderated list:ARM SMMU DRIVERS), linux-kernel@vger.kernel.org (open list) Subject: [PATCH] iommu/arm-smmu-qcom: Only enable stall on smmu-v2 Date: Mon, 16 Dec 2024 09:10:17 -0800 Message-ID: <20241216171017.4881-1-robdclark@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	iommu/arm-smmu-qcom: Only enable stall on smmu-v2 \| expand iommu/arm-smmu-qcom: Only enable stall on smmu-v2

Rob Clark Dec. 16, 2024, 5:10 p.m. UTC

From: Rob Clark <robdclark@chromium.org>

On mmu-500, stall-on-fault seems to stall all context banks, causing the
GMU to misbehave.  So limit this feature to smmu-v2 for now.

This fixes an issue with an older mesa bug taking outo the system
because of GMU going off into the year.

Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Akhil P Oommen Dec. 16, 2024, 8:28 p.m. UTC | #1

On 12/16/2024 10:40 PM, Rob Clark wrote:
> From: Rob Clark <robdclark@chromium.org>
> 
> On mmu-500, stall-on-fault seems to stall all context banks, causing the
> GMU to misbehave.  So limit this feature to smmu-v2 for now.
> 
> This fixes an issue with an older mesa bug taking outo the system
> because of GMU going off into the year.
> 
> Signed-off-by: Rob Clark <robdclark@chromium.org>
> ---
>  drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> index c4c52f7bd09a..1c881e88fc4d 100644
> --- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> +++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> @@ -331,8 +331,10 @@ static int qcom_adreno_smmu_init_context(struct arm_smmu_domain *smmu_domain,
>  	priv->get_ttbr1_cfg = qcom_adreno_smmu_get_ttbr1_cfg;
>  	priv->set_ttbr0_cfg = qcom_adreno_smmu_set_ttbr0_cfg;
>  	priv->get_fault_info = qcom_adreno_smmu_get_fault_info;
> -	priv->set_stall = qcom_adreno_smmu_set_stall;
> -	priv->resume_translation = qcom_adreno_smmu_resume_translation;
> +	if (of_device_is_compatible(np, "qcom,smmu-v2")) {
> +		priv->set_stall = qcom_adreno_smmu_set_stall;
> +		priv->resume_translation = qcom_adreno_smmu_resume_translation;
> +	}

Shall we disable this from the driver instead? A debugfs knob to trigger
coredump after a pagefault is very convenient.

-Akhil

>  	priv->set_prr_bit = NULL;
>  	priv->set_prr_addr = NULL;
>

Rob Clark Dec. 16, 2024, 8:54 p.m. UTC | #2

On Mon, Dec 16, 2024 at 12:28 PM Akhil P Oommen
<quic_akhilpo@quicinc.com> wrote:
>
> On 12/16/2024 10:40 PM, Rob Clark wrote:
> > From: Rob Clark <robdclark@chromium.org>
> >
> > On mmu-500, stall-on-fault seems to stall all context banks, causing the
> > GMU to misbehave.  So limit this feature to smmu-v2 for now.
> >
> > This fixes an issue with an older mesa bug taking outo the system
> > because of GMU going off into the year.
> >
> > Signed-off-by: Rob Clark <robdclark@chromium.org>
> > ---
> >  drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> > index c4c52f7bd09a..1c881e88fc4d 100644
> > --- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> > +++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> > @@ -331,8 +331,10 @@ static int qcom_adreno_smmu_init_context(struct arm_smmu_domain *smmu_domain,
> >       priv->get_ttbr1_cfg = qcom_adreno_smmu_get_ttbr1_cfg;
> >       priv->set_ttbr0_cfg = qcom_adreno_smmu_set_ttbr0_cfg;
> >       priv->get_fault_info = qcom_adreno_smmu_get_fault_info;
> > -     priv->set_stall = qcom_adreno_smmu_set_stall;
> > -     priv->resume_translation = qcom_adreno_smmu_resume_translation;
> > +     if (of_device_is_compatible(np, "qcom,smmu-v2")) {
> > +             priv->set_stall = qcom_adreno_smmu_set_stall;
> > +             priv->resume_translation = qcom_adreno_smmu_resume_translation;
> > +     }
>
> Shall we disable this from the driver instead? A debugfs knob to trigger
> coredump after a pagefault is very convenient.

It would require the driver to find the compatible for the smmu, so it
could differentiate btwn smmu-v2 and mmu-500, which seemed like it
might be a bit uglier.

Ideally/hopefully we could figure out how to make GMU a bit more
resilient in the face of stalled translations, because it is useful to
get accurate devcore dumps on smmu faults.  At that point we could
revert this change.

BR,
-R

>
> -Akhil
>
> >       priv->set_prr_bit = NULL;
> >       priv->set_prr_addr = NULL;
> >
>

Will Deacon Dec. 19, 2024, 11:30 a.m. UTC | #3

On Mon, Dec 16, 2024 at 09:10:17AM -0800, Rob Clark wrote:
> From: Rob Clark <robdclark@chromium.org>
> 
> On mmu-500, stall-on-fault seems to stall all context banks, causing the
> GMU to misbehave.  So limit this feature to smmu-v2 for now.

MMU-500 has public documentation so please can you dig up what the
actual behaviour is rather than guess?

> This fixes an issue with an older mesa bug taking outo the system
> because of GMU going off into the year.

Sorry, but I don't understand this sentence.

Will

Robin Murphy Dec. 19, 2024, 12:08 p.m. UTC | #4

On 2024-12-19 11:30 am, Will Deacon wrote:
> On Mon, Dec 16, 2024 at 09:10:17AM -0800, Rob Clark wrote:
>> From: Rob Clark <robdclark@chromium.org>
>>
>> On mmu-500, stall-on-fault seems to stall all context banks, causing the
>> GMU to misbehave.  So limit this feature to smmu-v2 for now.
> 
> MMU-500 has public documentation so please can you dig up what the
> actual behaviour is rather than guess?

Yeah, I'm pretty sure that's not true as stated, especially with 
SCTLR.HUPCF set as qcom_adreno_smmu_write_sctlr() does. However it is 
plausible that at the system interconnect level, a sufficient number of 
stalled transactions might backpressure other transactions from entering 
the same TBU, even if they are destined for a different context. That's 
more about the configuration and integration of individual SoCs than the 
SMMU IP used.

Robin.

>> This fixes an issue with an older mesa bug taking outo the system
>> because of GMU going off into the year.
> 
> Sorry, but I don't understand this sentence.
> 
> Will

Rob Clark Dec. 19, 2024, 3 p.m. UTC | #5

On Thu, Dec 19, 2024 at 4:08 AM Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 2024-12-19 11:30 am, Will Deacon wrote:
> > On Mon, Dec 16, 2024 at 09:10:17AM -0800, Rob Clark wrote:
> >> From: Rob Clark <robdclark@chromium.org>
> >>
> >> On mmu-500, stall-on-fault seems to stall all context banks, causing the
> >> GMU to misbehave.  So limit this feature to smmu-v2 for now.
> >
> > MMU-500 has public documentation so please can you dig up what the
> > actual behaviour is rather than guess?
>
> Yeah, I'm pretty sure that's not true as stated, especially with
> SCTLR.HUPCF set as qcom_adreno_smmu_write_sctlr() does. However it is
> plausible that at the system interconnect level, a sufficient number of
> stalled transactions might backpressure other transactions from entering
> the same TBU, even if they are destined for a different context. That's
> more about the configuration and integration of individual SoCs than the
> SMMU IP used.

I am aware of the docs and I've spent most of the last couple days
going thru them, as well as the errata, since it would be unfortunate
for debugging to disable this ;-)

The scenario where things lock up involves at least a few thousand
faults in rapid succession.  Disabling CFIE in the irq handler and
re-enabling when I resume translation does stop the flood of irq's but
not the lockup.  It might well be something about how the smmu is
integrated with the interconnect.

BR,
-R

> Robin.
>
> >> This fixes an issue with an older mesa bug taking outo the system
> >> because of GMU going off into the year.
> >
> > Sorry, but I don't understand this sentence.
> >
> > Will
>

iommu/arm-smmu-qcom: Only enable stall on smmu-v2

Commit Message

Comments

Patch