Message ID | 20241216171017.4881-1-robdclark@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | iommu/arm-smmu-qcom: Only enable stall on smmu-v2 | expand |
On 12/16/2024 10:40 PM, Rob Clark wrote: > From: Rob Clark <robdclark@chromium.org> > > On mmu-500, stall-on-fault seems to stall all context banks, causing the > GMU to misbehave. So limit this feature to smmu-v2 for now. > > This fixes an issue with an older mesa bug taking outo the system > because of GMU going off into the year. > > Signed-off-by: Rob Clark <robdclark@chromium.org> > --- > drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c > index c4c52f7bd09a..1c881e88fc4d 100644 > --- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c > +++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c > @@ -331,8 +331,10 @@ static int qcom_adreno_smmu_init_context(struct arm_smmu_domain *smmu_domain, > priv->get_ttbr1_cfg = qcom_adreno_smmu_get_ttbr1_cfg; > priv->set_ttbr0_cfg = qcom_adreno_smmu_set_ttbr0_cfg; > priv->get_fault_info = qcom_adreno_smmu_get_fault_info; > - priv->set_stall = qcom_adreno_smmu_set_stall; > - priv->resume_translation = qcom_adreno_smmu_resume_translation; > + if (of_device_is_compatible(np, "qcom,smmu-v2")) { > + priv->set_stall = qcom_adreno_smmu_set_stall; > + priv->resume_translation = qcom_adreno_smmu_resume_translation; > + } Shall we disable this from the driver instead? A debugfs knob to trigger coredump after a pagefault is very convenient. -Akhil > priv->set_prr_bit = NULL; > priv->set_prr_addr = NULL; >
On Mon, Dec 16, 2024 at 12:28 PM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote: > > On 12/16/2024 10:40 PM, Rob Clark wrote: > > From: Rob Clark <robdclark@chromium.org> > > > > On mmu-500, stall-on-fault seems to stall all context banks, causing the > > GMU to misbehave. So limit this feature to smmu-v2 for now. > > > > This fixes an issue with an older mesa bug taking outo the system > > because of GMU going off into the year. > > > > Signed-off-by: Rob Clark <robdclark@chromium.org> > > --- > > drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 6 ++++-- > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c > > index c4c52f7bd09a..1c881e88fc4d 100644 > > --- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c > > +++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c > > @@ -331,8 +331,10 @@ static int qcom_adreno_smmu_init_context(struct arm_smmu_domain *smmu_domain, > > priv->get_ttbr1_cfg = qcom_adreno_smmu_get_ttbr1_cfg; > > priv->set_ttbr0_cfg = qcom_adreno_smmu_set_ttbr0_cfg; > > priv->get_fault_info = qcom_adreno_smmu_get_fault_info; > > - priv->set_stall = qcom_adreno_smmu_set_stall; > > - priv->resume_translation = qcom_adreno_smmu_resume_translation; > > + if (of_device_is_compatible(np, "qcom,smmu-v2")) { > > + priv->set_stall = qcom_adreno_smmu_set_stall; > > + priv->resume_translation = qcom_adreno_smmu_resume_translation; > > + } > > Shall we disable this from the driver instead? A debugfs knob to trigger > coredump after a pagefault is very convenient. It would require the driver to find the compatible for the smmu, so it could differentiate btwn smmu-v2 and mmu-500, which seemed like it might be a bit uglier. Ideally/hopefully we could figure out how to make GMU a bit more resilient in the face of stalled translations, because it is useful to get accurate devcore dumps on smmu faults. At that point we could revert this change. BR, -R > > -Akhil > > > priv->set_prr_bit = NULL; > > priv->set_prr_addr = NULL; > > >
On Mon, Dec 16, 2024 at 09:10:17AM -0800, Rob Clark wrote: > From: Rob Clark <robdclark@chromium.org> > > On mmu-500, stall-on-fault seems to stall all context banks, causing the > GMU to misbehave. So limit this feature to smmu-v2 for now. MMU-500 has public documentation so please can you dig up what the actual behaviour is rather than guess? > This fixes an issue with an older mesa bug taking outo the system > because of GMU going off into the year. Sorry, but I don't understand this sentence. Will
On 2024-12-19 11:30 am, Will Deacon wrote: > On Mon, Dec 16, 2024 at 09:10:17AM -0800, Rob Clark wrote: >> From: Rob Clark <robdclark@chromium.org> >> >> On mmu-500, stall-on-fault seems to stall all context banks, causing the >> GMU to misbehave. So limit this feature to smmu-v2 for now. > > MMU-500 has public documentation so please can you dig up what the > actual behaviour is rather than guess? Yeah, I'm pretty sure that's not true as stated, especially with SCTLR.HUPCF set as qcom_adreno_smmu_write_sctlr() does. However it is plausible that at the system interconnect level, a sufficient number of stalled transactions might backpressure other transactions from entering the same TBU, even if they are destined for a different context. That's more about the configuration and integration of individual SoCs than the SMMU IP used. Robin. >> This fixes an issue with an older mesa bug taking outo the system >> because of GMU going off into the year. > > Sorry, but I don't understand this sentence. > > Will
On Thu, Dec 19, 2024 at 4:08 AM Robin Murphy <robin.murphy@arm.com> wrote: > > On 2024-12-19 11:30 am, Will Deacon wrote: > > On Mon, Dec 16, 2024 at 09:10:17AM -0800, Rob Clark wrote: > >> From: Rob Clark <robdclark@chromium.org> > >> > >> On mmu-500, stall-on-fault seems to stall all context banks, causing the > >> GMU to misbehave. So limit this feature to smmu-v2 for now. > > > > MMU-500 has public documentation so please can you dig up what the > > actual behaviour is rather than guess? > > Yeah, I'm pretty sure that's not true as stated, especially with > SCTLR.HUPCF set as qcom_adreno_smmu_write_sctlr() does. However it is > plausible that at the system interconnect level, a sufficient number of > stalled transactions might backpressure other transactions from entering > the same TBU, even if they are destined for a different context. That's > more about the configuration and integration of individual SoCs than the > SMMU IP used. I am aware of the docs and I've spent most of the last couple days going thru them, as well as the errata, since it would be unfortunate for debugging to disable this ;-) The scenario where things lock up involves at least a few thousand faults in rapid succession. Disabling CFIE in the irq handler and re-enabling when I resume translation does stop the flood of irq's but not the lockup. It might well be something about how the smmu is integrated with the interconnect. BR, -R > Robin. > > >> This fixes an issue with an older mesa bug taking outo the system > >> because of GMU going off into the year. > > > > Sorry, but I don't understand this sentence. > > > > Will >
diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c index c4c52f7bd09a..1c881e88fc4d 100644 --- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c +++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c @@ -331,8 +331,10 @@ static int qcom_adreno_smmu_init_context(struct arm_smmu_domain *smmu_domain, priv->get_ttbr1_cfg = qcom_adreno_smmu_get_ttbr1_cfg; priv->set_ttbr0_cfg = qcom_adreno_smmu_set_ttbr0_cfg; priv->get_fault_info = qcom_adreno_smmu_get_fault_info; - priv->set_stall = qcom_adreno_smmu_set_stall; - priv->resume_translation = qcom_adreno_smmu_resume_translation; + if (of_device_is_compatible(np, "qcom,smmu-v2")) { + priv->set_stall = qcom_adreno_smmu_set_stall; + priv->resume_translation = qcom_adreno_smmu_resume_translation; + } priv->set_prr_bit = NULL; priv->set_prr_addr = NULL;