diff mbox series

iommu: arm-smmu: Set SCTLR.HUPCF bit

Message ID 20180927224609.19515-1-robdclark@gmail.com (mailing list archive)
State New, archived
Headers show
Series iommu: arm-smmu: Set SCTLR.HUPCF bit | expand

Commit Message

Rob Clark Sept. 27, 2018, 10:46 p.m. UTC
We seem to need to set either this or CFCFG (stall), otherwise gpu
faults trigger problems with other in-flight transactions from the
GPU causing CP errors, etc.

In the ARM SMMU spec, the 'Hit under previous context fault' bit is
described as:

 '0' - Stall or terminate subsequent transactions in the presence
       of an outstanding context fault
 '1' - Process all subsequent transactions independently of any
       outstanding context fault.

Since we don't enable CFCFG (stall) the behavior of terminating
other transactions makes sense.  And is probably not what we want
(and definately not what we want for GPU).

Signed-off-by: Rob Clark <robdclark@gmail.com>
---
So I hit this issue a long time back on 820 (msm8996) and at the
time I solved it with a patch that enabled CFCFG.  And it resurfaced
more recently on sdm845.  But at the time CFCFG was rejected, iirc
because of concern that it would cause problems on other non-qcom
arm smmu implementations.  And I think I forgot to send this version
of the solution.

If enabling HUPCF is anticipated to cause problems on other ARM
SMMU implementations, I think I can come up with a variant of this
patch which conditionally enables it for snapdragon.

Either way, I'd really like to get some variant of this fix merged
(and probably it would be a good idea for stable kernel branches
too), since current behaviour with the GPU means faults turn into
a fantastic cascade of fail.

 drivers/iommu/arm-smmu-regs.h | 1 +
 drivers/iommu/arm-smmu.c      | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

Comments

Will Deacon Oct. 29, 2018, 7:10 p.m. UTC | #1
On Thu, Sep 27, 2018 at 06:46:07PM -0400, Rob Clark wrote:
> We seem to need to set either this or CFCFG (stall), otherwise gpu
> faults trigger problems with other in-flight transactions from the
> GPU causing CP errors, etc.
> 
> In the ARM SMMU spec, the 'Hit under previous context fault' bit is
> described as:
> 
>  '0' - Stall or terminate subsequent transactions in the presence
>        of an outstanding context fault
>  '1' - Process all subsequent transactions independently of any
>        outstanding context fault.
> 
> Since we don't enable CFCFG (stall) the behavior of terminating
> other transactions makes sense.  And is probably not what we want
> (and definately not what we want for GPU).
> 
> Signed-off-by: Rob Clark <robdclark@gmail.com>
> ---
> So I hit this issue a long time back on 820 (msm8996) and at the
> time I solved it with a patch that enabled CFCFG.  And it resurfaced
> more recently on sdm845.  But at the time CFCFG was rejected, iirc
> because of concern that it would cause problems on other non-qcom
> arm smmu implementations.  And I think I forgot to send this version
> of the solution.
> 
> If enabling HUPCF is anticipated to cause problems on other ARM
> SMMU implementations, I think I can come up with a variant of this
> patch which conditionally enables it for snapdragon.
> 
> Either way, I'd really like to get some variant of this fix merged
> (and probably it would be a good idea for stable kernel branches
> too), since current behaviour with the GPU means faults turn into
> a fantastic cascade of fail.

Can you describe how this fantastic cascade of fail improves with this
patch, please? If you're getting context faults then something has already
gone horribly wrong, so I'm trying to work out how this improves things.

> diff --git a/drivers/iommu/arm-smmu-regs.h b/drivers/iommu/arm-smmu-regs.h
> index a1226e4ab5f8..2291925eb800 100644
> --- a/drivers/iommu/arm-smmu-regs.h
> +++ b/drivers/iommu/arm-smmu-regs.h
> @@ -178,6 +178,7 @@ enum arm_smmu_s2cr_privcfg {
>  #define ARM_SMMU_CB_ATSR		0x8f0
>  
>  #define SCTLR_S1_ASIDPNE		(1 << 12)
> +#define SCTLR_HUPCF			(1 << 8)
>  #define SCTLR_CFCFG			(1 << 7)
>  #define SCTLR_CFIE			(1 << 6)
>  #define SCTLR_CFRE			(1 << 5)
> diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
> index f7a96bcf94a6..47ffc9aade72 100644
> --- a/drivers/iommu/arm-smmu.c
> +++ b/drivers/iommu/arm-smmu.c
> @@ -713,9 +713,9 @@ static void arm_smmu_write_context_bank(struct arm_smmu_device *smmu, int idx)
>  	reg = SCTLR_CFIE | SCTLR_CFRE | SCTLR_AFE | SCTLR_TRE | SCTLR_M;
>  	if (stage1)
>  		reg |= SCTLR_S1_ASIDPNE;
> +	reg |= SCTLR_HUPCF;

Maybe just throw this into the assignment a couple of lines above?

>  	if (IS_ENABLED(CONFIG_CPU_BIG_ENDIAN))
>  		reg |= SCTLR_E;
> -

Random whitespace change.

Will
Rob Clark Nov. 9, 2018, 6:01 p.m. UTC | #2
On Mon, Oct 29, 2018 at 3:09 PM Will Deacon <will.deacon@arm.com> wrote:
>
> On Thu, Sep 27, 2018 at 06:46:07PM -0400, Rob Clark wrote:
> > We seem to need to set either this or CFCFG (stall), otherwise gpu
> > faults trigger problems with other in-flight transactions from the
> > GPU causing CP errors, etc.
> >
> > In the ARM SMMU spec, the 'Hit under previous context fault' bit is
> > described as:
> >
> >  '0' - Stall or terminate subsequent transactions in the presence
> >        of an outstanding context fault
> >  '1' - Process all subsequent transactions independently of any
> >        outstanding context fault.
> >
> > Since we don't enable CFCFG (stall) the behavior of terminating
> > other transactions makes sense.  And is probably not what we want
> > (and definately not what we want for GPU).
> >
> > Signed-off-by: Rob Clark <robdclark@gmail.com>
> > ---
> > So I hit this issue a long time back on 820 (msm8996) and at the
> > time I solved it with a patch that enabled CFCFG.  And it resurfaced
> > more recently on sdm845.  But at the time CFCFG was rejected, iirc
> > because of concern that it would cause problems on other non-qcom
> > arm smmu implementations.  And I think I forgot to send this version
> > of the solution.
> >
> > If enabling HUPCF is anticipated to cause problems on other ARM
> > SMMU implementations, I think I can come up with a variant of this
> > patch which conditionally enables it for snapdragon.
> >
> > Either way, I'd really like to get some variant of this fix merged
> > (and probably it would be a good idea for stable kernel branches
> > too), since current behaviour with the GPU means faults turn into
> > a fantastic cascade of fail.
>
> Can you describe how this fantastic cascade of fail improves with this
> patch, please? If you're getting context faults then something has already
> gone horribly wrong, so I'm trying to work out how this improves things.
>

There are plenty of cases where getting iommu faults with a GPU is
"normal", or at least not something the kernel or even GL driver can
control.

With this patch, you still get the iommu fault, but it doesn't cause
the gpu to crash.  But without it, other memory accesses in flight
while the fault occurs, like the GPU command-processor reading further
ahead in the cmdstream to setup next draw, would return zero's,
causing the GPU to crash or get into a bad state.

BR,
-R

> > diff --git a/drivers/iommu/arm-smmu-regs.h b/drivers/iommu/arm-smmu-regs.h
> > index a1226e4ab5f8..2291925eb800 100644
> > --- a/drivers/iommu/arm-smmu-regs.h
> > +++ b/drivers/iommu/arm-smmu-regs.h
> > @@ -178,6 +178,7 @@ enum arm_smmu_s2cr_privcfg {
> >  #define ARM_SMMU_CB_ATSR             0x8f0
> >
> >  #define SCTLR_S1_ASIDPNE             (1 << 12)
> > +#define SCTLR_HUPCF                  (1 << 8)
> >  #define SCTLR_CFCFG                  (1 << 7)
> >  #define SCTLR_CFIE                   (1 << 6)
> >  #define SCTLR_CFRE                   (1 << 5)
> > diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
> > index f7a96bcf94a6..47ffc9aade72 100644
> > --- a/drivers/iommu/arm-smmu.c
> > +++ b/drivers/iommu/arm-smmu.c
> > @@ -713,9 +713,9 @@ static void arm_smmu_write_context_bank(struct arm_smmu_device *smmu, int idx)
> >       reg = SCTLR_CFIE | SCTLR_CFRE | SCTLR_AFE | SCTLR_TRE | SCTLR_M;
> >       if (stage1)
> >               reg |= SCTLR_S1_ASIDPNE;
> > +     reg |= SCTLR_HUPCF;
>
> Maybe just throw this into the assignment a couple of lines above?
>
> >       if (IS_ENABLED(CONFIG_CPU_BIG_ENDIAN))
> >               reg |= SCTLR_E;
> > -
>
> Random whitespace change.
>
> Will
Will Deacon Nov. 13, 2018, 6:32 a.m. UTC | #3
On Fri, Nov 09, 2018 at 01:01:55PM -0500, Rob Clark wrote:
> On Mon, Oct 29, 2018 at 3:09 PM Will Deacon <will.deacon@arm.com> wrote:
> > On Thu, Sep 27, 2018 at 06:46:07PM -0400, Rob Clark wrote:
> > > We seem to need to set either this or CFCFG (stall), otherwise gpu
> > > faults trigger problems with other in-flight transactions from the
> > > GPU causing CP errors, etc.
> > >
> > > In the ARM SMMU spec, the 'Hit under previous context fault' bit is
> > > described as:
> > >
> > >  '0' - Stall or terminate subsequent transactions in the presence
> > >        of an outstanding context fault
> > >  '1' - Process all subsequent transactions independently of any
> > >        outstanding context fault.
> > >
> > > Since we don't enable CFCFG (stall) the behavior of terminating
> > > other transactions makes sense.  And is probably not what we want
> > > (and definately not what we want for GPU).
> > >
> > > Signed-off-by: Rob Clark <robdclark@gmail.com>
> > > ---
> > > So I hit this issue a long time back on 820 (msm8996) and at the
> > > time I solved it with a patch that enabled CFCFG.  And it resurfaced
> > > more recently on sdm845.  But at the time CFCFG was rejected, iirc
> > > because of concern that it would cause problems on other non-qcom
> > > arm smmu implementations.  And I think I forgot to send this version
> > > of the solution.
> > >
> > > If enabling HUPCF is anticipated to cause problems on other ARM
> > > SMMU implementations, I think I can come up with a variant of this
> > > patch which conditionally enables it for snapdragon.
> > >
> > > Either way, I'd really like to get some variant of this fix merged
> > > (and probably it would be a good idea for stable kernel branches
> > > too), since current behaviour with the GPU means faults turn into
> > > a fantastic cascade of fail.
> >
> > Can you describe how this fantastic cascade of fail improves with this
> > patch, please? If you're getting context faults then something has already
> > gone horribly wrong, so I'm trying to work out how this improves things.
> >
> 
> There are plenty of cases where getting iommu faults with a GPU is
> "normal", or at least not something the kernel or even GL driver can
> control.

Such as? All the mainline driver does is print a diagnostic and clear the
fault, which doesn't seem generally useful.

> With this patch, you still get the iommu fault, but it doesn't cause
> the gpu to crash.  But without it, other memory accesses in flight
> while the fault occurs, like the GPU command-processor reading further
> ahead in the cmdstream to setup next draw, would return zero's,
> causing the GPU to crash or get into a bad state.

I get that part, but I don't understand why we're seeing faults in the first
place and I worry that this patch is just the tip of the iceberg. It's also
not clear that processing subsequent transactions is always the right thing
to do in a world where we actually want to report (and handle) synchronous
faults from devices.

Will
Rob Clark Nov. 13, 2018, 1:12 p.m. UTC | #4
On Tue, Nov 13, 2018 at 1:32 AM Will Deacon <will.deacon@arm.com> wrote:
>
> On Fri, Nov 09, 2018 at 01:01:55PM -0500, Rob Clark wrote:
> > On Mon, Oct 29, 2018 at 3:09 PM Will Deacon <will.deacon@arm.com> wrote:
> > > On Thu, Sep 27, 2018 at 06:46:07PM -0400, Rob Clark wrote:
> > > > We seem to need to set either this or CFCFG (stall), otherwise gpu
> > > > faults trigger problems with other in-flight transactions from the
> > > > GPU causing CP errors, etc.
> > > >
> > > > In the ARM SMMU spec, the 'Hit under previous context fault' bit is
> > > > described as:
> > > >
> > > >  '0' - Stall or terminate subsequent transactions in the presence
> > > >        of an outstanding context fault
> > > >  '1' - Process all subsequent transactions independently of any
> > > >        outstanding context fault.
> > > >
> > > > Since we don't enable CFCFG (stall) the behavior of terminating
> > > > other transactions makes sense.  And is probably not what we want
> > > > (and definately not what we want for GPU).
> > > >
> > > > Signed-off-by: Rob Clark <robdclark@gmail.com>
> > > > ---
> > > > So I hit this issue a long time back on 820 (msm8996) and at the
> > > > time I solved it with a patch that enabled CFCFG.  And it resurfaced
> > > > more recently on sdm845.  But at the time CFCFG was rejected, iirc
> > > > because of concern that it would cause problems on other non-qcom
> > > > arm smmu implementations.  And I think I forgot to send this version
> > > > of the solution.
> > > >
> > > > If enabling HUPCF is anticipated to cause problems on other ARM
> > > > SMMU implementations, I think I can come up with a variant of this
> > > > patch which conditionally enables it for snapdragon.
> > > >
> > > > Either way, I'd really like to get some variant of this fix merged
> > > > (and probably it would be a good idea for stable kernel branches
> > > > too), since current behaviour with the GPU means faults turn into
> > > > a fantastic cascade of fail.
> > >
> > > Can you describe how this fantastic cascade of fail improves with this
> > > patch, please? If you're getting context faults then something has already
> > > gone horribly wrong, so I'm trying to work out how this improves things.
> > >
> >
> > There are plenty of cases where getting iommu faults with a GPU is
> > "normal", or at least not something the kernel or even GL driver can
> > control.
>
> Such as? All the mainline driver does is print a diagnostic and clear the
> fault, which doesn't seem generally useful.

it is useful to debug the fault ;-)

Although eventually we'll want to be able to do more than that, like
have the fault trigger bringing in pages of a mmap'd file and that
sort of thing.

> > With this patch, you still get the iommu fault, but it doesn't cause
> > the gpu to crash.  But without it, other memory accesses in flight
> > while the fault occurs, like the GPU command-processor reading further
> > ahead in the cmdstream to setup next draw, would return zero's,
> > causing the GPU to crash or get into a bad state.
>
> I get that part, but I don't understand why we're seeing faults in the first
> place and I worry that this patch is just the tip of the iceberg. It's also
> not clear that processing subsequent transactions is always the right thing
> to do in a world where we actually want to report (and handle) synchronous
> faults from devices.

Sure, it is a bug.. but it can be an application bug that is not
something the userspace GL driver or kernel could do anything about.
We shouldn't let this kill the GPU.  If the application didn't have
this much control, we wouldn't need an IOMMU in the first place[1].
With opencl compute, the userspace controlled shader has full blown
pointers to GPU memory.

And even in cases where it is a userspace GL driver bug, having some
robustness to not completely kill the GPU makes debugging much easier.
Something I do a lot when bringing up support for a new generation of
GPU.

I'm having a hard time understanding your objection to this.
Returning zero's for non-faulting transactions is a *really bad idea*.

BR,
-R


[1] yes, vc4 did without an IOMMU.. but it also had to do some fairly
complex shader analysis in the kernel, and it was restricted to gles2
feature set, so no real flow control, and no memory access beyond
texture fetch.
Will Deacon Nov. 26, 2018, 7:31 p.m. UTC | #5
Hi Rob,

On Tue, Nov 13, 2018 at 08:12:35AM -0500, Rob Clark wrote:
> On Tue, Nov 13, 2018 at 1:32 AM Will Deacon <will.deacon@arm.com> wrote:
> > On Fri, Nov 09, 2018 at 01:01:55PM -0500, Rob Clark wrote:
> > > On Mon, Oct 29, 2018 at 3:09 PM Will Deacon <will.deacon@arm.com> wrote:
> > > > On Thu, Sep 27, 2018 at 06:46:07PM -0400, Rob Clark wrote:
> > > > > We seem to need to set either this or CFCFG (stall), otherwise gpu
> > > > > faults trigger problems with other in-flight transactions from the
> > > > > GPU causing CP errors, etc.
> > > > >
> > > > > In the ARM SMMU spec, the 'Hit under previous context fault' bit is
> > > > > described as:
> > > > >
> > > > >  '0' - Stall or terminate subsequent transactions in the presence
> > > > >        of an outstanding context fault
> > > > >  '1' - Process all subsequent transactions independently of any
> > > > >        outstanding context fault.
> > > > >
> > > > > Since we don't enable CFCFG (stall) the behavior of terminating
> > > > > other transactions makes sense.  And is probably not what we want
> > > > > (and definately not what we want for GPU).
> > > > >
> > > > > Signed-off-by: Rob Clark <robdclark@gmail.com>
> > > > > ---
> > > > > So I hit this issue a long time back on 820 (msm8996) and at the
> > > > > time I solved it with a patch that enabled CFCFG.  And it resurfaced
> > > > > more recently on sdm845.  But at the time CFCFG was rejected, iirc
> > > > > because of concern that it would cause problems on other non-qcom
> > > > > arm smmu implementations.  And I think I forgot to send this version
> > > > > of the solution.
> > > > >
> > > > > If enabling HUPCF is anticipated to cause problems on other ARM
> > > > > SMMU implementations, I think I can come up with a variant of this
> > > > > patch which conditionally enables it for snapdragon.
> > > > >
> > > > > Either way, I'd really like to get some variant of this fix merged
> > > > > (and probably it would be a good idea for stable kernel branches
> > > > > too), since current behaviour with the GPU means faults turn into
> > > > > a fantastic cascade of fail.
> > > >
> > > > Can you describe how this fantastic cascade of fail improves with this
> > > > patch, please? If you're getting context faults then something has already
> > > > gone horribly wrong, so I'm trying to work out how this improves things.
> > > >
> > >
> > > There are plenty of cases where getting iommu faults with a GPU is
> > > "normal", or at least not something the kernel or even GL driver can
> > > control.
> >
> > Such as? All the mainline driver does is print a diagnostic and clear the
> > fault, which doesn't seem generally useful.
> 
> it is useful to debug the fault ;-)
> 
> Although eventually we'll want to be able to do more than that, like
> have the fault trigger bringing in pages of a mmap'd file and that
> sort of thing.

Right, and feels very strange to me if we have this bit set because it
means that your fault is no longer synchronous and therefore diverges
from the fault model that you get from the CPU, where you certainly
wouldn't expect stores appearing in the program after a faulting load to
be visible in memory. However, thinking harder about it, I suppose we're
already in a situation where the translations are handled out of order
in the absence of barriers, so maybe it's not the end of the world.

Could you dump the FSR value that you see in the fault handler, please?
From my reading of the architecture spec, as long as clear all of the
fault bits in the irq handler, then your machine shouldn't die like it
does with HUPCFG=CFCFG=0..

> > > With this patch, you still get the iommu fault, but it doesn't cause
> > > the gpu to crash.  But without it, other memory accesses in flight
> > > while the fault occurs, like the GPU command-processor reading further
> > > ahead in the cmdstream to setup next draw, would return zero's,
> > > causing the GPU to crash or get into a bad state.
> >
> > I get that part, but I don't understand why we're seeing faults in the first
> > place and I worry that this patch is just the tip of the iceberg. It's also
> > not clear that processing subsequent transactions is always the right thing
> > to do in a world where we actually want to report (and handle) synchronous
> > faults from devices.
> 
> Sure, it is a bug.. but it can be an application bug that is not
> something the userspace GL driver or kernel could do anything about.
> We shouldn't let this kill the GPU.  If the application didn't have
> this much control, we wouldn't need an IOMMU in the first place[1].
> With opencl compute, the userspace controlled shader has full blown
> pointers to GPU memory.
> 
> And even in cases where it is a userspace GL driver bug, having some
> robustness to not completely kill the GPU makes debugging much easier.
> Something I do a lot when bringing up support for a new generation of
> GPU.
> 
> I'm having a hard time understanding your objection to this.
> Returning zero's for non-faulting transactions is a *really bad idea*.

Wait -- who said anything about returning zeroes? Where does that behaviour
appear in the architecture?

Will
Jordan Crouse Nov. 26, 2018, 8:38 p.m. UTC | #6
On Mon, Nov 26, 2018 at 07:31:48PM +0000, Will Deacon wrote:
> Hi Rob,
> 
> On Tue, Nov 13, 2018 at 08:12:35AM -0500, Rob Clark wrote:
> > On Tue, Nov 13, 2018 at 1:32 AM Will Deacon <will.deacon@arm.com> wrote:
> > > On Fri, Nov 09, 2018 at 01:01:55PM -0500, Rob Clark wrote:
> > > > On Mon, Oct 29, 2018 at 3:09 PM Will Deacon <will.deacon@arm.com> wrote:
> > > > > On Thu, Sep 27, 2018 at 06:46:07PM -0400, Rob Clark wrote:
> > > > > > We seem to need to set either this or CFCFG (stall), otherwise gpu
> > > > > > faults trigger problems with other in-flight transactions from the
> > > > > > GPU causing CP errors, etc.
> > > > > >
> > > > > > In the ARM SMMU spec, the 'Hit under previous context fault' bit is
> > > > > > described as:
> > > > > >
> > > > > >  '0' - Stall or terminate subsequent transactions in the presence
> > > > > >        of an outstanding context fault
> > > > > >  '1' - Process all subsequent transactions independently of any
> > > > > >        outstanding context fault.
> > > > > >
> > > > > > Since we don't enable CFCFG (stall) the behavior of terminating
> > > > > > other transactions makes sense.  And is probably not what we want
> > > > > > (and definately not what we want for GPU).
> > > > > >
> > > > > > Signed-off-by: Rob Clark <robdclark@gmail.com>
> > > > > > ---
> > > > > > So I hit this issue a long time back on 820 (msm8996) and at the
> > > > > > time I solved it with a patch that enabled CFCFG.  And it resurfaced
> > > > > > more recently on sdm845.  But at the time CFCFG was rejected, iirc
> > > > > > because of concern that it would cause problems on other non-qcom
> > > > > > arm smmu implementations.  And I think I forgot to send this version
> > > > > > of the solution.
> > > > > >
> > > > > > If enabling HUPCF is anticipated to cause problems on other ARM
> > > > > > SMMU implementations, I think I can come up with a variant of this
> > > > > > patch which conditionally enables it for snapdragon.
> > > > > >
> > > > > > Either way, I'd really like to get some variant of this fix merged
> > > > > > (and probably it would be a good idea for stable kernel branches
> > > > > > too), since current behaviour with the GPU means faults turn into
> > > > > > a fantastic cascade of fail.
> > > > >
> > > > > Can you describe how this fantastic cascade of fail improves with this
> > > > > patch, please? If you're getting context faults then something has already
> > > > > gone horribly wrong, so I'm trying to work out how this improves things.
> > > > >
> > > >
> > > > There are plenty of cases where getting iommu faults with a GPU is
> > > > "normal", or at least not something the kernel or even GL driver can
> > > > control.
> > >
> > > Such as? All the mainline driver does is print a diagnostic and clear the
> > > fault, which doesn't seem generally useful.
> > 
> > it is useful to debug the fault ;-)
> > 
> > Although eventually we'll want to be able to do more than that, like
> > have the fault trigger bringing in pages of a mmap'd file and that
> > sort of thing.
> 
> Right, and feels very strange to me if we have this bit set because it
> means that your fault is no longer synchronous and therefore diverges
> from the fault model that you get from the CPU, where you certainly
> wouldn't expect stores appearing in the program after a faulting load to
> be visible in memory. However, thinking harder about it, I suppose we're
> already in a situation where the translations are handled out of order
> in the absence of barriers, so maybe it's not the end of the world.
> 
> Could you dump the FSR value that you see in the fault handler, please?
> From my reading of the architecture spec, as long as clear all of the
> fault bits in the irq handler, then your machine shouldn't die like it
> does with HUPCFG=CFCFG=0..
> 
> > > > With this patch, you still get the iommu fault, but it doesn't cause
> > > > the gpu to crash.  But without it, other memory accesses in flight
> > > > while the fault occurs, like the GPU command-processor reading further
> > > > ahead in the cmdstream to setup next draw, would return zero's,
> > > > causing the GPU to crash or get into a bad state.
> > >
> > > I get that part, but I don't understand why we're seeing faults in the first
> > > place and I worry that this patch is just the tip of the iceberg. It's also
> > > not clear that processing subsequent transactions is always the right thing
> > > to do in a world where we actually want to report (and handle) synchronous
> > > faults from devices.
> > 
> > Sure, it is a bug.. but it can be an application bug that is not
> > something the userspace GL driver or kernel could do anything about.
> > We shouldn't let this kill the GPU.  If the application didn't have
> > this much control, we wouldn't need an IOMMU in the first place[1].
> > With opencl compute, the userspace controlled shader has full blown
> > pointers to GPU memory.
> > 
> > And even in cases where it is a userspace GL driver bug, having some
> > robustness to not completely kill the GPU makes debugging much easier.
> > Something I do a lot when bringing up support for a new generation of
> > GPU.
> > 
> > I'm having a hard time understanding your objection to this.
> > Returning zero's for non-faulting transactions is a *really bad idea*.
> 
> Wait -- who said anything about returning zeroes? Where does that behaviour
> appear in the architecture?

I _think_ it is the bus implementation that returns zero on a terminated
transaction but the effect on the GPU is the same.

Jordan
Rob Clark Nov. 26, 2018, 8:56 p.m. UTC | #7
On Mon, Nov 26, 2018 at 2:31 PM Will Deacon <will.deacon@arm.com> wrote:
>
> Hi Rob,
>
> On Tue, Nov 13, 2018 at 08:12:35AM -0500, Rob Clark wrote:
> > On Tue, Nov 13, 2018 at 1:32 AM Will Deacon <will.deacon@arm.com> wrote:
> > > On Fri, Nov 09, 2018 at 01:01:55PM -0500, Rob Clark wrote:
> > > > On Mon, Oct 29, 2018 at 3:09 PM Will Deacon <will.deacon@arm.com> wrote:
> > > > > On Thu, Sep 27, 2018 at 06:46:07PM -0400, Rob Clark wrote:
> > > > > > We seem to need to set either this or CFCFG (stall), otherwise gpu
> > > > > > faults trigger problems with other in-flight transactions from the
> > > > > > GPU causing CP errors, etc.
> > > > > >
> > > > > > In the ARM SMMU spec, the 'Hit under previous context fault' bit is
> > > > > > described as:
> > > > > >
> > > > > >  '0' - Stall or terminate subsequent transactions in the presence
> > > > > >        of an outstanding context fault
> > > > > >  '1' - Process all subsequent transactions independently of any
> > > > > >        outstanding context fault.
> > > > > >
> > > > > > Since we don't enable CFCFG (stall) the behavior of terminating
> > > > > > other transactions makes sense.  And is probably not what we want
> > > > > > (and definately not what we want for GPU).
> > > > > >
> > > > > > Signed-off-by: Rob Clark <robdclark@gmail.com>
> > > > > > ---
> > > > > > So I hit this issue a long time back on 820 (msm8996) and at the
> > > > > > time I solved it with a patch that enabled CFCFG.  And it resurfaced
> > > > > > more recently on sdm845.  But at the time CFCFG was rejected, iirc
> > > > > > because of concern that it would cause problems on other non-qcom
> > > > > > arm smmu implementations.  And I think I forgot to send this version
> > > > > > of the solution.
> > > > > >
> > > > > > If enabling HUPCF is anticipated to cause problems on other ARM
> > > > > > SMMU implementations, I think I can come up with a variant of this
> > > > > > patch which conditionally enables it for snapdragon.
> > > > > >
> > > > > > Either way, I'd really like to get some variant of this fix merged
> > > > > > (and probably it would be a good idea for stable kernel branches
> > > > > > too), since current behaviour with the GPU means faults turn into
> > > > > > a fantastic cascade of fail.
> > > > >
> > > > > Can you describe how this fantastic cascade of fail improves with this
> > > > > patch, please? If you're getting context faults then something has already
> > > > > gone horribly wrong, so I'm trying to work out how this improves things.
> > > > >
> > > >
> > > > There are plenty of cases where getting iommu faults with a GPU is
> > > > "normal", or at least not something the kernel or even GL driver can
> > > > control.
> > >
> > > Such as? All the mainline driver does is print a diagnostic and clear the
> > > fault, which doesn't seem generally useful.
> >
> > it is useful to debug the fault ;-)
> >
> > Although eventually we'll want to be able to do more than that, like
> > have the fault trigger bringing in pages of a mmap'd file and that
> > sort of thing.
>
> Right, and feels very strange to me if we have this bit set because it
> means that your fault is no longer synchronous and therefore diverges
> from the fault model that you get from the CPU, where you certainly
> wouldn't expect stores appearing in the program after a faulting load to
> be visible in memory. However, thinking harder about it, I suppose we're
> already in a situation where the translations are handled out of order
> in the absence of barriers, so maybe it's not the end of the world.

I guess I wouldn't have expected synchronous without CFCFG=1

> Could you dump the FSR value that you see in the fault handler, please?
> From my reading of the architecture spec, as long as clear all of the
> fault bits in the irq handler, then your machine shouldn't die like it
> does with HUPCFG=CFCFG=0..

I expect it dies before the irq handler returns..

possibly the behavior of terminated translations returning zero's
might be some detail of qcom's implementation (or how the gpu reacts
to terminated memory transactions, etc), rather than something the
spec expects/specifies.

I'll try and get you a dump of FSR in next couple days.. (need to
switch kernels and write up some test code to trigger faults)

BR,
-R

>
> > > > With this patch, you still get the iommu fault, but it doesn't cause
> > > > the gpu to crash.  But without it, other memory accesses in flight
> > > > while the fault occurs, like the GPU command-processor reading further
> > > > ahead in the cmdstream to setup next draw, would return zero's,
> > > > causing the GPU to crash or get into a bad state.
> > >
> > > I get that part, but I don't understand why we're seeing faults in the first
> > > place and I worry that this patch is just the tip of the iceberg. It's also
> > > not clear that processing subsequent transactions is always the right thing
> > > to do in a world where we actually want to report (and handle) synchronous
> > > faults from devices.
> >
> > Sure, it is a bug.. but it can be an application bug that is not
> > something the userspace GL driver or kernel could do anything about.
> > We shouldn't let this kill the GPU.  If the application didn't have
> > this much control, we wouldn't need an IOMMU in the first place[1].
> > With opencl compute, the userspace controlled shader has full blown
> > pointers to GPU memory.
> >
> > And even in cases where it is a userspace GL driver bug, having some
> > robustness to not completely kill the GPU makes debugging much easier.
> > Something I do a lot when bringing up support for a new generation of
> > GPU.
> >
> > I'm having a hard time understanding your objection to this.
> > Returning zero's for non-faulting transactions is a *really bad idea*.
>
> Wait -- who said anything about returning zeroes? Where does that behaviour
> appear in the architecture?
>
> Will
Rob Clark May 24, 2019, 6:38 p.m. UTC | #8
On Mon, Nov 26, 2018 at 11:31 AM Will Deacon <will.deacon@arm.com> wrote:
>
> Hi Rob,
>
> On Tue, Nov 13, 2018 at 08:12:35AM -0500, Rob Clark wrote:
> > On Tue, Nov 13, 2018 at 1:32 AM Will Deacon <will.deacon@arm.com> wrote:
> > > On Fri, Nov 09, 2018 at 01:01:55PM -0500, Rob Clark wrote:
> > > > On Mon, Oct 29, 2018 at 3:09 PM Will Deacon <will.deacon@arm.com> wrote:
> > > > > On Thu, Sep 27, 2018 at 06:46:07PM -0400, Rob Clark wrote:
> > > > > > We seem to need to set either this or CFCFG (stall), otherwise gpu
> > > > > > faults trigger problems with other in-flight transactions from the
> > > > > > GPU causing CP errors, etc.
> > > > > >
> > > > > > In the ARM SMMU spec, the 'Hit under previous context fault' bit is
> > > > > > described as:
> > > > > >
> > > > > >  '0' - Stall or terminate subsequent transactions in the presence
> > > > > >        of an outstanding context fault
> > > > > >  '1' - Process all subsequent transactions independently of any
> > > > > >        outstanding context fault.
> > > > > >
> > > > > > Since we don't enable CFCFG (stall) the behavior of terminating
> > > > > > other transactions makes sense.  And is probably not what we want
> > > > > > (and definately not what we want for GPU).
> > > > > >
> > > > > > Signed-off-by: Rob Clark <robdclark@gmail.com>
> > > > > > ---
> > > > > > So I hit this issue a long time back on 820 (msm8996) and at the
> > > > > > time I solved it with a patch that enabled CFCFG.  And it resurfaced
> > > > > > more recently on sdm845.  But at the time CFCFG was rejected, iirc
> > > > > > because of concern that it would cause problems on other non-qcom
> > > > > > arm smmu implementations.  And I think I forgot to send this version
> > > > > > of the solution.
> > > > > >
> > > > > > If enabling HUPCF is anticipated to cause problems on other ARM
> > > > > > SMMU implementations, I think I can come up with a variant of this
> > > > > > patch which conditionally enables it for snapdragon.
> > > > > >
> > > > > > Either way, I'd really like to get some variant of this fix merged
> > > > > > (and probably it would be a good idea for stable kernel branches
> > > > > > too), since current behaviour with the GPU means faults turn into
> > > > > > a fantastic cascade of fail.
> > > > >
> > > > > Can you describe how this fantastic cascade of fail improves with this
> > > > > patch, please? If you're getting context faults then something has already
> > > > > gone horribly wrong, so I'm trying to work out how this improves things.
> > > > >
> > > >
> > > > There are plenty of cases where getting iommu faults with a GPU is
> > > > "normal", or at least not something the kernel or even GL driver can
> > > > control.
> > >
> > > Such as? All the mainline driver does is print a diagnostic and clear the
> > > fault, which doesn't seem generally useful.
> >
> > it is useful to debug the fault ;-)
> >
> > Although eventually we'll want to be able to do more than that, like
> > have the fault trigger bringing in pages of a mmap'd file and that
> > sort of thing.
>
> Right, and feels very strange to me if we have this bit set because it
> means that your fault is no longer synchronous and therefore diverges
> from the fault model that you get from the CPU, where you certainly
> wouldn't expect stores appearing in the program after a faulting load to
> be visible in memory. However, thinking harder about it, I suppose we're
> already in a situation where the translations are handled out of order
> in the absence of barriers, so maybe it's not the end of the world.
>
> Could you dump the FSR value that you see in the fault handler, please?
> From my reading of the architecture spec, as long as clear all of the
> fault bits in the irq handler, then your machine shouldn't die like it
> does with HUPCFG=CFCFG=0..


Getting back to this after realizing I lost SCTLR.HUPCF patch that I
was carrying locally when rebasing.  Here is an example dump (w/ FSR)
of what happens:



[May24 14:33] arm-smmu 5040000.iommu: Unhandled context fault:
fsr=0x402, iova=0x7ffe35c0, fsynr=0x1, cb=1
[  +0.000034] adreno 5000000.gpu: CP illegal instruction error
[  +0.000036] adreno 5000000.gpu: CP illegal instruction error
[  +0.000017] adreno 5000000.gpu: CP illegal instruction error
[  +0.000017] adreno 5000000.gpu: CP illegal instruction error
[  +0.000015] adreno 5000000.gpu: CP illegal instruction error
[  +0.000016] adreno 5000000.gpu: CP illegal instruction error
[  +0.000076] adreno 5000000.gpu: CP illegal instruction error
[  +0.000015] adreno 5000000.gpu: CP illegal instruction error
[  +0.000016] adreno 5000000.gpu: CP illegal instruction error
[  +0.000015] adreno 5000000.gpu: CP illegal instruction error
[  +0.047100] adreno 5000000.gpu: [drm:a6xx_irq] *ERROR* gpu fault
ring 0 fence 8 status 00800005 rb 0047/0047 ib1 0000000001CC7000/0000
ib2 0000000001CC5000/0000
[  +0.000106] msm ae00000.mdss: [drm:recover_worker] *ERROR* A630:
hangcheck recover!
[  +0.000380] msm ae00000.mdss: [drm:recover_worker] *ERROR* A630:
offending task: d:flush_queue0 (./deqp-gles31 --deqp-visibility=hidden
--deqp-caselist-file=regressions
--deqp-log-filename=results/FD630/regressions.qpa
--deqp-surface-type=fbo --deqp-surface-width=256
--deqp-surface-height=256)
[  +0.501297] adreno 5000000.gpu: [drm:a6xx_irq] *ERROR* gpu fault
ring 0 fence 8 status 00800005 rb 0047/0047 ib1 0000000 ib2
0000/001CC500000
[  +0.051307] revision: 630 (6.3.0.2)
[  +0.000006] rb 0: fence:    6/8
[  +0.000003] rptr:     24
[  +0.000002] rb wptr:  71


The 'CP illegal instruction error's and following gpu fault are the
result of the faulting read from the GPU causing other non-faulting
reads to return zero.


Note that enabling STALL (CFCFG) also avoids this problem.  I suppose
maybe setting CFCFG for implementation==QCOM_SMMUV2 would be a better
idea?

BR,
-R

> > > > With this patch, you still get the iommu fault, but it doesn't cause
> > > > the gpu to crash.  But without it, other memory accesses in flight
> > > > while the fault occurs, like the GPU command-processor reading further
> > > > ahead in the cmdstream to setup next draw, would return zero's,
> > > > causing the GPU to crash or get into a bad state.
> > >
> > > I get that part, but I don't understand why we're seeing faults in the first
> > > place and I worry that this patch is just the tip of the iceberg. It's also
> > > not clear that processing subsequent transactions is always the right thing
> > > to do in a world where we actually want to report (and handle) synchronous
> > > faults from devices.
> >
> > Sure, it is a bug.. but it can be an application bug that is not
> > something the userspace GL driver or kernel could do anything about.
> > We shouldn't let this kill the GPU.  If the application didn't have
> > this much control, we wouldn't need an IOMMU in the first place[1].
> > With opencl compute, the userspace controlled shader has full blown
> > pointers to GPU memory.
> >
> > And even in cases where it is a userspace GL driver bug, having some
> > robustness to not completely kill the GPU makes debugging much easier.
> > Something I do a lot when bringing up support for a new generation of
> > GPU.
> >
> > I'm having a hard time understanding your objection to this.
> > Returning zero's for non-faulting transactions is a *really bad idea*.
>
> Wait -- who said anything about returning zeroes? Where does that behaviour
> appear in the architecture?
>
> Will
diff mbox series

Patch

diff --git a/drivers/iommu/arm-smmu-regs.h b/drivers/iommu/arm-smmu-regs.h
index a1226e4ab5f8..2291925eb800 100644
--- a/drivers/iommu/arm-smmu-regs.h
+++ b/drivers/iommu/arm-smmu-regs.h
@@ -178,6 +178,7 @@  enum arm_smmu_s2cr_privcfg {
 #define ARM_SMMU_CB_ATSR		0x8f0
 
 #define SCTLR_S1_ASIDPNE		(1 << 12)
+#define SCTLR_HUPCF			(1 << 8)
 #define SCTLR_CFCFG			(1 << 7)
 #define SCTLR_CFIE			(1 << 6)
 #define SCTLR_CFRE			(1 << 5)
diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index f7a96bcf94a6..47ffc9aade72 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -713,9 +713,9 @@  static void arm_smmu_write_context_bank(struct arm_smmu_device *smmu, int idx)
 	reg = SCTLR_CFIE | SCTLR_CFRE | SCTLR_AFE | SCTLR_TRE | SCTLR_M;
 	if (stage1)
 		reg |= SCTLR_S1_ASIDPNE;
+	reg |= SCTLR_HUPCF;
 	if (IS_ENABLED(CONFIG_CPU_BIG_ENDIAN))
 		reg |= SCTLR_E;
-
 	writel_relaxed(reg, cb_base + ARM_SMMU_CB_SCTLR);
 }