Message ID | 20240913195132.8282-1-robdclark@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | drm/msm/a6xx+: Insert a fence wait before SMMU table update | expand |
On 13.09.2024 9:51 PM, Rob Clark wrote: > From: Rob Clark <robdclark@chromium.org> > > The CP_SMMU_TABLE_UPDATE _should_ be waiting for idle, but on some > devices (x1-85, possibly others), it seems to pass that barrier while > there are still things in the event completion FIFO waiting to be > written back to memory. Can we try to force-fault around here on other GPUs and perhaps limit this workaround? Akhil, do we have any insight on this? Konrad
On Tue, Sep 17, 2024 at 03:47:09PM +0200, Konrad Dybcio wrote: > On 13.09.2024 9:51 PM, Rob Clark wrote: > > From: Rob Clark <robdclark@chromium.org> > > > > The CP_SMMU_TABLE_UPDATE _should_ be waiting for idle, but on some > > devices (x1-85, possibly others), it seems to pass that barrier while > > there are still things in the event completion FIFO waiting to be > > written back to memory. > > Can we try to force-fault around here on other GPUs and perhaps > limit this workaround? > > Akhil, do we have any insight on this? Nothing at the moment. I will check this further. -Akhil. > > Konrad
On Tue, Sep 17, 2024 at 6:47 AM Konrad Dybcio <konradybcio@kernel.org> wrote: > > On 13.09.2024 9:51 PM, Rob Clark wrote: > > From: Rob Clark <robdclark@chromium.org> > > > > The CP_SMMU_TABLE_UPDATE _should_ be waiting for idle, but on some > > devices (x1-85, possibly others), it seems to pass that barrier while > > there are still things in the event completion FIFO waiting to be > > written back to memory. > > Can we try to force-fault around here on other GPUs and perhaps > limit this workaround? not sure what you mean by "force-fault"... we could probably limit this to certain GPUs, the only reason I didn't is (a) it should be harmless when it is not needed, and (b) I have no real good way to get an exhaustive list of where it is needed. Maybe/hopefully it is only x1-85, but idk. It does bring up an interesting question about preemption, though BR, -R > Akhil, do we have any insight on this? > > Konrad
On 17.09.2024 5:30 PM, Rob Clark wrote: > On Tue, Sep 17, 2024 at 6:47 AM Konrad Dybcio <konradybcio@kernel.org> wrote: >> >> On 13.09.2024 9:51 PM, Rob Clark wrote: >>> From: Rob Clark <robdclark@chromium.org> >>> >>> The CP_SMMU_TABLE_UPDATE _should_ be waiting for idle, but on some >>> devices (x1-85, possibly others), it seems to pass that barrier while >>> there are still things in the event completion FIFO waiting to be >>> written back to memory. >> >> Can we try to force-fault around here on other GPUs and perhaps >> limit this workaround? > > not sure what you mean by "force-fault"... I suppose 'reproduce' is what I meant > we could probably limit > this to certain GPUs, the only reason I didn't is (a) it should be > harmless when it is not needed, Do we have any realistic perf hits here? > and (b) I have no real good way to get > an exhaustive list of where it is needed. Maybe/hopefully it is only > x1-85, but idk. > > It does bring up an interesting question about preemption, though Yeah.. Do we know what windows does here? Konrad
On Tue, Sep 17, 2024 at 4:37 PM Konrad Dybcio <konradybcio@kernel.org> wrote: > > On 17.09.2024 5:30 PM, Rob Clark wrote: > > On Tue, Sep 17, 2024 at 6:47 AM Konrad Dybcio <konradybcio@kernel.org> wrote: > >> > >> On 13.09.2024 9:51 PM, Rob Clark wrote: > >>> From: Rob Clark <robdclark@chromium.org> > >>> > >>> The CP_SMMU_TABLE_UPDATE _should_ be waiting for idle, but on some > >>> devices (x1-85, possibly others), it seems to pass that barrier while > >>> there are still things in the event completion FIFO waiting to be > >>> written back to memory. > >> > >> Can we try to force-fault around here on other GPUs and perhaps > >> limit this workaround? > > > > not sure what you mean by "force-fault"... > > I suppose 'reproduce' is what I meant I haven't _noticed_ it yet.. if you want to try on devices you have, glmark2 seems to be good at reproducing.. I think the reason is combo of high fps (on x1-85 most scenes are north of 8k fps) so you get a lot of context switches btwn compositor and glmark2. Most scenes are just a clear plus single draw, and I guess the compositor is just doing a single draw/blit. A6xx can be two draws/blits deep in it's pipeline, a7xx can be four, which maybe exacerbates this. > > we could probably limit > > this to certain GPUs, the only reason I didn't is (a) it should be > > harmless when it is not needed, > > Do we have any realistic perf hits here? I don't think so, we can't switch ttbr0 while the gpu is still busy so what the sqe does for CP_SMMU_TABLE_UPDATE _should_ be equivalent. Maybe it amounts to some extra CP cycles and memory read, but I think that should be negligible given that the expensive thing is that we are stalling the gpu until it is idle. > > and (b) I have no real good way to get > > an exhaustive list of where it is needed. Maybe/hopefully it is only > > x1-85, but idk. > > > > It does bring up an interesting question about preemption, though > > Yeah.. The KMD does setup an xAMBLE to clear the perfcntrs on context switch. We could maybe piggy back on that, but I guess we'd have to patch in the fence value to wait for? > Do we know what windows does here? not sure, maybe akhil has some way to check. Whether a similar scenario comes up with windows probably depends on how the winsys works. If it dropped frames when rendering >vblank rate, you'd get fewer context switches. BR, -R > Konrad
On Fri, Sep 13, 2024 at 8:51 PM Rob Clark <robdclark@gmail.com> wrote: > > From: Rob Clark <robdclark@chromium.org> > > The CP_SMMU_TABLE_UPDATE _should_ be waiting for idle, but on some > devices (x1-85, possibly others), it seems to pass that barrier while > there are still things in the event completion FIFO waiting to be > written back to memory. > > Work around that by adding a fence wait before context switch. The > CP_EVENT_WRITE that writes the fence is the last write from a submit, > so seeing this value hit memory is a reliable indication that it is > safe to proceed with the context switch. > > Closes: https://gitlab.freedesktop.org/drm/msm/-/issues/63 > Signed-off-by: Rob Clark <robdclark@chromium.org> > --- > drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 14 +++++++++++--- > 1 file changed, 11 insertions(+), 3 deletions(-) > > diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c > index bcaec86ac67a..ba5b35502e6d 100644 > --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c > +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c > @@ -101,9 +101,10 @@ static void get_stats_counter(struct msm_ringbuffer *ring, u32 counter, > } > > static void a6xx_set_pagetable(struct a6xx_gpu *a6xx_gpu, > - struct msm_ringbuffer *ring, struct msm_file_private *ctx) > + struct msm_ringbuffer *ring, struct msm_gem_submit *submit) > { > bool sysprof = refcount_read(&a6xx_gpu->base.base.sysprof_active) > 1; > + struct msm_file_private *ctx = submit->queue->ctx; > struct adreno_gpu *adreno_gpu = &a6xx_gpu->base; > phys_addr_t ttbr; > u32 asid; > @@ -115,6 +116,13 @@ static void a6xx_set_pagetable(struct a6xx_gpu *a6xx_gpu, > if (msm_iommu_pagetable_params(ctx->aspace->mmu, &ttbr, &asid)) > return; > > + /* Wait for previous submit to complete before continuing: */ > + OUT_PKT7(ring, CP_WAIT_TIMESTAMP, 4); CP_WAIT_TIMESTAMP doesn't exist on a6xx, so this won't work there. I don't know if the bug exists on a6xx, but I'd be inclined to say it has always existed and we just never hit it because it requires some very specific timing conditions. We can make it work on a6xx by using CP_WAIT_REG_MEM and waiting for it to equal the exact value. Connor > + OUT_RING(ring, 0); > + OUT_RING(ring, lower_32_bits(rbmemptr(ring, fence))); > + OUT_RING(ring, upper_32_bits(rbmemptr(ring, fence))); > + OUT_RING(ring, submit->seqno - 1); > + > if (!sysprof) { > if (!adreno_is_a7xx(adreno_gpu)) { > /* Turn off protected mode to write to special registers */ > @@ -193,7 +201,7 @@ static void a6xx_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit) > struct msm_ringbuffer *ring = submit->ring; > unsigned int i, ibs = 0; > > - a6xx_set_pagetable(a6xx_gpu, ring, submit->queue->ctx); > + a6xx_set_pagetable(a6xx_gpu, ring, submit); > > get_stats_counter(ring, REG_A6XX_RBBM_PERFCTR_CP(0), > rbmemptr_stats(ring, index, cpcycles_start)); > @@ -283,7 +291,7 @@ static void a7xx_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit) > OUT_PKT7(ring, CP_THREAD_CONTROL, 1); > OUT_RING(ring, CP_THREAD_CONTROL_0_SYNC_THREADS | CP_SET_THREAD_BR); > > - a6xx_set_pagetable(a6xx_gpu, ring, submit->queue->ctx); > + a6xx_set_pagetable(a6xx_gpu, ring, submit); > > get_stats_counter(ring, REG_A7XX_RBBM_PERFCTR_CP(0), > rbmemptr_stats(ring, index, cpcycles_start)); > -- > 2.46.0 >
On Wed, Sep 18, 2024 at 9:51 AM Connor Abbott <cwabbott0@gmail.com> wrote: > > On Fri, Sep 13, 2024 at 8:51 PM Rob Clark <robdclark@gmail.com> wrote: > > > > From: Rob Clark <robdclark@chromium.org> > > > > The CP_SMMU_TABLE_UPDATE _should_ be waiting for idle, but on some > > devices (x1-85, possibly others), it seems to pass that barrier while > > there are still things in the event completion FIFO waiting to be > > written back to memory. > > > > Work around that by adding a fence wait before context switch. The > > CP_EVENT_WRITE that writes the fence is the last write from a submit, > > so seeing this value hit memory is a reliable indication that it is > > safe to proceed with the context switch. > > > > Closes: https://gitlab.freedesktop.org/drm/msm/-/issues/63 > > Signed-off-by: Rob Clark <robdclark@chromium.org> > > --- > > drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 14 +++++++++++--- > > 1 file changed, 11 insertions(+), 3 deletions(-) > > > > diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c > > index bcaec86ac67a..ba5b35502e6d 100644 > > --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c > > +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c > > @@ -101,9 +101,10 @@ static void get_stats_counter(struct msm_ringbuffer *ring, u32 counter, > > } > > > > static void a6xx_set_pagetable(struct a6xx_gpu *a6xx_gpu, > > - struct msm_ringbuffer *ring, struct msm_file_private *ctx) > > + struct msm_ringbuffer *ring, struct msm_gem_submit *submit) > > { > > bool sysprof = refcount_read(&a6xx_gpu->base.base.sysprof_active) > 1; > > + struct msm_file_private *ctx = submit->queue->ctx; > > struct adreno_gpu *adreno_gpu = &a6xx_gpu->base; > > phys_addr_t ttbr; > > u32 asid; > > @@ -115,6 +116,13 @@ static void a6xx_set_pagetable(struct a6xx_gpu *a6xx_gpu, > > if (msm_iommu_pagetable_params(ctx->aspace->mmu, &ttbr, &asid)) > > return; > > > > + /* Wait for previous submit to complete before continuing: */ > > + OUT_PKT7(ring, CP_WAIT_TIMESTAMP, 4); > > CP_WAIT_TIMESTAMP doesn't exist on a6xx, so this won't work there. I > don't know if the bug exists on a6xx, but I'd be inclined to say it > has always existed and we just never hit it because it requires some > very specific timing conditions. We can make it work on a6xx by using > CP_WAIT_REG_MEM and waiting for it to equal the exact value. I've been unable to reproduce this on a690, despite running at a similar fps (so similar rate of CP_SMMU_TABLE_UPDATEs). I guess I can't rule out that this is just _harder_ to hit on a6xx due to the shallower pipeline. It would be nice to get some data points on other a7xx, but I only have the one. I did attempt to come up with an igt stand-alone reproducer by just ping-ponging between contexts (with fences to force context switches) with 1000's of CP_EVENT_WRITE's, to no avail. I guess I'd need to actually setup a blit or draw to make the event-write asynchronous, but that would be a lot harder to do in igt. I guess for now I'll re-work this patch to only do the workaround on a7xx. And wire up the gallium preemption support so we can confirm whether this is also an issue for preemption. BR, -R > Connor > > > + OUT_RING(ring, 0); > > + OUT_RING(ring, lower_32_bits(rbmemptr(ring, fence))); > > + OUT_RING(ring, upper_32_bits(rbmemptr(ring, fence))); > > + OUT_RING(ring, submit->seqno - 1); > > + > > if (!sysprof) { > > if (!adreno_is_a7xx(adreno_gpu)) { > > /* Turn off protected mode to write to special registers */ > > @@ -193,7 +201,7 @@ static void a6xx_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit) > > struct msm_ringbuffer *ring = submit->ring; > > unsigned int i, ibs = 0; > > > > - a6xx_set_pagetable(a6xx_gpu, ring, submit->queue->ctx); > > + a6xx_set_pagetable(a6xx_gpu, ring, submit); > > > > get_stats_counter(ring, REG_A6XX_RBBM_PERFCTR_CP(0), > > rbmemptr_stats(ring, index, cpcycles_start)); > > @@ -283,7 +291,7 @@ static void a7xx_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit) > > OUT_PKT7(ring, CP_THREAD_CONTROL, 1); > > OUT_RING(ring, CP_THREAD_CONTROL_0_SYNC_THREADS | CP_SET_THREAD_BR); > > > > - a6xx_set_pagetable(a6xx_gpu, ring, submit->queue->ctx); > > + a6xx_set_pagetable(a6xx_gpu, ring, submit); > > > > get_stats_counter(ring, REG_A7XX_RBBM_PERFCTR_CP(0), > > rbmemptr_stats(ring, index, cpcycles_start)); > > -- > > 2.46.0 > >
On Fri, Sep 13, 2024 at 12:51:31PM -0700, Rob Clark wrote: > From: Rob Clark <robdclark@chromium.org> > > The CP_SMMU_TABLE_UPDATE _should_ be waiting for idle, but on some > devices (x1-85, possibly others), it seems to pass that barrier while > there are still things in the event completion FIFO waiting to be > written back to memory. > > Work around that by adding a fence wait before context switch. The > CP_EVENT_WRITE that writes the fence is the last write from a submit, > so seeing this value hit memory is a reliable indication that it is > safe to proceed with the context switch. > > Closes: https://gitlab.freedesktop.org/drm/msm/-/issues/63 > Signed-off-by: Rob Clark <robdclark@chromium.org> Reviewed-by: Akhil P Oommen <quic_akhilpo@quicinc.com> -Akhil > --- > drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 14 +++++++++++--- > 1 file changed, 11 insertions(+), 3 deletions(-) > > diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c > index bcaec86ac67a..ba5b35502e6d 100644 > --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c > +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c > @@ -101,9 +101,10 @@ static void get_stats_counter(struct msm_ringbuffer *ring, u32 counter, > } > > static void a6xx_set_pagetable(struct a6xx_gpu *a6xx_gpu, > - struct msm_ringbuffer *ring, struct msm_file_private *ctx) > + struct msm_ringbuffer *ring, struct msm_gem_submit *submit) > { > bool sysprof = refcount_read(&a6xx_gpu->base.base.sysprof_active) > 1; > + struct msm_file_private *ctx = submit->queue->ctx; > struct adreno_gpu *adreno_gpu = &a6xx_gpu->base; > phys_addr_t ttbr; > u32 asid; > @@ -115,6 +116,13 @@ static void a6xx_set_pagetable(struct a6xx_gpu *a6xx_gpu, > if (msm_iommu_pagetable_params(ctx->aspace->mmu, &ttbr, &asid)) > return; > > + /* Wait for previous submit to complete before continuing: */ > + OUT_PKT7(ring, CP_WAIT_TIMESTAMP, 4); > + OUT_RING(ring, 0); > + OUT_RING(ring, lower_32_bits(rbmemptr(ring, fence))); > + OUT_RING(ring, upper_32_bits(rbmemptr(ring, fence))); > + OUT_RING(ring, submit->seqno - 1); > + > if (!sysprof) { > if (!adreno_is_a7xx(adreno_gpu)) { > /* Turn off protected mode to write to special registers */ > @@ -193,7 +201,7 @@ static void a6xx_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit) > struct msm_ringbuffer *ring = submit->ring; > unsigned int i, ibs = 0; > > - a6xx_set_pagetable(a6xx_gpu, ring, submit->queue->ctx); > + a6xx_set_pagetable(a6xx_gpu, ring, submit); > > get_stats_counter(ring, REG_A6XX_RBBM_PERFCTR_CP(0), > rbmemptr_stats(ring, index, cpcycles_start)); > @@ -283,7 +291,7 @@ static void a7xx_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit) > OUT_PKT7(ring, CP_THREAD_CONTROL, 1); > OUT_RING(ring, CP_THREAD_CONTROL_0_SYNC_THREADS | CP_SET_THREAD_BR); > > - a6xx_set_pagetable(a6xx_gpu, ring, submit->queue->ctx); > + a6xx_set_pagetable(a6xx_gpu, ring, submit); > > get_stats_counter(ring, REG_A7XX_RBBM_PERFCTR_CP(0), > rbmemptr_stats(ring, index, cpcycles_start)); > -- > 2.46.0 >
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c index bcaec86ac67a..ba5b35502e6d 100644 --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c @@ -101,9 +101,10 @@ static void get_stats_counter(struct msm_ringbuffer *ring, u32 counter, } static void a6xx_set_pagetable(struct a6xx_gpu *a6xx_gpu, - struct msm_ringbuffer *ring, struct msm_file_private *ctx) + struct msm_ringbuffer *ring, struct msm_gem_submit *submit) { bool sysprof = refcount_read(&a6xx_gpu->base.base.sysprof_active) > 1; + struct msm_file_private *ctx = submit->queue->ctx; struct adreno_gpu *adreno_gpu = &a6xx_gpu->base; phys_addr_t ttbr; u32 asid; @@ -115,6 +116,13 @@ static void a6xx_set_pagetable(struct a6xx_gpu *a6xx_gpu, if (msm_iommu_pagetable_params(ctx->aspace->mmu, &ttbr, &asid)) return; + /* Wait for previous submit to complete before continuing: */ + OUT_PKT7(ring, CP_WAIT_TIMESTAMP, 4); + OUT_RING(ring, 0); + OUT_RING(ring, lower_32_bits(rbmemptr(ring, fence))); + OUT_RING(ring, upper_32_bits(rbmemptr(ring, fence))); + OUT_RING(ring, submit->seqno - 1); + if (!sysprof) { if (!adreno_is_a7xx(adreno_gpu)) { /* Turn off protected mode to write to special registers */ @@ -193,7 +201,7 @@ static void a6xx_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit) struct msm_ringbuffer *ring = submit->ring; unsigned int i, ibs = 0; - a6xx_set_pagetable(a6xx_gpu, ring, submit->queue->ctx); + a6xx_set_pagetable(a6xx_gpu, ring, submit); get_stats_counter(ring, REG_A6XX_RBBM_PERFCTR_CP(0), rbmemptr_stats(ring, index, cpcycles_start)); @@ -283,7 +291,7 @@ static void a7xx_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit) OUT_PKT7(ring, CP_THREAD_CONTROL, 1); OUT_RING(ring, CP_THREAD_CONTROL_0_SYNC_THREADS | CP_SET_THREAD_BR); - a6xx_set_pagetable(a6xx_gpu, ring, submit->queue->ctx); + a6xx_set_pagetable(a6xx_gpu, ring, submit); get_stats_counter(ring, REG_A7XX_RBBM_PERFCTR_CP(0), rbmemptr_stats(ring, index, cpcycles_start));