diff mbox series

[02/10] drm/etnaviv: mmuv2: don't map zero page

Message ID 20181219144546.28224-3-l.stach@pengutronix.de (mailing list archive)
State New, archived
Headers show
Series per-process address spaces for MMUv2 | expand

Commit Message

Lucas Stach Dec. 19, 2018, 2:45 p.m. UTC
Keep the page at address 0 as faulting to catch any potential state
setup issues early.

Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
---
 drivers/gpu/drm/etnaviv/etnaviv_iommu_v2.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Guido Günther Dec. 30, 2018, 3:49 p.m. UTC | #1
Hi Lucas,
On Wed, Dec 19, 2018 at 03:45:38PM +0100, Lucas Stach wrote:
> Keep the page at address 0 as faulting to catch any potential state
> setup issues early.

This is a nice idea! But applying this and making mesa hit that page
leads to the process hanging in D state over here on GC7000:

# [  242.726192] INFO: task kworker/u8:2:37 blocked for more than 120 seconds.
[  242.733010]       Not tainted 4.18.0-00129-gce2b21074b41 #504
[  242.738795] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  242.746638] kworker/u8:2    D    0    37      2 0x00000028
[  242.752144] Workqueue: events_unbound commit_work
[  242.756860] Call trace:
[  242.759318]  __switch_to+0x94/0xd0
[  242.762741]  __schedule+0x1c0/0x6b8
[  242.766239]  schedule+0x40/0xa8
[  242.769380]  schedule_timeout+0x2f0/0x428
[  242.773410]  dma_fence_default_wait+0x1cc/0x2b8
[  242.777951]  dma_fence_wait_timeout+0x44/0x1b0
[  242.782403]  drm_atomic_helper_wait_for_fences+0x48/0x108
[  242.787819]  commit_tail+0x30/0x80
[  242.791229]  commit_work+0x20/0x30
[  242.794642]  process_one_work+0x1ec/0x458
[  242.798659]  worker_thread+0x48/0x430
[  242.802331]  kthread+0x130/0x138
[  242.805557]  ret_from_fork+0x10/0x1c

This is in dmesg showing that we hit the first page:

    [   65.907388] etnaviv-gpu 38000000.gpu: MMU fault status 0x00000002
    [   65.913497] etnaviv-gpu 38000000.gpu: MMU 0 fault addr 0x00000e40

Without that patch it's sampling random data from that page but does not hang.

Cheers,
 -- Guido

> 
> Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
> ---
>  drivers/gpu/drm/etnaviv/etnaviv_iommu_v2.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_iommu_v2.c b/drivers/gpu/drm/etnaviv/etnaviv_iommu_v2.c
> index f1c88d8ad5ba..f794e04be9e6 100644
> --- a/drivers/gpu/drm/etnaviv/etnaviv_iommu_v2.c
> +++ b/drivers/gpu/drm/etnaviv/etnaviv_iommu_v2.c
> @@ -320,8 +320,8 @@ etnaviv_iommuv2_domain_alloc(struct etnaviv_gpu *gpu)
>  	domain = &etnaviv_domain->base;
>  
>  	domain->dev = gpu->dev;
> -	domain->base = 0;
> -	domain->size = (u64)SZ_1G * 4;
> +	domain->base = SZ_4K;
> +	domain->size = (u64)SZ_1G * 4 - SZ_4K;
>  	domain->ops = &etnaviv_iommuv2_ops;
>  
>  	ret = etnaviv_iommuv2_init(etnaviv_domain);
> -- 
> 2.19.1
> 
> _______________________________________________
> etnaviv mailing list
> etnaviv@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/etnaviv
Lucas Stach Jan. 7, 2019, 8:50 a.m. UTC | #2
Hi Guido,

Am Sonntag, den 30.12.2018, 16:49 +0100 schrieb Guido Günther:
> Hi Lucas,
> On Wed, Dec 19, 2018 at 03:45:38PM +0100, Lucas Stach wrote:
> > Keep the page at address 0 as faulting to catch any potential state
> > setup issues early.
> 
> This is a nice idea! But applying this and making mesa hit that page
> leads to the process hanging in D state over here on GC7000:
> 
> # [  242.726192] INFO: task kworker/u8:2:37 blocked for more than 120 seconds.
> [  242.733010]       Not tainted 4.18.0-00129-gce2b21074b41 #504
> [  242.738795] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  242.746638] kworker/u8:2    D    0    37      2 0x00000028
> [  242.752144] Workqueue: events_unbound commit_work
> [  242.756860] Call trace:
> [  242.759318]  __switch_to+0x94/0xd0
> [  242.762741]  __schedule+0x1c0/0x6b8
> [  242.766239]  schedule+0x40/0xa8
> [  242.769380]  schedule_timeout+0x2f0/0x428
> [  242.773410]  dma_fence_default_wait+0x1cc/0x2b8
> [  242.777951]  dma_fence_wait_timeout+0x44/0x1b0
> [  242.782403]  drm_atomic_helper_wait_for_fences+0x48/0x108
> [  242.787819]  commit_tail+0x30/0x80
> [  242.791229]  commit_work+0x20/0x30
> [  242.794642]  process_one_work+0x1ec/0x458
> [  242.798659]  worker_thread+0x48/0x430
> [  242.802331]  kthread+0x130/0x138
> [  242.805557]  ret_from_fork+0x10/0x1c
> 
> This is in dmesg showing that we hit the first page:
> 
>     [   65.907388] etnaviv-gpu 38000000.gpu: MMU fault status 0x00000002
>     [   65.913497] etnaviv-gpu 38000000.gpu: MMU 0 fault addr 0x00000e40
> 
> Without that patch it's sampling random data from that page but does not hang.

GPU hangs after a MMU fault are expected or more accurately, we
actively request the GPU to stop by setting the exception bit in the
page table.

A hanging GPU should trigger the scheduler timeout handler, which then
makes sure to get the GPU back into a working state. So if things don't
progress after the fault for you either the timeout handler is buggy on
GC7000, or the fence signaling is broken somehow. I'll take a look at
this.

Regards,
Lucas

> Cheers,
>  -- Guido
> 
> > 
> > > > Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
> > ---
> >  drivers/gpu/drm/etnaviv/etnaviv_iommu_v2.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/etnaviv/etnaviv_iommu_v2.c b/drivers/gpu/drm/etnaviv/etnaviv_iommu_v2.c
> > index f1c88d8ad5ba..f794e04be9e6 100644
> > --- a/drivers/gpu/drm/etnaviv/etnaviv_iommu_v2.c
> > +++ b/drivers/gpu/drm/etnaviv/etnaviv_iommu_v2.c
> > @@ -320,8 +320,8 @@ etnaviv_iommuv2_domain_alloc(struct etnaviv_gpu *gpu)
> > > >  	domain = &etnaviv_domain->base;
> >  
> > > >  	domain->dev = gpu->dev;
> > > > -	domain->base = 0;
> > > > -	domain->size = (u64)SZ_1G * 4;
> > > > +	domain->base = SZ_4K;
> > > > +	domain->size = (u64)SZ_1G * 4 - SZ_4K;
> > > >  	domain->ops = &etnaviv_iommuv2_ops;
> >  
> > > >  	ret = etnaviv_iommuv2_init(etnaviv_domain);
> > -- 
> > 2.19.1
> > 
> > _______________________________________________
> > etnaviv mailing list
> > etnaviv@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/etnaviv
Guido Günther Jan. 7, 2019, 9:13 a.m. UTC | #3
Hi,
On Mon, Jan 07, 2019 at 09:50:52AM +0100, Lucas Stach wrote:
> Hi Guido,
> 
> Am Sonntag, den 30.12.2018, 16:49 +0100 schrieb Guido Günther:
> > Hi Lucas,
> > On Wed, Dec 19, 2018 at 03:45:38PM +0100, Lucas Stach wrote:
> > > Keep the page at address 0 as faulting to catch any potential state
> > > setup issues early.
> > 
> > This is a nice idea! But applying this and making mesa hit that page
> > leads to the process hanging in D state over here on GC7000:
> > 
> > # [  242.726192] INFO: task kworker/u8:2:37 blocked for more than 120 seconds.
> > [  242.733010]       Not tainted 4.18.0-00129-gce2b21074b41 #504
> > [  242.738795] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [  242.746638] kworker/u8:2    D    0    37      2 0x00000028
> > [  242.752144] Workqueue: events_unbound commit_work
> > [  242.756860] Call trace:
> > [  242.759318]  __switch_to+0x94/0xd0
> > [  242.762741]  __schedule+0x1c0/0x6b8
> > [  242.766239]  schedule+0x40/0xa8
> > [  242.769380]  schedule_timeout+0x2f0/0x428
> > [  242.773410]  dma_fence_default_wait+0x1cc/0x2b8
> > [  242.777951]  dma_fence_wait_timeout+0x44/0x1b0
> > [  242.782403]  drm_atomic_helper_wait_for_fences+0x48/0x108
> > [  242.787819]  commit_tail+0x30/0x80
> > [  242.791229]  commit_work+0x20/0x30
> > [  242.794642]  process_one_work+0x1ec/0x458
> > [  242.798659]  worker_thread+0x48/0x430
> > [  242.802331]  kthread+0x130/0x138
> > [  242.805557]  ret_from_fork+0x10/0x1c
> > 
> > This is in dmesg showing that we hit the first page:
> > 
> >     [   65.907388] etnaviv-gpu 38000000.gpu: MMU fault status 0x00000002
> >     [   65.913497] etnaviv-gpu 38000000.gpu: MMU 0 fault addr 0x00000e40
> > 
> > Without that patch it's sampling random data from that page but does not hang.
> 
> GPU hangs after a MMU fault are expected or more accurately, we
> actively request the GPU to stop by setting the exception bit in the
> page table.

Yeah. I put that in to show that this the cause for the trouble above.

> 
> A hanging GPU should trigger the scheduler timeout handler, which then
> makes sure to get the GPU back into a working state. So if things don't
> progress after the fault for you either the timeout handler is buggy on
> GC7000, or the fence signaling is broken somehow. I'll take a look at
> this.

This isn't a top notch linux-next based tree yet so if you're not seeing this
let me forward port our stuff to that and report back again.

Cheers,
 -- Guido
Lucas Stach Jan. 7, 2019, 3:02 p.m. UTC | #4
Am Montag, den 07.01.2019, 10:13 +0100 schrieb Guido Günther:
> Hi,
> On Mon, Jan 07, 2019 at 09:50:52AM +0100, Lucas Stach wrote:
> > Hi Guido,
> > 
> > Am Sonntag, den 30.12.2018, 16:49 +0100 schrieb Guido Günther:
> > > Hi Lucas,
> > > On Wed, Dec 19, 2018 at 03:45:38PM +0100, Lucas Stach wrote:
> > > > Keep the page at address 0 as faulting to catch any potential state
> > > > setup issues early.
> > > 
> > > This is a nice idea! But applying this and making mesa hit that page
> > > leads to the process hanging in D state over here on GC7000:
> > > 
> > > # [  242.726192] INFO: task kworker/u8:2:37 blocked for more than 120 seconds.
> > > [  242.733010]       Not tainted 4.18.0-00129-gce2b21074b41 #504
> > > [  242.738795] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > [  242.746638] kworker/u8:2    D    0    37      2 0x00000028
> > > [  242.752144] Workqueue: events_unbound commit_work
> > > [  242.756860] Call trace:
> > > [  242.759318]  __switch_to+0x94/0xd0
> > > [  242.762741]  __schedule+0x1c0/0x6b8
> > > [  242.766239]  schedule+0x40/0xa8
> > > [  242.769380]  schedule_timeout+0x2f0/0x428
> > > [  242.773410]  dma_fence_default_wait+0x1cc/0x2b8
> > > [  242.777951]  dma_fence_wait_timeout+0x44/0x1b0
> > > [  242.782403]  drm_atomic_helper_wait_for_fences+0x48/0x108
> > > [  242.787819]  commit_tail+0x30/0x80
> > > [  242.791229]  commit_work+0x20/0x30
> > > [  242.794642]  process_one_work+0x1ec/0x458
> > > [  242.798659]  worker_thread+0x48/0x430
> > > [  242.802331]  kthread+0x130/0x138
> > > [  242.805557]  ret_from_fork+0x10/0x1c
> > > 
> > > This is in dmesg showing that we hit the first page:
> > > 
> > >     [   65.907388] etnaviv-gpu 38000000.gpu: MMU fault status 0x00000002
> > >     [   65.913497] etnaviv-gpu 38000000.gpu: MMU 0 fault addr 0x00000e40
> > > 
> > > Without that patch it's sampling random data from that page but does not hang.
> > 
> > GPU hangs after a MMU fault are expected or more accurately, we
> > actively request the GPU to stop by setting the exception bit in the
> > page table.
> 
> Yeah. I put that in to show that this the cause for the trouble above.
> 
> > 
> > A hanging GPU should trigger the scheduler timeout handler, which then
> > makes sure to get the GPU back into a working state. So if things don't
> > progress after the fault for you either the timeout handler is buggy on
> > GC7000, or the fence signaling is broken somehow. I'll take a look at
> > this.
> 
> This isn't a top notch linux-next based tree yet so if you're not seeing this
> let me forward port our stuff to that and report back again.

I've certainly seen the timeout handler working on GC7000, but with the
GC7000 support being relatively lightly tested right now, I wouldn't
bet on us handling all corner cases correctly.

If this is an issue on a recent kernel, I would certainly love to learn
what's going wrong.

Regards,
Lucas
Christian Gmeiner Feb. 1, 2019, 7:57 a.m. UTC | #5
Am Mi., 19. Dez. 2018 um 15:45 Uhr schrieb Lucas Stach <l.stach@pengutronix.de>:
>
> Keep the page at address 0 as faulting to catch any potential state
> setup issues early.
>
> Signed-off-by: Lucas Stach <l.stach@pengutronix.de>

I like this idea.. but I am unsure about Guido's GC7000 problem.

Reviewed-by: Christian Gmeiner <christian.gmeiner@gmail.com>

> ---
>  drivers/gpu/drm/etnaviv/etnaviv_iommu_v2.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_iommu_v2.c b/drivers/gpu/drm/etnaviv/etnaviv_iommu_v2.c
> index f1c88d8ad5ba..f794e04be9e6 100644
> --- a/drivers/gpu/drm/etnaviv/etnaviv_iommu_v2.c
> +++ b/drivers/gpu/drm/etnaviv/etnaviv_iommu_v2.c
> @@ -320,8 +320,8 @@ etnaviv_iommuv2_domain_alloc(struct etnaviv_gpu *gpu)
>         domain = &etnaviv_domain->base;
>
>         domain->dev = gpu->dev;
> -       domain->base = 0;
> -       domain->size = (u64)SZ_1G * 4;
> +       domain->base = SZ_4K;
> +       domain->size = (u64)SZ_1G * 4 - SZ_4K;
>         domain->ops = &etnaviv_iommuv2_ops;
>
>         ret = etnaviv_iommuv2_init(etnaviv_domain);
> --
> 2.19.1
>
Guido Günther April 2, 2019, 11:38 a.m. UTC | #6
Hi,
On Mon, Jan 07, 2019 at 04:02:33PM +0100, Lucas Stach wrote:
[..snip..]
> I've certainly seen the timeout handler working on GC7000, but with the
> GC7000 support being relatively lightly tested right now, I wouldn't
> bet on us handling all corner cases correctly.
> 
> If this is an issue on a recent kernel, I would certainly love to learn
> what's going wrong.

I've brought my drm more in line with 5.x and it doesn't seem to hang
anymore.
Cheers,
 -- Guido

> 
> Regards,
> Lucas
> _______________________________________________
> etnaviv mailing list
> etnaviv@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/etnaviv
Guido Günther April 2, 2019, 11:39 a.m. UTC | #7
Hi,
On Wed, Dec 19, 2018 at 03:45:38PM +0100, Lucas Stach wrote:
> Keep the page at address 0 as faulting to catch any potential state
> setup issues early.
> 
> Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
> ---
>  drivers/gpu/drm/etnaviv/etnaviv_iommu_v2.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_iommu_v2.c b/drivers/gpu/drm/etnaviv/etnaviv_iommu_v2.c
> index f1c88d8ad5ba..f794e04be9e6 100644
> --- a/drivers/gpu/drm/etnaviv/etnaviv_iommu_v2.c
> +++ b/drivers/gpu/drm/etnaviv/etnaviv_iommu_v2.c
> @@ -320,8 +320,8 @@ etnaviv_iommuv2_domain_alloc(struct etnaviv_gpu *gpu)
>  	domain = &etnaviv_domain->base;
>  
>  	domain->dev = gpu->dev;
> -	domain->base = 0;
> -	domain->size = (u64)SZ_1G * 4;
> +	domain->base = SZ_4K;
> +	domain->size = (u64)SZ_1G * 4 - SZ_4K;
>  	domain->ops = &etnaviv_iommuv2_ops;
>  
>  	ret = etnaviv_iommuv2_init(etnaviv_domain);
> -- 

Reviewed-By: Guido Günther <agx@sigxcpu.org>

Cheers and sorry for the extreme delay,
 -- Guido
diff mbox series

Patch

diff --git a/drivers/gpu/drm/etnaviv/etnaviv_iommu_v2.c b/drivers/gpu/drm/etnaviv/etnaviv_iommu_v2.c
index f1c88d8ad5ba..f794e04be9e6 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_iommu_v2.c
+++ b/drivers/gpu/drm/etnaviv/etnaviv_iommu_v2.c
@@ -320,8 +320,8 @@  etnaviv_iommuv2_domain_alloc(struct etnaviv_gpu *gpu)
 	domain = &etnaviv_domain->base;
 
 	domain->dev = gpu->dev;
-	domain->base = 0;
-	domain->size = (u64)SZ_1G * 4;
+	domain->base = SZ_4K;
+	domain->size = (u64)SZ_1G * 4 - SZ_4K;
 	domain->ops = &etnaviv_iommuv2_ops;
 
 	ret = etnaviv_iommuv2_init(etnaviv_domain);