diff mbox series

[1/3] drm/panthor: Fix tiler OOM handling to allow incremental rendering

Message ID 20240425071837.529039-2-boris.brezillon@collabora.com (mailing list archive)
State New, archived
Headers show
Series drm/panthor: Collection of tiler heap related fixes | expand

Commit Message

Boris Brezillon April 25, 2024, 7:18 a.m. UTC
From: Antonino Maniscalco <antonino.maniscalco@collabora.com>

If the kernel couldn't allocate memory because we reached the maximum
number of chunks but no render passes are in flight
(panthor_heap_grow() returning -ENOMEM), we should defer the OOM
handling to the FW by returning a NULL chunk. The FW will then call
the tiler OOM exception handler, which is supposed to implement
incremental rendering (execute an intermediate fragment job to flush
the pending primitives, release the tiler memory that was used to
store those primitives, and start over from where it stopped).

Fixes: de8548813824 ("drm/panthor: Add the scheduler logical block")
Signed-off-by: Antonino Maniscalco <antonino.maniscalco@collabora.com>
Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/panthor/panthor_sched.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Steven Price April 25, 2024, 9:28 a.m. UTC | #1
On 25/04/2024 08:18, Boris Brezillon wrote:
> From: Antonino Maniscalco <antonino.maniscalco@collabora.com>
> 
> If the kernel couldn't allocate memory because we reached the maximum
> number of chunks but no render passes are in flight
> (panthor_heap_grow() returning -ENOMEM), we should defer the OOM
> handling to the FW by returning a NULL chunk. The FW will then call
> the tiler OOM exception handler, which is supposed to implement
> incremental rendering (execute an intermediate fragment job to flush
> the pending primitives, release the tiler memory that was used to
> store those primitives, and start over from where it stopped).
> 
> Fixes: de8548813824 ("drm/panthor: Add the scheduler logical block")
> Signed-off-by: Antonino Maniscalco <antonino.maniscalco@collabora.com>
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>

Reviewed-by: Steven Price <steven.price@arm.com>

Although I think the real issue here is that we haven't clearly defined
the return values from panthor_heap_grow - it's a bit weird to have two
different error codes for the same "try again later after incremental
rendering" result. But as a fix this seems most clear.

Steve

> ---
>  drivers/gpu/drm/panthor/panthor_sched.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
> index b3a51a6de523..6de8c0c702cb 100644
> --- a/drivers/gpu/drm/panthor/panthor_sched.c
> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> @@ -1354,7 +1354,13 @@ static int group_process_tiler_oom(struct panthor_group *group, u32 cs_id)
>  					pending_frag_count, &new_chunk_va);
>  	}
>  
> -	if (ret && ret != -EBUSY) {
> +	/* If the kernel couldn't allocate memory because we reached the maximum
> +	 * number of chunks (EBUSY if we have render passes in flight, ENOMEM
> +	 * otherwise), we want to let the FW try to reclaim memory by waiting
> +	 * for fragment jobs to land or by executing the tiler OOM exception
> +	 * handler, which is supposed to implement incremental rendering.
> +	 */
> +	if (ret && ret != -EBUSY && ret != -ENOMEM) {
>  		drm_warn(&ptdev->base, "Failed to extend the tiler heap\n");
>  		group->fatal_queues |= BIT(cs_id);
>  		sched_queue_delayed_work(sched, tick, 0);
Boris Brezillon April 25, 2024, 9:45 a.m. UTC | #2
On Thu, 25 Apr 2024 10:28:49 +0100
Steven Price <steven.price@arm.com> wrote:

> On 25/04/2024 08:18, Boris Brezillon wrote:
> > From: Antonino Maniscalco <antonino.maniscalco@collabora.com>
> > 
> > If the kernel couldn't allocate memory because we reached the maximum
> > number of chunks but no render passes are in flight
> > (panthor_heap_grow() returning -ENOMEM), we should defer the OOM
> > handling to the FW by returning a NULL chunk. The FW will then call
> > the tiler OOM exception handler, which is supposed to implement
> > incremental rendering (execute an intermediate fragment job to flush
> > the pending primitives, release the tiler memory that was used to
> > store those primitives, and start over from where it stopped).
> > 
> > Fixes: de8548813824 ("drm/panthor: Add the scheduler logical block")
> > Signed-off-by: Antonino Maniscalco <antonino.maniscalco@collabora.com>
> > Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>  
> 
> Reviewed-by: Steven Price <steven.price@arm.com>
> 
> Although I think the real issue here is that we haven't clearly defined
> the return values from panthor_heap_grow - it's a bit weird to have two
> different error codes for the same "try again later after incremental
> rendering" result. But as a fix this seems most clear.

Yeah, I actually considered returning -EBUSY for the 'max_chunks
reached' situation, but then realized we would also want to trigger
incremental rendering for actual mem allocation failures (when
chunk_count < max_chunks) once the fail-able/non-blocking allocation
logic is implemented, and for this kind of failure it makes more sense
to return -ENOMEM, even though this implies checking against two values
instead of one.

I guess returning -ENOMEM instead of -EBUSY for the case where we have
render passes in-flight wouldn't be too awkward, as this can be seen as
the kernel refusing to allocate more memory.

> 
> Steve
> 
> > ---
> >  drivers/gpu/drm/panthor/panthor_sched.c | 8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
> > index b3a51a6de523..6de8c0c702cb 100644
> > --- a/drivers/gpu/drm/panthor/panthor_sched.c
> > +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> > @@ -1354,7 +1354,13 @@ static int group_process_tiler_oom(struct panthor_group *group, u32 cs_id)
> >  					pending_frag_count, &new_chunk_va);
> >  	}
> >  
> > -	if (ret && ret != -EBUSY) {
> > +	/* If the kernel couldn't allocate memory because we reached the maximum
> > +	 * number of chunks (EBUSY if we have render passes in flight, ENOMEM
> > +	 * otherwise), we want to let the FW try to reclaim memory by waiting
> > +	 * for fragment jobs to land or by executing the tiler OOM exception
> > +	 * handler, which is supposed to implement incremental rendering.
> > +	 */
> > +	if (ret && ret != -EBUSY && ret != -ENOMEM) {
> >  		drm_warn(&ptdev->base, "Failed to extend the tiler heap\n");
> >  		group->fatal_queues |= BIT(cs_id);
> >  		sched_queue_delayed_work(sched, tick, 0);  
>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
index b3a51a6de523..6de8c0c702cb 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.c
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -1354,7 +1354,13 @@  static int group_process_tiler_oom(struct panthor_group *group, u32 cs_id)
 					pending_frag_count, &new_chunk_va);
 	}
 
-	if (ret && ret != -EBUSY) {
+	/* If the kernel couldn't allocate memory because we reached the maximum
+	 * number of chunks (EBUSY if we have render passes in flight, ENOMEM
+	 * otherwise), we want to let the FW try to reclaim memory by waiting
+	 * for fragment jobs to land or by executing the tiler OOM exception
+	 * handler, which is supposed to implement incremental rendering.
+	 */
+	if (ret && ret != -EBUSY && ret != -ENOMEM) {
 		drm_warn(&ptdev->base, "Failed to extend the tiler heap\n");
 		group->fatal_queues |= BIT(cs_id);
 		sched_queue_delayed_work(sched, tick, 0);