diff mbox series

[RFC,3/4] drm/ttm: Handle -EAGAIN in ttm_resource_alloc as -ENOSPC.

Message ID 20230503083500.645848-4-maarten.lankhorst@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series Add support for DRM cgroup memory accounting. | expand

Commit Message

Maarten Lankhorst May 3, 2023, 8:34 a.m. UTC
This allows the drm cgroup controller to return no space is available..

XXX: This is a hopeless simplification that changes behavior, and
returns -ENOSPC even if we could evict ourselves from the current
cgroup.

Ideally, the eviction code becomes cgroup aware, and will force eviction
from the current cgroup or its parents.

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
---
 drivers/gpu/drm/ttm/ttm_bo.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Thomas Hellström May 3, 2023, 9:11 a.m. UTC | #1
Hi, Maarten

On 5/3/23 10:34, Maarten Lankhorst wrote:
> This allows the drm cgroup controller to return no space is available..
>
> XXX: This is a hopeless simplification that changes behavior, and
> returns -ENOSPC even if we could evict ourselves from the current
> cgroup.
>
> Ideally, the eviction code becomes cgroup aware, and will force eviction
> from the current cgroup or its parents.
>
> Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

Thinking of the shrinker analogy, do non-cgroup aware shrinkers just 
shrink blindly or do they reject shrinking like this patch when a cgroup 
limit is reached?

/Thomas


> ---
>   drivers/gpu/drm/ttm/ttm_bo.c | 4 +++-
>   1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
> index bd5dae4d1624..e057d5d8f09a 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> @@ -731,6 +731,8 @@ static int ttm_bo_mem_force_space(struct ttm_buffer_object *bo,
>   		ret = ttm_resource_alloc(bo, place, mem);
>   		if (likely(!ret))
>   			break;
> +		if (ret == -EAGAIN)
> +			return -ENOSPC;
>   		if (unlikely(ret != -ENOSPC))
>   			return ret;
>   		ret = ttm_mem_evict_first(bdev, man, place, ctx,
> @@ -783,7 +785,7 @@ int ttm_bo_mem_space(struct ttm_buffer_object *bo,
>   
>   		type_found = true;
>   		ret = ttm_resource_alloc(bo, place, mem);
> -		if (ret == -ENOSPC)
> +		if (ret == -ENOSPC || ret == -EAGAIN)
>   			continue;
>   		if (unlikely(ret))
>   			goto error;
Maarten Lankhorst May 3, 2023, 9:36 a.m. UTC | #2
On 2023-05-03 11:11, Thomas Hellström wrote:
> Hi, Maarten
>
> On 5/3/23 10:34, Maarten Lankhorst wrote:
>> This allows the drm cgroup controller to return no space is available..
>>
>> XXX: This is a hopeless simplification that changes behavior, and
>> returns -ENOSPC even if we could evict ourselves from the current
>> cgroup.
>>
>> Ideally, the eviction code becomes cgroup aware, and will force eviction
>> from the current cgroup or its parents.
>>
>> Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>
> Thinking of the shrinker analogy, do non-cgroup aware shrinkers just 
> shrink blindly or do they reject shrinking like this patch when a 
> cgroup limit is reached?

When I made the cgroup controller return -ENOSPC I just hit an infinite 
loop since it sees enough memory is free and tries to allocate memory 
again. Hence the -EAGAIN handling here. It returns -ENOSPC, without the 
infinite looping.

I think there should be 2 code paths:

- OOM, generic case: Handle like we do now. No need for special cgroup 
handling needed right now.

Might change if we implement cgroup memory semantics. See the memory 
section of Documentation/admin-guide/cgroup-v2.rst

It could be useful regardless.

- OOM, cgroup limit reached: Check for each BO if it's valuable to evict to unblock the relevant limit.


              / cg1.0
root - cg1 --  cg1.1
    \         \ cg1.2
     \  cg2

If we hit the cg limit in cg1.0 for only cg1.0, it doesn't make sense to evict from any other cgroup.
If we hit the limit in cg1.0 for the entirety of cg1, it makes sense to evict from any of the cg1 nodes, but not from cg2.

This should be relatively straightforward to implement. We identify which cgroup hit a limit, and then let the shrinker
run only on that cgroup and its childs.

This could be simplified to the OOM generic case, for root/NULL cg.


~Maarten
diff mbox series

Patch

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index bd5dae4d1624..e057d5d8f09a 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -731,6 +731,8 @@  static int ttm_bo_mem_force_space(struct ttm_buffer_object *bo,
 		ret = ttm_resource_alloc(bo, place, mem);
 		if (likely(!ret))
 			break;
+		if (ret == -EAGAIN)
+			return -ENOSPC;
 		if (unlikely(ret != -ENOSPC))
 			return ret;
 		ret = ttm_mem_evict_first(bdev, man, place, ctx,
@@ -783,7 +785,7 @@  int ttm_bo_mem_space(struct ttm_buffer_object *bo,
 
 		type_found = true;
 		ret = ttm_resource_alloc(bo, place, mem);
-		if (ret == -ENOSPC)
+		if (ret == -ENOSPC || ret == -EAGAIN)
 			continue;
 		if (unlikely(ret))
 			goto error;