[RFC] drm/ttm: fix scheduling balance

Message ID	20180125095932.30860-1-david1.zhou@amd.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> Received-SPF: None (protection.outlook.com: amd.com does not designate permitted sender hosts) From: Chunming Zhou <david1.zhou@amd.com> To: <amd-gfx@lists.freedesktop.org>, <dri-devel@lists.freedesktop.org> Subject: [PATCH] [RFC]drm/ttm: fix scheduling balance Date: Thu, 25 Jan 2018 17:59:32 +0800 Message-ID: <20180125095932.30860-1-david1.zhou@amd.com> MIME-Version: 1.0 SpamDiagnosticOutput: 1:99 SpamDiagnosticMetadata: NSPM X-Microsoft-Exchange-Diagnostics: 1; CY4PR1201MB0054; 20:57vZhEi2TMk5YHepnE6Wl0SBtQWIjtD5f/p9eqO91R3Fk/YgOM/EnOgHYjBoZNZTGYqv7XxGmIh7HTTdNTw35xoO5IO+o71g18S91fu8sWwMFlk8LUirRo6taHboojVvethnBjtEhCXkPdoXGBXqSuF3qaUQ9PqXO+tmuXGtjDOA3MF/movHmX3pXtg7GjhjVBiqtemXVlxKi/4pAMVMKmNdEFTRYI97qEtg+xvXF2uFUBpsIu5PzWwW5F/AKVbE Precedence: list Cc: christian.koenig@amd.com Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>

Chunming Zhou Jan. 25, 2018, 9:59 a.m. UTC

there is a scheduling balance issue about get node like:
a. process A allocates full memory and use it for submission.
b. process B tries to allocates memory, will wait for process A BO idle in eviction.
c. process A completes the job, process B eviction will put process A BO node,
but in the meantime, process C is comming to allocate BO, whill directly get node successfully, and do submission,
process B will again wait for process C BO idle.
d. repeat the above setps, process B could be delayed much more.

add a mutex to gerantee the allocation sequence for same domain. But there is a possibility that
visible vram could be evicted to invisilbe, the tricky is they are same domain manager, so which needs a special handling.

Change-Id: I260e8eb704f7b4788b071d3f641f21b242912256
Signed-off-by: Chunming Zhou <david1.zhou@amd.com>
---
 drivers/gpu/drm/ttm/ttm_bo.c    | 16 +++++++++++++++-
 include/drm/ttm/ttm_bo_api.h    |  2 ++
 include/drm/ttm/ttm_bo_driver.h |  1 +
 3 files changed, 18 insertions(+), 1 deletion(-)

Christian König Jan. 25, 2018, 10:07 a.m. UTC | #1

Am 25.01.2018 um 10:59 schrieb Chunming Zhou:
> there is a scheduling balance issue about get node like:
> a. process A allocates full memory and use it for submission.
> b. process B tries to allocates memory, will wait for process A BO idle in eviction.
> c. process A completes the job, process B eviction will put process A BO node,
> but in the meantime, process C is comming to allocate BO, whill directly get node successfully, and do submission,
> process B will again wait for process C BO idle.
> d. repeat the above setps, process B could be delayed much more.

NAK, this is intended behavior please don't change it. If you start to 
use a per domain mutex like this no concurrent allocation is possible 
any more.

In other words when process B needs a lot of memory and blocks for that 
memory to be evicted it is intended that process C which doesn't needs 
so much memory can jump in and execute before B.

> add a mutex to gerantee the allocation sequence for same domain. But there is a possibility that
> visible vram could be evicted to invisilbe, the tricky is they are same domain manager, so which needs a special handling.

To avoid lock inversion you would need a global mutex here, otherwise 
you can run into the following issue:

Eviction: VRAM domain mutex locked first then GTT mutex.
Swapping in: GTT mutex locked first, then VRAM mutex.

Regards,
Christian.

>
> Change-Id: I260e8eb704f7b4788b071d3f641f21b242912256
> Signed-off-by: Chunming Zhou <david1.zhou@amd.com>
> ---
>   drivers/gpu/drm/ttm/ttm_bo.c    | 16 +++++++++++++++-
>   include/drm/ttm/ttm_bo_api.h    |  2 ++
>   include/drm/ttm/ttm_bo_driver.h |  1 +
>   3 files changed, 18 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
> index d33a6bb742a1..f465f1d38711 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> @@ -969,7 +969,11 @@ int ttm_bo_mem_space(struct ttm_buffer_object *bo,
>   		if (mem_type == TTM_PL_SYSTEM)
>   			break;
>   
> +		if (ctx->man != man)
> +			mutex_lock(&man->node_mutex);
>   		ret = (*man->func->get_node)(man, bo, place, mem);
> +		if (ctx->man != man)
> +			mutex_unlock(&man->node_mutex);
>   		if (unlikely(ret))
>   			return ret;
>   
> @@ -991,6 +995,7 @@ int ttm_bo_mem_space(struct ttm_buffer_object *bo,
>   
>   	for (i = 0; i < placement->num_busy_placement; ++i) {
>   		const struct ttm_place *place = &placement->busy_placement[i];
> +		struct ttm_operation_ctx busy_ctx;
>   
>   		ret = ttm_mem_type_from_place(place, &mem_type);
>   		if (ret)
> @@ -1018,7 +1023,15 @@ int ttm_bo_mem_space(struct ttm_buffer_object *bo,
>   			return 0;
>   		}
>   
> -		ret = ttm_bo_mem_force_space(bo, mem_type, place, mem, ctx);
> +		memcpy(&busy_ctx, ctx, sizeof(busy_ctx));
> +		if (ctx->man != man) {
> +			mutex_lock(&man->node_mutex);
> +			busy_ctx.man = man;
> +		}
> +		ret = ttm_bo_mem_force_space(bo, mem_type, place, mem,
> +					     &busy_ctx);
> +		if (ctx->man != man)
> +			mutex_unlock(&man->node_mutex);
>   		if (ret == 0 && mem->mm_node) {
>   			mem->placement = cur_flags;
>   			return 0;
> @@ -1449,6 +1462,7 @@ int ttm_bo_init_mm(struct ttm_bo_device *bdev, unsigned type,
>   	man->io_reserve_fastpath = true;
>   	man->use_io_reserve_lru = false;
>   	mutex_init(&man->io_reserve_mutex);
> +	mutex_init(&man->node_mutex);
>   	spin_lock_init(&man->move_lock);
>   	INIT_LIST_HEAD(&man->io_reserve_lru);
>   
> diff --git a/include/drm/ttm/ttm_bo_api.h b/include/drm/ttm/ttm_bo_api.h
> index 2cd025c2abe7..b08629050215 100644
> --- a/include/drm/ttm/ttm_bo_api.h
> +++ b/include/drm/ttm/ttm_bo_api.h
> @@ -265,6 +265,7 @@ struct ttm_bo_kmap_obj {
>    * @no_wait_gpu: Return immediately if the GPU is busy.
>    * @allow_reserved_eviction: Allow eviction of reserved BOs.
>    * @resv: Reservation object to allow reserved evictions with.
> + * @man: record ctx is under man->node_mutex.
>    *
>    * Context for TTM operations like changing buffer placement or general memory
>    * allocation.
> @@ -275,6 +276,7 @@ struct ttm_operation_ctx {
>   	bool allow_reserved_eviction;
>   	struct reservation_object *resv;
>   	uint64_t bytes_moved;
> +	struct ttm_mem_type_manager *man;
>   };
>   
>   /**
> diff --git a/include/drm/ttm/ttm_bo_driver.h b/include/drm/ttm/ttm_bo_driver.h
> index 9b417eb2df20..d85d6ee4f54d 100644
> --- a/include/drm/ttm/ttm_bo_driver.h
> +++ b/include/drm/ttm/ttm_bo_driver.h
> @@ -288,6 +288,7 @@ struct ttm_mem_type_manager {
>   	uint32_t default_caching;
>   	const struct ttm_mem_type_manager_func *func;
>   	void *priv;
> +	struct mutex node_mutex;
>   	struct mutex io_reserve_mutex;
>   	bool use_io_reserve_lru;
>   	bool io_reserve_fastpath;

Thomas Hellström (VMware) Jan. 25, 2018, 2:57 p.m. UTC | #2

On 01/25/2018 10:59 AM, Chunming Zhou wrote:
> there is a scheduling balance issue about get node like:
> a. process A allocates full memory and use it for submission.
> b. process B tries to allocates memory, will wait for process A BO idle in eviction.
> c. process A completes the job, process B eviction will put process A BO node,
> but in the meantime, process C is comming to allocate BO, whill directly get node successfully, and do submission,
> process B will again wait for process C BO idle.
> d. repeat the above setps, process B could be delayed much more.
>
> add a mutex to gerantee the allocation sequence for same domain. But there is a possibility that
> visible vram could be evicted to invisilbe, the tricky is they are same domain manager, so which needs a special handling.
>
> Change-Id: I260e8eb704f7b4788b071d3f641f21b242912256
> Signed-off-by: Chunming Zhou <david1.zhou@amd.com>

I think this is a good approach, however there are two things that IMO 
needs fixing.

1) You must lock mutex interruptible if requested. It's important for 
fast delivery of Xserver mouse signals, and also to avoid unkillable 
processes on GPU hang.
2) How do we resolve contradictary eviction order between buffers? You'd 
probably get lockdep warnings if that should happen. Now I'm not sure 
there are
drivers that do this, and that would probably be a bug if they did. In 
principle you'd have to either establish a manager priority and document 
that drivers have to do that,
or use ww_mutexes.

/Thomas

> ---
>   drivers/gpu/drm/ttm/ttm_bo.c    | 16 +++++++++++++++-
>   include/drm/ttm/ttm_bo_api.h    |  2 ++
>   include/drm/ttm/ttm_bo_driver.h |  1 +
>   3 files changed, 18 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
> index d33a6bb742a1..f465f1d38711 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> @@ -969,7 +969,11 @@ int ttm_bo_mem_space(struct ttm_buffer_object *bo,
>   		if (mem_type == TTM_PL_SYSTEM)
>   			break;
>   
> +		if (ctx->man != man)
> +			mutex_lock(&man->node_mutex);
>   		ret = (*man->func->get_node)(man, bo, place, mem);
> +		if (ctx->man != man)
> +			mutex_unlock(&man->node_mutex);
>   		if (unlikely(ret))
>   			return ret;
>   
> @@ -991,6 +995,7 @@ int ttm_bo_mem_space(struct ttm_buffer_object *bo,
>   
>   	for (i = 0; i < placement->num_busy_placement; ++i) {
>   		const struct ttm_place *place = &placement->busy_placement[i];
> +		struct ttm_operation_ctx busy_ctx;
>   
>   		ret = ttm_mem_type_from_place(place, &mem_type);
>   		if (ret)
> @@ -1018,7 +1023,15 @@ int ttm_bo_mem_space(struct ttm_buffer_object *bo,
>   			return 0;
>   		}
>   
> -		ret = ttm_bo_mem_force_space(bo, mem_type, place, mem, ctx);
> +		memcpy(&busy_ctx, ctx, sizeof(busy_ctx));
> +		if (ctx->man != man) {
> +			mutex_lock(&man->node_mutex);
> +			busy_ctx.man = man;
> +		}
> +		ret = ttm_bo_mem_force_space(bo, mem_type, place, mem,
> +					     &busy_ctx);
> +		if (ctx->man != man)
> +			mutex_unlock(&man->node_mutex);
>   		if (ret == 0 && mem->mm_node) {
>   			mem->placement = cur_flags;
>   			return 0;
> @@ -1449,6 +1462,7 @@ int ttm_bo_init_mm(struct ttm_bo_device *bdev, unsigned type,
>   	man->io_reserve_fastpath = true;
>   	man->use_io_reserve_lru = false;
>   	mutex_init(&man->io_reserve_mutex);
> +	mutex_init(&man->node_mutex);
>   	spin_lock_init(&man->move_lock);
>   	INIT_LIST_HEAD(&man->io_reserve_lru);
>   
> diff --git a/include/drm/ttm/ttm_bo_api.h b/include/drm/ttm/ttm_bo_api.h
> index 2cd025c2abe7..b08629050215 100644
> --- a/include/drm/ttm/ttm_bo_api.h
> +++ b/include/drm/ttm/ttm_bo_api.h
> @@ -265,6 +265,7 @@ struct ttm_bo_kmap_obj {
>    * @no_wait_gpu: Return immediately if the GPU is busy.
>    * @allow_reserved_eviction: Allow eviction of reserved BOs.
>    * @resv: Reservation object to allow reserved evictions with.
> + * @man: record ctx is under man->node_mutex.
>    *
>    * Context for TTM operations like changing buffer placement or general memory
>    * allocation.
> @@ -275,6 +276,7 @@ struct ttm_operation_ctx {
>   	bool allow_reserved_eviction;
>   	struct reservation_object *resv;
>   	uint64_t bytes_moved;
> +	struct ttm_mem_type_manager *man;
>   };
>   
>   /**
> diff --git a/include/drm/ttm/ttm_bo_driver.h b/include/drm/ttm/ttm_bo_driver.h
> index 9b417eb2df20..d85d6ee4f54d 100644
> --- a/include/drm/ttm/ttm_bo_driver.h
> +++ b/include/drm/ttm/ttm_bo_driver.h
> @@ -288,6 +288,7 @@ struct ttm_mem_type_manager {
>   	uint32_t default_caching;
>   	const struct ttm_mem_type_manager_func *func;
>   	void *priv;
> +	struct mutex node_mutex;
>   	struct mutex io_reserve_mutex;
>   	bool use_io_reserve_lru;
>   	bool io_reserve_fastpath;

Thomas Hellström (VMware) Jan. 25, 2018, 4:47 p.m. UTC | #3

On 01/25/2018 03:57 PM, Thomas Hellstrom wrote:
> On 01/25/2018 10:59 AM, Chunming Zhou wrote:
>> there is a scheduling balance issue about get node like:
>> a. process A allocates full memory and use it for submission.
>> b. process B tries to allocates memory, will wait for process A BO 
>> idle in eviction.
>> c. process A completes the job, process B eviction will put process A 
>> BO node,
>> but in the meantime, process C is comming to allocate BO, whill 
>> directly get node successfully, and do submission,
>> process B will again wait for process C BO idle.
>> d. repeat the above setps, process B could be delayed much more.
>>
>> add a mutex to gerantee the allocation sequence for same domain. But 
>> there is a possibility that
>> visible vram could be evicted to invisilbe, the tricky is they are 
>> same domain manager, so which needs a special handling.
>>
>> Change-Id: I260e8eb704f7b4788b071d3f641f21b242912256
>> Signed-off-by: Chunming Zhou <david1.zhou@amd.com>
>
> I think this is a good approach, however there are two things that IMO 
> needs fixing. [...]

Thinking a bit more about this, the end result would be that typical "C" 
processes would get an unfair amount of GPU scheduling.
Isn't it actually a scheduler's task outside of TTM to mitigate this?

Further, TTM has had a design principle of avoiding locks held while 
waiting for GPU, with the exception of buffer object reservations,
I think this would be the first violator, but a fairly harmless one.

I can see the use for it though. It would also allow scanning the LRU 
lists for a suitable set of buffer objects to evict, rather than 
evicting in strict LRU order...

/Thomas

Christian König Jan. 25, 2018, 5:30 p.m. UTC | #4

Am 25.01.2018 um 17:47 schrieb Thomas Hellstrom:
> On 01/25/2018 03:57 PM, Thomas Hellstrom wrote:
>> On 01/25/2018 10:59 AM, Chunming Zhou wrote:
>>> there is a scheduling balance issue about get node like:
>>> a. process A allocates full memory and use it for submission.
>>> b. process B tries to allocates memory, will wait for process A BO 
>>> idle in eviction.
>>> c. process A completes the job, process B eviction will put process 
>>> A BO node,
>>> but in the meantime, process C is comming to allocate BO, whill 
>>> directly get node successfully, and do submission,
>>> process B will again wait for process C BO idle.
>>> d. repeat the above setps, process B could be delayed much more.
>>>
>>> add a mutex to gerantee the allocation sequence for same domain. But 
>>> there is a possibility that
>>> visible vram could be evicted to invisilbe, the tricky is they are 
>>> same domain manager, so which needs a special handling.
>>>
>>> Change-Id: I260e8eb704f7b4788b071d3f641f21b242912256
>>> Signed-off-by: Chunming Zhou <david1.zhou@amd.com>
>>
>> I think this is a good approach, however there are two things that 
>> IMO needs fixing. [...]
>
> Thinking a bit more about this, the end result would be that typical 
> "C" processes would get an unfair amount of GPU scheduling.
> Isn't it actually a scheduler's task outside of TTM to mitigate this?

Yes, exactly the reason why I rejected this. I actually considered 
moving the whole evicting to a background workitem.

> Further, TTM has had a design principle of avoiding locks held while 
> waiting for GPU, with the exception of buffer object reservations,
> I think this would be the first violator, but a fairly harmless one.

At least amdgpu normally doesn't block for any GPU operation to finish 
(with a few exception), but yes I see the problem as well.

>
> I can see the use for it though. It would also allow scanning the LRU 
> lists for a suitable set of buffer objects to evict, rather than 
> evicting in strict LRU order...

At least for amdgpu that won't be possible even then, cause we don't 
tell TTM everything about buffer placement. E.g. BOs are not necessary 
composed from contiguous allocations.

Christian.

>
> /Thomas
>

Thomas Hellström (VMware) Jan. 25, 2018, 5:39 p.m. UTC | #5

Hi,

On 01/25/2018 06:30 PM, Christian König wrote:
> Am 25.01.2018 um 17:47 schrieb Thomas Hellstrom:
>> On 01/25/2018 03:57 PM, Thomas Hellstrom wrote:
>>> On 01/25/2018 10:59 AM, Chunming Zhou wrote:
>>>> there is a scheduling balance issue about get node like:
>>>> a. process A allocates full memory and use it for submission.
>>>> b. process B tries to allocates memory, will wait for process A BO 
>>>> idle in eviction.
>>>> c. process A completes the job, process B eviction will put process 
>>>> A BO node,
>>>> but in the meantime, process C is comming to allocate BO, whill 
>>>> directly get node successfully, and do submission,
>>>> process B will again wait for process C BO idle.
>>>> d. repeat the above setps, process B could be delayed much more.
>>>>
>>>> add a mutex to gerantee the allocation sequence for same domain. 
>>>> But there is a possibility that
>>>> visible vram could be evicted to invisilbe, the tricky is they are 
>>>> same domain manager, so which needs a special handling.
>>>>
>>>> Change-Id: I260e8eb704f7b4788b071d3f641f21b242912256
>>>> Signed-off-by: Chunming Zhou <david1.zhou@amd.com>
>>>
>>> I think this is a good approach, however there are two things that 
>>> IMO needs fixing. [...]
>>
>> Thinking a bit more about this, the end result would be that typical 
>> "C" processes would get an unfair amount of GPU scheduling.
>> Isn't it actually a scheduler's task outside of TTM to mitigate this?
>
> Yes, exactly the reason why I rejected this. I actually considered 
> moving the whole evicting to a background workitem.

Ah,I should've read up on the following emails, Sorry about that.

Thanks,
Thomas

[RFC] drm/ttm: fix scheduling balance

Commit Message

Comments

Patch