[RFC,0/5] Discussion around eviction improvements

Message ID	20240508180946.96863-1-tursulin@igalia.com (mailing list archive)
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: Tvrtko Ursulin <tursulin@igalia.com> To: amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org Cc: kernel-dev@igalia.com, Tvrtko Ursulin <tvrtko.ursulin@igalia.com>, =?utf-8?q?Christian_K=C3=B6nig?= <christian.koenig@amd.com>, Friedrich Vock <friedrich.vock@gmx.de> Subject: [RFC 0/5] Discussion around eviction improvements Date: Wed, 8 May 2024 19:09:40 +0100 Message-ID: <20240508180946.96863-1-tursulin@igalia.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: list Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	Discussion around eviction improvements \| expand [RFC,0/5] Discussion around eviction improvements [RFC,1/5] drm/amdgpu: Fix migration rate limiting accounting [RFC,2/5] drm/amdgpu: Actually respect buffer migration budget [RFC,3/5] drm/ttm: Add preferred placement flag [RFC,4/5] drm/amdgpu: Use preferred placement for VRAM+GTT [RFC,5/5] drm/amdgpu: Re-validate evicted buffers

Tvrtko Ursulin May 8, 2024, 6:09 p.m. UTC

From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

Last few days I was looking at the situation with VRAM over subscription, what
happens versus what perhaps should happen. Browsing through the driver and
running some simple experiments.

I ended up with this patch series which, as a disclaimer, may be completely
wrong but as I found some suspicious things, to me at least, I thought it was a
good point to stop and request some comments.

To perhaps summarise what are the main issues I think I found:

 * Migration rate limiting does not bother knowing if actual migration happened
   and so can over-account and unfairly penalise.

 * Migration rate limiting does not even work, at least not for the common case
   where userspace configures VRAM+GTT. It thinks it can stop migration attempts
   by playing with bo->allowed_domains vs bo->preferred domains but, both from
   the code, and from empirical experiments, I see that not working at all. Both
   masks are identical so fiddling with them achieves nothing.

 * Idea of the fallback placement only works when VRAM has free space. As soon
   as it does not, ttm_resource_compatible is happy to leave the buffers in the
   secondary placement forever.

 * Driver thinks it will be re-validating evicted buffers on the next submission
   but it does not for the very common case of VRAM+GTT because it only checks
   if current placement is *none* of the preferred placements.

All those problems are addressed in individual patches.

End result of this series appears to be driver which will try harder to move
buffers back into VRAM, but will be (more) correctly throttled in doing so by
the existing rate limiting logic.

I have run a quick benchmark of Cyberpunk 2077 and cannot say that I saw a
change but that could be a good thing too. At least I did not break anything,
perhaps.. On one occassion I did see the rate limiting logic get confused while
for a period of few minutes it went to a mode where it was constantly giving a
high migration budget. But that recovered itself when I switched clients and did
not come back so I don't know. If there is something wrong there I don't think
it would be caused by any patches in this series.

Series is probably rough but should be good enough for dicsussion. I am curious
to hear if I identified at least something correctly as a real problem.

It would also be good to hear what are the suggested games to check and see
whether there is any improvement.

Cc: Christian König <christian.koenig@amd.com>
Cc: Friedrich Vock <friedrich.vock@gmx.de>

Tvrtko Ursulin (5):
  drm/amdgpu: Fix migration rate limiting accounting
  drm/amdgpu: Actually respect buffer migration budget
  drm/ttm: Add preferred placement flag
  drm/amdgpu: Use preferred placement for VRAM+GTT
  drm/amdgpu: Re-validate evicted buffers

 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c     | 38 +++++++++++++++++-----
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c |  8 +++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c     | 21 ++++++++++--
 drivers/gpu/drm/ttm/ttm_resource.c         | 13 +++++---
 include/drm/ttm/ttm_placement.h            |  3 ++
 5 files changed, 65 insertions(+), 18 deletions(-)

Tvrtko Ursulin May 9, 2024, 12:40 p.m. UTC | #1

On 08/05/2024 19:09, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> 
> Last few days I was looking at the situation with VRAM over subscription, what
> happens versus what perhaps should happen. Browsing through the driver and
> running some simple experiments.
> 
> I ended up with this patch series which, as a disclaimer, may be completely
> wrong but as I found some suspicious things, to me at least, I thought it was a
> good point to stop and request some comments.
> 
> To perhaps summarise what are the main issues I think I found:
> 
>   * Migration rate limiting does not bother knowing if actual migration happened
>     and so can over-account and unfairly penalise.
> 
>   * Migration rate limiting does not even work, at least not for the common case
>     where userspace configures VRAM+GTT. It thinks it can stop migration attempts
>     by playing with bo->allowed_domains vs bo->preferred domains but, both from
>     the code, and from empirical experiments, I see that not working at all. Both
>     masks are identical so fiddling with them achieves nothing.
> 
>   * Idea of the fallback placement only works when VRAM has free space. As soon
>     as it does not, ttm_resource_compatible is happy to leave the buffers in the
>     secondary placement forever.
> 
>   * Driver thinks it will be re-validating evicted buffers on the next submission
>     but it does not for the very common case of VRAM+GTT because it only checks
>     if current placement is *none* of the preferred placements.
> 
> All those problems are addressed in individual patches.
> 
> End result of this series appears to be driver which will try harder to move
> buffers back into VRAM, but will be (more) correctly throttled in doing so by
> the existing rate limiting logic.
> 
> I have run a quick benchmark of Cyberpunk 2077 and cannot say that I saw a
> change but that could be a good thing too. At least I did not break anything,
> perhaps.. On one occassion I did see the rate limiting logic get confused while
> for a period of few minutes it went to a mode where it was constantly giving a
> high migration budget. But that recovered itself when I switched clients and did
> not come back so I don't know. If there is something wrong there I don't think
> it would be caused by any patches in this series.

Since yesterday I also briefly tested with Far Cry New Dawn. One run 
each so possibly doesn't mean anything apart that there isn't a 
regression aka migration throttling is keeping things at bay even with 
increased requests to migrate things back to VRAM:
			
		     before		 after
min/avg/max fps	    36/44/54		37/45/55

Cyberpunk 2077 from yesterday was similarly close:

		26.96/29.59/30.40	29.70/30.00/30.32

I guess the real story is proper DGPU where misplaced buffers have a 
real cost.

Regards,

Tvrtko

> Series is probably rough but should be good enough for dicsussion. I am curious
> to hear if I identified at least something correctly as a real problem.
> 
> It would also be good to hear what are the suggested games to check and see
> whether there is any improvement.
> 
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Friedrich Vock <friedrich.vock@gmx.de>
> 
> Tvrtko Ursulin (5):
>    drm/amdgpu: Fix migration rate limiting accounting
>    drm/amdgpu: Actually respect buffer migration budget
>    drm/ttm: Add preferred placement flag
>    drm/amdgpu: Use preferred placement for VRAM+GTT
>    drm/amdgpu: Re-validate evicted buffers
> 
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c     | 38 +++++++++++++++++-----
>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c |  8 +++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c     | 21 ++++++++++--
>   drivers/gpu/drm/ttm/ttm_resource.c         | 13 +++++---
>   include/drm/ttm/ttm_placement.h            |  3 ++
>   5 files changed, 65 insertions(+), 18 deletions(-)
>

Christian König May 13, 2024, 6:50 a.m. UTC | #2

Just FYI, I've been on sick leave for a while and now trying to catch up.

It will probably be at least week until I can look into this again.

Sorry,
Christian.

Am 08.05.24 um 20:09 schrieb Tvrtko Ursulin:
> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>
> Last few days I was looking at the situation with VRAM over subscription, what
> happens versus what perhaps should happen. Browsing through the driver and
> running some simple experiments.
>
> I ended up with this patch series which, as a disclaimer, may be completely
> wrong but as I found some suspicious things, to me at least, I thought it was a
> good point to stop and request some comments.
>
> To perhaps summarise what are the main issues I think I found:
>
>   * Migration rate limiting does not bother knowing if actual migration happened
>     and so can over-account and unfairly penalise.
>
>   * Migration rate limiting does not even work, at least not for the common case
>     where userspace configures VRAM+GTT. It thinks it can stop migration attempts
>     by playing with bo->allowed_domains vs bo->preferred domains but, both from
>     the code, and from empirical experiments, I see that not working at all. Both
>     masks are identical so fiddling with them achieves nothing.
>
>   * Idea of the fallback placement only works when VRAM has free space. As soon
>     as it does not, ttm_resource_compatible is happy to leave the buffers in the
>     secondary placement forever.
>
>   * Driver thinks it will be re-validating evicted buffers on the next submission
>     but it does not for the very common case of VRAM+GTT because it only checks
>     if current placement is *none* of the preferred placements.
>
> All those problems are addressed in individual patches.
>
> End result of this series appears to be driver which will try harder to move
> buffers back into VRAM, but will be (more) correctly throttled in doing so by
> the existing rate limiting logic.
>
> I have run a quick benchmark of Cyberpunk 2077 and cannot say that I saw a
> change but that could be a good thing too. At least I did not break anything,
> perhaps.. On one occassion I did see the rate limiting logic get confused while
> for a period of few minutes it went to a mode where it was constantly giving a
> high migration budget. But that recovered itself when I switched clients and did
> not come back so I don't know. If there is something wrong there I don't think
> it would be caused by any patches in this series.
>
> Series is probably rough but should be good enough for dicsussion. I am curious
> to hear if I identified at least something correctly as a real problem.
>
> It would also be good to hear what are the suggested games to check and see
> whether there is any improvement.
>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Friedrich Vock <friedrich.vock@gmx.de>
>
> Tvrtko Ursulin (5):
>    drm/amdgpu: Fix migration rate limiting accounting
>    drm/amdgpu: Actually respect buffer migration budget
>    drm/ttm: Add preferred placement flag
>    drm/amdgpu: Use preferred placement for VRAM+GTT
>    drm/amdgpu: Re-validate evicted buffers
>
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c     | 38 +++++++++++++++++-----
>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c |  8 +++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c     | 21 ++++++++++--
>   drivers/gpu/drm/ttm/ttm_resource.c         | 13 +++++---
>   include/drm/ttm/ttm_placement.h            |  3 ++
>   5 files changed, 65 insertions(+), 18 deletions(-)
>

Tvrtko Ursulin May 13, 2024, 1:49 p.m. UTC | #3

On 09/05/2024 13:40, Tvrtko Ursulin wrote:
> 
> On 08/05/2024 19:09, Tvrtko Ursulin wrote:
>> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>
>> Last few days I was looking at the situation with VRAM over 
>> subscription, what
>> happens versus what perhaps should happen. Browsing through the driver 
>> and
>> running some simple experiments.
>>
>> I ended up with this patch series which, as a disclaimer, may be 
>> completely
>> wrong but as I found some suspicious things, to me at least, I thought 
>> it was a
>> good point to stop and request some comments.
>>
>> To perhaps summarise what are the main issues I think I found:
>>
>>   * Migration rate limiting does not bother knowing if actual 
>> migration happened
>>     and so can over-account and unfairly penalise.
>>
>>   * Migration rate limiting does not even work, at least not for the 
>> common case
>>     where userspace configures VRAM+GTT. It thinks it can stop 
>> migration attempts
>>     by playing with bo->allowed_domains vs bo->preferred domains but, 
>> both from
>>     the code, and from empirical experiments, I see that not working 
>> at all. Both
>>     masks are identical so fiddling with them achieves nothing.
>>
>>   * Idea of the fallback placement only works when VRAM has free 
>> space. As soon
>>     as it does not, ttm_resource_compatible is happy to leave the 
>> buffers in the
>>     secondary placement forever.
>>
>>   * Driver thinks it will be re-validating evicted buffers on the next 
>> submission
>>     but it does not for the very common case of VRAM+GTT because it 
>> only checks
>>     if current placement is *none* of the preferred placements.
>>
>> All those problems are addressed in individual patches.
>>
>> End result of this series appears to be driver which will try harder 
>> to move
>> buffers back into VRAM, but will be (more) correctly throttled in 
>> doing so by
>> the existing rate limiting logic.
>>
>> I have run a quick benchmark of Cyberpunk 2077 and cannot say that I 
>> saw a
>> change but that could be a good thing too. At least I did not break 
>> anything,
>> perhaps.. On one occassion I did see the rate limiting logic get 
>> confused while
>> for a period of few minutes it went to a mode where it was constantly 
>> giving a
>> high migration budget. But that recovered itself when I switched 
>> clients and did
>> not come back so I don't know. If there is something wrong there I 
>> don't think
>> it would be caused by any patches in this series.
> 
> Since yesterday I also briefly tested with Far Cry New Dawn. One run 
> each so possibly doesn't mean anything apart that there isn't a 
> regression aka migration throttling is keeping things at bay even with 
> increased requests to migrate things back to VRAM:
> 
>               before         after
> min/avg/max fps        36/44/54        37/45/55
> 
> Cyberpunk 2077 from yesterday was similarly close:
> 
>          26.96/29.59/30.40    29.70/30.00/30.32
> 
> I guess the real story is proper DGPU where misplaced buffers have a 
> real cost.

I found one game which regresses spectacularly badly with this series - 
Assasin's Creed Valhalla. The built-in benchmark at least. The game 
appears to have a working set much larger than the other games I tested, 
around 5GiB total during the benchmark. And for some reason migration 
throttling totally fails to put it in check. I will be investigating 
this shortly.

Regards,

Tvrtko

>> Series is probably rough but should be good enough for dicsussion. I 
>> am curious
>> to hear if I identified at least something correctly as a real problem.
>>
>> It would also be good to hear what are the suggested games to check 
>> and see
>> whether there is any improvement.
>>
>> Cc: Christian König <christian.koenig@amd.com>
>> Cc: Friedrich Vock <friedrich.vock@gmx.de>
>>
>> Tvrtko Ursulin (5):
>>    drm/amdgpu: Fix migration rate limiting accounting
>>    drm/amdgpu: Actually respect buffer migration budget
>>    drm/ttm: Add preferred placement flag
>>    drm/amdgpu: Use preferred placement for VRAM+GTT
>>    drm/amdgpu: Re-validate evicted buffers
>>
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c     | 38 +++++++++++++++++-----
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c |  8 +++--
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c     | 21 ++++++++++--
>>   drivers/gpu/drm/ttm/ttm_resource.c         | 13 +++++---
>>   include/drm/ttm/ttm_placement.h            |  3 ++
>>   5 files changed, 65 insertions(+), 18 deletions(-)
>>

Tvrtko Ursulin May 14, 2024, 3:14 p.m. UTC | #4

On 13/05/2024 14:49, Tvrtko Ursulin wrote:
> 
> On 09/05/2024 13:40, Tvrtko Ursulin wrote:
>>
>> On 08/05/2024 19:09, Tvrtko Ursulin wrote:
>>> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>>
>>> Last few days I was looking at the situation with VRAM over 
>>> subscription, what
>>> happens versus what perhaps should happen. Browsing through the 
>>> driver and
>>> running some simple experiments.
>>>
>>> I ended up with this patch series which, as a disclaimer, may be 
>>> completely
>>> wrong but as I found some suspicious things, to me at least, I 
>>> thought it was a
>>> good point to stop and request some comments.
>>>
>>> To perhaps summarise what are the main issues I think I found:
>>>
>>>   * Migration rate limiting does not bother knowing if actual 
>>> migration happened
>>>     and so can over-account and unfairly penalise.
>>>
>>>   * Migration rate limiting does not even work, at least not for the 
>>> common case
>>>     where userspace configures VRAM+GTT. It thinks it can stop 
>>> migration attempts
>>>     by playing with bo->allowed_domains vs bo->preferred domains but, 
>>> both from
>>>     the code, and from empirical experiments, I see that not working 
>>> at all. Both
>>>     masks are identical so fiddling with them achieves nothing.
>>>
>>>   * Idea of the fallback placement only works when VRAM has free 
>>> space. As soon
>>>     as it does not, ttm_resource_compatible is happy to leave the 
>>> buffers in the
>>>     secondary placement forever.
>>>
>>>   * Driver thinks it will be re-validating evicted buffers on the 
>>> next submission
>>>     but it does not for the very common case of VRAM+GTT because it 
>>> only checks
>>>     if current placement is *none* of the preferred placements.
>>>
>>> All those problems are addressed in individual patches.
>>>
>>> End result of this series appears to be driver which will try harder 
>>> to move
>>> buffers back into VRAM, but will be (more) correctly throttled in 
>>> doing so by
>>> the existing rate limiting logic.
>>>
>>> I have run a quick benchmark of Cyberpunk 2077 and cannot say that I 
>>> saw a
>>> change but that could be a good thing too. At least I did not break 
>>> anything,
>>> perhaps.. On one occassion I did see the rate limiting logic get 
>>> confused while
>>> for a period of few minutes it went to a mode where it was constantly 
>>> giving a
>>> high migration budget. But that recovered itself when I switched 
>>> clients and did
>>> not come back so I don't know. If there is something wrong there I 
>>> don't think
>>> it would be caused by any patches in this series.
>>
>> Since yesterday I also briefly tested with Far Cry New Dawn. One run 
>> each so possibly doesn't mean anything apart that there isn't a 
>> regression aka migration throttling is keeping things at bay even with 
>> increased requests to migrate things back to VRAM:
>>
>>               before         after
>> min/avg/max fps        36/44/54        37/45/55
>>
>> Cyberpunk 2077 from yesterday was similarly close:
>>
>>          26.96/29.59/30.40    29.70/30.00/30.32
>>
>> I guess the real story is proper DGPU where misplaced buffers have a 
>> real cost.
> 
> I found one game which regresses spectacularly badly with this series - 
> Assasin's Creed Valhalla. The built-in benchmark at least. The game 
> appears to have a working set much larger than the other games I tested, 
> around 5GiB total during the benchmark. And for some reason migration 
> throttling totally fails to put it in check. I will be investigating 
> this shortly.

I think that the conclusion is everything I attempted to add relating to 
TTM_PL_PREFERRED does not really work as I initially thought it did. 
Therefore please imagine this series as only containing patches 1, 2 and 5.

(And FWIW it was quite annoying to get to the bottom of since for some 
reason the system exibits some sort of a latching behaviour, where on 
some boots and/or some minutes of runtime things were fine, and then it 
would latch onto a mode where the TTM_PL_PREFERRED induced breakage 
would show. And sometimes this breakage would appear straight away. Odd.)

I still need to test though if the subset of patches manage to achieve 
some positive improvement on their own. It is possible, as patch 5 marks 
more buffers for re-validation so once overcommit subsides they would 
get promoted to preferred placement straight away. And 1&2 are 
notionally fixes for migration throttling so at least in broad sense 
should be still valid as discussion points.

Regards,

Tvrtko

>>> Series is probably rough but should be good enough for dicsussion. I 
>>> am curious
>>> to hear if I identified at least something correctly as a real problem.
>>>
>>> It would also be good to hear what are the suggested games to check 
>>> and see
>>> whether there is any improvement.
>>>
>>> Cc: Christian König <christian.koenig@amd.com>
>>> Cc: Friedrich Vock <friedrich.vock@gmx.de>
>>>
>>> Tvrtko Ursulin (5):
>>>    drm/amdgpu: Fix migration rate limiting accounting
>>>    drm/amdgpu: Actually respect buffer migration budget
>>>    drm/ttm: Add preferred placement flag
>>>    drm/amdgpu: Use preferred placement for VRAM+GTT
>>>    drm/amdgpu: Re-validate evicted buffers
>>>
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c     | 38 +++++++++++++++++-----
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c |  8 +++--
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c     | 21 ++++++++++--
>>>   drivers/gpu/drm/ttm/ttm_resource.c         | 13 +++++---
>>>   include/drm/ttm/ttm_placement.h            |  3 ++
>>>   5 files changed, 65 insertions(+), 18 deletions(-)
>>>

Christian König May 14, 2024, 3:47 p.m. UTC | #5

Am 14.05.24 um 17:14 schrieb Tvrtko Ursulin:
>
> On 13/05/2024 14:49, Tvrtko Ursulin wrote:
>>
>> On 09/05/2024 13:40, Tvrtko Ursulin wrote:
>>>
>>> On 08/05/2024 19:09, Tvrtko Ursulin wrote:
>>>> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>>>
>>>> Last few days I was looking at the situation with VRAM over 
>>>> subscription, what
>>>> happens versus what perhaps should happen. Browsing through the 
>>>> driver and
>>>> running some simple experiments.
>>>>
>>>> I ended up with this patch series which, as a disclaimer, may be 
>>>> completely
>>>> wrong but as I found some suspicious things, to me at least, I 
>>>> thought it was a
>>>> good point to stop and request some comments.
>>>>
>>>> To perhaps summarise what are the main issues I think I found:
>>>>
>>>>   * Migration rate limiting does not bother knowing if actual 
>>>> migration happened
>>>>     and so can over-account and unfairly penalise.
>>>>
>>>>   * Migration rate limiting does not even work, at least not for 
>>>> the common case
>>>>     where userspace configures VRAM+GTT. It thinks it can stop 
>>>> migration attempts
>>>>     by playing with bo->allowed_domains vs bo->preferred domains 
>>>> but, both from
>>>>     the code, and from empirical experiments, I see that not 
>>>> working at all. Both
>>>>     masks are identical so fiddling with them achieves nothing.
>>>>
>>>>   * Idea of the fallback placement only works when VRAM has free 
>>>> space. As soon
>>>>     as it does not, ttm_resource_compatible is happy to leave the 
>>>> buffers in the
>>>>     secondary placement forever.
>>>>
>>>>   * Driver thinks it will be re-validating evicted buffers on the 
>>>> next submission
>>>>     but it does not for the very common case of VRAM+GTT because it 
>>>> only checks
>>>>     if current placement is *none* of the preferred placements.
>>>>
>>>> All those problems are addressed in individual patches.
>>>>
>>>> End result of this series appears to be driver which will try 
>>>> harder to move
>>>> buffers back into VRAM, but will be (more) correctly throttled in 
>>>> doing so by
>>>> the existing rate limiting logic.
>>>>
>>>> I have run a quick benchmark of Cyberpunk 2077 and cannot say that 
>>>> I saw a
>>>> change but that could be a good thing too. At least I did not break 
>>>> anything,
>>>> perhaps.. On one occassion I did see the rate limiting logic get 
>>>> confused while
>>>> for a period of few minutes it went to a mode where it was 
>>>> constantly giving a
>>>> high migration budget. But that recovered itself when I switched 
>>>> clients and did
>>>> not come back so I don't know. If there is something wrong there I 
>>>> don't think
>>>> it would be caused by any patches in this series.
>>>
>>> Since yesterday I also briefly tested with Far Cry New Dawn. One run 
>>> each so possibly doesn't mean anything apart that there isn't a 
>>> regression aka migration throttling is keeping things at bay even 
>>> with increased requests to migrate things back to VRAM:
>>>
>>>               before         after
>>> min/avg/max fps        36/44/54        37/45/55
>>>
>>> Cyberpunk 2077 from yesterday was similarly close:
>>>
>>>          26.96/29.59/30.40    29.70/30.00/30.32
>>>
>>> I guess the real story is proper DGPU where misplaced buffers have a 
>>> real cost.
>>
>> I found one game which regresses spectacularly badly with this series 
>> - Assasin's Creed Valhalla. The built-in benchmark at least. The game 
>> appears to have a working set much larger than the other games I 
>> tested, around 5GiB total during the benchmark. And for some reason 
>> migration throttling totally fails to put it in check. I will be 
>> investigating this shortly.
>
> I think that the conclusion is everything I attempted to add relating 
> to TTM_PL_PREFERRED does not really work as I initially thought it 
> did. Therefore please imagine this series as only containing patches 
> 1, 2 and 5.

Noted (and I had just started to wrap my head around that idea).

>
> (And FWIW it was quite annoying to get to the bottom of since for some 
> reason the system exibits some sort of a latching behaviour, where on 
> some boots and/or some minutes of runtime things were fine, and then 
> it would latch onto a mode where the TTM_PL_PREFERRED induced breakage 
> would show. And sometimes this breakage would appear straight away. Odd.)

Welcome to my world. You improve one use case and four other get a 
penalty. Even when you know the code and potential use cases inside out 
it's really hard to predict how some applications and the core memory 
management behave sometimes.

>
> I still need to test though if the subset of patches manage to achieve 
> some positive improvement on their own. It is possible, as patch 5 
> marks more buffers for re-validation so once overcommit subsides they 
> would get promoted to preferred placement straight away. And 1&2 are 
> notionally fixes for migration throttling so at least in broad sense 
> should be still valid as discussion points.

Yeah, especially 5 kind of makes sense but could potentially lead to 
higher overhead. Need to see how we can better handle that.

Regards,
Christian.

>
> Regards,
>
> Tvrtko
>
>>>> Series is probably rough but should be good enough for dicsussion. 
>>>> I am curious
>>>> to hear if I identified at least something correctly as a real 
>>>> problem.
>>>>
>>>> It would also be good to hear what are the suggested games to check 
>>>> and see
>>>> whether there is any improvement.
>>>>
>>>> Cc: Christian König <christian.koenig@amd.com>
>>>> Cc: Friedrich Vock <friedrich.vock@gmx.de>
>>>>
>>>> Tvrtko Ursulin (5):
>>>>    drm/amdgpu: Fix migration rate limiting accounting
>>>>    drm/amdgpu: Actually respect buffer migration budget
>>>>    drm/ttm: Add preferred placement flag
>>>>    drm/amdgpu: Use preferred placement for VRAM+GTT
>>>>    drm/amdgpu: Re-validate evicted buffers
>>>>
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c     | 38 
>>>> +++++++++++++++++-----
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c |  8 +++--
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c     | 21 ++++++++++--
>>>>   drivers/gpu/drm/ttm/ttm_resource.c         | 13 +++++---
>>>>   include/drm/ttm/ttm_placement.h            |  3 ++
>>>>   5 files changed, 65 insertions(+), 18 deletions(-)
>>>>

[RFC,0/5] Discussion around eviction improvements

Message

Comments