mbox series

[RFC,00/18] TTM interface for managing VRAM oversubscription

Message ID 20240424165937.54759-1-friedrich.vock@gmx.de (mailing list archive)
Headers show
Series TTM interface for managing VRAM oversubscription | expand

Message

Friedrich Vock April 24, 2024, 4:56 p.m. UTC
Hi everyone,

recently I've been looking into remedies for apps (in particular, newer
games) that experience significant performance loss when they start to
hit VRAM limits, especially on older or lower-end cards that struggle
to fit both desktop apps and all the game data into VRAM at once.

The root of the problem lies in the fact that from userspace's POV,
buffer eviction is very opaque: Userspace applications/drivers cannot
tell how oversubscribed VRAM is, nor do they have fine-grained control
over which buffers get evicted.  At the same time, with GPU APIs becoming
increasingly lower-level and GPU-driven, only the application itself
can know which buffers are used within a particular submission, and
how important each buffer is. For this, GPU APIs include interfaces
to query oversubscription and specify memory priorities: In Vulkan,
oversubscription can be queried through the VK_EXT_memory_budget
extension. Different buffers can also be assigned priorities via the
VK_EXT_pageable_device_local_memory extension. Modern games, especially
D3D12 games via vkd3d-proton, rely on oversubscription being reported and
priorities being respected in order to perform their memory management.

However, relaying this information to the kernel via the current KMD uAPIs
is not possible. On AMDGPU for example, all work submissions include a
"bo list" that contains any buffer object that is accessed during the
course of the submission. If VRAM is oversubscribed and a buffer in the
list was evicted to system memory, that buffer is moved back to VRAM
(potentially evicting other unused buffers).

Since the usermode driver doesn't know what buffers are used by the
application, its only choice is to submit a bo list that contains every
buffer the application has allocated. In case of VRAM oversubscription,
it is highly likely that some of the application's buffers were evicted,
which almost guarantees that some buffers will get moved around. Since
the bo list is only known at submit time, this also means the buffers
will get moved right before submitting application work, which is the
worst possible time to move buffers from a latency perspective. Another
consequence of the large bo list is that nearly all memory from other
applications will be evicted, too. When different applications (e.g. game
and compositor) submit work one after the other, this causes a ping-pong
effect where each app's submission evicts the other app's memory,
resulting in a large amount of unnecessary moves.

This overly aggressive eviction behavior led to RADV adopting a change
that effectively allows all VRAM applications to reside in system memory
[1].  This worked around the ping-ponging/excessive buffer moving problem,
but also meant that any memory evicted to system memory would forever
stay there, regardless of how VRAM is used.

My proposal aims at providing a middle ground between these extremes.
The goals I want to meet are:
- Userspace is accurately informed about VRAM oversubscription/how much
  VRAM has been evicted
- Buffer eviction respects priorities set by userspace - Wasteful
  ping-ponging is avoided to the extent possible

I have been testing out some prototypes, and came up with this rough
sketch of an API:

- For each ttm_resource_manager, the amount of evicted memory is tracked
  (similarly to how "usage" tracks the memory usage). When memory is
  evicted via ttm_bo_evict, the size of the evicted memory is added, when
  memory is un-evicted (see below), its size is subtracted. The amount of
  evicted memory for e.g. VRAM can be queried by userspace via an ioctl.

- Each ttm_resource_manager maintains a list of evicted buffer objects.

- ttm_mem_unevict walks the list of evicted bos for a given
  ttm_resource_manager and tries moving evicted resources back. When a
  buffer is freed, this function is called to immediately restore some
  evicted memory.

- Each ttm_buffer_object independently tracks the mem_type it wants
  to reside in.

- ttm_bo_try_unevict is added as a helper function which attempts to
  move the buffer to its preferred mem_type. If no space is available
  there, it fails with -ENOSPC/-ENOMEM.

- Similar to how ttm_bo_evict works, each driver can implement
  uneviction_valuable/unevict_flags callbacks to control buffer
  un-eviction.

This is what patches 1-10 accomplish (together with an amdgpu
implementation utilizing the new API).

Userspace priorities could then be implemented as follows:

- TTM already manages priorities for each buffer object. These priorities
  can be updated by userspace via a GEM_OP ioctl to inform the kernel
  which buffers should be evicted before others. If an ioctl increases
  the priority of a buffer, ttm_bo_try_unevict is called on that buffer to
  try and move it back (potentially evicting buffers with a lower
  priority)

- Buffers should never be evicted by other buffers with equal/lower
  priority, but if there is a buffer with lower priority occupying VRAM,
  it should be evicted in favor of the higher-priority one. This prevents
  ping-ponging between buffers that try evicting each other and is
  trivially implementable with an early-exit in ttm_mem_evict_first.

This is covered in patches 11-15, with the new features exposed to
userspace in patches 16-18.

I also have a RADV branch utilizing this API at [2], which I use for
testing.

This implementation is stil very much WIP, although the D3D12 games I
tested already seemed to benefit from it. Nevertheless, are still quite
a few TODOs and unresolved questions/problems.

Some kernel drivers (e.g i915) already use TTM priorities for
kernel-internal purposes. Of course, some of the highest priorities
should stay reserved for these purposes (with userspace being able to
use the lower priorities).

Another problem with priorities is the possibility of apps starving other
apps by occupying all of VRAM with high-priority allocations. A possible
solution could be include restricting the highest priority/priorities
to important apps like compositors.

Tying into this problem, only apps that are actively cooperating
to reduce memory pressure can benefit from the current memory priority
implementation. Eventually the priority system could also be utilized
to benefit all applications, for example with the desktop environment
boosting the priority of the currently-focused app/its cgroup (to
provide the best QoS to the apps the user is actively using). A full
implementation of this is probably out-of-scope for this initial proposal,
but it's probably a good idea to consider this as a possible future use
of the priority API.

I'm primarily looking to integrate this into amdgpu to solve the
issues I've seen there, but I'm also interested in feedback from
other drivers. Is this something you'd be interested in? Do you
have any objections/comments/questions about my proposed design?

Thanks,
Friedrich

[1] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6833
[2] https://gitlab.freedesktop.org/pixelcluster/mesa/-/tree/spilling

Friedrich Vock (18):
  drm/ttm: Add tracking for evicted memory
  drm/ttm: Add per-BO eviction tracking
  drm/ttm: Implement BO eviction tracking
  drm/ttm: Add driver funcs for uneviction control
  drm/ttm: Add option to evict no BOs in operation
  drm/ttm: Add public buffer eviction/uneviction functions
  drm/amdgpu: Add TTM uneviction control functions
  drm/amdgpu: Don't try moving BOs to preferred domain before submit
  drm/amdgpu: Don't mark VRAM as a busy placement for VRAM|GTT resources
  drm/amdgpu: Don't add GTT to initial domains after failing to allocate
    VRAM
  drm/ttm: Bump BO priority count
  drm/ttm: Do not evict BOs with higher priority
  drm/ttm: Implement ttm_bo_update_priority
  drm/ttm: Consider BOs placed in non-favorite locations evicted
  drm/amdgpu: Set a default priority for user/kernel BOs
  drm/amdgpu: Implement SET_PRIORITY GEM op
  drm/amdgpu: Implement EVICTED_VRAM query
  drm/amdgpu: Bump minor version

 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   2 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c     | 191 +---------------
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h     |   4 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |   3 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c    |  25 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c    |   3 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c |  26 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |   4 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c    |  50 ++++
 drivers/gpu/drm/ttm/ttm_bo.c               | 253 ++++++++++++++++++++-
 drivers/gpu/drm/ttm/ttm_bo_util.c          |   3 +
 drivers/gpu/drm/ttm/ttm_device.c           |   1 +
 drivers/gpu/drm/ttm/ttm_resource.c         |  19 +-
 include/drm/ttm/ttm_bo.h                   |  22 ++
 include/drm/ttm/ttm_device.h               |  28 +++
 include/drm/ttm/ttm_resource.h             |  11 +-
 include/uapi/drm/amdgpu_drm.h              |   3 +
 17 files changed, 430 insertions(+), 218 deletions(-)

--
2.44.0

Comments

Christian König April 25, 2024, 6:54 a.m. UTC | #1
In general: Yes please :)

But are exercising a lot of ideas we have already thrown over board over 
the years.

The general idea Marek and I have been working on for a while now is 
rather to make TTM aware of userspace "clients".

In other words we should start with having a TTM structure in the fpriv 
of the drivers and then track there how much VRAM was evicted for each 
client.

This should then be balanced so that each client gets it's equal share 
of VRAM and we pretty much end up with a static situation which only 
changes when applications become inactive/active (based on their GPU 
activity).

I will mail you some of the stuff we already came up with later on.

Regards,
Christian.

Am 24.04.24 um 18:56 schrieb Friedrich Vock:
> Hi everyone,
>
> recently I've been looking into remedies for apps (in particular, newer
> games) that experience significant performance loss when they start to
> hit VRAM limits, especially on older or lower-end cards that struggle
> to fit both desktop apps and all the game data into VRAM at once.
>
> The root of the problem lies in the fact that from userspace's POV,
> buffer eviction is very opaque: Userspace applications/drivers cannot
> tell how oversubscribed VRAM is, nor do they have fine-grained control
> over which buffers get evicted.  At the same time, with GPU APIs becoming
> increasingly lower-level and GPU-driven, only the application itself
> can know which buffers are used within a particular submission, and
> how important each buffer is. For this, GPU APIs include interfaces
> to query oversubscription and specify memory priorities: In Vulkan,
> oversubscription can be queried through the VK_EXT_memory_budget
> extension. Different buffers can also be assigned priorities via the
> VK_EXT_pageable_device_local_memory extension. Modern games, especially
> D3D12 games via vkd3d-proton, rely on oversubscription being reported and
> priorities being respected in order to perform their memory management.
>
> However, relaying this information to the kernel via the current KMD uAPIs
> is not possible. On AMDGPU for example, all work submissions include a
> "bo list" that contains any buffer object that is accessed during the
> course of the submission. If VRAM is oversubscribed and a buffer in the
> list was evicted to system memory, that buffer is moved back to VRAM
> (potentially evicting other unused buffers).
>
> Since the usermode driver doesn't know what buffers are used by the
> application, its only choice is to submit a bo list that contains every
> buffer the application has allocated. In case of VRAM oversubscription,
> it is highly likely that some of the application's buffers were evicted,
> which almost guarantees that some buffers will get moved around. Since
> the bo list is only known at submit time, this also means the buffers
> will get moved right before submitting application work, which is the
> worst possible time to move buffers from a latency perspective. Another
> consequence of the large bo list is that nearly all memory from other
> applications will be evicted, too. When different applications (e.g. game
> and compositor) submit work one after the other, this causes a ping-pong
> effect where each app's submission evicts the other app's memory,
> resulting in a large amount of unnecessary moves.
>
> This overly aggressive eviction behavior led to RADV adopting a change
> that effectively allows all VRAM applications to reside in system memory
> [1].  This worked around the ping-ponging/excessive buffer moving problem,
> but also meant that any memory evicted to system memory would forever
> stay there, regardless of how VRAM is used.
>
> My proposal aims at providing a middle ground between these extremes.
> The goals I want to meet are:
> - Userspace is accurately informed about VRAM oversubscription/how much
>    VRAM has been evicted
> - Buffer eviction respects priorities set by userspace - Wasteful
>    ping-ponging is avoided to the extent possible
>
> I have been testing out some prototypes, and came up with this rough
> sketch of an API:
>
> - For each ttm_resource_manager, the amount of evicted memory is tracked
>    (similarly to how "usage" tracks the memory usage). When memory is
>    evicted via ttm_bo_evict, the size of the evicted memory is added, when
>    memory is un-evicted (see below), its size is subtracted. The amount of
>    evicted memory for e.g. VRAM can be queried by userspace via an ioctl.
>
> - Each ttm_resource_manager maintains a list of evicted buffer objects.
>
> - ttm_mem_unevict walks the list of evicted bos for a given
>    ttm_resource_manager and tries moving evicted resources back. When a
>    buffer is freed, this function is called to immediately restore some
>    evicted memory.
>
> - Each ttm_buffer_object independently tracks the mem_type it wants
>    to reside in.
>
> - ttm_bo_try_unevict is added as a helper function which attempts to
>    move the buffer to its preferred mem_type. If no space is available
>    there, it fails with -ENOSPC/-ENOMEM.
>
> - Similar to how ttm_bo_evict works, each driver can implement
>    uneviction_valuable/unevict_flags callbacks to control buffer
>    un-eviction.
>
> This is what patches 1-10 accomplish (together with an amdgpu
> implementation utilizing the new API).
>
> Userspace priorities could then be implemented as follows:
>
> - TTM already manages priorities for each buffer object. These priorities
>    can be updated by userspace via a GEM_OP ioctl to inform the kernel
>    which buffers should be evicted before others. If an ioctl increases
>    the priority of a buffer, ttm_bo_try_unevict is called on that buffer to
>    try and move it back (potentially evicting buffers with a lower
>    priority)
>
> - Buffers should never be evicted by other buffers with equal/lower
>    priority, but if there is a buffer with lower priority occupying VRAM,
>    it should be evicted in favor of the higher-priority one. This prevents
>    ping-ponging between buffers that try evicting each other and is
>    trivially implementable with an early-exit in ttm_mem_evict_first.
>
> This is covered in patches 11-15, with the new features exposed to
> userspace in patches 16-18.
>
> I also have a RADV branch utilizing this API at [2], which I use for
> testing.
>
> This implementation is stil very much WIP, although the D3D12 games I
> tested already seemed to benefit from it. Nevertheless, are still quite
> a few TODOs and unresolved questions/problems.
>
> Some kernel drivers (e.g i915) already use TTM priorities for
> kernel-internal purposes. Of course, some of the highest priorities
> should stay reserved for these purposes (with userspace being able to
> use the lower priorities).
>
> Another problem with priorities is the possibility of apps starving other
> apps by occupying all of VRAM with high-priority allocations. A possible
> solution could be include restricting the highest priority/priorities
> to important apps like compositors.
>
> Tying into this problem, only apps that are actively cooperating
> to reduce memory pressure can benefit from the current memory priority
> implementation. Eventually the priority system could also be utilized
> to benefit all applications, for example with the desktop environment
> boosting the priority of the currently-focused app/its cgroup (to
> provide the best QoS to the apps the user is actively using). A full
> implementation of this is probably out-of-scope for this initial proposal,
> but it's probably a good idea to consider this as a possible future use
> of the priority API.
>
> I'm primarily looking to integrate this into amdgpu to solve the
> issues I've seen there, but I'm also interested in feedback from
> other drivers. Is this something you'd be interested in? Do you
> have any objections/comments/questions about my proposed design?
>
> Thanks,
> Friedrich
>
> [1] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6833
> [2] https://gitlab.freedesktop.org/pixelcluster/mesa/-/tree/spilling
>
> Friedrich Vock (18):
>    drm/ttm: Add tracking for evicted memory
>    drm/ttm: Add per-BO eviction tracking
>    drm/ttm: Implement BO eviction tracking
>    drm/ttm: Add driver funcs for uneviction control
>    drm/ttm: Add option to evict no BOs in operation
>    drm/ttm: Add public buffer eviction/uneviction functions
>    drm/amdgpu: Add TTM uneviction control functions
>    drm/amdgpu: Don't try moving BOs to preferred domain before submit
>    drm/amdgpu: Don't mark VRAM as a busy placement for VRAM|GTT resources
>    drm/amdgpu: Don't add GTT to initial domains after failing to allocate
>      VRAM
>    drm/ttm: Bump BO priority count
>    drm/ttm: Do not evict BOs with higher priority
>    drm/ttm: Implement ttm_bo_update_priority
>    drm/ttm: Consider BOs placed in non-favorite locations evicted
>    drm/amdgpu: Set a default priority for user/kernel BOs
>    drm/amdgpu: Implement SET_PRIORITY GEM op
>    drm/amdgpu: Implement EVICTED_VRAM query
>    drm/amdgpu: Bump minor version
>
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   2 -
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c     | 191 +---------------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h     |   4 -
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |   3 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c    |  25 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c    |   3 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c |  26 ++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |   4 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c    |  50 ++++
>   drivers/gpu/drm/ttm/ttm_bo.c               | 253 ++++++++++++++++++++-
>   drivers/gpu/drm/ttm/ttm_bo_util.c          |   3 +
>   drivers/gpu/drm/ttm/ttm_device.c           |   1 +
>   drivers/gpu/drm/ttm/ttm_resource.c         |  19 +-
>   include/drm/ttm/ttm_bo.h                   |  22 ++
>   include/drm/ttm/ttm_device.h               |  28 +++
>   include/drm/ttm/ttm_resource.h             |  11 +-
>   include/uapi/drm/amdgpu_drm.h              |   3 +
>   17 files changed, 430 insertions(+), 218 deletions(-)
>
> --
> 2.44.0
>
Marek Olšák April 25, 2024, 1:22 p.m. UTC | #2
The most extreme ping-ponging is mitigated by throttling buffer moves
in the kernel, but it only works without VM_ALWAYS_VALID and you can
set BO priorities in the BO list. A better approach that works with
VM_ALWAYS_VALID would be nice.

Marek

On Wed, Apr 24, 2024 at 1:12 PM Friedrich Vock <friedrich.vock@gmx.de> wrote:
>
> Hi everyone,
>
> recently I've been looking into remedies for apps (in particular, newer
> games) that experience significant performance loss when they start to
> hit VRAM limits, especially on older or lower-end cards that struggle
> to fit both desktop apps and all the game data into VRAM at once.
>
> The root of the problem lies in the fact that from userspace's POV,
> buffer eviction is very opaque: Userspace applications/drivers cannot
> tell how oversubscribed VRAM is, nor do they have fine-grained control
> over which buffers get evicted.  At the same time, with GPU APIs becoming
> increasingly lower-level and GPU-driven, only the application itself
> can know which buffers are used within a particular submission, and
> how important each buffer is. For this, GPU APIs include interfaces
> to query oversubscription and specify memory priorities: In Vulkan,
> oversubscription can be queried through the VK_EXT_memory_budget
> extension. Different buffers can also be assigned priorities via the
> VK_EXT_pageable_device_local_memory extension. Modern games, especially
> D3D12 games via vkd3d-proton, rely on oversubscription being reported and
> priorities being respected in order to perform their memory management.
>
> However, relaying this information to the kernel via the current KMD uAPIs
> is not possible. On AMDGPU for example, all work submissions include a
> "bo list" that contains any buffer object that is accessed during the
> course of the submission. If VRAM is oversubscribed and a buffer in the
> list was evicted to system memory, that buffer is moved back to VRAM
> (potentially evicting other unused buffers).
>
> Since the usermode driver doesn't know what buffers are used by the
> application, its only choice is to submit a bo list that contains every
> buffer the application has allocated. In case of VRAM oversubscription,
> it is highly likely that some of the application's buffers were evicted,
> which almost guarantees that some buffers will get moved around. Since
> the bo list is only known at submit time, this also means the buffers
> will get moved right before submitting application work, which is the
> worst possible time to move buffers from a latency perspective. Another
> consequence of the large bo list is that nearly all memory from other
> applications will be evicted, too. When different applications (e.g. game
> and compositor) submit work one after the other, this causes a ping-pong
> effect where each app's submission evicts the other app's memory,
> resulting in a large amount of unnecessary moves.
>
> This overly aggressive eviction behavior led to RADV adopting a change
> that effectively allows all VRAM applications to reside in system memory
> [1].  This worked around the ping-ponging/excessive buffer moving problem,
> but also meant that any memory evicted to system memory would forever
> stay there, regardless of how VRAM is used.
>
> My proposal aims at providing a middle ground between these extremes.
> The goals I want to meet are:
> - Userspace is accurately informed about VRAM oversubscription/how much
>   VRAM has been evicted
> - Buffer eviction respects priorities set by userspace - Wasteful
>   ping-ponging is avoided to the extent possible
>
> I have been testing out some prototypes, and came up with this rough
> sketch of an API:
>
> - For each ttm_resource_manager, the amount of evicted memory is tracked
>   (similarly to how "usage" tracks the memory usage). When memory is
>   evicted via ttm_bo_evict, the size of the evicted memory is added, when
>   memory is un-evicted (see below), its size is subtracted. The amount of
>   evicted memory for e.g. VRAM can be queried by userspace via an ioctl.
>
> - Each ttm_resource_manager maintains a list of evicted buffer objects.
>
> - ttm_mem_unevict walks the list of evicted bos for a given
>   ttm_resource_manager and tries moving evicted resources back. When a
>   buffer is freed, this function is called to immediately restore some
>   evicted memory.
>
> - Each ttm_buffer_object independently tracks the mem_type it wants
>   to reside in.
>
> - ttm_bo_try_unevict is added as a helper function which attempts to
>   move the buffer to its preferred mem_type. If no space is available
>   there, it fails with -ENOSPC/-ENOMEM.
>
> - Similar to how ttm_bo_evict works, each driver can implement
>   uneviction_valuable/unevict_flags callbacks to control buffer
>   un-eviction.
>
> This is what patches 1-10 accomplish (together with an amdgpu
> implementation utilizing the new API).
>
> Userspace priorities could then be implemented as follows:
>
> - TTM already manages priorities for each buffer object. These priorities
>   can be updated by userspace via a GEM_OP ioctl to inform the kernel
>   which buffers should be evicted before others. If an ioctl increases
>   the priority of a buffer, ttm_bo_try_unevict is called on that buffer to
>   try and move it back (potentially evicting buffers with a lower
>   priority)
>
> - Buffers should never be evicted by other buffers with equal/lower
>   priority, but if there is a buffer with lower priority occupying VRAM,
>   it should be evicted in favor of the higher-priority one. This prevents
>   ping-ponging between buffers that try evicting each other and is
>   trivially implementable with an early-exit in ttm_mem_evict_first.
>
> This is covered in patches 11-15, with the new features exposed to
> userspace in patches 16-18.
>
> I also have a RADV branch utilizing this API at [2], which I use for
> testing.
>
> This implementation is stil very much WIP, although the D3D12 games I
> tested already seemed to benefit from it. Nevertheless, are still quite
> a few TODOs and unresolved questions/problems.
>
> Some kernel drivers (e.g i915) already use TTM priorities for
> kernel-internal purposes. Of course, some of the highest priorities
> should stay reserved for these purposes (with userspace being able to
> use the lower priorities).
>
> Another problem with priorities is the possibility of apps starving other
> apps by occupying all of VRAM with high-priority allocations. A possible
> solution could be include restricting the highest priority/priorities
> to important apps like compositors.
>
> Tying into this problem, only apps that are actively cooperating
> to reduce memory pressure can benefit from the current memory priority
> implementation. Eventually the priority system could also be utilized
> to benefit all applications, for example with the desktop environment
> boosting the priority of the currently-focused app/its cgroup (to
> provide the best QoS to the apps the user is actively using). A full
> implementation of this is probably out-of-scope for this initial proposal,
> but it's probably a good idea to consider this as a possible future use
> of the priority API.
>
> I'm primarily looking to integrate this into amdgpu to solve the
> issues I've seen there, but I'm also interested in feedback from
> other drivers. Is this something you'd be interested in? Do you
> have any objections/comments/questions about my proposed design?
>
> Thanks,
> Friedrich
>
> [1] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6833
> [2] https://gitlab.freedesktop.org/pixelcluster/mesa/-/tree/spilling
>
> Friedrich Vock (18):
>   drm/ttm: Add tracking for evicted memory
>   drm/ttm: Add per-BO eviction tracking
>   drm/ttm: Implement BO eviction tracking
>   drm/ttm: Add driver funcs for uneviction control
>   drm/ttm: Add option to evict no BOs in operation
>   drm/ttm: Add public buffer eviction/uneviction functions
>   drm/amdgpu: Add TTM uneviction control functions
>   drm/amdgpu: Don't try moving BOs to preferred domain before submit
>   drm/amdgpu: Don't mark VRAM as a busy placement for VRAM|GTT resources
>   drm/amdgpu: Don't add GTT to initial domains after failing to allocate
>     VRAM
>   drm/ttm: Bump BO priority count
>   drm/ttm: Do not evict BOs with higher priority
>   drm/ttm: Implement ttm_bo_update_priority
>   drm/ttm: Consider BOs placed in non-favorite locations evicted
>   drm/amdgpu: Set a default priority for user/kernel BOs
>   drm/amdgpu: Implement SET_PRIORITY GEM op
>   drm/amdgpu: Implement EVICTED_VRAM query
>   drm/amdgpu: Bump minor version
>
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   2 -
>  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c     | 191 +---------------
>  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h     |   4 -
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |   3 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c    |  25 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c    |   3 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c |  26 ++-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |   4 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c    |  50 ++++
>  drivers/gpu/drm/ttm/ttm_bo.c               | 253 ++++++++++++++++++++-
>  drivers/gpu/drm/ttm/ttm_bo_util.c          |   3 +
>  drivers/gpu/drm/ttm/ttm_device.c           |   1 +
>  drivers/gpu/drm/ttm/ttm_resource.c         |  19 +-
>  include/drm/ttm/ttm_bo.h                   |  22 ++
>  include/drm/ttm/ttm_device.h               |  28 +++
>  include/drm/ttm/ttm_resource.h             |  11 +-
>  include/uapi/drm/amdgpu_drm.h              |   3 +
>  17 files changed, 430 insertions(+), 218 deletions(-)
>
> --
> 2.44.0
>
Christian König April 25, 2024, 1:33 p.m. UTC | #3
Yeah, and this patch set here is removing that functionality.

Which is major concern from my side as well.

Instead of removing it my long term plan was to move this into TTM ( the 
recent flags rework is going into that direction), so that both amdgpu 
and radeon can use the same code again *and* we can also apply it on 
VM_ALWAYS_VALID BOs.

Christian.

Am 25.04.24 um 15:22 schrieb Marek Olšák:
> The most extreme ping-ponging is mitigated by throttling buffer moves
> in the kernel, but it only works without VM_ALWAYS_VALID and you can
> set BO priorities in the BO list. A better approach that works with
> VM_ALWAYS_VALID would be nice.
>
> Marek
>
> On Wed, Apr 24, 2024 at 1:12 PM Friedrich Vock <friedrich.vock@gmx.de> wrote:
>> Hi everyone,
>>
>> recently I've been looking into remedies for apps (in particular, newer
>> games) that experience significant performance loss when they start to
>> hit VRAM limits, especially on older or lower-end cards that struggle
>> to fit both desktop apps and all the game data into VRAM at once.
>>
>> The root of the problem lies in the fact that from userspace's POV,
>> buffer eviction is very opaque: Userspace applications/drivers cannot
>> tell how oversubscribed VRAM is, nor do they have fine-grained control
>> over which buffers get evicted.  At the same time, with GPU APIs becoming
>> increasingly lower-level and GPU-driven, only the application itself
>> can know which buffers are used within a particular submission, and
>> how important each buffer is. For this, GPU APIs include interfaces
>> to query oversubscription and specify memory priorities: In Vulkan,
>> oversubscription can be queried through the VK_EXT_memory_budget
>> extension. Different buffers can also be assigned priorities via the
>> VK_EXT_pageable_device_local_memory extension. Modern games, especially
>> D3D12 games via vkd3d-proton, rely on oversubscription being reported and
>> priorities being respected in order to perform their memory management.
>>
>> However, relaying this information to the kernel via the current KMD uAPIs
>> is not possible. On AMDGPU for example, all work submissions include a
>> "bo list" that contains any buffer object that is accessed during the
>> course of the submission. If VRAM is oversubscribed and a buffer in the
>> list was evicted to system memory, that buffer is moved back to VRAM
>> (potentially evicting other unused buffers).
>>
>> Since the usermode driver doesn't know what buffers are used by the
>> application, its only choice is to submit a bo list that contains every
>> buffer the application has allocated. In case of VRAM oversubscription,
>> it is highly likely that some of the application's buffers were evicted,
>> which almost guarantees that some buffers will get moved around. Since
>> the bo list is only known at submit time, this also means the buffers
>> will get moved right before submitting application work, which is the
>> worst possible time to move buffers from a latency perspective. Another
>> consequence of the large bo list is that nearly all memory from other
>> applications will be evicted, too. When different applications (e.g. game
>> and compositor) submit work one after the other, this causes a ping-pong
>> effect where each app's submission evicts the other app's memory,
>> resulting in a large amount of unnecessary moves.
>>
>> This overly aggressive eviction behavior led to RADV adopting a change
>> that effectively allows all VRAM applications to reside in system memory
>> [1].  This worked around the ping-ponging/excessive buffer moving problem,
>> but also meant that any memory evicted to system memory would forever
>> stay there, regardless of how VRAM is used.
>>
>> My proposal aims at providing a middle ground between these extremes.
>> The goals I want to meet are:
>> - Userspace is accurately informed about VRAM oversubscription/how much
>>    VRAM has been evicted
>> - Buffer eviction respects priorities set by userspace - Wasteful
>>    ping-ponging is avoided to the extent possible
>>
>> I have been testing out some prototypes, and came up with this rough
>> sketch of an API:
>>
>> - For each ttm_resource_manager, the amount of evicted memory is tracked
>>    (similarly to how "usage" tracks the memory usage). When memory is
>>    evicted via ttm_bo_evict, the size of the evicted memory is added, when
>>    memory is un-evicted (see below), its size is subtracted. The amount of
>>    evicted memory for e.g. VRAM can be queried by userspace via an ioctl.
>>
>> - Each ttm_resource_manager maintains a list of evicted buffer objects.
>>
>> - ttm_mem_unevict walks the list of evicted bos for a given
>>    ttm_resource_manager and tries moving evicted resources back. When a
>>    buffer is freed, this function is called to immediately restore some
>>    evicted memory.
>>
>> - Each ttm_buffer_object independently tracks the mem_type it wants
>>    to reside in.
>>
>> - ttm_bo_try_unevict is added as a helper function which attempts to
>>    move the buffer to its preferred mem_type. If no space is available
>>    there, it fails with -ENOSPC/-ENOMEM.
>>
>> - Similar to how ttm_bo_evict works, each driver can implement
>>    uneviction_valuable/unevict_flags callbacks to control buffer
>>    un-eviction.
>>
>> This is what patches 1-10 accomplish (together with an amdgpu
>> implementation utilizing the new API).
>>
>> Userspace priorities could then be implemented as follows:
>>
>> - TTM already manages priorities for each buffer object. These priorities
>>    can be updated by userspace via a GEM_OP ioctl to inform the kernel
>>    which buffers should be evicted before others. If an ioctl increases
>>    the priority of a buffer, ttm_bo_try_unevict is called on that buffer to
>>    try and move it back (potentially evicting buffers with a lower
>>    priority)
>>
>> - Buffers should never be evicted by other buffers with equal/lower
>>    priority, but if there is a buffer with lower priority occupying VRAM,
>>    it should be evicted in favor of the higher-priority one. This prevents
>>    ping-ponging between buffers that try evicting each other and is
>>    trivially implementable with an early-exit in ttm_mem_evict_first.
>>
>> This is covered in patches 11-15, with the new features exposed to
>> userspace in patches 16-18.
>>
>> I also have a RADV branch utilizing this API at [2], which I use for
>> testing.
>>
>> This implementation is stil very much WIP, although the D3D12 games I
>> tested already seemed to benefit from it. Nevertheless, are still quite
>> a few TODOs and unresolved questions/problems.
>>
>> Some kernel drivers (e.g i915) already use TTM priorities for
>> kernel-internal purposes. Of course, some of the highest priorities
>> should stay reserved for these purposes (with userspace being able to
>> use the lower priorities).
>>
>> Another problem with priorities is the possibility of apps starving other
>> apps by occupying all of VRAM with high-priority allocations. A possible
>> solution could be include restricting the highest priority/priorities
>> to important apps like compositors.
>>
>> Tying into this problem, only apps that are actively cooperating
>> to reduce memory pressure can benefit from the current memory priority
>> implementation. Eventually the priority system could also be utilized
>> to benefit all applications, for example with the desktop environment
>> boosting the priority of the currently-focused app/its cgroup (to
>> provide the best QoS to the apps the user is actively using). A full
>> implementation of this is probably out-of-scope for this initial proposal,
>> but it's probably a good idea to consider this as a possible future use
>> of the priority API.
>>
>> I'm primarily looking to integrate this into amdgpu to solve the
>> issues I've seen there, but I'm also interested in feedback from
>> other drivers. Is this something you'd be interested in? Do you
>> have any objections/comments/questions about my proposed design?
>>
>> Thanks,
>> Friedrich
>>
>> [1] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6833
>> [2] https://gitlab.freedesktop.org/pixelcluster/mesa/-/tree/spilling
>>
>> Friedrich Vock (18):
>>    drm/ttm: Add tracking for evicted memory
>>    drm/ttm: Add per-BO eviction tracking
>>    drm/ttm: Implement BO eviction tracking
>>    drm/ttm: Add driver funcs for uneviction control
>>    drm/ttm: Add option to evict no BOs in operation
>>    drm/ttm: Add public buffer eviction/uneviction functions
>>    drm/amdgpu: Add TTM uneviction control functions
>>    drm/amdgpu: Don't try moving BOs to preferred domain before submit
>>    drm/amdgpu: Don't mark VRAM as a busy placement for VRAM|GTT resources
>>    drm/amdgpu: Don't add GTT to initial domains after failing to allocate
>>      VRAM
>>    drm/ttm: Bump BO priority count
>>    drm/ttm: Do not evict BOs with higher priority
>>    drm/ttm: Implement ttm_bo_update_priority
>>    drm/ttm: Consider BOs placed in non-favorite locations evicted
>>    drm/amdgpu: Set a default priority for user/kernel BOs
>>    drm/amdgpu: Implement SET_PRIORITY GEM op
>>    drm/amdgpu: Implement EVICTED_VRAM query
>>    drm/amdgpu: Bump minor version
>>
>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   2 -
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c     | 191 +---------------
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.h     |   4 -
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |   3 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c    |  25 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c    |   3 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c |  26 ++-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.h |   4 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c    |  50 ++++
>>   drivers/gpu/drm/ttm/ttm_bo.c               | 253 ++++++++++++++++++++-
>>   drivers/gpu/drm/ttm/ttm_bo_util.c          |   3 +
>>   drivers/gpu/drm/ttm/ttm_device.c           |   1 +
>>   drivers/gpu/drm/ttm/ttm_resource.c         |  19 +-
>>   include/drm/ttm/ttm_bo.h                   |  22 ++
>>   include/drm/ttm/ttm_device.h               |  28 +++
>>   include/drm/ttm/ttm_resource.h             |  11 +-
>>   include/uapi/drm/amdgpu_drm.h              |   3 +
>>   17 files changed, 430 insertions(+), 218 deletions(-)
>>
>> --
>> 2.44.0
>>
Maarten Lankhorst May 2, 2024, 2:23 p.m. UTC | #4
Hey,

Den 2024-04-24 kl. 18:56, skrev Friedrich Vock:
> Hi everyone,
> 
> recently I've been looking into remedies for apps (in particular, newer
> games) that experience significant performance loss when they start to
> hit VRAM limits, especially on older or lower-end cards that struggle
> to fit both desktop apps and all the game data into VRAM at once.
> 
> The root of the problem lies in the fact that from userspace's POV,
> buffer eviction is very opaque: Userspace applications/drivers cannot
> tell how oversubscribed VRAM is, nor do they have fine-grained control
> over which buffers get evicted.  At the same time, with GPU APIs becoming
> increasingly lower-level and GPU-driven, only the application itself
> can know which buffers are used within a particular submission, and
> how important each buffer is. For this, GPU APIs include interfaces
> to query oversubscription and specify memory priorities: In Vulkan,
> oversubscription can be queried through the VK_EXT_memory_budget
> extension. Different buffers can also be assigned priorities via the
> VK_EXT_pageable_device_local_memory extension. Modern games, especially
> D3D12 games via vkd3d-proton, rely on oversubscription being reported and
> priorities being respected in order to perform their memory management.
> 
> However, relaying this information to the kernel via the current KMD uAPIs
> is not possible. On AMDGPU for example, all work submissions include a
> "bo list" that contains any buffer object that is accessed during the
> course of the submission. If VRAM is oversubscribed and a buffer in the
> list was evicted to system memory, that buffer is moved back to VRAM
> (potentially evicting other unused buffers).
> 
> Since the usermode driver doesn't know what buffers are used by the
> application, its only choice is to submit a bo list that contains every
> buffer the application has allocated. In case of VRAM oversubscription,
> it is highly likely that some of the application's buffers were evicted,
> which almost guarantees that some buffers will get moved around. Since
> the bo list is only known at submit time, this also means the buffers
> will get moved right before submitting application work, which is the
> worst possible time to move buffers from a latency perspective. Another
> consequence of the large bo list is that nearly all memory from other
> applications will be evicted, too. When different applications (e.g. game
> and compositor) submit work one after the other, this causes a ping-pong
> effect where each app's submission evicts the other app's memory,
> resulting in a large amount of unnecessary moves.
> 
> This overly aggressive eviction behavior led to RADV adopting a change
> that effectively allows all VRAM applications to reside in system memory
> [1].  This worked around the ping-ponging/excessive buffer moving problem,
> but also meant that any memory evicted to system memory would forever
> stay there, regardless of how VRAM is used.
> 
> My proposal aims at providing a middle ground between these extremes.
> The goals I want to meet are:
> - Userspace is accurately informed about VRAM oversubscription/how much
>    VRAM has been evicted
> - Buffer eviction respects priorities set by userspace - Wasteful
>    ping-ponging is avoided to the extent possible
> 
> I have been testing out some prototypes, and came up with this rough
> sketch of an API:
> 
> - For each ttm_resource_manager, the amount of evicted memory is tracked
>    (similarly to how "usage" tracks the memory usage). When memory is
>    evicted via ttm_bo_evict, the size of the evicted memory is added, when
>    memory is un-evicted (see below), its size is subtracted. The amount of
>    evicted memory for e.g. VRAM can be queried by userspace via an ioctl.
> 
> - Each ttm_resource_manager maintains a list of evicted buffer objects.
> 
> - ttm_mem_unevict walks the list of evicted bos for a given
>    ttm_resource_manager and tries moving evicted resources back. When a
>    buffer is freed, this function is called to immediately restore some
>    evicted memory.
> 
> - Each ttm_buffer_object independently tracks the mem_type it wants
>    to reside in.
> 
> - ttm_bo_try_unevict is added as a helper function which attempts to
>    move the buffer to its preferred mem_type. If no space is available
>    there, it fails with -ENOSPC/-ENOMEM.
> 
> - Similar to how ttm_bo_evict works, each driver can implement
>    uneviction_valuable/unevict_flags callbacks to control buffer
>    un-eviction.
> 
> This is what patches 1-10 accomplish (together with an amdgpu
> implementation utilizing the new API).
> 
> Userspace priorities could then be implemented as follows:
> 
> - TTM already manages priorities for each buffer object. These priorities
>    can be updated by userspace via a GEM_OP ioctl to inform the kernel
>    which buffers should be evicted before others. If an ioctl increases
>    the priority of a buffer, ttm_bo_try_unevict is called on that buffer to
>    try and move it back (potentially evicting buffers with a lower
>    priority)
> 
> - Buffers should never be evicted by other buffers with equal/lower
>    priority, but if there is a buffer with lower priority occupying VRAM,
>    it should be evicted in favor of the higher-priority one. This prevents
>    ping-ponging between buffers that try evicting each other and is
>    trivially implementable with an early-exit in ttm_mem_evict_first.
> 
> This is covered in patches 11-15, with the new features exposed to
> userspace in patches 16-18.
> 
> I also have a RADV branch utilizing this API at [2], which I use for
> testing.
> 
> This implementation is stil very much WIP, although the D3D12 games I
> tested already seemed to benefit from it. Nevertheless, are still quite
> a few TODOs and unresolved questions/problems.
> 
> Some kernel drivers (e.g i915) already use TTM priorities for
> kernel-internal purposes. Of course, some of the highest priorities
> should stay reserved for these purposes (with userspace being able to
> use the lower priorities).
> 
> Another problem with priorities is the possibility of apps starving other
> apps by occupying all of VRAM with high-priority allocations. A possible
> solution could be include restricting the highest priority/priorities
> to important apps like compositors.
> 
> Tying into this problem, only apps that are actively cooperating
> to reduce memory pressure can benefit from the current memory priority
> implementation. Eventually the priority system could also be utilized
> to benefit all applications, for example with the desktop environment
> boosting the priority of the currently-focused app/its cgroup (to
> provide the best QoS to the apps the user is actively using). A full
> implementation of this is probably out-of-scope for this initial proposal,
> but it's probably a good idea to consider this as a possible future use
> of the priority API.
> 
> I'm primarily looking to integrate this into amdgpu to solve the
> issues I've seen there, but I'm also interested in feedback from
> other drivers. Is this something you'd be interested in? Do you
> have any objections/comments/questions about my proposed design?
> 
> Thanks,
> Friedrich
> 
For Xe, I've been loking at using cgroups. A small prototype is available at

https://cgit.freedesktop.org/~mlankhorst/linux/log/?h=dumpcg

To stimulate discussion, I've added amdgpu support as well.
This should make it possible to isolate the compositor allocations
from the target program.

This support is still incomplete and covers vram only, but I need help 
from userspace and consensus from other drivers on how to move forward.

I'm thinking of making 3 cgroup limits:
1. Physical memory, each time a buffer is allocated, it counts towards 
it, regardless where it resides.
2. Mappable memory, all buffers allocated in sysmem or vram count 
towards this limit.
3. VRAM, only buffers residing in VRAM count here.

This ensures that VRAM can always be evicted to sysmem, by having a 
mappable memory quota, and having a sysmem reservation.

The main trouble is that when evicting, you want to charge the original 
process the changes in allocation limits, but it should be solvable.

I've been looking for someone else needing the usecase in a different 
context, so let me know what you think of the idea.

This can be generalized towards all uses of the GPU, but the compositor 
vs game thrashing is a good example of why it is useful to have.

I should still have my cgroup testcase somewhere, this is only a rebase 
of my previous proposal, but I think it fits the usecase.

Cheers,
Maarten
Friedrich Vock May 13, 2024, 1:44 p.m. UTC | #5
Hi,

On 02.05.24 16:23, Maarten Lankhorst wrote:
> Hey,
>
> [snip]
>
> For Xe, I've been loking at using cgroups. A small prototype is
> available at
>
> https://cgit.freedesktop.org/~mlankhorst/linux/log/?h=dumpcg
>
> To stimulate discussion, I've added amdgpu support as well.
> This should make it possible to isolate the compositor allocations
> from the target program.
>
> This support is still incomplete and covers vram only, but I need help
> from userspace and consensus from other drivers on how to move forward.
>
> I'm thinking of making 3 cgroup limits:
> 1. Physical memory, each time a buffer is allocated, it counts towards
> it, regardless where it resides.
> 2. Mappable memory, all buffers allocated in sysmem or vram count
> towards this limit.
> 3. VRAM, only buffers residing in VRAM count here.
>
> This ensures that VRAM can always be evicted to sysmem, by having a
> mappable memory quota, and having a sysmem reservation.
>
> The main trouble is that when evicting, you want to charge the
> original process the changes in allocation limits, but it should be
> solvable.
>
> I've been looking for someone else needing the usecase in a different
> context, so let me know what you think of the idea.
>
Sorry for the late reply. The idea sounds really good! I think cgroups
are great fit for what we'd need to prioritize game+compositor over
other potential non-foreground apps.

 From what I can tell looking through the code, the current cgroup
properties are absolute memory sizes that userspace asks the kernel to
restrict the cgroup usage to?
While that sounds useful for some usecases too, I'm not sure just these
limits are a good solution for making sure that your compositor's and
foreground app's resources stay in memory (in favor of background apps)
when there is pressure.

> This can be generalized towards all uses of the GPU, but the
> compositor vs game thrashing is a good example of why it is useful to
> have.
>
IIRC Tvrtko's original proposal was about per-cgroup DRM scheduling
priorities providing lower submission latency for prioritized cgroups,
right?

I think what we need here would pretty much exactly such a priority
system, but for memory: The cgroup containing the foreground app/game
and the compositor should have some hint telling TTM to try its hardest
to avoid evicting its buffers (i.e. a high memory priority).
Your existing drm_cgroup work looks like a great base for this, and I'd
be happy to help/participate with the implementation for amdgpu.

Thanks,
Friedrich

> I should still have my cgroup testcase somewhere, this is only a
> rebase of my previous proposal, but I think it fits the usecase.
>
> Cheers,
> Maarten