[v3,0/8] drm: Introduce sparse GEM shmem

Message ID	20250404092634.2968115-1-boris.brezillon@collabora.com (mailing list archive)
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> sender: bbrezillon) by bali.collaboradmins.com (Postfix) with ESMTPSA id 84AEA17E07FD; Fri, 4 Apr 2025 11:26:35 +0200 (CEST) From: Boris Brezillon <boris.brezillon@collabora.com> To: Boris Brezillon <boris.brezillon@collabora.com>, Steven Price <steven.price@arm.com>, Liviu Dudau <liviu.dudau@arm.com>, =?utf-8?q?Adri=C3=A1n_Larumbe?= <adrian.larumbe@collabora.com>, lima@lists.freedesktop.org, Qiang Yu <yuq825@gmail.com> Cc: David Airlie <airlied@gmail.com>, Simona Vetter <simona@ffwll.ch>, Maarten Lankhorst <maarten.lankhorst@linux.intel.com>, Maxime Ripard <mripard@kernel.org>, Thomas Zimmermann <tzimmermann@suse.de>, dri-devel@lists.freedesktop.org, Dmitry Osipenko <dmitry.osipenko@collabora.com>, kernel@collabora.com Subject: [PATCH v3 0/8] drm: Introduce sparse GEM shmem Date: Fri, 4 Apr 2025 11:26:26 +0200 Message-ID: <20250404092634.2968115-1-boris.brezillon@collabora.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: list Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	drm: Introduce sparse GEM shmem \| expand [v3,0/8] drm: Introduce sparse GEM shmem [v3,1/8] drm/gem: Add helpers to request a range of pages on a GEM [v3,2/8] drm/gem-shmem: Support sparse backing [v3,3/8] drm/panfrost: Switch to sparse gem shmem to implement our alloc-on-fault [v3,4/8] drm/panthor: Add support for alloc-on-fault buffers [v3,5/8] drm/panthor: Allow kernel BOs to pass DRM_PANTHOR_BO_ALLOC_ON_FAULT [v3,6/8] drm/panthor: Add a panthor_vm_pre_fault_range() helper [v3,7/8] drm/panthor: Make heap chunk allocation non-blocking [v3,8/8] drm/lima: Use drm_gem_shmem_sparse_backing for heap buffers

Boris Brezillon April 4, 2025, 9:26 a.m. UTC

This patch series is a proposal for implementing sparse page allocations
for shmem objects. It was initially motivated by a kind of BO managed by
the Panfrost/Panthor and Lima drivers, the tiler heap, which grows on
demand every time the GPU faults on a virtual address within the heap BO
range.

Because keeping a struct page pointer array that can describe the entire
virtual range is wasteful when only a few backing pages have been
allocated, we thought a sparse allocation approach with xarrays was a
more efficient choice.

This sparse GEM shmem features takes the form of a new optional
drm_gem_shmem_sparse_backing object that can be attached to a
drm_gem_shmem_object at creation time if the driver wants. When this
sparse GEM shmem backing mode is enabled, the driver is allow to
partially populate the GEM pages, either at use time, or at fault
time. The page population can be done progressively as the need for
more memory arises. The new APIs takes explicit gfp_t flags to solve
a long standing issue reported by Sima a while ago: all allocations
happening in the fault handler path shouldn't block.

We also introduce two new helpers at the drm_gem.{c,h} level to
automate the partial xarray population which the GEM-SHMEM logic
relies on to populate its sparse page array.

A few details about the implementation now:

- Sparse GEM backing locking is deferred to the driver, because
  we can't take the resv locks in the GPU fault handler path, and
  since the page population can come from there, we have to let
  the driver decide how to lock.
- The pin/get semantics for sparse GEM shmem objects is different,
  in that it doesn't populate the pages, but just flag them as
  being used/required by some GPU component. The page population
  will happen explicitly at GEM creation time or when a GPU fault
  happens. Once pages have been populated, they won't disappear
  until the GEM object is destroyed, purged or evicted. This means
  you can grow, but not shrink. If we ever need to discard
  BO ranges, the API can be extended to allow it, but it's not
  something we needed for the Lima/Panthor/Panfrost case.
- Panthor and Panfrost changes have been tested, but not extensively.
  Lima changes have only been compile-tested.

Changes from v2:
- We decided to focus more on the DRM aspects and forget about the
  sgt building optimizations (sgt helpers taking an xarray instead of
  a plain array). If the xarray -> array copies happening in that
  path ever become the bottleneck, we can resurrect those changes.
- We decided to make the sparse GEM changes less invasive by making
  them an extra layer on top of drm_gem_shmem_object. What this means
  is that sparse GEM shmem can now be used as regular GEM shmem
  objects (vmap, pin, export, ... all work as they would on a regular
  GEM). When that happens, we just force all pages to be populated,
  so we kinda lose the benefit of sparse GEMs, but that's fine, because
  in practice, such objects shouldn't be manipulated as regular GEMs.
  It just serves as a safety guard to limit the risk of regressions
  introduced by these sparse GEM shmem changes.

Discussion of previus revision can be found here[2][3].

For those used to review gitlab MRs, here's a link [1] to a Draft
MR grouping those changes, but I'm in no way asking that the review
happens on gitlab.

Regards,

Boris

[1]https://gitlab.freedesktop.org/panfrost/linux/-/merge_requests/16
[2]https://lore.kernel.org/lkml/20250326021433.772196-1-adrian.larumbe@collabora.com/T/
[3]https://lore.kernel.org/dri-devel/20250218232552.3450939-1-adrian.larumbe@collabora.com/

Adrián Larumbe (1):
  drm/gem: Add helpers to request a range of pages on a GEM

Boris Brezillon (7):
  drm/gem-shmem: Support sparse backing
  drm/panfrost: Switch to sparse gem shmem to implement our
    alloc-on-fault
  drm/panthor: Add support for alloc-on-fault buffers
  drm/panthor: Allow kernel BOs to pass DRM_PANTHOR_BO_ALLOC_ON_FAULT
  drm/panthor: Add a panthor_vm_pre_fault_range() helper
  drm/panthor: Make heap chunk allocation non-blocking
  drm/lima: Use drm_gem_shmem_sparse_backing for heap buffers

 drivers/gpu/drm/drm_gem.c               | 134 +++++++
 drivers/gpu/drm/drm_gem_shmem_helper.c  | 404 +++++++++++++++++++-
 drivers/gpu/drm/lima/lima_gem.c         |  89 ++---
 drivers/gpu/drm/lima/lima_gem.h         |   1 +
 drivers/gpu/drm/lima/lima_vm.c          |  48 ++-
 drivers/gpu/drm/panfrost/panfrost_drv.c |   2 +-
 drivers/gpu/drm/panfrost/panfrost_gem.c |  37 +-
 drivers/gpu/drm/panfrost/panfrost_gem.h |   8 +-
 drivers/gpu/drm/panfrost/panfrost_mmu.c |  98 ++---
 drivers/gpu/drm/panfrost/panfrost_mmu.h |   2 +
 drivers/gpu/drm/panthor/panthor_drv.c   |  20 +-
 drivers/gpu/drm/panthor/panthor_fw.c    |   6 +-
 drivers/gpu/drm/panthor/panthor_gem.c   |  18 +-
 drivers/gpu/drm/panthor/panthor_gem.h   |  12 +-
 drivers/gpu/drm/panthor/panthor_heap.c  | 222 ++++++-----
 drivers/gpu/drm/panthor/panthor_mmu.c   | 481 ++++++++++++++++++------
 drivers/gpu/drm/panthor/panthor_mmu.h   |   2 +
 drivers/gpu/drm/panthor/panthor_sched.c |   6 +-
 include/drm/drm_gem.h                   |  14 +
 include/drm/drm_gem_shmem_helper.h      | 285 +++++++++++++-
 include/uapi/drm/panthor_drm.h          |  19 +-
 21 files changed, 1492 insertions(+), 416 deletions(-)

Boris Brezillon April 10, 2025, 2:48 p.m. UTC | #1

+Christian, Alyssa and Faith, as suggested by Sima. I'll make sure to
Cc you on v4, but before that, I'd like to get your opinion on the
drm-gem/drm-gem-shmem changes to see if sending a v4 is actually
desirable or if I should go back to the drawing board.

On Fri,  4 Apr 2025 11:26:26 +0200
Boris Brezillon <boris.brezillon@collabora.com> wrote:

> This patch series is a proposal for implementing sparse page allocations
> for shmem objects. It was initially motivated by a kind of BO managed by
> the Panfrost/Panthor and Lima drivers, the tiler heap, which grows on
> demand every time the GPU faults on a virtual address within the heap BO
> range.
> 
> Because keeping a struct page pointer array that can describe the entire
> virtual range is wasteful when only a few backing pages have been
> allocated, we thought a sparse allocation approach with xarrays was a
> more efficient choice.
> 
> This sparse GEM shmem features takes the form of a new optional
> drm_gem_shmem_sparse_backing object that can be attached to a
> drm_gem_shmem_object at creation time if the driver wants. When this
> sparse GEM shmem backing mode is enabled, the driver is allow to
> partially populate the GEM pages, either at use time, or at fault
> time. The page population can be done progressively as the need for
> more memory arises. The new APIs takes explicit gfp_t flags to solve
> a long standing issue reported by Sima a while ago: all allocations
> happening in the fault handler path shouldn't block.
> 
> We also introduce two new helpers at the drm_gem.{c,h} level to
> automate the partial xarray population which the GEM-SHMEM logic
> relies on to populate its sparse page array.
> 
> A few details about the implementation now:
> 
> - Sparse GEM backing locking is deferred to the driver, because
>   we can't take the resv locks in the GPU fault handler path, and
>   since the page population can come from there, we have to let
>   the driver decide how to lock.
> - The pin/get semantics for sparse GEM shmem objects is different,
>   in that it doesn't populate the pages, but just flag them as
>   being used/required by some GPU component. The page population
>   will happen explicitly at GEM creation time or when a GPU fault
>   happens. Once pages have been populated, they won't disappear
>   until the GEM object is destroyed, purged or evicted. This means
>   you can grow, but not shrink. If we ever need to discard
>   BO ranges, the API can be extended to allow it, but it's not
>   something we needed for the Lima/Panthor/Panfrost case.
> - Panthor and Panfrost changes have been tested, but not extensively.
>   Lima changes have only been compile-tested.
> 
> Changes from v2:
> - We decided to focus more on the DRM aspects and forget about the
>   sgt building optimizations (sgt helpers taking an xarray instead of
>   a plain array). If the xarray -> array copies happening in that
>   path ever become the bottleneck, we can resurrect those changes.
> - We decided to make the sparse GEM changes less invasive by making
>   them an extra layer on top of drm_gem_shmem_object. What this means
>   is that sparse GEM shmem can now be used as regular GEM shmem
>   objects (vmap, pin, export, ... all work as they would on a regular
>   GEM). When that happens, we just force all pages to be populated,
>   so we kinda lose the benefit of sparse GEMs, but that's fine, because
>   in practice, such objects shouldn't be manipulated as regular GEMs.
>   It just serves as a safety guard to limit the risk of regressions
>   introduced by these sparse GEM shmem changes.
> 
> Discussion of previus revision can be found here[2][3].
> 
> For those used to review gitlab MRs, here's a link [1] to a Draft
> MR grouping those changes, but I'm in no way asking that the review
> happens on gitlab.
> 
> Regards,
> 
> Boris
> 
> [1]https://gitlab.freedesktop.org/panfrost/linux/-/merge_requests/16
> [2]https://lore.kernel.org/lkml/20250326021433.772196-1-adrian.larumbe@collabora.com/T/
> [3]https://lore.kernel.org/dri-devel/20250218232552.3450939-1-adrian.larumbe@collabora.com/
> 
> Adrián Larumbe (1):
>   drm/gem: Add helpers to request a range of pages on a GEM
> 
> Boris Brezillon (7):
>   drm/gem-shmem: Support sparse backing
>   drm/panfrost: Switch to sparse gem shmem to implement our
>     alloc-on-fault
>   drm/panthor: Add support for alloc-on-fault buffers
>   drm/panthor: Allow kernel BOs to pass DRM_PANTHOR_BO_ALLOC_ON_FAULT
>   drm/panthor: Add a panthor_vm_pre_fault_range() helper
>   drm/panthor: Make heap chunk allocation non-blocking
>   drm/lima: Use drm_gem_shmem_sparse_backing for heap buffers
> 
>  drivers/gpu/drm/drm_gem.c               | 134 +++++++
>  drivers/gpu/drm/drm_gem_shmem_helper.c  | 404 +++++++++++++++++++-
>  drivers/gpu/drm/lima/lima_gem.c         |  89 ++---
>  drivers/gpu/drm/lima/lima_gem.h         |   1 +
>  drivers/gpu/drm/lima/lima_vm.c          |  48 ++-
>  drivers/gpu/drm/panfrost/panfrost_drv.c |   2 +-
>  drivers/gpu/drm/panfrost/panfrost_gem.c |  37 +-
>  drivers/gpu/drm/panfrost/panfrost_gem.h |   8 +-
>  drivers/gpu/drm/panfrost/panfrost_mmu.c |  98 ++---
>  drivers/gpu/drm/panfrost/panfrost_mmu.h |   2 +
>  drivers/gpu/drm/panthor/panthor_drv.c   |  20 +-
>  drivers/gpu/drm/panthor/panthor_fw.c    |   6 +-
>  drivers/gpu/drm/panthor/panthor_gem.c   |  18 +-
>  drivers/gpu/drm/panthor/panthor_gem.h   |  12 +-
>  drivers/gpu/drm/panthor/panthor_heap.c  | 222 ++++++-----
>  drivers/gpu/drm/panthor/panthor_mmu.c   | 481 ++++++++++++++++++------
>  drivers/gpu/drm/panthor/panthor_mmu.h   |   2 +
>  drivers/gpu/drm/panthor/panthor_sched.c |   6 +-
>  include/drm/drm_gem.h                   |  14 +
>  include/drm/drm_gem_shmem_helper.h      | 285 +++++++++++++-
>  include/uapi/drm/panthor_drm.h          |  19 +-
>  21 files changed, 1492 insertions(+), 416 deletions(-)
>

Christian König April 10, 2025, 3:05 p.m. UTC | #2

Hi Boris,

thanks for looping me in. Can you also send the full patch set to me since I don't see that on the mailing list (yet maybe).

Am 10.04.25 um 16:48 schrieb Boris Brezillon:
> +Christian, Alyssa and Faith, as suggested by Sima. I'll make sure to
> Cc you on v4, but before that, I'd like to get your opinion on the
> drm-gem/drm-gem-shmem changes to see if sending a v4 is actually
> desirable or if I should go back to the drawing board.
>
> On Fri,  4 Apr 2025 11:26:26 +0200
> Boris Brezillon <boris.brezillon@collabora.com> wrote:
>
>> This patch series is a proposal for implementing sparse page allocations
>> for shmem objects. It was initially motivated by a kind of BO managed by
>> the Panfrost/Panthor and Lima drivers, the tiler heap, which grows on
>> demand every time the GPU faults on a virtual address within the heap BO
>> range.

Oh, wait a second! GPU faults and DMA fences are usually fundamentally incompatible.

So stuff like filling in GEM objects on demand like you suggest here is usually seen as illegal. All resources must be pre-pinned before the submission.

Faulting is only legal when you have something like HMM, SVM or whatever you call it. And then you can just use a plain shmem object to provide you with backing pages.

I mean we could in theory allow faulting on GEM objects as well, but we would need to take very strict precautions on that we currently don't have as far as I know.

So could you explain how this works in the first place?

Regards,
Christian.


>>
>> Because keeping a struct page pointer array that can describe the entire
>> virtual range is wasteful when only a few backing pages have been
>> allocated, we thought a sparse allocation approach with xarrays was a
>> more efficient choice.
>>
>> This sparse GEM shmem features takes the form of a new optional
>> drm_gem_shmem_sparse_backing object that can be attached to a
>> drm_gem_shmem_object at creation time if the driver wants. When this
>> sparse GEM shmem backing mode is enabled, the driver is allow to
>> partially populate the GEM pages, either at use time, or at fault
>> time. The page population can be done progressively as the need for
>> more memory arises. The new APIs takes explicit gfp_t flags to solve
>> a long standing issue reported by Sima a while ago: all allocations
>> happening in the fault handler path shouldn't block.
>>
>> We also introduce two new helpers at the drm_gem.{c,h} level to
>> automate the partial xarray population which the GEM-SHMEM logic
>> relies on to populate its sparse page array.
>>
>> A few details about the implementation now:
>>
>> - Sparse GEM backing locking is deferred to the driver, because
>>   we can't take the resv locks in the GPU fault handler path, and
>>   since the page population can come from there, we have to let
>>   the driver decide how to lock.
>> - The pin/get semantics for sparse GEM shmem objects is different,
>>   in that it doesn't populate the pages, but just flag them as
>>   being used/required by some GPU component. The page population
>>   will happen explicitly at GEM creation time or when a GPU fault
>>   happens. Once pages have been populated, they won't disappear
>>   until the GEM object is destroyed, purged or evicted. This means
>>   you can grow, but not shrink. If we ever need to discard
>>   BO ranges, the API can be extended to allow it, but it's not
>>   something we needed for the Lima/Panthor/Panfrost case.
>> - Panthor and Panfrost changes have been tested, but not extensively.
>>   Lima changes have only been compile-tested.
>>
>> Changes from v2:
>> - We decided to focus more on the DRM aspects and forget about the
>>   sgt building optimizations (sgt helpers taking an xarray instead of
>>   a plain array). If the xarray -> array copies happening in that
>>   path ever become the bottleneck, we can resurrect those changes.
>> - We decided to make the sparse GEM changes less invasive by making
>>   them an extra layer on top of drm_gem_shmem_object. What this means
>>   is that sparse GEM shmem can now be used as regular GEM shmem
>>   objects (vmap, pin, export, ... all work as they would on a regular
>>   GEM). When that happens, we just force all pages to be populated,
>>   so we kinda lose the benefit of sparse GEMs, but that's fine, because
>>   in practice, such objects shouldn't be manipulated as regular GEMs.
>>   It just serves as a safety guard to limit the risk of regressions
>>   introduced by these sparse GEM shmem changes.
>>
>> Discussion of previus revision can be found here[2][3].
>>
>> For those used to review gitlab MRs, here's a link [1] to a Draft
>> MR grouping those changes, but I'm in no way asking that the review
>> happens on gitlab.
>>
>> Regards,
>>
>> Boris
>>
>> [1]https://gitlab.freedesktop.org/panfrost/linux/-/merge_requests/16
>> [2]https://lore.kernel.org/lkml/20250326021433.772196-1-adrian.larumbe@collabora.com/T/
>> [3]https://lore.kernel.org/dri-devel/20250218232552.3450939-1-adrian.larumbe@collabora.com/
>>
>> Adrián Larumbe (1):
>>   drm/gem: Add helpers to request a range of pages on a GEM
>>
>> Boris Brezillon (7):
>>   drm/gem-shmem: Support sparse backing
>>   drm/panfrost: Switch to sparse gem shmem to implement our
>>     alloc-on-fault
>>   drm/panthor: Add support for alloc-on-fault buffers
>>   drm/panthor: Allow kernel BOs to pass DRM_PANTHOR_BO_ALLOC_ON_FAULT
>>   drm/panthor: Add a panthor_vm_pre_fault_range() helper
>>   drm/panthor: Make heap chunk allocation non-blocking
>>   drm/lima: Use drm_gem_shmem_sparse_backing for heap buffers
>>
>>  drivers/gpu/drm/drm_gem.c               | 134 +++++++
>>  drivers/gpu/drm/drm_gem_shmem_helper.c  | 404 +++++++++++++++++++-
>>  drivers/gpu/drm/lima/lima_gem.c         |  89 ++---
>>  drivers/gpu/drm/lima/lima_gem.h         |   1 +
>>  drivers/gpu/drm/lima/lima_vm.c          |  48 ++-
>>  drivers/gpu/drm/panfrost/panfrost_drv.c |   2 +-
>>  drivers/gpu/drm/panfrost/panfrost_gem.c |  37 +-
>>  drivers/gpu/drm/panfrost/panfrost_gem.h |   8 +-
>>  drivers/gpu/drm/panfrost/panfrost_mmu.c |  98 ++---
>>  drivers/gpu/drm/panfrost/panfrost_mmu.h |   2 +
>>  drivers/gpu/drm/panthor/panthor_drv.c   |  20 +-
>>  drivers/gpu/drm/panthor/panthor_fw.c    |   6 +-
>>  drivers/gpu/drm/panthor/panthor_gem.c   |  18 +-
>>  drivers/gpu/drm/panthor/panthor_gem.h   |  12 +-
>>  drivers/gpu/drm/panthor/panthor_heap.c  | 222 ++++++-----
>>  drivers/gpu/drm/panthor/panthor_mmu.c   | 481 ++++++++++++++++++------
>>  drivers/gpu/drm/panthor/panthor_mmu.h   |   2 +
>>  drivers/gpu/drm/panthor/panthor_sched.c |   6 +-
>>  include/drm/drm_gem.h                   |  14 +
>>  include/drm/drm_gem_shmem_helper.h      | 285 +++++++++++++-
>>  include/uapi/drm/panthor_drm.h          |  19 +-
>>  21 files changed, 1492 insertions(+), 416 deletions(-)
>>

Boris Brezillon April 10, 2025, 3:53 p.m. UTC | #3

Hi Christian,

On Thu, 10 Apr 2025 17:05:07 +0200
Christian König <christian.koenig@amd.com> wrote:

> Hi Boris,
> 
> thanks for looping me in. Can you also send the full patch set to me since I don't see that on the mailing list (yet maybe).
> 
> Am 10.04.25 um 16:48 schrieb Boris Brezillon:
> > +Christian, Alyssa and Faith, as suggested by Sima. I'll make sure to
> > Cc you on v4, but before that, I'd like to get your opinion on the
> > drm-gem/drm-gem-shmem changes to see if sending a v4 is actually
> > desirable or if I should go back to the drawing board.
> >
> > On Fri,  4 Apr 2025 11:26:26 +0200
> > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> >  
> >> This patch series is a proposal for implementing sparse page allocations
> >> for shmem objects. It was initially motivated by a kind of BO managed by
> >> the Panfrost/Panthor and Lima drivers, the tiler heap, which grows on
> >> demand every time the GPU faults on a virtual address within the heap BO
> >> range.  
> 
> Oh, wait a second! GPU faults and DMA fences are usually fundamentally incompatible.
> 
> So stuff like filling in GEM objects on demand like you suggest here is usually seen as illegal. All resources must be pre-pinned before the submission.

Unfortunately, that's already how it's done in lima, panfrost and
panthor.

> 
> Faulting is only legal when you have something like HMM, SVM or whatever you call it. And then you can just use a plain shmem object to provide you with backing pages.
> 
> I mean we could in theory allow faulting on GEM objects as well, but we would need to take very strict precautions on that we currently don't have as far as I know.

We only use this mechanism for very specific allocations: tiler memory
whose maximum size can't be guessed upfront because tile binning is by
nature unpredictible.

> 
> So could you explain how this works in the first place?

I can explain you how this works in Panthor, yes. You get an initial
amount of memory that the tiler can use, when it runs out of memory, it
will first ask the system for more memory, if the allocation fails, it
will fallback to what they call "incremental rendering", where the
already binned primitives are flushed to the FB in order to free memory,
and the rendering starts over from there, with the memory that has been
freed.

In Panthor, this on-demand allocation scheme is something that allows
us to speed-up rendering when there's memory available, but we can make
progress when that's not the case, hence the failable allocation scheme
I'm proposing here.

In Panfrost and Lima, we don't have this concept of "incremental
rendering", so when we fail the allocation, we just fail the GPU job
with an unhandled GPU fault. And that's how it is today, the
alloc-on-fault mechanism is being used in at least 3 drivers, and all
I'm trying to do here is standardize it and try to document the
constraints that comes with this model, constraint that are currently
being ignored. Like the fact allocations in the fault handler path
shouldn't block so we're guaranteed to signal the job fence in finite
time, and we don't risk a deadlock between the driver shrinker and the
job triggering the fault.

I'm well aware of the implications of what I'm proposing here, but
ignoring the fact some drivers already violate the rules don't make them
disappear. So I'd rather work with you and others to clearly state what
the alloc-in-fault-path rules are, and enforce them in some helper
functions, than pretend these ugly things don't exist. :D

Regards,

Boris

Christian König April 10, 2025, 4:43 p.m. UTC | #4

Hi Boris,

Am 10.04.25 um 17:53 schrieb Boris Brezillon:
> Hi Christian,
>
> On Thu, 10 Apr 2025 17:05:07 +0200
> Christian König <christian.koenig@amd.com> wrote:
>
>> Hi Boris,
>>
>> thanks for looping me in. Can you also send the full patch set to me since I don't see that on the mailing list (yet maybe).
>>
>> Am 10.04.25 um 16:48 schrieb Boris Brezillon:
>>> +Christian, Alyssa and Faith, as suggested by Sima. I'll make sure to
>>> Cc you on v4, but before that, I'd like to get your opinion on the
>>> drm-gem/drm-gem-shmem changes to see if sending a v4 is actually
>>> desirable or if I should go back to the drawing board.
>>>
>>> On Fri,  4 Apr 2025 11:26:26 +0200
>>> Boris Brezillon <boris.brezillon@collabora.com> wrote:
>>>  
>>>> This patch series is a proposal for implementing sparse page allocations
>>>> for shmem objects. It was initially motivated by a kind of BO managed by
>>>> the Panfrost/Panthor and Lima drivers, the tiler heap, which grows on
>>>> demand every time the GPU faults on a virtual address within the heap BO
>>>> range.  
>> Oh, wait a second! GPU faults and DMA fences are usually fundamentally incompatible.
>>
>> So stuff like filling in GEM objects on demand like you suggest here is usually seen as illegal. All resources must be pre-pinned before the submission.
> Unfortunately, that's already how it's done in lima, panfrost and
> panthor.
>
>> Faulting is only legal when you have something like HMM, SVM or whatever you call it. And then you can just use a plain shmem object to provide you with backing pages.
>>
>> I mean we could in theory allow faulting on GEM objects as well, but we would need to take very strict precautions on that we currently don't have as far as I know.
> We only use this mechanism for very specific allocations: tiler memory
> whose maximum size can't be guessed upfront because tile binning is by
> nature unpredictible.
>
>> So could you explain how this works in the first place?
> I can explain you how this works in Panthor, yes. You get an initial
> amount of memory that the tiler can use, when it runs out of memory, it
> will first ask the system for more memory, if the allocation fails, it
> will fallback to what they call "incremental rendering", where the
> already binned primitives are flushed to the FB in order to free memory,
> and the rendering starts over from there, with the memory that has been
> freed.
>
> In Panthor, this on-demand allocation scheme is something that allows
> us to speed-up rendering when there's memory available, but we can make
> progress when that's not the case, hence the failable allocation scheme
> I'm proposing here.

Puh, that sounds like it can potentially work but is also very very fragile.

You must have a code comment when you select the GFP flags how and why that works.

> In Panfrost and Lima, we don't have this concept of "incremental
> rendering", so when we fail the allocation, we just fail the GPU job
> with an unhandled GPU fault.

To be honest I think that this is enough to mark those two drivers as broken.  It's documented that this approach is a no-go for upstream drivers.

How widely is that used?

> And that's how it is today, the
> alloc-on-fault mechanism is being used in at least 3 drivers, and all
> I'm trying to do here is standardize it and try to document the
> constraints that comes with this model, constraint that are currently
> being ignored. Like the fact allocations in the fault handler path
> shouldn't block so we're guaranteed to signal the job fence in finite
> time, and we don't risk a deadlock between the driver shrinker and the
> job triggering the fault.

Well on one hand it's good to that we document the pitfalls, but I clearly think that we should *not* spread that into common code.

That would give the impression that this actually works and to be honest I should start to charge money to rejecting stuff like that. It would probably get me rich.

> I'm well aware of the implications of what I'm proposing here, but
> ignoring the fact some drivers already violate the rules don't make them
> disappear. So I'd rather work with you and others to clearly state what
> the alloc-in-fault-path rules are, and enforce them in some helper
> functions, than pretend these ugly things don't exist. :D

Yeah, but this kind of says to others it's ok to do this which clearly isn't as far as I can see.

What we should do instead is to add this as not ok approaches to the documentation and move on.

Regards,
Christian.

>
> Regards,
>
> Boris

Boris Brezillon April 10, 2025, 5:20 p.m. UTC | #5

Hi Christian,

On Thu, 10 Apr 2025 18:43:29 +0200
Christian König <christian.koenig@amd.com> wrote:

> Hi Boris,
> 
> Am 10.04.25 um 17:53 schrieb Boris Brezillon:
> > Hi Christian,
> >
> > On Thu, 10 Apr 2025 17:05:07 +0200
> > Christian König <christian.koenig@amd.com> wrote:
> >  
> >> Hi Boris,
> >>
> >> thanks for looping me in. Can you also send the full patch set to
> >> me since I don't see that on the mailing list (yet maybe).
> >>
> >> Am 10.04.25 um 16:48 schrieb Boris Brezillon:  
> >>> +Christian, Alyssa and Faith, as suggested by Sima. I'll make
> >>> sure to Cc you on v4, but before that, I'd like to get your
> >>> opinion on the drm-gem/drm-gem-shmem changes to see if sending a
> >>> v4 is actually desirable or if I should go back to the drawing
> >>> board.
> >>>
> >>> On Fri,  4 Apr 2025 11:26:26 +0200
> >>> Boris Brezillon <boris.brezillon@collabora.com> wrote:
> >>>    
> >>>> This patch series is a proposal for implementing sparse page
> >>>> allocations for shmem objects. It was initially motivated by a
> >>>> kind of BO managed by the Panfrost/Panthor and Lima drivers, the
> >>>> tiler heap, which grows on demand every time the GPU faults on a
> >>>> virtual address within the heap BO range.    
> >> Oh, wait a second! GPU faults and DMA fences are usually
> >> fundamentally incompatible.
> >>
> >> So stuff like filling in GEM objects on demand like you suggest
> >> here is usually seen as illegal. All resources must be pre-pinned
> >> before the submission.  
> > Unfortunately, that's already how it's done in lima, panfrost and
> > panthor.
> >  
> >> Faulting is only legal when you have something like HMM, SVM or
> >> whatever you call it. And then you can just use a plain shmem
> >> object to provide you with backing pages.
> >>
> >> I mean we could in theory allow faulting on GEM objects as well,
> >> but we would need to take very strict precautions on that we
> >> currently don't have as far as I know.  
> > We only use this mechanism for very specific allocations: tiler
> > memory whose maximum size can't be guessed upfront because tile
> > binning is by nature unpredictible.
> >  
> >> So could you explain how this works in the first place?  
> > I can explain you how this works in Panthor, yes. You get an initial
> > amount of memory that the tiler can use, when it runs out of
> > memory, it will first ask the system for more memory, if the
> > allocation fails, it will fallback to what they call "incremental
> > rendering", where the already binned primitives are flushed to the
> > FB in order to free memory, and the rendering starts over from
> > there, with the memory that has been freed.
> >
> > In Panthor, this on-demand allocation scheme is something that
> > allows us to speed-up rendering when there's memory available, but
> > we can make progress when that's not the case, hence the failable
> > allocation scheme I'm proposing here.  
> 
> Puh, that sounds like it can potentially work but is also very very
> fragile.
> 
> You must have a code comment when you select the GFP flags how and
> why that works.

+	/* We want non-blocking allocations, if we're OOM, we just fail the job
+	 * to unblock things.
+	 */
+	ret = drm_gem_shmem_sparse_populate_range(&bo->base, page_offset,
+						  NUM_FAULT_PAGES, page_gfp,
+						  __GFP_NORETRY | __GFP_NOWARN);

That's what I have right now in the failable allocation path. The
tiler chunk allocation uses the same flags, but doesn't have a
comment explaining that a fallback exists when the allocation fails.

> 
> > In Panfrost and Lima, we don't have this concept of "incremental
> > rendering", so when we fail the allocation, we just fail the GPU job
> > with an unhandled GPU fault.  
> 
> To be honest I think that this is enough to mark those two drivers as
> broken.  It's documented that this approach is a no-go for upstream
> drivers.
> 
> How widely is that used?

It exists in lima and panfrost, and I wouldn't be surprised if a similar
mechanism was used in other drivers for tiler-based GPUs (etnaviv,
freedreno, powervr, ...), because ultimately that's how tilers work:
the amount of memory needed to store per-tile primitives (and metadata)
depends on what the geometry pipeline feeds the tiler with, and that
can't be predicted. If you over-provision, that's memory the system won't
be able to use while rendering takes place, even though only a small
portion might actually be used by the GPU. If your allocation is too
small, it will either trigger a GPU fault (for HW not supporting an
"incremental rendering" mode) or under-perform (because flushing
primitives has a huge cost on tilers).

Calling these drivers broken without having a plan to fix them is
also not option.

> 
> > And that's how it is today, the
> > alloc-on-fault mechanism is being used in at least 3 drivers, and
> > all I'm trying to do here is standardize it and try to document the
> > constraints that comes with this model, constraint that are
> > currently being ignored. Like the fact allocations in the fault
> > handler path shouldn't block so we're guaranteed to signal the job
> > fence in finite time, and we don't risk a deadlock between the
> > driver shrinker and the job triggering the fault.  
> 
> Well on one hand it's good to that we document the pitfalls, but I
> clearly think that we should *not* spread that into common code.

Ack on not encouraging people to use that; but having a clear path
describing how this should be done for HW that don't have other
options, with helpers and extensive doc is IMHO better than letting
new drivers reproduce the mistake that were done in the past.
Because, if you tell people "don't do that", but don't have an
alternative to propose, they'll end up doing it anyway.

> 
> That would give the impression that this actually works and to be
> honest I should start to charge money to rejecting stuff like that.
> It would probably get me rich.
> 
> > I'm well aware of the implications of what I'm proposing here, but
> > ignoring the fact some drivers already violate the rules don't make
> > them disappear. So I'd rather work with you and others to clearly
> > state what the alloc-in-fault-path rules are, and enforce them in
> > some helper functions, than pretend these ugly things don't exist.
> > :D  
> 
> Yeah, but this kind of says to others it's ok to do this which
> clearly isn't as far as I can see.

Not really. At least not if we properly review any driver that use
these APIs, and clearly state in the doc that this is provided as a
last resort for HW that can't do without on-fault-allocation, because
they are designed to work this way.

> 
> What we should do instead is to add this as not ok approaches to the
> documentation and move on.

Well, that's what happened with panfrost, lima and panthor, and see
where we are now: 3 drivers that you consider broken (probably
rightfully), and more to come if we don't come up with a plan for HW
that have the same requirements (as I said, I wouldn't be surprised
if most tilers were relying on the same kind of on-demand-allocation).

As always, I appreciate your valuable inputs, and would be happy to
try anything you think might be more adapted than what is proposed
here, but saying "this is broken HW/driver, so let's ignore it" is
not really an option, at least if you don't want the bad design
pattern to spread.

Regards,

Boris

Alyssa Rosenzweig April 10, 2025, 6:01 p.m. UTC | #6

> > > In Panfrost and Lima, we don't have this concept of "incremental
> > > rendering", so when we fail the allocation, we just fail the GPU job
> > > with an unhandled GPU fault.  
> > 
> > To be honest I think that this is enough to mark those two drivers as
> > broken.  It's documented that this approach is a no-go for upstream
> > drivers.
> > 
> > How widely is that used?
> 
> It exists in lima and panfrost, and I wouldn't be surprised if a similar
> mechanism was used in other drivers for tiler-based GPUs (etnaviv,
> freedreno, powervr, ...), because ultimately that's how tilers work:
> the amount of memory needed to store per-tile primitives (and metadata)
> depends on what the geometry pipeline feeds the tiler with, and that
> can't be predicted. If you over-provision, that's memory the system won't
> be able to use while rendering takes place, even though only a small
> portion might actually be used by the GPU. If your allocation is too
> small, it will either trigger a GPU fault (for HW not supporting an
> "incremental rendering" mode) or under-perform (because flushing
> primitives has a huge cost on tilers).

Yes and no.

Although we can't allocate more memory for /this/ frame, we know the
required size is probably constant across its lifetime. That gives a
simple heuristic to manage the tiler heap efficiently without
allocations - even fallible ones - in the fence signal path:

* Start with a small fixed size tiler heap
* Try to render, let incremental rendering kick in when it's too small.
* When cleaning up the job, check if we used incremental rendering.
* If we did - double the size of the heap the next time we submit work.

The tiler heap still grows dynamically - it just does so over the span
of a couple frames. In practice that means a tiny hit to startup time as
we dynamically figure out the right size, incurring extra flushing at
the start, without needing any "grow-on-page-fault" heroics.

This should solve the problem completely for CSF/panthor. So it's only
hardware that architecturally cannot do incremental rendering (older
Mali: panfrost/lima) where we need this mess.

Boris Brezillon April 10, 2025, 6:41 p.m. UTC | #7

On Thu, 10 Apr 2025 14:01:03 -0400
Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote:

> > > > In Panfrost and Lima, we don't have this concept of "incremental
> > > > rendering", so when we fail the allocation, we just fail the GPU job
> > > > with an unhandled GPU fault.    
> > > 
> > > To be honest I think that this is enough to mark those two drivers as
> > > broken.  It's documented that this approach is a no-go for upstream
> > > drivers.
> > > 
> > > How widely is that used?  
> > 
> > It exists in lima and panfrost, and I wouldn't be surprised if a similar
> > mechanism was used in other drivers for tiler-based GPUs (etnaviv,
> > freedreno, powervr, ...), because ultimately that's how tilers work:
> > the amount of memory needed to store per-tile primitives (and metadata)
> > depends on what the geometry pipeline feeds the tiler with, and that
> > can't be predicted. If you over-provision, that's memory the system won't
> > be able to use while rendering takes place, even though only a small
> > portion might actually be used by the GPU. If your allocation is too
> > small, it will either trigger a GPU fault (for HW not supporting an
> > "incremental rendering" mode) or under-perform (because flushing
> > primitives has a huge cost on tilers).  
> 
> Yes and no.
> 
> Although we can't allocate more memory for /this/ frame, we know the
> required size is probably constant across its lifetime. That gives a
> simple heuristic to manage the tiler heap efficiently without
> allocations - even fallible ones - in the fence signal path:
> 
> * Start with a small fixed size tiler heap
> * Try to render, let incremental rendering kick in when it's too small.
> * When cleaning up the job, check if we used incremental rendering.
> * If we did - double the size of the heap the next time we submit work.
> 
> The tiler heap still grows dynamically - it just does so over the span
> of a couple frames. In practice that means a tiny hit to startup time as
> we dynamically figure out the right size, incurring extra flushing at
> the start, without needing any "grow-on-page-fault" heroics.
> 
> This should solve the problem completely for CSF/panthor. So it's only
> hardware that architecturally cannot do incremental rendering (older
> Mali: panfrost/lima) where we need this mess.

OTOH, if we need something
for Utgard(Lima)/Midgard/Bifrost/Valhall(Panfrost), why not use the same
thing for CSF, since CSF is arguably the sanest of all the HW
architectures listed above: allocation can fail/be non-blocking,
because there's a fallback to incremental rendering when it fails.

Christian König April 10, 2025, 6:52 p.m. UTC | #8

Am 10.04.25 um 19:20 schrieb Boris Brezillon:
> [SNIP]
>>>> Faulting is only legal when you have something like HMM, SVM or
>>>> whatever you call it. And then you can just use a plain shmem
>>>> object to provide you with backing pages.
>>>>
>>>> I mean we could in theory allow faulting on GEM objects as well,
>>>> but we would need to take very strict precautions on that we
>>>> currently don't have as far as I know.  
>>> We only use this mechanism for very specific allocations: tiler
>>> memory whose maximum size can't be guessed upfront because tile
>>> binning is by nature unpredictible.
>>>  
>>>> So could you explain how this works in the first place?  
>>> I can explain you how this works in Panthor, yes. You get an initial
>>> amount of memory that the tiler can use, when it runs out of
>>> memory, it will first ask the system for more memory, if the
>>> allocation fails, it will fallback to what they call "incremental
>>> rendering", where the already binned primitives are flushed to the
>>> FB in order to free memory, and the rendering starts over from
>>> there, with the memory that has been freed.
>>>
>>> In Panthor, this on-demand allocation scheme is something that
>>> allows us to speed-up rendering when there's memory available, but
>>> we can make progress when that's not the case, hence the failable
>>> allocation scheme I'm proposing here.  
>> Puh, that sounds like it can potentially work but is also very very
>> fragile.
>>
>> You must have a code comment when you select the GFP flags how and
>> why that works.
> +	/* We want non-blocking allocations, if we're OOM, we just fail the job
> +	 * to unblock things.
> +	 */
> +	ret = drm_gem_shmem_sparse_populate_range(&bo->base, page_offset,
> +						  NUM_FAULT_PAGES, page_gfp,
> +						  __GFP_NORETRY | __GFP_NOWARN);
>
> That's what I have right now in the failable allocation path. The
> tiler chunk allocation uses the same flags, but doesn't have a
> comment explaining that a fallback exists when the allocation fails.

We agreed to use GFP_NOWAIT for such types of allocations to at least wake up kswapd on the low water mark.

IIRC even using __GFP_NORETRY here was illegal, but I need to double check the discussions again.

From the comment this at minimum needs an explanation what influence this has on the submission and that you can still guarantee fence forward progress.

>>> In Panfrost and Lima, we don't have this concept of "incremental
>>> rendering", so when we fail the allocation, we just fail the GPU job
>>> with an unhandled GPU fault.  
>> To be honest I think that this is enough to mark those two drivers as
>> broken.  It's documented that this approach is a no-go for upstream
>> drivers.
>>
>> How widely is that used?
> It exists in lima and panfrost, and I wouldn't be surprised if a similar
> mechanism was used in other drivers for tiler-based GPUs (etnaviv,
> freedreno, powervr, ...), because ultimately that's how tilers work:
> the amount of memory needed to store per-tile primitives (and metadata)
> depends on what the geometry pipeline feeds the tiler with, and that
> can't be predicted. If you over-provision, that's memory the system won't
> be able to use while rendering takes place, even though only a small
> portion might actually be used by the GPU. If your allocation is too
> small, it will either trigger a GPU fault (for HW not supporting an
> "incremental rendering" mode) or under-perform (because flushing
> primitives has a huge cost on tilers).
>
> Calling these drivers broken without having a plan to fix them is
> also not option.

Yeah, agree we need to have some kind of alternative. It's just that at the moment I can't think of any.

>>> And that's how it is today, the
>>> alloc-on-fault mechanism is being used in at least 3 drivers, and
>>> all I'm trying to do here is standardize it and try to document the
>>> constraints that comes with this model, constraint that are
>>> currently being ignored. Like the fact allocations in the fault
>>> handler path shouldn't block so we're guaranteed to signal the job
>>> fence in finite time, and we don't risk a deadlock between the
>>> driver shrinker and the job triggering the fault.  
>> Well on one hand it's good to that we document the pitfalls, but I
>> clearly think that we should *not* spread that into common code.
> Ack on not encouraging people to use that; but having a clear path
> describing how this should be done for HW that don't have other
> options, with helpers and extensive doc is IMHO better than letting
> new drivers reproduce the mistake that were done in the past.
> Because, if you tell people "don't do that", but don't have an
> alternative to propose, they'll end up doing it anyway.

Just to be clear: We have already completely rejected code from going upstream because of that!

And we are not talking about anything small, but rather a full blown framework and every developed by a major player.

Additional to that both i915 and amdgpu had code which used this approach as well and we reverted back because it simply doesn't work reliable.

>> That would give the impression that this actually works and to be
>> honest I should start to charge money to rejecting stuff like that.
>> It would probably get me rich.
>>
>>> I'm well aware of the implications of what I'm proposing here, but
>>> ignoring the fact some drivers already violate the rules don't make
>>> them disappear. So I'd rather work with you and others to clearly
>>> state what the alloc-in-fault-path rules are, and enforce them in
>>> some helper functions, than pretend these ugly things don't exist.
>>> :D  
>> Yeah, but this kind of says to others it's ok to do this which
>> clearly isn't as far as I can see.
> Not really. At least not if we properly review any driver that use
> these APIs, and clearly state in the doc that this is provided as a
> last resort for HW that can't do without on-fault-allocation, because
> they are designed to work this way.
>
>> What we should do instead is to add this as not ok approaches to the
>> documentation and move on.
> Well, that's what happened with panfrost, lima and panthor, and see
> where we are now: 3 drivers that you consider broken (probably
> rightfully), and more to come if we don't come up with a plan for HW
> that have the same requirements (as I said, I wouldn't be surprised
> if most tilers were relying on the same kind of on-demand-allocation).

Well we have HW features in major drivers which we simply don't use because of that.

> As always, I appreciate your valuable inputs, and would be happy to
> try anything you think might be more adapted than what is proposed
> here, but saying "this is broken HW/driver, so let's ignore it" is
> not really an option, at least if you don't want the bad design
> pattern to spread.

Yeah, I'm not sure what to do either. We *know* from experience that this approach will fail.

We have already tried that.

Regards,
Christian.


>
> Regards,
>
> Boris

Christian König April 11, 2025, 8:04 a.m. UTC | #9

Am 10.04.25 um 20:41 schrieb Boris Brezillon:
> On Thu, 10 Apr 2025 14:01:03 -0400
> Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote:
>
>>>>> In Panfrost and Lima, we don't have this concept of "incremental
>>>>> rendering", so when we fail the allocation, we just fail the GPU job
>>>>> with an unhandled GPU fault.    
>>>> To be honest I think that this is enough to mark those two drivers as
>>>> broken.  It's documented that this approach is a no-go for upstream
>>>> drivers.
>>>>
>>>> How widely is that used?  
>>> It exists in lima and panfrost, and I wouldn't be surprised if a similar
>>> mechanism was used in other drivers for tiler-based GPUs (etnaviv,
>>> freedreno, powervr, ...), because ultimately that's how tilers work:
>>> the amount of memory needed to store per-tile primitives (and metadata)
>>> depends on what the geometry pipeline feeds the tiler with, and that
>>> can't be predicted. If you over-provision, that's memory the system won't
>>> be able to use while rendering takes place, even though only a small
>>> portion might actually be used by the GPU. If your allocation is too
>>> small, it will either trigger a GPU fault (for HW not supporting an
>>> "incremental rendering" mode) or under-perform (because flushing
>>> primitives has a huge cost on tilers).  
>> Yes and no.
>>
>> Although we can't allocate more memory for /this/ frame, we know the
>> required size is probably constant across its lifetime. That gives a
>> simple heuristic to manage the tiler heap efficiently without
>> allocations - even fallible ones - in the fence signal path:
>>
>> * Start with a small fixed size tiler heap
>> * Try to render, let incremental rendering kick in when it's too small.
>> * When cleaning up the job, check if we used incremental rendering.
>> * If we did - double the size of the heap the next time we submit work.
>>
>> The tiler heap still grows dynamically - it just does so over the span
>> of a couple frames. In practice that means a tiny hit to startup time as
>> we dynamically figure out the right size, incurring extra flushing at
>> the start, without needing any "grow-on-page-fault" heroics.
>>
>> This should solve the problem completely for CSF/panthor. So it's only
>> hardware that architecturally cannot do incremental rendering (older
>> Mali: panfrost/lima) where we need this mess.
> OTOH, if we need something
> for Utgard(Lima)/Midgard/Bifrost/Valhall(Panfrost), why not use the same
> thing for CSF, since CSF is arguably the sanest of all the HW
> architectures listed above: allocation can fail/be non-blocking,
> because there's a fallback to incremental rendering when it fails.

Yeah that is a rather interesting point Alyssa noted here.

So basically you could as well implement it like this:
1. Userspace makes a submission.
2. HW finds buffer is not large enough, sets and error code and completes submission.
3. Userspace detects error, re-allocates buffer with increased size.
4. Userspace re-submits to incremental complete the submission.
5. Eventually repeat until fully completed.

That would work but is likely just not the most performant solution. So faulting in memory on demand is basically just an optimization and that is ok as far as I can see.

That is then a rather good justification for your work Boris. Because a common component makes it possible to implement a common fault injection functionality to make sure that the fallback path is properly exercised in testing.

Regards,
Christian.

Boris Brezillon April 11, 2025, 8:08 a.m. UTC | #10

Hi Christian,

On Thu, 10 Apr 2025 20:52:27 +0200
Christian König <christian.koenig@amd.com> wrote:

> Am 10.04.25 um 19:20 schrieb Boris Brezillon:
> > [SNIP]  
> >>>> Faulting is only legal when you have something like HMM, SVM or
> >>>> whatever you call it. And then you can just use a plain shmem
> >>>> object to provide you with backing pages.
> >>>>
> >>>> I mean we could in theory allow faulting on GEM objects as well,
> >>>> but we would need to take very strict precautions on that we
> >>>> currently don't have as far as I know.    
> >>> We only use this mechanism for very specific allocations: tiler
> >>> memory whose maximum size can't be guessed upfront because tile
> >>> binning is by nature unpredictible.
> >>>    
> >>>> So could you explain how this works in the first place?    
> >>> I can explain you how this works in Panthor, yes. You get an
> >>> initial amount of memory that the tiler can use, when it runs out
> >>> of memory, it will first ask the system for more memory, if the
> >>> allocation fails, it will fallback to what they call "incremental
> >>> rendering", where the already binned primitives are flushed to the
> >>> FB in order to free memory, and the rendering starts over from
> >>> there, with the memory that has been freed.
> >>>
> >>> In Panthor, this on-demand allocation scheme is something that
> >>> allows us to speed-up rendering when there's memory available, but
> >>> we can make progress when that's not the case, hence the failable
> >>> allocation scheme I'm proposing here.    
> >> Puh, that sounds like it can potentially work but is also very very
> >> fragile.
> >>
> >> You must have a code comment when you select the GFP flags how and
> >> why that works.  
> > +	/* We want non-blocking allocations, if we're OOM, we just
> > fail the job
> > +	 * to unblock things.
> > +	 */
> > +	ret = drm_gem_shmem_sparse_populate_range(&bo->base,
> > page_offset,
> > +						  NUM_FAULT_PAGES,
> > page_gfp,
> > +						  __GFP_NORETRY |
> > __GFP_NOWARN);
> >
> > That's what I have right now in the failable allocation path. The
> > tiler chunk allocation uses the same flags, but doesn't have a
> > comment explaining that a fallback exists when the allocation
> > fails.  
> 
> We agreed to use GFP_NOWAIT for such types of allocations to at least
> wake up kswapd on the low water mark.

Yeah, I wasn't sure about the flags to be honest. Hence the ping and
the very reason Sima asked me to Cc a few more people to look at those
changes.

> 
> IIRC even using __GFP_NORETRY here was illegal, but I need to double
> check the discussions again.

FWIW, it's been taken from i915 [1]

> 
> From the comment this at minimum needs an explanation what influence
> this has on the submission and that you can still guarantee fence
> forward progress.

Agreed.

> >>> And that's how it is today, the
> >>> alloc-on-fault mechanism is being used in at least 3 drivers, and
> >>> all I'm trying to do here is standardize it and try to document
> >>> the constraints that comes with this model, constraint that are
> >>> currently being ignored. Like the fact allocations in the fault
> >>> handler path shouldn't block so we're guaranteed to signal the job
> >>> fence in finite time, and we don't risk a deadlock between the
> >>> driver shrinker and the job triggering the fault.    
> >> Well on one hand it's good to that we document the pitfalls, but I
> >> clearly think that we should *not* spread that into common code.  
> > Ack on not encouraging people to use that; but having a clear path
> > describing how this should be done for HW that don't have other
> > options, with helpers and extensive doc is IMHO better than letting
> > new drivers reproduce the mistake that were done in the past.
> > Because, if you tell people "don't do that", but don't have an
> > alternative to propose, they'll end up doing it anyway.  
> 
> Just to be clear: We have already completely rejected code from going
> upstream because of that!

Apparently panfrost/panthor/lima didn't get enough scrutiny...

> 
> And we are not talking about anything small, but rather a full blown
> framework and every developed by a major player.
> 
> Additional to that both i915 and amdgpu had code which used this
> approach as well and we reverted back because it simply doesn't work
> reliable.

BTW, the sparse GEM-SHMEM code in itself doesn't necessarily imply
non-blocking allocation in the fault handler, though admittedly it's
been designed to allow it.

> 
> >> That would give the impression that this actually works and to be
> >> honest I should start to charge money to rejecting stuff like that.
> >> It would probably get me rich.
> >>  
> >>> I'm well aware of the implications of what I'm proposing here, but
> >>> ignoring the fact some drivers already violate the rules don't
> >>> make them disappear. So I'd rather work with you and others to
> >>> clearly state what the alloc-in-fault-path rules are, and enforce
> >>> them in some helper functions, than pretend these ugly things
> >>> don't exist. :D    
> >> Yeah, but this kind of says to others it's ok to do this which
> >> clearly isn't as far as I can see.  
> > Not really. At least not if we properly review any driver that use
> > these APIs, and clearly state in the doc that this is provided as a
> > last resort for HW that can't do without on-fault-allocation,
> > because they are designed to work this way.
> >  
> >> What we should do instead is to add this as not ok approaches to
> >> the documentation and move on.  
> > Well, that's what happened with panfrost, lima and panthor, and see
> > where we are now: 3 drivers that you consider broken (probably
> > rightfully), and more to come if we don't come up with a plan for HW
> > that have the same requirements (as I said, I wouldn't be surprised
> > if most tilers were relying on the same kind of
> > on-demand-allocation).  
> 
> Well we have HW features in major drivers which we simply don't use
> because of that.

Yeah, but in that case that's not really a HW feature we can switch
off/avoid using. It's at the core of how Mali GPUs work (at least
anything prior to gen10). And forcing all apps to fully allocate this
128MB buffer upfront (which might seem small for discrete GPUs, but
really isn't for systems where the memory can be as low as 1GB, and
shared with the rest of the system) isn't really an option. The only
alternative I see is finding a way to emulate incremental rendering on
older Mali GPUs, assuming that's even possible.

> 
> > As always, I appreciate your valuable inputs, and would be happy to
> > try anything you think might be more adapted than what is proposed
> > here, but saying "this is broken HW/driver, so let's ignore it" is
> > not really an option, at least if you don't want the bad design
> > pattern to spread.  
> 
> Yeah, I'm not sure what to do either. We *know* from experience that
> this approach will fail.

Can you, or someone who was involved in those attempts, sum-up the
problem that were attempted to be solved, and the walls you hit with
this non-blocking/failable allocation scheme (fell free to link me to
previous discussion so you don't have to type it again :-)).

FWIW, the problem I see with the current panfrost/panthor
on-fault-allocation scheme is the fact allocating GEMs with GFP_KERNEL
leads to unbounded waits, which might cause a job timeout. The dealock
between the panfrost shrinker and the job requesting memory doesn't
exist AFAICT, because panfrost can only reclaim objects that were
flagged as unused (MADVISE(DONTNEED)), and the allocation is done
without any of the locks that are taken by the panfrost shrinker held.
So yeah, jobs can fail because the system ran out of memory, the job
fence will be signalled with an error, and the system will just move
on. It's already what happens today, and userspace either ignores it
(because on older Mali GPUs, a fault on a job is just transient, and
doesn't prevent other jobs from executing) or knows how to recover from
it (on newer GPUs, the FW scheduling-context needs to be destroyed
re-created).

Just to be clear, I'm not saying this is good, I'm just describing what
is done today, so you get the whole picture.

Now, with panthor being in the process of adopting a transparent
reclaim mechanism (swapout on reclaim if the GPU context is idle,
swapin on next job), the deadlock between the panthor/gem-shmem
shrinker and the allocation in the fault path will probably surface, so
I do get why no-alloc-in-fault path, or at the very least,
non-blocking-alloc-in-fault-path is needed. And I really thought
non-blocking allocations would be allowed in that case. So please bare
with me, and tell me where my mistake is, because every time I think I
got it, I look at it again, and don't see the problem.

Regards,

Boris

[1]https://elixir.bootlin.com/linux/v6.13.7/source/drivers/gpu/drm/i915/gem/i915_gem_shmem.c#L96

Boris Brezillon April 11, 2025, 8:38 a.m. UTC | #11

On Fri, 11 Apr 2025 10:04:07 +0200
Christian König <christian.koenig@amd.com> wrote:

> Am 10.04.25 um 20:41 schrieb Boris Brezillon:
> > On Thu, 10 Apr 2025 14:01:03 -0400
> > Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote:
> >  
> >>>>> In Panfrost and Lima, we don't have this concept of "incremental
> >>>>> rendering", so when we fail the allocation, we just fail the
> >>>>> GPU job with an unhandled GPU fault.      
> >>>> To be honest I think that this is enough to mark those two
> >>>> drivers as broken.  It's documented that this approach is a
> >>>> no-go for upstream drivers.
> >>>>
> >>>> How widely is that used?    
> >>> It exists in lima and panfrost, and I wouldn't be surprised if a
> >>> similar mechanism was used in other drivers for tiler-based GPUs
> >>> (etnaviv, freedreno, powervr, ...), because ultimately that's how
> >>> tilers work: the amount of memory needed to store per-tile
> >>> primitives (and metadata) depends on what the geometry pipeline
> >>> feeds the tiler with, and that can't be predicted. If you
> >>> over-provision, that's memory the system won't be able to use
> >>> while rendering takes place, even though only a small portion
> >>> might actually be used by the GPU. If your allocation is too
> >>> small, it will either trigger a GPU fault (for HW not supporting
> >>> an "incremental rendering" mode) or under-perform (because
> >>> flushing primitives has a huge cost on tilers).    
> >> Yes and no.
> >>
> >> Although we can't allocate more memory for /this/ frame, we know
> >> the required size is probably constant across its lifetime. That
> >> gives a simple heuristic to manage the tiler heap efficiently
> >> without allocations - even fallible ones - in the fence signal
> >> path:
> >>
> >> * Start with a small fixed size tiler heap
> >> * Try to render, let incremental rendering kick in when it's too
> >> small.
> >> * When cleaning up the job, check if we used incremental rendering.
> >> * If we did - double the size of the heap the next time we submit
> >> work.
> >>
> >> The tiler heap still grows dynamically - it just does so over the
> >> span of a couple frames. In practice that means a tiny hit to
> >> startup time as we dynamically figure out the right size,
> >> incurring extra flushing at the start, without needing any
> >> "grow-on-page-fault" heroics.
> >>
> >> This should solve the problem completely for CSF/panthor. So it's
> >> only hardware that architecturally cannot do incremental rendering
> >> (older Mali: panfrost/lima) where we need this mess.  
> > OTOH, if we need something
> > for Utgard(Lima)/Midgard/Bifrost/Valhall(Panfrost), why not use the
> > same thing for CSF, since CSF is arguably the sanest of all the HW
> > architectures listed above: allocation can fail/be non-blocking,
> > because there's a fallback to incremental rendering when it fails.  
> 
> Yeah that is a rather interesting point Alyssa noted here.
> 
> So basically you could as well implement it like this:
> 1. Userspace makes a submission.
> 2. HW finds buffer is not large enough, sets and error code and
> completes submission. 3. Userspace detects error, re-allocates buffer
> with increased size. 4. Userspace re-submits to incremental complete
> the submission. 5. Eventually repeat until fully completed.
> 
> That would work but is likely just not the most performant solution.
> So faulting in memory on demand is basically just an optimization and
> that is ok as far as I can see.

Yeah, Alyssa's suggestion got me thinking too, and I think I can come
up with a plan where we try non-blocking allocation first, and if it
fails, we trigger incremental rendering, and queue a blocking
heap-chunk allocation on separate workqueue, such that next time the
tiler heap hits an OOM, it has a chunk (or multiple chunks) readily
available if the blocking allocation completed in the meantime. That's
basically what Alyssa suggested, with an optimization if the system is
not under memory pressure, and without userspace being involved (so no
uAPI changes).

I guess this leaves older GPUs that don't support incremental rendering
in a bad place though.

> 
> That is then a rather good justification for your work Boris. Because
> a common component makes it possible to implement a common fault
> injection functionality to make sure that the fallback path is
> properly exercised in testing.

I can also add an fault injection mechanism to validate that, yep.

Thanks,

Boris

Christian König April 11, 2025, 10:55 a.m. UTC | #12

Am 11.04.25 um 10:38 schrieb Boris Brezillon:
> On Fri, 11 Apr 2025 10:04:07 +0200
> Christian König <christian.koenig@amd.com> wrote:
>
>> Am 10.04.25 um 20:41 schrieb Boris Brezillon:
>>> On Thu, 10 Apr 2025 14:01:03 -0400
>>> Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote:
>>>  
>>>>>>> In Panfrost and Lima, we don't have this concept of "incremental
>>>>>>> rendering", so when we fail the allocation, we just fail the
>>>>>>> GPU job with an unhandled GPU fault.      
>>>>>> To be honest I think that this is enough to mark those two
>>>>>> drivers as broken.  It's documented that this approach is a
>>>>>> no-go for upstream drivers.
>>>>>>
>>>>>> How widely is that used?    
>>>>> It exists in lima and panfrost, and I wouldn't be surprised if a
>>>>> similar mechanism was used in other drivers for tiler-based GPUs
>>>>> (etnaviv, freedreno, powervr, ...), because ultimately that's how
>>>>> tilers work: the amount of memory needed to store per-tile
>>>>> primitives (and metadata) depends on what the geometry pipeline
>>>>> feeds the tiler with, and that can't be predicted. If you
>>>>> over-provision, that's memory the system won't be able to use
>>>>> while rendering takes place, even though only a small portion
>>>>> might actually be used by the GPU. If your allocation is too
>>>>> small, it will either trigger a GPU fault (for HW not supporting
>>>>> an "incremental rendering" mode) or under-perform (because
>>>>> flushing primitives has a huge cost on tilers).    
>>>> Yes and no.
>>>>
>>>> Although we can't allocate more memory for /this/ frame, we know
>>>> the required size is probably constant across its lifetime. That
>>>> gives a simple heuristic to manage the tiler heap efficiently
>>>> without allocations - even fallible ones - in the fence signal
>>>> path:
>>>>
>>>> * Start with a small fixed size tiler heap
>>>> * Try to render, let incremental rendering kick in when it's too
>>>> small.
>>>> * When cleaning up the job, check if we used incremental rendering.
>>>> * If we did - double the size of the heap the next time we submit
>>>> work.
>>>>
>>>> The tiler heap still grows dynamically - it just does so over the
>>>> span of a couple frames. In practice that means a tiny hit to
>>>> startup time as we dynamically figure out the right size,
>>>> incurring extra flushing at the start, without needing any
>>>> "grow-on-page-fault" heroics.
>>>>
>>>> This should solve the problem completely for CSF/panthor. So it's
>>>> only hardware that architecturally cannot do incremental rendering
>>>> (older Mali: panfrost/lima) where we need this mess.  
>>> OTOH, if we need something
>>> for Utgard(Lima)/Midgard/Bifrost/Valhall(Panfrost), why not use the
>>> same thing for CSF, since CSF is arguably the sanest of all the HW
>>> architectures listed above: allocation can fail/be non-blocking,
>>> because there's a fallback to incremental rendering when it fails.  
>> Yeah that is a rather interesting point Alyssa noted here.
>>
>> So basically you could as well implement it like this:
>> 1. Userspace makes a submission.
>> 2. HW finds buffer is not large enough, sets and error code and
>> completes submission. 3. Userspace detects error, re-allocates buffer
>> with increased size. 4. Userspace re-submits to incremental complete
>> the submission. 5. Eventually repeat until fully completed.
>>
>> That would work but is likely just not the most performant solution.
>> So faulting in memory on demand is basically just an optimization and
>> that is ok as far as I can see.
> Yeah, Alyssa's suggestion got me thinking too, and I think I can come
> up with a plan where we try non-blocking allocation first, and if it
> fails, we trigger incremental rendering, and queue a blocking
> heap-chunk allocation on separate workqueue, such that next time the
> tiler heap hits an OOM, it has a chunk (or multiple chunks) readily
> available if the blocking allocation completed in the meantime. That's
> basically what Alyssa suggested, with an optimization if the system is
> not under memory pressure, and without userspace being involved (so no
> uAPI changes).

That sounds like it most likely won't work. In an OOM situation the blocking allocation would just cause more pressure to complete your rendering to free up memory.

> I guess this leaves older GPUs that don't support incremental rendering
> in a bad place though.

Well what's the handling there currently? Just crash when you're OOM?

Regards,
Christian.

>
>> That is then a rather good justification for your work Boris. Because
>> a common component makes it possible to implement a common fault
>> injection functionality to make sure that the fallback path is
>> properly exercised in testing.
> I can also add an fault injection mechanism to validate that, yep.
>
> Thanks,
>
> Boris

Simona Vetter April 11, 2025, 12:01 p.m. UTC | #13

On Thu, Apr 10, 2025 at 08:41:55PM +0200, Boris Brezillon wrote:
> On Thu, 10 Apr 2025 14:01:03 -0400
> Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote:
> 
> > > > > In Panfrost and Lima, we don't have this concept of "incremental
> > > > > rendering", so when we fail the allocation, we just fail the GPU job
> > > > > with an unhandled GPU fault.    
> > > > 
> > > > To be honest I think that this is enough to mark those two drivers as
> > > > broken.  It's documented that this approach is a no-go for upstream
> > > > drivers.
> > > > 
> > > > How widely is that used?  
> > > 
> > > It exists in lima and panfrost, and I wouldn't be surprised if a similar
> > > mechanism was used in other drivers for tiler-based GPUs (etnaviv,
> > > freedreno, powervr, ...), because ultimately that's how tilers work:
> > > the amount of memory needed to store per-tile primitives (and metadata)
> > > depends on what the geometry pipeline feeds the tiler with, and that
> > > can't be predicted. If you over-provision, that's memory the system won't
> > > be able to use while rendering takes place, even though only a small
> > > portion might actually be used by the GPU. If your allocation is too
> > > small, it will either trigger a GPU fault (for HW not supporting an
> > > "incremental rendering" mode) or under-perform (because flushing
> > > primitives has a huge cost on tilers).  
> > 
> > Yes and no.
> > 
> > Although we can't allocate more memory for /this/ frame, we know the
> > required size is probably constant across its lifetime. That gives a
> > simple heuristic to manage the tiler heap efficiently without
> > allocations - even fallible ones - in the fence signal path:
> > 
> > * Start with a small fixed size tiler heap
> > * Try to render, let incremental rendering kick in when it's too small.
> > * When cleaning up the job, check if we used incremental rendering.
> > * If we did - double the size of the heap the next time we submit work.
> > 
> > The tiler heap still grows dynamically - it just does so over the span
> > of a couple frames. In practice that means a tiny hit to startup time as
> > we dynamically figure out the right size, incurring extra flushing at
> > the start, without needing any "grow-on-page-fault" heroics.
> > 
> > This should solve the problem completely for CSF/panthor. So it's only
> > hardware that architecturally cannot do incremental rendering (older
> > Mali: panfrost/lima) where we need this mess.
> 
> OTOH, if we need something
> for Utgard(Lima)/Midgard/Bifrost/Valhall(Panfrost), why not use the same
> thing for CSF, since CSF is arguably the sanest of all the HW
> architectures listed above: allocation can fail/be non-blocking,
> because there's a fallback to incremental rendering when it fails.

So this is a really horrible idea to sort this out for panfrost hardware,
which doesn't have a tiler cache flush as a fallback. It's roughly three
stages:

1. A pile of clever tricks to make the chances of running out of memory
really low. Most of these also make sense for panthor platforms, just as a
performance optimization.

2. I terrible way to handle the unavoidable VK_DEVICE_LOST, but in a way
such that the impact should be minimal. This is nasty, and we really want
to avoid that for panthor.

3. Mesa quirks so that 2 doesn't actually ever happen in practice.

1. Clever tricks
----------------

This is a cascade of tricks we can pull in the gpu fault handler:

1a. Allocate with GFP_NORECLAIM. We want this first because that triggers
  background reclaim, and that might be enough to get us through and free
  some easy caches (like clean fs cache and stuff like that which can just
  be dropped).

1b Userspace needs to guesstimate a good guess for how much we'll need. I'm
  hoping that between render target size and maybe counting the total
  amounts of vertices we can do a decent guesstimate for many workloads.
  Note that goal here is not to ensure success, but just to get the rough
  ballpark. The actual starting number here should aim fairly low, so that
  we avoid wasting memory since this is memory wasted on every context
  (that uses a feature which needs dynamic memory allocation, which I
  guess for pan* is everything, but for apple it would be more limited).

1c The kernel then keeps an additional global memory pool. Note this would
  not have the same semantics as mempool.h, which is aimed GFP_NOIO
  forward progress guarantees, but more as a preallocation pool. In every
  CS ioctl we'll make sure the pool is filled, and we probably want to
  size the pool relative to the context with the biggest dynamic memory
  usage. So probably this thing needs a shrinker, so we can reclaim it
  when you don't run an app with a huge buffer need on the gpu anymore.

  Note that we're still not sizing this to guarantee success, but together
  with the userspace heuristics it should be big enough to almost always
  work out. And since it's global reserve we can afford to waste a bit
  more memory on this one. We might also want to scale this pool by the
  total memory available, like the watermarks core mm computes. We'll only
  hang onto this memory when the gpu is in active usage, so this should be
  fine.

  Also the preallocation would need to happen without holding the memory
  pool look, so that we can use GFP_KERNEL.

Up to this point I think it's all tricks that panthor also wants to
employ.

1d Next up is scratch dynamic memory. If we can assume that the memory does
  not need to survive a batchbuffer (hopefully the case with vulkan render
  pass) we could steal such memory from other contexts. We could even do
  that for contexts which are queued but not yet running on the hardware
  (might need unloading them to be doable with fw renderers like
  panthor/CSF) as long as we keep such stolen dynamic memory on a separate
  free list. Because if we'd entirely free this, or release it into the
  memory pool we'll make things worse for these other contexts, we need to
  be able to guarantee that any context can always get all the stolen
  dynamic pages back before we start running it on the gpu.

Since the tracking for the memory pool in 1c and the stolen memory in 1d
has a bunch of tricky corner cases we probably want that as a drm helper
which globally does that book-keeping for all drm/sched instances. That
might even help if some NN accel thing also needs this on the same SoC.

1e Finally, if all else is lost we can try GFP_ATOMIC. This will eat into
  reserve memory pools the core mm maintains. And hopefully we've spent
  enough time between this step and the initial GFP_NORECLAIM that
  background reclaim had a chance to free some memory, hence why all the
  memory pool and memory stealing tricks should be in between.

The above will need quite some tuning to make sure it works as often as
possible, without wasting undue amounts of memory. It's classic memory
overcommit tuning.

2. Device Lost
--------------

At this point we're left with no other choice than to kill the context.
And userspace should be able to cope with VK_DEVICE_LOST (hopefully zink
does), but it will probably not cope well with an entire strom of these
just to get the first frame out.

Here comes the horrible trick:

We'll keep rendering the entire frame by just smashing one single reserve
page (per context) into the pte every time there's a fault. It will result
in total garbage, and we probably want to shot the context the moment the
VS stages have finished, but it allows us to collect an accurate estimate
of how much memory we'd have needed. We need to pass that to the vulkan
driver as part of the device lost processing, so that it can keep that as
the starting point for the userspace dynamic memory requirement
guesstimate as a lower bound. Together with the (scaled to that
requirement) gpu driver memory pool and the core mm watermarks, that
should allow us to not hit a device lost again hopefully.

I think if the CS ioctl both takes the current guesstimate from userspace,
and passes back whatever the current maximum known to the kernel is (we
need that anyway for the steps in stage 1 to make sense I think), then
that should also work for dead context where the CS ioctl returns -EIO or
something like that.

3. Mesa quirks
--------------

A database of the fixed dynamic memory requirements we get from step 2
(through bug reports), so that we can avoid that mess going forward.

If there's too many of those, we'll probably need to improve the
guesstimation 1b if it's systematically off by too much. We might even
need to store that on-disk, like shader caches, so that you get a crash
once and then it should work even without an explicit mesa quirk.

Documentation
-------------

Assuming this is not too terrible an idea and we reach some consensus, I
think it'd be good to bake this into a doc patch somewhere in the
dma_fence documentation. Also including Alyssa recipe for when you have
hardware support for flushing partial rendering. And maybe generalized so
it makes sense for the GS/streamout/... fun on apple hardware.

And perhaps with a section added that points to actual code to help make
this happen like the sparse bo support in this series.

Cheers, Sima

Boris Brezillon April 11, 2025, 12:02 p.m. UTC | #14

On Fri, 11 Apr 2025 12:55:57 +0200
Christian König <christian.koenig@amd.com> wrote:

> Am 11.04.25 um 10:38 schrieb Boris Brezillon:
> > On Fri, 11 Apr 2025 10:04:07 +0200
> > Christian König <christian.koenig@amd.com> wrote:
> >  
> >> Am 10.04.25 um 20:41 schrieb Boris Brezillon:  
> >>> On Thu, 10 Apr 2025 14:01:03 -0400
> >>> Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote:
> >>>    
> >>>>>>> In Panfrost and Lima, we don't have this concept of
> >>>>>>> "incremental rendering", so when we fail the allocation, we
> >>>>>>> just fail the GPU job with an unhandled GPU fault.        
> >>>>>> To be honest I think that this is enough to mark those two
> >>>>>> drivers as broken.  It's documented that this approach is a
> >>>>>> no-go for upstream drivers.
> >>>>>>
> >>>>>> How widely is that used?      
> >>>>> It exists in lima and panfrost, and I wouldn't be surprised if a
> >>>>> similar mechanism was used in other drivers for tiler-based GPUs
> >>>>> (etnaviv, freedreno, powervr, ...), because ultimately that's
> >>>>> how tilers work: the amount of memory needed to store per-tile
> >>>>> primitives (and metadata) depends on what the geometry pipeline
> >>>>> feeds the tiler with, and that can't be predicted. If you
> >>>>> over-provision, that's memory the system won't be able to use
> >>>>> while rendering takes place, even though only a small portion
> >>>>> might actually be used by the GPU. If your allocation is too
> >>>>> small, it will either trigger a GPU fault (for HW not supporting
> >>>>> an "incremental rendering" mode) or under-perform (because
> >>>>> flushing primitives has a huge cost on tilers).      
> >>>> Yes and no.
> >>>>
> >>>> Although we can't allocate more memory for /this/ frame, we know
> >>>> the required size is probably constant across its lifetime. That
> >>>> gives a simple heuristic to manage the tiler heap efficiently
> >>>> without allocations - even fallible ones - in the fence signal
> >>>> path:
> >>>>
> >>>> * Start with a small fixed size tiler heap
> >>>> * Try to render, let incremental rendering kick in when it's too
> >>>> small.
> >>>> * When cleaning up the job, check if we used incremental
> >>>> rendering.
> >>>> * If we did - double the size of the heap the next time we submit
> >>>> work.
> >>>>
> >>>> The tiler heap still grows dynamically - it just does so over the
> >>>> span of a couple frames. In practice that means a tiny hit to
> >>>> startup time as we dynamically figure out the right size,
> >>>> incurring extra flushing at the start, without needing any
> >>>> "grow-on-page-fault" heroics.
> >>>>
> >>>> This should solve the problem completely for CSF/panthor. So it's
> >>>> only hardware that architecturally cannot do incremental
> >>>> rendering (older Mali: panfrost/lima) where we need this mess.
> >>>>  
> >>> OTOH, if we need something
> >>> for Utgard(Lima)/Midgard/Bifrost/Valhall(Panfrost), why not use
> >>> the same thing for CSF, since CSF is arguably the sanest of all
> >>> the HW architectures listed above: allocation can fail/be
> >>> non-blocking, because there's a fallback to incremental rendering
> >>> when it fails.    
> >> Yeah that is a rather interesting point Alyssa noted here.
> >>
> >> So basically you could as well implement it like this:
> >> 1. Userspace makes a submission.
> >> 2. HW finds buffer is not large enough, sets and error code and
> >> completes submission. 3. Userspace detects error, re-allocates
> >> buffer with increased size. 4. Userspace re-submits to incremental
> >> complete the submission. 5. Eventually repeat until fully
> >> completed.
> >>
> >> That would work but is likely just not the most performant
> >> solution. So faulting in memory on demand is basically just an
> >> optimization and that is ok as far as I can see.  
> > Yeah, Alyssa's suggestion got me thinking too, and I think I can
> > come up with a plan where we try non-blocking allocation first, and
> > if it fails, we trigger incremental rendering, and queue a blocking
> > heap-chunk allocation on separate workqueue, such that next time the
> > tiler heap hits an OOM, it has a chunk (or multiple chunks) readily
> > available if the blocking allocation completed in the meantime.
> > That's basically what Alyssa suggested, with an optimization if the
> > system is not under memory pressure, and without userspace being
> > involved (so no uAPI changes).  
> 
> That sounds like it most likely won't work. In an OOM situation the
> blocking allocation would just cause more pressure to complete your
> rendering to free up memory.

Right. It could be deferred to the next job submission instead of being
queued immediately. My point being, userspace doesn't have to know
about it, because the kernel knows when a tiler OOM happened, and can
flag the buffer as "probably needs more space when you have an
opportunity to alloc more". It wouldn't be different from reporting it
to userspace and letting userspace explicitly grow the buffer, and it
avoids introducing a gap between old and new mesa.

> 
> > I guess this leaves older GPUs that don't support incremental
> > rendering in a bad place though.  
> 
> Well what's the handling there currently? Just crash when you're OOM?

It's "alloc(GFP_KERNEL) and crash if it fails or times out", yes.

Christian König April 11, 2025, 12:45 p.m. UTC | #15

Am 11.04.25 um 14:02 schrieb Boris Brezillon:
>>> I guess this leaves older GPUs that don't support incremental
>>> rendering in a bad place though.  
>> Well what's the handling there currently? Just crash when you're OOM?
> It's "alloc(GFP_KERNEL) and crash if it fails or times out", yes.

Oh, please absolutely don't! Using GFP_KERNEL here is as evil as it can be.

Background is that you don't get a crash, nor error message, nor anything indicating what is happening.

You just get a deadlocked system with the user wondering why the heck the system doesn't response any more.

Christian.

Christian König April 11, 2025, 12:50 p.m. UTC | #16

Am 11.04.25 um 14:01 schrieb Simona Vetter:
> On Thu, Apr 10, 2025 at 08:41:55PM +0200, Boris Brezillon wrote:
>> On Thu, 10 Apr 2025 14:01:03 -0400
>> Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote:
>>
>>>>>> In Panfrost and Lima, we don't have this concept of "incremental
>>>>>> rendering", so when we fail the allocation, we just fail the GPU job
>>>>>> with an unhandled GPU fault.    
>>>>> To be honest I think that this is enough to mark those two drivers as
>>>>> broken.  It's documented that this approach is a no-go for upstream
>>>>> drivers.
>>>>>
>>>>> How widely is that used?  
>>>> It exists in lima and panfrost, and I wouldn't be surprised if a similar
>>>> mechanism was used in other drivers for tiler-based GPUs (etnaviv,
>>>> freedreno, powervr, ...), because ultimately that's how tilers work:
>>>> the amount of memory needed to store per-tile primitives (and metadata)
>>>> depends on what the geometry pipeline feeds the tiler with, and that
>>>> can't be predicted. If you over-provision, that's memory the system won't
>>>> be able to use while rendering takes place, even though only a small
>>>> portion might actually be used by the GPU. If your allocation is too
>>>> small, it will either trigger a GPU fault (for HW not supporting an
>>>> "incremental rendering" mode) or under-perform (because flushing
>>>> primitives has a huge cost on tilers).  
>>> Yes and no.
>>>
>>> Although we can't allocate more memory for /this/ frame, we know the
>>> required size is probably constant across its lifetime. That gives a
>>> simple heuristic to manage the tiler heap efficiently without
>>> allocations - even fallible ones - in the fence signal path:
>>>
>>> * Start with a small fixed size tiler heap
>>> * Try to render, let incremental rendering kick in when it's too small.
>>> * When cleaning up the job, check if we used incremental rendering.
>>> * If we did - double the size of the heap the next time we submit work.
>>>
>>> The tiler heap still grows dynamically - it just does so over the span
>>> of a couple frames. In practice that means a tiny hit to startup time as
>>> we dynamically figure out the right size, incurring extra flushing at
>>> the start, without needing any "grow-on-page-fault" heroics.
>>>
>>> This should solve the problem completely for CSF/panthor. So it's only
>>> hardware that architecturally cannot do incremental rendering (older
>>> Mali: panfrost/lima) where we need this mess.
>> OTOH, if we need something
>> for Utgard(Lima)/Midgard/Bifrost/Valhall(Panfrost), why not use the same
>> thing for CSF, since CSF is arguably the sanest of all the HW
>> architectures listed above: allocation can fail/be non-blocking,
>> because there's a fallback to incremental rendering when it fails.
> So this is a really horrible idea to sort this out for panfrost hardware,
> which doesn't have a tiler cache flush as a fallback. It's roughly three
> stages:
>
> 1. A pile of clever tricks to make the chances of running out of memory
> really low. Most of these also make sense for panthor platforms, just as a
> performance optimization.
>
> 2. I terrible way to handle the unavoidable VK_DEVICE_LOST, but in a way
> such that the impact should be minimal. This is nasty, and we really want
> to avoid that for panthor.
>
> 3. Mesa quirks so that 2 doesn't actually ever happen in practice.
>
> 1. Clever tricks
> ----------------
>
> This is a cascade of tricks we can pull in the gpu fault handler:
>
> 1a. Allocate with GFP_NORECLAIM. We want this first because that triggers
>   background reclaim, and that might be enough to get us through and free
>   some easy caches (like clean fs cache and stuff like that which can just
>   be dropped).
>
> 1b Userspace needs to guesstimate a good guess for how much we'll need. I'm
>   hoping that between render target size and maybe counting the total
>   amounts of vertices we can do a decent guesstimate for many workloads.
>   Note that goal here is not to ensure success, but just to get the rough
>   ballpark. The actual starting number here should aim fairly low, so that
>   we avoid wasting memory since this is memory wasted on every context
>   (that uses a feature which needs dynamic memory allocation, which I
>   guess for pan* is everything, but for apple it would be more limited).
>
> 1c The kernel then keeps an additional global memory pool. Note this would
>   not have the same semantics as mempool.h, which is aimed GFP_NOIO
>   forward progress guarantees, but more as a preallocation pool. In every
>   CS ioctl we'll make sure the pool is filled, and we probably want to
>   size the pool relative to the context with the biggest dynamic memory
>   usage. So probably this thing needs a shrinker, so we can reclaim it
>   when you don't run an app with a huge buffer need on the gpu anymore.
>
>   Note that we're still not sizing this to guarantee success, but together
>   with the userspace heuristics it should be big enough to almost always
>   work out. And since it's global reserve we can afford to waste a bit
>   more memory on this one. We might also want to scale this pool by the
>   total memory available, like the watermarks core mm computes. We'll only
>   hang onto this memory when the gpu is in active usage, so this should be
>   fine.
>
>   Also the preallocation would need to happen without holding the memory
>   pool look, so that we can use GFP_KERNEL.
>
> Up to this point I think it's all tricks that panthor also wants to
> employ.
>
> 1d Next up is scratch dynamic memory. If we can assume that the memory does
>   not need to survive a batchbuffer (hopefully the case with vulkan render
>   pass) we could steal such memory from other contexts. We could even do
>   that for contexts which are queued but not yet running on the hardware
>   (might need unloading them to be doable with fw renderers like
>   panthor/CSF) as long as we keep such stolen dynamic memory on a separate
>   free list. Because if we'd entirely free this, or release it into the
>   memory pool we'll make things worse for these other contexts, we need to
>   be able to guarantee that any context can always get all the stolen
>   dynamic pages back before we start running it on the gpu.
>
> Since the tracking for the memory pool in 1c and the stolen memory in 1d
> has a bunch of tricky corner cases we probably want that as a drm helper
> which globally does that book-keeping for all drm/sched instances. That
> might even help if some NN accel thing also needs this on the same SoC.
>
> 1e Finally, if all else is lost we can try GFP_ATOMIC. This will eat into
>   reserve memory pools the core mm maintains. And hopefully we've spent
>   enough time between this step and the initial GFP_NORECLAIM that
>   background reclaim had a chance to free some memory, hence why all the
>   memory pool and memory stealing tricks should be in between.
>
> The above will need quite some tuning to make sure it works as often as
> possible, without wasting undue amounts of memory. It's classic memory
> overcommit tuning.
>
> 2. Device Lost
> --------------
>
> At this point we're left with no other choice than to kill the context.
> And userspace should be able to cope with VK_DEVICE_LOST (hopefully zink
> does), but it will probably not cope well with an entire strom of these
> just to get the first frame out.
>
> Here comes the horrible trick:
>
> We'll keep rendering the entire frame by just smashing one single reserve
> page (per context) into the pte every time there's a fault. It will result
> in total garbage, and we probably want to shot the context the moment the
> VS stages have finished, but it allows us to collect an accurate estimate
> of how much memory we'd have needed. We need to pass that to the vulkan
> driver as part of the device lost processing, so that it can keep that as
> the starting point for the userspace dynamic memory requirement
> guesstimate as a lower bound. Together with the (scaled to that
> requirement) gpu driver memory pool and the core mm watermarks, that
> should allow us to not hit a device lost again hopefully.
>
> I think if the CS ioctl both takes the current guesstimate from userspace,
> and passes back whatever the current maximum known to the kernel is (we
> need that anyway for the steps in stage 1 to make sense I think), then
> that should also work for dead context where the CS ioctl returns -EIO or
> something like that.

+1

It is a bit radical but as far as I can see that is the way to go.

Additionally I think it's a must have to have a module parameter or similar to exercise this situation even without getting the whole kernel into an OOM situation.

Fault injection to exercise fallback paths are is just mandatory for stuff like that.

Regards,
Christian.


>
> 3. Mesa quirks
> --------------
>
> A database of the fixed dynamic memory requirements we get from step 2
> (through bug reports), so that we can avoid that mess going forward.
>
> If there's too many of those, we'll probably need to improve the
> guesstimation 1b if it's systematically off by too much. We might even
> need to store that on-disk, like shader caches, so that you get a crash
> once and then it should work even without an explicit mesa quirk.
>
> Documentation
> -------------
>
> Assuming this is not too terrible an idea and we reach some consensus, I
> think it'd be good to bake this into a doc patch somewhere in the
> dma_fence documentation. Also including Alyssa recipe for when you have
> hardware support for flushing partial rendering. And maybe generalized so
> it makes sense for the GS/streamout/... fun on apple hardware.
>
> And perhaps with a section added that points to actual code to help make
> this happen like the sparse bo support in this series.
>
> Cheers, Sima

Boris Brezillon April 11, 2025, 1 p.m. UTC | #17

On Fri, 11 Apr 2025 14:45:49 +0200
Christian König <christian.koenig@amd.com> wrote:

> Am 11.04.25 um 14:02 schrieb Boris Brezillon:
> >>> I guess this leaves older GPUs that don't support incremental
> >>> rendering in a bad place though.    
> >> Well what's the handling there currently? Just crash when you're
> >> OOM?  
> > It's "alloc(GFP_KERNEL) and crash if it fails or times out", yes.  
> 
> Oh, please absolutely don't! Using GFP_KERNEL here is as evil as it
> can be.

I'm not saying that's what we should do, I'm just telling you what's
done at the moment. The whole point of this series is to address some
that mess :P.

> 
> Background is that you don't get a crash, nor error message, nor
> anything indicating what is happening.

The job times out at some point, but we might get stuck in the fault
handler waiting for memory, which is pretty close to a deadlock, I
suspect.

> 
> You just get a deadlocked system with the user wondering why the heck
> the system doesn't response any more.

Not sure that's a true deadlock, for the reason explained before (no
shrinker in panthor, and panfrost only reclaims idle BOs, so no waits
on fences there either), but that doesn't make things great either.

Christian König April 11, 2025, 1:13 p.m. UTC | #18

Am 11.04.25 um 15:00 schrieb Boris Brezillon:
> On Fri, 11 Apr 2025 14:45:49 +0200
> Christian König <christian.koenig@amd.com> wrote:
>
>> Am 11.04.25 um 14:02 schrieb Boris Brezillon:
>>>>> I guess this leaves older GPUs that don't support incremental
>>>>> rendering in a bad place though.    
>>>> Well what's the handling there currently? Just crash when you're
>>>> OOM?  
>>> It's "alloc(GFP_KERNEL) and crash if it fails or times out", yes.  
>> Oh, please absolutely don't! Using GFP_KERNEL here is as evil as it
>> can be.
> I'm not saying that's what we should do, I'm just telling you what's
> done at the moment. The whole point of this series is to address some
> that mess :P.

Then it is absolutely welcomed that you take a look at this :D

Oh my, how have we missed that previously?

>
>> Background is that you don't get a crash, nor error message, nor
>> anything indicating what is happening.
> The job times out at some point, but we might get stuck in the fault
> handler waiting for memory, which is pretty close to a deadlock, I
> suspect.

I don't know those drivers that well, but at least for amdgpu the problem would be that the timeout handling would need to grab some of the locks the memory management is holding waiting for the timeout handling to do something....

So that basically perfectly closes the circle. With a bit of lock you get a message after some time that the kernel is stuck, but since that are all sleeping locks I strongly doubt so.

As immediately action please provide patches which changes those GFP_KERNEL into GFP_NOWAIT.

Thanks in advance,
Christian.

>
>> You just get a deadlocked system with the user wondering why the heck
>> the system doesn't response any more.
> Not sure that's a true deadlock, for the reason explained before (no
> shrinker in panthor, and panfrost only reclaims idle BOs, so no waits
> on fences there either), but that doesn't make things great either.

Alyssa Rosenzweig April 11, 2025, 1:52 p.m. UTC | #19

> 2. Device Lost
> --------------
> 
> At this point we're left with no other choice than to kill the context.
> And userspace should be able to cope with VK_DEVICE_LOST (hopefully zink
> does), but it will probably not cope well with an entire strom of these
> just to get the first frame out.
> 
> Here comes the horrible trick:
> 
> We'll keep rendering the entire frame by just smashing one single reserve
> page (per context) into the pte every time there's a fault. It will result
> in total garbage, and we probably want to shot the context the moment the
> VS stages have finished, but it allows us to collect an accurate estimate
> of how much memory we'd have needed. We need to pass that to the vulkan
> driver as part of the device lost processing, so that it can keep that as
> the starting point for the userspace dynamic memory requirement
> guesstimate as a lower bound. Together with the (scaled to that
> requirement) gpu driver memory pool and the core mm watermarks, that
> should allow us to not hit a device lost again hopefully.

This doesn't work if vertex stages are allowed to have side effects
(which is required for adult-level APIs and can effectively get hit with
streamout on panfrost). Once you have anything involving side effects,
you can't replay work, there's no way to cope with that. No magic Zink
can do either.

Boris Brezillon April 11, 2025, 2:39 p.m. UTC | #20

On Fri, 11 Apr 2025 15:13:26 +0200
Christian König <christian.koenig@amd.com> wrote:

> >  
> >> Background is that you don't get a crash, nor error message, nor
> >> anything indicating what is happening.  
> > The job times out at some point, but we might get stuck in the fault
> > handler waiting for memory, which is pretty close to a deadlock, I
> > suspect.  
> 
> I don't know those drivers that well, but at least for amdgpu the
> problem would be that the timeout handling would need to grab some of
> the locks the memory management is holding waiting for the timeout
> handling to do something....
> 
> So that basically perfectly closes the circle. With a bit of lock you
> get a message after some time that the kernel is stuck, but since
> that are all sleeping locks I strongly doubt so.
> 
> As immediately action please provide patches which changes those
> GFP_KERNEL into GFP_NOWAIT.

Sure, I can do that.

Simona Vetter April 11, 2025, 6:16 p.m. UTC | #21

On Fri, Apr 11, 2025 at 09:52:30AM -0400, Alyssa Rosenzweig wrote:
> > 2. Device Lost
> > --------------
> > 
> > At this point we're left with no other choice than to kill the context.
> > And userspace should be able to cope with VK_DEVICE_LOST (hopefully zink
> > does), but it will probably not cope well with an entire strom of these
> > just to get the first frame out.
> > 
> > Here comes the horrible trick:
> > 
> > We'll keep rendering the entire frame by just smashing one single reserve
> > page (per context) into the pte every time there's a fault. It will result
> > in total garbage, and we probably want to shot the context the moment the
> > VS stages have finished, but it allows us to collect an accurate estimate
> > of how much memory we'd have needed. We need to pass that to the vulkan
> > driver as part of the device lost processing, so that it can keep that as
> > the starting point for the userspace dynamic memory requirement
> > guesstimate as a lower bound. Together with the (scaled to that
> > requirement) gpu driver memory pool and the core mm watermarks, that
> > should allow us to not hit a device lost again hopefully.
> 
> This doesn't work if vertex stages are allowed to have side effects
> (which is required for adult-level APIs and can effectively get hit with
> streamout on panfrost). Once you have anything involving side effects,
> you can't replay work, there's no way to cope with that. No magic Zink
> can do either.

Yeah no attempts at reply, it's just standard gl error handling. So either
tossing the context and reporting that through arb_robustness. Or tossing
the context, "transparently" creating a new one and a mix of recreating
some internal driver objects and thoughts&prayers to give the new context
the best chances possible.

You really want all the tricks in step 1 and the quirks in 3 to make sure
this doesn't ever happen. Or at most once per app, hopefully.

I promised terrible after all :-P

Cheers, Sima

Simona Vetter April 11, 2025, 6:18 p.m. UTC | #22

On Fri, Apr 11, 2025 at 02:50:19PM +0200, Christian König wrote:
> Am 11.04.25 um 14:01 schrieb Simona Vetter:
> > On Thu, Apr 10, 2025 at 08:41:55PM +0200, Boris Brezillon wrote:
> >> On Thu, 10 Apr 2025 14:01:03 -0400
> >> Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote:
> >>
> >>>>>> In Panfrost and Lima, we don't have this concept of "incremental
> >>>>>> rendering", so when we fail the allocation, we just fail the GPU job
> >>>>>> with an unhandled GPU fault.    
> >>>>> To be honest I think that this is enough to mark those two drivers as
> >>>>> broken.  It's documented that this approach is a no-go for upstream
> >>>>> drivers.
> >>>>>
> >>>>> How widely is that used?  
> >>>> It exists in lima and panfrost, and I wouldn't be surprised if a similar
> >>>> mechanism was used in other drivers for tiler-based GPUs (etnaviv,
> >>>> freedreno, powervr, ...), because ultimately that's how tilers work:
> >>>> the amount of memory needed to store per-tile primitives (and metadata)
> >>>> depends on what the geometry pipeline feeds the tiler with, and that
> >>>> can't be predicted. If you over-provision, that's memory the system won't
> >>>> be able to use while rendering takes place, even though only a small
> >>>> portion might actually be used by the GPU. If your allocation is too
> >>>> small, it will either trigger a GPU fault (for HW not supporting an
> >>>> "incremental rendering" mode) or under-perform (because flushing
> >>>> primitives has a huge cost on tilers).  
> >>> Yes and no.
> >>>
> >>> Although we can't allocate more memory for /this/ frame, we know the
> >>> required size is probably constant across its lifetime. That gives a
> >>> simple heuristic to manage the tiler heap efficiently without
> >>> allocations - even fallible ones - in the fence signal path:
> >>>
> >>> * Start with a small fixed size tiler heap
> >>> * Try to render, let incremental rendering kick in when it's too small.
> >>> * When cleaning up the job, check if we used incremental rendering.
> >>> * If we did - double the size of the heap the next time we submit work.
> >>>
> >>> The tiler heap still grows dynamically - it just does so over the span
> >>> of a couple frames. In practice that means a tiny hit to startup time as
> >>> we dynamically figure out the right size, incurring extra flushing at
> >>> the start, without needing any "grow-on-page-fault" heroics.
> >>>
> >>> This should solve the problem completely for CSF/panthor. So it's only
> >>> hardware that architecturally cannot do incremental rendering (older
> >>> Mali: panfrost/lima) where we need this mess.
> >> OTOH, if we need something
> >> for Utgard(Lima)/Midgard/Bifrost/Valhall(Panfrost), why not use the same
> >> thing for CSF, since CSF is arguably the sanest of all the HW
> >> architectures listed above: allocation can fail/be non-blocking,
> >> because there's a fallback to incremental rendering when it fails.
> > So this is a really horrible idea to sort this out for panfrost hardware,
> > which doesn't have a tiler cache flush as a fallback. It's roughly three
> > stages:
> >
> > 1. A pile of clever tricks to make the chances of running out of memory
> > really low. Most of these also make sense for panthor platforms, just as a
> > performance optimization.
> >
> > 2. I terrible way to handle the unavoidable VK_DEVICE_LOST, but in a way
> > such that the impact should be minimal. This is nasty, and we really want
> > to avoid that for panthor.
> >
> > 3. Mesa quirks so that 2 doesn't actually ever happen in practice.
> >
> > 1. Clever tricks
> > ----------------
> >
> > This is a cascade of tricks we can pull in the gpu fault handler:
> >
> > 1a. Allocate with GFP_NORECLAIM. We want this first because that triggers
> >   background reclaim, and that might be enough to get us through and free
> >   some easy caches (like clean fs cache and stuff like that which can just
> >   be dropped).
> >
> > 1b Userspace needs to guesstimate a good guess for how much we'll need. I'm
> >   hoping that between render target size and maybe counting the total
> >   amounts of vertices we can do a decent guesstimate for many workloads.
> >   Note that goal here is not to ensure success, but just to get the rough
> >   ballpark. The actual starting number here should aim fairly low, so that
> >   we avoid wasting memory since this is memory wasted on every context
> >   (that uses a feature which needs dynamic memory allocation, which I
> >   guess for pan* is everything, but for apple it would be more limited).
> >
> > 1c The kernel then keeps an additional global memory pool. Note this would
> >   not have the same semantics as mempool.h, which is aimed GFP_NOIO
> >   forward progress guarantees, but more as a preallocation pool. In every
> >   CS ioctl we'll make sure the pool is filled, and we probably want to
> >   size the pool relative to the context with the biggest dynamic memory
> >   usage. So probably this thing needs a shrinker, so we can reclaim it
> >   when you don't run an app with a huge buffer need on the gpu anymore.
> >
> >   Note that we're still not sizing this to guarantee success, but together
> >   with the userspace heuristics it should be big enough to almost always
> >   work out. And since it's global reserve we can afford to waste a bit
> >   more memory on this one. We might also want to scale this pool by the
> >   total memory available, like the watermarks core mm computes. We'll only
> >   hang onto this memory when the gpu is in active usage, so this should be
> >   fine.
> >
> >   Also the preallocation would need to happen without holding the memory
> >   pool look, so that we can use GFP_KERNEL.
> >
> > Up to this point I think it's all tricks that panthor also wants to
> > employ.
> >
> > 1d Next up is scratch dynamic memory. If we can assume that the memory does
> >   not need to survive a batchbuffer (hopefully the case with vulkan render
> >   pass) we could steal such memory from other contexts. We could even do
> >   that for contexts which are queued but not yet running on the hardware
> >   (might need unloading them to be doable with fw renderers like
> >   panthor/CSF) as long as we keep such stolen dynamic memory on a separate
> >   free list. Because if we'd entirely free this, or release it into the
> >   memory pool we'll make things worse for these other contexts, we need to
> >   be able to guarantee that any context can always get all the stolen
> >   dynamic pages back before we start running it on the gpu.
> >
> > Since the tracking for the memory pool in 1c and the stolen memory in 1d
> > has a bunch of tricky corner cases we probably want that as a drm helper
> > which globally does that book-keeping for all drm/sched instances. That
> > might even help if some NN accel thing also needs this on the same SoC.
> >
> > 1e Finally, if all else is lost we can try GFP_ATOMIC. This will eat into
> >   reserve memory pools the core mm maintains. And hopefully we've spent
> >   enough time between this step and the initial GFP_NORECLAIM that
> >   background reclaim had a chance to free some memory, hence why all the
> >   memory pool and memory stealing tricks should be in between.
> >
> > The above will need quite some tuning to make sure it works as often as
> > possible, without wasting undue amounts of memory. It's classic memory
> > overcommit tuning.
> >
> > 2. Device Lost
> > --------------
> >
> > At this point we're left with no other choice than to kill the context.
> > And userspace should be able to cope with VK_DEVICE_LOST (hopefully zink
> > does), but it will probably not cope well with an entire strom of these
> > just to get the first frame out.
> >
> > Here comes the horrible trick:
> >
> > We'll keep rendering the entire frame by just smashing one single reserve
> > page (per context) into the pte every time there's a fault. It will result
> > in total garbage, and we probably want to shot the context the moment the
> > VS stages have finished, but it allows us to collect an accurate estimate
> > of how much memory we'd have needed. We need to pass that to the vulkan
> > driver as part of the device lost processing, so that it can keep that as
> > the starting point for the userspace dynamic memory requirement
> > guesstimate as a lower bound. Together with the (scaled to that
> > requirement) gpu driver memory pool and the core mm watermarks, that
> > should allow us to not hit a device lost again hopefully.
> >
> > I think if the CS ioctl both takes the current guesstimate from userspace,
> > and passes back whatever the current maximum known to the kernel is (we
> > need that anyway for the steps in stage 1 to make sense I think), then
> > that should also work for dead context where the CS ioctl returns -EIO or
> > something like that.
> 
> +1
> 
> It is a bit radical but as far as I can see that is the way to go.
> 
> Additionally I think it's a must have to have a module parameter or
> similar to exercise this situation even without getting the whole kernel
> into an OOM situation.
> 
> Fault injection to exercise fallback paths are is just mandatory for
> stuff like that.

Yeah, although I'd lean towards debugfs so that an igt could nicely
control it all and let this blow up in a very controlled way. Probably
some bitflag to make each of the options in step 1 fail to get anything
useful, or at least the GFP_NORECLAIM and the GFP_ATOMIC ones.
-Sima

> 
> Regards,
> Christian.
> 
> 
> >
> > 3. Mesa quirks
> > --------------
> >
> > A database of the fixed dynamic memory requirements we get from step 2
> > (through bug reports), so that we can avoid that mess going forward.
> >
> > If there's too many of those, we'll probably need to improve the
> > guesstimation 1b if it's systematically off by too much. We might even
> > need to store that on-disk, like shader caches, so that you get a crash
> > once and then it should work even without an explicit mesa quirk.
> >
> > Documentation
> > -------------
> >
> > Assuming this is not too terrible an idea and we reach some consensus, I
> > think it'd be good to bake this into a doc patch somewhere in the
> > dma_fence documentation. Also including Alyssa recipe for when you have
> > hardware support for flushing partial rendering. And maybe generalized so
> > it makes sense for the GS/streamout/... fun on apple hardware.
> >
> > And perhaps with a section added that points to actual code to help make
> > this happen like the sparse bo support in this series.
> >
> > Cheers, Sima
>

Simona Vetter April 11, 2025, 6:24 p.m. UTC | #23

On Fri, Apr 11, 2025 at 12:55:57PM +0200, Christian König wrote:
> Am 11.04.25 um 10:38 schrieb Boris Brezillon:
> > On Fri, 11 Apr 2025 10:04:07 +0200
> > Christian König <christian.koenig@amd.com> wrote:
> >
> >> Am 10.04.25 um 20:41 schrieb Boris Brezillon:
> >>> On Thu, 10 Apr 2025 14:01:03 -0400
> >>> Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote:
> >>>  
> >>>>>>> In Panfrost and Lima, we don't have this concept of "incremental
> >>>>>>> rendering", so when we fail the allocation, we just fail the
> >>>>>>> GPU job with an unhandled GPU fault.      
> >>>>>> To be honest I think that this is enough to mark those two
> >>>>>> drivers as broken.  It's documented that this approach is a
> >>>>>> no-go for upstream drivers.
> >>>>>>
> >>>>>> How widely is that used?    
> >>>>> It exists in lima and panfrost, and I wouldn't be surprised if a
> >>>>> similar mechanism was used in other drivers for tiler-based GPUs
> >>>>> (etnaviv, freedreno, powervr, ...), because ultimately that's how
> >>>>> tilers work: the amount of memory needed to store per-tile
> >>>>> primitives (and metadata) depends on what the geometry pipeline
> >>>>> feeds the tiler with, and that can't be predicted. If you
> >>>>> over-provision, that's memory the system won't be able to use
> >>>>> while rendering takes place, even though only a small portion
> >>>>> might actually be used by the GPU. If your allocation is too
> >>>>> small, it will either trigger a GPU fault (for HW not supporting
> >>>>> an "incremental rendering" mode) or under-perform (because
> >>>>> flushing primitives has a huge cost on tilers).    
> >>>> Yes and no.
> >>>>
> >>>> Although we can't allocate more memory for /this/ frame, we know
> >>>> the required size is probably constant across its lifetime. That
> >>>> gives a simple heuristic to manage the tiler heap efficiently
> >>>> without allocations - even fallible ones - in the fence signal
> >>>> path:
> >>>>
> >>>> * Start with a small fixed size tiler heap
> >>>> * Try to render, let incremental rendering kick in when it's too
> >>>> small.
> >>>> * When cleaning up the job, check if we used incremental rendering.
> >>>> * If we did - double the size of the heap the next time we submit
> >>>> work.
> >>>>
> >>>> The tiler heap still grows dynamically - it just does so over the
> >>>> span of a couple frames. In practice that means a tiny hit to
> >>>> startup time as we dynamically figure out the right size,
> >>>> incurring extra flushing at the start, without needing any
> >>>> "grow-on-page-fault" heroics.
> >>>>
> >>>> This should solve the problem completely for CSF/panthor. So it's
> >>>> only hardware that architecturally cannot do incremental rendering
> >>>> (older Mali: panfrost/lima) where we need this mess.  
> >>> OTOH, if we need something
> >>> for Utgard(Lima)/Midgard/Bifrost/Valhall(Panfrost), why not use the
> >>> same thing for CSF, since CSF is arguably the sanest of all the HW
> >>> architectures listed above: allocation can fail/be non-blocking,
> >>> because there's a fallback to incremental rendering when it fails.  
> >> Yeah that is a rather interesting point Alyssa noted here.
> >>
> >> So basically you could as well implement it like this:
> >> 1. Userspace makes a submission.
> >> 2. HW finds buffer is not large enough, sets and error code and
> >> completes submission. 3. Userspace detects error, re-allocates buffer
> >> with increased size. 4. Userspace re-submits to incremental complete
> >> the submission. 5. Eventually repeat until fully completed.
> >>
> >> That would work but is likely just not the most performant solution.
> >> So faulting in memory on demand is basically just an optimization and
> >> that is ok as far as I can see.
> > Yeah, Alyssa's suggestion got me thinking too, and I think I can come
> > up with a plan where we try non-blocking allocation first, and if it
> > fails, we trigger incremental rendering, and queue a blocking
> > heap-chunk allocation on separate workqueue, such that next time the
> > tiler heap hits an OOM, it has a chunk (or multiple chunks) readily
> > available if the blocking allocation completed in the meantime. That's
> > basically what Alyssa suggested, with an optimization if the system is
> > not under memory pressure, and without userspace being involved (so no
> > uAPI changes).

Please no background task that tries to find memory, you're just
reinventing the background kswapd shrinking. Which even a GFP_NORECLAIM
should kick off.

Instead just rely on kswapd to hopefully get you through the current job
without undue amounts of tiler flushing.

Then just grow dynamic memory synchronously with some heuristics in the
next CS ioctl, that gives you appropriate amounts of throttling, no issues
with error reporting, and you can just use GFP_KERNEL.

> That sounds like it most likely won't work. In an OOM situation the
> blocking allocation would just cause more pressure to complete your
> rendering to free up memory.

The issue isn't oom, the issue is that GFP_RECLAIM can't even throw out
clean caches and so your limited to watermarks and what kswapd manages to
clean out in parallel.

Real OOM is much, much nastier, and there you should just get an ENONMEM
from the CS ioctl. Ideally at least, because that gives throttling and all
that nice stuff for free (for some value of "nice", lots of folks really
despise the stalls that introduces).

Cheers, Sima

> > I guess this leaves older GPUs that don't support incremental rendering
> > in a bad place though.
> 
> Well what's the handling there currently? Just crash when you're OOM?
> 
> Regards,
> Christian.
> 
> >
> >> That is then a rather good justification for your work Boris. Because
> >> a common component makes it possible to implement a common fault
> >> injection functionality to make sure that the fallback path is
> >> properly exercised in testing.
> > I can also add an fault injection mechanism to validate that, yep.
> >
> > Thanks,
> >
> > Boris
>

Boris Brezillon April 14, 2025, 11:22 a.m. UTC | #24

Hi Sima,

On Fri, 11 Apr 2025 14:01:16 +0200
Simona Vetter <simona.vetter@ffwll.ch> wrote:

> On Thu, Apr 10, 2025 at 08:41:55PM +0200, Boris Brezillon wrote:
> > On Thu, 10 Apr 2025 14:01:03 -0400
> > Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote:
> >   
> > > > > > In Panfrost and Lima, we don't have this concept of "incremental
> > > > > > rendering", so when we fail the allocation, we just fail the GPU job
> > > > > > with an unhandled GPU fault.      
> > > > > 
> > > > > To be honest I think that this is enough to mark those two drivers as
> > > > > broken.  It's documented that this approach is a no-go for upstream
> > > > > drivers.
> > > > > 
> > > > > How widely is that used?    
> > > > 
> > > > It exists in lima and panfrost, and I wouldn't be surprised if a similar
> > > > mechanism was used in other drivers for tiler-based GPUs (etnaviv,
> > > > freedreno, powervr, ...), because ultimately that's how tilers work:
> > > > the amount of memory needed to store per-tile primitives (and metadata)
> > > > depends on what the geometry pipeline feeds the tiler with, and that
> > > > can't be predicted. If you over-provision, that's memory the system won't
> > > > be able to use while rendering takes place, even though only a small
> > > > portion might actually be used by the GPU. If your allocation is too
> > > > small, it will either trigger a GPU fault (for HW not supporting an
> > > > "incremental rendering" mode) or under-perform (because flushing
> > > > primitives has a huge cost on tilers).    
> > > 
> > > Yes and no.
> > > 
> > > Although we can't allocate more memory for /this/ frame, we know the
> > > required size is probably constant across its lifetime. That gives a
> > > simple heuristic to manage the tiler heap efficiently without
> > > allocations - even fallible ones - in the fence signal path:
> > > 
> > > * Start with a small fixed size tiler heap
> > > * Try to render, let incremental rendering kick in when it's too small.
> > > * When cleaning up the job, check if we used incremental rendering.
> > > * If we did - double the size of the heap the next time we submit work.
> > > 
> > > The tiler heap still grows dynamically - it just does so over the span
> > > of a couple frames. In practice that means a tiny hit to startup time as
> > > we dynamically figure out the right size, incurring extra flushing at
> > > the start, without needing any "grow-on-page-fault" heroics.
> > > 
> > > This should solve the problem completely for CSF/panthor. So it's only
> > > hardware that architecturally cannot do incremental rendering (older
> > > Mali: panfrost/lima) where we need this mess.  
> > 
> > OTOH, if we need something
> > for Utgard(Lima)/Midgard/Bifrost/Valhall(Panfrost), why not use the same
> > thing for CSF, since CSF is arguably the sanest of all the HW
> > architectures listed above: allocation can fail/be non-blocking,
> > because there's a fallback to incremental rendering when it fails.  
> 
> So this is a really horrible idea to sort this out for panfrost hardware,
> which doesn't have a tiler cache flush as a fallback. It's roughly three
> stages:
> 
> 1. A pile of clever tricks to make the chances of running out of memory
> really low. Most of these also make sense for panthor platforms, just as a
> performance optimization.
> 
> 2. I terrible way to handle the unavoidable VK_DEVICE_LOST, but in a way
> such that the impact should be minimal. This is nasty, and we really want
> to avoid that for panthor.
> 
> 3. Mesa quirks so that 2 doesn't actually ever happen in practice.
> 
> 1. Clever tricks
> ----------------
> 
> This is a cascade of tricks we can pull in the gpu fault handler:
> 
> 1a. Allocate with GFP_NORECLAIM. We want this first because that triggers
>   background reclaim, and that might be enough to get us through and free
>   some easy caches (like clean fs cache and stuff like that which can just
>   be dropped).

There's no GFP_NORECLAIM, and given the discussions we had before, I
guess you meant GFP_NOWAIT. Otherwise it's the __GFP_NOWARN |
__GFP_NORETRY I used in this series, and it probably doesn't try hard
enough as pointed out by you and Christian.

> 
> 1b Userspace needs to guesstimate a good guess for how much we'll need. I'm
>   hoping that between render target size and maybe counting the total
>   amounts of vertices we can do a decent guesstimate for many workloads.

There are extra parameters to take into account, like the tile
hierarchy mask (number of binning lists instantiated) and probably
other things I forget, but for simple vertex+fragment pipelines and
direct draws, guessing the worst memory usage case is probably doable.
Throw indirect draws into the mix, and it suddenly becomes a lot more
complicated. Not even talking about GEOM/TESS stages, which makes the
guessing even harder AFAICT.

>   Note that goal here is not to ensure success, but just to get the rough
>   ballpark. The actual starting number here should aim fairly low, so that
>   we avoid wasting memory since this is memory wasted on every context
>   (that uses a feature which needs dynamic memory allocation, which I
>   guess for pan* is everything, but for apple it would be more limited).

Ack.

> 
> 1c The kernel then keeps an additional global memory pool. Note this would
>   not have the same semantics as mempool.h, which is aimed GFP_NOIO
>   forward progress guarantees, but more as a preallocation pool. In every
>   CS ioctl we'll make sure the pool is filled, and we probably want to
>   size the pool relative to the context with the biggest dynamic memory
>   usage. So probably this thing needs a shrinker, so we can reclaim it
>   when you don't run an app with a huge buffer need on the gpu anymore.

Okay, that's a technique Arm has been using in their downstream driver
(it named JIT-allocation there).

> 
>   Note that we're still not sizing this to guarantee success, but together
>   with the userspace heuristics it should be big enough to almost always
>   work out. And since it's global reserve we can afford to waste a bit
>   more memory on this one. We might also want to scale this pool by the
>   total memory available, like the watermarks core mm computes. We'll only
>   hang onto this memory when the gpu is in active usage, so this should be
>   fine.

Sounds like a good idea.

> 
>   Also the preallocation would need to happen without holding the memory
>   pool look, so that we can use GFP_KERNEL.
> 
> Up to this point I think it's all tricks that panthor also wants to
> employ.
> 
> 1d Next up is scratch dynamic memory. If we can assume that the memory does
>   not need to survive a batchbuffer (hopefully the case with vulkan render
>   pass) we could steal such memory from other contexts. We could even do
>   that for contexts which are queued but not yet running on the hardware
>   (might need unloading them to be doable with fw renderers like
>   panthor/CSF) as long as we keep such stolen dynamic memory on a separate
>   free list. Because if we'd entirely free this, or release it into the
>   memory pool we'll make things worse for these other contexts, we need to
>   be able to guarantee that any context can always get all the stolen
>   dynamic pages back before we start running it on the gpu.

Actually, CSF stands in the way of re-allocating memory to other
contexts, because once we've allocated memory to a tiler heap, the FW
manages this pool of chunks, and recycles them. Mesa can intercept
the "returned chunks" and collect those chunks instead of re-assiging
then to the tiler heap through a CS instruction (which goes thought
the FW internallu), but that involves extra collaboration between the
UMD, KMD and FW which we don't have at the moment. Not saying never,
but I'd rather fix things gradually (first the blocking alloc in the
fence-signalling path, then the optimization to share the extra mem
reservation cost among contexts by returning the chunks to the global
kernel pool rather than directly to the heap).

This approach should work fine with JM GPUs where the tiler heap is
entirely managed by the KMD though.

> 
> Since the tracking for the memory pool in 1c and the stolen memory in 1d
> has a bunch of tricky corner cases we probably want that as a drm helper
> which globally does that book-keeping for all drm/sched instances. That
> might even help if some NN accel thing also needs this on the same SoC.

Yeah, I think whatever we go for here should be exposed as generic
helpers with proper documentation to explain when and how to use those,
because otherwise we're back to the original situation where each
driver copies the pattern of another driver and adjust it to its needs,
even when the original design was unsound.

> 
> 1e Finally, if all else is lost we can try GFP_ATOMIC. This will eat into
>   reserve memory pools the core mm maintains. And hopefully we've spent
>   enough time between this step and the initial GFP_NORECLAIM that
>   background reclaim had a chance to free some memory, hence why all the
>   memory pool and memory stealing tricks should be in between.

Any reason not to use GFP_NOWAIT on this last attempt?
I'm not sure I like the idea of stealing memory from the reserved
amount of atomic mem. I mean, if we get there and there's still not
enough memory to satisfy a "normal" allocation, it's probably a good
time to fail with a actual OOM.

> 
> The above will need quite some tuning to make sure it works as often as
> possible, without wasting undue amounts of memory. It's classic memory
> overcommit tuning.
> 
> 2. Device Lost
> --------------
> 
> At this point we're left with no other choice than to kill the context.
> And userspace should be able to cope with VK_DEVICE_LOST (hopefully zink
> does), but it will probably not cope well with an entire strom of these
> just to get the first frame out.
> 
> Here comes the horrible trick:
> 
> We'll keep rendering the entire frame by just smashing one single reserve
> page (per context) into the pte every time there's a fault. It will result
> in total garbage, and we probably want to shot the context the moment the
> VS stages have finished, but it allows us to collect an accurate estimate
> of how much memory we'd have needed. We need to pass that to the vulkan
> driver as part of the device lost processing, so that it can keep that as
> the starting point for the userspace dynamic memory requirement
> guesstimate as a lower bound. Together with the (scaled to that
> requirement) gpu driver memory pool and the core mm watermarks, that
> should allow us to not hit a device lost again hopefully.

If the intent is to progressively increase the size of this
global memory pool, and the min-resident-size of the growable buffer
that triggered the OOM, I guess we can just go for something dumb like
double-the-size or increment-it-in-steps and flag the buffer on which
the fault happened such that userspace can query that at DEVICE_LOST
time and properly resize this internal buffer next time a
CreateDevice() coming from the same process/thread is created. I'm not
sure we need to actually execute the shader after the OOM to accurately
guess the size.

> 
> I think if the CS ioctl both takes the current guesstimate from userspace,
> and passes back whatever the current maximum known to the kernel is (we
> need that anyway for the steps in stage 1 to make sense I think), then
> that should also work for dead context where the CS ioctl returns -EIO or
> something like that.
> 
> 3. Mesa quirks
> --------------
> 
> A database of the fixed dynamic memory requirements we get from step 2
> (through bug reports), so that we can avoid that mess going forward.

Yeah, we already have a dri-conf to force the initial size of the tiler
heap on CSF hardware, so we could easily ship a DB of per-app initial
size.

> 
> If there's too many of those, we'll probably need to improve the
> guesstimation 1b if it's systematically off by too much. We might even
> need to store that on-disk, like shader caches, so that you get a crash
> once and then it should work even without an explicit mesa quirk.

Makes sense. Maybe some sort of implicit dri-conf DB that overloads the
explicit one?

> 
> Documentation
> -------------
> 
> Assuming this is not too terrible an idea and we reach some consensus, I
> think it'd be good to bake this into a doc patch somewhere in the
> dma_fence documentation. Also including Alyssa recipe for when you have
> hardware support for flushing partial rendering. And maybe generalized so
> it makes sense for the GS/streamout/... fun on apple hardware.

I think the tricky part is properly guessing the amount of memory
needed for a draw. For simple vertex+fragment pipelines with direct
draws, that's doable, but for more complex stuff like GEOM/TESS and
indirect draws, that gets tricky, because some of the parameters are
not known at CS recording time. So my gut feeling is that, rather than
trying to find a smart way to guesstimate those stuff, we should rely
on simple heuristics like the
double-the-resident-size-on-the-faulty-sparse-GEM thing suggested by
Alyssa.

Many thanks to all the people who chimed in. I'll wait until the
discussion settles on something that works for everyone and try to
sum-up the outcome of this discussion in some
drm-tile-based-renderer.rst doc file (or any other file name you think
this should be put in) so we have a clear plan of what we want to
do/how we want it done.

Regards,

Boris

Boris Brezillon April 14, 2025, 12:47 p.m. UTC | #25

On Fri, 11 Apr 2025 16:39:02 +0200
Boris Brezillon <boris.brezillon@collabora.com> wrote:

> On Fri, 11 Apr 2025 15:13:26 +0200
> Christian König <christian.koenig@amd.com> wrote:
> 
> > >    
> > >> Background is that you don't get a crash, nor error message, nor
> > >> anything indicating what is happening.    
> > > The job times out at some point, but we might get stuck in the fault
> > > handler waiting for memory, which is pretty close to a deadlock, I
> > > suspect.    
> > 
> > I don't know those drivers that well, but at least for amdgpu the
> > problem would be that the timeout handling would need to grab some of
> > the locks the memory management is holding waiting for the timeout
> > handling to do something....
> > 
> > So that basically perfectly closes the circle. With a bit of lock you
> > get a message after some time that the kernel is stuck, but since
> > that are all sleeping locks I strongly doubt so.
> > 
> > As immediately action please provide patches which changes those
> > GFP_KERNEL into GFP_NOWAIT.  
> 
> Sure, I can do that.

Hm, I might have been too prompt at claiming this was doable. In
practice, doing that might regress Lima and Panfrost in situations
where trying harder than GFP_NOWAIT would free up some memory. Not
saying this was right to use GFP_KERNEL in the first place, but some
expectations were set by this original mistake, so I'll probably need
Lima developers to vouch in for this change after they've done some
testing on a system under high memory pressure, and I'd need to do the
same kind of testing for Panfrost and ask Steve if he's okay with that
too.

For Panthor, I'm less worried, because we have the incremental rendering
fallback, and assuming GFP_NOWAIT tries hard enough to reclaim
low-hanging fruits, the perfs shouldn't suffer much more than they
would today with GFP_KERNEL allocations potentially delaying tiling
operations longer than would have been with a primitive flush.

Alyssa Rosenzweig April 14, 2025, 1:03 p.m. UTC | #26

> Actually, CSF stands in the way of re-allocating memory to other
> contexts, because once we've allocated memory to a tiler heap, the FW
> manages this pool of chunks, and recycles them. Mesa can intercept
> the "returned chunks" and collect those chunks instead of re-assiging
> then to the tiler heap through a CS instruction (which goes thought
> the FW internallu), but that involves extra collaboration between the
> UMD, KMD and FW which we don't have at the moment. Not saying never,
> but I'd rather fix things gradually (first the blocking alloc in the
> fence-signalling path, then the optimization to share the extra mem
> reservation cost among contexts by returning the chunks to the global
> kernel pool rather than directly to the heap).
> 
> This approach should work fine with JM GPUs where the tiler heap is
> entirely managed by the KMD though.

I really think CSF should be relying on the simple heuristics with
incremental-rendering, unless you can prove that's actually a
performance issue in practice. (On Imagination/Apple parts, it almost
never is and we rely entirely on this approach. It's ok - it really is.
For simple 2D workloads, the initial heap allocation is fine. For 3D
scenes, we need very few frames to get the right size. this doesn't
cause stutters in practice.)

For JM .. yes, this discussion remains relevant of course.

Liviu Dudau April 14, 2025, 1:08 p.m. UTC | #27

On Mon, Apr 14, 2025 at 01:22:06PM +0200, Boris Brezillon wrote:
> Hi Sima,
> 
> On Fri, 11 Apr 2025 14:01:16 +0200
> Simona Vetter <simona.vetter@ffwll.ch> wrote:
> 
> > On Thu, Apr 10, 2025 at 08:41:55PM +0200, Boris Brezillon wrote:
> > > On Thu, 10 Apr 2025 14:01:03 -0400
> > > Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote:
> > >   
> > > > > > > In Panfrost and Lima, we don't have this concept of "incremental
> > > > > > > rendering", so when we fail the allocation, we just fail the GPU job
> > > > > > > with an unhandled GPU fault.      
> > > > > > 
> > > > > > To be honest I think that this is enough to mark those two drivers as
> > > > > > broken.  It's documented that this approach is a no-go for upstream
> > > > > > drivers.
> > > > > > 
> > > > > > How widely is that used?    
> > > > > 
> > > > > It exists in lima and panfrost, and I wouldn't be surprised if a similar
> > > > > mechanism was used in other drivers for tiler-based GPUs (etnaviv,
> > > > > freedreno, powervr, ...), because ultimately that's how tilers work:
> > > > > the amount of memory needed to store per-tile primitives (and metadata)
> > > > > depends on what the geometry pipeline feeds the tiler with, and that
> > > > > can't be predicted. If you over-provision, that's memory the system won't
> > > > > be able to use while rendering takes place, even though only a small
> > > > > portion might actually be used by the GPU. If your allocation is too
> > > > > small, it will either trigger a GPU fault (for HW not supporting an
> > > > > "incremental rendering" mode) or under-perform (because flushing
> > > > > primitives has a huge cost on tilers).    
> > > > 
> > > > Yes and no.
> > > > 
> > > > Although we can't allocate more memory for /this/ frame, we know the
> > > > required size is probably constant across its lifetime. That gives a
> > > > simple heuristic to manage the tiler heap efficiently without
> > > > allocations - even fallible ones - in the fence signal path:
> > > > 
> > > > * Start with a small fixed size tiler heap
> > > > * Try to render, let incremental rendering kick in when it's too small.
> > > > * When cleaning up the job, check if we used incremental rendering.
> > > > * If we did - double the size of the heap the next time we submit work.
> > > > 
> > > > The tiler heap still grows dynamically - it just does so over the span
> > > > of a couple frames. In practice that means a tiny hit to startup time as
> > > > we dynamically figure out the right size, incurring extra flushing at
> > > > the start, without needing any "grow-on-page-fault" heroics.
> > > > 
> > > > This should solve the problem completely for CSF/panthor. So it's only
> > > > hardware that architecturally cannot do incremental rendering (older
> > > > Mali: panfrost/lima) where we need this mess.  
> > > 
> > > OTOH, if we need something
> > > for Utgard(Lima)/Midgard/Bifrost/Valhall(Panfrost), why not use the same
> > > thing for CSF, since CSF is arguably the sanest of all the HW
> > > architectures listed above: allocation can fail/be non-blocking,
> > > because there's a fallback to incremental rendering when it fails.  
> > 
> > So this is a really horrible idea to sort this out for panfrost hardware,
> > which doesn't have a tiler cache flush as a fallback. It's roughly three
> > stages:
> > 
> > 1. A pile of clever tricks to make the chances of running out of memory
> > really low. Most of these also make sense for panthor platforms, just as a
> > performance optimization.
> > 
> > 2. I terrible way to handle the unavoidable VK_DEVICE_LOST, but in a way
> > such that the impact should be minimal. This is nasty, and we really want
> > to avoid that for panthor.
> > 
> > 3. Mesa quirks so that 2 doesn't actually ever happen in practice.
> > 
> > 1. Clever tricks
> > ----------------
> > 
> > This is a cascade of tricks we can pull in the gpu fault handler:
> > 
> > 1a. Allocate with GFP_NORECLAIM. We want this first because that triggers
> >   background reclaim, and that might be enough to get us through and free
> >   some easy caches (like clean fs cache and stuff like that which can just
> >   be dropped).
> 
> There's no GFP_NORECLAIM, and given the discussions we had before, I
> guess you meant GFP_NOWAIT. Otherwise it's the __GFP_NOWARN |
> __GFP_NORETRY I used in this series, and it probably doesn't try hard
> enough as pointed out by you and Christian.
> 
> > 
> > 1b Userspace needs to guesstimate a good guess for how much we'll need. I'm
> >   hoping that between render target size and maybe counting the total
> >   amounts of vertices we can do a decent guesstimate for many workloads.
> 
> There are extra parameters to take into account, like the tile
> hierarchy mask (number of binning lists instantiated) and probably
> other things I forget, but for simple vertex+fragment pipelines and
> direct draws, guessing the worst memory usage case is probably doable.
> Throw indirect draws into the mix, and it suddenly becomes a lot more
> complicated. Not even talking about GEOM/TESS stages, which makes the
> guessing even harder AFAICT.
> 
> >   Note that goal here is not to ensure success, but just to get the rough
> >   ballpark. The actual starting number here should aim fairly low, so that
> >   we avoid wasting memory since this is memory wasted on every context
> >   (that uses a feature which needs dynamic memory allocation, which I
> >   guess for pan* is everything, but for apple it would be more limited).
> 
> Ack.
> 
> > 
> > 1c The kernel then keeps an additional global memory pool. Note this would
> >   not have the same semantics as mempool.h, which is aimed GFP_NOIO
> >   forward progress guarantees, but more as a preallocation pool. In every
> >   CS ioctl we'll make sure the pool is filled, and we probably want to
> >   size the pool relative to the context with the biggest dynamic memory
> >   usage. So probably this thing needs a shrinker, so we can reclaim it
> >   when you don't run an app with a huge buffer need on the gpu anymore.
> 
> Okay, that's a technique Arm has been using in their downstream driver
> (it named JIT-allocation there).
> 
> > 
> >   Note that we're still not sizing this to guarantee success, but together
> >   with the userspace heuristics it should be big enough to almost always
> >   work out. And since it's global reserve we can afford to waste a bit
> >   more memory on this one. We might also want to scale this pool by the
> >   total memory available, like the watermarks core mm computes. We'll only
> >   hang onto this memory when the gpu is in active usage, so this should be
> >   fine.
> 
> Sounds like a good idea.
> 
> > 
> >   Also the preallocation would need to happen without holding the memory
> >   pool look, so that we can use GFP_KERNEL.
> > 
> > Up to this point I think it's all tricks that panthor also wants to
> > employ.
> > 
> > 1d Next up is scratch dynamic memory. If we can assume that the memory does
> >   not need to survive a batchbuffer (hopefully the case with vulkan render
> >   pass) we could steal such memory from other contexts. We could even do
> >   that for contexts which are queued but not yet running on the hardware
> >   (might need unloading them to be doable with fw renderers like
> >   panthor/CSF) as long as we keep such stolen dynamic memory on a separate
> >   free list. Because if we'd entirely free this, or release it into the
> >   memory pool we'll make things worse for these other contexts, we need to
> >   be able to guarantee that any context can always get all the stolen
> >   dynamic pages back before we start running it on the gpu.
> 
> Actually, CSF stands in the way of re-allocating memory to other
> contexts, because once we've allocated memory to a tiler heap, the FW
> manages this pool of chunks, and recycles them. Mesa can intercept
> the "returned chunks" and collect those chunks instead of re-assiging
> then to the tiler heap through a CS instruction (which goes thought
> the FW internallu), but that involves extra collaboration between the
> UMD, KMD and FW which we don't have at the moment. Not saying never,
> but I'd rather fix things gradually (first the blocking alloc in the
> fence-signalling path, then the optimization to share the extra mem
> reservation cost among contexts by returning the chunks to the global
> kernel pool rather than directly to the heap).

The additional issue with borrowing memory from idle contexts is that it will
involve MMU operations, as we will have to move the memory into the active
context address space. CSF GPUs have a limitation that they can only work with
one address space for the active job when it comes to memory used internally
by the job, so we either have to map the scratch dynamic memory in all the
jobs before we submit them, or we will have to do MMU maintainance operations
in the OOM path in order to borrow memory from other contexts.

Overall I think the design could work and as Boris pointed out Arm already
has something similar in the Just In Time memory pool design, it's just that
we need to play with different rules in the upstream as the downstream bypasses
the DRM framework so it has fewer rules to abide by.

Best regards,
Liviu

> 
> This approach should work fine with JM GPUs where the tiler heap is
> entirely managed by the KMD though.
> 
> > 
> > Since the tracking for the memory pool in 1c and the stolen memory in 1d
> > has a bunch of tricky corner cases we probably want that as a drm helper
> > which globally does that book-keeping for all drm/sched instances. That
> > might even help if some NN accel thing also needs this on the same SoC.
> 
> Yeah, I think whatever we go for here should be exposed as generic
> helpers with proper documentation to explain when and how to use those,
> because otherwise we're back to the original situation where each
> driver copies the pattern of another driver and adjust it to its needs,
> even when the original design was unsound.
> 
> > 
> > 1e Finally, if all else is lost we can try GFP_ATOMIC. This will eat into
> >   reserve memory pools the core mm maintains. And hopefully we've spent
> >   enough time between this step and the initial GFP_NORECLAIM that
> >   background reclaim had a chance to free some memory, hence why all the
> >   memory pool and memory stealing tricks should be in between.
> 
> Any reason not to use GFP_NOWAIT on this last attempt?
> I'm not sure I like the idea of stealing memory from the reserved
> amount of atomic mem. I mean, if we get there and there's still not
> enough memory to satisfy a "normal" allocation, it's probably a good
> time to fail with a actual OOM.
> 
> > 
> > The above will need quite some tuning to make sure it works as often as
> > possible, without wasting undue amounts of memory. It's classic memory
> > overcommit tuning.
> > 
> > 2. Device Lost
> > --------------
> > 
> > At this point we're left with no other choice than to kill the context.
> > And userspace should be able to cope with VK_DEVICE_LOST (hopefully zink
> > does), but it will probably not cope well with an entire strom of these
> > just to get the first frame out.
> > 
> > Here comes the horrible trick:
> > 
> > We'll keep rendering the entire frame by just smashing one single reserve
> > page (per context) into the pte every time there's a fault. It will result
> > in total garbage, and we probably want to shot the context the moment the
> > VS stages have finished, but it allows us to collect an accurate estimate
> > of how much memory we'd have needed. We need to pass that to the vulkan
> > driver as part of the device lost processing, so that it can keep that as
> > the starting point for the userspace dynamic memory requirement
> > guesstimate as a lower bound. Together with the (scaled to that
> > requirement) gpu driver memory pool and the core mm watermarks, that
> > should allow us to not hit a device lost again hopefully.
> 
> If the intent is to progressively increase the size of this
> global memory pool, and the min-resident-size of the growable buffer
> that triggered the OOM, I guess we can just go for something dumb like
> double-the-size or increment-it-in-steps and flag the buffer on which
> the fault happened such that userspace can query that at DEVICE_LOST
> time and properly resize this internal buffer next time a
> CreateDevice() coming from the same process/thread is created. I'm not
> sure we need to actually execute the shader after the OOM to accurately
> guess the size.
> 
> > 
> > I think if the CS ioctl both takes the current guesstimate from userspace,
> > and passes back whatever the current maximum known to the kernel is (we
> > need that anyway for the steps in stage 1 to make sense I think), then
> > that should also work for dead context where the CS ioctl returns -EIO or
> > something like that.
> > 
> > 3. Mesa quirks
> > --------------
> > 
> > A database of the fixed dynamic memory requirements we get from step 2
> > (through bug reports), so that we can avoid that mess going forward.
> 
> Yeah, we already have a dri-conf to force the initial size of the tiler
> heap on CSF hardware, so we could easily ship a DB of per-app initial
> size.
> 
> > 
> > If there's too many of those, we'll probably need to improve the
> > guesstimation 1b if it's systematically off by too much. We might even
> > need to store that on-disk, like shader caches, so that you get a crash
> > once and then it should work even without an explicit mesa quirk.
> 
> Makes sense. Maybe some sort of implicit dri-conf DB that overloads the
> explicit one?
> 
> > 
> > Documentation
> > -------------
> > 
> > Assuming this is not too terrible an idea and we reach some consensus, I
> > think it'd be good to bake this into a doc patch somewhere in the
> > dma_fence documentation. Also including Alyssa recipe for when you have
> > hardware support for flushing partial rendering. And maybe generalized so
> > it makes sense for the GS/streamout/... fun on apple hardware.
> 
> I think the tricky part is properly guessing the amount of memory
> needed for a draw. For simple vertex+fragment pipelines with direct
> draws, that's doable, but for more complex stuff like GEOM/TESS and
> indirect draws, that gets tricky, because some of the parameters are
> not known at CS recording time. So my gut feeling is that, rather than
> trying to find a smart way to guesstimate those stuff, we should rely
> on simple heuristics like the
> double-the-resident-size-on-the-faulty-sparse-GEM thing suggested by
> Alyssa.
> 
> Many thanks to all the people who chimed in. I'll wait until the
> discussion settles on something that works for everyone and try to
> sum-up the outcome of this discussion in some
> drm-tile-based-renderer.rst doc file (or any other file name you think
> this should be put in) so we have a clear plan of what we want to
> do/how we want it done.
> 
> Regards,
> 
> Boris

Boris Brezillon April 14, 2025, 1:31 p.m. UTC | #28

On Mon, 14 Apr 2025 09:03:55 -0400
Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote:

> > Actually, CSF stands in the way of re-allocating memory to other
> > contexts, because once we've allocated memory to a tiler heap, the FW
> > manages this pool of chunks, and recycles them. Mesa can intercept
> > the "returned chunks" and collect those chunks instead of re-assiging
> > then to the tiler heap through a CS instruction (which goes thought
> > the FW internallu), but that involves extra collaboration between the
> > UMD, KMD and FW which we don't have at the moment. Not saying never,
> > but I'd rather fix things gradually (first the blocking alloc in the
> > fence-signalling path, then the optimization to share the extra mem
> > reservation cost among contexts by returning the chunks to the global
> > kernel pool rather than directly to the heap).
> > 
> > This approach should work fine with JM GPUs where the tiler heap is
> > entirely managed by the KMD though.  
> 
> I really think CSF should be relying on the simple heuristics with
> incremental-rendering, unless you can prove that's actually a
> performance issue in practice. (On Imagination/Apple parts, it almost
> never is and we rely entirely on this approach. It's ok - it really is.
> For simple 2D workloads, the initial heap allocation is fine. For 3D
> scenes, we need very few frames to get the right size. this doesn't
> cause stutters in practice.)

Yep I agree, hence the "let's try the simple thing first and let's see
if we actually need the more complex stuff later". My hope is that we'll
never need it, but I hate to make definitive statements, because it
usually bites me back when I do :P.

> 
> For JM .. yes, this discussion remains relevant of course.

I'm still trying to see if we can emulate/have incremental-rendering on
JM hardware, so it really becomes a Lima-only issue. According to Erik,
predicting how much heap is needed is much more predictible on Utgard
(no indirect draws, simpler binning hierarchy, and other details he
mentioned which I forgot).

Alyssa Rosenzweig April 14, 2025, 1:42 p.m. UTC | #29

> I'm still trying to see if we can emulate/have incremental-rendering on
> JM hardware

I guess since we don't advertise vertex shader side effects... Lol,
maybe that could work...

Simona Vetter April 14, 2025, 2:34 p.m. UTC | #30

On Mon, Apr 14, 2025 at 02:08:25PM +0100, Liviu Dudau wrote:
> On Mon, Apr 14, 2025 at 01:22:06PM +0200, Boris Brezillon wrote:
> > Hi Sima,
> > 
> > On Fri, 11 Apr 2025 14:01:16 +0200
> > Simona Vetter <simona.vetter@ffwll.ch> wrote:
> > 
> > > On Thu, Apr 10, 2025 at 08:41:55PM +0200, Boris Brezillon wrote:
> > > > On Thu, 10 Apr 2025 14:01:03 -0400
> > > > Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote:
> > > >   
> > > > > > > > In Panfrost and Lima, we don't have this concept of "incremental
> > > > > > > > rendering", so when we fail the allocation, we just fail the GPU job
> > > > > > > > with an unhandled GPU fault.      
> > > > > > > 
> > > > > > > To be honest I think that this is enough to mark those two drivers as
> > > > > > > broken.  It's documented that this approach is a no-go for upstream
> > > > > > > drivers.
> > > > > > > 
> > > > > > > How widely is that used?    
> > > > > > 
> > > > > > It exists in lima and panfrost, and I wouldn't be surprised if a similar
> > > > > > mechanism was used in other drivers for tiler-based GPUs (etnaviv,
> > > > > > freedreno, powervr, ...), because ultimately that's how tilers work:
> > > > > > the amount of memory needed to store per-tile primitives (and metadata)
> > > > > > depends on what the geometry pipeline feeds the tiler with, and that
> > > > > > can't be predicted. If you over-provision, that's memory the system won't
> > > > > > be able to use while rendering takes place, even though only a small
> > > > > > portion might actually be used by the GPU. If your allocation is too
> > > > > > small, it will either trigger a GPU fault (for HW not supporting an
> > > > > > "incremental rendering" mode) or under-perform (because flushing
> > > > > > primitives has a huge cost on tilers).    
> > > > > 
> > > > > Yes and no.
> > > > > 
> > > > > Although we can't allocate more memory for /this/ frame, we know the
> > > > > required size is probably constant across its lifetime. That gives a
> > > > > simple heuristic to manage the tiler heap efficiently without
> > > > > allocations - even fallible ones - in the fence signal path:
> > > > > 
> > > > > * Start with a small fixed size tiler heap
> > > > > * Try to render, let incremental rendering kick in when it's too small.
> > > > > * When cleaning up the job, check if we used incremental rendering.
> > > > > * If we did - double the size of the heap the next time we submit work.
> > > > > 
> > > > > The tiler heap still grows dynamically - it just does so over the span
> > > > > of a couple frames. In practice that means a tiny hit to startup time as
> > > > > we dynamically figure out the right size, incurring extra flushing at
> > > > > the start, without needing any "grow-on-page-fault" heroics.
> > > > > 
> > > > > This should solve the problem completely for CSF/panthor. So it's only
> > > > > hardware that architecturally cannot do incremental rendering (older
> > > > > Mali: panfrost/lima) where we need this mess.  
> > > > 
> > > > OTOH, if we need something
> > > > for Utgard(Lima)/Midgard/Bifrost/Valhall(Panfrost), why not use the same
> > > > thing for CSF, since CSF is arguably the sanest of all the HW
> > > > architectures listed above: allocation can fail/be non-blocking,
> > > > because there's a fallback to incremental rendering when it fails.  
> > > 
> > > So this is a really horrible idea to sort this out for panfrost hardware,
> > > which doesn't have a tiler cache flush as a fallback. It's roughly three
> > > stages:
> > > 
> > > 1. A pile of clever tricks to make the chances of running out of memory
> > > really low. Most of these also make sense for panthor platforms, just as a
> > > performance optimization.
> > > 
> > > 2. I terrible way to handle the unavoidable VK_DEVICE_LOST, but in a way
> > > such that the impact should be minimal. This is nasty, and we really want
> > > to avoid that for panthor.
> > > 
> > > 3. Mesa quirks so that 2 doesn't actually ever happen in practice.
> > > 
> > > 1. Clever tricks
> > > ----------------
> > > 
> > > This is a cascade of tricks we can pull in the gpu fault handler:
> > > 
> > > 1a. Allocate with GFP_NORECLAIM. We want this first because that triggers
> > >   background reclaim, and that might be enough to get us through and free
> > >   some easy caches (like clean fs cache and stuff like that which can just
> > >   be dropped).
> > 
> > There's no GFP_NORECLAIM, and given the discussions we had before, I
> > guess you meant GFP_NOWAIT. Otherwise it's the __GFP_NOWARN |
> > __GFP_NORETRY I used in this series, and it probably doesn't try hard
> > enough as pointed out by you and Christian.
> > 
> > > 
> > > 1b Userspace needs to guesstimate a good guess for how much we'll need. I'm
> > >   hoping that between render target size and maybe counting the total
> > >   amounts of vertices we can do a decent guesstimate for many workloads.
> > 
> > There are extra parameters to take into account, like the tile
> > hierarchy mask (number of binning lists instantiated) and probably
> > other things I forget, but for simple vertex+fragment pipelines and
> > direct draws, guessing the worst memory usage case is probably doable.
> > Throw indirect draws into the mix, and it suddenly becomes a lot more
> > complicated. Not even talking about GEOM/TESS stages, which makes the
> > guessing even harder AFAICT.
> > 
> > >   Note that goal here is not to ensure success, but just to get the rough
> > >   ballpark. The actual starting number here should aim fairly low, so that
> > >   we avoid wasting memory since this is memory wasted on every context
> > >   (that uses a feature which needs dynamic memory allocation, which I
> > >   guess for pan* is everything, but for apple it would be more limited).
> > 
> > Ack.
> > 
> > > 
> > > 1c The kernel then keeps an additional global memory pool. Note this would
> > >   not have the same semantics as mempool.h, which is aimed GFP_NOIO
> > >   forward progress guarantees, but more as a preallocation pool. In every
> > >   CS ioctl we'll make sure the pool is filled, and we probably want to
> > >   size the pool relative to the context with the biggest dynamic memory
> > >   usage. So probably this thing needs a shrinker, so we can reclaim it
> > >   when you don't run an app with a huge buffer need on the gpu anymore.
> > 
> > Okay, that's a technique Arm has been using in their downstream driver
> > (it named JIT-allocation there).
> > 
> > > 
> > >   Note that we're still not sizing this to guarantee success, but together
> > >   with the userspace heuristics it should be big enough to almost always
> > >   work out. And since it's global reserve we can afford to waste a bit
> > >   more memory on this one. We might also want to scale this pool by the
> > >   total memory available, like the watermarks core mm computes. We'll only
> > >   hang onto this memory when the gpu is in active usage, so this should be
> > >   fine.
> > 
> > Sounds like a good idea.
> > 
> > > 
> > >   Also the preallocation would need to happen without holding the memory
> > >   pool look, so that we can use GFP_KERNEL.
> > > 
> > > Up to this point I think it's all tricks that panthor also wants to
> > > employ.
> > > 
> > > 1d Next up is scratch dynamic memory. If we can assume that the memory does
> > >   not need to survive a batchbuffer (hopefully the case with vulkan render
> > >   pass) we could steal such memory from other contexts. We could even do
> > >   that for contexts which are queued but not yet running on the hardware
> > >   (might need unloading them to be doable with fw renderers like
> > >   panthor/CSF) as long as we keep such stolen dynamic memory on a separate
> > >   free list. Because if we'd entirely free this, or release it into the
> > >   memory pool we'll make things worse for these other contexts, we need to
> > >   be able to guarantee that any context can always get all the stolen
> > >   dynamic pages back before we start running it on the gpu.
> > 
> > Actually, CSF stands in the way of re-allocating memory to other
> > contexts, because once we've allocated memory to a tiler heap, the FW
> > manages this pool of chunks, and recycles them. Mesa can intercept
> > the "returned chunks" and collect those chunks instead of re-assiging
> > then to the tiler heap through a CS instruction (which goes thought
> > the FW internallu), but that involves extra collaboration between the
> > UMD, KMD and FW which we don't have at the moment. Not saying never,
> > but I'd rather fix things gradually (first the blocking alloc in the
> > fence-signalling path, then the optimization to share the extra mem
> > reservation cost among contexts by returning the chunks to the global
> > kernel pool rather than directly to the heap).
> 
> The additional issue with borrowing memory from idle contexts is that it will
> involve MMU operations, as we will have to move the memory into the active
> context address space. CSF GPUs have a limitation that they can only work with
> one address space for the active job when it comes to memory used internally
> by the job, so we either have to map the scratch dynamic memory in all the
> jobs before we submit them, or we will have to do MMU maintainance operations
> in the OOM path in order to borrow memory from other contexts.

Hm, this could be tricky. So mmu operations shouldn't be an issue because
they must work for GFP_NOFS contexts for i/o writeback. You might need to
much more carefully manage this and make sure the iommu has big enough
range of pagetables preallocated. This also holds for the other games
we're playing here, at least for gpu pagetables. But since pagetables are
really small overhead it might be good to somewhat aggressively
preallocate them.

But yeah this is possible an issue if you you need iommu wrangling, I
have honestly not looked at the exact rules in there.
-Sima

> Overall I think the design could work and as Boris pointed out Arm already
> has something similar in the Just In Time memory pool design, it's just that
> we need to play with different rules in the upstream as the downstream bypasses
> the DRM framework so it has fewer rules to abide by.
> 
> Best regards,
> Liviu
> 
> > 
> > This approach should work fine with JM GPUs where the tiler heap is
> > entirely managed by the KMD though.
> > 
> > > 
> > > Since the tracking for the memory pool in 1c and the stolen memory in 1d
> > > has a bunch of tricky corner cases we probably want that as a drm helper
> > > which globally does that book-keeping for all drm/sched instances. That
> > > might even help if some NN accel thing also needs this on the same SoC.
> > 
> > Yeah, I think whatever we go for here should be exposed as generic
> > helpers with proper documentation to explain when and how to use those,
> > because otherwise we're back to the original situation where each
> > driver copies the pattern of another driver and adjust it to its needs,
> > even when the original design was unsound.
> > 
> > > 
> > > 1e Finally, if all else is lost we can try GFP_ATOMIC. This will eat into
> > >   reserve memory pools the core mm maintains. And hopefully we've spent
> > >   enough time between this step and the initial GFP_NORECLAIM that
> > >   background reclaim had a chance to free some memory, hence why all the
> > >   memory pool and memory stealing tricks should be in between.
> > 
> > Any reason not to use GFP_NOWAIT on this last attempt?
> > I'm not sure I like the idea of stealing memory from the reserved
> > amount of atomic mem. I mean, if we get there and there's still not
> > enough memory to satisfy a "normal" allocation, it's probably a good
> > time to fail with a actual OOM.
> > 
> > > 
> > > The above will need quite some tuning to make sure it works as often as
> > > possible, without wasting undue amounts of memory. It's classic memory
> > > overcommit tuning.
> > > 
> > > 2. Device Lost
> > > --------------
> > > 
> > > At this point we're left with no other choice than to kill the context.
> > > And userspace should be able to cope with VK_DEVICE_LOST (hopefully zink
> > > does), but it will probably not cope well with an entire strom of these
> > > just to get the first frame out.
> > > 
> > > Here comes the horrible trick:
> > > 
> > > We'll keep rendering the entire frame by just smashing one single reserve
> > > page (per context) into the pte every time there's a fault. It will result
> > > in total garbage, and we probably want to shot the context the moment the
> > > VS stages have finished, but it allows us to collect an accurate estimate
> > > of how much memory we'd have needed. We need to pass that to the vulkan
> > > driver as part of the device lost processing, so that it can keep that as
> > > the starting point for the userspace dynamic memory requirement
> > > guesstimate as a lower bound. Together with the (scaled to that
> > > requirement) gpu driver memory pool and the core mm watermarks, that
> > > should allow us to not hit a device lost again hopefully.
> > 
> > If the intent is to progressively increase the size of this
> > global memory pool, and the min-resident-size of the growable buffer
> > that triggered the OOM, I guess we can just go for something dumb like
> > double-the-size or increment-it-in-steps and flag the buffer on which
> > the fault happened such that userspace can query that at DEVICE_LOST
> > time and properly resize this internal buffer next time a
> > CreateDevice() coming from the same process/thread is created. I'm not
> > sure we need to actually execute the shader after the OOM to accurately
> > guess the size.
> > 
> > > 
> > > I think if the CS ioctl both takes the current guesstimate from userspace,
> > > and passes back whatever the current maximum known to the kernel is (we
> > > need that anyway for the steps in stage 1 to make sense I think), then
> > > that should also work for dead context where the CS ioctl returns -EIO or
> > > something like that.
> > > 
> > > 3. Mesa quirks
> > > --------------
> > > 
> > > A database of the fixed dynamic memory requirements we get from step 2
> > > (through bug reports), so that we can avoid that mess going forward.
> > 
> > Yeah, we already have a dri-conf to force the initial size of the tiler
> > heap on CSF hardware, so we could easily ship a DB of per-app initial
> > size.
> > 
> > > 
> > > If there's too many of those, we'll probably need to improve the
> > > guesstimation 1b if it's systematically off by too much. We might even
> > > need to store that on-disk, like shader caches, so that you get a crash
> > > once and then it should work even without an explicit mesa quirk.
> > 
> > Makes sense. Maybe some sort of implicit dri-conf DB that overloads the
> > explicit one?
> > 
> > > 
> > > Documentation
> > > -------------
> > > 
> > > Assuming this is not too terrible an idea and we reach some consensus, I
> > > think it'd be good to bake this into a doc patch somewhere in the
> > > dma_fence documentation. Also including Alyssa recipe for when you have
> > > hardware support for flushing partial rendering. And maybe generalized so
> > > it makes sense for the GS/streamout/... fun on apple hardware.
> > 
> > I think the tricky part is properly guessing the amount of memory
> > needed for a draw. For simple vertex+fragment pipelines with direct
> > draws, that's doable, but for more complex stuff like GEOM/TESS and
> > indirect draws, that gets tricky, because some of the parameters are
> > not known at CS recording time. So my gut feeling is that, rather than
> > trying to find a smart way to guesstimate those stuff, we should rely
> > on simple heuristics like the
> > double-the-resident-size-on-the-faulty-sparse-GEM thing suggested by
> > Alyssa.
> > 
> > Many thanks to all the people who chimed in. I'll wait until the
> > discussion settles on something that works for everyone and try to
> > sum-up the outcome of this discussion in some
> > drm-tile-based-renderer.rst doc file (or any other file name you think
> > this should be put in) so we have a clear plan of what we want to
> > do/how we want it done.
> > 
> > Regards,
> > 
> > Boris
> 
> -- 
> ====================
> | I would like to |
> | fix the world,  |
> | but they're not |
> | giving me the   |
>  \ source code!  /
>   ---------------
>     ¯\_(ツ)_/¯

Simona Vetter April 14, 2025, 2:46 p.m. UTC | #31

On Mon, Apr 14, 2025 at 01:22:06PM +0200, Boris Brezillon wrote:
> Hi Sima,
> 
> On Fri, 11 Apr 2025 14:01:16 +0200
> Simona Vetter <simona.vetter@ffwll.ch> wrote:
> 
> > On Thu, Apr 10, 2025 at 08:41:55PM +0200, Boris Brezillon wrote:
> > > On Thu, 10 Apr 2025 14:01:03 -0400
> > > Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote:
> > >   
> > > > > > > In Panfrost and Lima, we don't have this concept of "incremental
> > > > > > > rendering", so when we fail the allocation, we just fail the GPU job
> > > > > > > with an unhandled GPU fault.      
> > > > > > 
> > > > > > To be honest I think that this is enough to mark those two drivers as
> > > > > > broken.  It's documented that this approach is a no-go for upstream
> > > > > > drivers.
> > > > > > 
> > > > > > How widely is that used?    
> > > > > 
> > > > > It exists in lima and panfrost, and I wouldn't be surprised if a similar
> > > > > mechanism was used in other drivers for tiler-based GPUs (etnaviv,
> > > > > freedreno, powervr, ...), because ultimately that's how tilers work:
> > > > > the amount of memory needed to store per-tile primitives (and metadata)
> > > > > depends on what the geometry pipeline feeds the tiler with, and that
> > > > > can't be predicted. If you over-provision, that's memory the system won't
> > > > > be able to use while rendering takes place, even though only a small
> > > > > portion might actually be used by the GPU. If your allocation is too
> > > > > small, it will either trigger a GPU fault (for HW not supporting an
> > > > > "incremental rendering" mode) or under-perform (because flushing
> > > > > primitives has a huge cost on tilers).    
> > > > 
> > > > Yes and no.
> > > > 
> > > > Although we can't allocate more memory for /this/ frame, we know the
> > > > required size is probably constant across its lifetime. That gives a
> > > > simple heuristic to manage the tiler heap efficiently without
> > > > allocations - even fallible ones - in the fence signal path:
> > > > 
> > > > * Start with a small fixed size tiler heap
> > > > * Try to render, let incremental rendering kick in when it's too small.
> > > > * When cleaning up the job, check if we used incremental rendering.
> > > > * If we did - double the size of the heap the next time we submit work.
> > > > 
> > > > The tiler heap still grows dynamically - it just does so over the span
> > > > of a couple frames. In practice that means a tiny hit to startup time as
> > > > we dynamically figure out the right size, incurring extra flushing at
> > > > the start, without needing any "grow-on-page-fault" heroics.
> > > > 
> > > > This should solve the problem completely for CSF/panthor. So it's only
> > > > hardware that architecturally cannot do incremental rendering (older
> > > > Mali: panfrost/lima) where we need this mess.  
> > > 
> > > OTOH, if we need something
> > > for Utgard(Lima)/Midgard/Bifrost/Valhall(Panfrost), why not use the same
> > > thing for CSF, since CSF is arguably the sanest of all the HW
> > > architectures listed above: allocation can fail/be non-blocking,
> > > because there's a fallback to incremental rendering when it fails.  
> > 
> > So this is a really horrible idea to sort this out for panfrost hardware,
> > which doesn't have a tiler cache flush as a fallback. It's roughly three
> > stages:
> > 
> > 1. A pile of clever tricks to make the chances of running out of memory
> > really low. Most of these also make sense for panthor platforms, just as a
> > performance optimization.
> > 
> > 2. I terrible way to handle the unavoidable VK_DEVICE_LOST, but in a way
> > such that the impact should be minimal. This is nasty, and we really want
> > to avoid that for panthor.
> > 
> > 3. Mesa quirks so that 2 doesn't actually ever happen in practice.
> > 
> > 1. Clever tricks
> > ----------------
> > 
> > This is a cascade of tricks we can pull in the gpu fault handler:
> > 
> > 1a. Allocate with GFP_NORECLAIM. We want this first because that triggers
> >   background reclaim, and that might be enough to get us through and free
> >   some easy caches (like clean fs cache and stuff like that which can just
> >   be dropped).
> 
> There's no GFP_NORECLAIM, and given the discussions we had before, I
> guess you meant GFP_NOWAIT. Otherwise it's the __GFP_NOWARN |
> __GFP_NORETRY I used in this series, and it probably doesn't try hard
> enough as pointed out by you and Christian.

Yeah it's GFP_NOWAIT. What I mean with GFP_NORECLAIM is
!GFP_DIRECT_RECLAIM, which is the nasty one that's not allowed in
dma_fence signalling critical paths.

> > 1b Userspace needs to guesstimate a good guess for how much we'll need. I'm
> >   hoping that between render target size and maybe counting the total
> >   amounts of vertices we can do a decent guesstimate for many workloads.
> 
> There are extra parameters to take into account, like the tile
> hierarchy mask (number of binning lists instantiated) and probably
> other things I forget, but for simple vertex+fragment pipelines and
> direct draws, guessing the worst memory usage case is probably doable.
> Throw indirect draws into the mix, and it suddenly becomes a lot more
> complicated. Not even talking about GEOM/TESS stages, which makes the
> guessing even harder AFAICT.

The point here is not to make a pessimistic guess, but one that's roughly
a good ballpark for most games/workloads, so that we don't eat too much
into the various emergency watermarks.

> >   Note that goal here is not to ensure success, but just to get the rough
> >   ballpark. The actual starting number here should aim fairly low, so that
> >   we avoid wasting memory since this is memory wasted on every context
> >   (that uses a feature which needs dynamic memory allocation, which I
> >   guess for pan* is everything, but for apple it would be more limited).
> 
> Ack.
> 
> > 
> > 1c The kernel then keeps an additional global memory pool. Note this would
> >   not have the same semantics as mempool.h, which is aimed GFP_NOIO
> >   forward progress guarantees, but more as a preallocation pool. In every
> >   CS ioctl we'll make sure the pool is filled, and we probably want to
> >   size the pool relative to the context with the biggest dynamic memory
> >   usage. So probably this thing needs a shrinker, so we can reclaim it
> >   when you don't run an app with a huge buffer need on the gpu anymore.
> 
> Okay, that's a technique Arm has been using in their downstream driver
> (it named JIT-allocation there).
> 
> > 
> >   Note that we're still not sizing this to guarantee success, but together
> >   with the userspace heuristics it should be big enough to almost always
> >   work out. And since it's global reserve we can afford to waste a bit
> >   more memory on this one. We might also want to scale this pool by the
> >   total memory available, like the watermarks core mm computes. We'll only
> >   hang onto this memory when the gpu is in active usage, so this should be
> >   fine.
> 
> Sounds like a good idea.
> 
> > 
> >   Also the preallocation would need to happen without holding the memory
> >   pool look, so that we can use GFP_KERNEL.
> > 
> > Up to this point I think it's all tricks that panthor also wants to
> > employ.
> > 
> > 1d Next up is scratch dynamic memory. If we can assume that the memory does
> >   not need to survive a batchbuffer (hopefully the case with vulkan render
> >   pass) we could steal such memory from other contexts. We could even do
> >   that for contexts which are queued but not yet running on the hardware
> >   (might need unloading them to be doable with fw renderers like
> >   panthor/CSF) as long as we keep such stolen dynamic memory on a separate
> >   free list. Because if we'd entirely free this, or release it into the
> >   memory pool we'll make things worse for these other contexts, we need to
> >   be able to guarantee that any context can always get all the stolen
> >   dynamic pages back before we start running it on the gpu.
> 
> Actually, CSF stands in the way of re-allocating memory to other
> contexts, because once we've allocated memory to a tiler heap, the FW
> manages this pool of chunks, and recycles them. Mesa can intercept
> the "returned chunks" and collect those chunks instead of re-assiging
> then to the tiler heap through a CS instruction (which goes thought
> the FW internallu), but that involves extra collaboration between the
> UMD, KMD and FW which we don't have at the moment. Not saying never,
> but I'd rather fix things gradually (first the blocking alloc in the
> fence-signalling path, then the optimization to share the extra mem
> reservation cost among contexts by returning the chunks to the global
> kernel pool rather than directly to the heap).
> 
> This approach should work fine with JM GPUs where the tiler heap is
> entirely managed by the KMD though.

Yeah stealing scratch memory is probably way too expensive if you can just
flush. And if the fw already does that, there's no point. But for hw where
you cannot flush it might just be enough to get you out of a bind.

> > Since the tracking for the memory pool in 1c and the stolen memory in 1d
> > has a bunch of tricky corner cases we probably want that as a drm helper
> > which globally does that book-keeping for all drm/sched instances. That
> > might even help if some NN accel thing also needs this on the same SoC.
> 
> Yeah, I think whatever we go for here should be exposed as generic
> helpers with proper documentation to explain when and how to use those,
> because otherwise we're back to the original situation where each
> driver copies the pattern of another driver and adjust it to its needs,
> even when the original design was unsound.
> 
> > 
> > 1e Finally, if all else is lost we can try GFP_ATOMIC. This will eat into
> >   reserve memory pools the core mm maintains. And hopefully we've spent
> >   enough time between this step and the initial GFP_NORECLAIM that
> >   background reclaim had a chance to free some memory, hence why all the
> >   memory pool and memory stealing tricks should be in between.
> 
> Any reason not to use GFP_NOWAIT on this last attempt?
> I'm not sure I like the idea of stealing memory from the reserved
> amount of atomic mem. I mean, if we get there and there's still not
> enough memory to satisfy a "normal" allocation, it's probably a good
> time to fail with a actual OOM.

GFP_ATOMIC includes __GFP_HIGH, which allows you to dip into emergency
watermarks (meaning more memory reserves before this fails). We might
actually want GFP_ATOMIC | __GPF_NOWARN though.

If you just try once more with GFP_NOWAIT then there's not really much of
a point.

> > The above will need quite some tuning to make sure it works as often as
> > possible, without wasting undue amounts of memory. It's classic memory
> > overcommit tuning.
> > 
> > 2. Device Lost
> > --------------
> > 
> > At this point we're left with no other choice than to kill the context.
> > And userspace should be able to cope with VK_DEVICE_LOST (hopefully zink
> > does), but it will probably not cope well with an entire strom of these
> > just to get the first frame out.
> > 
> > Here comes the horrible trick:
> > 
> > We'll keep rendering the entire frame by just smashing one single reserve
> > page (per context) into the pte every time there's a fault. It will result
> > in total garbage, and we probably want to shot the context the moment the
> > VS stages have finished, but it allows us to collect an accurate estimate
> > of how much memory we'd have needed. We need to pass that to the vulkan
> > driver as part of the device lost processing, so that it can keep that as
> > the starting point for the userspace dynamic memory requirement
> > guesstimate as a lower bound. Together with the (scaled to that
> > requirement) gpu driver memory pool and the core mm watermarks, that
> > should allow us to not hit a device lost again hopefully.
> 
> If the intent is to progressively increase the size of this
> global memory pool, and the min-resident-size of the growable buffer
> that triggered the OOM, I guess we can just go for something dumb like
> double-the-size or increment-it-in-steps and flag the buffer on which
> the fault happened such that userspace can query that at DEVICE_LOST
> time and properly resize this internal buffer next time a
> CreateDevice() coming from the same process/thread is created. I'm not
> sure we need to actually execute the shader after the OOM to accurately
> guess the size.

The point is to not incrementally increase, but get it right with just one
crash. I think incrementally increasing through a DEVICE_LOST is a no-go
design. Or at least I think we should aim better.

> > I think if the CS ioctl both takes the current guesstimate from userspace,
> > and passes back whatever the current maximum known to the kernel is (we
> > need that anyway for the steps in stage 1 to make sense I think), then
> > that should also work for dead context where the CS ioctl returns -EIO or
> > something like that.
> > 
> > 3. Mesa quirks
> > --------------
> > 
> > A database of the fixed dynamic memory requirements we get from step 2
> > (through bug reports), so that we can avoid that mess going forward.
> 
> Yeah, we already have a dri-conf to force the initial size of the tiler
> heap on CSF hardware, so we could easily ship a DB of per-app initial
> size.
> 
> > 
> > If there's too many of those, we'll probably need to improve the
> > guesstimation 1b if it's systematically off by too much. We might even
> > need to store that on-disk, like shader caches, so that you get a crash
> > once and then it should work even without an explicit mesa quirk.
> 
> Makes sense. Maybe some sort of implicit dri-conf DB that overloads the
> explicit one?

Yeah, maybe something like that. And then take the maximum of both.

> > Documentation
> > -------------
> > 
> > Assuming this is not too terrible an idea and we reach some consensus, I
> > think it'd be good to bake this into a doc patch somewhere in the
> > dma_fence documentation. Also including Alyssa recipe for when you have
> > hardware support for flushing partial rendering. And maybe generalized so
> > it makes sense for the GS/streamout/... fun on apple hardware.
> 
> I think the tricky part is properly guessing the amount of memory
> needed for a draw. For simple vertex+fragment pipelines with direct
> draws, that's doable, but for more complex stuff like GEOM/TESS and
> indirect draws, that gets tricky, because some of the parameters are
> not known at CS recording time. So my gut feeling is that, rather than
> trying to find a smart way to guesstimate those stuff, we should rely
> on simple heuristics like the
> double-the-resident-size-on-the-faulty-sparse-GEM thing suggested by
> Alyssa.

The point is to not correctly guess, but guess at 80/99% confidence
intervals. Memory overcommit is a statistical game, not a correctness
game. So for the guesstimates userspace does you essentially take some
interesting things like:

- tiler layout or whatever really has a big impact, which probably
  includes framebuffer size
- draw sizes
- does it use GS/streamout/whatever else has a big impact

Then gather a _lot_ of data from different games and make a linear
regression model that captures like 80% (wild guess, no idea whether
that's good) of games for the 1b estimate, 99% of everything you have for
the kernel pool (maybe a bit less, depends how much the emergency pools
from core mm) provides.

And the probably a different estimate to add a notch more when you have a
data from step 2.

It's not about the perfect estimates here at all, just heuristics.

> Many thanks to all the people who chimed in. I'll wait until the
> discussion settles on something that works for everyone and try to
> sum-up the outcome of this discussion in some
> drm-tile-based-renderer.rst doc file (or any other file name you think
> this should be put in) so we have a clear plan of what we want to
> do/how we want it done.

Yeah sounds like we have a plan.

Cheers, Sima

Boris Brezillon April 14, 2025, 3:15 p.m. UTC | #32

On Mon, 14 Apr 2025 16:34:47 +0200
Simona Vetter <simona.vetter@ffwll.ch> wrote:

> On Mon, Apr 14, 2025 at 02:08:25PM +0100, Liviu Dudau wrote:
> > On Mon, Apr 14, 2025 at 01:22:06PM +0200, Boris Brezillon wrote:  
> > > Hi Sima,
> > > 
> > > On Fri, 11 Apr 2025 14:01:16 +0200
> > > Simona Vetter <simona.vetter@ffwll.ch> wrote:
> > >   
> > > > On Thu, Apr 10, 2025 at 08:41:55PM +0200, Boris Brezillon wrote:  
> > > > > On Thu, 10 Apr 2025 14:01:03 -0400
> > > > > Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote:
> > > > >     
> > > > > > > > > In Panfrost and Lima, we don't have this concept of "incremental
> > > > > > > > > rendering", so when we fail the allocation, we just fail the GPU job
> > > > > > > > > with an unhandled GPU fault.        
> > > > > > > > 
> > > > > > > > To be honest I think that this is enough to mark those two drivers as
> > > > > > > > broken.  It's documented that this approach is a no-go for upstream
> > > > > > > > drivers.
> > > > > > > > 
> > > > > > > > How widely is that used?      
> > > > > > > 
> > > > > > > It exists in lima and panfrost, and I wouldn't be surprised if a similar
> > > > > > > mechanism was used in other drivers for tiler-based GPUs (etnaviv,
> > > > > > > freedreno, powervr, ...), because ultimately that's how tilers work:
> > > > > > > the amount of memory needed to store per-tile primitives (and metadata)
> > > > > > > depends on what the geometry pipeline feeds the tiler with, and that
> > > > > > > can't be predicted. If you over-provision, that's memory the system won't
> > > > > > > be able to use while rendering takes place, even though only a small
> > > > > > > portion might actually be used by the GPU. If your allocation is too
> > > > > > > small, it will either trigger a GPU fault (for HW not supporting an
> > > > > > > "incremental rendering" mode) or under-perform (because flushing
> > > > > > > primitives has a huge cost on tilers).      
> > > > > > 
> > > > > > Yes and no.
> > > > > > 
> > > > > > Although we can't allocate more memory for /this/ frame, we know the
> > > > > > required size is probably constant across its lifetime. That gives a
> > > > > > simple heuristic to manage the tiler heap efficiently without
> > > > > > allocations - even fallible ones - in the fence signal path:
> > > > > > 
> > > > > > * Start with a small fixed size tiler heap
> > > > > > * Try to render, let incremental rendering kick in when it's too small.
> > > > > > * When cleaning up the job, check if we used incremental rendering.
> > > > > > * If we did - double the size of the heap the next time we submit work.
> > > > > > 
> > > > > > The tiler heap still grows dynamically - it just does so over the span
> > > > > > of a couple frames. In practice that means a tiny hit to startup time as
> > > > > > we dynamically figure out the right size, incurring extra flushing at
> > > > > > the start, without needing any "grow-on-page-fault" heroics.
> > > > > > 
> > > > > > This should solve the problem completely for CSF/panthor. So it's only
> > > > > > hardware that architecturally cannot do incremental rendering (older
> > > > > > Mali: panfrost/lima) where we need this mess.    
> > > > > 
> > > > > OTOH, if we need something
> > > > > for Utgard(Lima)/Midgard/Bifrost/Valhall(Panfrost), why not use the same
> > > > > thing for CSF, since CSF is arguably the sanest of all the HW
> > > > > architectures listed above: allocation can fail/be non-blocking,
> > > > > because there's a fallback to incremental rendering when it fails.    
> > > > 
> > > > So this is a really horrible idea to sort this out for panfrost hardware,
> > > > which doesn't have a tiler cache flush as a fallback. It's roughly three
> > > > stages:
> > > > 
> > > > 1. A pile of clever tricks to make the chances of running out of memory
> > > > really low. Most of these also make sense for panthor platforms, just as a
> > > > performance optimization.
> > > > 
> > > > 2. I terrible way to handle the unavoidable VK_DEVICE_LOST, but in a way
> > > > such that the impact should be minimal. This is nasty, and we really want
> > > > to avoid that for panthor.
> > > > 
> > > > 3. Mesa quirks so that 2 doesn't actually ever happen in practice.
> > > > 
> > > > 1. Clever tricks
> > > > ----------------
> > > > 
> > > > This is a cascade of tricks we can pull in the gpu fault handler:
> > > > 
> > > > 1a. Allocate with GFP_NORECLAIM. We want this first because that triggers
> > > >   background reclaim, and that might be enough to get us through and free
> > > >   some easy caches (like clean fs cache and stuff like that which can just
> > > >   be dropped).  
> > > 
> > > There's no GFP_NORECLAIM, and given the discussions we had before, I
> > > guess you meant GFP_NOWAIT. Otherwise it's the __GFP_NOWARN |
> > > __GFP_NORETRY I used in this series, and it probably doesn't try hard
> > > enough as pointed out by you and Christian.
> > >   
> > > > 
> > > > 1b Userspace needs to guesstimate a good guess for how much we'll need. I'm
> > > >   hoping that between render target size and maybe counting the total
> > > >   amounts of vertices we can do a decent guesstimate for many workloads.  
> > > 
> > > There are extra parameters to take into account, like the tile
> > > hierarchy mask (number of binning lists instantiated) and probably
> > > other things I forget, but for simple vertex+fragment pipelines and
> > > direct draws, guessing the worst memory usage case is probably doable.
> > > Throw indirect draws into the mix, and it suddenly becomes a lot more
> > > complicated. Not even talking about GEOM/TESS stages, which makes the
> > > guessing even harder AFAICT.
> > >   
> > > >   Note that goal here is not to ensure success, but just to get the rough
> > > >   ballpark. The actual starting number here should aim fairly low, so that
> > > >   we avoid wasting memory since this is memory wasted on every context
> > > >   (that uses a feature which needs dynamic memory allocation, which I
> > > >   guess for pan* is everything, but for apple it would be more limited).  
> > > 
> > > Ack.
> > >   
> > > > 
> > > > 1c The kernel then keeps an additional global memory pool. Note this would
> > > >   not have the same semantics as mempool.h, which is aimed GFP_NOIO
> > > >   forward progress guarantees, but more as a preallocation pool. In every
> > > >   CS ioctl we'll make sure the pool is filled, and we probably want to
> > > >   size the pool relative to the context with the biggest dynamic memory
> > > >   usage. So probably this thing needs a shrinker, so we can reclaim it
> > > >   when you don't run an app with a huge buffer need on the gpu anymore.  
> > > 
> > > Okay, that's a technique Arm has been using in their downstream driver
> > > (it named JIT-allocation there).
> > >   
> > > > 
> > > >   Note that we're still not sizing this to guarantee success, but together
> > > >   with the userspace heuristics it should be big enough to almost always
> > > >   work out. And since it's global reserve we can afford to waste a bit
> > > >   more memory on this one. We might also want to scale this pool by the
> > > >   total memory available, like the watermarks core mm computes. We'll only
> > > >   hang onto this memory when the gpu is in active usage, so this should be
> > > >   fine.  
> > > 
> > > Sounds like a good idea.
> > >   
> > > > 
> > > >   Also the preallocation would need to happen without holding the memory
> > > >   pool look, so that we can use GFP_KERNEL.
> > > > 
> > > > Up to this point I think it's all tricks that panthor also wants to
> > > > employ.
> > > > 
> > > > 1d Next up is scratch dynamic memory. If we can assume that the memory does
> > > >   not need to survive a batchbuffer (hopefully the case with vulkan render
> > > >   pass) we could steal such memory from other contexts. We could even do
> > > >   that for contexts which are queued but not yet running on the hardware
> > > >   (might need unloading them to be doable with fw renderers like
> > > >   panthor/CSF) as long as we keep such stolen dynamic memory on a separate
> > > >   free list. Because if we'd entirely free this, or release it into the
> > > >   memory pool we'll make things worse for these other contexts, we need to
> > > >   be able to guarantee that any context can always get all the stolen
> > > >   dynamic pages back before we start running it on the gpu.  
> > > 
> > > Actually, CSF stands in the way of re-allocating memory to other
> > > contexts, because once we've allocated memory to a tiler heap, the FW
> > > manages this pool of chunks, and recycles them. Mesa can intercept
> > > the "returned chunks" and collect those chunks instead of re-assiging
> > > then to the tiler heap through a CS instruction (which goes thought
> > > the FW internallu), but that involves extra collaboration between the
> > > UMD, KMD and FW which we don't have at the moment. Not saying never,
> > > but I'd rather fix things gradually (first the blocking alloc in the
> > > fence-signalling path, then the optimization to share the extra mem
> > > reservation cost among contexts by returning the chunks to the global
> > > kernel pool rather than directly to the heap).  
> > 
> > The additional issue with borrowing memory from idle contexts is that it will
> > involve MMU operations, as we will have to move the memory into the active
> > context address space. CSF GPUs have a limitation that they can only work with
> > one address space for the active job when it comes to memory used internally
> > by the job, so we either have to map the scratch dynamic memory in all the
> > jobs before we submit them, or we will have to do MMU maintainance operations
> > in the OOM path in order to borrow memory from other contexts.  
> 
> Hm, this could be tricky. So mmu operations shouldn't be an issue because
> they must work for GFP_NOFS contexts for i/o writeback. You might need to
> much more carefully manage this and make sure the iommu has big enough
> range of pagetables preallocated. This also holds for the other games
> we're playing here, at least for gpu pagetables. But since pagetables are
> really small overhead it might be good to somewhat aggressively
> preallocate them.
> 
> But yeah this is possible an issue if you you need iommu wrangling, I
> have honestly not looked at the exact rules in there.

We already have a mechanism to pre-allocate page tables in panthor (for
aync VM_BIND requests where we're not allowed to allocate in the
run_job() path), but as said before, I probably won't try this global
mem pool thing on Panthor, since Panthor can do without it for now.
The page table pre-allocation mechanism is something we can easily
transpose to panfrost though.

Steven Price April 14, 2025, 3:34 p.m. UTC | #33

On 14/04/2025 13:47, Boris Brezillon wrote:
> On Fri, 11 Apr 2025 16:39:02 +0200
> Boris Brezillon <boris.brezillon@collabora.com> wrote:
> 
>> On Fri, 11 Apr 2025 15:13:26 +0200
>> Christian König <christian.koenig@amd.com> wrote:
>>
>>>>    
>>>>> Background is that you don't get a crash, nor error message, nor
>>>>> anything indicating what is happening.    
>>>> The job times out at some point, but we might get stuck in the fault
>>>> handler waiting for memory, which is pretty close to a deadlock, I
>>>> suspect.    
>>>
>>> I don't know those drivers that well, but at least for amdgpu the
>>> problem would be that the timeout handling would need to grab some of
>>> the locks the memory management is holding waiting for the timeout
>>> handling to do something....
>>>
>>> So that basically perfectly closes the circle. With a bit of lock you
>>> get a message after some time that the kernel is stuck, but since
>>> that are all sleeping locks I strongly doubt so.
>>>
>>> As immediately action please provide patches which changes those
>>> GFP_KERNEL into GFP_NOWAIT.  
>>
>> Sure, I can do that.
> 
> Hm, I might have been too prompt at claiming this was doable. In
> practice, doing that might regress Lima and Panfrost in situations
> where trying harder than GFP_NOWAIT would free up some memory. Not
> saying this was right to use GFP_KERNEL in the first place, but some
> expectations were set by this original mistake, so I'll probably need
> Lima developers to vouch in for this change after they've done some
> testing on a system under high memory pressure, and I'd need to do the
> same kind of testing for Panfrost and ask Steve if he's okay with that
> too.

It's a tricky one. The ideal would be to teach user space how to recover
from the memory allocation failure (even on older GPUs it's still in
theory possible to break up the work and do incremental rendering). But
that ship sailed long ago so this would be a regression.

I did consider GFP_ATOMIC as an option, but really we shouldn't be
accessing "memory reserves" in such a situation.

For background: In the "kbase" driver (Linux kernel driver for the
closed source user space 'DDK' driver for Midgard/Bifrost GPUs) we had a
"JIT memory allocator". The idea here was that if you have multiple
renderings in flight you can attempt to share the tiler heap memory
between them. So in this case when we can't allocate more memory and we
know there's another tiler heap which is going to be freed by a fragment
job that's already running, we can block knowing the memory is going to
become available.

It was designed to do the same thing as CSF's incremental rendering -
allow us to opportunistically allocate memory but not fail the rendering
if it wasn't available.

But it was a nightmare to have any confidence of it being deadlock free
and the implementation was frankly terrible - which is ultimately why
CSF GPU's have this ability in hardware to perform incremental rendering
without failing the job. But effectively this approach requires
allocating just enough memory for one complete tiler heap while ensuring
forward progress and opportunistically allowing extra memory to give a
performance boost.

TLDR; I think we should try switching to GFP_NOWAIT in Panfrost and do
some testing with memory pressure. It might be acceptable (and an
occasional job failure is better than an occasional lock up). If it
turns out it's too easy to trigger job failures like this then we'll
need to rethink.

> For Panthor, I'm less worried, because we have the incremental rendering
> fallback, and assuming GFP_NOWAIT tries hard enough to reclaim
> low-hanging fruits, the perfs shouldn't suffer much more than they
> would today with GFP_KERNEL allocations potentially delaying tiling
> operations longer than would have been with a primitive flush.

Yes for Panthor I think the approach is obvious - I believe GFP_NOWAIT
should trigger background reclaim, so over the course of a few frames it
should make the memory available (assuming there is sufficient 'free'
memory). Indeed it might even give a performance boost if it stops us
getting blocked in direct reclaim.

Thanks,
Steve

Boris Brezillon April 15, 2025, 9:47 a.m. UTC | #34

On Mon, 14 Apr 2025 16:34:35 +0100
Steven Price <steven.price@arm.com> wrote:

> On 14/04/2025 13:47, Boris Brezillon wrote:
> > On Fri, 11 Apr 2025 16:39:02 +0200
> > Boris Brezillon <boris.brezillon@collabora.com> wrote:
> >   
> >> On Fri, 11 Apr 2025 15:13:26 +0200
> >> Christian König <christian.koenig@amd.com> wrote:
> >>  
> >>>>      
> >>>>> Background is that you don't get a crash, nor error message, nor
> >>>>> anything indicating what is happening.      
> >>>> The job times out at some point, but we might get stuck in the fault
> >>>> handler waiting for memory, which is pretty close to a deadlock, I
> >>>> suspect.      
> >>>
> >>> I don't know those drivers that well, but at least for amdgpu the
> >>> problem would be that the timeout handling would need to grab some of
> >>> the locks the memory management is holding waiting for the timeout
> >>> handling to do something....
> >>>
> >>> So that basically perfectly closes the circle. With a bit of lock you
> >>> get a message after some time that the kernel is stuck, but since
> >>> that are all sleeping locks I strongly doubt so.
> >>>
> >>> As immediately action please provide patches which changes those
> >>> GFP_KERNEL into GFP_NOWAIT.    
> >>
> >> Sure, I can do that.  
> > 
> > Hm, I might have been too prompt at claiming this was doable. In
> > practice, doing that might regress Lima and Panfrost in situations
> > where trying harder than GFP_NOWAIT would free up some memory. Not
> > saying this was right to use GFP_KERNEL in the first place, but some
> > expectations were set by this original mistake, so I'll probably need
> > Lima developers to vouch in for this change after they've done some
> > testing on a system under high memory pressure, and I'd need to do the
> > same kind of testing for Panfrost and ask Steve if he's okay with that
> > too.  
> 
> It's a tricky one. The ideal would be to teach user space how to recover
> from the memory allocation failure (even on older GPUs it's still in
> theory possible to break up the work and do incremental rendering). But
> that ship sailed long ago so this would be a regression.
> 
> I did consider GFP_ATOMIC as an option, but really we shouldn't be
> accessing "memory reserves" in such a situation.
> 
> For background: In the "kbase" driver (Linux kernel driver for the
> closed source user space 'DDK' driver for Midgard/Bifrost GPUs) we had a
> "JIT memory allocator". The idea here was that if you have multiple
> renderings in flight you can attempt to share the tiler heap memory
> between them. So in this case when we can't allocate more memory and we
> know there's another tiler heap which is going to be freed by a fragment
> job that's already running, we can block knowing the memory is going to
> become available.
> 
> It was designed to do the same thing as CSF's incremental rendering -
> allow us to opportunistically allocate memory but not fail the rendering
> if it wasn't available.
> 
> But it was a nightmare to have any confidence of it being deadlock free
> and the implementation was frankly terrible - which is ultimately why
> CSF GPU's have this ability in hardware to perform incremental rendering
> without failing the job. But effectively this approach requires
> allocating just enough memory for one complete tiler heap while ensuring
> forward progress and opportunistically allowing extra memory to give a
> performance boost.
> 
> TLDR; I think we should try switching to GFP_NOWAIT in Panfrost and do
> some testing with memory pressure. It might be acceptable (and an
> occasional job failure is better than an occasional lock up). If it
> turns out it's too easy to trigger job failures like this then we'll
> need to rethink.

I thought about this incremental-rendering-on-JM thing during the past
few days, and I'd like to run one idea through you if you don't mind.
What if we tried to implement incremental rendering the way it's done
for CSF, that is, when we get a fault on a tiler heap object we:

1. flush caches on the tiler heap range (with a FLUSH_MEM command on the
   faulty AS), such that any primitive data that have been queued so far
   get pushed to the memory

2. use a different AS for an IR (Increment Rendering) fragment job that
   have been provided by the UMD at submission time, just like the
   CSF backend of the UMD registers an exception handler for tiler OOMs
   at the moment.
   Make it so this IR fragment job chain is queued immediately after the
   currently running fragment job (if any). This IR fragment job should
   consume all the primitives written so far and push the result to a
   framebuffer. There's a bit of patching to do on the final
   fragment job chain, because the FB preload setup will differ if
   IR took place, but that should be doable with simple WRITE_VALUE jobs
   turning NULL jobs into FRAGMENT jobs (or the other way around)
   such that they get skipped/activated depending on whether IR took
   place or not.

3. collect pages covering consumed primitives and make the heap range
   covering those consumed pages fault on further accesses (basically
   iommu->unmap() the section we collected pages from). We probably
   need to keep a couple pages around the top/bottom tiler heap
   pointers, because some partially written primitives might be crossing
   a heap chunk boundary, but assuming our heap has more than 4 chunks
   available (which we can ensure at tiler heap allocation time), we
   should be good.

4. use one of the collected pages to satisfy the growing request, and
   acknowledge the fault on the faulty AS. We can pick from the pool
   of collected pages to satisfy new growing requests until it's
   depleted again.

5. repeat steps 1-4 until all primitives are flushed.

6. reset the tiler heap mapping by doing an iommu->unmap() on the whole
   heap BO range, and collect all the pages that were previously
   allocated to the heap such that the next allocation can be satisfied
   immediately

Of course, this only works if primitives are added to the list only
when they are fully written (that is, all data for the primitive has
been written, and the primitive can be consumed by the fragment job
without hazards).

Just to be clear, this is a long term plan to try and fix this on JM
HW, and given the changes it involves (UMD needs to be taught about IR
on JM), we'll still need the GFP_NOWAIT fix for the time being.

Daniel Stone April 15, 2025, 12:39 p.m. UTC | #35

Hi,

On Mon, 14 Apr 2025 at 16:34, Steven Price <steven.price@arm.com> wrote:
> On 14/04/2025 13:47, Boris Brezillon wrote:
> > Hm, I might have been too prompt at claiming this was doable. In
> > practice, doing that might regress Lima and Panfrost in situations
> > where trying harder than GFP_NOWAIT would free up some memory. Not
> > saying this was right to use GFP_KERNEL in the first place, but some
> > expectations were set by this original mistake, so I'll probably need
> > Lima developers to vouch in for this change after they've done some
> > testing on a system under high memory pressure, and I'd need to do the
> > same kind of testing for Panfrost and ask Steve if he's okay with that
> > too.
>
> It's a tricky one. The ideal would be to teach user space how to recover
> from the memory allocation failure (even on older GPUs it's still in
> theory possible to break up the work and do incremental rendering). But
> that ship sailed long ago so this would be a regression.

Yeah. Vulkan clients are a little better in this regard, but GL
clients are pretty useless at dealing with device loss across the
board.

Cheers,
Daniel

Steven Price April 16, 2025, 3:16 p.m. UTC | #36

On 15/04/2025 10:47, Boris Brezillon wrote:
> On Mon, 14 Apr 2025 16:34:35 +0100
> Steven Price <steven.price@arm.com> wrote:
> 
>> On 14/04/2025 13:47, Boris Brezillon wrote:
>>> On Fri, 11 Apr 2025 16:39:02 +0200
>>> Boris Brezillon <boris.brezillon@collabora.com> wrote:
>>>   
>>>> On Fri, 11 Apr 2025 15:13:26 +0200
>>>> Christian König <christian.koenig@amd.com> wrote:
>>>>  
>>>>>>      
>>>>>>> Background is that you don't get a crash, nor error message, nor
>>>>>>> anything indicating what is happening.      
>>>>>> The job times out at some point, but we might get stuck in the fault
>>>>>> handler waiting for memory, which is pretty close to a deadlock, I
>>>>>> suspect.      
>>>>>
>>>>> I don't know those drivers that well, but at least for amdgpu the
>>>>> problem would be that the timeout handling would need to grab some of
>>>>> the locks the memory management is holding waiting for the timeout
>>>>> handling to do something....
>>>>>
>>>>> So that basically perfectly closes the circle. With a bit of lock you
>>>>> get a message after some time that the kernel is stuck, but since
>>>>> that are all sleeping locks I strongly doubt so.
>>>>>
>>>>> As immediately action please provide patches which changes those
>>>>> GFP_KERNEL into GFP_NOWAIT.    
>>>>
>>>> Sure, I can do that.  
>>>
>>> Hm, I might have been too prompt at claiming this was doable. In
>>> practice, doing that might regress Lima and Panfrost in situations
>>> where trying harder than GFP_NOWAIT would free up some memory. Not
>>> saying this was right to use GFP_KERNEL in the first place, but some
>>> expectations were set by this original mistake, so I'll probably need
>>> Lima developers to vouch in for this change after they've done some
>>> testing on a system under high memory pressure, and I'd need to do the
>>> same kind of testing for Panfrost and ask Steve if he's okay with that
>>> too.  
>>
>> It's a tricky one. The ideal would be to teach user space how to recover
>> from the memory allocation failure (even on older GPUs it's still in
>> theory possible to break up the work and do incremental rendering). But
>> that ship sailed long ago so this would be a regression.
>>
>> I did consider GFP_ATOMIC as an option, but really we shouldn't be
>> accessing "memory reserves" in such a situation.
>>
>> For background: In the "kbase" driver (Linux kernel driver for the
>> closed source user space 'DDK' driver for Midgard/Bifrost GPUs) we had a
>> "JIT memory allocator". The idea here was that if you have multiple
>> renderings in flight you can attempt to share the tiler heap memory
>> between them. So in this case when we can't allocate more memory and we
>> know there's another tiler heap which is going to be freed by a fragment
>> job that's already running, we can block knowing the memory is going to
>> become available.
>>
>> It was designed to do the same thing as CSF's incremental rendering -
>> allow us to opportunistically allocate memory but not fail the rendering
>> if it wasn't available.
>>
>> But it was a nightmare to have any confidence of it being deadlock free
>> and the implementation was frankly terrible - which is ultimately why
>> CSF GPU's have this ability in hardware to perform incremental rendering
>> without failing the job. But effectively this approach requires
>> allocating just enough memory for one complete tiler heap while ensuring
>> forward progress and opportunistically allowing extra memory to give a
>> performance boost.
>>
>> TLDR; I think we should try switching to GFP_NOWAIT in Panfrost and do
>> some testing with memory pressure. It might be acceptable (and an
>> occasional job failure is better than an occasional lock up). If it
>> turns out it's too easy to trigger job failures like this then we'll
>> need to rethink.
> 
> I thought about this incremental-rendering-on-JM thing during the past
> few days, and I'd like to run one idea through you if you don't mind.
> What if we tried to implement incremental rendering the way it's done
> for CSF, that is, when we get a fault on a tiler heap object we:

CSF adds the ability for the command stream to make decisions between
'jobs'. kbase has the concept of 'soft jobs' allowing the kernel to do
actions between jobs. I fear to make this work we'd need something
similar - see below.

> 1. flush caches on the tiler heap range (with a FLUSH_MEM command on the
>    faulty AS), such that any primitive data that have been queued so far
>    get pushed to the memory

So I'm not entirely sure whether you are proposing doing this on the
back of a MMU fault or not. Doing a FLUSH_MEM while the tiler is waiting
for a memory fault to be resolved is unlikely to work (it won't flush
the tiler's internal caches). The tiler is a very complex beast and has
some elaborate caches. We don't support soft-stopping tiler jobs (they
just run to completion) because the H/W guys could never figure out how
to safely stop the tiler - so I very much doubt we can deal with the
half-completed state of the tiler.

> 2. use a different AS for an IR (Increment Rendering) fragment job that
>    have been provided by the UMD at submission time, just like the
>    CSF backend of the UMD registers an exception handler for tiler OOMs
>    at the moment.

It's an interesting idea - I worry about internal deadlocks within the
GPU. There's some magic which ties vertex processing and the tiler
together, and I'm not sure whether a hung tiler job could lead to hung
vertex tasks on the shader cores. But it's possible my worries are
unfounded.

>    Make it so this IR fragment job chain is queued immediately after the
>    currently running fragment job (if any). This IR fragment job should
>    consume all the primitives written so far and push the result to a
>    framebuffer. There's a bit of patching to do on the final
>    fragment job chain, because the FB preload setup will differ if
>    IR took place, but that should be doable with simple WRITE_VALUE jobs
>    turning NULL jobs into FRAGMENT jobs (or the other way around)
>    such that they get skipped/activated depending on whether IR took
>    place or not.

You're probably more of an expert than me on that part - it certainly
sounds plausible.

> 3. collect pages covering consumed primitives and make the heap range
>    covering those consumed pages fault on further accesses (basically
>    iommu->unmap() the section we collected pages from). We probably
>    need to keep a couple pages around the top/bottom tiler heap
>    pointers, because some partially written primitives might be crossing
>    a heap chunk boundary, but assuming our heap has more than 4 chunks
>    available (which we can ensure at tiler heap allocation time), we
>    should be good.

I'm not sure if it's sufficient to keep the last 'n' chunks. But at
least in theory it should be possible to identify chunks that are still
active.

> 4. use one of the collected pages to satisfy the growing request, and
>    acknowledge the fault on the faulty AS. We can pick from the pool
>    of collected pages to satisfy new growing requests until it's
>    depleted again.
> 
> 5. repeat steps 1-4 until all primitives are flushed.

Running the IR fragment jobs from part way through the tiler heap on
subsequent renderings could be tricky - I'm not sure if we have a way of
doing that?

> 6. reset the tiler heap mapping by doing an iommu->unmap() on the whole
>    heap BO range, and collect all the pages that were previously
>    allocated to the heap such that the next allocation can be satisfied
>    immediately
> 
> Of course, this only works if primitives are added to the list only
> when they are fully written (that is, all data for the primitive has
> been written, and the primitive can be consumed by the fragment job
> without hazards).

Yeah, the hardware certainly wasn't designed for this ;)

The H/W designers approach is simple:

 1. Submit vertex/tiler job with small tiler heap

 2. Job fails with OUT_OF_MEMORY, capture index of last completed vertex
    which is written to the job descriptor

 3. Submit fragment job to incrementally render.

 4. Submit new vertex/tiler job using the index captured in step 2.

 5. Loop back to 2.

The problem from an implementation point of view is either you give up
on fences (bounce back to user space between each step), or you need
start writing job descriptors in the kernel.

Clearly one option is to go the whole hog and have something looking
like the CSF firmware running in the kernel. It's not a popular option
though! ;)

> Just to be clear, this is a long term plan to try and fix this on JM
> HW, and given the changes it involves (UMD needs to be taught about IR
> on JM), we'll still need the GFP_NOWAIT fix for the time being.

Agreed, and I like the ingenuity of it - but having seen the H/W team
give up on soft-stopping the tiler I'm not convinced it will work
leaving the tiler waiting on an MMU fault.

Thanks,
Steve

Boris Brezillon April 16, 2025, 3:53 p.m. UTC | #37

On Wed, 16 Apr 2025 16:16:05 +0100
Steven Price <steven.price@arm.com> wrote:

> On 15/04/2025 10:47, Boris Brezillon wrote:
> > On Mon, 14 Apr 2025 16:34:35 +0100
> > Steven Price <steven.price@arm.com> wrote:
> >   
> >> On 14/04/2025 13:47, Boris Brezillon wrote:  
> >>> On Fri, 11 Apr 2025 16:39:02 +0200
> >>> Boris Brezillon <boris.brezillon@collabora.com> wrote:
> >>>     
> >>>> On Fri, 11 Apr 2025 15:13:26 +0200
> >>>> Christian König <christian.koenig@amd.com> wrote:
> >>>>    
> >>>>>>        
> >>>>>>> Background is that you don't get a crash, nor error message, nor
> >>>>>>> anything indicating what is happening.        
> >>>>>> The job times out at some point, but we might get stuck in the fault
> >>>>>> handler waiting for memory, which is pretty close to a deadlock, I
> >>>>>> suspect.        
> >>>>>
> >>>>> I don't know those drivers that well, but at least for amdgpu the
> >>>>> problem would be that the timeout handling would need to grab some of
> >>>>> the locks the memory management is holding waiting for the timeout
> >>>>> handling to do something....
> >>>>>
> >>>>> So that basically perfectly closes the circle. With a bit of lock you
> >>>>> get a message after some time that the kernel is stuck, but since
> >>>>> that are all sleeping locks I strongly doubt so.
> >>>>>
> >>>>> As immediately action please provide patches which changes those
> >>>>> GFP_KERNEL into GFP_NOWAIT.      
> >>>>
> >>>> Sure, I can do that.    
> >>>
> >>> Hm, I might have been too prompt at claiming this was doable. In
> >>> practice, doing that might regress Lima and Panfrost in situations
> >>> where trying harder than GFP_NOWAIT would free up some memory. Not
> >>> saying this was right to use GFP_KERNEL in the first place, but some
> >>> expectations were set by this original mistake, so I'll probably need
> >>> Lima developers to vouch in for this change after they've done some
> >>> testing on a system under high memory pressure, and I'd need to do the
> >>> same kind of testing for Panfrost and ask Steve if he's okay with that
> >>> too.    
> >>
> >> It's a tricky one. The ideal would be to teach user space how to recover
> >> from the memory allocation failure (even on older GPUs it's still in
> >> theory possible to break up the work and do incremental rendering). But
> >> that ship sailed long ago so this would be a regression.
> >>
> >> I did consider GFP_ATOMIC as an option, but really we shouldn't be
> >> accessing "memory reserves" in such a situation.
> >>
> >> For background: In the "kbase" driver (Linux kernel driver for the
> >> closed source user space 'DDK' driver for Midgard/Bifrost GPUs) we had a
> >> "JIT memory allocator". The idea here was that if you have multiple
> >> renderings in flight you can attempt to share the tiler heap memory
> >> between them. So in this case when we can't allocate more memory and we
> >> know there's another tiler heap which is going to be freed by a fragment
> >> job that's already running, we can block knowing the memory is going to
> >> become available.
> >>
> >> It was designed to do the same thing as CSF's incremental rendering -
> >> allow us to opportunistically allocate memory but not fail the rendering
> >> if it wasn't available.
> >>
> >> But it was a nightmare to have any confidence of it being deadlock free
> >> and the implementation was frankly terrible - which is ultimately why
> >> CSF GPU's have this ability in hardware to perform incremental rendering
> >> without failing the job. But effectively this approach requires
> >> allocating just enough memory for one complete tiler heap while ensuring
> >> forward progress and opportunistically allowing extra memory to give a
> >> performance boost.
> >>
> >> TLDR; I think we should try switching to GFP_NOWAIT in Panfrost and do
> >> some testing with memory pressure. It might be acceptable (and an
> >> occasional job failure is better than an occasional lock up). If it
> >> turns out it's too easy to trigger job failures like this then we'll
> >> need to rethink.  
> > 
> > I thought about this incremental-rendering-on-JM thing during the past
> > few days, and I'd like to run one idea through you if you don't mind.
> > What if we tried to implement incremental rendering the way it's done
> > for CSF, that is, when we get a fault on a tiler heap object we:  
> 
> CSF adds the ability for the command stream to make decisions between
> 'jobs'. kbase has the concept of 'soft jobs' allowing the kernel to do
> actions between jobs. I fear to make this work we'd need something
> similar - see below.
> 
> > 1. flush caches on the tiler heap range (with a FLUSH_MEM command on the
> >    faulty AS), such that any primitive data that have been queued so far
> >    get pushed to the memory  
> 
> So I'm not entirely sure whether you are proposing doing this on the
> back of a MMU fault or not. Doing a FLUSH_MEM while the tiler is waiting
> for a memory fault to be resolved is unlikely to work (it won't flush
> the tiler's internal caches). The tiler is a very complex beast and has
> some elaborate caches. We don't support soft-stopping tiler jobs (they
> just run to completion) because the H/W guys could never figure out how
> to safely stop the tiler - so I very much doubt we can deal with the
> half-completed state of the tiler.
> 
> > 2. use a different AS for an IR (Increment Rendering) fragment job that
> >    have been provided by the UMD at submission time, just like the
> >    CSF backend of the UMD registers an exception handler for tiler OOMs
> >    at the moment.  
> 
> It's an interesting idea - I worry about internal deadlocks within the
> GPU. There's some magic which ties vertex processing and the tiler
> together, and I'm not sure whether a hung tiler job could lead to hung
> vertex tasks on the shader cores. But it's possible my worries are
> unfounded.
> 
> >    Make it so this IR fragment job chain is queued immediately after the
> >    currently running fragment job (if any). This IR fragment job should
> >    consume all the primitives written so far and push the result to a
> >    framebuffer. There's a bit of patching to do on the final
> >    fragment job chain, because the FB preload setup will differ if
> >    IR took place, but that should be doable with simple WRITE_VALUE jobs
> >    turning NULL jobs into FRAGMENT jobs (or the other way around)
> >    such that they get skipped/activated depending on whether IR took
> >    place or not.  
> 
> You're probably more of an expert than me on that part - it certainly
> sounds plausible.
> 
> > 3. collect pages covering consumed primitives and make the heap range
> >    covering those consumed pages fault on further accesses (basically
> >    iommu->unmap() the section we collected pages from). We probably
> >    need to keep a couple pages around the top/bottom tiler heap
> >    pointers, because some partially written primitives might be crossing
> >    a heap chunk boundary, but assuming our heap has more than 4 chunks
> >    available (which we can ensure at tiler heap allocation time), we
> >    should be good.  
> 
> I'm not sure if it's sufficient to keep the last 'n' chunks. But at
> least in theory it should be possible to identify chunks that are still
> active.
> 
> > 4. use one of the collected pages to satisfy the growing request, and
> >    acknowledge the fault on the faulty AS. We can pick from the pool
> >    of collected pages to satisfy new growing requests until it's
> >    depleted again.
> > 
> > 5. repeat steps 1-4 until all primitives are flushed.  
> 
> Running the IR fragment jobs from part way through the tiler heap on
> subsequent renderings could be tricky - I'm not sure if we have a way of
> doing that?
> 
> > 6. reset the tiler heap mapping by doing an iommu->unmap() on the whole
> >    heap BO range, and collect all the pages that were previously
> >    allocated to the heap such that the next allocation can be satisfied
> >    immediately
> > 
> > Of course, this only works if primitives are added to the list only
> > when they are fully written (that is, all data for the primitive has
> > been written, and the primitive can be consumed by the fragment job
> > without hazards).  
> 
> Yeah, the hardware certainly wasn't designed for this ;)
> 
> The H/W designers approach is simple:
> 
>  1. Submit vertex/tiler job with small tiler heap
> 
>  2. Job fails with OUT_OF_MEMORY, capture index of last completed vertex
>     which is written to the job descriptor
> 
>  3. Submit fragment job to incrementally render.
> 
>  4. Submit new vertex/tiler job using the index captured in step 2.
> 
>  5. Loop back to 2.
> 
> The problem from an implementation point of view is either you give up
> on fences (bounce back to user space between each step), or you need
> start writing job descriptors in the kernel.
> 
> Clearly one option is to go the whole hog and have something looking
> like the CSF firmware running in the kernel. It's not a popular option
> though! ;)
> 
> > Just to be clear, this is a long term plan to try and fix this on JM
> > HW, and given the changes it involves (UMD needs to be taught about IR
> > on JM), we'll still need the GFP_NOWAIT fix for the time being.  
> 
> Agreed, and I like the ingenuity of it - but having seen the H/W team
> give up on soft-stopping the tiler I'm not convinced it will work
> leaving the tiler waiting on an MMU fault.

Yeah, looking at the answers you gave, it doesn't sounds like this is
going to work, but thanks for taking the time to look at my crazy idea
:-).

Boris

[v3,0/8] drm: Introduce sparse GEM shmem

Message

Comments