Message ID | 20250404092634.2968115-1-boris.brezillon@collabora.com (mailing list archive) |
---|---|
Headers | show |
Series | drm: Introduce sparse GEM shmem | expand |
+Christian, Alyssa and Faith, as suggested by Sima. I'll make sure to Cc you on v4, but before that, I'd like to get your opinion on the drm-gem/drm-gem-shmem changes to see if sending a v4 is actually desirable or if I should go back to the drawing board. On Fri, 4 Apr 2025 11:26:26 +0200 Boris Brezillon <boris.brezillon@collabora.com> wrote: > This patch series is a proposal for implementing sparse page allocations > for shmem objects. It was initially motivated by a kind of BO managed by > the Panfrost/Panthor and Lima drivers, the tiler heap, which grows on > demand every time the GPU faults on a virtual address within the heap BO > range. > > Because keeping a struct page pointer array that can describe the entire > virtual range is wasteful when only a few backing pages have been > allocated, we thought a sparse allocation approach with xarrays was a > more efficient choice. > > This sparse GEM shmem features takes the form of a new optional > drm_gem_shmem_sparse_backing object that can be attached to a > drm_gem_shmem_object at creation time if the driver wants. When this > sparse GEM shmem backing mode is enabled, the driver is allow to > partially populate the GEM pages, either at use time, or at fault > time. The page population can be done progressively as the need for > more memory arises. The new APIs takes explicit gfp_t flags to solve > a long standing issue reported by Sima a while ago: all allocations > happening in the fault handler path shouldn't block. > > We also introduce two new helpers at the drm_gem.{c,h} level to > automate the partial xarray population which the GEM-SHMEM logic > relies on to populate its sparse page array. > > A few details about the implementation now: > > - Sparse GEM backing locking is deferred to the driver, because > we can't take the resv locks in the GPU fault handler path, and > since the page population can come from there, we have to let > the driver decide how to lock. > - The pin/get semantics for sparse GEM shmem objects is different, > in that it doesn't populate the pages, but just flag them as > being used/required by some GPU component. The page population > will happen explicitly at GEM creation time or when a GPU fault > happens. Once pages have been populated, they won't disappear > until the GEM object is destroyed, purged or evicted. This means > you can grow, but not shrink. If we ever need to discard > BO ranges, the API can be extended to allow it, but it's not > something we needed for the Lima/Panthor/Panfrost case. > - Panthor and Panfrost changes have been tested, but not extensively. > Lima changes have only been compile-tested. > > Changes from v2: > - We decided to focus more on the DRM aspects and forget about the > sgt building optimizations (sgt helpers taking an xarray instead of > a plain array). If the xarray -> array copies happening in that > path ever become the bottleneck, we can resurrect those changes. > - We decided to make the sparse GEM changes less invasive by making > them an extra layer on top of drm_gem_shmem_object. What this means > is that sparse GEM shmem can now be used as regular GEM shmem > objects (vmap, pin, export, ... all work as they would on a regular > GEM). When that happens, we just force all pages to be populated, > so we kinda lose the benefit of sparse GEMs, but that's fine, because > in practice, such objects shouldn't be manipulated as regular GEMs. > It just serves as a safety guard to limit the risk of regressions > introduced by these sparse GEM shmem changes. > > Discussion of previus revision can be found here[2][3]. > > For those used to review gitlab MRs, here's a link [1] to a Draft > MR grouping those changes, but I'm in no way asking that the review > happens on gitlab. > > Regards, > > Boris > > [1]https://gitlab.freedesktop.org/panfrost/linux/-/merge_requests/16 > [2]https://lore.kernel.org/lkml/20250326021433.772196-1-adrian.larumbe@collabora.com/T/ > [3]https://lore.kernel.org/dri-devel/20250218232552.3450939-1-adrian.larumbe@collabora.com/ > > Adrián Larumbe (1): > drm/gem: Add helpers to request a range of pages on a GEM > > Boris Brezillon (7): > drm/gem-shmem: Support sparse backing > drm/panfrost: Switch to sparse gem shmem to implement our > alloc-on-fault > drm/panthor: Add support for alloc-on-fault buffers > drm/panthor: Allow kernel BOs to pass DRM_PANTHOR_BO_ALLOC_ON_FAULT > drm/panthor: Add a panthor_vm_pre_fault_range() helper > drm/panthor: Make heap chunk allocation non-blocking > drm/lima: Use drm_gem_shmem_sparse_backing for heap buffers > > drivers/gpu/drm/drm_gem.c | 134 +++++++ > drivers/gpu/drm/drm_gem_shmem_helper.c | 404 +++++++++++++++++++- > drivers/gpu/drm/lima/lima_gem.c | 89 ++--- > drivers/gpu/drm/lima/lima_gem.h | 1 + > drivers/gpu/drm/lima/lima_vm.c | 48 ++- > drivers/gpu/drm/panfrost/panfrost_drv.c | 2 +- > drivers/gpu/drm/panfrost/panfrost_gem.c | 37 +- > drivers/gpu/drm/panfrost/panfrost_gem.h | 8 +- > drivers/gpu/drm/panfrost/panfrost_mmu.c | 98 ++--- > drivers/gpu/drm/panfrost/panfrost_mmu.h | 2 + > drivers/gpu/drm/panthor/panthor_drv.c | 20 +- > drivers/gpu/drm/panthor/panthor_fw.c | 6 +- > drivers/gpu/drm/panthor/panthor_gem.c | 18 +- > drivers/gpu/drm/panthor/panthor_gem.h | 12 +- > drivers/gpu/drm/panthor/panthor_heap.c | 222 ++++++----- > drivers/gpu/drm/panthor/panthor_mmu.c | 481 ++++++++++++++++++------ > drivers/gpu/drm/panthor/panthor_mmu.h | 2 + > drivers/gpu/drm/panthor/panthor_sched.c | 6 +- > include/drm/drm_gem.h | 14 + > include/drm/drm_gem_shmem_helper.h | 285 +++++++++++++- > include/uapi/drm/panthor_drm.h | 19 +- > 21 files changed, 1492 insertions(+), 416 deletions(-) >
Hi Boris, thanks for looping me in. Can you also send the full patch set to me since I don't see that on the mailing list (yet maybe). Am 10.04.25 um 16:48 schrieb Boris Brezillon: > +Christian, Alyssa and Faith, as suggested by Sima. I'll make sure to > Cc you on v4, but before that, I'd like to get your opinion on the > drm-gem/drm-gem-shmem changes to see if sending a v4 is actually > desirable or if I should go back to the drawing board. > > On Fri, 4 Apr 2025 11:26:26 +0200 > Boris Brezillon <boris.brezillon@collabora.com> wrote: > >> This patch series is a proposal for implementing sparse page allocations >> for shmem objects. It was initially motivated by a kind of BO managed by >> the Panfrost/Panthor and Lima drivers, the tiler heap, which grows on >> demand every time the GPU faults on a virtual address within the heap BO >> range. Oh, wait a second! GPU faults and DMA fences are usually fundamentally incompatible. So stuff like filling in GEM objects on demand like you suggest here is usually seen as illegal. All resources must be pre-pinned before the submission. Faulting is only legal when you have something like HMM, SVM or whatever you call it. And then you can just use a plain shmem object to provide you with backing pages. I mean we could in theory allow faulting on GEM objects as well, but we would need to take very strict precautions on that we currently don't have as far as I know. So could you explain how this works in the first place? Regards, Christian. >> >> Because keeping a struct page pointer array that can describe the entire >> virtual range is wasteful when only a few backing pages have been >> allocated, we thought a sparse allocation approach with xarrays was a >> more efficient choice. >> >> This sparse GEM shmem features takes the form of a new optional >> drm_gem_shmem_sparse_backing object that can be attached to a >> drm_gem_shmem_object at creation time if the driver wants. When this >> sparse GEM shmem backing mode is enabled, the driver is allow to >> partially populate the GEM pages, either at use time, or at fault >> time. The page population can be done progressively as the need for >> more memory arises. The new APIs takes explicit gfp_t flags to solve >> a long standing issue reported by Sima a while ago: all allocations >> happening in the fault handler path shouldn't block. >> >> We also introduce two new helpers at the drm_gem.{c,h} level to >> automate the partial xarray population which the GEM-SHMEM logic >> relies on to populate its sparse page array. >> >> A few details about the implementation now: >> >> - Sparse GEM backing locking is deferred to the driver, because >> we can't take the resv locks in the GPU fault handler path, and >> since the page population can come from there, we have to let >> the driver decide how to lock. >> - The pin/get semantics for sparse GEM shmem objects is different, >> in that it doesn't populate the pages, but just flag them as >> being used/required by some GPU component. The page population >> will happen explicitly at GEM creation time or when a GPU fault >> happens. Once pages have been populated, they won't disappear >> until the GEM object is destroyed, purged or evicted. This means >> you can grow, but not shrink. If we ever need to discard >> BO ranges, the API can be extended to allow it, but it's not >> something we needed for the Lima/Panthor/Panfrost case. >> - Panthor and Panfrost changes have been tested, but not extensively. >> Lima changes have only been compile-tested. >> >> Changes from v2: >> - We decided to focus more on the DRM aspects and forget about the >> sgt building optimizations (sgt helpers taking an xarray instead of >> a plain array). If the xarray -> array copies happening in that >> path ever become the bottleneck, we can resurrect those changes. >> - We decided to make the sparse GEM changes less invasive by making >> them an extra layer on top of drm_gem_shmem_object. What this means >> is that sparse GEM shmem can now be used as regular GEM shmem >> objects (vmap, pin, export, ... all work as they would on a regular >> GEM). When that happens, we just force all pages to be populated, >> so we kinda lose the benefit of sparse GEMs, but that's fine, because >> in practice, such objects shouldn't be manipulated as regular GEMs. >> It just serves as a safety guard to limit the risk of regressions >> introduced by these sparse GEM shmem changes. >> >> Discussion of previus revision can be found here[2][3]. >> >> For those used to review gitlab MRs, here's a link [1] to a Draft >> MR grouping those changes, but I'm in no way asking that the review >> happens on gitlab. >> >> Regards, >> >> Boris >> >> [1]https://gitlab.freedesktop.org/panfrost/linux/-/merge_requests/16 >> [2]https://lore.kernel.org/lkml/20250326021433.772196-1-adrian.larumbe@collabora.com/T/ >> [3]https://lore.kernel.org/dri-devel/20250218232552.3450939-1-adrian.larumbe@collabora.com/ >> >> Adrián Larumbe (1): >> drm/gem: Add helpers to request a range of pages on a GEM >> >> Boris Brezillon (7): >> drm/gem-shmem: Support sparse backing >> drm/panfrost: Switch to sparse gem shmem to implement our >> alloc-on-fault >> drm/panthor: Add support for alloc-on-fault buffers >> drm/panthor: Allow kernel BOs to pass DRM_PANTHOR_BO_ALLOC_ON_FAULT >> drm/panthor: Add a panthor_vm_pre_fault_range() helper >> drm/panthor: Make heap chunk allocation non-blocking >> drm/lima: Use drm_gem_shmem_sparse_backing for heap buffers >> >> drivers/gpu/drm/drm_gem.c | 134 +++++++ >> drivers/gpu/drm/drm_gem_shmem_helper.c | 404 +++++++++++++++++++- >> drivers/gpu/drm/lima/lima_gem.c | 89 ++--- >> drivers/gpu/drm/lima/lima_gem.h | 1 + >> drivers/gpu/drm/lima/lima_vm.c | 48 ++- >> drivers/gpu/drm/panfrost/panfrost_drv.c | 2 +- >> drivers/gpu/drm/panfrost/panfrost_gem.c | 37 +- >> drivers/gpu/drm/panfrost/panfrost_gem.h | 8 +- >> drivers/gpu/drm/panfrost/panfrost_mmu.c | 98 ++--- >> drivers/gpu/drm/panfrost/panfrost_mmu.h | 2 + >> drivers/gpu/drm/panthor/panthor_drv.c | 20 +- >> drivers/gpu/drm/panthor/panthor_fw.c | 6 +- >> drivers/gpu/drm/panthor/panthor_gem.c | 18 +- >> drivers/gpu/drm/panthor/panthor_gem.h | 12 +- >> drivers/gpu/drm/panthor/panthor_heap.c | 222 ++++++----- >> drivers/gpu/drm/panthor/panthor_mmu.c | 481 ++++++++++++++++++------ >> drivers/gpu/drm/panthor/panthor_mmu.h | 2 + >> drivers/gpu/drm/panthor/panthor_sched.c | 6 +- >> include/drm/drm_gem.h | 14 + >> include/drm/drm_gem_shmem_helper.h | 285 +++++++++++++- >> include/uapi/drm/panthor_drm.h | 19 +- >> 21 files changed, 1492 insertions(+), 416 deletions(-) >>
Hi Christian, On Thu, 10 Apr 2025 17:05:07 +0200 Christian König <christian.koenig@amd.com> wrote: > Hi Boris, > > thanks for looping me in. Can you also send the full patch set to me since I don't see that on the mailing list (yet maybe). > > Am 10.04.25 um 16:48 schrieb Boris Brezillon: > > +Christian, Alyssa and Faith, as suggested by Sima. I'll make sure to > > Cc you on v4, but before that, I'd like to get your opinion on the > > drm-gem/drm-gem-shmem changes to see if sending a v4 is actually > > desirable or if I should go back to the drawing board. > > > > On Fri, 4 Apr 2025 11:26:26 +0200 > > Boris Brezillon <boris.brezillon@collabora.com> wrote: > > > >> This patch series is a proposal for implementing sparse page allocations > >> for shmem objects. It was initially motivated by a kind of BO managed by > >> the Panfrost/Panthor and Lima drivers, the tiler heap, which grows on > >> demand every time the GPU faults on a virtual address within the heap BO > >> range. > > Oh, wait a second! GPU faults and DMA fences are usually fundamentally incompatible. > > So stuff like filling in GEM objects on demand like you suggest here is usually seen as illegal. All resources must be pre-pinned before the submission. Unfortunately, that's already how it's done in lima, panfrost and panthor. > > Faulting is only legal when you have something like HMM, SVM or whatever you call it. And then you can just use a plain shmem object to provide you with backing pages. > > I mean we could in theory allow faulting on GEM objects as well, but we would need to take very strict precautions on that we currently don't have as far as I know. We only use this mechanism for very specific allocations: tiler memory whose maximum size can't be guessed upfront because tile binning is by nature unpredictible. > > So could you explain how this works in the first place? I can explain you how this works in Panthor, yes. You get an initial amount of memory that the tiler can use, when it runs out of memory, it will first ask the system for more memory, if the allocation fails, it will fallback to what they call "incremental rendering", where the already binned primitives are flushed to the FB in order to free memory, and the rendering starts over from there, with the memory that has been freed. In Panthor, this on-demand allocation scheme is something that allows us to speed-up rendering when there's memory available, but we can make progress when that's not the case, hence the failable allocation scheme I'm proposing here. In Panfrost and Lima, we don't have this concept of "incremental rendering", so when we fail the allocation, we just fail the GPU job with an unhandled GPU fault. And that's how it is today, the alloc-on-fault mechanism is being used in at least 3 drivers, and all I'm trying to do here is standardize it and try to document the constraints that comes with this model, constraint that are currently being ignored. Like the fact allocations in the fault handler path shouldn't block so we're guaranteed to signal the job fence in finite time, and we don't risk a deadlock between the driver shrinker and the job triggering the fault. I'm well aware of the implications of what I'm proposing here, but ignoring the fact some drivers already violate the rules don't make them disappear. So I'd rather work with you and others to clearly state what the alloc-in-fault-path rules are, and enforce them in some helper functions, than pretend these ugly things don't exist. :D Regards, Boris
Hi Boris, Am 10.04.25 um 17:53 schrieb Boris Brezillon: > Hi Christian, > > On Thu, 10 Apr 2025 17:05:07 +0200 > Christian König <christian.koenig@amd.com> wrote: > >> Hi Boris, >> >> thanks for looping me in. Can you also send the full patch set to me since I don't see that on the mailing list (yet maybe). >> >> Am 10.04.25 um 16:48 schrieb Boris Brezillon: >>> +Christian, Alyssa and Faith, as suggested by Sima. I'll make sure to >>> Cc you on v4, but before that, I'd like to get your opinion on the >>> drm-gem/drm-gem-shmem changes to see if sending a v4 is actually >>> desirable or if I should go back to the drawing board. >>> >>> On Fri, 4 Apr 2025 11:26:26 +0200 >>> Boris Brezillon <boris.brezillon@collabora.com> wrote: >>> >>>> This patch series is a proposal for implementing sparse page allocations >>>> for shmem objects. It was initially motivated by a kind of BO managed by >>>> the Panfrost/Panthor and Lima drivers, the tiler heap, which grows on >>>> demand every time the GPU faults on a virtual address within the heap BO >>>> range. >> Oh, wait a second! GPU faults and DMA fences are usually fundamentally incompatible. >> >> So stuff like filling in GEM objects on demand like you suggest here is usually seen as illegal. All resources must be pre-pinned before the submission. > Unfortunately, that's already how it's done in lima, panfrost and > panthor. > >> Faulting is only legal when you have something like HMM, SVM or whatever you call it. And then you can just use a plain shmem object to provide you with backing pages. >> >> I mean we could in theory allow faulting on GEM objects as well, but we would need to take very strict precautions on that we currently don't have as far as I know. > We only use this mechanism for very specific allocations: tiler memory > whose maximum size can't be guessed upfront because tile binning is by > nature unpredictible. > >> So could you explain how this works in the first place? > I can explain you how this works in Panthor, yes. You get an initial > amount of memory that the tiler can use, when it runs out of memory, it > will first ask the system for more memory, if the allocation fails, it > will fallback to what they call "incremental rendering", where the > already binned primitives are flushed to the FB in order to free memory, > and the rendering starts over from there, with the memory that has been > freed. > > In Panthor, this on-demand allocation scheme is something that allows > us to speed-up rendering when there's memory available, but we can make > progress when that's not the case, hence the failable allocation scheme > I'm proposing here. Puh, that sounds like it can potentially work but is also very very fragile. You must have a code comment when you select the GFP flags how and why that works. > In Panfrost and Lima, we don't have this concept of "incremental > rendering", so when we fail the allocation, we just fail the GPU job > with an unhandled GPU fault. To be honest I think that this is enough to mark those two drivers as broken. It's documented that this approach is a no-go for upstream drivers. How widely is that used? > And that's how it is today, the > alloc-on-fault mechanism is being used in at least 3 drivers, and all > I'm trying to do here is standardize it and try to document the > constraints that comes with this model, constraint that are currently > being ignored. Like the fact allocations in the fault handler path > shouldn't block so we're guaranteed to signal the job fence in finite > time, and we don't risk a deadlock between the driver shrinker and the > job triggering the fault. Well on one hand it's good to that we document the pitfalls, but I clearly think that we should *not* spread that into common code. That would give the impression that this actually works and to be honest I should start to charge money to rejecting stuff like that. It would probably get me rich. > I'm well aware of the implications of what I'm proposing here, but > ignoring the fact some drivers already violate the rules don't make them > disappear. So I'd rather work with you and others to clearly state what > the alloc-in-fault-path rules are, and enforce them in some helper > functions, than pretend these ugly things don't exist. :D Yeah, but this kind of says to others it's ok to do this which clearly isn't as far as I can see. What we should do instead is to add this as not ok approaches to the documentation and move on. Regards, Christian. > > Regards, > > Boris
Hi Christian, On Thu, 10 Apr 2025 18:43:29 +0200 Christian König <christian.koenig@amd.com> wrote: > Hi Boris, > > Am 10.04.25 um 17:53 schrieb Boris Brezillon: > > Hi Christian, > > > > On Thu, 10 Apr 2025 17:05:07 +0200 > > Christian König <christian.koenig@amd.com> wrote: > > > >> Hi Boris, > >> > >> thanks for looping me in. Can you also send the full patch set to > >> me since I don't see that on the mailing list (yet maybe). > >> > >> Am 10.04.25 um 16:48 schrieb Boris Brezillon: > >>> +Christian, Alyssa and Faith, as suggested by Sima. I'll make > >>> sure to Cc you on v4, but before that, I'd like to get your > >>> opinion on the drm-gem/drm-gem-shmem changes to see if sending a > >>> v4 is actually desirable or if I should go back to the drawing > >>> board. > >>> > >>> On Fri, 4 Apr 2025 11:26:26 +0200 > >>> Boris Brezillon <boris.brezillon@collabora.com> wrote: > >>> > >>>> This patch series is a proposal for implementing sparse page > >>>> allocations for shmem objects. It was initially motivated by a > >>>> kind of BO managed by the Panfrost/Panthor and Lima drivers, the > >>>> tiler heap, which grows on demand every time the GPU faults on a > >>>> virtual address within the heap BO range. > >> Oh, wait a second! GPU faults and DMA fences are usually > >> fundamentally incompatible. > >> > >> So stuff like filling in GEM objects on demand like you suggest > >> here is usually seen as illegal. All resources must be pre-pinned > >> before the submission. > > Unfortunately, that's already how it's done in lima, panfrost and > > panthor. > > > >> Faulting is only legal when you have something like HMM, SVM or > >> whatever you call it. And then you can just use a plain shmem > >> object to provide you with backing pages. > >> > >> I mean we could in theory allow faulting on GEM objects as well, > >> but we would need to take very strict precautions on that we > >> currently don't have as far as I know. > > We only use this mechanism for very specific allocations: tiler > > memory whose maximum size can't be guessed upfront because tile > > binning is by nature unpredictible. > > > >> So could you explain how this works in the first place? > > I can explain you how this works in Panthor, yes. You get an initial > > amount of memory that the tiler can use, when it runs out of > > memory, it will first ask the system for more memory, if the > > allocation fails, it will fallback to what they call "incremental > > rendering", where the already binned primitives are flushed to the > > FB in order to free memory, and the rendering starts over from > > there, with the memory that has been freed. > > > > In Panthor, this on-demand allocation scheme is something that > > allows us to speed-up rendering when there's memory available, but > > we can make progress when that's not the case, hence the failable > > allocation scheme I'm proposing here. > > Puh, that sounds like it can potentially work but is also very very > fragile. > > You must have a code comment when you select the GFP flags how and > why that works. + /* We want non-blocking allocations, if we're OOM, we just fail the job + * to unblock things. + */ + ret = drm_gem_shmem_sparse_populate_range(&bo->base, page_offset, + NUM_FAULT_PAGES, page_gfp, + __GFP_NORETRY | __GFP_NOWARN); That's what I have right now in the failable allocation path. The tiler chunk allocation uses the same flags, but doesn't have a comment explaining that a fallback exists when the allocation fails. > > > In Panfrost and Lima, we don't have this concept of "incremental > > rendering", so when we fail the allocation, we just fail the GPU job > > with an unhandled GPU fault. > > To be honest I think that this is enough to mark those two drivers as > broken. It's documented that this approach is a no-go for upstream > drivers. > > How widely is that used? It exists in lima and panfrost, and I wouldn't be surprised if a similar mechanism was used in other drivers for tiler-based GPUs (etnaviv, freedreno, powervr, ...), because ultimately that's how tilers work: the amount of memory needed to store per-tile primitives (and metadata) depends on what the geometry pipeline feeds the tiler with, and that can't be predicted. If you over-provision, that's memory the system won't be able to use while rendering takes place, even though only a small portion might actually be used by the GPU. If your allocation is too small, it will either trigger a GPU fault (for HW not supporting an "incremental rendering" mode) or under-perform (because flushing primitives has a huge cost on tilers). Calling these drivers broken without having a plan to fix them is also not option. > > > And that's how it is today, the > > alloc-on-fault mechanism is being used in at least 3 drivers, and > > all I'm trying to do here is standardize it and try to document the > > constraints that comes with this model, constraint that are > > currently being ignored. Like the fact allocations in the fault > > handler path shouldn't block so we're guaranteed to signal the job > > fence in finite time, and we don't risk a deadlock between the > > driver shrinker and the job triggering the fault. > > Well on one hand it's good to that we document the pitfalls, but I > clearly think that we should *not* spread that into common code. Ack on not encouraging people to use that; but having a clear path describing how this should be done for HW that don't have other options, with helpers and extensive doc is IMHO better than letting new drivers reproduce the mistake that were done in the past. Because, if you tell people "don't do that", but don't have an alternative to propose, they'll end up doing it anyway. > > That would give the impression that this actually works and to be > honest I should start to charge money to rejecting stuff like that. > It would probably get me rich. > > > I'm well aware of the implications of what I'm proposing here, but > > ignoring the fact some drivers already violate the rules don't make > > them disappear. So I'd rather work with you and others to clearly > > state what the alloc-in-fault-path rules are, and enforce them in > > some helper functions, than pretend these ugly things don't exist. > > :D > > Yeah, but this kind of says to others it's ok to do this which > clearly isn't as far as I can see. Not really. At least not if we properly review any driver that use these APIs, and clearly state in the doc that this is provided as a last resort for HW that can't do without on-fault-allocation, because they are designed to work this way. > > What we should do instead is to add this as not ok approaches to the > documentation and move on. Well, that's what happened with panfrost, lima and panthor, and see where we are now: 3 drivers that you consider broken (probably rightfully), and more to come if we don't come up with a plan for HW that have the same requirements (as I said, I wouldn't be surprised if most tilers were relying on the same kind of on-demand-allocation). As always, I appreciate your valuable inputs, and would be happy to try anything you think might be more adapted than what is proposed here, but saying "this is broken HW/driver, so let's ignore it" is not really an option, at least if you don't want the bad design pattern to spread. Regards, Boris
> > > In Panfrost and Lima, we don't have this concept of "incremental > > > rendering", so when we fail the allocation, we just fail the GPU job > > > with an unhandled GPU fault. > > > > To be honest I think that this is enough to mark those two drivers as > > broken. It's documented that this approach is a no-go for upstream > > drivers. > > > > How widely is that used? > > It exists in lima and panfrost, and I wouldn't be surprised if a similar > mechanism was used in other drivers for tiler-based GPUs (etnaviv, > freedreno, powervr, ...), because ultimately that's how tilers work: > the amount of memory needed to store per-tile primitives (and metadata) > depends on what the geometry pipeline feeds the tiler with, and that > can't be predicted. If you over-provision, that's memory the system won't > be able to use while rendering takes place, even though only a small > portion might actually be used by the GPU. If your allocation is too > small, it will either trigger a GPU fault (for HW not supporting an > "incremental rendering" mode) or under-perform (because flushing > primitives has a huge cost on tilers). Yes and no. Although we can't allocate more memory for /this/ frame, we know the required size is probably constant across its lifetime. That gives a simple heuristic to manage the tiler heap efficiently without allocations - even fallible ones - in the fence signal path: * Start with a small fixed size tiler heap * Try to render, let incremental rendering kick in when it's too small. * When cleaning up the job, check if we used incremental rendering. * If we did - double the size of the heap the next time we submit work. The tiler heap still grows dynamically - it just does so over the span of a couple frames. In practice that means a tiny hit to startup time as we dynamically figure out the right size, incurring extra flushing at the start, without needing any "grow-on-page-fault" heroics. This should solve the problem completely for CSF/panthor. So it's only hardware that architecturally cannot do incremental rendering (older Mali: panfrost/lima) where we need this mess.
On Thu, 10 Apr 2025 14:01:03 -0400 Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote: > > > > In Panfrost and Lima, we don't have this concept of "incremental > > > > rendering", so when we fail the allocation, we just fail the GPU job > > > > with an unhandled GPU fault. > > > > > > To be honest I think that this is enough to mark those two drivers as > > > broken. It's documented that this approach is a no-go for upstream > > > drivers. > > > > > > How widely is that used? > > > > It exists in lima and panfrost, and I wouldn't be surprised if a similar > > mechanism was used in other drivers for tiler-based GPUs (etnaviv, > > freedreno, powervr, ...), because ultimately that's how tilers work: > > the amount of memory needed to store per-tile primitives (and metadata) > > depends on what the geometry pipeline feeds the tiler with, and that > > can't be predicted. If you over-provision, that's memory the system won't > > be able to use while rendering takes place, even though only a small > > portion might actually be used by the GPU. If your allocation is too > > small, it will either trigger a GPU fault (for HW not supporting an > > "incremental rendering" mode) or under-perform (because flushing > > primitives has a huge cost on tilers). > > Yes and no. > > Although we can't allocate more memory for /this/ frame, we know the > required size is probably constant across its lifetime. That gives a > simple heuristic to manage the tiler heap efficiently without > allocations - even fallible ones - in the fence signal path: > > * Start with a small fixed size tiler heap > * Try to render, let incremental rendering kick in when it's too small. > * When cleaning up the job, check if we used incremental rendering. > * If we did - double the size of the heap the next time we submit work. > > The tiler heap still grows dynamically - it just does so over the span > of a couple frames. In practice that means a tiny hit to startup time as > we dynamically figure out the right size, incurring extra flushing at > the start, without needing any "grow-on-page-fault" heroics. > > This should solve the problem completely for CSF/panthor. So it's only > hardware that architecturally cannot do incremental rendering (older > Mali: panfrost/lima) where we need this mess. OTOH, if we need something for Utgard(Lima)/Midgard/Bifrost/Valhall(Panfrost), why not use the same thing for CSF, since CSF is arguably the sanest of all the HW architectures listed above: allocation can fail/be non-blocking, because there's a fallback to incremental rendering when it fails.
Am 10.04.25 um 19:20 schrieb Boris Brezillon: > [SNIP] >>>> Faulting is only legal when you have something like HMM, SVM or >>>> whatever you call it. And then you can just use a plain shmem >>>> object to provide you with backing pages. >>>> >>>> I mean we could in theory allow faulting on GEM objects as well, >>>> but we would need to take very strict precautions on that we >>>> currently don't have as far as I know. >>> We only use this mechanism for very specific allocations: tiler >>> memory whose maximum size can't be guessed upfront because tile >>> binning is by nature unpredictible. >>> >>>> So could you explain how this works in the first place? >>> I can explain you how this works in Panthor, yes. You get an initial >>> amount of memory that the tiler can use, when it runs out of >>> memory, it will first ask the system for more memory, if the >>> allocation fails, it will fallback to what they call "incremental >>> rendering", where the already binned primitives are flushed to the >>> FB in order to free memory, and the rendering starts over from >>> there, with the memory that has been freed. >>> >>> In Panthor, this on-demand allocation scheme is something that >>> allows us to speed-up rendering when there's memory available, but >>> we can make progress when that's not the case, hence the failable >>> allocation scheme I'm proposing here. >> Puh, that sounds like it can potentially work but is also very very >> fragile. >> >> You must have a code comment when you select the GFP flags how and >> why that works. > + /* We want non-blocking allocations, if we're OOM, we just fail the job > + * to unblock things. > + */ > + ret = drm_gem_shmem_sparse_populate_range(&bo->base, page_offset, > + NUM_FAULT_PAGES, page_gfp, > + __GFP_NORETRY | __GFP_NOWARN); > > That's what I have right now in the failable allocation path. The > tiler chunk allocation uses the same flags, but doesn't have a > comment explaining that a fallback exists when the allocation fails. We agreed to use GFP_NOWAIT for such types of allocations to at least wake up kswapd on the low water mark. IIRC even using __GFP_NORETRY here was illegal, but I need to double check the discussions again. From the comment this at minimum needs an explanation what influence this has on the submission and that you can still guarantee fence forward progress. >>> In Panfrost and Lima, we don't have this concept of "incremental >>> rendering", so when we fail the allocation, we just fail the GPU job >>> with an unhandled GPU fault. >> To be honest I think that this is enough to mark those two drivers as >> broken. It's documented that this approach is a no-go for upstream >> drivers. >> >> How widely is that used? > It exists in lima and panfrost, and I wouldn't be surprised if a similar > mechanism was used in other drivers for tiler-based GPUs (etnaviv, > freedreno, powervr, ...), because ultimately that's how tilers work: > the amount of memory needed to store per-tile primitives (and metadata) > depends on what the geometry pipeline feeds the tiler with, and that > can't be predicted. If you over-provision, that's memory the system won't > be able to use while rendering takes place, even though only a small > portion might actually be used by the GPU. If your allocation is too > small, it will either trigger a GPU fault (for HW not supporting an > "incremental rendering" mode) or under-perform (because flushing > primitives has a huge cost on tilers). > > Calling these drivers broken without having a plan to fix them is > also not option. Yeah, agree we need to have some kind of alternative. It's just that at the moment I can't think of any. >>> And that's how it is today, the >>> alloc-on-fault mechanism is being used in at least 3 drivers, and >>> all I'm trying to do here is standardize it and try to document the >>> constraints that comes with this model, constraint that are >>> currently being ignored. Like the fact allocations in the fault >>> handler path shouldn't block so we're guaranteed to signal the job >>> fence in finite time, and we don't risk a deadlock between the >>> driver shrinker and the job triggering the fault. >> Well on one hand it's good to that we document the pitfalls, but I >> clearly think that we should *not* spread that into common code. > Ack on not encouraging people to use that; but having a clear path > describing how this should be done for HW that don't have other > options, with helpers and extensive doc is IMHO better than letting > new drivers reproduce the mistake that were done in the past. > Because, if you tell people "don't do that", but don't have an > alternative to propose, they'll end up doing it anyway. Just to be clear: We have already completely rejected code from going upstream because of that! And we are not talking about anything small, but rather a full blown framework and every developed by a major player. Additional to that both i915 and amdgpu had code which used this approach as well and we reverted back because it simply doesn't work reliable. >> That would give the impression that this actually works and to be >> honest I should start to charge money to rejecting stuff like that. >> It would probably get me rich. >> >>> I'm well aware of the implications of what I'm proposing here, but >>> ignoring the fact some drivers already violate the rules don't make >>> them disappear. So I'd rather work with you and others to clearly >>> state what the alloc-in-fault-path rules are, and enforce them in >>> some helper functions, than pretend these ugly things don't exist. >>> :D >> Yeah, but this kind of says to others it's ok to do this which >> clearly isn't as far as I can see. > Not really. At least not if we properly review any driver that use > these APIs, and clearly state in the doc that this is provided as a > last resort for HW that can't do without on-fault-allocation, because > they are designed to work this way. > >> What we should do instead is to add this as not ok approaches to the >> documentation and move on. > Well, that's what happened with panfrost, lima and panthor, and see > where we are now: 3 drivers that you consider broken (probably > rightfully), and more to come if we don't come up with a plan for HW > that have the same requirements (as I said, I wouldn't be surprised > if most tilers were relying on the same kind of on-demand-allocation). Well we have HW features in major drivers which we simply don't use because of that. > As always, I appreciate your valuable inputs, and would be happy to > try anything you think might be more adapted than what is proposed > here, but saying "this is broken HW/driver, so let's ignore it" is > not really an option, at least if you don't want the bad design > pattern to spread. Yeah, I'm not sure what to do either. We *know* from experience that this approach will fail. We have already tried that. Regards, Christian. > > Regards, > > Boris
Am 10.04.25 um 20:41 schrieb Boris Brezillon: > On Thu, 10 Apr 2025 14:01:03 -0400 > Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote: > >>>>> In Panfrost and Lima, we don't have this concept of "incremental >>>>> rendering", so when we fail the allocation, we just fail the GPU job >>>>> with an unhandled GPU fault. >>>> To be honest I think that this is enough to mark those two drivers as >>>> broken. It's documented that this approach is a no-go for upstream >>>> drivers. >>>> >>>> How widely is that used? >>> It exists in lima and panfrost, and I wouldn't be surprised if a similar >>> mechanism was used in other drivers for tiler-based GPUs (etnaviv, >>> freedreno, powervr, ...), because ultimately that's how tilers work: >>> the amount of memory needed to store per-tile primitives (and metadata) >>> depends on what the geometry pipeline feeds the tiler with, and that >>> can't be predicted. If you over-provision, that's memory the system won't >>> be able to use while rendering takes place, even though only a small >>> portion might actually be used by the GPU. If your allocation is too >>> small, it will either trigger a GPU fault (for HW not supporting an >>> "incremental rendering" mode) or under-perform (because flushing >>> primitives has a huge cost on tilers). >> Yes and no. >> >> Although we can't allocate more memory for /this/ frame, we know the >> required size is probably constant across its lifetime. That gives a >> simple heuristic to manage the tiler heap efficiently without >> allocations - even fallible ones - in the fence signal path: >> >> * Start with a small fixed size tiler heap >> * Try to render, let incremental rendering kick in when it's too small. >> * When cleaning up the job, check if we used incremental rendering. >> * If we did - double the size of the heap the next time we submit work. >> >> The tiler heap still grows dynamically - it just does so over the span >> of a couple frames. In practice that means a tiny hit to startup time as >> we dynamically figure out the right size, incurring extra flushing at >> the start, without needing any "grow-on-page-fault" heroics. >> >> This should solve the problem completely for CSF/panthor. So it's only >> hardware that architecturally cannot do incremental rendering (older >> Mali: panfrost/lima) where we need this mess. > OTOH, if we need something > for Utgard(Lima)/Midgard/Bifrost/Valhall(Panfrost), why not use the same > thing for CSF, since CSF is arguably the sanest of all the HW > architectures listed above: allocation can fail/be non-blocking, > because there's a fallback to incremental rendering when it fails. Yeah that is a rather interesting point Alyssa noted here. So basically you could as well implement it like this: 1. Userspace makes a submission. 2. HW finds buffer is not large enough, sets and error code and completes submission. 3. Userspace detects error, re-allocates buffer with increased size. 4. Userspace re-submits to incremental complete the submission. 5. Eventually repeat until fully completed. That would work but is likely just not the most performant solution. So faulting in memory on demand is basically just an optimization and that is ok as far as I can see. That is then a rather good justification for your work Boris. Because a common component makes it possible to implement a common fault injection functionality to make sure that the fallback path is properly exercised in testing. Regards, Christian.
Hi Christian, On Thu, 10 Apr 2025 20:52:27 +0200 Christian König <christian.koenig@amd.com> wrote: > Am 10.04.25 um 19:20 schrieb Boris Brezillon: > > [SNIP] > >>>> Faulting is only legal when you have something like HMM, SVM or > >>>> whatever you call it. And then you can just use a plain shmem > >>>> object to provide you with backing pages. > >>>> > >>>> I mean we could in theory allow faulting on GEM objects as well, > >>>> but we would need to take very strict precautions on that we > >>>> currently don't have as far as I know. > >>> We only use this mechanism for very specific allocations: tiler > >>> memory whose maximum size can't be guessed upfront because tile > >>> binning is by nature unpredictible. > >>> > >>>> So could you explain how this works in the first place? > >>> I can explain you how this works in Panthor, yes. You get an > >>> initial amount of memory that the tiler can use, when it runs out > >>> of memory, it will first ask the system for more memory, if the > >>> allocation fails, it will fallback to what they call "incremental > >>> rendering", where the already binned primitives are flushed to the > >>> FB in order to free memory, and the rendering starts over from > >>> there, with the memory that has been freed. > >>> > >>> In Panthor, this on-demand allocation scheme is something that > >>> allows us to speed-up rendering when there's memory available, but > >>> we can make progress when that's not the case, hence the failable > >>> allocation scheme I'm proposing here. > >> Puh, that sounds like it can potentially work but is also very very > >> fragile. > >> > >> You must have a code comment when you select the GFP flags how and > >> why that works. > > + /* We want non-blocking allocations, if we're OOM, we just > > fail the job > > + * to unblock things. > > + */ > > + ret = drm_gem_shmem_sparse_populate_range(&bo->base, > > page_offset, > > + NUM_FAULT_PAGES, > > page_gfp, > > + __GFP_NORETRY | > > __GFP_NOWARN); > > > > That's what I have right now in the failable allocation path. The > > tiler chunk allocation uses the same flags, but doesn't have a > > comment explaining that a fallback exists when the allocation > > fails. > > We agreed to use GFP_NOWAIT for such types of allocations to at least > wake up kswapd on the low water mark. Yeah, I wasn't sure about the flags to be honest. Hence the ping and the very reason Sima asked me to Cc a few more people to look at those changes. > > IIRC even using __GFP_NORETRY here was illegal, but I need to double > check the discussions again. FWIW, it's been taken from i915 [1] > > From the comment this at minimum needs an explanation what influence > this has on the submission and that you can still guarantee fence > forward progress. Agreed. > >>> And that's how it is today, the > >>> alloc-on-fault mechanism is being used in at least 3 drivers, and > >>> all I'm trying to do here is standardize it and try to document > >>> the constraints that comes with this model, constraint that are > >>> currently being ignored. Like the fact allocations in the fault > >>> handler path shouldn't block so we're guaranteed to signal the job > >>> fence in finite time, and we don't risk a deadlock between the > >>> driver shrinker and the job triggering the fault. > >> Well on one hand it's good to that we document the pitfalls, but I > >> clearly think that we should *not* spread that into common code. > > Ack on not encouraging people to use that; but having a clear path > > describing how this should be done for HW that don't have other > > options, with helpers and extensive doc is IMHO better than letting > > new drivers reproduce the mistake that were done in the past. > > Because, if you tell people "don't do that", but don't have an > > alternative to propose, they'll end up doing it anyway. > > Just to be clear: We have already completely rejected code from going > upstream because of that! Apparently panfrost/panthor/lima didn't get enough scrutiny... > > And we are not talking about anything small, but rather a full blown > framework and every developed by a major player. > > Additional to that both i915 and amdgpu had code which used this > approach as well and we reverted back because it simply doesn't work > reliable. BTW, the sparse GEM-SHMEM code in itself doesn't necessarily imply non-blocking allocation in the fault handler, though admittedly it's been designed to allow it. > > >> That would give the impression that this actually works and to be > >> honest I should start to charge money to rejecting stuff like that. > >> It would probably get me rich. > >> > >>> I'm well aware of the implications of what I'm proposing here, but > >>> ignoring the fact some drivers already violate the rules don't > >>> make them disappear. So I'd rather work with you and others to > >>> clearly state what the alloc-in-fault-path rules are, and enforce > >>> them in some helper functions, than pretend these ugly things > >>> don't exist. :D > >> Yeah, but this kind of says to others it's ok to do this which > >> clearly isn't as far as I can see. > > Not really. At least not if we properly review any driver that use > > these APIs, and clearly state in the doc that this is provided as a > > last resort for HW that can't do without on-fault-allocation, > > because they are designed to work this way. > > > >> What we should do instead is to add this as not ok approaches to > >> the documentation and move on. > > Well, that's what happened with panfrost, lima and panthor, and see > > where we are now: 3 drivers that you consider broken (probably > > rightfully), and more to come if we don't come up with a plan for HW > > that have the same requirements (as I said, I wouldn't be surprised > > if most tilers were relying on the same kind of > > on-demand-allocation). > > Well we have HW features in major drivers which we simply don't use > because of that. Yeah, but in that case that's not really a HW feature we can switch off/avoid using. It's at the core of how Mali GPUs work (at least anything prior to gen10). And forcing all apps to fully allocate this 128MB buffer upfront (which might seem small for discrete GPUs, but really isn't for systems where the memory can be as low as 1GB, and shared with the rest of the system) isn't really an option. The only alternative I see is finding a way to emulate incremental rendering on older Mali GPUs, assuming that's even possible. > > > As always, I appreciate your valuable inputs, and would be happy to > > try anything you think might be more adapted than what is proposed > > here, but saying "this is broken HW/driver, so let's ignore it" is > > not really an option, at least if you don't want the bad design > > pattern to spread. > > Yeah, I'm not sure what to do either. We *know* from experience that > this approach will fail. Can you, or someone who was involved in those attempts, sum-up the problem that were attempted to be solved, and the walls you hit with this non-blocking/failable allocation scheme (fell free to link me to previous discussion so you don't have to type it again :-)). FWIW, the problem I see with the current panfrost/panthor on-fault-allocation scheme is the fact allocating GEMs with GFP_KERNEL leads to unbounded waits, which might cause a job timeout. The dealock between the panfrost shrinker and the job requesting memory doesn't exist AFAICT, because panfrost can only reclaim objects that were flagged as unused (MADVISE(DONTNEED)), and the allocation is done without any of the locks that are taken by the panfrost shrinker held. So yeah, jobs can fail because the system ran out of memory, the job fence will be signalled with an error, and the system will just move on. It's already what happens today, and userspace either ignores it (because on older Mali GPUs, a fault on a job is just transient, and doesn't prevent other jobs from executing) or knows how to recover from it (on newer GPUs, the FW scheduling-context needs to be destroyed re-created). Just to be clear, I'm not saying this is good, I'm just describing what is done today, so you get the whole picture. Now, with panthor being in the process of adopting a transparent reclaim mechanism (swapout on reclaim if the GPU context is idle, swapin on next job), the deadlock between the panthor/gem-shmem shrinker and the allocation in the fault path will probably surface, so I do get why no-alloc-in-fault path, or at the very least, non-blocking-alloc-in-fault-path is needed. And I really thought non-blocking allocations would be allowed in that case. So please bare with me, and tell me where my mistake is, because every time I think I got it, I look at it again, and don't see the problem. Regards, Boris [1]https://elixir.bootlin.com/linux/v6.13.7/source/drivers/gpu/drm/i915/gem/i915_gem_shmem.c#L96
On Fri, 11 Apr 2025 10:04:07 +0200 Christian König <christian.koenig@amd.com> wrote: > Am 10.04.25 um 20:41 schrieb Boris Brezillon: > > On Thu, 10 Apr 2025 14:01:03 -0400 > > Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote: > > > >>>>> In Panfrost and Lima, we don't have this concept of "incremental > >>>>> rendering", so when we fail the allocation, we just fail the > >>>>> GPU job with an unhandled GPU fault. > >>>> To be honest I think that this is enough to mark those two > >>>> drivers as broken. It's documented that this approach is a > >>>> no-go for upstream drivers. > >>>> > >>>> How widely is that used? > >>> It exists in lima and panfrost, and I wouldn't be surprised if a > >>> similar mechanism was used in other drivers for tiler-based GPUs > >>> (etnaviv, freedreno, powervr, ...), because ultimately that's how > >>> tilers work: the amount of memory needed to store per-tile > >>> primitives (and metadata) depends on what the geometry pipeline > >>> feeds the tiler with, and that can't be predicted. If you > >>> over-provision, that's memory the system won't be able to use > >>> while rendering takes place, even though only a small portion > >>> might actually be used by the GPU. If your allocation is too > >>> small, it will either trigger a GPU fault (for HW not supporting > >>> an "incremental rendering" mode) or under-perform (because > >>> flushing primitives has a huge cost on tilers). > >> Yes and no. > >> > >> Although we can't allocate more memory for /this/ frame, we know > >> the required size is probably constant across its lifetime. That > >> gives a simple heuristic to manage the tiler heap efficiently > >> without allocations - even fallible ones - in the fence signal > >> path: > >> > >> * Start with a small fixed size tiler heap > >> * Try to render, let incremental rendering kick in when it's too > >> small. > >> * When cleaning up the job, check if we used incremental rendering. > >> * If we did - double the size of the heap the next time we submit > >> work. > >> > >> The tiler heap still grows dynamically - it just does so over the > >> span of a couple frames. In practice that means a tiny hit to > >> startup time as we dynamically figure out the right size, > >> incurring extra flushing at the start, without needing any > >> "grow-on-page-fault" heroics. > >> > >> This should solve the problem completely for CSF/panthor. So it's > >> only hardware that architecturally cannot do incremental rendering > >> (older Mali: panfrost/lima) where we need this mess. > > OTOH, if we need something > > for Utgard(Lima)/Midgard/Bifrost/Valhall(Panfrost), why not use the > > same thing for CSF, since CSF is arguably the sanest of all the HW > > architectures listed above: allocation can fail/be non-blocking, > > because there's a fallback to incremental rendering when it fails. > > Yeah that is a rather interesting point Alyssa noted here. > > So basically you could as well implement it like this: > 1. Userspace makes a submission. > 2. HW finds buffer is not large enough, sets and error code and > completes submission. 3. Userspace detects error, re-allocates buffer > with increased size. 4. Userspace re-submits to incremental complete > the submission. 5. Eventually repeat until fully completed. > > That would work but is likely just not the most performant solution. > So faulting in memory on demand is basically just an optimization and > that is ok as far as I can see. Yeah, Alyssa's suggestion got me thinking too, and I think I can come up with a plan where we try non-blocking allocation first, and if it fails, we trigger incremental rendering, and queue a blocking heap-chunk allocation on separate workqueue, such that next time the tiler heap hits an OOM, it has a chunk (or multiple chunks) readily available if the blocking allocation completed in the meantime. That's basically what Alyssa suggested, with an optimization if the system is not under memory pressure, and without userspace being involved (so no uAPI changes). I guess this leaves older GPUs that don't support incremental rendering in a bad place though. > > That is then a rather good justification for your work Boris. Because > a common component makes it possible to implement a common fault > injection functionality to make sure that the fallback path is > properly exercised in testing. I can also add an fault injection mechanism to validate that, yep. Thanks, Boris
Am 11.04.25 um 10:38 schrieb Boris Brezillon: > On Fri, 11 Apr 2025 10:04:07 +0200 > Christian König <christian.koenig@amd.com> wrote: > >> Am 10.04.25 um 20:41 schrieb Boris Brezillon: >>> On Thu, 10 Apr 2025 14:01:03 -0400 >>> Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote: >>> >>>>>>> In Panfrost and Lima, we don't have this concept of "incremental >>>>>>> rendering", so when we fail the allocation, we just fail the >>>>>>> GPU job with an unhandled GPU fault. >>>>>> To be honest I think that this is enough to mark those two >>>>>> drivers as broken. It's documented that this approach is a >>>>>> no-go for upstream drivers. >>>>>> >>>>>> How widely is that used? >>>>> It exists in lima and panfrost, and I wouldn't be surprised if a >>>>> similar mechanism was used in other drivers for tiler-based GPUs >>>>> (etnaviv, freedreno, powervr, ...), because ultimately that's how >>>>> tilers work: the amount of memory needed to store per-tile >>>>> primitives (and metadata) depends on what the geometry pipeline >>>>> feeds the tiler with, and that can't be predicted. If you >>>>> over-provision, that's memory the system won't be able to use >>>>> while rendering takes place, even though only a small portion >>>>> might actually be used by the GPU. If your allocation is too >>>>> small, it will either trigger a GPU fault (for HW not supporting >>>>> an "incremental rendering" mode) or under-perform (because >>>>> flushing primitives has a huge cost on tilers). >>>> Yes and no. >>>> >>>> Although we can't allocate more memory for /this/ frame, we know >>>> the required size is probably constant across its lifetime. That >>>> gives a simple heuristic to manage the tiler heap efficiently >>>> without allocations - even fallible ones - in the fence signal >>>> path: >>>> >>>> * Start with a small fixed size tiler heap >>>> * Try to render, let incremental rendering kick in when it's too >>>> small. >>>> * When cleaning up the job, check if we used incremental rendering. >>>> * If we did - double the size of the heap the next time we submit >>>> work. >>>> >>>> The tiler heap still grows dynamically - it just does so over the >>>> span of a couple frames. In practice that means a tiny hit to >>>> startup time as we dynamically figure out the right size, >>>> incurring extra flushing at the start, without needing any >>>> "grow-on-page-fault" heroics. >>>> >>>> This should solve the problem completely for CSF/panthor. So it's >>>> only hardware that architecturally cannot do incremental rendering >>>> (older Mali: panfrost/lima) where we need this mess. >>> OTOH, if we need something >>> for Utgard(Lima)/Midgard/Bifrost/Valhall(Panfrost), why not use the >>> same thing for CSF, since CSF is arguably the sanest of all the HW >>> architectures listed above: allocation can fail/be non-blocking, >>> because there's a fallback to incremental rendering when it fails. >> Yeah that is a rather interesting point Alyssa noted here. >> >> So basically you could as well implement it like this: >> 1. Userspace makes a submission. >> 2. HW finds buffer is not large enough, sets and error code and >> completes submission. 3. Userspace detects error, re-allocates buffer >> with increased size. 4. Userspace re-submits to incremental complete >> the submission. 5. Eventually repeat until fully completed. >> >> That would work but is likely just not the most performant solution. >> So faulting in memory on demand is basically just an optimization and >> that is ok as far as I can see. > Yeah, Alyssa's suggestion got me thinking too, and I think I can come > up with a plan where we try non-blocking allocation first, and if it > fails, we trigger incremental rendering, and queue a blocking > heap-chunk allocation on separate workqueue, such that next time the > tiler heap hits an OOM, it has a chunk (or multiple chunks) readily > available if the blocking allocation completed in the meantime. That's > basically what Alyssa suggested, with an optimization if the system is > not under memory pressure, and without userspace being involved (so no > uAPI changes). That sounds like it most likely won't work. In an OOM situation the blocking allocation would just cause more pressure to complete your rendering to free up memory. > I guess this leaves older GPUs that don't support incremental rendering > in a bad place though. Well what's the handling there currently? Just crash when you're OOM? Regards, Christian. > >> That is then a rather good justification for your work Boris. Because >> a common component makes it possible to implement a common fault >> injection functionality to make sure that the fallback path is >> properly exercised in testing. > I can also add an fault injection mechanism to validate that, yep. > > Thanks, > > Boris
On Thu, Apr 10, 2025 at 08:41:55PM +0200, Boris Brezillon wrote: > On Thu, 10 Apr 2025 14:01:03 -0400 > Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote: > > > > > > In Panfrost and Lima, we don't have this concept of "incremental > > > > > rendering", so when we fail the allocation, we just fail the GPU job > > > > > with an unhandled GPU fault. > > > > > > > > To be honest I think that this is enough to mark those two drivers as > > > > broken. It's documented that this approach is a no-go for upstream > > > > drivers. > > > > > > > > How widely is that used? > > > > > > It exists in lima and panfrost, and I wouldn't be surprised if a similar > > > mechanism was used in other drivers for tiler-based GPUs (etnaviv, > > > freedreno, powervr, ...), because ultimately that's how tilers work: > > > the amount of memory needed to store per-tile primitives (and metadata) > > > depends on what the geometry pipeline feeds the tiler with, and that > > > can't be predicted. If you over-provision, that's memory the system won't > > > be able to use while rendering takes place, even though only a small > > > portion might actually be used by the GPU. If your allocation is too > > > small, it will either trigger a GPU fault (for HW not supporting an > > > "incremental rendering" mode) or under-perform (because flushing > > > primitives has a huge cost on tilers). > > > > Yes and no. > > > > Although we can't allocate more memory for /this/ frame, we know the > > required size is probably constant across its lifetime. That gives a > > simple heuristic to manage the tiler heap efficiently without > > allocations - even fallible ones - in the fence signal path: > > > > * Start with a small fixed size tiler heap > > * Try to render, let incremental rendering kick in when it's too small. > > * When cleaning up the job, check if we used incremental rendering. > > * If we did - double the size of the heap the next time we submit work. > > > > The tiler heap still grows dynamically - it just does so over the span > > of a couple frames. In practice that means a tiny hit to startup time as > > we dynamically figure out the right size, incurring extra flushing at > > the start, without needing any "grow-on-page-fault" heroics. > > > > This should solve the problem completely for CSF/panthor. So it's only > > hardware that architecturally cannot do incremental rendering (older > > Mali: panfrost/lima) where we need this mess. > > OTOH, if we need something > for Utgard(Lima)/Midgard/Bifrost/Valhall(Panfrost), why not use the same > thing for CSF, since CSF is arguably the sanest of all the HW > architectures listed above: allocation can fail/be non-blocking, > because there's a fallback to incremental rendering when it fails. So this is a really horrible idea to sort this out for panfrost hardware, which doesn't have a tiler cache flush as a fallback. It's roughly three stages: 1. A pile of clever tricks to make the chances of running out of memory really low. Most of these also make sense for panthor platforms, just as a performance optimization. 2. I terrible way to handle the unavoidable VK_DEVICE_LOST, but in a way such that the impact should be minimal. This is nasty, and we really want to avoid that for panthor. 3. Mesa quirks so that 2 doesn't actually ever happen in practice. 1. Clever tricks ---------------- This is a cascade of tricks we can pull in the gpu fault handler: 1a. Allocate with GFP_NORECLAIM. We want this first because that triggers background reclaim, and that might be enough to get us through and free some easy caches (like clean fs cache and stuff like that which can just be dropped). 1b Userspace needs to guesstimate a good guess for how much we'll need. I'm hoping that between render target size and maybe counting the total amounts of vertices we can do a decent guesstimate for many workloads. Note that goal here is not to ensure success, but just to get the rough ballpark. The actual starting number here should aim fairly low, so that we avoid wasting memory since this is memory wasted on every context (that uses a feature which needs dynamic memory allocation, which I guess for pan* is everything, but for apple it would be more limited). 1c The kernel then keeps an additional global memory pool. Note this would not have the same semantics as mempool.h, which is aimed GFP_NOIO forward progress guarantees, but more as a preallocation pool. In every CS ioctl we'll make sure the pool is filled, and we probably want to size the pool relative to the context with the biggest dynamic memory usage. So probably this thing needs a shrinker, so we can reclaim it when you don't run an app with a huge buffer need on the gpu anymore. Note that we're still not sizing this to guarantee success, but together with the userspace heuristics it should be big enough to almost always work out. And since it's global reserve we can afford to waste a bit more memory on this one. We might also want to scale this pool by the total memory available, like the watermarks core mm computes. We'll only hang onto this memory when the gpu is in active usage, so this should be fine. Also the preallocation would need to happen without holding the memory pool look, so that we can use GFP_KERNEL. Up to this point I think it's all tricks that panthor also wants to employ. 1d Next up is scratch dynamic memory. If we can assume that the memory does not need to survive a batchbuffer (hopefully the case with vulkan render pass) we could steal such memory from other contexts. We could even do that for contexts which are queued but not yet running on the hardware (might need unloading them to be doable with fw renderers like panthor/CSF) as long as we keep such stolen dynamic memory on a separate free list. Because if we'd entirely free this, or release it into the memory pool we'll make things worse for these other contexts, we need to be able to guarantee that any context can always get all the stolen dynamic pages back before we start running it on the gpu. Since the tracking for the memory pool in 1c and the stolen memory in 1d has a bunch of tricky corner cases we probably want that as a drm helper which globally does that book-keeping for all drm/sched instances. That might even help if some NN accel thing also needs this on the same SoC. 1e Finally, if all else is lost we can try GFP_ATOMIC. This will eat into reserve memory pools the core mm maintains. And hopefully we've spent enough time between this step and the initial GFP_NORECLAIM that background reclaim had a chance to free some memory, hence why all the memory pool and memory stealing tricks should be in between. The above will need quite some tuning to make sure it works as often as possible, without wasting undue amounts of memory. It's classic memory overcommit tuning. 2. Device Lost -------------- At this point we're left with no other choice than to kill the context. And userspace should be able to cope with VK_DEVICE_LOST (hopefully zink does), but it will probably not cope well with an entire strom of these just to get the first frame out. Here comes the horrible trick: We'll keep rendering the entire frame by just smashing one single reserve page (per context) into the pte every time there's a fault. It will result in total garbage, and we probably want to shot the context the moment the VS stages have finished, but it allows us to collect an accurate estimate of how much memory we'd have needed. We need to pass that to the vulkan driver as part of the device lost processing, so that it can keep that as the starting point for the userspace dynamic memory requirement guesstimate as a lower bound. Together with the (scaled to that requirement) gpu driver memory pool and the core mm watermarks, that should allow us to not hit a device lost again hopefully. I think if the CS ioctl both takes the current guesstimate from userspace, and passes back whatever the current maximum known to the kernel is (we need that anyway for the steps in stage 1 to make sense I think), then that should also work for dead context where the CS ioctl returns -EIO or something like that. 3. Mesa quirks -------------- A database of the fixed dynamic memory requirements we get from step 2 (through bug reports), so that we can avoid that mess going forward. If there's too many of those, we'll probably need to improve the guesstimation 1b if it's systematically off by too much. We might even need to store that on-disk, like shader caches, so that you get a crash once and then it should work even without an explicit mesa quirk. Documentation ------------- Assuming this is not too terrible an idea and we reach some consensus, I think it'd be good to bake this into a doc patch somewhere in the dma_fence documentation. Also including Alyssa recipe for when you have hardware support for flushing partial rendering. And maybe generalized so it makes sense for the GS/streamout/... fun on apple hardware. And perhaps with a section added that points to actual code to help make this happen like the sparse bo support in this series. Cheers, Sima
On Fri, 11 Apr 2025 12:55:57 +0200 Christian König <christian.koenig@amd.com> wrote: > Am 11.04.25 um 10:38 schrieb Boris Brezillon: > > On Fri, 11 Apr 2025 10:04:07 +0200 > > Christian König <christian.koenig@amd.com> wrote: > > > >> Am 10.04.25 um 20:41 schrieb Boris Brezillon: > >>> On Thu, 10 Apr 2025 14:01:03 -0400 > >>> Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote: > >>> > >>>>>>> In Panfrost and Lima, we don't have this concept of > >>>>>>> "incremental rendering", so when we fail the allocation, we > >>>>>>> just fail the GPU job with an unhandled GPU fault. > >>>>>> To be honest I think that this is enough to mark those two > >>>>>> drivers as broken. It's documented that this approach is a > >>>>>> no-go for upstream drivers. > >>>>>> > >>>>>> How widely is that used? > >>>>> It exists in lima and panfrost, and I wouldn't be surprised if a > >>>>> similar mechanism was used in other drivers for tiler-based GPUs > >>>>> (etnaviv, freedreno, powervr, ...), because ultimately that's > >>>>> how tilers work: the amount of memory needed to store per-tile > >>>>> primitives (and metadata) depends on what the geometry pipeline > >>>>> feeds the tiler with, and that can't be predicted. If you > >>>>> over-provision, that's memory the system won't be able to use > >>>>> while rendering takes place, even though only a small portion > >>>>> might actually be used by the GPU. If your allocation is too > >>>>> small, it will either trigger a GPU fault (for HW not supporting > >>>>> an "incremental rendering" mode) or under-perform (because > >>>>> flushing primitives has a huge cost on tilers). > >>>> Yes and no. > >>>> > >>>> Although we can't allocate more memory for /this/ frame, we know > >>>> the required size is probably constant across its lifetime. That > >>>> gives a simple heuristic to manage the tiler heap efficiently > >>>> without allocations - even fallible ones - in the fence signal > >>>> path: > >>>> > >>>> * Start with a small fixed size tiler heap > >>>> * Try to render, let incremental rendering kick in when it's too > >>>> small. > >>>> * When cleaning up the job, check if we used incremental > >>>> rendering. > >>>> * If we did - double the size of the heap the next time we submit > >>>> work. > >>>> > >>>> The tiler heap still grows dynamically - it just does so over the > >>>> span of a couple frames. In practice that means a tiny hit to > >>>> startup time as we dynamically figure out the right size, > >>>> incurring extra flushing at the start, without needing any > >>>> "grow-on-page-fault" heroics. > >>>> > >>>> This should solve the problem completely for CSF/panthor. So it's > >>>> only hardware that architecturally cannot do incremental > >>>> rendering (older Mali: panfrost/lima) where we need this mess. > >>>> > >>> OTOH, if we need something > >>> for Utgard(Lima)/Midgard/Bifrost/Valhall(Panfrost), why not use > >>> the same thing for CSF, since CSF is arguably the sanest of all > >>> the HW architectures listed above: allocation can fail/be > >>> non-blocking, because there's a fallback to incremental rendering > >>> when it fails. > >> Yeah that is a rather interesting point Alyssa noted here. > >> > >> So basically you could as well implement it like this: > >> 1. Userspace makes a submission. > >> 2. HW finds buffer is not large enough, sets and error code and > >> completes submission. 3. Userspace detects error, re-allocates > >> buffer with increased size. 4. Userspace re-submits to incremental > >> complete the submission. 5. Eventually repeat until fully > >> completed. > >> > >> That would work but is likely just not the most performant > >> solution. So faulting in memory on demand is basically just an > >> optimization and that is ok as far as I can see. > > Yeah, Alyssa's suggestion got me thinking too, and I think I can > > come up with a plan where we try non-blocking allocation first, and > > if it fails, we trigger incremental rendering, and queue a blocking > > heap-chunk allocation on separate workqueue, such that next time the > > tiler heap hits an OOM, it has a chunk (or multiple chunks) readily > > available if the blocking allocation completed in the meantime. > > That's basically what Alyssa suggested, with an optimization if the > > system is not under memory pressure, and without userspace being > > involved (so no uAPI changes). > > That sounds like it most likely won't work. In an OOM situation the > blocking allocation would just cause more pressure to complete your > rendering to free up memory. Right. It could be deferred to the next job submission instead of being queued immediately. My point being, userspace doesn't have to know about it, because the kernel knows when a tiler OOM happened, and can flag the buffer as "probably needs more space when you have an opportunity to alloc more". It wouldn't be different from reporting it to userspace and letting userspace explicitly grow the buffer, and it avoids introducing a gap between old and new mesa. > > > I guess this leaves older GPUs that don't support incremental > > rendering in a bad place though. > > Well what's the handling there currently? Just crash when you're OOM? It's "alloc(GFP_KERNEL) and crash if it fails or times out", yes.
Am 11.04.25 um 14:02 schrieb Boris Brezillon: >>> I guess this leaves older GPUs that don't support incremental >>> rendering in a bad place though. >> Well what's the handling there currently? Just crash when you're OOM? > It's "alloc(GFP_KERNEL) and crash if it fails or times out", yes. Oh, please absolutely don't! Using GFP_KERNEL here is as evil as it can be. Background is that you don't get a crash, nor error message, nor anything indicating what is happening. You just get a deadlocked system with the user wondering why the heck the system doesn't response any more. Christian.
Am 11.04.25 um 14:01 schrieb Simona Vetter: > On Thu, Apr 10, 2025 at 08:41:55PM +0200, Boris Brezillon wrote: >> On Thu, 10 Apr 2025 14:01:03 -0400 >> Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote: >> >>>>>> In Panfrost and Lima, we don't have this concept of "incremental >>>>>> rendering", so when we fail the allocation, we just fail the GPU job >>>>>> with an unhandled GPU fault. >>>>> To be honest I think that this is enough to mark those two drivers as >>>>> broken. It's documented that this approach is a no-go for upstream >>>>> drivers. >>>>> >>>>> How widely is that used? >>>> It exists in lima and panfrost, and I wouldn't be surprised if a similar >>>> mechanism was used in other drivers for tiler-based GPUs (etnaviv, >>>> freedreno, powervr, ...), because ultimately that's how tilers work: >>>> the amount of memory needed to store per-tile primitives (and metadata) >>>> depends on what the geometry pipeline feeds the tiler with, and that >>>> can't be predicted. If you over-provision, that's memory the system won't >>>> be able to use while rendering takes place, even though only a small >>>> portion might actually be used by the GPU. If your allocation is too >>>> small, it will either trigger a GPU fault (for HW not supporting an >>>> "incremental rendering" mode) or under-perform (because flushing >>>> primitives has a huge cost on tilers). >>> Yes and no. >>> >>> Although we can't allocate more memory for /this/ frame, we know the >>> required size is probably constant across its lifetime. That gives a >>> simple heuristic to manage the tiler heap efficiently without >>> allocations - even fallible ones - in the fence signal path: >>> >>> * Start with a small fixed size tiler heap >>> * Try to render, let incremental rendering kick in when it's too small. >>> * When cleaning up the job, check if we used incremental rendering. >>> * If we did - double the size of the heap the next time we submit work. >>> >>> The tiler heap still grows dynamically - it just does so over the span >>> of a couple frames. In practice that means a tiny hit to startup time as >>> we dynamically figure out the right size, incurring extra flushing at >>> the start, without needing any "grow-on-page-fault" heroics. >>> >>> This should solve the problem completely for CSF/panthor. So it's only >>> hardware that architecturally cannot do incremental rendering (older >>> Mali: panfrost/lima) where we need this mess. >> OTOH, if we need something >> for Utgard(Lima)/Midgard/Bifrost/Valhall(Panfrost), why not use the same >> thing for CSF, since CSF is arguably the sanest of all the HW >> architectures listed above: allocation can fail/be non-blocking, >> because there's a fallback to incremental rendering when it fails. > So this is a really horrible idea to sort this out for panfrost hardware, > which doesn't have a tiler cache flush as a fallback. It's roughly three > stages: > > 1. A pile of clever tricks to make the chances of running out of memory > really low. Most of these also make sense for panthor platforms, just as a > performance optimization. > > 2. I terrible way to handle the unavoidable VK_DEVICE_LOST, but in a way > such that the impact should be minimal. This is nasty, and we really want > to avoid that for panthor. > > 3. Mesa quirks so that 2 doesn't actually ever happen in practice. > > 1. Clever tricks > ---------------- > > This is a cascade of tricks we can pull in the gpu fault handler: > > 1a. Allocate with GFP_NORECLAIM. We want this first because that triggers > background reclaim, and that might be enough to get us through and free > some easy caches (like clean fs cache and stuff like that which can just > be dropped). > > 1b Userspace needs to guesstimate a good guess for how much we'll need. I'm > hoping that between render target size and maybe counting the total > amounts of vertices we can do a decent guesstimate for many workloads. > Note that goal here is not to ensure success, but just to get the rough > ballpark. The actual starting number here should aim fairly low, so that > we avoid wasting memory since this is memory wasted on every context > (that uses a feature which needs dynamic memory allocation, which I > guess for pan* is everything, but for apple it would be more limited). > > 1c The kernel then keeps an additional global memory pool. Note this would > not have the same semantics as mempool.h, which is aimed GFP_NOIO > forward progress guarantees, but more as a preallocation pool. In every > CS ioctl we'll make sure the pool is filled, and we probably want to > size the pool relative to the context with the biggest dynamic memory > usage. So probably this thing needs a shrinker, so we can reclaim it > when you don't run an app with a huge buffer need on the gpu anymore. > > Note that we're still not sizing this to guarantee success, but together > with the userspace heuristics it should be big enough to almost always > work out. And since it's global reserve we can afford to waste a bit > more memory on this one. We might also want to scale this pool by the > total memory available, like the watermarks core mm computes. We'll only > hang onto this memory when the gpu is in active usage, so this should be > fine. > > Also the preallocation would need to happen without holding the memory > pool look, so that we can use GFP_KERNEL. > > Up to this point I think it's all tricks that panthor also wants to > employ. > > 1d Next up is scratch dynamic memory. If we can assume that the memory does > not need to survive a batchbuffer (hopefully the case with vulkan render > pass) we could steal such memory from other contexts. We could even do > that for contexts which are queued but not yet running on the hardware > (might need unloading them to be doable with fw renderers like > panthor/CSF) as long as we keep such stolen dynamic memory on a separate > free list. Because if we'd entirely free this, or release it into the > memory pool we'll make things worse for these other contexts, we need to > be able to guarantee that any context can always get all the stolen > dynamic pages back before we start running it on the gpu. > > Since the tracking for the memory pool in 1c and the stolen memory in 1d > has a bunch of tricky corner cases we probably want that as a drm helper > which globally does that book-keeping for all drm/sched instances. That > might even help if some NN accel thing also needs this on the same SoC. > > 1e Finally, if all else is lost we can try GFP_ATOMIC. This will eat into > reserve memory pools the core mm maintains. And hopefully we've spent > enough time between this step and the initial GFP_NORECLAIM that > background reclaim had a chance to free some memory, hence why all the > memory pool and memory stealing tricks should be in between. > > The above will need quite some tuning to make sure it works as often as > possible, without wasting undue amounts of memory. It's classic memory > overcommit tuning. > > 2. Device Lost > -------------- > > At this point we're left with no other choice than to kill the context. > And userspace should be able to cope with VK_DEVICE_LOST (hopefully zink > does), but it will probably not cope well with an entire strom of these > just to get the first frame out. > > Here comes the horrible trick: > > We'll keep rendering the entire frame by just smashing one single reserve > page (per context) into the pte every time there's a fault. It will result > in total garbage, and we probably want to shot the context the moment the > VS stages have finished, but it allows us to collect an accurate estimate > of how much memory we'd have needed. We need to pass that to the vulkan > driver as part of the device lost processing, so that it can keep that as > the starting point for the userspace dynamic memory requirement > guesstimate as a lower bound. Together with the (scaled to that > requirement) gpu driver memory pool and the core mm watermarks, that > should allow us to not hit a device lost again hopefully. > > I think if the CS ioctl both takes the current guesstimate from userspace, > and passes back whatever the current maximum known to the kernel is (we > need that anyway for the steps in stage 1 to make sense I think), then > that should also work for dead context where the CS ioctl returns -EIO or > something like that. +1 It is a bit radical but as far as I can see that is the way to go. Additionally I think it's a must have to have a module parameter or similar to exercise this situation even without getting the whole kernel into an OOM situation. Fault injection to exercise fallback paths are is just mandatory for stuff like that. Regards, Christian. > > 3. Mesa quirks > -------------- > > A database of the fixed dynamic memory requirements we get from step 2 > (through bug reports), so that we can avoid that mess going forward. > > If there's too many of those, we'll probably need to improve the > guesstimation 1b if it's systematically off by too much. We might even > need to store that on-disk, like shader caches, so that you get a crash > once and then it should work even without an explicit mesa quirk. > > Documentation > ------------- > > Assuming this is not too terrible an idea and we reach some consensus, I > think it'd be good to bake this into a doc patch somewhere in the > dma_fence documentation. Also including Alyssa recipe for when you have > hardware support for flushing partial rendering. And maybe generalized so > it makes sense for the GS/streamout/... fun on apple hardware. > > And perhaps with a section added that points to actual code to help make > this happen like the sparse bo support in this series. > > Cheers, Sima
On Fri, 11 Apr 2025 14:45:49 +0200 Christian König <christian.koenig@amd.com> wrote: > Am 11.04.25 um 14:02 schrieb Boris Brezillon: > >>> I guess this leaves older GPUs that don't support incremental > >>> rendering in a bad place though. > >> Well what's the handling there currently? Just crash when you're > >> OOM? > > It's "alloc(GFP_KERNEL) and crash if it fails or times out", yes. > > Oh, please absolutely don't! Using GFP_KERNEL here is as evil as it > can be. I'm not saying that's what we should do, I'm just telling you what's done at the moment. The whole point of this series is to address some that mess :P. > > Background is that you don't get a crash, nor error message, nor > anything indicating what is happening. The job times out at some point, but we might get stuck in the fault handler waiting for memory, which is pretty close to a deadlock, I suspect. > > You just get a deadlocked system with the user wondering why the heck > the system doesn't response any more. Not sure that's a true deadlock, for the reason explained before (no shrinker in panthor, and panfrost only reclaims idle BOs, so no waits on fences there either), but that doesn't make things great either.
Am 11.04.25 um 15:00 schrieb Boris Brezillon: > On Fri, 11 Apr 2025 14:45:49 +0200 > Christian König <christian.koenig@amd.com> wrote: > >> Am 11.04.25 um 14:02 schrieb Boris Brezillon: >>>>> I guess this leaves older GPUs that don't support incremental >>>>> rendering in a bad place though. >>>> Well what's the handling there currently? Just crash when you're >>>> OOM? >>> It's "alloc(GFP_KERNEL) and crash if it fails or times out", yes. >> Oh, please absolutely don't! Using GFP_KERNEL here is as evil as it >> can be. > I'm not saying that's what we should do, I'm just telling you what's > done at the moment. The whole point of this series is to address some > that mess :P. Then it is absolutely welcomed that you take a look at this :D Oh my, how have we missed that previously? > >> Background is that you don't get a crash, nor error message, nor >> anything indicating what is happening. > The job times out at some point, but we might get stuck in the fault > handler waiting for memory, which is pretty close to a deadlock, I > suspect. I don't know those drivers that well, but at least for amdgpu the problem would be that the timeout handling would need to grab some of the locks the memory management is holding waiting for the timeout handling to do something.... So that basically perfectly closes the circle. With a bit of lock you get a message after some time that the kernel is stuck, but since that are all sleeping locks I strongly doubt so. As immediately action please provide patches which changes those GFP_KERNEL into GFP_NOWAIT. Thanks in advance, Christian. > >> You just get a deadlocked system with the user wondering why the heck >> the system doesn't response any more. > Not sure that's a true deadlock, for the reason explained before (no > shrinker in panthor, and panfrost only reclaims idle BOs, so no waits > on fences there either), but that doesn't make things great either.
> 2. Device Lost > -------------- > > At this point we're left with no other choice than to kill the context. > And userspace should be able to cope with VK_DEVICE_LOST (hopefully zink > does), but it will probably not cope well with an entire strom of these > just to get the first frame out. > > Here comes the horrible trick: > > We'll keep rendering the entire frame by just smashing one single reserve > page (per context) into the pte every time there's a fault. It will result > in total garbage, and we probably want to shot the context the moment the > VS stages have finished, but it allows us to collect an accurate estimate > of how much memory we'd have needed. We need to pass that to the vulkan > driver as part of the device lost processing, so that it can keep that as > the starting point for the userspace dynamic memory requirement > guesstimate as a lower bound. Together with the (scaled to that > requirement) gpu driver memory pool and the core mm watermarks, that > should allow us to not hit a device lost again hopefully. This doesn't work if vertex stages are allowed to have side effects (which is required for adult-level APIs and can effectively get hit with streamout on panfrost). Once you have anything involving side effects, you can't replay work, there's no way to cope with that. No magic Zink can do either.
On Fri, 11 Apr 2025 15:13:26 +0200 Christian König <christian.koenig@amd.com> wrote: > > > >> Background is that you don't get a crash, nor error message, nor > >> anything indicating what is happening. > > The job times out at some point, but we might get stuck in the fault > > handler waiting for memory, which is pretty close to a deadlock, I > > suspect. > > I don't know those drivers that well, but at least for amdgpu the > problem would be that the timeout handling would need to grab some of > the locks the memory management is holding waiting for the timeout > handling to do something.... > > So that basically perfectly closes the circle. With a bit of lock you > get a message after some time that the kernel is stuck, but since > that are all sleeping locks I strongly doubt so. > > As immediately action please provide patches which changes those > GFP_KERNEL into GFP_NOWAIT. Sure, I can do that.
On Fri, Apr 11, 2025 at 09:52:30AM -0400, Alyssa Rosenzweig wrote: > > 2. Device Lost > > -------------- > > > > At this point we're left with no other choice than to kill the context. > > And userspace should be able to cope with VK_DEVICE_LOST (hopefully zink > > does), but it will probably not cope well with an entire strom of these > > just to get the first frame out. > > > > Here comes the horrible trick: > > > > We'll keep rendering the entire frame by just smashing one single reserve > > page (per context) into the pte every time there's a fault. It will result > > in total garbage, and we probably want to shot the context the moment the > > VS stages have finished, but it allows us to collect an accurate estimate > > of how much memory we'd have needed. We need to pass that to the vulkan > > driver as part of the device lost processing, so that it can keep that as > > the starting point for the userspace dynamic memory requirement > > guesstimate as a lower bound. Together with the (scaled to that > > requirement) gpu driver memory pool and the core mm watermarks, that > > should allow us to not hit a device lost again hopefully. > > This doesn't work if vertex stages are allowed to have side effects > (which is required for adult-level APIs and can effectively get hit with > streamout on panfrost). Once you have anything involving side effects, > you can't replay work, there's no way to cope with that. No magic Zink > can do either. Yeah no attempts at reply, it's just standard gl error handling. So either tossing the context and reporting that through arb_robustness. Or tossing the context, "transparently" creating a new one and a mix of recreating some internal driver objects and thoughts&prayers to give the new context the best chances possible. You really want all the tricks in step 1 and the quirks in 3 to make sure this doesn't ever happen. Or at most once per app, hopefully. I promised terrible after all :-P Cheers, Sima
On Fri, Apr 11, 2025 at 02:50:19PM +0200, Christian König wrote: > Am 11.04.25 um 14:01 schrieb Simona Vetter: > > On Thu, Apr 10, 2025 at 08:41:55PM +0200, Boris Brezillon wrote: > >> On Thu, 10 Apr 2025 14:01:03 -0400 > >> Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote: > >> > >>>>>> In Panfrost and Lima, we don't have this concept of "incremental > >>>>>> rendering", so when we fail the allocation, we just fail the GPU job > >>>>>> with an unhandled GPU fault. > >>>>> To be honest I think that this is enough to mark those two drivers as > >>>>> broken. It's documented that this approach is a no-go for upstream > >>>>> drivers. > >>>>> > >>>>> How widely is that used? > >>>> It exists in lima and panfrost, and I wouldn't be surprised if a similar > >>>> mechanism was used in other drivers for tiler-based GPUs (etnaviv, > >>>> freedreno, powervr, ...), because ultimately that's how tilers work: > >>>> the amount of memory needed to store per-tile primitives (and metadata) > >>>> depends on what the geometry pipeline feeds the tiler with, and that > >>>> can't be predicted. If you over-provision, that's memory the system won't > >>>> be able to use while rendering takes place, even though only a small > >>>> portion might actually be used by the GPU. If your allocation is too > >>>> small, it will either trigger a GPU fault (for HW not supporting an > >>>> "incremental rendering" mode) or under-perform (because flushing > >>>> primitives has a huge cost on tilers). > >>> Yes and no. > >>> > >>> Although we can't allocate more memory for /this/ frame, we know the > >>> required size is probably constant across its lifetime. That gives a > >>> simple heuristic to manage the tiler heap efficiently without > >>> allocations - even fallible ones - in the fence signal path: > >>> > >>> * Start with a small fixed size tiler heap > >>> * Try to render, let incremental rendering kick in when it's too small. > >>> * When cleaning up the job, check if we used incremental rendering. > >>> * If we did - double the size of the heap the next time we submit work. > >>> > >>> The tiler heap still grows dynamically - it just does so over the span > >>> of a couple frames. In practice that means a tiny hit to startup time as > >>> we dynamically figure out the right size, incurring extra flushing at > >>> the start, without needing any "grow-on-page-fault" heroics. > >>> > >>> This should solve the problem completely for CSF/panthor. So it's only > >>> hardware that architecturally cannot do incremental rendering (older > >>> Mali: panfrost/lima) where we need this mess. > >> OTOH, if we need something > >> for Utgard(Lima)/Midgard/Bifrost/Valhall(Panfrost), why not use the same > >> thing for CSF, since CSF is arguably the sanest of all the HW > >> architectures listed above: allocation can fail/be non-blocking, > >> because there's a fallback to incremental rendering when it fails. > > So this is a really horrible idea to sort this out for panfrost hardware, > > which doesn't have a tiler cache flush as a fallback. It's roughly three > > stages: > > > > 1. A pile of clever tricks to make the chances of running out of memory > > really low. Most of these also make sense for panthor platforms, just as a > > performance optimization. > > > > 2. I terrible way to handle the unavoidable VK_DEVICE_LOST, but in a way > > such that the impact should be minimal. This is nasty, and we really want > > to avoid that for panthor. > > > > 3. Mesa quirks so that 2 doesn't actually ever happen in practice. > > > > 1. Clever tricks > > ---------------- > > > > This is a cascade of tricks we can pull in the gpu fault handler: > > > > 1a. Allocate with GFP_NORECLAIM. We want this first because that triggers > > background reclaim, and that might be enough to get us through and free > > some easy caches (like clean fs cache and stuff like that which can just > > be dropped). > > > > 1b Userspace needs to guesstimate a good guess for how much we'll need. I'm > > hoping that between render target size and maybe counting the total > > amounts of vertices we can do a decent guesstimate for many workloads. > > Note that goal here is not to ensure success, but just to get the rough > > ballpark. The actual starting number here should aim fairly low, so that > > we avoid wasting memory since this is memory wasted on every context > > (that uses a feature which needs dynamic memory allocation, which I > > guess for pan* is everything, but for apple it would be more limited). > > > > 1c The kernel then keeps an additional global memory pool. Note this would > > not have the same semantics as mempool.h, which is aimed GFP_NOIO > > forward progress guarantees, but more as a preallocation pool. In every > > CS ioctl we'll make sure the pool is filled, and we probably want to > > size the pool relative to the context with the biggest dynamic memory > > usage. So probably this thing needs a shrinker, so we can reclaim it > > when you don't run an app with a huge buffer need on the gpu anymore. > > > > Note that we're still not sizing this to guarantee success, but together > > with the userspace heuristics it should be big enough to almost always > > work out. And since it's global reserve we can afford to waste a bit > > more memory on this one. We might also want to scale this pool by the > > total memory available, like the watermarks core mm computes. We'll only > > hang onto this memory when the gpu is in active usage, so this should be > > fine. > > > > Also the preallocation would need to happen without holding the memory > > pool look, so that we can use GFP_KERNEL. > > > > Up to this point I think it's all tricks that panthor also wants to > > employ. > > > > 1d Next up is scratch dynamic memory. If we can assume that the memory does > > not need to survive a batchbuffer (hopefully the case with vulkan render > > pass) we could steal such memory from other contexts. We could even do > > that for contexts which are queued but not yet running on the hardware > > (might need unloading them to be doable with fw renderers like > > panthor/CSF) as long as we keep such stolen dynamic memory on a separate > > free list. Because if we'd entirely free this, or release it into the > > memory pool we'll make things worse for these other contexts, we need to > > be able to guarantee that any context can always get all the stolen > > dynamic pages back before we start running it on the gpu. > > > > Since the tracking for the memory pool in 1c and the stolen memory in 1d > > has a bunch of tricky corner cases we probably want that as a drm helper > > which globally does that book-keeping for all drm/sched instances. That > > might even help if some NN accel thing also needs this on the same SoC. > > > > 1e Finally, if all else is lost we can try GFP_ATOMIC. This will eat into > > reserve memory pools the core mm maintains. And hopefully we've spent > > enough time between this step and the initial GFP_NORECLAIM that > > background reclaim had a chance to free some memory, hence why all the > > memory pool and memory stealing tricks should be in between. > > > > The above will need quite some tuning to make sure it works as often as > > possible, without wasting undue amounts of memory. It's classic memory > > overcommit tuning. > > > > 2. Device Lost > > -------------- > > > > At this point we're left with no other choice than to kill the context. > > And userspace should be able to cope with VK_DEVICE_LOST (hopefully zink > > does), but it will probably not cope well with an entire strom of these > > just to get the first frame out. > > > > Here comes the horrible trick: > > > > We'll keep rendering the entire frame by just smashing one single reserve > > page (per context) into the pte every time there's a fault. It will result > > in total garbage, and we probably want to shot the context the moment the > > VS stages have finished, but it allows us to collect an accurate estimate > > of how much memory we'd have needed. We need to pass that to the vulkan > > driver as part of the device lost processing, so that it can keep that as > > the starting point for the userspace dynamic memory requirement > > guesstimate as a lower bound. Together with the (scaled to that > > requirement) gpu driver memory pool and the core mm watermarks, that > > should allow us to not hit a device lost again hopefully. > > > > I think if the CS ioctl both takes the current guesstimate from userspace, > > and passes back whatever the current maximum known to the kernel is (we > > need that anyway for the steps in stage 1 to make sense I think), then > > that should also work for dead context where the CS ioctl returns -EIO or > > something like that. > > +1 > > It is a bit radical but as far as I can see that is the way to go. > > Additionally I think it's a must have to have a module parameter or > similar to exercise this situation even without getting the whole kernel > into an OOM situation. > > Fault injection to exercise fallback paths are is just mandatory for > stuff like that. Yeah, although I'd lean towards debugfs so that an igt could nicely control it all and let this blow up in a very controlled way. Probably some bitflag to make each of the options in step 1 fail to get anything useful, or at least the GFP_NORECLAIM and the GFP_ATOMIC ones. -Sima > > Regards, > Christian. > > > > > > 3. Mesa quirks > > -------------- > > > > A database of the fixed dynamic memory requirements we get from step 2 > > (through bug reports), so that we can avoid that mess going forward. > > > > If there's too many of those, we'll probably need to improve the > > guesstimation 1b if it's systematically off by too much. We might even > > need to store that on-disk, like shader caches, so that you get a crash > > once and then it should work even without an explicit mesa quirk. > > > > Documentation > > ------------- > > > > Assuming this is not too terrible an idea and we reach some consensus, I > > think it'd be good to bake this into a doc patch somewhere in the > > dma_fence documentation. Also including Alyssa recipe for when you have > > hardware support for flushing partial rendering. And maybe generalized so > > it makes sense for the GS/streamout/... fun on apple hardware. > > > > And perhaps with a section added that points to actual code to help make > > this happen like the sparse bo support in this series. > > > > Cheers, Sima >
On Fri, Apr 11, 2025 at 12:55:57PM +0200, Christian König wrote: > Am 11.04.25 um 10:38 schrieb Boris Brezillon: > > On Fri, 11 Apr 2025 10:04:07 +0200 > > Christian König <christian.koenig@amd.com> wrote: > > > >> Am 10.04.25 um 20:41 schrieb Boris Brezillon: > >>> On Thu, 10 Apr 2025 14:01:03 -0400 > >>> Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote: > >>> > >>>>>>> In Panfrost and Lima, we don't have this concept of "incremental > >>>>>>> rendering", so when we fail the allocation, we just fail the > >>>>>>> GPU job with an unhandled GPU fault. > >>>>>> To be honest I think that this is enough to mark those two > >>>>>> drivers as broken. It's documented that this approach is a > >>>>>> no-go for upstream drivers. > >>>>>> > >>>>>> How widely is that used? > >>>>> It exists in lima and panfrost, and I wouldn't be surprised if a > >>>>> similar mechanism was used in other drivers for tiler-based GPUs > >>>>> (etnaviv, freedreno, powervr, ...), because ultimately that's how > >>>>> tilers work: the amount of memory needed to store per-tile > >>>>> primitives (and metadata) depends on what the geometry pipeline > >>>>> feeds the tiler with, and that can't be predicted. If you > >>>>> over-provision, that's memory the system won't be able to use > >>>>> while rendering takes place, even though only a small portion > >>>>> might actually be used by the GPU. If your allocation is too > >>>>> small, it will either trigger a GPU fault (for HW not supporting > >>>>> an "incremental rendering" mode) or under-perform (because > >>>>> flushing primitives has a huge cost on tilers). > >>>> Yes and no. > >>>> > >>>> Although we can't allocate more memory for /this/ frame, we know > >>>> the required size is probably constant across its lifetime. That > >>>> gives a simple heuristic to manage the tiler heap efficiently > >>>> without allocations - even fallible ones - in the fence signal > >>>> path: > >>>> > >>>> * Start with a small fixed size tiler heap > >>>> * Try to render, let incremental rendering kick in when it's too > >>>> small. > >>>> * When cleaning up the job, check if we used incremental rendering. > >>>> * If we did - double the size of the heap the next time we submit > >>>> work. > >>>> > >>>> The tiler heap still grows dynamically - it just does so over the > >>>> span of a couple frames. In practice that means a tiny hit to > >>>> startup time as we dynamically figure out the right size, > >>>> incurring extra flushing at the start, without needing any > >>>> "grow-on-page-fault" heroics. > >>>> > >>>> This should solve the problem completely for CSF/panthor. So it's > >>>> only hardware that architecturally cannot do incremental rendering > >>>> (older Mali: panfrost/lima) where we need this mess. > >>> OTOH, if we need something > >>> for Utgard(Lima)/Midgard/Bifrost/Valhall(Panfrost), why not use the > >>> same thing for CSF, since CSF is arguably the sanest of all the HW > >>> architectures listed above: allocation can fail/be non-blocking, > >>> because there's a fallback to incremental rendering when it fails. > >> Yeah that is a rather interesting point Alyssa noted here. > >> > >> So basically you could as well implement it like this: > >> 1. Userspace makes a submission. > >> 2. HW finds buffer is not large enough, sets and error code and > >> completes submission. 3. Userspace detects error, re-allocates buffer > >> with increased size. 4. Userspace re-submits to incremental complete > >> the submission. 5. Eventually repeat until fully completed. > >> > >> That would work but is likely just not the most performant solution. > >> So faulting in memory on demand is basically just an optimization and > >> that is ok as far as I can see. > > Yeah, Alyssa's suggestion got me thinking too, and I think I can come > > up with a plan where we try non-blocking allocation first, and if it > > fails, we trigger incremental rendering, and queue a blocking > > heap-chunk allocation on separate workqueue, such that next time the > > tiler heap hits an OOM, it has a chunk (or multiple chunks) readily > > available if the blocking allocation completed in the meantime. That's > > basically what Alyssa suggested, with an optimization if the system is > > not under memory pressure, and without userspace being involved (so no > > uAPI changes). Please no background task that tries to find memory, you're just reinventing the background kswapd shrinking. Which even a GFP_NORECLAIM should kick off. Instead just rely on kswapd to hopefully get you through the current job without undue amounts of tiler flushing. Then just grow dynamic memory synchronously with some heuristics in the next CS ioctl, that gives you appropriate amounts of throttling, no issues with error reporting, and you can just use GFP_KERNEL. > That sounds like it most likely won't work. In an OOM situation the > blocking allocation would just cause more pressure to complete your > rendering to free up memory. The issue isn't oom, the issue is that GFP_RECLAIM can't even throw out clean caches and so your limited to watermarks and what kswapd manages to clean out in parallel. Real OOM is much, much nastier, and there you should just get an ENONMEM from the CS ioctl. Ideally at least, because that gives throttling and all that nice stuff for free (for some value of "nice", lots of folks really despise the stalls that introduces). Cheers, Sima > > I guess this leaves older GPUs that don't support incremental rendering > > in a bad place though. > > Well what's the handling there currently? Just crash when you're OOM? > > Regards, > Christian. > > > > >> That is then a rather good justification for your work Boris. Because > >> a common component makes it possible to implement a common fault > >> injection functionality to make sure that the fallback path is > >> properly exercised in testing. > > I can also add an fault injection mechanism to validate that, yep. > > > > Thanks, > > > > Boris >
Hi Sima, On Fri, 11 Apr 2025 14:01:16 +0200 Simona Vetter <simona.vetter@ffwll.ch> wrote: > On Thu, Apr 10, 2025 at 08:41:55PM +0200, Boris Brezillon wrote: > > On Thu, 10 Apr 2025 14:01:03 -0400 > > Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote: > > > > > > > > In Panfrost and Lima, we don't have this concept of "incremental > > > > > > rendering", so when we fail the allocation, we just fail the GPU job > > > > > > with an unhandled GPU fault. > > > > > > > > > > To be honest I think that this is enough to mark those two drivers as > > > > > broken. It's documented that this approach is a no-go for upstream > > > > > drivers. > > > > > > > > > > How widely is that used? > > > > > > > > It exists in lima and panfrost, and I wouldn't be surprised if a similar > > > > mechanism was used in other drivers for tiler-based GPUs (etnaviv, > > > > freedreno, powervr, ...), because ultimately that's how tilers work: > > > > the amount of memory needed to store per-tile primitives (and metadata) > > > > depends on what the geometry pipeline feeds the tiler with, and that > > > > can't be predicted. If you over-provision, that's memory the system won't > > > > be able to use while rendering takes place, even though only a small > > > > portion might actually be used by the GPU. If your allocation is too > > > > small, it will either trigger a GPU fault (for HW not supporting an > > > > "incremental rendering" mode) or under-perform (because flushing > > > > primitives has a huge cost on tilers). > > > > > > Yes and no. > > > > > > Although we can't allocate more memory for /this/ frame, we know the > > > required size is probably constant across its lifetime. That gives a > > > simple heuristic to manage the tiler heap efficiently without > > > allocations - even fallible ones - in the fence signal path: > > > > > > * Start with a small fixed size tiler heap > > > * Try to render, let incremental rendering kick in when it's too small. > > > * When cleaning up the job, check if we used incremental rendering. > > > * If we did - double the size of the heap the next time we submit work. > > > > > > The tiler heap still grows dynamically - it just does so over the span > > > of a couple frames. In practice that means a tiny hit to startup time as > > > we dynamically figure out the right size, incurring extra flushing at > > > the start, without needing any "grow-on-page-fault" heroics. > > > > > > This should solve the problem completely for CSF/panthor. So it's only > > > hardware that architecturally cannot do incremental rendering (older > > > Mali: panfrost/lima) where we need this mess. > > > > OTOH, if we need something > > for Utgard(Lima)/Midgard/Bifrost/Valhall(Panfrost), why not use the same > > thing for CSF, since CSF is arguably the sanest of all the HW > > architectures listed above: allocation can fail/be non-blocking, > > because there's a fallback to incremental rendering when it fails. > > So this is a really horrible idea to sort this out for panfrost hardware, > which doesn't have a tiler cache flush as a fallback. It's roughly three > stages: > > 1. A pile of clever tricks to make the chances of running out of memory > really low. Most of these also make sense for panthor platforms, just as a > performance optimization. > > 2. I terrible way to handle the unavoidable VK_DEVICE_LOST, but in a way > such that the impact should be minimal. This is nasty, and we really want > to avoid that for panthor. > > 3. Mesa quirks so that 2 doesn't actually ever happen in practice. > > 1. Clever tricks > ---------------- > > This is a cascade of tricks we can pull in the gpu fault handler: > > 1a. Allocate with GFP_NORECLAIM. We want this first because that triggers > background reclaim, and that might be enough to get us through and free > some easy caches (like clean fs cache and stuff like that which can just > be dropped). There's no GFP_NORECLAIM, and given the discussions we had before, I guess you meant GFP_NOWAIT. Otherwise it's the __GFP_NOWARN | __GFP_NORETRY I used in this series, and it probably doesn't try hard enough as pointed out by you and Christian. > > 1b Userspace needs to guesstimate a good guess for how much we'll need. I'm > hoping that between render target size and maybe counting the total > amounts of vertices we can do a decent guesstimate for many workloads. There are extra parameters to take into account, like the tile hierarchy mask (number of binning lists instantiated) and probably other things I forget, but for simple vertex+fragment pipelines and direct draws, guessing the worst memory usage case is probably doable. Throw indirect draws into the mix, and it suddenly becomes a lot more complicated. Not even talking about GEOM/TESS stages, which makes the guessing even harder AFAICT. > Note that goal here is not to ensure success, but just to get the rough > ballpark. The actual starting number here should aim fairly low, so that > we avoid wasting memory since this is memory wasted on every context > (that uses a feature which needs dynamic memory allocation, which I > guess for pan* is everything, but for apple it would be more limited). Ack. > > 1c The kernel then keeps an additional global memory pool. Note this would > not have the same semantics as mempool.h, which is aimed GFP_NOIO > forward progress guarantees, but more as a preallocation pool. In every > CS ioctl we'll make sure the pool is filled, and we probably want to > size the pool relative to the context with the biggest dynamic memory > usage. So probably this thing needs a shrinker, so we can reclaim it > when you don't run an app with a huge buffer need on the gpu anymore. Okay, that's a technique Arm has been using in their downstream driver (it named JIT-allocation there). > > Note that we're still not sizing this to guarantee success, but together > with the userspace heuristics it should be big enough to almost always > work out. And since it's global reserve we can afford to waste a bit > more memory on this one. We might also want to scale this pool by the > total memory available, like the watermarks core mm computes. We'll only > hang onto this memory when the gpu is in active usage, so this should be > fine. Sounds like a good idea. > > Also the preallocation would need to happen without holding the memory > pool look, so that we can use GFP_KERNEL. > > Up to this point I think it's all tricks that panthor also wants to > employ. > > 1d Next up is scratch dynamic memory. If we can assume that the memory does > not need to survive a batchbuffer (hopefully the case with vulkan render > pass) we could steal such memory from other contexts. We could even do > that for contexts which are queued but not yet running on the hardware > (might need unloading them to be doable with fw renderers like > panthor/CSF) as long as we keep such stolen dynamic memory on a separate > free list. Because if we'd entirely free this, or release it into the > memory pool we'll make things worse for these other contexts, we need to > be able to guarantee that any context can always get all the stolen > dynamic pages back before we start running it on the gpu. Actually, CSF stands in the way of re-allocating memory to other contexts, because once we've allocated memory to a tiler heap, the FW manages this pool of chunks, and recycles them. Mesa can intercept the "returned chunks" and collect those chunks instead of re-assiging then to the tiler heap through a CS instruction (which goes thought the FW internallu), but that involves extra collaboration between the UMD, KMD and FW which we don't have at the moment. Not saying never, but I'd rather fix things gradually (first the blocking alloc in the fence-signalling path, then the optimization to share the extra mem reservation cost among contexts by returning the chunks to the global kernel pool rather than directly to the heap). This approach should work fine with JM GPUs where the tiler heap is entirely managed by the KMD though. > > Since the tracking for the memory pool in 1c and the stolen memory in 1d > has a bunch of tricky corner cases we probably want that as a drm helper > which globally does that book-keeping for all drm/sched instances. That > might even help if some NN accel thing also needs this on the same SoC. Yeah, I think whatever we go for here should be exposed as generic helpers with proper documentation to explain when and how to use those, because otherwise we're back to the original situation where each driver copies the pattern of another driver and adjust it to its needs, even when the original design was unsound. > > 1e Finally, if all else is lost we can try GFP_ATOMIC. This will eat into > reserve memory pools the core mm maintains. And hopefully we've spent > enough time between this step and the initial GFP_NORECLAIM that > background reclaim had a chance to free some memory, hence why all the > memory pool and memory stealing tricks should be in between. Any reason not to use GFP_NOWAIT on this last attempt? I'm not sure I like the idea of stealing memory from the reserved amount of atomic mem. I mean, if we get there and there's still not enough memory to satisfy a "normal" allocation, it's probably a good time to fail with a actual OOM. > > The above will need quite some tuning to make sure it works as often as > possible, without wasting undue amounts of memory. It's classic memory > overcommit tuning. > > 2. Device Lost > -------------- > > At this point we're left with no other choice than to kill the context. > And userspace should be able to cope with VK_DEVICE_LOST (hopefully zink > does), but it will probably not cope well with an entire strom of these > just to get the first frame out. > > Here comes the horrible trick: > > We'll keep rendering the entire frame by just smashing one single reserve > page (per context) into the pte every time there's a fault. It will result > in total garbage, and we probably want to shot the context the moment the > VS stages have finished, but it allows us to collect an accurate estimate > of how much memory we'd have needed. We need to pass that to the vulkan > driver as part of the device lost processing, so that it can keep that as > the starting point for the userspace dynamic memory requirement > guesstimate as a lower bound. Together with the (scaled to that > requirement) gpu driver memory pool and the core mm watermarks, that > should allow us to not hit a device lost again hopefully. If the intent is to progressively increase the size of this global memory pool, and the min-resident-size of the growable buffer that triggered the OOM, I guess we can just go for something dumb like double-the-size or increment-it-in-steps and flag the buffer on which the fault happened such that userspace can query that at DEVICE_LOST time and properly resize this internal buffer next time a CreateDevice() coming from the same process/thread is created. I'm not sure we need to actually execute the shader after the OOM to accurately guess the size. > > I think if the CS ioctl both takes the current guesstimate from userspace, > and passes back whatever the current maximum known to the kernel is (we > need that anyway for the steps in stage 1 to make sense I think), then > that should also work for dead context where the CS ioctl returns -EIO or > something like that. > > 3. Mesa quirks > -------------- > > A database of the fixed dynamic memory requirements we get from step 2 > (through bug reports), so that we can avoid that mess going forward. Yeah, we already have a dri-conf to force the initial size of the tiler heap on CSF hardware, so we could easily ship a DB of per-app initial size. > > If there's too many of those, we'll probably need to improve the > guesstimation 1b if it's systematically off by too much. We might even > need to store that on-disk, like shader caches, so that you get a crash > once and then it should work even without an explicit mesa quirk. Makes sense. Maybe some sort of implicit dri-conf DB that overloads the explicit one? > > Documentation > ------------- > > Assuming this is not too terrible an idea and we reach some consensus, I > think it'd be good to bake this into a doc patch somewhere in the > dma_fence documentation. Also including Alyssa recipe for when you have > hardware support for flushing partial rendering. And maybe generalized so > it makes sense for the GS/streamout/... fun on apple hardware. I think the tricky part is properly guessing the amount of memory needed for a draw. For simple vertex+fragment pipelines with direct draws, that's doable, but for more complex stuff like GEOM/TESS and indirect draws, that gets tricky, because some of the parameters are not known at CS recording time. So my gut feeling is that, rather than trying to find a smart way to guesstimate those stuff, we should rely on simple heuristics like the double-the-resident-size-on-the-faulty-sparse-GEM thing suggested by Alyssa. Many thanks to all the people who chimed in. I'll wait until the discussion settles on something that works for everyone and try to sum-up the outcome of this discussion in some drm-tile-based-renderer.rst doc file (or any other file name you think this should be put in) so we have a clear plan of what we want to do/how we want it done. Regards, Boris
On Fri, 11 Apr 2025 16:39:02 +0200 Boris Brezillon <boris.brezillon@collabora.com> wrote: > On Fri, 11 Apr 2025 15:13:26 +0200 > Christian König <christian.koenig@amd.com> wrote: > > > > > > >> Background is that you don't get a crash, nor error message, nor > > >> anything indicating what is happening. > > > The job times out at some point, but we might get stuck in the fault > > > handler waiting for memory, which is pretty close to a deadlock, I > > > suspect. > > > > I don't know those drivers that well, but at least for amdgpu the > > problem would be that the timeout handling would need to grab some of > > the locks the memory management is holding waiting for the timeout > > handling to do something.... > > > > So that basically perfectly closes the circle. With a bit of lock you > > get a message after some time that the kernel is stuck, but since > > that are all sleeping locks I strongly doubt so. > > > > As immediately action please provide patches which changes those > > GFP_KERNEL into GFP_NOWAIT. > > Sure, I can do that. Hm, I might have been too prompt at claiming this was doable. In practice, doing that might regress Lima and Panfrost in situations where trying harder than GFP_NOWAIT would free up some memory. Not saying this was right to use GFP_KERNEL in the first place, but some expectations were set by this original mistake, so I'll probably need Lima developers to vouch in for this change after they've done some testing on a system under high memory pressure, and I'd need to do the same kind of testing for Panfrost and ask Steve if he's okay with that too. For Panthor, I'm less worried, because we have the incremental rendering fallback, and assuming GFP_NOWAIT tries hard enough to reclaim low-hanging fruits, the perfs shouldn't suffer much more than they would today with GFP_KERNEL allocations potentially delaying tiling operations longer than would have been with a primitive flush.
> Actually, CSF stands in the way of re-allocating memory to other > contexts, because once we've allocated memory to a tiler heap, the FW > manages this pool of chunks, and recycles them. Mesa can intercept > the "returned chunks" and collect those chunks instead of re-assiging > then to the tiler heap through a CS instruction (which goes thought > the FW internallu), but that involves extra collaboration between the > UMD, KMD and FW which we don't have at the moment. Not saying never, > but I'd rather fix things gradually (first the blocking alloc in the > fence-signalling path, then the optimization to share the extra mem > reservation cost among contexts by returning the chunks to the global > kernel pool rather than directly to the heap). > > This approach should work fine with JM GPUs where the tiler heap is > entirely managed by the KMD though. I really think CSF should be relying on the simple heuristics with incremental-rendering, unless you can prove that's actually a performance issue in practice. (On Imagination/Apple parts, it almost never is and we rely entirely on this approach. It's ok - it really is. For simple 2D workloads, the initial heap allocation is fine. For 3D scenes, we need very few frames to get the right size. this doesn't cause stutters in practice.) For JM .. yes, this discussion remains relevant of course.
On Mon, Apr 14, 2025 at 01:22:06PM +0200, Boris Brezillon wrote: > Hi Sima, > > On Fri, 11 Apr 2025 14:01:16 +0200 > Simona Vetter <simona.vetter@ffwll.ch> wrote: > > > On Thu, Apr 10, 2025 at 08:41:55PM +0200, Boris Brezillon wrote: > > > On Thu, 10 Apr 2025 14:01:03 -0400 > > > Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote: > > > > > > > > > > In Panfrost and Lima, we don't have this concept of "incremental > > > > > > > rendering", so when we fail the allocation, we just fail the GPU job > > > > > > > with an unhandled GPU fault. > > > > > > > > > > > > To be honest I think that this is enough to mark those two drivers as > > > > > > broken. It's documented that this approach is a no-go for upstream > > > > > > drivers. > > > > > > > > > > > > How widely is that used? > > > > > > > > > > It exists in lima and panfrost, and I wouldn't be surprised if a similar > > > > > mechanism was used in other drivers for tiler-based GPUs (etnaviv, > > > > > freedreno, powervr, ...), because ultimately that's how tilers work: > > > > > the amount of memory needed to store per-tile primitives (and metadata) > > > > > depends on what the geometry pipeline feeds the tiler with, and that > > > > > can't be predicted. If you over-provision, that's memory the system won't > > > > > be able to use while rendering takes place, even though only a small > > > > > portion might actually be used by the GPU. If your allocation is too > > > > > small, it will either trigger a GPU fault (for HW not supporting an > > > > > "incremental rendering" mode) or under-perform (because flushing > > > > > primitives has a huge cost on tilers). > > > > > > > > Yes and no. > > > > > > > > Although we can't allocate more memory for /this/ frame, we know the > > > > required size is probably constant across its lifetime. That gives a > > > > simple heuristic to manage the tiler heap efficiently without > > > > allocations - even fallible ones - in the fence signal path: > > > > > > > > * Start with a small fixed size tiler heap > > > > * Try to render, let incremental rendering kick in when it's too small. > > > > * When cleaning up the job, check if we used incremental rendering. > > > > * If we did - double the size of the heap the next time we submit work. > > > > > > > > The tiler heap still grows dynamically - it just does so over the span > > > > of a couple frames. In practice that means a tiny hit to startup time as > > > > we dynamically figure out the right size, incurring extra flushing at > > > > the start, without needing any "grow-on-page-fault" heroics. > > > > > > > > This should solve the problem completely for CSF/panthor. So it's only > > > > hardware that architecturally cannot do incremental rendering (older > > > > Mali: panfrost/lima) where we need this mess. > > > > > > OTOH, if we need something > > > for Utgard(Lima)/Midgard/Bifrost/Valhall(Panfrost), why not use the same > > > thing for CSF, since CSF is arguably the sanest of all the HW > > > architectures listed above: allocation can fail/be non-blocking, > > > because there's a fallback to incremental rendering when it fails. > > > > So this is a really horrible idea to sort this out for panfrost hardware, > > which doesn't have a tiler cache flush as a fallback. It's roughly three > > stages: > > > > 1. A pile of clever tricks to make the chances of running out of memory > > really low. Most of these also make sense for panthor platforms, just as a > > performance optimization. > > > > 2. I terrible way to handle the unavoidable VK_DEVICE_LOST, but in a way > > such that the impact should be minimal. This is nasty, and we really want > > to avoid that for panthor. > > > > 3. Mesa quirks so that 2 doesn't actually ever happen in practice. > > > > 1. Clever tricks > > ---------------- > > > > This is a cascade of tricks we can pull in the gpu fault handler: > > > > 1a. Allocate with GFP_NORECLAIM. We want this first because that triggers > > background reclaim, and that might be enough to get us through and free > > some easy caches (like clean fs cache and stuff like that which can just > > be dropped). > > There's no GFP_NORECLAIM, and given the discussions we had before, I > guess you meant GFP_NOWAIT. Otherwise it's the __GFP_NOWARN | > __GFP_NORETRY I used in this series, and it probably doesn't try hard > enough as pointed out by you and Christian. > > > > > 1b Userspace needs to guesstimate a good guess for how much we'll need. I'm > > hoping that between render target size and maybe counting the total > > amounts of vertices we can do a decent guesstimate for many workloads. > > There are extra parameters to take into account, like the tile > hierarchy mask (number of binning lists instantiated) and probably > other things I forget, but for simple vertex+fragment pipelines and > direct draws, guessing the worst memory usage case is probably doable. > Throw indirect draws into the mix, and it suddenly becomes a lot more > complicated. Not even talking about GEOM/TESS stages, which makes the > guessing even harder AFAICT. > > > Note that goal here is not to ensure success, but just to get the rough > > ballpark. The actual starting number here should aim fairly low, so that > > we avoid wasting memory since this is memory wasted on every context > > (that uses a feature which needs dynamic memory allocation, which I > > guess for pan* is everything, but for apple it would be more limited). > > Ack. > > > > > 1c The kernel then keeps an additional global memory pool. Note this would > > not have the same semantics as mempool.h, which is aimed GFP_NOIO > > forward progress guarantees, but more as a preallocation pool. In every > > CS ioctl we'll make sure the pool is filled, and we probably want to > > size the pool relative to the context with the biggest dynamic memory > > usage. So probably this thing needs a shrinker, so we can reclaim it > > when you don't run an app with a huge buffer need on the gpu anymore. > > Okay, that's a technique Arm has been using in their downstream driver > (it named JIT-allocation there). > > > > > Note that we're still not sizing this to guarantee success, but together > > with the userspace heuristics it should be big enough to almost always > > work out. And since it's global reserve we can afford to waste a bit > > more memory on this one. We might also want to scale this pool by the > > total memory available, like the watermarks core mm computes. We'll only > > hang onto this memory when the gpu is in active usage, so this should be > > fine. > > Sounds like a good idea. > > > > > Also the preallocation would need to happen without holding the memory > > pool look, so that we can use GFP_KERNEL. > > > > Up to this point I think it's all tricks that panthor also wants to > > employ. > > > > 1d Next up is scratch dynamic memory. If we can assume that the memory does > > not need to survive a batchbuffer (hopefully the case with vulkan render > > pass) we could steal such memory from other contexts. We could even do > > that for contexts which are queued but not yet running on the hardware > > (might need unloading them to be doable with fw renderers like > > panthor/CSF) as long as we keep such stolen dynamic memory on a separate > > free list. Because if we'd entirely free this, or release it into the > > memory pool we'll make things worse for these other contexts, we need to > > be able to guarantee that any context can always get all the stolen > > dynamic pages back before we start running it on the gpu. > > Actually, CSF stands in the way of re-allocating memory to other > contexts, because once we've allocated memory to a tiler heap, the FW > manages this pool of chunks, and recycles them. Mesa can intercept > the "returned chunks" and collect those chunks instead of re-assiging > then to the tiler heap through a CS instruction (which goes thought > the FW internallu), but that involves extra collaboration between the > UMD, KMD and FW which we don't have at the moment. Not saying never, > but I'd rather fix things gradually (first the blocking alloc in the > fence-signalling path, then the optimization to share the extra mem > reservation cost among contexts by returning the chunks to the global > kernel pool rather than directly to the heap). The additional issue with borrowing memory from idle contexts is that it will involve MMU operations, as we will have to move the memory into the active context address space. CSF GPUs have a limitation that they can only work with one address space for the active job when it comes to memory used internally by the job, so we either have to map the scratch dynamic memory in all the jobs before we submit them, or we will have to do MMU maintainance operations in the OOM path in order to borrow memory from other contexts. Overall I think the design could work and as Boris pointed out Arm already has something similar in the Just In Time memory pool design, it's just that we need to play with different rules in the upstream as the downstream bypasses the DRM framework so it has fewer rules to abide by. Best regards, Liviu > > This approach should work fine with JM GPUs where the tiler heap is > entirely managed by the KMD though. > > > > > Since the tracking for the memory pool in 1c and the stolen memory in 1d > > has a bunch of tricky corner cases we probably want that as a drm helper > > which globally does that book-keeping for all drm/sched instances. That > > might even help if some NN accel thing also needs this on the same SoC. > > Yeah, I think whatever we go for here should be exposed as generic > helpers with proper documentation to explain when and how to use those, > because otherwise we're back to the original situation where each > driver copies the pattern of another driver and adjust it to its needs, > even when the original design was unsound. > > > > > 1e Finally, if all else is lost we can try GFP_ATOMIC. This will eat into > > reserve memory pools the core mm maintains. And hopefully we've spent > > enough time between this step and the initial GFP_NORECLAIM that > > background reclaim had a chance to free some memory, hence why all the > > memory pool and memory stealing tricks should be in between. > > Any reason not to use GFP_NOWAIT on this last attempt? > I'm not sure I like the idea of stealing memory from the reserved > amount of atomic mem. I mean, if we get there and there's still not > enough memory to satisfy a "normal" allocation, it's probably a good > time to fail with a actual OOM. > > > > > The above will need quite some tuning to make sure it works as often as > > possible, without wasting undue amounts of memory. It's classic memory > > overcommit tuning. > > > > 2. Device Lost > > -------------- > > > > At this point we're left with no other choice than to kill the context. > > And userspace should be able to cope with VK_DEVICE_LOST (hopefully zink > > does), but it will probably not cope well with an entire strom of these > > just to get the first frame out. > > > > Here comes the horrible trick: > > > > We'll keep rendering the entire frame by just smashing one single reserve > > page (per context) into the pte every time there's a fault. It will result > > in total garbage, and we probably want to shot the context the moment the > > VS stages have finished, but it allows us to collect an accurate estimate > > of how much memory we'd have needed. We need to pass that to the vulkan > > driver as part of the device lost processing, so that it can keep that as > > the starting point for the userspace dynamic memory requirement > > guesstimate as a lower bound. Together with the (scaled to that > > requirement) gpu driver memory pool and the core mm watermarks, that > > should allow us to not hit a device lost again hopefully. > > If the intent is to progressively increase the size of this > global memory pool, and the min-resident-size of the growable buffer > that triggered the OOM, I guess we can just go for something dumb like > double-the-size or increment-it-in-steps and flag the buffer on which > the fault happened such that userspace can query that at DEVICE_LOST > time and properly resize this internal buffer next time a > CreateDevice() coming from the same process/thread is created. I'm not > sure we need to actually execute the shader after the OOM to accurately > guess the size. > > > > > I think if the CS ioctl both takes the current guesstimate from userspace, > > and passes back whatever the current maximum known to the kernel is (we > > need that anyway for the steps in stage 1 to make sense I think), then > > that should also work for dead context where the CS ioctl returns -EIO or > > something like that. > > > > 3. Mesa quirks > > -------------- > > > > A database of the fixed dynamic memory requirements we get from step 2 > > (through bug reports), so that we can avoid that mess going forward. > > Yeah, we already have a dri-conf to force the initial size of the tiler > heap on CSF hardware, so we could easily ship a DB of per-app initial > size. > > > > > If there's too many of those, we'll probably need to improve the > > guesstimation 1b if it's systematically off by too much. We might even > > need to store that on-disk, like shader caches, so that you get a crash > > once and then it should work even without an explicit mesa quirk. > > Makes sense. Maybe some sort of implicit dri-conf DB that overloads the > explicit one? > > > > > Documentation > > ------------- > > > > Assuming this is not too terrible an idea and we reach some consensus, I > > think it'd be good to bake this into a doc patch somewhere in the > > dma_fence documentation. Also including Alyssa recipe for when you have > > hardware support for flushing partial rendering. And maybe generalized so > > it makes sense for the GS/streamout/... fun on apple hardware. > > I think the tricky part is properly guessing the amount of memory > needed for a draw. For simple vertex+fragment pipelines with direct > draws, that's doable, but for more complex stuff like GEOM/TESS and > indirect draws, that gets tricky, because some of the parameters are > not known at CS recording time. So my gut feeling is that, rather than > trying to find a smart way to guesstimate those stuff, we should rely > on simple heuristics like the > double-the-resident-size-on-the-faulty-sparse-GEM thing suggested by > Alyssa. > > Many thanks to all the people who chimed in. I'll wait until the > discussion settles on something that works for everyone and try to > sum-up the outcome of this discussion in some > drm-tile-based-renderer.rst doc file (or any other file name you think > this should be put in) so we have a clear plan of what we want to > do/how we want it done. > > Regards, > > Boris
On Mon, 14 Apr 2025 09:03:55 -0400 Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote: > > Actually, CSF stands in the way of re-allocating memory to other > > contexts, because once we've allocated memory to a tiler heap, the FW > > manages this pool of chunks, and recycles them. Mesa can intercept > > the "returned chunks" and collect those chunks instead of re-assiging > > then to the tiler heap through a CS instruction (which goes thought > > the FW internallu), but that involves extra collaboration between the > > UMD, KMD and FW which we don't have at the moment. Not saying never, > > but I'd rather fix things gradually (first the blocking alloc in the > > fence-signalling path, then the optimization to share the extra mem > > reservation cost among contexts by returning the chunks to the global > > kernel pool rather than directly to the heap). > > > > This approach should work fine with JM GPUs where the tiler heap is > > entirely managed by the KMD though. > > I really think CSF should be relying on the simple heuristics with > incremental-rendering, unless you can prove that's actually a > performance issue in practice. (On Imagination/Apple parts, it almost > never is and we rely entirely on this approach. It's ok - it really is. > For simple 2D workloads, the initial heap allocation is fine. For 3D > scenes, we need very few frames to get the right size. this doesn't > cause stutters in practice.) Yep I agree, hence the "let's try the simple thing first and let's see if we actually need the more complex stuff later". My hope is that we'll never need it, but I hate to make definitive statements, because it usually bites me back when I do :P. > > For JM .. yes, this discussion remains relevant of course. I'm still trying to see if we can emulate/have incremental-rendering on JM hardware, so it really becomes a Lima-only issue. According to Erik, predicting how much heap is needed is much more predictible on Utgard (no indirect draws, simpler binning hierarchy, and other details he mentioned which I forgot).
> I'm still trying to see if we can emulate/have incremental-rendering on > JM hardware I guess since we don't advertise vertex shader side effects... Lol, maybe that could work...
On Mon, Apr 14, 2025 at 02:08:25PM +0100, Liviu Dudau wrote: > On Mon, Apr 14, 2025 at 01:22:06PM +0200, Boris Brezillon wrote: > > Hi Sima, > > > > On Fri, 11 Apr 2025 14:01:16 +0200 > > Simona Vetter <simona.vetter@ffwll.ch> wrote: > > > > > On Thu, Apr 10, 2025 at 08:41:55PM +0200, Boris Brezillon wrote: > > > > On Thu, 10 Apr 2025 14:01:03 -0400 > > > > Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote: > > > > > > > > > > > > In Panfrost and Lima, we don't have this concept of "incremental > > > > > > > > rendering", so when we fail the allocation, we just fail the GPU job > > > > > > > > with an unhandled GPU fault. > > > > > > > > > > > > > > To be honest I think that this is enough to mark those two drivers as > > > > > > > broken. It's documented that this approach is a no-go for upstream > > > > > > > drivers. > > > > > > > > > > > > > > How widely is that used? > > > > > > > > > > > > It exists in lima and panfrost, and I wouldn't be surprised if a similar > > > > > > mechanism was used in other drivers for tiler-based GPUs (etnaviv, > > > > > > freedreno, powervr, ...), because ultimately that's how tilers work: > > > > > > the amount of memory needed to store per-tile primitives (and metadata) > > > > > > depends on what the geometry pipeline feeds the tiler with, and that > > > > > > can't be predicted. If you over-provision, that's memory the system won't > > > > > > be able to use while rendering takes place, even though only a small > > > > > > portion might actually be used by the GPU. If your allocation is too > > > > > > small, it will either trigger a GPU fault (for HW not supporting an > > > > > > "incremental rendering" mode) or under-perform (because flushing > > > > > > primitives has a huge cost on tilers). > > > > > > > > > > Yes and no. > > > > > > > > > > Although we can't allocate more memory for /this/ frame, we know the > > > > > required size is probably constant across its lifetime. That gives a > > > > > simple heuristic to manage the tiler heap efficiently without > > > > > allocations - even fallible ones - in the fence signal path: > > > > > > > > > > * Start with a small fixed size tiler heap > > > > > * Try to render, let incremental rendering kick in when it's too small. > > > > > * When cleaning up the job, check if we used incremental rendering. > > > > > * If we did - double the size of the heap the next time we submit work. > > > > > > > > > > The tiler heap still grows dynamically - it just does so over the span > > > > > of a couple frames. In practice that means a tiny hit to startup time as > > > > > we dynamically figure out the right size, incurring extra flushing at > > > > > the start, without needing any "grow-on-page-fault" heroics. > > > > > > > > > > This should solve the problem completely for CSF/panthor. So it's only > > > > > hardware that architecturally cannot do incremental rendering (older > > > > > Mali: panfrost/lima) where we need this mess. > > > > > > > > OTOH, if we need something > > > > for Utgard(Lima)/Midgard/Bifrost/Valhall(Panfrost), why not use the same > > > > thing for CSF, since CSF is arguably the sanest of all the HW > > > > architectures listed above: allocation can fail/be non-blocking, > > > > because there's a fallback to incremental rendering when it fails. > > > > > > So this is a really horrible idea to sort this out for panfrost hardware, > > > which doesn't have a tiler cache flush as a fallback. It's roughly three > > > stages: > > > > > > 1. A pile of clever tricks to make the chances of running out of memory > > > really low. Most of these also make sense for panthor platforms, just as a > > > performance optimization. > > > > > > 2. I terrible way to handle the unavoidable VK_DEVICE_LOST, but in a way > > > such that the impact should be minimal. This is nasty, and we really want > > > to avoid that for panthor. > > > > > > 3. Mesa quirks so that 2 doesn't actually ever happen in practice. > > > > > > 1. Clever tricks > > > ---------------- > > > > > > This is a cascade of tricks we can pull in the gpu fault handler: > > > > > > 1a. Allocate with GFP_NORECLAIM. We want this first because that triggers > > > background reclaim, and that might be enough to get us through and free > > > some easy caches (like clean fs cache and stuff like that which can just > > > be dropped). > > > > There's no GFP_NORECLAIM, and given the discussions we had before, I > > guess you meant GFP_NOWAIT. Otherwise it's the __GFP_NOWARN | > > __GFP_NORETRY I used in this series, and it probably doesn't try hard > > enough as pointed out by you and Christian. > > > > > > > > 1b Userspace needs to guesstimate a good guess for how much we'll need. I'm > > > hoping that between render target size and maybe counting the total > > > amounts of vertices we can do a decent guesstimate for many workloads. > > > > There are extra parameters to take into account, like the tile > > hierarchy mask (number of binning lists instantiated) and probably > > other things I forget, but for simple vertex+fragment pipelines and > > direct draws, guessing the worst memory usage case is probably doable. > > Throw indirect draws into the mix, and it suddenly becomes a lot more > > complicated. Not even talking about GEOM/TESS stages, which makes the > > guessing even harder AFAICT. > > > > > Note that goal here is not to ensure success, but just to get the rough > > > ballpark. The actual starting number here should aim fairly low, so that > > > we avoid wasting memory since this is memory wasted on every context > > > (that uses a feature which needs dynamic memory allocation, which I > > > guess for pan* is everything, but for apple it would be more limited). > > > > Ack. > > > > > > > > 1c The kernel then keeps an additional global memory pool. Note this would > > > not have the same semantics as mempool.h, which is aimed GFP_NOIO > > > forward progress guarantees, but more as a preallocation pool. In every > > > CS ioctl we'll make sure the pool is filled, and we probably want to > > > size the pool relative to the context with the biggest dynamic memory > > > usage. So probably this thing needs a shrinker, so we can reclaim it > > > when you don't run an app with a huge buffer need on the gpu anymore. > > > > Okay, that's a technique Arm has been using in their downstream driver > > (it named JIT-allocation there). > > > > > > > > Note that we're still not sizing this to guarantee success, but together > > > with the userspace heuristics it should be big enough to almost always > > > work out. And since it's global reserve we can afford to waste a bit > > > more memory on this one. We might also want to scale this pool by the > > > total memory available, like the watermarks core mm computes. We'll only > > > hang onto this memory when the gpu is in active usage, so this should be > > > fine. > > > > Sounds like a good idea. > > > > > > > > Also the preallocation would need to happen without holding the memory > > > pool look, so that we can use GFP_KERNEL. > > > > > > Up to this point I think it's all tricks that panthor also wants to > > > employ. > > > > > > 1d Next up is scratch dynamic memory. If we can assume that the memory does > > > not need to survive a batchbuffer (hopefully the case with vulkan render > > > pass) we could steal such memory from other contexts. We could even do > > > that for contexts which are queued but not yet running on the hardware > > > (might need unloading them to be doable with fw renderers like > > > panthor/CSF) as long as we keep such stolen dynamic memory on a separate > > > free list. Because if we'd entirely free this, or release it into the > > > memory pool we'll make things worse for these other contexts, we need to > > > be able to guarantee that any context can always get all the stolen > > > dynamic pages back before we start running it on the gpu. > > > > Actually, CSF stands in the way of re-allocating memory to other > > contexts, because once we've allocated memory to a tiler heap, the FW > > manages this pool of chunks, and recycles them. Mesa can intercept > > the "returned chunks" and collect those chunks instead of re-assiging > > then to the tiler heap through a CS instruction (which goes thought > > the FW internallu), but that involves extra collaboration between the > > UMD, KMD and FW which we don't have at the moment. Not saying never, > > but I'd rather fix things gradually (first the blocking alloc in the > > fence-signalling path, then the optimization to share the extra mem > > reservation cost among contexts by returning the chunks to the global > > kernel pool rather than directly to the heap). > > The additional issue with borrowing memory from idle contexts is that it will > involve MMU operations, as we will have to move the memory into the active > context address space. CSF GPUs have a limitation that they can only work with > one address space for the active job when it comes to memory used internally > by the job, so we either have to map the scratch dynamic memory in all the > jobs before we submit them, or we will have to do MMU maintainance operations > in the OOM path in order to borrow memory from other contexts. Hm, this could be tricky. So mmu operations shouldn't be an issue because they must work for GFP_NOFS contexts for i/o writeback. You might need to much more carefully manage this and make sure the iommu has big enough range of pagetables preallocated. This also holds for the other games we're playing here, at least for gpu pagetables. But since pagetables are really small overhead it might be good to somewhat aggressively preallocate them. But yeah this is possible an issue if you you need iommu wrangling, I have honestly not looked at the exact rules in there. -Sima > Overall I think the design could work and as Boris pointed out Arm already > has something similar in the Just In Time memory pool design, it's just that > we need to play with different rules in the upstream as the downstream bypasses > the DRM framework so it has fewer rules to abide by. > > Best regards, > Liviu > > > > > This approach should work fine with JM GPUs where the tiler heap is > > entirely managed by the KMD though. > > > > > > > > Since the tracking for the memory pool in 1c and the stolen memory in 1d > > > has a bunch of tricky corner cases we probably want that as a drm helper > > > which globally does that book-keeping for all drm/sched instances. That > > > might even help if some NN accel thing also needs this on the same SoC. > > > > Yeah, I think whatever we go for here should be exposed as generic > > helpers with proper documentation to explain when and how to use those, > > because otherwise we're back to the original situation where each > > driver copies the pattern of another driver and adjust it to its needs, > > even when the original design was unsound. > > > > > > > > 1e Finally, if all else is lost we can try GFP_ATOMIC. This will eat into > > > reserve memory pools the core mm maintains. And hopefully we've spent > > > enough time between this step and the initial GFP_NORECLAIM that > > > background reclaim had a chance to free some memory, hence why all the > > > memory pool and memory stealing tricks should be in between. > > > > Any reason not to use GFP_NOWAIT on this last attempt? > > I'm not sure I like the idea of stealing memory from the reserved > > amount of atomic mem. I mean, if we get there and there's still not > > enough memory to satisfy a "normal" allocation, it's probably a good > > time to fail with a actual OOM. > > > > > > > > The above will need quite some tuning to make sure it works as often as > > > possible, without wasting undue amounts of memory. It's classic memory > > > overcommit tuning. > > > > > > 2. Device Lost > > > -------------- > > > > > > At this point we're left with no other choice than to kill the context. > > > And userspace should be able to cope with VK_DEVICE_LOST (hopefully zink > > > does), but it will probably not cope well with an entire strom of these > > > just to get the first frame out. > > > > > > Here comes the horrible trick: > > > > > > We'll keep rendering the entire frame by just smashing one single reserve > > > page (per context) into the pte every time there's a fault. It will result > > > in total garbage, and we probably want to shot the context the moment the > > > VS stages have finished, but it allows us to collect an accurate estimate > > > of how much memory we'd have needed. We need to pass that to the vulkan > > > driver as part of the device lost processing, so that it can keep that as > > > the starting point for the userspace dynamic memory requirement > > > guesstimate as a lower bound. Together with the (scaled to that > > > requirement) gpu driver memory pool and the core mm watermarks, that > > > should allow us to not hit a device lost again hopefully. > > > > If the intent is to progressively increase the size of this > > global memory pool, and the min-resident-size of the growable buffer > > that triggered the OOM, I guess we can just go for something dumb like > > double-the-size or increment-it-in-steps and flag the buffer on which > > the fault happened such that userspace can query that at DEVICE_LOST > > time and properly resize this internal buffer next time a > > CreateDevice() coming from the same process/thread is created. I'm not > > sure we need to actually execute the shader after the OOM to accurately > > guess the size. > > > > > > > > I think if the CS ioctl both takes the current guesstimate from userspace, > > > and passes back whatever the current maximum known to the kernel is (we > > > need that anyway for the steps in stage 1 to make sense I think), then > > > that should also work for dead context where the CS ioctl returns -EIO or > > > something like that. > > > > > > 3. Mesa quirks > > > -------------- > > > > > > A database of the fixed dynamic memory requirements we get from step 2 > > > (through bug reports), so that we can avoid that mess going forward. > > > > Yeah, we already have a dri-conf to force the initial size of the tiler > > heap on CSF hardware, so we could easily ship a DB of per-app initial > > size. > > > > > > > > If there's too many of those, we'll probably need to improve the > > > guesstimation 1b if it's systematically off by too much. We might even > > > need to store that on-disk, like shader caches, so that you get a crash > > > once and then it should work even without an explicit mesa quirk. > > > > Makes sense. Maybe some sort of implicit dri-conf DB that overloads the > > explicit one? > > > > > > > > Documentation > > > ------------- > > > > > > Assuming this is not too terrible an idea and we reach some consensus, I > > > think it'd be good to bake this into a doc patch somewhere in the > > > dma_fence documentation. Also including Alyssa recipe for when you have > > > hardware support for flushing partial rendering. And maybe generalized so > > > it makes sense for the GS/streamout/... fun on apple hardware. > > > > I think the tricky part is properly guessing the amount of memory > > needed for a draw. For simple vertex+fragment pipelines with direct > > draws, that's doable, but for more complex stuff like GEOM/TESS and > > indirect draws, that gets tricky, because some of the parameters are > > not known at CS recording time. So my gut feeling is that, rather than > > trying to find a smart way to guesstimate those stuff, we should rely > > on simple heuristics like the > > double-the-resident-size-on-the-faulty-sparse-GEM thing suggested by > > Alyssa. > > > > Many thanks to all the people who chimed in. I'll wait until the > > discussion settles on something that works for everyone and try to > > sum-up the outcome of this discussion in some > > drm-tile-based-renderer.rst doc file (or any other file name you think > > this should be put in) so we have a clear plan of what we want to > > do/how we want it done. > > > > Regards, > > > > Boris > > -- > ==================== > | I would like to | > | fix the world, | > | but they're not | > | giving me the | > \ source code! / > --------------- > ¯\_(ツ)_/¯
On Mon, Apr 14, 2025 at 01:22:06PM +0200, Boris Brezillon wrote: > Hi Sima, > > On Fri, 11 Apr 2025 14:01:16 +0200 > Simona Vetter <simona.vetter@ffwll.ch> wrote: > > > On Thu, Apr 10, 2025 at 08:41:55PM +0200, Boris Brezillon wrote: > > > On Thu, 10 Apr 2025 14:01:03 -0400 > > > Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote: > > > > > > > > > > In Panfrost and Lima, we don't have this concept of "incremental > > > > > > > rendering", so when we fail the allocation, we just fail the GPU job > > > > > > > with an unhandled GPU fault. > > > > > > > > > > > > To be honest I think that this is enough to mark those two drivers as > > > > > > broken. It's documented that this approach is a no-go for upstream > > > > > > drivers. > > > > > > > > > > > > How widely is that used? > > > > > > > > > > It exists in lima and panfrost, and I wouldn't be surprised if a similar > > > > > mechanism was used in other drivers for tiler-based GPUs (etnaviv, > > > > > freedreno, powervr, ...), because ultimately that's how tilers work: > > > > > the amount of memory needed to store per-tile primitives (and metadata) > > > > > depends on what the geometry pipeline feeds the tiler with, and that > > > > > can't be predicted. If you over-provision, that's memory the system won't > > > > > be able to use while rendering takes place, even though only a small > > > > > portion might actually be used by the GPU. If your allocation is too > > > > > small, it will either trigger a GPU fault (for HW not supporting an > > > > > "incremental rendering" mode) or under-perform (because flushing > > > > > primitives has a huge cost on tilers). > > > > > > > > Yes and no. > > > > > > > > Although we can't allocate more memory for /this/ frame, we know the > > > > required size is probably constant across its lifetime. That gives a > > > > simple heuristic to manage the tiler heap efficiently without > > > > allocations - even fallible ones - in the fence signal path: > > > > > > > > * Start with a small fixed size tiler heap > > > > * Try to render, let incremental rendering kick in when it's too small. > > > > * When cleaning up the job, check if we used incremental rendering. > > > > * If we did - double the size of the heap the next time we submit work. > > > > > > > > The tiler heap still grows dynamically - it just does so over the span > > > > of a couple frames. In practice that means a tiny hit to startup time as > > > > we dynamically figure out the right size, incurring extra flushing at > > > > the start, without needing any "grow-on-page-fault" heroics. > > > > > > > > This should solve the problem completely for CSF/panthor. So it's only > > > > hardware that architecturally cannot do incremental rendering (older > > > > Mali: panfrost/lima) where we need this mess. > > > > > > OTOH, if we need something > > > for Utgard(Lima)/Midgard/Bifrost/Valhall(Panfrost), why not use the same > > > thing for CSF, since CSF is arguably the sanest of all the HW > > > architectures listed above: allocation can fail/be non-blocking, > > > because there's a fallback to incremental rendering when it fails. > > > > So this is a really horrible idea to sort this out for panfrost hardware, > > which doesn't have a tiler cache flush as a fallback. It's roughly three > > stages: > > > > 1. A pile of clever tricks to make the chances of running out of memory > > really low. Most of these also make sense for panthor platforms, just as a > > performance optimization. > > > > 2. I terrible way to handle the unavoidable VK_DEVICE_LOST, but in a way > > such that the impact should be minimal. This is nasty, and we really want > > to avoid that for panthor. > > > > 3. Mesa quirks so that 2 doesn't actually ever happen in practice. > > > > 1. Clever tricks > > ---------------- > > > > This is a cascade of tricks we can pull in the gpu fault handler: > > > > 1a. Allocate with GFP_NORECLAIM. We want this first because that triggers > > background reclaim, and that might be enough to get us through and free > > some easy caches (like clean fs cache and stuff like that which can just > > be dropped). > > There's no GFP_NORECLAIM, and given the discussions we had before, I > guess you meant GFP_NOWAIT. Otherwise it's the __GFP_NOWARN | > __GFP_NORETRY I used in this series, and it probably doesn't try hard > enough as pointed out by you and Christian. Yeah it's GFP_NOWAIT. What I mean with GFP_NORECLAIM is !GFP_DIRECT_RECLAIM, which is the nasty one that's not allowed in dma_fence signalling critical paths. > > 1b Userspace needs to guesstimate a good guess for how much we'll need. I'm > > hoping that between render target size and maybe counting the total > > amounts of vertices we can do a decent guesstimate for many workloads. > > There are extra parameters to take into account, like the tile > hierarchy mask (number of binning lists instantiated) and probably > other things I forget, but for simple vertex+fragment pipelines and > direct draws, guessing the worst memory usage case is probably doable. > Throw indirect draws into the mix, and it suddenly becomes a lot more > complicated. Not even talking about GEOM/TESS stages, which makes the > guessing even harder AFAICT. The point here is not to make a pessimistic guess, but one that's roughly a good ballpark for most games/workloads, so that we don't eat too much into the various emergency watermarks. > > Note that goal here is not to ensure success, but just to get the rough > > ballpark. The actual starting number here should aim fairly low, so that > > we avoid wasting memory since this is memory wasted on every context > > (that uses a feature which needs dynamic memory allocation, which I > > guess for pan* is everything, but for apple it would be more limited). > > Ack. > > > > > 1c The kernel then keeps an additional global memory pool. Note this would > > not have the same semantics as mempool.h, which is aimed GFP_NOIO > > forward progress guarantees, but more as a preallocation pool. In every > > CS ioctl we'll make sure the pool is filled, and we probably want to > > size the pool relative to the context with the biggest dynamic memory > > usage. So probably this thing needs a shrinker, so we can reclaim it > > when you don't run an app with a huge buffer need on the gpu anymore. > > Okay, that's a technique Arm has been using in their downstream driver > (it named JIT-allocation there). > > > > > Note that we're still not sizing this to guarantee success, but together > > with the userspace heuristics it should be big enough to almost always > > work out. And since it's global reserve we can afford to waste a bit > > more memory on this one. We might also want to scale this pool by the > > total memory available, like the watermarks core mm computes. We'll only > > hang onto this memory when the gpu is in active usage, so this should be > > fine. > > Sounds like a good idea. > > > > > Also the preallocation would need to happen without holding the memory > > pool look, so that we can use GFP_KERNEL. > > > > Up to this point I think it's all tricks that panthor also wants to > > employ. > > > > 1d Next up is scratch dynamic memory. If we can assume that the memory does > > not need to survive a batchbuffer (hopefully the case with vulkan render > > pass) we could steal such memory from other contexts. We could even do > > that for contexts which are queued but not yet running on the hardware > > (might need unloading them to be doable with fw renderers like > > panthor/CSF) as long as we keep such stolen dynamic memory on a separate > > free list. Because if we'd entirely free this, or release it into the > > memory pool we'll make things worse for these other contexts, we need to > > be able to guarantee that any context can always get all the stolen > > dynamic pages back before we start running it on the gpu. > > Actually, CSF stands in the way of re-allocating memory to other > contexts, because once we've allocated memory to a tiler heap, the FW > manages this pool of chunks, and recycles them. Mesa can intercept > the "returned chunks" and collect those chunks instead of re-assiging > then to the tiler heap through a CS instruction (which goes thought > the FW internallu), but that involves extra collaboration between the > UMD, KMD and FW which we don't have at the moment. Not saying never, > but I'd rather fix things gradually (first the blocking alloc in the > fence-signalling path, then the optimization to share the extra mem > reservation cost among contexts by returning the chunks to the global > kernel pool rather than directly to the heap). > > This approach should work fine with JM GPUs where the tiler heap is > entirely managed by the KMD though. Yeah stealing scratch memory is probably way too expensive if you can just flush. And if the fw already does that, there's no point. But for hw where you cannot flush it might just be enough to get you out of a bind. > > Since the tracking for the memory pool in 1c and the stolen memory in 1d > > has a bunch of tricky corner cases we probably want that as a drm helper > > which globally does that book-keeping for all drm/sched instances. That > > might even help if some NN accel thing also needs this on the same SoC. > > Yeah, I think whatever we go for here should be exposed as generic > helpers with proper documentation to explain when and how to use those, > because otherwise we're back to the original situation where each > driver copies the pattern of another driver and adjust it to its needs, > even when the original design was unsound. > > > > > 1e Finally, if all else is lost we can try GFP_ATOMIC. This will eat into > > reserve memory pools the core mm maintains. And hopefully we've spent > > enough time between this step and the initial GFP_NORECLAIM that > > background reclaim had a chance to free some memory, hence why all the > > memory pool and memory stealing tricks should be in between. > > Any reason not to use GFP_NOWAIT on this last attempt? > I'm not sure I like the idea of stealing memory from the reserved > amount of atomic mem. I mean, if we get there and there's still not > enough memory to satisfy a "normal" allocation, it's probably a good > time to fail with a actual OOM. GFP_ATOMIC includes __GFP_HIGH, which allows you to dip into emergency watermarks (meaning more memory reserves before this fails). We might actually want GFP_ATOMIC | __GPF_NOWARN though. If you just try once more with GFP_NOWAIT then there's not really much of a point. > > The above will need quite some tuning to make sure it works as often as > > possible, without wasting undue amounts of memory. It's classic memory > > overcommit tuning. > > > > 2. Device Lost > > -------------- > > > > At this point we're left with no other choice than to kill the context. > > And userspace should be able to cope with VK_DEVICE_LOST (hopefully zink > > does), but it will probably not cope well with an entire strom of these > > just to get the first frame out. > > > > Here comes the horrible trick: > > > > We'll keep rendering the entire frame by just smashing one single reserve > > page (per context) into the pte every time there's a fault. It will result > > in total garbage, and we probably want to shot the context the moment the > > VS stages have finished, but it allows us to collect an accurate estimate > > of how much memory we'd have needed. We need to pass that to the vulkan > > driver as part of the device lost processing, so that it can keep that as > > the starting point for the userspace dynamic memory requirement > > guesstimate as a lower bound. Together with the (scaled to that > > requirement) gpu driver memory pool and the core mm watermarks, that > > should allow us to not hit a device lost again hopefully. > > If the intent is to progressively increase the size of this > global memory pool, and the min-resident-size of the growable buffer > that triggered the OOM, I guess we can just go for something dumb like > double-the-size or increment-it-in-steps and flag the buffer on which > the fault happened such that userspace can query that at DEVICE_LOST > time and properly resize this internal buffer next time a > CreateDevice() coming from the same process/thread is created. I'm not > sure we need to actually execute the shader after the OOM to accurately > guess the size. The point is to not incrementally increase, but get it right with just one crash. I think incrementally increasing through a DEVICE_LOST is a no-go design. Or at least I think we should aim better. > > I think if the CS ioctl both takes the current guesstimate from userspace, > > and passes back whatever the current maximum known to the kernel is (we > > need that anyway for the steps in stage 1 to make sense I think), then > > that should also work for dead context where the CS ioctl returns -EIO or > > something like that. > > > > 3. Mesa quirks > > -------------- > > > > A database of the fixed dynamic memory requirements we get from step 2 > > (through bug reports), so that we can avoid that mess going forward. > > Yeah, we already have a dri-conf to force the initial size of the tiler > heap on CSF hardware, so we could easily ship a DB of per-app initial > size. > > > > > If there's too many of those, we'll probably need to improve the > > guesstimation 1b if it's systematically off by too much. We might even > > need to store that on-disk, like shader caches, so that you get a crash > > once and then it should work even without an explicit mesa quirk. > > Makes sense. Maybe some sort of implicit dri-conf DB that overloads the > explicit one? Yeah, maybe something like that. And then take the maximum of both. > > Documentation > > ------------- > > > > Assuming this is not too terrible an idea and we reach some consensus, I > > think it'd be good to bake this into a doc patch somewhere in the > > dma_fence documentation. Also including Alyssa recipe for when you have > > hardware support for flushing partial rendering. And maybe generalized so > > it makes sense for the GS/streamout/... fun on apple hardware. > > I think the tricky part is properly guessing the amount of memory > needed for a draw. For simple vertex+fragment pipelines with direct > draws, that's doable, but for more complex stuff like GEOM/TESS and > indirect draws, that gets tricky, because some of the parameters are > not known at CS recording time. So my gut feeling is that, rather than > trying to find a smart way to guesstimate those stuff, we should rely > on simple heuristics like the > double-the-resident-size-on-the-faulty-sparse-GEM thing suggested by > Alyssa. The point is to not correctly guess, but guess at 80/99% confidence intervals. Memory overcommit is a statistical game, not a correctness game. So for the guesstimates userspace does you essentially take some interesting things like: - tiler layout or whatever really has a big impact, which probably includes framebuffer size - draw sizes - does it use GS/streamout/whatever else has a big impact Then gather a _lot_ of data from different games and make a linear regression model that captures like 80% (wild guess, no idea whether that's good) of games for the 1b estimate, 99% of everything you have for the kernel pool (maybe a bit less, depends how much the emergency pools from core mm) provides. And the probably a different estimate to add a notch more when you have a data from step 2. It's not about the perfect estimates here at all, just heuristics. > Many thanks to all the people who chimed in. I'll wait until the > discussion settles on something that works for everyone and try to > sum-up the outcome of this discussion in some > drm-tile-based-renderer.rst doc file (or any other file name you think > this should be put in) so we have a clear plan of what we want to > do/how we want it done. Yeah sounds like we have a plan. Cheers, Sima
On Mon, 14 Apr 2025 16:34:47 +0200 Simona Vetter <simona.vetter@ffwll.ch> wrote: > On Mon, Apr 14, 2025 at 02:08:25PM +0100, Liviu Dudau wrote: > > On Mon, Apr 14, 2025 at 01:22:06PM +0200, Boris Brezillon wrote: > > > Hi Sima, > > > > > > On Fri, 11 Apr 2025 14:01:16 +0200 > > > Simona Vetter <simona.vetter@ffwll.ch> wrote: > > > > > > > On Thu, Apr 10, 2025 at 08:41:55PM +0200, Boris Brezillon wrote: > > > > > On Thu, 10 Apr 2025 14:01:03 -0400 > > > > > Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote: > > > > > > > > > > > > > > In Panfrost and Lima, we don't have this concept of "incremental > > > > > > > > > rendering", so when we fail the allocation, we just fail the GPU job > > > > > > > > > with an unhandled GPU fault. > > > > > > > > > > > > > > > > To be honest I think that this is enough to mark those two drivers as > > > > > > > > broken. It's documented that this approach is a no-go for upstream > > > > > > > > drivers. > > > > > > > > > > > > > > > > How widely is that used? > > > > > > > > > > > > > > It exists in lima and panfrost, and I wouldn't be surprised if a similar > > > > > > > mechanism was used in other drivers for tiler-based GPUs (etnaviv, > > > > > > > freedreno, powervr, ...), because ultimately that's how tilers work: > > > > > > > the amount of memory needed to store per-tile primitives (and metadata) > > > > > > > depends on what the geometry pipeline feeds the tiler with, and that > > > > > > > can't be predicted. If you over-provision, that's memory the system won't > > > > > > > be able to use while rendering takes place, even though only a small > > > > > > > portion might actually be used by the GPU. If your allocation is too > > > > > > > small, it will either trigger a GPU fault (for HW not supporting an > > > > > > > "incremental rendering" mode) or under-perform (because flushing > > > > > > > primitives has a huge cost on tilers). > > > > > > > > > > > > Yes and no. > > > > > > > > > > > > Although we can't allocate more memory for /this/ frame, we know the > > > > > > required size is probably constant across its lifetime. That gives a > > > > > > simple heuristic to manage the tiler heap efficiently without > > > > > > allocations - even fallible ones - in the fence signal path: > > > > > > > > > > > > * Start with a small fixed size tiler heap > > > > > > * Try to render, let incremental rendering kick in when it's too small. > > > > > > * When cleaning up the job, check if we used incremental rendering. > > > > > > * If we did - double the size of the heap the next time we submit work. > > > > > > > > > > > > The tiler heap still grows dynamically - it just does so over the span > > > > > > of a couple frames. In practice that means a tiny hit to startup time as > > > > > > we dynamically figure out the right size, incurring extra flushing at > > > > > > the start, without needing any "grow-on-page-fault" heroics. > > > > > > > > > > > > This should solve the problem completely for CSF/panthor. So it's only > > > > > > hardware that architecturally cannot do incremental rendering (older > > > > > > Mali: panfrost/lima) where we need this mess. > > > > > > > > > > OTOH, if we need something > > > > > for Utgard(Lima)/Midgard/Bifrost/Valhall(Panfrost), why not use the same > > > > > thing for CSF, since CSF is arguably the sanest of all the HW > > > > > architectures listed above: allocation can fail/be non-blocking, > > > > > because there's a fallback to incremental rendering when it fails. > > > > > > > > So this is a really horrible idea to sort this out for panfrost hardware, > > > > which doesn't have a tiler cache flush as a fallback. It's roughly three > > > > stages: > > > > > > > > 1. A pile of clever tricks to make the chances of running out of memory > > > > really low. Most of these also make sense for panthor platforms, just as a > > > > performance optimization. > > > > > > > > 2. I terrible way to handle the unavoidable VK_DEVICE_LOST, but in a way > > > > such that the impact should be minimal. This is nasty, and we really want > > > > to avoid that for panthor. > > > > > > > > 3. Mesa quirks so that 2 doesn't actually ever happen in practice. > > > > > > > > 1. Clever tricks > > > > ---------------- > > > > > > > > This is a cascade of tricks we can pull in the gpu fault handler: > > > > > > > > 1a. Allocate with GFP_NORECLAIM. We want this first because that triggers > > > > background reclaim, and that might be enough to get us through and free > > > > some easy caches (like clean fs cache and stuff like that which can just > > > > be dropped). > > > > > > There's no GFP_NORECLAIM, and given the discussions we had before, I > > > guess you meant GFP_NOWAIT. Otherwise it's the __GFP_NOWARN | > > > __GFP_NORETRY I used in this series, and it probably doesn't try hard > > > enough as pointed out by you and Christian. > > > > > > > > > > > 1b Userspace needs to guesstimate a good guess for how much we'll need. I'm > > > > hoping that between render target size and maybe counting the total > > > > amounts of vertices we can do a decent guesstimate for many workloads. > > > > > > There are extra parameters to take into account, like the tile > > > hierarchy mask (number of binning lists instantiated) and probably > > > other things I forget, but for simple vertex+fragment pipelines and > > > direct draws, guessing the worst memory usage case is probably doable. > > > Throw indirect draws into the mix, and it suddenly becomes a lot more > > > complicated. Not even talking about GEOM/TESS stages, which makes the > > > guessing even harder AFAICT. > > > > > > > Note that goal here is not to ensure success, but just to get the rough > > > > ballpark. The actual starting number here should aim fairly low, so that > > > > we avoid wasting memory since this is memory wasted on every context > > > > (that uses a feature which needs dynamic memory allocation, which I > > > > guess for pan* is everything, but for apple it would be more limited). > > > > > > Ack. > > > > > > > > > > > 1c The kernel then keeps an additional global memory pool. Note this would > > > > not have the same semantics as mempool.h, which is aimed GFP_NOIO > > > > forward progress guarantees, but more as a preallocation pool. In every > > > > CS ioctl we'll make sure the pool is filled, and we probably want to > > > > size the pool relative to the context with the biggest dynamic memory > > > > usage. So probably this thing needs a shrinker, so we can reclaim it > > > > when you don't run an app with a huge buffer need on the gpu anymore. > > > > > > Okay, that's a technique Arm has been using in their downstream driver > > > (it named JIT-allocation there). > > > > > > > > > > > Note that we're still not sizing this to guarantee success, but together > > > > with the userspace heuristics it should be big enough to almost always > > > > work out. And since it's global reserve we can afford to waste a bit > > > > more memory on this one. We might also want to scale this pool by the > > > > total memory available, like the watermarks core mm computes. We'll only > > > > hang onto this memory when the gpu is in active usage, so this should be > > > > fine. > > > > > > Sounds like a good idea. > > > > > > > > > > > Also the preallocation would need to happen without holding the memory > > > > pool look, so that we can use GFP_KERNEL. > > > > > > > > Up to this point I think it's all tricks that panthor also wants to > > > > employ. > > > > > > > > 1d Next up is scratch dynamic memory. If we can assume that the memory does > > > > not need to survive a batchbuffer (hopefully the case with vulkan render > > > > pass) we could steal such memory from other contexts. We could even do > > > > that for contexts which are queued but not yet running on the hardware > > > > (might need unloading them to be doable with fw renderers like > > > > panthor/CSF) as long as we keep such stolen dynamic memory on a separate > > > > free list. Because if we'd entirely free this, or release it into the > > > > memory pool we'll make things worse for these other contexts, we need to > > > > be able to guarantee that any context can always get all the stolen > > > > dynamic pages back before we start running it on the gpu. > > > > > > Actually, CSF stands in the way of re-allocating memory to other > > > contexts, because once we've allocated memory to a tiler heap, the FW > > > manages this pool of chunks, and recycles them. Mesa can intercept > > > the "returned chunks" and collect those chunks instead of re-assiging > > > then to the tiler heap through a CS instruction (which goes thought > > > the FW internallu), but that involves extra collaboration between the > > > UMD, KMD and FW which we don't have at the moment. Not saying never, > > > but I'd rather fix things gradually (first the blocking alloc in the > > > fence-signalling path, then the optimization to share the extra mem > > > reservation cost among contexts by returning the chunks to the global > > > kernel pool rather than directly to the heap). > > > > The additional issue with borrowing memory from idle contexts is that it will > > involve MMU operations, as we will have to move the memory into the active > > context address space. CSF GPUs have a limitation that they can only work with > > one address space for the active job when it comes to memory used internally > > by the job, so we either have to map the scratch dynamic memory in all the > > jobs before we submit them, or we will have to do MMU maintainance operations > > in the OOM path in order to borrow memory from other contexts. > > Hm, this could be tricky. So mmu operations shouldn't be an issue because > they must work for GFP_NOFS contexts for i/o writeback. You might need to > much more carefully manage this and make sure the iommu has big enough > range of pagetables preallocated. This also holds for the other games > we're playing here, at least for gpu pagetables. But since pagetables are > really small overhead it might be good to somewhat aggressively > preallocate them. > > But yeah this is possible an issue if you you need iommu wrangling, I > have honestly not looked at the exact rules in there. We already have a mechanism to pre-allocate page tables in panthor (for aync VM_BIND requests where we're not allowed to allocate in the run_job() path), but as said before, I probably won't try this global mem pool thing on Panthor, since Panthor can do without it for now. The page table pre-allocation mechanism is something we can easily transpose to panfrost though.
On 14/04/2025 13:47, Boris Brezillon wrote: > On Fri, 11 Apr 2025 16:39:02 +0200 > Boris Brezillon <boris.brezillon@collabora.com> wrote: > >> On Fri, 11 Apr 2025 15:13:26 +0200 >> Christian König <christian.koenig@amd.com> wrote: >> >>>> >>>>> Background is that you don't get a crash, nor error message, nor >>>>> anything indicating what is happening. >>>> The job times out at some point, but we might get stuck in the fault >>>> handler waiting for memory, which is pretty close to a deadlock, I >>>> suspect. >>> >>> I don't know those drivers that well, but at least for amdgpu the >>> problem would be that the timeout handling would need to grab some of >>> the locks the memory management is holding waiting for the timeout >>> handling to do something.... >>> >>> So that basically perfectly closes the circle. With a bit of lock you >>> get a message after some time that the kernel is stuck, but since >>> that are all sleeping locks I strongly doubt so. >>> >>> As immediately action please provide patches which changes those >>> GFP_KERNEL into GFP_NOWAIT. >> >> Sure, I can do that. > > Hm, I might have been too prompt at claiming this was doable. In > practice, doing that might regress Lima and Panfrost in situations > where trying harder than GFP_NOWAIT would free up some memory. Not > saying this was right to use GFP_KERNEL in the first place, but some > expectations were set by this original mistake, so I'll probably need > Lima developers to vouch in for this change after they've done some > testing on a system under high memory pressure, and I'd need to do the > same kind of testing for Panfrost and ask Steve if he's okay with that > too. It's a tricky one. The ideal would be to teach user space how to recover from the memory allocation failure (even on older GPUs it's still in theory possible to break up the work and do incremental rendering). But that ship sailed long ago so this would be a regression. I did consider GFP_ATOMIC as an option, but really we shouldn't be accessing "memory reserves" in such a situation. For background: In the "kbase" driver (Linux kernel driver for the closed source user space 'DDK' driver for Midgard/Bifrost GPUs) we had a "JIT memory allocator". The idea here was that if you have multiple renderings in flight you can attempt to share the tiler heap memory between them. So in this case when we can't allocate more memory and we know there's another tiler heap which is going to be freed by a fragment job that's already running, we can block knowing the memory is going to become available. It was designed to do the same thing as CSF's incremental rendering - allow us to opportunistically allocate memory but not fail the rendering if it wasn't available. But it was a nightmare to have any confidence of it being deadlock free and the implementation was frankly terrible - which is ultimately why CSF GPU's have this ability in hardware to perform incremental rendering without failing the job. But effectively this approach requires allocating just enough memory for one complete tiler heap while ensuring forward progress and opportunistically allowing extra memory to give a performance boost. TLDR; I think we should try switching to GFP_NOWAIT in Panfrost and do some testing with memory pressure. It might be acceptable (and an occasional job failure is better than an occasional lock up). If it turns out it's too easy to trigger job failures like this then we'll need to rethink. > For Panthor, I'm less worried, because we have the incremental rendering > fallback, and assuming GFP_NOWAIT tries hard enough to reclaim > low-hanging fruits, the perfs shouldn't suffer much more than they > would today with GFP_KERNEL allocations potentially delaying tiling > operations longer than would have been with a primitive flush. Yes for Panthor I think the approach is obvious - I believe GFP_NOWAIT should trigger background reclaim, so over the course of a few frames it should make the memory available (assuming there is sufficient 'free' memory). Indeed it might even give a performance boost if it stops us getting blocked in direct reclaim. Thanks, Steve
On Mon, 14 Apr 2025 16:34:35 +0100 Steven Price <steven.price@arm.com> wrote: > On 14/04/2025 13:47, Boris Brezillon wrote: > > On Fri, 11 Apr 2025 16:39:02 +0200 > > Boris Brezillon <boris.brezillon@collabora.com> wrote: > > > >> On Fri, 11 Apr 2025 15:13:26 +0200 > >> Christian König <christian.koenig@amd.com> wrote: > >> > >>>> > >>>>> Background is that you don't get a crash, nor error message, nor > >>>>> anything indicating what is happening. > >>>> The job times out at some point, but we might get stuck in the fault > >>>> handler waiting for memory, which is pretty close to a deadlock, I > >>>> suspect. > >>> > >>> I don't know those drivers that well, but at least for amdgpu the > >>> problem would be that the timeout handling would need to grab some of > >>> the locks the memory management is holding waiting for the timeout > >>> handling to do something.... > >>> > >>> So that basically perfectly closes the circle. With a bit of lock you > >>> get a message after some time that the kernel is stuck, but since > >>> that are all sleeping locks I strongly doubt so. > >>> > >>> As immediately action please provide patches which changes those > >>> GFP_KERNEL into GFP_NOWAIT. > >> > >> Sure, I can do that. > > > > Hm, I might have been too prompt at claiming this was doable. In > > practice, doing that might regress Lima and Panfrost in situations > > where trying harder than GFP_NOWAIT would free up some memory. Not > > saying this was right to use GFP_KERNEL in the first place, but some > > expectations were set by this original mistake, so I'll probably need > > Lima developers to vouch in for this change after they've done some > > testing on a system under high memory pressure, and I'd need to do the > > same kind of testing for Panfrost and ask Steve if he's okay with that > > too. > > It's a tricky one. The ideal would be to teach user space how to recover > from the memory allocation failure (even on older GPUs it's still in > theory possible to break up the work and do incremental rendering). But > that ship sailed long ago so this would be a regression. > > I did consider GFP_ATOMIC as an option, but really we shouldn't be > accessing "memory reserves" in such a situation. > > For background: In the "kbase" driver (Linux kernel driver for the > closed source user space 'DDK' driver for Midgard/Bifrost GPUs) we had a > "JIT memory allocator". The idea here was that if you have multiple > renderings in flight you can attempt to share the tiler heap memory > between them. So in this case when we can't allocate more memory and we > know there's another tiler heap which is going to be freed by a fragment > job that's already running, we can block knowing the memory is going to > become available. > > It was designed to do the same thing as CSF's incremental rendering - > allow us to opportunistically allocate memory but not fail the rendering > if it wasn't available. > > But it was a nightmare to have any confidence of it being deadlock free > and the implementation was frankly terrible - which is ultimately why > CSF GPU's have this ability in hardware to perform incremental rendering > without failing the job. But effectively this approach requires > allocating just enough memory for one complete tiler heap while ensuring > forward progress and opportunistically allowing extra memory to give a > performance boost. > > TLDR; I think we should try switching to GFP_NOWAIT in Panfrost and do > some testing with memory pressure. It might be acceptable (and an > occasional job failure is better than an occasional lock up). If it > turns out it's too easy to trigger job failures like this then we'll > need to rethink. I thought about this incremental-rendering-on-JM thing during the past few days, and I'd like to run one idea through you if you don't mind. What if we tried to implement incremental rendering the way it's done for CSF, that is, when we get a fault on a tiler heap object we: 1. flush caches on the tiler heap range (with a FLUSH_MEM command on the faulty AS), such that any primitive data that have been queued so far get pushed to the memory 2. use a different AS for an IR (Increment Rendering) fragment job that have been provided by the UMD at submission time, just like the CSF backend of the UMD registers an exception handler for tiler OOMs at the moment. Make it so this IR fragment job chain is queued immediately after the currently running fragment job (if any). This IR fragment job should consume all the primitives written so far and push the result to a framebuffer. There's a bit of patching to do on the final fragment job chain, because the FB preload setup will differ if IR took place, but that should be doable with simple WRITE_VALUE jobs turning NULL jobs into FRAGMENT jobs (or the other way around) such that they get skipped/activated depending on whether IR took place or not. 3. collect pages covering consumed primitives and make the heap range covering those consumed pages fault on further accesses (basically iommu->unmap() the section we collected pages from). We probably need to keep a couple pages around the top/bottom tiler heap pointers, because some partially written primitives might be crossing a heap chunk boundary, but assuming our heap has more than 4 chunks available (which we can ensure at tiler heap allocation time), we should be good. 4. use one of the collected pages to satisfy the growing request, and acknowledge the fault on the faulty AS. We can pick from the pool of collected pages to satisfy new growing requests until it's depleted again. 5. repeat steps 1-4 until all primitives are flushed. 6. reset the tiler heap mapping by doing an iommu->unmap() on the whole heap BO range, and collect all the pages that were previously allocated to the heap such that the next allocation can be satisfied immediately Of course, this only works if primitives are added to the list only when they are fully written (that is, all data for the primitive has been written, and the primitive can be consumed by the fragment job without hazards). Just to be clear, this is a long term plan to try and fix this on JM HW, and given the changes it involves (UMD needs to be taught about IR on JM), we'll still need the GFP_NOWAIT fix for the time being.
Hi, On Mon, 14 Apr 2025 at 16:34, Steven Price <steven.price@arm.com> wrote: > On 14/04/2025 13:47, Boris Brezillon wrote: > > Hm, I might have been too prompt at claiming this was doable. In > > practice, doing that might regress Lima and Panfrost in situations > > where trying harder than GFP_NOWAIT would free up some memory. Not > > saying this was right to use GFP_KERNEL in the first place, but some > > expectations were set by this original mistake, so I'll probably need > > Lima developers to vouch in for this change after they've done some > > testing on a system under high memory pressure, and I'd need to do the > > same kind of testing for Panfrost and ask Steve if he's okay with that > > too. > > It's a tricky one. The ideal would be to teach user space how to recover > from the memory allocation failure (even on older GPUs it's still in > theory possible to break up the work and do incremental rendering). But > that ship sailed long ago so this would be a regression. Yeah. Vulkan clients are a little better in this regard, but GL clients are pretty useless at dealing with device loss across the board. Cheers, Daniel
On 15/04/2025 10:47, Boris Brezillon wrote: > On Mon, 14 Apr 2025 16:34:35 +0100 > Steven Price <steven.price@arm.com> wrote: > >> On 14/04/2025 13:47, Boris Brezillon wrote: >>> On Fri, 11 Apr 2025 16:39:02 +0200 >>> Boris Brezillon <boris.brezillon@collabora.com> wrote: >>> >>>> On Fri, 11 Apr 2025 15:13:26 +0200 >>>> Christian König <christian.koenig@amd.com> wrote: >>>> >>>>>> >>>>>>> Background is that you don't get a crash, nor error message, nor >>>>>>> anything indicating what is happening. >>>>>> The job times out at some point, but we might get stuck in the fault >>>>>> handler waiting for memory, which is pretty close to a deadlock, I >>>>>> suspect. >>>>> >>>>> I don't know those drivers that well, but at least for amdgpu the >>>>> problem would be that the timeout handling would need to grab some of >>>>> the locks the memory management is holding waiting for the timeout >>>>> handling to do something.... >>>>> >>>>> So that basically perfectly closes the circle. With a bit of lock you >>>>> get a message after some time that the kernel is stuck, but since >>>>> that are all sleeping locks I strongly doubt so. >>>>> >>>>> As immediately action please provide patches which changes those >>>>> GFP_KERNEL into GFP_NOWAIT. >>>> >>>> Sure, I can do that. >>> >>> Hm, I might have been too prompt at claiming this was doable. In >>> practice, doing that might regress Lima and Panfrost in situations >>> where trying harder than GFP_NOWAIT would free up some memory. Not >>> saying this was right to use GFP_KERNEL in the first place, but some >>> expectations were set by this original mistake, so I'll probably need >>> Lima developers to vouch in for this change after they've done some >>> testing on a system under high memory pressure, and I'd need to do the >>> same kind of testing for Panfrost and ask Steve if he's okay with that >>> too. >> >> It's a tricky one. The ideal would be to teach user space how to recover >> from the memory allocation failure (even on older GPUs it's still in >> theory possible to break up the work and do incremental rendering). But >> that ship sailed long ago so this would be a regression. >> >> I did consider GFP_ATOMIC as an option, but really we shouldn't be >> accessing "memory reserves" in such a situation. >> >> For background: In the "kbase" driver (Linux kernel driver for the >> closed source user space 'DDK' driver for Midgard/Bifrost GPUs) we had a >> "JIT memory allocator". The idea here was that if you have multiple >> renderings in flight you can attempt to share the tiler heap memory >> between them. So in this case when we can't allocate more memory and we >> know there's another tiler heap which is going to be freed by a fragment >> job that's already running, we can block knowing the memory is going to >> become available. >> >> It was designed to do the same thing as CSF's incremental rendering - >> allow us to opportunistically allocate memory but not fail the rendering >> if it wasn't available. >> >> But it was a nightmare to have any confidence of it being deadlock free >> and the implementation was frankly terrible - which is ultimately why >> CSF GPU's have this ability in hardware to perform incremental rendering >> without failing the job. But effectively this approach requires >> allocating just enough memory for one complete tiler heap while ensuring >> forward progress and opportunistically allowing extra memory to give a >> performance boost. >> >> TLDR; I think we should try switching to GFP_NOWAIT in Panfrost and do >> some testing with memory pressure. It might be acceptable (and an >> occasional job failure is better than an occasional lock up). If it >> turns out it's too easy to trigger job failures like this then we'll >> need to rethink. > > I thought about this incremental-rendering-on-JM thing during the past > few days, and I'd like to run one idea through you if you don't mind. > What if we tried to implement incremental rendering the way it's done > for CSF, that is, when we get a fault on a tiler heap object we: CSF adds the ability for the command stream to make decisions between 'jobs'. kbase has the concept of 'soft jobs' allowing the kernel to do actions between jobs. I fear to make this work we'd need something similar - see below. > 1. flush caches on the tiler heap range (with a FLUSH_MEM command on the > faulty AS), such that any primitive data that have been queued so far > get pushed to the memory So I'm not entirely sure whether you are proposing doing this on the back of a MMU fault or not. Doing a FLUSH_MEM while the tiler is waiting for a memory fault to be resolved is unlikely to work (it won't flush the tiler's internal caches). The tiler is a very complex beast and has some elaborate caches. We don't support soft-stopping tiler jobs (they just run to completion) because the H/W guys could never figure out how to safely stop the tiler - so I very much doubt we can deal with the half-completed state of the tiler. > 2. use a different AS for an IR (Increment Rendering) fragment job that > have been provided by the UMD at submission time, just like the > CSF backend of the UMD registers an exception handler for tiler OOMs > at the moment. It's an interesting idea - I worry about internal deadlocks within the GPU. There's some magic which ties vertex processing and the tiler together, and I'm not sure whether a hung tiler job could lead to hung vertex tasks on the shader cores. But it's possible my worries are unfounded. > Make it so this IR fragment job chain is queued immediately after the > currently running fragment job (if any). This IR fragment job should > consume all the primitives written so far and push the result to a > framebuffer. There's a bit of patching to do on the final > fragment job chain, because the FB preload setup will differ if > IR took place, but that should be doable with simple WRITE_VALUE jobs > turning NULL jobs into FRAGMENT jobs (or the other way around) > such that they get skipped/activated depending on whether IR took > place or not. You're probably more of an expert than me on that part - it certainly sounds plausible. > 3. collect pages covering consumed primitives and make the heap range > covering those consumed pages fault on further accesses (basically > iommu->unmap() the section we collected pages from). We probably > need to keep a couple pages around the top/bottom tiler heap > pointers, because some partially written primitives might be crossing > a heap chunk boundary, but assuming our heap has more than 4 chunks > available (which we can ensure at tiler heap allocation time), we > should be good. I'm not sure if it's sufficient to keep the last 'n' chunks. But at least in theory it should be possible to identify chunks that are still active. > 4. use one of the collected pages to satisfy the growing request, and > acknowledge the fault on the faulty AS. We can pick from the pool > of collected pages to satisfy new growing requests until it's > depleted again. > > 5. repeat steps 1-4 until all primitives are flushed. Running the IR fragment jobs from part way through the tiler heap on subsequent renderings could be tricky - I'm not sure if we have a way of doing that? > 6. reset the tiler heap mapping by doing an iommu->unmap() on the whole > heap BO range, and collect all the pages that were previously > allocated to the heap such that the next allocation can be satisfied > immediately > > Of course, this only works if primitives are added to the list only > when they are fully written (that is, all data for the primitive has > been written, and the primitive can be consumed by the fragment job > without hazards). Yeah, the hardware certainly wasn't designed for this ;) The H/W designers approach is simple: 1. Submit vertex/tiler job with small tiler heap 2. Job fails with OUT_OF_MEMORY, capture index of last completed vertex which is written to the job descriptor 3. Submit fragment job to incrementally render. 4. Submit new vertex/tiler job using the index captured in step 2. 5. Loop back to 2. The problem from an implementation point of view is either you give up on fences (bounce back to user space between each step), or you need start writing job descriptors in the kernel. Clearly one option is to go the whole hog and have something looking like the CSF firmware running in the kernel. It's not a popular option though! ;) > Just to be clear, this is a long term plan to try and fix this on JM > HW, and given the changes it involves (UMD needs to be taught about IR > on JM), we'll still need the GFP_NOWAIT fix for the time being. Agreed, and I like the ingenuity of it - but having seen the H/W team give up on soft-stopping the tiler I'm not convinced it will work leaving the tiler waiting on an MMU fault. Thanks, Steve
On Wed, 16 Apr 2025 16:16:05 +0100 Steven Price <steven.price@arm.com> wrote: > On 15/04/2025 10:47, Boris Brezillon wrote: > > On Mon, 14 Apr 2025 16:34:35 +0100 > > Steven Price <steven.price@arm.com> wrote: > > > >> On 14/04/2025 13:47, Boris Brezillon wrote: > >>> On Fri, 11 Apr 2025 16:39:02 +0200 > >>> Boris Brezillon <boris.brezillon@collabora.com> wrote: > >>> > >>>> On Fri, 11 Apr 2025 15:13:26 +0200 > >>>> Christian König <christian.koenig@amd.com> wrote: > >>>> > >>>>>> > >>>>>>> Background is that you don't get a crash, nor error message, nor > >>>>>>> anything indicating what is happening. > >>>>>> The job times out at some point, but we might get stuck in the fault > >>>>>> handler waiting for memory, which is pretty close to a deadlock, I > >>>>>> suspect. > >>>>> > >>>>> I don't know those drivers that well, but at least for amdgpu the > >>>>> problem would be that the timeout handling would need to grab some of > >>>>> the locks the memory management is holding waiting for the timeout > >>>>> handling to do something.... > >>>>> > >>>>> So that basically perfectly closes the circle. With a bit of lock you > >>>>> get a message after some time that the kernel is stuck, but since > >>>>> that are all sleeping locks I strongly doubt so. > >>>>> > >>>>> As immediately action please provide patches which changes those > >>>>> GFP_KERNEL into GFP_NOWAIT. > >>>> > >>>> Sure, I can do that. > >>> > >>> Hm, I might have been too prompt at claiming this was doable. In > >>> practice, doing that might regress Lima and Panfrost in situations > >>> where trying harder than GFP_NOWAIT would free up some memory. Not > >>> saying this was right to use GFP_KERNEL in the first place, but some > >>> expectations were set by this original mistake, so I'll probably need > >>> Lima developers to vouch in for this change after they've done some > >>> testing on a system under high memory pressure, and I'd need to do the > >>> same kind of testing for Panfrost and ask Steve if he's okay with that > >>> too. > >> > >> It's a tricky one. The ideal would be to teach user space how to recover > >> from the memory allocation failure (even on older GPUs it's still in > >> theory possible to break up the work and do incremental rendering). But > >> that ship sailed long ago so this would be a regression. > >> > >> I did consider GFP_ATOMIC as an option, but really we shouldn't be > >> accessing "memory reserves" in such a situation. > >> > >> For background: In the "kbase" driver (Linux kernel driver for the > >> closed source user space 'DDK' driver for Midgard/Bifrost GPUs) we had a > >> "JIT memory allocator". The idea here was that if you have multiple > >> renderings in flight you can attempt to share the tiler heap memory > >> between them. So in this case when we can't allocate more memory and we > >> know there's another tiler heap which is going to be freed by a fragment > >> job that's already running, we can block knowing the memory is going to > >> become available. > >> > >> It was designed to do the same thing as CSF's incremental rendering - > >> allow us to opportunistically allocate memory but not fail the rendering > >> if it wasn't available. > >> > >> But it was a nightmare to have any confidence of it being deadlock free > >> and the implementation was frankly terrible - which is ultimately why > >> CSF GPU's have this ability in hardware to perform incremental rendering > >> without failing the job. But effectively this approach requires > >> allocating just enough memory for one complete tiler heap while ensuring > >> forward progress and opportunistically allowing extra memory to give a > >> performance boost. > >> > >> TLDR; I think we should try switching to GFP_NOWAIT in Panfrost and do > >> some testing with memory pressure. It might be acceptable (and an > >> occasional job failure is better than an occasional lock up). If it > >> turns out it's too easy to trigger job failures like this then we'll > >> need to rethink. > > > > I thought about this incremental-rendering-on-JM thing during the past > > few days, and I'd like to run one idea through you if you don't mind. > > What if we tried to implement incremental rendering the way it's done > > for CSF, that is, when we get a fault on a tiler heap object we: > > CSF adds the ability for the command stream to make decisions between > 'jobs'. kbase has the concept of 'soft jobs' allowing the kernel to do > actions between jobs. I fear to make this work we'd need something > similar - see below. > > > 1. flush caches on the tiler heap range (with a FLUSH_MEM command on the > > faulty AS), such that any primitive data that have been queued so far > > get pushed to the memory > > So I'm not entirely sure whether you are proposing doing this on the > back of a MMU fault or not. Doing a FLUSH_MEM while the tiler is waiting > for a memory fault to be resolved is unlikely to work (it won't flush > the tiler's internal caches). The tiler is a very complex beast and has > some elaborate caches. We don't support soft-stopping tiler jobs (they > just run to completion) because the H/W guys could never figure out how > to safely stop the tiler - so I very much doubt we can deal with the > half-completed state of the tiler. > > > 2. use a different AS for an IR (Increment Rendering) fragment job that > > have been provided by the UMD at submission time, just like the > > CSF backend of the UMD registers an exception handler for tiler OOMs > > at the moment. > > It's an interesting idea - I worry about internal deadlocks within the > GPU. There's some magic which ties vertex processing and the tiler > together, and I'm not sure whether a hung tiler job could lead to hung > vertex tasks on the shader cores. But it's possible my worries are > unfounded. > > > Make it so this IR fragment job chain is queued immediately after the > > currently running fragment job (if any). This IR fragment job should > > consume all the primitives written so far and push the result to a > > framebuffer. There's a bit of patching to do on the final > > fragment job chain, because the FB preload setup will differ if > > IR took place, but that should be doable with simple WRITE_VALUE jobs > > turning NULL jobs into FRAGMENT jobs (or the other way around) > > such that they get skipped/activated depending on whether IR took > > place or not. > > You're probably more of an expert than me on that part - it certainly > sounds plausible. > > > 3. collect pages covering consumed primitives and make the heap range > > covering those consumed pages fault on further accesses (basically > > iommu->unmap() the section we collected pages from). We probably > > need to keep a couple pages around the top/bottom tiler heap > > pointers, because some partially written primitives might be crossing > > a heap chunk boundary, but assuming our heap has more than 4 chunks > > available (which we can ensure at tiler heap allocation time), we > > should be good. > > I'm not sure if it's sufficient to keep the last 'n' chunks. But at > least in theory it should be possible to identify chunks that are still > active. > > > 4. use one of the collected pages to satisfy the growing request, and > > acknowledge the fault on the faulty AS. We can pick from the pool > > of collected pages to satisfy new growing requests until it's > > depleted again. > > > > 5. repeat steps 1-4 until all primitives are flushed. > > Running the IR fragment jobs from part way through the tiler heap on > subsequent renderings could be tricky - I'm not sure if we have a way of > doing that? > > > 6. reset the tiler heap mapping by doing an iommu->unmap() on the whole > > heap BO range, and collect all the pages that were previously > > allocated to the heap such that the next allocation can be satisfied > > immediately > > > > Of course, this only works if primitives are added to the list only > > when they are fully written (that is, all data for the primitive has > > been written, and the primitive can be consumed by the fragment job > > without hazards). > > Yeah, the hardware certainly wasn't designed for this ;) > > The H/W designers approach is simple: > > 1. Submit vertex/tiler job with small tiler heap > > 2. Job fails with OUT_OF_MEMORY, capture index of last completed vertex > which is written to the job descriptor > > 3. Submit fragment job to incrementally render. > > 4. Submit new vertex/tiler job using the index captured in step 2. > > 5. Loop back to 2. > > The problem from an implementation point of view is either you give up > on fences (bounce back to user space between each step), or you need > start writing job descriptors in the kernel. > > Clearly one option is to go the whole hog and have something looking > like the CSF firmware running in the kernel. It's not a popular option > though! ;) > > > Just to be clear, this is a long term plan to try and fix this on JM > > HW, and given the changes it involves (UMD needs to be taught about IR > > on JM), we'll still need the GFP_NOWAIT fix for the time being. > > Agreed, and I like the ingenuity of it - but having seen the H/W team > give up on soft-stopping the tiler I'm not convinced it will work > leaving the tiler waiting on an MMU fault. Yeah, looking at the answers you gave, it doesn't sounds like this is going to work, but thanks for taking the time to look at my crazy idea :-). Boris