Message ID | 20240828024901.2582335-1-matthew.brost@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | Introduce GPU SVM and Xe SVM implementation | expand |
On Tue, Aug 27, 2024 at 07:48:33PM -0700, Matthew Brost wrote: > Continuation of SVM work by Oak Zeng [1][2] based on community feedback. > Introduces GPU SVM layer and new Xe uAPI. Supports GPU page faults for > system allocations (e.g., malloc), runtime allocations (e.g., binding a > BO), migration to and from VRAM, and unified eviction (BO and SVM VRAM > allocations can evict each other). Fully tested; more on this below. > > The patch breakdown is as follows: > > 1. Preparation patches already on the list [3]. > - Patches 1-3. > - Please refrain from reviewing these here. > 2. New migrate layer functionality > - Patch 4. > - Required for eviction to avoid locking inversion between > dma-resv and mmap lock. > 3. GPU SVM. > - Patch 5. > - This is what needs community review. > - Inspired by GPUVM. > - Kernel doc should explain design principles. > - There is certainly room for optimization of the implementation > and improvements with existing core MM interaction. Pulling in > pending DMA mapping work [4] and additional core MM support > for SVM is also likely desired. However, this serves as a good > starting point for any SVM discussions and could be used as a > stepping stone to future core MM work. > 3. Basic SVM support in Xe (i.e., SRAM backing only). > - Patches 6-15. > - The uAPI in the patch could benefit from community input. > 4. SVM VRAM migration support in Xe. > - Patches 16-23. > - Using TMM BOs for SVM VRAM allocations could use community > input. Patch 23 has a detailed explaination of this design > choice in the commit message. > 5. SVM eviction support in Xe. > - Patch 24. > - Should work with exhaustive eviction [5] when it merges. > 6. Xe SVM debug / tuning. > - Patch 25-28. > > Kernel documentation and commit messages are relatively light, aside > from GPU SVM and uAPI patches as this is an RFC. > > Testing has been conducted quite thoroughly with new IGT [6]. Various > system allocation types (malloc, mmap, mmap flags, huge pages, different > sizes, different alignments), mixing runtime allocations, unmapping > corners, invalid faults, and eviction have been tested. Testing scales > from single thread to multiple threads and multiple processes. Tests > pass on LNL, BMG, PVC 1 tile, and PVC 2 tile. > > 1. Multiple GPU support. > - This is likely to follow or occur in parallel to this work. > 2. Userptr unification with GPU SVM. > - This is essentially designed in my head (likely involving a > few new GPU SVM layer functions) but would require some fairly > invasive changes to Xe KMD to test out. Therefore, I would > like GPU SVM to be reviewed first before proceeding with these > changes. > 3. Madvise and prefetch IOCTLs > - This is likely to follow or occur in parallel to this work. > > Given the size of the series, I have pushed a GitLab branch for > reference [7]. > > Matt > > [1] https://patchwork.freedesktop.org/series/128910/ > [2] https://patchwork.freedesktop.org/series/132229/ > [3] https://patchwork.freedesktop.org/series/137805/ > [4] https://lore.kernel.org/linux-rdma/cover.1709635535.git.leon@kernel.org/ > [5] https://patchwork.freedesktop.org/series/133643/ > [6] https://patchwork.freedesktop.org/patch/610942/?series=137545&rev=2 > [7] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svm-post-8-27-24/-/tree/post?ref_type=heads Ok rather late, I wanted to type this up 2 weeks ago or so, but alas here it is finally. I think with all the experiments and discussions I have fairly clear understanding of the really tricky parts of svm (thanks a lot to Matt for all the work done). From my side the key points, sorted roughly in how important I think they are: 1. migrate_to_ram path I think this is the absolute center piece of making sure we're aligned with core mm and don't suffer from deadlocks or livelocks fundamental to the gpusvm library design. So this part imo needs to be solid for the first version we merge. But of course any core mm changes Matt prototyped shouldn't gate merging the drm side since they're nicely decoupled, we only need those to validate the design in testing. I think the key points are: - we rely on the migration pte, temporary page references and page lock only, which with the core mm changes Matt worked on gives us guaranteed reliable migration back to system memory. And we need that, or svm essentially becomes unusable as a concept. - we need to support partial migration, including the worst case fallback of only migrating that single page core mm managed to trylock for us while holding the pagetable lock. Since we have guaranteed migration back to system memory we can make the assumption on the gpu fault handling side that we will only ever handle ranges that are entirely in vram (by throwing any partial migrations out). Needs a retry loop for that in the gpu fault side, but I no logner see an issue with that assumption on the gpu fault side otherwise, so not needed for merging or even later until we have a driver that requires partial vram support. - no other driver locks related to that memory range in any way are allowed, and ideally we also test with the forced fallback to mmap_read lock in do_swap_page removed, i.e. calling migrate_to_ram with only holding the read vma lock. Of course driver locks for blitting are allowed, it's only locks related to managing physical memory which are problematic and could result in deadlocks. - the drm side must uphold the guarantee of not having elevated page references whithout holding the page lock. Otherwise there's a race and we cannot guarantee migratione to sram. - also through the try_to_migrate maze we'll hit our own gpu pte invalidate paths, so there's some requirements there too. But I've put the discussion for that at the very bottom, since most of the gpu pte locking questions are imo not that important, and definitely not important for the first version we merge. Everything else below I think we can sort out post merge and just need rough alignment on the design. 2. eviction Requirements much like migrate_to_ram, because otherwise we break the migration gurantee: - Only looking at physical memory datastructures and locks, no looking at mm/vma structs or relying on those being locked. We rely entirely on reverse maps from try_to_migrate to find all the mappings on both cpu and gpu side (cpu only zone device swap or migration pte entries ofc). - Partial migration needs to work to make sure we can get out of any low memory bind. 3. gpu fault side - We can only rely on mmap_read for hmm_range_fault. And ideally should assume that's not held anywhere else since with per-vma locking I'd expect the mm/vma locking will move into hmm_range_fault. This also means no looking at vma beyond just passing it around as currently needed by core mm functions. - Big retry loop to handle all races with the mmu notifier under the gpu pagetable locks/mmu notifier range lock/whatever we end up calling those. Races (especially against concurrent eviction/migrate_to_ram) should _not_ be handled on the fault side by trying to hold locks instead. - Long term I think we need to be able to handle concurrent faults, even on hw where there's only one gpu fault handling queue. For performance we absolutely want to prefault aggressively, and that likely happens through cpu ioctls that are entirely independent from the gpu fault handling. Short term enough (driver-side) locking to make sure this doesn't go boom is enough, I think just some design goal documentation here on how to achieve that is all we need. 4. physical memory to virtual backpointer No. Doesn't work :-) Also it's only used in in the eviction/migrate_to_ram path and I think Matt already fixed this all anyway. 5. gpu pagetable locking Or mmu notifier range locking or whatever you want to call it. Like on the cpu side this should _only_ protect the pagetable entries and additional for us mmu notifier seqno tracking, nothing else. Any races due to concurrent eviction/migrate_to_ram/gpu fault/prefault need to be handled by retrying outside of holding the pagetable locks. If we try to impose additional consistency guarantees we'll fall over and have a livelock/deadlock fight with core mm in migrate_to_ram. This part is required I think for the first version, but we already need that anyway to make migrate_to_ram work properly. For the actual data structure/locking design I think anything on the design line between a single global lock and the radix tree over-the-top scalable per-pagetable (spin)lock design of the core mm is fine. The design here with 3 levels (mmu notifer, range, struct page) wouldn't be my first choice, but clearly fits on that line so imo is fine for initial merging. We might want to make sure that the range locking (I guess mostly relevant for the invalidate side, drivers don't see much else) is somewhat abstracted so we can easily change that post-merge, but not required imo at all. For consensus documentation I'd recommend a todo or design documentation patch, where we put down both the current design and why it's like that, and some of the longer term goals. Then get that acked (imo needs at least one other driver that's seriously interested in this, plus I think an ack from Danilo for gpuvm interactions), then merge that. SVM is tricky enough that I think this would be really useful to make sure we're not unecessarily stuck in limbo. From my side again I think the only part we really have to get right from the start is migrate_to_ram. And I'm confident we've got that now really solid. Oh also you need userspace ofc :-) Cheers, Sima > Matthew Brost (28): > dma-buf: Split out dma fence array create into alloc and arm functions > drm/xe: Invalidate media_gt TLBs in PT code > drm/xe: Retry BO allocation > mm/migrate: Add migrate_device_vma_range > drm/gpusvm: Add support for GPU Shared Virtual Memory > drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATON flag > drm/xe: Add SVM init / fini to faulting VMs > drm/xe: Add dma_addr res cursor > drm/xe: Add SVM range invalidation > drm/gpuvm: Add DRM_GPUVA_OP_USER > drm/xe: Add (re)bind to SVM page fault handler > drm/xe: Add SVM garbage collector > drm/xe: Add unbind to SVM garbage collector > drm/xe: Do not allow system allocator VMA unbind if the GPU has > bindings > drm/xe: Enable system allocator uAPI > drm/xe: Add migrate layer functions for SVM support > drm/xe: Add SVM device memory mirroring > drm/xe: Add GPUSVM copy SRAM / VRAM vfunc functions > drm/xe: Update PT layer to understand ranges in VRAM > drm/xe: Add Xe SVM populate_vram_pfn vfunc > drm/xe: Add Xe SVM vram_release vfunc > drm/xe: Add BO flags required for SVM > drm/xe: Add SVM VRAM migration > drm/xe: Basic SVM BO eviction > drm/xe: Add SVM debug > drm/xe: Add modparam for SVM notifier size > drm/xe: Add modparam for SVM prefault > drm/gpusvm: Ensure all pages migrated upon eviction > > drivers/dma-buf/dma-fence-array.c | 78 +- > drivers/gpu/drm/xe/Makefile | 4 +- > drivers/gpu/drm/xe/drm_gpusvm.c | 2213 ++++++++++++++++++++++++++ > drivers/gpu/drm/xe/drm_gpusvm.h | 415 +++++ > drivers/gpu/drm/xe/xe_bo.c | 54 +- > drivers/gpu/drm/xe/xe_bo.h | 2 + > drivers/gpu/drm/xe/xe_bo_types.h | 3 + > drivers/gpu/drm/xe/xe_device_types.h | 8 + > drivers/gpu/drm/xe/xe_gt_pagefault.c | 17 +- > drivers/gpu/drm/xe/xe_migrate.c | 150 ++ > drivers/gpu/drm/xe/xe_migrate.h | 10 + > drivers/gpu/drm/xe/xe_module.c | 7 + > drivers/gpu/drm/xe/xe_module.h | 2 + > drivers/gpu/drm/xe/xe_pt.c | 456 +++++- > drivers/gpu/drm/xe/xe_pt.h | 3 + > drivers/gpu/drm/xe/xe_pt_types.h | 2 + > drivers/gpu/drm/xe/xe_res_cursor.h | 50 +- > drivers/gpu/drm/xe/xe_svm.c | 775 +++++++++ > drivers/gpu/drm/xe/xe_svm.h | 70 + > drivers/gpu/drm/xe/xe_tile.c | 5 + > drivers/gpu/drm/xe/xe_vm.c | 286 +++- > drivers/gpu/drm/xe/xe_vm.h | 15 +- > drivers/gpu/drm/xe/xe_vm_types.h | 44 + > include/drm/drm_gpuvm.h | 5 + > include/linux/dma-fence-array.h | 6 + > include/linux/migrate.h | 3 + > include/uapi/drm/xe_drm.h | 19 +- > mm/migrate_device.c | 53 + > 28 files changed, 4615 insertions(+), 140 deletions(-) > create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.c > create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.h > create mode 100644 drivers/gpu/drm/xe/xe_svm.c > create mode 100644 drivers/gpu/drm/xe/xe_svm.h > > -- > 2.34.1 >
On Tue, Sep 24, 2024 at 11:16:01AM +0200, Simona Vetter wrote: > On Tue, Aug 27, 2024 at 07:48:33PM -0700, Matthew Brost wrote: > > Continuation of SVM work by Oak Zeng [1][2] based on community feedback. > > Introduces GPU SVM layer and new Xe uAPI. Supports GPU page faults for > > system allocations (e.g., malloc), runtime allocations (e.g., binding a > > BO), migration to and from VRAM, and unified eviction (BO and SVM VRAM > > allocations can evict each other). Fully tested; more on this below. > > > > The patch breakdown is as follows: > > > > 1. Preparation patches already on the list [3]. > > - Patches 1-3. > > - Please refrain from reviewing these here. > > 2. New migrate layer functionality > > - Patch 4. > > - Required for eviction to avoid locking inversion between > > dma-resv and mmap lock. > > 3. GPU SVM. > > - Patch 5. > > - This is what needs community review. > > - Inspired by GPUVM. > > - Kernel doc should explain design principles. > > - There is certainly room for optimization of the implementation > > and improvements with existing core MM interaction. Pulling in > > pending DMA mapping work [4] and additional core MM support > > for SVM is also likely desired. However, this serves as a good > > starting point for any SVM discussions and could be used as a > > stepping stone to future core MM work. > > 3. Basic SVM support in Xe (i.e., SRAM backing only). > > - Patches 6-15. > > - The uAPI in the patch could benefit from community input. > > 4. SVM VRAM migration support in Xe. > > - Patches 16-23. > > - Using TMM BOs for SVM VRAM allocations could use community > > input. Patch 23 has a detailed explaination of this design > > choice in the commit message. > > 5. SVM eviction support in Xe. > > - Patch 24. > > - Should work with exhaustive eviction [5] when it merges. > > 6. Xe SVM debug / tuning. > > - Patch 25-28. > > > > Kernel documentation and commit messages are relatively light, aside > > from GPU SVM and uAPI patches as this is an RFC. > > > > Testing has been conducted quite thoroughly with new IGT [6]. Various > > system allocation types (malloc, mmap, mmap flags, huge pages, different > > sizes, different alignments), mixing runtime allocations, unmapping > > corners, invalid faults, and eviction have been tested. Testing scales > > from single thread to multiple threads and multiple processes. Tests > > pass on LNL, BMG, PVC 1 tile, and PVC 2 tile. > > > > 1. Multiple GPU support. > > - This is likely to follow or occur in parallel to this work. > > 2. Userptr unification with GPU SVM. > > - This is essentially designed in my head (likely involving a > > few new GPU SVM layer functions) but would require some fairly > > invasive changes to Xe KMD to test out. Therefore, I would > > like GPU SVM to be reviewed first before proceeding with these > > changes. > > 3. Madvise and prefetch IOCTLs > > - This is likely to follow or occur in parallel to this work. > > > > Given the size of the series, I have pushed a GitLab branch for > > reference [7]. > > > > Matt > > > > [1] https://patchwork.freedesktop.org/series/128910/ > > [2] https://patchwork.freedesktop.org/series/132229/ > > [3] https://patchwork.freedesktop.org/series/137805/ > > [4] https://lore.kernel.org/linux-rdma/cover.1709635535.git.leon@kernel.org/ > > [5] https://patchwork.freedesktop.org/series/133643/ > > [6] https://patchwork.freedesktop.org/patch/610942/?series=137545&rev=2 > > [7] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svm-post-8-27-24/-/tree/post?ref_type=heads > > Ok rather late, I wanted to type this up 2 weeks ago or so, but alas here > it is finally. I think with all the experiments and discussions I have Thanks for putting this together and all the initial reviews. > fairly clear understanding of the really tricky parts of svm (thanks a lot > to Matt for all the work done). From my side the key points, sorted > roughly in how important I think they are: > I've read this through and pretty much agree with everything here so won't have a detailed response to everything as there isn't much to say aside from I agree. A few minor comments below. > 1. migrate_to_ram path > > I think this is the absolute center piece of making sure we're aligned > with core mm and don't suffer from deadlocks or livelocks fundamental to > the gpusvm library design. So this part imo needs to be solid for the > first version we merge. But of course any core mm changes Matt prototyped > shouldn't gate merging the drm side since they're nicely decoupled, we > only need those to validate the design in testing. > > I think the key points are: > > - we rely on the migration pte, temporary page references and page lock > only, which with the core mm changes Matt worked on gives us guaranteed > reliable migration back to system memory. And we need that, or svm > essentially becomes unusable as a concept. > > - we need to support partial migration, including the worst case fallback > of only migrating that single page core mm managed to trylock for us > while holding the pagetable lock. > > Since we have guaranteed migration back to system memory we can make the > assumption on the gpu fault handling side that we will only ever handle > ranges that are entirely in vram (by throwing any partial migrations > out). Needs a retry loop for that in the gpu fault side, but I no logner > see an issue with that assumption on the gpu fault side otherwise, so > not needed for merging or even later until we have a driver that > requires partial vram support. > I think pretty quickly we will add partial vram support / mixed mappings but likely will not be in initial merge. > - no other driver locks related to that memory range in any way are > allowed, and ideally we also test with the forced fallback to mmap_read > lock in do_swap_page removed, i.e. calling migrate_to_ram with only > holding the read vma lock. Of course driver locks for blitting are > allowed, it's only locks related to managing physical memory which are > problematic and could result in deadlocks. > > - the drm side must uphold the guarantee of not having elevated page > references whithout holding the page lock. Otherwise there's a race and > we cannot guarantee migratione to sram. > > - also through the try_to_migrate maze we'll hit our own gpu pte > invalidate paths, so there's some requirements there too. But I've put > the discussion for that at the very bottom, since most of the gpu pte > locking questions are imo not that important, and definitely not > important for the first version we merge. > > Everything else below I think we can sort out post merge and just need > rough alignment on the design. > > 2. eviction > > Requirements much like migrate_to_ram, because otherwise we break the > migration gurantee: > > - Only looking at physical memory datastructures and locks, no looking at > mm/vma structs or relying on those being locked. We rely entirely on > reverse maps from try_to_migrate to find all the mappings on both cpu > and gpu side (cpu only zone device swap or migration pte entries ofc). > > - Partial migration needs to work to make sure we can get out of any > low memory bind. > > 3. gpu fault side > > - We can only rely on mmap_read for hmm_range_fault. And ideally should > assume that's not held anywhere else since with per-vma locking I'd > expect the mm/vma locking will move into hmm_range_fault. This also > means no looking at vma beyond just passing it around as currently > needed by core mm functions. > > - Big retry loop to handle all races with the mmu notifier under the gpu > pagetable locks/mmu notifier range lock/whatever we end up calling > those. Races (especially against concurrent eviction/migrate_to_ram) > should _not_ be handled on the fault side by trying to hold locks > instead. > > - Long term I think we need to be able to handle concurrent faults, even > on hw where there's only one gpu fault handling queue. For performance > we absolutely want to prefault aggressively, and that likely happens > through cpu ioctls that are entirely independent from the gpu fault > handling. > I agree the long term goal is handle concurrent GPU faults and with a bit of finer grained locking in Xe I have already made this work. The biggest part which needs to be parallel is migration code which is taking up roughly 98% of time in the GPU fault handler for a 2M fault with a split of 90% in migrate_vma_* function and 8% in the copy job. I've seen tests which mirrors multiple EUs from the same process taking concurrent GPU faults have large gains in perf. Also the CPU fault handler is concurrent so it makes a bit of sense that GPU faults are concurrent too. GPU faults being concurrent should also enable concurrent prefetches from CPU IOCTLs which also is likely desired. I'm not going to include this or any of the other optimizations I have worked on in what I try to merge initially though as I want to keep this as simple as possible and also don't want to throw more code at the list until a working baseline is in. > Short term enough (driver-side) locking to make sure this doesn't go > boom is enough, I think just some design goal documentation here on how > to achieve that is all we need. > > 4. physical memory to virtual backpointer > > No. Doesn't work :-) Also it's only used in in the eviction/migrate_to_ram > path and I think Matt already fixed this all anyway. > > 5. gpu pagetable locking > > Or mmu notifier range locking or whatever you want to call it. Like on the > cpu side this should _only_ protect the pagetable entries and additional > for us mmu notifier seqno tracking, nothing else. > > Any races due to concurrent eviction/migrate_to_ram/gpu fault/prefault > need to be handled by retrying outside of holding the pagetable locks. If > we try to impose additional consistency guarantees we'll fall over and > have a livelock/deadlock fight with core mm in migrate_to_ram. This part > is required I think for the first version, but we already need that anyway > to make migrate_to_ram work properly. > > For the actual data structure/locking design I think anything on the > design line between a single global lock and the radix tree over-the-top > scalable per-pagetable (spin)lock design of the core mm is fine. > I've seen the bind step in servicing GPU faults take barely any amount of time so having the GPU page tables protected by the VM's dma-resv lock seems fine in Xe. This really is up to each driver how it wants to handle this too. > The design here with 3 levels (mmu notifer, range, struct page) wouldn't > be my first choice, but clearly fits on that line so imo is fine for > initial merging. We might want to make sure that the range locking (I > guess mostly relevant for the invalidate side, drivers don't see much > else) is somewhat abstracted so we can easily change that post-merge, but > not required imo at all. > > For consensus documentation I'd recommend a todo or design documentation > patch, where we put down both the current design and why it's like that, > and some of the longer term goals. Then get that acked (imo needs at least > one other driver that's seriously interested in this, plus I think an ack > from Danilo for gpuvm interactions), then merge that. SVM is tricky enough > that I think this would be really useful to make sure we're not > unecessarily stuck in limbo. > I'll include a TODO or design documentation in the next rev. > From my side again I think the only part we really have to get right from > the start is migrate_to_ram. And I'm confident we've got that now really > solid. > I think most of all 5 points will be addressed in my next rev. Anything that isn't falls into an 'optimization we can do later' category which the design should be coded in a way these optimizations can easily be added. Matt > Oh also you need userspace ofc :-) > > Cheers, Sima > > > Matthew Brost (28): > > dma-buf: Split out dma fence array create into alloc and arm functions > > drm/xe: Invalidate media_gt TLBs in PT code > > drm/xe: Retry BO allocation > > mm/migrate: Add migrate_device_vma_range > > drm/gpusvm: Add support for GPU Shared Virtual Memory > > drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATON flag > > drm/xe: Add SVM init / fini to faulting VMs > > drm/xe: Add dma_addr res cursor > > drm/xe: Add SVM range invalidation > > drm/gpuvm: Add DRM_GPUVA_OP_USER > > drm/xe: Add (re)bind to SVM page fault handler > > drm/xe: Add SVM garbage collector > > drm/xe: Add unbind to SVM garbage collector > > drm/xe: Do not allow system allocator VMA unbind if the GPU has > > bindings > > drm/xe: Enable system allocator uAPI > > drm/xe: Add migrate layer functions for SVM support > > drm/xe: Add SVM device memory mirroring > > drm/xe: Add GPUSVM copy SRAM / VRAM vfunc functions > > drm/xe: Update PT layer to understand ranges in VRAM > > drm/xe: Add Xe SVM populate_vram_pfn vfunc > > drm/xe: Add Xe SVM vram_release vfunc > > drm/xe: Add BO flags required for SVM > > drm/xe: Add SVM VRAM migration > > drm/xe: Basic SVM BO eviction > > drm/xe: Add SVM debug > > drm/xe: Add modparam for SVM notifier size > > drm/xe: Add modparam for SVM prefault > > drm/gpusvm: Ensure all pages migrated upon eviction > > > > drivers/dma-buf/dma-fence-array.c | 78 +- > > drivers/gpu/drm/xe/Makefile | 4 +- > > drivers/gpu/drm/xe/drm_gpusvm.c | 2213 ++++++++++++++++++++++++++ > > drivers/gpu/drm/xe/drm_gpusvm.h | 415 +++++ > > drivers/gpu/drm/xe/xe_bo.c | 54 +- > > drivers/gpu/drm/xe/xe_bo.h | 2 + > > drivers/gpu/drm/xe/xe_bo_types.h | 3 + > > drivers/gpu/drm/xe/xe_device_types.h | 8 + > > drivers/gpu/drm/xe/xe_gt_pagefault.c | 17 +- > > drivers/gpu/drm/xe/xe_migrate.c | 150 ++ > > drivers/gpu/drm/xe/xe_migrate.h | 10 + > > drivers/gpu/drm/xe/xe_module.c | 7 + > > drivers/gpu/drm/xe/xe_module.h | 2 + > > drivers/gpu/drm/xe/xe_pt.c | 456 +++++- > > drivers/gpu/drm/xe/xe_pt.h | 3 + > > drivers/gpu/drm/xe/xe_pt_types.h | 2 + > > drivers/gpu/drm/xe/xe_res_cursor.h | 50 +- > > drivers/gpu/drm/xe/xe_svm.c | 775 +++++++++ > > drivers/gpu/drm/xe/xe_svm.h | 70 + > > drivers/gpu/drm/xe/xe_tile.c | 5 + > > drivers/gpu/drm/xe/xe_vm.c | 286 +++- > > drivers/gpu/drm/xe/xe_vm.h | 15 +- > > drivers/gpu/drm/xe/xe_vm_types.h | 44 + > > include/drm/drm_gpuvm.h | 5 + > > include/linux/dma-fence-array.h | 6 + > > include/linux/migrate.h | 3 + > > include/uapi/drm/xe_drm.h | 19 +- > > mm/migrate_device.c | 53 + > > 28 files changed, 4615 insertions(+), 140 deletions(-) > > create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.c > > create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.h > > create mode 100644 drivers/gpu/drm/xe/xe_svm.c > > create mode 100644 drivers/gpu/drm/xe/xe_svm.h > > > > -- > > 2.34.1 > > > > -- > Simona Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch
On Tue, Sep 24, 2024 at 07:36:21PM +0000, Matthew Brost wrote: > On Tue, Sep 24, 2024 at 11:16:01AM +0200, Simona Vetter wrote: > > On Tue, Aug 27, 2024 at 07:48:33PM -0700, Matthew Brost wrote: > > > Continuation of SVM work by Oak Zeng [1][2] based on community feedback. > > > Introduces GPU SVM layer and new Xe uAPI. Supports GPU page faults for > > > system allocations (e.g., malloc), runtime allocations (e.g., binding a > > > BO), migration to and from VRAM, and unified eviction (BO and SVM VRAM > > > allocations can evict each other). Fully tested; more on this below. > > > > > > The patch breakdown is as follows: > > > > > > 1. Preparation patches already on the list [3]. > > > - Patches 1-3. > > > - Please refrain from reviewing these here. > > > 2. New migrate layer functionality > > > - Patch 4. > > > - Required for eviction to avoid locking inversion between > > > dma-resv and mmap lock. > > > 3. GPU SVM. > > > - Patch 5. > > > - This is what needs community review. > > > - Inspired by GPUVM. > > > - Kernel doc should explain design principles. > > > - There is certainly room for optimization of the implementation > > > and improvements with existing core MM interaction. Pulling in > > > pending DMA mapping work [4] and additional core MM support > > > for SVM is also likely desired. However, this serves as a good > > > starting point for any SVM discussions and could be used as a > > > stepping stone to future core MM work. > > > 3. Basic SVM support in Xe (i.e., SRAM backing only). > > > - Patches 6-15. > > > - The uAPI in the patch could benefit from community input. > > > 4. SVM VRAM migration support in Xe. > > > - Patches 16-23. > > > - Using TMM BOs for SVM VRAM allocations could use community > > > input. Patch 23 has a detailed explaination of this design > > > choice in the commit message. > > > 5. SVM eviction support in Xe. > > > - Patch 24. > > > - Should work with exhaustive eviction [5] when it merges. > > > 6. Xe SVM debug / tuning. > > > - Patch 25-28. > > > > > > Kernel documentation and commit messages are relatively light, aside > > > from GPU SVM and uAPI patches as this is an RFC. > > > > > > Testing has been conducted quite thoroughly with new IGT [6]. Various > > > system allocation types (malloc, mmap, mmap flags, huge pages, different > > > sizes, different alignments), mixing runtime allocations, unmapping > > > corners, invalid faults, and eviction have been tested. Testing scales > > > from single thread to multiple threads and multiple processes. Tests > > > pass on LNL, BMG, PVC 1 tile, and PVC 2 tile. > > > > > > 1. Multiple GPU support. > > > - This is likely to follow or occur in parallel to this work. > > > 2. Userptr unification with GPU SVM. > > > - This is essentially designed in my head (likely involving a > > > few new GPU SVM layer functions) but would require some fairly > > > invasive changes to Xe KMD to test out. Therefore, I would > > > like GPU SVM to be reviewed first before proceeding with these > > > changes. > > > 3. Madvise and prefetch IOCTLs > > > - This is likely to follow or occur in parallel to this work. > > > > > > Given the size of the series, I have pushed a GitLab branch for > > > reference [7]. > > > > > > Matt > > > > > > [1] https://patchwork.freedesktop.org/series/128910/ > > > [2] https://patchwork.freedesktop.org/series/132229/ > > > [3] https://patchwork.freedesktop.org/series/137805/ > > > [4] https://lore.kernel.org/linux-rdma/cover.1709635535.git.leon@kernel.org/ > > > [5] https://patchwork.freedesktop.org/series/133643/ > > > [6] https://patchwork.freedesktop.org/patch/610942/?series=137545&rev=2 > > > [7] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svm-post-8-27-24/-/tree/post?ref_type=heads > > > > Ok rather late, I wanted to type this up 2 weeks ago or so, but alas here > > it is finally. I think with all the experiments and discussions I have > > Thanks for putting this together and all the initial reviews. > > > fairly clear understanding of the really tricky parts of svm (thanks a lot > > to Matt for all the work done). From my side the key points, sorted > > roughly in how important I think they are: > > > > I've read this through and pretty much agree with everything here > so won't have a detailed response to everything as there isn't much to > say aside from I agree. A few minor comments below. > > > 1. migrate_to_ram path > > > > I think this is the absolute center piece of making sure we're aligned > > with core mm and don't suffer from deadlocks or livelocks fundamental to > > the gpusvm library design. So this part imo needs to be solid for the > > first version we merge. But of course any core mm changes Matt prototyped > > shouldn't gate merging the drm side since they're nicely decoupled, we > > only need those to validate the design in testing. > > > > I think the key points are: > > > > - we rely on the migration pte, temporary page references and page lock > > only, which with the core mm changes Matt worked on gives us guaranteed > > reliable migration back to system memory. And we need that, or svm > > essentially becomes unusable as a concept. > > > > - we need to support partial migration, including the worst case fallback > > of only migrating that single page core mm managed to trylock for us > > while holding the pagetable lock. > > > > Since we have guaranteed migration back to system memory we can make the > > assumption on the gpu fault handling side that we will only ever handle > > ranges that are entirely in vram (by throwing any partial migrations > > out). Needs a retry loop for that in the gpu fault side, but I no logner > > see an issue with that assumption on the gpu fault side otherwise, so > > not needed for merging or even later until we have a driver that > > requires partial vram support. > > > > I think pretty quickly we will add partial vram support / mixed mappings > but likely will not be in initial merge. > > > - no other driver locks related to that memory range in any way are > > allowed, and ideally we also test with the forced fallback to mmap_read > > lock in do_swap_page removed, i.e. calling migrate_to_ram with only > > holding the read vma lock. Of course driver locks for blitting are > > allowed, it's only locks related to managing physical memory which are > > problematic and could result in deadlocks. > > > > - the drm side must uphold the guarantee of not having elevated page > > references whithout holding the page lock. Otherwise there's a race and > > we cannot guarantee migratione to sram. > > > > - also through the try_to_migrate maze we'll hit our own gpu pte > > invalidate paths, so there's some requirements there too. But I've put > > the discussion for that at the very bottom, since most of the gpu pte > > locking questions are imo not that important, and definitely not > > important for the first version we merge. > > > > Everything else below I think we can sort out post merge and just need > > rough alignment on the design. > > > > 2. eviction > > > > Requirements much like migrate_to_ram, because otherwise we break the > > migration gurantee: > > > > - Only looking at physical memory datastructures and locks, no looking at > > mm/vma structs or relying on those being locked. We rely entirely on > > reverse maps from try_to_migrate to find all the mappings on both cpu > > and gpu side (cpu only zone device swap or migration pte entries ofc). > > > > - Partial migration needs to work to make sure we can get out of any > > low memory bind. > > > > 3. gpu fault side > > > > - We can only rely on mmap_read for hmm_range_fault. And ideally should > > assume that's not held anywhere else since with per-vma locking I'd > > expect the mm/vma locking will move into hmm_range_fault. This also > > means no looking at vma beyond just passing it around as currently > > needed by core mm functions. > > > > - Big retry loop to handle all races with the mmu notifier under the gpu > > pagetable locks/mmu notifier range lock/whatever we end up calling > > those. Races (especially against concurrent eviction/migrate_to_ram) > > should _not_ be handled on the fault side by trying to hold locks > > instead. > > > > - Long term I think we need to be able to handle concurrent faults, even > > on hw where there's only one gpu fault handling queue. For performance > > we absolutely want to prefault aggressively, and that likely happens > > through cpu ioctls that are entirely independent from the gpu fault > > handling. > > > > I agree the long term goal is handle concurrent GPU faults and with a > bit of finer grained locking in Xe I have already made this work. The > biggest part which needs to be parallel is migration code which is > taking up roughly 98% of time in the GPU fault handler for a 2M fault > with a split of 90% in migrate_vma_* function and 8% in the copy job. > I've seen tests which mirrors multiple EUs from the same process taking > concurrent GPU faults have large gains in perf. Also the CPU fault > handler is concurrent so it makes a bit of sense that GPU faults are > concurrent too. > > GPU faults being concurrent should also enable concurrent prefetches > from CPU IOCTLs which also is likely desired. > > I'm not going to include this or any of the other optimizations I have > worked on in what I try to merge initially though as I want to keep this > as simple as possible and also don't want to throw more code at the list > until a working baseline is in. > > > Short term enough (driver-side) locking to make sure this doesn't go > > boom is enough, I think just some design goal documentation here on how > > to achieve that is all we need. > > > > 4. physical memory to virtual backpointer > > > > No. Doesn't work :-) Also it's only used in in the eviction/migrate_to_ram > > path and I think Matt already fixed this all anyway. > > > > 5. gpu pagetable locking > > > > Or mmu notifier range locking or whatever you want to call it. Like on the > > cpu side this should _only_ protect the pagetable entries and additional > > for us mmu notifier seqno tracking, nothing else. > > > > Any races due to concurrent eviction/migrate_to_ram/gpu fault/prefault > > need to be handled by retrying outside of holding the pagetable locks. If > > we try to impose additional consistency guarantees we'll fall over and > > have a livelock/deadlock fight with core mm in migrate_to_ram. This part > > is required I think for the first version, but we already need that anyway > > to make migrate_to_ram work properly. > > > > For the actual data structure/locking design I think anything on the > > design line between a single global lock and the radix tree over-the-top > > scalable per-pagetable (spin)lock design of the core mm is fine. > > > > I've seen the bind step in servicing GPU faults take barely any amount > of time so having the GPU page tables protected by the VM's dma-resv > lock seems fine in Xe. This really is up to each driver how it wants to > handle this too. I think if we go to a more fine-grained approach it'd make sense to have this common, especially if we go with something more radix-tree based. It's quick tricky to get right in the more extreme cases (*shudders looking at cpu pgtable walking code), and with mmu notifier seqno handling there's some additional complexity. And for drivers which don't need that fine-grained approach it shouldn't hurt. Also with pagetable lock here I meant the one you're also taking from the mmu notifier invalidate callback, so unless I'm completely lost that's not the dma_resv lock of the vm. That is iirc what you call "driver lock" and I think for initial merging your current approach is entirely fine. It would also need some changes for concurrent gpu pte (pre-)fault handling, but that's not what I'm talking about here directly. > > The design here with 3 levels (mmu notifer, range, struct page) wouldn't > > be my first choice, but clearly fits on that line so imo is fine for > > initial merging. We might want to make sure that the range locking (I > > guess mostly relevant for the invalidate side, drivers don't see much > > else) is somewhat abstracted so we can easily change that post-merge, but > > not required imo at all. > > > > For consensus documentation I'd recommend a todo or design documentation > > patch, where we put down both the current design and why it's like that, > > and some of the longer term goals. Then get that acked (imo needs at least > > one other driver that's seriously interested in this, plus I think an ack > > from Danilo for gpuvm interactions), then merge that. SVM is tricky enough > > that I think this would be really useful to make sure we're not > > unecessarily stuck in limbo. > > > > I'll include a TODO or design documentation in the next rev. > > > From my side again I think the only part we really have to get right from > > the start is migrate_to_ram. And I'm confident we've got that now really > > solid. > > > > I think most of all 5 points will be addressed in my next rev. Anything > that isn't falls into an 'optimization we can do later' category which > the design should be coded in a way these optimizations can easily be > added. Just quickly read through your comments and aside from the pagetable locking one (which I think is just a bit unclarity, not disagrement) I all agree on all of them. -Sima > Matt > > > Oh also you need userspace ofc :-) > > > > Cheers, Sima > > > > > Matthew Brost (28): > > > dma-buf: Split out dma fence array create into alloc and arm functions > > > drm/xe: Invalidate media_gt TLBs in PT code > > > drm/xe: Retry BO allocation > > > mm/migrate: Add migrate_device_vma_range > > > drm/gpusvm: Add support for GPU Shared Virtual Memory > > > drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATON flag > > > drm/xe: Add SVM init / fini to faulting VMs > > > drm/xe: Add dma_addr res cursor > > > drm/xe: Add SVM range invalidation > > > drm/gpuvm: Add DRM_GPUVA_OP_USER > > > drm/xe: Add (re)bind to SVM page fault handler > > > drm/xe: Add SVM garbage collector > > > drm/xe: Add unbind to SVM garbage collector > > > drm/xe: Do not allow system allocator VMA unbind if the GPU has > > > bindings > > > drm/xe: Enable system allocator uAPI > > > drm/xe: Add migrate layer functions for SVM support > > > drm/xe: Add SVM device memory mirroring > > > drm/xe: Add GPUSVM copy SRAM / VRAM vfunc functions > > > drm/xe: Update PT layer to understand ranges in VRAM > > > drm/xe: Add Xe SVM populate_vram_pfn vfunc > > > drm/xe: Add Xe SVM vram_release vfunc > > > drm/xe: Add BO flags required for SVM > > > drm/xe: Add SVM VRAM migration > > > drm/xe: Basic SVM BO eviction > > > drm/xe: Add SVM debug > > > drm/xe: Add modparam for SVM notifier size > > > drm/xe: Add modparam for SVM prefault > > > drm/gpusvm: Ensure all pages migrated upon eviction > > > > > > drivers/dma-buf/dma-fence-array.c | 78 +- > > > drivers/gpu/drm/xe/Makefile | 4 +- > > > drivers/gpu/drm/xe/drm_gpusvm.c | 2213 ++++++++++++++++++++++++++ > > > drivers/gpu/drm/xe/drm_gpusvm.h | 415 +++++ > > > drivers/gpu/drm/xe/xe_bo.c | 54 +- > > > drivers/gpu/drm/xe/xe_bo.h | 2 + > > > drivers/gpu/drm/xe/xe_bo_types.h | 3 + > > > drivers/gpu/drm/xe/xe_device_types.h | 8 + > > > drivers/gpu/drm/xe/xe_gt_pagefault.c | 17 +- > > > drivers/gpu/drm/xe/xe_migrate.c | 150 ++ > > > drivers/gpu/drm/xe/xe_migrate.h | 10 + > > > drivers/gpu/drm/xe/xe_module.c | 7 + > > > drivers/gpu/drm/xe/xe_module.h | 2 + > > > drivers/gpu/drm/xe/xe_pt.c | 456 +++++- > > > drivers/gpu/drm/xe/xe_pt.h | 3 + > > > drivers/gpu/drm/xe/xe_pt_types.h | 2 + > > > drivers/gpu/drm/xe/xe_res_cursor.h | 50 +- > > > drivers/gpu/drm/xe/xe_svm.c | 775 +++++++++ > > > drivers/gpu/drm/xe/xe_svm.h | 70 + > > > drivers/gpu/drm/xe/xe_tile.c | 5 + > > > drivers/gpu/drm/xe/xe_vm.c | 286 +++- > > > drivers/gpu/drm/xe/xe_vm.h | 15 +- > > > drivers/gpu/drm/xe/xe_vm_types.h | 44 + > > > include/drm/drm_gpuvm.h | 5 + > > > include/linux/dma-fence-array.h | 6 + > > > include/linux/migrate.h | 3 + > > > include/uapi/drm/xe_drm.h | 19 +- > > > mm/migrate_device.c | 53 + > > > 28 files changed, 4615 insertions(+), 140 deletions(-) > > > create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.c > > > create mode 100644 drivers/gpu/drm/xe/drm_gpusvm.h > > > create mode 100644 drivers/gpu/drm/xe/xe_svm.c > > > create mode 100644 drivers/gpu/drm/xe/xe_svm.h > > > > > > -- > > > 2.34.1 > > > > > > > -- > > Simona Vetter > > Software Engineer, Intel Corporation > > http://blog.ffwll.ch