mbox series

[v5,00/32] Introduce GPU SVM and Xe SVM implementation

Message ID 20250213021112.1228481-1-matthew.brost@intel.com (mailing list archive)
Headers show
Series Introduce GPU SVM and Xe SVM implementation | expand

Message

Matthew Brost Feb. 13, 2025, 2:10 a.m. UTC
Version 5 of GPU SVM. Thanks to everyone (especially Sima, Thomas,
Alistair, Himal) for their numerous reviews on revision 1, 2, 3  and for
helping to address many design issues.

This version has been tested with IGT [1] on PVC, BMG, and LNL. Also
tested with level0 (UMD) PR [2].

Major changes in v2:
- Dropped mmap write abuse
- core MM locking and retry loops instead of driver locking to avoid races
- Removed physical to virtual references
- Embedded structure/ops for drm_gpusvm_devmem
- Fixed mremap and fork issues
- Added DRM pagemap
- Included RFC documentation in the kernel doc

Major changes in v3:
- Move GPU SVM and DRM pagemap to DRM level
- Mostly addresses Thomas's feedback, lots of small changes documented
  in each individual patch change log

Major changes in v4:
- Pull documentation patch in
- Fix Kconfig / VRAM migration issue
- Address feedback which came out of internal multi-GPU implementation

Major changes in v5:
- Rebase on s/xe_mem_region/xe_vram_region
- Bit for uAPI has changed given PXP has landed

Known issues in v5:
- Check pages still exists, changed to threshold in this version which
  is better but still need to root cause cross process page finding on
  small user allocations.

Matt

[1] https://patchwork.freedesktop.org/series/137545/#rev3
[2] https://github.com/intel/compute-runtime/pull/782

Matthew Brost (28):
  drm/xe: Retry BO allocation
  mm/migrate: Add migrate_device_pfns
  mm/migrate: Trylock device page in do_swap_page
  drm/gpusvm: Add support for GPU Shared Virtual Memory
  drm/xe: Select DRM_GPUSVM Kconfig
  drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_CPU_ADDR_MIRROR
  drm/xe: Add SVM init / close / fini to faulting VMs
  drm/xe: Nuke VM's mapping upon close
  drm/xe: Add SVM range invalidation and page fault
  drm/gpuvm: Add DRM_GPUVA_OP_DRIVER
  drm/xe: Add (re)bind to SVM page fault handler
  drm/xe: Add SVM garbage collector
  drm/xe: Add unbind to SVM garbage collector
  drm/xe: Do not allow CPU address mirror VMA unbind if the GPU has
    bindings
  drm/xe: Enable CPU address mirror uAPI
  drm/xe/uapi: Add DRM_XE_QUERY_CONFIG_FLAG_HAS_CPU_ADDR_MIRROR
  drm/xe: Add migrate layer functions for SVM support
  drm/xe: Add SVM device memory mirroring
  drm/xe: Add drm_gpusvm_devmem to xe_bo
  drm/xe: Add GPUSVM device memory copy vfunc functions
  drm/xe: Add Xe SVM populate_devmem_pfn GPU SVM vfunc
  drm/xe: Add Xe SVM devmem_release GPU SVM vfunc
  drm/xe: Add SVM VRAM migration
  drm/xe: Basic SVM BO eviction
  drm/xe: Add SVM debug
  drm/xe: Add modparam for SVM notifier size
  drm/xe: Add always_migrate_to_vram modparam
  drm/doc: gpusvm: Add GPU SVM documentation

Thomas Hellström (4):
  drm/pagemap: Add DRM pagemap
  drm/xe/bo: Introduce xe_bo_put_async
  drm/xe: Add dma_addr res cursor
  drm/xe: Add drm_pagemap ops to SVM

 Documentation/gpu/rfc/gpusvm.rst            |   84 +
 Documentation/gpu/rfc/index.rst             |    4 +
 drivers/gpu/drm/Kconfig                     |    9 +
 drivers/gpu/drm/Makefile                    |    1 +
 drivers/gpu/drm/drm_gpusvm.c                | 2230 +++++++++++++++++++
 drivers/gpu/drm/xe/Kconfig                  |   10 +
 drivers/gpu/drm/xe/Makefile                 |    1 +
 drivers/gpu/drm/xe/xe_bo.c                  |   54 +
 drivers/gpu/drm/xe/xe_bo.h                  |   20 +
 drivers/gpu/drm/xe/xe_bo_types.h            |    4 +
 drivers/gpu/drm/xe/xe_device.c              |    3 +
 drivers/gpu/drm/xe/xe_device_types.h        |   22 +
 drivers/gpu/drm/xe/xe_gt_pagefault.c        |   18 +-
 drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c |   22 +
 drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h |    2 +
 drivers/gpu/drm/xe/xe_migrate.c             |  175 ++
 drivers/gpu/drm/xe/xe_migrate.h             |   10 +
 drivers/gpu/drm/xe/xe_module.c              |    7 +
 drivers/gpu/drm/xe/xe_module.h              |    2 +
 drivers/gpu/drm/xe/xe_pt.c                  |  393 +++-
 drivers/gpu/drm/xe/xe_pt.h                  |    5 +
 drivers/gpu/drm/xe/xe_pt_types.h            |    2 +
 drivers/gpu/drm/xe/xe_query.c               |    5 +-
 drivers/gpu/drm/xe/xe_res_cursor.h          |  116 +-
 drivers/gpu/drm/xe/xe_svm.c                 |  964 ++++++++
 drivers/gpu/drm/xe/xe_svm.h                 |   96 +
 drivers/gpu/drm/xe/xe_tile.c                |    5 +
 drivers/gpu/drm/xe/xe_vm.c                  |  374 +++-
 drivers/gpu/drm/xe/xe_vm.h                  |   15 +-
 drivers/gpu/drm/xe/xe_vm_types.h            |   57 +
 include/drm/drm_gpusvm.h                    |  507 +++++
 include/drm/drm_gpuvm.h                     |    5 +
 include/drm/drm_pagemap.h                   |  105 +
 include/linux/migrate.h                     |    1 +
 include/uapi/drm/xe_drm.h                   |   22 +-
 mm/memory.c                                 |   13 +-
 mm/migrate_device.c                         |  116 +-
 37 files changed, 5326 insertions(+), 153 deletions(-)
 create mode 100644 Documentation/gpu/rfc/gpusvm.rst
 create mode 100644 drivers/gpu/drm/drm_gpusvm.c
 create mode 100644 drivers/gpu/drm/xe/xe_svm.c
 create mode 100644 drivers/gpu/drm/xe/xe_svm.h
 create mode 100644 include/drm/drm_gpusvm.h
 create mode 100644 include/drm/drm_pagemap.h

Comments

Demi Marie Obenour Feb. 13, 2025, 9:23 p.m. UTC | #1
On Wed, Feb 12, 2025 at 06:10:40PM -0800, Matthew Brost wrote:
> Version 5 of GPU SVM. Thanks to everyone (especially Sima, Thomas,
> Alistair, Himal) for their numerous reviews on revision 1, 2, 3  and for
> helping to address many design issues.
> 
> This version has been tested with IGT [1] on PVC, BMG, and LNL. Also
> tested with level0 (UMD) PR [2].

What is the plan to deal with not being able to preempt while a page
fault is pending?  This seems like an easy DoS vector.  My understanding
is that SVM is mostly used by compute workloads on headless systems.
Recent AMD client GPUs don't support SVM, so programs that want to run
on client systems should not require SVM if they wish to be portable.

Given the potential for abuse, I think it would be best to require
explicit administrator opt-in to enable SVM, along with possibly having
a timeout to resolve a page fault (after which the context is killed).
Since I expect most uses of SVM to be in the datacenter space (for the
reasons mentioned above), I don't believe this will be a major
limitation in practice.  Programs that wish to run on client systems
already need to use explicit memory transfer or pinned userptr, and
administrators of compute clusters should be willing to enable this
feature because only one workload will be using a GPU at a time.

> Major changes in v2:
> - Dropped mmap write abuse
> - core MM locking and retry loops instead of driver locking to avoid races
> - Removed physical to virtual references
> - Embedded structure/ops for drm_gpusvm_devmem
> - Fixed mremap and fork issues
> - Added DRM pagemap
> - Included RFC documentation in the kernel doc
> 
> Major changes in v3:
> - Move GPU SVM and DRM pagemap to DRM level
> - Mostly addresses Thomas's feedback, lots of small changes documented
>   in each individual patch change log
> 
> Major changes in v4:
> - Pull documentation patch in
> - Fix Kconfig / VRAM migration issue
> - Address feedback which came out of internal multi-GPU implementation
> 
> Major changes in v5:
> - Rebase on s/xe_mem_region/xe_vram_region
> - Bit for uAPI has changed given PXP has landed
> 
> Known issues in v5:
> - Check pages still exists, changed to threshold in this version which
>   is better but still need to root cause cross process page finding on
>   small user allocations.
> 
> Matt
> 
> [1] https://patchwork.freedesktop.org/series/137545/#rev3
> [2] https://github.com/intel/compute-runtime/pull/782
> 
> Matthew Brost (28):
>   drm/xe: Retry BO allocation
>   mm/migrate: Add migrate_device_pfns
>   mm/migrate: Trylock device page in do_swap_page
>   drm/gpusvm: Add support for GPU Shared Virtual Memory
>   drm/xe: Select DRM_GPUSVM Kconfig
>   drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_CPU_ADDR_MIRROR
>   drm/xe: Add SVM init / close / fini to faulting VMs
>   drm/xe: Nuke VM's mapping upon close
>   drm/xe: Add SVM range invalidation and page fault
>   drm/gpuvm: Add DRM_GPUVA_OP_DRIVER
>   drm/xe: Add (re)bind to SVM page fault handler
>   drm/xe: Add SVM garbage collector
>   drm/xe: Add unbind to SVM garbage collector
>   drm/xe: Do not allow CPU address mirror VMA unbind if the GPU has
>     bindings
>   drm/xe: Enable CPU address mirror uAPI
>   drm/xe/uapi: Add DRM_XE_QUERY_CONFIG_FLAG_HAS_CPU_ADDR_MIRROR
>   drm/xe: Add migrate layer functions for SVM support
>   drm/xe: Add SVM device memory mirroring
>   drm/xe: Add drm_gpusvm_devmem to xe_bo
>   drm/xe: Add GPUSVM device memory copy vfunc functions
>   drm/xe: Add Xe SVM populate_devmem_pfn GPU SVM vfunc
>   drm/xe: Add Xe SVM devmem_release GPU SVM vfunc
>   drm/xe: Add SVM VRAM migration
>   drm/xe: Basic SVM BO eviction
>   drm/xe: Add SVM debug
>   drm/xe: Add modparam for SVM notifier size
>   drm/xe: Add always_migrate_to_vram modparam
>   drm/doc: gpusvm: Add GPU SVM documentation
> 
> Thomas Hellström (4):
>   drm/pagemap: Add DRM pagemap
>   drm/xe/bo: Introduce xe_bo_put_async
>   drm/xe: Add dma_addr res cursor
>   drm/xe: Add drm_pagemap ops to SVM
> 
>  Documentation/gpu/rfc/gpusvm.rst            |   84 +
>  Documentation/gpu/rfc/index.rst             |    4 +
>  drivers/gpu/drm/Kconfig                     |    9 +
>  drivers/gpu/drm/Makefile                    |    1 +
>  drivers/gpu/drm/drm_gpusvm.c                | 2230 +++++++++++++++++++
>  drivers/gpu/drm/xe/Kconfig                  |   10 +
>  drivers/gpu/drm/xe/Makefile                 |    1 +
>  drivers/gpu/drm/xe/xe_bo.c                  |   54 +
>  drivers/gpu/drm/xe/xe_bo.h                  |   20 +
>  drivers/gpu/drm/xe/xe_bo_types.h            |    4 +
>  drivers/gpu/drm/xe/xe_device.c              |    3 +
>  drivers/gpu/drm/xe/xe_device_types.h        |   22 +
>  drivers/gpu/drm/xe/xe_gt_pagefault.c        |   18 +-
>  drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c |   22 +
>  drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h |    2 +
>  drivers/gpu/drm/xe/xe_migrate.c             |  175 ++
>  drivers/gpu/drm/xe/xe_migrate.h             |   10 +
>  drivers/gpu/drm/xe/xe_module.c              |    7 +
>  drivers/gpu/drm/xe/xe_module.h              |    2 +
>  drivers/gpu/drm/xe/xe_pt.c                  |  393 +++-
>  drivers/gpu/drm/xe/xe_pt.h                  |    5 +
>  drivers/gpu/drm/xe/xe_pt_types.h            |    2 +
>  drivers/gpu/drm/xe/xe_query.c               |    5 +-
>  drivers/gpu/drm/xe/xe_res_cursor.h          |  116 +-
>  drivers/gpu/drm/xe/xe_svm.c                 |  964 ++++++++
>  drivers/gpu/drm/xe/xe_svm.h                 |   96 +
>  drivers/gpu/drm/xe/xe_tile.c                |    5 +
>  drivers/gpu/drm/xe/xe_vm.c                  |  374 +++-
>  drivers/gpu/drm/xe/xe_vm.h                  |   15 +-
>  drivers/gpu/drm/xe/xe_vm_types.h            |   57 +
>  include/drm/drm_gpusvm.h                    |  507 +++++
>  include/drm/drm_gpuvm.h                     |    5 +
>  include/drm/drm_pagemap.h                   |  105 +
>  include/linux/migrate.h                     |    1 +
>  include/uapi/drm/xe_drm.h                   |   22 +-
>  mm/memory.c                                 |   13 +-
>  mm/migrate_device.c                         |  116 +-
>  37 files changed, 5326 insertions(+), 153 deletions(-)
>  create mode 100644 Documentation/gpu/rfc/gpusvm.rst
>  create mode 100644 drivers/gpu/drm/drm_gpusvm.c
>  create mode 100644 drivers/gpu/drm/xe/xe_svm.c
>  create mode 100644 drivers/gpu/drm/xe/xe_svm.h
>  create mode 100644 include/drm/drm_gpusvm.h
>  create mode 100644 include/drm/drm_pagemap.h
> 
> -- 
> 2.34.1
Thomas Hellström Feb. 14, 2025, 8:47 a.m. UTC | #2
Hi

On Thu, 2025-02-13 at 16:23 -0500, Demi Marie Obenour wrote:
> On Wed, Feb 12, 2025 at 06:10:40PM -0800, Matthew Brost wrote:
> > Version 5 of GPU SVM. Thanks to everyone (especially Sima, Thomas,
> > Alistair, Himal) for their numerous reviews on revision 1, 2, 3 
> > and for
> > helping to address many design issues.
> > 
> > This version has been tested with IGT [1] on PVC, BMG, and LNL.
> > Also
> > tested with level0 (UMD) PR [2].
> 
> What is the plan to deal with not being able to preempt while a page
> fault is pending?  This seems like an easy DoS vector.  My
> understanding
> is that SVM is mostly used by compute workloads on headless systems.
> Recent AMD client GPUs don't support SVM, so programs that want to
> run
> on client systems should not require SVM if they wish to be portable.
> 
> Given the potential for abuse, I think it would be best to require
> explicit administrator opt-in to enable SVM, along with possibly
> having
> a timeout to resolve a page fault (after which the context is
> killed).
> Since I expect most uses of SVM to be in the datacenter space (for
> the
> reasons mentioned above), I don't believe this will be a major
> limitation in practice.  Programs that wish to run on client systems
> already need to use explicit memory transfer or pinned userptr, and
> administrators of compute clusters should be willing to enable this
> feature because only one workload will be using a GPU at a time.

While not directly having addressed the potential DoS issue you
mention, there is an associated deadlock possibility that may happen
due to not being able to preempt a pending pagefault. That is if a dma-
fence job is requiring the same resources held up by the pending page-
fault, and then the pagefault servicing is dependent on that dma-fence
to be signaled in one way or another.

That deadlock is handled by only allowing either page-faulting jobs or
dma-fence jobs on a resource (hw engine or hw engine group) that can be
used by both at a time, blocking synchronously in the exec IOCTL until
the resource is available for the job type. That means LR jobs waits
for all dma-fence jobs to complete, and dma-fence jobs wait for all LR
jobs to preempt. So a dma-fence job wait could easily mean "wait for
all outstanding pagefaults to be serviced".

Whether, on the other hand, that is a real DoS we need to care about,
is probably a topic for debate. The directions we've had so far are
that it's not. Nothing is held up indefinitely, what's held up can be
Ctrl-C'd by the user and core mm memory management is not blocked since
mmu_notifiers can execute to completion and shrinkers / eviction can
execute while a page-fault is pending.

Thanks,
Thomas
Ghimiray, Himal Prasad Feb. 14, 2025, 9:07 a.m. UTC | #3
k,
asx.ddk
Demi Marie Obenour Feb. 14, 2025, 4:14 p.m. UTC | #4
On Fri, Feb 14, 2025 at 09:47:13AM +0100, Thomas Hellström wrote:
> Hi
> 
> On Thu, 2025-02-13 at 16:23 -0500, Demi Marie Obenour wrote:
> > On Wed, Feb 12, 2025 at 06:10:40PM -0800, Matthew Brost wrote:
> > > Version 5 of GPU SVM. Thanks to everyone (especially Sima, Thomas,
> > > Alistair, Himal) for their numerous reviews on revision 1, 2, 3 
> > > and for
> > > helping to address many design issues.
> > > 
> > > This version has been tested with IGT [1] on PVC, BMG, and LNL.
> > > Also
> > > tested with level0 (UMD) PR [2].
> > 
> > What is the plan to deal with not being able to preempt while a page
> > fault is pending?  This seems like an easy DoS vector.  My
> > understanding
> > is that SVM is mostly used by compute workloads on headless systems.
> > Recent AMD client GPUs don't support SVM, so programs that want to
> > run
> > on client systems should not require SVM if they wish to be portable.
> > 
> > Given the potential for abuse, I think it would be best to require
> > explicit administrator opt-in to enable SVM, along with possibly
> > having
> > a timeout to resolve a page fault (after which the context is
> > killed).
> > Since I expect most uses of SVM to be in the datacenter space (for
> > the
> > reasons mentioned above), I don't believe this will be a major
> > limitation in practice.  Programs that wish to run on client systems
> > already need to use explicit memory transfer or pinned userptr, and
> > administrators of compute clusters should be willing to enable this
> > feature because only one workload will be using a GPU at a time.
> 
> While not directly having addressed the potential DoS issue you
> mention, there is an associated deadlock possibility that may happen
> due to not being able to preempt a pending pagefault. That is if a dma-
> fence job is requiring the same resources held up by the pending page-
> fault, and then the pagefault servicing is dependent on that dma-fence
> to be signaled in one way or another.
> 
> That deadlock is handled by only allowing either page-faulting jobs or
> dma-fence jobs on a resource (hw engine or hw engine group) that can be
> used by both at a time, blocking synchronously in the exec IOCTL until
> the resource is available for the job type. That means LR jobs waits
> for all dma-fence jobs to complete, and dma-fence jobs wait for all LR
> jobs to preempt. So a dma-fence job wait could easily mean "wait for
> all outstanding pagefaults to be serviced".
> 
> Whether, on the other hand, that is a real DoS we need to care about,
> is probably a topic for debate. The directions we've had so far are
> that it's not. Nothing is held up indefinitely, what's held up can be
> Ctrl-C'd by the user and core mm memory management is not blocked since
> mmu_notifiers can execute to completion and shrinkers / eviction can
> execute while a page-fault is pending.

The problem is that a program that uses a page-faulting job can lock out
all other programs on the system from using the GPU for an indefinite
period of time.  In a GUI session, this means a frozen UI, which makes
recovery basically impossible without drastic measures (like rebooting
or logging in over SSH).  That counts as a quite effective denial of
service from an end-user perspective, and unless I am mistaken it would
be very easy to trigger by accident: just start a page-faulting job that
loops forever.

The simplest way to prevent this would be to require DRM master
privileges to spawn page-faulting jobs.  Only the Wayland compositor or
X server will normally have these, and they will never submit a
page-faulting job.  My understanding is that other IOCTLs that can mess
up a compositor also require DRM master privileges, and submitting a
page-faulting job seems to qualify.

There is still a legitimate use-case for running long-running workloads
on a GPU used for an interactive session.  However, DMA fencing compute
jobs can long running as long as they are preemptable, and they are
preemptable as long as they don't need page faults.  Sima, Faith, and
Christian have already come up with a solution for long-running Vulkan
compute.
Thomas Hellström Feb. 14, 2025, 4:26 p.m. UTC | #5
Hi!

On Fri, 2025-02-14 at 11:14 -0500, Demi Marie Obenour wrote:
> On Fri, Feb 14, 2025 at 09:47:13AM +0100, Thomas Hellström wrote:
> > Hi
> > 
> > On Thu, 2025-02-13 at 16:23 -0500, Demi Marie Obenour wrote:
> > > On Wed, Feb 12, 2025 at 06:10:40PM -0800, Matthew Brost wrote:
> > > > Version 5 of GPU SVM. Thanks to everyone (especially Sima,
> > > > Thomas,
> > > > Alistair, Himal) for their numerous reviews on revision 1, 2,
> > > > 3 
> > > > and for
> > > > helping to address many design issues.
> > > > 
> > > > This version has been tested with IGT [1] on PVC, BMG, and LNL.
> > > > Also
> > > > tested with level0 (UMD) PR [2].
> > > 
> > > What is the plan to deal with not being able to preempt while a
> > > page
> > > fault is pending?  This seems like an easy DoS vector.  My
> > > understanding
> > > is that SVM is mostly used by compute workloads on headless
> > > systems.
> > > Recent AMD client GPUs don't support SVM, so programs that want
> > > to
> > > run
> > > on client systems should not require SVM if they wish to be
> > > portable.
> > > 
> > > Given the potential for abuse, I think it would be best to
> > > require
> > > explicit administrator opt-in to enable SVM, along with possibly
> > > having
> > > a timeout to resolve a page fault (after which the context is
> > > killed).
> > > Since I expect most uses of SVM to be in the datacenter space
> > > (for
> > > the
> > > reasons mentioned above), I don't believe this will be a major
> > > limitation in practice.  Programs that wish to run on client
> > > systems
> > > already need to use explicit memory transfer or pinned userptr,
> > > and
> > > administrators of compute clusters should be willing to enable
> > > this
> > > feature because only one workload will be using a GPU at a time.
> > 
> > While not directly having addressed the potential DoS issue you
> > mention, there is an associated deadlock possibility that may
> > happen
> > due to not being able to preempt a pending pagefault. That is if a
> > dma-
> > fence job is requiring the same resources held up by the pending
> > page-
> > fault, and then the pagefault servicing is dependent on that dma-
> > fence
> > to be signaled in one way or another.
> > 
> > That deadlock is handled by only allowing either page-faulting jobs
> > or
> > dma-fence jobs on a resource (hw engine or hw engine group) that
> > can be
> > used by both at a time, blocking synchronously in the exec IOCTL
> > until
> > the resource is available for the job type. That means LR jobs
> > waits
> > for all dma-fence jobs to complete, and dma-fence jobs wait for all
> > LR
> > jobs to preempt. So a dma-fence job wait could easily mean "wait
> > for
> > all outstanding pagefaults to be serviced".
> > 
> > Whether, on the other hand, that is a real DoS we need to care
> > about,
> > is probably a topic for debate. The directions we've had so far are
> > that it's not. Nothing is held up indefinitely, what's held up can
> > be
> > Ctrl-C'd by the user and core mm memory management is not blocked
> > since
> > mmu_notifiers can execute to completion and shrinkers / eviction
> > can
> > execute while a page-fault is pending.
> 
> The problem is that a program that uses a page-faulting job can lock
> out
> all other programs on the system from using the GPU for an indefinite
> period of time.  In a GUI session, this means a frozen UI, which
> makes
> recovery basically impossible without drastic measures (like
> rebooting
> or logging in over SSH).  That counts as a quite effective denial of
> service from an end-user perspective, and unless I am mistaken it
> would
> be very easy to trigger by accident: just start a page-faulting job
> that
> loops forever.

I think the easiest remedy for this is that if a page-faulting job is
either by purpose or mistake crafted in such a way that it holds up
preemption when preemption is needed (like in the case I described, a
dma-fence job is submitted) the driver will hit a preemption timeout
and kill the pagefaulting job. (I think that is already handled in all
cases in the xe driver but I would need to double check). So this would
then boil down to the system administrator configuring the preemption
timeout.


Thanks,
Thomas
Demi Marie Obenour Feb. 14, 2025, 6:36 p.m. UTC | #6
On Fri, Feb 14, 2025 at 05:26:48PM +0100, Thomas Hellström wrote:
> Hi!
> 
> On Fri, 2025-02-14 at 11:14 -0500, Demi Marie Obenour wrote:
> > On Fri, Feb 14, 2025 at 09:47:13AM +0100, Thomas Hellström wrote:
> > > Hi
> > > 
> > > On Thu, 2025-02-13 at 16:23 -0500, Demi Marie Obenour wrote:
> > > > On Wed, Feb 12, 2025 at 06:10:40PM -0800, Matthew Brost wrote:
> > > > > Version 5 of GPU SVM. Thanks to everyone (especially Sima,
> > > > > Thomas,
> > > > > Alistair, Himal) for their numerous reviews on revision 1, 2,
> > > > > 3 
> > > > > and for
> > > > > helping to address many design issues.
> > > > > 
> > > > > This version has been tested with IGT [1] on PVC, BMG, and LNL.
> > > > > Also
> > > > > tested with level0 (UMD) PR [2].
> > > > 
> > > > What is the plan to deal with not being able to preempt while a
> > > > page
> > > > fault is pending?  This seems like an easy DoS vector.  My
> > > > understanding
> > > > is that SVM is mostly used by compute workloads on headless
> > > > systems.
> > > > Recent AMD client GPUs don't support SVM, so programs that want
> > > > to
> > > > run
> > > > on client systems should not require SVM if they wish to be
> > > > portable.
> > > > 
> > > > Given the potential for abuse, I think it would be best to
> > > > require
> > > > explicit administrator opt-in to enable SVM, along with possibly
> > > > having
> > > > a timeout to resolve a page fault (after which the context is
> > > > killed).
> > > > Since I expect most uses of SVM to be in the datacenter space
> > > > (for
> > > > the
> > > > reasons mentioned above), I don't believe this will be a major
> > > > limitation in practice.  Programs that wish to run on client
> > > > systems
> > > > already need to use explicit memory transfer or pinned userptr,
> > > > and
> > > > administrators of compute clusters should be willing to enable
> > > > this
> > > > feature because only one workload will be using a GPU at a time.
> > > 
> > > While not directly having addressed the potential DoS issue you
> > > mention, there is an associated deadlock possibility that may
> > > happen
> > > due to not being able to preempt a pending pagefault. That is if a
> > > dma-
> > > fence job is requiring the same resources held up by the pending
> > > page-
> > > fault, and then the pagefault servicing is dependent on that dma-
> > > fence
> > > to be signaled in one way or another.
> > > 
> > > That deadlock is handled by only allowing either page-faulting jobs
> > > or
> > > dma-fence jobs on a resource (hw engine or hw engine group) that
> > > can be
> > > used by both at a time, blocking synchronously in the exec IOCTL
> > > until
> > > the resource is available for the job type. That means LR jobs
> > > waits
> > > for all dma-fence jobs to complete, and dma-fence jobs wait for all
> > > LR
> > > jobs to preempt. So a dma-fence job wait could easily mean "wait
> > > for
> > > all outstanding pagefaults to be serviced".
> > > 
> > > Whether, on the other hand, that is a real DoS we need to care
> > > about,
> > > is probably a topic for debate. The directions we've had so far are
> > > that it's not. Nothing is held up indefinitely, what's held up can
> > > be
> > > Ctrl-C'd by the user and core mm memory management is not blocked
> > > since
> > > mmu_notifiers can execute to completion and shrinkers / eviction
> > > can
> > > execute while a page-fault is pending.
> > 
> > The problem is that a program that uses a page-faulting job can lock
> > out
> > all other programs on the system from using the GPU for an indefinite
> > period of time.  In a GUI session, this means a frozen UI, which
> > makes
> > recovery basically impossible without drastic measures (like
> > rebooting
> > or logging in over SSH).  That counts as a quite effective denial of
> > service from an end-user perspective, and unless I am mistaken it
> > would
> > be very easy to trigger by accident: just start a page-faulting job
> > that
> > loops forever.
> 
> I think the easiest remedy for this is that if a page-faulting job is
> either by purpose or mistake crafted in such a way that it holds up
> preemption when preemption is needed (like in the case I described, a
> dma-fence job is submitted) the driver will hit a preemption timeout
> and kill the pagefaulting job. (I think that is already handled in all
> cases in the xe driver but I would need to double check). So this would
> then boil down to the system administrator configuring the preemption
> timeout.

That makes sense!  That turns a DoS into "Don't submit pagefaulting jobs
on an interactive system, they won't be reliable."