Message ID | 20241209133318.1806472-15-mika.kuoppala@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Intel Xe GPU debug support (eudebug) v3 | expand |
Am 09.12.24 um 14:33 schrieb Mika Kuoppala: > From: Andrzej Hajda <andrzej.hajda@intel.com> > > Debugger needs to read/write program's vmas including userptr_vma. > Since hmm_range_fault is used to pin userptr vmas, it is possible > to map those vmas from debugger context. Oh, this implementation is extremely questionable as well. Adding the LKML and the MM list as well. First of all hmm_range_fault() does *not* pin anything! In other words you don't have a page reference when the function returns, but rather just a sequence number you can check for modifications. > v2: pin pages vs notifier, move to vm.c (Matthew) > v3: - iterate over system pages instead of DMA, fixes iommu enabled > - s/xe_uvma_access/xe_vm_uvma_access/ (Matt) > > Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com> > Signed-off-by: Maciej Patelczyk <maciej.patelczyk@intel.com> > Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> > Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> #v1 > --- > drivers/gpu/drm/xe/xe_eudebug.c | 3 ++- > drivers/gpu/drm/xe/xe_vm.c | 47 +++++++++++++++++++++++++++++++++ > drivers/gpu/drm/xe/xe_vm.h | 3 +++ > 3 files changed, 52 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/xe/xe_eudebug.c b/drivers/gpu/drm/xe/xe_eudebug.c > index 9d87df75348b..e5949e4dcad8 100644 > --- a/drivers/gpu/drm/xe/xe_eudebug.c > +++ b/drivers/gpu/drm/xe/xe_eudebug.c > @@ -3076,7 +3076,8 @@ static int xe_eudebug_vma_access(struct xe_vma *vma, u64 offset_in_vma, > return ret; > } > > - return -EINVAL; > + return xe_vm_userptr_access(to_userptr_vma(vma), offset_in_vma, > + buf, bytes, write); > } > > static int xe_eudebug_vm_access(struct xe_vm *vm, u64 offset, > diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c > index 0f17bc8b627b..224ff9e16941 100644 > --- a/drivers/gpu/drm/xe/xe_vm.c > +++ b/drivers/gpu/drm/xe/xe_vm.c > @@ -3414,3 +3414,50 @@ void xe_vm_snapshot_free(struct xe_vm_snapshot *snap) > } > kvfree(snap); > } > + > +int xe_vm_userptr_access(struct xe_userptr_vma *uvma, u64 offset, > + void *buf, u64 len, bool write) > +{ > + struct xe_vm *vm = xe_vma_vm(&uvma->vma); > + struct xe_userptr *up = &uvma->userptr; > + struct xe_res_cursor cur = {}; > + int cur_len, ret = 0; > + > + while (true) { > + down_read(&vm->userptr.notifier_lock); > + if (!xe_vma_userptr_check_repin(uvma)) > + break; > + > + spin_lock(&vm->userptr.invalidated_lock); > + list_del_init(&uvma->userptr.invalidate_link); > + spin_unlock(&vm->userptr.invalidated_lock); > + > + up_read(&vm->userptr.notifier_lock); > + ret = xe_vma_userptr_pin_pages(uvma); > + if (ret) > + return ret; > + } > + > + if (!up->sg) { > + ret = -EINVAL; > + goto out_unlock_notifier; > + } > + > + for (xe_res_first_sg_system(up->sg, offset, len, &cur); cur.remaining; > + xe_res_next(&cur, cur_len)) { > + void *ptr = kmap_local_page(sg_page(cur.sgl)) + cur.start; The interface basically creates a side channel to access userptrs in the way an userspace application would do without actually going through userspace. That is generally not something a device driver should ever do as far as I can see. > + > + cur_len = min(cur.size, cur.remaining); > + if (write) > + memcpy(ptr, buf, cur_len); > + else > + memcpy(buf, ptr, cur_len); > + kunmap_local(ptr); > + buf += cur_len; > + } > + ret = len; > + > +out_unlock_notifier: > + up_read(&vm->userptr.notifier_lock); I just strongly hope that this will prevent the mapping from changing. Regards, Christian. > + return ret; > +} > diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h > index 23adb7442881..372ad40ad67f 100644 > --- a/drivers/gpu/drm/xe/xe_vm.h > +++ b/drivers/gpu/drm/xe/xe_vm.h > @@ -280,3 +280,6 @@ struct xe_vm_snapshot *xe_vm_snapshot_capture(struct xe_vm *vm); > void xe_vm_snapshot_capture_delayed(struct xe_vm_snapshot *snap); > void xe_vm_snapshot_print(struct xe_vm_snapshot *snap, struct drm_printer *p); > void xe_vm_snapshot_free(struct xe_vm_snapshot *snap); > + > +int xe_vm_userptr_access(struct xe_userptr_vma *uvma, u64 offset, > + void *buf, u64 len, bool write);
(+ Thomas and Matt B who this was reviewed with as a concept) Quoting Christian König (2024-12-09 16:03:04) > Am 09.12.24 um 14:33 schrieb Mika Kuoppala: <SNIP> > > +int xe_vm_userptr_access(struct xe_userptr_vma *uvma, u64 offset, > > + void *buf, u64 len, bool write) > > +{ > > + struct xe_vm *vm = xe_vma_vm(&uvma->vma); > > + struct xe_userptr *up = &uvma->userptr; > > + struct xe_res_cursor cur = {}; > > + int cur_len, ret = 0; > > + > > + while (true) { > > + down_read(&vm->userptr.notifier_lock); > > + if (!xe_vma_userptr_check_repin(uvma)) > > + break; > > + > > + spin_lock(&vm->userptr.invalidated_lock); > > + list_del_init(&uvma->userptr.invalidate_link); > > + spin_unlock(&vm->userptr.invalidated_lock); > > + > > + up_read(&vm->userptr.notifier_lock); > > + ret = xe_vma_userptr_pin_pages(uvma); > > + if (ret) > > + return ret; > > + } > > + > > + if (!up->sg) { > > + ret = -EINVAL; > > + goto out_unlock_notifier; > > + } > > + > > + for (xe_res_first_sg_system(up->sg, offset, len, &cur); cur.remaining; > > + xe_res_next(&cur, cur_len)) { > > + void *ptr = kmap_local_page(sg_page(cur.sgl)) + cur.start; > > The interface basically creates a side channel to access userptrs in the > way an userspace application would do without actually going through > userspace. That's not quite the case here. The whole point of the debugger ability to access memory is to access any VMA in the GPU VM emulating as much as possible like the EUs themselves would do the access. As mentioned in the other threads, that also involves special cache flushes to make sure the contents has been flushed in and out of the EU caches in case we're modifying instruction code for breakpoints as an example. What the memory access function should do is to establish a temporary pinning situation where the memory would be accessible just like it would be for a short batchbuffer, but without interfering with the command streamers. If this should be established in a different way from this patch, then we should adapt of course. > That is generally not something a device driver should ever do as far as > I can see. Given above explanation, does it make more sense? For debugging purposes, we try to emulate the EU threads themselves accessing the memory, not the userspace at all. Regards, Joonas
On Mon, Dec 09, 2024 at 03:03:04PM +0100, Christian König wrote: > Am 09.12.24 um 14:33 schrieb Mika Kuoppala: > > From: Andrzej Hajda <andrzej.hajda@intel.com> > > > > Debugger needs to read/write program's vmas including userptr_vma. > > Since hmm_range_fault is used to pin userptr vmas, it is possible > > to map those vmas from debugger context. > > Oh, this implementation is extremely questionable as well. Adding the LKML > and the MM list as well. > > First of all hmm_range_fault() does *not* pin anything! > > In other words you don't have a page reference when the function returns, > but rather just a sequence number you can check for modifications. I think it's all there, holds the invalidation lock during the critical access/section, drops it when reacquiring pages, retries until it works. I think the issue is more that everyone hand-rolls userptr. Probably time we standardize that and put it into gpuvm as an optional part, with consistent locking, naming (like not calling it _pin_pages when it's unpinnged userptr), kerneldoc and all the nice things so that we stop consistently getting confused by other driver's userptr code. I think that was on the plan originally as an eventual step, I guess time to pump that up. Matt/Thomas, thoughts? -Sima > > > v2: pin pages vs notifier, move to vm.c (Matthew) > > v3: - iterate over system pages instead of DMA, fixes iommu enabled > > - s/xe_uvma_access/xe_vm_uvma_access/ (Matt) > > > > Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com> > > Signed-off-by: Maciej Patelczyk <maciej.patelczyk@intel.com> > > Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> > > Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> #v1 > > --- > > drivers/gpu/drm/xe/xe_eudebug.c | 3 ++- > > drivers/gpu/drm/xe/xe_vm.c | 47 +++++++++++++++++++++++++++++++++ > > drivers/gpu/drm/xe/xe_vm.h | 3 +++ > > 3 files changed, 52 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/gpu/drm/xe/xe_eudebug.c b/drivers/gpu/drm/xe/xe_eudebug.c > > index 9d87df75348b..e5949e4dcad8 100644 > > --- a/drivers/gpu/drm/xe/xe_eudebug.c > > +++ b/drivers/gpu/drm/xe/xe_eudebug.c > > @@ -3076,7 +3076,8 @@ static int xe_eudebug_vma_access(struct xe_vma *vma, u64 offset_in_vma, > > return ret; > > } > > - return -EINVAL; > > + return xe_vm_userptr_access(to_userptr_vma(vma), offset_in_vma, > > + buf, bytes, write); > > } > > static int xe_eudebug_vm_access(struct xe_vm *vm, u64 offset, > > diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c > > index 0f17bc8b627b..224ff9e16941 100644 > > --- a/drivers/gpu/drm/xe/xe_vm.c > > +++ b/drivers/gpu/drm/xe/xe_vm.c > > @@ -3414,3 +3414,50 @@ void xe_vm_snapshot_free(struct xe_vm_snapshot *snap) > > } > > kvfree(snap); > > } > > + > > +int xe_vm_userptr_access(struct xe_userptr_vma *uvma, u64 offset, > > + void *buf, u64 len, bool write) > > +{ > > + struct xe_vm *vm = xe_vma_vm(&uvma->vma); > > + struct xe_userptr *up = &uvma->userptr; > > + struct xe_res_cursor cur = {}; > > + int cur_len, ret = 0; > > + > > + while (true) { > > + down_read(&vm->userptr.notifier_lock); > > + if (!xe_vma_userptr_check_repin(uvma)) > > + break; > > + > > + spin_lock(&vm->userptr.invalidated_lock); > > + list_del_init(&uvma->userptr.invalidate_link); > > + spin_unlock(&vm->userptr.invalidated_lock); > > + > > + up_read(&vm->userptr.notifier_lock); > > + ret = xe_vma_userptr_pin_pages(uvma); > > + if (ret) > > + return ret; > > + } > > + > > + if (!up->sg) { > > + ret = -EINVAL; > > + goto out_unlock_notifier; > > + } > > + > > + for (xe_res_first_sg_system(up->sg, offset, len, &cur); cur.remaining; > > + xe_res_next(&cur, cur_len)) { > > + void *ptr = kmap_local_page(sg_page(cur.sgl)) + cur.start; > > The interface basically creates a side channel to access userptrs in the way > an userspace application would do without actually going through userspace. > > That is generally not something a device driver should ever do as far as I > can see. > > > + > > + cur_len = min(cur.size, cur.remaining); > > + if (write) > > + memcpy(ptr, buf, cur_len); > > + else > > + memcpy(buf, ptr, cur_len); > > + kunmap_local(ptr); > > + buf += cur_len; > > + } > > + ret = len; > > + > > +out_unlock_notifier: > > + up_read(&vm->userptr.notifier_lock); > > I just strongly hope that this will prevent the mapping from changing. > > Regards, > Christian. > > > + return ret; > > +} > > diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h > > index 23adb7442881..372ad40ad67f 100644 > > --- a/drivers/gpu/drm/xe/xe_vm.h > > +++ b/drivers/gpu/drm/xe/xe_vm.h > > @@ -280,3 +280,6 @@ struct xe_vm_snapshot *xe_vm_snapshot_capture(struct xe_vm *vm); > > void xe_vm_snapshot_capture_delayed(struct xe_vm_snapshot *snap); > > void xe_vm_snapshot_print(struct xe_vm_snapshot *snap, struct drm_printer *p); > > void xe_vm_snapshot_free(struct xe_vm_snapshot *snap); > > + > > +int xe_vm_userptr_access(struct xe_userptr_vma *uvma, u64 offset, > > + void *buf, u64 len, bool write); >
Am 09.12.24 um 16:31 schrieb Simona Vetter: > On Mon, Dec 09, 2024 at 03:03:04PM +0100, Christian König wrote: >> Am 09.12.24 um 14:33 schrieb Mika Kuoppala: >>> From: Andrzej Hajda <andrzej.hajda@intel.com> >>> >>> Debugger needs to read/write program's vmas including userptr_vma. >>> Since hmm_range_fault is used to pin userptr vmas, it is possible >>> to map those vmas from debugger context. >> Oh, this implementation is extremely questionable as well. Adding the LKML >> and the MM list as well. >> >> First of all hmm_range_fault() does *not* pin anything! >> >> In other words you don't have a page reference when the function returns, >> but rather just a sequence number you can check for modifications. > I think it's all there, holds the invalidation lock during the critical > access/section, drops it when reacquiring pages, retries until it works. > > I think the issue is more that everyone hand-rolls userptr. Well that is part of the issue. The general problem here is that the eudebug interface tries to simulate the memory accesses as they would have happened by the hardware. What the debugger should probably do is to cleanly attach to the application, get the information which CPU address is mapped to which GPU address and then use the standard ptrace interfaces. The whole interface re-invents a lot of functionality which is already there just because you don't like the idea to attach to the debugged application in userspace. As far as I can see this whole idea is extremely questionable. This looks like re-inventing the wheel in a different color. Regards, Christian. > Probably time > we standardize that and put it into gpuvm as an optional part, with > consistent locking, naming (like not calling it _pin_pages when it's > unpinnged userptr), kerneldoc and all the nice things so that we > stop consistently getting confused by other driver's userptr code. > > I think that was on the plan originally as an eventual step, I guess time > to pump that up. Matt/Thomas, thoughts? > -Sima > >>> v2: pin pages vs notifier, move to vm.c (Matthew) >>> v3: - iterate over system pages instead of DMA, fixes iommu enabled >>> - s/xe_uvma_access/xe_vm_uvma_access/ (Matt) >>> >>> Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com> >>> Signed-off-by: Maciej Patelczyk <maciej.patelczyk@intel.com> >>> Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> >>> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> #v1 >>> --- >>> drivers/gpu/drm/xe/xe_eudebug.c | 3 ++- >>> drivers/gpu/drm/xe/xe_vm.c | 47 +++++++++++++++++++++++++++++++++ >>> drivers/gpu/drm/xe/xe_vm.h | 3 +++ >>> 3 files changed, 52 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/gpu/drm/xe/xe_eudebug.c b/drivers/gpu/drm/xe/xe_eudebug.c >>> index 9d87df75348b..e5949e4dcad8 100644 >>> --- a/drivers/gpu/drm/xe/xe_eudebug.c >>> +++ b/drivers/gpu/drm/xe/xe_eudebug.c >>> @@ -3076,7 +3076,8 @@ static int xe_eudebug_vma_access(struct xe_vma *vma, u64 offset_in_vma, >>> return ret; >>> } >>> - return -EINVAL; >>> + return xe_vm_userptr_access(to_userptr_vma(vma), offset_in_vma, >>> + buf, bytes, write); >>> } >>> static int xe_eudebug_vm_access(struct xe_vm *vm, u64 offset, >>> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c >>> index 0f17bc8b627b..224ff9e16941 100644 >>> --- a/drivers/gpu/drm/xe/xe_vm.c >>> +++ b/drivers/gpu/drm/xe/xe_vm.c >>> @@ -3414,3 +3414,50 @@ void xe_vm_snapshot_free(struct xe_vm_snapshot *snap) >>> } >>> kvfree(snap); >>> } >>> + >>> +int xe_vm_userptr_access(struct xe_userptr_vma *uvma, u64 offset, >>> + void *buf, u64 len, bool write) >>> +{ >>> + struct xe_vm *vm = xe_vma_vm(&uvma->vma); >>> + struct xe_userptr *up = &uvma->userptr; >>> + struct xe_res_cursor cur = {}; >>> + int cur_len, ret = 0; >>> + >>> + while (true) { >>> + down_read(&vm->userptr.notifier_lock); >>> + if (!xe_vma_userptr_check_repin(uvma)) >>> + break; >>> + >>> + spin_lock(&vm->userptr.invalidated_lock); >>> + list_del_init(&uvma->userptr.invalidate_link); >>> + spin_unlock(&vm->userptr.invalidated_lock); >>> + >>> + up_read(&vm->userptr.notifier_lock); >>> + ret = xe_vma_userptr_pin_pages(uvma); >>> + if (ret) >>> + return ret; >>> + } >>> + >>> + if (!up->sg) { >>> + ret = -EINVAL; >>> + goto out_unlock_notifier; >>> + } >>> + >>> + for (xe_res_first_sg_system(up->sg, offset, len, &cur); cur.remaining; >>> + xe_res_next(&cur, cur_len)) { >>> + void *ptr = kmap_local_page(sg_page(cur.sgl)) + cur.start; >> The interface basically creates a side channel to access userptrs in the way >> an userspace application would do without actually going through userspace. >> >> That is generally not something a device driver should ever do as far as I >> can see. >> >>> + >>> + cur_len = min(cur.size, cur.remaining); >>> + if (write) >>> + memcpy(ptr, buf, cur_len); >>> + else >>> + memcpy(buf, ptr, cur_len); >>> + kunmap_local(ptr); >>> + buf += cur_len; >>> + } >>> + ret = len; >>> + >>> +out_unlock_notifier: >>> + up_read(&vm->userptr.notifier_lock); >> I just strongly hope that this will prevent the mapping from changing. >> >> Regards, >> Christian. >> >>> + return ret; >>> +} >>> diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h >>> index 23adb7442881..372ad40ad67f 100644 >>> --- a/drivers/gpu/drm/xe/xe_vm.h >>> +++ b/drivers/gpu/drm/xe/xe_vm.h >>> @@ -280,3 +280,6 @@ struct xe_vm_snapshot *xe_vm_snapshot_capture(struct xe_vm *vm); >>> void xe_vm_snapshot_capture_delayed(struct xe_vm_snapshot *snap); >>> void xe_vm_snapshot_print(struct xe_vm_snapshot *snap, struct drm_printer *p); >>> void xe_vm_snapshot_free(struct xe_vm_snapshot *snap); >>> + >>> +int xe_vm_userptr_access(struct xe_userptr_vma *uvma, u64 offset, >>> + void *buf, u64 len, bool write);
Am 09.12.24 um 16:42 schrieb Christian König: > Am 09.12.24 um 16:31 schrieb Simona Vetter: >> On Mon, Dec 09, 2024 at 03:03:04PM +0100, Christian König wrote: >>> Am 09.12.24 um 14:33 schrieb Mika Kuoppala: >>>> From: Andrzej Hajda <andrzej.hajda@intel.com> >>>> >>>> Debugger needs to read/write program's vmas including userptr_vma. >>>> Since hmm_range_fault is used to pin userptr vmas, it is possible >>>> to map those vmas from debugger context. >>> Oh, this implementation is extremely questionable as well. Adding >>> the LKML >>> and the MM list as well. >>> >>> First of all hmm_range_fault() does *not* pin anything! >>> >>> In other words you don't have a page reference when the function >>> returns, >>> but rather just a sequence number you can check for modifications. >> I think it's all there, holds the invalidation lock during the critical >> access/section, drops it when reacquiring pages, retries until it works. One thing I'm missing: Is it possible that mappings are created read-only? E.g. that you have a read-only mapping of libc and then write through this interface to it? Of hand I don't see anything preventing this (well could be that you don't allow creating read-only mappings). Regards, Christian. >> >> I think the issue is more that everyone hand-rolls userptr. > > Well that is part of the issue. > > The general problem here is that the eudebug interface tries to > simulate the memory accesses as they would have happened by the hardware. > > What the debugger should probably do is to cleanly attach to the > application, get the information which CPU address is mapped to which > GPU address and then use the standard ptrace interfaces. > > The whole interface re-invents a lot of functionality which is already > there just because you don't like the idea to attach to the debugged > application in userspace. > > As far as I can see this whole idea is extremely questionable. This > looks like re-inventing the wheel in a different color. > > Regards, > Christian. > >> Probably time >> we standardize that and put it into gpuvm as an optional part, with >> consistent locking, naming (like not calling it _pin_pages when it's >> unpinnged userptr), kerneldoc and all the nice things so that we >> stop consistently getting confused by other driver's userptr code. >> >> I think that was on the plan originally as an eventual step, I guess >> time >> to pump that up. Matt/Thomas, thoughts? >> -Sima >> >>>> v2: pin pages vs notifier, move to vm.c (Matthew) >>>> v3: - iterate over system pages instead of DMA, fixes iommu enabled >>>> - s/xe_uvma_access/xe_vm_uvma_access/ (Matt) >>>> >>>> Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com> >>>> Signed-off-by: Maciej Patelczyk <maciej.patelczyk@intel.com> >>>> Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> >>>> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> #v1 >>>> --- >>>> drivers/gpu/drm/xe/xe_eudebug.c | 3 ++- >>>> drivers/gpu/drm/xe/xe_vm.c | 47 >>>> +++++++++++++++++++++++++++++++++ >>>> drivers/gpu/drm/xe/xe_vm.h | 3 +++ >>>> 3 files changed, 52 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/drivers/gpu/drm/xe/xe_eudebug.c >>>> b/drivers/gpu/drm/xe/xe_eudebug.c >>>> index 9d87df75348b..e5949e4dcad8 100644 >>>> --- a/drivers/gpu/drm/xe/xe_eudebug.c >>>> +++ b/drivers/gpu/drm/xe/xe_eudebug.c >>>> @@ -3076,7 +3076,8 @@ static int xe_eudebug_vma_access(struct >>>> xe_vma *vma, u64 offset_in_vma, >>>> return ret; >>>> } >>>> - return -EINVAL; >>>> + return xe_vm_userptr_access(to_userptr_vma(vma), offset_in_vma, >>>> + buf, bytes, write); >>>> } >>>> static int xe_eudebug_vm_access(struct xe_vm *vm, u64 offset, >>>> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c >>>> index 0f17bc8b627b..224ff9e16941 100644 >>>> --- a/drivers/gpu/drm/xe/xe_vm.c >>>> +++ b/drivers/gpu/drm/xe/xe_vm.c >>>> @@ -3414,3 +3414,50 @@ void xe_vm_snapshot_free(struct >>>> xe_vm_snapshot *snap) >>>> } >>>> kvfree(snap); >>>> } >>>> + >>>> +int xe_vm_userptr_access(struct xe_userptr_vma *uvma, u64 offset, >>>> + void *buf, u64 len, bool write) >>>> +{ >>>> + struct xe_vm *vm = xe_vma_vm(&uvma->vma); >>>> + struct xe_userptr *up = &uvma->userptr; >>>> + struct xe_res_cursor cur = {}; >>>> + int cur_len, ret = 0; >>>> + >>>> + while (true) { >>>> + down_read(&vm->userptr.notifier_lock); >>>> + if (!xe_vma_userptr_check_repin(uvma)) >>>> + break; >>>> + >>>> + spin_lock(&vm->userptr.invalidated_lock); >>>> + list_del_init(&uvma->userptr.invalidate_link); >>>> + spin_unlock(&vm->userptr.invalidated_lock); >>>> + >>>> + up_read(&vm->userptr.notifier_lock); >>>> + ret = xe_vma_userptr_pin_pages(uvma); >>>> + if (ret) >>>> + return ret; >>>> + } >>>> + >>>> + if (!up->sg) { >>>> + ret = -EINVAL; >>>> + goto out_unlock_notifier; >>>> + } >>>> + >>>> + for (xe_res_first_sg_system(up->sg, offset, len, &cur); >>>> cur.remaining; >>>> + xe_res_next(&cur, cur_len)) { >>>> + void *ptr = kmap_local_page(sg_page(cur.sgl)) + cur.start; >>> The interface basically creates a side channel to access userptrs in >>> the way >>> an userspace application would do without actually going through >>> userspace. >>> >>> That is generally not something a device driver should ever do as >>> far as I >>> can see. >>> >>>> + >>>> + cur_len = min(cur.size, cur.remaining); >>>> + if (write) >>>> + memcpy(ptr, buf, cur_len); >>>> + else >>>> + memcpy(buf, ptr, cur_len); >>>> + kunmap_local(ptr); >>>> + buf += cur_len; >>>> + } >>>> + ret = len; >>>> + >>>> +out_unlock_notifier: >>>> + up_read(&vm->userptr.notifier_lock); >>> I just strongly hope that this will prevent the mapping from changing. >>> >>> Regards, >>> Christian. >>> >>>> + return ret; >>>> +} >>>> diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h >>>> index 23adb7442881..372ad40ad67f 100644 >>>> --- a/drivers/gpu/drm/xe/xe_vm.h >>>> +++ b/drivers/gpu/drm/xe/xe_vm.h >>>> @@ -280,3 +280,6 @@ struct xe_vm_snapshot >>>> *xe_vm_snapshot_capture(struct xe_vm *vm); >>>> void xe_vm_snapshot_capture_delayed(struct xe_vm_snapshot *snap); >>>> void xe_vm_snapshot_print(struct xe_vm_snapshot *snap, struct >>>> drm_printer *p); >>>> void xe_vm_snapshot_free(struct xe_vm_snapshot *snap); >>>> + >>>> +int xe_vm_userptr_access(struct xe_userptr_vma *uvma, u64 offset, >>>> + void *buf, u64 len, bool write); >
Quoting Christian König (2024-12-09 17:42:32) > Am 09.12.24 um 16:31 schrieb Simona Vetter: > > On Mon, Dec 09, 2024 at 03:03:04PM +0100, Christian König wrote: > >> Am 09.12.24 um 14:33 schrieb Mika Kuoppala: > >>> From: Andrzej Hajda <andrzej.hajda@intel.com> > >>> > >>> Debugger needs to read/write program's vmas including userptr_vma. > >>> Since hmm_range_fault is used to pin userptr vmas, it is possible > >>> to map those vmas from debugger context. > >> Oh, this implementation is extremely questionable as well. Adding the LKML > >> and the MM list as well. > >> > >> First of all hmm_range_fault() does *not* pin anything! > >> > >> In other words you don't have a page reference when the function returns, > >> but rather just a sequence number you can check for modifications. > > I think it's all there, holds the invalidation lock during the critical > > access/section, drops it when reacquiring pages, retries until it works. > > > > I think the issue is more that everyone hand-rolls userptr. > > Well that is part of the issue. > > The general problem here is that the eudebug interface tries to simulate > the memory accesses as they would have happened by the hardware. Could you elaborate, what is that a problem in that, exactly? It's pretty much the equivalent of ptrace() poke/peek but for GPU memory. And it is exactly the kind of interface that makes sense for debugger as GPU memory != CPU memory, and they don't need to align at all. > What the debugger should probably do is to cleanly attach to the > application, get the information which CPU address is mapped to which > GPU address and then use the standard ptrace interfaces. I don't quite agree here -- at all. "Which CPU address is mapped to which GPU address" makes no sense when the GPU address space and CPU address space is completely controlled by the userspace driver/application. Please try to consider things outside of the ROCm architecture. Something like a register scratch region or EU instructions should not even be mapped to CPU address space as CPU has no business accessing it during normal operation. And backing of such region will vary per context/LRC on the same virtual address per EU thread. You seem to be suggesting to rewrite even our userspace driver to behave the same way as ROCm driver does just so that we could implement debug memory accesses via ptrace() to the CPU address space. That seems bit of a radical suggestion, especially given the drawbacks pointed out in your suggested design. > The whole interface re-invents a lot of functionality which is already > there I'm not really sure I would call adding a single interface for memory reading and writing to be "re-inventing a lot of functionality". All the functionality behind this interface will be needed by GPU core dumping, anyway. Just like for the other patch series. > just because you don't like the idea to attach to the debugged > application in userspace. A few points that have been brought up as drawback to the GPU debug through ptrace(), but to recap a few relevant ones for this discussion: - You can only really support GDB stop-all mode or at least have to stop all the CPU threads while you control the GPU threads to avoid interference. Elaborated on this on the other threads more. - Controlling the GPU threads will always interfere with CPU threads. Doesn't seem feasible to single-step an EU thread while CPU threads continue to run freely? - You are very much restricted by the CPU VA ~ GPU VA alignment requirement, which is not true for OpenGL or Vulkan etc. Seems like one of the reasons why ROCm debugging is not easily extendable outside compute? - You have to expose extra memory to CPU process just for GPU debugger access and keep track of GPU VA for each. Makes the GPU more prone to OOB writes from CPU. Exactly what not mapping the memory to CPU tried to protect the GPU from to begin with. > As far as I can see this whole idea is extremely questionable. This > looks like re-inventing the wheel in a different color. I see it like reinventing a round wheel compared to octagonal wheel. Could you elaborate with facts much more on your position why the ROCm debugger design is an absolute must for others to adopt? Otherwise it just looks like you are trying to prevent others from implementing a more flexible debugging interface through vague comments about "questionable design" without going into details. Not listing much concrete benefits nor addressing the very concretely expressed drawbacks of your suggested design, makes it seem like a very biased non-technical discussion. So while review interest and any comments are very much appreciated, please also work on providing bit more reasoning and facts instead of just claiming things. That'll help make the discussion much more fruitful. Regards, Joonas
Am 10.12.24 um 10:33 schrieb Joonas Lahtinen: > Quoting Christian König (2024-12-09 17:42:32) >> Am 09.12.24 um 16:31 schrieb Simona Vetter: >>> On Mon, Dec 09, 2024 at 03:03:04PM +0100, Christian König wrote: >>>> Am 09.12.24 um 14:33 schrieb Mika Kuoppala: >>>>> From: Andrzej Hajda<andrzej.hajda@intel.com> >>>>> >>>>> Debugger needs to read/write program's vmas including userptr_vma. >>>>> Since hmm_range_fault is used to pin userptr vmas, it is possible >>>>> to map those vmas from debugger context. >>>> Oh, this implementation is extremely questionable as well. Adding the LKML >>>> and the MM list as well. >>>> >>>> First of all hmm_range_fault() does *not* pin anything! >>>> >>>> In other words you don't have a page reference when the function returns, >>>> but rather just a sequence number you can check for modifications. >>> I think it's all there, holds the invalidation lock during the critical >>> access/section, drops it when reacquiring pages, retries until it works. >>> >>> I think the issue is more that everyone hand-rolls userptr. >> Well that is part of the issue. >> >> The general problem here is that the eudebug interface tries to simulate >> the memory accesses as they would have happened by the hardware. > Could you elaborate, what is that a problem in that, exactly? > > It's pretty much the equivalent of ptrace() poke/peek but for GPU memory. Exactly that here. You try to debug the GPU without taking control of the CPU process. This means that you have to re-implement all debug functionalities which where previously invested for the CPU process for the GPU once more. And that in turn creates a massive attack surface for security related problems, especially when you start messing with things like userptrs which have a very low level interaction with core memory management. > And it is exactly the kind of interface that makes sense for debugger as > GPU memory != CPU memory, and they don't need to align at all. And that is what I strongly disagree on. When you debug the GPU it is mandatory to gain control of the CPU process as well. The CPU process is basically the overseer of the GPU activity, so it should know everything about the GPU operation, for example what a mapping actually means. The kernel driver and the hardware only have the information necessary to execute the work prepared by the CPU process. So the information available is limited to begin with. >> What the debugger should probably do is to cleanly attach to the >> application, get the information which CPU address is mapped to which >> GPU address and then use the standard ptrace interfaces. > I don't quite agree here -- at all. "Which CPU address is mapped to > which GPU address" makes no sense when the GPU address space and CPU > address space is completely controlled by the userspace driver/application. Yeah, that's the reason why you should ask the userspace driver/application for the necessary information and not go over the kernel to debug things. > Please try to consider things outside of the ROCm architecture. Well I consider a good part of the ROCm architecture rather broken exactly because we haven't pushed back hard enough on bad ideas. > Something like a register scratch region or EU instructions should not > even be mapped to CPU address space as CPU has no business accessing it > during normal operation. And backing of such region will vary per > context/LRC on the same virtual address per EU thread. > > You seem to be suggesting to rewrite even our userspace driver to behave > the same way as ROCm driver does just so that we could implement debug memory > accesses via ptrace() to the CPU address space. Oh, well certainly not. That ROCm has an 1 to 1 mapping between CPU and GPU is one thing I've pushed back massively on and has now proven to be problematic. > That seems bit of a radical suggestion, especially given the drawbacks > pointed out in your suggested design. > >> The whole interface re-invents a lot of functionality which is already >> there > I'm not really sure I would call adding a single interface for memory > reading and writing to be "re-inventing a lot of functionality". > > All the functionality behind this interface will be needed by GPU core > dumping, anyway. Just like for the other patch series. As far as I can see exactly that's an absolutely no-go. Device core dumping should *never ever* touch memory imported by userptrs. That's what process core dumping is good for. >> just because you don't like the idea to attach to the debugged >> application in userspace. > A few points that have been brought up as drawback to the > GPU debug through ptrace(), but to recap a few relevant ones for this > discussion: > > - You can only really support GDB stop-all mode or at least have to > stop all the CPU threads while you control the GPU threads to > avoid interference. Elaborated on this on the other threads more. > - Controlling the GPU threads will always interfere with CPU threads. > Doesn't seem feasible to single-step an EU thread while CPU threads > continue to run freely? I would say no. > - You are very much restricted by the CPU VA ~ GPU VA alignment > requirement, which is not true for OpenGL or Vulkan etc. Seems > like one of the reasons why ROCm debugging is not easily extendable > outside compute? Well as long as you can't take debugged threads from the hardware you can pretty much forget any OpenGL or Vulkan debugging with this interface since it violates the dma_fence restrictions in the kernel. > - You have to expose extra memory to CPU process just for GPU > debugger access and keep track of GPU VA for each. Makes the GPU more > prone to OOB writes from CPU. Exactly what not mapping the memory > to CPU tried to protect the GPU from to begin with. > >> As far as I can see this whole idea is extremely questionable. This >> looks like re-inventing the wheel in a different color. > I see it like reinventing a round wheel compared to octagonal wheel. > > Could you elaborate with facts much more on your position why the ROCm > debugger design is an absolute must for others to adopt? Well I'm trying to prevent some of the mistakes we did with the ROCm design. And trying to re-invent well proven kernel interfaces is one of the big mistakes made in the ROCm design. If you really want to expose an interface to userspace which walks the process page table, installs an MMU notifier, kmaps the resulting page and then memcpy to/from it then you absolutely *must* run that by guys like Christoph Hellwig, Andrew and even Linus. I'm pretty sure that those guys will note that a device driver should absolutely not mess with such stuff. Regards, Christian. > Otherwise it just looks like you are trying to prevent others from > implementing a more flexible debugging interface through vague comments about > "questionable design" without going into details. Not listing much concrete > benefits nor addressing the very concretely expressed drawbacks of your > suggested design, makes it seem like a very biased non-technical discussion. > > So while review interest and any comments are very much appreciated, please > also work on providing bit more reasoning and facts instead of just claiming > things. That'll help make the discussion much more fruitful. > > Regards, Joonas
On Mon, Dec 09, 2024 at 04:42:32PM +0100, Christian König wrote: > Am 09.12.24 um 16:31 schrieb Simona Vetter: > > On Mon, Dec 09, 2024 at 03:03:04PM +0100, Christian König wrote: > > > Am 09.12.24 um 14:33 schrieb Mika Kuoppala: > > > > From: Andrzej Hajda <andrzej.hajda@intel.com> > > > > > > > > Debugger needs to read/write program's vmas including userptr_vma. > > > > Since hmm_range_fault is used to pin userptr vmas, it is possible > > > > to map those vmas from debugger context. > > > Oh, this implementation is extremely questionable as well. Adding the LKML > > > and the MM list as well. > > > > > > First of all hmm_range_fault() does *not* pin anything! > > > > > > In other words you don't have a page reference when the function returns, > > > but rather just a sequence number you can check for modifications. > > I think it's all there, holds the invalidation lock during the critical > > access/section, drops it when reacquiring pages, retries until it works. > > > > I think the issue is more that everyone hand-rolls userptr. > > Well that is part of the issue. Yeah I ignored the other part, because that didn't seem super interesting really. Like for compute you can make the architectural assumption that gpu addresses match cpu addresses, and this all becomes easy. Or at least easier, there's still the issue of having to call into the driver for gpu side flushes. But for 3d userptr that's not the case, and you get two options: - Have some tracking structure that umd and debugger agree on, so stable abi fun and all that, so you can find the mapping. And I think in some cases this would need to be added first. - Just ask the kernel, which already knows. Like for cpu mmaps we also don't inject tracking code into mmap/munmap, we just ask the kernel through /proc/pid/maps. This sounds like the same question, probably should have a similar answer. I guess you can do a bit a bikeshed about whether the kernel should only do the address translation and then you do a ptrace poke/peek for the actual access. But again you need some flushes, so this might be a bit silly. But fundamentally this makes sense to me to ask the entity that actually knows how userptr areas map to memory. -Sima > > The general problem here is that the eudebug interface tries to simulate the > memory accesses as they would have happened by the hardware. > > What the debugger should probably do is to cleanly attach to the > application, get the information which CPU address is mapped to which GPU > address and then use the standard ptrace interfaces. > > The whole interface re-invents a lot of functionality which is already there > just because you don't like the idea to attach to the debugged application > in userspace. > > As far as I can see this whole idea is extremely questionable. This looks > like re-inventing the wheel in a different color. > > Regards, > Christian. > > > Probably time > > we standardize that and put it into gpuvm as an optional part, with > > consistent locking, naming (like not calling it _pin_pages when it's > > unpinnged userptr), kerneldoc and all the nice things so that we > > stop consistently getting confused by other driver's userptr code. > > > > I think that was on the plan originally as an eventual step, I guess time > > to pump that up. Matt/Thomas, thoughts? > > -Sima > > > > > > v2: pin pages vs notifier, move to vm.c (Matthew) > > > > v3: - iterate over system pages instead of DMA, fixes iommu enabled > > > > - s/xe_uvma_access/xe_vm_uvma_access/ (Matt) > > > > > > > > Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com> > > > > Signed-off-by: Maciej Patelczyk <maciej.patelczyk@intel.com> > > > > Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> > > > > Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> #v1 > > > > --- > > > > drivers/gpu/drm/xe/xe_eudebug.c | 3 ++- > > > > drivers/gpu/drm/xe/xe_vm.c | 47 +++++++++++++++++++++++++++++++++ > > > > drivers/gpu/drm/xe/xe_vm.h | 3 +++ > > > > 3 files changed, 52 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_eudebug.c b/drivers/gpu/drm/xe/xe_eudebug.c > > > > index 9d87df75348b..e5949e4dcad8 100644 > > > > --- a/drivers/gpu/drm/xe/xe_eudebug.c > > > > +++ b/drivers/gpu/drm/xe/xe_eudebug.c > > > > @@ -3076,7 +3076,8 @@ static int xe_eudebug_vma_access(struct xe_vma *vma, u64 offset_in_vma, > > > > return ret; > > > > } > > > > - return -EINVAL; > > > > + return xe_vm_userptr_access(to_userptr_vma(vma), offset_in_vma, > > > > + buf, bytes, write); > > > > } > > > > static int xe_eudebug_vm_access(struct xe_vm *vm, u64 offset, > > > > diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c > > > > index 0f17bc8b627b..224ff9e16941 100644 > > > > --- a/drivers/gpu/drm/xe/xe_vm.c > > > > +++ b/drivers/gpu/drm/xe/xe_vm.c > > > > @@ -3414,3 +3414,50 @@ void xe_vm_snapshot_free(struct xe_vm_snapshot *snap) > > > > } > > > > kvfree(snap); > > > > } > > > > + > > > > +int xe_vm_userptr_access(struct xe_userptr_vma *uvma, u64 offset, > > > > + void *buf, u64 len, bool write) > > > > +{ > > > > + struct xe_vm *vm = xe_vma_vm(&uvma->vma); > > > > + struct xe_userptr *up = &uvma->userptr; > > > > + struct xe_res_cursor cur = {}; > > > > + int cur_len, ret = 0; > > > > + > > > > + while (true) { > > > > + down_read(&vm->userptr.notifier_lock); > > > > + if (!xe_vma_userptr_check_repin(uvma)) > > > > + break; > > > > + > > > > + spin_lock(&vm->userptr.invalidated_lock); > > > > + list_del_init(&uvma->userptr.invalidate_link); > > > > + spin_unlock(&vm->userptr.invalidated_lock); > > > > + > > > > + up_read(&vm->userptr.notifier_lock); > > > > + ret = xe_vma_userptr_pin_pages(uvma); > > > > + if (ret) > > > > + return ret; > > > > + } > > > > + > > > > + if (!up->sg) { > > > > + ret = -EINVAL; > > > > + goto out_unlock_notifier; > > > > + } > > > > + > > > > + for (xe_res_first_sg_system(up->sg, offset, len, &cur); cur.remaining; > > > > + xe_res_next(&cur, cur_len)) { > > > > + void *ptr = kmap_local_page(sg_page(cur.sgl)) + cur.start; > > > The interface basically creates a side channel to access userptrs in the way > > > an userspace application would do without actually going through userspace. > > > > > > That is generally not something a device driver should ever do as far as I > > > can see. > > > > > > > + > > > > + cur_len = min(cur.size, cur.remaining); > > > > + if (write) > > > > + memcpy(ptr, buf, cur_len); > > > > + else > > > > + memcpy(buf, ptr, cur_len); > > > > + kunmap_local(ptr); > > > > + buf += cur_len; > > > > + } > > > > + ret = len; > > > > + > > > > +out_unlock_notifier: > > > > + up_read(&vm->userptr.notifier_lock); > > > I just strongly hope that this will prevent the mapping from changing. > > > > > > Regards, > > > Christian. > > > > > > > + return ret; > > > > +} > > > > diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h > > > > index 23adb7442881..372ad40ad67f 100644 > > > > --- a/drivers/gpu/drm/xe/xe_vm.h > > > > +++ b/drivers/gpu/drm/xe/xe_vm.h > > > > @@ -280,3 +280,6 @@ struct xe_vm_snapshot *xe_vm_snapshot_capture(struct xe_vm *vm); > > > > void xe_vm_snapshot_capture_delayed(struct xe_vm_snapshot *snap); > > > > void xe_vm_snapshot_print(struct xe_vm_snapshot *snap, struct drm_printer *p); > > > > void xe_vm_snapshot_free(struct xe_vm_snapshot *snap); > > > > + > > > > +int xe_vm_userptr_access(struct xe_userptr_vma *uvma, u64 offset, > > > > + void *buf, u64 len, bool write); >
Quoting Christian König (2024-12-10 12:00:48) > Am 10.12.24 um 10:33 schrieb Joonas Lahtinen: > > Quoting Christian König (2024-12-09 17:42:32) > > Am 09.12.24 um 16:31 schrieb Simona Vetter: > > On Mon, Dec 09, 2024 at 03:03:04PM +0100, Christian König wrote: > > Am 09.12.24 um 14:33 schrieb Mika Kuoppala: > > From: Andrzej Hajda <andrzej.hajda@intel.com> > > Debugger needs to read/write program's vmas including userptr_vma. > Since hmm_range_fault is used to pin userptr vmas, it is possible > to map those vmas from debugger context. > > Oh, this implementation is extremely questionable as well. Adding the LKML > and the MM list as well. > > First of all hmm_range_fault() does *not* pin anything! > > In other words you don't have a page reference when the function returns, > but rather just a sequence number you can check for modifications. > > I think it's all there, holds the invalidation lock during the critical > access/section, drops it when reacquiring pages, retries until it works. > > I think the issue is more that everyone hand-rolls userptr. > > Well that is part of the issue. > > The general problem here is that the eudebug interface tries to simulate > the memory accesses as they would have happened by the hardware. > > Could you elaborate, what is that a problem in that, exactly? > > It's pretty much the equivalent of ptrace() poke/peek but for GPU memory. > > > Exactly that here. You try to debug the GPU without taking control of the CPU > process. You seem to have a built-in expectation that the CPU threads and memory space must be interfered with in order to debug a completely different set of threads and memory space elsewhere that executes independently. I don't quite see why? In debugging massively parallel workloads, it's a huge drawback to be limited to stop all mode in GDB. If ROCm folks are fine with such limitation, I have nothing against them keeping that limitation. Just it was a starting design principle for this design to avoid such a limitation. > This means that you have to re-implement all debug functionalities which where > previously invested for the CPU process for the GPU once more. Seems like a strawman argument. Can you list the "all interfaces" being added that would be possible via indirection via ptrace() beyond peek/poke? > And that in turn creates a massive attack surface for security related > problems, especially when you start messing with things like userptrs which > have a very low level interaction with core memory management. Again, just seems like a strawman argument. You seem to generalize to some massive attack surface of hypothetical interfaces which you don't list. We're talking about memory peek/poke here. Can you explain the high-level difference from security perspective for temporarily pinning userptr pages to write them to page tables for GPU to execute a dma-fence workload with and temporarily pinning pages for peek/poke? > And it is exactly the kind of interface that makes sense for debugger as > GPU memory != CPU memory, and they don't need to align at all. > > > And that is what I strongly disagree on. When you debug the GPU it is mandatory > to gain control of the CPU process as well. You are free to disagree on that. I simply don't agree and have in this and previous email presented multiple reasons as to why not. We can agree to disagree on the topic. > The CPU process is basically the overseer of the GPU activity, so it should > know everything about the GPU operation, for example what a mapping actually > means. How does that relate to what is being discussed here? You just seem to explain how you think userspace driver should work: Maintain a shadow tree of each ppGTT VM layout? I don't agree on that, but I think it is slightly irrelevant here. > The kernel driver and the hardware only have the information necessary to > execute the work prepared by the CPU process. So the information available is > limited to begin with. And the point here is? Are you saying kernel does not know the actual mappings maintained in the GPU page tables? > What the debugger should probably do is to cleanly attach to the > application, get the information which CPU address is mapped to which > GPU address and then use the standard ptrace interfaces. > > I don't quite agree here -- at all. "Which CPU address is mapped to > which GPU address" makes no sense when the GPU address space and CPU > address space is completely controlled by the userspace driver/application. > > > Yeah, that's the reason why you should ask the userspace driver/application for > the necessary information and not go over the kernel to debug things. What hypothetical necessary information are you referring to exactly? I already explained there are good reasons not to map all the GPU memory into the CPU address space. > Please try to consider things outside of the ROCm architecture. > > > Well I consider a good part of the ROCm architecture rather broken exactly > because we haven't pushed back hard enough on bad ideas. > > > Something like a register scratch region or EU instructions should not > even be mapped to CPU address space as CPU has no business accessing it > during normal operation. And backing of such region will vary per > context/LRC on the same virtual address per EU thread. > > You seem to be suggesting to rewrite even our userspace driver to behave > the same way as ROCm driver does just so that we could implement debug memory > accesses via ptrace() to the CPU address space. > > > Oh, well certainly not. That ROCm has an 1 to 1 mapping between CPU and GPU is > one thing I've pushed back massively on and has now proven to be problematic. Right, so is your claim then that instead of being 1:1 the CPU address space should be a superset of all GPU address spaces instead to make sure ptrace() can modify all memory? Cause I'm slightly lost here as you don't give much reasoning, just claim things to be certain way. > That seems bit of a radical suggestion, especially given the drawbacks > pointed out in your suggested design. > > > The whole interface re-invents a lot of functionality which is already > there > > I'm not really sure I would call adding a single interface for memory > reading and writing to be "re-inventing a lot of functionality". > > All the functionality behind this interface will be needed by GPU core > dumping, anyway. Just like for the other patch series. > > > As far as I can see exactly that's an absolutely no-go. Device core dumping > should *never ever* touch memory imported by userptrs. Could you again elaborate on what the great difference is to short term pinning to use in dma-fence workloads? Just the kmap? > That's what process core dumping is good for. Not really sure I agree. If you do not dump the memory as seen by the GPU, then you need to go parsing the CPU address space in order to make sense which buffers were mapped where and that CPU memory contents containing metadata could be corrupt as we're dealing with a crashing app to begin with. Big point of relying to the information from GPU VM for the GPU memory layout is that it won't be corrupted by rogue memory accesses in CPU process. > just because you don't like the idea to attach to the debugged > application in userspace. > > A few points that have been brought up as drawback to the > GPU debug through ptrace(), but to recap a few relevant ones for this > discussion: > > - You can only really support GDB stop-all mode or at least have to > stop all the CPU threads while you control the GPU threads to > avoid interference. Elaborated on this on the other threads more. > - Controlling the GPU threads will always interfere with CPU threads. > Doesn't seem feasible to single-step an EU thread while CPU threads > continue to run freely? > > > I would say no. Should this be understood that you agree these are limitations of the ROCm debug architecture? > - You are very much restricted by the CPU VA ~ GPU VA alignment > requirement, which is not true for OpenGL or Vulkan etc. Seems > like one of the reasons why ROCm debugging is not easily extendable > outside compute? > > > Well as long as you can't take debugged threads from the hardware you can > pretty much forget any OpenGL or Vulkan debugging with this interface since it > violates the dma_fence restrictions in the kernel. Agreed. However doesn't mean because you can't do it right now, you you should design an architecture that actively prevents you from doing that in the future. > - You have to expose extra memory to CPU process just for GPU > debugger access and keep track of GPU VA for each. Makes the GPU more > prone to OOB writes from CPU. Exactly what not mapping the memory > to CPU tried to protect the GPU from to begin with. > > > As far as I can see this whole idea is extremely questionable. This > looks like re-inventing the wheel in a different color. > > I see it like reinventing a round wheel compared to octagonal wheel. > > Could you elaborate with facts much more on your position why the ROCm > debugger design is an absolute must for others to adopt? > > > Well I'm trying to prevent some of the mistakes we did with the ROCm design. Well, I would say that the above limitations are direct results of the ROCm debugging design. So while we're eager to learn about how you perceive GPU debugging should work, would you mind addressing the above shortcomings? > And trying to re-invent well proven kernel interfaces is one of the big > mistakes made in the ROCm design. Appreciate the feedback. Please work on the representation a bit as it currently doesn't seem very helpful but appears just as an attempt to try to throw a spanner in the works. > If you really want to expose an interface to userspace To a debugger process, enabled only behind a flag. > which walks the process > page table, installs an MMU notifier This part is already done to put an userptr to the GPU page tables to begin with. So hopefully not too controversial. > kmaps the resulting page In addition to having it in the page tables where GPU can access it. > and then memcpy > to/from it then you absolutely *must* run that by guys like Christoph Hellwig, > Andrew and even Linus. Surely, that is why we're seeking out for review. We could also in theory use an in-kernel GPU context on the GPU hardware for doing the peek/poke operations on userptr. But that seems like a high-overhead thing to do due to the overhead of setting up a transfer per data word and going over the PCI bus twice compared to accessing the memory directly by CPU when it trivially can. So this is the current proposal. Regards, Joonas > > I'm pretty sure that those guys will note that a device driver should > absolutely not mess with such stuff. > > Regards, > Christian. > > > Otherwise it just looks like you are trying to prevent others from > implementing a more flexible debugging interface through vague comments about > "questionable design" without going into details. Not listing much concrete > benefits nor addressing the very concretely expressed drawbacks of your > suggested design, makes it seem like a very biased non-technical discussion. > > So while review interest and any comments are very much appreciated, please > also work on providing bit more reasoning and facts instead of just claiming > things. That'll help make the discussion much more fruitful. > > Regards, Joonas > >
Am 10.12.24 um 12:57 schrieb Joonas Lahtinen: > Quoting Christian König (2024-12-10 12:00:48) >> Am 10.12.24 um 10:33 schrieb Joonas Lahtinen: >> >> Quoting Christian König (2024-12-09 17:42:32) >> >> Am 09.12.24 um 16:31 schrieb Simona Vetter: >> >> On Mon, Dec 09, 2024 at 03:03:04PM +0100, Christian König wrote: >> >> Am 09.12.24 um 14:33 schrieb Mika Kuoppala: >> >> From: Andrzej Hajda<andrzej.hajda@intel.com> >> >> Debugger needs to read/write program's vmas including userptr_vma. >> Since hmm_range_fault is used to pin userptr vmas, it is possible >> to map those vmas from debugger context. >> >> Oh, this implementation is extremely questionable as well. Adding the LKML >> and the MM list as well. >> >> First of all hmm_range_fault() does *not* pin anything! >> >> In other words you don't have a page reference when the function returns, >> but rather just a sequence number you can check for modifications. >> >> I think it's all there, holds the invalidation lock during the critical >> access/section, drops it when reacquiring pages, retries until it works. >> >> I think the issue is more that everyone hand-rolls userptr. >> >> Well that is part of the issue. >> >> The general problem here is that the eudebug interface tries to simulate >> the memory accesses as they would have happened by the hardware. >> >> Could you elaborate, what is that a problem in that, exactly? >> >> It's pretty much the equivalent of ptrace() poke/peek but for GPU memory. >> >> >> Exactly that here. You try to debug the GPU without taking control of the CPU >> process. > You seem to have a built-in expectation that the CPU threads and memory space > must be interfered with in order to debug a completely different set of threads > and memory space elsewhere that executes independently. I don't quite see why? Because the GPU only gets the information it needs to execute the commands. A simple example would be to single step through the high level shader code. That is usually not available to the GPU, but only to the application who has submitted the work. The GPU only sees the result of the compiler from high level into low level assembler. > In debugging massively parallel workloads, it's a huge drawback to be limited to > stop all mode in GDB. If ROCm folks are fine with such limitation, I have nothing > against them keeping that limitation. Just it was a starting design principle for > this design to avoid such a limitation. Well, that's the part I don't understand. Why is that a drawback? >> This means that you have to re-implement all debug functionalities which where >> previously invested for the CPU process for the GPU once more. > Seems like a strawman argument. Can you list the "all interfaces" being added > that would be possible via indirection via ptrace() beyond peek/poke? > >> And that in turn creates a massive attack surface for security related >> problems, especially when you start messing with things like userptrs which >> have a very low level interaction with core memory management. > Again, just seems like a strawman argument. You seem to generalize to some massive > attack surface of hypothetical interfaces which you don't list. We're talking > about memory peek/poke here. That peek/poke interface is more than enough to cause problems. > Can you explain the high-level difference from security perspective for > temporarily pinning userptr pages to write them to page tables for GPU to > execute a dma-fence workload with and temporarily pinning pages for > peek/poke? If you want to access userptr imported pages from the GPU going through the hops of using hhm_range_fault()/get_user_pages() plus an MMU notifier is a must have. For a CPU based debugging interface that isn't necessary, you can just look directly into the application address space with existing interfaces. >> And it is exactly the kind of interface that makes sense for debugger as >> GPU memory != CPU memory, and they don't need to align at all. >> >> >> And that is what I strongly disagree on. When you debug the GPU it is mandatory >> to gain control of the CPU process as well. > You are free to disagree on that. I simply don't agree and have in this > and previous email presented multiple reasons as to why not. We can > agree to disagree on the topic. Yeah, that's ok. I also think we can agree on that this doesn't matter for the discussion. The question is rather should the userptr functionality be used for debugging or not. >> The CPU process is basically the overseer of the GPU activity, so it should >> know everything about the GPU operation, for example what a mapping actually >> means. > How does that relate to what is being discussed here? You just seem to > explain how you think userspace driver should work: Maintain a shadow > tree of each ppGTT VM layout? I don't agree on that, but I think it is > slightly irrelevant here. I'm trying to understand why you want to debug only the GPU without also attaching to the CPU process. >> The kernel driver and the hardware only have the information necessary to >> execute the work prepared by the CPU process. So the information available is >> limited to begin with. > And the point here is? Are you saying kernel does not know the actual mappings > maintained in the GPU page tables? The kernel knows that, the question is why does userspace don't know that? On the other hand I have to agree that this isn't much of a problem. If userspace really doesn't know what is mapped where in the GPU's VM address space then an IOCTL to query that is probably not an issue. >> What the debugger should probably do is to cleanly attach to the >> application, get the information which CPU address is mapped to which >> GPU address and then use the standard ptrace interfaces. >> >> I don't quite agree here -- at all. "Which CPU address is mapped to >> which GPU address" makes no sense when the GPU address space and CPU >> address space is completely controlled by the userspace driver/application. >> >> >> Yeah, that's the reason why you should ask the userspace driver/application for >> the necessary information and not go over the kernel to debug things. > What hypothetical necessary information are you referring to exactly? What you said before: "the GPU address space and CPU address space is completely controlled by the userspace driver/application". When that's the case, then why as the kernel for help? The driver/application is in control. > I already explained there are good reasons not to map all the GPU memory > into the CPU address space. Well I still don't fully agree to that argumentation, but compared to using userptr the peek/pook on a GEM handle is basically harmless. >> Please try to consider things outside of the ROCm architecture. >> >> >> Well I consider a good part of the ROCm architecture rather broken exactly >> because we haven't pushed back hard enough on bad ideas. >> >> >> Something like a register scratch region or EU instructions should not >> even be mapped to CPU address space as CPU has no business accessing it >> during normal operation. And backing of such region will vary per >> context/LRC on the same virtual address per EU thread. >> >> You seem to be suggesting to rewrite even our userspace driver to behave >> the same way as ROCm driver does just so that we could implement debug memory >> accesses via ptrace() to the CPU address space. >> >> >> Oh, well certainly not. That ROCm has an 1 to 1 mapping between CPU and GPU is >> one thing I've pushed back massively on and has now proven to be problematic. > Right, so is your claim then that instead of being 1:1 the CPU address space > should be a superset of all GPU address spaces instead to make sure > ptrace() can modify all memory? Well why not? Mapping a BO and not accessing it has only minimal overhead. We already considered to making that mandatory for TTM drivers for better OOM killer handling. That approach was discontinued, but certainly not for the overhead. > Cause I'm slightly lost here as you don't give much reasoning, just > claim things to be certain way. Ok, that's certainly not what I'm trying to express. Things don't need to be in a certain way, especially not in the way ROCm does things. But you should not try to re-create GPU accesses with the CPU, especially when that isn't memory you have control over in the sense that it was allocated through your driver stack. >> That seems bit of a radical suggestion, especially given the drawbacks >> pointed out in your suggested design. >> >> >> The whole interface re-invents a lot of functionality which is already >> there >> >> I'm not really sure I would call adding a single interface for memory >> reading and writing to be "re-inventing a lot of functionality". >> >> All the functionality behind this interface will be needed by GPU core >> dumping, anyway. Just like for the other patch series. >> >> >> As far as I can see exactly that's an absolutely no-go. Device core dumping >> should *never ever* touch memory imported by userptrs. > Could you again elaborate on what the great difference is to short term > pinning to use in dma-fence workloads? Just the kmap? The big difference is that the memory doesn't belong to the driver who is core dumping. That is just something you have imported from the MM subsystem, e.g. anonymous memory and file backed mappings. We also don't allow to mmap() dma-bufs on importing devices for similar reasons. >> That's what process core dumping is good for. > Not really sure I agree. If you do not dump the memory as seen by the > GPU, then you need to go parsing the CPU address space in order to make > sense which buffers were mapped where and that CPU memory contents containing > metadata could be corrupt as we're dealing with a crashing app to begin with. > > Big point of relying to the information from GPU VM for the GPU memory layout > is that it won't be corrupted by rogue memory accesses in CPU process. Well that you don't want to use potentially corrupted information is a good argument, but why just not dump an information like "range 0xabcd-0xbcde came as userptr from process 1 VMA 0x1234-0x5678" ? A process address space is not really something a device driver should be messing with. > >> just because you don't like the idea to attach to the debugged >> application in userspace. >> >> A few points that have been brought up as drawback to the >> GPU debug through ptrace(), but to recap a few relevant ones for this >> discussion: >> >> - You can only really support GDB stop-all mode or at least have to >> stop all the CPU threads while you control the GPU threads to >> avoid interference. Elaborated on this on the other threads more. >> - Controlling the GPU threads will always interfere with CPU threads. >> Doesn't seem feasible to single-step an EU thread while CPU threads >> continue to run freely? >> >> >> I would say no. > Should this be understood that you agree these are limitations of the ROCm > debug architecture? ROCm has a bunch of design decisions I would say we should never ever repeat: 1. Forcing a 1 to 1 model between GPU address space and CPU address space. 2. Using a separate file descriptor additional to the DRM render node. 3. Attaching information and context to the CPU process instead of the DRM render node. .... But stopping the world, e.g. both CPU and GPU threads if you want to debug something is not one of the problematic decisions. That's why I'm really surprised that you insist so much on that. >> - You are very much restricted by the CPU VA ~ GPU VA alignment >> requirement, which is not true for OpenGL or Vulkan etc. Seems >> like one of the reasons why ROCm debugging is not easily extendable >> outside compute? >> >> >> Well as long as you can't take debugged threads from the hardware you can >> pretty much forget any OpenGL or Vulkan debugging with this interface since it >> violates the dma_fence restrictions in the kernel. > Agreed. However doesn't mean because you can't do it right now, you you should > design an architecture that actively prevents you from doing that in the future. Good point. That's what I can totally agree on as well. >> - You have to expose extra memory to CPU process just for GPU >> debugger access and keep track of GPU VA for each. Makes the GPU more >> prone to OOB writes from CPU. Exactly what not mapping the memory >> to CPU tried to protect the GPU from to begin with. >> >> >> As far as I can see this whole idea is extremely questionable. This >> looks like re-inventing the wheel in a different color. >> >> I see it like reinventing a round wheel compared to octagonal wheel. >> >> Could you elaborate with facts much more on your position why the ROCm >> debugger design is an absolute must for others to adopt? >> >> >> Well I'm trying to prevent some of the mistakes we did with the ROCm design. > Well, I would say that the above limitations are direct results of the ROCm > debugging design. So while we're eager to learn about how you perceive > GPU debugging should work, would you mind addressing the above > shortcomings? Yeah, absolutely. That you don't have a 1 to 1 mapping on the GPU is a step in the right direction if you ask me. >> And trying to re-invent well proven kernel interfaces is one of the big >> mistakes made in the ROCm design. > Appreciate the feedback. Please work on the representation a bit as it currently > doesn't seem very helpful but appears just as an attempt to try to throw a spanner > in the works. > >> If you really want to expose an interface to userspace > To a debugger process, enabled only behind a flag. > >> which walks the process >> page table, installs an MMU notifier > This part is already done to put an userptr to the GPU page tables to > begin with. So hopefully not too controversial. > >> kmaps the resulting page > In addition to having it in the page tables where GPU can access it. > >> and then memcpy >> to/from it then you absolutely *must* run that by guys like Christoph Hellwig, >> Andrew and even Linus. > Surely, that is why we're seeking out for review. > > We could also in theory use an in-kernel GPU context on the GPU hardware for > doing the peek/poke operations on userptr. Yeah, I mean that should clearly work out. We have something similar. > But that seems like a high-overhead thing to do due to the overhead of > setting up a transfer per data word and going over the PCI bus twice > compared to accessing the memory directly by CPU when it trivially can. Understandable, but that will create another way of accessing process memory. Regards, Christian. > > So this is the current proposal. > > Regards, Joonas > >> I'm pretty sure that those guys will note that a device driver should >> absolutely not mess with such stuff. >> >> Regards, >> Christian. >> >> >> Otherwise it just looks like you are trying to prevent others from >> implementing a more flexible debugging interface through vague comments about >> "questionable design" without going into details. Not listing much concrete >> benefits nor addressing the very concretely expressed drawbacks of your >> suggested design, makes it seem like a very biased non-technical discussion. >> >> So while review interest and any comments are very much appreciated, please >> also work on providing bit more reasoning and facts instead of just claiming >> things. That'll help make the discussion much more fruitful. >> >> Regards, Joonas >> >>
First of all, do appreciate taking the time to explain your positions much more verbosely this time. Quoting Christian König (2024-12-10 16:03:14) > Am 10.12.24 um 12:57 schrieb Joonas Lahtinen: > > Quoting Christian König (2024-12-10 12:00:48) > > Am 10.12.24 um 10:33 schrieb Joonas Lahtinen: > > Quoting Christian König (2024-12-09 17:42:32) > > Am 09.12.24 um 16:31 schrieb Simona Vetter: > > On Mon, Dec 09, 2024 at 03:03:04PM +0100, Christian König wrote: > > Am 09.12.24 um 14:33 schrieb Mika Kuoppala: > > From: Andrzej Hajda <andrzej.hajda@intel.com> > > Debugger needs to read/write program's vmas including userptr_vma. > Since hmm_range_fault is used to pin userptr vmas, it is possible > to map those vmas from debugger context. > > Oh, this implementation is extremely questionable as well. Adding the LKML > and the MM list as well. > > First of all hmm_range_fault() does *not* pin anything! > > In other words you don't have a page reference when the function returns, > but rather just a sequence number you can check for modifications. > > I think it's all there, holds the invalidation lock during the critical > access/section, drops it when reacquiring pages, retries until it works. > > I think the issue is more that everyone hand-rolls userptr. > > Well that is part of the issue. > > The general problem here is that the eudebug interface tries to simulate > the memory accesses as they would have happened by the hardware. > > Could you elaborate, what is that a problem in that, exactly? > > It's pretty much the equivalent of ptrace() poke/peek but for GPU memory. > > > Exactly that here. You try to debug the GPU without taking control of the CPU > process. > > You seem to have a built-in expectation that the CPU threads and memory space > must be interfered with in order to debug a completely different set of threads > and memory space elsewhere that executes independently. I don't quite see why? > > > Because the GPU only gets the information it needs to execute the commands. Right, but even for the CPU process, the debug symbols are not part of the execution address space either. There similarly are only the instructions generated by the compiler and the debug symbols are separate. They may be obtainable by parsing /proc/<PID>/exe but can also be in a completely different file in a different host machine. > A simple example would be to single step through the high level shader code. > That is usually not available to the GPU, but only to the application who has > submitted the work. > > The GPU only sees the result of the compiler from high level into low level > assembler. If we were to have unified executable format where both the GPU and CPU instructions were to be part of the single executable file, so could the DWARF information for both CPU and GPU. Then GDB, by loading the executable file, would have all the debug information it needed. No need to introspect to the CPU process in order to debug the GPU, similarly as there is no need to introspect CPU process to debug CPU process. While we don't currently have that and GPU instructions are often JIT generated, we tried to make life easier by userspace driver providing the DWARF information it just generated for the code it JITed as VM_BIND metadata for a VMA and we make copy to store safely to avoid corruption by rogue CPU process writes. In the history it was exported to a file and then loaded by GDB from that separate file, making user experience quite bad. So to recap, while for JIT scenarios and for lack of unified carrier format for GPU and CPU instructions, there is some information that is convenient to have in CPU address space, I don't think that is a necessity at all. I guess we could equally export /sys/class/drm/.../clients/<ID>/{load_map,dwarf_symbols} or whatever, similar to /proc/<PID>/{maps,exe}. TL;DR While getting the information from CPU process for JIT scenarios is convenient for now, I don't think it is a must or explicitly required. > In debugging massively parallel workloads, it's a huge drawback to be limited to > stop all mode in GDB. If ROCm folks are fine with such limitation, I have nothing > against them keeping that limitation. Just it was a starting design principle for > this design to avoid such a limitation. > > > Well, that's the part I don't understand. Why is that a drawback? Hmm, same as for not supporting stop-all mode for CPU threads during CPU debugging? You will not be able to stop and observe a single thread while letting the other threads run. If the CPU threads are for example supposed to react to memory semaphores/fences written by GPU thread and you want to debug by doing those memory writes from GPU thread from the GDB command line? Again not being limited to stop-all mode being an input to the design phase from the folks doing in-field debugging, I'm probably not going to be able to give out all the good reasons for it. And as the CPU side supports it, even if you did not support it for the GPU debugging, if adding GPU to the equation would prevent from using the existing feature for CPU debugging that feels like a regression in user experience. I think those both are major drawbacks, but we can of course seek out further opinions if it's highly relevant. I'm pretty sure myself at this point that if a feature is desireable for CPU threaded debugging, it'll be very shortly asked to be available for GPU. That seems to be the trend for any CPU debug feature, even if some are less feasible than others due to the differences of GPUs and CPUs. > This means that you have to re-implement all debug functionalities which where > previously invested for the CPU process for the GPU once more. > > Seems like a strawman argument. Can you list the "all interfaces" being added > that would be possible via indirection via ptrace() beyond peek/poke? > > > And that in turn creates a massive attack surface for security related > problems, especially when you start messing with things like userptrs which > have a very low level interaction with core memory management. > > Again, just seems like a strawman argument. You seem to generalize to some massive > attack surface of hypothetical interfaces which you don't list. We're talking > about memory peek/poke here. > > > That peek/poke interface is more than enough to cause problems. Ok, just wanted to make sure we're talking about concrete things. Happy to discuss any other problems too, but for now let's focus on the peek/poke then, and not get sidetracked. > Can you explain the high-level difference from security perspective for > temporarily pinning userptr pages to write them to page tables for GPU to > execute a dma-fence workload with and temporarily pinning pages for > peek/poke? > > > If you want to access userptr imported pages from the GPU going through the > hops of using hhm_range_fault()/get_user_pages() plus an MMU notifier is a must > have. Right, the intent was always to make this as close to EU thread memory access as possible from both locking and Linux core MM memory point of view so if we need to improve on that front, we should look into it. > For a CPU based debugging interface that isn't necessary, you can just look > directly into the application address space with existing interfaces. First, this is only even possible when you have mapped everything the GPUs have access also to the CPU address space, and maintain a load map for each individual <DRM client, GPU VM, GPU VMA> => CPU address. I don't see the need to do that tracking in the userspace just for debugging, because kernel already has to do all the work. Second, mapping every GPU VMA to CPU address space will exhaust the vm.max_map_count [1] quite a bit faster. This problem can already be hit if a naive userspace application tries to create too many aligned_alloc blocks for userptr instead of pooling memory. Third of all when GPU VM size == CPU VM size for modern hardware, you will run out of VA space in the CPU VM. When you consider you have to add VA blocks of (num DRM clients) * (num VM) * (VM size) where (num VM) roughly equals number of engines * 3. And ultimately, I'm pretty sure there are processes like 32-bit games and emulators, and even demanding compute applications actually expect to be able to use most of the CPU address space :) So don't think we should have a design where we expect to be able to consume significant portions of the CPU VA space (especially if it is just for debug time functionality). [1] Documentation/admin-guide/sysctl/vm.rst#max_map_count > And it is exactly the kind of interface that makes sense for debugger as > GPU memory != CPU memory, and they don't need to align at all. > > > And that is what I strongly disagree on. When you debug the GPU it is mandatory > to gain control of the CPU process as well. > > You are free to disagree on that. I simply don't agree and have in this > and previous email presented multiple reasons as to why not. We can > agree to disagree on the topic. > > > Yeah, that's ok. I also think we can agree on that this doesn't matter for the > discussion. > > The question is rather should the userptr functionality be used for debugging > or not. > > > The CPU process is basically the overseer of the GPU activity, so it should > know everything about the GPU operation, for example what a mapping actually > means. > > How does that relate to what is being discussed here? You just seem to > explain how you think userspace driver should work: Maintain a shadow > tree of each ppGTT VM layout? I don't agree on that, but I think it is > slightly irrelevant here. > > > I'm trying to understand why you want to debug only the GPU without also > attaching to the CPU process. Mostly to ensure we're not limited to stop-all mode as described above and to have a clean independent implementation for the thread run-control between the "inferiors" in GDB. Say you have CPU threads and 2 sets of GPU threads (3 inferiors in total). We don't want the CPU inferior to be impacted by the user requesting to control the GPU inferiors. I know the ROCm GDB implementation takes a different approach, and I'm not quite sure how you folks plan on supporting multi-GPU debugging. I would spin the question the opposite direction, if you don't need anything from the CPU process why would you make them dependent and interfering? (Reminder, the peek/poke target page has been made available to the GPU page tables, so we don't want anything from the CPU process per se, we want to know which page the GPU IOMMU unit would get for its access. For all practical matters, the CPU process could have already exited and should not matter if an EU is executing on the GPU still.) > The kernel driver and the hardware only have the information necessary to > execute the work prepared by the CPU process. So the information available is > limited to begin with. > > And the point here is? Are you saying kernel does not know the actual mappings > maintained in the GPU page tables? > > > The kernel knows that, the question is why does userspace don't know that? > > On the other hand I have to agree that this isn't much of a problem. > > If userspace really doesn't know what is mapped where in the GPU's VM address > space then an IOCTL to query that is probably not an issue. > > What the debugger should probably do is to cleanly attach to the > application, get the information which CPU address is mapped to which > GPU address and then use the standard ptrace interfaces. > > I don't quite agree here -- at all. "Which CPU address is mapped to > which GPU address" makes no sense when the GPU address space and CPU > address space is completely controlled by the userspace driver/application. > > > Yeah, that's the reason why you should ask the userspace driver/application for > the necessary information and not go over the kernel to debug things. > > What hypothetical necessary information are you referring to exactly? > > > What you said before: "the GPU address space and CPU address space is > completely controlled by the userspace driver/application". When that's the > case, then why as the kernel for help? The driver/application is in control. I guess the emphasis should have been on the application part. Debugger can agree with userspace driver on conventions to facilitate debugging, but not with the application code. However, agree that query IOCTL could be avoided maintaining a shadow address space tracking in case ptrace() approach to debugging was otherwise favorable. > I already explained there are good reasons not to map all the GPU memory > into the CPU address space. > > > Well I still don't fully agree to that argumentation, but compared to using > userptr the peek/pook on a GEM handle is basically harmless. (Sidenote: We don't expose BO handles at all via debugger interface. The debugger interface fully relies on GPU addresses for everything.) But sounds like we're coming towards a conclusion that the focus of the discussion is only really on the controversy of touching userptr with the debugger peek/poke interface or not. > Please try to consider things outside of the ROCm architecture. > > > Well I consider a good part of the ROCm architecture rather broken exactly > because we haven't pushed back hard enough on bad ideas. > > > Something like a register scratch region or EU instructions should not > even be mapped to CPU address space as CPU has no business accessing it > during normal operation. And backing of such region will vary per > context/LRC on the same virtual address per EU thread. > > You seem to be suggesting to rewrite even our userspace driver to behave > the same way as ROCm driver does just so that we could implement debug memory > accesses via ptrace() to the CPU address space. > > > Oh, well certainly not. That ROCm has an 1 to 1 mapping between CPU and GPU is > one thing I've pushed back massively on and has now proven to be problematic. > > Right, so is your claim then that instead of being 1:1 the CPU address space > should be a superset of all GPU address spaces instead to make sure > ptrace() can modify all memory? > > > Well why not? Mapping a BO and not accessing it has only minimal overhead. > > We already considered to making that mandatory for TTM drivers for better OOM > killer handling. That approach was discontinued, but certainly not for the > overhead. I listed the reasons earlier in this message. > Cause I'm slightly lost here as you don't give much reasoning, just > claim things to be certain way. > > > Ok, that's certainly not what I'm trying to express. > > Things don't need to be in a certain way, especially not in the way ROCm does > things. > > But you should not try to re-create GPU accesses with the CPU, especially when > that isn't memory you have control over in the sense that it was allocated > through your driver stack. I guess thats what I don't quite follow. It's memory pages that are temporarily pinned and made available via GPU PTE to the GPU IOMMU and it will inherently be able to read/write them outside of the CPU caching domain. Not sure why replacing "Submit GPU workload to peek/poke such page pinned behind PTE" with "Use CPU to peek/poke because userptr is system memory anyway" seems such controversial and could cause much more complexity than userptr in general? > That seems bit of a radical suggestion, especially given the drawbacks > pointed out in your suggested design. > > > The whole interface re-invents a lot of functionality which is already > there > > I'm not really sure I would call adding a single interface for memory > reading and writing to be "re-inventing a lot of functionality". > > All the functionality behind this interface will be needed by GPU core > dumping, anyway. Just like for the other patch series. > > > As far as I can see exactly that's an absolutely no-go. Device core dumping > should *never ever* touch memory imported by userptrs. > > Could you again elaborate on what the great difference is to short term > pinning to use in dma-fence workloads? Just the kmap? > > > The big difference is that the memory doesn't belong to the driver who is core > dumping. But the driver who is core dumping is holding a temporary pin on that memory anyway, and has it in the GPU page tables. The CPU side of the memory dump would only reflect what was the CPU side memory contents at a dump time. It may have different contents of the GPU side depending on cache flush timing. Maybe this will not be true when CXL or some other coherency protocl is everywhere, but for now it is. So those two memory dumps may actually have different contents, and that might actually be the bug we're trying to debug. For GPU debugging, we're specifically interested on what was the GPU threads view of the memory. So I think it is more complex than that. > That is just something you have imported from the MM subsystem, e.g. anonymous > memory and file backed mappings. > > We also don't allow to mmap() dma-bufs on importing devices for similar > reasons. That is a reasonable limitation for userspace applications. And at no point has there been suggestions to expose such API for normal userspace to shoot itself in the foot. However debugger is not an average userspace consumer. For an example, it needs to be able modify read-only memory (like the EU instructions) and then do special cache flushes to magically change those instructions. I wouldn't want to expose such a functionality as regular IOCTL for an application. > That's what process core dumping is good for. > > Not really sure I agree. If you do not dump the memory as seen by the > GPU, then you need to go parsing the CPU address space in order to make > sense which buffers were mapped where and that CPU memory contents containing > metadata could be corrupt as we're dealing with a crashing app to begin with. > > Big point of relying to the information from GPU VM for the GPU memory layout > is that it won't be corrupted by rogue memory accesses in CPU process. > > > Well that you don't want to use potentially corrupted information is a good > argument, but why just not dump an information like "range 0xabcd-0xbcde came > as userptr from process 1 VMA 0x1234-0x5678" ? I guess that could be done for interactive debugging (but would again add the ptrace() dependency). In theory you could probably also come up with such a convention for ELF to support core dumps I guess, but I would have to refer to some folks more knowledgeable on the topic. Feels like that would make things more complex via indirection compared to existing memory maps. > A process address space is not really something a device driver should be > messing with. > > > > > just because you don't like the idea to attach to the debugged > application in userspace. > > A few points that have been brought up as drawback to the > GPU debug through ptrace(), but to recap a few relevant ones for this > discussion: > > - You can only really support GDB stop-all mode or at least have to > stop all the CPU threads while you control the GPU threads to > avoid interference. Elaborated on this on the other threads more. > - Controlling the GPU threads will always interfere with CPU threads. > Doesn't seem feasible to single-step an EU thread while CPU threads > continue to run freely? > > > I would say no. > > Should this be understood that you agree these are limitations of the ROCm > debug architecture? > > > ROCm has a bunch of design decisions I would say we should never ever repeat: > > 1. Forcing a 1 to 1 model between GPU address space and CPU address space. > > 2. Using a separate file descriptor additional to the DRM render node. > > 3. Attaching information and context to the CPU process instead of the DRM > render node. > .... > > But stopping the world, e.g. both CPU and GPU threads if you want to debug > something is not one of the problematic decisions. > > That's why I'm really surprised that you insist so much on that. I'm hoping the above explanations clarify my position further. Again, I would ask myself: "Why add a dependency that is not needed?" > - You are very much restricted by the CPU VA ~ GPU VA alignment > requirement, which is not true for OpenGL or Vulkan etc. Seems > like one of the reasons why ROCm debugging is not easily extendable > outside compute? > > > Well as long as you can't take debugged threads from the hardware you can > pretty much forget any OpenGL or Vulkan debugging with this interface since it > violates the dma_fence restrictions in the kernel. > > Agreed. However doesn't mean because you can't do it right now, you you should > design an architecture that actively prevents you from doing that in the future. > > > Good point. That's what I can totally agree on as well. > > > - You have to expose extra memory to CPU process just for GPU > debugger access and keep track of GPU VA for each. Makes the GPU more > prone to OOB writes from CPU. Exactly what not mapping the memory > to CPU tried to protect the GPU from to begin with. > > > As far as I can see this whole idea is extremely questionable. This > looks like re-inventing the wheel in a different color. > > I see it like reinventing a round wheel compared to octagonal wheel. > > Could you elaborate with facts much more on your position why the ROCm > debugger design is an absolute must for others to adopt? > > > Well I'm trying to prevent some of the mistakes we did with the ROCm design. > > Well, I would say that the above limitations are direct results of the ROCm > debugging design. So while we're eager to learn about how you perceive > GPU debugging should work, would you mind addressing the above > shortcomings? > > > Yeah, absolutely. That you don't have a 1 to 1 mapping on the GPU is a step in > the right direction if you ask me. Right, that is to have a possibility of extending to OpenGL/Vulkan etc. > And trying to re-invent well proven kernel interfaces is one of the big > mistakes made in the ROCm design. > > Appreciate the feedback. Please work on the representation a bit as it currently > doesn't seem very helpful but appears just as an attempt to try to throw a spanner > in the works. > > > If you really want to expose an interface to userspace > > To a debugger process, enabled only behind a flag. > > > which walks the process > page table, installs an MMU notifier > > This part is already done to put an userptr to the GPU page tables to > begin with. So hopefully not too controversial. > > > kmaps the resulting page > > In addition to having it in the page tables where GPU can access it. > > > and then memcpy > to/from it then you absolutely *must* run that by guys like Christoph Hellwig, > Andrew and even Linus. > > Surely, that is why we're seeking out for review. > > We could also in theory use an in-kernel GPU context on the GPU hardware for > doing the peek/poke operations on userptr. > > > Yeah, I mean that should clearly work out. We have something similar. Right, and that might actually be desireable for the more special GPU VMA like interconnect addresses. However userptr is one of the items where it makes least sense, given we'd have to set up the transfer over bus, the transfer would read system memory over bus and write the result back to system memory over bus. And this is just to avoid kmap'ing a page that would otherwise be already temporarily pinned for being in the PTEs. I'm not saying it can't be done, but I just don't feel like it's a technically sound solution. > But that seems like a high-overhead thing to do due to the overhead of > setting up a transfer per data word and going over the PCI bus twice > compared to accessing the memory directly by CPU when it trivially can. > > > Understandable, but that will create another way of accessing process memory. Well, we hopefully should be able to align with the regular temporary pinning and making page available to PTEs, but instead of making available to PTEs, do a peek/poke and then release the page already. I'm kind of hoping to build the case for it making a lot of sense for peek/poke performance (which is important for single-stepping), and should not be a burden due to new locking chains. And thanks once again for taking the time to share the details behind the thinking and bearing with all my questions. It seems like the peek/poke access to userptr is the big remaining open where opinions differ, so maybe we should first focus on aligning on it. It impacts both core dumping and interactive debugging. Regards, Joonas > > Regards, > Christian. > > > > So this is the current proposal. > > Regards, Joonas > > > I'm pretty sure that those guys will note that a device driver should > absolutely not mess with such stuff. > > Regards, > Christian. > > > Otherwise it just looks like you are trying to prevent others from > implementing a more flexible debugging interface through vague comments about > "questionable design" without going into details. Not listing much concrete > benefits nor addressing the very concretely expressed drawbacks of your > suggested design, makes it seem like a very biased non-technical discussion. > > So while review interest and any comments are very much appreciated, please > also work on providing bit more reasoning and facts instead of just claiming > things. That'll help make the discussion much more fruitful. > > Regards, Joonas > > > >
On Mon, 2024-12-09 at 16:31 +0100, Simona Vetter wrote: > On Mon, Dec 09, 2024 at 03:03:04PM +0100, Christian König wrote: > > Am 09.12.24 um 14:33 schrieb Mika Kuoppala: > > > From: Andrzej Hajda <andrzej.hajda@intel.com> > > > > > > Debugger needs to read/write program's vmas including > > > userptr_vma. > > > Since hmm_range_fault is used to pin userptr vmas, it is possible > > > to map those vmas from debugger context. > > > > Oh, this implementation is extremely questionable as well. Adding > > the LKML > > and the MM list as well. > > > > First of all hmm_range_fault() does *not* pin anything! > > > > In other words you don't have a page reference when the function > > returns, > > but rather just a sequence number you can check for modifications. > > I think it's all there, holds the invalidation lock during the > critical > access/section, drops it when reacquiring pages, retries until it > works. > > I think the issue is more that everyone hand-rolls userptr. Probably > time > we standardize that and put it into gpuvm as an optional part, with > consistent locking, naming (like not calling it _pin_pages when it's > unpinnged userptr), kerneldoc and all the nice things so that we > stop consistently getting confused by other driver's userptr code. > > I think that was on the plan originally as an eventual step, I guess > time > to pump that up. Matt/Thomas, thoughts? It looks like we have this planned and ongoing but there are some complications and thoughts. 1) A drm_gpuvm implementation would be based on vma userptrs, and would be pretty straightforward based on xe's current implementation and, as you say, renaming. 2) Current Intel work to land this on the drm level is based on drm_gpusvm (minus migration to VRAM). I'm not fully sure yet how this will integrate with drm_gpuvm. 3) Christian mentioned a plan to have a common userptr implementation based off drm_exec. I figure that would be bo-based like the amdgpu implemeentation still is. Possibly i915 would be interested in this but I think any VM_BIND based driver would want to use drm_gpuvm / drm_gpusvm implementation, which is also typically O(1), since userptrs are considered vm-local. Ideas / suggestions welcome > -Sima > > > > > > v2: pin pages vs notifier, move to vm.c (Matthew) > > > v3: - iterate over system pages instead of DMA, fixes iommu > > > enabled > > > - s/xe_uvma_access/xe_vm_uvma_access/ (Matt) > > > > > > Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com> > > > Signed-off-by: Maciej Patelczyk <maciej.patelczyk@intel.com> > > > Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> > > > Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> #v1 > > > --- > > > drivers/gpu/drm/xe/xe_eudebug.c | 3 ++- > > > drivers/gpu/drm/xe/xe_vm.c | 47 > > > +++++++++++++++++++++++++++++++++ > > > drivers/gpu/drm/xe/xe_vm.h | 3 +++ > > > 3 files changed, 52 insertions(+), 1 deletion(-) > > > > > > diff --git a/drivers/gpu/drm/xe/xe_eudebug.c > > > b/drivers/gpu/drm/xe/xe_eudebug.c > > > index 9d87df75348b..e5949e4dcad8 100644 > > > --- a/drivers/gpu/drm/xe/xe_eudebug.c > > > +++ b/drivers/gpu/drm/xe/xe_eudebug.c > > > @@ -3076,7 +3076,8 @@ static int xe_eudebug_vma_access(struct > > > xe_vma *vma, u64 offset_in_vma, > > > return ret; > > > } > > > - return -EINVAL; > > > + return xe_vm_userptr_access(to_userptr_vma(vma), > > > offset_in_vma, > > > + buf, bytes, write); > > > } > > > static int xe_eudebug_vm_access(struct xe_vm *vm, u64 offset, > > > diff --git a/drivers/gpu/drm/xe/xe_vm.c > > > b/drivers/gpu/drm/xe/xe_vm.c > > > index 0f17bc8b627b..224ff9e16941 100644 > > > --- a/drivers/gpu/drm/xe/xe_vm.c > > > +++ b/drivers/gpu/drm/xe/xe_vm.c > > > @@ -3414,3 +3414,50 @@ void xe_vm_snapshot_free(struct > > > xe_vm_snapshot *snap) > > > } > > > kvfree(snap); > > > } > > > + > > > +int xe_vm_userptr_access(struct xe_userptr_vma *uvma, u64 > > > offset, > > > + void *buf, u64 len, bool write) > > > +{ > > > + struct xe_vm *vm = xe_vma_vm(&uvma->vma); > > > + struct xe_userptr *up = &uvma->userptr; > > > + struct xe_res_cursor cur = {}; > > > + int cur_len, ret = 0; > > > + > > > + while (true) { > > > + down_read(&vm->userptr.notifier_lock); > > > + if (!xe_vma_userptr_check_repin(uvma)) > > > + break; > > > + > > > + spin_lock(&vm->userptr.invalidated_lock); > > > + list_del_init(&uvma->userptr.invalidate_link); > > > + spin_unlock(&vm->userptr.invalidated_lock); > > > + > > > + up_read(&vm->userptr.notifier_lock); > > > + ret = xe_vma_userptr_pin_pages(uvma); > > > + if (ret) > > > + return ret; > > > + } > > > + > > > + if (!up->sg) { > > > + ret = -EINVAL; > > > + goto out_unlock_notifier; > > > + } > > > + > > > + for (xe_res_first_sg_system(up->sg, offset, len, &cur); > > > cur.remaining; > > > + xe_res_next(&cur, cur_len)) { > > > + void *ptr = kmap_local_page(sg_page(cur.sgl)) + > > > cur.start; > > > > The interface basically creates a side channel to access userptrs > > in the way > > an userspace application would do without actually going through > > userspace. > > > > That is generally not something a device driver should ever do as > > far as I > > can see. > > > > > + > > > + cur_len = min(cur.size, cur.remaining); > > > + if (write) > > > + memcpy(ptr, buf, cur_len); > > > + else > > > + memcpy(buf, ptr, cur_len); > > > + kunmap_local(ptr); > > > + buf += cur_len; > > > + } > > > + ret = len; > > > + > > > +out_unlock_notifier: > > > + up_read(&vm->userptr.notifier_lock); > > > > I just strongly hope that this will prevent the mapping from > > changing. > > > > Regards, > > Christian. > > > > > + return ret; > > > +} > > > diff --git a/drivers/gpu/drm/xe/xe_vm.h > > > b/drivers/gpu/drm/xe/xe_vm.h > > > index 23adb7442881..372ad40ad67f 100644 > > > --- a/drivers/gpu/drm/xe/xe_vm.h > > > +++ b/drivers/gpu/drm/xe/xe_vm.h > > > @@ -280,3 +280,6 @@ struct xe_vm_snapshot > > > *xe_vm_snapshot_capture(struct xe_vm *vm); > > > void xe_vm_snapshot_capture_delayed(struct xe_vm_snapshot > > > *snap); > > > void xe_vm_snapshot_print(struct xe_vm_snapshot *snap, struct > > > drm_printer *p); > > > void xe_vm_snapshot_free(struct xe_vm_snapshot *snap); > > > + > > > +int xe_vm_userptr_access(struct xe_userptr_vma *uvma, u64 > > > offset, > > > + void *buf, u64 len, bool write); > > >
On Thu, Dec 12, 2024 at 09:49:24AM +0100, Thomas Hellström wrote: > On Mon, 2024-12-09 at 16:31 +0100, Simona Vetter wrote: > > On Mon, Dec 09, 2024 at 03:03:04PM +0100, Christian König wrote: > > > Am 09.12.24 um 14:33 schrieb Mika Kuoppala: > > > > From: Andrzej Hajda <andrzej.hajda@intel.com> > > > > > > > > Debugger needs to read/write program's vmas including > > > > userptr_vma. > > > > Since hmm_range_fault is used to pin userptr vmas, it is possible > > > > to map those vmas from debugger context. > > > > > > Oh, this implementation is extremely questionable as well. Adding > > > the LKML > > > and the MM list as well. > > > > > > First of all hmm_range_fault() does *not* pin anything! > > > > > > In other words you don't have a page reference when the function > > > returns, > > > but rather just a sequence number you can check for modifications. > > > > I think it's all there, holds the invalidation lock during the > > critical > > access/section, drops it when reacquiring pages, retries until it > > works. > > > > I think the issue is more that everyone hand-rolls userptr. Probably > > time > > we standardize that and put it into gpuvm as an optional part, with > > consistent locking, naming (like not calling it _pin_pages when it's > > unpinnged userptr), kerneldoc and all the nice things so that we > > stop consistently getting confused by other driver's userptr code. > > > > I think that was on the plan originally as an eventual step, I guess > > time > > to pump that up. Matt/Thomas, thoughts? > > It looks like we have this planned and ongoing but there are some > complications and thoughts. > > 1) A drm_gpuvm implementation would be based on vma userptrs, and would > be pretty straightforward based on xe's current implementation and, as > you say, renaming. > > 2) Current Intel work to land this on the drm level is based on > drm_gpusvm (minus migration to VRAM). I'm not fully sure yet how this > will integrate with drm_gpuvm. > > 3) Christian mentioned a plan to have a common userptr implementation > based off drm_exec. I figure that would be bo-based like the amdgpu > implemeentation still is. Possibly i915 would be interested in this but > I think any VM_BIND based driver would want to use drm_gpuvm / > drm_gpusvm implementation, which is also typically O(1), since userptrs > are considered vm-local. > > Ideas / suggestions welcome So just discussed this a bit with Joonas, and if we use access_remote_vm for the userptr access instead of hand-rolling then we really only need bare-bones data structure changes in gpuvm, and nothing more. So - add the mm pointer to struct drm_gpuvm - add a flag indicating that it's a userptr + userspace address to struct drm_gpuva - since we already have userptr in drivers I guess there should be any need to adjust the actual drm_gpuvm code to cope with these Then with this you can write the access helper using access_remote_vm since that does the entire remote mm walking internally, and so there's no need to also have all the mmu notifier and locking lifted to gpuvm. But it does already give us some great places to put relevant kerneldocs (not just for debugging architecture, but userptr stuff in general), which is already a solid step forward. Plus I think it'd would also be a solid first step that we need no matter what for figuring out the questions/options you have above. Thoughts? -Sima > > > -Sima > > > > > > > > > v2: pin pages vs notifier, move to vm.c (Matthew) > > > > v3: - iterate over system pages instead of DMA, fixes iommu > > > > enabled > > > > - s/xe_uvma_access/xe_vm_uvma_access/ (Matt) > > > > > > > > Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com> > > > > Signed-off-by: Maciej Patelczyk <maciej.patelczyk@intel.com> > > > > Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> > > > > Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> #v1 > > > > --- > > > > drivers/gpu/drm/xe/xe_eudebug.c | 3 ++- > > > > drivers/gpu/drm/xe/xe_vm.c | 47 > > > > +++++++++++++++++++++++++++++++++ > > > > drivers/gpu/drm/xe/xe_vm.h | 3 +++ > > > > 3 files changed, 52 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_eudebug.c > > > > b/drivers/gpu/drm/xe/xe_eudebug.c > > > > index 9d87df75348b..e5949e4dcad8 100644 > > > > --- a/drivers/gpu/drm/xe/xe_eudebug.c > > > > +++ b/drivers/gpu/drm/xe/xe_eudebug.c > > > > @@ -3076,7 +3076,8 @@ static int xe_eudebug_vma_access(struct > > > > xe_vma *vma, u64 offset_in_vma, > > > > return ret; > > > > } > > > > - return -EINVAL; > > > > + return xe_vm_userptr_access(to_userptr_vma(vma), > > > > offset_in_vma, > > > > + buf, bytes, write); > > > > } > > > > static int xe_eudebug_vm_access(struct xe_vm *vm, u64 offset, > > > > diff --git a/drivers/gpu/drm/xe/xe_vm.c > > > > b/drivers/gpu/drm/xe/xe_vm.c > > > > index 0f17bc8b627b..224ff9e16941 100644 > > > > --- a/drivers/gpu/drm/xe/xe_vm.c > > > > +++ b/drivers/gpu/drm/xe/xe_vm.c > > > > @@ -3414,3 +3414,50 @@ void xe_vm_snapshot_free(struct > > > > xe_vm_snapshot *snap) > > > > } > > > > kvfree(snap); > > > > } > > > > + > > > > +int xe_vm_userptr_access(struct xe_userptr_vma *uvma, u64 > > > > offset, > > > > + void *buf, u64 len, bool write) > > > > +{ > > > > + struct xe_vm *vm = xe_vma_vm(&uvma->vma); > > > > + struct xe_userptr *up = &uvma->userptr; > > > > + struct xe_res_cursor cur = {}; > > > > + int cur_len, ret = 0; > > > > + > > > > + while (true) { > > > > + down_read(&vm->userptr.notifier_lock); > > > > + if (!xe_vma_userptr_check_repin(uvma)) > > > > + break; > > > > + > > > > + spin_lock(&vm->userptr.invalidated_lock); > > > > + list_del_init(&uvma->userptr.invalidate_link); > > > > + spin_unlock(&vm->userptr.invalidated_lock); > > > > + > > > > + up_read(&vm->userptr.notifier_lock); > > > > + ret = xe_vma_userptr_pin_pages(uvma); > > > > + if (ret) > > > > + return ret; > > > > + } > > > > + > > > > + if (!up->sg) { > > > > + ret = -EINVAL; > > > > + goto out_unlock_notifier; > > > > + } > > > > + > > > > + for (xe_res_first_sg_system(up->sg, offset, len, &cur); > > > > cur.remaining; > > > > + xe_res_next(&cur, cur_len)) { > > > > + void *ptr = kmap_local_page(sg_page(cur.sgl)) + > > > > cur.start; > > > > > > The interface basically creates a side channel to access userptrs > > > in the way > > > an userspace application would do without actually going through > > > userspace. > > > > > > That is generally not something a device driver should ever do as > > > far as I > > > can see. > > > > > > > + > > > > + cur_len = min(cur.size, cur.remaining); > > > > + if (write) > > > > + memcpy(ptr, buf, cur_len); > > > > + else > > > > + memcpy(buf, ptr, cur_len); > > > > + kunmap_local(ptr); > > > > + buf += cur_len; > > > > + } > > > > + ret = len; > > > > + > > > > +out_unlock_notifier: > > > > + up_read(&vm->userptr.notifier_lock); > > > > > > I just strongly hope that this will prevent the mapping from > > > changing. > > > > > > Regards, > > > Christian. > > > > > > > + return ret; > > > > +} > > > > diff --git a/drivers/gpu/drm/xe/xe_vm.h > > > > b/drivers/gpu/drm/xe/xe_vm.h > > > > index 23adb7442881..372ad40ad67f 100644 > > > > --- a/drivers/gpu/drm/xe/xe_vm.h > > > > +++ b/drivers/gpu/drm/xe/xe_vm.h > > > > @@ -280,3 +280,6 @@ struct xe_vm_snapshot > > > > *xe_vm_snapshot_capture(struct xe_vm *vm); > > > > void xe_vm_snapshot_capture_delayed(struct xe_vm_snapshot > > > > *snap); > > > > void xe_vm_snapshot_print(struct xe_vm_snapshot *snap, struct > > > > drm_printer *p); > > > > void xe_vm_snapshot_free(struct xe_vm_snapshot *snap); > > > > + > > > > +int xe_vm_userptr_access(struct xe_userptr_vma *uvma, u64 > > > > offset, > > > > + void *buf, u64 len, bool write); > > > > > >
On Thu, Dec 12, 2024 at 11:12:39AM +0100, Simona Vetter wrote: > On Thu, Dec 12, 2024 at 09:49:24AM +0100, Thomas Hellström wrote: > > On Mon, 2024-12-09 at 16:31 +0100, Simona Vetter wrote: > > > On Mon, Dec 09, 2024 at 03:03:04PM +0100, Christian König wrote: > > > > Am 09.12.24 um 14:33 schrieb Mika Kuoppala: > > > > > From: Andrzej Hajda <andrzej.hajda@intel.com> > > > > > > > > > > Debugger needs to read/write program's vmas including > > > > > userptr_vma. > > > > > Since hmm_range_fault is used to pin userptr vmas, it is possible > > > > > to map those vmas from debugger context. > > > > > > > > Oh, this implementation is extremely questionable as well. Adding > > > > the LKML > > > > and the MM list as well. > > > > > > > > First of all hmm_range_fault() does *not* pin anything! > > > > > > > > In other words you don't have a page reference when the function > > > > returns, > > > > but rather just a sequence number you can check for modifications. > > > > > > I think it's all there, holds the invalidation lock during the > > > critical > > > access/section, drops it when reacquiring pages, retries until it > > > works. > > > > > > I think the issue is more that everyone hand-rolls userptr. Probably > > > time > > > we standardize that and put it into gpuvm as an optional part, with > > > consistent locking, naming (like not calling it _pin_pages when it's > > > unpinnged userptr), kerneldoc and all the nice things so that we > > > stop consistently getting confused by other driver's userptr code. > > > > > > I think that was on the plan originally as an eventual step, I guess > > > time > > > to pump that up. Matt/Thomas, thoughts? > > > > It looks like we have this planned and ongoing but there are some > > complications and thoughts. > > > > 1) A drm_gpuvm implementation would be based on vma userptrs, and would > > be pretty straightforward based on xe's current implementation and, as > > you say, renaming. > > My thoughts... Standardize gpuvm userptrs gpuvmas a bit. In Xe I think we basically set the BO to NULL in the gpuvmas then have some helpers in Xe to determine if gpuvma is a userptr. I think some this code could be moved into gpuvm so drivers are doing this in a standard way. I think NULL bindings also set te BO to NULL too, perhaps we standardize that too in gpuvm. > > 2) Current Intel work to land this on the drm level is based on > > drm_gpusvm (minus migration to VRAM). I'm not fully sure yet how this > > will integrate with drm_gpuvm. > > Implement the userptr locking / page collection (i.e. hmm_range_fault call) on top of gpusvm. Perhaps decouple the current page collection from drm_gpusvm_range into an embedded struct like drm_gpusvm_devmem. The plan was to more or less land gpusvm which in on the list addressing Thomas's feedback before doing the userptr rework on top. As of now, different engineer will own this rework. Ofc, with Thomas and myself providing guidance and welcoming community input. Xe will likely be the first user of this so if we have to tweak this as more drivers start to use this, ofc that is fine and will be open to any changes. > > 3) Christian mentioned a plan to have a common userptr implementation > > based off drm_exec. I figure that would be bo-based like the amdgpu > > implemeentation still is. Possibly i915 would be interested in this but > > I think any VM_BIND based driver would want to use drm_gpuvm / > > drm_gpusvm implementation, which is also typically O(1), since userptrs > > are considered vm-local. I don't think any new driver would want a userptr implementation based on drm_exec because of having to use BO's and this isn't required if drm_gpuvm / drm_gpusvm is used which I suspect all new drivers will use. Sure could be useful for amdgpu / i915 but for Xe we certainly wouldn't want this nor would a VM bind only driver. > > > > Ideas / suggestions welcome > > So just discussed this a bit with Joonas, and if we use access_remote_vm > for the userptr access instead of hand-rolling then we really only need > bare-bones data structure changes in gpuvm, and nothing more. So > > - add the mm pointer to struct drm_gpuvm > - add a flag indicating that it's a userptr + userspace address to struct > drm_gpuva > - since we already have userptr in drivers I guess there should be any > need to adjust the actual drm_gpuvm code to cope with these > > Then with this you can write the access helper using access_remote_vm > since that does the entire remote mm walking internally, and so there's > no need to also have all the mmu notifier and locking lifted to gpuvm. But > it does already give us some great places to put relevant kerneldocs (not > just for debugging architecture, but userptr stuff in general), which is > already a solid step forward. > > Plus I think it'd would also be a solid first step that we need no matter > what for figuring out the questions/options you have above. > > Thoughts? This seems like it could work with everything I've written above. Maybe this lives in gpusvm though so we have clear divide where gpuvm is GPU address space, and gpusvm is CPU address space. Kinda a bikeshed, but agree in general if we need to access / modify userptrs this lives in common code. Do we view this userptr rework as a blocker for EuDebug? My thinking was we don't as we (Intel) have fully committed to a common userptr implementation. FWIW, I really don't think the implementation in this patch and I stated this may times but that feedback seems to have been ignored yet again. I'd prefer an open code hmm_range_fault loop for now rather than a new xe_res_cursor concept that will get thrown away. Matt > -Sima > > > > > > -Sima > > > > > > > > > > > > v2: pin pages vs notifier, move to vm.c (Matthew) > > > > > v3: - iterate over system pages instead of DMA, fixes iommu > > > > > enabled > > > > > - s/xe_uvma_access/xe_vm_uvma_access/ (Matt) > > > > > > > > > > Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com> > > > > > Signed-off-by: Maciej Patelczyk <maciej.patelczyk@intel.com> > > > > > Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> > > > > > Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> #v1 > > > > > --- > > > > > drivers/gpu/drm/xe/xe_eudebug.c | 3 ++- > > > > > drivers/gpu/drm/xe/xe_vm.c | 47 > > > > > +++++++++++++++++++++++++++++++++ > > > > > drivers/gpu/drm/xe/xe_vm.h | 3 +++ > > > > > 3 files changed, 52 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/drivers/gpu/drm/xe/xe_eudebug.c > > > > > b/drivers/gpu/drm/xe/xe_eudebug.c > > > > > index 9d87df75348b..e5949e4dcad8 100644 > > > > > --- a/drivers/gpu/drm/xe/xe_eudebug.c > > > > > +++ b/drivers/gpu/drm/xe/xe_eudebug.c > > > > > @@ -3076,7 +3076,8 @@ static int xe_eudebug_vma_access(struct > > > > > xe_vma *vma, u64 offset_in_vma, > > > > > return ret; > > > > > } > > > > > - return -EINVAL; > > > > > + return xe_vm_userptr_access(to_userptr_vma(vma), > > > > > offset_in_vma, > > > > > + buf, bytes, write); > > > > > } > > > > > static int xe_eudebug_vm_access(struct xe_vm *vm, u64 offset, > > > > > diff --git a/drivers/gpu/drm/xe/xe_vm.c > > > > > b/drivers/gpu/drm/xe/xe_vm.c > > > > > index 0f17bc8b627b..224ff9e16941 100644 > > > > > --- a/drivers/gpu/drm/xe/xe_vm.c > > > > > +++ b/drivers/gpu/drm/xe/xe_vm.c > > > > > @@ -3414,3 +3414,50 @@ void xe_vm_snapshot_free(struct > > > > > xe_vm_snapshot *snap) > > > > > } > > > > > kvfree(snap); > > > > > } > > > > > + > > > > > +int xe_vm_userptr_access(struct xe_userptr_vma *uvma, u64 > > > > > offset, > > > > > + void *buf, u64 len, bool write) > > > > > +{ > > > > > + struct xe_vm *vm = xe_vma_vm(&uvma->vma); > > > > > + struct xe_userptr *up = &uvma->userptr; > > > > > + struct xe_res_cursor cur = {}; > > > > > + int cur_len, ret = 0; > > > > > + > > > > > + while (true) { > > > > > + down_read(&vm->userptr.notifier_lock); > > > > > + if (!xe_vma_userptr_check_repin(uvma)) > > > > > + break; > > > > > + > > > > > + spin_lock(&vm->userptr.invalidated_lock); > > > > > + list_del_init(&uvma->userptr.invalidate_link); > > > > > + spin_unlock(&vm->userptr.invalidated_lock); > > > > > + > > > > > + up_read(&vm->userptr.notifier_lock); > > > > > + ret = xe_vma_userptr_pin_pages(uvma); > > > > > + if (ret) > > > > > + return ret; > > > > > + } > > > > > + > > > > > + if (!up->sg) { > > > > > + ret = -EINVAL; > > > > > + goto out_unlock_notifier; > > > > > + } > > > > > + > > > > > + for (xe_res_first_sg_system(up->sg, offset, len, &cur); > > > > > cur.remaining; > > > > > + xe_res_next(&cur, cur_len)) { > > > > > + void *ptr = kmap_local_page(sg_page(cur.sgl)) + > > > > > cur.start; > > > > > > > > The interface basically creates a side channel to access userptrs > > > > in the way > > > > an userspace application would do without actually going through > > > > userspace. > > > > > > > > That is generally not something a device driver should ever do as > > > > far as I > > > > can see. > > > > > > > > > + > > > > > + cur_len = min(cur.size, cur.remaining); > > > > > + if (write) > > > > > + memcpy(ptr, buf, cur_len); > > > > > + else > > > > > + memcpy(buf, ptr, cur_len); > > > > > + kunmap_local(ptr); > > > > > + buf += cur_len; > > > > > + } > > > > > + ret = len; > > > > > + > > > > > +out_unlock_notifier: > > > > > + up_read(&vm->userptr.notifier_lock); > > > > > > > > I just strongly hope that this will prevent the mapping from > > > > changing. > > > > > > > > Regards, > > > > Christian. > > > > > > > > > + return ret; > > > > > +} > > > > > diff --git a/drivers/gpu/drm/xe/xe_vm.h > > > > > b/drivers/gpu/drm/xe/xe_vm.h > > > > > index 23adb7442881..372ad40ad67f 100644 > > > > > --- a/drivers/gpu/drm/xe/xe_vm.h > > > > > +++ b/drivers/gpu/drm/xe/xe_vm.h > > > > > @@ -280,3 +280,6 @@ struct xe_vm_snapshot > > > > > *xe_vm_snapshot_capture(struct xe_vm *vm); > > > > > void xe_vm_snapshot_capture_delayed(struct xe_vm_snapshot > > > > > *snap); > > > > > void xe_vm_snapshot_print(struct xe_vm_snapshot *snap, struct > > > > > drm_printer *p); > > > > > void xe_vm_snapshot_free(struct xe_vm_snapshot *snap); > > > > > + > > > > > +int xe_vm_userptr_access(struct xe_userptr_vma *uvma, u64 > > > > > offset, > > > > > + void *buf, u64 len, bool write); > > > > > > > > > > > -- > Simona Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch
Quoting Joonas Lahtinen (2024-12-11 14:59:33) > Quoting Christian König (2024-12-10 16:03:14) <SNIP> > > If you really want to expose an interface to userspace which walks the process > > page table, installs an MMU notifier, kmaps the resulting page and then memcpy > > to/from it then you absolutely *must* run that by guys like Christoph Hellwig, > > Andrew and even Linus. > > I'm pretty sure that those guys will note that a device driver should > > absolutely not mess with such stuff. <SNIP> > > But that seems like a high-overhead thing to do due to the overhead of > > setting up a transfer per data word and going over the PCI bus twice > > compared to accessing the memory directly by CPU when it trivially can. > > > > > > Understandable, but that will create another way of accessing process memory. Based on this feedback and some further discussion, we now have an alternative implementation for this interface via access_process_vm function posted by Mika: https://lore.kernel.org/dri-devel/20241216141721.2051279-1-mika.kuoppala@linux.intel.com/ It's a couple of dozen lines don't need to do any open-coded kmapping, only utilizing the pre-existing memory access functions. Hopefully that would address the above concerns? Regards, Joonas PS. It could still be optimized further to directly use the struct mm from within the mm notifier, and go with access_remote_vm using that, but would require new symbol export. For demonstration it is implemented by grabbing the task_struct and using the already exported access_process_vm function.
Joonas Lahtinen <joonas.lahtinen@linux.intel.com> writes: > Quoting Joonas Lahtinen (2024-12-11 14:59:33) >> Quoting Christian König (2024-12-10 16:03:14) > > <SNIP> > >> > If you really want to expose an interface to userspace which walks the process >> > page table, installs an MMU notifier, kmaps the resulting page and then memcpy >> > to/from it then you absolutely *must* run that by guys like Christoph Hellwig, >> > Andrew and even Linus. > >> > I'm pretty sure that those guys will note that a device driver should >> > absolutely not mess with such stuff. > > <SNIP> > >> > But that seems like a high-overhead thing to do due to the overhead of >> > setting up a transfer per data word and going over the PCI bus twice >> > compared to accessing the memory directly by CPU when it trivially can. >> > >> > >> > Understandable, but that will create another way of accessing process memory. > > Based on this feedback and some further discussion, we now have an alternative > implementation for this interface via access_process_vm function posted by Mika: > > https://lore.kernel.org/dri-devel/20241216141721.2051279-1-mika.kuoppala@linux.intel.com/ v2: https://lore.kernel.org/dri-devel/20241220113108.2386842-1-mika.kuoppala@linux.intel.com/ -Mika > > It's a couple of dozen lines don't need to do any open-coded kmapping, only utilizing > the pre-existing memory access functions. > > Hopefully that would address the above concerns? > > Regards, Joonas > > PS. It could still be optimized further to directly use the struct mm > from within the mm notifier, and go with access_remote_vm using that, > but would require new symbol export. > > For demonstration it is implemented by grabbing the task_struct and using > the already exported access_process_vm function.
diff --git a/drivers/gpu/drm/xe/xe_eudebug.c b/drivers/gpu/drm/xe/xe_eudebug.c index 9d87df75348b..e5949e4dcad8 100644 --- a/drivers/gpu/drm/xe/xe_eudebug.c +++ b/drivers/gpu/drm/xe/xe_eudebug.c @@ -3076,7 +3076,8 @@ static int xe_eudebug_vma_access(struct xe_vma *vma, u64 offset_in_vma, return ret; } - return -EINVAL; + return xe_vm_userptr_access(to_userptr_vma(vma), offset_in_vma, + buf, bytes, write); } static int xe_eudebug_vm_access(struct xe_vm *vm, u64 offset, diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c index 0f17bc8b627b..224ff9e16941 100644 --- a/drivers/gpu/drm/xe/xe_vm.c +++ b/drivers/gpu/drm/xe/xe_vm.c @@ -3414,3 +3414,50 @@ void xe_vm_snapshot_free(struct xe_vm_snapshot *snap) } kvfree(snap); } + +int xe_vm_userptr_access(struct xe_userptr_vma *uvma, u64 offset, + void *buf, u64 len, bool write) +{ + struct xe_vm *vm = xe_vma_vm(&uvma->vma); + struct xe_userptr *up = &uvma->userptr; + struct xe_res_cursor cur = {}; + int cur_len, ret = 0; + + while (true) { + down_read(&vm->userptr.notifier_lock); + if (!xe_vma_userptr_check_repin(uvma)) + break; + + spin_lock(&vm->userptr.invalidated_lock); + list_del_init(&uvma->userptr.invalidate_link); + spin_unlock(&vm->userptr.invalidated_lock); + + up_read(&vm->userptr.notifier_lock); + ret = xe_vma_userptr_pin_pages(uvma); + if (ret) + return ret; + } + + if (!up->sg) { + ret = -EINVAL; + goto out_unlock_notifier; + } + + for (xe_res_first_sg_system(up->sg, offset, len, &cur); cur.remaining; + xe_res_next(&cur, cur_len)) { + void *ptr = kmap_local_page(sg_page(cur.sgl)) + cur.start; + + cur_len = min(cur.size, cur.remaining); + if (write) + memcpy(ptr, buf, cur_len); + else + memcpy(buf, ptr, cur_len); + kunmap_local(ptr); + buf += cur_len; + } + ret = len; + +out_unlock_notifier: + up_read(&vm->userptr.notifier_lock); + return ret; +} diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h index 23adb7442881..372ad40ad67f 100644 --- a/drivers/gpu/drm/xe/xe_vm.h +++ b/drivers/gpu/drm/xe/xe_vm.h @@ -280,3 +280,6 @@ struct xe_vm_snapshot *xe_vm_snapshot_capture(struct xe_vm *vm); void xe_vm_snapshot_capture_delayed(struct xe_vm_snapshot *snap); void xe_vm_snapshot_print(struct xe_vm_snapshot *snap, struct drm_printer *p); void xe_vm_snapshot_free(struct xe_vm_snapshot *snap); + +int xe_vm_userptr_access(struct xe_userptr_vma *uvma, u64 offset, + void *buf, u64 len, bool write);