[14/26] drm/xe/eudebug: implement userptr_vma access

Message ID	20241209133318.1806472-15-mika.kuoppala@linux.intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: Mika Kuoppala <mika.kuoppala@linux.intel.com> To: intel-xe@lists.freedesktop.org Cc: dri-devel@lists.freedesktop.org, christian.koenig@amd.com, Andrzej Hajda <andrzej.hajda@intel.com>, Maciej Patelczyk <maciej.patelczyk@intel.com>, Mika Kuoppala <mika.kuoppala@linux.intel.com>, Jonathan Cavitt <jonathan.cavitt@intel.com> Subject: [PATCH 14/26] drm/xe/eudebug: implement userptr_vma access Date: Mon, 9 Dec 2024 15:33:05 +0200 Message-ID: <20241209133318.1806472-15-mika.kuoppala@linux.intel.com> In-Reply-To: <20241209133318.1806472-1-mika.kuoppala@linux.intel.com> References: <20241209133318.1806472-1-mika.kuoppala@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	Intel Xe GPU debug support (eudebug) v3 \| expand [00/26] Intel Xe GPU debug support (eudebug) v3 [01/26] ptrace: export ptrace_may_access [02/26] drm/xe/eudebug: Introduce eudebug support [03/26] drm/xe/eudebug: Introduce discovery for resources [04/26] drm/xe/eudebug: Introduce exec_queue events [05/26] drm/xe/eudebug: Introduce exec queue placements event [06/26] drm/xe/eudebug: hw enablement for eudebug [07/26] drm/xe: Add EUDEBUG_ENABLE exec queue property [08/26] drm/xe/eudebug: Introduce per device attention scan worker [09/26] drm/xe/eudebug: Introduce EU control interface [10/26] drm/xe/eudebug: Add vm bind and vm bind ops [11/26] drm/xe/eudebug: Add UFENCE events with acks [12/26] drm/xe/eudebug: vm open/pread/pwrite [13/26] drm/xe: add system memory page iterator support to xe_res_cursor [14/26] drm/xe/eudebug: implement userptr_vma access [15/26] drm/xe: Debug metadata create/destroy ioctls [16/26] drm/xe: Attach debug metadata to vma [17/26] drm/xe/eudebug: Add debug metadata support for xe_eudebug [18/26] drm/xe/eudebug: Implement vm_bind_op discovery [19/26] drm/xe/eudebug: Dynamically toggle debugger functionality [20/26] drm/xe/eudebug_test: Introduce xe_eudebug wa kunit test [21/26] drm/xe/eudebug/ptl: Add support for extra attention register [22/26] drm/xe/eudebug/ptl: Add RCU_DEBUG_1 register support for xe3 [23/26] drm/xe/eudebug: Add read/count/compare helper for eu attention [24/26] drm/xe/eudebug: Introduce EU pagefault handling interface [25/26] drm/xe/vm: Support for adding null page VMA to VM on request [26/26] drm/xe/eudebug: Enable EU pagefault handling

Mika Kuoppala Dec. 9, 2024, 1:33 p.m. UTC

From: Andrzej Hajda <andrzej.hajda@intel.com>

Debugger needs to read/write program's vmas including userptr_vma.
Since hmm_range_fault is used to pin userptr vmas, it is possible
to map those vmas from debugger context.

v2: pin pages vs notifier, move to vm.c (Matthew)
v3: - iterate over system pages instead of DMA, fixes iommu enabled
    - s/xe_uvma_access/xe_vm_uvma_access/ (Matt)

Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com>
Signed-off-by: Maciej Patelczyk <maciej.patelczyk@intel.com>
Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> #v1
---
 drivers/gpu/drm/xe/xe_eudebug.c |  3 ++-
 drivers/gpu/drm/xe/xe_vm.c      | 47 +++++++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_vm.h      |  3 +++
 3 files changed, 52 insertions(+), 1 deletion(-)

Christian König Dec. 9, 2024, 2:03 p.m. UTC | #1

Am 09.12.24 um 14:33 schrieb Mika Kuoppala:
> From: Andrzej Hajda <andrzej.hajda@intel.com>
>
> Debugger needs to read/write program's vmas including userptr_vma.
> Since hmm_range_fault is used to pin userptr vmas, it is possible
> to map those vmas from debugger context.

Oh, this implementation is extremely questionable as well. Adding the 
LKML and the MM list as well.

First of all hmm_range_fault() does *not* pin anything!

In other words you don't have a page reference when the function 
returns, but rather just a sequence number you can check for modifications.

> v2: pin pages vs notifier, move to vm.c (Matthew)
> v3: - iterate over system pages instead of DMA, fixes iommu enabled
>      - s/xe_uvma_access/xe_vm_uvma_access/ (Matt)
>
> Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com>
> Signed-off-by: Maciej Patelczyk <maciej.patelczyk@intel.com>
> Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> #v1
> ---
>   drivers/gpu/drm/xe/xe_eudebug.c |  3 ++-
>   drivers/gpu/drm/xe/xe_vm.c      | 47 +++++++++++++++++++++++++++++++++
>   drivers/gpu/drm/xe/xe_vm.h      |  3 +++
>   3 files changed, 52 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_eudebug.c b/drivers/gpu/drm/xe/xe_eudebug.c
> index 9d87df75348b..e5949e4dcad8 100644
> --- a/drivers/gpu/drm/xe/xe_eudebug.c
> +++ b/drivers/gpu/drm/xe/xe_eudebug.c
> @@ -3076,7 +3076,8 @@ static int xe_eudebug_vma_access(struct xe_vma *vma, u64 offset_in_vma,
>   		return ret;
>   	}
>   
> -	return -EINVAL;
> +	return xe_vm_userptr_access(to_userptr_vma(vma), offset_in_vma,
> +				    buf, bytes, write);
>   }
>   
>   static int xe_eudebug_vm_access(struct xe_vm *vm, u64 offset,
> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> index 0f17bc8b627b..224ff9e16941 100644
> --- a/drivers/gpu/drm/xe/xe_vm.c
> +++ b/drivers/gpu/drm/xe/xe_vm.c
> @@ -3414,3 +3414,50 @@ void xe_vm_snapshot_free(struct xe_vm_snapshot *snap)
>   	}
>   	kvfree(snap);
>   }
> +
> +int xe_vm_userptr_access(struct xe_userptr_vma *uvma, u64 offset,
> +			 void *buf, u64 len, bool write)
> +{
> +	struct xe_vm *vm = xe_vma_vm(&uvma->vma);
> +	struct xe_userptr *up = &uvma->userptr;
> +	struct xe_res_cursor cur = {};
> +	int cur_len, ret = 0;
> +
> +	while (true) {
> +		down_read(&vm->userptr.notifier_lock);
> +		if (!xe_vma_userptr_check_repin(uvma))
> +			break;
> +
> +		spin_lock(&vm->userptr.invalidated_lock);
> +		list_del_init(&uvma->userptr.invalidate_link);
> +		spin_unlock(&vm->userptr.invalidated_lock);
> +
> +		up_read(&vm->userptr.notifier_lock);
> +		ret = xe_vma_userptr_pin_pages(uvma);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	if (!up->sg) {
> +		ret = -EINVAL;
> +		goto out_unlock_notifier;
> +	}
> +
> +	for (xe_res_first_sg_system(up->sg, offset, len, &cur); cur.remaining;
> +	     xe_res_next(&cur, cur_len)) {
> +		void *ptr = kmap_local_page(sg_page(cur.sgl)) + cur.start;

The interface basically creates a side channel to access userptrs in the 
way an userspace application would do without actually going through 
userspace.

That is generally not something a device driver should ever do as far as 
I can see.

> +
> +		cur_len = min(cur.size, cur.remaining);
> +		if (write)
> +			memcpy(ptr, buf, cur_len);
> +		else
> +			memcpy(buf, ptr, cur_len);
> +		kunmap_local(ptr);
> +		buf += cur_len;
> +	}
> +	ret = len;
> +
> +out_unlock_notifier:
> +	up_read(&vm->userptr.notifier_lock);

I just strongly hope that this will prevent the mapping from changing.

Regards,
Christian.

> +	return ret;
> +}
> diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
> index 23adb7442881..372ad40ad67f 100644
> --- a/drivers/gpu/drm/xe/xe_vm.h
> +++ b/drivers/gpu/drm/xe/xe_vm.h
> @@ -280,3 +280,6 @@ struct xe_vm_snapshot *xe_vm_snapshot_capture(struct xe_vm *vm);
>   void xe_vm_snapshot_capture_delayed(struct xe_vm_snapshot *snap);
>   void xe_vm_snapshot_print(struct xe_vm_snapshot *snap, struct drm_printer *p);
>   void xe_vm_snapshot_free(struct xe_vm_snapshot *snap);
> +
> +int xe_vm_userptr_access(struct xe_userptr_vma *uvma, u64 offset,
> +			 void *buf, u64 len, bool write);

Joonas Lahtinen Dec. 9, 2024, 2:56 p.m. UTC | #2

(+ Thomas and Matt B who this was reviewed with as a concept)

Quoting Christian König (2024-12-09 16:03:04)
> Am 09.12.24 um 14:33 schrieb Mika Kuoppala:

<SNIP>

> > +int xe_vm_userptr_access(struct xe_userptr_vma *uvma, u64 offset,
> > +                      void *buf, u64 len, bool write)
> > +{
> > +     struct xe_vm *vm = xe_vma_vm(&uvma->vma);
> > +     struct xe_userptr *up = &uvma->userptr;
> > +     struct xe_res_cursor cur = {};
> > +     int cur_len, ret = 0;
> > +
> > +     while (true) {
> > +             down_read(&vm->userptr.notifier_lock);
> > +             if (!xe_vma_userptr_check_repin(uvma))
> > +                     break;
> > +
> > +             spin_lock(&vm->userptr.invalidated_lock);
> > +             list_del_init(&uvma->userptr.invalidate_link);
> > +             spin_unlock(&vm->userptr.invalidated_lock);
> > +
> > +             up_read(&vm->userptr.notifier_lock);
> > +             ret = xe_vma_userptr_pin_pages(uvma);
> > +             if (ret)
> > +                     return ret;
> > +     }
> > +
> > +     if (!up->sg) {
> > +             ret = -EINVAL;
> > +             goto out_unlock_notifier;
> > +     }
> > +
> > +     for (xe_res_first_sg_system(up->sg, offset, len, &cur); cur.remaining;
> > +          xe_res_next(&cur, cur_len)) {
> > +             void *ptr = kmap_local_page(sg_page(cur.sgl)) + cur.start;
> 
> The interface basically creates a side channel to access userptrs in the 
> way an userspace application would do without actually going through 
> userspace.

That's not quite the case here.

The whole point of the debugger ability to access memory is to access
any VMA in the GPU VM emulating as much as possible like the EUs themselves
would do the access.

As mentioned in the other threads, that also involves special cache flushes
to make sure the contents has been flushed in and out of the EU caches in case
we're modifying instruction code for breakpoints as an example.

What the memory access function should do is to establish a temporary
pinning situation where the memory would be accessible just like it would
be for a short batchbuffer, but without interfering with the command streamers.

If this should be established in a different way from this patch, then
we should adapt of course.

> That is generally not something a device driver should ever do as far as 
> I can see.

Given above explanation, does it make more sense? For debugging
purposes, we try to emulate the EU threads themselves accessing the
memory, not the userspace at all.

Regards, Joonas

Simona Vetter Dec. 9, 2024, 3:31 p.m. UTC | #3

On Mon, Dec 09, 2024 at 03:03:04PM +0100, Christian König wrote:
> Am 09.12.24 um 14:33 schrieb Mika Kuoppala:
> > From: Andrzej Hajda <andrzej.hajda@intel.com>
> > 
> > Debugger needs to read/write program's vmas including userptr_vma.
> > Since hmm_range_fault is used to pin userptr vmas, it is possible
> > to map those vmas from debugger context.
> 
> Oh, this implementation is extremely questionable as well. Adding the LKML
> and the MM list as well.
> 
> First of all hmm_range_fault() does *not* pin anything!
> 
> In other words you don't have a page reference when the function returns,
> but rather just a sequence number you can check for modifications.

I think it's all there, holds the invalidation lock during the critical
access/section, drops it when reacquiring pages, retries until it works.

I think the issue is more that everyone hand-rolls userptr. Probably time
we standardize that and put it into gpuvm as an optional part, with
consistent locking, naming (like not calling it _pin_pages when it's
unpinnged userptr), kerneldoc and all the nice things so that we
stop consistently getting confused by other driver's userptr code.

I think that was on the plan originally as an eventual step, I guess time
to pump that up. Matt/Thomas, thoughts?
-Sima

> 
> > v2: pin pages vs notifier, move to vm.c (Matthew)
> > v3: - iterate over system pages instead of DMA, fixes iommu enabled
> >      - s/xe_uvma_access/xe_vm_uvma_access/ (Matt)
> > 
> > Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com>
> > Signed-off-by: Maciej Patelczyk <maciej.patelczyk@intel.com>
> > Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> #v1
> > ---
> >   drivers/gpu/drm/xe/xe_eudebug.c |  3 ++-
> >   drivers/gpu/drm/xe/xe_vm.c      | 47 +++++++++++++++++++++++++++++++++
> >   drivers/gpu/drm/xe/xe_vm.h      |  3 +++
> >   3 files changed, 52 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_eudebug.c b/drivers/gpu/drm/xe/xe_eudebug.c
> > index 9d87df75348b..e5949e4dcad8 100644
> > --- a/drivers/gpu/drm/xe/xe_eudebug.c
> > +++ b/drivers/gpu/drm/xe/xe_eudebug.c
> > @@ -3076,7 +3076,8 @@ static int xe_eudebug_vma_access(struct xe_vma *vma, u64 offset_in_vma,
> >   		return ret;
> >   	}
> > -	return -EINVAL;
> > +	return xe_vm_userptr_access(to_userptr_vma(vma), offset_in_vma,
> > +				    buf, bytes, write);
> >   }
> >   static int xe_eudebug_vm_access(struct xe_vm *vm, u64 offset,
> > diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> > index 0f17bc8b627b..224ff9e16941 100644
> > --- a/drivers/gpu/drm/xe/xe_vm.c
> > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > @@ -3414,3 +3414,50 @@ void xe_vm_snapshot_free(struct xe_vm_snapshot *snap)
> >   	}
> >   	kvfree(snap);
> >   }
> > +
> > +int xe_vm_userptr_access(struct xe_userptr_vma *uvma, u64 offset,
> > +			 void *buf, u64 len, bool write)
> > +{
> > +	struct xe_vm *vm = xe_vma_vm(&uvma->vma);
> > +	struct xe_userptr *up = &uvma->userptr;
> > +	struct xe_res_cursor cur = {};
> > +	int cur_len, ret = 0;
> > +
> > +	while (true) {
> > +		down_read(&vm->userptr.notifier_lock);
> > +		if (!xe_vma_userptr_check_repin(uvma))
> > +			break;
> > +
> > +		spin_lock(&vm->userptr.invalidated_lock);
> > +		list_del_init(&uvma->userptr.invalidate_link);
> > +		spin_unlock(&vm->userptr.invalidated_lock);
> > +
> > +		up_read(&vm->userptr.notifier_lock);
> > +		ret = xe_vma_userptr_pin_pages(uvma);
> > +		if (ret)
> > +			return ret;
> > +	}
> > +
> > +	if (!up->sg) {
> > +		ret = -EINVAL;
> > +		goto out_unlock_notifier;
> > +	}
> > +
> > +	for (xe_res_first_sg_system(up->sg, offset, len, &cur); cur.remaining;
> > +	     xe_res_next(&cur, cur_len)) {
> > +		void *ptr = kmap_local_page(sg_page(cur.sgl)) + cur.start;
> 
> The interface basically creates a side channel to access userptrs in the way
> an userspace application would do without actually going through userspace.
> 
> That is generally not something a device driver should ever do as far as I
> can see.
> 
> > +
> > +		cur_len = min(cur.size, cur.remaining);
> > +		if (write)
> > +			memcpy(ptr, buf, cur_len);
> > +		else
> > +			memcpy(buf, ptr, cur_len);
> > +		kunmap_local(ptr);
> > +		buf += cur_len;
> > +	}
> > +	ret = len;
> > +
> > +out_unlock_notifier:
> > +	up_read(&vm->userptr.notifier_lock);
> 
> I just strongly hope that this will prevent the mapping from changing.
> 
> Regards,
> Christian.
> 
> > +	return ret;
> > +}
> > diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
> > index 23adb7442881..372ad40ad67f 100644
> > --- a/drivers/gpu/drm/xe/xe_vm.h
> > +++ b/drivers/gpu/drm/xe/xe_vm.h
> > @@ -280,3 +280,6 @@ struct xe_vm_snapshot *xe_vm_snapshot_capture(struct xe_vm *vm);
> >   void xe_vm_snapshot_capture_delayed(struct xe_vm_snapshot *snap);
> >   void xe_vm_snapshot_print(struct xe_vm_snapshot *snap, struct drm_printer *p);
> >   void xe_vm_snapshot_free(struct xe_vm_snapshot *snap);
> > +
> > +int xe_vm_userptr_access(struct xe_userptr_vma *uvma, u64 offset,
> > +			 void *buf, u64 len, bool write);
>

Christian König Dec. 9, 2024, 3:42 p.m. UTC | #4

Am 09.12.24 um 16:31 schrieb Simona Vetter:
> On Mon, Dec 09, 2024 at 03:03:04PM +0100, Christian König wrote:
>> Am 09.12.24 um 14:33 schrieb Mika Kuoppala:
>>> From: Andrzej Hajda <andrzej.hajda@intel.com>
>>>
>>> Debugger needs to read/write program's vmas including userptr_vma.
>>> Since hmm_range_fault is used to pin userptr vmas, it is possible
>>> to map those vmas from debugger context.
>> Oh, this implementation is extremely questionable as well. Adding the LKML
>> and the MM list as well.
>>
>> First of all hmm_range_fault() does *not* pin anything!
>>
>> In other words you don't have a page reference when the function returns,
>> but rather just a sequence number you can check for modifications.
> I think it's all there, holds the invalidation lock during the critical
> access/section, drops it when reacquiring pages, retries until it works.
>
> I think the issue is more that everyone hand-rolls userptr.

Well that is part of the issue.

The general problem here is that the eudebug interface tries to simulate 
the memory accesses as they would have happened by the hardware.

What the debugger should probably do is to cleanly attach to the 
application, get the information which CPU address is mapped to which 
GPU address and then use the standard ptrace interfaces.

The whole interface re-invents a lot of functionality which is already 
there just because you don't like the idea to attach to the debugged 
application in userspace.

As far as I can see this whole idea is extremely questionable. This 
looks like re-inventing the wheel in a different color.

Regards,
Christian.

>   Probably time
> we standardize that and put it into gpuvm as an optional part, with
> consistent locking, naming (like not calling it _pin_pages when it's
> unpinnged userptr), kerneldoc and all the nice things so that we
> stop consistently getting confused by other driver's userptr code.
>
> I think that was on the plan originally as an eventual step, I guess time
> to pump that up. Matt/Thomas, thoughts?
> -Sima
>
>>> v2: pin pages vs notifier, move to vm.c (Matthew)
>>> v3: - iterate over system pages instead of DMA, fixes iommu enabled
>>>       - s/xe_uvma_access/xe_vm_uvma_access/ (Matt)
>>>
>>> Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com>
>>> Signed-off-by: Maciej Patelczyk <maciej.patelczyk@intel.com>
>>> Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>>> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> #v1
>>> ---
>>>    drivers/gpu/drm/xe/xe_eudebug.c |  3 ++-
>>>    drivers/gpu/drm/xe/xe_vm.c      | 47 +++++++++++++++++++++++++++++++++
>>>    drivers/gpu/drm/xe/xe_vm.h      |  3 +++
>>>    3 files changed, 52 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/xe/xe_eudebug.c b/drivers/gpu/drm/xe/xe_eudebug.c
>>> index 9d87df75348b..e5949e4dcad8 100644
>>> --- a/drivers/gpu/drm/xe/xe_eudebug.c
>>> +++ b/drivers/gpu/drm/xe/xe_eudebug.c
>>> @@ -3076,7 +3076,8 @@ static int xe_eudebug_vma_access(struct xe_vma *vma, u64 offset_in_vma,
>>>    		return ret;
>>>    	}
>>> -	return -EINVAL;
>>> +	return xe_vm_userptr_access(to_userptr_vma(vma), offset_in_vma,
>>> +				    buf, bytes, write);
>>>    }
>>>    static int xe_eudebug_vm_access(struct xe_vm *vm, u64 offset,
>>> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
>>> index 0f17bc8b627b..224ff9e16941 100644
>>> --- a/drivers/gpu/drm/xe/xe_vm.c
>>> +++ b/drivers/gpu/drm/xe/xe_vm.c
>>> @@ -3414,3 +3414,50 @@ void xe_vm_snapshot_free(struct xe_vm_snapshot *snap)
>>>    	}
>>>    	kvfree(snap);
>>>    }
>>> +
>>> +int xe_vm_userptr_access(struct xe_userptr_vma *uvma, u64 offset,
>>> +			 void *buf, u64 len, bool write)
>>> +{
>>> +	struct xe_vm *vm = xe_vma_vm(&uvma->vma);
>>> +	struct xe_userptr *up = &uvma->userptr;
>>> +	struct xe_res_cursor cur = {};
>>> +	int cur_len, ret = 0;
>>> +
>>> +	while (true) {
>>> +		down_read(&vm->userptr.notifier_lock);
>>> +		if (!xe_vma_userptr_check_repin(uvma))
>>> +			break;
>>> +
>>> +		spin_lock(&vm->userptr.invalidated_lock);
>>> +		list_del_init(&uvma->userptr.invalidate_link);
>>> +		spin_unlock(&vm->userptr.invalidated_lock);
>>> +
>>> +		up_read(&vm->userptr.notifier_lock);
>>> +		ret = xe_vma_userptr_pin_pages(uvma);
>>> +		if (ret)
>>> +			return ret;
>>> +	}
>>> +
>>> +	if (!up->sg) {
>>> +		ret = -EINVAL;
>>> +		goto out_unlock_notifier;
>>> +	}
>>> +
>>> +	for (xe_res_first_sg_system(up->sg, offset, len, &cur); cur.remaining;
>>> +	     xe_res_next(&cur, cur_len)) {
>>> +		void *ptr = kmap_local_page(sg_page(cur.sgl)) + cur.start;
>> The interface basically creates a side channel to access userptrs in the way
>> an userspace application would do without actually going through userspace.
>>
>> That is generally not something a device driver should ever do as far as I
>> can see.
>>
>>> +
>>> +		cur_len = min(cur.size, cur.remaining);
>>> +		if (write)
>>> +			memcpy(ptr, buf, cur_len);
>>> +		else
>>> +			memcpy(buf, ptr, cur_len);
>>> +		kunmap_local(ptr);
>>> +		buf += cur_len;
>>> +	}
>>> +	ret = len;
>>> +
>>> +out_unlock_notifier:
>>> +	up_read(&vm->userptr.notifier_lock);
>> I just strongly hope that this will prevent the mapping from changing.
>>
>> Regards,
>> Christian.
>>
>>> +	return ret;
>>> +}
>>> diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
>>> index 23adb7442881..372ad40ad67f 100644
>>> --- a/drivers/gpu/drm/xe/xe_vm.h
>>> +++ b/drivers/gpu/drm/xe/xe_vm.h
>>> @@ -280,3 +280,6 @@ struct xe_vm_snapshot *xe_vm_snapshot_capture(struct xe_vm *vm);
>>>    void xe_vm_snapshot_capture_delayed(struct xe_vm_snapshot *snap);
>>>    void xe_vm_snapshot_print(struct xe_vm_snapshot *snap, struct drm_printer *p);
>>>    void xe_vm_snapshot_free(struct xe_vm_snapshot *snap);
>>> +
>>> +int xe_vm_userptr_access(struct xe_userptr_vma *uvma, u64 offset,
>>> +			 void *buf, u64 len, bool write);

Christian König Dec. 9, 2024, 3:45 p.m. UTC | #5

Am 09.12.24 um 16:42 schrieb Christian König:
> Am 09.12.24 um 16:31 schrieb Simona Vetter:
>> On Mon, Dec 09, 2024 at 03:03:04PM +0100, Christian König wrote:
>>> Am 09.12.24 um 14:33 schrieb Mika Kuoppala:
>>>> From: Andrzej Hajda <andrzej.hajda@intel.com>
>>>>
>>>> Debugger needs to read/write program's vmas including userptr_vma.
>>>> Since hmm_range_fault is used to pin userptr vmas, it is possible
>>>> to map those vmas from debugger context.
>>> Oh, this implementation is extremely questionable as well. Adding 
>>> the LKML
>>> and the MM list as well.
>>>
>>> First of all hmm_range_fault() does *not* pin anything!
>>>
>>> In other words you don't have a page reference when the function 
>>> returns,
>>> but rather just a sequence number you can check for modifications.
>> I think it's all there, holds the invalidation lock during the critical
>> access/section, drops it when reacquiring pages, retries until it works.

One thing I'm missing: Is it possible that mappings are created read-only?

E.g. that you have a read-only mapping of libc and then write through 
this interface to it?

Of hand I don't see anything preventing this (well could be that you 
don't allow creating read-only mappings).

Regards,
Christian.

>>
>> I think the issue is more that everyone hand-rolls userptr.
>
> Well that is part of the issue.
>
> The general problem here is that the eudebug interface tries to 
> simulate the memory accesses as they would have happened by the hardware.
>
> What the debugger should probably do is to cleanly attach to the 
> application, get the information which CPU address is mapped to which 
> GPU address and then use the standard ptrace interfaces.
>
> The whole interface re-invents a lot of functionality which is already 
> there just because you don't like the idea to attach to the debugged 
> application in userspace.
>
> As far as I can see this whole idea is extremely questionable. This 
> looks like re-inventing the wheel in a different color.
>
> Regards,
> Christian.
>
>>   Probably time
>> we standardize that and put it into gpuvm as an optional part, with
>> consistent locking, naming (like not calling it _pin_pages when it's
>> unpinnged userptr), kerneldoc and all the nice things so that we
>> stop consistently getting confused by other driver's userptr code.
>>
>> I think that was on the plan originally as an eventual step, I guess 
>> time
>> to pump that up. Matt/Thomas, thoughts?
>> -Sima
>>
>>>> v2: pin pages vs notifier, move to vm.c (Matthew)
>>>> v3: - iterate over system pages instead of DMA, fixes iommu enabled
>>>>       - s/xe_uvma_access/xe_vm_uvma_access/ (Matt)
>>>>
>>>> Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com>
>>>> Signed-off-by: Maciej Patelczyk <maciej.patelczyk@intel.com>
>>>> Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>>>> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> #v1
>>>> ---
>>>>    drivers/gpu/drm/xe/xe_eudebug.c |  3 ++-
>>>>    drivers/gpu/drm/xe/xe_vm.c      | 47 
>>>> +++++++++++++++++++++++++++++++++
>>>>    drivers/gpu/drm/xe/xe_vm.h      |  3 +++
>>>>    3 files changed, 52 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/xe_eudebug.c 
>>>> b/drivers/gpu/drm/xe/xe_eudebug.c
>>>> index 9d87df75348b..e5949e4dcad8 100644
>>>> --- a/drivers/gpu/drm/xe/xe_eudebug.c
>>>> +++ b/drivers/gpu/drm/xe/xe_eudebug.c
>>>> @@ -3076,7 +3076,8 @@ static int xe_eudebug_vma_access(struct 
>>>> xe_vma *vma, u64 offset_in_vma,
>>>>            return ret;
>>>>        }
>>>> -    return -EINVAL;
>>>> +    return xe_vm_userptr_access(to_userptr_vma(vma), offset_in_vma,
>>>> +                    buf, bytes, write);
>>>>    }
>>>>    static int xe_eudebug_vm_access(struct xe_vm *vm, u64 offset,
>>>> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
>>>> index 0f17bc8b627b..224ff9e16941 100644
>>>> --- a/drivers/gpu/drm/xe/xe_vm.c
>>>> +++ b/drivers/gpu/drm/xe/xe_vm.c
>>>> @@ -3414,3 +3414,50 @@ void xe_vm_snapshot_free(struct 
>>>> xe_vm_snapshot *snap)
>>>>        }
>>>>        kvfree(snap);
>>>>    }
>>>> +
>>>> +int xe_vm_userptr_access(struct xe_userptr_vma *uvma, u64 offset,
>>>> +             void *buf, u64 len, bool write)
>>>> +{
>>>> +    struct xe_vm *vm = xe_vma_vm(&uvma->vma);
>>>> +    struct xe_userptr *up = &uvma->userptr;
>>>> +    struct xe_res_cursor cur = {};
>>>> +    int cur_len, ret = 0;
>>>> +
>>>> +    while (true) {
>>>> +        down_read(&vm->userptr.notifier_lock);
>>>> +        if (!xe_vma_userptr_check_repin(uvma))
>>>> +            break;
>>>> +
>>>> +        spin_lock(&vm->userptr.invalidated_lock);
>>>> + list_del_init(&uvma->userptr.invalidate_link);
>>>> +        spin_unlock(&vm->userptr.invalidated_lock);
>>>> +
>>>> +        up_read(&vm->userptr.notifier_lock);
>>>> +        ret = xe_vma_userptr_pin_pages(uvma);
>>>> +        if (ret)
>>>> +            return ret;
>>>> +    }
>>>> +
>>>> +    if (!up->sg) {
>>>> +        ret = -EINVAL;
>>>> +        goto out_unlock_notifier;
>>>> +    }
>>>> +
>>>> +    for (xe_res_first_sg_system(up->sg, offset, len, &cur); 
>>>> cur.remaining;
>>>> +         xe_res_next(&cur, cur_len)) {
>>>> +        void *ptr = kmap_local_page(sg_page(cur.sgl)) + cur.start;
>>> The interface basically creates a side channel to access userptrs in 
>>> the way
>>> an userspace application would do without actually going through 
>>> userspace.
>>>
>>> That is generally not something a device driver should ever do as 
>>> far as I
>>> can see.
>>>
>>>> +
>>>> +        cur_len = min(cur.size, cur.remaining);
>>>> +        if (write)
>>>> +            memcpy(ptr, buf, cur_len);
>>>> +        else
>>>> +            memcpy(buf, ptr, cur_len);
>>>> +        kunmap_local(ptr);
>>>> +        buf += cur_len;
>>>> +    }
>>>> +    ret = len;
>>>> +
>>>> +out_unlock_notifier:
>>>> +    up_read(&vm->userptr.notifier_lock);
>>> I just strongly hope that this will prevent the mapping from changing.
>>>
>>> Regards,
>>> Christian.
>>>
>>>> +    return ret;
>>>> +}
>>>> diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
>>>> index 23adb7442881..372ad40ad67f 100644
>>>> --- a/drivers/gpu/drm/xe/xe_vm.h
>>>> +++ b/drivers/gpu/drm/xe/xe_vm.h
>>>> @@ -280,3 +280,6 @@ struct xe_vm_snapshot 
>>>> *xe_vm_snapshot_capture(struct xe_vm *vm);
>>>>    void xe_vm_snapshot_capture_delayed(struct xe_vm_snapshot *snap);
>>>>    void xe_vm_snapshot_print(struct xe_vm_snapshot *snap, struct 
>>>> drm_printer *p);
>>>>    void xe_vm_snapshot_free(struct xe_vm_snapshot *snap);
>>>> +
>>>> +int xe_vm_userptr_access(struct xe_userptr_vma *uvma, u64 offset,
>>>> +             void *buf, u64 len, bool write);
>

Joonas Lahtinen Dec. 10, 2024, 9:33 a.m. UTC | #6

Quoting Christian König (2024-12-09 17:42:32)
> Am 09.12.24 um 16:31 schrieb Simona Vetter:
> > On Mon, Dec 09, 2024 at 03:03:04PM +0100, Christian König wrote:
> >> Am 09.12.24 um 14:33 schrieb Mika Kuoppala:
> >>> From: Andrzej Hajda <andrzej.hajda@intel.com>
> >>>
> >>> Debugger needs to read/write program's vmas including userptr_vma.
> >>> Since hmm_range_fault is used to pin userptr vmas, it is possible
> >>> to map those vmas from debugger context.
> >> Oh, this implementation is extremely questionable as well. Adding the LKML
> >> and the MM list as well.
> >>
> >> First of all hmm_range_fault() does *not* pin anything!
> >>
> >> In other words you don't have a page reference when the function returns,
> >> but rather just a sequence number you can check for modifications.
> > I think it's all there, holds the invalidation lock during the critical
> > access/section, drops it when reacquiring pages, retries until it works.
> >
> > I think the issue is more that everyone hand-rolls userptr.
> 
> Well that is part of the issue.
> 
> The general problem here is that the eudebug interface tries to simulate 
> the memory accesses as they would have happened by the hardware.

Could you elaborate, what is that a problem in that, exactly?

It's pretty much the equivalent of ptrace() poke/peek but for GPU memory.
And it is exactly the kind of interface that makes sense for debugger as
GPU memory != CPU memory, and they don't need to align at all.

> What the debugger should probably do is to cleanly attach to the 
> application, get the information which CPU address is mapped to which 
> GPU address and then use the standard ptrace interfaces.

I don't quite agree here -- at all. "Which CPU address is mapped to
which GPU address" makes no sense when the GPU address space and CPU
address space is completely controlled by the userspace driver/application.

Please try to consider things outside of the ROCm architecture.

Something like a register scratch region or EU instructions should not
even be mapped to CPU address space as CPU has no business accessing it
during normal operation. And backing of such region will vary per
context/LRC on the same virtual address per EU thread.

You seem to be suggesting to rewrite even our userspace driver to behave
the same way as ROCm driver does just so that we could implement debug memory
accesses via ptrace() to the CPU address space.

That seems bit of a radical suggestion, especially given the drawbacks
pointed out in your suggested design.

> The whole interface re-invents a lot of functionality which is already 
> there 

I'm not really sure I would call adding a single interface for memory
reading and writing to be "re-inventing a lot of functionality".

All the functionality behind this interface will be needed by GPU core
dumping, anyway. Just like for the other patch series.

> just because you don't like the idea to attach to the debugged 
> application in userspace.

A few points that have been brought up as drawback to the
GPU debug through ptrace(), but to recap a few relevant ones for this
discussion:

- You can only really support GDB stop-all mode or at least have to
  stop all the CPU threads while you control the GPU threads to
  avoid interference. Elaborated on this on the other threads more.
- Controlling the GPU threads will always interfere with CPU threads.
  Doesn't seem feasible to single-step an EU thread while CPU threads
  continue to run freely?
- You are very much restricted by the CPU VA ~ GPU VA alignment
  requirement, which is not true for OpenGL or Vulkan etc. Seems
  like one of the reasons why ROCm debugging is not easily extendable
  outside compute?
- You have to expose extra memory to CPU process just for GPU
  debugger access and keep track of GPU VA for each. Makes the GPU more
  prone to OOB writes from CPU. Exactly what not mapping the memory
  to CPU tried to protect the GPU from to begin with.

> As far as I can see this whole idea is extremely questionable. This 
> looks like re-inventing the wheel in a different color.

I see it like reinventing a round wheel compared to octagonal wheel.

Could you elaborate with facts much more on your position why the ROCm
debugger design is an absolute must for others to adopt?

Otherwise it just looks like you are trying to prevent others from
implementing a more flexible debugging interface through vague comments about
"questionable design" without going into details. Not listing much concrete
benefits nor addressing the very concretely expressed drawbacks of your
suggested design, makes it seem like a very biased non-technical discussion.

So while review interest and any comments are very much appreciated, please
also work on providing bit more reasoning and facts instead of just claiming
things. That'll help make the discussion much more fruitful.

Regards, Joonas

Christian König Dec. 10, 2024, 10 a.m. UTC | #7

Am 10.12.24 um 10:33 schrieb Joonas Lahtinen:
> Quoting Christian König (2024-12-09 17:42:32)
>> Am 09.12.24 um 16:31 schrieb Simona Vetter:
>>> On Mon, Dec 09, 2024 at 03:03:04PM +0100, Christian König wrote:
>>>> Am 09.12.24 um 14:33 schrieb Mika Kuoppala:
>>>>> From: Andrzej Hajda<andrzej.hajda@intel.com>
>>>>>
>>>>> Debugger needs to read/write program's vmas including userptr_vma.
>>>>> Since hmm_range_fault is used to pin userptr vmas, it is possible
>>>>> to map those vmas from debugger context.
>>>> Oh, this implementation is extremely questionable as well. Adding the LKML
>>>> and the MM list as well.
>>>>
>>>> First of all hmm_range_fault() does *not* pin anything!
>>>>
>>>> In other words you don't have a page reference when the function returns,
>>>> but rather just a sequence number you can check for modifications.
>>> I think it's all there, holds the invalidation lock during the critical
>>> access/section, drops it when reacquiring pages, retries until it works.
>>>
>>> I think the issue is more that everyone hand-rolls userptr.
>> Well that is part of the issue.
>>
>> The general problem here is that the eudebug interface tries to simulate
>> the memory accesses as they would have happened by the hardware.
> Could you elaborate, what is that a problem in that, exactly?
>
> It's pretty much the equivalent of ptrace() poke/peek but for GPU memory.

Exactly that here. You try to debug the GPU without taking control of 
the CPU process.

This means that you have to re-implement all debug functionalities which 
where previously invested for the CPU process for the GPU once more.

And that in turn creates a massive attack surface for security related 
problems, especially when you start messing with things like userptrs 
which have a very low level interaction with core memory management.

> And it is exactly the kind of interface that makes sense for debugger as
> GPU memory != CPU memory, and they don't need to align at all.

And that is what I strongly disagree on. When you debug the GPU it is 
mandatory to gain control of the CPU process as well.

The CPU process is basically the overseer of the GPU activity, so it 
should know everything about the GPU operation, for example what a 
mapping actually means.

The kernel driver and the hardware only have the information necessary 
to execute the work prepared by the CPU process. So the information 
available is limited to begin with.

>> What the debugger should probably do is to cleanly attach to the
>> application, get the information which CPU address is mapped to which
>> GPU address and then use the standard ptrace interfaces.
> I don't quite agree here -- at all. "Which CPU address is mapped to
> which GPU address" makes no sense when the GPU address space and CPU
> address space is completely controlled by the userspace driver/application.

Yeah, that's the reason why you should ask the userspace 
driver/application for the necessary information and not go over the 
kernel to debug things.

> Please try to consider things outside of the ROCm architecture.

Well I consider a good part of the ROCm architecture rather broken 
exactly because we haven't pushed back hard enough on bad ideas.

> Something like a register scratch region or EU instructions should not
> even be mapped to CPU address space as CPU has no business accessing it
> during normal operation. And backing of such region will vary per
> context/LRC on the same virtual address per EU thread.
>
> You seem to be suggesting to rewrite even our userspace driver to behave
> the same way as ROCm driver does just so that we could implement debug memory
> accesses via ptrace() to the CPU address space.

Oh, well certainly not. That ROCm has an 1 to 1 mapping between CPU and 
GPU is one thing I've pushed back massively on and has now proven to be 
problematic.

> That seems bit of a radical suggestion, especially given the drawbacks
> pointed out in your suggested design.
>
>> The whole interface re-invents a lot of functionality which is already
>> there
> I'm not really sure I would call adding a single interface for memory
> reading and writing to be "re-inventing a lot of functionality".
>
> All the functionality behind this interface will be needed by GPU core
> dumping, anyway. Just like for the other patch series.

As far as I can see exactly that's an absolutely no-go. Device core 
dumping should *never ever* touch memory imported by userptrs.

That's what process core dumping is good for.

>> just because you don't like the idea to attach to the debugged
>> application in userspace.
> A few points that have been brought up as drawback to the
> GPU debug through ptrace(), but to recap a few relevant ones for this
> discussion:
>
> - You can only really support GDB stop-all mode or at least have to
>    stop all the CPU threads while you control the GPU threads to
>    avoid interference. Elaborated on this on the other threads more.
> - Controlling the GPU threads will always interfere with CPU threads.
>    Doesn't seem feasible to single-step an EU thread while CPU threads
>    continue to run freely?

I would say no.

> - You are very much restricted by the CPU VA ~ GPU VA alignment
>    requirement, which is not true for OpenGL or Vulkan etc. Seems
>    like one of the reasons why ROCm debugging is not easily extendable
>    outside compute?

Well as long as you can't take debugged threads from the hardware you 
can pretty much forget any OpenGL or Vulkan debugging with this 
interface since it violates the dma_fence restrictions in the kernel.

> - You have to expose extra memory to CPU process just for GPU
>    debugger access and keep track of GPU VA for each. Makes the GPU more
>    prone to OOB writes from CPU. Exactly what not mapping the memory
>    to CPU tried to protect the GPU from to begin with.
>
>> As far as I can see this whole idea is extremely questionable. This
>> looks like re-inventing the wheel in a different color.
> I see it like reinventing a round wheel compared to octagonal wheel.
>
> Could you elaborate with facts much more on your position why the ROCm
> debugger design is an absolute must for others to adopt?

Well I'm trying to prevent some of the mistakes we did with the ROCm design.

And trying to re-invent well proven kernel interfaces is one of the big 
mistakes made in the ROCm design.

If you really want to expose an interface to userspace which walks the 
process page table, installs an MMU notifier, kmaps the resulting page 
and then memcpy to/from it then you absolutely *must* run that by guys 
like Christoph Hellwig, Andrew and even Linus.

I'm pretty sure that those guys will note that a device driver should 
absolutely not mess with such stuff.

Regards,
Christian.

> Otherwise it just looks like you are trying to prevent others from
> implementing a more flexible debugging interface through vague comments about
> "questionable design" without going into details. Not listing much concrete
> benefits nor addressing the very concretely expressed drawbacks of your
> suggested design, makes it seem like a very biased non-technical discussion.
>
> So while review interest and any comments are very much appreciated, please
> also work on providing bit more reasoning and facts instead of just claiming
> things. That'll help make the discussion much more fruitful.
>
> Regards, Joonas

Simona Vetter Dec. 10, 2024, 11:17 a.m. UTC | #8

On Mon, Dec 09, 2024 at 04:42:32PM +0100, Christian König wrote:
> Am 09.12.24 um 16:31 schrieb Simona Vetter:
> > On Mon, Dec 09, 2024 at 03:03:04PM +0100, Christian König wrote:
> > > Am 09.12.24 um 14:33 schrieb Mika Kuoppala:
> > > > From: Andrzej Hajda <andrzej.hajda@intel.com>
> > > > 
> > > > Debugger needs to read/write program's vmas including userptr_vma.
> > > > Since hmm_range_fault is used to pin userptr vmas, it is possible
> > > > to map those vmas from debugger context.
> > > Oh, this implementation is extremely questionable as well. Adding the LKML
> > > and the MM list as well.
> > > 
> > > First of all hmm_range_fault() does *not* pin anything!
> > > 
> > > In other words you don't have a page reference when the function returns,
> > > but rather just a sequence number you can check for modifications.
> > I think it's all there, holds the invalidation lock during the critical
> > access/section, drops it when reacquiring pages, retries until it works.
> > 
> > I think the issue is more that everyone hand-rolls userptr.
> 
> Well that is part of the issue.

Yeah I ignored the other part, because that didn't seem super interesting
really.

Like for compute you can make the architectural assumption that gpu
addresses match cpu addresses, and this all becomes easy. Or at least
easier, there's still the issue of having to call into the driver for gpu
side flushes.

But for 3d userptr that's not the case, and you get two options:
- Have some tracking structure that umd and debugger agree on, so stable
  abi fun and all that, so you can find the mapping. And I think in some
  cases this would need to be added first.
- Just ask the kernel, which already knows.

Like for cpu mmaps we also don't inject tracking code into mmap/munmap, we
just ask the kernel through /proc/pid/maps. This sounds like the same
question, probably should have a similar answer.

I guess you can do a bit a bikeshed about whether the kernel should only
do the address translation and then you do a ptrace poke/peek for the
actual access. But again you need some flushes, so this might be a bit
silly.

But fundamentally this makes sense to me to ask the entity that actually
knows how userptr areas map to memory.
-Sima

> 
> The general problem here is that the eudebug interface tries to simulate the
> memory accesses as they would have happened by the hardware.
> 
> What the debugger should probably do is to cleanly attach to the
> application, get the information which CPU address is mapped to which GPU
> address and then use the standard ptrace interfaces.
> 
> The whole interface re-invents a lot of functionality which is already there
> just because you don't like the idea to attach to the debugged application
> in userspace.
> 
> As far as I can see this whole idea is extremely questionable. This looks
> like re-inventing the wheel in a different color.
> 
> Regards,
> Christian.
> 
> >   Probably time
> > we standardize that and put it into gpuvm as an optional part, with
> > consistent locking, naming (like not calling it _pin_pages when it's
> > unpinnged userptr), kerneldoc and all the nice things so that we
> > stop consistently getting confused by other driver's userptr code.
> > 
> > I think that was on the plan originally as an eventual step, I guess time
> > to pump that up. Matt/Thomas, thoughts?
> > -Sima
> > 
> > > > v2: pin pages vs notifier, move to vm.c (Matthew)
> > > > v3: - iterate over system pages instead of DMA, fixes iommu enabled
> > > >       - s/xe_uvma_access/xe_vm_uvma_access/ (Matt)
> > > > 
> > > > Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com>
> > > > Signed-off-by: Maciej Patelczyk <maciej.patelczyk@intel.com>
> > > > Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > > > Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> #v1
> > > > ---
> > > >    drivers/gpu/drm/xe/xe_eudebug.c |  3 ++-
> > > >    drivers/gpu/drm/xe/xe_vm.c      | 47 +++++++++++++++++++++++++++++++++
> > > >    drivers/gpu/drm/xe/xe_vm.h      |  3 +++
> > > >    3 files changed, 52 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/xe/xe_eudebug.c b/drivers/gpu/drm/xe/xe_eudebug.c
> > > > index 9d87df75348b..e5949e4dcad8 100644
> > > > --- a/drivers/gpu/drm/xe/xe_eudebug.c
> > > > +++ b/drivers/gpu/drm/xe/xe_eudebug.c
> > > > @@ -3076,7 +3076,8 @@ static int xe_eudebug_vma_access(struct xe_vma *vma, u64 offset_in_vma,
> > > >    		return ret;
> > > >    	}
> > > > -	return -EINVAL;
> > > > +	return xe_vm_userptr_access(to_userptr_vma(vma), offset_in_vma,
> > > > +				    buf, bytes, write);
> > > >    }
> > > >    static int xe_eudebug_vm_access(struct xe_vm *vm, u64 offset,
> > > > diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> > > > index 0f17bc8b627b..224ff9e16941 100644
> > > > --- a/drivers/gpu/drm/xe/xe_vm.c
> > > > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > > > @@ -3414,3 +3414,50 @@ void xe_vm_snapshot_free(struct xe_vm_snapshot *snap)
> > > >    	}
> > > >    	kvfree(snap);
> > > >    }
> > > > +
> > > > +int xe_vm_userptr_access(struct xe_userptr_vma *uvma, u64 offset,
> > > > +			 void *buf, u64 len, bool write)
> > > > +{
> > > > +	struct xe_vm *vm = xe_vma_vm(&uvma->vma);
> > > > +	struct xe_userptr *up = &uvma->userptr;
> > > > +	struct xe_res_cursor cur = {};
> > > > +	int cur_len, ret = 0;
> > > > +
> > > > +	while (true) {
> > > > +		down_read(&vm->userptr.notifier_lock);
> > > > +		if (!xe_vma_userptr_check_repin(uvma))
> > > > +			break;
> > > > +
> > > > +		spin_lock(&vm->userptr.invalidated_lock);
> > > > +		list_del_init(&uvma->userptr.invalidate_link);
> > > > +		spin_unlock(&vm->userptr.invalidated_lock);
> > > > +
> > > > +		up_read(&vm->userptr.notifier_lock);
> > > > +		ret = xe_vma_userptr_pin_pages(uvma);
> > > > +		if (ret)
> > > > +			return ret;
> > > > +	}
> > > > +
> > > > +	if (!up->sg) {
> > > > +		ret = -EINVAL;
> > > > +		goto out_unlock_notifier;
> > > > +	}
> > > > +
> > > > +	for (xe_res_first_sg_system(up->sg, offset, len, &cur); cur.remaining;
> > > > +	     xe_res_next(&cur, cur_len)) {
> > > > +		void *ptr = kmap_local_page(sg_page(cur.sgl)) + cur.start;
> > > The interface basically creates a side channel to access userptrs in the way
> > > an userspace application would do without actually going through userspace.
> > > 
> > > That is generally not something a device driver should ever do as far as I
> > > can see.
> > > 
> > > > +
> > > > +		cur_len = min(cur.size, cur.remaining);
> > > > +		if (write)
> > > > +			memcpy(ptr, buf, cur_len);
> > > > +		else
> > > > +			memcpy(buf, ptr, cur_len);
> > > > +		kunmap_local(ptr);
> > > > +		buf += cur_len;
> > > > +	}
> > > > +	ret = len;
> > > > +
> > > > +out_unlock_notifier:
> > > > +	up_read(&vm->userptr.notifier_lock);
> > > I just strongly hope that this will prevent the mapping from changing.
> > > 
> > > Regards,
> > > Christian.
> > > 
> > > > +	return ret;
> > > > +}
> > > > diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
> > > > index 23adb7442881..372ad40ad67f 100644
> > > > --- a/drivers/gpu/drm/xe/xe_vm.h
> > > > +++ b/drivers/gpu/drm/xe/xe_vm.h
> > > > @@ -280,3 +280,6 @@ struct xe_vm_snapshot *xe_vm_snapshot_capture(struct xe_vm *vm);
> > > >    void xe_vm_snapshot_capture_delayed(struct xe_vm_snapshot *snap);
> > > >    void xe_vm_snapshot_print(struct xe_vm_snapshot *snap, struct drm_printer *p);
> > > >    void xe_vm_snapshot_free(struct xe_vm_snapshot *snap);
> > > > +
> > > > +int xe_vm_userptr_access(struct xe_userptr_vma *uvma, u64 offset,
> > > > +			 void *buf, u64 len, bool write);
>

Joonas Lahtinen Dec. 10, 2024, 11:57 a.m. UTC | #9

Quoting Christian König (2024-12-10 12:00:48)
> Am 10.12.24 um 10:33 schrieb Joonas Lahtinen:
> 
>     Quoting Christian König (2024-12-09 17:42:32)
> 
>         Am 09.12.24 um 16:31 schrieb Simona Vetter:
> 
>             On Mon, Dec 09, 2024 at 03:03:04PM +0100, Christian König wrote:
> 
>                 Am 09.12.24 um 14:33 schrieb Mika Kuoppala:
> 
>                     From: Andrzej Hajda <andrzej.hajda@intel.com>
> 
>                     Debugger needs to read/write program's vmas including userptr_vma.
>                     Since hmm_range_fault is used to pin userptr vmas, it is possible
>                     to map those vmas from debugger context.
> 
>                 Oh, this implementation is extremely questionable as well. Adding the LKML
>                 and the MM list as well.
> 
>                 First of all hmm_range_fault() does *not* pin anything!
> 
>                 In other words you don't have a page reference when the function returns,
>                 but rather just a sequence number you can check for modifications.
> 
>             I think it's all there, holds the invalidation lock during the critical
>             access/section, drops it when reacquiring pages, retries until it works.
> 
>             I think the issue is more that everyone hand-rolls userptr.
> 
>         Well that is part of the issue.
> 
>         The general problem here is that the eudebug interface tries to simulate
>         the memory accesses as they would have happened by the hardware.
> 
>     Could you elaborate, what is that a problem in that, exactly?
> 
>     It's pretty much the equivalent of ptrace() poke/peek but for GPU memory.
> 
> 
> Exactly that here. You try to debug the GPU without taking control of the CPU
> process.

You seem to have a built-in expectation that the CPU threads and memory space
must be interfered with in order to debug a completely different set of threads
and memory space elsewhere that executes independently. I don't quite see why?

In debugging massively parallel workloads, it's a huge drawback to be limited to
stop all mode in GDB. If ROCm folks are fine with such limitation, I have nothing
against them keeping that limitation. Just it was a starting design principle for
this design to avoid such a limitation.

> This means that you have to re-implement all debug functionalities which where
> previously invested for the CPU process for the GPU once more.

Seems like a strawman argument. Can you list the "all interfaces" being added
that would be possible via indirection via ptrace() beyond peek/poke?

> And that in turn creates a massive attack surface for security related
> problems, especially when you start messing with things like userptrs which
> have a very low level interaction with core memory management.

Again, just seems like a strawman argument. You seem to generalize to some massive
attack surface of hypothetical interfaces which you don't list. We're talking
about memory peek/poke here.

Can you explain the high-level difference from security perspective for
temporarily pinning userptr pages to write them to page tables for GPU to
execute a dma-fence workload with and temporarily pinning pages for
peek/poke?

>     And it is exactly the kind of interface that makes sense for debugger as
>     GPU memory != CPU memory, and they don't need to align at all.
> 
> 
> And that is what I strongly disagree on. When you debug the GPU it is mandatory
> to gain control of the CPU process as well.

You are free to disagree on that. I simply don't agree and have in this
and previous email presented multiple reasons as to why not. We can
agree to disagree on the topic.

> The CPU process is basically the overseer of the GPU activity, so it should
> know everything about the GPU operation, for example what a mapping actually
> means.

How does that relate to what is being discussed here? You just seem to
explain how you think userspace driver should work: Maintain a shadow
tree of each ppGTT VM layout? I don't agree on that, but I think it is
slightly irrelevant here.

> The kernel driver and the hardware only have the information necessary to
> execute the work prepared by the CPU process. So the information available is
> limited to begin with.

And the point here is? Are you saying kernel does not know the actual mappings
maintained in the GPU page tables?

>         What the debugger should probably do is to cleanly attach to the
>         application, get the information which CPU address is mapped to which
>         GPU address and then use the standard ptrace interfaces.
> 
>     I don't quite agree here -- at all. "Which CPU address is mapped to
>     which GPU address" makes no sense when the GPU address space and CPU
>     address space is completely controlled by the userspace driver/application.
> 
> 
> Yeah, that's the reason why you should ask the userspace driver/application for
> the necessary information and not go over the kernel to debug things.

What hypothetical necessary information are you referring to exactly?

I already explained there are good reasons not to map all the GPU memory
into the CPU address space.

>     Please try to consider things outside of the ROCm architecture.
> 
> 
> Well I consider a good part of the ROCm architecture rather broken exactly
> because we haven't pushed back hard enough on bad ideas.
> 
> 
>     Something like a register scratch region or EU instructions should not
>     even be mapped to CPU address space as CPU has no business accessing it
>     during normal operation. And backing of such region will vary per
>     context/LRC on the same virtual address per EU thread.
> 
>     You seem to be suggesting to rewrite even our userspace driver to behave
>     the same way as ROCm driver does just so that we could implement debug memory
>     accesses via ptrace() to the CPU address space.
> 
> 
> Oh, well certainly not. That ROCm has an 1 to 1 mapping between CPU and GPU is
> one thing I've pushed back massively on and has now proven to be problematic.

Right, so is your claim then that instead of being 1:1 the CPU address space
should be a superset of all GPU address spaces instead to make sure
ptrace() can modify all memory?

Cause I'm slightly lost here as you don't give much reasoning, just
claim things to be certain way.

>     That seems bit of a radical suggestion, especially given the drawbacks
>     pointed out in your suggested design.
> 
> 
>         The whole interface re-invents a lot of functionality which is already
>         there
> 
>     I'm not really sure I would call adding a single interface for memory
>     reading and writing to be "re-inventing a lot of functionality".
> 
>     All the functionality behind this interface will be needed by GPU core
>     dumping, anyway. Just like for the other patch series.
> 
> 
> As far as I can see exactly that's an absolutely no-go. Device core dumping
> should *never ever* touch memory imported by userptrs.

Could you again elaborate on what the great difference is to short term
pinning to use in dma-fence workloads? Just the kmap?

> That's what process core dumping is good for.

Not really sure I agree. If you do not dump the memory as seen by the
GPU, then you need to go parsing the CPU address space in order to make
sense which buffers were mapped where and that CPU memory contents containing
metadata could be corrupt as we're dealing with a crashing app to begin with.

Big point of relying to the information from GPU VM for the GPU memory layout
is that it won't be corrupted by rogue memory accesses in CPU process.

>         just because you don't like the idea to attach to the debugged
>         application in userspace.
> 
>     A few points that have been brought up as drawback to the
>     GPU debug through ptrace(), but to recap a few relevant ones for this
>     discussion:
> 
>     - You can only really support GDB stop-all mode or at least have to
>       stop all the CPU threads while you control the GPU threads to
>       avoid interference. Elaborated on this on the other threads more.
>     - Controlling the GPU threads will always interfere with CPU threads.
>       Doesn't seem feasible to single-step an EU thread while CPU threads
>       continue to run freely?
> 
> 
> I would say no.

Should this be understood that you agree these are limitations of the ROCm
debug architecture?

>     - You are very much restricted by the CPU VA ~ GPU VA alignment
>       requirement, which is not true for OpenGL or Vulkan etc. Seems
>       like one of the reasons why ROCm debugging is not easily extendable
>       outside compute?
> 
> 
> Well as long as you can't take debugged threads from the hardware you can
> pretty much forget any OpenGL or Vulkan debugging with this interface since it
> violates the dma_fence restrictions in the kernel.

Agreed. However doesn't mean because you can't do it right now, you you should
design an architecture that actively prevents you from doing that in the future.

>     - You have to expose extra memory to CPU process just for GPU
>       debugger access and keep track of GPU VA for each. Makes the GPU more
>       prone to OOB writes from CPU. Exactly what not mapping the memory
>       to CPU tried to protect the GPU from to begin with.
> 
> 
>         As far as I can see this whole idea is extremely questionable. This
>         looks like re-inventing the wheel in a different color.
> 
>     I see it like reinventing a round wheel compared to octagonal wheel.
> 
>     Could you elaborate with facts much more on your position why the ROCm
>     debugger design is an absolute must for others to adopt?
> 
> 
> Well I'm trying to prevent some of the mistakes we did with the ROCm design.

Well, I would say that the above limitations are direct results of the ROCm
debugging design. So while we're eager to learn about how you perceive
GPU debugging should work, would you mind addressing the above
shortcomings?

> And trying to re-invent well proven kernel interfaces is one of the big
> mistakes made in the ROCm design.

Appreciate the feedback. Please work on the representation a bit as it currently
doesn't seem very helpful but appears just as an attempt to try to throw a spanner
in the works.

> If you really want to expose an interface to userspace

To a debugger process, enabled only behind a flag.

> which walks the process
> page table, installs an MMU notifier

This part is already done to put an userptr to the GPU page tables to
begin with. So hopefully not too controversial.

> kmaps the resulting page

In addition to having it in the page tables where GPU can access it.

> and then memcpy
> to/from it then you absolutely *must* run that by guys like Christoph Hellwig,
> Andrew and even Linus.

Surely, that is why we're seeking out for review.

We could also in theory use an in-kernel GPU context on the GPU hardware for
doing the peek/poke operations on userptr.

But that seems like a high-overhead thing to do due to the overhead of
setting up a transfer per data word and going over the PCI bus twice
compared to accessing the memory directly by CPU when it trivially can.

So this is the current proposal.

Regards, Joonas

> 
> I'm pretty sure that those guys will note that a device driver should
> absolutely not mess with such stuff.
> 
> Regards,
> Christian.
> 
> 
>     Otherwise it just looks like you are trying to prevent others from
>     implementing a more flexible debugging interface through vague comments about
>     "questionable design" without going into details. Not listing much concrete
>     benefits nor addressing the very concretely expressed drawbacks of your
>     suggested design, makes it seem like a very biased non-technical discussion.
> 
>     So while review interest and any comments are very much appreciated, please
>     also work on providing bit more reasoning and facts instead of just claiming
>     things. That'll help make the discussion much more fruitful.
> 
>     Regards, Joonas
> 
>

Christian König Dec. 10, 2024, 2:03 p.m. UTC | #10

Am 10.12.24 um 12:57 schrieb Joonas Lahtinen:
> Quoting Christian König (2024-12-10 12:00:48)
>> Am 10.12.24 um 10:33 schrieb Joonas Lahtinen:
>>
>>      Quoting Christian König (2024-12-09 17:42:32)
>>
>>          Am 09.12.24 um 16:31 schrieb Simona Vetter:
>>
>>              On Mon, Dec 09, 2024 at 03:03:04PM +0100, Christian König wrote:
>>
>>                  Am 09.12.24 um 14:33 schrieb Mika Kuoppala:
>>
>>                      From: Andrzej Hajda<andrzej.hajda@intel.com>
>>
>>                      Debugger needs to read/write program's vmas including userptr_vma.
>>                      Since hmm_range_fault is used to pin userptr vmas, it is possible
>>                      to map those vmas from debugger context.
>>
>>                  Oh, this implementation is extremely questionable as well. Adding the LKML
>>                  and the MM list as well.
>>
>>                  First of all hmm_range_fault() does *not* pin anything!
>>
>>                  In other words you don't have a page reference when the function returns,
>>                  but rather just a sequence number you can check for modifications.
>>
>>              I think it's all there, holds the invalidation lock during the critical
>>              access/section, drops it when reacquiring pages, retries until it works.
>>
>>              I think the issue is more that everyone hand-rolls userptr.
>>
>>          Well that is part of the issue.
>>
>>          The general problem here is that the eudebug interface tries to simulate
>>          the memory accesses as they would have happened by the hardware.
>>
>>      Could you elaborate, what is that a problem in that, exactly?
>>
>>      It's pretty much the equivalent of ptrace() poke/peek but for GPU memory.
>>
>>
>> Exactly that here. You try to debug the GPU without taking control of the CPU
>> process.
> You seem to have a built-in expectation that the CPU threads and memory space
> must be interfered with in order to debug a completely different set of threads
> and memory space elsewhere that executes independently. I don't quite see why?

Because the GPU only gets the information it needs to execute the commands.

A simple example would be to single step through the high level shader 
code. That is usually not available to the GPU, but only to the 
application who has submitted the work.

The GPU only sees the result of the compiler from high level into low 
level assembler.

> In debugging massively parallel workloads, it's a huge drawback to be limited to
> stop all mode in GDB. If ROCm folks are fine with such limitation, I have nothing
> against them keeping that limitation. Just it was a starting design principle for
> this design to avoid such a limitation.

Well, that's the part I don't understand. Why is that a drawback?

>> This means that you have to re-implement all debug functionalities which where
>> previously invested for the CPU process for the GPU once more.
> Seems like a strawman argument. Can you list the "all interfaces" being added
> that would be possible via indirection via ptrace() beyond peek/poke?
>
>> And that in turn creates a massive attack surface for security related
>> problems, especially when you start messing with things like userptrs which
>> have a very low level interaction with core memory management.
> Again, just seems like a strawman argument. You seem to generalize to some massive
> attack surface of hypothetical interfaces which you don't list. We're talking
> about memory peek/poke here.

That peek/poke interface is more than enough to cause problems.

> Can you explain the high-level difference from security perspective for
> temporarily pinning userptr pages to write them to page tables for GPU to
> execute a dma-fence workload with and temporarily pinning pages for
> peek/poke?

If you want to access userptr imported pages from the GPU going through 
the hops of using hhm_range_fault()/get_user_pages() plus an MMU 
notifier is a must have.

For a CPU based debugging interface that isn't necessary, you can just 
look directly into the application address space with existing interfaces.

>>      And it is exactly the kind of interface that makes sense for debugger as
>>      GPU memory != CPU memory, and they don't need to align at all.
>>
>>
>> And that is what I strongly disagree on. When you debug the GPU it is mandatory
>> to gain control of the CPU process as well.
> You are free to disagree on that. I simply don't agree and have in this
> and previous email presented multiple reasons as to why not. We can
> agree to disagree on the topic.

Yeah, that's ok. I also think we can agree on that this doesn't matter 
for the discussion.

The question is rather should the userptr functionality be used for 
debugging or not.

>> The CPU process is basically the overseer of the GPU activity, so it should
>> know everything about the GPU operation, for example what a mapping actually
>> means.
> How does that relate to what is being discussed here? You just seem to
> explain how you think userspace driver should work: Maintain a shadow
> tree of each ppGTT VM layout? I don't agree on that, but I think it is
> slightly irrelevant here.

I'm trying to understand why you want to debug only the GPU without also 
attaching to the CPU process.

>> The kernel driver and the hardware only have the information necessary to
>> execute the work prepared by the CPU process. So the information available is
>> limited to begin with.
> And the point here is? Are you saying kernel does not know the actual mappings
> maintained in the GPU page tables?

The kernel knows that, the question is why does userspace don't know that?

On the other hand I have to agree that this isn't much of a problem.

If userspace really doesn't know what is mapped where in the GPU's VM 
address space then an IOCTL to query that is probably not an issue.

>>          What the debugger should probably do is to cleanly attach to the
>>          application, get the information which CPU address is mapped to which
>>          GPU address and then use the standard ptrace interfaces.
>>
>>      I don't quite agree here -- at all. "Which CPU address is mapped to
>>      which GPU address" makes no sense when the GPU address space and CPU
>>      address space is completely controlled by the userspace driver/application.
>>
>>
>> Yeah, that's the reason why you should ask the userspace driver/application for
>> the necessary information and not go over the kernel to debug things.
> What hypothetical necessary information are you referring to exactly?

What you said before: "the GPU address space and CPU address space is 
completely controlled by the userspace driver/application". When that's 
the case, then why as the kernel for help? The driver/application is in 
control.
> I already explained there are good reasons not to map all the GPU memory
> into the CPU address space.

Well I still don't fully agree to that argumentation, but compared to 
using userptr the peek/pook on a GEM handle is basically harmless.

>>      Please try to consider things outside of the ROCm architecture.
>>
>>
>> Well I consider a good part of the ROCm architecture rather broken exactly
>> because we haven't pushed back hard enough on bad ideas.
>>
>>
>>      Something like a register scratch region or EU instructions should not
>>      even be mapped to CPU address space as CPU has no business accessing it
>>      during normal operation. And backing of such region will vary per
>>      context/LRC on the same virtual address per EU thread.
>>
>>      You seem to be suggesting to rewrite even our userspace driver to behave
>>      the same way as ROCm driver does just so that we could implement debug memory
>>      accesses via ptrace() to the CPU address space.
>>
>>
>> Oh, well certainly not. That ROCm has an 1 to 1 mapping between CPU and GPU is
>> one thing I've pushed back massively on and has now proven to be problematic.
> Right, so is your claim then that instead of being 1:1 the CPU address space
> should be a superset of all GPU address spaces instead to make sure
> ptrace() can modify all memory?

Well why not? Mapping a BO and not accessing it has only minimal overhead.

We already considered to making that mandatory for TTM drivers for 
better OOM killer handling. That approach was discontinued, but 
certainly not for the overhead.

> Cause I'm slightly lost here as you don't give much reasoning, just
> claim things to be certain way.

Ok, that's certainly not what I'm trying to express.

Things don't need to be in a certain way, especially not in the way ROCm 
does things.

But you should not try to re-create GPU accesses with the CPU, 
especially when that isn't memory you have control over in the sense 
that it was allocated through your driver stack.

>>      That seems bit of a radical suggestion, especially given the drawbacks
>>      pointed out in your suggested design.
>>
>>
>>          The whole interface re-invents a lot of functionality which is already
>>          there
>>
>>      I'm not really sure I would call adding a single interface for memory
>>      reading and writing to be "re-inventing a lot of functionality".
>>
>>      All the functionality behind this interface will be needed by GPU core
>>      dumping, anyway. Just like for the other patch series.
>>
>>
>> As far as I can see exactly that's an absolutely no-go. Device core dumping
>> should *never ever* touch memory imported by userptrs.
> Could you again elaborate on what the great difference is to short term
> pinning to use in dma-fence workloads? Just the kmap?

The big difference is that the memory doesn't belong to the driver who 
is core dumping.

That is just something you have imported from the MM subsystem, e.g. 
anonymous memory and file backed mappings.

We also don't allow to mmap() dma-bufs on importing devices for similar 
reasons.

>> That's what process core dumping is good for.
> Not really sure I agree. If you do not dump the memory as seen by the
> GPU, then you need to go parsing the CPU address space in order to make
> sense which buffers were mapped where and that CPU memory contents containing
> metadata could be corrupt as we're dealing with a crashing app to begin with.
>
> Big point of relying to the information from GPU VM for the GPU memory layout
> is that it won't be corrupted by rogue memory accesses in CPU process.

Well that you don't want to use potentially corrupted information is a 
good argument, but why just not dump an information like "range 
0xabcd-0xbcde came as userptr from process 1 VMA 0x1234-0x5678" ?

A process address space is not really something a device driver should 
be messing with.

>
>>          just because you don't like the idea to attach to the debugged
>>          application in userspace.
>>
>>      A few points that have been brought up as drawback to the
>>      GPU debug through ptrace(), but to recap a few relevant ones for this
>>      discussion:
>>
>>      - You can only really support GDB stop-all mode or at least have to
>>        stop all the CPU threads while you control the GPU threads to
>>        avoid interference. Elaborated on this on the other threads more.
>>      - Controlling the GPU threads will always interfere with CPU threads.
>>        Doesn't seem feasible to single-step an EU thread while CPU threads
>>        continue to run freely?
>>
>>
>> I would say no.
> Should this be understood that you agree these are limitations of the ROCm
> debug architecture?

ROCm has a bunch of design decisions I would say we should never ever 
repeat:

1. Forcing a 1 to 1 model between GPU address space and CPU address space.

2. Using a separate file descriptor additional to the DRM render node.

3. Attaching information and context to the CPU process instead of the 
DRM render node.
....

But stopping the world, e.g. both CPU and GPU threads if you want to 
debug something is not one of the problematic decisions.

That's why I'm really surprised that you insist so much on that.

>>      - You are very much restricted by the CPU VA ~ GPU VA alignment
>>        requirement, which is not true for OpenGL or Vulkan etc. Seems
>>        like one of the reasons why ROCm debugging is not easily extendable
>>        outside compute?
>>
>>
>> Well as long as you can't take debugged threads from the hardware you can
>> pretty much forget any OpenGL or Vulkan debugging with this interface since it
>> violates the dma_fence restrictions in the kernel.
> Agreed. However doesn't mean because you can't do it right now, you you should
> design an architecture that actively prevents you from doing that in the future.

Good point. That's what I can totally agree on as well.

>>      - You have to expose extra memory to CPU process just for GPU
>>        debugger access and keep track of GPU VA for each. Makes the GPU more
>>        prone to OOB writes from CPU. Exactly what not mapping the memory
>>        to CPU tried to protect the GPU from to begin with.
>>
>>
>>          As far as I can see this whole idea is extremely questionable. This
>>          looks like re-inventing the wheel in a different color.
>>
>>      I see it like reinventing a round wheel compared to octagonal wheel.
>>
>>      Could you elaborate with facts much more on your position why the ROCm
>>      debugger design is an absolute must for others to adopt?
>>
>>
>> Well I'm trying to prevent some of the mistakes we did with the ROCm design.
> Well, I would say that the above limitations are direct results of the ROCm
> debugging design. So while we're eager to learn about how you perceive
> GPU debugging should work, would you mind addressing the above
> shortcomings?

Yeah, absolutely. That you don't have a 1 to 1 mapping on the GPU is a 
step in the right direction if you ask me.

>> And trying to re-invent well proven kernel interfaces is one of the big
>> mistakes made in the ROCm design.
> Appreciate the feedback. Please work on the representation a bit as it currently
> doesn't seem very helpful but appears just as an attempt to try to throw a spanner
> in the works.
>
>> If you really want to expose an interface to userspace
> To a debugger process, enabled only behind a flag.
>
>> which walks the process
>> page table, installs an MMU notifier
> This part is already done to put an userptr to the GPU page tables to
> begin with. So hopefully not too controversial.
>
>> kmaps the resulting page
> In addition to having it in the page tables where GPU can access it.
>
>> and then memcpy
>> to/from it then you absolutely *must* run that by guys like Christoph Hellwig,
>> Andrew and even Linus.
> Surely, that is why we're seeking out for review.
>
> We could also in theory use an in-kernel GPU context on the GPU hardware for
> doing the peek/poke operations on userptr.

Yeah, I mean that should clearly work out. We have something similar.

> But that seems like a high-overhead thing to do due to the overhead of
> setting up a transfer per data word and going over the PCI bus twice
> compared to accessing the memory directly by CPU when it trivially can.

Understandable, but that will create another way of accessing process 
memory.

Regards,
Christian.

>
> So this is the current proposal.
>
> Regards, Joonas
>
>> I'm pretty sure that those guys will note that a device driver should
>> absolutely not mess with such stuff.
>>
>> Regards,
>> Christian.
>>
>>
>>      Otherwise it just looks like you are trying to prevent others from
>>      implementing a more flexible debugging interface through vague comments about
>>      "questionable design" without going into details. Not listing much concrete
>>      benefits nor addressing the very concretely expressed drawbacks of your
>>      suggested design, makes it seem like a very biased non-technical discussion.
>>
>>      So while review interest and any comments are very much appreciated, please
>>      also work on providing bit more reasoning and facts instead of just claiming
>>      things. That'll help make the discussion much more fruitful.
>>
>>      Regards, Joonas
>>
>>

Joonas Lahtinen Dec. 11, 2024, 12:59 p.m. UTC | #11

First of all, do appreciate taking the time to explain your positions
much more verbosely this time.

Quoting Christian König (2024-12-10 16:03:14)
> Am 10.12.24 um 12:57 schrieb Joonas Lahtinen:
> 
>     Quoting Christian König (2024-12-10 12:00:48)
> 
>         Am 10.12.24 um 10:33 schrieb Joonas Lahtinen:
> 
>             Quoting Christian König (2024-12-09 17:42:32)
> 
>                 Am 09.12.24 um 16:31 schrieb Simona Vetter:
> 
>                     On Mon, Dec 09, 2024 at 03:03:04PM +0100, Christian König wrote:
> 
>                         Am 09.12.24 um 14:33 schrieb Mika Kuoppala:
> 
>                             From: Andrzej Hajda <andrzej.hajda@intel.com>
> 
>                             Debugger needs to read/write program's vmas including userptr_vma.
>                             Since hmm_range_fault is used to pin userptr vmas, it is possible
>                             to map those vmas from debugger context.
> 
>                         Oh, this implementation is extremely questionable as well. Adding the LKML
>                         and the MM list as well.
> 
>                         First of all hmm_range_fault() does *not* pin anything!
> 
>                         In other words you don't have a page reference when the function returns,
>                         but rather just a sequence number you can check for modifications.
> 
>                     I think it's all there, holds the invalidation lock during the critical
>                     access/section, drops it when reacquiring pages, retries until it works.
> 
>                     I think the issue is more that everyone hand-rolls userptr.
> 
>                 Well that is part of the issue.
> 
>                 The general problem here is that the eudebug interface tries to simulate
>                 the memory accesses as they would have happened by the hardware.
> 
>             Could you elaborate, what is that a problem in that, exactly?
> 
>             It's pretty much the equivalent of ptrace() poke/peek but for GPU memory.
> 
> 
>         Exactly that here. You try to debug the GPU without taking control of the CPU
>         process.
> 
>     You seem to have a built-in expectation that the CPU threads and memory space
>     must be interfered with in order to debug a completely different set of threads
>     and memory space elsewhere that executes independently. I don't quite see why?
> 
> 
> Because the GPU only gets the information it needs to execute the commands.

Right, but even for the CPU process, the debug symbols are not part of the
execution address space either. There similarly are only the instructions
generated by the compiler and the debug symbols are separate. They may be
obtainable by parsing /proc/<PID>/exe but can also be in a completely
different file in a different host machine.

> A simple example would be to single step through the high level shader code.
> That is usually not available to the GPU, but only to the application who has
> submitted the work.
> 
> The GPU only sees the result of the compiler from high level into low level
> assembler.

If we were to have unified executable format where both the GPU and CPU
instructions were to be part of the single executable file, so could the
DWARF information for both CPU and GPU.

Then GDB, by loading the executable file, would have all the debug
information it needed. No need to introspect to the CPU process in order
to debug the GPU, similarly as there is no need to introspect CPU
process to debug CPU process.

While we don't currently have that and GPU instructions are often JIT
generated, we tried to make life easier by userspace driver providing
the DWARF information it just generated for the code it JITed as VM_BIND
metadata for a VMA and we make copy to store safely to avoid corruption by
rogue CPU process writes.

In the history it was exported to a file and then loaded by GDB from that
separate file, making user experience quite bad.

So to recap, while for JIT scenarios and for lack of unified carrier
format for GPU and CPU instructions, there is some information that is
convenient to have in CPU address space, I don't think that is a
necessity at all. I guess we could equally export
/sys/class/drm/.../clients/<ID>/{load_map,dwarf_symbols} or whatever,
similar to /proc/<PID>/{maps,exe}.

TL;DR While getting the information from CPU process for JIT scenarios is
convenient for now, I don't think it is a must or explicitly required.

>     In debugging massively parallel workloads, it's a huge drawback to be limited to
>     stop all mode in GDB. If ROCm folks are fine with such limitation, I have nothing
>     against them keeping that limitation. Just it was a starting design principle for
>     this design to avoid such a limitation.
> 
> 
> Well, that's the part I don't understand. Why is that a drawback?

Hmm, same as for not supporting stop-all mode for CPU threads during CPU
debugging? You will not be able to stop and observe a single thread while
letting the other threads run.

If the CPU threads are for example supposed to react to memory
semaphores/fences written by GPU thread and you want to debug by doing
those memory writes from GPU thread from the GDB command line?

Again not being limited to stop-all mode being an input to the design
phase from the folks doing in-field debugging, I'm probably not going to be
able to give out all the good reasons for it.

And as the CPU side supports it, even if you did not support it for the
GPU debugging, if adding GPU to the equation would prevent from using the
existing feature for CPU debugging that feels like a regression in user
experience.

I think those both are major drawbacks, but we can of course seek out further
opinions if it's highly relevant. I'm pretty sure myself at this point that if
a feature is desireable for CPU threaded debugging, it'll be very shortly asked
to be available for GPU.

That seems to be the trend for any CPU debug feature, even if some are
less feasible than others due to the differences of GPUs and CPUs.

>         This means that you have to re-implement all debug functionalities which where
>         previously invested for the CPU process for the GPU once more.
> 
>     Seems like a strawman argument. Can you list the "all interfaces" being added
>     that would be possible via indirection via ptrace() beyond peek/poke?
> 
> 
>         And that in turn creates a massive attack surface for security related
>         problems, especially when you start messing with things like userptrs which
>         have a very low level interaction with core memory management.
> 
>     Again, just seems like a strawman argument. You seem to generalize to some massive
>     attack surface of hypothetical interfaces which you don't list. We're talking
>     about memory peek/poke here.
> 
> 
> That peek/poke interface is more than enough to cause problems.

Ok, just wanted to make sure we're talking about concrete things. Happy
to discuss any other problems too, but for now let's focus on the
peek/poke then, and not get sidetracked.

>     Can you explain the high-level difference from security perspective for
>     temporarily pinning userptr pages to write them to page tables for GPU to
>     execute a dma-fence workload with and temporarily pinning pages for
>     peek/poke?
> 
> 
> If you want to access userptr imported pages from the GPU going through the
> hops of using hhm_range_fault()/get_user_pages() plus an MMU notifier is a must
> have.

Right, the intent was always to make this as close to EU thread memory access as
possible from both locking and Linux core MM memory point of view so if
we need to improve on that front, we should look into it.

> For a CPU based debugging interface that isn't necessary, you can just look
> directly into the application address space with existing interfaces.

First, this is only even possible when you have mapped everything the GPUs have
access also to the CPU address space, and maintain a load map for each individual
<DRM client, GPU VM, GPU VMA> => CPU address.

I don't see the need to do that tracking in the userspace just
for debugging, because kernel already has to do all the work.

Second, mapping every GPU VMA to CPU address space will exhaust the
vm.max_map_count [1] quite a bit faster. This problem can already be hit if
a naive userspace application tries to create too many aligned_alloc
blocks for userptr instead of pooling memory.

Third of all when GPU VM size == CPU VM size for modern hardware, you will run
out of VA space in the CPU VM. When you consider you have to add VA blocks of
(num DRM clients) * (num VM) * (VM size) where (num VM) roughly equals number of
engines * 3.

And ultimately, I'm pretty sure there are processes like 32-bit games
and emulators, and even demanding compute applications actually expect
to be able to use most of the CPU address space :) So don't think we should
have a design where we expect to be able to consume significant portions of the
CPU VA space (especially if it is just for debug time functionality).

[1] Documentation/admin-guide/sysctl/vm.rst#max_map_count

>             And it is exactly the kind of interface that makes sense for debugger as
>             GPU memory != CPU memory, and they don't need to align at all.
> 
> 
>         And that is what I strongly disagree on. When you debug the GPU it is mandatory
>         to gain control of the CPU process as well.
> 
>     You are free to disagree on that. I simply don't agree and have in this
>     and previous email presented multiple reasons as to why not. We can
>     agree to disagree on the topic.
> 
> 
> Yeah, that's ok. I also think we can agree on that this doesn't matter for the
> discussion.
> 
> The question is rather should the userptr functionality be used for debugging
> or not.
> 
> 
>         The CPU process is basically the overseer of the GPU activity, so it should
>         know everything about the GPU operation, for example what a mapping actually
>         means.
> 
>     How does that relate to what is being discussed here? You just seem to
>     explain how you think userspace driver should work: Maintain a shadow
>     tree of each ppGTT VM layout? I don't agree on that, but I think it is
>     slightly irrelevant here.
> 
> 
> I'm trying to understand why you want to debug only the GPU without also
> attaching to the CPU process.

Mostly to ensure we're not limited to stop-all mode as described above and to
have a clean independent implementation for the thread run-control between the
"inferiors" in GDB. Say you have CPU threads and 2 sets of GPU threads (3
inferiors in total). We don't want the CPU inferior to be impacted by
the user requesting to control the GPU inferiors.

I know the ROCm GDB implementation takes a different approach, and I'm
not quite sure how you folks plan on supporting multi-GPU debugging.

I would spin the question the opposite direction, if you don't need anything from
the CPU process why would you make them dependent and interfering?

(Reminder, the peek/poke target page has been made available to the GPU
page tables, so we don't want anything from the CPU process per se, we
want to know which page the GPU IOMMU unit would get for its access.

For all practical matters, the CPU process could have already exited and
should not matter if an EU is executing on the GPU still.)

>         The kernel driver and the hardware only have the information necessary to
>         execute the work prepared by the CPU process. So the information available is
>         limited to begin with.
> 
>     And the point here is? Are you saying kernel does not know the actual mappings
>     maintained in the GPU page tables?
> 
> 
> The kernel knows that, the question is why does userspace don't know that?
> 
> On the other hand I have to agree that this isn't much of a problem.
> 
> If userspace really doesn't know what is mapped where in the GPU's VM address
> space then an IOCTL to query that is probably not an issue.
> 
>                 What the debugger should probably do is to cleanly attach to the
>                 application, get the information which CPU address is mapped to which
>                 GPU address and then use the standard ptrace interfaces.
> 
>             I don't quite agree here -- at all. "Which CPU address is mapped to
>             which GPU address" makes no sense when the GPU address space and CPU
>             address space is completely controlled by the userspace driver/application.
> 
> 
>         Yeah, that's the reason why you should ask the userspace driver/application for
>         the necessary information and not go over the kernel to debug things.
> 
>     What hypothetical necessary information are you referring to exactly?
> 
> 
> What you said before: "the GPU address space and CPU address space is
> completely controlled by the userspace driver/application". When that's the
> case, then why as the kernel for help? The driver/application is in control.

I guess the emphasis should have been on the application part. Debugger can agree
with userspace driver on conventions to facilitate debugging, but not with the
application code.

However, agree that query IOCTL could be avoided maintaining a shadow address
space tracking in case ptrace() approach to debugging was otherwise favorable.

>     I already explained there are good reasons not to map all the GPU memory
>     into the CPU address space.
> 
> 
> Well I still don't fully agree to that argumentation, but compared to using
> userptr the peek/pook on a GEM handle is basically harmless.

(Sidenote: We don't expose BO handles at all via debugger interface. The debugger
interface fully relies on GPU addresses for everything.)

But sounds like we're coming towards a conclusion that the focus of the
discussion is only really on the controversy of touching userptr with
the debugger peek/poke interface or not.

>             Please try to consider things outside of the ROCm architecture.
> 
> 
>         Well I consider a good part of the ROCm architecture rather broken exactly
>         because we haven't pushed back hard enough on bad ideas.
> 
> 
>             Something like a register scratch region or EU instructions should not
>             even be mapped to CPU address space as CPU has no business accessing it
>             during normal operation. And backing of such region will vary per
>             context/LRC on the same virtual address per EU thread.
> 
>             You seem to be suggesting to rewrite even our userspace driver to behave
>             the same way as ROCm driver does just so that we could implement debug memory
>             accesses via ptrace() to the CPU address space.
> 
> 
>         Oh, well certainly not. That ROCm has an 1 to 1 mapping between CPU and GPU is
>         one thing I've pushed back massively on and has now proven to be problematic.
> 
>     Right, so is your claim then that instead of being 1:1 the CPU address space
>     should be a superset of all GPU address spaces instead to make sure
>     ptrace() can modify all memory?
> 
> 
> Well why not? Mapping a BO and not accessing it has only minimal overhead.
> 
> We already considered to making that mandatory for TTM drivers for better OOM
> killer handling. That approach was discontinued, but certainly not for the
> overhead.

I listed the reasons earlier in this message.

>     Cause I'm slightly lost here as you don't give much reasoning, just
>     claim things to be certain way.
> 
> 
> Ok, that's certainly not what I'm trying to express.
> 
> Things don't need to be in a certain way, especially not in the way ROCm does
> things.
> 
> But you should not try to re-create GPU accesses with the CPU, especially when
> that isn't memory you have control over in the sense that it was allocated
> through your driver stack.

I guess thats what I don't quite follow.

It's memory pages that are temporarily pinned and made available via GPU PTE to
the GPU IOMMU and it will inherently be able to read/write them outside
of the CPU caching domain.

Not sure why replacing "Submit GPU workload to peek/poke such page pinned behind
PTE" with "Use CPU to peek/poke because userptr is system memory anyway" seems such
controversial and could cause much more complexity than userptr in
general?

>             That seems bit of a radical suggestion, especially given the drawbacks
>             pointed out in your suggested design.
> 
> 
>                 The whole interface re-invents a lot of functionality which is already
>                 there
> 
>             I'm not really sure I would call adding a single interface for memory
>             reading and writing to be "re-inventing a lot of functionality".
> 
>             All the functionality behind this interface will be needed by GPU core
>             dumping, anyway. Just like for the other patch series.
> 
> 
>         As far as I can see exactly that's an absolutely no-go. Device core dumping
>         should *never ever* touch memory imported by userptrs.
> 
>     Could you again elaborate on what the great difference is to short term
>     pinning to use in dma-fence workloads? Just the kmap?
> 
> 
> The big difference is that the memory doesn't belong to the driver who is core
> dumping.

But the driver who is core dumping is holding a temporary pin on that
memory anyway, and has it in the GPU page tables.

The CPU side of the memory dump would only reflect what was the CPU side
memory contents at a dump time. It may have different contents of the GPU
side depending on cache flush timing. Maybe this will not be true when
CXL or some other coherency protocl is everywhere, but for now it is.

So those two memory dumps may actually have different contents, and that
might actually be the bug we're trying to debug. For GPU debugging, we're
specifically interested on what was the GPU threads view of the memory.

So I think it is more complex than that.

> That is just something you have imported from the MM subsystem, e.g. anonymous
> memory and file backed mappings.
> 
> We also don't allow to mmap() dma-bufs on importing devices for similar
> reasons.

That is a reasonable limitation for userspace applications.

And at no point has there been suggestions to expose such API for normal
userspace to shoot itself in the foot.

However debugger is not an average userspace consumer. For an example, it needs to
be able modify read-only memory (like the EU instructions) and then do special cache
flushes to magically change those instructions.

I wouldn't want to expose such a functionality as regular IOCTL for an
application.

>         That's what process core dumping is good for.
> 
>     Not really sure I agree. If you do not dump the memory as seen by the
>     GPU, then you need to go parsing the CPU address space in order to make
>     sense which buffers were mapped where and that CPU memory contents containing
>     metadata could be corrupt as we're dealing with a crashing app to begin with.
> 
>     Big point of relying to the information from GPU VM for the GPU memory layout
>     is that it won't be corrupted by rogue memory accesses in CPU process.
> 
> 
> Well that you don't want to use potentially corrupted information is a good
> argument, but why just not dump an information like "range 0xabcd-0xbcde came
> as userptr from process 1 VMA 0x1234-0x5678" ?

I guess that could be done for interactive debugging (but would again
add the ptrace() dependency).

In theory you could probably also come up with such a convention for ELF to
support core dumps I guess, but I would have to refer to some folks more
knowledgeable on the topic.

Feels like that would make things more complex via indirection compared
to existing memory maps.

> A process address space is not really something a device driver should be
> messing with.
> 
> 
> 
> 
>                 just because you don't like the idea to attach to the debugged
>                 application in userspace.
> 
>             A few points that have been brought up as drawback to the
>             GPU debug through ptrace(), but to recap a few relevant ones for this
>             discussion:
> 
>             - You can only really support GDB stop-all mode or at least have to
>               stop all the CPU threads while you control the GPU threads to
>               avoid interference. Elaborated on this on the other threads more.
>             - Controlling the GPU threads will always interfere with CPU threads.
>               Doesn't seem feasible to single-step an EU thread while CPU threads
>               continue to run freely?
> 
> 
>         I would say no.
> 
>     Should this be understood that you agree these are limitations of the ROCm
>     debug architecture?
> 
> 
> ROCm has a bunch of design decisions I would say we should never ever repeat:
> 
> 1. Forcing a 1 to 1 model between GPU address space and CPU address space.
> 
> 2. Using a separate file descriptor additional to the DRM render node.
> 
> 3. Attaching information and context to the CPU process instead of the DRM
> render node.
> ....
> 
> But stopping the world, e.g. both CPU and GPU threads if you want to debug
> something is not one of the problematic decisions.
> 
> That's why I'm really surprised that you insist so much on that.

I'm hoping the above explanations clarify my position further.

Again, I would ask myself: "Why add a dependency that is not needed?"

>             - You are very much restricted by the CPU VA ~ GPU VA alignment
>               requirement, which is not true for OpenGL or Vulkan etc. Seems
>               like one of the reasons why ROCm debugging is not easily extendable
>               outside compute?
> 
> 
>         Well as long as you can't take debugged threads from the hardware you can
>         pretty much forget any OpenGL or Vulkan debugging with this interface since it
>         violates the dma_fence restrictions in the kernel.
> 
>     Agreed. However doesn't mean because you can't do it right now, you you should
>     design an architecture that actively prevents you from doing that in the future.
> 
> 
> Good point. That's what I can totally agree on as well.
> 
> 
>             - You have to expose extra memory to CPU process just for GPU
>               debugger access and keep track of GPU VA for each. Makes the GPU more
>               prone to OOB writes from CPU. Exactly what not mapping the memory
>               to CPU tried to protect the GPU from to begin with.
> 
> 
>                 As far as I can see this whole idea is extremely questionable. This
>                 looks like re-inventing the wheel in a different color.
> 
>             I see it like reinventing a round wheel compared to octagonal wheel.
> 
>             Could you elaborate with facts much more on your position why the ROCm
>             debugger design is an absolute must for others to adopt?
> 
> 
>         Well I'm trying to prevent some of the mistakes we did with the ROCm design.
> 
>     Well, I would say that the above limitations are direct results of the ROCm
>     debugging design. So while we're eager to learn about how you perceive
>     GPU debugging should work, would you mind addressing the above
>     shortcomings?
> 
> 
> Yeah, absolutely. That you don't have a 1 to 1 mapping on the GPU is a step in
> the right direction if you ask me.

Right, that is to have a possibility of extending to OpenGL/Vulkan etc. 

>         And trying to re-invent well proven kernel interfaces is one of the big
>         mistakes made in the ROCm design.
> 
>     Appreciate the feedback. Please work on the representation a bit as it currently
>     doesn't seem very helpful but appears just as an attempt to try to throw a spanner
>     in the works.
> 
> 
>         If you really want to expose an interface to userspace
> 
>     To a debugger process, enabled only behind a flag.
> 
> 
>         which walks the process
>         page table, installs an MMU notifier
> 
>     This part is already done to put an userptr to the GPU page tables to
>     begin with. So hopefully not too controversial.
> 
> 
>         kmaps the resulting page
> 
>     In addition to having it in the page tables where GPU can access it.
> 
> 
>         and then memcpy
>         to/from it then you absolutely *must* run that by guys like Christoph Hellwig,
>         Andrew and even Linus.
> 
>     Surely, that is why we're seeking out for review.
> 
>     We could also in theory use an in-kernel GPU context on the GPU hardware for
>     doing the peek/poke operations on userptr.
> 
> 
> Yeah, I mean that should clearly work out. We have something similar.

Right, and that might actually be desireable for the more special GPU VMA
like interconnect addresses.

However userptr is one of the items where it makes least sense, given
we'd have to set up the transfer over bus, the transfer would read
system memory over bus and write the result back to system memory over
bus.

And this is just to avoid kmap'ing a page that would otherwise be
already temporarily pinned for being in the PTEs.

I'm not saying it can't be done, but I just don't feel like it's a
technically sound solution.

>     But that seems like a high-overhead thing to do due to the overhead of
>     setting up a transfer per data word and going over the PCI bus twice
>     compared to accessing the memory directly by CPU when it trivially can.
> 
> 
> Understandable, but that will create another way of accessing process memory.

Well, we hopefully should be able to align with the regular temporary
pinning and making page available to PTEs, but instead of making
available to PTEs, do a peek/poke and then release the page already.

I'm kind of hoping to build the case for it making a lot of sense for
peek/poke performance (which is important for single-stepping), and
should not be a burden due to new locking chains.

And thanks once again for taking the time to share the details behind
the thinking and bearing with all my questions.

It seems like the peek/poke access to userptr is the big remaining open
where opinions differ, so maybe we should first focus on aligning on it.
It impacts both core dumping and interactive debugging.

Regards, Joonas

> 
> Regards,
> Christian.
> 
> 
> 
>     So this is the current proposal.
> 
>     Regards, Joonas
> 
> 
>         I'm pretty sure that those guys will note that a device driver should
>         absolutely not mess with such stuff.
> 
>         Regards,
>         Christian.
> 
> 
>             Otherwise it just looks like you are trying to prevent others from
>             implementing a more flexible debugging interface through vague comments about
>             "questionable design" without going into details. Not listing much concrete
>             benefits nor addressing the very concretely expressed drawbacks of your
>             suggested design, makes it seem like a very biased non-technical discussion.
> 
>             So while review interest and any comments are very much appreciated, please
>             also work on providing bit more reasoning and facts instead of just claiming
>             things. That'll help make the discussion much more fruitful.
> 
>             Regards, Joonas
> 
> 
> 
>

Thomas Hellstrom Dec. 12, 2024, 8:49 a.m. UTC | #12

On Mon, 2024-12-09 at 16:31 +0100, Simona Vetter wrote:
> On Mon, Dec 09, 2024 at 03:03:04PM +0100, Christian König wrote:
> > Am 09.12.24 um 14:33 schrieb Mika Kuoppala:
> > > From: Andrzej Hajda <andrzej.hajda@intel.com>
> > > 
> > > Debugger needs to read/write program's vmas including
> > > userptr_vma.
> > > Since hmm_range_fault is used to pin userptr vmas, it is possible
> > > to map those vmas from debugger context.
> > 
> > Oh, this implementation is extremely questionable as well. Adding
> > the LKML
> > and the MM list as well.
> > 
> > First of all hmm_range_fault() does *not* pin anything!
> > 
> > In other words you don't have a page reference when the function
> > returns,
> > but rather just a sequence number you can check for modifications.
> 
> I think it's all there, holds the invalidation lock during the
> critical
> access/section, drops it when reacquiring pages, retries until it
> works.
> 
> I think the issue is more that everyone hand-rolls userptr. Probably
> time
> we standardize that and put it into gpuvm as an optional part, with
> consistent locking, naming (like not calling it _pin_pages when it's
> unpinnged userptr), kerneldoc and all the nice things so that we
> stop consistently getting confused by other driver's userptr code.
> 
> I think that was on the plan originally as an eventual step, I guess
> time
> to pump that up. Matt/Thomas, thoughts?

It looks like we have this planned and ongoing but there are some
complications and thoughts.

1) A drm_gpuvm implementation would be based on vma userptrs, and would
be pretty straightforward based on xe's current implementation and, as
you say, renaming.

2) Current Intel work to land this on the drm level is based on
drm_gpusvm (minus migration to VRAM). I'm not fully sure yet how this
will integrate with drm_gpuvm.

3) Christian mentioned a plan to have a common userptr implementation
based off drm_exec. I figure that would be bo-based like the amdgpu
implemeentation still is. Possibly i915 would be interested in this but
I think any VM_BIND based driver would want to use drm_gpuvm /
drm_gpusvm implementation, which is also typically O(1), since userptrs
are considered vm-local.

Ideas / suggestions welcome

> -Sima
> 
> > 
> > > v2: pin pages vs notifier, move to vm.c (Matthew)
> > > v3: - iterate over system pages instead of DMA, fixes iommu
> > > enabled
> > >      - s/xe_uvma_access/xe_vm_uvma_access/ (Matt)
> > > 
> > > Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com>
> > > Signed-off-by: Maciej Patelczyk <maciej.patelczyk@intel.com>
> > > Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > > Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> #v1
> > > ---
> > >   drivers/gpu/drm/xe/xe_eudebug.c |  3 ++-
> > >   drivers/gpu/drm/xe/xe_vm.c      | 47
> > > +++++++++++++++++++++++++++++++++
> > >   drivers/gpu/drm/xe/xe_vm.h      |  3 +++
> > >   3 files changed, 52 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_eudebug.c
> > > b/drivers/gpu/drm/xe/xe_eudebug.c
> > > index 9d87df75348b..e5949e4dcad8 100644
> > > --- a/drivers/gpu/drm/xe/xe_eudebug.c
> > > +++ b/drivers/gpu/drm/xe/xe_eudebug.c
> > > @@ -3076,7 +3076,8 @@ static int xe_eudebug_vma_access(struct
> > > xe_vma *vma, u64 offset_in_vma,
> > >   		return ret;
> > >   	}
> > > -	return -EINVAL;
> > > +	return xe_vm_userptr_access(to_userptr_vma(vma),
> > > offset_in_vma,
> > > +				    buf, bytes, write);
> > >   }
> > >   static int xe_eudebug_vm_access(struct xe_vm *vm, u64 offset,
> > > diff --git a/drivers/gpu/drm/xe/xe_vm.c
> > > b/drivers/gpu/drm/xe/xe_vm.c
> > > index 0f17bc8b627b..224ff9e16941 100644
> > > --- a/drivers/gpu/drm/xe/xe_vm.c
> > > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > > @@ -3414,3 +3414,50 @@ void xe_vm_snapshot_free(struct
> > > xe_vm_snapshot *snap)
> > >   	}
> > >   	kvfree(snap);
> > >   }
> > > +
> > > +int xe_vm_userptr_access(struct xe_userptr_vma *uvma, u64
> > > offset,
> > > +			 void *buf, u64 len, bool write)
> > > +{
> > > +	struct xe_vm *vm = xe_vma_vm(&uvma->vma);
> > > +	struct xe_userptr *up = &uvma->userptr;
> > > +	struct xe_res_cursor cur = {};
> > > +	int cur_len, ret = 0;
> > > +
> > > +	while (true) {
> > > +		down_read(&vm->userptr.notifier_lock);
> > > +		if (!xe_vma_userptr_check_repin(uvma))
> > > +			break;
> > > +
> > > +		spin_lock(&vm->userptr.invalidated_lock);
> > > +		list_del_init(&uvma->userptr.invalidate_link);
> > > +		spin_unlock(&vm->userptr.invalidated_lock);
> > > +
> > > +		up_read(&vm->userptr.notifier_lock);
> > > +		ret = xe_vma_userptr_pin_pages(uvma);
> > > +		if (ret)
> > > +			return ret;
> > > +	}
> > > +
> > > +	if (!up->sg) {
> > > +		ret = -EINVAL;
> > > +		goto out_unlock_notifier;
> > > +	}
> > > +
> > > +	for (xe_res_first_sg_system(up->sg, offset, len, &cur);
> > > cur.remaining;
> > > +	     xe_res_next(&cur, cur_len)) {
> > > +		void *ptr = kmap_local_page(sg_page(cur.sgl)) +
> > > cur.start;
> > 
> > The interface basically creates a side channel to access userptrs
> > in the way
> > an userspace application would do without actually going through
> > userspace.
> > 
> > That is generally not something a device driver should ever do as
> > far as I
> > can see.
> > 
> > > +
> > > +		cur_len = min(cur.size, cur.remaining);
> > > +		if (write)
> > > +			memcpy(ptr, buf, cur_len);
> > > +		else
> > > +			memcpy(buf, ptr, cur_len);
> > > +		kunmap_local(ptr);
> > > +		buf += cur_len;
> > > +	}
> > > +	ret = len;
> > > +
> > > +out_unlock_notifier:
> > > +	up_read(&vm->userptr.notifier_lock);
> > 
> > I just strongly hope that this will prevent the mapping from
> > changing.
> > 
> > Regards,
> > Christian.
> > 
> > > +	return ret;
> > > +}
> > > diff --git a/drivers/gpu/drm/xe/xe_vm.h
> > > b/drivers/gpu/drm/xe/xe_vm.h
> > > index 23adb7442881..372ad40ad67f 100644
> > > --- a/drivers/gpu/drm/xe/xe_vm.h
> > > +++ b/drivers/gpu/drm/xe/xe_vm.h
> > > @@ -280,3 +280,6 @@ struct xe_vm_snapshot
> > > *xe_vm_snapshot_capture(struct xe_vm *vm);
> > >   void xe_vm_snapshot_capture_delayed(struct xe_vm_snapshot
> > > *snap);
> > >   void xe_vm_snapshot_print(struct xe_vm_snapshot *snap, struct
> > > drm_printer *p);
> > >   void xe_vm_snapshot_free(struct xe_vm_snapshot *snap);
> > > +
> > > +int xe_vm_userptr_access(struct xe_userptr_vma *uvma, u64
> > > offset,
> > > +			 void *buf, u64 len, bool write);
> > 
>

Simona Vetter Dec. 12, 2024, 10:12 a.m. UTC | #13

On Thu, Dec 12, 2024 at 09:49:24AM +0100, Thomas Hellström wrote:
> On Mon, 2024-12-09 at 16:31 +0100, Simona Vetter wrote:
> > On Mon, Dec 09, 2024 at 03:03:04PM +0100, Christian König wrote:
> > > Am 09.12.24 um 14:33 schrieb Mika Kuoppala:
> > > > From: Andrzej Hajda <andrzej.hajda@intel.com>
> > > > 
> > > > Debugger needs to read/write program's vmas including
> > > > userptr_vma.
> > > > Since hmm_range_fault is used to pin userptr vmas, it is possible
> > > > to map those vmas from debugger context.
> > > 
> > > Oh, this implementation is extremely questionable as well. Adding
> > > the LKML
> > > and the MM list as well.
> > > 
> > > First of all hmm_range_fault() does *not* pin anything!
> > > 
> > > In other words you don't have a page reference when the function
> > > returns,
> > > but rather just a sequence number you can check for modifications.
> > 
> > I think it's all there, holds the invalidation lock during the
> > critical
> > access/section, drops it when reacquiring pages, retries until it
> > works.
> > 
> > I think the issue is more that everyone hand-rolls userptr. Probably
> > time
> > we standardize that and put it into gpuvm as an optional part, with
> > consistent locking, naming (like not calling it _pin_pages when it's
> > unpinnged userptr), kerneldoc and all the nice things so that we
> > stop consistently getting confused by other driver's userptr code.
> > 
> > I think that was on the plan originally as an eventual step, I guess
> > time
> > to pump that up. Matt/Thomas, thoughts?
> 
> It looks like we have this planned and ongoing but there are some
> complications and thoughts.
> 
> 1) A drm_gpuvm implementation would be based on vma userptrs, and would
> be pretty straightforward based on xe's current implementation and, as
> you say, renaming.
> 
> 2) Current Intel work to land this on the drm level is based on
> drm_gpusvm (minus migration to VRAM). I'm not fully sure yet how this
> will integrate with drm_gpuvm.
> 
> 3) Christian mentioned a plan to have a common userptr implementation
> based off drm_exec. I figure that would be bo-based like the amdgpu
> implemeentation still is. Possibly i915 would be interested in this but
> I think any VM_BIND based driver would want to use drm_gpuvm /
> drm_gpusvm implementation, which is also typically O(1), since userptrs
> are considered vm-local.
> 
> Ideas / suggestions welcome

So just discussed this a bit with Joonas, and if we use access_remote_vm
for the userptr access instead of hand-rolling then we really only need
bare-bones data structure changes in gpuvm, and nothing more. So

- add the mm pointer to struct drm_gpuvm
- add a flag indicating that it's a userptr + userspace address to struct
  drm_gpuva
- since we already have userptr in drivers I guess there should be any
  need to adjust the actual drm_gpuvm code to cope with these

Then with this you can write the access helper using access_remote_vm
since that does the entire remote mm walking internally, and so there's
no need to also have all the mmu notifier and locking lifted to gpuvm. But
it does already give us some great places to put relevant kerneldocs (not
just for debugging architecture, but userptr stuff in general), which is
already a solid step forward.

Plus I think it'd would also be a solid first step that we need no matter
what for figuring out the questions/options you have above.

Thoughts?
-Sima

> 
> > -Sima
> > 
> > > 
> > > > v2: pin pages vs notifier, move to vm.c (Matthew)
> > > > v3: - iterate over system pages instead of DMA, fixes iommu
> > > > enabled
> > > >      - s/xe_uvma_access/xe_vm_uvma_access/ (Matt)
> > > > 
> > > > Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com>
> > > > Signed-off-by: Maciej Patelczyk <maciej.patelczyk@intel.com>
> > > > Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > > > Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> #v1
> > > > ---
> > > >   drivers/gpu/drm/xe/xe_eudebug.c |  3 ++-
> > > >   drivers/gpu/drm/xe/xe_vm.c      | 47
> > > > +++++++++++++++++++++++++++++++++
> > > >   drivers/gpu/drm/xe/xe_vm.h      |  3 +++
> > > >   3 files changed, 52 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/xe/xe_eudebug.c
> > > > b/drivers/gpu/drm/xe/xe_eudebug.c
> > > > index 9d87df75348b..e5949e4dcad8 100644
> > > > --- a/drivers/gpu/drm/xe/xe_eudebug.c
> > > > +++ b/drivers/gpu/drm/xe/xe_eudebug.c
> > > > @@ -3076,7 +3076,8 @@ static int xe_eudebug_vma_access(struct
> > > > xe_vma *vma, u64 offset_in_vma,
> > > >   		return ret;
> > > >   	}
> > > > -	return -EINVAL;
> > > > +	return xe_vm_userptr_access(to_userptr_vma(vma),
> > > > offset_in_vma,
> > > > +				    buf, bytes, write);
> > > >   }
> > > >   static int xe_eudebug_vm_access(struct xe_vm *vm, u64 offset,
> > > > diff --git a/drivers/gpu/drm/xe/xe_vm.c
> > > > b/drivers/gpu/drm/xe/xe_vm.c
> > > > index 0f17bc8b627b..224ff9e16941 100644
> > > > --- a/drivers/gpu/drm/xe/xe_vm.c
> > > > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > > > @@ -3414,3 +3414,50 @@ void xe_vm_snapshot_free(struct
> > > > xe_vm_snapshot *snap)
> > > >   	}
> > > >   	kvfree(snap);
> > > >   }
> > > > +
> > > > +int xe_vm_userptr_access(struct xe_userptr_vma *uvma, u64
> > > > offset,
> > > > +			 void *buf, u64 len, bool write)
> > > > +{
> > > > +	struct xe_vm *vm = xe_vma_vm(&uvma->vma);
> > > > +	struct xe_userptr *up = &uvma->userptr;
> > > > +	struct xe_res_cursor cur = {};
> > > > +	int cur_len, ret = 0;
> > > > +
> > > > +	while (true) {
> > > > +		down_read(&vm->userptr.notifier_lock);
> > > > +		if (!xe_vma_userptr_check_repin(uvma))
> > > > +			break;
> > > > +
> > > > +		spin_lock(&vm->userptr.invalidated_lock);
> > > > +		list_del_init(&uvma->userptr.invalidate_link);
> > > > +		spin_unlock(&vm->userptr.invalidated_lock);
> > > > +
> > > > +		up_read(&vm->userptr.notifier_lock);
> > > > +		ret = xe_vma_userptr_pin_pages(uvma);
> > > > +		if (ret)
> > > > +			return ret;
> > > > +	}
> > > > +
> > > > +	if (!up->sg) {
> > > > +		ret = -EINVAL;
> > > > +		goto out_unlock_notifier;
> > > > +	}
> > > > +
> > > > +	for (xe_res_first_sg_system(up->sg, offset, len, &cur);
> > > > cur.remaining;
> > > > +	     xe_res_next(&cur, cur_len)) {
> > > > +		void *ptr = kmap_local_page(sg_page(cur.sgl)) +
> > > > cur.start;
> > > 
> > > The interface basically creates a side channel to access userptrs
> > > in the way
> > > an userspace application would do without actually going through
> > > userspace.
> > > 
> > > That is generally not something a device driver should ever do as
> > > far as I
> > > can see.
> > > 
> > > > +
> > > > +		cur_len = min(cur.size, cur.remaining);
> > > > +		if (write)
> > > > +			memcpy(ptr, buf, cur_len);
> > > > +		else
> > > > +			memcpy(buf, ptr, cur_len);
> > > > +		kunmap_local(ptr);
> > > > +		buf += cur_len;
> > > > +	}
> > > > +	ret = len;
> > > > +
> > > > +out_unlock_notifier:
> > > > +	up_read(&vm->userptr.notifier_lock);
> > > 
> > > I just strongly hope that this will prevent the mapping from
> > > changing.
> > > 
> > > Regards,
> > > Christian.
> > > 
> > > > +	return ret;
> > > > +}
> > > > diff --git a/drivers/gpu/drm/xe/xe_vm.h
> > > > b/drivers/gpu/drm/xe/xe_vm.h
> > > > index 23adb7442881..372ad40ad67f 100644
> > > > --- a/drivers/gpu/drm/xe/xe_vm.h
> > > > +++ b/drivers/gpu/drm/xe/xe_vm.h
> > > > @@ -280,3 +280,6 @@ struct xe_vm_snapshot
> > > > *xe_vm_snapshot_capture(struct xe_vm *vm);
> > > >   void xe_vm_snapshot_capture_delayed(struct xe_vm_snapshot
> > > > *snap);
> > > >   void xe_vm_snapshot_print(struct xe_vm_snapshot *snap, struct
> > > > drm_printer *p);
> > > >   void xe_vm_snapshot_free(struct xe_vm_snapshot *snap);
> > > > +
> > > > +int xe_vm_userptr_access(struct xe_userptr_vma *uvma, u64
> > > > offset,
> > > > +			 void *buf, u64 len, bool write);
> > > 
> > 
>

Matthew Brost Dec. 13, 2024, 7:39 p.m. UTC | #14

On Thu, Dec 12, 2024 at 11:12:39AM +0100, Simona Vetter wrote:
> On Thu, Dec 12, 2024 at 09:49:24AM +0100, Thomas Hellström wrote:
> > On Mon, 2024-12-09 at 16:31 +0100, Simona Vetter wrote:
> > > On Mon, Dec 09, 2024 at 03:03:04PM +0100, Christian König wrote:
> > > > Am 09.12.24 um 14:33 schrieb Mika Kuoppala:
> > > > > From: Andrzej Hajda <andrzej.hajda@intel.com>
> > > > > 
> > > > > Debugger needs to read/write program's vmas including
> > > > > userptr_vma.
> > > > > Since hmm_range_fault is used to pin userptr vmas, it is possible
> > > > > to map those vmas from debugger context.
> > > > 
> > > > Oh, this implementation is extremely questionable as well. Adding
> > > > the LKML
> > > > and the MM list as well.
> > > > 
> > > > First of all hmm_range_fault() does *not* pin anything!
> > > > 
> > > > In other words you don't have a page reference when the function
> > > > returns,
> > > > but rather just a sequence number you can check for modifications.
> > > 
> > > I think it's all there, holds the invalidation lock during the
> > > critical
> > > access/section, drops it when reacquiring pages, retries until it
> > > works.
> > > 
> > > I think the issue is more that everyone hand-rolls userptr. Probably
> > > time
> > > we standardize that and put it into gpuvm as an optional part, with
> > > consistent locking, naming (like not calling it _pin_pages when it's
> > > unpinnged userptr), kerneldoc and all the nice things so that we
> > > stop consistently getting confused by other driver's userptr code.
> > > 
> > > I think that was on the plan originally as an eventual step, I guess
> > > time
> > > to pump that up. Matt/Thomas, thoughts?
> > 
> > It looks like we have this planned and ongoing but there are some
> > complications and thoughts.
> > 
> > 1) A drm_gpuvm implementation would be based on vma userptrs, and would
> > be pretty straightforward based on xe's current implementation and, as
> > you say, renaming.
> > 

My thoughts...

Standardize gpuvm userptrs gpuvmas a bit. In Xe I think we basically set
the BO to NULL in the gpuvmas then have some helpers in Xe to determine
if gpuvma is a userptr. I think some this code could be moved into gpuvm
so drivers are doing this in a standard way.

I think NULL bindings also set te BO to NULL too, perhaps we standardize
that too in gpuvm. 

> > 2) Current Intel work to land this on the drm level is based on
> > drm_gpusvm (minus migration to VRAM). I'm not fully sure yet how this
> > will integrate with drm_gpuvm.
> > 

Implement the userptr locking / page collection (i.e. hmm_range_fault
call) on top of gpusvm. Perhaps decouple the current page collection
from drm_gpusvm_range into an embedded struct like drm_gpusvm_devmem.
The plan was to more or less land gpusvm which in on the list addressing
Thomas's feedback before doing the userptr rework on top. 

As of now, different engineer will own this rework. Ofc, with Thomas and
myself providing guidance and welcoming community input. Xe will likely
be the first user of this so if we have to tweak this as more drivers
start to use this, ofc that is fine and will be open to any changes.

> > 3) Christian mentioned a plan to have a common userptr implementation
> > based off drm_exec. I figure that would be bo-based like the amdgpu
> > implemeentation still is. Possibly i915 would be interested in this but
> > I think any VM_BIND based driver would want to use drm_gpuvm /
> > drm_gpusvm implementation, which is also typically O(1), since userptrs
> > are considered vm-local.

I don't think any new driver would want a userptr implementation based
on drm_exec because of having to use BO's and this isn't required if
drm_gpuvm / drm_gpusvm is used which I suspect all new drivers will use.
Sure could be useful for amdgpu / i915 but for Xe we certainly wouldn't
want this nor would a VM bind only driver.

> > 
> > Ideas / suggestions welcome
> 
> So just discussed this a bit with Joonas, and if we use access_remote_vm
> for the userptr access instead of hand-rolling then we really only need
> bare-bones data structure changes in gpuvm, and nothing more. So
> 
> - add the mm pointer to struct drm_gpuvm
> - add a flag indicating that it's a userptr + userspace address to struct
>   drm_gpuva
> - since we already have userptr in drivers I guess there should be any
>   need to adjust the actual drm_gpuvm code to cope with these
> 
> Then with this you can write the access helper using access_remote_vm
> since that does the entire remote mm walking internally, and so there's
> no need to also have all the mmu notifier and locking lifted to gpuvm. But
> it does already give us some great places to put relevant kerneldocs (not
> just for debugging architecture, but userptr stuff in general), which is
> already a solid step forward.
> 
> Plus I think it'd would also be a solid first step that we need no matter
> what for figuring out the questions/options you have above.
> 
> Thoughts?

This seems like it could work with everything I've written above. Maybe
this lives in gpusvm though so we have clear divide where gpuvm is GPU
address space, and gpusvm is CPU address space. Kinda a bikeshed, but
agree in general if we need to access / modify userptrs this lives in
common code.

Do we view this userptr rework as a blocker for EuDebug? My thinking was
we don't as we (Intel) have fully committed to a common userptr
implementation.

FWIW, I really don't think the implementation in this patch and I stated
this may times but that feedback seems to have been ignored yet again.
I'd prefer an open code hmm_range_fault loop for now rather than a new
xe_res_cursor concept that will get thrown away.

Matt

> -Sima
> 
> > 
> > > -Sima
> > > 
> > > > 
> > > > > v2: pin pages vs notifier, move to vm.c (Matthew)
> > > > > v3: - iterate over system pages instead of DMA, fixes iommu
> > > > > enabled
> > > > >      - s/xe_uvma_access/xe_vm_uvma_access/ (Matt)
> > > > > 
> > > > > Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com>
> > > > > Signed-off-by: Maciej Patelczyk <maciej.patelczyk@intel.com>
> > > > > Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > > > > Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> #v1
> > > > > ---
> > > > >   drivers/gpu/drm/xe/xe_eudebug.c |  3 ++-
> > > > >   drivers/gpu/drm/xe/xe_vm.c      | 47
> > > > > +++++++++++++++++++++++++++++++++
> > > > >   drivers/gpu/drm/xe/xe_vm.h      |  3 +++
> > > > >   3 files changed, 52 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/xe/xe_eudebug.c
> > > > > b/drivers/gpu/drm/xe/xe_eudebug.c
> > > > > index 9d87df75348b..e5949e4dcad8 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_eudebug.c
> > > > > +++ b/drivers/gpu/drm/xe/xe_eudebug.c
> > > > > @@ -3076,7 +3076,8 @@ static int xe_eudebug_vma_access(struct
> > > > > xe_vma *vma, u64 offset_in_vma,
> > > > >   		return ret;
> > > > >   	}
> > > > > -	return -EINVAL;
> > > > > +	return xe_vm_userptr_access(to_userptr_vma(vma),
> > > > > offset_in_vma,
> > > > > +				    buf, bytes, write);
> > > > >   }
> > > > >   static int xe_eudebug_vm_access(struct xe_vm *vm, u64 offset,
> > > > > diff --git a/drivers/gpu/drm/xe/xe_vm.c
> > > > > b/drivers/gpu/drm/xe/xe_vm.c
> > > > > index 0f17bc8b627b..224ff9e16941 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_vm.c
> > > > > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > > > > @@ -3414,3 +3414,50 @@ void xe_vm_snapshot_free(struct
> > > > > xe_vm_snapshot *snap)
> > > > >   	}
> > > > >   	kvfree(snap);
> > > > >   }
> > > > > +
> > > > > +int xe_vm_userptr_access(struct xe_userptr_vma *uvma, u64
> > > > > offset,
> > > > > +			 void *buf, u64 len, bool write)
> > > > > +{
> > > > > +	struct xe_vm *vm = xe_vma_vm(&uvma->vma);
> > > > > +	struct xe_userptr *up = &uvma->userptr;
> > > > > +	struct xe_res_cursor cur = {};
> > > > > +	int cur_len, ret = 0;
> > > > > +
> > > > > +	while (true) {
> > > > > +		down_read(&vm->userptr.notifier_lock);
> > > > > +		if (!xe_vma_userptr_check_repin(uvma))
> > > > > +			break;
> > > > > +
> > > > > +		spin_lock(&vm->userptr.invalidated_lock);
> > > > > +		list_del_init(&uvma->userptr.invalidate_link);
> > > > > +		spin_unlock(&vm->userptr.invalidated_lock);
> > > > > +
> > > > > +		up_read(&vm->userptr.notifier_lock);
> > > > > +		ret = xe_vma_userptr_pin_pages(uvma);
> > > > > +		if (ret)
> > > > > +			return ret;
> > > > > +	}
> > > > > +
> > > > > +	if (!up->sg) {
> > > > > +		ret = -EINVAL;
> > > > > +		goto out_unlock_notifier;
> > > > > +	}
> > > > > +
> > > > > +	for (xe_res_first_sg_system(up->sg, offset, len, &cur);
> > > > > cur.remaining;
> > > > > +	     xe_res_next(&cur, cur_len)) {
> > > > > +		void *ptr = kmap_local_page(sg_page(cur.sgl)) +
> > > > > cur.start;
> > > > 
> > > > The interface basically creates a side channel to access userptrs
> > > > in the way
> > > > an userspace application would do without actually going through
> > > > userspace.
> > > > 
> > > > That is generally not something a device driver should ever do as
> > > > far as I
> > > > can see.
> > > > 
> > > > > +
> > > > > +		cur_len = min(cur.size, cur.remaining);
> > > > > +		if (write)
> > > > > +			memcpy(ptr, buf, cur_len);
> > > > > +		else
> > > > > +			memcpy(buf, ptr, cur_len);
> > > > > +		kunmap_local(ptr);
> > > > > +		buf += cur_len;
> > > > > +	}
> > > > > +	ret = len;
> > > > > +
> > > > > +out_unlock_notifier:
> > > > > +	up_read(&vm->userptr.notifier_lock);
> > > > 
> > > > I just strongly hope that this will prevent the mapping from
> > > > changing.
> > > > 
> > > > Regards,
> > > > Christian.
> > > > 
> > > > > +	return ret;
> > > > > +}
> > > > > diff --git a/drivers/gpu/drm/xe/xe_vm.h
> > > > > b/drivers/gpu/drm/xe/xe_vm.h
> > > > > index 23adb7442881..372ad40ad67f 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_vm.h
> > > > > +++ b/drivers/gpu/drm/xe/xe_vm.h
> > > > > @@ -280,3 +280,6 @@ struct xe_vm_snapshot
> > > > > *xe_vm_snapshot_capture(struct xe_vm *vm);
> > > > >   void xe_vm_snapshot_capture_delayed(struct xe_vm_snapshot
> > > > > *snap);
> > > > >   void xe_vm_snapshot_print(struct xe_vm_snapshot *snap, struct
> > > > > drm_printer *p);
> > > > >   void xe_vm_snapshot_free(struct xe_vm_snapshot *snap);
> > > > > +
> > > > > +int xe_vm_userptr_access(struct xe_userptr_vma *uvma, u64
> > > > > offset,
> > > > > +			 void *buf, u64 len, bool write);
> > > > 
> > > 
> > 
> 
> -- 
> Simona Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

Joonas Lahtinen Dec. 17, 2024, 2:12 p.m. UTC | #15

Quoting Joonas Lahtinen (2024-12-11 14:59:33)
> Quoting Christian König (2024-12-10 16:03:14)

<SNIP>

> > If you really want to expose an interface to userspace which walks the process
> > page table, installs an MMU notifier, kmaps the resulting page and then memcpy
> > to/from it then you absolutely *must* run that by guys like Christoph Hellwig,
> > Andrew and even Linus.

> > I'm pretty sure that those guys will note that a device driver should
> > absolutely not mess with such stuff.

<SNIP>

> >     But that seems like a high-overhead thing to do due to the overhead of
> >     setting up a transfer per data word and going over the PCI bus twice
> >     compared to accessing the memory directly by CPU when it trivially can.
> > 
> > 
> > Understandable, but that will create another way of accessing process memory.

Based on this feedback and some further discussion, we now have an alternative
implementation for this interface via access_process_vm function posted by Mika:

https://lore.kernel.org/dri-devel/20241216141721.2051279-1-mika.kuoppala@linux.intel.com/

It's a couple of dozen lines don't need to do any open-coded kmapping, only utilizing
the pre-existing memory access functions.

Hopefully that would address the above concerns?

Regards, Joonas

PS. It could still be optimized further to directly use the struct mm
from within the mm notifier, and go with access_remote_vm using that,
but would require new symbol export.

For demonstration it is implemented by grabbing the task_struct and using
the already exported access_process_vm function.

Mika Kuoppala Dec. 20, 2024, 12:47 p.m. UTC | #16

Joonas Lahtinen <joonas.lahtinen@linux.intel.com> writes:

> Quoting Joonas Lahtinen (2024-12-11 14:59:33)
>> Quoting Christian König (2024-12-10 16:03:14)
>
> <SNIP>
>
>> > If you really want to expose an interface to userspace which walks the process
>> > page table, installs an MMU notifier, kmaps the resulting page and then memcpy
>> > to/from it then you absolutely *must* run that by guys like Christoph Hellwig,
>> > Andrew and even Linus.
>
>> > I'm pretty sure that those guys will note that a device driver should
>> > absolutely not mess with such stuff.
>
> <SNIP>
>
>> >     But that seems like a high-overhead thing to do due to the overhead of
>> >     setting up a transfer per data word and going over the PCI bus twice
>> >     compared to accessing the memory directly by CPU when it trivially can.
>> > 
>> > 
>> > Understandable, but that will create another way of accessing process memory.
>
> Based on this feedback and some further discussion, we now have an alternative
> implementation for this interface via access_process_vm function posted by Mika:
>
> https://lore.kernel.org/dri-devel/20241216141721.2051279-1-mika.kuoppala@linux.intel.com/

v2:
https://lore.kernel.org/dri-devel/20241220113108.2386842-1-mika.kuoppala@linux.intel.com/
-Mika

>
> It's a couple of dozen lines don't need to do any open-coded kmapping, only utilizing
> the pre-existing memory access functions.
>
> Hopefully that would address the above concerns?
>
> Regards, Joonas
>
> PS. It could still be optimized further to directly use the struct mm
> from within the mm notifier, and go with access_remote_vm using that,
> but would require new symbol export.
>
> For demonstration it is implemented by grabbing the task_struct and using
> the already exported access_process_vm function.

[14/26] drm/xe/eudebug: implement userptr_vma access

Commit Message

Comments

Patch