[v3] drm/i915: Ensure associated VMAs are inactive when contexts are destroyed
diff mbox

Message ID 1447777632-32099-1-git-send-email-tvrtko.ursulin@linux.intel.com
State New
Headers show

Commit Message

Tvrtko Ursulin Nov. 17, 2015, 4:27 p.m. UTC
From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

In the following commit:

    commit e9f24d5fb7cf3628b195b18ff3ac4e37937ceeae
    Author: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Date:   Mon Oct 5 13:26:36 2015 +0100

        drm/i915: Clean up associated VMAs on context destruction

I added a WARN_ON assertion that VM's active list must be empty
at the time of owning context is getting freed, but that turned
out to be a wrong assumption.

Due ordering of operations in i915_gem_object_retire__read, where
contexts are unreferenced before VMAs are moved to the inactive
list, the described situation can in fact happen.

It feels wrong to do things in such order so this fix makes sure
a reference to context is held until the move to inactive list
is completed.

v2: Rather than hold a temporary context reference move the
    request unreference to be the last operation. (Daniel Vetter)

v3: Fix use after free. (Chris Wilson)

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=92638
Cc: Michel Thierry <michel.thierry@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/gpu/drm/i915/i915_gem.c | 33 ++++++++++++++++++---------------
 1 file changed, 18 insertions(+), 15 deletions(-)

Comments

Daniel Vetter Nov. 17, 2015, 4:39 p.m. UTC | #1
On Tue, Nov 17, 2015 at 04:27:12PM +0000, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> 
> In the following commit:
> 
>     commit e9f24d5fb7cf3628b195b18ff3ac4e37937ceeae
>     Author: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>     Date:   Mon Oct 5 13:26:36 2015 +0100
> 
>         drm/i915: Clean up associated VMAs on context destruction
> 
> I added a WARN_ON assertion that VM's active list must be empty
> at the time of owning context is getting freed, but that turned
> out to be a wrong assumption.
> 
> Due ordering of operations in i915_gem_object_retire__read, where
> contexts are unreferenced before VMAs are moved to the inactive
> list, the described situation can in fact happen.
> 
> It feels wrong to do things in such order so this fix makes sure
> a reference to context is held until the move to inactive list
> is completed.
> 
> v2: Rather than hold a temporary context reference move the
>     request unreference to be the last operation. (Daniel Vetter)
> 
> v3: Fix use after free. (Chris Wilson)
> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=92638
> Cc: Michel Thierry <michel.thierry@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> ---
>  drivers/gpu/drm/i915/i915_gem.c | 33 ++++++++++++++++++---------------
>  1 file changed, 18 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 98c83286ab68..094ac17a712d 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2404,29 +2404,32 @@ i915_gem_object_retire__read(struct drm_i915_gem_object *obj, int ring)
>  	RQ_BUG_ON(!(obj->active & (1 << ring)));
>  
>  	list_del_init(&obj->ring_list[ring]);
> -	i915_gem_request_assign(&obj->last_read_req[ring], NULL);
>  
>  	if (obj->last_write_req && obj->last_write_req->ring->id == ring)
>  		i915_gem_object_retire__write(obj);
>  
>  	obj->active &= ~(1 << ring);
> -	if (obj->active)
> -		return;

	if (obj->active) {
		i915_gem_request_assign(&obj->last_read_req[ring], NULL);
		return;
	}

Would result in less churn in the code and drop the unecessary indent
level. Also comment is missing as to why we need to do things in a
specific order.
-Daniel

>  
> -	/* Bump our place on the bound list to keep it roughly in LRU order
> -	 * so that we don't steal from recently used but inactive objects
> -	 * (unless we are forced to ofc!)
> -	 */
> -	list_move_tail(&obj->global_list,
> -		       &to_i915(obj->base.dev)->mm.bound_list);
> +	if (!obj->active) {
> +		/* Bump our place on the bound list to keep it roughly in LRU order
> +		* so that we don't steal from recently used but inactive objects
> +		* (unless we are forced to ofc!)
> +		*/
> +		list_move_tail(&obj->global_list,
> +			&to_i915(obj->base.dev)->mm.bound_list);
>  
> -	list_for_each_entry(vma, &obj->vma_list, vma_link) {
> -		if (!list_empty(&vma->mm_list))
> -			list_move_tail(&vma->mm_list, &vma->vm->inactive_list);
> -	}
> +		list_for_each_entry(vma, &obj->vma_list, vma_link) {
> +			if (!list_empty(&vma->mm_list))
> +				list_move_tail(&vma->mm_list,
> +					       &vma->vm->inactive_list);
> +		}
>  
> -	i915_gem_request_assign(&obj->last_fenced_req, NULL);
> -	drm_gem_object_unreference(&obj->base);
> +		i915_gem_request_assign(&obj->last_fenced_req, NULL);
> +		i915_gem_request_assign(&obj->last_read_req[ring], NULL);
> +		drm_gem_object_unreference(&obj->base);
> +	} else {
> +		i915_gem_request_assign(&obj->last_read_req[ring], NULL);
> +	}
>  }
>  
>  static int
> -- 
> 1.9.1
>
Tvrtko Ursulin Nov. 17, 2015, 4:54 p.m. UTC | #2
On 17/11/15 16:39, Daniel Vetter wrote:
> On Tue, Nov 17, 2015 at 04:27:12PM +0000, Tvrtko Ursulin wrote:
>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>
>> In the following commit:
>>
>>      commit e9f24d5fb7cf3628b195b18ff3ac4e37937ceeae
>>      Author: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>      Date:   Mon Oct 5 13:26:36 2015 +0100
>>
>>          drm/i915: Clean up associated VMAs on context destruction
>>
>> I added a WARN_ON assertion that VM's active list must be empty
>> at the time of owning context is getting freed, but that turned
>> out to be a wrong assumption.
>>
>> Due ordering of operations in i915_gem_object_retire__read, where
>> contexts are unreferenced before VMAs are moved to the inactive
>> list, the described situation can in fact happen.
>>
>> It feels wrong to do things in such order so this fix makes sure
>> a reference to context is held until the move to inactive list
>> is completed.
>>
>> v2: Rather than hold a temporary context reference move the
>>      request unreference to be the last operation. (Daniel Vetter)
>>
>> v3: Fix use after free. (Chris Wilson)
>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=92638
>> Cc: Michel Thierry <michel.thierry@intel.com>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
>> ---
>>   drivers/gpu/drm/i915/i915_gem.c | 33 ++++++++++++++++++---------------
>>   1 file changed, 18 insertions(+), 15 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>> index 98c83286ab68..094ac17a712d 100644
>> --- a/drivers/gpu/drm/i915/i915_gem.c
>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>> @@ -2404,29 +2404,32 @@ i915_gem_object_retire__read(struct drm_i915_gem_object *obj, int ring)
>>   	RQ_BUG_ON(!(obj->active & (1 << ring)));
>>
>>   	list_del_init(&obj->ring_list[ring]);
>> -	i915_gem_request_assign(&obj->last_read_req[ring], NULL);
>>
>>   	if (obj->last_write_req && obj->last_write_req->ring->id == ring)
>>   		i915_gem_object_retire__write(obj);
>>
>>   	obj->active &= ~(1 << ring);
>> -	if (obj->active)
>> -		return;
>
> 	if (obj->active) {
> 		i915_gem_request_assign(&obj->last_read_req[ring], NULL);
> 		return;
> 	}
>
> Would result in less churn in the code and drop the unecessary indent
> level. Also comment is missing as to why we need to do things in a
> specific order.

Actually I think I changed my mind and that v1 is the way to go.

Just re-ordering the code here still makes it possible for the context 
destructor to run with VMAs on the active list I think.

If we hold the context then it is 100% clear it is not possible.

Regards,

Tvrtko
Daniel Vetter Nov. 17, 2015, 5:08 p.m. UTC | #3
On Tue, Nov 17, 2015 at 04:54:50PM +0000, Tvrtko Ursulin wrote:
> 
> On 17/11/15 16:39, Daniel Vetter wrote:
> >On Tue, Nov 17, 2015 at 04:27:12PM +0000, Tvrtko Ursulin wrote:
> >>From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>
> >>In the following commit:
> >>
> >>     commit e9f24d5fb7cf3628b195b18ff3ac4e37937ceeae
> >>     Author: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>     Date:   Mon Oct 5 13:26:36 2015 +0100
> >>
> >>         drm/i915: Clean up associated VMAs on context destruction
> >>
> >>I added a WARN_ON assertion that VM's active list must be empty
> >>at the time of owning context is getting freed, but that turned
> >>out to be a wrong assumption.
> >>
> >>Due ordering of operations in i915_gem_object_retire__read, where
> >>contexts are unreferenced before VMAs are moved to the inactive
> >>list, the described situation can in fact happen.
> >>
> >>It feels wrong to do things in such order so this fix makes sure
> >>a reference to context is held until the move to inactive list
> >>is completed.
> >>
> >>v2: Rather than hold a temporary context reference move the
> >>     request unreference to be the last operation. (Daniel Vetter)
> >>
> >>v3: Fix use after free. (Chris Wilson)
> >>
> >>Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=92638
> >>Cc: Michel Thierry <michel.thierry@intel.com>
> >>Cc: Chris Wilson <chris@chris-wilson.co.uk>
> >>Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> >>---
> >>  drivers/gpu/drm/i915/i915_gem.c | 33 ++++++++++++++++++---------------
> >>  1 file changed, 18 insertions(+), 15 deletions(-)
> >>
> >>diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> >>index 98c83286ab68..094ac17a712d 100644
> >>--- a/drivers/gpu/drm/i915/i915_gem.c
> >>+++ b/drivers/gpu/drm/i915/i915_gem.c
> >>@@ -2404,29 +2404,32 @@ i915_gem_object_retire__read(struct drm_i915_gem_object *obj, int ring)
> >>  	RQ_BUG_ON(!(obj->active & (1 << ring)));
> >>
> >>  	list_del_init(&obj->ring_list[ring]);
> >>-	i915_gem_request_assign(&obj->last_read_req[ring], NULL);
> >>
> >>  	if (obj->last_write_req && obj->last_write_req->ring->id == ring)
> >>  		i915_gem_object_retire__write(obj);
> >>
> >>  	obj->active &= ~(1 << ring);
> >>-	if (obj->active)
> >>-		return;
> >
> >	if (obj->active) {
> >		i915_gem_request_assign(&obj->last_read_req[ring], NULL);
> >		return;
> >	}
> >
> >Would result in less churn in the code and drop the unecessary indent
> >level. Also comment is missing as to why we need to do things in a
> >specific order.
> 
> Actually I think I changed my mind and that v1 is the way to go.
> 
> Just re-ordering the code here still makes it possible for the context
> destructor to run with VMAs on the active list I think.
> 
> If we hold the context then it is 100% clear it is not possible.

request_assign _is_ the function which adjust the refcounts for us, which
means if we drop that reference too early then grabbing a temp reference
is just papering over the real bug.

Written out your patch looks something like

	a_reference(a);
	a_unreference(a);

	/* more cleanup code that should get run before a_unreference but isn't */

	a_unrefernce(a); /* for real this time */

Unfortunately foo_assign is a new pattern and not well-established, so
that connection isn't clear. Maybe we should rename it to
foo_reference_assign to make it more obvious. Or just drop the pretense
and open-code it since we unconditionally assign NULL as the new pointer
value, and we know the current value of the pointer is non-NULL. So
there's really no benefit to the helper here, it only obfuscates. And
since that obfuscation tripped you up it's time to remove it ;-)

Cheers, Daniel
Tvrtko Ursulin Nov. 17, 2015, 5:24 p.m. UTC | #4
On 17/11/15 17:08, Daniel Vetter wrote:
> On Tue, Nov 17, 2015 at 04:54:50PM +0000, Tvrtko Ursulin wrote:
>>
>> On 17/11/15 16:39, Daniel Vetter wrote:
>>> On Tue, Nov 17, 2015 at 04:27:12PM +0000, Tvrtko Ursulin wrote:
>>>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>
>>>> In the following commit:
>>>>
>>>>      commit e9f24d5fb7cf3628b195b18ff3ac4e37937ceeae
>>>>      Author: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>      Date:   Mon Oct 5 13:26:36 2015 +0100
>>>>
>>>>          drm/i915: Clean up associated VMAs on context destruction
>>>>
>>>> I added a WARN_ON assertion that VM's active list must be empty
>>>> at the time of owning context is getting freed, but that turned
>>>> out to be a wrong assumption.
>>>>
>>>> Due ordering of operations in i915_gem_object_retire__read, where
>>>> contexts are unreferenced before VMAs are moved to the inactive
>>>> list, the described situation can in fact happen.
>>>>
>>>> It feels wrong to do things in such order so this fix makes sure
>>>> a reference to context is held until the move to inactive list
>>>> is completed.
>>>>
>>>> v2: Rather than hold a temporary context reference move the
>>>>      request unreference to be the last operation. (Daniel Vetter)
>>>>
>>>> v3: Fix use after free. (Chris Wilson)
>>>>
>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=92638
>>>> Cc: Michel Thierry <michel.thierry@intel.com>
>>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>>>> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
>>>> ---
>>>>   drivers/gpu/drm/i915/i915_gem.c | 33 ++++++++++++++++++---------------
>>>>   1 file changed, 18 insertions(+), 15 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>>>> index 98c83286ab68..094ac17a712d 100644
>>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>>> @@ -2404,29 +2404,32 @@ i915_gem_object_retire__read(struct drm_i915_gem_object *obj, int ring)
>>>>   	RQ_BUG_ON(!(obj->active & (1 << ring)));
>>>>
>>>>   	list_del_init(&obj->ring_list[ring]);
>>>> -	i915_gem_request_assign(&obj->last_read_req[ring], NULL);
>>>>
>>>>   	if (obj->last_write_req && obj->last_write_req->ring->id == ring)
>>>>   		i915_gem_object_retire__write(obj);
>>>>
>>>>   	obj->active &= ~(1 << ring);
>>>> -	if (obj->active)
>>>> -		return;
>>>
>>> 	if (obj->active) {
>>> 		i915_gem_request_assign(&obj->last_read_req[ring], NULL);
>>> 		return;
>>> 	}
>>>
>>> Would result in less churn in the code and drop the unecessary indent
>>> level. Also comment is missing as to why we need to do things in a
>>> specific order.
>>
>> Actually I think I changed my mind and that v1 is the way to go.
>>
>> Just re-ordering the code here still makes it possible for the context
>> destructor to run with VMAs on the active list I think.
>>
>> If we hold the context then it is 100% clear it is not possible.
>
> request_assign _is_ the function which adjust the refcounts for us, which
> means if we drop that reference too early then grabbing a temp reference
> is just papering over the real bug.
>
> Written out your patch looks something like
>
> 	a_reference(a);
> 	a_unreference(a);
>
> 	/* more cleanup code that should get run before a_unreference but isn't */
>
> 	a_unrefernce(a); /* for real this time */
>
> Unfortunately foo_assign is a new pattern and not well-established, so
> that connection isn't clear. Maybe we should rename it to
> foo_reference_assign to make it more obvious. Or just drop the pretense
> and open-code it since we unconditionally assign NULL as the new pointer
> value, and we know the current value of the pointer is non-NULL. So
> there's really no benefit to the helper here, it only obfuscates. And
> since that obfuscation tripped you up it's time to remove it ;-)

Then foo_reference_unreference_assign. :)

But seriously, I think it is more complicated that..

The thing it trips over is that moving VMAs to inactive does not 
correspond in time to request retirement. But in fact VMAs are moved to 
inactive only when all requests associated with an object are done.

This is the unintuitive thing I was working around. To make sure when 
context destructor runs there are not active VMAs for that VM.

I don't know how to guarantee that with what you propose. Perhaps I am 
missing something?

Regards,

Tvrtko
Tvrtko Ursulin Nov. 17, 2015, 5:32 p.m. UTC | #5
On 17/11/15 17:24, Tvrtko Ursulin wrote:
>
> On 17/11/15 17:08, Daniel Vetter wrote:
>> On Tue, Nov 17, 2015 at 04:54:50PM +0000, Tvrtko Ursulin wrote:
>>>
>>> On 17/11/15 16:39, Daniel Vetter wrote:
>>>> On Tue, Nov 17, 2015 at 04:27:12PM +0000, Tvrtko Ursulin wrote:
>>>>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>>
>>>>> In the following commit:
>>>>>
>>>>>      commit e9f24d5fb7cf3628b195b18ff3ac4e37937ceeae
>>>>>      Author: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>>      Date:   Mon Oct 5 13:26:36 2015 +0100
>>>>>
>>>>>          drm/i915: Clean up associated VMAs on context destruction
>>>>>
>>>>> I added a WARN_ON assertion that VM's active list must be empty
>>>>> at the time of owning context is getting freed, but that turned
>>>>> out to be a wrong assumption.
>>>>>
>>>>> Due ordering of operations in i915_gem_object_retire__read, where
>>>>> contexts are unreferenced before VMAs are moved to the inactive
>>>>> list, the described situation can in fact happen.
>>>>>
>>>>> It feels wrong to do things in such order so this fix makes sure
>>>>> a reference to context is held until the move to inactive list
>>>>> is completed.
>>>>>
>>>>> v2: Rather than hold a temporary context reference move the
>>>>>      request unreference to be the last operation. (Daniel Vetter)
>>>>>
>>>>> v3: Fix use after free. (Chris Wilson)
>>>>>
>>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=92638
>>>>> Cc: Michel Thierry <michel.thierry@intel.com>
>>>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>>>>> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
>>>>> ---
>>>>>   drivers/gpu/drm/i915/i915_gem.c | 33
>>>>> ++++++++++++++++++---------------
>>>>>   1 file changed, 18 insertions(+), 15 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c
>>>>> b/drivers/gpu/drm/i915/i915_gem.c
>>>>> index 98c83286ab68..094ac17a712d 100644
>>>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>>>> @@ -2404,29 +2404,32 @@ i915_gem_object_retire__read(struct
>>>>> drm_i915_gem_object *obj, int ring)
>>>>>       RQ_BUG_ON(!(obj->active & (1 << ring)));
>>>>>
>>>>>       list_del_init(&obj->ring_list[ring]);
>>>>> -    i915_gem_request_assign(&obj->last_read_req[ring], NULL);
>>>>>
>>>>>       if (obj->last_write_req && obj->last_write_req->ring->id ==
>>>>> ring)
>>>>>           i915_gem_object_retire__write(obj);
>>>>>
>>>>>       obj->active &= ~(1 << ring);
>>>>> -    if (obj->active)
>>>>> -        return;
>>>>
>>>>     if (obj->active) {
>>>>         i915_gem_request_assign(&obj->last_read_req[ring], NULL);
>>>>         return;
>>>>     }
>>>>
>>>> Would result in less churn in the code and drop the unecessary indent
>>>> level. Also comment is missing as to why we need to do things in a
>>>> specific order.
>>>
>>> Actually I think I changed my mind and that v1 is the way to go.
>>>
>>> Just re-ordering the code here still makes it possible for the context
>>> destructor to run with VMAs on the active list I think.
>>>
>>> If we hold the context then it is 100% clear it is not possible.
>>
>> request_assign _is_ the function which adjust the refcounts for us, which
>> means if we drop that reference too early then grabbing a temp reference
>> is just papering over the real bug.
>>
>> Written out your patch looks something like
>>
>>     a_reference(a);
>>     a_unreference(a);
>>
>>     /* more cleanup code that should get run before a_unreference but
>> isn't */
>>
>>     a_unrefernce(a); /* for real this time */
>>
>> Unfortunately foo_assign is a new pattern and not well-established, so
>> that connection isn't clear. Maybe we should rename it to
>> foo_reference_assign to make it more obvious. Or just drop the pretense
>> and open-code it since we unconditionally assign NULL as the new pointer
>> value, and we know the current value of the pointer is non-NULL. So
>> there's really no benefit to the helper here, it only obfuscates. And
>> since that obfuscation tripped you up it's time to remove it ;-)
>
> Then foo_reference_unreference_assign. :)
>
> But seriously, I think it is more complicated that..
>
> The thing it trips over is that moving VMAs to inactive does not
> correspond in time to request retirement. But in fact VMAs are moved to
> inactive only when all requests associated with an object are done.
>
> This is the unintuitive thing I was working around. To make sure when
> context destructor runs there are not active VMAs for that VM.
>
> I don't know how to guarantee that with what you propose. Perhaps I am
> missing something?

Maybe completely different approach would be to find the VMA belonging 
to req->ctx->vm in i915_gem_request_free and move it to the inactive 
list before context unreference there?

Should would I think, if request is going away means VMA definitely 
needs to go to inactive..

Regards,

Tvrtko
Tvrtko Ursulin Nov. 17, 2015, 5:34 p.m. UTC | #6
On 17/11/15 17:32, Tvrtko Ursulin wrote:
>
> On 17/11/15 17:24, Tvrtko Ursulin wrote:
>>
>> On 17/11/15 17:08, Daniel Vetter wrote:
>>> On Tue, Nov 17, 2015 at 04:54:50PM +0000, Tvrtko Ursulin wrote:
>>>>
>>>> On 17/11/15 16:39, Daniel Vetter wrote:
>>>>> On Tue, Nov 17, 2015 at 04:27:12PM +0000, Tvrtko Ursulin wrote:
>>>>>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>>>
>>>>>> In the following commit:
>>>>>>
>>>>>>      commit e9f24d5fb7cf3628b195b18ff3ac4e37937ceeae
>>>>>>      Author: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>>>      Date:   Mon Oct 5 13:26:36 2015 +0100
>>>>>>
>>>>>>          drm/i915: Clean up associated VMAs on context destruction
>>>>>>
>>>>>> I added a WARN_ON assertion that VM's active list must be empty
>>>>>> at the time of owning context is getting freed, but that turned
>>>>>> out to be a wrong assumption.
>>>>>>
>>>>>> Due ordering of operations in i915_gem_object_retire__read, where
>>>>>> contexts are unreferenced before VMAs are moved to the inactive
>>>>>> list, the described situation can in fact happen.
>>>>>>
>>>>>> It feels wrong to do things in such order so this fix makes sure
>>>>>> a reference to context is held until the move to inactive list
>>>>>> is completed.
>>>>>>
>>>>>> v2: Rather than hold a temporary context reference move the
>>>>>>      request unreference to be the last operation. (Daniel Vetter)
>>>>>>
>>>>>> v3: Fix use after free. (Chris Wilson)
>>>>>>
>>>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=92638
>>>>>> Cc: Michel Thierry <michel.thierry@intel.com>
>>>>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>>>>>> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
>>>>>> ---
>>>>>>   drivers/gpu/drm/i915/i915_gem.c | 33
>>>>>> ++++++++++++++++++---------------
>>>>>>   1 file changed, 18 insertions(+), 15 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c
>>>>>> b/drivers/gpu/drm/i915/i915_gem.c
>>>>>> index 98c83286ab68..094ac17a712d 100644
>>>>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>>>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>>>>> @@ -2404,29 +2404,32 @@ i915_gem_object_retire__read(struct
>>>>>> drm_i915_gem_object *obj, int ring)
>>>>>>       RQ_BUG_ON(!(obj->active & (1 << ring)));
>>>>>>
>>>>>>       list_del_init(&obj->ring_list[ring]);
>>>>>> -    i915_gem_request_assign(&obj->last_read_req[ring], NULL);
>>>>>>
>>>>>>       if (obj->last_write_req && obj->last_write_req->ring->id ==
>>>>>> ring)
>>>>>>           i915_gem_object_retire__write(obj);
>>>>>>
>>>>>>       obj->active &= ~(1 << ring);
>>>>>> -    if (obj->active)
>>>>>> -        return;
>>>>>
>>>>>     if (obj->active) {
>>>>>         i915_gem_request_assign(&obj->last_read_req[ring], NULL);
>>>>>         return;
>>>>>     }
>>>>>
>>>>> Would result in less churn in the code and drop the unecessary indent
>>>>> level. Also comment is missing as to why we need to do things in a
>>>>> specific order.
>>>>
>>>> Actually I think I changed my mind and that v1 is the way to go.
>>>>
>>>> Just re-ordering the code here still makes it possible for the context
>>>> destructor to run with VMAs on the active list I think.
>>>>
>>>> If we hold the context then it is 100% clear it is not possible.
>>>
>>> request_assign _is_ the function which adjust the refcounts for us,
>>> which
>>> means if we drop that reference too early then grabbing a temp reference
>>> is just papering over the real bug.
>>>
>>> Written out your patch looks something like
>>>
>>>     a_reference(a);
>>>     a_unreference(a);
>>>
>>>     /* more cleanup code that should get run before a_unreference but
>>> isn't */
>>>
>>>     a_unrefernce(a); /* for real this time */
>>>
>>> Unfortunately foo_assign is a new pattern and not well-established, so
>>> that connection isn't clear. Maybe we should rename it to
>>> foo_reference_assign to make it more obvious. Or just drop the pretense
>>> and open-code it since we unconditionally assign NULL as the new pointer
>>> value, and we know the current value of the pointer is non-NULL. So
>>> there's really no benefit to the helper here, it only obfuscates. And
>>> since that obfuscation tripped you up it's time to remove it ;-)
>>
>> Then foo_reference_unreference_assign. :)
>>
>> But seriously, I think it is more complicated that..
>>
>> The thing it trips over is that moving VMAs to inactive does not
>> correspond in time to request retirement. But in fact VMAs are moved to
>> inactive only when all requests associated with an object are done.
>>
>> This is the unintuitive thing I was working around. To make sure when
>> context destructor runs there are not active VMAs for that VM.
>>
>> I don't know how to guarantee that with what you propose. Perhaps I am
>> missing something?
>
> Maybe completely different approach would be to find the VMA belonging
> to req->ctx->vm in i915_gem_request_free and move it to the inactive
> list before context unreference there?
>
> Should would I think, if request is going away means VMA definitely
> needs to go to inactive..

Or not.. no link to the correct object from there. Ok calling it a day..

Tvrtko
Daniel Vetter Nov. 17, 2015, 5:56 p.m. UTC | #7
On Tue, Nov 17, 2015 at 05:24:01PM +0000, Tvrtko Ursulin wrote:
> 
> On 17/11/15 17:08, Daniel Vetter wrote:
> >On Tue, Nov 17, 2015 at 04:54:50PM +0000, Tvrtko Ursulin wrote:
> >>
> >>On 17/11/15 16:39, Daniel Vetter wrote:
> >>>On Tue, Nov 17, 2015 at 04:27:12PM +0000, Tvrtko Ursulin wrote:
> >>>>From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>>>
> >>>>In the following commit:
> >>>>
> >>>>     commit e9f24d5fb7cf3628b195b18ff3ac4e37937ceeae
> >>>>     Author: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>>>     Date:   Mon Oct 5 13:26:36 2015 +0100
> >>>>
> >>>>         drm/i915: Clean up associated VMAs on context destruction
> >>>>
> >>>>I added a WARN_ON assertion that VM's active list must be empty
> >>>>at the time of owning context is getting freed, but that turned
> >>>>out to be a wrong assumption.
> >>>>
> >>>>Due ordering of operations in i915_gem_object_retire__read, where
> >>>>contexts are unreferenced before VMAs are moved to the inactive
> >>>>list, the described situation can in fact happen.
> >>>>
> >>>>It feels wrong to do things in such order so this fix makes sure
> >>>>a reference to context is held until the move to inactive list
> >>>>is completed.
> >>>>
> >>>>v2: Rather than hold a temporary context reference move the
> >>>>     request unreference to be the last operation. (Daniel Vetter)
> >>>>
> >>>>v3: Fix use after free. (Chris Wilson)
> >>>>
> >>>>Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>>>Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=92638
> >>>>Cc: Michel Thierry <michel.thierry@intel.com>
> >>>>Cc: Chris Wilson <chris@chris-wilson.co.uk>
> >>>>Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> >>>>---
> >>>>  drivers/gpu/drm/i915/i915_gem.c | 33 ++++++++++++++++++---------------
> >>>>  1 file changed, 18 insertions(+), 15 deletions(-)
> >>>>
> >>>>diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> >>>>index 98c83286ab68..094ac17a712d 100644
> >>>>--- a/drivers/gpu/drm/i915/i915_gem.c
> >>>>+++ b/drivers/gpu/drm/i915/i915_gem.c
> >>>>@@ -2404,29 +2404,32 @@ i915_gem_object_retire__read(struct drm_i915_gem_object *obj, int ring)
> >>>>  	RQ_BUG_ON(!(obj->active & (1 << ring)));
> >>>>
> >>>>  	list_del_init(&obj->ring_list[ring]);
> >>>>-	i915_gem_request_assign(&obj->last_read_req[ring], NULL);
> >>>>
> >>>>  	if (obj->last_write_req && obj->last_write_req->ring->id == ring)
> >>>>  		i915_gem_object_retire__write(obj);
> >>>>
> >>>>  	obj->active &= ~(1 << ring);
> >>>>-	if (obj->active)
> >>>>-		return;
> >>>
> >>>	if (obj->active) {
> >>>		i915_gem_request_assign(&obj->last_read_req[ring], NULL);
> >>>		return;
> >>>	}
> >>>
> >>>Would result in less churn in the code and drop the unecessary indent
> >>>level. Also comment is missing as to why we need to do things in a
> >>>specific order.
> >>
> >>Actually I think I changed my mind and that v1 is the way to go.
> >>
> >>Just re-ordering the code here still makes it possible for the context
> >>destructor to run with VMAs on the active list I think.
> >>
> >>If we hold the context then it is 100% clear it is not possible.
> >
> >request_assign _is_ the function which adjust the refcounts for us, which
> >means if we drop that reference too early then grabbing a temp reference
> >is just papering over the real bug.
> >
> >Written out your patch looks something like
> >
> >	a_reference(a);
> >	a_unreference(a);
> >
> >	/* more cleanup code that should get run before a_unreference but isn't */
> >
> >	a_unrefernce(a); /* for real this time */
> >
> >Unfortunately foo_assign is a new pattern and not well-established, so
> >that connection isn't clear. Maybe we should rename it to
> >foo_reference_assign to make it more obvious. Or just drop the pretense
> >and open-code it since we unconditionally assign NULL as the new pointer
> >value, and we know the current value of the pointer is non-NULL. So
> >there's really no benefit to the helper here, it only obfuscates. And
> >since that obfuscation tripped you up it's time to remove it ;-)
> 
> Then foo_reference_unreference_assign. :)
> 
> But seriously, I think it is more complicated that..
> 
> The thing it trips over is that moving VMAs to inactive does not correspond
> in time to request retirement. But in fact VMAs are moved to inactive only
> when all requests associated with an object are done.
> 
> This is the unintuitive thing I was working around. To make sure when
> context destructor runs there are not active VMAs for that VM.
> 
> I don't know how to guarantee that with what you propose. Perhaps I am
> missing something?

Ok, my example was slightly off, since we have 2 objects:

	b_reference(a->b);
	a_unreference(a); /* might unref a->b if it's the last reference */

	/* more cleanup code that should get run before a_unreference but isn't */

	b_unrefernce(a->b); /* for real this time */

Holding the ref to a makes sure that b doesn't disappear. We rely on that
in a fundamental way (a request really needs the ctx to stick around), and
the bug really is that we drop the ref to a too early. That it's the
releasing of a->b which is eventually blowing things up doesn't really
matter.

Btw would it be possible to have an igt for this? I should be possible to
hit this with some varian of gem_unref_active_buffers.
-Daniel

Patch
diff mbox

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 98c83286ab68..094ac17a712d 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2404,29 +2404,32 @@  i915_gem_object_retire__read(struct drm_i915_gem_object *obj, int ring)
 	RQ_BUG_ON(!(obj->active & (1 << ring)));
 
 	list_del_init(&obj->ring_list[ring]);
-	i915_gem_request_assign(&obj->last_read_req[ring], NULL);
 
 	if (obj->last_write_req && obj->last_write_req->ring->id == ring)
 		i915_gem_object_retire__write(obj);
 
 	obj->active &= ~(1 << ring);
-	if (obj->active)
-		return;
 
-	/* Bump our place on the bound list to keep it roughly in LRU order
-	 * so that we don't steal from recently used but inactive objects
-	 * (unless we are forced to ofc!)
-	 */
-	list_move_tail(&obj->global_list,
-		       &to_i915(obj->base.dev)->mm.bound_list);
+	if (!obj->active) {
+		/* Bump our place on the bound list to keep it roughly in LRU order
+		* so that we don't steal from recently used but inactive objects
+		* (unless we are forced to ofc!)
+		*/
+		list_move_tail(&obj->global_list,
+			&to_i915(obj->base.dev)->mm.bound_list);
 
-	list_for_each_entry(vma, &obj->vma_list, vma_link) {
-		if (!list_empty(&vma->mm_list))
-			list_move_tail(&vma->mm_list, &vma->vm->inactive_list);
-	}
+		list_for_each_entry(vma, &obj->vma_list, vma_link) {
+			if (!list_empty(&vma->mm_list))
+				list_move_tail(&vma->mm_list,
+					       &vma->vm->inactive_list);
+		}
 
-	i915_gem_request_assign(&obj->last_fenced_req, NULL);
-	drm_gem_object_unreference(&obj->base);
+		i915_gem_request_assign(&obj->last_fenced_req, NULL);
+		i915_gem_request_assign(&obj->last_read_req[ring], NULL);
+		drm_gem_object_unreference(&obj->base);
+	} else {
+		i915_gem_request_assign(&obj->last_read_req[ring], NULL);
+	}
 }
 
 static int