diff mbox series

drm: Get ref on CRTC commit object when waiting for flip_done

Message ID 1539304063-9518-1-git-send-email-sunpeng.li@amd.com (mailing list archive)
State New, archived
Headers show
Series drm: Get ref on CRTC commit object when waiting for flip_done | expand

Commit Message

Leo Li Oct. 12, 2018, 12:27 a.m. UTC
From: Leo Li <sunpeng.li@amd.com>

This fixes a general protection fault, caused by accessing the contents
of a flip_done completion object that has already been freed. It occurs
due to the preemption of a non-blocking commit worker thread W by
another commit thread X. X continues to clear its atomic state at the
end, destroying the CRTC commit object that W still needs. Switching
back to W and accessing the commit objects then leads to bad results.

Worker W becomes preemptable when waiting for flip_done to complete. At
this point, a frequently occurring commit thread X can take over. Here's
an example where W is a worker thread that flips on both CRTCs, and X
does a legacy cursor update on both CRTCs:

        ...
     1. W does flip work
     2. W runs commit_hw_done()
     3. W waits for flip_done on CRTC 1
     4. > flip_done for CRTC 1 completes
     5. W finishes waiting for CRTC 1
     6. W waits for flip_done on CRTC 2

     7. > Preempted by X
     8. > flip_done for CRTC 2 completes
     9. X atomic_check: hw_done and flip_done are complete on all CRTCs
    10. X updates cursor on both CRTCs
    11. X destroys atomic state
    12. X done

    13. > Switch back to W
    14. W waits for flip_done on CRTC 2
    15. W raises general protection fault

The error looks like so:

    general protection fault: 0000 [#1] PREEMPT SMP PTI
    **snip**
    Call Trace:
     lock_acquire+0xa2/0x1b0
     _raw_spin_lock_irq+0x39/0x70
     wait_for_completion_timeout+0x31/0x130
     drm_atomic_helper_wait_for_flip_done+0x64/0x90 [drm_kms_helper]
     amdgpu_dm_atomic_commit_tail+0xcae/0xdd0 [amdgpu]
     commit_tail+0x3d/0x70 [drm_kms_helper]
     process_one_work+0x212/0x650
     worker_thread+0x49/0x420
     kthread+0xfb/0x130
     ret_from_fork+0x3a/0x50
    Modules linked in: x86_pkg_temp_thermal amdgpu(O) chash(O)
    gpu_sched(O) drm_kms_helper(O) syscopyarea sysfillrect sysimgblt
    fb_sys_fops ttm(O) drm(O)

Note that i915 has this issue masked, since hw_done is signaled after
waiting for flip_done. Doing so will block the cursor update from
happening until hw_done is signaled, preventing the cursor commit from
destroying the state.

Signed-off-by: Leo Li <sunpeng.li@amd.com>
---
 drivers/gpu/drm/drm_atomic_helper.c | 30 ++++++++++++++++++++++++++----
 1 file changed, 26 insertions(+), 4 deletions(-)

Comments

Daniel Vetter Oct. 12, 2018, 7:34 a.m. UTC | #1
On Thu, Oct 11, 2018 at 08:27:43PM -0400, sunpeng.li@amd.com wrote:
> From: Leo Li <sunpeng.li@amd.com>
> 
> This fixes a general protection fault, caused by accessing the contents
> of a flip_done completion object that has already been freed. It occurs
> due to the preemption of a non-blocking commit worker thread W by
> another commit thread X. X continues to clear its atomic state at the
> end, destroying the CRTC commit object that W still needs. Switching
> back to W and accessing the commit objects then leads to bad results.
> 
> Worker W becomes preemptable when waiting for flip_done to complete. At
> this point, a frequently occurring commit thread X can take over. Here's
> an example where W is a worker thread that flips on both CRTCs, and X
> does a legacy cursor update on both CRTCs:
> 
>         ...
>      1. W does flip work
>      2. W runs commit_hw_done()
>      3. W waits for flip_done on CRTC 1
>      4. > flip_done for CRTC 1 completes
>      5. W finishes waiting for CRTC 1
>      6. W waits for flip_done on CRTC 2
> 
>      7. > Preempted by X
>      8. > flip_done for CRTC 2 completes
>      9. X atomic_check: hw_done and flip_done are complete on all CRTCs
>     10. X updates cursor on both CRTCs
>     11. X destroys atomic state
>     12. X done
> 
>     13. > Switch back to W
>     14. W waits for flip_done on CRTC 2
>     15. W raises general protection fault
> 
> The error looks like so:
> 
>     general protection fault: 0000 [#1] PREEMPT SMP PTI
>     **snip**
>     Call Trace:
>      lock_acquire+0xa2/0x1b0
>      _raw_spin_lock_irq+0x39/0x70
>      wait_for_completion_timeout+0x31/0x130
>      drm_atomic_helper_wait_for_flip_done+0x64/0x90 [drm_kms_helper]
>      amdgpu_dm_atomic_commit_tail+0xcae/0xdd0 [amdgpu]
>      commit_tail+0x3d/0x70 [drm_kms_helper]
>      process_one_work+0x212/0x650
>      worker_thread+0x49/0x420
>      kthread+0xfb/0x130
>      ret_from_fork+0x3a/0x50
>     Modules linked in: x86_pkg_temp_thermal amdgpu(O) chash(O)
>     gpu_sched(O) drm_kms_helper(O) syscopyarea sysfillrect sysimgblt
>     fb_sys_fops ttm(O) drm(O)
> 
> Note that i915 has this issue masked, since hw_done is signaled after
> waiting for flip_done. Doing so will block the cursor update from
> happening until hw_done is signaled, preventing the cursor commit from
> destroying the state.
> 
> Signed-off-by: Leo Li <sunpeng.li@amd.com>

Great analysis!

Bugfix itself needs a bit more work though, see below.

> ---
>  drivers/gpu/drm/drm_atomic_helper.c | 30 ++++++++++++++++++++++++++----
>  1 file changed, 26 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c
> index 80be74d..efdf043 100644
> --- a/drivers/gpu/drm/drm_atomic_helper.c
> +++ b/drivers/gpu/drm/drm_atomic_helper.c
> @@ -1410,20 +1410,42 @@ void drm_atomic_helper_wait_for_flip_done(struct drm_device *dev,
>  {
>  	struct drm_crtc_state *new_crtc_state;
>  	struct drm_crtc *crtc;
> +	struct drm_crtc_commit **commits;
>  	int i;
> +	int num_crtc = dev->mode_config.num_crtc;
>  
> +	commits = kcalloc(num_crtc, sizeof(*commits), GFP_KERNEL);
> +
> +	/*
> +	 * Because we open ourselves to preemption by calling
> +	 * wait_for_completion_timeout(), we need to get our own references to
> +	 * the commit objects. This is so that a preempting commit won't free
> +	 * them.
> +	 */

The scheduler can preempt you at any point before here too, if you enable
CONFIG_PREEMPT. It definitely can preempt/block in the kcalloc above, even
if you have CONFIG_PREEMPT disabled. The scheduling point you identified
below is simply the most likely.

The underlying bug is that after drm_atomic_helper_commit_hw_done() we
unblock the next commit worker (thread X in your example), and cannot
assume anymore that the new state will stay around (since new state is
released by the subsequent commit work run in thread X).

Therefore your drm_crtc_commit_get here already can access freed memory.
The correct fix here is to store the temporary reference before we call
drm_atomic_helper_commit_hw_done(), and then use that in this function
here. The problem is that this is somewhat tricky to pull off, since some
drivers call wait_for_flip_done() before hw_done() (like i915).

Here's what I'd do:
- Add a new drm_crtc_commit pointer to struct __drm_crtcs_state.

- Fill that out in drm_atomic_helper_setup_commit, and make sure you
  release the reference for it in drm_atomic_state_default_clear().

- Use that pointer instead in drm_atomic_helper_wait_for_flip_done() here.

Cheers, Daniel
>  	for_each_new_crtc_in_state(old_state, crtc, new_crtc_state, i) {
> -		struct drm_crtc_commit *commit = new_crtc_state->commit;
> +		commits[i] = new_crtc_state->commit;
> +		if (commits[i])
> +			drm_crtc_commit_get(commits[i]);
> +	}
> +
> +	for (i = 0; i < num_crtc; i++) {
>  		int ret;
>  
> -		if (!commit)
> +		if (!commits[i])
>  			continue;
>  
> -		ret = wait_for_completion_timeout(&commit->flip_done, 10 * HZ);
> +		ret = wait_for_completion_timeout(&commits[i]->flip_done,
> +						  10 * HZ);
>  		if (ret == 0)
>  			DRM_ERROR("[CRTC:%d:%s] flip_done timed out\n",
> -				  crtc->base.id, crtc->name);
> +				  commits[i]->crtc->base.id,
> +				  commits[i]->crtc->name);
>  	}
> +
> +	for (i = 0; i < num_crtc; i++)
> +		if (commits[i])
> +			drm_crtc_commit_put(commits[i]);
> +	kfree(commits);
>  }
>  EXPORT_SYMBOL(drm_atomic_helper_wait_for_flip_done);
>  
> -- 
> 2.7.4
> 
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
Leo Li Oct. 12, 2018, 9:28 p.m. UTC | #2
On 2018-10-12 03:34 AM, Daniel Vetter wrote:
> On Thu, Oct 11, 2018 at 08:27:43PM -0400, sunpeng.li@amd.com wrote:
>> From: Leo Li <sunpeng.li@amd.com>
>>
>> This fixes a general protection fault, caused by accessing the contents
>> of a flip_done completion object that has already been freed. It occurs
>> due to the preemption of a non-blocking commit worker thread W by
>> another commit thread X. X continues to clear its atomic state at the
>> end, destroying the CRTC commit object that W still needs. Switching
>> back to W and accessing the commit objects then leads to bad results.
>>
>> Worker W becomes preemptable when waiting for flip_done to complete. At
>> this point, a frequently occurring commit thread X can take over. Here's
>> an example where W is a worker thread that flips on both CRTCs, and X
>> does a legacy cursor update on both CRTCs:
>>
>>          ...
>>       1. W does flip work
>>       2. W runs commit_hw_done()
>>       3. W waits for flip_done on CRTC 1
>>       4. > flip_done for CRTC 1 completes
>>       5. W finishes waiting for CRTC 1
>>       6. W waits for flip_done on CRTC 2
>>
>>       7. > Preempted by X
>>       8. > flip_done for CRTC 2 completes
>>       9. X atomic_check: hw_done and flip_done are complete on all CRTCs
>>      10. X updates cursor on both CRTCs
>>      11. X destroys atomic state
>>      12. X done
>>
>>      13. > Switch back to W
>>      14. W waits for flip_done on CRTC 2
>>      15. W raises general protection fault
>>
>> The error looks like so:
>>
>>      general protection fault: 0000 [#1] PREEMPT SMP PTI
>>      **snip**
>>      Call Trace:
>>       lock_acquire+0xa2/0x1b0
>>       _raw_spin_lock_irq+0x39/0x70
>>       wait_for_completion_timeout+0x31/0x130
>>       drm_atomic_helper_wait_for_flip_done+0x64/0x90 [drm_kms_helper]
>>       amdgpu_dm_atomic_commit_tail+0xcae/0xdd0 [amdgpu]
>>       commit_tail+0x3d/0x70 [drm_kms_helper]
>>       process_one_work+0x212/0x650
>>       worker_thread+0x49/0x420
>>       kthread+0xfb/0x130
>>       ret_from_fork+0x3a/0x50
>>      Modules linked in: x86_pkg_temp_thermal amdgpu(O) chash(O)
>>      gpu_sched(O) drm_kms_helper(O) syscopyarea sysfillrect sysimgblt
>>      fb_sys_fops ttm(O) drm(O)
>>
>> Note that i915 has this issue masked, since hw_done is signaled after
>> waiting for flip_done. Doing so will block the cursor update from
>> happening until hw_done is signaled, preventing the cursor commit from
>> destroying the state.
>>
>> Signed-off-by: Leo Li <sunpeng.li@amd.com>
> 
> Great analysis!
> 
> Bugfix itself needs a bit more work though, see below.
> 
>> ---
>>   drivers/gpu/drm/drm_atomic_helper.c | 30 ++++++++++++++++++++++++++----
>>   1 file changed, 26 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c
>> index 80be74d..efdf043 100644
>> --- a/drivers/gpu/drm/drm_atomic_helper.c
>> +++ b/drivers/gpu/drm/drm_atomic_helper.c
>> @@ -1410,20 +1410,42 @@ void drm_atomic_helper_wait_for_flip_done(struct drm_device *dev,
>>   {
>>   	struct drm_crtc_state *new_crtc_state;
>>   	struct drm_crtc *crtc;
>> +	struct drm_crtc_commit **commits;
>>   	int i;
>> +	int num_crtc = dev->mode_config.num_crtc;
>>   
>> +	commits = kcalloc(num_crtc, sizeof(*commits), GFP_KERNEL);
>> +
>> +	/*
>> +	 * Because we open ourselves to preemption by calling
>> +	 * wait_for_completion_timeout(), we need to get our own references to
>> +	 * the commit objects. This is so that a preempting commit won't free
>> +	 * them.
>> +	 */
> 
> The scheduler can preempt you at any point before here too, if you enable
> CONFIG_PREEMPT. It definitely can preempt/block in the kcalloc above, even
> if you have CONFIG_PREEMPT disabled. The scheduling point you identified
> below is simply the most likely.
> 

Ah, didn't think of this.

> The underlying bug is that after drm_atomic_helper_commit_hw_done() we
> unblock the next commit worker (thread X in your example), and cannot
> assume anymore that the new state will stay around (since new state is
> released by the subsequent commit work run in thread X).
> 
> Therefore your drm_crtc_commit_get here already can access freed memory.
> The correct fix here is to store the temporary reference before we call
> drm_atomic_helper_commit_hw_done(), and then use that in this function
> here. The problem is that this is somewhat tricky to pull off, since some
> drivers call wait_for_flip_done() before hw_done() (like i915).
> 
> Here's what I'd do:
> - Add a new drm_crtc_commit pointer to struct __drm_crtcs_state.
> 
> - Fill that out in drm_atomic_helper_setup_commit, and make sure you
>    release the reference for it in drm_atomic_state_default_clear().
> 
> - Use that pointer instead in drm_atomic_helper_wait_for_flip_done() here.

Thanks for the pointers, patch incoming :)

Leo

> 
> Cheers, Daniel
>>   	for_each_new_crtc_in_state(old_state, crtc, new_crtc_state, i) {
>> -		struct drm_crtc_commit *commit = new_crtc_state->commit;
>> +		commits[i] = new_crtc_state->commit;
>> +		if (commits[i])
>> +			drm_crtc_commit_get(commits[i]);
>> +	}
>> +
>> +	for (i = 0; i < num_crtc; i++) {
>>   		int ret;
>>   
>> -		if (!commit)
>> +		if (!commits[i])
>>   			continue;
>>   
>> -		ret = wait_for_completion_timeout(&commit->flip_done, 10 * HZ);
>> +		ret = wait_for_completion_timeout(&commits[i]->flip_done,
>> +						  10 * HZ);
>>   		if (ret == 0)
>>   			DRM_ERROR("[CRTC:%d:%s] flip_done timed out\n",
>> -				  crtc->base.id, crtc->name);
>> +				  commits[i]->crtc->base.id,
>> +				  commits[i]->crtc->name);
>>   	}
>> +
>> +	for (i = 0; i < num_crtc; i++)
>> +		if (commits[i])
>> +			drm_crtc_commit_put(commits[i]);
>> +	kfree(commits);
>>   }
>>   EXPORT_SYMBOL(drm_atomic_helper_wait_for_flip_done);
>>   
>> -- 
>> 2.7.4
>>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c
index 80be74d..efdf043 100644
--- a/drivers/gpu/drm/drm_atomic_helper.c
+++ b/drivers/gpu/drm/drm_atomic_helper.c
@@ -1410,20 +1410,42 @@  void drm_atomic_helper_wait_for_flip_done(struct drm_device *dev,
 {
 	struct drm_crtc_state *new_crtc_state;
 	struct drm_crtc *crtc;
+	struct drm_crtc_commit **commits;
 	int i;
+	int num_crtc = dev->mode_config.num_crtc;
 
+	commits = kcalloc(num_crtc, sizeof(*commits), GFP_KERNEL);
+
+	/*
+	 * Because we open ourselves to preemption by calling
+	 * wait_for_completion_timeout(), we need to get our own references to
+	 * the commit objects. This is so that a preempting commit won't free
+	 * them.
+	 */
 	for_each_new_crtc_in_state(old_state, crtc, new_crtc_state, i) {
-		struct drm_crtc_commit *commit = new_crtc_state->commit;
+		commits[i] = new_crtc_state->commit;
+		if (commits[i])
+			drm_crtc_commit_get(commits[i]);
+	}
+
+	for (i = 0; i < num_crtc; i++) {
 		int ret;
 
-		if (!commit)
+		if (!commits[i])
 			continue;
 
-		ret = wait_for_completion_timeout(&commit->flip_done, 10 * HZ);
+		ret = wait_for_completion_timeout(&commits[i]->flip_done,
+						  10 * HZ);
 		if (ret == 0)
 			DRM_ERROR("[CRTC:%d:%s] flip_done timed out\n",
-				  crtc->base.id, crtc->name);
+				  commits[i]->crtc->base.id,
+				  commits[i]->crtc->name);
 	}
+
+	for (i = 0; i < num_crtc; i++)
+		if (commits[i])
+			drm_crtc_commit_put(commits[i]);
+	kfree(commits);
 }
 EXPORT_SYMBOL(drm_atomic_helper_wait_for_flip_done);