diff mbox series

[1/5] drm/amd/display: Remove migrate_en/dis from dc_fpu_begin().

Message ID 20230921141516.520471-2-bigeasy@linutronix.de (mailing list archive)
State New, archived
Headers show
Series drm/amd/display: Remove migrate-disable and move memory allocation. | expand

Commit Message

Sebastian Andrzej Siewior Sept. 21, 2023, 2:15 p.m. UTC
This is a revert of the commit mentioned below while it is not wrong, as
in the kernel will explode, having migrate_disable() here it is
complete waste of resources.

Additionally commit message is plain wrong the review tag does not make
it any better. The migrate_disable() interface has a fat comment
describing it and it includes the word "undesired" in the headline which
should tickle people to read it before using it.
Initially I assumed it is worded too harsh but now I beg to differ.

The reviewer of the original commit, even not understanding what
migrate_disable() does should ask the following:

- migrate_disable() is added only to the CONFIG_X86 block and it claims
  to protect fpu_recursion_depth. Why are the other the architectures
  excluded?

- migrate_disable() is added after fpu_recursion_depth was modified.
  Shouldn't it be added before the modification or referencing takes
  place?

Moving on.
Disabling preemption DOES prevent CPU migration. A task, that can not be
pushed away from the CPU by the scheduler (due to disabled preemption)
can not be pushed or migrated to another CPU.

Disabling migration DOES NOT ensure consistency of per-CPU variables. It
only ensures that the task acts always on the same per-CPU variable. The
task remains preemptible meaning multiple tasks can access the same
per-CPU variable. This in turn leads to inconsistency for the statement

                  *pcpu -= 1;

with two tasks on one CPU and a preemption point during the RMW
operation:

     Task A                           Task B
     read pcpu to reg  # 0
     inc reg           # 0 -> 1
                                      read pcpu to reg  # 0
                                      inc reg           # 0 -> 1
                                      write reg to pcpu # 1
     write reg to pcpu # 1

At the end pcpu reads 1 but should read 2 instead. Boom.

get_cpu_ptr() already contains a preempt_disable() statement. That means
that the per-CPU variable can only be referenced by a single task which
is currently running. The only inconsistency that can occur if the
variable is additionally accessed from an interrupt.

Remove migrate_disable/enable() from dc_fpu_begin/end().

Cc: Tianci Yin <tianci.yin@amd.com>
Cc: Aurabindo Pillai <aurabindo.pillai@amd.com>
Fixes: 0c316556d1249 ("drm/amd/display: Disable migration to ensure consistency of per-CPU variable")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 drivers/gpu/drm/amd/display/amdgpu_dm/dc_fpu.c | 2 --
 1 file changed, 2 deletions(-)

Comments

Harry Wentland Oct. 3, 2023, 7:53 p.m. UTC | #1
On 2023-09-21 10:15, Sebastian Andrzej Siewior wrote:
> This is a revert of the commit mentioned below while it is not wrong, as
> in the kernel will explode, having migrate_disable() here it is
> complete waste of resources.
> 
> Additionally commit message is plain wrong the review tag does not make

Not sure I follow what's unhelpful about the review tag with
0c316556d1249 ("drm/amd/display: Disable migration to ensure consistency of per-CPU variable")

I do wish the original patch showed the splat it's attempting
to fix. It apparently made a difference for something, whether
inadvertently or not. I wish I knew what that "something" was.

Harry

> it any better. The migrate_disable() interface has a fat comment
> describing it and it includes the word "undesired" in the headline which
> should tickle people to read it before using it.
> Initially I assumed it is worded too harsh but now I beg to differ.
> 
> The reviewer of the original commit, even not understanding what
> migrate_disable() does should ask the following:
> 
> - migrate_disable() is added only to the CONFIG_X86 block and it claims
>   to protect fpu_recursion_depth. Why are the other the architectures
>   excluded?
> 
> - migrate_disable() is added after fpu_recursion_depth was modified.
>   Shouldn't it be added before the modification or referencing takes
>   place?
> 
> Moving on.
> Disabling preemption DOES prevent CPU migration. A task, that can not be
> pushed away from the CPU by the scheduler (due to disabled preemption)
> can not be pushed or migrated to another CPU.
> 
> Disabling migration DOES NOT ensure consistency of per-CPU variables. It
> only ensures that the task acts always on the same per-CPU variable. The
> task remains preemptible meaning multiple tasks can access the same
> per-CPU variable. This in turn leads to inconsistency for the statement
> 
>                   *pcpu -= 1;
> 
> with two tasks on one CPU and a preemption point during the RMW
> operation:
> 
>      Task A                           Task B
>      read pcpu to reg  # 0
>      inc reg           # 0 -> 1
>                                       read pcpu to reg  # 0
>                                       inc reg           # 0 -> 1
>                                       write reg to pcpu # 1
>      write reg to pcpu # 1
> 
> At the end pcpu reads 1 but should read 2 instead. Boom.
> 
> get_cpu_ptr() already contains a preempt_disable() statement. That means
> that the per-CPU variable can only be referenced by a single task which
> is currently running. The only inconsistency that can occur if the
> variable is additionally accessed from an interrupt.
> 
> Remove migrate_disable/enable() from dc_fpu_begin/end().
> 
> Cc: Tianci Yin <tianci.yin@amd.com>
> Cc: Aurabindo Pillai <aurabindo.pillai@amd.com>
> Fixes: 0c316556d1249 ("drm/amd/display: Disable migration to ensure consistency of per-CPU variable")
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  drivers/gpu/drm/amd/display/amdgpu_dm/dc_fpu.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/dc_fpu.c b/drivers/gpu/drm/amd/display/amdgpu_dm/dc_fpu.c
> index 172aa10a8800f..86f4c0e046548 100644
> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/dc_fpu.c
> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/dc_fpu.c
> @@ -91,7 +91,6 @@ void dc_fpu_begin(const char *function_name, const int line)
>  
>  	if (*pcpu == 1) {
>  #if defined(CONFIG_X86) || defined(CONFIG_LOONGARCH)
> -		migrate_disable();
>  		kernel_fpu_begin();
>  #elif defined(CONFIG_PPC64)
>  		if (cpu_has_feature(CPU_FTR_VSX_COMP)) {
> @@ -132,7 +131,6 @@ void dc_fpu_end(const char *function_name, const int line)
>  	if (*pcpu <= 0) {
>  #if defined(CONFIG_X86) || defined(CONFIG_LOONGARCH)
>  		kernel_fpu_end();
> -		migrate_enable();
>  #elif defined(CONFIG_PPC64)
>  		if (cpu_has_feature(CPU_FTR_VSX_COMP)) {
>  			disable_kernel_vsx();
Sebastian Andrzej Siewior Oct. 4, 2023, 8:53 a.m. UTC | #2
On 2023-10-03 15:53:41 [-0400], Harry Wentland wrote:
> On 2023-09-21 10:15, Sebastian Andrzej Siewior wrote:
> > This is a revert of the commit mentioned below while it is not wrong, as
> > in the kernel will explode, having migrate_disable() here it is
> > complete waste of resources.
> > 
> > Additionally commit message is plain wrong the review tag does not make
> 
> Not sure I follow what's unhelpful about the review tag with
> 0c316556d1249 ("drm/amd/display: Disable migration to ensure consistency of per-CPU variable")

I explained it below with two points what the reviewer should have
noticed why reading the commit message even if he does not know what
migrate_disable() itself does.

> I do wish the original patch showed the splat it's attempting
> to fix. It apparently made a difference for something, whether
> inadvertently or not. I wish I knew what that "something" was.

As far as I can tell the patch does make a difference.

> Harry

Sebastian
Hamza Mahfooz Oct. 4, 2023, 12:10 p.m. UTC | #3
On 10/3/23 15:53, Harry Wentland wrote:
> On 2023-09-21 10:15, Sebastian Andrzej Siewior wrote:
>> This is a revert of the commit mentioned below while it is not wrong, as
>> in the kernel will explode, having migrate_disable() here it is
>> complete waste of resources.
>>
>> Additionally commit message is plain wrong the review tag does not make
> 
> Not sure I follow what's unhelpful about the review tag with
> 0c316556d1249 ("drm/amd/display: Disable migration to ensure consistency of per-CPU variable")
> 
> I do wish the original patch showed the splat it's attempting
> to fix. It apparently made a difference for something, whether
> inadvertently or not. I wish I knew what that "something" was.

I did some digging, and it seems like the intention of that patch was to
fix the following splat:

WARNING: CPU: 5 PID: 1062 at 
drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/dc_fpu.c:71 
dc_assert_fp_enabled+0x1a/0x30 [amdgpu]
[...]
CPU: 5 PID: 1062 Comm: Xorg Tainted: G           OE 
5.15.0-56-generic #62-Ubuntu
Hardware name: ASUS System Product Name/ROG STRIX Z590-F GAMING WIFI, 
BIOS 1202 10/27/2021
RIP: 0010:dc_assert_fp_enabled+0x1a/0x30 [amdgpu]
Code: ff 45 31 f6 0f 0b e9 ca fe ff ff e8 90 1c 1f f7 48 c7 c0 00 30 03 
00 65 48 03 05 b1 aa 86 3f 8b 00 85 c0 7e 05 c3 cc cc cc cc <0f> 0b c3 
cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00
RSP: 0000:ffffb89b82a8f118 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8c271cd00000 RCX: 0000000000000000
RDX: ffff8c2708025000 RSI: ffff8c270e8c0000 RDI: ffff8c271cd00000
RBP: ffffb89b82a8f1d0 R08: 0000000000000000 R09: 7fffff6affffffff
R10: ffffb89b82a8f240 R11: 0000000000000000 R12: 0000000000000002
R13: ffff8c271cd00000 R14: ffff8c270e8c0000 R15: ffff8c2708025000
FS:  00007f0570019a80(0000) GS:ffff8c2e3fb40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005594858a0058 CR3: 000000010e204001 CR4: 0000000000770ee0
PKRU: 55555554
Call Trace:
  <TASK>
  ? dcn20_populate_dml_pipes_from_context+0x47/0x1730 [amdgpu]
  ? __kmalloc_node+0x2cb/0x3a0
  dcn32_populate_dml_pipes_from_context+0x2b/0x450 [amdgpu]
  dcn32_internal_validate_bw+0x15f9/0x2670 [amdgpu]
  dcn32_find_dummy_latency_index_for_fw_based_mclk_switch+0xd0/0x310 
[amdgpu]
  dcn32_calculate_wm_and_dlg_fpu+0xe6/0x1e50 [amdgpu]
  dcn32_calculate_wm_and_dlg+0x46/0x70 [amdgpu]
  dcn32_validate_bandwidth+0x1b7/0x3e0 [amdgpu]
  dc_validate_global_state+0x32c/0x560 [amdgpu]
  dc_validate_with_context+0x6e6/0xd80 [amdgpu]
  dc_commit_streams+0x21b/0x500 [amdgpu]
  dc_commit_state+0xf3/0x150 [amdgpu]
  amdgpu_dm_atomic_commit_tail+0x60d/0x3120 [amdgpu]
  ? dcn32_internal_validate_bw+0xcf8/0x2670 [amdgpu]
  ? fill_plane_buffer_attributes+0x1e5/0x560 [amdgpu]
  ? dcn32_validate_bandwidth+0x1e0/0x3e0 [amdgpu]
  ? kfree+0x1f7/0x250
  ? dcn32_validate_bandwidth+0x1e0/0x3e0 [amdgpu]
  ? dc_validate_global_state+0x32c/0x560 [amdgpu]
  ? __cond_resched+0x1a/0x50
  ? __wait_for_common+0x3e/0x150
  ? fill_plane_buffer_attributes+0x1e5/0x560 [amdgpu]
  ? usleep_range_state+0x90/0x90
  ? wait_for_completion_timeout+0x1d/0x30
  commit_tail+0xc2/0x170 [drm_kms_helper]
  ? drm_atomic_helper_swap_state+0x20f/0x370 [drm_kms_helper]
  drm_atomic_helper_commit+0x12b/0x150 [drm_kms_helper]
  amdgpu_dm_atomic_commit+0x11/0x20 [amdgpu]
  drm_atomic_commit+0x47/0x60 [drm]
  drm_mode_obj_set_property_ioctl+0x16b/0x420 [drm]
  ? mutex_lock+0x13/0x50
  ? drm_mode_createblob_ioctl+0xf6/0x130 [drm]
  ? drm_mode_obj_find_prop_id+0x90/0x90 [drm]
  drm_ioctl_kernel+0xb0/0x100 [drm]
  drm_ioctl+0x268/0x4b0 [drm]
  ? drm_mode_obj_find_prop_id+0x90/0x90 [drm]
  ? ktime_get_mono_fast_ns+0x4a/0xa0
  amdgpu_drm_ioctl+0x4e/0x90 [amdgpu]
  __x64_sys_ioctl+0x92/0xd0
  do_syscall_64+0x59/0xc0
  ? do_user_addr_fault+0x1e7/0x670
  ? do_syscall_64+0x69/0xc0
  ? exit_to_user_mode_prepare+0x37/0xb0
  ? irqentry_exit_to_user_mode+0x9/0x20
  ? irqentry_exit+0x1d/0x30
  ? exc_page_fault+0x89/0x170
  entry_SYSCALL_64_after_hwframe+0x61/0xcb
RIP: 0033:0x7f05704a2aff
Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 
44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <41> 89 c0 
3d 00 f0 ff ff 77 1f 48 8b 44 24 18 64 48 2b 04 25 28 00
RSP: 002b:00007ffc8c45a3f0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffc8c45a480 RCX: 00007f05704a2aff
RDX: 00007ffc8c45a480 RSI: 00000000c01864ba RDI: 000000000000000e
RBP: 00000000c01864ba R08: 0000000000000077 R09: 00000000cccccccc
R10: 00007f05705a22f0 R11: 0000000000000246 R12: 0000000000000004
R13: 000000000000000e R14: 000000000000000f R15: 00007ffc8c45a8a8
  </TASK>
---[ end trace 4deab30bb69df00f ]---

> 
> Harry
> 
>> it any better. The migrate_disable() interface has a fat comment
>> describing it and it includes the word "undesired" in the headline which
>> should tickle people to read it before using it.
>> Initially I assumed it is worded too harsh but now I beg to differ.
>>
>> The reviewer of the original commit, even not understanding what
>> migrate_disable() does should ask the following:
>>
>> - migrate_disable() is added only to the CONFIG_X86 block and it claims
>>    to protect fpu_recursion_depth. Why are the other the architectures
>>    excluded?
>>
>> - migrate_disable() is added after fpu_recursion_depth was modified.
>>    Shouldn't it be added before the modification or referencing takes
>>    place?
>>
>> Moving on.
>> Disabling preemption DOES prevent CPU migration. A task, that can not be
>> pushed away from the CPU by the scheduler (due to disabled preemption)
>> can not be pushed or migrated to another CPU.
>>
>> Disabling migration DOES NOT ensure consistency of per-CPU variables. It
>> only ensures that the task acts always on the same per-CPU variable. The
>> task remains preemptible meaning multiple tasks can access the same
>> per-CPU variable. This in turn leads to inconsistency for the statement
>>
>>                    *pcpu -= 1;
>>
>> with two tasks on one CPU and a preemption point during the RMW
>> operation:
>>
>>       Task A                           Task B
>>       read pcpu to reg  # 0
>>       inc reg           # 0 -> 1
>>                                        read pcpu to reg  # 0
>>                                        inc reg           # 0 -> 1
>>                                        write reg to pcpu # 1
>>       write reg to pcpu # 1
>>
>> At the end pcpu reads 1 but should read 2 instead. Boom.
>>
>> get_cpu_ptr() already contains a preempt_disable() statement. That means
>> that the per-CPU variable can only be referenced by a single task which
>> is currently running. The only inconsistency that can occur if the
>> variable is additionally accessed from an interrupt.
>>
>> Remove migrate_disable/enable() from dc_fpu_begin/end().
>>
>> Cc: Tianci Yin <tianci.yin@amd.com>
>> Cc: Aurabindo Pillai <aurabindo.pillai@amd.com>
>> Fixes: 0c316556d1249 ("drm/amd/display: Disable migration to ensure consistency of per-CPU variable")
>> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>> ---
>>   drivers/gpu/drm/amd/display/amdgpu_dm/dc_fpu.c | 2 --
>>   1 file changed, 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/dc_fpu.c b/drivers/gpu/drm/amd/display/amdgpu_dm/dc_fpu.c
>> index 172aa10a8800f..86f4c0e046548 100644
>> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/dc_fpu.c
>> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/dc_fpu.c
>> @@ -91,7 +91,6 @@ void dc_fpu_begin(const char *function_name, const int line)
>>   
>>   	if (*pcpu == 1) {
>>   #if defined(CONFIG_X86) || defined(CONFIG_LOONGARCH)
>> -		migrate_disable();
>>   		kernel_fpu_begin();
>>   #elif defined(CONFIG_PPC64)
>>   		if (cpu_has_feature(CPU_FTR_VSX_COMP)) {
>> @@ -132,7 +131,6 @@ void dc_fpu_end(const char *function_name, const int line)
>>   	if (*pcpu <= 0) {
>>   #if defined(CONFIG_X86) || defined(CONFIG_LOONGARCH)
>>   		kernel_fpu_end();
>> -		migrate_enable();
>>   #elif defined(CONFIG_PPC64)
>>   		if (cpu_has_feature(CPU_FTR_VSX_COMP)) {
>>   			disable_kernel_vsx();
>
Sebastian Andrzej Siewior Oct. 5, 2023, 12:10 p.m. UTC | #4
On 2023-10-04 08:10:35 [-0400], Hamza Mahfooz wrote:
> I did some digging, and it seems like the intention of that patch was to
> fix the following splat:
> 
> WARNING: CPU: 5 PID: 1062 at
> drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/dc_fpu.c:71
> dc_assert_fp_enabled+0x1a/0x30 [amdgpu]
> [...]
> CPU: 5 PID: 1062 Comm: Xorg Tainted: G           OE 5.15.0-56-generic

So it has hard to look this up with a upstream v5.15 kernel since the
dcn32_populate_dml_pipes_from_context() was introduced in v6.0-rc1.
Judging from v6.0-rc1 I don't see how that warning could occur other
than using dc_assert_fp_enabled() without invoking DC_FP_START first.

> Hamza

Sebastian
diff mbox series

Patch

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/dc_fpu.c b/drivers/gpu/drm/amd/display/amdgpu_dm/dc_fpu.c
index 172aa10a8800f..86f4c0e046548 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/dc_fpu.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/dc_fpu.c
@@ -91,7 +91,6 @@  void dc_fpu_begin(const char *function_name, const int line)
 
 	if (*pcpu == 1) {
 #if defined(CONFIG_X86) || defined(CONFIG_LOONGARCH)
-		migrate_disable();
 		kernel_fpu_begin();
 #elif defined(CONFIG_PPC64)
 		if (cpu_has_feature(CPU_FTR_VSX_COMP)) {
@@ -132,7 +131,6 @@  void dc_fpu_end(const char *function_name, const int line)
 	if (*pcpu <= 0) {
 #if defined(CONFIG_X86) || defined(CONFIG_LOONGARCH)
 		kernel_fpu_end();
-		migrate_enable();
 #elif defined(CONFIG_PPC64)
 		if (cpu_has_feature(CPU_FTR_VSX_COMP)) {
 			disable_kernel_vsx();