diff mbox series

[v2,01/13] mailbox: pcc: Fix the possible race in updation of chan_in_use flag

Message ID 20250305-pcc_fixes_updates-v2-1-1b1822bc8746@arm.com (mailing list archive)
State Handled Elsewhere, archived
Headers show
Series mailbox: pcc: Fixes and cleanup/refactoring | expand

Commit Message

Sudeep Holla March 5, 2025, 4:38 p.m. UTC
From: Huisong Li <lihuisong@huawei.com>

The function mbox_chan_received_data() calls the Rx callback of the
mailbox client driver. The callback might set chan_in_use flag from
pcc_send_data(). This flag's status determines whether the PCC channel
is in use.

However, there is a potential race condition where chan_in_use is
updated incorrectly due to concurrency between the interrupt handler
(pcc_mbox_irq()) and the command sender(pcc_send_data()).

The 'chan_in_use' flag of a channel is set to true after sending a
command. And the flag of the new command may be cleared erroneous by
the interrupt handler afer mbox_chan_received_data() returns,

As a result, the interrupt being level triggered can't be cleared in
pcc_mbox_irq() and it will be disabled after the number of handled times
exceeds the specified value. The error log is as follows:

  |  kunpeng_hccs HISI04B2:00: PCC command executed timeout!
  |  kunpeng_hccs HISI04B2:00: get port link status info failed, ret = -110
  |  irq 13: nobody cared (try booting with the "irqpoll" option)
  |  Call trace:
  |   dump_backtrace+0x0/0x210
  |   show_stack+0x1c/0x2c
  |   dump_stack+0xec/0x130
  |   __report_bad_irq+0x50/0x190
  |   note_interrupt+0x1e4/0x260
  |   handle_irq_event+0x144/0x17c
  |   handle_fasteoi_irq+0xd0/0x240
  |   __handle_domain_irq+0x80/0xf0
  |   gic_handle_irq+0x74/0x2d0
  |   el1_irq+0xbc/0x140
  |   mnt_clone_write+0x0/0x70
  |   file_update_time+0xcc/0x160
  |   fault_dirty_shared_page+0xe8/0x150
  |   do_shared_fault+0x80/0x1d0
  |   do_fault+0x118/0x1a4
  |   handle_pte_fault+0x154/0x230
  |   __handle_mm_fault+0x1ac/0x390
  |   handle_mm_fault+0xf0/0x250
  |   do_page_fault+0x184/0x454
  |   do_translation_fault+0xac/0xd4
  |   do_mem_abort+0x44/0xb4
  |   el0_da+0x40/0x74
  |   el0_sync_handler+0x60/0xb4
  |   el0_sync+0x168/0x180
  |  handlers:
  |   pcc_mbox_irq
  |  Disabling IRQ #13

To solve this issue, pcc_mbox_irq() must clear 'chan_in_use' flag before
the call to mbox_chan_received_data().

Signed-off-by: Huisong Li <lihuisong@huawei.com>
(sudeep.holla: Minor updates to the subject and commit message)
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
---
 drivers/mailbox/pcc.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Comments

lihuisong (C) March 11, 2025, 11:40 a.m. UTC | #1
在 2025/3/6 0:38, Sudeep Holla 写道:
> From: Huisong Li <lihuisong@huawei.com>
>
> The function mbox_chan_received_data() calls the Rx callback of the
> mailbox client driver. The callback might set chan_in_use flag from
> pcc_send_data(). This flag's status determines whether the PCC channel
> is in use.
>
> However, there is a potential race condition where chan_in_use is
> updated incorrectly due to concurrency between the interrupt handler
> (pcc_mbox_irq()) and the command sender(pcc_send_data()).
>
> The 'chan_in_use' flag of a channel is set to true after sending a
> command. And the flag of the new command may be cleared erroneous by
> the interrupt handler afer mbox_chan_received_data() returns,
>
> As a result, the interrupt being level triggered can't be cleared in
> pcc_mbox_irq() and it will be disabled after the number of handled times
> exceeds the specified value. The error log is as follows:
>
>    |  kunpeng_hccs HISI04B2:00: PCC command executed timeout!
>    |  kunpeng_hccs HISI04B2:00: get port link status info failed, ret = -110
>    |  irq 13: nobody cared (try booting with the "irqpoll" option)
>    |  Call trace:
>    |   dump_backtrace+0x0/0x210
>    |   show_stack+0x1c/0x2c
>    |   dump_stack+0xec/0x130
>    |   __report_bad_irq+0x50/0x190
>    |   note_interrupt+0x1e4/0x260
>    |   handle_irq_event+0x144/0x17c
>    |   handle_fasteoi_irq+0xd0/0x240
>    |   __handle_domain_irq+0x80/0xf0
>    |   gic_handle_irq+0x74/0x2d0
>    |   el1_irq+0xbc/0x140
>    |   mnt_clone_write+0x0/0x70
>    |   file_update_time+0xcc/0x160
>    |   fault_dirty_shared_page+0xe8/0x150
>    |   do_shared_fault+0x80/0x1d0
>    |   do_fault+0x118/0x1a4
>    |   handle_pte_fault+0x154/0x230
>    |   __handle_mm_fault+0x1ac/0x390
>    |   handle_mm_fault+0xf0/0x250
>    |   do_page_fault+0x184/0x454
>    |   do_translation_fault+0xac/0xd4
>    |   do_mem_abort+0x44/0xb4
>    |   el0_da+0x40/0x74
>    |   el0_sync_handler+0x60/0xb4
>    |   el0_sync+0x168/0x180
>    |  handlers:
>    |   pcc_mbox_irq
>    |  Disabling IRQ #13
>
> To solve this issue, pcc_mbox_irq() must clear 'chan_in_use' flag before
> the call to mbox_chan_received_data().
>
> Signed-off-by: Huisong Li <lihuisong@huawei.com>
> (sudeep.holla: Minor updates to the subject and commit message)
> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
> ---
>   drivers/mailbox/pcc.c | 7 ++++++-
>   1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/mailbox/pcc.c b/drivers/mailbox/pcc.c
> index 82102a4c5d68839170238540a6fed61afa5185a0..f2e4087281c70eeb5b9b33371596613a371dff4f 100644
> --- a/drivers/mailbox/pcc.c
> +++ b/drivers/mailbox/pcc.c
> @@ -333,10 +333,15 @@ static irqreturn_t pcc_mbox_irq(int irq, void *p)
>   	if (pcc_chan_reg_read_modify_write(&pchan->plat_irq_ack))
>   		return IRQ_NONE;
>   
> +	/*
> +	 * Clear this flag immediately after updating interrupt ack register
> +	 * to avoid possible race in updatation of the flag from
> +	 * pcc_send_data() that could execute from mbox_chan_received_data()
This comment may be inappropriate becuase of the moving of clearing 
interrupt ack register in patch 2/13.
I suggested that fix it in this patch or patch 2/13.
> +	 */
> +	pchan->chan_in_use = false;
>   	mbox_chan_received_data(chan, NULL);
>   
>   	check_and_ack(pchan, chan);
> -	pchan->chan_in_use = false;
>   
>   	return IRQ_HANDLED;
>   }
>
Sudeep Holla March 11, 2025, 12:02 p.m. UTC | #2
On Tue, Mar 11, 2025 at 07:40:53PM +0800, lihuisong (C) wrote:
> 
> 在 2025/3/6 0:38, Sudeep Holla 写道:
> > From: Huisong Li <lihuisong@huawei.com>
> > 
> > The function mbox_chan_received_data() calls the Rx callback of the
> > mailbox client driver. The callback might set chan_in_use flag from
> > pcc_send_data(). This flag's status determines whether the PCC channel
> > is in use.
> > 
> > However, there is a potential race condition where chan_in_use is
> > updated incorrectly due to concurrency between the interrupt handler
> > (pcc_mbox_irq()) and the command sender(pcc_send_data()).
> > 
> > The 'chan_in_use' flag of a channel is set to true after sending a
> > command. And the flag of the new command may be cleared erroneous by
> > the interrupt handler afer mbox_chan_received_data() returns,
> > 
> > As a result, the interrupt being level triggered can't be cleared in
> > pcc_mbox_irq() and it will be disabled after the number of handled times
> > exceeds the specified value. The error log is as follows:
> > 
> >    |  kunpeng_hccs HISI04B2:00: PCC command executed timeout!
> >    |  kunpeng_hccs HISI04B2:00: get port link status info failed, ret = -110
> >    |  irq 13: nobody cared (try booting with the "irqpoll" option)
> >    |  Call trace:
> >    |   dump_backtrace+0x0/0x210
> >    |   show_stack+0x1c/0x2c
> >    |   dump_stack+0xec/0x130
> >    |   __report_bad_irq+0x50/0x190
> >    |   note_interrupt+0x1e4/0x260
> >    |   handle_irq_event+0x144/0x17c
> >    |   handle_fasteoi_irq+0xd0/0x240
> >    |   __handle_domain_irq+0x80/0xf0
> >    |   gic_handle_irq+0x74/0x2d0
> >    |   el1_irq+0xbc/0x140
> >    |   mnt_clone_write+0x0/0x70
> >    |   file_update_time+0xcc/0x160
> >    |   fault_dirty_shared_page+0xe8/0x150
> >    |   do_shared_fault+0x80/0x1d0
> >    |   do_fault+0x118/0x1a4
> >    |   handle_pte_fault+0x154/0x230
> >    |   __handle_mm_fault+0x1ac/0x390
> >    |   handle_mm_fault+0xf0/0x250
> >    |   do_page_fault+0x184/0x454
> >    |   do_translation_fault+0xac/0xd4
> >    |   do_mem_abort+0x44/0xb4
> >    |   el0_da+0x40/0x74
> >    |   el0_sync_handler+0x60/0xb4
> >    |   el0_sync+0x168/0x180
> >    |  handlers:
> >    |   pcc_mbox_irq
> >    |  Disabling IRQ #13
> > 
> > To solve this issue, pcc_mbox_irq() must clear 'chan_in_use' flag before
> > the call to mbox_chan_received_data().
> > 
> > Signed-off-by: Huisong Li <lihuisong@huawei.com>
> > (sudeep.holla: Minor updates to the subject and commit message)
> > Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
> > ---
> >   drivers/mailbox/pcc.c | 7 ++++++-
> >   1 file changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/mailbox/pcc.c b/drivers/mailbox/pcc.c
> > index 82102a4c5d68839170238540a6fed61afa5185a0..f2e4087281c70eeb5b9b33371596613a371dff4f 100644
> > --- a/drivers/mailbox/pcc.c
> > +++ b/drivers/mailbox/pcc.c
> > @@ -333,10 +333,15 @@ static irqreturn_t pcc_mbox_irq(int irq, void *p)
> >   	if (pcc_chan_reg_read_modify_write(&pchan->plat_irq_ack))
> >   		return IRQ_NONE;
> > +	/*
> > +	 * Clear this flag immediately after updating interrupt ack register
> > +	 * to avoid possible race in updatation of the flag from
> > +	 * pcc_send_data() that could execute from mbox_chan_received_data()
> This comment may be inappropriate becuase of the moving of clearing
> interrupt ack register in patch 2/13.
> I suggested that fix it in this patch or patch 2/13.

Right, I did think of updating or did update and missed to commit.
I wanted to update something like:
"
Clear this flag after updating interrupt ack register and just before
mbox_chan_received_data() which might call pcc_send_data() where the
flag is set again to start new transfer
"

Hope that is more apt to the current situation.
lihuisong (C) March 11, 2025, 12:15 p.m. UTC | #3
在 2025/3/11 20:02, Sudeep Holla 写道:
> On Tue, Mar 11, 2025 at 07:40:53PM +0800, lihuisong (C) wrote:
>> 在 2025/3/6 0:38, Sudeep Holla 写道:
>>> From: Huisong Li <lihuisong@huawei.com>
>>>
>>> The function mbox_chan_received_data() calls the Rx callback of the
>>> mailbox client driver. The callback might set chan_in_use flag from
>>> pcc_send_data(). This flag's status determines whether the PCC channel
>>> is in use.
>>>
>>> However, there is a potential race condition where chan_in_use is
>>> updated incorrectly due to concurrency between the interrupt handler
>>> (pcc_mbox_irq()) and the command sender(pcc_send_data()).
>>>
>>> The 'chan_in_use' flag of a channel is set to true after sending a
>>> command. And the flag of the new command may be cleared erroneous by
>>> the interrupt handler afer mbox_chan_received_data() returns,
>>>
>>> As a result, the interrupt being level triggered can't be cleared in
>>> pcc_mbox_irq() and it will be disabled after the number of handled times
>>> exceeds the specified value. The error log is as follows:
>>>
>>>     |  kunpeng_hccs HISI04B2:00: PCC command executed timeout!
>>>     |  kunpeng_hccs HISI04B2:00: get port link status info failed, ret = -110
>>>     |  irq 13: nobody cared (try booting with the "irqpoll" option)
>>>     |  Call trace:
>>>     |   dump_backtrace+0x0/0x210
>>>     |   show_stack+0x1c/0x2c
>>>     |   dump_stack+0xec/0x130
>>>     |   __report_bad_irq+0x50/0x190
>>>     |   note_interrupt+0x1e4/0x260
>>>     |   handle_irq_event+0x144/0x17c
>>>     |   handle_fasteoi_irq+0xd0/0x240
>>>     |   __handle_domain_irq+0x80/0xf0
>>>     |   gic_handle_irq+0x74/0x2d0
>>>     |   el1_irq+0xbc/0x140
>>>     |   mnt_clone_write+0x0/0x70
>>>     |   file_update_time+0xcc/0x160
>>>     |   fault_dirty_shared_page+0xe8/0x150
>>>     |   do_shared_fault+0x80/0x1d0
>>>     |   do_fault+0x118/0x1a4
>>>     |   handle_pte_fault+0x154/0x230
>>>     |   __handle_mm_fault+0x1ac/0x390
>>>     |   handle_mm_fault+0xf0/0x250
>>>     |   do_page_fault+0x184/0x454
>>>     |   do_translation_fault+0xac/0xd4
>>>     |   do_mem_abort+0x44/0xb4
>>>     |   el0_da+0x40/0x74
>>>     |   el0_sync_handler+0x60/0xb4
>>>     |   el0_sync+0x168/0x180
>>>     |  handlers:
>>>     |   pcc_mbox_irq
>>>     |  Disabling IRQ #13
>>>
>>> To solve this issue, pcc_mbox_irq() must clear 'chan_in_use' flag before
>>> the call to mbox_chan_received_data().
>>>
>>> Signed-off-by: Huisong Li <lihuisong@huawei.com>
>>> (sudeep.holla: Minor updates to the subject and commit message)
>>> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
>>> ---
>>>    drivers/mailbox/pcc.c | 7 ++++++-
>>>    1 file changed, 6 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/mailbox/pcc.c b/drivers/mailbox/pcc.c
>>> index 82102a4c5d68839170238540a6fed61afa5185a0..f2e4087281c70eeb5b9b33371596613a371dff4f 100644
>>> --- a/drivers/mailbox/pcc.c
>>> +++ b/drivers/mailbox/pcc.c
>>> @@ -333,10 +333,15 @@ static irqreturn_t pcc_mbox_irq(int irq, void *p)
>>>    	if (pcc_chan_reg_read_modify_write(&pchan->plat_irq_ack))
>>>    		return IRQ_NONE;
>>> +	/*
>>> +	 * Clear this flag immediately after updating interrupt ack register
>>> +	 * to avoid possible race in updatation of the flag from
>>> +	 * pcc_send_data() that could execute from mbox_chan_received_data()
>> This comment may be inappropriate becuase of the moving of clearing
>> interrupt ack register in patch 2/13.
>> I suggested that fix it in this patch or patch 2/13.
> Right, I did think of updating or did update and missed to commit.
> I wanted to update something like:
> "
> Clear this flag after updating interrupt ack register and just before
> mbox_chan_received_data() which might call pcc_send_data() where the
> flag is set again to start new transfer
> "
Ack
>
> Hope that is more apt to the current situation.
>
Robbie King March 13, 2025, 3:07 p.m. UTC | #4
On 3/5/2025 11:38 AM, Sudeep Holla wrote:
> From: Huisong Li <lihuisong@huawei.com>
> 
> The function mbox_chan_received_data() calls the Rx callback of the
> mailbox client driver. The callback might set chan_in_use flag from
> pcc_send_data(). This flag's status determines whether the PCC channel
> is in use.
> 
> However, there is a potential race condition where chan_in_use is
> updated incorrectly due to concurrency between the interrupt handler
> (pcc_mbox_irq()) and the command sender(pcc_send_data()).
> 
> The 'chan_in_use' flag of a channel is set to true after sending a
> command. And the flag of the new command may be cleared erroneous by
> the interrupt handler afer mbox_chan_received_data() returns,
> 
> As a result, the interrupt being level triggered can't be cleared in
> pcc_mbox_irq() and it will be disabled after the number of handled times
> exceeds the specified value. The error log is as follows:
> 
>   |  kunpeng_hccs HISI04B2:00: PCC command executed timeout!
>   |  kunpeng_hccs HISI04B2:00: get port link status info failed, ret = -110
>   |  irq 13: nobody cared (try booting with the "irqpoll" option)
>   |  Call trace:
>   |   dump_backtrace+0x0/0x210
>   |   show_stack+0x1c/0x2c
>   |   dump_stack+0xec/0x130
>   |   __report_bad_irq+0x50/0x190
>   |   note_interrupt+0x1e4/0x260
>   |   handle_irq_event+0x144/0x17c
>   |   handle_fasteoi_irq+0xd0/0x240
>   |   __handle_domain_irq+0x80/0xf0
>   |   gic_handle_irq+0x74/0x2d0
>   |   el1_irq+0xbc/0x140
>   |   mnt_clone_write+0x0/0x70
>   |   file_update_time+0xcc/0x160
>   |   fault_dirty_shared_page+0xe8/0x150
>   |   do_shared_fault+0x80/0x1d0
>   |   do_fault+0x118/0x1a4
>   |   handle_pte_fault+0x154/0x230
>   |   __handle_mm_fault+0x1ac/0x390
>   |   handle_mm_fault+0xf0/0x250
>   |   do_page_fault+0x184/0x454
>   |   do_translation_fault+0xac/0xd4
>   |   do_mem_abort+0x44/0xb4
>   |   el0_da+0x40/0x74
>   |   el0_sync_handler+0x60/0xb4
>   |   el0_sync+0x168/0x180
>   |  handlers:
>   |   pcc_mbox_irq
>   |  Disabling IRQ #13
> 
> To solve this issue, pcc_mbox_irq() must clear 'chan_in_use' flag before
> the call to mbox_chan_received_data().
> 
> Signed-off-by: Huisong Li <lihuisong@huawei.com>
> (sudeep.holla: Minor updates to the subject and commit message)
> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>

Tested-by: Robbie King <robbiek@xsightlabs.com>
diff mbox series

Patch

diff --git a/drivers/mailbox/pcc.c b/drivers/mailbox/pcc.c
index 82102a4c5d68839170238540a6fed61afa5185a0..f2e4087281c70eeb5b9b33371596613a371dff4f 100644
--- a/drivers/mailbox/pcc.c
+++ b/drivers/mailbox/pcc.c
@@ -333,10 +333,15 @@  static irqreturn_t pcc_mbox_irq(int irq, void *p)
 	if (pcc_chan_reg_read_modify_write(&pchan->plat_irq_ack))
 		return IRQ_NONE;
 
+	/*
+	 * Clear this flag immediately after updating interrupt ack register
+	 * to avoid possible race in updatation of the flag from
+	 * pcc_send_data() that could execute from mbox_chan_received_data()
+	 */
+	pchan->chan_in_use = false;
 	mbox_chan_received_data(chan, NULL);
 
 	check_and_ack(pchan, chan);
-	pchan->chan_in_use = false;
 
 	return IRQ_HANDLED;
 }