diff mbox series

[net,v2] net: wwan: t7xx: Fix FSM command timeout issue

Message ID 20241213064720.122615-1-jinjian.song@fibocom.com (mailing list archive)
State New
Headers show
Series [net,v2] net: wwan: t7xx: Fix FSM command timeout issue | expand

Commit Message

Jinjian Song Dec. 13, 2024, 6:47 a.m. UTC
When driver processes the internal state change command, it use an
asynchronous thread to process the command operation. If the main
thread detects that the task has timed out, the asynchronous thread
will panic when executing the completion notification because the
main thread completion object has been released.

BUG: unable to handle page fault for address: fffffffffffffff8
PGD 1f283a067 P4D 1f283a067 PUD 1f283c067 PMD 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
RIP: 0010:complete_all+0x3e/0xa0
[...]
Call Trace:
 <TASK>
 ? __die_body+0x68/0xb0
 ? page_fault_oops+0x379/0x3e0
 ? exc_page_fault+0x69/0xa0
 ? asm_exc_page_fault+0x22/0x30
 ? complete_all+0x3e/0xa0
 fsm_main_thread+0xa3/0x9c0 [mtk_t7xx (HASH:1400 5)]
 ? __pfx_autoremove_wake_function+0x10/0x10
 kthread+0xd8/0x110
 ? __pfx_fsm_main_thread+0x10/0x10 [mtk_t7xx (HASH:1400 5)]
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x38/0x50
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1b/0x30
 </TASK>
[...]
CR2: fffffffffffffff8
---[ end trace 0000000000000000 ]---

After the main thread determines that the task has timed out, mark
the completion invalid, and add judgment in the asynchronous task.

Fixes: d785ed945de6 ("net: wwan: t7xx: PCIe reset rescan")
Signed-off-by: Jinjian Song <jinjian.song@fibocom.com>
---
 drivers/net/wwan/t7xx/t7xx_state_monitor.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

Comments

Sergey Ryazanov Dec. 15, 2024, 2:17 a.m. UTC | #1
Hello Jinjian,

Nice catch! Suddenly, the proposed fix is not complete, we still have 
issues with memory use after free. Please find details below.

On 13.12.2024 08:47, Jinjian Song wrote:
> When driver processes the internal state change command, it use an
> asynchronous thread to process the command operation. If the main
> thread detects that the task has timed out, the asynchronous thread
> will panic when executing the completion notification because the
> main thread completion object has been released.
> 
> BUG: unable to handle page fault for address: fffffffffffffff8
> PGD 1f283a067 P4D 1f283a067 PUD 1f283c067 PMD 0
> Oops: 0000 [#1] PREEMPT SMP NOPTI
> RIP: 0010:complete_all+0x3e/0xa0
> [...]
> Call Trace:
>   <TASK>
>   ? __die_body+0x68/0xb0
>   ? page_fault_oops+0x379/0x3e0
>   ? exc_page_fault+0x69/0xa0
>   ? asm_exc_page_fault+0x22/0x30
>   ? complete_all+0x3e/0xa0
>   fsm_main_thread+0xa3/0x9c0 [mtk_t7xx (HASH:1400 5)]
>   ? __pfx_autoremove_wake_function+0x10/0x10
>   kthread+0xd8/0x110
>   ? __pfx_fsm_main_thread+0x10/0x10 [mtk_t7xx (HASH:1400 5)]
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork+0x38/0x50
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork_asm+0x1b/0x30
>   </TASK>
> [...]
> CR2: fffffffffffffff8
> ---[ end trace 0000000000000000 ]---
> 
> After the main thread determines that the task has timed out, mark
> the completion invalid, and add judgment in the asynchronous task.
> 
> Fixes: d785ed945de6 ("net: wwan: t7xx: PCIe reset rescan")

The completion waiting was introduced in a different commit. I believe, 
the fix tag should be 13e920d93e37 ("net: wwan: t7xx: Add core components")

> Signed-off-by: Jinjian Song <jinjian.song@fibocom.com>
> ---
>   drivers/net/wwan/t7xx/t7xx_state_monitor.c | 7 +++++--
>   1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/wwan/t7xx/t7xx_state_monitor.c b/drivers/net/wwan/t7xx/t7xx_state_monitor.c
> index 3931c7a13f5a..57f1a7730fff 100644
> --- a/drivers/net/wwan/t7xx/t7xx_state_monitor.c
> +++ b/drivers/net/wwan/t7xx/t7xx_state_monitor.c
> @@ -108,7 +108,8 @@ static void fsm_finish_command(struct t7xx_fsm_ctl *ctl, struct t7xx_fsm_command
>   {
>   	if (cmd->flag & FSM_CMD_FLAG_WAIT_FOR_COMPLETION) {
>   		*cmd->ret = result;

The memory for the result storage is allocated on the stack as well. And 
writing it unconditionally can cause unexpected consequences.

> -		complete_all(cmd->done);
> +		if (cmd->done)
> +			complete_all(cmd->done);
>   	}
>   
>   	kfree(cmd);
> @@ -503,8 +504,10 @@ int t7xx_fsm_append_cmd(struct t7xx_fsm_ctl *ctl, enum t7xx_fsm_cmd_state cmd_id
>   
>   		wait_ret = wait_for_completion_timeout(&done,
>   						       msecs_to_jiffies(FSM_CMD_TIMEOUT_MS));
> -		if (!wait_ret)
> +		if (!wait_ret) {
> +			cmd->done = NULL;

We cannot access the command memory here, since fsm_finish_command() 
could release it already.

>   			return -ETIMEDOUT;
> +		}
>   
>   		return ret;
>   	}

Here we have an ownership transfer problem and a driver author has tried 
to solve it, but as noticed, we are still experiencing issues in case of 
timeout.

The command completion routine should not release the command memory 
unconditionally. Looks like the references counting approach should help 
us here. E.g.
1. grab a reference before we put a command into the queue
1.1. grab an extra reference if we are going to wait the completion
2. release the reference as soon as we are done with the command execution
3. in case of completion waiting release the reference as soon as we are 
done with waiting due to completion or timeout

Could you try the following patch? Please note, besides the reference 
counter introduction it also moves completion and result storage inside 
the command structure as advised by the completion documentation.


--- a/drivers/net/wwan/t7xx/t7xx_state_monitor.h
+++ b/drivers/net/wwan/t7xx/t7xx_state_monitor.h
@@ -109,8 +109,9 @@ struct t7xx_fsm_command {
  	struct list_head	entry;
  	enum t7xx_fsm_cmd_state	cmd_id;
  	unsigned int		flag;
-	struct completion	*done;
-	int			*ret;
+	struct completion	done;
+	int			result;
+	struct struct kref	refcnt;
  };

  struct t7xx_fsm_notifier {
--- a/drivers/net/wwan/t7xx/t7xx_state_monitor.c
+++ b/drivers/net/wwan/t7xx/t7xx_state_monitor.c
@@ -97,14 +97,20 @@ void t7xx_fsm_broadcast_state(struct t7xx_fsm_ctl 
*ctl, enum md_state state)
  	fsm_state_notify(ctl->md, state);
  }

+static void fsm_release_command(struct kref *ref)
+{
+	struct t7xx_fsm_command *cmd = container_of(ref, typeof(*cmd), refcnt);
+	kfree(cmd);
+}
+
  static void fsm_finish_command(struct t7xx_fsm_ctl *ctl, struct 
t7xx_fsm_command *cmd, int result)
  {
  	if (cmd->flag & FSM_CMD_FLAG_WAIT_FOR_COMPLETION) {
-		*cmd->ret = result;
+		cmd->result = result;
  		complete_all(cmd->done);
  	}

-	kfree(cmd);
+	kref_put(&cmd->refcnt, fsm_release_command);
  }

  static void fsm_del_kf_event(struct t7xx_fsm_event *event)
@@ -396,7 +402,6 @@ static int fsm_main_thread(void *data)

  int t7xx_fsm_append_cmd(struct t7xx_fsm_ctl *ctl, enum 
t7xx_fsm_cmd_state cmd_id, unsigned int flag)
  {
-	DECLARE_COMPLETION_ONSTACK(done);
  	struct t7xx_fsm_command *cmd;
  	unsigned long flags;
  	int ret;
@@ -408,11 +413,13 @@ int t7xx_fsm_append_cmd(struct t7xx_fsm_ctl *ctl, 
enum t7xx_fsm_cmd_state cmd_id
  	INIT_LIST_HEAD(&cmd->entry);
  	cmd->cmd_id = cmd_id;
  	cmd->flag = flag;
+	kref_init(&cmd->refcnt);
  	if (flag & FSM_CMD_FLAG_WAIT_FOR_COMPLETION) {
-		cmd->done = &done;
-		cmd->ret = &ret;
+		init_completion(&cmd->done);
+		kref_get(&cmd->refcnt);
  	}

+	kref_get(&cmd->refcnt);
  	spin_lock_irqsave(&ctl->command_lock, flags);
  	list_add_tail(&cmd->entry, &ctl->command_queue);
  	spin_unlock_irqrestore(&ctl->command_lock, flags);
@@ -422,10 +429,10 @@ int t7xx_fsm_append_cmd(struct t7xx_fsm_ctl *ctl, 
enum t7xx_fsm_cmd_state cmd_id
  	if (flag & FSM_CMD_FLAG_WAIT_FOR_COMPLETION) {
  		unsigned long wait_ret;

-		wait_ret = wait_for_completion_timeout(&done,
+		wait_ret = wait_for_completion_timeout(&cmd->done,
  						       msecs_to_jiffies(FSM_CMD_TIMEOUT_MS));
-		if (!wait_ret)
-			return -ETIMEDOUT;
+		ret = wait_ret ? cmd->result : -ETIMEDOUT;
+		kref_put(&cmd->refcnt, fsm_release_command);

  		return ret;
  	}

--
Sergey
Jinjian Song Dec. 16, 2024, 1:53 p.m. UTC | #2
From: Sergey Ryazanov <ryazanov.s.a@gmail.com>

>> Fixes: d785ed945de6 ("net: wwan: t7xx: PCIe reset rescan")
>
>The completion waiting was introduced in a different commit. I believe, 
>the fix tag should be 13e920d93e37 ("net: wwan: t7xx: Add core components")
>

Got it.

[...]
>>   	if (cmd->flag & FSM_CMD_FLAG_WAIT_FOR_COMPLETION) {
>>   		*cmd->ret = result;
>
>The memory for the result storage is allocated on the stack as well. And 
>writing it unconditionally can cause unexpected consequences.
>

Got it.

[...]
>>   		wait_ret = wait_for_completion_timeout(&done,
>>   						       msecs_to_jiffies(FSM_CMD_TIMEOUT_MS));
>> -		if (!wait_ret)
>> +		if (!wait_ret) {
>> +			cmd->done = NULL;
>
>We cannot access the command memory here, since fsm_finish_command() 
>could release it already.
>

Got it.

[...]
>Here we have an ownership transfer problem and a driver author has tried 
>to solve it, but as noticed, we are still experiencing issues in case of 
>timeout.
>
>The command completion routine should not release the command memory 
>unconditionally. Looks like the references counting approach should help 
>us here. E.g.
>1. grab a reference before we put a command into the queue
>1.1. grab an extra reference if we are going to wait the completion
>2. release the reference as soon as we are done with the command execution
>3. in case of completion waiting release the reference as soon as we are 
>done with waiting due to completion or timeout
>
>Could you try the following patch? Please note, besides the reference 
>counter introduction it also moves completion and result storage inside 
>the command structure as advised by the completion documentation.
>

Yes, please let me try the following patch.
[...]

Thanks.

Jinjian,
Best Regards.
Jinjian Song Dec. 20, 2024, 8:50 a.m. UTC | #3
From: Sergey Ryazanov <ryazanov.s.a@gmail.com>

>> Fixes: d785ed945de6 ("net: wwan: t7xx: PCIe reset rescan")
>
>The completion waiting was introduced in a different commit. I believe, 
>the fix tag should be 13e920d93e37 ("net: wwan: t7xx: Add core components")
>

Got it.

[...]
>>   	if (cmd->flag & FSM_CMD_FLAG_WAIT_FOR_COMPLETION) {
>>   		*cmd->ret = result;
>
>The memory for the result storage is allocated on the stack as well. And 
>writing it unconditionally can cause unexpected consequences.
>

Got it.

[...]
>>   		wait_ret = wait_for_completion_timeout(&done,
>>   						       msecs_to_jiffies(FSM_CMD_TIMEOUT_MS));
>> -		if (!wait_ret)
>> +		if (!wait_ret) {
>> +			cmd->done = NULL;
>
>We cannot access the command memory here, since fsm_finish_command() 
>could release it already.
>

Got it.

[...]
>Here we have an ownership transfer problem and a driver author has tried 
>to solve it, but as noticed, we are still experiencing issues in case of 
>timeout.
>
>The command completion routine should not release the command memory 
>unconditionally. Looks like the references counting approach should help 
>us here. E.g.
>1. grab a reference before we put a command into the queue
>1.1. grab an extra reference if we are going to wait the completion
>2. release the reference as soon as we are done with the command execution
>3. in case of completion waiting release the reference as soon as we are 
>done with waiting due to completion or timeout
>
>Could you try the following patch? Please note, besides the reference 
>counter introduction it also moves completion and result storage inside 
>the command structure as advised by the completion documentation.
>

Hi Sergey,

Yes, the patch works fine, needs some minor modifications, could we 
feedback to the driver author to merge these changes.

diff --git a/drivers/net/wwan/t7xx/t7xx_state_monitor.c b/drivers/net/wwan/t7xx/t7xx_state_monitor.c
index 3931c7a13f5a..265c40b29f56 100644
--- a/drivers/net/wwan/t7xx/t7xx_state_monitor.c
+++ b/drivers/net/wwan/t7xx/t7xx_state_monitor.c
@@ -104,14 +104,20 @@ void t7xx_fsm_broadcast_state(struct t7xx_fsm_ctl *ctl, enum md_state state)
 	fsm_state_notify(ctl->md, state);
 }
 
+static void fsm_release_command(struct kref *ref)
+{
+	struct t7xx_fsm_command *cmd = container_of(ref, typeof(*cmd), refcnt);
+	kfree(cmd);
+}
+
 static void fsm_finish_command(struct t7xx_fsm_ctl *ctl, struct t7xx_fsm_command *cmd, int result)
 {
 	if (cmd->flag & FSM_CMD_FLAG_WAIT_FOR_COMPLETION) {
-		*cmd->ret = result;
-		complete_all(cmd->done);
+		cmd->result = result;
+		complete_all(&cmd->done);
 	}
 
-	kfree(cmd);
+	kref_put(&cmd->refcnt, fsm_release_command);
 }
 
 static void fsm_del_kf_event(struct t7xx_fsm_event *event)
@@ -475,7 +481,6 @@ static int fsm_main_thread(void *data)
 
 int t7xx_fsm_append_cmd(struct t7xx_fsm_ctl *ctl, enum t7xx_fsm_cmd_state cmd_id, unsigned int flag)
 {
-	DECLARE_COMPLETION_ONSTACK(done);
 	struct t7xx_fsm_command *cmd;
 	unsigned long flags;
 	int ret;
@@ -487,11 +492,13 @@ int t7xx_fsm_append_cmd(struct t7xx_fsm_ctl *ctl, enum t7xx_fsm_cmd_state cmd_id
 	INIT_LIST_HEAD(&cmd->entry);
 	cmd->cmd_id = cmd_id;
 	cmd->flag = flag;
+	kref_init(&cmd->refcnt);
 	if (flag & FSM_CMD_FLAG_WAIT_FOR_COMPLETION) {
-		cmd->done = &done;
-		cmd->ret = &ret;
+		init_completion(&cmd->done);
+		kref_get(&cmd->refcnt);
 	}
 
+	kref_get(&cmd->refcnt);
 	spin_lock_irqsave(&ctl->command_lock, flags);
 	list_add_tail(&cmd->entry, &ctl->command_queue);
 	spin_unlock_irqrestore(&ctl->command_lock, flags);
@@ -501,11 +508,11 @@ int t7xx_fsm_append_cmd(struct t7xx_fsm_ctl *ctl, enum t7xx_fsm_cmd_state cmd_id
 	if (flag & FSM_CMD_FLAG_WAIT_FOR_COMPLETION) {
 		unsigned long wait_ret;
 
-		wait_ret = wait_for_completion_timeout(&done,
+		wait_ret = wait_for_completion_timeout(&cmd->done,
 						       msecs_to_jiffies(FSM_CMD_TIMEOUT_MS));
-		if (!wait_ret)
-			return -ETIMEDOUT;
 
+		ret = wait_ret ? cmd->result : -ETIMEDOUT;
+		kref_put(&cmd->refcnt, fsm_release_command);
 		return ret;
 	}
 
diff --git a/drivers/net/wwan/t7xx/t7xx_state_monitor.h b/drivers/net/wwan/t7xx/t7xx_state_monitor.h
index 7b0a9baf488c..6e0601bb752e 100644
--- a/drivers/net/wwan/t7xx/t7xx_state_monitor.h
+++ b/drivers/net/wwan/t7xx/t7xx_state_monitor.h
@@ -110,8 +110,9 @@ struct t7xx_fsm_command {
 	struct list_head	entry;
 	enum t7xx_fsm_cmd_state	cmd_id;
 	unsigned int		flag;
-	struct completion	*done;
-	int			*ret;
+	struct completion	done;
+	int			result;
+	struct kref		refcnt;
 };
 
 struct t7xx_fsm_notifier {

Thanks.

Jinjian,
Best Regards.
Sergey Ryazanov Dec. 22, 2024, 9:04 p.m. UTC | #4
Hi Jinjian,

On 20.12.2024 10:50, Jinjian Song wrote:
> From: Sergey Ryazanov <ryazanov.s.a@gmail.com>
> 
>>> Fixes: d785ed945de6 ("net: wwan: t7xx: PCIe reset rescan")
>>
>> The completion waiting was introduced in a different commit. I 
>> believe, the fix tag should be 13e920d93e37 ("net: wwan: t7xx: Add 
>> core components")
>>
> 
> Got it.

[...]

> Yes, the patch works fine, needs some minor modifications, could we 
> feedback to the driver author to merge these changes.

Looks like the drivers authors haven't enough time to maintain it. You 
are the most active developer at the moment. Could formally resend the 
updated patch with refcounter as V3? And, I believe, we can apply it.

--
Sergey
Jinjian Song Dec. 23, 2024, 2:24 a.m. UTC | #5
From: Sergey Ryazanov <ryazanov.s.a@gmail.com>

>Hi Jinjian,
>
>On 20.12.2024 10:50, Jinjian Song wrote:
>> From: Sergey Ryazanov <ryazanov.s.a@gmail.com>
>> 
>>>> Fixes: d785ed945de6 ("net: wwan: t7xx: PCIe reset rescan")
>>>
>>> The completion waiting was introduced in a different commit. I 
>>> believe, the fix tag should be 13e920d93e37 ("net: wwan: t7xx: Add 
>>> core components")
>>>
>> 
>> Got it.
>
>[...]
>
>> Yes, the patch works fine, needs some minor modifications, could we 
>> feedback to the driver author to merge these changes.
>
>Looks like the drivers authors haven't enough time to maintain it. You 
>are the most active developer at the moment. Could formally resend the 
>updated patch with refcounter as V3? And, I believe, we can apply it.

Hi Sergey,

Yes, please let me resend the updated patch as suggested.

Thanks.

Jinjian,
Best Regards.
diff mbox series

Patch

diff --git a/drivers/net/wwan/t7xx/t7xx_state_monitor.c b/drivers/net/wwan/t7xx/t7xx_state_monitor.c
index 3931c7a13f5a..57f1a7730fff 100644
--- a/drivers/net/wwan/t7xx/t7xx_state_monitor.c
+++ b/drivers/net/wwan/t7xx/t7xx_state_monitor.c
@@ -108,7 +108,8 @@  static void fsm_finish_command(struct t7xx_fsm_ctl *ctl, struct t7xx_fsm_command
 {
 	if (cmd->flag & FSM_CMD_FLAG_WAIT_FOR_COMPLETION) {
 		*cmd->ret = result;
-		complete_all(cmd->done);
+		if (cmd->done)
+			complete_all(cmd->done);
 	}
 
 	kfree(cmd);
@@ -503,8 +504,10 @@  int t7xx_fsm_append_cmd(struct t7xx_fsm_ctl *ctl, enum t7xx_fsm_cmd_state cmd_id
 
 		wait_ret = wait_for_completion_timeout(&done,
 						       msecs_to_jiffies(FSM_CMD_TIMEOUT_MS));
-		if (!wait_ret)
+		if (!wait_ret) {
+			cmd->done = NULL;
 			return -ETIMEDOUT;
+		}
 
 		return ret;
 	}