diff mbox series

[v1] misc: fastrpc: Trigger a panic using BUG_ON in device release

Message ID 20240730070945.4174823-1-quic_abhishes@quicinc.com (mailing list archive)
State Not Applicable
Headers show
Series [v1] misc: fastrpc: Trigger a panic using BUG_ON in device release | expand

Commit Message

Abhishek Singh July 30, 2024, 7:09 a.m. UTC
The user process on ARM closes the device node while closing the
session, triggers a remote call to terminate the PD running on the
DSP. If the DSP is in an unstable state and cannot process the remote
request from the HLOS, glink fails to deliver the kill request to the
DSP, resulting in a timeout error. Currently, this error is ignored,
and the session is closed, causing all the SMMU mappings associated
with that specific PD to be removed. However, since the PD is still
operational on the DSP, any attempt to access these SMMU mappings
results in an SMMU fault, leading to a panic.  As the SMMU mappings
have already been removed, there is no available information on the
DSP to determine the root cause of its unresponsiveness to remote
calls. As the DSP is unresponsive to all process remote calls, use
BUG_ON to prevent the removal of SMMU mappings and to properly
identify the root cause of the DSP’s unresponsiveness to the remote
calls.

Signed-off-by: Abhishek Singh <quic_abhishes@quicinc.com>
---
 drivers/misc/fastrpc.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Greg Kroah-Hartman July 30, 2024, 7:16 a.m. UTC | #1
On Tue, Jul 30, 2024 at 12:39:45PM +0530, Abhishek Singh wrote:
> The user process on ARM closes the device node while closing the
> session, triggers a remote call to terminate the PD running on the
> DSP. If the DSP is in an unstable state and cannot process the remote
> request from the HLOS, glink fails to deliver the kill request to the
> DSP, resulting in a timeout error. Currently, this error is ignored,
> and the session is closed, causing all the SMMU mappings associated
> with that specific PD to be removed. However, since the PD is still
> operational on the DSP, any attempt to access these SMMU mappings
> results in an SMMU fault, leading to a panic.  As the SMMU mappings
> have already been removed, there is no available information on the
> DSP to determine the root cause of its unresponsiveness to remote
> calls. As the DSP is unresponsive to all process remote calls, use
> BUG_ON to prevent the removal of SMMU mappings and to properly
> identify the root cause of the DSP’s unresponsiveness to the remote
> calls.
> 
> Signed-off-by: Abhishek Singh <quic_abhishes@quicinc.com>
> ---
>  drivers/misc/fastrpc.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/misc/fastrpc.c b/drivers/misc/fastrpc.c
> index 5204fda51da3..bac9c749564c 100644
> --- a/drivers/misc/fastrpc.c
> +++ b/drivers/misc/fastrpc.c
> @@ -97,6 +97,7 @@
>  #define FASTRPC_RMID_INIT_CREATE_STATIC	8
>  #define FASTRPC_RMID_INIT_MEM_MAP      10
>  #define FASTRPC_RMID_INIT_MEM_UNMAP    11
> +#define PROCESS_KILL_SC 0x01010000
>  
>  /* Protection Domain(PD) ids */
>  #define ROOT_PD		(0)
> @@ -1128,6 +1129,9 @@ static int fastrpc_invoke_send(struct fastrpc_session_ctx *sctx,
>  	fastrpc_context_get(ctx);
>  
>  	ret = rpmsg_send(cctx->rpdev->ept, (void *)msg, sizeof(*msg));
> +	/* trigger panic if glink communication is broken and the message is for PD kill */
> +	BUG_ON((ret == -ETIMEDOUT) && (handle == FASTRPC_INIT_HANDLE) &&
> +			(ctx->sc == PROCESS_KILL_SC));

You just crashed the machine completely, sorry, but no, properly handle
the issue and clean up if you can detect it, do not break systems.

greg k-h
Abhishek Singh Aug. 5, 2024, 11:06 a.m. UTC | #2
On 7/30/2024 12:46 PM, Greg KH wrote:
> On Tue, Jul 30, 2024 at 12:39:45PM +0530, Abhishek Singh wrote:
>> The user process on ARM closes the device node while closing the
>> session, triggers a remote call to terminate the PD running on the
>> DSP. If the DSP is in an unstable state and cannot process the remote
>> request from the HLOS, glink fails to deliver the kill request to the
>> DSP, resulting in a timeout error. Currently, this error is ignored,
>> and the session is closed, causing all the SMMU mappings associated
>> with that specific PD to be removed. However, since the PD is still
>> operational on the DSP, any attempt to access these SMMU mappings
>> results in an SMMU fault, leading to a panic.  As the SMMU mappings
>> have already been removed, there is no available information on the
>> DSP to determine the root cause of its unresponsiveness to remote
>> calls. As the DSP is unresponsive to all process remote calls, use
>> BUG_ON to prevent the removal of SMMU mappings and to properly
>> identify the root cause of the DSP’s unresponsiveness to the remote
>> calls.
>>
>> Signed-off-by: Abhishek Singh <quic_abhishes@quicinc.com>
>> ---
>>  drivers/misc/fastrpc.c | 4 ++++
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/drivers/misc/fastrpc.c b/drivers/misc/fastrpc.c
>> index 5204fda51da3..bac9c749564c 100644
>> --- a/drivers/misc/fastrpc.c
>> +++ b/drivers/misc/fastrpc.c
>> @@ -97,6 +97,7 @@
>>  #define FASTRPC_RMID_INIT_CREATE_STATIC	8
>>  #define FASTRPC_RMID_INIT_MEM_MAP      10
>>  #define FASTRPC_RMID_INIT_MEM_UNMAP    11
>> +#define PROCESS_KILL_SC 0x01010000
>>  
>>  /* Protection Domain(PD) ids */
>>  #define ROOT_PD		(0)
>> @@ -1128,6 +1129,9 @@ static int fastrpc_invoke_send(struct fastrpc_session_ctx *sctx,
>>  	fastrpc_context_get(ctx);
>>  
>>  	ret = rpmsg_send(cctx->rpdev->ept, (void *)msg, sizeof(*msg));
>> +	/* trigger panic if glink communication is broken and the message is for PD kill */
>> +	BUG_ON((ret == -ETIMEDOUT) && (handle == FASTRPC_INIT_HANDLE) &&
>> +			(ctx->sc == PROCESS_KILL_SC));
> 
> You just crashed the machine completely, sorry, but no, properly handle
> the issue and clean up if you can detect it, do not break systems.
> 
But the Glink communication with DSP is already broken; we cannot communicate with the DSP.
The system will crash if we proceed with cleanup on the ARM side. If we don’t do cleanup,
a resource leak will occur. Eventually, the system will become dead. That’s why I am
crashing the device.
> greg k-h
Greg Kroah-Hartman Aug. 13, 2024, 9:37 a.m. UTC | #3
On Mon, Aug 05, 2024 at 04:36:28PM +0530, Abhishek Singh wrote:
> 
> On 7/30/2024 12:46 PM, Greg KH wrote:
> > On Tue, Jul 30, 2024 at 12:39:45PM +0530, Abhishek Singh wrote:
> >> The user process on ARM closes the device node while closing the
> >> session, triggers a remote call to terminate the PD running on the
> >> DSP. If the DSP is in an unstable state and cannot process the remote
> >> request from the HLOS, glink fails to deliver the kill request to the
> >> DSP, resulting in a timeout error. Currently, this error is ignored,
> >> and the session is closed, causing all the SMMU mappings associated
> >> with that specific PD to be removed. However, since the PD is still
> >> operational on the DSP, any attempt to access these SMMU mappings
> >> results in an SMMU fault, leading to a panic.  As the SMMU mappings
> >> have already been removed, there is no available information on the
> >> DSP to determine the root cause of its unresponsiveness to remote
> >> calls. As the DSP is unresponsive to all process remote calls, use
> >> BUG_ON to prevent the removal of SMMU mappings and to properly
> >> identify the root cause of the DSP’s unresponsiveness to the remote
> >> calls.
> >>
> >> Signed-off-by: Abhishek Singh <quic_abhishes@quicinc.com>
> >> ---
> >>  drivers/misc/fastrpc.c | 4 ++++
> >>  1 file changed, 4 insertions(+)
> >>
> >> diff --git a/drivers/misc/fastrpc.c b/drivers/misc/fastrpc.c
> >> index 5204fda51da3..bac9c749564c 100644
> >> --- a/drivers/misc/fastrpc.c
> >> +++ b/drivers/misc/fastrpc.c
> >> @@ -97,6 +97,7 @@
> >>  #define FASTRPC_RMID_INIT_CREATE_STATIC	8
> >>  #define FASTRPC_RMID_INIT_MEM_MAP      10
> >>  #define FASTRPC_RMID_INIT_MEM_UNMAP    11
> >> +#define PROCESS_KILL_SC 0x01010000
> >>  
> >>  /* Protection Domain(PD) ids */
> >>  #define ROOT_PD		(0)
> >> @@ -1128,6 +1129,9 @@ static int fastrpc_invoke_send(struct fastrpc_session_ctx *sctx,
> >>  	fastrpc_context_get(ctx);
> >>  
> >>  	ret = rpmsg_send(cctx->rpdev->ept, (void *)msg, sizeof(*msg));
> >> +	/* trigger panic if glink communication is broken and the message is for PD kill */
> >> +	BUG_ON((ret == -ETIMEDOUT) && (handle == FASTRPC_INIT_HANDLE) &&
> >> +			(ctx->sc == PROCESS_KILL_SC));
> > 
> > You just crashed the machine completely, sorry, but no, properly handle
> > the issue and clean up if you can detect it, do not break systems.
> > 
> But the Glink communication with DSP is already broken; we cannot communicate with the DSP.
> The system will crash if we proceed with cleanup on the ARM side. If we don’t do cleanup,
> a resource leak will occur. Eventually, the system will become dead. That’s why I am
> crashing the device.

Then explicitly call panic() if you think you really want to shut the
system down.

greg k-h
Caleb Connolly Aug. 19, 2024, 11 a.m. UTC | #4
Hi Abishek,

On 30/07/2024 09:09, Abhishek Singh wrote:
> The user process on ARM closes the device node while closing the
> session, triggers a remote call to terminate the PD running on the
> DSP. If the DSP is in an unstable state and cannot process the remote
> request from the HLOS, glink fails to deliver the kill request to the
> DSP, resulting in a timeout error. Currently, this error is ignored,
> and the session is closed, causing all the SMMU mappings associated
> with that specific PD to be removed. However, since the PD is still
> operational on the DSP, any attempt to access these SMMU mappings
> results in an SMMU fault, leading to a panic.  As the SMMU mappings
> have already been removed, there is no available information on the
> DSP to determine the root cause of its unresponsiveness to remote
> calls. As the DSP is unresponsive to all process remote calls, use
> BUG_ON to prevent the removal of SMMU mappings and to properly
> identify the root cause of the DSP’s unresponsiveness to the remote
> calls.

Could you elaborate a little about what contexts this can happen? What 
SoC/board are you hitting this on? Is it running pre-production firmware?

Since fastrpc is not at all required for basic functionality of the 
device, maybe it would be better to print an error here and then prevent 
trying to open new connections to the DSP?

Kind regards,
> 
> Signed-off-by: Abhishek Singh <quic_abhishes@quicinc.com>
> ---
>   drivers/misc/fastrpc.c | 4 ++++
>   1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/misc/fastrpc.c b/drivers/misc/fastrpc.c
> index 5204fda51da3..bac9c749564c 100644
> --- a/drivers/misc/fastrpc.c
> +++ b/drivers/misc/fastrpc.c
> @@ -97,6 +97,7 @@
>   #define FASTRPC_RMID_INIT_CREATE_STATIC	8
>   #define FASTRPC_RMID_INIT_MEM_MAP      10
>   #define FASTRPC_RMID_INIT_MEM_UNMAP    11
> +#define PROCESS_KILL_SC 0x01010000
>   
>   /* Protection Domain(PD) ids */
>   #define ROOT_PD		(0)
> @@ -1128,6 +1129,9 @@ static int fastrpc_invoke_send(struct fastrpc_session_ctx *sctx,
>   	fastrpc_context_get(ctx);
>   
>   	ret = rpmsg_send(cctx->rpdev->ept, (void *)msg, sizeof(*msg));
> +	/* trigger panic if glink communication is broken and the message is for PD kill */
> +	BUG_ON((ret == -ETIMEDOUT) && (handle == FASTRPC_INIT_HANDLE) &&
> +			(ctx->sc == PROCESS_KILL_SC));
>   
>   	if (ret)
>   		fastrpc_context_put(ctx);
Abhishek Singh Aug. 30, 2024, 8:44 a.m. UTC | #5
On 8/13/2024 3:07 PM, Greg KH wrote:
> On Mon, Aug 05, 2024 at 04:36:28PM +0530, Abhishek Singh wrote:
>>
>> On 7/30/2024 12:46 PM, Greg KH wrote:
>>> On Tue, Jul 30, 2024 at 12:39:45PM +0530, Abhishek Singh wrote:
>>>> The user process on ARM closes the device node while closing the
>>>> session, triggers a remote call to terminate the PD running on the
>>>> DSP. If the DSP is in an unstable state and cannot process the remote
>>>> request from the HLOS, glink fails to deliver the kill request to the
>>>> DSP, resulting in a timeout error. Currently, this error is ignored,
>>>> and the session is closed, causing all the SMMU mappings associated
>>>> with that specific PD to be removed. However, since the PD is still
>>>> operational on the DSP, any attempt to access these SMMU mappings
>>>> results in an SMMU fault, leading to a panic.  As the SMMU mappings
>>>> have already been removed, there is no available information on the
>>>> DSP to determine the root cause of its unresponsiveness to remote
>>>> calls. As the DSP is unresponsive to all process remote calls, use
>>>> BUG_ON to prevent the removal of SMMU mappings and to properly
>>>> identify the root cause of the DSP’s unresponsiveness to the remote
>>>> calls.
>>>>
>>>> Signed-off-by: Abhishek Singh <quic_abhishes@quicinc.com>
>>>> ---
>>>>  drivers/misc/fastrpc.c | 4 ++++
>>>>  1 file changed, 4 insertions(+)
>>>>
>>>> diff --git a/drivers/misc/fastrpc.c b/drivers/misc/fastrpc.c
>>>> index 5204fda51da3..bac9c749564c 100644
>>>> --- a/drivers/misc/fastrpc.c
>>>> +++ b/drivers/misc/fastrpc.c
>>>> @@ -97,6 +97,7 @@
>>>>  #define FASTRPC_RMID_INIT_CREATE_STATIC	8
>>>>  #define FASTRPC_RMID_INIT_MEM_MAP      10
>>>>  #define FASTRPC_RMID_INIT_MEM_UNMAP    11
>>>> +#define PROCESS_KILL_SC 0x01010000
>>>>  
>>>>  /* Protection Domain(PD) ids */
>>>>  #define ROOT_PD		(0)
>>>> @@ -1128,6 +1129,9 @@ static int fastrpc_invoke_send(struct fastrpc_session_ctx *sctx,
>>>>  	fastrpc_context_get(ctx);
>>>>  
>>>>  	ret = rpmsg_send(cctx->rpdev->ept, (void *)msg, sizeof(*msg));
>>>> +	/* trigger panic if glink communication is broken and the message is for PD kill */
>>>> +	BUG_ON((ret == -ETIMEDOUT) && (handle == FASTRPC_INIT_HANDLE) &&
>>>> +			(ctx->sc == PROCESS_KILL_SC));
>>>
>>> You just crashed the machine completely, sorry, but no, properly handle
>>> the issue and clean up if you can detect it, do not break systems.
>>>
>> But the Glink communication with DSP is already broken; we cannot communicate with the DSP.
>> The system will crash if we proceed with cleanup on the ARM side. If we don’t do cleanup,
>> a resource leak will occur. Eventually, the system will become dead. That’s why I am
>> crashing the device.
> 
> Then explicitly call panic() if you think you really want to shut the
> system down.
>
>> What does it mean to explicitly call panic()? Are you trying to say we should use panic() instead of BUG_ON()?
> 
> greg k-h
diff mbox series

Patch

diff --git a/drivers/misc/fastrpc.c b/drivers/misc/fastrpc.c
index 5204fda51da3..bac9c749564c 100644
--- a/drivers/misc/fastrpc.c
+++ b/drivers/misc/fastrpc.c
@@ -97,6 +97,7 @@ 
 #define FASTRPC_RMID_INIT_CREATE_STATIC	8
 #define FASTRPC_RMID_INIT_MEM_MAP      10
 #define FASTRPC_RMID_INIT_MEM_UNMAP    11
+#define PROCESS_KILL_SC 0x01010000
 
 /* Protection Domain(PD) ids */
 #define ROOT_PD		(0)
@@ -1128,6 +1129,9 @@  static int fastrpc_invoke_send(struct fastrpc_session_ctx *sctx,
 	fastrpc_context_get(ctx);
 
 	ret = rpmsg_send(cctx->rpdev->ept, (void *)msg, sizeof(*msg));
+	/* trigger panic if glink communication is broken and the message is for PD kill */
+	BUG_ON((ret == -ETIMEDOUT) && (handle == FASTRPC_INIT_HANDLE) &&
+			(ctx->sc == PROCESS_KILL_SC));
 
 	if (ret)
 		fastrpc_context_put(ctx);