Message ID | 20240730070945.4174823-1-quic_abhishes@quicinc.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Series | [v1] misc: fastrpc: Trigger a panic using BUG_ON in device release | expand |
On Tue, Jul 30, 2024 at 12:39:45PM +0530, Abhishek Singh wrote: > The user process on ARM closes the device node while closing the > session, triggers a remote call to terminate the PD running on the > DSP. If the DSP is in an unstable state and cannot process the remote > request from the HLOS, glink fails to deliver the kill request to the > DSP, resulting in a timeout error. Currently, this error is ignored, > and the session is closed, causing all the SMMU mappings associated > with that specific PD to be removed. However, since the PD is still > operational on the DSP, any attempt to access these SMMU mappings > results in an SMMU fault, leading to a panic. As the SMMU mappings > have already been removed, there is no available information on the > DSP to determine the root cause of its unresponsiveness to remote > calls. As the DSP is unresponsive to all process remote calls, use > BUG_ON to prevent the removal of SMMU mappings and to properly > identify the root cause of the DSP’s unresponsiveness to the remote > calls. > > Signed-off-by: Abhishek Singh <quic_abhishes@quicinc.com> > --- > drivers/misc/fastrpc.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/drivers/misc/fastrpc.c b/drivers/misc/fastrpc.c > index 5204fda51da3..bac9c749564c 100644 > --- a/drivers/misc/fastrpc.c > +++ b/drivers/misc/fastrpc.c > @@ -97,6 +97,7 @@ > #define FASTRPC_RMID_INIT_CREATE_STATIC 8 > #define FASTRPC_RMID_INIT_MEM_MAP 10 > #define FASTRPC_RMID_INIT_MEM_UNMAP 11 > +#define PROCESS_KILL_SC 0x01010000 > > /* Protection Domain(PD) ids */ > #define ROOT_PD (0) > @@ -1128,6 +1129,9 @@ static int fastrpc_invoke_send(struct fastrpc_session_ctx *sctx, > fastrpc_context_get(ctx); > > ret = rpmsg_send(cctx->rpdev->ept, (void *)msg, sizeof(*msg)); > + /* trigger panic if glink communication is broken and the message is for PD kill */ > + BUG_ON((ret == -ETIMEDOUT) && (handle == FASTRPC_INIT_HANDLE) && > + (ctx->sc == PROCESS_KILL_SC)); You just crashed the machine completely, sorry, but no, properly handle the issue and clean up if you can detect it, do not break systems. greg k-h
On 7/30/2024 12:46 PM, Greg KH wrote: > On Tue, Jul 30, 2024 at 12:39:45PM +0530, Abhishek Singh wrote: >> The user process on ARM closes the device node while closing the >> session, triggers a remote call to terminate the PD running on the >> DSP. If the DSP is in an unstable state and cannot process the remote >> request from the HLOS, glink fails to deliver the kill request to the >> DSP, resulting in a timeout error. Currently, this error is ignored, >> and the session is closed, causing all the SMMU mappings associated >> with that specific PD to be removed. However, since the PD is still >> operational on the DSP, any attempt to access these SMMU mappings >> results in an SMMU fault, leading to a panic. As the SMMU mappings >> have already been removed, there is no available information on the >> DSP to determine the root cause of its unresponsiveness to remote >> calls. As the DSP is unresponsive to all process remote calls, use >> BUG_ON to prevent the removal of SMMU mappings and to properly >> identify the root cause of the DSP’s unresponsiveness to the remote >> calls. >> >> Signed-off-by: Abhishek Singh <quic_abhishes@quicinc.com> >> --- >> drivers/misc/fastrpc.c | 4 ++++ >> 1 file changed, 4 insertions(+) >> >> diff --git a/drivers/misc/fastrpc.c b/drivers/misc/fastrpc.c >> index 5204fda51da3..bac9c749564c 100644 >> --- a/drivers/misc/fastrpc.c >> +++ b/drivers/misc/fastrpc.c >> @@ -97,6 +97,7 @@ >> #define FASTRPC_RMID_INIT_CREATE_STATIC 8 >> #define FASTRPC_RMID_INIT_MEM_MAP 10 >> #define FASTRPC_RMID_INIT_MEM_UNMAP 11 >> +#define PROCESS_KILL_SC 0x01010000 >> >> /* Protection Domain(PD) ids */ >> #define ROOT_PD (0) >> @@ -1128,6 +1129,9 @@ static int fastrpc_invoke_send(struct fastrpc_session_ctx *sctx, >> fastrpc_context_get(ctx); >> >> ret = rpmsg_send(cctx->rpdev->ept, (void *)msg, sizeof(*msg)); >> + /* trigger panic if glink communication is broken and the message is for PD kill */ >> + BUG_ON((ret == -ETIMEDOUT) && (handle == FASTRPC_INIT_HANDLE) && >> + (ctx->sc == PROCESS_KILL_SC)); > > You just crashed the machine completely, sorry, but no, properly handle > the issue and clean up if you can detect it, do not break systems. > But the Glink communication with DSP is already broken; we cannot communicate with the DSP. The system will crash if we proceed with cleanup on the ARM side. If we don’t do cleanup, a resource leak will occur. Eventually, the system will become dead. That’s why I am crashing the device. > greg k-h
On Mon, Aug 05, 2024 at 04:36:28PM +0530, Abhishek Singh wrote: > > On 7/30/2024 12:46 PM, Greg KH wrote: > > On Tue, Jul 30, 2024 at 12:39:45PM +0530, Abhishek Singh wrote: > >> The user process on ARM closes the device node while closing the > >> session, triggers a remote call to terminate the PD running on the > >> DSP. If the DSP is in an unstable state and cannot process the remote > >> request from the HLOS, glink fails to deliver the kill request to the > >> DSP, resulting in a timeout error. Currently, this error is ignored, > >> and the session is closed, causing all the SMMU mappings associated > >> with that specific PD to be removed. However, since the PD is still > >> operational on the DSP, any attempt to access these SMMU mappings > >> results in an SMMU fault, leading to a panic. As the SMMU mappings > >> have already been removed, there is no available information on the > >> DSP to determine the root cause of its unresponsiveness to remote > >> calls. As the DSP is unresponsive to all process remote calls, use > >> BUG_ON to prevent the removal of SMMU mappings and to properly > >> identify the root cause of the DSP’s unresponsiveness to the remote > >> calls. > >> > >> Signed-off-by: Abhishek Singh <quic_abhishes@quicinc.com> > >> --- > >> drivers/misc/fastrpc.c | 4 ++++ > >> 1 file changed, 4 insertions(+) > >> > >> diff --git a/drivers/misc/fastrpc.c b/drivers/misc/fastrpc.c > >> index 5204fda51da3..bac9c749564c 100644 > >> --- a/drivers/misc/fastrpc.c > >> +++ b/drivers/misc/fastrpc.c > >> @@ -97,6 +97,7 @@ > >> #define FASTRPC_RMID_INIT_CREATE_STATIC 8 > >> #define FASTRPC_RMID_INIT_MEM_MAP 10 > >> #define FASTRPC_RMID_INIT_MEM_UNMAP 11 > >> +#define PROCESS_KILL_SC 0x01010000 > >> > >> /* Protection Domain(PD) ids */ > >> #define ROOT_PD (0) > >> @@ -1128,6 +1129,9 @@ static int fastrpc_invoke_send(struct fastrpc_session_ctx *sctx, > >> fastrpc_context_get(ctx); > >> > >> ret = rpmsg_send(cctx->rpdev->ept, (void *)msg, sizeof(*msg)); > >> + /* trigger panic if glink communication is broken and the message is for PD kill */ > >> + BUG_ON((ret == -ETIMEDOUT) && (handle == FASTRPC_INIT_HANDLE) && > >> + (ctx->sc == PROCESS_KILL_SC)); > > > > You just crashed the machine completely, sorry, but no, properly handle > > the issue and clean up if you can detect it, do not break systems. > > > But the Glink communication with DSP is already broken; we cannot communicate with the DSP. > The system will crash if we proceed with cleanup on the ARM side. If we don’t do cleanup, > a resource leak will occur. Eventually, the system will become dead. That’s why I am > crashing the device. Then explicitly call panic() if you think you really want to shut the system down. greg k-h
Hi Abishek, On 30/07/2024 09:09, Abhishek Singh wrote: > The user process on ARM closes the device node while closing the > session, triggers a remote call to terminate the PD running on the > DSP. If the DSP is in an unstable state and cannot process the remote > request from the HLOS, glink fails to deliver the kill request to the > DSP, resulting in a timeout error. Currently, this error is ignored, > and the session is closed, causing all the SMMU mappings associated > with that specific PD to be removed. However, since the PD is still > operational on the DSP, any attempt to access these SMMU mappings > results in an SMMU fault, leading to a panic. As the SMMU mappings > have already been removed, there is no available information on the > DSP to determine the root cause of its unresponsiveness to remote > calls. As the DSP is unresponsive to all process remote calls, use > BUG_ON to prevent the removal of SMMU mappings and to properly > identify the root cause of the DSP’s unresponsiveness to the remote > calls. Could you elaborate a little about what contexts this can happen? What SoC/board are you hitting this on? Is it running pre-production firmware? Since fastrpc is not at all required for basic functionality of the device, maybe it would be better to print an error here and then prevent trying to open new connections to the DSP? Kind regards, > > Signed-off-by: Abhishek Singh <quic_abhishes@quicinc.com> > --- > drivers/misc/fastrpc.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/drivers/misc/fastrpc.c b/drivers/misc/fastrpc.c > index 5204fda51da3..bac9c749564c 100644 > --- a/drivers/misc/fastrpc.c > +++ b/drivers/misc/fastrpc.c > @@ -97,6 +97,7 @@ > #define FASTRPC_RMID_INIT_CREATE_STATIC 8 > #define FASTRPC_RMID_INIT_MEM_MAP 10 > #define FASTRPC_RMID_INIT_MEM_UNMAP 11 > +#define PROCESS_KILL_SC 0x01010000 > > /* Protection Domain(PD) ids */ > #define ROOT_PD (0) > @@ -1128,6 +1129,9 @@ static int fastrpc_invoke_send(struct fastrpc_session_ctx *sctx, > fastrpc_context_get(ctx); > > ret = rpmsg_send(cctx->rpdev->ept, (void *)msg, sizeof(*msg)); > + /* trigger panic if glink communication is broken and the message is for PD kill */ > + BUG_ON((ret == -ETIMEDOUT) && (handle == FASTRPC_INIT_HANDLE) && > + (ctx->sc == PROCESS_KILL_SC)); > > if (ret) > fastrpc_context_put(ctx);
On 8/13/2024 3:07 PM, Greg KH wrote: > On Mon, Aug 05, 2024 at 04:36:28PM +0530, Abhishek Singh wrote: >> >> On 7/30/2024 12:46 PM, Greg KH wrote: >>> On Tue, Jul 30, 2024 at 12:39:45PM +0530, Abhishek Singh wrote: >>>> The user process on ARM closes the device node while closing the >>>> session, triggers a remote call to terminate the PD running on the >>>> DSP. If the DSP is in an unstable state and cannot process the remote >>>> request from the HLOS, glink fails to deliver the kill request to the >>>> DSP, resulting in a timeout error. Currently, this error is ignored, >>>> and the session is closed, causing all the SMMU mappings associated >>>> with that specific PD to be removed. However, since the PD is still >>>> operational on the DSP, any attempt to access these SMMU mappings >>>> results in an SMMU fault, leading to a panic. As the SMMU mappings >>>> have already been removed, there is no available information on the >>>> DSP to determine the root cause of its unresponsiveness to remote >>>> calls. As the DSP is unresponsive to all process remote calls, use >>>> BUG_ON to prevent the removal of SMMU mappings and to properly >>>> identify the root cause of the DSP’s unresponsiveness to the remote >>>> calls. >>>> >>>> Signed-off-by: Abhishek Singh <quic_abhishes@quicinc.com> >>>> --- >>>> drivers/misc/fastrpc.c | 4 ++++ >>>> 1 file changed, 4 insertions(+) >>>> >>>> diff --git a/drivers/misc/fastrpc.c b/drivers/misc/fastrpc.c >>>> index 5204fda51da3..bac9c749564c 100644 >>>> --- a/drivers/misc/fastrpc.c >>>> +++ b/drivers/misc/fastrpc.c >>>> @@ -97,6 +97,7 @@ >>>> #define FASTRPC_RMID_INIT_CREATE_STATIC 8 >>>> #define FASTRPC_RMID_INIT_MEM_MAP 10 >>>> #define FASTRPC_RMID_INIT_MEM_UNMAP 11 >>>> +#define PROCESS_KILL_SC 0x01010000 >>>> >>>> /* Protection Domain(PD) ids */ >>>> #define ROOT_PD (0) >>>> @@ -1128,6 +1129,9 @@ static int fastrpc_invoke_send(struct fastrpc_session_ctx *sctx, >>>> fastrpc_context_get(ctx); >>>> >>>> ret = rpmsg_send(cctx->rpdev->ept, (void *)msg, sizeof(*msg)); >>>> + /* trigger panic if glink communication is broken and the message is for PD kill */ >>>> + BUG_ON((ret == -ETIMEDOUT) && (handle == FASTRPC_INIT_HANDLE) && >>>> + (ctx->sc == PROCESS_KILL_SC)); >>> >>> You just crashed the machine completely, sorry, but no, properly handle >>> the issue and clean up if you can detect it, do not break systems. >>> >> But the Glink communication with DSP is already broken; we cannot communicate with the DSP. >> The system will crash if we proceed with cleanup on the ARM side. If we don’t do cleanup, >> a resource leak will occur. Eventually, the system will become dead. That’s why I am >> crashing the device. > > Then explicitly call panic() if you think you really want to shut the > system down. > >> What does it mean to explicitly call panic()? Are you trying to say we should use panic() instead of BUG_ON()? > > greg k-h
diff --git a/drivers/misc/fastrpc.c b/drivers/misc/fastrpc.c index 5204fda51da3..bac9c749564c 100644 --- a/drivers/misc/fastrpc.c +++ b/drivers/misc/fastrpc.c @@ -97,6 +97,7 @@ #define FASTRPC_RMID_INIT_CREATE_STATIC 8 #define FASTRPC_RMID_INIT_MEM_MAP 10 #define FASTRPC_RMID_INIT_MEM_UNMAP 11 +#define PROCESS_KILL_SC 0x01010000 /* Protection Domain(PD) ids */ #define ROOT_PD (0) @@ -1128,6 +1129,9 @@ static int fastrpc_invoke_send(struct fastrpc_session_ctx *sctx, fastrpc_context_get(ctx); ret = rpmsg_send(cctx->rpdev->ept, (void *)msg, sizeof(*msg)); + /* trigger panic if glink communication is broken and the message is for PD kill */ + BUG_ON((ret == -ETIMEDOUT) && (handle == FASTRPC_INIT_HANDLE) && + (ctx->sc == PROCESS_KILL_SC)); if (ret) fastrpc_context_put(ctx);
The user process on ARM closes the device node while closing the session, triggers a remote call to terminate the PD running on the DSP. If the DSP is in an unstable state and cannot process the remote request from the HLOS, glink fails to deliver the kill request to the DSP, resulting in a timeout error. Currently, this error is ignored, and the session is closed, causing all the SMMU mappings associated with that specific PD to be removed. However, since the PD is still operational on the DSP, any attempt to access these SMMU mappings results in an SMMU fault, leading to a panic. As the SMMU mappings have already been removed, there is no available information on the DSP to determine the root cause of its unresponsiveness to remote calls. As the DSP is unresponsive to all process remote calls, use BUG_ON to prevent the removal of SMMU mappings and to properly identify the root cause of the DSP’s unresponsiveness to the remote calls. Signed-off-by: Abhishek Singh <quic_abhishes@quicinc.com> --- drivers/misc/fastrpc.c | 4 ++++ 1 file changed, 4 insertions(+)