Message ID | 20240925103351.1628788-1-quic_mojha@quicinc.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Series | remoteproc: qcom: Fix NULL pointer in glink_subdev_stop() | expand |
On Wed, Sep 25, 2024 at 04:03:51PM +0530, Mukesh Ojha wrote: > Multiple call to glink_subdev_stop() for the same remoteproc can happen > if rproc_stop() fails from Process-A that leaves the rproc state to > RPROC_CRASHED state later a call to recovery_store from user space in > Process B triggers rproc_trigger_recovery() of the same remoteproc to > recover it results in NULL pointer dereference issue in > qcom_glink_smem_unregister(). > > Fix it by having a NULL check in glink_subdev_stop(). > > Process-A Process-B > > fatal error interrupt happens > > rproc_crash_handler_work() > mutex_lock_interruptible(&rproc->lock); > ... > > rproc->state = RPROC_CRASHED; > ... > mutex_unlock(&rproc->lock); > > rproc_trigger_recovery() > mutex_lock_interruptible(&rproc->lock); > > adsp_stop() > qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22 > remoteproc remoteproc3: can't stop rproc: -22 I presume that at this point this remoteproc is in some undefined state and the only way to recover is for the user to reboot the machine? The check for glink->edge avoids one pitfall following this, but I'd prefer to see a solution that avoids issues in this scenario in the remoteproc core - rather than working around side effects of this in different places. Regards, Bjorn > mutex_unlock(&rproc->lock); > > echo enabled > /sys/class/remoteproc/remoteprocX/recovery > recovery_store() > rproc_trigger_recovery() > mutex_lock_interruptible(&rproc->lock); > rproc_stop() > glink_subdev_stop() > qcom_glink_smem_unregister() ==| > | > V > Unable to handle kernel NULL pointer dereference > at virtual address 0000000000000358 > > Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com> > --- > - We can do this NULL check in qcom_glink_smem_unregister() as it is > exported function however, there is only one user of this. So, doing > it with current approach should also be fine. > > drivers/remoteproc/qcom_common.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/remoteproc/qcom_common.c b/drivers/remoteproc/qcom_common.c > index 8c8688f99f0a..52d6c9b99fdb 100644 > --- a/drivers/remoteproc/qcom_common.c > +++ b/drivers/remoteproc/qcom_common.c > @@ -209,6 +209,9 @@ static void glink_subdev_stop(struct rproc_subdev *subdev, bool crashed) > { > struct qcom_rproc_glink *glink = to_glink_subdev(subdev); > > + if (!glink->edge) > + return; > + > qcom_glink_smem_unregister(glink->edge); > glink->edge = NULL; > } > -- > 2.34.1 > >
On Wed, Sep 25, 2024 at 08:41:55PM -0700, Bjorn Andersson wrote: > On Wed, Sep 25, 2024 at 04:03:51PM +0530, Mukesh Ojha wrote: > > Multiple call to glink_subdev_stop() for the same remoteproc can happen > > if rproc_stop() fails from Process-A that leaves the rproc state to > > RPROC_CRASHED state later a call to recovery_store from user space in > > Process B triggers rproc_trigger_recovery() of the same remoteproc to > > recover it results in NULL pointer dereference issue in > > qcom_glink_smem_unregister(). > > > > Fix it by having a NULL check in glink_subdev_stop(). > > > > Process-A Process-B > > > > fatal error interrupt happens > > > > rproc_crash_handler_work() > > mutex_lock_interruptible(&rproc->lock); > > ... > > > > rproc->state = RPROC_CRASHED; > > ... > > mutex_unlock(&rproc->lock); > > > > rproc_trigger_recovery() > > mutex_lock_interruptible(&rproc->lock); > > > > adsp_stop() > > qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22 > > remoteproc remoteproc3: can't stop rproc: -22 > > I presume that at this point this remoteproc is in some undefined state > and the only way to recover is for the user to reboot the machine? Here, 50+ (5s) retry of scm shutdown is failing during decryption of remote processor memory region, and i don't think, it is anyway to do with remote processor state here, as a best effort more number of retries can be tried instead of 50 or wait for some other recovery command like recovery_store() to let it do the retry again from beginning. > > > The check for glink->edge avoids one pitfall following this, but I'd > prefer to see a solution that avoids issues in this scenario in the > remoteproc core - rather than working around side effects of this in > different places. Handling in a remoteproc core means we may need another state something like "RPROC_UNKNOWN" which can be kept after one attempt of recovery failure and checking the same during another try return immediately with some log message. -Mukesh
On Sat, Sep 28, 2024 at 01:07:43AM +0530, Mukesh Ojha wrote: > On Wed, Sep 25, 2024 at 08:41:55PM -0700, Bjorn Andersson wrote: > > On Wed, Sep 25, 2024 at 04:03:51PM +0530, Mukesh Ojha wrote: > > > Multiple call to glink_subdev_stop() for the same remoteproc can happen > > > if rproc_stop() fails from Process-A that leaves the rproc state to > > > RPROC_CRASHED state later a call to recovery_store from user space in > > > Process B triggers rproc_trigger_recovery() of the same remoteproc to > > > recover it results in NULL pointer dereference issue in > > > qcom_glink_smem_unregister(). > > > > > > Fix it by having a NULL check in glink_subdev_stop(). > > > > > > Process-A Process-B > > > > > > fatal error interrupt happens > > > > > > rproc_crash_handler_work() > > > mutex_lock_interruptible(&rproc->lock); > > > ... > > > > > > rproc->state = RPROC_CRASHED; > > > ... > > > mutex_unlock(&rproc->lock); > > > > > > rproc_trigger_recovery() > > > mutex_lock_interruptible(&rproc->lock); > > > > > > adsp_stop() > > > qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22 > > > remoteproc remoteproc3: can't stop rproc: -22 > > > > I presume that at this point this remoteproc is in some undefined state > > and the only way to recover is for the user to reboot the machine? > > Here, 50+ (5s) retry of scm shutdown is failing during decryption of > remote processor memory region, and i don't think, it is anyway to do > with remote processor state here, as a best effort more number of > retries can be tried instead of 50 or wait for some other recovery > command like recovery_store() to let it do the retry again from > beginning. > But are you saying that retrying a bit later would allow us to get out of this problem? If we just didn't hit the NULL pointer(s)? How long are we talking about here? Is the timeout too short? > > > > > > The check for glink->edge avoids one pitfall following this, but I'd > > prefer to see a solution that avoids issues in this scenario in the > > remoteproc core - rather than working around side effects of this in > > different places. > > Handling in a remoteproc core means we may need another state something > like "RPROC_UNKNOWN" which can be kept after one attempt of recovery > failure and checking the same during another try return immediately with > some log message. > Yes, if we are failing to shut down the remoteproc and there's no way for us to reliably recover from that (e.g. we are not able to reclaim the memory), it seems reasonable that we have to mark it using a new state. If that is the case, I'd call it RPROC_DEFUNCT (or something like that instead), because while in some "unknown" state, from a remoteproc framework's point of view, it's in a well known (broken) state. Regards, Bjorn
On Fri, Sep 27, 2024 at 02:59:09PM -0700, Bjorn Andersson wrote: > On Sat, Sep 28, 2024 at 01:07:43AM +0530, Mukesh Ojha wrote: > > On Wed, Sep 25, 2024 at 08:41:55PM -0700, Bjorn Andersson wrote: > > > On Wed, Sep 25, 2024 at 04:03:51PM +0530, Mukesh Ojha wrote: > > > > Multiple call to glink_subdev_stop() for the same remoteproc can happen > > > > if rproc_stop() fails from Process-A that leaves the rproc state to > > > > RPROC_CRASHED state later a call to recovery_store from user space in > > > > Process B triggers rproc_trigger_recovery() of the same remoteproc to > > > > recover it results in NULL pointer dereference issue in > > > > qcom_glink_smem_unregister(). > > > > > > > > Fix it by having a NULL check in glink_subdev_stop(). > > > > > > > > Process-A Process-B > > > > > > > > fatal error interrupt happens > > > > > > > > rproc_crash_handler_work() > > > > mutex_lock_interruptible(&rproc->lock); > > > > ... > > > > > > > > rproc->state = RPROC_CRASHED; > > > > ... > > > > mutex_unlock(&rproc->lock); > > > > > > > > rproc_trigger_recovery() > > > > mutex_lock_interruptible(&rproc->lock); > > > > > > > > adsp_stop() > > > > qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22 > > > > remoteproc remoteproc3: can't stop rproc: -22 > > > > > > I presume that at this point this remoteproc is in some undefined state > > > and the only way to recover is for the user to reboot the machine? > > > > Here, 50+ (5s) retry of scm shutdown is failing during decryption of > > remote processor memory region, and i don't think, it is anyway to do > > with remote processor state here, as a best effort more number of > > retries can be tried instead of 50 or wait for some other recovery > > command like recovery_store() to let it do the retry again from > > beginning. > > > > But are you saying that retrying a bit later would allow us to get out > of this problem? If we just didn't hit the NULL pointer(s)? I am not sure whether adding more number of retries will solve the issue and initially thinking from perspective that, it is better to retry than to leave the remoteproc in some unknown state however, I do get that letting it retry could give unnecessary patching every code e.g., ssr notifier callbacks, which is not expecting to be called twice as a side-effect of remoteproc unknown state. > > How long are we talking about here? Is the timeout too short? 5sec is very significant amount of time in wait for remote processor to get recovered, we should not change this. > > > > > > > > > > The check for glink->edge avoids one pitfall following this, but I'd > > > prefer to see a solution that avoids issues in this scenario in the > > > remoteproc core - rather than working around side effects of this in > > > different places. > > > > Handling in a remoteproc core means we may need another state something > > like "RPROC_UNKNOWN" which can be kept after one attempt of recovery > > failure and checking the same during another try return immediately with > > some log message. > > > > Yes, if we are failing to shut down the remoteproc and there's no way > for us to reliably recover from that (e.g. we are not able to reclaim > the memory), it seems reasonable that we have to mark it using a new > state. > > If that is the case, I'd call it RPROC_DEFUNCT (or something like that > instead), because while in some "unknown" state, from a remoteproc > framework's point of view, it's in a well known (broken) state. Ack. -Mukesh > > Regards, > Bjorn
On Tue, Oct 01, 2024 at 12:06:17PM +0530, Mukesh Ojha wrote: > On Fri, Sep 27, 2024 at 02:59:09PM -0700, Bjorn Andersson wrote: > > On Sat, Sep 28, 2024 at 01:07:43AM +0530, Mukesh Ojha wrote: > > > On Wed, Sep 25, 2024 at 08:41:55PM -0700, Bjorn Andersson wrote: > > > > On Wed, Sep 25, 2024 at 04:03:51PM +0530, Mukesh Ojha wrote: > > > > > Multiple call to glink_subdev_stop() for the same remoteproc can happen > > > > > if rproc_stop() fails from Process-A that leaves the rproc state to > > > > > RPROC_CRASHED state later a call to recovery_store from user space in > > > > > Process B triggers rproc_trigger_recovery() of the same remoteproc to > > > > > recover it results in NULL pointer dereference issue in > > > > > qcom_glink_smem_unregister(). > > > > > > > > > > Fix it by having a NULL check in glink_subdev_stop(). > > > > > > > > > > Process-A Process-B > > > > > > > > > > fatal error interrupt happens > > > > > > > > > > rproc_crash_handler_work() > > > > > mutex_lock_interruptible(&rproc->lock); > > > > > ... > > > > > > > > > > rproc->state = RPROC_CRASHED; > > > > > ... > > > > > mutex_unlock(&rproc->lock); > > > > > > > > > > rproc_trigger_recovery() > > > > > mutex_lock_interruptible(&rproc->lock); > > > > > > > > > > adsp_stop() > > > > > qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22 > > > > > remoteproc remoteproc3: can't stop rproc: -22 > > > > > > > > I presume that at this point this remoteproc is in some undefined state > > > > and the only way to recover is for the user to reboot the machine? > > > > > > Here, 50+ (5s) retry of scm shutdown is failing during decryption of > > > remote processor memory region, and i don't think, it is anyway to do > > > with remote processor state here, as a best effort more number of > > > retries can be tried instead of 50 or wait for some other recovery > > > command like recovery_store() to let it do the retry again from > > > beginning. > > > > > > > But are you saying that retrying a bit later would allow us to get out > > of this problem? If we just didn't hit the NULL pointer(s)? > > I am not sure whether adding more number of retries will solve the issue > and initially thinking from perspective that, it is better to retry than > to leave the remoteproc in some unknown state however, I do get that > letting it retry could give unnecessary patching every code e.g., ssr > notifier callbacks, which is not expecting to be called twice as a > side-effect of remoteproc unknown state. That's not what I'm asking you. When the remote processor fails to stop, can you recover the system by just trying a bit later, or is the remoteproc dead until reboot? > > > > How long are we talking about here? Is the timeout too short? > > 5sec is very significant amount of time in wait for remote processor to > get recovered, we should not change this. Okay, I'm just trying to understand the problem you're trying to solve. Regards, Bjorn > > > > > > > > > > > > > > The check for glink->edge avoids one pitfall following this, but I'd > > > > prefer to see a solution that avoids issues in this scenario in the > > > > remoteproc core - rather than working around side effects of this in > > > > different places. > > > > > > Handling in a remoteproc core means we may need another state something > > > like "RPROC_UNKNOWN" which can be kept after one attempt of recovery > > > failure and checking the same during another try return immediately with > > > some log message. > > > > > > > Yes, if we are failing to shut down the remoteproc and there's no way > > for us to reliably recover from that (e.g. we are not able to reclaim > > the memory), it seems reasonable that we have to mark it using a new > > state. > > > > If that is the case, I'd call it RPROC_DEFUNCT (or something like that > > instead), because while in some "unknown" state, from a remoteproc > > framework's point of view, it's in a well known (broken) state. > > Ack. > > -Mukesh > > > > Regards, > > Bjorn
On Tue, Oct 01, 2024 at 02:15:52PM -0700, Bjorn Andersson wrote: > On Tue, Oct 01, 2024 at 12:06:17PM +0530, Mukesh Ojha wrote: > > On Fri, Sep 27, 2024 at 02:59:09PM -0700, Bjorn Andersson wrote: > > > On Sat, Sep 28, 2024 at 01:07:43AM +0530, Mukesh Ojha wrote: > > > > On Wed, Sep 25, 2024 at 08:41:55PM -0700, Bjorn Andersson wrote: > > > > > On Wed, Sep 25, 2024 at 04:03:51PM +0530, Mukesh Ojha wrote: > > > > > > Multiple call to glink_subdev_stop() for the same remoteproc can happen > > > > > > if rproc_stop() fails from Process-A that leaves the rproc state to > > > > > > RPROC_CRASHED state later a call to recovery_store from user space in > > > > > > Process B triggers rproc_trigger_recovery() of the same remoteproc to > > > > > > recover it results in NULL pointer dereference issue in > > > > > > qcom_glink_smem_unregister(). > > > > > > > > > > > > Fix it by having a NULL check in glink_subdev_stop(). > > > > > > > > > > > > Process-A Process-B > > > > > > > > > > > > fatal error interrupt happens > > > > > > > > > > > > rproc_crash_handler_work() > > > > > > mutex_lock_interruptible(&rproc->lock); > > > > > > ... > > > > > > > > > > > > rproc->state = RPROC_CRASHED; > > > > > > ... > > > > > > mutex_unlock(&rproc->lock); > > > > > > > > > > > > rproc_trigger_recovery() > > > > > > mutex_lock_interruptible(&rproc->lock); > > > > > > > > > > > > adsp_stop() > > > > > > qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22 > > > > > > remoteproc remoteproc3: can't stop rproc: -22 > > > > > > > > > > I presume that at this point this remoteproc is in some undefined state > > > > > and the only way to recover is for the user to reboot the machine? > > > > > > > > Here, 50+ (5s) retry of scm shutdown is failing during decryption of > > > > remote processor memory region, and i don't think, it is anyway to do > > > > with remote processor state here, as a best effort more number of > > > > retries can be tried instead of 50 or wait for some other recovery > > > > command like recovery_store() to let it do the retry again from > > > > beginning. > > > > > > > > > > But are you saying that retrying a bit later would allow us to get out > > > of this problem? If we just didn't hit the NULL pointer(s)? > > > > I am not sure whether adding more number of retries will solve the issue > > and initially thinking from perspective that, it is better to retry than > > to leave the remoteproc in some unknown state however, I do get that > > letting it retry could give unnecessary patching every code e.g., ssr > > notifier callbacks, which is not expecting to be called twice as a > > side-effect of remoteproc unknown state. > > That's not what I'm asking you. When the remote processor fails to stop, > can you recover the system by just trying a bit later, or is the > remoteproc dead until reboot? I cannot say this with certainty. For the current issue, the remoteproc fails to stop as it is running out of heap memory. -Mukesh > > > > > > > How long are we talking about here? Is the timeout too short? > > > > 5sec is very significant amount of time in wait for remote processor to > > get recovered, we should not change this. > > Okay, I'm just trying to understand the problem you're trying to solve. > > Regards, > Bjorn > > > > > > > > > > > > > > > > > > > The check for glink->edge avoids one pitfall following this, but I'd > > > > > prefer to see a solution that avoids issues in this scenario in the > > > > > remoteproc core - rather than working around side effects of this in > > > > > different places. > > > > > > > > Handling in a remoteproc core means we may need another state something > > > > like "RPROC_UNKNOWN" which can be kept after one attempt of recovery > > > > failure and checking the same during another try return immediately with > > > > some log message. > > > > > > > > > > Yes, if we are failing to shut down the remoteproc and there's no way > > > for us to reliably recover from that (e.g. we are not able to reclaim > > > the memory), it seems reasonable that we have to mark it using a new > > > state. > > > > > > If that is the case, I'd call it RPROC_DEFUNCT (or something like that > > > instead), because while in some "unknown" state, from a remoteproc > > > framework's point of view, it's in a well known (broken) state. > > > > Ack. > > > > -Mukesh > > > > > > Regards, > > > Bjorn
diff --git a/drivers/remoteproc/qcom_common.c b/drivers/remoteproc/qcom_common.c index 8c8688f99f0a..52d6c9b99fdb 100644 --- a/drivers/remoteproc/qcom_common.c +++ b/drivers/remoteproc/qcom_common.c @@ -209,6 +209,9 @@ static void glink_subdev_stop(struct rproc_subdev *subdev, bool crashed) { struct qcom_rproc_glink *glink = to_glink_subdev(subdev); + if (!glink->edge) + return; + qcom_glink_smem_unregister(glink->edge); glink->edge = NULL; }
Multiple call to glink_subdev_stop() for the same remoteproc can happen if rproc_stop() fails from Process-A that leaves the rproc state to RPROC_CRASHED state later a call to recovery_store from user space in Process B triggers rproc_trigger_recovery() of the same remoteproc to recover it results in NULL pointer dereference issue in qcom_glink_smem_unregister(). Fix it by having a NULL check in glink_subdev_stop(). Process-A Process-B fatal error interrupt happens rproc_crash_handler_work() mutex_lock_interruptible(&rproc->lock); ... rproc->state = RPROC_CRASHED; ... mutex_unlock(&rproc->lock); rproc_trigger_recovery() mutex_lock_interruptible(&rproc->lock); adsp_stop() qcom_q6v5_pas 20c00000.remoteproc: failed to shutdown: -22 remoteproc remoteproc3: can't stop rproc: -22 mutex_unlock(&rproc->lock); echo enabled > /sys/class/remoteproc/remoteprocX/recovery recovery_store() rproc_trigger_recovery() mutex_lock_interruptible(&rproc->lock); rproc_stop() glink_subdev_stop() qcom_glink_smem_unregister() ==| | V Unable to handle kernel NULL pointer dereference at virtual address 0000000000000358 Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com> --- - We can do this NULL check in qcom_glink_smem_unregister() as it is exported function however, there is only one user of this. So, doing it with current approach should also be fine. drivers/remoteproc/qcom_common.c | 3 +++ 1 file changed, 3 insertions(+)