Message ID | 1607806087-27244-1-git-send-email-rishabhb@codeaurora.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | remoteproc: Create a separate workqueue for recovery tasks | expand |
On Sat 12 Dec 14:48 CST 2020, Rishabh Bhatnagar wrote: > Create an unbound high priority workqueue for recovery tasks. This simply repeats $subject > Recovery time is an important parameter for a subsystem and there > might be situations where multiple subsystems crash around the same > time. Scheduling into an unbound workqueue increases parallelization > and avoids time impact. You should be able to write this more succinctly. The important part is that you want an unbound work queue to allow recovery to happen in parallel - which naturally implies that you care about recovery latency. > Also creating a high priority workqueue > will utilize separate worker threads with higher nice values than > normal ones. > This doesn't describe why you need the higher priority. I believe, and certainly with the in-line coredump, that we're running our recovery work for way too long to be queued on the system_wq. As such the content of the patch looks good! Regards, Bjorn > Signed-off-by: Rishabh Bhatnagar <rishabhb@codeaurora.org> > --- > drivers/remoteproc/remoteproc_core.c | 9 ++++++++- > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c > index 46c2937..8fd8166 100644 > --- a/drivers/remoteproc/remoteproc_core.c > +++ b/drivers/remoteproc/remoteproc_core.c > @@ -48,6 +48,8 @@ static DEFINE_MUTEX(rproc_list_mutex); > static LIST_HEAD(rproc_list); > static struct notifier_block rproc_panic_nb; > > +static struct workqueue_struct *rproc_wq; > + > typedef int (*rproc_handle_resource_t)(struct rproc *rproc, > void *, int offset, int avail); > > @@ -2475,7 +2477,7 @@ void rproc_report_crash(struct rproc *rproc, enum rproc_crash_type type) > rproc->name, rproc_crash_to_string(type)); > > /* create a new task to handle the error */ > - schedule_work(&rproc->crash_handler); > + queue_work(rproc_wq, &rproc->crash_handler); > } > EXPORT_SYMBOL(rproc_report_crash); > > @@ -2520,6 +2522,10 @@ static void __exit rproc_exit_panic(void) > > static int __init remoteproc_init(void) > { > + rproc_wq = alloc_workqueue("rproc_wq", WQ_UNBOUND | WQ_HIGHPRI, 0); > + if (!rproc_wq) > + return -ENOMEM; > + > rproc_init_sysfs(); > rproc_init_debugfs(); > rproc_init_cdev(); > @@ -2536,6 +2542,7 @@ static void __exit remoteproc_exit(void) > rproc_exit_panic(); > rproc_exit_debugfs(); > rproc_exit_sysfs(); > + destroy_workqueue(rproc_wq); > } > module_exit(remoteproc_exit); > > -- > The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, > a Linux Foundation Collaborative Project >
On 12/15/20 4:55 PM, Bjorn Andersson wrote: > On Sat 12 Dec 14:48 CST 2020, Rishabh Bhatnagar wrote: > >> Create an unbound high priority workqueue for recovery tasks. I have been looking at a different issue that is caused by crash notification. What happened was that the modem crashed while the AP was in system suspend (or possibly even resuming) state. And there is no guarantee that the system will have called a driver's ->resume callback when the crash notification is delivered. In my case (in the IPA driver), handling a modem crash cannot be done while the driver is suspended; i.e. the activities in its ->resume callback must be completed before we can recover from the crash. For this reason I might like to change the way the crash notification is handled, but what I'd rather see is to have the work queue not run until user space is unfrozen, which would guarantee that all drivers that have registered for a crash notification will be resumed when the notification arrives. I'm not sure how that interacts with what you are looking for here. I think the workqueue could still be unbound, but its work would be delayed longer before any notification (and recovery) started. -Alex > This simply repeats $subject > >> Recovery time is an important parameter for a subsystem and there >> might be situations where multiple subsystems crash around the same >> time. Scheduling into an unbound workqueue increases parallelization >> and avoids time impact. > > You should be able to write this more succinctly. The important part is > that you want an unbound work queue to allow recovery to happen in > parallel - which naturally implies that you care about recovery latency. > >> Also creating a high priority workqueue >> will utilize separate worker threads with higher nice values than >> normal ones. >> > > This doesn't describe why you need the higher priority. > > > I believe, and certainly with the in-line coredump, that we're running > our recovery work for way too long to be queued on the system_wq. As > such the content of the patch looks good! > > Regards, > Bjorn > >> Signed-off-by: Rishabh Bhatnagar <rishabhb@codeaurora.org> >> --- >> drivers/remoteproc/remoteproc_core.c | 9 ++++++++- >> 1 file changed, 8 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c >> index 46c2937..8fd8166 100644 >> --- a/drivers/remoteproc/remoteproc_core.c >> +++ b/drivers/remoteproc/remoteproc_core.c >> @@ -48,6 +48,8 @@ static DEFINE_MUTEX(rproc_list_mutex); >> static LIST_HEAD(rproc_list); >> static struct notifier_block rproc_panic_nb; >> >> +static struct workqueue_struct *rproc_wq; >> + >> typedef int (*rproc_handle_resource_t)(struct rproc *rproc, >> void *, int offset, int avail); >> >> @@ -2475,7 +2477,7 @@ void rproc_report_crash(struct rproc *rproc, enum rproc_crash_type type) >> rproc->name, rproc_crash_to_string(type)); >> >> /* create a new task to handle the error */ >> - schedule_work(&rproc->crash_handler); >> + queue_work(rproc_wq, &rproc->crash_handler); >> } >> EXPORT_SYMBOL(rproc_report_crash); >> >> @@ -2520,6 +2522,10 @@ static void __exit rproc_exit_panic(void) >> >> static int __init remoteproc_init(void) >> { >> + rproc_wq = alloc_workqueue("rproc_wq", WQ_UNBOUND | WQ_HIGHPRI, 0); >> + if (!rproc_wq) >> + return -ENOMEM; >> + >> rproc_init_sysfs(); >> rproc_init_debugfs(); >> rproc_init_cdev(); >> @@ -2536,6 +2542,7 @@ static void __exit remoteproc_exit(void) >> rproc_exit_panic(); >> rproc_exit_debugfs(); >> rproc_exit_sysfs(); >> + destroy_workqueue(rproc_wq); >> } >> module_exit(remoteproc_exit); >> >> -- >> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, >> a Linux Foundation Collaborative Project >>
On 2020-12-17 08:12, Alex Elder wrote: > On 12/15/20 4:55 PM, Bjorn Andersson wrote: >> On Sat 12 Dec 14:48 CST 2020, Rishabh Bhatnagar wrote: >> >>> Create an unbound high priority workqueue for recovery tasks. > > I have been looking at a different issue that is caused by > crash notification. > > What happened was that the modem crashed while the AP was > in system suspend (or possibly even resuming) state. And > there is no guarantee that the system will have called a > driver's ->resume callback when the crash notification is > delivered. > > In my case (in the IPA driver), handling a modem crash > cannot be done while the driver is suspended; i.e. the > activities in its ->resume callback must be completed > before we can recover from the crash. > > For this reason I might like to change the way the > crash notification is handled, but what I'd rather see > is to have the work queue not run until user space > is unfrozen, which would guarantee that all drivers > that have registered for a crash notification will > be resumed when the notification arrives. > > I'm not sure how that interacts with what you are > looking for here. I think the workqueue could still > be unbound, but its work would be delayed longer before > any notification (and recovery) started. > > -Alex > > In that case, maybe adding a "WQ_FREEZABLE" flag might help? > >> This simply repeats $subject >> >>> Recovery time is an important parameter for a subsystem and there >>> might be situations where multiple subsystems crash around the same >>> time. Scheduling into an unbound workqueue increases parallelization >>> and avoids time impact. >> >> You should be able to write this more succinctly. The important part >> is >> that you want an unbound work queue to allow recovery to happen in >> parallel - which naturally implies that you care about recovery >> latency. >> >>> Also creating a high priority workqueue >>> will utilize separate worker threads with higher nice values than >>> normal ones. >>> >> >> This doesn't describe why you need the higher priority. >> >> >> I believe, and certainly with the in-line coredump, that we're running >> our recovery work for way too long to be queued on the system_wq. As >> such the content of the patch looks good! >> >> Regards, >> Bjorn >> >>> Signed-off-by: Rishabh Bhatnagar <rishabhb@codeaurora.org> >>> --- >>> drivers/remoteproc/remoteproc_core.c | 9 ++++++++- >>> 1 file changed, 8 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/remoteproc/remoteproc_core.c >>> b/drivers/remoteproc/remoteproc_core.c >>> index 46c2937..8fd8166 100644 >>> --- a/drivers/remoteproc/remoteproc_core.c >>> +++ b/drivers/remoteproc/remoteproc_core.c >>> @@ -48,6 +48,8 @@ static DEFINE_MUTEX(rproc_list_mutex); >>> static LIST_HEAD(rproc_list); >>> static struct notifier_block rproc_panic_nb; >>> +static struct workqueue_struct *rproc_wq; >>> + >>> typedef int (*rproc_handle_resource_t)(struct rproc *rproc, >>> void *, int offset, int avail); >>> @@ -2475,7 +2477,7 @@ void rproc_report_crash(struct rproc *rproc, >>> enum rproc_crash_type type) >>> rproc->name, rproc_crash_to_string(type)); >>> /* create a new task to handle the error */ >>> - schedule_work(&rproc->crash_handler); >>> + queue_work(rproc_wq, &rproc->crash_handler); >>> } >>> EXPORT_SYMBOL(rproc_report_crash); >>> @@ -2520,6 +2522,10 @@ static void __exit rproc_exit_panic(void) >>> static int __init remoteproc_init(void) >>> { >>> + rproc_wq = alloc_workqueue("rproc_wq", WQ_UNBOUND | WQ_HIGHPRI, 0); >>> + if (!rproc_wq) >>> + return -ENOMEM; >>> + >>> rproc_init_sysfs(); >>> rproc_init_debugfs(); >>> rproc_init_cdev(); >>> @@ -2536,6 +2542,7 @@ static void __exit remoteproc_exit(void) >>> rproc_exit_panic(); >>> rproc_exit_debugfs(); >>> rproc_exit_sysfs(); >>> + destroy_workqueue(rproc_wq); >>> } >>> module_exit(remoteproc_exit); >>> -- The Qualcomm Innovation Center, Inc. is a member of the Code >>> Aurora Forum, >>> a Linux Foundation Collaborative Project >>>
On 12/17/20 12:21 PM, rishabhb@codeaurora.org wrote: > On 2020-12-17 08:12, Alex Elder wrote: >> On 12/15/20 4:55 PM, Bjorn Andersson wrote: >>> On Sat 12 Dec 14:48 CST 2020, Rishabh Bhatnagar wrote: >>> >>>> Create an unbound high priority workqueue for recovery tasks. >> >> I have been looking at a different issue that is caused by >> crash notification. >> >> What happened was that the modem crashed while the AP was >> in system suspend (or possibly even resuming) state. And >> there is no guarantee that the system will have called a >> driver's ->resume callback when the crash notification is >> delivered. >> >> In my case (in the IPA driver), handling a modem crash >> cannot be done while the driver is suspended; i.e. the >> activities in its ->resume callback must be completed >> before we can recover from the crash. >> >> For this reason I might like to change the way the >> crash notification is handled, but what I'd rather see >> is to have the work queue not run until user space >> is unfrozen, which would guarantee that all drivers >> that have registered for a crash notification will >> be resumed when the notification arrives. >> >> I'm not sure how that interacts with what you are >> looking for here. I think the workqueue could still >> be unbound, but its work would be delayed longer before >> any notification (and recovery) started. >> >> -Alex >> >> > In that case, maybe adding a "WQ_FREEZABLE" flag might help? Yes, exactly. But how does that affect whatever you were trying to do with your patch? -Alex . . .
On Thu 17 Dec 12:49 CST 2020, Alex Elder wrote: > On 12/17/20 12:21 PM, rishabhb@codeaurora.org wrote: > > On 2020-12-17 08:12, Alex Elder wrote: > > > On 12/15/20 4:55 PM, Bjorn Andersson wrote: > > > > On Sat 12 Dec 14:48 CST 2020, Rishabh Bhatnagar wrote: > > > > > > > > > Create an unbound high priority workqueue for recovery tasks. > > > > > > I have been looking at a different issue that is caused by > > > crash notification. > > > > > > What happened was that the modem crashed while the AP was > > > in system suspend (or possibly even resuming) state. And > > > there is no guarantee that the system will have called a > > > driver's ->resume callback when the crash notification is > > > delivered. > > > > > > In my case (in the IPA driver), handling a modem crash > > > cannot be done while the driver is suspended; i.e. the > > > activities in its ->resume callback must be completed > > > before we can recover from the crash. > > > > > > For this reason I might like to change the way the > > > crash notification is handled, but what I'd rather see > > > is to have the work queue not run until user space > > > is unfrozen, which would guarantee that all drivers > > > that have registered for a crash notification will > > > be resumed when the notification arrives. > > > > > > I'm not sure how that interacts with what you are > > > looking for here. I think the workqueue could still > > > be unbound, but its work would be delayed longer before > > > any notification (and recovery) started. > > > > > > -Alex > > > > > > > > In that case, maybe adding a "WQ_FREEZABLE" flag might help? > > Yes, exactly. But how does that affect whatever you were > trying to do with your patch? > I don't see any impact on Rishabh's change in particular, syntactically it would just be a matter of adding another flag and the impact would be separate from his patch. In other words, creating a separate work queue to get the long running work off the system_wq and making sure that these doesn't run during suspend & resume seems very reasonable to me. The one piece that I'm still contemplating is the HIPRIO, I would like to better understand the actual impact - or perhaps is this a result of everyone downstream moving all their work to HIPRIO work queues, starving the recovery? Regards, Bjorn
On 2020-12-21 16:35, Bjorn Andersson wrote: > On Thu 17 Dec 12:49 CST 2020, Alex Elder wrote: > >> On 12/17/20 12:21 PM, rishabhb@codeaurora.org wrote: >> > On 2020-12-17 08:12, Alex Elder wrote: >> > > On 12/15/20 4:55 PM, Bjorn Andersson wrote: >> > > > On Sat 12 Dec 14:48 CST 2020, Rishabh Bhatnagar wrote: >> > > > >> > > > > Create an unbound high priority workqueue for recovery tasks. >> > > >> > > I have been looking at a different issue that is caused by >> > > crash notification. >> > > >> > > What happened was that the modem crashed while the AP was >> > > in system suspend (or possibly even resuming) state. And >> > > there is no guarantee that the system will have called a >> > > driver's ->resume callback when the crash notification is >> > > delivered. >> > > >> > > In my case (in the IPA driver), handling a modem crash >> > > cannot be done while the driver is suspended; i.e. the >> > > activities in its ->resume callback must be completed >> > > before we can recover from the crash. >> > > >> > > For this reason I might like to change the way the >> > > crash notification is handled, but what I'd rather see >> > > is to have the work queue not run until user space >> > > is unfrozen, which would guarantee that all drivers >> > > that have registered for a crash notification will >> > > be resumed when the notification arrives. >> > > >> > > I'm not sure how that interacts with what you are >> > > looking for here. I think the workqueue could still >> > > be unbound, but its work would be delayed longer before >> > > any notification (and recovery) started. >> > > >> > > -Alex >> > > >> > > >> > In that case, maybe adding a "WQ_FREEZABLE" flag might help? >> >> Yes, exactly. But how does that affect whatever you were >> trying to do with your patch? >> > > I don't see any impact on Rishabh's change in particular, syntactically > it would just be a matter of adding another flag and the impact would > be > separate from his patch. > > In other words, creating a separate work queue to get the long running > work off the system_wq and making sure that these doesn't run during > suspend & resume seems very reasonable to me. > > The one piece that I'm still contemplating is the HIPRIO, I would like > to better understand the actual impact - or perhaps is this a result of > everyone downstream moving all their work to HIPRIO work queues, > starving the recovery? > Hi Bjorn, You are right, this is a result of downstream having HIPRIO workqueues therefore starving recovery. I don't have actual data to support the flag as of now. If needed for now we can skip this flag and add it later with sufficient data? > Regards, > Bjorn
diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c index 46c2937..8fd8166 100644 --- a/drivers/remoteproc/remoteproc_core.c +++ b/drivers/remoteproc/remoteproc_core.c @@ -48,6 +48,8 @@ static DEFINE_MUTEX(rproc_list_mutex); static LIST_HEAD(rproc_list); static struct notifier_block rproc_panic_nb; +static struct workqueue_struct *rproc_wq; + typedef int (*rproc_handle_resource_t)(struct rproc *rproc, void *, int offset, int avail); @@ -2475,7 +2477,7 @@ void rproc_report_crash(struct rproc *rproc, enum rproc_crash_type type) rproc->name, rproc_crash_to_string(type)); /* create a new task to handle the error */ - schedule_work(&rproc->crash_handler); + queue_work(rproc_wq, &rproc->crash_handler); } EXPORT_SYMBOL(rproc_report_crash); @@ -2520,6 +2522,10 @@ static void __exit rproc_exit_panic(void) static int __init remoteproc_init(void) { + rproc_wq = alloc_workqueue("rproc_wq", WQ_UNBOUND | WQ_HIGHPRI, 0); + if (!rproc_wq) + return -ENOMEM; + rproc_init_sysfs(); rproc_init_debugfs(); rproc_init_cdev(); @@ -2536,6 +2542,7 @@ static void __exit remoteproc_exit(void) rproc_exit_panic(); rproc_exit_debugfs(); rproc_exit_sysfs(); + destroy_workqueue(rproc_wq); } module_exit(remoteproc_exit);
Create an unbound high priority workqueue for recovery tasks. Recovery time is an important parameter for a subsystem and there might be situations where multiple subsystems crash around the same time. Scheduling into an unbound workqueue increases parallelization and avoids time impact. Also creating a high priority workqueue will utilize separate worker threads with higher nice values than normal ones. Signed-off-by: Rishabh Bhatnagar <rishabhb@codeaurora.org> --- drivers/remoteproc/remoteproc_core.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)