remoteproc: Create a separate workqueue for recovery tasks

Message ID	1607806087-27244-1-git-send-email-rishabhb@codeaurora.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-remoteproc-owner@kernel.org> Date: Subject: Cc: To: From: Sender; bh=/AJJdvvgcxzEk2iIh8Q9OM+WIMHQvzRBydUe39VSYsA=; b=ZT091r0iNtPQ3ocmMKOJNGjNAfmeQH48czhjyPhKYqa2YJRddZMy2CEtJi/RAHaeXQ9eizBx aHuPl9CpyuYGKKDZTiTQxX5HvjkA+aBT0/rJUf68u0FySomvF6yPKUDAfYZ+WAH0F0UBtFxl ts7eHCkv2L593YJ4oweOCGCsv1k= Sender: rishabhb=codeaurora.org@mg.codeaurora.org sender: rishabhb) by smtp.codeaurora.org (Postfix) with ESMTPSA id 89FC3C433C6; Sat, 12 Dec 2020 20:48:13 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 smtp.codeaurora.org 89FC3C433C6 From: Rishabh Bhatnagar <rishabhb@codeaurora.org> To: linux-remoteproc@vger.kernel.org, linux-kernel@vger.kernel.org Cc: tsoni@codeaurora.org, bjorn.andersson@linaro.org, psodagud@codeaurora.org, sidgup@codeaurora.org, Rishabh Bhatnagar <rishabhb@codeaurora.org> Subject: [PATCH] remoteproc: Create a separate workqueue for recovery tasks Date: Sat, 12 Dec 2020 12:48:07 -0800 Message-Id: <1607806087-27244-1-git-send-email-rishabhb@codeaurora.org> Precedence: bulk
Series	remoteproc: Create a separate workqueue for recovery tasks \| expand remoteproc: Create a separate workqueue for recovery tasks

Rishabh Bhatnagar Dec. 12, 2020, 8:48 p.m. UTC

Create an unbound high priority workqueue for recovery tasks.
Recovery time is an important parameter for a subsystem and there
might be situations where multiple subsystems crash around the same
time. Scheduling into an unbound workqueue increases parallelization
and avoids time impact. Also creating a high priority workqueue
will utilize separate worker threads with higher nice values than
normal ones.

Signed-off-by: Rishabh Bhatnagar <rishabhb@codeaurora.org>
---
 drivers/remoteproc/remoteproc_core.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Bjorn Andersson Dec. 15, 2020, 10:55 p.m. UTC | #1

On Sat 12 Dec 14:48 CST 2020, Rishabh Bhatnagar wrote:

> Create an unbound high priority workqueue for recovery tasks.

This simply repeats $subject

> Recovery time is an important parameter for a subsystem and there
> might be situations where multiple subsystems crash around the same
> time.  Scheduling into an unbound workqueue increases parallelization
> and avoids time impact.

You should be able to write this more succinctly. The important part is
that you want an unbound work queue to allow recovery to happen in
parallel - which naturally implies that you care about recovery latency.

> Also creating a high priority workqueue
> will utilize separate worker threads with higher nice values than
> normal ones.
> 

This doesn't describe why you need the higher priority.


I believe, and certainly with the in-line coredump, that we're running
our recovery work for way too long to be queued on the system_wq. As
such the content of the patch looks good!

Regards,
Bjorn

> Signed-off-by: Rishabh Bhatnagar <rishabhb@codeaurora.org>
> ---
>  drivers/remoteproc/remoteproc_core.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
> index 46c2937..8fd8166 100644
> --- a/drivers/remoteproc/remoteproc_core.c
> +++ b/drivers/remoteproc/remoteproc_core.c
> @@ -48,6 +48,8 @@ static DEFINE_MUTEX(rproc_list_mutex);
>  static LIST_HEAD(rproc_list);
>  static struct notifier_block rproc_panic_nb;
>  
> +static struct workqueue_struct *rproc_wq;
> +
>  typedef int (*rproc_handle_resource_t)(struct rproc *rproc,
>  				 void *, int offset, int avail);
>  
> @@ -2475,7 +2477,7 @@ void rproc_report_crash(struct rproc *rproc, enum rproc_crash_type type)
>  		rproc->name, rproc_crash_to_string(type));
>  
>  	/* create a new task to handle the error */
> -	schedule_work(&rproc->crash_handler);
> +	queue_work(rproc_wq, &rproc->crash_handler);
>  }
>  EXPORT_SYMBOL(rproc_report_crash);
>  
> @@ -2520,6 +2522,10 @@ static void __exit rproc_exit_panic(void)
>  
>  static int __init remoteproc_init(void)
>  {
> +	rproc_wq = alloc_workqueue("rproc_wq", WQ_UNBOUND | WQ_HIGHPRI, 0);
> +	if (!rproc_wq)
> +		return -ENOMEM;
> +
>  	rproc_init_sysfs();
>  	rproc_init_debugfs();
>  	rproc_init_cdev();
> @@ -2536,6 +2542,7 @@ static void __exit remoteproc_exit(void)
>  	rproc_exit_panic();
>  	rproc_exit_debugfs();
>  	rproc_exit_sysfs();
> +	destroy_workqueue(rproc_wq);
>  }
>  module_exit(remoteproc_exit);
>  
> -- 
> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
> a Linux Foundation Collaborative Project
>

Alex Elder Dec. 17, 2020, 4:12 p.m. UTC | #2

On 12/15/20 4:55 PM, Bjorn Andersson wrote:
> On Sat 12 Dec 14:48 CST 2020, Rishabh Bhatnagar wrote:
> 
>> Create an unbound high priority workqueue for recovery tasks.

I have been looking at a different issue that is caused by
crash notification.

What happened was that the modem crashed while the AP was
in system suspend (or possibly even resuming) state.  And
there is no guarantee that the system will have called a
driver's ->resume callback when the crash notification is
delivered.

In my case (in the IPA driver), handling a modem crash
cannot be done while the driver is suspended; i.e. the
activities in its ->resume callback must be completed
before we can recover from the crash.

For this reason I might like to change the way the
crash notification is handled, but what I'd rather see
is to have the work queue not run until user space
is unfrozen, which would guarantee that all drivers
that have registered for a crash notification will
be resumed when the notification arrives.

I'm not sure how that interacts with what you are
looking for here.  I think the workqueue could still
be unbound, but its work would be delayed longer before
any notification (and recovery) started.

					-Alex



> This simply repeats $subject
> 
>> Recovery time is an important parameter for a subsystem and there
>> might be situations where multiple subsystems crash around the same
>> time.  Scheduling into an unbound workqueue increases parallelization
>> and avoids time impact.
> 
> You should be able to write this more succinctly. The important part is
> that you want an unbound work queue to allow recovery to happen in
> parallel - which naturally implies that you care about recovery latency.
> 
>> Also creating a high priority workqueue
>> will utilize separate worker threads with higher nice values than
>> normal ones.
>>
> 
> This doesn't describe why you need the higher priority.
> 
> 
> I believe, and certainly with the in-line coredump, that we're running
> our recovery work for way too long to be queued on the system_wq. As
> such the content of the patch looks good!
> 
> Regards,
> Bjorn
> 
>> Signed-off-by: Rishabh Bhatnagar <rishabhb@codeaurora.org>
>> ---
>>   drivers/remoteproc/remoteproc_core.c | 9 ++++++++-
>>   1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
>> index 46c2937..8fd8166 100644
>> --- a/drivers/remoteproc/remoteproc_core.c
>> +++ b/drivers/remoteproc/remoteproc_core.c
>> @@ -48,6 +48,8 @@ static DEFINE_MUTEX(rproc_list_mutex);
>>   static LIST_HEAD(rproc_list);
>>   static struct notifier_block rproc_panic_nb;
>>   
>> +static struct workqueue_struct *rproc_wq;
>> +
>>   typedef int (*rproc_handle_resource_t)(struct rproc *rproc,
>>   				 void *, int offset, int avail);
>>   
>> @@ -2475,7 +2477,7 @@ void rproc_report_crash(struct rproc *rproc, enum rproc_crash_type type)
>>   		rproc->name, rproc_crash_to_string(type));
>>   
>>   	/* create a new task to handle the error */
>> -	schedule_work(&rproc->crash_handler);
>> +	queue_work(rproc_wq, &rproc->crash_handler);
>>   }
>>   EXPORT_SYMBOL(rproc_report_crash);
>>   
>> @@ -2520,6 +2522,10 @@ static void __exit rproc_exit_panic(void)
>>   
>>   static int __init remoteproc_init(void)
>>   {
>> +	rproc_wq = alloc_workqueue("rproc_wq", WQ_UNBOUND | WQ_HIGHPRI, 0);
>> +	if (!rproc_wq)
>> +		return -ENOMEM;
>> +
>>   	rproc_init_sysfs();
>>   	rproc_init_debugfs();
>>   	rproc_init_cdev();
>> @@ -2536,6 +2542,7 @@ static void __exit remoteproc_exit(void)
>>   	rproc_exit_panic();
>>   	rproc_exit_debugfs();
>>   	rproc_exit_sysfs();
>> +	destroy_workqueue(rproc_wq);
>>   }
>>   module_exit(remoteproc_exit);
>>   
>> -- 
>> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
>> a Linux Foundation Collaborative Project
>>

Rishabh Bhatnagar Dec. 17, 2020, 6:21 p.m. UTC | #3

On 2020-12-17 08:12, Alex Elder wrote:
> On 12/15/20 4:55 PM, Bjorn Andersson wrote:
>> On Sat 12 Dec 14:48 CST 2020, Rishabh Bhatnagar wrote:
>> 
>>> Create an unbound high priority workqueue for recovery tasks.
> 
> I have been looking at a different issue that is caused by
> crash notification.
> 
> What happened was that the modem crashed while the AP was
> in system suspend (or possibly even resuming) state.  And
> there is no guarantee that the system will have called a
> driver's ->resume callback when the crash notification is
> delivered.
> 
> In my case (in the IPA driver), handling a modem crash
> cannot be done while the driver is suspended; i.e. the
> activities in its ->resume callback must be completed
> before we can recover from the crash.
> 
> For this reason I might like to change the way the
> crash notification is handled, but what I'd rather see
> is to have the work queue not run until user space
> is unfrozen, which would guarantee that all drivers
> that have registered for a crash notification will
> be resumed when the notification arrives.
> 
> I'm not sure how that interacts with what you are
> looking for here.  I think the workqueue could still
> be unbound, but its work would be delayed longer before
> any notification (and recovery) started.
> 
> 					-Alex
> 
> 
In that case, maybe adding a "WQ_FREEZABLE" flag might help?
> 
>> This simply repeats $subject
>> 
>>> Recovery time is an important parameter for a subsystem and there
>>> might be situations where multiple subsystems crash around the same
>>> time.  Scheduling into an unbound workqueue increases parallelization
>>> and avoids time impact.
>> 
>> You should be able to write this more succinctly. The important part 
>> is
>> that you want an unbound work queue to allow recovery to happen in
>> parallel - which naturally implies that you care about recovery 
>> latency.
>> 
>>> Also creating a high priority workqueue
>>> will utilize separate worker threads with higher nice values than
>>> normal ones.
>>> 
>> 
>> This doesn't describe why you need the higher priority.
>> 
>> 
>> I believe, and certainly with the in-line coredump, that we're running
>> our recovery work for way too long to be queued on the system_wq. As
>> such the content of the patch looks good!
>> 
>> Regards,
>> Bjorn
>> 
>>> Signed-off-by: Rishabh Bhatnagar <rishabhb@codeaurora.org>
>>> ---
>>>   drivers/remoteproc/remoteproc_core.c | 9 ++++++++-
>>>   1 file changed, 8 insertions(+), 1 deletion(-)
>>> 
>>> diff --git a/drivers/remoteproc/remoteproc_core.c 
>>> b/drivers/remoteproc/remoteproc_core.c
>>> index 46c2937..8fd8166 100644
>>> --- a/drivers/remoteproc/remoteproc_core.c
>>> +++ b/drivers/remoteproc/remoteproc_core.c
>>> @@ -48,6 +48,8 @@ static DEFINE_MUTEX(rproc_list_mutex);
>>>   static LIST_HEAD(rproc_list);
>>>   static struct notifier_block rproc_panic_nb;
>>>   +static struct workqueue_struct *rproc_wq;
>>> +
>>>   typedef int (*rproc_handle_resource_t)(struct rproc *rproc,
>>>   				 void *, int offset, int avail);
>>>   @@ -2475,7 +2477,7 @@ void rproc_report_crash(struct rproc *rproc, 
>>> enum rproc_crash_type type)
>>>   		rproc->name, rproc_crash_to_string(type));
>>>     	/* create a new task to handle the error */
>>> -	schedule_work(&rproc->crash_handler);
>>> +	queue_work(rproc_wq, &rproc->crash_handler);
>>>   }
>>>   EXPORT_SYMBOL(rproc_report_crash);
>>>   @@ -2520,6 +2522,10 @@ static void __exit rproc_exit_panic(void)
>>>     static int __init remoteproc_init(void)
>>>   {
>>> +	rproc_wq = alloc_workqueue("rproc_wq", WQ_UNBOUND | WQ_HIGHPRI, 0);
>>> +	if (!rproc_wq)
>>> +		return -ENOMEM;
>>> +
>>>   	rproc_init_sysfs();
>>>   	rproc_init_debugfs();
>>>   	rproc_init_cdev();
>>> @@ -2536,6 +2542,7 @@ static void __exit remoteproc_exit(void)
>>>   	rproc_exit_panic();
>>>   	rproc_exit_debugfs();
>>>   	rproc_exit_sysfs();
>>> +	destroy_workqueue(rproc_wq);
>>>   }
>>>   module_exit(remoteproc_exit);
>>>   -- The Qualcomm Innovation Center, Inc. is a member of the Code 
>>> Aurora Forum,
>>> a Linux Foundation Collaborative Project
>>>

Alex Elder Dec. 17, 2020, 6:49 p.m. UTC | #4

On 12/17/20 12:21 PM, rishabhb@codeaurora.org wrote:
> On 2020-12-17 08:12, Alex Elder wrote:
>> On 12/15/20 4:55 PM, Bjorn Andersson wrote:
>>> On Sat 12 Dec 14:48 CST 2020, Rishabh Bhatnagar wrote:
>>>
>>>> Create an unbound high priority workqueue for recovery tasks.
>>
>> I have been looking at a different issue that is caused by
>> crash notification.
>>
>> What happened was that the modem crashed while the AP was
>> in system suspend (or possibly even resuming) state.  And
>> there is no guarantee that the system will have called a
>> driver's ->resume callback when the crash notification is
>> delivered.
>>
>> In my case (in the IPA driver), handling a modem crash
>> cannot be done while the driver is suspended; i.e. the
>> activities in its ->resume callback must be completed
>> before we can recover from the crash.
>>
>> For this reason I might like to change the way the
>> crash notification is handled, but what I'd rather see
>> is to have the work queue not run until user space
>> is unfrozen, which would guarantee that all drivers
>> that have registered for a crash notification will
>> be resumed when the notification arrives.
>>
>> I'm not sure how that interacts with what you are
>> looking for here.  I think the workqueue could still
>> be unbound, but its work would be delayed longer before
>> any notification (and recovery) started.
>>
>>                     -Alex
>>
>>
> In that case, maybe adding a "WQ_FREEZABLE" flag might help?

Yes, exactly.  But how does that affect whatever you were
trying to do with your patch?

					-Alex

. . .

Bjorn Andersson Dec. 22, 2020, 12:35 a.m. UTC | #5

On Thu 17 Dec 12:49 CST 2020, Alex Elder wrote:

> On 12/17/20 12:21 PM, rishabhb@codeaurora.org wrote:
> > On 2020-12-17 08:12, Alex Elder wrote:
> > > On 12/15/20 4:55 PM, Bjorn Andersson wrote:
> > > > On Sat 12 Dec 14:48 CST 2020, Rishabh Bhatnagar wrote:
> > > > 
> > > > > Create an unbound high priority workqueue for recovery tasks.
> > > 
> > > I have been looking at a different issue that is caused by
> > > crash notification.
> > > 
> > > What happened was that the modem crashed while the AP was
> > > in system suspend (or possibly even resuming) state.  And
> > > there is no guarantee that the system will have called a
> > > driver's ->resume callback when the crash notification is
> > > delivered.
> > > 
> > > In my case (in the IPA driver), handling a modem crash
> > > cannot be done while the driver is suspended; i.e. the
> > > activities in its ->resume callback must be completed
> > > before we can recover from the crash.
> > > 
> > > For this reason I might like to change the way the
> > > crash notification is handled, but what I'd rather see
> > > is to have the work queue not run until user space
> > > is unfrozen, which would guarantee that all drivers
> > > that have registered for a crash notification will
> > > be resumed when the notification arrives.
> > > 
> > > I'm not sure how that interacts with what you are
> > > looking for here.  I think the workqueue could still
> > > be unbound, but its work would be delayed longer before
> > > any notification (and recovery) started.
> > > 
> > >                     -Alex
> > > 
> > > 
> > In that case, maybe adding a "WQ_FREEZABLE" flag might help?
> 
> Yes, exactly.  But how does that affect whatever you were
> trying to do with your patch?
> 

I don't see any impact on Rishabh's change in particular, syntactically
it would just be a matter of adding another flag and the impact would be
separate from his patch.

In other words, creating a separate work queue to get the long running
work off the system_wq and making sure that these doesn't run during
suspend & resume seems very reasonable to me.

The one piece that I'm still contemplating is the HIPRIO, I would like
to better understand the actual impact - or perhaps is this a result of
everyone downstream moving all their work to HIPRIO work queues,
starving the recovery?

Regards,
Bjorn

Rishabh Bhatnagar Jan. 8, 2021, 9:03 p.m. UTC | #6

On 2020-12-21 16:35, Bjorn Andersson wrote:
> On Thu 17 Dec 12:49 CST 2020, Alex Elder wrote:
> 
>> On 12/17/20 12:21 PM, rishabhb@codeaurora.org wrote:
>> > On 2020-12-17 08:12, Alex Elder wrote:
>> > > On 12/15/20 4:55 PM, Bjorn Andersson wrote:
>> > > > On Sat 12 Dec 14:48 CST 2020, Rishabh Bhatnagar wrote:
>> > > >
>> > > > > Create an unbound high priority workqueue for recovery tasks.
>> > >
>> > > I have been looking at a different issue that is caused by
>> > > crash notification.
>> > >
>> > > What happened was that the modem crashed while the AP was
>> > > in system suspend (or possibly even resuming) state.  And
>> > > there is no guarantee that the system will have called a
>> > > driver's ->resume callback when the crash notification is
>> > > delivered.
>> > >
>> > > In my case (in the IPA driver), handling a modem crash
>> > > cannot be done while the driver is suspended; i.e. the
>> > > activities in its ->resume callback must be completed
>> > > before we can recover from the crash.
>> > >
>> > > For this reason I might like to change the way the
>> > > crash notification is handled, but what I'd rather see
>> > > is to have the work queue not run until user space
>> > > is unfrozen, which would guarantee that all drivers
>> > > that have registered for a crash notification will
>> > > be resumed when the notification arrives.
>> > >
>> > > I'm not sure how that interacts with what you are
>> > > looking for here.  I think the workqueue could still
>> > > be unbound, but its work would be delayed longer before
>> > > any notification (and recovery) started.
>> > >
>> > >                     -Alex
>> > >
>> > >
>> > In that case, maybe adding a "WQ_FREEZABLE" flag might help?
>> 
>> Yes, exactly.  But how does that affect whatever you were
>> trying to do with your patch?
>> 
> 
> I don't see any impact on Rishabh's change in particular, syntactically
> it would just be a matter of adding another flag and the impact would 
> be
> separate from his patch.
> 
> In other words, creating a separate work queue to get the long running
> work off the system_wq and making sure that these doesn't run during
> suspend & resume seems very reasonable to me.
> 
> The one piece that I'm still contemplating is the HIPRIO, I would like
> to better understand the actual impact - or perhaps is this a result of
> everyone downstream moving all their work to HIPRIO work queues,
> starving the recovery?
> 
Hi Bjorn,
You are right, this is a result of downstream having HIPRIO workqueues
therefore starving recovery. I don't have actual data to support the 
flag
as of now. If needed for now we can skip this flag and add it later with
sufficient data?
> Regards,
> Bjorn

remoteproc: Create a separate workqueue for recovery tasks

Commit Message

Comments

Patch